Message ID | 20241216080206.2850773-2-ming.lei@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | block: fix deadlock caused by atomic limits update | expand |
On Mon, Dec 16, 2024 at 04:02:03PM +0800, Ming Lei wrote: > More importantly, queue_limits_start_update() returns one local copy of > q->limits, then the API user overwrites the local copy, and commit the > copy by queue_limits_commit_update() finally. > > So holding q->limits_lock protects nothing for the local copy, and not see > any real help by preventing new update & commit from happening, cause > what matters is that we do validation & commit atomically. It protects against someone else changing the copy in the queue while the current thread is updating the local copy, and thus incoherent updates. So no, we can't just remove it.
On Mon, Dec 16, 2024 at 04:49:01PM +0100, Christoph Hellwig wrote: > On Mon, Dec 16, 2024 at 04:02:03PM +0800, Ming Lei wrote: > > More importantly, queue_limits_start_update() returns one local copy of > > q->limits, then the API user overwrites the local copy, and commit the > > copy by queue_limits_commit_update() finally. > > > > So holding q->limits_lock protects nothing for the local copy, and not see > > any real help by preventing new update & commit from happening, cause > > what matters is that we do validation & commit atomically. > > It protects against someone else changing the copy in the queue while > the current thread is updating the local copy, and thus incoherent > updates. So no, we can't just remove it. The local copy can be updated in any way with any data, so does another concurrent update on q->limits really matter? thanks, Ming
On Tue, Dec 17, 2024 at 09:52:51AM +0800, Ming Lei wrote: > The local copy can be updated in any way with any data, so does another > concurrent update on q->limits really matter? Yes, because that means one of the updates get lost even if it is for entirely separate fields.
On Tue, Dec 17, 2024 at 05:40:56AM +0100, Christoph Hellwig wrote: > On Tue, Dec 17, 2024 at 09:52:51AM +0800, Ming Lei wrote: > > The local copy can be updated in any way with any data, so does another > > concurrent update on q->limits really matter? > > Yes, because that means one of the updates get lost even if it is > for entirely separate fields. Right, but the limits are still valid anytime. Any suggestion for fixing this deadlock? One idea is to add queue_limits_start_try_update() and apply it in sysfs ->store(), in which update failure could be tolerable. Thanks, Ming
On Tue, Dec 17, 2024 at 03:05:48PM +0800, Ming Lei wrote: > On Tue, Dec 17, 2024 at 05:40:56AM +0100, Christoph Hellwig wrote: > > On Tue, Dec 17, 2024 at 09:52:51AM +0800, Ming Lei wrote: > > > The local copy can be updated in any way with any data, so does another > > > concurrent update on q->limits really matter? > > > > Yes, because that means one of the updates get lost even if it is > > for entirely separate fields. > > Right, but the limits are still valid anytime. > > Any suggestion for fixing this deadlock? What is "this deadlock"?
On Tue, Dec 17, 2024 at 08:19:28AM +0100, Christoph Hellwig wrote: > On Tue, Dec 17, 2024 at 03:05:48PM +0800, Ming Lei wrote: > > On Tue, Dec 17, 2024 at 05:40:56AM +0100, Christoph Hellwig wrote: > > > On Tue, Dec 17, 2024 at 09:52:51AM +0800, Ming Lei wrote: > > > > The local copy can be updated in any way with any data, so does another > > > > concurrent update on q->limits really matter? > > > > > > Yes, because that means one of the updates get lost even if it is > > > for entirely separate fields. > > > > Right, but the limits are still valid anytime. > > > > Any suggestion for fixing this deadlock? > > What is "this deadlock"? The commit log provides two reports: - lockdep warning https://lore.kernel.org/linux-block/Z1A8fai9_fQFhs1s@hovoldconsulting.com/ - real deadlock report https://lore.kernel.org/linux-scsi/ZxG38G9BuFdBpBHZ@fedora/ It is actually one simple ABBA lock: 1) queue_attr_store() blk_mq_freeze_queue(q); //queue freeze lock res = entry->store(disk, page, length); queue_limits_start_update //->limits_lock ... queue_limits_commit_update blk_mq_unfreeze_queue(q); 2) sd_revalidate_disk() queue_limits_start_update //->limits_lock sd_read_capacity() scsi_execute_cmd scsi_alloc_request blk_queue_enter //queue freeze lock Thanks, Ming
On 2024/12/16 23:30, Ming Lei wrote: > On Tue, Dec 17, 2024 at 08:19:28AM +0100, Christoph Hellwig wrote: >> On Tue, Dec 17, 2024 at 03:05:48PM +0800, Ming Lei wrote: >>> On Tue, Dec 17, 2024 at 05:40:56AM +0100, Christoph Hellwig wrote: >>>> On Tue, Dec 17, 2024 at 09:52:51AM +0800, Ming Lei wrote: >>>>> The local copy can be updated in any way with any data, so does another >>>>> concurrent update on q->limits really matter? >>>> >>>> Yes, because that means one of the updates get lost even if it is >>>> for entirely separate fields. >>> >>> Right, but the limits are still valid anytime. >>> >>> Any suggestion for fixing this deadlock? >> >> What is "this deadlock"? > > The commit log provides two reports: > > - lockdep warning > > https://lore.kernel.org/linux-block/Z1A8fai9_fQFhs1s@hovoldconsulting.com/ > > - real deadlock report > > https://lore.kernel.org/linux-scsi/ZxG38G9BuFdBpBHZ@fedora/ > > It is actually one simple ABBA lock: > > 1) queue_attr_store() > > blk_mq_freeze_queue(q); //queue freeze lock > res = entry->store(disk, page, length); > queue_limits_start_update //->limits_lock > ... > queue_limits_commit_update > blk_mq_unfreeze_queue(q); The locking + freeze pattern should be: lim = queue_limits_start_update(q); ... blk_mq_freeze_queue(q); ret = queue_limits_commit_update(q, &lim); blk_mq_unfreeze_queue(q); This pattern is used in most places and anything that does not use it is likely susceptible to a similar ABBA deadlock. We should probably look into trying to integrate the freeze/unfreeze calls directly into queue_limits_commit_update(). Fixing queue_attr_store() to use this pattern seems simpler than trying to fix sd_revalidate_disk(). > > 2) sd_revalidate_disk() > > queue_limits_start_update //->limits_lock > sd_read_capacity() > scsi_execute_cmd > scsi_alloc_request > blk_queue_enter //queue freeze lock > > > Thanks, > Ming > >
On Tue, Dec 17, 2024 at 08:07:06AM -0800, Damien Le Moal wrote: > On 2024/12/16 23:30, Ming Lei wrote: > > On Tue, Dec 17, 2024 at 08:19:28AM +0100, Christoph Hellwig wrote: > >> On Tue, Dec 17, 2024 at 03:05:48PM +0800, Ming Lei wrote: > >>> On Tue, Dec 17, 2024 at 05:40:56AM +0100, Christoph Hellwig wrote: > >>>> On Tue, Dec 17, 2024 at 09:52:51AM +0800, Ming Lei wrote: > >>>>> The local copy can be updated in any way with any data, so does another > >>>>> concurrent update on q->limits really matter? > >>>> > >>>> Yes, because that means one of the updates get lost even if it is > >>>> for entirely separate fields. > >>> > >>> Right, but the limits are still valid anytime. > >>> > >>> Any suggestion for fixing this deadlock? > >> > >> What is "this deadlock"? > > > > The commit log provides two reports: > > > > - lockdep warning > > > > https://lore.kernel.org/linux-block/Z1A8fai9_fQFhs1s@hovoldconsulting.com/ > > > > - real deadlock report > > > > https://lore.kernel.org/linux-scsi/ZxG38G9BuFdBpBHZ@fedora/ > > > > It is actually one simple ABBA lock: > > > > 1) queue_attr_store() > > > > blk_mq_freeze_queue(q); //queue freeze lock > > res = entry->store(disk, page, length); > > queue_limits_start_update //->limits_lock > > ... > > queue_limits_commit_update > > blk_mq_unfreeze_queue(q); > > The locking + freeze pattern should be: > > lim = queue_limits_start_update(q); > ... > blk_mq_freeze_queue(q); > ret = queue_limits_commit_update(q, &lim); > blk_mq_unfreeze_queue(q); > > This pattern is used in most places and anything that does not use it is likely > susceptible to a similar ABBA deadlock. We should probably look into trying to > integrate the freeze/unfreeze calls directly into queue_limits_commit_update(). > > Fixing queue_attr_store() to use this pattern seems simpler than trying to fix > sd_revalidate_disk(). This way looks good, just commit af2814149883 ("block: freeze the queue in queue_attr_store") needs to be reverted, and freeze/unfreeze has to be added to each queue attribute .store() handler. Thanks, Ming
On 12/18/24 07:39, Ming Lei wrote: > On Tue, Dec 17, 2024 at 08:07:06AM -0800, Damien Le Moal wrote: >> On 2024/12/16 23:30, Ming Lei wrote: >>> On Tue, Dec 17, 2024 at 08:19:28AM +0100, Christoph Hellwig wrote: >>>> On Tue, Dec 17, 2024 at 03:05:48PM +0800, Ming Lei wrote: >>>>> On Tue, Dec 17, 2024 at 05:40:56AM +0100, Christoph Hellwig wrote: >>>>>> On Tue, Dec 17, 2024 at 09:52:51AM +0800, Ming Lei wrote: >>>>>>> The local copy can be updated in any way with any data, so does another >>>>>>> concurrent update on q->limits really matter? >>>>>> >>>>>> Yes, because that means one of the updates get lost even if it is >>>>>> for entirely separate fields. >>>>> >>>>> Right, but the limits are still valid anytime. >>>>> >>>>> Any suggestion for fixing this deadlock? >>>> >>>> What is "this deadlock"? >>> >>> The commit log provides two reports: >>> >>> - lockdep warning >>> >>> https://lore.kernel.org/linux-block/Z1A8fai9_fQFhs1s@hovoldconsulting.com/ >>> >>> - real deadlock report >>> >>> https://lore.kernel.org/linux-scsi/ZxG38G9BuFdBpBHZ@fedora/ >>> >>> It is actually one simple ABBA lock: >>> >>> 1) queue_attr_store() >>> >>> blk_mq_freeze_queue(q); //queue freeze lock >>> res = entry->store(disk, page, length); >>> queue_limits_start_update //->limits_lock >>> ... >>> queue_limits_commit_update >>> blk_mq_unfreeze_queue(q); >> >> The locking + freeze pattern should be: >> >> lim = queue_limits_start_update(q); >> ... >> blk_mq_freeze_queue(q); >> ret = queue_limits_commit_update(q, &lim); >> blk_mq_unfreeze_queue(q); >> >> This pattern is used in most places and anything that does not use it is likely >> susceptible to a similar ABBA deadlock. We should probably look into trying to >> integrate the freeze/unfreeze calls directly into queue_limits_commit_update(). >> >> Fixing queue_attr_store() to use this pattern seems simpler than trying to fix >> sd_revalidate_disk(). > > This way looks good, just commit af2814149883 ("block: freeze the queue in > queue_attr_store") needs to be reverted, and freeze/unfreeze has to be > added to each queue attribute .store() handler. > Wouldn't it be feasible to add blk-mq freeze in queue_limits_start_update() and blk-mq unfreeze in queue_limits_commit_update()? If we do this then the pattern would be, queue_limits_start_update(): limit-lock + freeze queue_limits_commit_update() : unfreeze + limit-unlock Then in queue_attr_store() we shall just remove freeze/unfreeze. We also need to fix few call sites where we've code block, { blk_mq_freeze_queue() ... queue_limits_start_update() ... queue_limits_commit_update() ... blk_mq_unfreeze_queue() } In the above code block, we may then replace blk_mq_freeze_queue() with queue_limits_commit_update() and similarly replace blk_mq_unfreeze_queue() with queue_limits_commit_update(). { queue_limits_start_update() ... ... ... queue_limits_commit_update() } Thanks, --Nilay
On Wed, Dec 18, 2024 at 05:03:00PM +0530, Nilay Shroff wrote: > > > On 12/18/24 07:39, Ming Lei wrote: > > On Tue, Dec 17, 2024 at 08:07:06AM -0800, Damien Le Moal wrote: > >> On 2024/12/16 23:30, Ming Lei wrote: > >>> On Tue, Dec 17, 2024 at 08:19:28AM +0100, Christoph Hellwig wrote: > >>>> On Tue, Dec 17, 2024 at 03:05:48PM +0800, Ming Lei wrote: > >>>>> On Tue, Dec 17, 2024 at 05:40:56AM +0100, Christoph Hellwig wrote: > >>>>>> On Tue, Dec 17, 2024 at 09:52:51AM +0800, Ming Lei wrote: > >>>>>>> The local copy can be updated in any way with any data, so does another > >>>>>>> concurrent update on q->limits really matter? > >>>>>> > >>>>>> Yes, because that means one of the updates get lost even if it is > >>>>>> for entirely separate fields. > >>>>> > >>>>> Right, but the limits are still valid anytime. > >>>>> > >>>>> Any suggestion for fixing this deadlock? > >>>> > >>>> What is "this deadlock"? > >>> > >>> The commit log provides two reports: > >>> > >>> - lockdep warning > >>> > >>> https://lore.kernel.org/linux-block/Z1A8fai9_fQFhs1s@hovoldconsulting.com/ > >>> > >>> - real deadlock report > >>> > >>> https://lore.kernel.org/linux-scsi/ZxG38G9BuFdBpBHZ@fedora/ > >>> > >>> It is actually one simple ABBA lock: > >>> > >>> 1) queue_attr_store() > >>> > >>> blk_mq_freeze_queue(q); //queue freeze lock > >>> res = entry->store(disk, page, length); > >>> queue_limits_start_update //->limits_lock > >>> ... > >>> queue_limits_commit_update > >>> blk_mq_unfreeze_queue(q); > >> > >> The locking + freeze pattern should be: > >> > >> lim = queue_limits_start_update(q); > >> ... > >> blk_mq_freeze_queue(q); > >> ret = queue_limits_commit_update(q, &lim); > >> blk_mq_unfreeze_queue(q); > >> > >> This pattern is used in most places and anything that does not use it is likely > >> susceptible to a similar ABBA deadlock. We should probably look into trying to > >> integrate the freeze/unfreeze calls directly into queue_limits_commit_update(). > >> > >> Fixing queue_attr_store() to use this pattern seems simpler than trying to fix > >> sd_revalidate_disk(). > > > > This way looks good, just commit af2814149883 ("block: freeze the queue in > > queue_attr_store") needs to be reverted, and freeze/unfreeze has to be > > added to each queue attribute .store() handler. > > > Wouldn't it be feasible to add blk-mq freeze in queue_limits_start_update() > and blk-mq unfreeze in queue_limits_commit_update()? If we do this then > the pattern would be, > > queue_limits_start_update(): limit-lock + freeze > queue_limits_commit_update() : unfreeze + limit-unlock > > Then in queue_attr_store() we shall just remove freeze/unfreeze. > > We also need to fix few call sites where we've code block, > > { > blk_mq_freeze_queue() > ... > queue_limits_start_update() > ... > queue_limits_commit_update() > ... > blk_mq_unfreeze_queue() > > } > > In the above code block, we may then replace blk_mq_freeze_queue() with > queue_limits_commit_update() and similarly replace blk_mq_unfreeze_queue() > with queue_limits_commit_update(). > > { > queue_limits_start_update() > ... > ... > ... > queue_limits_commit_update() In sd_revalidate_disk(), blk-mq request is allocated under queue_limits_start_update(), then ABBA deadlock is triggered since blk_queue_enter() implies same lock(freeze lock) from blk_mq_freeze_queue(). Thanks, Ming
On 12/18/24 19:10, Ming Lei wrote: > On Wed, Dec 18, 2024 at 05:03:00PM +0530, Nilay Shroff wrote: >> >> >> On 12/18/24 07:39, Ming Lei wrote: >>> On Tue, Dec 17, 2024 at 08:07:06AM -0800, Damien Le Moal wrote: >>>> On 2024/12/16 23:30, Ming Lei wrote: >>>>> On Tue, Dec 17, 2024 at 08:19:28AM +0100, Christoph Hellwig wrote: >>>>>> On Tue, Dec 17, 2024 at 03:05:48PM +0800, Ming Lei wrote: >>>>>>> On Tue, Dec 17, 2024 at 05:40:56AM +0100, Christoph Hellwig wrote: >>>>>>>> On Tue, Dec 17, 2024 at 09:52:51AM +0800, Ming Lei wrote: >>>>>>>>> The local copy can be updated in any way with any data, so does another >>>>>>>>> concurrent update on q->limits really matter? >>>>>>>> >>>>>>>> Yes, because that means one of the updates get lost even if it is >>>>>>>> for entirely separate fields. >>>>>>> >>>>>>> Right, but the limits are still valid anytime. >>>>>>> >>>>>>> Any suggestion for fixing this deadlock? >>>>>> >>>>>> What is "this deadlock"? >>>>> >>>>> The commit log provides two reports: >>>>> >>>>> - lockdep warning >>>>> >>>>> https://lore.kernel.org/linux-block/Z1A8fai9_fQFhs1s@hovoldconsulting.com/ >>>>> >>>>> - real deadlock report >>>>> >>>>> https://lore.kernel.org/linux-scsi/ZxG38G9BuFdBpBHZ@fedora/ >>>>> >>>>> It is actually one simple ABBA lock: >>>>> >>>>> 1) queue_attr_store() >>>>> >>>>> blk_mq_freeze_queue(q); //queue freeze lock >>>>> res = entry->store(disk, page, length); >>>>> queue_limits_start_update //->limits_lock >>>>> ... >>>>> queue_limits_commit_update >>>>> blk_mq_unfreeze_queue(q); >>>> >>>> The locking + freeze pattern should be: >>>> >>>> lim = queue_limits_start_update(q); >>>> ... >>>> blk_mq_freeze_queue(q); >>>> ret = queue_limits_commit_update(q, &lim); >>>> blk_mq_unfreeze_queue(q); >>>> >>>> This pattern is used in most places and anything that does not use it is likely >>>> susceptible to a similar ABBA deadlock. We should probably look into trying to >>>> integrate the freeze/unfreeze calls directly into queue_limits_commit_update(). >>>> >>>> Fixing queue_attr_store() to use this pattern seems simpler than trying to fix >>>> sd_revalidate_disk(). >>> >>> This way looks good, just commit af2814149883 ("block: freeze the queue in >>> queue_attr_store") needs to be reverted, and freeze/unfreeze has to be >>> added to each queue attribute .store() handler. >>> >> Wouldn't it be feasible to add blk-mq freeze in queue_limits_start_update() >> and blk-mq unfreeze in queue_limits_commit_update()? If we do this then >> the pattern would be, >> >> queue_limits_start_update(): limit-lock + freeze >> queue_limits_commit_update() : unfreeze + limit-unlock >> >> Then in queue_attr_store() we shall just remove freeze/unfreeze. >> >> We also need to fix few call sites where we've code block, >> >> { >> blk_mq_freeze_queue() >> ... >> queue_limits_start_update() >> ... >> queue_limits_commit_update() >> ... >> blk_mq_unfreeze_queue() >> >> } >> >> In the above code block, we may then replace blk_mq_freeze_queue() with >> queue_limits_commit_update() and similarly replace blk_mq_unfreeze_queue() >> with queue_limits_commit_update(). >> >> { >> queue_limits_start_update() >> ... >> ... >> ... >> queue_limits_commit_update() > > In sd_revalidate_disk(), blk-mq request is allocated under queue_limits_start_update(), > then ABBA deadlock is triggered since blk_queue_enter() implies same lock(freeze lock) > from blk_mq_freeze_queue(). > Yeah agreed but I see sd_revalidate_disk() is probably the only exception which allocates the blk-mq request. Can't we fix it? One simple way I could think of would be updating queue_limit_xxx() APIs and then use it, queue_limits_start_update(struct request_queue *q, bool freeze) { ... mutex_lock(&q->limits_lock); if(freeze) blk_mq_freeze_queue(q); ... } queue_limits_commit_update(struct request_queue *q, struct queue_limits *lim, bool unfreeze) { ... ... if(unfreeze) blk_mq_unfreeze_queue(q); mutex_unlock(&q->limits_lock); } sd_revalidate_disk() { ... ... queue_limits_start_update(q, false); ... ... blk_mq_freeze_queue(q); queue_limits_commit_update(q, lim, true); } Thanks, --Nilay
On 2024/12/18 6:05, Nilay Shroff wrote: >>> In the above code block, we may then replace blk_mq_freeze_queue() with >>> queue_limits_commit_update() and similarly replace blk_mq_unfreeze_queue() >>> with queue_limits_commit_update(). >>> >>> { >>> queue_limits_start_update() >>> ... >>> ... >>> ... >>> queue_limits_commit_update() >> >> In sd_revalidate_disk(), blk-mq request is allocated under queue_limits_start_update(), >> then ABBA deadlock is triggered since blk_queue_enter() implies same lock(freeze lock) >> from blk_mq_freeze_queue(). >> > Yeah agreed but I see sd_revalidate_disk() is probably the only exception > which allocates the blk-mq request. Can't we fix it? If we change where limits_lock is taken now, we will again introduce races between user config and discovery/revalidation, which is what queue_limits_start_update() and queue_limits_commit_update() intended to fix in the first place. So changing sd_revalidate_disk() is not the right approach. > One simple way I could think of would be updating queue_limit_xxx() APIs and > then use it, > > queue_limits_start_update(struct request_queue *q, bool freeze) > { > ... > mutex_lock(&q->limits_lock); > if(freeze) > blk_mq_freeze_queue(q); > ... > } > > queue_limits_commit_update(struct request_queue *q, > struct queue_limits *lim, bool unfreeze) > { > ... > ... > if(unfreeze) > blk_mq_unfreeze_queue(q); > mutex_unlock(&q->limits_lock); > } > > sd_revalidate_disk() > { > ... > ... > queue_limits_start_update(q, false); > > ... > ... > > blk_mq_freeze_queue(q); > queue_limits_commit_update(q, lim, true); This is overly complicated ... As I suggested, I think that a simpler approach is to call blk_mq_freeze_queue() and blk_mq_unfreeze_queue() inside queue_limits_commit_update(). Doing so, no driver should need to directly call freeze/unfreeze. But that would be a cleanup. Let's first fix the few instances that have the update/freeze order wrong. As mentioned, the pattern simply needs to be: queue_limits_start_update(); ... blk_mq_freeze_queue(); queue_limits_commit_update(); blk_mq_unfreeze_queue(); I fixed blk-zoned code last week like this. But I had not noticed that other places had a similar issue as well.
On Wed, Dec 18, 2024 at 10:09:24AM +0800, Ming Lei wrote: > This way looks good, just commit af2814149883 ("block: freeze the queue in > queue_attr_store") needs to be reverted, and freeze/unfreeze has to be > added to each queue attribute .store() handler. Actually the sysfs handlers need a lot more work, including sorting out the sysfs_lock mess. Gimme a little time to work on it.
On Wed, Dec 18, 2024 at 07:35:36PM +0530, Nilay Shroff wrote: > One simple way I could think of would be updating queue_limit_xxx() APIs and > then use it, > > queue_limits_start_update(struct request_queue *q, bool freeze) > { > ... > mutex_lock(&q->limits_lock); > if(freeze) > blk_mq_freeze_queue(q); > ... Conditional locking (and the freeze sort of is a lock) is always a bad idea. But we can do a lower level no-free API or this use case.
On Wed, Dec 18, 2024 at 06:57:45AM -0800, Damien Le Moal wrote: > > Yeah agreed but I see sd_revalidate_disk() is probably the only exception > > which allocates the blk-mq request. Can't we fix it? > > If we change where limits_lock is taken now, we will again introduce races > between user config and discovery/revalidation, which is what > queue_limits_start_update() and queue_limits_commit_update() intended to fix in > the first place. > > So changing sd_revalidate_disk() is not the right approach. Well, sd_revalidate_disk is a bit special in that it needs a command on the same queue to query the information. So it needs to be able to issue commands without the queue frozen. Freezing the queue inside the limits lock support that, sd just can't use the convenience helpers that lock and freeze. > This is overly complicated ... As I suggested, I think that a simpler approach > is to call blk_mq_freeze_queue() and blk_mq_unfreeze_queue() inside > queue_limits_commit_update(). Doing so, no driver should need to directly call > freeze/unfreeze. But that would be a cleanup. Let's first fix the few instances > that have the update/freeze order wrong. As mentioned, the pattern simply needs Yes, the queue only needs to be frozen for the actual update, which would remove the need for the locking. The big question for both variants is if we can get rid of all the callers that have the queue already frozen and then start an update.
On 12/19/24 11:50, Christoph Hellwig wrote: > On Wed, Dec 18, 2024 at 06:57:45AM -0800, Damien Le Moal wrote: >>> Yeah agreed but I see sd_revalidate_disk() is probably the only exception >>> which allocates the blk-mq request. Can't we fix it? >> >> If we change where limits_lock is taken now, we will again introduce races >> between user config and discovery/revalidation, which is what >> queue_limits_start_update() and queue_limits_commit_update() intended to fix in >> the first place. >> >> So changing sd_revalidate_disk() is not the right approach. > > Well, sd_revalidate_disk is a bit special in that it needs a command > on the same queue to query the information. So it needs to be able > to issue commands without the queue frozen. Freezing the queue inside > the limits lock support that, sd just can't use the convenience helpers > that lock and freeze. > >> This is overly complicated ... As I suggested, I think that a simpler approach >> is to call blk_mq_freeze_queue() and blk_mq_unfreeze_queue() inside >> queue_limits_commit_update(). Doing so, no driver should need to directly call >> freeze/unfreeze. But that would be a cleanup. Let's first fix the few instances >> that have the update/freeze order wrong. As mentioned, the pattern simply needs > > Yes, the queue only needs to be frozen for the actual update, > which would remove the need for the locking. The big question for both > variants is if we can get rid of all the callers that have the queue > already frozen and then start an update. > Yes agreed that in most cases we only needs the queue to be frozen while committing the update, however we do have few call sites (in nvme driver) where I see we freeze queue before actually starting update. And looking at those call sites it seems that we probably do require freezing the queue. One example from NVMe driver, nvme_update_ns_info_block() { ... ... blk_mq_freeze_queue(ns->disk->queue); ns->head->lba_shift = id->lbaf[lbaf].ds; ns->head->nuse = le64_to_cpu(id->nuse); capacity = nvme_lba_to_sect(ns->head, le64_to_cpu(id->nsze)); lim = queue_limits_start_update(ns->disk->queue); ... ... queue_limits_commit_update(); ... set_capacity_and_notify(ns->disk, capacity); ... set_disk_ro(ns->disk, nvme_ns_is_readonly(ns, info)); set_bit(NVME_NS_READY, &ns->flags); blk_mq_unfreeze_queue(ns->disk->queue); ... } So looking at the above example, I earlier proposed freezing the queue in queue_limits_start_update() and then unfreezing the queue in queue_limits_commit_update(). In the above code then we could replace blk_mq_freeze_queue() with queue_limits_start_update() and blk_mq_unfreeze_queue() with queue_limits_commit_update() and get rid of the original call sites of start/commit update APIs. Having said that, I am open for any other better suggestions and one of the suggestion is from Damien about calling blk_mq_freeze_queue() and blk_mq_unfreeze_queue() inside queue_limits_commit_update(). But then I wonder how would we fix the call sites as shown above with this approach. Thanks, --Nilay
On 12/19/24 11:50, Christoph Hellwig wrote: > On Wed, Dec 18, 2024 at 06:57:45AM -0800, Damien Le Moal wrote: >>> Yeah agreed but I see sd_revalidate_disk() is probably the only exception >>> which allocates the blk-mq request. Can't we fix it? >> >> If we change where limits_lock is taken now, we will again introduce races >> between user config and discovery/revalidation, which is what >> queue_limits_start_update() and queue_limits_commit_update() intended to fix in >> the first place. >> >> So changing sd_revalidate_disk() is not the right approach. > > Well, sd_revalidate_disk is a bit special in that it needs a command > on the same queue to query the information. So it needs to be able > to issue commands without the queue frozen. Freezing the queue inside > the limits lock support that, sd just can't use the convenience helpers > that lock and freeze. > >> This is overly complicated ... As I suggested, I think that a simpler approach >> is to call blk_mq_freeze_queue() and blk_mq_unfreeze_queue() inside >> queue_limits_commit_update(). Doing so, no driver should need to directly call >> freeze/unfreeze. But that would be a cleanup. Let's first fix the few instances >> that have the update/freeze order wrong. As mentioned, the pattern simply needs > > Yes, the queue only needs to be frozen for the actual update, > which would remove the need for the locking. The big question for both > variants is if we can get rid of all the callers that have the queue > already frozen and then start an update. > After thinking for a while I found that broadly we've four categories of users which need this pattern of limits-lock and/or queue-freeze: 1. Callers which need acquiring limits-lock while starting the update; and freezing queue only when committing the update: - sd_revalidate_disk - nvme_init_identify - loop_clear_limits - few more... 2. Callers which need both freezing the queue and acquiring limits-lock while starting the update: - nvme_update_ns_info_block - nvme_update_ns_info_generic - few more... 3. Callers which neither need acquiring limits-lock nor require freezing queue as for these set of callers in the call stack limits-lock is already acquired and queue is already frozen: - __blk_mq_update_nr_hw_queues - queue_xxx_store and helpers 4. Callers which only need acquiring limits-lock; freezing queue may not be needed for such callers even while committing update: - scsi_add_lun - nvme_init_identify - few more... IMO, we may covert category #4 users into category #1, as it may not harm even if we momentarily freeze the queue while committing the update. Then, for each of the above category we may define helpers show below: // For category-3: static inline struct queue_limits get_queue_limits(struct request_queue *q) { return q->limits; } int set_queue_limits(struct request_queue *q, struct queue_limits *lim) { int error; error = blk_validate_limits(lim); ... ... q->limits = *lim; if (q->disk) blk_apply_bdi_limits(q->disk->bdi, lim); return error; } // For category-1: static inline struct queue_limits __queue_limits_start_update(struct request_queue *q) { mutex_lock(&q->limits_lock); return q->limits; } int __queue_limits_commit_update(struct request_queue *q, struct queue_limits *lim) { int error; blk_mq_freeze_queue(q); error = set_queue_limits(q, lim); blk_mq_unfreeze_queue(q); mutex_unlock(&q->limits_lock); return error; } // For category-2 : static inline struct queue_limits queue_limits_start_update(struct request_queue *q) { mutex_lock(&q->limits_lock); blk_mq_freeze_queue(q); return q->limits; } int queue_limits_commit_update(struct request_queue *q, struct queue_limits *lim) { int error; error = set_queue_limits(q, lim); blk_mq_unfreeze_queue(q); mutex_unlock(&q->limits_lock); return error; } With above helpers, I updated each caller based on in which category it fits in. For reference, attached the full diff. With this change, I ran blktests to ensure that we don't see any lockdep splat or failures. Thanks, --Nilay
On Sat, Dec 21, 2024 at 06:33:13PM +0530, Nilay Shroff wrote: > > > On 12/19/24 11:50, Christoph Hellwig wrote: > > On Wed, Dec 18, 2024 at 06:57:45AM -0800, Damien Le Moal wrote: > >>> Yeah agreed but I see sd_revalidate_disk() is probably the only exception > >>> which allocates the blk-mq request. Can't we fix it? > >> > >> If we change where limits_lock is taken now, we will again introduce races > >> between user config and discovery/revalidation, which is what > >> queue_limits_start_update() and queue_limits_commit_update() intended to fix in > >> the first place. > >> > >> So changing sd_revalidate_disk() is not the right approach. > > > > Well, sd_revalidate_disk is a bit special in that it needs a command > > on the same queue to query the information. So it needs to be able > > to issue commands without the queue frozen. Freezing the queue inside > > the limits lock support that, sd just can't use the convenience helpers > > that lock and freeze. > > > >> This is overly complicated ... As I suggested, I think that a simpler approach > >> is to call blk_mq_freeze_queue() and blk_mq_unfreeze_queue() inside > >> queue_limits_commit_update(). Doing so, no driver should need to directly call > >> freeze/unfreeze. But that would be a cleanup. Let's first fix the few instances > >> that have the update/freeze order wrong. As mentioned, the pattern simply needs > > > > Yes, the queue only needs to be frozen for the actual update, > > which would remove the need for the locking. The big question for both > > variants is if we can get rid of all the callers that have the queue > > already frozen and then start an update. > > > After thinking for a while I found that broadly we've four categories of users > which need this pattern of limits-lock and/or queue-freeze: > > 1. Callers which need acquiring limits-lock while starting the update; and freezing > queue only when committing the update: > - sd_revalidate_disk sd_revalidate_disk() should be the most strange one, in which passthrough io command is required, so dependency on queue freeze lock can't be added, such as, q->limits_lock Actually the current queue limits structure aren't well-organized, otherwise limit lock isn't needed for reading queue limits from hardware, since sd_revalidate_disk() just overwrites partial limits. Or it can be done by refactoring sd_revalidate_disk(). However, the change might be a little big, and I guess that is the reason why Damien don't like it. > - nvme_init_identify > - loop_clear_limits > - few more... > > 2. Callers which need both freezing the queue and acquiring limits-lock while starting > the update: > - nvme_update_ns_info_block > - nvme_update_ns_info_generic > - few more... > > 3. Callers which neither need acquiring limits-lock nor require freezing queue as for > these set of callers in the call stack limits-lock is already acquired and queue is > already frozen: > - __blk_mq_update_nr_hw_queues > - queue_xxx_store and helpers I think it isn't correct. The queue limits are applied on fast IO path, in theory anywhere updating q->limits need to drain IOs in submission path at least after gendisk is added. Also Christoph adds limits-lock for avoiding to lose other concurrent update, which makes the problem more hard to solve. Thanks, Ming
On 2024/12/30 18:02, Ming Lei wrote: >> 1. Callers which need acquiring limits-lock while starting the update; and freezing >> queue only when committing the update: >> - sd_revalidate_disk > > sd_revalidate_disk() should be the most strange one, in which > passthrough io command is required, so dependency on queue freeze lock > can't be added, such as, q->limits_lock > > Actually the current queue limits structure aren't well-organized, otherwise > limit lock isn't needed for reading queue limits from hardware, since > sd_revalidate_disk() just overwrites partial limits. Or it can be > done by refactoring sd_revalidate_disk(). However, the change might > be a little big, and I guess that is the reason why Damien don't like > it. That was not the reason, but yes, modifying sd_revalidate_disk() is not without risks of introducing regressions. The reason I proposed to simply move the queue freeze around or inside queue_limits_commit_update() is that: 1) It is the right thing to do as that is the only place where it is actually needed to avoid losing concurrent limits changes. 2) It clarifies the locking order between queue freeze and the limits lock. 3) The current issues should mostly all be solved with some refactoring of the ->store() calls in blk-sysfs.c, resolving the current ABBA deadlocks between queue freeze and limits lock. With that, we should be able to fix the issue for all block drivers with changes to the block layer sysfs code only. But... I have not looked into the details of all limits commit calls in all block drivers. So there may be some bad apples in there that will also need some tweaking. >> - nvme_init_identify >> - loop_clear_limits >> - few more... >> >> 2. Callers which need both freezing the queue and acquiring limits-lock while starting >> the update: >> - nvme_update_ns_info_block >> - nvme_update_ns_info_generic >> - few more... The queue freeze should not be necessary anywhere when starting the update. The queue freeze is only needed when applying the limits so that IOs that are in flight are not affected by the limits change while still being processed. >> >> 3. Callers which neither need acquiring limits-lock nor require freezing queue as for >> these set of callers in the call stack limits-lock is already acquired and queue is >> already frozen: >> - __blk_mq_update_nr_hw_queues >> - queue_xxx_store and helpers > > I think it isn't correct. > > The queue limits are applied on fast IO path, in theory anywhere > updating q->limits need to drain IOs in submission path at least > after gendisk is added. Yes !
On 12/31/24 04:59, Damien Le Moal wrote: > On 2024/12/30 18:02, Ming Lei wrote: >>> 1. Callers which need acquiring limits-lock while starting the update; and freezing >>> queue only when committing the update: >>> - sd_revalidate_disk >> >> sd_revalidate_disk() should be the most strange one, in which >> passthrough io command is required, so dependency on queue freeze lock >> can't be added, such as, q->limits_lock >> >> Actually the current queue limits structure aren't well-organized, otherwise >> limit lock isn't needed for reading queue limits from hardware, since >> sd_revalidate_disk() just overwrites partial limits. Or it can be >> done by refactoring sd_revalidate_disk(). However, the change might >> be a little big, and I guess that is the reason why Damien don't like >> it. > > That was not the reason, but yes, modifying sd_revalidate_disk() is not without > risks of introducing regressions. The reason I proposed to simply move the queue > freeze around or inside queue_limits_commit_update() is that: > > 1) It is the right thing to do as that is the only place where it is actually > needed to avoid losing concurrent limits changes. > > 2) It clarifies the locking order between queue freeze and the limits lock. > > 3) The current issues should mostly all be solved with some refactoring of the > ->store() calls in blk-sysfs.c, resolving the current ABBA deadlocks between > queue freeze and limits lock. > > With that, we should be able to fix the issue for all block drivers with changes > to the block layer sysfs code only. But... I have not looked into the details of > all limits commit calls in all block drivers. So there may be some bad apples in > there that will also need some tweaking. Yes, I think, we have places other than blk-sysfs, like nvme, where we need fixing. > >>> - nvme_init_identify >>> - loop_clear_limits >>> - few more... >>> >>> 2. Callers which need both freezing the queue and acquiring limits-lock while starting >>> the update: >>> - nvme_update_ns_info_block >>> - nvme_update_ns_info_generic >>> - few more... > > The queue freeze should not be necessary anywhere when starting the update. The > queue freeze is only needed when applying the limits so that IOs that are in > flight are not affected by the limits change while still being processed. Hmm, but as mentioned for nvme limits update this is not always true. So this need to be tweaked. >>> >>> 3. Callers which neither need acquiring limits-lock nor require freezing queue as for >>> these set of callers in the call stack limits-lock is already acquired and queue is >>> already frozen: >>> - __blk_mq_update_nr_hw_queues >>> - queue_xxx_store and helpers >> >> I think it isn't correct. >> >> The queue limits are applied on fast IO path, in theory anywhere >> updating q->limits need to drain IOs in submission path at least >> after gendisk is added. > > Yes ! > OK, I think it'd be tricky to fix __blk_mq_update_nr_hw_queues and queue_xxx_store with the current proposal of acquiring limits lock followed by queue freeze while committing limit update. Thanks, --Nilay
diff --git a/block/blk-settings.c b/block/blk-settings.c index 8f09e33f41f6..b737428c6084 100644 --- a/block/blk-settings.c +++ b/block/blk-settings.c @@ -422,6 +422,7 @@ int queue_limits_commit_update(struct request_queue *q, { int error; + mutex_lock(&q->limits_lock); error = blk_validate_limits(lim); if (error) goto out_unlock; @@ -456,7 +457,6 @@ EXPORT_SYMBOL_GPL(queue_limits_commit_update); */ int queue_limits_set(struct request_queue *q, struct queue_limits *lim) { - mutex_lock(&q->limits_lock); return queue_limits_commit_update(q, lim); } EXPORT_SYMBOL_GPL(queue_limits_set); diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 378d3a1a22fc..6cc20ca79adc 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -944,8 +944,13 @@ static inline unsigned int blk_boundary_sectors_left(sector_t offset, static inline struct queue_limits queue_limits_start_update(struct request_queue *q) { + struct queue_limits lim; + mutex_lock(&q->limits_lock); - return q->limits; + lim = q->limits; + mutex_unlock(&q->limits_lock); + + return lim; } int queue_limits_commit_update(struct request_queue *q, struct queue_limits *lim); @@ -962,7 +967,6 @@ int blk_validate_limits(struct queue_limits *lim); */ static inline void queue_limits_cancel_update(struct request_queue *q) { - mutex_unlock(&q->limits_lock); } /*
Commit d690cb8ae14b ("block: add an API to atomically update queue limits") adds APIs for updating queue limits atomically. And q->limits_lock is grabbed in queue_limits_start_update(), and released in queue_limits_commit_update(). This way is very fragile and easy to introduce deadlock[1][2]. More importantly, queue_limits_start_update() returns one local copy of q->limits, then the API user overwrites the local copy, and commit the copy by queue_limits_commit_update() finally. So holding q->limits_lock protects nothing for the local copy, and not see any real help by preventing new update & commit from happening, cause what matters is that we do validation & commit atomically. Changes the API to not hold q->limits_lock across atomic queue limits update APIs for fixing deadlock & making it easy to use. [1] https://lore.kernel.org/linux-block/Z1A8fai9_fQFhs1s@hovoldconsulting.com/ [2] https://lore.kernel.org/linux-scsi/ZxG38G9BuFdBpBHZ@fedora/ Fixes: d690cb8ae14b ("block: add an API to atomically update queue limits") Signed-off-by: Ming Lei <ming.lei@redhat.com> --- block/blk-settings.c | 2 +- include/linux/blkdev.h | 8 ++++++-- 2 files changed, 7 insertions(+), 3 deletions(-)