Message ID | 20220121114006.3633-1-penguin-kernel@I-love.SAKURA.ne.jp (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v3,1/5] task_work: export task_work_add() | expand |
I'm sometimes a little slow, but I still fail to understand why we need all this. Your cut down patch that moves the destroy_workqueue call and the work_struct fixed all the know lockdep issues, right? And the only other problem we're thinking off is that blk_mq_freeze_queue could have the same effect, except that lockdep doesn't track it and we've not seen it in the wild. As far as I can tell we do not need the freeze at all for given that by the time release is called I/O is quiesced. So with a little prep work we should not need any of that. See below (this is actually three patches, with the last one being a very slight rebase of your patch): diff --git a/drivers/block/loop.c b/drivers/block/loop.c index 01cbbfc4e9e24..91b56761392e2 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -1006,10 +1006,10 @@ static int loop_configure(struct loop_device *lo, fmode_t mode, !file->f_op->write_iter) lo->lo_flags |= LO_FLAGS_READ_ONLY; - lo->workqueue = alloc_workqueue("loop%d", - WQ_UNBOUND | WQ_FREEZABLE, - 0, - lo->lo_number); + if (!lo->workqueue) + lo->workqueue = alloc_workqueue("loop%d", + WQ_UNBOUND | WQ_FREEZABLE, + 0, lo->lo_number); if (!lo->workqueue) { error = -ENOMEM; goto out_unlock; @@ -1082,7 +1082,7 @@ static int loop_configure(struct loop_device *lo, fmode_t mode, return error; } -static void __loop_clr_fd(struct loop_device *lo) +static void __loop_clr_fd(struct loop_device *lo, bool release) { struct file *filp; gfp_t gfp = lo->old_gfp_mask; @@ -1112,10 +1112,14 @@ static void __loop_clr_fd(struct loop_device *lo) if (test_bit(QUEUE_FLAG_WC, &lo->lo_queue->queue_flags)) blk_queue_write_cache(lo->lo_queue, false, false); - /* freeze request queue during the transition */ - blk_mq_freeze_queue(lo->lo_queue); + /* + * Freeze the request queue when unbinding on a live file descriptor and + * thus an open device. When called from ->release we are guaranteed + * that there is no I/O in progress already. + */ + if (!release) + blk_mq_freeze_queue(lo->lo_queue); - destroy_workqueue(lo->workqueue); spin_lock_irq(&lo->lo_work_lock); list_for_each_entry_safe(worker, pos, &lo->idle_worker_list, idle_list) { @@ -1144,16 +1148,19 @@ static void __loop_clr_fd(struct loop_device *lo) /* let user-space know about this change */ kobject_uevent(&disk_to_dev(lo->lo_disk)->kobj, KOBJ_CHANGE); mapping_set_gfp_mask(filp->f_mapping, gfp); - blk_mq_unfreeze_queue(lo->lo_queue); + if (!release) + blk_mq_unfreeze_queue(lo->lo_queue); disk_force_media_change(lo->lo_disk, DISK_EVENT_MEDIA_CHANGE); if (lo->lo_flags & LO_FLAGS_PARTSCAN) { int err; - mutex_lock(&lo->lo_disk->open_mutex); + if (!release) + mutex_lock(&lo->lo_disk->open_mutex); err = bdev_disk_changed(lo->lo_disk, false); - mutex_unlock(&lo->lo_disk->open_mutex); + if (!release) + mutex_unlock(&lo->lo_disk->open_mutex); if (err) pr_warn("%s: partition scan of loop%d failed (rc=%d)\n", __func__, lo->lo_number, err); @@ -1164,39 +1171,17 @@ static void __loop_clr_fd(struct loop_device *lo) if (!part_shift) lo->lo_disk->flags |= GENHD_FL_NO_PART; - fput(filp); -} - -static void loop_rundown_completed(struct loop_device *lo) -{ mutex_lock(&lo->lo_mutex); lo->lo_state = Lo_unbound; mutex_unlock(&lo->lo_mutex); module_put(THIS_MODULE); -} - -static void loop_rundown_workfn(struct work_struct *work) -{ - struct loop_device *lo = container_of(work, struct loop_device, - rundown_work); - struct block_device *bdev = lo->lo_device; - struct gendisk *disk = lo->lo_disk; - - __loop_clr_fd(lo); - kobject_put(&bdev->bd_device.kobj); - module_put(disk->fops->owner); - loop_rundown_completed(lo); -} -static void loop_schedule_rundown(struct loop_device *lo) -{ - struct block_device *bdev = lo->lo_device; - struct gendisk *disk = lo->lo_disk; - - __module_get(disk->fops->owner); - kobject_get(&bdev->bd_device.kobj); - INIT_WORK(&lo->rundown_work, loop_rundown_workfn); - queue_work(system_long_wq, &lo->rundown_work); + /* + * Need not hold lo_mutex to fput backing file. Calling fput holding + * lo_mutex triggers a circular lock dependency possibility warning as + * fput can take open_mutex which is usually taken before lo_mutex. + */ + fput(filp); } static int loop_clr_fd(struct loop_device *lo) @@ -1228,8 +1213,7 @@ static int loop_clr_fd(struct loop_device *lo) lo->lo_state = Lo_rundown; mutex_unlock(&lo->lo_mutex); - __loop_clr_fd(lo); - loop_rundown_completed(lo); + __loop_clr_fd(lo, false); return 0; } @@ -1754,15 +1738,8 @@ static void lo_release(struct gendisk *disk, fmode_t mode) * In autoclear mode, stop the loop thread * and remove configuration after last close. */ - loop_schedule_rundown(lo); + __loop_clr_fd(lo, true); return; - } else if (lo->lo_state == Lo_bound) { - /* - * Otherwise keep thread (if running) and config, - * but flush possible ongoing bios in thread. - */ - blk_mq_freeze_queue(lo->lo_queue); - blk_mq_unfreeze_queue(lo->lo_queue); } out_unlock: @@ -2080,6 +2057,8 @@ static void loop_remove(struct loop_device *lo) mutex_unlock(&loop_ctl_mutex); /* There is no route which can find this loop device. */ mutex_destroy(&lo->lo_mutex); + if (lo->workqueue) + destroy_workqueue(lo->workqueue); kfree(lo); } diff --git a/drivers/block/loop.h b/drivers/block/loop.h index 918a7a2dc0259..082d4b6bfc6a6 100644 --- a/drivers/block/loop.h +++ b/drivers/block/loop.h @@ -56,7 +56,6 @@ struct loop_device { struct gendisk *lo_disk; struct mutex lo_mutex; bool idr_visible; - struct work_struct rundown_work; }; struct loop_cmd {
On Fri, Jan 21, 2022 at 08:40:02PM +0900, Tetsuo Handa wrote: > Commit 322c4293ecc58110 ("loop: make autoclear operation asynchronous") > silenced a circular locking dependency warning by moving autoclear > operation to WQ context. > > Then, it was reported that WQ context is too late to run autoclear > operation; some userspace programs (e.g. xfstest) assume that the autoclear > operation already completed by the moment close() returns to user mode > so that they can immediately call umount() of a partition containing a > backing file which the autoclear operation should have closed. > > Then, Jan Kara found that fundamental problem is that waiting for I/O > completion (from blk_mq_freeze_queue() or flush_workqueue()) with > disk->open_mutex held has possibility of deadlock. > > Then, I found that since disk->open_mutex => lo->lo_mutex dependency is > recorded by lo_open() and lo_release(), and blk_mq_freeze_queue() by e.g. > loop_set_status() waits for I/O completion with lo->lo_mutex held, from > locking dependency chain perspective we need to avoid holding lo->lo_mutex > from lo_open() and lo_release(). And we can avoid holding lo->lo_mutex > from lo_open(), for we can instead use a spinlock dedicated for > Lo_deleting check. > > But we cannot avoid holding lo->lo_mutex from lo_release(), for WQ context > was too late to run autoclear operation. We need to make whole lo_release() > operation start without disk->open_mutex and complete before returning to > user mode. One of approaches that can meet such requirement is to use the > task_work context. Thus, export task_work_add() for the loop driver. > > Cc: Jan Kara <jack@suse.cz> > Cc: Christoph Hellwig <hch@lst.de> > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> 5.17-rc1 came out and I saw regressions across the board when xfs/049 and xfs/073 ran. xfs/049 formats XFS on a block device, mounts it, creates a sparse file inside the XFS fs, formats the sparse file, mounts that (via automatic loop device), does some work, and then unmounts both the sparse file filesystem and the outer XFS filesystem. The first unmount no longer waited for the loop device to release asynchronously so the unmount of the outer fs fails because it's still in use. So this series fixes xfs/049, but xfs/073 is still broken. xfs/073 creates a sparse file containing an XFS filesystem and then does this in rapid succession: mount -o loop <mount options that guarantee mount failure> mount -o loop <mount options that should work> Whereas with 5.16 this worked fine, [U] try mount 1 loop0: detected capacity change from 0 to 83968 XFS (loop0): Filesystem has duplicate UUID 924e8033-a130-4f9c-a11f-52f892c268e9 - can't mount [U] try mount 2 loop0: detected capacity change from 0 to 83968 XFS (loop0): Mounting V5 Filesystem XFS (loop0): resetting quota flags XFS (loop0): Ending clean mount in 5.17-rc1 it fails like this: [U] try mount 1 loop0: detected capacity change from 0 to 83968 XFS (loop0): Filesystem has duplicate UUID 0b0afdac-5c9c-4d94-9b8d-fe85a2eb1143 - can't mount [U] try mount 2 I/O error, dev loop0, sector 0 op 0x0:(READ) flags 0x1000 phys_seg 1 prio class 0 XFS (loop0): SB validate failed with error -5. [U] fail mount 2 I guess this means that mount can grab the loop device after the backing file has been released but before a new one has been plumbed in? Or, seeing the lack of "detected capacity change" for mount 2, maybe the backing file never gets added? --D > --- > kernel/task_work.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/kernel/task_work.c b/kernel/task_work.c > index 1698fbe6f0e1..2a1644189182 100644 > --- a/kernel/task_work.c > +++ b/kernel/task_work.c > @@ -60,6 +60,7 @@ int task_work_add(struct task_struct *task, struct callback_head *work, > > return 0; > } > +EXPORT_SYMBOL_GPL(task_work_add); > > /** > * task_work_cancel_match - cancel a pending work added by task_work_add() > -- > 2.32.0 >
On 2022/01/26 0:47, Christoph Hellwig wrote: > I'm sometimes a little slow, but I still fail to understand why we need > all this. Your cut down patch that moves the destroy_workqueue call > and the work_struct fixed all the know lockdep issues, right? Right. Moving destroy_workqueue() (and blk_mq_freeze_queue() together) in __loop_clr_fd() to WQ context fixed a known lockdep issue. > > And the only other problem we're thinking off is that blk_mq_freeze_queue > could have the same effect, Right. blk_mq_freeze_queue() is still called with disk->open_mutex held, for there is } else if (lo->lo_state == Lo_bound) { /* * Otherwise keep thread (if running) and config, * but flush possible ongoing bios in thread. */ blk_mq_freeze_queue(lo->lo_queue); blk_mq_unfreeze_queue(lo->lo_queue); } path in lo_release(). > except that lockdep doesn't track it and > we've not seen it in the wild. It is difficult to test. Fuzzers cannot test fsfreeze paths, for failing to issue an unfreeze request leads to unresponding virtual machines. > > As far as I can tell we do not need the freeze at all for given that > by the time release is called I/O is quiesced. Why? lo_release() is called when close() is called. But (periodically-scheduled or triggered-on-demand) writeback of previously executed buffered write() calls can start while lo_release() or __loop_clr_fd() is running. Then, why not to wait for I/O requests to complete? Isn't that the reason of } else if (lo->lo_state == Lo_bound) { /* * Otherwise keep thread (if running) and config, * but flush possible ongoing bios in thread. */ blk_mq_freeze_queue(lo->lo_queue); blk_mq_unfreeze_queue(lo->lo_queue); } path in lo_release() being there?
On Wed, Jan 26, 2022 at 08:47:17AM +0900, Tetsuo Handa wrote: > > As far as I can tell we do not need the freeze at all for given that > > by the time release is called I/O is quiesced. > > Why? lo_release() is called when close() is called. But (periodically-scheduled > or triggered-on-demand) writeback of previously executed buffered write() calls > can start while lo_release() or __loop_clr_fd() is running. Then, why not to > wait for I/O requests to complete? Let's refine my wording, the above should be "... by the time the final lo_release is called". blkdev_put_whole ensures all writeback has finished and all buffers are gone by writing all data back and waiting for it and then truncating the pages from blkdev_flush_mapping. > Isn't that the reason of > > } else if (lo->lo_state == Lo_bound) { > /* > * Otherwise keep thread (if running) and config, > * but flush possible ongoing bios in thread. > */ > blk_mq_freeze_queue(lo->lo_queue); > blk_mq_unfreeze_queue(lo->lo_queue); > } > > path in lo_release() being there? This looks completely spurious to me. Adding Ming who added it.
On Wed, Jan 26, 2022 at 06:21:59AM +0100, Christoph Hellwig wrote: > On Wed, Jan 26, 2022 at 08:47:17AM +0900, Tetsuo Handa wrote: > > > As far as I can tell we do not need the freeze at all for given that > > > by the time release is called I/O is quiesced. > > > > Why? lo_release() is called when close() is called. But (periodically-scheduled > > or triggered-on-demand) writeback of previously executed buffered write() calls > > can start while lo_release() or __loop_clr_fd() is running. Then, why not to > > wait for I/O requests to complete? > > Let's refine my wording, the above should be "... by the time the final > lo_release is called". blkdev_put_whole ensures all writeback has finished > and all buffers are gone by writing all data back and waiting for it and then > truncating the pages from blkdev_flush_mapping. > > > Isn't that the reason of > > > > } else if (lo->lo_state == Lo_bound) { > > /* > > * Otherwise keep thread (if running) and config, > > * but flush possible ongoing bios in thread. > > */ > > blk_mq_freeze_queue(lo->lo_queue); > > blk_mq_unfreeze_queue(lo->lo_queue); > > } > > > > path in lo_release() being there? > > This looks completely spurious to me. Adding Ming who added it. It was added when converting to blk-mq. I remember it was to replace original loop_flush() which uses wait_for_completion() for draining all inflight bios, but seems the flush isn't needed in lo_release(). Thanks, Ming
On 2022/01/26 16:18, Ming Lei wrote: >>> Isn't that the reason of >>> >>> } else if (lo->lo_state == Lo_bound) { >>> /* >>> * Otherwise keep thread (if running) and config, >>> * but flush possible ongoing bios in thread. >>> */ >>> blk_mq_freeze_queue(lo->lo_queue); >>> blk_mq_unfreeze_queue(lo->lo_queue); >>> } >>> >>> path in lo_release() being there? >> >> This looks completely spurious to me. Adding Ming who added it. > > It was added when converting to blk-mq. Well, commit b5dd2f6047ca1080 ("block: loop: improve performance via blk-mq") in v4.0-rc1. That's where "thread" in the block comment above became outdated, and a global WQ_MEM_RECLAIM workqueue was used. We need to update both. > > I remember it was to replace original loop_flush() which uses > wait_for_completion() for draining all inflight bios, but seems > the flush isn't needed in lo_release(). Even if we can remove blk_mq_freeze_queue()/blk_mq_unfreeze_queue() from lo_release(), we cannot remove blk_mq_freeze_queue()/blk_mq_unfreeze_queue() from e.g. loop_set_status(), right? Then, lo_release() which is called with disk->open_mutex held can be still blocked at mutex_lock(&lo->lo_mutex) waiting for e.g. loop_set_status() to call mutex_unlock(&lo->lo_mutex). That is, lo_open() from e.g. /sys/power/resume can still wait for I/O completion with disk->open_mutex held. How to kill this dependency? Don't we need to make sure that lo_open()/lo_release() are not blocked on lo->lo_mutex (like [PATCH v3 3/5] and [PATCH v3 4/5] does) ?
On Wed 26-01-22 19:27:17, Tetsuo Handa wrote: > On 2022/01/26 16:18, Ming Lei wrote: > > I remember it was to replace original loop_flush() which uses > > wait_for_completion() for draining all inflight bios, but seems > > the flush isn't needed in lo_release(). > > Even if we can remove blk_mq_freeze_queue()/blk_mq_unfreeze_queue() from > lo_release(), we cannot remove > blk_mq_freeze_queue()/blk_mq_unfreeze_queue() from e.g. > loop_set_status(), right? Correct AFAICT. > Then, lo_release() which is called with disk->open_mutex held can be > still blocked at mutex_lock(&lo->lo_mutex) waiting for e.g. > loop_set_status() to call mutex_unlock(&lo->lo_mutex). That is, > lo_open() from e.g. /sys/power/resume can still wait for I/O completion > with disk->open_mutex held. I don't think this is a real problem. If someone is calling loop_set_status() it means the loop device is open and thus lo_release() cannot be running in parallel, can it? Honza
On 2022/01/26 22:11, Jan Kara wrote: > On Wed 26-01-22 19:27:17, Tetsuo Handa wrote: >> Even if we can remove blk_mq_freeze_queue()/blk_mq_unfreeze_queue() from >> lo_release(), we cannot remove >> blk_mq_freeze_queue()/blk_mq_unfreeze_queue() from e.g. >> loop_set_status(), right? > > Correct AFAICT. OK. > >> Then, lo_release() which is called with disk->open_mutex held can be >> still blocked at mutex_lock(&lo->lo_mutex) waiting for e.g. >> loop_set_status() to call mutex_unlock(&lo->lo_mutex). That is, >> lo_open() from e.g. /sys/power/resume can still wait for I/O completion >> with disk->open_mutex held. > > I don't think this is a real problem. If someone is calling > loop_set_status() it means the loop device is open and thus lo_release() > cannot be running in parallel, can it? lo_release() is called when a file descriptor is close()d. That is, loop_set_status() and lo_release() can run in parallel, can't it? Process-A Process-B int fd1 = open("/dev/loop0", O_RDWR); int fd2 = open("/dev/loop0", O_RDWR); ioctl(fd1, LOOP_SET_STATUS64, ...); close(fd2); If lo_release() (which is called with disk->open_mutex held) waits for ioctl() (which waits for I/O completion with lo->lo_mutex held), there is disk->open_mutex => lo->lo_mutex => I/O completion dependency. And the possibility of deadlock problem (when mixed with sysfs and fsfreeze) happens at lo_open(). If lo_open() (which is called with disk->open_mutex held) waits for ioctl() (which waits for I/O completion with lo->lo_mutex held), there as well is disk->open_mutex => lo->lo_mutex => I/O completion dependency.
diff --git a/kernel/task_work.c b/kernel/task_work.c index 1698fbe6f0e1..2a1644189182 100644 --- a/kernel/task_work.c +++ b/kernel/task_work.c @@ -60,6 +60,7 @@ int task_work_add(struct task_struct *task, struct callback_head *work, return 0; } +EXPORT_SYMBOL_GPL(task_work_add); /** * task_work_cancel_match - cancel a pending work added by task_work_add()
Commit 322c4293ecc58110 ("loop: make autoclear operation asynchronous") silenced a circular locking dependency warning by moving autoclear operation to WQ context. Then, it was reported that WQ context is too late to run autoclear operation; some userspace programs (e.g. xfstest) assume that the autoclear operation already completed by the moment close() returns to user mode so that they can immediately call umount() of a partition containing a backing file which the autoclear operation should have closed. Then, Jan Kara found that fundamental problem is that waiting for I/O completion (from blk_mq_freeze_queue() or flush_workqueue()) with disk->open_mutex held has possibility of deadlock. Then, I found that since disk->open_mutex => lo->lo_mutex dependency is recorded by lo_open() and lo_release(), and blk_mq_freeze_queue() by e.g. loop_set_status() waits for I/O completion with lo->lo_mutex held, from locking dependency chain perspective we need to avoid holding lo->lo_mutex from lo_open() and lo_release(). And we can avoid holding lo->lo_mutex from lo_open(), for we can instead use a spinlock dedicated for Lo_deleting check. But we cannot avoid holding lo->lo_mutex from lo_release(), for WQ context was too late to run autoclear operation. We need to make whole lo_release() operation start without disk->open_mutex and complete before returning to user mode. One of approaches that can meet such requirement is to use the task_work context. Thus, export task_work_add() for the loop driver. Cc: Jan Kara <jack@suse.cz> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> --- kernel/task_work.c | 1 + 1 file changed, 1 insertion(+)