Message ID | bcaf38e6-055e-0d83-fd1d-cb7c0c649372@I-love.SAKURA.ne.jp (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2,1/2] block: export task_work_add() | expand |
On Fri, Jan 7, 2022 at 12:04 PM Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote: > > Jan Stancek is reporting that commit 322c4293ecc58110 ("loop: make > autoclear operation asynchronous") broke LTP tests which run /bin/mount > and /bin/umount in close succession like > > while :; do mount -o loop,ro isofs.iso isofs/; umount isofs/; done > > . This is because since /usr/lib/systemd/systemd-udevd asynchronously > opens the loop device which /bin/mount and /bin/umount are operating, > autoclear from lo_release() is likely triggered by systemd-udevd than > mount or umount. And unfortunately, /bin/mount fails if read of superblock > (for examining filesystem type) returned an error due to the backing file > being cleared by __loop_clr_fd(). It turned out that disk->open_mutex was > by chance serving as a barrier for serializing "__loop_clr_fd() from > lo_release()" and "vfs_read() after lo_open()", and we need to restore > this barrier (without reintroducing circular locking dependency). > > Also, the kernel test robot is reporting that that commit broke xfstest > which does > > umount ext2 on xfs > umount xfs > > sequence. > > One of approaches for fixing these problems is to revert that commit and > instead remove destroy_workqueue() from __loop_clr_fd(), for it turned out > that we did not need to call flush_workqueue() from __loop_clr_fd() since > blk_mq_freeze_queue() blocks until all pending "struct work_struct" are > processed by loop_process_work(). But we are not sure whether it is safe > to wait blk_mq_freeze_queue() etc. with disk->open_mutex held; it could > be simply because dependency is not clear enough for fuzzers to cover and > lockdep to detect. > > Therefore, before taking revert approach, let's try if we can use task > work approach which is called without locks held while the caller can > wait for completion of that task work before returning to user mode. > > This patch tries to make lo_open()/lo_release() to locklessly wait for > __loop_clr_fd() by inserting a task work into lo_open()/lo_release() if > possible. > > Reported-by: kernel test robot <oliver.sang@intel.com> > Reported-by: Jan Stancek <jstancek@redhat.com> v2 looked OK in my tests, thanks. Tested-by: Jan Stancek <jstancek@redhat.com> > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > --- > Changes in v2: > Need to also wait on lo_open(), per Jan's testcase. > > drivers/block/loop.c | 70 ++++++++++++++++++++++++++++++++++++++++---- > drivers/block/loop.h | 5 +++- > 2 files changed, 68 insertions(+), 7 deletions(-) >
On Fri 07-01-22 20:04:31, Tetsuo Handa wrote: > Jan Stancek is reporting that commit 322c4293ecc58110 ("loop: make > autoclear operation asynchronous") broke LTP tests which run /bin/mount > and /bin/umount in close succession like > > while :; do mount -o loop,ro isofs.iso isofs/; umount isofs/; done > > . This is because since /usr/lib/systemd/systemd-udevd asynchronously > opens the loop device which /bin/mount and /bin/umount are operating, > autoclear from lo_release() is likely triggered by systemd-udevd than > mount or umount. And unfortunately, /bin/mount fails if read of superblock > (for examining filesystem type) returned an error due to the backing file > being cleared by __loop_clr_fd(). It turned out that disk->open_mutex was > by chance serving as a barrier for serializing "__loop_clr_fd() from > lo_release()" and "vfs_read() after lo_open()", and we need to restore > this barrier (without reintroducing circular locking dependency). > > Also, the kernel test robot is reporting that that commit broke xfstest > which does > > umount ext2 on xfs > umount xfs > > sequence. > > One of approaches for fixing these problems is to revert that commit and > instead remove destroy_workqueue() from __loop_clr_fd(), for it turned out > that we did not need to call flush_workqueue() from __loop_clr_fd() since > blk_mq_freeze_queue() blocks until all pending "struct work_struct" are > processed by loop_process_work(). But we are not sure whether it is safe > to wait blk_mq_freeze_queue() etc. with disk->open_mutex held; it could > be simply because dependency is not clear enough for fuzzers to cover and > lockdep to detect. > > Therefore, before taking revert approach, let's try if we can use task > work approach which is called without locks held while the caller can > wait for completion of that task work before returning to user mode. > > This patch tries to make lo_open()/lo_release() to locklessly wait for > __loop_clr_fd() by inserting a task work into lo_open()/lo_release() if > possible. > > Reported-by: kernel test robot <oliver.sang@intel.com> > Reported-by: Jan Stancek <jstancek@redhat.com> > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Eek, I agree that the patch may improve the situation but is the complexity and ugliness really worth it? I mean using task work in loop_schedule_rundown() makes a lot of sense because the loop while :; do mount -o loop,ro isofs.iso isofs/; umount isofs/; done will not fail because of dangling work in the workqueue after umount -> __loop_clr_fd(). But when other processes like systemd-udevd start to mess with the loop device, then you have no control whether the following mount will see isofs.iso busy or not - it depends on when systemd-udevd decides to close the loop device. What your waiting in lo_open() achieves is only that if __loop_clr_fd() from systemd-udevd happens to run at the same time as lo_open() from mount, then we won't see EBUSY. But IMO that is not worth the complexity in lo_open() because if systemd-udevd happens to close the loop device a milisecond later, you will get EBUSY anyway (and you would get it even in the past) Or am I missing something? Honza > --- > Changes in v2: > Need to also wait on lo_open(), per Jan's testcase. > > drivers/block/loop.c | 70 ++++++++++++++++++++++++++++++++++++++++---- > drivers/block/loop.h | 5 +++- > 2 files changed, 68 insertions(+), 7 deletions(-) > > diff --git a/drivers/block/loop.c b/drivers/block/loop.c > index b1b05c45c07c..8ef6da186c5c 100644 > --- a/drivers/block/loop.c > +++ b/drivers/block/loop.c > @@ -89,6 +89,7 @@ > static DEFINE_IDR(loop_index_idr); > static DEFINE_MUTEX(loop_ctl_mutex); > static DEFINE_MUTEX(loop_validate_mutex); > +static DECLARE_WAIT_QUEUE_HEAD(loop_rundown_wait); > > /** > * loop_global_lock_killable() - take locks for safe loop_validate_file() test > @@ -1172,13 +1173,12 @@ static void loop_rundown_completed(struct loop_device *lo) > mutex_lock(&lo->lo_mutex); > lo->lo_state = Lo_unbound; > mutex_unlock(&lo->lo_mutex); > + wake_up_all(&loop_rundown_wait); > module_put(THIS_MODULE); > } > > -static void loop_rundown_workfn(struct work_struct *work) > +static void loop_rundown_start(struct loop_device *lo) > { > - struct loop_device *lo = container_of(work, struct loop_device, > - rundown_work); > struct block_device *bdev = lo->lo_device; > struct gendisk *disk = lo->lo_disk; > > @@ -1188,6 +1188,18 @@ static void loop_rundown_workfn(struct work_struct *work) > loop_rundown_completed(lo); > } > > +static void loop_rundown_callbackfn(struct callback_head *callback) > +{ > + loop_rundown_start(container_of(callback, struct loop_device, > + rundown.callback)); > +} > + > +static void loop_rundown_workfn(struct work_struct *work) > +{ > + loop_rundown_start(container_of(work, struct loop_device, > + rundown.work)); > +} > + > static void loop_schedule_rundown(struct loop_device *lo) > { > struct block_device *bdev = lo->lo_device; > @@ -1195,8 +1207,13 @@ static void loop_schedule_rundown(struct loop_device *lo) > > __module_get(disk->fops->owner); > kobject_get(&bdev->bd_device.kobj); > - INIT_WORK(&lo->rundown_work, loop_rundown_workfn); > - queue_work(system_long_wq, &lo->rundown_work); > + if (!(current->flags & PF_KTHREAD)) { > + init_task_work(&lo->rundown.callback, loop_rundown_callbackfn); > + if (!task_work_add(current, &lo->rundown.callback, TWA_RESUME)) > + return; > + } > + INIT_WORK(&lo->rundown.work, loop_rundown_workfn); > + queue_work(system_long_wq, &lo->rundown.work); > } > > static int loop_clr_fd(struct loop_device *lo) > @@ -1721,19 +1738,60 @@ static int lo_compat_ioctl(struct block_device *bdev, fmode_t mode, > } > #endif > > +struct loop_rundown_waiter { > + struct callback_head callback; > + struct loop_device *lo; > +}; > + > +static void loop_rundown_waiter_callbackfn(struct callback_head *callback) > +{ > + struct loop_rundown_waiter *lrw = > + container_of(callback, struct loop_rundown_waiter, callback); > + > + /* > + * Locklessly wait for completion of __loop_clr_fd(). > + * This should be safe because of the following rules. > + * > + * (a) From locking dependency perspective, this function is called > + * without any locks held. > + * (b) From execution ordering perspective, this function is called > + * by the moment lo_open() from open() syscall returns to user > + * mode. > + * (c) From use-after-free protection perspective, this function is > + * called before loop_remove() is called, for lo->lo_refcnt taken > + * by lo_open() prevents loop_control_remove() and loop_exit(). > + */ > + wait_event_killable(loop_rundown_wait, data_race(lrw->lo->lo_state) != Lo_rundown); > + kfree(lrw); > +} > + > static int lo_open(struct block_device *bdev, fmode_t mode) > { > struct loop_device *lo = bdev->bd_disk->private_data; > + struct loop_rundown_waiter *lrw = > + kmalloc(sizeof(*lrw), GFP_KERNEL | __GFP_NOWARN); > int err; > > + if (!lrw) > + return -ENOMEM; > err = mutex_lock_killable(&lo->lo_mutex); > - if (err) > + if (err) { > + kfree(lrw); > return err; > + } > if (lo->lo_state == Lo_deleting) > err = -ENXIO; > else > atomic_inc(&lo->lo_refcnt); > mutex_unlock(&lo->lo_mutex); > + if (!err && !(current->flags & PF_KTHREAD)) { > + init_task_work(&lrw->callback, loop_rundown_waiter_callbackfn); > + lrw->lo = lo; > + if (task_work_add(current, &lrw->callback, TWA_RESUME)) > + kfree(lrw); > + } else { > + kfree(lrw); > + } > return err; > } > > diff --git a/drivers/block/loop.h b/drivers/block/loop.h > index 918a7a2dc025..596472f9cde3 100644 > --- a/drivers/block/loop.h > +++ b/drivers/block/loop.h > @@ -56,7 +56,10 @@ struct loop_device { > struct gendisk *lo_disk; > struct mutex lo_mutex; > bool idr_visible; > - struct work_struct rundown_work; > + union { > + struct work_struct work; > + struct callback_head callback; > + } rundown; > }; > > struct loop_cmd { > -- > 2.32.0 > >
On 2022/01/10 19:30, Jan Kara wrote: > Eek, I agree that the patch may improve the situation but is the complexity > and ugliness really worth it? If we are clear about Now your patch will fix this lockdep complaint but we still would wait for the write to complete through blk_mq_freeze_queue() (just lockdep is not clever enough to detect this). So IHMO if there was a deadlock before, it will be still there with your changes. in https://lkml.kernel.org/r/20211223134050.GD19129@quack2.suse.cz , we can go with the revert approach. I want to call blk_mq_freeze_queue() without disk->open_mutex held. But currently lo_release() is calling blk_mq_freeze_queue() with disk->open_mutex held. My patch is going towards doing locklessly where possible. > I mean using task work in > loop_schedule_rundown() makes a lot of sense because the loop > > while :; do mount -o loop,ro isofs.iso isofs/; umount isofs/; done > > will not fail because of dangling work in the workqueue after umount -> > __loop_clr_fd(). Using task work from lo_release() is for handling close() => umount() sequence. Using task work in lo_open() is for handling close() => open() sequence. The mount in while :; do mount -o loop,ro isofs.iso isofs/; umount isofs/; done fails unless lo_open() waits for __loop_clr_fd() to complete. > But when other processes like systemd-udevd start to mess > with the loop device, then you have no control whether the following mount > will see isofs.iso busy or not - it depends on when systemd-udevd decides > to close the loop device. Right. But regarding that part the main problem is that the script is not checking for errors. What we are expected to do is to restore barrier which existed before commit 322c4293ecc58110 ("loop: make autoclear operation asynchronous"). > What your waiting in lo_open() achieves is only > that if __loop_clr_fd() from systemd-udevd happens to run at the same time > as lo_open() from mount, then we won't see EBUSY. My waiting in lo_open() is to fix a regression that while :; do mount -o loop,ro isofs.iso isofs/; umount isofs/; done started to fail. > But IMO that is not worth > the complexity in lo_open() because if systemd-udevd happens to close the > loop device a millisecond later, you will get EBUSY anyway (and you would > get it even in the past) Or am I missing something? Excuse me, but lo_open() does not return EBUSY? What operation are you talking about? If lo_open() from /bin/mount happened earlier than close() from systemd-udevd happens, __loop_clr_fd() from systemd-udevd does not happen because lo_open() incremented lo->lo_refcnt, and autoclear will not happen until /bin/mount calls close(). If you are talking about close() => umount() sequence, do_umount() can fail with EBUSY if /bin/umount happened earlier than close() from systemd-udevd happens. But that is out of scope for use of task work in lo_open().
On Mon 10-01-22 20:28:28, Tetsuo Handa wrote: > On 2022/01/10 19:30, Jan Kara wrote: > > Eek, I agree that the patch may improve the situation but is the complexity > > and ugliness really worth it? > > If we are clear about > > Now your patch will fix this lockdep complaint but we > still would wait for the write to complete through blk_mq_freeze_queue() > (just lockdep is not clever enough to detect this). So IHMO if there was a > deadlock before, it will be still there with your changes. > > in https://lkml.kernel.org/r/20211223134050.GD19129@quack2.suse.cz , > we can go with the revert approach. > > I want to call blk_mq_freeze_queue() without disk->open_mutex held. > But currently lo_release() is calling blk_mq_freeze_queue() with > disk->open_mutex held. My patch is going towards doing locklessly > where possible. I see. But: a) We didn't fully establish a real deadlock scenario from the lockdep report, did we? The lockdep report involved suspend locks, some locks on accessing files in /proc etc. and it was not clear whether it all reflects a real deadlock possibility or just a fact that lockdep tracking is rather coarse-grained at times. Now lockdep report is unpleasant and loop device locking was ugly anyway so your async change made sense but once lockdep is silenced we should really establish whether there is real deadlock and more work is needed or not. b) Unless we have a realistic plan of moving *all* blk_mq_freeze_queue() calls from under open_mutex in loop driver, moving stuff "where possible" from under open_mutex is IMO just muddying water. > > I mean using task work in > > loop_schedule_rundown() makes a lot of sense because the loop > > > > while :; do mount -o loop,ro isofs.iso isofs/; umount isofs/; done > > > > will not fail because of dangling work in the workqueue after umount -> > > __loop_clr_fd(). > > Using task work from lo_release() is for handling close() => umount() sequence. > Using task work in lo_open() is for handling close() => open() sequence. > The mount in > > while :; do mount -o loop,ro isofs.iso isofs/; umount isofs/; done > > fails unless lo_open() waits for __loop_clr_fd() to complete. Why? If we use task work, we are guaranteed the loop device is cleaned up before umount returns unless some other process also has the loop device open. > > But when other processes like systemd-udevd start to mess > > with the loop device, then you have no control whether the following mount > > will see isofs.iso busy or not - it depends on when systemd-udevd decides > > to close the loop device. > > Right. But regarding that part the main problem is that the script is not > checking for errors. What we are expected to do is to restore barrier > which existed before commit 322c4293ecc58110 ("loop: make autoclear > operation asynchronous"). OK, can you explain what you exactly mean by the barrier? Because it my understanding your patch only makes one race somewhat less likely. > > > What your waiting in lo_open() achieves is only > > that if __loop_clr_fd() from systemd-udevd happens to run at the same time > > as lo_open() from mount, then we won't see EBUSY. > > My waiting in lo_open() is to fix a regression that > > while :; do mount -o loop,ro isofs.iso isofs/; umount isofs/; done > > started to fail. OK, so you claim this loop fails even with using task work in __loop_clr_fd(), correct? And does it fail even without systemd-udev? > > But IMO that is not worth > > the complexity in lo_open() because if systemd-udevd happens to close the > > loop device a millisecond later, you will get EBUSY anyway (and you would > > get it even in the past) Or am I missing something? > > Excuse me, but lo_open() does not return EBUSY? > What operation are you talking about? I didn't mean EBUSY from lo_open() but EBUSY from LOOP_SET_FD ioctl(2). But maybe I misunderstood the failure. Where exactly does mount(1) fail? Because with the options you mention it should do something like: ioctl(LOOP_CTL_GET_FREE) -> get free loop dev ioctl(LOOP_SET_FD) -> bind isofs.iso to free loop dev mount(loop_dev, isofs/) and I though it is the second syscall that fails. Honza
On 2022/01/10 22:42, Jan Kara wrote: > a) We didn't fully establish a real deadlock scenario from the lockdep > report, did we? The lockdep report involved suspend locks, some locks on > accessing files in /proc etc. and it was not clear whether it all reflects > a real deadlock possibility or just a fact that lockdep tracking is rather > coarse-grained at times. Now lockdep report is unpleasant and loop device > locking was ugly anyway so your async change made sense but once lockdep is > silenced we should really establish whether there is real deadlock and more > work is needed or not. Not /proc files but /sys/power/resume file. Here is a reproducer but I can't test whether we can trigger a real deadlock. ---------------------------------------- #include <stdio.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> #include <sys/ioctl.h> #include <linux/loop.h> #include <sys/sendfile.h> int main(int argc, char *argv[]) { const int file_fd = open("testfile", O_RDWR | O_CREAT, 0600); ftruncate(file_fd, 1048576); char filename[128] = { }; const int loop_num = ioctl(open("/dev/loop-control", 3), LOOP_CTL_GET_FREE, 0); snprintf(filename, sizeof(filename) - 1, "/dev/loop%d", loop_num); const int loop_fd_1 = open(filename, O_RDWR); ioctl(loop_fd_1, LOOP_SET_FD, file_fd); const int loop_fd_2 = open(filename, O_RDWR); ioctl(loop_fd_1, LOOP_CLR_FD, 0); const int sysfs_fd = open("/sys/power/resume", O_RDWR); sendfile(file_fd, sysfs_fd, 0, 1048576); sendfile(loop_fd_2, file_fd, 0, 1048576); write(sysfs_fd, "700", 3); close(loop_fd_2); close(loop_fd_1); // Lockdep complains. close(file_fd); return 0; } ---------------------------------------- Since many locks are held under disk->open_mutex, disk->open_mutex is getting like a dependency aggregation hub. ---------------------------------------- ffffffff852c9920 OPS: 963 FD: 315 BD: 4 +.+.: &disk->open_mutex -> [ffffffffa01c0a48] sd_ref_mutex -> [ffffffff831bda60] fs_reclaim -> [ffffffff8529b260] &n->list_lock -> [ffffffff83330988] depot_lock -> [ffffffff8312bb08] tk_core.seq.seqcount -> [ffffffff83d85000] &rq->__lock -> [ffffffff852ca120] &obj_hash[i].lock -> [ffffffff852c8e60] &hctx->lock -> [ffffffff852c8e00] &x->wait#20 -> [ffffffff85283d60] &base->lock -> [ffffffff85283ce0] (&timer.timer) -> [ffff88811aa44b60] lock#2 -> [ffffffff85291860] &____s->seqcount -> [ffffffff852c8c00] &q->sysfs_dir_lock -> [ffffffff852c8160] &bdev->bd_size_lock -> [ffffffff831bc378] free_vmap_area_lock -> [ffffffff831bc318] vmap_area_lock -> [ffffffff852a8c00] &xa->xa_lock#3 -> [ffff88811aa44390] lock#6 -> [ffffffff852a8bf0] &mapping->private_lock -> [ffffffff852c9f20] &dd->lock -> [ffffffff8528fb40] &folio_wait_table[i] -> [ffffffff82ef6c18] (console_sem).lock -> [ffffffff8330b708] &sb->s_type->i_lock_key#3 -> [ffffffff852a5720] &s->s_inode_list_lock -> [ffffffff831a9508] pcpu_alloc_mutex -> [ffffffff8544e5c0] &x->wait#9 -> [ffffffff8543b160] &k->list_lock -> [ffff88811aa459c0] lock#3 -> [ffffffff852adbc0] &root->kernfs_rwsem -> [ffffffff83356b50] bus_type_sem -> [ffffffff8321e358] sysfs_symlink_target_lock -> [ffffffff8544e020] &dev->power.lock -> [ffffffff833d75e8] dpm_list_mtx -> [ffffffff833d4b78] req_lock -> [ffffffff83d80860] &p->pi_lock -> [ffffffff8544e420] &x->wait#10 -> [ffffffff8543b140] &k->k_lock -> [ffffffff852c9960] subsys mutex#47 -> [ffffffff852c9980] &xa->xa_lock#5 -> [ffffffff82e14698] inode_hash_lock -> [ffffffff831bc538] purge_vmap_area_lock -> [ffffffff852900a0] &lruvec->lru_lock -> [ffffffff83328c98] pool_lock -> [ffffffff834052c8] sr_ref_mutex -> [ffffffff852c9a20] &ev->block_mutex -> [ffffffff852c9a00] &ev->lock -> [ffffffff831e6ed8] sb_lock -> [ffffffff8544fae0] &cd->lock -> [ffffffff82e14818] bdev_lock -> [ffffffffa03d10c0] &lo->lo_mutex -> [ffffffff83d86320] &lock->wait_lock -> [ffffffff83195368] lock#5 -> [ffffffff85290d60] &wb->list_lock ---------------------------------------- > > b) Unless we have a realistic plan of moving *all* blk_mq_freeze_queue() > calls from under open_mutex in loop driver, moving stuff "where possible" > from under open_mutex is IMO just muddying water. As far as I know, only lo_open() and lo_release() are functions which are called with disk->open_mutex held in loop driver. And only lo_release() calls blk_mq_freeze_queue() with disk->open_mutex held. blk_mq_freeze_queue() from __loop_clr_fd() from lo_release() (I mean, autoclear part) can be done without disk->open_mutex held if we call __loop_clr_fd() from task work context. And I think blk_mq_freeze_queue() from lo_release() (I mean, "if (lo->lo_state == Lo_bound) { }" part) can be done in the same manner. Therefore, I think this is not "partial move" but "complete move". > >>> I mean using task work in >>> loop_schedule_rundown() makes a lot of sense because the loop >>> >>> while :; do mount -o loop,ro isofs.iso isofs/; umount isofs/; done >>> >>> will not fail because of dangling work in the workqueue after umount -> >>> __loop_clr_fd(). >> >> Using task work from lo_release() is for handling close() => umount() sequence. >> Using task work in lo_open() is for handling close() => open() sequence. >> The mount in >> >> while :; do mount -o loop,ro isofs.iso isofs/; umount isofs/; done >> >> fails unless lo_open() waits for __loop_clr_fd() to complete. > > Why? If we use task work, we are guaranteed the loop device is cleaned up > before umount returns unless some other process also has the loop device > open. Since systemd-udevd opens this loop device asynchronously, __loop_clr_fd() from lo_release() is called by not mount or umount but systemd-udevd. This means that we are not guaranteed that the loop device is cleaned up before umount returns. > >>> But when other processes like systemd-udevd start to mess >>> with the loop device, then you have no control whether the following mount >>> will see isofs.iso busy or not - it depends on when systemd-udevd decides >>> to close the loop device. >> >> Right. But regarding that part the main problem is that the script is not >> checking for errors. What we are expected to do is to restore barrier >> which existed before commit 322c4293ecc58110 ("loop: make autoclear >> operation asynchronous"). > > OK, can you explain what you exactly mean by the barrier? /bin/mount opens the loop device and reads from that loop device in order to read superblock for guessing/verifying filesystem type of the backing file. disk->open_mutex was blocking open() of the loop device until autoclear via close() from systemd-udevd completes. My patch is intended to restore/mimic this behavior without holding disk->open_mutex. > Because it my > understanding your patch only makes one race somewhat less likely. Yes, this barrier is for making one race less likely. We need to try to avoid hitting this race because /bin/mount fails when we hit this race while reading superblock of the backing file. > >> >>> What your waiting in lo_open() achieves is only >>> that if __loop_clr_fd() from systemd-udevd happens to run at the same time >>> as lo_open() from mount, then we won't see EBUSY. >> >> My waiting in lo_open() is to fix a regression that >> >> while :; do mount -o loop,ro isofs.iso isofs/; umount isofs/; done >> >> started to fail. > > OK, so you claim this loop fails even with using task work in __loop_clr_fd(), > correct? Correct. If we run this loop with https://lkml.kernel.org/r/9eff2034-2f32-54a3-e476-d0f609ab49c0@i-love.sakura.ne.jp , we get errors like below. ---------------------------------------- root@fuzz:/mnt# while :; do mount -o loop,ro isofs.iso isofs/; umount isofs/; done mount: /mnt/isofs: can't read superblock on /dev/loop0. umount: isofs/: not mounted. mount: /mnt/isofs: can't read superblock on /dev/loop0. umount: isofs/: not mounted. mount: /mnt/isofs: can't read superblock on /dev/loop0. umount: isofs/: not mounted. mount: /mnt/isofs: can't read superblock on /dev/loop0. umount: isofs/: not mounted. mount: /mnt/isofs: can't read superblock on /dev/loop0. umount: isofs/: not mounted. mount: /mnt/isofs: can't read superblock on /dev/loop0. umount: isofs/: not mounted. mount: /mnt/isofs: can't read superblock on /dev/loop0. umount: isofs/: not mounted. mount: /mnt/isofs: can't read superblock on /dev/loop0. umount: isofs/: not mounted. ---------------------------------------- ---------------------------------------- [ 172.907151] loop0: detected capacity change from 0 to 1992704 [ 173.034450] LoadPin: kernel-module pinning-ignored obj="/usr/lib/modules/5.16.0-rc8-next-20220107+/kernel/fs/isofs/isofs.ko" pid=6124 cmdline="/sbin/modprobe -q -- fs-iso9660" [ 173.130665] ISO 9660 Extensions: Microsoft Joliet Level 3 [ 173.132227] ISO 9660 Extensions: RRIP_1991A [ 173.271904] loop0: detected capacity change from 0 to 1992704 [ 173.303460] ISO 9660 Extensions: Microsoft Joliet Level 3 [ 173.304261] ISO 9660 Extensions: RRIP_1991A [ 173.431772] ISO 9660 Extensions: Microsoft Joliet Level 3 [ 173.435583] ISO 9660 Extensions: RRIP_1991A [ 173.559952] ISO 9660 Extensions: Microsoft Joliet Level 3 [ 173.561014] ISO 9660 Extensions: RRIP_1991A [ 173.673018] I/O error, dev loop0, sector 1992576 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0 [ 173.683062] I/O error, dev loop0, sector 2 op 0x0:(READ) flags 0x1000 phys_seg 1 prio class 0 [ 173.685547] EXT4-fs (loop0): unable to read superblock [ 173.763214] loop0: detected capacity change from 0 to 1992704 [ 173.792790] ISO 9660 Extensions: Microsoft Joliet Level 3 [ 173.802959] ISO 9660 Extensions: RRIP_1991A [ 173.918140] ISO 9660 Extensions: Microsoft Joliet Level 3 [ 173.920018] ISO 9660 Extensions: RRIP_1991A [ 173.991678] loop0: detected capacity change from 0 to 1992704 [ 174.019262] ISO 9660 Extensions: Microsoft Joliet Level 3 [ 174.020914] ISO 9660 Extensions: RRIP_1991A [ 174.111092] ISO 9660 Extensions: Microsoft Joliet Level 3 [ 174.111862] ISO 9660 Extensions: RRIP_1991A [ 174.210594] ISO 9660 Extensions: Microsoft Joliet Level 3 [ 174.211624] ISO 9660 Extensions: RRIP_1991A [ 174.310668] ISO 9660 Extensions: Microsoft Joliet Level 3 [ 174.312334] ISO 9660 Extensions: RRIP_1991A [ 174.424918] loop0: detected capacity change from 0 to 1992704 [ 174.455070] ISO 9660 Extensions: Microsoft Joliet Level 3 [ 174.455944] ISO 9660 Extensions: RRIP_1991A [ 174.540379] loop0: detected capacity change from 0 to 1992704 [ 174.576673] ISO 9660 Extensions: Microsoft Joliet Level 3 [ 174.578195] ISO 9660 Extensions: RRIP_1991A [ 174.655124] I/O error, dev loop0, sector 1992576 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0 [ 174.665039] I/O error, dev loop0, sector 2 op 0x0:(READ) flags 0x1000 phys_seg 1 prio class 0 [ 174.667435] EXT4-fs (loop0): unable to read superblock [ 174.745809] loop0: detected capacity change from 0 to 1992704 [ 174.783917] ISO 9660 Extensions: Microsoft Joliet Level 3 [ 174.789040] ISO 9660 Extensions: RRIP_1991A [ 174.888963] I/O error, dev loop0, sector 1992576 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0 [ 174.896654] I/O error, dev loop0, sector 2 op 0x0:(READ) flags 0x1000 phys_seg 1 prio class 0 [ 174.898893] EXT4-fs (loop0): unable to read superblock ---------------------------------------- > And does it fail even without systemd-udev? I don't have an environment without systemd-udevd. But probably it won't fail if there is nobody who opens asynchronously. > >>> But IMO that is not worth >>> the complexity in lo_open() because if systemd-udevd happens to close the >>> loop device a millisecond later, you will get EBUSY anyway (and you would >>> get it even in the past) Or am I missing something? >> >> Excuse me, but lo_open() does not return EBUSY? >> What operation are you talking about? > > I didn't mean EBUSY from lo_open() but EBUSY from LOOP_SET_FD ioctl(2). But > maybe I misunderstood the failure. Where exactly does mount(1) fail? Because > with the options you mention it should do something like: > ioctl(LOOP_CTL_GET_FREE) -> get free loop dev > ioctl(LOOP_SET_FD) -> bind isofs.iso to free loop dev > mount(loop_dev, isofs/) > > and I though it is the second syscall that fails. /bin/mount is failing at read() of superblock after open() of the loop device.
On Mon, Jan 10, 2022 at 02:42:34PM +0100, Jan Kara wrote: > I see. But: > a) We didn't fully establish a real deadlock scenario from the lockdep > report, did we? The lockdep report involved suspend locks, some locks on > accessing files in /proc etc. and it was not clear whether it all reflects > a real deadlock possibility or just a fact that lockdep tracking is rather > coarse-grained at times. Now lockdep report is unpleasant and loop device > locking was ugly anyway so your async change made sense but once lockdep is > silenced we should really establish whether there is real deadlock and more > work is needed or not. > > b) Unless we have a realistic plan of moving *all* blk_mq_freeze_queue() > calls from under open_mutex in loop driver, moving stuff "where possible" > from under open_mutex is IMO just muddying water. Agreed. I also have to say I'm not a fan of proliferating the use of task_work_add.
On Tue 11-01-22 00:08:56, Tetsuo Handa wrote: > On 2022/01/10 22:42, Jan Kara wrote: > > a) We didn't fully establish a real deadlock scenario from the lockdep > > report, did we? The lockdep report involved suspend locks, some locks on > > accessing files in /proc etc. and it was not clear whether it all reflects > > a real deadlock possibility or just a fact that lockdep tracking is rather > > coarse-grained at times. Now lockdep report is unpleasant and loop device > > locking was ugly anyway so your async change made sense but once lockdep is > > silenced we should really establish whether there is real deadlock and more > > work is needed or not. > > Not /proc files but /sys/power/resume file. > Here is a reproducer but I can't test whether we can trigger a real deadlock. > > ---------------------------------------- > #include <stdio.h> > #include <sys/types.h> > #include <sys/stat.h> > #include <fcntl.h> > #include <unistd.h> > #include <sys/ioctl.h> > #include <linux/loop.h> > #include <sys/sendfile.h> > > int main(int argc, char *argv[]) > { > const int file_fd = open("testfile", O_RDWR | O_CREAT, 0600); > ftruncate(file_fd, 1048576); > char filename[128] = { }; > const int loop_num = ioctl(open("/dev/loop-control", 3), LOOP_CTL_GET_FREE, 0); > snprintf(filename, sizeof(filename) - 1, "/dev/loop%d", loop_num); > const int loop_fd_1 = open(filename, O_RDWR); > ioctl(loop_fd_1, LOOP_SET_FD, file_fd); > const int loop_fd_2 = open(filename, O_RDWR); > ioctl(loop_fd_1, LOOP_CLR_FD, 0); > const int sysfs_fd = open("/sys/power/resume", O_RDWR); > sendfile(file_fd, sysfs_fd, 0, 1048576); > sendfile(loop_fd_2, file_fd, 0, 1048576); > write(sysfs_fd, "700", 3); > close(loop_fd_2); > close(loop_fd_1); // Lockdep complains. > close(file_fd); > return 0; > } > ---------------------------------------- Thanks for the reproducer. I will try to simplify it even more and figure out whether there is a real deadlock potential in the lockdep complaint or not... > > b) Unless we have a realistic plan of moving *all* blk_mq_freeze_queue() > > calls from under open_mutex in loop driver, moving stuff "where possible" > > from under open_mutex is IMO just muddying water. > > As far as I know, only lo_open() and lo_release() are functions which are > called with disk->open_mutex held in loop driver. And only lo_release() calls > blk_mq_freeze_queue() with disk->open_mutex held. > > blk_mq_freeze_queue() from __loop_clr_fd() from lo_release() (I mean, autoclear > part) can be done without disk->open_mutex held if we call __loop_clr_fd() from > task work context. And I think blk_mq_freeze_queue() from lo_release() (I mean, > "if (lo->lo_state == Lo_bound) { }" part) can be done in the same manner. > > Therefore, I think this is not "partial move" but "complete move". OK, fair point. > >>> I mean using task work in > >>> loop_schedule_rundown() makes a lot of sense because the loop > >>> > >>> while :; do mount -o loop,ro isofs.iso isofs/; umount isofs/; done > >>> > >>> will not fail because of dangling work in the workqueue after umount -> > >>> __loop_clr_fd(). > >> > >> Using task work from lo_release() is for handling close() => umount() sequence. > >> Using task work in lo_open() is for handling close() => open() sequence. > >> The mount in > >> > >> while :; do mount -o loop,ro isofs.iso isofs/; umount isofs/; done > >> > >> fails unless lo_open() waits for __loop_clr_fd() to complete. > > > > Why? If we use task work, we are guaranteed the loop device is cleaned up > > before umount returns unless some other process also has the loop device > > open. > > Since systemd-udevd opens this loop device asynchronously, > __loop_clr_fd() from lo_release() is called by not mount or umount but > systemd-udevd. This means that we are not guaranteed that the loop device > is cleaned up before umount returns. OK, understood. Thanks for explanation. ... > >>> What your waiting in lo_open() achieves is only > >>> that if __loop_clr_fd() from systemd-udevd happens to run at the same time > >>> as lo_open() from mount, then we won't see EBUSY. > >> > >> My waiting in lo_open() is to fix a regression that > >> > >> while :; do mount -o loop,ro isofs.iso isofs/; umount isofs/; done > >> > >> started to fail. > > > > OK, so you claim this loop fails even with using task work in __loop_clr_fd(), > > correct? > > Correct. If we run this loop with > https://lkml.kernel.org/r/9eff2034-2f32-54a3-e476-d0f609ab49c0@i-love.sakura.ne.jp , > we get errors like below. > > ---------------------------------------- > root@fuzz:/mnt# while :; do mount -o loop,ro isofs.iso isofs/; umount isofs/; done > mount: /mnt/isofs: can't read superblock on /dev/loop0. > umount: isofs/: not mounted. OK, so I was still wondering why this happens and now after poking a bit more into util-linux I think I understand. As you say, someone (systemd-udev in our case) has /dev/loop0 open when umount isofs/ runs. So /dev/loop0 stays alive after umount. Then mount runs, sees /dev/loop0 is still attached to isofs.iso and wants to reuse it for mount but before it gets to opening the loop device in mnt_context_setup_loopdev(), the loop device is already cleaned up. Open still succeeds but because backing file is detached, by the time mount(2) tries to do any IO, it fails. I don't think we can fully solve this race in the kernel and IMO it is futile to try that - depending on when exactly systemd-udev decides to close /dev/loop0 and how exactly mount decides to implement its loop device reuse, different strategies will / will not work. But there is one subtlety we need to solve - when __loop_clr_fd() is run outside of disk->open_mutex, it can happen that next lo_open() rus before __loop_clr_fd(). Thus we can hand away a file descriptor to a loop device that is half torn-down and will be cleaned up in uncontrollable point in the future which can lead to interesting consequences when the device is used at that time as well. Perhaps we can fix these problems by checking lo_refcnt in __loop_clr_fd() once we grab lo_mutex and thus make sure the device still should be destroyed? Honza
On 2022/01/12 22:16, Jan Kara wrote: > I don't think we can fully solve this race in the kernel and IMO it is > futile to try that - depending on when exactly systemd-udev decides to > close /dev/loop0 and how exactly mount decides to implement its loop device > reuse, different strategies will / will not work. Right, there is no perfect solution. Only mitigation for less likely hitting this race window is possible. > But there is one subtlety > we need to solve - when __loop_clr_fd() is run outside of disk->open_mutex, > it can happen that next lo_open() runs before __loop_clr_fd(). Yes, but the problem (I/O error) becomes visible when read() is called. Also, __loop_clr_fd() from ioctl(LOOP_CLR_FD) runs outside of disk->open_mutex. > Thus we can > hand away a file descriptor to a loop device that is half torn-down and > will be cleaned up in uncontrollable point in the future which can lead to > interesting consequences when the device is used at that time as well. Right, but I don't think we can solve this. > > Perhaps we can fix these problems by checking lo_refcnt in __loop_clr_fd() > once we grab lo_mutex and thus make sure the device still should be > destroyed? That will not help, for (even if __loop_clr_fd() from lo_release() runs under disk->open_mutex) lo_release() { mutex_lock(&lo->lo_mutex); // Decrements lo->lo_refcnt and finds it became 0. mutex_unlock(&lo->lo_mutex); __loop_clr_fd(lo, true) { mutex_lock(&lo->lo_mutex); // Finds lo->lo_refcnt is still 0. mutex_unlock(&lo->lo_mutex); // Release the backing file. } } lo_open() { mutex_lock(&lo->lo_mutex); // Increments lo->lo_refcnt. mutex_unlock(&lo->lo_mutex); } vfs_read() { // Read attempt fails because the backing file is already gone. } sequence might still cause some program to fail with I/O error, and (since __loop_clr_fd() from loop_clr_fd() does not run under disk->open_mutex) loop_clr_fd() { mutex_lock(&lo->lo_mutex); // Finds lo->lo_refcnt is 1. mutex_unlock(&lo->lo_mutex); __loop_clr_fd(lo, true) { mutex_lock(&lo->lo_mutex); // Finds lo->lo_refcnt is still 1. mutex_unlock(&lo->lo_mutex); lo_open() { mutex_lock(&lo->lo_mutex); // Increments lo->lo_refcnt. mutex_unlock(&lo->lo_mutex); } // Release the backing file. } } vfs_read() { // Read attempt fails because the backing file is already gone. } sequence is still possible. We don't want to hold disk->open_mutex and/or lo->lo_mutex when an I/O request on this loop device is issued, do we? Do we want to hold disk->open_mutex when calling __loop_clr_fd() from loop_clr_fd() ? That might be useful for avoiding some race situation but is counter way to locking simplification.
On Wed 12-01-22 14:16:15, Jan Kara wrote: > On Tue 11-01-22 00:08:56, Tetsuo Handa wrote: > > On 2022/01/10 22:42, Jan Kara wrote: > > > a) We didn't fully establish a real deadlock scenario from the lockdep > > > report, did we? The lockdep report involved suspend locks, some locks on > > > accessing files in /proc etc. and it was not clear whether it all reflects > > > a real deadlock possibility or just a fact that lockdep tracking is rather > > > coarse-grained at times. Now lockdep report is unpleasant and loop device > > > locking was ugly anyway so your async change made sense but once lockdep is > > > silenced we should really establish whether there is real deadlock and more > > > work is needed or not. > > > > Not /proc files but /sys/power/resume file. > > Here is a reproducer but I can't test whether we can trigger a real deadlock. > > > > ---------------------------------------- > > #include <stdio.h> > > #include <sys/types.h> > > #include <sys/stat.h> > > #include <fcntl.h> > > #include <unistd.h> > > #include <sys/ioctl.h> > > #include <linux/loop.h> > > #include <sys/sendfile.h> > > > > int main(int argc, char *argv[]) > > { > > const int file_fd = open("testfile", O_RDWR | O_CREAT, 0600); > > ftruncate(file_fd, 1048576); > > char filename[128] = { }; > > const int loop_num = ioctl(open("/dev/loop-control", 3), LOOP_CTL_GET_FREE, 0); > > snprintf(filename, sizeof(filename) - 1, "/dev/loop%d", loop_num); > > const int loop_fd_1 = open(filename, O_RDWR); > > ioctl(loop_fd_1, LOOP_SET_FD, file_fd); > > const int loop_fd_2 = open(filename, O_RDWR); > > ioctl(loop_fd_1, LOOP_CLR_FD, 0); > > const int sysfs_fd = open("/sys/power/resume", O_RDWR); > > sendfile(file_fd, sysfs_fd, 0, 1048576); > > sendfile(loop_fd_2, file_fd, 0, 1048576); > > write(sysfs_fd, "700", 3); > > close(loop_fd_2); > > close(loop_fd_1); // Lockdep complains. > > close(file_fd); > > return 0; > > } > > ---------------------------------------- > > Thanks for the reproducer. I will try to simplify it even more and figure > out whether there is a real deadlock potential in the lockdep complaint or > not... OK, so I think I understand the lockdep complaint better. Lockdep essentially complains about the following scenario: blkdev_put() lock disk->open_mutex lo_release __loop_clr_fd() | | wait for IO to complete v loop worker write to backing file sb_start_write(file) | | wait for fs with backing file to unfreeze v process holding fs frozen freeze_super() | | wait for remaining writers on the fs so that fs can be frozen v sendfile() sb_start_write() do_splice_direct() | | blocked on read from /sys/power/resume, waiting for kernfs file | lock v write() "/dev/loop0" (in whatever form) to /sys/power/resume calls blkdev_get_by_dev("/dev/loop0") lock disk->open_mutex => deadlock So there are three other ingredients to this locking loop besides loop device behavior: 1) sysfs files which serialize reads & writes 2) sendfile which holds freeze protection while reading data to write 3) "resume" file behavior opening bdev from write to sysfs file We cannot sensibly do anything about 1) AFAICT. You cannot get a coherent view of a sysfs file while it is changing. We could actually change 2) (to only hold freeze protection while splicing from pipe) but that will fix the problem only for sendfile(2). E.g. splice(2) may also block waiting for data to become available in the pipe while holding freeze protection. Changing that would require some surgery in our splice infrastructure and at this point I'm not sure whether we would not introduce other locking problems due to pipe_lock lock ordering. For 3), we could consider stuff like not resuming from a loop device or postponing resume to a workqueue but it all looks ugly. Maybe the most disputable thing in this locking chain seems to be splicing from sysfs files. That does not seem terribly useful and due to special locking and behavior of sysfs files it allows for creating interesting lock dependencies. OTOH maybe there is someone out there who (possibly inadvertedly through some library) ends up using splice on sysfs files so chances for userspace breakage, if we disable splice for sysfs, would be non-negligible. Hum, tough. Honza
On Wed 12-01-22 23:00:00, Tetsuo Handa wrote: > On 2022/01/12 22:16, Jan Kara wrote: > > I don't think we can fully solve this race in the kernel and IMO it is > > futile to try that - depending on when exactly systemd-udev decides to > > close /dev/loop0 and how exactly mount decides to implement its loop device > > reuse, different strategies will / will not work. > > Right, there is no perfect solution. > Only mitigation for less likely hitting this race window is possible. > > > But there is one subtlety > > we need to solve - when __loop_clr_fd() is run outside of disk->open_mutex, > > it can happen that next lo_open() runs before __loop_clr_fd(). > > Yes, but the problem (I/O error) becomes visible when read() is called. > Also, __loop_clr_fd() from ioctl(LOOP_CLR_FD) runs outside of disk->open_mutex. True, that path is similar. But now I have realized that once we decide to tear down the loop device we set lo_state to Lo_rundown and so ioctls such as LOOP_GET_STATUS start to fail and also new IO submissions start to fail with IO error. So from userspace POV everything should be consistent. > > Thus we can > > hand away a file descriptor to a loop device that is half torn-down and > > will be cleaned up in uncontrollable point in the future which can lead to > > interesting consequences when the device is used at that time as well. > > Right, but I don't think we can solve this. Well, we cannot guarantee what will be state of the loop device when you open it but I think we should guarantee that once you have loop device open, it will not be torn down under your hands. And now that I have realized there are those lo_state checks, I think everything is fine in that regard. I wanted to make sure that sequence such as: fd = open(loop device) ioctl(fd, LOOP_GET_STATUS, &stats) verify in stats loop dev has parameters we want ... use loop device ... is race free and that already seems to be the case - once LOOP_GET_STATUS does not return error, we know there is no pending loop device shutdown. So the only bug seems to be in mount(8) code where the loop device reuse detection is racy and the subtle change in the timing with your loop changes makes it break. I'll write to util-linux maintainer about this. > We don't want to hold disk->open_mutex and/or lo->lo_mutex when > an I/O request on this loop device is issued, do we? Currently, we may hold both. With your async patch we hold only lo_mutex. Now that I better understand the nature of the deadlock, I agree that holding either still creates the deadlock possibility because both are acquired on loop device open. But now that I reminded myself the lo_state handling, I think the following should be safe in __loop_clr_fd: /* Just a safety check... */ if (WARN_ON_ONCE(data_race(lo->lo_state) != Lo_rundown)) return -ENXIO; /* * With Lo_rundown state, no new work can be queued. Flush pending * IO and destroy workqueue. */ blk_mq_freeze_queue(lo->lo_queue); destroy_workqueue(lo->workqueue); mutex_lock(&lo->lo_mutex); ... do the rest of the teardown ... > Do we want to hold disk->open_mutex when calling __loop_clr_fd() from > loop_clr_fd() ? That might be useful for avoiding some race situation > but is counter way to locking simplification. No, I'm not suggesting that. Honza
On 2022/01/14 0:23, Jan Kara wrote: > Well, we cannot guarantee what will be state of the loop device when you > open it but I think we should guarantee that once you have loop device > open, it will not be torn down under your hands. And now that I have > realized there are those lo_state checks, I think everything is fine in > that regard. I wanted to make sure that sequence such as: Well, we could abort __loop_clr_fd() if somebody called "open()", something like below. But ---------------------------------------- diff --git a/drivers/block/loop.c b/drivers/block/loop.c index b1b05c45c07c..960db2c484ab 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -1082,7 +1082,7 @@ static int loop_configure(struct loop_device *lo, fmode_t mode, return error; } -static void __loop_clr_fd(struct loop_device *lo) +static bool __loop_clr_fd(struct loop_device *lo, int expected_refcnt) { struct file *filp; gfp_t gfp = lo->old_gfp_mask; @@ -1104,9 +1104,19 @@ static void __loop_clr_fd(struct loop_device *lo) * Since this function is called upon "ioctl(LOOP_CLR_FD)" xor "close() * after ioctl(LOOP_CLR_FD)", it is a sign of something going wrong if * lo->lo_state has changed while waiting for lo->lo_mutex. + * + * However, if somebody called "open()" after lo->lo_state became + * Lo_rundown, we should abort rundown in order to avoid unexpected + * I/O error. */ mutex_lock(&lo->lo_mutex); BUG_ON(lo->lo_state != Lo_rundown); + if (atomic_read(&lo->lo_refcnt) != expected_refcnt) { + lo->lo_state = Lo_bound; + lo->lo_flags |= LO_FLAGS_AUTOCLEAR; + mutex_unlock(&lo->lo_mutex); + return false; + } mutex_unlock(&lo->lo_mutex); if (test_bit(QUEUE_FLAG_WC, &lo->lo_queue->queue_flags)) @@ -1165,6 +1175,7 @@ static void __loop_clr_fd(struct loop_device *lo) lo->lo_disk->flags |= GENHD_FL_NO_PART; fput(filp); + return true; } static void loop_rundown_completed(struct loop_device *lo) @@ -1181,11 +1192,12 @@ static void loop_rundown_workfn(struct work_struct *work) rundown_work); struct block_device *bdev = lo->lo_device; struct gendisk *disk = lo->lo_disk; + const bool cleared = __loop_clr_fd(lo, 0); - __loop_clr_fd(lo); kobject_put(&bdev->bd_device.kobj); module_put(disk->fops->owner); - loop_rundown_completed(lo); + if (cleared) + loop_rundown_completed(lo); } static void loop_schedule_rundown(struct loop_device *lo) @@ -1228,8 +1240,8 @@ static int loop_clr_fd(struct loop_device *lo) lo->lo_state = Lo_rundown; mutex_unlock(&lo->lo_mutex); - __loop_clr_fd(lo); - loop_rundown_completed(lo); + if (__loop_clr_fd(lo, 1)) + loop_rundown_completed(lo); return 0; } ---------------------------------------- > Currently, we may hold both. With your async patch we hold only lo_mutex. > Now that I better understand the nature of the deadlock, I agree that > holding either still creates the deadlock possibility because both are > acquired on loop device open. But now that I reminded myself the lo_state > handling, I think the following should be safe in __loop_clr_fd: > > /* Just a safety check... */ > if (WARN_ON_ONCE(data_race(lo->lo_state) != Lo_rundown)) > return -ENXIO; > this is still racy, for somebody can reach lo_open() right after this check. Anyway, since the problem that umount() immediately after close() (reported by kernel test robot) remains, we need to make sure that __loop_clr_fd() completes before close() returns to user mode.
On 2022/01/13 19:44, Jan Kara wrote: > OK, so I think I understand the lockdep complaint better. Lockdep > essentially complains about the following scenario: > > blkdev_put() > lock disk->open_mutex > lo_release > __loop_clr_fd() > | > | wait for IO to complete > v > loop worker > write to backing file > sb_start_write(file) > | > | wait for fs with backing file to unfreeze > v > process holding fs frozen > freeze_super() > | > | wait for remaining writers on the fs so that fs can be frozen > v > sendfile() > sb_start_write() > do_splice_direct() > | > | blocked on read from /sys/power/resume, waiting for kernfs file > | lock > v > write() "/dev/loop0" (in whatever form) to /sys/power/resume > calls blkdev_get_by_dev("/dev/loop0") > lock disk->open_mutex => deadlock > Then, not calling flush_workqueue() from destroy_workqueue() from __loop_clr_fd() will not help because the reason we did not need to call flush_workqueue() is that blk_mq_freeze_queue() waits until all "struct work_struct" which flush_workqueue() would have waited completes? If flush_workqueue() cannot complete because an I/O request cannot complete, blk_mq_freeze_queue() as well cannot complete because it waits for I/O requests which flush_workqueue() would have waited? Then, any attempt to revert commit 322c4293ecc58110 ("loop: make autoclear operation asynchronous") (e.g. https://lkml.kernel.org/r/4e7b711f-744b-3a78-39be-c9432a3cecd2@i-love.sakura.ne.jp ) cannot be used? Now that commit 322c4293ecc58110 already arrived at linux.git, we need to quickly send a fixup patch for these reported regressions. This "[PATCH v2 2/2] loop: use task_work for autoclear operation" did not address "if (lo->lo_state == Lo_bound) { }" part. If we address this part, something like below diff? ---------------------------------------- drivers/block/loop.c | 158 ++++++++++++++++++++++++++++++++++++++------------- drivers/block/loop.h | 1 2 files changed, 120 insertions(+), 39 deletions(-) diff --git a/drivers/block/loop.c b/drivers/block/loop.c index b1b05c45c07c..eb750602c94d 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -89,6 +89,8 @@ static DEFINE_IDR(loop_index_idr); static DEFINE_MUTEX(loop_ctl_mutex); static DEFINE_MUTEX(loop_validate_mutex); +static DECLARE_WAIT_QUEUE_HEAD(loop_rundown_wait); +static DEFINE_SPINLOCK(loop_open_spinlock); /** * loop_global_lock_killable() - take locks for safe loop_validate_file() test @@ -1172,33 +1174,10 @@ static void loop_rundown_completed(struct loop_device *lo) mutex_lock(&lo->lo_mutex); lo->lo_state = Lo_unbound; mutex_unlock(&lo->lo_mutex); + wake_up_all(&loop_rundown_wait); module_put(THIS_MODULE); } -static void loop_rundown_workfn(struct work_struct *work) -{ - struct loop_device *lo = container_of(work, struct loop_device, - rundown_work); - struct block_device *bdev = lo->lo_device; - struct gendisk *disk = lo->lo_disk; - - __loop_clr_fd(lo); - kobject_put(&bdev->bd_device.kobj); - module_put(disk->fops->owner); - loop_rundown_completed(lo); -} - -static void loop_schedule_rundown(struct loop_device *lo) -{ - struct block_device *bdev = lo->lo_device; - struct gendisk *disk = lo->lo_disk; - - __module_get(disk->fops->owner); - kobject_get(&bdev->bd_device.kobj); - INIT_WORK(&lo->rundown_work, loop_rundown_workfn); - queue_work(system_long_wq, &lo->rundown_work); -} - static int loop_clr_fd(struct loop_device *lo) { int err; @@ -1721,30 +1700,91 @@ static int lo_compat_ioctl(struct block_device *bdev, fmode_t mode, } #endif +struct loop_rundown_waiter { + struct callback_head callback; + struct loop_device *lo; +}; + +static void loop_rundown_waiter_callbackfn(struct callback_head *callback) +{ + struct loop_rundown_waiter *lrw = + container_of(callback, struct loop_rundown_waiter, callback); + + /* + * Locklessly wait for completion of __loop_clr_fd(). + * This should be safe because of the following rules. + * + * (a) From locking dependency perspective, this function is called + * without any locks held. + * (b) From execution ordering perspective, this function is called + * by the moment lo_open() from open() syscall returns to user + * mode. + * (c) From use-after-free protection perspective, this function is + * called before loop_remove() is called, for lo->lo_refcnt taken + * by lo_open() prevents loop_control_remove() and loop_exit(). + */ + wait_event_killable(loop_rundown_wait, data_race(lrw->lo->lo_state) != Lo_rundown); + kfree(lrw); +} + static int lo_open(struct block_device *bdev, fmode_t mode) { struct loop_device *lo = bdev->bd_disk->private_data; - int err; + int err = 0; - err = mutex_lock_killable(&lo->lo_mutex); - if (err) - return err; - if (lo->lo_state == Lo_deleting) + /* + * Avoid creating disk->open_mutex => lo->lo_mutex locking chain, for + * calling blk_mq_freeze_queue() with lo->lo_mutex held will form a + * circular locking dependency chain due to waiting for I/O requests + * to complete. + */ + spin_lock(&loop_open_spinlock); + /* + * Since the purpose of lo_open() is to protect this loop device from + * entering Lo_rundown or Lo_deleting state, only serialization against + * loop_control_remove() is sufficient. + */ + if (data_race(lo->lo_state) == Lo_deleting) err = -ENXIO; else atomic_inc(&lo->lo_refcnt); - mutex_unlock(&lo->lo_mutex); + spin_unlock(&loop_open_spinlock); + /* + * But loop_clr_fd() or __lo_release() might have already set this loop + * device to Lo_rundown state. Since accessing Lo_rundown loop device + * confuses userspace programs, try to wait for __loop_clr_fd() to + * complete without disk->open_mutex held. + */ + if (!err && !(current->flags & PF_KTHREAD)) { + struct loop_rundown_waiter *lrw = + kmalloc(sizeof(*lrw), GFP_KERNEL | __GFP_NOFAIL); + + init_task_work(&lrw->callback, loop_rundown_waiter_callbackfn); + lrw->lo = lo; + if (task_work_add(current, &lrw->callback, TWA_RESUME)) + kfree(lrw); + } return err; } -static void lo_release(struct gendisk *disk, fmode_t mode) +struct loop_release_trigger { + union { + struct work_struct work; + struct callback_head callback; + }; + struct loop_device *lo; +}; + +/* Perform actual creanup if nobody is using this loop device. */ +static void __lo_release(struct loop_release_trigger *lrt) { - struct loop_device *lo = disk->private_data; + struct loop_device *lo = lrt->lo; + struct gendisk *disk = lo->lo_disk; + bool cleared = false; mutex_lock(&lo->lo_mutex); - if (atomic_dec_return(&lo->lo_refcnt)) + if (atomic_read(&lo->lo_refcnt)) goto out_unlock; - if (lo->lo_flags & LO_FLAGS_AUTOCLEAR) { if (lo->lo_state != Lo_bound) goto out_unlock; @@ -1754,8 +1794,8 @@ static void lo_release(struct gendisk *disk, fmode_t mode) * In autoclear mode, stop the loop thread * and remove configuration after last close. */ - loop_schedule_rundown(lo); - return; + __loop_clr_fd(lo); + cleared = true; } else if (lo->lo_state == Lo_bound) { /* * Otherwise keep thread (if running) and config, @@ -1764,9 +1804,48 @@ static void lo_release(struct gendisk *disk, fmode_t mode) blk_mq_freeze_queue(lo->lo_queue); blk_mq_unfreeze_queue(lo->lo_queue); } + out_unlock: + if (!cleared) + mutex_unlock(&lo->lo_mutex); + module_put(disk->fops->owner); + if (cleared) + loop_rundown_completed(lo); +} -out_unlock: - mutex_unlock(&lo->lo_mutex); +static void loop_release_callbackfn(struct callback_head *callback) +{ + __lo_release(container_of(callback, struct loop_release_trigger, + callback)); +} + +static void loop_release_workfn(struct work_struct *work) +{ + __lo_release(container_of(work, struct loop_release_trigger, work)); +} + +static void lo_release(struct gendisk *disk, fmode_t mode) +{ + struct loop_device *lo = disk->private_data; + struct loop_release_trigger *lrt; + + if (atomic_dec_return(&lo->lo_refcnt)) + return; + /* + * In order to avoid waiting for I/O requests to complete under + * disk->open_mutex held, defer actual clreanup to __lo_release(). + * Prefer task work context if possible in order to make sure that + * __lo_release() completes before close() returns to user mode, + */ + lrt = kmalloc(sizeof(*lrt), GFP_KERNEL | __GFP_NOFAIL); + lrt->lo = lo; + __module_get(disk->fops->owner); + if (!(current->flags & PF_KTHREAD)) { + init_task_work(&lrt->callback, loop_release_callbackfn); + if (!task_work_add(current, &lrt->callback, TWA_RESUME)) + return; + } + INIT_WORK(&lrt->work, loop_release_workfn); + queue_work(system_long_wq, &lrt->work); } static const struct block_device_operations lo_fops = { @@ -2119,14 +2198,17 @@ static int loop_control_remove(int idx) ret = mutex_lock_killable(&lo->lo_mutex); if (ret) goto mark_visible; + spin_lock(&loop_open_spinlock); if (lo->lo_state != Lo_unbound || atomic_read(&lo->lo_refcnt) > 0) { + spin_unlock(&loop_open_spinlock); mutex_unlock(&lo->lo_mutex); ret = -EBUSY; goto mark_visible; } /* Mark this loop device no longer open()-able. */ lo->lo_state = Lo_deleting; + spin_unlock(&loop_open_spinlock); mutex_unlock(&lo->lo_mutex); loop_remove(lo); diff --git a/drivers/block/loop.h b/drivers/block/loop.h index 918a7a2dc025..082d4b6bfc6a 100644 --- a/drivers/block/loop.h +++ b/drivers/block/loop.h @@ -56,7 +56,6 @@ struct loop_device { struct gendisk *lo_disk; struct mutex lo_mutex; bool idr_visible; - struct work_struct rundown_work; }; struct loop_cmd { ---------------------------------------- This diff avoids disk->open_mutex => lo->lo_mutex chain. ffffffff84a06140 OPS: 7944 FD: 305 BD: 4 +.+.: &disk->open_mutex -> [ffffffffa0158a48] sd_ref_mutex -> [ffffffff831be200] fs_reclaim -> [ffffffff849d7ee0] &n->list_lock -> [ffffffff8332ce18] depot_lock -> [ffffffff8312d508] tk_core.seq.seqcount -> [ffffffff84a06940] &obj_hash[i].lock -> [ffffffff84a05680] &hctx->lock -> [ffffffff84a05620] &x->wait#20 -> [ffffffff849c10e0] &base->lock -> [ffffffff83d93000] &rq->__lock -> [ffffffff849c1060] (&timer.timer) -> [ffff88811ac44b48] lock#2 -> [ffffffff849cebe0] &____s->seqcount -> [ffffffff84a05420] &q->sysfs_dir_lock -> [ffffffff84a04980] &bdev->bd_size_lock -> [ffffffff831bcdd8] free_vmap_area_lock -> [ffffffff831bcd78] vmap_area_lock -> [ffffffff849e5900] &xa->xa_lock#3 -> [ffff88811ac44390] lock#6 -> [ffffffff849e58f0] &mapping->private_lock -> [ffffffff84a06740] &dd->lock -> [ffffffff82ef8d78] (console_sem).lock -> [ffffffff83307c28] &sb->s_type->i_lock_key#3 -> [ffffffff849e2420] &s->s_inode_list_lock -> [ffffffff831a9f68] pcpu_alloc_mutex -> [ffffffff84b8dd60] &x->wait#9 -> [ffffffff84b77980] &k->list_lock -> [ffff88811ac45780] lock#3 -> [ffffffff849ea3e0] &root->kernfs_rwsem -> [ffffffff8335b830] bus_type_sem -> [ffffffff8321b1b8] sysfs_symlink_target_lock -> [ffffffff84b8d7c0] &dev->power.lock -> [ffffffff833e2e28] dpm_list_mtx -> [ffffffff833e03b8] req_lock -> [ffffffff83d8e820] &p->pi_lock -> [ffffffff84b8dbc0] &x->wait#10 -> [ffffffff84b77960] &k->k_lock -> [ffffffff84a06180] subsys mutex#48 -> [ffffffff84a061a0] &xa->xa_lock#5 -> [ffffffff82e14698] inode_hash_lock -> [ffffffff831bcf98] purge_vmap_area_lock -> [ffffffff849cd420] &lruvec->lru_lock -> [ffffffff849ccec0] &folio_wait_table[i] -> [ffffffff83410a68] sr_ref_mutex -> [ffffffff84a06240] &ev->block_mutex -> [ffffffff84a06220] &ev->lock -> [ffffffff831e7398] sb_lock -> [ffffffff84b8f280] &cd->lock -> [ffffffff82ea6878] pgd_lock -> [ffffffff82e14758] bdev_lock -> [ffffffff849ce0e0] &wb->list_lock -> [ffffffffa04b9378] loop_open_spinlock -> [ffffffff83d8e800] &____s->seqcount#2 -> [ffffffff849c0b00] rcu_node_0 -> [ffffffff83d94320] &lock->wait_lock ffffffffa04b90c0 OPS: 3257 FD: 280 BD: 1 +.+.: &lo->lo_mutex -> [ffffffff831be200] fs_reclaim -> [ffffffff849d7ee0] &n->list_lock -> [ffffffff8332ce18] depot_lock -> [ffffffff82ec1450] cpu_hotplug_lock -> [ffffffff82ed3848] wq_pool_mutex -> [ffffffff83330448] uevent_sock_mutex -> [ffffffff84a06940] &obj_hash[i].lock -> [ffffffff831e7398] sb_lock -> [ffffffff83d8e800] &____s->seqcount#2 -> [ffff88811ac44b48] lock#2 -> [ffffffff849cebe0] &____s->seqcount -> [ffff88811ac45780] lock#3 -> [ffffffff849ea3e0] &root->kernfs_rwsem -> [ffffffff83d94330] &sem->wait_lock -> [ffffffff83d8e820] &p->pi_lock -> [ffffffff84a04980] &bdev->bd_size_lock -> [ffffffff82ef8d78] (console_sem).lock -> [ffffffff84a05480] &q->mq_freeze_lock -> [ffffffff83322e18] percpu_ref_switch_lock -> [ffffffff84a05460] &q->mq_freeze_wq -> [ffffffff83d93000] &rq->__lock -> [ffffffff831cf1a0] quarantine_lock -> [ffffffff849e57a0] &dentry->d_lock -> [ffffffff83d94320] &lock->wait_lock -> [ffffffff83110f98] console_owner_lock I think lo_open() and lo_release() are doing too much things. I wish "struct block_device_operations" provides open() and release() callbacks without disk->open_mutex held...
On Fri 14-01-22 20:05:31, Tetsuo Handa wrote: > On 2022/01/14 0:23, Jan Kara wrote: > > Well, we cannot guarantee what will be state of the loop device when you > > open it but I think we should guarantee that once you have loop device > > open, it will not be torn down under your hands. And now that I have > > realized there are those lo_state checks, I think everything is fine in > > that regard. I wanted to make sure that sequence such as: > > Well, we could abort __loop_clr_fd() if somebody called "open()", something > like below. But > > ---------------------------------------- > diff --git a/drivers/block/loop.c b/drivers/block/loop.c > index b1b05c45c07c..960db2c484ab 100644 > --- a/drivers/block/loop.c > +++ b/drivers/block/loop.c > @@ -1082,7 +1082,7 @@ static int loop_configure(struct loop_device *lo, fmode_t mode, > return error; > } > > -static void __loop_clr_fd(struct loop_device *lo) > +static bool __loop_clr_fd(struct loop_device *lo, int expected_refcnt) > { > struct file *filp; > gfp_t gfp = lo->old_gfp_mask; > @@ -1104,9 +1104,19 @@ static void __loop_clr_fd(struct loop_device *lo) > * Since this function is called upon "ioctl(LOOP_CLR_FD)" xor "close() > * after ioctl(LOOP_CLR_FD)", it is a sign of something going wrong if > * lo->lo_state has changed while waiting for lo->lo_mutex. > + * > + * However, if somebody called "open()" after lo->lo_state became > + * Lo_rundown, we should abort rundown in order to avoid unexpected > + * I/O error. > */ > mutex_lock(&lo->lo_mutex); > BUG_ON(lo->lo_state != Lo_rundown); > + if (atomic_read(&lo->lo_refcnt) != expected_refcnt) { > + lo->lo_state = Lo_bound; > + lo->lo_flags |= LO_FLAGS_AUTOCLEAR; > + mutex_unlock(&lo->lo_mutex); > + return false; > + } > mutex_unlock(&lo->lo_mutex); Yeah, but as I wrote in my email, I don't think this is needed anymore (and I even think it would be counterproductive). There can be new opens happening before __loop_clr_fd() but any ioctl querying loop device state will return error due to lo->lo_state == Lo_rundown. So from userspace POV the loop device is already invalidated. > > Currently, we may hold both. With your async patch we hold only lo_mutex. > > Now that I better understand the nature of the deadlock, I agree that > > holding either still creates the deadlock possibility because both are > > acquired on loop device open. But now that I reminded myself the lo_state > > handling, I think the following should be safe in __loop_clr_fd: > > > > /* Just a safety check... */ > > if (WARN_ON_ONCE(data_race(lo->lo_state) != Lo_rundown)) > > return -ENXIO; > > > > this is still racy, for somebody can reach lo_open() right after this check. Yes, somebody can open but he cannot change lo->lo_state. Also this should be just a safety check. We should never reach __loop_clr_fd() with different lo_state. > Anyway, since the problem that umount() immediately after close() (reported by > kernel test robot) remains, we need to make sure that __loop_clr_fd() completes > before close() returns to user mode. I agree with this. But using task_work for __loop_clr_fd() is enough for that. I was just arguing that we don't need extra waiting in lo_open(). Honza
On Sat 15-01-22 00:50:32, Tetsuo Handa wrote: > On 2022/01/13 19:44, Jan Kara wrote: > > OK, so I think I understand the lockdep complaint better. Lockdep > > essentially complains about the following scenario: > > > > blkdev_put() > > lock disk->open_mutex > > lo_release > > __loop_clr_fd() > > | > > | wait for IO to complete > > v > > loop worker > > write to backing file > > sb_start_write(file) > > | > > | wait for fs with backing file to unfreeze > > v > > process holding fs frozen > > freeze_super() > > | > > | wait for remaining writers on the fs so that fs can be frozen > > v > > sendfile() > > sb_start_write() > > do_splice_direct() > > | > > | blocked on read from /sys/power/resume, waiting for kernfs file > > | lock > > v > > write() "/dev/loop0" (in whatever form) to /sys/power/resume > > calls blkdev_get_by_dev("/dev/loop0") > > lock disk->open_mutex => deadlock > > > > Then, not calling flush_workqueue() from destroy_workqueue() from > __loop_clr_fd() will not help because the reason we did not need to call > flush_workqueue() is that blk_mq_freeze_queue() waits until all "struct > work_struct" which flush_workqueue() would have waited completes? Yes. > If flush_workqueue() cannot complete because an I/O request cannot > complete, blk_mq_freeze_queue() as well cannot complete because it waits > for I/O requests which flush_workqueue() would have waited? > > Then, any attempt to revert commit 322c4293ecc58110 ("loop: make > autoclear operation asynchronous") > (e.g. > https://lkml.kernel.org/r/4e7b711f-744b-3a78-39be-c9432a3cecd2@i-love.sakura.ne.jp > ) cannot be used? Well, it could be used but the deadlock would be reintroduced. There are still possibilities to fix it in some other way. But for now I think that async loop cleanup is probably the least painful solution. > Now that commit 322c4293ecc58110 already arrived at linux.git, we need to > quickly send a fixup patch for these reported regressions. This "[PATCH > v2 2/2] loop: use task_work for autoclear operation" did not address "if > (lo->lo_state == Lo_bound) { }" part. If we address this part, something > like below diff? Please no. That is too ugly to live. I'd go just with attached version of your solution. That should fix the xfstests regression. The LTP regression needs to be fixed in mount. Honza
On 2022/01/15 4:58, Jan Kara wrote: >> Now that commit 322c4293ecc58110 already arrived at linux.git, we need to >> quickly send a fixup patch for these reported regressions. This "[PATCH >> v2 2/2] loop: use task_work for autoclear operation" did not address "if >> (lo->lo_state == Lo_bound) { }" part. If we address this part, something >> like below diff? > > Please no. That is too ugly to live. I'd go just with attached version of > your solution. That should fix the xfstests regression. The LTP regression > needs to be fixed in mount. Yes, this patch is ugly. Since disk->open_mutex => lo->lo_mutex dependency is recorded by lo_open()/lo_release(), and blk_mq_freeze_queue() by e.g. loop_set_status() waits for I/O completion with lo->lo_mutex held, from locking dependency chain perspective waiting for I/O completion with disk->open_mutex held still remains. Therefore, this ugly patch kills disk->open_mutex => lo->lo_mutex dependency via not holding lo->lo_mutex from lo_open()/lo_release(). Waiting for Lo_rundown device after lo_open() is a freebie till a fixed version of /bin/mount is used by many users ( https://lkml.kernel.org/r/20220113154735.hdzi4cqsz5jt6asp@quack3.lan ). >> I think lo_open() and lo_release() are doing too much things. I wish "struct block_device_operations" >> provides open() and release() callbacks without disk->open_mutex held... Christoph is not a fan of proliferating the use of task_work_add(). Can we go with exporting task_work_add() for this release cycle? Or instead can we go with providing release() callback without disk->open_mutex held ( https://lkml.kernel.org/r/08d703d1-8b32-ec9b-2b50-54b8376d3d40@i-love.sakura.ne.jp ) ?
On Sat, Jan 15, 2022 at 09:34:10AM +0900, Tetsuo Handa wrote: > Christoph is not a fan of proliferating the use of task_work_add(). Can we go with exporting task_work_add() Not a fan != NAK. If we can't think of anything better we'll have to do that. Note that I also have a task_work_add API cleanup pending that makes it a lot less ugly. > for this release cycle? Or instead can we go with providing release() callback without disk->open_mutex held > ( https://lkml.kernel.org/r/08d703d1-8b32-ec9b-2b50-54b8376d3d40@i-love.sakura.ne.jp ) ? This one OTOH is a hard NAK as this is an API that will just cause a lot of problems.
On 2022/01/17 17:15, Christoph Hellwig wrote: > On Sat, Jan 15, 2022 at 09:34:10AM +0900, Tetsuo Handa wrote: >> Christoph is not a fan of proliferating the use of task_work_add(). Can we go with exporting task_work_add() > > Not a fan != NAK. If we can't think of anything better we'll have to do > that. Note that I also have a task_work_add API cleanup pending that makes > it a lot less ugly. > >> for this release cycle? Or instead can we go with providing release() callback without disk->open_mutex held >> ( https://lkml.kernel.org/r/08d703d1-8b32-ec9b-2b50-54b8376d3d40@i-love.sakura.ne.jp ) ? > > This one OTOH is a hard NAK as this is an API that will just cause a lot > of problems. What problems can you think of with [PATCH 1/4] below? I found that patches below are robuster than task_work_add() approach because the loop module does not need to know about refcounts which the core block layer manipulates. If we go with task_work_add() approach, the loop module needs to be updated in sync with refcount manipulations in the core block layer. Subject: [PATCH 1/4] block: add post_open/post_release block device callbacks Commit 322c4293ecc58110 ("loop: make autoclear operation asynchronous") silenced a circular locking dependency warning, but caused a breakage for /bin/mount and /bin/umount users. Further analysis by Jan revealed that waiting for I/O completion with disk->open_mutex held has possibility of deadlock. We need to fix this breakage, without waiting for I/O completion with disk->open_mutex held. We can't temporarily drop disk->open_mutex inside open and release block device callbacks. We could export task_work_add() for the loop module, but Christoph is not a fan of proliferating the use of task_work_add(). Therefore, add post_open and post_release block device callbacks which allows performing an extra operation without holding disk->open_mutex after returning from open and release block device callbacks respectively. Cc: Jan Kara <jack@suse.cz> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> --- block/bdev.c | 30 +++++++++++++++++++++--------- include/linux/blkdev.h | 6 ++++++ 2 files changed, 27 insertions(+), 9 deletions(-) diff --git a/block/bdev.c b/block/bdev.c index 8bf93a19041b..efde8ffd0df6 100644 --- a/block/bdev.c +++ b/block/bdev.c @@ -683,15 +683,16 @@ static int blkdev_get_whole(struct block_device *bdev, fmode_t mode) if (test_bit(GD_NEED_PART_SCAN, &disk->state)) bdev_disk_changed(disk, false); bdev->bd_openers++; - return 0;; + return 0; } -static void blkdev_put_whole(struct block_device *bdev, fmode_t mode) +static bool blkdev_put_whole(struct block_device *bdev, fmode_t mode) { if (!--bdev->bd_openers) blkdev_flush_mapping(bdev); if (bdev->bd_disk->fops->release) bdev->bd_disk->fops->release(bdev->bd_disk, mode); + return true; } static int blkdev_get_part(struct block_device *part, fmode_t mode) @@ -721,15 +722,15 @@ static int blkdev_get_part(struct block_device *part, fmode_t mode) return ret; } -static void blkdev_put_part(struct block_device *part, fmode_t mode) +static bool blkdev_put_part(struct block_device *part, fmode_t mode) { struct block_device *whole = bdev_whole(part); if (--part->bd_openers) - return; + return false; blkdev_flush_mapping(part); whole->bd_disk->open_partitions--; - blkdev_put_whole(whole, mode); + return blkdev_put_whole(whole, mode); } struct block_device *blkdev_get_no_open(dev_t dev) @@ -784,6 +785,7 @@ struct block_device *blkdev_get_by_dev(dev_t dev, fmode_t mode, void *holder) bool unblock_events = true; struct block_device *bdev; struct gendisk *disk; + bool release_called = false; int ret; ret = devcgroup_check_permission(DEVCG_DEV_BLOCK, @@ -812,10 +814,13 @@ struct block_device *blkdev_get_by_dev(dev_t dev, fmode_t mode, void *holder) goto abort_claiming; if (!try_module_get(disk->fops->owner)) goto abort_claiming; - if (bdev_is_partition(bdev)) + if (bdev_is_partition(bdev)) { ret = blkdev_get_part(bdev, mode); - else + if (ret) + release_called = true; + } else { ret = blkdev_get_whole(bdev, mode); + } if (ret) goto put_module; if (mode & FMODE_EXCL) { @@ -835,6 +840,8 @@ struct block_device *blkdev_get_by_dev(dev_t dev, fmode_t mode, void *holder) } } mutex_unlock(&disk->open_mutex); + if (bdev->bd_disk->fops->post_open) + bdev->bd_disk->fops->post_open(bdev->bd_disk); if (unblock_events) disk_unblock_events(disk); @@ -845,6 +852,8 @@ struct block_device *blkdev_get_by_dev(dev_t dev, fmode_t mode, void *holder) if (mode & FMODE_EXCL) bd_abort_claiming(bdev, holder); mutex_unlock(&disk->open_mutex); + if (release_called && bdev->bd_disk->fops->post_release) + bdev->bd_disk->fops->post_release(bdev->bd_disk); disk_unblock_events(disk); put_blkdev: blkdev_put_no_open(bdev); @@ -893,6 +902,7 @@ EXPORT_SYMBOL(blkdev_get_by_path); void blkdev_put(struct block_device *bdev, fmode_t mode) { struct gendisk *disk = bdev->bd_disk; + bool release_called; /* * Sync early if it looks like we're the last one. If someone else @@ -944,10 +954,12 @@ void blkdev_put(struct block_device *bdev, fmode_t mode) disk_flush_events(disk, DISK_EVENT_MEDIA_CHANGE); if (bdev_is_partition(bdev)) - blkdev_put_part(bdev, mode); + release_called = blkdev_put_part(bdev, mode); else - blkdev_put_whole(bdev, mode); + release_called = blkdev_put_whole(bdev, mode); mutex_unlock(&disk->open_mutex); + if (release_called && bdev->bd_disk->fops->post_release) + bdev->bd_disk->fops->post_release(bdev->bd_disk); module_put(disk->fops->owner); blkdev_put_no_open(bdev); diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 9c95df26fc26..f35e92fd72d0 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1227,6 +1227,12 @@ struct block_device_operations { * driver. */ int (*alternative_gpt_sector)(struct gendisk *disk, sector_t *sector); + /* + * Special callback for doing extra operations without + * disk->open_mutex held. Used by loop driver. + */ + void (*post_open)(struct gendisk *disk); + void (*post_release)(struct gendisk *disk); }; #ifdef CONFIG_COMPAT
On 2022/01/17 18:34, Tetsuo Handa wrote: > On 2022/01/17 17:15, Christoph Hellwig wrote: >> On Sat, Jan 15, 2022 at 09:34:10AM +0900, Tetsuo Handa wrote: >>> Christoph is not a fan of proliferating the use of task_work_add(). Can we go with exporting task_work_add() >> >> Not a fan != NAK. If we can't think of anything better we'll have to do >> that. Note that I also have a task_work_add API cleanup pending that makes >> it a lot less ugly. >> >>> for this release cycle? Or instead can we go with providing release() callback without disk->open_mutex held >>> ( https://lkml.kernel.org/r/08d703d1-8b32-ec9b-2b50-54b8376d3d40@i-love.sakura.ne.jp ) ? >> >> This one OTOH is a hard NAK as this is an API that will just cause a lot >> of problems. > > What problems can you think of with [PATCH 1/4] below? > > I found that patches below are robuster than task_work_add() approach because > the loop module does not need to know about refcounts which the core block layer > manipulates. If we go with task_work_add() approach, the loop module needs to be > updated in sync with refcount manipulations in the core block layer. > For your information, below is how task_work_add() approach would look like. Despite full of refcount management, cannot provide synchronous autoclear operation if closed by kernel threads, cannot provide synchronous waiting if opened by kernel threads, possibility to fail to run autoclear operation when open by user threads failed... What a mess! --- block/bdev.c | 30 +++------- drivers/block/loop.c | 125 ++++++++++++++++++++++++++++++++++++----- include/linux/blkdev.h | 6 -- kernel/task_work.c | 1 + 4 files changed, 121 insertions(+), 41 deletions(-) diff --git a/block/bdev.c b/block/bdev.c index efde8ffd0df6..8bf93a19041b 100644 --- a/block/bdev.c +++ b/block/bdev.c @@ -683,16 +683,15 @@ static int blkdev_get_whole(struct block_device *bdev, fmode_t mode) if (test_bit(GD_NEED_PART_SCAN, &disk->state)) bdev_disk_changed(disk, false); bdev->bd_openers++; - return 0; + return 0;; } -static bool blkdev_put_whole(struct block_device *bdev, fmode_t mode) +static void blkdev_put_whole(struct block_device *bdev, fmode_t mode) { if (!--bdev->bd_openers) blkdev_flush_mapping(bdev); if (bdev->bd_disk->fops->release) bdev->bd_disk->fops->release(bdev->bd_disk, mode); - return true; } static int blkdev_get_part(struct block_device *part, fmode_t mode) @@ -722,15 +721,15 @@ static int blkdev_get_part(struct block_device *part, fmode_t mode) return ret; } -static bool blkdev_put_part(struct block_device *part, fmode_t mode) +static void blkdev_put_part(struct block_device *part, fmode_t mode) { struct block_device *whole = bdev_whole(part); if (--part->bd_openers) - return false; + return; blkdev_flush_mapping(part); whole->bd_disk->open_partitions--; - return blkdev_put_whole(whole, mode); + blkdev_put_whole(whole, mode); } struct block_device *blkdev_get_no_open(dev_t dev) @@ -785,7 +784,6 @@ struct block_device *blkdev_get_by_dev(dev_t dev, fmode_t mode, void *holder) bool unblock_events = true; struct block_device *bdev; struct gendisk *disk; - bool release_called = false; int ret; ret = devcgroup_check_permission(DEVCG_DEV_BLOCK, @@ -814,13 +812,10 @@ struct block_device *blkdev_get_by_dev(dev_t dev, fmode_t mode, void *holder) goto abort_claiming; if (!try_module_get(disk->fops->owner)) goto abort_claiming; - if (bdev_is_partition(bdev)) { + if (bdev_is_partition(bdev)) ret = blkdev_get_part(bdev, mode); - if (ret) - release_called = true; - } else { + else ret = blkdev_get_whole(bdev, mode); - } if (ret) goto put_module; if (mode & FMODE_EXCL) { @@ -840,8 +835,6 @@ struct block_device *blkdev_get_by_dev(dev_t dev, fmode_t mode, void *holder) } } mutex_unlock(&disk->open_mutex); - if (bdev->bd_disk->fops->post_open) - bdev->bd_disk->fops->post_open(bdev->bd_disk); if (unblock_events) disk_unblock_events(disk); @@ -852,8 +845,6 @@ struct block_device *blkdev_get_by_dev(dev_t dev, fmode_t mode, void *holder) if (mode & FMODE_EXCL) bd_abort_claiming(bdev, holder); mutex_unlock(&disk->open_mutex); - if (release_called && bdev->bd_disk->fops->post_release) - bdev->bd_disk->fops->post_release(bdev->bd_disk); disk_unblock_events(disk); put_blkdev: blkdev_put_no_open(bdev); @@ -902,7 +893,6 @@ EXPORT_SYMBOL(blkdev_get_by_path); void blkdev_put(struct block_device *bdev, fmode_t mode) { struct gendisk *disk = bdev->bd_disk; - bool release_called; /* * Sync early if it looks like we're the last one. If someone else @@ -954,12 +944,10 @@ void blkdev_put(struct block_device *bdev, fmode_t mode) disk_flush_events(disk, DISK_EVENT_MEDIA_CHANGE); if (bdev_is_partition(bdev)) - release_called = blkdev_put_part(bdev, mode); + blkdev_put_part(bdev, mode); else - release_called = blkdev_put_whole(bdev, mode); + blkdev_put_whole(bdev, mode); mutex_unlock(&disk->open_mutex); - if (release_called && bdev->bd_disk->fops->post_release) - bdev->bd_disk->fops->post_release(bdev->bd_disk); module_put(disk->fops->owner); blkdev_put_no_open(bdev); diff --git a/drivers/block/loop.c b/drivers/block/loop.c index cd19af969209..cfa6ab7c315a 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -1695,6 +1695,38 @@ static int lo_compat_ioctl(struct block_device *bdev, fmode_t mode, } #endif +static void lo_post_open(struct gendisk *disk) +{ + struct loop_device *lo = disk->private_data; + + /* Wait for lo_post_release() to leave lo->lo_mutex section. */ + if (mutex_lock_killable(&lo->lo_mutex)) + return; + mutex_unlock(&lo->lo_mutex); + /* Also wait for __loop_clr_fd() to complete if Lo_rundown was set. */ + wait_event_killable(loop_rundown_wait, data_race(lo->lo_state) != Lo_rundown); +} + +struct loop_open_task { + struct callback_head cb; + struct loop_device *lo; +}; + +static void loop_open_callbackfn(struct callback_head *callback) +{ + struct loop_open_task *lot = + container_of(callback, struct loop_open_task, cb); + struct gendisk *disk = lot->lo->lo_disk; + + lo_post_open(disk); + /* + * Since I didn't hold references, I can't call lo_post_release(). + * That is, I won't handle __loop_clr_fd() if I was the last user. + */ + atomic_dec(&lot->lo->lo_refcnt); + kfree(lot); +} + static int lo_open(struct block_device *bdev, fmode_t mode) { struct loop_device *lo = bdev->bd_disk->private_data; @@ -1707,21 +1739,29 @@ static int lo_open(struct block_device *bdev, fmode_t mode) else atomic_inc(&lo->lo_refcnt); spin_unlock(&loop_delete_spinlock); + /* Try to avoid accessing Lo_rundown loop device. */ + if (!(current->flags & PF_KTHREAD)) { + struct loop_open_task *lot = + kmalloc(sizeof(*lot), GFP_KERNEL | __GFP_NOWARN); + + if (lot) { + lot->lo = lo; + init_task_work(&lot->cb, loop_open_callbackfn); + if (task_work_add(current, &lot->cb, TWA_RESUME)) + kfree(lot); + /* + * Needs an extra reference for avoiding use-after-free + * when an error occurred before returning to user mode + * because the task work list is LIFO. But that in turn + * bothers me with dropping this extra reference. + */ + else + atomic_inc(&lo->lo_refcnt); + } + } return err; } -static void lo_post_open(struct gendisk *disk) -{ - struct loop_device *lo = disk->private_data; - - /* Wait for lo_post_release() to leave lo->lo_mutex section. */ - if (mutex_lock_killable(&lo->lo_mutex)) - return; - mutex_unlock(&lo->lo_mutex); - /* Also wait for __loop_clr_fd() to complete if Lo_rundown was set. */ - wait_event_killable(loop_rundown_wait, data_race(lo->lo_state) != Lo_rundown); -} - static void lo_post_release(struct gendisk *disk) { struct loop_device *lo = disk->private_data; @@ -1753,11 +1793,68 @@ static void lo_post_release(struct gendisk *disk) mutex_unlock(&lo->lo_mutex); } +struct loop_release_task { + union { + struct callback_head cb; + struct work_struct ws; + }; + struct loop_device *lo; +}; + +static void loop_release_workfn(struct work_struct *work) +{ + struct loop_release_task *lrt = + container_of(work, struct loop_release_task, ws); + struct loop_device *lo = lrt->lo; + struct gendisk *disk = lo->lo_disk; + + lo_post_release(disk); + /* Drop references which will be dropped after lo_release(). */ + kobject_put(&disk_to_dev(disk)->kobj); + module_put(disk->fops->owner); + kfree(lrt); + /* Drop a reference to allow loop_exit(). */ + module_put(THIS_MODULE); +} + +static void loop_release_callbackfn(struct callback_head *callback) +{ + struct loop_release_task *lrt = + container_of(callback, struct loop_release_task, cb); + + loop_release_workfn(&lrt->ws); +} + +static void lo_release(struct gendisk *disk, fmode_t mode) +{ + struct loop_device *lo = disk->private_data; + struct loop_release_task *lrt = kmalloc(sizeof(*lrt), + GFP_KERNEL | __GFP_NOFAIL); + + /* Hold a reference to disallow loop_exit(). */ + __module_get(THIS_MODULE); + /* Hold references which will be dropped after lo_release(). */ + __module_get(disk->fops->owner); + kobject_get(&disk_to_dev(disk)->kobj); + /* Clear this loop device. */ + lrt->lo = lo; + if (!(current->flags & PF_KTHREAD)) { + /* + * Prefer task work so that clear operation completes + * before close() returns to user mode. + */ + init_task_work(&lrt->cb, loop_release_callbackfn); + if (!task_work_add(current, &lrt->cb, TWA_RESUME)) + return; + } + INIT_WORK(&lrt->ws, loop_release_workfn); + queue_work(system_long_wq, &lrt->ws); +} + static const struct block_device_operations lo_fops = { .owner = THIS_MODULE, .open = lo_open, - .post_open = lo_post_open, - .post_release = lo_post_release, + .release = lo_release, .ioctl = lo_ioctl, #ifdef CONFIG_COMPAT .compat_ioctl = lo_compat_ioctl, diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index f35e92fd72d0..9c95df26fc26 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1227,12 +1227,6 @@ struct block_device_operations { * driver. */ int (*alternative_gpt_sector)(struct gendisk *disk, sector_t *sector); - /* - * Special callback for doing extra operations without - * disk->open_mutex held. Used by loop driver. - */ - void (*post_open)(struct gendisk *disk); - void (*post_release)(struct gendisk *disk); }; #ifdef CONFIG_COMPAT diff --git a/kernel/task_work.c b/kernel/task_work.c index 1698fbe6f0e1..2a1644189182 100644 --- a/kernel/task_work.c +++ b/kernel/task_work.c @@ -60,6 +60,7 @@ int task_work_add(struct task_struct *task, struct callback_head *work, return 0; } +EXPORT_SYMBOL_GPL(task_work_add); /** * task_work_cancel_match - cancel a pending work added by task_work_add()
On 2022/01/17 23:10, Tetsuo Handa wrote: > On 2022/01/17 18:34, Tetsuo Handa wrote: >> On 2022/01/17 17:15, Christoph Hellwig wrote: >>> On Sat, Jan 15, 2022 at 09:34:10AM +0900, Tetsuo Handa wrote: >>>> Christoph is not a fan of proliferating the use of task_work_add(). Can we go with exporting task_work_add() >>> >>> Not a fan != NAK. If we can't think of anything better we'll have to do >>> that. Note that I also have a task_work_add API cleanup pending that makes >>> it a lot less ugly. >>> >>>> for this release cycle? Or instead can we go with providing release() callback without disk->open_mutex held >>>> ( https://lkml.kernel.org/r/08d703d1-8b32-ec9b-2b50-54b8376d3d40@i-love.sakura.ne.jp ) ? >>> >>> This one OTOH is a hard NAK as this is an API that will just cause a lot >>> of problems. >> >> What problems can you think of with [PATCH 1/4] below? >> >> I found that patches below are robuster than task_work_add() approach because >> the loop module does not need to know about refcounts which the core block layer >> manipulates. If we go with task_work_add() approach, the loop module needs to be >> updated in sync with refcount manipulations in the core block layer. >> > > For your information, below is how task_work_add() approach would look like. > Despite full of refcount management, cannot provide synchronous autoclear > operation if closed by kernel threads, cannot provide synchronous waiting if > opened by kernel threads, possibility to fail to run autoclear operation when > open by user threads failed... What a mess! > I found a bit simpler refcount management. This should fix both /bin/mount and /bin/umount breakage. Can we go with this approach? From b0ae6e632f0d4980755364c822223b32e26cfbcd Mon Sep 17 00:00:00 2001 From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Date: Wed, 19 Jan 2022 00:42:52 +0900 Subject: [PATCH] loop: don't hold lo->lo_mutex from lo_open() and lo_release() Commit 322c4293ecc58110 ("loop: make autoclear operation asynchronous") silenced a circular locking dependency warning, but further analysis by Jan Kara revealed that fundamental problem is that waiting for I/O completion (from blk_mq_freeze_queue()) with disk->open_mutex held has possibility of deadlock. We need to fix this breakage, without waiting for I/O completion with disk->open_mutex. Since disk->open_mutex => lo->lo_mutex dependency is recorded by lo_open() and lo_release(), and blk_mq_freeze_queue() by e.g. loop_set_status() waits for I/O completion with lo->lo_mutex held, from locking dependency chain perspective we need to kill disk->open_mutex => lo->lo_mutex dependency as well. This patch does the following things. (1) Revert commit 322c4293ecc58110, for moving only autoclear operation to WQ context caused a breakage for /bin/mount and /bin/umount users. (2) Move whole lo_release() operation to task_work context (if possible) or WQ context (otherwise). disk->open_mutex => lo->lo_mutex dependency by lo_release() will be avoided by this change. /bin/umount breakage will be avoided by running whole lo_release() operation from task_work context. (3) Split lo_open() into "holding a reference" part and "serializing between lo_open() and lo_release()" part, and replace lo->lo_mutex from the former part with a spinlock, and defer the latter part to task_work context (if possible) or just give up (otherwise). disk->open_mutex => lo->lo_mutex dependency by lo_open() will be avoided by the former part of this change. /bin/mount breakage will be avoided by running the latter part from task_work context. Reported-by: kernel test robot <oliver.sang@intel.com> Reported-by: Jan Stancek <jstancek@redhat.com> Reported-by: Mike Galbraith <efault@gmx.de> Analyzed-by: Jan Kara <jack@suse.cz> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Fixes: 322c4293ecc58110 ("loop: make autoclear operation asynchronous") --- drivers/block/loop.c | 225 +++++++++++++++++++++++++++++++------------ drivers/block/loop.h | 2 +- 2 files changed, 164 insertions(+), 63 deletions(-) diff --git a/drivers/block/loop.c b/drivers/block/loop.c index b1b05c45c07c..2bbc1195c3fc 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -89,6 +89,8 @@ static DEFINE_IDR(loop_index_idr); static DEFINE_MUTEX(loop_ctl_mutex); static DEFINE_MUTEX(loop_validate_mutex); +static DEFINE_SPINLOCK(loop_delete_spinlock); +static DECLARE_WAIT_QUEUE_HEAD(loop_rundown_wait); /** * loop_global_lock_killable() - take locks for safe loop_validate_file() test @@ -1165,40 +1167,13 @@ static void __loop_clr_fd(struct loop_device *lo) lo->lo_disk->flags |= GENHD_FL_NO_PART; fput(filp); -} - -static void loop_rundown_completed(struct loop_device *lo) -{ mutex_lock(&lo->lo_mutex); lo->lo_state = Lo_unbound; mutex_unlock(&lo->lo_mutex); + wake_up_all(&loop_rundown_wait); module_put(THIS_MODULE); } -static void loop_rundown_workfn(struct work_struct *work) -{ - struct loop_device *lo = container_of(work, struct loop_device, - rundown_work); - struct block_device *bdev = lo->lo_device; - struct gendisk *disk = lo->lo_disk; - - __loop_clr_fd(lo); - kobject_put(&bdev->bd_device.kobj); - module_put(disk->fops->owner); - loop_rundown_completed(lo); -} - -static void loop_schedule_rundown(struct loop_device *lo) -{ - struct block_device *bdev = lo->lo_device; - struct gendisk *disk = lo->lo_disk; - - __module_get(disk->fops->owner); - kobject_get(&bdev->bd_device.kobj); - INIT_WORK(&lo->rundown_work, loop_rundown_workfn); - queue_work(system_long_wq, &lo->rundown_work); -} - static int loop_clr_fd(struct loop_device *lo) { int err; @@ -1229,7 +1204,6 @@ static int loop_clr_fd(struct loop_device *lo) mutex_unlock(&lo->lo_mutex); __loop_clr_fd(lo); - loop_rundown_completed(lo); return 0; } @@ -1721,46 +1695,120 @@ static int lo_compat_ioctl(struct block_device *bdev, fmode_t mode, } #endif +struct loop_open_task { + struct callback_head cb; + struct loop_device *lo; +}; + +struct loop_release_task { + union { + struct list_head head; + struct callback_head cb; + struct work_struct ws; + }; + struct loop_device *lo; +}; + +static LIST_HEAD(release_task_spool); +static DEFINE_SPINLOCK(release_task_spool_spinlock); + +static void lo_post_open(struct gendisk *disk) +{ + struct loop_device *lo = disk->private_data; + + /* Wait for lo_post_release() to leave lo->lo_mutex section. */ + if (mutex_lock_killable(&lo->lo_mutex) == 0) + mutex_unlock(&lo->lo_mutex); + /* Also wait for __loop_clr_fd() to complete if Lo_rundown was set. */ + wait_event_killable(loop_rundown_wait, data_race(lo->lo_state) != Lo_rundown); + atomic_dec(&lo->async_pending); +} + +static void loop_open_callbackfn(struct callback_head *callback) +{ + struct loop_open_task *lot = + container_of(callback, struct loop_open_task, cb); + struct gendisk *disk = lot->lo->lo_disk; + + lo_post_open(disk); + kfree(lot); +} + static int lo_open(struct block_device *bdev, fmode_t mode) { struct loop_device *lo = bdev->bd_disk->private_data; - int err; + int err = 0; + struct loop_open_task *lot; + struct loop_release_task *lrt = + kmalloc(sizeof(*lrt), GFP_KERNEL | __GFP_NOWARN); + + if (!lrt) + return -ENOMEM; + lot = kmalloc(sizeof(*lot), GFP_KERNEL | __GFP_NOWARN); + if (!lot) { + kfree(lrt); + return -ENOMEM; + } - err = mutex_lock_killable(&lo->lo_mutex); - if (err) - return err; - if (lo->lo_state == Lo_deleting) + spin_lock(&loop_delete_spinlock); + /* lo->lo_state may be changed to any Lo_* but Lo_deleting. */ + if (data_race(lo->lo_state) == Lo_deleting) err = -ENXIO; else atomic_inc(&lo->lo_refcnt); - mutex_unlock(&lo->lo_mutex); - return err; + spin_unlock(&loop_delete_spinlock); + if (err) + return err; + /* Add to spool, for -ENOMEM upon release() cannot be handled. */ + spin_lock(&release_task_spool_spinlock); + list_add(&lrt->head, &release_task_spool); + spin_unlock(&release_task_spool_spinlock); + /* Try to avoid accessing Lo_rundown loop device. */ + if (current->flags & PF_KTHREAD) { + kfree(lot); + return 0; + } + lot->lo = lo; + init_task_work(&lot->cb, loop_open_callbackfn); + if (task_work_add(current, &lot->cb, TWA_RESUME)) + kfree(lot); + /* + * Since the task_work list is LIFO, lo_post_release() scheduled by + * lo_release() can run before lo_post_open() scheduled by lo_open() + * runs when an error occurred and fput() scheduled lo_release() before + * returning to user mode . This means that lo->refcnt may be already 0 + * when lo_post_open() runs. Therefore, use lo->async_pending in order + * to prevent loop_remove() from releasing this loop device. + */ + else + atomic_inc(&lo->async_pending); + return 0; } -static void lo_release(struct gendisk *disk, fmode_t mode) +static void lo_post_release(struct gendisk *disk) { struct loop_device *lo = disk->private_data; mutex_lock(&lo->lo_mutex); - if (atomic_dec_return(&lo->lo_refcnt)) - goto out_unlock; + /* Check whether this loop device can be cleared. */ + if (atomic_dec_return(&lo->lo_refcnt) || lo->lo_state != Lo_bound) + goto out_unlock; + /* + * Clear this loop device since nobody is using. Note that since + * lo_open() increments lo->lo_refcnt without holding lo->lo_mutex, + * I might become no longer the last user, but there is a fact that + * there was no user. + * + * In autoclear mode, destroy WQ and remove configuration. + * Otherwise flush possible ongoing bios in WQ and keep configuration. + */ if (lo->lo_flags & LO_FLAGS_AUTOCLEAR) { - if (lo->lo_state != Lo_bound) - goto out_unlock; lo->lo_state = Lo_rundown; mutex_unlock(&lo->lo_mutex); - /* - * In autoclear mode, stop the loop thread - * and remove configuration after last close. - */ - loop_schedule_rundown(lo); + __loop_clr_fd(lo); return; - } else if (lo->lo_state == Lo_bound) { - /* - * Otherwise keep thread (if running) and config, - * but flush possible ongoing bios in thread. - */ + } else { blk_mq_freeze_queue(lo->lo_queue); blk_mq_unfreeze_queue(lo->lo_queue); } @@ -1769,6 +1817,57 @@ static void lo_release(struct gendisk *disk, fmode_t mode) mutex_unlock(&lo->lo_mutex); } +static void loop_release_workfn(struct work_struct *work) +{ + struct loop_release_task *lrt = + container_of(work, struct loop_release_task, ws); + struct loop_device *lo = lrt->lo; + struct gendisk *disk = lo->lo_disk; + + lo_post_release(disk); + /* Drop references which will be dropped after lo_release(). */ + kobject_put(&disk_to_dev(disk)->kobj); + module_put(disk->fops->owner); + kfree(lrt); + atomic_dec(&lo->async_pending); +} + +static void loop_release_callbackfn(struct callback_head *callback) +{ + struct loop_release_task *lrt = + container_of(callback, struct loop_release_task, cb); + + loop_release_workfn(&lrt->ws); +} + +static void lo_release(struct gendisk *disk, fmode_t mode) +{ + struct loop_device *lo = disk->private_data; + struct loop_release_task *lrt; + + atomic_inc(&lo->async_pending); + /* Fetch from spool. */ + spin_lock(&release_task_spool_spinlock); + lrt = list_first_entry(&release_task_spool, typeof(*lrt), head); + list_del(&lrt->head); + spin_unlock(&release_task_spool_spinlock); + /* Hold references which will be dropped after lo_release(). */ + __module_get(disk->fops->owner); + kobject_get(&disk_to_dev(disk)->kobj); + /* + * Prefer task work so that clear operation completes + * before close() returns to user mode. + */ + lrt->lo = lo; + if (!(current->flags & PF_KTHREAD)) { + init_task_work(&lrt->cb, loop_release_callbackfn); + if (!task_work_add(current, &lrt->cb, TWA_RESUME)) + return; + } + INIT_WORK(&lrt->ws, loop_release_workfn); + queue_work(system_long_wq, &lrt->ws); +} + static const struct block_device_operations lo_fops = { .owner = THIS_MODULE, .open = lo_open, @@ -2030,6 +2129,7 @@ static int loop_add(int i) if (!part_shift) disk->flags |= GENHD_FL_NO_PART; atomic_set(&lo->lo_refcnt, 0); + atomic_set(&lo->async_pending, 0); mutex_init(&lo->lo_mutex); lo->lo_number = i; spin_lock_init(&lo->lo_lock); @@ -2071,6 +2171,8 @@ static int loop_add(int i) static void loop_remove(struct loop_device *lo) { + while (atomic_read(&lo->async_pending)) + schedule_timeout_uninterruptible(1); /* Make this loop device unreachable from pathname. */ del_gendisk(lo->lo_disk); blk_cleanup_disk(lo->lo_disk); @@ -2119,19 +2221,18 @@ static int loop_control_remove(int idx) ret = mutex_lock_killable(&lo->lo_mutex); if (ret) goto mark_visible; - if (lo->lo_state != Lo_unbound || - atomic_read(&lo->lo_refcnt) > 0) { - mutex_unlock(&lo->lo_mutex); + spin_lock(&loop_delete_spinlock); + /* Mark this loop device no longer open()-able if nobody is using. */ + if (lo->lo_state != Lo_unbound || atomic_read(&lo->lo_refcnt) > 0) ret = -EBUSY; - goto mark_visible; - } - /* Mark this loop device no longer open()-able. */ - lo->lo_state = Lo_deleting; + else + lo->lo_state = Lo_deleting; + spin_unlock(&loop_delete_spinlock); mutex_unlock(&lo->lo_mutex); - - loop_remove(lo); - return 0; - + if (!ret) { + loop_remove(lo); + return 0; + } mark_visible: /* Show this loop device again. */ mutex_lock(&loop_ctl_mutex); diff --git a/drivers/block/loop.h b/drivers/block/loop.h index 918a7a2dc025..20fc5eebe455 100644 --- a/drivers/block/loop.h +++ b/drivers/block/loop.h @@ -56,7 +56,7 @@ struct loop_device { struct gendisk *lo_disk; struct mutex lo_mutex; bool idr_visible; - struct work_struct rundown_work; + atomic_t async_pending; }; struct loop_cmd {
I'll try to take a look. I've been busy today coming up with a scheme that avoids taking ->open_mutex outside the open/release path at all, which should also help, but it's not quite ready.
On Thu, Jan 13, 2022 at 11:44:24AM +0100, Jan Kara wrote: > Maybe the most disputable thing in this locking chain seems to be splicing > from sysfs files. That does not seem terribly useful and due to special > locking and behavior of sysfs files it allows for creating interesting lock > dependencies. OTOH maybe there is someone out there who (possibly > inadvertedly through some library) ends up using splice on sysfs files so > chances for userspace breakage, if we disable splice for sysfs, would be > non-negligible. Hum, tough. People were using sendfile on sysfs files, that is why support for this got added back after it was removed for a while as part of the set_fs() removal. The real question for me is why do we need freeing and writer counts on sysfs or any other pure in-memory file system to start with?
On Thu 20-01-22 15:20:14, Christoph Hellwig wrote: > On Thu, Jan 13, 2022 at 11:44:24AM +0100, Jan Kara wrote: > > Maybe the most disputable thing in this locking chain seems to be splicing > > from sysfs files. That does not seem terribly useful and due to special > > locking and behavior of sysfs files it allows for creating interesting lock > > dependencies. OTOH maybe there is someone out there who (possibly > > inadvertedly through some library) ends up using splice on sysfs files so > > chances for userspace breakage, if we disable splice for sysfs, would be > > non-negligible. Hum, tough. > > People were using sendfile on sysfs files, that is why support for this > got added back after it was removed for a while as part of the set_fs() > removal. > > The real question for me is why do we need freeing and writer counts on > sysfs or any other pure in-memory file system to start with? We don't. But freezing of sysfs is not part of the locking chain. Only freezing of the filesystem holding loopback backing file is (let's call that fs F). And splice from sysfs to some file in F is what ties freezing of F with a file lock in sysfs... Honza
diff --git a/drivers/block/loop.c b/drivers/block/loop.c index b1b05c45c07c..8ef6da186c5c 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -89,6 +89,7 @@ static DEFINE_IDR(loop_index_idr); static DEFINE_MUTEX(loop_ctl_mutex); static DEFINE_MUTEX(loop_validate_mutex); +static DECLARE_WAIT_QUEUE_HEAD(loop_rundown_wait); /** * loop_global_lock_killable() - take locks for safe loop_validate_file() test @@ -1172,13 +1173,12 @@ static void loop_rundown_completed(struct loop_device *lo) mutex_lock(&lo->lo_mutex); lo->lo_state = Lo_unbound; mutex_unlock(&lo->lo_mutex); + wake_up_all(&loop_rundown_wait); module_put(THIS_MODULE); } -static void loop_rundown_workfn(struct work_struct *work) +static void loop_rundown_start(struct loop_device *lo) { - struct loop_device *lo = container_of(work, struct loop_device, - rundown_work); struct block_device *bdev = lo->lo_device; struct gendisk *disk = lo->lo_disk; @@ -1188,6 +1188,18 @@ static void loop_rundown_workfn(struct work_struct *work) loop_rundown_completed(lo); } +static void loop_rundown_callbackfn(struct callback_head *callback) +{ + loop_rundown_start(container_of(callback, struct loop_device, + rundown.callback)); +} + +static void loop_rundown_workfn(struct work_struct *work) +{ + loop_rundown_start(container_of(work, struct loop_device, + rundown.work)); +} + static void loop_schedule_rundown(struct loop_device *lo) { struct block_device *bdev = lo->lo_device; @@ -1195,8 +1207,13 @@ static void loop_schedule_rundown(struct loop_device *lo) __module_get(disk->fops->owner); kobject_get(&bdev->bd_device.kobj); - INIT_WORK(&lo->rundown_work, loop_rundown_workfn); - queue_work(system_long_wq, &lo->rundown_work); + if (!(current->flags & PF_KTHREAD)) { + init_task_work(&lo->rundown.callback, loop_rundown_callbackfn); + if (!task_work_add(current, &lo->rundown.callback, TWA_RESUME)) + return; + } + INIT_WORK(&lo->rundown.work, loop_rundown_workfn); + queue_work(system_long_wq, &lo->rundown.work); } static int loop_clr_fd(struct loop_device *lo) @@ -1721,19 +1738,60 @@ static int lo_compat_ioctl(struct block_device *bdev, fmode_t mode, } #endif +struct loop_rundown_waiter { + struct callback_head callback; + struct loop_device *lo; +}; + +static void loop_rundown_waiter_callbackfn(struct callback_head *callback) +{ + struct loop_rundown_waiter *lrw = + container_of(callback, struct loop_rundown_waiter, callback); + + /* + * Locklessly wait for completion of __loop_clr_fd(). + * This should be safe because of the following rules. + * + * (a) From locking dependency perspective, this function is called + * without any locks held. + * (b) From execution ordering perspective, this function is called + * by the moment lo_open() from open() syscall returns to user + * mode. + * (c) From use-after-free protection perspective, this function is + * called before loop_remove() is called, for lo->lo_refcnt taken + * by lo_open() prevents loop_control_remove() and loop_exit(). + */ + wait_event_killable(loop_rundown_wait, data_race(lrw->lo->lo_state) != Lo_rundown); + kfree(lrw); +} + static int lo_open(struct block_device *bdev, fmode_t mode) { struct loop_device *lo = bdev->bd_disk->private_data; + struct loop_rundown_waiter *lrw = + kmalloc(sizeof(*lrw), GFP_KERNEL | __GFP_NOWARN); int err; + if (!lrw) + return -ENOMEM; err = mutex_lock_killable(&lo->lo_mutex); - if (err) + if (err) { + kfree(lrw); return err; + } if (lo->lo_state == Lo_deleting) err = -ENXIO; else atomic_inc(&lo->lo_refcnt); mutex_unlock(&lo->lo_mutex); + if (!err && !(current->flags & PF_KTHREAD)) { + init_task_work(&lrw->callback, loop_rundown_waiter_callbackfn); + lrw->lo = lo; + if (task_work_add(current, &lrw->callback, TWA_RESUME)) + kfree(lrw); + } else { + kfree(lrw); + } return err; } diff --git a/drivers/block/loop.h b/drivers/block/loop.h index 918a7a2dc025..596472f9cde3 100644 --- a/drivers/block/loop.h +++ b/drivers/block/loop.h @@ -56,7 +56,10 @@ struct loop_device { struct gendisk *lo_disk; struct mutex lo_mutex; bool idr_visible; - struct work_struct rundown_work; + union { + struct work_struct work; + struct callback_head callback; + } rundown; }; struct loop_cmd {
Jan Stancek is reporting that commit 322c4293ecc58110 ("loop: make autoclear operation asynchronous") broke LTP tests which run /bin/mount and /bin/umount in close succession like while :; do mount -o loop,ro isofs.iso isofs/; umount isofs/; done . This is because since /usr/lib/systemd/systemd-udevd asynchronously opens the loop device which /bin/mount and /bin/umount are operating, autoclear from lo_release() is likely triggered by systemd-udevd than mount or umount. And unfortunately, /bin/mount fails if read of superblock (for examining filesystem type) returned an error due to the backing file being cleared by __loop_clr_fd(). It turned out that disk->open_mutex was by chance serving as a barrier for serializing "__loop_clr_fd() from lo_release()" and "vfs_read() after lo_open()", and we need to restore this barrier (without reintroducing circular locking dependency). Also, the kernel test robot is reporting that that commit broke xfstest which does umount ext2 on xfs umount xfs sequence. One of approaches for fixing these problems is to revert that commit and instead remove destroy_workqueue() from __loop_clr_fd(), for it turned out that we did not need to call flush_workqueue() from __loop_clr_fd() since blk_mq_freeze_queue() blocks until all pending "struct work_struct" are processed by loop_process_work(). But we are not sure whether it is safe to wait blk_mq_freeze_queue() etc. with disk->open_mutex held; it could be simply because dependency is not clear enough for fuzzers to cover and lockdep to detect. Therefore, before taking revert approach, let's try if we can use task work approach which is called without locks held while the caller can wait for completion of that task work before returning to user mode. This patch tries to make lo_open()/lo_release() to locklessly wait for __loop_clr_fd() by inserting a task work into lo_open()/lo_release() if possible. Reported-by: kernel test robot <oliver.sang@intel.com> Reported-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> --- Changes in v2: Need to also wait on lo_open(), per Jan's testcase. drivers/block/loop.c | 70 ++++++++++++++++++++++++++++++++++++++++---- drivers/block/loop.h | 5 +++- 2 files changed, 68 insertions(+), 7 deletions(-)