[v3,1/5] task_work: export task_work_add()

Message ID	20220121114006.3633-1-penguin-kernel@I-love.SAKURA.ne.jp (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@kernel.org> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> To: Jens Axboe <axboe@kernel.dk>, Christoph Hellwig <hch@lst.de>, Jan Kara <jack@suse.cz> Cc: linux-block@vger.kernel.org, Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Subject: [PATCH v3 1/5] task_work: export task_work_add() Date: Fri, 21 Jan 2022 20:40:02 +0900 Message-Id: <20220121114006.3633-1-penguin-kernel@I-love.SAKURA.ne.jp> Precedence: bulk
Series	[v3,1/5] task_work: export task_work_add() \| expand [v3,1/5] task_work: export task_work_add() [v3,2/5] loop: revert "make autoclear operation asynchronous" [v3,3/5] loop: don't hold lo->lo_mutex from lo_open() [v3,4/5] loop: don't hold lo->lo_mutex from lo_release() [v3,5/5] loop: add workaround for racy loop device reuse logic in /bin/mount

Tetsuo Handa Jan. 21, 2022, 11:40 a.m. UTC

Commit 322c4293ecc58110 ("loop: make autoclear operation asynchronous")
silenced a circular locking dependency warning by moving autoclear
operation to WQ context.

Then, it was reported that WQ context is too late to run autoclear
operation; some userspace programs (e.g. xfstest) assume that the autoclear
operation already completed by the moment close() returns to user mode
so that they can immediately call umount() of a partition containing a
backing file which the autoclear operation should have closed.

Then, Jan Kara found that fundamental problem is that waiting for I/O
completion (from blk_mq_freeze_queue() or flush_workqueue()) with
disk->open_mutex held has possibility of deadlock.

Then, I found that since disk->open_mutex => lo->lo_mutex dependency is
recorded by lo_open() and lo_release(), and blk_mq_freeze_queue() by e.g.
loop_set_status() waits for I/O completion with lo->lo_mutex held, from
locking dependency chain perspective we need to avoid holding lo->lo_mutex
 from lo_open() and lo_release(). And we can avoid holding lo->lo_mutex
 from lo_open(), for we can instead use a spinlock dedicated for
Lo_deleting check.

But we cannot avoid holding lo->lo_mutex from lo_release(), for WQ context
was too late to run autoclear operation. We need to make whole lo_release()
operation start without disk->open_mutex and complete before returning to
user mode. One of approaches that can meet such requirement is to use the
task_work context. Thus, export task_work_add() for the loop driver.

Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 kernel/task_work.c | 1 +
 1 file changed, 1 insertion(+)

Christoph Hellwig Jan. 25, 2022, 3:47 p.m. UTC | #1

I'm sometimes a little slow, but I still fail to understand why we need
all this.  Your cut down patch that moves the destroy_workqueue call
and the work_struct fixed all the know lockdep issues, right?

And the only other problem we're thinking off is that blk_mq_freeze_queue
could have the same effect, except that lockdep doesn't track it and
we've not seen it in the wild.

As far as I can tell we do not need the freeze at all for given that
by the time release is called I/O is quiesced.  So with a little prep
work we should not need any of that.  See below (this is actually
three patches, with the last one being a very slight rebase of your
patch):

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 01cbbfc4e9e24..91b56761392e2 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1006,10 +1006,10 @@ static int loop_configure(struct loop_device *lo, fmode_t mode,
 	    !file->f_op->write_iter)
 		lo->lo_flags |= LO_FLAGS_READ_ONLY;
 
-	lo->workqueue = alloc_workqueue("loop%d",
-					WQ_UNBOUND | WQ_FREEZABLE,
-					0,
-					lo->lo_number);
+	if (!lo->workqueue)
+		lo->workqueue = alloc_workqueue("loop%d",
+						WQ_UNBOUND | WQ_FREEZABLE,
+						0, lo->lo_number);
 	if (!lo->workqueue) {
 		error = -ENOMEM;
 		goto out_unlock;
@@ -1082,7 +1082,7 @@ static int loop_configure(struct loop_device *lo, fmode_t mode,
 	return error;
 }
 
-static void __loop_clr_fd(struct loop_device *lo)
+static void __loop_clr_fd(struct loop_device *lo, bool release)
 {
 	struct file *filp;
 	gfp_t gfp = lo->old_gfp_mask;
@@ -1112,10 +1112,14 @@ static void __loop_clr_fd(struct loop_device *lo)
 	if (test_bit(QUEUE_FLAG_WC, &lo->lo_queue->queue_flags))
 		blk_queue_write_cache(lo->lo_queue, false, false);
 
-	/* freeze request queue during the transition */
-	blk_mq_freeze_queue(lo->lo_queue);
+	/*
+	 * Freeze the request queue when unbinding on a live file descriptor and
+	 * thus an open device.  When called from ->release we are guaranteed
+	 * that there is no I/O in progress already.
+	 */
+	if (!release)
+		blk_mq_freeze_queue(lo->lo_queue);
 
-	destroy_workqueue(lo->workqueue);
 	spin_lock_irq(&lo->lo_work_lock);
 	list_for_each_entry_safe(worker, pos, &lo->idle_worker_list,
 				idle_list) {
@@ -1144,16 +1148,19 @@ static void __loop_clr_fd(struct loop_device *lo)
 	/* let user-space know about this change */
 	kobject_uevent(&disk_to_dev(lo->lo_disk)->kobj, KOBJ_CHANGE);
 	mapping_set_gfp_mask(filp->f_mapping, gfp);
-	blk_mq_unfreeze_queue(lo->lo_queue);
+	if (!release)
+		blk_mq_unfreeze_queue(lo->lo_queue);
 
 	disk_force_media_change(lo->lo_disk, DISK_EVENT_MEDIA_CHANGE);
 
 	if (lo->lo_flags & LO_FLAGS_PARTSCAN) {
 		int err;
 
-		mutex_lock(&lo->lo_disk->open_mutex);
+		if (!release)
+			mutex_lock(&lo->lo_disk->open_mutex);
 		err = bdev_disk_changed(lo->lo_disk, false);
-		mutex_unlock(&lo->lo_disk->open_mutex);
+		if (!release)
+			mutex_unlock(&lo->lo_disk->open_mutex);
 		if (err)
 			pr_warn("%s: partition scan of loop%d failed (rc=%d)\n",
 				__func__, lo->lo_number, err);
@@ -1164,39 +1171,17 @@ static void __loop_clr_fd(struct loop_device *lo)
 	if (!part_shift)
 		lo->lo_disk->flags |= GENHD_FL_NO_PART;
 
-	fput(filp);
-}
-
-static void loop_rundown_completed(struct loop_device *lo)
-{
 	mutex_lock(&lo->lo_mutex);
 	lo->lo_state = Lo_unbound;
 	mutex_unlock(&lo->lo_mutex);
 	module_put(THIS_MODULE);
-}
-
-static void loop_rundown_workfn(struct work_struct *work)
-{
-	struct loop_device *lo = container_of(work, struct loop_device,
-					      rundown_work);
-	struct block_device *bdev = lo->lo_device;
-	struct gendisk *disk = lo->lo_disk;
-
-	__loop_clr_fd(lo);
-	kobject_put(&bdev->bd_device.kobj);
-	module_put(disk->fops->owner);
-	loop_rundown_completed(lo);
-}
 
-static void loop_schedule_rundown(struct loop_device *lo)
-{
-	struct block_device *bdev = lo->lo_device;
-	struct gendisk *disk = lo->lo_disk;
-
-	__module_get(disk->fops->owner);
-	kobject_get(&bdev->bd_device.kobj);
-	INIT_WORK(&lo->rundown_work, loop_rundown_workfn);
-	queue_work(system_long_wq, &lo->rundown_work);
+	/*
+	 * Need not hold lo_mutex to fput backing file. Calling fput holding
+	 * lo_mutex triggers a circular lock dependency possibility warning as
+	 * fput can take open_mutex which is usually taken before lo_mutex.
+	 */
+	fput(filp);
 }
 
 static int loop_clr_fd(struct loop_device *lo)
@@ -1228,8 +1213,7 @@ static int loop_clr_fd(struct loop_device *lo)
 	lo->lo_state = Lo_rundown;
 	mutex_unlock(&lo->lo_mutex);
 
-	__loop_clr_fd(lo);
-	loop_rundown_completed(lo);
+	__loop_clr_fd(lo, false);
 	return 0;
 }
 
@@ -1754,15 +1738,8 @@ static void lo_release(struct gendisk *disk, fmode_t mode)
 		 * In autoclear mode, stop the loop thread
 		 * and remove configuration after last close.
 		 */
-		loop_schedule_rundown(lo);
+		__loop_clr_fd(lo, true);
 		return;
-	} else if (lo->lo_state == Lo_bound) {
-		/*
-		 * Otherwise keep thread (if running) and config,
-		 * but flush possible ongoing bios in thread.
-		 */
-		blk_mq_freeze_queue(lo->lo_queue);
-		blk_mq_unfreeze_queue(lo->lo_queue);
 	}
 
 out_unlock:
@@ -2080,6 +2057,8 @@ static void loop_remove(struct loop_device *lo)
 	mutex_unlock(&loop_ctl_mutex);
 	/* There is no route which can find this loop device. */
 	mutex_destroy(&lo->lo_mutex);
+	if (lo->workqueue)
+		destroy_workqueue(lo->workqueue);
 	kfree(lo);
 }
 
diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index 918a7a2dc0259..082d4b6bfc6a6 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -56,7 +56,6 @@ struct loop_device {
 	struct gendisk		*lo_disk;
 	struct mutex		lo_mutex;
 	bool			idr_visible;
-	struct work_struct      rundown_work;
 };
 
 struct loop_cmd {

Darrick J. Wong Jan. 25, 2022, 9:37 p.m. UTC | #2

On Fri, Jan 21, 2022 at 08:40:02PM +0900, Tetsuo Handa wrote:
> Commit 322c4293ecc58110 ("loop: make autoclear operation asynchronous")
> silenced a circular locking dependency warning by moving autoclear
> operation to WQ context.
> 
> Then, it was reported that WQ context is too late to run autoclear
> operation; some userspace programs (e.g. xfstest) assume that the autoclear
> operation already completed by the moment close() returns to user mode
> so that they can immediately call umount() of a partition containing a
> backing file which the autoclear operation should have closed.
> 
> Then, Jan Kara found that fundamental problem is that waiting for I/O
> completion (from blk_mq_freeze_queue() or flush_workqueue()) with
> disk->open_mutex held has possibility of deadlock.
> 
> Then, I found that since disk->open_mutex => lo->lo_mutex dependency is
> recorded by lo_open() and lo_release(), and blk_mq_freeze_queue() by e.g.
> loop_set_status() waits for I/O completion with lo->lo_mutex held, from
> locking dependency chain perspective we need to avoid holding lo->lo_mutex
>  from lo_open() and lo_release(). And we can avoid holding lo->lo_mutex
>  from lo_open(), for we can instead use a spinlock dedicated for
> Lo_deleting check.
> 
> But we cannot avoid holding lo->lo_mutex from lo_release(), for WQ context
> was too late to run autoclear operation. We need to make whole lo_release()
> operation start without disk->open_mutex and complete before returning to
> user mode. One of approaches that can meet such requirement is to use the
> task_work context. Thus, export task_work_add() for the loop driver.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

5.17-rc1 came out and I saw regressions across the board when xfs/049
and xfs/073 ran.

xfs/049 formats XFS on a block device, mounts it, creates a sparse file
inside the XFS fs, formats the sparse file, mounts that (via automatic
loop device), does some work, and then unmounts both the sparse file
filesystem and the outer XFS filesystem.

The first unmount no longer waited for the loop device to release
asynchronously so the unmount of the outer fs fails because it's still
in use.

So this series fixes xfs/049, but xfs/073 is still broken.  xfs/073
creates a sparse file containing an XFS filesystem and then does this in
rapid succession:

mount -o loop <mount options that guarantee mount failure>
mount -o loop <mount options that should work>

Whereas with 5.16 this worked fine,

 [U] try mount 1
 loop0: detected capacity change from 0 to 83968
 XFS (loop0): Filesystem has duplicate UUID 924e8033-a130-4f9c-a11f-52f892c268e9 - can't mount
 [U] try mount 2
 loop0: detected capacity change from 0 to 83968
 XFS (loop0): Mounting V5 Filesystem
 XFS (loop0): resetting quota flags
 XFS (loop0): Ending clean mount

in 5.17-rc1 it fails like this:

 [U] try mount 1
 loop0: detected capacity change from 0 to 83968
 XFS (loop0): Filesystem has duplicate UUID 0b0afdac-5c9c-4d94-9b8d-fe85a2eb1143 - can't mount
 [U] try mount 2
 I/O error, dev loop0, sector 0 op 0x0:(READ) flags 0x1000 phys_seg 1 prio class 0
 XFS (loop0): SB validate failed with error -5.
 [U] fail mount 2

I guess this means that mount can grab the loop device after the backing
file has been released but before a new one has been plumbed in?  Or,
seeing the lack of "detected capacity change" for mount 2, maybe the
backing file never gets added?

--D

> ---
>  kernel/task_work.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/kernel/task_work.c b/kernel/task_work.c
> index 1698fbe6f0e1..2a1644189182 100644
> --- a/kernel/task_work.c
> +++ b/kernel/task_work.c
> @@ -60,6 +60,7 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
>  
>  	return 0;
>  }
> +EXPORT_SYMBOL_GPL(task_work_add);
>  
>  /**
>   * task_work_cancel_match - cancel a pending work added by task_work_add()
> -- 
> 2.32.0
>

Tetsuo Handa Jan. 25, 2022, 11:47 p.m. UTC | #3

On 2022/01/26 0:47, Christoph Hellwig wrote:
> I'm sometimes a little slow, but I still fail to understand why we need
> all this.  Your cut down patch that moves the destroy_workqueue call
> and the work_struct fixed all the know lockdep issues, right?

Right. Moving destroy_workqueue() (and blk_mq_freeze_queue() together)
in __loop_clr_fd() to WQ context fixed a known lockdep issue.

> 
> And the only other problem we're thinking off is that blk_mq_freeze_queue
> could have the same effect,

Right. blk_mq_freeze_queue() is still called with disk->open_mutex held, for
there is

	} else if (lo->lo_state == Lo_bound) {
		/*
		 * Otherwise keep thread (if running) and config,
		 * but flush possible ongoing bios in thread.
		 */
		blk_mq_freeze_queue(lo->lo_queue);
		blk_mq_unfreeze_queue(lo->lo_queue);
	}

path in lo_release().

>                             except that lockdep doesn't track it and
> we've not seen it in the wild.

It is difficult to test. Fuzzers cannot test fsfreeze paths, for failing to
issue an unfreeze request leads to unresponding virtual machines.

> 
> As far as I can tell we do not need the freeze at all for given that
> by the time release is called I/O is quiesced.

Why? lo_release() is called when close() is called. But (periodically-scheduled
or triggered-on-demand) writeback of previously executed buffered write() calls
can start while lo_release() or __loop_clr_fd() is running. Then, why not to
wait for I/O requests to complete? Isn't that the reason of

	} else if (lo->lo_state == Lo_bound) {
		/*
		 * Otherwise keep thread (if running) and config,
		 * but flush possible ongoing bios in thread.
		 */
		blk_mq_freeze_queue(lo->lo_queue);
		blk_mq_unfreeze_queue(lo->lo_queue);
	}

path in lo_release() being there?

Christoph Hellwig Jan. 26, 2022, 5:21 a.m. UTC | #4

On Wed, Jan 26, 2022 at 08:47:17AM +0900, Tetsuo Handa wrote:
> > As far as I can tell we do not need the freeze at all for given that
> > by the time release is called I/O is quiesced.
> 
> Why? lo_release() is called when close() is called. But (periodically-scheduled
> or triggered-on-demand) writeback of previously executed buffered write() calls
> can start while lo_release() or __loop_clr_fd() is running. Then, why not to
> wait for I/O requests to complete?

Let's refine my wording, the above should be "... by the time the final
lo_release is called".  blkdev_put_whole ensures all writeback has finished
and all buffers are gone by writing all data back and waiting for it and then
truncating the pages from blkdev_flush_mapping.

> Isn't that the reason of
> 
> 	} else if (lo->lo_state == Lo_bound) {
> 		/*
> 		 * Otherwise keep thread (if running) and config,
> 		 * but flush possible ongoing bios in thread.
> 		 */
> 		blk_mq_freeze_queue(lo->lo_queue);
> 		blk_mq_unfreeze_queue(lo->lo_queue);
> 	}
> 
> path in lo_release() being there?

This looks completely spurious to me.  Adding Ming who added it.

Ming Lei Jan. 26, 2022, 7:18 a.m. UTC | #5

On Wed, Jan 26, 2022 at 06:21:59AM +0100, Christoph Hellwig wrote:
> On Wed, Jan 26, 2022 at 08:47:17AM +0900, Tetsuo Handa wrote:
> > > As far as I can tell we do not need the freeze at all for given that
> > > by the time release is called I/O is quiesced.
> > 
> > Why? lo_release() is called when close() is called. But (periodically-scheduled
> > or triggered-on-demand) writeback of previously executed buffered write() calls
> > can start while lo_release() or __loop_clr_fd() is running. Then, why not to
> > wait for I/O requests to complete?
> 
> Let's refine my wording, the above should be "... by the time the final
> lo_release is called".  blkdev_put_whole ensures all writeback has finished
> and all buffers are gone by writing all data back and waiting for it and then
> truncating the pages from blkdev_flush_mapping.
> 
> > Isn't that the reason of
> > 
> > 	} else if (lo->lo_state == Lo_bound) {
> > 		/*
> > 		 * Otherwise keep thread (if running) and config,
> > 		 * but flush possible ongoing bios in thread.
> > 		 */
> > 		blk_mq_freeze_queue(lo->lo_queue);
> > 		blk_mq_unfreeze_queue(lo->lo_queue);
> > 	}
> > 
> > path in lo_release() being there?
> 
> This looks completely spurious to me.  Adding Ming who added it.

It was added when converting to blk-mq.

I remember it was to replace original loop_flush() which uses
wait_for_completion() for draining all inflight bios, but seems
the flush isn't needed in lo_release().


Thanks,
Ming

Tetsuo Handa Jan. 26, 2022, 10:27 a.m. UTC | #6

On 2022/01/26 16:18, Ming Lei wrote:
>>> Isn't that the reason of
>>>
>>> 	} else if (lo->lo_state == Lo_bound) {
>>> 		/*
>>> 		 * Otherwise keep thread (if running) and config,
>>> 		 * but flush possible ongoing bios in thread.
>>> 		 */
>>> 		blk_mq_freeze_queue(lo->lo_queue);
>>> 		blk_mq_unfreeze_queue(lo->lo_queue);
>>> 	}
>>>
>>> path in lo_release() being there?
>>
>> This looks completely spurious to me.  Adding Ming who added it.
> 
> It was added when converting to blk-mq.

Well, commit b5dd2f6047ca1080 ("block: loop: improve performance via blk-mq") in v4.0-rc1.
That's where "thread" in the block comment above became outdated, and a global WQ_MEM_RECLAIM
workqueue was used. We need to update both.

> 
> I remember it was to replace original loop_flush() which uses
> wait_for_completion() for draining all inflight bios, but seems
> the flush isn't needed in lo_release().

Even if we can remove blk_mq_freeze_queue()/blk_mq_unfreeze_queue() from lo_release(), we
cannot remove blk_mq_freeze_queue()/blk_mq_unfreeze_queue() from e.g. loop_set_status(), right?

Then, lo_release() which is called with disk->open_mutex held can be still blocked at
mutex_lock(&lo->lo_mutex) waiting for e.g. loop_set_status() to call mutex_unlock(&lo->lo_mutex).
That is, lo_open() from e.g. /sys/power/resume can still wait for I/O completion with disk->open_mutex held.

How to kill this dependency? Don't we need to make sure that lo_open()/lo_release() are not
blocked on lo->lo_mutex (like [PATCH v3 3/5] and [PATCH v3 4/5] does) ?

Jan Kara Jan. 26, 2022, 1:11 p.m. UTC | #7

On Wed 26-01-22 19:27:17, Tetsuo Handa wrote:
> On 2022/01/26 16:18, Ming Lei wrote:
> > I remember it was to replace original loop_flush() which uses
> > wait_for_completion() for draining all inflight bios, but seems
> > the flush isn't needed in lo_release().
> 
> Even if we can remove blk_mq_freeze_queue()/blk_mq_unfreeze_queue() from
> lo_release(), we cannot remove
> blk_mq_freeze_queue()/blk_mq_unfreeze_queue() from e.g.
> loop_set_status(), right?

Correct AFAICT.

> Then, lo_release() which is called with disk->open_mutex held can be
> still blocked at mutex_lock(&lo->lo_mutex) waiting for e.g.
> loop_set_status() to call mutex_unlock(&lo->lo_mutex).  That is,
> lo_open() from e.g. /sys/power/resume can still wait for I/O completion
> with disk->open_mutex held.

I don't think this is a real problem. If someone is calling
loop_set_status() it means the loop device is open and thus lo_release()
cannot be running in parallel, can it?

								Honza

Tetsuo Handa Jan. 26, 2022, 1:35 p.m. UTC | #8

On 2022/01/26 22:11, Jan Kara wrote:
> On Wed 26-01-22 19:27:17, Tetsuo Handa wrote:
>> Even if we can remove blk_mq_freeze_queue()/blk_mq_unfreeze_queue() from
>> lo_release(), we cannot remove
>> blk_mq_freeze_queue()/blk_mq_unfreeze_queue() from e.g.
>> loop_set_status(), right?
> 
> Correct AFAICT.

OK.

> 
>> Then, lo_release() which is called with disk->open_mutex held can be
>> still blocked at mutex_lock(&lo->lo_mutex) waiting for e.g.
>> loop_set_status() to call mutex_unlock(&lo->lo_mutex).  That is,
>> lo_open() from e.g. /sys/power/resume can still wait for I/O completion
>> with disk->open_mutex held.
> 
> I don't think this is a real problem. If someone is calling
> loop_set_status() it means the loop device is open and thus lo_release()
> cannot be running in parallel, can it?

lo_release() is called when a file descriptor is close()d.
That is, loop_set_status() and lo_release() can run in parallel, can't it?

  Process-A                               Process-B

  int fd1 = open("/dev/loop0", O_RDWR);   int fd2 = open("/dev/loop0", O_RDWR);
  ioctl(fd1, LOOP_SET_STATUS64, ...);     close(fd2);

If lo_release() (which is called with disk->open_mutex held) waits for ioctl()
(which waits for I/O completion with lo->lo_mutex held), there is
disk->open_mutex => lo->lo_mutex => I/O completion dependency.

And the possibility of deadlock problem (when mixed with sysfs and fsfreeze)
happens at lo_open(). If lo_open() (which is called with disk->open_mutex held)
waits for ioctl() (which waits for I/O completion with lo->lo_mutex held), there
as well is disk->open_mutex => lo->lo_mutex => I/O completion dependency.

[v3,1/5] task_work: export task_work_add()

Commit Message

Comments

Patch