diff mbox series

[07/13] block: Do not remove zone write plugs still in use

Message ID 20240430125131.668482-8-dlemoal@kernel.org (mailing list archive)
State Superseded, archived
Delegated to: Mike Snitzer
Headers show
Series Zone write plugging fixes and cleanup | expand

Commit Message

Damien Le Moal April 30, 2024, 12:51 p.m. UTC
Large write BIOs that span a zone boundary are split in
blk_mq_submit_bio() before being passed to blk_zone_plug_bio() for zone
write plugging. Such split BIO will be chained with one fragment
targeting one zone and the remainder of the BIO tergetting the next
zone. The two BIOs can be executed in parallel, without a predetermine
order relative to eachother and their completion may be reversed: the
remainder first completing and the first fragment then completing. In
such case, bio_endio() will not immediately execute
blk_zone_write_plug_bio_endio() for the parent BIO (the remainder of the
split BIO) as the BIOs are chained. blk_zone_write_plug_bio_endio() for
the parent BIO will be executed only once the first fragment completes.

In the case of a device with small zones and very large BIOs, uch
completion pattern can lead to disk_should_remove_zone_wplug() to return
true for the zone of the parent BIO when the parent BIO request
completes and blk_zone_write_plug_complete_request() is executed. This
triggers the removal of the zone write plug from the hash table using
disk_remove_zone_wplug(). With the zone write plug of the parent BIO
missing, the call to disk_get_zone_wplug() in
blk_zone_write_plug_bio_endio() returns NULL and triggers a warning.

This patterns can be recreated fairly easily using a scsi_debug device
with small zone and btrfs. E.g.

modprobe scsi_debug delay=0 dev_size_mb=1024 sector_size=4096 \
	zbc=host-managed zone_cap_mb=3 zone_nr_conv=0 zone_size_mb=4
mkfs.btrfs -f -O zoned /dev/sda
mount -t btrfs /dev/sda /mnt
fio --name=wrtest --rw=randwrite --direct=1 --ioengine=libaio \
	--bs=4k --iodepth=16 --size=1M --directory=/mnt --time_based \
	--runtime=10
umount /dev/sda

Will result in the warning:

[   29.035538] WARNING: CPU: 3 PID: 37 at block/blk-zoned.c:1207 blk_zone_write_plug_bio_endio+0xee/0x1e0
...
[   29.058682] Call Trace:
[   29.059095]  <TASK>
[   29.059473]  ? __warn+0x80/0x120
[   29.059983]  ? blk_zone_write_plug_bio_endio+0xee/0x1e0
[   29.060728]  ? report_bug+0x160/0x190
[   29.061283]  ? handle_bug+0x36/0x70
[   29.061830]  ? exc_invalid_op+0x17/0x60
[   29.062399]  ? asm_exc_invalid_op+0x1a/0x20
[   29.063025]  ? blk_zone_write_plug_bio_endio+0xee/0x1e0
[   29.063760]  bio_endio+0xb7/0x150
[   29.064280]  btrfs_clone_write_end_io+0x2b/0x60 [btrfs]
[   29.065049]  blk_update_request+0x17c/0x500
[   29.065666]  scsi_end_request+0x27/0x1a0 [scsi_mod]
[   29.066356]  scsi_io_completion+0x5b/0x690 [scsi_mod]
[   29.067077]  blk_complete_reqs+0x3a/0x50
[   29.067692]  __do_softirq+0xcf/0x2b3
[   29.068248]  ? sort_range+0x20/0x20
[   29.068791]  run_ksoftirqd+0x1c/0x30
[   29.069339]  smpboot_thread_fn+0xcc/0x1b0
[   29.069936]  kthread+0xcf/0x100
[   29.070438]  ? kthread_complete_and_exit+0x20/0x20
[   29.071314]  ret_from_fork+0x31/0x50
[   29.071873]  ? kthread_complete_and_exit+0x20/0x20
[   29.072563]  ret_from_fork_asm+0x11/0x20
[   29.073146]  </TASK>

either when fio executes or when unmount is executed.

Fix this by modifying disk_should_remove_zone_wplug() to check that the
reference count to a zone write plug is not larger than 2, that is, that
the only references left on the zone are the caller held reference
(blk_zone_write_plug_complete_request()) and the initial extra reference
for the zone write plug taken when it was initialized (and that is
dropped when the zone write plug is removed from the hash table).

To be consistent with this change, make sure to drop the request or BIO
held reference to the zone write plug before calling
disk_zone_wplug_unplug_bio(). All references are also dropped using
disk_put_zone_wplug() instead of atomic_dec() to ensure that the zone
write plug is freed if it needs to be.

Comments are also improved to clarify zone write plugs reference
handling.

Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 block/blk-zoned.c | 24 +++++++++++++++---------
 1 file changed, 15 insertions(+), 9 deletions(-)

Comments

Christoph Hellwig April 30, 2024, 3:34 p.m. UTC | #1
On Tue, Apr 30, 2024 at 09:51:25PM +0900, Damien Le Moal wrote:
> Large write BIOs that span a zone boundary are split in
> blk_mq_submit_bio() before being passed to blk_zone_plug_bio() for zone
> write plugging. Such split BIO will be chained with one fragment
> targeting one zone and the remainder of the BIO tergetting the next

s/tergetting/targetting/

> Fix this by modifying disk_should_remove_zone_wplug() to check that the
> reference count to a zone write plug is not larger than 2, that is, that
> the only references left on the zone are the caller held reference
> (blk_zone_write_plug_complete_request()) and the initial extra reference
> for the zone write plug taken when it was initialized (and that is
> dropped when the zone write plug is removed from the hash table).

How is this atomic_read() based check not racy?
Damien Le Moal April 30, 2024, 11:06 p.m. UTC | #2
On 5/1/24 00:34, Christoph Hellwig wrote:
>> Fix this by modifying disk_should_remove_zone_wplug() to check that the
>> reference count to a zone write plug is not larger than 2, that is, that
>> the only references left on the zone are the caller held reference
>> (blk_zone_write_plug_complete_request()) and the initial extra reference
>> for the zone write plug taken when it was initialized (and that is
>> dropped when the zone write plug is removed from the hash table).
> 
> How is this atomic_read() based check not racy?

Because of how references work:
1) A valid and unused zone write plug has a ref count of 1
2) A function using a write plug always has a reference on it, so if the plug is
valid and unused, the ref count is always 2
3) Any plugged BIO and in-flight BIOs and requests hold a reference on the plug.
So if the plug is used for BIOs, the reference count is always at least 2, and
when a function is using the plug the refcount is always at least 3

Based on this, all callers of disk_should_remove_zone_wplug() will always see a
refcount of 2 if the plug is unused, or more than 2 if the plug is being used to
handle BIOs. Most of the time, checking for the BUSY (PLUGGED) flag catches the
later case. But as explained in the commit message, chained BIOs due to splits
can lead to bio_endio() execution order to change and to calls to
blk_zone_write_plug_bio_endio() to be done after
blk_zone_write_plug_finish_request() calls disk_zone_wplug_unplug_bio().
Checking that the plug refcount is not more than 2 tells us reliably that BIOs
are still holding references on the plug and that the plug should not be removed
until all BIOs completions are handled.

Does this answer your question ?
Damien Le Moal April 30, 2024, 11:55 p.m. UTC | #3
On 5/1/24 08:06, Damien Le Moal wrote:
> On 5/1/24 00:34, Christoph Hellwig wrote:
>>> Fix this by modifying disk_should_remove_zone_wplug() to check that the
>>> reference count to a zone write plug is not larger than 2, that is, that
>>> the only references left on the zone are the caller held reference
>>> (blk_zone_write_plug_complete_request()) and the initial extra reference
>>> for the zone write plug taken when it was initialized (and that is
>>> dropped when the zone write plug is removed from the hash table).
>>
>> How is this atomic_read() based check not racy?
> 
> Because of how references work:
> 1) A valid and unused zone write plug has a ref count of 1
> 2) A function using a write plug always has a reference on it, so if the plug is
> valid and unused, the ref count is always 2
> 3) Any plugged BIO and in-flight BIOs and requests hold a reference on the plug.
> So if the plug is used for BIOs, the reference count is always at least 2, and
> when a function is using the plug the refcount is always at least 3
> 
> Based on this, all callers of disk_should_remove_zone_wplug() will always see a
> refcount of 2 if the plug is unused, or more than 2 if the plug is being used to
> handle BIOs. Most of the time, checking for the BUSY (PLUGGED) flag catches the
> later case. But as explained in the commit message, chained BIOs due to splits
> can lead to bio_endio() execution order to change and to calls to
> blk_zone_write_plug_bio_endio() to be done after
> blk_zone_write_plug_finish_request() calls disk_zone_wplug_unplug_bio().
> Checking that the plug refcount is not more than 2 tells us reliably that BIOs
> are still holding references on the plug and that the plug should not be removed
> until all BIOs completions are handled.
> 
> Does this answer your question ?

I modified the patch to add a comment in disk_should_remove_zone_wplug()
explaining the above.
diff mbox series

Patch

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 82e540dad900..5792e3b160c9 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -520,8 +520,9 @@  static inline void disk_put_zone_wplug(struct blk_zone_wplug *zwplug)
 static inline bool disk_should_remove_zone_wplug(struct gendisk *disk,
 						 struct blk_zone_wplug *zwplug)
 {
-	/* If the zone is still busy, the plug cannot be removed. */
-	if (zwplug->flags & BLK_ZONE_WPLUG_BUSY)
+	/* If the zone write plug is still busy, it cannot be removed. */
+	if ((zwplug->flags & BLK_ZONE_WPLUG_BUSY) ||
+	    atomic_read(&zwplug->ref) > 2)
 		return false;
 
 	/* We can remove zone write plugs for zones that are empty or full. */
@@ -891,8 +892,9 @@  void blk_zone_write_plug_attempt_merge(struct request *req)
 	struct bio *bio;
 
 	/*
-	 * Completion of this request needs to be handled with
-	 * blk_zone_write_plug_complete_request().
+	 * Indicate that completion of this request needs to be handled with
+	 * blk_zone_write_plug_complete_request(), which will drop the reference
+	 * on the zone write plug we took above on entry to this function.
 	 */
 	req->rq_flags |= RQF_ZONE_WRITE_PLUGGING;
 
@@ -1221,6 +1223,9 @@  void blk_zone_write_plug_bio_endio(struct bio *bio)
 		spin_unlock_irqrestore(&zwplug->lock, flags);
 	}
 
+	/* Drop the reference we took when the BIO was issued. */
+	disk_put_zone_wplug(zwplug);
+
 	/*
 	 * For BIO-based devices, blk_zone_write_plug_complete_request()
 	 * is not called. So we need to schedule execution of the next
@@ -1229,8 +1234,7 @@  void blk_zone_write_plug_bio_endio(struct bio *bio)
 	if (bio->bi_bdev->bd_has_submit_bio)
 		disk_zone_wplug_unplug_bio(disk, zwplug);
 
-	/* Drop the reference we took when the BIO was issued. */
-	atomic_dec(&zwplug->ref);
+	/* Drop the reference we took when entering this function. */
 	disk_put_zone_wplug(zwplug);
 }
 
@@ -1244,13 +1248,15 @@  void blk_zone_write_plug_complete_request(struct request *req)
 
 	req->rq_flags &= ~RQF_ZONE_WRITE_PLUGGING;
 
-	disk_zone_wplug_unplug_bio(disk, zwplug);
-
 	/*
 	 * Drop the reference we took when the request was initialized in
 	 * blk_zone_write_plug_attempt_merge().
 	 */
-	atomic_dec(&zwplug->ref);
+	disk_put_zone_wplug(zwplug);
+
+	disk_zone_wplug_unplug_bio(disk, zwplug);
+
+	/* Drop the reference we took when entering this function. */
 	disk_put_zone_wplug(zwplug);
 }