mbox series

[v17,00/14] Improve write performance for zoned UFS devices

Message ID 20250115224649.3973718-1-bvanassche@acm.org (mailing list archive)
Headers show
Series Improve write performance for zoned UFS devices | expand

Message

Bart Van Assche Jan. 15, 2025, 10:46 p.m. UTC
Hi Damien and Christoph,

This patch series improves small write IOPS by a factor of four (+300%) for
zoned UFS devices on my test setup. Please help with reviewing this patch
series.

Thanks,

Bart.

Changes compared to v16:
 - Rebased the entire patch series on top of Jens' for-next branch. Compared
   to when v16 of this series was posted, the BLK_ZONE_WPLUG_NEED_WP_UPDATE
   flag has been introduced and support for REQ_NOWAIT has been fixed.
 - The behavior for SMR disks is preserved: if .driver_preserves_write_order
   has not been set, BLK_ZONE_WPLUG_NEED_WP_UPDATE is still set if a write
   error has been encountered. If .driver_preserves_write_order has not been
   set, the write pointer is restored and the failed zoned writes are retried.
 - The superfluous "disk->zone_wplugs_hash_bits != 0" tests have been removed.

Changes compared to v15:
 - Reworked this patch series on top of the zone write plugging approach.
 - Moved support for requeuing requests from the SCSI core into the block
   layer core.
 - In the UFS driver, instead of disabling write pipelining if
   auto-hibernation is enabled, rely on the requeuing mechanism to handle
   reordering caused by resuming from auto-hibernation.

Changes compared to v14:
 - Removed the drivers/scsi/Kconfig.kunit and drivers/scsi/Makefile.kunit
   files. Instead, modified drivers/scsi/Kconfig and added #include "*_test.c"
   directives in the appropriate .c files. Removed the EXPORT_SYMBOL()
   directives that were added to make the unit tests link.
 - Fixed a double free in a unit test.

Changes compared to v13:
 - Reworked patch "block: Preserve the order of requeued zoned writes".
 - Addressed a performance concern by removing the eh_needs_prepare_resubmit
   SCSI driver callback and by introducing the SCSI host template flag
   .needs_prepare_resubmit instead.
 - Added a patch that adds a 'host' argument to scsi_eh_flush_done_q().
 - Made the code in unit tests less repetitive.

Changes compared to v12:
 - Added two new patches: "block: Preserve the order of requeued zoned writes"
   and "scsi: sd: Add a unit test for sd_cmp_sector()"
 - Restricted the number of zoned write retries. To my surprise I had to add
   "&& scmd->retries <= scmd->allowed" in the SCSI error handler to limit the
   number of retries.
 - In patch "scsi: ufs: Inform the block layer about write ordering", only set
   ELEVATOR_F_ZBD_SEQ_WRITE for zoned block devices.

Changes compared to v11:
 - Fixed a NULL pointer dereference that happened when booting from an ATA
   device by adding an scmd->device != NULL check in scsi_needs_preparation().
 - Updated Reviewed-by tags.

Changes compared to v10:
 - Dropped the UFS MediaTek and HiSilicon patches because these are not correct
   and because it is safe to drop these patches.
 - Updated Acked-by / Reviewed-by tags.

Changes compared to v9:
 - Introduced an additional scsi_driver callback: .eh_needs_prepare_resubmit().
 - Renamed the scsi_debug kernel module parameter 'no_zone_write_lock' into
   'preserves_write_order'.
 - Fixed an out-of-bounds access in the unit scsi_call_prepare_resubmit() unit
   test.
 - Wrapped ufshcd_auto_hibern8_update() calls in UFS host drivers with
   WARN_ON_ONCE() such that a kernel stack appears in case an error code is
   returned.
 - Elaborated a comment in the UFSHCI driver.

Changes compared to v8:
 - Fixed handling of 'driver_preserves_write_order' and 'use_zone_write_lock'
   in blk_stack_limits().
 - Added a comment in disk_set_zoned().
 - Modified blk_req_needs_zone_write_lock() such that it returns false if
   q->limits.use_zone_write_lock is false.
 - Modified disk_clear_zone_settings() such that it clears
   q->limits.use_zone_write_lock.
 - Left out one change from the mq-deadline patch that became superfluous due to
   the blk_req_needs_zone_write_lock() change.
 - Modified scsi_call_prepare_resubmit() such that it only calls list_sort() if
   zoned writes have to be resubmitted for which zone write locking is disabled.
 - Added an additional unit test for scsi_call_prepare_resubmit().
 - Modified the sorting code in the sd driver such that only those SCSI commands
   are sorted for which write locking is disabled.
 - Modified sd_zbc.c such that ELEVATOR_F_ZBD_SEQ_WRITE is only set if the
   write order is not preserved.
 - Included three patches for UFS host drivers that rework code that wrote
   directly to the auto-hibernation controller register.
 - Modified the UFS driver such that enabling auto-hibernation is not allowed
   if a zoned logical unit is present and if the controller operates in legacy
   mode.
 - Also in the UFS driver, simplified ufshcd_auto_hibern8_update().

Changes compared to v7:
 - Split the queue_limits member variable `use_zone_write_lock' into two member
   variables: `use_zone_write_lock' (set by disk_set_zoned()) and
   `driver_preserves_write_order' (set by the block driver or SCSI LLD). This
   should clear up the confusion about the purpose of this variable.
 - Moved the code for sorting SCSI commands by LBA from the SCSI error handler
   into the SCSI disk (sd) driver as requested by Christoph.
   
Changes compared to v6:
 - Removed QUEUE_FLAG_NO_ZONE_WRITE_LOCK and instead introduced a flag in
   the request queue limits data structure.

Changes compared to v5:
 - Renamed scsi_cmp_lba() into scsi_cmp_sector().
 - Improved several source code comments.

Changes compared to v4:
 - Dropped the patch that introduces the REQ_NO_ZONE_WRITE_LOCK flag.
 - Dropped the null_blk patch and added two scsi_debug patches instead.
 - Dropped the f2fs patch.
 - Split the patch for the UFS driver into two patches.
 - Modified several patch descriptions and source code comments.
 - Renamed dd_use_write_locking() into dd_use_zone_write_locking().
 - Moved the list_sort() call from scsi_unjam_host() into scsi_eh_flush_done_q()
   such that sorting happens just before reinserting.
 - Removed the scsi_cmd_retry_allowed() call from scsi_check_sense() to make
   sure that the retry counter is adjusted once per retry instead of twice.

Changes compared to v3:
 - Restored the patch that introduces QUEUE_FLAG_NO_ZONE_WRITE_LOCK. That patch
   had accidentally been left out from v2.
 - In patch "block: Introduce the flag REQ_NO_ZONE_WRITE_LOCK", improved the
   patch description and added the function blk_no_zone_write_lock().
 - In patch "block/mq-deadline: Only use zone locking if necessary", moved the
   blk_queue_is_zoned() call into dd_use_write_locking().
 - In patch "fs/f2fs: Disable zone write locking", set REQ_NO_ZONE_WRITE_LOCK
   from inside __bio_alloc() instead of in f2fs_submit_write_bio().

Changes compared to v2:
 - Renamed the request queue flag for disabling zone write locking.
 - Introduced a new request flag for disabling zone write locking.
 - Modified the mq-deadline scheduler such that zone write locking is only
   disabled if both flags are set.
 - Added an F2FS patch that sets the request flag for disabling zone write
   locking.
 - Only disable zone write locking in the UFS driver if auto-hibernation is
   disabled.

Changes compared to v1:
 - Left out the patches that are already upstream.
 - Switched the approach in patch "scsi: Retry unaligned zoned writes" from
   retrying immediately to sending unaligned write commands to the SCSI error
   handler.

Bart Van Assche (14):
  block: Support block drivers that preserve the order of write requests
  dm-linear: Report to the block layer that the write order is preserved
  block: Rework request allocation in blk_mq_submit_bio()
  block: Support allocating from a specific software queue
  blk-mq: Restore the zoned write order when requeuing
  blk-zoned: Track the write pointer per zone
  blk-zoned: Defer error handling
  blk-zoned: Add an argument to blk_zone_plug_bio()
  blk-zoned: Support pipelining of zoned writes
  scsi: core: Retry unaligned zoned writes
  scsi: sd: Increase retry count for zoned writes
  scsi: scsi_debug: Add the preserves_write_order module parameter
  scsi: scsi_debug: Support injecting unaligned write errors
  scsi: ufs: Inform the block layer about write ordering

 block/bfq-iosched.c       |   2 +
 block/blk-mq.c            |  82 +++++----
 block/blk-mq.h            |   3 +
 block/blk-settings.c      |   2 +
 block/blk-zoned.c         | 342 ++++++++++++++++++++++++++++++++++----
 block/blk.h               |  31 +++-
 block/kyber-iosched.c     |   2 +
 block/mq-deadline.c       |   7 +-
 drivers/md/dm-linear.c    |   6 +
 drivers/md/dm.c           |   2 +-
 drivers/scsi/scsi_debug.c |  21 ++-
 drivers/scsi/scsi_error.c |  16 ++
 drivers/scsi/sd.c         |   7 +
 drivers/ufs/core/ufshcd.c |   7 +
 include/linux/blk-mq.h    |  13 +-
 include/linux/blkdev.h    |  14 +-
 16 files changed, 492 insertions(+), 65 deletions(-)

Comments

Damien Le Moal Jan. 17, 2025, 10:47 p.m. UTC | #1
On 1/16/25 7:46 AM, Bart Van Assche wrote:
> Hi Damien and Christoph,
> 
> This patch series improves small write IOPS by a factor of four (+300%) for
> zoned UFS devices on my test setup. Please help with reviewing this patch
> series.

Bart,

Given that this is a set of 14 patches, giving a more detailed summary here of
what the entire patch set intends to do and how it does it would really help
with the review. I do not want to have to reverse engineer your line of thoughts
to see if the patch set is correctly organized.

Also, patch 14 puts back a lot of the code that was recently removed. Not nice,
and its commit message is also far too short given the size of the patch. Please
spend more time explaining what the patches do and how they do it to facilitate
review.

I also fail to see why you treat request requeue as errors if you actually
expect them to happen... I do not think you need error handling and tracking of
the completion wp position: when you get a requeue event, requeue all requests
in the plug and set the plug wp position to the lowest sector of all the
requeued requests. That is necessarily the current location of the write
pointer. No need for re-introducing all the error handling for that, no ?

I am going to wait for you to resend this with a better "big picture"
explanation of what you are trying to do so that I do not randomly comment on
things that I am not sure I completely understand. Thanks.
Bart Van Assche Jan. 21, 2025, 9:57 p.m. UTC | #2
On 1/17/25 2:47 PM, Damien Le Moal wrote:
> Also, patch 14 puts back a lot of the code that was recently removed. Not nice,
> and its commit message is also far too short given the size of the patch. Please
> spend more time explaining what the patches do and how they do it to facilitate
> review.

Regarding the following part of patch 07/14:

-		disk_zone_wplug_abort(zwplug);
-		zwplug->flags |= BLK_ZONE_WPLUG_NEED_WP_UPDATE;
+		if (!disk->queue->limits.driver_preserves_write_order)
+			zwplug->flags |= BLK_ZONE_WPLUG_NEED_WP_UPDATE;
+		zwplug->flags |= BLK_ZONE_WPLUG_ERROR;

How about setting BLK_ZONE_WPLUG_ERROR only if
disk->queue->limits.driver_preserves_write_order has been set such that
the restored error handling code is only used if the storage controller
preserves the write order? This should be sufficient to preserve the
existing behavior of the blk-zoned.c code for SMR disks.

The reason I restored the error handling code is that the code in 
blk-zoned.c does not support error handling inside block drivers if QD > 1
and because supporting the SCSI error handler is essential for UFS
devices.

> I also fail to see why you treat request requeue as errors if you actually
> expect them to happen...

I only expect requeues to happen in exceptional cases, e.g. because the
SCSI error handler has been activated or because a SCSI device reports a 
unit attention. Requeues may happen in any order so I think that it is
essential to pause request submission after a requeue until all pending
zoned writes for the same zone have completed or failed.

> I do not think you need error handling and tracking of
> the completion wp position: when you get a requeue event, requeue all requests
> in the plug and set the plug wp position to the lowest sector of all the
> requeued requests. That is necessarily the current location of the write
> pointer.

That's an interesting proposal. I will look into setting the WP to the
lowest sector of all requeued requests.

> No need for re-introducing all the error handling for that, no ?

The error handling code has been restored because of another reason,
namely to support the SCSI error handler in case the queue depth is
larger than one.

> I am going to wait for you to resend this with a better "big picture"
> explanation of what you are trying to do so that I do not randomly comment on
> things that I am not sure I completely understand. Thanks.

I will include a more detailed cover letter when I repost this series
(after the merge window has closed). While I work on this, feedback on
this version of this patch series is still welcome.

Thanks,

Bart.
Damien Le Moal Jan. 23, 2025, 4:16 a.m. UTC | #3
On 1/22/25 6:57 AM, Bart Van Assche wrote:
> On 1/17/25 2:47 PM, Damien Le Moal wrote:
>> Also, patch 14 puts back a lot of the code that was recently removed. Not nice,
>> and its commit message is also far too short given the size of the patch. Please
>> spend more time explaining what the patches do and how they do it to facilitate
>> review.
> 
> Regarding the following part of patch 07/14:
> 
> -		disk_zone_wplug_abort(zwplug);
> -		zwplug->flags |= BLK_ZONE_WPLUG_NEED_WP_UPDATE;
> +		if (!disk->queue->limits.driver_preserves_write_order)
> +			zwplug->flags |= BLK_ZONE_WPLUG_NEED_WP_UPDATE;
> +		zwplug->flags |= BLK_ZONE_WPLUG_ERROR;
> 
> How about setting BLK_ZONE_WPLUG_ERROR only if
> disk->queue->limits.driver_preserves_write_order has been set such that
> the restored error handling code is only used if the storage controller
> preserves the write order? This should be sufficient to preserve the
> existing behavior of the blk-zoned.c code for SMR disks.

Your patch changes BLK_ZONE_WPLUG_NEED_WP_UPDATE into BLK_ZONE_WPLUG_ERROR for
regular (no write order preserving) zoned devices. I do not see why this is
needed. The main difference between this default behavior and what you are
trying to do is that BLK_ZONE_WPLUG_NEED_WP_UPDATE is to be expected with your
changes while it is NOT expected to ever happen for a regular coned device.

> The reason I restored the error handling code is that the code in 
> blk-zoned.c does not support error handling inside block drivers if QD > 1
> and because supporting the SCSI error handler is essential for UFS
> devices.

What error handling ? The only error handling we need is to tell tell the user
that there was an error and track that the user actually does something about
it. That tracking is done with the BLK_ZONE_WPLUG_NEED_WP_UPDATE flag. That's
it, nothing more.

I think that conflating/handling the expected (even if infrequent) write requeue
events with zoned UFS devices with actual hard device write errors (e.g. the
write failed because the device is toast or you hit a bad sector, or you have a
bad cable or whatever other hardware reason for the write to fail) is making a
mess of everything. Treating the requeues like hard errors makes things
complicated, but also wrong because even for zoned UFS devices, we may still get
hard write errors, and treating these in the same way as requeue event is wrong
in my opinion. For hard errors, the write must not be retried and the error
propagated immediately to the issuer (e.g. the FS).

For a requeue event, even though that is signaled initially at the bottom of the
stack by a device unit attention error, there is no reason for us to treat these
events as failed requests in the zone write plugging code. We need to have a
special treatment for these, e.g. a "ZONE_WRITE_NEED_RETRY" request flag (and
that flag can bhe set in the scsi mid-layer before calling into the block layer
requeue.

When the zone write plugging sees this flag, it can:
1) Stop issuing write BIOs since we know they will fail
2) Wait for all requests already issued and in-flight to come back
3) Restoore the zone write plug write pointer position to the lowest sector of
all requests that were requeued
4) Re-issue the requeued writes (after somehow sorting them)
5) Re-start issuing BIOs waiting in the write plug

And any write that is failed due to a hard error will still set the zone write
plug with BLK_ZONE_WPLUG_NEED_WP_UPDATE after aborting all BIOs that are pluggeg
in the zone write plug.

Now, I think that the biggest difficulty of this work is to make sure that a
write that fails with an unaligned write error due to a write requeue event
before it can be clearly differentiated from the same failure without the
requeue event before it. For the former, we can recover. For the latter, it is a
user bug and we must NOT hide it.

Can this be done cleanly ? I am not entirely sure about it because I am still
not 100% clear about the conditions that result in a zoned UFS device failing a
write with unit attention. As far as I can tell, the first write issued when the
device is sleeping will be rejected with that error and must be requeued. But
others write behind this failed write that are already in-flight will endup
failing with an unaligned write error, no ? Or will they be failed with a unit
attention too ? This is all unclear to me.
Bart Van Assche Jan. 27, 2025, 11:01 p.m. UTC | #4
On 1/22/25 8:16 PM, Damien Le Moal wrote:
> On 1/22/25 6:57 AM, Bart Van Assche wrote:
>> The reason I restored the error handling code is that the code in
>> blk-zoned.c does not support error handling inside block drivers if QD > 1
>> and because supporting the SCSI error handler is essential for UFS
>> devices.
> 
> What error handling ?

The SCSI error handler. If the SCSI error handler is invoked while only
one request is outstanding, there is no chance of reordering and no
additional measures are necessary. If multiple requests are outstanding
and the SCSI error handler is invoked, care must be taken that the
pending requests are resubmitted in LBA order per zone. This is not
possible without blocking zoned write processing until all pending
zoned writes have completed. Hence the need to set BLK_ZONE_WPLUG_ERROR
if write pipelining is enabled and a request is requeued.

> I think that conflating/handling the expected (even if infrequent) write requeue
> events with zoned UFS devices with actual hard device write errors (e.g. the
> write failed because the device is toast or you hit a bad sector, or you have a
> bad cable or whatever other hardware reason for the write to fail) is making a
> mess of everything. Treating the requeues like hard errors makes things
> complicated, but also wrong because even for zoned UFS devices, we may still get
> hard write errors, and treating these in the same way as requeue event is wrong
> in my opinion. For hard errors, the write must not be retried and the error
> propagated immediately to the issuer (e.g. the FS).
> 
> For a requeue event, even though that is signaled initially at the bottom of the
> stack by a device unit attention error, there is no reason for us to treat these
> events as failed requests in the zone write plugging code. We need to have a
> special treatment for these, e.g. a "ZONE_WRITE_NEED_RETRY" request flag (and
> that flag can bhe set in the scsi mid-layer before calling into the block layer
> requeue.
> 
> When the zone write plugging sees this flag, it can:
> 1) Stop issuing write BIOs since we know they will fail
> 2) Wait for all requests already issued and in-flight to come back
> 3) Restoore the zone write plug write pointer position to the lowest sector of
> all requests that were requeued
> 4) Re-issue the requeued writes (after somehow sorting them)
> 5) Re-start issuing BIOs waiting in the write plug
> 
> And any write that is failed due to a hard error will still set the zone write
> plug with BLK_ZONE_WPLUG_NEED_WP_UPDATE after aborting all BIOs that are pluggeg
> in the zone write plug.
> 
> Now, I think that the biggest difficulty of this work is to make sure that a
> write that fails with an unaligned write error due to a write requeue event
> before it can be clearly differentiated from the same failure without the
> requeue event before it. For the former, we can recover. For the latter, it is a
> user bug and we must NOT hide it.
> 
> Can this be done cleanly ? I am not entirely sure about it because I am still
> not 100% clear about the conditions that result in a zoned UFS device failing a
> write with unit attention. As far as I can tell, the first write issued when the
> device is sleeping will be rejected with that error and must be requeued. But
> others write behind this failed write that are already in-flight will endup
> failing with an unaligned write error, no ? Or will they be failed with a unit
> attention too ? This is all unclear to me.

I propose to keep the approach of the code in blk-zoned.c independent of
the details of how UFS devices work. That will make it more likely that
the blk-zoned.c code remains generic and that it can be reused for other
storage protocols.

Although I have not yet observed reordering due to a unit attention, I
think that unit attentions should be supported because unit attentions
are such an important SCSI mechanism.

Since UFSHCI 3 controllers use a 32-bit register in the UFSHCI
controller for indicating which requests are in flight, reordering of
requests is likely to happen if a UFSHCI controller comes out of the
suspended state. When a UFSHCI controller leaves the suspended state,
it scans that 32-bit register from the lowest to the highest bit and
hence ordering information that was provided by the host is lost. It
seems to me that the easiest way to address this reordering is by
resubmitting requests once in case the controller reordered the requests
due to leaving the suspended state. Please note that UFSHCI 3
controllers are not my primary focus and that I would be fine with not
supporting write pipelining for these controllers if there would be
disagreement about this aspect. UFSHCI 4 controllers have proper
submission queues in host memory, just like NVMe controllers, and hence
request ordering information for submission queues is not lost if the
controller is suspended and resumed.

The SCSI error handler can also cause reordering. If e.g. the UFS
controller reports a CRC error then the UFS driver will reset the UFS
host controller and activate the SCSI error handler. Pending writes must
be resubmitted in LBA order per zone after the SCSI error handler has
finished to prevent that write errors propagate to the file system and
to user space.

At the time a UFS device reports an unaligned write error, I don't think
that it is possible to determine whether or not this is a retryable
error. So instead of introducing a ZONE_WRITE_NEED_RETRY flag, I propose
the following:
* Use BLK_ZONE_WPLUG_ERROR for both hard and soft errors. Hard errors
   means errors that shouldn't be retried. Soft errors means errors that
   are retryable.
* After all pending writes have completed and before resubmitting any
   zoned writes, check whether or not these should be retried. For UFS
   devices this can be done by comparing the wp_offset_compl member with
   the lowest LBA of the pending zoned writes.

Do I remember correctly that residual information is not always reliable
for zoned writes to SMR disks and hence that another approach is needed
for SMR disks? How about setting BLK_ZONE_WPLUG_NEED_WP_UPDATE for SMR
disks instead of resubmitting writes after an unaligned write has been
reported by the storage device?

Thanks,

Bart.