mbox series

[v16,00/26] Improve write performance for zoned UFS devices

Message ID 20241119002815.600608-1-bvanassche@acm.org (mailing list archive)
Headers show
Series Improve write performance for zoned UFS devices | expand

Message

Bart Van Assche Nov. 19, 2024, 12:27 a.m. UTC
Hi Damien and Christoph,

This patch series improves small write IOPS by a factor of four (+300%) for
zoned UFS devices on my test setup with an UFSHCI 3.0 controller. Although
you are probably busy because the merge window is open, please take a look
at this patch series when you have the time. This patch series is organized
as follows:
 - Bug fixes for existing code at the start of the series.
 - The write pipelining support implementation comes after the bug fixes.

Thanks,

Bart.

Changes compared to v15:
 - Reworked this patch series on top of the zone write plugging approach.
 - Moved support for requeuing requests from the SCSI core into the block
   layer core.
 - In the UFS driver, instead of disabling write pipelining if
   auto-hibernation is enabled, rely on the requeuing mechanism to handle
   reordering caused by resuming from auto-hibernation.

Changes compared to v14:
 - Removed the drivers/scsi/Kconfig.kunit and drivers/scsi/Makefile.kunit
   files. Instead, modified drivers/scsi/Kconfig and added #include "*_test.c"
   directives in the appropriate .c files. Removed the EXPORT_SYMBOL()
   directives that were added to make the unit tests link.
 - Fixed a double free in a unit test.

Changes compared to v13:
 - Reworked patch "block: Preserve the order of requeued zoned writes".
 - Addressed a performance concern by removing the eh_needs_prepare_resubmit
   SCSI driver callback and by introducing the SCSI host template flag
   .needs_prepare_resubmit instead.
 - Added a patch that adds a 'host' argument to scsi_eh_flush_done_q().
 - Made the code in unit tests less repetitive.

Changes compared to v12:
 - Added two new patches: "block: Preserve the order of requeued zoned writes"
   and "scsi: sd: Add a unit test for sd_cmp_sector()"
 - Restricted the number of zoned write retries. To my surprise I had to add
   "&& scmd->retries <= scmd->allowed" in the SCSI error handler to limit the
   number of retries.
 - In patch "scsi: ufs: Inform the block layer about write ordering", only set
   ELEVATOR_F_ZBD_SEQ_WRITE for zoned block devices.

Changes compared to v11:
 - Fixed a NULL pointer dereference that happened when booting from an ATA
   device by adding an scmd->device != NULL check in scsi_needs_preparation().
 - Updated Reviewed-by tags.

Changes compared to v10:
 - Dropped the UFS MediaTek and HiSilicon patches because these are not correct
   and because it is safe to drop these patches.
 - Updated Acked-by / Reviewed-by tags.

Changes compared to v9:
 - Introduced an additional scsi_driver callback: .eh_needs_prepare_resubmit().
 - Renamed the scsi_debug kernel module parameter 'no_zone_write_lock' into
   'preserves_write_order'.
 - Fixed an out-of-bounds access in the unit scsi_call_prepare_resubmit() unit
   test.
 - Wrapped ufshcd_auto_hibern8_update() calls in UFS host drivers with
   WARN_ON_ONCE() such that a kernel stack appears in case an error code is
   returned.
 - Elaborated a comment in the UFSHCI driver.

Changes compared to v8:
 - Fixed handling of 'driver_preserves_write_order' and 'use_zone_write_lock'
   in blk_stack_limits().
 - Added a comment in disk_set_zoned().
 - Modified blk_req_needs_zone_write_lock() such that it returns false if
   q->limits.use_zone_write_lock is false.
 - Modified disk_clear_zone_settings() such that it clears
   q->limits.use_zone_write_lock.
 - Left out one change from the mq-deadline patch that became superfluous due to
   the blk_req_needs_zone_write_lock() change.
 - Modified scsi_call_prepare_resubmit() such that it only calls list_sort() if
   zoned writes have to be resubmitted for which zone write locking is disabled.
 - Added an additional unit test for scsi_call_prepare_resubmit().
 - Modified the sorting code in the sd driver such that only those SCSI commands
   are sorted for which write locking is disabled.
 - Modified sd_zbc.c such that ELEVATOR_F_ZBD_SEQ_WRITE is only set if the
   write order is not preserved.
 - Included three patches for UFS host drivers that rework code that wrote
   directly to the auto-hibernation controller register.
 - Modified the UFS driver such that enabling auto-hibernation is not allowed
   if a zoned logical unit is present and if the controller operates in legacy
   mode.
 - Also in the UFS driver, simplified ufshcd_auto_hibern8_update().

Changes compared to v7:
 - Split the queue_limits member variable `use_zone_write_lock' into two member
   variables: `use_zone_write_lock' (set by disk_set_zoned()) and
   `driver_preserves_write_order' (set by the block driver or SCSI LLD). This
   should clear up the confusion about the purpose of this variable.
 - Moved the code for sorting SCSI commands by LBA from the SCSI error handler
   into the SCSI disk (sd) driver as requested by Christoph.
   
Changes compared to v6:
 - Removed QUEUE_FLAG_NO_ZONE_WRITE_LOCK and instead introduced a flag in
   the request queue limits data structure.

Changes compared to v5:
 - Renamed scsi_cmp_lba() into scsi_cmp_sector().
 - Improved several source code comments.

Changes compared to v4:
 - Dropped the patch that introduces the REQ_NO_ZONE_WRITE_LOCK flag.
 - Dropped the null_blk patch and added two scsi_debug patches instead.
 - Dropped the f2fs patch.
 - Split the patch for the UFS driver into two patches.
 - Modified several patch descriptions and source code comments.
 - Renamed dd_use_write_locking() into dd_use_zone_write_locking().
 - Moved the list_sort() call from scsi_unjam_host() into scsi_eh_flush_done_q()
   such that sorting happens just before reinserting.
 - Removed the scsi_cmd_retry_allowed() call from scsi_check_sense() to make
   sure that the retry counter is adjusted once per retry instead of twice.

Changes compared to v3:
 - Restored the patch that introduces QUEUE_FLAG_NO_ZONE_WRITE_LOCK. That patch
   had accidentally been left out from v2.
 - In patch "block: Introduce the flag REQ_NO_ZONE_WRITE_LOCK", improved the
   patch description and added the function blk_no_zone_write_lock().
 - In patch "block/mq-deadline: Only use zone locking if necessary", moved the
   blk_queue_is_zoned() call into dd_use_write_locking().
 - In patch "fs/f2fs: Disable zone write locking", set REQ_NO_ZONE_WRITE_LOCK
   from inside __bio_alloc() instead of in f2fs_submit_write_bio().

Changes compared to v2:
 - Renamed the request queue flag for disabling zone write locking.
 - Introduced a new request flag for disabling zone write locking.
 - Modified the mq-deadline scheduler such that zone write locking is only
   disabled if both flags are set.
 - Added an F2FS patch that sets the request flag for disabling zone write
   locking.
 - Only disable zone write locking in the UFS driver if auto-hibernation is
   disabled.

Changes compared to v1:
 - Left out the patches that are already upstream.
 - Switched the approach in patch "scsi: Retry unaligned zoned writes" from
   retrying immediately to sending unaligned write commands to the SCSI error
   handler.

Bart Van Assche (26):
  blk-zoned: Fix a reference count leak
  blk-zoned: Split disk_zone_wplugs_work()
  blk-zoned: Split queue_zone_wplugs_show()
  blk-zoned: Only handle errors after pending zoned writes have
    completed
  blk-zoned: Fix a deadlock triggered by unaligned writes
  blk-zoned: Fix requeuing of zoned writes
  block: Support block drivers that preserve the order of write requests
  dm-linear: Report to the block layer that the write order is preserved
  mq-deadline: Remove a local variable
  blk-mq: Clean up blk_mq_requeue_work()
  block: Optimize blk_mq_submit_bio() for the cache hit scenario
  block: Rework request allocation in blk_mq_submit_bio()
  block: Support allocating from a specific software queue
  blk-mq: Restore the zoned write order when requeuing
  blk-zoned: Document the locking order
  blk-zoned: Document locking assumptions
  blk-zoned: Uninline functions that are not in the hot path
  blk-zoned: Make disk_should_remove_zone_wplug() more robust
  blk-zoned: Add an argument to blk_zone_plug_bio()
  blk-zoned: Support pipelining of zoned writes
  scsi: core: Retry unaligned zoned writes
  scsi: sd: Increase retry count for zoned writes
  scsi: scsi_debug: Add the preserves_write_order module parameter
  scsi: scsi_debug: Support injecting unaligned write errors
  scsi: scsi_debug: Skip host/bus reset settle delay
  scsi: ufs: Inform the block layer about write ordering

 block/bfq-iosched.c       |   2 +
 block/blk-mq.c            |  97 ++++++----
 block/blk-mq.h            |   3 +
 block/blk-settings.c      |   2 +
 block/blk-zoned.c         | 376 +++++++++++++++++++++++++-------------
 block/blk.h               |  33 +++-
 block/kyber-iosched.c     |   2 +
 block/mq-deadline.c       |  10 +-
 drivers/md/dm-linear.c    |   6 +
 drivers/md/dm.c           |   2 +-
 drivers/scsi/scsi_debug.c |  22 ++-
 drivers/scsi/scsi_error.c |  16 ++
 drivers/scsi/sd.c         |   7 +
 drivers/ufs/core/ufshcd.c |   7 +
 include/linux/blk-mq.h    |  20 +-
 include/linux/blkdev.h    |  11 +-
 16 files changed, 444 insertions(+), 172 deletions(-)

Comments

Damien Le Moal Nov. 19, 2024, 8:01 a.m. UTC | #1
On 11/19/24 09:27, Bart Van Assche wrote:
> Hi Damien and Christoph,
> 
> This patch series improves small write IOPS by a factor of four (+300%) for
> zoned UFS devices on my test setup with an UFSHCI 3.0 controller. Although
> you are probably busy because the merge window is open, please take a look
> at this patch series when you have the time. This patch series is organized
> as follows:
>  - Bug fixes for existing code at the start of the series.
>  - The write pipelining support implementation comes after the bug fixes.

Impressive improvements but the changes are rather invasive. Have you tried
simpler solution like forcing unplugging a zone write plug from the driver once
a command is passed to the driver and the driver did not reject it ? It seems
like this would make everything simpler on the block layer side. But I am not
sure if the performance gains would be the same.
Christoph Hellwig Nov. 19, 2024, 12:25 p.m. UTC | #2
On Mon, Nov 18, 2024 at 04:27:49PM -0800, Bart Van Assche wrote:
> Hi Damien and Christoph,
> 
> This patch series improves small write IOPS by a factor of four (+300%) for
> zoned UFS devices on my test setup with an UFSHCI 3.0 controller.

What's your exact test setup?  Which upstream kernel support zoned UFS
device are using?
Bart Van Assche Nov. 19, 2024, 6:52 p.m. UTC | #3
On 11/19/24 4:25 AM, Christoph Hellwig wrote:
> What's your exact test setup?

Pixel 2023 and 2025 development boards with android-mainline kernel
(https://android.googlesource.com/kernel/common/+/refs/heads/android-mainline). 
The android mainline kernel tracks the upstream kernel closely. While 
the performance of Pixel development boards is
identical to that of the corresponding smartphone generation, these
development boards have the following advantages:
- A UFS socket that allows swapping UFS devices without any soldering.
- A USB port that makes it easy to monitor and capture kernel log
   messages.

> Which upstream kernel support zoned UFS device are using?

A Micron ZUFS device. Micron zoned UFS devices have a zone size that is
a power of two.

Thanks,

Bart.
Bart Van Assche Nov. 19, 2024, 7:08 p.m. UTC | #4
On 11/19/24 12:01 AM, Damien Le Moal wrote:
> Impressive improvements but the changes are rather invasive. Have you tried
> simpler solution like forcing unplugging a zone write plug from the driver once
> a command is passed to the driver and the driver did not reject it ? It seems
> like this would make everything simpler on the block layer side. But I am not
> sure if the performance gains would be the same.

Hi Damien,

I'm not sure that the approach of submitting a new zoned write if the
driver did not reject the previous write would result in a simpler
solution. SCSI devices are allowed to respond to any command with a unit
attention instead of processing the command. If a unit attention is
reported, the SCSI core requeues the command. In other words, even with
this approach, proper support for requeued zoned writes in the block
layer is required.

Additionally, that approach is not compatible with using .queue_rqs().
While the SCSI core does not yet support a .queue_rqs() callback, I
think it would be good to have this support in the SCSI core.

If we need requeuing support anyway, why to select an approach that
probably will result in lower performance than what has been implemented
in this patch series?

Thanks,

Bart.
Damien Le Moal Nov. 21, 2024, 3:20 a.m. UTC | #5
On 11/20/24 04:08, Bart Van Assche wrote:
> On 11/19/24 12:01 AM, Damien Le Moal wrote:
>> Impressive improvements but the changes are rather invasive. Have you tried
>> simpler solution like forcing unplugging a zone write plug from the driver once
>> a command is passed to the driver and the driver did not reject it ? It seems
>> like this would make everything simpler on the block layer side. But I am not
>> sure if the performance gains would be the same.
> 
> Hi Damien,
> 
> I'm not sure that the approach of submitting a new zoned write if the
> driver did not reject the previous write would result in a simpler
> solution. SCSI devices are allowed to respond to any command with a unit
> attention instead of processing the command. If a unit attention is
> reported, the SCSI core requeues the command. In other words, even with
> this approach, proper support for requeued zoned writes in the block
> layer is required.

Yes, but it would be vastly simpler because you would be guaranteed to having
only a single write request per zone being in-flight between the write plug and
the device at any time. So the requeue would not need reordering, and likely not
need any special code at all. nless I am missing something, this would be
simpler, no ?

The main question though with such approach is: does it give you the same
performance improvements as your current (more invasive) approach ?

> Additionally, that approach is not compatible with using .queue_rqs().
> While the SCSI core does not yet support a .queue_rqs() callback, I
> think it would be good to have this support in the SCSI core.

I do not understand why it is not compatible. What is the problem ?

> If we need requeuing support anyway, why to select an approach that
> probably will result in lower performance than what has been implemented
> in this patch series?

I am only trying to see if there is not a simpler approach than what you did.
The less changes, the better, right ?
Bart Van Assche Nov. 21, 2024, 6 p.m. UTC | #6
On 11/20/24 7:20 PM, Damien Le Moal wrote:
> I am only trying to see if there is not a simpler approach than what you did.
> The less changes, the better, right ?

Hi Damien,

I agree with you that we should select the simplest approach that yields
the desired performance.

Regarding the proposed approach, forcing unplugging a zone write plug
from the driver once a command is passed to the driver and the driver
did not reject it, is this approach compatible with SCSI devices that
may report a unit attention? If two zoned writes for the same zone are
submitted in order to a SCSI device and the SCSI device responds with a
unit attention condition to the first write then the second write will
fail with an "unaligned write" error. This will have to be handled by
pausing zoned write submission and by resubmitting zoned writes after
all pending zoned writes for the given zone have completed. In other
words, if higher queue depths are supported per zone, we cannot avoid
increasing the complexity of the code in block/blk-zoned.c. If we cannot
avoid increasing the complexity of that code, I think we can as well
select the approach that yields the highest performance and the fewest
changes in the block layer code for regular reads and writes.

Thanks,

Bart.
Damien Le Moal Nov. 25, 2024, 3:59 a.m. UTC | #7
On 11/22/24 3:00 AM, Bart Van Assche wrote:
> On 11/20/24 7:20 PM, Damien Le Moal wrote:
>> I am only trying to see if there is not a simpler approach than what you did.
>> The less changes, the better, right ?
> 
> Hi Damien,
> 
> I agree with you that we should select the simplest approach that yields
> the desired performance.
> 
> Regarding the proposed approach, forcing unplugging a zone write plug
> from the driver once a command is passed to the driver and the driver
> did not reject it, is this approach compatible with SCSI devices that
> may report a unit attention? If two zoned writes for the same zone are
> submitted in order to a SCSI device and the SCSI device responds with a
> unit attention condition to the first write then the second write will
> fail with an "unaligned write" error. This will have to be handled by

I have never seen this happen (i mean UNIT ATTENTION being returned) in
practice with SAS HDDs. Not sure how zoned UFS devices behave though.

> pausing zoned write submission and by resubmitting zoned writes after
> all pending zoned writes for the given zone have completed. In other
> words, if higher queue depths are supported per zone, we cannot avoid
> increasing the complexity of the code in block/blk-zoned.c. If we cannot
> avoid increasing the complexity of that code, I think we can as well
> select the approach that yields the highest performance and the fewest
> changes in the block layer code for regular reads and writes.

But it seems that all we need to do is a better request handling in the
requeue+dispatch queue to handle requeued writes. I am not yet convinced that
zone write plugging needs a lot of changes beside allowing more than one write
per zone and some form of throttling based on feedback from the requeue path.
I.e. the current code throttles write command issuing every time one command is
submitted: the submission "plugs" the write plug. In your case, it seems that
the plugging should be driven by the requeue path, or the driver. Unplugging of
a plug happens currently when a BIO completes, but in your case, it would need
to be also driven by the requeue path or driver.

I am thinking that things could be cleaner/easier to maintain with a zoned
write requeue special path. May be... I am thinking aloud here as I have not
tried to code anything.

> 
> Thanks,
> 
> Bart.
> 
>