mbox series

[00/16] support non power of 2 zoned devices

Message ID 20220427160255.300418-1-p.raghav@samsung.com (mailing list archive)
Headers show
Series support non power of 2 zoned devices | expand

Message

Pankaj Raghav April 27, 2022, 4:02 p.m. UTC
- Background and Motivation:

The zone storage implementation in Linux, introduced since v4.10, first
targetted SMR drives which have a power of 2 (po2) zone size alignment
requirement. The po2 zone size was further imposed implicitly by the
block layer's blk_queue_chunk_sectors(), used to prevent IO merging
across chunks beyond the specified size, since v3.16 through commit
762380ad9322 ("block: add notion of a chunk size for request merging").
But this same general block layer po2 requirement for blk_queue_chunk_sectors()
was removed on v5.10 through commit 07d098e6bbad ("block: allow 'chunk_sectors'
to be non-power-of-2"). NAND, which is the media used in newer zoned storage
devices, does not naturally align to po2, and so the po2 requirement
does not make sense for those type of zone storage devices.

Removing the po2 requirement from zone storage should therefore be possible
now provided that no userspace regression and no performance regressions are
introduced. Stop-gap patches have been already merged into f2fs-tools to
proactively not allow npo2 zone sizes until proper support is added [0].
Additional kernel stop-gap patches are provided in this series for dm-zoned.
Support for npo2 zonefs and btrfs support is addressed in this series.

There was an effort previously [1] to add support to non po2 devices via
device level emulation but that was rejected with a final conclusion
to add support for non po2 zoned device in the complete stack[2].

- Patchset description:
This patchset aims at adding support to non power of 2 zoned devices in
the block layer, nvme layer, null blk and adds support to btrfs and
zonefs.

This round of patches **will not** support f2fs and DM layer for non
power of 2 zoned devices. More about this in the future work section.

Patches 1-4 deals with removing the po2 constraint from the
block layer. Note that the patches have been split for clarity and it can
be squashed together if the community feels that is better.

Patches 5-6 deals with removing the constraint from nvme zns.

Patches 7-11 adds support to btrfs for non po2 zoned devices.

Patch 12 removes the po2 constraint in ZoneFS

Patch 13 removes the po2 contraint in null blk

Patches 14-16 adds conditions to not allow non power of 2 devices in
f2fs and DM.

- Performance:
PO2 zone sizes utilizes log and shifts instead of division when
determing alignment, zone number, etc. The same math cannot be used when
using a zoned device with non po2 zone size. Hence, to avoid any performance
regression on zoned devices with po2 zone sizes, the optimized math in the
hot paths has been retained with branching.

The performance was measured using null blk for regression
and the results have been posted in the appropriate commit log. No
performance regression was noticed.

- Testing
With respect to testing we need to tackle two things: one for regression
on po2 zoned device and progression on non po2 zoned devices.

kdevops (https://github.com/mcgrof/kdevops) was extensively used to
automate the testing for blktests and (x)fstests for btrfs changes. The
known failures were excluded during the test based on the baseline
v5.17.0-rc7

-- regression
Emulated zoned device with zone size =128M , nr_zones = 10000

Block and nvme zns:
blktests were run with no new failures

Btrfs:
Changes were tested with the following profile in QEMU:
[btrfs_simple_zns]
TEST_DIR=<dir>
SCRATCH_MNT=<mnt>
FSTYP=btrfs
MKFS_OPTIONS="-f -d single -m single"
TEST_DEV=<dev>
SCRATCH_DEV_POOL=<dev-pool>

No new failures were observed in btrfs, generic and shared test suite

ZoneFS:
zonefs-tests-nullblk.sh and zonefs-tests.sh from zonefs-tools were run
with no failures.

nullblk:
t/zbd/run-tests-against-nullb from fio was run with no failures.

F2FS and DM:
It was verified if f2fs and dm-zoned successfully mounts without any
error.

-- progression
Emulated zoned device with zone size = 96M , nr_zones = 10000

Block and nvme zns:
blktests were run with no new failures

Btrfs:
Same profile as po2 zone size was used.

Many tests in xfstests for btrfs included dm-flakey and some tests
required dm-linear. As they are not supported at the moment for non
po2 devices, those **tests were excluded for non po2 devices**.

No new failures were observed in btrfs, generic and shared test suite

ZoneFS:
zonefs-tests.sh from zonefs-tools were run with no failures.

nullblk:
A new section was added to cover non po2 devices:

section14()
{
       conv_pcnt=10
       zone_size=3
       zone_capacity=3
       max_open=${set_max_open}
       zbd_test_opts+=("-o ${max_open}")
}
t/zbd/run-tests-against-nullb from fio was run with no failures.

F2FS and DM:
It was verified that f2fs and dm-zoned does not mount.

- Open issue:
* btrfs superblock location for zoned devices is expected to be in 0,
  512GB(mirror) and 4TB(mirror) in the device. Zoned devices with po2
  zone size will naturally align with these superblock location but non
  po2 devices will not align with 512GB and 4TB offset.

  The current approach for npo2 devices is to place the superblock mirror
  zones near   512GB and 4TB that is **aligned to the zone size**. This
  is of no issue for normal operation as we keep track where the superblock
  mirror are placed but this can cause an issue with recovery tools for
  zoned devices as they expect mirror superblock to be in 512GB and 4TB.

  Note that ATM, recovery tools such as `btrfs check` does not work for
  image dumps for zoned devices even for po2 zone sizes.

  I hope this issue could be discussed as a part of the BoF on ZNS
  during the upcoming LSFMM.

- Tools:
Some tools had to be updated to support non po2 devices. Once these
patches are accepted in the kernel, these tool updates will also be
upstreamed.
* btrfs-prog: https://github.com/Panky-codes/btrfs-progs/tree/remove-po2-btrfs
* blkzone: https://github.com/Panky-codes/util-linux/tree/remove-po2

- Future work
To reduce the amount of changes and testing, support for F2FS and DM was
excluded in this round of patches. The plan is to add support to them in
the forthcoming future.

[0] https://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git/commit/?h=dev-test&id=6afcf6493578e77528abe65ab8b12f3e1c16749f
[1] https://lore.kernel.org/all/20220310094725.GA28499@lst.de/T/
[2] https://lore.kernel.org/all/20220315135245.eqf4tqngxxb7ymqa@unifi/

Luis Chamberlain (4):
  nvmet: use blk_queue_zone_no()
  f2fs: call bdev_zone_sectors() only once on init_blkz_info()
  f2fs: ensure only power of 2 zone sizes are allowed
  dm-zoned: ensure only power of 2 zone sizes are allowed

Pankaj Raghav (12):
  block: make blkdev_nr_zones and blk_queue_zone_no generic for npo2
    zsze
  block: add blk_queue_zone_aligned and bdev_zone_aligned helper
  block: add bdev_zone_no helper
  block: allow blk-zoned devices to have non-power-of-2 zone size
  nvme: zns: Allow ZNS drives that have non-power_of_2 zone size
  btrfs: zoned: Cache superblock location in btrfs_zoned_device_info
  btrfs: zoned: add generic btrfs helpers for zoned devices
  btrfs: zoned: Make sb_zone_number function non power of 2 compatible
  btrfs: zoned: use btrfs zone helpers to support non po2 zoned devices
  btrfs: zoned: relax the alignment constraint for zoned devices
  zonefs: allow non power of 2 zoned devices
  null_blk: allow non power of 2 zoned devices

 block/blk-core.c               |   3 +-
 block/blk-zoned.c              |  20 ++++--
 drivers/block/null_blk/main.c  |   4 +-
 drivers/block/null_blk/zoned.c |  14 ++--
 drivers/md/dm-zone.c           |  12 ++++
 drivers/nvme/host/zns.c        |  11 +---
 drivers/nvme/target/zns.c      |   2 +-
 fs/btrfs/volumes.c             |  24 +++----
 fs/btrfs/zoned.c               | 115 +++++++++++++++++----------------
 fs/btrfs/zoned.h               |  41 ++++++++++--
 fs/f2fs/super.c                |  15 +++--
 fs/zonefs/super.c              |   6 +-
 fs/zonefs/zonefs.h             |   1 -
 include/linux/blkdev.h         |  48 +++++++++++++-
 14 files changed, 208 insertions(+), 108 deletions(-)

Comments

Johannes Thumshirn May 2, 2022, 10:07 p.m. UTC | #1
On 27/04/2022 09:03, Pankaj Raghav wrote:
> - Background and Motivation:
> 
> The zone storage implementation in Linux, introduced since v4.10, first
> targetted SMR drives which have a power of 2 (po2) zone size alignment
> requirement. The po2 zone size was further imposed implicitly by the
> block layer's blk_queue_chunk_sectors(), used to prevent IO merging
> across chunks beyond the specified size, since v3.16 through commit
> 762380ad9322 ("block: add notion of a chunk size for request merging").
> But this same general block layer po2 requirement for blk_queue_chunk_sectors()
> was removed on v5.10 through commit 07d098e6bbad ("block: allow 'chunk_sectors'
> to be non-power-of-2"). NAND, which is the media used in newer zoned storage
> devices, does not naturally align to po2, and so the po2 requirement
> does not make sense for those type of zone storage devices.
> 
> Removing the po2 requirement from zone storage should therefore be possible
> now provided that no userspace regression and no performance regressions are
> introduced. Stop-gap patches have been already merged into f2fs-tools to
> proactively not allow npo2 zone sizes until proper support is added [0].
> Additional kernel stop-gap patches are provided in this series for dm-zoned.
> Support for npo2 zonefs and btrfs support is addressed in this series.
> 
> There was an effort previously [1] to add support to non po2 devices via
> device level emulation but that was rejected with a final conclusion
> to add support for non po2 zoned device in the complete stack[2].

Hey Pankaj,

One thing I'm concerned with this patches is, once we have npo2 zones (or to be precise 
not fs_info->sectorsize aligned zones) we have to check on every allocation if we still 
have at least have fs_info->sectorsize bytes left in a zone. If not we need to 
explicitly finish the zone, otherwise we'll run out of max active zones. 

This is a problem for zoned btrfs at the moment already but it'll be even worse
with npo2, because we're never implicitly finishing zones.

See also 
https://lore.kernel.org/linux-btrfs/42758829d8696a471a27f7aaeab5468f60b1565d.1651157034.git.naohiro.aota@wdc.com

Thanks,
	Johannes
Pankaj Raghav May 3, 2022, 9:12 a.m. UTC | #2
Hi Johannes,
On 2022-05-03 00:07, Johannes Thumshirn wrote:
>> There was an effort previously [1] to add support to non po2 devices via
>> device level emulation but that was rejected with a final conclusion
>> to add support for non po2 zoned device in the complete stack[2].
> 
> Hey Pankaj,
> 
> One thing I'm concerned with this patches is, once we have npo2 zones (or to be precise 
> not fs_info->sectorsize aligned zones) we have to check on every allocation if we still 
> have at least have fs_info->sectorsize bytes left in a zone. If not we need to 
> explicitly finish the zone, otherwise we'll run out of max active zones. 
> 
This commit: `btrfs: zoned: relax the alignment constraint for zoned
devices` makes sure the zone size is BTRFS_STRIPE_LEN aligned (64K). So
even the npo2 zoned device should be aligned to `fs_info->sectorsize`,
which is typically 4k.

This was one of the comment that came from David Sterba:
https://lore.kernel.org/all/20220315142740.GU12643@twin.jikos.cz/
where he suggested to have some sane alignment for the zone sizes.

> This is a problem for zoned btrfs at the moment already but it'll be even worse
> with npo2, because we're never implicitly finishing zones.
> 
> See also 
> https://lore.kernel.org/linux-btrfs/42758829d8696a471a27f7aaeab5468f60b1565d.1651157034.git.naohiro.aota@wdc.com
> 
I did take a look at this few days back and the patch should work fine
also for npo2 zoned device as we allow only zone sizes that are
BTRFS_STRIPE_LEN aligned. So even the max nodesize for METADATA BGs is
only 64k and it should be aligned correctly to implicitly finish the zone.

Let me know your thoughts and if I am missing something.

Regards,
Pankaj
David Sterba May 4, 2022, 9:14 p.m. UTC | #3
On Tue, May 03, 2022 at 11:12:04AM +0200, Pankaj Raghav wrote:
> Hi Johannes,
> On 2022-05-03 00:07, Johannes Thumshirn wrote:
> >> There was an effort previously [1] to add support to non po2 devices via
> >> device level emulation but that was rejected with a final conclusion
> >> to add support for non po2 zoned device in the complete stack[2].
> > 
> > Hey Pankaj,
> > 
> > One thing I'm concerned with this patches is, once we have npo2 zones (or to be precise 
> > not fs_info->sectorsize aligned zones) we have to check on every allocation if we still 
> > have at least have fs_info->sectorsize bytes left in a zone. If not we need to 
> > explicitly finish the zone, otherwise we'll run out of max active zones. 
> > 
> This commit: `btrfs: zoned: relax the alignment constraint for zoned
> devices` makes sure the zone size is BTRFS_STRIPE_LEN aligned (64K). So
> even the npo2 zoned device should be aligned to `fs_info->sectorsize`,
> which is typically 4k.
> 
> This was one of the comment that came from David Sterba:
> https://lore.kernel.org/all/20220315142740.GU12643@twin.jikos.cz/
> where he suggested to have some sane alignment for the zone sizes.

My idea of 'sane' value would be 1M, that we have 4K for sectors is
because of the 1:1 mapping to pages, but RAM sizes are on a different
scale than storage devices. The 4K is absolute minimum but if the page
size is taken as a basic constraint, ARM has 64K and there are some 256K
arches.
Pankaj Raghav May 5, 2022, 7:28 a.m. UTC | #4
On 2022-05-04 23:14, David Sterba wrote:
>> This commit: `btrfs: zoned: relax the alignment constraint for zoned
>> devices` makes sure the zone size is BTRFS_STRIPE_LEN aligned (64K). So
>> even the npo2 zoned device should be aligned to `fs_info->sectorsize`,
>> which is typically 4k.
>>
>> This was one of the comment that came from David Sterba:
>> https://lore.kernel.org/all/20220315142740.GU12643@twin.jikos.cz/
>> where he suggested to have some sane alignment for the zone sizes.
> 
> My idea of 'sane' value would be 1M, that we have 4K for sectors is
> because of the 1:1 mapping to pages, but RAM sizes are on a different
> scale than storage devices. The 4K is absolute minimum but if the page
> size is taken as a basic constraint, ARM has 64K and there are some 256K
> arches.

That is a good point. I think it is safe to have 1MB as the minimum
alignment so that it covers all architecture's page sizes. Thanks. I
will queue this up.