Message ID | 20220823121859.163903-14-p.raghav@samsung.com (mailing list archive) |
---|---|
State | Changes Requested, archived |
Delegated to: | Mike Snitzer |
Headers | show |
Series | support zoned block devices with non-power-of-2 zone sizes | expand |
On Aug 23, 2022 / 14:18, Pankaj Raghav wrote: > Only zoned devices with power-of-2(po2) number of sectors per zone(zone > size) were supported in linux but now non power-of-2(npo2) zone sizes > support has been added to the block layer. > > Filesystems such as F2FS and btrfs have support for zoned devices with > po2 zone size assumption. Before adding native support for npo2 zone > sizes, it was suggested to create a dm target for npo2 zone size device to > appear as a po2 zone size target so that file systems can initially > work without any explicit changes by using this target. FYI, with this patch series, I created the new dm target and ran blktests zbd group for it. And I observed zbd/007 test case failure (other test cases passed). The test checks sector mapping of zoned dm-linear, dm-flakey and dm- crypt. Some changes in the test case look required to handle the new target.
On 2022-08-30 04:52, Shinichiro Kawasaki wrote: > On Aug 23, 2022 / 14:18, Pankaj Raghav wrote: >> Only zoned devices with power-of-2(po2) number of sectors per zone(zone >> size) were supported in linux but now non power-of-2(npo2) zone sizes >> support has been added to the block layer. >> >> Filesystems such as F2FS and btrfs have support for zoned devices with >> po2 zone size assumption. Before adding native support for npo2 zone >> sizes, it was suggested to create a dm target for npo2 zone size device to >> appear as a po2 zone size target so that file systems can initially >> work without any explicit changes by using this target. > > FYI, with this patch series, I created the new dm target and ran blktests zbd > group for it. And I observed zbd/007 test case failure (other test cases > passed). The test checks sector mapping of zoned dm-linear, dm-flakey and dm- > crypt. Some changes in the test case look required to handle the new target. > Thanks for testing it. I am aware of this test case, and I skipped it while I was testing my target. The test needs to be adapted as the container's start, and the logical device's start will be different for this target. I initially thought this test case might not apply to the dm-po2zone target, but at a closer look, it is helpful once the zone offset is adapted while doing a reset and writing data as the test only verifies the relative WP position. I also noticed that this test relies on getting the underlying device id using `dmsetup table` command. The target currently lacks the `.status` callback which appends the device id details. I will add them as a part of the next revision for this target. Thanks. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
On 2022-08-23 14:18, Pankaj Raghav wrote: > +static int dm_po2z_iterate_devices(struct dm_target *ti, > + iterate_devices_callout_fn fn, void *data) > +{ > + struct dm_po2z_target *dmh = ti->private; > + sector_t len = dmh->nr_zones * dmh->zone_size; > + > + return fn(ti, dmh->dev, 0, len, data); > +} > + > +static struct target_type dm_po2z_target = { > + .name = "po2zone", > + .version = { 1, 0, 0 }, > + .features = DM_TARGET_ZONED_HM | DM_TARGET_EMULATED_ZONES, This target also supports DM_TARGET_NOWAIT feature flag. I will add it in the next version. > + .map = dm_po2z_map, > + .end_io = dm_po2z_end_io, > + .report_zones = dm_po2z_report_zones, > + .iterate_devices = dm_po2z_iterate_devices, > + .module = THIS_MODULE, > + .io_hints = dm_po2z_io_hints, > + .ctr = dm_po2z_ctr, > +}; > + > +static int __init dm_po2z_init(void) > +{ > + return dm_register_target(&dm_po2z_target); > +} > + > +static void __exit dm_po2z_exit(void) > +{ > + dm_unregister_target(&dm_po2z_target); > +} > + > +/* Module hooks */ > +module_init(dm_po2z_init); > +module_exit(dm_po2z_exit); > + > +MODULE_DESCRIPTION(DM_NAME "power-of-2 zoned target"); > +MODULE_AUTHOR("Pankaj Raghav <p.raghav@samsung.com>"); > +MODULE_LICENSE("GPL"); > + -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
On Tue, Aug 23 2022 at 8:18P -0400, Pankaj Raghav <p.raghav@samsung.com> wrote: > Only zoned devices with power-of-2(po2) number of sectors per zone(zone > size) were supported in linux but now non power-of-2(npo2) zone sizes > support has been added to the block layer. > > Filesystems such as F2FS and btrfs have support for zoned devices with > po2 zone size assumption. Before adding native support for npo2 zone > sizes, it was suggested to create a dm target for npo2 zone size device to > appear as a po2 zone size target so that file systems can initially > work without any explicit changes by using this target. > > The design of this target is very simple: remap the device zone size to > the zone capacity and change the zone size to be the nearest power of 2 > value. > > For e.g., a device with a zone size/capacity of 3M will have an equivalent > target layout as follows: > > Device layout :- > zone capacity = 3M > zone size = 3M > > |--------------|-------------| > 0 3M 6M > > Target layout :- > zone capacity=3M > zone size = 4M > > |--------------|---|--------------|---| > 0 3M 4M 7M 8M > > The area between target's zone capacity and zone size will be emulated > in the target. > The read IOs that fall in the emulated gap area will return 0 filled > bio and all the other IOs in that area will result in an error. > If a read IO span across the emulated area boundary, then the IOs are > split across them. All other IO operations that span across the emulated > area boundary will result in an error. > > The target can be easily created as follows: > dmsetup create <label> --table '0 <size_sects> po2zone /dev/nvme<id>' > > Note that the target does not support partial mapping of the underlying > device. > > Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> > Suggested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> > Suggested-by: Damien Le Moal <damien.lemoal@wdc.com> > Suggested-by: Hannes Reinecke <hare@suse.de> This target needs more review from those who Suggested-by it. And the header and docs needs to address: 1) why is a partial mapping of the underlying device disallowed? 2) why is it assumed all IO is read-only? (talk to me and others like we don't know the inherent limitations of this class of zoned hw) On a code level: 1) are you certain you're properly failing all writes? - are writes allowed to the "zone capacity area" but _not_ allowed to the "emulated zone area"? (if yes, _please document_). 2) yes, you absolutely need to implement the .status target_type hook (for both STATUS and TABLE). 3) really not loving the nested return (of DM_MAPIO_SUBMITTED or DM_MAPIO_REMAPPED) from methods called from dm_po2z_map(). Would prefer to not have to do a depth-first search to see where and when dm_po2z_map() returns a DM_MAPIO_XXX unless there is a solid justification for it. To me it just obfuscates the DM interface a bit too much. Otherwise, pretty clean code and nothing weird going on. I look forward to seeing your next (final?) revision of this patchset. Thanks, Mike -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
On Fri, Sep 02 2022 at 4:55P -0400, Mike Snitzer <snitzer@redhat.com> wrote: > On Tue, Aug 23 2022 at 8:18P -0400, > Pankaj Raghav <p.raghav@samsung.com> wrote: > > > Only zoned devices with power-of-2(po2) number of sectors per zone(zone > > size) were supported in linux but now non power-of-2(npo2) zone sizes > > support has been added to the block layer. > > > > Filesystems such as F2FS and btrfs have support for zoned devices with > > po2 zone size assumption. Before adding native support for npo2 zone > > sizes, it was suggested to create a dm target for npo2 zone size device to > > appear as a po2 zone size target so that file systems can initially > > work without any explicit changes by using this target. > > > > The design of this target is very simple: remap the device zone size to > > the zone capacity and change the zone size to be the nearest power of 2 > > value. > > > > For e.g., a device with a zone size/capacity of 3M will have an equivalent > > target layout as follows: > > > > Device layout :- > > zone capacity = 3M > > zone size = 3M > > > > |--------------|-------------| > > 0 3M 6M > > > > Target layout :- > > zone capacity=3M > > zone size = 4M > > > > |--------------|---|--------------|---| > > 0 3M 4M 7M 8M > > > > The area between target's zone capacity and zone size will be emulated > > in the target. > > The read IOs that fall in the emulated gap area will return 0 filled > > bio and all the other IOs in that area will result in an error. > > If a read IO span across the emulated area boundary, then the IOs are > > split across them. All other IO operations that span across the emulated > > area boundary will result in an error. > > > > The target can be easily created as follows: > > dmsetup create <label> --table '0 <size_sects> po2zone /dev/nvme<id>' > > > > Note that the target does not support partial mapping of the underlying > > device. > > > > Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> > > Suggested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> > > Suggested-by: Damien Le Moal <damien.lemoal@wdc.com> > > Suggested-by: Hannes Reinecke <hare@suse.de> > > > This target needs more review from those who Suggested-by it. > > And the header and docs needs to address: > > 1) why is a partial mapping of the underlying device disallowed? > 2) why is it assumed all IO is read-only? (talk to me and others like > we don't know the inherent limitations of this class of zoned hw) > > On a code level: > 1) are you certain you're properly failing all writes? > - are writes allowed to the "zone capacity area" but _not_ > allowed to the "emulated zone area"? (if yes, _please document_). > 2) yes, you absolutely need to implement the .status target_type hook > (for both STATUS and TABLE). > 3) really not loving the nested return (of DM_MAPIO_SUBMITTED or > DM_MAPIO_REMAPPED) from methods called from dm_po2z_map(). Would > prefer to not have to do a depth-first search to see where and when > dm_po2z_map() returns a DM_MAPIO_XXX unless there is a solid > justification for it. To me it just obfuscates the DM interface a > bit too much. > > Otherwise, pretty clean code and nothing weird going on. > > I look forward to seeing your next (final?) revision of this patchset. Thinking further.. I'm left confused about just what the heck this target is assuming. E.g.: feels like its exposing a readonly end of the zone is very bi-polar... yet no hint to upper layer it shouldn't write to that read-only end (the "emulated zone").. but there has to be some zoned magic assumed? And I'm just naive? Mike -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Hi Mike, >> Note that the target does not support partial mapping of the underlying >> device. >> >> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> >> Suggested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> >> Suggested-by: Damien Le Moal <damien.lemoal@wdc.com> >> Suggested-by: Hannes Reinecke <hare@suse.de> > > > This target needs more review from those who Suggested-by it. > > And the header and docs needs to address: > > 1) why is a partial mapping of the underlying device disallowed? While it is technically possible, I don't see any use-case to do so for this target. I can mention it in the documentation as well. > 2) why is it assumed all IO is read-only? (talk to me and others like > we don't know the inherent limitations of this class of zoned hw) > TL;DR: no, we don't assume all IO to be read-only. All operations all allowed until the zone capacity, and only reads are permitted in the emulated gap area. A bit of context about Zoned HW(especially ZNS SSD): Zone: A contiguous range of logical block addresses managed as a single unit. Zoned Block Device: A block device that consists of multiple zones. Zone size: Size of a zone Zone capacity: Usable logical blocks in a zone According to ZNS spec, the LBAs from zone capacity to zone size behave like deallocated blocks when read and are not allowed to be written. Until now, zone capacity can be any value, but zone size needed to be a power-of-2 to work in Linux (More information about this is also in my cover letter). This patch series aims to allow non-po2 zone size devices with zone capacity == zone size to work in Linux. A non-po2 zone size device might not work correctly in filesystems that support zoned devices such as btrfs and f2fs as they assume po2 zone sizes. Therefore, this target is created to enable these filesystems to work with non-po2 zone sizes until native support is added. This target's zone capacity will be the same as the underlying device, but the target's zone size will be the nearest po2 value of its zone capacity. Furthermore, the area between the zone capacity and zone size of the target (emulated gap area) will resemble the spec behavior: behave like the deallocated blocks when read (we fill zeroes in the bio) and are not allowed to write. Does that clarify your question? > On a code level: > 1) are you certain you're properly failing all writes? > - are writes allowed to the "zone capacity area" but _not_ > allowed to the "emulated zone area"? (if yes, _please document_). I have already documented in Documentation: A simple remap is performed for all the BIOs that do not cross the emulation gap area, i.e., the area between the zone capacity and size. If a BIO lies in the emulation gap area, the following operations are performed: Read: - If the BIO lies entirely in the emulation gap area, then zero out the BIO and complete it. - If the BIO spans the emulation gap area, split the BIO across the zone capacity boundary and remap only the BIO within the zone capacity boundary. The other part of the split BIO will be zeroed out. Other operations: - Return an error Maybe it is not clear enough?? Let me know. > 2) yes, you absolutely need to implement the .status target_type hook > (for both STATUS and TABLE). I already queued this change locally. I will send it as a part of the next rev. > 3) really not loving the nested return (of DM_MAPIO_SUBMITTED or > DM_MAPIO_REMAPPED) from methods called from dm_po2z_map(). Would > prefer to not have to do a depth-first search to see where and when > dm_po2z_map() returns a DM_MAPIO_XXX unless there is a solid > justification for it. To me it just obfuscates the DM interface a > bit too much. > Got it. Do you prefer having the return statements in the dm_po2z_map itself instead of returning a helper function, which in return returns the status code? What about something like this: static inline void dm_po2z_read_zeroes(struct bio *bio) { zero_fill_bio(bio); bio_endio(bio); } static int dm_po2z_map(struct dm_target *ti, struct bio *bio) { struct dm_po2z_target *dmh = ti->private; int split_io_pos; bio_set_dev(bio, dmh->dev->bdev); if (op_is_zone_mgmt(bio_op(bio))) goto remap_sector; if (!bio_sectors(bio)) return DM_MAPIO_REMAPPED; if (!dm_po2z_bio_in_emulated_zone_area(dmh, bio, &split_io_pos)) goto remap_sector; /* * Read operation on the emulated zone area (between zone capacity * and zone size) will fill the bio with zeroes.Any other operation * in the emulated area should return an error. */ if (bio_op(bio) == REQ_OP_READ) { /* * If the bio is across emulated zone boundary, split * the bio at * the boundary. */ if (split_io_pos > 0) { dm_accept_partial_bio(bio, split_io_pos); goto remap_sector; } dm_po2z_read_zeroes(bio); return DM_MAPIO_SUBMITTED; } return DM_MAPIO_KILL; remap_sector: bio->bi_iter.bi_sector = target_to_device_sect(dmh, bio->bi_iter.bi_sector); return DM_MAPIO_REMAPPED; } > Otherwise, pretty clean code and nothing weird going on. > > I look forward to seeing your next (final?) revision of this patchset. > > Thanks, > Mike > -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
>> >> 1) why is a partial mapping of the underlying device disallowed? >> 2) why is it assumed all IO is read-only? (talk to me and others like >> we don't know the inherent limitations of this class of zoned hw) >> >> On a code level: >> 1) are you certain you're properly failing all writes? >> - are writes allowed to the "zone capacity area" but _not_ >> allowed to the "emulated zone area"? (if yes, _please document_). >> 2) yes, you absolutely need to implement the .status target_type hook >> (for both STATUS and TABLE). >> 3) really not loving the nested return (of DM_MAPIO_SUBMITTED or >> DM_MAPIO_REMAPPED) from methods called from dm_po2z_map(). Would >> prefer to not have to do a depth-first search to see where and when >> dm_po2z_map() returns a DM_MAPIO_XXX unless there is a solid >> justification for it. To me it just obfuscates the DM interface a >> bit too much. >> >> Otherwise, pretty clean code and nothing weird going on. >> >> I look forward to seeing your next (final?) revision of this patchset. > > Thinking further.. I'm left confused about just what the heck this > target is assuming. > > E.g.: feels like its exposing a readonly end of the zone is very > bi-polar... yet no hint to upper layer it shouldn't write to that > read-only end (the "emulated zone").. but there has to be some zoned > magic assumed? And I'm just naive? > You are absolutely right about "zoned magic". Applications that use a zoned block device are aware of the zone capacity and zone size. BLKREPORTZONE ioctl is typically used to get the zone information from a zoned block device. This target adjusts the zone report so that zone size and zone capacity are modified correctly (see dm_po2z_report_zones() and dm_po2z_report_zones_cb() functions). -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
diff --git a/Documentation/admin-guide/device-mapper/dm-po2zone.rst b/Documentation/admin-guide/device-mapper/dm-po2zone.rst new file mode 100644 index 000000000000..19dc215fbcca --- /dev/null +++ b/Documentation/admin-guide/device-mapper/dm-po2zone.rst @@ -0,0 +1,71 @@ +========== +dm-po2zone +========== +The dm-po2zone device mapper target exposes a zoned block device with a +non-power-of-2(npo2) number of sectors per zone as a power-of-2(po2) +number of sectors per zone(zone size). +The filesystems that support zoned block devices such as F2FS and BTRFS +assume po2 zone size as the kernel has traditionally only supported +those devices. However, as the kernel now supports zoned block devices with +npo2 zone sizes, the filesystems can run on top of the dm-po2zone target before +adding native support. + +Partial mapping of the underlying device is not supported by this target. + +Algorithm +========= +The device mapper target maps the underlying device's zone size to the +zone capacity and changes the zone size to the nearest po2 zone size. +The gap between the zone capacity and the zone size is emulated in the target. +E.g., a zoned block device with a zone size (and capacity) of 3M will have an +equivalent target layout with mapping as follows: + +:: + + 0M 3M 4M 6M 8M + | | | | | + +x------------+--+x---------+--+x------- Target + |x | |x | |x + x x x + x x x + x x x + x x x + |x |x |x + +x------------+x------------+x---------- Device + | | | + 0M 3M 6M + +A simple remap is performed for all the BIOs that do not cross the +emulation gap area, i.e., the area between the zone capacity and size. + +If a BIO crosses the emulation gap area, the following operations are performed: + + Read: + - If the BIO lies entirely in the emulation gap area, then zero out the BIO and complete it. + - If the BIO spans the emulation gap area, split the BIO across the zone capacity boundary + and remap only the BIO within the zone capacity boundary. The other part of the split BIO + will be zeroed out. + + Other operations: + - Return an error + +Table parameters +================ + +:: + + <dev path> + +Mandatory parameters: + + <dev path>: + Full pathname to the underlying block-device, or a + "major:minor" device-number. + +Examples +======== + +:: + + #!/bin/sh + echo "0 `blockdev --getsz $1` po2zone $1" | dmsetup create po2z diff --git a/Documentation/admin-guide/device-mapper/index.rst b/Documentation/admin-guide/device-mapper/index.rst index cde52cc09645..1fd04b5b0565 100644 --- a/Documentation/admin-guide/device-mapper/index.rst +++ b/Documentation/admin-guide/device-mapper/index.rst @@ -23,6 +23,7 @@ Device Mapper dm-service-time dm-uevent dm-zoned + dm-po2zone era kcopyd linear diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig index 998a5cfdbc4e..638801b2449a 100644 --- a/drivers/md/Kconfig +++ b/drivers/md/Kconfig @@ -518,6 +518,16 @@ config DM_FLAKEY help A target that intermittently fails I/O for debugging purposes. +config DM_PO2ZONE + tristate "Zoned block devices target emulating a power-of-2 number of sectors per zone" + depends on BLK_DEV_DM + depends on BLK_DEV_ZONED + help + A target that converts a zoned block device with non-power-of-2(npo2) + number of sectors per zone to be power-of-2(po2). Use this target for + zoned block devices with npo2 number of sectors per zone until native + support is added to the filesystems and applications. + config DM_VERITY tristate "Verity target support" depends on BLK_DEV_DM diff --git a/drivers/md/Makefile b/drivers/md/Makefile index 84291e38dca8..c23f81cc8789 100644 --- a/drivers/md/Makefile +++ b/drivers/md/Makefile @@ -26,6 +26,7 @@ dm-era-y += dm-era-target.o dm-clone-y += dm-clone-target.o dm-clone-metadata.o dm-verity-y += dm-verity-target.o dm-zoned-y += dm-zoned-target.o dm-zoned-metadata.o dm-zoned-reclaim.o +dm-po2zone-y += dm-po2zone-target.o md-mod-y += md.o md-bitmap.o raid456-y += raid5.o raid5-cache.o raid5-ppl.o @@ -60,6 +61,7 @@ obj-$(CONFIG_DM_CRYPT) += dm-crypt.o obj-$(CONFIG_DM_DELAY) += dm-delay.o obj-$(CONFIG_DM_DUST) += dm-dust.o obj-$(CONFIG_DM_FLAKEY) += dm-flakey.o +obj-$(CONFIG_DM_PO2ZONE) += dm-po2zone.o obj-$(CONFIG_DM_MULTIPATH) += dm-multipath.o dm-round-robin.o obj-$(CONFIG_DM_MULTIPATH_QL) += dm-queue-length.o obj-$(CONFIG_DM_MULTIPATH_ST) += dm-service-time.o diff --git a/drivers/md/dm-po2zone-target.c b/drivers/md/dm-po2zone-target.c new file mode 100644 index 000000000000..34ccbeec9a59 --- /dev/null +++ b/drivers/md/dm-po2zone-target.c @@ -0,0 +1,260 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2022 Samsung Electronics Co., Ltd. + */ + +#include <linux/device-mapper.h> + +#define DM_MSG_PREFIX "po2zone" + +struct dm_po2z_target { + struct dm_dev *dev; + sector_t zone_size; /* Actual zone size of the underlying dev*/ + sector_t zone_size_po2; /* zone_size rounded to the nearest po2 value */ + unsigned int zone_size_po2_shift; + sector_t zone_size_diff; /* diff between zone_size_po2 and zone_size */ + unsigned int nr_zones; +}; + +static inline unsigned int npo2_zone_no(struct dm_po2z_target *dmh, + sector_t sect) +{ + return div64_u64(sect, dmh->zone_size); +} + +static inline unsigned int po2_zone_no(struct dm_po2z_target *dmh, + sector_t sect) +{ + return sect >> dmh->zone_size_po2_shift; +} + +static inline sector_t target_to_device_sect(struct dm_po2z_target *dmh, + sector_t sect) +{ + return sect - (po2_zone_no(dmh, sect) * dmh->zone_size_diff); +} + +static inline sector_t device_to_target_sect(struct dm_po2z_target *dmh, + sector_t sect) +{ + return sect + (npo2_zone_no(dmh, sect) * dmh->zone_size_diff); +} + +/* + * This target works on the complete zoned device. Partial mapping is not + * supported. + * Construct a zoned po2 logical device: <dev-path> + */ +static int dm_po2z_ctr(struct dm_target *ti, unsigned int argc, char **argv) +{ + struct dm_po2z_target *dmh = NULL; + int ret; + sector_t zone_size; + sector_t dev_capacity; + + if (argc != 1) + return -EINVAL; + + dmh = kmalloc(sizeof(*dmh), GFP_KERNEL); + if (!dmh) + return -ENOMEM; + + ret = dm_get_device(ti, argv[0], dm_table_get_mode(ti->table), + &dmh->dev); + if (ret) { + ti->error = "Device lookup failed"; + kfree(dmh); + return ret; + } + + if (!bdev_is_zoned(dmh->dev->bdev)) { + DMERR("%pg is not a zoned device", dmh->dev->bdev); + kfree(dmh); + return -EINVAL; + } + + zone_size = bdev_zone_sectors(dmh->dev->bdev); + dev_capacity = get_capacity(dmh->dev->bdev->bd_disk); + if (ti->len != dev_capacity || ti->begin) { + DMERR("%pg Partial mapping of the target not supported", + dmh->dev->bdev); + kfree(dmh); + return -EINVAL; + } + + if (is_power_of_2(zone_size)) + DMWARN("%pg: underlying device has a power-of-2 number of sectors per zone", + dmh->dev->bdev); + + dmh->zone_size = zone_size; + dmh->zone_size_po2 = 1 << get_count_order_long(zone_size); + dmh->zone_size_po2_shift = ilog2(dmh->zone_size_po2); + dmh->zone_size_diff = dmh->zone_size_po2 - dmh->zone_size; + ti->private = dmh; + ti->max_io_len = dmh->zone_size_po2; + dmh->nr_zones = npo2_zone_no(dmh, ti->len); + ti->len = dmh->zone_size_po2 * dmh->nr_zones; + + return 0; +} + +static int dm_po2z_report_zones_cb(struct blk_zone *zone, unsigned int idx, + void *data) +{ + struct dm_report_zones_args *args = data; + struct dm_po2z_target *dmh = args->tgt->private; + + zone->start = device_to_target_sect(dmh, zone->start); + zone->wp = device_to_target_sect(dmh, zone->wp); + zone->len = dmh->zone_size_po2; + args->next_sector = zone->start + zone->len; + + return args->orig_cb(zone, args->zone_idx++, args->orig_data); +} + +static int dm_po2z_report_zones(struct dm_target *ti, + struct dm_report_zones_args *args, + unsigned int nr_zones) +{ + struct dm_po2z_target *dmh = ti->private; + sector_t sect = po2_zone_no(dmh, args->next_sector) * dmh->zone_size; + + return blkdev_report_zones(dmh->dev->bdev, sect, nr_zones, + dm_po2z_report_zones_cb, args); +} + +static int dm_po2z_end_io(struct dm_target *ti, struct bio *bio, + blk_status_t *error) +{ + struct dm_po2z_target *dmh = ti->private; + + if (bio->bi_status == BLK_STS_OK && bio_op(bio) == REQ_OP_ZONE_APPEND) + bio->bi_iter.bi_sector = + device_to_target_sect(dmh, bio->bi_iter.bi_sector); + + return DM_ENDIO_DONE; +} + +static void dm_po2z_io_hints(struct dm_target *ti, struct queue_limits *limits) +{ + struct dm_po2z_target *dmh = ti->private; + + limits->chunk_sectors = dmh->zone_size_po2; +} + +/** + * dm_po2z_bio_in_emulated_zone_area - check if bio is in the emulated zone area + * @dmh: target data + * @bio: bio + * @offset: bio offset to emulated zone boundary + * + * Check if a @bio is partly or completely in the emulated zone area. If the + * @bio is partly in the emulated zone area, @offset can be used to split + * the @bio across the emulated zone boundary. @offset + * will be negative if the @bio completely lies in the emulated area. + * + */ +static bool dm_po2z_bio_in_emulated_zone_area(struct dm_po2z_target *dmh, + struct bio *bio, int *offset) +{ + unsigned int zone_idx = po2_zone_no(dmh, bio->bi_iter.bi_sector); + sector_t nr_sectors = bio->bi_iter.bi_size >> SECTOR_SHIFT; + sector_t sector_offset = + bio->bi_iter.bi_sector - (zone_idx << dmh->zone_size_po2_shift); + + *offset = dmh->zone_size - sector_offset; + + return sector_offset + nr_sectors > dmh->zone_size; +} + +static inline int dm_po2z_read_zeroes(struct bio *bio) +{ + zero_fill_bio(bio); + bio_endio(bio); + return DM_MAPIO_SUBMITTED; +} + +static inline int dm_po2z_remap_sector(struct dm_po2z_target *dmh, + struct bio *bio) +{ + bio->bi_iter.bi_sector = + target_to_device_sect(dmh, bio->bi_iter.bi_sector); + return DM_MAPIO_REMAPPED; +} + +static int dm_po2z_map(struct dm_target *ti, struct bio *bio) +{ + struct dm_po2z_target *dmh = ti->private; + int split_io_pos; + + bio_set_dev(bio, dmh->dev->bdev); + + if (op_is_zone_mgmt(bio_op(bio))) + return dm_po2z_remap_sector(dmh, bio); + + if (!bio_sectors(bio)) + return DM_MAPIO_REMAPPED; + + /* + * Read operation on the emulated zone area (between zone capacity + * and zone size) will fill the bio with zeroes. Any other operation + * in the emulated area should return an error. + */ + if (!dm_po2z_bio_in_emulated_zone_area(dmh, bio, &split_io_pos)) + return dm_po2z_remap_sector(dmh, bio); + + if (bio_op(bio) == REQ_OP_READ) { + /* + * If the bio is across emulated zone boundary, split the bio at + * the boundary. + */ + if (split_io_pos > 0) { + dm_accept_partial_bio(bio, split_io_pos); + return dm_po2z_remap_sector(dmh, bio); + } + return dm_po2z_read_zeroes(bio); + } + + return DM_MAPIO_KILL; +} + +static int dm_po2z_iterate_devices(struct dm_target *ti, + iterate_devices_callout_fn fn, void *data) +{ + struct dm_po2z_target *dmh = ti->private; + sector_t len = dmh->nr_zones * dmh->zone_size; + + return fn(ti, dmh->dev, 0, len, data); +} + +static struct target_type dm_po2z_target = { + .name = "po2zone", + .version = { 1, 0, 0 }, + .features = DM_TARGET_ZONED_HM | DM_TARGET_EMULATED_ZONES, + .map = dm_po2z_map, + .end_io = dm_po2z_end_io, + .report_zones = dm_po2z_report_zones, + .iterate_devices = dm_po2z_iterate_devices, + .module = THIS_MODULE, + .io_hints = dm_po2z_io_hints, + .ctr = dm_po2z_ctr, +}; + +static int __init dm_po2z_init(void) +{ + return dm_register_target(&dm_po2z_target); +} + +static void __exit dm_po2z_exit(void) +{ + dm_unregister_target(&dm_po2z_target); +} + +/* Module hooks */ +module_init(dm_po2z_init); +module_exit(dm_po2z_exit); + +MODULE_DESCRIPTION(DM_NAME "power-of-2 zoned target"); +MODULE_AUTHOR("Pankaj Raghav <p.raghav@samsung.com>"); +MODULE_LICENSE("GPL"); +
Only zoned devices with power-of-2(po2) number of sectors per zone(zone size) were supported in linux but now non power-of-2(npo2) zone sizes support has been added to the block layer. Filesystems such as F2FS and btrfs have support for zoned devices with po2 zone size assumption. Before adding native support for npo2 zone sizes, it was suggested to create a dm target for npo2 zone size device to appear as a po2 zone size target so that file systems can initially work without any explicit changes by using this target. The design of this target is very simple: remap the device zone size to the zone capacity and change the zone size to be the nearest power of 2 value. For e.g., a device with a zone size/capacity of 3M will have an equivalent target layout as follows: Device layout :- zone capacity = 3M zone size = 3M |--------------|-------------| 0 3M 6M Target layout :- zone capacity=3M zone size = 4M |--------------|---|--------------|---| 0 3M 4M 7M 8M The area between target's zone capacity and zone size will be emulated in the target. The read IOs that fall in the emulated gap area will return 0 filled bio and all the other IOs in that area will result in an error. If a read IO span across the emulated area boundary, then the IOs are split across them. All other IO operations that span across the emulated area boundary will result in an error. The target can be easily created as follows: dmsetup create <label> --table '0 <size_sects> po2zone /dev/nvme<id>' Note that the target does not support partial mapping of the underlying device. Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Suggested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Suggested-by: Damien Le Moal <damien.lemoal@wdc.com> Suggested-by: Hannes Reinecke <hare@suse.de> --- .../admin-guide/device-mapper/dm-po2zone.rst | 71 +++++ .../admin-guide/device-mapper/index.rst | 1 + drivers/md/Kconfig | 10 + drivers/md/Makefile | 2 + drivers/md/dm-po2zone-target.c | 260 ++++++++++++++++++ 5 files changed, 344 insertions(+) create mode 100644 Documentation/admin-guide/device-mapper/dm-po2zone.rst create mode 100644 drivers/md/dm-po2zone-target.c