Message ID | 20220920091119.115879-1-p.raghav@samsung.com (mailing list archive) |
---|---|
Headers | show |
Series | support zoned block devices with non-power-of-2 zone sizes | expand |
On Tue, Sep 20 2022 at 5:11P -0400, Pankaj Raghav <p.raghav@samsung.com> wrote: > - Background and Motivation: > > The zone storage implementation in Linux, introduced since v4.10, first > targetted SMR drives which have a power of 2 (po2) zone size alignment > requirement. The po2 zone size was further imposed implicitly by the > block layer's blk_queue_chunk_sectors(), used to prevent IO merging > across chunks beyond the specified size, since v3.16 through commit > 762380ad9322 ("block: add notion of a chunk size for request merging"). > But this same general block layer po2 requirement for blk_queue_chunk_sectors() > was removed on v5.10 through commit 07d098e6bbad ("block: allow 'chunk_sectors' > to be non-power-of-2"). > > NAND, which is the media used in newer zoned storage devices, does not > naturally align to po2. In these devices, zone capacity(cap) is not the > same as the po2 zone size. When the zone cap != zone size, then unmapped > LBAs are introduced to cover the space between the zone cap and zone size. > po2 requirement does not make sense for these type of zone storage devices. > This patch series aims to remove these unmapped LBAs for zoned devices when > zone cap is npo2. This is done by relaxing the po2 zone size constraint > in the kernel and allowing zoned device with npo2 zone sizes if zone cap > == zone size. > > Removing the po2 requirement from zone storage should be possible > now provided that no userspace regression and no performance regressions are > introduced. Stop-gap patches have been already merged into f2fs-tools to > proactively not allow npo2 zone sizes until proper support is added [1]. > > There were two efforts previously to add support to npo2 devices: 1) via > device level emulation [2] but that was rejected with a final conclusion > to add support for non po2 zoned device in the complete stack[3] 2) > adding support to the complete stack by removing the constraint in the > block layer and NVMe layer with support to btrfs, zonefs, etc which was > rejected with a conclusion to add a dm target for FS support [0] > to reduce the regression impact. > > This series adds support to npo2 zoned devices in the block and nvme > layer and a new **dm target** is added: dm-po2zoned-target. This new > target will be initially used for filesystems such as btrfs and > f2fs until native npo2 zone support is added. As this patchset nears the point of being "ready for merge" and DM's "zoned" oriented targets are multiplying, I need to understand: where are we collectively going? How long are we expecting to support the "stop-gap zoned storage" layers we've constructed? I know https://zonedstorage.io/docs/introduction exists... but it _seems_ stale given the emergence of ZNS and new permutations of zoned hardware. Maybe that isn't quite fair (it does cover A LOT!) but I'm still left wanting (e.g. "bring it all home for me!")... Damien, as the most "zoned storage" oriented engineer I know, can you please kick things off by shedding light on where Linux is now, and where it's going, for "zoned storage"? To give some additional context to help me when you answer: I'm left wondering what, if any, role dm-zoned has to play moving forward given ZNS is "the future" (and yeah "the future" is now but...)? E.g.: Does it make sense to stack dm-zoned ontop of dm-po2zoned!? Yet more context: When I'm asked to add full-blown support for dm-zoned to RHEL my gut is "please no, why!?". And if we really should add dm-zoned is dm-po2zoned now also a requirement (to support non-power-of-2 ZNS devices in our never-ending engineering of "zoned storage" compatibility stop-gaps)? In addition, it was my understanding that WDC had yet another zoned DM target called "dm-zap" that is for ZNS based devices... It's all a bit messy in my head (that's on me for not keeping up, but I think we need a recap!) So please help me, and others, become more informed as quickly as possible! ;) Thanks, Mike ps. I'm asking all this in the open on various Linux mailing lists because it doesn't seem right to request a concall to inform only me... I think others may need similar "zoned storage" help. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
On 9/22/22 02:27, Mike Snitzer wrote: > On Tue, Sep 20 2022 at 5:11P -0400, > Pankaj Raghav <p.raghav@samsung.com> wrote: > >> - Background and Motivation: >> >> The zone storage implementation in Linux, introduced since v4.10, first >> targetted SMR drives which have a power of 2 (po2) zone size alignment >> requirement. The po2 zone size was further imposed implicitly by the >> block layer's blk_queue_chunk_sectors(), used to prevent IO merging >> across chunks beyond the specified size, since v3.16 through commit >> 762380ad9322 ("block: add notion of a chunk size for request merging"). >> But this same general block layer po2 requirement for blk_queue_chunk_sectors() >> was removed on v5.10 through commit 07d098e6bbad ("block: allow 'chunk_sectors' >> to be non-power-of-2"). >> >> NAND, which is the media used in newer zoned storage devices, does not >> naturally align to po2. In these devices, zone capacity(cap) is not the >> same as the po2 zone size. When the zone cap != zone size, then unmapped >> LBAs are introduced to cover the space between the zone cap and zone size. >> po2 requirement does not make sense for these type of zone storage devices. >> This patch series aims to remove these unmapped LBAs for zoned devices when >> zone cap is npo2. This is done by relaxing the po2 zone size constraint >> in the kernel and allowing zoned device with npo2 zone sizes if zone cap >> == zone size. >> >> Removing the po2 requirement from zone storage should be possible >> now provided that no userspace regression and no performance regressions are >> introduced. Stop-gap patches have been already merged into f2fs-tools to >> proactively not allow npo2 zone sizes until proper support is added [1]. >> >> There were two efforts previously to add support to npo2 devices: 1) via >> device level emulation [2] but that was rejected with a final conclusion >> to add support for non po2 zoned device in the complete stack[3] 2) >> adding support to the complete stack by removing the constraint in the >> block layer and NVMe layer with support to btrfs, zonefs, etc which was >> rejected with a conclusion to add a dm target for FS support [0] >> to reduce the regression impact. >> >> This series adds support to npo2 zoned devices in the block and nvme >> layer and a new **dm target** is added: dm-po2zoned-target. This new >> target will be initially used for filesystems such as btrfs and >> f2fs until native npo2 zone support is added. > > As this patchset nears the point of being "ready for merge" and DM's > "zoned" oriented targets are multiplying, I need to understand: where > are we collectively going? How long are we expecting to support the > "stop-gap zoned storage" layers we've constructed? > > I know https://zonedstorage.io/docs/introduction exists... but it > _seems_ stale given the emergence of ZNS and new permutations of zoned > hardware. Maybe that isn't quite fair (it does cover A LOT!) but I'm > still left wanting (e.g. "bring it all home for me!")... > > Damien, as the most "zoned storage" oriented engineer I know, can you > please kick things off by shedding light on where Linux is now, and > where it's going, for "zoned storage"? Let me first start with what we have seen so far with deployments in the field. The largest user base for zoned storage is for now hyperscalers (cloud services) deploying SMR disks. E.g. Dropbox has many times publicized its use of SMR HDDs. ZNS is fairly new, and while it is being actively evaluated by many, there are not yet any large scale deployments that I am aware of. Most of the large scale SMR users today mainly use the zoned storage drives directly, without a file system, similarly to their use of regular block devices. Some erasure coded object store sits on top of the zoned drives and manage them. The interface used for that has now switched to using the kernel API, from libzbc pass-through in the early days of SMR support. With the inclusion of zonefs in kernel 5.6, many are now switching to using that instead of directly accessing the block device file. zonefs makes the application development somewhat easier (there is no need for issuing zone management ioctls) and can also result in applications that can actually run almost as-is on top of regular block devices with a file system. That is a very interesting property, especially in development phase for the user. Beside these large scale SMR deployments, there are also many smaller users. For these cases, dm-zoned seemed to be used a lot. In particular, the Chia cryptocurrency boom (now fading ?) did generate a fair amount of new SMR users relying on dm-zoned. With btrfs zoned storage support maturing, dm-zoned is not as needed as it used to though. SMR drives can be used directly under btrfs and I certainly am always recommending this approach over dm-zoned+ext4 or dm-zoned+xfs as performance is much better for write intensive workloads. For Linux kernel overall, zoned storage is in a very good shape for raw block device use and zonefs use. Production deployments we are seeing are a proof of that. Currently, my team effort is mostly focused on btrfs and zonefs and increasing zoned storage use cases. 1) For btrfs, Johannes and Naohiro are working on stabilizing support for ZNS (we still have some issues with the management of active zones) and implementing de-clustered parity RAID support so that zoned drives can be used in RAID 0, 1, 10, 5, 6 and erasure coded volumes. This will address use cases such as home NAS boxes, backup servers, small file servers, video applications (e.g. video surveillance) etc. Essentially, any application with large storage capacity needs that is not a distributed setup. There are many. 2) For zonefs, I have some to-do items lined up to improve performance (better read IO tail latency) and further improve ease of use (e.g. remove the O_DIRECT write constraint). 3) At the block device level, we are also working on adding zoned block device specifications to virtio and implementing that support in qemu and the kernel. Patches are floating around now but not yet merged. This addresses the use of zoned storage in VM environments through virtio interface instead of directly attaching devices to guests. > To give some additional context to help me when you answer: I'm left > wondering what, if any, role dm-zoned has to play moving forward given > ZNS is "the future" (and yeah "the future" is now but...)? E.g.: Does > it make sense to stack dm-zoned ontop of dm-po2zoned!? That is a lot to unfold in a small paragraph :) First of all, I would rephrase "ZNS is the future" into "ZNS is a very interesting alternative to generic NVMe SSDs". The reason being that HDD are not dead, far from it. They still are way cheaper than SSDs in $/TB :) So ZNS is not really in competition with SMR HDDs jere. The 2 are complementary, exactly like regular SSDs are complementary to regular HDDs. dm-zoned serves some use cases for SMR HDDs (see above) but does not address ZNS (more on this below). And given that all SMR HDD on the market today have a zone size that is a power of 2 number of LBAs (256MB zone size is by far the most common), dm-po2zoned is not required at all for SMR. Pankaj patch series is all about supporting ZNS devices that have a zone size that is not a power of 2 number of LBAs as some vendors want to produce such drives. There is no such move happening in the SMR world as all users are happy with the current zone sizes which match the kernel support (which currently requires power-of-2 number of LBAs for the zone size). I do not think we have yet reached a consensus on if we really want to accept any zone size for zoned storage. I personally am not a big fan of removing the existing constraint as that makes the code somewhat heavier (multiplication & divisions instead of bit shifts) without introducing any benefit to the user that I can see (or agree with). And there is also a risk of forcing onto the users to redesign/change their code to support different devices in the same system. That is never nice to fragment support like this for the same device class. This is why several people, including me, requested something like dm-po2zoned, to avoid breaking user applications if non-power-of-2 zone size drives support is merged. Better than nothing for sure, but not ideal either. That is only my opinion. There are different opinions out there. > Yet more context: When I'm asked to add full-blown support for > dm-zoned to RHEL my gut is "please no, why!?". And if we really > should add dm-zoned is dm-po2zoned now also a requirement (to support > non-power-of-2 ZNS devices in our never-ending engineering of "zoned > storage" compatibility stop-gaps)? Support for dm-zoned in RHEL really depends on if your customers need it. Having SMR and ZNS block device (CONFIG_BLK_DEV_ZONED) and zonefs support enabled would already cover a lot of use cases on their own, at least the ones we see in the field today. Going forward, we expect more use cases to rely on btrfs rather than dm-zoned or any equivalent DM target for ZNS. And that can also include non power of 2 zone size drives as btrfs should normally be able to handle such devices, if the support for them is merged. But we are not there yet with btrfs support, hence dm-po2zoned. But again, that all depends on if Pankaj patch series is accepted, that is, on everybody accepting that we lift the power-of-2 zone size constraint. > In addition, it was my understanding that WDC had yet another zoned DM > target called "dm-zap" that is for ZNS based devices... It's all a bit > messy in my head (that's on me for not keeping up, but I think we need > a recap!) Since the ZNS specification does not define conventional zones, dm-zoned cannot be used as a standalone DM target (read: single block device) with NVMe zoned block devices. Furthermore, due to its block mapping scheme, dm-zoned does not support devices with zones that have a capacity lower than the zone size. So ZNS is really a big *no* for dm-zoned. dm-zap is a prototype and in a nutshell is the equivalent of dm-zoned for ZNS. dm-zap can deal with the smaller zone capacity and does not require conventional zones. We are not trying to push for dm-zap to be merged for now as we are still evaluating its potential use cases. We also have a different but functionally equivalent approach implemented as a block device driver that we are evaluating internally. Given the above mentioned usage pattern we have seen so far for zoned storage, it is not yet clear if something like dm-zap for ZNS is needed beside some niche use cases. > So please help me, and others, become more informed as quickly as > possible! ;) I hope the above helps. If you want me to develop further any of the points above, feel free to let me know. > ps. I'm asking all this in the open on various Linux mailing lists > because it doesn't seem right to request a concall to inform only > me... I think others may need similar "zoned storage" help. All good with me :)
Thanks a lot Damien for the summary. Your feedback has made this series much better. > Pankaj patch series is all about supporting ZNS devices that have a zone > size that is not a power of 2 number of LBAs as some vendors want to > produce such drives. There is no such move happening in the SMR world as > all users are happy with the current zone sizes which match the kernel > support (which currently requires power-of-2 number of LBAs for the zone > size). > > I do not think we have yet reached a consensus on if we really want to > accept any zone size for zoned storage. I personally am not a big fan of > removing the existing constraint as that makes the code somewhat heavier > (multiplication & divisions instead of bit shifts) without introducing any > benefit to the user that I can see (or agree with). And there is also a > risk of forcing onto the users to redesign/change their code to support > different devices in the same system. That is never nice to fragment > support like this for the same device class. This is why several people, > including me, requested something like dm-po2zoned, to avoid breaking user > applications if non-power-of-2 zone size drives support is merged. Better > than nothing for sure, but not ideal either. That is only my opinion. > There are different opinions out there. I appreciate that you have explained the different perspectives. We have covered this written and orally, and it seems to me that we have a good coverage of the arguments in the list. At this point, I would like to ask the opinion of Jens, Christoph and Keith. Do you think we are missing anything in the series? Can this be queued up for 6.1 (after I send the next version with a minor fix suggested by Mike)? -- Regards, Pankaj -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
On Wed, Sep 21 2022 at 7:55P -0400, Damien Le Moal <damien.lemoal@opensource.wdc.com> wrote: > On 9/22/22 02:27, Mike Snitzer wrote: > > On Tue, Sep 20 2022 at 5:11P -0400, > > Pankaj Raghav <p.raghav@samsung.com> wrote: > > > >> - Background and Motivation: > >> > >> The zone storage implementation in Linux, introduced since v4.10, first > >> targetted SMR drives which have a power of 2 (po2) zone size alignment > >> requirement. The po2 zone size was further imposed implicitly by the > >> block layer's blk_queue_chunk_sectors(), used to prevent IO merging > >> across chunks beyond the specified size, since v3.16 through commit > >> 762380ad9322 ("block: add notion of a chunk size for request merging"). > >> But this same general block layer po2 requirement for blk_queue_chunk_sectors() > >> was removed on v5.10 through commit 07d098e6bbad ("block: allow 'chunk_sectors' > >> to be non-power-of-2"). > >> > >> NAND, which is the media used in newer zoned storage devices, does not > >> naturally align to po2. In these devices, zone capacity(cap) is not the > >> same as the po2 zone size. When the zone cap != zone size, then unmapped > >> LBAs are introduced to cover the space between the zone cap and zone size. > >> po2 requirement does not make sense for these type of zone storage devices. > >> This patch series aims to remove these unmapped LBAs for zoned devices when > >> zone cap is npo2. This is done by relaxing the po2 zone size constraint > >> in the kernel and allowing zoned device with npo2 zone sizes if zone cap > >> == zone size. > >> > >> Removing the po2 requirement from zone storage should be possible > >> now provided that no userspace regression and no performance regressions are > >> introduced. Stop-gap patches have been already merged into f2fs-tools to > >> proactively not allow npo2 zone sizes until proper support is added [1]. > >> > >> There were two efforts previously to add support to npo2 devices: 1) via > >> device level emulation [2] but that was rejected with a final conclusion > >> to add support for non po2 zoned device in the complete stack[3] 2) > >> adding support to the complete stack by removing the constraint in the > >> block layer and NVMe layer with support to btrfs, zonefs, etc which was > >> rejected with a conclusion to add a dm target for FS support [0] > >> to reduce the regression impact. > >> > >> This series adds support to npo2 zoned devices in the block and nvme > >> layer and a new **dm target** is added: dm-po2zoned-target. This new > >> target will be initially used for filesystems such as btrfs and > >> f2fs until native npo2 zone support is added. > > > > As this patchset nears the point of being "ready for merge" and DM's > > "zoned" oriented targets are multiplying, I need to understand: where > > are we collectively going? How long are we expecting to support the > > "stop-gap zoned storage" layers we've constructed? > > > > I know https://zonedstorage.io/docs/introduction exists... but it > > _seems_ stale given the emergence of ZNS and new permutations of zoned > > hardware. Maybe that isn't quite fair (it does cover A LOT!) but I'm > > still left wanting (e.g. "bring it all home for me!")... > > > > Damien, as the most "zoned storage" oriented engineer I know, can you > > please kick things off by shedding light on where Linux is now, and > > where it's going, for "zoned storage"? > > Let me first start with what we have seen so far with deployments in the > field. <snip> Thanks for all your insights on zoned storage, very appreciated! > > In addition, it was my understanding that WDC had yet another zoned DM > > target called "dm-zap" that is for ZNS based devices... It's all a bit > > messy in my head (that's on me for not keeping up, but I think we need > > a recap!) > > Since the ZNS specification does not define conventional zones, dm-zoned > cannot be used as a standalone DM target (read: single block device) with > NVMe zoned block devices. Furthermore, due to its block mapping scheme, > dm-zoned does not support devices with zones that have a capacity lower > than the zone size. So ZNS is really a big *no* for dm-zoned. dm-zap is a > prototype and in a nutshell is the equivalent of dm-zoned for ZNS. dm-zap > can deal with the smaller zone capacity and does not require conventional > zones. We are not trying to push for dm-zap to be merged for now as we are > still evaluating its potential use cases. We also have a different but > functionally equivalent approach implemented as a block device driver that > we are evaluating internally. > > Given the above mentioned usage pattern we have seen so far for zoned > storage, it is not yet clear if something like dm-zap for ZNS is needed > beside some niche use cases. OK, good to know. I do think dm-zoned should be trained to _not_ allow use with ZNS NVMe devices (maybe that is in place and I just missed it?). Because there is some confusion with at least one customer that is asserting dm-zoned is somehow enabling them to use ZNS NVMe devices! Maybe they somehow don't _need_ conventional zones (writes are handled by some other layer? and dm-zoned access is confined to read only)!? And might they also be using ZNS NVMe devices to do _not_ have a zone capacity lower than the zone size? Or maybe they are mistaken and we should ask more specific questions of them? > > So please help me, and others, become more informed as quickly as > > possible! ;) > > I hope the above helps. If you want me to develop further any of the > points above, feel free to let me know. You've been extremely helpful, thanks! -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
On 9/23/22 04:37, Mike Snitzer wrote: > On Wed, Sep 21 2022 at 7:55P -0400, > Damien Le Moal <damien.lemoal@opensource.wdc.com> wrote: > >> On 9/22/22 02:27, Mike Snitzer wrote: >>> On Tue, Sep 20 2022 at 5:11P -0400, >>> Pankaj Raghav <p.raghav@samsung.com> wrote: >>> >>>> - Background and Motivation: >>>> >>>> The zone storage implementation in Linux, introduced since v4.10, first >>>> targetted SMR drives which have a power of 2 (po2) zone size alignment >>>> requirement. The po2 zone size was further imposed implicitly by the >>>> block layer's blk_queue_chunk_sectors(), used to prevent IO merging >>>> across chunks beyond the specified size, since v3.16 through commit >>>> 762380ad9322 ("block: add notion of a chunk size for request merging"). >>>> But this same general block layer po2 requirement for blk_queue_chunk_sectors() >>>> was removed on v5.10 through commit 07d098e6bbad ("block: allow 'chunk_sectors' >>>> to be non-power-of-2"). >>>> >>>> NAND, which is the media used in newer zoned storage devices, does not >>>> naturally align to po2. In these devices, zone capacity(cap) is not the >>>> same as the po2 zone size. When the zone cap != zone size, then unmapped >>>> LBAs are introduced to cover the space between the zone cap and zone size. >>>> po2 requirement does not make sense for these type of zone storage devices. >>>> This patch series aims to remove these unmapped LBAs for zoned devices when >>>> zone cap is npo2. This is done by relaxing the po2 zone size constraint >>>> in the kernel and allowing zoned device with npo2 zone sizes if zone cap >>>> == zone size. >>>> >>>> Removing the po2 requirement from zone storage should be possible >>>> now provided that no userspace regression and no performance regressions are >>>> introduced. Stop-gap patches have been already merged into f2fs-tools to >>>> proactively not allow npo2 zone sizes until proper support is added [1]. >>>> >>>> There were two efforts previously to add support to npo2 devices: 1) via >>>> device level emulation [2] but that was rejected with a final conclusion >>>> to add support for non po2 zoned device in the complete stack[3] 2) >>>> adding support to the complete stack by removing the constraint in the >>>> block layer and NVMe layer with support to btrfs, zonefs, etc which was >>>> rejected with a conclusion to add a dm target for FS support [0] >>>> to reduce the regression impact. >>>> >>>> This series adds support to npo2 zoned devices in the block and nvme >>>> layer and a new **dm target** is added: dm-po2zoned-target. This new >>>> target will be initially used for filesystems such as btrfs and >>>> f2fs until native npo2 zone support is added. >>> >>> As this patchset nears the point of being "ready for merge" and DM's >>> "zoned" oriented targets are multiplying, I need to understand: where >>> are we collectively going? How long are we expecting to support the >>> "stop-gap zoned storage" layers we've constructed? >>> >>> I know https://zonedstorage.io/docs/introduction exists... but it >>> _seems_ stale given the emergence of ZNS and new permutations of zoned >>> hardware. Maybe that isn't quite fair (it does cover A LOT!) but I'm >>> still left wanting (e.g. "bring it all home for me!")... >>> >>> Damien, as the most "zoned storage" oriented engineer I know, can you >>> please kick things off by shedding light on where Linux is now, and >>> where it's going, for "zoned storage"? >> >> Let me first start with what we have seen so far with deployments in the >> field. > > <snip> > > Thanks for all your insights on zoned storage, very appreciated! > >>> In addition, it was my understanding that WDC had yet another zoned DM >>> target called "dm-zap" that is for ZNS based devices... It's all a bit >>> messy in my head (that's on me for not keeping up, but I think we need >>> a recap!) >> >> Since the ZNS specification does not define conventional zones, dm-zoned >> cannot be used as a standalone DM target (read: single block device) with >> NVMe zoned block devices. Furthermore, due to its block mapping scheme, >> dm-zoned does not support devices with zones that have a capacity lower >> than the zone size. So ZNS is really a big *no* for dm-zoned. dm-zap is a >> prototype and in a nutshell is the equivalent of dm-zoned for ZNS. dm-zap >> can deal with the smaller zone capacity and does not require conventional >> zones. We are not trying to push for dm-zap to be merged for now as we are >> still evaluating its potential use cases. We also have a different but >> functionally equivalent approach implemented as a block device driver that >> we are evaluating internally. >> >> Given the above mentioned usage pattern we have seen so far for zoned >> storage, it is not yet clear if something like dm-zap for ZNS is needed >> beside some niche use cases. > > OK, good to know. I do think dm-zoned should be trained to _not_ > allow use with ZNS NVMe devices (maybe that is in place and I just > missed it?). Because there is some confusion with at least one > customer that is asserting dm-zoned is somehow enabling them to use > ZNS NVMe devices! dm-zoned checks for conventional zones and also that all zones have a zone capacity that is equal to the zone size. The first point puts ZNS out but a second regular drive can be used to emulate conventional zones. However, the second point (zone cap < zone size) is pretty much a given with ZNS and so rules it out. If anything, we should also add a check on the max number of active zones, which is also a limitation that ZNS drives have, unlike SMR drives. Since dm-zoned does not handle active zones at all, any drive with a limit should be excluded. I will send patches for that. > > Maybe they somehow don't _need_ conventional zones (writes are handled > by some other layer? and dm-zoned access is confined to read only)!? > And might they also be using ZNS NVMe devices to do _not_ have a > zone capacity lower than the zone size? It is a possibility. Indeed, if the ZNS drive has: 1) zone capacity equal to zone size 2) a second regular drive is used to emulate conventional zones 3) no limit on the max number of active zones Then dm-zoned will work just fine. But again, I seriously doubt that point (3) holds. And we should check that upfront in dm-zoned ctr. > Or maybe they are mistaken and we should ask more specific questions > of them? Getting the exact drive characteristics (zone size, capacity and zone resource limits) will tell you if dm-zoned can work or not.
On 9/21/22 16:55, Damien Le Moal wrote: > But again, that all depends on if Pankaj patch series is accepted, that > is, on everybody accepting that we lift the power-of-2 zone size constraint. The companies that are busy with implementing zoned storage for UFS devices are asking for kernel support for non-power-of-2 zone sizes. Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
> -----Original Message----- > From: Bart Van Assche <bvanassche@acm.org> > Sent: Friday, 23 September 2022 01.56 > To: Damien Le Moal <damien.lemoal@opensource.wdc.com>; Mike Snitzer > <snitzer@redhat.com>; Pankaj Raghav <p.raghav@samsung.com> > Cc: agk@redhat.com; snitzer@kernel.org; axboe@kernel.dk; hch@lst.de; > pankydev8@gmail.com; gost.dev@samsung.com; linux-kernel@vger.kernel.org; > linux-nvme@lists.infradead.org; linux-block@vger.kernel.org; dm- > devel@redhat.com; Johannes Thumshirn <Johannes.Thumshirn@wdc.com>; > jaegeuk@kernel.org; Matias Bjørling <Matias.Bjorling@wdc.com> > Subject: Re: Please further explain Linux's "zoned storage" roadmap [was: Re: > [PATCH v14 00/13] support zoned block devices with non-power-of-2 zone > sizes] > > On 9/21/22 16:55, Damien Le Moal wrote: > > But again, that all depends on if Pankaj patch series is accepted, > > that is, on everybody accepting that we lift the power-of-2 zone size > constraint. > > The companies that are busy with implementing zoned storage for UFS devices > are asking for kernel support for non-power-of-2 zone sizes. > > Thanks, > > Bart. Hi Bart, With UFS, in the proposed copy I have (may been changed) - there's the concept of gap zones, which is zones that cannot be accessed by the host. The gap zones are essentially "LBA fillers", enabling the next writeable zone to start at a X * pow2 size offset. My understanding is that this specific approach was chosen to simplify standardization in UFS and avoid updating T10's ZBC with zone capacity support. While UFS would technically expose non-power of 2 zone sizes, they're also, due to the gap zones, could also be considered power of 2 zones if one considers the seq. write zone + the gap zone as a single unit. When I think about having UFS support in the kernel, the SWR and the gap zone could be represented as a single unit. For example: UFS - Zone Report Zone 0: SWR, LBA 0-11 Zone 1: Gap, LBA 12-15 Zone 2: SWR, LBA 16-27 Zone 3: Gap, LBA 28-31 ... Kernel representation - Zone Report (as supported today) Zone 0: SWR, LBA 0-15, Zone Capacity 12 Zone 1: SWR, LBA 16-31, Zone Capacity 12 ... If doing it this way, it removes the need for filesystems, device-mappers, user-space applications having to understand gap zones, and allows UFS to work out of the box with no changes to the rest of the zoned storage eco-system. Has the above representation been considered? Best, Matias -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
On 9/22/22 23:29, Matias Bjørling wrote: > With UFS, in the proposed copy I have (may been changed) - there's > the concept of gap zones, which is zones that cannot be accessed by > the host. The gap zones are essentially "LBA fillers", enabling the > next writeable zone to start at a X * pow2 size offset. My > understanding is that this specific approach was chosen to simplify > standardization in UFS and avoid updating T10's ZBC with zone > capacity support. > > While UFS would technically expose non-power of 2 zone sizes, they're > also, due to the gap zones, could also be considered power of 2 zones > if one considers the seq. write zone + the gap zone as a single > unit. > > When I think about having UFS support in the kernel, the SWR and the > gap zone could be represented as a single unit. For example: > > UFS - Zone Report > Zone 0: SWR, LBA 0-11 > Zone 1: Gap, LBA 12-15 > Zone 2: SWR, LBA 16-27 > Zone 3: Gap, LBA 28-31 > ... > > Kernel representation - Zone Report (as supported today) > Zone 0: SWR, LBA 0-15, Zone Capacity 12 > Zone 1: SWR, LBA 16-31, Zone Capacity 12 > ... > > If doing it this way, it removes the need for filesystems, > device-mappers, user-space applications having to understand gap > zones, and allows UFS to work out of the box with no changes to the > rest of the zoned storage eco-system. > > Has the above representation been considered? Hi Matias, What has been described above is the approach from the first version of the zoned storage for UFS (ZUFS) draft standard. Support for this approach is available in the upstream kernel. See also "[PATCH v2 0/9] Support zoned devices with gap zones", 2022-04-21 (https://lore.kernel.org/linux-scsi/20220421183023.3462291-1-bvanassche@acm.org/). Since F2FS extents must be split at gap zones, gap zones negatively affect sequential read and write performance. So we abandoned the gap zone approach. The current approach is as follows: * The power-of-two restriction for the offset between zone starts has been removed. Gap zones are no longer required. Hence, we will need the patches that add support for zone sizes that are not a power of two. * The Sequential Write Required (SWR) and Sequential Write Preferred (SWP) zone types are supported. The feedback we received from UFS vendors is that which zone type works best depends on their firmware and ASIC design. * We need a queue depth larger than one (QD > 1) for writes to achieve the full sequential write bandwidth. We plan to support QD > 1 as follows: - If writes have to be serialized, submit these to the same hardware queue. According to the UFS host controller interface (UFSHCI) standard, UFS host controllers are not allowed to reorder SCSI commands that are submitted to the same hardware queue. A source of command reordering that remains is the SCSI retry mechanism. Retries happen e.g. after a command timeout. - For SWP zones, require the UFS device firmware to use its garbage collection mechanism to reorder data in the unlikely case that out-of-order writes happened. - For SWR zones, retry writes that failed because these were received out-of-order by a UFS device. ZBC-1 requires compliant devices to respond with ILLEGAL REQUEST / UNALIGNED WRITE COMMAND to out-of- order writes. We have considered the zone append approach but decided not to use it because if zone append commands get reordered the data ends up permanently out-of-order on the storage medium. This affects sequential read performance negatively. Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel