Message ID | 20220308165349.231320-1-p.raghav@samsung.com (mailing list archive) |
---|---|
Headers | show |
Series | power_of_2 emulation support for NVMe ZNS devices | expand |
This is complete bonkers. IFF we have a good reason to support non power of two zones size (and I'd like to see evidence for that) we'll need to go through all the layers to support it. But doing this emulation is just idiotic and will at tons of code just to completely confuse users. On Tue, Mar 08, 2022 at 05:53:43PM +0100, Pankaj Raghav wrote: > > #Motivation: > There are currently ZNS drives that are produced and deployed that do > not have power_of_2(PO2) zone size. The NVMe spec for ZNS does not > specify the PO2 requirement but the linux block layer currently checks > for zoned devices to have power_of_2 zone sizes. Well, apparently whoever produces these drives never cared about supporting Linux as the power of two requirement goes back to SMR HDDs, which also don't have that requirement in the spec (and even allow non-uniform zone size), but Linux decided that we want this for sanity. Do these drives even support Zone Append?
On 2022-03-10 10:47, Christoph Hellwig wrote: > This is complete bonkers. IFF we have a good reason to support non > power of two zones size (and I'd like to see evidence for that) we'll non power of 2 support is important to the users and that is why we started this effort to do that. I have also CCed Bo from Bytedance based on their request. > need to go through all the layers to support it. But doing this emulation > is just idiotic and will at tons of code just to completely confuse users. > I agree with your point to create the non power of 2 support through all the layers but this is the first step. One of the early feedback that we got from Damien is to not break the existing kernel and userspace applications that are written with the po2 assumption. The following are the steps we have in the pipeline: - Remove the constraint in the block layer - Start migrating the Kernel applications such as btrfs so that it also works on non power of 2 devices. Of course, we wanted to post RFCs to the steps mentioned above so that there could be a public discussion about the issues. > Well, apparently whoever produces these drives never cared about supporting > Linux as the power of two requirement goes back to SMR HDDs, which also > don't have that requirement in the spec (and even allow non-uniform zone > size), but Linux decided that we want this for sanity. > > Do these drives even support Zone Append? Yes, these drives are intended for Linux users that would use the zoned block device. Append is supported but holes in the LBA space (due to diff in zone cap and zone size) is still a problem for these users.
> Yes, these drives are intended for Linux users that would use the zoned > block device. Append is supported but holes in the LBA space (due to diff in > zone cap and zone size) is still a problem for these users. With respect to the specific users, what does it break specifically? What are key features are they missing when there's holes?
> On 10 Mar 2022, at 14.07, Matias Bjørling <matias.bjorling@wdc.com> wrote: > > >> >> Yes, these drives are intended for Linux users that would use the zoned >> block device. Append is supported but holes in the LBA space (due to diff in >> zone cap and zone size) is still a problem for these users. > > With respect to the specific users, what does it break specifically? What are key features are they missing when there's holes? What we hear is that it breaks existing mapping in applications, where the address space is seen as contiguous; with holes it needs to account for the unmapped space. This affects performance and and CPU due to unnecessary splits. This is for both reads and writes. For more details, I guess they will have to jump in and share the parts that they consider is proper to share in the mailing list. I guess we will have more conversations around this as we push the block layer changes after this series.
On Thu, Mar 10, 2022 at 01:57:58PM +0100, Pankaj Raghav wrote: > Yes, these drives are intended for Linux users that would use the zoned > block device. Append is supported but holes in the LBA space (due to > diff in zone cap and zone size) is still a problem for these users. I'd really like to hear from the users. Because really, either they should use a proper file system abstraction (including zonefs if that is all they need), or raw nvme passthrough which will alredy work for this case. But adding a whole bunch of crap because people want to use the block device special file for something it is not designed for just does not make any sense.
>> Yes, these drives are intended for Linux users that would use the > >> zoned block device. Append is supported but holes in the LBA space > >> (due to diff in zone cap and zone size) is still a problem for these users. > > > > With respect to the specific users, what does it break specifically? What are > key features are they missing when there's holes? > > What we hear is that it breaks existing mapping in applications, where the > address space is seen as contiguous; with holes it needs to account for the > unmapped space. This affects performance and and CPU due to unnecessary > splits. This is for both reads and writes. > > For more details, I guess they will have to jump in and share the parts that > they consider is proper to share in the mailing list. > > I guess we will have more conversations around this as we push the block > layer changes after this series. Ok, so I hear that one issue is I/O splits - If I assume that reads are sequential, zone cap/size between 100MiB and 1GiB, then my gut feeling would tell me its less CPU intensive to split every 100MiB to 1GiB of reads, than it would be to not have power of 2 zones due to the extra per io calculations. Do I have a faulty assumption about the above, or is there more to it?
On Thu, Mar 10, 2022 at 02:58:07PM +0000, Matias Bjørling wrote: > >> Yes, these drives are intended for Linux users that would use the > > >> zoned block device. Append is supported but holes in the LBA space > > >> (due to diff in zone cap and zone size) is still a problem for these users. > > > > > > With respect to the specific users, what does it break specifically? What are > > key features are they missing when there's holes? > > > > What we hear is that it breaks existing mapping in applications, where the > > address space is seen as contiguous; with holes it needs to account for the > > unmapped space. This affects performance and and CPU due to unnecessary > > splits. This is for both reads and writes. > > > > For more details, I guess they will have to jump in and share the parts that > > they consider is proper to share in the mailing list. > > > > I guess we will have more conversations around this as we push the block > > layer changes after this series. > > Ok, so I hear that one issue is I/O splits - If I assume that reads > are sequential, zone cap/size between 100MiB and 1GiB, then my gut > feeling would tell me its less CPU intensive to split every 100MiB to > 1GiB of reads, than it would be to not have power of 2 zones due to > the extra per io calculations. Don't you need to split anyway when spanning two zones to avoid the zone boundary error? Maybe this is a silly idea, but it would be a trivial device-mapper to remap the gaps out of the lba range.
On 10.03.2022 14:58, Matias Bjørling wrote: > >> Yes, these drives are intended for Linux users that would use the >> >> zoned block device. Append is supported but holes in the LBA space >> >> (due to diff in zone cap and zone size) is still a problem for these users. >> > >> > With respect to the specific users, what does it break specifically? What are >> key features are they missing when there's holes? >> >> What we hear is that it breaks existing mapping in applications, where the >> address space is seen as contiguous; with holes it needs to account for the >> unmapped space. This affects performance and and CPU due to unnecessary >> splits. This is for both reads and writes. >> >> For more details, I guess they will have to jump in and share the parts that >> they consider is proper to share in the mailing list. >> >> I guess we will have more conversations around this as we push the block >> layer changes after this series. > >Ok, so I hear that one issue is I/O splits - If I assume that reads are sequential, zone cap/size between 100MiB and 1GiB, then my gut feeling would tell me its less CPU intensive to split every 100MiB to 1GiB of reads, than it would be to not have power of 2 zones due to the extra per io calculations. > >Do I have a faulty assumption about the above, or is there more to it? I do not have numbers on the number of splits. I can only say that it is an issue. Then the whole management is apparently also costing some DRAM for extra mapping, instead of simply doing +1. The goal for these customers is not having the emulation, so the cost of the !PO2 path would be 0. For the existing applications that require a PO2, we have the emulation. In this case, the cost will only be paid on the devices that implement !PO2 zones. Hope this answer the question.
On 10.03.2022 07:07, Keith Busch wrote: >On Thu, Mar 10, 2022 at 02:58:07PM +0000, Matias Bjørling wrote: >> >> Yes, these drives are intended for Linux users that would use the >> > >> zoned block device. Append is supported but holes in the LBA space >> > >> (due to diff in zone cap and zone size) is still a problem for these users. >> > > >> > > With respect to the specific users, what does it break specifically? What are >> > key features are they missing when there's holes? >> > >> > What we hear is that it breaks existing mapping in applications, where the >> > address space is seen as contiguous; with holes it needs to account for the >> > unmapped space. This affects performance and and CPU due to unnecessary >> > splits. This is for both reads and writes. >> > >> > For more details, I guess they will have to jump in and share the parts that >> > they consider is proper to share in the mailing list. >> > >> > I guess we will have more conversations around this as we push the block >> > layer changes after this series. >> >> Ok, so I hear that one issue is I/O splits - If I assume that reads >> are sequential, zone cap/size between 100MiB and 1GiB, then my gut >> feeling would tell me its less CPU intensive to split every 100MiB to >> 1GiB of reads, than it would be to not have power of 2 zones due to >> the extra per io calculations. > >Don't you need to split anyway when spanning two zones to avoid the zone >boundary error? If you have size = capacity then you can do a cross-zone read. This is only a problem when we have gaps. >Maybe this is a silly idea, but it would be a trivial device-mapper >to remap the gaps out of the lba range. One thing we have considered is that as we remove the PO2 constraint from the block layer is that devices exposing PO2 zone sizes are able to do the emulation the other way around to support things like this. A device mapper is also a fine place to put this, but it seems like a very simple task. Is it worth all the boilerplate code for the device mapper only for this?
On Thu, Mar 10, 2022 at 10:47:25AM +0100, Christoph Hellwig wrote: > This is complete bonkers. IFF we have a good reason to support non > power of two zones size (and I'd like to see evidence for that) we'll > need to go through all the layers to support it. But doing this emulation > is just idiotic and will at tons of code just to completely confuse users. > > On Tue, Mar 08, 2022 at 05:53:43PM +0100, Pankaj Raghav wrote: > > > > #Motivation: > > There are currently ZNS drives that are produced and deployed that do > > not have power_of_2(PO2) zone size. The NVMe spec for ZNS does not > > specify the PO2 requirement but the linux block layer currently checks > > for zoned devices to have power_of_2 zone sizes. > > Well, apparently whoever produces these drives never cared about supporting > Linux as the power of two requirement goes back to SMR HDDs, which also > don't have that requirement in the spec (and even allow non-uniform zone > size), but Linux decided that we want this for sanity. Non uniform zone size definitely seems like a mess. Fixed zone sizes that are non po2 doesn't seem insane to me given that chunk sectors is no longer assumed to be po2. We have looked at removing po2 and the only hot path optimization for po2 is for appends. > > Do these drives even support Zone Append? Should it matter if the drives support append? SMR drives do not support append and they are considered zone block devices. Append seems to be an optimization for users that want higher concurrency per zone. One can also build concurrency by leveraging multiple zones simultaneously as well.
On 3/11/22 00:16, Javier González wrote: > On 10.03.2022 07:07, Keith Busch wrote: >> On Thu, Mar 10, 2022 at 02:58:07PM +0000, Matias Bjørling wrote: >>> >> Yes, these drives are intended for Linux users that would use the >>>>>> zoned block device. Append is supported but holes in the LBA space >>>>>> (due to diff in zone cap and zone size) is still a problem for these users. >>>>> >>>>> With respect to the specific users, what does it break specifically? What are >>>> key features are they missing when there's holes? >>>> >>>> What we hear is that it breaks existing mapping in applications, where the >>>> address space is seen as contiguous; with holes it needs to account for the >>>> unmapped space. This affects performance and and CPU due to unnecessary >>>> splits. This is for both reads and writes. >>>> >>>> For more details, I guess they will have to jump in and share the parts that >>>> they consider is proper to share in the mailing list. >>>> >>>> I guess we will have more conversations around this as we push the block >>>> layer changes after this series. >>> >>> Ok, so I hear that one issue is I/O splits - If I assume that reads >>> are sequential, zone cap/size between 100MiB and 1GiB, then my gut >>> feeling would tell me its less CPU intensive to split every 100MiB to >>> 1GiB of reads, than it would be to not have power of 2 zones due to >>> the extra per io calculations. >> >> Don't you need to split anyway when spanning two zones to avoid the zone >> boundary error? > > If you have size = capacity then you can do a cross-zone read. This is > only a problem when we have gaps. > >> Maybe this is a silly idea, but it would be a trivial device-mapper >> to remap the gaps out of the lba range. > > One thing we have considered is that as we remove the PO2 constraint > from the block layer is that devices exposing PO2 zone sizes are able to > do the emulation the other way around to support things like this. > > A device mapper is also a fine place to put this, but it seems like a > very simple task. Is it worth all the boilerplate code for the device > mapper only for this? Boiler plate ? DM already support zoned devices. Writing a "dm-unhole" target would be extremely simple as it would essentially be a variation of dm-linear. There should be no DM core changes needed.
On Thu, Mar 10, 2022 at 03:44:49PM +0100, Christoph Hellwig wrote: > On Thu, Mar 10, 2022 at 01:57:58PM +0100, Pankaj Raghav wrote: > > Yes, these drives are intended for Linux users that would use the zoned > > block device. Append is supported but holes in the LBA space (due to > > diff in zone cap and zone size) is still a problem for these users. > > I'd really like to hear from the users. Because really, either they > should use a proper file system abstraction (including zonefs if that is > all they need), That requires access to at least the block device and without PO2 emulation that is not possible. Using zonefs is not possible today for !PO2 devices. > or raw nvme passthrough which will alredy work for this > case. This effort is not upstream yet, however, once and if this does land upstream it does mean something other than zonefs must be used since !PO2 devices are not supported by zonefs. So although the goal with the zonefs was to provide a unified interface for raw access for applications, the PO2 requirement will essentially create fragmenetation. > But adding a whole bunch of crap because people want to use the > block device special file for something it is not designed for just > does not make any sense. Using Linux requires PO2. And so on behalf of Damien's request the logical thing to do was to upkeep that requirement and to avoid any performance regressions. That "crap" was done to slowly pave the way forward to then later remove the PO2 requirement. I think we'll all acknowledge that doing emulation just means adding more software for something that is not a NAND requirement, but a requirement imposed by the inheritance of zoned software designed for SMR HDDs. I think we may also all acknowledge now that keeping this emulation code *forever* seems like complete insanity. Since the PO2 requirement imposed on Linux today seems to be now sending us down a dubious effort we'd need to support, let me then try to get folks who have been saying that we must keep this requirement to answer the following question: Are you 100% sure your ZNS hardware team and firmware team will always be happy you have caked in a PO2 requirement for ZNS drives on Linux and are you ready to deal with those consequences on Linux forever? Really? NAND has no PO2 requirement. The emulation effort was only done to help add support for !PO2 devices because there is no alternative. If we however are ready instead to go down the avenue of removing those restrictions well let's go there then instead. If that's not even something we are willing to consider I'd really like folks who stand behind the PO2 requirement to stick their necks out and clearly say that their hw/fw teams are happy to deal with this requirement forever on ZNS. From what I am seeing this is a legacy requirement which we should be able to remove. Keeping the requirement will only do harm to ZNS adoption on Linux and it will also create *more* fragmentation. Luis
On Fri, Mar 11, 2022 at 12:19:38PM -0800, Luis Chamberlain wrote: > NAND has no PO2 requirement. The emulation effort was only done to help > add support for !PO2 devices because there is no alternative. If we > however are ready instead to go down the avenue of removing those > restrictions well let's go there then instead. If that's not even > something we are willing to consider I'd really like folks who stand > behind the PO2 requirement to stick their necks out and clearly say that > their hw/fw teams are happy to deal with this requirement forever on ZNS. Regardless of the merits of the current OS requirement, it's a trivial matter for firmware to round up their reported zone size to the next power of 2. This does not create a significant burden on their part, as far as I know. And po2 does not even seem to be the real problem here. The holes seem to be what's causing a concern, which you have even without po2 zones. I'm starting to like the previous idea of creating an unholey device-mapper for such users...
On Fri, Mar 11, 2022 at 12:51:35PM -0800, Keith Busch wrote: > On Fri, Mar 11, 2022 at 12:19:38PM -0800, Luis Chamberlain wrote: > > NAND has no PO2 requirement. The emulation effort was only done to help > > add support for !PO2 devices because there is no alternative. If we > > however are ready instead to go down the avenue of removing those > > restrictions well let's go there then instead. If that's not even > > something we are willing to consider I'd really like folks who stand > > behind the PO2 requirement to stick their necks out and clearly say that > > their hw/fw teams are happy to deal with this requirement forever on ZNS. > > Regardless of the merits of the current OS requirement, it's a trivial > matter for firmware to round up their reported zone size to the next > power of 2. This does not create a significant burden on their part, as > far as I know. Sure sure.. fw can do crap like that too... > And po2 does not even seem to be the real problem here. The holes seem > to be what's causing a concern, which you have even without po2 zones. Exactly. > I'm starting to like the previous idea of creating an unholey > device-mapper for such users... Won't that restrict nvme with chunk size crap. For instance later if we want much larger block sizes. Luis
On Fri, Mar 11, 2022 at 01:04:35PM -0800, Luis Chamberlain wrote: > On Fri, Mar 11, 2022 at 12:51:35PM -0800, Keith Busch wrote: > > > I'm starting to like the previous idea of creating an unholey > > device-mapper for such users... > > Won't that restrict nvme with chunk size crap. For instance later if we > want much larger block sizes. I'm not sure I understand. The chunk_size has nothing to do with the block size. And while nvme is a user of this in some circumstances, it can't be used concurrently with ZNS because the block layer appropriates the field for the zone size.
On Fri, Mar 11, 2022 at 12:51:35PM -0800, Keith Busch wrote: > On Fri, Mar 11, 2022 at 12:19:38PM -0800, Luis Chamberlain wrote: > > NAND has no PO2 requirement. The emulation effort was only done to help > > add support for !PO2 devices because there is no alternative. If we > > however are ready instead to go down the avenue of removing those > > restrictions well let's go there then instead. If that's not even > > something we are willing to consider I'd really like folks who stand > > behind the PO2 requirement to stick their necks out and clearly say that > > their hw/fw teams are happy to deal with this requirement forever on ZNS. > > Regardless of the merits of the current OS requirement, it's a trivial > matter for firmware to round up their reported zone size to the next > power of 2. This does not create a significant burden on their part, as > far as I know. I can't comment on FW burdens but adding po2 zone size creates holes for the FW to deal with as well. > > And po2 does not even seem to be the real problem here. The holes seem > to be what's causing a concern, which you have even without po2 zones. > I'm starting to like the previous idea of creating an unholey > device-mapper for such users... I see holes as being caused by having to make zone size po2 when capacity is not po2. po2 should be tied to the holes, unless I am missing something. BTW if we go down the dm route can we start calling it dm-unholy.
On Fri, Mar 11, 2022 at 01:31:02PM -0800, Keith Busch wrote: > On Fri, Mar 11, 2022 at 01:04:35PM -0800, Luis Chamberlain wrote: > > On Fri, Mar 11, 2022 at 12:51:35PM -0800, Keith Busch wrote: > > > > > I'm starting to like the previous idea of creating an unholey > > > device-mapper for such users... > > > > Won't that restrict nvme with chunk size crap. For instance later if we > > want much larger block sizes. > > I'm not sure I understand. The chunk_size has nothing to do with the > block size. And while nvme is a user of this in some circumstances, it > can't be used concurrently with ZNS because the block layer appropriates > the field for the zone size. Many device mapper targets split I/O into chunks, see max_io_len(), wouldn't this create an overhead? Using a device mapper target also creates a divergence in strategy for ZNS. Some will use the block device, others the dm target. The goal should be to create a unified path. And all this, just because SMR. Is that worth it? Are we sure? Luis
On Fri, Mar 11, 2022 at 10:23:33PM +0000, Adam Manzanares wrote: > On Fri, Mar 11, 2022 at 12:51:35PM -0800, Keith Busch wrote: > > And po2 does not even seem to be the real problem here. The holes seem > > to be what's causing a concern, which you have even without po2 zones. > > I'm starting to like the previous idea of creating an unholey > > device-mapper for such users... > > I see holes as being caused by having to make zone size po2 when capacity is > not po2. po2 should be tied to the holes, unless I am missing something. Practically speaking, you're probably not missing anything. The spec, however, doesn't constrain the existence of holes to any particular zone size. > BTW if we go down the dm route can we start calling it dm-unholy. I was thinking "dm-evil" but unholy works too. :)
On 3/12/22 07:24, Luis Chamberlain wrote: > On Fri, Mar 11, 2022 at 01:31:02PM -0800, Keith Busch wrote: >> On Fri, Mar 11, 2022 at 01:04:35PM -0800, Luis Chamberlain wrote: >>> On Fri, Mar 11, 2022 at 12:51:35PM -0800, Keith Busch wrote: >>> >>>> I'm starting to like the previous idea of creating an unholey >>>> device-mapper for such users... >>> >>> Won't that restrict nvme with chunk size crap. For instance later if we >>> want much larger block sizes. >> >> I'm not sure I understand. The chunk_size has nothing to do with the >> block size. And while nvme is a user of this in some circumstances, it >> can't be used concurrently with ZNS because the block layer appropriates >> the field for the zone size. > > Many device mapper targets split I/O into chunks, see max_io_len(), > wouldn't this create an overhead? Apart from the bio clone, the overhead should not be higher than what the block layer already has. IOs that are too large or that are straddling zones are split by the block layer, and DM splitting leads generally to no split in the block layer for the underlying device IO. DM essentially follows the same pattern: max_io_len() depends on the target design limits, which in turn depend on the underlying device. For a dm-unhole target, the IO size limit would typically be the same as that of the underlying device. > Using a device mapper target also creates a divergence in strategy > for ZNS. Some will use the block device, others the dm target. The > goal should be to create a unified path. If we allow non power of 2 zone sized devices, the path will *never* be unified because we will get fragmentation on what can run on these devices as opposed to power of 2 sized ones. E.g. f2fs will not work for the former but will for the latter. That is really not an ideal situation. > > And all this, just because SMR. Is that worth it? Are we sure? No. This is *not* because of SMR. Never has been. The first prototype SMR drives I received in my lab 10 years ago did not have a power of 2 sized zone size because zones where naturally aligned to tracks, which like NAND erase blocks, are not necessarily power of 2 sized. And all zones were not even the same size. That was not usable. The reason for the power of 2 requirement is 2 fold: 1) At the time we added zone support for SMR, chunk_sectors had to be a power of 2 number of sectors. 2) SMR users did request power of 2 zone sizes and that all zones have the same size as that simplified software design. There was even a de-facto agreement that 256MB zone size is a good compromise between usability and overhead of zone reclaim/GC. But that particular number is for HDD due to their performance characteristics. Hence the current Linux requirements which have been serving us well so far. DM needed that chunk_sectors be changed to allow non power of 2 values. So the chunk_sectors requirement was lifted recently (can't remember which version added this). Allowing non power of 2 zone size would thus be more easily feasible now. Allowing devices with a non power of 2 zone size is not technically difficult. But... The problem being raised is all about the fact that the power of 2 zone size requirement creates a hole of unusable sectors in every zone when the device implementation has a zone capacity lower than the zone size. I have been arguing all along that I think this problem is a non-problem, simply because a well designed application should *always* use zones as storage containers without ever hoping that the next zone in sequence can be used as well. The application should *never* consider the entire LBA space of the device capacity without this zone split. The zone based management of capacity is necessary for any good design to deal correctly with write error recovery and active/open zone resources management. And as Keith said. there is always a "hole" anyway for any non-full zone, between the zone write pointer and the last usable sector in the zone. Reads there are nonsensical and writes can only go to one place. Now, in the spirit of trying to facilitate software development for zoned devices, we can try finding solutions to remove that hole. zonefs is a obvious solution. But back to the previous point: with one zone == one file, there is no continuity in the storage address space that the application can use. The application has to be designed to use individual files representing a zone. And with such design, an equivalent design directly using the block device file would have no difficulties due to the the sector hole between zone capacity and zone size. I have a prototype LevelDB implementation that can use both zonefs and block device file on ZNS with only a few different lines of code to prove this point. The other solution would be adding a dm-unhole target to remap sectors to remove the holes from the device address space. Such target would be easy to write, but in my opinion, this would still not change the fact that applications still have to deal with error recovery and active/open zone resources. So they still have to be zone aware and operate per zone. Furthermore, adding such DM target would create a non power of 2 zone size zoned device which will need support from the block layer. So some block layer functions will need to change. In the end, this may not be different than enabling non power of 2 zone sized devices for ZNS. And for this decision, I maintain some of my requirements: 1) The added overhead from multiplication & divisions should be acceptable and not degrade performance. Otherwise, this would be a disservice to the zone ecosystem. 2) Nothing that works today on available devices should break 3) Zone size requirements will still exist. E.g. btrfs 64K alignment requirement But even with all these properly addressed, f2fs will not work anymore, some in-kernel users will still need some zone size requirements (btrfs) and *all* applications using a zoned block device file will now have to be designed based on non power of 2 zone size so that they can work on all devices. Meaning that this is also potentially forcing changes on existing applications to use newer zoned devices that may not have a power of 2 zone size. This entire discussion is about the problem that power of 2 zone size creates (which again I think is a non-problem). However, based on the arguments above, allowing non power of 2 zone sized devices is not exactly problem free either. My answer to your last question ("Are we sure?") is thus: No. I am not sure this is a good idea. But as always, I would be happy to be proven wrong. So far, I have not seen any argument doing that.
On Sat, Mar 12, 2022 at 04:58:08PM +0900, Damien Le Moal wrote: > The reason for the power of 2 requirement is 2 fold: > 1) At the time we added zone support for SMR, chunk_sectors had to be a > power of 2 number of sectors. > 2) SMR users did request power of 2 zone sizes and that all zones have > the same size as that simplified software design. There was even a > de-facto agreement that 256MB zone size is a good compromise between > usability and overhead of zone reclaim/GC. But that particular number is > for HDD due to their performance characteristics. Also for NVMe we initially went down the road to try to support non power of two sizes. But there was another major early host that really wanted the power of two zone sizes to support hardware based hosts that can cheaply do shifts but not divisions. The variable zone capacity feature (something that Linux does not currently support) is a feature requested by NVMe members on the host and device side also can only be supported with the the zone size / zone capacity split. > The other solution would be adding a dm-unhole target to remap sectors > to remove the holes from the device address space. Such target would be > easy to write, but in my opinion, this would still not change the fact > that applications still have to deal with error recovery and active/open > zone resources. So they still have to be zone aware and operate per zone. I don't think we even need a new target for it. I think you can do this with a table using multiple dm-linear sections already if you want. > My answer to your last question ("Are we sure?") is thus: No. I am not > sure this is a good idea. But as always, I would be happy to be proven > wrong. So far, I have not seen any argument doing that. Agreed. Supporting non-power of two sizes in the block layer is fairly easy as shown by some of the patches seens in this series. Supporting them properly in the whole ecosystem is not trivial and will create a long-term burden. We could do that, but we'd rather have a really good reason for it, and right now I don't see that.
On Thu, Mar 10, 2022 at 05:38:35PM +0000, Adam Manzanares wrote: > > Do these drives even support Zone Append? > > Should it matter if the drives support append? SMR drives do not support append > and they are considered zone block devices. Append seems to be an optimization > for users that want higher concurrency per zone. One can also build concurrency > by leveraging multiple zones simultaneously as well. Not supporting it natively for SMR is a major pain. Due to hard drives being relatively slow the emulation is somewhat workable, but on SSDs the serialization would completely kill performance.
On 3/14/22 16:35, Christoph Hellwig wrote: > On Sat, Mar 12, 2022 at 04:58:08PM +0900, Damien Le Moal wrote: >> The reason for the power of 2 requirement is 2 fold: >> 1) At the time we added zone support for SMR, chunk_sectors had to be a >> power of 2 number of sectors. >> 2) SMR users did request power of 2 zone sizes and that all zones have >> the same size as that simplified software design. There was even a >> de-facto agreement that 256MB zone size is a good compromise between >> usability and overhead of zone reclaim/GC. But that particular number is >> for HDD due to their performance characteristics. > > Also for NVMe we initially went down the road to try to support > non power of two sizes. But there was another major early host that > really wanted the power of two zone sizes to support hardware based > hosts that can cheaply do shifts but not divisions. The variable > zone capacity feature (something that Linux does not currently support) > is a feature requested by NVMe members on the host and device side > also can only be supported with the the zone size / zone capacity split. > >> The other solution would be adding a dm-unhole target to remap sectors >> to remove the holes from the device address space. Such target would be >> easy to write, but in my opinion, this would still not change the fact >> that applications still have to deal with error recovery and active/open >> zone resources. So they still have to be zone aware and operate per zone. > > I don't think we even need a new target for it. I think you can do > this with a table using multiple dm-linear sections already if you > want. Nope, this is currently not possible: DM requires the target zone size to be the same as the underlying device zone size. So that would not work. > >> My answer to your last question ("Are we sure?") is thus: No. I am not >> sure this is a good idea. But as always, I would be happy to be proven >> wrong. So far, I have not seen any argument doing that. > > Agreed. Supporting non-power of two sizes in the block layer is fairly > easy as shown by some of the patches seens in this series. Supporting > them properly in the whole ecosystem is not trivial and will create a > long-term burden. We could do that, but we'd rather have a really good > reason for it, and right now I don't see that.
On Mon, Mar 14, 2022 at 04:45:12PM +0900, Damien Le Moal wrote: > Nope, this is currently not possible: DM requires the target zone size > to be the same as the underlying device zone size. So that would not work. Indeed.
> > Furthermore, adding such DM target would create a non power of 2 zone size > zoned device which will need support from the block layer. So some block layer > functions will need to change. In the end, this may not be different than > enabling non power of 2 zone sized devices for ZNS. > > And for this decision, I maintain some of my requirements: > 1) The added overhead from multiplication & divisions should be acceptable > and not degrade performance. Otherwise, this would be a disservice to the > zone ecosystem. > 2) Nothing that works today on available devices should break > 3) Zone size requirements will still exist. E.g. btrfs 64K alignment requirement > Adding to the existing points that has been made. I believe it hasn't been mentioned that for non-power of 2 zone sizes, holes are still allowed due to zones being/becoming offline. The offline zone state supports neither writes nor reads, and applications must be aware and work around such holes in the address space. Furthermore, the specification doesn't allow writes to cross zones - so while reads may cross a zone, the writes must always be broken up across zone boundaries. As a result, applications must work with zones independently and can't assume that it can write to the adjacent zone nor write across two zones. Best, Matias
On 14.03.2022 16:45, Damien Le Moal wrote: >On 3/14/22 16:35, Christoph Hellwig wrote: >> On Sat, Mar 12, 2022 at 04:58:08PM +0900, Damien Le Moal wrote: >>> The reason for the power of 2 requirement is 2 fold: >>> 1) At the time we added zone support for SMR, chunk_sectors had to be a >>> power of 2 number of sectors. >>> 2) SMR users did request power of 2 zone sizes and that all zones have >>> the same size as that simplified software design. There was even a >>> de-facto agreement that 256MB zone size is a good compromise between >>> usability and overhead of zone reclaim/GC. But that particular number is >>> for HDD due to their performance characteristics. >> >> Also for NVMe we initially went down the road to try to support >> non power of two sizes. But there was another major early host that >> really wanted the power of two zone sizes to support hardware based >> hosts that can cheaply do shifts but not divisions. The variable >> zone capacity feature (something that Linux does not currently support) >> is a feature requested by NVMe members on the host and device side >> also can only be supported with the the zone size / zone capacity split. >> >>> The other solution would be adding a dm-unhole target to remap sectors >>> to remove the holes from the device address space. Such target would be >>> easy to write, but in my opinion, this would still not change the fact >>> that applications still have to deal with error recovery and active/open >>> zone resources. So they still have to be zone aware and operate per zone. >> >> I don't think we even need a new target for it. I think you can do >> this with a table using multiple dm-linear sections already if you >> want. > >Nope, this is currently not possible: DM requires the target zone size >to be the same as the underlying device zone size. So that would not work. > >> >>> My answer to your last question ("Are we sure?") is thus: No. I am not >>> sure this is a good idea. But as always, I would be happy to be proven >>> wrong. So far, I have not seen any argument doing that. >> >> Agreed. Supporting non-power of two sizes in the block layer is fairly >> easy as shown by some of the patches seens in this series. Supporting >> them properly in the whole ecosystem is not trivial and will create a >> long-term burden. We could do that, but we'd rather have a really good >> reason for it, and right now I don't see that. I think that Bo's use-case is an example of a major upstream Linux host that is struggling with unmmapped LBAs. Can we focus on this use-case and the parts that we are missing to support Bytedance? If you agree to this, I believe we can add support for ZoneFS pretty easily. We also have a POC in btrfs that we will follow on. For the time being, F2FS would fail at mkfs time if zone size is not a PO2. What do you think?
> >> Agreed. Supporting non-power of two sizes in the block layer is > >> fairly easy as shown by some of the patches seens in this series. > >> Supporting them properly in the whole ecosystem is not trivial and > >> will create a long-term burden. We could do that, but we'd rather > >> have a really good reason for it, and right now I don't see that. > > I think that Bo's use-case is an example of a major upstream Linux host that is > struggling with unmmapped LBAs. Can we focus on this use-case and the parts > that we are missing to support Bytedance? Any application that uses zoned storage devices would have to manage unmapped LBAs due to the potential of zones being/becoming offline (no reads/writes allowed). Eliminating the difference between zone cap and zone size will not remove this requirement, and holes will continue to exist. Furthermore, writing to LBAs across zones is not allowed by the specification and must also be managed. Given the above, applications have to be conscious of zones in general and work within their boundaries. I don't understand how applications can work without having per-zone knowledge. An application would have to know about zones and their writeable capacity. To decide where and how data is written, an application must manage writing across zones, specific offline zones, and (currently) its writeable capacity. I.e., knowledge about zones and holes is required for writing to zoned devices and isn't eliminated by removing the PO2 zone size requirement. For years, the PO2 requirement has been known in the Linux community and by the ZNS SSD vendors. Some SSD implementors have chosen not to support PO2 zone sizes, which is a perfectly valid decision. But its implementors knowingly did that while knowing that the Linux kernel didn't support it. I want to turn the argument around to see it from the kernel developer's point of view. They have communicated the PO2 requirement clearly, there's good precedence working with PO2 zone sizes, and at last, holes can't be avoided and are part of the overall design of zoned storage devices. So why should the kernel developer's take on the long-term maintenance burden of NPO2 zone sizes?
On Mon, Mar 14, 2022 at 02:16:36PM +0000, Matias Bjørling wrote: > I want to turn the argument around to see it from the kernel > developer's point of view. They have communicated the PO2 requirement > clearly, Such requirement is based on history and effort put in place to assume a PO2 requirement for zone storage, and clearly it is not. And clearly even vendors who have embraced PO2 don't know for sure they'll always be able to stick to PO2... > there's good precedence working with PO2 zone sizes, and at > last, holes can't be avoided and are part of the overall design of > zoned storage devices. So why should the kernel developer's take on > the long-term maintenance burden of NPO2 zone sizes? I think the better question to address here is: Do we *not* want to support NPO2 zone sizes in Linux out of principal? If we *are* open to support NPO2 zone sizes, what path should we take to incur the least pain and fragmentation? Emulation was a path being considered, and I think at this point the answer to eveluating that path is: this is cumbersome, probably not. The next question then is: are we open to evaluate what it looks like to slowly shave off the PO2 requirement in different layers, with an goal to avoid further fragmentation? There is effort on evaluating that path and it doesn't seem to be that bad. So I'd advise to evaluate that, there is nothing to loose other than awareness of what that path might look like. Uness of course we already have a clear path forward for NPO2 we can all agree on. Luis
> -----Original Message----- > From: Luis Chamberlain <mcgrof@infradead.org> On Behalf Of Luis > Chamberlain > Sent: Monday, 14 March 2022 17.24 > To: Matias Bjørling <Matias.Bjorling@wdc.com> > Cc: Javier González <javier@javigon.com>; Damien Le Moal > <damien.lemoal@opensource.wdc.com>; Christoph Hellwig <hch@lst.de>; > Keith Busch <kbusch@kernel.org>; Pankaj Raghav <p.raghav@samsung.com>; > Adam Manzanares <a.manzanares@samsung.com>; > jiangbo.365@bytedance.com; kanchan Joshi <joshi.k@samsung.com>; Jens > Axboe <axboe@kernel.dk>; Sagi Grimberg <sagi@grimberg.me>; Pankaj > Raghav <pankydev8@gmail.com>; Kanchan Joshi <joshiiitr@gmail.com>; linux- > block@vger.kernel.org; linux-nvme@lists.infradead.org > Subject: Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices > > On Mon, Mar 14, 2022 at 02:16:36PM +0000, Matias Bjørling wrote: > > I want to turn the argument around to see it from the kernel > > developer's point of view. They have communicated the PO2 requirement > > clearly, > > Such requirement is based on history and effort put in place to assume a PO2 > requirement for zone storage, and clearly it is not. And clearly even vendors > who have embraced PO2 don't know for sure they'll always be able to stick to > PO2... Sure - It'll be naïve to give a carte blanche promise. However, you're skipping the next two elements, which state that there are both good precedence working with PO2 zone sizes and that holes/unmapped LBAs can't be avoided. Making an argument for why NPO2 zone sizes may not bring what one is looking for. It's a lot of work for little practical change, if any. > > > there's good precedence working with PO2 zone sizes, and at last, > > holes can't be avoided and are part of the overall design of zoned > > storage devices. So why should the kernel developer's take on the > > long-term maintenance burden of NPO2 zone sizes? > > I think the better question to address here is: > > Do we *not* want to support NPO2 zone sizes in Linux out of principal? > > If we *are* open to support NPO2 zone sizes, what path should we take to > incur the least pain and fragmentation? > > Emulation was a path being considered, and I think at this point the answer to > eveluating that path is: this is cumbersome, probably not. > > The next question then is: are we open to evaluate what it looks like to slowly > shave off the PO2 requirement in different layers, with an goal to avoid further > fragmentation? There is effort on evaluating that path and it doesn't seem to > be that bad. > > So I'd advise to evaluate that, there is nothing to loose other than awareness of > what that path might look like. > > Uness of course we already have a clear path forward for NPO2 we can all > agree on. It looks like there isn't currently one that can be agreed upon. If evaluating different approaches, it would be helpful to the reviewers if interfaces and all of its kernel users are converted in a single patchset. This would also help to avoid users getting hit by what is supported, and what isn't supported by a particular device implementation and allow better to review the full set of changes required to add the support.
On Mon, Mar 14, 2022 at 07:30:25PM +0000, Matias Bjørling wrote: > > -----Original Message----- > > From: Luis Chamberlain <mcgrof@infradead.org> On Behalf Of Luis > > Chamberlain > > Sent: Monday, 14 March 2022 17.24 > > To: Matias Bjørling <Matias.Bjorling@wdc.com> > > Cc: Javier González <javier@javigon.com>; Damien Le Moal > > <damien.lemoal@opensource.wdc.com>; Christoph Hellwig <hch@lst.de>; > > Keith Busch <kbusch@kernel.org>; Pankaj Raghav <p.raghav@samsung.com>; > > Adam Manzanares <a.manzanares@samsung.com>; > > jiangbo.365@bytedance.com; kanchan Joshi <joshi.k@samsung.com>; Jens > > Axboe <axboe@kernel.dk>; Sagi Grimberg <sagi@grimberg.me>; Pankaj > > Raghav <pankydev8@gmail.com>; Kanchan Joshi <joshiiitr@gmail.com>; linux- > > block@vger.kernel.org; linux-nvme@lists.infradead.org > > Subject: Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices > > > > On Mon, Mar 14, 2022 at 02:16:36PM +0000, Matias Bjørling wrote: > > > I want to turn the argument around to see it from the kernel > > > developer's point of view. They have communicated the PO2 requirement > > > clearly, > > > > Such requirement is based on history and effort put in place to assume a PO2 > > requirement for zone storage, and clearly it is not. And clearly even vendors > > who have embraced PO2 don't know for sure they'll always be able to stick to > > PO2... > > Sure - It'll be naïve to give a carte blanche promise. Exactly. So taking a position to not support NPO2 I think seems counter productive to the future of ZNS, the question whould be, *how* to best do this in light of what we need to support / avoid performance regressions / strive towards avoiding fragmentation. > However, you're skipping the next two elements, which state that there > are both good precedence working with PO2 zone sizes and that > holes/unmapped LBAs can't be avoided. I'm not, but I admit that it's a good point of having the possibility of zones being taken offline also implicates holes. I also think it was a good excercise to discuss and evaluate emulation given I don't think this point you made would have been made clear otherwise. This is why I treat ZNS as evolving effort, and I can't seriously take any position stating all answers are known. > Making an argument for why NPO2 > zone sizes may not bring what one is looking for. It's a lot of work > for little practical change, if any. NAND does not incur a PO2 requirement, that should be enough to implicate that PO2 zones *can* be expected. If no vendor wants to take a position that they know for a fact they'll never adopt PO2 zones should be enough to keep an open mind to consider *how* to support them. > > > there's good precedence working with PO2 zone sizes, and at last, > > > holes can't be avoided and are part of the overall design of zoned > > > storage devices. So why should the kernel developer's take on the > > > long-term maintenance burden of NPO2 zone sizes? > > > > I think the better question to address here is: > > > > Do we *not* want to support NPO2 zone sizes in Linux out of principal? > > > > If we *are* open to support NPO2 zone sizes, what path should we take to > > incur the least pain and fragmentation? > > > > Emulation was a path being considered, and I think at this point the answer to > > eveluating that path is: this is cumbersome, probably not. > > > > The next question then is: are we open to evaluate what it looks like to slowly > > shave off the PO2 requirement in different layers, with an goal to avoid further > > fragmentation? There is effort on evaluating that path and it doesn't seem to > > be that bad. > > > > So I'd advise to evaluate that, there is nothing to loose other than awareness of > > what that path might look like. > > > > Uness of course we already have a clear path forward for NPO2 we can all > > agree on. > > It looks like there isn't currently one that can be agreed upon. I'm not quite sure that is the case. To reach consensus one has to take a position of accepting the right answer may not be known and we evaluate all prospects. It is not clear to me that we've done that yet and it is why I think a venue such as LSFMM may be good to review these things. > If evaluating different approaches, it would be helpful to the > reviewers if interfaces and all of its kernel users are converted in a > single patchset. This would also help to avoid users getting hit by > what is supported, and what isn't supported by a particular device > implementation and allow better to review the full set of changes > required to add the support. Sorry I didn't understand the suggestion here, can you clarify what it is you are suggesting? Thanks! Luis
On 14.03.2022 14:16, Matias Bjørling wrote: >> >> Agreed. Supporting non-power of two sizes in the block layer is >> >> fairly easy as shown by some of the patches seens in this series. >> >> Supporting them properly in the whole ecosystem is not trivial and >> >> will create a long-term burden. We could do that, but we'd rather >> >> have a really good reason for it, and right now I don't see that. >> >> I think that Bo's use-case is an example of a major upstream Linux host that is >> struggling with unmmapped LBAs. Can we focus on this use-case and the parts >> that we are missing to support Bytedance? > >Any application that uses zoned storage devices would have to manage >unmapped LBAs due to the potential of zones being/becoming offline (no >reads/writes allowed). Eliminating the difference between zone cap and >zone size will not remove this requirement, and holes will continue to >exist. Furthermore, writing to LBAs across zones is not allowed by the >specification and must also be managed. > >Given the above, applications have to be conscious of zones in general and work within their boundaries. I don't understand how applications can work without having per-zone knowledge. An application would have to know about zones and their writeable capacity. To decide where and how data is written, an application must manage writing across zones, specific offline zones, and (currently) its writeable capacity. I.e., knowledge about zones and holes is required for writing to zoned devices and isn't eliminated by removing the PO2 zone size requirement. Supporting offlines zones is optional in the ZNS spec? We are not considering supporting this in the host. This will be handled by the device for exactly maintaining the SW stack simpler. > >For years, the PO2 requirement has been known in the Linux community and by the ZNS SSD vendors. Some SSD implementors have chosen not to support PO2 zone sizes, which is a perfectly valid decision. But its implementors knowingly did that while knowing that the Linux kernel didn't support it. > >I want to turn the argument around to see it from the kernel developer's point of view. They have communicated the PO2 requirement clearly, there's good precedence working with PO2 zone sizes, and at last, holes can't be avoided and are part of the overall design of zoned storage devices. So why should the kernel developer's take on the long-term maintenance burden of NPO2 zone sizes? You have a good point, and that is the question we need to help answer. As I see it, requirements evolve and the kernel changes with it as long as there are active upstream users for it. The main constraint for PO2 is removed in the block layer, we have Linux hosts stating that unmapped LBAs are a problem, and we have HW supporting size=capacity. I would be happy to hear what else you would like to see for this to be of use to the kernel community.
> > > On Mon, Mar 14, 2022 at 02:16:36PM +0000, Matias Bjørling wrote: > > > > I want to turn the argument around to see it from the kernel > > > > developer's point of view. They have communicated the PO2 > > > > requirement clearly, > > > > > > Such requirement is based on history and effort put in place to > > > assume a PO2 requirement for zone storage, and clearly it is not. > > > And clearly even vendors who have embraced PO2 don't know for sure > > > they'll always be able to stick to PO2... > > > > Sure - It'll be naïve to give a carte blanche promise. > > Exactly. So taking a position to not support NPO2 I think seems counter > productive to the future of ZNS, the question whould be, *how* to best do this > in light of what we need to support / avoid performance regressions / strive > towards avoiding fragmentation. Having non-power of two zone sizes is a derivation from existing devices being used in full production today. That there is a wish to introduce support for such drives is interesting, but given the background and development of zoned devices. Damien mentioned that SMR HDDs didn't start off with PO2 zone sizes - that was what became the norm due to its overall benefits. I.e., drives with NPO2 zone sizes is the odd one, and in some views, is the one creating fragmentation. That there is a wish to revisit that design decision is fair, and it sounds like there is willingness to explorer such options. But please be advised that the Linux community have had communicated the specific requirement for a long time to avoid this particular issue. Thus, the community have been trying to help the vendors make the appropriate design decisions, such that they could take advantage of the Linux kernel stack from day one. > > However, you're skipping the next two elements, which state that there > > are both good precedence working with PO2 zone sizes and that > > holes/unmapped LBAs can't be avoided. > > I'm not, but I admit that it's a good point of having the possibility of zones being > taken offline also implicates holes. I also think it was a good excercise to > discuss and evaluate emulation given I don't think this point you made would > have been made clear otherwise. This is why I treat ZNS as evolving effort, and > I can't seriously take any position stating all answers are known. That's good to hear. I would note that some members in this thread have been doing zoned storage for close to a decade, and have a very thorough understanding of the zoned storage model - so it might be a stretch for them to hear that you're considering everything up in the air and early. This stack is already being used by a large percentage of the bits being shipped in the world. Thus, there is an interest in maintaining these things, and making sure that things don't regress and so on. > > > Making an argument for why NPO2 > > zone sizes may not bring what one is looking for. It's a lot of work > > for little practical change, if any. > > NAND does not incur a PO2 requirement, that should be enough to implicate > that PO2 zones *can* be expected. If no vendor wants to take a position that > they know for a fact they'll never adopt > PO2 zones should be enough to keep an open mind to consider *how* to > support them. As long as it doesn't also imply that support *has* to be added to the kernel, then that's okay. <snip> > > > If evaluating different approaches, it would be helpful to the > > reviewers if interfaces and all of its kernel users are converted in a > > single patchset. This would also help to avoid users getting hit by > > what is supported, and what isn't supported by a particular device > > implementation and allow better to review the full set of changes > > required to add the support. > > Sorry I didn't understand the suggestion here, can you clarify what it is you are > suggesting? It would help reviewers that a potential patchset would convert all users (e.g., f2fs, btrfs, device mappers, io schedulers, etc.), such that the full effect can be evaluated with the added benefits that end-users not having to think about what is and what isn't supported.
> >Given the above, applications have to be conscious of zones in general and > work within their boundaries. I don't understand how applications can work > without having per-zone knowledge. An application would have to know about > zones and their writeable capacity. To decide where and how data is written, > an application must manage writing across zones, specific offline zones, and > (currently) its writeable capacity. I.e., knowledge about zones and holes is > required for writing to zoned devices and isn't eliminated by removing the PO2 > zone size requirement. > > Supporting offlines zones is optional in the ZNS spec? We are not considering > supporting this in the host. This will be handled by the device for exactly > maintaining the SW stack simpler. It isn't optional. The spec allows any zones to go to Read Only or Offline state at any point in time. A specific implementation might give some guarantees to when such transitions happens, but it must nevertheless must be managed by the host software. Given that, and the need to not issue writes that spans zones, an application would have to aware of such behaviors. The information to make those decisions are in a zone's attributes, and thus applications would pull those, it would also know the writeable capability of a zone. So, all in all, creating support for NPO2 is something that takes a lot of work, but might have little to no impact on the overall software design. > > > >For years, the PO2 requirement has been known in the Linux community and > by the ZNS SSD vendors. Some SSD implementors have chosen not to support > PO2 zone sizes, which is a perfectly valid decision. But its implementors > knowingly did that while knowing that the Linux kernel didn't support it. > > > >I want to turn the argument around to see it from the kernel developer's point > of view. They have communicated the PO2 requirement clearly, there's good > precedence working with PO2 zone sizes, and at last, holes can't be avoided > and are part of the overall design of zoned storage devices. So why should the > kernel developer's take on the long-term maintenance burden of NPO2 zone > sizes? > > You have a good point, and that is the question we need to help answer. > As I see it, requirements evolve and the kernel changes with it as long as there > are active upstream users for it. True. There's also active users for SSDs which are custom (e.g., larger than 4KiB writes required) - but they aren't supported by the Linux kernel and isn't actively being worked on to my knowledge. Which is fine, as the customers anyway uses this in their own way, and don't need the Linux kernel support. > > The main constraint for (1) PO2 is removed in the block layer, we have (2) Linux hosts > stating that unmapped LBAs are a problem, and we have (3) HW supporting > size=capacity. > > I would be happy to hear what else you would like to see for this to be of use to > the kernel community. (Added numbers to your paragraph above) 1. The sysfs chunksize attribute was "misused" to also represent zone size. What has changed is that RAID controllers now can use a NPO2 chunk size. This wasn't meant to naturally extend to zones, which as shown in the current posted patchset, is a lot more work. 2. Bo mentioned that the software already manages holes. It took a bit of time to get right, but now it works. Thus, the software in question is already capable of working with holes. Thus, fixing this, would present itself as a minor optimization overall. I'm not convinced the work to do this in the kernel is proportional to the change it'll make to the applications. 3. I'm happy to hear that. However, I'll like to reiterate the point that the PO2 requirement have been known for years. That there's a drive doing NPO2 zones is great, but a decision was made by the SSD implementors to not support the Linux kernel given its current implementation. All that said - if there are people willing to do the work and it doesn't have a negative impact on performance, code quality, maintenance complexity, etc. then there isn't anything saying support can't be added - but it does seem like it’s a lot of work, for little overall benefits to applications and the host users.
On 15.03.2022 12:32, Matias Bjørling wrote: >> >Given the above, applications have to be conscious of zones in general and >> work within their boundaries. I don't understand how applications can work >> without having per-zone knowledge. An application would have to know about >> zones and their writeable capacity. To decide where and how data is written, >> an application must manage writing across zones, specific offline zones, and >> (currently) its writeable capacity. I.e., knowledge about zones and holes is >> required for writing to zoned devices and isn't eliminated by removing the PO2 >> zone size requirement. >> >> Supporting offlines zones is optional in the ZNS spec? We are not considering >> supporting this in the host. This will be handled by the device for exactly >> maintaining the SW stack simpler. > >It isn't optional. The spec allows any zones to go to Read Only or Offline state at any point in time. A specific implementation might give some guarantees to when such transitions happens, but it must nevertheless must be managed by the host software. > >Given that, and the need to not issue writes that spans zones, an application would have to aware of such behaviors. The information to make those decisions are in a zone's attributes, and thus applications would pull those, it would also know the writeable capability of a zone. So, all in all, creating support for NPO2 is something that takes a lot of work, but might have little to no impact on the overall software design. Thanks for the clarification. I can attest that we are giving the guarantee to simplify the host stack. I believe we are making many assumptions in Linux too to simplify ZNS support. This said, I understand your point. I am not developing application support. I will refer again to Bo's response on the use case on where holes are problematic. > >> > >> >For years, the PO2 requirement has been known in the Linux community and >> by the ZNS SSD vendors. Some SSD implementors have chosen not to support >> PO2 zone sizes, which is a perfectly valid decision. But its implementors >> knowingly did that while knowing that the Linux kernel didn't support it. >> > >> >I want to turn the argument around to see it from the kernel developer's point >> of view. They have communicated the PO2 requirement clearly, there's good >> precedence working with PO2 zone sizes, and at last, holes can't be avoided >> and are part of the overall design of zoned storage devices. So why should the >> kernel developer's take on the long-term maintenance burden of NPO2 zone >> sizes? >> >> You have a good point, and that is the question we need to help answer. >> As I see it, requirements evolve and the kernel changes with it as long as there >> are active upstream users for it. > >True. There's also active users for SSDs which are custom (e.g., larger than 4KiB writes required) - but they aren't supported by the Linux kernel and isn't actively being worked on to my knowledge. Which is fine, as the customers anyway uses this in their own way, and don't need the Linux kernel support. Ask things become stable some might choose to push support for certain features in the Kernel. In this case, the changes are not big in the block layer. I believe it is a process and the features should be chosen to maximize benefit and minimize maintenance cost. > >> >> The main constraint for (1) PO2 is removed in the block layer, we have (2) Linux hosts >> stating that unmapped LBAs are a problem, and we have (3) HW supporting >> size=capacity. >> >> I would be happy to hear what else you would like to see for this to be of use to >> the kernel community. > >(Added numbers to your paragraph above) > >1. The sysfs chunksize attribute was "misused" to also represent zone size. What has changed is that RAID controllers now can use a NPO2 chunk size. This wasn't meant to naturally extend to zones, which as shown in the current posted patchset, is a lot more work. True. But this was the main constraint for PO2. >2. Bo mentioned that the software already manages holes. It took a bit of time to get right, but now it works. Thus, the software in question is already capable of working with holes. Thus, fixing this, would present itself as a minor optimization overall. I'm not convinced the work to do this in the kernel is proportional to the change it'll make to the applications. I will let Bo response himself to this. >3. I'm happy to hear that. However, I'll like to reiterate the point that the PO2 requirement have been known for years. That there's a drive doing NPO2 zones is great, but a decision was made by the SSD implementors to not support the Linux kernel given its current implementation. Zone devices has been supported for years in SMR, and I this is a strong argument. However, ZNS is still very new and customers have several requirements. I do not believe that a HDD stack should have such an impact in NVMe. Also, we will see new interfaces adding support for zoned devices in the future. We should think about the future and not the past. > >All that said - if there are people willing to do the work and it doesn't have a negative impact on performance, code quality, maintenance complexity, etc. then there isn't anything saying support can't be added - but it does seem like it’s a lot of work, for little overall benefits to applications and the host users. Exactly. Patches in the block layer are trivial. This is running in production loads without issues. I have tried to highlight the benefits in previous benefits and I believe you understand them. Support for ZoneFS seems easy too. We have an early POC for btrfs and it seems it can be done. We sign up for these 2. As for F2FS and dm-zoned, I do not think these are targets at the moment. If this is the path we follow, these will bail out at mkfs time. If we can agree on the above, I believe we can start with the code that enables the existing customers and build support for butrfs and ZoneFS in the next few months. What do you think?
> > > >All that said - if there are people willing to do the work and it doesn't have a > negative impact on performance, code quality, maintenance complexity, etc. > then there isn't anything saying support can't be added - but it does seem like > it’s a lot of work, for little overall benefits to applications and the host users. > > Exactly. > > Patches in the block layer are trivial. This is running in production loads without > issues. I have tried to highlight the benefits in previous benefits and I believe > you understand them. > > Support for ZoneFS seems easy too. We have an early POC for btrfs and it > seems it can be done. We sign up for these 2. > > As for F2FS and dm-zoned, I do not think these are targets at the moment. If > this is the path we follow, these will bail out at mkfs time. > > If we can agree on the above, I believe we can start with the code that enables > the existing customers and build support for butrfs and ZoneFS in the next few > months. > > What do you think? I would suggest to do it in a single shot, i.e., a single patchset, which enables all the internal users in the kernel (including f2fs and others). That way end-users do not have to worry about the difference of PO2/NPO2 zones and it'll help reduce the burden on long-term maintenance.
On 15.03.2022 13:14, Matias Bjørling wrote: >> > >> >All that said - if there are people willing to do the work and it doesn't have a >> negative impact on performance, code quality, maintenance complexity, etc. >> then there isn't anything saying support can't be added - but it does seem like >> it’s a lot of work, for little overall benefits to applications and the host users. >> >> Exactly. >> >> Patches in the block layer are trivial. This is running in production loads without >> issues. I have tried to highlight the benefits in previous benefits and I believe >> you understand them. >> >> Support for ZoneFS seems easy too. We have an early POC for btrfs and it >> seems it can be done. We sign up for these 2. >> >> As for F2FS and dm-zoned, I do not think these are targets at the moment. If >> this is the path we follow, these will bail out at mkfs time. >> >> If we can agree on the above, I believe we can start with the code that enables >> the existing customers and build support for butrfs and ZoneFS in the next few >> months. >> >> What do you think? > >I would suggest to do it in a single shot, i.e., a single patchset, which enables all the internal users in the kernel (including f2fs and others). That way end-users do not have to worry about the difference of PO2/NPO2 zones and it'll help reduce the burden on long-term maintenance. Thanks for the suggestion Matias. Happy to see that you are open to support this. I understand why a patchseries fixing all is attracgive, but we do not see a usage for ZNS in F2FS, as it is a mobile file-system. As other interfaces arrive, this work will become natural. ZoneFS and butrfs are good targets for ZNS and these we can do. I would still do the work in phases to make sure we have enough early feedback from the community. Since this thread has been very active, I will wait some time for Christoph and others to catch up before we start sending code.
On Tue, Mar 15, 2022 at 02:26:11PM +0100, Javier González wrote: > but we do not see a usage for ZNS in F2FS, as it is a mobile > file-system. As other interfaces arrive, this work will become natural. > > ZoneFS and butrfs are good targets for ZNS and these we can do. I would > still do the work in phases to make sure we have enough early feedback > from the community. > > Since this thread has been very active, I will wait some time for > Christoph and others to catch up before we start sending code. Can someone summarize where we stand? Between the lack of quoting from hell and overly long lines from corporate mail clients I've mostly stopped reading this thread because it takes too much effort actually extract the information.
> -----Original Message----- > From: Javier González <javier@javigon.com> > Sent: Tuesday, 15 March 2022 14.26 > To: Matias Bjørling <Matias.Bjorling@wdc.com> > Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>; Christoph > Hellwig <hch@lst.de>; Luis Chamberlain <mcgrof@kernel.org>; Keith Busch > <kbusch@kernel.org>; Pankaj Raghav <p.raghav@samsung.com>; Adam > Manzanares <a.manzanares@samsung.com>; jiangbo.365@bytedance.com; > kanchan Joshi <joshi.k@samsung.com>; Jens Axboe <axboe@kernel.dk>; Sagi > Grimberg <sagi@grimberg.me>; Pankaj Raghav <pankydev8@gmail.com>; > Kanchan Joshi <joshiiitr@gmail.com>; linux-block@vger.kernel.org; linux- > nvme@lists.infradead.org > Subject: Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices > > On 15.03.2022 13:14, Matias Bjørling wrote: > >> > > >> >All that said - if there are people willing to do the work and it > >> >doesn't have a > >> negative impact on performance, code quality, maintenance complexity, > etc. > >> then there isn't anything saying support can't be added - but it does > >> seem like it’s a lot of work, for little overall benefits to applications and the > host users. > >> > >> Exactly. > >> > >> Patches in the block layer are trivial. This is running in production > >> loads without issues. I have tried to highlight the benefits in > >> previous benefits and I believe you understand them. > >> > >> Support for ZoneFS seems easy too. We have an early POC for btrfs and > >> it seems it can be done. We sign up for these 2. > >> > >> As for F2FS and dm-zoned, I do not think these are targets at the > >> moment. If this is the path we follow, these will bail out at mkfs time. > >> > >> If we can agree on the above, I believe we can start with the code > >> that enables the existing customers and build support for butrfs and > >> ZoneFS in the next few months. > >> > >> What do you think? > > > >I would suggest to do it in a single shot, i.e., a single patchset, which enables > all the internal users in the kernel (including f2fs and others). That way end- > users do not have to worry about the difference of PO2/NPO2 zones and it'll > help reduce the burden on long-term maintenance. > > Thanks for the suggestion Matias. Happy to see that you are open to support > this. I understand why a patchseries fixing all is attracgive, but we do not see a > usage for ZNS in F2FS, as it is a mobile file-system. As other interfaces arrive, > this work will become natural. We've seen uptake on ZNS on f2fs, so I would argue that its important to have support in as well. > > ZoneFS and butrfs are good targets for ZNS and these we can do. I would still do > the work in phases to make sure we have enough early feedback from the > community. Sure, continuous review is good. But not having support for all the kernel users creates fragmentation. Doing a full switch is greatly preferred, as it avoids this fragmentation, but will also lower the overall maintenance burden, which also was raised as a concern.
On 15.03.2022 14:30, Christoph Hellwig wrote: >On Tue, Mar 15, 2022 at 02:26:11PM +0100, Javier González wrote: >> but we do not see a usage for ZNS in F2FS, as it is a mobile >> file-system. As other interfaces arrive, this work will become natural. >> >> ZoneFS and butrfs are good targets for ZNS and these we can do. I would >> still do the work in phases to make sure we have enough early feedback >> from the community. >> >> Since this thread has been very active, I will wait some time for >> Christoph and others to catch up before we start sending code. > >Can someone summarize where we stand? Between the lack of quoting >from hell and overly long lines from corporate mail clients I've >mostly stopped reading this thread because it takes too much effort >actually extract the information. Let me give it a try: - PO2 emulation in NVMe is a no-go. Drop this. - The arguments against supporting PO2 are: - It makes ZNS depart from a SMR assumption of PO2 zone sizes. This can create confusion for users of both SMR and ZNS - Existing applications assume PO2 zone sizes, and probably do optimizations for these. These applications, if wanting to use ZNS will have to change the calculations - There is a fear for performance regressions. - It adds more work to you and other maintainers - The arguments in favour of PO2 are: - Unmapped LBAs create holes that applications need to deal with. This affects mapping and performance due to splits. Bo explained this in a thread from Bytedance's perspective. I explained in an answer to Matias how we are not letting zones transition to offline in order to simplify the host stack. Not sure if this is something we want to bring to NVMe. - As ZNS adds more features and other protocols add support for zoned devices we will have more use-cases for the zoned block device. We will have to deal with these fragmentation at some point. - This is used in production workloads in Linux hosts. I would advocate for this not being off-tree as it will be a headache for all in the future. - If you agree that removing PO2 is an option, we can do the following: - Remove the constraint in the block layer and add ZoneFS support in a first patch. - Add btrfs support in a later patch - Make changes to tools once merged Hope I have collected all points of view in such a short format.
> -----Original Message----- > From: Javier González <javier@javigon.com> > Sent: Tuesday, 15 March 2022 14.53 > To: Christoph Hellwig <hch@lst.de> > Cc: Matias Bjørling <Matias.Bjorling@wdc.com>; Damien Le Moal > <damien.lemoal@opensource.wdc.com>; Luis Chamberlain > <mcgrof@kernel.org>; Keith Busch <kbusch@kernel.org>; Pankaj Raghav > <p.raghav@samsung.com>; Adam Manzanares > <a.manzanares@samsung.com>; jiangbo.365@bytedance.com; kanchan Joshi > <joshi.k@samsung.com>; Jens Axboe <axboe@kernel.dk>; Sagi Grimberg > <sagi@grimberg.me>; Pankaj Raghav <pankydev8@gmail.com>; Kanchan Joshi > <joshiiitr@gmail.com>; linux-block@vger.kernel.org; linux- > nvme@lists.infradead.org > Subject: Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices > > On 15.03.2022 14:30, Christoph Hellwig wrote: > >On Tue, Mar 15, 2022 at 02:26:11PM +0100, Javier González wrote: > >> but we do not see a usage for ZNS in F2FS, as it is a mobile > >> file-system. As other interfaces arrive, this work will become natural. > >> > >> ZoneFS and butrfs are good targets for ZNS and these we can do. I > >> would still do the work in phases to make sure we have enough early > >> feedback from the community. > >> > >> Since this thread has been very active, I will wait some time for > >> Christoph and others to catch up before we start sending code. > > > >Can someone summarize where we stand? Between the lack of quoting from > >hell and overly long lines from corporate mail clients I've mostly > >stopped reading this thread because it takes too much effort actually > >extract the information. > > Let me give it a try: > > - PO2 emulation in NVMe is a no-go. Drop this. > > - The arguments against supporting PO2 are: > - It makes ZNS depart from a SMR assumption of PO2 zone sizes. This > can create confusion for users of both SMR and ZNS > > - Existing applications assume PO2 zone sizes, and probably do > optimizations for these. These applications, if wanting to use > ZNS will have to change the calculations > > - There is a fear for performance regressions. > > - It adds more work to you and other maintainers > > - The arguments in favour of PO2 are: > - Unmapped LBAs create holes that applications need to deal with. > This affects mapping and performance due to splits. Bo explained > this in a thread from Bytedance's perspective. I explained in an > answer to Matias how we are not letting zones transition to > offline in order to simplify the host stack. Not sure if this is > something we want to bring to NVMe. > > - As ZNS adds more features and other protocols add support for > zoned devices we will have more use-cases for the zoned block > device. We will have to deal with these fragmentation at some > point. > > - This is used in production workloads in Linux hosts. I would > advocate for this not being off-tree as it will be a headache for > all in the future. > > - If you agree that removing PO2 is an option, we can do the following: > - Remove the constraint in the block layer and add ZoneFS support > in a first patch. > > - Add btrfs support in a later patch > > - Make changes to tools once merged > > Hope I have collected all points of view in such a short format. + Suggestion to enable all users in the kernel to limit fragmentation and maintainer burden. + Possible not a big issue as users already have added the necessary support and users already must manage offline zones and avoid writing across zones. + Re: Bo's email, it sounds like this only affect a single vendor which knowingly made the decision to do NPO2 zone sizes. From Bo: "(What we discussed here has a precondition that is, we cannot determine if the SSD provider could change the FW to make it PO2 or not)").
On 15/03/2022 14:52, Javier González wrote: > On 15.03.2022 14:30, Christoph Hellwig wrote: >> On Tue, Mar 15, 2022 at 02:26:11PM +0100, Javier González wrote: >>> but we do not see a usage for ZNS in F2FS, as it is a mobile >>> file-system. As other interfaces arrive, this work will become natural. >>> >>> ZoneFS and butrfs are good targets for ZNS and these we can do. I would >>> still do the work in phases to make sure we have enough early feedback >>> from the community. >>> >>> Since this thread has been very active, I will wait some time for >>> Christoph and others to catch up before we start sending code. >> >> Can someone summarize where we stand? Between the lack of quoting >>from hell and overly long lines from corporate mail clients I've >> mostly stopped reading this thread because it takes too much effort >> actually extract the information. > > Let me give it a try: > > - PO2 emulation in NVMe is a no-go. Drop this. > > - The arguments against supporting PO2 are: > - It makes ZNS depart from a SMR assumption of PO2 zone sizes. This > can create confusion for users of both SMR and ZNS > > - Existing applications assume PO2 zone sizes, and probably do > optimizations for these. These applications, if wanting to use > ZNS will have to change the calculations > > - There is a fear for performance regressions. > > - It adds more work to you and other maintainers > > - The arguments in favour of PO2 are: > - Unmapped LBAs create holes that applications need to deal with. > This affects mapping and performance due to splits. Bo explained > this in a thread from Bytedance's perspective. I explained in an > answer to Matias how we are not letting zones transition to > offline in order to simplify the host stack. Not sure if this is > something we want to bring to NVMe. > > - As ZNS adds more features and other protocols add support for > zoned devices we will have more use-cases for the zoned block > device. We will have to deal with these fragmentation at some > point. > > - This is used in production workloads in Linux hosts. I would > advocate for this not being off-tree as it will be a headache for > all in the future. > > - If you agree that removing PO2 is an option, we can do the following: > - Remove the constraint in the block layer and add ZoneFS support > in a first patch. > > - Add btrfs support in a later patch (+ linux-btrfs ) Please also make sure to support btrfs and not only throw some patches over the fence. Zoned device support in btrfs is complex enough and has quite some special casing vs regular btrfs, which we're working on getting rid of. So having non-power-of-2 zone size, would also mean having NPO2 block-groups (and thus block-groups not aligned to the stripe size). Just thinking of this and knowing I need to support it gives me a headache. Also please consult the rest of the btrfs developers for thoughts on this. After all btrfs has full zoned support (including ZNS, not saying it's perfect) and is also the default FS for at least two Linux distributions. Thanks a lot, Johannes
On Tue, Mar 15, 2022 at 02:14:23PM +0000, Johannes Thumshirn wrote: > On 15/03/2022 14:52, Javier González wrote: > > On 15.03.2022 14:30, Christoph Hellwig wrote: > >> On Tue, Mar 15, 2022 at 02:26:11PM +0100, Javier González wrote: > >>> but we do not see a usage for ZNS in F2FS, as it is a mobile > >>> file-system. As other interfaces arrive, this work will become natural. > >>> > >>> ZoneFS and butrfs are good targets for ZNS and these we can do. I would > >>> still do the work in phases to make sure we have enough early feedback > >>> from the community. > >>> > >>> Since this thread has been very active, I will wait some time for > >>> Christoph and others to catch up before we start sending code. > >> > >> Can someone summarize where we stand? Between the lack of quoting > >> from hell and overly long lines from corporate mail clients I've > >> mostly stopped reading this thread because it takes too much effort > >> actually extract the information. > > > > Let me give it a try: > > > > - PO2 emulation in NVMe is a no-go. Drop this. > > > > - The arguments against supporting PO2 are: > > - It makes ZNS depart from a SMR assumption of PO2 zone sizes. This > > can create confusion for users of both SMR and ZNS > > > > - Existing applications assume PO2 zone sizes, and probably do > > optimizations for these. These applications, if wanting to use > > ZNS will have to change the calculations > > > > - There is a fear for performance regressions. > > > > - It adds more work to you and other maintainers > > > > - The arguments in favour of PO2 are: > > - Unmapped LBAs create holes that applications need to deal with. > > This affects mapping and performance due to splits. Bo explained > > this in a thread from Bytedance's perspective. I explained in an > > answer to Matias how we are not letting zones transition to > > offline in order to simplify the host stack. Not sure if this is > > something we want to bring to NVMe. > > > > - As ZNS adds more features and other protocols add support for > > zoned devices we will have more use-cases for the zoned block > > device. We will have to deal with these fragmentation at some > > point. > > > > - This is used in production workloads in Linux hosts. I would > > advocate for this not being off-tree as it will be a headache for > > all in the future. > > > > - If you agree that removing PO2 is an option, we can do the following: > > - Remove the constraint in the block layer and add ZoneFS support > > in a first patch. > > > > - Add btrfs support in a later patch > > (+ linux-btrfs ) > > Please also make sure to support btrfs and not only throw some patches > over the fence. Zoned device support in btrfs is complex enough and has > quite some special casing vs regular btrfs, which we're working on getting > rid of. So having non-power-of-2 zone size, would also mean having NPO2 > block-groups (and thus block-groups not aligned to the stripe size). > > Just thinking of this and knowing I need to support it gives me a > headache. PO2 is really easy to work with and I guess allocation on the physical device could also benefit from that, I'm still puzzled why the NPO2 is even proposed. We can possibly hide the calculations behind some API so I hope in the end it should be bearable. The size of block groups is flexible we only want some reasonable alignment. > Also please consult the rest of the btrfs developers for thoughts on this. > After all btrfs has full zoned support (including ZNS, not saying it's > perfect) and is also the default FS for at least two Linux distributions. I haven't read the whole thread yet, my impression is that some hardware is deliberately breaking existing assumptions about zoned devices and in turn breaking btrfs support. I hope I'm wrong on that or at least that it's possible to work around it.
On 15.03.2022 14:14, Johannes Thumshirn wrote: >On 15/03/2022 14:52, Javier González wrote: >> On 15.03.2022 14:30, Christoph Hellwig wrote: >>> On Tue, Mar 15, 2022 at 02:26:11PM +0100, Javier González wrote: >>>> but we do not see a usage for ZNS in F2FS, as it is a mobile >>>> file-system. As other interfaces arrive, this work will become natural. >>>> >>>> ZoneFS and butrfs are good targets for ZNS and these we can do. I would >>>> still do the work in phases to make sure we have enough early feedback >>>> from the community. >>>> >>>> Since this thread has been very active, I will wait some time for >>>> Christoph and others to catch up before we start sending code. >>> >>> Can someone summarize where we stand? Between the lack of quoting >>>from hell and overly long lines from corporate mail clients I've >>> mostly stopped reading this thread because it takes too much effort >>> actually extract the information. >> >> Let me give it a try: >> >> - PO2 emulation in NVMe is a no-go. Drop this. >> >> - The arguments against supporting PO2 are: >> - It makes ZNS depart from a SMR assumption of PO2 zone sizes. This >> can create confusion for users of both SMR and ZNS >> >> - Existing applications assume PO2 zone sizes, and probably do >> optimizations for these. These applications, if wanting to use >> ZNS will have to change the calculations >> >> - There is a fear for performance regressions. >> >> - It adds more work to you and other maintainers >> >> - The arguments in favour of PO2 are: >> - Unmapped LBAs create holes that applications need to deal with. >> This affects mapping and performance due to splits. Bo explained >> this in a thread from Bytedance's perspective. I explained in an >> answer to Matias how we are not letting zones transition to >> offline in order to simplify the host stack. Not sure if this is >> something we want to bring to NVMe. >> >> - As ZNS adds more features and other protocols add support for >> zoned devices we will have more use-cases for the zoned block >> device. We will have to deal with these fragmentation at some >> point. >> >> - This is used in production workloads in Linux hosts. I would >> advocate for this not being off-tree as it will be a headache for >> all in the future. >> >> - If you agree that removing PO2 is an option, we can do the following: >> - Remove the constraint in the block layer and add ZoneFS support >> in a first patch. >> >> - Add btrfs support in a later patch > >(+ linux-btrfs ) > >Please also make sure to support btrfs and not only throw some patches >over the fence. Zoned device support in btrfs is complex enough and has >quite some special casing vs regular btrfs, which we're working on getting >rid of. So having non-power-of-2 zone size, would also mean having NPO2 >block-groups (and thus block-groups not aligned to the stripe size). Thanks for mentioning this Johannes. If we say we will work with you in supporting btrfs properly, we will. I believe you have seen already a couple of patches fixing things for zone support in btrfs in the last weeks. > >Just thinking of this and knowing I need to support it gives me a >headache. I hope we have help you with that. butrfs has no alignment to PO2 natively, so I am confident we can find a good solution. > >Also please consult the rest of the btrfs developers for thoughts on this. >After all btrfs has full zoned support (including ZNS, not saying it's >perfect) and is also the default FS for at least two Linux distributions. Of course. We will work with you and other btrfs developers. Luis is helping making sure that we have good tests for linux-next. This is in part how we have found the problems with Append, which should be fixed now. > >Thanks a lot, > Johannes
On Tue, Mar 15, 2022 at 02:30:52PM +0100, Christoph Hellwig wrote: > On Tue, Mar 15, 2022 at 02:26:11PM +0100, Javier González wrote: > > but we do not see a usage for ZNS in F2FS, as it is a mobile > > file-system. As other interfaces arrive, this work will become natural. > > > > ZoneFS and butrfs are good targets for ZNS and these we can do. I would > > still do the work in phases to make sure we have enough early feedback > > from the community. > > > > Since this thread has been very active, I will wait some time for > > Christoph and others to catch up before we start sending code. > > Can someone summarize where we stand? RFCs should be posted to help review and evaluate direct NPO2 support (not emulation) given we have no vendor willing to take a position that NPO2 will *never* be supported on ZNS, and its not clear yet how many vendors other than Samsung actually require NPO2 support. The other reason is existing NPO2 customers currently cake in hacks to Linux to supoport NPO2 support, and so a fragmentation already exists. To help address this it's best to evaluate what the world of NPO2 support would look like and put the effort to do the work for that and review that. Luis
Hi Johannes, On 2022-03-15 15:14, Johannes Thumshirn wrote: > Please also make sure to support btrfs and not only throw some patches > over the fence. Zoned device support in btrfs is complex enough and has > quite some special casing vs regular btrfs, which we're working on getting > rid of. So having non-power-of-2 zone size, would also mean having NPO2 I already made a simple btrfs npo2 poc and it involved mostly changing the po2 calculation to be based on generic calculation. I understand that changing the calculations from using log & shifts to division will incur some performance penalty but I think we can wrap them with helpers to minimize those impact. > So having non-power-of-2 zone size, would also mean having NPO2 > block-groups (and thus block-groups not aligned to the stripe size). > I agree with your point that we risk not aligning to stripe size when we move to npo2 zone size which I believe the minimum is 64K (please correct me if I am wrong). As David Sterba mentioned in his email, we could agree on some reasonable alignment, which I believe would be the minimum stripe size of 64k to avoid added complexity to the existing btrfs zoned support. And it is a much milder constraint that most devices can naturally adhere compared to the po2 zone size requirement. > Just thinking of this and knowing I need to support it gives me a > headache. > This is definitely not some one off patch that we want upstream and disappear. As Javier already pointed out, we would be more than happy help you out here. > Also please consult the rest of the btrfs developers for thoughts on this. > After all btrfs has full zoned support (including ZNS, not saying it's > perfect) and is also the default FS for at least two Linux distributions. > > Thanks a lot, > Johannes
Hi David, On 2022-03-15 15:27, David Sterba wrote: > > PO2 is really easy to work with and I guess allocation on the physical > device could also benefit from that, I'm still puzzled why the NPO2 is > even proposed. > Quick recap: Hardware NAND cannot naturally align to po2 zone sizes which led to having a zone cap and zone size, where, zone cap is the actually storage available in a zone. The main proposal is to remove the po2 constraint to get rid of this LBA holes (generally speaking). That is why this whole effort was started. > We can possibly hide the calculations behind some API so I hope in the > end it should be bearable. The size of block groups is flexible we only > want some reasonable alignment. > I agree. I already replied to Johannes on what it might look like. Reiterating here again, the reasonable alignment I was thinking while I was doing a POC for btrfs with npo2 zone size is the minimum stripe size that is required by btrfs (64K) to reduce the impact of this change on the zoned support in btrfs. > I haven't read the whole thread yet, my impression is that some hardware > is deliberately breaking existing assumptions about zoned devices and in > turn breaking btrfs support. I hope I'm wrong on that or at least that > it's possible to work around it. Based on the POC we did internally, it is definitely possible to support it in btrfs. And making this change will not break the existing btrfs support for zoned devices. Naive approach to making this change will have some performance impact as we will be changing the po2 calculations from log & shifts to division, multiplications. I definitely think we can optimize it to minimize the impact on the existing deployments.
On 3/15/22 22:05, Javier González wrote: >>> The main constraint for (1) PO2 is removed in the block layer, we >>> have (2) Linux hosts stating that unmapped LBAs are a problem, >>> and we have (3) HW supporting size=capacity. >>> >>> I would be happy to hear what else you would like to see for this >>> to be of use to the kernel community. >> >> (Added numbers to your paragraph above) >> >> 1. The sysfs chunksize attribute was "misused" to also represent >> zone size. What has changed is that RAID controllers now can use a >> NPO2 chunk size. This wasn't meant to naturally extend to zones, >> which as shown in the current posted patchset, is a lot more work. > > True. But this was the main constraint for PO2. And as I said, users asked for it. >> 2. Bo mentioned that the software already manages holes. It took a >> bit of time to get right, but now it works. Thus, the software in >> question is already capable of working with holes. Thus, fixing >> this, would present itself as a minor optimization overall. I'm not >> convinced the work to do this in the kernel is proportional to the >> change it'll make to the applications. > > I will let Bo response himself to this. > >> 3. I'm happy to hear that. However, I'll like to reiterate the >> point that the PO2 requirement have been known for years. That >> there's a drive doing NPO2 zones is great, but a decision was made >> by the SSD implementors to not support the Linux kernel given its >> current implementation. > > Zone devices has been supported for years in SMR, and I this is a > strong argument. However, ZNS is still very new and customers have > several requirements. I do not believe that a HDD stack should have > such an impact in NVMe. > > Also, we will see new interfaces adding support for zoned devices in > the future. > > We should think about the future and not the past. Backward compatibility ? We must not break userspace... >> >> All that said - if there are people willing to do the work and it >> doesn't have a negative impact on performance, code quality, >> maintenance complexity, etc. then there isn't anything saying >> support can't be added - but it does seem like it’s a lot of work, >> for little overall benefits to applications and the host users. > > Exactly. > > Patches in the block layer are trivial. This is running in > production loads without issues. I have tried to highlight the > benefits in previous benefits and I believe you understand them. The block layer is not the issue here. We all understand that one is easy. > Support for ZoneFS seems easy too. We have an early POC for btrfs and > it seems it can be done. We sign up for these 2. zonefs can trivially support non power of 2 zone sizes, but as zonefs creates a discrete view of the device capacity with its one file per zone interface, an application accesses to a zone are forcibly limited to that zone, as they should. With zonefs, pow2 and nonpow2 devices will show the *same* interface to the application. Non power of 2 zone size then have absolutely no benefits at all. > As for F2FS and dm-zoned, I do not think these are targets at the > moment. If this is the path we follow, these will bail out at mkfs > time. And what makes you think that this is acceptable ? What guarantees do you have that this will not be a problem for users out there ?
On 3/16/22 02:00, Luis Chamberlain wrote: > On Tue, Mar 15, 2022 at 02:30:52PM +0100, Christoph Hellwig wrote: >> On Tue, Mar 15, 2022 at 02:26:11PM +0100, Javier González wrote: >>> but we do not see a usage for ZNS in F2FS, as it is a mobile >>> file-system. As other interfaces arrive, this work will become natural. >>> >>> ZoneFS and butrfs are good targets for ZNS and these we can do. I would >>> still do the work in phases to make sure we have enough early feedback >>> from the community. >>> >>> Since this thread has been very active, I will wait some time for >>> Christoph and others to catch up before we start sending code. >> >> Can someone summarize where we stand? > > RFCs should be posted to help review and evaluate direct NPO2 support > (not emulation) given we have no vendor willing to take a position that > NPO2 will *never* be supported on ZNS, and its not clear yet how many > vendors other than Samsung actually require NPO2 support. The other > reason is existing NPO2 customers currently cake in hacks to Linux to > supoport NPO2 support, and so a fragmentation already exists. To help > address this it's best to evaluate what the world of NPO2 support would > look like and put the effort to do the work for that and review that. And again no mentions of all the applications supporting zones assuming a power of 2 zone size that will break. Seriously. Please stop considering the kernel only. If this were only about the kernel, we would all be working on patches already. Allowing non power of 2 zone size may prevent applications running today to run properly on these non power of 2 zone size devices. *not* nice. I have yet to see any convincing argument proving that this is not an issue.
On Wed, Mar 16, 2022 at 09:07:18AM +0900, Damien Le Moal wrote: > On 3/16/22 02:00, Luis Chamberlain wrote: > > On Tue, Mar 15, 2022 at 02:30:52PM +0100, Christoph Hellwig wrote: > >> On Tue, Mar 15, 2022 at 02:26:11PM +0100, Javier González wrote: > >>> but we do not see a usage for ZNS in F2FS, as it is a mobile > >>> file-system. As other interfaces arrive, this work will become natural. > >>> > >>> ZoneFS and butrfs are good targets for ZNS and these we can do. I would > >>> still do the work in phases to make sure we have enough early feedback > >>> from the community. > >>> > >>> Since this thread has been very active, I will wait some time for > >>> Christoph and others to catch up before we start sending code. > >> > >> Can someone summarize where we stand? > > > > RFCs should be posted to help review and evaluate direct NPO2 support > > (not emulation) given we have no vendor willing to take a position that > > NPO2 will *never* be supported on ZNS, and its not clear yet how many > > vendors other than Samsung actually require NPO2 support. The other > > reason is existing NPO2 customers currently cake in hacks to Linux to > > supoport NPO2 support, and so a fragmentation already exists. To help > > address this it's best to evaluate what the world of NPO2 support would > > look like and put the effort to do the work for that and review that. > > And again no mentions of all the applications supporting zones assuming > a power of 2 zone size that will break. What applications? ZNS does not incur a PO2 requirement. So I really want to know what applications make this assumption and would break because all of a sudden say NPO2 is supported. Why would that break those ZNS applications? > Allowing non power of 2 zone size may prevent applications running today > to run properly on these non power of 2 zone size devices. *not* nice. Applications which want to support ZNS have to take into consideration that NPO2 is posisble and there existing users of that world today. You cannot negate their existance. > I have yet to see any convincing argument proving that this is not an issue. You are just saying things can break but not clarifying exactly what. And you have not taken a position to say WD will not ever support NPO2 on ZNS. And so, you can't negate the prospect of that implied path for support as a possibility, even if it means work towards the ecosystem today. Luis
On 3/16/22 09:23, Luis Chamberlain wrote: > On Wed, Mar 16, 2022 at 09:07:18AM +0900, Damien Le Moal wrote: >> On 3/16/22 02:00, Luis Chamberlain wrote: >>> On Tue, Mar 15, 2022 at 02:30:52PM +0100, Christoph Hellwig wrote: >>>> On Tue, Mar 15, 2022 at 02:26:11PM +0100, Javier González wrote: >>>>> but we do not see a usage for ZNS in F2FS, as it is a mobile >>>>> file-system. As other interfaces arrive, this work will become natural. >>>>> >>>>> ZoneFS and butrfs are good targets for ZNS and these we can do. I would >>>>> still do the work in phases to make sure we have enough early feedback >>>>> from the community. >>>>> >>>>> Since this thread has been very active, I will wait some time for >>>>> Christoph and others to catch up before we start sending code. >>>> >>>> Can someone summarize where we stand? >>> >>> RFCs should be posted to help review and evaluate direct NPO2 support >>> (not emulation) given we have no vendor willing to take a position that >>> NPO2 will *never* be supported on ZNS, and its not clear yet how many >>> vendors other than Samsung actually require NPO2 support. The other >>> reason is existing NPO2 customers currently cake in hacks to Linux to >>> supoport NPO2 support, and so a fragmentation already exists. To help >>> address this it's best to evaluate what the world of NPO2 support would >>> look like and put the effort to do the work for that and review that. >> >> And again no mentions of all the applications supporting zones assuming >> a power of 2 zone size that will break. > > What applications? ZNS does not incur a PO2 requirement. So I really > want to know what applications make this assumption and would break > because all of a sudden say NPO2 is supported. Exactly. What applications ? For ZNS, I cannot say as devices have not been available for long. But neither can you. > Why would that break those ZNS applications? Please keep in mind that there are power of 2 zone sized ZNS devices out there. Applications designed for these devices and optimized to do bit shift arithmetic using the power of 2 size property will break. What the plan for that case ? How will you address these users complaints ? >> Allowing non power of 2 zone size may prevent applications running today >> to run properly on these non power of 2 zone size devices. *not* nice. > > Applications which want to support ZNS have to take into consideration > that NPO2 is posisble and there existing users of that world today. Which is really an ugly approach. The kernel zone user interface is common to all zoned devices: SMR, ZNS, null_blk, DM (dm-crypt, dm-linear). They all have one point in common: zone size is a power of 2. Zone capacity may differ, but hey, we also unified that by reporting a zone capacity for *ALL* of them. Applications correctly designed for SMR can thus also run on ZNS too. With this in mind, the spectrum of applications that would break on non power of 2 ZNS devices is suddenly much larger. This has always been my concern from the start: allowing non power of 2 zone size fragments userspace support and has the potential to complicate things for application developers. > > You cannot negate their existance. > >> I have yet to see any convincing argument proving that this is not an issue. > > You are just saying things can break but not clarifying exactly what. > And you have not taken a position to say WD will not ever support NPO2 > on ZNS. And so, you can't negate the prospect of that implied path for > support as a possibility, even if it means work towards the ecosystem > today. Please do not bring in corporate strategy aspects in this discussion. This is a technical discussion and I am not talking as a representative of my employer nor should we ever dicsuss business plans on a public mailing list. I am a kernel developer and maintainer. Keep it technical please.
On Wed, Mar 16, 2022 at 09:46:44AM +0900, Damien Le Moal wrote: > On 3/16/22 09:23, Luis Chamberlain wrote: > > What applications? ZNS does not incur a PO2 requirement. So I really > > want to know what applications make this assumption and would break > > because all of a sudden say NPO2 is supported. > > Exactly. What applications ? For ZNS, I cannot say as devices have not > been available for long. But neither can you. I can tell you we there is an existing NPO2 ZNS customer which chimed on the discussion and they described having to carry a delta to support NPO2 ZNS. So if you cannot tell me of a ZNS application which is going to break to add NPO2 support then your original point is not valid of suggesting that there would be a break. > > Why would that break those ZNS applications? > > Please keep in mind that there are power of 2 zone sized ZNS devices out > there. No one is saying otherwise. > Applications designed for these devices and optimized to do bit > shift arithmetic using the power of 2 size property will break. They must not be ZNS. So they can continue to chug on. > What the > plan for that case ? How will you address these users complaints ? They are not ZNS so they don't have to worry about ZNS. ZNS applications must be aware of that fact that NPO2 can exist. ZNS applications must be aware of that fact that any vendor may one day sell NPO2 devices. > >> Allowing non power of 2 zone size may prevent applications running today > >> to run properly on these non power of 2 zone size devices. *not* nice. > > > > Applications which want to support ZNS have to take into consideration > > that NPO2 is posisble and there existing users of that world today. > > Which is really an ugly approach. Ugly is relative and subjective. NAND does not force PO2. > The kernel <etc> And back you go to kernel talk. I thought you wanted to focus on applications. > Applications correctly designed for SMR can thus also run on ZNS too. That seems to be an incorrect assumption given ZNS drives exist with NPO2. So you can probably say that some SMR applications can work with PO2 ZNS drives. That is a more correct statement. > With this in mind, the spectrum of applications that would break on non > power of 2 ZNS devices is suddenly much larger. We already determined you cannot identify any ZNS specific application which would break. SMR != ZNS If you really want to use SMR applications for ZNS that seems to be a bit beyond the scope of this discussion, but it seems to me that those SMR applications should simply learn that if a device is ZNS that NPO2 can be expected. As technologies evolve so do specifications. > This has always been my concern from the start: allowing non power of 2 > zone size fragments userspace support and has the potential to > complicate things for application developers. It's a reality though. Devices exist, and so do users. And they're carrying their own delta to support NPO2 ZNS today on Linux. > > You cannot negate their existance. > > > >> I have yet to see any convincing argument proving that this is not an issue. > > > > You are just saying things can break but not clarifying exactly what. > > And you have not taken a position to say WD will not ever support NPO2 > > on ZNS. And so, you can't negate the prospect of that implied path for > > support as a possibility, even if it means work towards the ecosystem > > today. > > Please do not bring in corporate strategy aspects in this discussion. > This is a technical discussion and I am not talking as a representative > of my employer nor should we ever dicsuss business plans on a public > mailing list. I am a kernel developer and maintainer. Keep it technical > please. This conversation is about the reality that ZNS NPO2 exist and how best to support that. You seem to want to negate that reality and support on Linux without even considering what the changes look like to to support ZNS NPO2. As a maintainer I think we need to *evaluate* supporting users as best as possible. Not denying their existance. Even if it pains us. Luis
On 3/16/22 10:24, Luis Chamberlain wrote: > On Wed, Mar 16, 2022 at 09:46:44AM +0900, Damien Le Moal wrote: >> On 3/16/22 09:23, Luis Chamberlain wrote: >>> What applications? ZNS does not incur a PO2 requirement. So I really >>> want to know what applications make this assumption and would break >>> because all of a sudden say NPO2 is supported. >> >> Exactly. What applications ? For ZNS, I cannot say as devices have not >> been available for long. But neither can you. > > I can tell you we there is an existing NPO2 ZNS customer which chimed on > the discussion and they described having to carry a delta to support > NPO2 ZNS. So if you cannot tell me of a ZNS application which is going to > break to add NPO2 support then your original point is not valid of > suggesting that there would be a break. > >>> Why would that break those ZNS applications? >> >> Please keep in mind that there are power of 2 zone sized ZNS devices out >> there. > > No one is saying otherwise. > >> Applications designed for these devices and optimized to do bit >> shift arithmetic using the power of 2 size property will break. > > They must not be ZNS. So they can continue to chug on. > >> What the >> plan for that case ? How will you address these users complaints ? > > They are not ZNS so they don't have to worry about ZNS. > > ZNS applications must be aware of that fact that NPO2 can exist. > ZNS applications must be aware of that fact that any vendor may one day > sell NPO2 devices. > >>>> Allowing non power of 2 zone size may prevent applications running today >>>> to run properly on these non power of 2 zone size devices. *not* nice. >>> >>> Applications which want to support ZNS have to take into consideration >>> that NPO2 is posisble and there existing users of that world today. >> >> Which is really an ugly approach. > > Ugly is relative and subjective. NAND does not force PO2. > >> The kernel > > <etc> And back you go to kernel talk. I thought you wanted to > focus on applications. > >> Applications correctly designed for SMR can thus also run on ZNS too. > > That seems to be an incorrect assumption given ZNS drives exist > with NPO2. So you can probably say that some SMR applications can work > with PO2 ZNS drives. That is a more correct statement. > >> With this in mind, the spectrum of applications that would break on non >> power of 2 ZNS devices is suddenly much larger. > > We already determined you cannot identify any ZNS specific application > which would break. > > SMR != ZNS Not for the block layer nor for any in-kernel users above it today. We should not drive toward differentiating device types but unify them under a common interface that works for everything, including applications. That is why we have zone append emulation in the scsi disk driver. Considering the zone size requirement problem in the context of ZNS only is thus far from ideal in my opinion, to say the least.
On Wed, Mar 16, 2022 at 10:44:56AM +0900, Damien Le Moal wrote: > On 3/16/22 10:24, Luis Chamberlain wrote: > > SMR != ZNS > > Not for the block layer nor for any in-kernel <etc> Back to kernel, I thought you wanted to focus on applications. > Considering the zone size requirement problem in the context of ZNS only > is thus far from ideal in my opinion, to say the least. It's the reality for ZNS though. Luis
Luis, > Applications which want to support ZNS have to take into consideration > that NPO2 is posisble and there existing users of that world today. Every time a new technology comes along vendors inevitably introduce first gen devices that are implemented with little consideration for the OS stacks they need to work with. This has happened for pretty much every technology I have been involved with over the years. So the fact that NPO2 devices exist is no argument. There are tons of devices out there that Linux does not support and never will. In early engagements SSD drive vendors proposed all sorts of weird NPO2 block sizes and alignments that it was argued were *incontestable* requirements for building NAND devices. And yet a generation or two later every SSD transparently handled 512-byte or 4096-byte logical blocks just fine. Imagine if we had re-engineered the entire I/O stack to accommodate these awful designs? Similarly, many proponents suggested oddball NPO2 sizes for SMR zones. And yet the market very quickly settled on PO2 once things started shipping in volume. Simplicity and long term maintainability of the kernel should always take precedence as far as I'm concerned.
On Tue, Mar 15, 2022 at 10:27:32PM -0400, Martin K. Petersen wrote: > Simplicity and long term maintainability of the kernel should always > take precedence as far as I'm concerned. No one is arguing against that. It is not even clear what all the changes are. So to argue that the sky will fall seems a bit too early without seeing patches, don't you think? Luis
On 15/03/2022 19:51, Pankaj Raghav wrote: >> ck-groups (and thus block-groups not aligned to the stripe size). >> > I agree with your point that we risk not aligning to stripe size when we > move to npo2 zone size which I believe the minimum is 64K (please > correct me if I am wrong). As David Sterba mentioned in his email, we > could agree on some reasonable alignment, which I believe would be the > minimum stripe size of 64k to avoid added complexity to the existing > btrfs zoned support. And it is a much milder constraint that most > devices can naturally adhere compared to the po2 zone size requirement. > What could be done is rounding a zone down to the next po2 (64k aligned), but then we need to explicitly finish the zones.
On 15.03.2022 22:27, Martin K. Petersen wrote: > >Luis, > >> Applications which want to support ZNS have to take into consideration >> that NPO2 is posisble and there existing users of that world today. > >Every time a new technology comes along vendors inevitably introduce >first gen devices that are implemented with little consideration for the >OS stacks they need to work with. This has happened for pretty much >every technology I have been involved with over the years. So the fact >that NPO2 devices exist is no argument. There are tons of devices out >there that Linux does not support and never will. > >In early engagements SSD drive vendors proposed all sorts of weird NPO2 >block sizes and alignments that it was argued were *incontestable* >requirements for building NAND devices. And yet a generation or two >later every SSD transparently handled 512-byte or 4096-byte logical >blocks just fine. Imagine if we had re-engineered the entire I/O stack >to accommodate these awful designs? > >Similarly, many proponents suggested oddball NPO2 sizes for SMR >zones. And yet the market very quickly settled on PO2 once things >started shipping in volume. > >Simplicity and long term maintainability of the kernel should always >take precedence as far as I'm concerned. Martin, you are absolutely right. The argument is not that there is available HW. The argument is that as we tried to retrofit ZNS into the zoned block device, the gap between zone size and capacity has brought adoption issues for some customers. I would still like to wait and give some time to get some feedback on the plan I proposed yesterday before we post patches. At this point, I would very much like to hear your opinion on how the changes will incur a maintainability problem. Nobody wants that.
On 16.03.2022 09:00, Damien Le Moal wrote: >On 3/15/22 22:05, Javier González wrote: >>>> The main constraint for (1) PO2 is removed in the block layer, we >>>> have (2) Linux hosts stating that unmapped LBAs are a problem, >>>> and we have (3) HW supporting size=capacity. >>>> >>>> I would be happy to hear what else you would like to see for this >>>> to be of use to the kernel community. >>> >>> (Added numbers to your paragraph above) >>> >>> 1. The sysfs chunksize attribute was "misused" to also represent >>> zone size. What has changed is that RAID controllers now can use a >>> NPO2 chunk size. This wasn't meant to naturally extend to zones, >>> which as shown in the current posted patchset, is a lot more work. >> >> True. But this was the main constraint for PO2. > >And as I said, users asked for it. Now users are asking for arbitrary zone sizes. [...] >>> 3. I'm happy to hear that. However, I'll like to reiterate the >>> point that the PO2 requirement have been known for years. That >>> there's a drive doing NPO2 zones is great, but a decision was made >>> by the SSD implementors to not support the Linux kernel given its >>> current implementation. >> >> Zone devices has been supported for years in SMR, and I this is a >> strong argument. However, ZNS is still very new and customers have >> several requirements. I do not believe that a HDD stack should have >> such an impact in NVMe. >> >> Also, we will see new interfaces adding support for zoned devices in >> the future. >> >> We should think about the future and not the past. > >Backward compatibility ? We must not break userspace... This is not a user API change. If making changes to applications to adopt new features and technologies is breaking user-space, then the zoned block device already broke that when we introduced zone capacity. Any existing zoned application working on ZNS _will have to_ make changes to support ZNS.
Hi Damien, On 2022-03-16 01:00, Damien Le Moal wrote: >> As for F2FS and dm-zoned, I do not think these are targets at the >> moment. If this is the path we follow, these will bail out at mkfs >> time. > > And what makes you think that this is acceptable ? What guarantees do > you have that this will not be a problem for users out there ? > As you know, the architecture of F2FS ATM requires PO2 segments, therefore, it might not be possible support nonPO2 ZNS drives. So we could continue supporting PO2 ZNS drives for F2FS and bail out if it is a Non PO2 ZNS drive during mkfs time (This is the current behavior as well). This way we are not really breaking any ZNS drives that have already been deployed for F2FS users.
On 3/11/2022 1:51 PM, Keith Busch wrote: > On Fri, Mar 11, 2022 at 12:19:38PM -0800, Luis Chamberlain wrote: >> NAND has no PO2 requirement. The emulation effort was only done to help >> add support for !PO2 devices because there is no alternative. If we >> however are ready instead to go down the avenue of removing those >> restrictions well let's go there then instead. If that's not even >> something we are willing to consider I'd really like folks who stand >> behind the PO2 requirement to stick their necks out and clearly say that >> their hw/fw teams are happy to deal with this requirement forever on ZNS. > > Regardless of the merits of the current OS requirement, it's a trivial > matter for firmware to round up their reported zone size to the next > power of 2. This does not create a significant burden on their part, as > far as I know. Sure wonder why !PO2 keeps coming up if it's so trivial to fix in firmware as you claim. I actually find the hubris of the Linux community wrt the whole PO2 requirement pretty exhausting. Consider that some SSD manufacturers are having to rely on a NAND shortage and existing ASIC architecture limitations that may define the sizes of their erase blocks and write units. A !PO2 implementation in the Linux kernel would enable consumers to be able to choose more options in the marketplace for their Linux ZNS application.
On Mon, Mar 21, 2022 at 10:21:36AM -0600, Jonathan Derrick wrote: > > > On 3/11/2022 1:51 PM, Keith Busch wrote: > > On Fri, Mar 11, 2022 at 12:19:38PM -0800, Luis Chamberlain wrote: > > > NAND has no PO2 requirement. The emulation effort was only done to help > > > add support for !PO2 devices because there is no alternative. If we > > > however are ready instead to go down the avenue of removing those > > > restrictions well let's go there then instead. If that's not even > > > something we are willing to consider I'd really like folks who stand > > > behind the PO2 requirement to stick their necks out and clearly say that > > > their hw/fw teams are happy to deal with this requirement forever on ZNS. > > > > Regardless of the merits of the current OS requirement, it's a trivial > > matter for firmware to round up their reported zone size to the next > > power of 2. This does not create a significant burden on their part, as > > far as I know. > > Sure wonder why !PO2 keeps coming up if it's so trivial to fix in firmware as you claim. The triviality to adjust alignment in firmware has nothing to do with some users' desire to not see gaps in LBA space. > I actually find the hubris of the Linux community wrt the whole PO2 requirement > pretty exhausting. > > Consider that some SSD manufacturers are having to rely on a NAND shortage and > existing ASIC architecture limitations that may define the sizes of their erase blocks > and write units. A !PO2 implementation in the Linux kernel would enable consumers > to be able to choose more options in the marketplace for their Linux ZNS application. All zone block devices through the linux kernel use a common abstraction interface. Users expect you can swap out one zone device for another and all their previously used features will continue to work. That does not necessarily hold with relaxing the long existing zone alignment. Fragmenting uses harms adoption, so this discussion seems appropriate.