[0/6] power_of_2 emulation support for NVMe ZNS devices

Message ID	20220308165349.231320-1-p.raghav@samsung.com (mailing list archive)
Headers	show Return-Path: <linux-block-owner@kernel.org> From: Pankaj Raghav <p.raghav@samsung.com> To: Luis Chamberlain <mcgrof@kernel.org>, Adam Manzanares <a.manzanares@samsung.com>, =?utf-8?q?Javier_Gonz=C3=A1lez?= <javier.gonz@samsung.com>, kanchan Joshi <joshi.k@samsung.com>, Jens Axboe <axboe@kernel.dk>, Keith Busch <kbusch@kernel.org>, Christoph Hellwig <hch@lst.de>, Sagi Grimberg <sagi@grimberg.me>, Damien Le Moal <damien.lemoal@opensource.wdc.com>, =?utf-8?q?Matias_Bj?= =?utf-8?q?=C3=B8rling?= <matias.bjorling@wdc.com>, jiangbo.365@bytedance.com Cc: Pankaj Raghav <pankydev8@gmail.com>, Kanchan Joshi <joshiiitr@gmail.com>, linux-block@vger.kernel.org, linux-nvme@lists.infradead.org, Pankaj Raghav <p.raghav@samsung.com> Subject: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices Date: Tue, 8 Mar 2022 17:53:43 +0100 Message-Id: <20220308165349.231320-1-p.raghav@samsung.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset="utf-8" CMS-TYPE: 201P References: <CGME20220308165414eucas1p106df0bd6a901931215cfab81660a4564@eucas1p1.samsung.com> Precedence: bulk
Series	power_of_2 emulation support for NVMe ZNS devices \| expand [0/6] power_of_2 emulation support for NVMe ZNS devices [1/6] nvme: zns: Allow ZNS drives that have non-power_of_2 zone size [2/6] block: Add npo2_zone_setup callback to block device fops [3/6] block: add a bool member to request_queue for power_of_2 emulation [4/6] nvme: zns: Add support for power_of_2 emulation to NVMe ZNS devices [5/6] null_blk: forward the sector value from null_handle_memory_backend [6/6] null_blk: Add support for power_of_2 emulation to the null blk device

Pankaj Raghav March 8, 2022, 4:53 p.m. UTC

#Motivation:
There are currently ZNS drives that are produced and deployed that do
not have power_of_2(PO2) zone size. The NVMe spec for ZNS does not
specify the PO2 requirement but the linux block layer currently checks
for zoned devices to have power_of_2 zone sizes.

As a result there are many applications in the kernel such as F2FS,
BTRFS and other userspace applications that are designed based on the assumption
that zone sizes are PO2.

This patchset aims at supporting non-power_of_2 zoned devices without
affecting the existing applications by adding an emulation layer for
NVMe ZNS devices without regressing the current upstream implementation.

#Implementation:
A new callback is added to the block device operation fops which is
called when a special handling is required by the driver when a
non-power_of_2 zoned device is discovered. This patchset adds support
only to NVMe ZNS and null block driver to measure performance.
The scsi ZAC/ZBC implementation is untouched.

Emulation is enabled by doing a static remapping of the zones only in
the host and whenever a request is sent to the device via the block
layer, a transformation is done to the actual device sector.

#Testing:
There are two things that need to be tested: no regression on the
upstream implementation for PO2 zone sizes and testing the
implementation of the emulation itself.

To do apple-apples comparison, the following device specs were chosen
for testing (both on null_blk and QEMU):
PO2 device:  zone.size=128M zone.cap=96M
NPO2 device: zone.size=96M zone.cap=96M

##Regression:
These tests are done on a **PO2 device**.
PO2 device used:  zone.size=128M zone.cap=96M

###blktests:
Blktests were executed with the following config:

TEST_DEVS=(/dev/nvme0n2)
TIMEOUT=100
RUN_ZONED_TESTS=1

block and zbd tests were performed and no regression were found in the
tests.

###Performance:
Performance tests were performed on a null blk device. The following fio
script was used to measure the performance:

fio --name=zbc --filename=/dev/nullb0 --direct=1 --zonemode=zbd  --size=23G
--io_size=<iosize> --ioengine=io_uring --iodepth=<iod> --rw=<mode> --bs=4k --loops=4

No regressions were found with the patches on a **PO2 device** compared
to the existing upstream implementation.

The following results are an average of 4 runs on AMD Ryzen 5 5600X with
32GB of RAM:

Sequential Write:
x-----------------x---------------------------------x---------------------------------x
|     IOdepth     |             1                   |             4                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patches |  155     |  604     |   6.00    |  426     |  1663    |   8.77    |
x-----------------x---------------------------------x---------------------------------x
|  With patches   |  157     |  613     |   5.92    |  425     |  1741    |   8.79    |
x-----------------x---------------------------------x---------------------------------x

x-----------------x---------------------------------x---------------------------------x
|     IOdepth     |            8                    |            16                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patches |  607     |  2370    |   12.06   |  622     |  2431    |   23.61   |
x-----------------x---------------------------------x---------------------------------x
|  With patches   |  621     |  2425    |   11.80   |  633     |  2472    |   23.24   |
x-----------------x---------------------------------x---------------------------------x

Sequential read:
x-----------------x---------------------------------x---------------------------------x
| IOdepth         |             1                   |             4                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patches |  165     |  643     |   5.72    |  485     |  1896    |   8.03    |
x-----------------x---------------------------------x---------------------------------x
|  With patches   |  167     |  654     |   5.62    |  483     |  1888    |   8.06    |
x-----------------x---------------------------------x---------------------------------x

x-----------------x---------------------------------x---------------------------------x
| IOdepth         |            8                    |            16                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patches |  696     |  2718    |   11.29   |  692     |  2701    |   22.92   |
x-----------------x---------------------------------x---------------------------------x
|  With patches   |  696     |  2718    |   11.29   |  730     |  2835    |   21.70   |
x-----------------x---------------------------------x---------------------------------x

Random read:
x-----------------x---------------------------------x---------------------------------x
| IOdepth         |             1                   |             4                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patches |  159     |  623     |   5.86    |  451     |  1760    |   8.58    |
x-----------------x---------------------------------x---------------------------------x
|  With patches   |  163     |  635     |   5.75    |  462     |  1806    |   8.36    |
x-----------------x---------------------------------x---------------------------------x

x-----------------x---------------------------------x---------------------------------x
| IOdepth         |            8                    |            16                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patches |  544     |  2124    |   14.44   |  553     |  2162    |   28.64   |
x-----------------x---------------------------------x---------------------------------x
|  With patches   |  554     |  2165    |   14.15   |  556     |  2171    |   28.52   |
x-----------------x---------------------------------x---------------------------------x

##Emulated device
NPO2 device: zone.size=96M zone.cap=96M

###blktests:
Blktests were executed with the following config:

TEST_DEVS=(/dev/nvme0n2)
TIMEOUT=100
RUN_ZONED_TESTS=1

block and zbd tests were performed and they are passing.

###Performance:
Performance tests were performed on a null blk device. The following fio
script was used to measure the performance:

fio --name=zbc --filename=/dev/nullb0 --direct=1 --zonemode=zbd  --size=23G
--io_size=<iosize> --ioengine=io_uring --iodepth=<iod> --rw=<mode> --bs=4k --loops=4

On an average, the NPO2 devices had a performance degradation of less than 1%
compared to the PO2 devices.

The following results are an average of 4 runs on AMD Ryzen 5 5600X with
32GB of RAM:

Write:
x-----------------x---------------------------------x---------------------------------x
|     IOdepth     |             1                   |             4                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
|  With patches   |  155     |  606     |   5.99    |  424     |  1655    |   8.83    |
x-----------------x---------------------------------x---------------------------------x

x-----------------x---------------------------------x---------------------------------x
|     IOdepth     |            8                    |            16                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
|  With patches   |  609     |  2378    |   12.04   |  620     |  2421    |   23.75   |
x-----------------x---------------------------------x---------------------------------x

SEQREAD:
x-----------------x---------------------------------x---------------------------------x
| IOdepth         |             1                   |             4                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
|  With patches   |  160     |  623     |   5.91    |  481     |  1878    |   8.11    |
x-----------------x---------------------------------x---------------------------------x

x-----------------x---------------------------------x---------------------------------x
| IOdepth         |            8                    |            16                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
|  With patches   |  696     |  2720    |   11.28   |  722     |  2819    |   21.96   |
x-----------------x---------------------------------x---------------------------------x

RANDREAD:
x-----------------x---------------------------------x---------------------------------x
| IOdepth         |             1                   |             4                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
|  With patches   |  155     |  607     |   6.03    |  465     |  1817    |   8.31    |
x-----------------x---------------------------------x---------------------------------x

x-----------------x---------------------------------x---------------------------------x
| IOdepth         |            8                    |            16                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
|  With patches   |  552     |  2158    |   14.21   |  561     |  2190    |   28.27   |
x-----------------x---------------------------------x---------------------------------x

#TODO:
- The current implementation only works for the NVMe pci transport to
  limit the scope and impact.
  Support for NVMe target will follow soon.

Pankaj Raghav (6):
  nvme: zns: Allow ZNS drives that have non-power_of_2 zone size
  block: Add npo2_zone_setup callback to block device fops
  block: add a bool member to request_queue for power_of_2 emulation
  nvme: zns: Add support for power_of_2 emulation to NVMe ZNS devices
  null_blk: forward the sector value from null_handle_memory_backend
  null_blk: Add support for power_of_2 emulation to the null blk device

 block/blk-zoned.c                 |   3 +
 drivers/block/null_blk/main.c     |  18 +--
 drivers/block/null_blk/null_blk.h |  12 ++
 drivers/block/null_blk/zoned.c    | 203 ++++++++++++++++++++++++++----
 drivers/nvme/host/core.c          |  28 +++--
 drivers/nvme/host/nvme.h          | 100 ++++++++++++++-
 drivers/nvme/host/pci.c           |   4 +
 drivers/nvme/host/zns.c           |  86 +++++++++++--
 include/linux/blk-mq.h            |   2 +
 include/linux/blkdev.h            |  25 ++++
 10 files changed, 428 insertions(+), 53 deletions(-)

Christoph Hellwig March 10, 2022, 9:47 a.m. UTC | #1

This is complete bonkers.  IFF we have a good reason to support non
power of two zones size (and I'd like to see evidence for that) we'll
need to go through all the layers to support it.  But doing this emulation
is just idiotic and will at tons of code just to completely confuse users.

On Tue, Mar 08, 2022 at 05:53:43PM +0100, Pankaj Raghav wrote:
> 
> #Motivation:
> There are currently ZNS drives that are produced and deployed that do
> not have power_of_2(PO2) zone size. The NVMe spec for ZNS does not
> specify the PO2 requirement but the linux block layer currently checks
> for zoned devices to have power_of_2 zone sizes.

Well, apparently whoever produces these drives never cared about supporting
Linux as the power of two requirement goes back to SMR HDDs, which also
don't have that requirement in the spec (and even allow non-uniform zone
size), but Linux decided that we want this for sanity.

Do these drives even support Zone Append?

Pankaj Raghav March 10, 2022, 12:57 p.m. UTC | #2

On 2022-03-10 10:47, Christoph Hellwig wrote:
> This is complete bonkers.  IFF we have a good reason to support non
> power of two zones size (and I'd like to see evidence for that) we'll

non power of 2 support is important to the users and that is why we
started this effort to do that. I have also CCed Bo from Bytedance based
on their request.

> need to go through all the layers to support it.  But doing this emulation
> is just idiotic and will at tons of code just to completely confuse users.
> 

I agree with your point to create the non power of 2 support through all
the layers but this is the first step.

One of the early feedback that we got from Damien is to not break the
existing kernel and userspace applications that are written with the po2
assumption.

The following are the steps we have in the pipeline:
- Remove the constraint in the block layer
- Start migrating the Kernel applications such as btrfs so that it also
works on non power of 2 devices.

Of course, we wanted to post RFCs to the steps mentioned above so that
there could be a public discussion about the issues.

> Well, apparently whoever produces these drives never cared about supporting
> Linux as the power of two requirement goes back to SMR HDDs, which also
> don't have that requirement in the spec (and even allow non-uniform zone
> size), but Linux decided that we want this for sanity.
> 
> Do these drives even support Zone Append?

Yes, these drives are intended for Linux users that would use the zoned
block device. Append is supported but holes in the LBA space (due to
diff in zone cap and zone size) is still a problem for these users.

Matias Bjørling March 10, 2022, 1:07 p.m. UTC | #3

> Yes, these drives are intended for Linux users that would use the zoned
> block device. Append is supported but holes in the LBA space (due to diff in
> zone cap and zone size) is still a problem for these users.

With respect to the specific users, what does it break specifically? What are key features are they missing when there's holes?

Javier González March 10, 2022, 1:14 p.m. UTC | #4

> On 10 Mar 2022, at 14.07, Matias Bjørling <matias.bjorling@wdc.com> wrote:
> 
> 
>> 
>> Yes, these drives are intended for Linux users that would use the zoned
>> block device. Append is supported but holes in the LBA space (due to diff in
>> zone cap and zone size) is still a problem for these users.
> 
> With respect to the specific users, what does it break specifically? What are key features are they missing when there's holes? 

What we hear is that it breaks existing mapping in applications, where the address space is seen as contiguous; with holes it needs to account for the unmapped space. This affects performance and and CPU due to unnecessary splits. This is for both reads and writes. 

For more details, I guess they will have to jump in and share the parts that they consider is proper to share in the mailing list. 

I guess we will have more conversations around this as we push the block layer changes after this series.

Christoph Hellwig March 10, 2022, 2:44 p.m. UTC | #5

On Thu, Mar 10, 2022 at 01:57:58PM +0100, Pankaj Raghav wrote:
> Yes, these drives are intended for Linux users that would use the zoned
> block device. Append is supported but holes in the LBA space (due to
> diff in zone cap and zone size) is still a problem for these users.

I'd really like to hear from the users.  Because really, either they
should use a proper file system abstraction (including zonefs if that is
all they need), or raw nvme passthrough which will alredy work for this
case.  But adding a whole bunch of crap because people want to use the
block device special file for something it is not designed for just
does not make any sense.

Matias Bjørling March 10, 2022, 2:58 p.m. UTC | #6

>> Yes, these drives are intended for Linux users that would use the
> >> zoned block device. Append is supported but holes in the LBA space
> >> (due to diff in zone cap and zone size) is still a problem for these users.
> >
> > With respect to the specific users, what does it break specifically? What are
> key features are they missing when there's holes?
> 
> What we hear is that it breaks existing mapping in applications, where the
> address space is seen as contiguous; with holes it needs to account for the
> unmapped space. This affects performance and and CPU due to unnecessary
> splits. This is for both reads and writes.
> 
> For more details, I guess they will have to jump in and share the parts that
> they consider is proper to share in the mailing list.
> 
> I guess we will have more conversations around this as we push the block
> layer changes after this series.

Ok, so I hear that one issue is I/O splits - If I assume that reads are sequential, zone cap/size between 100MiB and 1GiB, then my gut feeling would tell me its less CPU intensive to split every 100MiB to 1GiB of reads, than it would be to not have power of 2 zones due to the extra per io calculations. 

Do I have a faulty assumption about the above, or is there more to it?

Keith Busch March 10, 2022, 3:07 p.m. UTC | #7

On Thu, Mar 10, 2022 at 02:58:07PM +0000, Matias Bjørling wrote:
>  >> Yes, these drives are intended for Linux users that would use the
> > >> zoned block device. Append is supported but holes in the LBA space
> > >> (due to diff in zone cap and zone size) is still a problem for these users.
> > >
> > > With respect to the specific users, what does it break specifically? What are
> > key features are they missing when there's holes?
> > 
> > What we hear is that it breaks existing mapping in applications, where the
> > address space is seen as contiguous; with holes it needs to account for the
> > unmapped space. This affects performance and and CPU due to unnecessary
> > splits. This is for both reads and writes.
> > 
> > For more details, I guess they will have to jump in and share the parts that
> > they consider is proper to share in the mailing list.
> > 
> > I guess we will have more conversations around this as we push the block
> > layer changes after this series.
> 
> Ok, so I hear that one issue is I/O splits - If I assume that reads
> are sequential, zone cap/size between 100MiB and 1GiB, then my gut
> feeling would tell me its less CPU intensive to split every 100MiB to
> 1GiB of reads, than it would be to not have power of 2 zones due to
> the extra per io calculations. 

Don't you need to split anyway when spanning two zones to avoid the zone
boundary error?

Maybe this is a silly idea, but it would be a trivial device-mapper
to remap the gaps out of the lba range.

Javier González March 10, 2022, 3:13 p.m. UTC | #8

On 10.03.2022 14:58, Matias Bjørling wrote:
> >> Yes, these drives are intended for Linux users that would use the
>> >> zoned block device. Append is supported but holes in the LBA space
>> >> (due to diff in zone cap and zone size) is still a problem for these users.
>> >
>> > With respect to the specific users, what does it break specifically? What are
>> key features are they missing when there's holes?
>>
>> What we hear is that it breaks existing mapping in applications, where the
>> address space is seen as contiguous; with holes it needs to account for the
>> unmapped space. This affects performance and and CPU due to unnecessary
>> splits. This is for both reads and writes.
>>
>> For more details, I guess they will have to jump in and share the parts that
>> they consider is proper to share in the mailing list.
>>
>> I guess we will have more conversations around this as we push the block
>> layer changes after this series.
>
>Ok, so I hear that one issue is I/O splits - If I assume that reads are sequential, zone cap/size between 100MiB and 1GiB, then my gut feeling would tell me its less CPU intensive to split every 100MiB to 1GiB of reads, than it would be to not have power of 2 zones due to the extra per io calculations.
>
>Do I have a faulty assumption about the above, or is there more to it?

I do not have numbers on the number of splits. I can only say that it is
an issue. Then the whole management is apparently also costing some DRAM
for extra mapping, instead of simply doing +1.

The goal for these customers is not having the emulation, so the cost of
the !PO2 path would be 0.

For the existing applications that require a PO2, we have the emulation.
In this case, the cost will only be paid on the devices that implement
!PO2 zones.

Hope this answer the question.

Javier González March 10, 2022, 3:16 p.m. UTC | #9

On 10.03.2022 07:07, Keith Busch wrote:
>On Thu, Mar 10, 2022 at 02:58:07PM +0000, Matias Bjørling wrote:
>>  >> Yes, these drives are intended for Linux users that would use the
>> > >> zoned block device. Append is supported but holes in the LBA space
>> > >> (due to diff in zone cap and zone size) is still a problem for these users.
>> > >
>> > > With respect to the specific users, what does it break specifically? What are
>> > key features are they missing when there's holes?
>> >
>> > What we hear is that it breaks existing mapping in applications, where the
>> > address space is seen as contiguous; with holes it needs to account for the
>> > unmapped space. This affects performance and and CPU due to unnecessary
>> > splits. This is for both reads and writes.
>> >
>> > For more details, I guess they will have to jump in and share the parts that
>> > they consider is proper to share in the mailing list.
>> >
>> > I guess we will have more conversations around this as we push the block
>> > layer changes after this series.
>>
>> Ok, so I hear that one issue is I/O splits - If I assume that reads
>> are sequential, zone cap/size between 100MiB and 1GiB, then my gut
>> feeling would tell me its less CPU intensive to split every 100MiB to
>> 1GiB of reads, than it would be to not have power of 2 zones due to
>> the extra per io calculations.
>
>Don't you need to split anyway when spanning two zones to avoid the zone
>boundary error?

If you have size = capacity then you can do a cross-zone read. This is
only a problem when we have gaps.

>Maybe this is a silly idea, but it would be a trivial device-mapper
>to remap the gaps out of the lba range.

One thing we have considered is that as we remove the PO2 constraint
from the block layer is that devices exposing PO2 zone sizes are able to
do the emulation the other way around to support things like this.

A device mapper is also a fine place to put this, but it seems like a
very simple task. Is it worth all the boilerplate code for the device
mapper only for this?

Adam Manzanares March 10, 2022, 5:38 p.m. UTC | #10

On Thu, Mar 10, 2022 at 10:47:25AM +0100, Christoph Hellwig wrote:
> This is complete bonkers.  IFF we have a good reason to support non
> power of two zones size (and I'd like to see evidence for that) we'll
> need to go through all the layers to support it.  But doing this emulation
> is just idiotic and will at tons of code just to completely confuse users.
> 
> On Tue, Mar 08, 2022 at 05:53:43PM +0100, Pankaj Raghav wrote:
> > 
> > #Motivation:
> > There are currently ZNS drives that are produced and deployed that do
> > not have power_of_2(PO2) zone size. The NVMe spec for ZNS does not
> > specify the PO2 requirement but the linux block layer currently checks
> > for zoned devices to have power_of_2 zone sizes.
> 
> Well, apparently whoever produces these drives never cared about supporting
> Linux as the power of two requirement goes back to SMR HDDs, which also
> don't have that requirement in the spec (and even allow non-uniform zone
> size), but Linux decided that we want this for sanity.

Non uniform zone size definitely seems like a mess. Fixed zone sizes that
are non po2 doesn't seem insane to me given that chunk sectors is no longer 
assumed to be po2. We have looked at removing po2 and the only hot path 
optimization for po2 is for appends.

> 
> Do these drives even support Zone Append?

Should it matter if the drives support append? SMR drives do not support append
and they are considered zone block devices. Append seems to be an optimization
for users that want higher concurrency per zone. One can also build concurrency
by leveraging multiple zones simultaneously as well.

Damien Le Moal March 10, 2022, 11:44 p.m. UTC | #11

On 3/11/22 00:16, Javier González wrote:
> On 10.03.2022 07:07, Keith Busch wrote:
>> On Thu, Mar 10, 2022 at 02:58:07PM +0000, Matias Bjørling wrote:
>>>  >> Yes, these drives are intended for Linux users that would use the
>>>>>> zoned block device. Append is supported but holes in the LBA space
>>>>>> (due to diff in zone cap and zone size) is still a problem for these users.
>>>>>
>>>>> With respect to the specific users, what does it break specifically? What are
>>>> key features are they missing when there's holes?
>>>>
>>>> What we hear is that it breaks existing mapping in applications, where the
>>>> address space is seen as contiguous; with holes it needs to account for the
>>>> unmapped space. This affects performance and and CPU due to unnecessary
>>>> splits. This is for both reads and writes.
>>>>
>>>> For more details, I guess they will have to jump in and share the parts that
>>>> they consider is proper to share in the mailing list.
>>>>
>>>> I guess we will have more conversations around this as we push the block
>>>> layer changes after this series.
>>>
>>> Ok, so I hear that one issue is I/O splits - If I assume that reads
>>> are sequential, zone cap/size between 100MiB and 1GiB, then my gut
>>> feeling would tell me its less CPU intensive to split every 100MiB to
>>> 1GiB of reads, than it would be to not have power of 2 zones due to
>>> the extra per io calculations.
>>
>> Don't you need to split anyway when spanning two zones to avoid the zone
>> boundary error?
> 
> If you have size = capacity then you can do a cross-zone read. This is
> only a problem when we have gaps.
> 
>> Maybe this is a silly idea, but it would be a trivial device-mapper
>> to remap the gaps out of the lba range.
> 
> One thing we have considered is that as we remove the PO2 constraint
> from the block layer is that devices exposing PO2 zone sizes are able to
> do the emulation the other way around to support things like this.
> 
> A device mapper is also a fine place to put this, but it seems like a
> very simple task. Is it worth all the boilerplate code for the device
> mapper only for this?

Boiler plate ? DM already support zoned devices. Writing a "dm-unhole"
target would be extremely simple as it would essentially be a variation
of dm-linear. There should be no DM core changes needed.

Luis Chamberlain March 11, 2022, 8:19 p.m. UTC | #12

On Thu, Mar 10, 2022 at 03:44:49PM +0100, Christoph Hellwig wrote:
> On Thu, Mar 10, 2022 at 01:57:58PM +0100, Pankaj Raghav wrote:
> > Yes, these drives are intended for Linux users that would use the zoned
> > block device. Append is supported but holes in the LBA space (due to
> > diff in zone cap and zone size) is still a problem for these users.
> 
> I'd really like to hear from the users.  Because really, either they
> should use a proper file system abstraction (including zonefs if that is
> all they need),

That requires access to at least the block device and without PO2
emulation that is not possible. Using zonefs is not possible today
for !PO2 devices.

> or raw nvme passthrough which will alredy work for this
> case. 

This effort is not upstream yet, however, once and if this does land
upstream it does mean something other than zonefs must be used since
!PO2 devices are not supported by zonefs. So although the goal with the
zonefs was to provide a unified interface for raw access for
applications, the PO2 requirement will essentially create
fragmenetation.

> But adding a whole bunch of crap because people want to use the
> block device special file for something it is not designed for just
> does not make any sense.

Using Linux requires PO2. And so on behalf of Damien's request the
logical thing to do was to upkeep that requirement and to avoid any
performance regressions. That "crap" was done to slowly pave the way
forward to then later remove the PO2 requirement.

I think we'll all acknowledge that doing emulation just means adding more
software for something that is not a NAND requirement, but a requirement
imposed by the inheritance of zoned software designed for SMR HDDs. I
think we may also all acknowledge now that keeping this emulation code
*forever* seems like complete insanity.

Since the PO2 requirement imposed on Linux today seems to be now
sending us down a dubious effort we'd need to support, let me
then try to get folks who have been saying that we must keep this
requirement to answer the following question:

Are you 100% sure your ZNS hardware team and firmware team will always
be happy you have caked in a PO2 requirement for ZNS drives on Linux
and are you ready to deal with those consequences on Linux forever? Really?

NAND has no PO2 requirement. The emulation effort was only done to help
add support for !PO2 devices because there is no alternative. If we
however are ready instead to go down the avenue of removing those
restrictions well let's go there then instead. If that's not even
something we are willing to consider I'd really like folks who stand
behind the PO2 requirement to stick their necks out and clearly say that
their hw/fw teams are happy to deal with this requirement forever on ZNS.

From what I am seeing this is a legacy requirement which we should be
able to remove. Keeping the requirement will only do harm to ZNS
adoption on Linux and it will also create *more* fragmentation.

  Luis

Keith Busch March 11, 2022, 8:51 p.m. UTC | #13

On Fri, Mar 11, 2022 at 12:19:38PM -0800, Luis Chamberlain wrote:
> NAND has no PO2 requirement. The emulation effort was only done to help
> add support for !PO2 devices because there is no alternative. If we
> however are ready instead to go down the avenue of removing those
> restrictions well let's go there then instead. If that's not even
> something we are willing to consider I'd really like folks who stand
> behind the PO2 requirement to stick their necks out and clearly say that
> their hw/fw teams are happy to deal with this requirement forever on ZNS.

Regardless of the merits of the current OS requirement, it's a trivial
matter for firmware to round up their reported zone size to the next
power of 2. This does not create a significant burden on their part, as
far as I know.

And po2 does not even seem to be the real problem here. The holes seem
to be what's causing a concern, which you have even without po2 zones.
I'm starting to like the previous idea of creating an unholey
device-mapper for such users...

Luis Chamberlain March 11, 2022, 9:04 p.m. UTC | #14

On Fri, Mar 11, 2022 at 12:51:35PM -0800, Keith Busch wrote:
> On Fri, Mar 11, 2022 at 12:19:38PM -0800, Luis Chamberlain wrote:
> > NAND has no PO2 requirement. The emulation effort was only done to help
> > add support for !PO2 devices because there is no alternative. If we
> > however are ready instead to go down the avenue of removing those
> > restrictions well let's go there then instead. If that's not even
> > something we are willing to consider I'd really like folks who stand
> > behind the PO2 requirement to stick their necks out and clearly say that
> > their hw/fw teams are happy to deal with this requirement forever on ZNS.
> 
> Regardless of the merits of the current OS requirement, it's a trivial
> matter for firmware to round up their reported zone size to the next
> power of 2. This does not create a significant burden on their part, as
> far as I know.

Sure sure.. fw can do crap like that too...

> And po2 does not even seem to be the real problem here. The holes seem
> to be what's causing a concern, which you have even without po2 zones.

Exactly.

> I'm starting to like the previous idea of creating an unholey
> device-mapper for such users...

Won't that restrict nvme with chunk size crap. For instance later if we
want much larger block sizes.

  Luis

Keith Busch March 11, 2022, 9:31 p.m. UTC | #15

On Fri, Mar 11, 2022 at 01:04:35PM -0800, Luis Chamberlain wrote:
> On Fri, Mar 11, 2022 at 12:51:35PM -0800, Keith Busch wrote:
> 
> > I'm starting to like the previous idea of creating an unholey
> > device-mapper for such users...
> 
> Won't that restrict nvme with chunk size crap. For instance later if we
> want much larger block sizes.

I'm not sure I understand. The chunk_size has nothing to do with the
block size. And while nvme is a user of this in some circumstances, it
can't be used concurrently with ZNS because the block layer appropriates
the field for the zone size.

Adam Manzanares March 11, 2022, 10:23 p.m. UTC | #16

On Fri, Mar 11, 2022 at 12:51:35PM -0800, Keith Busch wrote:
> On Fri, Mar 11, 2022 at 12:19:38PM -0800, Luis Chamberlain wrote:
> > NAND has no PO2 requirement. The emulation effort was only done to help
> > add support for !PO2 devices because there is no alternative. If we
> > however are ready instead to go down the avenue of removing those
> > restrictions well let's go there then instead. If that's not even
> > something we are willing to consider I'd really like folks who stand
> > behind the PO2 requirement to stick their necks out and clearly say that
> > their hw/fw teams are happy to deal with this requirement forever on ZNS.
> 
> Regardless of the merits of the current OS requirement, it's a trivial
> matter for firmware to round up their reported zone size to the next
> power of 2. This does not create a significant burden on their part, as
> far as I know.

I can't comment on FW burdens but adding po2 zone size creates holes for the 
FW to deal with as well.

> 
> And po2 does not even seem to be the real problem here. The holes seem
> to be what's causing a concern, which you have even without po2 zones.
> I'm starting to like the previous idea of creating an unholey
> device-mapper for such users...

I see holes as being caused by having to make zone size po2 when capacity is 
not po2. po2 should be tied to the holes, unless I am missing something. BTW if
we go down the dm route can we start calling it dm-unholy.

Luis Chamberlain March 11, 2022, 10:24 p.m. UTC | #17

On Fri, Mar 11, 2022 at 01:31:02PM -0800, Keith Busch wrote:
> On Fri, Mar 11, 2022 at 01:04:35PM -0800, Luis Chamberlain wrote:
> > On Fri, Mar 11, 2022 at 12:51:35PM -0800, Keith Busch wrote:
> > 
> > > I'm starting to like the previous idea of creating an unholey
> > > device-mapper for such users...
> > 
> > Won't that restrict nvme with chunk size crap. For instance later if we
> > want much larger block sizes.
> 
> I'm not sure I understand. The chunk_size has nothing to do with the
> block size. And while nvme is a user of this in some circumstances, it
> can't be used concurrently with ZNS because the block layer appropriates
> the field for the zone size.

Many device mapper targets split I/O into chunks, see max_io_len(),
wouldn't this create an overhead?

Using a device mapper target also creates a divergence in strategy
for ZNS. Some will use the block device, others the dm target. The
goal should be to create a unified path.

And all this, just because SMR. Is that worth it? Are we sure?

  Luis

Keith Busch March 11, 2022, 10:30 p.m. UTC | #18

On Fri, Mar 11, 2022 at 10:23:33PM +0000, Adam Manzanares wrote:
> On Fri, Mar 11, 2022 at 12:51:35PM -0800, Keith Busch wrote:
> > And po2 does not even seem to be the real problem here. The holes seem
> > to be what's causing a concern, which you have even without po2 zones.
> > I'm starting to like the previous idea of creating an unholey
> > device-mapper for such users...
> 
> I see holes as being caused by having to make zone size po2 when capacity is 
> not po2. po2 should be tied to the holes, unless I am missing something. 

Practically speaking, you're probably not missing anything. The spec,
however, doesn't constrain the existence of holes to any particular zone
size.

> BTW if we go down the dm route can we start calling it dm-unholy.

I was thinking "dm-evil" but unholy works too. :)

Damien Le Moal March 12, 2022, 7:58 a.m. UTC | #19

On 3/12/22 07:24, Luis Chamberlain wrote:
> On Fri, Mar 11, 2022 at 01:31:02PM -0800, Keith Busch wrote:
>> On Fri, Mar 11, 2022 at 01:04:35PM -0800, Luis Chamberlain wrote:
>>> On Fri, Mar 11, 2022 at 12:51:35PM -0800, Keith Busch wrote:
>>>
>>>> I'm starting to like the previous idea of creating an unholey
>>>> device-mapper for such users...
>>>
>>> Won't that restrict nvme with chunk size crap. For instance later if we
>>> want much larger block sizes.
>>
>> I'm not sure I understand. The chunk_size has nothing to do with the
>> block size. And while nvme is a user of this in some circumstances, it
>> can't be used concurrently with ZNS because the block layer appropriates
>> the field for the zone size.
> 
> Many device mapper targets split I/O into chunks, see max_io_len(),
> wouldn't this create an overhead?

Apart from the bio clone, the overhead should not be higher than what
the block layer already has. IOs that are too large or that are
straddling zones are split by the block layer, and DM splitting leads
generally to no split in the block layer for the underlying device IO.
DM essentially follows the same pattern: max_io_len() depends on the
target design limits, which in turn depend on the underlying device. For
a dm-unhole target, the IO size limit would typically be the same as
that of the underlying device.

> Using a device mapper target also creates a divergence in strategy
> for ZNS. Some will use the block device, others the dm target. The
> goal should be to create a unified path.

If we allow non power of 2 zone sized devices, the path will *never* be
unified because we will get fragmentation on what can run on these
devices as opposed to power of 2 sized ones. E.g. f2fs will not work for
the former but will for the latter. That is really not an ideal situation.

> 
> And all this, just because SMR. Is that worth it? Are we sure?

No. This is *not* because of SMR. Never has been. The first prototype
SMR drives I received in my lab 10 years ago did not have a power of 2
sized zone size because zones where naturally aligned to tracks, which
like NAND erase blocks, are not necessarily power of 2 sized. And all
zones were not even the same size. That was not usable.

The reason for the power of 2 requirement is 2 fold:
1) At the time we added zone support for SMR, chunk_sectors had to be a
power of 2 number of sectors.
2) SMR users did request power of 2 zone sizes and that all zones have
the same size as that simplified software design. There was even a
de-facto agreement that 256MB zone size is a good compromise between
usability and overhead of zone reclaim/GC. But that particular number is
for HDD due to their performance characteristics.

Hence the current Linux requirements which have been serving us well so
far. DM needed that chunk_sectors be changed to allow non power of 2
values. So the chunk_sectors requirement was lifted recently (can't
remember which version added this). Allowing non power of 2 zone size
would thus be more easily feasible now. Allowing devices with a non
power of 2 zone size is not technically difficult.

But...

The problem being raised is all about the fact that the power of 2 zone
size requirement creates a hole of unusable sectors in every zone when
the device implementation has a zone capacity lower than the zone size.

I have been arguing all along that I think this problem is a
non-problem, simply because a well designed application should *always*
use zones as storage containers without ever hoping that the next zone
in sequence can be used as well. The application should *never* consider
the entire LBA space of the device capacity without this zone split. The
zone based management of capacity is necessary for any good design to
deal correctly with write error recovery and active/open zone resources
management. And as Keith said. there is always a "hole" anyway for any
non-full zone, between the zone write pointer and the last usable sector
in the zone. Reads there are nonsensical and writes can only go to one
place.

Now, in the spirit of trying to facilitate software development for
zoned devices, we can try finding solutions to remove that hole. zonefs
is a obvious solution. But back to the previous point: with one zone ==
one file, there is no continuity in the storage address space that the
application can use. The application has to be designed to use
individual files representing a zone. And with such design, an
equivalent design directly using the block device file would have no
difficulties due to the the sector hole between zone capacity and zone
size. I have a prototype LevelDB implementation that can use both zonefs
and block device file on ZNS with only a few different lines of code to
prove this point.

The other solution would be adding a dm-unhole target to remap sectors
to remove the holes from the device address space. Such target would be
easy to write, but in my opinion, this would still not change the fact
that applications still have to deal with error recovery and active/open
zone resources. So they still have to be zone aware and operate per zone.

Furthermore, adding such DM target would create a non power of 2 zone
size zoned device which will need support from the block layer. So some
block layer functions will need to change. In the end, this may not be
different than enabling non power of 2 zone sized devices for ZNS.

And for this decision, I maintain some of my requirements:
1) The added overhead from multiplication & divisions should be
acceptable and not degrade performance. Otherwise, this would be a
disservice to the zone ecosystem.
2) Nothing that works today on available devices should break
3) Zone size requirements will still exist. E.g. btrfs 64K alignment
requirement

But even with all these properly addressed, f2fs will not work anymore,
some in-kernel users will still need some zone size requirements (btrfs)
and *all* applications using a zoned block device file will now have to
be designed based on non power of 2 zone size so that they can work on
all devices. Meaning that this is also potentially forcing changes on
existing applications to use newer zoned devices that may not have a
power of 2 zone size.

This entire discussion is about the problem that power of 2 zone size
creates (which again I think is a non-problem). However, based on the
arguments above, allowing non power of 2 zone sized devices is not
exactly problem free either.

My answer to your last question ("Are we sure?") is thus: No. I am not
sure this is a good idea. But as always, I would be happy to be proven
wrong. So far, I have not seen any argument doing that.

Christoph Hellwig March 14, 2022, 7:35 a.m. UTC | #20

On Sat, Mar 12, 2022 at 04:58:08PM +0900, Damien Le Moal wrote:
> The reason for the power of 2 requirement is 2 fold:
> 1) At the time we added zone support for SMR, chunk_sectors had to be a
> power of 2 number of sectors.
> 2) SMR users did request power of 2 zone sizes and that all zones have
> the same size as that simplified software design. There was even a
> de-facto agreement that 256MB zone size is a good compromise between
> usability and overhead of zone reclaim/GC. But that particular number is
> for HDD due to their performance characteristics.

Also for NVMe we initially went down the road to try to support
non power of two sizes.  But there was another major early host that
really wanted the power of two zone sizes to support hardware based
hosts that can cheaply do shifts but not divisions.  The variable
zone capacity feature (something that Linux does not currently support)
is a feature requested by NVMe members on the host and device side
also can only be supported with the the zone size / zone capacity split.

> The other solution would be adding a dm-unhole target to remap sectors
> to remove the holes from the device address space. Such target would be
> easy to write, but in my opinion, this would still not change the fact
> that applications still have to deal with error recovery and active/open
> zone resources. So they still have to be zone aware and operate per zone.

I don't think we even need a new target for it.  I think you can do
this with a table using multiple dm-linear sections already if you
want.

> My answer to your last question ("Are we sure?") is thus: No. I am not
> sure this is a good idea. But as always, I would be happy to be proven
> wrong. So far, I have not seen any argument doing that.

Agreed. Supporting non-power of two sizes in the block layer is fairly
easy as shown by some of the patches seens in this series.  Supporting
them properly in the whole ecosystem is not trivial and will create a
long-term burden.  We could do that, but we'd rather have a really good
reason for it, and right now I don't see that.

Christoph Hellwig March 14, 2022, 7:36 a.m. UTC | #21

On Thu, Mar 10, 2022 at 05:38:35PM +0000, Adam Manzanares wrote:
> > Do these drives even support Zone Append?
> 
> Should it matter if the drives support append? SMR drives do not support append
> and they are considered zone block devices. Append seems to be an optimization
> for users that want higher concurrency per zone. One can also build concurrency
> by leveraging multiple zones simultaneously as well.

Not supporting it natively for SMR is a major pain.  Due to hard drives
being relatively slow the emulation is somewhat workable, but on SSDs
the serialization would completely kill performance.

Damien Le Moal March 14, 2022, 7:45 a.m. UTC | #22

On 3/14/22 16:35, Christoph Hellwig wrote:
> On Sat, Mar 12, 2022 at 04:58:08PM +0900, Damien Le Moal wrote:
>> The reason for the power of 2 requirement is 2 fold:
>> 1) At the time we added zone support for SMR, chunk_sectors had to be a
>> power of 2 number of sectors.
>> 2) SMR users did request power of 2 zone sizes and that all zones have
>> the same size as that simplified software design. There was even a
>> de-facto agreement that 256MB zone size is a good compromise between
>> usability and overhead of zone reclaim/GC. But that particular number is
>> for HDD due to their performance characteristics.
> 
> Also for NVMe we initially went down the road to try to support
> non power of two sizes.  But there was another major early host that
> really wanted the power of two zone sizes to support hardware based
> hosts that can cheaply do shifts but not divisions.  The variable
> zone capacity feature (something that Linux does not currently support)
> is a feature requested by NVMe members on the host and device side
> also can only be supported with the the zone size / zone capacity split.
> 
>> The other solution would be adding a dm-unhole target to remap sectors
>> to remove the holes from the device address space. Such target would be
>> easy to write, but in my opinion, this would still not change the fact
>> that applications still have to deal with error recovery and active/open
>> zone resources. So they still have to be zone aware and operate per zone.
> 
> I don't think we even need a new target for it.  I think you can do
> this with a table using multiple dm-linear sections already if you
> want.

Nope, this is currently not possible: DM requires the target zone size
to be the same as the underlying device zone size. So that would not work.

> 
>> My answer to your last question ("Are we sure?") is thus: No. I am not
>> sure this is a good idea. But as always, I would be happy to be proven
>> wrong. So far, I have not seen any argument doing that.
> 
> Agreed. Supporting non-power of two sizes in the block layer is fairly
> easy as shown by some of the patches seens in this series.  Supporting
> them properly in the whole ecosystem is not trivial and will create a
> long-term burden.  We could do that, but we'd rather have a really good
> reason for it, and right now I don't see that.

Christoph Hellwig March 14, 2022, 7:58 a.m. UTC | #23

On Mon, Mar 14, 2022 at 04:45:12PM +0900, Damien Le Moal wrote:
> Nope, this is currently not possible: DM requires the target zone size
> to be the same as the underlying device zone size. So that would not work.

Indeed.

Matias Bjørling March 14, 2022, 8:36 a.m. UTC | #24

> 
> Furthermore, adding such DM target would create a non power of 2 zone size
> zoned device which will need support from the block layer. So some block layer
> functions will need to change. In the end, this may not be different than
> enabling non power of 2 zone sized devices for ZNS.
> 
> And for this decision, I maintain some of my requirements:
> 1) The added overhead from multiplication & divisions should be acceptable
> and not degrade performance. Otherwise, this would be a disservice to the
> zone ecosystem.
> 2) Nothing that works today on available devices should break
> 3) Zone size requirements will still exist. E.g. btrfs 64K alignment requirement
> 

Adding to the existing points that has been made.

I believe it hasn't been mentioned that for non-power of 2 zone sizes, holes are still allowed due to zones being/becoming offline. The offline zone state supports neither writes nor reads, and applications must be aware and work around such holes in the address space. 

Furthermore, the specification doesn't allow writes to cross zones - so while reads may cross a zone, the writes must always be broken up across zone boundaries. 

As a result, applications must work with zones independently and can't assume that it can write to the adjacent zone nor write across two zones. 

Best, Matias

Javier González March 14, 2022, 10:49 a.m. UTC | #25

On 14.03.2022 16:45, Damien Le Moal wrote:
>On 3/14/22 16:35, Christoph Hellwig wrote:
>> On Sat, Mar 12, 2022 at 04:58:08PM +0900, Damien Le Moal wrote:
>>> The reason for the power of 2 requirement is 2 fold:
>>> 1) At the time we added zone support for SMR, chunk_sectors had to be a
>>> power of 2 number of sectors.
>>> 2) SMR users did request power of 2 zone sizes and that all zones have
>>> the same size as that simplified software design. There was even a
>>> de-facto agreement that 256MB zone size is a good compromise between
>>> usability and overhead of zone reclaim/GC. But that particular number is
>>> for HDD due to their performance characteristics.
>>
>> Also for NVMe we initially went down the road to try to support
>> non power of two sizes.  But there was another major early host that
>> really wanted the power of two zone sizes to support hardware based
>> hosts that can cheaply do shifts but not divisions.  The variable
>> zone capacity feature (something that Linux does not currently support)
>> is a feature requested by NVMe members on the host and device side
>> also can only be supported with the the zone size / zone capacity split.
>>
>>> The other solution would be adding a dm-unhole target to remap sectors
>>> to remove the holes from the device address space. Such target would be
>>> easy to write, but in my opinion, this would still not change the fact
>>> that applications still have to deal with error recovery and active/open
>>> zone resources. So they still have to be zone aware and operate per zone.
>>
>> I don't think we even need a new target for it.  I think you can do
>> this with a table using multiple dm-linear sections already if you
>> want.
>
>Nope, this is currently not possible: DM requires the target zone size
>to be the same as the underlying device zone size. So that would not work.
>
>>
>>> My answer to your last question ("Are we sure?") is thus: No. I am not
>>> sure this is a good idea. But as always, I would be happy to be proven
>>> wrong. So far, I have not seen any argument doing that.
>>
>> Agreed. Supporting non-power of two sizes in the block layer is fairly
>> easy as shown by some of the patches seens in this series.  Supporting
>> them properly in the whole ecosystem is not trivial and will create a
>> long-term burden.  We could do that, but we'd rather have a really good
>> reason for it, and right now I don't see that.

I think that Bo's use-case is an example of a major upstream Linux host
that is struggling with unmmapped LBAs. Can we focus on this use-case
and the parts that we are missing to support Bytedance?

If you agree to this, I believe we can add support for ZoneFS pretty
easily. We also have a POC in btrfs that we will follow on. For the time
being, F2FS would fail at mkfs time if zone size is not a PO2.

What do you think?

Matias Bjørling March 14, 2022, 2:16 p.m. UTC | #26

> >> Agreed. Supporting non-power of two sizes in the block layer is
> >> fairly easy as shown by some of the patches seens in this series.
> >> Supporting them properly in the whole ecosystem is not trivial and
> >> will create a long-term burden.  We could do that, but we'd rather
> >> have a really good reason for it, and right now I don't see that.
> 
> I think that Bo's use-case is an example of a major upstream Linux host that is
> struggling with unmmapped LBAs. Can we focus on this use-case and the parts
> that we are missing to support Bytedance?

Any application that uses zoned storage devices would have to manage unmapped LBAs due to the potential of zones being/becoming offline (no reads/writes allowed). Eliminating the difference between zone cap and zone size will not remove this requirement, and holes will continue to exist. Furthermore, writing to LBAs across zones is not allowed by the specification and must also be managed.

Given the above, applications have to be conscious of zones in general and work within their boundaries. I don't understand how applications can work without having per-zone knowledge. An application would have to know about zones and their writeable capacity. To decide where and how data is written, an application must manage writing across zones, specific offline zones, and (currently) its writeable capacity. I.e., knowledge about zones and holes is required for writing to zoned devices and isn't eliminated by removing the PO2 zone size requirement.

For years, the PO2 requirement has been known in the Linux community and by the ZNS SSD vendors. Some SSD implementors have chosen not to support PO2 zone sizes, which is a perfectly valid decision. But its implementors knowingly did that while knowing that the Linux kernel didn't support it. 

I want to turn the argument around to see it from the kernel developer's point of view. They have communicated the PO2 requirement clearly, there's good precedence working with PO2 zone sizes, and at last, holes can't be avoided and are part of the overall design of zoned storage devices. So why should the kernel developer's take on the long-term maintenance burden of NPO2 zone sizes?

Luis Chamberlain March 14, 2022, 4:23 p.m. UTC | #27

On Mon, Mar 14, 2022 at 02:16:36PM +0000, Matias Bjørling wrote:
> I want to turn the argument around to see it from the kernel
> developer's point of view. They have communicated the PO2 requirement
> clearly,

Such requirement is based on history and effort put in place to assume
a PO2 requirement for zone storage, and clearly it is not. And clearly
even vendors who have embraced PO2 don't know for sure they'll always
be able to stick to PO2...

> there's good precedence working with PO2 zone sizes, and at
> last, holes can't be avoided and are part of the overall design of
> zoned storage devices. So why should the kernel developer's take on
> the long-term maintenance burden of NPO2 zone sizes?

I think the better question to address here is:

Do we *not* want to support NPO2 zone sizes in Linux out of principal?

If we *are* open to support NPO2 zone sizes, what path should we take to
incur the least pain and fragmentation?

Emulation was a path being considered, and I think at this point the
answer to eveluating that path is: this is cumbersome, probably not.

The next question then is: are we open to evaluate what it looks like
to slowly shave off the PO2 requirement in different layers, with an
goal to avoid further fragmentation? There is effort on evaluating that
path and it doesn't seem to be that bad.

So I'd advise to evaluate that, there is nothing to loose other than
awareness of what that path might look like.

Uness of course we already have a clear path forward for NPO2 we can
all agree on.

  Luis

Matias Bjørling March 14, 2022, 7:30 p.m. UTC | #28

> -----Original Message-----
> From: Luis Chamberlain <mcgrof@infradead.org> On Behalf Of Luis
> Chamberlain
> Sent: Monday, 14 March 2022 17.24
> To: Matias Bjørling <Matias.Bjorling@wdc.com>
> Cc: Javier González <javier@javigon.com>; Damien Le Moal
> <damien.lemoal@opensource.wdc.com>; Christoph Hellwig <hch@lst.de>;
> Keith Busch <kbusch@kernel.org>; Pankaj Raghav <p.raghav@samsung.com>;
> Adam Manzanares <a.manzanares@samsung.com>;
> jiangbo.365@bytedance.com; kanchan Joshi <joshi.k@samsung.com>; Jens
> Axboe <axboe@kernel.dk>; Sagi Grimberg <sagi@grimberg.me>; Pankaj
> Raghav <pankydev8@gmail.com>; Kanchan Joshi <joshiiitr@gmail.com>; linux-
> block@vger.kernel.org; linux-nvme@lists.infradead.org
> Subject: Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
> 
> On Mon, Mar 14, 2022 at 02:16:36PM +0000, Matias Bjørling wrote:
> > I want to turn the argument around to see it from the kernel
> > developer's point of view. They have communicated the PO2 requirement
> > clearly,
> 
> Such requirement is based on history and effort put in place to assume a PO2
> requirement for zone storage, and clearly it is not. And clearly even vendors
> who have embraced PO2 don't know for sure they'll always be able to stick to
> PO2...

Sure - It'll be naïve to give a carte blanche promise.

However, you're skipping the next two elements, which state that there are both good precedence working with PO2 zone sizes and that holes/unmapped LBAs can't be avoided. Making an argument for why NPO2 zone sizes may not bring what one is looking for. It's a lot of work for little practical change, if any. 

> 
> > there's good precedence working with PO2 zone sizes, and at last,
> > holes can't be avoided and are part of the overall design of zoned
> > storage devices. So why should the kernel developer's take on the
> > long-term maintenance burden of NPO2 zone sizes?
> 
> I think the better question to address here is:
> 
> Do we *not* want to support NPO2 zone sizes in Linux out of principal?
> 
> If we *are* open to support NPO2 zone sizes, what path should we take to
> incur the least pain and fragmentation?
> 
> Emulation was a path being considered, and I think at this point the answer to
> eveluating that path is: this is cumbersome, probably not.
> 
> The next question then is: are we open to evaluate what it looks like to slowly
> shave off the PO2 requirement in different layers, with an goal to avoid further
> fragmentation? There is effort on evaluating that path and it doesn't seem to
> be that bad.
> 
> So I'd advise to evaluate that, there is nothing to loose other than awareness of
> what that path might look like.
> 
> Uness of course we already have a clear path forward for NPO2 we can all
> agree on.

It looks like there isn't currently one that can be agreed upon.

If evaluating different approaches, it would be helpful to the reviewers if interfaces and all of its kernel users are converted in a single patchset. This would also help to avoid users getting hit by what is supported, and what isn't supported by a particular device implementation and allow better to review the full set of changes required to add the support.

Luis Chamberlain March 14, 2022, 7:51 p.m. UTC | #29

On Mon, Mar 14, 2022 at 07:30:25PM +0000, Matias Bjørling wrote:
> > -----Original Message-----
> > From: Luis Chamberlain <mcgrof@infradead.org> On Behalf Of Luis
> > Chamberlain
> > Sent: Monday, 14 March 2022 17.24
> > To: Matias Bjørling <Matias.Bjorling@wdc.com>
> > Cc: Javier González <javier@javigon.com>; Damien Le Moal
> > <damien.lemoal@opensource.wdc.com>; Christoph Hellwig <hch@lst.de>;
> > Keith Busch <kbusch@kernel.org>; Pankaj Raghav <p.raghav@samsung.com>;
> > Adam Manzanares <a.manzanares@samsung.com>;
> > jiangbo.365@bytedance.com; kanchan Joshi <joshi.k@samsung.com>; Jens
> > Axboe <axboe@kernel.dk>; Sagi Grimberg <sagi@grimberg.me>; Pankaj
> > Raghav <pankydev8@gmail.com>; Kanchan Joshi <joshiiitr@gmail.com>; linux-
> > block@vger.kernel.org; linux-nvme@lists.infradead.org
> > Subject: Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
> > 
> > On Mon, Mar 14, 2022 at 02:16:36PM +0000, Matias Bjørling wrote:
> > > I want to turn the argument around to see it from the kernel
> > > developer's point of view. They have communicated the PO2 requirement
> > > clearly,
> > 
> > Such requirement is based on history and effort put in place to assume a PO2
> > requirement for zone storage, and clearly it is not. And clearly even vendors
> > who have embraced PO2 don't know for sure they'll always be able to stick to
> > PO2...
> 
> Sure - It'll be naïve to give a carte blanche promise.

Exactly. So taking a position to not support NPO2 I think seems counter
productive to the future of ZNS, the question whould be, *how* to best
do this in light of what we need to support / avoid performance
regressions / strive towards avoiding fragmentation.

> However, you're skipping the next two elements, which state that there
> are both good precedence working with PO2 zone sizes and that
> holes/unmapped LBAs can't be avoided.

I'm not, but I admit that it's a good point of having the possibility of
zones being taken offline also implicates holes. I also think it was a
good excercise to discuss and evaluate emulation given I don't think
this point you made would have been made clear otherwise. This is why
I treat ZNS as evolving effort, and I can't seriously take any position
stating all answers are known.

> Making an argument for why NPO2
> zone sizes may not bring what one is looking for. It's a lot of work
> for little practical change, if any. 

NAND does not incur a PO2 requirement, that should be enough to
implicate that PO2 zones *can* be expected. If no vendor wants
to take a position that they know for a fact they'll never adopt
PO2 zones should be enough to keep an open mind to consider *how*
to support them.

> > > there's good precedence working with PO2 zone sizes, and at last,
> > > holes can't be avoided and are part of the overall design of zoned
> > > storage devices. So why should the kernel developer's take on the
> > > long-term maintenance burden of NPO2 zone sizes?
> > 
> > I think the better question to address here is:
> > 
> > Do we *not* want to support NPO2 zone sizes in Linux out of principal?
> > 
> > If we *are* open to support NPO2 zone sizes, what path should we take to
> > incur the least pain and fragmentation?
> > 
> > Emulation was a path being considered, and I think at this point the answer to
> > eveluating that path is: this is cumbersome, probably not.
> > 
> > The next question then is: are we open to evaluate what it looks like to slowly
> > shave off the PO2 requirement in different layers, with an goal to avoid further
> > fragmentation? There is effort on evaluating that path and it doesn't seem to
> > be that bad.
> > 
> > So I'd advise to evaluate that, there is nothing to loose other than awareness of
> > what that path might look like.
> > 
> > Uness of course we already have a clear path forward for NPO2 we can all
> > agree on.
> 
> It looks like there isn't currently one that can be agreed upon.

I'm not quite sure that is the case. To reach consensus one has
to take a position of accepting the right answer may not be known
and we evaluate all prospects. It is not clear to me that we've done
that yet and it is why I think a venue such as LSFMM may be good to
review these things.

> If evaluating different approaches, it would be helpful to the
> reviewers if interfaces and all of its kernel users are converted in a
> single patchset. This would also help to avoid users getting hit by
> what is supported, and what isn't supported by a particular device
> implementation and allow better to review the full set of changes
> required to add the support.

Sorry I didn't understand the suggestion here, can you clarify what it
is you are suggesting?

Thanks!

  Luis

Javier González March 14, 2022, 7:55 p.m. UTC | #30

On 14.03.2022 14:16, Matias Bjørling wrote:
>> >> Agreed. Supporting non-power of two sizes in the block layer is
>> >> fairly easy as shown by some of the patches seens in this series.
>> >> Supporting them properly in the whole ecosystem is not trivial and
>> >> will create a long-term burden.  We could do that, but we'd rather
>> >> have a really good reason for it, and right now I don't see that.
>>
>> I think that Bo's use-case is an example of a major upstream Linux host that is
>> struggling with unmmapped LBAs. Can we focus on this use-case and the parts
>> that we are missing to support Bytedance?
>
>Any application that uses zoned storage devices would have to manage
>unmapped LBAs due to the potential of zones being/becoming offline (no
>reads/writes allowed). Eliminating the difference between zone cap and
>zone size will not remove this requirement, and holes will continue to
>exist. Furthermore, writing to LBAs across zones is not allowed by the
>specification and must also be managed.
>
>Given the above, applications have to be conscious of zones in general and work within their boundaries. I don't understand how applications can work without having per-zone knowledge. An application would have to know about zones and their writeable capacity. To decide where and how data is written, an application must manage writing across zones, specific offline zones, and (currently) its writeable capacity. I.e., knowledge about zones and holes is required for writing to zoned devices and isn't eliminated by removing the PO2 zone size requirement.

Supporting offlines zones is optional in the ZNS spec? We are not
considering supporting this in the host. This will be handled by the
device for exactly maintaining the SW stack simpler.
>
>For years, the PO2 requirement has been known in the Linux community and by the ZNS SSD vendors. Some SSD implementors have chosen not to support PO2 zone sizes, which is a perfectly valid decision. But its implementors knowingly did that while knowing that the Linux kernel didn't support it.
>
>I want to turn the argument around to see it from the kernel developer's point of view. They have communicated the PO2 requirement clearly, there's good precedence working with PO2 zone sizes, and at last, holes can't be avoided and are part of the overall design of zoned storage devices. So why should the kernel developer's take on the long-term maintenance burden of NPO2 zone sizes?

You have a good point, and that is the question we need to help answer.
As I see it, requirements evolve and the kernel changes with it as long
as there are active upstream users for it.

The main constraint for PO2 is removed in the block layer, we have Linux
hosts stating that unmapped LBAs are a problem, and we have HW
supporting size=capacity.

I would be happy to hear what else you would like to see for this to be
of use to the kernel community.

Matias Bjørling March 15, 2022, 10:45 a.m. UTC | #31

> > > On Mon, Mar 14, 2022 at 02:16:36PM +0000, Matias Bjørling wrote:
> > > > I want to turn the argument around to see it from the kernel
> > > > developer's point of view. They have communicated the PO2
> > > > requirement clearly,
> > >
> > > Such requirement is based on history and effort put in place to
> > > assume a PO2 requirement for zone storage, and clearly it is not.
> > > And clearly even vendors who have embraced PO2 don't know for sure
> > > they'll always be able to stick to PO2...
> >
> > Sure - It'll be naïve to give a carte blanche promise.
> 
> Exactly. So taking a position to not support NPO2 I think seems counter
> productive to the future of ZNS, the question whould be, *how* to best do this
> in light of what we need to support / avoid performance regressions / strive
> towards avoiding fragmentation.

Having non-power of two zone sizes is a derivation from existing devices being used in full production today. That there is a wish to introduce support for such drives is interesting, but given the background and development of zoned devices. Damien mentioned that SMR HDDs didn't start off with PO2 zone sizes - that was what became the norm due to its overall benefits. I.e., drives with NPO2 zone sizes is the odd one, and in some views, is the one creating fragmentation.

That there is a wish to revisit that design decision is fair, and it sounds like there is willingness to explorer such options. But please be advised that the Linux community have had communicated the specific requirement for a long time to avoid this particular issue. Thus, the community have been trying to help the vendors make the appropriate design decisions, such that they could take advantage of the Linux kernel stack from day one.

> > However, you're skipping the next two elements, which state that there
> > are both good precedence working with PO2 zone sizes and that
> > holes/unmapped LBAs can't be avoided.
> 
> I'm not, but I admit that it's a good point of having the possibility of zones being
> taken offline also implicates holes. I also think it was a good excercise to
> discuss and evaluate emulation given I don't think this point you made would
> have been made clear otherwise. This is why I treat ZNS as evolving effort, and
> I can't seriously take any position stating all answers are known.

That's good to hear. I would note that some members in this thread have been doing zoned storage for close to a decade, and have a very thorough understanding of the zoned storage model - so it might be a stretch for them to hear that you're considering everything up in the air and early. This stack is already being used by a large percentage of the bits being shipped in the world. Thus, there is an interest in maintaining these things, and making sure that things don't regress and so on. 

> 
> > Making an argument for why NPO2
> > zone sizes may not bring what one is looking for. It's a lot of work
> > for little practical change, if any.
> 
> NAND does not incur a PO2 requirement, that should be enough to implicate
> that PO2 zones *can* be expected. If no vendor wants to take a position that
> they know for a fact they'll never adopt
> PO2 zones should be enough to keep an open mind to consider *how* to
> support them.

As long as it doesn't also imply that support *has* to be added to the kernel, then that's okay.

<snip>
> 
> > If evaluating different approaches, it would be helpful to the
> > reviewers if interfaces and all of its kernel users are converted in a
> > single patchset. This would also help to avoid users getting hit by
> > what is supported, and what isn't supported by a particular device
> > implementation and allow better to review the full set of changes
> > required to add the support.
> 
> Sorry I didn't understand the suggestion here, can you clarify what it is you are
> suggesting?

It would help reviewers that a potential patchset would convert all users (e.g., f2fs, btrfs, device mappers, io schedulers, etc.), such that the full effect can be evaluated with the added benefits that end-users not having to think about what is and what isn't supported.

Matias Bjørling March 15, 2022, 12:32 p.m. UTC | #32

> >Given the above, applications have to be conscious of zones in general and
> work within their boundaries. I don't understand how applications can work
> without having per-zone knowledge. An application would have to know about
> zones and their writeable capacity. To decide where and how data is written,
> an application must manage writing across zones, specific offline zones, and
> (currently) its writeable capacity. I.e., knowledge about zones and holes is
> required for writing to zoned devices and isn't eliminated by removing the PO2
> zone size requirement.
> 
> Supporting offlines zones is optional in the ZNS spec? We are not considering
> supporting this in the host. This will be handled by the device for exactly
> maintaining the SW stack simpler.

It isn't optional. The spec allows any zones to go to Read Only or Offline state at any point in time. A specific implementation might give some guarantees to when such transitions happens, but it must nevertheless must be managed by the host software. 

Given that, and the need to not issue writes that spans zones, an application would have to aware of such behaviors. The information to make those decisions are in a zone's attributes, and thus applications would pull those, it would also know the writeable capability of a zone. So, all in all, creating support for NPO2 is something that takes a lot of work, but might have little to no impact on the overall software design. 

> >
> >For years, the PO2 requirement has been known in the Linux community and
> by the ZNS SSD vendors. Some SSD implementors have chosen not to support
> PO2 zone sizes, which is a perfectly valid decision. But its implementors
> knowingly did that while knowing that the Linux kernel didn't support it.
> >
> >I want to turn the argument around to see it from the kernel developer's point
> of view. They have communicated the PO2 requirement clearly, there's good
> precedence working with PO2 zone sizes, and at last, holes can't be avoided
> and are part of the overall design of zoned storage devices. So why should the
> kernel developer's take on the long-term maintenance burden of NPO2 zone
> sizes?
> 
> You have a good point, and that is the question we need to help answer.
> As I see it, requirements evolve and the kernel changes with it as long as there
> are active upstream users for it.

True. There's also active users for SSDs which are custom (e.g., larger than 4KiB writes required) - but they aren't supported by the Linux kernel and isn't actively being worked on to my knowledge. Which is fine, as the customers anyway uses this in their own way, and don't need the Linux kernel support.
 
> 
> The main constraint for (1) PO2 is removed in the block layer, we have (2) Linux hosts
> stating that unmapped LBAs are a problem, and we have (3) HW supporting
> size=capacity.
> 
> I would be happy to hear what else you would like to see for this to be of use to
> the kernel community.

(Added numbers to your paragraph above)

1. The sysfs chunksize attribute was "misused" to also represent zone size. What has changed is that RAID controllers now can use a NPO2 chunk size. This wasn't meant to naturally extend to zones, which as shown in the current posted patchset, is a lot more work.
2. Bo mentioned that the software already manages holes. It took a bit of time to get right, but now it works. Thus, the software in question is already capable of working with holes. Thus, fixing this, would present itself as a minor optimization overall. I'm not convinced the work to do this in the kernel is proportional to the change it'll make to the applications.
3. I'm happy to hear that. However, I'll like to reiterate the point that the PO2 requirement have been known for years. That there's a drive doing NPO2 zones is great, but a decision was made by the SSD implementors to not support the Linux kernel given its current implementation. 

All that said - if there are people willing to do the work and it doesn't have a negative impact on performance, code quality, maintenance complexity, etc. then there isn't anything saying support can't be added - but it does seem like it’s a lot of work, for little overall benefits to applications and the host users.

Javier González March 15, 2022, 1:05 p.m. UTC | #33

On 15.03.2022 12:32, Matias Bjørling wrote:
>> >Given the above, applications have to be conscious of zones in general and
>> work within their boundaries. I don't understand how applications can work
>> without having per-zone knowledge. An application would have to know about
>> zones and their writeable capacity. To decide where and how data is written,
>> an application must manage writing across zones, specific offline zones, and
>> (currently) its writeable capacity. I.e., knowledge about zones and holes is
>> required for writing to zoned devices and isn't eliminated by removing the PO2
>> zone size requirement.
>>
>> Supporting offlines zones is optional in the ZNS spec? We are not considering
>> supporting this in the host. This will be handled by the device for exactly
>> maintaining the SW stack simpler.
>
>It isn't optional. The spec allows any zones to go to Read Only or Offline state at any point in time. A specific implementation might give some guarantees to when such transitions happens, but it must nevertheless must be managed by the host software.
>
>Given that, and the need to not issue writes that spans zones, an application would have to aware of such behaviors. The information to make those decisions are in a zone's attributes, and thus applications would pull those, it would also know the writeable capability of a zone. So, all in all, creating support for NPO2 is something that takes a lot of work, but might have little to no impact on the overall software design.

Thanks for the clarification. I can attest that we are giving the
guarantee to simplify the host stack. I believe we are making many
assumptions in Linux too to simplify ZNS support.

This said, I understand your point. I am not developing application
support. I will refer again to Bo's response on the use case on where
holes are problematic.


>
>> >
>> >For years, the PO2 requirement has been known in the Linux community and
>> by the ZNS SSD vendors. Some SSD implementors have chosen not to support
>> PO2 zone sizes, which is a perfectly valid decision. But its implementors
>> knowingly did that while knowing that the Linux kernel didn't support it.
>> >
>> >I want to turn the argument around to see it from the kernel developer's point
>> of view. They have communicated the PO2 requirement clearly, there's good
>> precedence working with PO2 zone sizes, and at last, holes can't be avoided
>> and are part of the overall design of zoned storage devices. So why should the
>> kernel developer's take on the long-term maintenance burden of NPO2 zone
>> sizes?
>>
>> You have a good point, and that is the question we need to help answer.
>> As I see it, requirements evolve and the kernel changes with it as long as there
>> are active upstream users for it.
>
>True. There's also active users for SSDs which are custom (e.g., larger than 4KiB writes required) - but they aren't supported by the Linux kernel and isn't actively being worked on to my knowledge. Which is fine, as the customers anyway uses this in their own way, and don't need the Linux kernel support.

Ask things become stable some might choose to push support for certain
features in the Kernel. In this case, the changes are not big in the
block layer. I believe it is a process and the features should be chosen
to maximize benefit and minimize maintenance cost.

>
>>
>> The main constraint for (1) PO2 is removed in the block layer, we have (2) Linux hosts
>> stating that unmapped LBAs are a problem, and we have (3) HW supporting
>> size=capacity.
>>
>> I would be happy to hear what else you would like to see for this to be of use to
>> the kernel community.
>
>(Added numbers to your paragraph above)
>
>1. The sysfs chunksize attribute was "misused" to also represent zone size. What has changed is that RAID controllers now can use a NPO2 chunk size. This wasn't meant to naturally extend to zones, which as shown in the current posted patchset, is a lot more work.

True. But this was the main constraint for PO2.

>2. Bo mentioned that the software already manages holes. It took a bit of time to get right, but now it works. Thus, the software in question is already capable of working with holes. Thus, fixing this, would present itself as a minor optimization overall. I'm not convinced the work to do this in the kernel is proportional to the change it'll make to the applications.

I will let Bo response himself to this.

>3. I'm happy to hear that. However, I'll like to reiterate the point that the PO2 requirement have been known for years. That there's a drive doing NPO2 zones is great, but a decision was made by the SSD implementors to not support the Linux kernel given its current implementation.

Zone devices has been supported for years in SMR, and I this is a strong
argument. However, ZNS is still very new and customers have several
requirements. I do not believe that a HDD stack should have such an
impact in NVMe.

Also, we will see new interfaces adding support for zoned devices in the
future.

We should think about the future and not the past.


>
>All that said - if there are people willing to do the work and it doesn't have a negative impact on performance, code quality, maintenance complexity, etc. then there isn't anything saying support can't be added - but it does seem like it’s a lot of work, for little overall benefits to applications and the host users.

Exactly.

Patches in the block layer are trivial. This is running in production
loads without issues. I have tried to highlight the benefits in previous
benefits and I believe you understand them.

Support for ZoneFS seems easy too. We have an early POC for btrfs and it
seems it can be done. We sign up for these 2.

As for F2FS and dm-zoned, I do not think these are targets at the
moment. If this is the path we follow, these will bail out at mkfs time.

If we can agree on the above, I believe we can start with the code that
enables the existing customers and build support for butrfs and ZoneFS
in the next few months.

What do you think?

Matias Bjørling March 15, 2022, 1:14 p.m. UTC | #34

> >
> >All that said - if there are people willing to do the work and it doesn't have a
> negative impact on performance, code quality, maintenance complexity, etc.
> then there isn't anything saying support can't be added - but it does seem like
> it’s a lot of work, for little overall benefits to applications and the host users.
> 
> Exactly.
> 
> Patches in the block layer are trivial. This is running in production loads without
> issues. I have tried to highlight the benefits in previous benefits and I believe
> you understand them.
> 
> Support for ZoneFS seems easy too. We have an early POC for btrfs and it
> seems it can be done. We sign up for these 2.
> 
> As for F2FS and dm-zoned, I do not think these are targets at the moment. If
> this is the path we follow, these will bail out at mkfs time.
> 
> If we can agree on the above, I believe we can start with the code that enables
> the existing customers and build support for butrfs and ZoneFS in the next few
> months.
> 
> What do you think?

I would suggest to do it in a single shot, i.e., a single patchset, which enables all the internal users in the kernel (including f2fs and others). That way end-users do not have to worry about the difference of PO2/NPO2 zones and it'll help reduce the burden on long-term maintenance.

Javier González March 15, 2022, 1:26 p.m. UTC | #35

On 15.03.2022 13:14, Matias Bjørling wrote:
>> >
>> >All that said - if there are people willing to do the work and it doesn't have a
>> negative impact on performance, code quality, maintenance complexity, etc.
>> then there isn't anything saying support can't be added - but it does seem like
>> it’s a lot of work, for little overall benefits to applications and the host users.
>>
>> Exactly.
>>
>> Patches in the block layer are trivial. This is running in production loads without
>> issues. I have tried to highlight the benefits in previous benefits and I believe
>> you understand them.
>>
>> Support for ZoneFS seems easy too. We have an early POC for btrfs and it
>> seems it can be done. We sign up for these 2.
>>
>> As for F2FS and dm-zoned, I do not think these are targets at the moment. If
>> this is the path we follow, these will bail out at mkfs time.
>>
>> If we can agree on the above, I believe we can start with the code that enables
>> the existing customers and build support for butrfs and ZoneFS in the next few
>> months.
>>
>> What do you think?
>
>I would suggest to do it in a single shot, i.e., a single patchset, which enables all the internal users in the kernel (including f2fs and others). That way end-users do not have to worry about the difference of PO2/NPO2 zones and it'll help reduce the burden on long-term maintenance.

Thanks for the suggestion Matias. Happy to see that you are open to
support this. I understand why a patchseries fixing all is attracgive,
but we do not see a usage for ZNS in F2FS, as it is a mobile
file-system. As other interfaces arrive, this work will become natural.

ZoneFS and butrfs are good targets for ZNS and these we can do. I would
still do the work in phases to make sure we have enough early feedback
from the community.

Since this thread has been very active, I will wait some time for
Christoph and others to catch up before we start sending code.

Christoph Hellwig March 15, 2022, 1:30 p.m. UTC | #36

On Tue, Mar 15, 2022 at 02:26:11PM +0100, Javier González wrote:
> but we do not see a usage for ZNS in F2FS, as it is a mobile
> file-system. As other interfaces arrive, this work will become natural.
>
> ZoneFS and butrfs are good targets for ZNS and these we can do. I would
> still do the work in phases to make sure we have enough early feedback
> from the community.
>
> Since this thread has been very active, I will wait some time for
> Christoph and others to catch up before we start sending code.

Can someone summarize where we stand?  Between the lack of quoting
from hell and overly long lines from corporate mail clients I've
mostly stopped reading this thread because it takes too much effort
actually extract the information.

Matias Bjørling March 15, 2022, 1:39 p.m. UTC | #37

> -----Original Message-----
> From: Javier González <javier@javigon.com>
> Sent: Tuesday, 15 March 2022 14.26
> To: Matias Bjørling <Matias.Bjorling@wdc.com>
> Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>; Christoph
> Hellwig <hch@lst.de>; Luis Chamberlain <mcgrof@kernel.org>; Keith Busch
> <kbusch@kernel.org>; Pankaj Raghav <p.raghav@samsung.com>; Adam
> Manzanares <a.manzanares@samsung.com>; jiangbo.365@bytedance.com;
> kanchan Joshi <joshi.k@samsung.com>; Jens Axboe <axboe@kernel.dk>; Sagi
> Grimberg <sagi@grimberg.me>; Pankaj Raghav <pankydev8@gmail.com>;
> Kanchan Joshi <joshiiitr@gmail.com>; linux-block@vger.kernel.org; linux-
> nvme@lists.infradead.org
> Subject: Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
> 
> On 15.03.2022 13:14, Matias Bjørling wrote:
> >> >
> >> >All that said - if there are people willing to do the work and it
> >> >doesn't have a
> >> negative impact on performance, code quality, maintenance complexity,
> etc.
> >> then there isn't anything saying support can't be added - but it does
> >> seem like it’s a lot of work, for little overall benefits to applications and the
> host users.
> >>
> >> Exactly.
> >>
> >> Patches in the block layer are trivial. This is running in production
> >> loads without issues. I have tried to highlight the benefits in
> >> previous benefits and I believe you understand them.
> >>
> >> Support for ZoneFS seems easy too. We have an early POC for btrfs and
> >> it seems it can be done. We sign up for these 2.
> >>
> >> As for F2FS and dm-zoned, I do not think these are targets at the
> >> moment. If this is the path we follow, these will bail out at mkfs time.
> >>
> >> If we can agree on the above, I believe we can start with the code
> >> that enables the existing customers and build support for butrfs and
> >> ZoneFS in the next few months.
> >>
> >> What do you think?
> >
> >I would suggest to do it in a single shot, i.e., a single patchset, which enables
> all the internal users in the kernel (including f2fs and others). That way end-
> users do not have to worry about the difference of PO2/NPO2 zones and it'll
> help reduce the burden on long-term maintenance.
> 
> Thanks for the suggestion Matias. Happy to see that you are open to support
> this. I understand why a patchseries fixing all is attracgive, but we do not see a
> usage for ZNS in F2FS, as it is a mobile file-system. As other interfaces arrive,
> this work will become natural.

We've seen uptake on ZNS on f2fs, so I would argue that its important to have support in as well.

> 
> ZoneFS and butrfs are good targets for ZNS and these we can do. I would still do
> the work in phases to make sure we have enough early feedback from the
> community.

Sure, continuous review is good. But not having support for all the kernel users creates fragmentation. Doing a full switch is greatly preferred, as it avoids this fragmentation, but will also lower the overall maintenance burden, which also was raised as a concern.

Javier González March 15, 2022, 1:52 p.m. UTC | #38

On 15.03.2022 14:30, Christoph Hellwig wrote:
>On Tue, Mar 15, 2022 at 02:26:11PM +0100, Javier González wrote:
>> but we do not see a usage for ZNS in F2FS, as it is a mobile
>> file-system. As other interfaces arrive, this work will become natural.
>>
>> ZoneFS and butrfs are good targets for ZNS and these we can do. I would
>> still do the work in phases to make sure we have enough early feedback
>> from the community.
>>
>> Since this thread has been very active, I will wait some time for
>> Christoph and others to catch up before we start sending code.
>
>Can someone summarize where we stand?  Between the lack of quoting
>from hell and overly long lines from corporate mail clients I've
>mostly stopped reading this thread because it takes too much effort
>actually extract the information.

Let me give it a try:

  - PO2 emulation in NVMe is a no-go. Drop this.

  - The arguments against supporting PO2 are:
      - It makes ZNS depart from a SMR assumption of PO2 zone sizes. This
        can create confusion for users of both SMR and ZNS

      - Existing applications assume PO2 zone sizes, and probably do
        optimizations for these. These applications, if wanting to use
        ZNS will have to change the calculations

      - There is a fear for performance regressions.

      - It adds more work to you and other maintainers

  - The arguments in favour of PO2 are:
      - Unmapped LBAs create holes that applications need to deal with.
        This affects mapping and performance due to splits. Bo explained
        this in a thread from Bytedance's perspective.  I explained in an
        answer to Matias how we are not letting zones transition to
        offline in order to simplify the host stack. Not sure if this is
        something we want to bring to NVMe.

      - As ZNS adds more features and other protocols add support for
        zoned devices we will have more use-cases for the zoned block
        device. We will have to deal with these fragmentation at some
        point.

      - This is used in production workloads in Linux hosts. I would
        advocate for this not being off-tree as it will be a headache for
        all in the future.

  - If you agree that removing PO2 is an option, we can do the following:
      - Remove the constraint in the block layer and add ZoneFS support
        in a first patch.

      - Add btrfs support in a later patch

      - Make changes to tools once merged

Hope I have collected all points of view in such a short format.

Matias Bjørling March 15, 2022, 2:03 p.m. UTC | #39

> -----Original Message-----
> From: Javier González <javier@javigon.com>
> Sent: Tuesday, 15 March 2022 14.53
> To: Christoph Hellwig <hch@lst.de>
> Cc: Matias Bjørling <Matias.Bjorling@wdc.com>; Damien Le Moal
> <damien.lemoal@opensource.wdc.com>; Luis Chamberlain
> <mcgrof@kernel.org>; Keith Busch <kbusch@kernel.org>; Pankaj Raghav
> <p.raghav@samsung.com>; Adam Manzanares
> <a.manzanares@samsung.com>; jiangbo.365@bytedance.com; kanchan Joshi
> <joshi.k@samsung.com>; Jens Axboe <axboe@kernel.dk>; Sagi Grimberg
> <sagi@grimberg.me>; Pankaj Raghav <pankydev8@gmail.com>; Kanchan Joshi
> <joshiiitr@gmail.com>; linux-block@vger.kernel.org; linux-
> nvme@lists.infradead.org
> Subject: Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
> 
> On 15.03.2022 14:30, Christoph Hellwig wrote:
> >On Tue, Mar 15, 2022 at 02:26:11PM +0100, Javier González wrote:
> >> but we do not see a usage for ZNS in F2FS, as it is a mobile
> >> file-system. As other interfaces arrive, this work will become natural.
> >>
> >> ZoneFS and butrfs are good targets for ZNS and these we can do. I
> >> would still do the work in phases to make sure we have enough early
> >> feedback from the community.
> >>
> >> Since this thread has been very active, I will wait some time for
> >> Christoph and others to catch up before we start sending code.
> >
> >Can someone summarize where we stand?  Between the lack of quoting from
> >hell and overly long lines from corporate mail clients I've mostly
> >stopped reading this thread because it takes too much effort actually
> >extract the information.
> 
> Let me give it a try:
> 
>   - PO2 emulation in NVMe is a no-go. Drop this.
> 
>   - The arguments against supporting PO2 are:
>       - It makes ZNS depart from a SMR assumption of PO2 zone sizes. This
>         can create confusion for users of both SMR and ZNS
> 
>       - Existing applications assume PO2 zone sizes, and probably do
>         optimizations for these. These applications, if wanting to use
>         ZNS will have to change the calculations
> 
>       - There is a fear for performance regressions.
> 
>       - It adds more work to you and other maintainers
> 
>   - The arguments in favour of PO2 are:
>       - Unmapped LBAs create holes that applications need to deal with.
>         This affects mapping and performance due to splits. Bo explained
>         this in a thread from Bytedance's perspective.  I explained in an
>         answer to Matias how we are not letting zones transition to
>         offline in order to simplify the host stack. Not sure if this is
>         something we want to bring to NVMe.
> 
>       - As ZNS adds more features and other protocols add support for
>         zoned devices we will have more use-cases for the zoned block
>         device. We will have to deal with these fragmentation at some
>         point.
> 
>       - This is used in production workloads in Linux hosts. I would
>         advocate for this not being off-tree as it will be a headache for
>         all in the future.
> 
>   - If you agree that removing PO2 is an option, we can do the following:
>       - Remove the constraint in the block layer and add ZoneFS support
>         in a first patch.
> 
>       - Add btrfs support in a later patch
> 
>       - Make changes to tools once merged
> 
> Hope I have collected all points of view in such a short format.

+ Suggestion to enable all users in the kernel to limit fragmentation and maintainer burden.
+ Possible not a big issue as users already have added the necessary support and users already must manage offline zones and avoid writing across zones. 
+ Re: Bo's email, it sounds like this only affect a single vendor which knowingly made the decision to do NPO2 zone sizes. From Bo: "(What we discussed here has a precondition that is, we cannot determine if the SSD provider could change the FW to make it PO2 or not)").

Johannes Thumshirn March 15, 2022, 2:14 p.m. UTC | #40

On 15/03/2022 14:52, Javier González wrote:
> On 15.03.2022 14:30, Christoph Hellwig wrote:
>> On Tue, Mar 15, 2022 at 02:26:11PM +0100, Javier González wrote:
>>> but we do not see a usage for ZNS in F2FS, as it is a mobile
>>> file-system. As other interfaces arrive, this work will become natural.
>>>
>>> ZoneFS and butrfs are good targets for ZNS and these we can do. I would
>>> still do the work in phases to make sure we have enough early feedback
>>> from the community.
>>>
>>> Since this thread has been very active, I will wait some time for
>>> Christoph and others to catch up before we start sending code.
>>
>> Can someone summarize where we stand?  Between the lack of quoting
>>from hell and overly long lines from corporate mail clients I've
>> mostly stopped reading this thread because it takes too much effort
>> actually extract the information.
> 
> Let me give it a try:
> 
>   - PO2 emulation in NVMe is a no-go. Drop this.
> 
>   - The arguments against supporting PO2 are:
>       - It makes ZNS depart from a SMR assumption of PO2 zone sizes. This
>         can create confusion for users of both SMR and ZNS
> 
>       - Existing applications assume PO2 zone sizes, and probably do
>         optimizations for these. These applications, if wanting to use
>         ZNS will have to change the calculations
> 
>       - There is a fear for performance regressions.
> 
>       - It adds more work to you and other maintainers
> 
>   - The arguments in favour of PO2 are:
>       - Unmapped LBAs create holes that applications need to deal with.
>         This affects mapping and performance due to splits. Bo explained
>         this in a thread from Bytedance's perspective.  I explained in an
>         answer to Matias how we are not letting zones transition to
>         offline in order to simplify the host stack. Not sure if this is
>         something we want to bring to NVMe.
> 
>       - As ZNS adds more features and other protocols add support for
>         zoned devices we will have more use-cases for the zoned block
>         device. We will have to deal with these fragmentation at some
>         point.
> 
>       - This is used in production workloads in Linux hosts. I would
>         advocate for this not being off-tree as it will be a headache for
>         all in the future.
> 
>   - If you agree that removing PO2 is an option, we can do the following:
>       - Remove the constraint in the block layer and add ZoneFS support
>         in a first patch.
> 
>       - Add btrfs support in a later patch

(+ linux-btrfs )

Please also make sure to support btrfs and not only throw some patches 
over the fence. Zoned device support in btrfs is complex enough and has 
quite some special casing vs regular btrfs, which we're working on getting
rid of. So having non-power-of-2 zone size, would also mean having NPO2
block-groups (and thus block-groups not aligned to the stripe size).

Just thinking of this and knowing I need to support it gives me a 
headache.

Also please consult the rest of the btrfs developers for thoughts on this.
After all btrfs has full zoned support (including ZNS, not saying it's 
perfect) and is also the default FS for at least two Linux distributions.

Thanks a lot,
	Johannes

David Sterba March 15, 2022, 2:27 p.m. UTC | #41

On Tue, Mar 15, 2022 at 02:14:23PM +0000, Johannes Thumshirn wrote:
> On 15/03/2022 14:52, Javier González wrote:
> > On 15.03.2022 14:30, Christoph Hellwig wrote:
> >> On Tue, Mar 15, 2022 at 02:26:11PM +0100, Javier González wrote:
> >>> but we do not see a usage for ZNS in F2FS, as it is a mobile
> >>> file-system. As other interfaces arrive, this work will become natural.
> >>>
> >>> ZoneFS and butrfs are good targets for ZNS and these we can do. I would
> >>> still do the work in phases to make sure we have enough early feedback
> >>> from the community.
> >>>
> >>> Since this thread has been very active, I will wait some time for
> >>> Christoph and others to catch up before we start sending code.
> >>
> >> Can someone summarize where we stand?  Between the lack of quoting
> >> from hell and overly long lines from corporate mail clients I've
> >> mostly stopped reading this thread because it takes too much effort
> >> actually extract the information.
> > 
> > Let me give it a try:
> > 
> >   - PO2 emulation in NVMe is a no-go. Drop this.
> > 
> >   - The arguments against supporting PO2 are:
> >       - It makes ZNS depart from a SMR assumption of PO2 zone sizes. This
> >         can create confusion for users of both SMR and ZNS
> > 
> >       - Existing applications assume PO2 zone sizes, and probably do
> >         optimizations for these. These applications, if wanting to use
> >         ZNS will have to change the calculations
> > 
> >       - There is a fear for performance regressions.
> > 
> >       - It adds more work to you and other maintainers
> > 
> >   - The arguments in favour of PO2 are:
> >       - Unmapped LBAs create holes that applications need to deal with.
> >         This affects mapping and performance due to splits. Bo explained
> >         this in a thread from Bytedance's perspective.  I explained in an
> >         answer to Matias how we are not letting zones transition to
> >         offline in order to simplify the host stack. Not sure if this is
> >         something we want to bring to NVMe.
> > 
> >       - As ZNS adds more features and other protocols add support for
> >         zoned devices we will have more use-cases for the zoned block
> >         device. We will have to deal with these fragmentation at some
> >         point.
> > 
> >       - This is used in production workloads in Linux hosts. I would
> >         advocate for this not being off-tree as it will be a headache for
> >         all in the future.
> > 
> >   - If you agree that removing PO2 is an option, we can do the following:
> >       - Remove the constraint in the block layer and add ZoneFS support
> >         in a first patch.
> > 
> >       - Add btrfs support in a later patch
> 
> (+ linux-btrfs )
> 
> Please also make sure to support btrfs and not only throw some patches 
> over the fence. Zoned device support in btrfs is complex enough and has 
> quite some special casing vs regular btrfs, which we're working on getting
> rid of. So having non-power-of-2 zone size, would also mean having NPO2
> block-groups (and thus block-groups not aligned to the stripe size).
> 
> Just thinking of this and knowing I need to support it gives me a 
> headache.

PO2 is really easy to work with and I guess allocation on the physical
device could also benefit from that, I'm still puzzled why the NPO2 is
even proposed.

We can possibly hide the calculations behind some API so I hope in the
end it should be bearable. The size of block groups is flexible we only
want some reasonable alignment.

> Also please consult the rest of the btrfs developers for thoughts on this.
> After all btrfs has full zoned support (including ZNS, not saying it's 
> perfect) and is also the default FS for at least two Linux distributions.

I haven't read the whole thread yet, my impression is that some hardware
is deliberately breaking existing assumptions about zoned devices and in
turn breaking btrfs support. I hope I'm wrong on that or at least that
it's possible to work around it.

Javier González March 15, 2022, 3:11 p.m. UTC | #42

On 15.03.2022 14:14, Johannes Thumshirn wrote:
>On 15/03/2022 14:52, Javier González wrote:
>> On 15.03.2022 14:30, Christoph Hellwig wrote:
>>> On Tue, Mar 15, 2022 at 02:26:11PM +0100, Javier González wrote:
>>>> but we do not see a usage for ZNS in F2FS, as it is a mobile
>>>> file-system. As other interfaces arrive, this work will become natural.
>>>>
>>>> ZoneFS and butrfs are good targets for ZNS and these we can do. I would
>>>> still do the work in phases to make sure we have enough early feedback
>>>> from the community.
>>>>
>>>> Since this thread has been very active, I will wait some time for
>>>> Christoph and others to catch up before we start sending code.
>>>
>>> Can someone summarize where we stand?  Between the lack of quoting
>>>from hell and overly long lines from corporate mail clients I've
>>> mostly stopped reading this thread because it takes too much effort
>>> actually extract the information.
>>
>> Let me give it a try:
>>
>>   - PO2 emulation in NVMe is a no-go. Drop this.
>>
>>   - The arguments against supporting PO2 are:
>>       - It makes ZNS depart from a SMR assumption of PO2 zone sizes. This
>>         can create confusion for users of both SMR and ZNS
>>
>>       - Existing applications assume PO2 zone sizes, and probably do
>>         optimizations for these. These applications, if wanting to use
>>         ZNS will have to change the calculations
>>
>>       - There is a fear for performance regressions.
>>
>>       - It adds more work to you and other maintainers
>>
>>   - The arguments in favour of PO2 are:
>>       - Unmapped LBAs create holes that applications need to deal with.
>>         This affects mapping and performance due to splits. Bo explained
>>         this in a thread from Bytedance's perspective.  I explained in an
>>         answer to Matias how we are not letting zones transition to
>>         offline in order to simplify the host stack. Not sure if this is
>>         something we want to bring to NVMe.
>>
>>       - As ZNS adds more features and other protocols add support for
>>         zoned devices we will have more use-cases for the zoned block
>>         device. We will have to deal with these fragmentation at some
>>         point.
>>
>>       - This is used in production workloads in Linux hosts. I would
>>         advocate for this not being off-tree as it will be a headache for
>>         all in the future.
>>
>>   - If you agree that removing PO2 is an option, we can do the following:
>>       - Remove the constraint in the block layer and add ZoneFS support
>>         in a first patch.
>>
>>       - Add btrfs support in a later patch
>
>(+ linux-btrfs )
>
>Please also make sure to support btrfs and not only throw some patches
>over the fence. Zoned device support in btrfs is complex enough and has
>quite some special casing vs regular btrfs, which we're working on getting
>rid of. So having non-power-of-2 zone size, would also mean having NPO2
>block-groups (and thus block-groups not aligned to the stripe size).

Thanks for mentioning this Johannes. If we say we will work with you in
supporting btrfs properly, we will.

I believe you have seen already a couple of patches fixing things for
zone support in btrfs in the last weeks.

>
>Just thinking of this and knowing I need to support it gives me a
>headache.

I hope we have help you with that. butrfs has no alignment to PO2
natively, so I am confident we can find a good solution.

>
>Also please consult the rest of the btrfs developers for thoughts on this.
>After all btrfs has full zoned support (including ZNS, not saying it's
>perfect) and is also the default FS for at least two Linux distributions.

Of course. We will work with you and other btrfs developers. Luis is
helping making sure that we have good tests for linux-next. This is in
part how we have found the problems with Append, which should be fixed
now.

>
>Thanks a lot,
>	Johannes

Luis Chamberlain March 15, 2022, 5 p.m. UTC | #43

On Tue, Mar 15, 2022 at 02:30:52PM +0100, Christoph Hellwig wrote:
> On Tue, Mar 15, 2022 at 02:26:11PM +0100, Javier González wrote:
> > but we do not see a usage for ZNS in F2FS, as it is a mobile
> > file-system. As other interfaces arrive, this work will become natural.
> >
> > ZoneFS and butrfs are good targets for ZNS and these we can do. I would
> > still do the work in phases to make sure we have enough early feedback
> > from the community.
> >
> > Since this thread has been very active, I will wait some time for
> > Christoph and others to catch up before we start sending code.
> 
> Can someone summarize where we stand?

RFCs should be posted to help review and evaluate direct NPO2 support
(not emulation) given we have no vendor willing to take a position that
NPO2 will *never* be supported on ZNS, and its not clear yet how many
vendors other than Samsung actually require NPO2 support. The other
reason is existing NPO2 customers currently cake in hacks to Linux to
supoport NPO2 support, and so a fragmentation already exists. To help
address this it's best to evaluate what the world of NPO2 support would
look like and put the effort to do the work for that and review that.

  Luis

Pankaj Raghav March 15, 2022, 6:51 p.m. UTC | #44

Hi Johannes,

On 2022-03-15 15:14, Johannes Thumshirn wrote:
> Please also make sure to support btrfs and not only throw some patches 
> over the fence. Zoned device support in btrfs is complex enough and has 
> quite some special casing vs regular btrfs, which we're working on getting
> rid of. So having non-power-of-2 zone size, would also mean having NPO2

I already made a simple btrfs npo2 poc and it involved mostly changing
the po2 calculation to be based on generic calculation. I understand
that changing the calculations from using log & shifts to division will
incur some performance penalty but I think we can wrap them with helpers
to minimize those impact.

> So having non-power-of-2 zone size, would also mean having NPO2
> block-groups (and thus block-groups not aligned to the stripe size).
>

I agree with your point that we risk not aligning to stripe size when we
move to npo2 zone size which I believe the minimum is 64K (please
correct me if I am wrong). As David Sterba mentioned in his email, we
could agree on some reasonable alignment, which I believe would be the
minimum stripe size of 64k to avoid added complexity to the existing
btrfs zoned support. And it is a much milder constraint that most
devices can naturally adhere compared to the po2 zone size requirement.

> Just thinking of this and knowing I need to support it gives me a 
> headache.
> 
This is definitely not some one off patch that we want upstream and
disappear. As Javier already pointed out, we would be more than happy
help you out here.
> Also please consult the rest of the btrfs developers for thoughts on this.
> After all btrfs has full zoned support (including ZNS, not saying it's 
> perfect) and is also the default FS for at least two Linux distributions.
> 
> Thanks a lot,
> 	Johannes

Pankaj Raghav March 15, 2022, 7:56 p.m. UTC | #45

Hi David,

On 2022-03-15 15:27, David Sterba wrote:
> 
> PO2 is really easy to work with and I guess allocation on the physical
> device could also benefit from that, I'm still puzzled why the NPO2 is
> even proposed.
> 
Quick recap:
Hardware NAND cannot naturally align to po2 zone sizes which led to
having a zone cap and zone size, where, zone cap is the actually storage
available in a zone. The main proposal is to remove the po2 constraint
to get rid of this LBA holes (generally speaking). That is why this
whole effort was started.

> We can possibly hide the calculations behind some API so I hope in the
> end it should be bearable. The size of block groups is flexible we only
> want some reasonable alignment.
>
I agree. I already replied to Johannes on what it might look like.
Reiterating here again, the reasonable alignment I was thinking while I
was doing a POC for btrfs with npo2 zone size is the minimum stripe size
that is required by btrfs (64K) to reduce the impact of this change on
the zoned support in btrfs.

> I haven't read the whole thread yet, my impression is that some hardware
> is deliberately breaking existing assumptions about zoned devices and in
> turn breaking btrfs support. I hope I'm wrong on that or at least that
> it's possible to work around it.
Based on the POC we did internally, it is definitely possible to support
it in btrfs. And making this change will not break the existing btrfs
support for zoned devices. Naive approach to making this change will
have some performance impact as we will be changing the po2 calculations
from log & shifts to division, multiplications. I definitely think we
can optimize it to minimize the impact on the existing deployments.

Damien Le Moal March 16, 2022, midnight UTC | #46

On 3/15/22 22:05, Javier González wrote:
>>> The main constraint for (1) PO2 is removed in the block layer, we
>>> have (2) Linux hosts stating that unmapped LBAs are a problem,
>>> and we have (3) HW supporting size=capacity.
>>> 
>>> I would be happy to hear what else you would like to see for this
>>> to be of use to the kernel community.
>> 
>> (Added numbers to your paragraph above)
>> 
>> 1. The sysfs chunksize attribute was "misused" to also represent
>> zone size. What has changed is that RAID controllers now can use a
>> NPO2 chunk size. This wasn't meant to naturally extend to zones,
>> which as shown in the current posted patchset, is a lot more work.
> 
> True. But this was the main constraint for PO2.

And as I said, users asked for it.

>> 2. Bo mentioned that the software already manages holes. It took a
>> bit of time to get right, but now it works. Thus, the software in
>> question is already capable of working with holes. Thus, fixing
>> this, would present itself as a minor optimization overall. I'm not
>> convinced the work to do this in the kernel is proportional to the
>> change it'll make to the applications.
> 
> I will let Bo response himself to this.
> 
>> 3. I'm happy to hear that. However, I'll like to reiterate the
>> point that the PO2 requirement have been known for years. That
>> there's a drive doing NPO2 zones is great, but a decision was made
>> by the SSD implementors to not support the Linux kernel given its
>> current implementation.
> 
> Zone devices has been supported for years in SMR, and I this is a
> strong argument. However, ZNS is still very new and customers have
> several requirements. I do not believe that a HDD stack should have
> such an impact in NVMe.
> 
> Also, we will see new interfaces adding support for zoned devices in
> the future.
> 
> We should think about the future and not the past.

Backward compatibility ? We must not break userspace...

>> 
>> All that said - if there are people willing to do the work and it
>> doesn't have a negative impact on performance, code quality,
>> maintenance complexity, etc. then there isn't anything saying
>> support can't be added - but it does seem like it’s a lot of work,
>> for little overall benefits to applications and the host users.
> 
> Exactly.
> 
> Patches in the block layer are trivial. This is running in
> production loads without issues. I have tried to highlight the
> benefits in previous benefits and I believe you understand them.

The block layer is not the issue here. We all understand that one is easy.

> Support for ZoneFS seems easy too. We have an early POC for btrfs and
> it seems it can be done. We sign up for these 2.

zonefs can trivially support non power of 2 zone sizes, but as zonefs
creates a discrete view of the device capacity with its one file per
zone interface, an application accesses to a zone are forcibly limited
to that zone, as they should. With zonefs, pow2 and nonpow2 devices will
show the *same* interface to the application. Non power of 2 zone size
then have absolutely no benefits at all.

> As for F2FS and dm-zoned, I do not think these are targets at the 
> moment. If this is the path we follow, these will bail out at mkfs
> time.

And what makes you think that this is acceptable ? What guarantees do
you have that this will not be a problem for users out there ?

Damien Le Moal March 16, 2022, 12:07 a.m. UTC | #47

On 3/16/22 02:00, Luis Chamberlain wrote:
> On Tue, Mar 15, 2022 at 02:30:52PM +0100, Christoph Hellwig wrote:
>> On Tue, Mar 15, 2022 at 02:26:11PM +0100, Javier González wrote:
>>> but we do not see a usage for ZNS in F2FS, as it is a mobile
>>> file-system. As other interfaces arrive, this work will become natural.
>>>
>>> ZoneFS and butrfs are good targets for ZNS and these we can do. I would
>>> still do the work in phases to make sure we have enough early feedback
>>> from the community.
>>>
>>> Since this thread has been very active, I will wait some time for
>>> Christoph and others to catch up before we start sending code.
>>
>> Can someone summarize where we stand?
> 
> RFCs should be posted to help review and evaluate direct NPO2 support
> (not emulation) given we have no vendor willing to take a position that
> NPO2 will *never* be supported on ZNS, and its not clear yet how many
> vendors other than Samsung actually require NPO2 support. The other
> reason is existing NPO2 customers currently cake in hacks to Linux to
> supoport NPO2 support, and so a fragmentation already exists. To help
> address this it's best to evaluate what the world of NPO2 support would
> look like and put the effort to do the work for that and review that.

And again no mentions of all the applications supporting zones assuming
a power of 2 zone size that will break. Seriously. Please stop
considering the kernel only. If this were only about the kernel, we
would all be working on patches already.

Allowing non power of 2 zone size may prevent applications running today
to run properly on these non power of 2 zone size devices. *not* nice. I
have yet to see any convincing argument proving that this is not an issue.

Luis Chamberlain March 16, 2022, 12:23 a.m. UTC | #48

On Wed, Mar 16, 2022 at 09:07:18AM +0900, Damien Le Moal wrote:
> On 3/16/22 02:00, Luis Chamberlain wrote:
> > On Tue, Mar 15, 2022 at 02:30:52PM +0100, Christoph Hellwig wrote:
> >> On Tue, Mar 15, 2022 at 02:26:11PM +0100, Javier González wrote:
> >>> but we do not see a usage for ZNS in F2FS, as it is a mobile
> >>> file-system. As other interfaces arrive, this work will become natural.
> >>>
> >>> ZoneFS and butrfs are good targets for ZNS and these we can do. I would
> >>> still do the work in phases to make sure we have enough early feedback
> >>> from the community.
> >>>
> >>> Since this thread has been very active, I will wait some time for
> >>> Christoph and others to catch up before we start sending code.
> >>
> >> Can someone summarize where we stand?
> > 
> > RFCs should be posted to help review and evaluate direct NPO2 support
> > (not emulation) given we have no vendor willing to take a position that
> > NPO2 will *never* be supported on ZNS, and its not clear yet how many
> > vendors other than Samsung actually require NPO2 support. The other
> > reason is existing NPO2 customers currently cake in hacks to Linux to
> > supoport NPO2 support, and so a fragmentation already exists. To help
> > address this it's best to evaluate what the world of NPO2 support would
> > look like and put the effort to do the work for that and review that.
> 
> And again no mentions of all the applications supporting zones assuming
> a power of 2 zone size that will break.

What applications? ZNS does not incur a PO2 requirement. So I really
want to know what applications make this assumption and would break
because all of a sudden say NPO2 is supported.

Why would that break those ZNS applications?

> Allowing non power of 2 zone size may prevent applications running today
> to run properly on these non power of 2 zone size devices. *not* nice.

Applications which want to support ZNS have to take into consideration
that NPO2 is posisble and there existing users of that world today.

You cannot negate their existance.

> I have yet to see any convincing argument proving that this is not an issue.

You are just saying things can break but not clarifying exactly what.
And you have not taken a position to say WD will not ever support NPO2
on ZNS. And so, you can't negate the prospect of that implied path for
support as a possibility, even if it means work towards the ecosystem
today.

 Luis

Damien Le Moal March 16, 2022, 12:46 a.m. UTC | #49

On 3/16/22 09:23, Luis Chamberlain wrote:
> On Wed, Mar 16, 2022 at 09:07:18AM +0900, Damien Le Moal wrote:
>> On 3/16/22 02:00, Luis Chamberlain wrote:
>>> On Tue, Mar 15, 2022 at 02:30:52PM +0100, Christoph Hellwig wrote:
>>>> On Tue, Mar 15, 2022 at 02:26:11PM +0100, Javier González wrote:
>>>>> but we do not see a usage for ZNS in F2FS, as it is a mobile
>>>>> file-system. As other interfaces arrive, this work will become natural.
>>>>>
>>>>> ZoneFS and butrfs are good targets for ZNS and these we can do. I would
>>>>> still do the work in phases to make sure we have enough early feedback
>>>>> from the community.
>>>>>
>>>>> Since this thread has been very active, I will wait some time for
>>>>> Christoph and others to catch up before we start sending code.
>>>>
>>>> Can someone summarize where we stand?
>>>
>>> RFCs should be posted to help review and evaluate direct NPO2 support
>>> (not emulation) given we have no vendor willing to take a position that
>>> NPO2 will *never* be supported on ZNS, and its not clear yet how many
>>> vendors other than Samsung actually require NPO2 support. The other
>>> reason is existing NPO2 customers currently cake in hacks to Linux to
>>> supoport NPO2 support, and so a fragmentation already exists. To help
>>> address this it's best to evaluate what the world of NPO2 support would
>>> look like and put the effort to do the work for that and review that.
>>
>> And again no mentions of all the applications supporting zones assuming
>> a power of 2 zone size that will break.
> 
> What applications? ZNS does not incur a PO2 requirement. So I really
> want to know what applications make this assumption and would break
> because all of a sudden say NPO2 is supported.

Exactly. What applications ? For ZNS, I cannot say as devices have not
been available for long. But neither can you.

> Why would that break those ZNS applications?

Please keep in mind that there are power of 2 zone sized ZNS devices out
there. Applications designed for these devices and optimized to do bit
shift arithmetic using the power of 2 size property will break. What the
plan for that case ? How will you address these users complaints ?

>> Allowing non power of 2 zone size may prevent applications running today
>> to run properly on these non power of 2 zone size devices. *not* nice.
> 
> Applications which want to support ZNS have to take into consideration
> that NPO2 is posisble and there existing users of that world today.

Which is really an ugly approach. The kernel zone user interface is
common to all zoned devices: SMR, ZNS, null_blk, DM (dm-crypt,
dm-linear). They all have one point in common: zone size is a power of
2. Zone capacity may differ, but hey, we also unified that by reporting
a zone capacity for *ALL* of them.

Applications correctly designed for SMR can thus also run on ZNS too.
With this in mind, the spectrum of applications that would break on non
power of 2 ZNS devices is suddenly much larger.

This has always been my concern from the start: allowing non power of 2
zone size fragments userspace support and has the potential to
complicate things for application developers.

> 
> You cannot negate their existance.
> 
>> I have yet to see any convincing argument proving that this is not an issue.
> 
> You are just saying things can break but not clarifying exactly what.
> And you have not taken a position to say WD will not ever support NPO2
> on ZNS. And so, you can't negate the prospect of that implied path for
> support as a possibility, even if it means work towards the ecosystem
> today.

Please do not bring in corporate strategy aspects in this discussion.
This is a technical discussion and I am not talking as a representative
of my employer nor should we ever dicsuss business plans on a public
mailing list. I am a kernel developer and maintainer. Keep it technical
please.

Luis Chamberlain March 16, 2022, 1:24 a.m. UTC | #50

On Wed, Mar 16, 2022 at 09:46:44AM +0900, Damien Le Moal wrote:
> On 3/16/22 09:23, Luis Chamberlain wrote:
> > What applications? ZNS does not incur a PO2 requirement. So I really
> > want to know what applications make this assumption and would break
> > because all of a sudden say NPO2 is supported.
> 
> Exactly. What applications ? For ZNS, I cannot say as devices have not
> been available for long. But neither can you.

I can tell you we there is an existing NPO2 ZNS customer which chimed on
the discussion and they described having to carry a delta to support
NPO2 ZNS. So if you cannot tell me of a ZNS application which is going to
break to add NPO2 support then your original point is not valid of
suggesting that there would be a break.

> > Why would that break those ZNS applications?
> 
> Please keep in mind that there are power of 2 zone sized ZNS devices out
> there.

No one is saying otherwise.

> Applications designed for these devices and optimized to do bit
> shift arithmetic using the power of 2 size property will break.

They must not be ZNS. So they can continue to chug on.

> What the
> plan for that case ? How will you address these users complaints ?

They are not ZNS so they don't have to worry about ZNS.

ZNS applications must be aware of that fact that NPO2 can exist.
ZNS applications must be aware of that fact that any vendor may one day
sell NPO2 devices.

> >> Allowing non power of 2 zone size may prevent applications running today
> >> to run properly on these non power of 2 zone size devices. *not* nice.
> > 
> > Applications which want to support ZNS have to take into consideration
> > that NPO2 is posisble and there existing users of that world today.
> 
> Which is really an ugly approach.

Ugly is relative and subjective. NAND does not force PO2.

> The kernel

<etc> And back you go to kernel talk. I thought you wanted to
focus on applications.

> Applications correctly designed for SMR can thus also run on ZNS too.

That seems to be an incorrect assumption given ZNS drives exist
with NPO2. So you can probably say that some SMR applications can work
with PO2 ZNS drives. That is a more correct statement.

> With this in mind, the spectrum of applications that would break on non
> power of 2 ZNS devices is suddenly much larger.

We already determined you cannot identify any ZNS specific application
which would break.

SMR != ZNS

If you really want to use SMR applications for ZNS that seems to be
a bit beyond the scope of this discussion, but it seems to me that those
SMR applications should simply learn that if a device is ZNS that NPO2 can
be expected.

As technologies evolve so do specifications.

> This has always been my concern from the start: allowing non power of 2
> zone size fragments userspace support and has the potential to
> complicate things for application developers.

It's a reality though. Devices exist, and so do users. And they're
carrying their own delta to support NPO2 ZNS today on Linux.

> > You cannot negate their existance.
> > 
> >> I have yet to see any convincing argument proving that this is not an issue.
> > 
> > You are just saying things can break but not clarifying exactly what.
> > And you have not taken a position to say WD will not ever support NPO2
> > on ZNS. And so, you can't negate the prospect of that implied path for
> > support as a possibility, even if it means work towards the ecosystem
> > today.
> 
> Please do not bring in corporate strategy aspects in this discussion.
> This is a technical discussion and I am not talking as a representative
> of my employer nor should we ever dicsuss business plans on a public
> mailing list. I am a kernel developer and maintainer. Keep it technical
> please.

This conversation is about the reality that ZNS NPO2 exist and how best to
support that. You seem to want to negate that reality and support on
Linux without even considering what the changes look like to to support
ZNS NPO2.

As a maintainer I think we need to *evaluate* supporting users as best
as possible. Not denying their existance. Even if it pains us.

  Luis

Damien Le Moal March 16, 2022, 1:44 a.m. UTC | #51

On 3/16/22 10:24, Luis Chamberlain wrote:
> On Wed, Mar 16, 2022 at 09:46:44AM +0900, Damien Le Moal wrote:
>> On 3/16/22 09:23, Luis Chamberlain wrote:
>>> What applications? ZNS does not incur a PO2 requirement. So I really
>>> want to know what applications make this assumption and would break
>>> because all of a sudden say NPO2 is supported.
>>
>> Exactly. What applications ? For ZNS, I cannot say as devices have not
>> been available for long. But neither can you.
> 
> I can tell you we there is an existing NPO2 ZNS customer which chimed on
> the discussion and they described having to carry a delta to support
> NPO2 ZNS. So if you cannot tell me of a ZNS application which is going to
> break to add NPO2 support then your original point is not valid of
> suggesting that there would be a break.
> 
>>> Why would that break those ZNS applications?
>>
>> Please keep in mind that there are power of 2 zone sized ZNS devices out
>> there.
> 
> No one is saying otherwise.
> 
>> Applications designed for these devices and optimized to do bit
>> shift arithmetic using the power of 2 size property will break.
> 
> They must not be ZNS. So they can continue to chug on.
> 
>> What the
>> plan for that case ? How will you address these users complaints ?
> 
> They are not ZNS so they don't have to worry about ZNS.
> 
> ZNS applications must be aware of that fact that NPO2 can exist.
> ZNS applications must be aware of that fact that any vendor may one day
> sell NPO2 devices.
> 
>>>> Allowing non power of 2 zone size may prevent applications running today
>>>> to run properly on these non power of 2 zone size devices. *not* nice.
>>>
>>> Applications which want to support ZNS have to take into consideration
>>> that NPO2 is posisble and there existing users of that world today.
>>
>> Which is really an ugly approach.
> 
> Ugly is relative and subjective. NAND does not force PO2.
> 
>> The kernel
> 
> <etc> And back you go to kernel talk. I thought you wanted to
> focus on applications.
> 
>> Applications correctly designed for SMR can thus also run on ZNS too.
> 
> That seems to be an incorrect assumption given ZNS drives exist
> with NPO2. So you can probably say that some SMR applications can work
> with PO2 ZNS drives. That is a more correct statement.
> 
>> With this in mind, the spectrum of applications that would break on non
>> power of 2 ZNS devices is suddenly much larger.
> 
> We already determined you cannot identify any ZNS specific application
> which would break.
> 
> SMR != ZNS

Not for the block layer nor for any in-kernel users above it today. We
should not drive toward differentiating device types but unify them
under a common interface that works for everything, including
applications. That is why we have zone append emulation in the scsi disk
driver.

Considering the zone size requirement problem in the context of ZNS only
is thus far from ideal in my opinion, to say the least.

Luis Chamberlain March 16, 2022, 2:13 a.m. UTC | #52

On Wed, Mar 16, 2022 at 10:44:56AM +0900, Damien Le Moal wrote:
> On 3/16/22 10:24, Luis Chamberlain wrote:
> > SMR != ZNS
> 
> Not for the block layer nor for any in-kernel <etc>

Back to kernel, I thought you wanted to focus on applications.

> Considering the zone size requirement problem in the context of ZNS only
> is thus far from ideal in my opinion, to say the least.

It's the reality for ZNS though.

  Luis

Martin K. Petersen March 16, 2022, 2:27 a.m. UTC | #53

Luis,

> Applications which want to support ZNS have to take into consideration
> that NPO2 is posisble and there existing users of that world today.

Every time a new technology comes along vendors inevitably introduce
first gen devices that are implemented with little consideration for the
OS stacks they need to work with. This has happened for pretty much
every technology I have been involved with over the years. So the fact
that NPO2 devices exist is no argument. There are tons of devices out
there that Linux does not support and never will.

In early engagements SSD drive vendors proposed all sorts of weird NPO2
block sizes and alignments that it was argued were *incontestable*
requirements for building NAND devices. And yet a generation or two
later every SSD transparently handled 512-byte or 4096-byte logical
blocks just fine. Imagine if we had re-engineered the entire I/O stack
to accommodate these awful designs?

Similarly, many proponents suggested oddball NPO2 sizes for SMR
zones. And yet the market very quickly settled on PO2 once things
started shipping in volume.

Simplicity and long term maintainability of the kernel should always
take precedence as far as I'm concerned.

Luis Chamberlain March 16, 2022, 2:41 a.m. UTC | #54

On Tue, Mar 15, 2022 at 10:27:32PM -0400, Martin K. Petersen wrote:
> Simplicity and long term maintainability of the kernel should always
> take precedence as far as I'm concerned.

No one is arguing against that. It is not even clear what all the changes are.
So to argue that the sky will fall seems a bit too early without seeing
patches, don't you think?

  Luis

Johannes Thumshirn March 16, 2022, 8:37 a.m. UTC | #55

On 15/03/2022 19:51, Pankaj Raghav wrote:
>> ck-groups (and thus block-groups not aligned to the stripe size).
>>
> I agree with your point that we risk not aligning to stripe size when we
> move to npo2 zone size which I believe the minimum is 64K (please
> correct me if I am wrong). As David Sterba mentioned in his email, we
> could agree on some reasonable alignment, which I believe would be the
> minimum stripe size of 64k to avoid added complexity to the existing
> btrfs zoned support. And it is a much milder constraint that most
> devices can naturally adhere compared to the po2 zone size requirement.
> 

What could be done is rounding a zone down to the next po2 (64k aligned),
but then we need to explicitly finish the zones.

Javier González March 16, 2022, 8:44 a.m. UTC | #56

On 15.03.2022 22:27, Martin K. Petersen wrote:
>
>Luis,
>
>> Applications which want to support ZNS have to take into consideration
>> that NPO2 is posisble and there existing users of that world today.
>
>Every time a new technology comes along vendors inevitably introduce
>first gen devices that are implemented with little consideration for the
>OS stacks they need to work with. This has happened for pretty much
>every technology I have been involved with over the years. So the fact
>that NPO2 devices exist is no argument. There are tons of devices out
>there that Linux does not support and never will.
>
>In early engagements SSD drive vendors proposed all sorts of weird NPO2
>block sizes and alignments that it was argued were *incontestable*
>requirements for building NAND devices. And yet a generation or two
>later every SSD transparently handled 512-byte or 4096-byte logical
>blocks just fine. Imagine if we had re-engineered the entire I/O stack
>to accommodate these awful designs?
>
>Similarly, many proponents suggested oddball NPO2 sizes for SMR
>zones. And yet the market very quickly settled on PO2 once things
>started shipping in volume.
>
>Simplicity and long term maintainability of the kernel should always
>take precedence as far as I'm concerned.

Martin, you are absolutely right.

The argument is not that there is available HW. The argument is that as
we tried to retrofit ZNS into the zoned block device, the gap between
zone size and capacity has brought adoption issues for some customers.

I would still like to wait and give some time to get some feedback on
the plan I proposed yesterday before we post patches. At this point, I
would very much like to hear your opinion on how the changes will incur
a maintainability problem. Nobody wants that.

Javier González March 16, 2022, 8:57 a.m. UTC | #57

On 16.03.2022 09:00, Damien Le Moal wrote:
>On 3/15/22 22:05, Javier González wrote:
>>>> The main constraint for (1) PO2 is removed in the block layer, we
>>>> have (2) Linux hosts stating that unmapped LBAs are a problem,
>>>> and we have (3) HW supporting size=capacity.
>>>>
>>>> I would be happy to hear what else you would like to see for this
>>>> to be of use to the kernel community.
>>>
>>> (Added numbers to your paragraph above)
>>>
>>> 1. The sysfs chunksize attribute was "misused" to also represent
>>> zone size. What has changed is that RAID controllers now can use a
>>> NPO2 chunk size. This wasn't meant to naturally extend to zones,
>>> which as shown in the current posted patchset, is a lot more work.
>>
>> True. But this was the main constraint for PO2.
>
>And as I said, users asked for it.

Now users are asking for arbitrary zone sizes.

[...]

>>> 3. I'm happy to hear that. However, I'll like to reiterate the
>>> point that the PO2 requirement have been known for years. That
>>> there's a drive doing NPO2 zones is great, but a decision was made
>>> by the SSD implementors to not support the Linux kernel given its
>>> current implementation.
>>
>> Zone devices has been supported for years in SMR, and I this is a
>> strong argument. However, ZNS is still very new and customers have
>> several requirements. I do not believe that a HDD stack should have
>> such an impact in NVMe.
>>
>> Also, we will see new interfaces adding support for zoned devices in
>> the future.
>>
>> We should think about the future and not the past.
>
>Backward compatibility ? We must not break userspace...

This is not a user API change. If making changes to applications to
adopt new features and technologies is breaking user-space, then the
zoned block device already broke that when we introduced zone capacity.
Any existing zoned application working on ZNS _will have to_ make
changes to support ZNS.

Pankaj Raghav March 16, 2022, 4:18 p.m. UTC | #58

Hi Damien,

On 2022-03-16 01:00, Damien Le Moal wrote:
>> As for F2FS and dm-zoned, I do not think these are targets at the 
>> moment. If this is the path we follow, these will bail out at mkfs
>> time.
> 
> And what makes you think that this is acceptable ? What guarantees do
> you have that this will not be a problem for users out there ?
> 
As you know, the architecture of F2FS ATM requires PO2 segments,
therefore, it might not be possible support nonPO2 ZNS drives.

So we could continue supporting PO2 ZNS drives for F2FS and bail out if
it is a Non PO2 ZNS drive during mkfs time (This is the current behavior
as well). This way we are not really breaking any ZNS drives that have
already been deployed for F2FS users.

Jonathan Derrick March 21, 2022, 4:21 p.m. UTC | #59

On 3/11/2022 1:51 PM, Keith Busch wrote:
> On Fri, Mar 11, 2022 at 12:19:38PM -0800, Luis Chamberlain wrote:
>> NAND has no PO2 requirement. The emulation effort was only done to help
>> add support for !PO2 devices because there is no alternative. If we
>> however are ready instead to go down the avenue of removing those
>> restrictions well let's go there then instead. If that's not even
>> something we are willing to consider I'd really like folks who stand
>> behind the PO2 requirement to stick their necks out and clearly say that
>> their hw/fw teams are happy to deal with this requirement forever on ZNS.
> 
> Regardless of the merits of the current OS requirement, it's a trivial
> matter for firmware to round up their reported zone size to the next
> power of 2. This does not create a significant burden on their part, as
> far as I know.

Sure wonder why !PO2 keeps coming up if it's so trivial to fix in firmware as you claim.
I actually find the hubris of the Linux community wrt the whole PO2 requirement
pretty exhausting.

Consider that some SSD manufacturers are having to rely on a NAND shortage and
existing ASIC architecture limitations that may define the sizes of their erase blocks
and write units. A !PO2 implementation in the Linux kernel would enable consumers
to be able to choose more options in the marketplace for their Linux ZNS application.

Keith Busch March 21, 2022, 4:44 p.m. UTC | #60

On Mon, Mar 21, 2022 at 10:21:36AM -0600, Jonathan Derrick wrote:
> 
> 
> On 3/11/2022 1:51 PM, Keith Busch wrote:
> > On Fri, Mar 11, 2022 at 12:19:38PM -0800, Luis Chamberlain wrote:
> > > NAND has no PO2 requirement. The emulation effort was only done to help
> > > add support for !PO2 devices because there is no alternative. If we
> > > however are ready instead to go down the avenue of removing those
> > > restrictions well let's go there then instead. If that's not even
> > > something we are willing to consider I'd really like folks who stand
> > > behind the PO2 requirement to stick their necks out and clearly say that
> > > their hw/fw teams are happy to deal with this requirement forever on ZNS.
> > 
> > Regardless of the merits of the current OS requirement, it's a trivial
> > matter for firmware to round up their reported zone size to the next
> > power of 2. This does not create a significant burden on their part, as
> > far as I know.
> 
> Sure wonder why !PO2 keeps coming up if it's so trivial to fix in firmware as you claim.

The triviality to adjust alignment in firmware has nothing to do with
some users' desire to not see gaps in LBA space.

> I actually find the hubris of the Linux community wrt the whole PO2 requirement
> pretty exhausting.
>
> Consider that some SSD manufacturers are having to rely on a NAND shortage and
> existing ASIC architecture limitations that may define the sizes of their erase blocks
> and write units. A !PO2 implementation in the Linux kernel would enable consumers
> to be able to choose more options in the marketplace for their Linux ZNS application.

All zone block devices through the linux kernel use a common abstraction
interface. Users expect you can swap out one zone device for another and all
their previously used features will continue to work. That does not necessarily
hold with relaxing the long existing zone alignment. Fragmenting uses harms
adoption, so this discussion seems appropriate.

[0/6] power_of_2 emulation support for NVMe ZNS devices

Message

Comments