diff mbox series

[3/3] nvme: don't set io_opt if NOWS is zero

Message ID 20240701051800.1245240-4-hch@lst.de (mailing list archive)
State New, archived
Headers show
Series [1/3] block: remove a duplicate io_min check in blk_validate_limits | expand

Commit Message

Christoph Hellwig July 1, 2024, 5:17 a.m. UTC
NOWS is one of the annoying "0's based values" in NVMe, where 0 means one
and we thus can't detect if it isn't set.  Thus a NOWS value of 0 means
that the Namespace Optimal Write Size is a single LBA, which is clearly
bogus.  Ignore the value in that case and don't propagate an io_opt
value to the block layer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Chaitanya Kulkarni July 1, 2024, 7:02 a.m. UTC | #1
On 6/30/24 22:17, Christoph Hellwig wrote:
> NOWS is one of the annoying "0's based values" in NVMe, where 0 means one
> and we thus can't detect if it isn't set.  Thus a NOWS value of 0 means
> that the Namespace Optimal Write Size is a single LBA, which is clearly
> bogus.  Ignore the value in that case and don't propagate an io_opt
> value to the block layer.
>
> Signed-off-by: Christoph Hellwig<hch@lst.de>

indeed (Figure 97: Identify -> Namespace Optimal Write Size (NOWS)).

wonder why can't spec use 0xFFFF to indicate not supported for all
zero based values to avoid this confusion ?

Looks good.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>

-ck
Sagi Grimberg July 1, 2024, 7:08 a.m. UTC | #2
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Keith Busch July 1, 2024, 1:55 p.m. UTC | #3
On Mon, Jul 01, 2024 at 07:17:52AM +0200, Christoph Hellwig wrote:
> NOWS is one of the annoying "0's based values" in NVMe, where 0 means one
> and we thus can't detect if it isn't set.  

We can detect if it is set based on the namespace features flags,
though.

> Thus a NOWS value of 0 means
> that the Namespace Optimal Write Size is a single LBA, which is clearly
> bogus.  Ignore the value in that case and don't propagate an io_opt
> value to the block layer.

Hm, why is that clearly bogus? Optane SSDs were optimized for
single-sector writes.
Christoph Hellwig July 1, 2024, 3:11 p.m. UTC | #4
On Mon, Jul 01, 2024 at 07:55:37AM -0600, Keith Busch wrote:
> On Mon, Jul 01, 2024 at 07:17:52AM +0200, Christoph Hellwig wrote:
> > NOWS is one of the annoying "0's based values" in NVMe, where 0 means one
> > and we thus can't detect if it isn't set.  
> 
> We can detect if it is set based on the namespace features flags,
> though.

Except that the flag covers 5 different flags, for 2 different operations.

> > Thus a NOWS value of 0 means
> > that the Namespace Optimal Write Size is a single LBA, which is clearly
> > bogus.  Ignore the value in that case and don't propagate an io_opt
> > value to the block layer.
> 
> Hm, why is that clearly bogus? Optane SSDs were optimized for
> single-sector writes.

Because even if they optimize for it, the pure overhead of both
the PCIe physical layer, nvme command overhead and software overhead
will never make it more optimal to split a larger I/O down to this
size, which the optimal write size is about.
diff mbox series

Patch

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index ab0429644fe378..6784fc6e95625e 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2025,7 +2025,8 @@  static bool nvme_update_disk_info(struct nvme_ns *ns, struct nvme_id_ns *id,
 		/* NPWG = Namespace Preferred Write Granularity */
 		phys_bs = bs * (1 + le16_to_cpu(id->npwg));
 		/* NOWS = Namespace Optimal Write Size */
-		io_opt = bs * (1 + le16_to_cpu(id->nows));
+		if (id->nows)
+			io_opt = bs * (1 + le16_to_cpu(id->nows));
 	}
 
 	/*