diff mbox series

[v3,14/15] nvme: Support atomic writes

Message ID 20240124113841.31824-15-john.g.garry@oracle.com (mailing list archive)
State Superseded
Headers show
Series block atomic writes | expand

Commit Message

John Garry Jan. 24, 2024, 11:38 a.m. UTC
From: Alan Adamson <alan.adamson@oracle.com>

Support reading atomic write registers to fill in request_queue
properties.

Use following method to calculate limits:
atomic_write_max_bytes = flp2(NAWUPF ?: AWUPF)
atomic_write_unit_min = logical_block_size
atomic_write_unit_max = flp2(NAWUPF ?: AWUPF)
atomic_write_boundary = NABSPF

Signed-off-by: Alan Adamson <alan.adamson@oracle.com>
#jpg: some rewrite
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 drivers/nvme/host/core.c | 40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

Comments

Christoph Hellwig Feb. 13, 2024, 6:42 a.m. UTC | #1
On Wed, Jan 24, 2024 at 11:38:40AM +0000, John Garry wrote:
> From: Alan Adamson <alan.adamson@oracle.com>
> 
> Support reading atomic write registers to fill in request_queue
> properties.
> 
> Use following method to calculate limits:
> atomic_write_max_bytes = flp2(NAWUPF ?: AWUPF)
> atomic_write_unit_min = logical_block_size
> atomic_write_unit_max = flp2(NAWUPF ?: AWUPF)
> atomic_write_boundary = NABSPF

Can you expand this to actually be a real commit log with full
sentences, expanding the NVME field name acronyms and reference
the relevant Sections and Figures in a specific version of the
NVMe specification?

Also some implementation comments:

NVMe has a particularly nasty NABO field in Identify Namespace, which
offsets the boundary. We probably need to reject atomic writes or
severly limit them if this field is set.

Please also read through TP4098(a) and look at the MAM field.  As far
as I can tell the patch as-is assumes it always is set to 1.

> +static void nvme_update_atomic_write_disk_info(struct gendisk *disk,
> +		struct nvme_ctrl *ctrl, struct nvme_id_ns *id, u32 bs, u32 atomic_bs)

Please avoid the overly long line here.

> +		nvme_update_atomic_write_disk_info(disk, ctrl, id, bs, atomic_bs);

.. and here.
John Garry Feb. 13, 2024, 2:21 p.m. UTC | #2
On 13/02/2024 06:42, Christoph Hellwig wrote:
> On Wed, Jan 24, 2024 at 11:38:40AM +0000, John Garry wrote:
>> From: Alan Adamson <alan.adamson@oracle.com>
>>
>> Support reading atomic write registers to fill in request_queue
>> properties.
>>
>> Use following method to calculate limits:
>> atomic_write_max_bytes = flp2(NAWUPF ?: AWUPF)
>> atomic_write_unit_min = logical_block_size
>> atomic_write_unit_max = flp2(NAWUPF ?: AWUPF)
>> atomic_write_boundary = NABSPF
> 
> Can you expand this to actually be a real commit log with full
> sentences, expanding the NVME field name acronyms and reference
> the relevant Sections and Figures in a specific version of the
> NVMe specification?

ok

> 
> Also some implementation comments:
> 
> NVMe has a particularly nasty NABO field in Identify Namespace, which
> offsets the boundary. We probably need to reject atomic writes or
> severly limit them if this field is set.

ok, and initially we'll just not support atomic writes for NABO != 0

> 
> Please also read through TP4098(a) and look at the MAM field.

It's not public, AFAIK.

And I don't think a feature which allows us to straddle boundaries is 
too interesting really.

>  As far
> as I can tell the patch as-is assumes it always is set to 1.
> 
>> +static void nvme_update_atomic_write_disk_info(struct gendisk *disk,
>> +		struct nvme_ctrl *ctrl, struct nvme_id_ns *id, u32 bs, u32 atomic_bs)
> 
> Please avoid the overly long line here.
> 
>> +		nvme_update_atomic_write_disk_info(disk, ctrl, id, bs, atomic_bs);
> 
> .. and here.

ok, but I think that this will get shorter anyway.

> 

Thanks,
John
Christoph Hellwig Feb. 14, 2024, 8 a.m. UTC | #3
On Tue, Feb 13, 2024 at 02:21:25PM +0000, John Garry wrote:
>> Please also read through TP4098(a) and look at the MAM field.
>
> It's not public, AFAIK.

Oracle is a member, so you can take a look at it easily.  If we need
it for Linux I can also work with the NVMe Board to release it.

> And I don't think a feature which allows us to straddle boundaries is too 
> interesting really.

Without MAM=1 NVMe can't support atomic writes larger than
AWUPF/NAWUPF, which is typically set to the indirection table
size.  You're leaving a lot of potential unused with that.
John Garry Feb. 14, 2024, 9:21 a.m. UTC | #4
On 14/02/2024 08:00, Christoph Hellwig wrote:
> On Tue, Feb 13, 2024 at 02:21:25PM +0000, John Garry wrote:
>>> Please also read through TP4098(a) and look at the MAM field.
>>
>> It's not public, AFAIK.
> 
> Oracle is a member, so you can take a look at it easily.  If we need
> it for Linux I can also work with the NVMe Board to release it.

What I really meant was that I prefer not to quote private TPs in public 
domain. I have the doc.

> 
>> And I don't think a feature which allows us to straddle boundaries is too
>> interesting really.
> 
> Without MAM=1 NVMe can't support atomic writes larger than
> AWUPF/NAWUPF, which is typically set to the indirection table
> size.  You're leaving a lot of potential unused with that.
> 

atomic_write_unit_max would always be dictated by min of boundary and 
AWUPF/NAWUPF. With MAM=1, we could increase atomic_write_max_bytes - but 
does it really help us? I mean, atomic_write_max_bytes only comes into 
play for merging - do we get much merging for NVMe transports? I am not 
sure.

Thanks,
John
Nilay Shroff Feb. 14, 2024, 12:27 p.m. UTC | #5
>Support reading atomic write registers to fill in request_queue
>properties.

>Use following method to calculate limits:
>atomic_write_max_bytes = flp2(NAWUPF ?: AWUPF)
>atomic_write_unit_min = logical_block_size
>atomic_write_unit_max = flp2(NAWUPF ?: AWUPF)
>atomic_write_boundary = NABSPF

In case the device doesn't support namespace atomic boundary size (i.e. NABSPF 
is zero) then while merging atomic block-IO we should allow merge.
 
For example, while front/back merging the atomic block IO, we check whether 
boundary is defined or not. In case if boundary is not-defined (i.e. it's zero) 
then we simply reject merging ateempt (as implemented in 
rq_straddles_atomic_write_boundary()).  

I am quoting this from NVMe spec (Command Set Specification, revision 1.0a, 
Section 2.1.4.3) : "To ensure backwards compatibility, the values reported for 
AWUN, AWUPF, and ACWU shall be set such that they  are  supported  even  if  a  
write  crosses  an  atomic  boundary.  If  a  controller  does  not  guarantee 
atomicity across atomic boundaries, the controller shall set AWUN, AWUPF, and 
ACWU to 0h (1 LBA)." 

Thanks,
--Nilay
John Garry Feb. 14, 2024, 1:02 p.m. UTC | #6
On 14/02/2024 12:27, Nilay Shroff wrote:
>> Support reading atomic write registers to fill in request_queue
> 
>> properties.
> 
> 
> 
>> Use following method to calculate limits:
> 
>> atomic_write_max_bytes = flp2(NAWUPF ?: AWUPF)
> 

You still need to fix that mail client to not add extra blank lines.

>> atomic_write_unit_min = logical_block_size
> 
>> atomic_write_unit_max = flp2(NAWUPF ?: AWUPF)
> 
>> atomic_write_boundary = NABSPF
> 
> 
> 
> In case the device doesn't support namespace atomic boundary size (i.e. NABSPF
> 
> is zero) then while merging atomic block-IO we should allow merge.
> 
>   
> 
> For example, while front/back merging the atomic block IO, we check whether
> 
> boundary is defined or not. In case if boundary is not-defined (i.e. it's zero)
> 
> then we simply reject merging ateempt (as implemented in
> 
> rq_straddles_atomic_write_boundary()).

Are you sure about that? In rq_straddles_atomic_write_boundary(), if 
boundary == 0, then we return false, i.e. there is no boundary, so we 
can never be crossing it.

static bool rq_straddles_atomic_write_boundary(struct request *rq,
unsigned int front,
unsigned int back)
{
	unsigned int boundary = queue_atomic_write_boundary_bytes(rq->q);
	unsigned int mask, imask;
	loff_t start, end;

	if (!boundary)
		return false;

	...
}

And then will not reject a merge for that reason, like:

int ll_back_merge_fn(struct request *req, struct bio *bio, unsigned int 
nr_segs)
{
	...

	if (req->cmd_flags & REQ_ATOMIC) {
		if (rq_straddles_atomic_write_boundary(req,
			0, bio->bi_iter.bi_size)) {
			return 0;
		}
	}

	return ll_new_hw_segment(req, bio, nr_segs);
}


> 
> 
> 
> I am quoting this from NVMe spec (Command Set Specification, revision 1.0a,
> 
> Section 2.1.4.3) : "To ensure backwards compatibility, the values reported for
> 
> AWUN, AWUPF, and ACWU shall be set such that they  are  supported  even  if  a
> 
> write  crosses  an  atomic  boundary.  If  a  controller  does  not  guarantee
> 
> atomicity across atomic boundaries, the controller shall set AWUN, AWUPF, and
> 
> ACWU to 0h (1 LBA)."
> 
> 
> 

Thanks,
John
Nilay Shroff Feb. 14, 2024, 4:45 p.m. UTC | #7
On 2/14/24 18:32, John Garry wrote:
> On 14/02/2024 12:27, Nilay Shroff wrote:
>>
>>
>>
>>> Use following method to calculate limits:
>>
>>> atomic_write_max_bytes = flp2(NAWUPF ?: AWUPF)
>>
> 
> You still need to fix that mail client to not add extra blank lines.
Yes, I am working on it. I hope it's solved now. 
> 
>>> atomic_write_unit_min = logical_block_size
>>
>>> atomic_write_unit_max = flp2(NAWUPF ?: AWUPF)
>>
>>> atomic_write_boundary = NABSPF
>>
>>
>>
>> In case the device doesn't support namespace atomic boundary size (i.e. NABSPF
>>
>> is zero) then while merging atomic block-IO we should allow merge.
>>
>>  
>> For example, while front/back merging the atomic block IO, we check whether
>>
>> boundary is defined or not. In case if boundary is not-defined (i.e. it's zero)
>>
>> then we simply reject merging ateempt (as implemented in
>>
>> rq_straddles_atomic_write_boundary()).
> 
> Are you sure about that? In rq_straddles_atomic_write_boundary(), if boundary == 0, then we return false, i.e. there is no boundary, so we can never be crossing it.
> 
> static bool rq_straddles_atomic_write_boundary(struct request *rq,
> unsigned int front,
> unsigned int back)
> {
>     unsigned int boundary = queue_atomic_write_boundary_bytes(rq->q);
>     unsigned int mask, imask;
>     loff_t start, end;
> 
>     if (!boundary)
>         return false;
> 
>     ...
> }
> 
> And then will not reject a merge for that reason, like:
> 
> int ll_back_merge_fn(struct request *req, struct bio *bio, unsigned int nr_segs)
> {
>     ...
> 
>     if (req->cmd_flags & REQ_ATOMIC) {
>         if (rq_straddles_atomic_write_boundary(req,
>             0, bio->bi_iter.bi_size)) {
>             return 0;
>         }
>     }
> 
>     return ll_new_hw_segment(req, bio, nr_segs);
> }
> 
> 
Aargh, you are right. I see that if rq_straddles_atomic_write_boundary() returns true then we avoid merge otherwise the merge is attempted. My bad...

Thanks,
--Nilay
diff mbox series

Patch

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 85ab0fcf9e88..5045c84f2516 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1911,6 +1911,44 @@  static void nvme_set_queue_limits(struct nvme_ctrl *ctrl,
 	blk_queue_write_cache(q, vwc, vwc);
 }
 
+static void nvme_update_atomic_write_disk_info(struct gendisk *disk,
+		struct nvme_ctrl *ctrl, struct nvme_id_ns *id, u32 bs, u32 atomic_bs)
+{
+	unsigned int unit_min = 0, unit_max = 0, boundary = 0, max_bytes = 0;
+	struct request_queue *q = disk->queue;
+
+	if (id->nsfeat & NVME_NS_FEAT_ATOMICS && id->nawupf) {
+		if (le16_to_cpu(id->nabspf))
+			boundary = (le16_to_cpu(id->nabspf) + 1) * bs;
+
+		/*
+		 * The boundary size just needs to be a multiple of unit_max
+		 * (and not necessarily a power-of-2), so this could be relaxed
+		 * in the block layer in future.
+		 * Furthermore, if needed, unit_max could be reduced so that the
+		 * boundary size was compliant - but don't support yet.
+		 */
+		if (!boundary || is_power_of_2(boundary)) {
+			max_bytes = atomic_bs;
+			unit_min = bs;
+			unit_max = rounddown_pow_of_two(atomic_bs);
+		} else {
+			dev_notice(ctrl->device, "Unsupported atomic write boundary (%d)\n",
+				boundary);
+			boundary = 0;
+		}
+	} else if (ctrl->subsys->awupf) {
+		max_bytes = atomic_bs;
+		unit_min = bs;
+		unit_max = rounddown_pow_of_two(atomic_bs);
+	}
+
+	blk_queue_atomic_write_max_bytes(q, max_bytes);
+	blk_queue_atomic_write_unit_min_sectors(q, unit_min >> SECTOR_SHIFT);
+	blk_queue_atomic_write_unit_max_sectors(q, unit_max >> SECTOR_SHIFT);
+	blk_queue_atomic_write_boundary_bytes(q, boundary);
+}
+
 static void nvme_update_disk_info(struct nvme_ctrl *ctrl, struct gendisk *disk,
 		struct nvme_ns_head *head, struct nvme_id_ns *id)
 {
@@ -1941,6 +1979,8 @@  static void nvme_update_disk_info(struct nvme_ctrl *ctrl, struct gendisk *disk,
 			atomic_bs = (1 + le16_to_cpu(id->nawupf)) * bs;
 		else
 			atomic_bs = (1 + ctrl->subsys->awupf) * bs;
+
+		nvme_update_atomic_write_disk_info(disk, ctrl, id, bs, atomic_bs);
 	}
 
 	if (id->nsfeat & NVME_NS_FEAT_IO_OPT) {