diff mbox series

[v3,09/15] block: Add checks to merging of atomic writes

Message ID 20240124113841.31824-10-john.g.garry@oracle.com (mailing list archive)
State New, archived
Headers show
Series block atomic writes | expand

Commit Message

John Garry Jan. 24, 2024, 11:38 a.m. UTC
For atomic writes we allow merging, but we must adhere to some additional
rules:
- Only allow merging of atomic writes with other atomic writes. This avoids
  any checks to ensure that the resultant merge still adheres to the
  underlying device atomic write properties. It also avoids possible
  unnecessary overhead in the device in submitting the whole resultant
  merged IO with an atomic write command for SCSI.
- Ensure that the merged IO would not cross an atomic write boundary, if
  any.

We already ensure that we don't exceed the atomic writes size limit in
get_max_io_size().

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 block/blk-merge.c | 75 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 75 insertions(+)

Comments

Nilay Shroff Feb. 12, 2024, 10:54 a.m. UTC | #1
>+static bool rq_straddles_atomic_write_boundary(struct request *rq,
>+					unsigned int front,
>+					unsigned int back)
>+{
>+	unsigned int boundary = queue_atomic_write_boundary_bytes(rq->q);
>+	unsigned int mask, imask;
>+	loff_t start, end;
>+
>+	if (!boundary)
>+		return false;
>+
>+	start = rq->__sector << SECTOR_SHIFT;
>+	end = start + rq->__data_len;
>+
>+	start -= front;
>+	end += back;
>+
>+	/* We're longer than the boundary, so must be crossing it */
>+	if (end - start > boundary)
>+		return true;
>+
>+	mask = boundary - 1;
>+
>+	/* start/end are boundary-aligned, so cannot be crossing */
>+	if (!(start & mask) || !(end & mask))
>+		return false;
>+
>+	imask = ~mask;
>+
>+	/* Top bits are different, so crossed a boundary */
>+	if ((start & imask) != (end & imask))
>+		return true;
>+
>+	return false;
>+}
>+

Shall we ensure here that we don't cross max limit of atomic write supported by 
device? It seems that if the boundary size is not advertized by the device 
(in fact, I have one NVMe drive which has boundary size zero i.e. nabo/nabspf/
nawupf are all zero but awupf is non-zero) then we (unconditionally) allow 
merging. However it may be possible that post merging the total size of the 
request may exceed the atomic-write-unit-max-size supported by the device and 
if that happens then most probably we would be able to catch it very late in 
the driver code (if the device is NVMe). 

So is it a good idea to validate here whether we could potentially exceed 
the atomic-write-max-unit-size supported by device before we allow merging? 
In case we exceed the atomic-write-max-unit-size post merge then don't allow
merging?
 
Thanks,
--Nilay
John Garry Feb. 12, 2024, 11:20 a.m. UTC | #2
>> +
> 
>> +	imask = ~mask;
> 
>> +
> 
>> +	/* Top bits are different, so crossed a boundary */
> 
>> +	if ((start & imask) != (end & imask))
> 
>> +		return true;
> 
>> +
> 
>> +	return false;
> 
>> +}
> 
>> +
> 

I'm not sure what is going on with your mail client here.

> 
> 
> Shall we ensure here that we don't cross max limit of atomic write supported by
> 
> device? It seems that if the boundary size is not advertized by the device
> 
> (in fact, I have one NVMe drive which has boundary size zero i.e. nabo/nabspf/
> 
> nawupf are all zero but awupf is non-zero) then we (unconditionally) allow
> 
> merging. However it may be possible that post merging the total size of the
> 
> request may exceed the atomic-write-unit-max-size supported by the device and
> 
> if that happens then most probably we would be able to catch it very late in
> 
> the driver code (if the device is NVMe).
> 
> 
> 
> So is it a good idea to validate here whether we could potentially exceed
> 
> the atomic-write-max-unit-size supported by device before we allow merging?

Note that we have atomic_write_max_bytes and atomic_write_max_unit_size, 
and they are not always the same thing.

> 
> In case we exceed the atomic-write-max-unit-size post merge then don't allow
> 
> merging?

We check this elsewhere. I just expanded the normal check for max 
request size to cover atomic writes.

Normally we check that a merged request would not exceed max_sectors 
value, and this max_sectors value can be got from 
blk_queue_get_max_sectors().

So if you check a function like ll_back_merge_fn(), we have a merging 
size check:

	if (blk_rq_sectors(req) + bio_sectors(bio) >
	    blk_rq_get_max_sectors(req, blk_rq_pos(req))) {
		req_set_nomerge(req->q, req);
		return 0;
	}

And here the blk_rq_get_max_sectors() -> blk_queue_get_max_sectors() 
call now also supports atomic writes (see patch #7):

@@ -167,7 +167,16 @@ static inline unsigned get_max_io_size(struct bio *bio,
  {
...

+	if (bio->bi_opf & REQ_ATOMIC)
+		max_sectors = lim->atomic_write_max_sectors;
+	else
+		max_sectors = lim->max_sectors;

Note that we do not allow merging of atomic and non-atomic writes.

Thanks,
John
Nilay Shroff Feb. 12, 2024, 12:01 p.m. UTC | #3
On 2/12/24 16:50, John Garry wrote:
> I'm not sure what is going on with your mail client here.

Sorry for the inconvenience, I will check the settings.

>>
>> So is it a good idea to validate here whether we could potentially exceed
>>
>> the atomic-write-max-unit-size supported by device before we allow merging?
> 
> Note that we have atomic_write_max_bytes and atomic_write_max_unit_size, and they are not always the same thing.
> 
>>
>> In case we exceed the atomic-write-max-unit-size post merge then don't allow
>>
>> merging?
> 
> We check this elsewhere. I just expanded the normal check for max request size to cover atomic writes.
> 
> Normally we check that a merged request would not exceed max_sectors value, and this max_sectors value can be got from blk_queue_get_max_sectors().
> 
> So if you check a function like ll_back_merge_fn(), we have a merging size check:
> 
>     if (blk_rq_sectors(req) + bio_sectors(bio) >
>         blk_rq_get_max_sectors(req, blk_rq_pos(req))) {
>         req_set_nomerge(req->q, req);
>         return 0;
>     }
> 
> And here the blk_rq_get_max_sectors() -> blk_queue_get_max_sectors() call now also supports atomic writes (see patch #7):
OK got it. I think I have missed this part.

> 
> @@ -167,7 +167,16 @@ static inline unsigned get_max_io_size(struct bio *bio,
>  {
> ...
> 
> +    if (bio->bi_opf & REQ_ATOMIC)
> +        max_sectors = lim->atomic_write_max_sectors;
> +    else
> +        max_sectors = lim->max_sectors;
> 
> Note that we do not allow merging of atomic and non-atomic writes.
> 
Yeah 

Thanks,
--Nilay
John Garry Feb. 12, 2024, 12:09 p.m. UTC | #4
On 12/02/2024 10:54, Nilay Shroff wrote:
> 
> Shall we ensure here that we don't cross max limit of atomic write supported by
> 
> device? It seems that if the boundary size is not advertized by the device
> 
> (in fact, I have one NVMe drive which has boundary size zero i.e. nabo/nabspf/
> 
> nawupf are all zero but awupf is non-zero) then we (unconditionally) allow
> 
> merging.

BTW, if you don't mind, can you share awupf value and device model? I 
have been on the search for NVMe devices which support atomic writes 
(with non-zero PF reported value). All I have is a M.2 card which has a 
4KB PF atomic write value.

But if this is private info, then ok.

Thanks,
John
Nilay Shroff Feb. 13, 2024, 6:52 a.m. UTC | #5
On 2/12/24 17:39, John Garry wrote:
> On 12/02/2024 10:54, Nilay Shroff wrote:
>>
>> Shall we ensure here that we don't cross max limit of atomic write supported by
>>
>> device? It seems that if the boundary size is not advertized by the device
>>
>> (in fact, I have one NVMe drive which has boundary size zero i.e. nabo/nabspf/
>>
>> nawupf are all zero but awupf is non-zero) then we (unconditionally) allow
>>
>> merging.
> 
> BTW, if you don't mind, can you share awupf value and device model? I have been on the search for NVMe devices which support atomic writes (with non-zero PF reported value). All I have is a M.2 card which has a 4KB PF atomic write value.
> 
> But if this is private info, then ok.
> 
> Thanks,
> John

Yeah sure. Below are the details about NVMe:

# lspci 
0040:01:00.0 Non-Volatile memory controller: KIOXIA Corporation Device 0025 (rev 01)

# nvme id-ctrl /dev/nvme0 -H 
NVME Identify Controller:
vid       : 0x1e0f
ssvid     : 0x1014
sn        : Z130A00LTGZ8        
mn        : 800GB NVMe Gen4 U.2 SSD                 
fr        : REV.C9S2
[...]
awun      : 65535
awupf     : 63
[...]

# nvme id-ns /dev/nvme0n1 -H 
NVME Identify Namespace 1:
nsze    : 0x18ffff3
ncap    : 0x18ffff3
nuse    : 0
nsfeat  : 0x14
  [4:4] : 0x1	NPWG, NPWA, NPDG, NPDA, and NOWS are Supported
  [3:3] : 0	NGUID and EUI64 fields if non-zero, Reused
  [2:2] : 0x1	Deallocated or Unwritten Logical Block error Supported
  [1:1] : 0	Namespace uses AWUN, AWUPF, and ACWU
  [0:0] : 0	Thin Provisioning Not Supported

[...]

nawun   : 0
nawupf  : 0
nacwu   : 0
nabsn   : 0
nabo    : 0
nabspf  : 0

[...]

LBA Format  0 : Metadata Size: 0   bytes - Data Size: 4096 bytes - Relative Performance: 0 Best (in use)
LBA Format  1 : Metadata Size: 8   bytes - Data Size: 4096 bytes - Relative Performance: 0 Best 
LBA Format  2 : Metadata Size: 0   bytes - Data Size: 512 bytes - Relative Performance: 0 Best 
LBA Format  3 : Metadata Size: 8   bytes - Data Size: 512 bytes - Relative Performance: 0 Best 
LBA Format  4 : Metadata Size: 64  bytes - Data Size: 4096 bytes - Relative Performance: 0 Best 
LBA Format  5 : Metadata Size: 8   bytes - Data Size: 512 bytes - Relative Performance: 0 Best 


As shown above, I am using KIOXIA NVMe. This NVMe has one namespace created(nvme0n1). 
The nsfeat reports that this namespace uses AWUN, AWUPF, and ACWU.The awupf for this NVMe is 63. 
As awupf is 0's based value, it's actually 64. The configured LBA for the namespace (in use) is 4096 
bytes and so that means that this NVMe supports writing 262144 (64*4096) bytes of data atomically 
during power failure. Further, please note that on this NVMe we have nawupf/nabspf/nabo all set
to zero. 

Let me know if you need any other details. And BTW, if you want I could help you with anything 
you'd like to test on this NVMe. 

Thanks,
--Nilay
diff mbox series

Patch

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 9d714e8f76b3..12a75a252ca2 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -18,6 +18,42 @@ 
 #include "blk-rq-qos.h"
 #include "blk-throttle.h"
 
+static bool rq_straddles_atomic_write_boundary(struct request *rq,
+					unsigned int front,
+					unsigned int back)
+{
+	unsigned int boundary = queue_atomic_write_boundary_bytes(rq->q);
+	unsigned int mask, imask;
+	loff_t start, end;
+
+	if (!boundary)
+		return false;
+
+	start = rq->__sector << SECTOR_SHIFT;
+	end = start + rq->__data_len;
+
+	start -= front;
+	end += back;
+
+	/* We're longer than the boundary, so must be crossing it */
+	if (end - start > boundary)
+		return true;
+
+	mask = boundary - 1;
+
+	/* start/end are boundary-aligned, so cannot be crossing */
+	if (!(start & mask) || !(end & mask))
+		return false;
+
+	imask = ~mask;
+
+	/* Top bits are different, so crossed a boundary */
+	if ((start & imask) != (end & imask))
+		return true;
+
+	return false;
+}
+
 static inline void bio_get_first_bvec(struct bio *bio, struct bio_vec *bv)
 {
 	*bv = mp_bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
@@ -659,6 +695,13 @@  int ll_back_merge_fn(struct request *req, struct bio *bio, unsigned int nr_segs)
 		return 0;
 	}
 
+	if (req->cmd_flags & REQ_ATOMIC) {
+		if (rq_straddles_atomic_write_boundary(req,
+				0, bio->bi_iter.bi_size)) {
+			return 0;
+		}
+	}
+
 	return ll_new_hw_segment(req, bio, nr_segs);
 }
 
@@ -678,6 +721,13 @@  static int ll_front_merge_fn(struct request *req, struct bio *bio,
 		return 0;
 	}
 
+	if (req->cmd_flags & REQ_ATOMIC) {
+		if (rq_straddles_atomic_write_boundary(req,
+				bio->bi_iter.bi_size, 0)) {
+			return 0;
+		}
+	}
+
 	return ll_new_hw_segment(req, bio, nr_segs);
 }
 
@@ -714,6 +764,13 @@  static int ll_merge_requests_fn(struct request_queue *q, struct request *req,
 	    blk_rq_get_max_sectors(req, blk_rq_pos(req)))
 		return 0;
 
+	if (req->cmd_flags & REQ_ATOMIC) {
+		if (rq_straddles_atomic_write_boundary(req,
+				0, blk_rq_bytes(next))) {
+			return 0;
+		}
+	}
+
 	total_phys_segments = req->nr_phys_segments + next->nr_phys_segments;
 	if (total_phys_segments > blk_rq_get_max_segments(req))
 		return 0;
@@ -809,6 +866,18 @@  static enum elv_merge blk_try_req_merge(struct request *req,
 	return ELEVATOR_NO_MERGE;
 }
 
+static bool blk_atomic_write_mergeable_rq_bio(struct request *rq,
+					      struct bio *bio)
+{
+	return (rq->cmd_flags & REQ_ATOMIC) == (bio->bi_opf & REQ_ATOMIC);
+}
+
+static bool blk_atomic_write_mergeable_rqs(struct request *rq,
+					   struct request *next)
+{
+	return (rq->cmd_flags & REQ_ATOMIC) == (next->cmd_flags & REQ_ATOMIC);
+}
+
 /*
  * For non-mq, this has to be called with the request spinlock acquired.
  * For mq with scheduling, the appropriate queue wide lock should be held.
@@ -828,6 +897,9 @@  static struct request *attempt_merge(struct request_queue *q,
 	if (req->ioprio != next->ioprio)
 		return NULL;
 
+	if (!blk_atomic_write_mergeable_rqs(req, next))
+		return NULL;
+
 	/*
 	 * If we are allowed to merge, then append bio list
 	 * from next to rq and release next. merge_requests_fn
@@ -955,6 +1027,9 @@  bool blk_rq_merge_ok(struct request *rq, struct bio *bio)
 	if (rq->ioprio != bio_prio(bio))
 		return false;
 
+	if (blk_atomic_write_mergeable_rq_bio(rq, bio) == false)
+		return false;
+
 	return true;
 }