[RFC,0/4] Read/Write with meta buffer

Message ID	20240322185023.131697-1-joshi.k@samsung.com (mailing list archive)
Headers	show Received: from mailout4.samsung.com (mailout4.samsung.com [203.254.224.34]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D462F76906 for <io-uring@vger.kernel.org>; Fri, 22 Mar 2024 18:57:39 +0000 (UTC) From: Kanchan Joshi <joshi.k@samsung.com> To: martin.petersen@oracle.com, axboe@kernel.dk, kbusch@kernel.org, hch@lst.de Cc: io-uring@vger.kernel.org, linux-block@vger.kernel.org, anuj1072538@gmail.com, Kanchan Joshi <joshi.k@samsung.com> Subject: [RFC PATCH 0/4] Read/Write with meta buffer Date: Sat, 23 Mar 2024 00:20:19 +0530 Message-Id: <20240322185023.131697-1-joshi.k@samsung.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset="utf-8" CMS-TYPE: 105P DLP-Filter: Pass References: <CGME20240322185729epcas5p350c5054b5b519a6aa9d1b35ba3709563@epcas5p3.samsung.com>
Series	Read/Write with meta buffer \| expand [RFC,0/4] Read/Write with meta buffer [RFC,1/4] io_uring/rw: Get rid of flags field in struct io_rw [RFC,2/4] io_uring/rw: support read/write with metadata [RFC,3/4] block: modify bio_integrity_map_user to accept iov_iter as argument [RFC,4/4] block: add support to pass the meta buffer

Message ID

20240322185023.131697-1-joshi.k@samsung.com (mailing list archive)

Headers

From: Kanchan Joshi <joshi.k@samsung.com>
To: martin.petersen@oracle.com, axboe@kernel.dk, kbusch@kernel.org,
	hch@lst.de
Cc: io-uring@vger.kernel.org, linux-block@vger.kernel.org,
	anuj1072538@gmail.com, Kanchan Joshi <joshi.k@samsung.com>
Subject: [RFC PATCH 0/4] Read/Write with meta buffer
Date: Sat, 23 Mar 2024 00:20:19 +0530
Message-Id: <20240322185023.131697-1-joshi.k@samsung.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain; charset="utf-8"
CMS-TYPE: 105P
DLP-Filter: Pass
References: 
 <CGME20240322185729epcas5p350c5054b5b519a6aa9d1b35ba3709563@epcas5p3.samsung.com>

Series

Read/Write with meta buffer | expand

Message

Kanchan Joshi March 22, 2024, 6:50 p.m. UTC

This patchset is aimed at getting the feedback on a new io_uring
interface that userspace can use to exchange meta buffer along with
read/write.

Two new opcodes for that: IORING_OP_READ_META and IORING_OP_WRITE_META.
The leftover space in the SQE is used to send meta buffer pointer
and its length. Patch #2 for this.

The interface is supported for block direct IO. Patch #4 for this.
Other two are prep patches.

It has been tried not to touch the hot read/write path, as much as
possible. Performance for non-meta IO is same after the patches [2].
There is some code in the cold path (worker-based async)
though.

Moderately tested by modifying the fio [1] to use this interface
(only for NVMe block devices)

[1] https://github.com/OpenMPDK/fio/tree/feat/test-meta

[2]
without this:

taskset -c 2,5 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 /dev/nvme0n1 /dev/nvme1n1
submitter=1, tid=2453, file=/dev/nvme1n1, node=-1
submitter=0, tid=2452, file=/dev/nvme0n1, node=-1
polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128
Engine=io_uring, sq_ring=128, cq_ring=128
IOPS=10.02M, BW=4.89GiB/s, IOS/call=31/31
IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31

With this:
taskset -c 2,5 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 /dev/nvme0n1 /dev/nvme1n1
submitter=1, tid=2453, file=/dev/nvme1n1, node=-1
submitter=0, tid=2452, file=/dev/nvme0n1, node=-1
polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128
Engine=io_uring, sq_ring=128, cq_ring=128
IOPS=10.02M, BW=4.89GiB/s, IOS/call=31/31
IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31

Anuj Gupta (3):
  io_uring/rw: Get rid of flags field in struct io_rw
  io_uring/rw: support read/write with metadata
  block: modify bio_integrity_map_user to accept iov_iter as argument

Kanchan Joshi (1):
  block: add support to pass the meta buffer

 block/bio-integrity.c         |  27 ++++++---
 block/fops.c                  |   9 +++
 block/t10-pi.c                |   6 ++
 drivers/nvme/host/ioctl.c     |  11 +++-
 include/linux/bio.h           |  13 +++-
 include/linux/fs.h            |   1 +
 include/uapi/linux/io_uring.h |   6 ++
 io_uring/io_uring.c           |   2 +
 io_uring/opdef.c              |  29 +++++++++
 io_uring/rw.c                 | 108 +++++++++++++++++++++++++++++-----
 io_uring/rw.h                 |   8 +++
 11 files changed, 193 insertions(+), 27 deletions(-)


base-commit: 6f0974eccbf78baead1735722c4f1ee3eb9422cd

Comments

Jens Axboe March 27, 2024, 11:38 p.m. UTC | #1

On 3/22/24 12:50 PM, Kanchan Joshi wrote:
> This patchset is aimed at getting the feedback on a new io_uring
> interface that userspace can use to exchange meta buffer along with
> read/write.
> 
> Two new opcodes for that: IORING_OP_READ_META and IORING_OP_WRITE_META.
> The leftover space in the SQE is used to send meta buffer pointer
> and its length. Patch #2 for this.
> 
> The interface is supported for block direct IO. Patch #4 for this.
> Other two are prep patches.
> 
> It has been tried not to touch the hot read/write path, as much as
> possible. Performance for non-meta IO is same after the patches [2].
> There is some code in the cold path (worker-based async)
> though.

This patchset should look cleaner if you rebase it on top of the current
for-6.10/io_uring branch, as it gets rid of the async nastiness. Since
that'll need doing anyway, could you repost a v2 where it's rebased on
top of that?

Also in terms of the cover letter, would be good with a bit more of a
description of what this enables. It's a bit scant on detail on what
exactly this gives you.

> taskset -c 2,5 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 /dev/nvme0n1 /dev/nvme1n1
> submitter=1, tid=2453, file=/dev/nvme1n1, node=-1
> submitter=0, tid=2452, file=/dev/nvme0n1, node=-1
> polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128
> Engine=io_uring, sq_ring=128, cq_ring=128
> IOPS=10.02M, BW=4.89GiB/s, IOS/call=31/31
> IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31
> 
> With this:
> taskset -c 2,5 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 /dev/nvme0n1 /dev/nvme1n1
> submitter=1, tid=2453, file=/dev/nvme1n1, node=-1
> submitter=0, tid=2452, file=/dev/nvme0n1, node=-1
> polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128
> Engine=io_uring, sq_ring=128, cq_ring=128
> IOPS=10.02M, BW=4.89GiB/s, IOS/call=31/31
> IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31

Not that I don't believe you, but that looks like you pasted the same
stuff in there twice? It's the exact same perf and pids.

Kanchan Joshi March 28, 2024, 12:03 p.m. UTC | #2

On 3/28/2024 5:08 AM, Jens Axboe wrote:

> This patchset should look cleaner if you rebase it on top of the current
> for-6.10/io_uring branch, as it gets rid of the async nastiness. Since
> that'll need doing anyway, could you repost a v2 where it's rebased on
> top of that?

Yes, next iteration will use that as the base.

> Also in terms of the cover letter, would be good with a bit more of a
> description of what this enables. It's a bit scant on detail on what
> exactly this gives you.

Will fix that.
But currently the only thing it gives is - pass meta buffer to/from the 
block-device.

It keeps things simple, and fine for PI type 0 (normal unprotected IO).
For other PI types, exposing few knobs may help. Using "sqe->rw_flags" 
if there is no other way.

>> taskset -c 2,5 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 /dev/nvme0n1 /dev/nvme1n1
>> submitter=1, tid=2453, file=/dev/nvme1n1, node=-1
>> submitter=0, tid=2452, file=/dev/nvme0n1, node=-1
>> polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128
>> Engine=io_uring, sq_ring=128, cq_ring=128
>> IOPS=10.02M, BW=4.89GiB/s, IOS/call=31/31
>> IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31
>>
>> With this:
>> taskset -c 2,5 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 /dev/nvme0n1 /dev/nvme1n1
>> submitter=1, tid=2453, file=/dev/nvme1n1, node=-1
>> submitter=0, tid=2452, file=/dev/nvme0n1, node=-1
>> polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128
>> Engine=io_uring, sq_ring=128, cq_ring=128
>> IOPS=10.02M, BW=4.89GiB/s, IOS/call=31/31
>> IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31
> 
> Not that I don't believe you, but that looks like you pasted the same
> stuff in there twice? It's the exact same perf and pids.

Indeed :-(
Made a goof-up while pasting stuff [1] to the cover letter.

[1]
Before the patch:
# taskset -c 2,5 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 
/dev/nvme0n1 /dev/nvme1n1
submitter=1, tid=2453, file=/dev/nvme1n1, node=-1
submitter=0, tid=2452, file=/dev/nvme0n1, node=-1
polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128
Engine=io_uring, sq_ring=128, cq_ring=128
IOPS=10.02M, BW=4.89GiB/s, IOS/call=31/31
IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31
IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31
Exiting on timeout
Maximum IOPS=10.04M

After the patch:
# taskset -c 2,5 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 
/dev/nvme0n1 /dev/nvme1n1
submitter=1, tid=2412, file=/dev/nvme1n1, node=-1
submitter=0, tid=2411, file=/dev/nvme0n1, node=-1
polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128
Engine=io_uring, sq_ring=128, cq_ring=128
IOPS=10.02M, BW=4.89GiB/s, IOS/call=31/31
IOPS=10.03M, BW=4.90GiB/s, IOS/call=31/31
IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31
Exiting on timeout
Maximum IOPS=10.04M

Pavel Begunkov April 6, 2024, 9:30 p.m. UTC | #3

On 3/22/24 18:50, Kanchan Joshi wrote:
> This patchset is aimed at getting the feedback on a new io_uring
> interface that userspace can use to exchange meta buffer along with
> read/write.
> 
> Two new opcodes for that: IORING_OP_READ_META and IORING_OP_WRITE_META.
> The leftover space in the SQE is used to send meta buffer pointer
> and its length. Patch #2 for this.

I do remember there were back and forth design discussions about that
back when some other guy attempted to implement it, but have you tried
to do it not as a separate opcode?

It reads like all read/write opcodes might benefit from it, and it'd
be unfortunate to then be adding IORING_OP_READ_META_FIXED and
multiplicatively all other variants.

> The interface is supported for block direct IO. Patch #4 for this.
> Other two are prep patches.
> 
> It has been tried not to touch the hot read/write path, as much as
> possible. Performance for non-meta IO is same after the patches [2].
> There is some code in the cold path (worker-based async)
> though.
> 
> Moderately tested by modifying the fio [1] to use this interface
> (only for NVMe block devices)
> 
> [1] https://github.com/OpenMPDK/fio/tree/feat/test-meta
> 
> [2]
> without this:
> 
> taskset -c 2,5 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 /dev/nvme0n1 /dev/nvme1n1
> submitter=1, tid=2453, file=/dev/nvme1n1, node=-1
> submitter=0, tid=2452, file=/dev/nvme0n1, node=-1
> polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128
> Engine=io_uring, sq_ring=128, cq_ring=128
> IOPS=10.02M, BW=4.89GiB/s, IOS/call=31/31
> IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31
> 
> With this:
> taskset -c 2,5 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 /dev/nvme0n1 /dev/nvme1n1
> submitter=1, tid=2453, file=/dev/nvme1n1, node=-1
> submitter=0, tid=2452, file=/dev/nvme0n1, node=-1
> polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128
> Engine=io_uring, sq_ring=128, cq_ring=128
> IOPS=10.02M, BW=4.89GiB/s, IOS/call=31/31
> IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31
> 
> Anuj Gupta (3):
>    io_uring/rw: Get rid of flags field in struct io_rw
>    io_uring/rw: support read/write with metadata
>    block: modify bio_integrity_map_user to accept iov_iter as argument
> 
> Kanchan Joshi (1):
>    block: add support to pass the meta buffer
> 
>   block/bio-integrity.c         |  27 ++++++---
>   block/fops.c                  |   9 +++
>   block/t10-pi.c                |   6 ++
>   drivers/nvme/host/ioctl.c     |  11 +++-
>   include/linux/bio.h           |  13 +++-
>   include/linux/fs.h            |   1 +
>   include/uapi/linux/io_uring.h |   6 ++
>   io_uring/io_uring.c           |   2 +
>   io_uring/opdef.c              |  29 +++++++++
>   io_uring/rw.c                 | 108 +++++++++++++++++++++++++++++-----
>   io_uring/rw.h                 |   8 +++
>   11 files changed, 193 insertions(+), 27 deletions(-)
> 
> 
> base-commit: 6f0974eccbf78baead1735722c4f1ee3eb9422cd

Kanchan Joshi April 25, 2024, 7:05 p.m. UTC | #4

On 4/7/2024 3:00 AM, Pavel Begunkov wrote:
> On 3/22/24 18:50, Kanchan Joshi wrote:
>> This patchset is aimed at getting the feedback on a new io_uring
>> interface that userspace can use to exchange meta buffer along with
>> read/write.
>>
>> Two new opcodes for that: IORING_OP_READ_META and IORING_OP_WRITE_META.
>> The leftover space in the SQE is used to send meta buffer pointer
>> and its length. Patch #2 for this.
> 
> I do remember there were back and forth design discussions about that
> back when some other guy attempted to implement it, but have you tried
> to do it not as a separate opcode?

Did not try that in the first cut, thinking it would help in not 
touching the hot (non-meta) io path. But open to this.

> It reads like all read/write opcodes might benefit from it, and it'd
> be unfortunate to then be adding IORING_OP_READ_META_FIXED and
> multiplicatively all other variants.

Right, that's a good point.