Message ID | 20240322185023.131697-1-joshi.k@samsung.com (mailing list archive) |
---|---|
Headers | show |
Series | Read/Write with meta buffer | expand |
On 3/22/24 12:50 PM, Kanchan Joshi wrote: > This patchset is aimed at getting the feedback on a new io_uring > interface that userspace can use to exchange meta buffer along with > read/write. > > Two new opcodes for that: IORING_OP_READ_META and IORING_OP_WRITE_META. > The leftover space in the SQE is used to send meta buffer pointer > and its length. Patch #2 for this. > > The interface is supported for block direct IO. Patch #4 for this. > Other two are prep patches. > > It has been tried not to touch the hot read/write path, as much as > possible. Performance for non-meta IO is same after the patches [2]. > There is some code in the cold path (worker-based async) > though. This patchset should look cleaner if you rebase it on top of the current for-6.10/io_uring branch, as it gets rid of the async nastiness. Since that'll need doing anyway, could you repost a v2 where it's rebased on top of that? Also in terms of the cover letter, would be good with a bit more of a description of what this enables. It's a bit scant on detail on what exactly this gives you. > taskset -c 2,5 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 /dev/nvme0n1 /dev/nvme1n1 > submitter=1, tid=2453, file=/dev/nvme1n1, node=-1 > submitter=0, tid=2452, file=/dev/nvme0n1, node=-1 > polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128 > Engine=io_uring, sq_ring=128, cq_ring=128 > IOPS=10.02M, BW=4.89GiB/s, IOS/call=31/31 > IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31 > > With this: > taskset -c 2,5 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 /dev/nvme0n1 /dev/nvme1n1 > submitter=1, tid=2453, file=/dev/nvme1n1, node=-1 > submitter=0, tid=2452, file=/dev/nvme0n1, node=-1 > polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128 > Engine=io_uring, sq_ring=128, cq_ring=128 > IOPS=10.02M, BW=4.89GiB/s, IOS/call=31/31 > IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31 Not that I don't believe you, but that looks like you pasted the same stuff in there twice? It's the exact same perf and pids.
On 3/28/2024 5:08 AM, Jens Axboe wrote: > This patchset should look cleaner if you rebase it on top of the current > for-6.10/io_uring branch, as it gets rid of the async nastiness. Since > that'll need doing anyway, could you repost a v2 where it's rebased on > top of that? Yes, next iteration will use that as the base. > Also in terms of the cover letter, would be good with a bit more of a > description of what this enables. It's a bit scant on detail on what > exactly this gives you. Will fix that. But currently the only thing it gives is - pass meta buffer to/from the block-device. It keeps things simple, and fine for PI type 0 (normal unprotected IO). For other PI types, exposing few knobs may help. Using "sqe->rw_flags" if there is no other way. >> taskset -c 2,5 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 /dev/nvme0n1 /dev/nvme1n1 >> submitter=1, tid=2453, file=/dev/nvme1n1, node=-1 >> submitter=0, tid=2452, file=/dev/nvme0n1, node=-1 >> polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128 >> Engine=io_uring, sq_ring=128, cq_ring=128 >> IOPS=10.02M, BW=4.89GiB/s, IOS/call=31/31 >> IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31 >> >> With this: >> taskset -c 2,5 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 /dev/nvme0n1 /dev/nvme1n1 >> submitter=1, tid=2453, file=/dev/nvme1n1, node=-1 >> submitter=0, tid=2452, file=/dev/nvme0n1, node=-1 >> polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128 >> Engine=io_uring, sq_ring=128, cq_ring=128 >> IOPS=10.02M, BW=4.89GiB/s, IOS/call=31/31 >> IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31 > > Not that I don't believe you, but that looks like you pasted the same > stuff in there twice? It's the exact same perf and pids. Indeed :-( Made a goof-up while pasting stuff [1] to the cover letter. [1] Before the patch: # taskset -c 2,5 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 /dev/nvme0n1 /dev/nvme1n1 submitter=1, tid=2453, file=/dev/nvme1n1, node=-1 submitter=0, tid=2452, file=/dev/nvme0n1, node=-1 polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128 Engine=io_uring, sq_ring=128, cq_ring=128 IOPS=10.02M, BW=4.89GiB/s, IOS/call=31/31 IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31 IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31 Exiting on timeout Maximum IOPS=10.04M After the patch: # taskset -c 2,5 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 /dev/nvme0n1 /dev/nvme1n1 submitter=1, tid=2412, file=/dev/nvme1n1, node=-1 submitter=0, tid=2411, file=/dev/nvme0n1, node=-1 polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128 Engine=io_uring, sq_ring=128, cq_ring=128 IOPS=10.02M, BW=4.89GiB/s, IOS/call=31/31 IOPS=10.03M, BW=4.90GiB/s, IOS/call=31/31 IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31 Exiting on timeout Maximum IOPS=10.04M
On 3/22/24 18:50, Kanchan Joshi wrote: > This patchset is aimed at getting the feedback on a new io_uring > interface that userspace can use to exchange meta buffer along with > read/write. > > Two new opcodes for that: IORING_OP_READ_META and IORING_OP_WRITE_META. > The leftover space in the SQE is used to send meta buffer pointer > and its length. Patch #2 for this. I do remember there were back and forth design discussions about that back when some other guy attempted to implement it, but have you tried to do it not as a separate opcode? It reads like all read/write opcodes might benefit from it, and it'd be unfortunate to then be adding IORING_OP_READ_META_FIXED and multiplicatively all other variants. > The interface is supported for block direct IO. Patch #4 for this. > Other two are prep patches. > > It has been tried not to touch the hot read/write path, as much as > possible. Performance for non-meta IO is same after the patches [2]. > There is some code in the cold path (worker-based async) > though. > > Moderately tested by modifying the fio [1] to use this interface > (only for NVMe block devices) > > [1] https://github.com/OpenMPDK/fio/tree/feat/test-meta > > [2] > without this: > > taskset -c 2,5 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 /dev/nvme0n1 /dev/nvme1n1 > submitter=1, tid=2453, file=/dev/nvme1n1, node=-1 > submitter=0, tid=2452, file=/dev/nvme0n1, node=-1 > polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128 > Engine=io_uring, sq_ring=128, cq_ring=128 > IOPS=10.02M, BW=4.89GiB/s, IOS/call=31/31 > IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31 > > With this: > taskset -c 2,5 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 /dev/nvme0n1 /dev/nvme1n1 > submitter=1, tid=2453, file=/dev/nvme1n1, node=-1 > submitter=0, tid=2452, file=/dev/nvme0n1, node=-1 > polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128 > Engine=io_uring, sq_ring=128, cq_ring=128 > IOPS=10.02M, BW=4.89GiB/s, IOS/call=31/31 > IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31 > > Anuj Gupta (3): > io_uring/rw: Get rid of flags field in struct io_rw > io_uring/rw: support read/write with metadata > block: modify bio_integrity_map_user to accept iov_iter as argument > > Kanchan Joshi (1): > block: add support to pass the meta buffer > > block/bio-integrity.c | 27 ++++++--- > block/fops.c | 9 +++ > block/t10-pi.c | 6 ++ > drivers/nvme/host/ioctl.c | 11 +++- > include/linux/bio.h | 13 +++- > include/linux/fs.h | 1 + > include/uapi/linux/io_uring.h | 6 ++ > io_uring/io_uring.c | 2 + > io_uring/opdef.c | 29 +++++++++ > io_uring/rw.c | 108 +++++++++++++++++++++++++++++----- > io_uring/rw.h | 8 +++ > 11 files changed, 193 insertions(+), 27 deletions(-) > > > base-commit: 6f0974eccbf78baead1735722c4f1ee3eb9422cd
On 4/7/2024 3:00 AM, Pavel Begunkov wrote: > On 3/22/24 18:50, Kanchan Joshi wrote: >> This patchset is aimed at getting the feedback on a new io_uring >> interface that userspace can use to exchange meta buffer along with >> read/write. >> >> Two new opcodes for that: IORING_OP_READ_META and IORING_OP_WRITE_META. >> The leftover space in the SQE is used to send meta buffer pointer >> and its length. Patch #2 for this. > > I do remember there were back and forth design discussions about that > back when some other guy attempted to implement it, but have you tried > to do it not as a separate opcode? Did not try that in the first cut, thinking it would help in not touching the hot (non-meta) io path. But open to this. > It reads like all read/write opcodes might benefit from it, and it'd > be unfortunate to then be adding IORING_OP_READ_META_FIXED and > multiplicatively all other variants. Right, that's a good point.