Message ID | 20230307141520.793891-1-ming.lei@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | io_uring/ublk: add IORING_OP_FUSED_CMD | expand |
On 3/7/23 14:15, Ming Lei wrote: > Hello, > > Add IORING_OP_FUSED_CMD, it is one special URING_CMD, which has to > be SQE128. The 1st SQE(master) is one 64byte URING_CMD, and the 2nd > 64byte SQE(slave) is another normal 64byte OP. For any OP which needs > to support slave OP, io_issue_defs[op].fused_slave needs to be set as 1, > and its ->issue() can retrieve/import buffer from master request's > fused_cmd_kbuf. The slave OP is actually submitted from kernel, part of > this idea is from Xiaoguang's ublk ebpf patchset, but this patchset > submits slave OP just like normal OP issued from userspace, that said, > SQE order is kept, and batching handling is done too. From a quick look through patches it all looks a bit complicated and intrusive, all over generic hot paths. I think instead we should be able to use registered buffer table as intermediary and reuse splicing. Let me try it out > Please see detailed design in commit log of the 3th patch, and one big > point is how to handle buffer ownership. > > With this way, it is easy to support zero copy for ublk/fuse device. > > Basically userspace can specify any sub-buffer of the ublk block request > buffer from the fused command just by setting 'offset/len' > in the slave SQE for running slave OP. This way is flexible to implement > io mapping: mirror, stripped, ... > > The 4th & 5th patches enable fused slave support for the following OPs: > > OP_READ/OP_WRITE > OP_SEND/OP_RECV/OP_SEND_ZC > > The other ublk patches cleans ublk driver and implement fused command > for supporting zero copy. > > Follows userspace code: > > https://github.com/ming1/ubdsrv/tree/fused-cmd-zc-v2 > > All three(loop, nbd and qcow2) ublk targets have supported zero copy by passing: > > ublk add -t [loop|nbd|qcow2] -z .... > > Basic fs mount/kernel building and builtin test are done. > > Also add liburing test case for covering fused command based on miniublk > of blktest: > > https://github.com/ming1/liburing/commits/fused_cmd_miniublk > > Performance improvement is obvious on memory bandwidth > related workloads, such as, 1~2X improvement on 64K/512K BS > IO test on loop with ramfs backing file. > > Any comments are welcome! > > V2: > - don't resue io_mapped_ubuf (io_uring) > - remove REQ_F_FUSED_MASTER_BIT (io_uring) > - fix compile warning (io_uring) > - rebase on v6.3-rc1 (io_uring) > - grabbing io request reference when handling fused command > - simplify ublk_copy_user_pages() by iov iterator > - add read()/write() for userspace to read/write ublk io buffer, so > that some corner cases(read zero, passthrough request(report zones)) can > be handled easily in case of zero copy; this way also helps to switch to > zero copy completely > - misc cleanup > > Ming Lei (17): > io_uring: add IO_URING_F_FUSED and prepare for supporting OP_FUSED_CMD > io_uring: increase io_kiocb->flags into 64bit > io_uring: add IORING_OP_FUSED_CMD > io_uring: support OP_READ/OP_WRITE for fused slave request > io_uring: support OP_SEND_ZC/OP_RECV for fused slave request > block: ublk_drv: mark device as LIVE before adding disk > block: ublk_drv: add common exit handling > block: ublk_drv: don't consider flush request in map/unmap io > block: ublk_drv: add two helpers to clean up map/unmap request > block: ublk_drv: clean up several helpers > block: ublk_drv: cleanup 'struct ublk_map_data' > block: ublk_drv: cleanup ublk_copy_user_pages > block: ublk_drv: grab request reference when the request is handled by > userspace > block: ublk_drv: support to copy any part of request pages > block: ublk_drv: add read()/write() support for ublk char device > block: ublk_drv: don't check buffer in case of zero copy > block: ublk_drv: apply io_uring FUSED_CMD for supporting zero copy > > drivers/block/ublk_drv.c | 605 ++++++++++++++++++++++++++------- > drivers/char/mem.c | 4 + > drivers/nvme/host/ioctl.c | 9 + > include/linux/io_uring.h | 49 ++- > include/linux/io_uring_types.h | 18 +- > include/uapi/linux/io_uring.h | 1 + > include/uapi/linux/ublk_cmd.h | 37 +- > io_uring/Makefile | 2 +- > io_uring/fused_cmd.c | 232 +++++++++++++ > io_uring/fused_cmd.h | 11 + > io_uring/io_uring.c | 22 +- > io_uring/io_uring.h | 3 + > io_uring/net.c | 23 +- > io_uring/opdef.c | 17 + > io_uring/opdef.h | 2 + > io_uring/rw.c | 20 ++ > 16 files changed, 926 insertions(+), 129 deletions(-) > create mode 100644 io_uring/fused_cmd.c > create mode 100644 io_uring/fused_cmd.h >
On 3/7/23 15:37, Pavel Begunkov wrote: > On 3/7/23 14:15, Ming Lei wrote: >> Hello, >> >> Add IORING_OP_FUSED_CMD, it is one special URING_CMD, which has to >> be SQE128. The 1st SQE(master) is one 64byte URING_CMD, and the 2nd >> 64byte SQE(slave) is another normal 64byte OP. For any OP which needs >> to support slave OP, io_issue_defs[op].fused_slave needs to be set as 1, >> and its ->issue() can retrieve/import buffer from master request's >> fused_cmd_kbuf. The slave OP is actually submitted from kernel, part of >> this idea is from Xiaoguang's ublk ebpf patchset, but this patchset >> submits slave OP just like normal OP issued from userspace, that said, >> SQE order is kept, and batching handling is done too. > > From a quick look through patches it all looks a bit complicated > and intrusive, all over generic hot paths. I think instead we > should be able to use registered buffer table as intermediary and > reuse splicing. Let me try it out Here we go, isolated in a new opcode, and in the end should work with any file supporting splice. It's a quick prototype, it's lacking and there are many obvious fatal bugs. It also needs some optimisations, improvements on how executed by io_uring and extra stuff like memcpy ops and fixed buf recv/send. I'll clean it up. I used a test below, it essentially does zc recv. https://github.com/isilence/liburing/commit/81fe705739af7d9b77266f9aa901c1ada870739d From 87ad9e8e3aed683aa040fb4b9ae499f8726ba393 Mon Sep 17 00:00:00 2001 Message-Id: <87ad9e8e3aed683aa040fb4b9ae499f8726ba393.1678208911.git.asml.silence@gmail.com> From: Pavel Begunkov <asml.silence@gmail.com> Date: Tue, 7 Mar 2023 17:01:44 +0000 Subject: [POC 1/1] io_uring: splicing into reg buf table EXTREMELY BUGGY! Not for inclusion. Add a new operation called IORING_OP_SPLICE_FROM, which splices from a file into the registered buffer table. This is done in a zerocopy fashion with a caveat that the user won't have direct access to the data, however it can use it with any io_uring request supporting registered buffers. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> --- include/uapi/linux/io_uring.h | 1 + io_uring/io_uring.c | 4 +- io_uring/opdef.c | 10 ++++ io_uring/splice.c | 98 +++++++++++++++++++++++++++++++++++ io_uring/splice.h | 3 ++ 5 files changed, 114 insertions(+), 2 deletions(-) diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 709de6d4feb2..a91ce1d2ebd7 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -223,6 +223,7 @@ enum io_uring_op { IORING_OP_URING_CMD, IORING_OP_SEND_ZC, IORING_OP_SENDMSG_ZC, + IORING_OP_SPLICE_FROM, /* this goes last, obviously */ IORING_OP_LAST, diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 7625597b5227..b7389a6ea190 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -2781,8 +2781,8 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx) io_wait_rsrc_data(ctx->file_data); mutex_lock(&ctx->uring_lock); - if (ctx->buf_data) - __io_sqe_buffers_unregister(ctx); + // if (ctx->buf_data) + // __io_sqe_buffers_unregister(ctx); if (ctx->file_data) __io_sqe_files_unregister(ctx); io_cqring_overflow_kill(ctx); diff --git a/io_uring/opdef.c b/io_uring/opdef.c index cca7c5b55208..28d4fa42676b 100644 --- a/io_uring/opdef.c +++ b/io_uring/opdef.c @@ -428,6 +428,13 @@ const struct io_issue_def io_issue_defs[] = { .prep = io_eopnotsupp_prep, #endif }, + [IORING_OP_SPLICE_FROM] = { + .needs_file = 1, + .unbound_nonreg_file = 1, + // .pollin = 1, + .prep = io_splice_from_prep, + .issue = io_splice_from, + } }; @@ -648,6 +655,9 @@ const struct io_cold_def io_cold_defs[] = { .fail = io_sendrecv_fail, #endif }, + [IORING_OP_SPLICE_FROM] = { + .name = "SPLICE_FROM", + } }; const char *io_uring_get_opcode(u8 opcode) diff --git a/io_uring/splice.c b/io_uring/splice.c index 2a4bbb719531..0467e9f46e99 100644 --- a/io_uring/splice.c +++ b/io_uring/splice.c @@ -8,11 +8,13 @@ #include <linux/namei.h> #include <linux/io_uring.h> #include <linux/splice.h> +#include <linux/nospec.h> #include <uapi/linux/io_uring.h> #include "io_uring.h" #include "splice.h" +#include "rsrc.h" struct io_splice { struct file *file_out; @@ -119,3 +121,99 @@ int io_splice(struct io_kiocb *req, unsigned int issue_flags) io_req_set_res(req, ret, 0); return IOU_OK; } + +struct io_splice_from { + struct file *file; + loff_t off; + u64 len; +}; + + +int io_splice_from_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) +{ + struct io_splice_from *sp = io_kiocb_to_cmd(req, struct io_splice_from); + + if (unlikely(sqe->splice_flags || sqe->splice_fd_in || sqe->ioprio || + sqe->addr || sqe->addr3)) + return -EINVAL; + + req->buf_index = READ_ONCE(sqe->buf_index); + + sp->len = READ_ONCE(sqe->len); + if (unlikely(!sp->len)) + return -EINVAL; + + sp->off = READ_ONCE(sqe->off); + return 0; +} + +int io_splice_from(struct io_kiocb *req, unsigned int issue_flags) +{ + struct io_splice_from *sp = io_kiocb_to_cmd(req, struct io_splice_from); + loff_t *ppos = (sp->off == -1) ? NULL : &sp->off; + struct io_mapped_ubuf *imu; + struct pipe_inode_info *pi; + struct io_ring_ctx *ctx; + unsigned int pipe_tail; + int ret, i, nr_pages; + u16 index; + + if (!sp->file->f_op->splice_read) + return -ENOTSUPP; + + pi = alloc_pipe_info(); + if (!pi) + return -ENOMEM; + pi->readers = 1; + + ret = sp->file->f_op->splice_read(sp->file, ppos, pi, sp->len, 0); + if (ret < 0) + goto done; + + nr_pages = pipe_occupancy(pi->head, pi->tail); + imu = kvmalloc(struct_size(imu, bvec, nr_pages), GFP_KERNEL); + if (!imu) + goto done; + + ret = 0; + pipe_tail = pi->tail; + for (i = 0; !pipe_empty(pi->head, pipe_tail); i++) { + unsigned int mask = pi->ring_size - 1; // kill mask + struct pipe_buffer *buf = &pi->bufs[pipe_tail & mask]; + + bvec_set_page(&imu->bvec[i], buf->page, buf->len, buf->offset); + ret += buf->len; + pipe_tail++; + } + if (WARN_ON_ONCE(i != nr_pages)) + return -EFAULT; + + ctx = req->ctx; + io_ring_submit_lock(ctx, issue_flags); + if (unlikely(req->buf_index >= ctx->nr_user_bufs)) { + /* TODO: cleanup pages */ + ret = -EFAULT; + kvfree(imu); + goto done_unlock; + } + index = array_index_nospec(req->buf_index, ctx->nr_user_bufs); + if (ctx->user_bufs[index] != ctx->dummy_ubuf) { + /* TODO: cleanup pages */ + kvfree(imu); + ret = -EFAULT; + goto done_unlock; + } + + imu->ubuf = 0; + imu->ubuf_end = ret; + imu->nr_bvecs = nr_pages; + ctx->user_bufs[index] = imu; +done_unlock: + io_ring_submit_unlock(ctx, issue_flags); +done: + free_pipe_info(pi); + if (ret != sp->len) + req_set_fail(req); + io_req_set_res(req, ret, 0); + return IOU_OK; +} diff --git a/io_uring/splice.h b/io_uring/splice.h index 542f94168ad3..abdf5ad8e8d2 100644 --- a/io_uring/splice.h +++ b/io_uring/splice.h @@ -5,3 +5,6 @@ int io_tee(struct io_kiocb *req, unsigned int issue_flags); int io_splice_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe); int io_splice(struct io_kiocb *req, unsigned int issue_flags); + +int io_splice_from_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe); +int io_splice_from(struct io_kiocb *req, unsigned int issue_flags);
On Tue, Mar 07, 2023 at 03:37:21PM +0000, Pavel Begunkov wrote: > On 3/7/23 14:15, Ming Lei wrote: > > Hello, > > > > Add IORING_OP_FUSED_CMD, it is one special URING_CMD, which has to > > be SQE128. The 1st SQE(master) is one 64byte URING_CMD, and the 2nd > > 64byte SQE(slave) is another normal 64byte OP. For any OP which needs > > to support slave OP, io_issue_defs[op].fused_slave needs to be set as 1, > > and its ->issue() can retrieve/import buffer from master request's > > fused_cmd_kbuf. The slave OP is actually submitted from kernel, part of > > this idea is from Xiaoguang's ublk ebpf patchset, but this patchset > > submits slave OP just like normal OP issued from userspace, that said, > > SQE order is kept, and batching handling is done too. > > From a quick look through patches it all looks a bit complicated > and intrusive, all over generic hot paths. I think instead we Really? The main change to generic hot paths are adding one 'true/false' parameter to io_init_req(). For others, the change is just check on req->flags or issue_flags, which is basically zero cost. > should be able to use registered buffer table as intermediary and > reuse splicing. Let me try it out I will take a look at you patch, but last time, Linus has pointed out that splice isn't one good way, in which buffer ownership transferring is one big issue for writing data to page retrieved from pipe. https://lore.kernel.org/linux-block/CAJfpeguQ3xn2-6svkkVXJ88tiVfcDd-eKi1evzzfvu305fMoyw@mail.gmail.com/ Thanks, Ming
On Tue, Mar 07, 2023 at 05:17:04PM +0000, Pavel Begunkov wrote: > On 3/7/23 15:37, Pavel Begunkov wrote: > > On 3/7/23 14:15, Ming Lei wrote: > > > Hello, > > > > > > Add IORING_OP_FUSED_CMD, it is one special URING_CMD, which has to > > > be SQE128. The 1st SQE(master) is one 64byte URING_CMD, and the 2nd > > > 64byte SQE(slave) is another normal 64byte OP. For any OP which needs > > > to support slave OP, io_issue_defs[op].fused_slave needs to be set as 1, > > > and its ->issue() can retrieve/import buffer from master request's > > > fused_cmd_kbuf. The slave OP is actually submitted from kernel, part of > > > this idea is from Xiaoguang's ublk ebpf patchset, but this patchset > > > submits slave OP just like normal OP issued from userspace, that said, > > > SQE order is kept, and batching handling is done too. > > > > From a quick look through patches it all looks a bit complicated > > and intrusive, all over generic hot paths. I think instead we > > should be able to use registered buffer table as intermediary and > > reuse splicing. Let me try it out > > Here we go, isolated in a new opcode, and in the end should work > with any file supporting splice. It's a quick prototype, it's lacking > and there are many obvious fatal bugs. It also needs some optimisations, > improvements on how executed by io_uring and extra stuff like > memcpy ops and fixed buf recv/send. I'll clean it up. > > I used a test below, it essentially does zc recv. > > https://github.com/isilence/liburing/commit/81fe705739af7d9b77266f9aa901c1ada870739d > > > From 87ad9e8e3aed683aa040fb4b9ae499f8726ba393 Mon Sep 17 00:00:00 2001 > Message-Id: <87ad9e8e3aed683aa040fb4b9ae499f8726ba393.1678208911.git.asml.silence@gmail.com> > From: Pavel Begunkov <asml.silence@gmail.com> > Date: Tue, 7 Mar 2023 17:01:44 +0000 > Subject: [POC 1/1] io_uring: splicing into reg buf table > > EXTREMELY BUGGY! Not for inclusion. > > Add a new operation called IORING_OP_SPLICE_FROM, > which splices from a file into the registered buffer table. This is > done in a zerocopy fashion with a caveat that the user won't have > direct access to the data, however it can use it with any io_uring > request supporting registered buffers. > > Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> > --- > include/uapi/linux/io_uring.h | 1 + > io_uring/io_uring.c | 4 +- > io_uring/opdef.c | 10 ++++ > io_uring/splice.c | 98 +++++++++++++++++++++++++++++++++++ > io_uring/splice.h | 3 ++ > 5 files changed, 114 insertions(+), 2 deletions(-) > > diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h > index 709de6d4feb2..a91ce1d2ebd7 100644 > --- a/include/uapi/linux/io_uring.h > +++ b/include/uapi/linux/io_uring.h > @@ -223,6 +223,7 @@ enum io_uring_op { > IORING_OP_URING_CMD, > IORING_OP_SEND_ZC, > IORING_OP_SENDMSG_ZC, > + IORING_OP_SPLICE_FROM, > /* this goes last, obviously */ > IORING_OP_LAST, > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c > index 7625597b5227..b7389a6ea190 100644 > --- a/io_uring/io_uring.c > +++ b/io_uring/io_uring.c > @@ -2781,8 +2781,8 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx) > io_wait_rsrc_data(ctx->file_data); > mutex_lock(&ctx->uring_lock); > - if (ctx->buf_data) > - __io_sqe_buffers_unregister(ctx); > + // if (ctx->buf_data) > + // __io_sqe_buffers_unregister(ctx); > if (ctx->file_data) > __io_sqe_files_unregister(ctx); > io_cqring_overflow_kill(ctx); > diff --git a/io_uring/opdef.c b/io_uring/opdef.c > index cca7c5b55208..28d4fa42676b 100644 > --- a/io_uring/opdef.c > +++ b/io_uring/opdef.c > @@ -428,6 +428,13 @@ const struct io_issue_def io_issue_defs[] = { > .prep = io_eopnotsupp_prep, > #endif > }, > + [IORING_OP_SPLICE_FROM] = { > + .needs_file = 1, > + .unbound_nonreg_file = 1, > + // .pollin = 1, > + .prep = io_splice_from_prep, > + .issue = io_splice_from, > + } > }; > @@ -648,6 +655,9 @@ const struct io_cold_def io_cold_defs[] = { > .fail = io_sendrecv_fail, > #endif > }, > + [IORING_OP_SPLICE_FROM] = { > + .name = "SPLICE_FROM", > + } > }; > const char *io_uring_get_opcode(u8 opcode) > diff --git a/io_uring/splice.c b/io_uring/splice.c > index 2a4bbb719531..0467e9f46e99 100644 > --- a/io_uring/splice.c > +++ b/io_uring/splice.c > @@ -8,11 +8,13 @@ > #include <linux/namei.h> > #include <linux/io_uring.h> > #include <linux/splice.h> > +#include <linux/nospec.h> > #include <uapi/linux/io_uring.h> > #include "io_uring.h" > #include "splice.h" > +#include "rsrc.h" > struct io_splice { > struct file *file_out; > @@ -119,3 +121,99 @@ int io_splice(struct io_kiocb *req, unsigned int issue_flags) > io_req_set_res(req, ret, 0); > return IOU_OK; > } > + > +struct io_splice_from { > + struct file *file; > + loff_t off; > + u64 len; > +}; > + > + > +int io_splice_from_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) > +{ > + struct io_splice_from *sp = io_kiocb_to_cmd(req, struct io_splice_from); > + > + if (unlikely(sqe->splice_flags || sqe->splice_fd_in || sqe->ioprio || > + sqe->addr || sqe->addr3)) > + return -EINVAL; > + > + req->buf_index = READ_ONCE(sqe->buf_index); > + > + sp->len = READ_ONCE(sqe->len); > + if (unlikely(!sp->len)) > + return -EINVAL; > + > + sp->off = READ_ONCE(sqe->off); > + return 0; > +} > + > +int io_splice_from(struct io_kiocb *req, unsigned int issue_flags) > +{ > + struct io_splice_from *sp = io_kiocb_to_cmd(req, struct io_splice_from); > + loff_t *ppos = (sp->off == -1) ? NULL : &sp->off; > + struct io_mapped_ubuf *imu; > + struct pipe_inode_info *pi; > + struct io_ring_ctx *ctx; > + unsigned int pipe_tail; > + int ret, i, nr_pages; > + u16 index; > + > + if (!sp->file->f_op->splice_read) > + return -ENOTSUPP; > + > + pi = alloc_pipe_info(); The above should be replaced with direct pipe, otherwise every time allocating one pipe inode really hurts performance. > + if (!pi) > + return -ENOMEM; > + pi->readers = 1; > + > + ret = sp->file->f_op->splice_read(sp->file, ppos, pi, sp->len, 0); > + if (ret < 0) > + goto done; > + > + nr_pages = pipe_occupancy(pi->head, pi->tail); > + imu = kvmalloc(struct_size(imu, bvec, nr_pages), GFP_KERNEL); > + if (!imu) > + goto done; > + > + ret = 0; > + pipe_tail = pi->tail; > + for (i = 0; !pipe_empty(pi->head, pipe_tail); i++) { > + unsigned int mask = pi->ring_size - 1; // kill mask > + struct pipe_buffer *buf = &pi->bufs[pipe_tail & mask]; > + > + bvec_set_page(&imu->bvec[i], buf->page, buf->len, buf->offset); > + ret += buf->len; > + pipe_tail++; > + } > + if (WARN_ON_ONCE(i != nr_pages)) > + return -EFAULT; > + > + ctx = req->ctx; > + io_ring_submit_lock(ctx, issue_flags); > + if (unlikely(req->buf_index >= ctx->nr_user_bufs)) { > + /* TODO: cleanup pages */ > + ret = -EFAULT; > + kvfree(imu); > + goto done_unlock; > + } > + index = array_index_nospec(req->buf_index, ctx->nr_user_bufs); > + if (ctx->user_bufs[index] != ctx->dummy_ubuf) { > + /* TODO: cleanup pages */ > + kvfree(imu); > + ret = -EFAULT; > + goto done_unlock; > + } > + > + imu->ubuf = 0; > + imu->ubuf_end = ret; > + imu->nr_bvecs = nr_pages; > + ctx->user_bufs[index] = imu; Your patch looks like transferring pages ownership to io_uring fixed buffer, but unfortunately it can't be done in this way. splice is supposed for moving data, not transfer buffer ownership. https://lore.kernel.org/linux-block/CAJfpeguQ3xn2-6svkkVXJ88tiVfcDd-eKi1evzzfvu305fMoyw@mail.gmail.com/ 1) pages are actually owned by device side(ublk, here: sp->file), but we want to loan them to io_uring normal OPs. 2) after these pages are used by io_uring normal OPs, these pages have been returned back to sp->file, and the notification has to be done explicitly, because page is owned by sp->file of splice_read(). 3) pages RW direction has to limited strictly, and in case of ublk/fuse, device pages can only be read or write which depends on user io request direction. Also IMO it isn't good to add one buffer to ctx->user_bufs[] oneshot and retrieve it oneshot, and it can be set via req->imu simply in one fused command. Thanks, Ming
On 3/8/23 02:10, Ming Lei wrote: > On Tue, Mar 07, 2023 at 05:17:04PM +0000, Pavel Begunkov wrote: >> On 3/7/23 15:37, Pavel Begunkov wrote: >>> On 3/7/23 14:15, Ming Lei wrote: >>>> Hello, >>>> >>>> Add IORING_OP_FUSED_CMD, it is one special URING_CMD, which has to >>>> be SQE128. The 1st SQE(master) is one 64byte URING_CMD, and the 2nd >>>> 64byte SQE(slave) is another normal 64byte OP. For any OP which needs >>>> to support slave OP, io_issue_defs[op].fused_slave needs to be set as 1, >>>> and its ->issue() can retrieve/import buffer from master request's >>>> fused_cmd_kbuf. The slave OP is actually submitted from kernel, part of >>>> this idea is from Xiaoguang's ublk ebpf patchset, but this patchset >>>> submits slave OP just like normal OP issued from userspace, that said, >>>> SQE order is kept, and batching handling is done too. >>> >>> From a quick look through patches it all looks a bit complicated >>> and intrusive, all over generic hot paths. I think instead we >>> should be able to use registered buffer table as intermediary and >>> reuse splicing. Let me try it out >> >> Here we go, isolated in a new opcode, and in the end should work >> with any file supporting splice. It's a quick prototype, it's lacking >> and there are many obvious fatal bugs. It also needs some optimisations, >> improvements on how executed by io_uring and extra stuff like >> memcpy ops and fixed buf recv/send. I'll clean it up. >> >> I used a test below, it essentially does zc recv. >> >> https://github.com/isilence/liburing/commit/81fe705739af7d9b77266f9aa901c1ada870739d >> [...] >> +int io_splice_from(struct io_kiocb *req, unsigned int issue_flags) >> +{ >> + struct io_splice_from *sp = io_kiocb_to_cmd(req, struct io_splice_from); >> + loff_t *ppos = (sp->off == -1) ? NULL : &sp->off; >> + struct io_mapped_ubuf *imu; >> + struct pipe_inode_info *pi; >> + struct io_ring_ctx *ctx; >> + unsigned int pipe_tail; >> + int ret, i, nr_pages; >> + u16 index; >> + >> + if (!sp->file->f_op->splice_read) >> + return -ENOTSUPP; >> + >> + pi = alloc_pipe_info(); > > The above should be replaced with direct pipe, otherwise every time > allocating one pipe inode really hurts performance. We don't even need to alloc it dynanically, could be just on stack. There is a long list of TODOs I can add, e.g. polling support, retries, nowait, caching imu and so on. [...] > Your patch looks like transferring pages ownership to io_uring fixed > buffer, but unfortunately it can't be done in this way. splice is > supposed for moving data, not transfer buffer ownership. Borrowing rather than transferring. It's not obvious since it's not implemented in the patch, but the buffer should be eventually returned using the splice's ->release callback. > https://lore.kernel.org/linux-block/CAJfpeguQ3xn2-6svkkVXJ88tiVfcDd-eKi1evzzfvu305fMoyw@mail.gmail.com/ > > 1) pages are actually owned by device side(ublk, here: sp->file), but we want to > loan them to io_uring normal OPs. > > 2) after these pages are used by io_uring normal OPs, these pages have > been returned back to sp->file, and the notification has to be done > explicitly, because page is owned by sp->file of splice_read(). Right, see above, they're going to be returned back via ->release. > 3) pages RW direction has to limited strictly, and in case of ublk/fuse, > device pages can only be read or write which depends on user io request > direction. Yes, I know, and directions will be needed anyway for DMA mappings and different p2p cases in the future, but again a bunch of things is omitted here. > > Also IMO it isn't good to add one buffer to ctx->user_bufs[] oneshot and > retrieve it oneshot, and it can be set via req->imu simply in one fused > command. That's one of the points though. It's nice if not necessary (for a generic feature) to be able to do multiple ops on the data. For instance, if we have a memcpy request, we can link it to this splice / zc recv, memcpy necessary headers to the userspace and let it decide how to proceed with data.
On Wed, Mar 08, 2023 at 02:46:48PM +0000, Pavel Begunkov wrote: > On 3/8/23 02:10, Ming Lei wrote: > > On Tue, Mar 07, 2023 at 05:17:04PM +0000, Pavel Begunkov wrote: > > > On 3/7/23 15:37, Pavel Begunkov wrote: > > > > On 3/7/23 14:15, Ming Lei wrote: > > > > > Hello, > > > > > > > > > > Add IORING_OP_FUSED_CMD, it is one special URING_CMD, which has to > > > > > be SQE128. The 1st SQE(master) is one 64byte URING_CMD, and the 2nd > > > > > 64byte SQE(slave) is another normal 64byte OP. For any OP which needs > > > > > to support slave OP, io_issue_defs[op].fused_slave needs to be set as 1, > > > > > and its ->issue() can retrieve/import buffer from master request's > > > > > fused_cmd_kbuf. The slave OP is actually submitted from kernel, part of > > > > > this idea is from Xiaoguang's ublk ebpf patchset, but this patchset > > > > > submits slave OP just like normal OP issued from userspace, that said, > > > > > SQE order is kept, and batching handling is done too. > > > > > > > > From a quick look through patches it all looks a bit complicated > > > > and intrusive, all over generic hot paths. I think instead we > > > > should be able to use registered buffer table as intermediary and > > > > reuse splicing. Let me try it out > > > > > > Here we go, isolated in a new opcode, and in the end should work > > > with any file supporting splice. It's a quick prototype, it's lacking > > > and there are many obvious fatal bugs. It also needs some optimisations, > > > improvements on how executed by io_uring and extra stuff like > > > memcpy ops and fixed buf recv/send. I'll clean it up. > > > > > > I used a test below, it essentially does zc recv. > > > > > > https://github.com/isilence/liburing/commit/81fe705739af7d9b77266f9aa901c1ada870739d > > > > [...] > > > +int io_splice_from(struct io_kiocb *req, unsigned int issue_flags) > > > +{ > > > + struct io_splice_from *sp = io_kiocb_to_cmd(req, struct io_splice_from); > > > + loff_t *ppos = (sp->off == -1) ? NULL : &sp->off; > > > + struct io_mapped_ubuf *imu; > > > + struct pipe_inode_info *pi; > > > + struct io_ring_ctx *ctx; > > > + unsigned int pipe_tail; > > > + int ret, i, nr_pages; > > > + u16 index; > > > + > > > + if (!sp->file->f_op->splice_read) > > > + return -ENOTSUPP; > > > + > > > + pi = alloc_pipe_info(); > > > > The above should be replaced with direct pipe, otherwise every time > > allocating one pipe inode really hurts performance. > > We don't even need to alloc it dynanically, could be just > on stack. There is a long list of TODOs I can add, e.g. > polling support, retries, nowait, caching imu and so on. > > [...] > > Your patch looks like transferring pages ownership to io_uring fixed > > buffer, but unfortunately it can't be done in this way. splice is > > supposed for moving data, not transfer buffer ownership. > > Borrowing rather than transferring. It's not obvious since it's > not implemented in the patch, but the buffer should be eventually > returned using the splice's ->release callback. What is the splice's ->release() callback? Is pipe buffer's release()? If yes, there is at least the following two problems: 1) it requires the buffer to be saved(for calling its callback and use its private data to return back the whole buffer) in the pipe until it is consumed, which becomes one sync interface like splice syscall, and can't cross multiple io_uring OPs or per-buffer pipe inode is needed 2) pipe buffer's get()/release() works on per-buffer/page level, but we need to borrow the whole buffer, and the whole buffer could be used by arbitrary number of OPs, such as one IO buffer needs to be used for handling mirror or stripped targets, so when we know the buffer can be released? And basically it can't be known by kernel, and only application knows when to release it. Anyway, please post the whole patch, otherwise it is hard to see the whole picture, and devil is always in details, especially Linus mentioned splice can't be used in this way. > > > https://lore.kernel.org/linux-block/CAJfpeguQ3xn2-6svkkVXJ88tiVfcDd-eKi1evzzfvu305fMoyw@mail.gmail.com/ > > > > 1) pages are actually owned by device side(ublk, here: sp->file), but we want to > > loan them to io_uring normal OPs. > > > > 2) after these pages are used by io_uring normal OPs, these pages have > > been returned back to sp->file, and the notification has to be done > > explicitly, because page is owned by sp->file of splice_read(). > > Right, see above, they're going to be returned back via ->release. How? > > > 3) pages RW direction has to limited strictly, and in case of ublk/fuse, > > device pages can only be read or write which depends on user io request > > direction. > > Yes, I know, and directions will be needed anyway for DMA mappings and > different p2p cases in the future, but again a bunch of things is > omitted here. Please don't omitted it and it is one fundamental security problem. > > > > > Also IMO it isn't good to add one buffer to ctx->user_bufs[] oneshot and > > retrieve it oneshot, and it can be set via req->imu simply in one fused > > command. > > That's one of the points though. It's nice if not necessary (for a generic > feature) to be able to do multiple ops on the data. For instance, if we > have a memcpy request, we can link it to this splice / zc recv, memcpy > necessary headers to the userspace and let it decide how to proceed with > data. I feel it could be one big problem for buffer borrowing to cross more than one OPs, and when can the buffer be returned back? memory copy can be done simply by device's read/write interface, please see patch 15. Thanks, Ming
On 3/8/23 01:08, Ming Lei wrote: > On Tue, Mar 07, 2023 at 03:37:21PM +0000, Pavel Begunkov wrote: >> On 3/7/23 14:15, Ming Lei wrote: >>> Hello, >>> >>> Add IORING_OP_FUSED_CMD, it is one special URING_CMD, which has to >>> be SQE128. The 1st SQE(master) is one 64byte URING_CMD, and the 2nd >>> 64byte SQE(slave) is another normal 64byte OP. For any OP which needs >>> to support slave OP, io_issue_defs[op].fused_slave needs to be set as 1, >>> and its ->issue() can retrieve/import buffer from master request's >>> fused_cmd_kbuf. The slave OP is actually submitted from kernel, part of >>> this idea is from Xiaoguang's ublk ebpf patchset, but this patchset >>> submits slave OP just like normal OP issued from userspace, that said, >>> SQE order is kept, and batching handling is done too. >> >> From a quick look through patches it all looks a bit complicated >> and intrusive, all over generic hot paths. I think instead we > > Really? The main change to generic hot paths are adding one 'true/false' > parameter to io_init_req(). For others, the change is just check on > req->flags or issue_flags, which is basically zero cost. Extra flag in io_init_req() but also exporting it, which is an internal function, to non-core code. Additionally it un-inlines it and even looks recurse calls it (max depth 2). From a quick look, there is some hand coded ->cached_refs manipulations, it takes extra space in generic sections of io_kiocb. It makes all cmd users to check for IO_URING_F_FUSED. There is also a two-way dependency b/w requests, which never plays out well, e.g. I still hate how linked timeouts stick out in generic paths. Depending on SQE128 also doesn't seem right, though it can be dealt with, e.g. sth like how it's done with links requests. >> should be able to use registered buffer table as intermediary and >> reuse splicing. Let me try it out > > I will take a look at you patch, but last time, Linus has pointed out that > splice isn't one good way, in which buffer ownership transferring is one big > issue for writing data to page retrieved from pipe. There are no real pipes, better to say io_uring replaces a pipe, and splice bits are used to get pages from a file. Though, there will be some common problems. Thanks for the link, I'll need to get through it first, thanks for the link > https://lore.kernel.org/linux-block/CAJfpeguQ3xn2-6svkkVXJ88tiVfcDd-eKi1evzzfvu305fMoyw@mail.gmail.com/
On 3/8/23 16:17, Ming Lei wrote: > On Wed, Mar 08, 2023 at 02:46:48PM +0000, Pavel Begunkov wrote: >> On 3/8/23 02:10, Ming Lei wrote: >>> On Tue, Mar 07, 2023 at 05:17:04PM +0000, Pavel Begunkov wrote: >>>> On 3/7/23 15:37, Pavel Begunkov wrote: >>>>> On 3/7/23 14:15, Ming Lei wrote: >>>>>> Hello, >>>>>> >>>>>> Add IORING_OP_FUSED_CMD, it is one special URING_CMD, which has to >>>>>> be SQE128. The 1st SQE(master) is one 64byte URING_CMD, and the 2nd >>>>>> 64byte SQE(slave) is another normal 64byte OP. For any OP which needs >>>>>> to support slave OP, io_issue_defs[op].fused_slave needs to be set as 1, >>>>>> and its ->issue() can retrieve/import buffer from master request's >>>>>> fused_cmd_kbuf. The slave OP is actually submitted from kernel, part of >>>>>> this idea is from Xiaoguang's ublk ebpf patchset, but this patchset >>>>>> submits slave OP just like normal OP issued from userspace, that said, >>>>>> SQE order is kept, and batching handling is done too. >>>>> >>>>> From a quick look through patches it all looks a bit complicated >>>>> and intrusive, all over generic hot paths. I think instead we >>>>> should be able to use registered buffer table as intermediary and >>>>> reuse splicing. Let me try it out >>>> >>>> Here we go, isolated in a new opcode, and in the end should work >>>> with any file supporting splice. It's a quick prototype, it's lacking >>>> and there are many obvious fatal bugs. It also needs some optimisations, >>>> improvements on how executed by io_uring and extra stuff like >>>> memcpy ops and fixed buf recv/send. I'll clean it up. >>>> >>>> I used a test below, it essentially does zc recv. >>>> >>>> https://github.com/isilence/liburing/commit/81fe705739af7d9b77266f9aa901c1ada870739d >>>> >> [...] >>>> +int io_splice_from(struct io_kiocb *req, unsigned int issue_flags) >>>> +{ >>>> + struct io_splice_from *sp = io_kiocb_to_cmd(req, struct io_splice_from); >>>> + loff_t *ppos = (sp->off == -1) ? NULL : &sp->off; >>>> + struct io_mapped_ubuf *imu; >>>> + struct pipe_inode_info *pi; >>>> + struct io_ring_ctx *ctx; >>>> + unsigned int pipe_tail; >>>> + int ret, i, nr_pages; >>>> + u16 index; >>>> + >>>> + if (!sp->file->f_op->splice_read) >>>> + return -ENOTSUPP; >>>> + >>>> + pi = alloc_pipe_info(); >>> >>> The above should be replaced with direct pipe, otherwise every time >>> allocating one pipe inode really hurts performance. >> >> We don't even need to alloc it dynanically, could be just >> on stack. There is a long list of TODOs I can add, e.g. >> polling support, retries, nowait, caching imu and so on. >> >> [...] >>> Your patch looks like transferring pages ownership to io_uring fixed >>> buffer, but unfortunately it can't be done in this way. splice is >>> supposed for moving data, not transfer buffer ownership. >> >> Borrowing rather than transferring. It's not obvious since it's >> not implemented in the patch, but the buffer should be eventually >> returned using the splice's ->release callback. > > What is the splice's ->release() callback? Is pipe buffer's > release()? If yes, there is at least the following two problems: Right > 1) it requires the buffer to be saved(for calling its callback and use its private > data to return back the whole buffer) in the pipe until it is consumed, which becomes > one sync interface like splice syscall, and can't cross multiple io_uring OPs or > per-buffer pipe inode is needed We don't mix data from different sources, it's reasonable to expect that all buffers will have the same callback, then it'll be saved in struct io_mapped_ubuf. That's sth should definitely be checked and rejected if happens. > 2) pipe buffer's get()/release() works on per-buffer/page level, but > we need to borrow the whole buffer, and the whole buffer could be used Surely that can be improved. > by arbitrary number of OPs, such as one IO buffer needs to be used for > handling mirror or stripped targets, so when we know the buffer can be released? There is a separate efficient lifetime semantic for io_uring's registered buffers, which don't involve any get/put. It'll be freed according to it, i.e. when the userspace asks it to be removed and there are no more inflight requests. > And basically it can't be known by kernel, and only application knows > when to release it. > > Anyway, please post the whole patch, otherwise it is hard to see > the whole picture, and devil is always in details, especially Linus > mentioned splice can't be used in this way. Sure > >> >>> https://lore.kernel.org/linux-block/CAJfpeguQ3xn2-6svkkVXJ88tiVfcDd-eKi1evzzfvu305fMoyw@mail.gmail.com/ >>> >>> 1) pages are actually owned by device side(ublk, here: sp->file), but we want to >>> loan them to io_uring normal OPs. >>> >>> 2) after these pages are used by io_uring normal OPs, these pages have >>> been returned back to sp->file, and the notification has to be done >>> explicitly, because page is owned by sp->file of splice_read(). >> >> Right, see above, they're going to be returned back via ->release. > > How? I admit, I shouldn't have skipped it even for a quick POC. It'll save ->release() in struct io_mapped_ubuf and call it when the buffer is freed from io_uring perspective, that is there are no more requests using it and the user requested it to be removed. >>> 3) pages RW direction has to limited strictly, and in case of ublk/fuse, >>> device pages can only be read or write which depends on user io request >>> direction. >> >> Yes, I know, and directions will be needed anyway for DMA mappings and >> different p2p cases in the future, but again a bunch of things is >> omitted here. > > Please don't omitted it and it is one fundamental security problem. That's not interesting for a design concept with a huge warning. io_import_fixed() already takes a dir argument, we just need to check it against the buffer's one. >>> Also IMO it isn't good to add one buffer to ctx->user_bufs[] oneshot and >>> retrieve it oneshot, and it can be set via req->imu simply in one fused >>> command. >> >> That's one of the points though. It's nice if not necessary (for a generic >> feature) to be able to do multiple ops on the data. For instance, if we >> have a memcpy request, we can link it to this splice / zc recv, memcpy >> necessary headers to the userspace and let it decide how to proceed with >> data. > > I feel it could be one big problem for buffer borrowing to cross more than one > OPs, and when can the buffer be returned back? Described above > memory copy can be done simply by device's read/write interface, please see > patch 15. I don't think I understand how it looks in the userspace, maybe it's only applicable to ublk? but it seems that the concept of having one op producing a buffer and another consuming it don't go well with multi use in general case, especially stretched in time. E.g. you recv data, some of which is an application protocol header that should be looked at by the user and the rest is data that might be sent out somewhere else. Am I wrong?
On Wed, Mar 08, 2023 at 04:54:45PM +0000, Pavel Begunkov wrote: > On 3/8/23 16:17, Ming Lei wrote: > > On Wed, Mar 08, 2023 at 02:46:48PM +0000, Pavel Begunkov wrote: > > > On 3/8/23 02:10, Ming Lei wrote: > > > > On Tue, Mar 07, 2023 at 05:17:04PM +0000, Pavel Begunkov wrote: > > > > > On 3/7/23 15:37, Pavel Begunkov wrote: > > > > > > On 3/7/23 14:15, Ming Lei wrote: > > > > > > > Hello, > > > > > > > > > > > > > > Add IORING_OP_FUSED_CMD, it is one special URING_CMD, which has to > > > > > > > be SQE128. The 1st SQE(master) is one 64byte URING_CMD, and the 2nd > > > > > > > 64byte SQE(slave) is another normal 64byte OP. For any OP which needs > > > > > > > to support slave OP, io_issue_defs[op].fused_slave needs to be set as 1, > > > > > > > and its ->issue() can retrieve/import buffer from master request's > > > > > > > fused_cmd_kbuf. The slave OP is actually submitted from kernel, part of > > > > > > > this idea is from Xiaoguang's ublk ebpf patchset, but this patchset > > > > > > > submits slave OP just like normal OP issued from userspace, that said, > > > > > > > SQE order is kept, and batching handling is done too. > > > > > > > > > > > > From a quick look through patches it all looks a bit complicated > > > > > > and intrusive, all over generic hot paths. I think instead we > > > > > > should be able to use registered buffer table as intermediary and > > > > > > reuse splicing. Let me try it out > > > > > > > > > > Here we go, isolated in a new opcode, and in the end should work > > > > > with any file supporting splice. It's a quick prototype, it's lacking > > > > > and there are many obvious fatal bugs. It also needs some optimisations, > > > > > improvements on how executed by io_uring and extra stuff like > > > > > memcpy ops and fixed buf recv/send. I'll clean it up. > > > > > > > > > > I used a test below, it essentially does zc recv. > > > > > > > > > > https://github.com/isilence/liburing/commit/81fe705739af7d9b77266f9aa901c1ada870739d > > > > > > > > [...] > > > > > +int io_splice_from(struct io_kiocb *req, unsigned int issue_flags) > > > > > +{ > > > > > + struct io_splice_from *sp = io_kiocb_to_cmd(req, struct io_splice_from); > > > > > + loff_t *ppos = (sp->off == -1) ? NULL : &sp->off; > > > > > + struct io_mapped_ubuf *imu; > > > > > + struct pipe_inode_info *pi; > > > > > + struct io_ring_ctx *ctx; > > > > > + unsigned int pipe_tail; > > > > > + int ret, i, nr_pages; > > > > > + u16 index; > > > > > + > > > > > + if (!sp->file->f_op->splice_read) > > > > > + return -ENOTSUPP; > > > > > + > > > > > + pi = alloc_pipe_info(); > > > > > > > > The above should be replaced with direct pipe, otherwise every time > > > > allocating one pipe inode really hurts performance. > > > > > > We don't even need to alloc it dynanically, could be just > > > on stack. There is a long list of TODOs I can add, e.g. > > > polling support, retries, nowait, caching imu and so on. > > > > > > [...] > > > > Your patch looks like transferring pages ownership to io_uring fixed > > > > buffer, but unfortunately it can't be done in this way. splice is > > > > supposed for moving data, not transfer buffer ownership. > > > > > > Borrowing rather than transferring. It's not obvious since it's > > > not implemented in the patch, but the buffer should be eventually > > > returned using the splice's ->release callback. > > > > What is the splice's ->release() callback? Is pipe buffer's > > release()? If yes, there is at least the following two problems: > > Right > > > 1) it requires the buffer to be saved(for calling its callback and use its private > > data to return back the whole buffer) in the pipe until it is consumed, which becomes > > one sync interface like splice syscall, and can't cross multiple io_uring OPs or > > per-buffer pipe inode is needed > > We don't mix data from different sources, it's reasonable to expect > that all buffers will have the same callback, then it'll be saved > in struct io_mapped_ubuf. That's sth should definitely be checked and > rejected if happens. It seems reasonable to expect ->release() &->confirm() to be same, but not sure if pipe_buffer->flags & pipe_buffer->private are same, and I think pipe_buffer->private has to be saved too, since ->release() could use it. ->read_splice() is one generic interface, if we support it in the new io_uring OP, all existed ->read_splice() have to be supported. Trust me, I did think about this approach, and the main difference is to add the buffer into io_uring ctx fixed buffer. > > > 2) pipe buffer's get()/release() works on per-buffer/page level, but > > we need to borrow the whole buffer, and the whole buffer could be used > > Surely that can be improved. > > > by arbitrary number of OPs, such as one IO buffer needs to be used for > > handling mirror or stripped targets, so when we know the buffer can be released? > > There is a separate efficient lifetime semantic for io_uring's registered > buffers, which don't involve any get/put. It'll be freed according to it, > i.e. when the userspace asks it to be removed and there are no more > inflight requests. Then one new OP need to be added for removing the buffer explicitly. Then at least three OPs(SQEs)(one for adding the buffer, one or more for consuming it, one for removing it) are involved for handling single IO, which can't be efficient. > > > And basically it can't be known by kernel, and only application knows > > when to release it. > > > > Anyway, please post the whole patch, otherwise it is hard to see > > the whole picture, and devil is always in details, especially Linus > > mentioned splice can't be used in this way. > > Sure > > > > > > > > > > https://lore.kernel.org/linux-block/CAJfpeguQ3xn2-6svkkVXJ88tiVfcDd-eKi1evzzfvu305fMoyw@mail.gmail.com/ > > > > > > > > 1) pages are actually owned by device side(ublk, here: sp->file), but we want to > > > > loan them to io_uring normal OPs. > > > > > > > > 2) after these pages are used by io_uring normal OPs, these pages have > > > > been returned back to sp->file, and the notification has to be done > > > > explicitly, because page is owned by sp->file of splice_read(). > > > > > > Right, see above, they're going to be returned back via ->release. > > > > How? > > I admit, I shouldn't have skipped it even for a quick POC. It'll save > ->release() in struct io_mapped_ubuf and call it when the buffer is > freed from io_uring perspective, that is there are no more requests > using it and the user requested it to be removed. It isn't enough to save ->release() only(->confirm, ->flags, ->private), and all existed ->read_splice() has to be considered. Basically the io_mapped_ubuf becomes one per-buffer (thin)pipe, and the extra allocation could be big. > > > > > 3) pages RW direction has to limited strictly, and in case of ublk/fuse, > > > > device pages can only be read or write which depends on user io request > > > > direction. > > > > > > Yes, I know, and directions will be needed anyway for DMA mappings and > > > different p2p cases in the future, but again a bunch of things is > > > omitted here. > > > > Please don't omitted it and it is one fundamental security problem. > > That's not interesting for a design concept with a huge warning. > io_import_fixed() already takes a dir argument, we just need to check > it against the buffer's one. The dir in io_import_fixed() can't be validated, and only the buffer owner knows if the buffer can be read or write. The main issue is that for one ublk/fuse read io request, the buffer can only be filled with data, and can't be read to somewhere, otherwise kernel data could be leaked to userspace. And the check has to be done in ->read_splcie() or new pipe_buffer->flags has to be added, that is the exact topic we discussed before, and NAKed by Linus. https://lore.kernel.org/linux-block/CAJfpeguQ3xn2-6svkkVXJ88tiVfcDd-eKi1evzzfvu305fMoyw@mail.gmail.com/ > > > > > > Also IMO it isn't good to add one buffer to ctx->user_bufs[] oneshot and > > > > retrieve it oneshot, and it can be set via req->imu simply in one fused > > > > command. > > > > > > That's one of the points though. It's nice if not necessary (for a generic > > > feature) to be able to do multiple ops on the data. For instance, if we > > > have a memcpy request, we can link it to this splice / zc recv, memcpy > > > necessary headers to the userspace and let it decide how to proceed with > > > data. > > > > I feel it could be one big problem for buffer borrowing to cross more than one > > OPs, and when can the buffer be returned back? > > Described above > > > memory copy can be done simply by device's read/write interface, please see > > patch 15. > > I don't think I understand how it looks in the userspace, maybe it's > only applicable to ublk? but it seems that the concept of having one op > producing a buffer and another consuming it don't go well with multi > use in general case, especially stretched in time. So far, the interface is only for ublk/fuse, and the use case for ublk is generic, such as: we have one logic volume manager ublk driver, and now we need to mirror the io request to multiple underlying devices, then the buffer needs to be consumed for each device. Same with stripped, or distribute network storage(one same request need to be sent to more than one remote machine). > > E.g. you recv data, some of which is an application protocol header > that should be looked at by the user and the rest is data that might > be sent out somewhere else. Yeah, see the example[1] in ublk/nbd, application protocol is recv/send by its own SQE, and io data is handled in standalone SQE. Here we only deal with ublk block io data, which isn't changed most of times. [1] https://github.com/ming1/ubdsrv/blob/master/nbd/tgt_nbd.cpp Thanks, Ming
On Wed, Mar 08, 2023 at 04:22:15PM +0000, Pavel Begunkov wrote: > On 3/8/23 01:08, Ming Lei wrote: > > On Tue, Mar 07, 2023 at 03:37:21PM +0000, Pavel Begunkov wrote: > > > On 3/7/23 14:15, Ming Lei wrote: > > > > Hello, > > > > > > > > Add IORING_OP_FUSED_CMD, it is one special URING_CMD, which has to > > > > be SQE128. The 1st SQE(master) is one 64byte URING_CMD, and the 2nd > > > > 64byte SQE(slave) is another normal 64byte OP. For any OP which needs > > > > to support slave OP, io_issue_defs[op].fused_slave needs to be set as 1, > > > > and its ->issue() can retrieve/import buffer from master request's > > > > fused_cmd_kbuf. The slave OP is actually submitted from kernel, part of > > > > this idea is from Xiaoguang's ublk ebpf patchset, but this patchset > > > > submits slave OP just like normal OP issued from userspace, that said, > > > > SQE order is kept, and batching handling is done too. > > > > > > From a quick look through patches it all looks a bit complicated > > > and intrusive, all over generic hot paths. I think instead we > > > > Really? The main change to generic hot paths are adding one 'true/false' > > parameter to io_init_req(). For others, the change is just check on > > req->flags or issue_flags, which is basically zero cost. > > Extra flag in io_init_req() but also exporting it, which is an > internal function, to non-core code. Additionally it un-inlines it We can make it inline for core code only. > and even looks recurse calls it (max depth 2). From a quick look, The reurse call is only done for fused command, and won't be one issue for normal OPs. > there is some hand coded ->cached_refs manipulations, it takes extra > space in generic sections of io_kiocb. Yeah, but it is still done on fused command only. I think people is happy to pay the cost for the benefit, and we do not cause trouble for others. > It makes all cmd users to > check for IO_URING_F_FUSED. There is also a two-way dependency b/w The check is zero cost, and just for avoiding to add ->fused_cmd() callback, otherwise the check can be killed. > requests, which never plays out well, e.g. I still hate how linked > timeouts stick out in generic paths. I appreciate you may explain it in details. Yeah, part of fused command's job is to submit one new io and wait its completion. slave request is actually invisible in the linked list, and only fused command can be in the linked list. > > Depending on SQE128 also doesn't seem right, though it can be dealt > with, e.g. sth like how it's done with links requests. I thought about handling it by linked request, but we need fused command to be completed after the slave request is done, and that becomes one deadlock if the two are linked together. SQE128 is per-context feature, when we need to submit uring SQE128 command, the same ring is required to handle IO, then IMO it is perfect for this case, at least for ublk. > > > > should be able to use registered buffer table as intermediary and > > > reuse splicing. Let me try it out > > > > I will take a look at you patch, but last time, Linus has pointed out that > > splice isn't one good way, in which buffer ownership transferring is one big > > issue for writing data to page retrieved from pipe. > > There are no real pipes, better to say io_uring replaces a pipe, > and splice bits are used to get pages from a file. Though, there > will be some common problems. Thanks for the link, I'll need to > get through it first, thanks for the link Yeah, here the only value of pipe is to reuse ->splice_read() interface, that is why I figure out fused command for this job. I am open for other approaches, if the problem can be solved(reliably and efficiently). Thanks, Ming
On 2023/3/7 22:15, Ming Lei wrote: > Hello, > > Add IORING_OP_FUSED_CMD, it is one special URING_CMD, which has to > be SQE128. The 1st SQE(master) is one 64byte URING_CMD, and the 2nd > 64byte SQE(slave) is another normal 64byte OP. For any OP which needs > to support slave OP, io_issue_defs[op].fused_slave needs to be set as 1, > and its ->issue() can retrieve/import buffer from master request's > fused_cmd_kbuf. The slave OP is actually submitted from kernel, part of > this idea is from Xiaoguang's ublk ebpf patchset, but this patchset > submits slave OP just like normal OP issued from userspace, that said, > SQE order is kept, and batching handling is done too. > > Please see detailed design in commit log of the 3th patch, and one big > point is how to handle buffer ownership. > > With this way, it is easy to support zero copy for ublk/fuse device. > > Basically userspace can specify any sub-buffer of the ublk block request > buffer from the fused command just by setting 'offset/len' > in the slave SQE for running slave OP. This way is flexible to implement > io mapping: mirror, stripped, ... > > The 4th & 5th patches enable fused slave support for the following OPs: > > OP_READ/OP_WRITE > OP_SEND/OP_RECV/OP_SEND_ZC > > The other ublk patches cleans ublk driver and implement fused command > for supporting zero copy. > > Follows userspace code: > > https://github.com/ming1/ubdsrv/tree/fused-cmd-zc-v2 > > All three(loop, nbd and qcow2) ublk targets have supported zero copy by passing: > > ublk add -t [loop|nbd|qcow2] -z .... > > Basic fs mount/kernel building and builtin test are done. > > Also add liburing test case for covering fused command based on miniublk > of blktest: > > https://github.com/ming1/liburing/commits/fused_cmd_miniublk > > Performance improvement is obvious on memory bandwidth > related workloads, such as, 1~2X improvement on 64K/512K BS > IO test on loop with ramfs backing file. > > Any comments are welcome! > Hi Ming, Maybe we can split patch 06-12 to a separate cleanup pacthset. I think these patches can be merged first because they are not related to zero copy. Regards, Zhang
On Wed, Mar 15, 2023 at 03:08:21PM +0800, Ziyang Zhang wrote: > On 2023/3/7 22:15, Ming Lei wrote: > > Hello, > > > > Add IORING_OP_FUSED_CMD, it is one special URING_CMD, which has to > > be SQE128. The 1st SQE(master) is one 64byte URING_CMD, and the 2nd > > 64byte SQE(slave) is another normal 64byte OP. For any OP which needs > > to support slave OP, io_issue_defs[op].fused_slave needs to be set as 1, > > and its ->issue() can retrieve/import buffer from master request's > > fused_cmd_kbuf. The slave OP is actually submitted from kernel, part of > > this idea is from Xiaoguang's ublk ebpf patchset, but this patchset > > submits slave OP just like normal OP issued from userspace, that said, > > SQE order is kept, and batching handling is done too. > > > > Please see detailed design in commit log of the 3th patch, and one big > > point is how to handle buffer ownership. > > > > With this way, it is easy to support zero copy for ublk/fuse device. > > > > Basically userspace can specify any sub-buffer of the ublk block request > > buffer from the fused command just by setting 'offset/len' > > in the slave SQE for running slave OP. This way is flexible to implement > > io mapping: mirror, stripped, ... > > > > The 4th & 5th patches enable fused slave support for the following OPs: > > > > OP_READ/OP_WRITE > > OP_SEND/OP_RECV/OP_SEND_ZC > > > > The other ublk patches cleans ublk driver and implement fused command > > for supporting zero copy. > > > > Follows userspace code: > > > > https://github.com/ming1/ubdsrv/tree/fused-cmd-zc-v2 > > > > All three(loop, nbd and qcow2) ublk targets have supported zero copy by passing: > > > > ublk add -t [loop|nbd|qcow2] -z .... > > > > Basic fs mount/kernel building and builtin test are done. > > > > Also add liburing test case for covering fused command based on miniublk > > of blktest: > > > > https://github.com/ming1/liburing/commits/fused_cmd_miniublk > > > > Performance improvement is obvious on memory bandwidth > > related workloads, such as, 1~2X improvement on 64K/512K BS > > IO test on loop with ramfs backing file. > > > > Any comments are welcome! > > > > > Hi Ming, > > Maybe we can split patch 06-12 to a separate cleanup pacthset. I think > these patches can be merged first because they are not related to zero copy. Hi Ziyang, I think the fused command approach is doable, and the patchset is basically ready to go now: - the fused command approach won't affect normal uring OP - most of the change focuses on io_uring/fused_cmd.* - this approach works reliably and efficiently, pass xfstests, and verified on all three ublk targets(loop, nbd, qcow2), improvement is pretty obvious so I'd rather to put them in one series and aim at v6.14, then we can avoid conflict. However if things doesn't work toward the expected direction, we can move on with another way at that time, so it is too early to separate the cleanup patchset now. thanks, Ming