diff mbox series

[v5,06/10] io_uring/rw: add support to send metadata along with read/write

Message ID 20241029162402.21400-7-anuj20.g@samsung.com (mailing list archive)
State New
Headers show
Series Read/Write with meta/integrity | expand

Commit Message

Anuj Gupta Oct. 29, 2024, 4:23 p.m. UTC
This patch adds the capability of sending metadata along with read/write.
A new meta_type field is introduced in SQE which indicates the type of
metadata being passed. This meta is represented by a newly introduced
'struct io_uring_meta_pi' which specifies information such as flags,buffer
length,seed and apptag. Application sets up a SQE128 ring, prepares
io_uring_meta_pi within the second SQE.
The patch processes the user-passed information to prepare uio_meta
descriptor and passes it down using kiocb->private.

Meta exchange is supported only for direct IO.
Also vectored read/write operations with meta are not supported
currently.

Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
---
 include/uapi/linux/io_uring.h | 29 +++++++++++++
 io_uring/io_uring.c           |  9 ++++
 io_uring/rw.c                 | 79 ++++++++++++++++++++++++++++++++++-
 io_uring/rw.h                 | 14 ++++++-
 4 files changed, 128 insertions(+), 3 deletions(-)

Comments

Keith Busch Oct. 29, 2024, 11:24 p.m. UTC | #1
On Tue, Oct 29, 2024 at 09:53:58PM +0530, Anuj Gupta wrote:
> This patch adds the capability of sending metadata along with read/write.
> A new meta_type field is introduced in SQE which indicates the type of
> metadata being passed. This meta is represented by a newly introduced
> 'struct io_uring_meta_pi' which specifies information such as flags,buffer
> length,seed and apptag. Application sets up a SQE128 ring, prepares
> io_uring_meta_pi within the second SQE.
> The patch processes the user-passed information to prepare uio_meta
> descriptor and passes it down using kiocb->private.
> 
> Meta exchange is supported only for direct IO.
> Also vectored read/write operations with meta are not supported
> currently.

It looks like it is reasonable to add support for fixed buffers too.
There would be implications for subsequent patches, mostly patch 10, but
it looks like we can do that.

Anyway, this patch mostly looks okay to me. I don't know about the whole
"meta_type" thing. My understanding from Pavel was wanting a way to
chain command specific extra options. For example, userspace metadata
and write hints, and this doesn't look like it can be extended to do
that.
Kanchan Joshi Oct. 30, 2024, 5:05 a.m. UTC | #2
On 10/30/2024 4:54 AM, Keith Busch wrote:
> On Tue, Oct 29, 2024 at 09:53:58PM +0530, Anuj Gupta wrote:
>> This patch adds the capability of sending metadata along with read/write.
>> A new meta_type field is introduced in SQE which indicates the type of
>> metadata being passed. This meta is represented by a newly introduced
>> 'struct io_uring_meta_pi' which specifies information such as flags,buffer
>> length,seed and apptag. Application sets up a SQE128 ring, prepares
>> io_uring_meta_pi within the second SQE.
>> The patch processes the user-passed information to prepare uio_meta
>> descriptor and passes it down using kiocb->private.
>>
>> Meta exchange is supported only for direct IO.
>> Also vectored read/write operations with meta are not supported
>> currently.
> 
> It looks like it is reasonable to add support for fixed buffers too.
> There would be implications for subsequent patches, mostly patch 10, but
> it looks like we can do that.

Fixed buffers for data continues to be supported with this.
Do you mean fixed buffers for metadata?
We can take that as an incremental addition outside of this series which 
is already touching various subsystems (io_uring, block, nvme, scsi, fs).

> Anyway, this patch mostly looks okay to me. I don't know about the whole
> "meta_type" thing. My understanding from Pavel was wanting a way to
> chain command specific extra options.

Right. During LSFMM, he mentioned Btrfs needed to send extra stuff with 
read/write.
But in general, this is about seeing metadata as a generic term to 
encode extra information into io_uring SQE.
It may not be very uncommon that people will have the need to send extra 
stuff with read/write and add specific processing for that. And 
SQE->meta_type helps to isolate all such processing from the common case 
when no extra stuff is sent.

if (sqe->meta_type)
{
	if (type1(sqe->meta_type))
		process(type1);
	if (type2(sqe>meta_type))
		process(type1);
}

  For example, userspace metadata
> and write hints, and this doesn't look like it can be extended to do
> that.

It can be. And in past I used that to represent different types of write 
hints.
Just that in the current version, write hints are being sent without any 
type.
Christoph Hellwig Oct. 30, 2024, 5:08 a.m. UTC | #3
On Wed, Oct 30, 2024 at 10:35:19AM +0530, Kanchan Joshi wrote:
> if (sqe->meta_type)
> {
> 	if (type1(sqe->meta_type))
> 		process(type1);
> 	if (type2(sqe>meta_type))
> 		process(type1);
> }

Ensuring that all these are incompatible, which doesn't exactly scale.

So as is this weird meta_type thing (especially overloading the
meta name which is unfortuntely) feels actively harmful.
Pavel Begunkov Nov. 7, 2024, 5:30 p.m. UTC | #4
On 10/29/24 23:24, Keith Busch wrote:
> On Tue, Oct 29, 2024 at 09:53:58PM +0530, Anuj Gupta wrote:
>> This patch adds the capability of sending metadata along with read/write.
>> A new meta_type field is introduced in SQE which indicates the type of
>> metadata being passed. This meta is represented by a newly introduced
>> 'struct io_uring_meta_pi' which specifies information such as flags,buffer
>> length,seed and apptag. Application sets up a SQE128 ring, prepares
>> io_uring_meta_pi within the second SQE.
>> The patch processes the user-passed information to prepare uio_meta
>> descriptor and passes it down using kiocb->private.
>>
>> Meta exchange is supported only for direct IO.
>> Also vectored read/write operations with meta are not supported
>> currently.
> 
> It looks like it is reasonable to add support for fixed buffers too.
> There would be implications for subsequent patches, mostly patch 10, but
> it looks like we can do that.
> 
> Anyway, this patch mostly looks okay to me. I don't know about the whole
> "meta_type" thing. My understanding from Pavel was wanting a way to
> chain command specific extra options. For example, userspace metadata
> and write hints, and this doesn't look like it can be extended to do
> that.

It makes sense to implement write hints as a meta/attribute type,
but depends on whether it's supposed to be widely supported by
different file types vs it being a block specific feature, and if
SQEs have space for it.
Christoph Hellwig Nov. 8, 2024, 7:12 a.m. UTC | #5
On Thu, Nov 07, 2024 at 05:30:36PM +0000, Pavel Begunkov wrote:
> It makes sense to implement write hints as a meta/attribute type,
> but depends on whether it's supposed to be widely supported by
> different file types vs it being a block specific feature, and if
> SQEs have space for it.

It make sense everywhere.  Implementing it for direct I/O on regular
files is mostly trivial and I'll do it once this series lands.
diff mbox series

Patch

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 024745283783..4dab2b904394 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -92,6 +92,10 @@  struct io_uring_sqe {
 			__u16	addr_len;
 			__u16	__pad3[1];
 		};
+		struct {
+			__u16	meta_type;
+			__u16	__pad4[1];
+		};
 	};
 	union {
 		struct {
@@ -105,6 +109,31 @@  struct io_uring_sqe {
 		 */
 		__u8	cmd[0];
 	};
+	/*
+	 * If the ring is initialized with IORING_SETUP_SQE128, then
+	 * this field is starting offset for 64 bytes of data. For meta io
+	 * this contains 'struct io_uring_meta_pi'
+	 */
+	__u8	big_sqe[0];
+};
+
+enum io_uring_sqe_meta_type_bits {
+	META_TYPE_PI_BIT,
+	/* not a real meta type; just to make sure that we don't overflow */
+	META_TYPE_LAST_BIT,
+};
+
+/* meta type flags */
+#define META_TYPE_PI	(1U << META_TYPE_PI_BIT)
+
+/* this goes to SQE128 */
+struct io_uring_meta_pi {
+	__u16		pi_flags;
+	__u16		app_tag;
+	__u32		len;
+	__u64		addr;
+	__u64		seed;
+	__u64		rsvd;
 };
 
 /*
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 4514644fdf52..b3aeddeaba2f 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3875,10 +3875,13 @@  static int __init io_uring_init(void)
 	BUILD_BUG_SQE_ELEM(44, __s32,  splice_fd_in);
 	BUILD_BUG_SQE_ELEM(44, __u32,  file_index);
 	BUILD_BUG_SQE_ELEM(44, __u16,  addr_len);
+	BUILD_BUG_SQE_ELEM(44, __u16,  meta_type);
 	BUILD_BUG_SQE_ELEM(46, __u16,  __pad3[0]);
+	BUILD_BUG_SQE_ELEM(46, __u16,  __pad4[0]);
 	BUILD_BUG_SQE_ELEM(48, __u64,  addr3);
 	BUILD_BUG_SQE_ELEM_SIZE(48, 0, cmd);
 	BUILD_BUG_SQE_ELEM(56, __u64,  __pad2);
+	BUILD_BUG_SQE_ELEM_SIZE(64, 0, big_sqe);
 
 	BUILD_BUG_ON(sizeof(struct io_uring_files_update) !=
 		     sizeof(struct io_uring_rsrc_update));
@@ -3902,6 +3905,12 @@  static int __init io_uring_init(void)
 	/* top 8bits are for internal use */
 	BUILD_BUG_ON((IORING_URING_CMD_MASK & 0xff000000) != 0);
 
+	BUILD_BUG_ON(sizeof(struct io_uring_meta_pi) >
+		     sizeof(struct io_uring_sqe));
+
+	BUILD_BUG_ON(META_TYPE_LAST_BIT >
+		     8 * sizeof_field(struct io_uring_sqe, meta_type));
+
 	io_uring_optable_init();
 
 	/*
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 7ce1cbc048fa..bcff3ae76268 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -257,11 +257,58 @@  static int io_prep_rw_setup(struct io_kiocb *req, int ddir, bool do_import)
 	return 0;
 }
 
+static inline void io_meta_save_state(struct io_async_rw *io)
+{
+	io->meta_state.seed = io->meta.seed;
+	iov_iter_save_state(&io->meta.iter, &io->meta_state.iter_meta);
+}
+
+static inline void io_meta_restore(struct io_async_rw *io)
+{
+	io->meta.seed = io->meta_state.seed;
+	iov_iter_restore(&io->meta.iter, &io->meta_state.iter_meta);
+}
+
+static int io_prep_rw_meta(struct io_kiocb *req, const struct io_uring_sqe *sqe,
+			   struct io_rw *rw, int ddir, u16 meta_type)
+{
+	const struct io_uring_meta_pi *md = (struct io_uring_meta_pi *)sqe->big_sqe;
+	const struct io_issue_def *def;
+	struct io_async_rw *io;
+	int ret;
+
+	if (READ_ONCE(sqe->__pad4[0]))
+		return -EINVAL;
+	if (!(meta_type & META_TYPE_PI))
+		return -EINVAL;
+	if (!(req->ctx->flags & IORING_SETUP_SQE128))
+		return -EINVAL;
+	if (READ_ONCE(md->rsvd))
+		return -EINVAL;
+
+	def = &io_issue_defs[req->opcode];
+	if (def->vectored)
+		return -EOPNOTSUPP;
+
+	io = req->async_data;
+	io->meta.flags = READ_ONCE(md->pi_flags);
+	io->meta.app_tag = READ_ONCE(md->app_tag);
+	io->meta.seed = READ_ONCE(md->seed);
+	ret = import_ubuf(ddir, u64_to_user_ptr(READ_ONCE(md->addr)),
+			  READ_ONCE(md->len), &io->meta.iter);
+	if (unlikely(ret < 0))
+		return ret;
+	rw->kiocb.ki_flags |= IOCB_HAS_METADATA;
+	io_meta_save_state(io);
+	return ret;
+}
+
 static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		      int ddir, bool do_import)
 {
 	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
 	unsigned ioprio;
+	u16 meta_type;
 	int ret;
 
 	rw->kiocb.ki_pos = READ_ONCE(sqe->off);
@@ -279,11 +326,20 @@  static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		rw->kiocb.ki_ioprio = get_current_ioprio();
 	}
 	rw->kiocb.dio_complete = NULL;
+	rw->kiocb.ki_flags = 0;
 
 	rw->addr = READ_ONCE(sqe->addr);
 	rw->len = READ_ONCE(sqe->len);
 	rw->flags = READ_ONCE(sqe->rw_flags);
-	return io_prep_rw_setup(req, ddir, do_import);
+	ret = io_prep_rw_setup(req, ddir, do_import);
+
+	if (unlikely(ret))
+		return ret;
+
+	meta_type = READ_ONCE(sqe->meta_type);
+	if (meta_type)
+		ret = io_prep_rw_meta(req, sqe, rw, ddir, meta_type);
+	return ret;
 }
 
 int io_prep_read(struct io_kiocb *req, const struct io_uring_sqe *sqe)
@@ -410,7 +466,10 @@  static inline loff_t *io_kiocb_update_pos(struct io_kiocb *req)
 static void io_resubmit_prep(struct io_kiocb *req)
 {
 	struct io_async_rw *io = req->async_data;
+	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
 
+	if (rw->kiocb.ki_flags & IOCB_HAS_METADATA)
+		io_meta_restore(io);
 	iov_iter_restore(&io->iter, &io->iter_state);
 }
 
@@ -795,7 +854,7 @@  static int io_rw_init_file(struct io_kiocb *req, fmode_t mode, int rw_type)
 	if (!(req->flags & REQ_F_FIXED_FILE))
 		req->flags |= io_file_get_flags(file);
 
-	kiocb->ki_flags = file->f_iocb_flags;
+	kiocb->ki_flags |= file->f_iocb_flags;
 	ret = kiocb_set_rw_flags(kiocb, rw->flags, rw_type);
 	if (unlikely(ret))
 		return ret;
@@ -824,6 +883,18 @@  static int io_rw_init_file(struct io_kiocb *req, fmode_t mode, int rw_type)
 		kiocb->ki_complete = io_complete_rw;
 	}
 
+	if (kiocb->ki_flags & IOCB_HAS_METADATA) {
+		struct io_async_rw *io = req->async_data;
+
+		/*
+		 * We have a union of meta fields with wpq used for buffered-io
+		 * in io_async_rw, so fail it here.
+		 */
+		if (!(req->file->f_flags & O_DIRECT))
+			return -EOPNOTSUPP;
+		kiocb->private = &io->meta;
+	}
+
 	return 0;
 }
 
@@ -898,6 +969,8 @@  static int __io_read(struct io_kiocb *req, unsigned int issue_flags)
 	 * manually if we need to.
 	 */
 	iov_iter_restore(&io->iter, &io->iter_state);
+	if (kiocb->ki_flags & IOCB_HAS_METADATA)
+		io_meta_restore(io);
 
 	do {
 		/*
@@ -1102,6 +1175,8 @@  int io_write(struct io_kiocb *req, unsigned int issue_flags)
 	} else {
 ret_eagain:
 		iov_iter_restore(&io->iter, &io->iter_state);
+		if (kiocb->ki_flags & IOCB_HAS_METADATA)
+			io_meta_restore(io);
 		if (kiocb->ki_flags & IOCB_WRITE)
 			io_req_end_write(req);
 		return -EAGAIN;
diff --git a/io_uring/rw.h b/io_uring/rw.h
index 3f432dc75441..2d7656bd268d 100644
--- a/io_uring/rw.h
+++ b/io_uring/rw.h
@@ -2,6 +2,11 @@ 
 
 #include <linux/pagemap.h>
 
+struct io_meta_state {
+	u32			seed;
+	struct iov_iter_state	iter_meta;
+};
+
 struct io_async_rw {
 	size_t				bytes_done;
 	struct iov_iter			iter;
@@ -9,7 +14,14 @@  struct io_async_rw {
 	struct iovec			fast_iov;
 	struct iovec			*free_iovec;
 	int				free_iov_nr;
-	struct wait_page_queue		wpq;
+	/* wpq is for buffered io, while meta fields are used with direct io */
+	union {
+		struct wait_page_queue		wpq;
+		struct {
+			struct uio_meta			meta;
+			struct io_meta_state		meta_state;
+		};
+	};
 };
 
 int io_prep_read_fixed(struct io_kiocb *req, const struct io_uring_sqe *sqe);