diff mbox series

[RFC,7/9] io_uring: Introduce IORING_OP_CLONE

Message ID 20241209234316.4132786-8-krisman@suse.de (mailing list archive)
State New
Headers show
Series Launching processes with io_uring | expand

Commit Message

Gabriel Krisman Bertazi Dec. 9, 2024, 11:43 p.m. UTC
From: Josh Triplett <josh@joshtriplett.org>

This command spawns a short lived asynchronous context to execute
following linked operations.  Once the link is completed, the task
terminates.  This is specially useful to create new processes, by
linking an IORING_OP_EXEC at the end of the chain. In this case, the
task doesn't terminate, but returns to userspace, starting the new
process.

This is different from the existing io workqueues in a few ways: First,
it is completely separated from the io-wq code, and the task cannot be
reused by a future link; Second, the task doesn't share the FDT, and
other process structures with the rest of io_uring (except for the
memory map); Finally, because of the limited context, it doesn't support
executing requests asynchronously and requeing them. Every request must
complete at ->issue time, or fail.  It also doesn't support task_work
execution, for a similar reason.  The goal of this design allowing the
user to close file descriptors, release locks and do other cleanups
right before switching to a new process.

A big pitfall here, in my (Gabriel) opinion, is how this duplicates the
logic of io_uring linked request dispatching.  I'd suggest I merge this
into the io-wq code, as a special case of workqueue. But I'd like to get
feedback on this idea from the maintainers before moving further with the
implementation.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Co-developed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
---
 include/uapi/linux/io_uring.h |   1 +
 io_uring/Makefile             |   2 +-
 io_uring/io_uring.c           |   3 +-
 io_uring/io_uring.h           |   2 +
 io_uring/opdef.c              |   9 +++
 io_uring/spawn.c              | 140 ++++++++++++++++++++++++++++++++++
 io_uring/spawn.h              |  10 +++
 7 files changed, 165 insertions(+), 2 deletions(-)
 create mode 100644 io_uring/spawn.c
 create mode 100644 io_uring/spawn.h

Comments

Pavel Begunkov Dec. 11, 2024, 1:37 p.m. UTC | #1
On 12/9/24 23:43, Gabriel Krisman Bertazi wrote:
> From: Josh Triplett <josh@joshtriplett.org>
> 
> This command spawns a short lived asynchronous context to execute
> following linked operations.  Once the link is completed, the task
> terminates.  This is specially useful to create new processes, by
> linking an IORING_OP_EXEC at the end of the chain. In this case, the
> task doesn't terminate, but returns to userspace, starting the new
> process.
> 
> This is different from the existing io workqueues in a few ways: First,
> it is completely separated from the io-wq code, and the task cannot be
> reused by a future link; Second, the task doesn't share the FDT, and
> other process structures with the rest of io_uring (except for the
> memory map); Finally, because of the limited context, it doesn't support
> executing requests asynchronously and requeing them. Every request must
> complete at ->issue time, or fail.  It also doesn't support task_work
> execution, for a similar reason.  The goal of this design allowing the
> user to close file descriptors, release locks and do other cleanups
> right before switching to a new process.
> 
> A big pitfall here, in my (Gabriel) opinion, is how this duplicates the
> logic of io_uring linked request dispatching.  I'd suggest I merge this
> into the io-wq code, as a special case of workqueue. But I'd like to get
> feedback on this idea from the maintainers before moving further with the
> implementation.
> 
> Signed-off-by: Josh Triplett <josh@joshtriplett.org>
> Co-developed-by: Gabriel Krisman Bertazi <krisman@suse.de>
> Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
> ---
...
> +static int io_uring_spawn_task(void *data)
> +{
> +	struct io_kiocb *head = data;
> +	struct io_clone *c = io_kiocb_to_cmd(head, struct io_clone);
> +	struct io_ring_ctx *ctx = head->ctx;
> +	struct io_kiocb *req, *next;
> +	int err;
> +
> +	set_task_comm(current, "iou-spawn");
> +
> +	mutex_lock(&ctx->uring_lock);
> +
> +	for (req = c->link; req; req = next) {
> +		int hardlink = req->flags & REQ_F_HARDLINK;
> +
> +		next = req->link;
> +		req->link = NULL;
> +		req->flags &= ~(REQ_F_HARDLINK | REQ_F_LINK);

Do you allow linked timeouts? If so, it'd need to take the lock.

Also, the current link impl assumes that the list only modified
when all refs to the head request are put, you can't just do it
without dropping refs first or adjusting the rest of core link
handling.

> +
> +		if (!(req->flags & REQ_F_FAIL)) {
> +			err = io_issue_sqe(req, IO_URING_F_COMPLETE_DEFER);

There should never be non IO_URING_F_NONBLOCK calls with ->uring_lock.

I'd even say that opcode handling shouldn't have any business with
digging so deep into internal infra, submitting requests, flushing
caches, processing links and so on. It complicates things.

Take defer taskrun, io_submit_flush_completions() will not wake
waiters, and we probably have or will have a bunch of optimisations
that it can break.

Also, do you block somewhere all other opcodes? If it's indeed
an under initialised task then it's not safe to run most of them,
and you'd never know in what way, unfortunately. An fs write
might need a net namespace, a send/recv might decide to touch
fs_struct and so on.

> +			/*
> +			 * We can't requeue a request from the spawn
> +			 * context.  Fail the whole chain.
> +			 */
> +			if (err) {
> +				req_fail_link_node(req, -ECANCELED);
> +				io_req_complete_defer(req);
> +			}
> +		}
> +		if (req->flags & REQ_F_FAIL) {
> +			if (!hardlink) {
> +				fail_link(next);
> +				break;
> +			}
> +		}
> +	}
> +
> +	io_submit_flush_completions(ctx);

Ordering with

> +	percpu_ref_put(&ctx->refs);
> +
> +	mutex_unlock(&ctx->uring_lock);
> +
> +	force_exit_sig(SIGKILL);
> +	return 0;
> +}
> +
> +int io_clone_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> +{
> +	if (unlikely(sqe->fd || sqe->ioprio || sqe->addr2 || sqe->addr
> +		     || sqe->len || sqe->rw_flags || sqe->buf_index
> +		     || sqe->optlen || sqe->addr3))
> +		return -EINVAL;
> +
> +	if (unlikely(!(req->flags & (REQ_F_HARDLINK|REQ_F_LINK))))
> +		return -EINVAL;
> +
> +	if (unlikely(req->ctx->submit_state.link.head))
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +
> +int io_clone(struct io_kiocb *req, unsigned int issue_flags)
> +{
> +	struct io_clone *c = io_kiocb_to_cmd(req, struct io_clone);
> +	struct task_struct *tsk;
> +
> +	/* It is possible that we don't have any linked requests, depite
> +	 * checking during ->prep().  It would be harmless to continue,
> +	 * but we don't need even to create the worker thread in this
> +	 * case.
> +	 */
> +	if (!req->link)
> +		return IOU_OK;
> +
> +	/*
> +	 * Prevent the context from going away before the spawned task
> +	 * has had a chance to execute.  Dropped by io_uring_spawn_task.
> +	 */
> +	percpu_ref_get(&req->ctx->refs);
> +
> +	tsk = create_io_uring_spawn_task(io_uring_spawn_task, req);
> +	if (IS_ERR(tsk)) {
> +		percpu_ref_put(&req->ctx->refs);
> +
> +		req_set_fail(req);
> +		io_req_set_res(req, PTR_ERR(tsk), 0);
> +		return PTR_ERR(tsk);
> +	}
> +
> +	/*
> +	 * Steal the link from the io_uring dispatcher to have them
> +	 * submitted through the new thread.  Note we can no longer fail
> +	 * the clone, so the spawned task is responsible for completing
> +	 * these requests.
> +	 */
> +	c->link = req->link;
> +	req->flags &= ~(REQ_F_HARDLINK | REQ_F_LINK);
> +	req->link = NULL;
> +
> +	wake_up_new_task(tsk);

I assume from here onwards io_uring_spawn_task() might be
running in parallel. You're passing the req inside but free
it below by returning OK. Is there anything that prevents
io_uring_spawn_task() from accessing an already deallocated
request?

> +
> +	return IOU_OK;
> +
> +}
> diff --git a/io_uring/spawn.h b/io_uring/spawn.h
> new file mode 100644
> index 000000000000..9b7ddb776d1e
> --- /dev/null
> +++ b/io_uring/spawn.h
> @@ -0,0 +1,10 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +
> +/*
> + * Spawning a linked series of operations onto a dedicated task.
> + *
> + * Copyright © 2022 Josh Triplett
> + */
> +
> +int io_clone_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
> +int io_clone(struct io_kiocb *req, unsigned int issue_flags);
Josh Triplett Dec. 11, 2024, 5:26 p.m. UTC | #2
On Wed, Dec 11, 2024 at 01:37:40PM +0000, Pavel Begunkov wrote:
> Also, do you block somewhere all other opcodes? If it's indeed
> an under initialised task then it's not safe to run most of them,
> and you'd never know in what way, unfortunately. An fs write
> might need a net namespace, a send/recv might decide to touch
> fs_struct and so on.

I would not expect the new task to be under-initialised, beyond the fact
that it doesn't have a userspace yet (e.g. it can't return to userspace
without exec-ing first); if it is, that'd be a bug. It *should* be
possible to do almost any reasonable opcode. For instance, reasonable
possibilities include "write a byte to a pipe, open a file,
install/rearrange some file descriptors, then exec".
Pavel Begunkov Dec. 17, 2024, 11:03 a.m. UTC | #3
On 12/11/24 17:26, Josh Triplett wrote:
> On Wed, Dec 11, 2024 at 01:37:40PM +0000, Pavel Begunkov wrote:
>> Also, do you block somewhere all other opcodes? If it's indeed
>> an under initialised task then it's not safe to run most of them,
>> and you'd never know in what way, unfortunately. An fs write
>> might need a net namespace, a send/recv might decide to touch
>> fs_struct and so on.
> 
> I would not expect the new task to be under-initialised, beyond the fact
> that it doesn't have a userspace yet (e.g. it can't return to userspace

I see, that's good. What it takes to setup a userspace? and is
it expensive? I remember there were good numbers at the time and
I'm to see where the performance improvement comes from. Is it
because the page table is shared? In other word what's the
difference comparing to spinning a new (user space) thread and
executing the rest with a new io_uring instance from it?


> without exec-ing first); if it is, that'd be a bug. It *should* be
> possible to do almost any reasonable opcode. For instance, reasonable
> possibilities include "write a byte to a pipe, open a file,
> install/rearrange some file descriptors, then exec".
Josh Triplett Dec. 17, 2024, 7:14 p.m. UTC | #4
On Tue, Dec 17, 2024 at 11:03:27AM +0000, Pavel Begunkov wrote:
> On 12/11/24 17:26, Josh Triplett wrote:
> > On Wed, Dec 11, 2024 at 01:37:40PM +0000, Pavel Begunkov wrote:
> > > Also, do you block somewhere all other opcodes? If it's indeed
> > > an under initialised task then it's not safe to run most of them,
> > > and you'd never know in what way, unfortunately. An fs write
> > > might need a net namespace, a send/recv might decide to touch
> > > fs_struct and so on.
> > 
> > I would not expect the new task to be under-initialised, beyond the fact
> > that it doesn't have a userspace yet (e.g. it can't return to userspace
> 
> I see, that's good. What it takes to setup a userspace? and is
> it expensive? I remember there were good numbers at the time and
> I'm to see where the performance improvement comes from. Is it
> because the page table is shared? In other word what's the
> difference comparing to spinning a new (user space) thread and
> executing the rest with a new io_uring instance from it?

The goal is to provide all the advantages of `vfork` (and then some),
but without the incredibly unsafe vfork limitations.

Or, to look at it a different way, posix_spawn but with all the power of
io_uring available rather than a handful of "spawn attributes".

> > without exec-ing first); if it is, that'd be a bug. It *should* be
> > possible to do almost any reasonable opcode. For instance, reasonable
> > possibilities include "write a byte to a pipe, open a file,
> > install/rearrange some file descriptors, then exec".
diff mbox series

Patch

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 38f0d6b10eaf..82d8dae49645 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -278,6 +278,7 @@  enum io_uring_op {
 	IORING_OP_FTRUNCATE,
 	IORING_OP_BIND,
 	IORING_OP_LISTEN,
+	IORING_OP_CLONE,
 
 	/* this goes last, obviously */
 	IORING_OP_LAST,
diff --git a/io_uring/Makefile b/io_uring/Makefile
index 53167bef37d7..06ad15c07c57 100644
--- a/io_uring/Makefile
+++ b/io_uring/Makefile
@@ -12,7 +12,7 @@  obj-$(CONFIG_IO_URING)		+= io_uring.o opdef.o kbuf.o rsrc.o notif.o \
 					sqpoll.o xattr.o nop.o fs.o splice.o \
 					sync.o msg_ring.o advise.o openclose.o \
 					epoll.o statx.o timeout.o fdinfo.o \
-					cancel.o waitid.o register.o \
+					cancel.o waitid.o register.o spawn.o \
 					truncate.o memmap.o
 obj-$(CONFIG_IO_WQ)		+= io-wq.o
 obj-$(CONFIG_FUTEX)		+= futex.o
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 0fd8709401fc..b82ea1cc393f 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -97,6 +97,7 @@ 
 #include "uring_cmd.h"
 #include "msg_ring.h"
 #include "memmap.h"
+#include "spawn.h"
 
 #include "timeout.h"
 #include "poll.h"
@@ -1706,7 +1707,7 @@  static bool io_assign_file(struct io_kiocb *req, const struct io_issue_def *def,
 	return !!req->file;
 }
 
-static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
+int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
 {
 	const struct io_issue_def *def = &io_issue_defs[req->opcode];
 	const struct cred *creds = NULL;
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 4dd051d29cb0..302c8f92b812 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -497,4 +497,6 @@  static inline bool io_has_work(struct io_ring_ctx *ctx)
 	return test_bit(IO_CHECK_CQ_OVERFLOW_BIT, &ctx->check_cq) ||
 	       io_local_work_pending(ctx);
 }
+
+int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags);
 #endif
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index 3de75eca1c92..1bab2e517e55 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -36,6 +36,7 @@ 
 #include "waitid.h"
 #include "futex.h"
 #include "truncate.h"
+#include "spawn.h"
 
 static int io_no_issue(struct io_kiocb *req, unsigned int issue_flags)
 {
@@ -515,6 +516,11 @@  const struct io_issue_def io_issue_defs[] = {
 		.prep			= io_eopnotsupp_prep,
 #endif
 	},
+	[IORING_OP_CLONE] = {
+		.audit_skip		= 1,
+		.prep			= io_clone_prep,
+		.issue			= io_clone,
+	},
 };
 
 const struct io_cold_def io_cold_defs[] = {
@@ -744,6 +750,9 @@  const struct io_cold_def io_cold_defs[] = {
 	[IORING_OP_LISTEN] = {
 		.name			= "LISTEN",
 	},
+	[IORING_OP_CLONE] = {
+		.name			= "CLONE",
+	},
 };
 
 const char *io_uring_get_opcode(u8 opcode)
diff --git a/io_uring/spawn.c b/io_uring/spawn.c
new file mode 100644
index 000000000000..1cd069bb6f59
--- /dev/null
+++ b/io_uring/spawn.c
@@ -0,0 +1,140 @@ 
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Spawning a linked series of operations onto a dedicated task.
+ *
+ * Copyright (C) 2022 Josh Triplett
+ */
+
+#include <linux/binfmts.h>
+#include <linux/nospec.h>
+#include <linux/syscalls.h>
+
+#include "io_uring.h"
+#include "rsrc.h"
+#include "spawn.h"
+
+struct io_clone {
+	struct file *file_unused;
+	struct io_kiocb *link;
+};
+
+static void fail_link(struct io_kiocb *req)
+{
+	struct io_kiocb *nxt;
+
+	while (req) {
+		req_fail_link_node(req, -ECANCELED);
+		io_req_complete_defer(req);
+
+		nxt = req->link;
+		req->link = NULL;
+		req = nxt;
+	}
+}
+
+static int io_uring_spawn_task(void *data)
+{
+	struct io_kiocb *head = data;
+	struct io_clone *c = io_kiocb_to_cmd(head, struct io_clone);
+	struct io_ring_ctx *ctx = head->ctx;
+	struct io_kiocb *req, *next;
+	int err;
+
+	set_task_comm(current, "iou-spawn");
+
+	mutex_lock(&ctx->uring_lock);
+
+	for (req = c->link; req; req = next) {
+		int hardlink = req->flags & REQ_F_HARDLINK;
+
+		next = req->link;
+		req->link = NULL;
+		req->flags &= ~(REQ_F_HARDLINK | REQ_F_LINK);
+
+		if (!(req->flags & REQ_F_FAIL)) {
+			err = io_issue_sqe(req, IO_URING_F_COMPLETE_DEFER);
+			/*
+			 * We can't requeue a request from the spawn
+			 * context.  Fail the whole chain.
+			 */
+			if (err) {
+				req_fail_link_node(req, -ECANCELED);
+				io_req_complete_defer(req);
+			}
+		}
+		if (req->flags & REQ_F_FAIL) {
+			if (!hardlink) {
+				fail_link(next);
+				break;
+			}
+		}
+	}
+
+	io_submit_flush_completions(ctx);
+	percpu_ref_put(&ctx->refs);
+
+	mutex_unlock(&ctx->uring_lock);
+
+	force_exit_sig(SIGKILL);
+	return 0;
+}
+
+int io_clone_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	if (unlikely(sqe->fd || sqe->ioprio || sqe->addr2 || sqe->addr
+		     || sqe->len || sqe->rw_flags || sqe->buf_index
+		     || sqe->optlen || sqe->addr3))
+		return -EINVAL;
+
+	if (unlikely(!(req->flags & (REQ_F_HARDLINK|REQ_F_LINK))))
+		return -EINVAL;
+
+	if (unlikely(req->ctx->submit_state.link.head))
+		return -EINVAL;
+
+	return 0;
+}
+
+int io_clone(struct io_kiocb *req, unsigned int issue_flags)
+{
+	struct io_clone *c = io_kiocb_to_cmd(req, struct io_clone);
+	struct task_struct *tsk;
+
+	/* It is possible that we don't have any linked requests, depite
+	 * checking during ->prep().  It would be harmless to continue,
+	 * but we don't need even to create the worker thread in this
+	 * case.
+	 */
+	if (!req->link)
+		return IOU_OK;
+
+	/*
+	 * Prevent the context from going away before the spawned task
+	 * has had a chance to execute.  Dropped by io_uring_spawn_task.
+	 */
+	percpu_ref_get(&req->ctx->refs);
+
+	tsk = create_io_uring_spawn_task(io_uring_spawn_task, req);
+	if (IS_ERR(tsk)) {
+		percpu_ref_put(&req->ctx->refs);
+
+		req_set_fail(req);
+		io_req_set_res(req, PTR_ERR(tsk), 0);
+		return PTR_ERR(tsk);
+	}
+
+	/*
+	 * Steal the link from the io_uring dispatcher to have them
+	 * submitted through the new thread.  Note we can no longer fail
+	 * the clone, so the spawned task is responsible for completing
+	 * these requests.
+	 */
+	c->link = req->link;
+	req->flags &= ~(REQ_F_HARDLINK | REQ_F_LINK);
+	req->link = NULL;
+
+	wake_up_new_task(tsk);
+
+	return IOU_OK;
+
+}
diff --git a/io_uring/spawn.h b/io_uring/spawn.h
new file mode 100644
index 000000000000..9b7ddb776d1e
--- /dev/null
+++ b/io_uring/spawn.h
@@ -0,0 +1,10 @@ 
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+
+/*
+ * Spawning a linked series of operations onto a dedicated task.
+ *
+ * Copyright © 2022 Josh Triplett
+ */
+
+int io_clone_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+int io_clone(struct io_kiocb *req, unsigned int issue_flags);