diff mbox series

btrfs: add io_uring interface for encoded writes

Message ID 20241112163021.1948119-1-maharmstone@fb.com (mailing list archive)
State New
Headers show
Series btrfs: add io_uring interface for encoded writes | expand

Commit Message

Mark Harmstone Nov. 12, 2024, 4:29 p.m. UTC
Add an io_uring interface for encoded writes, with the same parameters
as the BTRFS_IOC_ENCODED_WRITE ioctl.

As with the encoded reads code, there's a test program for this at
https://github.com/maharmstone/io_uring-encoded, and I'll get this
worked into an fstest.

How io_uring works is that it initially calls btrfs_uring_cmd with the
IO_URING_F_NONBLOCK flag set, and if we return -EAGAIN it tries again in
a kthread with the flag cleared.

Ideally we'd honour this and call try_lock etc., but there's still a lot
of work to be done to create non-blocking versions of all the functions
in our write path. Instead, just validate the input in
btrfs_uring_encoded_write() on the first pass and return -EAGAIN, with a
view to properly optimizing the happy path later on.

Signed-off-by: Mark Harmstone <maharmstone@fb.com>
---
 fs/btrfs/ioctl.c | 116 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 116 insertions(+)

Comments

Jens Axboe Nov. 12, 2024, 9:01 p.m. UTC | #1
On 11/12/24 9:29 AM, Mark Harmstone wrote:
> Add an io_uring interface for encoded writes, with the same parameters
> as the BTRFS_IOC_ENCODED_WRITE ioctl.
> 
> As with the encoded reads code, there's a test program for this at
> https://github.com/maharmstone/io_uring-encoded, and I'll get this
> worked into an fstest.
> 
> How io_uring works is that it initially calls btrfs_uring_cmd with the
> IO_URING_F_NONBLOCK flag set, and if we return -EAGAIN it tries again in
> a kthread with the flag cleared.
    ^^^^^^^^

Not a kernel thread, it's an io worker. The distinction may seem
irrelevant, but it's really not - io workers inherit all the properties
of the original task.

> Ideally we'd honour this and call try_lock etc., but there's still a lot
> of work to be done to create non-blocking versions of all the functions
> in our write path. Instead, just validate the input in
> btrfs_uring_encoded_write() on the first pass and return -EAGAIN, with a
> view to properly optimizing the happy path later on.

But you need to ensure stable state after the first issue, regardless of
how you handle it. I don't have the other patches handy, but whatever
you copy from userspace before you return -EAGAIN, you should not be
copying again. By the time you get the 2nd invocation from io-wq, no
copying should be taking place, you should be using the state you
already ensured was stable for the non-blocking issue.

Maybe this is all handled by the caller of btrfs_uring_encoded_write()
already? As far as looking at the code below, it just looks like it
copies everything, then returns -EAGAIN, then copies it again later? Yes
uring_cmd will make the sqe itself stable, but:

	sqe_addr = u64_to_user_ptr(READ_ONCE(cmd->sqe->addr));

the userspace btrfs_ioctl_encoded_io_args that sqe->addr points too
should remain stable as well. If not, consider userspace doing:

some_func()
{
	struct btrfs_ioctl_encoded_io_args args;

	fill_in_args(&args);
	sqe = io_uring_get_sqe(ring);
	sqe->addr = &args;
	io_uring_submit();		<- initial invocation here
}

main_func()
{
	some_func();
				- io-wq invocation perhaps here
	wait_on_cqes();
}

where io-wq will be reading garbage as args went out of scope, unless
some_func() used a stable/heap struct that isn't freed until completion.
some_func() can obviously wait on the cqe, but at that point you'd be
using it as a sync interface, and there's little point.

This is why io_kiocb->async_data exists. uring_cmd is already using that
for the sqe, I think you'd want to add a 2nd "void *op_data" or
something in there, and have the uring_cmd alloc cache get clear that to
NULL and have uring_cmd alloc cache put kfree() it if it's non-NULL.

We'd also need to move the uring_cache struct into
include/linux/io_uring_types.h so that btrfs can get to it (and probably
rename it to something saner, uring_cmd_async_data for example).

static int btrfs_uring_encoded_write(struct io_uring_cmd *cmd, unsigned int issue_flags)
{
	struct io_kiocb *req = cmd_to_io_kiocb(cmd);
	struct uring_cmd_async_data *data = req->async_data;
	struct btrfs_ioctl_encoded_io_args *args;

	if (!data->op_data) {
		data->op_data = kmalloc(sizeof(*args), GFP_NOIO);
		if (!data->op_data)
			return -ENOMEM;
		if (copy_from_user(data->op_data, sqe_addr, sizeof(*args))
			return -EFAULT;
	}
	...
}

and have it be stable, then moving your copying into a helper rather
than inline in btrfs_uring_encoded_write() (it probably should be
regardless). Ignored the compat above, it's just pseudo code.

Anyway, hope that helps. I'll be happy to do the uring_cmd bit for you,
but it really should be pretty straight forward.

I'm also pondering if the encoded read side suffers from the same issue?
Jens Axboe Nov. 12, 2024, 9:11 p.m. UTC | #2
On 11/12/24 2:01 PM, Jens Axboe wrote:
> This is why io_kiocb->async_data exists. uring_cmd is already using that
> for the sqe, I think you'd want to add a 2nd "void *op_data" or
> something in there, and have the uring_cmd alloc cache get clear that to
> NULL and have uring_cmd alloc cache put kfree() it if it's non-NULL.
> 
> We'd also need to move the uring_cache struct into
> include/linux/io_uring_types.h so that btrfs can get to it (and probably
> rename it to something saner, uring_cmd_async_data for example).

Here are two patches that implement that basic thing on the io_uring
uring_cmd side. With that, you can then do:

> static int btrfs_uring_encoded_write(struct io_uring_cmd *cmd, unsigned int issue_flags)
> {
> 	struct io_kiocb *req = cmd_to_io_kiocb(cmd);
> 	struct uring_cmd_async_data *data = req->async_data;
> 	struct btrfs_ioctl_encoded_io_args *args;
> 
> 	if (!data->op_data) {
> 		data->op_data = kmalloc(sizeof(*args), GFP_NOIO);
> 		if (!data->op_data)
> 			return -ENOMEM;
> 		if (copy_from_user(data->op_data, sqe_addr, sizeof(*args))
> 			return -EFAULT;
> 	}
> 	...
> }

and have it be both stable, and not need to worry about freeing it
either. Hope that helps. Totally untested...
Jens Axboe Nov. 12, 2024, 9:19 p.m. UTC | #3
On 11/12/24 2:01 PM, Jens Axboe wrote:
> I'm also pondering if the encoded read side suffers from the same issue?

It does, it needs the same fixup.
Mark Harmstone Nov. 15, 2024, 5:29 p.m. UTC | #4
On 12/11/24 21:01, Jens Axboe wrote:
> > 
> On 11/12/24 9:29 AM, Mark Harmstone wrote:
>> Add an io_uring interface for encoded writes, with the same parameters
>> as the BTRFS_IOC_ENCODED_WRITE ioctl.
>>
>> As with the encoded reads code, there's a test program for this at
>> https://github.com/maharmstone/io_uring-encoded, and I'll get this
>> worked into an fstest.
>>
>> How io_uring works is that it initially calls btrfs_uring_cmd with the
>> IO_URING_F_NONBLOCK flag set, and if we return -EAGAIN it tries again in
>> a kthread with the flag cleared.
>      ^^^^^^^^
> 
> Not a kernel thread, it's an io worker. The distinction may seem
> irrelevant, but it's really not - io workers inherit all the properties
> of the original task.
> 
>> Ideally we'd honour this and call try_lock etc., but there's still a lot
>> of work to be done to create non-blocking versions of all the functions
>> in our write path. Instead, just validate the input in
>> btrfs_uring_encoded_write() on the first pass and return -EAGAIN, with a
>> view to properly optimizing the happy path later on.
> 
> But you need to ensure stable state after the first issue, regardless of
> how you handle it. I don't have the other patches handy, but whatever
> you copy from userspace before you return -EAGAIN, you should not be
> copying again. By the time you get the 2nd invocation from io-wq, no
> copying should be taking place, you should be using the state you
> already ensured was stable for the non-blocking issue.
> 
> Maybe this is all handled by the caller of btrfs_uring_encoded_write()
> already? As far as looking at the code below, it just looks like it
> copies everything, then returns -EAGAIN, then copies it again later? Yes
> uring_cmd will make the sqe itself stable, but:
> 
> 	sqe_addr = u64_to_user_ptr(READ_ONCE(cmd->sqe->addr));
> 
> the userspace btrfs_ioctl_encoded_io_args that sqe->addr points too
> should remain stable as well. If not, consider userspace doing:
> 
> some_func()
> {
> 	struct btrfs_ioctl_encoded_io_args args;
> 
> 	fill_in_args(&args);
> 	sqe = io_uring_get_sqe(ring);
> 	sqe->addr = &args;
> 	io_uring_submit();		<- initial invocation here
> }
> 
> main_func()
> {
> 	some_func();
> 				- io-wq invocation perhaps here
> 	wait_on_cqes();
> }
> 
> where io-wq will be reading garbage as args went out of scope, unless
> some_func() used a stable/heap struct that isn't freed until completion.
> some_func() can obviously wait on the cqe, but at that point you'd be
> using it as a sync interface, and there's little point.
> 
> This is why io_kiocb->async_data exists. uring_cmd is already using that
> for the sqe, I think you'd want to add a 2nd "void *op_data" or
> something in there, and have the uring_cmd alloc cache get clear that to
> NULL and have uring_cmd alloc cache put kfree() it if it's non-NULL.
> 
> We'd also need to move the uring_cache struct into
> include/linux/io_uring_types.h so that btrfs can get to it (and probably
> rename it to something saner, uring_cmd_async_data for example).
> 
> static int btrfs_uring_encoded_write(struct io_uring_cmd *cmd, unsigned int issue_flags)
> {
> 	struct io_kiocb *req = cmd_to_io_kiocb(cmd);
> 	struct uring_cmd_async_data *data = req->async_data;
> 	struct btrfs_ioctl_encoded_io_args *args;
> 
> 	if (!data->op_data) {
> 		data->op_data = kmalloc(sizeof(*args), GFP_NOIO);
> 		if (!data->op_data)
> 			return -ENOMEM;
> 		if (copy_from_user(data->op_data, sqe_addr, sizeof(*args))
> 			return -EFAULT;
> 	}
> 	...
> }
> 
> and have it be stable, then moving your copying into a helper rather
> than inline in btrfs_uring_encoded_write() (it probably should be
> regardless). Ignored the compat above, it's just pseudo code.
> 
> Anyway, hope that helps. I'll be happy to do the uring_cmd bit for you,
> but it really should be pretty straight forward.
> 
> I'm also pondering if the encoded read side suffers from the same issue?
> 

Thanks Jens, that makes sense to me.

Mark
Mark Harmstone Nov. 20, 2024, 3:50 p.m. UTC | #5
On 12/11/24 21:11, Jens Axboe wrote:
> > 
> On 11/12/24 2:01 PM, Jens Axboe wrote:
>> This is why io_kiocb->async_data exists. uring_cmd is already using that
>> for the sqe, I think you'd want to add a 2nd "void *op_data" or
>> something in there, and have the uring_cmd alloc cache get clear that to
>> NULL and have uring_cmd alloc cache put kfree() it if it's non-NULL.
>>
>> We'd also need to move the uring_cache struct into
>> include/linux/io_uring_types.h so that btrfs can get to it (and probably
>> rename it to something saner, uring_cmd_async_data for example).
> 
> Here are two patches that implement that basic thing on the io_uring
> uring_cmd side. With that, you can then do:
> 
>> static int btrfs_uring_encoded_write(struct io_uring_cmd *cmd, unsigned int issue_flags)
>> {
>> 	struct io_kiocb *req = cmd_to_io_kiocb(cmd);
>> 	struct uring_cmd_async_data *data = req->async_data;
>> 	struct btrfs_ioctl_encoded_io_args *args;
>>
>> 	if (!data->op_data) {
>> 		data->op_data = kmalloc(sizeof(*args), GFP_NOIO);
>> 		if (!data->op_data)
>> 			return -ENOMEM;
>> 		if (copy_from_user(data->op_data, sqe_addr, sizeof(*args))
>> 			return -EFAULT;
>> 	}
>> 	...
>> }
> 
> and have it be both stable, and not need to worry about freeing it
> either. Hope that helps. Totally untested...
> 

This works, but I get a KASAN crash because 
io_issue_defs[IORING_OP_URING_CMD].async_size is now the wrong size. 
I'll send a patch for this.

Mark
diff mbox series

Patch

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 6478216305c7..37578270686a 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -5030,6 +5030,116 @@  static int btrfs_uring_encoded_read(struct io_uring_cmd *cmd, unsigned int issue
 	return ret;
 }
 
+static int btrfs_uring_encoded_write(struct io_uring_cmd *cmd, unsigned int issue_flags)
+{
+	struct btrfs_ioctl_encoded_io_args args;
+	struct iovec iovstack[UIO_FASTIOV];
+	struct iovec *iov = iovstack;
+	struct iov_iter iter;
+	loff_t pos;
+	struct kiocb kiocb;
+	struct file *file;
+	ssize_t ret;
+	void __user *sqe_addr;
+
+	if (!capable(CAP_SYS_ADMIN)) {
+		ret = -EPERM;
+		goto out_acct;
+	}
+
+	file = cmd->file;
+	sqe_addr = u64_to_user_ptr(READ_ONCE(cmd->sqe->addr));
+
+	if (!(file->f_mode & FMODE_WRITE)) {
+		ret = -EBADF;
+		goto out_acct;
+	}
+
+	if (issue_flags & IO_URING_F_COMPAT) {
+#if defined(CONFIG_64BIT) && defined(CONFIG_COMPAT)
+		struct btrfs_ioctl_encoded_io_args_32 args32;
+
+		if (copy_from_user(&args32, sqe_addr, sizeof(args32))) {
+			ret = -EFAULT;
+			goto out_acct;
+		}
+		args.iov = compat_ptr(args32.iov);
+		args.iovcnt = args32.iovcnt;
+		args.offset = args32.offset;
+		args.flags = args32.flags;
+		args.len = args32.len;
+		args.unencoded_len = args32.unencoded_len;
+		args.unencoded_offset = args32.unencoded_offset;
+		args.compression = args32.compression;
+		args.encryption = args32.encryption;
+		memcpy(args.reserved, args32.reserved, sizeof(args.reserved));
+#else
+		return -ENOTTY;
+#endif
+	} else {
+		if (copy_from_user(&args, sqe_addr, sizeof(args))) {
+			ret = -EFAULT;
+			goto out_acct;
+		}
+	}
+
+	ret = -EINVAL;
+	if (args.flags != 0)
+		goto out_acct;
+	if (memchr_inv(args.reserved, 0, sizeof(args.reserved)))
+		goto out_acct;
+	if (args.compression == BTRFS_ENCODED_IO_COMPRESSION_NONE &&
+	    args.encryption == BTRFS_ENCODED_IO_ENCRYPTION_NONE)
+		goto out_acct;
+	if (args.compression >= BTRFS_ENCODED_IO_COMPRESSION_TYPES ||
+	    args.encryption >= BTRFS_ENCODED_IO_ENCRYPTION_TYPES)
+		goto out_acct;
+	if (args.unencoded_offset > args.unencoded_len)
+		goto out_acct;
+	if (args.len > args.unencoded_len - args.unencoded_offset)
+		goto out_acct;
+
+	if (issue_flags & IO_URING_F_NONBLOCK) {
+		ret = -EAGAIN;
+		goto out_acct;
+	}
+
+	ret = import_iovec(ITER_SOURCE, args.iov, args.iovcnt, ARRAY_SIZE(iovstack),
+			   &iov, &iter);
+	if (ret < 0)
+		goto out_acct;
+
+	if (iov_iter_count(&iter) == 0) {
+		ret = 0;
+		goto out_iov;
+	}
+	pos = args.offset;
+	ret = rw_verify_area(WRITE, file, &pos, args.len);
+	if (ret < 0)
+		goto out_iov;
+
+	init_sync_kiocb(&kiocb, file);
+	ret = kiocb_set_rw_flags(&kiocb, 0, WRITE);
+	if (ret)
+		goto out_iov;
+	kiocb.ki_pos = pos;
+
+	file_start_write(file);
+
+	ret = btrfs_do_write_iter(&kiocb, &iter, &args);
+	if (ret > 0)
+		fsnotify_modify(file);
+
+	file_end_write(file);
+out_iov:
+	kfree(iov);
+out_acct:
+	if (ret > 0)
+		add_wchar(current, ret);
+	inc_syscw(current);
+	return ret;
+}
+
 int btrfs_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
 {
 	switch (cmd->cmd_op) {
@@ -5038,6 +5148,12 @@  int btrfs_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
 	case BTRFS_IOC_ENCODED_READ_32:
 #endif
 		return btrfs_uring_encoded_read(cmd, issue_flags);
+
+	case BTRFS_IOC_ENCODED_WRITE:
+#if defined(CONFIG_64BIT) && defined(CONFIG_COMPAT)
+	case BTRFS_IOC_ENCODED_WRITE_32:
+#endif
+		return btrfs_uring_encoded_write(cmd, issue_flags);
 	}
 
 	return -EINVAL;