diff mbox series

[2/3] io_uring: use iov_iter state save/restore helpers

Message ID 20210914141750.261568-3-axboe@kernel.dk (mailing list archive)
State New, archived
Headers show
Series Add ability to save/restore iov_iter state | expand

Commit Message

Jens Axboe Sept. 14, 2021, 2:17 p.m. UTC
Get rid of the need to do re-expand and revert on an iterator when we
encounter a short IO, or failure that warrants a retry. Use the new
state save/restore helpers instead.

We keep the iov_iter_state persistent across retries, if we need to
restart the read or write operation. If there's a pending retry, the
operation will always exit with the state correctly saved.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 62 ++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 41 insertions(+), 21 deletions(-)

Comments

Linus Torvalds Sept. 14, 2021, 6:45 p.m. UTC | #1
On Tue, Sep 14, 2021 at 7:18 AM Jens Axboe <axboe@kernel.dk> wrote:
>
>
> +       iov_iter_restore(iter, state);
> +
...
>                 rw->bytes_done += ret;
> +               iov_iter_advance(iter, ret);
> +               if (!iov_iter_count(iter))
> +                       break;
> +               iov_iter_save_state(iter, state);

Ok, so now you keep iovb_iter and the state always in sync by just
always resetting the iter back and then walking it forward explicitly
- and re-saving the state.

That seems safe, if potentially unnecessarily expensive.

I guess re-walking lots of iovec entries is actually very unlikely in
practice, so maybe this "stupid brute-force" model is the right one.

I do find the odd "use __state vs rw->state" to be very confusing,
though. Particularly in io_read(), where you do this:

+       iov_iter_restore(iter, state);
+
        ret2 = io_setup_async_rw(req, iovec, inline_vecs, iter, true);
        if (ret2)
                return ret2;

        iovec = NULL;
        rw = req->async_data;
-       /* now use our persistent iterator, if we aren't already */
-       iter = &rw->iter;
+       /* now use our persistent iterator and state, if we aren't already */
+       if (iter != &rw->iter) {
+               iter = &rw->iter;
+               state = &rw->iter_state;
+       }

        do {
-               io_size -= ret;
                rw->bytes_done += ret;
+               iov_iter_advance(iter, ret);
+               if (!iov_iter_count(iter))
+                       break;
+               iov_iter_save_state(iter, state);


Note how it first does that iov_iter_restore() on iter/state, buit
then it *replaces&* the iter/state pointers, and then it does
iov_iter_advance() on the replacement ones.

I don't see how that could be right. You're doing iov_iter_advance()
on something else than the one you restored to the original values.

And if it is right, it's sure confusing as hell.

           Linus
Jens Axboe Sept. 14, 2021, 7:37 p.m. UTC | #2
On 9/14/21 12:45 PM, Linus Torvalds wrote:
> On Tue, Sep 14, 2021 at 7:18 AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>>
>> +       iov_iter_restore(iter, state);
>> +
> ...
>>                 rw->bytes_done += ret;
>> +               iov_iter_advance(iter, ret);
>> +               if (!iov_iter_count(iter))
>> +                       break;
>> +               iov_iter_save_state(iter, state);
> 
> Ok, so now you keep iovb_iter and the state always in sync by just
> always resetting the iter back and then walking it forward explicitly
> - and re-saving the state.
> 
> That seems safe, if potentially unnecessarily expensive.

Right, it's not ideal if it's a big range of IO, then it'll definitely
be noticeable. But not too worried about it, at least not for now...

> I guess re-walking lots of iovec entries is actually very unlikely in
> practice, so maybe this "stupid brute-force" model is the right one.

Not sure what the alternative is here. We could do something similar to
__io_import_fixed() as we're only dealing with iter types where we can
do that, but probably best left as a later optimization if it's deemed
necessary.

> I do find the odd "use __state vs rw->state" to be very confusing,
> though. Particularly in io_read(), where you do this:
> 
> +       iov_iter_restore(iter, state);
> +
>         ret2 = io_setup_async_rw(req, iovec, inline_vecs, iter, true);
>         if (ret2)
>                 return ret2;
> 
>         iovec = NULL;
>         rw = req->async_data;
> -       /* now use our persistent iterator, if we aren't already */
> -       iter = &rw->iter;
> +       /* now use our persistent iterator and state, if we aren't already */
> +       if (iter != &rw->iter) {
> +               iter = &rw->iter;
> +               state = &rw->iter_state;
> +       }
> 
>         do {
> -               io_size -= ret;
>                 rw->bytes_done += ret;
> +               iov_iter_advance(iter, ret);
> +               if (!iov_iter_count(iter))
> +                       break;
> +               iov_iter_save_state(iter, state);
> 
> 
> Note how it first does that iov_iter_restore() on iter/state, buit
> then it *replaces&* the iter/state pointers, and then it does
> iov_iter_advance() on the replacement ones.

We restore the iter so it's the same as before we did the read_iter
call, and then setup a consistent copy of the iov/iter in case we need
to punt this request for retry. rw->iter should have the same state as
iter at this point, and since rw->iter is the copy we'll use going
forward, we're advancing that one in case ret > 0.

The other case is that no persistent state is needed, and then iter
remains the same.

I'll take a second look at this part and see if I can make it a bit more
straight forward, or at least comment it properly.
Jens Axboe Sept. 14, 2021, 11:02 p.m. UTC | #3
On 9/14/21 1:37 PM, Jens Axboe wrote:
> On 9/14/21 12:45 PM, Linus Torvalds wrote:
>> On Tue, Sep 14, 2021 at 7:18 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>>
>>> +       iov_iter_restore(iter, state);
>>> +
>> ...
>>>                 rw->bytes_done += ret;
>>> +               iov_iter_advance(iter, ret);
>>> +               if (!iov_iter_count(iter))
>>> +                       break;
>>> +               iov_iter_save_state(iter, state);
>>
>> Ok, so now you keep iovb_iter and the state always in sync by just
>> always resetting the iter back and then walking it forward explicitly
>> - and re-saving the state.
>>
>> That seems safe, if potentially unnecessarily expensive.
> 
> Right, it's not ideal if it's a big range of IO, then it'll definitely
> be noticeable. But not too worried about it, at least not for now...
> 
>> I guess re-walking lots of iovec entries is actually very unlikely in
>> practice, so maybe this "stupid brute-force" model is the right one.
> 
> Not sure what the alternative is here. We could do something similar to
> __io_import_fixed() as we're only dealing with iter types where we can
> do that, but probably best left as a later optimization if it's deemed
> necessary.
> 
>> I do find the odd "use __state vs rw->state" to be very confusing,
>> though. Particularly in io_read(), where you do this:
>>
>> +       iov_iter_restore(iter, state);
>> +
>>         ret2 = io_setup_async_rw(req, iovec, inline_vecs, iter, true);
>>         if (ret2)
>>                 return ret2;
>>
>>         iovec = NULL;
>>         rw = req->async_data;
>> -       /* now use our persistent iterator, if we aren't already */
>> -       iter = &rw->iter;
>> +       /* now use our persistent iterator and state, if we aren't already */
>> +       if (iter != &rw->iter) {
>> +               iter = &rw->iter;
>> +               state = &rw->iter_state;
>> +       }
>>
>>         do {
>> -               io_size -= ret;
>>                 rw->bytes_done += ret;
>> +               iov_iter_advance(iter, ret);
>> +               if (!iov_iter_count(iter))
>> +                       break;
>> +               iov_iter_save_state(iter, state);
>>
>>
>> Note how it first does that iov_iter_restore() on iter/state, buit
>> then it *replaces&* the iter/state pointers, and then it does
>> iov_iter_advance() on the replacement ones.
> 
> We restore the iter so it's the same as before we did the read_iter
> call, and then setup a consistent copy of the iov/iter in case we need
> to punt this request for retry. rw->iter should have the same state as
> iter at this point, and since rw->iter is the copy we'll use going
> forward, we're advancing that one in case ret > 0.
> 
> The other case is that no persistent state is needed, and then iter
> remains the same.
> 
> I'll take a second look at this part and see if I can make it a bit more
> straight forward, or at least comment it properly.

I hacked up something that shortens the iter for the initial IO, so we
could more easily test the retry path and the state. It really is a
hack, but the idea was to issue 64K io from fio, and then the initial
attempt would be anywhere from 4K-60K truncated. That forces retry.
I ran this with both 16 segments and 8 segments, verifying that it
hits both the UIO_FASTIOV and alloc path.

I did find one issue with that, see the last hunk in the hack. We
need to increment rw->bytes_done if we don't break, or set ret to
0 if we do. Otherwise that last ret ends up being accounted twice.
But apart from that, it passes data verification runs.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index dc1ff47e3221..484c86252f9d 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -744,6 +744,7 @@ enum {
 	REQ_F_NOWAIT_READ_BIT,
 	REQ_F_NOWAIT_WRITE_BIT,
 	REQ_F_ISREG_BIT,
+	REQ_F_TRUNCATED_BIT,
 
 	/* not a real bit, just to check we're not overflowing the space */
 	__REQ_F_LAST_BIT,
@@ -797,6 +798,7 @@ enum {
 	REQ_F_REFCOUNT		= BIT(REQ_F_REFCOUNT_BIT),
 	/* there is a linked timeout that has to be armed */
 	REQ_F_ARM_LTIMEOUT	= BIT(REQ_F_ARM_LTIMEOUT_BIT),
+	REQ_F_TRUNCATED		= BIT(REQ_F_TRUNCATED_BIT),
 };
 
 struct async_poll {
@@ -3454,11 +3456,12 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 {
 	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
 	struct kiocb *kiocb = &req->rw.kiocb;
-	struct iov_iter __iter, *iter = &__iter;
+	struct iov_iter __i, __iter, *iter = &__iter;
 	struct io_async_rw *rw = req->async_data;
 	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
 	struct iov_iter_state __state, *state;
 	ssize_t ret, ret2;
+	bool do_restore = false;
 
 	if (rw) {
 		iter = &rw->iter;
@@ -3492,8 +3495,25 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 		return ret;
 	}
 
+	if (!(req->flags & REQ_F_TRUNCATED) && !(iov_iter_count(iter) & 4095)) {
+		int nr_vecs;
+
+		__i = *iter;
+		nr_vecs = 1 + (prandom_u32() % iter->nr_segs);
+		iter->nr_segs = nr_vecs;
+		iter->count = nr_vecs * 8192;
+		req->flags |= REQ_F_TRUNCATED;
+		do_restore = true;
+	}
+
 	ret = io_iter_do_read(req, iter);
 
+	if (ret == -EAGAIN) {
+		req->flags &= ~REQ_F_TRUNCATED;
+		*iter = __i;
+		do_restore = false;
+	}
+
 	if (ret == -EAGAIN || (req->flags & REQ_F_REISSUE)) {
 		req->flags &= ~REQ_F_REISSUE;
 		/* IOPOLL retry should happen for io-wq threads */
@@ -3513,6 +3533,9 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 
 	iov_iter_restore(iter, state);
 
+	if (do_restore)
+		*iter = __i;
+
 	ret2 = io_setup_async_rw(req, iovec, inline_vecs, iter, true);
 	if (ret2)
 		return ret2;
@@ -3526,10 +3549,10 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 	}
 
 	do {
-		rw->bytes_done += ret;
 		iov_iter_advance(iter, ret);
 		if (!iov_iter_count(iter))
 			break;
+		rw->bytes_done += ret;
 		iov_iter_save_state(iter, state);
 
 		/* if we can retry, do so with the callbacks armed */
diff mbox series

Patch

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 855ea544807f..dbc97d440801 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -712,6 +712,7 @@  struct io_async_rw {
 	struct iovec			fast_iov[UIO_FASTIOV];
 	const struct iovec		*free_iovec;
 	struct iov_iter			iter;
+	struct iov_iter_state		iter_state;
 	size_t				bytes_done;
 	struct wait_page_queue		wpq;
 };
@@ -2608,8 +2609,7 @@  static bool io_resubmit_prep(struct io_kiocb *req)
 
 	if (!rw)
 		return !io_req_prep_async(req);
-	/* may have left rw->iter inconsistent on -EIOCBQUEUED */
-	iov_iter_revert(&rw->iter, req->result - iov_iter_count(&rw->iter));
+	iov_iter_restore(&rw->iter, &rw->iter_state);
 	return true;
 }
 
@@ -3310,12 +3310,16 @@  static int io_setup_async_rw(struct io_kiocb *req, const struct iovec *iovec,
 	if (!force && !io_op_defs[req->opcode].needs_async_setup)
 		return 0;
 	if (!req->async_data) {
+		struct io_async_rw *iorw;
+
 		if (io_alloc_async_data(req)) {
 			kfree(iovec);
 			return -ENOMEM;
 		}
 
 		io_req_map_rw(req, iovec, fast_iov, iter);
+		iorw = req->async_data;
+		iov_iter_save_state(&iorw->iter, &iorw->iter_state);
 	}
 	return 0;
 }
@@ -3334,6 +3338,7 @@  static inline int io_rw_prep_async(struct io_kiocb *req, int rw)
 	iorw->free_iovec = iov;
 	if (iov)
 		req->flags |= REQ_F_NEED_CLEANUP;
+	iov_iter_save_state(&iorw->iter, &iorw->iter_state);
 	return 0;
 }
 
@@ -3437,19 +3442,23 @@  static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 	struct kiocb *kiocb = &req->rw.kiocb;
 	struct iov_iter __iter, *iter = &__iter;
 	struct io_async_rw *rw = req->async_data;
-	ssize_t io_size, ret, ret2;
 	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
+	struct iov_iter_state __state, *state;
+	ssize_t ret, ret2;
 
 	if (rw) {
 		iter = &rw->iter;
+		state = &rw->iter_state;
+		iov_iter_restore(iter, state);
 		iovec = NULL;
 	} else {
 		ret = io_import_iovec(READ, req, &iovec, iter, !force_nonblock);
 		if (ret < 0)
 			return ret;
+		state = &__state;
+		iov_iter_save_state(iter, state);
 	}
-	io_size = iov_iter_count(iter);
-	req->result = io_size;
+	req->result = iov_iter_count(iter);
 
 	/* Ensure we clear previously set non-block flag */
 	if (!force_nonblock)
@@ -3463,7 +3472,7 @@  static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 		return ret ?: -EAGAIN;
 	}
 
-	ret = rw_verify_area(READ, req->file, io_kiocb_ppos(kiocb), io_size);
+	ret = rw_verify_area(READ, req->file, io_kiocb_ppos(kiocb), req->result);
 	if (unlikely(ret)) {
 		kfree(iovec);
 		return ret;
@@ -3479,30 +3488,36 @@  static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 		/* no retry on NONBLOCK nor RWF_NOWAIT */
 		if (req->flags & REQ_F_NOWAIT)
 			goto done;
-		/* some cases will consume bytes even on error returns */
-		iov_iter_reexpand(iter, iter->count + iter->truncated);
-		iov_iter_revert(iter, io_size - iov_iter_count(iter));
 		ret = 0;
 	} else if (ret == -EIOCBQUEUED) {
 		goto out_free;
-	} else if (ret <= 0 || ret == io_size || !force_nonblock ||
+	} else if (ret <= 0 || ret == req->result || !force_nonblock ||
 		   (req->flags & REQ_F_NOWAIT) || !need_read_all(req)) {
 		/* read all, failed, already did sync or don't want to retry */
 		goto done;
 	}
 
+	iov_iter_restore(iter, state);
+
 	ret2 = io_setup_async_rw(req, iovec, inline_vecs, iter, true);
 	if (ret2)
 		return ret2;
 
 	iovec = NULL;
 	rw = req->async_data;
-	/* now use our persistent iterator, if we aren't already */
-	iter = &rw->iter;
+	/* now use our persistent iterator and state, if we aren't already */
+	if (iter != &rw->iter) {
+		iter = &rw->iter;
+		state = &rw->iter_state;
+	}
 
 	do {
-		io_size -= ret;
 		rw->bytes_done += ret;
+		iov_iter_advance(iter, ret);
+		if (!iov_iter_count(iter))
+			break;
+		iov_iter_save_state(iter, state);
+
 		/* if we can retry, do so with the callbacks armed */
 		if (!io_rw_should_retry(req)) {
 			kiocb->ki_flags &= ~IOCB_WAITQ;
@@ -3520,7 +3535,7 @@  static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 			return 0;
 		/* we got some bytes, but not all. retry. */
 		kiocb->ki_flags &= ~IOCB_WAITQ;
-	} while (ret > 0 && ret < io_size);
+	} while (ret > 0);
 done:
 	kiocb_done(kiocb, ret, issue_flags);
 out_free:
@@ -3543,19 +3558,24 @@  static int io_write(struct io_kiocb *req, unsigned int issue_flags)
 	struct kiocb *kiocb = &req->rw.kiocb;
 	struct iov_iter __iter, *iter = &__iter;
 	struct io_async_rw *rw = req->async_data;
-	ssize_t ret, ret2, io_size;
 	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
+	struct iov_iter_state __state, *state;
+	ssize_t ret, ret2;
 
 	if (rw) {
 		iter = &rw->iter;
+		state = &rw->iter_state;
+		iov_iter_restore(iter, state);
 		iovec = NULL;
 	} else {
 		ret = io_import_iovec(WRITE, req, &iovec, iter, !force_nonblock);
 		if (ret < 0)
 			return ret;
+		state = &__state;
+		iov_iter_save_state(iter, state);
 	}
-	io_size = iov_iter_count(iter);
-	req->result = io_size;
+	req->result = iov_iter_count(iter);
+	ret2 = 0;
 
 	/* Ensure we clear previously set non-block flag */
 	if (!force_nonblock)
@@ -3572,7 +3592,7 @@  static int io_write(struct io_kiocb *req, unsigned int issue_flags)
 	    (req->flags & REQ_F_ISREG))
 		goto copy_iov;
 
-	ret = rw_verify_area(WRITE, req->file, io_kiocb_ppos(kiocb), io_size);
+	ret = rw_verify_area(WRITE, req->file, io_kiocb_ppos(kiocb), req->result);
 	if (unlikely(ret))
 		goto out_free;
 
@@ -3619,9 +3639,9 @@  static int io_write(struct io_kiocb *req, unsigned int issue_flags)
 		kiocb_done(kiocb, ret2, issue_flags);
 	} else {
 copy_iov:
-		/* some cases will consume bytes even on error returns */
-		iov_iter_reexpand(iter, iter->count + iter->truncated);
-		iov_iter_revert(iter, io_size - iov_iter_count(iter));
+		iov_iter_restore(iter, state);
+		if (ret2 > 0)
+			iov_iter_advance(iter, ret2);
 		ret = io_setup_async_rw(req, iovec, inline_vecs, iter, false);
 		return ret ?: -EAGAIN;
 	}