Message ID | 20210914141750.261568-3-axboe@kernel.dk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Add ability to save/restore iov_iter state | expand |
On Tue, Sep 14, 2021 at 7:18 AM Jens Axboe <axboe@kernel.dk> wrote: > > > + iov_iter_restore(iter, state); > + ... > rw->bytes_done += ret; > + iov_iter_advance(iter, ret); > + if (!iov_iter_count(iter)) > + break; > + iov_iter_save_state(iter, state); Ok, so now you keep iovb_iter and the state always in sync by just always resetting the iter back and then walking it forward explicitly - and re-saving the state. That seems safe, if potentially unnecessarily expensive. I guess re-walking lots of iovec entries is actually very unlikely in practice, so maybe this "stupid brute-force" model is the right one. I do find the odd "use __state vs rw->state" to be very confusing, though. Particularly in io_read(), where you do this: + iov_iter_restore(iter, state); + ret2 = io_setup_async_rw(req, iovec, inline_vecs, iter, true); if (ret2) return ret2; iovec = NULL; rw = req->async_data; - /* now use our persistent iterator, if we aren't already */ - iter = &rw->iter; + /* now use our persistent iterator and state, if we aren't already */ + if (iter != &rw->iter) { + iter = &rw->iter; + state = &rw->iter_state; + } do { - io_size -= ret; rw->bytes_done += ret; + iov_iter_advance(iter, ret); + if (!iov_iter_count(iter)) + break; + iov_iter_save_state(iter, state); Note how it first does that iov_iter_restore() on iter/state, buit then it *replaces&* the iter/state pointers, and then it does iov_iter_advance() on the replacement ones. I don't see how that could be right. You're doing iov_iter_advance() on something else than the one you restored to the original values. And if it is right, it's sure confusing as hell. Linus
On 9/14/21 12:45 PM, Linus Torvalds wrote: > On Tue, Sep 14, 2021 at 7:18 AM Jens Axboe <axboe@kernel.dk> wrote: >> >> >> + iov_iter_restore(iter, state); >> + > ... >> rw->bytes_done += ret; >> + iov_iter_advance(iter, ret); >> + if (!iov_iter_count(iter)) >> + break; >> + iov_iter_save_state(iter, state); > > Ok, so now you keep iovb_iter and the state always in sync by just > always resetting the iter back and then walking it forward explicitly > - and re-saving the state. > > That seems safe, if potentially unnecessarily expensive. Right, it's not ideal if it's a big range of IO, then it'll definitely be noticeable. But not too worried about it, at least not for now... > I guess re-walking lots of iovec entries is actually very unlikely in > practice, so maybe this "stupid brute-force" model is the right one. Not sure what the alternative is here. We could do something similar to __io_import_fixed() as we're only dealing with iter types where we can do that, but probably best left as a later optimization if it's deemed necessary. > I do find the odd "use __state vs rw->state" to be very confusing, > though. Particularly in io_read(), where you do this: > > + iov_iter_restore(iter, state); > + > ret2 = io_setup_async_rw(req, iovec, inline_vecs, iter, true); > if (ret2) > return ret2; > > iovec = NULL; > rw = req->async_data; > - /* now use our persistent iterator, if we aren't already */ > - iter = &rw->iter; > + /* now use our persistent iterator and state, if we aren't already */ > + if (iter != &rw->iter) { > + iter = &rw->iter; > + state = &rw->iter_state; > + } > > do { > - io_size -= ret; > rw->bytes_done += ret; > + iov_iter_advance(iter, ret); > + if (!iov_iter_count(iter)) > + break; > + iov_iter_save_state(iter, state); > > > Note how it first does that iov_iter_restore() on iter/state, buit > then it *replaces&* the iter/state pointers, and then it does > iov_iter_advance() on the replacement ones. We restore the iter so it's the same as before we did the read_iter call, and then setup a consistent copy of the iov/iter in case we need to punt this request for retry. rw->iter should have the same state as iter at this point, and since rw->iter is the copy we'll use going forward, we're advancing that one in case ret > 0. The other case is that no persistent state is needed, and then iter remains the same. I'll take a second look at this part and see if I can make it a bit more straight forward, or at least comment it properly.
On 9/14/21 1:37 PM, Jens Axboe wrote: > On 9/14/21 12:45 PM, Linus Torvalds wrote: >> On Tue, Sep 14, 2021 at 7:18 AM Jens Axboe <axboe@kernel.dk> wrote: >>> >>> >>> + iov_iter_restore(iter, state); >>> + >> ... >>> rw->bytes_done += ret; >>> + iov_iter_advance(iter, ret); >>> + if (!iov_iter_count(iter)) >>> + break; >>> + iov_iter_save_state(iter, state); >> >> Ok, so now you keep iovb_iter and the state always in sync by just >> always resetting the iter back and then walking it forward explicitly >> - and re-saving the state. >> >> That seems safe, if potentially unnecessarily expensive. > > Right, it's not ideal if it's a big range of IO, then it'll definitely > be noticeable. But not too worried about it, at least not for now... > >> I guess re-walking lots of iovec entries is actually very unlikely in >> practice, so maybe this "stupid brute-force" model is the right one. > > Not sure what the alternative is here. We could do something similar to > __io_import_fixed() as we're only dealing with iter types where we can > do that, but probably best left as a later optimization if it's deemed > necessary. > >> I do find the odd "use __state vs rw->state" to be very confusing, >> though. Particularly in io_read(), where you do this: >> >> + iov_iter_restore(iter, state); >> + >> ret2 = io_setup_async_rw(req, iovec, inline_vecs, iter, true); >> if (ret2) >> return ret2; >> >> iovec = NULL; >> rw = req->async_data; >> - /* now use our persistent iterator, if we aren't already */ >> - iter = &rw->iter; >> + /* now use our persistent iterator and state, if we aren't already */ >> + if (iter != &rw->iter) { >> + iter = &rw->iter; >> + state = &rw->iter_state; >> + } >> >> do { >> - io_size -= ret; >> rw->bytes_done += ret; >> + iov_iter_advance(iter, ret); >> + if (!iov_iter_count(iter)) >> + break; >> + iov_iter_save_state(iter, state); >> >> >> Note how it first does that iov_iter_restore() on iter/state, buit >> then it *replaces&* the iter/state pointers, and then it does >> iov_iter_advance() on the replacement ones. > > We restore the iter so it's the same as before we did the read_iter > call, and then setup a consistent copy of the iov/iter in case we need > to punt this request for retry. rw->iter should have the same state as > iter at this point, and since rw->iter is the copy we'll use going > forward, we're advancing that one in case ret > 0. > > The other case is that no persistent state is needed, and then iter > remains the same. > > I'll take a second look at this part and see if I can make it a bit more > straight forward, or at least comment it properly. I hacked up something that shortens the iter for the initial IO, so we could more easily test the retry path and the state. It really is a hack, but the idea was to issue 64K io from fio, and then the initial attempt would be anywhere from 4K-60K truncated. That forces retry. I ran this with both 16 segments and 8 segments, verifying that it hits both the UIO_FASTIOV and alloc path. I did find one issue with that, see the last hunk in the hack. We need to increment rw->bytes_done if we don't break, or set ret to 0 if we do. Otherwise that last ret ends up being accounted twice. But apart from that, it passes data verification runs. diff --git a/fs/io_uring.c b/fs/io_uring.c index dc1ff47e3221..484c86252f9d 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -744,6 +744,7 @@ enum { REQ_F_NOWAIT_READ_BIT, REQ_F_NOWAIT_WRITE_BIT, REQ_F_ISREG_BIT, + REQ_F_TRUNCATED_BIT, /* not a real bit, just to check we're not overflowing the space */ __REQ_F_LAST_BIT, @@ -797,6 +798,7 @@ enum { REQ_F_REFCOUNT = BIT(REQ_F_REFCOUNT_BIT), /* there is a linked timeout that has to be armed */ REQ_F_ARM_LTIMEOUT = BIT(REQ_F_ARM_LTIMEOUT_BIT), + REQ_F_TRUNCATED = BIT(REQ_F_TRUNCATED_BIT), }; struct async_poll { @@ -3454,11 +3456,12 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; struct kiocb *kiocb = &req->rw.kiocb; - struct iov_iter __iter, *iter = &__iter; + struct iov_iter __i, __iter, *iter = &__iter; struct io_async_rw *rw = req->async_data; bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK; struct iov_iter_state __state, *state; ssize_t ret, ret2; + bool do_restore = false; if (rw) { iter = &rw->iter; @@ -3492,8 +3495,25 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags) return ret; } + if (!(req->flags & REQ_F_TRUNCATED) && !(iov_iter_count(iter) & 4095)) { + int nr_vecs; + + __i = *iter; + nr_vecs = 1 + (prandom_u32() % iter->nr_segs); + iter->nr_segs = nr_vecs; + iter->count = nr_vecs * 8192; + req->flags |= REQ_F_TRUNCATED; + do_restore = true; + } + ret = io_iter_do_read(req, iter); + if (ret == -EAGAIN) { + req->flags &= ~REQ_F_TRUNCATED; + *iter = __i; + do_restore = false; + } + if (ret == -EAGAIN || (req->flags & REQ_F_REISSUE)) { req->flags &= ~REQ_F_REISSUE; /* IOPOLL retry should happen for io-wq threads */ @@ -3513,6 +3533,9 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags) iov_iter_restore(iter, state); + if (do_restore) + *iter = __i; + ret2 = io_setup_async_rw(req, iovec, inline_vecs, iter, true); if (ret2) return ret2; @@ -3526,10 +3549,10 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags) } do { - rw->bytes_done += ret; iov_iter_advance(iter, ret); if (!iov_iter_count(iter)) break; + rw->bytes_done += ret; iov_iter_save_state(iter, state); /* if we can retry, do so with the callbacks armed */
diff --git a/fs/io_uring.c b/fs/io_uring.c index 855ea544807f..dbc97d440801 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -712,6 +712,7 @@ struct io_async_rw { struct iovec fast_iov[UIO_FASTIOV]; const struct iovec *free_iovec; struct iov_iter iter; + struct iov_iter_state iter_state; size_t bytes_done; struct wait_page_queue wpq; }; @@ -2608,8 +2609,7 @@ static bool io_resubmit_prep(struct io_kiocb *req) if (!rw) return !io_req_prep_async(req); - /* may have left rw->iter inconsistent on -EIOCBQUEUED */ - iov_iter_revert(&rw->iter, req->result - iov_iter_count(&rw->iter)); + iov_iter_restore(&rw->iter, &rw->iter_state); return true; } @@ -3310,12 +3310,16 @@ static int io_setup_async_rw(struct io_kiocb *req, const struct iovec *iovec, if (!force && !io_op_defs[req->opcode].needs_async_setup) return 0; if (!req->async_data) { + struct io_async_rw *iorw; + if (io_alloc_async_data(req)) { kfree(iovec); return -ENOMEM; } io_req_map_rw(req, iovec, fast_iov, iter); + iorw = req->async_data; + iov_iter_save_state(&iorw->iter, &iorw->iter_state); } return 0; } @@ -3334,6 +3338,7 @@ static inline int io_rw_prep_async(struct io_kiocb *req, int rw) iorw->free_iovec = iov; if (iov) req->flags |= REQ_F_NEED_CLEANUP; + iov_iter_save_state(&iorw->iter, &iorw->iter_state); return 0; } @@ -3437,19 +3442,23 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags) struct kiocb *kiocb = &req->rw.kiocb; struct iov_iter __iter, *iter = &__iter; struct io_async_rw *rw = req->async_data; - ssize_t io_size, ret, ret2; bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK; + struct iov_iter_state __state, *state; + ssize_t ret, ret2; if (rw) { iter = &rw->iter; + state = &rw->iter_state; + iov_iter_restore(iter, state); iovec = NULL; } else { ret = io_import_iovec(READ, req, &iovec, iter, !force_nonblock); if (ret < 0) return ret; + state = &__state; + iov_iter_save_state(iter, state); } - io_size = iov_iter_count(iter); - req->result = io_size; + req->result = iov_iter_count(iter); /* Ensure we clear previously set non-block flag */ if (!force_nonblock) @@ -3463,7 +3472,7 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags) return ret ?: -EAGAIN; } - ret = rw_verify_area(READ, req->file, io_kiocb_ppos(kiocb), io_size); + ret = rw_verify_area(READ, req->file, io_kiocb_ppos(kiocb), req->result); if (unlikely(ret)) { kfree(iovec); return ret; @@ -3479,30 +3488,36 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags) /* no retry on NONBLOCK nor RWF_NOWAIT */ if (req->flags & REQ_F_NOWAIT) goto done; - /* some cases will consume bytes even on error returns */ - iov_iter_reexpand(iter, iter->count + iter->truncated); - iov_iter_revert(iter, io_size - iov_iter_count(iter)); ret = 0; } else if (ret == -EIOCBQUEUED) { goto out_free; - } else if (ret <= 0 || ret == io_size || !force_nonblock || + } else if (ret <= 0 || ret == req->result || !force_nonblock || (req->flags & REQ_F_NOWAIT) || !need_read_all(req)) { /* read all, failed, already did sync or don't want to retry */ goto done; } + iov_iter_restore(iter, state); + ret2 = io_setup_async_rw(req, iovec, inline_vecs, iter, true); if (ret2) return ret2; iovec = NULL; rw = req->async_data; - /* now use our persistent iterator, if we aren't already */ - iter = &rw->iter; + /* now use our persistent iterator and state, if we aren't already */ + if (iter != &rw->iter) { + iter = &rw->iter; + state = &rw->iter_state; + } do { - io_size -= ret; rw->bytes_done += ret; + iov_iter_advance(iter, ret); + if (!iov_iter_count(iter)) + break; + iov_iter_save_state(iter, state); + /* if we can retry, do so with the callbacks armed */ if (!io_rw_should_retry(req)) { kiocb->ki_flags &= ~IOCB_WAITQ; @@ -3520,7 +3535,7 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags) return 0; /* we got some bytes, but not all. retry. */ kiocb->ki_flags &= ~IOCB_WAITQ; - } while (ret > 0 && ret < io_size); + } while (ret > 0); done: kiocb_done(kiocb, ret, issue_flags); out_free: @@ -3543,19 +3558,24 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags) struct kiocb *kiocb = &req->rw.kiocb; struct iov_iter __iter, *iter = &__iter; struct io_async_rw *rw = req->async_data; - ssize_t ret, ret2, io_size; bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK; + struct iov_iter_state __state, *state; + ssize_t ret, ret2; if (rw) { iter = &rw->iter; + state = &rw->iter_state; + iov_iter_restore(iter, state); iovec = NULL; } else { ret = io_import_iovec(WRITE, req, &iovec, iter, !force_nonblock); if (ret < 0) return ret; + state = &__state; + iov_iter_save_state(iter, state); } - io_size = iov_iter_count(iter); - req->result = io_size; + req->result = iov_iter_count(iter); + ret2 = 0; /* Ensure we clear previously set non-block flag */ if (!force_nonblock) @@ -3572,7 +3592,7 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags) (req->flags & REQ_F_ISREG)) goto copy_iov; - ret = rw_verify_area(WRITE, req->file, io_kiocb_ppos(kiocb), io_size); + ret = rw_verify_area(WRITE, req->file, io_kiocb_ppos(kiocb), req->result); if (unlikely(ret)) goto out_free; @@ -3619,9 +3639,9 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags) kiocb_done(kiocb, ret2, issue_flags); } else { copy_iov: - /* some cases will consume bytes even on error returns */ - iov_iter_reexpand(iter, iter->count + iter->truncated); - iov_iter_revert(iter, io_size - iov_iter_count(iter)); + iov_iter_restore(iter, state); + if (ret2 > 0) + iov_iter_advance(iter, ret2); ret = io_setup_async_rw(req, iovec, inline_vecs, iter, false); return ret ?: -EAGAIN; }
Get rid of the need to do re-expand and revert on an iterator when we encounter a short IO, or failure that warrants a retry. Use the new state save/restore helpers instead. We keep the iov_iter_state persistent across retries, if we need to restart the read or write operation. If there's a pending retry, the operation will always exit with the state correctly saved. Signed-off-by: Jens Axboe <axboe@kernel.dk> --- fs/io_uring.c | 62 ++++++++++++++++++++++++++++++++++----------------- 1 file changed, 41 insertions(+), 21 deletions(-)