mbox series

[PATCHSET,0/3] Add ability to save/restore iov_iter state

Message ID 20210910182536.685100-1-axboe@kernel.dk (mailing list archive)
Headers show
Series Add ability to save/restore iov_iter state | expand

Message

Jens Axboe Sept. 10, 2021, 6:25 p.m. UTC
Hi,

Linus didn't particularly love the iov_iter->truncated addition and how
it was used, and it was hard to disagree with that. Instead of relying
on tracking ->truncated, add a few pieces of state so we can safely
handle partial or errored read/write attempts (that we want to retry).

Then we can get rid of the iov_iter addition, and at the same time
handle cases that weren't handled correctly before.

I've run this through vectored read/write with io_uring on the commonly
problematic cases (dm and low depth SCSI device) which trigger these
conditions often, and it seems to pass muster.

For a discussion on this topic, see the thread here:

https://lore.kernel.org/linux-fsdevel/CAHk-=wiacKV4Gh-MYjteU0LwNBSGpWrK-Ov25HdqB1ewinrFPg@mail.gmail.com/

You can find these patches here:

https://git.kernel.dk/cgit/linux-block/log/?h=iov_iter

Comments

Jens Axboe Sept. 13, 2021, 10:43 p.m. UTC | #1
On 9/10/21 12:25 PM, Jens Axboe wrote:
> Hi,
> 
> Linus didn't particularly love the iov_iter->truncated addition and how
> it was used, and it was hard to disagree with that. Instead of relying
> on tracking ->truncated, add a few pieces of state so we can safely
> handle partial or errored read/write attempts (that we want to retry).
> 
> Then we can get rid of the iov_iter addition, and at the same time
> handle cases that weren't handled correctly before.
> 
> I've run this through vectored read/write with io_uring on the commonly
> problematic cases (dm and low depth SCSI device) which trigger these
> conditions often, and it seems to pass muster.
> 
> For a discussion on this topic, see the thread here:
> 
> https://lore.kernel.org/linux-fsdevel/CAHk-=wiacKV4Gh-MYjteU0LwNBSGpWrK-Ov25HdqB1ewinrFPg@mail.gmail.com/
> 
> You can find these patches here:
> 
> https://git.kernel.dk/cgit/linux-block/log/?h=iov_iter

Al, Linus, are you OK with this? I think we should get this in for 5.15.
I didn't resend the whole series, just a v2 of patch 1/3 to fix that bvec
vs iovec issue. Let me know if you want the while thing resent.
Linus Torvalds Sept. 13, 2021, 11:23 p.m. UTC | #2
On Mon, Sep 13, 2021 at 3:43 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> Al, Linus, are you OK with this? I think we should get this in for 5.15.
> I didn't resend the whole series, just a v2 of patch 1/3 to fix that bvec
> vs iovec issue. Let me know if you want the while thing resent.

So I'm ok with the iov_iter side, but the io_uring side seems still
positively buggy, and very confused.

It also messes with the state in bad ways and has internal knowledge.
And some of it looks completely bogus.

For example, I see

        state->count -= ret;
        rw->bytes_done += ret;

and I go "that's BS". There's no way it's ok to start messing with the
byte count inside the state like that. That just means that the state
is now no longer the saved state, and it's some random garbage.

I also think that the "bytes_done += ret" is a big hint there: any
time you restore the iovec state, you should then forward it by
"bytes_done". But that's not what the code does.

Instead, it will now restore the iovec styate with the *wrong* number
of bytes remaining, but will start from the beginning of the iovec.

So I think the fs/io_uring.c use of this state buffer is completely wrong.

What *may* be the right thing to do is to

 (a) not mess with state->count

 (b) when you restore the state you always use

        iov_iter_restore(iter, state, bytes_done);

to actually restore the *correct* state.

Because modifying the iovec save state like that cannot be right, and
if it's right it's still too ugly and fragile for words. That save
state should be treated as a snapshot, not as a random buffer that you
can make arbitrary changes to.

See what I'm saying?

I'd like Al to take a look at the io_uring.c usage too, since this was
just my reaction from looking at that diff a bit more.

           Linus
Jens Axboe Sept. 14, 2021, 1:54 a.m. UTC | #3
On 9/13/21 5:23 PM, Linus Torvalds wrote:
> On Mon, Sep 13, 2021 at 3:43 PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> Al, Linus, are you OK with this? I think we should get this in for 5.15.
>> I didn't resend the whole series, just a v2 of patch 1/3 to fix that bvec
>> vs iovec issue. Let me know if you want the while thing resent.
> 
> So I'm ok with the iov_iter side, but the io_uring side seems still
> positively buggy, and very confused.
> 
> It also messes with the state in bad ways and has internal knowledge.
> And some of it looks completely bogus.
> 
> For example, I see
> 
>         state->count -= ret;
>         rw->bytes_done += ret;
> 
> and I go "that's BS". There's no way it's ok to start messing with the
> byte count inside the state like that. That just means that the state
> is now no longer the saved state, and it's some random garbage.
>
> I also think that the "bytes_done += ret" is a big hint there: any
> time you restore the iovec state, you should then forward it by
> "bytes_done". But that's not what the code does.
> 
> Instead, it will now restore the iovec styate with the *wrong* number
> of bytes remaining, but will start from the beginning of the iovec.
> 
> So I think the fs/io_uring.c use of this state buffer is completely wrong.
> 
> What *may* be the right thing to do is to
> 
>  (a) not mess with state->count
> 
>  (b) when you restore the state you always use
> 
>         iov_iter_restore(iter, state, bytes_done);
> 
> to actually restore the *correct* state.
> 
> Because modifying the iovec save state like that cannot be right, and
> if it's right it's still too ugly and fragile for words. That save
> state should be treated as a snapshot, not as a random buffer that you
> can make arbitrary changes to.
> 
> See what I'm saying?

OK, for the do while loop itself, I do think we should be more
consistent and that would also get rid of the state->count manipulation.
I do agree that messing with that state is not something that should be
done, and we can do this more cleanly and consistently instead. Once we
hit the do {} while loop, state should be &rw->state and we can
consistently handle it that way.

Let me rework that bit and run the tests, and I'll post a v2 tomorrow.
Thanks for taking a closer look.