[PATCHSET,v3,0/3] Add ability to save/restore iov_iter state

Message ID	20210915162937.777002-1-axboe@kernel.dk (mailing list archive)
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: Jens Axboe <axboe@kernel.dk> To: io-uring@vger.kernel.org, linux-fsdevel@vger.kernel.org Cc: torvalds@linux-foundation.org, viro@zeniv.linux.org.uk Subject: [PATCHSET v3 0/3] Add ability to save/restore iov_iter state Date: Wed, 15 Sep 2021 10:29:34 -0600 Message-Id: <20210915162937.777002-1-axboe@kernel.dk> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Add ability to save/restore iov_iter state \| expand [PATCHSET,v3,0/3] Add ability to save/restore iov_iter state [1/3] iov_iter: add helper to save iov_iter state [2/3] io_uring: use iov_iter state save/restore helpers [3/3] Revert "iov_iter: track truncated size"

Jens Axboe Sept. 15, 2021, 4:29 p.m. UTC

Hi,

Linus didn't particularly love the iov_iter->truncated addition and how
it was used, and it was hard to disagree with that. Instead of relying
on tracking ->truncated, add a few pieces of state so we can safely
handle partial or errored read/write attempts (that we want to retry).

Then we can get rid of the iov_iter addition, and at the same time
handle cases that weren't handled correctly before.

I've run this through vectored read/write with io_uring on the commonly
problematic cases (dm and low depth SCSI device) which trigger these
conditions often, and it seems to pass muster. I've also hacked in
faked randomly short reads and that helped find on issue with double
accounting. But it did validate the state handling otherwise.

For a discussion on this topic, see the thread here:

https://lore.kernel.org/linux-fsdevel/CAHk-=wiacKV4Gh-MYjteU0LwNBSGpWrK-Ov25HdqB1ewinrFPg@mail.gmail.com/

You can find these patches here:

https://git.kernel.dk/cgit/linux-block/log/?h=iov_iter.3

Changes since v2:
- Add comments on io_read() on the flow
- Fix issue with rw->bytes_done being incremented too early and hence
double accounting if we enter that bottom do {} while loop in io_read()
- Restore iter at the bottom of do {} while loop in io_read()

Changes since v1:
- Drop 'did_bytes' from iov_iter_restore(). Only two cases in io_uring
used it, and one of them had to be changed with v2. Better to just
make the subsequent iov_iter_advance() explicit at that point.
- Cleanup and sanitize the io_uring side, and ensure it's sane around
worker retries. No more digging into iov_iter_state from io_uring, we
use it just for save/restore purposes.

Linus Torvalds Sept. 15, 2021, 6:32 p.m. UTC | #1

On Wed, Sep 15, 2021 at 9:29 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> I've run this through vectored read/write with io_uring on the commonly
> problematic cases (dm and low depth SCSI device) which trigger these
> conditions often, and it seems to pass muster. I've also hacked in
> faked randomly short reads and that helped find on issue with double
> accounting. But it did validate the state handling otherwise.

Ok, so I can't see anything obviously wrong with this, or anything I
can object to. It's still fairly complicated, and I don't love how
hard it is to follow some of it, but I do believe it's better.

IOW, I don't have any objections. Al was saying he was looking at the
io_uring code, so maybe he'll find something.

Do you have these test-cases as some kind of test-suite so that this
all stays correct?

            Linus

Jens Axboe Sept. 15, 2021, 6:46 p.m. UTC | #2

On 9/15/21 12:32 PM, Linus Torvalds wrote:
> On Wed, Sep 15, 2021 at 9:29 AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> I've run this through vectored read/write with io_uring on the commonly
>> problematic cases (dm and low depth SCSI device) which trigger these
>> conditions often, and it seems to pass muster. I've also hacked in
>> faked randomly short reads and that helped find on issue with double
>> accounting. But it did validate the state handling otherwise.
> 
> Ok, so I can't see anything obviously wrong with this, or anything I
> can object to. It's still fairly complicated, and I don't love how
> hard it is to follow some of it, but I do believe it's better.

OK good

> IOW, I don't have any objections. Al was saying he was looking at the
> io_uring code, so maybe he'll find something.
> 
> Do you have these test-cases as some kind of test-suite so that this
> all stays correct?

Yep liburing has a whole bunch of regressions tests that we always run
for any change, and new cases are added as problems are found. That also
has test cases for new features, etc. This one is particularly difficult
to test and have confidence in, which is why I ended up hacking up that
faked short return so I knew I had exercised all of it. The usual tests
do end up hitting the -EAGAIN path quite easily for certain device
types, but not the short read/write.

Linus Torvalds Sept. 15, 2021, 7:26 p.m. UTC | #3

On Wed, Sep 15, 2021 at 11:46 AM Jens Axboe <axboe@kernel.dk> wrote:
>
>    The usual tests
> do end up hitting the -EAGAIN path quite easily for certain device
> types, but not the short read/write.

No way to do something like "read in file to make sure it's cached,
then invalidate caches from position X with POSIX_FADV_DONTNEED, then
do a read that crosses that cached/uncached boundary"?

To at least verify that "partly synchronous, but partly punted to async" case?

Or were you talking about some other situation?

            Linus

Jens Axboe Sept. 15, 2021, 7:40 p.m. UTC | #4

On 9/15/21 1:26 PM, Linus Torvalds wrote:
> On Wed, Sep 15, 2021 at 11:46 AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>>    The usual tests
>> do end up hitting the -EAGAIN path quite easily for certain device
>> types, but not the short read/write.
> 
> No way to do something like "read in file to make sure it's cached,
> then invalidate caches from position X with POSIX_FADV_DONTNEED, then
> do a read that crosses that cached/uncached boundary"?
> 
> To at least verify that "partly synchronous, but partly punted to
> async" case?
> 
> Or were you talking about some other situation?

No that covers some of it, and that happens naturally with buffered IO.
The typical case is -EAGAIN on the first try, then you get a partial
or all of it the next loop, and then done or continue. I tend to run
fio verification workloads for that, as you get all the flexibility
of fio with the data verification. And there are tests in there that run
DONTNEED in parallel with buffered IO, exactly to catch some of these
csaes. But they don't verify the data, generally.

In that sense buffered is a lot easier than O_DIRECT, as it's easier to
provoke these cases. And that does hit all the save/restore parts and
looping, and if you do it with registered buffers then you get to work
with bvec iter as well. O_DIRECT may get you -EAGAIN for low queue depth
devices, but it'll never do a short read/write after that. 

But that's not in the regressions tests. I'll write a test case
that can go with the liburing regressions for it.

Jens Axboe Sept. 15, 2021, 10:42 p.m. UTC | #5

On 9/15/21 1:40 PM, Jens Axboe wrote:
> On 9/15/21 1:26 PM, Linus Torvalds wrote:
>> On Wed, Sep 15, 2021 at 11:46 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>>    The usual tests
>>> do end up hitting the -EAGAIN path quite easily for certain device
>>> types, but not the short read/write.
>>
>> No way to do something like "read in file to make sure it's cached,
>> then invalidate caches from position X with POSIX_FADV_DONTNEED, then
>> do a read that crosses that cached/uncached boundary"?
>>
>> To at least verify that "partly synchronous, but partly punted to
>> async" case?
>>
>> Or were you talking about some other situation?
> 
> No that covers some of it, and that happens naturally with buffered IO.
> The typical case is -EAGAIN on the first try, then you get a partial
> or all of it the next loop, and then done or continue. I tend to run
> fio verification workloads for that, as you get all the flexibility
> of fio with the data verification. And there are tests in there that run
> DONTNEED in parallel with buffered IO, exactly to catch some of these
> csaes. But they don't verify the data, generally.
> 
> In that sense buffered is a lot easier than O_DIRECT, as it's easier to
> provoke these cases. And that does hit all the save/restore parts and
> looping, and if you do it with registered buffers then you get to work
> with bvec iter as well. O_DIRECT may get you -EAGAIN for low queue depth
> devices, but it'll never do a short read/write after that. 
> 
> But that's not in the regressions tests. I'll write a test case
> that can go with the liburing regressions for it.

OK I wrote one, quick'n dirty. It's written as a liburing test, which
means it can take no arguments (in which case it creates a 128MB file),
or it can take an argument and it'll use that argument as the file. We
fill the first 128MB of the file with known data, basically the offset
of the file. Then we read it back in any of the following ways:

1) Using non-vectored read
2) Using vectored read, segments that fit in UIO_FASTIOV
3) Using vectored read, segments larger than UIO_FASTIOV

This catches all the different cases for a read.

We do that with both buffered and O_DIRECT, and before each pass, we
randomly DONTNEED either the first, middle, or end part of each segment
in the read size.

I ran this on my laptop, and I found this:
axboe@p1 ~/gi/liburing (master)> test/file-verify                                0.100s
bad read 229376, read 3
Buffered novec test failed
axboe@p1 ~/gi/liburing (master)> test/file-verify                                0.213s
bad read 294912, read 0
Buffered novec test failed

which is because I'm running the iov_iter.2 stuff, and we're hitting
that double accounting issue that I mentioned in the cover letter for
this series. That's why the read return is larger than we ask for
(128K). Running it on the current branch passes:

[root@archlinux liburing]# for i in $(seq 10); do test/file-verify; done
[root@archlinux liburing]# 

(this is in my test vm that I run on the laptop for kernel testing,
hence the root and different hostname).

I will add this as a liburing regression test case. Probably needs a bit
of cleaning up first, it was just a quick prototype as I thought your
suggestion was a good one. Will probably change it to run at a higher
queue depth than just the 1 it does now.


/* SPDX-License-Identifier: MIT */
/*
 * Description: run various read verify tests
 *
 */
#include <errno.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <assert.h>

#include "helpers.h"
#include "liburing.h"

#define FSIZE		128*1024*1024
#define CHUNK_SIZE	131072
#define PUNCH_SIZE	32768

static int verify_buf(void *buf, size_t size, off_t off)
{
	int i, u_in_buf = size / sizeof(unsigned int);
	unsigned int *ptr;

	off /= sizeof(unsigned int);
	ptr = buf;
	for (i = 0; i < u_in_buf; i++) {
		if (off != *ptr) {
			fprintf(stderr, "Found %u, wanted %lu\n", *ptr, off);
			return 1;
		}
		ptr++;
		off++;
	}

	return 0;
}

enum {
	PUNCH_NONE,
	PUNCH_FRONT,
	PUNCH_MIDDLE,
	PUNCH_END,
};

/*
 * For each chunk in file, DONTNEED a start, end, or middle segment of it.
 * We enter here with the file fully cached every time, either freshly
 * written or after other reads.
 */
static int do_punch(int fd)
{
	off_t offset = 0;
	int punch_type;

	while (offset + CHUNK_SIZE <= FSIZE) {
		off_t punch_off;

		punch_type = rand() % (PUNCH_END + 1);
		switch (punch_type) {
		default:
		case PUNCH_NONE:
			punch_off = -1; /* gcc... */
			break;
		case PUNCH_FRONT:
			punch_off = offset;
			break;
		case PUNCH_MIDDLE:
			punch_off = offset + PUNCH_SIZE;
			break;
		case PUNCH_END:
			punch_off = offset + CHUNK_SIZE - PUNCH_SIZE;
			break;
		}

		offset += CHUNK_SIZE;
		if (punch_type == PUNCH_NONE)
			continue;
		if (posix_fadvise(fd, punch_off, PUNCH_SIZE, POSIX_FADV_DONTNEED) < 0) {
			perror("posix_fadivse");
			return 1;
		}
	}

	return 0;
}

static int test(struct io_uring *ring, const char *fname, int buffered,
		int vectored, int small_vecs)
{
	struct io_uring_cqe *cqe;
	struct io_uring_sqe *sqe;
	struct iovec *vecs;
	int ret, fd, flags;
	void *buf;
	size_t left;
	off_t off, voff;
	int i, nr_vecs;

	flags = O_RDONLY;
	if (!buffered)
		flags |= O_DIRECT;
	fd = open(fname, flags);
	if (fd < 0) {
		perror("open");
		return 1;
	}

	if (do_punch(fd))
		return 1;

	if (vectored) {
		int vec_size;

		if (small_vecs)
			nr_vecs = 8;
		else
			nr_vecs = 16;
		vecs = t_malloc(nr_vecs * sizeof(struct iovec));
		vec_size = CHUNK_SIZE / nr_vecs;
		for (i = 0; i < nr_vecs; i++) {
			t_posix_memalign(&buf, 4096, vec_size);
			vecs[i].iov_base = buf;
			vecs[i].iov_len = vec_size;
		}
	} else {
		t_posix_memalign(&buf, 4096, CHUNK_SIZE);
		nr_vecs = 0;
		vecs = NULL;
	}

	i = 0;
	left = FSIZE;
	off = 0;
	while (left) {
		size_t this = left;

		if (this > CHUNK_SIZE)
			this = CHUNK_SIZE;

		sqe = io_uring_get_sqe(ring);
		if (!sqe) {
			fprintf(stderr, "get sqe failed\n");
			goto err;
		}

		if (vectored)
			io_uring_prep_readv(sqe, fd, vecs, nr_vecs, off);
		else
			io_uring_prep_read(sqe, fd, buf, this, off);
		sqe->user_data = off;
		off += this;

		ret = io_uring_submit(ring);
		if (ret <= 0) {
			fprintf(stderr, "sqe submit failed: %d\n", ret);
			goto err;
		}

		ret = io_uring_wait_cqe(ring, &cqe);
		if (ret < 0) {
			fprintf(stderr, "wait completion %d\n", ret);
			goto err;
		}
		if (cqe->res != this) {
			fprintf(stderr, "bad read %d, read %d\n", cqe->res, i);
			goto err;
		}
		if (vectored) {
			voff = cqe->user_data;
			for (i = 0; i < nr_vecs; i++) {
				if (verify_buf(vecs[i].iov_base, vecs[i].iov_len,
						voff)) {
					fprintf(stderr, "failed at off %lu\n", (unsigned long) voff);
					goto err;
				}
				voff += vecs[i].iov_len;
			}
		} else {
			if (verify_buf(buf, CHUNK_SIZE, cqe->user_data)) {
				fprintf(stderr, "failed at off %lu\n", (unsigned long) cqe->user_data);
				goto err;
			}
		}
		io_uring_cqe_seen(ring, cqe);
		i++;
		left -= CHUNK_SIZE;
	}

	ret = 0;
done:
	if (vectored) {
		for (i = 0; i < nr_vecs; i++)
			free(vecs[i].iov_base);
	} else {
		free(buf);
	}
	close(fd);
	return ret;
err:
	ret = 1;
	goto done;
}

static int fill_pattern(const char *fname)
{
	size_t left = FSIZE;
	unsigned int val, *ptr;
	void *buf;
	int fd, i;

	fd = open(fname, O_WRONLY);
	if (fd < 0) {
		perror("open");
		return 1;
	}

	val = 0;
	buf = t_malloc(4096);
	while (left) {
		int u_in_buf = 4096 / sizeof(val);
		size_t this = left;

		if (this > 4096)
			this = 4096;
		ptr = buf;
		for (i = 0; i < u_in_buf; i++) {
			*ptr = val;
			val++;
			ptr++;
		}
		if (write(fd, buf, 4096) != 4096)
			return 1;
		left -= 4096;
	}

	fsync(fd);
	close(fd);
	free(buf);
	return 0;
}

int main(int argc, char *argv[])
{
	struct io_uring ring;
	const char *fname;
	char buf[32];
	int ret;

	srand(getpid());

	if (argc > 1) {
		fname = argv[1];
	} else {
		sprintf(buf, ".%d", getpid());
		fname = buf;
		t_create_file(fname, FSIZE);
	}

	ret = io_uring_queue_init(64, &ring, 0);
	if (ret) {
		fprintf(stderr, "ring setup failed: %d\n", ret);
		goto err;
	}

	if (fill_pattern(fname))
		goto err;

	ret = test(&ring, fname, 1, 0, 0);
	if (ret) {
		fprintf(stderr, "Buffered novec test failed\n");
		goto err;
	}
	ret = test(&ring, fname, 1, 1, 0);
	if (ret) {
		fprintf(stderr, "Buffered vec test failed\n");
		goto err;
	}
	ret = test(&ring, fname, 1, 1, 1);
	if (ret) {
		fprintf(stderr, "Buffered small vec test failed\n");
		goto err;
	}

	ret = test(&ring, fname, 0, 0, 0);
	if (ret) {
		fprintf(stderr, "O_DIRECt novec test failed\n");
		goto err;
	}
	ret = test(&ring, fname, 0, 1, 0);
	if (ret) {
		fprintf(stderr, "O_DIRECt vec test failed\n");
		goto err;
	}
	ret = test(&ring, fname, 0, 1, 1);
	if (ret) {
		fprintf(stderr, "O_DIRECt small vec test failed\n");
		goto err;
	}

	if (buf == fname)
		unlink(fname);
	return 0;
err:
	if (buf == fname)
		unlink(fname);
	return 1;
}

Jens Axboe Sept. 16, 2021, 1:15 a.m. UTC | #6

On 9/15/21 4:42 PM, Jens Axboe wrote:
> On 9/15/21 1:40 PM, Jens Axboe wrote:
>> On 9/15/21 1:26 PM, Linus Torvalds wrote:
>>> On Wed, Sep 15, 2021 at 11:46 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>>    The usual tests
>>>> do end up hitting the -EAGAIN path quite easily for certain device
>>>> types, but not the short read/write.
>>>
>>> No way to do something like "read in file to make sure it's cached,
>>> then invalidate caches from position X with POSIX_FADV_DONTNEED, then
>>> do a read that crosses that cached/uncached boundary"?
>>>
>>> To at least verify that "partly synchronous, but partly punted to
>>> async" case?
>>>
>>> Or were you talking about some other situation?
>>
>> No that covers some of it, and that happens naturally with buffered IO.
>> The typical case is -EAGAIN on the first try, then you get a partial
>> or all of it the next loop, and then done or continue. I tend to run
>> fio verification workloads for that, as you get all the flexibility
>> of fio with the data verification. And there are tests in there that run
>> DONTNEED in parallel with buffered IO, exactly to catch some of these
>> csaes. But they don't verify the data, generally.
>>
>> In that sense buffered is a lot easier than O_DIRECT, as it's easier to
>> provoke these cases. And that does hit all the save/restore parts and
>> looping, and if you do it with registered buffers then you get to work
>> with bvec iter as well. O_DIRECT may get you -EAGAIN for low queue depth
>> devices, but it'll never do a short read/write after that. 
>>
>> But that's not in the regressions tests. I'll write a test case
>> that can go with the liburing regressions for it.
> 
> OK I wrote one, quick'n dirty. It's written as a liburing test, which
> means it can take no arguments (in which case it creates a 128MB file),
> or it can take an argument and it'll use that argument as the file. We
> fill the first 128MB of the file with known data, basically the offset
> of the file. Then we read it back in any of the following ways:
> 
> 1) Using non-vectored read
> 2) Using vectored read, segments that fit in UIO_FASTIOV
> 3) Using vectored read, segments larger than UIO_FASTIOV
> 
> This catches all the different cases for a read.
> 
> We do that with both buffered and O_DIRECT, and before each pass, we
> randomly DONTNEED either the first, middle, or end part of each segment
> in the read size.
> 
> I ran this on my laptop, and I found this:
> axboe@p1 ~/gi/liburing (master)> test/file-verify                                0.100s
> bad read 229376, read 3
> Buffered novec test failed
> axboe@p1 ~/gi/liburing (master)> test/file-verify                                0.213s
> bad read 294912, read 0
> Buffered novec test failed
> 
> which is because I'm running the iov_iter.2 stuff, and we're hitting
> that double accounting issue that I mentioned in the cover letter for
> this series. That's why the read return is larger than we ask for
> (128K). Running it on the current branch passes:
> 
> [root@archlinux liburing]# for i in $(seq 10); do test/file-verify; done
> [root@archlinux liburing]# 
> 
> (this is in my test vm that I run on the laptop for kernel testing,
> hence the root and different hostname).
> 
> I will add this as a liburing regression test case. Probably needs a bit
> of cleaning up first, it was just a quick prototype as I thought your
> suggestion was a good one. Will probably change it to run at a higher
> queue depth than just the 1 it does now.

Cleaned it up a bit, and added registered buffer support as well (which
is another variant over non-vectored reads) and queued IO support as
well:

https://git.kernel.dk/cgit/liburing/commit/?id=6ab387dab745aff2af760d9fed56a4154669edec

and it's now part of the regular testing. Here's my usual run:

Running test file-verify                                            3 sec
Running test file-verify /dev/nvme0n1p2                             3 sec
Running test file-verify /dev/nvme1n1p1                             3 sec
Running test file-verify /dev/sdc2                                  Test file-verify timed out (may not be a failure)
Running test file-verify /dev/dm-0                                  3 sec
Running test file-verify /data/file                                 3 sec

Note that the sdc2 timeout isn't a failure, it's just that emulation on
qemu is slow enough that it takes 1min20s to run and I time out tests
after 60s in the harness to prevent something stalling forever.

Al Viro Sept. 16, 2021, 4:47 a.m. UTC | #7

Jens, may I politely inquire why is struct io_rw playing
these games with overloading ->rw.addr, instead of simply having
struct io_buffer *kbuf in it?

	Another question: what the hell are the rules for
REQ_F_BUFFER_SELECT?  The first time around io_iov_buffer_select()
will
	* read iovec from ->rw.addr
	* replace iovec.iov_base with value derived from
->buf_index
	* cap iovec.iov_len with value derived from ->buf_index
Next time around it will use the same base *AND* replace the
length with the value used to cap the original.
	Is that deliberate?

Jens Axboe Sept. 16, 2021, 4:10 p.m. UTC | #8

On 9/15/21 10:47 PM, Al Viro wrote:
> 	Jens, may I politely inquire why is struct io_rw playing
> these games with overloading ->rw.addr, instead of simply having
> struct io_buffer *kbuf in it?

Very simply to avoid growing the union command part of io_kiocb beyond a
cacheline. We're pretty sensitive to io_kiocb size in general, and io_rw
is already the biggest member in there.

> 	Another question: what the hell are the rules for
> REQ_F_BUFFER_SELECT?  The first time around io_iov_buffer_select()
> will
> 	* read iovec from ->rw.addr
> 	* replace iovec.iov_base with value derived from
> ->buf_index
> 	* cap iovec.iov_len with value derived from ->buf_index
> Next time around it will use the same base *AND* replace the
> length with the value used to cap the original.
> 	Is that deliberate?

Probably not strictly needed, but doesn't harm anything. The buffer is
being consumed (and hence removed) at completion anyway, it's not a
persistent change. Selected buffers must be re-provided by the
application as the kernel has no way of knowing when the application
would otherwise be ready for it to get reused, and that's done by
issuing a new provide buffers request for the buffers that can get
recycled.

[PATCHSET,v3,0/3] Add ability to save/restore iov_iter state

Message

Comments