diff mbox

[2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio

Message ID 20180719093918.28876-3-mwilck@suse.com (mailing list archive)
State New, archived
Headers show

Commit Message

Martin Wilck July 19, 2018, 9:39 a.m. UTC
bio_iov_iter_get_pages() returns only pages for a single non-empty
segment of the input iov_iter's iovec. This may be much less than the number
of pages __blkdev_direct_IO_simple() is supposed to process. Call
bio_iov_iter_get_pages() repeatedly until either the requested number
of bytes is reached, or bio.bi_io_vec is exhausted. If this is not done,
short writes or reads may occur for direct synchronous IOs with multiple
iovec slots (such as generated by writev()). In that case,
__generic_file_write_iter() falls back to buffered writes, which
has been observed to cause data corruption in certain workloads.

Note: if segments aren't page-aligned in the input iovec, this patch may
result in multiple adjacent slots of the bi_io_vec array to reference the same
page (the byte ranges are guaranteed to be disjunct if the preceding patch is
applied). We haven't seen problems with that in our and the customer's
tests. It'd be possible to detect this situation and merge bi_io_vec slots
that refer to the same page, but I prefer to keep it simple for now.

Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified bdev direct-io")
Signed-off-by: Martin Wilck <mwilck@suse.com>
---
 fs/block_dev.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

Hannes Reinecke July 19, 2018, 10:06 a.m. UTC | #1
On 07/19/2018 11:39 AM, Martin Wilck wrote:
> bio_iov_iter_get_pages() returns only pages for a single non-empty
> segment of the input iov_iter's iovec. This may be much less than the number
> of pages __blkdev_direct_IO_simple() is supposed to process. Call
> bio_iov_iter_get_pages() repeatedly until either the requested number
> of bytes is reached, or bio.bi_io_vec is exhausted. If this is not done,
> short writes or reads may occur for direct synchronous IOs with multiple
> iovec slots (such as generated by writev()). In that case,
> __generic_file_write_iter() falls back to buffered writes, which
> has been observed to cause data corruption in certain workloads.
> 
> Note: if segments aren't page-aligned in the input iovec, this patch may
> result in multiple adjacent slots of the bi_io_vec array to reference the same
> page (the byte ranges are guaranteed to be disjunct if the preceding patch is
> applied). We haven't seen problems with that in our and the customer's
> tests. It'd be possible to detect this situation and merge bi_io_vec slots
> that refer to the same page, but I prefer to keep it simple for now.
> 
> Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified bdev direct-io")
> Signed-off-by: Martin Wilck <mwilck@suse.com>
> ---
>  fs/block_dev.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
Ming Lei July 19, 2018, 10:21 a.m. UTC | #2
On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote:
> bio_iov_iter_get_pages() returns only pages for a single non-empty
> segment of the input iov_iter's iovec. This may be much less than the number
> of pages __blkdev_direct_IO_simple() is supposed to process. Call

In bio_iov_iter_get_pages(), iov_iter_get_pages() supposes to retrieve
as many as possible pages since both 'maxsize' and 'maxpages' are provided
to cover all.

So the question is why iov_iter_get_pages() doesn't work as expected?

Thanks,
Ming
Jan Kara July 19, 2018, 10:37 a.m. UTC | #3
On Thu 19-07-18 18:21:23, Ming Lei wrote:
> On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote:
> > bio_iov_iter_get_pages() returns only pages for a single non-empty
> > segment of the input iov_iter's iovec. This may be much less than the number
> > of pages __blkdev_direct_IO_simple() is supposed to process. Call
> 
> In bio_iov_iter_get_pages(), iov_iter_get_pages() supposes to retrieve
> as many as possible pages since both 'maxsize' and 'maxpages' are provided
> to cover all.
> 
> So the question is why iov_iter_get_pages() doesn't work as expected?

Well, there has never been a promise that it will grab *all* pages in the
iter AFAIK. Practically, I think that it was just too hairy to implement in
the macro magic that iter processing is... Al might know more (added to
CC).

								Honza
Jan Kara July 19, 2018, 10:45 a.m. UTC | #4
On Thu 19-07-18 11:39:18, Martin Wilck wrote:
> bio_iov_iter_get_pages() returns only pages for a single non-empty
> segment of the input iov_iter's iovec. This may be much less than the number
> of pages __blkdev_direct_IO_simple() is supposed to process. Call
> bio_iov_iter_get_pages() repeatedly until either the requested number
> of bytes is reached, or bio.bi_io_vec is exhausted. If this is not done,
> short writes or reads may occur for direct synchronous IOs with multiple
> iovec slots (such as generated by writev()). In that case,
> __generic_file_write_iter() falls back to buffered writes, which
> has been observed to cause data corruption in certain workloads.
> 
> Note: if segments aren't page-aligned in the input iovec, this patch may
> result in multiple adjacent slots of the bi_io_vec array to reference the same
> page (the byte ranges are guaranteed to be disjunct if the preceding patch is
> applied). We haven't seen problems with that in our and the customer's
> tests. It'd be possible to detect this situation and merge bi_io_vec slots
> that refer to the same page, but I prefer to keep it simple for now.
> 
> Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified bdev direct-io")
> Signed-off-by: Martin Wilck <mwilck@suse.com>
> ---
>  fs/block_dev.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 0dd87aa..41643c4 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
>  
>  	ret = bio_iov_iter_get_pages(&bio, iter);
>  	if (unlikely(ret))
> -		return ret;
> +		goto out;
> +
> +	while (ret == 0 &&
> +	       bio.bi_vcnt < bio.bi_max_vecs && iov_iter_count(iter) > 0)
> +		ret = bio_iov_iter_get_pages(&bio, iter);
> +

I have two suggestions here (posting them now in public):

Condition bio.bi_vcnt < bio.bi_max_vecs should always be true - we made
sure we have enough vecs for pages in iter. So I'd WARN if this isn't true.

Secondly, I don't think it is good to discard error from
bio_iov_iter_get_pages() here and just submit partial IO. It will again
lead to part of IO being done as direct and part attempted to be done as
buffered. Also the "slow" direct IO path in __blkdev_direct_IO() behaves
differently - it aborts and returns error if bio_iov_iter_get_pages() ever
returned error. IMO we should do the same here.

								Honza
Ming Lei July 19, 2018, 10:46 a.m. UTC | #5
On Thu, Jul 19, 2018 at 12:37:13PM +0200, Jan Kara wrote:
> On Thu 19-07-18 18:21:23, Ming Lei wrote:
> > On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote:
> > > bio_iov_iter_get_pages() returns only pages for a single non-empty
> > > segment of the input iov_iter's iovec. This may be much less than the number
> > > of pages __blkdev_direct_IO_simple() is supposed to process. Call
> > 
> > In bio_iov_iter_get_pages(), iov_iter_get_pages() supposes to retrieve
> > as many as possible pages since both 'maxsize' and 'maxpages' are provided
> > to cover all.
> > 
> > So the question is why iov_iter_get_pages() doesn't work as expected?
> 
> Well, there has never been a promise that it will grab *all* pages in the
> iter AFAIK. Practically, I think that it was just too hairy to implement in
> the macro magic that iter processing is... Al might know more (added to
> CC).

OK, I got it, given get_user_pages_fast() may fail and it is reasonable
to see less pages retrieved.

Thanks,
Ming
Al Viro July 19, 2018, 11:08 a.m. UTC | #6
On Thu, Jul 19, 2018 at 12:37:13PM +0200, Jan Kara wrote:
> On Thu 19-07-18 18:21:23, Ming Lei wrote:
> > On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote:
> > > bio_iov_iter_get_pages() returns only pages for a single non-empty
> > > segment of the input iov_iter's iovec. This may be much less than the number
> > > of pages __blkdev_direct_IO_simple() is supposed to process. Call
> > 
> > In bio_iov_iter_get_pages(), iov_iter_get_pages() supposes to retrieve
> > as many as possible pages since both 'maxsize' and 'maxpages' are provided
> > to cover all.

Huh?  Why does having a way to say "I (the caller) don't want more than <amount>"
implies anything of that sort?  It never had been promised, explicitly or not...

> > So the question is why iov_iter_get_pages() doesn't work as expected?
>
> Well, there has never been a promise that it will grab *all* pages in the
> iter AFAIK. Practically, I think that it was just too hairy to implement in
> the macro magic that iter processing is... Al might know more (added to
> CC).

Not really - it's more that VM has every right to refuse letting you pin
an arbitrary amount of pages anyway.
Martin Wilck July 19, 2018, 12:23 p.m. UTC | #7
On Thu, 2018-07-19 at 12:45 +0200, Jan Kara wrote:
> On Thu 19-07-18 11:39:18, Martin Wilck wrote:
> > bio_iov_iter_get_pages() returns only pages for a single non-empty
> > segment of the input iov_iter's iovec. This may be much less than
> > the number
> > of pages __blkdev_direct_IO_simple() is supposed to process. Call
> > bio_iov_iter_get_pages() repeatedly until either the requested
> > number
> > of bytes is reached, or bio.bi_io_vec is exhausted. If this is not
> > done,
> > short writes or reads may occur for direct synchronous IOs with
> > multiple
> > iovec slots (such as generated by writev()). In that case,
> > __generic_file_write_iter() falls back to buffered writes, which
> > has been observed to cause data corruption in certain workloads.
> > 
> > Note: if segments aren't page-aligned in the input iovec, this
> > patch may
> > result in multiple adjacent slots of the bi_io_vec array to
> > reference the same
> > page (the byte ranges are guaranteed to be disjunct if the
> > preceding patch is
> > applied). We haven't seen problems with that in our and the
> > customer's
> > tests. It'd be possible to detect this situation and merge
> > bi_io_vec slots
> > that refer to the same page, but I prefer to keep it simple for
> > now.
> > 
> > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for
> > simplified bdev direct-io")
> > Signed-off-by: Martin Wilck <mwilck@suse.com>
> > ---
> >  fs/block_dev.c | 8 +++++++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > index 0dd87aa..41643c4 100644
> > --- a/fs/block_dev.c
> > +++ b/fs/block_dev.c
> > @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb *iocb,
> > struct iov_iter *iter,
> >  
> >  	ret = bio_iov_iter_get_pages(&bio, iter);
> >  	if (unlikely(ret))
> > -		return ret;
> > +		goto out;
> > +
> > +	while (ret == 0 &&
> > +	       bio.bi_vcnt < bio.bi_max_vecs &&
> > iov_iter_count(iter) > 0)
> > +		ret = bio_iov_iter_get_pages(&bio, iter);
> > +
> 
> I have two suggestions here (posting them now in public):
> 
> Condition bio.bi_vcnt < bio.bi_max_vecs should always be true - we
> made
> sure we have enough vecs for pages in iter. So I'd WARN if this isn't
> true.

Yeah. I wanted to add that to the patch. Slipped through, somehow.
Sorry about that.

> Secondly, I don't think it is good to discard error from
> bio_iov_iter_get_pages() here and just submit partial IO. It will
> again
> lead to part of IO being done as direct and part attempted to be done
> as
> buffered. Also the "slow" direct IO path in __blkdev_direct_IO()
> behaves
> differently - it aborts and returns error if bio_iov_iter_get_pages()
> ever
> returned error. IMO we should do the same here.

Well, it aborts the loop, but then (in the sync case) it still waits
for the already submitted IOs to finish. Here, too, I'd find it more
logical to return the number of successfully transmitted bytes rather
than an error code. In the async case, the submitted bios are left in
place, and will probably sooner or later finish, changing iocb->ki_pos.

I'm actually not quite certain if that's correct. In the sync case, it
causes the already-performed IO to be done again, buffered. In the
async case, it it may even cause two IOs for the same range to be in
flight at the same time ... ?

Martin
Christoph Hellwig July 19, 2018, 2:53 p.m. UTC | #8
On Thu, Jul 19, 2018 at 12:08:41PM +0100, Al Viro wrote:
> > Well, there has never been a promise that it will grab *all* pages in the
> > iter AFAIK. Practically, I think that it was just too hairy to implement in
> > the macro magic that iter processing is... Al might know more (added to
> > CC).
> 
> Not really - it's more that VM has every right to refuse letting you pin
> an arbitrary amount of pages anyway.

In which case the code after this patch isn't going to help either, because
it still tries to pin it all, just in multiple calls to get_user_pages().
Jan Kara July 19, 2018, 3:06 p.m. UTC | #9
On Thu 19-07-18 16:53:51, Christoph Hellwig wrote:
> On Thu, Jul 19, 2018 at 12:08:41PM +0100, Al Viro wrote:
> > > Well, there has never been a promise that it will grab *all* pages in the
> > > iter AFAIK. Practically, I think that it was just too hairy to implement in
> > > the macro magic that iter processing is... Al might know more (added to
> > > CC).
> > 
> > Not really - it's more that VM has every right to refuse letting you pin
> > an arbitrary amount of pages anyway.
> 
> In which case the code after this patch isn't going to help either, because
> it still tries to pin it all, just in multiple calls to get_user_pages().

Yeah. Actually previous version of the fix (not posted publicly) submitted
partial bio and then reused the bio to submit more. This is also the way
__blkdev_direct_IO operates. Martin optimized this to fill the bio
completely (as we know we have enough bvecs) before submitting which has
chances to perform better. I'm fine with either approach, just we have to
decide which way to go.

								Honza
Christoph Hellwig July 19, 2018, 3:11 p.m. UTC | #10
On Thu, Jul 19, 2018 at 05:06:42PM +0200, Jan Kara wrote:
> Yeah. Actually previous version of the fix (not posted publicly) submitted
> partial bio and then reused the bio to submit more. This is also the way
> __blkdev_direct_IO operates. Martin optimized this to fill the bio
> completely (as we know we have enough bvecs) before submitting which has
> chances to perform better. I'm fine with either approach, just we have to
> decide which way to go.

I think this first version is going to be less fragile, so I we should
aim for that.
Jan Kara July 19, 2018, 3:15 p.m. UTC | #11
On Thu 19-07-18 14:23:53, Martin Wilck wrote:
> On Thu, 2018-07-19 at 12:45 +0200, Jan Kara wrote:
> > Secondly, I don't think it is good to discard error from
> > bio_iov_iter_get_pages() here and just submit partial IO. It will
> > again
> > lead to part of IO being done as direct and part attempted to be done
> > as
> > buffered. Also the "slow" direct IO path in __blkdev_direct_IO()
> > behaves
> > differently - it aborts and returns error if bio_iov_iter_get_pages()
> > ever
> > returned error. IMO we should do the same here.
> 
> Well, it aborts the loop, but then (in the sync case) it still waits
> for the already submitted IOs to finish. Here, too, I'd find it more
> logical to return the number of successfully transmitted bytes rather
> than an error code. In the async case, the submitted bios are left in
> place, and will probably sooner or later finish, changing iocb->ki_pos.

Well, both these behaviors make sense, just traditionally (defined by our
implementation) DIO returns error even if part of IO has actually been
successfully submitted. Making a userspace visible change like you suggest
thus has to be very carefully analyzed and frankly I don't think it's worth
the bother.

> I'm actually not quite certain if that's correct. In the sync case, it
> causes the already-performed IO to be done again, buffered. In the
> async case, it it may even cause two IOs for the same range to be in
> flight at the same time ... ?

It doesn't cause IO to be done again. Look at __generic_file_write_iter().
If generic_file_direct_write() returned error, we immediately return error
as well without retrying buffered IO. We only retry buffered IO for partial
(or 0) return value.

								Honza
Martin Wilck July 19, 2018, 7:21 p.m. UTC | #12
On Thu, 2018-07-19 at 17:11 +0200, Christoph Hellwig wrote:
> On Thu, Jul 19, 2018 at 05:06:42PM +0200, Jan Kara wrote:
> > Yeah. Actually previous version of the fix (not posted publicly)
> > submitted
> > partial bio and then reused the bio to submit more. This is also
> > the way
> > __blkdev_direct_IO operates. Martin optimized this to fill the
> > bio
> > completely (as we know we have enough bvecs) before submitting
> > which has
> > chances to perform better. I'm fine with either approach, just we
> > have to
> > decide which way to go.
> 
> I think this first version is going to be less fragile, so I we
> should
> aim for that.

Wait a second. We changed the approach for a reason. By submitting
partial bios synchronously and sequentially, we loose a lot of
performance, so much that this "fast path" quickly falls behind the
regular __blkkdev_direct_IO() path as the number of input IO segments
increases. The patch I submitted avoids that. The whole point of this
fast path is to submit a single bio, and return as quickly as possible.

It's not an error condition if bio_iov_iter_get_pages() returns less
data than possible. The function just takes one iteration step at a
time, and thus iterating over it until the desired number of pages is
obtained, and collecting the resulting pages in a single bio, isn't
"fragile", it's just the natural thing to do.

Martin
Martin Wilck July 19, 2018, 7:34 p.m. UTC | #13
On Thu, 2018-07-19 at 16:53 +0200, Christoph Hellwig wrote:
> On Thu, Jul 19, 2018 at 12:08:41PM +0100, Al Viro wrote:
> > > Well, there has never been a promise that it will grab *all*
> > > pages in the
> > > iter AFAIK. Practically, I think that it was just too hairy to
> > > implement in
> > > the macro magic that iter processing is... Al might know more
> > > (added to
> > > CC).
> > 
> > Not really - it's more that VM has every right to refuse letting
> > you pin
> > an arbitrary amount of pages anyway.
> 
> In which case the code after this patch isn't going to help either,
> because
> it still tries to pin it all, just in multiple calls to
> get_user_pages().

That was not the point of the patch. It's not about a situation in
which MM refuses to pin more pages.

The patch is about the "iterator::next()" nature of
bio_iov_iter_get_pages().

If it can't pin the pages, bio_iov_iter_get_pages() returns an error
code (elsewhere in this thread we discussed how to treat that right).
Otherwise, it always adds _some_ data to the bio, but the amount added
depends on the segment structure of the input iov_iter. If the input
iovec has just a single segment, it fills the bio in a single call.
With multiple segments, it just returns the page(s) of the first
segment. The point of my patch is to make no difference between single-
segment and multi-segment IOs.

Regards
Martin
Martin Wilck July 19, 2018, 8:01 p.m. UTC | #14
On Thu, 2018-07-19 at 17:15 +0200, Jan Kara wrote:
> On Thu 19-07-18 14:23:53, Martin Wilck wrote:
> > On Thu, 2018-07-19 at 12:45 +0200, Jan Kara wrote:
> > > Secondly, I don't think it is good to discard error from
> > > bio_iov_iter_get_pages() here and just submit partial IO. It will
> > > again
> > > lead to part of IO being done as direct and part attempted to be
> > > done
> > > as
> > > buffered. Also the "slow" direct IO path in __blkdev_direct_IO()
> > > behaves
> > > differently - it aborts and returns error if
> > > bio_iov_iter_get_pages()
> > > ever
> > > returned error. IMO we should do the same here.
> > 
> > Well, it aborts the loop, but then (in the sync case) it still
> > waits
> > for the already submitted IOs to finish. Here, too, I'd find it
> > more
> > logical to return the number of successfully transmitted bytes
> > rather
> > than an error code. In the async case, the submitted bios are left
> > in
> > place, and will probably sooner or later finish, changing iocb-
> > >ki_pos.
> 
> Well, both these behaviors make sense, just traditionally (defined by
> our
> implementation) DIO returns error even if part of IO has actually
> been
> successfully submitted. Making a userspace visible change like you
> suggest
> thus has to be very carefully analyzed and frankly I don't think it's
> worth
> the bother.

OK, maybe not.
I'll rework the treatment of the error case.

Martin
diff mbox

Patch

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 0dd87aa..41643c4 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -221,7 +221,12 @@  __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
 
 	ret = bio_iov_iter_get_pages(&bio, iter);
 	if (unlikely(ret))
-		return ret;
+		goto out;
+
+	while (ret == 0 &&
+	       bio.bi_vcnt < bio.bi_max_vecs && iov_iter_count(iter) > 0)
+		ret = bio_iov_iter_get_pages(&bio, iter);
+
 	ret = bio.bi_iter.bi_size;
 
 	if (iov_iter_rw(iter) == READ) {
@@ -250,6 +255,7 @@  __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
 		put_page(bvec->bv_page);
 	}
 
+out:
 	if (vecs != inline_vecs)
 		kfree(vecs);