Message ID | 20180719093918.28876-3-mwilck@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 07/19/2018 11:39 AM, Martin Wilck wrote: > bio_iov_iter_get_pages() returns only pages for a single non-empty > segment of the input iov_iter's iovec. This may be much less than the number > of pages __blkdev_direct_IO_simple() is supposed to process. Call > bio_iov_iter_get_pages() repeatedly until either the requested number > of bytes is reached, or bio.bi_io_vec is exhausted. If this is not done, > short writes or reads may occur for direct synchronous IOs with multiple > iovec slots (such as generated by writev()). In that case, > __generic_file_write_iter() falls back to buffered writes, which > has been observed to cause data corruption in certain workloads. > > Note: if segments aren't page-aligned in the input iovec, this patch may > result in multiple adjacent slots of the bi_io_vec array to reference the same > page (the byte ranges are guaranteed to be disjunct if the preceding patch is > applied). We haven't seen problems with that in our and the customer's > tests. It'd be possible to detect this situation and merge bi_io_vec slots > that refer to the same page, but I prefer to keep it simple for now. > > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified bdev direct-io") > Signed-off-by: Martin Wilck <mwilck@suse.com> > --- > fs/block_dev.c | 8 +++++++- > 1 file changed, 7 insertions(+), 1 deletion(-) > Reviewed-by: Hannes Reinecke <hare@suse.com> Cheers, Hannes
On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote: > bio_iov_iter_get_pages() returns only pages for a single non-empty > segment of the input iov_iter's iovec. This may be much less than the number > of pages __blkdev_direct_IO_simple() is supposed to process. Call In bio_iov_iter_get_pages(), iov_iter_get_pages() supposes to retrieve as many as possible pages since both 'maxsize' and 'maxpages' are provided to cover all. So the question is why iov_iter_get_pages() doesn't work as expected? Thanks, Ming
On Thu 19-07-18 18:21:23, Ming Lei wrote: > On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote: > > bio_iov_iter_get_pages() returns only pages for a single non-empty > > segment of the input iov_iter's iovec. This may be much less than the number > > of pages __blkdev_direct_IO_simple() is supposed to process. Call > > In bio_iov_iter_get_pages(), iov_iter_get_pages() supposes to retrieve > as many as possible pages since both 'maxsize' and 'maxpages' are provided > to cover all. > > So the question is why iov_iter_get_pages() doesn't work as expected? Well, there has never been a promise that it will grab *all* pages in the iter AFAIK. Practically, I think that it was just too hairy to implement in the macro magic that iter processing is... Al might know more (added to CC). Honza
On Thu 19-07-18 11:39:18, Martin Wilck wrote: > bio_iov_iter_get_pages() returns only pages for a single non-empty > segment of the input iov_iter's iovec. This may be much less than the number > of pages __blkdev_direct_IO_simple() is supposed to process. Call > bio_iov_iter_get_pages() repeatedly until either the requested number > of bytes is reached, or bio.bi_io_vec is exhausted. If this is not done, > short writes or reads may occur for direct synchronous IOs with multiple > iovec slots (such as generated by writev()). In that case, > __generic_file_write_iter() falls back to buffered writes, which > has been observed to cause data corruption in certain workloads. > > Note: if segments aren't page-aligned in the input iovec, this patch may > result in multiple adjacent slots of the bi_io_vec array to reference the same > page (the byte ranges are guaranteed to be disjunct if the preceding patch is > applied). We haven't seen problems with that in our and the customer's > tests. It'd be possible to detect this situation and merge bi_io_vec slots > that refer to the same page, but I prefer to keep it simple for now. > > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified bdev direct-io") > Signed-off-by: Martin Wilck <mwilck@suse.com> > --- > fs/block_dev.c | 8 +++++++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index 0dd87aa..41643c4 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter, > > ret = bio_iov_iter_get_pages(&bio, iter); > if (unlikely(ret)) > - return ret; > + goto out; > + > + while (ret == 0 && > + bio.bi_vcnt < bio.bi_max_vecs && iov_iter_count(iter) > 0) > + ret = bio_iov_iter_get_pages(&bio, iter); > + I have two suggestions here (posting them now in public): Condition bio.bi_vcnt < bio.bi_max_vecs should always be true - we made sure we have enough vecs for pages in iter. So I'd WARN if this isn't true. Secondly, I don't think it is good to discard error from bio_iov_iter_get_pages() here and just submit partial IO. It will again lead to part of IO being done as direct and part attempted to be done as buffered. Also the "slow" direct IO path in __blkdev_direct_IO() behaves differently - it aborts and returns error if bio_iov_iter_get_pages() ever returned error. IMO we should do the same here. Honza
On Thu, Jul 19, 2018 at 12:37:13PM +0200, Jan Kara wrote: > On Thu 19-07-18 18:21:23, Ming Lei wrote: > > On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote: > > > bio_iov_iter_get_pages() returns only pages for a single non-empty > > > segment of the input iov_iter's iovec. This may be much less than the number > > > of pages __blkdev_direct_IO_simple() is supposed to process. Call > > > > In bio_iov_iter_get_pages(), iov_iter_get_pages() supposes to retrieve > > as many as possible pages since both 'maxsize' and 'maxpages' are provided > > to cover all. > > > > So the question is why iov_iter_get_pages() doesn't work as expected? > > Well, there has never been a promise that it will grab *all* pages in the > iter AFAIK. Practically, I think that it was just too hairy to implement in > the macro magic that iter processing is... Al might know more (added to > CC). OK, I got it, given get_user_pages_fast() may fail and it is reasonable to see less pages retrieved. Thanks, Ming
On Thu, Jul 19, 2018 at 12:37:13PM +0200, Jan Kara wrote: > On Thu 19-07-18 18:21:23, Ming Lei wrote: > > On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote: > > > bio_iov_iter_get_pages() returns only pages for a single non-empty > > > segment of the input iov_iter's iovec. This may be much less than the number > > > of pages __blkdev_direct_IO_simple() is supposed to process. Call > > > > In bio_iov_iter_get_pages(), iov_iter_get_pages() supposes to retrieve > > as many as possible pages since both 'maxsize' and 'maxpages' are provided > > to cover all. Huh? Why does having a way to say "I (the caller) don't want more than <amount>" implies anything of that sort? It never had been promised, explicitly or not... > > So the question is why iov_iter_get_pages() doesn't work as expected? > > Well, there has never been a promise that it will grab *all* pages in the > iter AFAIK. Practically, I think that it was just too hairy to implement in > the macro magic that iter processing is... Al might know more (added to > CC). Not really - it's more that VM has every right to refuse letting you pin an arbitrary amount of pages anyway.
On Thu, 2018-07-19 at 12:45 +0200, Jan Kara wrote: > On Thu 19-07-18 11:39:18, Martin Wilck wrote: > > bio_iov_iter_get_pages() returns only pages for a single non-empty > > segment of the input iov_iter's iovec. This may be much less than > > the number > > of pages __blkdev_direct_IO_simple() is supposed to process. Call > > bio_iov_iter_get_pages() repeatedly until either the requested > > number > > of bytes is reached, or bio.bi_io_vec is exhausted. If this is not > > done, > > short writes or reads may occur for direct synchronous IOs with > > multiple > > iovec slots (such as generated by writev()). In that case, > > __generic_file_write_iter() falls back to buffered writes, which > > has been observed to cause data corruption in certain workloads. > > > > Note: if segments aren't page-aligned in the input iovec, this > > patch may > > result in multiple adjacent slots of the bi_io_vec array to > > reference the same > > page (the byte ranges are guaranteed to be disjunct if the > > preceding patch is > > applied). We haven't seen problems with that in our and the > > customer's > > tests. It'd be possible to detect this situation and merge > > bi_io_vec slots > > that refer to the same page, but I prefer to keep it simple for > > now. > > > > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for > > simplified bdev direct-io") > > Signed-off-by: Martin Wilck <mwilck@suse.com> > > --- > > fs/block_dev.c | 8 +++++++- > > 1 file changed, 7 insertions(+), 1 deletion(-) > > > > diff --git a/fs/block_dev.c b/fs/block_dev.c > > index 0dd87aa..41643c4 100644 > > --- a/fs/block_dev.c > > +++ b/fs/block_dev.c > > @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, > > struct iov_iter *iter, > > > > ret = bio_iov_iter_get_pages(&bio, iter); > > if (unlikely(ret)) > > - return ret; > > + goto out; > > + > > + while (ret == 0 && > > + bio.bi_vcnt < bio.bi_max_vecs && > > iov_iter_count(iter) > 0) > > + ret = bio_iov_iter_get_pages(&bio, iter); > > + > > I have two suggestions here (posting them now in public): > > Condition bio.bi_vcnt < bio.bi_max_vecs should always be true - we > made > sure we have enough vecs for pages in iter. So I'd WARN if this isn't > true. Yeah. I wanted to add that to the patch. Slipped through, somehow. Sorry about that. > Secondly, I don't think it is good to discard error from > bio_iov_iter_get_pages() here and just submit partial IO. It will > again > lead to part of IO being done as direct and part attempted to be done > as > buffered. Also the "slow" direct IO path in __blkdev_direct_IO() > behaves > differently - it aborts and returns error if bio_iov_iter_get_pages() > ever > returned error. IMO we should do the same here. Well, it aborts the loop, but then (in the sync case) it still waits for the already submitted IOs to finish. Here, too, I'd find it more logical to return the number of successfully transmitted bytes rather than an error code. In the async case, the submitted bios are left in place, and will probably sooner or later finish, changing iocb->ki_pos. I'm actually not quite certain if that's correct. In the sync case, it causes the already-performed IO to be done again, buffered. In the async case, it it may even cause two IOs for the same range to be in flight at the same time ... ? Martin
On Thu, Jul 19, 2018 at 12:08:41PM +0100, Al Viro wrote: > > Well, there has never been a promise that it will grab *all* pages in the > > iter AFAIK. Practically, I think that it was just too hairy to implement in > > the macro magic that iter processing is... Al might know more (added to > > CC). > > Not really - it's more that VM has every right to refuse letting you pin > an arbitrary amount of pages anyway. In which case the code after this patch isn't going to help either, because it still tries to pin it all, just in multiple calls to get_user_pages().
On Thu 19-07-18 16:53:51, Christoph Hellwig wrote: > On Thu, Jul 19, 2018 at 12:08:41PM +0100, Al Viro wrote: > > > Well, there has never been a promise that it will grab *all* pages in the > > > iter AFAIK. Practically, I think that it was just too hairy to implement in > > > the macro magic that iter processing is... Al might know more (added to > > > CC). > > > > Not really - it's more that VM has every right to refuse letting you pin > > an arbitrary amount of pages anyway. > > In which case the code after this patch isn't going to help either, because > it still tries to pin it all, just in multiple calls to get_user_pages(). Yeah. Actually previous version of the fix (not posted publicly) submitted partial bio and then reused the bio to submit more. This is also the way __blkdev_direct_IO operates. Martin optimized this to fill the bio completely (as we know we have enough bvecs) before submitting which has chances to perform better. I'm fine with either approach, just we have to decide which way to go. Honza
On Thu, Jul 19, 2018 at 05:06:42PM +0200, Jan Kara wrote: > Yeah. Actually previous version of the fix (not posted publicly) submitted > partial bio and then reused the bio to submit more. This is also the way > __blkdev_direct_IO operates. Martin optimized this to fill the bio > completely (as we know we have enough bvecs) before submitting which has > chances to perform better. I'm fine with either approach, just we have to > decide which way to go. I think this first version is going to be less fragile, so I we should aim for that.
On Thu 19-07-18 14:23:53, Martin Wilck wrote: > On Thu, 2018-07-19 at 12:45 +0200, Jan Kara wrote: > > Secondly, I don't think it is good to discard error from > > bio_iov_iter_get_pages() here and just submit partial IO. It will > > again > > lead to part of IO being done as direct and part attempted to be done > > as > > buffered. Also the "slow" direct IO path in __blkdev_direct_IO() > > behaves > > differently - it aborts and returns error if bio_iov_iter_get_pages() > > ever > > returned error. IMO we should do the same here. > > Well, it aborts the loop, but then (in the sync case) it still waits > for the already submitted IOs to finish. Here, too, I'd find it more > logical to return the number of successfully transmitted bytes rather > than an error code. In the async case, the submitted bios are left in > place, and will probably sooner or later finish, changing iocb->ki_pos. Well, both these behaviors make sense, just traditionally (defined by our implementation) DIO returns error even if part of IO has actually been successfully submitted. Making a userspace visible change like you suggest thus has to be very carefully analyzed and frankly I don't think it's worth the bother. > I'm actually not quite certain if that's correct. In the sync case, it > causes the already-performed IO to be done again, buffered. In the > async case, it it may even cause two IOs for the same range to be in > flight at the same time ... ? It doesn't cause IO to be done again. Look at __generic_file_write_iter(). If generic_file_direct_write() returned error, we immediately return error as well without retrying buffered IO. We only retry buffered IO for partial (or 0) return value. Honza
On Thu, 2018-07-19 at 17:11 +0200, Christoph Hellwig wrote: > On Thu, Jul 19, 2018 at 05:06:42PM +0200, Jan Kara wrote: > > Yeah. Actually previous version of the fix (not posted publicly) > > submitted > > partial bio and then reused the bio to submit more. This is also > > the way > > __blkdev_direct_IO operates. Martin optimized this to fill the > > bio > > completely (as we know we have enough bvecs) before submitting > > which has > > chances to perform better. I'm fine with either approach, just we > > have to > > decide which way to go. > > I think this first version is going to be less fragile, so I we > should > aim for that. Wait a second. We changed the approach for a reason. By submitting partial bios synchronously and sequentially, we loose a lot of performance, so much that this "fast path" quickly falls behind the regular __blkkdev_direct_IO() path as the number of input IO segments increases. The patch I submitted avoids that. The whole point of this fast path is to submit a single bio, and return as quickly as possible. It's not an error condition if bio_iov_iter_get_pages() returns less data than possible. The function just takes one iteration step at a time, and thus iterating over it until the desired number of pages is obtained, and collecting the resulting pages in a single bio, isn't "fragile", it's just the natural thing to do. Martin
On Thu, 2018-07-19 at 16:53 +0200, Christoph Hellwig wrote: > On Thu, Jul 19, 2018 at 12:08:41PM +0100, Al Viro wrote: > > > Well, there has never been a promise that it will grab *all* > > > pages in the > > > iter AFAIK. Practically, I think that it was just too hairy to > > > implement in > > > the macro magic that iter processing is... Al might know more > > > (added to > > > CC). > > > > Not really - it's more that VM has every right to refuse letting > > you pin > > an arbitrary amount of pages anyway. > > In which case the code after this patch isn't going to help either, > because > it still tries to pin it all, just in multiple calls to > get_user_pages(). That was not the point of the patch. It's not about a situation in which MM refuses to pin more pages. The patch is about the "iterator::next()" nature of bio_iov_iter_get_pages(). If it can't pin the pages, bio_iov_iter_get_pages() returns an error code (elsewhere in this thread we discussed how to treat that right). Otherwise, it always adds _some_ data to the bio, but the amount added depends on the segment structure of the input iov_iter. If the input iovec has just a single segment, it fills the bio in a single call. With multiple segments, it just returns the page(s) of the first segment. The point of my patch is to make no difference between single- segment and multi-segment IOs. Regards Martin
On Thu, 2018-07-19 at 17:15 +0200, Jan Kara wrote: > On Thu 19-07-18 14:23:53, Martin Wilck wrote: > > On Thu, 2018-07-19 at 12:45 +0200, Jan Kara wrote: > > > Secondly, I don't think it is good to discard error from > > > bio_iov_iter_get_pages() here and just submit partial IO. It will > > > again > > > lead to part of IO being done as direct and part attempted to be > > > done > > > as > > > buffered. Also the "slow" direct IO path in __blkdev_direct_IO() > > > behaves > > > differently - it aborts and returns error if > > > bio_iov_iter_get_pages() > > > ever > > > returned error. IMO we should do the same here. > > > > Well, it aborts the loop, but then (in the sync case) it still > > waits > > for the already submitted IOs to finish. Here, too, I'd find it > > more > > logical to return the number of successfully transmitted bytes > > rather > > than an error code. In the async case, the submitted bios are left > > in > > place, and will probably sooner or later finish, changing iocb- > > >ki_pos. > > Well, both these behaviors make sense, just traditionally (defined by > our > implementation) DIO returns error even if part of IO has actually > been > successfully submitted. Making a userspace visible change like you > suggest > thus has to be very carefully analyzed and frankly I don't think it's > worth > the bother. OK, maybe not. I'll rework the treatment of the error case. Martin
diff --git a/fs/block_dev.c b/fs/block_dev.c index 0dd87aa..41643c4 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter, ret = bio_iov_iter_get_pages(&bio, iter); if (unlikely(ret)) - return ret; + goto out; + + while (ret == 0 && + bio.bi_vcnt < bio.bi_max_vecs && iov_iter_count(iter) > 0) + ret = bio_iov_iter_get_pages(&bio, iter); + ret = bio.bi_iter.bi_size; if (iov_iter_rw(iter) == READ) { @@ -250,6 +255,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter, put_page(bvec->bv_page); } +out: if (vecs != inline_vecs) kfree(vecs);
bio_iov_iter_get_pages() returns only pages for a single non-empty segment of the input iov_iter's iovec. This may be much less than the number of pages __blkdev_direct_IO_simple() is supposed to process. Call bio_iov_iter_get_pages() repeatedly until either the requested number of bytes is reached, or bio.bi_io_vec is exhausted. If this is not done, short writes or reads may occur for direct synchronous IOs with multiple iovec slots (such as generated by writev()). In that case, __generic_file_write_iter() falls back to buffered writes, which has been observed to cause data corruption in certain workloads. Note: if segments aren't page-aligned in the input iovec, this patch may result in multiple adjacent slots of the bi_io_vec array to reference the same page (the byte ranges are guaranteed to be disjunct if the preceding patch is applied). We haven't seen problems with that in our and the customer's tests. It'd be possible to detect this situation and merge bi_io_vec slots that refer to the same page, but I prefer to keep it simple for now. Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified bdev direct-io") Signed-off-by: Martin Wilck <mwilck@suse.com> --- fs/block_dev.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)