Message ID | 20180719210158.25923-4-mwilck@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Jul 19, 2018 at 11:01:58PM +0200, Martin Wilck wrote: > When bio_iov_iter_get_pages() is called from __blkdev_direct_IO_simple(), > we already know that the content of the input iov_iter fits into a single > bio, so we expect iov_iter_count(iter) to drop to 0. But in a single invocation, > bio_iov_iter_get_pages() may add less bytes then we expect. For iov_iters with > multiple segments (generated e.g. by writev()), it behaves like an iterator's > next() function, taking only one step (segment) at a time. Furthermore, MM may > fail or refuse to pin all requested pages. The latter may signify an error condition > (in which case eventually an error code will be returned), the former does not. > > Call bio_iov_iter_get_pages() repeatedly to avoid short reads or writes. > Otherwise, __generic_file_write_iter() falls back to buffered writes, which > has been observed to cause data corruption in certain workloads. > > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified bdev direct-io") > Signed-off-by: Martin Wilck <mwilck@suse.com> > --- > fs/block_dev.c | 18 ++++++++++++++++++ > 1 file changed, 18 insertions(+) > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index aba2541..561c34e 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -222,6 +222,24 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter, > ret = bio_iov_iter_get_pages(&bio, iter); > if (unlikely(ret)) > goto out; > + > + /* > + * bio_iov_iter_get_pages() may add less bytes than we expect: > + * - for multi-segment iov_iters, as it only adds one segment at a time > + * - if MM refuses or fails to pin all requested pages. In this case, > + * an error is returned eventually if no progress can be made. > + */ > + while (iov_iter_count(iter) > 0 && bio.bi_vcnt < bio.bi_max_vecs) { Please use the helper bio_full(). > + ret = bio_iov_iter_get_pages(&bio, iter); > + if (unlikely(ret)) > + goto out; > + } When ret returns, pages pinned already need to be put. Also I think the way suggested in the following link may be more generic: https://marc.info/?l=linux-block&m=153199830130209&w=2 in which it is retried until no progress is made, and there should have performance benefit for other users too, not only fix this partial dio issue. Thanks, Ming
On Fri, 2018-07-20 at 10:41 +0800, Ming Lei wrote: > On Thu, Jul 19, 2018 at 11:01:58PM +0200, Martin Wilck wrote: > > When bio_iov_iter_get_pages() is called > > from __blkdev_direct_IO_simple(), > > we already know that the content of the input iov_iter fits into a > > single > > bio, so we expect iov_iter_count(iter) to drop to 0. But in a > > single invocation, > > bio_iov_iter_get_pages() may add less bytes then we expect. For > > iov_iters with > > multiple segments (generated e.g. by writev()), it behaves like an > > iterator's > > next() function, taking only one step (segment) at a time. > > Furthermore, MM may > > fail or refuse to pin all requested pages. The latter may signify > > an error condition > > (in which case eventually an error code will be returned), the > > former does not. > > > > Call bio_iov_iter_get_pages() repeatedly to avoid short reads or > > writes. > > Otherwise, __generic_file_write_iter() falls back to buffered > > writes, which > > has been observed to cause data corruption in certain workloads. > > > > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for > > simplified bdev direct-io") > > Signed-off-by: Martin Wilck <mwilck@suse.com> > > --- > > fs/block_dev.c | 18 ++++++++++++++++++ > > 1 file changed, 18 insertions(+) > > > > diff --git a/fs/block_dev.c b/fs/block_dev.c > > index aba2541..561c34e 100644 > > --- a/fs/block_dev.c > > +++ b/fs/block_dev.c > > @@ -222,6 +222,24 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, > > struct iov_iter *iter, > > ret = bio_iov_iter_get_pages(&bio, iter); > > if (unlikely(ret)) > > goto out; > > + > > + /* > > + * bio_iov_iter_get_pages() may add less bytes than we > > expect: > > + * - for multi-segment iov_iters, as it only adds one > > segment at a time > > + * - if MM refuses or fails to pin all requested pages. In > > this case, > > + * an error is returned eventually if no progress can be > > made. > > + */ > > + while (iov_iter_count(iter) > 0 && bio.bi_vcnt < > > bio.bi_max_vecs) { > > Please use the helper bio_full(). > > > + ret = bio_iov_iter_get_pages(&bio, iter); > > + if (unlikely(ret)) > > + goto out; > > + } > > When ret returns, pages pinned already need to be put. Oops, sorry for missing that. > Also I think the way suggested in the following link may be more > generic: > > https://marc.info/?l=linux-block&m=153199830130209&w=2 > > in which it is retried until no progress is made, and there should > have > performance benefit for other users too, not only fix this partial > dio > issue. As there was no unequivocal support for that proposal, I propose to add a new helper without renaming the current one. Callers could then be migrated one-by-one. I will do so in v4. Thanks, Martin
diff --git a/fs/block_dev.c b/fs/block_dev.c index aba2541..561c34e 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -222,6 +222,24 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter, ret = bio_iov_iter_get_pages(&bio, iter); if (unlikely(ret)) goto out; + + /* + * bio_iov_iter_get_pages() may add less bytes than we expect: + * - for multi-segment iov_iters, as it only adds one segment at a time + * - if MM refuses or fails to pin all requested pages. In this case, + * an error is returned eventually if no progress can be made. + */ + while (iov_iter_count(iter) > 0 && bio.bi_vcnt < bio.bi_max_vecs) { + ret = bio_iov_iter_get_pages(&bio, iter); + if (unlikely(ret)) + goto out; + } + /* + * Our bi_io_vec should be big enough to hold all data from the + * iov_iter, as this has been checked before calling this function. + */ + WARN_ON_ONCE(iov_iter_count(iter) > 0); + ret = bio.bi_iter.bi_size; if (iov_iter_rw(iter) == READ) {
When bio_iov_iter_get_pages() is called from __blkdev_direct_IO_simple(), we already know that the content of the input iov_iter fits into a single bio, so we expect iov_iter_count(iter) to drop to 0. But in a single invocation, bio_iov_iter_get_pages() may add less bytes then we expect. For iov_iters with multiple segments (generated e.g. by writev()), it behaves like an iterator's next() function, taking only one step (segment) at a time. Furthermore, MM may fail or refuse to pin all requested pages. The latter may signify an error condition (in which case eventually an error code will be returned), the former does not. Call bio_iov_iter_get_pages() repeatedly to avoid short reads or writes. Otherwise, __generic_file_write_iter() falls back to buffered writes, which has been observed to cause data corruption in certain workloads. Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified bdev direct-io") Signed-off-by: Martin Wilck <mwilck@suse.com> --- fs/block_dev.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+)