Message ID | ab04202d0f8c1424da47251085657c436d762785.1605827965.git.asml.silence@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | optimise iov_iter | expand |
On Thu, Nov 19, 2020 at 11:24:38PM +0000, Pavel Begunkov wrote: > The block layer spends quite a while in iov_iter_npages(), but for the > bvec case the number of pages is already known and stored in > iter->nr_segs, so it can be returned immediately as an optimisation Er ... no, it doesn't. nr_segs is the number of bvecs. Each bvec can store up to 4GB of contiguous physical memory.
On 20/11/2020 01:20, Matthew Wilcox wrote: > On Thu, Nov 19, 2020 at 11:24:38PM +0000, Pavel Begunkov wrote: >> The block layer spends quite a while in iov_iter_npages(), but for the >> bvec case the number of pages is already known and stored in >> iter->nr_segs, so it can be returned immediately as an optimisation > > Er ... no, it doesn't. nr_segs is the number of bvecs. Each bvec can > store up to 4GB of contiguous physical memory. Ah, really, missed min() with PAGE_SIZE in bvec_iter_len(), then it's a stupid statement. Thanks! Are there many users of that? All these iterators are a huge burden, just to count one 4KB page in bvec it takes 2% of CPU time for me.
On Fri, Nov 20, 2020 at 01:39:05AM +0000, Pavel Begunkov wrote: > On 20/11/2020 01:20, Matthew Wilcox wrote: > > On Thu, Nov 19, 2020 at 11:24:38PM +0000, Pavel Begunkov wrote: > >> The block layer spends quite a while in iov_iter_npages(), but for the > >> bvec case the number of pages is already known and stored in > >> iter->nr_segs, so it can be returned immediately as an optimisation > > > > Er ... no, it doesn't. nr_segs is the number of bvecs. Each bvec can > > store up to 4GB of contiguous physical memory. > > Ah, really, missed min() with PAGE_SIZE in bvec_iter_len(), then it's a > stupid statement. Thanks! > > Are there many users of that? All these iterators are a huge burden, > just to count one 4KB page in bvec it takes 2% of CPU time for me. __bio_try_merge_page() will create multipage BIOs, and that's called from a number of places including bio_try_merge_hw_seg(), bio_add_page(), and __bio_iov_iter_get_pages() so ... yeah, it's used a lot.
On 20/11/2020 01:49, Matthew Wilcox wrote: > On Fri, Nov 20, 2020 at 01:39:05AM +0000, Pavel Begunkov wrote: >> On 20/11/2020 01:20, Matthew Wilcox wrote: >>> On Thu, Nov 19, 2020 at 11:24:38PM +0000, Pavel Begunkov wrote: >>>> The block layer spends quite a while in iov_iter_npages(), but for the >>>> bvec case the number of pages is already known and stored in >>>> iter->nr_segs, so it can be returned immediately as an optimisation >>> >>> Er ... no, it doesn't. nr_segs is the number of bvecs. Each bvec can >>> store up to 4GB of contiguous physical memory. >> >> Ah, really, missed min() with PAGE_SIZE in bvec_iter_len(), then it's a >> stupid statement. Thanks! >> >> Are there many users of that? All these iterators are a huge burden, >> just to count one 4KB page in bvec it takes 2% of CPU time for me. > > __bio_try_merge_page() will create multipage BIOs, and that's > called from a number of places including > bio_try_merge_hw_seg(), bio_add_page(), and __bio_iov_iter_get_pages() I get it that there are a lot of places, more interesting how often it's actually triggered and if that's performance critical for anybody. Not like I'm going to change it, just out of curiosity, but bvec.h can be nicely optimised without it. > > so ... yeah, it's used a lot. >
On Fri, Nov 20, 2020 at 01:56:22AM +0000, Pavel Begunkov wrote: > On 20/11/2020 01:49, Matthew Wilcox wrote: > > On Fri, Nov 20, 2020 at 01:39:05AM +0000, Pavel Begunkov wrote: > >> On 20/11/2020 01:20, Matthew Wilcox wrote: > >>> On Thu, Nov 19, 2020 at 11:24:38PM +0000, Pavel Begunkov wrote: > >>>> The block layer spends quite a while in iov_iter_npages(), but for the > >>>> bvec case the number of pages is already known and stored in > >>>> iter->nr_segs, so it can be returned immediately as an optimisation > >>> > >>> Er ... no, it doesn't. nr_segs is the number of bvecs. Each bvec can > >>> store up to 4GB of contiguous physical memory. > >> > >> Ah, really, missed min() with PAGE_SIZE in bvec_iter_len(), then it's a > >> stupid statement. Thanks! > >> > >> Are there many users of that? All these iterators are a huge burden, > >> just to count one 4KB page in bvec it takes 2% of CPU time for me. > > > > __bio_try_merge_page() will create multipage BIOs, and that's > > called from a number of places including > > bio_try_merge_hw_seg(), bio_add_page(), and __bio_iov_iter_get_pages() > > I get it that there are a lot of places, more interesting how often > it's actually triggered and if that's performance critical for anybody. > Not like I'm going to change it, just out of curiosity, but bvec.h > can be nicely optimised without it. Typically when you're allocating pages for the page cache, they'll get allocated in order and then you'll read or write them in order, so yes, it ends up triggering quite a lot. There was once a bug in the page allocator which caused them to get allocated in reverse order and it was a noticable performance hit (this was 15-20 years ago).
On 20/11/2020 02:06, Matthew Wilcox wrote: > On Fri, Nov 20, 2020 at 01:56:22AM +0000, Pavel Begunkov wrote: >> On 20/11/2020 01:49, Matthew Wilcox wrote: >>> On Fri, Nov 20, 2020 at 01:39:05AM +0000, Pavel Begunkov wrote: >>>> On 20/11/2020 01:20, Matthew Wilcox wrote: >>>>> On Thu, Nov 19, 2020 at 11:24:38PM +0000, Pavel Begunkov wrote: >>>>>> The block layer spends quite a while in iov_iter_npages(), but for the >>>>>> bvec case the number of pages is already known and stored in >>>>>> iter->nr_segs, so it can be returned immediately as an optimisation >>>>> >>>>> Er ... no, it doesn't. nr_segs is the number of bvecs. Each bvec can >>>>> store up to 4GB of contiguous physical memory. >>>> >>>> Ah, really, missed min() with PAGE_SIZE in bvec_iter_len(), then it's a >>>> stupid statement. Thanks! >>>> >>>> Are there many users of that? All these iterators are a huge burden, >>>> just to count one 4KB page in bvec it takes 2% of CPU time for me. >>> >>> __bio_try_merge_page() will create multipage BIOs, and that's >>> called from a number of places including >>> bio_try_merge_hw_seg(), bio_add_page(), and __bio_iov_iter_get_pages() >> >> I get it that there are a lot of places, more interesting how often >> it's actually triggered and if that's performance critical for anybody. >> Not like I'm going to change it, just out of curiosity, but bvec.h >> can be nicely optimised without it. > > Typically when you're allocating pages for the page cache, they'll get > allocated in order and then you'll read or write them in order, so yes, > it ends up triggering quite a lot. There was once a bug in the page > allocator which caused them to get allocated in reverse order and it > was a noticable performance hit (this was 15-20 years ago). I see, thanks for a bit of insight
On Fri, Nov 20, 2020 at 01:39:05AM +0000, Pavel Begunkov wrote: > On 20/11/2020 01:20, Matthew Wilcox wrote: > > On Thu, Nov 19, 2020 at 11:24:38PM +0000, Pavel Begunkov wrote: > >> The block layer spends quite a while in iov_iter_npages(), but for the > >> bvec case the number of pages is already known and stored in > >> iter->nr_segs, so it can be returned immediately as an optimisation > > > > Er ... no, it doesn't. nr_segs is the number of bvecs. Each bvec can > > store up to 4GB of contiguous physical memory. > > Ah, really, missed min() with PAGE_SIZE in bvec_iter_len(), then it's a > stupid statement. Thanks! > iov_iter_npages(bvec) still can be improved a bit by the following way: diff --git a/lib/iov_iter.c b/lib/iov_iter.c index 1635111c5bd2..d85ed7acce05 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -1608,17 +1608,23 @@ int iov_iter_npages(const struct iov_iter *i, int maxpages) npages = pipe_space_for_user(iter_head, pipe->tail, pipe); if (npages >= maxpages) return maxpages; + } else if (iov_iter_is_bvec(i)) { + unsigned idx, offset = i->iov_offset; + + for (idx = 0; idx < i->nr_segs; idx++) { + npages += DIV_ROUND_UP(i->bvec[idx].bv_len - offset, + PAGE_SIZE); + offset = 0; + } + if (npages >= maxpages) + return maxpages; } else iterate_all_kinds(i, size, v, ({ unsigned long p = (unsigned long)v.iov_base; npages += DIV_ROUND_UP(p + v.iov_len, PAGE_SIZE) - p / PAGE_SIZE; if (npages >= maxpages) return maxpages; - 0;}),({ - npages++; - if (npages >= maxpages) - return maxpages; - }),({ + 0;}),0,({ unsigned long p = (unsigned long)v.iov_base; npages += DIV_ROUND_UP(p + v.iov_len, PAGE_SIZE) - p / PAGE_SIZE;
On Fri, Nov 20, 2020 at 02:06:10AM +0000, Matthew Wilcox wrote: > On Fri, Nov 20, 2020 at 01:56:22AM +0000, Pavel Begunkov wrote: > > On 20/11/2020 01:49, Matthew Wilcox wrote: > > > On Fri, Nov 20, 2020 at 01:39:05AM +0000, Pavel Begunkov wrote: > > >> On 20/11/2020 01:20, Matthew Wilcox wrote: > > >>> On Thu, Nov 19, 2020 at 11:24:38PM +0000, Pavel Begunkov wrote: > > >>>> The block layer spends quite a while in iov_iter_npages(), but for the > > >>>> bvec case the number of pages is already known and stored in > > >>>> iter->nr_segs, so it can be returned immediately as an optimisation > > >>> > > >>> Er ... no, it doesn't. nr_segs is the number of bvecs. Each bvec can > > >>> store up to 4GB of contiguous physical memory. > > >> > > >> Ah, really, missed min() with PAGE_SIZE in bvec_iter_len(), then it's a > > >> stupid statement. Thanks! > > >> > > >> Are there many users of that? All these iterators are a huge burden, > > >> just to count one 4KB page in bvec it takes 2% of CPU time for me. > > > > > > __bio_try_merge_page() will create multipage BIOs, and that's > > > called from a number of places including > > > bio_try_merge_hw_seg(), bio_add_page(), and __bio_iov_iter_get_pages() > > > > I get it that there are a lot of places, more interesting how often > > it's actually triggered and if that's performance critical for anybody. > > Not like I'm going to change it, just out of curiosity, but bvec.h > > can be nicely optimised without it. > > Typically when you're allocating pages for the page cache, they'll get > allocated in order and then you'll read or write them in order, so yes, > it ends up triggering quite a lot. There was once a bug in the page > allocator which caused them to get allocated in reverse order and it > was a noticable performance hit (this was 15-20 years ago). hugepage use cases can benefit much from this way too. Thanks, Ming
On 20/11/2020 02:22, Ming Lei wrote: > On Fri, Nov 20, 2020 at 01:39:05AM +0000, Pavel Begunkov wrote: >> On 20/11/2020 01:20, Matthew Wilcox wrote: >>> On Thu, Nov 19, 2020 at 11:24:38PM +0000, Pavel Begunkov wrote: >>>> The block layer spends quite a while in iov_iter_npages(), but for the >>>> bvec case the number of pages is already known and stored in >>>> iter->nr_segs, so it can be returned immediately as an optimisation >>> >>> Er ... no, it doesn't. nr_segs is the number of bvecs. Each bvec can >>> store up to 4GB of contiguous physical memory. >> >> Ah, really, missed min() with PAGE_SIZE in bvec_iter_len(), then it's a >> stupid statement. Thanks! >> > > iov_iter_npages(bvec) still can be improved a bit by the following way: Yep, was doing exactly that, +a couple of other places that are in my way. > > diff --git a/lib/iov_iter.c b/lib/iov_iter.c > index 1635111c5bd2..d85ed7acce05 100644 > --- a/lib/iov_iter.c > +++ b/lib/iov_iter.c > @@ -1608,17 +1608,23 @@ int iov_iter_npages(const struct iov_iter *i, int maxpages) > npages = pipe_space_for_user(iter_head, pipe->tail, pipe); > if (npages >= maxpages) > return maxpages; > + } else if (iov_iter_is_bvec(i)) { > + unsigned idx, offset = i->iov_offset; > + > + for (idx = 0; idx < i->nr_segs; idx++) { > + npages += DIV_ROUND_UP(i->bvec[idx].bv_len - offset, > + PAGE_SIZE); > + offset = 0; > + } > + if (npages >= maxpages) > + return maxpages; > } else iterate_all_kinds(i, size, v, ({ > unsigned long p = (unsigned long)v.iov_base; > npages += DIV_ROUND_UP(p + v.iov_len, PAGE_SIZE) > - p / PAGE_SIZE; > if (npages >= maxpages) > return maxpages; > - 0;}),({ > - npages++; > - if (npages >= maxpages) > - return maxpages; > - }),({ > + 0;}),0,({ > unsigned long p = (unsigned long)v.iov_base; > npages += DIV_ROUND_UP(p + v.iov_len, PAGE_SIZE) > - p / PAGE_SIZE; >
On Fri, Nov 20, 2020 at 02:25:08AM +0000, Pavel Begunkov wrote: > On 20/11/2020 02:22, Ming Lei wrote: > > iov_iter_npages(bvec) still can be improved a bit by the following way: > > Yep, was doing exactly that, +a couple of other places that are in my way. Are you optimising the right thing here? Assuming you're looking at the one in do_blockdev_direct_IO(), wouldn't we be better off figuring out how to copy the bvecs directly from the iov_iter into the bio rather than calling dio_bio_add_page() for each page?
On Fri, Nov 20, 2020 at 02:54:57AM +0000, Matthew Wilcox wrote: > On Fri, Nov 20, 2020 at 02:25:08AM +0000, Pavel Begunkov wrote: > > On 20/11/2020 02:22, Ming Lei wrote: > > > iov_iter_npages(bvec) still can be improved a bit by the following way: > > > > Yep, was doing exactly that, +a couple of other places that are in my way. > > Are you optimising the right thing here? Assuming you're looking at > the one in do_blockdev_direct_IO(), wouldn't we be better off figuring > out how to copy the bvecs directly from the iov_iter into the bio > rather than calling dio_bio_add_page() for each page? Which is most effectively done by stopping to to use *blockdev_direct_IO and switching to iomap instead :)
On 20/11/2020 02:54, Matthew Wilcox wrote: > On Fri, Nov 20, 2020 at 02:25:08AM +0000, Pavel Begunkov wrote: >> On 20/11/2020 02:22, Ming Lei wrote: >>> iov_iter_npages(bvec) still can be improved a bit by the following way: >> >> Yep, was doing exactly that, +a couple of other places that are in my way. > > Are you optimising the right thing here? Assuming you're looking at > the one in do_blockdev_direct_IO(), wouldn't we be better off figuring > out how to copy the bvecs directly from the iov_iter into the bio > rather than calling dio_bio_add_page() for each page? Ha, you got me, *add_page() was that "couple of others". It shows up much more, but iov_iter_npages() just looked simple enough to do first.
On Fri, Nov 20, 2020 at 08:14:29AM +0000, Christoph Hellwig wrote: > On Fri, Nov 20, 2020 at 02:54:57AM +0000, Matthew Wilcox wrote: > > On Fri, Nov 20, 2020 at 02:25:08AM +0000, Pavel Begunkov wrote: > > > On 20/11/2020 02:22, Ming Lei wrote: > > > > iov_iter_npages(bvec) still can be improved a bit by the following way: > > > > > > Yep, was doing exactly that, +a couple of other places that are in my way. > > > > Are you optimising the right thing here? Assuming you're looking at > > the one in do_blockdev_direct_IO(), wouldn't we be better off figuring > > out how to copy the bvecs directly from the iov_iter into the bio > > rather than calling dio_bio_add_page() for each page? > > Which is most effectively done by stopping to to use *blockdev_direct_IO > and switching to iomap instead :) But iomap still calls iov_iter_npages(). So maybe we need something like this ... diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c index 933f234d5bec..1c5a802a45d9 100644 --- a/fs/iomap/direct-io.c +++ b/fs/iomap/direct-io.c @@ -250,7 +250,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, orig_count = iov_iter_count(dio->submit.iter); iov_iter_truncate(dio->submit.iter, length); - nr_pages = iov_iter_npages(dio->submit.iter, BIO_MAX_PAGES); + nr_pages = bio_iov_iter_npages(dio->submit.iter); if (nr_pages <= 0) { ret = nr_pages; goto out; @@ -308,7 +308,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, dio->size += n; copied += n; - nr_pages = iov_iter_npages(dio->submit.iter, BIO_MAX_PAGES); + nr_pages = bio_iov_iter_npages(dio->submit.iter); iomap_dio_submit_bio(dio, iomap, bio, pos); pos += n; } while (nr_pages); diff --git a/include/linux/bio.h b/include/linux/bio.h index c6d765382926..86cc74f84b30 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -10,6 +10,7 @@ #include <linux/ioprio.h> /* struct bio, bio_vec and BIO_* flags are defined in blk_types.h */ #include <linux/blk_types.h> +#include <linux/uio.h> #define BIO_DEBUG @@ -447,6 +448,16 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page, void __bio_add_page(struct bio *bio, struct page *page, unsigned int len, unsigned int off); int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter); + +static inline int bio_iov_iter_npages(const struct iov_iter *i) +{ + if (!iov_iter_count(i)) + return 0; + if (iov_iter_is_bvec(i)) + return 1; + return iov_iter_npages(i, BIO_MAX_PAGES); +} + void bio_release_pages(struct bio *bio, bool mark_dirty); extern void bio_set_pages_dirty(struct bio *bio); extern void bio_check_pages_dirty(struct bio *bio);
On 20/11/2020 12:39, Matthew Wilcox wrote: > On Fri, Nov 20, 2020 at 08:14:29AM +0000, Christoph Hellwig wrote: >> On Fri, Nov 20, 2020 at 02:54:57AM +0000, Matthew Wilcox wrote: >>> On Fri, Nov 20, 2020 at 02:25:08AM +0000, Pavel Begunkov wrote: >>>> On 20/11/2020 02:22, Ming Lei wrote: >>>>> iov_iter_npages(bvec) still can be improved a bit by the following way: >>>> >>>> Yep, was doing exactly that, +a couple of other places that are in my way. >>> >>> Are you optimising the right thing here? Assuming you're looking at >>> the one in do_blockdev_direct_IO(), wouldn't we be better off figuring >>> out how to copy the bvecs directly from the iov_iter into the bio >>> rather than calling dio_bio_add_page() for each page? >> >> Which is most effectively done by stopping to to use *blockdev_direct_IO >> and switching to iomap instead :) > > But iomap still calls iov_iter_npages(). So maybe we need something like > this ... Yep, all that are not mutually exclusive optimisations. Why `return 1`? It seems to be used later in bio_alloc(nr_pages) > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c > index 933f234d5bec..1c5a802a45d9 100644 > --- a/fs/iomap/direct-io.c > +++ b/fs/iomap/direct-io.c > @@ -250,7 +250,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, > orig_count = iov_iter_count(dio->submit.iter); > iov_iter_truncate(dio->submit.iter, length); > > - nr_pages = iov_iter_npages(dio->submit.iter, BIO_MAX_PAGES); > + nr_pages = bio_iov_iter_npages(dio->submit.iter); > if (nr_pages <= 0) { > ret = nr_pages; > goto out; > @@ -308,7 +308,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, > dio->size += n; > copied += n; > > - nr_pages = iov_iter_npages(dio->submit.iter, BIO_MAX_PAGES); > + nr_pages = bio_iov_iter_npages(dio->submit.iter); > iomap_dio_submit_bio(dio, iomap, bio, pos); > pos += n; > } while (nr_pages); > diff --git a/include/linux/bio.h b/include/linux/bio.h > index c6d765382926..86cc74f84b30 100644 > --- a/include/linux/bio.h > +++ b/include/linux/bio.h > @@ -10,6 +10,7 @@ > #include <linux/ioprio.h> > /* struct bio, bio_vec and BIO_* flags are defined in blk_types.h */ > #include <linux/blk_types.h> > +#include <linux/uio.h> > > #define BIO_DEBUG > > @@ -447,6 +448,16 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page, > void __bio_add_page(struct bio *bio, struct page *page, > unsigned int len, unsigned int off); > int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter); > + > +static inline int bio_iov_iter_npages(const struct iov_iter *i) > +{ > + if (!iov_iter_count(i)) > + return 0; > + if (iov_iter_is_bvec(i)) > + return 1; > + return iov_iter_npages(i, BIO_MAX_PAGES); > +} > + > void bio_release_pages(struct bio *bio, bool mark_dirty); > extern void bio_set_pages_dirty(struct bio *bio); > extern void bio_check_pages_dirty(struct bio *bio); >
On Fri, Nov 20, 2020 at 01:00:37PM +0000, Pavel Begunkov wrote: > On 20/11/2020 12:39, Matthew Wilcox wrote: > > On Fri, Nov 20, 2020 at 08:14:29AM +0000, Christoph Hellwig wrote: > >> On Fri, Nov 20, 2020 at 02:54:57AM +0000, Matthew Wilcox wrote: > >>> On Fri, Nov 20, 2020 at 02:25:08AM +0000, Pavel Begunkov wrote: > >>>> On 20/11/2020 02:22, Ming Lei wrote: > >>>>> iov_iter_npages(bvec) still can be improved a bit by the following way: > >>>> > >>>> Yep, was doing exactly that, +a couple of other places that are in my way. > >>> > >>> Are you optimising the right thing here? Assuming you're looking at > >>> the one in do_blockdev_direct_IO(), wouldn't we be better off figuring > >>> out how to copy the bvecs directly from the iov_iter into the bio > >>> rather than calling dio_bio_add_page() for each page? > >> > >> Which is most effectively done by stopping to to use *blockdev_direct_IO > >> and switching to iomap instead :) > > > > But iomap still calls iov_iter_npages(). So maybe we need something like > > this ... > > Yep, all that are not mutually exclusive optimisations. > Why `return 1`? It seems to be used later in bio_alloc(nr_pages) because 0 means "no pages". It does no harm to allocate one biovec that we then don't use. > > - nr_pages = iov_iter_npages(dio->submit.iter, BIO_MAX_PAGES); > > + nr_pages = bio_iov_iter_npages(dio->submit.iter); > > if (nr_pages <= 0) { ^^^^^^^^^^^^^ > > - nr_pages = iov_iter_npages(dio->submit.iter, BIO_MAX_PAGES); > > + nr_pages = bio_iov_iter_npages(dio->submit.iter); > > iomap_dio_submit_bio(dio, iomap, bio, pos); > > pos += n; > > } while (nr_pages); ^^^^^^^^
From: Pavel Begunkov > Sent: 19 November 2020 23:25 > > The block layer spends quite a while in iov_iter_npages(), but for the > bvec case the number of pages is already known and stored in > iter->nr_segs, so it can be returned immediately as an optimisation > > Perf for an io_uring benchmark with registered buffers (i.e. bvec) shows > ~1.5-2.0% total cycle count spent in iov_iter_npages(), that's dropped > by this patch to ~0.2%. > > Reviewed-by: Jens Axboe <axboe@kernel.dk> > Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> > --- > lib/iov_iter.c | 10 +++++----- > 1 file changed, 5 insertions(+), 5 deletions(-) > > diff --git a/lib/iov_iter.c b/lib/iov_iter.c > index 1635111c5bd2..0fa7ac330acf 100644 > --- a/lib/iov_iter.c > +++ b/lib/iov_iter.c > @@ -1594,6 +1594,8 @@ int iov_iter_npages(const struct iov_iter *i, int maxpages) > return 0; > if (unlikely(iov_iter_is_discard(i))) > return 0; > + if (unlikely(iov_iter_is_bvec(i))) > + return min_t(int, i->nr_segs, maxpages); > > if (unlikely(iov_iter_is_pipe(i))) { Is it worth putting an extra condition around these three 'unlikely' cases. ie: if (unlikely((iov_iter_type(i) & (ITER_DISCARD | ITER_BVEC | ITER_PIPE)) { if (iov_iter_is_discard(i)) return 0; if (iov_iter_is_bvec(i)) return min_t(int, i->nr_segs, maxpages); /* Must be ITER_PIPE */ David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On 20/11/2020 02:24, Ming Lei wrote: > On Fri, Nov 20, 2020 at 02:06:10AM +0000, Matthew Wilcox wrote: >> On Fri, Nov 20, 2020 at 01:56:22AM +0000, Pavel Begunkov wrote: >>> On 20/11/2020 01:49, Matthew Wilcox wrote: >>>> On Fri, Nov 20, 2020 at 01:39:05AM +0000, Pavel Begunkov wrote: >>>>> On 20/11/2020 01:20, Matthew Wilcox wrote: >>>>>> On Thu, Nov 19, 2020 at 11:24:38PM +0000, Pavel Begunkov wrote: >>>>>>> The block layer spends quite a while in iov_iter_npages(), but for the >>>>>>> bvec case the number of pages is already known and stored in >>>>>>> iter->nr_segs, so it can be returned immediately as an optimisation >>>>>> >>>>>> Er ... no, it doesn't. nr_segs is the number of bvecs. Each bvec can >>>>>> store up to 4GB of contiguous physical memory. >>>>> >>>>> Ah, really, missed min() with PAGE_SIZE in bvec_iter_len(), then it's a >>>>> stupid statement. Thanks! >>>>> >>>>> Are there many users of that? All these iterators are a huge burden, >>>>> just to count one 4KB page in bvec it takes 2% of CPU time for me. >>>> >>>> __bio_try_merge_page() will create multipage BIOs, and that's >>>> called from a number of places including >>>> bio_try_merge_hw_seg(), bio_add_page(), and __bio_iov_iter_get_pages() >>> >>> I get it that there are a lot of places, more interesting how often >>> it's actually triggered and if that's performance critical for anybody. >>> Not like I'm going to change it, just out of curiosity, but bvec.h >>> can be nicely optimised without it. >> >> Typically when you're allocating pages for the page cache, they'll get >> allocated in order and then you'll read or write them in order, so yes, >> it ends up triggering quite a lot. There was once a bug in the page >> allocator which caused them to get allocated in reverse order and it >> was a noticable performance hit (this was 15-20 years ago). > > hugepage use cases can benefit much from this way too. This didn't yield any considerable boost for me though. 1.5% -> 1.3% for 1 page reads. I'll send it anyway though because there are cases that can benefit, e.g. as Ming mentioned. Ming would you want to send the patch yourself? After all you did post it first.
On 20/11/2020 17:22, Pavel Begunkov wrote: > On 20/11/2020 02:24, Ming Lei wrote: >> On Fri, Nov 20, 2020 at 02:06:10AM +0000, Matthew Wilcox wrote: >>> On Fri, Nov 20, 2020 at 01:56:22AM +0000, Pavel Begunkov wrote: >>>> On 20/11/2020 01:49, Matthew Wilcox wrote: >>>>> On Fri, Nov 20, 2020 at 01:39:05AM +0000, Pavel Begunkov wrote: >>>>>> On 20/11/2020 01:20, Matthew Wilcox wrote: >>>>>>> On Thu, Nov 19, 2020 at 11:24:38PM +0000, Pavel Begunkov wrote: >>>>>>>> The block layer spends quite a while in iov_iter_npages(), but for the >>>>>>>> bvec case the number of pages is already known and stored in >>>>>>>> iter->nr_segs, so it can be returned immediately as an optimisation >>>>>>> >>>>>>> Er ... no, it doesn't. nr_segs is the number of bvecs. Each bvec can >>>>>>> store up to 4GB of contiguous physical memory. >>>>>> >>>>>> Ah, really, missed min() with PAGE_SIZE in bvec_iter_len(), then it's a >>>>>> stupid statement. Thanks! >>>>>> >>>>>> Are there many users of that? All these iterators are a huge burden, >>>>>> just to count one 4KB page in bvec it takes 2% of CPU time for me. >>>>> >>>>> __bio_try_merge_page() will create multipage BIOs, and that's >>>>> called from a number of places including >>>>> bio_try_merge_hw_seg(), bio_add_page(), and __bio_iov_iter_get_pages() >>>> >>>> I get it that there are a lot of places, more interesting how often >>>> it's actually triggered and if that's performance critical for anybody. >>>> Not like I'm going to change it, just out of curiosity, but bvec.h >>>> can be nicely optimised without it. >>> >>> Typically when you're allocating pages for the page cache, they'll get >>> allocated in order and then you'll read or write them in order, so yes, >>> it ends up triggering quite a lot. There was once a bug in the page >>> allocator which caused them to get allocated in reverse order and it >>> was a noticable performance hit (this was 15-20 years ago). >> >> hugepage use cases can benefit much from this way too. > > This didn't yield any considerable boost for me though. 1.5% -> 1.3% > for 1 page reads. I'll send it anyway though because there are cases > that can benefit, e.g. as Ming mentioned. And yeah, it just shifts my attention for optimisation to its callers, e.g. blkdev_direct_IO. > Ming would you want to send the patch yourself? After all you did post > it first. >
diff --git a/lib/iov_iter.c b/lib/iov_iter.c index 1635111c5bd2..0fa7ac330acf 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -1594,6 +1594,8 @@ int iov_iter_npages(const struct iov_iter *i, int maxpages) return 0; if (unlikely(iov_iter_is_discard(i))) return 0; + if (unlikely(iov_iter_is_bvec(i))) + return min_t(int, i->nr_segs, maxpages); if (unlikely(iov_iter_is_pipe(i))) { struct pipe_inode_info *pipe = i->pipe; @@ -1614,11 +1616,9 @@ int iov_iter_npages(const struct iov_iter *i, int maxpages) - p / PAGE_SIZE; if (npages >= maxpages) return maxpages; - 0;}),({ - npages++; - if (npages >= maxpages) - return maxpages; - }),({ + 0;}), + 0 /* bvecs are handled above */ + ,({ unsigned long p = (unsigned long)v.iov_base; npages += DIV_ROUND_UP(p + v.iov_len, PAGE_SIZE) - p / PAGE_SIZE;