Message ID | 20210105132647.3818503-1-ming.lei@redhat.com (mailing list archive) |
---|---|
State | Deferred, archived |
Headers | show |
Series | [RFC] fs: block_dev: compute nr_vecs hint for improving writeback bvecs allocation | expand |
At least for iomap I think this is the wrong approach. Between the iomap and writeback_control we know the maximum size of the writeback request and can just use that. Take a look at what iomap_readpage_actor does on the read size, something similar taking the wbc into account can be done on the writeback side.
On Tue, Jan 05, 2021 at 07:39:38PM +0100, Christoph Hellwig wrote: > At least for iomap I think this is the wrong approach. Between the > iomap and writeback_control we know the maximum size of the writeback > request and can just use that. I think writeback_control can tell us nothing about max pages in single bio: - wbc->nr_to_write controls how many pages to writeback, this pages usually don't belong to same bio. Also this number is often much bigger than BIO_MAX_PAGES. - wbc->range_start/range_end is similar too, which is often much more bigger than BIO_MAX_PAGES. Also page/blocks_in_page can be mapped to different extent too, which is only available when wpc->ops->map_blocks() is returned, which looks not different with mpage_writepages(), in which bio is allocated with BIO_MAX_PAGES vecs too. Or you mean we can use iomap->length for this purpose? But iomap->length still is still too big in case of xfs.
On Wed, Jan 06, 2021 at 04:45:48PM +0800, Ming Lei wrote: > On Tue, Jan 05, 2021 at 07:39:38PM +0100, Christoph Hellwig wrote: > > At least for iomap I think this is the wrong approach. Between the > > iomap and writeback_control we know the maximum size of the writeback > > request and can just use that. > > I think writeback_control can tell us nothing about max pages in single > bio: By definition, the iomap tells us exactly how big the IO is going to be. i.e. an iomap spans a single contiguous range that we are going to issue IO on. Hence we can use that to size the bio exactly right for direct IO. > - wbc->nr_to_write controls how many pages to writeback, this pages > usually don't belong to same bio. Also this number is often much > bigger than BIO_MAX_PAGES. > > - wbc->range_start/range_end is similar too, which is often much more > bigger than BIO_MAX_PAGES. > > Also page/blocks_in_page can be mapped to different extent too, which is > only available when wpc->ops->map_blocks() is returned, We only allocate the bio -after- calling ->map_blocks() to obtain the iomap for the given writeback range request. Hence we already know how large the BIO could be before we allocate it. > which looks not > different with mpage_writepages(), in which bio is allocated with > BIO_MAX_PAGES vecs too. __mpage_writepage() only maps a page at a time, so it can't tell ahead of time how big the bio is going to need to be as it doesn't return/cache a contiguous extent range. So it's actually very different to the iomap writeback code, and effectively does require a BIO_MAX_PAGES vecs allocation all the time... > Or you mean we can use iomap->length for this purpose? But iomap->length > still is still too big in case of xfs. if we are doing small random writeback into large extents (i.e. iomap->length is large), then it is trivial to detect that we are doing random writes rather than sequential writes by checking if the current page is sequential to the last sector in the current bio. We already do this non-sequential IO checking to determine if a new bio needs to be allocated in iomap_can_add_to_ioend(), and we also know how large the current contiguous range mapped into the current bio chain is (ioend->io_size). Hence we've got everything we need to determine whether we should do a large or small bio vec allocation in the iomap writeback path... Cheers, Dave.
On Thu, Jan 07, 2021 at 09:21:11AM +1100, Dave Chinner wrote: > On Wed, Jan 06, 2021 at 04:45:48PM +0800, Ming Lei wrote: > > On Tue, Jan 05, 2021 at 07:39:38PM +0100, Christoph Hellwig wrote: > > > At least for iomap I think this is the wrong approach. Between the > > > iomap and writeback_control we know the maximum size of the writeback > > > request and can just use that. > > > > I think writeback_control can tell us nothing about max pages in single > > bio: > > By definition, the iomap tells us exactly how big the IO is going to > be. i.e. an iomap spans a single contiguous range that we are going > to issue IO on. Hence we can use that to size the bio exactly > right for direct IO. When I trace wpc->iomap.length in iomap_add_to_ioend() on the following fio randwrite/write, the length is 1GB most of times, maybe because it is one fresh XFS. fio --size=1G --bsrange=4k-4k --runtime=30 --numjobs=2 --ioengine=psync --iodepth=32 \ --directory=$DIR --group_reporting=1 --unlink=0 --direct=0 --fsync=0 --name=f1 \ --stonewall --rw=$RW sync Another reason is that pages in the range may be contiguous physically, so lots of pages may share one single bvec. > > > - wbc->nr_to_write controls how many pages to writeback, this pages > > usually don't belong to same bio. Also this number is often much > > bigger than BIO_MAX_PAGES. > > > > - wbc->range_start/range_end is similar too, which is often much more > > bigger than BIO_MAX_PAGES. > > > > Also page/blocks_in_page can be mapped to different extent too, which is > > only available when wpc->ops->map_blocks() is returned, > > We only allocate the bio -after- calling ->map_blocks() to obtain > the iomap for the given writeback range request. Hence we > already know how large the BIO could be before we allocate it. > > > which looks not > > different with mpage_writepages(), in which bio is allocated with > > BIO_MAX_PAGES vecs too. > > __mpage_writepage() only maps a page at a time, so it can't tell > ahead of time how big the bio is going to need to be as it doesn't > return/cache a contiguous extent range. So it's actually very > different to the iomap writeback code, and effectively does require > a BIO_MAX_PAGES vecs allocation all the time... > > > Or you mean we can use iomap->length for this purpose? But iomap->length > > still is still too big in case of xfs. > > if we are doing small random writeback into large extents (i.e. > iomap->length is large), then it is trivial to detect that we are > doing random writes rather than sequential writes by checking if the > current page is sequential to the last sector in the current bio. > We already do this non-sequential IO checking to determine if a new > bio needs to be allocated in iomap_can_add_to_ioend(), and we also > know how large the current contiguous range mapped into the current > bio chain is (ioend->io_size). Hence we've got everything we need to > determine whether we should do a large or small bio vec allocation > in the iomap writeback path... page->index should tell us if the workload is random or sequential, however still not easy to decide how many pages there will be in the next bio when iomap->length is large. Thanks, Ming
On Fri, Jan 08, 2021 at 03:59:22PM +0800, Ming Lei wrote: > On Thu, Jan 07, 2021 at 09:21:11AM +1100, Dave Chinner wrote: > > On Wed, Jan 06, 2021 at 04:45:48PM +0800, Ming Lei wrote: > > > On Tue, Jan 05, 2021 at 07:39:38PM +0100, Christoph Hellwig wrote: > > > > At least for iomap I think this is the wrong approach. Between the > > > > iomap and writeback_control we know the maximum size of the writeback > > > > request and can just use that. > > > > > > I think writeback_control can tell us nothing about max pages in single > > > bio: > > > > By definition, the iomap tells us exactly how big the IO is going to > > be. i.e. an iomap spans a single contiguous range that we are going > > to issue IO on. Hence we can use that to size the bio exactly > > right for direct IO. > > When I trace wpc->iomap.length in iomap_add_to_ioend() on the following fio > randwrite/write, the length is 1GB most of times, maybe because it is > one fresh XFS. Yes, that's exactly what I said it would do. > Another reason is that pages in the range may be contiguous physically, > so lots of pages may share one single bvec. The iomap layer does not care about this, and there's no way this can be detected ahead of time, anyway, because we are only passed a single page at a time. When we get large pages from the page cache, we'll still only get one page at a time, but we'll get physically contiguous pages and so it will still be a 1 page : 1 bvec relationship at the iomap layer. > > > - wbc->nr_to_write controls how many pages to writeback, this pages > > > usually don't belong to same bio. Also this number is often much > > > bigger than BIO_MAX_PAGES. > > > > > > - wbc->range_start/range_end is similar too, which is often much more > > > bigger than BIO_MAX_PAGES. > > > > > > Also page/blocks_in_page can be mapped to different extent too, which is > > > only available when wpc->ops->map_blocks() is returned, > > > > We only allocate the bio -after- calling ->map_blocks() to obtain > > the iomap for the given writeback range request. Hence we > > already know how large the BIO could be before we allocate it. > > > > > which looks not > > > different with mpage_writepages(), in which bio is allocated with > > > BIO_MAX_PAGES vecs too. > > > > __mpage_writepage() only maps a page at a time, so it can't tell > > ahead of time how big the bio is going to need to be as it doesn't > > return/cache a contiguous extent range. So it's actually very > > different to the iomap writeback code, and effectively does require > > a BIO_MAX_PAGES vecs allocation all the time... > > > > > Or you mean we can use iomap->length for this purpose? But iomap->length > > > still is still too big in case of xfs. > > > > if we are doing small random writeback into large extents (i.e. > > iomap->length is large), then it is trivial to detect that we are > > doing random writes rather than sequential writes by checking if the > > current page is sequential to the last sector in the current bio. > > We already do this non-sequential IO checking to determine if a new > > bio needs to be allocated in iomap_can_add_to_ioend(), and we also > > know how large the current contiguous range mapped into the current > > bio chain is (ioend->io_size). Hence we've got everything we need to > > determine whether we should do a large or small bio vec allocation > > in the iomap writeback path... > > page->index should tell us if the workload is random or sequential, however > still not easy to decide how many pages there will be in the next bio > when iomap->length is large. page->index doesn't tell us anything about what type of IO is being done - it just tells us where in the file we need to map to find the physical block we need to write it to. OTOH, the iomap writeback context contains all the information about current IO being build - offset, size, current bio, etc - and the page->index gets compared against the state in the iomap writepage context. So, if the wpc->iomap.length is large, current page->index does not map sequentially to the end of wpc->ioend->io_bio (or wpc->io_end->io_offset + wpc->ioend->io_size) and wpc->io_end->io_size == page_size(page) for the currently held bio, then we are clearly doing random single page writeback into a large allocated extent. Hence in that case we can do small bvec allocations for the new bio. Sure, the first bio in a ->writepages invocation doesn't have this information, so we've going to have to assume BIO_MAX_PAGES for the first bio. But for every bio after that in the ->writepages invocation we have the state of the previous contiguous writeback range held in the wpc structure and can use that info to optimise the thousands of random single pages that are written after then first one... Cheers, Dave.
diff --git a/fs/block_dev.c b/fs/block_dev.c index 3e5b02f6606c..0490f16a96db 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -880,6 +880,7 @@ struct block_device *bdev_alloc(struct gendisk *disk, u8 partno) #ifdef CONFIG_SYSFS INIT_LIST_HEAD(&bdev->bd_holder_disks); #endif + bdev->bd_wb_nr_vecs_hint = BIO_MAX_PAGES / 2; bdev->bd_stats = alloc_percpu(struct disk_stats); if (!bdev->bd_stats) { iput(inode); diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 16a1e82e3aeb..1e031af182db 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -1210,6 +1210,7 @@ iomap_submit_ioend(struct iomap_writepage_ctx *wpc, struct iomap_ioend *ioend, return error; } + bdev_update_wb_nr_vecs_hint(wpc->iomap.bdev, ioend->io_bio); submit_bio(ioend->io_bio); return 0; } @@ -1221,7 +1222,8 @@ iomap_alloc_ioend(struct inode *inode, struct iomap_writepage_ctx *wpc, struct iomap_ioend *ioend; struct bio *bio; - bio = bio_alloc_bioset(GFP_NOFS, BIO_MAX_PAGES, &iomap_ioend_bioset); + bio = bio_alloc_bioset(GFP_NOFS, bdev_wb_nr_vecs_hint(wpc->iomap.bdev), + &iomap_ioend_bioset); bio_set_dev(bio, wpc->iomap.bdev); bio->bi_iter.bi_sector = sector; bio->bi_opf = REQ_OP_WRITE | wbc_to_write_flags(wbc); @@ -1248,11 +1250,13 @@ iomap_alloc_ioend(struct inode *inode, struct iomap_writepage_ctx *wpc, * traversal in iomap_finish_ioend(). */ static struct bio * -iomap_chain_bio(struct bio *prev) +iomap_chain_bio(struct bio *prev, struct block_device *bdev) { struct bio *new; - new = bio_alloc(GFP_NOFS, BIO_MAX_PAGES); + bdev_update_wb_nr_vecs_hint(bdev, prev); + + new = bio_alloc(GFP_NOFS, bdev_wb_nr_vecs_hint(bdev)); bio_copy_dev(new, prev);/* also copies over blkcg information */ new->bi_iter.bi_sector = bio_end_sector(prev); new->bi_opf = prev->bi_opf; @@ -1308,7 +1312,8 @@ iomap_add_to_ioend(struct inode *inode, loff_t offset, struct page *page, if (!merged) { if (bio_full(wpc->ioend->io_bio, len)) { wpc->ioend->io_bio = - iomap_chain_bio(wpc->ioend->io_bio); + iomap_chain_bio(wpc->ioend->io_bio, + wpc->iomap.bdev); } bio_add_page(wpc->ioend->io_bio, page, len, poff); } diff --git a/include/linux/bio.h b/include/linux/bio.h index 70914dd6a70d..ad0cd1a2abbe 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -19,8 +19,6 @@ #define BIO_BUG_ON #endif -#define BIO_MAX_PAGES 256 - #define bio_prio(bio) (bio)->bi_ioprio #define bio_set_prio(bio, prio) ((bio)->bi_ioprio = prio) diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 866f74261b3b..078e212f5e1f 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -24,6 +24,7 @@ struct block_device { sector_t bd_start_sect; struct disk_stats __percpu *bd_stats; unsigned long bd_stamp; + unsigned short bd_wb_nr_vecs_hint; bool bd_read_only; /* read-only policy */ dev_t bd_dev; int bd_openers; @@ -307,6 +308,8 @@ enum { BIO_FLAG_LAST }; +#define BIO_MAX_PAGES 256 + /* See BVEC_POOL_OFFSET below before adding new flags */ /* @@ -565,4 +568,32 @@ struct blk_rq_stat { u64 batch; }; +/* called before submitting writeback bios */ +static inline void bdev_update_wb_nr_vecs_hint(struct block_device *bdev, + struct bio *bio) +{ + unsigned short nr_vecs; + + if (!bdev) + return; + + nr_vecs = bdev->bd_wb_nr_vecs_hint; + /* + * If this bio is full, double current nr_vecs_hint for fast convergence. + * Otherwise use Exponential Weighted Moving Average to figure out the + * hint + */ + if (bio->bi_vcnt >= bio->bi_max_vecs) + nr_vecs *= 2; + else + nr_vecs = (nr_vecs * 3 + bio->bi_vcnt + 3) / 4; + + bdev->bd_wb_nr_vecs_hint = clamp_val(nr_vecs, 1, BIO_MAX_PAGES); +} + +static inline unsigned short bdev_wb_nr_vecs_hint(struct block_device *bdev) +{ + return bdev->bd_wb_nr_vecs_hint; +} + #endif /* __LINUX_BLK_TYPES_H */
Writeback code always allocates bio with max allowed bvecs(BIO_MAX_PAGES) since it is hard to know in advance how many pages will be written in single bio. And BIO_MAX_PAGES bvecs takes one 4K page. On the other hand, for workloads of random IO, most of writeback bios just uses <= 4 bvecs; for workloads of sequential IO, attributes to multipage bvec, quite a lot of pages are contiguous because of space/time locality for pages in sequential IO workloads, then either nr_bvecs is small enough or size of bio is very big. So in reality, it is quite often to see most of writeback bios just uses <=4 bvecs, which can be allocated inline. This can be observed in normal tests(kernel build, git clone kernel, dbench, ...) So improve bvec allocation by using Exponential Weighted Moving Average( EWMA) to compute nr_bvecs hint for writeback bio. With this simple approach, it is observed[1] that most of writeback bios just need <=4 nr_bvecs in normal workloads, then bvecs can be allocated inline with bio together, meantime one extra 4k allocation is avoided. [1] bpftrace script for observing writeback .bi_max_vcnt & .bi_vcnt histogram http://people.redhat.com/minlei/tests/tools/wb_vcnt.bt Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: linux-xfs@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Ming Lei <ming.lei@redhat.com> --- fs/block_dev.c | 1 + fs/iomap/buffered-io.c | 13 +++++++++---- include/linux/bio.h | 2 -- include/linux/blk_types.h | 31 +++++++++++++++++++++++++++++++ 4 files changed, 41 insertions(+), 6 deletions(-)