Message ID | 20210628150609.2833435-1-nborisov@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: Stop kmalloc'ing btrfs_path in btrfs_lookup_bio_sums | expand |
On 28/06/2021 17:09, Nikolay Borisov wrote: > While doing some performance characterization of a workoad reading ~80g workload ~^ > All in all there's a 20 seconds (~7%) reductino in real run time as well reduction ~^ Code looks good to me, Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
On 2021/6/28 下午11:06, Nikolay Borisov wrote: > While doing some performance characterization of a workoad reading ~80g > split among 4mb files via DIO I observed that btrfs_lookup_bio_sums was > using rather excessive cpu cycles: > > 95.97%--ksys_pread64 > 95.96%--new_sync_read > 95.95%--iomap_dio_rw > 95.89%--iomap_apply > 72.25%--iomap_dio_bio_actor > 55.19%--iomap_dio_submit_bio > 55.13%--btrfs_submit_direct > 20.72%--btrfs_lookup_bio_sums > 7.44%-- kmem_cache_alloc > > Timing the workload yielded: > real 4m41.825s > user 0m14.321s > sys 10m38.419s > > Turns out this kmem_cache_alloc corresponds to the btrfs_path allocation > that happens in btrfs_lookup_bio_sums. Nothing in btrfs_lookup_bio_sums > warrants having the path be heap allocated. Just turn this allocation > into a stack based one. After the change performance changes > accordingly: > > real 4m21.742s > user 0m13.679s > sys 9m8.889s > > All in all there's a 20 seconds (~7%) reductino in real run time as well > as the time spent in the kernel is reduced by 1:30 minutes (~14%). And > btrfs_lookup_bio_sums ends up accounting for only 7.91% of cycles rather > than 20% before. > > Signed-off-by: Nikolay Borisov <nborisov@suse.com> I'm surprised that btrfs_path in kernel no longer needs any initialization. I guess we can do a lot of more such conversion to use stack btrfs_path, just like what we did in btrfs-progs. But I'm not that sure if 112 bytes stack memory usage increase is OK or not for other locations. But since this one get a pretty good performance increase, and this function itself doesn't have much stack memory usage anyway, it looks ok to me. Reviewed-by: Qu Wenruo <wqu@suse.com> Thanks, Qu > --- > fs/btrfs/file-item.c | 19 +++++++------------ > 1 file changed, 7 insertions(+), 12 deletions(-) > > diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c > index df6631eefc65..0358ca0a69c8 100644 > --- a/fs/btrfs/file-item.c > +++ b/fs/btrfs/file-item.c > @@ -367,7 +367,7 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst > { > struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); > struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree; > - struct btrfs_path *path; > + struct btrfs_path path = {}; > const u32 sectorsize = fs_info->sectorsize; > const u32 csum_size = fs_info->csum_size; > u32 orig_len = bio->bi_iter.bi_size; > @@ -393,9 +393,6 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst > * directly. > */ > ASSERT(bio_op(bio) == REQ_OP_READ); > - path = btrfs_alloc_path(); > - if (!path) > - return BLK_STS_RESOURCE; > > if (!dst) { > struct btrfs_io_bio *btrfs_bio = btrfs_io_bio(bio); > @@ -403,10 +400,8 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst > if (nblocks * csum_size > BTRFS_BIO_INLINE_CSUM_SIZE) { > btrfs_bio->csum = kmalloc_array(nblocks, csum_size, > GFP_NOFS); > - if (!btrfs_bio->csum) { > - btrfs_free_path(path); > + if (!btrfs_bio->csum) > return BLK_STS_RESOURCE; > - } > } else { > btrfs_bio->csum = btrfs_bio->csum_inline; > } > @@ -420,7 +415,7 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst > * kick the readahead for csum tree. > */ > if (nblocks > fs_info->csums_per_leaf) > - path->reada = READA_FORWARD; > + path.reada = READA_FORWARD; > > /* > * the free space stuff is only read when it hasn't been > @@ -429,8 +424,8 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst > * between reading the free space cache and updating the csum tree. > */ > if (btrfs_is_free_space_inode(BTRFS_I(inode))) { > - path->search_commit_root = 1; > - path->skip_locking = 1; > + path.search_commit_root = 1; > + path.skip_locking = 1; > } > > for (cur_disk_bytenr = orig_disk_bytenr; > @@ -453,7 +448,7 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst > fs_info->sectorsize_bits; > csum_dst = csum + sector_offset * csum_size; > > - count = search_csum_tree(fs_info, path, cur_disk_bytenr, > + count = search_csum_tree(fs_info, &path, cur_disk_bytenr, > search_len, csum_dst); > if (count <= 0) { > /* > @@ -489,7 +484,7 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst > } > } > > - btrfs_free_path(path); > + btrfs_release_path(&path); > return BLK_STS_OK; > } > > -- > 2.25.1 >
On 29.06.21 г. 14:56, Qu Wenruo wrote: > > > On 2021/6/28 下午11:06, Nikolay Borisov wrote: >> While doing some performance characterization of a workoad reading ~80g >> split among 4mb files via DIO I observed that btrfs_lookup_bio_sums was >> using rather excessive cpu cycles: >> >> 95.97%--ksys_pread64 >> 95.96%--new_sync_read >> 95.95%--iomap_dio_rw >> 95.89%--iomap_apply >> 72.25%--iomap_dio_bio_actor >> 55.19%--iomap_dio_submit_bio >> 55.13%--btrfs_submit_direct >> 20.72%--btrfs_lookup_bio_sums >> 7.44%-- kmem_cache_alloc >> >> Timing the workload yielded: >> real 4m41.825s >> user 0m14.321s >> sys 10m38.419s >> >> Turns out this kmem_cache_alloc corresponds to the btrfs_path allocation >> that happens in btrfs_lookup_bio_sums. Nothing in btrfs_lookup_bio_sums >> warrants having the path be heap allocated. Just turn this allocation >> into a stack based one. After the change performance changes >> accordingly: >> >> real 4m21.742s >> user 0m13.679s >> sys 9m8.889s >> >> All in all there's a 20 seconds (~7%) reductino in real run time as well >> as the time spent in the kernel is reduced by 1:30 minutes (~14%). And >> btrfs_lookup_bio_sums ends up accounting for only 7.91% of cycles rather >> than 20% before. >> >> Signed-off-by: Nikolay Borisov <nborisov@suse.com> > > I'm surprised that btrfs_path in kernel no longer needs any initialization. > > I guess we can do a lot of more such conversion to use stack btrfs_path, > just like what we did in btrfs-progs. > > But I'm not that sure if 112 bytes stack memory usage increase is OK or > not for other locations. > > But since this one get a pretty good performance increase, and this > function itself doesn't have much stack memory usage anyway, it looks ok > to me. > > Reviewed-by: Qu Wenruo <wqu@suse.com> > Actually this patch makes most sense when SLUB debugging is enabled as it would seem btrfs is hit pretty hard when debugging is enabled. IN real world with SLUB debugging disabled it didn't make any noticeable difference. In this case the fact that I'm adding 112 bytes on the stack might bring more harm than good so I guess this patch can be dropped. <snip>
On Tue, Jun 29, 2021 at 07:56:34PM +0800, Qu Wenruo wrote: > On 2021/6/28 下午11:06, Nikolay Borisov wrote: > > While doing some performance characterization of a workoad reading ~80g > > split among 4mb files via DIO I observed that btrfs_lookup_bio_sums was > > using rather excessive cpu cycles: > > > > 95.97%--ksys_pread64 > > 95.96%--new_sync_read > > 95.95%--iomap_dio_rw > > 95.89%--iomap_apply > > 72.25%--iomap_dio_bio_actor > > 55.19%--iomap_dio_submit_bio > > 55.13%--btrfs_submit_direct > > 20.72%--btrfs_lookup_bio_sums > > 7.44%-- kmem_cache_alloc > > > > Timing the workload yielded: > > real 4m41.825s > > user 0m14.321s > > sys 10m38.419s > > > > Turns out this kmem_cache_alloc corresponds to the btrfs_path allocation > > that happens in btrfs_lookup_bio_sums. Nothing in btrfs_lookup_bio_sums > > warrants having the path be heap allocated. Just turn this allocation > > into a stack based one. After the change performance changes > > accordingly: > > > > real 4m21.742s > > user 0m13.679s > > sys 9m8.889s > > > > All in all there's a 20 seconds (~7%) reductino in real run time as well > > as the time spent in the kernel is reduced by 1:30 minutes (~14%). And > > btrfs_lookup_bio_sums ends up accounting for only 7.91% of cycles rather > > than 20% before. > > > > Signed-off-by: Nikolay Borisov <nborisov@suse.com> > > I'm surprised that btrfs_path in kernel no longer needs any initialization. > > I guess we can do a lot of more such conversion to use stack btrfs_path, > just like what we did in btrfs-progs. > > But I'm not that sure if 112 bytes stack memory usage increase is OK or > not for other locations. In justified cases 112 on-stack instead of allocation could be considered but not in general. That progs to that is because it's in userspace where handling memory is way easier and stack sizes are megabytes, not a few kilobytes like in kernel. The btrfs_path is allocated from slab and the performance hit is negligible on the release build, as Nikolay found out. > But since this one get a pretty good performance increase, and this > function itself doesn't have much stack memory usage anyway, it looks ok > to me. The stack size needs to take into account the whole call chain that could end up in this function, and in the IO paths we need to be conservative in case there are other layers on top or under btrfs that could need the stack space too.
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c index df6631eefc65..0358ca0a69c8 100644 --- a/fs/btrfs/file-item.c +++ b/fs/btrfs/file-item.c @@ -367,7 +367,7 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst { struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree; - struct btrfs_path *path; + struct btrfs_path path = {}; const u32 sectorsize = fs_info->sectorsize; const u32 csum_size = fs_info->csum_size; u32 orig_len = bio->bi_iter.bi_size; @@ -393,9 +393,6 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst * directly. */ ASSERT(bio_op(bio) == REQ_OP_READ); - path = btrfs_alloc_path(); - if (!path) - return BLK_STS_RESOURCE; if (!dst) { struct btrfs_io_bio *btrfs_bio = btrfs_io_bio(bio); @@ -403,10 +400,8 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst if (nblocks * csum_size > BTRFS_BIO_INLINE_CSUM_SIZE) { btrfs_bio->csum = kmalloc_array(nblocks, csum_size, GFP_NOFS); - if (!btrfs_bio->csum) { - btrfs_free_path(path); + if (!btrfs_bio->csum) return BLK_STS_RESOURCE; - } } else { btrfs_bio->csum = btrfs_bio->csum_inline; } @@ -420,7 +415,7 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst * kick the readahead for csum tree. */ if (nblocks > fs_info->csums_per_leaf) - path->reada = READA_FORWARD; + path.reada = READA_FORWARD; /* * the free space stuff is only read when it hasn't been @@ -429,8 +424,8 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst * between reading the free space cache and updating the csum tree. */ if (btrfs_is_free_space_inode(BTRFS_I(inode))) { - path->search_commit_root = 1; - path->skip_locking = 1; + path.search_commit_root = 1; + path.skip_locking = 1; } for (cur_disk_bytenr = orig_disk_bytenr; @@ -453,7 +448,7 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst fs_info->sectorsize_bits; csum_dst = csum + sector_offset * csum_size; - count = search_csum_tree(fs_info, path, cur_disk_bytenr, + count = search_csum_tree(fs_info, &path, cur_disk_bytenr, search_len, csum_dst); if (count <= 0) { /* @@ -489,7 +484,7 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst } } - btrfs_free_path(path); + btrfs_release_path(&path); return BLK_STS_OK; }
While doing some performance characterization of a workoad reading ~80g split among 4mb files via DIO I observed that btrfs_lookup_bio_sums was using rather excessive cpu cycles: 95.97%--ksys_pread64 95.96%--new_sync_read 95.95%--iomap_dio_rw 95.89%--iomap_apply 72.25%--iomap_dio_bio_actor 55.19%--iomap_dio_submit_bio 55.13%--btrfs_submit_direct 20.72%--btrfs_lookup_bio_sums 7.44%-- kmem_cache_alloc Timing the workload yielded: real 4m41.825s user 0m14.321s sys 10m38.419s Turns out this kmem_cache_alloc corresponds to the btrfs_path allocation that happens in btrfs_lookup_bio_sums. Nothing in btrfs_lookup_bio_sums warrants having the path be heap allocated. Just turn this allocation into a stack based one. After the change performance changes accordingly: real 4m21.742s user 0m13.679s sys 9m8.889s All in all there's a 20 seconds (~7%) reductino in real run time as well as the time spent in the kernel is reduced by 1:30 minutes (~14%). And btrfs_lookup_bio_sums ends up accounting for only 7.91% of cycles rather than 20% before. Signed-off-by: Nikolay Borisov <nborisov@suse.com> --- fs/btrfs/file-item.c | 19 +++++++------------ 1 file changed, 7 insertions(+), 12 deletions(-) -- 2.25.1