diff mbox series

btrfs: Stop kmalloc'ing btrfs_path in btrfs_lookup_bio_sums

Message ID 20210628150609.2833435-1-nborisov@suse.com (mailing list archive)
State New, archived
Headers show
Series btrfs: Stop kmalloc'ing btrfs_path in btrfs_lookup_bio_sums | expand

Commit Message

Nikolay Borisov June 28, 2021, 3:06 p.m. UTC
While doing some performance characterization of a workoad reading ~80g
split among 4mb files via DIO I observed that btrfs_lookup_bio_sums was
using rather excessive cpu cycles:

95.97%--ksys_pread64
95.96%--new_sync_read
95.95%--iomap_dio_rw
95.89%--iomap_apply
72.25%--iomap_dio_bio_actor
55.19%--iomap_dio_submit_bio
55.13%--btrfs_submit_direct
20.72%--btrfs_lookup_bio_sums
7.44%-- kmem_cache_alloc

Timing the workload yielded:
real 4m41.825s
user 0m14.321s
sys 10m38.419s

Turns out this kmem_cache_alloc corresponds to the btrfs_path allocation
that happens in btrfs_lookup_bio_sums. Nothing in btrfs_lookup_bio_sums
warrants having the path be heap allocated. Just turn this allocation
into a stack based one. After the change performance changes
accordingly:

real 4m21.742s
user 0m13.679s
sys 9m8.889s

All in all there's a 20 seconds (~7%) reductino in real run time as well
as the time spent in the kernel is reduced by 1:30 minutes (~14%). And
btrfs_lookup_bio_sums ends up accounting for only 7.91% of cycles rather
than 20% before.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
---
 fs/btrfs/file-item.c | 19 +++++++------------
 1 file changed, 7 insertions(+), 12 deletions(-)

--
2.25.1

Comments

Johannes Thumshirn June 28, 2021, 7:09 p.m. UTC | #1
On 28/06/2021 17:09, Nikolay Borisov wrote:
> While doing some performance characterization of a workoad reading ~80g                                            workload ~^

> All in all there's a 20 seconds (~7%) reductino in real run time as well
                             reduction ~^

Code looks good to me,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Qu Wenruo June 29, 2021, 11:56 a.m. UTC | #2
On 2021/6/28 下午11:06, Nikolay Borisov wrote:
> While doing some performance characterization of a workoad reading ~80g
> split among 4mb files via DIO I observed that btrfs_lookup_bio_sums was
> using rather excessive cpu cycles:
>
> 95.97%--ksys_pread64
> 95.96%--new_sync_read
> 95.95%--iomap_dio_rw
> 95.89%--iomap_apply
> 72.25%--iomap_dio_bio_actor
> 55.19%--iomap_dio_submit_bio
> 55.13%--btrfs_submit_direct
> 20.72%--btrfs_lookup_bio_sums
> 7.44%-- kmem_cache_alloc
>
> Timing the workload yielded:
> real 4m41.825s
> user 0m14.321s
> sys 10m38.419s
>
> Turns out this kmem_cache_alloc corresponds to the btrfs_path allocation
> that happens in btrfs_lookup_bio_sums. Nothing in btrfs_lookup_bio_sums
> warrants having the path be heap allocated. Just turn this allocation
> into a stack based one. After the change performance changes
> accordingly:
>
> real 4m21.742s
> user 0m13.679s
> sys 9m8.889s
>
> All in all there's a 20 seconds (~7%) reductino in real run time as well
> as the time spent in the kernel is reduced by 1:30 minutes (~14%). And
> btrfs_lookup_bio_sums ends up accounting for only 7.91% of cycles rather
> than 20% before.
>
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>

I'm surprised that btrfs_path in kernel no longer needs any initialization.

I guess we can do a lot of more such conversion to use stack btrfs_path,
just like what we did in btrfs-progs.

But I'm not that sure if 112 bytes stack memory usage increase is OK or
not for other locations.

But since this one get a pretty good performance increase, and this
function itself doesn't have much stack memory usage anyway, it looks ok
to me.

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu
> ---
>   fs/btrfs/file-item.c | 19 +++++++------------
>   1 file changed, 7 insertions(+), 12 deletions(-)
>
> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> index df6631eefc65..0358ca0a69c8 100644
> --- a/fs/btrfs/file-item.c
> +++ b/fs/btrfs/file-item.c
> @@ -367,7 +367,7 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst
>   {
>   	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>   	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> -	struct btrfs_path *path;
> +	struct btrfs_path path = {};
>   	const u32 sectorsize = fs_info->sectorsize;
>   	const u32 csum_size = fs_info->csum_size;
>   	u32 orig_len = bio->bi_iter.bi_size;
> @@ -393,9 +393,6 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst
>   	 *   directly.
>   	 */
>   	ASSERT(bio_op(bio) == REQ_OP_READ);
> -	path = btrfs_alloc_path();
> -	if (!path)
> -		return BLK_STS_RESOURCE;
>
>   	if (!dst) {
>   		struct btrfs_io_bio *btrfs_bio = btrfs_io_bio(bio);
> @@ -403,10 +400,8 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst
>   		if (nblocks * csum_size > BTRFS_BIO_INLINE_CSUM_SIZE) {
>   			btrfs_bio->csum = kmalloc_array(nblocks, csum_size,
>   							GFP_NOFS);
> -			if (!btrfs_bio->csum) {
> -				btrfs_free_path(path);
> +			if (!btrfs_bio->csum)
>   				return BLK_STS_RESOURCE;
> -			}
>   		} else {
>   			btrfs_bio->csum = btrfs_bio->csum_inline;
>   		}
> @@ -420,7 +415,7 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst
>   	 * kick the readahead for csum tree.
>   	 */
>   	if (nblocks > fs_info->csums_per_leaf)
> -		path->reada = READA_FORWARD;
> +		path.reada = READA_FORWARD;
>
>   	/*
>   	 * the free space stuff is only read when it hasn't been
> @@ -429,8 +424,8 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst
>   	 * between reading the free space cache and updating the csum tree.
>   	 */
>   	if (btrfs_is_free_space_inode(BTRFS_I(inode))) {
> -		path->search_commit_root = 1;
> -		path->skip_locking = 1;
> +		path.search_commit_root = 1;
> +		path.skip_locking = 1;
>   	}
>
>   	for (cur_disk_bytenr = orig_disk_bytenr;
> @@ -453,7 +448,7 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst
>   				fs_info->sectorsize_bits;
>   		csum_dst = csum + sector_offset * csum_size;
>
> -		count = search_csum_tree(fs_info, path, cur_disk_bytenr,
> +		count = search_csum_tree(fs_info, &path, cur_disk_bytenr,
>   					 search_len, csum_dst);
>   		if (count <= 0) {
>   			/*
> @@ -489,7 +484,7 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst
>   		}
>   	}
>
> -	btrfs_free_path(path);
> +	btrfs_release_path(&path);
>   	return BLK_STS_OK;
>   }
>
> --
> 2.25.1
>
Nikolay Borisov June 30, 2021, 7:39 a.m. UTC | #3
On 29.06.21 г. 14:56, Qu Wenruo wrote:
> 
> 
> On 2021/6/28 下午11:06, Nikolay Borisov wrote:
>> While doing some performance characterization of a workoad reading ~80g
>> split among 4mb files via DIO I observed that btrfs_lookup_bio_sums was
>> using rather excessive cpu cycles:
>>
>> 95.97%--ksys_pread64
>> 95.96%--new_sync_read
>> 95.95%--iomap_dio_rw
>> 95.89%--iomap_apply
>> 72.25%--iomap_dio_bio_actor
>> 55.19%--iomap_dio_submit_bio
>> 55.13%--btrfs_submit_direct
>> 20.72%--btrfs_lookup_bio_sums
>> 7.44%-- kmem_cache_alloc
>>
>> Timing the workload yielded:
>> real 4m41.825s
>> user 0m14.321s
>> sys 10m38.419s
>>
>> Turns out this kmem_cache_alloc corresponds to the btrfs_path allocation
>> that happens in btrfs_lookup_bio_sums. Nothing in btrfs_lookup_bio_sums
>> warrants having the path be heap allocated. Just turn this allocation
>> into a stack based one. After the change performance changes
>> accordingly:
>>
>> real 4m21.742s
>> user 0m13.679s
>> sys 9m8.889s
>>
>> All in all there's a 20 seconds (~7%) reductino in real run time as well
>> as the time spent in the kernel is reduced by 1:30 minutes (~14%). And
>> btrfs_lookup_bio_sums ends up accounting for only 7.91% of cycles rather
>> than 20% before.
>>
>> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> 
> I'm surprised that btrfs_path in kernel no longer needs any initialization.
> 
> I guess we can do a lot of more such conversion to use stack btrfs_path,
> just like what we did in btrfs-progs.
> 
> But I'm not that sure if 112 bytes stack memory usage increase is OK or
> not for other locations.
> 
> But since this one get a pretty good performance increase, and this
> function itself doesn't have much stack memory usage anyway, it looks ok
> to me.
> 
> Reviewed-by: Qu Wenruo <wqu@suse.com>
> 

Actually this patch makes most sense when SLUB debugging is enabled as
it would seem btrfs is hit pretty hard when debugging is enabled. IN
real world with SLUB debugging disabled it didn't make any noticeable
difference. In this case the fact that I'm adding 112 bytes on the stack
might bring more harm than good so I guess this patch can be dropped.

<snip>
David Sterba June 30, 2021, 10:14 a.m. UTC | #4
On Tue, Jun 29, 2021 at 07:56:34PM +0800, Qu Wenruo wrote:
> On 2021/6/28 下午11:06, Nikolay Borisov wrote:
> > While doing some performance characterization of a workoad reading ~80g
> > split among 4mb files via DIO I observed that btrfs_lookup_bio_sums was
> > using rather excessive cpu cycles:
> >
> > 95.97%--ksys_pread64
> > 95.96%--new_sync_read
> > 95.95%--iomap_dio_rw
> > 95.89%--iomap_apply
> > 72.25%--iomap_dio_bio_actor
> > 55.19%--iomap_dio_submit_bio
> > 55.13%--btrfs_submit_direct
> > 20.72%--btrfs_lookup_bio_sums
> > 7.44%-- kmem_cache_alloc
> >
> > Timing the workload yielded:
> > real 4m41.825s
> > user 0m14.321s
> > sys 10m38.419s
> >
> > Turns out this kmem_cache_alloc corresponds to the btrfs_path allocation
> > that happens in btrfs_lookup_bio_sums. Nothing in btrfs_lookup_bio_sums
> > warrants having the path be heap allocated. Just turn this allocation
> > into a stack based one. After the change performance changes
> > accordingly:
> >
> > real 4m21.742s
> > user 0m13.679s
> > sys 9m8.889s
> >
> > All in all there's a 20 seconds (~7%) reductino in real run time as well
> > as the time spent in the kernel is reduced by 1:30 minutes (~14%). And
> > btrfs_lookup_bio_sums ends up accounting for only 7.91% of cycles rather
> > than 20% before.
> >
> > Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> 
> I'm surprised that btrfs_path in kernel no longer needs any initialization.
> 
> I guess we can do a lot of more such conversion to use stack btrfs_path,
> just like what we did in btrfs-progs.
> 
> But I'm not that sure if 112 bytes stack memory usage increase is OK or
> not for other locations.

In justified cases 112 on-stack instead of allocation could be
considered but not in general. That progs to that is because it's in
userspace where handling memory is way easier and stack sizes are
megabytes, not a few kilobytes like in kernel.

The btrfs_path is allocated from slab and the performance hit is
negligible on the release build, as Nikolay found out.

> But since this one get a pretty good performance increase, and this
> function itself doesn't have much stack memory usage anyway, it looks ok
> to me.

The stack size needs to take into account the whole call chain that
could end up in this function, and in the IO paths we need to be
conservative in case there are other layers on top or under btrfs that
could need the stack space too.
diff mbox series

Patch

diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index df6631eefc65..0358ca0a69c8 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -367,7 +367,7 @@  blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
-	struct btrfs_path *path;
+	struct btrfs_path path = {};
 	const u32 sectorsize = fs_info->sectorsize;
 	const u32 csum_size = fs_info->csum_size;
 	u32 orig_len = bio->bi_iter.bi_size;
@@ -393,9 +393,6 @@  blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst
 	 *   directly.
 	 */
 	ASSERT(bio_op(bio) == REQ_OP_READ);
-	path = btrfs_alloc_path();
-	if (!path)
-		return BLK_STS_RESOURCE;

 	if (!dst) {
 		struct btrfs_io_bio *btrfs_bio = btrfs_io_bio(bio);
@@ -403,10 +400,8 @@  blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst
 		if (nblocks * csum_size > BTRFS_BIO_INLINE_CSUM_SIZE) {
 			btrfs_bio->csum = kmalloc_array(nblocks, csum_size,
 							GFP_NOFS);
-			if (!btrfs_bio->csum) {
-				btrfs_free_path(path);
+			if (!btrfs_bio->csum)
 				return BLK_STS_RESOURCE;
-			}
 		} else {
 			btrfs_bio->csum = btrfs_bio->csum_inline;
 		}
@@ -420,7 +415,7 @@  blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst
 	 * kick the readahead for csum tree.
 	 */
 	if (nblocks > fs_info->csums_per_leaf)
-		path->reada = READA_FORWARD;
+		path.reada = READA_FORWARD;

 	/*
 	 * the free space stuff is only read when it hasn't been
@@ -429,8 +424,8 @@  blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst
 	 * between reading the free space cache and updating the csum tree.
 	 */
 	if (btrfs_is_free_space_inode(BTRFS_I(inode))) {
-		path->search_commit_root = 1;
-		path->skip_locking = 1;
+		path.search_commit_root = 1;
+		path.skip_locking = 1;
 	}

 	for (cur_disk_bytenr = orig_disk_bytenr;
@@ -453,7 +448,7 @@  blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst
 				fs_info->sectorsize_bits;
 		csum_dst = csum + sector_offset * csum_size;

-		count = search_csum_tree(fs_info, path, cur_disk_bytenr,
+		count = search_csum_tree(fs_info, &path, cur_disk_bytenr,
 					 search_len, csum_dst);
 		if (count <= 0) {
 			/*
@@ -489,7 +484,7 @@  blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst
 		}
 	}

-	btrfs_free_path(path);
+	btrfs_release_path(&path);
 	return BLK_STS_OK;
 }