diff mbox series

btrfs: zlib: do not do unnecessary page copying for compression

Message ID 0a24cc8a48821e8cf3bd01263b453c4cbc22d832.1716801849.git.wqu@suse.com (mailing list archive)
State New, archived
Headers show
Series btrfs: zlib: do not do unnecessary page copying for compression | expand

Commit Message

Qu Wenruo May 27, 2024, 9:24 a.m. UTC
[BUG]
In function zlib_compress_folios(), we handle the input by:

- If there are multiple pages left
  We copy the page content into workspace->buf, and use workspace->buf
  as input for compression.

  But on x86_64 (which doesn't support dfltcc), that buffer size is just
  one page, so we're wasting our CPU time copying the page for no
  benefit.

- If there is only one page left
  We use the mapped page address as input for compression.

The problem is, this means we will copy the whole input range except the
last page (can be as large as 124K), without much obvious benefit.

Meanwhile the cost is pretty obvious.

[POSSIBLE REASON]
The possible reason may be related to the support of S390 hardware zlib
decompression acceleration.

As we allocate 4 pages (4 * 4K) as workspace input buffer just for s390.

[FIX]
I checked the dfltcc code, there seems to be no requirement on the
input buffer size.
The function dfltcc_can_deflate() only checks:

- If the compression settings are supported
  Only level/w_bits/strategy/level_mask is checked.

- If the hardware supports

No mention at all on the input buffer size, thus I believe there is no
need to waste time doing the page copying.

Maybe the hardware acceleration is so good for s390 that they can offset
the page copying cost, but it's definitely a penalty for non-s390
systems.

So fix the problem by:

- Use the same buffer size
  No matter if dfltcc support is enabled or not

- Always use page address as input

Cc: linux-s390@vger.kernel.org
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/zlib.c | 67 +++++++++++--------------------------------------
 1 file changed, 14 insertions(+), 53 deletions(-)

Comments

Zaslonko Mikhail May 27, 2024, 4:25 p.m. UTC | #1
Hello Qu,

I remember implementing btrfs zlib changes related to s390 dfltcc compression support a while ago:
https://lwn.net/Articles/808809/

The workspace buffer size was indeed enlarged for performance reasons.

Please see my comments below.

On 27.05.2024 11:24, Qu Wenruo wrote:
> [BUG]
> In function zlib_compress_folios(), we handle the input by:
> 
> - If there are multiple pages left
>   We copy the page content into workspace->buf, and use workspace->buf
>   as input for compression.
> 
>   But on x86_64 (which doesn't support dfltcc), that buffer size is just
>   one page, so we're wasting our CPU time copying the page for no
>   benefit.
> 
> - If there is only one page left
>   We use the mapped page address as input for compression.
> 
> The problem is, this means we will copy the whole input range except the
> last page (can be as large as 124K), without much obvious benefit.
> 
> Meanwhile the cost is pretty obvious.

Actually, the behavior for kernels w/o dfltcc support (currently available on s390
only) should not be affected. 
We copy input pages to the workspace->buf only if the buffer size is larger than 1 page.
At least it worked this way after my original btrfs zlib patch:
https://lwn.net/ml/linux-kernel/20200108105103.29028-1-zaslonko@linux.ibm.com/

Has this behavior somehow changed after your page->folio conversion performed for btrfs? 
https://lore.kernel.org/all/cover.1706521511.git.wqu@suse.com/

Am I missing something?

> 
> [POSSIBLE REASON]
> The possible reason may be related to the support of S390 hardware zlib
> decompression acceleration.
> 
> As we allocate 4 pages (4 * 4K) as workspace input buffer just for s390.
> 
> [FIX]
> I checked the dfltcc code, there seems to be no requirement on the
> input buffer size.
> The function dfltcc_can_deflate() only checks:
> 
> - If the compression settings are supported
>   Only level/w_bits/strategy/level_mask is checked.
> 
> - If the hardware supports
> 
> No mention at all on the input buffer size, thus I believe there is no
> need to waste time doing the page copying.
> 
> Maybe the hardware acceleration is so good for s390 that they can offset
> the page copying cost, but it's definitely a penalty for non-s390
> systems.
> 
> So fix the problem by:
> 
> - Use the same buffer size
>   No matter if dfltcc support is enabled or not
> 
> - Always use page address as input
> 
> Cc: linux-s390@vger.kernel.org
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/zlib.c | 67 +++++++++++--------------------------------------
>  1 file changed, 14 insertions(+), 53 deletions(-)
> 
> diff --git a/fs/btrfs/zlib.c b/fs/btrfs/zlib.c
> index d9e5c88a0f85..9c88a841a060 100644
> --- a/fs/btrfs/zlib.c
> +++ b/fs/btrfs/zlib.c
> @@ -65,21 +65,8 @@ struct list_head *zlib_alloc_workspace(unsigned int level)
>  			zlib_inflate_workspacesize());
>  	workspace->strm.workspace = kvzalloc(workspacesize, GFP_KERNEL | __GFP_NOWARN);
>  	workspace->level = level;
> -	workspace->buf = NULL;
> -	/*
> -	 * In case of s390 zlib hardware support, allocate lager workspace
> -	 * buffer. If allocator fails, fall back to a single page buffer.
> -	 */
> -	if (zlib_deflate_dfltcc_enabled()) {
> -		workspace->buf = kmalloc(ZLIB_DFLTCC_BUF_SIZE,
> -					 __GFP_NOMEMALLOC | __GFP_NORETRY |
> -					 __GFP_NOWARN | GFP_NOIO);
> -		workspace->buf_size = ZLIB_DFLTCC_BUF_SIZE;
> -	}
> -	if (!workspace->buf) {
> -		workspace->buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
> -		workspace->buf_size = PAGE_SIZE;
> -	}
> +	workspace->buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
> +	workspace->buf_size = PAGE_SIZE;
>  	if (!workspace->strm.workspace || !workspace->buf)
>  		goto fail;
>  
> @@ -103,7 +90,6 @@ int zlib_compress_folios(struct list_head *ws, struct address_space *mapping,
>  	struct folio *in_folio = NULL;
>  	struct folio *out_folio = NULL;
>  	unsigned long bytes_left;
> -	unsigned int in_buf_folios;
>  	unsigned long len = *total_out;
>  	unsigned long nr_dest_folios = *out_folios;
>  	const unsigned long max_out = nr_dest_folios * PAGE_SIZE;
> @@ -130,7 +116,6 @@ int zlib_compress_folios(struct list_head *ws, struct address_space *mapping,
>  	folios[0] = out_folio;
>  	nr_folios = 1;
>  
> -	workspace->strm.next_in = workspace->buf;
>  	workspace->strm.avail_in = 0;
>  	workspace->strm.next_out = cfolio_out;
>  	workspace->strm.avail_out = PAGE_SIZE;
> @@ -142,43 +127,19 @@ int zlib_compress_folios(struct list_head *ws, struct address_space *mapping,
>  		 */
>  		if (workspace->strm.avail_in == 0) {
>  			bytes_left = len - workspace->strm.total_in;
> -			in_buf_folios = min(DIV_ROUND_UP(bytes_left, PAGE_SIZE),
> -					    workspace->buf_size / PAGE_SIZE);

	doesn't this always set *in_buf_folios* to 1 in case no dfltcc support (single page workspace buffer)?

> -			if (in_buf_folios > 1) {
> -				int i;
> -
> -				for (i = 0; i < in_buf_folios; i++) {
> -					if (data_in) {
> -						kunmap_local(data_in);
> -						folio_put(in_folio);
> -						data_in = NULL;
> -					}
> -					ret = btrfs_compress_filemap_get_folio(mapping,
> -							start, &in_folio);
> -					if (ret < 0)
> -						goto out;
> -					data_in = kmap_local_folio(in_folio, 0);
> -					copy_page(workspace->buf + i * PAGE_SIZE,
> -						  data_in);
> -					start += PAGE_SIZE;
> -				}
> -				workspace->strm.next_in = workspace->buf;
> -			} else {
> -				if (data_in) {
> -					kunmap_local(data_in);
> -					folio_put(in_folio);
> -					data_in = NULL;
> -				}
> -				ret = btrfs_compress_filemap_get_folio(mapping,
> -						start, &in_folio);
> -				if (ret < 0)
> -					goto out;
> -				data_in = kmap_local_folio(in_folio, 0);
> -				start += PAGE_SIZE;
> -				workspace->strm.next_in = data_in;
> +			if (data_in) {
> +				kunmap_local(data_in);
> +				folio_put(in_folio);
> +				data_in = NULL;
>  			}
> -			workspace->strm.avail_in = min(bytes_left,
> -						       (unsigned long) workspace->buf_size);
> +			ret = btrfs_compress_filemap_get_folio(mapping,
> +					start, &in_folio);
> +			if (ret < 0)
> +				goto out;
> +			data_in = kmap_local_folio(in_folio, 0);
> +			start += PAGE_SIZE;
> +			workspace->strm.next_in = data_in;
> +			workspace->strm.avail_in = min(bytes_left, PAGE_SIZE);
>  		}
>  
>  		ret = zlib_deflate(&workspace->strm, Z_SYNC_FLUSH);

Thanks,
Mikhail
Qu Wenruo May 27, 2024, 10:09 p.m. UTC | #2
在 2024/5/28 01:55, Zaslonko Mikhail 写道:
> Hello Qu,
>
> I remember implementing btrfs zlib changes related to s390 dfltcc compression support a while ago:
> https://lwn.net/Articles/808809/
>
> The workspace buffer size was indeed enlarged for performance reasons.
>
> Please see my comments below.
>
> On 27.05.2024 11:24, Qu Wenruo wrote:
>> [BUG]
>> In function zlib_compress_folios(), we handle the input by:
>>
>> - If there are multiple pages left
>>    We copy the page content into workspace->buf, and use workspace->buf
>>    as input for compression.
>>
>>    But on x86_64 (which doesn't support dfltcc), that buffer size is just
>>    one page, so we're wasting our CPU time copying the page for no
>>    benefit.
>>
>> - If there is only one page left
>>    We use the mapped page address as input for compression.
>>
>> The problem is, this means we will copy the whole input range except the
>> last page (can be as large as 124K), without much obvious benefit.
>>
>> Meanwhile the cost is pretty obvious.
>
> Actually, the behavior for kernels w/o dfltcc support (currently available on s390
> only) should not be affected.
> We copy input pages to the workspace->buf only if the buffer size is larger than 1 page.
> At least it worked this way after my original btrfs zlib patch:
> https://lwn.net/ml/linux-kernel/20200108105103.29028-1-zaslonko@linux.ibm.com/
>
> Has this behavior somehow changed after your page->folio conversion performed for btrfs?
> https://lore.kernel.org/all/cover.1706521511.git.wqu@suse.com/

My bad, I forgot that the buf_size for non-S390 systems is fixed to one
page thus the page copy is not utilized for x86_64.

But I'm still wondering if we do not go 4 pages as buffer, how much
performance penalty would there be?

One of the objective is to prepare for the incoming sector perfect
subpage compression support, thus I'm re-checking the existing
compression code, preparing to change them to be subpage compatible.

If we can simplify the behavior without too large performance penalty,
can we consider just using one single page as buffer?

Thanks,
Qu
Zaslonko Mikhail May 28, 2024, 10:44 a.m. UTC | #3
Hello Qu,

On 28.05.2024 00:09, Qu Wenruo wrote:
> 
> 
> 在 2024/5/28 01:55, Zaslonko Mikhail 写道:
>> Hello Qu,
>>
>> I remember implementing btrfs zlib changes related to s390 dfltcc compression support a while ago:
>> https://lwn.net/Articles/808809/
>>
>> The workspace buffer size was indeed enlarged for performance reasons.
>>
>> Please see my comments below.
>>
>> On 27.05.2024 11:24, Qu Wenruo wrote:
>>> [BUG]
>>> In function zlib_compress_folios(), we handle the input by:
>>>
>>> - If there are multiple pages left
>>>    We copy the page content into workspace->buf, and use workspace->buf
>>>    as input for compression.
>>>
>>>    But on x86_64 (which doesn't support dfltcc), that buffer size is just
>>>    one page, so we're wasting our CPU time copying the page for no
>>>    benefit.
>>>
>>> - If there is only one page left
>>>    We use the mapped page address as input for compression.
>>>
>>> The problem is, this means we will copy the whole input range except the
>>> last page (can be as large as 124K), without much obvious benefit.
>>>
>>> Meanwhile the cost is pretty obvious.
>>
>> Actually, the behavior for kernels w/o dfltcc support (currently available on s390
>> only) should not be affected.
>> We copy input pages to the workspace->buf only if the buffer size is larger than 1 page.
>> At least it worked this way after my original btrfs zlib patch:
>> https://lwn.net/ml/linux-kernel/20200108105103.29028-1-zaslonko@linux.ibm.com/
>>
>> Has this behavior somehow changed after your page->folio conversion performed for btrfs?
>> https://lore.kernel.org/all/cover.1706521511.git.wqu@suse.com/
> 
> My bad, I forgot that the buf_size for non-S390 systems is fixed to one
> page thus the page copy is not utilized for x86_64.
> 
> But I'm still wondering if we do not go 4 pages as buffer, how much
> performance penalty would there be?
> 
> One of the objective is to prepare for the incoming sector perfect
> subpage compression support, thus I'm re-checking the existing
> compression code, preparing to change them to be subpage compatible.
> 
> If we can simplify the behavior without too large performance penalty,
> can we consider just using one single page as buffer?

Based on my earlier estimates, bigger buffer provided up to 60% performance for inflate and up to 30% for
deflate on s390 with dfltcc support.
I don't think giving it away for simplification would be a good idea.

> 
> Thanks,
> Qu
David Sterba May 28, 2024, 9:43 p.m. UTC | #4
On Tue, May 28, 2024 at 12:44:19PM +0200, Zaslonko Mikhail wrote:
> > But I'm still wondering if we do not go 4 pages as buffer, how much
> > performance penalty would there be?
> > 
> > One of the objective is to prepare for the incoming sector perfect
> > subpage compression support, thus I'm re-checking the existing
> > compression code, preparing to change them to be subpage compatible.
> > 
> > If we can simplify the behavior without too large performance penalty,
> > can we consider just using one single page as buffer?
> 
> Based on my earlier estimates, bigger buffer provided up to 60% performance for inflate and up to 30% for
> deflate on s390 with dfltcc support.
> I don't think giving it away for simplification would be a good idea.

60% and 30% sound like significant gain, I agree this takes precedence
over code simplification. Eventually the s390 optimized case can be
moved to a separate function if the conditions are satisfied so it's not
mixed with the page-by-page code.
Qu Wenruo May 28, 2024, 10:02 p.m. UTC | #5
在 2024/5/29 07:13, David Sterba 写道:
> On Tue, May 28, 2024 at 12:44:19PM +0200, Zaslonko Mikhail wrote:
>>> But I'm still wondering if we do not go 4 pages as buffer, how much
>>> performance penalty would there be?
>>>
>>> One of the objective is to prepare for the incoming sector perfect
>>> subpage compression support, thus I'm re-checking the existing
>>> compression code, preparing to change them to be subpage compatible.
>>>
>>> If we can simplify the behavior without too large performance penalty,
>>> can we consider just using one single page as buffer?
>>
>> Based on my earlier estimates, bigger buffer provided up to 60% performance for inflate and up to 30% for
>> deflate on s390 with dfltcc support.
>> I don't think giving it away for simplification would be a good idea.
> 
> 60% and 30% sound like significant gain, I agree this takes precedence
> over code simplification. Eventually the s390 optimized case can be
> moved to a separate function if the conditions are satisfied so it's not
> mixed with the page-by-page code.

Thanks a lot for the numbers, the number is indeed impressive.

I'll definitely leave the larger buffer support untouched.
And thankfully since S390 only supports 4K page size, we won't need to 
bother too much for that routine.

Thanks,
Qu
diff mbox series

Patch

diff --git a/fs/btrfs/zlib.c b/fs/btrfs/zlib.c
index d9e5c88a0f85..9c88a841a060 100644
--- a/fs/btrfs/zlib.c
+++ b/fs/btrfs/zlib.c
@@ -65,21 +65,8 @@  struct list_head *zlib_alloc_workspace(unsigned int level)
 			zlib_inflate_workspacesize());
 	workspace->strm.workspace = kvzalloc(workspacesize, GFP_KERNEL | __GFP_NOWARN);
 	workspace->level = level;
-	workspace->buf = NULL;
-	/*
-	 * In case of s390 zlib hardware support, allocate lager workspace
-	 * buffer. If allocator fails, fall back to a single page buffer.
-	 */
-	if (zlib_deflate_dfltcc_enabled()) {
-		workspace->buf = kmalloc(ZLIB_DFLTCC_BUF_SIZE,
-					 __GFP_NOMEMALLOC | __GFP_NORETRY |
-					 __GFP_NOWARN | GFP_NOIO);
-		workspace->buf_size = ZLIB_DFLTCC_BUF_SIZE;
-	}
-	if (!workspace->buf) {
-		workspace->buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
-		workspace->buf_size = PAGE_SIZE;
-	}
+	workspace->buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	workspace->buf_size = PAGE_SIZE;
 	if (!workspace->strm.workspace || !workspace->buf)
 		goto fail;
 
@@ -103,7 +90,6 @@  int zlib_compress_folios(struct list_head *ws, struct address_space *mapping,
 	struct folio *in_folio = NULL;
 	struct folio *out_folio = NULL;
 	unsigned long bytes_left;
-	unsigned int in_buf_folios;
 	unsigned long len = *total_out;
 	unsigned long nr_dest_folios = *out_folios;
 	const unsigned long max_out = nr_dest_folios * PAGE_SIZE;
@@ -130,7 +116,6 @@  int zlib_compress_folios(struct list_head *ws, struct address_space *mapping,
 	folios[0] = out_folio;
 	nr_folios = 1;
 
-	workspace->strm.next_in = workspace->buf;
 	workspace->strm.avail_in = 0;
 	workspace->strm.next_out = cfolio_out;
 	workspace->strm.avail_out = PAGE_SIZE;
@@ -142,43 +127,19 @@  int zlib_compress_folios(struct list_head *ws, struct address_space *mapping,
 		 */
 		if (workspace->strm.avail_in == 0) {
 			bytes_left = len - workspace->strm.total_in;
-			in_buf_folios = min(DIV_ROUND_UP(bytes_left, PAGE_SIZE),
-					    workspace->buf_size / PAGE_SIZE);
-			if (in_buf_folios > 1) {
-				int i;
-
-				for (i = 0; i < in_buf_folios; i++) {
-					if (data_in) {
-						kunmap_local(data_in);
-						folio_put(in_folio);
-						data_in = NULL;
-					}
-					ret = btrfs_compress_filemap_get_folio(mapping,
-							start, &in_folio);
-					if (ret < 0)
-						goto out;
-					data_in = kmap_local_folio(in_folio, 0);
-					copy_page(workspace->buf + i * PAGE_SIZE,
-						  data_in);
-					start += PAGE_SIZE;
-				}
-				workspace->strm.next_in = workspace->buf;
-			} else {
-				if (data_in) {
-					kunmap_local(data_in);
-					folio_put(in_folio);
-					data_in = NULL;
-				}
-				ret = btrfs_compress_filemap_get_folio(mapping,
-						start, &in_folio);
-				if (ret < 0)
-					goto out;
-				data_in = kmap_local_folio(in_folio, 0);
-				start += PAGE_SIZE;
-				workspace->strm.next_in = data_in;
+			if (data_in) {
+				kunmap_local(data_in);
+				folio_put(in_folio);
+				data_in = NULL;
 			}
-			workspace->strm.avail_in = min(bytes_left,
-						       (unsigned long) workspace->buf_size);
+			ret = btrfs_compress_filemap_get_folio(mapping,
+					start, &in_folio);
+			if (ret < 0)
+				goto out;
+			data_in = kmap_local_folio(in_folio, 0);
+			start += PAGE_SIZE;
+			workspace->strm.next_in = data_in;
+			workspace->strm.avail_in = min(bytes_left, PAGE_SIZE);
 		}
 
 		ret = zlib_deflate(&workspace->strm, Z_SYNC_FLUSH);