[0/2] btrfs: simplify extent buffer writeback

Message ID	cover.1744822090.git.josef@toxicpanda.com (mailing list archive)
Headers	show Received: from mail-yw1-f175.google.com (mail-yw1-f175.google.com [209.85.128.175]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 34F172135A6 for <linux-btrfs@vger.kernel.org>; Wed, 16 Apr 2025 16:51:14 +0000 (UTC) From: Josef Bacik <josef@toxicpanda.com> To: linux-btrfs@vger.kernel.org, kernel-team@fb.com Subject: [PATCH 0/2] btrfs: simplify extent buffer writeback Date: Wed, 16 Apr 2025 12:51:05 -0400 Message-ID: <cover.1744822090.git.josef@toxicpanda.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	btrfs: simplify extent buffer writeback \| expand [0/2] btrfs: simplify extent buffer writeback [1/2] btrfs: set DIRTY and WRITEBACK tags on the buffer_radix [2/2] btrfs: use buffer radix for extent buffer writeback operations

Message ID

cover.1744822090.git.josef@toxicpanda.com (mailing list archive)

Headers

From: Josef Bacik <josef@toxicpanda.com>
To: linux-btrfs@vger.kernel.org,
	kernel-team@fb.com
Subject: [PATCH 0/2] btrfs: simplify extent buffer writeback
Date: Wed, 16 Apr 2025 12:51:05 -0400
Message-ID: <cover.1744822090.git.josef@toxicpanda.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

btrfs: simplify extent buffer writeback | expand

Message

Josef Bacik April 16, 2025, 4:51 p.m. UTC

Hello,

We currently have two different paths for writing out extent buffers, a subpage
path and a normal path.  This has resulted in subtle bugs with subpage code that
took us a while to figure out.  Additionally we have this complex interaction of
get folio, find eb, see if we already started writing that eb out, write out the
eb.

We already have a radix tree for our extent buffers, so we can use that
similarly to how pagecache uses the radix tree.  Tag the buffers with DIRTY when
they're dirty, and WRITEBACK when we start writing them out.

The unfortunate part is we have to re-implement folio_batch for extent buffers,
so that's where most of the new code comes from.  The good part is we are now
down to a single path for writing out extent buffers, it's way simpler, and in
fact quite a bit faster now that we don't have all of these folio->eb
transitions to deal with.

I ran this through fsperf on a VM with 8 CPUs and 16gib of ram.  I used
smallfiles100k, but reduced the files to 1k to make it run faster, the
results are as follows, with the statistically significant improvements
marked with *, there were no regressions.  fsperf was run with -n 10 for
both runs, so the baseline is the average 10 runs and the test is the
average of 10 runs.

smallfiles100k results
      metric           baseline       current        stdev            diff
================================================================================
avg_commit_ms               68.58         58.44          3.35   -14.79% *
commits                    270.60        254.70         16.24    -5.88%
dev_read_iops                  48            48             0     0.00%
dev_read_kbytes              1044          1044             0     0.00%
dev_write_iops          866117.90     850028.10      14292.20    -1.86%
dev_write_kbytes      10939976.40   10605701.20     351330.32    -3.06%
elapsed                     49.30            33          1.64   -33.06% *
end_state_mount_ns    41251498.80   35773220.70    2531205.32   -13.28% *
end_state_umount_ns      1.90e+09      1.50e+09   14186226.85   -21.38% *
max_commit_ms                 139        111.60          9.72   -19.71% *
sys_cpu                      4.90          3.86          0.88   -21.29%
write_bw_bytes        42935768.20   64318451.10    1609415.05    49.80% *
write_clat_ns_mean      366431.69     243202.60      14161.98   -33.63% *
write_clat_ns_p50        49203.20         20992        264.40   -57.34% *
write_clat_ns_p99          827392     653721.60      65904.74   -20.99% *
write_io_kbytes           2035940       2035940             0     0.00%
write_iops               10482.37      15702.75        392.92    49.80% *
write_lat_ns_max         1.01e+08      90516129    3910102.06   -10.29% *
write_lat_ns_mean       366556.19     243308.48      14154.51   -33.62% *

As you can see we get about a 33% decrease runtime, with a 50%
throughput increase, which is pretty significant.  Thanks,

Josef

Josef Bacik (2):
  btrfs: set DIRTY and WRITEBACK tags on the buffer_radix
  btrfs: use buffer radix for extent buffer writeback operations

 fs/btrfs/disk-io.c     |  10 ++
 fs/btrfs/extent_io.c   | 385 ++++++++++++++++++++++-------------------
 fs/btrfs/extent_io.h   |   1 +
 fs/btrfs/transaction.c |   5 +-
 4 files changed, 224 insertions(+), 177 deletions(-)

Comments

Josef Bacik April 16, 2025, 5:54 p.m. UTC | #1

On Wed, Apr 16, 2025 at 12:51:05PM -0400, Josef Bacik wrote:
> Hello,
> 
> We currently have two different paths for writing out extent buffers, a subpage
> path and a normal path.  This has resulted in subtle bugs with subpage code that
> took us a while to figure out.  Additionally we have this complex interaction of
> get folio, find eb, see if we already started writing that eb out, write out the
> eb.
> 
> We already have a radix tree for our extent buffers, so we can use that
> similarly to how pagecache uses the radix tree.  Tag the buffers with DIRTY when
> they're dirty, and WRITEBACK when we start writing them out.
> 
> The unfortunate part is we have to re-implement folio_batch for extent buffers,
> so that's where most of the new code comes from.  The good part is we are now
> down to a single path for writing out extent buffers, it's way simpler, and in
> fact quite a bit faster now that we don't have all of these folio->eb
> transitions to deal with.
> 
> I ran this through fsperf on a VM with 8 CPUs and 16gib of ram.  I used
> smallfiles100k, but reduced the files to 1k to make it run faster, the
> results are as follows, with the statistically significant improvements
> marked with *, there were no regressions.  fsperf was run with -n 10 for
> both runs, so the baseline is the average 10 runs and the test is the
> average of 10 runs.
> 
> smallfiles100k results
>       metric           baseline       current        stdev            diff
> ================================================================================
> avg_commit_ms               68.58         58.44          3.35   -14.79% *
> commits                    270.60        254.70         16.24    -5.88%
> dev_read_iops                  48            48             0     0.00%
> dev_read_kbytes              1044          1044             0     0.00%
> dev_write_iops          866117.90     850028.10      14292.20    -1.86%
> dev_write_kbytes      10939976.40   10605701.20     351330.32    -3.06%
> elapsed                     49.30            33          1.64   -33.06% *
> end_state_mount_ns    41251498.80   35773220.70    2531205.32   -13.28% *
> end_state_umount_ns      1.90e+09      1.50e+09   14186226.85   -21.38% *
> max_commit_ms                 139        111.60          9.72   -19.71% *
> sys_cpu                      4.90          3.86          0.88   -21.29%
> write_bw_bytes        42935768.20   64318451.10    1609415.05    49.80% *
> write_clat_ns_mean      366431.69     243202.60      14161.98   -33.63% *
> write_clat_ns_p50        49203.20         20992        264.40   -57.34% *
> write_clat_ns_p99          827392     653721.60      65904.74   -20.99% *
> write_io_kbytes           2035940       2035940             0     0.00%
> write_iops               10482.37      15702.75        392.92    49.80% *
> write_lat_ns_max         1.01e+08      90516129    3910102.06   -10.29% *
> write_lat_ns_mean       366556.19     243308.48      14154.51   -33.62% *
> 
> As you can see we get about a 33% decrease runtime, with a 50%
> throughput increase, which is pretty significant.  Thanks,

Ignore this for now, the xarray<->radix thing isn't quite one to one, so I've
got to convert the buffer radix to a proper xarray first.  Thanks,

Josef