[v2,0/3] raid1 balancing methods

Message ID	cover.1728608421.git.anand.jain@oracle.com (mailing list archive)
Headers	show Received: from mx0a-00069f02.pphosted.com (mx0a-00069f02.pphosted.com [205.220.165.32]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A72181F4FB7 for <linux-btrfs@vger.kernel.org>; Fri, 11 Oct 2024 02:49:46 +0000 (UTC) From: Anand Jain <anand.jain@oracle.com> To: linux-btrfs@vger.kernel.org Cc: dsterba@suse.com, wqu@suse.com, hrx@bupt.moe, waxhead@dirtcellar.net Subject: [PATCH v2 0/3] raid1 balancing methods Date: Fri, 11 Oct 2024 10:49:15 +0800 Message-ID: <cover.1728608421.git.anand.jain@oracle.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain Precedence: bulk MIME-Version: 1.0
Series	raid1 balancing methods \| expand [v2,0/3] raid1 balancing methods [v2,1/3] btrfs: introduce RAID1 round-robin read balancing [v2,2/3] btrfs: use the path with the lowest latency for RAID1 reads [v2,3/3] btrfs: add RAID1 preferred read device

Anand Jain Oct. 11, 2024, 2:49 a.m. UTC

v2:
1. Move new features to CONFIG_BTRFS_EXPERIMENTAL instead of CONFIG_BTRFS_DEBUG.
2. Correct the typo from %est_wait to %best_wait.
3. Initialize %best_wait to U64_MAX and remove the check for 0.
4. Implement rotation with a minimum contiguous read threshold before
   switching to the next stripe. Configure this, using:

        echo rotation:[min_contiguous_read] > /sys/fs/btrfs/<uuid>/read_policy

   The default value is the sector size, and the min_contiguous_read
   value must be a multiple of the sector size.

5. Tested FIO random read/write and defrag compression workloads with
   min_contiguous_read set to sector size, 192k, and 256k.

   RAID1 balancing method rotation is better for multi-process workloads
   such as fio and also single-process workload such as defragmentation.

     $ fio --filename=/btrfs/foo --size=5Gi --direct=1 --rw=randrw --bs=4k \
        --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 \
        --time_based --group_reporting --name=iops-test-job --eta-newline=1


|         |            |            | Read I/O count  |
|         | Read       | Write      | devid1 | devid2 |
|---------|------------|------------|--------|--------|
| pid     | 20.3MiB/s  | 20.5MiB/s  | 313895 | 313895 |
| rotation|            |            |        |        |
|     4096| 20.4MiB/s  | 20.5MiB/s  | 313895 | 313895 |
|   196608| 20.2MiB/s  | 20.2MiB/s  | 310152 | 310175 |
|   262144| 20.3MiB/s  | 20.4MiB/s  | 312180 | 312191 |
|  latency| 18.4MiB/s  | 18.4MiB/s  | 272980 | 291683 |
| devid:1 | 14.8MiB/s  | 14.9MiB/s  | 456376 | 0      |

   rotation RAID1 balancing technique performs more than 2x better for
   single-process defrag.

      $ time -p btrfs filesystem defrag -r -f -c /btrfs


|         | Time  | Read I/O Count  |
|         | Real  | devid1 | devid2 |
|---------|-------|--------|--------|
| pid     | 18.00s| 3800   | 0      |
| rotation|       |        |        |
|     4096|  8.95s| 1900   | 1901   |
|   196608|  8.50s| 1881   | 1919   |
|   262144|  8.80s| 1881   | 1919   |
| latency | 17.18s| 3800   | 0      |
| devid:2 | 17.48s| 0      | 3800   |

Rotation keeps all devices active, and for now, the Rotation RAID1
balancing method is preferable as default. More workload testing is
needed while the code is EXPERIMENTAL.
While Latency is better during the failing/unstable block layer transport.
As of now these two techniques, are needed to be further independently
tested with different worloads, and in the long term we should be merge
these technique to a unified heuristic.

Rotation keeps all devices active, and for now, the Rotation RAID1
balancing method should be the default. More workload testing is needed
while the code is EXPERIMENTAL.

Latency is smarter with unstable block layer transport.

Both techniques need independent testing across workloads, with the goal of
eventually merging them into a unified approach? for the long term.

Devid is a hands-on approach, provides manual or user-space script control.

These RAID1 balancing methods are tunable via the sysfs knob.
The mount -o option and btrfs properties are under consideration.

Thx.

--------- original v1 ------------

The RAID1-balancing methods helps distribute read I/O across devices, and
this patch introduces three balancing methods: rotation, latency, and
devid. These methods are enabled under the `CONFIG_BTRFS_DEBUG` config
option and are on top of the previously added
`/sys/fs/btrfs/<UUID>/read_policy` interface to configure the desired
RAID1 read balancing method.

I've tested these patches using fio and filesystem defragmentation
workloads on a two-device RAID1 setup (with both data and metadata
mirrored across identical devices). I tracked device read counts by
extracting stats from `/sys/devices/<..>/stat` for each device. Below is
a summary of the results, with each result the average of three
iterations.

A typical generic random rw workload:

$ fio --filename=/btrfs/foo --size=10Gi --direct=1 --rw=randrw --bs=4k \
  --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 --time_based \
  --group_reporting --name=iops-test-job --eta-newline=1

|         |            |            | Read I/O count  |
|         | Read       | Write      | devid1 | devid2 |
|---------|------------|------------|--------|--------|
| pid     | 29.4MiB/s  | 29.5MiB/s  | 456548 | 447975 |
| rotation| 29.3MiB/s  | 29.3MiB/s  | 450105 | 450055 |
| latency | 21.9MiB/s  | 21.9MiB/s  | 672387 | 0      |
| devid:1 | 22.0MiB/s  | 22.0MiB/s  | 674788 | 0      |

Defragmentation with compression workload:

$ xfs_io -f -d -c 'pwrite -S 0xab 0 1G' /btrfs/foo
$ sync
$ echo 3 > /proc/sys/vm/drop_caches
$ btrfs filesystem defrag -f -c /btrfs/foo

|         | Time  | Read I/O Count  |
|         | Real  | devid1 | devid2 |
|---------|-------|--------|--------|
| pid     | 21.61s| 3810   | 0      |
| rotation| 11.55s| 1905   | 1905   |
| latency | 20.99s| 0      | 3810   |
| devid:2 | 21.41s| 0      | 3810   |

. The PID-based balancing method works well for the generic random rw fio
  workload.
. The rotation method is ideal when you want to keep both devices active,
  and it boosts performance in sequential defragmentation scenarios.
. The latency-based method work well when we have mixed device types or
  when one device experiences intermittent I/O failures the latency
  increases and it automatically picks the other device for further Read
  IOs.
. The devid method is a more hands-on approach, useful for diagnosing and
  testing RAID1 mirror synchronizations.

Anand Jain (3):
  btrfs: introduce RAID1 round-robin read balancing
  btrfs: use the path with the lowest latency for RAID1 reads
  btrfs: add RAID1 preferred read device

 fs/btrfs/disk-io.c |   4 ++
 fs/btrfs/sysfs.c   | 116 +++++++++++++++++++++++++++++++++++++++------
 fs/btrfs/volumes.c | 109 ++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.h |  16 +++++++
 4 files changed, 230 insertions(+), 15 deletions(-)

Anand Jain Oct. 11, 2024, 3:35 a.m. UTC | #1

On 11/10/24 8:19 am, Anand Jain wrote:
> v2:
> 1. Move new features to CONFIG_BTRFS_EXPERIMENTAL instead of CONFIG_BTRFS_DEBUG.
> 2. Correct the typo from %est_wait to %best_wait.
> 3. Initialize %best_wait to U64_MAX and remove the check for 0.
> 4. Implement rotation with a minimum contiguous read threshold before
>     switching to the next stripe. Configure this, using:
> 
>          echo rotation:[min_contiguous_read] > /sys/fs/btrfs/<uuid>/read_policy
> 
>     The default value is the sector size, and the min_contiguous_read
>     value must be a multiple of the sector size.
> 
> 5. Tested FIO random read/write and defrag compression workloads with
>     min_contiguous_read set to sector size, 192k, and 256k.
> 
>     RAID1 balancing method rotation is better for multi-process workloads
>     such as fio and also single-process workload such as defragmentation.
> 
>       $ fio --filename=/btrfs/foo --size=5Gi --direct=1 --rw=randrw --bs=4k \
>          --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 \
>          --time_based --group_reporting --name=iops-test-job --eta-newline=1
> 
> 
> |         |            |            | Read I/O count  |
> |         | Read       | Write      | devid1 | devid2 |
> |---------|------------|------------|--------|--------|
> | pid     | 20.3MiB/s  | 20.5MiB/s  | 313895 | 313895 |
> | rotation|            |            |        |        |
> |     4096| 20.4MiB/s  | 20.5MiB/s  | 313895 | 313895 |
> |   196608| 20.2MiB/s  | 20.2MiB/s  | 310152 | 310175 |
> |   262144| 20.3MiB/s  | 20.4MiB/s  | 312180 | 312191 |
> |  latency| 18.4MiB/s  | 18.4MiB/s  | 272980 | 291683 |
> | devid:1 | 14.8MiB/s  | 14.9MiB/s  | 456376 | 0      |
> 
>     rotation RAID1 balancing technique performs more than 2x better for
>     single-process defrag.
> 
>        $ time -p btrfs filesystem defrag -r -f -c /btrfs
> 
> 
> |         | Time  | Read I/O Count  |
> |         | Real  | devid1 | devid2 |
> |---------|-------|--------|--------|
> | pid     | 18.00s| 3800   | 0      |
> | rotation|       |        |        |
> |     4096|  8.95s| 1900   | 1901   |
> |   196608|  8.50s| 1881   | 1919   |
> |   262144|  8.80s| 1881   | 1919   |
> | latency | 17.18s| 3800   | 0      |
> | devid:2 | 17.48s| 0      | 3800   |
> 


Copy and paste error. Please ignore the below paragraph. Thx.
---vvv--- ignore ---vvv----
> Rotation keeps all devices active, and for now, the Rotation RAID1
> balancing method is preferable as default. More workload testing is
> needed while the code is EXPERIMENTAL.
> While Latency is better during the failing/unstable block layer transport.
> As of now these two techniques, are needed to be further independently
> tested with different worloads, and in the long term we should be merge
> these technique to a unified heuristic.
---^^^------------^^^------

> Rotation keeps all devices active, and for now, the Rotation RAID1
> balancing method should be the default. More workload testing is needed
> while the code is EXPERIMENTAL.
> 
> Latency is smarter with unstable block layer transport.
> 
> Both techniques need independent testing across workloads, with the goal of
> eventually merging them into a unified approach? for the long term.
> 
> Devid is a hands-on approach, provides manual or user-space script control.
> 
> These RAID1 balancing methods are tunable via the sysfs knob.
> The mount -o option and btrfs properties are under consideration.
> 
> Thx.
> 
> --------- original v1 ------------
> 
> The RAID1-balancing methods helps distribute read I/O across devices, and
> this patch introduces three balancing methods: rotation, latency, and
> devid. These methods are enabled under the `CONFIG_BTRFS_DEBUG` config
> option and are on top of the previously added
> `/sys/fs/btrfs/<UUID>/read_policy` interface to configure the desired
> RAID1 read balancing method.
> 
> I've tested these patches using fio and filesystem defragmentation
> workloads on a two-device RAID1 setup (with both data and metadata
> mirrored across identical devices). I tracked device read counts by
> extracting stats from `/sys/devices/<..>/stat` for each device. Below is
> a summary of the results, with each result the average of three
> iterations.
> 
> A typical generic random rw workload:
> 
> $ fio --filename=/btrfs/foo --size=10Gi --direct=1 --rw=randrw --bs=4k \
>    --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 --time_based \
>    --group_reporting --name=iops-test-job --eta-newline=1
> 
> |         |            |            | Read I/O count  |
> |         | Read       | Write      | devid1 | devid2 |
> |---------|------------|------------|--------|--------|
> | pid     | 29.4MiB/s  | 29.5MiB/s  | 456548 | 447975 |
> | rotation| 29.3MiB/s  | 29.3MiB/s  | 450105 | 450055 |
> | latency | 21.9MiB/s  | 21.9MiB/s  | 672387 | 0      |
> | devid:1 | 22.0MiB/s  | 22.0MiB/s  | 674788 | 0      |
> 
> Defragmentation with compression workload:
> 
> $ xfs_io -f -d -c 'pwrite -S 0xab 0 1G' /btrfs/foo
> $ sync
> $ echo 3 > /proc/sys/vm/drop_caches
> $ btrfs filesystem defrag -f -c /btrfs/foo
> 
> |         | Time  | Read I/O Count  |
> |         | Real  | devid1 | devid2 |
> |---------|-------|--------|--------|
> | pid     | 21.61s| 3810   | 0      |
> | rotation| 11.55s| 1905   | 1905   |
> | latency | 20.99s| 0      | 3810   |
> | devid:2 | 21.41s| 0      | 3810   |
> 
> . The PID-based balancing method works well for the generic random rw fio
>    workload.
> . The rotation method is ideal when you want to keep both devices active,
>    and it boosts performance in sequential defragmentation scenarios.
> . The latency-based method work well when we have mixed device types or
>    when one device experiences intermittent I/O failures the latency
>    increases and it automatically picks the other device for further Read
>    IOs.
> . The devid method is a more hands-on approach, useful for diagnosing and
>    testing RAID1 mirror synchronizations.
> 
> Anand Jain (3):
>    btrfs: introduce RAID1 round-robin read balancing
>    btrfs: use the path with the lowest latency for RAID1 reads
>    btrfs: add RAID1 preferred read device
> 
>   fs/btrfs/disk-io.c |   4 ++
>   fs/btrfs/sysfs.c   | 116 +++++++++++++++++++++++++++++++++++++++------
>   fs/btrfs/volumes.c | 109 ++++++++++++++++++++++++++++++++++++++++++
>   fs/btrfs/volumes.h |  16 +++++++
>   4 files changed, 230 insertions(+), 15 deletions(-)
>

Qu Wenruo Oct. 11, 2024, 4:59 a.m. UTC | #2

在 2024/10/11 13:19, Anand Jain 写道:
> v2:
> 1. Move new features to CONFIG_BTRFS_EXPERIMENTAL instead of CONFIG_BTRFS_DEBUG.
> 2. Correct the typo from %est_wait to %best_wait.
> 3. Initialize %best_wait to U64_MAX and remove the check for 0.
> 4. Implement rotation with a minimum contiguous read threshold before
>     switching to the next stripe. Configure this, using:
>
>          echo rotation:[min_contiguous_read] > /sys/fs/btrfs/<uuid>/read_policy
>
>     The default value is the sector size, and the min_contiguous_read
>     value must be a multiple of the sector size.

Overall, I'm fine with the latency and preferred device policies.

Meanwhile I'd prefer the previous version of round-robin, without the
min_contiguous_read.

That looks a little overkilled, and I think we should keep the policy as
simple as possible for now.

Mind to share why the min_contiguous_read is introduced in this update?

In the future, we should go the same method as sched_ext, by pushing the
complex policies to eBPF programs.


Another future improvement is the interface, I'm fine with the sysfs
knob for an experimental feature.

But from my last drop_subtree_threshold experience, sysfs is not going
to be a user-friendly interface. It really relies on some user space
daemon to set.

I'd prefer something more persistent, like some XATTR but inside root
tree, and go with prop interfaces.
But that can all be done in the future.

Thanks,
Qu
>
> 5. Tested FIO random read/write and defrag compression workloads with
>     min_contiguous_read set to sector size, 192k, and 256k.
>
>     RAID1 balancing method rotation is better for multi-process workloads
>     such as fio and also single-process workload such as defragmentation.
>
>       $ fio --filename=/btrfs/foo --size=5Gi --direct=1 --rw=randrw --bs=4k \
>          --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 \
>          --time_based --group_reporting --name=iops-test-job --eta-newline=1
>
>
> |         |            |            | Read I/O count  |
> |         | Read       | Write      | devid1 | devid2 |
> |---------|------------|------------|--------|--------|
> | pid     | 20.3MiB/s  | 20.5MiB/s  | 313895 | 313895 |
> | rotation|            |            |        |        |
> |     4096| 20.4MiB/s  | 20.5MiB/s  | 313895 | 313895 |
> |   196608| 20.2MiB/s  | 20.2MiB/s  | 310152 | 310175 |
> |   262144| 20.3MiB/s  | 20.4MiB/s  | 312180 | 312191 |
> |  latency| 18.4MiB/s  | 18.4MiB/s  | 272980 | 291683 |
> | devid:1 | 14.8MiB/s  | 14.9MiB/s  | 456376 | 0      |
>
>     rotation RAID1 balancing technique performs more than 2x better for
>     single-process defrag.
>
>        $ time -p btrfs filesystem defrag -r -f -c /btrfs
>
>
> |         | Time  | Read I/O Count  |
> |         | Real  | devid1 | devid2 |
> |---------|-------|--------|--------|
> | pid     | 18.00s| 3800   | 0      |
> | rotation|       |        |        |
> |     4096|  8.95s| 1900   | 1901   |
> |   196608|  8.50s| 1881   | 1919   |
> |   262144|  8.80s| 1881   | 1919   |
> | latency | 17.18s| 3800   | 0      |
> | devid:2 | 17.48s| 0      | 3800   |
>
> Rotation keeps all devices active, and for now, the Rotation RAID1
> balancing method is preferable as default. More workload testing is
> needed while the code is EXPERIMENTAL.
> While Latency is better during the failing/unstable block layer transport.
> As of now these two techniques, are needed to be further independently
> tested with different worloads, and in the long term we should be merge
> these technique to a unified heuristic.
>
> Rotation keeps all devices active, and for now, the Rotation RAID1
> balancing method should be the default. More workload testing is needed
> while the code is EXPERIMENTAL.
>
> Latency is smarter with unstable block layer transport.
>
> Both techniques need independent testing across workloads, with the goal of
> eventually merging them into a unified approach? for the long term.
>
> Devid is a hands-on approach, provides manual or user-space script control.
>
> These RAID1 balancing methods are tunable via the sysfs knob.
> The mount -o option and btrfs properties are under consideration.
>
> Thx.
>
> --------- original v1 ------------
>
> The RAID1-balancing methods helps distribute read I/O across devices, and
> this patch introduces three balancing methods: rotation, latency, and
> devid. These methods are enabled under the `CONFIG_BTRFS_DEBUG` config
> option and are on top of the previously added
> `/sys/fs/btrfs/<UUID>/read_policy` interface to configure the desired
> RAID1 read balancing method.
>
> I've tested these patches using fio and filesystem defragmentation
> workloads on a two-device RAID1 setup (with both data and metadata
> mirrored across identical devices). I tracked device read counts by
> extracting stats from `/sys/devices/<..>/stat` for each device. Below is
> a summary of the results, with each result the average of three
> iterations.
>
> A typical generic random rw workload:
>
> $ fio --filename=/btrfs/foo --size=10Gi --direct=1 --rw=randrw --bs=4k \
>    --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 --time_based \
>    --group_reporting --name=iops-test-job --eta-newline=1
>
> |         |            |            | Read I/O count  |
> |         | Read       | Write      | devid1 | devid2 |
> |---------|------------|------------|--------|--------|
> | pid     | 29.4MiB/s  | 29.5MiB/s  | 456548 | 447975 |
> | rotation| 29.3MiB/s  | 29.3MiB/s  | 450105 | 450055 |
> | latency | 21.9MiB/s  | 21.9MiB/s  | 672387 | 0      |
> | devid:1 | 22.0MiB/s  | 22.0MiB/s  | 674788 | 0      |
>
> Defragmentation with compression workload:
>
> $ xfs_io -f -d -c 'pwrite -S 0xab 0 1G' /btrfs/foo
> $ sync
> $ echo 3 > /proc/sys/vm/drop_caches
> $ btrfs filesystem defrag -f -c /btrfs/foo
>
> |         | Time  | Read I/O Count  |
> |         | Real  | devid1 | devid2 |
> |---------|-------|--------|--------|
> | pid     | 21.61s| 3810   | 0      |
> | rotation| 11.55s| 1905   | 1905   |
> | latency | 20.99s| 0      | 3810   |
> | devid:2 | 21.41s| 0      | 3810   |
>
> . The PID-based balancing method works well for the generic random rw fio
>    workload.
> . The rotation method is ideal when you want to keep both devices active,
>    and it boosts performance in sequential defragmentation scenarios.
> . The latency-based method work well when we have mixed device types or
>    when one device experiences intermittent I/O failures the latency
>    increases and it automatically picks the other device for further Read
>    IOs.
> . The devid method is a more hands-on approach, useful for diagnosing and
>    testing RAID1 mirror synchronizations.
>
> Anand Jain (3):
>    btrfs: introduce RAID1 round-robin read balancing
>    btrfs: use the path with the lowest latency for RAID1 reads
>    btrfs: add RAID1 preferred read device
>
>   fs/btrfs/disk-io.c |   4 ++
>   fs/btrfs/sysfs.c   | 116 +++++++++++++++++++++++++++++++++++++++------
>   fs/btrfs/volumes.c | 109 ++++++++++++++++++++++++++++++++++++++++++
>   fs/btrfs/volumes.h |  16 +++++++
>   4 files changed, 230 insertions(+), 15 deletions(-)
>

Anand Jain Oct. 11, 2024, 6:04 a.m. UTC | #3

On 11/10/24 10:29 am, Qu Wenruo wrote:
> 
> 
> 在 2024/10/11 13:19, Anand Jain 写道:
>> v2:
>> 1. Move new features to CONFIG_BTRFS_EXPERIMENTAL instead of 
>> CONFIG_BTRFS_DEBUG.
>> 2. Correct the typo from %est_wait to %best_wait.
>> 3. Initialize %best_wait to U64_MAX and remove the check for 0.
>> 4. Implement rotation with a minimum contiguous read threshold before
>>     switching to the next stripe. Configure this, using:
>>
>>          echo rotation:[min_contiguous_read] > /sys/fs/btrfs/<uuid>/ 
>> read_policy
>>
>>     The default value is the sector size, and the min_contiguous_read
>>     value must be a multiple of the sector size.
> 
> Overall, I'm fine with the latency and preferred device policies.
> 
> Meanwhile I'd prefer the previous version of round-robin, without the
> min_contiguous_read.
> 
> That looks a little overkilled, and I think we should keep the policy as
> simple as possible for now.
> 
> Mind to share why the min_contiguous_read is introduced in this update?
> 

The reason for adding min_contiguous_read:
The block layer optimizes with bio merging to improve HDD performance. 
David mentioned on Slack that 192k to 256k co-reads can
performance better, though I haven't seen this in my setup but it
may work in others.

> In the future, we should go the same method as sched_ext, by pushing the
> complex policies to eBPF programs.

External scripts for RAID1 balancing are achievable with BPF, though
it require writable BPF, which is disabled in some cases. That said,
still we should prioritize adding support and provide choice to the
use-case to decide.

> Another future improvement is the interface, I'm fine with the sysfs
> knob for an experimental feature.

Yes, we need to review tunables - mount options, sysfs, and
btrfs properties to have a clear guidelines/consolidation.

> But from my last drop_subtree_threshold experience, sysfs is not going
> to be a user-friendly interface. It really relies on some user space
> daemon to set.
> 

Agreed. However, for Btrfs, sysfs has been the most comprehensive so far.

> I'd prefer something more persistent, like some XATTR but inside root
> tree, and go with prop interfaces.
> But that can all be done in the future.
>

Absolutely. I included that in earlier experiments, but it was removed
due to review comments. Now isn't the right time to reintroduce it; we
can update the on-disk formats and xattrs once the in-memory graduates
address specific use cases.

Thanks, Anand

> Thanks,
> Qu
>>
>> 5. Tested FIO random read/write and defrag compression workloads with
>>     min_contiguous_read set to sector size, 192k, and 256k.
>>
>>     RAID1 balancing method rotation is better for multi-process workloads
>>     such as fio and also single-process workload such as defragmentation.
>>
>>       $ fio --filename=/btrfs/foo --size=5Gi --direct=1 --rw=randrw -- 
>> bs=4k \
>>          --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 \
>>          --time_based --group_reporting --name=iops-test-job --eta- 
>> newline=1
>>
>>
>> |         |            |            | Read I/O count  |
>> |         | Read       | Write      | devid1 | devid2 |
>> |---------|------------|------------|--------|--------|
>> | pid     | 20.3MiB/s  | 20.5MiB/s  | 313895 | 313895 |
>> | rotation|            |            |        |        |
>> |     4096| 20.4MiB/s  | 20.5MiB/s  | 313895 | 313895 |
>> |   196608| 20.2MiB/s  | 20.2MiB/s  | 310152 | 310175 |
>> |   262144| 20.3MiB/s  | 20.4MiB/s  | 312180 | 312191 |
>> |  latency| 18.4MiB/s  | 18.4MiB/s  | 272980 | 291683 |
>> | devid:1 | 14.8MiB/s  | 14.9MiB/s  | 456376 | 0      |
>>
>>     rotation RAID1 balancing technique performs more than 2x better for
>>     single-process defrag.
>>
>>        $ time -p btrfs filesystem defrag -r -f -c /btrfs
>>
>>
>> |         | Time  | Read I/O Count  |
>> |         | Real  | devid1 | devid2 |
>> |---------|-------|--------|--------|
>> | pid     | 18.00s| 3800   | 0      |
>> | rotation|       |        |        |
>> |     4096|  8.95s| 1900   | 1901   |
>> |   196608|  8.50s| 1881   | 1919   |
>> |   262144|  8.80s| 1881   | 1919   |
>> | latency | 17.18s| 3800   | 0      |
>> | devid:2 | 17.48s| 0      | 3800   |
>>
>> Rotation keeps all devices active, and for now, the Rotation RAID1
>> balancing method is preferable as default. More workload testing is
>> needed while the code is EXPERIMENTAL.
>> While Latency is better during the failing/unstable block layer 
>> transport.
>> As of now these two techniques, are needed to be further independently
>> tested with different worloads, and in the long term we should be merge
>> these technique to a unified heuristic.
>>
>> Rotation keeps all devices active, and for now, the Rotation RAID1
>> balancing method should be the default. More workload testing is needed
>> while the code is EXPERIMENTAL.
>>
>> Latency is smarter with unstable block layer transport.
>>
>> Both techniques need independent testing across workloads, with the 
>> goal of
>> eventually merging them into a unified approach? for the long term.
>>
>> Devid is a hands-on approach, provides manual or user-space script 
>> control.
>>
>> These RAID1 balancing methods are tunable via the sysfs knob.
>> The mount -o option and btrfs properties are under consideration.
>>
>> Thx.
>>
>> --------- original v1 ------------
>>
>> The RAID1-balancing methods helps distribute read I/O across devices, and
>> this patch introduces three balancing methods: rotation, latency, and
>> devid. These methods are enabled under the `CONFIG_BTRFS_DEBUG` config
>> option and are on top of the previously added
>> `/sys/fs/btrfs/<UUID>/read_policy` interface to configure the desired
>> RAID1 read balancing method.
>>
>> I've tested these patches using fio and filesystem defragmentation
>> workloads on a two-device RAID1 setup (with both data and metadata
>> mirrored across identical devices). I tracked device read counts by
>> extracting stats from `/sys/devices/<..>/stat` for each device. Below is
>> a summary of the results, with each result the average of three
>> iterations.
>>
>> A typical generic random rw workload:
>>
>> $ fio --filename=/btrfs/foo --size=10Gi --direct=1 --rw=randrw --bs=4k \
>>    --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 -- 
>> time_based \
>>    --group_reporting --name=iops-test-job --eta-newline=1
>>
>> |         |            |            | Read I/O count  |
>> |         | Read       | Write      | devid1 | devid2 |
>> |---------|------------|------------|--------|--------|
>> | pid     | 29.4MiB/s  | 29.5MiB/s  | 456548 | 447975 |
>> | rotation| 29.3MiB/s  | 29.3MiB/s  | 450105 | 450055 |
>> | latency | 21.9MiB/s  | 21.9MiB/s  | 672387 | 0      |
>> | devid:1 | 22.0MiB/s  | 22.0MiB/s  | 674788 | 0      |
>>
>> Defragmentation with compression workload:
>>
>> $ xfs_io -f -d -c 'pwrite -S 0xab 0 1G' /btrfs/foo
>> $ sync
>> $ echo 3 > /proc/sys/vm/drop_caches
>> $ btrfs filesystem defrag -f -c /btrfs/foo
>>
>> |         | Time  | Read I/O Count  |
>> |         | Real  | devid1 | devid2 |
>> |---------|-------|--------|--------|
>> | pid     | 21.61s| 3810   | 0      |
>> | rotation| 11.55s| 1905   | 1905   |
>> | latency | 20.99s| 0      | 3810   |
>> | devid:2 | 21.41s| 0      | 3810   |
>>
>> . The PID-based balancing method works well for the generic random rw fio
>>    workload.
>> . The rotation method is ideal when you want to keep both devices active,
>>    and it boosts performance in sequential defragmentation scenarios.
>> . The latency-based method work well when we have mixed device types or
>>    when one device experiences intermittent I/O failures the latency
>>    increases and it automatically picks the other device for further Read
>>    IOs.
>> . The devid method is a more hands-on approach, useful for diagnosing and
>>    testing RAID1 mirror synchronizations.
>>
>> Anand Jain (3):
>>    btrfs: introduce RAID1 round-robin read balancing
>>    btrfs: use the path with the lowest latency for RAID1 reads
>>    btrfs: add RAID1 preferred read device
>>
>>   fs/btrfs/disk-io.c |   4 ++
>>   fs/btrfs/sysfs.c   | 116 +++++++++++++++++++++++++++++++++++++++------
>>   fs/btrfs/volumes.c | 109 ++++++++++++++++++++++++++++++++++++++++++
>>   fs/btrfs/volumes.h |  16 +++++++
>>   4 files changed, 230 insertions(+), 15 deletions(-)
>>
>

David Sterba Oct. 21, 2024, 2:05 p.m. UTC | #4

On Fri, Oct 11, 2024 at 10:49:15AM +0800, Anand Jain wrote:
> v2:
> 1. Move new features to CONFIG_BTRFS_EXPERIMENTAL instead of CONFIG_BTRFS_DEBUG.
> 2. Correct the typo from %est_wait to %best_wait.
> 3. Initialize %best_wait to U64_MAX and remove the check for 0.
> 4. Implement rotation with a minimum contiguous read threshold before
>    switching to the next stripe. Configure this, using:
> 
>         echo rotation:[min_contiguous_read] > /sys/fs/btrfs/<uuid>/read_policy
> 
>    The default value is the sector size, and the min_contiguous_read
>    value must be a multiple of the sector size.

I think it's safe to start with the round-round robin policy, but the
syntax is strange, why the [ ] are mandatory? Also please call it
round-robin, or 'rr' for short.

The default of sector size is IMHO a wrong value, switching devices so
often will drop the performance just because of the io request overhead.
From what I rememer values around 200k were reasonable, so either 192k
or 256k should be the default. We may also drop the configurable value
at all and provide a few hard coded sizes like rr-256k, rr-512k, rr-1m,
if not only to drop parsing of user strings.

> 5. Tested FIO random read/write and defrag compression workloads with
>    min_contiguous_read set to sector size, 192k, and 256k.
> 
>    RAID1 balancing method rotation is better for multi-process workloads
>    such as fio and also single-process workload such as defragmentation.
> 
>      $ fio --filename=/btrfs/foo --size=5Gi --direct=1 --rw=randrw --bs=4k \
>         --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 \
>         --time_based --group_reporting --name=iops-test-job --eta-newline=1
> 
> 
> |         |            |            | Read I/O count  |
> |         | Read       | Write      | devid1 | devid2 |
> |---------|------------|------------|--------|--------|
> | pid     | 20.3MiB/s  | 20.5MiB/s  | 313895 | 313895 |
> | rotation|            |            |        |        |
> |     4096| 20.4MiB/s  | 20.5MiB/s  | 313895 | 313895 |
> |   196608| 20.2MiB/s  | 20.2MiB/s  | 310152 | 310175 |
> |   262144| 20.3MiB/s  | 20.4MiB/s  | 312180 | 312191 |
> |  latency| 18.4MiB/s  | 18.4MiB/s  | 272980 | 291683 |
> | devid:1 | 14.8MiB/s  | 14.9MiB/s  | 456376 | 0      |
> 
>    rotation RAID1 balancing technique performs more than 2x better for
>    single-process defrag.
> 
>       $ time -p btrfs filesystem defrag -r -f -c /btrfs
> 
> 
> |         | Time  | Read I/O Count  |
> |         | Real  | devid1 | devid2 |
> |---------|-------|--------|--------|
> | pid     | 18.00s| 3800   | 0      |
> | rotation|       |        |        |
> |     4096|  8.95s| 1900   | 1901   |
> |   196608|  8.50s| 1881   | 1919   |
> |   262144|  8.80s| 1881   | 1919   |
> | latency | 17.18s| 3800   | 0      |
> | devid:2 | 17.48s| 0      | 3800   |
> 
> Rotation keeps all devices active, and for now, the Rotation RAID1
> balancing method is preferable as default. More workload testing is
> needed while the code is EXPERIMENTAL.

Yeah round-robin will be a good defalt, we only need to verify the chunk
size and then do the switch in the next release.

> While Latency is better during the failing/unstable block layer transport.
> As of now these two techniques, are needed to be further independently
> tested with different worloads, and in the long term we should be merge
> these technique to a unified heuristic.

This sounds like he latency is good for a specific case and maybe a
fallback if the device becomes faulty, but once the layer below becomes
unstable we may need to skip reading from the device. This is also a
different mode of operation than balancing reads.

> Rotation keeps all devices active, and for now, the Rotation RAID1
> balancing method should be the default. More workload testing is needed
> while the code is EXPERIMENTAL.
> 
> Latency is smarter with unstable block layer transport.
> 
> Both techniques need independent testing across workloads, with the goal of
> eventually merging them into a unified approach? for the long term.
> 
> Devid is a hands-on approach, provides manual or user-space script control.
> 
> These RAID1 balancing methods are tunable via the sysfs knob.
> The mount -o option and btrfs properties are under consideration.

To move forward with the feature I think the round robin and preferred
device id can be merged. I'm not sure about the latency but if it's
under experimental we can take it as is and tune later.

I'll check my notes from the last time Michal attempted to implement the
policies if we haven't missed something.

waxhead Oct. 21, 2024, 2:32 p.m. UTC | #5

Anand Jain wrote:
> v2:
> 1. Move new features to CONFIG_BTRFS_EXPERIMENTAL instead of CONFIG_BTRFS_DEBUG.
> 2. Correct the typo from %est_wait to %best_wait.
> 3. Initialize %best_wait to U64_MAX and remove the check for 0.
> 4. Implement rotation with a minimum contiguous read threshold before
>     switching to the next stripe. Configure this, using:
> 
>          echo rotation:[min_contiguous_read] > /sys/fs/btrfs/<uuid>/read_policy
> 
>     The default value is the sector size, and the min_contiguous_read
>     value must be a multiple of the sector size.
> 
> 5. Tested FIO random read/write and defrag compression workloads with
>     min_contiguous_read set to sector size, 192k, and 256k.
> 
>     RAID1 balancing method rotation is better for multi-process workloads
>     such as fio and also single-process workload such as defragmentation.

With this functionality added, would it not also make sense to add a 
RAID0/10 profile that limits the stripe width, so a stripe does not 
spawn more than n disk (for example n=4).

On systems with for example 24 disks in RAID10, a read may activate 12 
disks at the same time which could easily saturate the bus.

Therefore if a storage profile that limits the number of devices a 
stripe occupy existed, it seems like there might be posibillities for 
RAID0/10 as well.

Note that as of writing this I believe that RAID0/10/5/6 make the stripe 
as wide as the number of storage devices available for the filesystem. 
If I am wrong about this please ignore my jabbering and move on.

Anand Jain Oct. 21, 2024, 3:36 p.m. UTC | #6

Thanks for commenting.

On 21/10/24 22:05, David Sterba wrote:
> On Fri, Oct 11, 2024 at 10:49:15AM +0800, Anand Jain wrote:
>> v2:
>> 1. Move new features to CONFIG_BTRFS_EXPERIMENTAL instead of CONFIG_BTRFS_DEBUG.
>> 2. Correct the typo from %est_wait to %best_wait.
>> 3. Initialize %best_wait to U64_MAX and remove the check for 0.
>> 4. Implement rotation with a minimum contiguous read threshold before
>>     switching to the next stripe. Configure this, using:
>>
>>          echo rotation:[min_contiguous_read] > /sys/fs/btrfs/<uuid>/read_policy
>>
>>     The default value is the sector size, and the min_contiguous_read
>>     value must be a multiple of the sector size.
> 
> I think it's safe to start with the round-round robin policy, but the
> syntax is strange, why the [ ] are mandatory? Also please call it
> round-robin, or 'rr' for short.
> 

I'm fine with round-robin. The [ ] part is not mandatory; if the
min_contiguous_read value is not specified, it will default to a
predefined value.

> The default of sector size is IMHO a wrong value, switching devices so
> often will drop the performance just because of the io request overhead.

>  From what I rememer values around 200k were reasonable, so either 192k
> or 256k should be the default. We may also drop the configurable value
> at all and provide a few hard coded sizes like rr-256k, rr-512k, rr-1m,
> if not only to drop parsing of user strings.

I'm okay with a default value of 256k. For the experimental feature,
we can keep it configurable, allowing the opportunity to experiment
with other values as well

> 
>> 5. Tested FIO random read/write and defrag compression workloads with
>>     min_contiguous_read set to sector size, 192k, and 256k.
>>
>>     RAID1 balancing method rotation is better for multi-process workloads
>>     such as fio and also single-process workload such as defragmentation.
>>
>>       $ fio --filename=/btrfs/foo --size=5Gi --direct=1 --rw=randrw --bs=4k \
>>          --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 \
>>          --time_based --group_reporting --name=iops-test-job --eta-newline=1
>>
>>
>> |         |            |            | Read I/O count  |
>> |         | Read       | Write      | devid1 | devid2 |
>> |---------|------------|------------|--------|--------|
>> | pid     | 20.3MiB/s  | 20.5MiB/s  | 313895 | 313895 |
>> | rotation|            |            |        |        |
>> |     4096| 20.4MiB/s  | 20.5MiB/s  | 313895 | 313895 |
>> |   196608| 20.2MiB/s  | 20.2MiB/s  | 310152 | 310175 |
>> |   262144| 20.3MiB/s  | 20.4MiB/s  | 312180 | 312191 |
>> |  latency| 18.4MiB/s  | 18.4MiB/s  | 272980 | 291683 |
>> | devid:1 | 14.8MiB/s  | 14.9MiB/s  | 456376 | 0      |
>>
>>     rotation RAID1 balancing technique performs more than 2x better for
>>     single-process defrag.
>>
>>        $ time -p btrfs filesystem defrag -r -f -c /btrfs
>>
>>
>> |         | Time  | Read I/O Count  |
>> |         | Real  | devid1 | devid2 |
>> |---------|-------|--------|--------|
>> | pid     | 18.00s| 3800   | 0      |
>> | rotation|       |        |        |
>> |     4096|  8.95s| 1900   | 1901   |
>> |   196608|  8.50s| 1881   | 1919   |
>> |   262144|  8.80s| 1881   | 1919   |
>> | latency | 17.18s| 3800   | 0      |
>> | devid:2 | 17.48s| 0      | 3800   |
>>
>> Rotation keeps all devices active, and for now, the Rotation RAID1
>> balancing method is preferable as default. More workload testing is
>> needed while the code is EXPERIMENTAL.
> 
> Yeah round-robin will be a good defalt, we only need to verify the chunk
> size and then do the switch in the next release.
> 

Yes..

>> While Latency is better during the failing/unstable block layer transport.
>> As of now these two techniques, are needed to be further independently
>> tested with different worloads, and in the long term we should be merge
>> these technique to a unified heuristic.
> 
> This sounds like he latency is good for a specific case and maybe a
> fallback if the device becomes faulty, but once the layer below becomes
> unstable we may need to skip reading from the device. This is also a
> different mode of operation than balancing reads.
> 

If the latency on the faulty path is so high that it shouldn't pick that
path at all, so it works. However, the round-robin balancing is unaware
of dynamic faults on the device path. IMO, a round-robin method that is
latency aware (with ~20% variance) would be better.

>> Rotation keeps all devices active, and for now, the Rotation RAID1
>> balancing method should be the default. More workload testing is needed
>> while the code is EXPERIMENTAL.
>>
>> Latency is smarter with unstable block layer transport.
>>
>> Both techniques need independent testing across workloads, with the goal of
>> eventually merging them into a unified approach? for the long term.
>>
>> Devid is a hands-on approach, provides manual or user-space script control.
>>
>> These RAID1 balancing methods are tunable via the sysfs knob.
>> The mount -o option and btrfs properties are under consideration.
> 
> To move forward with the feature I think the round robin and preferred
> device id can be merged. I'm not sure about the latency but if it's
> under experimental we can take it as is and tune later.

I hope the experimental feature also means we can change the name of the
balancing method at any time. Once we have tested a fair combination of
block device types, we'll definitely need a method that can
automatically tune based on the device type, which will require adding
or dropping balancing methods accordingly.

> I'll check my notes from the last time Michal attempted to implement the
> policies if we haven't missed something.

Thanks, Anand

Anand Jain Oct. 21, 2024, 3:44 p.m. UTC | #7

On 21/10/24 22:32, waxhead wrote:
> Anand Jain wrote:
>> v2:
>> 1. Move new features to CONFIG_BTRFS_EXPERIMENTAL instead of 
>> CONFIG_BTRFS_DEBUG.
>> 2. Correct the typo from %est_wait to %best_wait.
>> 3. Initialize %best_wait to U64_MAX and remove the check for 0.
>> 4. Implement rotation with a minimum contiguous read threshold before
>>     switching to the next stripe. Configure this, using:
>>
>>          echo rotation:[min_contiguous_read] > /sys/fs/btrfs/<uuid>/ 
>> read_policy
>>
>>     The default value is the sector size, and the min_contiguous_read
>>     value must be a multiple of the sector size.
>>
>> 5. Tested FIO random read/write and defrag compression workloads with
>>     min_contiguous_read set to sector size, 192k, and 256k.
>>
>>     RAID1 balancing method rotation is better for multi-process workloads
>>     such as fio and also single-process workload such as defragmentation.
> 
> With this functionality added, would it not also make sense to add a 
> RAID0/10 profile that limits the stripe width, so a stripe does not 
> spawn more than n disk (for example n=4).
> 
 > On systems with for example 24 disks in RAID10, a read may activate 
12 > disks at the same time which could easily saturate the bus.
> 
> Therefore if a storage profile that limits the number of devices a 
> stripe occupy existed, it seems like there might be posibillities for 
> RAID0/10 as well.
> 
> Note that as of writing this I believe that RAID0/10/5/6 make the stripe 
> as wide as the number of storage devices available for the filesystem. 
> If I am wrong about this please ignore my jabbering and move on.

That's correct. I previously attempted to come up with a fix using
the device grouping method. If there's a convincing and more generic
way to specify how the devices should be grouped, we could consider
that.

Thanks, Anand

David Sterba Oct. 21, 2024, 6:42 p.m. UTC | #8

On Mon, Oct 21, 2024 at 11:36:10PM +0800, Anand Jain wrote:
> > I think it's safe to start with the round-round robin policy, but the
> > syntax is strange, why the [ ] are mandatory? Also please call it
> > round-robin, or 'rr' for short.
> 
> I'm fine with round-robin. The [ ] part is not mandatory; if the
> min_contiguous_read value is not specified, it will default to a
> predefined value.
> 
> > The default of sector size is IMHO a wrong value, switching devices so
> > often will drop the performance just because of the io request overhead.
> 
> >  From what I rememer values around 200k were reasonable, so either 192k
> > or 256k should be the default. We may also drop the configurable value
> > at all and provide a few hard coded sizes like rr-256k, rr-512k, rr-1m,
> > if not only to drop parsing of user strings.
> 
> I'm okay with a default value of 256k. For the experimental feature,
> we can keep it configurable, allowing the opportunity to experiment
> with other values as well

Yeah, for experimenting it makes sense to make it flexible, no need to
patch and reboot the kernel. For final we should settle on some
reasonable values.

> >> 5. Tested FIO random read/write and defrag compression workloads with
> >>     min_contiguous_read set to sector size, 192k, and 256k.
> >>
> >>     RAID1 balancing method rotation is better for multi-process workloads
> >>     such as fio and also single-process workload such as defragmentation.
> >>
> >>       $ fio --filename=/btrfs/foo --size=5Gi --direct=1 --rw=randrw --bs=4k \
> >>          --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 \
> >>          --time_based --group_reporting --name=iops-test-job --eta-newline=1
> >>
> >>
> >> |         |            |            | Read I/O count  |
> >> |         | Read       | Write      | devid1 | devid2 |
> >> |---------|------------|------------|--------|--------|
> >> | pid     | 20.3MiB/s  | 20.5MiB/s  | 313895 | 313895 |
> >> | rotation|            |            |        |        |
> >> |     4096| 20.4MiB/s  | 20.5MiB/s  | 313895 | 313895 |
> >> |   196608| 20.2MiB/s  | 20.2MiB/s  | 310152 | 310175 |
> >> |   262144| 20.3MiB/s  | 20.4MiB/s  | 312180 | 312191 |
> >> |  latency| 18.4MiB/s  | 18.4MiB/s  | 272980 | 291683 |
> >> | devid:1 | 14.8MiB/s  | 14.9MiB/s  | 456376 | 0      |
> >>
> >>     rotation RAID1 balancing technique performs more than 2x better for
> >>     single-process defrag.
> >>
> >>        $ time -p btrfs filesystem defrag -r -f -c /btrfs
> >>
> >>
> >> |         | Time  | Read I/O Count  |
> >> |         | Real  | devid1 | devid2 |
> >> |---------|-------|--------|--------|
> >> | pid     | 18.00s| 3800   | 0      |
> >> | rotation|       |        |        |
> >> |     4096|  8.95s| 1900   | 1901   |
> >> |   196608|  8.50s| 1881   | 1919   |
> >> |   262144|  8.80s| 1881   | 1919   |
> >> | latency | 17.18s| 3800   | 0      |
> >> | devid:2 | 17.48s| 0      | 3800   |
> >>
> >> Rotation keeps all devices active, and for now, the Rotation RAID1
> >> balancing method is preferable as default. More workload testing is
> >> needed while the code is EXPERIMENTAL.
> > 
> > Yeah round-robin will be a good defalt, we only need to verify the chunk
> > size and then do the switch in the next release.
> > 
> 
> Yes..
> 
> >> While Latency is better during the failing/unstable block layer transport.
> >> As of now these two techniques, are needed to be further independently
> >> tested with different worloads, and in the long term we should be merge
> >> these technique to a unified heuristic.
> > 
> > This sounds like he latency is good for a specific case and maybe a
> > fallback if the device becomes faulty, but once the layer below becomes
> > unstable we may need to skip reading from the device. This is also a
> > different mode of operation than balancing reads.
> > 
> 
> If the latency on the faulty path is so high that it shouldn't pick that
> path at all, so it works. However, the round-robin balancing is unaware
> of dynamic faults on the device path. IMO, a round-robin method that is
> latency aware (with ~20% variance) would be better.

We should not mix the faulty device handling mode to the read balancing,
at least for now. A back off algorithm that checks number of failed io
requests should precede the balancing.

> >> Rotation keeps all devices active, and for now, the Rotation RAID1
> >> balancing method should be the default. More workload testing is needed
> >> while the code is EXPERIMENTAL.
> >>
> >> Latency is smarter with unstable block layer transport.
> >>
> >> Both techniques need independent testing across workloads, with the goal of
> >> eventually merging them into a unified approach? for the long term.
> >>
> >> Devid is a hands-on approach, provides manual or user-space script control.
> >>
> >> These RAID1 balancing methods are tunable via the sysfs knob.
> >> The mount -o option and btrfs properties are under consideration.
> > 
> > To move forward with the feature I think the round robin and preferred
> > device id can be merged. I'm not sure about the latency but if it's
> > under experimental we can take it as is and tune later.
> 
> I hope the experimental feature also means we can change the name of the
> balancing method at any time. Once we have tested a fair combination of
> block device types, we'll definitely need a method that can
> automatically tune based on the device type, which will require adding
> or dropping balancing methods accordingly.

Yes we can change the names. The automatic tuning would need some
feedback that measures the load and tries to improve the throughput,
this is where we got stuck last time. So for now let's do some
starightforward policy that on average works better than the current pid
policy. I hope that tha round-robin-256k can be a good default, but of
course we need more data for that.

Anand Jain Oct. 22, 2024, 12:31 a.m. UTC | #9

On 22/10/24 02:42, David Sterba wrote:
> On Mon, Oct 21, 2024 at 11:36:10PM +0800, Anand Jain wrote:
>>> I think it's safe to start with the round-round robin policy, but the
>>> syntax is strange, why the [ ] are mandatory? Also please call it
>>> round-robin, or 'rr' for short.
>>
>> I'm fine with round-robin. The [ ] part is not mandatory; if the
>> min_contiguous_read value is not specified, it will default to a
>> predefined value.
>>
>>> The default of sector size is IMHO a wrong value, switching devices so
>>> often will drop the performance just because of the io request overhead.
>>
>>>   From what I rememer values around 200k were reasonable, so either 192k
>>> or 256k should be the default. We may also drop the configurable value
>>> at all and provide a few hard coded sizes like rr-256k, rr-512k, rr-1m,
>>> if not only to drop parsing of user strings.
>>
>> I'm okay with a default value of 256k. For the experimental feature,
>> we can keep it configurable, allowing the opportunity to experiment
>> with other values as well
> 
> Yeah, for experimenting it makes sense to make it flexible, no need to
> patch and reboot the kernel. For final we should settle on some
> reasonable values.
> 
>>>> 5. Tested FIO random read/write and defrag compression workloads with
>>>>      min_contiguous_read set to sector size, 192k, and 256k.
>>>>
>>>>      RAID1 balancing method rotation is better for multi-process workloads
>>>>      such as fio and also single-process workload such as defragmentation.
>>>>
>>>>        $ fio --filename=/btrfs/foo --size=5Gi --direct=1 --rw=randrw --bs=4k \
>>>>           --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 \
>>>>           --time_based --group_reporting --name=iops-test-job --eta-newline=1
>>>>
>>>>
>>>> |         |            |            | Read I/O count  |
>>>> |         | Read       | Write      | devid1 | devid2 |
>>>> |---------|------------|------------|--------|--------|
>>>> | pid     | 20.3MiB/s  | 20.5MiB/s  | 313895 | 313895 |
>>>> | rotation|            |            |        |        |
>>>> |     4096| 20.4MiB/s  | 20.5MiB/s  | 313895 | 313895 |
>>>> |   196608| 20.2MiB/s  | 20.2MiB/s  | 310152 | 310175 |
>>>> |   262144| 20.3MiB/s  | 20.4MiB/s  | 312180 | 312191 |
>>>> |  latency| 18.4MiB/s  | 18.4MiB/s  | 272980 | 291683 |
>>>> | devid:1 | 14.8MiB/s  | 14.9MiB/s  | 456376 | 0      |
>>>>
>>>>      rotation RAID1 balancing technique performs more than 2x better for
>>>>      single-process defrag.
>>>>
>>>>         $ time -p btrfs filesystem defrag -r -f -c /btrfs
>>>>
>>>>
>>>> |         | Time  | Read I/O Count  |
>>>> |         | Real  | devid1 | devid2 |
>>>> |---------|-------|--------|--------|
>>>> | pid     | 18.00s| 3800   | 0      |
>>>> | rotation|       |        |        |
>>>> |     4096|  8.95s| 1900   | 1901   |
>>>> |   196608|  8.50s| 1881   | 1919   |
>>>> |   262144|  8.80s| 1881   | 1919   |
>>>> | latency | 17.18s| 3800   | 0      |
>>>> | devid:2 | 17.48s| 0      | 3800   |
>>>>
>>>> Rotation keeps all devices active, and for now, the Rotation RAID1
>>>> balancing method is preferable as default. More workload testing is
>>>> needed while the code is EXPERIMENTAL.
>>>
>>> Yeah round-robin will be a good defalt, we only need to verify the chunk
>>> size and then do the switch in the next release.
>>>
>>
>> Yes..
>>
>>>> While Latency is better during the failing/unstable block layer transport.
>>>> As of now these two techniques, are needed to be further independently
>>>> tested with different worloads, and in the long term we should be merge
>>>> these technique to a unified heuristic.
>>>
>>> This sounds like he latency is good for a specific case and maybe a
>>> fallback if the device becomes faulty, but once the layer below becomes
>>> unstable we may need to skip reading from the device. This is also a
>>> different mode of operation than balancing reads.
>>>
>>
>> If the latency on the faulty path is so high that it shouldn't pick that
>> path at all, so it works. However, the round-robin balancing is unaware
>> of dynamic faults on the device path. IMO, a round-robin method that is
>> latency aware (with ~20% variance) would be better.
> 
> We should not mix the faulty device handling mode to the read balancing,
> at least for now. A back off algorithm that checks number of failed io
> requests should precede the balancing.
> 
>>>> Rotation keeps all devices active, and for now, the Rotation RAID1
>>>> balancing method should be the default. More workload testing is needed
>>>> while the code is EXPERIMENTAL.
>>>>
>>>> Latency is smarter with unstable block layer transport.
>>>>
>>>> Both techniques need independent testing across workloads, with the goal of
>>>> eventually merging them into a unified approach? for the long term.
>>>>
>>>> Devid is a hands-on approach, provides manual or user-space script control.
>>>>
>>>> These RAID1 balancing methods are tunable via the sysfs knob.
>>>> The mount -o option and btrfs properties are under consideration.
>>>
>>> To move forward with the feature I think the round robin and preferred
>>> device id can be merged. I'm not sure about the latency but if it's
>>> under experimental we can take it as is and tune later.
>>
>> I hope the experimental feature also means we can change the name of the
>> balancing method at any time. Once we have tested a fair combination of
>> block device types, we'll definitely need a method that can
>> automatically tune based on the device type, which will require adding
>> or dropping balancing methods accordingly.
> 
> Yes we can change the names. The automatic tuning would need some
> feedback that measures the load and tries to improve the throughput,
> this is where we got stuck last time. So for now let's do some
> starightforward policy that on average works better than the current pid
> policy. I hope that tha round-robin-256k can be a good default, but of
> course we need more data for that.

Sending v3 with rotation renamed to round-robin. Code review
appreciated; I'll wait a day.

Thanks, Anand

Johannes Thumshirn Oct. 22, 2024, 7:07 a.m. UTC | #10

On 21.10.24 16:32, waxhead wrote:
> Note that as of writing this I believe that RAID0/10/5/6 make the stripe
> as wide as the number of storage devices available for the filesystem.
> If I am wrong about this please ignore my jabbering and move on.

Nope, you're correct and this is a huge problem for bigger (in numbers 
of drives) arrays.

But it's also on my list of things I want to change in how we handle 
RAID with the RAID stripe-tree. This way we can do declustered RAID and 
ease on rebuild times. Also we can drastically enhance write parallelism 
to an array by directing different write streams to different sets of 
stripes. Which btw atm isn't even done for RAID1, as we're picking a 
block-group at a time until it's full which then gets written, instead 
of creating new block groups on new drive sets for different write 
streams (i.e. different files, etc..)

Qu Wenruo Oct. 24, 2024, 4:39 a.m. UTC | #11

在 2024/10/11 13:19, Anand Jain 写道:
> v2:
> 1. Move new features to CONFIG_BTRFS_EXPERIMENTAL instead of CONFIG_BTRFS_DEBUG.
> 2. Correct the typo from %est_wait to %best_wait.
> 3. Initialize %best_wait to U64_MAX and remove the check for 0.
> 4. Implement rotation with a minimum contiguous read threshold before
>     switching to the next stripe. Configure this, using:
>
>          echo rotation:[min_contiguous_read] > /sys/fs/btrfs/<uuid>/read_policy
>
>     The default value is the sector size, and the min_contiguous_read
>     value must be a multiple of the sector size.
>
> 5. Tested FIO random read/write and defrag compression workloads with
>     min_contiguous_read set to sector size, 192k, and 256k.
>
>     RAID1 balancing method rotation is better for multi-process workloads
>     such as fio and also single-process workload such as defragmentation.
>
>       $ fio --filename=/btrfs/foo --size=5Gi --direct=1 --rw=randrw --bs=4k \
>          --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 \
>          --time_based --group_reporting --name=iops-test-job --eta-newline=1

Reviewed-by: Qu Wenruo <wqu@suse.com>

Although not 100% happy with the min_contiguous_read setting, since it's
an optional one and still experimental, I'm fine with series so far.


Just want to express my concern about going mount option.

I know sysfs is not a good way to setup a lot of features, but mount
option is way too committed to me, even under experimental features.

But I also understand without mount option it can be pretty hard to
setup the read policy for fstests runs.

So I'd prefer to have some on-disk solution (XATTR or temporary items)
to save the read policy.
It's less committed compared to mount option (aka, much easier to revert
the change with breaking any compatibility), and can help for future
features.

Thanks,
Qu
>
>
> |         |            |            | Read I/O count  |
> |         | Read       | Write      | devid1 | devid2 |
> |---------|------------|------------|--------|--------|
> | pid     | 20.3MiB/s  | 20.5MiB/s  | 313895 | 313895 |
> | rotation|            |            |        |        |
> |     4096| 20.4MiB/s  | 20.5MiB/s  | 313895 | 313895 |
> |   196608| 20.2MiB/s  | 20.2MiB/s  | 310152 | 310175 |
> |   262144| 20.3MiB/s  | 20.4MiB/s  | 312180 | 312191 |
> |  latency| 18.4MiB/s  | 18.4MiB/s  | 272980 | 291683 |
> | devid:1 | 14.8MiB/s  | 14.9MiB/s  | 456376 | 0      |
>
>     rotation RAID1 balancing technique performs more than 2x better for
>     single-process defrag.
>
>        $ time -p btrfs filesystem defrag -r -f -c /btrfs
>
>
> |         | Time  | Read I/O Count  |
> |         | Real  | devid1 | devid2 |
> |---------|-------|--------|--------|
> | pid     | 18.00s| 3800   | 0      |
> | rotation|       |        |        |
> |     4096|  8.95s| 1900   | 1901   |
> |   196608|  8.50s| 1881   | 1919   |
> |   262144|  8.80s| 1881   | 1919   |
> | latency | 17.18s| 3800   | 0      |
> | devid:2 | 17.48s| 0      | 3800   |
>
> Rotation keeps all devices active, and for now, the Rotation RAID1
> balancing method is preferable as default. More workload testing is
> needed while the code is EXPERIMENTAL.
> While Latency is better during the failing/unstable block layer transport.
> As of now these two techniques, are needed to be further independently
> tested with different worloads, and in the long term we should be merge
> these technique to a unified heuristic.
>
> Rotation keeps all devices active, and for now, the Rotation RAID1
> balancing method should be the default. More workload testing is needed
> while the code is EXPERIMENTAL.
>
> Latency is smarter with unstable block layer transport.
>
> Both techniques need independent testing across workloads, with the goal of
> eventually merging them into a unified approach? for the long term.
>
> Devid is a hands-on approach, provides manual or user-space script control.
>
> These RAID1 balancing methods are tunable via the sysfs knob.
> The mount -o option and btrfs properties are under consideration.
>
> Thx.
>
> --------- original v1 ------------
>
> The RAID1-balancing methods helps distribute read I/O across devices, and
> this patch introduces three balancing methods: rotation, latency, and
> devid. These methods are enabled under the `CONFIG_BTRFS_DEBUG` config
> option and are on top of the previously added
> `/sys/fs/btrfs/<UUID>/read_policy` interface to configure the desired
> RAID1 read balancing method.
>
> I've tested these patches using fio and filesystem defragmentation
> workloads on a two-device RAID1 setup (with both data and metadata
> mirrored across identical devices). I tracked device read counts by
> extracting stats from `/sys/devices/<..>/stat` for each device. Below is
> a summary of the results, with each result the average of three
> iterations.
>
> A typical generic random rw workload:
>
> $ fio --filename=/btrfs/foo --size=10Gi --direct=1 --rw=randrw --bs=4k \
>    --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 --time_based \
>    --group_reporting --name=iops-test-job --eta-newline=1
>
> |         |            |            | Read I/O count  |
> |         | Read       | Write      | devid1 | devid2 |
> |---------|------------|------------|--------|--------|
> | pid     | 29.4MiB/s  | 29.5MiB/s  | 456548 | 447975 |
> | rotation| 29.3MiB/s  | 29.3MiB/s  | 450105 | 450055 |
> | latency | 21.9MiB/s  | 21.9MiB/s  | 672387 | 0      |
> | devid:1 | 22.0MiB/s  | 22.0MiB/s  | 674788 | 0      |
>
> Defragmentation with compression workload:
>
> $ xfs_io -f -d -c 'pwrite -S 0xab 0 1G' /btrfs/foo
> $ sync
> $ echo 3 > /proc/sys/vm/drop_caches
> $ btrfs filesystem defrag -f -c /btrfs/foo
>
> |         | Time  | Read I/O Count  |
> |         | Real  | devid1 | devid2 |
> |---------|-------|--------|--------|
> | pid     | 21.61s| 3810   | 0      |
> | rotation| 11.55s| 1905   | 1905   |
> | latency | 20.99s| 0      | 3810   |
> | devid:2 | 21.41s| 0      | 3810   |
>
> . The PID-based balancing method works well for the generic random rw fio
>    workload.
> . The rotation method is ideal when you want to keep both devices active,
>    and it boosts performance in sequential defragmentation scenarios.
> . The latency-based method work well when we have mixed device types or
>    when one device experiences intermittent I/O failures the latency
>    increases and it automatically picks the other device for further Read
>    IOs.
> . The devid method is a more hands-on approach, useful for diagnosing and
>    testing RAID1 mirror synchronizations.
>
> Anand Jain (3):
>    btrfs: introduce RAID1 round-robin read balancing
>    btrfs: use the path with the lowest latency for RAID1 reads
>    btrfs: add RAID1 preferred read device
>
>   fs/btrfs/disk-io.c |   4 ++
>   fs/btrfs/sysfs.c   | 116 +++++++++++++++++++++++++++++++++++++++------
>   fs/btrfs/volumes.c | 109 ++++++++++++++++++++++++++++++++++++++++++
>   fs/btrfs/volumes.h |  16 +++++++
>   4 files changed, 230 insertions(+), 15 deletions(-)
>

[v2,0/3] raid1 balancing methods

Message

Comments