[RFC,v1,0/6] Introducing `wq_cpu_set` mount option for btrfs

Message ID	20230226160259.18354-1-ammarfaizi2@gnuweeb.org (mailing list archive)
Headers	show Return-Path: <linux-btrfs-owner@vger.kernel.org> From: Ammar Faizi <ammarfaizi2@gnuweeb.org> To: Chris Mason <clm@fb.com>, Josef Bacik <josef@toxicpanda.com>, David Sterba <dsterba@suse.com>, Tejun Heo <tj@kernel.org> Cc: Ammar Faizi <ammarfaizi2@gnuweeb.org>, Lai Jiangshan <jiangshanlai@gmail.com>, Filipe Manana <fdmanana@suse.com>, Linux Btrfs Mailing List <linux-btrfs@vger.kernel.org>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, Linux Fsdevel Mailing List <linux-fsdevel@vger.kernel.org>, GNU/Weeb Mailing List <gwml@vger.gnuweeb.org> Subject: [RFC PATCH v1 0/6] Introducing `wq_cpu_set` mount option for btrfs Date: Sun, 26 Feb 2023 23:02:53 +0700 Message-Id: <20230226160259.18354-1-ammarfaizi2@gnuweeb.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Introducing `wq_cpu_set` mount option for btrfs \| expand [RFC,v1,0/6] Introducing `wq_cpu_set` mount option for btrfs [RFC,v1,1/6] workqueue: Add set_workqueue_cpumask() helper function [RFC,v1,2/6] btrfs: Change `mount_opt` type in `struct btrfs_fs_info` to `u64` [RFC,v1,3/6] btrfs: Create btrfs CPU set struct and helpers [RFC,v1,4/6] btrfs: Add wq_cpu_set=%s mount option [RFC,v1,5/6] btrfs: Adjust the default thread pool size when `wq_cpu_set` option is used [RFC,v1,6/6] btrfs: Add `BTRFS_DEFAULT_MAX_THREAD_POOL_SIZE` macro

Ammar Faizi Feb. 26, 2023, 4:02 p.m. UTC

Hi,

This is an RFC patchset that introduces the `wq_cpu_set` mount option.
This option lets the user specify a CPU set that the Btrfs workqueues
will use.

Btrfs workqueues can slow sensitive user tasks down because they can use
any online CPU to perform heavy workloads on an SMP system. Add a mount
option to isolate the Btrfs workqueues to a set of CPUs. It is helpful
to avoid sensitive user tasks being preempted by Btrfs heavy workqueues.

This option is similar to the taskset bitmask except that the comma
separator is replaced with a dot. The reason for this is that the mount
option parser uses commas to separate mount options.

Figure (the CPU usage when `wq_cpu_set` is used VS when it is not):
https://gist.githubusercontent.com/ammarfaizi2/a10f8073e58d1712c1ed49af83ae4ad1/raw/a4f7cbc4eb163db792a669d570ff542495e8c704/wq_cpu_set.png

A simple stress testing:

1. Open htop.
2. Open a new terminal.
3. Mount and perform a heavy workload on the mounted Btrfs filesystem.

## Test without wq_cpu_set
sudo mount -t btrfs -o rw,compress-force=zstd:15,commit=1500 /dev/sda2 hdd/a;
cp -rf /path/folder_with_many_large_files/ hdd/a/test;
sync; # See the CPU usage in htop.
sudo umount hdd/a;

## Test wq_cpu_set
sudo mount -t btrfs -o rw,compress-force=zstd:15,commit=1500,wq_cpu_set=0.4.1.5 /dev/sda2 hdd/a;
cp -rf /path/folder_with_many_large_files/ hdd/a/test;
sync; # See the CPU usage in htop.
sudo umount hdd/a;

Signed-off-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>
---

Ammar Faizi (6):
  workqueue: Add set_workqueue_cpumask() helper function
  btrfs: Change `mount_opt` type in `struct btrfs_fs_info` to `u64`
  btrfs: Create btrfs CPU set struct and helpers
  btrfs: Add wq_cpu_set=%s mount option
  btrfs: Adjust the default thread pool size when `wq_cpu_set` option is used
  btrfs: Add `BTRFS_DEFAULT_MAX_THREAD_POOL_SIZE` macro

 fs/btrfs/async-thread.c   | 51 ++++++++++++++++++++
 fs/btrfs/async-thread.h   |  3 ++
 fs/btrfs/disk-io.c        |  6 ++-
 fs/btrfs/fs.c             | 97 +++++++++++++++++++++++++++++++++++++++
 fs/btrfs/fs.h             | 12 ++++-
 fs/btrfs/super.c          | 83 +++++++++++++++++++++++++++++++++
 include/linux/workqueue.h |  3 ++
 kernel/workqueue.c        | 19 ++++++++
 8 files changed, 271 insertions(+), 3 deletions(-)


base-commit: 2fcd07b7ccd5fd10b2120d298363e4e6c53ccf9c

Tejun Heo Feb. 26, 2023, 5:01 p.m. UTC | #1

On Sun, Feb 26, 2023 at 11:02:53PM +0700, Ammar Faizi wrote:
> Hi,
> 
> This is an RFC patchset that introduces the `wq_cpu_set` mount option.
> This option lets the user specify a CPU set that the Btrfs workqueues
> will use.
> 
> Btrfs workqueues can slow sensitive user tasks down because they can use
> any online CPU to perform heavy workloads on an SMP system. Add a mount
> option to isolate the Btrfs workqueues to a set of CPUs. It is helpful
> to avoid sensitive user tasks being preempted by Btrfs heavy workqueues.
> 
> This option is similar to the taskset bitmask except that the comma
> separator is replaced with a dot. The reason for this is that the mount
> option parser uses commas to separate mount options.

Hmm... the allowed cpumasks for unbounded workqueues can already be set
through /sys/devices/virtual/workqueue/cpumask and also each individual
workqueue can be exposed in the matching subdirectory by setting WQ_SYSFS.
Wouldn't the proposed btrfs option be a bit reduandant?

Thanks.

Ammar Faizi Feb. 26, 2023, 6:26 p.m. UTC | #2

Hello,

On Sun, 26 Feb 2023 07:01:41 -1000, Tejun Heo wrote:
> Hmm... the allowed cpumasks for unbounded workqueues can already be set
> through /sys/devices/virtual/workqueue/cpumask and also each individual
> workqueue can be exposed in the matching subdirectory by setting WQ_SYSFS.
> Wouldn't the proposed btrfs option be a bit reduandant?

Thank you for the comment. I just realized the sysfs facility for this.
So I will take a look into it deeper. For now, I have several reasons to
use the `wq_cpu_set` option:

1. Currently, there are 15 btrfs workqueues. It would not be convenient
   to let the user manage each of them via the sysfs. Using `wq_cpu_set`
   option at mounting time allows the user to set all of them in one
   shot.

   (for btrfs maintainers):
   I am also not sure if the number of btrfs workqueues is stable so
   that the user can rely on the WQ_SYSFS facility.

2. I looked at /sys/devices/virtual/workqueue/ and saw its
   subdirectories. The directory name is taken from the wq->name. But
   how to distinguish multiple workqueues with the same name?

Each btrfs mount will at least do this:

	alloc_workqueue("btrfs-compressed-write", flags, max_active);

When we do:

	mount -t -o rw btrfs /dev/sda1 a;
	mount -t -o rw btrfs /dev/sda2 b;
	mount -t -o rw btrfs /dev/sda3 c;
	mount -t -o rw btrfs /dev/sda4 d;

Is there a way to identify which sysfs devices correspond to a specific
mounted btrfs fs workqueues? Let's say I want each mount to have a
different CPU mask.

Ammar Faizi Feb. 26, 2023, 6:29 p.m. UTC | #3

On Mon, Feb 27, 2023 at 01:26:24AM +0700, Ammar Faizi wrote:
> 	mount -t -o rw btrfs /dev/sda1 a;
> 	mount -t -o rw btrfs /dev/sda2 b;
> 	mount -t -o rw btrfs /dev/sda3 c;
> 	mount -t -o rw btrfs /dev/sda4 d;

Excuse the wrong mount commands, should be something like this:

	mount -t btrfs -o rw /dev/sda1 a;
	mount -t btrfs -o rw /dev/sda2 b;
	mount -t btrfs -o rw /dev/sda3 c;
	mount -t btrfs -o rw /dev/sda4 d;

Qu Wenruo Feb. 27, 2023, 10:18 a.m. UTC | #4

On 2023/2/27 00:02, Ammar Faizi wrote:
> Hi,
> 
> This is an RFC patchset that introduces the `wq_cpu_set` mount option.
> This option lets the user specify a CPU set that the Btrfs workqueues
> will use.
> 
> Btrfs workqueues can slow sensitive user tasks down because they can use
> any online CPU to perform heavy workloads on an SMP system. Add a mount
> option to isolate the Btrfs workqueues to a set of CPUs. It is helpful
> to avoid sensitive user tasks being preempted by Btrfs heavy workqueues.

I'm not sure if pinning the wq is really the best way to your problem.

Yes, I understand you want to limit the CPU usage of btrfs workqueues, 
but have you tried "thread_pool=" mount option?

That mount option should limit the max amount of in-flight work items, 
thus at least limit the CPU usage.

For the wq CPU pinning part, I'm not sure if it's really needed, 
although it's known CPU pinning can affect some performance characteristics.

Thanks,
Qu

> 
> This option is similar to the taskset bitmask except that the comma
> separator is replaced with a dot. The reason for this is that the mount
> option parser uses commas to separate mount options.
> 
> Figure (the CPU usage when `wq_cpu_set` is used VS when it is not):
> https://gist.githubusercontent.com/ammarfaizi2/a10f8073e58d1712c1ed49af83ae4ad1/raw/a4f7cbc4eb163db792a669d570ff542495e8c704/wq_cpu_set.png
> 
> A simple stress testing:
> 
> 1. Open htop.
> 2. Open a new terminal.
> 3. Mount and perform a heavy workload on the mounted Btrfs filesystem.
> 
> ## Test without wq_cpu_set
> sudo mount -t btrfs -o rw,compress-force=zstd:15,commit=1500 /dev/sda2 hdd/a;
> cp -rf /path/folder_with_many_large_files/ hdd/a/test;
> sync; # See the CPU usage in htop.
> sudo umount hdd/a;
> 
> ## Test wq_cpu_set
> sudo mount -t btrfs -o rw,compress-force=zstd:15,commit=1500,wq_cpu_set=0.4.1.5 /dev/sda2 hdd/a;
> cp -rf /path/folder_with_many_large_files/ hdd/a/test;
> sync; # See the CPU usage in htop.
> sudo umount hdd/a;
> 
> Signed-off-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>
> ---
> 
> Ammar Faizi (6):
>    workqueue: Add set_workqueue_cpumask() helper function
>    btrfs: Change `mount_opt` type in `struct btrfs_fs_info` to `u64`
>    btrfs: Create btrfs CPU set struct and helpers
>    btrfs: Add wq_cpu_set=%s mount option
>    btrfs: Adjust the default thread pool size when `wq_cpu_set` option is used
>    btrfs: Add `BTRFS_DEFAULT_MAX_THREAD_POOL_SIZE` macro
> 
>   fs/btrfs/async-thread.c   | 51 ++++++++++++++++++++
>   fs/btrfs/async-thread.h   |  3 ++
>   fs/btrfs/disk-io.c        |  6 ++-
>   fs/btrfs/fs.c             | 97 +++++++++++++++++++++++++++++++++++++++
>   fs/btrfs/fs.h             | 12 ++++-
>   fs/btrfs/super.c          | 83 +++++++++++++++++++++++++++++++++
>   include/linux/workqueue.h |  3 ++
>   kernel/workqueue.c        | 19 ++++++++
>   8 files changed, 271 insertions(+), 3 deletions(-)
> 
> 
> base-commit: 2fcd07b7ccd5fd10b2120d298363e4e6c53ccf9c

Filipe Manana Feb. 27, 2023, 11:02 a.m. UTC | #5

On Sun, Feb 26, 2023 at 4:31 PM Ammar Faizi <ammarfaizi2@gnuweeb.org> wrote:
>
> Hi,
>
> This is an RFC patchset that introduces the `wq_cpu_set` mount option.
> This option lets the user specify a CPU set that the Btrfs workqueues
> will use.
>
> Btrfs workqueues can slow sensitive user tasks down because they can use
> any online CPU to perform heavy workloads on an SMP system. Add a mount
> option to isolate the Btrfs workqueues to a set of CPUs. It is helpful
> to avoid sensitive user tasks being preempted by Btrfs heavy workqueues.
>
> This option is similar to the taskset bitmask except that the comma
> separator is replaced with a dot. The reason for this is that the mount
> option parser uses commas to separate mount options.
>
> Figure (the CPU usage when `wq_cpu_set` is used VS when it is not):
> https://gist.githubusercontent.com/ammarfaizi2/a10f8073e58d1712c1ed49af83ae4ad1/raw/a4f7cbc4eb163db792a669d570ff542495e8c704/wq_cpu_set.png

I haven't read the patchset.

It's great that it reduces CPU usage.
But does it also provide other performance benefits, like lower
latency or higher throughput for some workloads? Or using less CPU
also affects negatively in those other aspects?

Thanks.

>
> A simple stress testing:
>
> 1. Open htop.
> 2. Open a new terminal.
> 3. Mount and perform a heavy workload on the mounted Btrfs filesystem.
>
> ## Test without wq_cpu_set
> sudo mount -t btrfs -o rw,compress-force=zstd:15,commit=1500 /dev/sda2 hdd/a;
> cp -rf /path/folder_with_many_large_files/ hdd/a/test;
> sync; # See the CPU usage in htop.
> sudo umount hdd/a;
>
> ## Test wq_cpu_set
> sudo mount -t btrfs -o rw,compress-force=zstd:15,commit=1500,wq_cpu_set=0.4.1.5 /dev/sda2 hdd/a;
> cp -rf /path/folder_with_many_large_files/ hdd/a/test;
> sync; # See the CPU usage in htop.
> sudo umount hdd/a;
>
> Signed-off-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>
> ---
>
> Ammar Faizi (6):
>   workqueue: Add set_workqueue_cpumask() helper function
>   btrfs: Change `mount_opt` type in `struct btrfs_fs_info` to `u64`
>   btrfs: Create btrfs CPU set struct and helpers
>   btrfs: Add wq_cpu_set=%s mount option
>   btrfs: Adjust the default thread pool size when `wq_cpu_set` option is used
>   btrfs: Add `BTRFS_DEFAULT_MAX_THREAD_POOL_SIZE` macro
>
>  fs/btrfs/async-thread.c   | 51 ++++++++++++++++++++
>  fs/btrfs/async-thread.h   |  3 ++
>  fs/btrfs/disk-io.c        |  6 ++-
>  fs/btrfs/fs.c             | 97 +++++++++++++++++++++++++++++++++++++++
>  fs/btrfs/fs.h             | 12 ++++-
>  fs/btrfs/super.c          | 83 +++++++++++++++++++++++++++++++++
>  include/linux/workqueue.h |  3 ++
>  kernel/workqueue.c        | 19 ++++++++
>  8 files changed, 271 insertions(+), 3 deletions(-)
>
>
> base-commit: 2fcd07b7ccd5fd10b2120d298363e4e6c53ccf9c
> --
> Ammar Faizi
>

Qu Wenruo Feb. 27, 2023, 11:46 a.m. UTC | #6

On 2023/2/27 19:02, Filipe Manana wrote:
> On Sun, Feb 26, 2023 at 4:31 PM Ammar Faizi <ammarfaizi2@gnuweeb.org> wrote:
>>
>> Hi,
>>
>> This is an RFC patchset that introduces the `wq_cpu_set` mount option.
>> This option lets the user specify a CPU set that the Btrfs workqueues
>> will use.
>>
>> Btrfs workqueues can slow sensitive user tasks down because they can use
>> any online CPU to perform heavy workloads on an SMP system. Add a mount
>> option to isolate the Btrfs workqueues to a set of CPUs. It is helpful
>> to avoid sensitive user tasks being preempted by Btrfs heavy workqueues.
>>
>> This option is similar to the taskset bitmask except that the comma
>> separator is replaced with a dot. The reason for this is that the mount
>> option parser uses commas to separate mount options.
>>
>> Figure (the CPU usage when `wq_cpu_set` is used VS when it is not):
>> https://gist.githubusercontent.com/ammarfaizi2/a10f8073e58d1712c1ed49af83ae4ad1/raw/a4f7cbc4eb163db792a669d570ff542495e8c704/wq_cpu_set.png
> 
> I haven't read the patchset.
> 
> It's great that it reduces CPU usage.
> But does it also provide other performance benefits, like lower
> latency or higher throughput for some workloads? Or using less CPU
> also affects negatively in those other aspects?

So far it looks like to just set CPU masks for each workqueue.

Thus if it's reducing CPU usage, it also takes longer time to finish the 
workload (compression,csum calculation etc).

Thanks,
Qu
> 
> Thanks.
> 
>>
>> A simple stress testing:
>>
>> 1. Open htop.
>> 2. Open a new terminal.
>> 3. Mount and perform a heavy workload on the mounted Btrfs filesystem.
>>
>> ## Test without wq_cpu_set
>> sudo mount -t btrfs -o rw,compress-force=zstd:15,commit=1500 /dev/sda2 hdd/a;
>> cp -rf /path/folder_with_many_large_files/ hdd/a/test;
>> sync; # See the CPU usage in htop.
>> sudo umount hdd/a;
>>
>> ## Test wq_cpu_set
>> sudo mount -t btrfs -o rw,compress-force=zstd:15,commit=1500,wq_cpu_set=0.4.1.5 /dev/sda2 hdd/a;
>> cp -rf /path/folder_with_many_large_files/ hdd/a/test;
>> sync; # See the CPU usage in htop.
>> sudo umount hdd/a;
>>
>> Signed-off-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>
>> ---
>>
>> Ammar Faizi (6):
>>    workqueue: Add set_workqueue_cpumask() helper function
>>    btrfs: Change `mount_opt` type in `struct btrfs_fs_info` to `u64`
>>    btrfs: Create btrfs CPU set struct and helpers
>>    btrfs: Add wq_cpu_set=%s mount option
>>    btrfs: Adjust the default thread pool size when `wq_cpu_set` option is used
>>    btrfs: Add `BTRFS_DEFAULT_MAX_THREAD_POOL_SIZE` macro
>>
>>   fs/btrfs/async-thread.c   | 51 ++++++++++++++++++++
>>   fs/btrfs/async-thread.h   |  3 ++
>>   fs/btrfs/disk-io.c        |  6 ++-
>>   fs/btrfs/fs.c             | 97 +++++++++++++++++++++++++++++++++++++++
>>   fs/btrfs/fs.h             | 12 ++++-
>>   fs/btrfs/super.c          | 83 +++++++++++++++++++++++++++++++++
>>   include/linux/workqueue.h |  3 ++
>>   kernel/workqueue.c        | 19 ++++++++
>>   8 files changed, 271 insertions(+), 3 deletions(-)
>>
>>
>> base-commit: 2fcd07b7ccd5fd10b2120d298363e4e6c53ccf9c
>> --
>> Ammar Faizi
>>

Ammar Faizi Feb. 27, 2023, 1:42 p.m. UTC | #7

On Mon, Feb 27, 2023 at 06:18:43PM +0800, Qu Wenruo wrote: 
> I'm not sure if pinning the wq is really the best way to your problem.
> 
> Yes, I understand you want to limit the CPU usage of btrfs workqueues, but
> have you tried "thread_pool=" mount option?
> 
> That mount option should limit the max amount of in-flight work items, thus
> at least limit the CPU usage.

I have tried to use the thread_poll=%u mount option previously. But I
didn't observe the effect intensively. I'll try to play with this option
more and see if it can yield the desired behavior.

> For the wq CPU pinning part, I'm not sure if it's really needed, although
> it's known CPU pinning can affect some performance characteristics.

What I like about CPU pinning is that we can dedicate CPUs for specific
workloads so it won't cause scheduling noise to the app we've dedicated
other CPUs for.

Ammar Faizi Feb. 27, 2023, 1:45 p.m. UTC | #8

On 2/27/23 6:46 PM, Qu Wenruo wrote:
> On 2023/2/27 19:02, Filipe Manana wrote:
>> On Sun, Feb 26, 2023 at 4:31 PM Ammar Faizi <ammarfaizi2@gnuweeb.org> wrote:
>>> Figure (the CPU usage when `wq_cpu_set` is used VS when it is not):
>>> https://gist.githubusercontent.com/ammarfaizi2/a10f8073e58d1712c1ed49af83ae4ad1/raw/a4f7cbc4eb163db792a669d570ff542495e8c704/wq_cpu_set.png
>>
>> I haven't read the patchset.
>>
>> It's great that it reduces CPU usage. But does it also provide
>> other performance benefits, like lower latency or higher throughput
>> for some workloads? Or using less CPU also affects negatively in
>> those other aspects?

Based on my testing, it gives lower latency for a browser app playing
a YouTube video.

Without this proposed option, high-level compression on a btrfs
storage is a real noise to user space apps. It periodically freezes
the UI for 2 to 3 seconds and causes audio lag; it mostly happens when
it starts writing the dirty write to the disk.

It's reasonably easy to reproduce by making a large dirty write and
invoking a "sync" command.

Side note: Pin user apps to CPUs a,b,c,d and btrfs workquques to CPUs
w,x,y,z.

> So far it looks like to just set CPU masks for each workqueue.
> 
> Thus if it's reducing CPU usage, it also takes longer time to finish
> the workload (compression,csum calculation etc).

Yes, that's correct.

I see this as a good mount option for btrfs because the btrfs-workload
in question is CPU bound, specifically for the writing operation.
While it may degrade the btrfs workload because we limit the number of
usable CPUs, there is a condition where users don't prioritize writing
to disk.

Let's say:
I want to run a smooth app with video. I also want to have high-level
compression for my btrfs storage. But I don't want the compression and
checksum work to bother my video; here, I give you CPU x,y,z for the
btrfs work. And here I give you CPU a,b,c,d,e,f for the video work.

I have a similar case on a torrent seeder server where high-level
compression is expected. And I believe there are more cases where this
option is advantageous.

Thank you all for the comments,

Roman Mamedov Feb. 27, 2023, 4:24 p.m. UTC | #9

On Mon, 27 Feb 2023 20:45:26 +0700
Ammar Faizi <ammarfaizi2@gmail.com> wrote:

> Based on my testing, it gives lower latency for a browser app playing
> a YouTube video.
> 
> Without this proposed option, high-level compression on a btrfs
> storage is a real noise to user space apps. It periodically freezes
> the UI for 2 to 3 seconds and causes audio lag; it mostly happens when
> it starts writing the dirty write to the disk.
> 
> It's reasonably easy to reproduce by making a large dirty write and
> invoking a "sync" command.
> 
> Side note: Pin user apps to CPUs a,b,c,d and btrfs workquques to CPUs
> w,x,y,z.

The end user should not be expected to do that.
(At least that is my opinion as an end user :)

The worst part, in times when Btrfs has no current work to do, it means the
user apps are hard-capped to 4 cores for no good reason, and the other 4 are
idling. Even with a split like 6/2, this is still looks like giving up on the
achievements of multi-tasking operating systems and falling back to some
antique state with fixed separation.

> I want to run a smooth app with video. I also want to have high-level
> compression for my btrfs storage. But I don't want the compression and
> checksum work to bother my video; here, I give you CPU x,y,z for the
> btrfs work. And here I give you CPU a,b,c,d,e,f for the video work.
> 
> I have a similar case on a torrent seeder server where high-level
> compression is expected. And I believe there are more cases where this
> option is advantageous.

I really hope there can be some other approach. Such as some adjustment
to kworker processing so that it's not as aggressive in starving userspace.
There are no possibility to run kernel task at a lower priority, right?

But if no other way, maybe an option like that is good to have for the
moment.

Dave Chinner Feb. 27, 2023, 10:17 p.m. UTC | #10

On Sun, Feb 26, 2023 at 11:02:53PM +0700, Ammar Faizi wrote:
> ## Test wq_cpu_set
> sudo mount -t btrfs -o rw,compress-force=zstd:15,commit=1500,wq_cpu_set=0.4.1.5 /dev/sda2 hdd/a;
> cp -rf /path/folder_with_many_large_files/ hdd/a/test;
> sync; # See the CPU usage in htop.
> sudo umount hdd/a;

This seems like the wrong model for setting cpu locality for
internal filesystem threads.

Users are used to controlling cpu sets and other locality behaviour
of a task with wrapper tools like numactl. Wrap th emount command
with a numactl command to limit the CPU set, then have the btrfs
fill_super() callback set the cpu mask for the work queues it
creates based on the cpu mask that has been set for the mount task.

That is, I think the model should be "inherit cpu mask from parent
task" rather than adding mount options. This model allows anything
that numactl can control (e.g. memory locality) to also influence
the filesystem default behaviour without having to add yet more
mount options in the future....

-Dave.

Qu Wenruo Feb. 27, 2023, 11:49 p.m. UTC | #11

On 2023/2/27 21:42, Ammar Faizi wrote:
> On Mon, Feb 27, 2023 at 06:18:43PM +0800, Qu Wenruo wrote:
>> I'm not sure if pinning the wq is really the best way to your problem.
>>
>> Yes, I understand you want to limit the CPU usage of btrfs workqueues, but
>> have you tried "thread_pool=" mount option?
>>
>> That mount option should limit the max amount of in-flight work items, thus
>> at least limit the CPU usage.
> 
> I have tried to use the thread_poll=%u mount option previously. But I
> didn't observe the effect intensively. I'll try to play with this option
> more and see if it can yield the desired behavior.

The thread_pool mount option is much harder to observe the behavior change.

As wq pinned to one or two CPUs is easy to observe using htop, while the 
unbounded wq, even with thread_pool, is much harder to observe.

Thus it needs more systematic testing to find the difference.

> 
>> For the wq CPU pinning part, I'm not sure if it's really needed, although
>> it's known CPU pinning can affect some performance characteristics.
> 
> What I like about CPU pinning is that we can dedicate CPUs for specific
> workloads so it won't cause scheduling noise to the app we've dedicated
> other CPUs for.
> 

I'm not 100% sure if we're really any better than the scheduler 
developers, as there are a lot of more things to consider.

E.g. for recent Intel CPUs, they have BIG and little cores, and BIG 
cores even have SMT supports.
For current kernels, scheduler would avoid putting workloads into the 
threads sharing the same physical cores.

Thus it can result seemingly weird priority like BIG core thread1 > 
little core > BIG core thread2.
But that results overall better performance.

So overall, unless necessary I'd avoid manual CPU pinning.

Thanks,
Qu

Ammar Faizi Feb. 28, 2023, 8:01 a.m. UTC | #12

On Tue, Feb 28, 2023 at 09:17:45AM +1100, Dave Chinner wrote:
> This seems like the wrong model for setting cpu locality for
> internal filesystem threads.
> 
> Users are used to controlling cpu sets and other locality behaviour
> of a task with wrapper tools like numactl. Wrap th emount command
> with a numactl command to limit the CPU set, then have the btrfs
> fill_super() callback set the cpu mask for the work queues it
> creates based on the cpu mask that has been set for the mount task.
> 
> That is, I think the model should be "inherit cpu mask from parent
> task" rather than adding mount options. This model allows anything
> that numactl can control (e.g. memory locality) to also influence
> the filesystem default behaviour without having to add yet more
> mount options in the future....

Good idea on the tooling part.

I like the idea of using numactl to determine a proper CPU set. But
users may also use /etc/fstab to mount their btrfs storage. In that
case, using mount option is still handy.

Also, if we always inherit CPU mask from the parent task who calls the
mount, it will be breaking the CPU affinity for old users who
inadvertently call their mount cmd with random CPU mask.

We should keep the old behavior and allow user to opt in if they want
to. Maybe something like:

   numactl -N 0 mount -t btrfs -o rw,wq_cpu_set=inherit /dev/bla bla

[RFC,v1,0/6] Introducing `wq_cpu_set` mount option for btrfs

Message

Comments