loop: add WQ_MEM_RECLAIM flag to per device workqueue

Message ID	e0a0bc94-e6de-b0e5-ee46-a76cd1570ea6@I-love.SAKURA.ne.jp (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@kernel.org> Message-ID: <e0a0bc94-e6de-b0e5-ee46-a76cd1570ea6@I-love.SAKURA.ne.jp> Date: Thu, 17 Mar 2022 23:08:04 +0900 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:91.0) Gecko/20100101 Thunderbird/91.7.0 Content-Language: en-US To: Jens Axboe <axboe@kernel.dk>, Ming Lei <ming.lei@redhat.com>, Andrew Morton <akpm@linux-foundation.org>, Dan Schatzberg <schatzberg.dan@gmail.com>, Jan Kara <jack@suse.cz>, Christoph Hellwig <hch@lst.de> Cc: linux-block <linux-block@vger.kernel.org> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Subject: [PATCH] loop: add WQ_MEM_RECLAIM flag to per device workqueue Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Precedence: bulk
Series	loop: add WQ_MEM_RECLAIM flag to per device workqueue \| expand loop: add WQ_MEM_RECLAIM flag to per device workqueue

Tetsuo Handa March 17, 2022, 2:08 p.m. UTC

Commit b5dd2f6047ca1080 ("block: loop: improve performance via blk-mq")
started using a global WQ_MEM_RECLAIM workqueue.

+       loop_wq = alloc_workqueue("kloopd",
+                       WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UNBOUND, 0);

Commit f4aa4c7bbac6c4af ("block: loop: convert to per-device workqueue")
started using per device WQ_MEM_RECLAIM workqueue.

-       loop_wq = alloc_workqueue("kloopd",
-                       WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UNBOUND, 0);
+       lo->wq = alloc_workqueue("kloopd%d",
+                       WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UNBOUND, 0,
+                       lo->lo_number);

Commit 4d4e41aef9429872 ("block: loop: avoiding too many pending per work
I/O") adjusted max_active parameter.

-                       WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UNBOUND, 0,
+                       WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UNBOUND, 16,

Commit e03a3d7a94e2485b ("block: loop: use kthread_work") started using
per device kthread.

-       lo->wq = alloc_workqueue("kloopd%d",
-                       WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UNBOUND, 16,
-                       lo->lo_number);
+       lo->worker_task = kthread_run(kthread_worker_fn,
+                       &lo->worker, "loop%d", lo->lo_number);

Commit 87579e9b7d8dc36e ("loop: use worker per cgroup instead of kworker")
again started using per device workqueue

-       lo->worker_task = kthread_run(loop_kthread_worker_fn,
-                       &lo->worker, "loop%d", lo->lo_number);
+       lo->workqueue = alloc_workqueue("loop%d",
+                                       WQ_UNBOUND | WQ_FREEZABLE,
+                                       0,
+                                       lo->lo_number);

but forgot to restore WQ_MEM_RECLAIM flag.

We should reserve one "struct task_struct" to each loop device in order
to guarantee that I/O request for writeback operation can make forward
progress even under memory pressure. I don't know why WQ_FREEZABLE flag
was added, but let's add WQ_MEM_RECLAIM flag like commit f4aa4c7bbac6c4af
("block: loop: convert to per-device workqueue") describes.

Fixes: 87579e9b7d8dc36e ("loop: use worker per cgroup instead of kworker")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 drivers/block/loop.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Dan Schatzberg March 17, 2022, 2:38 p.m. UTC | #1

On Thu, Mar 17, 2022 at 11:08:04PM +0900, Tetsuo Handa wrote:
> Commit 87579e9b7d8dc36e ("loop: use worker per cgroup instead of kworker")
> again started using per device workqueue
> 
> -       lo->worker_task = kthread_run(loop_kthread_worker_fn,
> -                       &lo->worker, "loop%d", lo->lo_number);
> +       lo->workqueue = alloc_workqueue("loop%d",
> +                                       WQ_UNBOUND | WQ_FREEZABLE,
> +                                       0,
> +                                       lo->lo_number);
> 
> but forgot to restore WQ_MEM_RECLAIM flag.

Early versions of the patch did have WQ_MEM_RECLAIM but it was
removed. I looked up my notes and found this:

Changes since V11:

* Removed WQ_MEM_RECLAIM flag from loop workqueue. Technically, this
  can be driven by writeback, but this was causing a warning in xfs
  and likely other filesystems aren't equipped to be driven by reclaim
  at the VFS layer.

I don't know if this is still the case. Can you test it?

> I don't know why WQ_FREEZABLE flag was added

My memory here is hazy and I don't have it in my notes but the same is
done in block/blk-cgroup.c which is conceptually quite similar.

Tetsuo Handa March 18, 2022, 12:05 p.m. UTC | #2

On 2022/03/17 23:38, Dan Schatzberg wrote:
> On Thu, Mar 17, 2022 at 11:08:04PM +0900, Tetsuo Handa wrote:
>> Commit 87579e9b7d8dc36e ("loop: use worker per cgroup instead of kworker")
>> again started using per device workqueue
>>
>> -       lo->worker_task = kthread_run(loop_kthread_worker_fn,
>> -                       &lo->worker, "loop%d", lo->lo_number);
>> +       lo->workqueue = alloc_workqueue("loop%d",
>> +                                       WQ_UNBOUND | WQ_FREEZABLE,
>> +                                       0,
>> +                                       lo->lo_number);
>>
>> but forgot to restore WQ_MEM_RECLAIM flag.
> 
> Early versions of the patch did have WQ_MEM_RECLAIM but it was
> removed. I looked up my notes and found this:
> 
> Changes since V11:
> 
> * Removed WQ_MEM_RECLAIM flag from loop workqueue. Technically, this
>   can be driven by writeback, but this was causing a warning in xfs
>   and likely other filesystems aren't equipped to be driven by reclaim
>   at the VFS layer.
> 
> I don't know if this is still the case. Can you test it?

Well, indeed there was WQ_MEM_RECLAIM flag in
https://lore.kernel.org/all/20210316153655.500806-2-schatzberg.dan@gmail.com/
and the warning you are referring to seems to be
https://lore.kernel.org/all/20210322060334.GD32426@xsang-OptiPlex-9020/ .

----------------------------------------
[   49.143055] run fstests generic/361 at 2021-03-19 23:56:49
[   49.418726] loop: module loaded
[   49.642111] XFS (sda4): Mounting V5 Filesystem
[   49.761139] XFS (sda4): Ending clean mount
[   49.768167] xfs filesystem being mounted at /fs/scratch supports timestamps until 2038 (0x7fffffff)
[   49.799002] loop0: detected capacity change from 0 to 2097152
[   50.027339] XFS (loop0): Mounting V5 Filesystem
[   50.033830] XFS (loop0): Ending clean mount
[   50.038232] xfs filesystem being mounted at /fs/scratch/mnt supports timestamps until 2038 (0x7fffffff)
[   50.239610] XFS: attr2 mount option is deprecated.
[   50.423584] ------------[ cut here ]------------
[   50.428278] workqueue: WQ_MEM_RECLAIM loop0:loop_rootcg_workfn [loop] is flushing !WQ_MEM_RECLAIM xfs-sync/sda4:xfs_flush_inodes_worker [xfs]
[   50.428387] WARNING: CPU: 0 PID: 35 at kernel/workqueue.c:2613 check_flush_dependency+0x116/0x140
[   50.450013] Modules linked in: loop xfs dm_mod btrfs blake2b_generic xor zstd_compress raid6_pq libcrc32c sd_mod t10_pi sg ipmi_devintf ipmi_msghandler intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal i915 intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul hp_wmi sparse_keymap intel_gtt crc32c_intel ghash_clmulni_intel mei_wdt rfkill wmi_bmof rapl drm_kms_helper ahci intel_cstate syscopyarea mei_me libahci sysfillrect sysimgblt fb_sys_fops intel_uncore serio_raw mei drm libata intel_pch_thermal ie31200_edac wmi video tpm_infineon intel_pmc_core acpi_pad ip_tables
[   50.500731] CPU: 0 PID: 35 Comm: kworker/u8:3 Not tainted 5.12.0-rc2-00093-geaba74271070 #1
[   50.509081] Hardware name: HP HP Z238 Microtower Workstation/8183, BIOS N51 Ver. 01.63 10/05/2017
[   50.517963] Workqueue: loop0 loop_rootcg_workfn [loop]
[   50.523109] RIP: 0010:check_flush_dependency+0x116/0x140
[   50.528427] Code: 8d 8b b0 00 00 00 48 8b 50 18 49 89 e8 48 8d b1 b0 00 00 00 48 c7 c7 80 ed 34 82 4c 89 c9 c6 05 8a 11 ab 01 01 e8 29 9a a9 00 <0f> 0b e9 07 ff ff ff 80 3d 78 11 ab 01 00 75 92 e9 3b ff ff ff 66
[   50.547207] RSP: 0018:ffffc9000018fba0 EFLAGS: 00010086
[   50.552434] RAX: 0000000000000000 RBX: ffff88843a39b600 RCX: 0000000000000027
[   50.559567] RDX: 0000000000000027 RSI: 00000000ffff7fff RDI: ffff88843f4177f8
[   50.566713] RBP: ffffffffc091c760 R08: ffff88843f4177f0 R09: ffffc9000018f9b8
[   50.573875] R10: 0000000000000001 R11: 0000000000000001 R12: ffff88810d0fa800
[   50.581006] R13: 0000000000000001 R14: ffff88843f42ab80 R15: ffff88810d0fa800
[   50.588139] FS:  0000000000000000(0000) GS:ffff88843f400000(0000) knlGS:0000000000000000
[   50.596228] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   50.601975] CR2: 00007fbda63ea7f8 CR3: 000000043da0a002 CR4: 00000000003706f0
[   50.609105] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   50.616240] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   50.623379] Call Trace:
[   50.625837]  __flush_work+0xaf/0x220
[   50.629418]  ? __queue_work+0x142/0x400
[   50.633261]  xfs_file_buffered_write+0x1d0/0x2c0 [xfs]
[   50.638468]  do_iter_readv_writev+0x157/0x1c0
[   50.642833]  do_iter_write+0x7d/0x1c0
[   50.646513]  lo_write_bvec+0x62/0x1a0 [loop]
[   50.650804]  loop_process_work+0x20d/0xb80 [loop]
[   50.655543]  ? newidle_balance+0x1f0/0x400
[   50.659647]  process_one_work+0x1ed/0x3c0
[   50.663696]  worker_thread+0x50/0x3c0
[   50.667365]  ? process_one_work+0x3c0/0x3c0
[   50.671568]  kthread+0x116/0x160
[   50.674813]  ? kthread_park+0xa0/0xa0
[   50.678476]  ret_from_fork+0x22/0x30
[   50.682070] ---[ end trace cdbf6e08ae434f6c ]---
[   56.951496] loop: Write error at byte offset 1032134656, length 4096.
[   56.957973] blk_update_request: I/O error, dev loop0, sector 2015816 op 0x1:(WRITE) flags 0x4000 phys_seg 21 prio class 0
----------------------------------------

Since xfs_init_mount_workqueues() as of 5.17-rc8 is doing

	mp->m_sync_workqueue = alloc_workqueue("xfs-sync/%s",
			XFS_WQFLAGS(WQ_FREEZABLE), 0, mp->m_super->s_id);

, this will still be the case.

Since

  scsi_add_host_with_dma() in drivers/scsi/hosts.c is doing

		shost->work_q = alloc_workqueue("%s",
			WQ_SYSFS | __WQ_LEGACY | WQ_MEM_RECLAIM | WQ_UNBOUND,
			1, shost->work_q_name);

  iscsi_host_alloc() in drivers/scsi/libiscsi.c is doing

		ihost->workq = alloc_workqueue("%s",
			WQ_SYSFS | __WQ_LEGACY | WQ_MEM_RECLAIM | WQ_UNBOUND,
			1, ihost->workq_name);

  iscsi_transport_init() in drivers/scsi/scsi_transport_iscsi.c is doing

	iscsi_eh_timer_workq = alloc_workqueue("%s",
			WQ_SYSFS | __WQ_LEGACY | WQ_MEM_RECLAIM | WQ_UNBOUND,
			1, "iscsi_eh");

  include/linux/workqueue.h is doing

	#define create_workqueue(name)						\
		alloc_workqueue("%s", __WQ_LEGACY | WQ_MEM_RECLAIM, 1, (name))

, I guess that also adding __WQ_LEGACY flag would avoid hitting

	WARN_ONCE(worker && ((worker->current_pwq->wq->flags &
			      (WQ_MEM_RECLAIM | __WQ_LEGACY)) == WQ_MEM_RECLAIM),
		  "workqueue: WQ_MEM_RECLAIM %s:%ps is flushing !WQ_MEM_RECLAIM %s:%ps",
		  worker->current_pwq->wq->name, worker->current_func,
		  target_wq->name, target_func);

warning.

But since include/linux/workqueue.h only says

	__WQ_LEGACY		= 1 << 18, /* internal: create*_workqueue() */

, I can't tell when not to specify __WQ_LEGACY and WQ_MEM_RECLAIM together...

Tejun, what is the intent of this warning? Can the description of __WQ_LEGACY flag
be updated? I think that the loop module had better reserve one "struct task_struct"
for each loop device.

I guess that, in general, waiting for a work in !WQ_MEM_RECLAIM WQ from a
WQ_MEM_RECLAIM WQ is dangerous because that work may not be able to find
"struct task_struct" for processing that work. Then, what we should do is to
create mp->m_sync_workqueue with WQ_MEM_RECLAIM flag added instead of creating
lo->workqueue with __WQ_LEGACY + WQ_MEM_RECLAIM flags added...

Is __WQ_LEGACY + WQ_MEM_RECLAIM combination a hack for silencing this warning
without fixing various WQs used by xfs and other filesystems?

Tejun Heo March 18, 2022, 5:15 p.m. UTC | #3

Hello,

On Fri, Mar 18, 2022 at 09:05:42PM +0900, Tetsuo Handa wrote:
> But since include/linux/workqueue.h only says
> 
> 	__WQ_LEGACY		= 1 << 18, /* internal: create*_workqueue() */
> 
> , I can't tell when not to specify __WQ_LEGACY and WQ_MEM_RECLAIM together...
> 
> Tejun, what is the intent of this warning? Can the description of __WQ_LEGACY flag
> be updated? I think that the loop module had better reserve one "struct task_struct"
> for each loop device.
> 
> I guess that, in general, waiting for a work in !WQ_MEM_RECLAIM WQ from a
> WQ_MEM_RECLAIM WQ is dangerous because that work may not be able to find
> "struct task_struct" for processing that work. Then, what we should do is to
> create mp->m_sync_workqueue with WQ_MEM_RECLAIM flag added instead of creating
> lo->workqueue with __WQ_LEGACY + WQ_MEM_RECLAIM flags added...
> 
> Is __WQ_LEGACY + WQ_MEM_RECLAIM combination a hack for silencing this warning
> without fixing various WQs used by xfs and other filesystems?

So, create_workqueue() is the deprecated interface and always imples
MEM_RECLAIM because back when the interface was added each wq had a
dedicated worker and there's no way to tell one way or the other. The
warning is telling you to convert the workqueue to the alloc_workqueue()
interface and explicitly use WQ_MEM_RECLAIM flag if the workqueue is gonna
participate in MEM_RECLAIM chain.

Thanks.

Tetsuo Handa March 19, 2022, 2:02 a.m. UTC | #4

On 2022/03/19 2:15, Tejun Heo wrote:
> Hello,
> 
> On Fri, Mar 18, 2022 at 09:05:42PM +0900, Tetsuo Handa wrote:
>> But since include/linux/workqueue.h only says
>>
>> 	__WQ_LEGACY		= 1 << 18, /* internal: create*_workqueue() */
>>
>> , I can't tell when not to specify __WQ_LEGACY and WQ_MEM_RECLAIM together...
>>
>> Tejun, what is the intent of this warning? Can the description of __WQ_LEGACY flag
>> be updated? I think that the loop module had better reserve one "struct task_struct"
>> for each loop device.
>>
>> I guess that, in general, waiting for a work in !WQ_MEM_RECLAIM WQ from a
>> WQ_MEM_RECLAIM WQ is dangerous because that work may not be able to find
>> "struct task_struct" for processing that work. Then, what we should do is to
>> create mp->m_sync_workqueue with WQ_MEM_RECLAIM flag added instead of creating
>> lo->workqueue with __WQ_LEGACY + WQ_MEM_RECLAIM flags added...
>>
>> Is __WQ_LEGACY + WQ_MEM_RECLAIM combination a hack for silencing this warning
>> without fixing various WQs used by xfs and other filesystems?
> 
> So, create_workqueue() is the deprecated interface and always imples
> MEM_RECLAIM because back when the interface was added each wq had a
> dedicated worker and there's no way to tell one way or the other. The
> warning is telling you to convert the workqueue to the alloc_workqueue()
> interface and explicitly use WQ_MEM_RECLAIM flag if the workqueue is gonna
> participate in MEM_RECLAIM chain.

Is the intent of __WQ_LEGACY flag to indicate that "this WQ was created
using deprecated interface" ? But such intention no longer holds true.

Despite __WQ_LEGACY flag is described as "internal: create*_workqueue()",
tegra194_cpufreq_probe()/scsi_add_host_with_dma()/iscsi_host_alloc()/
iscsi_transport_init() are passing __WQ_LEGACY flag using alloc_workqueue()
interface. Therefore, __WQ_LEGACY flag is no longer a meaningful indicator of
"internal: create*_workqueue()". Description for __WQ_LEGACY flag needs an
update.

Here is an example program which reproduces

	WARN_ONCE(worker && ((worker->current_pwq->wq->flags &
			      (WQ_MEM_RECLAIM | __WQ_LEGACY)) == WQ_MEM_RECLAIM),
		  "workqueue: WQ_MEM_RECLAIM %s:%ps is flushing !WQ_MEM_RECLAIM %s:%ps",
		  worker->current_pwq->wq->name, worker->current_func,
		  target_wq->name, target_func);

reported in https://lore.kernel.org/all/20210322060334.GD32426@xsang-OptiPlex-9020/ .

---------- test.c ----------
#include <linux/module.h>
#include <linux/sched.h>

static struct workqueue_struct *wq1;
static struct workqueue_struct *wq2;
static struct work_struct w1;
static struct work_struct w2;

static void wq2_workfn(struct work_struct *work)
{
}

static void wq1_workfn(struct work_struct *work)
{
	INIT_WORK(&w2, wq2_workfn);
	queue_work(wq2, &w2);
	flush_work(&w2);
}

static int __init wq_test_init(void)
{
	wq1 = alloc_workqueue("wq1", WQ_MEM_RECLAIM, 0);
	wq2 = alloc_workqueue("wq2", 0, 0);
	INIT_WORK(&w1, wq1_workfn);
	queue_work(wq1, &w1);
	flush_work(&w1);
	destroy_workqueue(wq2);
	destroy_workqueue(wq1);
	return -EINVAL;
}

module_init(wq_test_init);
MODULE_LICENSE("GPL");
---------- test.c ----------

----------
[  152.666153] test: loading out-of-tree module taints kernel.
[  152.673510] ------------[ cut here ]------------
[  152.675765] workqueue: WQ_MEM_RECLAIM wq1:wq1_workfn [test] is flushing !WQ_MEM_RECLAIM wq2:wq2_workfn [test]
[  152.675790] WARNING: CPU: 0 PID: 259 at kernel/workqueue.c:2650 check_flush_dependency+0x169/0x170
[  152.682636] Modules linked in: test(O+) loop dm_mod dax serio_raw sg vmw_vmci fuse drm sd_mod t10_pi ata_generic pata_acpi ahci libahci ata_piix mptspi mptscsih i2c_piix4 mptbase i2c_core libata scsi_transport_spi e1000
[  152.690020] CPU: 0 PID: 259 Comm: kworker/0:2 Kdump: loaded Tainted: G           O      5.17.0-rc8+ #72
[  152.693869] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020
[  152.697764] Workqueue: wq1 wq1_workfn [test]
[  152.699762] RIP: 0010:check_flush_dependency+0x169/0x170
[  152.701817] Code: 8d 7f 18 e8 89 84 3a 00 49 8b 57 18 49 81 c5 60 01 00 00 48 c7 c7 60 f4 86 82 4c 89 e6 4c 89 e9 4d 89 f0 31 c0 e8 c7 80 fc ff <0f> 0b e9 61 ff ff ff 55 41 57 41 56 41 55 41 54 53 48 83 ec 18 48
[  152.709031] RSP: 0018:ffff8881110a7b00 EFLAGS: 00010046
[  152.711434] RAX: 19db87cad24ebb00 RBX: ffffe8ffffa4cc00 RCX: 0000000000000002
[  152.714781] RDX: 0000000000000004 RSI: dffffc0000000000 RDI: ffffffff84a84f00
[  152.717334] RBP: 0000000000000000 R08: dffffc0000000000 R09: ffffed1023546ef8
[  152.723543] R10: ffffed1023546ef8 R11: 1ffff11023546ef7 R12: ffff888113e75160
[  152.729068] R13: ffff888113e70f60 R14: ffffffffa0848090 R15: ffff8881014c3528
[  152.731834] FS:  0000000000000000(0000) GS:ffff88811aa00000(0000) knlGS:0000000000000000
[  152.734735] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  152.736840] CR2: 0000557cfd128768 CR3: 000000007216f006 CR4: 00000000001706f0
[  152.739790] Call Trace:
[  152.740715]  <TASK>
[  152.742072]  start_flush_work+0xf9/0x440
[  152.746516]  __flush_work+0xed/0x170
[  152.747845]  ? flush_work+0x10/0x10
[  152.749240]  ? __queue_work+0x558/0x5b0
[  152.750648]  ? queue_work_on+0xe0/0x160
[  152.752036]  ? lockdep_hardirqs_on+0xe6/0x170
[  152.753757]  ? queue_work_on+0xed/0x160
[  152.755546]  ? wq_worker_last_func+0x20/0x20
[  152.757177]  ? rcu_read_lock_sched_held+0x87/0x100
[  152.758960]  ? perf_trace_rcu_stall_warning+0x210/0x210
[  152.760929]  process_one_work+0x45e/0x6b0
[  152.762587]  ? rescuer_thread+0x9f0/0x9f0
[  152.764332]  ? _raw_spin_lock_irqsave+0xf0/0xf0
[  152.766110]  worker_thread+0x4d7/0x960
[  152.767523]  ? _raw_spin_unlock+0x40/0x40
[  152.769146]  ? preempt_count_sub+0xf/0xc0
[  152.770595]  ? _raw_spin_unlock_irqrestore+0xb2/0x110
[  152.773091]  ? rcu_lock_release+0x20/0x20
[  152.774637]  kthread+0x178/0x1a0
[  152.775819]  ? kthread_blkcg+0x50/0x50
[  152.777228]  ret_from_fork+0x1f/0x30
[  152.778603]  </TASK>
[  152.779427] irq event stamp: 12002
[  152.782443] hardirqs last  enabled at (12001): [<ffffffff81102620>] queue_work_on+0xe0/0x160
[  152.786186] hardirqs last disabled at (12002): [<ffffffff8237473b>] _raw_spin_lock_irq+0x7b/0xe0
[  152.789707] softirqs last  enabled at (11996): [<ffffffff810d9105>] irq_exit_rcu+0xb5/0x100
[  152.792767] softirqs last disabled at (11971): [<ffffffff810d9105>] irq_exit_rcu+0xb5/0x100
[  152.795802] ---[ end trace 0000000000000000 ]---
----------

But if I do

-	wq1 = alloc_workqueue("wq1", WQ_MEM_RECLAIM, 0);
+	wq1 = alloc_workqueue("wq1", __WQ_LEGACY | WQ_MEM_RECLAIM, 0);

, this warning goes away. Therefore, it seems to me that __WQ_LEGACY flag is used
in combination with WQ_MEM_RECLAIM flag in order to ask check_flush_dependency()
not to emit this warning when we cannot afford doing

-	wq2 = alloc_workqueue("wq2", 0, 0);
+	wq2 = alloc_workqueue("wq2", WQ_MEM_RECLAIM, 0);

because the owner of wq1 and wq2 differs.

Given that the legacy create_workqueue() interface always implied WQ_MEM_RECLAIM flag,
maybe it is better to make alloc_workqueue() interface WQ_MEM_RECLAIM by default.
That is, obsolete WQ_MEM_RECLAIM flag and __WQ_LEGACY flag, and introduce a new flag
(e.g. WQ_MAY_SHARE_WORKER) which is passed to alloc_workqueue() interface only when
it is absolutely confident that this WQ never participates in memory reclaim path and
never participates in flush_workqueue()/flush_work() operation.

Tejun Heo March 21, 2022, 4:55 p.m. UTC | #5

Hello,

On Sat, Mar 19, 2022 at 11:02:51AM +0900, Tetsuo Handa wrote:
> Is the intent of __WQ_LEGACY flag to indicate that "this WQ was created
> using deprecated interface" ? But such intention no longer holds true.
> 
> Despite __WQ_LEGACY flag is described as "internal: create*_workqueue()",
> tegra194_cpufreq_probe()/scsi_add_host_with_dma()/iscsi_host_alloc()/
> iscsi_transport_init() are passing __WQ_LEGACY flag using alloc_workqueue()
> interface. Therefore, __WQ_LEGACY flag is no longer a meaningful indicator of
> "internal: create*_workqueue()". Description for __WQ_LEGACY flag needs an
> update.
...
> Given that the legacy create_workqueue() interface always implied WQ_MEM_RECLAIM flag,
>
> maybe it is better to make alloc_workqueue() interface WQ_MEM_RECLAIM by default.

That actually is pretty expensive when added up, which is why we went for
the shared worker pool model in the first place.

> That is, obsolete WQ_MEM_RECLAIM flag and __WQ_LEGACY flag, and introduce a new flag
> (e.g. WQ_MAY_SHARE_WORKER) which is passed to alloc_workqueue() interface only when
> it is absolutely confident that this WQ never participates in memory reclaim path and
> never participates in flush_workqueue()/flush_work() operation.

No, just fix the abusers. There are four abusers in the kernel and they
aren't difficult to fix.

Thanks.

Tetsuo Handa March 21, 2022, 10:53 p.m. UTC | #6

On 2022/03/22 1:55, Tejun Heo wrote:
> No, just fix the abusers. There are four abusers in the kernel and they
> aren't difficult to fix.

So, are you expecting that a change shown below happens, by adding WQ_MEM_RECLAIM
flag to all WQs which may hit "workqueue: WQ_MEM_RECLAIM %s:%ps is flushing
!WQ_MEM_RECLAIM %s:%ps" warning? Otherwise, __WQ_LEGACY flag will continue
serving as a hack for suppressing this warning.

---
 drivers/cpufreq/tegra194-cpufreq.c  | 2 +-
 drivers/scsi/hosts.c                | 2 +-
 drivers/scsi/libiscsi.c             | 2 +-
 drivers/scsi/scsi_transport_iscsi.c | 2 +-
 include/linux/workqueue.h           | 7 +++----
 kernel/workqueue.c                  | 3 +--
 6 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/drivers/cpufreq/tegra194-cpufreq.c b/drivers/cpufreq/tegra194-cpufreq.c
index ac381db25dbe..5d72ef07f9ed 100644
--- a/drivers/cpufreq/tegra194-cpufreq.c
+++ b/drivers/cpufreq/tegra194-cpufreq.c
@@ -379,7 +379,7 @@ static int tegra194_cpufreq_probe(struct platform_device *pdev)
 	if (IS_ERR(bpmp))
 		return PTR_ERR(bpmp);
 
-	read_counters_wq = alloc_workqueue("read_counters_wq", __WQ_LEGACY, 1);
+	read_counters_wq = alloc_workqueue("read_counters_wq", 0, 1);
 	if (!read_counters_wq) {
 		dev_err(&pdev->dev, "fail to create_workqueue\n");
 		err = -EINVAL;
diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index f69b77cbf538..4485f65d5e92 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -277,7 +277,7 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct device *dev,
 		snprintf(shost->work_q_name, sizeof(shost->work_q_name),
 			 "scsi_wq_%d", shost->host_no);
 		shost->work_q = alloc_workqueue("%s",
-			WQ_SYSFS | __WQ_LEGACY | WQ_MEM_RECLAIM | WQ_UNBOUND,
+			WQ_SYSFS | WQ_MEM_RECLAIM | WQ_UNBOUND,
 			1, shost->work_q_name);
 
 		if (!shost->work_q) {
diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c
index 059dae8909ee..c923d2a1086e 100644
--- a/drivers/scsi/libiscsi.c
+++ b/drivers/scsi/libiscsi.c
@@ -2801,7 +2801,7 @@ struct Scsi_Host *iscsi_host_alloc(struct scsi_host_template *sht,
 		snprintf(ihost->workq_name, sizeof(ihost->workq_name),
 			"iscsi_q_%d", shost->host_no);
 		ihost->workq = alloc_workqueue("%s",
-			WQ_SYSFS | __WQ_LEGACY | WQ_MEM_RECLAIM | WQ_UNBOUND,
+			WQ_SYSFS | WQ_MEM_RECLAIM | WQ_UNBOUND,
 			1, ihost->workq_name);
 		if (!ihost->workq)
 			goto free_host;
diff --git a/drivers/scsi/scsi_transport_iscsi.c b/drivers/scsi/scsi_transport_iscsi.c
index 554b6f784223..b4f0b7584112 100644
--- a/drivers/scsi/scsi_transport_iscsi.c
+++ b/drivers/scsi/scsi_transport_iscsi.c
@@ -4876,7 +4876,7 @@ static __init int iscsi_transport_init(void)
 	}
 
 	iscsi_eh_timer_workq = alloc_workqueue("%s",
-			WQ_SYSFS | __WQ_LEGACY | WQ_MEM_RECLAIM | WQ_UNBOUND,
+			WQ_SYSFS | WQ_MEM_RECLAIM | WQ_UNBOUND,
 			1, "iscsi_eh");
 	if (!iscsi_eh_timer_workq) {
 		err = -ENOMEM;
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 7fee9b6cfede..666c4cc73f6b 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -337,7 +337,6 @@ enum {
 
 	__WQ_DRAINING		= 1 << 16, /* internal: workqueue is draining */
 	__WQ_ORDERED		= 1 << 17, /* internal: workqueue is ordered */
-	__WQ_LEGACY		= 1 << 18, /* internal: create*_workqueue() */
 	__WQ_ORDERED_EXPLICIT	= 1 << 19, /* internal: alloc_ordered_workqueue() */
 
 	WQ_MAX_ACTIVE		= 512,	  /* I like 512, better ideas? */
@@ -420,12 +419,12 @@ alloc_workqueue(const char *fmt, unsigned int flags, int max_active, ...);
 			__WQ_ORDERED_EXPLICIT | (flags), 1, ##args)
 
 #define create_workqueue(name)						\
-	alloc_workqueue("%s", __WQ_LEGACY | WQ_MEM_RECLAIM, 1, (name))
+	alloc_workqueue("%s", WQ_MEM_RECLAIM, 1, (name))
 #define create_freezable_workqueue(name)				\
-	alloc_workqueue("%s", __WQ_LEGACY | WQ_FREEZABLE | WQ_UNBOUND |	\
+	alloc_workqueue("%s", WQ_FREEZABLE | WQ_UNBOUND |	\
 			WQ_MEM_RECLAIM, 1, (name))
 #define create_singlethread_workqueue(name)				\
-	alloc_ordered_workqueue("%s", __WQ_LEGACY | WQ_MEM_RECLAIM, name)
+	alloc_ordered_workqueue("%s", WQ_MEM_RECLAIM, name)
 
 extern void destroy_workqueue(struct workqueue_struct *wq);
 
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 33f1106b4f99..aba3b1505292 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2643,8 +2643,7 @@ static void check_flush_dependency(struct workqueue_struct *target_wq,
 	WARN_ONCE(current->flags & PF_MEMALLOC,
 		  "workqueue: PF_MEMALLOC task %d(%s) is flushing !WQ_MEM_RECLAIM %s:%ps",
 		  current->pid, current->comm, target_wq->name, target_func);
-	WARN_ONCE(worker && ((worker->current_pwq->wq->flags &
-			      (WQ_MEM_RECLAIM | __WQ_LEGACY)) == WQ_MEM_RECLAIM),
+	WARN_ONCE(worker && (worker->current_pwq->wq->flags & WQ_MEM_RECLAIM),
 		  "workqueue: WQ_MEM_RECLAIM %s:%ps is flushing !WQ_MEM_RECLAIM %s:%ps",
 		  worker->current_pwq->wq->name, worker->current_func,
 		  target_wq->name, target_func);

Tejun Heo March 21, 2022, 11:04 p.m. UTC | #7

Hello,

On Tue, Mar 22, 2022 at 07:53:49AM +0900, Tetsuo Handa wrote:
> On 2022/03/22 1:55, Tejun Heo wrote:
> > No, just fix the abusers. There are four abusers in the kernel and they
> > aren't difficult to fix.
> 
> So, are you expecting that a change shown below happens, by adding WQ_MEM_RECLAIM
> flag to all WQs which may hit "workqueue: WQ_MEM_RECLAIM %s:%ps is flushing
> !WQ_MEM_RECLAIM %s:%ps" warning? Otherwise, __WQ_LEGACY flag will continue
> serving as a hack for suppressing this warning.

If you convert them to all of them in the flush chain to use
alloc_workqueue() w/ MEM_RECLAIM, the warning will go away.

>  #define create_workqueue(name)						\
> -	alloc_workqueue("%s", __WQ_LEGACY | WQ_MEM_RECLAIM, 1, (name))
> +	alloc_workqueue("%s", WQ_MEM_RECLAIM, 1, (name))
>  #define create_freezable_workqueue(name)				\
> -	alloc_workqueue("%s", __WQ_LEGACY | WQ_FREEZABLE | WQ_UNBOUND |	\
> +	alloc_workqueue("%s", WQ_FREEZABLE | WQ_UNBOUND |	\
>  			WQ_MEM_RECLAIM, 1, (name))
>  #define create_singlethread_workqueue(name)				\
> -	alloc_ordered_workqueue("%s", __WQ_LEGACY | WQ_MEM_RECLAIM, name)
> +	alloc_ordered_workqueue("%s", WQ_MEM_RECLAIM, name)

But why are you dropping the flag from their intended users?

Thanks.

Tetsuo Handa March 21, 2022, 11:17 p.m. UTC | #8

On 2022/03/22 8:04, Tejun Heo wrote:
> But why are you dropping the flag from their intended users?

As far as I can see, the only difference __WQ_LEGACY makes is whether
"workqueue: WQ_MEM_RECLAIM %s:%ps is flushing !WQ_MEM_RECLAIM %s:%ps"
warning is printed or not. What are the intended users?

Tejun Heo March 21, 2022, 11:27 p.m. UTC | #9

On Tue, Mar 22, 2022 at 08:17:43AM +0900, Tetsuo Handa wrote:
> On 2022/03/22 8:04, Tejun Heo wrote:
> > But why are you dropping the flag from their intended users?
> 
> As far as I can see, the only difference __WQ_LEGACY makes is whether
> "workqueue: WQ_MEM_RECLAIM %s:%ps is flushing !WQ_MEM_RECLAIM %s:%ps"
> warning is printed or not. What are the intended users?

The old create_workqueue() and friends always imply WQ_MEM_RECLAIM because
they all used to be served dedicated kthreads. The growing number of
kthreads used this way became a headache. There were too many of these
kthreads sitting around doing nothing. In some niche configurations, they
ate up enough PIDs to cause boot failrues.

To address the issue, the new implementation made the workqueues share pools
of workers. However, this means that forward progress can't be guaranteed
under memory pressure, so workqueues which are depended upon during memory
reclaim now need to set WQ_MEM_RECLAIM explicitly to request a dedicated
rescuer thread.

The legacy flushing warning is telling us that those workqueues can be
converted to alloc_workqueue + WQ_MEM_RECLAIM as we know them to be
participating in memory reclaim (as they're flushing one of the explicitly
marked workqueues). So, if you spot them, the right thing to do is
converting all the involved workqueues to use alloc_workqueue() +
WQ_MEM_RECLAIM.

Thanks.

Tetsuo Handa March 22, 2022, 12:09 a.m. UTC | #10

On 2022/03/22 8:27, Tejun Heo wrote:
> On Tue, Mar 22, 2022 at 08:17:43AM +0900, Tetsuo Handa wrote:
>> On 2022/03/22 8:04, Tejun Heo wrote:
>>> But why are you dropping the flag from their intended users?
>>
>> As far as I can see, the only difference __WQ_LEGACY makes is whether
>> "workqueue: WQ_MEM_RECLAIM %s:%ps is flushing !WQ_MEM_RECLAIM %s:%ps"
>> warning is printed or not. What are the intended users?
> 
> The old create_workqueue() and friends always imply WQ_MEM_RECLAIM because
> they all used to be served dedicated kthreads. The growing number of
> kthreads used this way became a headache. There were too many of these
> kthreads sitting around doing nothing. In some niche configurations, they
> ate up enough PIDs to cause boot failrues.

OK.

> 
> To address the issue, the new implementation made the workqueues share pools
> of workers. However, this means that forward progress can't be guaranteed
> under memory pressure, so workqueues which are depended upon during memory
> reclaim now need to set WQ_MEM_RECLAIM explicitly to request a dedicated
> rescuer thread.

OK.

> 
> The legacy flushing warning is telling us that those workqueues can be

s/can be/must be/ ?

> converted to alloc_workqueue + WQ_MEM_RECLAIM as we know them to be
> participating in memory reclaim (as they're flushing one of the explicitly
> marked workqueues). So, if you spot them, the right thing to do is
> converting all the involved workqueues to use alloc_workqueue() +
> WQ_MEM_RECLAIM.

Then, can the description of

	__WQ_LEGACY		= 1 << 18, /* internal: create*_workqueue() */

be improved to something like

	/*
	 * This flag disables deadlock detection which can happen when flushing
	 * a work item in !WQ_MEM_RECLAIM workqueue from WQ_MEM_RECLAIM workqueue.
	 * But try to avoid using this flag, by adding WQ_MEM_RECLAIM to all WQs which
	 * can be involved where a guarantee of forward progress under memory pressure
	 * is required.
	 */

? Current /* internal: create*_workqueue() */ tells me nothing.



My question is: I want to add WQ_MEM_RECLAIM flag to the WQ used by loop module
because this WQ is involved upon writeback operation. But unless I add both
__WQ_LEGACY | WQ_MEM_RECLAIM flags to the WQ used by loop module, we will hit

	WARN_ONCE(worker && ((worker->current_pwq->wq->flags &
			      (WQ_MEM_RECLAIM | __WQ_LEGACY)) == WQ_MEM_RECLAIM),

warning because e.g. xfs-sync WQ used by xfs module is not using WQ_MEM_RECLAIM flag.

	mp->m_sync_workqueue = alloc_workqueue("xfs-sync/%s",
				XFS_WQFLAGS(WQ_FREEZABLE), 0, mp->m_super->s_id);

You are suggesting that the correct approach is to add WQ_MEM_RECLAIM flag to WQs
used by filesystems when adding WQ_MEM_RECLAIM flag to the WQ used by loop module
introduces possibility of hitting

	WARN_ONCE(worker && ((worker->current_pwq->wq->flags &
			      (WQ_MEM_RECLAIM | __WQ_LEGACY)) == WQ_MEM_RECLAIM),

warning (instead of either adding __WQ_LEGACY | WQ_MEM_RECLAIM flags to the WQ used
by loop module or giving up WQ_MEM_RECLAIM flag for the WQ used by loop module),
correct?

Tejun Heo March 22, 2022, 4:52 p.m. UTC | #11

Hello,

On Tue, Mar 22, 2022 at 09:09:53AM +0900, Tetsuo Handa wrote:
> > The legacy flushing warning is telling us that those workqueues can be
> 
> s/can be/must be/ ?

Well, one thing that we can but don't want to do is converting all
create_workqueue() users to alloc_workqueue() with MEM_RECLAIM, which is
wasteful but won't break anything. We know for sure that the workqueues
which trigger the legacy warning are participating in memory reclaim and
thus need MEM_RECLAIM. So, yeah, the warning tells us that they need
MEM_RECLAIM and should be converted.

> ? Current /* internal: create*_workqueue() */ tells me nothing.

It's trying to say that it shouldn't be used outside workqueue proper and
the warning message is supposed to trigger the conversion. But, yeah, a
stronger comment can help.

> My question is: I want to add WQ_MEM_RECLAIM flag to the WQ used by loop module
> because this WQ is involved upon writeback operation. But unless I add both
> __WQ_LEGACY | WQ_MEM_RECLAIM flags to the WQ used by loop module, we will hit
> 
> 	WARN_ONCE(worker && ((worker->current_pwq->wq->flags &
> 			      (WQ_MEM_RECLAIM | __WQ_LEGACY)) == WQ_MEM_RECLAIM),
> 
> warning because e.g. xfs-sync WQ used by xfs module is not using WQ_MEM_RECLAIM flag.
> 
> 	mp->m_sync_workqueue = alloc_workqueue("xfs-sync/%s",
> 				XFS_WQFLAGS(WQ_FREEZABLE), 0, mp->m_super->s_id);
> 
> You are suggesting that the correct approach is to add WQ_MEM_RECLAIM flag to WQs
> used by filesystems when adding WQ_MEM_RECLAIM flag to the WQ used by loop module
> introduces possibility of hitting
> 
> 	WARN_ONCE(worker && ((worker->current_pwq->wq->flags &
> 			      (WQ_MEM_RECLAIM | __WQ_LEGACY)) == WQ_MEM_RECLAIM),
> 
> warning (instead of either adding __WQ_LEGACY | WQ_MEM_RECLAIM flags to the WQ used
> by loop module or giving up WQ_MEM_RECLAIM flag for the WQ used by loop module),
> correct?

Yeah, you detected multiple issues at the same time. xfs sync is
participating in memory reclaim but doesn't have MEM_RECLAIM and loop is
marked with LEGACY but flushing other workqueues which are MEM_RECLIAM. So,
both xfs and loop workqueues need to be explicitly marked with MEM_RECLAIM.

Thanks.

Dave Chinner March 22, 2022, 10 p.m. UTC | #12

On Tue, Mar 22, 2022 at 06:52:14AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Tue, Mar 22, 2022 at 09:09:53AM +0900, Tetsuo Handa wrote:
> > > The legacy flushing warning is telling us that those workqueues can be
> > 
> > s/can be/must be/ ?
> 
> Well, one thing that we can but don't want to do is converting all
> create_workqueue() users to alloc_workqueue() with MEM_RECLAIM, which is
> wasteful but won't break anything. We know for sure that the workqueues
> which trigger the legacy warning are participating in memory reclaim and
> thus need MEM_RECLAIM. So, yeah, the warning tells us that they need
> MEM_RECLAIM and should be converted.
> 
> > ? Current /* internal: create*_workqueue() */ tells me nothing.
> 
> It's trying to say that it shouldn't be used outside workqueue proper and
> the warning message is supposed to trigger the conversion. But, yeah, a
> stronger comment can help.
> 
> > My question is: I want to add WQ_MEM_RECLAIM flag to the WQ used by loop module
> > because this WQ is involved upon writeback operation. But unless I add both
> > __WQ_LEGACY | WQ_MEM_RECLAIM flags to the WQ used by loop module, we will hit
> > 
> > 	WARN_ONCE(worker && ((worker->current_pwq->wq->flags &
> > 			      (WQ_MEM_RECLAIM | __WQ_LEGACY)) == WQ_MEM_RECLAIM),
> > 
> > warning because e.g. xfs-sync WQ used by xfs module is not using WQ_MEM_RECLAIM flag.
> > 
> > 	mp->m_sync_workqueue = alloc_workqueue("xfs-sync/%s",
> > 				XFS_WQFLAGS(WQ_FREEZABLE), 0, mp->m_super->s_id);
> > 
> > You are suggesting that the correct approach is to add WQ_MEM_RECLAIM flag to WQs
> > used by filesystems when adding WQ_MEM_RECLAIM flag to the WQ used by loop module
> > introduces possibility of hitting
> > 
> > 	WARN_ONCE(worker && ((worker->current_pwq->wq->flags &
> > 			      (WQ_MEM_RECLAIM | __WQ_LEGACY)) == WQ_MEM_RECLAIM),
> > 
> > warning (instead of either adding __WQ_LEGACY | WQ_MEM_RECLAIM flags to the WQ used
> > by loop module or giving up WQ_MEM_RECLAIM flag for the WQ used by loop module),
> > correct?
> 
> Yeah, you detected multiple issues at the same time. xfs sync is
> participating in memory reclaim

No it isn't. What makes you think it is part of memory reclaim?

The xfs-sync workqueue exists solely to perform async flushes of
dirty data at ENOSPC via sync_inodes_sb() because we can't call
sync_inodes_sb directly in the context that hit ENOSPC due to locks
and transaction contexts held. The paths that need this are
buffered writes and file create (on disk inode allocation), neither
of which are in the the memory reclaim path, either.

So this work has nothing to do with memory reclaim, and as such it's
not tagged with WQ_MEM_RECLAIM.

Cheers,

Dave.

Tejun Heo March 22, 2022, 10:02 p.m. UTC | #13

Hello,

On Wed, Mar 23, 2022 at 09:00:07AM +1100, Dave Chinner wrote:
> > Yeah, you detected multiple issues at the same time. xfs sync is
> > participating in memory reclaim
> 
> No it isn't. What makes you think it is part of memory reclaim?
> 
> The xfs-sync workqueue exists solely to perform async flushes of
> dirty data at ENOSPC via sync_inodes_sb() because we can't call
> sync_inodes_sb directly in the context that hit ENOSPC due to locks
> and transaction contexts held. The paths that need this are
> buffered writes and file create (on disk inode allocation), neither
> of which are in the the memory reclaim path, either.
> 
> So this work has nothing to do with memory reclaim, and as such it's
> not tagged with WQ_MEM_RECLAIM.

Hmmm... yeah, I actually don't know the exact dependency here and the
dependency may not be real - e.g. the conclusion might be that loop is
conflating different uses and needs to split its use of workqueues into two
separate ones. Tetsuo, can you post more details on the warning that you're
seeing?

Thanks.

Tetsuo Handa March 22, 2022, 10:05 p.m. UTC | #14

On 2022/03/23 7:02, Tejun Heo wrote:
> Hello,
> 
> On Wed, Mar 23, 2022 at 09:00:07AM +1100, Dave Chinner wrote:
>>> Yeah, you detected multiple issues at the same time. xfs sync is
>>> participating in memory reclaim
>>
>> No it isn't. What makes you think it is part of memory reclaim?
>>
>> The xfs-sync workqueue exists solely to perform async flushes of
>> dirty data at ENOSPC via sync_inodes_sb() because we can't call
>> sync_inodes_sb directly in the context that hit ENOSPC due to locks
>> and transaction contexts held. The paths that need this are
>> buffered writes and file create (on disk inode allocation), neither
>> of which are in the the memory reclaim path, either.
>>
>> So this work has nothing to do with memory reclaim, and as such it's
>> not tagged with WQ_MEM_RECLAIM.
> 
> Hmmm... yeah, I actually don't know the exact dependency here and the
> dependency may not be real - e.g. the conclusion might be that loop is
> conflating different uses and needs to split its use of workqueues into two
> separate ones. Tetsuo, can you post more details on the warning that you're
> seeing?
> 

It was reported at https://lore.kernel.org/all/20210322060334.GD32426@xsang-OptiPlex-9020/ .

Tejun Heo March 22, 2022, 10:19 p.m. UTC | #15

On Wed, Mar 23, 2022 at 07:05:56AM +0900, Tetsuo Handa wrote:
> > Hmmm... yeah, I actually don't know the exact dependency here and the
> > dependency may not be real - e.g. the conclusion might be that loop is
> > conflating different uses and needs to split its use of workqueues into two
> > separate ones. Tetsuo, can you post more details on the warning that you're
> > seeing?
> > 
> 
> It was reported at https://lore.kernel.org/all/20210322060334.GD32426@xsang-OptiPlex-9020/ .

Looks like a correct dependency to me. The work item is being flushed from
good old write path. Dave?

Thanks.

Dave Chinner March 22, 2022, 10:59 p.m. UTC | #16

On Tue, Mar 22, 2022 at 12:19:40PM -1000, Tejun Heo wrote:
> On Wed, Mar 23, 2022 at 07:05:56AM +0900, Tetsuo Handa wrote:
> > > Hmmm... yeah, I actually don't know the exact dependency here and the
> > > dependency may not be real - e.g. the conclusion might be that loop is
> > > conflating different uses and needs to split its use of workqueues into two
> > > separate ones. Tetsuo, can you post more details on the warning that you're
> > > seeing?
> > > 
> > 
> > It was reported at https://lore.kernel.org/all/20210322060334.GD32426@xsang-OptiPlex-9020/ .
> 
> Looks like a correct dependency to me. The work item is being flushed from
> good old write path. Dave?

The filesystem buffered write IO path isn't part of memory reclaim -
it's a user IO path and I think most filesystems will treat it that
way.

We've had similar layering problems with the loop IO path implyingi
GFP_NOFS must be used by filesystems allocating memory in the IO
path - we solved that by requiring the loop IO submission context
(loop_process_work()) to set PF_MEMALLOC_NOIO so that it didn't
deadlock anywhere in the underlying filesystems that have no idea
that the loop device has added memory reclaim constraints to the IO
path.

This seems like it's the same layering problem - syscall facing IO
paths are designed for incoming IO from user context, not outgoing
writeback IO from memory reclaim contexts. Memory reclaim contexts
are supposed to use back end filesystem operations like
->writepages() to flush dirty data when necessary.

If the loop device IO mechanism means that every ->write_iter path
needs to be considered as directly in the memory reclaim path, then
that means a huge amount of the kernel needs to be considered as "in
memory reclaim". i.e. it's not just this one XFS workqueue that is
going have this problem - it's any workqueue that can be waited on
by the incoming IO path.

For example, network filesystem might put the network stack directly
in the IO path. Which means if we then put loop on top of that
filesystems, various workqueues in the network stack may now need to
be considered as running under the memory reclaim path because of
the loop block device.

I don't know what the solution is, but if the fix is "xfs needs to
mark a workqueue that has nothing to do with memory reclaim as
WQ_MEM_RECLAIM because of the loop device" then we're talking about
playing workqueue whack-a-mole across the entire kernel forever
more....

Cheers,

Dave.

Tetsuo Handa March 22, 2022, 11:32 p.m. UTC | #17

On 2022/03/23 7:59, Dave Chinner wrote:
> I don't know what the solution is, but if the fix is "xfs needs to
> mark a workqueue that has nothing to do with memory reclaim as
> WQ_MEM_RECLAIM because of the loop device" then we're talking about
> playing workqueue whack-a-mole across the entire kernel forever
> more....

During an attempt to fix lockdep warning caused by calling destroy_workqueue()
when loop device's autoclear operation is triggered with disk->open_mutex held,
Christoph has proposed a patch which avoids calling destroy_workqueue() by using
a global WQ. But like demonstrated at
https://lkml.kernel.org/r/22b51922-30c6-e4ed-ace9-5620f877682c@i-love.sakura.ne.jp ,
I confirmed that the loop module needs to reserve one "struct task_struct" on each
loop device's WQ. And creating per loop device WQ with WQ_MEM_RECLAIM flag is the
only available method for reserving "struct task_struct" on each loop device's WQ.
That is, WQ_MEM_RECLAIM flag is required for not only surviving memory pressure
situation but also surviving max active limitation.

But like demonstrated at
https://lkml.kernel.org/r/61f41e56-3650-f0fc-9ef5-7e19fe84e6b7@I-love.SAKURA.ne.jp ,
creating per loop device WQ with WQ_MEM_RECLAIM flag without __WQ_LEGACY flag will hit
this "workqueue: WQ_MEM_RECLAIM %s:%ps is flushing !WQ_MEM_RECLAIM %s:%ps" warning.

And if WQs used by filesystem side do not want to use WQ_MEM_RECLAIM flag, the loop
module would have to abuse __WQ_LEGACY flag in order to suppress this warning.

Dave Chinner March 22, 2022, 11:50 p.m. UTC | #18

On Wed, Mar 23, 2022 at 08:32:49AM +0900, Tetsuo Handa wrote:
> On 2022/03/23 7:59, Dave Chinner wrote:
> > I don't know what the solution is, but if the fix is "xfs needs to
> > mark a workqueue that has nothing to do with memory reclaim as
> > WQ_MEM_RECLAIM because of the loop device" then we're talking about
> > playing workqueue whack-a-mole across the entire kernel forever
> > more....
....
> And if WQs used by filesystem side do not want to use WQ_MEM_RECLAIM flag, the loop
> module would have to abuse __WQ_LEGACY flag in order to suppress this warning.

I'm not talking about whether filesysetms want to use WQ_MEM_RECLAIM
or not, I'm commenting on the implicit depedency that the loop
device creates that forces the use of WQ_MEM_RECLAIM in all
downstream workqueues. That's what I'm asking about here - how far
does this implicit, undocumented dependency actually reach and how
do we communicate to all developers so that they know about this in
the future when creating new workqueues that might end up under the
loop device?

That's the problem here - unless the developer explicitly considers
and/or remembers this loopback dependency when adding a new
workqueue to a filesystem (or even as far down as network stacks)
then this problem is going to keep happening and we'll just have to
keep driving WQ_MEM_RECLAIM deeper into the stack.

Tejun stated that just marking all workqueues as WQ_MEM_RECLAIM as
being problematic in some situations, and I'm concerned that
resolving implicit dependencies are going to end up in that
situation anyway.

Cheers,

Dave.

Tejun Heo March 23, 2022, 12:09 a.m. UTC | #19

Hello,

On Wed, Mar 23, 2022 at 09:59:14AM +1100, Dave Chinner wrote:
> The filesystem buffered write IO path isn't part of memory reclaim -
> it's a user IO path and I think most filesystems will treat it that
> way.

We can argue the semantics but anything in fs / io path which sit on write
path should be marked MEM_RECLAIM because they can be depended upon while
cleaning dirty pages. This isn't a layering problem or anything. It's just
what that flag is for.

> If the loop device IO mechanism means that every ->write_iter path
> needs to be considered as directly in the memory reclaim path, then
> that means a huge amount of the kernel needs to be considered as "in
> memory reclaim". i.e. it's not just this one XFS workqueue that is
> going have this problem - it's any workqueue that can be waited on
> by the incoming IO path.
> 
> For example, network filesystem might put the network stack directly
> in the IO path. Which means if we then put loop on top of that
> filesystems, various workqueues in the network stack may now need to
> be considered as running under the memory reclaim path because of
> the loop block device.
> 
> I don't know what the solution is, but if the fix is "xfs needs to
> mark a workqueue that has nothing to do with memory reclaim as
> WQ_MEM_RECLAIM because of the loop device" then we're talking about
> playing workqueue whack-a-mole across the entire kernel forever
> more....

Yeah, all those workqueues must be and most of them are already tagged with
MEM_RECLAIM. The network drivers are kinda painful and we *can* make them
conditional (on it sitting in the io path) if that ever becomes necessary
but the number hasn't been problematic till now.

Thanks.

loop: add WQ_MEM_RECLAIM flag to per device workqueue

Commit Message

Comments

Patch