diff mbox series

[v2] xfs: enable WQ_MEM_RECLAIM on m_sync_workqueue

Message ID ZoGJRSe98wZFDK36@kernel.org (mailing list archive)
State New
Headers show
Series [v2] xfs: enable WQ_MEM_RECLAIM on m_sync_workqueue | expand

Commit Message

Mike Snitzer June 30, 2024, 4:35 p.m. UTC
The need for this fix was exposed while developing a new NFS feature
called "localio" which bypasses the network, if both the client and
server are on the same host, see:
https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/log/?h=nfs-localio-for-6.11

Because NFS's nfsiod_workqueue enables WQ_MEM_RECLAIM, writeback will
call into NFS and if localio is enabled the NFS client will call
directly into xfs_file_write_iter, this causes the following
backtrace when running xfstest generic/476 against NFS with localio:

  workqueue: WQ_MEM_RECLAIM writeback:wb_workfn is flushing !WQ_MEM_RECLAIM xfs-sync/vdc:xfs_flush_inodes_worker
  WARNING: CPU: 6 PID: 8525 at kernel/workqueue.c:3706 check_flush_dependency+0x2a4/0x328
  Modules linked in:
  CPU: 6 PID: 8525 Comm: kworker/u71:5 Not tainted 6.10.0-rc3-ktest-00032-g2b0a133403ab #18502
  Hardware name: linux,dummy-virt (DT)
  Workqueue: writeback wb_workfn (flush-0:33)
  pstate: 400010c5 (nZcv daIF -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
  pc : check_flush_dependency+0x2a4/0x328
  lr : check_flush_dependency+0x2a4/0x328
  sp : ffff0000c5f06bb0
  x29: ffff0000c5f06bb0 x28: ffff0000c998a908 x27: 1fffe00019331521
  x26: ffff0000d0620900 x25: ffff0000c5f06ca0 x24: ffff8000828848c0
  x23: 1fffe00018be0d8e x22: ffff0000c1210000 x21: ffff0000c75fde00
  x20: ffff800080bfd258 x19: ffff0000cad63400 x18: ffff0000cd3a4810
  x17: 0000000000000000 x16: 0000000000000000 x15: ffff800080508d98
  x14: 0000000000000000 x13: 204d49414c434552 x12: 1fffe0001b6eeab2
  x11: ffff60001b6eeab2 x10: dfff800000000000 x9 : ffff60001b6eeab3
  x8 : 0000000000000001 x7 : 00009fffe491154e x6 : ffff0000db775593
  x5 : ffff0000db775590 x4 : ffff0000db775590 x3 : 0000000000000000
  x2 : 0000000000000027 x1 : ffff600018be0d62 x0 : dfff800000000000
  Call trace:
   check_flush_dependency+0x2a4/0x328
   __flush_work+0x184/0x5c8
   flush_work+0x18/0x28
   xfs_flush_inodes+0x68/0x88
   xfs_file_buffered_write+0x128/0x6f0
   xfs_file_write_iter+0x358/0x448
   nfs_local_doio+0x854/0x1568
   nfs_initiate_pgio+0x214/0x418
   nfs_generic_pg_pgios+0x304/0x480
   nfs_pageio_doio+0xe8/0x240
   nfs_pageio_complete+0x160/0x480
   nfs_writepages+0x300/0x4f0
   do_writepages+0x12c/0x4a0
   __writeback_single_inode+0xd4/0xa68
   writeback_sb_inodes+0x470/0xcb0
   __writeback_inodes_wb+0xb0/0x1d0
   wb_writeback+0x594/0x808
   wb_workfn+0x5e8/0x9e0
   process_scheduled_works+0x53c/0xd90
   worker_thread+0x370/0x8c8
   kthread+0x258/0x2e8
   ret_from_fork+0x10/0x20

Fix this by enabling WQ_MEM_RECLAIM on XFS's m_sync_workqueue.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
 fs/xfs/xfs_super.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

[v2: dropped RFC, this fixes xfstests generic/476, resubmitting with more feeling]

Comments

Dave Chinner June 30, 2024, 11:46 p.m. UTC | #1
On Sun, Jun 30, 2024 at 12:35:17PM -0400, Mike Snitzer wrote:
> The need for this fix was exposed while developing a new NFS feature
> called "localio" which bypasses the network, if both the client and
> server are on the same host, see:
> https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/log/?h=nfs-localio-for-6.11
> 
> Because NFS's nfsiod_workqueue enables WQ_MEM_RECLAIM, writeback will
> call into NFS and if localio is enabled the NFS client will call
> directly into xfs_file_write_iter, this causes the following
> backtrace when running xfstest generic/476 against NFS with localio:

Oh, that's nasty.

We now have to change every path in every filesystem that NFS can
call that might defer work to a workqueue.

IOWs, this makes WQ_MEM_RECLAIM pretty much mandatory for front end
workqueues in the filesystem and block layers regardless of whether
the filesystem or block layer path runs under memory reclaim context
or not.

All WQ_MEM_RECLAIM does is create a rescuer thread at workqueue
creation that is used as a "worker of last resort" when forking new
worker threads fail due to ENOMEM. This prevents deadlocks when
doing GFP_KERNEL allocations in workqueue context and potentially
deadlocking because a GFP_KERNEL allocation is blocking waiting for
this workqueue to allocate workers to make progress.

>   workqueue: WQ_MEM_RECLAIM writeback:wb_workfn is flushing !WQ_MEM_RECLAIM xfs-sync/vdc:xfs_flush_inodes_worker
>   WARNING: CPU: 6 PID: 8525 at kernel/workqueue.c:3706 check_flush_dependency+0x2a4/0x328
>   Modules linked in:
>   CPU: 6 PID: 8525 Comm: kworker/u71:5 Not tainted 6.10.0-rc3-ktest-00032-g2b0a133403ab #18502
>   Hardware name: linux,dummy-virt (DT)
>   Workqueue: writeback wb_workfn (flush-0:33)
>   pstate: 400010c5 (nZcv daIF -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
>   pc : check_flush_dependency+0x2a4/0x328
>   lr : check_flush_dependency+0x2a4/0x328
>   sp : ffff0000c5f06bb0
>   x29: ffff0000c5f06bb0 x28: ffff0000c998a908 x27: 1fffe00019331521
>   x26: ffff0000d0620900 x25: ffff0000c5f06ca0 x24: ffff8000828848c0
>   x23: 1fffe00018be0d8e x22: ffff0000c1210000 x21: ffff0000c75fde00
>   x20: ffff800080bfd258 x19: ffff0000cad63400 x18: ffff0000cd3a4810
>   x17: 0000000000000000 x16: 0000000000000000 x15: ffff800080508d98
>   x14: 0000000000000000 x13: 204d49414c434552 x12: 1fffe0001b6eeab2
>   x11: ffff60001b6eeab2 x10: dfff800000000000 x9 : ffff60001b6eeab3
>   x8 : 0000000000000001 x7 : 00009fffe491154e x6 : ffff0000db775593
>   x5 : ffff0000db775590 x4 : ffff0000db775590 x3 : 0000000000000000
>   x2 : 0000000000000027 x1 : ffff600018be0d62 x0 : dfff800000000000
>   Call trace:
>    check_flush_dependency+0x2a4/0x328
>    __flush_work+0x184/0x5c8
>    flush_work+0x18/0x28
>    xfs_flush_inodes+0x68/0x88
>    xfs_file_buffered_write+0x128/0x6f0
>    xfs_file_write_iter+0x358/0x448
>    nfs_local_doio+0x854/0x1568
>    nfs_initiate_pgio+0x214/0x418
>    nfs_generic_pg_pgios+0x304/0x480
>    nfs_pageio_doio+0xe8/0x240
>    nfs_pageio_complete+0x160/0x480
>    nfs_writepages+0x300/0x4f0
>    do_writepages+0x12c/0x4a0
>    __writeback_single_inode+0xd4/0xa68
>    writeback_sb_inodes+0x470/0xcb0
>    __writeback_inodes_wb+0xb0/0x1d0
>    wb_writeback+0x594/0x808
>    wb_workfn+0x5e8/0x9e0
>    process_scheduled_works+0x53c/0xd90

Ah, this is just the standard backing device flusher thread that is
running. This is the back end of filesystem writeback, not the front
end. It was never intended to be able to directly do loop back IO
submission to the front end filesystem IO paths like this - they are
very different contexts and have very different constraints.

This particular situation occurs when XFS is near ENOSPC.  There's a
very high probability it is going to fail these writes, and so it's
doing slow path work that involves blocking and front end filesystem
processing is allowed to block on just about anything in the
filesystem as long as it can guarantee it won't deadlock.

Fundamentally, doing IO submission in WQ_MEM_RECLAIM context changes
the submission context for -all- filesystems, not just XFS.
If we have to make this change to XFS, then -every-
workqueue in XFS (not just this one) must be converted to
WQ_MEM_RECLAIM, and then many workqueues in all the other
filesystems will need to have the same changes made, too.

That doesn't smell right to me.

----

So let's look at how back end filesystem IO currently submits new
front end filesystem IO: the loop block device does this, and it
uses workqueues to defer submitted IO so that the lower IO
submission context can be directly controlled and made with the
front end filesystem IO submission path behaviours.

The loop device does not use rescuer threads - that's not needed
when you have a queue based submission and just use the workqueues
to run the queues until they are empty. So the loop device uses
a standard unbound workqueue for it's IO submission path, and
then when the work is running it sets the task flags to say "this is
a nested IO worker thread" before it starts processing the
submission queue and submitting new front end filesystem IO:

static void loop_process_work(struct loop_worker *worker,
                        struct list_head *cmd_list, struct loop_device *lo)
{
        int orig_flags = current->flags;
        struct loop_cmd *cmd;

        current->flags |= PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO;

PF_LOCAL_THROTTLE prevents deadlocks in balance_dirty_pages() by
lifting the dirty ratio for this thread a little, hence giving it
priority over the upper filesystem. i.e. the upper filesystem will
throttle incoming writes first, then the back end IO submission
thread can still submit new front end IOs to the lower filesystem
and they won't block in balance_dirty_pages() because the lower
filesystem has a higher limit. hence the lower filesystem can always
drain the dirty pages on the upper filesystem, and the system won't
deadlock in balance_dirty_pages().

Using WQ_MEM_RECLAIM context for IO submission does not address this
deadlock.

The PF_MEMALLOC_NOIO flag prevents the lower filesystem IO from
causing memory reclaim to re-enter filesystems or IO devices and so
prevents deadlocks from occuring where IO that cleans pages is
waiting on IO to complete.

Using WQ_MEM_RECLAIM context for IO submission does not address this
deadlock either.

IOWs, doing front IO submission like this from the BDI flusher
thread is guaranteed to deadlock sooner or later, regardless of
whether WQ_MEM_RECLAIM is set or not on workqueues that are flushed
during IO submission. The WQ_MEM_RECLAIM warning is effectively your
canary in the coal mine. And the canary just carked it.

IMO, the only sane way to ensure this sort of nested "back-end page
cleaning submits front-end IO filesystem IO" mechanism works is to
do something similar to the loop device. You most definitely don't
want to be doing buffered IO (double caching is almost always bad)
and you want to be doing async direct IO so that the submission
thread is not waiting on completion before the next IO is
submitted.

-Dave.
Christoph Hellwig July 1, 2024, 4:45 a.m. UTC | #2
On Mon, Jul 01, 2024 at 09:46:36AM +1000, Dave Chinner wrote:
> Oh, that's nasty.

Yes.

> We now have to change every path in every filesystem that NFS can
> call that might defer work to a workqueue.

Yes.  That's why the kernel for a long time had the stance that using
network file systems / storage locally is entirely unsupported.

If we want to change that we'll have a lot of work to do.
Mike Snitzer July 1, 2024, 2:13 p.m. UTC | #3
On Mon, Jul 01, 2024 at 09:46:36AM +1000, Dave Chinner wrote:
> On Sun, Jun 30, 2024 at 12:35:17PM -0400, Mike Snitzer wrote:
> > The need for this fix was exposed while developing a new NFS feature
> > called "localio" which bypasses the network, if both the client and
> > server are on the same host, see:
> > https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/log/?h=nfs-localio-for-6.11
> > 
> > Because NFS's nfsiod_workqueue enables WQ_MEM_RECLAIM, writeback will
> > call into NFS and if localio is enabled the NFS client will call
> > directly into xfs_file_write_iter, this causes the following
> > backtrace when running xfstest generic/476 against NFS with localio:
> 
> Oh, that's nasty.
> 
> We now have to change every path in every filesystem that NFS can
> call that might defer work to a workqueue.
> 
> IOWs, this makes WQ_MEM_RECLAIM pretty much mandatory for front end
> workqueues in the filesystem and block layers regardless of whether
> the filesystem or block layer path runs under memory reclaim context
> or not.

As you noticed later (yet didn't circle back to edit here) this is
just triggering when space is low.

But yeah, I submitted the patch knowing it was likely not viable, I
just wasn't clear on all the whys.  I appreciate your feedback, and
yes it did feel like a slippery slope (understatement) that'd force
other filesystems to follow.
 
> All WQ_MEM_RECLAIM does is create a rescuer thread at workqueue
> creation that is used as a "worker of last resort" when forking new
> worker threads fail due to ENOMEM. This prevents deadlocks when
> doing GFP_KERNEL allocations in workqueue context and potentially
> deadlocking because a GFP_KERNEL allocation is blocking waiting for
> this workqueue to allocate workers to make progress.

Right.

> >   workqueue: WQ_MEM_RECLAIM writeback:wb_workfn is flushing !WQ_MEM_RECLAIM xfs-sync/vdc:xfs_flush_inodes_worker
> >   WARNING: CPU: 6 PID: 8525 at kernel/workqueue.c:3706 check_flush_dependency+0x2a4/0x328
> >   Modules linked in:
> >   CPU: 6 PID: 8525 Comm: kworker/u71:5 Not tainted 6.10.0-rc3-ktest-00032-g2b0a133403ab #18502
> >   Hardware name: linux,dummy-virt (DT)
> >   Workqueue: writeback wb_workfn (flush-0:33)
> >   pstate: 400010c5 (nZcv daIF -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
> >   pc : check_flush_dependency+0x2a4/0x328
> >   lr : check_flush_dependency+0x2a4/0x328
> >   sp : ffff0000c5f06bb0
> >   x29: ffff0000c5f06bb0 x28: ffff0000c998a908 x27: 1fffe00019331521
> >   x26: ffff0000d0620900 x25: ffff0000c5f06ca0 x24: ffff8000828848c0
> >   x23: 1fffe00018be0d8e x22: ffff0000c1210000 x21: ffff0000c75fde00
> >   x20: ffff800080bfd258 x19: ffff0000cad63400 x18: ffff0000cd3a4810
> >   x17: 0000000000000000 x16: 0000000000000000 x15: ffff800080508d98
> >   x14: 0000000000000000 x13: 204d49414c434552 x12: 1fffe0001b6eeab2
> >   x11: ffff60001b6eeab2 x10: dfff800000000000 x9 : ffff60001b6eeab3
> >   x8 : 0000000000000001 x7 : 00009fffe491154e x6 : ffff0000db775593
> >   x5 : ffff0000db775590 x4 : ffff0000db775590 x3 : 0000000000000000
> >   x2 : 0000000000000027 x1 : ffff600018be0d62 x0 : dfff800000000000
> >   Call trace:
> >    check_flush_dependency+0x2a4/0x328
> >    __flush_work+0x184/0x5c8
> >    flush_work+0x18/0x28
> >    xfs_flush_inodes+0x68/0x88
> >    xfs_file_buffered_write+0x128/0x6f0
> >    xfs_file_write_iter+0x358/0x448
> >    nfs_local_doio+0x854/0x1568
> >    nfs_initiate_pgio+0x214/0x418
> >    nfs_generic_pg_pgios+0x304/0x480
> >    nfs_pageio_doio+0xe8/0x240
> >    nfs_pageio_complete+0x160/0x480
> >    nfs_writepages+0x300/0x4f0
> >    do_writepages+0x12c/0x4a0
> >    __writeback_single_inode+0xd4/0xa68
> >    writeback_sb_inodes+0x470/0xcb0
> >    __writeback_inodes_wb+0xb0/0x1d0
> >    wb_writeback+0x594/0x808
> >    wb_workfn+0x5e8/0x9e0
> >    process_scheduled_works+0x53c/0xd90
> 
> Ah, this is just the standard backing device flusher thread that is
> running. This is the back end of filesystem writeback, not the front
> end. It was never intended to be able to directly do loop back IO
> submission to the front end filesystem IO paths like this - they are
> very different contexts and have very different constraints.
> 
> This particular situation occurs when XFS is near ENOSPC.  There's a
> very high probability it is going to fail these writes, and so it's
> doing slow path work that involves blocking and front end filesystem
> processing is allowed to block on just about anything in the
> filesystem as long as it can guarantee it won't deadlock.
> 
> Fundamentally, doing IO submission in WQ_MEM_RECLAIM context changes
> the submission context for -all- filesystems, not just XFS.

Yes, I see that.

> If we have to make this change to XFS, then -every-
> workqueue in XFS (not just this one) must be converted to
> WQ_MEM_RECLAIM, and then many workqueues in all the other
> filesystems will need to have the same changes made, too.

AFAICT they are all WQ_MEM_RECLAIM aside from m_sync_workqueue, but
that's besides the point.

> That doesn't smell right to me.
> 
> ----
> 
> So let's look at how back end filesystem IO currently submits new
> front end filesystem IO: the loop block device does this, and it
> uses workqueues to defer submitted IO so that the lower IO
> submission context can be directly controlled and made with the
> front end filesystem IO submission path behaviours.
> 
> The loop device does not use rescuer threads - that's not needed
> when you have a queue based submission and just use the workqueues
> to run the queues until they are empty. So the loop device uses
> a standard unbound workqueue for it's IO submission path, and
> then when the work is running it sets the task flags to say "this is
> a nested IO worker thread" before it starts processing the
> submission queue and submitting new front end filesystem IO:
> 
> static void loop_process_work(struct loop_worker *worker,
>                         struct list_head *cmd_list, struct loop_device *lo)
> {
>         int orig_flags = current->flags;
>         struct loop_cmd *cmd;
> 
>         current->flags |= PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO;
> 
> PF_LOCAL_THROTTLE prevents deadlocks in balance_dirty_pages() by
> lifting the dirty ratio for this thread a little, hence giving it
> priority over the upper filesystem. i.e. the upper filesystem will
> throttle incoming writes first, then the back end IO submission
> thread can still submit new front end IOs to the lower filesystem
> and they won't block in balance_dirty_pages() because the lower
> filesystem has a higher limit. hence the lower filesystem can always
> drain the dirty pages on the upper filesystem, and the system won't
> deadlock in balance_dirty_pages().

Perfect, thanks for the guidance.

> Using WQ_MEM_RECLAIM context for IO submission does not address this
> deadlock.
> 
> The PF_MEMALLOC_NOIO flag prevents the lower filesystem IO from
> causing memory reclaim to re-enter filesystems or IO devices and so
> prevents deadlocks from occuring where IO that cleans pages is
> waiting on IO to complete.
> 
> Using WQ_MEM_RECLAIM context for IO submission does not address this
> deadlock either.
> 
> IOWs, doing front IO submission like this from the BDI flusher
> thread is guaranteed to deadlock sooner or later, regardless of
> whether WQ_MEM_RECLAIM is set or not on workqueues that are flushed
> during IO submission. The WQ_MEM_RECLAIM warning is effectively your
> canary in the coal mine. And the canary just carked it.

Yes, I knew it as such, but I wanted to pin down why it died.. thanks
for your help!

FYI, this is the long-standing approach to how Trond dealt with this
WQ_MEM_RECLAIM situation. I was just hoping to avoid having to
introduce a dedicated workqueue for localio's needs (NeilBrown really
disliked the fact it mentioned how it avoids blowing the stack, but we
get that as a side-effect of needing it to avoid WQ_MEM_RECLAIM):
https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/commit/?h=nfs-localio-for-6.11-testing&id=0cd25f7610df291827ad95023e03fdd4f93bbea7
(I revised the patch to add PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO and
adjusted the header to reflect our discussion in this thread).

> IMO, the only sane way to ensure this sort of nested "back-end page
> cleaning submits front-end IO filesystem IO" mechanism works is to
> do something similar to the loop device. You most definitely don't
> want to be doing buffered IO (double caching is almost always bad)
> and you want to be doing async direct IO so that the submission
> thread is not waiting on completion before the next IO is
> submitted.

Yes, follow-on work is for me to revive the directio path for localio
that ultimately wasn't pursued (or properly wired up) because it
creates DIO alignment requirements on NFS client IO:
https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/commit/?h=nfs-localio-for-6.11-testing&id=f6c9f51fca819a8af595a4eb94811c1f90051eab

But underlying filesystems (like XFS) have the appropriate checks, we
just need to fail gracefully and disable NFS localio if the IO is
misaligned.

Mike
Trond Myklebust July 2, 2024, 12:33 p.m. UTC | #4
On Mon, 2024-07-01 at 10:13 -0400, Mike Snitzer wrote:
> On Mon, Jul 01, 2024 at 09:46:36AM +1000, Dave Chinner wrote:
> > IMO, the only sane way to ensure this sort of nested "back-end page
> > cleaning submits front-end IO filesystem IO" mechanism works is to
> > do something similar to the loop device. You most definitely don't
> > want to be doing buffered IO (double caching is almost always bad)
> > and you want to be doing async direct IO so that the submission
> > thread is not waiting on completion before the next IO is
> > submitted.
> 
> Yes, follow-on work is for me to revive the directio path for localio
> that ultimately wasn't pursued (or properly wired up) because it
> creates DIO alignment requirements on NFS client IO:
> https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/commit/?h=nfs-localio-for-6.11-testing&id=f6c9f51fca819a8af595a4eb94811c1f90051eab
> 
> But underlying filesystems (like XFS) have the appropriate checks, we
> just need to fail gracefully and disable NFS localio if the IO is
> misaligned.
> 

Just a reminder to everyone that this is replacing a configuration
which would in any case result in double caching, because without the
localio change, it would end up being a loopback mount through the NFS
server.

Use of O_DIRECT to xfs would impose alignment requirements by the lower
filesystem that are not being followed by the upper filesystem. A
"remedy" where we fall back to disabling localio if there is no
alignment won't fix anything. You will now have added the extra nfsd
layer back in, and so have the extra networking overhead in addition to
the memory management problems you were trying to solve with O_DIRECT.
Dave Chinner July 2, 2024, 1:04 p.m. UTC | #5
On Tue, Jul 02, 2024 at 12:33:53PM +0000, Trond Myklebust wrote:
> On Mon, 2024-07-01 at 10:13 -0400, Mike Snitzer wrote:
> > On Mon, Jul 01, 2024 at 09:46:36AM +1000, Dave Chinner wrote:
> > > IMO, the only sane way to ensure this sort of nested "back-end page
> > > cleaning submits front-end IO filesystem IO" mechanism works is to
> > > do something similar to the loop device. You most definitely don't
> > > want to be doing buffered IO (double caching is almost always bad)
> > > and you want to be doing async direct IO so that the submission
> > > thread is not waiting on completion before the next IO is
> > > submitted.
> > 
> > Yes, follow-on work is for me to revive the directio path for localio
> > that ultimately wasn't pursued (or properly wired up) because it
> > creates DIO alignment requirements on NFS client IO:
> > https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/commit/?h=nfs-localio-for-6.11-testing&id=f6c9f51fca819a8af595a4eb94811c1f90051eab

I don't follow - this is page cache writeback. All the write IO from
the bdi flusher thread should be page aligned, right? So why does DIO
alignment matter here?

> > But underlying filesystems (like XFS) have the appropriate checks, we
> > just need to fail gracefully and disable NFS localio if the IO is
> > misaligned.
> > 
> 
> Just a reminder to everyone that this is replacing a configuration
> which would in any case result in double caching, because without the
> localio change, it would end up being a loopback mount through the NFS
> server.

Sure. That doesn't mean double caching is desirable and it's
something we try should avoid if we trying to design a fast
server bypass mechanism.

-Dave.
Trond Myklebust July 2, 2024, 2 p.m. UTC | #6
On Tue, 2024-07-02 at 23:04 +1000, Dave Chinner wrote:
> On Tue, Jul 02, 2024 at 12:33:53PM +0000, Trond Myklebust wrote:
> > On Mon, 2024-07-01 at 10:13 -0400, Mike Snitzer wrote:
> > > On Mon, Jul 01, 2024 at 09:46:36AM +1000, Dave Chinner wrote:
> > > > IMO, the only sane way to ensure this sort of nested "back-end
> > > > page
> > > > cleaning submits front-end IO filesystem IO" mechanism works is
> > > > to
> > > > do something similar to the loop device. You most definitely
> > > > don't
> > > > want to be doing buffered IO (double caching is almost always
> > > > bad)
> > > > and you want to be doing async direct IO so that the submission
> > > > thread is not waiting on completion before the next IO is
> > > > submitted.
> > > 
> > > Yes, follow-on work is for me to revive the directio path for
> > > localio
> > > that ultimately wasn't pursued (or properly wired up) because it
> > > creates DIO alignment requirements on NFS client IO:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/commit/?h=nfs-localio-for-6.11-testing&id=f6c9f51fca819a8af595a4eb94811c1f90051eab
> 
> I don't follow - this is page cache writeback. All the write IO from
> the bdi flusher thread should be page aligned, right? So why does DIO
> alignment matter here?
> 

There is no guarantee in NFS that writes from the flusher thread are
page aligned. If a page/folio is known to be up to date, we will
usually align writes to the boundaries, but we won't guarantee to do a
read-modify-write if that's not the case. Specifically, we will not do
so if the file is open for write-only.
Dave Chinner July 2, 2024, 11:15 p.m. UTC | #7
On Tue, Jul 02, 2024 at 02:00:46PM +0000, Trond Myklebust wrote:
> On Tue, 2024-07-02 at 23:04 +1000, Dave Chinner wrote:
> > On Tue, Jul 02, 2024 at 12:33:53PM +0000, Trond Myklebust wrote:
> > > On Mon, 2024-07-01 at 10:13 -0400, Mike Snitzer wrote:
> > > > On Mon, Jul 01, 2024 at 09:46:36AM +1000, Dave Chinner wrote:
> > > > > IMO, the only sane way to ensure this sort of nested "back-end
> > > > > page
> > > > > cleaning submits front-end IO filesystem IO" mechanism works is
> > > > > to
> > > > > do something similar to the loop device. You most definitely
> > > > > don't
> > > > > want to be doing buffered IO (double caching is almost always
> > > > > bad)
> > > > > and you want to be doing async direct IO so that the submission
> > > > > thread is not waiting on completion before the next IO is
> > > > > submitted.
> > > > 
> > > > Yes, follow-on work is for me to revive the directio path for
> > > > localio
> > > > that ultimately wasn't pursued (or properly wired up) because it
> > > > creates DIO alignment requirements on NFS client IO:
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/commit/?h=nfs-localio-for-6.11-testing&id=f6c9f51fca819a8af595a4eb94811c1f90051eab
> > 
> > I don't follow - this is page cache writeback. All the write IO from
> > the bdi flusher thread should be page aligned, right? So why does DIO
> > alignment matter here?
> > 
> 
> There is no guarantee in NFS that writes from the flusher thread are
> page aligned. If a page/folio is known to be up to date, we will
> usually align writes to the boundaries, but we won't guarantee to do a
> read-modify-write if that's not the case. Specifically, we will not do
> so if the file is open for write-only.

So perhaps if the localio mechanism is enabled, it should behave
like a local filesystem and do the page cache RMW cycle (because it
doesn't involve a network round trip) to make sure all the buffered
IO is page aligned. That means both buffered reads and writes are page
aligned, and both can be done using async direct IO. If the client
is doing aligned direct IO, then we can still do async direct IO to
the underlying file. If it's not aligned, then the localio flusher
thread can just do async buffered IO for those IOs instead.

Let's not reinvent the wheel: we know how to do loopback filesystem
IO very efficiently, and the whole point of localio is to do
loopback filesystem IO very efficiently.

-Dave.
Dave Chinner July 2, 2024, 11:51 p.m. UTC | #8
On Sun, Jun 30, 2024 at 09:45:40PM -0700, Christoph Hellwig wrote:
> On Mon, Jul 01, 2024 at 09:46:36AM +1000, Dave Chinner wrote:
> > Oh, that's nasty.
> 
> Yes.
> 
> > We now have to change every path in every filesystem that NFS can
> > call that might defer work to a workqueue.
> 
> Yes.  That's why the kernel for a long time had the stance that using
> network file systems / storage locally is entirely unsupported.
> 
> If we want to change that we'll have a lot of work to do.

Yep. These sorts of changes really need to be cc'd to linux-fsdevel,
not kept private to the NFS lists. I wouldn't have known that NFS
was going to do local IO to filesystems if it wasn't for this patch,
and it's clear the approach being taken needs architectural review
before we even get down into the nitty gritty details of the
implementation.

Mike, can you make sure that linux-fsdevel@vger.kernel.org is cc'd
on all the localio work being posted so we can all keep track of it
easily?

-Dave.
NeilBrown July 3, 2024, 11:29 a.m. UTC | #9
On Mon, 01 Jul 2024, Christoph Hellwig wrote:
> On Mon, Jul 01, 2024 at 09:46:36AM +1000, Dave Chinner wrote:
> > Oh, that's nasty.
> 
> Yes.
> 
> > We now have to change every path in every filesystem that NFS can
> > call that might defer work to a workqueue.
> 
> Yes.  That's why the kernel for a long time had the stance that using
> network file systems / storage locally is entirely unsupported.

I know nothing of this stance.  Do you have a reference?

I have put a modest amount of work into ensure NFS to a server on the
same machine works and last I checked it did - though I'm more
confident of NFSv3 than NFSv4 because of the state manager thread.

Also /dev/loop can be backed by a file, and have a filesystem mounted on
it, and if that didn't work I'm sure we would have complaints.

> 
> If we want to change that we'll have a lot of work to do.

What sort of work are you thinking of?

Thanks,
NeilBrown


> 
> 
>
Christoph Hellwig July 3, 2024, 2:15 p.m. UTC | #10
On Wed, Jul 03, 2024 at 09:29:00PM +1000, NeilBrown wrote:
> I know nothing of this stance.  Do you have a reference?

No particular one.

> I have put a modest amount of work into ensure NFS to a server on the
> same machine works and last I checked it did - though I'm more
> confident of NFSv3 than NFSv4 because of the state manager thread.

How do you propagate the NOFS flag (and NOIO for a loop device) to
the server an the workqueues run by the server and the file system
call by it?  How do you ensure WQ_MEM_RECLAIM gets propagate to
all workqueues that could be called by the file system on the
server (the problem kicking off this discussion)?
Dave Chinner July 3, 2024, 11:02 p.m. UTC | #11
On Wed, Jul 03, 2024 at 07:15:28AM -0700, Christoph Hellwig wrote:
> On Wed, Jul 03, 2024 at 09:29:00PM +1000, NeilBrown wrote:
> > I know nothing of this stance.  Do you have a reference?
> 
> No particular one.
> 
> > I have put a modest amount of work into ensure NFS to a server on the
> > same machine works and last I checked it did - though I'm more
> > confident of NFSv3 than NFSv4 because of the state manager thread.
> 
> How do you propagate the NOFS flag (and NOIO for a loop device) to
> the server an the workqueues run by the server and the file system
> call by it?  How do you ensure WQ_MEM_RECLAIM gets propagate to
> all workqueues that could be called by the file system on the
> server (the problem kicking off this discussion)?

Don't forget PF_LOCAL_THROTTLE, too.  I note that nfsd_vfs_write()
knows when it is doing local loopback write IO and in that case sets
PF_LOCAL_THROTTLE:

	if (test_bit(RQ_LOCAL, &rqstp->rq_flags) &&
            !(exp_op_flags & EXPORT_OP_REMOTE_FS)) {
                /*
                 * We want throttling in balance_dirty_pages()
                 * and shrink_inactive_list() to only consider
                 * the backingdev we are writing to, so that nfs to
                 * localhost doesn't cause nfsd to lock up due to all
                 * the client's dirty pages or its congested queue.
                 */
                current->flags |= PF_LOCAL_THROTTLE;
                restore_flags = true;
        }

This also has impact on memory reclaim congestion throttling (i.e.
it turns it off), which is also needed for loopback IO to prevent it
being throttled by reclaim because it getting congested trying to
reclaim all the dirty pages on the upper filesystem that the IO
thread is trying to clean...

However, I don't see it calling memalloc_nofs_save() there to
prevent memory reclaim recursion back into the upper NFS client
filesystem.

I suspect that because filesystems like XFS hard code GFP_NOFS
context for page cache allocation to prevent NFSD loopback IO from
deadlocking hides this issue. We've had to do that because,
historically speaking, there wasn't been a way for high level IO
submitters to indicate they need GFP_NOFS allocation context.

Howver, we have had the memalloc_nofs_save/restore() scoped API for
several years now, so it seems to me that the nfsd should really be
using this rather than requiring the filesystem to always use
GFP_NOFS allocations to avoid loopback IO memory allocation
deadlocks...

-Dave.
NeilBrown July 6, 2024, 12:32 a.m. UTC | #12
On Wed, 03 Jul 2024, Trond Myklebust wrote:
> On Tue, 2024-07-02 at 23:04 +1000, Dave Chinner wrote:
> > On Tue, Jul 02, 2024 at 12:33:53PM +0000, Trond Myklebust wrote:
> > > On Mon, 2024-07-01 at 10:13 -0400, Mike Snitzer wrote:
> > > > On Mon, Jul 01, 2024 at 09:46:36AM +1000, Dave Chinner wrote:
> > > > > IMO, the only sane way to ensure this sort of nested "back-end
> > > > > page
> > > > > cleaning submits front-end IO filesystem IO" mechanism works is
> > > > > to
> > > > > do something similar to the loop device. You most definitely
> > > > > don't
> > > > > want to be doing buffered IO (double caching is almost always
> > > > > bad)
> > > > > and you want to be doing async direct IO so that the submission
> > > > > thread is not waiting on completion before the next IO is
> > > > > submitted.
> > > > 
> > > > Yes, follow-on work is for me to revive the directio path for
> > > > localio
> > > > that ultimately wasn't pursued (or properly wired up) because it
> > > > creates DIO alignment requirements on NFS client IO:
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/commit/?h=nfs-localio-for-6.11-testing&id=f6c9f51fca819a8af595a4eb94811c1f90051eab
> > 
> > I don't follow - this is page cache writeback. All the write IO from
> > the bdi flusher thread should be page aligned, right? So why does DIO
> > alignment matter here?
> > 
> 
> There is no guarantee in NFS that writes from the flusher thread are
> page aligned. If a page/folio is known to be up to date, we will
> usually align writes to the boundaries, but we won't guarantee to do a
> read-modify-write if that's not the case. Specifically, we will not do
> so if the file is open for write-only.

Would it be reasonable for the partial pages to be written over RPC and
for only full pages to be sent directly to the server-side file using
O_DIRECT writes?  Presumably the benefits of localio are most pronounced
with large writes which will mostly be full-page, or even full-folio.

O_DIRECT writes on the NFS side would be more awkward.  open(2)
documents that NFS places no alignment restrictions on O_DIRECT I/O.  If
applications depend on that then some copying will have to be done
before the data is written to a block filesystem - possibly into the
page cache, possibly into some other buffer.  This wouldn't be more
copying that we already do.

NeilBrown
Christoph Hellwig July 6, 2024, 6:13 a.m. UTC | #13
On Sat, Jul 06, 2024 at 10:32:52AM +1000, NeilBrown wrote:
> Would it be reasonable for the partial pages to be written over RPC and
> for only full pages to be sent directly to the server-side file using
> O_DIRECT writes?  Presumably the benefits of localio are most pronounced
> with large writes which will mostly be full-page, or even full-folio.

Yes.  Note that we already have infrastructure to always read the
entire page for pnfs block (PNFS_READ_WHOLE_PAGE /
pnfs_ld_read_whole_page) to make this a lot more common at least for
reads.

> O_DIRECT writes on the NFS side would be more awkward.  open(2)
> documents that NFS places no alignment restrictions on O_DIRECT I/O.  If
> applications depend on that then some copying will have to be done
> before the data is written to a block filesystem - possibly into the
> page cache, possibly into some other buffer.  This wouldn't be more
> copying that we already do.

Yes.  There is precedence for that in pNFS as well:
fs/nfs/blocklayout/blocklayout.c:is_aligned_req().
Christoph Hellwig July 6, 2024, 6:37 a.m. UTC | #14
Btw, one issue with using direct I/O is that need to synchronize with
page cache access from the server itself.  For pNFS we can do that as
we track outstanding layouts.  Without layouts it will be more work
as we'll need a different data structure tracking grant for bypassing
the server.  Or just piggy back on layouts anyway as that's what they
are doing.
NeilBrown July 9, 2024, 11:12 p.m. UTC | #15
On Thu, 04 Jul 2024, Christoph Hellwig wrote:
> On Wed, Jul 03, 2024 at 09:29:00PM +1000, NeilBrown wrote:
> > I know nothing of this stance.  Do you have a reference?
> 
> No particular one.
> 
> > I have put a modest amount of work into ensure NFS to a server on the
> > same machine works and last I checked it did - though I'm more
> > confident of NFSv3 than NFSv4 because of the state manager thread.
> 
> How do you propagate the NOFS flag (and NOIO for a loop device) to
> the server an the workqueues run by the server and the file system
> call by it?  How do you ensure WQ_MEM_RECLAIM gets propagate to
> all workqueues that could be called by the file system on the
> server (the problem kicking off this discussion)?
> 

Do we need to propagate these?

NOFS is for deadlock avoidance.  A filesystem "backend" (Dave's term - I
think for the parts of the fs that handle write-back) might allocate
memory, that might block waiting for memory reclaim, memory reclaim
might re-enter the filesystem backend and might block on a lock (or
similar) held while allocating memory.  NOFS breaks that deadlock.

The important thing here isn't the NOFS flag, it is breaking any
possible deadlock.

Layered filesystems introduce a new complexity.  The backend for one
filesystem can call into the front end of another filesystem.  That
front-end is not required to use NOFS and even if we impose
PF_MEMALLOC_NOFS, the front-end might wait for some work-queue action
which doesn't inherit the NOFS flag.

But this doesn't necessarily matter.  Calling into the filesystem is not
the problem - blocking waiting for a reply is the problem.  It is
blocking that creates deadlocks.  So if the backend of one filesystem
queues to a separate thread the work for the front end of the other
filesystem and doesn't wait for the work to complete, then a deadlock
cannot be introduced.

/dev/loop uses the loop%d workqueue for this.  loop-back NFS hands the
front-end work over to nfsd.  The proposed localio implementation uses a
nfslocaliod workqueue for exactly the same task.  These remove the
possibility of deadlock and mean that there is no need to pass NOFS
through to the front-end of the backing filesystem.

Note that there is a separate question concerning pageout to a swap
file.  pageout needs more than just deadlock avoidance.  It needs
guaranteed progress in low memory conditions.   It needs PF_MEMALLOC (or
mempools) and that cannot be finessed using work queues.  I don't think
that Linux is able to support pageout through layered filesystems.

So while I support loop-back NFS and swap-over-NFS, I don't support them
in combination.  We don't support swap on /dev/loop when it is backed by
a file - for that we have swap-to-file.

Thank you for challenging me on this - it helped me clarify my thoughts
and understanding for myself.

NeilBrown
NeilBrown July 9, 2024, 11:39 p.m. UTC | #16
On Sat, 06 Jul 2024, Christoph Hellwig wrote:
> Btw, one issue with using direct I/O is that need to synchronize with
> page cache access from the server itself.  For pNFS we can do that as
> we track outstanding layouts.  Without layouts it will be more work
> as we'll need a different data structure tracking grant for bypassing
> the server.  Or just piggy back on layouts anyway as that's what they
> are doing.
> 

I'm missing something here.

Certainly if localio or nfsd were to choose to use direct I/O we would
need to ensure proper synchronisation with page cache access.

Does VFS/MM already provide enough synchronisation?  A quick look at the
code suggests:
- before an O_DIRECT read any dirty pages that overlap are flushed to
  the device.
- after a write, any pages that overlap are invalidated.

So as long as IO requests don't overlap we should have adequate
synchronisation.

If they do overlap we should expect inconsistent results.  Maybe we
would expect reads to only "tear" on a page boundary, and writes to only
interleave in whole pages, and probably using O_DIRECT would not give
any whole-page guarantees.  So maybe that is a problem.

If it is a problem, I think it can only be fixed by keeping track of which
pages are under direct IO, and preventing access to the page-cache for
those regions.  This could be done in the page-cache itself, or in a
separte extent-tree.  I don't think the VFS/MM supports this - does any
filesystem?

(or we could prevent adding any new pages to the page-cache for an inode
 with i_dio_count > 0 - but that would likely hurt performance.)

I can see that pNFS extents could encode the information to enforce
this, but I don't see how that is mapped to filesystems in Linux at
present.

What am I missing?

Thanks,
NeilBrown
Dave Chinner July 11, 2024, 11:55 a.m. UTC | #17
On Wed, Jul 10, 2024 at 09:12:58AM +1000, NeilBrown wrote:
> On Thu, 04 Jul 2024, Christoph Hellwig wrote:
> > On Wed, Jul 03, 2024 at 09:29:00PM +1000, NeilBrown wrote:
> > > I know nothing of this stance.  Do you have a reference?
> > 
> > No particular one.
> > 
> > > I have put a modest amount of work into ensure NFS to a server on the
> > > same machine works and last I checked it did - though I'm more
> > > confident of NFSv3 than NFSv4 because of the state manager thread.
> > 
> > How do you propagate the NOFS flag (and NOIO for a loop device) to
> > the server an the workqueues run by the server and the file system
> > call by it?  How do you ensure WQ_MEM_RECLAIM gets propagate to
> > all workqueues that could be called by the file system on the
> > server (the problem kicking off this discussion)?
> > 
> 
> Do we need to propagate these?
> 
> NOFS is for deadlock avoidance.  A filesystem "backend" (Dave's term - I
> think for the parts of the fs that handle write-back) might allocate
> memory, that might block waiting for memory reclaim, memory reclaim
> might re-enter the filesystem backend and might block on a lock (or
> similar) held while allocating memory.  NOFS breaks that deadlock.
>
> The important thing here isn't the NOFS flag, it is breaking any
> possible deadlock.

NOFS doesn't "break" any deadlocks. It simply prevents recursion
from one filesystem context to another. We don't have to use
NOFS if recursion is safe and won't deadlock.

That is, it may be safe for a filesystem to use GFP_KERNEL
allocations in it's writeback path. If the filesystem doesn't
implement ->writepage (like most of the major filesystems these
days) there is no way for memory reclaim to recurse back into the fs
writeback path. Hence GFP_NOFS is not needed in writeback context to
prevent reclaim recursion back into the filesystem writeback
path....

And the superblock shrinkers can't deadlock - they are non-blocking
and only act on unreferenced inodes. Hence any code that has a
locked inode is either evicting an unreferenced inode or holds a
reference to the inode. If we are doing an allocation with eithe rof
those sorts of inodes locked, there is no way that memory reclaim
recursion can trip over the locked inode and deadlock.

So between the removal of ->writepage, non-blocking shrinkers, and
scoped NOIO context for loop devices, I'm not sure there are any
generic reclaim recursion paths that can actually deadlock. i.e.
GFP_NOFS really only needs to be used if the filesystem itself
cannot safely recurse back into itself.

> Layered filesystems introduce a new complexity.

Nothing new about layered filesystems - we've been doing this for
decades...

> The backend for one
> filesystem can call into the front end of another filesystem.  That
> front-end is not required to use NOFS and even if we impose
> PF_MEMALLOC_NOFS, the front-end might wait for some work-queue action
> which doesn't inherit the NOFS flag.
> 
> But this doesn't necessarily matter.  Calling into the filesystem is not
> the problem - blocking waiting for a reply is the problem.  It is
> blocking that creates deadlocks.  So if the backend of one filesystem
> queues to a separate thread the work for the front end of the other
> filesystem and doesn't wait for the work to complete, then a deadlock
> cannot be introduced.
>
> /dev/loop uses the loop%d workqueue for this.  loop-back NFS hands the
> front-end work over to nfsd.  The proposed localio implementation uses a
> nfslocaliod workqueue for exactly the same task.  These remove the
> possibility of deadlock and mean that there is no need to pass NOFS
> through to the front-end of the backing filesystem.

I think this logic is fundamentally flawed.

Pushing IO submission to a separate
thread context which runs them in GFP_KERNEL context does not help
if the deadlock occurs during IO submission. With loop devices,
there's a "global" lock in the lower filesystem on the loop device -
the image file inode lock.

The IO issued by the loop device will -always- hit the same inode
and the same inode locks. Hence if we do memory allocation with an
inode lock held exclusive in the lower filesystem (e.g. a page cache
folio for a buffered write), we cannot allow memory reclaim during
any allocation with the image file inode locked to recurse into the
upper filesystem. If the upper filesystem then performs an operation
that requires IO to be submitted and completed to make progress
then we have a deadlock condition due to recursion from
the lower to upper filesystem regardless of the fact that the
lower IO submission is run from a different task.

Hence the loop device sets up the backing file mapping as:

	lo->lo_backing_file = file;
        lo->old_gfp_mask = mapping_gfp_mask(mapping);
        mapping_set_gfp_mask(mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS));

GFP_NOIO context. It also sets up worker task context as:

	current->flags |= PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO;

GFP_NOIO context. IOWs, all allocation in the IO submission path is
explicitly GFP_NOIO to prevent any sort of reclaim recursion into
filesysetms or the block layer. That's the only sane thing to do,
because multi-filesystem deadlocks are an utter PITA to triage and
solve...

Keep in mind that PF_LOCAL_THROTTLE also prevents IO submission
deadlocks in the lower filesystem.  If the lower filesystem IO
submission dirties pages (i.e. buffered writes) it can get throttled
on the dirty page threshold. If it get's throttled like this trying
to clean dirty pages from the upper filesystem we have a deadlock.
The localio submission task will need to prevent that deadlock,
too.

IOWs, just moving IO submission to another thread does not avoid
the possibility of lower-to-upper filesystem recursion or lower
filesystem dirty page throttling deadlocks.

> Note that there is a separate question concerning pageout to a swap
> file.  pageout needs more than just deadlock avoidance.  It needs
> guaranteed progress in low memory conditions.   It needs PF_MEMALLOC (or
> mempools) and that cannot be finessed using work queues.  I don't think
> that Linux is able to support pageout through layered filesystems.

I don't think we ever want to go there.

-Dave.
diff mbox series

Patch

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 27e9f749c4c7..dbe6af00708b 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -574,7 +574,8 @@  xfs_init_mount_workqueues(
 		goto out_destroy_blockgc;
 
 	mp->m_sync_workqueue = alloc_workqueue("xfs-sync/%s",
-			XFS_WQFLAGS(WQ_FREEZABLE), 0, mp->m_super->s_id);
+			XFS_WQFLAGS(WQ_FREEZABLE | WQ_MEM_RECLAIM),
+			0, mp->m_super->s_id);
 	if (!mp->m_sync_workqueue)
 		goto out_destroy_inodegc;