mbox series

[00/18,V2] Repair SWAP-over-NFS

Message ID 163969801519.20885.3977673503103544412.stgit@noble.brown (mailing list archive)
Headers show
Series Repair SWAP-over-NFS | expand

Message

NeilBrown Dec. 16, 2021, 11:48 p.m. UTC
swap-over-NFS currently has a variety of problems.

swap writes call generic_write_checks(), which always fails on a swap
file, so it completely fails.
Even without this, various deadlocks are possible - largely due to
improvements in NFS memory allocation (using NOFS instead of ATOMIC)
which weren't tested against swap-out.

NFS is the only filesystem that has supported fs-based swap IO, and it
hasn't worked for several releases, so now is a convenient time to clean
up the swap-via-filesystem interfaces - we cannot break anything !

So the first few patches here clean up and improve various parts of the
swap-via-filesystem code.  ->activate_swap() is given a cleaner
interface, a new ->swap_rw is introduced instead of burdening
->direct_IO, etc.

Current swap-to-filesystem code only ever submits single-page reads and
writes.  These patches change that to allow multi-page IO when adjacent
requests are submitted.  Writes are also changed to be async rather than
sync.  This substantially speeds up write throughput for swap-over-NFS.

Some of the NFS patches can land independently of the MM patches.  A few
require the MM patches to land first.

Thanks,
NeilBrown


---

NeilBrown (18):
      Structural cleanup for filesystem-based swap
      MM: create new mm/swap.h header file.
      MM: use ->swap_rw for reads from SWP_FS_OPS swap-space
      MM: perform async writes to SWP_FS_OPS swap-space
      MM: reclaim mustn't enter FS for SWP_FS_OPS swap-space
      MM: submit multipage reads for SWP_FS_OPS swap-space
      MM: submit multipage write for SWP_FS_OPS swap-space
      MM: Add AS_CAN_DIO mapping flag
      NFS: rename nfs_direct_IO and use as ->swap_rw
      NFS: swap IO handling is slightly different for O_DIRECT IO
      SUNRPC/call_alloc: async tasks mustn't block waiting for memory
      SUNRPC/auth: async tasks mustn't block waiting for memory
      SUNRPC/xprt: async tasks mustn't block waiting for memory
      SUNRPC: remove scheduling boost for "SWAPPER" tasks.
      NFS: discard NFS_RPC_SWAPFLAGS and RPC_TASK_ROOTCREDS
      SUNRPC: improve 'swap' handling: scheduling and PF_MEMALLOC
      NFSv4: keep state manager thread active if swap is enabled
      NFS: swap-out must always use STABLE writes.


 drivers/block/loop.c            |   4 +-
 fs/fcntl.c                      |   5 +-
 fs/inode.c                      |   3 +
 fs/nfs/direct.c                 |  56 ++++++----
 fs/nfs/file.c                   |  25 +++--
 fs/nfs/inode.c                  |   1 +
 fs/nfs/nfs4_fs.h                |   1 +
 fs/nfs/nfs4proc.c               |  20 ++++
 fs/nfs/nfs4state.c              |  39 ++++++-
 fs/nfs/read.c                   |   4 -
 fs/nfs/write.c                  |   2 +
 fs/open.c                       |   2 +-
 fs/overlayfs/file.c             |  10 +-
 include/linux/fs.h              |   2 +-
 include/linux/nfs_fs.h          |  11 +-
 include/linux/nfs_xdr.h         |   2 +
 include/linux/pagemap.h         |   3 +-
 include/linux/sunrpc/auth.h     |   1 +
 include/linux/sunrpc/sched.h    |   1 -
 include/linux/swap.h            | 121 --------------------
 include/linux/writeback.h       |   7 ++
 include/trace/events/sunrpc.h   |   1 -
 mm/madvise.c                    |   9 +-
 mm/memory.c                     |   3 +-
 mm/mincore.c                    |   1 +
 mm/page_alloc.c                 |   1 +
 mm/page_io.c                    | 189 ++++++++++++++++++++++++++------
 mm/shmem.c                      |   1 +
 mm/swap.h                       | 140 +++++++++++++++++++++++
 mm/swap_state.c                 |  32 ++++--
 mm/swapfile.c                   |   6 +
 mm/util.c                       |   1 +
 mm/vmscan.c                     |  31 +++++-
 net/sunrpc/auth.c               |   8 +-
 net/sunrpc/auth_gss/auth_gss.c  |   6 +-
 net/sunrpc/auth_unix.c          |  10 +-
 net/sunrpc/clnt.c               |   7 +-
 net/sunrpc/sched.c              |  29 +++--
 net/sunrpc/xprt.c               |  19 ++--
 net/sunrpc/xprtrdma/transport.c |  10 +-
 net/sunrpc/xprtsock.c           |   8 ++
 41 files changed, 558 insertions(+), 274 deletions(-)
 create mode 100644 mm/swap.h

--
Signature

Comments

Schumaker, Anna Dec. 17, 2021, 9:29 p.m. UTC | #1
Hi Neil,

On Thu, Dec 16, 2021 at 7:07 PM NeilBrown <neilb@suse.de> wrote:
>
> swap-over-NFS currently has a variety of problems.
>
> swap writes call generic_write_checks(), which always fails on a swap
> file, so it completely fails.
> Even without this, various deadlocks are possible - largely due to
> improvements in NFS memory allocation (using NOFS instead of ATOMIC)
> which weren't tested against swap-out.
>
> NFS is the only filesystem that has supported fs-based swap IO, and it
> hasn't worked for several releases, so now is a convenient time to clean
> up the swap-via-filesystem interfaces - we cannot break anything !
>
> So the first few patches here clean up and improve various parts of the
> swap-via-filesystem code.  ->activate_swap() is given a cleaner
> interface, a new ->swap_rw is introduced instead of burdening
> ->direct_IO, etc.
>
> Current swap-to-filesystem code only ever submits single-page reads and
> writes.  These patches change that to allow multi-page IO when adjacent
> requests are submitted.  Writes are also changed to be async rather than
> sync.  This substantially speeds up write throughput for swap-over-NFS.
>
> Some of the NFS patches can land independently of the MM patches.  A few
> require the MM patches to land first.

Thanks for fixing swap-over-NFS! Looks like it passes all the
swap-related xfstests except for generic/357 on NFS v4.2. This test
checks that we get -EINVAL on a reflinked swapfile, but I'm not sure
if there is a way to check for that on the client side but if you have
any ideas it would be nice to get that test passing while you're at
it!

Anna

>
> Thanks,
> NeilBrown
>
>
> ---
>
> NeilBrown (18):
>       Structural cleanup for filesystem-based swap
>       MM: create new mm/swap.h header file.
>       MM: use ->swap_rw for reads from SWP_FS_OPS swap-space
>       MM: perform async writes to SWP_FS_OPS swap-space
>       MM: reclaim mustn't enter FS for SWP_FS_OPS swap-space
>       MM: submit multipage reads for SWP_FS_OPS swap-space
>       MM: submit multipage write for SWP_FS_OPS swap-space
>       MM: Add AS_CAN_DIO mapping flag
>       NFS: rename nfs_direct_IO and use as ->swap_rw
>       NFS: swap IO handling is slightly different for O_DIRECT IO
>       SUNRPC/call_alloc: async tasks mustn't block waiting for memory
>       SUNRPC/auth: async tasks mustn't block waiting for memory
>       SUNRPC/xprt: async tasks mustn't block waiting for memory
>       SUNRPC: remove scheduling boost for "SWAPPER" tasks.
>       NFS: discard NFS_RPC_SWAPFLAGS and RPC_TASK_ROOTCREDS
>       SUNRPC: improve 'swap' handling: scheduling and PF_MEMALLOC
>       NFSv4: keep state manager thread active if swap is enabled
>       NFS: swap-out must always use STABLE writes.
>
>
>  drivers/block/loop.c            |   4 +-
>  fs/fcntl.c                      |   5 +-
>  fs/inode.c                      |   3 +
>  fs/nfs/direct.c                 |  56 ++++++----
>  fs/nfs/file.c                   |  25 +++--
>  fs/nfs/inode.c                  |   1 +
>  fs/nfs/nfs4_fs.h                |   1 +
>  fs/nfs/nfs4proc.c               |  20 ++++
>  fs/nfs/nfs4state.c              |  39 ++++++-
>  fs/nfs/read.c                   |   4 -
>  fs/nfs/write.c                  |   2 +
>  fs/open.c                       |   2 +-
>  fs/overlayfs/file.c             |  10 +-
>  include/linux/fs.h              |   2 +-
>  include/linux/nfs_fs.h          |  11 +-
>  include/linux/nfs_xdr.h         |   2 +
>  include/linux/pagemap.h         |   3 +-
>  include/linux/sunrpc/auth.h     |   1 +
>  include/linux/sunrpc/sched.h    |   1 -
>  include/linux/swap.h            | 121 --------------------
>  include/linux/writeback.h       |   7 ++
>  include/trace/events/sunrpc.h   |   1 -
>  mm/madvise.c                    |   9 +-
>  mm/memory.c                     |   3 +-
>  mm/mincore.c                    |   1 +
>  mm/page_alloc.c                 |   1 +
>  mm/page_io.c                    | 189 ++++++++++++++++++++++++++------
>  mm/shmem.c                      |   1 +
>  mm/swap.h                       | 140 +++++++++++++++++++++++
>  mm/swap_state.c                 |  32 ++++--
>  mm/swapfile.c                   |   6 +
>  mm/util.c                       |   1 +
>  mm/vmscan.c                     |  31 +++++-
>  net/sunrpc/auth.c               |   8 +-
>  net/sunrpc/auth_gss/auth_gss.c  |   6 +-
>  net/sunrpc/auth_unix.c          |  10 +-
>  net/sunrpc/clnt.c               |   7 +-
>  net/sunrpc/sched.c              |  29 +++--
>  net/sunrpc/xprt.c               |  19 ++--
>  net/sunrpc/xprtrdma/transport.c |  10 +-
>  net/sunrpc/xprtsock.c           |   8 ++
>  41 files changed, 558 insertions(+), 274 deletions(-)
>  create mode 100644 mm/swap.h
>
> --
> Signature
>
NeilBrown Dec. 19, 2021, 9:07 p.m. UTC | #2
On Sat, 18 Dec 2021, Anna Schumaker wrote:
> Hi Neil,
> 
> On Thu, Dec 16, 2021 at 7:07 PM NeilBrown <neilb@suse.de> wrote:
> >
> > swap-over-NFS currently has a variety of problems.
> >
> > swap writes call generic_write_checks(), which always fails on a swap
> > file, so it completely fails.
> > Even without this, various deadlocks are possible - largely due to
> > improvements in NFS memory allocation (using NOFS instead of ATOMIC)
> > which weren't tested against swap-out.
> >
> > NFS is the only filesystem that has supported fs-based swap IO, and it
> > hasn't worked for several releases, so now is a convenient time to clean
> > up the swap-via-filesystem interfaces - we cannot break anything !
> >
> > So the first few patches here clean up and improve various parts of the
> > swap-via-filesystem code.  ->activate_swap() is given a cleaner
> > interface, a new ->swap_rw is introduced instead of burdening
> > ->direct_IO, etc.
> >
> > Current swap-to-filesystem code only ever submits single-page reads and
> > writes.  These patches change that to allow multi-page IO when adjacent
> > requests are submitted.  Writes are also changed to be async rather than
> > sync.  This substantially speeds up write throughput for swap-over-NFS.
> >
> > Some of the NFS patches can land independently of the MM patches.  A few
> > require the MM patches to land first.
> 
> Thanks for fixing swap-over-NFS! Looks like it passes all the
> swap-related xfstests except for generic/357 on NFS v4.2. This test
> checks that we get -EINVAL on a reflinked swapfile, but I'm not sure
> if there is a way to check for that on the client side but if you have
> any ideas it would be nice to get that test passing while you're at
> it!

Thanks for testing!.
I think that testing that swap fails on a reflinked file is bogus.  This
isn't an important part of the API, it is just an internal
implementation detail.
I certainly understand that it could be problematic implementing swap on
a reflinked file within XFS and it is perfectly acceptable to fail such
a request.  But if one day someone decided to implement it - should that
be seen as a regression?

Certainly over NFS there is no reason at all not to swap to a file that
happens to be reflinked on the server.
I don't think it even makes sense to test if the file has holes as the
current nfs_swap_activate() does.  I don't exactly object to the test,
but I think it is misguided and pointless.

Thanks,
NeilBrown
Christoph Hellwig Dec. 21, 2021, 8:48 a.m. UTC | #3
On Mon, Dec 20, 2021 at 08:07:15AM +1100, NeilBrown wrote:
> > Thanks for fixing swap-over-NFS! Looks like it passes all the
> > swap-related xfstests except for generic/357 on NFS v4.2. This test
> > checks that we get -EINVAL on a reflinked swapfile, but I'm not sure
> > if there is a way to check for that on the client side but if you have
> > any ideas it would be nice to get that test passing while you're at
> > it!
> 
> Thanks for testing!.
> I think that testing that swap fails on a reflinked file is bogus.  This
> isn't an important part of the API, it is just an internal
> implementation detail.
> I certainly understand that it could be problematic implementing swap on
> a reflinked file within XFS and it is perfectly acceptable to fail such
> a request.  But if one day someone decided to implement it - should that
> be seen as a regression?

Yes, there is really no fundamental reason not to support swap to
reflinked files, especially for NFS.  We'll need some kind of opt-in/out
for this test.