[00/67] fscache: Rewrite index API and management system

Message ID	163456861570.2614702.14754548462706508617.stgit@warthog.procyon.org.uk (mailing list archive)
Headers	show Return-Path: <linux-cifs-owner@kernel.org> Subject: [PATCH 00/67] fscache: Rewrite index API and management system From: David Howells <dhowells@redhat.com> To: linux-cachefs@redhat.com Cc: ceph-devel@vger.kernel.org, linux-afs@lists.infradead.org, Anna Schumaker <anna.schumaker@netapp.com>, linux-nfs@vger.kernel.org, Kent Overstreet <kent.overstreet@gmail.com>, linux-mm@kvack.org, Matthew Wilcox <willy@infradead.org>, linux-fsdevel@vger.kernel.org, Dave Wysochanski <dwysocha@redhat.com>, Marc Dionne <marc.dionne@auristor.com>, Trond Myklebust <trond.myklebust@hammerspace.com>, Shyam Prasad N <nspmangalore@gmail.com>, Eric Van Hensbergen <ericvh@gmail.com>, v9fs-developer@lists.sourceforge.net, linux-cifs@vger.kernel.org, Latchesar Ionkov <lucho@ionkov.net>, Jeff Layton <jlayton@kernel.org>, Steve French <sfrench@samba.org>, Al Viro <viro@zeniv.linux.org.uk>, Dominique Martinet <asmadeus@codewreck.org>, "Matthew Wilcox (Oracle)" <willy@infradead.org>, Ilya Dryomov <idryomov@gmail.com>, dhowells@redhat.com, Trond Myklebust <trondmy@hammerspace.com>, Anna Schumaker <anna.schumaker@netapp.com>, Steve French <sfrench@samba.org>, Dominique Martinet <asmadeus@codewreck.org>, Jeff Layton <jlayton@redhat.com>, Matthew Wilcox <willy@infradead.org>, Alexander Viro <viro@zeniv.linux.org.uk>, Omar Sandoval <osandov@osandov.com>, Linus Torvalds <torvalds@linux-foundation.org>, linux-afs@lists.infradead.org, linux-nfs@vger.kernel.org, linux-cifs@vger.kernel.org, ceph-devel@vger.kernel.org, v9fs-developer@lists.sourceforge.net, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Date: Mon, 18 Oct 2021 15:50:15 +0100 Message-ID: <163456861570.2614702.14754548462706508617.stgit@warthog.procyon.org.uk> User-Agent: StGit/0.23 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Precedence: bulk
Series	fscache: Rewrite index API and management system \| expand [00/67] fscache: Rewrite index API and management system [01/67] mm: Stop filemap_read() from grabbing a superfluous page [02/67] vfs: Provide S_KERNEL_FILE inode flag [03/67] vfs, fscache: Force ->write_inode() to occur if cookie pinned for writeback [04/67] afs: Handle len being extending over page end in write_begin/write_end [05/67] afs: Fix afs_write_end() to handle len > page size [06/67] nfs, cifs, ceph, 9p: Disable use of fscache prior to its rewrite [07/67] fscache: Remove the netfs data from the cookie [08/67] fscache: Remove struct fscache_cookie_def [09/67] fscache: Remove store_limit* from struct fscache_object [10/67] fscache: Remove fscache_check_consistency() [11/67] fscache: Remove fscache_attr_changed() [12/67] fscache: Remove obsolete stats [13/67] fscache: Remove old I/O tracepoints [14/67] fscache: Temporarily disable fscache_invalidate() [15/67] fscache: Disable fscache_begin_operation() [16/67] fscache: Remove the I/O operation manager [17/67] fscache: Rename fscache_cookie_{get,put,see}() [18/67] cachefiles: Remove tree of active files and use S_CACHE_FILE inode flag [19/67] cachefiles: Don't set an xattr on the root of the cache [20/67] cachefiles: Remove some redundant checks on unsigned values [21/67] cachefiles: Prevent inode from going away when burying a dentry [22/67] cachefiles: Simplify the pathwalk and save the filename for an object [23/67] cachefiles: trace: Improve the lookup tracepoint [24/67] cachefiles: Remove separate backer dentry from cachefiles_object [25/67] cachefiles: Fold fscache_object into cachefiles_object [26/67] cachefiles: Change to storing file* rather than dentry* [27/67] cachefiles: trace: Log coherency checks [28/67] cachefiles: Trace truncations [29/67] cachefiles: Trace read and write operations [30/67] cachefiles: Round the cachefile size up to DIO block size [31/67] cachefiles: Don't use XATTR_ flags with vfs_setxattr() [33/67] cachefiles: Trace decisions in cachefiles_prepare_read() [34/67] cachefiles: Make cachefiles_write_prepare() check for space [35/67] fscache: Automatically close a file that's been unused for a while [36/67] fscache: Add stats for the cookie commit LRU [37/67] fscache: Move fscache_update_cookie() complete inline [38/67] fscache: Remove more obsolete stats [39/67] fscache: Note the object size during invalidation [40/67] vfs, fscache: Force ->write_inode() to occur if cookie pinned for writeback [41/67] afs: Render cache cookie key as big endian [42/67] cachefiles: Use tmpfile/link [43/67] fscache: Rewrite invalidation [44/67] fscache: disable cookie when doing an invalidation for DIO write [45/67] cachefiles: Simplify the file lookup/creation/check code [46/67] fscache: Provide resize operation [47/67] cachefiles: Put more information in the xattr attached to the cache file [48/67] fscache: Implement "will_modify" parameter on fscache_use_cookie() [49/67] fscache: Add support for writing to the cache [50/67] fscache: Make fscache_clear_page_bits() conditional on cookie [51/67] fscache: Make fscache_write_to_cache() conditional on cookie [52/67] afs: Copy local writes to the cache when writing to the server [53/67] afs: Invoke fscache_resize_cookie() when handling ATTR_SIZE for setattr [54/67] afs: Add O_DIRECT read support [55/67] afs: Skip truncation on the server of data we haven't written yet [56/67] afs: Make afs_write_begin() return the THP subpage [57/67] cachefiles, afs: Drive FSCACHE_COOKIE_NO_DATA_TO_READ [58/67] NFS: Convert fscache_acquire_cookie and fscache_relinquish_cookie [59/67] NFS: Convert fscache_enable_cookie and fscache_disable_cookie [60/67] NFS: Convert fscache invalidation and update aux_data and i_size [61/67] nfs: Convert to new fscache volume/cookie API [62/67] 9p: Use fscache indexing rewrite and reenable caching [63/67] 9p: Copy local writes to the cache when writing to the server [64/67] netfs: Display the netfs inode number in the netfs_read tracepoint [65/67] cachefiles: Add tracepoints to log errors from ops on the backing fs [66/67] cachefiles: Add error injection support [67/67] cifs: Support fscache indexing rewrite (untested)

David Howells Oct. 18, 2021, 2:50 p.m. UTC

Here's a set of patches that rewrites and simplifies the fscache index API
to remove the complex operation scheduling and object state machine in
favour of something much smaller and simpler.  It is built on top of the
set of patches that removes the old API[1].

The operation scheduling API was intended to handle sequencing of cache
operations, which were all required (where possible) to run asynchronously
in parallel with the operations being done by the network filesystem, while
allowing the cache to be brought online and offline and interrupt service
with invalidation.

However, with the advent of the tmpfile capacity in the VFS, an opportunity
arises to do invalidation much more easily, without having to wait for I/O
that's actually in progress: Cachefiles can simply cut over its file
pointer for the backing object attached to a cookie and abandon the
in-progress I/O, dismissing it upon completion.

Future work there would involve using Omar Sandoval's vfs_link() with
AT_LINK_REPLACE[2] to allow an extant file to be displaced by a new hard
link from a tmpfile as currently I have to unlink the old file first.

These patches can also simplify the object state handling as I/O operations
to the cache don't all have to be brought to a stop in order to invalidate
a file.  To that end, and with an eye on to writing a new backing cache
model in the future, I've taken the opportunity to simplify the indexing
structure.

I've separated the index cookie concept from the file cookie concept by
type now.  The former is now called a "volume cookie" (struct
fscache_volume) and there is a container of file cookies.  There are then
just the two levels.  All the index cookieage is collapsed into a single
volume cookie, and this has a single printable string as a key.  For
instance, an AFS volume would have a key of something like
"afs,example.com,1000555", combining the filesystem name, cell name and
volume ID.  This is freeform, but must not have '/' chars in it.

I've also eliminated all pointers back from fscache into the network
filesystem.  This required the duplication of a little bit of data in the
cookie (cookie key, coherency data and file size), but it's not actually
that much.  This gets rid of problems with making sure we keep netfs data
structures around so that the cache can access them.

I have changed afs throughout the patch series, but I also have patches for
9p, nfs and cifs.  Jeff Layton is handling ceph support.


BITS THAT MAY BE CONTROVERSIAL
==============================

There are some bits I've added that may be controversial:

 (1) I've provided a flag, S_KERNEL_FILE, that cachefiles uses to check if
     a files is already being used by some other kernel service (e.g. a
     duplicate cachefiles cache in the same directory) and reject it if it
     is.  This isn't entirely necessary, but it helps prevent accidental
     data corruption.

     I don't want to use S_SWAPFILE as that has other effects, but quite
     possibly swapon() should set S_KERNEL_FILE too.

     Note that it doesn't prevent userspace from interfering, though
     perhaps it should.

 (2) Cachefiles wants to keep the backing file for a cookie open whilst we
     might need to write to it from network filesystem writeback.  The
     problem is that the network filesystem unuses its cookie when its file
     is closed, and so we have nothing pinning the cachefiles file open and
     it will get closed automatically after a short time to avoid
     EMFILE/ENFILE problems.

     Reopening the cache file, however, is a problem if this is being done
     due to writeback triggered by exit().  Some filesystems will oops if
     we try to open a file in that context because they want to access
     current->fs or suchlike.

     To get around this, I added the following:

     (A) An inode flag, I_PINNING_FSCACHE_WB, to be set on a network
     	 filesystem inode to indicate that we have a usage count on the
     	 cookie caching that inode.

     (B) A flag in struct writeback_control, unpinned_fscache_wb, that is
     	 set when __writeback_single_inode() clears the last dirty page
     	 from i_pages - at which point it clears I_PINNING_FSCACHE_WB and
     	 sets this flag.

	 This has to be done here so that clearing I_PINNING_FSCACHE_WB can
	 be done atomically with the check of PAGECACHE_TAG_DIRTY that
	 clears I_DIRTY_PAGES.

     (C) A function, fscache_set_page_dirty(), which if it is not set, sets
     	 I_PINNING_FSCACHE_WB and calls fscache_use_cookie() to pin the
     	 cache resources.

     (D) A function, fscache_unpin_writeback(), to be called by
     	 ->write_inode() to unuse the cookie.

     (E) A function, fscache_clear_inode_writeback(), to be called when the
     	 inode is evicted, before clear_inode() is called.  This cleans up
     	 any lingering I_PINNING_FSCACHE_WB.

     The network filesystem can then use these tools to make sure that
     fscache_write_to_cache() can write locally modified data to the cache
     as well as to the server.

     For the future, I'm working on write helpers for netfs lib that should
     allow this facility to be removed by keeping track of the dirty
     regions separately - but that's incomplete at the moment and is also
     going to be affected by folios, one way or another, since it deals
     with pages.


These patches can be found also on:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=fscache-rewrite-indexing

David

Link: https://lore.kernel.org/r/163363935000.1980952.15279841414072653108.stgit@warthog.procyon.org.uk [1]
Link: https://lore.kernel.org/r/cover.1580251857.git.osandov@fb.com/ [2]

---
Dave Wysochanski (3):
      NFS: Convert fscache_acquire_cookie and fscache_relinquish_cookie
      NFS: Convert fscache_enable_cookie and fscache_disable_cookie
      NFS: Convert fscache invalidation and update aux_data and i_size

David Howells (63):
      mm: Stop filemap_read() from grabbing a superfluous page
      vfs: Provide S_KERNEL_FILE inode flag
      vfs, fscache: Force ->write_inode() to occur if cookie pinned for writeback
      afs: Handle len being extending over page end in write_begin/write_end
      afs: Fix afs_write_end() to handle len > page size
      nfs, cifs, ceph, 9p: Disable use of fscache prior to its rewrite
      fscache: Remove the netfs data from the cookie
      fscache: Remove struct fscache_cookie_def
      fscache: Remove store_limit* from struct fscache_object
      fscache: Remove fscache_check_consistency()
      fscache: Remove fscache_attr_changed()
      fscache: Remove obsolete stats
      fscache: Remove old I/O tracepoints
      fscache: Temporarily disable fscache_invalidate()
      fscache: Disable fscache_begin_operation()
      fscache: Remove the I/O operation manager
      fscache: Rename fscache_cookie_{get,put,see}()
      cachefiles: Remove tree of active files and use S_CACHE_FILE inode flag
      cachefiles: Don't set an xattr on the root of the cache
      cachefiles: Remove some redundant checks on unsigned values
      cachefiles: Prevent inode from going away when burying a dentry
      cachefiles: Simplify the pathwalk and save the filename for an object
      cachefiles: trace: Improve the lookup tracepoint
      cachefiles: Remove separate backer dentry from cachefiles_object
      cachefiles: Fold fscache_object into cachefiles_object
      cachefiles: Change to storing file* rather than dentry*
      cachefiles: trace: Log coherency checks
      cachefiles: Trace truncations
      cachefiles: Trace read and write operations
      cachefiles: Round the cachefile size up to DIO block size
      cachefiles: Don't use XATTR_ flags with vfs_setxattr()
      fscache: Replace the object management state machine
      cachefiles: Trace decisions in cachefiles_prepare_read()
      cachefiles: Make cachefiles_write_prepare() check for space
      fscache: Automatically close a file that's been unused for a while
      fscache: Add stats for the cookie commit LRU
      fscache: Move fscache_update_cookie() complete inline
      fscache: Remove more obsolete stats
      fscache: Note the object size during invalidation
      vfs, fscache: Force ->write_inode() to occur if cookie pinned for writeback
      afs: Render cache cookie key as big endian
      cachefiles: Use tmpfile/link
      fscache: Rewrite invalidation
      cachefiles: Simplify the file lookup/creation/check code
      fscache: Provide resize operation
      cachefiles: Put more information in the xattr attached to the cache file
      fscache: Implement "will_modify" parameter on fscache_use_cookie()
      fscache: Add support for writing to the cache
      fscache: Make fscache_clear_page_bits() conditional on cookie
      fscache: Make fscache_write_to_cache() conditional on cookie
      afs: Copy local writes to the cache when writing to the server
      afs: Invoke fscache_resize_cookie() when handling ATTR_SIZE for setattr
      afs: Add O_DIRECT read support
      afs: Skip truncation on the server of data we haven't written yet
      afs: Make afs_write_begin() return the THP subpage
      cachefiles, afs: Drive FSCACHE_COOKIE_NO_DATA_TO_READ
      nfs: Convert to new fscache volume/cookie API
      9p: Use fscache indexing rewrite and reenable caching
      9p: Copy local writes to the cache when writing to the server
      netfs: Display the netfs inode number in the netfs_read tracepoint
      cachefiles: Add tracepoints to log errors from ops on the backing fs
      cachefiles: Add error injection support
      cifs: Support fscache indexing rewrite (untested)

Jeff Layton (1):
      fscache: disable cookie when doing an invalidation for DIO write


 fs/9p/cache.c                     |  184 +----
 fs/9p/cache.h                     |   23 +-
 fs/9p/v9fs.c                      |   14 +-
 fs/9p/v9fs.h                      |   13 +-
 fs/9p/vfs_addr.c                  |   55 +-
 fs/9p/vfs_dir.c                   |   13 +-
 fs/9p/vfs_file.c                  |    7 +-
 fs/9p/vfs_inode.c                 |   24 +-
 fs/9p/vfs_inode_dotl.c            |    3 +-
 fs/9p/vfs_super.c                 |    3 +
 fs/afs/Makefile                   |    3 -
 fs/afs/cache.c                    |   68 --
 fs/afs/cell.c                     |   12 -
 fs/afs/file.c                     |   83 +-
 fs/afs/fsclient.c                 |   18 +-
 fs/afs/inode.c                    |  101 ++-
 fs/afs/internal.h                 |   36 +-
 fs/afs/main.c                     |   14 -
 fs/afs/super.c                    |    1 +
 fs/afs/volume.c                   |   15 +-
 fs/afs/write.c                    |  170 +++-
 fs/afs/yfsclient.c                |   12 +-
 fs/cachefiles/Kconfig             |    8 +
 fs/cachefiles/Makefile            |    3 +
 fs/cachefiles/bind.c              |  186 +++--
 fs/cachefiles/daemon.c            |   20 +-
 fs/cachefiles/error_inject.c      |   46 ++
 fs/cachefiles/interface.c         |  660 +++++++--------
 fs/cachefiles/internal.h          |  191 +++--
 fs/cachefiles/io.c                |  310 +++++--
 fs/cachefiles/key.c               |  203 +++--
 fs/cachefiles/main.c              |   20 +-
 fs/cachefiles/namei.c             |  978 ++++++++++------------
 fs/cachefiles/volume.c            |  128 +++
 fs/cachefiles/xattr.c             |  367 +++------
 fs/ceph/Kconfig                   |    2 +-
 fs/cifs/Makefile                  |    2 +-
 fs/cifs/cache.c                   |  105 ---
 fs/cifs/cifsfs.c                  |   11 +-
 fs/cifs/cifsglob.h                |    5 +-
 fs/cifs/connect.c                 |    3 -
 fs/cifs/file.c                    |   37 +-
 fs/cifs/fscache.c                 |  201 ++---
 fs/cifs/fscache.h                 |   51 +-
 fs/cifs/inode.c                   |   18 +-
 fs/fs-writeback.c                 |    8 +
 fs/fscache/Kconfig                |    4 +
 fs/fscache/Makefile               |    6 +-
 fs/fscache/cache.c                |  541 ++++++-------
 fs/fscache/cookie.c               | 1262 ++++++++++++++---------------
 fs/fscache/fsdef.c                |   98 ---
 fs/fscache/internal.h             |  213 +----
 fs/fscache/io.c                   |  405 ++++++---
 fs/fscache/main.c                 |  134 +--
 fs/fscache/netfs.c                |   74 --
 fs/fscache/object.c               | 1123 -------------------------
 fs/fscache/operation.c            |  633 ---------------
 fs/fscache/page.c                 |   84 --
 fs/fscache/proc.c                 |   43 +-
 fs/fscache/stats.c                |  202 ++---
 fs/fscache/volume.c               |  449 ++++++++++
 fs/netfs/read_helper.c            |    2 +-
 fs/nfs/Makefile                   |    2 +-
 fs/nfs/client.c                   |    4 -
 fs/nfs/direct.c                   |    2 +
 fs/nfs/file.c                     |    7 +-
 fs/nfs/fscache-index.c            |  114 ---
 fs/nfs/fscache.c                  |  264 ++----
 fs/nfs/fscache.h                  |   89 +-
 fs/nfs/inode.c                    |   11 +-
 fs/nfs/super.c                    |    7 +-
 fs/nfs/write.c                    |    1 +
 include/linux/fs.h                |    4 +
 include/linux/fscache-cache.h     |  463 +++--------
 include/linux/fscache.h           |  626 +++++++-------
 include/linux/netfs.h             |    4 +-
 include/linux/nfs_fs_sb.h         |    9 +-
 include/linux/writeback.h         |    1 +
 include/trace/events/cachefiles.h |  483 ++++++++---
 include/trace/events/fscache.h    |  631 +++++++--------
 include/trace/events/netfs.h      |    5 +-
 81 files changed, 5140 insertions(+), 7295 deletions(-)
 delete mode 100644 fs/afs/cache.c
 create mode 100644 fs/cachefiles/error_inject.c
 create mode 100644 fs/cachefiles/volume.c
 delete mode 100644 fs/cifs/cache.c
 delete mode 100644 fs/fscache/fsdef.c
 delete mode 100644 fs/fscache/netfs.c
 delete mode 100644 fs/fscache/object.c
 delete mode 100644 fs/fscache/operation.c
 create mode 100644 fs/fscache/volume.c
 delete mode 100644 fs/nfs/fscache-index.c

Marc Dionne Oct. 19, 2021, 1:29 p.m. UTC | #1

On Mon, Oct 18, 2021 at 11:50 AM David Howells <dhowells@redhat.com> wrote:
>
>
> Here's a set of patches that rewrites and simplifies the fscache index API
> to remove the complex operation scheduling and object state machine in
> favour of something much smaller and simpler.  It is built on top of the
> set of patches that removes the old API[1].

Testing this series in our afs test framework, saw the oops pasted below.

cachefiles_begin_operation+0x2d maps to cachefiles/io.c:565, where
object is probably NULL (object->file is at offset 0x28).

Marc
===
BUG: kernel NULL pointer dereference, address: 0000000000000028
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] SMP NOPTI
CPU: 5 PID: 16607 Comm: ar Tainted: G            E
5.15.0-rc5.kafs_testing+ #37
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.14.0-2.fc34 04/01/2014
RIP: 0010:cachefiles_begin_operation+0x2d/0x80 [cachefiles]
Code: 00 00 55 53 48 83 ec 08 48 8b 47 08 48 83 7f 10 00 48 8b 68 20
74 0c b8 01 00 00 00 48 83 c4 08 5b 5d c3 48 c7 07 a0 12 1b a0 <48> 8b
45 28 48 89 fb 48 85 c0 74 20 48 8d 7d 04 89 74 24 04 e8 3a
RSP: 0018:ffffc90000d33b48 EFLAGS: 00010246
RAX: ffff888014991420 RBX: ffff888100ae9cf0 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff888100ae9cf0
RBP: 0000000000000000 R08: 00000000000006b8 R09: ffff88810e98e000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff888014991434
R13: 0000000000000002 R14: ffff888014991420 R15: 0000000000000002
FS:  00007f72d0486b80(0000) GS:ffff888139940000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000028 CR3: 000000007bac8004 CR4: 0000000000770ee0
PKRU: 55555554
Call Trace:
 fscache_begin_operation.part.0+0x1e3/0x210 [fscache]
 netfs_write_begin+0x3fb/0x800 [netfs]
 ? __fscache_use_cookie+0x120/0x200 [fscache]
 afs_write_begin+0x58/0x2c0 [kafs]
 ? __vfs_getxattr+0x2a/0x70
 generic_perform_write+0xb1/0x1b0
 ? file_update_time+0xcf/0x120
 __generic_file_write_iter+0x14c/0x1d0
 generic_file_write_iter+0x5d/0xb0
 afs_file_write+0x73/0xa0 [kafs]
 new_sync_write+0x105/0x180
 vfs_write+0x1cb/0x260
 ksys_write+0x4f/0xc0
 do_syscall_64+0x35/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f72d059a7a7
Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f
1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d
00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
RSP: 002b:00007fffc31942b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007f72d059a7a7
RDX: 0000000000000008 RSI: 000055fe42367730 RDI: 0000000000000003
RBP: 000055fe42367730 R08: 0000000000000000 R09: 00007f72d066ca00
R10: 000000000000007c R11: 0000000000000246 R12: 0000000000000008

Jeff Layton Oct. 19, 2021, 6:08 p.m. UTC | #2

On Mon, 2021-10-18 at 15:50 +0100, David Howells wrote:
> Here's a set of patches that rewrites and simplifies the fscache index API
> to remove the complex operation scheduling and object state machine in
> favour of something much smaller and simpler.  It is built on top of the
> set of patches that removes the old API[1].
> 
> The operation scheduling API was intended to handle sequencing of cache
> operations, which were all required (where possible) to run asynchronously
> in parallel with the operations being done by the network filesystem, while
> allowing the cache to be brought online and offline and interrupt service
> with invalidation.
> 
> However, with the advent of the tmpfile capacity in the VFS, an opportunity
> arises to do invalidation much more easily, without having to wait for I/O
> that's actually in progress: Cachefiles can simply cut over its file
> pointer for the backing object attached to a cookie and abandon the
> in-progress I/O, dismissing it upon completion.
> 
> Future work there would involve using Omar Sandoval's vfs_link() with
> AT_LINK_REPLACE[2] to allow an extant file to be displaced by a new hard
> link from a tmpfile as currently I have to unlink the old file first.
> 
> These patches can also simplify the object state handling as I/O operations
> to the cache don't all have to be brought to a stop in order to invalidate
> a file.  To that end, and with an eye on to writing a new backing cache
> model in the future, I've taken the opportunity to simplify the indexing
> structure.
> 
> I've separated the index cookie concept from the file cookie concept by
> type now.  The former is now called a "volume cookie" (struct
> fscache_volume) and there is a container of file cookies.  There are then
> just the two levels.  All the index cookieage is collapsed into a single
> volume cookie, and this has a single printable string as a key.  For
> instance, an AFS volume would have a key of something like
> "afs,example.com,1000555", combining the filesystem name, cell name and
> volume ID.  This is freeform, but must not have '/' chars in it.
> 

Given the indexing changes, what sort of behavior should we expect when
upgrading from old-style to new-style indexes? Do they just not match,
and we end up downloading new copies of all the data and the old stale
stuff eventually gets culled?

Ditto for downgrades -- can we expect sane behavior if someone runs an
old kernel on top of an existing fscache that was populated by a new
kernel?

> I've also eliminated all pointers back from fscache into the network
> filesystem.  This required the duplication of a little bit of data in the
> cookie (cookie key, coherency data and file size), but it's not actually
> that much.  This gets rid of problems with making sure we keep netfs data
> structures around so that the cache can access them.
> 
> I have changed afs throughout the patch series, but I also have patches for
> 9p, nfs and cifs.  Jeff Layton is handling ceph support.
> 
> 
> BITS THAT MAY BE CONTROVERSIAL
> ==============================
> 
> There are some bits I've added that may be controversial:
> 
>  (1) I've provided a flag, S_KERNEL_FILE, that cachefiles uses to check if
>      a files is already being used by some other kernel service (e.g. a
>      duplicate cachefiles cache in the same directory) and reject it if it
>      is.  This isn't entirely necessary, but it helps prevent accidental
>      data corruption.
> 
>      I don't want to use S_SWAPFILE as that has other effects, but quite
>      possibly swapon() should set S_KERNEL_FILE too.
> 
>      Note that it doesn't prevent userspace from interfering, though
>      perhaps it should.
> 
>  (2) Cachefiles wants to keep the backing file for a cookie open whilst we
>      might need to write to it from network filesystem writeback.  The
>      problem is that the network filesystem unuses its cookie when its file
>      is closed, and so we have nothing pinning the cachefiles file open and
>      it will get closed automatically after a short time to avoid
>      EMFILE/ENFILE problems.
> 
>      Reopening the cache file, however, is a problem if this is being done
>      due to writeback triggered by exit().  Some filesystems will oops if
>      we try to open a file in that context because they want to access
>      current->fs or suchlike.
> 
>      To get around this, I added the following:
> 
>      (A) An inode flag, I_PINNING_FSCACHE_WB, to be set on a network
>      	 filesystem inode to indicate that we have a usage count on the
>      	 cookie caching that inode.
> 
>      (B) A flag in struct writeback_control, unpinned_fscache_wb, that is
>      	 set when __writeback_single_inode() clears the last dirty page
>      	 from i_pages - at which point it clears I_PINNING_FSCACHE_WB and
>      	 sets this flag.
> 
> 	 This has to be done here so that clearing I_PINNING_FSCACHE_WB can
> 	 be done atomically with the check of PAGECACHE_TAG_DIRTY that
> 	 clears I_DIRTY_PAGES.
> 
>      (C) A function, fscache_set_page_dirty(), which if it is not set, sets
>      	 I_PINNING_FSCACHE_WB and calls fscache_use_cookie() to pin the
>      	 cache resources.
> 
>      (D) A function, fscache_unpin_writeback(), to be called by
>      	 ->write_inode() to unuse the cookie.
> 
>      (E) A function, fscache_clear_inode_writeback(), to be called when the
>      	 inode is evicted, before clear_inode() is called.  This cleans up
>      	 any lingering I_PINNING_FSCACHE_WB.
> 
>      The network filesystem can then use these tools to make sure that
>      fscache_write_to_cache() can write locally modified data to the cache
>      as well as to the server.
> 
>      For the future, I'm working on write helpers for netfs lib that should
>      allow this facility to be removed by keeping track of the dirty
>      regions separately - but that's incomplete at the moment and is also
>      going to be affected by folios, one way or another, since it deals
>      with pages.
> 
> 
> These patches can be found also on:
> 
> 	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=fscache-rewrite-indexing
> 
> David
> 
> Link: https://lore.kernel.org/r/163363935000.1980952.15279841414072653108.stgit@warthog.procyon.org.uk [1]
> Link: https://lore.kernel.org/r/cover.1580251857.git.osandov@fb.com/ [2]
> 
> ---
> Dave Wysochanski (3):
>       NFS: Convert fscache_acquire_cookie and fscache_relinquish_cookie
>       NFS: Convert fscache_enable_cookie and fscache_disable_cookie
>       NFS: Convert fscache invalidation and update aux_data and i_size
> 
> David Howells (63):
>       mm: Stop filemap_read() from grabbing a superfluous page
>       vfs: Provide S_KERNEL_FILE inode flag
>       vfs, fscache: Force ->write_inode() to occur if cookie pinned for writeback
>       afs: Handle len being extending over page end in write_begin/write_end
>       afs: Fix afs_write_end() to handle len > page size
>       nfs, cifs, ceph, 9p: Disable use of fscache prior to its rewrite
>       fscache: Remove the netfs data from the cookie
>       fscache: Remove struct fscache_cookie_def
>       fscache: Remove store_limit* from struct fscache_object
>       fscache: Remove fscache_check_consistency()
>       fscache: Remove fscache_attr_changed()
>       fscache: Remove obsolete stats
>       fscache: Remove old I/O tracepoints
>       fscache: Temporarily disable fscache_invalidate()
>       fscache: Disable fscache_begin_operation()
>       fscache: Remove the I/O operation manager
>       fscache: Rename fscache_cookie_{get,put,see}()
>       cachefiles: Remove tree of active files and use S_CACHE_FILE inode flag
>       cachefiles: Don't set an xattr on the root of the cache
>       cachefiles: Remove some redundant checks on unsigned values
>       cachefiles: Prevent inode from going away when burying a dentry
>       cachefiles: Simplify the pathwalk and save the filename for an object
>       cachefiles: trace: Improve the lookup tracepoint
>       cachefiles: Remove separate backer dentry from cachefiles_object
>       cachefiles: Fold fscache_object into cachefiles_object
>       cachefiles: Change to storing file* rather than dentry*
>       cachefiles: trace: Log coherency checks
>       cachefiles: Trace truncations
>       cachefiles: Trace read and write operations
>       cachefiles: Round the cachefile size up to DIO block size
>       cachefiles: Don't use XATTR_ flags with vfs_setxattr()
>       fscache: Replace the object management state machine
>       cachefiles: Trace decisions in cachefiles_prepare_read()
>       cachefiles: Make cachefiles_write_prepare() check for space
>       fscache: Automatically close a file that's been unused for a while
>       fscache: Add stats for the cookie commit LRU
>       fscache: Move fscache_update_cookie() complete inline
>       fscache: Remove more obsolete stats
>       fscache: Note the object size during invalidation
>       vfs, fscache: Force ->write_inode() to occur if cookie pinned for writeback
>       afs: Render cache cookie key as big endian
>       cachefiles: Use tmpfile/link
>       fscache: Rewrite invalidation
>       cachefiles: Simplify the file lookup/creation/check code
>       fscache: Provide resize operation
>       cachefiles: Put more information in the xattr attached to the cache file
>       fscache: Implement "will_modify" parameter on fscache_use_cookie()
>       fscache: Add support for writing to the cache
>       fscache: Make fscache_clear_page_bits() conditional on cookie
>       fscache: Make fscache_write_to_cache() conditional on cookie
>       afs: Copy local writes to the cache when writing to the server
>       afs: Invoke fscache_resize_cookie() when handling ATTR_SIZE for setattr
>       afs: Add O_DIRECT read support
>       afs: Skip truncation on the server of data we haven't written yet
>       afs: Make afs_write_begin() return the THP subpage
>       cachefiles, afs: Drive FSCACHE_COOKIE_NO_DATA_TO_READ
>       nfs: Convert to new fscache volume/cookie API
>       9p: Use fscache indexing rewrite and reenable caching
>       9p: Copy local writes to the cache when writing to the server
>       netfs: Display the netfs inode number in the netfs_read tracepoint
>       cachefiles: Add tracepoints to log errors from ops on the backing fs
>       cachefiles: Add error injection support
>       cifs: Support fscache indexing rewrite (untested)
> 
> Jeff Layton (1):
>       fscache: disable cookie when doing an invalidation for DIO write
> 
> 
>  fs/9p/cache.c                     |  184 +----
>  fs/9p/cache.h                     |   23 +-
>  fs/9p/v9fs.c                      |   14 +-
>  fs/9p/v9fs.h                      |   13 +-
>  fs/9p/vfs_addr.c                  |   55 +-
>  fs/9p/vfs_dir.c                   |   13 +-
>  fs/9p/vfs_file.c                  |    7 +-
>  fs/9p/vfs_inode.c                 |   24 +-
>  fs/9p/vfs_inode_dotl.c            |    3 +-
>  fs/9p/vfs_super.c                 |    3 +
>  fs/afs/Makefile                   |    3 -
>  fs/afs/cache.c                    |   68 --
>  fs/afs/cell.c                     |   12 -
>  fs/afs/file.c                     |   83 +-
>  fs/afs/fsclient.c                 |   18 +-
>  fs/afs/inode.c                    |  101 ++-
>  fs/afs/internal.h                 |   36 +-
>  fs/afs/main.c                     |   14 -
>  fs/afs/super.c                    |    1 +
>  fs/afs/volume.c                   |   15 +-
>  fs/afs/write.c                    |  170 +++-
>  fs/afs/yfsclient.c                |   12 +-
>  fs/cachefiles/Kconfig             |    8 +
>  fs/cachefiles/Makefile            |    3 +
>  fs/cachefiles/bind.c              |  186 +++--
>  fs/cachefiles/daemon.c            |   20 +-
>  fs/cachefiles/error_inject.c      |   46 ++
>  fs/cachefiles/interface.c         |  660 +++++++--------
>  fs/cachefiles/internal.h          |  191 +++--
>  fs/cachefiles/io.c                |  310 +++++--
>  fs/cachefiles/key.c               |  203 +++--
>  fs/cachefiles/main.c              |   20 +-
>  fs/cachefiles/namei.c             |  978 ++++++++++------------
>  fs/cachefiles/volume.c            |  128 +++
>  fs/cachefiles/xattr.c             |  367 +++------
>  fs/ceph/Kconfig                   |    2 +-
>  fs/cifs/Makefile                  |    2 +-
>  fs/cifs/cache.c                   |  105 ---
>  fs/cifs/cifsfs.c                  |   11 +-
>  fs/cifs/cifsglob.h                |    5 +-
>  fs/cifs/connect.c                 |    3 -
>  fs/cifs/file.c                    |   37 +-
>  fs/cifs/fscache.c                 |  201 ++---
>  fs/cifs/fscache.h                 |   51 +-
>  fs/cifs/inode.c                   |   18 +-
>  fs/fs-writeback.c                 |    8 +
>  fs/fscache/Kconfig                |    4 +
>  fs/fscache/Makefile               |    6 +-
>  fs/fscache/cache.c                |  541 ++++++-------
>  fs/fscache/cookie.c               | 1262 ++++++++++++++---------------
>  fs/fscache/fsdef.c                |   98 ---
>  fs/fscache/internal.h             |  213 +----
>  fs/fscache/io.c                   |  405 ++++++---
>  fs/fscache/main.c                 |  134 +--
>  fs/fscache/netfs.c                |   74 --
>  fs/fscache/object.c               | 1123 -------------------------
>  fs/fscache/operation.c            |  633 ---------------
>  fs/fscache/page.c                 |   84 --
>  fs/fscache/proc.c                 |   43 +-
>  fs/fscache/stats.c                |  202 ++---
>  fs/fscache/volume.c               |  449 ++++++++++
>  fs/netfs/read_helper.c            |    2 +-
>  fs/nfs/Makefile                   |    2 +-
>  fs/nfs/client.c                   |    4 -
>  fs/nfs/direct.c                   |    2 +
>  fs/nfs/file.c                     |    7 +-
>  fs/nfs/fscache-index.c            |  114 ---
>  fs/nfs/fscache.c                  |  264 ++----
>  fs/nfs/fscache.h                  |   89 +-
>  fs/nfs/inode.c                    |   11 +-
>  fs/nfs/super.c                    |    7 +-
>  fs/nfs/write.c                    |    1 +
>  include/linux/fs.h                |    4 +
>  include/linux/fscache-cache.h     |  463 +++--------
>  include/linux/fscache.h           |  626 +++++++-------
>  include/linux/netfs.h             |    4 +-
>  include/linux/nfs_fs_sb.h         |    9 +-
>  include/linux/writeback.h         |    1 +
>  include/trace/events/cachefiles.h |  483 ++++++++---
>  include/trace/events/fscache.h    |  631 +++++++--------
>  include/trace/events/netfs.h      |    5 +-
>  81 files changed, 5140 insertions(+), 7295 deletions(-)
>  delete mode 100644 fs/afs/cache.c
>  create mode 100644 fs/cachefiles/error_inject.c
>  create mode 100644 fs/cachefiles/volume.c
>  delete mode 100644 fs/cifs/cache.c
>  delete mode 100644 fs/fscache/fsdef.c
>  delete mode 100644 fs/fscache/netfs.c
>  delete mode 100644 fs/fscache/object.c
>  delete mode 100644 fs/fscache/operation.c
>  create mode 100644 fs/fscache/volume.c
>  delete mode 100644 fs/nfs/fscache-index.c
> 
>

David Howells Oct. 19, 2021, 7 p.m. UTC | #3

Jeff Layton <jlayton@kernel.org> wrote:

> Given the indexing changes, what sort of behavior should we expect when
> upgrading from old-style to new-style indexes? Do they just not match,
> and we end up downloading new copies of all the data and the old stale
> stuff eventually gets culled?

Correct: they don't match.  The names of the directories and files will be
quite different - and so will the attached xattrs.  However, no filesystems
currently store locally-modified data in the cache, so you shouldn't lose any
data after upgrading.

> Ditto for downgrades -- can we expect sane behavior if someone runs an
> old kernel on top of an existing fscache that was populated by a new
> kernel?

Correct.  With this branch, filesystems now store locally-modified data into
the cache - but they also upload it to the server at the same time.  If
there's a disagreement between what's in the cache and what's on the server
with this branch, the cache is discarded, so simply discarding the cache on a
download shouldn't be a problem.

It's currently operating as a write-through cache, not a write-back cache.
That will change if I get round to implementing disconnected operation, but
it's not there yet.

David

Omar Sandoval Oct. 21, 2021, 10:20 p.m. UTC | #4

On Mon, Oct 18, 2021 at 03:50:15PM +0100, David Howells wrote:
> 
> Here's a set of patches that rewrites and simplifies the fscache index API
> to remove the complex operation scheduling and object state machine in
> favour of something much smaller and simpler.  It is built on top of the
> set of patches that removes the old API[1].
> 
> The operation scheduling API was intended to handle sequencing of cache
> operations, which were all required (where possible) to run asynchronously
> in parallel with the operations being done by the network filesystem, while
> allowing the cache to be brought online and offline and interrupt service
> with invalidation.
> 
> However, with the advent of the tmpfile capacity in the VFS, an opportunity
> arises to do invalidation much more easily, without having to wait for I/O
> that's actually in progress: Cachefiles can simply cut over its file
> pointer for the backing object attached to a cookie and abandon the
> in-progress I/O, dismissing it upon completion.
> 
> Future work there would involve using Omar Sandoval's vfs_link() with
> AT_LINK_REPLACE[2] to allow an extant file to be displaced by a new hard
> link from a tmpfile as currently I have to unlink the old file first.

I had forgotten about that. It'd be great to finish that someday, but
given the dead-end of the last discussion [1], we might need to hash it
out the next time we can convene in person.

1:https://lore.kernel.org/linux-fsdevel/364531.1579265357@warthog.procyon.org.uk/

Steve French Oct. 21, 2021, 11:15 p.m. UTC | #5

On Thu, Oct 21, 2021 at 5:21 PM Omar Sandoval <osandov@osandov.com> wrote:
>
> On Mon, Oct 18, 2021 at 03:50:15PM +0100, David Howells wrote:
> However, with the advent of the tmpfile capacity in the VFS, an opportunity
> arises to do invalidation much more easily, without having to wait for I/O
> that's actually in progress: Cachefiles can simply cut over its file
> pointer for the backing object attached to a cookie and abandon the
> in-progress I/O, dismissing it upon completion.

Have changes been made to O_TMPFILE?  It is problematic for network filesystems
because it is not an atomic operation, and would be great if it were possible
to create a tmpfile and open it atomically (at the file system level).

Currently it results in creating a tmpfile (which results in
opencreate then close)
immediately followed by reopening the tmpfile which is somewhat counter to
the whole idea of a tmpfile (ie that it is deleted when closed) since
the syscall results
in two opens ie open(create)/close/open/close

Dominique Martinet Oct. 21, 2021, 11:23 p.m. UTC | #6

Steve French wrote on Thu, Oct 21, 2021 at 06:15:49PM -0500:
> Have changes been made to O_TMPFILE?  It is problematic for network filesystems
> because it is not an atomic operation, and would be great if it were possible
> to create a tmpfile and open it atomically (at the file system level).
> 
> Currently it results in creating a tmpfile (which results in
> opencreate then close)
> immediately followed by reopening the tmpfile which is somewhat counter to
> the whole idea of a tmpfile (ie that it is deleted when closed) since
> the syscall results
> in two opens ie open(create)/close/open/close

That depends on the filesystem, e.g. 9p open returns the opened fid so
our semantic could be closer to that of a local filesystem (could
because I didn't test recently and don't recall how it was handled, I
think it was fixed as I remember it being a problem at some point...)

The main problem with network filesystem and "open closed files" is:
what should the server do if the client disconnects? Will the client
come back and try to access that file again? Did the client crash
completely and should the file be deleted? The server has no way of
knowing.
It's the same logic as unlinking an open file, leading to all sort of
"silly renames" that are most famous for nfs (.nfsxxxx files that the
servers have been taught to just delete if they haven't been accessed
for long enough...)

I'm not sure we can have a generic solution for that unfortunately...

(9P is "easy" enough in that the linux client does not attempt to
reconnect ever if the connection has been lost, so we can just really
unlink the file and the server will delete it when the client
disconnects... But if we were to implement a reconnect eventually (I
know of an existing port that does) then this solution is no longer
acceptable)

Jeff Layton Oct. 21, 2021, 11:43 p.m. UTC | #7

On Thu, 2021-10-21 at 18:15 -0500, Steve French wrote:
> On Thu, Oct 21, 2021 at 5:21 PM Omar Sandoval <osandov@osandov.com> wrote:
> > 
> > On Mon, Oct 18, 2021 at 03:50:15PM +0100, David Howells wrote:
> > However, with the advent of the tmpfile capacity in the VFS, an opportunity
> > arises to do invalidation much more easily, without having to wait for I/O
> > that's actually in progress: Cachefiles can simply cut over its file
> > pointer for the backing object attached to a cookie and abandon the
> > in-progress I/O, dismissing it upon completion.
> 
> Have changes been made to O_TMPFILE?  It is problematic for network filesystems
> because it is not an atomic operation, and would be great if it were possible
> to create a tmpfile and open it atomically (at the file system level).
> 
> Currently it results in creating a tmpfile (which results in
> opencreate then close)
> immediately followed by reopening the tmpfile which is somewhat counter to
> the whole idea of a tmpfile (ie that it is deleted when closed) since
> the syscall results
> in two opens ie open(create)/close/open/close
> 
> 

In this case, O_TMPFILE is being used on the cachefiles backing store,
and that usually isn't deployed on a netfs. That said, Steve does have a
good point...

What happens if you do end up without O_TMPFILE support on the backing
store? Probably just opting to not cache in that case is fine. Does
cachefiles just shut down in that situation?

David Howells Oct. 22, 2021, 6:52 p.m. UTC | #8

Steve French <smfrench@gmail.com> wrote:

> On Thu, Oct 21, 2021 at 5:21 PM Omar Sandoval <osandov@osandov.com> wrote:
> >
> > On Mon, Oct 18, 2021 at 03:50:15PM +0100, David Howells wrote:
> > However, with the advent of the tmpfile capacity in the VFS, an opportunity
> > arises to do invalidation much more easily, without having to wait for I/O
> > that's actually in progress: Cachefiles can simply cut over its file
> > pointer for the backing object attached to a cookie and abandon the
> > in-progress I/O, dismissing it upon completion.
> 
> Have changes been made to O_TMPFILE?  It is problematic for network
> filesystems because it is not an atomic operation, and would be great if it
> were possible to create a tmpfile and open it atomically (at the file system
> level).

In this case, it's nothing to do with the network filesystem that's using the
cache per se.  Cachefiles is using tmpfiles on the backing filesystem, so as
long as that's, say, ext4, xfs or btrfs, it should work fine.  The backing
filesystem also needs to support SEEK_HOLE and SEEK_DATA.

I'm not sure I'd recommend putting your cache on a network filesystem.

David

[00/67] fscache: Rewrite index API and management system

Message

Comments