Message ID | 163456861570.2614702.14754548462706508617.stgit@warthog.procyon.org.uk (mailing list archive) |
---|---|
Headers | show |
Series | fscache: Rewrite index API and management system | expand |
On Mon, Oct 18, 2021 at 11:50 AM David Howells <dhowells@redhat.com> wrote: > > > Here's a set of patches that rewrites and simplifies the fscache index API > to remove the complex operation scheduling and object state machine in > favour of something much smaller and simpler. It is built on top of the > set of patches that removes the old API[1]. Testing this series in our afs test framework, saw the oops pasted below. cachefiles_begin_operation+0x2d maps to cachefiles/io.c:565, where object is probably NULL (object->file is at offset 0x28). Marc === BUG: kernel NULL pointer dereference, address: 0000000000000028 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP NOPTI CPU: 5 PID: 16607 Comm: ar Tainted: G E 5.15.0-rc5.kafs_testing+ #37 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2.fc34 04/01/2014 RIP: 0010:cachefiles_begin_operation+0x2d/0x80 [cachefiles] Code: 00 00 55 53 48 83 ec 08 48 8b 47 08 48 83 7f 10 00 48 8b 68 20 74 0c b8 01 00 00 00 48 83 c4 08 5b 5d c3 48 c7 07 a0 12 1b a0 <48> 8b 45 28 48 89 fb 48 85 c0 74 20 48 8d 7d 04 89 74 24 04 e8 3a RSP: 0018:ffffc90000d33b48 EFLAGS: 00010246 RAX: ffff888014991420 RBX: ffff888100ae9cf0 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff888100ae9cf0 RBP: 0000000000000000 R08: 00000000000006b8 R09: ffff88810e98e000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff888014991434 R13: 0000000000000002 R14: ffff888014991420 R15: 0000000000000002 FS: 00007f72d0486b80(0000) GS:ffff888139940000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000028 CR3: 000000007bac8004 CR4: 0000000000770ee0 PKRU: 55555554 Call Trace: fscache_begin_operation.part.0+0x1e3/0x210 [fscache] netfs_write_begin+0x3fb/0x800 [netfs] ? __fscache_use_cookie+0x120/0x200 [fscache] afs_write_begin+0x58/0x2c0 [kafs] ? __vfs_getxattr+0x2a/0x70 generic_perform_write+0xb1/0x1b0 ? file_update_time+0xcf/0x120 __generic_file_write_iter+0x14c/0x1d0 generic_file_write_iter+0x5d/0xb0 afs_file_write+0x73/0xa0 [kafs] new_sync_write+0x105/0x180 vfs_write+0x1cb/0x260 ksys_write+0x4f/0xc0 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f72d059a7a7 Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 RSP: 002b:00007fffc31942b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007f72d059a7a7 RDX: 0000000000000008 RSI: 000055fe42367730 RDI: 0000000000000003 RBP: 000055fe42367730 R08: 0000000000000000 R09: 00007f72d066ca00 R10: 000000000000007c R11: 0000000000000246 R12: 0000000000000008
On Mon, 2021-10-18 at 15:50 +0100, David Howells wrote: > Here's a set of patches that rewrites and simplifies the fscache index API > to remove the complex operation scheduling and object state machine in > favour of something much smaller and simpler. It is built on top of the > set of patches that removes the old API[1]. > > The operation scheduling API was intended to handle sequencing of cache > operations, which were all required (where possible) to run asynchronously > in parallel with the operations being done by the network filesystem, while > allowing the cache to be brought online and offline and interrupt service > with invalidation. > > However, with the advent of the tmpfile capacity in the VFS, an opportunity > arises to do invalidation much more easily, without having to wait for I/O > that's actually in progress: Cachefiles can simply cut over its file > pointer for the backing object attached to a cookie and abandon the > in-progress I/O, dismissing it upon completion. > > Future work there would involve using Omar Sandoval's vfs_link() with > AT_LINK_REPLACE[2] to allow an extant file to be displaced by a new hard > link from a tmpfile as currently I have to unlink the old file first. > > These patches can also simplify the object state handling as I/O operations > to the cache don't all have to be brought to a stop in order to invalidate > a file. To that end, and with an eye on to writing a new backing cache > model in the future, I've taken the opportunity to simplify the indexing > structure. > > I've separated the index cookie concept from the file cookie concept by > type now. The former is now called a "volume cookie" (struct > fscache_volume) and there is a container of file cookies. There are then > just the two levels. All the index cookieage is collapsed into a single > volume cookie, and this has a single printable string as a key. For > instance, an AFS volume would have a key of something like > "afs,example.com,1000555", combining the filesystem name, cell name and > volume ID. This is freeform, but must not have '/' chars in it. > Given the indexing changes, what sort of behavior should we expect when upgrading from old-style to new-style indexes? Do they just not match, and we end up downloading new copies of all the data and the old stale stuff eventually gets culled? Ditto for downgrades -- can we expect sane behavior if someone runs an old kernel on top of an existing fscache that was populated by a new kernel? > I've also eliminated all pointers back from fscache into the network > filesystem. This required the duplication of a little bit of data in the > cookie (cookie key, coherency data and file size), but it's not actually > that much. This gets rid of problems with making sure we keep netfs data > structures around so that the cache can access them. > > I have changed afs throughout the patch series, but I also have patches for > 9p, nfs and cifs. Jeff Layton is handling ceph support. > > > BITS THAT MAY BE CONTROVERSIAL > ============================== > > There are some bits I've added that may be controversial: > > (1) I've provided a flag, S_KERNEL_FILE, that cachefiles uses to check if > a files is already being used by some other kernel service (e.g. a > duplicate cachefiles cache in the same directory) and reject it if it > is. This isn't entirely necessary, but it helps prevent accidental > data corruption. > > I don't want to use S_SWAPFILE as that has other effects, but quite > possibly swapon() should set S_KERNEL_FILE too. > > Note that it doesn't prevent userspace from interfering, though > perhaps it should. > > (2) Cachefiles wants to keep the backing file for a cookie open whilst we > might need to write to it from network filesystem writeback. The > problem is that the network filesystem unuses its cookie when its file > is closed, and so we have nothing pinning the cachefiles file open and > it will get closed automatically after a short time to avoid > EMFILE/ENFILE problems. > > Reopening the cache file, however, is a problem if this is being done > due to writeback triggered by exit(). Some filesystems will oops if > we try to open a file in that context because they want to access > current->fs or suchlike. > > To get around this, I added the following: > > (A) An inode flag, I_PINNING_FSCACHE_WB, to be set on a network > filesystem inode to indicate that we have a usage count on the > cookie caching that inode. > > (B) A flag in struct writeback_control, unpinned_fscache_wb, that is > set when __writeback_single_inode() clears the last dirty page > from i_pages - at which point it clears I_PINNING_FSCACHE_WB and > sets this flag. > > This has to be done here so that clearing I_PINNING_FSCACHE_WB can > be done atomically with the check of PAGECACHE_TAG_DIRTY that > clears I_DIRTY_PAGES. > > (C) A function, fscache_set_page_dirty(), which if it is not set, sets > I_PINNING_FSCACHE_WB and calls fscache_use_cookie() to pin the > cache resources. > > (D) A function, fscache_unpin_writeback(), to be called by > ->write_inode() to unuse the cookie. > > (E) A function, fscache_clear_inode_writeback(), to be called when the > inode is evicted, before clear_inode() is called. This cleans up > any lingering I_PINNING_FSCACHE_WB. > > The network filesystem can then use these tools to make sure that > fscache_write_to_cache() can write locally modified data to the cache > as well as to the server. > > For the future, I'm working on write helpers for netfs lib that should > allow this facility to be removed by keeping track of the dirty > regions separately - but that's incomplete at the moment and is also > going to be affected by folios, one way or another, since it deals > with pages. > > > These patches can be found also on: > > https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=fscache-rewrite-indexing > > David > > Link: https://lore.kernel.org/r/163363935000.1980952.15279841414072653108.stgit@warthog.procyon.org.uk [1] > Link: https://lore.kernel.org/r/cover.1580251857.git.osandov@fb.com/ [2] > > --- > Dave Wysochanski (3): > NFS: Convert fscache_acquire_cookie and fscache_relinquish_cookie > NFS: Convert fscache_enable_cookie and fscache_disable_cookie > NFS: Convert fscache invalidation and update aux_data and i_size > > David Howells (63): > mm: Stop filemap_read() from grabbing a superfluous page > vfs: Provide S_KERNEL_FILE inode flag > vfs, fscache: Force ->write_inode() to occur if cookie pinned for writeback > afs: Handle len being extending over page end in write_begin/write_end > afs: Fix afs_write_end() to handle len > page size > nfs, cifs, ceph, 9p: Disable use of fscache prior to its rewrite > fscache: Remove the netfs data from the cookie > fscache: Remove struct fscache_cookie_def > fscache: Remove store_limit* from struct fscache_object > fscache: Remove fscache_check_consistency() > fscache: Remove fscache_attr_changed() > fscache: Remove obsolete stats > fscache: Remove old I/O tracepoints > fscache: Temporarily disable fscache_invalidate() > fscache: Disable fscache_begin_operation() > fscache: Remove the I/O operation manager > fscache: Rename fscache_cookie_{get,put,see}() > cachefiles: Remove tree of active files and use S_CACHE_FILE inode flag > cachefiles: Don't set an xattr on the root of the cache > cachefiles: Remove some redundant checks on unsigned values > cachefiles: Prevent inode from going away when burying a dentry > cachefiles: Simplify the pathwalk and save the filename for an object > cachefiles: trace: Improve the lookup tracepoint > cachefiles: Remove separate backer dentry from cachefiles_object > cachefiles: Fold fscache_object into cachefiles_object > cachefiles: Change to storing file* rather than dentry* > cachefiles: trace: Log coherency checks > cachefiles: Trace truncations > cachefiles: Trace read and write operations > cachefiles: Round the cachefile size up to DIO block size > cachefiles: Don't use XATTR_ flags with vfs_setxattr() > fscache: Replace the object management state machine > cachefiles: Trace decisions in cachefiles_prepare_read() > cachefiles: Make cachefiles_write_prepare() check for space > fscache: Automatically close a file that's been unused for a while > fscache: Add stats for the cookie commit LRU > fscache: Move fscache_update_cookie() complete inline > fscache: Remove more obsolete stats > fscache: Note the object size during invalidation > vfs, fscache: Force ->write_inode() to occur if cookie pinned for writeback > afs: Render cache cookie key as big endian > cachefiles: Use tmpfile/link > fscache: Rewrite invalidation > cachefiles: Simplify the file lookup/creation/check code > fscache: Provide resize operation > cachefiles: Put more information in the xattr attached to the cache file > fscache: Implement "will_modify" parameter on fscache_use_cookie() > fscache: Add support for writing to the cache > fscache: Make fscache_clear_page_bits() conditional on cookie > fscache: Make fscache_write_to_cache() conditional on cookie > afs: Copy local writes to the cache when writing to the server > afs: Invoke fscache_resize_cookie() when handling ATTR_SIZE for setattr > afs: Add O_DIRECT read support > afs: Skip truncation on the server of data we haven't written yet > afs: Make afs_write_begin() return the THP subpage > cachefiles, afs: Drive FSCACHE_COOKIE_NO_DATA_TO_READ > nfs: Convert to new fscache volume/cookie API > 9p: Use fscache indexing rewrite and reenable caching > 9p: Copy local writes to the cache when writing to the server > netfs: Display the netfs inode number in the netfs_read tracepoint > cachefiles: Add tracepoints to log errors from ops on the backing fs > cachefiles: Add error injection support > cifs: Support fscache indexing rewrite (untested) > > Jeff Layton (1): > fscache: disable cookie when doing an invalidation for DIO write > > > fs/9p/cache.c | 184 +---- > fs/9p/cache.h | 23 +- > fs/9p/v9fs.c | 14 +- > fs/9p/v9fs.h | 13 +- > fs/9p/vfs_addr.c | 55 +- > fs/9p/vfs_dir.c | 13 +- > fs/9p/vfs_file.c | 7 +- > fs/9p/vfs_inode.c | 24 +- > fs/9p/vfs_inode_dotl.c | 3 +- > fs/9p/vfs_super.c | 3 + > fs/afs/Makefile | 3 - > fs/afs/cache.c | 68 -- > fs/afs/cell.c | 12 - > fs/afs/file.c | 83 +- > fs/afs/fsclient.c | 18 +- > fs/afs/inode.c | 101 ++- > fs/afs/internal.h | 36 +- > fs/afs/main.c | 14 - > fs/afs/super.c | 1 + > fs/afs/volume.c | 15 +- > fs/afs/write.c | 170 +++- > fs/afs/yfsclient.c | 12 +- > fs/cachefiles/Kconfig | 8 + > fs/cachefiles/Makefile | 3 + > fs/cachefiles/bind.c | 186 +++-- > fs/cachefiles/daemon.c | 20 +- > fs/cachefiles/error_inject.c | 46 ++ > fs/cachefiles/interface.c | 660 +++++++-------- > fs/cachefiles/internal.h | 191 +++-- > fs/cachefiles/io.c | 310 +++++-- > fs/cachefiles/key.c | 203 +++-- > fs/cachefiles/main.c | 20 +- > fs/cachefiles/namei.c | 978 ++++++++++------------ > fs/cachefiles/volume.c | 128 +++ > fs/cachefiles/xattr.c | 367 +++------ > fs/ceph/Kconfig | 2 +- > fs/cifs/Makefile | 2 +- > fs/cifs/cache.c | 105 --- > fs/cifs/cifsfs.c | 11 +- > fs/cifs/cifsglob.h | 5 +- > fs/cifs/connect.c | 3 - > fs/cifs/file.c | 37 +- > fs/cifs/fscache.c | 201 ++--- > fs/cifs/fscache.h | 51 +- > fs/cifs/inode.c | 18 +- > fs/fs-writeback.c | 8 + > fs/fscache/Kconfig | 4 + > fs/fscache/Makefile | 6 +- > fs/fscache/cache.c | 541 ++++++------- > fs/fscache/cookie.c | 1262 ++++++++++++++--------------- > fs/fscache/fsdef.c | 98 --- > fs/fscache/internal.h | 213 +---- > fs/fscache/io.c | 405 ++++++--- > fs/fscache/main.c | 134 +-- > fs/fscache/netfs.c | 74 -- > fs/fscache/object.c | 1123 ------------------------- > fs/fscache/operation.c | 633 --------------- > fs/fscache/page.c | 84 -- > fs/fscache/proc.c | 43 +- > fs/fscache/stats.c | 202 ++--- > fs/fscache/volume.c | 449 ++++++++++ > fs/netfs/read_helper.c | 2 +- > fs/nfs/Makefile | 2 +- > fs/nfs/client.c | 4 - > fs/nfs/direct.c | 2 + > fs/nfs/file.c | 7 +- > fs/nfs/fscache-index.c | 114 --- > fs/nfs/fscache.c | 264 ++---- > fs/nfs/fscache.h | 89 +- > fs/nfs/inode.c | 11 +- > fs/nfs/super.c | 7 +- > fs/nfs/write.c | 1 + > include/linux/fs.h | 4 + > include/linux/fscache-cache.h | 463 +++-------- > include/linux/fscache.h | 626 +++++++------- > include/linux/netfs.h | 4 +- > include/linux/nfs_fs_sb.h | 9 +- > include/linux/writeback.h | 1 + > include/trace/events/cachefiles.h | 483 ++++++++--- > include/trace/events/fscache.h | 631 +++++++-------- > include/trace/events/netfs.h | 5 +- > 81 files changed, 5140 insertions(+), 7295 deletions(-) > delete mode 100644 fs/afs/cache.c > create mode 100644 fs/cachefiles/error_inject.c > create mode 100644 fs/cachefiles/volume.c > delete mode 100644 fs/cifs/cache.c > delete mode 100644 fs/fscache/fsdef.c > delete mode 100644 fs/fscache/netfs.c > delete mode 100644 fs/fscache/object.c > delete mode 100644 fs/fscache/operation.c > create mode 100644 fs/fscache/volume.c > delete mode 100644 fs/nfs/fscache-index.c > >
Jeff Layton <jlayton@kernel.org> wrote: > Given the indexing changes, what sort of behavior should we expect when > upgrading from old-style to new-style indexes? Do they just not match, > and we end up downloading new copies of all the data and the old stale > stuff eventually gets culled? Correct: they don't match. The names of the directories and files will be quite different - and so will the attached xattrs. However, no filesystems currently store locally-modified data in the cache, so you shouldn't lose any data after upgrading. > Ditto for downgrades -- can we expect sane behavior if someone runs an > old kernel on top of an existing fscache that was populated by a new > kernel? Correct. With this branch, filesystems now store locally-modified data into the cache - but they also upload it to the server at the same time. If there's a disagreement between what's in the cache and what's on the server with this branch, the cache is discarded, so simply discarding the cache on a download shouldn't be a problem. It's currently operating as a write-through cache, not a write-back cache. That will change if I get round to implementing disconnected operation, but it's not there yet. David
On Mon, Oct 18, 2021 at 03:50:15PM +0100, David Howells wrote: > > Here's a set of patches that rewrites and simplifies the fscache index API > to remove the complex operation scheduling and object state machine in > favour of something much smaller and simpler. It is built on top of the > set of patches that removes the old API[1]. > > The operation scheduling API was intended to handle sequencing of cache > operations, which were all required (where possible) to run asynchronously > in parallel with the operations being done by the network filesystem, while > allowing the cache to be brought online and offline and interrupt service > with invalidation. > > However, with the advent of the tmpfile capacity in the VFS, an opportunity > arises to do invalidation much more easily, without having to wait for I/O > that's actually in progress: Cachefiles can simply cut over its file > pointer for the backing object attached to a cookie and abandon the > in-progress I/O, dismissing it upon completion. > > Future work there would involve using Omar Sandoval's vfs_link() with > AT_LINK_REPLACE[2] to allow an extant file to be displaced by a new hard > link from a tmpfile as currently I have to unlink the old file first. I had forgotten about that. It'd be great to finish that someday, but given the dead-end of the last discussion [1], we might need to hash it out the next time we can convene in person. 1:https://lore.kernel.org/linux-fsdevel/364531.1579265357@warthog.procyon.org.uk/
On Thu, Oct 21, 2021 at 5:21 PM Omar Sandoval <osandov@osandov.com> wrote: > > On Mon, Oct 18, 2021 at 03:50:15PM +0100, David Howells wrote: > However, with the advent of the tmpfile capacity in the VFS, an opportunity > arises to do invalidation much more easily, without having to wait for I/O > that's actually in progress: Cachefiles can simply cut over its file > pointer for the backing object attached to a cookie and abandon the > in-progress I/O, dismissing it upon completion. Have changes been made to O_TMPFILE? It is problematic for network filesystems because it is not an atomic operation, and would be great if it were possible to create a tmpfile and open it atomically (at the file system level). Currently it results in creating a tmpfile (which results in opencreate then close) immediately followed by reopening the tmpfile which is somewhat counter to the whole idea of a tmpfile (ie that it is deleted when closed) since the syscall results in two opens ie open(create)/close/open/close
Steve French wrote on Thu, Oct 21, 2021 at 06:15:49PM -0500: > Have changes been made to O_TMPFILE? It is problematic for network filesystems > because it is not an atomic operation, and would be great if it were possible > to create a tmpfile and open it atomically (at the file system level). > > Currently it results in creating a tmpfile (which results in > opencreate then close) > immediately followed by reopening the tmpfile which is somewhat counter to > the whole idea of a tmpfile (ie that it is deleted when closed) since > the syscall results > in two opens ie open(create)/close/open/close That depends on the filesystem, e.g. 9p open returns the opened fid so our semantic could be closer to that of a local filesystem (could because I didn't test recently and don't recall how it was handled, I think it was fixed as I remember it being a problem at some point...) The main problem with network filesystem and "open closed files" is: what should the server do if the client disconnects? Will the client come back and try to access that file again? Did the client crash completely and should the file be deleted? The server has no way of knowing. It's the same logic as unlinking an open file, leading to all sort of "silly renames" that are most famous for nfs (.nfsxxxx files that the servers have been taught to just delete if they haven't been accessed for long enough...) I'm not sure we can have a generic solution for that unfortunately... (9P is "easy" enough in that the linux client does not attempt to reconnect ever if the connection has been lost, so we can just really unlink the file and the server will delete it when the client disconnects... But if we were to implement a reconnect eventually (I know of an existing port that does) then this solution is no longer acceptable)
On Thu, 2021-10-21 at 18:15 -0500, Steve French wrote: > On Thu, Oct 21, 2021 at 5:21 PM Omar Sandoval <osandov@osandov.com> wrote: > > > > On Mon, Oct 18, 2021 at 03:50:15PM +0100, David Howells wrote: > > However, with the advent of the tmpfile capacity in the VFS, an opportunity > > arises to do invalidation much more easily, without having to wait for I/O > > that's actually in progress: Cachefiles can simply cut over its file > > pointer for the backing object attached to a cookie and abandon the > > in-progress I/O, dismissing it upon completion. > > Have changes been made to O_TMPFILE? It is problematic for network filesystems > because it is not an atomic operation, and would be great if it were possible > to create a tmpfile and open it atomically (at the file system level). > > Currently it results in creating a tmpfile (which results in > opencreate then close) > immediately followed by reopening the tmpfile which is somewhat counter to > the whole idea of a tmpfile (ie that it is deleted when closed) since > the syscall results > in two opens ie open(create)/close/open/close > > In this case, O_TMPFILE is being used on the cachefiles backing store, and that usually isn't deployed on a netfs. That said, Steve does have a good point... What happens if you do end up without O_TMPFILE support on the backing store? Probably just opting to not cache in that case is fine. Does cachefiles just shut down in that situation?
Steve French <smfrench@gmail.com> wrote: > On Thu, Oct 21, 2021 at 5:21 PM Omar Sandoval <osandov@osandov.com> wrote: > > > > On Mon, Oct 18, 2021 at 03:50:15PM +0100, David Howells wrote: > > However, with the advent of the tmpfile capacity in the VFS, an opportunity > > arises to do invalidation much more easily, without having to wait for I/O > > that's actually in progress: Cachefiles can simply cut over its file > > pointer for the backing object attached to a cookie and abandon the > > in-progress I/O, dismissing it upon completion. > > Have changes been made to O_TMPFILE? It is problematic for network > filesystems because it is not an atomic operation, and would be great if it > were possible to create a tmpfile and open it atomically (at the file system > level). In this case, it's nothing to do with the network filesystem that's using the cache per se. Cachefiles is using tmpfiles on the backing filesystem, so as long as that's, say, ext4, xfs or btrfs, it should work fine. The backing filesystem also needs to support SEEK_HOLE and SEEK_DATA. I'm not sure I'd recommend putting your cache on a network filesystem. David