Message ID | 1668172.1709764777@warthog.procyon.org.uk (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [RFC] mm: Replace ->launder_folio() with flush and wait | expand |
On Wed, Mar 06, 2024 at 10:39:37PM +0000, David Howells wrote: > Here's a patch to have a go at getting rid of ->launder_folio(). Since it's > failable and cannot guarantee that pages in the range are removed, I've tried > to replace laundering with just flush-and-wait, dropping the folio lock around > the I/O. My sense is that ->launder_folio doesn't actually need to be replaced. commit e3db7691e9f3dff3289f64e3d98583e28afe03db Author: Trond Myklebust <Trond.Myklebust@netapp.com> Date: Wed Jan 10 23:15:39 2007 -0800 [PATCH] NFS: Fix race in nfs_release_page() NFS: Fix race in nfs_release_page() invalidate_inode_pages2() may find the dirty bit has been set on a page owing to the fact that the page may still be mapped after it was locked. Only after the call to unmap_mapping_range() are we sure that the page can no longer be dirtied. In order to fix this, NFS has hooked the releasepage() method and tries to write the page out between the call to unmap_mapping_range() and the call to remove_mapping(). This, however leads to deadlocks in the page reclaim code, where the page may be locked without holding a reference to the inode or dentry. Fix is to add a new address_space_operation, launder_page(), which will attempt to write out a dirty page without releasing the page lock. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> I don't understand why this couldn't've been solved by page_mkwrite. NFS did later add nfs_vm_page_mkwrite in July 2007, and maybe it's just not needed any more? I haven't looked into it enough to make sure, but my belief is that we should be able to get rid of it.
Matthew Wilcox <willy@infradead.org> wrote: > commit e3db7691e9f3dff3289f64e3d98583e28afe03db > Author: Trond Myklebust <Trond.Myklebust@netapp.com> > Date: Wed Jan 10 23:15:39 2007 -0800 > > [PATCH] NFS: Fix race in nfs_release_page() >... > invalidate_inode_pages2() may find the dirty bit has been set on a page > owing to the fact that the page may still be mapped after it was locked. > Only after the call to unmap_mapping_range() are we sure that the page > can no longer be dirtied. Is that last sentence even true? It evicts folios from the TLB and/or pagetables, but it doesn't actually trim any mmap made - in which case, userspace is perfectly at liberty to regenerate and dirty the folio the moment the folio is removed from the page tables. Otherwise DIO-to/from-mmap will deadlock. > but my belief is that we should be able to get rid of it. I think you're probably correct. The best we can do, I think, is to preface any call to invalidate_inode_pages2() with a flush-and-wait over the same range. David
Matthew Wilcox <willy@infradead.org> wrote: > On Wed, Mar 06, 2024 at 10:39:37PM +0000, David Howells wrote: > > Here's a patch to have a go at getting rid of ->launder_folio(). Since it's > > failable and cannot guarantee that pages in the range are removed, I've tried > > to replace laundering with just flush-and-wait, dropping the folio lock around > > the I/O. > > My sense is that ->launder_folio doesn't actually need to be replaced. > > commit e3db7691e9f3dff3289f64e3d98583e28afe03db > Author: Trond Myklebust <Trond.Myklebust@netapp.com> > Date: Wed Jan 10 23:15:39 2007 -0800 > > [PATCH] NFS: Fix race in nfs_release_page() > > NFS: Fix race in nfs_release_page() > > invalidate_inode_pages2() may find the dirty bit has been set on a page > owing to the fact that the page may still be mapped after it was locked. > Only after the call to unmap_mapping_range() are we sure that the page > can no longer be dirtied. > In order to fix this, NFS has hooked the releasepage() method and tries > to write the page out between the call to unmap_mapping_range() and the > call to remove_mapping(). This, however leads to deadlocks in the page > reclaim code, where the page may be locked without holding a reference > to the inode or dentry. > > Fix is to add a new address_space_operation, launder_page(), which will > attempt to write out a dirty page without releasing the page lock. > > Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> > > I don't understand why this couldn't've been solved by page_mkwrite. > NFS did later add nfs_vm_page_mkwrite in July 2007, and maybe it's just > not needed any more? I haven't looked into it enough to make sure, > but my belief is that we should be able to get rid of it. Okay, it's slightly more complex than I thought - and I'm not sure all callers are actually using it correctly. There are some additional interesting cases I've found, beyond the pre-/post-DIO case: (1) NFS relies on it to retry the write before stripping the pages in the case where a writeback error occurs. I think this can probably be dealt with by sticking a filemap_fdatawrite() call before the invalidation. I'm not sure if this would incur the deadlock with the page reclaim code of which Trond speaks. (2) invalidate_inode_pages2() is used in some places to effect invalidation of the pagecache in the case where the server tells us that a third party modified the server copy of a file. What the right behaviour should be here, I'm not sure, but at the moment, any dirty data will get laundered back to the server. Possibly it should be simply invalidated locally or the user asked how they want to handle the divergence. Some filesystems use invalidate_remote_inode() instead which seems to leave the dirty pages in place locally. If it is desirous to save the dirty data, then filemap_fdatawrite() could be deployed before invalidating the pages. (3) Fuse uses invalidate_inode_pages2() in fuse_do_setattr() to strip all the pages from an inode that has had its size changed, laundering any page that's still dirty. This would seem to be excessive, but maybe Miklos had a reason for doing it that way. There are some places that should perhaps be using kiocb_invalidate_pages() and kiocb_invalidate_post_direct_write() instead - assuming Christoph has no objection to the latter function being exported. David
On Thu, Mar 07, 2024 at 10:36:28AM +0000, David Howells wrote: > There are some places that should perhaps be using kiocb_invalidate_pages() > and kiocb_invalidate_post_direct_write() instead - assuming Christoph has no > objection to the latter function being exported. These are intended as helpers for direct I/O implementation, so I very much expected them to eventually get exported.
On Thu, 7 Mar 2024 at 11:36, David Howells <dhowells@redhat.com> wrote: > (2) invalidate_inode_pages2() is used in some places to effect invalidation > of the pagecache in the case where the server tells us that a third party > modified the server copy of a file. What the right behaviour should be > here, I'm not sure, but at the moment, any dirty data will get laundered > back to the server. Possibly it should be simply invalidated locally or > the user asked how they want to handle the divergence. Skipping ->launder_page will mean there's a window where the data *will* be lost, AFAICS. Of course concurrent cached writes on different hosts against the same region (the size of which depends on how the caching is done) will conflict. But if concurrent writes are to different regions, then they shouldn't be lost, no? Without the current ->launder_page thing I don't see how that could be guaranteed. Thanks, Miklos
Miklos Szeredi <miklos@szeredi.hu> wrote: > > (2) invalidate_inode_pages2() is used in some places to effect > > invalidation of the pagecache in the case where the server tells us > > that a third party modified the server copy of a file. What the > > right behaviour should be here, I'm not sure, but at the moment, any > > dirty data will get laundered back to the server. Possibly it should > > be simply invalidated locally or the user asked how they want to > > handle the divergence. > > Skipping ->launder_page will mean there's a window where the data > *will* be lost, AFAICS. > > Of course concurrent cached writes on different hosts against the same > region (the size of which depends on how the caching is done) will > conflict. Indeed. Depending on when you're using invalidate_inode_pages2() and co. and what circumstances you're using it for, you *are* going to suffer data loss. For instance, if you have dirty data on the local host and get an invalidation notification from the server: if you write just your dirty data back, you may corrupt the file on the server, losing the third party changes; if you write back your entire copy of the file, you might avoid corrupting the file, but completely obliterate the third party changes; if you discard your changes, you lose those instead, but save the third party changes. I'm working towards supporting disconnected operation where I'll need to add some sort of user interaction mechanism that will allow the user to say how they want to handle this. > But if concurrent writes are to different regions, then they shouldn't > be lost, no? Without the current ->launder_page thing I don't see how > that could be guaranteed. Define "different regions". If they're not on the same folios, then why would they be lost by simply flushing the data before doing the invalidation? If they are on different parts of the same folio, all the above still apply when you flush the whole folio. Now, you can mitigate the latter case by keeping track of which bytes changed, but that still allows you to corrupt the file by writing back just your particular changes. And then there's the joker in the deck: mmap. The main advantage of invalidate_inode_pages2() is that it forcibly unmaps the page before laundering it. However, this doesn't prevent you then corrupting the upstream copy by writing the changes back. What particular usage case of invalidate_inode_pages2() are you thinking of? DIO read/write can only be best effort: flush, invalidate then do the DIO which may bring the buffers back in because they're mmapped. In which case doing a flush and a non-laundering invalidate that leaves dirty pages in place ought to be fine. David
On Tue, 19 Mar 2024 at 15:15, David Howells <dhowells@redhat.com> wrote:
> What particular usage case of invalidate_inode_pages2() are you thinking of?
FUSE_NOTIFY_INVAL_INODE will trigger invalidate_inode_pages2_range()
to clean up the cache.
The server is free to discard writes resulting from this invalidation
and delay reads in the region until the invalidation finishes. This
would no longer work with your change, since the mapping could
silently be reinstated between the writeback and the removal from the
cache due to the page being unlocked/relocked.
I'm not saying such a filesystem actually exists, but it's a
theoretical possibility. And maybe there are cases which I haven't
thought of.
Thanks,
Miklos
On Tue, 19 Mar 2024 at 17:13, Miklos Szeredi <miklos@szeredi.hu> wrote: > > On Tue, 19 Mar 2024 at 15:15, David Howells <dhowells@redhat.com> wrote: > > > What particular usage case of invalidate_inode_pages2() are you thinking of? > > FUSE_NOTIFY_INVAL_INODE will trigger invalidate_inode_pages2_range() > to clean up the cache. > > The server is free to discard writes resulting from this invalidation > and delay reads in the region until the invalidation finishes. This > would no longer work with your change, since the mapping could > silently be reinstated between the writeback and the removal from the > cache due to the page being unlocked/relocked. This would also matter if a distributed filesystem wanted to implement coherence even if there are mmaps. I.e. a client could get exclusive access to a region by issuing FUSE_NOTIFY_INVAL_INODE on all other clients and blocking reads. With your change this would fail. Again, this is purely theoretical, and without a way to differentiate between the read-only and write cases it has limited usefulness. Adding leases to fuse (which I plan to do) would make this much more useful. Thanks, Miklos
On 3/19/24 17:40, Miklos Szeredi wrote: > On Tue, 19 Mar 2024 at 17:13, Miklos Szeredi <miklos@szeredi.hu> wrote: >> >> On Tue, 19 Mar 2024 at 15:15, David Howells <dhowells@redhat.com> wrote: >> >>> What particular usage case of invalidate_inode_pages2() are you thinking of? >> >> FUSE_NOTIFY_INVAL_INODE will trigger invalidate_inode_pages2_range() >> to clean up the cache. >> >> The server is free to discard writes resulting from this invalidation >> and delay reads in the region until the invalidation finishes. This >> would no longer work with your change, since the mapping could >> silently be reinstated between the writeback and the removal from the >> cache due to the page being unlocked/relocked. > > This would also matter if a distributed filesystem wanted to implement > coherence even if there are mmaps. I.e. a client could get exclusive > access to a region by issuing FUSE_NOTIFY_INVAL_INODE on all other > clients and blocking reads. With your change this would fail. > > Again, this is purely theoretical, and without a way to differentiate > between the read-only and write cases it has limited usefulness. > Adding leases to fuse (which I plan to do) would make this much more > useful. Thanks Miklos! Fyi, we are actually planning to extend fuse notifications from inode to page ranges. Thanks, Bernd
diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst index d5bf4b6b7509..139554d1ab51 100644 --- a/Documentation/filesystems/locking.rst +++ b/Documentation/filesystems/locking.rst @@ -262,7 +262,6 @@ prototypes:: int (*direct_IO)(struct kiocb *, struct iov_iter *iter); int (*migrate_folio)(struct address_space *, struct folio *dst, struct folio *src, enum migrate_mode); - int (*launder_folio)(struct folio *); bool (*is_partially_uptodate)(struct folio *, size_t from, size_t count); int (*error_remove_folio)(struct address_space *, struct folio *); int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span) @@ -288,7 +287,6 @@ release_folio: yes free_folio: yes direct_IO: migrate_folio: yes (both) -launder_folio: yes is_partially_uptodate: yes error_remove_folio: yes swap_activate: no @@ -394,12 +392,6 @@ try_to_free_buffers(). ->free_folio() is called when the kernel has dropped the folio from the page cache. -->launder_folio() may be called prior to releasing a folio if -it is still found to be dirty. It returns zero if the folio was successfully -cleaned, or an error value if not. Note that in order to prevent the folio -getting mapped back in and redirtied, it needs to be kept locked -across the entire operation. - ->swap_activate() will be called to prepare the given file for swap. It should perform any validation and preparation necessary to ensure that writes can be performed with minimal memory allocation. It should call diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst index eebcc0f9e2bc..b2af9ee6515a 100644 --- a/Documentation/filesystems/vfs.rst +++ b/Documentation/filesystems/vfs.rst @@ -818,7 +818,6 @@ cache in your filesystem. The following members are defined: ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter); int (*migrate_folio)(struct mapping *, struct folio *dst, struct folio *src, enum migrate_mode); - int (*launder_folio) (struct folio *); bool (*is_partially_uptodate) (struct folio *, size_t from, size_t count); @@ -1012,11 +1011,6 @@ cache in your filesystem. The following members are defined: folio to this function. migrate_folio should transfer any private data across and update any references that it has to the folio. -``launder_folio`` - Called before freeing a folio - it writes back the dirty folio. - To prevent redirtying the folio, it is kept locked during the - whole operation. - ``is_partially_uptodate`` Called by the VM when reading a file through the pagecache when the underlying blocksize is smaller than the size of the folio. diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c index 047855033d32..967c81b7aa73 100644 --- a/fs/9p/vfs_addr.c +++ b/fs/9p/vfs_addr.c @@ -141,7 +141,6 @@ const struct address_space_operations v9fs_addr_operations = { .dirty_folio = netfs_dirty_folio, .release_folio = netfs_release_folio, .invalidate_folio = netfs_invalidate_folio, - .launder_folio = netfs_launder_folio, .direct_IO = noop_direct_IO, .writepages = netfs_writepages, }; diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c index 32572982f72e..e085e520cb12 100644 --- a/fs/9p/vfs_inode.c +++ b/fs/9p/vfs_inode.c @@ -263,6 +263,7 @@ int v9fs_init_inode(struct v9fs_session_info *v9ses, simple_inode_init_ts(inode); inode->i_mapping->a_ops = &v9fs_addr_operations; inode->i_private = NULL; + mapping_set_launder_folios(inode->i_mapping); switch (mode & S_IFMT) { case S_IFIFO: diff --git a/fs/afs/file.c b/fs/afs/file.c index ef2cc8f565d2..dfd8f60f5e1f 100644 --- a/fs/afs/file.c +++ b/fs/afs/file.c @@ -54,7 +54,6 @@ const struct address_space_operations afs_file_aops = { .read_folio = netfs_read_folio, .readahead = netfs_readahead, .dirty_folio = netfs_dirty_folio, - .launder_folio = netfs_launder_folio, .release_folio = netfs_release_folio, .invalidate_folio = netfs_invalidate_folio, .migrate_folio = filemap_migrate_folio, diff --git a/fs/afs/inode.c b/fs/afs/inode.c index 94fc049aff58..4041d9a4dae8 100644 --- a/fs/afs/inode.c +++ b/fs/afs/inode.c @@ -104,6 +104,7 @@ static int afs_inode_init_from_status(struct afs_operation *op, inode->i_fop = &afs_file_operations; inode->i_mapping->a_ops = &afs_file_aops; mapping_set_large_folios(inode->i_mapping); + mapping_set_launder_folios(inode->i_mapping); break; case AFS_FTYPE_DIR: inode->i_mode = S_IFDIR | (status->mode & S_IALLUGO); diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index d19cbf34c634..d87ee76bc578 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -1978,7 +1978,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr, /* * Only call invalidate_inode_pages2() after removing - * FUSE_NOWRITE, otherwise fuse_launder_folio() would deadlock. + * FUSE_NOWRITE, otherwise writeback would deadlock. */ if ((is_truncate || !is_wb) && S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) { diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 148a71b8b4d0..4933fe0b3269 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -2393,21 +2393,6 @@ static int fuse_write_end(struct file *file, struct address_space *mapping, return copied; } -static int fuse_launder_folio(struct folio *folio) -{ - int err = 0; - if (folio_clear_dirty_for_io(folio)) { - struct inode *inode = folio->mapping->host; - - /* Serialize with pending writeback for the same page */ - fuse_wait_on_page_writeback(inode, folio->index); - err = fuse_writepage_locked(&folio->page); - if (!err) - fuse_wait_on_page_writeback(inode, folio->index); - } - return err; -} - /* * Write back dirty data/metadata now (there may not be any suitable * open files later for data) @@ -3227,7 +3212,6 @@ static const struct address_space_operations fuse_file_aops = { .readahead = fuse_readahead, .writepage = fuse_writepage, .writepages = fuse_writepages, - .launder_folio = fuse_launder_folio, .dirty_folio = filemap_dirty_folio, .bmap = fuse_bmap, .direct_IO = fuse_direct_IO, @@ -3241,6 +3225,7 @@ void fuse_init_file_inode(struct inode *inode, unsigned int flags) inode->i_fop = &fuse_file_operations; inode->i_data.a_ops = &fuse_file_aops; + mapping_set_launder_folios(inode->i_mapping); INIT_LIST_HEAD(&fi->write_files); INIT_LIST_HEAD(&fi->queued_writes); diff --git a/fs/netfs/buffered_write.c b/fs/netfs/buffered_write.c index 9a0d32e4b422..56aeb7e70bc8 100644 --- a/fs/netfs/buffered_write.c +++ b/fs/netfs/buffered_write.c @@ -908,7 +908,7 @@ static ssize_t netfs_write_back_from_locked_folio(struct address_space *mapping, _enter(",%lx,%llx-%llx,%u", folio->index, start, end, caching); wreq = netfs_alloc_request(mapping, NULL, start, folio_size(folio), - NETFS_WRITEBACK); + wbc->for_launder ? NETFS_LAUNDER_WRITE : NETFS_WRITEBACK); if (IS_ERR(wreq)) { folio_unlock(folio); return PTR_ERR(wreq); @@ -1181,77 +1181,3 @@ int netfs_writepages(struct address_space *mapping, return ret; } EXPORT_SYMBOL(netfs_writepages); - -/* - * Deal with the disposition of a laundered folio. - */ -static void netfs_cleanup_launder_folio(struct netfs_io_request *wreq) -{ - if (wreq->error) { - pr_notice("R=%08x Laundering error %d\n", wreq->debug_id, wreq->error); - mapping_set_error(wreq->mapping, wreq->error); - } -} - -/** - * netfs_launder_folio - Clean up a dirty folio that's being invalidated - * @folio: The folio to clean - * - * This is called to write back a folio that's being invalidated when an inode - * is getting torn down. Ideally, writepages would be used instead. - */ -int netfs_launder_folio(struct folio *folio) -{ - struct netfs_io_request *wreq; - struct address_space *mapping = folio->mapping; - struct netfs_folio *finfo = netfs_folio_info(folio); - struct netfs_group *group = netfs_folio_group(folio); - struct bio_vec bvec; - unsigned long long i_size = i_size_read(mapping->host); - unsigned long long start = folio_pos(folio); - size_t offset = 0, len; - int ret = 0; - - if (finfo) { - offset = finfo->dirty_offset; - start += offset; - len = finfo->dirty_len; - } else { - len = folio_size(folio); - } - len = min_t(unsigned long long, len, i_size - start); - - wreq = netfs_alloc_request(mapping, NULL, start, len, NETFS_LAUNDER_WRITE); - if (IS_ERR(wreq)) { - ret = PTR_ERR(wreq); - goto out; - } - - if (!folio_clear_dirty_for_io(folio)) - goto out_put; - - trace_netfs_folio(folio, netfs_folio_trace_launder); - - _debug("launder %llx-%llx", start, start + len - 1); - - /* Speculatively write to the cache. We have to fix this up later if - * the store fails. - */ - wreq->cleanup = netfs_cleanup_launder_folio; - - bvec_set_folio(&bvec, folio, len, offset); - iov_iter_bvec(&wreq->iter, ITER_SOURCE, &bvec, 1, len); - __set_bit(NETFS_RREQ_UPLOAD_TO_SERVER, &wreq->flags); - ret = netfs_begin_write(wreq, true, netfs_write_trace_launder); - -out_put: - folio_detach_private(folio); - netfs_put_group(group); - kfree(finfo); - netfs_put_request(wreq, false, netfs_rreq_trace_put_return); -out: - folio_wait_fscache(folio); - _leave(" = %d", ret); - return ret; -} -EXPORT_SYMBOL(netfs_launder_folio); diff --git a/fs/nfs/file.c b/fs/nfs/file.c index 8577ccf621f5..6efe0af3ba80 100644 --- a/fs/nfs/file.c +++ b/fs/nfs/file.c @@ -484,28 +484,6 @@ static void nfs_check_dirty_writeback(struct folio *folio, *dirty = true; } -/* - * Attempt to clear the private state associated with a page when an error - * occurs that requires the cached contents of an inode to be written back or - * destroyed - * - Called if either PG_private or fscache is set on the page - * - Caller holds page lock - * - Return 0 if successful, -error otherwise - */ -static int nfs_launder_folio(struct folio *folio) -{ - struct inode *inode = folio->mapping->host; - int ret; - - dfprintk(PAGECACHE, "NFS: launder_folio(%ld, %llu)\n", - inode->i_ino, folio_pos(folio)); - - folio_wait_fscache(folio); - ret = nfs_wb_folio(inode, folio); - trace_nfs_launder_folio_done(inode, folio, ret); - return ret; -} - static int nfs_swap_activate(struct swap_info_struct *sis, struct file *file, sector_t *span) { @@ -564,7 +542,6 @@ const struct address_space_operations nfs_file_aops = { .invalidate_folio = nfs_invalidate_folio, .release_folio = nfs_release_folio, .migrate_folio = nfs_migrate_folio, - .launder_folio = nfs_launder_folio, .is_dirty_writeback = nfs_check_dirty_writeback, .error_remove_folio = generic_error_remove_folio, .swap_activate = nfs_swap_activate, diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c index ebb8d60e1152..f2f8cf00a442 100644 --- a/fs/nfs/inode.c +++ b/fs/nfs/inode.c @@ -491,6 +491,7 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct nfs_fattr *fattr) inode->i_fop = NFS_SB(sb)->nfs_client->rpc_ops->file_ops; inode->i_data.a_ops = &nfs_file_aops; nfs_inode_init_regular(nfsi); + mapping_set_launder_folios(inode->i_mapping); } else if (S_ISDIR(inode->i_mode)) { inode->i_op = NFS_SB(sb)->nfs_client->rpc_ops->dir_inode_ops; inode->i_fop = &nfs_dir_operations; diff --git a/fs/nfs/nfstrace.h b/fs/nfs/nfstrace.h index afedb449b54f..f0e8c0fb9447 100644 --- a/fs/nfs/nfstrace.h +++ b/fs/nfs/nfstrace.h @@ -1039,7 +1039,6 @@ DEFINE_NFS_FOLIO_EVENT(nfs_writeback_folio); DEFINE_NFS_FOLIO_EVENT_DONE(nfs_writeback_folio_done); DEFINE_NFS_FOLIO_EVENT(nfs_invalidate_folio); -DEFINE_NFS_FOLIO_EVENT_DONE(nfs_launder_folio_done); TRACE_EVENT(nfs_aop_readahead, TP_PROTO( diff --git a/fs/orangefs/inode.c b/fs/orangefs/inode.c index 085912268442..a9ea40261832 100644 --- a/fs/orangefs/inode.c +++ b/fs/orangefs/inode.c @@ -626,7 +626,6 @@ static const struct address_space_operations orangefs_address_operations = { .invalidate_folio = orangefs_invalidate_folio, .release_folio = orangefs_release_folio, .free_folio = orangefs_free_folio, - .launder_folio = orangefs_launder_folio, .direct_IO = orangefs_direct_IO, }; @@ -980,6 +979,7 @@ static const struct inode_operations orangefs_file_inode_operations = { static int orangefs_init_iops(struct inode *inode) { inode->i_mapping->a_ops = &orangefs_address_operations; + mapping_set_launder_folios(inode->i_mapping); switch (inode->i_mode & S_IFMT) { case S_IFREG: diff --git a/fs/smb/client/file.c b/fs/smb/client/file.c index f391c9b803d8..4fd871ba00f9 100644 --- a/fs/smb/client/file.c +++ b/fs/smb/client/file.c @@ -4858,27 +4858,6 @@ static void cifs_invalidate_folio(struct folio *folio, size_t offset, folio_wait_fscache(folio); } -static int cifs_launder_folio(struct folio *folio) -{ - int rc = 0; - loff_t range_start = folio_pos(folio); - loff_t range_end = range_start + folio_size(folio); - struct writeback_control wbc = { - .sync_mode = WB_SYNC_ALL, - .nr_to_write = 0, - .range_start = range_start, - .range_end = range_end, - }; - - cifs_dbg(FYI, "Launder page: %lu\n", folio->index); - - if (folio_clear_dirty_for_io(folio)) - rc = cifs_writepage_locked(&folio->page, &wbc); - - folio_wait_fscache(folio); - return rc; -} - void cifs_oplock_break(struct work_struct *work) { struct cifsFileInfo *cfile = container_of(work, struct cifsFileInfo, @@ -5057,7 +5036,6 @@ const struct address_space_operations cifs_addr_ops = { .release_folio = cifs_release_folio, .direct_IO = cifs_direct_io, .invalidate_folio = cifs_invalidate_folio, - .launder_folio = cifs_launder_folio, .migrate_folio = filemap_migrate_folio, /* * TODO: investigate and if useful we could add an is_dirty_writeback @@ -5080,6 +5058,5 @@ const struct address_space_operations cifs_addr_ops_smallbuf = { .dirty_folio = netfs_dirty_folio, .release_folio = cifs_release_folio, .invalidate_folio = cifs_invalidate_folio, - .launder_folio = cifs_launder_folio, .migrate_folio = filemap_migrate_folio, }; diff --git a/fs/smb/client/inode.c b/fs/smb/client/inode.c index d02f8ba29cb5..24c1c2610e04 100644 --- a/fs/smb/client/inode.c +++ b/fs/smb/client/inode.c @@ -56,6 +56,7 @@ static void cifs_set_ops(struct inode *inode) inode->i_data.a_ops = &cifs_addr_ops_smallbuf; else inode->i_data.a_ops = &cifs_addr_ops; + mapping_set_launder_folios(inode->i_mapping); break; case S_IFDIR: if (IS_AUTOMOUNT(inode)) { diff --git a/include/linux/fs.h b/include/linux/fs.h index 1fbc72c5f112..ded54555ab30 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -432,7 +432,6 @@ struct address_space_operations { */ int (*migrate_folio)(struct address_space *, struct folio *dst, struct folio *src, enum migrate_mode); - int (*launder_folio)(struct folio *); bool (*is_partially_uptodate) (struct folio *, size_t from, size_t count); void (*is_dirty_writeback) (struct folio *, bool *dirty, bool *wb); diff --git a/include/linux/netfs.h b/include/linux/netfs.h index 100cbb261269..e566d31afb04 100644 --- a/include/linux/netfs.h +++ b/include/linux/netfs.h @@ -410,7 +410,6 @@ int netfs_unpin_writeback(struct inode *inode, struct writeback_control *wbc); void netfs_clear_inode_writeback(struct inode *inode, const void *aux); void netfs_invalidate_folio(struct folio *folio, size_t offset, size_t length); bool netfs_release_folio(struct folio *folio, gfp_t gfp); -int netfs_launder_folio(struct folio *folio); /* VMA operations API. */ vm_fault_t netfs_page_mkwrite(struct vm_fault *vmf, struct netfs_group *netfs_group); diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 2df35e65557d..27ea8b9139a3 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -207,6 +207,7 @@ enum mapping_flags { AS_STABLE_WRITES, /* must wait for writeback before modifying folio contents */ AS_UNMOVABLE, /* The mapping cannot be moved, ever */ + AS_LAUNDER_FOLIOS, /* Use laundering in invalidate_inode_pages2*() */ }; /** @@ -323,6 +324,11 @@ static inline bool mapping_unmovable(struct address_space *mapping) return test_bit(AS_UNMOVABLE, &mapping->flags); } +static inline void mapping_set_launder_folios(struct address_space *mapping) +{ + set_bit(AS_LAUNDER_FOLIOS, &mapping->flags); +} + static inline gfp_t mapping_gfp_mask(struct address_space * mapping) { return mapping->gfp_mask; diff --git a/include/linux/writeback.h b/include/linux/writeback.h index 9845cb62e40b..7a1b79b344fa 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -62,6 +62,7 @@ struct writeback_control { unsigned for_reclaim:1; /* Invoked from the page allocator */ unsigned range_cyclic:1; /* range_start is cyclic */ unsigned for_sync:1; /* sync(2) WB_SYNC_ALL writeback */ + unsigned for_launder:1; /* Called from invalidate_inode_pages2_range() */ unsigned unpinned_netfs_wb:1; /* Cleared I_PINNING_NETFS_WB */ /* diff --git a/mm/truncate.c b/mm/truncate.c index 725b150e47ac..cb92759c4d8d 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -586,15 +586,6 @@ static int invalidate_complete_folio2(struct address_space *mapping, return 0; } -static int folio_launder(struct address_space *mapping, struct folio *folio) -{ - if (!folio_test_dirty(folio)) - return 0; - if (folio->mapping != mapping || mapping->a_ops->launder_folio == NULL) - return 0; - return mapping->a_ops->launder_folio(folio); -} - /** * invalidate_inode_pages2_range - remove range of pages from an address_space * @mapping: the address_space @@ -657,13 +648,29 @@ int invalidate_inode_pages2_range(struct address_space *mapping, unmap_mapping_folio(folio); BUG_ON(folio_mapped(folio)); - ret2 = folio_launder(mapping, folio); - if (ret2 == 0) { - if (!invalidate_complete_folio2(mapping, folio)) + if (folio_test_dirty(folio) && + test_bit(AS_LAUNDER_FOLIOS, &mapping->flags)) { + struct writeback_control wbc = { + .sync_mode = WB_SYNC_ALL, + .nr_to_write = LLONG_MAX, + .range_start = folio_pos(folio), + .range_end = (end + 1ULL) * PAGE_SIZE - 1, + .for_launder = true, + }; + + folio_unlock(folio); + ret2 = filemap_fdatawrite_wbc(mapping, &wbc); + folio_lock(folio); + if (ret2 < 0) + ret = ret2; + else if (!invalidate_complete_folio2(mapping, folio) && + ret == 0) + ret2 = -EBUSY; + } else { + if (!invalidate_complete_folio2(mapping, folio) && + ret == 0) ret2 = -EBUSY; } - if (ret2 < 0) - ret = ret2; folio_unlock(folio); } folio_batch_remove_exceptionals(&fbatch);
Here's a patch to have a go at getting rid of ->launder_folio(). Since it's failable and cannot guarantee that pages in the range are removed, I've tried to replace laundering with just flush-and-wait, dropping the folio lock around the I/O. Are there any tests for launder_folio as triggering it is tricky? David --- mm: Replace ->launder_folio() with flush and wait invalidate_inode_pages2_range() and its wrappers are frequently used to invalidate overlapping folios prior to and after doing direct I/O. This calls ->launder_folio() to flush dirty folios out to the backing store, keeping the folio lock across the I/O - presumably to prevent the folio from being redirtied and thereby prevent it from being removed. However... If we're doing this prior to doing DIO on a file, there may be nothing preventing an mmapped write from recreating and redirtying the folio the moment it is removed from the mapping lest the kernel deadlock on doing DIO to/from a buffer mmapped from the target file. Further, invalidate_inode_pages2_range() is permitted to fail - and half the callers don't even check to see if it *did* fail, probably not unreasonably. In which case, there's no need to keep the folio lock and no need for a special op. Instead, drop the lock and flush the range covering the page to the end of the range to be invalidated (with a flag set in the wbc to say what we're doing) and then wait for it to complete. That said, there are other places that use invalidate_inode_pages2_range() and most of those don't implement launder_page, so add a flag to only do this in those filesystems that implemented it: 9p, afs, cifs, fuse, nfs and orangefs. Signed-off-by: David Howells <dhowells@redhat.com> cc: Matthew Wilcox <willy@infradead.org> cc: Christoph Hellwig <hch@lst.de> cc: Andrew Morton <akpm@linux-foundation.org> cc: Alexander Viro <viro@zeniv.linux.org.uk> cc: Christian Brauner <brauner@kernel.org> cc: Jeff Layton <jlayton@kernel.org> cc: linux-mm@kvack.org cc: linux-fsdevel@vger.kernel.org cc: netfs@lists.linux.dev cc: v9fs@lists.linux.dev cc: linux-afs@lists.infradead.org cc: ceph-devel@vger.kernel.org cc: linux-cifs@vger.kernel.org cc: linux-nfs@vger.kernel.org cc: devel@lists.orangefs.org --- Documentation/filesystems/locking.rst | 8 --- Documentation/filesystems/vfs.rst | 6 -- fs/9p/vfs_addr.c | 1 fs/9p/vfs_inode.c | 1 fs/afs/file.c | 1 fs/afs/inode.c | 1 fs/fuse/dir.c | 2 fs/fuse/file.c | 17 ------- fs/netfs/buffered_write.c | 76 ---------------------------------- fs/nfs/file.c | 23 ---------- fs/nfs/inode.c | 1 fs/nfs/nfstrace.h | 1 fs/orangefs/inode.c | 2 fs/smb/client/file.c | 23 ---------- fs/smb/client/inode.c | 1 include/linux/fs.h | 1 include/linux/netfs.h | 1 include/linux/pagemap.h | 6 ++ include/linux/writeback.h | 1 mm/truncate.c | 35 +++++++++------ 20 files changed, 36 insertions(+), 172 deletions(-)