mbox series

[v7,00/23] Change readahead API

Message ID 20200219210103.32400-1-willy@infradead.org (mailing list archive)
Headers show
Series Change readahead API | expand

Message

Matthew Wilcox Feb. 19, 2020, 9 p.m. UTC
From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

This series adds a readahead address_space operation to eventually
replace the readpages operation.  The key difference is that
pages are added to the page cache as they are allocated (and
then looked up by the filesystem) instead of passing them on a
list to the readpages operation and having the filesystem add
them to the page cache.  It's a net reduction in code for each
implementation, more efficient than walking a list, and solves
the direct-write vs buffered-read problem reported by yu kuai at
https://lore.kernel.org/linux-fsdevel/20200116063601.39201-1-yukuai3@huawei.com/

The only unconverted filesystems are those which use fscache.
Their conversion is pending Dave Howells' rewrite which will make the
conversion substantially easier.

I want to thank the reviewers; Dave Chinner, John Hubbard and Christoph
Hellwig have done a marvellous job of providing constructive criticism.
Eric Biggers pointed out how I'd broken ext4 (which led to a substantial
change).  I've tried to take it all on board, but I may have missed
something simply because you've done such a thorough job.

This series can also be found at
http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/tags/readahead_v7
(I also pushed the readahead_v6 tag there in case anyone wants to diff, and
they're both based on 5.6-rc2 so they're easy to diff)

v7:
 - Now passes an xfstests run on ext4!
 - Documentation improvements
 - Move the readahead prototypes out of mm.h (new patch)
 - readahead_for_each* iterators are gone; replaced with readahead_page()
   and readahead_page_batch()
 - page_cache_readahead_limit() renamed to page_cache_readahead_unbounded()
   and arguments changed
 - iomap_readahead_actor() restructured differently
 - The readahead code no longer uses the word 'offset' to reduce ambiguity
 - read_pages() now maintains the rac so we can just call it and continue
   instead of mucking around with branches
 - More assertions
 - More readahead functions return void

v6:
 - Name the private members of readahead_control with a leading underscore
   (suggested by Christoph Hellwig)
 - Fix whitespace in rst file
 - Remove misleading comment in btrfs patch
 - Add readahead_next() API and use it in iomap
 - Add iomap_readahead kerneldoc.
 - Fix the mpage_readahead kerneldoc
 - Make various readahead functions return void
 - Keep readahead_index() and readahead_offset() pointing to the start of
   this batch through the body.  No current user requires this, but it's
   less surprising.
 - Add kerneldoc for page_cache_readahead_limit
 - Make page_idx an unsigned long, and rename it to just 'i'
 - Get rid of page_offset local variable
 - Add patch to call memalloc_nofs_save() before allocating pages (suggested
   by Michal Hocko)
 - Resplit a lot of patches for more logical progression and easier review
   (suggested by John Hubbard)
 - Added sign-offs where received, and I deemed still relevant

v5 switched to passing a readahead_control struct (mirroring the
writepages_control struct passed to writepages).  This has a number of
advantages:
 - It fixes a number of bugs in various implementations, eg forgetting to
   increment 'start', an off-by-one error in 'nr_pages' or treating 'start'
   as a byte offset instead of a page offset.
 - It allows us to change the arguments without changing all the
   implementations of ->readahead which just call mpage_readahead() or
   iomap_readahead()
 - Figuring out which pages haven't been attempted by the implementation
   is more natural this way.
 - There's less code in each implementation.

Matthew Wilcox (Oracle) (24):
  mm: Move readahead prototypes from mm.h
  mm: Return void from various readahead functions
  mm: Ignore return value of ->readpages
  mm: Move readahead nr_pages check into read_pages
  mm: Use readahead_control to pass arguments
  mm: Rename various 'offset' parameters to 'index'
  mm: rename readahead loop variable to 'i'
  mm: Remove 'page_offset' from readahead loop
  mm: Put readahead pages in cache earlier
  mm: Add readahead address space operation
  mm: Move end_index check out of readahead loop
  mm: Add page_cache_readahead_unbounded
  fs: Convert mpage_readpages to mpage_readahead
  btrfs: Convert from readpages to readahead
  erofs: Convert uncompressed files from readpages to readahead
  erofs: Convert compressed files from readpages to readahead
  ext4: Convert from readpages to readahead
  ext4: Pass the inode to ext4_mpage_readpages
  f2fs: Convert from readpages to readahead
  fuse: Convert from readpages to readahead
  iomap: Restructure iomap_readpages_actor
  iomap: Convert from readpages to readahead
  mm: Document why we don't set PageReadahead
  mm: Use memalloc_nofs_save in readahead path

 Documentation/filesystems/locking.rst |   6 +-
 Documentation/filesystems/vfs.rst     |  15 ++
 block/blk-core.c                      |   1 +
 drivers/staging/exfat/exfat_super.c   |   7 +-
 fs/block_dev.c                        |   7 +-
 fs/btrfs/extent_io.c                  |  46 ++---
 fs/btrfs/extent_io.h                  |   3 +-
 fs/btrfs/inode.c                      |  16 +-
 fs/erofs/data.c                       |  39 ++--
 fs/erofs/zdata.c                      |  29 +--
 fs/ext2/inode.c                       |  10 +-
 fs/ext4/ext4.h                        |   5 +-
 fs/ext4/inode.c                       |  21 +-
 fs/ext4/readpage.c                    |  25 +--
 fs/ext4/verity.c                      |  35 +---
 fs/f2fs/data.c                        |  50 ++---
 fs/f2fs/f2fs.h                        |   5 +-
 fs/f2fs/verity.c                      |  35 +---
 fs/fat/inode.c                        |   7 +-
 fs/fuse/file.c                        |  46 ++---
 fs/gfs2/aops.c                        |  23 +--
 fs/hpfs/file.c                        |   7 +-
 fs/iomap/buffered-io.c                | 124 +++++-------
 fs/iomap/trace.h                      |   2 +-
 fs/isofs/inode.c                      |   7 +-
 fs/jfs/inode.c                        |   7 +-
 fs/mpage.c                            |  38 +---
 fs/nilfs2/inode.c                     |  15 +-
 fs/ocfs2/aops.c                       |  34 ++--
 fs/omfs/file.c                        |   7 +-
 fs/qnx6/inode.c                       |   7 +-
 fs/reiserfs/inode.c                   |   8 +-
 fs/udf/inode.c                        |   7 +-
 fs/xfs/xfs_aops.c                     |  13 +-
 fs/zonefs/super.c                     |   7 +-
 include/linux/fs.h                    |   2 +
 include/linux/iomap.h                 |   3 +-
 include/linux/mm.h                    |  19 --
 include/linux/mpage.h                 |   4 +-
 include/linux/pagemap.h               | 103 ++++++++++
 include/trace/events/erofs.h          |   6 +-
 include/trace/events/f2fs.h           |   6 +-
 mm/fadvise.c                          |   6 +-
 mm/internal.h                         |  12 +-
 mm/migrate.c                          |   2 +-
 mm/readahead.c                        | 277 ++++++++++++++++----------
 46 files changed, 551 insertions(+), 603 deletions(-)


base-commit: 11a48a5a18c63fd7621bb050228cebf13566e4d8

Comments

David Sterba Feb. 20, 2020, 5:54 p.m. UTC | #1
On Wed, Feb 19, 2020 at 01:00:39PM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> This series adds a readahead address_space operation to eventually
> replace the readpages operation.  The key difference is that
> pages are added to the page cache as they are allocated (and
> then looked up by the filesystem) instead of passing them on a
> list to the readpages operation and having the filesystem add
> them to the page cache.  It's a net reduction in code for each
> implementation, more efficient than walking a list, and solves
> the direct-write vs buffered-read problem reported by yu kuai at
> https://lore.kernel.org/linux-fsdevel/20200116063601.39201-1-yukuai3@huawei.com/
> 
> The only unconverted filesystems are those which use fscache.
> Their conversion is pending Dave Howells' rewrite which will make the
> conversion substantially easier.
> 
> I want to thank the reviewers; Dave Chinner, John Hubbard and Christoph
> Hellwig have done a marvellous job of providing constructive criticism.
> Eric Biggers pointed out how I'd broken ext4 (which led to a substantial
> change).  I've tried to take it all on board, but I may have missed
> something simply because you've done such a thorough job.
> 
> This series can also be found at
> http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/tags/readahead_v7
> (I also pushed the readahead_v6 tag there in case anyone wants to diff, and
> they're both based on 5.6-rc2 so they're easy to diff)
> 
> v7:
>  - Now passes an xfstests run on ext4!

On btrfs it still chokes on the first test btrfs/001, with the following
warning, the test is stuck there.

[   21.100922] WARNING: suspicious RCU usage
[   21.103107] 5.6.0-rc2-default+ #996 Not tainted
[   21.105133] -----------------------------
[   21.106864] include/linux/xarray.h:1164 suspicious rcu_dereference_check() usage!
[   21.109948]
[   21.109948] other info that might help us debug this:
[   21.109948]
[   21.113373]
[   21.113373] rcu_scheduler_active = 2, debug_locks = 1
[   21.115801] 4 locks held by umount/793:
[   21.117135]  #0: ffff964a736890e8 (&type->s_umount_key#26){+.+.}, at: deactivate_super+0x2f/0x40
[   21.120188]  #1: ffff964a7347ba68 (&delayed_node->mutex){+.+.}, at: __btrfs_commit_inode_delayed_items+0x44c/0x4e0 [btrfs]
[   21.123042]  #2: ffff964a612fe5c8 (&space_info->groups_sem){++++}, at: find_free_extent+0x27d/0xf00 [btrfs]
[   21.126068]  #3: ffff964a60b93280 (&caching_ctl->mutex){+.+.}, at: btrfs_cache_block_group+0x1f0/0x500 [btrfs]
[   21.129655]
[   21.129655] stack backtrace:
[   21.131943] CPU: 1 PID: 793 Comm: umount Not tainted 5.6.0-rc2-default+ #996
[   21.134164] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
[   21.138076] Call Trace:
[   21.139441]  dump_stack+0x71/0xa0
[   21.140954]  xas_start+0x1a4/0x240
[   21.142473]  xas_load+0xa/0x50
[   21.143874]  xas_find+0x226/0x280
[   21.145298]  extent_readahead+0xcb/0x4f0 [btrfs]
[   21.146934]  ? mem_cgroup_commit_charge+0x56/0x400
[   21.148654]  ? rcu_read_lock_sched_held+0x5d/0x90
[   21.150382]  ? __add_to_page_cache_locked+0x327/0x380
[   21.152155]  read_pages+0x80/0x1f0
[   21.153531]  page_cache_readahead_unbounded+0x1b7/0x210
[   21.155196]  __load_free_space_cache+0x1c1/0x730 [btrfs]
[   21.157014]  load_free_space_cache+0xb9/0x190 [btrfs]
[   21.158222]  btrfs_cache_block_group+0x1f8/0x500 [btrfs]
[   21.159717]  ? finish_wait+0x90/0x90
[   21.160723]  find_free_extent+0xa17/0xf00 [btrfs]
[   21.161798]  ? kvm_sched_clock_read+0x14/0x30
[   21.163022]  ? sched_clock_cpu+0x10/0x120
[   21.164361]  btrfs_reserve_extent+0x9b/0x180 [btrfs]
[   21.165952]  btrfs_alloc_tree_block+0xc1/0x350 [btrfs]
[   21.167680]  ? __lock_acquire+0x272/0x1320
[   21.169353]  alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs]
[   21.171313]  __btrfs_cow_block+0x143/0x7a0 [btrfs]
[   21.173080]  btrfs_cow_block+0x15f/0x310 [btrfs]
[   21.174487]  btrfs_search_slot+0x93b/0xf70 [btrfs]
[   21.175940]  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
[   21.177419]  ? __btrfs_commit_inode_delayed_items+0x417/0x4e0 [btrfs]
[   21.179032]  ? __btrfs_commit_inode_delayed_items+0x44c/0x4e0 [btrfs]
[   21.180787]  __btrfs_update_delayed_inode+0x73/0x260 [btrfs]
[   21.182174]  __btrfs_commit_inode_delayed_items+0x46c/0x4e0 [btrfs]
[   21.183907]  ? btrfs_first_delayed_node+0x4c/0x90 [btrfs]
[   21.185204]  __btrfs_run_delayed_items+0x8e/0x140 [btrfs]
[   21.186521]  btrfs_commit_transaction+0x312/0xae0 [btrfs]
[   21.188142]  ? btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs]
[   21.189684]  sync_filesystem+0x6e/0x90
[   21.190878]  generic_shutdown_super+0x22/0x100
[   21.192693]  kill_anon_super+0x14/0x30
[   21.194389]  btrfs_kill_super+0x12/0x20 [btrfs]
[   21.196078]  deactivate_locked_super+0x2c/0x70
[   21.197732]  cleanup_mnt+0x100/0x160
[   21.199033]  task_work_run+0x90/0xc0
[   21.200331]  exit_to_usermode_loop+0x96/0xa0
[   21.201744]  do_syscall_64+0x1df/0x210
[   21.203187]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
Matthew Wilcox Feb. 20, 2020, 10:39 p.m. UTC | #2
On Thu, Feb 20, 2020 at 06:54:00PM +0100, David Sterba wrote:
> On Wed, Feb 19, 2020 at 01:00:39PM -0800, Matthew Wilcox wrote:
> > From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> > 
> > This series adds a readahead address_space operation to eventually
> > replace the readpages operation.  The key difference is that
> > pages are added to the page cache as they are allocated (and
> > then looked up by the filesystem) instead of passing them on a
> > list to the readpages operation and having the filesystem add
> > them to the page cache.  It's a net reduction in code for each
> > implementation, more efficient than walking a list, and solves
> > the direct-write vs buffered-read problem reported by yu kuai at
> > https://lore.kernel.org/linux-fsdevel/20200116063601.39201-1-yukuai3@huawei.com/
> > 
> > The only unconverted filesystems are those which use fscache.
> > Their conversion is pending Dave Howells' rewrite which will make the
> > conversion substantially easier.
> > 
> > I want to thank the reviewers; Dave Chinner, John Hubbard and Christoph
> > Hellwig have done a marvellous job of providing constructive criticism.
> > Eric Biggers pointed out how I'd broken ext4 (which led to a substantial
> > change).  I've tried to take it all on board, but I may have missed
> > something simply because you've done such a thorough job.
> > 
> > This series can also be found at
> > http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/tags/readahead_v7
> > (I also pushed the readahead_v6 tag there in case anyone wants to diff, and
> > they're both based on 5.6-rc2 so they're easy to diff)
> > 
> > v7:
> >  - Now passes an xfstests run on ext4!
> 
> On btrfs it still chokes on the first test btrfs/001, with the following
> warning, the test is stuck there.

Thanks.  The warning actually wasn't the problem, but it did need to
be addressed.  I got a test system up & running with btrfs, and it's
currently on generic/027 with the following patch:

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 9c782c15f7f7..d23a224d2ad2 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -676,7 +676,7 @@ static inline unsigned int __readahead_batch(struct readahead_control *rac,
 		struct page **array, unsigned int array_sz)
 {
 	unsigned int i = 0;
-	XA_STATE(xas, &rac->mapping->i_pages, rac->_index);
+	XA_STATE(xas, &rac->mapping->i_pages, 0);
 	struct page *page;
 
 	BUG_ON(rac->_batch_count > rac->_nr_pages);
@@ -684,6 +684,8 @@ static inline unsigned int __readahead_batch(struct readahead_control *rac,
 	rac->_index += rac->_batch_count;
 	rac->_batch_count = 0;
 
+	xas_set(&xas, rac->_index);
+	rcu_read_lock();
 	xas_for_each(&xas, page, rac->_index + rac->_nr_pages - 1) {
 		VM_BUG_ON_PAGE(!PageLocked(page), page);
 		VM_BUG_ON_PAGE(PageTail(page), page);
@@ -702,6 +704,7 @@ static inline unsigned int __readahead_batch(struct readahead_control *rac,
 		if (i == array_sz)
 			break;
 	}
+	rcu_read_unlock();
 
 	return i;
 }
David Sterba Feb. 21, 2020, 11:59 a.m. UTC | #3
On Thu, Feb 20, 2020 at 02:39:09PM -0800, Matthew Wilcox wrote:
> > >  - Now passes an xfstests run on ext4!
> > 
> > On btrfs it still chokes on the first test btrfs/001, with the following
> > warning, the test is stuck there.
> 
> Thanks.  The warning actually wasn't the problem, but it did need to
> be addressed.  I got a test system up & running with btrfs, and it's
> currently on generic/027 with the following patch:

Thanks, with the fix applied the first 10 tests passed, I'll let the
testsuite finish and let you know if ther are more warnings and such.