mbox series

[v6,00/19] Change readahead API

Message ID 20200217184613.19668-1-willy@infradead.org (mailing list archive)
Headers show
Series Change readahead API | expand

Message

Matthew Wilcox (Oracle) Feb. 17, 2020, 6:45 p.m. UTC
From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

This series adds a readahead address_space operation to eventually
replace the readpages operation.  The key difference is that
pages are added to the page cache as they are allocated (and
then looked up by the filesystem) instead of passing them on a
list to the readpages operation and having the filesystem add
them to the page cache.  It's a net reduction in code for each
implementation, more efficient than walking a list, and solves
the direct-write vs buffered-read problem reported by yu kuai at
https://lore.kernel.org/linux-fsdevel/20200116063601.39201-1-yukuai3@huawei.com/

The only unconverted filesystems are those which use fscache.
Their conversion is pending Dave Howells' rewrite which will make the
conversion substantially easier.

v6:
 - Name the private members of readahead_control with a leading underscore
   (suggested by Christoph Hellwig)
 - Fix whitespace in rst file
 - Remove misleading comment in btrfs patch
 - Add readahead_next() API and use it in iomap
 - Add iomap_readahead kerneldoc.
 - Fix the mpage_readahead kerneldoc
 - Make various readahead functions return void
 - Keep readahead_index() and readahead_offset() pointing to the start of
   this batch through the body.  No current user requires this, but it's
   less surprising.
 - Add kerneldoc for page_cache_readahead_limit
 - Make page_idx an unsigned long, and rename it to just 'i'
 - Get rid of page_offset local variable
 - Add patch to call memalloc_nofs_save() before allocating pages (suggested
   by Michal Hocko)
 - Resplit a lot of patches for more logical progression and easier review
   (suggested by John Hubbard)
 - Added sign-offs where received, and I deemed still relevant

v5 switched to passing a readahead_control struct (mirroring the
writepages_control struct passed to writepages).  This has a number of
advantages:
 - It fixes a number of bugs in various implementations, eg forgetting to
   increment 'start', an off-by-one error in 'nr_pages' or treating 'start'
   as a byte offset instead of a page offset.
 - It allows us to change the arguments without changing all the
   implementations of ->readahead which just call mpage_readahead() or
   iomap_readahead()
 - Figuring out which pages haven't been attempted by the implementation
   is more natural this way.
 - There's less code in each implementation.

Matthew Wilcox (Oracle) (19):
  mm: Return void from various readahead functions
  mm: Ignore return value of ->readpages
  mm: Use readahead_control to pass arguments
  mm: Rearrange readahead loop
  mm: Remove 'page_offset' from readahead loop
  mm: rename readahead loop variable to 'i'
  mm: Put readahead pages in cache earlier
  mm: Add readahead address space operation
  mm: Add page_cache_readahead_limit
  fs: Convert mpage_readpages to mpage_readahead
  btrfs: Convert from readpages to readahead
  erofs: Convert uncompressed files from readpages to readahead
  erofs: Convert compressed files from readpages to readahead
  ext4: Convert from readpages to readahead
  f2fs: Convert from readpages to readahead
  fuse: Convert from readpages to readahead
  iomap: Restructure iomap_readpages_actor
  iomap: Convert from readpages to readahead
  mm: Use memalloc_nofs_save in readahead path

 Documentation/filesystems/locking.rst |   6 +-
 Documentation/filesystems/vfs.rst     |  13 ++
 drivers/staging/exfat/exfat_super.c   |   7 +-
 fs/block_dev.c                        |   7 +-
 fs/btrfs/extent_io.c                  |  46 ++-----
 fs/btrfs/extent_io.h                  |   3 +-
 fs/btrfs/inode.c                      |  16 +--
 fs/erofs/data.c                       |  39 ++----
 fs/erofs/zdata.c                      |  29 ++--
 fs/ext2/inode.c                       |  10 +-
 fs/ext4/ext4.h                        |   3 +-
 fs/ext4/inode.c                       |  23 ++--
 fs/ext4/readpage.c                    |  22 ++-
 fs/ext4/verity.c                      |  35 +----
 fs/f2fs/data.c                        |  50 +++----
 fs/f2fs/f2fs.h                        |   5 +-
 fs/f2fs/verity.c                      |  35 +----
 fs/fat/inode.c                        |   7 +-
 fs/fuse/file.c                        |  46 +++----
 fs/gfs2/aops.c                        |  23 ++--
 fs/hpfs/file.c                        |   7 +-
 fs/iomap/buffered-io.c                | 118 +++++++----------
 fs/iomap/trace.h                      |   2 +-
 fs/isofs/inode.c                      |   7 +-
 fs/jfs/inode.c                        |   7 +-
 fs/mpage.c                            |  38 ++----
 fs/nilfs2/inode.c                     |  15 +--
 fs/ocfs2/aops.c                       |  34 ++---
 fs/omfs/file.c                        |   7 +-
 fs/qnx6/inode.c                       |   7 +-
 fs/reiserfs/inode.c                   |   8 +-
 fs/udf/inode.c                        |   7 +-
 fs/xfs/xfs_aops.c                     |  13 +-
 fs/zonefs/super.c                     |   7 +-
 include/linux/fs.h                    |   2 +
 include/linux/iomap.h                 |   3 +-
 include/linux/mpage.h                 |   4 +-
 include/linux/pagemap.h               |  90 +++++++++++++
 include/trace/events/erofs.h          |   6 +-
 include/trace/events/f2fs.h           |   6 +-
 mm/internal.h                         |   8 +-
 mm/migrate.c                          |   2 +-
 mm/readahead.c                        | 184 +++++++++++++++++---------
 43 files changed, 474 insertions(+), 533 deletions(-)


base-commit: 11a48a5a18c63fd7621bb050228cebf13566e4d8

Comments

Matthew Wilcox (Oracle) Feb. 17, 2020, 6:48 p.m. UTC | #1
On Mon, Feb 17, 2020 at 10:45:41AM -0800, Matthew Wilcox wrote:
> This series adds a readahead address_space operation to eventually

*sigh*.  Clearly I forgot to rm -rf an earlier version.  Please disregard
any patches labelled n/16.  I can send a v7 if this is too much hassle.
Dave Chinner Feb. 18, 2020, 4:56 a.m. UTC | #2
On Mon, Feb 17, 2020 at 10:45:41AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> This series adds a readahead address_space operation to eventually
> replace the readpages operation.  The key difference is that
> pages are added to the page cache as they are allocated (and
> then looked up by the filesystem) instead of passing them on a
> list to the readpages operation and having the filesystem add
> them to the page cache.  It's a net reduction in code for each
> implementation, more efficient than walking a list, and solves
> the direct-write vs buffered-read problem reported by yu kuai at
> https://lore.kernel.org/linux-fsdevel/20200116063601.39201-1-yukuai3@huawei.com/
> 
> The only unconverted filesystems are those which use fscache.
> Their conversion is pending Dave Howells' rewrite which will make the
> conversion substantially easier.

Latest version in your git tree:

$ ▶ glo -n 5 willy/readahead
4be497096c04 mm: Use memalloc_nofs_save in readahead path
ff63497fcb98 iomap: Convert from readpages to readahead
26aee60e89b5 iomap: Restructure iomap_readpages_actor
8115bcca7312 fuse: Convert from readpages to readahead
3db3d10d9ea1 f2fs: Convert from readpages to readahead
$

merged into a 5.6-rc2 tree fails at boot on my test vm:

[    2.423116] ------------[ cut here ]------------
[    2.424957] list_add double add: new=ffffea000efff4c8, prev=ffff8883bfffee60, next=ffffea000efff4c8.
[    2.428259] WARNING: CPU: 4 PID: 1 at lib/list_debug.c:29 __list_add_valid+0x67/0x70
[    2.430617] CPU: 4 PID: 1 Comm: sh Not tainted 5.6.0-rc2-dgc+ #1800
[    2.432405] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[    2.434744] RIP: 0010:__list_add_valid+0x67/0x70
[    2.436107] Code: c6 4c 89 ca 48 c7 c7 10 41 58 82 e8 55 29 89 ff 0f 0b 31 c0 c3 48 89 f2 4c 89 c1 48 89 fe 48 c7 c7 60 41 58 82 e8 3b 29 89 ff <0f> 0b 31 c7
[    2.441161] RSP: 0000:ffffc900018a3bb0 EFLAGS: 00010082
[    2.442548] RAX: 0000000000000000 RBX: ffffea000efff4c0 RCX: 0000000000000256
[    2.444432] RDX: 0000000000000001 RSI: 0000000000000086 RDI: ffffffff8288a8b0
[    2.446315] RBP: ffffea000efff4c8 R08: ffffc900018a3a65 R09: 0000000000000256
[    2.448199] R10: 0000000000000008 R11: ffffc900018a3a65 R12: ffffea000efff4c8
[    2.450072] R13: ffff8883bfffee60 R14: 0000000000000010 R15: 0000000000000001
[    2.451959] FS:  0000000000000000(0000) GS:ffff8883b9c00000(0000) knlGS:0000000000000000
[    2.454083] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    2.455604] CR2: 00000000ffffffff CR3: 00000003b9a37002 CR4: 0000000000060ee0
[    2.457484] Call Trace:
[    2.458171]  __pagevec_lru_add_fn+0x15f/0x2c0
[    2.459376]  pagevec_lru_move_fn+0x87/0xd0
[    2.460500]  ? pagevec_move_tail_fn+0x2d0/0x2d0
[    2.461712]  lru_add_drain_cpu+0x8d/0x160
[    2.462787]  lru_add_drain+0x18/0x20
[    2.463757]  shift_arg_pages+0xb8/0x180
[    2.464789]  ? vprintk_emit+0x101/0x1c0
[    2.465813]  ? printk+0x58/0x6f
[    2.466659]  setup_arg_pages+0x205/0x240
[    2.467716]  load_elf_binary+0x34a/0x1560
[    2.468789]  ? get_user_pages_remote+0x159/0x280
[    2.470024]  ? selinux_inode_permission+0x10d/0x1e0
[    2.471323]  ? _raw_read_unlock+0xa/0x20
[    2.472375]  ? load_misc_binary+0x2b2/0x410
[    2.473492]  search_binary_handler+0x60/0xe0
[    2.474634]  __do_execve_file.isra.0+0x512/0x850
[    2.475888]  ? rest_init+0xc6/0xc6
[    2.476801]  do_execve+0x21/0x30
[    2.477671]  try_to_run_init_process+0x10/0x34
[    2.478855]  kernel_init+0xe2/0xfa
[    2.479776]  ret_from_fork+0x1f/0x30
[    2.480737] ---[ end trace e77079de9b22dc6a ]---

I just dropped the ext4 conversion from my local tree so I can boot
the machine and test XFS. Might have some more info when that
crashes and burns...

Cheers,

Dave.
Matthew Wilcox (Oracle) Feb. 18, 2020, 1:42 p.m. UTC | #3
On Tue, Feb 18, 2020 at 03:56:33PM +1100, Dave Chinner wrote:
> Latest version in your git tree:
> 
> $ ▶ glo -n 5 willy/readahead
> 4be497096c04 mm: Use memalloc_nofs_save in readahead path
> ff63497fcb98 iomap: Convert from readpages to readahead
> 26aee60e89b5 iomap: Restructure iomap_readpages_actor
> 8115bcca7312 fuse: Convert from readpages to readahead
> 3db3d10d9ea1 f2fs: Convert from readpages to readahead
> $
> 
> merged into a 5.6-rc2 tree fails at boot on my test vm:
> 
> [    2.423116] ------------[ cut here ]------------
> [    2.424957] list_add double add: new=ffffea000efff4c8, prev=ffff8883bfffee60, next=ffffea000efff4c8.
> [    2.428259] WARNING: CPU: 4 PID: 1 at lib/list_debug.c:29 __list_add_valid+0x67/0x70
> [    2.457484] Call Trace:
> [    2.458171]  __pagevec_lru_add_fn+0x15f/0x2c0
> [    2.459376]  pagevec_lru_move_fn+0x87/0xd0
> [    2.460500]  ? pagevec_move_tail_fn+0x2d0/0x2d0
> [    2.461712]  lru_add_drain_cpu+0x8d/0x160
> [    2.462787]  lru_add_drain+0x18/0x20

Are you sure that was 4be497096c04 ?  I ask because there was a
version pushed to that git tree that did contain a list double-add
(due to a mismerge when shuffling patches).  I noticed it and fixed
it, and 4be497096c04 doesn't have that problem.  I also test with
CONFIG_DEBUG_LIST turned on, but this problem you hit is going to be
probabilistic because it'll depend on the timing between whatever other
list is being used and the page actually being added to the LRU.
John Hubbard Feb. 18, 2020, 8:49 p.m. UTC | #4
On 2/17/20 10:45 AM, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> This series adds a readahead address_space operation to eventually
> replace the readpages operation.  The key difference is that
> pages are added to the page cache as they are allocated (and
> then looked up by the filesystem) instead of passing them on a
> list to the readpages operation and having the filesystem add
> them to the page cache.  It's a net reduction in code for each
> implementation, more efficient than walking a list, and solves
> the direct-write vs buffered-read problem reported by yu kuai at
> https://lore.kernel.org/linux-fsdevel/20200116063601.39201-1-yukuai3@huawei.com/
> 
> The only unconverted filesystems are those which use fscache.
> Their conversion is pending Dave Howells' rewrite which will make the
> conversion substantially easier.

Hi Matthew,

I see that Dave Chinner is reviewing this series, but I'm trying out his recent
advice about code reviews [1], and so I'm not going to read his comments first.
So you may see some duplication or contradictions this time around.


[1] Somewhere in this thread, "[LSF/MM/BPF TOPIC] FS Maintainers Don't Scale": 
https://lore.kernel.org/r/20200131052520.GC6869@magnolia


thanks,
Dave Chinner Feb. 18, 2020, 9:26 p.m. UTC | #5
On Tue, Feb 18, 2020 at 05:42:30AM -0800, Matthew Wilcox wrote:
> On Tue, Feb 18, 2020 at 03:56:33PM +1100, Dave Chinner wrote:
> > Latest version in your git tree:
> > 
> > $ ▶ glo -n 5 willy/readahead
> > 4be497096c04 mm: Use memalloc_nofs_save in readahead path
> > ff63497fcb98 iomap: Convert from readpages to readahead
> > 26aee60e89b5 iomap: Restructure iomap_readpages_actor
> > 8115bcca7312 fuse: Convert from readpages to readahead
> > 3db3d10d9ea1 f2fs: Convert from readpages to readahead
> > $
> > 
> > merged into a 5.6-rc2 tree fails at boot on my test vm:
> > 
> > [    2.423116] ------------[ cut here ]------------
> > [    2.424957] list_add double add: new=ffffea000efff4c8, prev=ffff8883bfffee60, next=ffffea000efff4c8.
> > [    2.428259] WARNING: CPU: 4 PID: 1 at lib/list_debug.c:29 __list_add_valid+0x67/0x70
> > [    2.457484] Call Trace:
> > [    2.458171]  __pagevec_lru_add_fn+0x15f/0x2c0
> > [    2.459376]  pagevec_lru_move_fn+0x87/0xd0
> > [    2.460500]  ? pagevec_move_tail_fn+0x2d0/0x2d0
> > [    2.461712]  lru_add_drain_cpu+0x8d/0x160
> > [    2.462787]  lru_add_drain+0x18/0x20
> 
> Are you sure that was 4be497096c04 ?  I ask because there was a

Yes, because it's the only version I've actually merged into my
working tree, compiled and tried to run. :P

> version pushed to that git tree that did contain a list double-add
> (due to a mismerge when shuffling patches).  I noticed it and fixed
> it, and 4be497096c04 doesn't have that problem.  I also test with
> CONFIG_DEBUG_LIST turned on, but this problem you hit is going to be
> probabilistic because it'll depend on the timing between whatever other
> list is being used and the page actually being added to the LRU.

I'll see if I can reproduce it.

Cheers,

Dave.
Dave Chinner Feb. 19, 2020, 3:45 a.m. UTC | #6
On Wed, Feb 19, 2020 at 08:26:52AM +1100, Dave Chinner wrote:
> On Tue, Feb 18, 2020 at 05:42:30AM -0800, Matthew Wilcox wrote:
> > On Tue, Feb 18, 2020 at 03:56:33PM +1100, Dave Chinner wrote:
> > > Latest version in your git tree:
> > > 
> > > $ ▶ glo -n 5 willy/readahead
> > > 4be497096c04 mm: Use memalloc_nofs_save in readahead path
> > > ff63497fcb98 iomap: Convert from readpages to readahead
> > > 26aee60e89b5 iomap: Restructure iomap_readpages_actor
> > > 8115bcca7312 fuse: Convert from readpages to readahead
> > > 3db3d10d9ea1 f2fs: Convert from readpages to readahead
> > > $
> > > 
> > > merged into a 5.6-rc2 tree fails at boot on my test vm:
> > > 
> > > [    2.423116] ------------[ cut here ]------------
> > > [    2.424957] list_add double add: new=ffffea000efff4c8, prev=ffff8883bfffee60, next=ffffea000efff4c8.
> > > [    2.428259] WARNING: CPU: 4 PID: 1 at lib/list_debug.c:29 __list_add_valid+0x67/0x70
> > > [    2.457484] Call Trace:
> > > [    2.458171]  __pagevec_lru_add_fn+0x15f/0x2c0
> > > [    2.459376]  pagevec_lru_move_fn+0x87/0xd0
> > > [    2.460500]  ? pagevec_move_tail_fn+0x2d0/0x2d0
> > > [    2.461712]  lru_add_drain_cpu+0x8d/0x160
> > > [    2.462787]  lru_add_drain+0x18/0x20
> > 
> > Are you sure that was 4be497096c04 ?  I ask because there was a
> 
> Yes, because it's the only version I've actually merged into my
> working tree, compiled and tried to run. :P
> 
> > version pushed to that git tree that did contain a list double-add
> > (due to a mismerge when shuffling patches).  I noticed it and fixed
> > it, and 4be497096c04 doesn't have that problem.  I also test with
> > CONFIG_DEBUG_LIST turned on, but this problem you hit is going to be
> > probabilistic because it'll depend on the timing between whatever other
> > list is being used and the page actually being added to the LRU.
> 
> I'll see if I can reproduce it.

Just updated to a current TOT Linus kernel and your latest branch,
and so far this is 100% reproducable.

Not sure how I'm going to debug it yet, because it's init that is
triggering it....

Cheers,

Dave.
Matthew Wilcox (Oracle) Feb. 19, 2020, 3:48 a.m. UTC | #7
On Wed, Feb 19, 2020 at 02:45:25PM +1100, Dave Chinner wrote:
> On Wed, Feb 19, 2020 at 08:26:52AM +1100, Dave Chinner wrote:
> > On Tue, Feb 18, 2020 at 05:42:30AM -0800, Matthew Wilcox wrote:
> > > On Tue, Feb 18, 2020 at 03:56:33PM +1100, Dave Chinner wrote:
> > > > Latest version in your git tree:
> > > > 
> > > > $ ▶ glo -n 5 willy/readahead
> > > > 4be497096c04 mm: Use memalloc_nofs_save in readahead path
> > > > ff63497fcb98 iomap: Convert from readpages to readahead
> > > > 26aee60e89b5 iomap: Restructure iomap_readpages_actor
> > > > 8115bcca7312 fuse: Convert from readpages to readahead
> > > > 3db3d10d9ea1 f2fs: Convert from readpages to readahead
> > > > $
> > > > 
> > > > merged into a 5.6-rc2 tree fails at boot on my test vm:
> > > > 
> > > > [    2.423116] ------------[ cut here ]------------
> > > > [    2.424957] list_add double add: new=ffffea000efff4c8, prev=ffff8883bfffee60, next=ffffea000efff4c8.
> > > > [    2.428259] WARNING: CPU: 4 PID: 1 at lib/list_debug.c:29 __list_add_valid+0x67/0x70
> > > > [    2.457484] Call Trace:
> > > > [    2.458171]  __pagevec_lru_add_fn+0x15f/0x2c0
> > > > [    2.459376]  pagevec_lru_move_fn+0x87/0xd0
> > > > [    2.460500]  ? pagevec_move_tail_fn+0x2d0/0x2d0
> > > > [    2.461712]  lru_add_drain_cpu+0x8d/0x160
> > > > [    2.462787]  lru_add_drain+0x18/0x20
> > > 
> > > Are you sure that was 4be497096c04 ?  I ask because there was a
> > 
> > Yes, because it's the only version I've actually merged into my
> > working tree, compiled and tried to run. :P
> > 
> > > version pushed to that git tree that did contain a list double-add
> > > (due to a mismerge when shuffling patches).  I noticed it and fixed
> > > it, and 4be497096c04 doesn't have that problem.  I also test with
> > > CONFIG_DEBUG_LIST turned on, but this problem you hit is going to be
> > > probabilistic because it'll depend on the timing between whatever other
> > > list is being used and the page actually being added to the LRU.
> > 
> > I'll see if I can reproduce it.
> 
> Just updated to a current TOT Linus kernel and your latest branch,
> and so far this is 100% reproducable.
> 
> Not sure how I'm going to debug it yet, because it's init that is
> triggering it....

Eric found it ... still not sure why I don't see it.
Dave Chinner Feb. 19, 2020, 3:57 a.m. UTC | #8
On Tue, Feb 18, 2020 at 07:48:32PM -0800, Matthew Wilcox wrote:
> On Wed, Feb 19, 2020 at 02:45:25PM +1100, Dave Chinner wrote:
> > On Wed, Feb 19, 2020 at 08:26:52AM +1100, Dave Chinner wrote:
> > > On Tue, Feb 18, 2020 at 05:42:30AM -0800, Matthew Wilcox wrote:
> > > > On Tue, Feb 18, 2020 at 03:56:33PM +1100, Dave Chinner wrote:
> > > > > Latest version in your git tree:
> > > > > 
> > > > > $ ▶ glo -n 5 willy/readahead
> > > > > 4be497096c04 mm: Use memalloc_nofs_save in readahead path
> > > > > ff63497fcb98 iomap: Convert from readpages to readahead
> > > > > 26aee60e89b5 iomap: Restructure iomap_readpages_actor
> > > > > 8115bcca7312 fuse: Convert from readpages to readahead
> > > > > 3db3d10d9ea1 f2fs: Convert from readpages to readahead
> > > > > $
> > > > > 
> > > > > merged into a 5.6-rc2 tree fails at boot on my test vm:
> > > > > 
> > > > > [    2.423116] ------------[ cut here ]------------
> > > > > [    2.424957] list_add double add: new=ffffea000efff4c8, prev=ffff8883bfffee60, next=ffffea000efff4c8.
> > > > > [    2.428259] WARNING: CPU: 4 PID: 1 at lib/list_debug.c:29 __list_add_valid+0x67/0x70
> > > > > [    2.457484] Call Trace:
> > > > > [    2.458171]  __pagevec_lru_add_fn+0x15f/0x2c0
> > > > > [    2.459376]  pagevec_lru_move_fn+0x87/0xd0
> > > > > [    2.460500]  ? pagevec_move_tail_fn+0x2d0/0x2d0
> > > > > [    2.461712]  lru_add_drain_cpu+0x8d/0x160
> > > > > [    2.462787]  lru_add_drain+0x18/0x20
> > > > 
> > > > Are you sure that was 4be497096c04 ?  I ask because there was a
> > > 
> > > Yes, because it's the only version I've actually merged into my
> > > working tree, compiled and tried to run. :P
> > > 
> > > > version pushed to that git tree that did contain a list double-add
> > > > (due to a mismerge when shuffling patches).  I noticed it and fixed
> > > > it, and 4be497096c04 doesn't have that problem.  I also test with
> > > > CONFIG_DEBUG_LIST turned on, but this problem you hit is going to be
> > > > probabilistic because it'll depend on the timing between whatever other
> > > > list is being used and the page actually being added to the LRU.
> > > 
> > > I'll see if I can reproduce it.
> > 
> > Just updated to a current TOT Linus kernel and your latest branch,
> > and so far this is 100% reproducable.
> > 
> > Not sure how I'm going to debug it yet, because it's init that is
> > triggering it....
> 
> Eric found it ...

Yeah, just saw that and am applying his patch to test it...

> still not sure why I don't see it.

No readahead configured on your device?


Cheers,

Dave.