diff mbox series

[RFC] mm/readahead: readahead aggressively if read drops in willneed range

Message ID 20240128142522.1524741-1-ming.lei@redhat.com (mailing list archive)
State New, archived
Headers show
Series [RFC] mm/readahead: readahead aggressively if read drops in willneed range | expand

Commit Message

Ming Lei Jan. 28, 2024, 2:25 p.m. UTC
Since commit 6d2be915e589 ("mm/readahead.c: fix readahead failure for
memoryless NUMA nodes and limit readahead max_pages"), ADV_WILLNEED
only tries to readahead 512 pages, and the remained part in the advised
range fallback on normal readahead.

If bdi->ra_pages is set as small, readahead will perform not efficient
enough. Increasing read ahead may not be an option since workload may
have mixed random and sequential I/O.

Improve this situation by maintaining one willneed range maple tree, if
read drops in any willneed range, readahead aggressively just like
what we did before commit 6d2be915e589.

Cc: Mike Snitzer <snitzer@kernel.org>
Cc: Don Dutile <ddutile@redhat.com>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 fs/file_table.c    | 10 ++++++++++
 include/linux/fs.h | 15 +++++++++++++++
 mm/filemap.c       |  5 ++++-
 mm/internal.h      |  7 ++++++-
 mm/readahead.c     | 32 +++++++++++++++++++++++++++++++-
 5 files changed, 66 insertions(+), 3 deletions(-)

Comments

Matthew Wilcox Jan. 28, 2024, 10:02 p.m. UTC | #1
On Sun, Jan 28, 2024 at 10:25:22PM +0800, Ming Lei wrote:
> Since commit 6d2be915e589 ("mm/readahead.c: fix readahead failure for
> memoryless NUMA nodes and limit readahead max_pages"), ADV_WILLNEED
> only tries to readahead 512 pages, and the remained part in the advised
> range fallback on normal readahead.

Does the MAINTAINERS file mean nothing any more?

> If bdi->ra_pages is set as small, readahead will perform not efficient
> enough. Increasing read ahead may not be an option since workload may
> have mixed random and sequential I/O.

I thik there needs to be a lot more explanation than this about what's
going on before we jump to "And therefore this patch is the right
answer".

> @@ -972,6 +974,7 @@ struct file_ra_state {
>  	unsigned int ra_pages;
>  	unsigned int mmap_miss;
>  	loff_t prev_pos;
> +	struct maple_tree *need_mt;

No.  Embed the struct maple tree.  Don't allocate it.  What made you
think this was the right approach?
Mike Snitzer Jan. 28, 2024, 11:12 p.m. UTC | #2
On Sun, Jan 28 2024 at  5:02P -0500,
Matthew Wilcox <willy@infradead.org> wrote:

> On Sun, Jan 28, 2024 at 10:25:22PM +0800, Ming Lei wrote:
> > Since commit 6d2be915e589 ("mm/readahead.c: fix readahead failure for
> > memoryless NUMA nodes and limit readahead max_pages"), ADV_WILLNEED
> > only tries to readahead 512 pages, and the remained part in the advised
> > range fallback on normal readahead.
> 
> Does the MAINTAINERS file mean nothing any more?

"Ming, please use scripts/get_maintainer.pl when submitting patches."

(I've cc'd accordingly with this email).

> > If bdi->ra_pages is set as small, readahead will perform not efficient
> > enough. Increasing read ahead may not be an option since workload may
> > have mixed random and sequential I/O.
> 
> I thik there needs to be a lot more explanation than this about what's
> going on before we jump to "And therefore this patch is the right
> answer".

The patch is "RFC". Ming didn't declare his RFC is "the right answer".
All ideas for how best to fix this issue are welcome.

I agree this patch's header could've worked harder to establish the
problem that it fixes.  But I'll now take a crack at backfilling the
regression report that motivated this patch be developed:

Linux 3.14 was the last kernel to allow madvise (MADV_WILLNEED)
allowed mmap'ing a file more optimally if read_ahead_kb < max_sectors_kb.

Ths regressed with commit 6d2be915e589 (so Linux 3.15) such that
mounting XFS on a device with read_ahead_kb=64 and max_sectors_kb=1024
and running this reproducer against a 2G file will take ~5x longer
(depending on the system's capabilities), mmap_load_test.java follows:

import java.nio.ByteBuffer;
import java.nio.ByteOrder;
import java.io.RandomAccessFile;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;

public class mmap_load_test {

        public static void main(String[] args) throws FileNotFoundException, IOException, InterruptedException {
		if (args.length == 0) {
			System.out.println("Please provide a file");
			System.exit(0);
		}
		FileChannel fc = new RandomAccessFile(new File(args[0]), "rw").getChannel();
		MappedByteBuffer mem = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());

		System.out.println("Loading the file");

		long startTime = System.currentTimeMillis();
		mem.load();
		long endTime = System.currentTimeMillis();
		System.out.println("Done! Loading took " + (endTime-startTime) + " ms");
		
	}
}

reproduce with:

javac mmap_load_test.java
echo 64 > /sys/block/sda/queue/read_ahead_kb
echo 1024 > /sys/block/sda/queue/max_sectors_kb
mkfs.xfs /dev/sda
mount /dev/sda /mnt/test
dd if=/dev/zero of=/mnt/test/2G_file bs=1024k count=2000

echo 3 > /proc/sys/vm/drop_caches
java mmap_load_test /mnt/test/2G_file

Without a fix, like the patch Ming provided, iostat will show rareq-sz
is 64 rather than ~1024.

> > @@ -972,6 +974,7 @@ struct file_ra_state {
> >  	unsigned int ra_pages;
> >  	unsigned int mmap_miss;
> >  	loff_t prev_pos;
> > +	struct maple_tree *need_mt;
> 
> No.  Embed the struct maple tree.  Don't allocate it.

Constructive feedback, thanks.

> What made you think this was the right approach?

But then you closed with an attack, rather than inform Ming and/or
others why you feel so strongly, e.g.: Best to keep memory used for
file_ra_state contiguous.
Matthew Wilcox Jan. 29, 2024, 12:21 a.m. UTC | #3
On Sun, Jan 28, 2024 at 06:12:29PM -0500, Mike Snitzer wrote:
> On Sun, Jan 28 2024 at  5:02P -0500,
> Matthew Wilcox <willy@infradead.org> wrote:
> 
> > On Sun, Jan 28, 2024 at 10:25:22PM +0800, Ming Lei wrote:
> > > Since commit 6d2be915e589 ("mm/readahead.c: fix readahead failure for
> > > memoryless NUMA nodes and limit readahead max_pages"), ADV_WILLNEED
> > > only tries to readahead 512 pages, and the remained part in the advised
> > > range fallback on normal readahead.
> > 
> > Does the MAINTAINERS file mean nothing any more?
> 
> "Ming, please use scripts/get_maintainer.pl when submitting patches."

That's an appropriate response to a new contributor, sure.  Ming has
been submitting patches since, what, 2008?  Surely they know how to
submit patches by now.

> I agree this patch's header could've worked harder to establish the
> problem that it fixes.  But I'll now take a crack at backfilling the
> regression report that motivated this patch be developed:

Thank you.

> Linux 3.14 was the last kernel to allow madvise (MADV_WILLNEED)
> allowed mmap'ing a file more optimally if read_ahead_kb < max_sectors_kb.
> 
> Ths regressed with commit 6d2be915e589 (so Linux 3.15) such that
> mounting XFS on a device with read_ahead_kb=64 and max_sectors_kb=1024
> and running this reproducer against a 2G file will take ~5x longer
> (depending on the system's capabilities), mmap_load_test.java follows:
> 
> import java.nio.ByteBuffer;
> import java.nio.ByteOrder;
> import java.io.RandomAccessFile;
> import java.nio.MappedByteBuffer;
> import java.nio.channels.FileChannel;
> import java.io.File;
> import java.io.FileNotFoundException;
> import java.io.IOException;
> 
> public class mmap_load_test {
> 
>         public static void main(String[] args) throws FileNotFoundException, IOException, InterruptedException {
> 		if (args.length == 0) {
> 			System.out.println("Please provide a file");
> 			System.exit(0);
> 		}
> 		FileChannel fc = new RandomAccessFile(new File(args[0]), "rw").getChannel();
> 		MappedByteBuffer mem = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());
> 
> 		System.out.println("Loading the file");
> 
> 		long startTime = System.currentTimeMillis();
> 		mem.load();
> 		long endTime = System.currentTimeMillis();
> 		System.out.println("Done! Loading took " + (endTime-startTime) + " ms");
> 		
> 	}
> }

It's good to have the original reproducer.  The unfortunate part is
that being at such a high level, it doesn't really show what syscalls
the library makes on behalf of the application.  I'll take your word
for it that it calls madvise(MADV_WILLNEED).  An strace might not go
amiss.

> reproduce with:
> 
> javac mmap_load_test.java
> echo 64 > /sys/block/sda/queue/read_ahead_kb
> echo 1024 > /sys/block/sda/queue/max_sectors_kb
> mkfs.xfs /dev/sda
> mount /dev/sda /mnt/test
> dd if=/dev/zero of=/mnt/test/2G_file bs=1024k count=2000
> 
> echo 3 > /proc/sys/vm/drop_caches

(I prefer to unmount/mount /mnt/test; it drops the cache for
/mnt/test/2G_file without affecting the rest of the system)

> java mmap_load_test /mnt/test/2G_file
> 
> Without a fix, like the patch Ming provided, iostat will show rareq-sz
> is 64 rather than ~1024.

Understood.  But ... the application is asking for as much readahead as
possible, and the sysadmin has said "Don't readahead more than 64kB at
a time".  So why will we not get a bug report in 1-15 years time saying
"I put a limit on readahead and the kernel is ignoring it"?  I think
typically we allow the sysadmin to override application requests,
don't we?

> > > @@ -972,6 +974,7 @@ struct file_ra_state {
> > >  	unsigned int ra_pages;
> > >  	unsigned int mmap_miss;
> > >  	loff_t prev_pos;
> > > +	struct maple_tree *need_mt;
> > 
> > No.  Embed the struct maple tree.  Don't allocate it.
> 
> Constructive feedback, thanks.
> 
> > What made you think this was the right approach?
> 
> But then you closed with an attack, rather than inform Ming and/or
> others why you feel so strongly, e.g.: Best to keep memory used for
> file_ra_state contiguous.

That's not an attack, it's a genuine question.  Is there somewhere else
doing it wrong that Ming copied from?  Does the documentation need to
be clearer?  I can't fix what I don't know.
Mike Snitzer Jan. 29, 2024, 12:39 a.m. UTC | #4
On Sun, Jan 28, 2024 at 7:22 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sun, Jan 28, 2024 at 06:12:29PM -0500, Mike Snitzer wrote:
> > On Sun, Jan 28 2024 at  5:02P -0500,
> > Matthew Wilcox <willy@infradead.org> wrote:
> >
> > > On Sun, Jan 28, 2024 at 10:25:22PM +0800, Ming Lei wrote:
> > > > Since commit 6d2be915e589 ("mm/readahead.c: fix readahead failure for
> > > > memoryless NUMA nodes and limit readahead max_pages"), ADV_WILLNEED
> > > > only tries to readahead 512 pages, and the remained part in the advised
> > > > range fallback on normal readahead.
> > >
> > > Does the MAINTAINERS file mean nothing any more?
> >
> > "Ming, please use scripts/get_maintainer.pl when submitting patches."
>
> That's an appropriate response to a new contributor, sure.  Ming has
> been submitting patches since, what, 2008?  Surely they know how to
> submit patches by now.
>
> > I agree this patch's header could've worked harder to establish the
> > problem that it fixes.  But I'll now take a crack at backfilling the
> > regression report that motivated this patch be developed:
>
> Thank you.
>
> > Linux 3.14 was the last kernel to allow madvise (MADV_WILLNEED)
> > allowed mmap'ing a file more optimally if read_ahead_kb < max_sectors_kb.
> >
> > Ths regressed with commit 6d2be915e589 (so Linux 3.15) such that
> > mounting XFS on a device with read_ahead_kb=64 and max_sectors_kb=1024
> > and running this reproducer against a 2G file will take ~5x longer
> > (depending on the system's capabilities), mmap_load_test.java follows:
> >
> > import java.nio.ByteBuffer;
> > import java.nio.ByteOrder;
> > import java.io.RandomAccessFile;
> > import java.nio.MappedByteBuffer;
> > import java.nio.channels.FileChannel;
> > import java.io.File;
> > import java.io.FileNotFoundException;
> > import java.io.IOException;
> >
> > public class mmap_load_test {
> >
> >         public static void main(String[] args) throws FileNotFoundException, IOException, InterruptedException {
> >               if (args.length == 0) {
> >                       System.out.println("Please provide a file");
> >                       System.exit(0);
> >               }
> >               FileChannel fc = new RandomAccessFile(new File(args[0]), "rw").getChannel();
> >               MappedByteBuffer mem = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());
> >
> >               System.out.println("Loading the file");
> >
> >               long startTime = System.currentTimeMillis();
> >               mem.load();
> >               long endTime = System.currentTimeMillis();
> >               System.out.println("Done! Loading took " + (endTime-startTime) + " ms");
> >
> >       }
> > }
>
> It's good to have the original reproducer.  The unfortunate part is
> that being at such a high level, it doesn't really show what syscalls
> the library makes on behalf of the application.  I'll take your word
> for it that it calls madvise(MADV_WILLNEED).  An strace might not go
> amiss.
>
> > reproduce with:
> >
> > javac mmap_load_test.java
> > echo 64 > /sys/block/sda/queue/read_ahead_kb
> > echo 1024 > /sys/block/sda/queue/max_sectors_kb
> > mkfs.xfs /dev/sda
> > mount /dev/sda /mnt/test
> > dd if=/dev/zero of=/mnt/test/2G_file bs=1024k count=2000
> >
> > echo 3 > /proc/sys/vm/drop_caches
>
> (I prefer to unmount/mount /mnt/test; it drops the cache for
> /mnt/test/2G_file without affecting the rest of the system)
>
> > java mmap_load_test /mnt/test/2G_file
> >
> > Without a fix, like the patch Ming provided, iostat will show rareq-sz
> > is 64 rather than ~1024.
>
> Understood.  But ... the application is asking for as much readahead as
> possible, and the sysadmin has said "Don't readahead more than 64kB at
> a time".  So why will we not get a bug report in 1-15 years time saying
> "I put a limit on readahead and the kernel is ignoring it"?  I think
> typically we allow the sysadmin to override application requests,
> don't we?

The application isn't knowingly asking for readahead.  It is asking to
mmap the file (and reporter wants it done as quickly as possible..
like occurred before).

This fix is comparable to Jens' commit 9491ae4aade6 ("mm: don't cap
request size based on read-ahead setting") -- same logic, just applied
to callchain that ends up using madvise(MADV_WILLNEED).

> > > > @@ -972,6 +974,7 @@ struct file_ra_state {
> > > >   unsigned int ra_pages;
> > > >   unsigned int mmap_miss;
> > > >   loff_t prev_pos;
> > > > + struct maple_tree *need_mt;
> > >
> > > No.  Embed the struct maple tree.  Don't allocate it.
> >
> > Constructive feedback, thanks.
> >
> > > What made you think this was the right approach?
> >
> > But then you closed with an attack, rather than inform Ming and/or
> > others why you feel so strongly, e.g.: Best to keep memory used for
> > file_ra_state contiguous.
>
> That's not an attack, it's a genuine question.  Is there somewhere else
> doing it wrong that Ming copied from?  Does the documentation need to
> be clearer?  I can't fix what I don't know.

OK
Dave Chinner Jan. 29, 2024, 1:47 a.m. UTC | #5
On Sun, Jan 28, 2024 at 07:39:49PM -0500, Mike Snitzer wrote:
> On Sun, Jan 28, 2024 at 7:22 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Sun, Jan 28, 2024 at 06:12:29PM -0500, Mike Snitzer wrote:
> > > On Sun, Jan 28 2024 at  5:02P -0500,
> > > Matthew Wilcox <willy@infradead.org> wrote:
> > Understood.  But ... the application is asking for as much readahead as
> > possible, and the sysadmin has said "Don't readahead more than 64kB at
> > a time".  So why will we not get a bug report in 1-15 years time saying
> > "I put a limit on readahead and the kernel is ignoring it"?  I think
> > typically we allow the sysadmin to override application requests,
> > don't we?
> 
> The application isn't knowingly asking for readahead.  It is asking to
> mmap the file (and reporter wants it done as quickly as possible..
> like occurred before).

... which we do within the constraints of the given configuration.

> This fix is comparable to Jens' commit 9491ae4aade6 ("mm: don't cap
> request size based on read-ahead setting") -- same logic, just applied
> to callchain that ends up using madvise(MADV_WILLNEED).

Not really. There is a difference between performing a synchronous
read IO here that we must complete, compared to optimistic
asynchronous read-ahead which we can fail or toss away without the
user ever seeing the data the IO returned.

We want required IO to be done in as few, larger IOs as possible,
and not be limited by constraints placed on background optimistic
IOs.

madvise(WILLNEED) is optimistic IO - there is no requirement that it
complete the data reads successfully. If the data is actually
required, we'll guarantee completion when the user accesses it, not
when madvise() is called.  IOWs, madvise is async readahead, and so
really should be constrained by readahead bounds and not user IO
bounds.

We could change this behaviour for madvise of large ranges that we
force into the page cache by ignoring device readahead bounds, but
I'm not sure we want to do this in general.

Perhaps fadvise/madvise(willneed) can fiddle the file f_ra.ra_pages
value in this situation to override the device limit for large
ranges (for some definition of large - say 10x bdi->ra_pages) and
restore it once the readahead operation is done. This would make it
behave less like readahead and more like a user read from an IO
perspective...

-Dave.
Mike Snitzer Jan. 29, 2024, 2:12 a.m. UTC | #6
On Sun, Jan 28, 2024 at 8:48 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Sun, Jan 28, 2024 at 07:39:49PM -0500, Mike Snitzer wrote:
> > On Sun, Jan 28, 2024 at 7:22 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Sun, Jan 28, 2024 at 06:12:29PM -0500, Mike Snitzer wrote:
> > > > On Sun, Jan 28 2024 at  5:02P -0500,
> > > > Matthew Wilcox <willy@infradead.org> wrote:
> > > Understood.  But ... the application is asking for as much readahead as
> > > possible, and the sysadmin has said "Don't readahead more than 64kB at
> > > a time".  So why will we not get a bug report in 1-15 years time saying
> > > "I put a limit on readahead and the kernel is ignoring it"?  I think
> > > typically we allow the sysadmin to override application requests,
> > > don't we?
> >
> > The application isn't knowingly asking for readahead.  It is asking to
> > mmap the file (and reporter wants it done as quickly as possible..
> > like occurred before).
>
> .. which we do within the constraints of the given configuration.
>
> > This fix is comparable to Jens' commit 9491ae4aade6 ("mm: don't cap
> > request size based on read-ahead setting") -- same logic, just applied
> > to callchain that ends up using madvise(MADV_WILLNEED).
>
> Not really. There is a difference between performing a synchronous
> read IO here that we must complete, compared to optimistic
> asynchronous read-ahead which we can fail or toss away without the
> user ever seeing the data the IO returned.
>
> We want required IO to be done in as few, larger IOs as possible,
> and not be limited by constraints placed on background optimistic
> IOs.
>
> madvise(WILLNEED) is optimistic IO - there is no requirement that it
> complete the data reads successfully. If the data is actually
> required, we'll guarantee completion when the user accesses it, not
> when madvise() is called.  IOWs, madvise is async readahead, and so
> really should be constrained by readahead bounds and not user IO
> bounds.
>
> We could change this behaviour for madvise of large ranges that we
> force into the page cache by ignoring device readahead bounds, but
> I'm not sure we want to do this in general.
>
> Perhaps fadvise/madvise(willneed) can fiddle the file f_ra.ra_pages
> value in this situation to override the device limit for large
> ranges (for some definition of large - say 10x bdi->ra_pages) and
> restore it once the readahead operation is done. This would make it
> behave less like readahead and more like a user read from an IO
> perspective...

I'm not going to pretend like I'm an expert in this code or all the
distinctions that are in play.  BUT, if you look at the high-level
java reproducer: it is requesting mmap of a finite size, starting from
the beginning of the file:
FileChannel fc = new RandomAccessFile(new File(args[0]), "rw").getChannel();
MappedByteBuffer mem = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());

Yet you're talking about the application like it is stabbingly
triggering unbounded async reads that can get dropped, etc, etc.  I
just want to make sure the subtlety of (ab)using madvise(WILLNEED)
like this app does isn't incorrectly attributed to something it isn't.
The app really is effectively requesting a user read of a particular
extent in terms of mmap, right?

BTW, your suggestion to have this app fiddle with ra_pages and then
reset it is pretty awful (that is a global setting, being tweaked for
a single use, and exposing random IO to excessive readahead should
there be a heavy mix of IO to the backing block device).  Seems the
app is providing plenty of context that it shouldn't be bounded in
terms of readahead limits, so much so that Ming's patch is conveying
the range the madvise(WILLNEED) is provided by the app so as to _know_
if the requested page(s) fall within the requested size; Linux just
happens to be fulfilling the syscall in terms of readahead.

FYI, here is the evolution of this use-case back from when it issued
larger IO back in 3.14, Ming documented it I'm just sharing in case it
helps:

3.14:
madvise_willneed() -> force_page_cache_readahead()
force_page_cache_readahead() will read all pages in the specified range

3.15:
madvise_willneed() -> vfs_fadvise() -> generic_fadvise() ->
  force_page_cache_readahead()

force_page_cache_readahead() only reads at most device max_sectors bytes,
and the remainder is read in filemap_fault()

Things start to change since:

1) 6d2be915e589 ("mm/readahead.c: fix readahead failure for memoryless
NUMA nodes
and limit readahead max_pages")
which limits at most 2Mbytes to be read in madvise_willneed()
so the remained bytes are read in filemap_fault() via normal readahead

2) 600e19afc5f8 ("mm: use only per-device readahead limit")
limits at most .ra_pages bytes to read in madvise_willneed()

3) 9491ae4aade6 ("mm: don't cap request size based on read-ahead setting")
relax the limit by reading at most max sectors bytes in madvise_willneed()
Ming Lei Jan. 29, 2024, 3 a.m. UTC | #7
On Sun, Jan 28, 2024 at 10:02:49PM +0000, Matthew Wilcox wrote:
> On Sun, Jan 28, 2024 at 10:25:22PM +0800, Ming Lei wrote:
> > Since commit 6d2be915e589 ("mm/readahead.c: fix readahead failure for
> > memoryless NUMA nodes and limit readahead max_pages"), ADV_WILLNEED
> > only tries to readahead 512 pages, and the remained part in the advised
> > range fallback on normal readahead.
> 
> Does the MAINTAINERS file mean nothing any more?

It is just miss to Cc you, sorry.

> 
> > If bdi->ra_pages is set as small, readahead will perform not efficient
> > enough. Increasing read ahead may not be an option since workload may
> > have mixed random and sequential I/O.
> 
> I thik there needs to be a lot more explanation than this about what's
> going on before we jump to "And therefore this patch is the right
> answer".

Both 6d2be915e589 and the commit log provids background about this
issue, let me explain it more:

1) before commit 6d2be915e589, madvise/fadvise(WILLNEED)/readahead
syscalls try to readahead in the specified range if memory is allowed,
and for each readahead in this range, the ra size is set as max sectors
of the block device, see force_page_cache_ra().

2) since commit 6d2be915e589, only 2MB bytes are load in these syscalls,
and the remained bytes fallback to future normal readahead when reads
from page cache or mmap buffer

3) this patch wires the advise(WILLNEED) range info to normal readahead for
both mmap fault and buffered read code path, so each readhead can use
max sectors of block size for the ra, basically takes the similar
approach before commit 6d2be915e589

> 
> > @@ -972,6 +974,7 @@ struct file_ra_state {
> >  	unsigned int ra_pages;
> >  	unsigned int mmap_miss;
> >  	loff_t prev_pos;
> > +	struct maple_tree *need_mt;
> 
> No.  Embed the struct maple tree.  Don't allocate it.  What made you
> think this was the right approach?

Can you explain why it has to be embedded? core-api/maple_tree.rst
mentioned it is fine to call "mt_init() for dynamically allocated ones".

maple tree provides one easy way to record the advised willneed range,
so readahead code path can apply this info for speedup readahead.


Thanks,
Ming
Ming Lei Jan. 29, 2024, 3:20 a.m. UTC | #8
On Mon, Jan 29, 2024 at 12:21:16AM +0000, Matthew Wilcox wrote:
> On Sun, Jan 28, 2024 at 06:12:29PM -0500, Mike Snitzer wrote:
> > On Sun, Jan 28 2024 at  5:02P -0500,
> > Matthew Wilcox <willy@infradead.org> wrote:
> > 
> > > On Sun, Jan 28, 2024 at 10:25:22PM +0800, Ming Lei wrote:
> > > > Since commit 6d2be915e589 ("mm/readahead.c: fix readahead failure for
> > > > memoryless NUMA nodes and limit readahead max_pages"), ADV_WILLNEED
> > > > only tries to readahead 512 pages, and the remained part in the advised
> > > > range fallback on normal readahead.
> > > 
> > > Does the MAINTAINERS file mean nothing any more?
> > 
> > "Ming, please use scripts/get_maintainer.pl when submitting patches."
> 
> That's an appropriate response to a new contributor, sure.  Ming has
> been submitting patches since, what, 2008?  Surely they know how to
> submit patches by now.
> 
> > I agree this patch's header could've worked harder to establish the
> > problem that it fixes.  But I'll now take a crack at backfilling the
> > regression report that motivated this patch be developed:
> 
> Thank you.
> 
> > Linux 3.14 was the last kernel to allow madvise (MADV_WILLNEED)
> > allowed mmap'ing a file more optimally if read_ahead_kb < max_sectors_kb.
> > 
> > Ths regressed with commit 6d2be915e589 (so Linux 3.15) such that
> > mounting XFS on a device with read_ahead_kb=64 and max_sectors_kb=1024
> > and running this reproducer against a 2G file will take ~5x longer
> > (depending on the system's capabilities), mmap_load_test.java follows:
> > 
> > import java.nio.ByteBuffer;
> > import java.nio.ByteOrder;
> > import java.io.RandomAccessFile;
> > import java.nio.MappedByteBuffer;
> > import java.nio.channels.FileChannel;
> > import java.io.File;
> > import java.io.FileNotFoundException;
> > import java.io.IOException;
> > 
> > public class mmap_load_test {
> > 
> >         public static void main(String[] args) throws FileNotFoundException, IOException, InterruptedException {
> > 		if (args.length == 0) {
> > 			System.out.println("Please provide a file");
> > 			System.exit(0);
> > 		}
> > 		FileChannel fc = new RandomAccessFile(new File(args[0]), "rw").getChannel();
> > 		MappedByteBuffer mem = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());
> > 
> > 		System.out.println("Loading the file");
> > 
> > 		long startTime = System.currentTimeMillis();
> > 		mem.load();
> > 		long endTime = System.currentTimeMillis();
> > 		System.out.println("Done! Loading took " + (endTime-startTime) + " ms");
> > 		
> > 	}
> > }
> 
> It's good to have the original reproducer.  The unfortunate part is
> that being at such a high level, it doesn't really show what syscalls
> the library makes on behalf of the application.  I'll take your word
> for it that it calls madvise(MADV_WILLNEED).  An strace might not go
> amiss.

Yeah, it can be fadvise(WILLNEED)/readahead syscall too.

> 
> > reproduce with:
> > 
> > javac mmap_load_test.java
> > echo 64 > /sys/block/sda/queue/read_ahead_kb
> > echo 1024 > /sys/block/sda/queue/max_sectors_kb
> > mkfs.xfs /dev/sda
> > mount /dev/sda /mnt/test
> > dd if=/dev/zero of=/mnt/test/2G_file bs=1024k count=2000
> > 
> > echo 3 > /proc/sys/vm/drop_caches
> 
> (I prefer to unmount/mount /mnt/test; it drops the cache for
> /mnt/test/2G_file without affecting the rest of the system)
> 
> > java mmap_load_test /mnt/test/2G_file
> > 
> > Without a fix, like the patch Ming provided, iostat will show rareq-sz
> > is 64 rather than ~1024.
> 
> Understood.  But ... the application is asking for as much readahead as
> possible, and the sysadmin has said "Don't readahead more than 64kB at
> a time".  So why will we not get a bug report in 1-15 years time saying
> "I put a limit on readahead and the kernel is ignoring it"?  I think
> typically we allow the sysadmin to override application requests,
> don't we?

ra_pages is just one hint for readahead, the reality is that sysadmin
can't understand how much bytes is perfect for readahead.

But application often knows how much bytes it will need, so here
I think application requirement should have higher priority, especially
when application doesn't want kernel to readahead blindly.

And madvise/fadvise(WILLNEED) syscall already reads bdi->io_pages
first, and which is bigger than ra_pages.


Thanks, 
Ming
Ming Lei Jan. 29, 2024, 3:57 a.m. UTC | #9
On Mon, Jan 29, 2024 at 12:47:41PM +1100, Dave Chinner wrote:
> On Sun, Jan 28, 2024 at 07:39:49PM -0500, Mike Snitzer wrote:
> > On Sun, Jan 28, 2024 at 7:22 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Sun, Jan 28, 2024 at 06:12:29PM -0500, Mike Snitzer wrote:
> > > > On Sun, Jan 28 2024 at  5:02P -0500,
> > > > Matthew Wilcox <willy@infradead.org> wrote:
> > > Understood.  But ... the application is asking for as much readahead as
> > > possible, and the sysadmin has said "Don't readahead more than 64kB at
> > > a time".  So why will we not get a bug report in 1-15 years time saying
> > > "I put a limit on readahead and the kernel is ignoring it"?  I think
> > > typically we allow the sysadmin to override application requests,
> > > don't we?
> > 
> > The application isn't knowingly asking for readahead.  It is asking to
> > mmap the file (and reporter wants it done as quickly as possible..
> > like occurred before).
> 
> ... which we do within the constraints of the given configuration.
> 
> > This fix is comparable to Jens' commit 9491ae4aade6 ("mm: don't cap
> > request size based on read-ahead setting") -- same logic, just applied
> > to callchain that ends up using madvise(MADV_WILLNEED).
> 
> Not really. There is a difference between performing a synchronous
> read IO here that we must complete, compared to optimistic
> asynchronous read-ahead which we can fail or toss away without the
> user ever seeing the data the IO returned.

Yeah, the big readahead in this patch happens when user starts to read
over mmaped buffer instead of madvise().

> 
> We want required IO to be done in as few, larger IOs as possible,
> and not be limited by constraints placed on background optimistic
> IOs.
> 
> madvise(WILLNEED) is optimistic IO - there is no requirement that it
> complete the data reads successfully. If the data is actually
> required, we'll guarantee completion when the user accesses it, not
> when madvise() is called.  IOWs, madvise is async readahead, and so
> really should be constrained by readahead bounds and not user IO
> bounds.
> 
> We could change this behaviour for madvise of large ranges that we
> force into the page cache by ignoring device readahead bounds, but
> I'm not sure we want to do this in general.
> 
> Perhaps fadvise/madvise(willneed) can fiddle the file f_ra.ra_pages
> value in this situation to override the device limit for large
> ranges (for some definition of large - say 10x bdi->ra_pages) and
> restore it once the readahead operation is done. This would make it
> behave less like readahead and more like a user read from an IO
> perspective...

->ra_pages is just one hint, which is 128KB at default, and either
device or userspace can override it.

fadvise/madvise(willneed) already readahead bytes from bdi->io_pages which
is the max device sector size(often 10X of ->ra_pages), please see
force_page_cache_ra().

Follows the current report:

1) usersapce call madvise(willneed, 1G)

2) only the 1st part(size is from bdi->io_pages, suppose it is 2MB) is
readahead in madvise(willneed, 1G) since commit 6d2be915e589

3) the other parts(2M ~ 1G) is readahead by unit of bdi->ra_pages which is
set as 64KB by userspace when userspace reads the mmaped buffer, then
the whole application becomes slower.

This patch changes 3) to use bdi->io_pages as readahead unit.


Thanks,
Ming
Dave Chinner Jan. 29, 2024, 4:56 a.m. UTC | #10
On Sun, Jan 28, 2024 at 09:12:12PM -0500, Mike Snitzer wrote:
> On Sun, Jan 28, 2024 at 8:48 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Sun, Jan 28, 2024 at 07:39:49PM -0500, Mike Snitzer wrote:
> > > On Sun, Jan 28, 2024 at 7:22 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Sun, Jan 28, 2024 at 06:12:29PM -0500, Mike Snitzer wrote:
> > > > > On Sun, Jan 28 2024 at  5:02P -0500,
> > > > > Matthew Wilcox <willy@infradead.org> wrote:
> > > > Understood.  But ... the application is asking for as much readahead as
> > > > possible, and the sysadmin has said "Don't readahead more than 64kB at
> > > > a time".  So why will we not get a bug report in 1-15 years time saying
> > > > "I put a limit on readahead and the kernel is ignoring it"?  I think
> > > > typically we allow the sysadmin to override application requests,
> > > > don't we?
> > >
> > > The application isn't knowingly asking for readahead.  It is asking to
> > > mmap the file (and reporter wants it done as quickly as possible..
> > > like occurred before).
> >
> > .. which we do within the constraints of the given configuration.
> >
> > > This fix is comparable to Jens' commit 9491ae4aade6 ("mm: don't cap
> > > request size based on read-ahead setting") -- same logic, just applied
> > > to callchain that ends up using madvise(MADV_WILLNEED).
> >
> > Not really. There is a difference between performing a synchronous
> > read IO here that we must complete, compared to optimistic
> > asynchronous read-ahead which we can fail or toss away without the
> > user ever seeing the data the IO returned.
> >
> > We want required IO to be done in as few, larger IOs as possible,
> > and not be limited by constraints placed on background optimistic
> > IOs.
> >
> > madvise(WILLNEED) is optimistic IO - there is no requirement that it
> > complete the data reads successfully. If the data is actually
> > required, we'll guarantee completion when the user accesses it, not
> > when madvise() is called.  IOWs, madvise is async readahead, and so
> > really should be constrained by readahead bounds and not user IO
> > bounds.
> >
> > We could change this behaviour for madvise of large ranges that we
> > force into the page cache by ignoring device readahead bounds, but
> > I'm not sure we want to do this in general.
> >
> > Perhaps fadvise/madvise(willneed) can fiddle the file f_ra.ra_pages
> > value in this situation to override the device limit for large
> > ranges (for some definition of large - say 10x bdi->ra_pages) and
> > restore it once the readahead operation is done. This would make it
> > behave less like readahead and more like a user read from an IO
> > perspective...
> 
> I'm not going to pretend like I'm an expert in this code or all the
> distinctions that are in play.  BUT, if you look at the high-level
> java reproducer: it is requesting mmap of a finite size, starting from
> the beginning of the file:
> FileChannel fc = new RandomAccessFile(new File(args[0]), "rw").getChannel();
> MappedByteBuffer mem = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());

Mapping an entire file does not mean "we are going to access the
entire file". Lots of code will do this, especially those that do
random accesses within the file.

> Yet you're talking about the application like it is stabbingly
> triggering unbounded async reads that can get dropped, etc, etc.  I

I don't know what the application actually does. All I see is a
microbenchmark that mmaps() a file and walks it sequentially. On a
system where readahead has been tuned to de-prioritise sequential IO
performance.

> just want to make sure the subtlety of (ab)using madvise(WILLNEED)
> like this app does isn't incorrectly attributed to something it isn't.
> The app really is effectively requesting a user read of a particular
> extent in terms of mmap, right?

madvise() is an -advisory- interface that does not guarantee any
specific behaviour. the man page says:

MADV_WILLNEED
       Expect access in the near future.  (Hence, it might be a good
       idea to read some pages ahead.)

It says nothing about guaranteeing that all the data is brought into
memory, or that if it does get brought into memory that it will
remain there until the application accesses it. It doesn't even
imply that IO will even be done immediately. Any application
relying on madvise() to fully populate the page cache range before
returning is expecting behaviour that is not documented nor
guaranteed.

Similarly, the fadvise64() man page does not say that WILLNEED will
bring the entire file into cache:

POSIX_FADV_WILLNEED
        The specified data will be accessed in the near future.

	POSIX_FADV_WILLNEED  initiates a nonblocking read of the
	specified region into the page cache.  The amount of data
	read may be de‐ creased by the kernel depending on virtual
	memory load.  (A few megabytes will usually be fully
	satisfied, and more is rarely use‐ ful.)

> BTW, your suggestion to have this app fiddle with ra_pages and then

No, I did not suggest that the app fiddle with anything. I was
talking about the in-kernel FADV_WILLNEED implementation changing
file->f_ra.ra_pages similar to how FADV_RANDOM and FADV_SEQUENTIAL
do to change readahead IO behaviour. That then allows subsequent
readahead on that vma->file to use a larger value than the default
value pulled in off the device.

Largely, I think the problem is that the application has set a
readahead limit too low for optimal sequential performance.
Complaining that readahead is slow when it has been explicitly tuned
to be slow doesn't really seem like a problem we can fix with code.

-Dave.
Dave Chinner Jan. 29, 2024, 5:15 a.m. UTC | #11
On Mon, Jan 29, 2024 at 11:57:45AM +0800, Ming Lei wrote:
> On Mon, Jan 29, 2024 at 12:47:41PM +1100, Dave Chinner wrote:
> > On Sun, Jan 28, 2024 at 07:39:49PM -0500, Mike Snitzer wrote:
> > > On Sun, Jan 28, 2024 at 7:22 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Sun, Jan 28, 2024 at 06:12:29PM -0500, Mike Snitzer wrote:
> > > > > On Sun, Jan 28 2024 at  5:02P -0500,
> > > > > Matthew Wilcox <willy@infradead.org> wrote:
> > > > Understood.  But ... the application is asking for as much readahead as
> > > > possible, and the sysadmin has said "Don't readahead more than 64kB at
> > > > a time".  So why will we not get a bug report in 1-15 years time saying
> > > > "I put a limit on readahead and the kernel is ignoring it"?  I think
> > > > typically we allow the sysadmin to override application requests,
> > > > don't we?
> > > 
> > > The application isn't knowingly asking for readahead.  It is asking to
> > > mmap the file (and reporter wants it done as quickly as possible..
> > > like occurred before).
> > 
> > ... which we do within the constraints of the given configuration.
> > 
> > > This fix is comparable to Jens' commit 9491ae4aade6 ("mm: don't cap
> > > request size based on read-ahead setting") -- same logic, just applied
> > > to callchain that ends up using madvise(MADV_WILLNEED).
> > 
> > Not really. There is a difference between performing a synchronous
> > read IO here that we must complete, compared to optimistic
> > asynchronous read-ahead which we can fail or toss away without the
> > user ever seeing the data the IO returned.
> 
> Yeah, the big readahead in this patch happens when user starts to read
> over mmaped buffer instead of madvise().

Yes, that's how it is intended to work :/

> > We want required IO to be done in as few, larger IOs as possible,
> > and not be limited by constraints placed on background optimistic
> > IOs.
> > 
> > madvise(WILLNEED) is optimistic IO - there is no requirement that it
> > complete the data reads successfully. If the data is actually
> > required, we'll guarantee completion when the user accesses it, not
> > when madvise() is called.  IOWs, madvise is async readahead, and so
> > really should be constrained by readahead bounds and not user IO
> > bounds.
> > 
> > We could change this behaviour for madvise of large ranges that we
> > force into the page cache by ignoring device readahead bounds, but
> > I'm not sure we want to do this in general.
> > 
> > Perhaps fadvise/madvise(willneed) can fiddle the file f_ra.ra_pages
> > value in this situation to override the device limit for large
> > ranges (for some definition of large - say 10x bdi->ra_pages) and
> > restore it once the readahead operation is done. This would make it
> > behave less like readahead and more like a user read from an IO
> > perspective...
> 
> ->ra_pages is just one hint, which is 128KB at default, and either
> device or userspace can override it.
> 
> fadvise/madvise(willneed) already readahead bytes from bdi->io_pages which
> is the max device sector size(often 10X of ->ra_pages), please see
> force_page_cache_ra().

Yes, but if we also change vma->file->f_ra->ra_pages during the
WILLNEED operation (as we do for FADV_SEQUENTIAL) then we get a
larger readahead window for the demand-paged access portion of the
WILLNEED access...

> 
> Follows the current report:
> 
> 1) usersapce call madvise(willneed, 1G)
> 
> 2) only the 1st part(size is from bdi->io_pages, suppose it is 2MB) is
> readahead in madvise(willneed, 1G) since commit 6d2be915e589
> 
> 3) the other parts(2M ~ 1G) is readahead by unit of bdi->ra_pages which is
> set as 64KB by userspace when userspace reads the mmaped buffer, then
> the whole application becomes slower.

It gets limited by file->f_ra->ra_pages being initialised to
bdi->ra_pages and then never changed as the advice for access
methods to the file are changed.

But the problem here is *not the readahead code*. The problem is
that the user has configured the device readahead window to be far
smaller than is optimal for the storage. Hence readahead is slow.
The fix for that is to either increase the device readahead windows,
or to change the specific readahead window for the file that has
sequential access patterns.

Indeed, we already have that - FADV_SEQUENTIAL will set
file->f_ra.ra_pages to 2 * bdi->ra_pages so that readahead uses
larger IOs for that access.

That's what should happen here - MADV_WILLNEED does not imply a
specific access pattern so the application should be running
MADV_SEQUENTIAL (triggers aggressive readahead) then MADV_WILLNEED
to start the readahead, and then the rest of the on-demand readahead
will get the higher readahead limits.

> This patch changes 3) to use bdi->io_pages as readahead unit.

I think it really should be changing MADV/FADV_SEQUENTIAL to set
file->f_ra.ra_pages to bdi->io_pages, not bdi->ra_pages * 2, and the
mem.load() implementation in the application converted to use
MADV_SEQUENTIAL to properly indicate it's access pattern to the
readahead algorithm.

Cheers,

Dave.
Ming Lei Jan. 29, 2024, 8:25 a.m. UTC | #12
On Mon, Jan 29, 2024 at 04:15:16PM +1100, Dave Chinner wrote:
> On Mon, Jan 29, 2024 at 11:57:45AM +0800, Ming Lei wrote:
> > On Mon, Jan 29, 2024 at 12:47:41PM +1100, Dave Chinner wrote:
> > > On Sun, Jan 28, 2024 at 07:39:49PM -0500, Mike Snitzer wrote:
> > > > On Sun, Jan 28, 2024 at 7:22 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > On Sun, Jan 28, 2024 at 06:12:29PM -0500, Mike Snitzer wrote:
> > > > > > On Sun, Jan 28 2024 at  5:02P -0500,
> > > > > > Matthew Wilcox <willy@infradead.org> wrote:
> > > > > Understood.  But ... the application is asking for as much readahead as
> > > > > possible, and the sysadmin has said "Don't readahead more than 64kB at
> > > > > a time".  So why will we not get a bug report in 1-15 years time saying
> > > > > "I put a limit on readahead and the kernel is ignoring it"?  I think
> > > > > typically we allow the sysadmin to override application requests,
> > > > > don't we?
> > > > 
> > > > The application isn't knowingly asking for readahead.  It is asking to
> > > > mmap the file (and reporter wants it done as quickly as possible..
> > > > like occurred before).
> > > 
> > > ... which we do within the constraints of the given configuration.
> > > 
> > > > This fix is comparable to Jens' commit 9491ae4aade6 ("mm: don't cap
> > > > request size based on read-ahead setting") -- same logic, just applied
> > > > to callchain that ends up using madvise(MADV_WILLNEED).
> > > 
> > > Not really. There is a difference between performing a synchronous
> > > read IO here that we must complete, compared to optimistic
> > > asynchronous read-ahead which we can fail or toss away without the
> > > user ever seeing the data the IO returned.
> > 
> > Yeah, the big readahead in this patch happens when user starts to read
> > over mmaped buffer instead of madvise().
> 
> Yes, that's how it is intended to work :/
> 
> > > We want required IO to be done in as few, larger IOs as possible,
> > > and not be limited by constraints placed on background optimistic
> > > IOs.
> > > 
> > > madvise(WILLNEED) is optimistic IO - there is no requirement that it
> > > complete the data reads successfully. If the data is actually
> > > required, we'll guarantee completion when the user accesses it, not
> > > when madvise() is called.  IOWs, madvise is async readahead, and so
> > > really should be constrained by readahead bounds and not user IO
> > > bounds.
> > > 
> > > We could change this behaviour for madvise of large ranges that we
> > > force into the page cache by ignoring device readahead bounds, but
> > > I'm not sure we want to do this in general.
> > > 
> > > Perhaps fadvise/madvise(willneed) can fiddle the file f_ra.ra_pages
> > > value in this situation to override the device limit for large
> > > ranges (for some definition of large - say 10x bdi->ra_pages) and
> > > restore it once the readahead operation is done. This would make it
> > > behave less like readahead and more like a user read from an IO
> > > perspective...
> > 
> > ->ra_pages is just one hint, which is 128KB at default, and either
> > device or userspace can override it.
> > 
> > fadvise/madvise(willneed) already readahead bytes from bdi->io_pages which
> > is the max device sector size(often 10X of ->ra_pages), please see
> > force_page_cache_ra().
> 
> Yes, but if we also change vma->file->f_ra->ra_pages during the
> WILLNEED operation (as we do for FADV_SEQUENTIAL) then we get a
> larger readahead window for the demand-paged access portion of the
> WILLNEED access...
> 
> > 
> > Follows the current report:
> > 
> > 1) usersapce call madvise(willneed, 1G)
> > 
> > 2) only the 1st part(size is from bdi->io_pages, suppose it is 2MB) is
> > readahead in madvise(willneed, 1G) since commit 6d2be915e589
> > 
> > 3) the other parts(2M ~ 1G) is readahead by unit of bdi->ra_pages which is
> > set as 64KB by userspace when userspace reads the mmaped buffer, then
> > the whole application becomes slower.
> 
> It gets limited by file->f_ra->ra_pages being initialised to
> bdi->ra_pages and then never changed as the advice for access
> methods to the file are changed.
> 
> But the problem here is *not the readahead code*. The problem is
> that the user has configured the device readahead window to be far
> smaller than is optimal for the storage. Hence readahead is slow.
> The fix for that is to either increase the device readahead windows,
> or to change the specific readahead window for the file that has
> sequential access patterns.
> 
> Indeed, we already have that - FADV_SEQUENTIAL will set
> file->f_ra.ra_pages to 2 * bdi->ra_pages so that readahead uses
> larger IOs for that access.
> 
> That's what should happen here - MADV_WILLNEED does not imply a
> specific access pattern so the application should be running
> MADV_SEQUENTIAL (triggers aggressive readahead) then MADV_WILLNEED
> to start the readahead, and then the rest of the on-demand readahead
> will get the higher readahead limits.
> 
> > This patch changes 3) to use bdi->io_pages as readahead unit.
> 
> I think it really should be changing MADV/FADV_SEQUENTIAL to set
> file->f_ra.ra_pages to bdi->io_pages, not bdi->ra_pages * 2, and the
> mem.load() implementation in the application converted to use
> MADV_SEQUENTIAL to properly indicate it's access pattern to the
> readahead algorithm.

Here the single .ra_pages may not work, that is why this patch stores
the willneed range in maple tree, please see the following words from
the original RH report:

"
Increasing read ahead is not an option as it has a mixed I/O workload of
random I/O and sequential I/O, so that a large read ahead is very counterproductive
to the random I/O and is unacceptable.
"

Also almost all these advises(SEQUENTIA, WILLNEED, NORMAL, RANDOM)
ignore the passed range, and the behavior becomes all or nothing,
instead of something only for the specified range, which may not
match with man, please see 'man posix_fadvise':

	```
       POSIX_FADV_NORMAL
              Indicates that the application has no advice to give about its access
			  pattern for the specified data.  If no advice is given for an open file,
			  this is the default assumption.

       POSIX_FADV_SEQUENTIAL
              The application expects to access the specified data sequentially (with
			  lower offsets read before higher ones).

       POSIX_FADV_RANDOM
              The specified data will be accessed in random order.

       POSIX_FADV_NOREUSE
              The specified data will be accessed only once.

              In kernels before 2.6.18, POSIX_FADV_NOREUSE had the same semantics as
			  POSIX_FADV_WILLNEED.  This was probably a bug; since kernel 2.6.18, this
			  flag is a no-op.

       POSIX_FADV_WILLNEED
              The specified data will be accessed in the near future.
	```

It is even worse for readahead() syscall:

	```
	DESCRIPTION
	       readahead()  initiates readahead on a file so that subsequent reads from that
		   file will be satisfied from the cache, and not block on disk I/O (assuming the
		   readahead was initiated early enough and that other activity on the system did
		   not in the meantime flush pages from the cache).
	```


Thanks,
Ming
Matthew Wilcox Jan. 29, 2024, 1:26 p.m. UTC | #13
On Mon, Jan 29, 2024 at 04:25:41PM +0800, Ming Lei wrote:
> Here the single .ra_pages may not work, that is why this patch stores
> the willneed range in maple tree, please see the following words from
> the original RH report:
> 
> "
> Increasing read ahead is not an option as it has a mixed I/O workload of
> random I/O and sequential I/O, so that a large read ahead is very counterproductive
> to the random I/O and is unacceptable.
> "

It is really frustrating having to drag this important information out of
you.  Please send the full bug report (stripping identifying information
if that's what the customer needs).  We seem to be covering the same
ground again in public that apparently you've already done in private.
This is no way to work with upstream.
Mike Snitzer Jan. 29, 2024, 5:42 p.m. UTC | #14
On Mon, Jan 29 2024 at 12:19P -0500,
Mike Snitzer <snitzer@kernel.org> wrote:
 
> While I'm sure this legacy application would love to not have to
> change its code at all, I think we can all agree that we need to just
> focus on how best to advise applications that have mixed workloads
> accomplish efficient mmap+read of both sequential and random.
> 
> To that end, I heard Dave clearly suggest 2 things:
> 
> 1) update MADV/FADV_SEQUENTIAL to set file->f_ra.ra_pages to
>    bdi->io_pages, not bdi->ra_pages * 2
> 
> 2) Have the application first issue MADV_SEQUENTIAL to convey that for
>    the following MADV_WILLNEED is for sequential file load (so it is
>    desirable to use larger ra_pages)
> 
> This overrides the default of bdi->io_pages and _should_ provide the
> required per-file duality of control for readahead, correct?

I meant "This overrides the default of bdi->ra_pages ..." ;)
Dave Chinner Jan. 29, 2024, 10:07 p.m. UTC | #15
On Mon, Jan 29, 2024 at 04:25:41PM +0800, Ming Lei wrote:
> On Mon, Jan 29, 2024 at 04:15:16PM +1100, Dave Chinner wrote:
> > On Mon, Jan 29, 2024 at 11:57:45AM +0800, Ming Lei wrote:
> > > On Mon, Jan 29, 2024 at 12:47:41PM +1100, Dave Chinner wrote:
> > > > On Sun, Jan 28, 2024 at 07:39:49PM -0500, Mike Snitzer wrote:
> > > Follows the current report:
> > > 
> > > 1) usersapce call madvise(willneed, 1G)
> > > 
> > > 2) only the 1st part(size is from bdi->io_pages, suppose it is 2MB) is
> > > readahead in madvise(willneed, 1G) since commit 6d2be915e589
> > > 
> > > 3) the other parts(2M ~ 1G) is readahead by unit of bdi->ra_pages which is
> > > set as 64KB by userspace when userspace reads the mmaped buffer, then
> > > the whole application becomes slower.
> > 
> > It gets limited by file->f_ra->ra_pages being initialised to
> > bdi->ra_pages and then never changed as the advice for access
> > methods to the file are changed.
> > 
> > But the problem here is *not the readahead code*. The problem is
> > that the user has configured the device readahead window to be far
> > smaller than is optimal for the storage. Hence readahead is slow.
> > The fix for that is to either increase the device readahead windows,
> > or to change the specific readahead window for the file that has
> > sequential access patterns.
> > 
> > Indeed, we already have that - FADV_SEQUENTIAL will set
> > file->f_ra.ra_pages to 2 * bdi->ra_pages so that readahead uses
> > larger IOs for that access.
> > 
> > That's what should happen here - MADV_WILLNEED does not imply a
> > specific access pattern so the application should be running
> > MADV_SEQUENTIAL (triggers aggressive readahead) then MADV_WILLNEED
> > to start the readahead, and then the rest of the on-demand readahead
> > will get the higher readahead limits.
> > 
> > > This patch changes 3) to use bdi->io_pages as readahead unit.
> > 
> > I think it really should be changing MADV/FADV_SEQUENTIAL to set
> > file->f_ra.ra_pages to bdi->io_pages, not bdi->ra_pages * 2, and the
> > mem.load() implementation in the application converted to use
> > MADV_SEQUENTIAL to properly indicate it's access pattern to the
> > readahead algorithm.
> 
> Here the single .ra_pages may not work, that is why this patch stores
> the willneed range in maple tree, please see the following words from
> the original RH report:

> "
> Increasing read ahead is not an option as it has a mixed I/O workload of
> random I/O and sequential I/O, so that a large read ahead is very counterproductive
> to the random I/O and is unacceptable.
> "

Yes, I've read the bug. There's no triage that tells us what the
root cause of the application perofrmance issue might be. Just an
assertion that "this is how we did it 10 years ago, it's been
unchanged for all this time, the new kernel we are upgrading
to needs to behave exactly like pre-3.10 era kernels did.

And to be totally honest, my instincts tell me this is more likely a
problem with a root cause in poor IO scheduling decisions than be a
problem with the page cache readahead implementation. Readahead has
been turned down to stop the bandwidth it uses via background async
read IO from starving latency dependent foreground random IO
operation, and then we're being asked to turn readahead back up
in specific situations because it's actually needed for performance
in certain access patterns. This is the sort of thing bfq is
intended to solve.


> Also almost all these advises(SEQUENTIA, WILLNEED, NORMAL, RANDOM)
> ignore the passed range, and the behavior becomes all or nothing,
> instead of something only for the specified range, which may not
> match with man, please see 'man posix_fadvise':

The man page says:

	The advice is not binding; it merely constitutes an
	expectation on behalf of the application.

> It is even worse for readahead() syscall:
> 
> 	``` DESCRIPTION readahead()  initiates readahead on a file
> 	so that subsequent reads from that file will be satisfied
> 	from the cache, and not block on disk I/O (assuming the
> 	readahead was initiated early enough and that other activity
> 	on the system did not in the meantime flush pages from the
> 	cache).  ```

Yes, that's been "broken" for a long time (since the changes to cap
force_page_cache_readahead() to ra_pages way back when), but the
assumption documented about when readahead(2) will work goes to the
heart of why we don't let user controlled readahead actually do much
in the way of direct readahead. i.e. too much readahead is
typically harmful to IO and system performance and very, very few
applications actually need files preloaded entirely into memory.

----

All said, I'm starting to think that there isn't an immediate
upstream kernel change needed right now.  I just did a quick check
through the madvise() man page to see if I'd missed anything, and I
most definitely did miss what is a relatively new addition to it:

MADV_POPULATE_READ (since Linux 5.14)
     "Populate (prefault) page tables readable, faulting in all
     pages in the range just as if manually reading from each page;
     however, avoid the actual memory access that would have been
     performed after handling the fault.

     In contrast to MAP_POPULATE, MADV_POPULATE_READ does not hide
     errors, can be applied to (parts of) existing mappings and will
     al‐ ways  populate (prefault) page tables readable.  One
     example use case is prefaulting a file mapping, reading all
     file content from disk; however, pages won't be dirtied and
     consequently won't have to be written back to disk when
     evicting the pages from memory.

That's exactly what the application is apparently wanting
MADV_WILLNEED to do.

Please read the commit message for commit 4ca9b3859dac ("mm/madvise:
introduce MADV_POPULATE_(READ|WRITE) to prefault page tables"). It
has some relevant commentary on why MADV_WILLNEED could not be
modified to meet the pre-population requirements of the applications
that required this pre-population behaviour from the kernel.

With this, I suspect that the application needs to be updated to
use MADV_POPULATE_READ rather than MADV_WILLNEED, and then we can go
back and do some analysis of the readahead behaviour of the
application and the MADV_POPULATE_READ operation. We may need to
tweak MADV_POPULATE_READ for large readahead IO, but that's OK
because it's no longer "optimistic speculation" about whether the
data is needed in cache - the operation being performed guarantees
that or it fails with an error. IOWs, MADV_POPULATE_READ is
effectively user data IO at this point, not advice about future
access patterns...

Cheers,

Dave.
Dave Chinner Jan. 29, 2024, 10:12 p.m. UTC | #16
On Mon, Jan 29, 2024 at 12:19:02PM -0500, Mike Snitzer wrote:
> While I'm sure this legacy application would love to not have to
> change its code at all, I think we can all agree that we need to just
> focus on how best to advise applications that have mixed workloads
> accomplish efficient mmap+read of both sequential and random.
> 
> To that end, I heard Dave clearly suggest 2 things:
> 
> 1) update MADV/FADV_SEQUENTIAL to set file->f_ra.ra_pages to
>    bdi->io_pages, not bdi->ra_pages * 2
> 
> 2) Have the application first issue MADV_SEQUENTIAL to convey that for
>    the following MADV_WILLNEED is for sequential file load (so it is
>    desirable to use larger ra_pages)
> 
> This overrides the default of bdi->io_pages and _should_ provide the
> required per-file duality of control for readahead, correct?

I just discovered MADV_POPULATE_READ - see my reply to Ming
up-thread about that. The applicaiton should use that instead of
MADV_WILLNEED because it gives cache population guarantees that
WILLNEED doesn't. Then we can look at optimising the performance of
MADV_POPULATE_READ (if needed) as there is constrained scope we can
optimise within in ways that we cannot do with WILLNEED.

-Dave.
Mike Snitzer Jan. 29, 2024, 10:46 p.m. UTC | #17
On Mon, Jan 29 2024 at  5:12P -0500,
Dave Chinner <david@fromorbit.com> wrote:

> On Mon, Jan 29, 2024 at 12:19:02PM -0500, Mike Snitzer wrote:
> > While I'm sure this legacy application would love to not have to
> > change its code at all, I think we can all agree that we need to just
> > focus on how best to advise applications that have mixed workloads
> > accomplish efficient mmap+read of both sequential and random.
> > 
> > To that end, I heard Dave clearly suggest 2 things:
> > 
> > 1) update MADV/FADV_SEQUENTIAL to set file->f_ra.ra_pages to
> >    bdi->io_pages, not bdi->ra_pages * 2
> > 
> > 2) Have the application first issue MADV_SEQUENTIAL to convey that for
> >    the following MADV_WILLNEED is for sequential file load (so it is
> >    desirable to use larger ra_pages)
> > 
> > This overrides the default of bdi->ra_pages and _should_ provide the
> > required per-file duality of control for readahead, correct?
> 
> I just discovered MADV_POPULATE_READ - see my reply to Ming
> up-thread about that. The applicaiton should use that instead of
> MADV_WILLNEED because it gives cache population guarantees that
> WILLNEED doesn't. Then we can look at optimising the performance of
> MADV_POPULATE_READ (if needed) as there is constrained scope we can
> optimise within in ways that we cannot do with WILLNEED.

Nice find! Given commit 4ca9b3859dac ("mm/madvise: introduce
MADV_POPULATE_(READ|WRITE) to prefault page tables"), I've cc'd David
Hildenbrand just so he's in the loop.

FYI, I proactively raised feedback and questions to the reporter of
this issue:
 
CONTEXT: madvise(WILLNEED) doesn't convey the nature of the access,
sequential vs random, just the range that may be accessed.
 
Q1: Is your application's sequential vs random (or smaller sequential)
access split on a per-file basis?  Or is the same file accessed both
sequentially and randomly?
 
  A1: The same files can be accessed either randomly or sequentially,
  depending on certain access patterns and optimizing logic.
 
Q2: Can the application be changed to use madvise() MADV_SEQUENTIAL
and MADV_RANDOM to indicate its access pattern?
 
  A2: No, the application is a Java application. Java does not expose
  MADVISE API directly. Our application uses Java NIO API via
  MappedByteBuffer.load()
  (https://docs.oracle.com/javase/8/docs/api/java/nio/MappedByteBuffer.html#load--)
  that calls MADVISE_WILLNEED at the low level. There is no way for us
  to switch this behavior, but we take advantage of this behavior to
  optimize large file sequential I/O with great success.
 
So it's looking like it'll be hard to help this reporter avoid
changes... but that's not upstream's problem!

Mike
Ming Lei Jan. 30, 2024, 3:13 a.m. UTC | #18
On Tue, Jan 30, 2024 at 09:07:24AM +1100, Dave Chinner wrote:
> On Mon, Jan 29, 2024 at 04:25:41PM +0800, Ming Lei wrote:
> > On Mon, Jan 29, 2024 at 04:15:16PM +1100, Dave Chinner wrote:
> > > On Mon, Jan 29, 2024 at 11:57:45AM +0800, Ming Lei wrote:
> > > > On Mon, Jan 29, 2024 at 12:47:41PM +1100, Dave Chinner wrote:
> > > > > On Sun, Jan 28, 2024 at 07:39:49PM -0500, Mike Snitzer wrote:
> > > > Follows the current report:
> > > > 
> > > > 1) usersapce call madvise(willneed, 1G)
> > > > 
> > > > 2) only the 1st part(size is from bdi->io_pages, suppose it is 2MB) is
> > > > readahead in madvise(willneed, 1G) since commit 6d2be915e589
> > > > 
> > > > 3) the other parts(2M ~ 1G) is readahead by unit of bdi->ra_pages which is
> > > > set as 64KB by userspace when userspace reads the mmaped buffer, then
> > > > the whole application becomes slower.
> > > 
> > > It gets limited by file->f_ra->ra_pages being initialised to
> > > bdi->ra_pages and then never changed as the advice for access
> > > methods to the file are changed.
> > > 
> > > But the problem here is *not the readahead code*. The problem is
> > > that the user has configured the device readahead window to be far
> > > smaller than is optimal for the storage. Hence readahead is slow.
> > > The fix for that is to either increase the device readahead windows,
> > > or to change the specific readahead window for the file that has
> > > sequential access patterns.
> > > 
> > > Indeed, we already have that - FADV_SEQUENTIAL will set
> > > file->f_ra.ra_pages to 2 * bdi->ra_pages so that readahead uses
> > > larger IOs for that access.
> > > 
> > > That's what should happen here - MADV_WILLNEED does not imply a
> > > specific access pattern so the application should be running
> > > MADV_SEQUENTIAL (triggers aggressive readahead) then MADV_WILLNEED
> > > to start the readahead, and then the rest of the on-demand readahead
> > > will get the higher readahead limits.
> > > 
> > > > This patch changes 3) to use bdi->io_pages as readahead unit.
> > > 
> > > I think it really should be changing MADV/FADV_SEQUENTIAL to set
> > > file->f_ra.ra_pages to bdi->io_pages, not bdi->ra_pages * 2, and the
> > > mem.load() implementation in the application converted to use
> > > MADV_SEQUENTIAL to properly indicate it's access pattern to the
> > > readahead algorithm.
> > 
> > Here the single .ra_pages may not work, that is why this patch stores
> > the willneed range in maple tree, please see the following words from
> > the original RH report:
> 
> > "
> > Increasing read ahead is not an option as it has a mixed I/O workload of
> > random I/O and sequential I/O, so that a large read ahead is very counterproductive
> > to the random I/O and is unacceptable.
> > "
> 
> Yes, I've read the bug. There's no triage that tells us what the
> root cause of the application perofrmance issue might be. Just an
> assertion that "this is how we did it 10 years ago, it's been
> unchanged for all this time, the new kernel we are upgrading
> to needs to behave exactly like pre-3.10 era kernels did.
> 
> And to be totally honest, my instincts tell me this is more likely a
> problem with a root cause in poor IO scheduling decisions than be a
> problem with the page cache readahead implementation. Readahead has
> been turned down to stop the bandwidth it uses via background async
> read IO from starving latency dependent foreground random IO
> operation, and then we're being asked to turn readahead back up
> in specific situations because it's actually needed for performance
> in certain access patterns. This is the sort of thing bfq is
> intended to solve.

Reading mmaped buffer in userspace is sync IO, and page fault just
readahead 64KB. I don't understand how block IO scheduler makes a
difference in this single 64KB readahead in case of cache miss.

> 
> 
> > Also almost all these advises(SEQUENTIA, WILLNEED, NORMAL, RANDOM)
> > ignore the passed range, and the behavior becomes all or nothing,
> > instead of something only for the specified range, which may not
> > match with man, please see 'man posix_fadvise':
> 
> The man page says:
> 
> 	The advice is not binding; it merely constitutes an
> 	expectation on behalf of the application.
> 
> > It is even worse for readahead() syscall:
> > 
> > 	``` DESCRIPTION readahead()  initiates readahead on a file
> > 	so that subsequent reads from that file will be satisfied
> > 	from the cache, and not block on disk I/O (assuming the
> > 	readahead was initiated early enough and that other activity
> > 	on the system did not in the meantime flush pages from the
> > 	cache).  ```
> 
> Yes, that's been "broken" for a long time (since the changes to cap
> force_page_cache_readahead() to ra_pages way back when), but the
> assumption documented about when readahead(2) will work goes to the
> heart of why we don't let user controlled readahead actually do much
> in the way of direct readahead. i.e. too much readahead is
> typically harmful to IO and system performance and very, very few
> applications actually need files preloaded entirely into memory.

It is true for normal readahead, but not sure if it is for
advise(willneed) or readahead().

> 
> ----
> 
> All said, I'm starting to think that there isn't an immediate
> upstream kernel change needed right now.  I just did a quick check
> through the madvise() man page to see if I'd missed anything, and I
> most definitely did miss what is a relatively new addition to it:
> 
> MADV_POPULATE_READ (since Linux 5.14)
>      "Populate (prefault) page tables readable, faulting in all
>      pages in the range just as if manually reading from each page;
>      however, avoid the actual memory access that would have been
>      performed after handling the fault.
> 
>      In contrast to MAP_POPULATE, MADV_POPULATE_READ does not hide
>      errors, can be applied to (parts of) existing mappings and will
>      al‐ ways  populate (prefault) page tables readable.  One
>      example use case is prefaulting a file mapping, reading all
>      file content from disk; however, pages won't be dirtied and
>      consequently won't have to be written back to disk when
>      evicting the pages from memory.
> 
> That's exactly what the application is apparently wanting
> MADV_WILLNEED to do.

Indeed, it works as expected(all mmapped pages are load in
madvise(MADV_POPULATE_READ)) in my test code except for 16 ra_pages, but
it is less important now.

Thanks for this idea!

> 
> Please read the commit message for commit 4ca9b3859dac ("mm/madvise:
> introduce MADV_POPULATE_(READ|WRITE) to prefault page tables"). It
> has some relevant commentary on why MADV_WILLNEED could not be
> modified to meet the pre-population requirements of the applications
> that required this pre-population behaviour from the kernel.
> 
> With this, I suspect that the application needs to be updated to
> use MADV_POPULATE_READ rather than MADV_WILLNEED, and then we can go
> back and do some analysis of the readahead behaviour of the
> application and the MADV_POPULATE_READ operation. We may need to
> tweak MADV_POPULATE_READ for large readahead IO, but that's OK
> because it's no longer "optimistic speculation" about whether the
> data is needed in cache - the operation being performed guarantees
> that or it fails with an error. IOWs, MADV_POPULATE_READ is
> effectively user data IO at this point, not advice about future
> access patterns...

BTW, in this report, MADV_WILLNEED is used by java library[1], and I
guess it could be difficult to update to MADV_POPULATE_READ.

[1] https://docs.oracle.com/javase/8/docs/api/java/nio/MappedByteBuffer.html



Thanks,
Ming
Dave Chinner Jan. 30, 2024, 5:29 a.m. UTC | #19
On Tue, Jan 30, 2024 at 11:13:50AM +0800, Ming Lei wrote:
> On Tue, Jan 30, 2024 at 09:07:24AM +1100, Dave Chinner wrote:
> > On Mon, Jan 29, 2024 at 04:25:41PM +0800, Ming Lei wrote:
> > > On Mon, Jan 29, 2024 at 04:15:16PM +1100, Dave Chinner wrote:
> > > > On Mon, Jan 29, 2024 at 11:57:45AM +0800, Ming Lei wrote:
> > > > > On Mon, Jan 29, 2024 at 12:47:41PM +1100, Dave Chinner wrote:
> > > > > > On Sun, Jan 28, 2024 at 07:39:49PM -0500, Mike Snitzer wrote:
> > > > > Follows the current report:
> > > > > 
> > > > > 1) usersapce call madvise(willneed, 1G)
> > > > > 
> > > > > 2) only the 1st part(size is from bdi->io_pages, suppose it is 2MB) is
> > > > > readahead in madvise(willneed, 1G) since commit 6d2be915e589
> > > > > 
> > > > > 3) the other parts(2M ~ 1G) is readahead by unit of bdi->ra_pages which is
> > > > > set as 64KB by userspace when userspace reads the mmaped buffer, then
> > > > > the whole application becomes slower.
> > > > 
> > > > It gets limited by file->f_ra->ra_pages being initialised to
> > > > bdi->ra_pages and then never changed as the advice for access
> > > > methods to the file are changed.
> > > > 
> > > > But the problem here is *not the readahead code*. The problem is
> > > > that the user has configured the device readahead window to be far
> > > > smaller than is optimal for the storage. Hence readahead is slow.
> > > > The fix for that is to either increase the device readahead windows,
> > > > or to change the specific readahead window for the file that has
> > > > sequential access patterns.
> > > > 
> > > > Indeed, we already have that - FADV_SEQUENTIAL will set
> > > > file->f_ra.ra_pages to 2 * bdi->ra_pages so that readahead uses
> > > > larger IOs for that access.
> > > > 
> > > > That's what should happen here - MADV_WILLNEED does not imply a
> > > > specific access pattern so the application should be running
> > > > MADV_SEQUENTIAL (triggers aggressive readahead) then MADV_WILLNEED
> > > > to start the readahead, and then the rest of the on-demand readahead
> > > > will get the higher readahead limits.
> > > > 
> > > > > This patch changes 3) to use bdi->io_pages as readahead unit.
> > > > 
> > > > I think it really should be changing MADV/FADV_SEQUENTIAL to set
> > > > file->f_ra.ra_pages to bdi->io_pages, not bdi->ra_pages * 2, and the
> > > > mem.load() implementation in the application converted to use
> > > > MADV_SEQUENTIAL to properly indicate it's access pattern to the
> > > > readahead algorithm.
> > > 
> > > Here the single .ra_pages may not work, that is why this patch stores
> > > the willneed range in maple tree, please see the following words from
> > > the original RH report:
> > 
> > > "
> > > Increasing read ahead is not an option as it has a mixed I/O workload of
> > > random I/O and sequential I/O, so that a large read ahead is very counterproductive
> > > to the random I/O and is unacceptable.
> > > "
> > 
> > Yes, I've read the bug. There's no triage that tells us what the
> > root cause of the application perofrmance issue might be. Just an
> > assertion that "this is how we did it 10 years ago, it's been
> > unchanged for all this time, the new kernel we are upgrading
> > to needs to behave exactly like pre-3.10 era kernels did.
> > 
> > And to be totally honest, my instincts tell me this is more likely a
> > problem with a root cause in poor IO scheduling decisions than be a
> > problem with the page cache readahead implementation. Readahead has
> > been turned down to stop the bandwidth it uses via background async
> > read IO from starving latency dependent foreground random IO
> > operation, and then we're being asked to turn readahead back up
> > in specific situations because it's actually needed for performance
> > in certain access patterns. This is the sort of thing bfq is
> > intended to solve.
> 
> Reading mmaped buffer in userspace is sync IO, and page fault just
> readahead 64KB. I don't understand how block IO scheduler makes a
> difference in this single 64KB readahead in case of cache miss.

I think you've misunderstood what I said. I was refering to the
original customer problem of "too much readahead IO causes problems
for latency sensitive IO" issue that lead to the customer setting
64kB readahead device limits in the first place.

That is, if reducing readahead for sequential IO suddenly makes
synchronous random IO perform a whole lot better and the application
goes faster, then it indicates the problem is IO dispatch
prioritisation, not that there is too much readahead. Deprioritising
readahead will educe it's impact on other IO, without having to
reduce the readahead windows that provide decent sequential IO
perofrmance...

I really think the customer needs to retune their application from
first principles. Start with the defaults, measure where things are
slow, address the worst issue by twiddling knobs. Repeat until
performance is either good enough or they hit on actual problems
that need code changes.

> > > It is even worse for readahead() syscall:
> > > 
> > > 	``` DESCRIPTION readahead()  initiates readahead on a file
> > > 	so that subsequent reads from that file will be satisfied
> > > 	from the cache, and not block on disk I/O (assuming the
> > > 	readahead was initiated early enough and that other activity
> > > 	on the system did not in the meantime flush pages from the
> > > 	cache).  ```
> > 
> > Yes, that's been "broken" for a long time (since the changes to cap
> > force_page_cache_readahead() to ra_pages way back when), but the
> > assumption documented about when readahead(2) will work goes to the
> > heart of why we don't let user controlled readahead actually do much
> > in the way of direct readahead. i.e. too much readahead is
> > typically harmful to IO and system performance and very, very few
> > applications actually need files preloaded entirely into memory.
> 
> It is true for normal readahead, but not sure if it is for
> advise(willneed) or readahead().

If we allowed unbound readahead via WILLNEED or readahead(2), then
a user can DOS the storage and/or the memory allocation subsystem
very easily.

In a previous attempt to revert the current WILLNEED readahead
bounding behaviour changes, Linus said this:

"It's just that historically we've had
some issues with people over-doing readahead (because it often helps
some made-up microbenchmark), and then we end up with latency issues
when somebody does a multi-gigabyte readahead... Iirc, we had exactly
that problem with the readahead() system call at some point (long
ago)."

https://lore.kernel.org/linux-mm/CA+55aFy8kOomnL-C5GwSpHTn+g5R7dY78C9=h-J_Rb_u=iASpg@mail.gmail.com/

Elsewhere in a different thread for a different patchset to try to
revert this readahead behaviour, Linus ranted about how it allowed
unbound, unkillable user controlled readahead for 64-bit data
lengths.

Fundamentally, readahead is not functionality we want to expose
directly to user control. MADV_POPULATE_* is a different in that it
isn't actually readahead - it works more like normal sequential user
page fault access. It is interruptable, it can fail due to ENOMEM or
OOM-kill, it can fail on IO errors, etc. IOWs, The MADV_POPULATE
functions are what the application should be using, not trying to
hack WILLNEED to do stuff that MADV_POPULATE* already does in a
better way...

> > Please read the commit message for commit 4ca9b3859dac ("mm/madvise:
> > introduce MADV_POPULATE_(READ|WRITE) to prefault page tables"). It
> > has some relevant commentary on why MADV_WILLNEED could not be
> > modified to meet the pre-population requirements of the applications
> > that required this pre-population behaviour from the kernel.
> > 
> > With this, I suspect that the application needs to be updated to
> > use MADV_POPULATE_READ rather than MADV_WILLNEED, and then we can go
> > back and do some analysis of the readahead behaviour of the
> > application and the MADV_POPULATE_READ operation. We may need to
> > tweak MADV_POPULATE_READ for large readahead IO, but that's OK
> > because it's no longer "optimistic speculation" about whether the
> > data is needed in cache - the operation being performed guarantees
> > that or it fails with an error. IOWs, MADV_POPULATE_READ is
> > effectively user data IO at this point, not advice about future
> > access patterns...
> 
> BTW, in this report, MADV_WILLNEED is used by java library[1], and I
> guess it could be difficult to update to MADV_POPULATE_READ.

Yes, but that's not an upstream kernel code development problem.
That's a problem for the people paying $$$$$ to their software
vendor to sort out.

-Dave.
David Hildenbrand Jan. 30, 2024, 10:43 a.m. UTC | #20
On 29.01.24 23:46, Mike Snitzer wrote:
> On Mon, Jan 29 2024 at  5:12P -0500,
> Dave Chinner <david@fromorbit.com> wrote:
> 
>> On Mon, Jan 29, 2024 at 12:19:02PM -0500, Mike Snitzer wrote:
>>> While I'm sure this legacy application would love to not have to
>>> change its code at all, I think we can all agree that we need to just
>>> focus on how best to advise applications that have mixed workloads
>>> accomplish efficient mmap+read of both sequential and random.
>>>
>>> To that end, I heard Dave clearly suggest 2 things:
>>>
>>> 1) update MADV/FADV_SEQUENTIAL to set file->f_ra.ra_pages to
>>>     bdi->io_pages, not bdi->ra_pages * 2
>>>
>>> 2) Have the application first issue MADV_SEQUENTIAL to convey that for
>>>     the following MADV_WILLNEED is for sequential file load (so it is
>>>     desirable to use larger ra_pages)
>>>
>>> This overrides the default of bdi->ra_pages and _should_ provide the
>>> required per-file duality of control for readahead, correct?
>>
>> I just discovered MADV_POPULATE_READ - see my reply to Ming
>> up-thread about that. The applicaiton should use that instead of
>> MADV_WILLNEED because it gives cache population guarantees that
>> WILLNEED doesn't. Then we can look at optimising the performance of
>> MADV_POPULATE_READ (if needed) as there is constrained scope we can
>> optimise within in ways that we cannot do with WILLNEED.
> 
> Nice find! Given commit 4ca9b3859dac ("mm/madvise: introduce
> MADV_POPULATE_(READ|WRITE) to prefault page tables"), I've cc'd David
> Hildenbrand just so he's in the loop.

Thanks for CCing me.

MADV_POPULATE_READ is indeed different; it doesn't give hints (not 
"might be a good idea to read some pages" like MADV_WILLNEED documents), 
it forces swapin/read/.../.

In a sense, MADV_POPULATE_READ is similar to simply reading one byte 
from each PTE, triggering page faults. However, without actually reading 
from the target pages.

MADV_POPULATE_READ has a conceptual benefit: we know exactly how much 
memory user space wants to have populated (which range). In contrast, 
page faults contain no such hints and we have to guess based on 
historical behavior. One could use that range information to *not* do 
any faultaround/readahead when we come via MADV_POPULATE_READ, and 
really only popoulate the range of interest.

Further, one can use that range information to allocate larger folios, 
without having to guess where placement of a large folio is reasonable, 
and which size we should use.

> 
> FYI, I proactively raised feedback and questions to the reporter of
> this issue:
>   
> CONTEXT: madvise(WILLNEED) doesn't convey the nature of the access,
> sequential vs random, just the range that may be accessed.

Indeed. The "problem" with MADV_SEQUENTIAL/MADV_RANDOM is that it will 
fragment/split VMAs. So applying it to smaller chunks (like one would do 
with MADV_WILLNEED) is likely not a good option.
Ming Lei Jan. 30, 2024, 11:34 a.m. UTC | #21
On Tue, Jan 30, 2024 at 04:29:35PM +1100, Dave Chinner wrote:
> On Tue, Jan 30, 2024 at 11:13:50AM +0800, Ming Lei wrote:
> > On Tue, Jan 30, 2024 at 09:07:24AM +1100, Dave Chinner wrote:
> > > On Mon, Jan 29, 2024 at 04:25:41PM +0800, Ming Lei wrote:
> > > > On Mon, Jan 29, 2024 at 04:15:16PM +1100, Dave Chinner wrote:
> > > > > On Mon, Jan 29, 2024 at 11:57:45AM +0800, Ming Lei wrote:
> > > > > > On Mon, Jan 29, 2024 at 12:47:41PM +1100, Dave Chinner wrote:
> > > > > > > On Sun, Jan 28, 2024 at 07:39:49PM -0500, Mike Snitzer wrote:
> > > > > > Follows the current report:
> > > > > > 
> > > > > > 1) usersapce call madvise(willneed, 1G)
> > > > > > 
> > > > > > 2) only the 1st part(size is from bdi->io_pages, suppose it is 2MB) is
> > > > > > readahead in madvise(willneed, 1G) since commit 6d2be915e589
> > > > > > 
> > > > > > 3) the other parts(2M ~ 1G) is readahead by unit of bdi->ra_pages which is
> > > > > > set as 64KB by userspace when userspace reads the mmaped buffer, then
> > > > > > the whole application becomes slower.
> > > > > 
> > > > > It gets limited by file->f_ra->ra_pages being initialised to
> > > > > bdi->ra_pages and then never changed as the advice for access
> > > > > methods to the file are changed.
> > > > > 
> > > > > But the problem here is *not the readahead code*. The problem is
> > > > > that the user has configured the device readahead window to be far
> > > > > smaller than is optimal for the storage. Hence readahead is slow.
> > > > > The fix for that is to either increase the device readahead windows,
> > > > > or to change the specific readahead window for the file that has
> > > > > sequential access patterns.
> > > > > 
> > > > > Indeed, we already have that - FADV_SEQUENTIAL will set
> > > > > file->f_ra.ra_pages to 2 * bdi->ra_pages so that readahead uses
> > > > > larger IOs for that access.
> > > > > 
> > > > > That's what should happen here - MADV_WILLNEED does not imply a
> > > > > specific access pattern so the application should be running
> > > > > MADV_SEQUENTIAL (triggers aggressive readahead) then MADV_WILLNEED
> > > > > to start the readahead, and then the rest of the on-demand readahead
> > > > > will get the higher readahead limits.
> > > > > 
> > > > > > This patch changes 3) to use bdi->io_pages as readahead unit.
> > > > > 
> > > > > I think it really should be changing MADV/FADV_SEQUENTIAL to set
> > > > > file->f_ra.ra_pages to bdi->io_pages, not bdi->ra_pages * 2, and the
> > > > > mem.load() implementation in the application converted to use
> > > > > MADV_SEQUENTIAL to properly indicate it's access pattern to the
> > > > > readahead algorithm.
> > > > 
> > > > Here the single .ra_pages may not work, that is why this patch stores
> > > > the willneed range in maple tree, please see the following words from
> > > > the original RH report:
> > > 
> > > > "
> > > > Increasing read ahead is not an option as it has a mixed I/O workload of
> > > > random I/O and sequential I/O, so that a large read ahead is very counterproductive
> > > > to the random I/O and is unacceptable.
> > > > "
> > > 
> > > Yes, I've read the bug. There's no triage that tells us what the
> > > root cause of the application perofrmance issue might be. Just an
> > > assertion that "this is how we did it 10 years ago, it's been
> > > unchanged for all this time, the new kernel we are upgrading
> > > to needs to behave exactly like pre-3.10 era kernels did.
> > > 
> > > And to be totally honest, my instincts tell me this is more likely a
> > > problem with a root cause in poor IO scheduling decisions than be a
> > > problem with the page cache readahead implementation. Readahead has
> > > been turned down to stop the bandwidth it uses via background async
> > > read IO from starving latency dependent foreground random IO
> > > operation, and then we're being asked to turn readahead back up
> > > in specific situations because it's actually needed for performance
> > > in certain access patterns. This is the sort of thing bfq is
> > > intended to solve.
> > 
> > Reading mmaped buffer in userspace is sync IO, and page fault just
> > readahead 64KB. I don't understand how block IO scheduler makes a
> > difference in this single 64KB readahead in case of cache miss.
> 
> I think you've misunderstood what I said. I was refering to the
> original customer problem of "too much readahead IO causes problems
> for latency sensitive IO" issue that lead to the customer setting
> 64kB readahead device limits in the first place.

Looks we are not in same page, I never see words of "latency sensitive IO"
in this report(RHEL-22476).

> 
> That is, if reducing readahead for sequential IO suddenly makes
> synchronous random IO perform a whole lot better and the application
> goes faster, then it indicates the problem is IO dispatch
> prioritisation, not that there is too much readahead. Deprioritising
> readahead will educe it's impact on other IO, without having to
> reduce the readahead windows that provide decent sequential IO
> perofrmance...
> 
> I really think the customer needs to retune their application from
> first principles. Start with the defaults, measure where things are
> slow, address the worst issue by twiddling knobs. Repeat until
> performance is either good enough or they hit on actual problems
> that need code changes.

io priority is set in blkcg/process level, we even don't know if the random IO
and sequential IO are submitted from different process.

Also io priority is only applied when IOs with different priority are
submitted concurrently.

The main input from the report is that iostat shows that read IO request size is
reduced to 64K from 1MB, which isn't something io priority can deal with.

Here from my understanding the problem is that advise(ADV_RANDOM, ADV_SEQUENTIAL,
ADV_WILLNEED) are basically applied on file level instead of range level, even
though range is passed in from these syscalls.

So sequential and random advise are actually exclusively on the whole file.

That is why the customer don't want to set bigger ra_pages because they
are afraid(or have tested) that bigger ra_pages hurts performance of random
IO workload because unnecessary data may be readahead. But readahead
algorithm is supposed to be capable of dealing with it, maybe still not
well enough.

But yes, more details are needed for understand the issue further.

thanks, 
Ming
diff mbox series

Patch

diff --git a/fs/file_table.c b/fs/file_table.c
index b991f90571b4..bb0303683305 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -61,8 +61,18 @@  struct path *backing_file_user_path(struct file *f)
 }
 EXPORT_SYMBOL_GPL(backing_file_user_path);
 
+static inline void file_ra_free(struct file_ra_state *ra)
+{
+	if (ra->need_mt) {
+		mtree_destroy(ra->need_mt);
+		kfree(ra->need_mt);
+		ra->need_mt = NULL;
+	}
+}
+
 static inline void file_free(struct file *f)
 {
+	file_ra_free(&f->f_ra);
 	security_file_free(f);
 	if (likely(!(f->f_mode & FMODE_NOACCOUNT)))
 		percpu_counter_dec(&nr_files);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ed5966a70495..bdbd16990072 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -43,6 +43,7 @@ 
 #include <linux/cred.h>
 #include <linux/mnt_idmapping.h>
 #include <linux/slab.h>
+#include <linux/maple_tree.h>
 
 #include <asm/byteorder.h>
 #include <uapi/linux/fs.h>
@@ -961,6 +962,7 @@  struct fown_struct {
  * @ra_pages: Maximum size of a readahead request, copied from the bdi.
  * @mmap_miss: How many mmap accesses missed in the page cache.
  * @prev_pos: The last byte in the most recent read request.
+ * @need_mt: maple tree for tracking WILL_NEED ranges
  *
  * When this structure is passed to ->readahead(), the "most recent"
  * readahead means the current readahead.
@@ -972,6 +974,7 @@  struct file_ra_state {
 	unsigned int ra_pages;
 	unsigned int mmap_miss;
 	loff_t prev_pos;
+	struct maple_tree *need_mt;
 };
 
 /*
@@ -983,6 +986,18 @@  static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index)
 		index <  ra->start + ra->size);
 }
 
+/*
+ * Check if @index falls in the madvise/fadvise WILLNEED window.
+ */
+static inline bool ra_index_in_need_range(struct file_ra_state *ra,
+		pgoff_t index)
+{
+	if (ra->need_mt)
+		return mtree_load(ra->need_mt, index) != NULL;
+
+	return false;
+}
+
 /*
  * f_{lock,count,pos_lock} members can be highly contended and share
  * the same cacheline. f_{lock,mode} are very frequently used together
diff --git a/mm/filemap.c b/mm/filemap.c
index 750e779c23db..0ffe63d58421 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3147,7 +3147,10 @@  static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 	 */
 	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
 	ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
-	ra->size = ra->ra_pages;
+	if (ra_index_in_need_range(ra, vmf->pgoff))
+		ra->size = inode_to_bdi(mapping->host)->io_pages;
+	else
+		ra->size = ra->ra_pages;
 	ra->async_size = ra->ra_pages / 4;
 	ractl._index = ra->start;
 	page_cache_ra_order(&ractl, ra, 0);
diff --git a/mm/internal.h b/mm/internal.h
index f309a010d50f..17bd970ff23c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -120,13 +120,18 @@  void unmap_page_range(struct mmu_gather *tlb,
 			     unsigned long addr, unsigned long end,
 			     struct zap_details *details);
 
+void file_ra_add_need_range(struct file_ra_state *ra, pgoff_t start,
+		pgoff_t end);
 void page_cache_ra_order(struct readahead_control *, struct file_ra_state *,
 		unsigned int order);
 void force_page_cache_ra(struct readahead_control *, unsigned long nr);
 static inline void force_page_cache_readahead(struct address_space *mapping,
 		struct file *file, pgoff_t index, unsigned long nr_to_read)
 {
-	DEFINE_READAHEAD(ractl, file, &file->f_ra, mapping, index);
+	struct file_ra_state *ra = &file->f_ra;
+	DEFINE_READAHEAD(ractl, file, ra, mapping, index);
+
+	file_ra_add_need_range(ra, index, index + nr_to_read);
 	force_page_cache_ra(&ractl, nr_to_read);
 }
 
diff --git a/mm/readahead.c b/mm/readahead.c
index 23620c57c122..0882ceecf9ff 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -140,9 +140,38 @@  file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping)
 {
 	ra->ra_pages = inode_to_bdi(mapping->host)->ra_pages;
 	ra->prev_pos = -1;
+	ra->need_mt = NULL;
 }
 EXPORT_SYMBOL_GPL(file_ra_state_init);
 
+static void file_ra_setup_need_mt(struct file_ra_state *ra)
+{
+	struct maple_tree *mt = kzalloc(sizeof(*mt), GFP_KERNEL);
+
+	if (!mt)
+		return;
+
+	mt_init(mt);
+	if (cmpxchg(&ra->need_mt, NULL, mt) != NULL)
+		kfree(mt);
+}
+
+/* Maintain one willneed range hint for speedup readahead */
+void file_ra_add_need_range(struct file_ra_state *ra, pgoff_t start,
+		pgoff_t end)
+{
+	/* ignore small willneed range */
+	if (end - start < 4 * ra->ra_pages)
+		return;
+
+	if (!ra->need_mt)
+		file_ra_setup_need_mt(ra);
+
+	if (ra->need_mt)
+		mtree_insert_range(ra->need_mt, start, end, (void *)1,
+				GFP_KERNEL);
+}
+
 static void read_pages(struct readahead_control *rac)
 {
 	const struct address_space_operations *aops = rac->mapping->a_ops;
@@ -552,9 +581,10 @@  static void ondemand_readahead(struct readahead_control *ractl,
 {
 	struct backing_dev_info *bdi = inode_to_bdi(ractl->mapping->host);
 	struct file_ra_state *ra = ractl->ra;
-	unsigned long max_pages = ra->ra_pages;
 	unsigned long add_pages;
 	pgoff_t index = readahead_index(ractl);
+	unsigned long max_pages = ra_index_in_need_range(ra, index) ?
+		bdi->io_pages : ra->ra_pages;
 	pgoff_t expected, prev_index;
 	unsigned int order = folio ? folio_order(folio) : 0;