diff mbox series

mm/readahead: Fix large folio support in async readahead

Message ID 20241106092114.8408-1-laoar.shao@gmail.com (mailing list archive)
State New
Headers show
Series mm/readahead: Fix large folio support in async readahead | expand

Commit Message

Yafang Shao Nov. 6, 2024, 9:21 a.m. UTC
When testing large folio support with XFS on our servers, we observed that
only a few large folios are mapped when reading large files via mmap.
After a thorough analysis, I identified it was caused by the
`/sys/block/*/queue/read_ahead_kb` setting. On our test servers, this
parameter is set to 128KB. After I tune it to 2MB, the large folio can
work as expected. However, I believe the large folio behavior should not be
dependent on the value of read_ahead_kb. It would be more robust if the
kernel can automatically adopt to it.

With `/sys/block/*/queue/read_ahead_kb` set to a non-2MB aligned size,
this issue can be verified with a simple test case, as shown below:

      #define LEN (1024 * 1024 * 1024) // 1GB file
      int main(int argc, char *argv[])
      {
          char *addr;
          int fd, i;

          fd = open("data", O_RDWR);
          if (fd < 0) {
              perror("open");
              exit(-1);
          }

          addr = mmap(NULL, LEN, PROT_READ|PROT_WRITE,
                      MAP_SHARED, fd, 0);
          if (addr == MAP_FAILED) {
              perror("mmap");
              exit(-1);
          }

          if (madvise(addr, LEN, MADV_HUGEPAGE)) {
              perror("madvise");
              exit(-1);
          }

          for (i = 0; i < LEN / 4096; i++)
                memset(addr + i * 4096, 1, 1);

          while (1) {} // Verifiable with /proc/meminfo

          munmap(addr, LEN);
          close(fd);
          exit(0);
      }

When large folio support is enabled and read_ahead_kb is set to a smaller
value, ra->size (4MB) may exceed the maximum allowed size (e.g., 128KB). To
address this, we need to add a conditional check for such cases. However,
this alone is insufficient, as users might set read_ahead_kb to a larger,
non-hugepage-aligned value (e.g., 4MB + 128KB). In these instances, it is
essential to explicitly align ra->size with the hugepage size.

Fixes: 4687fdbb805a ("mm/filemap: Support VM_HUGEPAGE for file mappings")
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
---
 mm/readahead.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Changes: 
RFC->v1:
- Simplify the code as suggested by Matthew

RFC: https://lore.kernel.org/linux-mm/20241104143015.34684-1-laoar.shao@gmail.com/

Comments

Andrew Morton Nov. 6, 2024, 9:03 p.m. UTC | #1
On Wed,  6 Nov 2024 17:21:14 +0800 Yafang Shao <laoar.shao@gmail.com> wrote:

> When large folio support is enabled and read_ahead_kb is set to a smaller
> value, ra->size (4MB) may exceed the maximum allowed size (e.g., 128KB). To
> address this, we need to add a conditional check for such cases. However,
> this alone is insufficient, as users might set read_ahead_kb to a larger,
> non-hugepage-aligned value (e.g., 4MB + 128KB). In these instances, it is
> essential to explicitly align ra->size with the hugepage size.

How much performance improvement is this likely to offer our users? 
IOW, should we consider backporting it?

(I bet anyone who comes across this will say "oh goody" and backport it
anyway, so why not do this for them?)
Yafang Shao Nov. 7, 2024, 3:39 a.m. UTC | #2
On Thu, Nov 7, 2024 at 5:03 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Wed,  6 Nov 2024 17:21:14 +0800 Yafang Shao <laoar.shao@gmail.com> wrote:
>
> > When large folio support is enabled and read_ahead_kb is set to a smaller
> > value, ra->size (4MB) may exceed the maximum allowed size (e.g., 128KB). To
> > address this, we need to add a conditional check for such cases. However,
> > this alone is insufficient, as users might set read_ahead_kb to a larger,
> > non-hugepage-aligned value (e.g., 4MB + 128KB). In these instances, it is
> > essential to explicitly align ra->size with the hugepage size.
>
> How much performance improvement is this likely to offer our users?

The performance boost comes from enabling the use of hugepages
directly. Previously, users were unable to leverage large folios as
expected. With this change, however, large folios are now usable as
intended.

This improvement addresses a critical need in services like AI
inference, which benefit substantially from hugetlbfs. However, using
hugetlbfs effectively within containerized environments can be
challenging. To overcome this limitation, we explored large folios as
a more flexible and production-friendly alternative.

> IOW, should we consider backporting it?

We should consider backporting this change. We've already backported
it to our local 6.1.y kernel, where it's performing well.
The Fixes tag should ensure it will be included in the stable kernel, right?

>
> (I bet anyone who comes across this will say "oh goody" and backport it
> anyway, so why not do this for them?)
>

--
Regards

Yafang
Andrew Morton Nov. 7, 2024, 4:06 a.m. UTC | #3
On Thu, 7 Nov 2024 11:39:36 +0800 Yafang Shao <laoar.shao@gmail.com> wrote:

> On Thu, Nov 7, 2024 at 5:03 AM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Wed,  6 Nov 2024 17:21:14 +0800 Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > > When large folio support is enabled and read_ahead_kb is set to a smaller
> > > value, ra->size (4MB) may exceed the maximum allowed size (e.g., 128KB). To
> > > address this, we need to add a conditional check for such cases. However,
> > > this alone is insufficient, as users might set read_ahead_kb to a larger,
> > > non-hugepage-aligned value (e.g., 4MB + 128KB). In these instances, it is
> > > essential to explicitly align ra->size with the hugepage size.
> >
> > How much performance improvement is this likely to offer our users?
> 
> The performance boost comes from enabling the use of hugepages
> directly. Previously, users were unable to leverage large folios as
> expected. With this change, however, large folios are now usable as
> intended.

Thanks, but I was hoping for something quantitative.  Some nice before-
and-after testing?  How important/useful/impactful is this change?

> This improvement addresses a critical need in services like AI
> inference, which benefit substantially from hugetlbfs. However, using
> hugetlbfs effectively within containerized environments can be
> challenging. To overcome this limitation, we explored large folios as
> a more flexible and production-friendly alternative.
> 
> > IOW, should we consider backporting it?
> 
> We should consider backporting this change. We've already backported
> it to our local 6.1.y kernel, where it's performing well.
> The Fixes tag should ensure it will be included in the stable kernel, right?

For most subsystems, yes.  In MM an explicit cc:stable is needed. 
Along with a changelog which permits readers to understand why a
backport is proposed.
Matthew Wilcox Nov. 7, 2024, 4:52 a.m. UTC | #4
On Wed, Nov 06, 2024 at 05:21:14PM +0800, Yafang Shao wrote:
> When testing large folio support with XFS on our servers, we observed that
> only a few large folios are mapped when reading large files via mmap.
> After a thorough analysis, I identified it was caused by the
> `/sys/block/*/queue/read_ahead_kb` setting. On our test servers, this
> parameter is set to 128KB. After I tune it to 2MB, the large folio can
> work as expected. However, I believe the large folio behavior should not be
> dependent on the value of read_ahead_kb. It would be more robust if the
> kernel can automatically adopt to it.
> 
> With `/sys/block/*/queue/read_ahead_kb` set to a non-2MB aligned size,
> this issue can be verified with a simple test case, as shown below:

I don't like to see these programs in commit messages.  If it's a
valuable program, it should go into tools/testing.  If not, it shouldn't
be in the commit message.

> When large folio support is enabled and read_ahead_kb is set to a smaller
> value, ra->size (4MB) may exceed the maximum allowed size (e.g., 128KB). To
> address this, we need to add a conditional check for such cases. However,
> this alone is insufficient, as users might set read_ahead_kb to a larger,
> non-hugepage-aligned value (e.g., 4MB + 128KB). In these instances, it is
> essential to explicitly align ra->size with the hugepage size.

I wish you'd discussed this in the earlier thread instead of just
smashing it into this patch.  Because your solution is wrong.

> @@ -642,7 +644,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
>  			1UL << order);
>  	if (index == expected) {
>  		ra->start += ra->size;
> -		ra->size = get_next_ra_size(ra, max_pages);
> +		ra->size = ALIGN(get_next_ra_size(ra, max_pages), 1 << order);

Let's suppose that someone sets read_ahead_kb to 192kB.  If the previous
readahead did 128kB, we now try to align that to 128kB, so we'll readahead
256kB which is larger than max.  We were only intending to breach the
'max' for the MADV_HUGE case, not for all cases.

Honestly, I don't know if we should try to defend a stupid sysadmin
against the consequences of their misconfiguration like this.  I'd be in
favour of getting rid of the configuration knob entirely (or just ignoring
what the sysadmin set it to), but if we do that, we need to replace it
with something that can automatically figure out what the correct setting
for the readahead_max_kb should be (which is probably a function of the
bandwidth, latency and seek time of the underlying device).

But that's obviously not part of this patch.  I'd be in favour of just
dropping this ALIGN and leaving just the first hunk of the patch.
Yafang Shao Nov. 7, 2024, 5:55 a.m. UTC | #5
On Thu, Nov 7, 2024 at 12:52 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, Nov 06, 2024 at 05:21:14PM +0800, Yafang Shao wrote:
> > When testing large folio support with XFS on our servers, we observed that
> > only a few large folios are mapped when reading large files via mmap.
> > After a thorough analysis, I identified it was caused by the
> > `/sys/block/*/queue/read_ahead_kb` setting. On our test servers, this
> > parameter is set to 128KB. After I tune it to 2MB, the large folio can
> > work as expected. However, I believe the large folio behavior should not be
> > dependent on the value of read_ahead_kb. It would be more robust if the
> > kernel can automatically adopt to it.
> >
> > With `/sys/block/*/queue/read_ahead_kb` set to a non-2MB aligned size,
> > this issue can be verified with a simple test case, as shown below:
>
> I don't like to see these programs in commit messages.  If it's a
> valuable program, it should go into tools/testing.  If not, it shouldn't
> be in the commit message.

I just want to make it easy for others to verify this change. I'm okay
with dropping this program.

>
> > When large folio support is enabled and read_ahead_kb is set to a smaller
> > value, ra->size (4MB) may exceed the maximum allowed size (e.g., 128KB). To
> > address this, we need to add a conditional check for such cases. However,
> > this alone is insufficient, as users might set read_ahead_kb to a larger,
> > non-hugepage-aligned value (e.g., 4MB + 128KB). In these instances, it is
> > essential to explicitly align ra->size with the hugepage size.
>
> I wish you'd discussed this in the earlier thread instead of just
> smashing it into this patch.  Because your solution is wrong.
>
> > @@ -642,7 +644,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
> >                       1UL << order);
> >       if (index == expected) {
> >               ra->start += ra->size;
> > -             ra->size = get_next_ra_size(ra, max_pages);
> > +             ra->size = ALIGN(get_next_ra_size(ra, max_pages), 1 << order);
>
> Let's suppose that someone sets read_ahead_kb to 192kB.  If the previous
> readahead did 128kB, we now try to align that to 128kB, so we'll readahead
> 256kB which is larger than max.  We were only intending to breach the
> 'max' for the MADV_HUGE case, not for all cases.

In the non-MADV_HUGE case, the order can only be 0, correct?

>
> Honestly, I don't know if we should try to defend a stupid sysadmin
> against the consequences of their misconfiguration like this.  I'd be in
> favour of getting rid of the configuration knob entirely (or just ignoring
> what the sysadmin set it to), but if we do that, we need to replace it
> with something that can automatically figure out what the correct setting
> for the readahead_max_kb should be (which is probably a function of the
> bandwidth, latency and seek time of the underlying device).
>
> But that's obviously not part of this patch.  I'd be in favour of just
> dropping this ALIGN and leaving just the first hunk of the patch.

I’m okay with removing this ALIGN, since we won’t be setting a large
read_ahead_kb value in our production environment. ;)
Yafang Shao Nov. 7, 2024, 6:01 a.m. UTC | #6
On Thu, Nov 7, 2024 at 12:06 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Thu, 7 Nov 2024 11:39:36 +0800 Yafang Shao <laoar.shao@gmail.com> wrote:
>
> > On Thu, Nov 7, 2024 at 5:03 AM Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > > On Wed,  6 Nov 2024 17:21:14 +0800 Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > > When large folio support is enabled and read_ahead_kb is set to a smaller
> > > > value, ra->size (4MB) may exceed the maximum allowed size (e.g., 128KB). To
> > > > address this, we need to add a conditional check for such cases. However,
> > > > this alone is insufficient, as users might set read_ahead_kb to a larger,
> > > > non-hugepage-aligned value (e.g., 4MB + 128KB). In these instances, it is
> > > > essential to explicitly align ra->size with the hugepage size.
> > >
> > > How much performance improvement is this likely to offer our users?
> >
> > The performance boost comes from enabling the use of hugepages
> > directly. Previously, users were unable to leverage large folios as
> > expected. With this change, however, large folios are now usable as
> > intended.
>
> Thanks, but I was hoping for something quantitative.  Some nice before-
> and-after testing?  How important/useful/impactful is this change?

will improve the commit log.

>
> > This improvement addresses a critical need in services like AI
> > inference, which benefit substantially from hugetlbfs. However, using
> > hugetlbfs effectively within containerized environments can be
> > challenging. To overcome this limitation, we explored large folios as
> > a more flexible and production-friendly alternative.
> >
> > > IOW, should we consider backporting it?
> >
> > We should consider backporting this change. We've already backported
> > it to our local 6.1.y kernel, where it's performing well.
> > The Fixes tag should ensure it will be included in the stable kernel, right?
>
> For most subsystems, yes.  In MM an explicit cc:stable is needed.
> Along with a changelog which permits readers to understand why a
> backport is proposed.

will add a cc:stable in the next version.
Matthew Wilcox Nov. 7, 2024, 3 p.m. UTC | #7
On Thu, Nov 07, 2024 at 01:55:52PM +0800, Yafang Shao wrote:
> > Let's suppose that someone sets read_ahead_kb to 192kB.  If the previous
> > readahead did 128kB, we now try to align that to 128kB, so we'll readahead
> > 256kB which is larger than max.  We were only intending to breach the
> > 'max' for the MADV_HUGE case, not for all cases.
> 
> In the non-MADV_HUGE case, the order can only be 0, correct?

No.  The readahead code will use larger orders, depending on the size of
reads and how successful readahead is performing.
diff mbox series

Patch

diff --git a/mm/readahead.c b/mm/readahead.c
index 3dc6c7a128dd..9e2c6168ebfa 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -385,6 +385,8 @@  static unsigned long get_next_ra_size(struct file_ra_state *ra,
 		return 4 * cur;
 	if (cur <= max / 2)
 		return 2 * cur;
+	if (cur > max)
+		return cur;
 	return max;
 }
 
@@ -642,7 +644,7 @@  void page_cache_async_ra(struct readahead_control *ractl,
 			1UL << order);
 	if (index == expected) {
 		ra->start += ra->size;
-		ra->size = get_next_ra_size(ra, max_pages);
+		ra->size = ALIGN(get_next_ra_size(ra, max_pages), 1 << order);
 		ra->async_size = ra->size;
 		goto readit;
 	}