diff mbox series

[v2] mm/readahead: Fix large folio support in async readahead

Message ID 20241108141710.9721-1-laoar.shao@gmail.com (mailing list archive)
State New
Headers show
Series [v2] mm/readahead: Fix large folio support in async readahead | expand

Commit Message

Yafang Shao Nov. 8, 2024, 2:17 p.m. UTC
When testing large folio support with XFS on our servers, we observed that
only a few large folios are mapped when reading large files via mmap.
After a thorough analysis, I identified it was caused by the
`/sys/block/*/queue/read_ahead_kb` setting. On our test servers, this
parameter is set to 128KB. After I tune it to 2MB, the large folio can
work as expected. However, I believe the large folio behavior should not be
dependent on the value of read_ahead_kb. It would be more robust if the
kernel can automatically adopt to it.

With /sys/block/*/queue/read_ahead_kb set to 128KB and performing a
sequential read on a 1GB file using MADV_HUGEPAGE, the differences in
/proc/meminfo are as follows:

- before this patch
  FileHugePages:     18432 kB
  FilePmdMapped:      4096 kB

- after this patch
  FileHugePages:   1067008 kB
  FilePmdMapped:   1048576 kB

This shows that after applying the patch, the entire 1GB file is mapped to
huge pages. The stable list is CCed, as without this patch, large folios
don’t function optimally in the readahead path. 

It's worth noting that if read_ahead_kb is set to a larger value that isn't
aligned with huge page sizes (e.g., 4MB + 128KB), it may still fail to map
to hugepages.

Fixes: 4687fdbb805a ("mm/filemap: Support VM_HUGEPAGE for file mappings")
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: stable@vger.kernel.org

---
 mm/readahead.c | 2 ++
 1 file changed, 2 insertions(+)

Changes:
v1->v2:
- Drop the align (Matthew)
- Improve commit log (Andrew)

RFC->v1: https://lore.kernel.org/linux-mm/20241106092114.8408-1-laoar.shao@gmail.com/
- Simplify the code as suggested by Matthew

RFC: https://lore.kernel.org/linux-mm/20241104143015.34684-1-laoar.shao@gmail.com/

Comments

David Hildenbrand Nov. 11, 2024, 10:33 a.m. UTC | #1
On 08.11.24 15:17, Yafang Shao wrote:
> When testing large folio support with XFS on our servers, we observed that
> only a few large folios are mapped when reading large files via mmap.
> After a thorough analysis, I identified it was caused by the
> `/sys/block/*/queue/read_ahead_kb` setting. On our test servers, this
> parameter is set to 128KB. After I tune it to 2MB, the large folio can
> work as expected. However, I believe the large folio behavior should not be
> dependent on the value of read_ahead_kb. It would be more robust if the
> kernel can automatically adopt to it.

Now I am extremely confused.

Documentation/ABI/stable/sysfs-block:

"[RW] Maximum number of kilobytes to read-ahead for filesystems on this 
block device."


So, with your patch, will we also be changing the readahead size to 
exceed that, or simply allocate larger folios and not exceeding the 
readahead size (e.g., leaving them partially non-filled)?

If you're also changing the readahead behavior to exceed the 
configuration parameter it would sound to me like "I am pushing the 
brake pedal and my care brakes; fix the brakes to adopt whether to brake 
automatically" :)

Likely I am missing something here, and how the read_ahead_kb parameter 
is used after your patch.


> 
> With /sys/block/*/queue/read_ahead_kb set to 128KB and performing a
> sequential read on a 1GB file using MADV_HUGEPAGE, the differences in
> /proc/meminfo are as follows:
> 
> - before this patch
>    FileHugePages:     18432 kB
>    FilePmdMapped:      4096 kB
> 
> - after this patch
>    FileHugePages:   1067008 kB
>    FilePmdMapped:   1048576 kB
> 
> This shows that after applying the patch, the entire 1GB file is mapped to
> huge pages. The stable list is CCed, as without this patch, large folios
> don’t function optimally in the readahead path.
 >> It's worth noting that if read_ahead_kb is set to a larger value 
that isn't
> aligned with huge page sizes (e.g., 4MB + 128KB), it may still fail to map
> to hugepages.
> 
> Fixes: 4687fdbb805a ("mm/filemap: Support VM_HUGEPAGE for file mappings")
> Suggested-by: Matthew Wilcox <willy@infradead.org>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> Cc: stable@vger.kernel.org
> 
> ---
>   mm/readahead.c | 2 ++
>   1 file changed, 2 insertions(+)
> 
> Changes:
> v1->v2:
> - Drop the align (Matthew)
> - Improve commit log (Andrew)
> 
> RFC->v1: https://lore.kernel.org/linux-mm/20241106092114.8408-1-laoar.shao@gmail.com/
> - Simplify the code as suggested by Matthew
> 
> RFC: https://lore.kernel.org/linux-mm/20241104143015.34684-1-laoar.shao@gmail.com/
> 
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 3dc6c7a128dd..9b8a48e736c6 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -385,6 +385,8 @@ static unsigned long get_next_ra_size(struct file_ra_state *ra,
>   		return 4 * cur;
>   	if (cur <= max / 2)
>   		return 2 * cur;
> +	if (cur > max)
> +		return cur;
>   	return max;

Maybe something like

return max_t(unsigned long, cur, max);

might be more readable (likely "max()" cannot be used because of the 
local variable name "max" ...).


... but it's rather weird having a "max" and then returning something 
larger than the "max" ... especially with code like

"ra->size = get_next_ra_size(ra, max_pages);"


Maybe we can improve that by renaming "max_pages" / "max" to what it 
actually is supposed to be (which I haven't quite understood yet).
Yafang Shao Nov. 11, 2024, 2:28 p.m. UTC | #2
On Mon, Nov 11, 2024 at 6:33 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 08.11.24 15:17, Yafang Shao wrote:
> > When testing large folio support with XFS on our servers, we observed that
> > only a few large folios are mapped when reading large files via mmap.
> > After a thorough analysis, I identified it was caused by the
> > `/sys/block/*/queue/read_ahead_kb` setting. On our test servers, this
> > parameter is set to 128KB. After I tune it to 2MB, the large folio can
> > work as expected. However, I believe the large folio behavior should not be
> > dependent on the value of read_ahead_kb. It would be more robust if the
> > kernel can automatically adopt to it.
>
> Now I am extremely confused.
>
> Documentation/ABI/stable/sysfs-block:
>
> "[RW] Maximum number of kilobytes to read-ahead for filesystems on this
> block device."
>
>
> So, with your patch, will we also be changing the readahead size to
> exceed that, or simply allocate larger folios and not exceeding the
> readahead size (e.g., leaving them partially non-filled)?

Exceeding the readahead size for the MADV_HUGEPAGE case is
straightforward; this is what the current patch accomplishes.

>
> If you're also changing the readahead behavior to exceed the
> configuration parameter it would sound to me like "I am pushing the
> brake pedal and my care brakes; fix the brakes to adopt whether to brake
> automatically" :)
>
> Likely I am missing something here, and how the read_ahead_kb parameter
> is used after your patch.

The read_ahead_kb parameter continues to function for
non-MADV_HUGEPAGE scenarios, whereas special handling is required for
the MADV_HUGEPAGE case. It appears that we ought to update the
Documentation/ABI/stable/sysfs-block to reflect the changes related to
large folios, correct?

>
>
> >
> > With /sys/block/*/queue/read_ahead_kb set to 128KB and performing a
> > sequential read on a 1GB file using MADV_HUGEPAGE, the differences in
> > /proc/meminfo are as follows:
> >
> > - before this patch
> >    FileHugePages:     18432 kB
> >    FilePmdMapped:      4096 kB
> >
> > - after this patch
> >    FileHugePages:   1067008 kB
> >    FilePmdMapped:   1048576 kB
> >
> > This shows that after applying the patch, the entire 1GB file is mapped to
> > huge pages. The stable list is CCed, as without this patch, large folios
> > don’t function optimally in the readahead path.
>  >> It's worth noting that if read_ahead_kb is set to a larger value
> that isn't
> > aligned with huge page sizes (e.g., 4MB + 128KB), it may still fail to map
> > to hugepages.
> >
> > Fixes: 4687fdbb805a ("mm/filemap: Support VM_HUGEPAGE for file mappings")
> > Suggested-by: Matthew Wilcox <willy@infradead.org>
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > Cc: stable@vger.kernel.org
> >
> > ---
> >   mm/readahead.c | 2 ++
> >   1 file changed, 2 insertions(+)
> >
> > Changes:
> > v1->v2:
> > - Drop the align (Matthew)
> > - Improve commit log (Andrew)
> >
> > RFC->v1: https://lore.kernel.org/linux-mm/20241106092114.8408-1-laoar.shao@gmail.com/
> > - Simplify the code as suggested by Matthew
> >
> > RFC: https://lore.kernel.org/linux-mm/20241104143015.34684-1-laoar.shao@gmail.com/
> >
> > diff --git a/mm/readahead.c b/mm/readahead.c
> > index 3dc6c7a128dd..9b8a48e736c6 100644
> > --- a/mm/readahead.c
> > +++ b/mm/readahead.c
> > @@ -385,6 +385,8 @@ static unsigned long get_next_ra_size(struct file_ra_state *ra,
> >               return 4 * cur;
> >       if (cur <= max / 2)
> >               return 2 * cur;
> > +     if (cur > max)
> > +             return cur;
> >       return max;
>
> Maybe something like
>
> return max_t(unsigned long, cur, max);
>
> might be more readable (likely "max()" cannot be used because of the
> local variable name "max" ...).
>
>
> ... but it's rather weird having a "max" and then returning something
> larger than the "max" ... especially with code like

Indeed, that could lead to confusion ;)

>
> "ra->size = get_next_ra_size(ra, max_pages);"
>
>
> Maybe we can improve that by renaming "max_pages" / "max" to what it
> actually is supposed to be (which I haven't quite understood yet).

Perhaps a more straightforward solution would be to implement it
directly at the callsite, as demonstrated below?

diff --git a/mm/readahead.c b/mm/readahead.c
index 3dc6c7a128dd..187efae95b02 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -642,7 +642,11 @@ void page_cache_async_ra(struct readahead_control *ractl,
                        1UL << order);
        if (index == expected) {
                ra->start += ra->size;
-               ra->size = get_next_ra_size(ra, max_pages);
+               /*
+                * Allow the actual size to exceed the readahead window for a
+                * large folio.
+                */
+               ra->size = get_next_ra_size(ra, max(max_pages, ra->size));
                ra->async_size = ra->size;
                goto readit;
        }


--
Regards
Yafang
David Hildenbrand Nov. 11, 2024, 3:05 p.m. UTC | #3
On 11.11.24 15:28, Yafang Shao wrote:
> On Mon, Nov 11, 2024 at 6:33 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 08.11.24 15:17, Yafang Shao wrote:
>>> When testing large folio support with XFS on our servers, we observed that
>>> only a few large folios are mapped when reading large files via mmap.
>>> After a thorough analysis, I identified it was caused by the
>>> `/sys/block/*/queue/read_ahead_kb` setting. On our test servers, this
>>> parameter is set to 128KB. After I tune it to 2MB, the large folio can
>>> work as expected. However, I believe the large folio behavior should not be
>>> dependent on the value of read_ahead_kb. It would be more robust if the
>>> kernel can automatically adopt to it.
>>
>> Now I am extremely confused.
>>
>> Documentation/ABI/stable/sysfs-block:
>>
>> "[RW] Maximum number of kilobytes to read-ahead for filesystems on this
>> block device."
>>
>>
>> So, with your patch, will we also be changing the readahead size to
>> exceed that, or simply allocate larger folios and not exceeding the
>> readahead size (e.g., leaving them partially non-filled)?
> 
> Exceeding the readahead size for the MADV_HUGEPAGE case is
> straightforward; this is what the current patch accomplishes.
> 

Okay, so this only applies with MADV_HUGEPAGE I assume. Likely we should 
also make that clearer in the subject.

mm/readahead: allow exceeding configured read_ahead_kb with MADV_HUGEPAGE


If this is really a fix, especially one that deserves CC-stable, I 
cannot tell. Willy is the obvious expert :)

>>
>> If you're also changing the readahead behavior to exceed the
>> configuration parameter it would sound to me like "I am pushing the
>> brake pedal and my care brakes; fix the brakes to adopt whether to brake
>> automatically" :)
>>
>> Likely I am missing something here, and how the read_ahead_kb parameter
>> is used after your patch.
> 
> The read_ahead_kb parameter continues to function for
> non-MADV_HUGEPAGE scenarios, whereas special handling is required for
> the MADV_HUGEPAGE case. It appears that we ought to update the
> Documentation/ABI/stable/sysfs-block to reflect the changes related to
> large folios, correct?

Yes, how it related to MADV_HUGEPAGE. I would assume that it would get 
ignored, but ...

... staring at get_next_ra_size(), it's not quite ignored, because we 
still us it as a baseline to detect how much we want to bump up the 
limit when the requested size is small? (*2 vs *4 etc) :/

So the semantics are really starting to get weird, unless I am missing 
something important.

[...]

> Perhaps a more straightforward solution would be to implement it
> directly at the callsite, as demonstrated below?

Likely something into this direction might be better, but Willy is the 
expert that code.

> 
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 3dc6c7a128dd..187efae95b02 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -642,7 +642,11 @@ void page_cache_async_ra(struct readahead_control *ractl,
>                          1UL << order);
>          if (index == expected) {
>                  ra->start += ra->size;
> -               ra->size = get_next_ra_size(ra, max_pages);
> +               /*
> +                * Allow the actual size to exceed the readahead window for a
> +                * large folio.

"a large folio" -> "with MADV_HUGEPAGE" ? Or can this be hit on 
different paths that are not covered in the patch description?
David Hildenbrand Nov. 11, 2024, 3:26 p.m. UTC | #4
On 11.11.24 16:05, David Hildenbrand wrote:
> On 11.11.24 15:28, Yafang Shao wrote:
>> On Mon, Nov 11, 2024 at 6:33 PM David Hildenbrand <david@redhat.com> wrote:
>>>
>>> On 08.11.24 15:17, Yafang Shao wrote:
>>>> When testing large folio support with XFS on our servers, we observed that
>>>> only a few large folios are mapped when reading large files via mmap.
>>>> After a thorough analysis, I identified it was caused by the
>>>> `/sys/block/*/queue/read_ahead_kb` setting. On our test servers, this
>>>> parameter is set to 128KB. After I tune it to 2MB, the large folio can
>>>> work as expected. However, I believe the large folio behavior should not be
>>>> dependent on the value of read_ahead_kb. It would be more robust if the
>>>> kernel can automatically adopt to it.
>>>
>>> Now I am extremely confused.
>>>
>>> Documentation/ABI/stable/sysfs-block:
>>>
>>> "[RW] Maximum number of kilobytes to read-ahead for filesystems on this
>>> block device."
>>>
>>>
>>> So, with your patch, will we also be changing the readahead size to
>>> exceed that, or simply allocate larger folios and not exceeding the
>>> readahead size (e.g., leaving them partially non-filled)?
>>
>> Exceeding the readahead size for the MADV_HUGEPAGE case is
>> straightforward; this is what the current patch accomplishes.
>>
> 
> Okay, so this only applies with MADV_HUGEPAGE I assume. Likely we should
> also make that clearer in the subject.
> 
> mm/readahead: allow exceeding configured read_ahead_kb with MADV_HUGEPAGE
> 
> 
> If this is really a fix, especially one that deserves CC-stable, I
> cannot tell. Willy is the obvious expert :)
> 
>>>
>>> If you're also changing the readahead behavior to exceed the
>>> configuration parameter it would sound to me like "I am pushing the
>>> brake pedal and my care brakes; fix the brakes to adopt whether to brake
>>> automatically" :)
>>>
>>> Likely I am missing something here, and how the read_ahead_kb parameter
>>> is used after your patch.
>>
>> The read_ahead_kb parameter continues to function for
>> non-MADV_HUGEPAGE scenarios, whereas special handling is required for
>> the MADV_HUGEPAGE case. It appears that we ought to update the
>> Documentation/ABI/stable/sysfs-block to reflect the changes related to
>> large folios, correct?
> 
> Yes, how it related to MADV_HUGEPAGE. I would assume that it would get
> ignored, but ...
> 
> ... staring at get_next_ra_size(), it's not quite ignored, because we
> still us it as a baseline to detect how much we want to bump up the
> limit when the requested size is small? (*2 vs *4 etc) :/
> 
> So the semantics are really starting to get weird, unless I am missing
> something important.
Likely what I am missing is that the value of get_next_ra_size() will never be relevant
in that case. I assume the following would end up doing the same:

iff --git a/mm/readahead.c b/mm/readahead.c
index 475d2940a1edb..cc7f883f83d86 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -668,7 +668,12 @@ void page_cache_async_ra(struct readahead_control *ractl,
         ra->start = start;
         ra->size = start - index;       /* old async_size */
         ra->size += req_count;
-       ra->size = get_next_ra_size(ra, max_pages);
+       /*
+        * Allow the actual size to exceed the readahead window for
+        * MADV_HUGEPAGE.
+        */
+       if (ra->size < max_pages)
+               ra->size = get_next_ra_size(ra, max_pages);
         ra->async_size = ra->size;
  readit:
         ractl->_index = ra->start;


So maybe it should just be in get_next_ra_size() where we clarify what "max_pages"
means and why we simply decide to ignore the value ...
Yafang Shao Nov. 11, 2024, 4:08 p.m. UTC | #5
On Mon, Nov 11, 2024 at 11:05 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 11.11.24 15:28, Yafang Shao wrote:
> > On Mon, Nov 11, 2024 at 6:33 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 08.11.24 15:17, Yafang Shao wrote:
> >>> When testing large folio support with XFS on our servers, we observed that
> >>> only a few large folios are mapped when reading large files via mmap.
> >>> After a thorough analysis, I identified it was caused by the
> >>> `/sys/block/*/queue/read_ahead_kb` setting. On our test servers, this
> >>> parameter is set to 128KB. After I tune it to 2MB, the large folio can
> >>> work as expected. However, I believe the large folio behavior should not be
> >>> dependent on the value of read_ahead_kb. It would be more robust if the
> >>> kernel can automatically adopt to it.
> >>
> >> Now I am extremely confused.
> >>
> >> Documentation/ABI/stable/sysfs-block:
> >>
> >> "[RW] Maximum number of kilobytes to read-ahead for filesystems on this
> >> block device."
> >>
> >>
> >> So, with your patch, will we also be changing the readahead size to
> >> exceed that, or simply allocate larger folios and not exceeding the
> >> readahead size (e.g., leaving them partially non-filled)?
> >
> > Exceeding the readahead size for the MADV_HUGEPAGE case is
> > straightforward; this is what the current patch accomplishes.
> >
>
> Okay, so this only applies with MADV_HUGEPAGE I assume. Likely we should
> also make that clearer in the subject.
>
> mm/readahead: allow exceeding configured read_ahead_kb with MADV_HUGEPAGE
>
>
> If this is really a fix, especially one that deserves CC-stable, I
> cannot tell. Willy is the obvious expert :)
>
> >>
> >> If you're also changing the readahead behavior to exceed the
> >> configuration parameter it would sound to me like "I am pushing the
> >> brake pedal and my care brakes; fix the brakes to adopt whether to brake
> >> automatically" :)
> >>
> >> Likely I am missing something here, and how the read_ahead_kb parameter
> >> is used after your patch.
> >
> > The read_ahead_kb parameter continues to function for
> > non-MADV_HUGEPAGE scenarios, whereas special handling is required for
> > the MADV_HUGEPAGE case. It appears that we ought to update the
> > Documentation/ABI/stable/sysfs-block to reflect the changes related to
> > large folios, correct?
>
> Yes, how it related to MADV_HUGEPAGE. I would assume that it would get
> ignored, but ...
>
> ... staring at get_next_ra_size(), it's not quite ignored, because we
> still us it as a baseline to detect how much we want to bump up the
> limit when the requested size is small? (*2 vs *4 etc) :/
>
> So the semantics are really starting to get weird, unless I am missing
> something important.
>
> [...]
>
> > Perhaps a more straightforward solution would be to implement it
> > directly at the callsite, as demonstrated below?
>
> Likely something into this direction might be better, but Willy is the
> expert that code.
>
> >
> > diff --git a/mm/readahead.c b/mm/readahead.c
> > index 3dc6c7a128dd..187efae95b02 100644
> > --- a/mm/readahead.c
> > +++ b/mm/readahead.c
> > @@ -642,7 +642,11 @@ void page_cache_async_ra(struct readahead_control *ractl,
> >                          1UL << order);
> >          if (index == expected) {
> >                  ra->start += ra->size;
> > -               ra->size = get_next_ra_size(ra, max_pages);
> > +               /*
> > +                * Allow the actual size to exceed the readahead window for a
> > +                * large folio.
>
> "a large folio" -> "with MADV_HUGEPAGE" ? Or can this be hit on
> different paths that are not covered in the patch description?

This branch may also be triggered by other large folios that are not
necessarily order-9. Therefore, I’ve referred to it as a 'large folio'
rather than associating it specifically with MADV_HUGEPAGE. If we were
to handle only the MADV_HUGEPAGE case, we would proceed as outlined in
the initial RFC patch[0]. However, following Willy's recommendation, I
implemented it this way, as he likely has a deeper understanding of
the intended behavior.

[0]. https://lore.kernel.org/linux-mm/20241104143015.34684-1-laoar.shao@gmail.com/

--
Regards
Yafang
Yafang Shao Nov. 11, 2024, 4:13 p.m. UTC | #6
On Mon, Nov 11, 2024 at 11:26 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 11.11.24 16:05, David Hildenbrand wrote:
> > On 11.11.24 15:28, Yafang Shao wrote:
> >> On Mon, Nov 11, 2024 at 6:33 PM David Hildenbrand <david@redhat.com> wrote:
> >>>
> >>> On 08.11.24 15:17, Yafang Shao wrote:
> >>>> When testing large folio support with XFS on our servers, we observed that
> >>>> only a few large folios are mapped when reading large files via mmap.
> >>>> After a thorough analysis, I identified it was caused by the
> >>>> `/sys/block/*/queue/read_ahead_kb` setting. On our test servers, this
> >>>> parameter is set to 128KB. After I tune it to 2MB, the large folio can
> >>>> work as expected. However, I believe the large folio behavior should not be
> >>>> dependent on the value of read_ahead_kb. It would be more robust if the
> >>>> kernel can automatically adopt to it.
> >>>
> >>> Now I am extremely confused.
> >>>
> >>> Documentation/ABI/stable/sysfs-block:
> >>>
> >>> "[RW] Maximum number of kilobytes to read-ahead for filesystems on this
> >>> block device."
> >>>
> >>>
> >>> So, with your patch, will we also be changing the readahead size to
> >>> exceed that, or simply allocate larger folios and not exceeding the
> >>> readahead size (e.g., leaving them partially non-filled)?
> >>
> >> Exceeding the readahead size for the MADV_HUGEPAGE case is
> >> straightforward; this is what the current patch accomplishes.
> >>
> >
> > Okay, so this only applies with MADV_HUGEPAGE I assume. Likely we should
> > also make that clearer in the subject.
> >
> > mm/readahead: allow exceeding configured read_ahead_kb with MADV_HUGEPAGE
> >
> >
> > If this is really a fix, especially one that deserves CC-stable, I
> > cannot tell. Willy is the obvious expert :)
> >
> >>>
> >>> If you're also changing the readahead behavior to exceed the
> >>> configuration parameter it would sound to me like "I am pushing the
> >>> brake pedal and my care brakes; fix the brakes to adopt whether to brake
> >>> automatically" :)
> >>>
> >>> Likely I am missing something here, and how the read_ahead_kb parameter
> >>> is used after your patch.
> >>
> >> The read_ahead_kb parameter continues to function for
> >> non-MADV_HUGEPAGE scenarios, whereas special handling is required for
> >> the MADV_HUGEPAGE case. It appears that we ought to update the
> >> Documentation/ABI/stable/sysfs-block to reflect the changes related to
> >> large folios, correct?
> >
> > Yes, how it related to MADV_HUGEPAGE. I would assume that it would get
> > ignored, but ...
> >
> > ... staring at get_next_ra_size(), it's not quite ignored, because we
> > still us it as a baseline to detect how much we want to bump up the
> > limit when the requested size is small? (*2 vs *4 etc) :/
> >
> > So the semantics are really starting to get weird, unless I am missing
> > something important.
> Likely what I am missing is that the value of get_next_ra_size() will never be relevant
> in that case. I assume the following would end up doing the same:
>
> iff --git a/mm/readahead.c b/mm/readahead.c
> index 475d2940a1edb..cc7f883f83d86 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -668,7 +668,12 @@ void page_cache_async_ra(struct readahead_control *ractl,
>          ra->start = start;
>          ra->size = start - index;       /* old async_size */
>          ra->size += req_count;
> -       ra->size = get_next_ra_size(ra, max_pages);
> +       /*
> +        * Allow the actual size to exceed the readahead window for
> +        * MADV_HUGEPAGE.
> +        */
> +       if (ra->size < max_pages)
> +               ra->size = get_next_ra_size(ra, max_pages);

This change doesn’t apply to MADV_HUGEPAGE because, in cases where
`ra->size > max_pages`, ra->size has already been set to max_pages.
This can be easily verified with the example provided in the previous
version[1].

[1]. https://lore.kernel.org/linux-mm/20241106092114.8408-1-laoar.shao@gmail.com/

>          ra->async_size = ra->size;
>   readit:
>          ractl->_index = ra->start;
>
>
> So maybe it should just be in get_next_ra_size() where we clarify what "max_pages"
> means and why we simply decide to ignore the value ...
David Hildenbrand Nov. 11, 2024, 6:31 p.m. UTC | #7
On 11.11.24 17:08, Yafang Shao wrote:
> On Mon, Nov 11, 2024 at 11:05 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 11.11.24 15:28, Yafang Shao wrote:
>>> On Mon, Nov 11, 2024 at 6:33 PM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 08.11.24 15:17, Yafang Shao wrote:
>>>>> When testing large folio support with XFS on our servers, we observed that
>>>>> only a few large folios are mapped when reading large files via mmap.
>>>>> After a thorough analysis, I identified it was caused by the
>>>>> `/sys/block/*/queue/read_ahead_kb` setting. On our test servers, this
>>>>> parameter is set to 128KB. After I tune it to 2MB, the large folio can
>>>>> work as expected. However, I believe the large folio behavior should not be
>>>>> dependent on the value of read_ahead_kb. It would be more robust if the
>>>>> kernel can automatically adopt to it.
>>>>
>>>> Now I am extremely confused.
>>>>
>>>> Documentation/ABI/stable/sysfs-block:
>>>>
>>>> "[RW] Maximum number of kilobytes to read-ahead for filesystems on this
>>>> block device."
>>>>
>>>>
>>>> So, with your patch, will we also be changing the readahead size to
>>>> exceed that, or simply allocate larger folios and not exceeding the
>>>> readahead size (e.g., leaving them partially non-filled)?
>>>
>>> Exceeding the readahead size for the MADV_HUGEPAGE case is
>>> straightforward; this is what the current patch accomplishes.
>>>
>>
>> Okay, so this only applies with MADV_HUGEPAGE I assume. Likely we should
>> also make that clearer in the subject.
>>
>> mm/readahead: allow exceeding configured read_ahead_kb with MADV_HUGEPAGE
>>
>>
>> If this is really a fix, especially one that deserves CC-stable, I
>> cannot tell. Willy is the obvious expert :)
>>
>>>>
>>>> If you're also changing the readahead behavior to exceed the
>>>> configuration parameter it would sound to me like "I am pushing the
>>>> brake pedal and my care brakes; fix the brakes to adopt whether to brake
>>>> automatically" :)
>>>>
>>>> Likely I am missing something here, and how the read_ahead_kb parameter
>>>> is used after your patch.
>>>
>>> The read_ahead_kb parameter continues to function for
>>> non-MADV_HUGEPAGE scenarios, whereas special handling is required for
>>> the MADV_HUGEPAGE case. It appears that we ought to update the
>>> Documentation/ABI/stable/sysfs-block to reflect the changes related to
>>> large folios, correct?
>>
>> Yes, how it related to MADV_HUGEPAGE. I would assume that it would get
>> ignored, but ...
>>
>> ... staring at get_next_ra_size(), it's not quite ignored, because we
>> still us it as a baseline to detect how much we want to bump up the
>> limit when the requested size is small? (*2 vs *4 etc) :/
>>
>> So the semantics are really starting to get weird, unless I am missing
>> something important.
>>
>> [...]
>>
>>> Perhaps a more straightforward solution would be to implement it
>>> directly at the callsite, as demonstrated below?
>>
>> Likely something into this direction might be better, but Willy is the
>> expert that code.
>>
>>>
>>> diff --git a/mm/readahead.c b/mm/readahead.c
>>> index 3dc6c7a128dd..187efae95b02 100644
>>> --- a/mm/readahead.c
>>> +++ b/mm/readahead.c
>>> @@ -642,7 +642,11 @@ void page_cache_async_ra(struct readahead_control *ractl,
>>>                           1UL << order);
>>>           if (index == expected) {
>>>                   ra->start += ra->size;
>>> -               ra->size = get_next_ra_size(ra, max_pages);
>>> +               /*
>>> +                * Allow the actual size to exceed the readahead window for a
>>> +                * large folio.
>>
>> "a large folio" -> "with MADV_HUGEPAGE" ? Or can this be hit on
>> different paths that are not covered in the patch description?
> 
> This branch may also be triggered by other large folios that are not
> necessarily order-9. Therefore, I’ve referred to it as a 'large folio'
> rather than associating it specifically with MADV_HUGEPAGE. If we were
> to handle only the MADV_HUGEPAGE case, we would proceed as outlined in
> the initial RFC patch[0]. However, following Willy's recommendation, I
> implemented it this way, as he likely has a deeper understanding of
> the intended behavior.

Sorry, but this code is getting quite confusing, especially with such 
misleading "large folio" comments.

Even without MADV_HUGEPAGE we will be allocating large folios, as 
emphasized by Willy [1]. So the only thing MADV_HUGEPAGE controls is 
*which* large folios we allocate. .. as Willy says [2]: "We were only 
intending to breach the 'max' for the MADV_HUGE case, not for all cases."

I have no idea how *anybody* should derive from the code here that we 
treat MADV_HUGEPAGE in a special way.

Simply completely confusing.

My interpretation of "I don't know if we should try to defend a stupid 
sysadmin against the consequences of their misconfiguration like this" 
means" would be "drop this patch and don't change anything".

No changes to API, no confusing code.

Maybe pr_info_once() when someone uses MADV_HUGEPAGE with such backends 
to tell the sysadmin that something stupid is happening ...


[1] https://lore.kernel.org/linux-mm/ZyzV-RV0fpWABdWD@casper.infradead.org/
[2] https://lore.kernel.org/linux-mm/ZyxHc5Uukh47CO2R@casper.infradead.org/
Yafang Shao Nov. 11, 2024, 7:10 p.m. UTC | #8
On Tue, Nov 12, 2024 at 2:31 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 11.11.24 17:08, Yafang Shao wrote:
> > On Mon, Nov 11, 2024 at 11:05 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 11.11.24 15:28, Yafang Shao wrote:
> >>> On Mon, Nov 11, 2024 at 6:33 PM David Hildenbrand <david@redhat.com> wrote:
> >>>>
> >>>> On 08.11.24 15:17, Yafang Shao wrote:
> >>>>> When testing large folio support with XFS on our servers, we observed that
> >>>>> only a few large folios are mapped when reading large files via mmap.
> >>>>> After a thorough analysis, I identified it was caused by the
> >>>>> `/sys/block/*/queue/read_ahead_kb` setting. On our test servers, this
> >>>>> parameter is set to 128KB. After I tune it to 2MB, the large folio can
> >>>>> work as expected. However, I believe the large folio behavior should not be
> >>>>> dependent on the value of read_ahead_kb. It would be more robust if the
> >>>>> kernel can automatically adopt to it.
> >>>>
> >>>> Now I am extremely confused.
> >>>>
> >>>> Documentation/ABI/stable/sysfs-block:
> >>>>
> >>>> "[RW] Maximum number of kilobytes to read-ahead for filesystems on this
> >>>> block device."
> >>>>
> >>>>
> >>>> So, with your patch, will we also be changing the readahead size to
> >>>> exceed that, or simply allocate larger folios and not exceeding the
> >>>> readahead size (e.g., leaving them partially non-filled)?
> >>>
> >>> Exceeding the readahead size for the MADV_HUGEPAGE case is
> >>> straightforward; this is what the current patch accomplishes.
> >>>
> >>
> >> Okay, so this only applies with MADV_HUGEPAGE I assume. Likely we should
> >> also make that clearer in the subject.
> >>
> >> mm/readahead: allow exceeding configured read_ahead_kb with MADV_HUGEPAGE
> >>
> >>
> >> If this is really a fix, especially one that deserves CC-stable, I
> >> cannot tell. Willy is the obvious expert :)
> >>
> >>>>
> >>>> If you're also changing the readahead behavior to exceed the
> >>>> configuration parameter it would sound to me like "I am pushing the
> >>>> brake pedal and my care brakes; fix the brakes to adopt whether to brake
> >>>> automatically" :)
> >>>>
> >>>> Likely I am missing something here, and how the read_ahead_kb parameter
> >>>> is used after your patch.
> >>>
> >>> The read_ahead_kb parameter continues to function for
> >>> non-MADV_HUGEPAGE scenarios, whereas special handling is required for
> >>> the MADV_HUGEPAGE case. It appears that we ought to update the
> >>> Documentation/ABI/stable/sysfs-block to reflect the changes related to
> >>> large folios, correct?
> >>
> >> Yes, how it related to MADV_HUGEPAGE. I would assume that it would get
> >> ignored, but ...
> >>
> >> ... staring at get_next_ra_size(), it's not quite ignored, because we
> >> still us it as a baseline to detect how much we want to bump up the
> >> limit when the requested size is small? (*2 vs *4 etc) :/
> >>
> >> So the semantics are really starting to get weird, unless I am missing
> >> something important.
> >>
> >> [...]
> >>
> >>> Perhaps a more straightforward solution would be to implement it
> >>> directly at the callsite, as demonstrated below?
> >>
> >> Likely something into this direction might be better, but Willy is the
> >> expert that code.
> >>
> >>>
> >>> diff --git a/mm/readahead.c b/mm/readahead.c
> >>> index 3dc6c7a128dd..187efae95b02 100644
> >>> --- a/mm/readahead.c
> >>> +++ b/mm/readahead.c
> >>> @@ -642,7 +642,11 @@ void page_cache_async_ra(struct readahead_control *ractl,
> >>>                           1UL << order);
> >>>           if (index == expected) {
> >>>                   ra->start += ra->size;
> >>> -               ra->size = get_next_ra_size(ra, max_pages);
> >>> +               /*
> >>> +                * Allow the actual size to exceed the readahead window for a
> >>> +                * large folio.
> >>
> >> "a large folio" -> "with MADV_HUGEPAGE" ? Or can this be hit on
> >> different paths that are not covered in the patch description?
> >
> > This branch may also be triggered by other large folios that are not
> > necessarily order-9. Therefore, I’ve referred to it as a 'large folio'
> > rather than associating it specifically with MADV_HUGEPAGE. If we were
> > to handle only the MADV_HUGEPAGE case, we would proceed as outlined in
> > the initial RFC patch[0]. However, following Willy's recommendation, I
> > implemented it this way, as he likely has a deeper understanding of
> > the intended behavior.
>
> Sorry, but this code is getting quite confusing, especially with such
> misleading "large folio" comments.
>
> Even without MADV_HUGEPAGE we will be allocating large folios, as
> emphasized by Willy [1]. So the only thing MADV_HUGEPAGE controls is
> *which* large folios we allocate. .. as Willy says [2]: "We were only
> intending to breach the 'max' for the MADV_HUGE case, not for all cases."
>
> I have no idea how *anybody* should derive from the code here that we
> treat MADV_HUGEPAGE in a special way.
>
> Simply completely confusing.
>
> My interpretation of "I don't know if we should try to defend a stupid
> sysadmin against the consequences of their misconfiguration like this"
> means" would be "drop this patch and don't change anything".

Without this change, large folios won’t function as expected.
Currently, to support MADV_HUGEPAGE, you’d need to set readahead_kb to
2MB, 4MB, or more. However, many applications run without
MADV_HUGEPAGE, and a larger readahead_kb might not be optimal for
them.

>
> No changes to API, no confusing code.

New features like large folios can often create confusion with
existing rules or APIs, correct?

>
> Maybe pr_info_once() when someone uses MADV_HUGEPAGE with such backends
> to tell the sysadmin that something stupid is happening ...

It's not a flawed setup; it's just that this new feature doesn’t work
well with the existing settings, and updating those settings to
accommodate it isn't always feasible.
David Hildenbrand Nov. 12, 2024, 3:19 p.m. UTC | #9
>> Sorry, but this code is getting quite confusing, especially with such
>> misleading "large folio" comments.
>>
>> Even without MADV_HUGEPAGE we will be allocating large folios, as
>> emphasized by Willy [1]. So the only thing MADV_HUGEPAGE controls is
>> *which* large folios we allocate. .. as Willy says [2]: "We were only
>> intending to breach the 'max' for the MADV_HUGE case, not for all cases."
>>
>> I have no idea how *anybody* should derive from the code here that we
>> treat MADV_HUGEPAGE in a special way.
>>
>> Simply completely confusing.
>>
>> My interpretation of "I don't know if we should try to defend a stupid
>> sysadmin against the consequences of their misconfiguration like this"
>> means" would be "drop this patch and don't change anything".
> 
> Without this change, large folios won’t function as expected.
> Currently, to support MADV_HUGEPAGE, you’d need to set readahead_kb to
> 2MB, 4MB, or more. However, many applications run without
 > MADV_HUGEPAGE, and a larger readahead_kb might not be optimal for> them.

Someone configured: "Don't readahead more than 128KiB"

And then we complain why we "don't readahead more than 128KiB".

:)

"mm/filemap: Support VM_HUGEPAGE for file mappings" talks about "even if 
we have no history of readahead being successful".

So not about exceeding the configured limit, but exceeding the 
"readahead history".

So I consider VM_HUGEPAGE the sign here to "ignore readahead history" 
and not to "violate the config".

But that's just my opinion.

> 
>>
>> No changes to API, no confusing code.
> 
> New features like large folios can often create confusion with
> existing rules or APIs, correct?

We should not try making it even more confusing, if possible.

> 
>>
>> Maybe pr_info_once() when someone uses MADV_HUGEPAGE with such backends
>> to tell the sysadmin that something stupid is happening ...
> 
> It's not a flawed setup; it's just that this new feature doesn’t work
> well with the existing settings, and updating those settings to
> accommodate it isn't always feasible.
I don't agree. But it really is Willy's take.

The code, as it stands is confusing and nobody will be able to figure 
out how MADV_HUGEPAGE comes into play here and why we suddenly exceed 
"max/config" simply because "cur" is larger than max.

For example, in the code

ra->size = start - index;	/* old async_size */
ra->size += req_count;
ra->size = get_next_ra_size(ra, max_pages);

What happens if ra->size was at max, then we add "req_count" and 
suddenly we exceed "max" and say "well, sure that's fine now". Even if 
MADV_HUGEPAGE was never involved? Maybe it cannot happen, but it sure is 
confusing.


Not to mention that "It's worth noting that if read_ahead_kb is set to a 
larger value that isn't aligned with huge page sizes (e.g., 4MB + 
128KB), it may still fail to map to hugepages." sounds very odd :(
Yafang Shao Nov. 13, 2024, 2:16 a.m. UTC | #10
On Tue, Nov 12, 2024 at 11:19 PM David Hildenbrand <david@redhat.com> wrote:
>
> >> Sorry, but this code is getting quite confusing, especially with such
> >> misleading "large folio" comments.
> >>
> >> Even without MADV_HUGEPAGE we will be allocating large folios, as
> >> emphasized by Willy [1]. So the only thing MADV_HUGEPAGE controls is
> >> *which* large folios we allocate. .. as Willy says [2]: "We were only
> >> intending to breach the 'max' for the MADV_HUGE case, not for all cases."
> >>
> >> I have no idea how *anybody* should derive from the code here that we
> >> treat MADV_HUGEPAGE in a special way.
> >>
> >> Simply completely confusing.
> >>
> >> My interpretation of "I don't know if we should try to defend a stupid
> >> sysadmin against the consequences of their misconfiguration like this"
> >> means" would be "drop this patch and don't change anything".
> >
> > Without this change, large folios won’t function as expected.
> > Currently, to support MADV_HUGEPAGE, you’d need to set readahead_kb to
> > 2MB, 4MB, or more. However, many applications run without
>  > MADV_HUGEPAGE, and a larger readahead_kb might not be optimal for> them.
>
> Someone configured: "Don't readahead more than 128KiB"
>
> And then we complain why we "don't readahead more than 128KiB".

That is just bikeshielding.

So, what’s your suggestion? Simply setting readahead_kb to 2MB? That
would almost certainly cause issues elsewhere.

>
> :)
>
> "mm/filemap: Support VM_HUGEPAGE for file mappings" talks about "even if
> we have no history of readahead being successful".
>
> So not about exceeding the configured limit, but exceeding the
> "readahead history".
>
> So I consider VM_HUGEPAGE the sign here to "ignore readahead history"
> and not to "violate the config".

MADV_HUGEPAGE is definitely a new addition to readahead, and its
behavior isn’t yet defined in the documentation. All we need to do is
clarify its behavior there. The documentation isn’t set in stone—we
can update it as long as it doesn’t disrupt existing applications.

>
> But that's just my opinion.
>
> >
> >>
> >> No changes to API, no confusing code.
> >
> > New features like large folios can often create confusion with
> > existing rules or APIs, correct?
>
> We should not try making it even more confusing, if possible.

A quick tip for you: the readahead size already exceeds readahead_kb
even without MADV_HUGEPAGE. You might want to spend some time tracing
that behavior.

In summary, it’s really the readahead code itself that’s causing the
confusion—not MADV_HUGEPAGE.

>
> >
> >>
> >> Maybe pr_info_once() when someone uses MADV_HUGEPAGE with such backends
> >> to tell the sysadmin that something stupid is happening ...
> >
> > It's not a flawed setup; it's just that this new feature doesn’t work
> > well with the existing settings, and updating those settings to
> > accommodate it isn't always feasible.
> I don't agree. But it really is Willy's take.
>
> The code, as it stands is confusing and nobody will be able to figure
> out how MADV_HUGEPAGE comes into play here and why we suddenly exceed
> "max/config" simply because "cur" is larger than max.
>
> For example, in the code
>
> ra->size = start - index;       /* old async_size */
> ra->size += req_count;
> ra->size = get_next_ra_size(ra, max_pages);
>
> What happens if ra->size was at max, then we add "req_count" and
> suddenly we exceed "max" and say "well, sure that's fine now". Even if
> MADV_HUGEPAGE was never involved? Maybe it cannot happen, but it sure is
> confusing.
>
>
> Not to mention that "It's worth noting that if read_ahead_kb is set to a
> larger value that isn't aligned with huge page sizes (e.g., 4MB +
> 128KB), it may still fail to map to hugepages." sounds very odd :(

This will be beneficial if you're open to using MADV_HUGEPAGE in a
production environment.
Matthew Wilcox Nov. 13, 2024, 4:19 a.m. UTC | #11
On Tue, Nov 12, 2024 at 04:19:07PM +0100, David Hildenbrand wrote:
> Someone configured: "Don't readahead more than 128KiB"

Did they, though?  I have nothing but contempt for the thousands of
parameters that we expect sysadmins to configure.  It's ridiculous and
it needs to stop.  So, we listen to the program that has told us "We
want 2MB pages" and not to the sysadmin who hasn't changed the value of
readahead from one that was originally intended for floppy discs.

> "mm/filemap: Support VM_HUGEPAGE for file mappings" talks about "even if we
> have no history of readahead being successful".
> 
> So not about exceeding the configured limit, but exceeding the "readahead
> history".
> 
> So I consider VM_HUGEPAGE the sign here to "ignore readahead history" and
> not to "violate the config".
> 
> But that's just my opinion.

We're using the readahead code to accomplish what filemap wants.
That's an implementation detail and we shouldn't be constrained by the
limits which are in effect for normal readahead.

Normal readahead will never create a folio larger than the readahead
window.  It'll get to 256kB (eventually) and then it'll stop.  Indeed,
that was the problem this patch is addressing -- we started at 2MB then
readahead turned it down to 256kB.  Normal readahead will start at 16kB
sized folios.  This patch won't affect normal readahead at all.
David Hildenbrand Nov. 13, 2024, 8:12 a.m. UTC | #12
On 13.11.24 05:19, Matthew Wilcox wrote:
> On Tue, Nov 12, 2024 at 04:19:07PM +0100, David Hildenbrand wrote:
>> Someone configured: "Don't readahead more than 128KiB"
> 
> Did they, though?  I have nothing but contempt for the thousands of
> parameters that we expect sysadmins to configure.  It's ridiculous and
> it needs to stop.  So, we listen to the program that has told us "We
> want 2MB pages" and not to the sysadmin who hasn't changed the value of
> readahead from one that was originally intended for floppy discs.

If something can be achieved using MADV_HUGEPAGE but not using 
auto-tuning (ordinary readahead / no MADV_HUGEPAGE) it's a warning sign 
at least to me ...

FWIW, I agree that the parameters we have are confusing. But selectively 
ignoring them ... I don't know.

If the parameter is effectively useless on some devices (except floppy 
discs?) maybe it should be set to "0=auto" (and modification attempts 
failing/being ignored) or simply ignore for all of readahead.

Anyhow, your call. I'll see if I can make sense of the code and come up 
with a comment that explains the situation.
David Hildenbrand Nov. 13, 2024, 8:28 a.m. UTC | #13
On 13.11.24 03:16, Yafang Shao wrote:
> On Tue, Nov 12, 2024 at 11:19 PM David Hildenbrand <david@redhat.com> wrote:
>>
>>>> Sorry, but this code is getting quite confusing, especially with such
>>>> misleading "large folio" comments.
>>>>
>>>> Even without MADV_HUGEPAGE we will be allocating large folios, as
>>>> emphasized by Willy [1]. So the only thing MADV_HUGEPAGE controls is
>>>> *which* large folios we allocate. .. as Willy says [2]: "We were only
>>>> intending to breach the 'max' for the MADV_HUGE case, not for all cases."
>>>>
>>>> I have no idea how *anybody* should derive from the code here that we
>>>> treat MADV_HUGEPAGE in a special way.
>>>>
>>>> Simply completely confusing.
>>>>
>>>> My interpretation of "I don't know if we should try to defend a stupid
>>>> sysadmin against the consequences of their misconfiguration like this"
>>>> means" would be "drop this patch and don't change anything".
>>>
>>> Without this change, large folios won’t function as expected.
>>> Currently, to support MADV_HUGEPAGE, you’d need to set readahead_kb to
>>> 2MB, 4MB, or more. However, many applications run without
>>   > MADV_HUGEPAGE, and a larger readahead_kb might not be optimal for> them.
>>
>> Someone configured: "Don't readahead more than 128KiB"
>>
>> And then we complain why we "don't readahead more than 128KiB".
> 
> That is just bikeshielding.

It's called "reading the documentation and trying to make sense of a 
patch". ;)

> 
> So, what’s your suggestion? Simply setting readahead_kb to 2MB? That
> would almost certainly cause issues elsewhere.

I'm not 100% sure. I'm trying to make sense of it all.

And I assume there is a relevant difference now between "readahead 2M 
using all 4k pages" and "readahead 2M using a single large folio".

I agree that likely readahead using many 4k pages is a worse idea than 
just using a single large folio ... if we manage to allocate one. And 
it's all not that clear in the code ...

FWIW, I looked at "read_ahead_kb" values on my Fedora40 notebook and 
they are all set to 128KiB. I'm not so sure if they really should be 
that small ... or if large folio readahead code should just be able to 
exceed it.

>> "mm/filemap: Support VM_HUGEPAGE for file mappings" talks about "even if
>> we have no history of readahead being successful".
>>
>> So not about exceeding the configured limit, but exceeding the
>> "readahead history".
>>
>> So I consider VM_HUGEPAGE the sign here to "ignore readahead history"
>> and not to "violate the config".
> 
> MADV_HUGEPAGE is definitely a new addition to readahead, and its
> behavior isn’t yet defined in the documentation. All we need to do is
> clarify its behavior there. The documentation isn’t set in stone—we
> can update it as long as it doesn’t disrupt existing applications.

If Willy thinks this is the way to go, then we should document that 
MADV_HUGEPAGE may ignore the parameter, agreed.

I still don't understand your one comment:

"It's worth noting that if read_ahead_kb is set to a larger value that 
isn't aligned with huge page sizes (e.g., 4MB + 128KB), it may still 
fail to map to hugepages."

Do you mean that MADV_HUGEPAGE+read_ahead_kb<=4M will give you 2M pages, 
but MADV_HUGEPAGE+read_ahead_kb>4M won't? Or is this the case without 
MADV_HUGEPAGE?

If MADV_HUGEPAGE ignores read_ahead_kb completely, it's easy to document.

> 
>>
>> But that's just my opinion.
>>
>>>
>>>>
>>>> No changes to API, no confusing code.
>>>
>>> New features like large folios can often create confusion with
>>> existing rules or APIs, correct?
>>
>> We should not try making it even more confusing, if possible.
> 
> A quick tip for you: the readahead size already exceeds readahead_kb
> even without MADV_HUGEPAGE. You might want to spend some time tracing
> that behavior.

Care to save me some time and point me at what you mean?

> 
> In summary, it’s really the readahead code itself that’s causing the
> confusion—not MADV_HUGEPAGE.


Let me dig into the code in more detail if can make sense of it all.
David Hildenbrand Nov. 13, 2024, 9:46 a.m. UTC | #14
On 13.11.24 09:28, David Hildenbrand wrote:
> On 13.11.24 03:16, Yafang Shao wrote:
>> On Tue, Nov 12, 2024 at 11:19 PM David Hildenbrand <david@redhat.com> wrote:
>>>
>>>>> Sorry, but this code is getting quite confusing, especially with such
>>>>> misleading "large folio" comments.
>>>>>
>>>>> Even without MADV_HUGEPAGE we will be allocating large folios, as
>>>>> emphasized by Willy [1]. So the only thing MADV_HUGEPAGE controls is
>>>>> *which* large folios we allocate. .. as Willy says [2]: "We were only
>>>>> intending to breach the 'max' for the MADV_HUGE case, not for all cases."
>>>>>
>>>>> I have no idea how *anybody* should derive from the code here that we
>>>>> treat MADV_HUGEPAGE in a special way.
>>>>>
>>>>> Simply completely confusing.
>>>>>
>>>>> My interpretation of "I don't know if we should try to defend a stupid
>>>>> sysadmin against the consequences of their misconfiguration like this"
>>>>> means" would be "drop this patch and don't change anything".
>>>>
>>>> Without this change, large folios won’t function as expected.
>>>> Currently, to support MADV_HUGEPAGE, you’d need to set readahead_kb to
>>>> 2MB, 4MB, or more. However, many applications run without
>>>    > MADV_HUGEPAGE, and a larger readahead_kb might not be optimal for> them.
>>>
>>> Someone configured: "Don't readahead more than 128KiB"
>>>
>>> And then we complain why we "don't readahead more than 128KiB".
>>
>> That is just bikeshielding.
> 
> It's called "reading the documentation and trying to make sense of a
> patch". ;)
> 
>>
>> So, what’s your suggestion? Simply setting readahead_kb to 2MB? That
>> would almost certainly cause issues elsewhere.
> 
> I'm not 100% sure. I'm trying to make sense of it all.
> 
> And I assume there is a relevant difference now between "readahead 2M
> using all 4k pages" and "readahead 2M using a single large folio".
> 
> I agree that likely readahead using many 4k pages is a worse idea than
> just using a single large folio ... if we manage to allocate one. And
> it's all not that clear in the code ...
> 
> FWIW, I looked at "read_ahead_kb" values on my Fedora40 notebook and
> they are all set to 128KiB. I'm not so sure if they really should be
> that small ... or if large folio readahead code should just be able to
> exceed it.
> 
>>> "mm/filemap: Support VM_HUGEPAGE for file mappings" talks about "even if
>>> we have no history of readahead being successful".
>>>
>>> So not about exceeding the configured limit, but exceeding the
>>> "readahead history".
>>>
>>> So I consider VM_HUGEPAGE the sign here to "ignore readahead history"
>>> and not to "violate the config".
>>
>> MADV_HUGEPAGE is definitely a new addition to readahead, and its
>> behavior isn’t yet defined in the documentation. All we need to do is
>> clarify its behavior there. The documentation isn’t set in stone—we
>> can update it as long as it doesn’t disrupt existing applications.
> 
> If Willy thinks this is the way to go, then we should document that
> MADV_HUGEPAGE may ignore the parameter, agreed.
> 
> I still don't understand your one comment:
> 
> "It's worth noting that if read_ahead_kb is set to a larger value that
> isn't aligned with huge page sizes (e.g., 4MB + 128KB), it may still
> fail to map to hugepages."

I just played with your patch and your reproducer.

Setting read_ahead_kb to 0/128/4096 will give us PMDs with 
MADV_HUGEPAGE. So we're ignoring the parameter.

But with 4224 we are not ignoring the parameter with MADV_HUGEPAGE. :(

root@localhost:~#  cat /proc/`pgrep readhahead`/smaps_rollup
00400000-7ffe3a206000 ---p 00000000 00:00 0 
[rollup]
Rss:             1049728 kB
Pss:             1048700 kB
Pss_Dirty:        968576 kB
Pss_Anon:             84 kB
Pss_File:        1048616 kB
Pss_Shmem:             0 kB
Shared_Clean:       1056 kB
Shared_Dirty:          0 kB
Private_Clean:     80096 kB
Private_Dirty:    968576 kB
Referenced:      1049728 kB
Anonymous:            84 kB
KSM:                   0 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
FilePmdMapped:    135168 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
Yafang Shao Nov. 13, 2024, 9:54 a.m. UTC | #15
On Wed, Nov 13, 2024 at 4:28 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 13.11.24 03:16, Yafang Shao wrote:
> > On Tue, Nov 12, 2024 at 11:19 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >>>> Sorry, but this code is getting quite confusing, especially with such
> >>>> misleading "large folio" comments.
> >>>>
> >>>> Even without MADV_HUGEPAGE we will be allocating large folios, as
> >>>> emphasized by Willy [1]. So the only thing MADV_HUGEPAGE controls is
> >>>> *which* large folios we allocate. .. as Willy says [2]: "We were only
> >>>> intending to breach the 'max' for the MADV_HUGE case, not for all cases."
> >>>>
> >>>> I have no idea how *anybody* should derive from the code here that we
> >>>> treat MADV_HUGEPAGE in a special way.
> >>>>
> >>>> Simply completely confusing.
> >>>>
> >>>> My interpretation of "I don't know if we should try to defend a stupid
> >>>> sysadmin against the consequences of their misconfiguration like this"
> >>>> means" would be "drop this patch and don't change anything".
> >>>
> >>> Without this change, large folios won’t function as expected.
> >>> Currently, to support MADV_HUGEPAGE, you’d need to set readahead_kb to
> >>> 2MB, 4MB, or more. However, many applications run without
> >>   > MADV_HUGEPAGE, and a larger readahead_kb might not be optimal for> them.
> >>
> >> Someone configured: "Don't readahead more than 128KiB"
> >>
> >> And then we complain why we "don't readahead more than 128KiB".
> >
> > That is just bikeshielding.
>
> It's called "reading the documentation and trying to make sense of a
> patch". ;)
>
> >
> > So, what’s your suggestion? Simply setting readahead_kb to 2MB? That
> > would almost certainly cause issues elsewhere.
>
> I'm not 100% sure. I'm trying to make sense of it all.
>
> And I assume there is a relevant difference now between "readahead 2M
> using all 4k pages" and "readahead 2M using a single large folio".
>
> I agree that likely readahead using many 4k pages is a worse idea than
> just using a single large folio ... if we manage to allocate one. And
> it's all not that clear in the code ...
>
> FWIW, I looked at "read_ahead_kb" values on my Fedora40 notebook and
> they are all set to 128KiB. I'm not so sure if they really should be
> that small ...

It depends on the use case. For our hardop servers, we set it to 4MB,
as they prioritize throughput over latency. However, for our
Kubernetes servers, we keep it at 128KB since those services are more
latency-sensitive, and increasing it could lead to more frequent
latency spikes.

> or if large folio readahead code should just be able to
> exceed it.
>
> >> "mm/filemap: Support VM_HUGEPAGE for file mappings" talks about "even if
> >> we have no history of readahead being successful".
> >>
> >> So not about exceeding the configured limit, but exceeding the
> >> "readahead history".
> >>
> >> So I consider VM_HUGEPAGE the sign here to "ignore readahead history"
> >> and not to "violate the config".
> >
> > MADV_HUGEPAGE is definitely a new addition to readahead, and its
> > behavior isn’t yet defined in the documentation. All we need to do is
> > clarify its behavior there. The documentation isn’t set in stone—we
> > can update it as long as it doesn’t disrupt existing applications.
>
> If Willy thinks this is the way to go, then we should document that
> MADV_HUGEPAGE may ignore the parameter, agreed.

I'll submit an additional patch to update the documentation for MADV_HUGEPAGE.

>
> I still don't understand your one comment:
>
> "It's worth noting that if read_ahead_kb is set to a larger value that
> isn't aligned with huge page sizes (e.g., 4MB + 128KB), it may still
> fail to map to hugepages."
>
> Do you mean that MADV_HUGEPAGE+read_ahead_kb<=4M will give you 2M pages,
> but MADV_HUGEPAGE+read_ahead_kb>4M won't? Or is this the case without
> MADV_HUGEPAGE?

Typically, users set read_ahead_kb to aligned sizes, such as 128KB,
256KB, 512KB, 1MB, 2MB, 4MB, or 8MB. With this patch, MADV_HUGEPAGE
functions well for all these settings. However, if read_ahead_kb is
set to a non-hugepage-aligned size (e.g., 4MB + 128KB), MADV_HUGEPAGE
won’t work. This is because the initial readahead size for
MADV_HUGEPAGE is set to 4MB, as established in commit 4687fdbb805a:

   ra->size = HPAGE_PMD_NR;
   if (!(vmf->vma->vm_flags & VM_RAND_READ))
       ra->size *= 2;

However, as Willy noted, non-aligned settings are quite stupid, so we
should disregard them.

>
> If MADV_HUGEPAGE ignores read_ahead_kb completely, it's easy to document.

Perhaps, but documenting the behavior of every unusual setting doesn’t
seem practical.

>
> >
> >>
> >> But that's just my opinion.
> >>
> >>>
> >>>>
> >>>> No changes to API, no confusing code.
> >>>
> >>> New features like large folios can often create confusion with
> >>> existing rules or APIs, correct?
> >>
> >> We should not try making it even more confusing, if possible.
> >
> > A quick tip for you: the readahead size already exceeds readahead_kb
> > even without MADV_HUGEPAGE. You might want to spend some time tracing
> > that behavior.
>
> Care to save me some time and point me at what you mean?

I reached this conclusion by tracing ra->size in each
page_cache_ra_order() call, but I’m not fully equipped to provide all
the details ;)

--
Regards

Yafang
David Hildenbrand Nov. 13, 2024, 10:24 a.m. UTC | #16
>>
>> FWIW, I looked at "read_ahead_kb" values on my Fedora40 notebook and
>> they are all set to 128KiB. I'm not so sure if they really should be
>> that small ...
> 
> It depends on the use case. For our hardop servers, we set it to 4MB,
> as they prioritize throughput over latency. However, for our
> Kubernetes servers, we keep it at 128KB since those services are more
> latency-sensitive, and increasing it could lead to more frequent
> latency spikes.

Thanks for sharing.

> 
>> or if large folio readahead code should just be able to
>> exceed it.
>>
>>>> "mm/filemap: Support VM_HUGEPAGE for file mappings" talks about "even if
>>>> we have no history of readahead being successful".
>>>>
>>>> So not about exceeding the configured limit, but exceeding the
>>>> "readahead history".
>>>>
>>>> So I consider VM_HUGEPAGE the sign here to "ignore readahead history"
>>>> and not to "violate the config".
>>>
>>> MADV_HUGEPAGE is definitely a new addition to readahead, and its
>>> behavior isn’t yet defined in the documentation. All we need to do is
>>> clarify its behavior there. The documentation isn’t set in stone—we
>>> can update it as long as it doesn’t disrupt existing applications.
>>
>> If Willy thinks this is the way to go, then we should document that
>> MADV_HUGEPAGE may ignore the parameter, agreed.
> 
> I'll submit an additional patch to update the documentation for MADV_HUGEPAGE.
> 
>>
>> I still don't understand your one comment:
>>
>> "It's worth noting that if read_ahead_kb is set to a larger value that
>> isn't aligned with huge page sizes (e.g., 4MB + 128KB), it may still
>> fail to map to hugepages."
>>
>> Do you mean that MADV_HUGEPAGE+read_ahead_kb<=4M will give you 2M pages,
>> but MADV_HUGEPAGE+read_ahead_kb>4M won't? Or is this the case without
>> MADV_HUGEPAGE?
> 
> Typically, users set read_ahead_kb to aligned sizes, such as 128KB,
> 256KB, 512KB, 1MB, 2MB, 4MB, or 8MB. With this patch, MADV_HUGEPAGE
> functions well for all these settings. However, if read_ahead_kb is
> set to a non-hugepage-aligned size (e.g., 4MB + 128KB), MADV_HUGEPAGE
> won’t work. This is because the initial readahead size for
> MADV_HUGEPAGE is set to 4MB, as established in commit 4687fdbb805a:
> 
>     ra->size = HPAGE_PMD_NR;
>     if (!(vmf->vma->vm_flags & VM_RAND_READ))
>         ra->size *= 2;
> 
> However, as Willy noted, non-aligned settings are quite stupid, so we
> should disregard them.

Right. What I've been wondering, to make this code easier to understand, 
if there should be some kind of ra->size_fixed=true parameter that tells 
readahead code to simply not mess with the ra->size until something 
changes. (below)

[...]

>>> A quick tip for you: the readahead size already exceeds readahead_kb
>>> even without MADV_HUGEPAGE. You might want to spend some time tracing
>>> that behavior.
>>
>> Care to save me some time and point me at what you mean?
> 
> I reached this conclusion by tracing ra->size in each
> page_cache_ra_order() call, but I’m not fully equipped to provide all
> the details ;)

I've been staring at the readhead code for 30 minutes and I am still 
lost. Either I'm too stupid for the code or the code is too complicated.


If we'd have something like

ra->start += ra->size;
/*
  * If someone like VM_HUGEPAGE fixed the size, just don't mess with it.
  */
if (!ra->size_fixed)
	ra->size = get_next_ra_size(ra, max_pages);
ra->async_size = ra->size;

It would be a lot clearer at least to me -- and we'd likely be able to 
get rid of the "unaligned readhahead" oddity.

If we'd grep for who sets "size_fixed", we could even figure out how all 
of this belongs together.
diff mbox series

Patch

diff --git a/mm/readahead.c b/mm/readahead.c
index 3dc6c7a128dd..9b8a48e736c6 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -385,6 +385,8 @@  static unsigned long get_next_ra_size(struct file_ra_state *ra,
 		return 4 * cur;
 	if (cur <= max / 2)
 		return 2 * cur;
+	if (cur > max)
+		return cur;
 	return max;
 }