mbox series

[v4,0/4] mm: introduce THP deferred setting

Message ID 20250417001846.81480-1-npache@redhat.com (mailing list archive)
Headers show
Series mm: introduce THP deferred setting | expand

Message

Nico Pache April 17, 2025, 12:18 a.m. UTC
This series is a follow-up to [1], which adds mTHP support to khugepaged.
mTHP khugepaged support is a "loose" dependency for the sysfs/sysctl
configs to make sense. Without it global="defer" and  mTHP="inherit" case
is "undefined" behavior.

We've seen cases were customers switching from RHEL7 to RHEL8 see a
significant increase in the memory footprint for the same workloads.

Through our investigations we found that a large contributing factor to
the increase in RSS was an increase in THP usage.

For workloads like MySQL, or when using allocators like jemalloc, it is
often recommended to set /transparent_hugepages/enabled=never. This is
in part due to performance degradations and increased memory waste.

This series introduces enabled=defer, this setting acts as a middle
ground between always and madvise. If the mapping is MADV_HUGEPAGE, the
page fault handler will act normally, making a hugepage if possible. If
the allocation is not MADV_HUGEPAGE, then the page fault handler will
default to the base size allocation. The caveat is that khugepaged can
still operate on pages thats not MADV_HUGEPAGE.

This allows for three things... one, applications specifically designed to
use hugepages will get them, and two, applications that don't use
hugepages can still benefit from them without aggressively inserting
THPs at every possible chance. This curbs the memory waste, and defers
the use of hugepages to khugepaged. Khugepaged can then scan the memory
for eligible collapsing. Lastly there is the added benefit for those who
want THPs but experience higher latency PFs. Now you can get base page
performance at the PF handler and Hugepage performance for those mappings
after they collapse.

Admins may want to lower max_ptes_none, if not, khugepaged may
aggressively collapse single allocations into hugepages.

TESTING:
- Built for x86_64, aarch64, ppc64le, and s390x
- selftests mm
- In [1] I provided a script [2] that has multiple access patterns
- lots of general use.
- redis testing. This test was my original case for the defer mode. What I
   was able to prove was that THP=always leads to increased max_latency
   cases; hence why it is recommended to disable THPs for redis servers.
   However with 'defer' we dont have the max_latency spikes and can still
   get the system to utilize THPs. I further tested this with the mTHP
   defer setting and found that redis (and probably other jmalloc users)
   can utilize THPs via defer (+mTHP defer) without a large latency
   penalty and some potential gains. I uploaded some mmtest results
   here[3] which compares:
       stock+thp=never
       stock+(m)thp=always
       khugepaged-mthp + defer (max_ptes_none=64)

  The results show that (m)THPs can cause some throughput regression in
  some cases, but also has gains in other cases. The mTHP+defer results
  have more gains and less losses over the (m)THP=always case.

V4 Changes:
- Minor Documentation fixes
- rebased the dependent series [1] onto mm-unstable
    commit 0e68b850b1d3 ("vmalloc: use atomic_long_add_return_relaxed()")

V3 Changes:
- moved some Documentation to the other series and merged the remaining
   Documentation updates into one

V2 Changes:
- rebase changes ontop mTHP khugepaged support series
- Fix selftests parsing issue
- add mTHP defer option
- add mTHP defer Documentation

[1] - https://lore.kernel.org/lkml/20250417000238.74567-1-npache@redhat.com/
[2] - https://gitlab.com/npache/khugepaged_mthp_test
[3] - https://people.redhat.com/npache/mthp_khugepaged_defer/testoutput2/output.html

Nico Pache (4):
  mm: defer THP insertion to khugepaged
  mm: document (m)THP defer usage
  khugepaged: add defer option to mTHP options
  selftests: mm: add defer to thp setting parser

 Documentation/admin-guide/mm/transhuge.rst | 31 +++++++---
 include/linux/huge_mm.h                    | 18 +++++-
 mm/huge_memory.c                           | 69 +++++++++++++++++++---
 mm/khugepaged.c                            | 10 ++--
 tools/testing/selftests/mm/thp_settings.c  |  1 +
 tools/testing/selftests/mm/thp_settings.h  |  1 +
 6 files changed, 107 insertions(+), 23 deletions(-)

Comments

Andrew Morton April 17, 2025, 11:11 p.m. UTC | #1
On Wed, 16 Apr 2025 18:18:42 -0600 Nico Pache <npache@redhat.com> wrote:

> This series is a follow-up to [1], which adds mTHP support to khugepaged.
> mTHP khugepaged support is a "loose" dependency for the sysfs/sysctl
> configs to make sense. Without it global="defer" and  mTHP="inherit" case
> is "undefined" behavior.
> 
> We've seen cases were customers switching from RHEL7 to RHEL8 see a
> significant increase in the memory footprint for the same workloads.
> 
> Through our investigations we found that a large contributing factor to
> the increase in RSS was an increase in THP usage.
> 
> For workloads like MySQL, or when using allocators like jemalloc, it is
> often recommended to set /transparent_hugepages/enabled=never. This is
> in part due to performance degradations and increased memory waste.
> 
> This series introduces enabled=defer, this setting acts as a middle
> ground between always and madvise. If the mapping is MADV_HUGEPAGE, the
> page fault handler will act normally, making a hugepage if possible. If
> the allocation is not MADV_HUGEPAGE, then the page fault handler will
> default to the base size allocation. The caveat is that khugepaged can
> still operate on pages thats not MADV_HUGEPAGE.
> 
> This allows for three things... one, applications specifically designed to
> use hugepages will get them, and two, applications that don't use
> hugepages can still benefit from them without aggressively inserting
> THPs at every possible chance. This curbs the memory waste, and defers
> the use of hugepages to khugepaged. Khugepaged can then scan the memory
> for eligible collapsing. Lastly there is the added benefit for those who
> want THPs but experience higher latency PFs. Now you can get base page
> performance at the PF handler and Hugepage performance for those mappings
> after they collapse.
> 
> Admins may want to lower max_ptes_none, if not, khugepaged may
> aggressively collapse single allocations into hugepages.
> 
> TESTING:
> - Built for x86_64, aarch64, ppc64le, and s390x
> - selftests mm
> - In [1] I provided a script [2] that has multiple access patterns

Namely https://gitlab.com/npache/khugepaged_mthp_test?

Looks useful and could perhaps be directly linked to from this
patchset's [0/N] changelog?