[v6,0/9] variable-order, large folios for anonymous memory

Message ID	20230929114421.3761121-1-ryan.roberts@arm.com (mailing list archive)
Headers	show Return-Path: <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org> From: Ryan Roberts <ryan.roberts@arm.com> To: Andrew Morton <akpm@linux-foundation.org>, Matthew Wilcox <willy@infradead.org>, Yin Fengwei <fengwei.yin@intel.com>, David Hildenbrand <david@redhat.com>, Yu Zhao <yuzhao@google.com>, Catalin Marinas <catalin.marinas@arm.com>, Anshuman Khandual <anshuman.khandual@arm.com>, Yang Shi <shy828301@gmail.com>, "Huang, Ying" <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>, Luis Chamberlain <mcgrof@kernel.org>, Itaru Kitayama <itaru.kitayama@gmail.com>, "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>, John Hubbard <jhubbard@nvidia.com>, David Rientjes <rientjes@google.com>, Vlastimil Babka <vbabka@suse.cz>, Hugh Dickins <hughd@google.com> Cc: Ryan Roberts <ryan.roberts@arm.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org Subject: [PATCH v6 0/9] variable-order, large folios for anonymous memory Date: Fri, 29 Sep 2023 12:44:11 +0100 Message-Id: <20230929114421.3761121-1-ryan.roberts@arm.com> MIME-Version: 1.0 Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
Series	variable-order, large folios for anonymous memory \| expand [v6,0/9] variable-order, large folios for anonymous memory [v6,1/9] mm: Allow deferred splitting of arbitrary anon large folios [v6,2/9] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap() [v6,3/9] mm: thp: Account pte-mapped anonymous THP usage [v6,4/9] mm: thp: Introduce anon_orders and anon_always_mask sysfs files [v6,5/9] mm: thp: Extend THP to allocate anonymous large folios [v6,6/9] mm: thp: Add "recommend" option for anon_orders [v6,7/9] arm64/mm: Override arch_wants_pte_order() [v6,8/9] selftests/mm/cow: Generalize do_run_with_thp() helper [v6,9/9] selftests/mm/cow: Add tests for small-order anon THP

Ryan Roberts Sept. 29, 2023, 11:44 a.m. UTC

Hi All,

This is v6 of a series to implement variable order, large folios for anonymous
memory. (previously called "ANON_LARGE_FOLIO", "LARGE_ANON_FOLIO",
"FLEXIBLE_THP", but now exposed as an extension to THP; "small-order THP"). The
objective of this is to improve performance by allocating larger chunks of
memory during anonymous page faults:

1) Since SW (the kernel) is dealing with larger chunks of memory than base
   pages, there are efficiency savings to be had; fewer page faults, batched PTE
   and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel
   overhead. This should benefit all architectures.
2) Since we are now mapping physically contiguous chunks of memory, we can take
   advantage of HW TLB compression techniques. A reduction in TLB pressure
   speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
   TLB entries; "the contiguous bit" (architectural) and HPA (uarch).

The major change in this revision is the addition of sysfs controls to allow
this "small-order THP" to be enabled/disabled/configured independently of
PMD-order THP. The approach I've taken differs a bit from previous discussions;
instead of creating a whole new interface ("large_folio"), I'm extending THP. I
personally think this makes things clearer and more extensible. See [6] for
detailed rationale.

Because we now have runtime enable/disable control, I've removed the compile
time Kconfig switch. It still defaults to runtime-disabled.

NOTE: These changes should not be merged until the prerequisites are complete.
These are in progress and tracked at [7].

This series is based on mm-hotfixes-unstable (f9911db48293).


Testing
=======

This version adds patches to mm selftests so that the cow tests explicitly test
small-order THP, in the same way that PMD-order THP is tested. The new tests all
pass, and no regressions are observed in the mm selftest suite.


Performance
===========

The below tables show performance and memory data for selected workloads, with
different small-order THPs enabled. All configs are compared to a 4k page kernel
with small-order THP disabled. 16k and 64k (with small-order THP disabled)
kernels are included to aid the comparison. All kernels built from the same
source; mm-hotfixes-unstable (f9911db48293) + this series.

4k-page-16k-folio: 16k (order-2) THP enabled
4k-page-32k-folio: 32k+16k (order-3, order-2) THP enabled
4k-page-64k-folio: 64k+32k+16k (order-4, order-3, order-2) THP enabled

Running on Ampere Altra with 1 NUMA node enabled, Ubuntu 22.04, XFS filesystem
20 repeats across 5 reboots (with 1 warmup run after each reboot)
Run in its own cgroup and read memory.peak after completion


Kernel Compilation with 8 jobs: (make defconfig && make -s -j8 Image)
(smaller is better):

| kernel            |   real-time |   kern-time |   user-time | memory |
|:------------------|------------:|------------:|------------:|-------:|
| baseline-4k-page  |        0.0% |        0.0% |        0.0% |   0.0% |
| 16k-page          |       -9.0% |      -49.7% |       -4.0% |   6.2% |
| 64k-page          |      -11.9% |      -66.5% |       -5.0% |  28.3% |
| 4k-page-16k-folio |       -2.8% |      -23.0% |       -0.3% |   0.0% |
| 4k-page-32k-folio |       -4.0% |      -32.0% |       -0.6% |   0.1% |
| 4k-page-64k-folio |       -4.6% |      -37.9% |       -0.5% |   0.1% |


Kernel Compilation with 80 jobs: (make defconfig && make -s -j80 Image)
(smaller is better):

| kernel            |   real-time |   kern-time |   user-time | memory |
|:------------------|------------:|------------:|------------:|:-------|
| baseline-4k-page  |        0.0% |        0.0% |        0.0% |   0.0% |
| 16k-page          |       -9.2% |      -52.1% |       -3.6% |   4.6% |
| 64k-page          |      -11.4% |      -66.4% |       -3.0% |  12.6% |
| 4k-page-16k-folio |       -3.2% |      -22.8% |       -0.3% |   2.7% |
| 4k-page-32k-folio |       -4.8% |      -37.1% |       -0.5% |   2.9% |
| 4k-page-64k-folio |       -5.0% |      -42.1% |       -0.3% |   3.4% |


Speedometer 2.0: Running on Chromium automated with Selenium
(bigger is better for runs_per_min, smaller is better for memory):

| kernel            |   runs_per_min | memory |
|:------------------|---------------:|-------:|
| baseline-4k-page  |           0.0% |   0.0% |
| 16k-page          |           5.9% |  10.6% |
| 4k-page-16k-folio |           1.0% |  -0.6% |
| 4k-page-32k-folio |           1.3% |   3.5% |
| 4k-page-64k-folio |           1.3% |   6.4% |


Changes since v5 [5]
====================

  - Added accounting for PTE-mapped THPs (patch 3)
  - Added runtime control mechanism via sysfs as extension to THP (patch 4)
  - Minor refactoring of alloc_anon_folio() to integrate with runtime controls
  - Stripped out hardcoded policy for allocation order; its now all user space
    controlled (although user space can request "recommend" which will configure
    the HW-preferred order)


Changes since v4 [4]
====================

  - Removed "arm64: mm: Override arch_wants_pte_order()" patch; arm64
    now uses the default order-3 size. I have moved this patch over to
    the contpte series.
  - Added "mm: Allow deferred splitting of arbitrary large anon folios" back
    into series. I originally removed this at v2 to add to a separate series,
    but that series has transformed significantly and it no longer fits, so
    bringing it back here.
  - Reintroduced dependency on set_ptes(); Originally dropped this at v2, but
    set_ptes() is in mm-unstable now.
  - Updated policy for when to allocate LAF; only fallback to order-0 if
    MADV_NOHUGEPAGE is present or if THP disabled via prctl; no longer rely on
    sysfs's never/madvise/always knob.
  - Fallback to order-0 whenever uffd is armed for the vma, not just when
    uffd-wp is set on the pte.
  - alloc_anon_folio() now returns `struct folio *`, where errors are encoded
    with ERR_PTR().

  The last 3 changes were proposed by Yu Zhao - thanks!


Changes since v3 [3]
====================

  - Renamed feature from FLEXIBLE_THP to LARGE_ANON_FOLIO.
  - Removed `flexthp_unhinted_max` boot parameter. Discussion concluded that a
    sysctl is preferable but we will wait until real workload needs it.
  - Fixed uninitialized `addr` on read fault path in do_anonymous_page().
  - Added mm selftests for large anon folios in cow test suite.


Changes since v2 [2]
====================

  - Dropped commit "Allow deferred splitting of arbitrary large anon folios"
      - Huang, Ying suggested the "batch zap" work (which I dropped from this
        series after v1) is a prerequisite for merging FLXEIBLE_THP, so I've
        moved the deferred split patch to a separate series along with the batch
        zap changes. I plan to submit this series early next week.
  - Changed folio order fallback policy
      - We no longer iterate from preferred to 0 looking for acceptable policy
      - Instead we iterate through preferred, PAGE_ALLOC_COSTLY_ORDER and 0 only
  - Removed vma parameter from arch_wants_pte_order()
  - Added command line parameter `flexthp_unhinted_max`
      - clamps preferred order when vma hasn't explicitly opted-in to THP
  - Never allocate large folio for MADV_NOHUGEPAGE vma (or when THP is disabled
    for process or system).
  - Simplified implementation and integration with do_anonymous_page()
  - Removed dependency on set_ptes()


Changes since v1 [1]
====================

  - removed changes to arch-dependent vma_alloc_zeroed_movable_folio()
  - replaced with arch-independent alloc_anon_folio()
      - follows THP allocation approach
  - no longer retry with intermediate orders if allocation fails
      - fallback directly to order-0
  - remove folio_add_new_anon_rmap_range() patch
      - instead add its new functionality to folio_add_new_anon_rmap()
  - remove batch-zap pte mappings optimization patch
      - remove enabler folio_remove_rmap_range() patch too
      - These offer real perf improvement so will submit separately
  - simplify Kconfig
      - single FLEXIBLE_THP option, which is independent of arch
      - depends on TRANSPARENT_HUGEPAGE
      - when enabled default to max anon folio size of 64K unless arch
        explicitly overrides
  - simplify changes to do_anonymous_page():
      - no more retry loop


[1] https://lore.kernel.org/linux-mm/20230626171430.3167004-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/linux-mm/20230703135330.1865927-1-ryan.roberts@arm.com/
[3] https://lore.kernel.org/linux-mm/20230714160407.4142030-1-ryan.roberts@arm.com/
[4] https://lore.kernel.org/linux-mm/20230726095146.2826796-1-ryan.roberts@arm.com/
[5] https://lore.kernel.org/linux-mm/20230810142942.3169679-1-ryan.roberts@arm.com/
[6] https://lore.kernel.org/linux-mm/1b03f4d6-634d-4786-81a0-5a104799b125@arm.com/
[7] https://lore.kernel.org/linux-mm/f8d47176-03a8-99bf-a813-b5942830fd73@arm.com/


Thanks,
Ryan


Ryan Roberts (9):
  mm: Allow deferred splitting of arbitrary anon large folios
  mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap()
  mm: thp: Account pte-mapped anonymous THP usage
  mm: thp: Introduce anon_orders and anon_always_mask sysfs files
  mm: thp: Extend THP to allocate anonymous large folios
  mm: thp: Add "recommend" option for anon_orders
  arm64/mm: Override arch_wants_pte_order()
  selftests/mm/cow: Generalize do_run_with_thp() helper
  selftests/mm/cow: Add tests for small-order anon THP

 Documentation/ABI/testing/procfs-smaps_rollup |   1 +
 .../admin-guide/cgroup-v1/memory.rst          |   5 +-
 Documentation/admin-guide/cgroup-v2.rst       |   6 +-
 Documentation/admin-guide/mm/transhuge.rst    |  96 ++++++-
 Documentation/filesystems/proc.rst            |  20 +-
 arch/arm64/include/asm/pgtable.h              |  10 +
 drivers/base/node.c                           |   2 +
 fs/proc/meminfo.c                             |   2 +
 fs/proc/task_mmu.c                            |   7 +-
 include/linux/huge_mm.h                       |  95 +++++--
 include/linux/mmzone.h                        |   1 +
 include/linux/pgtable.h                       |  13 +
 mm/huge_memory.c                              | 172 ++++++++++--
 mm/khugepaged.c                               |  18 +-
 mm/memcontrol.c                               |   8 +
 mm/memory.c                                   | 114 +++++++-
 mm/page_vma_mapped.c                          |   3 +-
 mm/rmap.c                                     |  42 ++-
 mm/show_mem.c                                 |   2 +
 mm/vmstat.c                                   |   1 +
 tools/testing/selftests/mm/cow.c              | 244 +++++++++++++-----
 21 files changed, 696 insertions(+), 166 deletions(-)

--
2.25.1

David Hildenbrand Oct. 6, 2023, 8:06 p.m. UTC | #1

On 29.09.23 13:44, Ryan Roberts wrote:
> Hi All,

Let me highlight some core decisions on the things discussed in the 
previous alignment meetings, and comment on them.

> 
> This is v6 of a series to implement variable order, large folios for anonymous
> memory. (previously called "ANON_LARGE_FOLIO", "LARGE_ANON_FOLIO",
> "FLEXIBLE_THP", but now exposed as an extension to THP; "small-order THP"). The
> objective of this is to improve performance by allocating larger chunks of
> memory during anonymous page faults:

Change number 1: Let's call these things THP.

Fine with me; I previously rooted for that but was told that end users 
could be confused. I think the important bit is that we don't mess up 
the stats, and when we talk about THP we default to "PMD-sized THP", 
unless we explicitly include the other ones.

I dislike exposing "orders" to the users, I'm happy to be convinced why 
I am wrong and it is a good idea.

So maybe "Small THP"/"Small-sized THP" is better. Or "Medium-sized THP" 
-- as said, I think FreeBSD tends to call it "Medium-sized superpages". 
But what's small/medium/large is debatable. "Small" implies at least 
that it's smaller than what we used to know, which is a fact.

Can we also now use the terminology consistently? (e.g., 
"variable-order, large folios for anonymous memory" -> "Small-sized 
anonymous THP", you can just point at the previous patch set name in the 
cover letter)

> 
> 1) Since SW (the kernel) is dealing with larger chunks of memory than base
>     pages, there are efficiency savings to be had; fewer page faults, batched PTE
>     and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel
>     overhead. This should benefit all architectures.
> 2) Since we are now mapping physically contiguous chunks of memory, we can take
>     advantage of HW TLB compression techniques. A reduction in TLB pressure
>     speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>     TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
> 
> The major change in this revision is the addition of sysfs controls to allow
> this "small-order THP" to be enabled/disabled/configured independently of
> PMD-order THP. The approach I've taken differs a bit from previous discussions;
> instead of creating a whole new interface ("large_folio"), I'm extending THP. I
> personally think this makes things clearer and more extensible. See [6] for
> detailed rationale.

Change 2: sysfs interface.

If we call it THP, it shall go under 
"/sys/kernel/mm/transparent_hugepage/", I agree.

What we expose there and how, is TBD. Again, not a friend of "orders" 
and bitmaps at all. We can do better if we want to go down that path.

Maybe we should take a look at hugetlb, and how they added support for 
multiple sizes. What *might* make sense could be (depending on which 
values we actually support!)

/sys/kernel/mm/transparent_hugepage/hugepages-64kB/
/sys/kernel/mm/transparent_hugepage/hugepages-128kB/
/sys/kernel/mm/transparent_hugepage/hugepages-256kB/
/sys/kernel/mm/transparent_hugepage/hugepages-512kB/
/sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/

Each one would contain an "enabled" and "defrag" file. We want something 
minimal first? Start with the "enabled" option.

enabled: always [global] madvise never

Initially, we would set it for PMD-sized THP to "global" and for 
everything else to "never".

That sounds reasonable at least to me, and we would be using what we 
learned from THP (as John suggested).  That still gives reasonable 
flexibility without going too wild, and a better IMHO interface.

I understand Yu's point about ABI discussions and "0 knobs". I'm happy 
as long as we can have something that won't hurt us later and still be 
able to use this in distributions within a reasonable timeframe. 
Enabling/disabling individual sizes does not sound too restrictive to 
me. And we could always add an "auto" setting later and default to that 
with a new kconfig knob.

If someone wants to configure it, why not. Let's just prepare a way to 
to handle this "better" automatically in the future (if ever ...).

Change 3: Stats

 > /proc/meminfo:
 >   Introduce new "AnonHugePteMap" field, which reports the amount of
 >   memory (in KiB) mapped from large folios globally (similar to
 >   AnonHugePages field).

AnonHugePages is and remains "PMD-sized THP that is mapped using a PMD", 
I think we all agree on that. It should have been named "AnonPmdMapped" 
or "AnonHugePmdMapped", too bad, we can't change that.

"AnonHugePteMap" better be "AnonHugePteMapped".

But, I wonder if we want to expose this "PteMapped" to user space *at 
all*. Why should they care if it's PTE mapped? For PMD-sized THP it 
makes a bit of sense, because !PMD implied !performance, and one might 
have been able to troubleshoot that somehow. For PTE-mapped, it doesn't 
make much sense really, they are always PTE-mapped.

That also raises the question how you would account a PTE-mapped THP. 
The hole thing? Only the parts that are mapped? Let's better not go down 
that path.

That leaves the question why we would want to include them here at all 
in a special PTE-mapped way?

Again, let's look at hugetlb: I prepared 1 GiB and one 2 MiB page.

HugePages_Total:       1
HugePages_Free:        1
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:         1050624 kB

-> Only the last one gives the sum, the other stats don't even mention 
the other ones. [how do we get their stats, if at all?]

So maybe, we only want a summary of how many anon huge pages of any size 
are allocated (independent of the PTE vs. PMD mapping), and some other 
source to eventually inspect how the different sizes behave.

But note that for non-PMD-sized file THP we don't even have special 
counters! ... so maybe we should also defer any such stats and come up 
with something uniform for all types of non-PMD-sized THP.

Sane discussion applies to all other stats.

> 
> Because we now have runtime enable/disable control, I've removed the compile
> time Kconfig switch. It still defaults to runtime-disabled.
> 
> NOTE: These changes should not be merged until the prerequisites are complete.
> These are in progress and tracked at [7].

We should probably list them here, and classify which one we see as 
strict a requirement, which ones might be an optimization.

Now, these are just my thoughts, and I'm happy about other thoughts.

Ryan Roberts Oct. 9, 2023, 11:28 a.m. UTC | #2

On 06/10/2023 21:06, David Hildenbrand wrote:
> On 29.09.23 13:44, Ryan Roberts wrote:
>> Hi All,
> 
> Let me highlight some core decisions on the things discussed in the previous
> alignment meetings, and comment on them.
> 
>>
>> This is v6 of a series to implement variable order, large folios for anonymous
>> memory. (previously called "ANON_LARGE_FOLIO", "LARGE_ANON_FOLIO",
>> "FLEXIBLE_THP", but now exposed as an extension to THP; "small-order THP"). The
>> objective of this is to improve performance by allocating larger chunks of
>> memory during anonymous page faults:
> 
> Change number 1: Let's call these things THP.
> 
> Fine with me; I previously rooted for that but was told that end users could be
> confused. I think the important bit is that we don't mess up the stats, and when
> we talk about THP we default to "PMD-sized THP", unless we explicitly include
> the other ones.
> 
> 
> I dislike exposing "orders" to the users, I'm happy to be convinced why I am
> wrong and it is a good idea.
> 
> So maybe "Small THP"/"Small-sized THP" is better. Or "Medium-sized THP" -- as
> said, I think FreeBSD tends to call it "Medium-sized superpages". But what's
> small/medium/large is debatable. "Small" implies at least that it's smaller than
> what we used to know, which is a fact.
> 
> Can we also now use the terminology consistently? (e.g., "variable-order, large
> folios for anonymous memory" -> "Small-sized anonymous THP", you can just point
> at the previous patch set name in the cover letter)

Yes absolutely. FWIW, I was deliberately not changing the title of the patchset
so people could easily see it was an evolution of something posted before. But
if it's the norm to change the title as the patchset evolves, I'm very happy to
do that. And there are other places too, in commit logs that I can tidy up. I
will assume "PMD-sized THP", "small-sized THP" and "anonymous small-sized THP"
(that last one slightly different from what David suggested above - it means
"small-sized THP" can still be grepped) unless others object.

> 
>>
>> 1) Since SW (the kernel) is dealing with larger chunks of memory than base
>>     pages, there are efficiency savings to be had; fewer page faults, batched PTE
>>     and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel
>>     overhead. This should benefit all architectures.
>> 2) Since we are now mapping physically contiguous chunks of memory, we can take
>>     advantage of HW TLB compression techniques. A reduction in TLB pressure
>>     speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>>     TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
>>
>> The major change in this revision is the addition of sysfs controls to allow
>> this "small-order THP" to be enabled/disabled/configured independently of
>> PMD-order THP. The approach I've taken differs a bit from previous discussions;
>> instead of creating a whole new interface ("large_folio"), I'm extending THP. I
>> personally think this makes things clearer and more extensible. See [6] for
>> detailed rationale.
> 
> Change 2: sysfs interface.
> 
> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
> agree.
> 
> What we expose there and how, is TBD. Again, not a friend of "orders" and
> bitmaps at all. We can do better if we want to go down that path.
> 
> Maybe we should take a look at hugetlb, and how they added support for multiple
> sizes. What *might* make sense could be (depending on which values we actually
> support!)
> 
> 
> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
> 
> Each one would contain an "enabled" and "defrag" file. We want something minimal
> first? Start with the "enabled" option.
> 
> 
> enabled: always [global] madvise never
> 
> Initially, we would set it for PMD-sized THP to "global" and for everything else
> to "never".

My only reservation about this approach is the potential for a future need for a
"one setting applied across all sizes" class of control (e.g. "auto"). I think
we agreed in the previous meetings that chasing a solution for "auto" was a good
aspiration to have, so it would be good to have a place we we can insert that in
future. The main reason why I chose to expose the "anon_orders" control is
because it is possible to both enable/disable the various sizes as well as
specificy (e.g.) "auto", without creating redundancy. But I agree that ideally
we wouldn't expose orders to the user; I was attempting a compromise to simplify
the "auto" case.

A potential (though feels quite complex) solution to make auto work with your
proposal: Add "auto" as an option to the existing global enabled file, and to
all of your proposed new enabled files. But its only possible to *set* auto
through the global file. And when it is set, all of the size-specific enabled
files read-back "auto" too. Any any writes to the size-specific enabled files
are ignored (or remembered but not enacted) until the global enabled file is
changed away from auto.

But I'm not sure if adding a new option to the global enabled file might break
compat?

> 
> 
> 
> That sounds reasonable at least to me, and we would be using what we learned
> from THP (as John suggested).  That still gives reasonable flexibility without
> going too wild, and a better IMHO interface.
> 
> I understand Yu's point about ABI discussions and "0 knobs". I'm happy as long
> as we can have something that won't hurt us later and still be able to use this
> in distributions within a reasonable timeframe. Enabling/disabling individual
> sizes does not sound too restrictive to me. And we could always add an "auto"
> setting later and default to that with a new kconfig knob.
> 
> If someone wants to configure it, why not. Let's just prepare a way to to handle
> this "better" automatically in the future (if ever ...).
> 
> 
> Change 3: Stats
> 
>> /proc/meminfo:
>>   Introduce new "AnonHugePteMap" field, which reports the amount of
>>   memory (in KiB) mapped from large folios globally (similar to
>>   AnonHugePages field).
> 
> AnonHugePages is and remains "PMD-sized THP that is mapped using a PMD", I think
> we all agree on that. It should have been named "AnonPmdMapped" or
> "AnonHugePmdMapped", too bad, we can't change that.

Yes agreed. I did consider redefining "AnonHugePages" to cover PMD- and
PTE-mapped memory, then introduce both an "AnonHugePmdMapped" and
"AnonHugePteMapped", but I think that would likely break things. Its further
complicated because vmstats prints it in PMD-size units, so can't represent
PTE-mapped memory in that counter.

> 
> "AnonHugePteMap" better be "AnonHugePteMapped".

I agree, but I went with the shorter one because any longer and it would unalign
the value e.g:

    AnonHugePages:         0 kB
    AnonHugePteMapped:        0 kB
    ShmemPmdMapped:        0 kB
    Shared_Hugetlb:        0 kB

So would need to decide which is preferable, or come up with a shorter name.

> 
> But, I wonder if we want to expose this "PteMapped" to user space *at all*. Why
> should they care if it's PTE mapped? For PMD-sized THP it makes a bit of sense,
> because !PMD implied !performance, and one might have been able to troubleshoot
> that somehow. For PTE-mapped, it doesn't make much sense really, they are always
> PTE-mapped.

I disagree; I've been using it a lot to debug performance issues. It tells you
how much of your anon memory is allocated with large folios. And making that
percentage bigger improves performance; fewer page faults, and with a separate
contpte series on arm64, better use of the TLB. Reasons might include; poorly
aligned/too small VMAs, memory fragmentation preventing allocation, CoW, etc.

I would actually argue for adding similar counters for file-backed memory too
for the same reasons. (I actually posted an independent patch a while back that
did this for file- and anon- memory, bucketted by size. But I think the idea of
the bucketting was NAKed.

> 
> That also raises the question how you would account a PTE-mapped THP. The hole
> thing? Only the parts that are mapped? Let's better not go down that path.

The approach I've taken in this series is the simple one - account every page
that belongs to a large folio from when it is first mapped to last unmapped.
Yes, in this case, you might not actually be mapping the full thing
contigiously. But it gives a good indication.

I also considered accounting the whole folio only when all of its pages become
mapped (although not worrying about them all being contiguous). That's still
simple to implement for all counters except smaps. So went with the simplest
approach with the view that its "good enough".

> 
> That leaves the question why we would want to include them here at all in a
> special PTE-mapped way?
> 
> 
> Again, let's look at hugetlb: I prepared 1 GiB and one 2 MiB page.
> 
> HugePages_Total:       1
> HugePages_Free:        1
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> Hugepagesize:       2048 kB
> Hugetlb:         1050624 kB
> 
> -> Only the last one gives the sum, the other stats don't even mention the other
> ones. [how do we get their stats, if at all?]

There are some files in /sys/kernel/mm/hugepages/hugepages-XXkB and
/sys/devices/system/node/node*/hugepages/; nr_hugepages, free_hugepages,
surplus_hugepages. But this interface also constitutes the allocator, not just
stats, I think.

> 
> So maybe, we only want a summary of how many anon huge pages of any size are
> allocated (independent of the PTE vs. PMD mapping), 

Are you proposing (AnonHugePages + AnonHugePteMapped) here or something else? If
the former, then I don't really see the difference. We have to continue to
expose PMD-size (AnonHugePages). So either add PTE-only counter, and derive the
total, or add a total counter and derive PTE-only. I suspect I've misunderstood
your point.

> and some other source to
> eventually inspect how the different sizes behave.
> 
> But note that for non-PMD-sized file THP we don't even have special counters!
> ... so maybe we should also defer any such stats and come up with something
> uniform for all types of non-PMD-sized THP.

Indeed, I can see benefit in adding these for file THP - in fact I have a patch
that does exactly that to help my development work. I had envisaged that we
could add something like FileHugePteMapped, ShmemHugePteMapped that would follow
the same semantics as AnonHugePteMapped.

> 
> 
> Sane discussion applies to all other stats.
> 
> 
>>
>> Because we now have runtime enable/disable control, I've removed the compile
>> time Kconfig switch. It still defaults to runtime-disabled.
>>
>> NOTE: These changes should not be merged until the prerequisites are complete.
>> These are in progress and tracked at [7].
> 
> We should probably list them here, and classify which one we see as strict a
> requirement, which ones might be an optimization.

I'll need some help with clasifying them, so showing my working. Final list that
I would propose as strict requirements at bottom.

This is my list with status, as per response to Yu in other thread:

  - David is working on "shared vs exclusive mappings"
  - Zi Yan has posted an RFC for compaction
  - Yin Fengwei's mlock series is now in mm-stable
  - Yin Fengwei's madvise series is in 6.6
  - I've reworked and posted a series for deferred_split_folio; although I've
    deprioritied it because Yu said it wasn't really a pre-requisite.
  - numa balancing depends on David's "shared vs exclusive mappings" work
  - I've started looking at "large folios in swap cache" in the background,
    because I'm seeing some slow down with large folios, but we also agreed that
    wasn't a prerequisite

Although, since sending that, I've determined that when running kernel
compilation across high number of cores on arm64, the cost of splitting the
folios gets large due to needing to broadcast the extra TLBIs. So I think the
last point on that list may be a prerequisite after all. (I've been able to fix
this by adding support for allocating large folios in the swap file, and
avoiding the split - planning to send RFC this week).

There is also this set of things that you mentioned against "shared vs exclusive
mappings", which I'm not sure if you are planning to cover as part of your work
or if they are follow on things that will need to be done:

(1) Detecting shared folios, to not mess with them while they are shared.
    MADV_PAGEOUT, user-triggered page migration, NUMA hinting, khugepaged ...
    replace cases where folio_estimated_sharers() == 1 would currently be the
    best we can do (and in some cases, page_mapcount() == 1).

And I recently discovered that khugepaged doesn't collapse file-backed pages to
a PMD-size THP if they belong to a large folio, so I'm guessing it may also
suffer the same behaviour for anon memory. I'm not sure if that's what your
"khugepaged ..." comment refers to?

So taking all that and trying to put together a complete outstanding list for
strict requirements:

  - Shared vs Exclusive Mappings (DavidH)
      - user-triggered page migration
      - NUMA hinting/balancing
      - Enhance khugepaged to collapse to PMD-size from PTE-mapped large folios
  - Compaction of Large Folios (Zi Yan)
  - Swap out small-size THP without Split (Ryan Roberts)

> 
> 
> Now, these are just my thoughts, and I'm happy about other thoughts.

As always, thanks for taking the time - I really appreciate it.

Thanks,
Ryan

David Hildenbrand Oct. 9, 2023, 4:22 p.m. UTC | #3

[...]

>>
>> I dislike exposing "orders" to the users, I'm happy to be convinced why I am
>> wrong and it is a good idea.
>>
>> So maybe "Small THP"/"Small-sized THP" is better. Or "Medium-sized THP" -- as
>> said, I think FreeBSD tends to call it "Medium-sized superpages". But what's
>> small/medium/large is debatable. "Small" implies at least that it's smaller than
>> what we used to know, which is a fact.
>>
>> Can we also now use the terminology consistently? (e.g., "variable-order, large
>> folios for anonymous memory" -> "Small-sized anonymous THP", you can just point
>> at the previous patch set name in the cover letter)
> 
> Yes absolutely. FWIW, I was deliberately not changing the title of the patchset
> so people could easily see it was an evolution of something posted before. But
> if it's the norm to change the title as the patchset evolves, I'm very happy to
> do that. And there are other places too, in commit logs that I can tidy up. I
> will assume "PMD-sized THP", "small-sized THP" and "anonymous small-sized THP"
> (that last one slightly different from what David suggested above - it means
> "small-sized THP" can still be grepped) unless others object.

Absolutely fine with me. Hoping other people will object when I talk 
nonsense or my suggestions don't make any sense.

Or even better, propose something better :)

> 
>>
>>>
>>> 1) Since SW (the kernel) is dealing with larger chunks of memory than base
>>>      pages, there are efficiency savings to be had; fewer page faults, batched PTE
>>>      and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel
>>>      overhead. This should benefit all architectures.
>>> 2) Since we are now mapping physically contiguous chunks of memory, we can take
>>>      advantage of HW TLB compression techniques. A reduction in TLB pressure
>>>      speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>>>      TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
>>>
>>> The major change in this revision is the addition of sysfs controls to allow
>>> this "small-order THP" to be enabled/disabled/configured independently of
>>> PMD-order THP. The approach I've taken differs a bit from previous discussions;
>>> instead of creating a whole new interface ("large_folio"), I'm extending THP. I
>>> personally think this makes things clearer and more extensible. See [6] for
>>> detailed rationale.
>>
>> Change 2: sysfs interface.
>>
>> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
>> agree.
>>
>> What we expose there and how, is TBD. Again, not a friend of "orders" and
>> bitmaps at all. We can do better if we want to go down that path.
>>
>> Maybe we should take a look at hugetlb, and how they added support for multiple
>> sizes. What *might* make sense could be (depending on which values we actually
>> support!)
>>
>>
>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
>>
>> Each one would contain an "enabled" and "defrag" file. We want something minimal
>> first? Start with the "enabled" option.
>>
>>
>> enabled: always [global] madvise never
>>
>> Initially, we would set it for PMD-sized THP to "global" and for everything else
>> to "never".
> 
> My only reservation about this approach is the potential for a future need for a
> "one setting applied across all sizes" class of control (e.g. "auto"). I think
> we agreed in the previous meetings that chasing a solution for "auto" was a good
> aspiration to have, so it would be good to have a place we we can insert that in
> future. The main reason why I chose to expose the "anon_orders" control is
> because it is possible to both enable/disable the various sizes as well as
> specificy (e.g.) "auto", without creating redundancy. But I agree that ideally
> we wouldn't expose orders to the user; I was attempting a compromise to simplify
> the "auto" case.
> 
> A potential (though feels quite complex) solution to make auto work with your
> proposal: Add "auto" as an option to the existing global enabled file, and to
> all of your proposed new enabled files. But its only possible to *set* auto
> through the global file. And when it is set, all of the size-specific enabled
> files read-back "auto" too. Any any writes to the size-specific enabled files
> are ignored (or remembered but not enacted) until the global enabled file is
> changed away from auto.

Yes, I think there are various ways forward regarding that. Or to enable 
"auto" mode only once all are "auto", and as soon as one is not "auto", 
just disable it. A simple

echo "auto" > /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled

Would do to enable it. Or, have them all be "global" and have a global 
"auto" mode as you raised.

echo "global" > /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled
echo "auto" > /sys/kernel/mm/transparent_hugepage/enabled

> 
> But I'm not sure if adding a new option to the global enabled file might break
> compat?

I think we used to extend the "defrag" option, see

commit 21440d7eb9044001b7fdb71d0163689f60a0f2a1
Author: David Rientjes <rientjes@google.com>
Date:   Wed Feb 22 15:45:49 2017 -0800

     mm, thp: add new defer+madvise defrag option


So I suspect we could extend that one in a similar way.

But again, this is just the thing that came to mind when thinking about 
how to:
a) avoid orders
b) make it configurable and future-proof
c) make it look a bit consistent with other interfaces (hugetlb and
    existing thp)
d) still prepare for an auto mode that we want in the future

I'm happy to hear other ideas.

> 
>>
>>
>>
>> That sounds reasonable at least to me, and we would be using what we learned
>> from THP (as John suggested).  That still gives reasonable flexibility without
>> going too wild, and a better IMHO interface.
>>
>> I understand Yu's point about ABI discussions and "0 knobs". I'm happy as long
>> as we can have something that won't hurt us later and still be able to use this
>> in distributions within a reasonable timeframe. Enabling/disabling individual
>> sizes does not sound too restrictive to me. And we could always add an "auto"
>> setting later and default to that with a new kconfig knob.
>>
>> If someone wants to configure it, why not. Let's just prepare a way to to handle
>> this "better" automatically in the future (if ever ...).
>>
>>
>> Change 3: Stats
>>
>>> /proc/meminfo:
>>>     Introduce new "AnonHugePteMap" field, which reports the amount of
>>>     memory (in KiB) mapped from large folios globally (similar to
>>>     AnonHugePages field).
>>
>> AnonHugePages is and remains "PMD-sized THP that is mapped using a PMD", I think
>> we all agree on that. It should have been named "AnonPmdMapped" or
>> "AnonHugePmdMapped", too bad, we can't change that.
> 
> Yes agreed. I did consider redefining "AnonHugePages" to cover PMD- and
> PTE-mapped memory, then introduce both an "AnonHugePmdMapped" and
> "AnonHugePteMapped", but I think that would likely break things. Its further
> complicated because vmstats prints it in PMD-size units, so can't represent
> PTE-mapped memory in that counter.

:/

> 
>>
>> "AnonHugePteMap" better be "AnonHugePteMapped".
> 
> I agree, but I went with the shorter one because any longer and it would unalign
> the value e.g:
> 
>      AnonHugePages:         0 kB
>      AnonHugePteMapped:        0 kB
>      ShmemPmdMapped:        0 kB
>      Shared_Hugetlb:        0 kB
> 

Can't that be handled? We surely have long stuff in there:

HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
ShmemHugePages:  1081344 kB
ShmemPmdMapped:        0 kB

HardwareCorrupted has same length as AnonHugePteMapped

But I'm not convinced about "AnonHugePteMapped" yet :)

> So would need to decide which is preferable, or come up with a shorter name.
> 
>>
>> But, I wonder if we want to expose this "PteMapped" to user space *at all*. Why
>> should they care if it's PTE mapped? For PMD-sized THP it makes a bit of sense,
>> because !PMD implied !performance, and one might have been able to troubleshoot
>> that somehow. For PTE-mapped, it doesn't make much sense really, they are always
>> PTE-mapped.
> 
> I disagree; I've been using it a lot to debug performance issues. It tells you
> how much of your anon memory is allocated with large folios. And making that
> percentage bigger improves performance; fewer page faults, and with a separate
> contpte series on arm64, better use of the TLB. Reasons might include; poorly
> aligned/too small VMAs, memory fragmentation preventing allocation, CoW, etc.

Just because a small-sized THP is PTE-mapped doesn't tell you anything, 
really. What you want to know is if it is "completely" and 
"consecutively" mapped such that the HW can actually benefit from it -- 
if HW even supports it. So "PTE-mapped THP" is just part of the story. 
And that's where it gets tricky I think.

I agree that it's good for debugging, but then maybe it should a) live 
somewhere else (debugfs, bucketing below) and b) be consistent with 
other THPs, meaning we also want similar stats somewhere.

One idea would be to expose such stats in a R/O fashion like 
"nr_allocated" or "nr_hugepages" in 
/sys/kernel/mm/transparent_hugepage/hugepages-64kB/ and friends. Of 
course, maybe tagging them with "anon" prefix.

> 
> I would actually argue for adding similar counters for file-backed memory too
> for the same reasons. (I actually posted an independent patch a while back that
> did this for file- and anon- memory, bucketted by size. But I think the idea of
> the bucketting was NAKed.
For debugging, I *think* it might be valuable to see how many THP of 
each size are allocated. Tracking exactly "how is it mapped" is not easy 
to achieve as we learned. PMD-mapped was easy, but also requires us to 
keep doing that tracking for all eternity ...

Do you have a pointer to the patch set? Did it try to squeeze it into 
/proc/meminfo?

> 
>>
>> That also raises the question how you would account a PTE-mapped THP. The hole
>> thing? Only the parts that are mapped? Let's better not go down that path.
> 
> The approach I've taken in this series is the simple one - account every page
> that belongs to a large folio from when it is first mapped to last unmapped.
> Yes, in this case, you might not actually be mapping the full thing
> contigiously. But it gives a good indication.
> 
> I also considered accounting the whole folio only when all of its pages become
> mapped (although not worrying about them all being contiguous). That's still
> simple to implement for all counters except smaps. So went with the simplest
> approach with the view that its "good enough".

If you take a look at "ShmemHugePages" and "FileHugePages", there we 
actually track them when they get allocated+freed, which is much easier 
than tracking when/how they are (un)mapped. But it's only done for 
PMD-sized THP for now.

> 
>>
>> That leaves the question why we would want to include them here at all in a
>> special PTE-mapped way?
>>
>>
>> Again, let's look at hugetlb: I prepared 1 GiB and one 2 MiB page.
>>
>> HugePages_Total:       1
>> HugePages_Free:        1
>> HugePages_Rsvd:        0
>> HugePages_Surp:        0
>> Hugepagesize:       2048 kB
>> Hugetlb:         1050624 kB
>>
>> -> Only the last one gives the sum, the other stats don't even mention the other
>> ones. [how do we get their stats, if at all?]
> 
> There are some files in /sys/kernel/mm/hugepages/hugepages-XXkB and
> /sys/devices/system/node/node*/hugepages/; nr_hugepages, free_hugepages,
> surplus_hugepages. But this interface also constitutes the allocator, not just
> stats, I think.

Ah, I missed that we expose free vs. reserved vs. surpluse ... there as 
well; I thought we would only have "nr_hugepages".

> 
>>
>> So maybe, we only want a summary of how many anon huge pages of any size are
>> allocated (independent of the PTE vs. PMD mapping),
> 
> Are you proposing (AnonHugePages + AnonHugePteMapped) here or something else? If
> the former, then I don't really see the difference. We have to continue to
> expose PMD-size (AnonHugePages). So either add PTE-only counter, and derive the
> total, or add a total counter and derive PTE-only. I suspect I've misunderstood
> your point.

I don't think we should go down the "PteMapped" path. Probably we want 
"bucketing" stats as you said, and maybe a global one that just combines 
everything (any THP). But naming will be difficult.

> 
>> and some other source to
>> eventually inspect how the different sizes behave.
>>
>> But note that for non-PMD-sized file THP we don't even have special counters!
>> ... so maybe we should also defer any such stats and come up with something
>> uniform for all types of non-PMD-sized THP.
> 
> Indeed, I can see benefit in adding these for file THP - in fact I have a patch
> that does exactly that to help my development work. I had envisaged that we
> could add something like FileHugePteMapped, ShmemHugePteMapped that would follow
> the same semantics as AnonHugePteMapped.

Again, maybe we can find something that does not involve the "PteMapped" 
terminology and just gives us a big total of "allocated" THP. For 
detailed stats for debugging, maybe we can just use a different 
interface then.

> 
>>
>>
>> Sane discussion applies to all other stats.
>>
>>
>>>
>>> Because we now have runtime enable/disable control, I've removed the compile
>>> time Kconfig switch. It still defaults to runtime-disabled.
>>>
>>> NOTE: These changes should not be merged until the prerequisites are complete.
>>> These are in progress and tracked at [7].
>>
>> We should probably list them here, and classify which one we see as strict a
>> requirement, which ones might be an optimization.
> 
> 
> I'll need some help with clasifying them, so showing my working. Final list that
> I would propose as strict requirements at bottom.
> 
> This is my list with status, as per response to Yu in other thread:
> 
>    - David is working on "shared vs exclusive mappings"

Probably "COW reuse support" is a separate item, although my approach 
would cover that.

The question is, if the estimate we're using in most code for now would 
at least be sufficient to merge it. The estimate is easily wrong, but we 
do have that issue with PTE-mapped THP already.

But that argument probably applies to most things here: the difference 
is that PTE-mapped THP are not the default, that's why nobody really cared.

[I'm playing with an almost-lockless scheme right now and hope I have 
something running soonish -- as you know, I got distracted]

>    - Zi Yan has posted an RFC for compaction
>    - Yin Fengwei's mlock series is now in mm-stable
>    - Yin Fengwei's madvise series is in 6.6
>    - I've reworked and posted a series for deferred_split_folio; although I've
>      deprioritied it because Yu said it wasn't really a pre-requisite.
>    - numa balancing depends on David's "shared vs exclusive mappings" work
>    - I've started looking at "large folios in swap cache" in the background,
>      because I'm seeing some slow down with large folios, but we also agreed that
>      wasn't a prerequisite
> 

Probably it would be good to talk about the items and how we would 
classify them in a meeting.


> Although, since sending that, I've determined that when running kernel
> compilation across high number of cores on arm64, the cost of splitting the
> folios gets large due to needing to broadcast the extra TLBIs. So I think the
> last point on that list may be a prerequisite after all. (I've been able to fix
> this by adding support for allocating large folios in the swap file, and
> avoiding the split - planning to send RFC this week).
> 
> There is also this set of things that you mentioned against "shared vs exclusive
> mappings", which I'm not sure if you are planning to cover as part of your work
> or if they are follow on things that will need to be done:
> 
> (1) Detecting shared folios, to not mess with them while they are shared.
>      MADV_PAGEOUT, user-triggered page migration, NUMA hinting, khugepaged ...
>      replace cases where folio_estimated_sharers() == 1 would currently be the
>      best we can do (and in some cases, page_mapcount() == 1).
> 
> And I recently discovered that khugepaged doesn't collapse file-backed pages to
> a PMD-size THP if they belong to a large folio, so I'm guessing it may also
> suffer the same behaviour for anon memory. I'm not sure if that's what your
> "khugepaged ..." comment refers to?

Yes. But I did not look into all the details yet.

"kuhepaged" collapse support to small-sized THP is probably also a very 
imporant item, although it might be less relevant than for PMD -- and I 
consider it future work. See below.

> 
> So taking all that and trying to put together a complete outstanding list for
> strict requirements:
> 
>    - Shared vs Exclusive Mappings (DavidH)
>        - user-triggered page migration
>        - NUMA hinting/balancing
>        - Enhance khugepaged to collapse to PMD-size from PTE-mapped large folios
>    - Compaction of Large Folios (Zi Yan)
>    - Swap out small-size THP without Split (Ryan Roberts)

^ that's going to be tough, I can promise. And the only way to live 
without that would be khugepaged support. (because that's how it's all 
working for PMD-sized THP after all!)

Once a PMD-sized THP was swapped out and evicted, it will always come 
back in order-0 folios. khugeged will re-collapse into PMD-sized chunks. 
If we could do that for PTE-sized THP as well ...

> 
> 
>>
>>
>> Now, these are just my thoughts, and I'm happy about other thoughts.
> 
> As always, thanks for taking the time - I really appreciate it.

Sure. Hoping others can comment.

My gut feeling is that it's best to focus on getting the sysfs interface 
right+future proof and handling the stats independently. While being a 
good debug mechanism, I wouldn't consider these stats a requirement: we 
don't have them for file/shmem small-sized thp so far as well.

So maybe really better to handle the stats in meminfo and friends 
separately.

Ryan Roberts Oct. 10, 2023, 10:47 a.m. UTC | #4

On 09/10/2023 17:22, David Hildenbrand wrote:
> [...]
> 
>>>
>>> I dislike exposing "orders" to the users, I'm happy to be convinced why I am
>>> wrong and it is a good idea.
>>>
>>> So maybe "Small THP"/"Small-sized THP" is better. Or "Medium-sized THP" -- as
>>> said, I think FreeBSD tends to call it "Medium-sized superpages". But what's
>>> small/medium/large is debatable. "Small" implies at least that it's smaller than
>>> what we used to know, which is a fact.
>>>
>>> Can we also now use the terminology consistently? (e.g., "variable-order, large
>>> folios for anonymous memory" -> "Small-sized anonymous THP", you can just point
>>> at the previous patch set name in the cover letter)
>>
>> Yes absolutely. FWIW, I was deliberately not changing the title of the patchset
>> so people could easily see it was an evolution of something posted before. But
>> if it's the norm to change the title as the patchset evolves, I'm very happy to
>> do that. And there are other places too, in commit logs that I can tidy up. I
>> will assume "PMD-sized THP", "small-sized THP" and "anonymous small-sized THP"
>> (that last one slightly different from what David suggested above - it means
>> "small-sized THP" can still be grepped) unless others object.
> 
> Absolutely fine with me. Hoping other people will object when I talk nonsense or
> my suggestions don't make any sense.
> 
> Or even better, propose something better :)
> 
>>
>>>
>>>>
>>>> 1) Since SW (the kernel) is dealing with larger chunks of memory than base
>>>>      pages, there are efficiency savings to be had; fewer page faults,
>>>> batched PTE
>>>>      and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel
>>>>      overhead. This should benefit all architectures.
>>>> 2) Since we are now mapping physically contiguous chunks of memory, we can take
>>>>      advantage of HW TLB compression techniques. A reduction in TLB pressure
>>>>      speeds up kernel and user space. arm64 systems have 2 mechanisms to
>>>> coalesce
>>>>      TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
>>>>
>>>> The major change in this revision is the addition of sysfs controls to allow
>>>> this "small-order THP" to be enabled/disabled/configured independently of
>>>> PMD-order THP. The approach I've taken differs a bit from previous discussions;
>>>> instead of creating a whole new interface ("large_folio"), I'm extending THP. I
>>>> personally think this makes things clearer and more extensible. See [6] for
>>>> detailed rationale.
>>>
>>> Change 2: sysfs interface.
>>>
>>> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
>>> agree.
>>>
>>> What we expose there and how, is TBD. Again, not a friend of "orders" and
>>> bitmaps at all. We can do better if we want to go down that path.
>>>
>>> Maybe we should take a look at hugetlb, and how they added support for multiple
>>> sizes. What *might* make sense could be (depending on which values we actually
>>> support!)
>>>
>>>
>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
>>>
>>> Each one would contain an "enabled" and "defrag" file. We want something minimal
>>> first? Start with the "enabled" option.
>>>
>>>
>>> enabled: always [global] madvise never
>>>
>>> Initially, we would set it for PMD-sized THP to "global" and for everything else
>>> to "never".
>>
>> My only reservation about this approach is the potential for a future need for a
>> "one setting applied across all sizes" class of control (e.g. "auto"). I think
>> we agreed in the previous meetings that chasing a solution for "auto" was a good
>> aspiration to have, so it would be good to have a place we we can insert that in
>> future. The main reason why I chose to expose the "anon_orders" control is
>> because it is possible to both enable/disable the various sizes as well as
>> specificy (e.g.) "auto", without creating redundancy. But I agree that ideally
>> we wouldn't expose orders to the user; I was attempting a compromise to simplify
>> the "auto" case.
>>
>> A potential (though feels quite complex) solution to make auto work with your
>> proposal: Add "auto" as an option to the existing global enabled file, and to
>> all of your proposed new enabled files. But its only possible to *set* auto
>> through the global file. And when it is set, all of the size-specific enabled
>> files read-back "auto" too. Any any writes to the size-specific enabled files
>> are ignored (or remembered but not enacted) until the global enabled file is
>> changed away from auto.
> 
> Yes, I think there are various ways forward regarding that. Or to enable "auto"
> mode only once all are "auto", and as soon as one is not "auto", just disable
> it. A simple
> 
> echo "auto" > /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled

I'm not really a fan, because this implies that you have a period where "auto"
is reported for a size, but its not really in "auto" mode yet.

> 
> Would do to enable it. Or, have them all be "global" and have a global "auto"
> mode as you raised.
> 
> echo "global" > /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled
> echo "auto" > /sys/kernel/mm/transparent_hugepage/enabled
> 

Again, this isn't atomic either. I tend to prefer my proposal because it
switches atomically - there are no weird intermediate states. Anyway, I guess
the important point is we have demonstrated that your proposed interface could
be extended to support "auto" in future, should we need it.

>>
>> But I'm not sure if adding a new option to the global enabled file might break
>> compat?
> 
> I think we used to extend the "defrag" option, see
> 
> commit 21440d7eb9044001b7fdb71d0163689f60a0f2a1
> Author: David Rientjes <rientjes@google.com>
> Date:   Wed Feb 22 15:45:49 2017 -0800
> 
>     mm, thp: add new defer+madvise defrag option
> 
> 
> So I suspect we could extend that one in a similar way.
> 
> But again, this is just the thing that came to mind when thinking about how to:
> a) avoid orders
> b) make it configurable and future-proof
> c) make it look a bit consistent with other interfaces (hugetlb and
>    existing thp)
> d) still prepare for an auto mode that we want in the future
> 
> I'm happy to hear other ideas.
> 
>>
>>>
>>>
>>>
>>> That sounds reasonable at least to me, and we would be using what we learned
>>> from THP (as John suggested).  That still gives reasonable flexibility without
>>> going too wild, and a better IMHO interface.
>>>
>>> I understand Yu's point about ABI discussions and "0 knobs". I'm happy as long
>>> as we can have something that won't hurt us later and still be able to use this
>>> in distributions within a reasonable timeframe. Enabling/disabling individual
>>> sizes does not sound too restrictive to me. And we could always add an "auto"
>>> setting later and default to that with a new kconfig knob.
>>>
>>> If someone wants to configure it, why not. Let's just prepare a way to to handle
>>> this "better" automatically in the future (if ever ...).
>>>
>>>
>>> Change 3: Stats
>>>
>>>> /proc/meminfo:
>>>>     Introduce new "AnonHugePteMap" field, which reports the amount of
>>>>     memory (in KiB) mapped from large folios globally (similar to
>>>>     AnonHugePages field).
>>>
>>> AnonHugePages is and remains "PMD-sized THP that is mapped using a PMD", I think
>>> we all agree on that. It should have been named "AnonPmdMapped" or
>>> "AnonHugePmdMapped", too bad, we can't change that.
>>
>> Yes agreed. I did consider redefining "AnonHugePages" to cover PMD- and
>> PTE-mapped memory, then introduce both an "AnonHugePmdMapped" and
>> "AnonHugePteMapped", but I think that would likely break things. Its further
>> complicated because vmstats prints it in PMD-size units, so can't represent
>> PTE-mapped memory in that counter.
> 
> :/
> 
>>
>>>
>>> "AnonHugePteMap" better be "AnonHugePteMapped".
>>
>> I agree, but I went with the shorter one because any longer and it would unalign
>> the value e.g:
>>
>>      AnonHugePages:         0 kB
>>      AnonHugePteMapped:        0 kB
>>      ShmemPmdMapped:        0 kB
>>      Shared_Hugetlb:        0 kB
>>
> 
> Can't that be handled? We surely have long stuff in there:
> 
> HardwareCorrupted:     0 kB
> AnonHugePages:         0 kB
> ShmemHugePages:  1081344 kB
> ShmemPmdMapped:        0 kB
> 
> HardwareCorrupted has same length as AnonHugePteMapped

HardwareCorrupted is special cased and has a field length of 5, so the largest
value you can represent before it gets pushed out is ~97MB. I imagine that's
plenty for HardwareCorrupted, but unlikely enough for AnonHugePteMapped. The
standard field size is 8, which provides for ~95GB before becoming unaligned.

> 
> But I'm not convinced about "AnonHugePteMapped" yet :)

OK

> 
>> So would need to decide which is preferable, or come up with a shorter name.
>>
>>>
>>> But, I wonder if we want to expose this "PteMapped" to user space *at all*. Why
>>> should they care if it's PTE mapped? For PMD-sized THP it makes a bit of sense,
>>> because !PMD implied !performance, and one might have been able to troubleshoot
>>> that somehow. For PTE-mapped, it doesn't make much sense really, they are always
>>> PTE-mapped.
>>
>> I disagree; I've been using it a lot to debug performance issues. It tells you
>> how much of your anon memory is allocated with large folios. And making that
>> percentage bigger improves performance; fewer page faults, and with a separate
>> contpte series on arm64, better use of the TLB. Reasons might include; poorly
>> aligned/too small VMAs, memory fragmentation preventing allocation, CoW, etc.
> 
> Just because a small-sized THP is PTE-mapped doesn't tell you anything, really.
> What you want to know is if it is "completely" and "consecutively" mapped such
> that the HW can actually benefit from it -- if HW even supports it. So
> "PTE-mapped THP" is just part of the story. And that's where it gets tricky I
> think.
> 
> I agree that it's good for debugging, but then maybe it should a) live somewhere
> else (debugfs, bucketing below) and b) be consistent with other THPs, meaning we
> also want similar stats somewhere.
> 
> One idea would be to expose such stats in a R/O fashion like "nr_allocated" or
> "nr_hugepages" in /sys/kernel/mm/transparent_hugepage/hugepages-64kB/ and
> friends. Of course, maybe tagging them with "anon" prefix.

I see your point, but I don't completely agree with it all. That said, given
your conclusion at the bottom, perhaps we should park the discussion about the
accounting for a separate series in future? Then we can focus on the ABI?

> 
>>
>> I would actually argue for adding similar counters for file-backed memory too
>> for the same reasons. (I actually posted an independent patch a while back that
>> did this for file- and anon- memory, bucketted by size. But I think the idea of
>> the bucketting was NAKed.
> For debugging, I *think* it might be valuable to see how many THP of each size
> are allocated. Tracking exactly "how is it mapped" is not easy to achieve as we
> learned. PMD-mapped was easy, but also requires us to keep doing that tracking
> for all eternity ...
> 
> Do you have a pointer to the patch set? Did it try to squeeze it into
> /proc/meminfo?

I was actually only working on smaps/smaps_rollup, which has been my main tool
for debugging. patches at [1].

[1] https://lore.kernel.org/linux-mm/20230613160950.3554675-1-ryan.roberts@arm.com/

> 
>>
>>>
>>> That also raises the question how you would account a PTE-mapped THP. The hole
>>> thing? Only the parts that are mapped? Let's better not go down that path.
>>
>> The approach I've taken in this series is the simple one - account every page
>> that belongs to a large folio from when it is first mapped to last unmapped.
>> Yes, in this case, you might not actually be mapping the full thing
>> contigiously. But it gives a good indication.
>>
>> I also considered accounting the whole folio only when all of its pages become
>> mapped (although not worrying about them all being contiguous). That's still
>> simple to implement for all counters except smaps. So went with the simplest
>> approach with the view that its "good enough".
> 
> If you take a look at "ShmemHugePages" and "FileHugePages", there we actually
> track them when they get allocated+freed, which is much easier than tracking
> when/how they are (un)mapped. But it's only done for PMD-sized THP for now.
> 
>>
>>>
>>> That leaves the question why we would want to include them here at all in a
>>> special PTE-mapped way?
>>>
>>>
>>> Again, let's look at hugetlb: I prepared 1 GiB and one 2 MiB page.
>>>
>>> HugePages_Total:       1
>>> HugePages_Free:        1
>>> HugePages_Rsvd:        0
>>> HugePages_Surp:        0
>>> Hugepagesize:       2048 kB
>>> Hugetlb:         1050624 kB
>>>
>>> -> Only the last one gives the sum, the other stats don't even mention the other
>>> ones. [how do we get their stats, if at all?]
>>
>> There are some files in /sys/kernel/mm/hugepages/hugepages-XXkB and
>> /sys/devices/system/node/node*/hugepages/; nr_hugepages, free_hugepages,
>> surplus_hugepages. But this interface also constitutes the allocator, not just
>> stats, I think.
> 
> Ah, I missed that we expose free vs. reserved vs. surpluse ... there as well; I
> thought we would only have "nr_hugepages".
> 
>>
>>>
>>> So maybe, we only want a summary of how many anon huge pages of any size are
>>> allocated (independent of the PTE vs. PMD mapping),
>>
>> Are you proposing (AnonHugePages + AnonHugePteMapped) here or something else? If
>> the former, then I don't really see the difference. We have to continue to
>> expose PMD-size (AnonHugePages). So either add PTE-only counter, and derive the
>> total, or add a total counter and derive PTE-only. I suspect I've misunderstood
>> your point.
> 
> I don't think we should go down the "PteMapped" path. Probably we want
> "bucketing" stats as you said, and maybe a global one that just combines
> everything (any THP). But naming will be difficult.
> 
>>
>>> and some other source to
>>> eventually inspect how the different sizes behave.
>>>
>>> But note that for non-PMD-sized file THP we don't even have special counters!
>>> ... so maybe we should also defer any such stats and come up with something
>>> uniform for all types of non-PMD-sized THP.
>>
>> Indeed, I can see benefit in adding these for file THP - in fact I have a patch
>> that does exactly that to help my development work. I had envisaged that we
>> could add something like FileHugePteMapped, ShmemHugePteMapped that would follow
>> the same semantics as AnonHugePteMapped.
> 
> Again, maybe we can find something that does not involve the "PteMapped"
> terminology and just gives us a big total of "allocated" THP. For detailed stats
> for debugging, maybe we can just use a different interface then.
> 
>>
>>>
>>>
>>> Sane discussion applies to all other stats.
>>>
>>>
>>>>
>>>> Because we now have runtime enable/disable control, I've removed the compile
>>>> time Kconfig switch. It still defaults to runtime-disabled.
>>>>
>>>> NOTE: These changes should not be merged until the prerequisites are complete.
>>>> These are in progress and tracked at [7].
>>>
>>> We should probably list them here, and classify which one we see as strict a
>>> requirement, which ones might be an optimization.
>>
>>
>> I'll need some help with clasifying them, so showing my working. Final list that
>> I would propose as strict requirements at bottom.
>>
>> This is my list with status, as per response to Yu in other thread:
>>
>>    - David is working on "shared vs exclusive mappings"
> 
> Probably "COW reuse support" is a separate item, although my approach would
> cover that.

Yeah that's the in the original thread as (2), but I thought we were all agreed
that is not a prerequisite so didn't bring it over here.

> 
> The question is, if the estimate we're using in most code for now would at least
> be sufficient to merge it. The estimate is easily wrong, but we do have that
> issue with PTE-mapped THP already.

Well as I understand it, at least the NUMA balancing code and khugepaged are
ignoring all folios > order-0. That's probably ok for the occasional PTE-mapped
THP, but I assume it becomes untenable if large folios are the norm. Perhaps we
can modify those paths to work with the current estimators in order to remove
your work from the critical path - is that what you are getting at?

> 
> But that argument probably applies to most things here: the difference is that
> PTE-mapped THP are not the default, that's why nobody really cared.
> 
> [I'm playing with an almost-lockless scheme right now and hope I have something
> running soonish -- as you know, I got distracted]
> 
>>    - Zi Yan has posted an RFC for compaction
>>    - Yin Fengwei's mlock series is now in mm-stable
>>    - Yin Fengwei's madvise series is in 6.6
>>    - I've reworked and posted a series for deferred_split_folio; although I've
>>      deprioritied it because Yu said it wasn't really a pre-requisite.
>>    - numa balancing depends on David's "shared vs exclusive mappings" work
>>    - I've started looking at "large folios in swap cache" in the background,
>>      because I'm seeing some slow down with large folios, but we also agreed that
>>      wasn't a prerequisite
>>
> 
> Probably it would be good to talk about the items and how we would classify them
> in a meeting.

Perhaps we can get a slot in Matthew's meeting tomorrow?

> 
> 
>> Although, since sending that, I've determined that when running kernel
>> compilation across high number of cores on arm64, the cost of splitting the
>> folios gets large due to needing to broadcast the extra TLBIs. So I think the
>> last point on that list may be a prerequisite after all. (I've been able to fix
>> this by adding support for allocating large folios in the swap file, and
>> avoiding the split - planning to send RFC this week).
>>
>> There is also this set of things that you mentioned against "shared vs exclusive
>> mappings", which I'm not sure if you are planning to cover as part of your work
>> or if they are follow on things that will need to be done:
>>
>> (1) Detecting shared folios, to not mess with them while they are shared.
>>      MADV_PAGEOUT, user-triggered page migration, NUMA hinting, khugepaged ...
>>      replace cases where folio_estimated_sharers() == 1 would currently be the
>>      best we can do (and in some cases, page_mapcount() == 1).
>>
>> And I recently discovered that khugepaged doesn't collapse file-backed pages to
>> a PMD-size THP if they belong to a large folio, so I'm guessing it may also
>> suffer the same behaviour for anon memory. I'm not sure if that's what your
>> "khugepaged ..." comment refers to?
> 
> Yes. But I did not look into all the details yet.
> 
> "kuhepaged" collapse support to small-sized THP is probably also a very imporant
> item, although it might be less relevant than for PMD -- and I consider it
> future work. See below.

Yes I agree that it's definitely future work. Nothing regresses from today's
performance if you don't have that.

> 
>>
>> So taking all that and trying to put together a complete outstanding list for
>> strict requirements:
>>
>>    - Shared vs Exclusive Mappings (DavidH)
>>        - user-triggered page migration
>>        - NUMA hinting/balancing
>>        - Enhance khugepaged to collapse to PMD-size from PTE-mapped large folios
>>    - Compaction of Large Folios (Zi Yan)
>>    - Swap out small-size THP without Split (Ryan Roberts)
> 
> ^ that's going to be tough, I can promise. And the only way to live without that
> would be khugepaged support. (because that's how it's all working for PMD-sized
> THP after all!)

Are you referring specifically to the "swap out" line there? If so, it wasn't my
plan that we would *swap in* large folios - only swap them *out* as large folios
to avoid the cost of splitting. Then when they come back in, the come in as
single pages, just like PMD-sized THP, if I've understood things correctly. I
have a patch working and showing the perf improvement as a result. I'm planning
to post an RFC today, hopefully.

I don't see the swap-in side as a problem for the initial patch set. OK, they
come in as single pages, so you lost the potential TLB benefits. But that's no
worse than today's performance so not a regression. And the ratio of SW savings
on THP allocation to HW savings from the TLB is very different for small-sized
THP; much more of the benefit comes from the SW and that's still there.

> 
> Once a PMD-sized THP was swapped out and evicted, it will always come back in
> order-0 folios. khugeged will re-collapse into PMD-sized chunks. If we could do
> that for PTE-sized THP as well ...

Yes, sure, but that's a future improvement, not a requirement to prevent
regression vs today, right?

> 
>>
>>
>>>
>>>
>>> Now, these are just my thoughts, and I'm happy about other thoughts.
>>
>> As always, thanks for taking the time - I really appreciate it.
> 
> Sure. Hoping others can comment.
> 
> My gut feeling is that it's best to focus on getting the sysfs interface
> right+future proof and handling the stats independently. While being a good
> debug mechanism, I wouldn't consider these stats a requirement: we don't have
> them for file/shmem small-sized thp so far as well.
> 
> So maybe really better to handle the stats in meminfo and friends separately.
> 

I'd be very happy with that approach if others are bought in.

Thanks,
Ryan

David Hildenbrand Oct. 13, 2023, 8:14 p.m. UTC | #5

>>
>> Yes, I think there are various ways forward regarding that. Or to enable "auto"
>> mode only once all are "auto", and as soon as one is not "auto", just disable
>> it. A simple
>>
>> echo "auto" > /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled
> 
> I'm not really a fan, because this implies that you have a period where "auto"
> is reported for a size, but its not really in "auto" mode yet.

I think there are various alternatives that are feasible.

For most systems later, you'd want to just "auto" via compile-time 
CONFIG option as default or via some cmdline option like 
"transparent_hugepage=auto".

> 
>>
>> Would do to enable it. Or, have them all be "global" and have a global "auto"
>> mode as you raised.
>>
>> echo "global" > /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled
>> echo "auto" > /sys/kernel/mm/transparent_hugepage/enabled
>>
> 
> Again, this isn't atomic either. I tend to prefer my proposal because it
> switches atomically - there are no weird intermediate states. Anyway, I guess
> the important point is we have demonstrated that your proposed interface could
> be extended to support "auto" in future, should we need it.

I don't think the atomic switch is really relevant. But that's probably 
a separate discussion.

[...]

>>
>> Just because a small-sized THP is PTE-mapped doesn't tell you anything, really.
>> What you want to know is if it is "completely" and "consecutively" mapped such
>> that the HW can actually benefit from it -- if HW even supports it. So
>> "PTE-mapped THP" is just part of the story. And that's where it gets tricky I
>> think.
>>
>> I agree that it's good for debugging, but then maybe it should a) live somewhere
>> else (debugfs, bucketing below) and b) be consistent with other THPs, meaning we
>> also want similar stats somewhere.
>>
>> One idea would be to expose such stats in a R/O fashion like "nr_allocated" or
>> "nr_hugepages" in /sys/kernel/mm/transparent_hugepage/hugepages-64kB/ and
>> friends. Of course, maybe tagging them with "anon" prefix.
> 
> I see your point, but I don't completely agree with it all. That said, given
> your conclusion at the bottom, perhaps we should park the discussion about the
> accounting for a separate series in future? Then we can focus on the ABI?

Yes!

> 
>>
>>>
>>> I would actually argue for adding similar counters for file-backed memory too
>>> for the same reasons. (I actually posted an independent patch a while back that
>>> did this for file- and anon- memory, bucketted by size. But I think the idea of
>>> the bucketting was NAKed.
>> For debugging, I *think* it might be valuable to see how many THP of each size
>> are allocated. Tracking exactly "how is it mapped" is not easy to achieve as we
>> learned. PMD-mapped was easy, but also requires us to keep doing that tracking
>> for all eternity ...
>>
>> Do you have a pointer to the patch set? Did it try to squeeze it into
>> /proc/meminfo?
> 
> I was actually only working on smaps/smaps_rollup, which has been my main tool
> for debugging. patches at [1].
> 
> [1] https://lore.kernel.org/linux-mm/20230613160950.3554675-1-ryan.roberts@arm.com/
> 

Thanks for the pointer!

[...]

>>>
>>> I'll need some help with clasifying them, so showing my working. Final list that
>>> I would propose as strict requirements at bottom.
>>>
>>> This is my list with status, as per response to Yu in other thread:
>>>
>>>     - David is working on "shared vs exclusive mappings"
>>
>> Probably "COW reuse support" is a separate item, although my approach would
>> cover that.
> 
> Yeah that's the in the original thread as (2), but I thought we were all agreed
> that is not a prerequisite so didn't bring it over here.

Agreed. Having a full list of todo items might be reasonable.

> 
>>
>> The question is, if the estimate we're using in most code for now would at least
>> be sufficient to merge it. The estimate is easily wrong, but we do have that
>> issue with PTE-mapped THP already.
> 
> Well as I understand it, at least the NUMA balancing code and khugepaged are
> ignoring all folios > order-0. That's probably ok for the occasional PTE-mapped
> THP, but I assume it becomes untenable if large folios are the norm. Perhaps we
> can modify those paths to work with the current estimators in order to remove
> your work from the critical path - is that what you are getting at?

IMHO most of the code that now uses the estimate-logic is really 
suboptimal, but it's all we have. It's probably interesting to see where 
the false negative/positives are tolerable for now ... I hate to be at 
the critical path ;) But I'm getting somewhere slowly but steadily 
(slowly, because I'm constantly distracted -- and apparently sick).

[...]

>>
>>
>>> Although, since sending that, I've determined that when running kernel
>>> compilation across high number of cores on arm64, the cost of splitting the
>>> folios gets large due to needing to broadcast the extra TLBIs. So I think the
>>> last point on that list may be a prerequisite after all. (I've been able to fix
>>> this by adding support for allocating large folios in the swap file, and
>>> avoiding the split - planning to send RFC this week).
>>>
>>> There is also this set of things that you mentioned against "shared vs exclusive
>>> mappings", which I'm not sure if you are planning to cover as part of your work
>>> or if they are follow on things that will need to be done:
>>>
>>> (1) Detecting shared folios, to not mess with them while they are shared.
>>>       MADV_PAGEOUT, user-triggered page migration, NUMA hinting, khugepaged ...
>>>       replace cases where folio_estimated_sharers() == 1 would currently be the
>>>       best we can do (and in some cases, page_mapcount() == 1).
>>>
>>> And I recently discovered that khugepaged doesn't collapse file-backed pages to
>>> a PMD-size THP if they belong to a large folio, so I'm guessing it may also
>>> suffer the same behaviour for anon memory. I'm not sure if that's what your
>>> "khugepaged ..." comment refers to?
>>
>> Yes. But I did not look into all the details yet.
>>
>> "kuhepaged" collapse support to small-sized THP is probably also a very imporant
>> item, although it might be less relevant than for PMD -- and I consider it
>> future work. See below.
> 
> Yes I agree that it's definitely future work. Nothing regresses from today's
> performance if you don't have that.
> 
>>
>>>
>>> So taking all that and trying to put together a complete outstanding list for
>>> strict requirements:
>>>
>>>     - Shared vs Exclusive Mappings (DavidH)
>>>         - user-triggered page migration
>>>         - NUMA hinting/balancing
>>>         - Enhance khugepaged to collapse to PMD-size from PTE-mapped large folios
>>>     - Compaction of Large Folios (Zi Yan)
>>>     - Swap out small-size THP without Split (Ryan Roberts)
>>
>> ^ that's going to be tough, I can promise. And the only way to live without that
>> would be khugepaged support. (because that's how it's all working for PMD-sized
>> THP after all!)
> 
> Are you referring specifically to the "swap out" line there? If so, it wasn't my
> plan that we would *swap in* large folios - only swap them *out* as large folios

Ah!

> to avoid the cost of splitting. Then when they come back in, the come in as
> single pages, just like PMD-sized THP, if I've understood things correctly. I
> have a patch working and showing the perf improvement as a result. I'm planning
> to post an RFC today, hopefully.
> 
> I don't see the swap-in side as a problem for the initial patch set. OK, they
> come in as single pages, so you lost the potential TLB benefits. But that's no
> worse than today's performance so not a regression. And the ratio of SW savings
> on THP allocation to HW savings from the TLB is very different for small-sized
> THP; much more of the benefit comes from the SW and that's still there.
> 
>>
>> Once a PMD-sized THP was swapped out and evicted, it will always come back in
>> order-0 folios. khugeged will re-collapse into PMD-sized chunks. If we could do
>> that for PTE-sized THP as well ...
> 
> Yes, sure, but that's a future improvement, not a requirement to prevent
> regression vs today, right?

Yes. You can't just currently assume: as soon as you swap, the whole 
benefit is gone because you end up will all order-0 pages.

These are certainly not limiting "merge" factors IMHO, but it's 
certainly one of the things users of distributions will heavily complain 
about ;)

PMD-sized THP are mostly self-healing in that sense.

> 
>>
>>>
>>>
>>>>
>>>>
>>>> Now, these are just my thoughts, and I'm happy about other thoughts.
>>>
>>> As always, thanks for taking the time - I really appreciate it.
>>
>> Sure. Hoping others can comment.
>>
>> My gut feeling is that it's best to focus on getting the sysfs interface
>> right+future proof and handling the stats independently. While being a good
>> debug mechanism, I wouldn't consider these stats a requirement: we don't have
>> them for file/shmem small-sized thp so far as well.
>>
>> So maybe really better to handle the stats in meminfo and friends separately.
>>
> 
> I'd be very happy with that approach if others are bought in.

Yeah. I'm expecting there still to be discussions, but then we shall 
also here other proposals. memcg controls are IMHO certainly future work.

Ryan Roberts Oct. 20, 2023, 12:33 p.m. UTC | #6

On 06/10/2023 21:06, David Hildenbrand wrote:
> On 29.09.23 13:44, Ryan Roberts wrote:
>> Hi All,
> 

[...]

>> NOTE: These changes should not be merged until the prerequisites are complete.
>> These are in progress and tracked at [7].
> 
> We should probably list them here, and classify which one we see as strict a
> requirement, which ones might be an optimization.
> 

Bringing back the discussion of prerequistes to this thread following the
discussion at the mm-alignment meeting on Wednesday.

Slides, updated following discussion to reflect all the agreed items that are
prerequisites and enhancements, are at [1].

I've taken a closer look at the situation with khugepaged, and can confirm that
it does correctly collapse anon small-sized THP into PMD-sized THP. I did notice
though, that one of the khugepaged selftests (collapse_max_ptes_none) fails when
small-sized THP is enabled+always. So I've fixed that test up and will add the
patch to the next version of my series.

So I believe the khugepaged prerequisite can be marked as done.

[1]
https://drive.google.com/file/d/1GnfYFpr7_c1kA41liRUW5YtCb8Cj18Ud/view?usp=sharing&resourcekey=0-U1Mj3-RhLD1JV6EThpyPyA

Thanks,
Ryan

Ryan Roberts Oct. 25, 2023, 4:24 p.m. UTC | #7

On 20/10/2023 13:33, Ryan Roberts wrote:
> On 06/10/2023 21:06, David Hildenbrand wrote:
>> On 29.09.23 13:44, Ryan Roberts wrote:
>>> Hi All,
>>
> 
> [...]
> 
>>> NOTE: These changes should not be merged until the prerequisites are complete.
>>> These are in progress and tracked at [7].
>>
>> We should probably list them here, and classify which one we see as strict a
>> requirement, which ones might be an optimization.
>>
> 
> Bringing back the discussion of prerequistes to this thread following the
> discussion at the mm-alignment meeting on Wednesday.
> 
> Slides, updated following discussion to reflect all the agreed items that are
> prerequisites and enhancements, are at [1].
> 
> I've taken a closer look at the situation with khugepaged, and can confirm that
> it does correctly collapse anon small-sized THP into PMD-sized THP. I did notice
> though, that one of the khugepaged selftests (collapse_max_ptes_none) fails when
> small-sized THP is enabled+always. So I've fixed that test up and will add the
> patch to the next version of my series.
> 
> So I believe the khugepaged prerequisite can be marked as done.
> 
> [1]
> https://drive.google.com/file/d/1GnfYFpr7_c1kA41liRUW5YtCb8Cj18Ud/view?usp=sharing&resourcekey=0-U1Mj3-RhLD1JV6EThpyPyA

Hi All,

It's been a week since the mm alignment meeting discussion we had around
prerequisites and the ABI. I haven't heard any further feedback on the ABI
proposal, so I'm going to be optimistic and assume that nobody has found any
fatal flaws in it :).

Certainly, I think it held up to the potential future policies that Yu Zhou
cited on the call - the possibility of preferring a smaller size over a bigger
one, if the smaller size can be allocated without splitting a contiguous block.
I think the suggestion of adding a per-size priority file would solve it. And in
general because we have a per-size directory, that gives us lots of flexibility
for growth.

Anyway, given the lack of feedback, I'm proposing to spin a new version. I'm
planning to do the following:

  - Drop the accounting patch (#3); we will continue to only account PMD-sized
    THP for now. We can add more counters in future if needed. page cache large
    folios haven't needed any new counters yet.

  - Pivot to the ABI proposed by DavidH; per-size directories in a similar shape
    to that used by hugetlb

  - Drop the "recommend" keyword patch (#6); For now, users will need to
    understand implicitly which sizes are beneficial to their HW perf

  - Drop patch (#7); arch_wants_pte_order() is no longer needed due to dropping
    patch #6

  - Add patch for khugepaged selftest improvement (described in previous email
    above).

  - Ensure that PMD_ORDER is not assumed to be compile-time constant (current
    code is broken on powerpc)

Please shout if you think this is the wrong approach.

On the prerequisites front, we have 2 items still to land:

  - compaction; Zi Yan is working on a v2

  - numa balancing; A developer has signed up and is working on it (I'll leave
    them to reveal themself as preferred).

Thanks,
Ryan

David Hildenbrand Oct. 25, 2023, 6:47 p.m. UTC | #8

On 25.10.23 18:24, Ryan Roberts wrote:
> On 20/10/2023 13:33, Ryan Roberts wrote:
>> On 06/10/2023 21:06, David Hildenbrand wrote:
>>> On 29.09.23 13:44, Ryan Roberts wrote:
>>>> Hi All,
>>>
>>
>> [...]
>>
>>>> NOTE: These changes should not be merged until the prerequisites are complete.
>>>> These are in progress and tracked at [7].
>>>
>>> We should probably list them here, and classify which one we see as strict a
>>> requirement, which ones might be an optimization.
>>>
>>
>> Bringing back the discussion of prerequistes to this thread following the
>> discussion at the mm-alignment meeting on Wednesday.
>>
>> Slides, updated following discussion to reflect all the agreed items that are
>> prerequisites and enhancements, are at [1].
>>
>> I've taken a closer look at the situation with khugepaged, and can confirm that
>> it does correctly collapse anon small-sized THP into PMD-sized THP. I did notice
>> though, that one of the khugepaged selftests (collapse_max_ptes_none) fails when
>> small-sized THP is enabled+always. So I've fixed that test up and will add the
>> patch to the next version of my series.
>>
>> So I believe the khugepaged prerequisite can be marked as done.
>>
>> [1]
>> https://drive.google.com/file/d/1GnfYFpr7_c1kA41liRUW5YtCb8Cj18Ud/view?usp=sharing&resourcekey=0-U1Mj3-RhLD1JV6EThpyPyA
> 
> Hi All,

Hi,

I wanted to remind people in the THP cabal meeting, but that either 
didn't happen or zoomed decided to not let me join :)

> 
> It's been a week since the mm alignment meeting discussion we had around
> prerequisites and the ABI. I haven't heard any further feedback on the ABI
> proposal, so I'm going to be optimistic and assume that nobody has found any
> fatal flaws in it :).

After saying in the call probably 10 times that people should comment 
here if there are reasonable alternatives worth discussing, call me 
"optimistic" as well; but, it's only been a week and people might still 
be thinking about this/

There were two things discussed in the call:

* Yu brought up "lists" so we can have priorities. As briefly discussed
   in the  call, this (a) might not be needed right now in an initial
   version;  (b) the kernel might be able to handle that (or many cases)
   automatically, TBD. Adding lists now would kind-of set the semantics
   of that interface in stone. As you describe below, the approach
   discussed here could easily be extended to cover priorities, if need
   be.

* Hugh raised the point that "bitmap of orders" could be replaced by
   "added THP sizes", which really is "bitmap of orders" shifted to the
   left. To configure 2 MiB +  64Kib, one would get "2097152 + 65536" =
   "2162688" or in KiB "2112". Hm.

Both approaches would require single-option files like "enable_always", 
"enable_madvise" ... which I don't particularly like, but who am I to judge.


> 
> Certainly, I think it held up to the potential future policies that Yu Zhou
> cited on the call - the possibility of preferring a smaller size over a bigger
> one, if the smaller size can be allocated without splitting a contiguous block.
> I think the suggestion of adding a per-size priority file would solve it. And in
> general because we have a per-size directory, that gives us lots of flexibility
> for growth.

Jup, same opinion here. But again, I'm very happy to hear other 
alternatives and why they are better.

> 
> Anyway, given the lack of feedback, I'm proposing to spin a new version. I'm
> planning to do the following:
> 
>    - Drop the accounting patch (#3); we will continue to only account PMD-sized
>      THP for now. We can add more counters in future if needed. page cache large
>      folios haven't needed any new counters yet.
> 
>    - Pivot to the ABI proposed by DavidH; per-size directories in a similar shape
>      to that used by hugetlb
> 
>    - Drop the "recommend" keyword patch (#6); For now, users will need to
>      understand implicitly which sizes are beneficial to their HW perf
> 
>    - Drop patch (#7); arch_wants_pte_order() is no longer needed due to dropping
>      patch #6
> 
>    - Add patch for khugepaged selftest improvement (described in previous email
>      above).
> 
>    - Ensure that PMD_ORDER is not assumed to be compile-time constant (current
>      code is broken on powerpc)
> 
> Please shout if you think this is the wrong approach.

I'll shout that this sounds good to me; rather wait a bit more for more 
opinions. It probably makes sense to post something after the (upcoming) 
merge window, if there are no further discussions here.

John Hubbard Oct. 25, 2023, 7:10 p.m. UTC | #9

On 10/25/23 09:24, Ryan Roberts wrote:
> On the prerequisites front, we have 2 items still to land:
> 
>    - compaction; Zi Yan is working on a v2
> 
>    - numa balancing; A developer has signed up and is working on it (I'll leave
>      them to reveal themself as preferred).
> 

Oh yes, that's me, thanks for providing an easy place to reply-to, for that.


thanks,

Yu Zhao Oct. 25, 2023, 7:11 p.m. UTC | #10

On Wed, Oct 25, 2023 at 12:47 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 25.10.23 18:24, Ryan Roberts wrote:
> > On 20/10/2023 13:33, Ryan Roberts wrote:
> >> On 06/10/2023 21:06, David Hildenbrand wrote:
> >>> On 29.09.23 13:44, Ryan Roberts wrote:
> >>>> Hi All,
> >>>
> >>
> >> [...]
> >>
> >>>> NOTE: These changes should not be merged until the prerequisites are complete.
> >>>> These are in progress and tracked at [7].
> >>>
> >>> We should probably list them here, and classify which one we see as strict a
> >>> requirement, which ones might be an optimization.
> >>>
> >>
> >> Bringing back the discussion of prerequistes to this thread following the
> >> discussion at the mm-alignment meeting on Wednesday.
> >>
> >> Slides, updated following discussion to reflect all the agreed items that are
> >> prerequisites and enhancements, are at [1].
> >>
> >> I've taken a closer look at the situation with khugepaged, and can confirm that
> >> it does correctly collapse anon small-sized THP into PMD-sized THP. I did notice
> >> though, that one of the khugepaged selftests (collapse_max_ptes_none) fails when
> >> small-sized THP is enabled+always. So I've fixed that test up and will add the
> >> patch to the next version of my series.
> >>
> >> So I believe the khugepaged prerequisite can be marked as done.
> >>
> >> [1]
> >> https://drive.google.com/file/d/1GnfYFpr7_c1kA41liRUW5YtCb8Cj18Ud/view?usp=sharing&resourcekey=0-U1Mj3-RhLD1JV6EThpyPyA
> >
> > Hi All,
>
> Hi,
>
> I wanted to remind people in the THP cabal meeting, but that either
> didn't happen or zoomed decided to not let me join :)
>
> >
> > It's been a week since the mm alignment meeting discussion we had around
> > prerequisites and the ABI. I haven't heard any further feedback on the ABI
> > proposal, so I'm going to be optimistic and assume that nobody has found any
> > fatal flaws in it :).
>
> After saying in the call probably 10 times that people should comment
> here if there are reasonable alternatives worth discussing, call me
> "optimistic" as well; but, it's only been a week and people might still
> be thinking about this/
>
> There were two things discussed in the call:
>
> * Yu brought up "lists" so we can have priorities. As briefly discussed
>    in the  call, this (a) might not be needed right now in an initial
>    version;  (b) the kernel might be able to handle that (or many cases)
>    automatically, TBD. Adding lists now would kind-of set the semantics
>    of that interface in stone. As you describe below, the approach
>    discussed here could easily be extended to cover priorities, if need
>    be.

I want to expand on this: the argument that "if you could allocate a
higher order you should use it" is too simplistic. There are many
reasons in addition to the one above that we want to "fall back" to
higher orders, e.g., those higher orders are not on PCP or from the
local node. When we consider the sequence of orders to try, user
preference is just one of the parameters to the cost function. The
bottom line is that I think we should all agree that there needs to be
a cost function down the road, whatever it looks like. Otherwise I
don't know how we can make "auto" happen.

> * Hugh raised the point that "bitmap of orders" could be replaced by
>    "added THP sizes", which really is "bitmap of orders" shifted to the
>    left. To configure 2 MiB +  64Kib, one would get "2097152 + 65536" =
>    "2162688" or in KiB "2112". Hm.

I'm not a big fan of the "bitmap of orders" approach, because it
doesn't address my concern above.

> Both approaches would require single-option files like "enable_always",
> "enable_madvise" ... which I don't particularly like, but who am I to judge.
>
>
> >
> > Certainly, I think it held up to the potential future policies that Yu Zhou
> > cited on the call - the possibility of preferring a smaller size over a bigger
> > one, if the smaller size can be allocated without splitting a contiguous block.
> > I think the suggestion of adding a per-size priority file would solve it. And in
> > general because we have a per-size directory, that gives us lots of flexibility
> > for growth.
>
> Jup, same opinion here. But again, I'm very happy to hear other
> alternatives and why they are better.

I'm not against David's proposal but I want to hear a lot more about
"lots of flexibility for growth" before I'm fully convinced. Why do I
need more convincing? When I brought up that we need to consider the
priority of each order and the potential need to fall back to higher
orders during the meeting, I got the impression people were surprised
why we want to fall back to higher orders. TBH, I was surprised too
that this possibility was never considered. I missed today's THP
meeting too but I'll join next time and if anyone has more ideas on
this, we can spend some time discussing it, especially on how LAF
should cooperate with the page allocator to make better decisions.

Ryan Roberts Oct. 26, 2023, 9:53 a.m. UTC | #11

On 25/10/2023 20:11, Yu Zhao wrote:
> On Wed, Oct 25, 2023 at 12:47 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 25.10.23 18:24, Ryan Roberts wrote:
>>> On 20/10/2023 13:33, Ryan Roberts wrote:
>>>> On 06/10/2023 21:06, David Hildenbrand wrote:
>>>>> On 29.09.23 13:44, Ryan Roberts wrote:
>>>>>> Hi All,
>>>>>
>>>>
>>>> [...]
>>>>
>>>>>> NOTE: These changes should not be merged until the prerequisites are complete.
>>>>>> These are in progress and tracked at [7].
>>>>>
>>>>> We should probably list them here, and classify which one we see as strict a
>>>>> requirement, which ones might be an optimization.
>>>>>
>>>>
>>>> Bringing back the discussion of prerequistes to this thread following the
>>>> discussion at the mm-alignment meeting on Wednesday.
>>>>
>>>> Slides, updated following discussion to reflect all the agreed items that are
>>>> prerequisites and enhancements, are at [1].
>>>>
>>>> I've taken a closer look at the situation with khugepaged, and can confirm that
>>>> it does correctly collapse anon small-sized THP into PMD-sized THP. I did notice
>>>> though, that one of the khugepaged selftests (collapse_max_ptes_none) fails when
>>>> small-sized THP is enabled+always. So I've fixed that test up and will add the
>>>> patch to the next version of my series.
>>>>
>>>> So I believe the khugepaged prerequisite can be marked as done.
>>>>
>>>> [1]
>>>> https://drive.google.com/file/d/1GnfYFpr7_c1kA41liRUW5YtCb8Cj18Ud/view?usp=sharing&resourcekey=0-U1Mj3-RhLD1JV6EThpyPyA
>>>
>>> Hi All,
>>
>> Hi,
>>
>> I wanted to remind people in the THP cabal meeting, but that either
>> didn't happen or zoomed decided to not let me join :)

I didn't make it yesterday either - was having to juggle child care.

>>
>>>
>>> It's been a week since the mm alignment meeting discussion we had around
>>> prerequisites and the ABI. I haven't heard any further feedback on the ABI
>>> proposal, so I'm going to be optimistic and assume that nobody has found any
>>> fatal flaws in it :).
>>
>> After saying in the call probably 10 times that people should comment
>> here if there are reasonable alternatives worth discussing, call me
>> "optimistic" as well; but, it's only been a week and people might still
>> be thinking about this/
>>
>> There were two things discussed in the call:
>>
>> * Yu brought up "lists" so we can have priorities. As briefly discussed
>>    in the  call, this (a) might not be needed right now in an initial
>>    version;  (b) the kernel might be able to handle that (or many cases)
>>    automatically, TBD. Adding lists now would kind-of set the semantics
>>    of that interface in stone. As you describe below, the approach
>>    discussed here could easily be extended to cover priorities, if need
>>    be.
> 
> I want to expand on this: the argument that "if you could allocate a
> higher order you should use it" is too simplistic. There are many
> reasons in addition to the one above that we want to "fall back" to
> higher orders, e.g., those higher orders are not on PCP or from the
> local node. When we consider the sequence of orders to try, user
> preference is just one of the parameters to the cost function. The
> bottom line is that I think we should all agree that there needs to be
> a cost function down the road, whatever it looks like. Otherwise I
> don't know how we can make "auto" happen.

I don't dispute that this sounds like it could be beneficial, but I see it as
research to happen further down the road (as you say), and we don't know what
that research might conclude. Also, I think the scope of this is bigger than
anonymous memory - you would also likely want to look at the policy for page
cache folio order too, since today that's based solely on readahead. So I see it
as an optimization that is somewhat orthogonal to small-sized THP.

The proposed interface does not imply any preference order - it only states
which sizes the user wants the kernel to select from, so I think there is lots
of freedom to change this down the track if the kernel wants to start using the
buddy allocator's state as a signal to make its decisions.

> 
>> * Hugh raised the point that "bitmap of orders" could be replaced by
>>    "added THP sizes", which really is "bitmap of orders" shifted to the
>>    left. To configure 2 MiB +  64Kib, one would get "2097152 + 65536" =
>>    "2162688" or in KiB "2112". Hm.
> 
> I'm not a big fan of the "bitmap of orders" approach, because it
> doesn't address my concern above.
> 
>> Both approaches would require single-option files like "enable_always",
>> "enable_madvise" ... which I don't particularly like, but who am I to judge.
>>
>>
>>>
>>> Certainly, I think it held up to the potential future policies that Yu Zhou
>>> cited on the call - the possibility of preferring a smaller size over a bigger
>>> one, if the smaller size can be allocated without splitting a contiguous block.
>>> I think the suggestion of adding a per-size priority file would solve it. And in
>>> general because we have a per-size directory, that gives us lots of flexibility
>>> for growth.
>>
>> Jup, same opinion here. But again, I'm very happy to hear other
>> alternatives and why they are better.
> 
> I'm not against David's proposal but I want to hear a lot more about
> "lots of flexibility for growth" before I'm fully convinced. 

My point was that in an abstract sense, there are properties a user may wish to
apply individually to a size, which is catered for by having a per-size
directory into which we can add more files if/when requirements for new per-size
properties arise. There are also properties that may be applied globally, for
which we have the top-level transparent_hugepage directory where properties can
be extended or added.

For your case around tighter integration with the buddy allocator, I could
imagine a per-size file allowing the user to specify if the kernel should allow
splitting a higher order to make a THP of that size (I'm not suggesting that's a
good idea, I'm just pointing out that this sort of thing is possible with the
interface). And we have discussed how the global enabled prpoerty could be
extended to support "auto" [1].

But perhaps what we really need are lots more ideas for future directions for
small-sized THP to allow us to evaluate this interface more widely.

> Why do I
> need more convincing? When I brought up that we need to consider the
> priority of each order and the potential need to fall back to higher
> orders during the meeting, I got the impression people were surprised
> why we want to fall back to higher orders. TBH, I was surprised too
> that this possibility was never considered. I missed today's THP
> meeting too but I'll join next time and if anyone has more ideas on
> this, we can spend some time discussing it, especially on how LAF
> should cooperate with the page allocator to make better decisions.

[1]
https://lore.kernel.org/linux-mm/99f8294b-b4e5-424f-d761-24a70a82cc1a@redhat.com/

Thanks,
Ryan

David Hildenbrand Oct. 26, 2023, 3:19 p.m. UTC | #12

[...]

>>> Hi,
>>>
>>> I wanted to remind people in the THP cabal meeting, but that either
>>> didn't happen or zoomed decided to not let me join :)
> 
> I didn't make it yesterday either - was having to juggle child care.

I think it didn't happen, or started quite late (>20 min).

> 
>>>
>>>>
>>>> It's been a week since the mm alignment meeting discussion we had around
>>>> prerequisites and the ABI. I haven't heard any further feedback on the ABI
>>>> proposal, so I'm going to be optimistic and assume that nobody has found any
>>>> fatal flaws in it :).
>>>
>>> After saying in the call probably 10 times that people should comment
>>> here if there are reasonable alternatives worth discussing, call me
>>> "optimistic" as well; but, it's only been a week and people might still
>>> be thinking about this/
>>>
>>> There were two things discussed in the call:
>>>
>>> * Yu brought up "lists" so we can have priorities. As briefly discussed
>>>     in the  call, this (a) might not be needed right now in an initial
>>>     version;  (b) the kernel might be able to handle that (or many cases)
>>>     automatically, TBD. Adding lists now would kind-of set the semantics
>>>     of that interface in stone. As you describe below, the approach
>>>     discussed here could easily be extended to cover priorities, if need
>>>     be.
>>
>> I want to expand on this: the argument that "if you could allocate a
>> higher order you should use it" is too simplistic. There are many
>> reasons in addition to the one above that we want to "fall back" to
>> higher orders, e.g., those higher orders are not on PCP or from the
>> local node. When we consider the sequence of orders to try, user
>> preference is just one of the parameters to the cost function. The
>> bottom line is that I think we should all agree that there needs to be
>> a cost function down the road, whatever it looks like. Otherwise I
>> don't know how we can make "auto" happen.

I agree that there needs to be a cost function, and as pagecache showed 
that's independent of initial enablement.

> 
> I don't dispute that this sounds like it could be beneficial, but I see it as
> research to happen further down the road (as you say), and we don't know what
> that research might conclude. Also, I think the scope of this is bigger than
> anonymous memory - you would also likely want to look at the policy for page
> cache folio order too, since today that's based solely on readahead. So I see it
> as an optimization that is somewhat orthogonal to small-sized THP.

Exactly my thoughts.

The important thing is that we should plan ahead that we still have the 
option to let the admin configure if we cannot make this work 
automatically in the kernel.

What we'll need, nobody knows. Maybe it's a per-size priority, maybe 
it's a single global toggle.

> 
> The proposed interface does not imply any preference order - it only states
> which sizes the user wants the kernel to select from, so I think there is lots
> of freedom to change this down the track if the kernel wants to start using the
> buddy allocator's state as a signal to make its decisions.

Yes.

[..]

>>> Jup, same opinion here. But again, I'm very happy to hear other
>>> alternatives and why they are better.
>>
>> I'm not against David's proposal but I want to hear a lot more about
>> "lots of flexibility for growth" before I'm fully convinced.
> 
> My point was that in an abstract sense, there are properties a user may wish to
> apply individually to a size, which is catered for by having a per-size
> directory into which we can add more files if/when requirements for new per-size
> properties arise. There are also properties that may be applied globally, for
> which we have the top-level transparent_hugepage directory where properties can
> be extended or added.

Exactly, well said.

> 
> For your case around tighter integration with the buddy allocator, I could
> imagine a per-size file allowing the user to specify if the kernel should allow
> splitting a higher order to make a THP of that size (I'm not suggesting that's a
> good idea, I'm just pointing out that this sort of thing is possible with the
> interface). And we have discussed how the global enabled prpoerty could be
> extended to support "auto" [1].
> 
> But perhaps what we really need are lots more ideas for future directions for
> small-sized THP to allow us to evaluate this interface more widely.

David R. motivated a future size-aware setting of the defrag option. As 
discussed we might want something similar to shmem_enable. What will 
happen with khugepaged, nobody knows yet :)

I could imagine exposing per-size boolean read-only properties like 
"native-hw-size" (PMD, cont-pte). But these things require much more 
thought.

Ryan Roberts Oct. 31, 2023, 11:50 a.m. UTC | #13

On 06/10/2023 21:06, David Hildenbrand wrote:
[...]
> 
> Change 2: sysfs interface.
> 
> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
> agree.
> 
> What we expose there and how, is TBD. Again, not a friend of "orders" and
> bitmaps at all. We can do better if we want to go down that path.
> 
> Maybe we should take a look at hugetlb, and how they added support for multiple
> sizes. What *might* make sense could be (depending on which values we actually
> support!)
> 
> 
> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
> 
> Each one would contain an "enabled" and "defrag" file. We want something minimal
> first? Start with the "enabled" option.
> 
> 
> enabled: always [global] madvise never
> 
> Initially, we would set it for PMD-sized THP to "global" and for everything else
> to "never".

Hi David,

I've just started coding this, and it occurs to me that I might need a small
clarification here; the existing global "enabled" control is used to drive
decisions for both anonymous memory and (non-shmem) file-backed memory. But the
proposed new per-size "enabled" is implicitly only controlling anon memory (for
now).

1) Is this potentially confusing for the user? Should we rename the per-size
controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so
we can reuse the same control for file-backed memory in future?

2) The global control will continue to drive the file-backed memory decision
(for now), even when hugepages-2048kB/enabled != "global"; agreed?

Thanks,
Ryan

Ryan Roberts Oct. 31, 2023, 11:55 a.m. UTC | #14

On 31/10/2023 11:50, Ryan Roberts wrote:
> On 06/10/2023 21:06, David Hildenbrand wrote:
> [...]
>>
>> Change 2: sysfs interface.
>>
>> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
>> agree.
>>
>> What we expose there and how, is TBD. Again, not a friend of "orders" and
>> bitmaps at all. We can do better if we want to go down that path.
>>
>> Maybe we should take a look at hugetlb, and how they added support for multiple
>> sizes. What *might* make sense could be (depending on which values we actually
>> support!)
>>
>>
>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
>>
>> Each one would contain an "enabled" and "defrag" file. We want something minimal
>> first? Start with the "enabled" option.
>>
>>
>> enabled: always [global] madvise never
>>
>> Initially, we would set it for PMD-sized THP to "global" and for everything else
>> to "never".
> 
> Hi David,
> 
> I've just started coding this, and it occurs to me that I might need a small
> clarification here; the existing global "enabled" control is used to drive
> decisions for both anonymous memory and (non-shmem) file-backed memory. But the
> proposed new per-size "enabled" is implicitly only controlling anon memory (for
> now).
> 
> 1) Is this potentially confusing for the user? Should we rename the per-size
> controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so
> we can reuse the same control for file-backed memory in future?
> 
> 2) The global control will continue to drive the file-backed memory decision
> (for now), even when hugepages-2048kB/enabled != "global"; agreed?
> 
> Thanks,
> Ryan
> 

Also, an implementation question:

hugepage_vma_check() doesn't currently care whether enabled="never" for DAX VMAs
(although it does honour MADV_NOHUGEPAGE and the prctl); It will return true
regardless. Is that by design? It couldn't fathom any reasoning from the commit log:

bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
			bool smaps, bool in_pf, bool enforce_sysfs)
{
	if (!vma->vm_mm)		/* vdso */
		return false;

	/*
	 * Explicitly disabled through madvise or prctl, or some
	 * architectures may disable THP for some mappings, for
	 * example, s390 kvm.
	 * */
	if ((vm_flags & VM_NOHUGEPAGE) ||
	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
		return false;
	/*
	 * If the hardware/firmware marked hugepage support disabled.
	 */
	if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
		return false;

	/* khugepaged doesn't collapse DAX vma, but page fault is fine. */
	if (vma_is_dax(vma))
		return in_pf;  <<<<<<<<

	...
}

David Hildenbrand Oct. 31, 2023, 11:58 a.m. UTC | #15

On 31.10.23 12:50, Ryan Roberts wrote:
> On 06/10/2023 21:06, David Hildenbrand wrote:
> [...]
>>
>> Change 2: sysfs interface.
>>
>> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
>> agree.
>>
>> What we expose there and how, is TBD. Again, not a friend of "orders" and
>> bitmaps at all. We can do better if we want to go down that path.
>>
>> Maybe we should take a look at hugetlb, and how they added support for multiple
>> sizes. What *might* make sense could be (depending on which values we actually
>> support!)
>>
>>
>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
>>
>> Each one would contain an "enabled" and "defrag" file. We want something minimal
>> first? Start with the "enabled" option.
>>
>>
>> enabled: always [global] madvise never
>>
>> Initially, we would set it for PMD-sized THP to "global" and for everything else
>> to "never".
> 
> Hi David,

Hi!

> 
> I've just started coding this, and it occurs to me that I might need a small
> clarification here; the existing global "enabled" control is used to drive
> decisions for both anonymous memory and (non-shmem) file-backed memory. But the
> proposed new per-size "enabled" is implicitly only controlling anon memory (for
> now).

Anon was (way) first, and pagecache later decided to reuse that one as 
an indication whether larger folios are desired.

For the pagecache, it's just a way to enable/disable it globally. As 
there is no memory waste, nobody currently really cares about the exact 
sized the pagecache is allocating (maybe that will change at some point, 
maybe not, who knows).

> 
> 1) Is this potentially confusing for the user? Should we rename the per-size
> controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so
> we can reuse the same control for file-backed memory in future?

The latter would be my take. Just like we did with the global toggle.

> 
> 2) The global control will continue to drive the file-backed memory decision
> (for now), even when hugepages-2048kB/enabled != "global"; agreed?

That would be my take; it will allocate other sizes already, so just 
glue it to the global toggle and document for the other toggles that 
they only control anonymous THP for now.

David Hildenbrand Oct. 31, 2023, 12:03 p.m. UTC | #16

On 31.10.23 12:55, Ryan Roberts wrote:
> On 31/10/2023 11:50, Ryan Roberts wrote:
>> On 06/10/2023 21:06, David Hildenbrand wrote:
>> [...]
>>>
>>> Change 2: sysfs interface.
>>>
>>> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
>>> agree.
>>>
>>> What we expose there and how, is TBD. Again, not a friend of "orders" and
>>> bitmaps at all. We can do better if we want to go down that path.
>>>
>>> Maybe we should take a look at hugetlb, and how they added support for multiple
>>> sizes. What *might* make sense could be (depending on which values we actually
>>> support!)
>>>
>>>
>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
>>>
>>> Each one would contain an "enabled" and "defrag" file. We want something minimal
>>> first? Start with the "enabled" option.
>>>
>>>
>>> enabled: always [global] madvise never
>>>
>>> Initially, we would set it for PMD-sized THP to "global" and for everything else
>>> to "never".
>>
>> Hi David,
>>
>> I've just started coding this, and it occurs to me that I might need a small
>> clarification here; the existing global "enabled" control is used to drive
>> decisions for both anonymous memory and (non-shmem) file-backed memory. But the
>> proposed new per-size "enabled" is implicitly only controlling anon memory (for
>> now).
>>
>> 1) Is this potentially confusing for the user? Should we rename the per-size
>> controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so
>> we can reuse the same control for file-backed memory in future?
>>
>> 2) The global control will continue to drive the file-backed memory decision
>> (for now), even when hugepages-2048kB/enabled != "global"; agreed?
>>
>> Thanks,
>> Ryan
>>
> 
> Also, an implementation question:
> 
> hugepage_vma_check() doesn't currently care whether enabled="never" for DAX VMAs
> (although it does honour MADV_NOHUGEPAGE and the prctl); It will return true
> regardless. Is that by design? It couldn't fathom any reasoning from the commit log:

The whole DAX "hugepage" and THP mixup is just plain confusing. We're 
simply using PUD/PMD mappings of DAX memory, and PMD/PTE- remap when 
required (VMA split I assume, COW).

It doesn't result in any memory waste, so who really cares how it's 
mapped? Apparently we want individual processes to just disable PMD/PUD 
mappings of DAX using the prctl and madvise. Maybe there are good reasons.

Looks like a design decision, probably some legacy leftovers.

Ryan Roberts Oct. 31, 2023, 1:12 p.m. UTC | #17

On 31/10/2023 11:58, David Hildenbrand wrote:
> On 31.10.23 12:50, Ryan Roberts wrote:
>> On 06/10/2023 21:06, David Hildenbrand wrote:
>> [...]
>>>
>>> Change 2: sysfs interface.
>>>
>>> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
>>> agree.
>>>
>>> What we expose there and how, is TBD. Again, not a friend of "orders" and
>>> bitmaps at all. We can do better if we want to go down that path.
>>>
>>> Maybe we should take a look at hugetlb, and how they added support for multiple
>>> sizes. What *might* make sense could be (depending on which values we actually
>>> support!)
>>>
>>>
>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
>>>
>>> Each one would contain an "enabled" and "defrag" file. We want something minimal
>>> first? Start with the "enabled" option.
>>>
>>>
>>> enabled: always [global] madvise never
>>>
>>> Initially, we would set it for PMD-sized THP to "global" and for everything else
>>> to "never".
>>
>> Hi David,
> 
> Hi!
> 
>>
>> I've just started coding this, and it occurs to me that I might need a small
>> clarification here; the existing global "enabled" control is used to drive
>> decisions for both anonymous memory and (non-shmem) file-backed memory. But the
>> proposed new per-size "enabled" is implicitly only controlling anon memory (for
>> now).
> 
> Anon was (way) first, and pagecache later decided to reuse that one as an
> indication whether larger folios are desired.
> 
> For the pagecache, it's just a way to enable/disable it globally. As there is no
> memory waste, nobody currently really cares about the exact sized the pagecache
> is allocating (maybe that will change at some point, maybe not, who knows).

Yup. Its not _just_ about allocation though; its also about collapse
(MADV_COLLAPSE, khugepaged) which is supported for pagecache pages. I can
imagine value in collapsing to various sizes that are beneficial for HW...
anyway that's for another day.

> 
>>
>> 1) Is this potentially confusing for the user? Should we rename the per-size
>> controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so
>> we can reuse the same control for file-backed memory in future?
> 
> The latter would be my take. Just like we did with the global toggle.

ACK

> 
>>
>> 2) The global control will continue to drive the file-backed memory decision
>> (for now), even when hugepages-2048kB/enabled != "global"; agreed?
> 
> That would be my take; it will allocate other sizes already, so just glue it to
> the global toggle and document for the other toggles that they only control
> anonymous THP for now.

ACK

>

Ryan Roberts Oct. 31, 2023, 1:13 p.m. UTC | #18

On 31/10/2023 12:03, David Hildenbrand wrote:
> On 31.10.23 12:55, Ryan Roberts wrote:
>> On 31/10/2023 11:50, Ryan Roberts wrote:
>>> On 06/10/2023 21:06, David Hildenbrand wrote:
>>> [...]
>>>>
>>>> Change 2: sysfs interface.
>>>>
>>>> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
>>>> agree.
>>>>
>>>> What we expose there and how, is TBD. Again, not a friend of "orders" and
>>>> bitmaps at all. We can do better if we want to go down that path.
>>>>
>>>> Maybe we should take a look at hugetlb, and how they added support for multiple
>>>> sizes. What *might* make sense could be (depending on which values we actually
>>>> support!)
>>>>
>>>>
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
>>>>
>>>> Each one would contain an "enabled" and "defrag" file. We want something
>>>> minimal
>>>> first? Start with the "enabled" option.
>>>>
>>>>
>>>> enabled: always [global] madvise never
>>>>
>>>> Initially, we would set it for PMD-sized THP to "global" and for everything
>>>> else
>>>> to "never".
>>>
>>> Hi David,
>>>
>>> I've just started coding this, and it occurs to me that I might need a small
>>> clarification here; the existing global "enabled" control is used to drive
>>> decisions for both anonymous memory and (non-shmem) file-backed memory. But the
>>> proposed new per-size "enabled" is implicitly only controlling anon memory (for
>>> now).
>>>
>>> 1) Is this potentially confusing for the user? Should we rename the per-size
>>> controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so
>>> we can reuse the same control for file-backed memory in future?
>>>
>>> 2) The global control will continue to drive the file-backed memory decision
>>> (for now), even when hugepages-2048kB/enabled != "global"; agreed?
>>>
>>> Thanks,
>>> Ryan
>>>
>>
>> Also, an implementation question:
>>
>> hugepage_vma_check() doesn't currently care whether enabled="never" for DAX VMAs
>> (although it does honour MADV_NOHUGEPAGE and the prctl); It will return true
>> regardless. Is that by design? It couldn't fathom any reasoning from the
>> commit log:
> 
> The whole DAX "hugepage" and THP mixup is just plain confusing. We're simply
> using PUD/PMD mappings of DAX memory, and PMD/PTE- remap when required (VMA
> split I assume, COW).
> 
> It doesn't result in any memory waste, so who really cares how it's mapped?
> Apparently we want individual processes to just disable PMD/PUD mappings of DAX
> using the prctl and madvise. Maybe there are good reasons.
> 
> Looks like a design decision, probably some legacy leftovers.

OK, I'll ensure I keep this behaviour.

Thanks!

>

Yang Shi Oct. 31, 2023, 6:29 p.m. UTC | #19

On Tue, Oct 31, 2023 at 4:55 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 31/10/2023 11:50, Ryan Roberts wrote:
> > On 06/10/2023 21:06, David Hildenbrand wrote:
> > [...]
> >>
> >> Change 2: sysfs interface.
> >>
> >> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
> >> agree.
> >>
> >> What we expose there and how, is TBD. Again, not a friend of "orders" and
> >> bitmaps at all. We can do better if we want to go down that path.
> >>
> >> Maybe we should take a look at hugetlb, and how they added support for multiple
> >> sizes. What *might* make sense could be (depending on which values we actually
> >> support!)
> >>
> >>
> >> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
> >> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
> >> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
> >> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
> >> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
> >> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
> >>
> >> Each one would contain an "enabled" and "defrag" file. We want something minimal
> >> first? Start with the "enabled" option.
> >>
> >>
> >> enabled: always [global] madvise never
> >>
> >> Initially, we would set it for PMD-sized THP to "global" and for everything else
> >> to "never".
> >
> > Hi David,
> >
> > I've just started coding this, and it occurs to me that I might need a small
> > clarification here; the existing global "enabled" control is used to drive
> > decisions for both anonymous memory and (non-shmem) file-backed memory. But the
> > proposed new per-size "enabled" is implicitly only controlling anon memory (for
> > now).
> >
> > 1) Is this potentially confusing for the user? Should we rename the per-size
> > controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so
> > we can reuse the same control for file-backed memory in future?
> >
> > 2) The global control will continue to drive the file-backed memory decision
> > (for now), even when hugepages-2048kB/enabled != "global"; agreed?
> >
> > Thanks,
> > Ryan
> >
>
> Also, an implementation question:
>
> hugepage_vma_check() doesn't currently care whether enabled="never" for DAX VMAs
> (although it does honour MADV_NOHUGEPAGE and the prctl); It will return true
> regardless. Is that by design? It couldn't fathom any reasoning from the commit log:

The enabled="never" is for anonymous VMAs, DAX VMAs are typically file VMAs.

>
> bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
>                         bool smaps, bool in_pf, bool enforce_sysfs)
> {
>         if (!vma->vm_mm)                /* vdso */
>                 return false;
>
>         /*
>          * Explicitly disabled through madvise or prctl, or some
>          * architectures may disable THP for some mappings, for
>          * example, s390 kvm.
>          * */
>         if ((vm_flags & VM_NOHUGEPAGE) ||
>             test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
>                 return false;
>         /*
>          * If the hardware/firmware marked hugepage support disabled.
>          */
>         if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
>                 return false;
>
>         /* khugepaged doesn't collapse DAX vma, but page fault is fine. */
>         if (vma_is_dax(vma))
>                 return in_pf;  <<<<<<<<
>
>         ...
> }
>
>

Ryan Roberts Nov. 1, 2023, 2:02 p.m. UTC | #20

On 31/10/2023 18:29, Yang Shi wrote:
> On Tue, Oct 31, 2023 at 4:55 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 31/10/2023 11:50, Ryan Roberts wrote:
>>> On 06/10/2023 21:06, David Hildenbrand wrote:
>>> [...]
>>>>
>>>> Change 2: sysfs interface.
>>>>
>>>> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
>>>> agree.
>>>>
>>>> What we expose there and how, is TBD. Again, not a friend of "orders" and
>>>> bitmaps at all. We can do better if we want to go down that path.
>>>>
>>>> Maybe we should take a look at hugetlb, and how they added support for multiple
>>>> sizes. What *might* make sense could be (depending on which values we actually
>>>> support!)
>>>>
>>>>
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
>>>>
>>>> Each one would contain an "enabled" and "defrag" file. We want something minimal
>>>> first? Start with the "enabled" option.
>>>>
>>>>
>>>> enabled: always [global] madvise never
>>>>
>>>> Initially, we would set it for PMD-sized THP to "global" and for everything else
>>>> to "never".
>>>
>>> Hi David,
>>>
>>> I've just started coding this, and it occurs to me that I might need a small
>>> clarification here; the existing global "enabled" control is used to drive
>>> decisions for both anonymous memory and (non-shmem) file-backed memory. But the
>>> proposed new per-size "enabled" is implicitly only controlling anon memory (for
>>> now).
>>>
>>> 1) Is this potentially confusing for the user? Should we rename the per-size
>>> controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so
>>> we can reuse the same control for file-backed memory in future?
>>>
>>> 2) The global control will continue to drive the file-backed memory decision
>>> (for now), even when hugepages-2048kB/enabled != "global"; agreed?
>>>
>>> Thanks,
>>> Ryan
>>>
>>
>> Also, an implementation question:
>>
>> hugepage_vma_check() doesn't currently care whether enabled="never" for DAX VMAs
>> (although it does honour MADV_NOHUGEPAGE and the prctl); It will return true
>> regardless. Is that by design? It couldn't fathom any reasoning from the commit log:
> 
> The enabled="never" is for anonymous VMAs, DAX VMAs are typically file VMAs.

That's not quite true; enabled="never" is honoured for non-DAX/non-shmem file
VMAs (for collapse via CONFIG_READ_ONLY_THP_FOR_FS and more recently for
anything that implements huge_fault() - see
7a81751fcdeb833acc858e59082688e3020bfe12).

Yang Shi Nov. 1, 2023, 6:11 p.m. UTC | #21

On Wed, Nov 1, 2023 at 7:02 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 31/10/2023 18:29, Yang Shi wrote:
> > On Tue, Oct 31, 2023 at 4:55 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 31/10/2023 11:50, Ryan Roberts wrote:
> >>> On 06/10/2023 21:06, David Hildenbrand wrote:
> >>> [...]
> >>>>
> >>>> Change 2: sysfs interface.
> >>>>
> >>>> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
> >>>> agree.
> >>>>
> >>>> What we expose there and how, is TBD. Again, not a friend of "orders" and
> >>>> bitmaps at all. We can do better if we want to go down that path.
> >>>>
> >>>> Maybe we should take a look at hugetlb, and how they added support for multiple
> >>>> sizes. What *might* make sense could be (depending on which values we actually
> >>>> support!)
> >>>>
> >>>>
> >>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
> >>>> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
> >>>> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
> >>>> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
> >>>> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
> >>>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
> >>>>
> >>>> Each one would contain an "enabled" and "defrag" file. We want something minimal
> >>>> first? Start with the "enabled" option.
> >>>>
> >>>>
> >>>> enabled: always [global] madvise never
> >>>>
> >>>> Initially, we would set it for PMD-sized THP to "global" and for everything else
> >>>> to "never".
> >>>
> >>> Hi David,
> >>>
> >>> I've just started coding this, and it occurs to me that I might need a small
> >>> clarification here; the existing global "enabled" control is used to drive
> >>> decisions for both anonymous memory and (non-shmem) file-backed memory. But the
> >>> proposed new per-size "enabled" is implicitly only controlling anon memory (for
> >>> now).
> >>>
> >>> 1) Is this potentially confusing for the user? Should we rename the per-size
> >>> controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so
> >>> we can reuse the same control for file-backed memory in future?
> >>>
> >>> 2) The global control will continue to drive the file-backed memory decision
> >>> (for now), even when hugepages-2048kB/enabled != "global"; agreed?
> >>>
> >>> Thanks,
> >>> Ryan
> >>>
> >>
> >> Also, an implementation question:
> >>
> >> hugepage_vma_check() doesn't currently care whether enabled="never" for DAX VMAs
> >> (although it does honour MADV_NOHUGEPAGE and the prctl); It will return true
> >> regardless. Is that by design? It couldn't fathom any reasoning from the commit log:
> >
> > The enabled="never" is for anonymous VMAs, DAX VMAs are typically file VMAs.
>
> That's not quite true; enabled="never" is honoured for non-DAX/non-shmem file
> VMAs (for collapse via CONFIG_READ_ONLY_THP_FOR_FS and more recently for

When implementing READ_ONLY_THP_FOR_FS the file THP just can be
collapsed by khugepaged, but khugepaged is started iff enabled !=
"never". So READ_ONLY_THP_FOR_FS has to honor it. Unfortunately there
are some confusing exceptions... But anyway DAX is not the same class.

> anything that implements huge_fault() - see
> 7a81751fcdeb833acc858e59082688e3020bfe12).

IIUC this commit just gives the vmas which implement huge_fault() a
chance to handle the fault. Currently just DAX vmas implement
huge_fault() in vanilla kernel AFAICT.

>

John Hubbard Nov. 13, 2023, 3:57 a.m. UTC | #22

On 9/29/23 4:44 AM, Ryan Roberts wrote:
> Hi All,
> 
> This is v6 of a series to implement variable order, large folios for anonymous
> memory. (previously called "ANON_LARGE_FOLIO", "LARGE_ANON_FOLIO",
> "FLEXIBLE_THP", but now exposed as an extension to THP; "small-order THP"). The
> objective of this is to improve performance by allocating larger chunks of
> memory during anonymous page faults:
...
> 
> The major change in this revision is the addition of sysfs controls to allow
> this "small-order THP" to be enabled/disabled/configured independently of
> PMD-order THP. The approach I've taken differs a bit from previous discussions;
> instead of creating a whole new interface ("large_folio"), I'm extending THP. I
> personally think this makes things clearer and more extensible. See [6] for
> detailed rationale.
> 

Hi Ryan and all,

I've done some initial performance testing of this patchset on an arm64
SBSA server. When these patches are combined with the arm64 arch contpte
patches in Ryan's git tree (he has conveniently combined everything
here: [1]), we are seeing a remarkable, consistent speedup of 10.5x on
some memory-intensive workloads. Many test runs, conducted independently
by different engineers and on different machines, have convinced me and
my colleagues that this is an accurate result.

In order to achieve that result, we used the git tree in [1] with
following settings:

     echo always >/sys/kernel/mm/transparent_hugepage/enabled
     echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders

This was on a aarch64 machine configure to use a 64KB base page size.
That configuration means that the PMD size is 512MB, which is of course
too large for practical use as a pure PMD-THP. However, with with these
small-size (less than PMD-sized) THPs, we get the improvements in TLB
coverage, while still getting pages that are small enough to be
effectively usable.

These results are admittedly limited to aarch64 CPUs so far (because the
contpte TLB coalescing behavior plays a big role), but it's nice to see
real performance numbers from real computers.

Up until now, there has been some healthy discussion and debate about
various aspects of this patchset. This data point shows that at least
for some types of memory-intensive workloads (and I apologize for being
vague, at this point, about exactly *which* workloads), the performance
gains are really worth it: ~10x !

[1] https://gitlab.arm.com/linux-arm/linux-rr.git
         (branch: features/granule_perf/anonfolio-v6-contpte-v2)

thanks,

Matthew Wilcox (Oracle) Nov. 13, 2023, 5:18 a.m. UTC | #23

On Sun, Nov 12, 2023 at 10:57:47PM -0500, John Hubbard wrote:
> I've done some initial performance testing of this patchset on an arm64
> SBSA server. When these patches are combined with the arm64 arch contpte
> patches in Ryan's git tree (he has conveniently combined everything
> here: [1]), we are seeing a remarkable, consistent speedup of 10.5x on
> some memory-intensive workloads. Many test runs, conducted independently
> by different engineers and on different machines, have convinced me and
> my colleagues that this is an accurate result.
> 
> In order to achieve that result, we used the git tree in [1] with
> following settings:
> 
>     echo always >/sys/kernel/mm/transparent_hugepage/enabled
>     echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders
> 
> This was on a aarch64 machine configure to use a 64KB base page size.
> That configuration means that the PMD size is 512MB, which is of course
> too large for practical use as a pure PMD-THP. However, with with these
> small-size (less than PMD-sized) THPs, we get the improvements in TLB
> coverage, while still getting pages that are small enough to be
> effectively usable.

That is quite remarkable!

My hope is to abolish the 64kB page size configuration.  ie instead of
using the mixture of page sizes that you currently are -- 64k and
1M (right?  Order-0, and order-4), that 4k, 64k and 2MB (order-0,
order-4 and order-9) will provide better performance.

Have you run any experiements with a 4kB page size?

Ryan Roberts Nov. 13, 2023, 10:19 a.m. UTC | #24

On 13/11/2023 05:18, Matthew Wilcox wrote:
> On Sun, Nov 12, 2023 at 10:57:47PM -0500, John Hubbard wrote:
>> I've done some initial performance testing of this patchset on an arm64
>> SBSA server. When these patches are combined with the arm64 arch contpte
>> patches in Ryan's git tree (he has conveniently combined everything
>> here: [1]), we are seeing a remarkable, consistent speedup of 10.5x on
>> some memory-intensive workloads. Many test runs, conducted independently
>> by different engineers and on different machines, have convinced me and
>> my colleagues that this is an accurate result.
>>
>> In order to achieve that result, we used the git tree in [1] with
>> following settings:
>>
>>     echo always >/sys/kernel/mm/transparent_hugepage/enabled
>>     echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders
>>
>> This was on a aarch64 machine configure to use a 64KB base page size.
>> That configuration means that the PMD size is 512MB, which is of course
>> too large for practical use as a pure PMD-THP. However, with with these
>> small-size (less than PMD-sized) THPs, we get the improvements in TLB
>> coverage, while still getting pages that are small enough to be
>> effectively usable.
> 
> That is quite remarkable!

Yes, agreed - thanks for sharing these results! A very nice Monday morning boost!

> 
> My hope is to abolish the 64kB page size configuration.  ie instead of
> using the mixture of page sizes that you currently are -- 64k and
> 1M (right?  Order-0, and order-4)

Not quite; the contpte-size for a 64K page size is 2M/order-5. (and yes, it is
64K/order-4 for a 4K page size, and 2M/order-7 for a 16K page size. I agree that
intuitively you would expect the order to remain constant, but it doesn't).

The "recommend" setting above will actually enable order-3 as well even though
there is no HW benefit to this. So the full set of available memory sizes here is:

64K/order-0, 512K/order-3, 2M/order-5, 512M/order-13

> , that 4k, 64k and 2MB (order-0,
> order-4 and order-9) will provide better performance.
> 
> Have you run any experiements with a 4kB page size?

Agree that would be interesting with 64K small-sized THP enabled. And I'd love
to get to a world were we universally deal in variable sized chunks of memory,
aligned on 4K boundaries.

In my experience though, there are still some performance benefits to 64K base
page vs 4K+contpte; the page tables are more cache efficient for the former case
- 64K of memory is described by 8 bytes in the former vs 8x16=128 bytes in the
latter. In practice the HW will still only read 8 bytes in the latter but that's
taking up a full cache line vs the former where a single cache line stores 8x
64K entries.

Thanks,
Ryan

Kefeng Wang Nov. 13, 2023, 11:52 a.m. UTC | #25

On 2023/11/13 18:19, Ryan Roberts wrote:
> On 13/11/2023 05:18, Matthew Wilcox wrote:
>> On Sun, Nov 12, 2023 at 10:57:47PM -0500, John Hubbard wrote:
>>> I've done some initial performance testing of this patchset on an arm64
>>> SBSA server. When these patches are combined with the arm64 arch contpte
>>> patches in Ryan's git tree (he has conveniently combined everything
>>> here: [1]), we are seeing a remarkable, consistent speedup of 10.5x on
>>> some memory-intensive workloads. Many test runs, conducted independently
>>> by different engineers and on different machines, have convinced me and
>>> my colleagues that this is an accurate result.
>>>
>>> In order to achieve that result, we used the git tree in [1] with
>>> following settings:
>>>
>>>      echo always >/sys/kernel/mm/transparent_hugepage/enabled
>>>      echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders
>>>
>>> This was on a aarch64 machine configure to use a 64KB base page size.
>>> That configuration means that the PMD size is 512MB, which is of course
>>> too large for practical use as a pure PMD-THP. However, with with these
>>> small-size (less than PMD-sized) THPs, we get the improvements in TLB
>>> coverage, while still getting pages that are small enough to be
>>> effectively usable.
>>
>> That is quite remarkable!
> 
> Yes, agreed - thanks for sharing these results! A very nice Monday morning boost!
> 
>>
>> My hope is to abolish the 64kB page size configuration.  ie instead of
>> using the mixture of page sizes that you currently are -- 64k and
>> 1M (right?  Order-0, and order-4)
> 
> Not quite; the contpte-size for a 64K page size is 2M/order-5. (and yes, it is
> 64K/order-4 for a 4K page size, and 2M/order-7 for a 16K page size. I agree that
> intuitively you would expect the order to remain constant, but it doesn't).
> 
> The "recommend" setting above will actually enable order-3 as well even though
> there is no HW benefit to this. So the full set of available memory sizes here is:
> 
> 64K/order-0, 512K/order-3, 2M/order-5, 512M/order-13
> 
>> , that 4k, 64k and 2MB (order-0,
>> order-4 and order-9) will provide better performance.
>>
>> Have you run any experiements with a 4kB page size?
> 
> Agree that would be interesting with 64K small-sized THP enabled. And I'd love
> to get to a world were we universally deal in variable sized chunks of memory,
> aligned on 4K boundaries.
> 
> In my experience though, there are still some performance benefits to 64K base
> page vs 4K+contpte; the page tables are more cache efficient for the former case
> - 64K of memory is described by 8 bytes in the former vs 8x16=128 bytes in the
> latter. In practice the HW will still only read 8 bytes in the latter but that's
> taking up a full cache line vs the former where a single cache line stores 8x
> 64K entries.

We test some benchmark, eg, unixbench, lmbench, sysbench, with v5 on
arm64 board(for better evaluation of anon large folio, using ext4,
which don't support large folio for now), will test again and send
the results once v7 out.

1) base page 4k  + without anon large folio
2) base page 64k + without anon large folio
3) base page 4k  + with anon large folio + cont-pte(order = 4,0)

Most of the test results from v5 show the 3) have a good improvement
vs 1), but still low than 2) , also for some latency-sensitive
benchmark, 2) and 3) maybe have poor performance vs 1).

Note, for pcp_allowed_order, order <= PAGE_ALLOC_COSTLY_ORDER=3, for
3), we maybe enlarge it for better scalability when page allocation
on arm64, not test on v5, will try to enlarge it on v7.

> 
> Thanks,
> Ryan
> 
> 
>

Ryan Roberts Nov. 13, 2023, 12:12 p.m. UTC | #26

On 13/11/2023 11:52, Kefeng Wang wrote:
> 
> 
> On 2023/11/13 18:19, Ryan Roberts wrote:
>> On 13/11/2023 05:18, Matthew Wilcox wrote:
>>> On Sun, Nov 12, 2023 at 10:57:47PM -0500, John Hubbard wrote:
>>>> I've done some initial performance testing of this patchset on an arm64
>>>> SBSA server. When these patches are combined with the arm64 arch contpte
>>>> patches in Ryan's git tree (he has conveniently combined everything
>>>> here: [1]), we are seeing a remarkable, consistent speedup of 10.5x on
>>>> some memory-intensive workloads. Many test runs, conducted independently
>>>> by different engineers and on different machines, have convinced me and
>>>> my colleagues that this is an accurate result.
>>>>
>>>> In order to achieve that result, we used the git tree in [1] with
>>>> following settings:
>>>>
>>>>      echo always >/sys/kernel/mm/transparent_hugepage/enabled
>>>>      echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders
>>>>
>>>> This was on a aarch64 machine configure to use a 64KB base page size.
>>>> That configuration means that the PMD size is 512MB, which is of course
>>>> too large for practical use as a pure PMD-THP. However, with with these
>>>> small-size (less than PMD-sized) THPs, we get the improvements in TLB
>>>> coverage, while still getting pages that are small enough to be
>>>> effectively usable.
>>>
>>> That is quite remarkable!
>>
>> Yes, agreed - thanks for sharing these results! A very nice Monday morning boost!
>>
>>>
>>> My hope is to abolish the 64kB page size configuration.  ie instead of
>>> using the mixture of page sizes that you currently are -- 64k and
>>> 1M (right?  Order-0, and order-4)
>>
>> Not quite; the contpte-size for a 64K page size is 2M/order-5. (and yes, it is
>> 64K/order-4 for a 4K page size, and 2M/order-7 for a 16K page size. I agree that
>> intuitively you would expect the order to remain constant, but it doesn't).
>>
>> The "recommend" setting above will actually enable order-3 as well even though
>> there is no HW benefit to this. So the full set of available memory sizes here
>> is:
>>
>> 64K/order-0, 512K/order-3, 2M/order-5, 512M/order-13
>>
>>> , that 4k, 64k and 2MB (order-0,
>>> order-4 and order-9) will provide better performance.
>>>
>>> Have you run any experiements with a 4kB page size?
>>
>> Agree that would be interesting with 64K small-sized THP enabled. And I'd love
>> to get to a world were we universally deal in variable sized chunks of memory,
>> aligned on 4K boundaries.
>>
>> In my experience though, there are still some performance benefits to 64K base
>> page vs 4K+contpte; the page tables are more cache efficient for the former case
>> - 64K of memory is described by 8 bytes in the former vs 8x16=128 bytes in the
>> latter. In practice the HW will still only read 8 bytes in the latter but that's
>> taking up a full cache line vs the former where a single cache line stores 8x
>> 64K entries.
> 
> We test some benchmark, eg, unixbench, lmbench, sysbench, with v5 on
> arm64 board(for better evaluation of anon large folio, using ext4,
> which don't support large folio for now), will test again and send
> the results once v7 out.

Thanks for the testing and for posting the insights!

> 
> 1) base page 4k  + without anon large folio
> 2) base page 64k + without anon large folio
> 3) base page 4k  + with anon large folio + cont-pte(order = 4,0)
> 
> Most of the test results from v5 show the 3) have a good improvement
> vs 1), but still low than 2) 

Do you have any understanding what the shortfall is for these particular
workloads? Certainly the cache spatial locality benefit of the 64K page tables
could be a factor. But certainly for the workloads I've been looking at, a
bigger factor is often the fact that executable file-backed memory (elf
segments) are not in 64K folios and therefore not contpte-mapped. If the iTLB is
under pressure this can help a lot. I have a change (hack) to force all
executable mappings to be read-ahead into 64K folios and this gives an
improvement. But obviously that only works when the file system supports large
folios (so not ext4 right now). It would certainly be interesting to see just
how close to native 64K we can get when employing these extra ideas.

>, also for some latency-sensitive
> benchmark, 2) and 3) maybe have poor performance vs 1).
> 
> Note, for pcp_allowed_order, order <= PAGE_ALLOC_COSTLY_ORDER=3, for
> 3), we maybe enlarge it for better scalability when page allocation
> on arm64, not test on v5, will try to enlarge it on v7.

Yes interesting! I'm hoping to post v7 this week - just waiting for mm-unstable
to be rebased on v6.7-rc1. I'd be interested to see your results.

> 
>>
>> Thanks,
>> Ryan
>>
>>
>>

John Hubbard Nov. 13, 2023, 2:52 p.m. UTC | #27

On 11/13/23 2:19 AM, Ryan Roberts wrote:
> On 13/11/2023 05:18, Matthew Wilcox wrote:
>> On Sun, Nov 12, 2023 at 10:57:47PM -0500, John Hubbard wrote:
>>> I've done some initial performance testing of this patchset on an arm64
>>> SBSA server. When these patches are combined with the arm64 arch contpte
>>> patches in Ryan's git tree (he has conveniently combined everything
>>> here: [1]), we are seeing a remarkable, consistent speedup of 10.5x on
>>> some memory-intensive workloads. Many test runs, conducted independently
>>> by different engineers and on different machines, have convinced me and
>>> my colleagues that this is an accurate result.
>>>
>>> In order to achieve that result, we used the git tree in [1] with
>>> following settings:
>>>
>>>      echo always >/sys/kernel/mm/transparent_hugepage/enabled
>>>      echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders
>>>
>>> This was on a aarch64 machine configure to use a 64KB base page size.
>>> That configuration means that the PMD size is 512MB, which is of course
>>> too large for practical use as a pure PMD-THP. However, with with these
>>> small-size (less than PMD-sized) THPs, we get the improvements in TLB
>>> coverage, while still getting pages that are small enough to be
>>> effectively usable.
>>
>> That is quite remarkable!
> 
> Yes, agreed - thanks for sharing these results! A very nice Monday morning boost!
> 
>>
>> My hope is to abolish the 64kB page size configuration.  ie instead of

We've found that a 64KB base page size provides better performance for
HPC and AI workloads, than a 4KB base size, at least for these kinds of
servers. In fact, the 4KB config is considered odd and I'd have to
look around to get one. It's mostly a TLB coverage issue because,
again, the problem typically has a very large memory footprint.

So even though it would be nice from a software point of view, there's
a real need for this.

>> using the mixture of page sizes that you currently are -- 64k and
>> 1M (right?  Order-0, and order-4)
> 
> Not quite; the contpte-size for a 64K page size is 2M/order-5. (and yes, it is
> 64K/order-4 for a 4K page size, and 2M/order-7 for a 16K page size. I agree that
> intuitively you would expect the order to remain constant, but it doesn't).
> 
> The "recommend" setting above will actually enable order-3 as well even though
> there is no HW benefit to this. So the full set of available memory sizes here is:
> 
> 64K/order-0, 512K/order-3, 2M/order-5, 512M/order-13

Yes, and to provide some further details about the test runs, I went
so far as to test individual anon_orders (for example, 
anon_orders=0x20), in order to isolate behavior and see what's really
going on.

On this hardware, anything with 2MB page sizes which corresponds to
anon_orders=0x20, as I recall) or larger, gets the 10x boost. It's
an interesting on/off behavior. This particular server design and
workload combination really prefers 2MB pages, even if they are
held together with contpte instead of a real PMD entry.

> 
>> , that 4k, 64k and 2MB (order-0,
>> order-4 and order-9) will provide better performance.
>>
>> Have you run any experiements with a 4kB page size?
> 
> Agree that would be interesting with 64K small-sized THP enabled. And I'd love
> to get to a world were we universally deal in variable sized chunks of memory,
> aligned on 4K boundaries.
> 
> In my experience though, there are still some performance benefits to 64K base
> page vs 4K+contpte; the page tables are more cache efficient for the former case
> - 64K of memory is described by 8 bytes in the former vs 8x16=128 bytes in the
> latter. In practice the HW will still only read 8 bytes in the latter but that's
> taking up a full cache line vs the former where a single cache line stores 8x
> 64K entries. >
> Thanks,
> Ryan
> 

thanks,

Kefeng Wang Nov. 13, 2023, 2:52 p.m. UTC | #28

On 2023/11/13 20:12, Ryan Roberts wrote:
> On 13/11/2023 11:52, Kefeng Wang wrote:
>>
>>
>> On 2023/11/13 18:19, Ryan Roberts wrote:
>>> On 13/11/2023 05:18, Matthew Wilcox wrote:
>>>> On Sun, Nov 12, 2023 at 10:57:47PM -0500, John Hubbard wrote:
>>>>> I've done some initial performance testing of this patchset on an arm64
>>>>> SBSA server. When these patches are combined with the arm64 arch contpte
>>>>> patches in Ryan's git tree (he has conveniently combined everything
>>>>> here: [1]), we are seeing a remarkable, consistent speedup of 10.5x on
>>>>> some memory-intensive workloads. Many test runs, conducted independently
>>>>> by different engineers and on different machines, have convinced me and
>>>>> my colleagues that this is an accurate result.
>>>>>
>>>>> In order to achieve that result, we used the git tree in [1] with
>>>>> following settings:
>>>>>
>>>>>       echo always >/sys/kernel/mm/transparent_hugepage/enabled
>>>>>       echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders
>>>>>
>>>>> This was on a aarch64 machine configure to use a 64KB base page size.
>>>>> That configuration means that the PMD size is 512MB, which is of course
>>>>> too large for practical use as a pure PMD-THP. However, with with these
>>>>> small-size (less than PMD-sized) THPs, we get the improvements in TLB
>>>>> coverage, while still getting pages that are small enough to be
>>>>> effectively usable.
>>>>
>>>> That is quite remarkable!
>>>
>>> Yes, agreed - thanks for sharing these results! A very nice Monday morning boost!
>>>
>>>>
>>>> My hope is to abolish the 64kB page size configuration.  ie instead of
>>>> using the mixture of page sizes that you currently are -- 64k and
>>>> 1M (right?  Order-0, and order-4)
>>>
>>> Not quite; the contpte-size for a 64K page size is 2M/order-5. (and yes, it is
>>> 64K/order-4 for a 4K page size, and 2M/order-7 for a 16K page size. I agree that
>>> intuitively you would expect the order to remain constant, but it doesn't).
>>>
>>> The "recommend" setting above will actually enable order-3 as well even though
>>> there is no HW benefit to this. So the full set of available memory sizes here
>>> is:
>>>
>>> 64K/order-0, 512K/order-3, 2M/order-5, 512M/order-13
>>>
>>>> , that 4k, 64k and 2MB (order-0,
>>>> order-4 and order-9) will provide better performance.
>>>>
>>>> Have you run any experiements with a 4kB page size?
>>>
>>> Agree that would be interesting with 64K small-sized THP enabled. And I'd love
>>> to get to a world were we universally deal in variable sized chunks of memory,
>>> aligned on 4K boundaries.
>>>
>>> In my experience though, there are still some performance benefits to 64K base
>>> page vs 4K+contpte; the page tables are more cache efficient for the former case
>>> - 64K of memory is described by 8 bytes in the former vs 8x16=128 bytes in the
>>> latter. In practice the HW will still only read 8 bytes in the latter but that's
>>> taking up a full cache line vs the former where a single cache line stores 8x
>>> 64K entries.
>>
>> We test some benchmark, eg, unixbench, lmbench, sysbench, with v5 on
>> arm64 board(for better evaluation of anon large folio, using ext4,
>> which don't support large folio for now), will test again and send
>> the results once v7 out.
> 
> Thanks for the testing and for posting the insights!
> 
>>
>> 1) base page 4k  + without anon large folio
>> 2) base page 64k + without anon large folio
>> 3) base page 4k  + with anon large folio + cont-pte(order = 4,0)
>>
>> Most of the test results from v5 show the 3) have a good improvement
>> vs 1), but still low than 2)
> 
> Do you have any understanding what the shortfall is for these particular
> workloads? Certainly the cache spatial locality benefit of the 64K page tables
> could be a factor. But certainly for the workloads I've been looking at, a
> bigger factor is often the fact that executable file-backed memory (elf
> segments) are not in 64K folios and therefore not contpte-mapped. If the iTLB is
> under pressure this can help a lot. I have a change (hack) to force all
> executable mappings to be read-ahead into 64K folios and this gives an
> improvement. But obviously that only works when the file system supports large
> folios (so not ext4 right now). It would certainly be interesting to see just
> how close to native 64K we can get when employing these extra ideas.

No detailed analysis, but with base page 64k,
  less page fault
  less TLB operation
  less zone-lock congestion(pcp)
  less buddy split/merge
  no reclaim/compact when allocate 64k page, and no fallback logical
  execfolio
  faster page table opreation?
  ...

> 
>> , also for some latency-sensitive
>> benchmark, 2) and 3) maybe have poor performance vs 1).
>>
>> Note, for pcp_allowed_order, order <= PAGE_ALLOC_COSTLY_ORDER=3, for
>> 3), we maybe enlarge it for better scalability when page allocation
>> on arm64, not test on v5, will try to enlarge it on v7.
> 
> Yes interesting! I'm hoping to post v7 this week - just waiting for mm-unstable
> to be rebased on v6.7-rc1. I'd be interested to see your results.
> 
Glad to see it.>>
>>>
>>> Thanks,
>>> Ryan
>>>
>>>
>>>
> 
>

Matthew Wilcox (Oracle) Nov. 13, 2023, 3:04 p.m. UTC | #29

On Mon, Nov 13, 2023 at 10:19:48AM +0000, Ryan Roberts wrote:
> On 13/11/2023 05:18, Matthew Wilcox wrote:
> > My hope is to abolish the 64kB page size configuration.  ie instead of
> > using the mixture of page sizes that you currently are -- 64k and
> > 1M (right?  Order-0, and order-4)
> 
> Not quite; the contpte-size for a 64K page size is 2M/order-5. (and yes, it is
> 64K/order-4 for a 4K page size, and 2M/order-7 for a 16K page size. I agree that
> intuitively you would expect the order to remain constant, but it doesn't).
> 
> The "recommend" setting above will actually enable order-3 as well even though
> there is no HW benefit to this. So the full set of available memory sizes here is:
> 
> 64K/order-0, 512K/order-3, 2M/order-5, 512M/order-13
> 
> > , that 4k, 64k and 2MB (order-0,
> > order-4 and order-9) will provide better performance.
> > 
> > Have you run any experiements with a 4kB page size?
> 
> Agree that would be interesting with 64K small-sized THP enabled. And I'd love
> to get to a world were we universally deal in variable sized chunks of memory,
> aligned on 4K boundaries.
> 
> In my experience though, there are still some performance benefits to 64K base
> page vs 4K+contpte; the page tables are more cache efficient for the former case
> - 64K of memory is described by 8 bytes in the former vs 8x16=128 bytes in the
> latter. In practice the HW will still only read 8 bytes in the latter but that's
> taking up a full cache line vs the former where a single cache line stores 8x
> 64K entries.

This is going to depend on your workload though -- if you're using more
2MB than 64kB, you get to elide a layer of page table with 4k base,
rather than taking up 4 cache lines with a 64k base.

Ryan Roberts Nov. 14, 2023, 10:57 a.m. UTC | #30

On 13/11/2023 15:04, Matthew Wilcox wrote:
> On Mon, Nov 13, 2023 at 10:19:48AM +0000, Ryan Roberts wrote:
>> On 13/11/2023 05:18, Matthew Wilcox wrote:
>>> My hope is to abolish the 64kB page size configuration.  ie instead of
>>> using the mixture of page sizes that you currently are -- 64k and
>>> 1M (right?  Order-0, and order-4)
>>
>> Not quite; the contpte-size for a 64K page size is 2M/order-5. (and yes, it is
>> 64K/order-4 for a 4K page size, and 2M/order-7 for a 16K page size. I agree that
>> intuitively you would expect the order to remain constant, but it doesn't).
>>
>> The "recommend" setting above will actually enable order-3 as well even though
>> there is no HW benefit to this. So the full set of available memory sizes here is:
>>
>> 64K/order-0, 512K/order-3, 2M/order-5, 512M/order-13
>>
>>> , that 4k, 64k and 2MB (order-0,
>>> order-4 and order-9) will provide better performance.
>>>
>>> Have you run any experiements with a 4kB page size?
>>
>> Agree that would be interesting with 64K small-sized THP enabled. And I'd love
>> to get to a world were we universally deal in variable sized chunks of memory,
>> aligned on 4K boundaries.
>>
>> In my experience though, there are still some performance benefits to 64K base
>> page vs 4K+contpte; the page tables are more cache efficient for the former case
>> - 64K of memory is described by 8 bytes in the former vs 8x16=128 bytes in the
>> latter. In practice the HW will still only read 8 bytes in the latter but that's
>> taking up a full cache line vs the former where a single cache line stores 8x
>> 64K entries.
> 
> This is going to depend on your workload though -- if you're using more
> 2MB than 64kB, you get to elide a layer of page table with 4k base,
> rather than taking up 4 cache lines with a 64k base.

True, but again depending on workload/config, you may have few levels of lookup
for the 64K native case in the first place because you consume more VA bits at
each level.

Matthew Wilcox (Oracle) Dec. 5, 2023, 4:05 p.m. UTC | #31

On Tue, Nov 14, 2023 at 10:57:07AM +0000, Ryan Roberts wrote:
> On 13/11/2023 15:04, Matthew Wilcox wrote:
> > On Mon, Nov 13, 2023 at 10:19:48AM +0000, Ryan Roberts wrote:
> >> On 13/11/2023 05:18, Matthew Wilcox wrote:
> >>> My hope is to abolish the 64kB page size configuration.  ie instead of
> >>> using the mixture of page sizes that you currently are -- 64k and
> >>> 1M (right?  Order-0, and order-4)
> >>
> >> Not quite; the contpte-size for a 64K page size is 2M/order-5. (and yes, it is
> >> 64K/order-4 for a 4K page size, and 2M/order-7 for a 16K page size. I agree that
> >> intuitively you would expect the order to remain constant, but it doesn't).
> >>
> >> The "recommend" setting above will actually enable order-3 as well even though
> >> there is no HW benefit to this. So the full set of available memory sizes here is:
> >>
> >> 64K/order-0, 512K/order-3, 2M/order-5, 512M/order-13
> >>
> >>> , that 4k, 64k and 2MB (order-0,
> >>> order-4 and order-9) will provide better performance.
> >>>
> >>> Have you run any experiements with a 4kB page size?
> >>
> >> Agree that would be interesting with 64K small-sized THP enabled. And I'd love
> >> to get to a world were we universally deal in variable sized chunks of memory,
> >> aligned on 4K boundaries.
> >>
> >> In my experience though, there are still some performance benefits to 64K base
> >> page vs 4K+contpte; the page tables are more cache efficient for the former case
> >> - 64K of memory is described by 8 bytes in the former vs 8x16=128 bytes in the
> >> latter. In practice the HW will still only read 8 bytes in the latter but that's
> >> taking up a full cache line vs the former where a single cache line stores 8x
> >> 64K entries.
> > 
> > This is going to depend on your workload though -- if you're using more
> > 2MB than 64kB, you get to elide a layer of page table with 4k base,
> > rather than taking up 4 cache lines with a 64k base.
> 
> True, but again depending on workload/config, you may have few levels of lookup
> for the 64K native case in the first place because you consume more VA bits at
> each level.

Sorry, missed this email ... let's work it through.

With 4k, and a 48-bit VA space, we get 12 bits at the lowest level, then
9 bits each layer, so 4 * 9 + 12 = 48.  With a 2MB allocation, we
eliminate the bottom layer and examine three cachelines to find it (PGD
entry, PUD entry, PMD entry)

With 64k, we get 16 bits at the lowest level, then 13 bits each layer,
so 3 * 13 + 16 = 55.  With a 2MB allocation, we can't eliminate the
bottom layer, so we still have to examine three cachelines to find it
(PGD, PMD, PTE).  If you can fit into a 42-bit address space, you can
reduce it by one cache miss, but my impression is that applications
which use 21 bits of address space for a single allocation want more
address space than your average application.

[v6,0/9] variable-order, large folios for anonymous memory

Message

Comments