mbox series

[RFC,v1,0/2] mm: multi-gen LRU scanning for page promotion

Message ID 20250324220301.1273038-1-kinseyho@google.com (mailing list archive)
Headers show
Series mm: multi-gen LRU scanning for page promotion | expand

Message

Kinsey Ho March 24, 2025, 10:02 p.m. UTC
This patch series introduces a software-based approach to identify
hot pages for promotion in tiered memory systems, particularly those
leveraging CXL-attached memory, by utilizing the Multi-Generational
LRU (MGLRU) framework. This method is designed to complement
hardware-based hotness detection mechanisms like Intel PMU sampling, AMD
IBS, or dedicated CXL memory monitoring units, providing a more
comprehensive view of page access patterns, similar to kmmscand [1].

We propose to utilize MGLRU's existing infrastructure to provide hot
page information. A key benefit here is the reuse of the MGLRU page
table walk code, thus avoiding the overhead and duplication of effort
involved in implementing a separate page table scanning mechanism. The
working set reporting proposal [2] also reuses MGLRU's infrastructure,
but focuses on cold page detection. It provides its own aging daemon,
which could additionally provide hot page information by integrating
this proof-of-concept.

This series relies on kpromoted [3] as the migration engine to implement
the promotion policies. This is just an early proof-of-concept RFC
posted now in the context of LSFMM.

Kinsey Ho (2):
  mm: mglru: generalize page table walk
  mm: klruscand: use mglru scanning for page promotion

 include/linux/mmzone.h |   5 ++
 mm/Kconfig             |   8 ++
 mm/Makefile            |   1 +
 mm/internal.h          |   4 +
 mm/klruscand.c         | 118 +++++++++++++++++++++++++++
 mm/vmscan.c            | 177 ++++++++++++++++++++++++++++++-----------
 6 files changed, 267 insertions(+), 46 deletions(-)
 create mode 100644 mm/klruscand.c

Comments

Bharata B Rao March 25, 2025, 11:56 a.m. UTC | #1
On 25-Mar-25 3:32 AM, Kinsey Ho wrote:
> This patch series introduces a software-based approach to identify
> hot pages for promotion in tiered memory systems, particularly those
> leveraging CXL-attached memory, by utilizing the Multi-Generational
> LRU (MGLRU) framework. This method is designed to complement
> hardware-based hotness detection mechanisms like Intel PMU sampling, AMD
> IBS, or dedicated CXL memory monitoring units, providing a more
> comprehensive view of page access patterns, similar to kmmscand [1].
> 
> We propose to utilize MGLRU's existing infrastructure to provide hot
> page information. A key benefit here is the reuse of the MGLRU page
> table walk code, thus avoiding the overhead and duplication of effort
> involved in implementing a separate page table scanning mechanism. The
> working set reporting proposal [2] also reuses MGLRU's infrastructure,
> but focuses on cold page detection. It provides its own aging daemon,
> which could additionally provide hot page information by integrating
> this proof-of-concept.
> 
> This series relies on kpromoted [3] as the migration engine to implement
> the promotion policies. This is just an early proof-of-concept RFC
> posted now in the context of LSFMM.

Thanks for your patchset. I haven't looked at the patches in detail yet, 
but gave it a quick try with the micro-benchmark that I have been using.

The below numbers can be compared with the base numbers that I have 
posted here 
(https://lore.kernel.org/linux-mm/20250325081832.209140-1-bharata@amd.com/). 
Test 2 in the above link is the one I tried with this patchset.

kernel.numa_balancing = 0
demotion=true
cpufreq governor=performance

Benchmark run configuration:
Compute-node            = 1
Memory-node             = 2
Memory-size             = 206158430208
Hot-region-size         = 1073741824
Nr-hot-regions          = 192
Access pattern          = random
Access granularity      = 4096
Delay b/n accesses      = 0
Load/store ratio        = 50l50s
THP used                = no
Nr accesses             = 25769803776
Nr repetitions          = 512

Benchmark completed in 605983205.0 us

numa_hit 63621437
numa_miss 2721737
numa_foreign 2721737
numa_interleave 0
numa_local 48243292
numa_other 18099882
pgpromote_success 0
pgpromote_candidate 0
pgdemote_kswapd 15409682
pgdemote_direct 0
pgdemote_khugepaged 0
numa_pte_updates 0
numa_huge_pte_updates 0
numa_hint_faults 0
numa_hint_faults_local 0
numa_pages_migrated 19596
pgmigrate_success 15429278
pgmigrate_fail 256

kpromoted_recorded_accesses 27647687
kpromoted_recorded_hwhints 0
kpromoted_recorded_pgtscans 27647687
kpromoted_record_toptier 0
kpromoted_record_added 17184209
kpromoted_record_exists 10463478
kpromoted_mig_right_node 0
kpromoted_mig_non_lru 404308
kpromoted_mig_cold_old 6417567
kpromoted_mig_cold_not_accessed 10342825
kpromoted_mig_promoted 19509
kpromoted_mig_dropped 17164700

When I try to get the same benchmark numbers for kpromoted driven by 
kmmscand, kpromoted gets overwhelmed with the amount of data that 
kmmdscand provides while no such issues with the amount of accesses 
reported by this patchset.

As I have mentioned earlier, the hot page categorization heuristics is 
simplistic in kpromoted and may not have been able to promote more pages 
than what it has for this benchmark.

Regards,
Bharata.
Yuanchu Xie March 25, 2025, 9:55 p.m. UTC | #2
On Tue, Mar 25, 2025 at 4:56 AM Bharata B Rao <bharata@amd.com> wrote:
>
> Thanks for your patchset. I haven't looked at the patches in detail yet,
> but gave it a quick try with the micro-benchmark that I have been using.
Thanks for running the numbers. Unfortunately neither of us can attend
LSF/MM in person, but we're excited about this opportunity for
collaboration.

>
> The below numbers can be compared with the base numbers that I have
> posted here
> (https://lore.kernel.org/linux-mm/20250325081832.209140-1-bharata@amd.com/).
> Test 2 in the above link is the one I tried with this patchset.
>
> kernel.numa_balancing = 0
> demotion=true
> cpufreq governor=performance
>
> Benchmark run configuration:
> Compute-node            = 1
> Memory-node             = 2
> Memory-size             = 206158430208
> Hot-region-size         = 1073741824
> Nr-hot-regions          = 192
> Access pattern          = random
> Access granularity      = 4096
> Delay b/n accesses      = 0
> Load/store ratio        = 50l50s
> THP used                = no
> Nr accesses             = 25769803776
> Nr repetitions          = 512
>
> Benchmark completed in 605983205.0 us
The benchmark does seem to complete in less time, but I'm not sure why
especially given the small number of pages promoted. I think it would
also be useful to see the usage breakdown of DRAM/CXL over time.

>
> numa_hit 63621437
> numa_miss 2721737
> numa_foreign 2721737
> numa_interleave 0
> numa_local 48243292
> numa_other 18099882
> pgpromote_success 0
> pgpromote_candidate 0
> pgdemote_kswapd 15409682
> pgdemote_direct 0
> pgdemote_khugepaged 0
> numa_pte_updates 0
> numa_huge_pte_updates 0
> numa_hint_faults 0
> numa_hint_faults_local 0
> numa_pages_migrated 19596
> pgmigrate_success 15429278
> pgmigrate_fail 256
>
> kpromoted_recorded_accesses 27647687
> kpromoted_recorded_hwhints 0
> kpromoted_recorded_pgtscans 27647687
> kpromoted_record_toptier 0
Makes sense, we skip toptier scanning

> kpromoted_record_added 17184209
> kpromoted_record_exists 10463478
> kpromoted_mig_right_node 0
> kpromoted_mig_non_lru 404308
> kpromoted_mig_cold_old 6417567
> kpromoted_mig_cold_not_accessed 10342825
> kpromoted_mig_promoted 19509
Compared to 611077 (IBS number) this is a lot lower.

> kpromoted_mig_dropped 17164700
>
> When I try to get the same benchmark numbers for kpromoted driven by
> kmmscand, kpromoted gets overwhelmed with the amount of data that
> kmmdscand provides while no such issues with the amount of accesses
> reported by this patchset.
The scan interval in this series is 4 seconds, while the kmmscand's
pause between scanning is 16ms. So there're definitely some gaps here.
The MGLRU page table walk also has a bunch of optimizations, and some
of them are more focused on reclaim, so we might need to tweak some
things there too.


Yuanchu