Message ID | 20250324220301.1273038-1-kinseyho@google.com (mailing list archive) |
---|---|
Headers | show |
Series | mm: multi-gen LRU scanning for page promotion | expand |
On 25-Mar-25 3:32 AM, Kinsey Ho wrote: > This patch series introduces a software-based approach to identify > hot pages for promotion in tiered memory systems, particularly those > leveraging CXL-attached memory, by utilizing the Multi-Generational > LRU (MGLRU) framework. This method is designed to complement > hardware-based hotness detection mechanisms like Intel PMU sampling, AMD > IBS, or dedicated CXL memory monitoring units, providing a more > comprehensive view of page access patterns, similar to kmmscand [1]. > > We propose to utilize MGLRU's existing infrastructure to provide hot > page information. A key benefit here is the reuse of the MGLRU page > table walk code, thus avoiding the overhead and duplication of effort > involved in implementing a separate page table scanning mechanism. The > working set reporting proposal [2] also reuses MGLRU's infrastructure, > but focuses on cold page detection. It provides its own aging daemon, > which could additionally provide hot page information by integrating > this proof-of-concept. > > This series relies on kpromoted [3] as the migration engine to implement > the promotion policies. This is just an early proof-of-concept RFC > posted now in the context of LSFMM. Thanks for your patchset. I haven't looked at the patches in detail yet, but gave it a quick try with the micro-benchmark that I have been using. The below numbers can be compared with the base numbers that I have posted here (https://lore.kernel.org/linux-mm/20250325081832.209140-1-bharata@amd.com/). Test 2 in the above link is the one I tried with this patchset. kernel.numa_balancing = 0 demotion=true cpufreq governor=performance Benchmark run configuration: Compute-node = 1 Memory-node = 2 Memory-size = 206158430208 Hot-region-size = 1073741824 Nr-hot-regions = 192 Access pattern = random Access granularity = 4096 Delay b/n accesses = 0 Load/store ratio = 50l50s THP used = no Nr accesses = 25769803776 Nr repetitions = 512 Benchmark completed in 605983205.0 us numa_hit 63621437 numa_miss 2721737 numa_foreign 2721737 numa_interleave 0 numa_local 48243292 numa_other 18099882 pgpromote_success 0 pgpromote_candidate 0 pgdemote_kswapd 15409682 pgdemote_direct 0 pgdemote_khugepaged 0 numa_pte_updates 0 numa_huge_pte_updates 0 numa_hint_faults 0 numa_hint_faults_local 0 numa_pages_migrated 19596 pgmigrate_success 15429278 pgmigrate_fail 256 kpromoted_recorded_accesses 27647687 kpromoted_recorded_hwhints 0 kpromoted_recorded_pgtscans 27647687 kpromoted_record_toptier 0 kpromoted_record_added 17184209 kpromoted_record_exists 10463478 kpromoted_mig_right_node 0 kpromoted_mig_non_lru 404308 kpromoted_mig_cold_old 6417567 kpromoted_mig_cold_not_accessed 10342825 kpromoted_mig_promoted 19509 kpromoted_mig_dropped 17164700 When I try to get the same benchmark numbers for kpromoted driven by kmmscand, kpromoted gets overwhelmed with the amount of data that kmmdscand provides while no such issues with the amount of accesses reported by this patchset. As I have mentioned earlier, the hot page categorization heuristics is simplistic in kpromoted and may not have been able to promote more pages than what it has for this benchmark. Regards, Bharata.
On Tue, Mar 25, 2025 at 4:56 AM Bharata B Rao <bharata@amd.com> wrote: > > Thanks for your patchset. I haven't looked at the patches in detail yet, > but gave it a quick try with the micro-benchmark that I have been using. Thanks for running the numbers. Unfortunately neither of us can attend LSF/MM in person, but we're excited about this opportunity for collaboration. > > The below numbers can be compared with the base numbers that I have > posted here > (https://lore.kernel.org/linux-mm/20250325081832.209140-1-bharata@amd.com/). > Test 2 in the above link is the one I tried with this patchset. > > kernel.numa_balancing = 0 > demotion=true > cpufreq governor=performance > > Benchmark run configuration: > Compute-node = 1 > Memory-node = 2 > Memory-size = 206158430208 > Hot-region-size = 1073741824 > Nr-hot-regions = 192 > Access pattern = random > Access granularity = 4096 > Delay b/n accesses = 0 > Load/store ratio = 50l50s > THP used = no > Nr accesses = 25769803776 > Nr repetitions = 512 > > Benchmark completed in 605983205.0 us The benchmark does seem to complete in less time, but I'm not sure why especially given the small number of pages promoted. I think it would also be useful to see the usage breakdown of DRAM/CXL over time. > > numa_hit 63621437 > numa_miss 2721737 > numa_foreign 2721737 > numa_interleave 0 > numa_local 48243292 > numa_other 18099882 > pgpromote_success 0 > pgpromote_candidate 0 > pgdemote_kswapd 15409682 > pgdemote_direct 0 > pgdemote_khugepaged 0 > numa_pte_updates 0 > numa_huge_pte_updates 0 > numa_hint_faults 0 > numa_hint_faults_local 0 > numa_pages_migrated 19596 > pgmigrate_success 15429278 > pgmigrate_fail 256 > > kpromoted_recorded_accesses 27647687 > kpromoted_recorded_hwhints 0 > kpromoted_recorded_pgtscans 27647687 > kpromoted_record_toptier 0 Makes sense, we skip toptier scanning > kpromoted_record_added 17184209 > kpromoted_record_exists 10463478 > kpromoted_mig_right_node 0 > kpromoted_mig_non_lru 404308 > kpromoted_mig_cold_old 6417567 > kpromoted_mig_cold_not_accessed 10342825 > kpromoted_mig_promoted 19509 Compared to 611077 (IBS number) this is a lot lower. > kpromoted_mig_dropped 17164700 > > When I try to get the same benchmark numbers for kpromoted driven by > kmmscand, kpromoted gets overwhelmed with the amount of data that > kmmdscand provides while no such issues with the amount of accesses > reported by this patchset. The scan interval in this series is 4 seconds, while the kmmscand's pause between scanning is 16ms. So there're definitely some gaps here. The MGLRU page table walk also has a bunch of optimizations, and some of them are more focused on reclaim, so we might need to tweak some things there too. Yuanchu