[RFC,V1,00/13] mm: slowtier page promotion based on PTE A bit

Message ID	20250319193028.29514-1-raghavendra.kt@amd.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=SATLEXMB04.amd.com; pr=C From: Raghavendra K T <raghavendra.kt@amd.com> To: <raghavendra.kt@amd.com> CC: <AneeshKumar.KizhakeVeetil@arm.com>, <Hasan.Maruf@amd.com>, <Michael.Day@amd.com>, <akpm@linux-foundation.org>, <bharata@amd.com>, <dave.hansen@intel.com>, <david@redhat.com>, <dongjoo.linux.dev@gmail.com>, <feng.tang@intel.com>, <gourry@gourry.net>, <hannes@cmpxchg.org>, <honggyu.kim@sk.com>, <hughd@google.com>, <jhubbard@nvidia.com>, <jon.grimm@amd.com>, <k.shutemov@gmail.com>, <kbusch@meta.com>, <kmanaouil.dev@gmail.com>, <leesuyeon0506@gmail.com>, <leillc@google.com>, <liam.howlett@oracle.com>, <linux-kernel@vger.kernel.org>, <linux-mm@kvack.org>, <mgorman@techsingularity.net>, <mingo@redhat.com>, <nadav.amit@gmail.com>, <nphamcs@gmail.com>, <peterz@infradead.org>, <riel@surriel.com>, <rientjes@google.com>, <rppt@kernel.org>, <santosh.shukla@amd.com>, <shivankg@amd.com>, <shy828301@gmail.com>, <sj@kernel.org>, <vbabka@suse.cz>, <weixugc@google.com>, <willy@infradead.org>, <ying.huang@linux.alibaba.com>, <ziy@nvidia.com>, <Jonathan.Cameron@huawei.com>, <dave@stgolabs.net> Subject: [RFC PATCH V1 00/13] mm: slowtier page promotion based on PTE A bit Date: Wed, 19 Mar 2025 19:30:15 +0000 Message-ID: <20250319193028.29514-1-raghavendra.kt@amd.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: slowtier page promotion based on PTE A bit \| expand [RFC,V1,00/13] mm: slowtier page promotion based on PTE A bit [RFC,V1,01/13] mm: Add kmmscand kernel daemon [RFC,V1,02/13] mm: Maintain mm_struct list in the system [RFC,V1,03/13] mm: Scan the mm and create a migration list [RFC,V1,04/13] mm: Create a separate kernel thread for migration [RFC,V1,05/13] mm/migration: Migrate accessed folios to toptier node [RFC,V1,06/13] mm: Add throttling of mm scanning using scan_period [RFC,V1,07/13] mm: Add throttling of mm scanning using scan_size [RFC,V1,08/13] mm: Add initial scan delay [RFC,V1,09/13] mm: Add heuristic to calculate target node [RFC,V1,10/13] sysfs: Add sysfs support to tune scanning [RFC,V1,11/13] vmstat: Add vmstat counters [RFC,V1,12/13] trace/kmmscand: Add tracing of scanning and migration [RFC,V1,13/13] prctl: Introduce new prctl to control scanning

Raghavendra K T March 19, 2025, 7:30 p.m. UTC

Introduction:
=============
In the current hot page promotion, all the activities including the
process address space scanning, NUMA hint fault handling and page
migration is performed in the process context. i.e., scanning overhead is
borne by applications.

This is RFC V1 patch series to do (slow tier) CXL page promotion.
The approach in this patchset assists/addresses the issue by adding PTE
Accessed bit scanning.

Scanning is done by a global kernel thread which routinely scans all
the processes' address spaces and checks for accesses by reading the
PTE A bit. 

A separate migration thread migrates/promotes the pages to the toptier
node based on a simple heuristic that uses toptier scan/access information
of the mm.

Additionally based on the feedback for RFC V0 [4], a prctl knob with
a scalar value is provided to control per task scanning.

Initial results show promising number on a microbenchmark. Soon
will get numbers with real benchmarks and findings (tunings). 

Experiment:
============
Abench microbenchmark,
- Allocates 8GB/16GB/32GB/64GB of memory on CXL node
- 64 threads created, and each thread randomly accesses pages in 4K
  granularity.
- 512 iterations with a delay of 1 us between two successive iterations.

SUT: 512 CPU, 2 node 256GB, AMD EPYC.

3 runs, command:  abench -m 2 -d 1 -i 512 -s <size>

Calculates how much time is taken to complete the task, lower is better.
Expectation is CXL node memory is expected to be migrated as fast as
possible.

Base case: 6.14-rc6    w/ numab mode = 2 (hot page promotion is enabled).
patched case: 6.14-rc6 w/ numab mode = 1 (numa balancing is enabled).
we expect daemon to do page promotion.

Result:
========
         base NUMAB2                    patched NUMAB1
         time in sec  (%stdev)   time in sec  (%stdev)     %gain
 8GB     134.33       ( 0.19 )        120.52  ( 0.21 )     10.28
16GB     292.24       ( 0.60 )        275.97  ( 0.18 )      5.56
32GB     585.06       ( 0.24 )        546.49  ( 0.35 )      6.59
64GB    1278.98       ( 0.27 )       1205.20  ( 2.29 )      5.76

Base case: 6.14-rc6    w/ numab mode = 1 (numa balancing is enabled).
patched case: 6.14-rc6 w/ numab mode = 1 (numa balancing is enabled).
         base NUMAB1                    patched NUMAB1
         time in sec  (%stdev)   time in sec  (%stdev)     %gain
 8GB     186.71       ( 0.99 )        120.52  ( 0.21 )     35.45 
16GB     376.09       ( 0.46 )        275.97  ( 0.18 )     26.62 
32GB     744.37       ( 0.71 )        546.49  ( 0.35 )     26.58 
64GB    1534.49       ( 0.09 )       1205.20  ( 2.29 )     21.45


Major Changes since V0:
======================
- A separate migration thread is used for migration, thus alleviating need for
  multi-threaded scanning (atleast as per tracing).

- A simple heuristic for target node calculation is added.

- prctl (David R) interface with scalar value is added to control per task scanning.

- Steve's comment on tracing incorporated.

- Davidlohr's reported bugfix.

- Initial scan delay similar to NUMAB1 mode added.

- Got rid of migration lock during mm_walk.

PS: Occassionally I do see if scanning is too fast compared to migration,
scanning can stall waiting for lock. Should be fixed in next version by
using memslot for migration..

Disclaimer, Takeaways and discussion points and future TODOs 
==============================================================
1) Source code, patch seggregation still to be improved, current patchset only
provides a skeleton.

2) Unification of source of hotness is not easy (as mentioned perhaps by Jonathan)
but perhaps all the consumers/producers can work coopertaively.

Scanning:
3) Major positive: Current patchset is able to cover all the process address
space scanning effectively with simple algorithms to tune scan_size and scan_period.

4) Effective tracking of folio's or address space using / or ideas used in DAMON
is yet to be explored fully.

5) Use timestamp information-based migration (Similar to numab mode=2).
instead of migrating immediately when PTE A bit set.
(cons:
 - It will not be accurate since it is done outside of process
context.
 - Performance benefit may be lost.)

Migration:

6) Currently fast scanner can bombard migration list, need to maintain migration list in a more
organized way (for e.g. using memslot, so that it is also helpful in maintaining recency, frequency
information (similar to kpromoted posted by Bharata)

7) NUMAB2 throttling is very effective, we would need a common interface to control migration
and also exploit batch migration.

Thanks to Bharata, Joannes, Gregory, SJ, Chris, David Rientjes, Jonathan, John Hubbard,
Davidlohr, Ying, Willy, Hyeonggon Yoo and many of you for your valuable comments and support.

Links:
[1] https://lore.kernel.org/lkml/20241127082201.1276-1-gourry@gourry.net/
[2] kstaled: https://lore.kernel.org/lkml/1317170947-17074-3-git-send-email-walken@google.com/#r
[3] https://lore.kernel.org/lkml/Y+Pj+9bbBbHpf6xM@hirez.programming.kicks-ass.net/
[4] RFC V0: https://lore.kernel.org/all/20241201153818.2633616-1-raghavendra.kt@amd.com/
[5] Recap: https://lore.kernel.org/linux-mm/20241226012833.rmmbkws4wdhzdht6@ed.ac.uk/T/
[6] LSFMM: https://lore.kernel.org/linux-mm/20250123105721.424117-1-raghavendra.kt@amd.com/#r
[7] LSFMM: https://lore.kernel.org/linux-mm/20250131130901.00000dd1@huawei.com/

I might have CCed more people or less people than needed
unintentionally.

Patch organization:
patch 1-4 initial skeleton for scanning and migration
patch 5: migration
patch 6-8: scanning optimizations
patch 9: target_node heuristic
patch 10-12: sysfs, vmstat and tracing
patch 13: A basic prctl implementation.

Raghavendra K T (13):
  mm: Add kmmscand kernel daemon
  mm: Maintain mm_struct list in the system
  mm: Scan the mm and create a migration list
  mm: Create a separate kernel thread for migration
  mm/migration: Migrate accessed folios to toptier node
  mm: Add throttling of mm scanning using scan_period
  mm: Add throttling of mm scanning using scan_size
  mm: Add initial scan delay
  mm: Add heuristic to calculate target node
  sysfs: Add sysfs support to tune scanning
  vmstat: Add vmstat counters
  trace/kmmscand: Add tracing of scanning and migration
  prctl: Introduce new prctl to control scanning

 Documentation/filesystems/proc.rst |    2 +
 fs/exec.c                          |    4 +
 fs/proc/task_mmu.c                 |    4 +
 include/linux/kmmscand.h           |   31 +
 include/linux/migrate.h            |    2 +
 include/linux/mm.h                 |   11 +
 include/linux/mm_types.h           |    7 +
 include/linux/vm_event_item.h      |   10 +
 include/trace/events/kmem.h        |   90 ++
 include/uapi/linux/prctl.h         |    7 +
 kernel/fork.c                      |    8 +
 kernel/sys.c                       |   25 +
 mm/Kconfig                         |    8 +
 mm/Makefile                        |    1 +
 mm/kmmscand.c                      | 1515 ++++++++++++++++++++++++++++
 mm/migrate.c                       |    2 +-
 mm/vmstat.c                        |   10 +
 17 files changed, 1736 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/kmmscand.h
 create mode 100644 mm/kmmscand.c


base-commit: b7f94fcf55469ad3ef8a74c35b488dbfa314d1bb

Davidlohr Bueso March 19, 2025, 11 p.m. UTC | #1

On Wed, 19 Mar 2025, Raghavendra K T wrote:

>Introduction:
>=============
>In the current hot page promotion, all the activities including the
>process address space scanning, NUMA hint fault handling and page
>migration is performed in the process context. i.e., scanning overhead is
>borne by applications.
>
>This is RFC V1 patch series to do (slow tier) CXL page promotion.
>The approach in this patchset assists/addresses the issue by adding PTE
>Accessed bit scanning.
>
>Scanning is done by a global kernel thread which routinely scans all
>the processes' address spaces and checks for accesses by reading the
>PTE A bit.
>
>A separate migration thread migrates/promotes the pages to the toptier
>node based on a simple heuristic that uses toptier scan/access information
>of the mm.
>
>Additionally based on the feedback for RFC V0 [4], a prctl knob with
>a scalar value is provided to control per task scanning.
>
>Initial results show promising number on a microbenchmark. Soon
>will get numbers with real benchmarks and findings (tunings).
>
>Experiment:
>============
>Abench microbenchmark,
>- Allocates 8GB/16GB/32GB/64GB of memory on CXL node
>- 64 threads created, and each thread randomly accesses pages in 4K
>  granularity.
>- 512 iterations with a delay of 1 us between two successive iterations.
>
>SUT: 512 CPU, 2 node 256GB, AMD EPYC.
>
>3 runs, command:  abench -m 2 -d 1 -i 512 -s <size>
>
>Calculates how much time is taken to complete the task, lower is better.
>Expectation is CXL node memory is expected to be migrated as fast as
>possible.
>
>Base case: 6.14-rc6    w/ numab mode = 2 (hot page promotion is enabled).
>patched case: 6.14-rc6 w/ numab mode = 1 (numa balancing is enabled).
>we expect daemon to do page promotion.
>
>Result:
>========
>         base NUMAB2                    patched NUMAB1
>         time in sec  (%stdev)   time in sec  (%stdev)     %gain
> 8GB     134.33       ( 0.19 )        120.52  ( 0.21 )     10.28
>16GB     292.24       ( 0.60 )        275.97  ( 0.18 )      5.56
>32GB     585.06       ( 0.24 )        546.49  ( 0.35 )      6.59
>64GB    1278.98       ( 0.27 )       1205.20  ( 2.29 )      5.76
>
>Base case: 6.14-rc6    w/ numab mode = 1 (numa balancing is enabled).
>patched case: 6.14-rc6 w/ numab mode = 1 (numa balancing is enabled).
>         base NUMAB1                    patched NUMAB1
>         time in sec  (%stdev)   time in sec  (%stdev)     %gain
> 8GB     186.71       ( 0.99 )        120.52  ( 0.21 )     35.45
>16GB     376.09       ( 0.46 )        275.97  ( 0.18 )     26.62
>32GB     744.37       ( 0.71 )        546.49  ( 0.35 )     26.58
>64GB    1534.49       ( 0.09 )       1205.20  ( 2.29 )     21.45

Very promising, but a few things. A more fair comparison would be
vs kpromoted using the PROT_NONE of NUMAB2. Essentially disregarding
the asynchronous migration, and effectively measuring synchronous
vs asynchronous scanning overhead and implied semantics. Essentially
save the extra kthread and only have a per-NUMA node migrator, which
is the common denominator for all these sources of hotness.

Similarly, while I don't see any users disabling NUMAB1 _and_ enabling
this sort of thing, it would be useful to have data on no numa balancing
at all. If nothing else, that would measure the effects of the dest
node heuristics.

Also, data/workload involving demotion would also be good to have for
a more complete picture.

>
>Major Changes since V0:
>======================
>- A separate migration thread is used for migration, thus alleviating need for
>  multi-threaded scanning (atleast as per tracing).
>
>- A simple heuristic for target node calculation is added.
>
>- prctl (David R) interface with scalar value is added to control per task scanning.
>
>- Steve's comment on tracing incorporated.
>
>- Davidlohr's reported bugfix.
>
>- Initial scan delay similar to NUMAB1 mode added.
>
>- Got rid of migration lock during mm_walk.
>
>PS: Occassionally I do see if scanning is too fast compared to migration,
>scanning can stall waiting for lock. Should be fixed in next version by
>using memslot for migration..
>
>Disclaimer, Takeaways and discussion points and future TODOs
>==============================================================
>1) Source code, patch seggregation still to be improved, current patchset only
>provides a skeleton.
>
>2) Unification of source of hotness is not easy (as mentioned perhaps by Jonathan)
>but perhaps all the consumers/producers can work coopertaively.
>
>Scanning:
>3) Major positive: Current patchset is able to cover all the process address
>space scanning effectively with simple algorithms to tune scan_size and scan_period.
>
>4) Effective tracking of folio's or address space using / or ideas used in DAMON
>is yet to be explored fully.
>
>5) Use timestamp information-based migration (Similar to numab mode=2).
>instead of migrating immediately when PTE A bit set.
>(cons:
> - It will not be accurate since it is done outside of process
>context.
> - Performance benefit may be lost.)
>
>Migration:
>
>6) Currently fast scanner can bombard migration list, need to maintain migration list in a more
>organized way (for e.g. using memslot, so that it is also helpful in maintaining recency, frequency
>information (similar to kpromoted posted by Bharata)
>
>7) NUMAB2 throttling is very effective, we would need a common interface to control migration
>and also exploit batch migration.

Does NUMAB2 continue to exist? Are there any benefits in having two sources?

Thanks,
Davidlohr

>
>Thanks to Bharata, Joannes, Gregory, SJ, Chris, David Rientjes, Jonathan, John Hubbard,
>Davidlohr, Ying, Willy, Hyeonggon Yoo and many of you for your valuable comments and support.
>
>Links:
>[1] https://lore.kernel.org/lkml/20241127082201.1276-1-gourry@gourry.net/
>[2] kstaled: https://lore.kernel.org/lkml/1317170947-17074-3-git-send-email-walken@google.com/#r
>[3] https://lore.kernel.org/lkml/Y+Pj+9bbBbHpf6xM@hirez.programming.kicks-ass.net/
>[4] RFC V0: https://lore.kernel.org/all/20241201153818.2633616-1-raghavendra.kt@amd.com/
>[5] Recap: https://lore.kernel.org/linux-mm/20241226012833.rmmbkws4wdhzdht6@ed.ac.uk/T/
>[6] LSFMM: https://lore.kernel.org/linux-mm/20250123105721.424117-1-raghavendra.kt@amd.com/#r
>[7] LSFMM: https://lore.kernel.org/linux-mm/20250131130901.00000dd1@huawei.com/
>
>I might have CCed more people or less people than needed
>unintentionally.
>
>Patch organization:
>patch 1-4 initial skeleton for scanning and migration
>patch 5: migration
>patch 6-8: scanning optimizations
>patch 9: target_node heuristic
>patch 10-12: sysfs, vmstat and tracing
>patch 13: A basic prctl implementation.
>
>Raghavendra K T (13):
>  mm: Add kmmscand kernel daemon
>  mm: Maintain mm_struct list in the system
>  mm: Scan the mm and create a migration list
>  mm: Create a separate kernel thread for migration
>  mm/migration: Migrate accessed folios to toptier node
>  mm: Add throttling of mm scanning using scan_period
>  mm: Add throttling of mm scanning using scan_size
>  mm: Add initial scan delay
>  mm: Add heuristic to calculate target node
>  sysfs: Add sysfs support to tune scanning
>  vmstat: Add vmstat counters
>  trace/kmmscand: Add tracing of scanning and migration
>  prctl: Introduce new prctl to control scanning
>
> Documentation/filesystems/proc.rst |    2 +
> fs/exec.c                          |    4 +
> fs/proc/task_mmu.c                 |    4 +
> include/linux/kmmscand.h           |   31 +
> include/linux/migrate.h            |    2 +
> include/linux/mm.h                 |   11 +
> include/linux/mm_types.h           |    7 +
> include/linux/vm_event_item.h      |   10 +
> include/trace/events/kmem.h        |   90 ++
> include/uapi/linux/prctl.h         |    7 +
> kernel/fork.c                      |    8 +
> kernel/sys.c                       |   25 +
> mm/Kconfig                         |    8 +
> mm/Makefile                        |    1 +
> mm/kmmscand.c                      | 1515 ++++++++++++++++++++++++++++
> mm/migrate.c                       |    2 +-
> mm/vmstat.c                        |   10 +
> 17 files changed, 1736 insertions(+), 1 deletion(-)
> create mode 100644 include/linux/kmmscand.h
> create mode 100644 mm/kmmscand.c
>
>
>base-commit: b7f94fcf55469ad3ef8a74c35b488dbfa314d1bb
>--
>2.34.1
>

Raghavendra K T March 20, 2025, 8:51 a.m. UTC | #2

On 3/20/2025 4:30 AM, Davidlohr Bueso wrote:
> On Wed, 19 Mar 2025, Raghavendra K T wrote:
> 
>> Introduction:
>> =============
>> In the current hot page promotion, all the activities including the
>> process address space scanning, NUMA hint fault handling and page
>> migration is performed in the process context. i.e., scanning overhead is
>> borne by applications.
>>
>> This is RFC V1 patch series to do (slow tier) CXL page promotion.
>> The approach in this patchset assists/addresses the issue by adding PTE
>> Accessed bit scanning.
>>
>> Scanning is done by a global kernel thread which routinely scans all
>> the processes' address spaces and checks for accesses by reading the
>> PTE A bit.
>>
>> A separate migration thread migrates/promotes the pages to the toptier
>> node based on a simple heuristic that uses toptier scan/access 
>> information
>> of the mm.
>>
>> Additionally based on the feedback for RFC V0 [4], a prctl knob with
>> a scalar value is provided to control per task scanning.
>>
>> Initial results show promising number on a microbenchmark. Soon
>> will get numbers with real benchmarks and findings (tunings).
>>
>> Experiment:
>> ============
>> Abench microbenchmark,
>> - Allocates 8GB/16GB/32GB/64GB of memory on CXL node
>> - 64 threads created, and each thread randomly accesses pages in 4K
>>  granularity.
>> - 512 iterations with a delay of 1 us between two successive iterations.
>>
>> SUT: 512 CPU, 2 node 256GB, AMD EPYC.
>>
>> 3 runs, command:  abench -m 2 -d 1 -i 512 -s <size>
>>
>> Calculates how much time is taken to complete the task, lower is better.
>> Expectation is CXL node memory is expected to be migrated as fast as
>> possible.
>>
>> Base case: 6.14-rc6    w/ numab mode = 2 (hot page promotion is enabled).
>> patched case: 6.14-rc6 w/ numab mode = 1 (numa balancing is enabled).
>> we expect daemon to do page promotion.
>>
>> Result:
>> ========
>>         base NUMAB2                    patched NUMAB1
>>         time in sec  (%stdev)   time in sec  (%stdev)     %gain
>> 8GB     134.33       ( 0.19 )        120.52  ( 0.21 )     10.28
>> 16GB     292.24       ( 0.60 )        275.97  ( 0.18 )      5.56
>> 32GB     585.06       ( 0.24 )        546.49  ( 0.35 )      6.59
>> 64GB    1278.98       ( 0.27 )       1205.20  ( 2.29 )      5.76
>>
>> Base case: 6.14-rc6    w/ numab mode = 1 (numa balancing is enabled).
>> patched case: 6.14-rc6 w/ numab mode = 1 (numa balancing is enabled).
>>         base NUMAB1                    patched NUMAB1
>>         time in sec  (%stdev)   time in sec  (%stdev)     %gain
>> 8GB     186.71       ( 0.99 )        120.52  ( 0.21 )     35.45
>> 16GB     376.09       ( 0.46 )        275.97  ( 0.18 )     26.62
>> 32GB     744.37       ( 0.71 )        546.49  ( 0.35 )     26.58
>> 64GB    1534.49       ( 0.09 )       1205.20  ( 2.29 )     21.45
> 
> Very promising, but a few things. A more fair comparison would be
> vs kpromoted using the PROT_NONE of NUMAB2. Essentially disregarding
> the asynchronous migration, and effectively measuring synchronous
> vs asynchronous scanning overhead and implied semantics. Essentially
> save the extra kthread and only have a per-NUMA node migrator, which
> is the common denominator for all these sources of hotness.


Yes, I agree that fair comparison would be
1) kmmscand generating data on pages to be promoted working with
kpromoted asynchronously migrating
VS
2) NUMAB2 generating data on pages to be migrated integrated with
kpromoted.

As Bharata already mentioned, we tried integrating kpromoted with
kmmscand generated migration list, But kmmscand generates huge amount of
scanned page data, and need to be organized better so that kpromted can 
handle the migration effectively.

(2) We have not tried it yet, will get back on the possibility (and also
numbers when both are ready).

> 
> Similarly, while I don't see any users disabling NUMAB1 _and_ enabling
> this sort of thing, it would be useful to have data on no numa balancing
> at all. If nothing else, that would measure the effects of the dest
> node heuristics.

Last time when I checked, with patch, numbers with NUMAB=0 and NUMAB=1
was not making much difference in 8GB case because most of the migration 
was handled by kmmscand. It is because before NUMAB=1 learns and tries
to migrate, kmmscand would have already migrated.

But a longer running/ more memory workload may make more difference.
I will comeback with that number.

> 
> Also, data/workload involving demotion would also be good to have for
> a more complete picture.
>

Agree.
additionally we need to handle various cases like
  - Should we choose second best target node when first node is full?
    >>
>> Major Changes since V0:
>> ======================
>> - A separate migration thread is used for migration, thus alleviating 
>> need for
>>  multi-threaded scanning (atleast as per tracing).
>>
>> - A simple heuristic for target node calculation is added.
>>
>> - prctl (David R) interface with scalar value is added to control per 
>> task scanning.
>>
>> - Steve's comment on tracing incorporated.
>>
>> - Davidlohr's reported bugfix.
>>
>> - Initial scan delay similar to NUMAB1 mode added.
>>
>> - Got rid of migration lock during mm_walk.
>>
>> PS: Occassionally I do see if scanning is too fast compared to migration,
>> scanning can stall waiting for lock. Should be fixed in next version by
>> using memslot for migration..
>>
>> Disclaimer, Takeaways and discussion points and future TODOs
>> ==============================================================
>> 1) Source code, patch seggregation still to be improved, current 
>> patchset only
>> provides a skeleton.
>>
>> 2) Unification of source of hotness is not easy (as mentioned perhaps 
>> by Jonathan)
>> but perhaps all the consumers/producers can work coopertaively.
>>
>> Scanning:
>> 3) Major positive: Current patchset is able to cover all the process 
>> address
>> space scanning effectively with simple algorithms to tune scan_size 
>> and scan_period.
>>
>> 4) Effective tracking of folio's or address space using / or ideas 
>> used in DAMON
>> is yet to be explored fully.
>>
>> 5) Use timestamp information-based migration (Similar to numab mode=2).
>> instead of migrating immediately when PTE A bit set.
>> (cons:
>> - It will not be accurate since it is done outside of process
>> context.
>> - Performance benefit may be lost.)
>>
>> Migration:
>>
>> 6) Currently fast scanner can bombard migration list, need to maintain 
>> migration list in a more
>> organized way (for e.g. using memslot, so that it is also helpful in 
>> maintaining recency, frequency
>> information (similar to kpromoted posted by Bharata)
>>
>> 7) NUMAB2 throttling is very effective, we would need a common 
>> interface to control migration
>> and also exploit batch migration.
> 
> Does NUMAB2 continue to exist? Are there any benefits in having two 
> sources?
> 

I think there is surely a benefit in having two sources.
NUMAB2 is more accurate but slow learning.

IBS: No scan overhead but we need more sampledata.

PTE A bit: more scanning overhead (but was not much significant to
impact performance when compared with NUMAB1/NUMAB2, rather it was more
performing because of proactive migration) but has less accurate data on
hotness, target_node(?).

When system is more stable, IBS was more effective.
PTE A bit and NUMAB was effective when we needed more aggressive
migration  (in that order).

- Raghu

Raghavendra K T March 20, 2025, 7:11 p.m. UTC | #3

On 3/20/2025 2:21 PM, Raghavendra K T wrote:
> On 3/20/2025 4:30 AM, Davidlohr Bueso wrote:
>> On Wed, 19 Mar 2025, Raghavendra K T wrote:
>>
>>> Introduction:
>>> =============
>>> In the current hot page promotion, all the activities including the
>>> process address space scanning, NUMA hint fault handling and page
>>> migration is performed in the process context. i.e., scanning 
>>> overhead is
>>> borne by applications.
>>>
>>> This is RFC V1 patch series to do (slow tier) CXL page promotion.
>>> The approach in this patchset assists/addresses the issue by adding PTE
>>> Accessed bit scanning.
>>>
>>> Scanning is done by a global kernel thread which routinely scans all
>>> the processes' address spaces and checks for accesses by reading the
>>> PTE A bit.
>>>
>>> A separate migration thread migrates/promotes the pages to the toptier
>>> node based on a simple heuristic that uses toptier scan/access 
>>> information
>>> of the mm.
>>>
>>> Additionally based on the feedback for RFC V0 [4], a prctl knob with
>>> a scalar value is provided to control per task scanning.
>>>
>>> Initial results show promising number on a microbenchmark. Soon
>>> will get numbers with real benchmarks and findings (tunings).
>>>
>>> Experiment:
>>> ============
>>> Abench microbenchmark,
>>> - Allocates 8GB/16GB/32GB/64GB of memory on CXL node
>>> - 64 threads created, and each thread randomly accesses pages in 4K
>>>  granularity.
>>> - 512 iterations with a delay of 1 us between two successive iterations.
>>>
>>> SUT: 512 CPU, 2 node 256GB, AMD EPYC.
>>>
>>> 3 runs, command:  abench -m 2 -d 1 -i 512 -s <size>
>>>
>>> Calculates how much time is taken to complete the task, lower is better.
>>> Expectation is CXL node memory is expected to be migrated as fast as
>>> possible.
>>>
>>> Base case: 6.14-rc6    w/ numab mode = 2 (hot page promotion is 
>>> enabled).
>>> patched case: 6.14-rc6 w/ numab mode = 1 (numa balancing is enabled).
>>> we expect daemon to do page promotion.
>>>
>>> Result:
>>> ========
>>>         base NUMAB2                    patched NUMAB1
>>>         time in sec  (%stdev)   time in sec  (%stdev)     %gain
>>> 8GB     134.33       ( 0.19 )        120.52  ( 0.21 )     10.28
>>> 16GB     292.24       ( 0.60 )        275.97  ( 0.18 )      5.56
>>> 32GB     585.06       ( 0.24 )        546.49  ( 0.35 )      6.59
>>> 64GB    1278.98       ( 0.27 )       1205.20  ( 2.29 )      5.76
>>>
>>> Base case: 6.14-rc6    w/ numab mode = 1 (numa balancing is enabled).
>>> patched case: 6.14-rc6 w/ numab mode = 1 (numa balancing is enabled).
>>>         base NUMAB1                    patched NUMAB1
>>>         time in sec  (%stdev)   time in sec  (%stdev)     %gain
>>> 8GB     186.71       ( 0.99 )        120.52  ( 0.21 )     35.45
>>> 16GB     376.09       ( 0.46 )        275.97  ( 0.18 )     26.62
>>> 32GB     744.37       ( 0.71 )        546.49  ( 0.35 )     26.58
>>> 64GB    1534.49       ( 0.09 )       1205.20  ( 2.29 )     21.45
>>
>> Very promising, but a few things. A more fair comparison would be
>> vs kpromoted using the PROT_NONE of NUMAB2. Essentially disregarding
>> the asynchronous migration, and effectively measuring synchronous
>> vs asynchronous scanning overhead and implied semantics. Essentially
>> save the extra kthread and only have a per-NUMA node migrator, which
>> is the common denominator for all these sources of hotness.
> 
> 
> Yes, I agree that fair comparison would be
> 1) kmmscand generating data on pages to be promoted working with
> kpromoted asynchronously migrating
> VS
> 2) NUMAB2 generating data on pages to be migrated integrated with
> kpromoted.
> 
> As Bharata already mentioned, we tried integrating kpromoted with
> kmmscand generated migration list, But kmmscand generates huge amount of
> scanned page data, and need to be organized better so that kpromted can 
> handle the migration effectively.
> 
> (2) We have not tried it yet, will get back on the possibility (and also
> numbers when both are ready).
> 
>>
>> Similarly, while I don't see any users disabling NUMAB1 _and_ enabling
>> this sort of thing, it would be useful to have data on no numa balancing
>> at all. If nothing else, that would measure the effects of the dest
>> node heuristics.
> 
> Last time when I checked, with patch, numbers with NUMAB=0 and NUMAB=1
> was not making much difference in 8GB case because most of the migration 
> was handled by kmmscand. It is because before NUMAB=1 learns and tries
> to migrate, kmmscand would have already migrated.
> 
> But a longer running/ more memory workload may make more difference.
> I will comeback with that number.

                  base NUMAB=2   Patched NUMAB=0
                  time in sec    time in sec
===================================================
8G:              134.33 (0.19)   119.88 ( 0.25)
16G:             292.24 (0.60)   325.06 (11.11)
32G:             585.06 (0.24)   546.15 ( 0.50)
64G:            1278.98 (0.27)  1221.41 ( 1.54)

We can see that numbers have not changed much between NUMAB=1 NUMAB=0 in
patched case.

PS: for 16G there was a bad case where a rare contention happen for lock
for same mm. that we can see from stdev, which should be taken care in
next version.

[...]

Davidlohr Bueso March 20, 2025, 9:50 p.m. UTC | #4

On Thu, 20 Mar 2025, Raghavendra K T wrote:

>>Does NUMAB2 continue to exist? Are there any benefits in having two
>>sources?
>>
>
>I think there is surely a benefit in having two sources.

I think I was a bit vague. What I'm really asking is if the scanning is
done async (kmmscand), should NUMAB2 also exist as a source and also feed
into the migrator? Looking at it differently, I guess doing so would allow
additional flexibility in choosing what to use.

>NUMAB2 is more accurate but slow learning.

Yes. Which is also why it is important to have demotion in the picture to
measure the ping pong effect. LRU based heuristics work best here.

>IBS: No scan overhead but we need more sampledata.

>PTE A bit: more scanning overhead (but was not much significant to
>impact performance when compared with NUMAB1/NUMAB2, rather it was more
>performing because of proactive migration) but has less accurate data on
>hotness, target_node(?).
>
>When system is more stable, IBS was more effective.

IBS will never be as effective as it should be simply because of the lack
of time decay/frequency (hence all that related phi hackery in the kpromoted
series). It has a global view of memory, it should beat any sw scanning
heuristics by far but the numbers have lacked.

As you know, PeterZ, Dave Hansen, Ying and I have expressed concerns about
this in the past. But that is not to say it does not serve as a source,
as you point out.

Thanks,
Davidlohr

Raghavendra K T March 21, 2025, 6:48 a.m. UTC | #5

+Yu Zhao

Realized we had not CCed him earlier

On 3/21/2025 3:20 AM, Davidlohr Bueso wrote:
> On Thu, 20 Mar 2025, Raghavendra K T wrote:
> 
>>> Does NUMAB2 continue to exist? Are there any benefits in having two
>>> sources?
>>>
>>
>> I think there is surely a benefit in having two sources.
> 
> I think I was a bit vague. What I'm really asking is if the scanning is
> done async (kmmscand), should NUMAB2 also exist as a source and also feed
> into the migrator? Looking at it differently, I guess doing so would allow
> additional flexibility in choosing what to use.
> 

Not exactly. Since NUMAB2 is bringing accurate timestamp information and
additional migration throttling logic on top of NUMAB1,
we can just keep NUMAB1, but borrowing migration throttling from NUMAB2
and make sure that migration is asynchronous.

This is with the assumption that kmmscand will be able to detect the
exact target node in most of the cases, and additional flexibility
of toptier balancing come from NUMAB1.


>> NUMAB2 is more accurate but slow learning.
> 
> Yes. Which is also why it is important to have demotion in the picture to
> measure the ping pong effect. LRU based heuristics work best here.
> 

+1

>> IBS: No scan overhead but we need more sampledata.
> 
>> PTE A bit: more scanning overhead (but was not much significant to
>> impact performance when compared with NUMAB1/NUMAB2, rather it was more
>> performing because of proactive migration) but has less accurate data on
>> hotness, target_node(?).
>>
>> When system is more stable, IBS was more effective.
> 
> IBS will never be as effective as it should be simply because of the lack
> of time decay/frequency (hence all that related phi hackery in the 
> kpromoted
> series). It has a global view of memory, it should beat any sw scanning
> heuristics by far but the numbers have lacked.
> 
> As you know, PeterZ, Dave Hansen, Ying and I have expressed concerns about
> this in the past. But that is not to say it does not serve as a source,
> as you point out.
> 
> Thanks,
> Davidlohr

Jonathan Cameron March 21, 2025, 3:52 p.m. UTC | #6

On Wed, 19 Mar 2025 19:30:15 +0000
Raghavendra K T <raghavendra.kt@amd.com> wrote:

> Introduction:
> =============
> In the current hot page promotion, all the activities including the
> process address space scanning, NUMA hint fault handling and page
> migration is performed in the process context. i.e., scanning overhead is
> borne by applications.
> 
> This is RFC V1 patch series to do (slow tier) CXL page promotion.
> The approach in this patchset assists/addresses the issue by adding PTE
> Accessed bit scanning.
> 
> Scanning is done by a global kernel thread which routinely scans all
> the processes' address spaces and checks for accesses by reading the
> PTE A bit. 
> 
> A separate migration thread migrates/promotes the pages to the toptier
> node based on a simple heuristic that uses toptier scan/access information
> of the mm.
> 
> Additionally based on the feedback for RFC V0 [4], a prctl knob with
> a scalar value is provided to control per task scanning.
> 
> Initial results show promising number on a microbenchmark. Soon
> will get numbers with real benchmarks and findings (tunings). 
> 
> Experiment:
> ============
> Abench microbenchmark,
> - Allocates 8GB/16GB/32GB/64GB of memory on CXL node
> - 64 threads created, and each thread randomly accesses pages in 4K
>   granularity.

So if I'm reading this right, this is a flat distribution and any
estimate of what is hot is noise?

That will put a positive spin on costs of migration as we will
be moving something that isn't really all that hot and so is moderately
unlikely to be accessed whilst migration is going on.  Or is the point that
the rest of the memory is also mapped but not being accessed?

I'm not entirely sure I follow what this is bound by. Is it bandwidth
bound?


> - 512 iterations with a delay of 1 us between two successive iterations.
> 
> SUT: 512 CPU, 2 node 256GB, AMD EPYC.
> 
> 3 runs, command:  abench -m 2 -d 1 -i 512 -s <size>
> 
> Calculates how much time is taken to complete the task, lower is better.
> Expectation is CXL node memory is expected to be migrated as fast as
> possible.

> 
> Base case: 6.14-rc6    w/ numab mode = 2 (hot page promotion is enabled).
> patched case: 6.14-rc6 w/ numab mode = 1 (numa balancing is enabled).
> we expect daemon to do page promotion.
> 
> Result:
> ========
>          base NUMAB2                    patched NUMAB1
>          time in sec  (%stdev)   time in sec  (%stdev)     %gain
>  8GB     134.33       ( 0.19 )        120.52  ( 0.21 )     10.28
> 16GB     292.24       ( 0.60 )        275.97  ( 0.18 )      5.56
> 32GB     585.06       ( 0.24 )        546.49  ( 0.35 )      6.59
> 64GB    1278.98       ( 0.27 )       1205.20  ( 2.29 )      5.76
> 
> Base case: 6.14-rc6    w/ numab mode = 1 (numa balancing is enabled).
> patched case: 6.14-rc6 w/ numab mode = 1 (numa balancing is enabled).
>          base NUMAB1                    patched NUMAB1
>          time in sec  (%stdev)   time in sec  (%stdev)     %gain
>  8GB     186.71       ( 0.99 )        120.52  ( 0.21 )     35.45 
> 16GB     376.09       ( 0.46 )        275.97  ( 0.18 )     26.62 
> 32GB     744.37       ( 0.71 )        546.49  ( 0.35 )     26.58 
> 64GB    1534.49       ( 0.09 )       1205.20  ( 2.29 )     21.45

Nice numbers, but maybe some more details on what they are showing?
At what point in the workload has all the memory migrated to the
fast node or does that never happen?

I'm confused :(

Jonathan

Davidlohr Bueso March 21, 2025, 8:35 p.m. UTC | #7

On Fri, 21 Mar 2025, Raghavendra K T wrote:

>>But a longer running/ more memory workload may make more difference.
>>I will comeback with that number.
>
>                 base NUMAB=2   Patched NUMAB=0
>                 time in sec    time in sec
>===================================================
>8G:              134.33 (0.19)   119.88 ( 0.25)
>16G:             292.24 (0.60)   325.06 (11.11)
>32G:             585.06 (0.24)   546.15 ( 0.50)
>64G:            1278.98 (0.27)  1221.41 ( 1.54)
>
>We can see that numbers have not changed much between NUMAB=1 NUMAB=0 in
>patched case.

Thanks. Since this might vary across workloads, another important metric
here is numa hit/misses statistics.

fyi I have also been trying this series to get some numbers as well, but
noticed overnight things went south (so no chance before LSFMM):

[  464.026917] watchdog: BUG: soft lockup - CPU#108 stuck for 52s! [kmmscand:934]
[  464.026924] Modules linked in: ...
[  464.027098] CPU: 108 UID: 0 PID: 934 Comm: kmmscand Tainted: G             L     6.14.0-rc6-kmmscand+ #4
[  464.027105] Tainted: [L]=SOFTLOCKUP
[  464.027107] Hardware name: Supermicro SSG-121E-NE3X12R/X13DSF-A, BIOS 2.1 01/29/2024
[  464.027109] RIP: 0010:pmd_off+	0x58/0xd0
[  464.027124] Code: 83 e9 01 48 21 f1 48 c1 e1 03 48 89 f8 0f 1f 00 48 23 05 fb c7 fd 00 48 03 0d 0c b9 fb 00 48 25 00 f0 ff ff 48 01 c8 48 8b 38 <48> 89 f8 0f 1f 00 48 8b 0d db c7 fd 00 48 21 c1 48 89 d0 48 c1 e8
[  464.027128] RSP: 0018:ff71a0dc1b05bbc8 EFLAGS: 00000286
[  464.027133] RAX: ff3b028e421c17f0 RBX: ffc90cb8322e5e00 RCX: ff3b020d400007f0
[  464.027136] RDX: 00007f1393978000 RSI: 00000000000000fe RDI: 000000b9726b0067
[  464.027139] RBP: ff3b02f5d05babc0 R08: 00007f9c5653f000 R09: ffc90cb8322e0001
[  464.027141] R10: 0000000000000000 R11: ff3b028dd339420c R12: 00007f1393978000
[  464.027144] R13: ff3b028dded9cbb0 R14: ffc90cb8322e0000 R15: ffffffffb9a0a4c0
[  464.027146] FS:  0000000000000000(0000) GS:ff3b030bbf400000(0000) knlGS:0000000000000000
[  464.027150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  464.027153] CR2: 0000564713088f19 CR3: 000000fb40822006 CR4: 0000000000773ef0
[  464.027157] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  464.027159] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[  464.027162] PKRU: 55555554
[  464.027163] Call Trace:
[  464.027166]  <IRQ>
[  464.027170]  ? watchdog_timer_fn+0x21b/0x2a0
[  464.027180]  ? __pfx_watchdog_timer_fn+0x10/0x10
[  464.027186]  ? __hrtimer_run_queues+0x10f/0x2a0
[  464.027193]  ? hrtimer_interrupt+0xfb/0x240
[  464.027199]  ? __sysvec_apic_timer_interrupt+0x4e/0x110
[  464.027208]  ? sysvec_apic_timer_interrupt+0x68/0x90
[  464.027219]  </IRQ>
[  464.027221]  <TASK>
[  464.027222]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[  464.027236]  ? pmd_off+0x58/0xd0
[  464.027243]  hot_vma_idle_pte_entry+0x151/0x500
[  464.027253]  walk_pte_range_inner+0xbe/0x100
[  464.027260]  ? __pte_offset_map_lock+0x9a/0x110
[  464.027267]  walk_pgd_range+0x8f0/0xbb0
[  464.027271]  ? __pfx_hot_vma_idle_pte_entry+0x10/0x10
[  464.027282]  __walk_page_range+0x71/0x1d0
[  464.027287]  ? prepare_to_wait_event+0x53/0x180
[  464.027294]  walk_page_vma+0x98/0xf0
[  464.027300]  kmmscand+0x2aa/0x8d0
[  464.027310]  ? __pfx_kmmscand+0x10/0x10
[  464.027318]  kthread+0xea/0x230
[  464.027326]  ? finish_task_switch.isra.0+0x88/0x2d0
[  464.027335]  ? __pfx_kthread+0x10/0x10
[  464.027341]  ret_from_fork+0x2d/0x50
[  464.027350]  ? __pfx_kthread+0x10/0x10
[  464.027355]  ret_from_fork_asm+0x1a/0x30
[  464.027365]  </TASK>

Raghavendra K T March 23, 2025, 6:14 p.m. UTC | #8

On 3/21/2025 4:23 PM, Hillf Danton wrote:
> On Wed, 19 Mar 2025 19:30:24 +0000 Raghavendra K T wrote
>> One of the key challenges in PTE A bit based scanning is to find right
>> target node to promote to.
>>
>> Here is a simple heuristic based approach:
>>     While scanning pages of any mm we also scan toptier pages that belong
>> to that mm. We get an insight on the distribution of pages that potentially
>> belonging to particular toptier node and also its recent access.
>>
>> Current logic walks all the toptier node, and picks the one with highest
>> accesses.
>>
> My $.02 for selecting promotion target node given a simple multi tier system.
> 
> 	Tk /* top Tierk (k > 0) has K (K > 0) nodes */
> 	...
> 	Tj /* Tierj (j > 0) has J (J > 0) nodes */
> 	...
> 	T0 /* bottom Tier0 has O (O > 0) nodes */
> 
> Unless config comes from user space (sysfs window for example should be opened),
> 
> 1, adopt the data flow pattern of L3 cache <--> DRAM <--> SSD, to only
> select Tj+1 when promoting pages in Tj.
> 

Hello Hillf ,
Thanks for giving a thought on this. This looks to be good idea in 
general. Mostly be able to implement with reverse of preferred demotion
target?

Thinking loud, Can there be exception cases similar to non-temporal copy
operations, where we don't want to pollute cache?
I mean cases we don't want to hop via middle tier node..?

> 2, select the node in Tj+1 that has the most free pages for promotion
> by default.

Not sure if this is productive always.

for e.g.
node 0-1 toptier (100GB)
node2 slowtier

suppose a workload (that occupies 80GB in total) running on CPU of node1
where 40GB is already in node1 rest of 40GB is in node2.

Now it is preferred to consolidate workload on node1 when slowtier
data becomes hot?
(This assumes that node1 channel has enough bandwidth to cater to
requirement of the workload)

> 3, nothing more.

Raghavendra K T March 24, 2025, 2:54 p.m. UTC | #9

On 3/24/2025 4:35 PM, Hillf Danton wrote:
> On Sun, 23 Mar 2025 23:44:02 +0530 Raghavendra K T wrote
>> On 3/21/2025 4:23 PM, Hillf Danton wrote:
>>> On Wed, 19 Mar 2025 19:30:24 +0000 Raghavendra K T wrote
>>>> One of the key challenges in PTE A bit based scanning is to find right
>>>> target node to promote to.
>>>>
>>>> Here is a simple heuristic based approach:
>>>>      While scanning pages of any mm we also scan toptier pages that belong
>>>> to that mm. We get an insight on the distribution of pages that potentially
>>>> belonging to particular toptier node and also its recent access.
>>>>
>>>> Current logic walks all the toptier node, and picks the one with highest
>>>> accesses.
>>>>
>>> My $.02 for selecting promotion target node given a simple multi tier system.
>>>
>>> 	Tk /* top Tierk (k > 0) has K (K > 0) nodes */
>>> 	...
>>> 	Tj /* Tierj (j > 0) has J (J > 0) nodes */
>>> 	...
>>> 	T0 /* bottom Tier0 has O (O > 0) nodes */
>>>
>>> Unless config comes from user space (sysfs window for example should be opened),
>>>
>>> 1, adopt the data flow pattern of L3 cache <--> DRAM <--> SSD, to only
>>> select Tj+1 when promoting pages in Tj.
>>>
>>
>> Hello Hillf ,
>> Thanks for giving a thought on this. This looks to be good idea in
>> general. Mostly be able to implement with reverse of preferred demotion
>> target?
>>
>> Thinking loud, Can there be exception cases similar to non-temporal copy
>> operations, where we don't want to pollute cache?
>> I mean cases we don't want to hop via middle tier node..?
>>
> Given page cache, direct IO and coherent DMA have their roles to play.
>

Agree.

>>> 2, select the node in Tj+1 that has the most free pages for promotion
>>> by default.
>>
>> Not sure if this is productive always.
>>
> Trying to cure all pains with ONE pill wastes minutes I think.
> 

Very much true.

> To achive reliable high order pages, page allocator can not work well in
> combination with kswapd and kcompactd without clear boundaries drawn in
> between the tree parties for example.
> 
>> for e.g.
>> node 0-1 toptier (100GB)
>> node2 slowtier
>>
>> suppose a workload (that occupies 80GB in total) running on CPU of node1
>> where 40GB is already in node1 rest of 40GB is in node2.
>>
>> Now it is preferred to consolidate workload on node1 when slowtier
>> data becomes hot?
>>
> Yes and no (say, a couple seconds later mm pressure rises in node0).
> 
> In case of yes, I would like to turn on autonuma in the toptier instead
> without bothering to select the target node. You see a line is drawn
> between autonma and slowtier promotion now.

Yes, the goal has been slow tier promotion without much overhead to the
system + co-cooperatively work with NUMAB1 for top-tier balancing.
(for e.g., providing hints of hot VMAs).

Raghavendra K T March 25, 2025, 6:36 a.m. UTC | #10

+kinseyho and yuanchu

On 3/22/2025 2:05 AM, Davidlohr Bueso wrote:
> On Fri, 21 Mar 2025, Raghavendra K T wrote:
> 
>>> But a longer running/ more memory workload may make more difference.
>>> I will comeback with that number.
>>
>>                 base NUMAB=2   Patched NUMAB=0
>>                 time in sec    time in sec
>> ===================================================
>> 8G:              134.33 (0.19)   119.88 ( 0.25)
>> 16G:             292.24 (0.60)   325.06 (11.11)
>> 32G:             585.06 (0.24)   546.15 ( 0.50)
>> 64G:            1278.98 (0.27)  1221.41 ( 1.54)
>>
>> We can see that numbers have not changed much between NUMAB=1 NUMAB=0 in
>> patched case.
> 
> Thanks. Since this might vary across workloads, another important metric
> here is numa hit/misses statistics.

Hello David, sorry for coming back late.

Yes I did collect some of the other stats along with this (posting for
8GB only). I did not se much difference in total numa_hit. But there are 
differences in in numa_local etc.. (not pasted here)

#grep -A2 completed  abench_cxl_6.14.0-rc6-kmmscand+_8G.log 
abench_cxl_6.14.0-rc6-cxlfix+_numab2_8G.log
abench_cxl_6.14.0-rc6-kmmscand+_8G.log:Benchmark completed in 
120292376.0 us, Total thread execution time 7490922681.0 us
abench_cxl_6.14.0-rc6-kmmscand+_8G.log-numa_hit 6376927
abench_cxl_6.14.0-rc6-kmmscand+_8G.log-numa_miss 0
--
abench_cxl_6.14.0-rc6-kmmscand+_8G.log:Benchmark completed in 
119583939.0 us, Total thread execution time 7461705291.0 us
abench_cxl_6.14.0-rc6-kmmscand+_8G.log-numa_hit 6373409
abench_cxl_6.14.0-rc6-kmmscand+_8G.log-numa_miss 0
--
abench_cxl_6.14.0-rc6-kmmscand+_8G.log:Benchmark completed in 
119784117.0 us, Total thread execution time 7482710944.0 us
abench_cxl_6.14.0-rc6-kmmscand+_8G.log-numa_hit 6378384
abench_cxl_6.14.0-rc6-kmmscand+_8G.log-numa_miss 0
--
abench_cxl_6.14.0-rc6-cxlfix+_numab2_8G.log:Benchmark completed in 
134481344.0 us, Total thread execution time 8409840511.0 us
abench_cxl_6.14.0-rc6-cxlfix+_numab2_8G.log-numa_hit 6303300
abench_cxl_6.14.0-rc6-cxlfix+_numab2_8G.log-numa_miss 0
--
abench_cxl_6.14.0-rc6-cxlfix+_numab2_8G.log:Benchmark completed in 
133967260.0 us, Total thread execution time 8352886349.0 us
abench_cxl_6.14.0-rc6-cxlfix+_numab2_8G.log-numa_hit 6304063
abench_cxl_6.14.0-rc6-cxlfix+_numab2_8G.log-numa_miss 0
--
abench_cxl_6.14.0-rc6-cxlfix+_numab2_8G.log:Benchmark completed in 
134554911.0 us, Total thread execution time 8444951713.0 us
abench_cxl_6.14.0-rc6-cxlfix+_numab2_8G.log-numa_hit 6302506
abench_cxl_6.14.0-rc6-cxlfix+_numab2_8G.log-numa_miss 0

> 
> fyi I have also been trying this series to get some numbers as well, but
> noticed overnight things went south (so no chance before LSFMM):
>

This issue looks to be different. Could you please let me know any ways
to reproduce?

I had tested perf bench numa mem, did not find anything.

The issue I know of currently is:

kmmscand:
  for_each_mm
     for_each_vma
         scan_vma and get accessed_folo_list
         add to migration_list() // does not check for duplicate

kmmmigrated:
   for_each_folio in migration_list
        migrate_misplaced_folio()

there is also
   cleanup_migration_list() in mm teardown

migration_list is protected by single lock, and kmmscand is too
aggressive and can potentially bombard with migration_list (practical
workload may generate lesser pages though). That results in non-fatal
  softlockup that will be fixed with mmslot as I noted somewhere.

But now main challenge to solve in kmmscand is, it generates:
t1-> migration_list1 (of recently accessed folios)
t2-> migration_list2

How do I get the union of migration_list1 and migration_list2 so that
instead of migrating on first access, we can get a hotter page to
promote.

I had few solutions in mind: (That I wanted to get opinion / suggestion
from exerts during LSFMM)

1. Reusing DAMON VA scanning. scanning params are controlled in KMMSCAND 
(current heuristics)


2. Can we use LRU information to filter access list (LRU active/ folio 
is in (n-1) generation?)
  (I do see Kinseyho just posted LRU based approach)

3. Can we split the address range to 2MB to monitor? PMD level access 
monitoring.

4. Any possible ways of using bloom-filters for list1,list2

- Raghu

[snip...]

[RFC,V1,00/13] mm: slowtier page promotion based on PTE A bit

Message

Comments