[RFC,v3,0/7] DAMON based tiered memory management for CXL memory

Message ID	20240405060858.2818-1-honggyu.kim@sk.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Honggyu Kim <honggyu.kim@sk.com> To: sj@kernel.org, damon@lists.linux.dev, linux-mm@kvack.org Cc: akpm@linux-foundation.org, apopple@nvidia.com, baolin.wang@linux.alibaba.com, dave.jiang@intel.com, honggyu.kim@sk.com, hyeongtak.ji@sk.com, kernel_team@skhynix.com, linmiaohe@huawei.com, linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org, mathieu.desnoyers@efficios.com, mhiramat@kernel.org, rakie.kim@sk.com, rostedt@goodmis.org, surenb@google.com, yangx.jy@fujitsu.com, ying.huang@intel.com, ziy@nvidia.com, 42.hyeyoo@gmail.com, art.jeongseob@gmail.com Subject: [RFC PATCH v3 0/7] DAMON based tiered memory management for CXL memory Date: Fri, 5 Apr 2024 15:08:49 +0900 Message-ID: <20240405060858.2818-1-honggyu.kim@sk.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	DAMON based tiered memory management for CXL memory \| expand [RFC,v3,0/7] DAMON based tiered memory management for CXL memory [RFC,v3,1/7] mm/damon/paddr: refactor DAMOS_PAGEOUT with migration_mode [RFC,v3,2/7] mm: make alloc_demote_folio externally invokable for migration [RFC,v3,3/7] mm/damon/sysfs-schemes: add target_nid on sysfs-schemes [RFC,v3,4/7] mm/migrate: add MR_DAMON to migrate_reason [RFC,v3,5/7] mm/damon/paddr: introduce DAMOS_MIGRATE_COLD action for demotion [RFC,v3,6/7] mm/damon/paddr: introduce DAMOS_MIGRATE_HOT action for promotion [RFC,v3,7/7] mm/damon: Add "damon_migrate_{hot,cold}" vmstat

Honggyu Kim April 5, 2024, 6:08 a.m. UTC

There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously
posted at [1].

It says there is no implementation of the demote/promote DAMOS action
are made.  This RFC is about its implementation for physical address
space.


Changes from RFC v2:
  1. Rename DAMOS_{PROMOTE,DEMOTE} actions to DAMOS_MIGRATE_{HOT,COLD}.
  2. Create 'target_nid' to set the migration target node instead of
     depending on node distance based information.
  3. Instead of having page level access check in this patch series,
     delegate the job to a new DAMOS filter type YOUNG[2].
  4. Introduce vmstat counters "damon_migrate_{hot,cold}".
  5. Rebase from v6.7 to v6.8.

Changes from RFC:
  1. Move most of implementation from mm/vmscan.c to mm/damon/paddr.c.
  2. Simplify some functions of vmscan.c and used in paddr.c, but need
     to be reviewed more in depth.
  3. Refactor most functions for common usage for both promote and
     demote actions and introduce an enum migration_mode for its control.
  4. Add "target_nid" sysfs knob for migration destination node for both
     promote and demote actions.
  5. Move DAMOS_PROMOTE before DAMOS_DEMOTE and move then even above
     DAMOS_STAT.


Introduction
============

With the advent of CXL/PCIe attached DRAM, which will be called simply
as CXL memory in this cover letter, some systems are becoming more
heterogeneous having memory systems with different latency and bandwidth
characteristics.  They are usually handled as different NUMA nodes in
separate memory tiers and CXL memory is used as slow tiers because of
its protocol overhead compared to local DRAM.

In this kind of systems, we need to be careful placing memory pages on
proper NUMA nodes based on the memory access frequency.  Otherwise, some
frequently accessed pages might reside on slow tiers and it makes
performance degradation unexpectedly.  Moreover, the memory access
patterns can be changed at runtime.

To handle this problem, we need a way to monitor the memory access
patterns and migrate pages based on their access temperature.  The
DAMON(Data Access MONitor) framework and its DAMOS(DAMON-based Operation
Schemes) can be useful features for monitoring and migrating pages.
DAMOS provides multiple actions based on DAMON monitoring results and it
can be used for proactive reclaim, which means swapping cold pages out
with DAMOS_PAGEOUT action, but it doesn't support migration actions such
as demotion and promotion between tiered memory nodes.

This series supports two new DAMOS actions; DAMOS_MIGRATE_HOT for
promotion from slow tiers and DAMOS_MIGRATE_COLD for demotion from fast
tiers.  This prevents hot pages from being stuck on slow tiers, which
makes performance degradation and cold pages can be proactively demoted
to slow tiers so that the system can increase the chance to allocate
more hot pages to fast tiers.

The DAMON provides various tuning knobs but we found that the proactive
demotion for cold pages is especially useful when the system is running
out of memory on its fast tier nodes.

Our evaluation result shows that it reduces the performance slowdown
compared to the default memory policy from 17~18% to 4~5% when the
system runs under high memory pressure on its fast tier DRAM nodes.


DAMON configuration
===================

The specific DAMON configuration doesn't have to be in the scope of this
patch series, but some rough idea is better to be shared to explain the
evaluation result.

The DAMON provides many knobs for fine tuning but its configuration file
is generated by HMSDK[3].  It includes gen_config.py script that
generates a json file with the full config of DAMON knobs and it creates
multiple kdamonds for each NUMA node when the DAMON is enabled so that
it can run hot/cold based migration for tiered memory.


Evaluation Workload
===================

The performance evaluation is done with redis[4], which is a widely used
in-memory database and the memory access patterns are generated via
YCSB[5].  We have measured two different workloads with zipfian and
latest distributions but their configs are slightly modified to make
memory usage higher and execution time longer for better evaluation.

The idea of evaluation using these migrate_{hot,cold} actions covers
system-wide memory management rather than partitioning hot/cold pages of
a single workload.  The default memory allocation policy creates pages
to the fast tier DRAM node first, then allocates newly created pages to
the slow tier CXL node when the DRAM node has insufficient free space.
Once the page allocation is done then those pages never move between
NUMA nodes.  It's not true when using numa balancing, but it is not the
scope of this DAMON based tiered memory management support.

If the working set of redis can be fit fully into the DRAM node, then
the redis will access the fast DRAM only.  Since the performance of DRAM
only is faster than partially accessing CXL memory in slow tiers, this
environment is not useful to evaluate this patch series.

To make pages of redis be distributed across fast DRAM node and slow
CXL node to evaluate our migrate_{hot,cold} actions, we pre-allocate
some cold memory externally using mmap and memset before launching
redis-server.  We assumed that there are enough amount of cold memory in
datacenters as TMO[6] and TPP[7] papers mentioned.

The evaluation sequence is as follows.

1. Turn on DAMON with DAMOS_MIGRATE_COLD action for DRAM node and
   DAMOS_MIGRATE_HOT action for CXL node.  It demotes cold pages on DRAM
   node and promotes hot pages on CXL node in a regular interval.
2. Allocate a huge block of cold memory by calling mmap and memset at
   the fast tier DRAM node, then make the process sleep to make the fast
   tier has insufficient space for redis-server.
3. Launch redis-server and load prebaked snapshot image, dump.rdb.  The
   redis-server consumes 52GB of anon pages and 33GB of file pages, but
   due to the cold memory allocated at 2, it fails allocating the entire
   memory of redis-server on the fast tier DRAM node so it partially
   allocates the remaining on the slow tier CXL node.  The ratio of
   DRAM:CXL depends on the size of the pre-allocated cold memory.
4. Run YCSB to make zipfian or latest distribution of memory accesses to
   redis-server, then measure its execution time when it's completed.
5. Repeat 4 over 50 times to measure the average execution time for each
   run.
6. Increase the cold memory size then repeat goes to 2.

For each test at 4 took about a minute so repeating it 50 times almost
took about 1 hour for each test with a specific cold memory from 440GB
to 500GB in 10GB increments for each evaluation.  So it took about more
than 10 hours for both zipfian and latest workloads to get the entire
evaluation results.  Repeating the same test set multiple times doesn't
show much difference so I think it might be enough to make the result
reliable.


Evaluation Results
==================

All the result values are normalized to DRAM-only execution time because
the workload cannot be faster than DRAM-only unless the workload hits
the peak bandwidth but our redis test doesn't go beyond the bandwidth
limit.

So the DRAM-only execution time is the ideal result without affected by
the gap between DRAM and CXL performance difference.  The NUMA node
environment is as follows.

  node0 - local DRAM, 512GB with a CPU socket (fast tier)
  node1 - disabled
  node2 - CXL DRAM, 96GB, no CPU attached (slow tier)

The following is the result of generating zipfian distribution to
redis-server and the numbers are averaged by 50 times of execution.

  1. YCSB zipfian distribution read only workload
  memory pressure with cold memory on node0 with 512GB of local DRAM.
  =============+================================================+=========
               |       cold memory occupied by mmap and memset  |
               |   0G  440G  450G  460G  470G  480G  490G  500G |
  =============+================================================+=========
  Execution time normalized to DRAM-only values                 | GEOMEAN
  -------------+------------------------------------------------+---------
  DRAM-only    | 1.00     -     -     -     -     -     -     - | 1.00
  CXL-only     | 1.22     -     -     -     -     -     -     - | 1.22
  default      |    -  1.12  1.13  1.14  1.16  1.19  1.21  1.21 | 1.17 
  DAMON tiered |    -  1.04  1.03  1.04  1.06  1.05  1.05  1.05 | 1.05 
  =============+================================================+=========
  CXL usage of redis-server in GB                               | AVERAGE
  -------------+------------------------------------------------+---------
  DRAM-only    |  0.0     -     -     -     -     -     -     - |  0.0
  CXL-only     | 52.6     -     -     -     -     -     -     - | 52.6
  default      |    -  20.4  27.0  33.1  39.5  45.6  50.5  50.3 | 38.1
  DAMON tiered |    -   0.1   0.3   0.8   0.6   0.7   1.3   0.9 |  0.7
  =============+================================================+=========

Each test result is based on the exeuction environment as follows.

  DRAM-only   : redis-server uses only local DRAM memory.
  CXL-only    : redis-server uses only CXL memory.
  default     : default memory policy(MPOL_DEFAULT).
                numa balancing disabled.
  DAMON tiered: DAMON enabled with DAMOS_MIGRATE_COLD for DRAM nodes and
                DAMOS_MIGRATE_HOT for CXL nodes.

The above result shows the "default" execution time goes up as the size
of cold memory is increased from 440G to 500G because the more cold
memory used, the more CXL memory is used for the target redis workload
and this makes the execution time increase.

However, "DAMON tiered" result shows less slowdown because the
DAMOS_MIGRATE_COLD action at DRAM node proactively demotes pre-allocated
cold memory to CXL node and this free space at DRAM increases more
chance to allocate hot or warm pages of redis-server to fast DRAM node.
Moreover, DAMOS_MIGRATE_HOT action at CXL node also promotes hot pages
of redis-server to DRAM node actively.

As a result, it makes more memory of redis-server stay in DRAM node
compared to "default" memory policy and this makes the performance
improvement.

The following result of latest distribution workload shows similar data.

  2. YCSB latest distribution read only workload
  memory pressure with cold memory on node0 with 512GB of local DRAM.
  =============+================================================+=========
               |       cold memory occupied by mmap and memset  |
               |   0G  440G  450G  460G  470G  480G  490G  500G |
  =============+================================================+=========
  Execution time normalized to DRAM-only values                 | GEOMEAN
  -------------+------------------------------------------------+---------
  DRAM-only    | 1.00     -     -     -     -     -     -     - | 1.00
  CXL-only     | 1.18     -     -     -     -     -     -     - | 1.18
  default      |    -  1.18  1.19  1.18  1.18  1.17  1.19  1.18 | 1.18 
  DAMON tiered |    -  1.04  1.04  1.04  1.05  1.04  1.05  1.05 | 1.04 
  =============+================================================+=========
  CXL usage of redis-server in GB                               | AVERAGE
  -------------+------------------------------------------------+---------
  DRAM-only    |  0.0     -     -     -     -     -     -     - |  0.0
  CXL-only     | 52.6     -     -     -     -     -     -     - | 52.6
  default      |    -  20.5  27.1  33.2  39.5  45.5  50.4  50.5 | 38.1
  DAMON tiered |    -   0.2   0.4   0.7   1.6   1.2   1.1   3.4 |  1.2
  =============+================================================+=========

In summary of both results, our evaluation shows that "DAMON tiered"
memory management reduces the performance slowdown compared to the
"default" memory policy from 17~18% to 4~5% when the system runs with
high memory pressure on its fast tier DRAM nodes.

Having these DAMOS_MIGRATE_HOT and DAMOS_MIGRATE_COLD actions can make
tiered memory systems run more efficiently under high memory pressures.

Signed-off-by: Honggyu Kim <honggyu.kim@sk.com>
Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Signed-off-by: Rakie Kim <rakie.kim@sk.com>

[1] https://lore.kernel.org/damon/20231112195602.61525-1-sj@kernel.org
[2] https://lore.kernel.org/damon/20240311204545.47097-1-sj@kernel.org
[3] https://github.com/skhynix/hmsdk
[4] https://github.com/redis/redis/tree/7.0.0
[5] https://github.com/brianfrankcooper/YCSB/tree/0.17.0
[6] https://dl.acm.org/doi/10.1145/3503222.3507731
[7] https://dl.acm.org/doi/10.1145/3582016.3582063


Honggyu Kim (5):
  mm/damon/paddr: refactor DAMOS_PAGEOUT with migration_mode
  mm: make alloc_demote_folio externally invokable for migration
  mm/migrate: add MR_DAMON to migrate_reason
  mm/damon/paddr: introduce DAMOS_MIGRATE_COLD action for demotion
  mm/damon: Add "damon_migrate_{hot,cold}" vmstat

Hyeongtak Ji (2):
  mm/damon/sysfs-schemes: add target_nid on sysfs-schemes
  mm/damon/paddr: introduce DAMOS_MIGRATE_HOT action for promotion

 include/linux/damon.h          |  15 ++-
 include/linux/migrate_mode.h   |   1 +
 include/linux/mmzone.h         |   4 +
 include/trace/events/migrate.h |   3 +-
 mm/damon/core.c                |   5 +-
 mm/damon/dbgfs.c               |   2 +-
 mm/damon/lru_sort.c            |   3 +-
 mm/damon/paddr.c               | 191 +++++++++++++++++++++++++++++++--
 mm/damon/reclaim.c             |   3 +-
 mm/damon/sysfs-schemes.c       |  39 ++++++-
 mm/internal.h                  |   1 +
 mm/vmscan.c                    |  10 +-
 mm/vmstat.c                    |   4 +
 13 files changed, 265 insertions(+), 16 deletions(-)


base-commit: e8f897f4afef0031fe618a8e94127a0934896aba

Gregory Price April 5, 2024, 4:56 p.m. UTC | #1

On Fri, Apr 05, 2024 at 03:08:49PM +0900, Honggyu Kim wrote:
> There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously
> posted at [1].
> 
>   1. YCSB zipfian distribution read only workload
>   memory pressure with cold memory on node0 with 512GB of local DRAM.
>   =============+================================================+=========
>                |       cold memory occupied by mmap and memset  |
>                |   0G  440G  450G  460G  470G  480G  490G  500G |
>   =============+================================================+=========
>   Execution time normalized to DRAM-only values                 | GEOMEAN
>   -------------+------------------------------------------------+---------
>   DRAM-only    | 1.00     -     -     -     -     -     -     - | 1.00
>   CXL-only     | 1.22     -     -     -     -     -     -     - | 1.22
>   default      |    -  1.12  1.13  1.14  1.16  1.19  1.21  1.21 | 1.17 
>   DAMON tiered |    -  1.04  1.03  1.04  1.06  1.05  1.05  1.05 | 1.05 
>   =============+================================================+=========
>   CXL usage of redis-server in GB                               | AVERAGE
>   -------------+------------------------------------------------+---------
>   DRAM-only    |  0.0     -     -     -     -     -     -     - |  0.0
>   CXL-only     | 52.6     -     -     -     -     -     -     - | 52.6
>   default      |    -  20.4  27.0  33.1  39.5  45.6  50.5  50.3 | 38.1
>   DAMON tiered |    -   0.1   0.3   0.8   0.6   0.7   1.3   0.9 |  0.7
>   =============+================================================+=========
> 
> Each test result is based on the exeuction environment as follows.
> 
>   DRAM-only   : redis-server uses only local DRAM memory.
>   CXL-only    : redis-server uses only CXL memory.
>   default     : default memory policy(MPOL_DEFAULT).
>                 numa balancing disabled.
>   DAMON tiered: DAMON enabled with DAMOS_MIGRATE_COLD for DRAM nodes and
>                 DAMOS_MIGRATE_HOT for CXL nodes.
> 
> The above result shows the "default" execution time goes up as the size
> of cold memory is increased from 440G to 500G because the more cold
> memory used, the more CXL memory is used for the target redis workload
> and this makes the execution time increase.
> 
> However, "DAMON tiered" result shows less slowdown because the
> DAMOS_MIGRATE_COLD action at DRAM node proactively demotes pre-allocated
> cold memory to CXL node and this free space at DRAM increases more
> chance to allocate hot or warm pages of redis-server to fast DRAM node.
> Moreover, DAMOS_MIGRATE_HOT action at CXL node also promotes hot pages
> of redis-server to DRAM node actively.
> 
> As a result, it makes more memory of redis-server stay in DRAM node
> compared to "default" memory policy and this makes the performance
> improvement.
> 
> The following result of latest distribution workload shows similar data.
> 
>   2. YCSB latest distribution read only workload
>   memory pressure with cold memory on node0 with 512GB of local DRAM.
>   =============+================================================+=========
>                |       cold memory occupied by mmap and memset  |
>                |   0G  440G  450G  460G  470G  480G  490G  500G |
>   =============+================================================+=========
>   Execution time normalized to DRAM-only values                 | GEOMEAN
>   -------------+------------------------------------------------+---------
>   DRAM-only    | 1.00     -     -     -     -     -     -     - | 1.00
>   CXL-only     | 1.18     -     -     -     -     -     -     - | 1.18
>   default      |    -  1.18  1.19  1.18  1.18  1.17  1.19  1.18 | 1.18 
>   DAMON tiered |    -  1.04  1.04  1.04  1.05  1.04  1.05  1.05 | 1.04 
>   =============+================================================+=========
>   CXL usage of redis-server in GB                               | AVERAGE
>   -------------+------------------------------------------------+---------
>   DRAM-only    |  0.0     -     -     -     -     -     -     - |  0.0
>   CXL-only     | 52.6     -     -     -     -     -     -     - | 52.6
>   default      |    -  20.5  27.1  33.2  39.5  45.5  50.4  50.5 | 38.1
>   DAMON tiered |    -   0.2   0.4   0.7   1.6   1.2   1.1   3.4 |  1.2
>   =============+================================================+=========
> 
> In summary of both results, our evaluation shows that "DAMON tiered"
> memory management reduces the performance slowdown compared to the
> "default" memory policy from 17~18% to 4~5% when the system runs with
> high memory pressure on its fast tier DRAM nodes.
> 
> Having these DAMOS_MIGRATE_HOT and DAMOS_MIGRATE_COLD actions can make
> tiered memory systems run more efficiently under high memory pressures.
> 

Hi,

It's hard to determine from your results whether the performance
mitigation is being caused primarily by MIGRATE_COLD freeing up space
for new allocations, or from some combination of HOT/COLD actions
occurring during execution but after the database has already been
warmed up.

Do you have test results which enable only DAMOS_MIGRATE_COLD actions
but not DAMOS_MIGRATE_HOT actions? (and vice versa)

The question I have is exactly how often is MIGRATE_HOT actually being
utilized, and how much data is being moved. Testing MIGRATE_COLD only
would at least give a rough approximation of that.


Additionally, do you have any data on workloads that exceed the capacity
of the DRAM tier?  Here you say you have 512GB of local DRAM, but only
test a workload that caps out at 500G.  Have you run a test of, say,
550GB to see the effect of DAMON HOT/COLD migration actions when DRAM
capacity is exceeded?

Can you also provide the DRAM-only results for each test?  Presumably,
as workload size increases from 440G to 500G, the system probably starts
using some amount of swap/zswap/whatever.  It would be good to know how
this system compares to swap small amounts of overflow.

~Gregory

SeongJae Park April 5, 2024, 7:28 p.m. UTC | #2

Hello Honggyu,

On Fri,  5 Apr 2024 15:08:49 +0900 Honggyu Kim <honggyu.kim@sk.com> wrote:

> There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously
> posted at [1].
> 
> It says there is no implementation of the demote/promote DAMOS action
> are made.  This RFC is about its implementation for physical address
> space.
> 
> 
> Changes from RFC v2:
>   1. Rename DAMOS_{PROMOTE,DEMOTE} actions to DAMOS_MIGRATE_{HOT,COLD}.
>   2. Create 'target_nid' to set the migration target node instead of
>      depending on node distance based information.
>   3. Instead of having page level access check in this patch series,
>      delegate the job to a new DAMOS filter type YOUNG[2].
>   4. Introduce vmstat counters "damon_migrate_{hot,cold}".
>   5. Rebase from v6.7 to v6.8.

Thank you for patiently keeping discussion and making this great version!  I
left comments on each patch, but found no special concerns.  Per-page access
recheck for MIGRATE_HOT and vmstat change are taking my eyes, though.  I doubt
if those really needed.  It would be nice if you could answer to the comments.

Once my comments on this version are addressed, I would have no reason to
object at dropping the RFC tag from this patchset.

Nonetheless, I show some warnings and errors from checkpatch.pl.  I don't
really care about those for RFC patches, so no problem at all.  But if you
agree to my opinion about RFC tag dropping, and therefore if you will send next
version without RFC tag, please make sure you also run checkpatch.pl before
posting.

Thanks,
SJ

[...]

Honggyu Kim April 8, 2024, 10:56 a.m. UTC | #3

Hi SeongJae,

On Fri,  5 Apr 2024 12:28:00 -0700 SeongJae Park <sj@kernel.org> wrote:
> Hello Honggyu,
> 
> On Fri,  5 Apr 2024 15:08:49 +0900 Honggyu Kim <honggyu.kim@sk.com> wrote:
> 
> > There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously
> > posted at [1].
> > 
> > It says there is no implementation of the demote/promote DAMOS action
> > are made.  This RFC is about its implementation for physical address
> > space.
> > 
> > 
> > Changes from RFC v2:
> >   1. Rename DAMOS_{PROMOTE,DEMOTE} actions to DAMOS_MIGRATE_{HOT,COLD}.
> >   2. Create 'target_nid' to set the migration target node instead of
> >      depending on node distance based information.
> >   3. Instead of having page level access check in this patch series,
> >      delegate the job to a new DAMOS filter type YOUNG[2].
> >   4. Introduce vmstat counters "damon_migrate_{hot,cold}".
> >   5. Rebase from v6.7 to v6.8.
> 
> Thank you for patiently keeping discussion and making this great version!  I
> left comments on each patch, but found no special concerns.  Per-page access
> recheck for MIGRATE_HOT and vmstat change are taking my eyes, though.  I doubt
> if those really needed.  It would be nice if you could answer to the comments.

I will answer them where you made the comments.

> Once my comments on this version are addressed, I would have no reason to
> object at dropping the RFC tag from this patchset.

Thanks.  I will drop the RFC after addressing your comments.

> Nonetheless, I show some warnings and errors from checkpatch.pl.  I don't
> really care about those for RFC patches, so no problem at all.  But if you
> agree to my opinion about RFC tag dropping, and therefore if you will send next
> version without RFC tag, please make sure you also run checkpatch.pl before
> posting.

Sure.  I will run checkpatch.pl from the next revision.

Thanks,
Honggyu

> 
> Thanks,
> SJ
> 
> [...]

Honggyu Kim April 8, 2024, 1:41 p.m. UTC | #4

Hi Gregory,

On Fri, 5 Apr 2024 12:56:14 -0400 Gregory Price <gregory.price@memverge.com> wrote:
> On Fri, Apr 05, 2024 at 03:08:49PM +0900, Honggyu Kim wrote:
> > There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously
> > posted at [1].
> > 
> >   1. YCSB zipfian distribution read only workload
> >   memory pressure with cold memory on node0 with 512GB of local DRAM.
> >   =============+================================================+=========
> >                |       cold memory occupied by mmap and memset  |
> >                |   0G  440G  450G  460G  470G  480G  490G  500G |
> >   =============+================================================+=========
> >   Execution time normalized to DRAM-only values                 | GEOMEAN
> >   -------------+------------------------------------------------+---------
> >   DRAM-only    | 1.00     -     -     -     -     -     -     - | 1.00
> >   CXL-only     | 1.22     -     -     -     -     -     -     - | 1.22
> >   default      |    -  1.12  1.13  1.14  1.16  1.19  1.21  1.21 | 1.17 
> >   DAMON tiered |    -  1.04  1.03  1.04  1.06  1.05  1.05  1.05 | 1.05 
> >   =============+================================================+=========
> >   CXL usage of redis-server in GB                               | AVERAGE
> >   -------------+------------------------------------------------+---------
> >   DRAM-only    |  0.0     -     -     -     -     -     -     - |  0.0
> >   CXL-only     | 52.6     -     -     -     -     -     -     - | 52.6
> >   default      |    -  20.4  27.0  33.1  39.5  45.6  50.5  50.3 | 38.1
> >   DAMON tiered |    -   0.1   0.3   0.8   0.6   0.7   1.3   0.9 |  0.7
> >   =============+================================================+=========
> > 
> > Each test result is based on the exeuction environment as follows.
> > 
> >   DRAM-only   : redis-server uses only local DRAM memory.
> >   CXL-only    : redis-server uses only CXL memory.
> >   default     : default memory policy(MPOL_DEFAULT).
> >                 numa balancing disabled.
> >   DAMON tiered: DAMON enabled with DAMOS_MIGRATE_COLD for DRAM nodes and
> >                 DAMOS_MIGRATE_HOT for CXL nodes.
> > 
> > The above result shows the "default" execution time goes up as the size
> > of cold memory is increased from 440G to 500G because the more cold
> > memory used, the more CXL memory is used for the target redis workload
> > and this makes the execution time increase.
> > 
> > However, "DAMON tiered" result shows less slowdown because the
> > DAMOS_MIGRATE_COLD action at DRAM node proactively demotes pre-allocated
> > cold memory to CXL node and this free space at DRAM increases more
> > chance to allocate hot or warm pages of redis-server to fast DRAM node.
> > Moreover, DAMOS_MIGRATE_HOT action at CXL node also promotes hot pages
> > of redis-server to DRAM node actively.
> > 
> > As a result, it makes more memory of redis-server stay in DRAM node
> > compared to "default" memory policy and this makes the performance
> > improvement.
> > 
> > The following result of latest distribution workload shows similar data.
> > 
> >   2. YCSB latest distribution read only workload
> >   memory pressure with cold memory on node0 with 512GB of local DRAM.
> >   =============+================================================+=========
> >                |       cold memory occupied by mmap and memset  |
> >                |   0G  440G  450G  460G  470G  480G  490G  500G |
> >   =============+================================================+=========
> >   Execution time normalized to DRAM-only values                 | GEOMEAN
> >   -------------+------------------------------------------------+---------
> >   DRAM-only    | 1.00     -     -     -     -     -     -     - | 1.00
> >   CXL-only     | 1.18     -     -     -     -     -     -     - | 1.18
> >   default      |    -  1.18  1.19  1.18  1.18  1.17  1.19  1.18 | 1.18 
> >   DAMON tiered |    -  1.04  1.04  1.04  1.05  1.04  1.05  1.05 | 1.04 
> >   =============+================================================+=========
> >   CXL usage of redis-server in GB                               | AVERAGE
> >   -------------+------------------------------------------------+---------
> >   DRAM-only    |  0.0     -     -     -     -     -     -     - |  0.0
> >   CXL-only     | 52.6     -     -     -     -     -     -     - | 52.6
> >   default      |    -  20.5  27.1  33.2  39.5  45.5  50.4  50.5 | 38.1
> >   DAMON tiered |    -   0.2   0.4   0.7   1.6   1.2   1.1   3.4 |  1.2
> >   =============+================================================+=========
> > 
> > In summary of both results, our evaluation shows that "DAMON tiered"
> > memory management reduces the performance slowdown compared to the
> > "default" memory policy from 17~18% to 4~5% when the system runs with
> > high memory pressure on its fast tier DRAM nodes.
> > 
> > Having these DAMOS_MIGRATE_HOT and DAMOS_MIGRATE_COLD actions can make
> > tiered memory systems run more efficiently under high memory pressures.
> > 
> 
> Hi,
> 
> It's hard to determine from your results whether the performance
> mitigation is being caused primarily by MIGRATE_COLD freeing up space
> for new allocations, or from some combination of HOT/COLD actions
> occurring during execution but after the database has already been
> warmed up.

Thanks for the question.  I didn't include all the details for the
evaluation result, but this is a chance to share more in details.

I would say the mitigation comes from both.  DAMOS_MIGRATE_COLD demotes
some cold data to CXL so redis can allocate more data on the fast DRAM
during launching time as the mmap+memset and redis launching takes
several minutes.  But it also promotes some redis data while running.

> Do you have test results which enable only DAMOS_MIGRATE_COLD actions
> but not DAMOS_MIGRATE_HOT actions? (and vice versa)
> 
> The question I have is exactly how often is MIGRATE_HOT actually being
> utilized, and how much data is being moved. Testing MIGRATE_COLD only
> would at least give a rough approximation of that.

To explain this, I better share more test results.  In the section of
"Evaluation Workload", the test sequence can be summarized as follows.

  *. "Turn on DAMON."
  1. Allocate cold memory(mmap+memset) at DRAM node, then make the
     process sleep.
  2. Launch redis-server and load prebaked snapshot image, dump.rdb.
     (85GB consumed: 52GB for anon and 33GB for file cache)
  3. Run YCSB to make zipfian distribution of memory accesses to
     redis-server, then measure execution time.
  4. Repeat 4 over 50 times to measure the average execution time for
     each run.
  5. Increase the cold memory size then repeat goes to 2.

I didn't want to make the evaluation too long in the cover letter, but
I have also evaluated another senario, which lazyly enabled DAMON just
before YCSB run at step 4.  I will call this test as "DAMON lazy".  This
is missing part from the cover letter.

  1. Allocate cold memory(mmap+memset) at DRAM node, then make the
     process sleep.
  2. Launch redis-server and load prebaked snapshot image, dump.rdb.
     (85GB consumed: 52GB for anon and 33GB for file cache)
  *. "Turn on DAMON."
  4. Run YCSB to make zipfian distribution of memory accesses to
     redis-server, then measure execution time.
  5. Repeat 4 over 50 times to measure the average execution time for
     each run.
  6. Increase the cold memory size then repeat goes to 2.

In the "DAMON lazy" senario, DAMON started monitoring late so the
initial redis-server placement is same as "default", but started to
demote cold data and promote redis data just before YCSB run.

The full test result is as follows.

  1. YCSB zipfian distribution read only workload
  memory pressure with cold memory on node0 with 512GB of local DRAM.
  =============+================================================+=========
               |       cold memory occupied by mmap and memset  |
               |   0G  440G  450G  460G  470G  480G  490G  500G |
  =============+================================================+=========
  Execution time normalized to DRAM-only values                 | GEOMEAN
  -------------+------------------------------------------------+---------
  DRAM-only    | 1.00     -     -     -     -     -     -     - | 1.00
  CXL-only     | 1.22     -     -     -     -     -     -     - | 1.22
  default      |    -  1.12  1.13  1.14  1.16  1.19  1.21  1.21 | 1.17
  DAMON tiered |    -  1.04  1.03  1.04  1.06  1.05  1.05  1.05 | 1.05
  DAMON lazy   |    -  1.04  1.05  1.05  1.06  1.06  1.07  1.07 | 1.06
  =============+================================================+=========
  CXL usage of redis-server in GB                               | AVERAGE
  -------------+------------------------------------------------+---------
  DRAM-only    |  0.0     -     -     -     -     -     -     - |  0.0
  CXL-only     | 52.6     -     -     -     -     -     -     - | 52.6
  default      |    -  20.4  27.0  33.1  39.5  45.6  50.5  50.3 | 38.1
  DAMON tiered |    -   0.1   0.3   0.8   0.6   0.7   1.3   0.9 |  0.7
  DAMON lazy   |    -   2.9   3.1   3.7   4.7   6.6   8.2   9.7 |  5.6
  =============+================================================+=========
  Migration size in GB by DAMOS_MIGRATE_COLD(demotion) and      |
  DAMOS_MIGRATE_HOT(promotion)                                  | AVERAGE
  -------------+------------------------------------------------+---------
  DAMON tiered |                                                |
  - demotion   |    -   522   510   523   520   513   558   558 |  529
  - promotion  |    -   0.1   1.3   6.2   8.1   7.2    22    17 |  8.8
  DAMON lazy   |                                                |
  - demotion   |    -   288   277   322   343   315   312   320 |  311
  - promotion  |    -    33    44    41    55    73    89   101 |  5.6
  =============+================================================+=========

I have included "DAMON lazy" result and also the migration size by new
DAMOS migrate actions.  Please note that demotion size is way higher
than promotion because promotion target is only for redis data, but
demotion target includes huge cold memory allocated by mmap + memset.
(there could be some ping-pong issue though.)

As you mentioned, "DAMON tiered" case gets more benefit because new
redis allocations go to DRAM more than "default", but it also gets
benefit from promotion when it is under higher memory pressure as shown
in 490G and 500G cases.  It promotes 22GB and 17GB of redis data to DRAM
from CXL.

In the case of "DAMON lazy", it shows more promotion size as expected
and it gets increases as memory pressure goes higher from left to right.

I will share "latest" workload result as well and it shows similar
tendency.

  2. YCSB latest distribution read only workload
  memory pressure with cold memory on node0 with 512GB of local DRAM.
  =============+================================================+=========
               |       cold memory occupied by mmap and memset  |
               |   0G  440G  450G  460G  470G  480G  490G  500G |
  =============+================================================+=========
  Execution time normalized to DRAM-only values                 | GEOMEAN
  -------------+------------------------------------------------+---------
  DRAM-only    | 1.00     -     -     -     -     -     -     - | 1.00
  CXL-only     | 1.18     -     -     -     -     -     -     - | 1.18
  default      |    -  1.18  1.19  1.18  1.18  1.17  1.19  1.18 | 1.18 
  DAMON tiered |    -  1.04  1.04  1.04  1.05  1.04  1.05  1.05 | 1.04 
  DAMON lazy   |    -  1.05  1.05  1.06  1.06  1.07  1.06  1.07 | 1.06
  =============+================================================+=========
  CXL usage of redis-server in GB                               | AVERAGE
  -------------+------------------------------------------------+---------
  DRAM-only    |  0.0     -     -     -     -     -     -     - |  0.0
  CXL-only     | 52.6     -     -     -     -     -     -     - | 52.6
  default      |    -  20.5  27.1  33.2  39.5  45.5  50.4  50.5 | 38.1
  DAMON tiered |    -   0.2   0.4   0.7   1.6   1.2   1.1   3.4 |  1.2
  DAMON lazy   |    -   5.3   4.1   3.9   6.4   8.8  10.1  11.3 |  7.1
  =============+================================================+=========
  Migration size in GB by DAMOS_MIGRATE_COLD(demotion) and      |
  DAMOS_MIGRATE_HOT(promotion)                                  | AVERAGE
  -------------+------------------------------------------------+---------
  DAMON tiered |                                                |
  - demotion   |    -   493   478   487   516   510   540   512 |  505
  - promotion  |    -   0.1   0.2   8.2   5.6   4.0   5.9    29 |  7.5
  DAMON lazy   |                                                |
  - demotion   |    -   315   318   293   290   308   322   286 |  305
  - promotion  |    -    36    45    38    56    74    91    99 |   63
  =============+================================================+=========

> Additionally, do you have any data on workloads that exceed the capacity
> of the DRAM tier?  Here you say you have 512GB of local DRAM, but only
> test a workload that caps out at 500G.  Have you run a test of, say,
> 550GB to see the effect of DAMON HOT/COLD migration actions when DRAM
> capacity is exceeded?

I didn't want to remove DRAM from my server so kept using 512GB of DRAM,
but I couldn't make a single workload that consumes more than the DRAM
size.

I wanted to use more realistic workload rather than micro benchmarks.
And the core concept of this test is to cover realisitic senarios with
the system wide view.  I think if the system has 512GB of local DRAM,
then it wouldn't be possible to make the entire 512GB of DRAM hot and
it'd have some amount of cold memory, which can be the target of
demotion.  Then we can find some workload that is actively used and
promote it as much as possible.  That's why I made the promotion policy
aggressively.

> Can you also provide the DRAM-only results for each test?  Presumably,
> as workload size increases from 440G to 500G, the system probably starts
> using some amount of swap/zswap/whatever.  It would be good to know how
> this system compares to swap small amounts of overflow.

It looks like my explanation doesn't correctly inform you.   The size
from 440GB to 500GB is for pre allocated cold data to give memory
pressure on the system so that redis-server cannot be fully allocated at
fast DRAM, then partially allocated at CXL memory as well.

And my evaluation environment doesn't have swap space to focus on
migration rather than swap.

> 
> ~Gregory

I hope my explanation is helpful for you to understand.  Please let me
know if you have more questions.

Thanks,
Honggyu

Honggyu Kim April 9, 2024, 9:59 a.m. UTC | #5

On Mon,  8 Apr 2024 22:41:04 +0900 Honggyu Kim <honggyu.kim@sk.com> wrote:
[...]
> To explain this, I better share more test results.  In the section of
> "Evaluation Workload", the test sequence can be summarized as follows.
> 
>   *. "Turn on DAMON."
>   1. Allocate cold memory(mmap+memset) at DRAM node, then make the
>      process sleep.
>   2. Launch redis-server and load prebaked snapshot image, dump.rdb.
>      (85GB consumed: 52GB for anon and 33GB for file cache)
>   3. Run YCSB to make zipfian distribution of memory accesses to
>      redis-server, then measure execution time.
>   4. Repeat 4 over 50 times to measure the average execution time for
>      each run.

Sorry, "Repeat 4 over 50 times" is incorrect.  This should be "Repeat 3
over 50 times".

>   5. Increase the cold memory size then repeat goes to 2.
> 
> I didn't want to make the evaluation too long in the cover letter, but
> I have also evaluated another senario, which lazyly enabled DAMON just
> before YCSB run at step 4.  I will call this test as "DAMON lazy".  This
> is missing part from the cover letter.
> 
>   1. Allocate cold memory(mmap+memset) at DRAM node, then make the
>      process sleep.
>   2. Launch redis-server and load prebaked snapshot image, dump.rdb.
>      (85GB consumed: 52GB for anon and 33GB for file cache)
>   *. "Turn on DAMON."
>   4. Run YCSB to make zipfian distribution of memory accesses to
>      redis-server, then measure execution time.
>   5. Repeat 4 over 50 times to measure the average execution time for
>      each run.
>   6. Increase the cold memory size then repeat goes to 2.
> 
> In the "DAMON lazy" senario, DAMON started monitoring late so the
> initial redis-server placement is same as "default", but started to
> demote cold data and promote redis data just before YCSB run.
[...]

Thanks,
Honggyu

Gregory Price April 10, 2024, midnight UTC | #6

On Mon, Apr 08, 2024 at 10:41:04PM +0900, Honggyu Kim wrote:
> Hi Gregory,
> 
> On Fri, 5 Apr 2024 12:56:14 -0400 Gregory Price <gregory.price@memverge.com> wrote:
> > Do you have test results which enable only DAMOS_MIGRATE_COLD actions
> > but not DAMOS_MIGRATE_HOT actions? (and vice versa)
> > 
> > The question I have is exactly how often is MIGRATE_HOT actually being
> > utilized, and how much data is being moved. Testing MIGRATE_COLD only
> > would at least give a rough approximation of that.
> 
> To explain this, I better share more test results.  In the section of
> "Evaluation Workload", the test sequence can be summarized as follows.
> 
>   *. "Turn on DAMON."
>   1. Allocate cold memory(mmap+memset) at DRAM node, then make the
>      process sleep.
>   2. Launch redis-server and load prebaked snapshot image, dump.rdb.
>      (85GB consumed: 52GB for anon and 33GB for file cache)

Aha! I see now, you are allocating memory to ensure the real workload
(redis-server) pressures the DRAM tier and causes "spillage" to the CXL
tier, and then measure the overhead in different scenarios.

I would still love to know what the result of a demote-only system would
produce, mosty because it would very clearly demonstrate the value of
the demote+promote system when the system is memory-pressured.

Given the additional results below, it shows a demote-only system would
likely trend toward CXL-only, and so this shows an affirmative support
for the promotion logic.

Just another datum that is useful and paints a more complete picture.

> I didn't want to make the evaluation too long in the cover letter, but
> I have also evaluated another senario, which lazyly enabled DAMON just
> before YCSB run at step 4.  I will call this test as "DAMON lazy".  This
> is missing part from the cover letter.
> 
>   1. Allocate cold memory(mmap+memset) at DRAM node, then make the
>      process sleep.
>   2. Launch redis-server and load prebaked snapshot image, dump.rdb.
>      (85GB consumed: 52GB for anon and 33GB for file cache)
>   *. "Turn on DAMON."
> 
> In the "DAMON lazy" senario, DAMON started monitoring late so the
> initial redis-server placement is same as "default", but started to
> demote cold data and promote redis data just before YCSB run.
>

This is excellent and definitely demonstrates part of the picture I was
alluding to, thank you for the additional data!

> 
> I have included "DAMON lazy" result and also the migration size by new
> DAMOS migrate actions.  Please note that demotion size is way higher
> than promotion because promotion target is only for redis data, but
> demotion target includes huge cold memory allocated by mmap + memset.
> (there could be some ping-pong issue though.)
> 
> As you mentioned, "DAMON tiered" case gets more benefit because new
> redis allocations go to DRAM more than "default", but it also gets
> benefit from promotion when it is under higher memory pressure as shown
> in 490G and 500G cases.  It promotes 22GB and 17GB of redis data to DRAM
> from CXL.

I think a better way of saying this is that "DAMON tiered" more
effectively mitigates the effect of memory-pressure on faster tier
before spillage occurs, while "DAMON lazy" demonstrates the expected
performance of the system after memory pressure outruns the demotion
logic, so you wind up with hot data stuck in the slow tier.

There are some out there that would simply say "just demote more
aggressively", so this is useful information for the discussion.

+/- ~2% despite greater meomry migration is an excellent result

> > Can you also provide the DRAM-only results for each test?  Presumably,
> > as workload size increases from 440G to 500G, the system probably starts
> > using some amount of swap/zswap/whatever.  It would be good to know how
> > this system compares to swap small amounts of overflow.
> 
> It looks like my explanation doesn't correctly inform you.   The size
> from 440GB to 500GB is for pre allocated cold data to give memory
> pressure on the system so that redis-server cannot be fully allocated at
> fast DRAM, then partially allocated at CXL memory as well.
> 

Yes, sorry for the misunderstanding.  This makes it much clearer.

> 
> I hope my explanation is helpful for you to understand.  Please let me
> know if you have more questions.
>

Excellent work, exciting results! Thank you for the additional answers
:]

~Gregory

[RFC,v3,0/7] DAMON based tiered memory management for CXL memory

Message

Comments