diff mbox series

[v7,11/12] mm/demotion: Add documentation for memory tiering

Message ID 20220622082513.467538-12-aneesh.kumar@linux.ibm.com (mailing list archive)
State New
Headers show
Series mm/demotion: Memory tiers and demotion | expand

Commit Message

Aneesh Kumar K.V June 22, 2022, 8:25 a.m. UTC
From: Jagdish Gediya <jvgediya@linux.ibm.com>

All N_MEMORY nodes are divided into 3 memoty tiers with tier ID value
MEMORY_TIER_HBM_GPU, MEMORY_TIER_DRAM and MEMORY_TIER_PMEM. By default,
all nodes are assigned to default memory tier.

Demotion path for all N_MEMORY nodes is prepared based on the tier ID value
of memory tiers.

This patch adds documention for memory tiering introduction, its sysfs
interfaces and how demotion is performed based on memory tiers.

Suggested-by: Wei Xu <weixugc@google.com>
Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 Documentation/admin-guide/mm/index.rst        |   1 +
 .../admin-guide/mm/memory-tiering.rst         | 182 ++++++++++++++++++
 2 files changed, 183 insertions(+)
 create mode 100644 Documentation/admin-guide/mm/memory-tiering.rst

Comments

kernel test robot June 22, 2022, 9:21 p.m. UTC | #1
Hi "Aneesh,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Aneesh-Kumar-K-V/mm-demotion-Memory-tiers-and-demotion/20220622-163031
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
reproduce: make htmldocs

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> Documentation/admin-guide/mm/memory-tiering.rst:5: (SEVERE/4) Title overline & underline mismatch.

vim +5 Documentation/admin-guide/mm/memory-tiering.rst

     4	
   > 5	===========
     6	Memory tiers
     7	============
     8
Bagas Sanjaya June 25, 2022, 2:56 a.m. UTC | #2
On Thu, Jun 23, 2022 at 05:21:17AM +0800, kernel test robot wrote:
> If you fix the issue, kindly add following tag where applicable
> Reported-by: kernel test robot <lkp@intel.com>
> 
> All errors (new ones prefixed by >>):
> 
> >> Documentation/admin-guide/mm/memory-tiering.rst:5: (SEVERE/4) Title overline & underline mismatch.
> 
> vim +5 Documentation/admin-guide/mm/memory-tiering.rst
> 
>      4	
>    > 5	===========
>      6	Memory tiers
>      7	============
>      8	
> 

Here is the fixup. Thanks.

---- >8 ----

From ee8b97451b6ad1869f4d426e2d3825ac20a6e15d Mon Sep 17 00:00:00 2001
From: Bagas Sanjaya <bagasdotme@gmail.com>
Date: Sat, 25 Jun 2022 09:48:28 +0700
Subject: [PATCH] fixup for "mm/demotion: Add documentation for memory tiering"

Extend the title heading overline by one (=) to match the underline.

Fixes: 64fc925cf27dac ("mm/demotion: Add documentation for memory tiering")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
---
 Documentation/admin-guide/mm/memory-tiering.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/mm/memory-tiering.rst b/Documentation/admin-guide/mm/memory-tiering.rst
index 142c36651f5dd2..0a75e0dab1fd8e 100644
--- a/Documentation/admin-guide/mm/memory-tiering.rst
+++ b/Documentation/admin-guide/mm/memory-tiering.rst
@@ -2,7 +2,7 @@
 
 .. _admin_guide_memory_tiering:
 
-===========
+============
 Memory tiers
 ============
Bagas Sanjaya June 25, 2022, 4:13 a.m. UTC | #3
On Wed, Jun 22, 2022 at 01:55:12PM +0530, Aneesh Kumar K.V wrote:
> From: Jagdish Gediya <jvgediya@linux.ibm.com>
> 

Hi Aneesh and Jagdish,

The documentation can be improved, see below.

> All N_MEMORY nodes are divided into 3 memoty tiers with tier ID value
> MEMORY_TIER_HBM_GPU, MEMORY_TIER_DRAM and MEMORY_TIER_PMEM. By default,
> all nodes are assigned to default memory tier.
> 
> Demotion path for all N_MEMORY nodes is prepared based on the tier ID value
> of memory tiers.
> 
> This patch adds documention for memory tiering introduction, its sysfs
> interfaces and how demotion is performed based on memory tiers.
> 

I think the patch message should just be:
"Add documentation for memory tiering. It also covers its sysfs
interfaces and how demotion is performed based on memory tiers."

> +===========
> +Memory tiers
> +============
> +
> +This document describes explicit memory tiering support along with
> +demotion based on memory tiers.
> +

This causes htmldocs error, for which I have applied the fixup at [1].

> +Memory nodes are divided into 3 types of memory tiers with tier ID
> +value as shown based on their hardware characteristics.
> +
> +
> +MEMORY_TIER_HBM_GPU
> +MEMORY_TIER_DRAM
> +MEMORY_TIER_PMEM
> +

Use bullet list.

> +Sysfs interfaces
> +================
> +
> +Nodes belonging to specific tier can be read from,
> +/sys/devices/system/memtier/memtierN/nodelist (Read-Only)
> +
> +Where N is 0 - 2.

The "where" sentence can be compounded into the previous sentence above.

> +
> +Example 1:
> +For a system where Node 0 is CPU + DRAM nodes, Node 1 is HBM node,
> +node 2 is a PMEM node an ideal tier layout will be
> +
> +$ cat /sys/devices/system/memtier/memtier0/nodelist
> +1
> +$ cat /sys/devices/system/memtier/memtier1/nodelist
> +0
> +$ cat /sys/devices/system/memtier/memtier2/nodelist
> +2
> +

The code snippets should have been inside literal code blocks.

> +Example 2:
> +For a system where Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM
> +nodes.
> +
> +$ cat /sys/devices/system/memtier/memtier0/nodelist
> +cat: /sys/devices/system/memtier/memtier0/nodelist: No such file or
> +directory
> +$ cat /sys/devices/system/memtier/memtier1/nodelist
> +0-1
> +$ cat /sys/devices/system/memtier/memtier2/nodelist
> +2-3
> +

Use literal code block.

> +Default memory tier can be read from,
> +/sys/devices/system/memtier/default_tier (Read-Only)
> +
> +e.g.
> +$ cat /sys/devices/system/memtier/default_tier
> +memtier200
> +
> +Max memory tier ID supported can be read from,
> +/sys/devices/system/memtier/max_tier (Read-Only)
> +
> +e.g.
> +$ cat /sys/devices/system/memtier/max_tier
> +400
> +
> +Individual node's memory tier can be read of set using,
> +/sys/devices/system/node/nodeN/memtier	(Read-Write)
> +
> +where N = node id
> +
> +When this interface is written, Node is moved from the old memory tier
> +to new memory tier and demotion targets for all N_MEMORY nodes are
> +built again.
> +
> +For example 1 mentioned above,
> +$ cat /sys/devices/system/node/node0/memtier
> +1
> +$ cat /sys/devices/system/node/node1/memtier
> +0
> +$ cat /sys/devices/system/node/node2/memtier
> +2
> +

The same suggestions above apply here, too.

> +Enable/Disable demotion
> +-----------------------
> +
> +By default demotion is disabled, it can be enabled/disabled using
> +below sysfs interface,
> +
> +$ echo 0/1 or false/true > /sys/kernel/mm/numa/demotion_enabled
> +

Use literal code block.

> +preferred and allowed demotion nodes
> +------------------------------------
> +
> +Preferred nodes for a specific N_MEMORY node are the best nodes
> +from the next possible lower memory tier. Allowed nodes for any
> +node are all the nodes available in all possible lower memory
> +tiers.
> +
> +Example:
> +
> +For a system where Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM
> +nodes,
> +
> +node distances:
> +node   0    1    2    3
> +   0  10   20   30   40
> +   1  20   10   40   30
> +   2  30   40   10   40
> +   3  40   30   40   10
> +

Use reST table.

> +memory_tiers[0] = <empty>
> +memory_tiers[1] = 0-1
> +memory_tiers[2] = 2-3
> +
> +node_demotion[0].preferred = 2
> +node_demotion[0].allowed   = 2, 3
> +node_demotion[1].preferred = 3
> +node_demotion[1].allowed   = 3, 2
> +node_demotion[2].preferred = <empty>
> +node_demotion[2].allowed   = <empty>
> +node_demotion[3].preferred = <empty>
> +node_demotion[3].allowed   = <empty>
> +

What are these above? Node properties? BTW, use literal code block.

If you don't understand these suggestions above, here is the diff:

---- >8 ----

diff --git a/Documentation/admin-guide/mm/memory-tiering.rst b/Documentation/admin-guide/mm/memory-tiering.rst
index 0a75e0dab1fd8e..10ec5aab6ddd53 100644
--- a/Documentation/admin-guide/mm/memory-tiering.rst
+++ b/Documentation/admin-guide/mm/memory-tiering.rst
@@ -14,13 +14,13 @@ Introduction
 
 Many systems have multiple types of memory devices e.g. GPU, DRAM and
 PMEM. The memory subsystem of these systems can be called a memory
-tiering system because the performance of the different types of
+tiering system because the performance of each type of
 memory is different. Memory tiers are defined based on the hardware
 capabilities of memory nodes. Each memory tier is assigned a tier ID
 value that determines the memory tier position in demotion order.
 
 The memory tier assignment of each node is independent of each
-other. Moving a node from one tier to another tier doesn't affect
+other. Moving a node from one tier to another doesn't affect
 the tier assignment of any other node.
 
 Memory tiers are used to build the demotion targets for nodes. A node
@@ -32,10 +32,9 @@ Memory tier rank
 Memory nodes are divided into 3 types of memory tiers with tier ID
 value as shown based on their hardware characteristics.
 
-
-MEMORY_TIER_HBM_GPU
-MEMORY_TIER_DRAM
-MEMORY_TIER_PMEM
+  * MEMORY_TIER_HBM_GPU
+  * MEMORY_TIER_DRAM
+  * MEMORY_TIER_PMEM
 
 Memory tiers initialization and (re)assignments
 ===============================================
@@ -49,68 +48,73 @@ hotplug, the memory tier with default tier ID is assigned to the memory node.
 Sysfs interfaces
 ================
 
-Nodes belonging to specific tier can be read from,
-/sys/devices/system/memtier/memtierN/nodelist (Read-Only)
+Nodes belonging to specific tier can be read from
+/sys/devices/system/memtier/memtierN/nodelist, where N is 0 - 2 (read-only)
 
-Where N is 0 - 2.
+Examples:
 
-Example 1:
-For a system where Node 0 is CPU + DRAM nodes, Node 1 is HBM node,
-node 2 is a PMEM node an ideal tier layout will be
+1. On a system where Node 0 is CPU + DRAM nodes, Node 1 is HBM node,
+   node 2 is a PMEM node an ideal tier layout will be:
 
-$ cat /sys/devices/system/memtier/memtier0/nodelist
-1
-$ cat /sys/devices/system/memtier/memtier1/nodelist
-0
-$ cat /sys/devices/system/memtier/memtier2/nodelist
-2
+   .. code-block::
 
-Example 2:
-For a system where Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM
-nodes.
+      $ cat /sys/devices/system/memtier/memtier0/nodelist
+      1
+      $ cat /sys/devices/system/memtier/memtier1/nodelist
+      0
+      $ cat /sys/devices/system/memtier/memtier2/nodelist
+      2
 
-$ cat /sys/devices/system/memtier/memtier0/nodelist
-cat: /sys/devices/system/memtier/memtier0/nodelist: No such file or
-directory
-$ cat /sys/devices/system/memtier/memtier1/nodelist
-0-1
-$ cat /sys/devices/system/memtier/memtier2/nodelist
-2-3
+2. On a system where Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM
+   nodes:
 
-Default memory tier can be read from,
-/sys/devices/system/memtier/default_tier (Read-Only)
+   .. code-block::
 
-e.g.
-$ cat /sys/devices/system/memtier/default_tier
-memtier200
+      $ cat /sys/devices/system/memtier/memtier0/nodelist
+      cat: /sys/devices/system/memtier/memtier0/nodelist: No such file or
+      directory
+      $ cat /sys/devices/system/memtier/memtier1/nodelist
+      0-1
+      $ cat /sys/devices/system/memtier/memtier2/nodelist
+      2-3
 
-Max memory tier ID supported can be read from,
-/sys/devices/system/memtier/max_tier (Read-Only)
+Default memory tier can be read from
+/sys/devices/system/memtier/default_tier (read-only), e.g.:
 
-e.g.
-$ cat /sys/devices/system/memtier/max_tier
-400
+.. code-block::
 
-Individual node's memory tier can be read of set using,
-/sys/devices/system/node/nodeN/memtier	(Read-Write)
+   $ cat /sys/devices/system/memtier/default_tier
+   memtier200
 
-where N = node id
+Max memory tier ID supported can be read from
+/sys/devices/system/memtier/max_tier (read-only), e.g.:
 
-When this interface is written, Node is moved from the old memory tier
+.. code-block::
+
+   $ cat /sys/devices/system/memtier/max_tier
+   400
+
+Individual node's memory tier can be read or set using
+/sys/devices/system/node/nodeN/memtier (read-write), where N = node id.
+
+When this interface is written, node is moved from the old memory tier
 to new memory tier and demotion targets for all N_MEMORY nodes are
 built again.
 
-For example 1 mentioned above,
-$ cat /sys/devices/system/node/node0/memtier
-1
-$ cat /sys/devices/system/node/node1/memtier
-0
-$ cat /sys/devices/system/node/node2/memtier
-2
+For example 1 mentioned above:
+
+.. code-block::
+
+   $ cat /sys/devices/system/node/node0/memtier
+   1
+   $ cat /sys/devices/system/node/node1/memtier
+   0
+   $ cat /sys/devices/system/node/node2/memtier
+   2
 
 Additional memory tiers can be created by writing a tier ID value to this file.
-This results in a new memory tier creation and moving the specific NUMA node to
-that memory tier.
+This results into creating a new tier and moving the specific NUMA node to
+that tier.
 
 Demotion
 ========
@@ -128,19 +132,20 @@ be used.
 
 Instead of a page being discarded during reclaim, it can be moved to
 persistent memory. Allowing page migration during reclaim enables
-these systems to migrate pages from fast(higher) tiers to slow(lower)
-tiers when the fast(higher) tier is under pressure.
+these systems to migrate pages from fast (higher) tiers to slow (lower)
+tiers when the fast (higher) tier is under pressure.
 
 
 Enable/Disable demotion
 -----------------------
 
-By default demotion is disabled, it can be enabled/disabled using
-below sysfs interface,
+By default demotion is disabled. It can be toggled by:
 
-$ echo 0/1 or false/true > /sys/kernel/mm/numa/demotion_enabled
+.. code-block::
 
-preferred and allowed demotion nodes
+   $ echo 0/1 or false/true > /sys/kernel/mm/numa/demotion_enabled
+
+Preferred and allowed demotion nodes
 ------------------------------------
 
 Preferred nodes for a specific N_MEMORY node are the best nodes
@@ -148,35 +153,40 @@ from the next possible lower memory tier. Allowed nodes for any
 node are all the nodes available in all possible lower memory
 tiers.
 
-Example:
+For example, on a system where Node 0 & 1 are CPU + DRAM nodes,
+node 2 & 3 are PMEM nodes:
 
-For a system where Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM
-nodes,
+  * node distances
 
-node distances:
-node   0    1    2    3
-   0  10   20   30   40
-   1  20   10   40   30
-   2  30   40   10   40
-   3  40   30   40   10
+    ====  ==   ==   ==   ==
+    node   0    1    2    3
+    ====  ==   ==   ==   ==
+       0  10   20   30   40
+       1  20   10   40   30
+       2  30   40   10   40
+       3  40   30   40   10
+    ====  ==   ==   ==   ==
 
-memory_tiers[0] = <empty>
-memory_tiers[1] = 0-1
-memory_tiers[2] = 2-3
+  * node properties
 
-node_demotion[0].preferred = 2
-node_demotion[0].allowed   = 2, 3
-node_demotion[1].preferred = 3
-node_demotion[1].allowed   = 3, 2
-node_demotion[2].preferred = <empty>
-node_demotion[2].allowed   = <empty>
-node_demotion[3].preferred = <empty>
-node_demotion[3].allowed   = <empty>
+    .. code-block::
+
+       memory_tiers[0] = <empty>
+       memory_tiers[1] = 0-1
+       memory_tiers[2] = 2-3
+
+       node_demotion[0].preferred = 2
+       node_demotion[0].allowed   = 2, 3
+       node_demotion[1].preferred = 3
+       node_demotion[1].allowed   = 3, 2
+       node_demotion[2].preferred = <empty>
+       node_demotion[2].allowed   = <empty>
+       node_demotion[3].preferred = <empty>
+       node_demotion[3].allowed   = <empty>
 
 Memory allocation for demotion
 ------------------------------
 
-If a page needs to be demoted from any node, the kernel 1st tries
-to allocate a new page from the node's preferred node and fallbacks to
-node's allowed targets in allocation fallback order.
-
+If a page needs to be demoted from any node, the kernel first tries
+to allocate a new page from the node's preferred target node and fallbacks
+to node's allowed targets in allocation fallback order.


Thanks.

[1]: https://lore.kernel.org/linux-doc/YrZ5cTFOSuWxlF2t@debian.me/
Aneesh Kumar K.V June 27, 2022, 4:40 a.m. UTC | #4
Bagas Sanjaya <bagasdotme@gmail.com> writes:

> On Wed, Jun 22, 2022 at 01:55:12PM +0530, Aneesh Kumar K.V wrote:
>> From: Jagdish Gediya <jvgediya@linux.ibm.com>
>> 
>
> Hi Aneesh and Jagdish,
>
> The documentation can be improved, see below.
>
>> All N_MEMORY nodes are divided into 3 memoty tiers with tier ID value
>> MEMORY_TIER_HBM_GPU, MEMORY_TIER_DRAM and MEMORY_TIER_PMEM. By default,
>> all nodes are assigned to default memory tier.
>> 
>> Demotion path for all N_MEMORY nodes is prepared based on the tier ID value
>> of memory tiers.
>> 
>> This patch adds documention for memory tiering introduction, its sysfs
>> interfaces and how demotion is performed based on memory tiers.
>> 
>
> I think the patch message should just be:
> "Add documentation for memory tiering. It also covers its sysfs
> interfaces and how demotion is performed based on memory tiers."
>
>> +===========
>> +Memory tiers
>> +============
>> +
>> +This document describes explicit memory tiering support along with
>> +demotion based on memory tiers.
>> +
>
> This causes htmldocs error, for which I have applied the fixup at [1].
>
>> +Memory nodes are divided into 3 types of memory tiers with tier ID
>> +value as shown based on their hardware characteristics.
>> +
>> +
>> +MEMORY_TIER_HBM_GPU
>> +MEMORY_TIER_DRAM
>> +MEMORY_TIER_PMEM
>> +
>
> Use bullet list.
>
>> +Sysfs interfaces
>> +================
>> +
>> +Nodes belonging to specific tier can be read from,
>> +/sys/devices/system/memtier/memtierN/nodelist (Read-Only)
>> +
>> +Where N is 0 - 2.
>
> The "where" sentence can be compounded into the previous sentence above.
>
>> +
>> +Example 1:
>> +For a system where Node 0 is CPU + DRAM nodes, Node 1 is HBM node,
>> +node 2 is a PMEM node an ideal tier layout will be
>> +
>> +$ cat /sys/devices/system/memtier/memtier0/nodelist
>> +1
>> +$ cat /sys/devices/system/memtier/memtier1/nodelist
>> +0
>> +$ cat /sys/devices/system/memtier/memtier2/nodelist
>> +2
>> +
>
> The code snippets should have been inside literal code blocks.
>
>> +Example 2:
>> +For a system where Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM
>> +nodes.
>> +
>> +$ cat /sys/devices/system/memtier/memtier0/nodelist
>> +cat: /sys/devices/system/memtier/memtier0/nodelist: No such file or
>> +directory
>> +$ cat /sys/devices/system/memtier/memtier1/nodelist
>> +0-1
>> +$ cat /sys/devices/system/memtier/memtier2/nodelist
>> +2-3
>> +
>
> Use literal code block.
>
>> +Default memory tier can be read from,
>> +/sys/devices/system/memtier/default_tier (Read-Only)
>> +
>> +e.g.
>> +$ cat /sys/devices/system/memtier/default_tier
>> +memtier200
>> +
>> +Max memory tier ID supported can be read from,
>> +/sys/devices/system/memtier/max_tier (Read-Only)
>> +
>> +e.g.
>> +$ cat /sys/devices/system/memtier/max_tier
>> +400
>> +
>> +Individual node's memory tier can be read of set using,
>> +/sys/devices/system/node/nodeN/memtier	(Read-Write)
>> +
>> +where N = node id
>> +
>> +When this interface is written, Node is moved from the old memory tier
>> +to new memory tier and demotion targets for all N_MEMORY nodes are
>> +built again.
>> +
>> +For example 1 mentioned above,
>> +$ cat /sys/devices/system/node/node0/memtier
>> +1
>> +$ cat /sys/devices/system/node/node1/memtier
>> +0
>> +$ cat /sys/devices/system/node/node2/memtier
>> +2
>> +
>
> The same suggestions above apply here, too.
>
>> +Enable/Disable demotion
>> +-----------------------
>> +
>> +By default demotion is disabled, it can be enabled/disabled using
>> +below sysfs interface,
>> +
>> +$ echo 0/1 or false/true > /sys/kernel/mm/numa/demotion_enabled
>> +
>
> Use literal code block.
>
>> +preferred and allowed demotion nodes
>> +------------------------------------
>> +
>> +Preferred nodes for a specific N_MEMORY node are the best nodes
>> +from the next possible lower memory tier. Allowed nodes for any
>> +node are all the nodes available in all possible lower memory
>> +tiers.
>> +
>> +Example:
>> +
>> +For a system where Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM
>> +nodes,
>> +
>> +node distances:
>> +node   0    1    2    3
>> +   0  10   20   30   40
>> +   1  20   10   40   30
>> +   2  30   40   10   40
>> +   3  40   30   40   10
>> +
>
> Use reST table.
>
>> +memory_tiers[0] = <empty>
>> +memory_tiers[1] = 0-1
>> +memory_tiers[2] = 2-3
>> +
>> +node_demotion[0].preferred = 2
>> +node_demotion[0].allowed   = 2, 3
>> +node_demotion[1].preferred = 3
>> +node_demotion[1].allowed   = 3, 2
>> +node_demotion[2].preferred = <empty>
>> +node_demotion[2].allowed   = <empty>
>> +node_demotion[3].preferred = <empty>
>> +node_demotion[3].allowed   = <empty>
>> +
>
> What are these above? Node properties? BTW, use literal code block.
>
> If you don't understand these suggestions above, here is the diff:

I got with the below diff.
patch: **** malformed patch at line 180: @@ -148,35 +153,40 @@ from the next possible lower memory tier. Allowed nodes for any

But I did modify the documentation based on your feedback and it is much
better than what I had. Thanks for the review. I will send v8 with the
changes folded. I did add the below to commit message. Hope that is ok. 

[update doc format by Bagas Sanjaya <bagasdotme@gmail.com>]

>
> ---- >8 ----
>
> diff --git a/Documentation/admin-guide/mm/memory-tiering.rst b/Documentation/admin-guide/mm/memory-tiering.rst
> index 0a75e0dab1fd8e..10ec5aab6ddd53 100644
> --- a/Documentation/admin-guide/mm/memory-tiering.rst
> +++ b/Documentation/admin-guide/mm/memory-tiering.rst
> @@ -14,13 +14,13 @@ Introduction
>  
>  Many systems have multiple types of memory devices e.g. GPU, DRAM and
>  PMEM. The memory subsystem of these systems can be called a memory
> -tiering system because the performance of the different types of
> +tiering system because the performance of each type of
>  memory is different. Memory tiers are defined based on the hardware
>  capabilities of memory nodes. Each memory tier is assigned a tier ID
>  value that determines the memory tier position in demotion order.
>  
>  The memory tier assignment of each node is independent of each
> -other. Moving a node from one tier to another tier doesn't affect
> +other. Moving a node from one tier to another doesn't affect
>  the tier assignment of any other node.
>  
>  Memory tiers are used to build the demotion targets for nodes. A node
> @@ -32,10 +32,9 @@ Memory tier rank
>  Memory nodes are divided into 3 types of memory tiers with tier ID
>  value as shown based on their hardware characteristics.
>  
> -
> -MEMORY_TIER_HBM_GPU
> -MEMORY_TIER_DRAM
> -MEMORY_TIER_PMEM
> +  * MEMORY_TIER_HBM_GPU
> +  * MEMORY_TIER_DRAM
> +  * MEMORY_TIER_PMEM
>  
>  Memory tiers initialization and (re)assignments
>  ===============================================
> @@ -49,68 +48,73 @@ hotplug, the memory tier with default tier ID is assigned to the memory node.
>  Sysfs interfaces
>  ================
>  
> -Nodes belonging to specific tier can be read from,
> -/sys/devices/system/memtier/memtierN/nodelist (Read-Only)
> +Nodes belonging to specific tier can be read from
> +/sys/devices/system/memtier/memtierN/nodelist, where N is 0 - 2 (read-only)
>  
> -Where N is 0 - 2.
> +Examples:
>  
> -Example 1:
> -For a system where Node 0 is CPU + DRAM nodes, Node 1 is HBM node,
> -node 2 is a PMEM node an ideal tier layout will be
> +1. On a system where Node 0 is CPU + DRAM nodes, Node 1 is HBM node,
> +   node 2 is a PMEM node an ideal tier layout will be:
>  
> -$ cat /sys/devices/system/memtier/memtier0/nodelist
> -1
> -$ cat /sys/devices/system/memtier/memtier1/nodelist
> -0
> -$ cat /sys/devices/system/memtier/memtier2/nodelist
> -2
> +   .. code-block::
>  
> -Example 2:
> -For a system where Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM
> -nodes.
> +      $ cat /sys/devices/system/memtier/memtier0/nodelist
> +      1
> +      $ cat /sys/devices/system/memtier/memtier1/nodelist
> +      0
> +      $ cat /sys/devices/system/memtier/memtier2/nodelist
> +      2
>  
> -$ cat /sys/devices/system/memtier/memtier0/nodelist
> -cat: /sys/devices/system/memtier/memtier0/nodelist: No such file or
> -directory
> -$ cat /sys/devices/system/memtier/memtier1/nodelist
> -0-1
> -$ cat /sys/devices/system/memtier/memtier2/nodelist
> -2-3
> +2. On a system where Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM
> +   nodes:
>  
> -Default memory tier can be read from,
> -/sys/devices/system/memtier/default_tier (Read-Only)
> +   .. code-block::
>  
> -e.g.
> -$ cat /sys/devices/system/memtier/default_tier
> -memtier200
> +      $ cat /sys/devices/system/memtier/memtier0/nodelist
> +      cat: /sys/devices/system/memtier/memtier0/nodelist: No such file or
> +      directory
> +      $ cat /sys/devices/system/memtier/memtier1/nodelist
> +      0-1
> +      $ cat /sys/devices/system/memtier/memtier2/nodelist
> +      2-3
>  
> -Max memory tier ID supported can be read from,
> -/sys/devices/system/memtier/max_tier (Read-Only)
> +Default memory tier can be read from
> +/sys/devices/system/memtier/default_tier (read-only), e.g.:
>  
> -e.g.
> -$ cat /sys/devices/system/memtier/max_tier
> -400
> +.. code-block::
>  
> -Individual node's memory tier can be read of set using,
> -/sys/devices/system/node/nodeN/memtier	(Read-Write)
> +   $ cat /sys/devices/system/memtier/default_tier
> +   memtier200
>  
> -where N = node id
> +Max memory tier ID supported can be read from
> +/sys/devices/system/memtier/max_tier (read-only), e.g.:
>  
> -When this interface is written, Node is moved from the old memory tier
> +.. code-block::
> +
> +   $ cat /sys/devices/system/memtier/max_tier
> +   400
> +
> +Individual node's memory tier can be read or set using
> +/sys/devices/system/node/nodeN/memtier (read-write), where N = node id.
> +
> +When this interface is written, node is moved from the old memory tier
>  to new memory tier and demotion targets for all N_MEMORY nodes are
>  built again.
>  
> -For example 1 mentioned above,
> -$ cat /sys/devices/system/node/node0/memtier
> -1
> -$ cat /sys/devices/system/node/node1/memtier
> -0
> -$ cat /sys/devices/system/node/node2/memtier
> -2
> +For example 1 mentioned above:
> +
> +.. code-block::
> +
> +   $ cat /sys/devices/system/node/node0/memtier
> +   1
> +   $ cat /sys/devices/system/node/node1/memtier
> +   0
> +   $ cat /sys/devices/system/node/node2/memtier
> +   2
>  
>  Additional memory tiers can be created by writing a tier ID value to this file.
> -This results in a new memory tier creation and moving the specific NUMA node to
> -that memory tier.
> +This results into creating a new tier and moving the specific NUMA node to
> +that tier.
>  
>  Demotion
>  ========
> @@ -128,19 +132,20 @@ be used.
>  
>  Instead of a page being discarded during reclaim, it can be moved to
>  persistent memory. Allowing page migration during reclaim enables
> -these systems to migrate pages from fast(higher) tiers to slow(lower)
> -tiers when the fast(higher) tier is under pressure.
> +these systems to migrate pages from fast (higher) tiers to slow (lower)
> +tiers when the fast (higher) tier is under pressure.
>  
>  
>  Enable/Disable demotion
>  -----------------------
>  
> -By default demotion is disabled, it can be enabled/disabled using
> -below sysfs interface,
> +By default demotion is disabled. It can be toggled by:
>  
> -$ echo 0/1 or false/true > /sys/kernel/mm/numa/demotion_enabled
> +.. code-block::
>  
> -preferred and allowed demotion nodes
> +   $ echo 0/1 or false/true > /sys/kernel/mm/numa/demotion_enabled
> +
> +Preferred and allowed demotion nodes
>  ------------------------------------
>  
>  Preferred nodes for a specific N_MEMORY node are the best nodes
> @@ -148,35 +153,40 @@ from the next possible lower memory tier. Allowed nodes for any
>  node are all the nodes available in all possible lower memory
>  tiers.
>  
> -Example:
> +For example, on a system where Node 0 & 1 are CPU + DRAM nodes,
> +node 2 & 3 are PMEM nodes:
>  
> -For a system where Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM
> -nodes,
> +  * node distances
>  
> -node distances:
> -node   0    1    2    3
> -   0  10   20   30   40
> -   1  20   10   40   30
> -   2  30   40   10   40
> -   3  40   30   40   10
> +    ====  ==   ==   ==   ==
> +    node   0    1    2    3
> +    ====  ==   ==   ==   ==
> +       0  10   20   30   40
> +       1  20   10   40   30
> +       2  30   40   10   40
> +       3  40   30   40   10
> +    ====  ==   ==   ==   ==
>  
> -memory_tiers[0] = <empty>
> -memory_tiers[1] = 0-1
> -memory_tiers[2] = 2-3
> +  * node properties
>  
> -node_demotion[0].preferred = 2
> -node_demotion[0].allowed   = 2, 3
> -node_demotion[1].preferred = 3
> -node_demotion[1].allowed   = 3, 2
> -node_demotion[2].preferred = <empty>
> -node_demotion[2].allowed   = <empty>
> -node_demotion[3].preferred = <empty>
> -node_demotion[3].allowed   = <empty>
> +    .. code-block::
> +
> +       memory_tiers[0] = <empty>
> +       memory_tiers[1] = 0-1
> +       memory_tiers[2] = 2-3
> +
> +       node_demotion[0].preferred = 2
> +       node_demotion[0].allowed   = 2, 3
> +       node_demotion[1].preferred = 3
> +       node_demotion[1].allowed   = 3, 2
> +       node_demotion[2].preferred = <empty>
> +       node_demotion[2].allowed   = <empty>
> +       node_demotion[3].preferred = <empty>
> +       node_demotion[3].allowed   = <empty>
>  
>  Memory allocation for demotion
>  ------------------------------
>  
> -If a page needs to be demoted from any node, the kernel 1st tries
> -to allocate a new page from the node's preferred node and fallbacks to
> -node's allowed targets in allocation fallback order.
> -
> +If a page needs to be demoted from any node, the kernel first tries
> +to allocate a new page from the node's preferred target node and fallbacks
> +to node's allowed targets in allocation fallback order.
>
>
> Thanks.
>
> [1]: https://lore.kernel.org/linux-doc/YrZ5cTFOSuWxlF2t@debian.me/
>
> -- 
> An old man doll... just what I always wanted! - Clara
Souptick Joarder June 30, 2022, 12:57 a.m. UTC | #5
On Wed, Jun 22, 2022 at 2:04 PM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> From: Jagdish Gediya <jvgediya@linux.ibm.com>
>
> All N_MEMORY nodes are divided into 3 memoty tiers with tier ID value

s /memoty/ memory

> MEMORY_TIER_HBM_GPU, MEMORY_TIER_DRAM and MEMORY_TIER_PMEM. By default,
> all nodes are assigned to default memory tier.

I think adding the default memory tier name will be helpful.

>
> Demotion path for all N_MEMORY nodes is prepared based on the tier ID value
> of memory tiers.
>
> This patch adds documention for memory tiering introduction, its sysfs
> interfaces and how demotion is performed based on memory tiers.
>
> Suggested-by: Wei Xu <weixugc@google.com>
> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  Documentation/admin-guide/mm/index.rst        |   1 +
>  .../admin-guide/mm/memory-tiering.rst         | 182 ++++++++++++++++++
>  2 files changed, 183 insertions(+)
>  create mode 100644 Documentation/admin-guide/mm/memory-tiering.rst
>
> diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
> index c21b5823f126..3f211cbca8c3 100644
> --- a/Documentation/admin-guide/mm/index.rst
> +++ b/Documentation/admin-guide/mm/index.rst
> @@ -32,6 +32,7 @@ the Linux memory management.
>     idle_page_tracking
>     ksm
>     memory-hotplug
> +   memory-tiering
>     nommu-mmap
>     numa_memory_policy
>     numaperf
> diff --git a/Documentation/admin-guide/mm/memory-tiering.rst b/Documentation/admin-guide/mm/memory-tiering.rst
> new file mode 100644
> index 000000000000..142c36651f5d
> --- /dev/null
> +++ b/Documentation/admin-guide/mm/memory-tiering.rst
> @@ -0,0 +1,182 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +.. _admin_guide_memory_tiering:
> +
> +===========
> +Memory tiers
> +============
> +
> +This document describes explicit memory tiering support along with
> +demotion based on memory tiers.
> +
> +Introduction
> +============
> +
> +Many systems have multiple types of memory devices e.g. GPU, DRAM and
> +PMEM. The memory subsystem of these systems can be called a memory
> +tiering system because the performance of the different types of
> +memory is different. Memory tiers are defined based on the hardware
> +capabilities of memory nodes. Each memory tier is assigned a tier ID
> +value that determines the memory tier position in demotion order.
> +
> +The memory tier assignment of each node is independent of each
> +other. Moving a node from one tier to another tier doesn't affect
> +the tier assignment of any other node.
> +
> +Memory tiers are used to build the demotion targets for nodes. A node
> +can demote its pages to any node of any lower tiers.
> +
> +Memory tier rank
> +=================
> +
> +Memory nodes are divided into 3 types of memory tiers with tier ID
> +value as shown based on their hardware characteristics.
> +
> +
> +MEMORY_TIER_HBM_GPU
> +MEMORY_TIER_DRAM
> +MEMORY_TIER_PMEM
> +
> +Memory tiers initialization and (re)assignments
> +===============================================
> +
> +By default, all nodes are assigned to the memory tier with the default tier ID
> +DEFAULT_MEMORY_TIER which is 200 (MEMORY_TIER_DRAM). The memory tier of
> +the memory node can be either modified through sysfs or from the driver. On
> +hotplug, the memory tier with default tier ID is assigned to the memory node.
> +
> +
> +Sysfs interfaces
> +================
> +
> +Nodes belonging to specific tier can be read from,
> +/sys/devices/system/memtier/memtierN/nodelist (Read-Only)
> +
> +Where N is 0 - 2.
> +
> +Example 1:
> +For a system where Node 0 is CPU + DRAM nodes, Node 1 is HBM node,
> +node 2 is a PMEM node an ideal tier layout will be
> +
> +$ cat /sys/devices/system/memtier/memtier0/nodelist
> +1
> +$ cat /sys/devices/system/memtier/memtier1/nodelist
> +0
> +$ cat /sys/devices/system/memtier/memtier2/nodelist
> +2
> +
> +Example 2:
> +For a system where Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM
> +nodes.
> +
> +$ cat /sys/devices/system/memtier/memtier0/nodelist
> +cat: /sys/devices/system/memtier/memtier0/nodelist: No such file or
> +directory
> +$ cat /sys/devices/system/memtier/memtier1/nodelist
> +0-1
> +$ cat /sys/devices/system/memtier/memtier2/nodelist
> +2-3
> +
> +Default memory tier can be read from,
> +/sys/devices/system/memtier/default_tier (Read-Only)
> +
> +e.g.
> +$ cat /sys/devices/system/memtier/default_tier
> +memtier200
> +
> +Max memory tier ID supported can be read from,
> +/sys/devices/system/memtier/max_tier (Read-Only)
> +
> +e.g.
> +$ cat /sys/devices/system/memtier/max_tier
> +400
> +
> +Individual node's memory tier can be read of set using,
> +/sys/devices/system/node/nodeN/memtier (Read-Write)
> +
> +where N = node id
> +
> +When this interface is written, Node is moved from the old memory tier
> +to new memory tier and demotion targets for all N_MEMORY nodes are
> +built again.
> +
> +For example 1 mentioned above,
> +$ cat /sys/devices/system/node/node0/memtier
> +1
> +$ cat /sys/devices/system/node/node1/memtier
> +0
> +$ cat /sys/devices/system/node/node2/memtier
> +2
> +
> +Additional memory tiers can be created by writing a tier ID value to this file.
> +This results in a new memory tier creation and moving the specific NUMA node to
> +that memory tier.
> +
> +Demotion
> +========
> +
> +In a system with DRAM and persistent memory, once DRAM
> +fills up, reclaim will start and some of the DRAM contents will be
> +thrown out even if there is a space in persistent memory.
> +Consequently, allocations will, at some point, start falling over to the slower
> +persistent memory.
> +
> +That has two nasty properties. First, the newer allocations can end up in
> +the slower persistent memory. Second, reclaimed data in DRAM are just
> +discarded even if there are gobs of space in persistent memory that could
> +be used.
> +
> +Instead of a page being discarded during reclaim, it can be moved to
> +persistent memory. Allowing page migration during reclaim enables
> +these systems to migrate pages from fast(higher) tiers to slow(lower)
> +tiers when the fast(higher) tier is under pressure.
> +
> +
> +Enable/Disable demotion
> +-----------------------
> +
> +By default demotion is disabled, it can be enabled/disabled using
> +below sysfs interface,
> +
> +$ echo 0/1 or false/true > /sys/kernel/mm/numa/demotion_enabled
> +
> +preferred and allowed demotion nodes
> +------------------------------------
> +
> +Preferred nodes for a specific N_MEMORY node are the best nodes
> +from the next possible lower memory tier. Allowed nodes for any
> +node are all the nodes available in all possible lower memory
> +tiers.
> +
> +Example:
> +
> +For a system where Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM
> +nodes,
> +
> +node distances:
> +node   0    1    2    3
> +   0  10   20   30   40
> +   1  20   10   40   30
> +   2  30   40   10   40
> +   3  40   30   40   10
> +
> +memory_tiers[0] = <empty>
> +memory_tiers[1] = 0-1
> +memory_tiers[2] = 2-3
> +
> +node_demotion[0].preferred = 2
> +node_demotion[0].allowed   = 2, 3
> +node_demotion[1].preferred = 3
> +node_demotion[1].allowed   = 3, 2
> +node_demotion[2].preferred = <empty>
> +node_demotion[2].allowed   = <empty>
> +node_demotion[3].preferred = <empty>
> +node_demotion[3].allowed   = <empty>
> +
> +Memory allocation for demotion
> +------------------------------
> +
> +If a page needs to be demoted from any node, the kernel 1st tries
> +to allocate a new page from the node's preferred node and fallbacks to
> +node's allowed targets in allocation fallback order.
> +
> --
> 2.36.1
>
>
diff mbox series

Patch

diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
index c21b5823f126..3f211cbca8c3 100644
--- a/Documentation/admin-guide/mm/index.rst
+++ b/Documentation/admin-guide/mm/index.rst
@@ -32,6 +32,7 @@  the Linux memory management.
    idle_page_tracking
    ksm
    memory-hotplug
+   memory-tiering
    nommu-mmap
    numa_memory_policy
    numaperf
diff --git a/Documentation/admin-guide/mm/memory-tiering.rst b/Documentation/admin-guide/mm/memory-tiering.rst
new file mode 100644
index 000000000000..142c36651f5d
--- /dev/null
+++ b/Documentation/admin-guide/mm/memory-tiering.rst
@@ -0,0 +1,182 @@ 
+.. SPDX-License-Identifier: GPL-2.0
+
+.. _admin_guide_memory_tiering:
+
+===========
+Memory tiers
+============
+
+This document describes explicit memory tiering support along with
+demotion based on memory tiers.
+
+Introduction
+============
+
+Many systems have multiple types of memory devices e.g. GPU, DRAM and
+PMEM. The memory subsystem of these systems can be called a memory
+tiering system because the performance of the different types of
+memory is different. Memory tiers are defined based on the hardware
+capabilities of memory nodes. Each memory tier is assigned a tier ID
+value that determines the memory tier position in demotion order.
+
+The memory tier assignment of each node is independent of each
+other. Moving a node from one tier to another tier doesn't affect
+the tier assignment of any other node.
+
+Memory tiers are used to build the demotion targets for nodes. A node
+can demote its pages to any node of any lower tiers.
+
+Memory tier rank
+=================
+
+Memory nodes are divided into 3 types of memory tiers with tier ID
+value as shown based on their hardware characteristics.
+
+
+MEMORY_TIER_HBM_GPU
+MEMORY_TIER_DRAM
+MEMORY_TIER_PMEM
+
+Memory tiers initialization and (re)assignments
+===============================================
+
+By default, all nodes are assigned to the memory tier with the default tier ID
+DEFAULT_MEMORY_TIER which is 200 (MEMORY_TIER_DRAM). The memory tier of
+the memory node can be either modified through sysfs or from the driver. On
+hotplug, the memory tier with default tier ID is assigned to the memory node.
+
+
+Sysfs interfaces
+================
+
+Nodes belonging to specific tier can be read from,
+/sys/devices/system/memtier/memtierN/nodelist (Read-Only)
+
+Where N is 0 - 2.
+
+Example 1:
+For a system where Node 0 is CPU + DRAM nodes, Node 1 is HBM node,
+node 2 is a PMEM node an ideal tier layout will be
+
+$ cat /sys/devices/system/memtier/memtier0/nodelist
+1
+$ cat /sys/devices/system/memtier/memtier1/nodelist
+0
+$ cat /sys/devices/system/memtier/memtier2/nodelist
+2
+
+Example 2:
+For a system where Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM
+nodes.
+
+$ cat /sys/devices/system/memtier/memtier0/nodelist
+cat: /sys/devices/system/memtier/memtier0/nodelist: No such file or
+directory
+$ cat /sys/devices/system/memtier/memtier1/nodelist
+0-1
+$ cat /sys/devices/system/memtier/memtier2/nodelist
+2-3
+
+Default memory tier can be read from,
+/sys/devices/system/memtier/default_tier (Read-Only)
+
+e.g.
+$ cat /sys/devices/system/memtier/default_tier
+memtier200
+
+Max memory tier ID supported can be read from,
+/sys/devices/system/memtier/max_tier (Read-Only)
+
+e.g.
+$ cat /sys/devices/system/memtier/max_tier
+400
+
+Individual node's memory tier can be read of set using,
+/sys/devices/system/node/nodeN/memtier	(Read-Write)
+
+where N = node id
+
+When this interface is written, Node is moved from the old memory tier
+to new memory tier and demotion targets for all N_MEMORY nodes are
+built again.
+
+For example 1 mentioned above,
+$ cat /sys/devices/system/node/node0/memtier
+1
+$ cat /sys/devices/system/node/node1/memtier
+0
+$ cat /sys/devices/system/node/node2/memtier
+2
+
+Additional memory tiers can be created by writing a tier ID value to this file.
+This results in a new memory tier creation and moving the specific NUMA node to
+that memory tier.
+
+Demotion
+========
+
+In a system with DRAM and persistent memory, once DRAM
+fills up, reclaim will start and some of the DRAM contents will be
+thrown out even if there is a space in persistent memory.
+Consequently, allocations will, at some point, start falling over to the slower
+persistent memory.
+
+That has two nasty properties. First, the newer allocations can end up in
+the slower persistent memory. Second, reclaimed data in DRAM are just
+discarded even if there are gobs of space in persistent memory that could
+be used.
+
+Instead of a page being discarded during reclaim, it can be moved to
+persistent memory. Allowing page migration during reclaim enables
+these systems to migrate pages from fast(higher) tiers to slow(lower)
+tiers when the fast(higher) tier is under pressure.
+
+
+Enable/Disable demotion
+-----------------------
+
+By default demotion is disabled, it can be enabled/disabled using
+below sysfs interface,
+
+$ echo 0/1 or false/true > /sys/kernel/mm/numa/demotion_enabled
+
+preferred and allowed demotion nodes
+------------------------------------
+
+Preferred nodes for a specific N_MEMORY node are the best nodes
+from the next possible lower memory tier. Allowed nodes for any
+node are all the nodes available in all possible lower memory
+tiers.
+
+Example:
+
+For a system where Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM
+nodes,
+
+node distances:
+node   0    1    2    3
+   0  10   20   30   40
+   1  20   10   40   30
+   2  30   40   10   40
+   3  40   30   40   10
+
+memory_tiers[0] = <empty>
+memory_tiers[1] = 0-1
+memory_tiers[2] = 2-3
+
+node_demotion[0].preferred = 2
+node_demotion[0].allowed   = 2, 3
+node_demotion[1].preferred = 3
+node_demotion[1].allowed   = 3, 2
+node_demotion[2].preferred = <empty>
+node_demotion[2].allowed   = <empty>
+node_demotion[3].preferred = <empty>
+node_demotion[3].allowed   = <empty>
+
+Memory allocation for demotion
+------------------------------
+
+If a page needs to be demoted from any node, the kernel 1st tries
+to allocate a new page from the node's preferred node and fallbacks to
+node's allowed targets in allocation fallback order.
+