@@ -32,6 +32,7 @@ the Linux memory management.
idle_page_tracking
ksm
memory-hotplug
+ memory-tiering
nommu-mmap
numa_memory_policy
numaperf
new file mode 100644
@@ -0,0 +1,192 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. _admin_guide_memory_tiering:
+
+============
+Memory tiers
+============
+
+This document describes explicit memory tiering support along with
+demotion based on memory tiers.
+
+Introduction
+============
+
+Many systems have multiple types of memory devices e.g. GPU, DRAM and
+PMEM. The memory subsystem of these systems can be called a memory
+tiering system because the performance of the each types of
+memory is different. Memory tiers are defined based on the hardware
+capabilities of memory nodes. Each memory tier is assigned a tier ID
+value that determines the memory tier position in demotion order.
+
+The memory tier assignment of each node is independent of each
+other. Moving a node from one tier to another doesn't affect
+the tier assignment of any other node.
+
+Memory tiers are used to build the demotion targets for nodes. A node
+can demote its pages to any node of any lower tiers.
+
+Memory tier ID
+=================
+
+Memory nodes are divided into 3 types of memory tiers with tier ID
+value as shown based on their hardware characteristics.
+
+
+ * MEMORY_TIER_HBM_GPU
+ * MEMORY_TIER_DRAM
+ * MEMORY_TIER_PMEM
+
+Memory tiers initialization and (re)assignments
+===============================================
+
+By default, all nodes are assigned to the memory tier with the default tier ID
+DEFAULT_MEMORY_TIER which is 200 (MEMORY_TIER_DRAM). The memory tier of
+the memory node can be either modified through sysfs or from the driver. On
+hotplug, the memory tier with default tier ID is assigned to the memory node.
+
+
+Sysfs interfaces
+================
+
+Nodes belonging to specific tier can be read from,
+/sys/devices/system/memtier/memtierN/nodelist (read-Only)
+
+Examples:
+
+1. On a system where Node 0 is CPU + DRAM nodes, Node 1 is HBM node and
+ node 2 is a PMEM node an ideal tier layout will be
+
+ .. code-block:: sh
+
+ $ cat /sys/devices/system/memtier/memtier0/nodelist
+ 1
+ $ cat /sys/devices/system/memtier/memtier1/nodelist
+ 0
+ $ cat /sys/devices/system/memtier/memtier2/nodelist
+ 2
+
+2. On a system where Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM
+ nodes.
+
+ .. code-block:: sh
+
+ $ cat /sys/devices/system/memtier/memtier0/nodelist
+ cat: /sys/devices/system/memtier/memtier0/nodelist: No such file or directory
+ $ cat /sys/devices/system/memtier/memtier1/nodelist
+ 0-1
+ $ cat /sys/devices/system/memtier/memtier2/nodelist
+ 2-3
+
+Default memory tier can be read from,
+/sys/devices/system/memtier/default_tier (read-Only)
+
+ .. code-block:: sh
+
+ $ cat /sys/devices/system/memtier/default_tier
+ memtier200
+
+Max memory tier ID supported can be read from,
+/sys/devices/system/memtier/max_tier (read-Only)
+
+ .. code-block:: sh
+
+ $ cat /sys/devices/system/memtier/max_tier
+ 400
+
+Individual node's memory tier can be read of set using,
+/sys/devices/system/node/nodeN/memtier (read-write), where N = node id
+
+When this interface is written, node is moved from the old memory tier
+to new memory tier and demotion targets for all N_MEMORY nodes are
+built again.
+
+For example 1 mentioned above,
+ .. code-block:: sh
+
+ $ cat /sys/devices/system/node/node0/memtier
+ 1
+ $ cat /sys/devices/system/node/node1/memtier
+ 0
+ $ cat /sys/devices/system/node/node2/memtier
+ 2
+
+Additional memory tiers can be created by writing a tier ID value to this file.
+This results in a new memory tier creation and moving the specific NUMA node to
+that memory tier.
+
+Demotion
+========
+
+In a system with DRAM and persistent memory, once DRAM
+fills up, reclaim will start and some of the DRAM contents will be
+thrown out even if there is a space in persistent memory.
+Consequently, allocations will, at some point, start falling over to the slower
+persistent memory.
+
+That has two nasty properties. First, the newer allocations can end up in
+the slower persistent memory. Second, reclaimed data in DRAM are just
+discarded even if there are gobs of space in persistent memory that could
+be used.
+
+Instead of a page being discarded during reclaim, it can be moved to
+persistent memory. Allowing page migration during reclaim enables
+these systems to migrate pages from fast (higher) tiers to slow (lower)
+tiers when the fast (higher) tier is under pressure.
+
+
+Enable/Disable demotion
+-----------------------
+
+By default demotion is disabled, it can be enabled/disabled using
+below sysfs interface,
+
+ .. code-block:: sh
+
+ $ echo 0/1 or false/true > /sys/kernel/mm/numa/demotion_enabled
+
+preferred and allowed demotion nodes
+------------------------------------
+
+Preferred nodes for a specific N_MEMORY node are the best nodes
+from the next possible lower memory tier. Allowed nodes for any
+node are all the nodes available in all possible lower memory
+tiers.
+
+For example on a system where Node 0 & 1 are CPU + DRAM nodes,
+node 2 & 3 are PMEM nodes,
+
+ * node distances:
+
+ ==== == == == ==
+ node 0 1 2 3
+ ==== == == == ==
+ 0 10 20 30 40
+ 1 20 10 40 30
+ 2 30 40 10 40
+ 3 40 30 40 10
+ ==== == == == ==
+
+
+ .. code-block:: none
+
+ memory_tiers[0] = <empty>
+ memory_tiers[1] = 0-1
+ memory_tiers[2] = 2-3
+
+ node_demotion[0].preferred = 2
+ node_demotion[0].allowed = 2, 3
+ node_demotion[1].preferred = 3
+ node_demotion[1].allowed = 3, 2
+ node_demotion[2].preferred = <empty>
+ node_demotion[2].allowed = <empty>
+ node_demotion[3].preferred = <empty>
+ node_demotion[3].allowed = <empty>
+
+Memory allocation for demotion
+------------------------------
+
+If a page needs to be demoted from any node, the kernel first tries
+to allocate a new page from the node's preferred node and fallbacks to
+node's allowed targets in allocation fallback order.
+