Message ID | 20220704070612.299585-1-aneesh.kumar@linux.ibm.com (mailing list archive) |
---|---|
Headers | show |
Series | mm/demotion: Memory tiers and demotion | expand |
On Mon, Jul 04, 2022 at 12:36:00PM +0530, Aneesh Kumar K.V wrote: > * The current tier initialization code always initializes > each memory-only NUMA node into a lower tier. But a memory-only > NUMA node may have a high performance memory device (e.g. a DRAM > device attached via CXL.mem or a DRAM-backed memory-only node on > a virtual machine) and should be put into a higher tier. > > * The current tier hierarchy always puts CPU nodes into the top > tier. But on a system with HBM (e.g. GPU memory) devices, these > memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes > with CPUs are better to be placed into the next lower tier. These things that you identify as problems seem perfectly sensible to me. Memory which is attached to this CPU has the lowest latency and should be preferred over more remote memory, no matter its bandwidth.
Matthew Wilcox <willy@infradead.org> writes: > On Mon, Jul 04, 2022 at 12:36:00PM +0530, Aneesh Kumar K.V wrote: >> * The current tier initialization code always initializes >> each memory-only NUMA node into a lower tier. But a memory-only >> NUMA node may have a high performance memory device (e.g. a DRAM >> device attached via CXL.mem or a DRAM-backed memory-only node on >> a virtual machine) and should be put into a higher tier. >> >> * The current tier hierarchy always puts CPU nodes into the top >> tier. But on a system with HBM (e.g. GPU memory) devices, these >> memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes >> with CPUs are better to be placed into the next lower tier. > > These things that you identify as problems seem perfectly sensible to me. > Memory which is attached to this CPU has the lowest latency and should > be preferred over more remote memory, no matter its bandwidth. It is a problem because HBM NUMA node memory is generally also used by some kind of device/accelerator (eg. GPU). Typically users would prefer to keep HBM memory for use by the accelerator rather than random pages demoted from the CPU as accelerators have orders of magnitude better performance when accessing local HBM vs. remote memory.
On 7/4/22 8:30 PM, Matthew Wilcox wrote: > On Mon, Jul 04, 2022 at 12:36:00PM +0530, Aneesh Kumar K.V wrote: >> * The current tier initialization code always initializes >> each memory-only NUMA node into a lower tier. But a memory-only >> NUMA node may have a high performance memory device (e.g. a DRAM >> device attached via CXL.mem or a DRAM-backed memory-only node on >> a virtual machine) and should be put into a higher tier. >> >> * The current tier hierarchy always puts CPU nodes into the top >> tier. But on a system with HBM (e.g. GPU memory) devices, these >> memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes >> with CPUs are better to be placed into the next lower tier. > > These things that you identify as problems seem perfectly sensible to me. > Memory which is attached to this CPU has the lowest latency and should > be preferred over more remote memory, no matter its bandwidth. Allocation will prefer local memory over remote memory. Memory tiers are used during demotion and currently, the kernel demotes cold pages from DRAM memory to these special device memories because they appear as memory-only NUMA nodes. In many cases (ex: GPU) what is desired is the demotion of cold pages from GPU memory to DRAM or even slow memory. This patchset builds a framework to enable such demotion criteria. -aneesh
Hi, Aneesh, "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes: > The current kernel has the basic memory tiering support: Inactive > pages on a higher tier NUMA node can be migrated (demoted) to a lower > tier NUMA node to make room for new allocations on the higher tier > NUMA node. Frequently accessed pages on a lower tier NUMA node can be > migrated (promoted) to a higher tier NUMA node to improve the > performance. > > In the current kernel, memory tiers are defined implicitly via a > demotion path relationship between NUMA nodes, which is created during > the kernel initialization and updated when a NUMA node is hot-added or > hot-removed. The current implementation puts all nodes with CPU into > the top tier, and builds the tier hierarchy tier-by-tier by establishing > the per-node demotion targets based on the distances between nodes. > > This current memory tier kernel interface needs to be improved for > several important use cases: > > * The current tier initialization code always initializes > each memory-only NUMA node into a lower tier. But a memory-only > NUMA node may have a high performance memory device (e.g. a DRAM > device attached via CXL.mem or a DRAM-backed memory-only node on > a virtual machine) and should be put into a higher tier. > > * The current tier hierarchy always puts CPU nodes into the top > tier. But on a system with HBM (e.g. GPU memory) devices, these > memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes > with CPUs are better to be placed into the next lower tier. > > * Also because the current tier hierarchy always puts CPU nodes > into the top tier, when a CPU is hot-added (or hot-removed) and > triggers a memory node from CPU-less into a CPU node (or vice > versa), the memory tier hierarchy gets changed, even though no > memory node is added or removed. This can make the tier > hierarchy unstable and make it difficult to support tier-based > memory accounting. > > * A higher tier node can only be demoted to selected nodes on the > next lower tier as defined by the demotion path, not any other > node from any lower tier. This strict, hard-coded demotion order > does not work in all use cases (e.g. some use cases may want to > allow cross-socket demotion to another node in the same demotion > tier as a fallback when the preferred demotion node is out of > space), and has resulted in the feature request for an interface to > override the system-wide, per-node demotion order from the > userspace. This demotion order is also inconsistent with the page > allocation fallback order when all the nodes in a higher tier are > out of space: The page allocation can fall back to any node from > any lower tier, whereas the demotion order doesn't allow that. > > * There are no interfaces for the userspace to learn about the memory > tier hierarchy in order to optimize its memory allocations. > > This patch series make the creation of memory tiers explicit under > the control of userspace or device driver. > > Memory Tier Initialization > ========================== > > By default, all memory nodes are assigned to the default tier with > tier ID value 200. > > A device driver can move up or down its memory nodes from the default > tier. For example, PMEM can move down its memory nodes below the > default tier, whereas GPU can move up its memory nodes above the > default tier. > > The kernel initialization code makes the decision on which exact tier > a memory node should be assigned to based on the requests from the > device drivers as well as the memory device hardware information > provided by the firmware. > > Hot-adding/removing CPUs doesn't affect memory tier hierarchy. > > Memory Allocation for Demotion > ============================== > This patch series keep the demotion target page allocation logic same. > The demotion page allocation pick the closest NUMA node in the > next lower tier to the current NUMA node allocating pages from. > > This will be later improved to use the same page allocation strategy > using fallback list. > > Sysfs Interface: > ------------- > Listing current list of memory tiers details: > > :/sys/devices/system/memtier$ ls > default_tier max_tier memtier1 power uevent > :/sys/devices/system/memtier$ cat default_tier > memtier200 > :/sys/devices/system/memtier$ cat max_tier > 400 > :/sys/devices/system/memtier$ > > Per node memory tier details: > > For a cpu only NUMA node: > > :/sys/devices/system/node# cat node0/memtier > :/sys/devices/system/node# echo 1 > node0/memtier > :/sys/devices/system/node# cat node0/memtier > :/sys/devices/system/node# > > For a NUMA node with memory: > :/sys/devices/system/node# cat node1/memtier > 1 > :/sys/devices/system/node# ls ../memtier/ > default_tier max_tier memtier1 power uevent > :/sys/devices/system/node# echo 2 > node1/memtier > :/sys/devices/system/node# > :/sys/devices/system/node# ls ../memtier/ > default_tier max_tier memtier1 memtier2 power uevent > :/sys/devices/system/node# cat node1/memtier > 2 > :/sys/devices/system/node# > > Removing a memory tier > :/sys/devices/system/node# cat node1/memtier > 2 > :/sys/devices/system/node# echo 1 > node1/memtier Thanks a lot for your patchset. Per my understanding, we haven't reach consensus on - how to create the default memory tiers in kernel (via abstract distance provided by drivers? Or use SLIT as the first step?) - how to override the default memory tiers from user space As in the following thread and email, https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/ I think that we need to finalized on that firstly? Best Regards, Huang, Ying > :/sys/devices/system/node# > :/sys/devices/system/node# cat node1/memtier > 1 > :/sys/devices/system/node# > :/sys/devices/system/node# ls ../memtier/ > default_tier max_tier memtier1 power uevent > :/sys/devices/system/node# > > The above resulted in removal of memtier2 which was created in the earlier step. > > Changes from v7: > * Fix kernel crash with demotion. > * Improve documentation. > > Changes from v6: > * Drop the usage of rank. > * Address other review feedback. > > Changes from v5: > * Remove patch supporting N_MEMORY node removal from memory tiers. memory tiers > are going to be used for features other than demotion. Hence keep all N_MEMORY > nodes in memory tiers irrespective of whether they want to participate in promotion or demotion. > * Add NODE_DATA->memtier > * Rearrage patches to add sysfs files later. > * Add support to create memory tiers from userspace. > * Address other review feedback. > > > Changes from v4: > * Address review feedback. > * Reverse the meaning of "rank": higher rank value means higher tier. > * Add "/sys/devices/system/memtier/default_tier". > * Add node_is_toptier > > v4: > Add support for explicit memory tiers and ranks. > > v3: > - Modify patch 1 subject to make it more specific > - Remove /sys/kernel/mm/numa/demotion_targets interface, use > /sys/devices/system/node/demotion_targets instead and make > it writable to override node_states[N_DEMOTION_TARGETS]. > - Add support to view per node demotion targets via sysfs > > v2: > In v1, only 1st patch of this patch series was sent, which was > implemented to avoid some of the limitations on the demotion > target sharing, however for certain numa topology, the demotion > targets found by that patch was not most optimal, so 1st patch > in this series is modified according to suggestions from Huang > and Baolin. Different examples of demotion list comparasion > between existing implementation and changed implementation can > be found in the commit message of 1st patch. > > > Aneesh Kumar K.V (10): > mm/demotion: Add support for explicit memory tiers > mm/demotion: Move memory demotion related code > mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM > mm/demotion: Add hotplug callbacks to handle new numa node onlined > mm/demotion: Build demotion targets based on explicit memory tiers > mm/demotion: Expose memory tier details via sysfs > mm/demotion: Add per node memory tier attribute to sysfs > mm/demotion: Add pg_data_t member to track node memory tier details > mm/demotion: Update node_is_toptier to work with memory tiers > mm/demotion: Add sysfs ABI documentation > > Jagdish Gediya (2): > mm/demotion: Demote pages according to allocation fallback order > mm/demotion: Add documentation for memory tiering > > .../ABI/testing/sysfs-kernel-mm-memory-tiers | 61 ++ > Documentation/admin-guide/mm/index.rst | 1 + > .../admin-guide/mm/memory-tiering.rst | 192 +++++ > drivers/base/node.c | 42 + > drivers/dax/kmem.c | 6 +- > include/linux/memory-tiers.h | 72 ++ > include/linux/migrate.h | 15 - > include/linux/mmzone.h | 3 + > include/linux/node.h | 5 - > mm/Makefile | 1 + > mm/huge_memory.c | 1 + > mm/memory-tiers.c | 791 ++++++++++++++++++ > mm/migrate.c | 453 +--------- > mm/mprotect.c | 1 + > mm/vmscan.c | 59 +- > mm/vmstat.c | 4 - > 16 files changed, 1215 insertions(+), 492 deletions(-) > create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-memory-tiers > create mode 100644 Documentation/admin-guide/mm/memory-tiering.rst > create mode 100644 include/linux/memory-tiers.h > create mode 100644 mm/memory-tiers.c
On 7/5/22 9:59 AM, Huang, Ying wrote: > Hi, Aneesh, > > "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes: > >> The current kernel has the basic memory tiering support: Inactive >> pages on a higher tier NUMA node can be migrated (demoted) to a lower >> tier NUMA node to make room for new allocations on the higher tier >> NUMA node. Frequently accessed pages on a lower tier NUMA node can be >> migrated (promoted) to a higher tier NUMA node to improve the >> performance. >> >> In the current kernel, memory tiers are defined implicitly via a >> demotion path relationship between NUMA nodes, which is created during >> the kernel initialization and updated when a NUMA node is hot-added or >> hot-removed. The current implementation puts all nodes with CPU into >> the top tier, and builds the tier hierarchy tier-by-tier by establishing >> the per-node demotion targets based on the distances between nodes. >> >> This current memory tier kernel interface needs to be improved for >> several important use cases: >> >> * The current tier initialization code always initializes >> each memory-only NUMA node into a lower tier. But a memory-only >> NUMA node may have a high performance memory device (e.g. a DRAM >> device attached via CXL.mem or a DRAM-backed memory-only node on >> a virtual machine) and should be put into a higher tier. >> >> * The current tier hierarchy always puts CPU nodes into the top >> tier. But on a system with HBM (e.g. GPU memory) devices, these >> memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes >> with CPUs are better to be placed into the next lower tier. >> >> * Also because the current tier hierarchy always puts CPU nodes >> into the top tier, when a CPU is hot-added (or hot-removed) and >> triggers a memory node from CPU-less into a CPU node (or vice >> versa), the memory tier hierarchy gets changed, even though no >> memory node is added or removed. This can make the tier >> hierarchy unstable and make it difficult to support tier-based >> memory accounting. >> >> * A higher tier node can only be demoted to selected nodes on the >> next lower tier as defined by the demotion path, not any other >> node from any lower tier. This strict, hard-coded demotion order >> does not work in all use cases (e.g. some use cases may want to >> allow cross-socket demotion to another node in the same demotion >> tier as a fallback when the preferred demotion node is out of >> space), and has resulted in the feature request for an interface to >> override the system-wide, per-node demotion order from the >> userspace. This demotion order is also inconsistent with the page >> allocation fallback order when all the nodes in a higher tier are >> out of space: The page allocation can fall back to any node from >> any lower tier, whereas the demotion order doesn't allow that. >> >> * There are no interfaces for the userspace to learn about the memory >> tier hierarchy in order to optimize its memory allocations. >> >> This patch series make the creation of memory tiers explicit under >> the control of userspace or device driver. >> >> Memory Tier Initialization >> ========================== >> >> By default, all memory nodes are assigned to the default tier with >> tier ID value 200. >> >> A device driver can move up or down its memory nodes from the default >> tier. For example, PMEM can move down its memory nodes below the >> default tier, whereas GPU can move up its memory nodes above the >> default tier. >> >> The kernel initialization code makes the decision on which exact tier >> a memory node should be assigned to based on the requests from the >> device drivers as well as the memory device hardware information >> provided by the firmware. >> >> Hot-adding/removing CPUs doesn't affect memory tier hierarchy. >> >> Memory Allocation for Demotion >> ============================== >> This patch series keep the demotion target page allocation logic same. >> The demotion page allocation pick the closest NUMA node in the >> next lower tier to the current NUMA node allocating pages from. >> >> This will be later improved to use the same page allocation strategy >> using fallback list. >> >> Sysfs Interface: >> ------------- >> Listing current list of memory tiers details: >> >> :/sys/devices/system/memtier$ ls >> default_tier max_tier memtier1 power uevent >> :/sys/devices/system/memtier$ cat default_tier >> memtier200 >> :/sys/devices/system/memtier$ cat max_tier >> 400 >> :/sys/devices/system/memtier$ >> >> Per node memory tier details: >> >> For a cpu only NUMA node: >> >> :/sys/devices/system/node# cat node0/memtier >> :/sys/devices/system/node# echo 1 > node0/memtier >> :/sys/devices/system/node# cat node0/memtier >> :/sys/devices/system/node# >> >> For a NUMA node with memory: >> :/sys/devices/system/node# cat node1/memtier >> 1 >> :/sys/devices/system/node# ls ../memtier/ >> default_tier max_tier memtier1 power uevent >> :/sys/devices/system/node# echo 2 > node1/memtier >> :/sys/devices/system/node# >> :/sys/devices/system/node# ls ../memtier/ >> default_tier max_tier memtier1 memtier2 power uevent >> :/sys/devices/system/node# cat node1/memtier >> 2 >> :/sys/devices/system/node# >> >> Removing a memory tier >> :/sys/devices/system/node# cat node1/memtier >> 2 >> :/sys/devices/system/node# echo 1 > node1/memtier > > Thanks a lot for your patchset. > > Per my understanding, we haven't reach consensus on > > - how to create the default memory tiers in kernel (via abstract > distance provided by drivers? Or use SLIT as the first step?) > > - how to override the default memory tiers from user space > > As in the following thread and email, > > https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/ > > I think that we need to finalized on that firstly? I did list the proposal here https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com So both the kernel default and driver-specific default tiers now become kernel parameters that can be updated if the user wants a different tier topology. All memory that is not managed by a driver gets added to default_memory_tier which got a default value of 200 For now, the only driver that is updated is dax kmem, which adds the memory it manages to memory tier 100. Later as we learn more about the device attributes (HMAT or something similar) that we might want to use to control the tier assignment this can be a range of memory tiers. Based on the above, I guess we can merge what is posted in this series and later fine-tune/update the memory tier assignment based on device attributes. -aneesh
"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes: > The current kernel has the basic memory tiering support: Inactive > pages on a higher tier NUMA node can be migrated (demoted) to a lower > tier NUMA node to make room for new allocations on the higher tier > NUMA node. Frequently accessed pages on a lower tier NUMA node can be > migrated (promoted) to a higher tier NUMA node to improve the > performance. > > In the current kernel, memory tiers are defined implicitly via a > demotion path relationship between NUMA nodes, which is created during > the kernel initialization and updated when a NUMA node is hot-added or > hot-removed. The current implementation puts all nodes with CPU into > the top tier, and builds the tier hierarchy tier-by-tier by establishing > the per-node demotion targets based on the distances between nodes. > > This current memory tier kernel interface needs to be improved for > several important use cases: > > * The current tier initialization code always initializes > each memory-only NUMA node into a lower tier. But a memory-only > NUMA node may have a high performance memory device (e.g. a DRAM > device attached via CXL.mem or a DRAM-backed memory-only node on > a virtual machine) and should be put into a higher tier. > > * The current tier hierarchy always puts CPU nodes into the top > tier. But on a system with HBM (e.g. GPU memory) devices, these > memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes > with CPUs are better to be placed into the next lower tier. > > * Also because the current tier hierarchy always puts CPU nodes > into the top tier, when a CPU is hot-added (or hot-removed) and > triggers a memory node from CPU-less into a CPU node (or vice > versa), the memory tier hierarchy gets changed, even though no > memory node is added or removed. This can make the tier > hierarchy unstable and make it difficult to support tier-based > memory accounting. > > * A higher tier node can only be demoted to selected nodes on the > next lower tier as defined by the demotion path, not any other > node from any lower tier. This strict, hard-coded demotion order > does not work in all use cases (e.g. some use cases may want to > allow cross-socket demotion to another node in the same demotion > tier as a fallback when the preferred demotion node is out of > space), and has resulted in the feature request for an interface to > override the system-wide, per-node demotion order from the > userspace. This demotion order is also inconsistent with the page > allocation fallback order when all the nodes in a higher tier are > out of space: The page allocation can fall back to any node from > any lower tier, whereas the demotion order doesn't allow that. > > * There are no interfaces for the userspace to learn about the memory > tier hierarchy in order to optimize its memory allocations. > > This patch series make the creation of memory tiers explicit under > the control of userspace or device driver. > > Memory Tier Initialization > ========================== > > By default, all memory nodes are assigned to the default tier with > tier ID value 200. > > A device driver can move up or down its memory nodes from the default > tier. For example, PMEM can move down its memory nodes below the > default tier, whereas GPU can move up its memory nodes above the > default tier. > > The kernel initialization code makes the decision on which exact tier > a memory node should be assigned to based on the requests from the > device drivers as well as the memory device hardware information > provided by the firmware. > > Hot-adding/removing CPUs doesn't affect memory tier hierarchy. > > Memory Allocation for Demotion > ============================== > This patch series keep the demotion target page allocation logic same. > The demotion page allocation pick the closest NUMA node in the > next lower tier to the current NUMA node allocating pages from. > > This will be later improved to use the same page allocation strategy > using fallback list. > > Sysfs Interface: > ------------- > Listing current list of memory tiers details: > > :/sys/devices/system/memtier$ ls > default_tier max_tier memtier1 power uevent > :/sys/devices/system/memtier$ cat default_tier > memtier200 > :/sys/devices/system/memtier$ cat max_tier > 400 > :/sys/devices/system/memtier$ > > Per node memory tier details: > > For a cpu only NUMA node: > > :/sys/devices/system/node# cat node0/memtier > :/sys/devices/system/node# echo 1 > node0/memtier > :/sys/devices/system/node# cat node0/memtier > :/sys/devices/system/node# > > For a NUMA node with memory: > :/sys/devices/system/node# cat node1/memtier > 1 > :/sys/devices/system/node# ls ../memtier/ > default_tier max_tier memtier1 power uevent > :/sys/devices/system/node# echo 2 > node1/memtier > :/sys/devices/system/node# > :/sys/devices/system/node# ls ../memtier/ > default_tier max_tier memtier1 memtier2 power uevent > :/sys/devices/system/node# cat node1/memtier > 2 > :/sys/devices/system/node# > > Removing a memory tier > :/sys/devices/system/node# cat node1/memtier > 2 > :/sys/devices/system/node# echo 1 > node1/memtier > :/sys/devices/system/node# > :/sys/devices/system/node# cat node1/memtier > 1 > :/sys/devices/system/node# > :/sys/devices/system/node# ls ../memtier/ > default_tier max_tier memtier1 power uevent > :/sys/devices/system/node# > > The above resulted in removal of memtier2 which was created in the earlier step. > > Changes from v7: > * Fix kernel crash with demotion. > * Improve documentation. > > Changes from v6: > * Drop the usage of rank. > * Address other review feedback. > > Changes from v5: > * Remove patch supporting N_MEMORY node removal from memory tiers. memory tiers > are going to be used for features other than demotion. Hence keep all N_MEMORY > nodes in memory tiers irrespective of whether they want to participate in promotion or demotion. > * Add NODE_DATA->memtier > * Rearrage patches to add sysfs files later. > * Add support to create memory tiers from userspace. > * Address other review feedback. > > > Changes from v4: > * Address review feedback. > * Reverse the meaning of "rank": higher rank value means higher tier. > * Add "/sys/devices/system/memtier/default_tier". > * Add node_is_toptier > > v4: > Add support for explicit memory tiers and ranks. > > v3: > - Modify patch 1 subject to make it more specific > - Remove /sys/kernel/mm/numa/demotion_targets interface, use > /sys/devices/system/node/demotion_targets instead and make > it writable to override node_states[N_DEMOTION_TARGETS]. > - Add support to view per node demotion targets via sysfs > > v2: > In v1, only 1st patch of this patch series was sent, which was > implemented to avoid some of the limitations on the demotion > target sharing, however for certain numa topology, the demotion > targets found by that patch was not most optimal, so 1st patch > in this series is modified according to suggestions from Huang > and Baolin. Different examples of demotion list comparasion > between existing implementation and changed implementation can > be found in the commit message of 1st patch. > > > Aneesh Kumar K.V (10): > mm/demotion: Add support for explicit memory tiers > mm/demotion: Move memory demotion related code > mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM > mm/demotion: Add hotplug callbacks to handle new numa node onlined > mm/demotion: Build demotion targets based on explicit memory tiers > mm/demotion: Expose memory tier details via sysfs > mm/demotion: Add per node memory tier attribute to sysfs > mm/demotion: Add pg_data_t member to track node memory tier details > mm/demotion: Update node_is_toptier to work with memory tiers > mm/demotion: Add sysfs ABI documentation > > Jagdish Gediya (2): > mm/demotion: Demote pages according to allocation fallback order > mm/demotion: Add documentation for memory tiering > > .../ABI/testing/sysfs-kernel-mm-memory-tiers | 61 ++ > Documentation/admin-guide/mm/index.rst | 1 + > .../admin-guide/mm/memory-tiering.rst | 192 +++++ > drivers/base/node.c | 42 + > drivers/dax/kmem.c | 6 +- > include/linux/memory-tiers.h | 72 ++ > include/linux/migrate.h | 15 - > include/linux/mmzone.h | 3 + > include/linux/node.h | 5 - > mm/Makefile | 1 + > mm/huge_memory.c | 1 + > mm/memory-tiers.c | 791 ++++++++++++++++++ > mm/migrate.c | 453 +--------- > mm/mprotect.c | 1 + > mm/vmscan.c | 59 +- > mm/vmstat.c | 4 - > 16 files changed, 1215 insertions(+), 492 deletions(-) > create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-memory-tiers > create mode 100644 Documentation/admin-guide/mm/memory-tiering.rst > create mode 100644 include/linux/memory-tiers.h > create mode 100644 mm/memory-tiers.c > Gentle ping. Any objections for this series? -aneesh
Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: > On 7/5/22 9:59 AM, Huang, Ying wrote: >> Hi, Aneesh, >> >> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes: >> >>> The current kernel has the basic memory tiering support: Inactive >>> pages on a higher tier NUMA node can be migrated (demoted) to a lower >>> tier NUMA node to make room for new allocations on the higher tier >>> NUMA node. Frequently accessed pages on a lower tier NUMA node can be >>> migrated (promoted) to a higher tier NUMA node to improve the >>> performance. >>> >>> In the current kernel, memory tiers are defined implicitly via a >>> demotion path relationship between NUMA nodes, which is created during >>> the kernel initialization and updated when a NUMA node is hot-added or >>> hot-removed. The current implementation puts all nodes with CPU into >>> the top tier, and builds the tier hierarchy tier-by-tier by establishing >>> the per-node demotion targets based on the distances between nodes. >>> >>> This current memory tier kernel interface needs to be improved for >>> several important use cases: >>> >>> * The current tier initialization code always initializes >>> each memory-only NUMA node into a lower tier. But a memory-only >>> NUMA node may have a high performance memory device (e.g. a DRAM >>> device attached via CXL.mem or a DRAM-backed memory-only node on >>> a virtual machine) and should be put into a higher tier. >>> >>> * The current tier hierarchy always puts CPU nodes into the top >>> tier. But on a system with HBM (e.g. GPU memory) devices, these >>> memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes >>> with CPUs are better to be placed into the next lower tier. >>> >>> * Also because the current tier hierarchy always puts CPU nodes >>> into the top tier, when a CPU is hot-added (or hot-removed) and >>> triggers a memory node from CPU-less into a CPU node (or vice >>> versa), the memory tier hierarchy gets changed, even though no >>> memory node is added or removed. This can make the tier >>> hierarchy unstable and make it difficult to support tier-based >>> memory accounting. >>> >>> * A higher tier node can only be demoted to selected nodes on the >>> next lower tier as defined by the demotion path, not any other >>> node from any lower tier. This strict, hard-coded demotion order >>> does not work in all use cases (e.g. some use cases may want to >>> allow cross-socket demotion to another node in the same demotion >>> tier as a fallback when the preferred demotion node is out of >>> space), and has resulted in the feature request for an interface to >>> override the system-wide, per-node demotion order from the >>> userspace. This demotion order is also inconsistent with the page >>> allocation fallback order when all the nodes in a higher tier are >>> out of space: The page allocation can fall back to any node from >>> any lower tier, whereas the demotion order doesn't allow that. >>> >>> * There are no interfaces for the userspace to learn about the memory >>> tier hierarchy in order to optimize its memory allocations. >>> >>> This patch series make the creation of memory tiers explicit under >>> the control of userspace or device driver. >>> >>> Memory Tier Initialization >>> ========================== >>> >>> By default, all memory nodes are assigned to the default tier with >>> tier ID value 200. >>> >>> A device driver can move up or down its memory nodes from the default >>> tier. For example, PMEM can move down its memory nodes below the >>> default tier, whereas GPU can move up its memory nodes above the >>> default tier. >>> >>> The kernel initialization code makes the decision on which exact tier >>> a memory node should be assigned to based on the requests from the >>> device drivers as well as the memory device hardware information >>> provided by the firmware. >>> >>> Hot-adding/removing CPUs doesn't affect memory tier hierarchy. >>> >>> Memory Allocation for Demotion >>> ============================== >>> This patch series keep the demotion target page allocation logic same. >>> The demotion page allocation pick the closest NUMA node in the >>> next lower tier to the current NUMA node allocating pages from. >>> >>> This will be later improved to use the same page allocation strategy >>> using fallback list. >>> >>> Sysfs Interface: >>> ------------- >>> Listing current list of memory tiers details: >>> >>> :/sys/devices/system/memtier$ ls >>> default_tier max_tier memtier1 power uevent >>> :/sys/devices/system/memtier$ cat default_tier >>> memtier200 >>> :/sys/devices/system/memtier$ cat max_tier >>> 400 >>> :/sys/devices/system/memtier$ >>> >>> Per node memory tier details: >>> >>> For a cpu only NUMA node: >>> >>> :/sys/devices/system/node# cat node0/memtier >>> :/sys/devices/system/node# echo 1 > node0/memtier >>> :/sys/devices/system/node# cat node0/memtier >>> :/sys/devices/system/node# >>> >>> For a NUMA node with memory: >>> :/sys/devices/system/node# cat node1/memtier >>> 1 >>> :/sys/devices/system/node# ls ../memtier/ >>> default_tier max_tier memtier1 power uevent >>> :/sys/devices/system/node# echo 2 > node1/memtier >>> :/sys/devices/system/node# >>> :/sys/devices/system/node# ls ../memtier/ >>> default_tier max_tier memtier1 memtier2 power uevent >>> :/sys/devices/system/node# cat node1/memtier >>> 2 >>> :/sys/devices/system/node# >>> >>> Removing a memory tier >>> :/sys/devices/system/node# cat node1/memtier >>> 2 >>> :/sys/devices/system/node# echo 1 > node1/memtier >> >> Thanks a lot for your patchset. >> >> Per my understanding, we haven't reach consensus on >> >> - how to create the default memory tiers in kernel (via abstract >> distance provided by drivers? Or use SLIT as the first step?) >> >> - how to override the default memory tiers from user space >> >> As in the following thread and email, >> >> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/ >> >> I think that we need to finalized on that firstly? > > I did list the proposal here > > https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com > > So both the kernel default and driver-specific default tiers now become kernel parameters that can be updated > if the user wants a different tier topology. > > All memory that is not managed by a driver gets added to default_memory_tier which got a default value of 200 > > For now, the only driver that is updated is dax kmem, which adds the memory it manages to memory tier 100. > Later as we learn more about the device attributes (HMAT or something similar) that we might want to use > to control the tier assignment this can be a range of memory tiers. > > Based on the above, I guess we can merge what is posted in this series and later fine-tune/update > the memory tier assignment based on device attributes. Sorry for late reply. As the first step, it may be better to skip the parts that we haven't reached consensus yet, for example, the user space interface to override the default memory tiers. And we can use 0, 1, 2 as the default memory tier IDs. We can refine/revise the in-kernel implementation, but we cannot change the user space ABI. Best Regards, Huang, Ying
On 7/12/22 6:46 AM, Huang, Ying wrote: > Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: > >> On 7/5/22 9:59 AM, Huang, Ying wrote: >>> Hi, Aneesh, >>> >>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes: >>> >>>> The current kernel has the basic memory tiering support: Inactive >>>> pages on a higher tier NUMA node can be migrated (demoted) to a lower >>>> tier NUMA node to make room for new allocations on the higher tier >>>> NUMA node. Frequently accessed pages on a lower tier NUMA node can be >>>> migrated (promoted) to a higher tier NUMA node to improve the >>>> performance. >>>> >>>> In the current kernel, memory tiers are defined implicitly via a >>>> demotion path relationship between NUMA nodes, which is created during >>>> the kernel initialization and updated when a NUMA node is hot-added or >>>> hot-removed. The current implementation puts all nodes with CPU into >>>> the top tier, and builds the tier hierarchy tier-by-tier by establishing >>>> the per-node demotion targets based on the distances between nodes. >>>> >>>> This current memory tier kernel interface needs to be improved for >>>> several important use cases: >>>> >>>> * The current tier initialization code always initializes >>>> each memory-only NUMA node into a lower tier. But a memory-only >>>> NUMA node may have a high performance memory device (e.g. a DRAM >>>> device attached via CXL.mem or a DRAM-backed memory-only node on >>>> a virtual machine) and should be put into a higher tier. >>>> >>>> * The current tier hierarchy always puts CPU nodes into the top >>>> tier. But on a system with HBM (e.g. GPU memory) devices, these >>>> memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes >>>> with CPUs are better to be placed into the next lower tier. >>>> >>>> * Also because the current tier hierarchy always puts CPU nodes >>>> into the top tier, when a CPU is hot-added (or hot-removed) and >>>> triggers a memory node from CPU-less into a CPU node (or vice >>>> versa), the memory tier hierarchy gets changed, even though no >>>> memory node is added or removed. This can make the tier >>>> hierarchy unstable and make it difficult to support tier-based >>>> memory accounting. >>>> >>>> * A higher tier node can only be demoted to selected nodes on the >>>> next lower tier as defined by the demotion path, not any other >>>> node from any lower tier. This strict, hard-coded demotion order >>>> does not work in all use cases (e.g. some use cases may want to >>>> allow cross-socket demotion to another node in the same demotion >>>> tier as a fallback when the preferred demotion node is out of >>>> space), and has resulted in the feature request for an interface to >>>> override the system-wide, per-node demotion order from the >>>> userspace. This demotion order is also inconsistent with the page >>>> allocation fallback order when all the nodes in a higher tier are >>>> out of space: The page allocation can fall back to any node from >>>> any lower tier, whereas the demotion order doesn't allow that. >>>> >>>> * There are no interfaces for the userspace to learn about the memory >>>> tier hierarchy in order to optimize its memory allocations. >>>> >>>> This patch series make the creation of memory tiers explicit under >>>> the control of userspace or device driver. >>>> >>>> Memory Tier Initialization >>>> ========================== >>>> >>>> By default, all memory nodes are assigned to the default tier with >>>> tier ID value 200. >>>> >>>> A device driver can move up or down its memory nodes from the default >>>> tier. For example, PMEM can move down its memory nodes below the >>>> default tier, whereas GPU can move up its memory nodes above the >>>> default tier. >>>> >>>> The kernel initialization code makes the decision on which exact tier >>>> a memory node should be assigned to based on the requests from the >>>> device drivers as well as the memory device hardware information >>>> provided by the firmware. >>>> >>>> Hot-adding/removing CPUs doesn't affect memory tier hierarchy. >>>> >>>> Memory Allocation for Demotion >>>> ============================== >>>> This patch series keep the demotion target page allocation logic same. >>>> The demotion page allocation pick the closest NUMA node in the >>>> next lower tier to the current NUMA node allocating pages from. >>>> >>>> This will be later improved to use the same page allocation strategy >>>> using fallback list. >>>> >>>> Sysfs Interface: >>>> ------------- >>>> Listing current list of memory tiers details: >>>> >>>> :/sys/devices/system/memtier$ ls >>>> default_tier max_tier memtier1 power uevent >>>> :/sys/devices/system/memtier$ cat default_tier >>>> memtier200 >>>> :/sys/devices/system/memtier$ cat max_tier >>>> 400 >>>> :/sys/devices/system/memtier$ >>>> >>>> Per node memory tier details: >>>> >>>> For a cpu only NUMA node: >>>> >>>> :/sys/devices/system/node# cat node0/memtier >>>> :/sys/devices/system/node# echo 1 > node0/memtier >>>> :/sys/devices/system/node# cat node0/memtier >>>> :/sys/devices/system/node# >>>> >>>> For a NUMA node with memory: >>>> :/sys/devices/system/node# cat node1/memtier >>>> 1 >>>> :/sys/devices/system/node# ls ../memtier/ >>>> default_tier max_tier memtier1 power uevent >>>> :/sys/devices/system/node# echo 2 > node1/memtier >>>> :/sys/devices/system/node# >>>> :/sys/devices/system/node# ls ../memtier/ >>>> default_tier max_tier memtier1 memtier2 power uevent >>>> :/sys/devices/system/node# cat node1/memtier >>>> 2 >>>> :/sys/devices/system/node# >>>> >>>> Removing a memory tier >>>> :/sys/devices/system/node# cat node1/memtier >>>> 2 >>>> :/sys/devices/system/node# echo 1 > node1/memtier >>> >>> Thanks a lot for your patchset. >>> >>> Per my understanding, we haven't reach consensus on >>> >>> - how to create the default memory tiers in kernel (via abstract >>> distance provided by drivers? Or use SLIT as the first step?) >>> >>> - how to override the default memory tiers from user space >>> >>> As in the following thread and email, >>> >>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/ >>> >>> I think that we need to finalized on that firstly? >> >> I did list the proposal here >> >> https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com >> >> So both the kernel default and driver-specific default tiers now become kernel parameters that can be updated >> if the user wants a different tier topology. >> >> All memory that is not managed by a driver gets added to default_memory_tier which got a default value of 200 >> >> For now, the only driver that is updated is dax kmem, which adds the memory it manages to memory tier 100. >> Later as we learn more about the device attributes (HMAT or something similar) that we might want to use >> to control the tier assignment this can be a range of memory tiers. >> >> Based on the above, I guess we can merge what is posted in this series and later fine-tune/update >> the memory tier assignment based on device attributes. > > Sorry for late reply. > > As the first step, it may be better to skip the parts that we haven't > reached consensus yet, for example, the user space interface to override > the default memory tiers. And we can use 0, 1, 2 as the default memory > tier IDs. We can refine/revise the in-kernel implementation, but we > cannot change the user space ABI. > Can you help list the use case that will be broken by using tierID as outlined in this series? One of the details that were mentioned earlier was the need to track top-tier memory usage in a memcg and IIUC the patchset posted https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com can work with tier IDs too. Let me know if you think otherwise. So at this point I am not sure which area we are still debating w.r.t the userspace interface. I will still keep the default tier IDs with a large range between them. That will allow us to go back to tierID based demotion order if we can. That is much simpler than using tierID and rank together. If we still want to go back to rank based approach the tierID value won't have much meaning anyway. Any feedback on patches 1 - 5, so that I can request Andrew to merge them? -aneesh
On 7/12/22 10:12 AM, Aneesh Kumar K V wrote: > On 7/12/22 6:46 AM, Huang, Ying wrote: >> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: >> >>> On 7/5/22 9:59 AM, Huang, Ying wrote: >>>> Hi, Aneesh, >>>> >>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes: >>>> >>>>> The current kernel has the basic memory tiering support: Inactive >>>>> pages on a higher tier NUMA node can be migrated (demoted) to a lower >>>>> tier NUMA node to make room for new allocations on the higher tier >>>>> NUMA node. Frequently accessed pages on a lower tier NUMA node can be >>>>> migrated (promoted) to a higher tier NUMA node to improve the >>>>> performance. >>>>> >>>>> In the current kernel, memory tiers are defined implicitly via a >>>>> demotion path relationship between NUMA nodes, which is created during >>>>> the kernel initialization and updated when a NUMA node is hot-added or >>>>> hot-removed. The current implementation puts all nodes with CPU into >>>>> the top tier, and builds the tier hierarchy tier-by-tier by establishing >>>>> the per-node demotion targets based on the distances between nodes. >>>>> >>>>> This current memory tier kernel interface needs to be improved for >>>>> several important use cases: >>>>> >>>>> * The current tier initialization code always initializes >>>>> each memory-only NUMA node into a lower tier. But a memory-only >>>>> NUMA node may have a high performance memory device (e.g. a DRAM >>>>> device attached via CXL.mem or a DRAM-backed memory-only node on >>>>> a virtual machine) and should be put into a higher tier. >>>>> >>>>> * The current tier hierarchy always puts CPU nodes into the top >>>>> tier. But on a system with HBM (e.g. GPU memory) devices, these >>>>> memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes >>>>> with CPUs are better to be placed into the next lower tier. >>>>> >>>>> * Also because the current tier hierarchy always puts CPU nodes >>>>> into the top tier, when a CPU is hot-added (or hot-removed) and >>>>> triggers a memory node from CPU-less into a CPU node (or vice >>>>> versa), the memory tier hierarchy gets changed, even though no >>>>> memory node is added or removed. This can make the tier >>>>> hierarchy unstable and make it difficult to support tier-based >>>>> memory accounting. >>>>> >>>>> * A higher tier node can only be demoted to selected nodes on the >>>>> next lower tier as defined by the demotion path, not any other >>>>> node from any lower tier. This strict, hard-coded demotion order >>>>> does not work in all use cases (e.g. some use cases may want to >>>>> allow cross-socket demotion to another node in the same demotion >>>>> tier as a fallback when the preferred demotion node is out of >>>>> space), and has resulted in the feature request for an interface to >>>>> override the system-wide, per-node demotion order from the >>>>> userspace. This demotion order is also inconsistent with the page >>>>> allocation fallback order when all the nodes in a higher tier are >>>>> out of space: The page allocation can fall back to any node from >>>>> any lower tier, whereas the demotion order doesn't allow that. >>>>> >>>>> * There are no interfaces for the userspace to learn about the memory >>>>> tier hierarchy in order to optimize its memory allocations. >>>>> >>>>> This patch series make the creation of memory tiers explicit under >>>>> the control of userspace or device driver. >>>>> >>>>> Memory Tier Initialization >>>>> ========================== >>>>> >>>>> By default, all memory nodes are assigned to the default tier with >>>>> tier ID value 200. >>>>> >>>>> A device driver can move up or down its memory nodes from the default >>>>> tier. For example, PMEM can move down its memory nodes below the >>>>> default tier, whereas GPU can move up its memory nodes above the >>>>> default tier. >>>>> >>>>> The kernel initialization code makes the decision on which exact tier >>>>> a memory node should be assigned to based on the requests from the >>>>> device drivers as well as the memory device hardware information >>>>> provided by the firmware. >>>>> >>>>> Hot-adding/removing CPUs doesn't affect memory tier hierarchy. >>>>> >>>>> Memory Allocation for Demotion >>>>> ============================== >>>>> This patch series keep the demotion target page allocation logic same. >>>>> The demotion page allocation pick the closest NUMA node in the >>>>> next lower tier to the current NUMA node allocating pages from. >>>>> >>>>> This will be later improved to use the same page allocation strategy >>>>> using fallback list. >>>>> >>>>> Sysfs Interface: >>>>> ------------- >>>>> Listing current list of memory tiers details: >>>>> >>>>> :/sys/devices/system/memtier$ ls >>>>> default_tier max_tier memtier1 power uevent >>>>> :/sys/devices/system/memtier$ cat default_tier >>>>> memtier200 >>>>> :/sys/devices/system/memtier$ cat max_tier >>>>> 400 >>>>> :/sys/devices/system/memtier$ >>>>> >>>>> Per node memory tier details: >>>>> >>>>> For a cpu only NUMA node: >>>>> >>>>> :/sys/devices/system/node# cat node0/memtier >>>>> :/sys/devices/system/node# echo 1 > node0/memtier >>>>> :/sys/devices/system/node# cat node0/memtier >>>>> :/sys/devices/system/node# >>>>> >>>>> For a NUMA node with memory: >>>>> :/sys/devices/system/node# cat node1/memtier >>>>> 1 >>>>> :/sys/devices/system/node# ls ../memtier/ >>>>> default_tier max_tier memtier1 power uevent >>>>> :/sys/devices/system/node# echo 2 > node1/memtier >>>>> :/sys/devices/system/node# >>>>> :/sys/devices/system/node# ls ../memtier/ >>>>> default_tier max_tier memtier1 memtier2 power uevent >>>>> :/sys/devices/system/node# cat node1/memtier >>>>> 2 >>>>> :/sys/devices/system/node# >>>>> >>>>> Removing a memory tier >>>>> :/sys/devices/system/node# cat node1/memtier >>>>> 2 >>>>> :/sys/devices/system/node# echo 1 > node1/memtier >>>> >>>> Thanks a lot for your patchset. >>>> >>>> Per my understanding, we haven't reach consensus on >>>> >>>> - how to create the default memory tiers in kernel (via abstract >>>> distance provided by drivers? Or use SLIT as the first step?) >>>> >>>> - how to override the default memory tiers from user space >>>> >>>> As in the following thread and email, >>>> >>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/ >>>> >>>> I think that we need to finalized on that firstly? >>> >>> I did list the proposal here >>> >>> https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com >>> >>> So both the kernel default and driver-specific default tiers now become kernel parameters that can be updated >>> if the user wants a different tier topology. >>> >>> All memory that is not managed by a driver gets added to default_memory_tier which got a default value of 200 >>> >>> For now, the only driver that is updated is dax kmem, which adds the memory it manages to memory tier 100. >>> Later as we learn more about the device attributes (HMAT or something similar) that we might want to use >>> to control the tier assignment this can be a range of memory tiers. >>> >>> Based on the above, I guess we can merge what is posted in this series and later fine-tune/update >>> the memory tier assignment based on device attributes. >> >> Sorry for late reply. >> >> As the first step, it may be better to skip the parts that we haven't >> reached consensus yet, for example, the user space interface to override >> the default memory tiers. And we can use 0, 1, 2 as the default memory >> tier IDs. We can refine/revise the in-kernel implementation, but we >> cannot change the user space ABI. >> > > Can you help list the use case that will be broken by using tierID as outlined in this series? > One of the details that were mentioned earlier was the need to track top-tier memory usage in a > memcg and IIUC the patchset posted https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com > can work with tier IDs too. Let me know if you think otherwise. So at this point > I am not sure which area we are still debating w.r.t the userspace interface. > > I will still keep the default tier IDs with a large range between them. That will allow > us to go back to tierID based demotion order if we can. That is much simpler than using tierID and rank > together. If we still want to go back to rank based approach the tierID value won't have much > meaning anyway. > > Any feedback on patches 1 - 5, so that I can request Andrew to merge them? > Looking at this again, I guess we just need to drop patch 7 mm/demotion: Add per node memory tier attribute to sysfs ? We do agree to use the device model to expose memory tiers to userspace so patch 6 can still be included. It also exposes max_tier, default_tier, and node list of a memory tier. All these are useful and agreed upon. Hence patch 6 can be merged? patch 8 - 10 -> are done based on the request from others and is independent of how memory tiers are exposed/created from userspace. Hence that can be merged? If you agree I can rebase the series moving patch 7,11,12 as the last patches in the series so that we can skip merging them based on what we conclude w.r.t usage of rank. -aneesh
Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: > On 7/12/22 6:46 AM, Huang, Ying wrote: >> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: >> >>> On 7/5/22 9:59 AM, Huang, Ying wrote: >>>> Hi, Aneesh, >>>> >>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes: >>>> >>>>> The current kernel has the basic memory tiering support: Inactive >>>>> pages on a higher tier NUMA node can be migrated (demoted) to a lower >>>>> tier NUMA node to make room for new allocations on the higher tier >>>>> NUMA node. Frequently accessed pages on a lower tier NUMA node can be >>>>> migrated (promoted) to a higher tier NUMA node to improve the >>>>> performance. >>>>> >>>>> In the current kernel, memory tiers are defined implicitly via a >>>>> demotion path relationship between NUMA nodes, which is created during >>>>> the kernel initialization and updated when a NUMA node is hot-added or >>>>> hot-removed. The current implementation puts all nodes with CPU into >>>>> the top tier, and builds the tier hierarchy tier-by-tier by establishing >>>>> the per-node demotion targets based on the distances between nodes. >>>>> >>>>> This current memory tier kernel interface needs to be improved for >>>>> several important use cases: >>>>> >>>>> * The current tier initialization code always initializes >>>>> each memory-only NUMA node into a lower tier. But a memory-only >>>>> NUMA node may have a high performance memory device (e.g. a DRAM >>>>> device attached via CXL.mem or a DRAM-backed memory-only node on >>>>> a virtual machine) and should be put into a higher tier. >>>>> >>>>> * The current tier hierarchy always puts CPU nodes into the top >>>>> tier. But on a system with HBM (e.g. GPU memory) devices, these >>>>> memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes >>>>> with CPUs are better to be placed into the next lower tier. >>>>> >>>>> * Also because the current tier hierarchy always puts CPU nodes >>>>> into the top tier, when a CPU is hot-added (or hot-removed) and >>>>> triggers a memory node from CPU-less into a CPU node (or vice >>>>> versa), the memory tier hierarchy gets changed, even though no >>>>> memory node is added or removed. This can make the tier >>>>> hierarchy unstable and make it difficult to support tier-based >>>>> memory accounting. >>>>> >>>>> * A higher tier node can only be demoted to selected nodes on the >>>>> next lower tier as defined by the demotion path, not any other >>>>> node from any lower tier. This strict, hard-coded demotion order >>>>> does not work in all use cases (e.g. some use cases may want to >>>>> allow cross-socket demotion to another node in the same demotion >>>>> tier as a fallback when the preferred demotion node is out of >>>>> space), and has resulted in the feature request for an interface to >>>>> override the system-wide, per-node demotion order from the >>>>> userspace. This demotion order is also inconsistent with the page >>>>> allocation fallback order when all the nodes in a higher tier are >>>>> out of space: The page allocation can fall back to any node from >>>>> any lower tier, whereas the demotion order doesn't allow that. >>>>> >>>>> * There are no interfaces for the userspace to learn about the memory >>>>> tier hierarchy in order to optimize its memory allocations. >>>>> >>>>> This patch series make the creation of memory tiers explicit under >>>>> the control of userspace or device driver. >>>>> >>>>> Memory Tier Initialization >>>>> ========================== >>>>> >>>>> By default, all memory nodes are assigned to the default tier with >>>>> tier ID value 200. >>>>> >>>>> A device driver can move up or down its memory nodes from the default >>>>> tier. For example, PMEM can move down its memory nodes below the >>>>> default tier, whereas GPU can move up its memory nodes above the >>>>> default tier. >>>>> >>>>> The kernel initialization code makes the decision on which exact tier >>>>> a memory node should be assigned to based on the requests from the >>>>> device drivers as well as the memory device hardware information >>>>> provided by the firmware. >>>>> >>>>> Hot-adding/removing CPUs doesn't affect memory tier hierarchy. >>>>> >>>>> Memory Allocation for Demotion >>>>> ============================== >>>>> This patch series keep the demotion target page allocation logic same. >>>>> The demotion page allocation pick the closest NUMA node in the >>>>> next lower tier to the current NUMA node allocating pages from. >>>>> >>>>> This will be later improved to use the same page allocation strategy >>>>> using fallback list. >>>>> >>>>> Sysfs Interface: >>>>> ------------- >>>>> Listing current list of memory tiers details: >>>>> >>>>> :/sys/devices/system/memtier$ ls >>>>> default_tier max_tier memtier1 power uevent >>>>> :/sys/devices/system/memtier$ cat default_tier >>>>> memtier200 >>>>> :/sys/devices/system/memtier$ cat max_tier >>>>> 400 >>>>> :/sys/devices/system/memtier$ >>>>> >>>>> Per node memory tier details: >>>>> >>>>> For a cpu only NUMA node: >>>>> >>>>> :/sys/devices/system/node# cat node0/memtier >>>>> :/sys/devices/system/node# echo 1 > node0/memtier >>>>> :/sys/devices/system/node# cat node0/memtier >>>>> :/sys/devices/system/node# >>>>> >>>>> For a NUMA node with memory: >>>>> :/sys/devices/system/node# cat node1/memtier >>>>> 1 >>>>> :/sys/devices/system/node# ls ../memtier/ >>>>> default_tier max_tier memtier1 power uevent >>>>> :/sys/devices/system/node# echo 2 > node1/memtier >>>>> :/sys/devices/system/node# >>>>> :/sys/devices/system/node# ls ../memtier/ >>>>> default_tier max_tier memtier1 memtier2 power uevent >>>>> :/sys/devices/system/node# cat node1/memtier >>>>> 2 >>>>> :/sys/devices/system/node# >>>>> >>>>> Removing a memory tier >>>>> :/sys/devices/system/node# cat node1/memtier >>>>> 2 >>>>> :/sys/devices/system/node# echo 1 > node1/memtier >>>> >>>> Thanks a lot for your patchset. >>>> >>>> Per my understanding, we haven't reach consensus on >>>> >>>> - how to create the default memory tiers in kernel (via abstract >>>> distance provided by drivers? Or use SLIT as the first step?) >>>> >>>> - how to override the default memory tiers from user space >>>> >>>> As in the following thread and email, >>>> >>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/ >>>> >>>> I think that we need to finalized on that firstly? >>> >>> I did list the proposal here >>> >>> https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com >>> >>> So both the kernel default and driver-specific default tiers now become kernel parameters that can be updated >>> if the user wants a different tier topology. >>> >>> All memory that is not managed by a driver gets added to default_memory_tier which got a default value of 200 >>> >>> For now, the only driver that is updated is dax kmem, which adds the memory it manages to memory tier 100. >>> Later as we learn more about the device attributes (HMAT or something similar) that we might want to use >>> to control the tier assignment this can be a range of memory tiers. >>> >>> Based on the above, I guess we can merge what is posted in this series and later fine-tune/update >>> the memory tier assignment based on device attributes. >> >> Sorry for late reply. >> >> As the first step, it may be better to skip the parts that we haven't >> reached consensus yet, for example, the user space interface to override >> the default memory tiers. And we can use 0, 1, 2 as the default memory >> tier IDs. We can refine/revise the in-kernel implementation, but we >> cannot change the user space ABI. >> > > Can you help list the use case that will be broken by using tierID as outlined in this series? > One of the details that were mentioned earlier was the need to track top-tier memory usage in a > memcg and IIUC the patchset posted https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com > can work with tier IDs too. Let me know if you think otherwise. So at this point > I am not sure which area we are still debating w.r.t the userspace interface. In https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/ per my understanding, Johannes suggested to override the kernel default memory tiers with "abstract distance" via drivers implementing memory devices. As you said in another email, that is related to [7/12] of the series. And we can table it for future. And per my understanding, he also suggested to make memory tier IDs dynamic. For example, after the "abstract distance" of a driver is overridden by users, the total number of memory tiers may be changed, and the memory tier ID of some nodes may be changed too. This will make memory tier ID easier to be understood, but more unstable. For example, this will make it harder to specify the per-memory-tier memory partition for a cgroup. > I will still keep the default tier IDs with a large range between them. That will allow > us to go back to tierID based demotion order if we can. That is much simpler than using tierID and rank > together. If we still want to go back to rank based approach the tierID value won't have much > meaning anyway. I agree to get rid of "rank". > Any feedback on patches 1 - 5, so that I can request Andrew to merge > them? I hope that we can discuss with Johannes firstly. But it appears that he is busy recently. Best Regards, Huang, Ying
On 7/12/22 12:29 PM, Huang, Ying wrote: > Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: > >> On 7/12/22 6:46 AM, Huang, Ying wrote: >>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: >>> >>>> On 7/5/22 9:59 AM, Huang, Ying wrote: >>>>> Hi, Aneesh, >>>>> >>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes: >>>>> >>>>>> The current kernel has the basic memory tiering support: Inactive >>>>>> pages on a higher tier NUMA node can be migrated (demoted) to a lower >>>>>> tier NUMA node to make room for new allocations on the higher tier >>>>>> NUMA node. Frequently accessed pages on a lower tier NUMA node can be >>>>>> migrated (promoted) to a higher tier NUMA node to improve the >>>>>> performance. >>>>>> >>>>>> In the current kernel, memory tiers are defined implicitly via a >>>>>> demotion path relationship between NUMA nodes, which is created during >>>>>> the kernel initialization and updated when a NUMA node is hot-added or >>>>>> hot-removed. The current implementation puts all nodes with CPU into >>>>>> the top tier, and builds the tier hierarchy tier-by-tier by establishing >>>>>> the per-node demotion targets based on the distances between nodes. >>>>>> >>>>>> This current memory tier kernel interface needs to be improved for >>>>>> several important use cases: >>>>>> >>>>>> * The current tier initialization code always initializes >>>>>> each memory-only NUMA node into a lower tier. But a memory-only >>>>>> NUMA node may have a high performance memory device (e.g. a DRAM >>>>>> device attached via CXL.mem or a DRAM-backed memory-only node on >>>>>> a virtual machine) and should be put into a higher tier. >>>>>> >>>>>> * The current tier hierarchy always puts CPU nodes into the top >>>>>> tier. But on a system with HBM (e.g. GPU memory) devices, these >>>>>> memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes >>>>>> with CPUs are better to be placed into the next lower tier. >>>>>> >>>>>> * Also because the current tier hierarchy always puts CPU nodes >>>>>> into the top tier, when a CPU is hot-added (or hot-removed) and >>>>>> triggers a memory node from CPU-less into a CPU node (or vice >>>>>> versa), the memory tier hierarchy gets changed, even though no >>>>>> memory node is added or removed. This can make the tier >>>>>> hierarchy unstable and make it difficult to support tier-based >>>>>> memory accounting. >>>>>> >>>>>> * A higher tier node can only be demoted to selected nodes on the >>>>>> next lower tier as defined by the demotion path, not any other >>>>>> node from any lower tier. This strict, hard-coded demotion order >>>>>> does not work in all use cases (e.g. some use cases may want to >>>>>> allow cross-socket demotion to another node in the same demotion >>>>>> tier as a fallback when the preferred demotion node is out of >>>>>> space), and has resulted in the feature request for an interface to >>>>>> override the system-wide, per-node demotion order from the >>>>>> userspace. This demotion order is also inconsistent with the page >>>>>> allocation fallback order when all the nodes in a higher tier are >>>>>> out of space: The page allocation can fall back to any node from >>>>>> any lower tier, whereas the demotion order doesn't allow that. >>>>>> >>>>>> * There are no interfaces for the userspace to learn about the memory >>>>>> tier hierarchy in order to optimize its memory allocations. >>>>>> >>>>>> This patch series make the creation of memory tiers explicit under >>>>>> the control of userspace or device driver. >>>>>> >>>>>> Memory Tier Initialization >>>>>> ========================== >>>>>> >>>>>> By default, all memory nodes are assigned to the default tier with >>>>>> tier ID value 200. >>>>>> >>>>>> A device driver can move up or down its memory nodes from the default >>>>>> tier. For example, PMEM can move down its memory nodes below the >>>>>> default tier, whereas GPU can move up its memory nodes above the >>>>>> default tier. >>>>>> >>>>>> The kernel initialization code makes the decision on which exact tier >>>>>> a memory node should be assigned to based on the requests from the >>>>>> device drivers as well as the memory device hardware information >>>>>> provided by the firmware. >>>>>> >>>>>> Hot-adding/removing CPUs doesn't affect memory tier hierarchy. >>>>>> >>>>>> Memory Allocation for Demotion >>>>>> ============================== >>>>>> This patch series keep the demotion target page allocation logic same. >>>>>> The demotion page allocation pick the closest NUMA node in the >>>>>> next lower tier to the current NUMA node allocating pages from. >>>>>> >>>>>> This will be later improved to use the same page allocation strategy >>>>>> using fallback list. >>>>>> >>>>>> Sysfs Interface: >>>>>> ------------- >>>>>> Listing current list of memory tiers details: >>>>>> >>>>>> :/sys/devices/system/memtier$ ls >>>>>> default_tier max_tier memtier1 power uevent >>>>>> :/sys/devices/system/memtier$ cat default_tier >>>>>> memtier200 >>>>>> :/sys/devices/system/memtier$ cat max_tier >>>>>> 400 >>>>>> :/sys/devices/system/memtier$ >>>>>> >>>>>> Per node memory tier details: >>>>>> >>>>>> For a cpu only NUMA node: >>>>>> >>>>>> :/sys/devices/system/node# cat node0/memtier >>>>>> :/sys/devices/system/node# echo 1 > node0/memtier >>>>>> :/sys/devices/system/node# cat node0/memtier >>>>>> :/sys/devices/system/node# >>>>>> >>>>>> For a NUMA node with memory: >>>>>> :/sys/devices/system/node# cat node1/memtier >>>>>> 1 >>>>>> :/sys/devices/system/node# ls ../memtier/ >>>>>> default_tier max_tier memtier1 power uevent >>>>>> :/sys/devices/system/node# echo 2 > node1/memtier >>>>>> :/sys/devices/system/node# >>>>>> :/sys/devices/system/node# ls ../memtier/ >>>>>> default_tier max_tier memtier1 memtier2 power uevent >>>>>> :/sys/devices/system/node# cat node1/memtier >>>>>> 2 >>>>>> :/sys/devices/system/node# >>>>>> >>>>>> Removing a memory tier >>>>>> :/sys/devices/system/node# cat node1/memtier >>>>>> 2 >>>>>> :/sys/devices/system/node# echo 1 > node1/memtier >>>>> >>>>> Thanks a lot for your patchset. >>>>> >>>>> Per my understanding, we haven't reach consensus on >>>>> >>>>> - how to create the default memory tiers in kernel (via abstract >>>>> distance provided by drivers? Or use SLIT as the first step?) >>>>> >>>>> - how to override the default memory tiers from user space >>>>> >>>>> As in the following thread and email, >>>>> >>>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/ >>>>> >>>>> I think that we need to finalized on that firstly? >>>> >>>> I did list the proposal here >>>> >>>> https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com >>>> >>>> So both the kernel default and driver-specific default tiers now become kernel parameters that can be updated >>>> if the user wants a different tier topology. >>>> >>>> All memory that is not managed by a driver gets added to default_memory_tier which got a default value of 200 >>>> >>>> For now, the only driver that is updated is dax kmem, which adds the memory it manages to memory tier 100. >>>> Later as we learn more about the device attributes (HMAT or something similar) that we might want to use >>>> to control the tier assignment this can be a range of memory tiers. >>>> >>>> Based on the above, I guess we can merge what is posted in this series and later fine-tune/update >>>> the memory tier assignment based on device attributes. >>> >>> Sorry for late reply. >>> >>> As the first step, it may be better to skip the parts that we haven't >>> reached consensus yet, for example, the user space interface to override >>> the default memory tiers. And we can use 0, 1, 2 as the default memory >>> tier IDs. We can refine/revise the in-kernel implementation, but we >>> cannot change the user space ABI. >>> >> >> Can you help list the use case that will be broken by using tierID as outlined in this series? >> One of the details that were mentioned earlier was the need to track top-tier memory usage in a >> memcg and IIUC the patchset posted https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com >> can work with tier IDs too. Let me know if you think otherwise. So at this point >> I am not sure which area we are still debating w.r.t the userspace interface. > > In > > https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/ > > per my understanding, Johannes suggested to override the kernel default > memory tiers with "abstract distance" via drivers implementing memory > devices. As you said in another email, that is related to [7/12] of the > series. And we can table it for future. > > And per my understanding, he also suggested to make memory tier IDs > dynamic. For example, after the "abstract distance" of a driver is > overridden by users, the total number of memory tiers may be changed, > and the memory tier ID of some nodes may be changed too. This will make > memory tier ID easier to be understood, but more unstable. For example, > this will make it harder to specify the per-memory-tier memory partition > for a cgroup. > With all the approaches we discussed so far, a memory tier of a numa node can be changed. ie, pgdat->memtier can change anytime. The per memcg top tier mem usage tracking patches posted here https://lore.kernel.org/linux-mm/cefeb63173fa0fac7543315a2abbd4b5a1b25af8.1655242024.git.tim.c.chen@linux.intel.com/ doesn't consider the node movement from one memory tier to another. If we need a stable pgdat->memtier we will have to prevent a node memory tier reassignment while we have pages from the memory tier charged to a cgroup. This patchset should not prevent such a restriction. There are 3 knobs provided in this patchset. 1. kernel parameter to change default memory tier. Changing this applies only to new memory that is hotplugged. The existing node to memtier mapping remains the same. 2. module parameter to change dax kmem memory tier. Same as above. 3. Ability to change node to memory tier mapping via /sys/devices/system/node/nodeN/memtier . We should be able to add any restrictions w.r.t cgroup there. Hence my observation is that the requirement for a stable node to memory tier mapping should not prevent the merging of this patch series. >> I will still keep the default tier IDs with a large range between them. That will allow >> us to go back to tierID based demotion order if we can. That is much simpler than using tierID and rank >> together. If we still want to go back to rank based approach the tierID value won't have much >> meaning anyway. > > I agree to get rid of "rank". > >> Any feedback on patches 1 - 5, so that I can request Andrew to merge >> them? > > I hope that we can discuss with Johannes firstly. But it appears that > he is busy recently. > -aneesh
Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: > On 7/12/22 12:29 PM, Huang, Ying wrote: >> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: >> >>> On 7/12/22 6:46 AM, Huang, Ying wrote: >>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: >>>> >>>>> On 7/5/22 9:59 AM, Huang, Ying wrote: >>>>>> Hi, Aneesh, >>>>>> >>>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes: >>>>>> >>>>>>> The current kernel has the basic memory tiering support: Inactive >>>>>>> pages on a higher tier NUMA node can be migrated (demoted) to a lower >>>>>>> tier NUMA node to make room for new allocations on the higher tier >>>>>>> NUMA node. Frequently accessed pages on a lower tier NUMA node can be >>>>>>> migrated (promoted) to a higher tier NUMA node to improve the >>>>>>> performance. >>>>>>> >>>>>>> In the current kernel, memory tiers are defined implicitly via a >>>>>>> demotion path relationship between NUMA nodes, which is created during >>>>>>> the kernel initialization and updated when a NUMA node is hot-added or >>>>>>> hot-removed. The current implementation puts all nodes with CPU into >>>>>>> the top tier, and builds the tier hierarchy tier-by-tier by establishing >>>>>>> the per-node demotion targets based on the distances between nodes. >>>>>>> >>>>>>> This current memory tier kernel interface needs to be improved for >>>>>>> several important use cases: >>>>>>> >>>>>>> * The current tier initialization code always initializes >>>>>>> each memory-only NUMA node into a lower tier. But a memory-only >>>>>>> NUMA node may have a high performance memory device (e.g. a DRAM >>>>>>> device attached via CXL.mem or a DRAM-backed memory-only node on >>>>>>> a virtual machine) and should be put into a higher tier. >>>>>>> >>>>>>> * The current tier hierarchy always puts CPU nodes into the top >>>>>>> tier. But on a system with HBM (e.g. GPU memory) devices, these >>>>>>> memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes >>>>>>> with CPUs are better to be placed into the next lower tier. >>>>>>> >>>>>>> * Also because the current tier hierarchy always puts CPU nodes >>>>>>> into the top tier, when a CPU is hot-added (or hot-removed) and >>>>>>> triggers a memory node from CPU-less into a CPU node (or vice >>>>>>> versa), the memory tier hierarchy gets changed, even though no >>>>>>> memory node is added or removed. This can make the tier >>>>>>> hierarchy unstable and make it difficult to support tier-based >>>>>>> memory accounting. >>>>>>> >>>>>>> * A higher tier node can only be demoted to selected nodes on the >>>>>>> next lower tier as defined by the demotion path, not any other >>>>>>> node from any lower tier. This strict, hard-coded demotion order >>>>>>> does not work in all use cases (e.g. some use cases may want to >>>>>>> allow cross-socket demotion to another node in the same demotion >>>>>>> tier as a fallback when the preferred demotion node is out of >>>>>>> space), and has resulted in the feature request for an interface to >>>>>>> override the system-wide, per-node demotion order from the >>>>>>> userspace. This demotion order is also inconsistent with the page >>>>>>> allocation fallback order when all the nodes in a higher tier are >>>>>>> out of space: The page allocation can fall back to any node from >>>>>>> any lower tier, whereas the demotion order doesn't allow that. >>>>>>> >>>>>>> * There are no interfaces for the userspace to learn about the memory >>>>>>> tier hierarchy in order to optimize its memory allocations. >>>>>>> >>>>>>> This patch series make the creation of memory tiers explicit under >>>>>>> the control of userspace or device driver. >>>>>>> >>>>>>> Memory Tier Initialization >>>>>>> ========================== >>>>>>> >>>>>>> By default, all memory nodes are assigned to the default tier with >>>>>>> tier ID value 200. >>>>>>> >>>>>>> A device driver can move up or down its memory nodes from the default >>>>>>> tier. For example, PMEM can move down its memory nodes below the >>>>>>> default tier, whereas GPU can move up its memory nodes above the >>>>>>> default tier. >>>>>>> >>>>>>> The kernel initialization code makes the decision on which exact tier >>>>>>> a memory node should be assigned to based on the requests from the >>>>>>> device drivers as well as the memory device hardware information >>>>>>> provided by the firmware. >>>>>>> >>>>>>> Hot-adding/removing CPUs doesn't affect memory tier hierarchy. >>>>>>> >>>>>>> Memory Allocation for Demotion >>>>>>> ============================== >>>>>>> This patch series keep the demotion target page allocation logic same. >>>>>>> The demotion page allocation pick the closest NUMA node in the >>>>>>> next lower tier to the current NUMA node allocating pages from. >>>>>>> >>>>>>> This will be later improved to use the same page allocation strategy >>>>>>> using fallback list. >>>>>>> >>>>>>> Sysfs Interface: >>>>>>> ------------- >>>>>>> Listing current list of memory tiers details: >>>>>>> >>>>>>> :/sys/devices/system/memtier$ ls >>>>>>> default_tier max_tier memtier1 power uevent >>>>>>> :/sys/devices/system/memtier$ cat default_tier >>>>>>> memtier200 >>>>>>> :/sys/devices/system/memtier$ cat max_tier >>>>>>> 400 >>>>>>> :/sys/devices/system/memtier$ >>>>>>> >>>>>>> Per node memory tier details: >>>>>>> >>>>>>> For a cpu only NUMA node: >>>>>>> >>>>>>> :/sys/devices/system/node# cat node0/memtier >>>>>>> :/sys/devices/system/node# echo 1 > node0/memtier >>>>>>> :/sys/devices/system/node# cat node0/memtier >>>>>>> :/sys/devices/system/node# >>>>>>> >>>>>>> For a NUMA node with memory: >>>>>>> :/sys/devices/system/node# cat node1/memtier >>>>>>> 1 >>>>>>> :/sys/devices/system/node# ls ../memtier/ >>>>>>> default_tier max_tier memtier1 power uevent >>>>>>> :/sys/devices/system/node# echo 2 > node1/memtier >>>>>>> :/sys/devices/system/node# >>>>>>> :/sys/devices/system/node# ls ../memtier/ >>>>>>> default_tier max_tier memtier1 memtier2 power uevent >>>>>>> :/sys/devices/system/node# cat node1/memtier >>>>>>> 2 >>>>>>> :/sys/devices/system/node# >>>>>>> >>>>>>> Removing a memory tier >>>>>>> :/sys/devices/system/node# cat node1/memtier >>>>>>> 2 >>>>>>> :/sys/devices/system/node# echo 1 > node1/memtier >>>>>> >>>>>> Thanks a lot for your patchset. >>>>>> >>>>>> Per my understanding, we haven't reach consensus on >>>>>> >>>>>> - how to create the default memory tiers in kernel (via abstract >>>>>> distance provided by drivers? Or use SLIT as the first step?) >>>>>> >>>>>> - how to override the default memory tiers from user space >>>>>> >>>>>> As in the following thread and email, >>>>>> >>>>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/ >>>>>> >>>>>> I think that we need to finalized on that firstly? >>>>> >>>>> I did list the proposal here >>>>> >>>>> https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com >>>>> >>>>> So both the kernel default and driver-specific default tiers now become kernel parameters that can be updated >>>>> if the user wants a different tier topology. >>>>> >>>>> All memory that is not managed by a driver gets added to default_memory_tier which got a default value of 200 >>>>> >>>>> For now, the only driver that is updated is dax kmem, which adds the memory it manages to memory tier 100. >>>>> Later as we learn more about the device attributes (HMAT or something similar) that we might want to use >>>>> to control the tier assignment this can be a range of memory tiers. >>>>> >>>>> Based on the above, I guess we can merge what is posted in this series and later fine-tune/update >>>>> the memory tier assignment based on device attributes. >>>> >>>> Sorry for late reply. >>>> >>>> As the first step, it may be better to skip the parts that we haven't >>>> reached consensus yet, for example, the user space interface to override >>>> the default memory tiers. And we can use 0, 1, 2 as the default memory >>>> tier IDs. We can refine/revise the in-kernel implementation, but we >>>> cannot change the user space ABI. >>>> >>> >>> Can you help list the use case that will be broken by using tierID as outlined in this series? >>> One of the details that were mentioned earlier was the need to track top-tier memory usage in a >>> memcg and IIUC the patchset posted https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com >>> can work with tier IDs too. Let me know if you think otherwise. So at this point >>> I am not sure which area we are still debating w.r.t the userspace interface. >> >> In >> >> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/ >> >> per my understanding, Johannes suggested to override the kernel default >> memory tiers with "abstract distance" via drivers implementing memory >> devices. As you said in another email, that is related to [7/12] of the >> series. And we can table it for future. >> >> And per my understanding, he also suggested to make memory tier IDs >> dynamic. For example, after the "abstract distance" of a driver is >> overridden by users, the total number of memory tiers may be changed, >> and the memory tier ID of some nodes may be changed too. This will make >> memory tier ID easier to be understood, but more unstable. For example, >> this will make it harder to specify the per-memory-tier memory partition >> for a cgroup. >> > > With all the approaches we discussed so far, a memory tier of a numa node can be changed. > ie, pgdat->memtier can change anytime. The per memcg top tier mem usage tracking patches > posted here > https://lore.kernel.org/linux-mm/cefeb63173fa0fac7543315a2abbd4b5a1b25af8.1655242024.git.tim.c.chen@linux.intel.com/ > doesn't consider the node movement from one memory tier to another. If we need > a stable pgdat->memtier we will have to prevent a node memory tier reassignment > while we have pages from the memory tier charged to a cgroup. This patchset should not > prevent such a restriction. Absolute stableness doesn't exist even in "rank" based solution. But "rank" can improve the stableness at some degree. For example, if we move the tier of HBM nodes (from below DRAM to above DRAM), the DRAM nodes can keep its memory tier ID stable. This may be not a real issue finally. But we need to discuss that. Tim has suggested to use top-tier(s) memory partition among cgroups. But I don't think that has been finalized. We may use per-memory-tier memory partition among cgroups. I don't know whether Wei will use that (may be implemented in the user space). And, if we thought stableness between nodes and memory tier ID isn't important. Why should we use sparse memory device IDs (that is, 100, 200, 300)? Why not just 0, 1, 2, ...? That looks more natural. > There are 3 knobs provided in this patchset. > > 1. kernel parameter to change default memory tier. Changing this applies only to new memory that is > hotplugged. The existing node to memtier mapping remains the same. > > 2. module parameter to change dax kmem memory tier. Same as above. Why do we need these 2 knobs? For example, we may use user space overridden mechanism suggested by Johannes. > 3. Ability to change node to memory tier mapping via /sys/devices/system/node/nodeN/memtier . We > should be able to add any restrictions w.r.t cgroup there. I think that we have decided to delay this feature ([7/12])? Best Regards, Huang, Ying > Hence my observation is that the requirement for a stable node to memory tier mapping should not > prevent the merging of this patch series. > > >>> I will still keep the default tier IDs with a large range between them. That will allow >>> us to go back to tierID based demotion order if we can. That is much simpler than using tierID and rank >>> together. If we still want to go back to rank based approach the tierID value won't have much >>> meaning anyway. >> >> I agree to get rid of "rank". >> >>> Any feedback on patches 1 - 5, so that I can request Andrew to merge >>> them? >> >> I hope that we can discuss with Johannes firstly. But it appears that >> he is busy recently. >> > > > -aneesh
On 7/12/22 2:18 PM, Huang, Ying wrote: > Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: > >> On 7/12/22 12:29 PM, Huang, Ying wrote: >>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: >>> >>>> On 7/12/22 6:46 AM, Huang, Ying wrote: >>>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: >>>>> >>>>>> On 7/5/22 9:59 AM, Huang, Ying wrote: >>>>>>> Hi, Aneesh, >>>>>>> >>>>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes: >>>>>>> >>>>>>>> The current kernel has the basic memory tiering support: Inactive >>>>>>>> pages on a higher tier NUMA node can be migrated (demoted) to a lower >>>>>>>> tier NUMA node to make room for new allocations on the higher tier >>>>>>>> NUMA node. Frequently accessed pages on a lower tier NUMA node can be >>>>>>>> migrated (promoted) to a higher tier NUMA node to improve the >>>>>>>> performance. >>>>>>>> >>>>>>>> In the current kernel, memory tiers are defined implicitly via a >>>>>>>> demotion path relationship between NUMA nodes, which is created during >>>>>>>> the kernel initialization and updated when a NUMA node is hot-added or >>>>>>>> hot-removed. The current implementation puts all nodes with CPU into >>>>>>>> the top tier, and builds the tier hierarchy tier-by-tier by establishing >>>>>>>> the per-node demotion targets based on the distances between nodes. >>>>>>>> >>>>>>>> This current memory tier kernel interface needs to be improved for >>>>>>>> several important use cases: >>>>>>>> >>>>>>>> * The current tier initialization code always initializes >>>>>>>> each memory-only NUMA node into a lower tier. But a memory-only >>>>>>>> NUMA node may have a high performance memory device (e.g. a DRAM >>>>>>>> device attached via CXL.mem or a DRAM-backed memory-only node on >>>>>>>> a virtual machine) and should be put into a higher tier. >>>>>>>> >>>>>>>> * The current tier hierarchy always puts CPU nodes into the top >>>>>>>> tier. But on a system with HBM (e.g. GPU memory) devices, these >>>>>>>> memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes >>>>>>>> with CPUs are better to be placed into the next lower tier. >>>>>>>> >>>>>>>> * Also because the current tier hierarchy always puts CPU nodes >>>>>>>> into the top tier, when a CPU is hot-added (or hot-removed) and >>>>>>>> triggers a memory node from CPU-less into a CPU node (or vice >>>>>>>> versa), the memory tier hierarchy gets changed, even though no >>>>>>>> memory node is added or removed. This can make the tier >>>>>>>> hierarchy unstable and make it difficult to support tier-based >>>>>>>> memory accounting. >>>>>>>> >>>>>>>> * A higher tier node can only be demoted to selected nodes on the >>>>>>>> next lower tier as defined by the demotion path, not any other >>>>>>>> node from any lower tier. This strict, hard-coded demotion order >>>>>>>> does not work in all use cases (e.g. some use cases may want to >>>>>>>> allow cross-socket demotion to another node in the same demotion >>>>>>>> tier as a fallback when the preferred demotion node is out of >>>>>>>> space), and has resulted in the feature request for an interface to >>>>>>>> override the system-wide, per-node demotion order from the >>>>>>>> userspace. This demotion order is also inconsistent with the page >>>>>>>> allocation fallback order when all the nodes in a higher tier are >>>>>>>> out of space: The page allocation can fall back to any node from >>>>>>>> any lower tier, whereas the demotion order doesn't allow that. >>>>>>>> >>>>>>>> * There are no interfaces for the userspace to learn about the memory >>>>>>>> tier hierarchy in order to optimize its memory allocations. >>>>>>>> >>>>>>>> This patch series make the creation of memory tiers explicit under >>>>>>>> the control of userspace or device driver. >>>>>>>> >>>>>>>> Memory Tier Initialization >>>>>>>> ========================== >>>>>>>> >>>>>>>> By default, all memory nodes are assigned to the default tier with >>>>>>>> tier ID value 200. >>>>>>>> >>>>>>>> A device driver can move up or down its memory nodes from the default >>>>>>>> tier. For example, PMEM can move down its memory nodes below the >>>>>>>> default tier, whereas GPU can move up its memory nodes above the >>>>>>>> default tier. >>>>>>>> >>>>>>>> The kernel initialization code makes the decision on which exact tier >>>>>>>> a memory node should be assigned to based on the requests from the >>>>>>>> device drivers as well as the memory device hardware information >>>>>>>> provided by the firmware. >>>>>>>> >>>>>>>> Hot-adding/removing CPUs doesn't affect memory tier hierarchy. >>>>>>>> >>>>>>>> Memory Allocation for Demotion >>>>>>>> ============================== >>>>>>>> This patch series keep the demotion target page allocation logic same. >>>>>>>> The demotion page allocation pick the closest NUMA node in the >>>>>>>> next lower tier to the current NUMA node allocating pages from. >>>>>>>> >>>>>>>> This will be later improved to use the same page allocation strategy >>>>>>>> using fallback list. >>>>>>>> >>>>>>>> Sysfs Interface: >>>>>>>> ------------- >>>>>>>> Listing current list of memory tiers details: >>>>>>>> >>>>>>>> :/sys/devices/system/memtier$ ls >>>>>>>> default_tier max_tier memtier1 power uevent >>>>>>>> :/sys/devices/system/memtier$ cat default_tier >>>>>>>> memtier200 >>>>>>>> :/sys/devices/system/memtier$ cat max_tier >>>>>>>> 400 >>>>>>>> :/sys/devices/system/memtier$ >>>>>>>> >>>>>>>> Per node memory tier details: >>>>>>>> >>>>>>>> For a cpu only NUMA node: >>>>>>>> >>>>>>>> :/sys/devices/system/node# cat node0/memtier >>>>>>>> :/sys/devices/system/node# echo 1 > node0/memtier >>>>>>>> :/sys/devices/system/node# cat node0/memtier >>>>>>>> :/sys/devices/system/node# >>>>>>>> >>>>>>>> For a NUMA node with memory: >>>>>>>> :/sys/devices/system/node# cat node1/memtier >>>>>>>> 1 >>>>>>>> :/sys/devices/system/node# ls ../memtier/ >>>>>>>> default_tier max_tier memtier1 power uevent >>>>>>>> :/sys/devices/system/node# echo 2 > node1/memtier >>>>>>>> :/sys/devices/system/node# >>>>>>>> :/sys/devices/system/node# ls ../memtier/ >>>>>>>> default_tier max_tier memtier1 memtier2 power uevent >>>>>>>> :/sys/devices/system/node# cat node1/memtier >>>>>>>> 2 >>>>>>>> :/sys/devices/system/node# >>>>>>>> >>>>>>>> Removing a memory tier >>>>>>>> :/sys/devices/system/node# cat node1/memtier >>>>>>>> 2 >>>>>>>> :/sys/devices/system/node# echo 1 > node1/memtier >>>>>>> >>>>>>> Thanks a lot for your patchset. >>>>>>> >>>>>>> Per my understanding, we haven't reach consensus on >>>>>>> >>>>>>> - how to create the default memory tiers in kernel (via abstract >>>>>>> distance provided by drivers? Or use SLIT as the first step?) >>>>>>> >>>>>>> - how to override the default memory tiers from user space >>>>>>> >>>>>>> As in the following thread and email, >>>>>>> >>>>>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/ >>>>>>> >>>>>>> I think that we need to finalized on that firstly? >>>>>> >>>>>> I did list the proposal here >>>>>> >>>>>> https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com >>>>>> >>>>>> So both the kernel default and driver-specific default tiers now become kernel parameters that can be updated >>>>>> if the user wants a different tier topology. >>>>>> >>>>>> All memory that is not managed by a driver gets added to default_memory_tier which got a default value of 200 >>>>>> >>>>>> For now, the only driver that is updated is dax kmem, which adds the memory it manages to memory tier 100. >>>>>> Later as we learn more about the device attributes (HMAT or something similar) that we might want to use >>>>>> to control the tier assignment this can be a range of memory tiers. >>>>>> >>>>>> Based on the above, I guess we can merge what is posted in this series and later fine-tune/update >>>>>> the memory tier assignment based on device attributes. >>>>> >>>>> Sorry for late reply. >>>>> >>>>> As the first step, it may be better to skip the parts that we haven't >>>>> reached consensus yet, for example, the user space interface to override >>>>> the default memory tiers. And we can use 0, 1, 2 as the default memory >>>>> tier IDs. We can refine/revise the in-kernel implementation, but we >>>>> cannot change the user space ABI. >>>>> >>>> >>>> Can you help list the use case that will be broken by using tierID as outlined in this series? >>>> One of the details that were mentioned earlier was the need to track top-tier memory usage in a >>>> memcg and IIUC the patchset posted https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com >>>> can work with tier IDs too. Let me know if you think otherwise. So at this point >>>> I am not sure which area we are still debating w.r.t the userspace interface. >>> >>> In >>> >>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/ >>> >>> per my understanding, Johannes suggested to override the kernel default >>> memory tiers with "abstract distance" via drivers implementing memory >>> devices. As you said in another email, that is related to [7/12] of the >>> series. And we can table it for future. >>> >>> And per my understanding, he also suggested to make memory tier IDs >>> dynamic. For example, after the "abstract distance" of a driver is >>> overridden by users, the total number of memory tiers may be changed, >>> and the memory tier ID of some nodes may be changed too. This will make >>> memory tier ID easier to be understood, but more unstable. For example, >>> this will make it harder to specify the per-memory-tier memory partition >>> for a cgroup. >>> >> >> With all the approaches we discussed so far, a memory tier of a numa node can be changed. >> ie, pgdat->memtier can change anytime. The per memcg top tier mem usage tracking patches >> posted here >> https://lore.kernel.org/linux-mm/cefeb63173fa0fac7543315a2abbd4b5a1b25af8.1655242024.git.tim.c.chen@linux.intel.com/ >> doesn't consider the node movement from one memory tier to another. If we need >> a stable pgdat->memtier we will have to prevent a node memory tier reassignment >> while we have pages from the memory tier charged to a cgroup. This patchset should not >> prevent such a restriction. > > Absolute stableness doesn't exist even in "rank" based solution. But > "rank" can improve the stableness at some degree. For example, if we > move the tier of HBM nodes (from below DRAM to above DRAM), the DRAM > nodes can keep its memory tier ID stable. This may be not a real issue > finally. But we need to discuss that. > I agree that using ranks gives us the flexibility to change demotion order without being blocked by cgroup usage. But how frequently do we expect the tier assignment to change? My expectation was these reassignments are going to be rare and won't happen frequently after a system is up and running? Hence using tierID for demotion order won't prevent a node reassignment much because we don't expect to change the node tierID during runtime. In the rare case we do, we will have to make sure there is no cgroup usage from the specific memory tier. Even if we use ranks, we will have to avoid a rank update, if such an update can change the meaning of top tier? ie, if a rank update can result in a node being moved from top tier to non top tier. > Tim has suggested to use top-tier(s) memory partition among cgroups. > But I don't think that has been finalized. We may use per-memory-tier > memory partition among cgroups. I don't know whether Wei will use that > (may be implemented in the user space). > > And, if we thought stableness between nodes and memory tier ID isn't > important. Why should we use sparse memory device IDs (that is, 100, > 200, 300)? Why not just 0, 1, 2, ...? That looks more natural. > The range allows us to use memtier ID for demotion order. ie, as we start initializing devices with different attributes via dax kmem, there will be a desire to assign them to different tierIDs. Having default memtier ID (DRAM) at 200 enables us to put these devices in the range [0 - 200) without updating the node to memtier mapping of existing NUMA nodes (ie, without updating default memtier). -aneesh
On Mon, Jul 11, 2022 at 10:10 PM Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> wrote: > > On 7/12/22 10:12 AM, Aneesh Kumar K V wrote: > > On 7/12/22 6:46 AM, Huang, Ying wrote: > >> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: > >> > >>> On 7/5/22 9:59 AM, Huang, Ying wrote: > >>>> Hi, Aneesh, > >>>> > >>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes: > >>>> > >>>>> The current kernel has the basic memory tiering support: Inactive > >>>>> pages on a higher tier NUMA node can be migrated (demoted) to a lower > >>>>> tier NUMA node to make room for new allocations on the higher tier > >>>>> NUMA node. Frequently accessed pages on a lower tier NUMA node can be > >>>>> migrated (promoted) to a higher tier NUMA node to improve the > >>>>> performance. > >>>>> > >>>>> In the current kernel, memory tiers are defined implicitly via a > >>>>> demotion path relationship between NUMA nodes, which is created during > >>>>> the kernel initialization and updated when a NUMA node is hot-added or > >>>>> hot-removed. The current implementation puts all nodes with CPU into > >>>>> the top tier, and builds the tier hierarchy tier-by-tier by establishing > >>>>> the per-node demotion targets based on the distances between nodes. > >>>>> > >>>>> This current memory tier kernel interface needs to be improved for > >>>>> several important use cases: > >>>>> > >>>>> * The current tier initialization code always initializes > >>>>> each memory-only NUMA node into a lower tier. But a memory-only > >>>>> NUMA node may have a high performance memory device (e.g. a DRAM > >>>>> device attached via CXL.mem or a DRAM-backed memory-only node on > >>>>> a virtual machine) and should be put into a higher tier. > >>>>> > >>>>> * The current tier hierarchy always puts CPU nodes into the top > >>>>> tier. But on a system with HBM (e.g. GPU memory) devices, these > >>>>> memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes > >>>>> with CPUs are better to be placed into the next lower tier. > >>>>> > >>>>> * Also because the current tier hierarchy always puts CPU nodes > >>>>> into the top tier, when a CPU is hot-added (or hot-removed) and > >>>>> triggers a memory node from CPU-less into a CPU node (or vice > >>>>> versa), the memory tier hierarchy gets changed, even though no > >>>>> memory node is added or removed. This can make the tier > >>>>> hierarchy unstable and make it difficult to support tier-based > >>>>> memory accounting. > >>>>> > >>>>> * A higher tier node can only be demoted to selected nodes on the > >>>>> next lower tier as defined by the demotion path, not any other > >>>>> node from any lower tier. This strict, hard-coded demotion order > >>>>> does not work in all use cases (e.g. some use cases may want to > >>>>> allow cross-socket demotion to another node in the same demotion > >>>>> tier as a fallback when the preferred demotion node is out of > >>>>> space), and has resulted in the feature request for an interface to > >>>>> override the system-wide, per-node demotion order from the > >>>>> userspace. This demotion order is also inconsistent with the page > >>>>> allocation fallback order when all the nodes in a higher tier are > >>>>> out of space: The page allocation can fall back to any node from > >>>>> any lower tier, whereas the demotion order doesn't allow that. > >>>>> > >>>>> * There are no interfaces for the userspace to learn about the memory > >>>>> tier hierarchy in order to optimize its memory allocations. > >>>>> > >>>>> This patch series make the creation of memory tiers explicit under > >>>>> the control of userspace or device driver. > >>>>> > >>>>> Memory Tier Initialization > >>>>> ========================== > >>>>> > >>>>> By default, all memory nodes are assigned to the default tier with > >>>>> tier ID value 200. > >>>>> > >>>>> A device driver can move up or down its memory nodes from the default > >>>>> tier. For example, PMEM can move down its memory nodes below the > >>>>> default tier, whereas GPU can move up its memory nodes above the > >>>>> default tier. > >>>>> > >>>>> The kernel initialization code makes the decision on which exact tier > >>>>> a memory node should be assigned to based on the requests from the > >>>>> device drivers as well as the memory device hardware information > >>>>> provided by the firmware. > >>>>> > >>>>> Hot-adding/removing CPUs doesn't affect memory tier hierarchy. > >>>>> > >>>>> Memory Allocation for Demotion > >>>>> ============================== > >>>>> This patch series keep the demotion target page allocation logic same. > >>>>> The demotion page allocation pick the closest NUMA node in the > >>>>> next lower tier to the current NUMA node allocating pages from. > >>>>> > >>>>> This will be later improved to use the same page allocation strategy > >>>>> using fallback list. > >>>>> > >>>>> Sysfs Interface: > >>>>> ------------- > >>>>> Listing current list of memory tiers details: > >>>>> > >>>>> :/sys/devices/system/memtier$ ls > >>>>> default_tier max_tier memtier1 power uevent > >>>>> :/sys/devices/system/memtier$ cat default_tier > >>>>> memtier200 > >>>>> :/sys/devices/system/memtier$ cat max_tier > >>>>> 400 > >>>>> :/sys/devices/system/memtier$ > >>>>> > >>>>> Per node memory tier details: > >>>>> > >>>>> For a cpu only NUMA node: > >>>>> > >>>>> :/sys/devices/system/node# cat node0/memtier > >>>>> :/sys/devices/system/node# echo 1 > node0/memtier > >>>>> :/sys/devices/system/node# cat node0/memtier > >>>>> :/sys/devices/system/node# > >>>>> > >>>>> For a NUMA node with memory: > >>>>> :/sys/devices/system/node# cat node1/memtier > >>>>> 1 > >>>>> :/sys/devices/system/node# ls ../memtier/ > >>>>> default_tier max_tier memtier1 power uevent > >>>>> :/sys/devices/system/node# echo 2 > node1/memtier > >>>>> :/sys/devices/system/node# > >>>>> :/sys/devices/system/node# ls ../memtier/ > >>>>> default_tier max_tier memtier1 memtier2 power uevent > >>>>> :/sys/devices/system/node# cat node1/memtier > >>>>> 2 > >>>>> :/sys/devices/system/node# > >>>>> > >>>>> Removing a memory tier > >>>>> :/sys/devices/system/node# cat node1/memtier > >>>>> 2 > >>>>> :/sys/devices/system/node# echo 1 > node1/memtier > >>>> > >>>> Thanks a lot for your patchset. > >>>> > >>>> Per my understanding, we haven't reach consensus on > >>>> > >>>> - how to create the default memory tiers in kernel (via abstract > >>>> distance provided by drivers? Or use SLIT as the first step?) > >>>> > >>>> - how to override the default memory tiers from user space > >>>> > >>>> As in the following thread and email, > >>>> > >>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/ > >>>> > >>>> I think that we need to finalized on that firstly? > >>> > >>> I did list the proposal here > >>> > >>> https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com > >>> > >>> So both the kernel default and driver-specific default tiers now become kernel parameters that can be updated > >>> if the user wants a different tier topology. > >>> > >>> All memory that is not managed by a driver gets added to default_memory_tier which got a default value of 200 > >>> > >>> For now, the only driver that is updated is dax kmem, which adds the memory it manages to memory tier 100. > >>> Later as we learn more about the device attributes (HMAT or something similar) that we might want to use > >>> to control the tier assignment this can be a range of memory tiers. > >>> > >>> Based on the above, I guess we can merge what is posted in this series and later fine-tune/update > >>> the memory tier assignment based on device attributes. > >> > >> Sorry for late reply. > >> > >> As the first step, it may be better to skip the parts that we haven't > >> reached consensus yet, for example, the user space interface to override > >> the default memory tiers. And we can use 0, 1, 2 as the default memory > >> tier IDs. We can refine/revise the in-kernel implementation, but we > >> cannot change the user space ABI. > >> > > > > Can you help list the use case that will be broken by using tierID as outlined in this series? > > One of the details that were mentioned earlier was the need to track top-tier memory usage in a > > memcg and IIUC the patchset posted https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com > > can work with tier IDs too. Let me know if you think otherwise. So at this point > > I am not sure which area we are still debating w.r.t the userspace interface. > > > > I will still keep the default tier IDs with a large range between them. That will allow > > us to go back to tierID based demotion order if we can. That is much simpler than using tierID and rank > > together. If we still want to go back to rank based approach the tierID value won't have much > > meaning anyway. > > > > Any feedback on patches 1 - 5, so that I can request Andrew to merge them? > > > > Looking at this again, I guess we just need to drop patch 7 > mm/demotion: Add per node memory tier attribute to sysfs ? > > We do agree to use the device model to expose memory tiers to userspace so patch 6 can still be included. > It also exposes max_tier, default_tier, and node list of a memory tier. All these are useful > and agreed upon. Hence patch 6 can be merged? > > patch 8 - 10 -> are done based on the request from others and is independent of how memory tiers > are exposed/created from userspace. Hence that can be merged? > > If you agree I can rebase the series moving patch 7,11,12 as the last patches in the series so > that we can skip merging them based on what we conclude w.r.t usage of rank. I think the most controversial part is the user visible interfaces so far. And IIUC the series could be split roughly into two parts, patch 1 - 5 and others. The patch 1 -5 added the explicit memory tier support and fixed the issue reported by Jagdish. I think we are on the same page for this part. But I haven't seen any thorough review on those patches yet since we got distracted by spending most time discussing about the user visible interfaces. So would it help to move things forward to submit patch 1 - 5 as a standalone series to get thorough review then get merged? > > -aneesh >
Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: > On 7/12/22 2:18 PM, Huang, Ying wrote: >> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: >> >>> On 7/12/22 12:29 PM, Huang, Ying wrote: >>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: >>>> >>>>> On 7/12/22 6:46 AM, Huang, Ying wrote: >>>>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: >>>>>> >>>>>>> On 7/5/22 9:59 AM, Huang, Ying wrote: >>>>>>>> Hi, Aneesh, >>>>>>>> >>>>>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes: >>>>>>>> >>>>>>>>> The current kernel has the basic memory tiering support: Inactive >>>>>>>>> pages on a higher tier NUMA node can be migrated (demoted) to a lower >>>>>>>>> tier NUMA node to make room for new allocations on the higher tier >>>>>>>>> NUMA node. Frequently accessed pages on a lower tier NUMA node can be >>>>>>>>> migrated (promoted) to a higher tier NUMA node to improve the >>>>>>>>> performance. >>>>>>>>> >>>>>>>>> In the current kernel, memory tiers are defined implicitly via a >>>>>>>>> demotion path relationship between NUMA nodes, which is created during >>>>>>>>> the kernel initialization and updated when a NUMA node is hot-added or >>>>>>>>> hot-removed. The current implementation puts all nodes with CPU into >>>>>>>>> the top tier, and builds the tier hierarchy tier-by-tier by establishing >>>>>>>>> the per-node demotion targets based on the distances between nodes. >>>>>>>>> >>>>>>>>> This current memory tier kernel interface needs to be improved for >>>>>>>>> several important use cases: >>>>>>>>> >>>>>>>>> * The current tier initialization code always initializes >>>>>>>>> each memory-only NUMA node into a lower tier. But a memory-only >>>>>>>>> NUMA node may have a high performance memory device (e.g. a DRAM >>>>>>>>> device attached via CXL.mem or a DRAM-backed memory-only node on >>>>>>>>> a virtual machine) and should be put into a higher tier. >>>>>>>>> >>>>>>>>> * The current tier hierarchy always puts CPU nodes into the top >>>>>>>>> tier. But on a system with HBM (e.g. GPU memory) devices, these >>>>>>>>> memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes >>>>>>>>> with CPUs are better to be placed into the next lower tier. >>>>>>>>> >>>>>>>>> * Also because the current tier hierarchy always puts CPU nodes >>>>>>>>> into the top tier, when a CPU is hot-added (or hot-removed) and >>>>>>>>> triggers a memory node from CPU-less into a CPU node (or vice >>>>>>>>> versa), the memory tier hierarchy gets changed, even though no >>>>>>>>> memory node is added or removed. This can make the tier >>>>>>>>> hierarchy unstable and make it difficult to support tier-based >>>>>>>>> memory accounting. >>>>>>>>> >>>>>>>>> * A higher tier node can only be demoted to selected nodes on the >>>>>>>>> next lower tier as defined by the demotion path, not any other >>>>>>>>> node from any lower tier. This strict, hard-coded demotion order >>>>>>>>> does not work in all use cases (e.g. some use cases may want to >>>>>>>>> allow cross-socket demotion to another node in the same demotion >>>>>>>>> tier as a fallback when the preferred demotion node is out of >>>>>>>>> space), and has resulted in the feature request for an interface to >>>>>>>>> override the system-wide, per-node demotion order from the >>>>>>>>> userspace. This demotion order is also inconsistent with the page >>>>>>>>> allocation fallback order when all the nodes in a higher tier are >>>>>>>>> out of space: The page allocation can fall back to any node from >>>>>>>>> any lower tier, whereas the demotion order doesn't allow that. >>>>>>>>> >>>>>>>>> * There are no interfaces for the userspace to learn about the memory >>>>>>>>> tier hierarchy in order to optimize its memory allocations. >>>>>>>>> >>>>>>>>> This patch series make the creation of memory tiers explicit under >>>>>>>>> the control of userspace or device driver. >>>>>>>>> >>>>>>>>> Memory Tier Initialization >>>>>>>>> ========================== >>>>>>>>> >>>>>>>>> By default, all memory nodes are assigned to the default tier with >>>>>>>>> tier ID value 200. >>>>>>>>> >>>>>>>>> A device driver can move up or down its memory nodes from the default >>>>>>>>> tier. For example, PMEM can move down its memory nodes below the >>>>>>>>> default tier, whereas GPU can move up its memory nodes above the >>>>>>>>> default tier. >>>>>>>>> >>>>>>>>> The kernel initialization code makes the decision on which exact tier >>>>>>>>> a memory node should be assigned to based on the requests from the >>>>>>>>> device drivers as well as the memory device hardware information >>>>>>>>> provided by the firmware. >>>>>>>>> >>>>>>>>> Hot-adding/removing CPUs doesn't affect memory tier hierarchy. >>>>>>>>> >>>>>>>>> Memory Allocation for Demotion >>>>>>>>> ============================== >>>>>>>>> This patch series keep the demotion target page allocation logic same. >>>>>>>>> The demotion page allocation pick the closest NUMA node in the >>>>>>>>> next lower tier to the current NUMA node allocating pages from. >>>>>>>>> >>>>>>>>> This will be later improved to use the same page allocation strategy >>>>>>>>> using fallback list. >>>>>>>>> >>>>>>>>> Sysfs Interface: >>>>>>>>> ------------- >>>>>>>>> Listing current list of memory tiers details: >>>>>>>>> >>>>>>>>> :/sys/devices/system/memtier$ ls >>>>>>>>> default_tier max_tier memtier1 power uevent >>>>>>>>> :/sys/devices/system/memtier$ cat default_tier >>>>>>>>> memtier200 >>>>>>>>> :/sys/devices/system/memtier$ cat max_tier >>>>>>>>> 400 >>>>>>>>> :/sys/devices/system/memtier$ >>>>>>>>> >>>>>>>>> Per node memory tier details: >>>>>>>>> >>>>>>>>> For a cpu only NUMA node: >>>>>>>>> >>>>>>>>> :/sys/devices/system/node# cat node0/memtier >>>>>>>>> :/sys/devices/system/node# echo 1 > node0/memtier >>>>>>>>> :/sys/devices/system/node# cat node0/memtier >>>>>>>>> :/sys/devices/system/node# >>>>>>>>> >>>>>>>>> For a NUMA node with memory: >>>>>>>>> :/sys/devices/system/node# cat node1/memtier >>>>>>>>> 1 >>>>>>>>> :/sys/devices/system/node# ls ../memtier/ >>>>>>>>> default_tier max_tier memtier1 power uevent >>>>>>>>> :/sys/devices/system/node# echo 2 > node1/memtier >>>>>>>>> :/sys/devices/system/node# >>>>>>>>> :/sys/devices/system/node# ls ../memtier/ >>>>>>>>> default_tier max_tier memtier1 memtier2 power uevent >>>>>>>>> :/sys/devices/system/node# cat node1/memtier >>>>>>>>> 2 >>>>>>>>> :/sys/devices/system/node# >>>>>>>>> >>>>>>>>> Removing a memory tier >>>>>>>>> :/sys/devices/system/node# cat node1/memtier >>>>>>>>> 2 >>>>>>>>> :/sys/devices/system/node# echo 1 > node1/memtier >>>>>>>> >>>>>>>> Thanks a lot for your patchset. >>>>>>>> >>>>>>>> Per my understanding, we haven't reach consensus on >>>>>>>> >>>>>>>> - how to create the default memory tiers in kernel (via abstract >>>>>>>> distance provided by drivers? Or use SLIT as the first step?) >>>>>>>> >>>>>>>> - how to override the default memory tiers from user space >>>>>>>> >>>>>>>> As in the following thread and email, >>>>>>>> >>>>>>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/ >>>>>>>> >>>>>>>> I think that we need to finalized on that firstly? >>>>>>> >>>>>>> I did list the proposal here >>>>>>> >>>>>>> https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com >>>>>>> >>>>>>> So both the kernel default and driver-specific default tiers now become kernel parameters that can be updated >>>>>>> if the user wants a different tier topology. >>>>>>> >>>>>>> All memory that is not managed by a driver gets added to default_memory_tier which got a default value of 200 >>>>>>> >>>>>>> For now, the only driver that is updated is dax kmem, which adds the memory it manages to memory tier 100. >>>>>>> Later as we learn more about the device attributes (HMAT or something similar) that we might want to use >>>>>>> to control the tier assignment this can be a range of memory tiers. >>>>>>> >>>>>>> Based on the above, I guess we can merge what is posted in this series and later fine-tune/update >>>>>>> the memory tier assignment based on device attributes. >>>>>> >>>>>> Sorry for late reply. >>>>>> >>>>>> As the first step, it may be better to skip the parts that we haven't >>>>>> reached consensus yet, for example, the user space interface to override >>>>>> the default memory tiers. And we can use 0, 1, 2 as the default memory >>>>>> tier IDs. We can refine/revise the in-kernel implementation, but we >>>>>> cannot change the user space ABI. >>>>>> >>>>> >>>>> Can you help list the use case that will be broken by using tierID as outlined in this series? >>>>> One of the details that were mentioned earlier was the need to track top-tier memory usage in a >>>>> memcg and IIUC the patchset posted https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com >>>>> can work with tier IDs too. Let me know if you think otherwise. So at this point >>>>> I am not sure which area we are still debating w.r.t the userspace interface. >>>> >>>> In >>>> >>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/ >>>> >>>> per my understanding, Johannes suggested to override the kernel default >>>> memory tiers with "abstract distance" via drivers implementing memory >>>> devices. As you said in another email, that is related to [7/12] of the >>>> series. And we can table it for future. >>>> >>>> And per my understanding, he also suggested to make memory tier IDs >>>> dynamic. For example, after the "abstract distance" of a driver is >>>> overridden by users, the total number of memory tiers may be changed, >>>> and the memory tier ID of some nodes may be changed too. This will make >>>> memory tier ID easier to be understood, but more unstable. For example, >>>> this will make it harder to specify the per-memory-tier memory partition >>>> for a cgroup. >>>> >>> >>> With all the approaches we discussed so far, a memory tier of a numa node can be changed. >>> ie, pgdat->memtier can change anytime. The per memcg top tier mem usage tracking patches >>> posted here >>> https://lore.kernel.org/linux-mm/cefeb63173fa0fac7543315a2abbd4b5a1b25af8.1655242024.git.tim.c.chen@linux.intel.com/ >>> doesn't consider the node movement from one memory tier to another. If we need >>> a stable pgdat->memtier we will have to prevent a node memory tier reassignment >>> while we have pages from the memory tier charged to a cgroup. This patchset should not >>> prevent such a restriction. >> >> Absolute stableness doesn't exist even in "rank" based solution. But >> "rank" can improve the stableness at some degree. For example, if we >> move the tier of HBM nodes (from below DRAM to above DRAM), the DRAM >> nodes can keep its memory tier ID stable. This may be not a real issue >> finally. But we need to discuss that. >> > > I agree that using ranks gives us the flexibility to change demotion order > without being blocked by cgroup usage. But how frequently do we expect the > tier assignment to change? My expectation was these reassignments are going > to be rare and won't happen frequently after a system is up and running? > Hence using tierID for demotion order won't prevent a node reassignment > much because we don't expect to change the node tierID during runtime. In > the rare case we do, we will have to make sure there is no cgroup usage from > the specific memory tier. > > Even if we use ranks, we will have to avoid a rank update, if such > an update can change the meaning of top tier? ie, if a rank update > can result in a node being moved from top tier to non top tier. > >> Tim has suggested to use top-tier(s) memory partition among cgroups. >> But I don't think that has been finalized. We may use per-memory-tier >> memory partition among cgroups. I don't know whether Wei will use that >> (may be implemented in the user space). >> >> And, if we thought stableness between nodes and memory tier ID isn't >> important. Why should we use sparse memory device IDs (that is, 100, >> 200, 300)? Why not just 0, 1, 2, ...? That looks more natural. >> > > > The range allows us to use memtier ID for demotion order. ie, as we start initializing > devices with different attributes via dax kmem, there will be a desire to > assign them to different tierIDs. Having default memtier ID (DRAM) at 200 enables > us to put these devices in the range [0 - 200) without updating the node to memtier > mapping of existing NUMA nodes (ie, without updating default memtier). I believe that sparse memory tier IDs can make memory tier more stable in some cases. But this is different from the system suggested by Johannes. Per my understanding, with Johannes' system, we will - one driver may online different memory types (such as kmem_dax may online HBM, PMEM, etc.) - one memory type manages several memory nodes (NUMA nodes) - one "abstract distance" for each memory type - the "abstract distance" can be offset by user space override knob - memory tiers generated dynamic from different memory types according "abstract distance" and overridden "offset" - the granularity to group several memory types into one memory tier can be overridden via user space knob In this way, the memory tiers may be changed totally after user space overridden. It may be hard to link memory tiers before/after the overridden. So we may need to reset all per-memory-tier configuration, such as cgroup paritation limit or interleave weight, etc. Personally, I think the system above makes sense. But I think we need to make sure whether it satisfies the requirements. Best Regards, Huang, Ying
Yang Shi <shy828301@gmail.com> writes: > On Mon, Jul 11, 2022 at 10:10 PM Aneesh Kumar K V > <aneesh.kumar@linux.ibm.com> wrote: >> >> On 7/12/22 10:12 AM, Aneesh Kumar K V wrote: >> > On 7/12/22 6:46 AM, Huang, Ying wrote: >> >> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: >> >> >> >>> On 7/5/22 9:59 AM, Huang, Ying wrote: >> >>>> Hi, Aneesh, >> >>>> >> >>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes: >> >>>> >> >>>>> The current kernel has the basic memory tiering support: Inactive >> >>>>> pages on a higher tier NUMA node can be migrated (demoted) to a lower >> >>>>> tier NUMA node to make room for new allocations on the higher tier >> >>>>> NUMA node. Frequently accessed pages on a lower tier NUMA node can be >> >>>>> migrated (promoted) to a higher tier NUMA node to improve the >> >>>>> performance. >> >>>>> >> >>>>> In the current kernel, memory tiers are defined implicitly via a >> >>>>> demotion path relationship between NUMA nodes, which is created during >> >>>>> the kernel initialization and updated when a NUMA node is hot-added or >> >>>>> hot-removed. The current implementation puts all nodes with CPU into >> >>>>> the top tier, and builds the tier hierarchy tier-by-tier by establishing >> >>>>> the per-node demotion targets based on the distances between nodes. >> >>>>> >> >>>>> This current memory tier kernel interface needs to be improved for >> >>>>> several important use cases: >> >>>>> >> >>>>> * The current tier initialization code always initializes >> >>>>> each memory-only NUMA node into a lower tier. But a memory-only >> >>>>> NUMA node may have a high performance memory device (e.g. a DRAM >> >>>>> device attached via CXL.mem or a DRAM-backed memory-only node on >> >>>>> a virtual machine) and should be put into a higher tier. >> >>>>> >> >>>>> * The current tier hierarchy always puts CPU nodes into the top >> >>>>> tier. But on a system with HBM (e.g. GPU memory) devices, these >> >>>>> memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes >> >>>>> with CPUs are better to be placed into the next lower tier. >> >>>>> >> >>>>> * Also because the current tier hierarchy always puts CPU nodes >> >>>>> into the top tier, when a CPU is hot-added (or hot-removed) and >> >>>>> triggers a memory node from CPU-less into a CPU node (or vice >> >>>>> versa), the memory tier hierarchy gets changed, even though no >> >>>>> memory node is added or removed. This can make the tier >> >>>>> hierarchy unstable and make it difficult to support tier-based >> >>>>> memory accounting. >> >>>>> >> >>>>> * A higher tier node can only be demoted to selected nodes on the >> >>>>> next lower tier as defined by the demotion path, not any other >> >>>>> node from any lower tier. This strict, hard-coded demotion order >> >>>>> does not work in all use cases (e.g. some use cases may want to >> >>>>> allow cross-socket demotion to another node in the same demotion >> >>>>> tier as a fallback when the preferred demotion node is out of >> >>>>> space), and has resulted in the feature request for an interface to >> >>>>> override the system-wide, per-node demotion order from the >> >>>>> userspace. This demotion order is also inconsistent with the page >> >>>>> allocation fallback order when all the nodes in a higher tier are >> >>>>> out of space: The page allocation can fall back to any node from >> >>>>> any lower tier, whereas the demotion order doesn't allow that. >> >>>>> >> >>>>> * There are no interfaces for the userspace to learn about the memory >> >>>>> tier hierarchy in order to optimize its memory allocations. >> >>>>> >> >>>>> This patch series make the creation of memory tiers explicit under >> >>>>> the control of userspace or device driver. >> >>>>> >> >>>>> Memory Tier Initialization >> >>>>> ========================== >> >>>>> >> >>>>> By default, all memory nodes are assigned to the default tier with >> >>>>> tier ID value 200. >> >>>>> >> >>>>> A device driver can move up or down its memory nodes from the default >> >>>>> tier. For example, PMEM can move down its memory nodes below the >> >>>>> default tier, whereas GPU can move up its memory nodes above the >> >>>>> default tier. >> >>>>> >> >>>>> The kernel initialization code makes the decision on which exact tier >> >>>>> a memory node should be assigned to based on the requests from the >> >>>>> device drivers as well as the memory device hardware information >> >>>>> provided by the firmware. >> >>>>> >> >>>>> Hot-adding/removing CPUs doesn't affect memory tier hierarchy. >> >>>>> >> >>>>> Memory Allocation for Demotion >> >>>>> ============================== >> >>>>> This patch series keep the demotion target page allocation logic same. >> >>>>> The demotion page allocation pick the closest NUMA node in the >> >>>>> next lower tier to the current NUMA node allocating pages from. >> >>>>> >> >>>>> This will be later improved to use the same page allocation strategy >> >>>>> using fallback list. >> >>>>> >> >>>>> Sysfs Interface: >> >>>>> ------------- >> >>>>> Listing current list of memory tiers details: >> >>>>> >> >>>>> :/sys/devices/system/memtier$ ls >> >>>>> default_tier max_tier memtier1 power uevent >> >>>>> :/sys/devices/system/memtier$ cat default_tier >> >>>>> memtier200 >> >>>>> :/sys/devices/system/memtier$ cat max_tier >> >>>>> 400 >> >>>>> :/sys/devices/system/memtier$ >> >>>>> >> >>>>> Per node memory tier details: >> >>>>> >> >>>>> For a cpu only NUMA node: >> >>>>> >> >>>>> :/sys/devices/system/node# cat node0/memtier >> >>>>> :/sys/devices/system/node# echo 1 > node0/memtier >> >>>>> :/sys/devices/system/node# cat node0/memtier >> >>>>> :/sys/devices/system/node# >> >>>>> >> >>>>> For a NUMA node with memory: >> >>>>> :/sys/devices/system/node# cat node1/memtier >> >>>>> 1 >> >>>>> :/sys/devices/system/node# ls ../memtier/ >> >>>>> default_tier max_tier memtier1 power uevent >> >>>>> :/sys/devices/system/node# echo 2 > node1/memtier >> >>>>> :/sys/devices/system/node# >> >>>>> :/sys/devices/system/node# ls ../memtier/ >> >>>>> default_tier max_tier memtier1 memtier2 power uevent >> >>>>> :/sys/devices/system/node# cat node1/memtier >> >>>>> 2 >> >>>>> :/sys/devices/system/node# >> >>>>> >> >>>>> Removing a memory tier >> >>>>> :/sys/devices/system/node# cat node1/memtier >> >>>>> 2 >> >>>>> :/sys/devices/system/node# echo 1 > node1/memtier >> >>>> >> >>>> Thanks a lot for your patchset. >> >>>> >> >>>> Per my understanding, we haven't reach consensus on >> >>>> >> >>>> - how to create the default memory tiers in kernel (via abstract >> >>>> distance provided by drivers? Or use SLIT as the first step?) >> >>>> >> >>>> - how to override the default memory tiers from user space >> >>>> >> >>>> As in the following thread and email, >> >>>> >> >>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/ >> >>>> >> >>>> I think that we need to finalized on that firstly? >> >>> >> >>> I did list the proposal here >> >>> >> >>> https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com >> >>> >> >>> So both the kernel default and driver-specific default tiers now become kernel parameters that can be updated >> >>> if the user wants a different tier topology. >> >>> >> >>> All memory that is not managed by a driver gets added to default_memory_tier which got a default value of 200 >> >>> >> >>> For now, the only driver that is updated is dax kmem, which adds the memory it manages to memory tier 100. >> >>> Later as we learn more about the device attributes (HMAT or something similar) that we might want to use >> >>> to control the tier assignment this can be a range of memory tiers. >> >>> >> >>> Based on the above, I guess we can merge what is posted in this series and later fine-tune/update >> >>> the memory tier assignment based on device attributes. >> >> >> >> Sorry for late reply. >> >> >> >> As the first step, it may be better to skip the parts that we haven't >> >> reached consensus yet, for example, the user space interface to override >> >> the default memory tiers. And we can use 0, 1, 2 as the default memory >> >> tier IDs. We can refine/revise the in-kernel implementation, but we >> >> cannot change the user space ABI. >> >> >> > >> > Can you help list the use case that will be broken by using tierID as outlined in this series? >> > One of the details that were mentioned earlier was the need to track top-tier memory usage in a >> > memcg and IIUC the patchset posted https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com >> > can work with tier IDs too. Let me know if you think otherwise. So at this point >> > I am not sure which area we are still debating w.r.t the userspace interface. >> > >> > I will still keep the default tier IDs with a large range between them. That will allow >> > us to go back to tierID based demotion order if we can. That is much simpler than using tierID and rank >> > together. If we still want to go back to rank based approach the tierID value won't have much >> > meaning anyway. >> > >> > Any feedback on patches 1 - 5, so that I can request Andrew to merge them? >> > >> >> Looking at this again, I guess we just need to drop patch 7 >> mm/demotion: Add per node memory tier attribute to sysfs ? >> >> We do agree to use the device model to expose memory tiers to userspace so patch 6 can still be included. >> It also exposes max_tier, default_tier, and node list of a memory tier. All these are useful >> and agreed upon. Hence patch 6 can be merged? >> >> patch 8 - 10 -> are done based on the request from others and is independent of how memory tiers >> are exposed/created from userspace. Hence that can be merged? >> >> If you agree I can rebase the series moving patch 7,11,12 as the last patches in the series so >> that we can skip merging them based on what we conclude w.r.t usage of rank. > > I think the most controversial part is the user visible interfaces so > far. And IIUC the series could be split roughly into two parts, patch > 1 - 5 and others. The patch 1 -5 added the explicit memory tier > support and fixed the issue reported by Jagdish. I think we are on the > same page for this part. But I haven't seen any thorough review on > those patches yet since we got distracted by spending most time > discussing about the user visible interfaces. > > So would it help to move things forward to submit patch 1 - 5 as a > standalone series to get thorough review then get merged? Yes. I think this is a good idea. We can discuss the in kernel implementation (without user space interface) in details and try to make it merged. And we can continue our discussion of user space interface in a separate thread. Best Regards, Huang, Ying
On Tue, Jul 12, 2022 at 8:42 PM Huang, Ying <ying.huang@intel.com> wrote: > Yang Shi <shy828301@gmail.com> writes: > > > On Mon, Jul 11, 2022 at 10:10 PM Aneesh Kumar K V > > <aneesh.kumar@linux.ibm.com> wrote: > >> > >> On 7/12/22 10:12 AM, Aneesh Kumar K V wrote: > >> > On 7/12/22 6:46 AM, Huang, Ying wrote: > >> >> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: > >> >> > >> >>> On 7/5/22 9:59 AM, Huang, Ying wrote: > >> >>>> Hi, Aneesh, > >> >>>> > >> >>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes: > >> >>>> > >> >>>>> The current kernel has the basic memory tiering support: Inactive > >> >>>>> pages on a higher tier NUMA node can be migrated (demoted) to a > lower > >> >>>>> tier NUMA node to make room for new allocations on the higher tier > >> >>>>> NUMA node. Frequently accessed pages on a lower tier NUMA node > can be > >> >>>>> migrated (promoted) to a higher tier NUMA node to improve the > >> >>>>> performance. > >> >>>>> > >> >>>>> In the current kernel, memory tiers are defined implicitly via a > >> >>>>> demotion path relationship between NUMA nodes, which is created > during > >> >>>>> the kernel initialization and updated when a NUMA node is > hot-added or > >> >>>>> hot-removed. The current implementation puts all nodes with CPU > into > >> >>>>> the top tier, and builds the tier hierarchy tier-by-tier by > establishing > >> >>>>> the per-node demotion targets based on the distances between > nodes. > >> >>>>> > >> >>>>> This current memory tier kernel interface needs to be improved for > >> >>>>> several important use cases: > >> >>>>> > >> >>>>> * The current tier initialization code always initializes > >> >>>>> each memory-only NUMA node into a lower tier. But a memory-only > >> >>>>> NUMA node may have a high performance memory device (e.g. a DRAM > >> >>>>> device attached via CXL.mem or a DRAM-backed memory-only node on > >> >>>>> a virtual machine) and should be put into a higher tier. > >> >>>>> > >> >>>>> * The current tier hierarchy always puts CPU nodes into the top > >> >>>>> tier. But on a system with HBM (e.g. GPU memory) devices, these > >> >>>>> memory-only HBM NUMA nodes should be in the top tier, and DRAM > nodes > >> >>>>> with CPUs are better to be placed into the next lower tier. > >> >>>>> > >> >>>>> * Also because the current tier hierarchy always puts CPU nodes > >> >>>>> into the top tier, when a CPU is hot-added (or hot-removed) and > >> >>>>> triggers a memory node from CPU-less into a CPU node (or vice > >> >>>>> versa), the memory tier hierarchy gets changed, even though no > >> >>>>> memory node is added or removed. This can make the tier > >> >>>>> hierarchy unstable and make it difficult to support tier-based > >> >>>>> memory accounting. > >> >>>>> > >> >>>>> * A higher tier node can only be demoted to selected nodes on the > >> >>>>> next lower tier as defined by the demotion path, not any other > >> >>>>> node from any lower tier. This strict, hard-coded demotion > order > >> >>>>> does not work in all use cases (e.g. some use cases may want to > >> >>>>> allow cross-socket demotion to another node in the same demotion > >> >>>>> tier as a fallback when the preferred demotion node is out of > >> >>>>> space), and has resulted in the feature request for an > interface to > >> >>>>> override the system-wide, per-node demotion order from the > >> >>>>> userspace. This demotion order is also inconsistent with the > page > >> >>>>> allocation fallback order when all the nodes in a higher tier > are > >> >>>>> out of space: The page allocation can fall back to any node from > >> >>>>> any lower tier, whereas the demotion order doesn't allow that. > >> >>>>> > >> >>>>> * There are no interfaces for the userspace to learn about the > memory > >> >>>>> tier hierarchy in order to optimize its memory allocations. > >> >>>>> > >> >>>>> This patch series make the creation of memory tiers explicit under > >> >>>>> the control of userspace or device driver. > >> >>>>> > >> >>>>> Memory Tier Initialization > >> >>>>> ========================== > >> >>>>> > >> >>>>> By default, all memory nodes are assigned to the default tier with > >> >>>>> tier ID value 200. > >> >>>>> > >> >>>>> A device driver can move up or down its memory nodes from the > default > >> >>>>> tier. For example, PMEM can move down its memory nodes below the > >> >>>>> default tier, whereas GPU can move up its memory nodes above the > >> >>>>> default tier. > >> >>>>> > >> >>>>> The kernel initialization code makes the decision on which exact > tier > >> >>>>> a memory node should be assigned to based on the requests from the > >> >>>>> device drivers as well as the memory device hardware information > >> >>>>> provided by the firmware. > >> >>>>> > >> >>>>> Hot-adding/removing CPUs doesn't affect memory tier hierarchy. > >> >>>>> > >> >>>>> Memory Allocation for Demotion > >> >>>>> ============================== > >> >>>>> This patch series keep the demotion target page allocation logic > same. > >> >>>>> The demotion page allocation pick the closest NUMA node in the > >> >>>>> next lower tier to the current NUMA node allocating pages from. > >> >>>>> > >> >>>>> This will be later improved to use the same page allocation > strategy > >> >>>>> using fallback list. > >> >>>>> > >> >>>>> Sysfs Interface: > >> >>>>> ------------- > >> >>>>> Listing current list of memory tiers details: > >> >>>>> > >> >>>>> :/sys/devices/system/memtier$ ls > >> >>>>> default_tier max_tier memtier1 power uevent > >> >>>>> :/sys/devices/system/memtier$ cat default_tier > >> >>>>> memtier200 > >> >>>>> :/sys/devices/system/memtier$ cat max_tier > >> >>>>> 400 > >> >>>>> :/sys/devices/system/memtier$ > >> >>>>> > >> >>>>> Per node memory tier details: > >> >>>>> > >> >>>>> For a cpu only NUMA node: > >> >>>>> > >> >>>>> :/sys/devices/system/node# cat node0/memtier > >> >>>>> :/sys/devices/system/node# echo 1 > node0/memtier > >> >>>>> :/sys/devices/system/node# cat node0/memtier > >> >>>>> :/sys/devices/system/node# > >> >>>>> > >> >>>>> For a NUMA node with memory: > >> >>>>> :/sys/devices/system/node# cat node1/memtier > >> >>>>> 1 > >> >>>>> :/sys/devices/system/node# ls ../memtier/ > >> >>>>> default_tier max_tier memtier1 power uevent > >> >>>>> :/sys/devices/system/node# echo 2 > node1/memtier > >> >>>>> :/sys/devices/system/node# > >> >>>>> :/sys/devices/system/node# ls ../memtier/ > >> >>>>> default_tier max_tier memtier1 memtier2 power uevent > >> >>>>> :/sys/devices/system/node# cat node1/memtier > >> >>>>> 2 > >> >>>>> :/sys/devices/system/node# > >> >>>>> > >> >>>>> Removing a memory tier > >> >>>>> :/sys/devices/system/node# cat node1/memtier > >> >>>>> 2 > >> >>>>> :/sys/devices/system/node# echo 1 > node1/memtier > >> >>>> > >> >>>> Thanks a lot for your patchset. > >> >>>> > >> >>>> Per my understanding, we haven't reach consensus on > >> >>>> > >> >>>> - how to create the default memory tiers in kernel (via abstract > >> >>>> distance provided by drivers? Or use SLIT as the first step?) > >> >>>> > >> >>>> - how to override the default memory tiers from user space > >> >>>> > >> >>>> As in the following thread and email, > >> >>>> > >> >>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/ > >> >>>> > >> >>>> I think that we need to finalized on that firstly? > >> >>> > >> >>> I did list the proposal here > >> >>> > >> >>> > https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com > >> >>> > >> >>> So both the kernel default and driver-specific default tiers now > become kernel parameters that can be updated > >> >>> if the user wants a different tier topology. > >> >>> > >> >>> All memory that is not managed by a driver gets added to > default_memory_tier which got a default value of 200 > >> >>> > >> >>> For now, the only driver that is updated is dax kmem, which adds > the memory it manages to memory tier 100. > >> >>> Later as we learn more about the device attributes (HMAT or > something similar) that we might want to use > >> >>> to control the tier assignment this can be a range of memory tiers. > >> >>> > >> >>> Based on the above, I guess we can merge what is posted in this > series and later fine-tune/update > >> >>> the memory tier assignment based on device attributes. > >> >> > >> >> Sorry for late reply. > >> >> > >> >> As the first step, it may be better to skip the parts that we haven't > >> >> reached consensus yet, for example, the user space interface to > override > >> >> the default memory tiers. And we can use 0, 1, 2 as the default > memory > >> >> tier IDs. We can refine/revise the in-kernel implementation, but we > >> >> cannot change the user space ABI. > >> >> > >> > > >> > Can you help list the use case that will be broken by using tierID as > outlined in this series? > >> > One of the details that were mentioned earlier was the need to track > top-tier memory usage in a > >> > memcg and IIUC the patchset posted > https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com > >> > can work with tier IDs too. Let me know if you think otherwise. So at > this point > >> > I am not sure which area we are still debating w.r.t the userspace > interface. > >> > > >> > I will still keep the default tier IDs with a large range between > them. That will allow > >> > us to go back to tierID based demotion order if we can. That is much > simpler than using tierID and rank > >> > together. If we still want to go back to rank based approach the > tierID value won't have much > >> > meaning anyway. > >> > > >> > Any feedback on patches 1 - 5, so that I can request Andrew to merge > them? > >> > > >> > >> Looking at this again, I guess we just need to drop patch 7 > >> mm/demotion: Add per node memory tier attribute to sysfs ? > >> > >> We do agree to use the device model to expose memory tiers to userspace > so patch 6 can still be included. > >> It also exposes max_tier, default_tier, and node list of a memory tier. > All these are useful > >> and agreed upon. Hence patch 6 can be merged? > >> > >> patch 8 - 10 -> are done based on the request from others and is > independent of how memory tiers > >> are exposed/created from userspace. Hence that can be merged? > >> > >> If you agree I can rebase the series moving patch 7,11,12 as the last > patches in the series so > >> that we can skip merging them based on what we conclude w.r.t usage of > rank. > > > > I think the most controversial part is the user visible interfaces so > > far. And IIUC the series could be split roughly into two parts, patch > > 1 - 5 and others. The patch 1 -5 added the explicit memory tier > > support and fixed the issue reported by Jagdish. I think we are on the > > same page for this part. But I haven't seen any thorough review on > > those patches yet since we got distracted by spending most time > > discussing about the user visible interfaces. > > > > So would it help to move things forward to submit patch 1 - 5 as a > > standalone series to get thorough review then get merged? > > Yes. I think this is a good idea. We can discuss the in kernel > implementation (without user space interface) in details and try to make > it merged. > > And we can continue our discussion of user space interface in a separate > thread. > > Best Regards, > Huang, Ying > > I also agree that it is a good idea to split this patch series into the kernel and userspace parts. The current sysfs interface provides more dynamic memtiers than what I have expected. Let's have more discussions on that after the kernel space changes are finalized. Wei
On Tue, Jul 12, 2022 at 8:42 PM Huang, Ying <ying.huang@intel.com> wrote: > > Yang Shi <shy828301@gmail.com> writes: > > > On Mon, Jul 11, 2022 at 10:10 PM Aneesh Kumar K V > > <aneesh.kumar@linux.ibm.com> wrote: > >> > >> On 7/12/22 10:12 AM, Aneesh Kumar K V wrote: > >> > On 7/12/22 6:46 AM, Huang, Ying wrote: > >> >> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: > >> >> > >> >>> On 7/5/22 9:59 AM, Huang, Ying wrote: > >> >>>> Hi, Aneesh, > >> >>>> > >> >>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes: > >> >>>> > >> >>>>> The current kernel has the basic memory tiering support: Inactive > >> >>>>> pages on a higher tier NUMA node can be migrated (demoted) to a lower > >> >>>>> tier NUMA node to make room for new allocations on the higher tier > >> >>>>> NUMA node. Frequently accessed pages on a lower tier NUMA node can be > >> >>>>> migrated (promoted) to a higher tier NUMA node to improve the > >> >>>>> performance. > >> >>>>> > >> >>>>> In the current kernel, memory tiers are defined implicitly via a > >> >>>>> demotion path relationship between NUMA nodes, which is created during > >> >>>>> the kernel initialization and updated when a NUMA node is hot-added or > >> >>>>> hot-removed. The current implementation puts all nodes with CPU into > >> >>>>> the top tier, and builds the tier hierarchy tier-by-tier by establishing > >> >>>>> the per-node demotion targets based on the distances between nodes. > >> >>>>> > >> >>>>> This current memory tier kernel interface needs to be improved for > >> >>>>> several important use cases: > >> >>>>> > >> >>>>> * The current tier initialization code always initializes > >> >>>>> each memory-only NUMA node into a lower tier. But a memory-only > >> >>>>> NUMA node may have a high performance memory device (e.g. a DRAM > >> >>>>> device attached via CXL.mem or a DRAM-backed memory-only node on > >> >>>>> a virtual machine) and should be put into a higher tier. > >> >>>>> > >> >>>>> * The current tier hierarchy always puts CPU nodes into the top > >> >>>>> tier. But on a system with HBM (e.g. GPU memory) devices, these > >> >>>>> memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes > >> >>>>> with CPUs are better to be placed into the next lower tier. > >> >>>>> > >> >>>>> * Also because the current tier hierarchy always puts CPU nodes > >> >>>>> into the top tier, when a CPU is hot-added (or hot-removed) and > >> >>>>> triggers a memory node from CPU-less into a CPU node (or vice > >> >>>>> versa), the memory tier hierarchy gets changed, even though no > >> >>>>> memory node is added or removed. This can make the tier > >> >>>>> hierarchy unstable and make it difficult to support tier-based > >> >>>>> memory accounting. > >> >>>>> > >> >>>>> * A higher tier node can only be demoted to selected nodes on the > >> >>>>> next lower tier as defined by the demotion path, not any other > >> >>>>> node from any lower tier. This strict, hard-coded demotion order > >> >>>>> does not work in all use cases (e.g. some use cases may want to > >> >>>>> allow cross-socket demotion to another node in the same demotion > >> >>>>> tier as a fallback when the preferred demotion node is out of > >> >>>>> space), and has resulted in the feature request for an interface to > >> >>>>> override the system-wide, per-node demotion order from the > >> >>>>> userspace. This demotion order is also inconsistent with the page > >> >>>>> allocation fallback order when all the nodes in a higher tier are > >> >>>>> out of space: The page allocation can fall back to any node from > >> >>>>> any lower tier, whereas the demotion order doesn't allow that. > >> >>>>> > >> >>>>> * There are no interfaces for the userspace to learn about the memory > >> >>>>> tier hierarchy in order to optimize its memory allocations. > >> >>>>> > >> >>>>> This patch series make the creation of memory tiers explicit under > >> >>>>> the control of userspace or device driver. > >> >>>>> > >> >>>>> Memory Tier Initialization > >> >>>>> ========================== > >> >>>>> > >> >>>>> By default, all memory nodes are assigned to the default tier with > >> >>>>> tier ID value 200. > >> >>>>> > >> >>>>> A device driver can move up or down its memory nodes from the default > >> >>>>> tier. For example, PMEM can move down its memory nodes below the > >> >>>>> default tier, whereas GPU can move up its memory nodes above the > >> >>>>> default tier. > >> >>>>> > >> >>>>> The kernel initialization code makes the decision on which exact tier > >> >>>>> a memory node should be assigned to based on the requests from the > >> >>>>> device drivers as well as the memory device hardware information > >> >>>>> provided by the firmware. > >> >>>>> > >> >>>>> Hot-adding/removing CPUs doesn't affect memory tier hierarchy. > >> >>>>> > >> >>>>> Memory Allocation for Demotion > >> >>>>> ============================== > >> >>>>> This patch series keep the demotion target page allocation logic same. > >> >>>>> The demotion page allocation pick the closest NUMA node in the > >> >>>>> next lower tier to the current NUMA node allocating pages from. > >> >>>>> > >> >>>>> This will be later improved to use the same page allocation strategy > >> >>>>> using fallback list. > >> >>>>> > >> >>>>> Sysfs Interface: > >> >>>>> ------------- > >> >>>>> Listing current list of memory tiers details: > >> >>>>> > >> >>>>> :/sys/devices/system/memtier$ ls > >> >>>>> default_tier max_tier memtier1 power uevent > >> >>>>> :/sys/devices/system/memtier$ cat default_tier > >> >>>>> memtier200 > >> >>>>> :/sys/devices/system/memtier$ cat max_tier > >> >>>>> 400 > >> >>>>> :/sys/devices/system/memtier$ > >> >>>>> > >> >>>>> Per node memory tier details: > >> >>>>> > >> >>>>> For a cpu only NUMA node: > >> >>>>> > >> >>>>> :/sys/devices/system/node# cat node0/memtier > >> >>>>> :/sys/devices/system/node# echo 1 > node0/memtier > >> >>>>> :/sys/devices/system/node# cat node0/memtier > >> >>>>> :/sys/devices/system/node# > >> >>>>> > >> >>>>> For a NUMA node with memory: > >> >>>>> :/sys/devices/system/node# cat node1/memtier > >> >>>>> 1 > >> >>>>> :/sys/devices/system/node# ls ../memtier/ > >> >>>>> default_tier max_tier memtier1 power uevent > >> >>>>> :/sys/devices/system/node# echo 2 > node1/memtier > >> >>>>> :/sys/devices/system/node# > >> >>>>> :/sys/devices/system/node# ls ../memtier/ > >> >>>>> default_tier max_tier memtier1 memtier2 power uevent > >> >>>>> :/sys/devices/system/node# cat node1/memtier > >> >>>>> 2 > >> >>>>> :/sys/devices/system/node# > >> >>>>> > >> >>>>> Removing a memory tier > >> >>>>> :/sys/devices/system/node# cat node1/memtier > >> >>>>> 2 > >> >>>>> :/sys/devices/system/node# echo 1 > node1/memtier > >> >>>> > >> >>>> Thanks a lot for your patchset. > >> >>>> > >> >>>> Per my understanding, we haven't reach consensus on > >> >>>> > >> >>>> - how to create the default memory tiers in kernel (via abstract > >> >>>> distance provided by drivers? Or use SLIT as the first step?) > >> >>>> > >> >>>> - how to override the default memory tiers from user space > >> >>>> > >> >>>> As in the following thread and email, > >> >>>> > >> >>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/ > >> >>>> > >> >>>> I think that we need to finalized on that firstly? > >> >>> > >> >>> I did list the proposal here > >> >>> > >> >>> https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com > >> >>> > >> >>> So both the kernel default and driver-specific default tiers now become kernel parameters that can be updated > >> >>> if the user wants a different tier topology. > >> >>> > >> >>> All memory that is not managed by a driver gets added to default_memory_tier which got a default value of 200 > >> >>> > >> >>> For now, the only driver that is updated is dax kmem, which adds the memory it manages to memory tier 100. > >> >>> Later as we learn more about the device attributes (HMAT or something similar) that we might want to use > >> >>> to control the tier assignment this can be a range of memory tiers. > >> >>> > >> >>> Based on the above, I guess we can merge what is posted in this series and later fine-tune/update > >> >>> the memory tier assignment based on device attributes. > >> >> > >> >> Sorry for late reply. > >> >> > >> >> As the first step, it may be better to skip the parts that we haven't > >> >> reached consensus yet, for example, the user space interface to override > >> >> the default memory tiers. And we can use 0, 1, 2 as the default memory > >> >> tier IDs. We can refine/revise the in-kernel implementation, but we > >> >> cannot change the user space ABI. > >> >> > >> > > >> > Can you help list the use case that will be broken by using tierID as outlined in this series? > >> > One of the details that were mentioned earlier was the need to track top-tier memory usage in a > >> > memcg and IIUC the patchset posted https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com > >> > can work with tier IDs too. Let me know if you think otherwise. So at this point > >> > I am not sure which area we are still debating w.r.t the userspace interface. > >> > > >> > I will still keep the default tier IDs with a large range between them. That will allow > >> > us to go back to tierID based demotion order if we can. That is much simpler than using tierID and rank > >> > together. If we still want to go back to rank based approach the tierID value won't have much > >> > meaning anyway. > >> > > >> > Any feedback on patches 1 - 5, so that I can request Andrew to merge them? > >> > > >> > >> Looking at this again, I guess we just need to drop patch 7 > >> mm/demotion: Add per node memory tier attribute to sysfs ? > >> > >> We do agree to use the device model to expose memory tiers to userspace so patch 6 can still be included. > >> It also exposes max_tier, default_tier, and node list of a memory tier. All these are useful > >> and agreed upon. Hence patch 6 can be merged? > >> > >> patch 8 - 10 -> are done based on the request from others and is independent of how memory tiers > >> are exposed/created from userspace. Hence that can be merged? > >> > >> If you agree I can rebase the series moving patch 7,11,12 as the last patches in the series so > >> that we can skip merging them based on what we conclude w.r.t usage of rank. > > > > I think the most controversial part is the user visible interfaces so > > far. And IIUC the series could be split roughly into two parts, patch > > 1 - 5 and others. The patch 1 -5 added the explicit memory tier > > support and fixed the issue reported by Jagdish. I think we are on the > > same page for this part. But I haven't seen any thorough review on > > those patches yet since we got distracted by spending most time > > discussing about the user visible interfaces. > > > > So would it help to move things forward to submit patch 1 - 5 as a > > standalone series to get thorough review then get merged? > > Yes. I think this is a good idea. We can discuss the in kernel > implementation (without user space interface) in details and try to make > it merged. > > And we can continue our discussion of user space interface in a separate > thread. > > Best Regards, > Huang, Ying > I also agree that it is a good idea to split this patch series into the kernel and userspace parts. The current sysfs interface provides more dynamic memtiers than what I have expected. Let's have more discussions on that after the kernel space changes are finalized. Wei
On Tue, Jul 12, 2022 at 8:03 PM Huang, Ying <ying.huang@intel.com> wrote: > > Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: > > > On 7/12/22 2:18 PM, Huang, Ying wrote: > >> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: > >> > >>> On 7/12/22 12:29 PM, Huang, Ying wrote: > >>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: > >>>> > >>>>> On 7/12/22 6:46 AM, Huang, Ying wrote: > >>>>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: > >>>>>> > >>>>>>> On 7/5/22 9:59 AM, Huang, Ying wrote: > >>>>>>>> Hi, Aneesh, > >>>>>>>> > >>>>>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes: > >>>>>>>> > >>>>>>>>> The current kernel has the basic memory tiering support: Inactive > >>>>>>>>> pages on a higher tier NUMA node can be migrated (demoted) to a lower > >>>>>>>>> tier NUMA node to make room for new allocations on the higher tier > >>>>>>>>> NUMA node. Frequently accessed pages on a lower tier NUMA node can be > >>>>>>>>> migrated (promoted) to a higher tier NUMA node to improve the > >>>>>>>>> performance. > >>>>>>>>> > >>>>>>>>> In the current kernel, memory tiers are defined implicitly via a > >>>>>>>>> demotion path relationship between NUMA nodes, which is created during > >>>>>>>>> the kernel initialization and updated when a NUMA node is hot-added or > >>>>>>>>> hot-removed. The current implementation puts all nodes with CPU into > >>>>>>>>> the top tier, and builds the tier hierarchy tier-by-tier by establishing > >>>>>>>>> the per-node demotion targets based on the distances between nodes. > >>>>>>>>> > >>>>>>>>> This current memory tier kernel interface needs to be improved for > >>>>>>>>> several important use cases: > >>>>>>>>> > >>>>>>>>> * The current tier initialization code always initializes > >>>>>>>>> each memory-only NUMA node into a lower tier. But a memory-only > >>>>>>>>> NUMA node may have a high performance memory device (e.g. a DRAM > >>>>>>>>> device attached via CXL.mem or a DRAM-backed memory-only node on > >>>>>>>>> a virtual machine) and should be put into a higher tier. > >>>>>>>>> > >>>>>>>>> * The current tier hierarchy always puts CPU nodes into the top > >>>>>>>>> tier. But on a system with HBM (e.g. GPU memory) devices, these > >>>>>>>>> memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes > >>>>>>>>> with CPUs are better to be placed into the next lower tier. > >>>>>>>>> > >>>>>>>>> * Also because the current tier hierarchy always puts CPU nodes > >>>>>>>>> into the top tier, when a CPU is hot-added (or hot-removed) and > >>>>>>>>> triggers a memory node from CPU-less into a CPU node (or vice > >>>>>>>>> versa), the memory tier hierarchy gets changed, even though no > >>>>>>>>> memory node is added or removed. This can make the tier > >>>>>>>>> hierarchy unstable and make it difficult to support tier-based > >>>>>>>>> memory accounting. > >>>>>>>>> > >>>>>>>>> * A higher tier node can only be demoted to selected nodes on the > >>>>>>>>> next lower tier as defined by the demotion path, not any other > >>>>>>>>> node from any lower tier. This strict, hard-coded demotion order > >>>>>>>>> does not work in all use cases (e.g. some use cases may want to > >>>>>>>>> allow cross-socket demotion to another node in the same demotion > >>>>>>>>> tier as a fallback when the preferred demotion node is out of > >>>>>>>>> space), and has resulted in the feature request for an interface to > >>>>>>>>> override the system-wide, per-node demotion order from the > >>>>>>>>> userspace. This demotion order is also inconsistent with the page > >>>>>>>>> allocation fallback order when all the nodes in a higher tier are > >>>>>>>>> out of space: The page allocation can fall back to any node from > >>>>>>>>> any lower tier, whereas the demotion order doesn't allow that. > >>>>>>>>> > >>>>>>>>> * There are no interfaces for the userspace to learn about the memory > >>>>>>>>> tier hierarchy in order to optimize its memory allocations. > >>>>>>>>> > >>>>>>>>> This patch series make the creation of memory tiers explicit under > >>>>>>>>> the control of userspace or device driver. > >>>>>>>>> > >>>>>>>>> Memory Tier Initialization > >>>>>>>>> ========================== > >>>>>>>>> > >>>>>>>>> By default, all memory nodes are assigned to the default tier with > >>>>>>>>> tier ID value 200. > >>>>>>>>> > >>>>>>>>> A device driver can move up or down its memory nodes from the default > >>>>>>>>> tier. For example, PMEM can move down its memory nodes below the > >>>>>>>>> default tier, whereas GPU can move up its memory nodes above the > >>>>>>>>> default tier. > >>>>>>>>> > >>>>>>>>> The kernel initialization code makes the decision on which exact tier > >>>>>>>>> a memory node should be assigned to based on the requests from the > >>>>>>>>> device drivers as well as the memory device hardware information > >>>>>>>>> provided by the firmware. > >>>>>>>>> > >>>>>>>>> Hot-adding/removing CPUs doesn't affect memory tier hierarchy. > >>>>>>>>> > >>>>>>>>> Memory Allocation for Demotion > >>>>>>>>> ============================== > >>>>>>>>> This patch series keep the demotion target page allocation logic same. > >>>>>>>>> The demotion page allocation pick the closest NUMA node in the > >>>>>>>>> next lower tier to the current NUMA node allocating pages from. > >>>>>>>>> > >>>>>>>>> This will be later improved to use the same page allocation strategy > >>>>>>>>> using fallback list. > >>>>>>>>> > >>>>>>>>> Sysfs Interface: > >>>>>>>>> ------------- > >>>>>>>>> Listing current list of memory tiers details: > >>>>>>>>> > >>>>>>>>> :/sys/devices/system/memtier$ ls > >>>>>>>>> default_tier max_tier memtier1 power uevent > >>>>>>>>> :/sys/devices/system/memtier$ cat default_tier > >>>>>>>>> memtier200 > >>>>>>>>> :/sys/devices/system/memtier$ cat max_tier > >>>>>>>>> 400 > >>>>>>>>> :/sys/devices/system/memtier$ > >>>>>>>>> > >>>>>>>>> Per node memory tier details: > >>>>>>>>> > >>>>>>>>> For a cpu only NUMA node: > >>>>>>>>> > >>>>>>>>> :/sys/devices/system/node# cat node0/memtier > >>>>>>>>> :/sys/devices/system/node# echo 1 > node0/memtier > >>>>>>>>> :/sys/devices/system/node# cat node0/memtier > >>>>>>>>> :/sys/devices/system/node# > >>>>>>>>> > >>>>>>>>> For a NUMA node with memory: > >>>>>>>>> :/sys/devices/system/node# cat node1/memtier > >>>>>>>>> 1 > >>>>>>>>> :/sys/devices/system/node# ls ../memtier/ > >>>>>>>>> default_tier max_tier memtier1 power uevent > >>>>>>>>> :/sys/devices/system/node# echo 2 > node1/memtier > >>>>>>>>> :/sys/devices/system/node# > >>>>>>>>> :/sys/devices/system/node# ls ../memtier/ > >>>>>>>>> default_tier max_tier memtier1 memtier2 power uevent > >>>>>>>>> :/sys/devices/system/node# cat node1/memtier > >>>>>>>>> 2 > >>>>>>>>> :/sys/devices/system/node# > >>>>>>>>> > >>>>>>>>> Removing a memory tier > >>>>>>>>> :/sys/devices/system/node# cat node1/memtier > >>>>>>>>> 2 > >>>>>>>>> :/sys/devices/system/node# echo 1 > node1/memtier > >>>>>>>> > >>>>>>>> Thanks a lot for your patchset. > >>>>>>>> > >>>>>>>> Per my understanding, we haven't reach consensus on > >>>>>>>> > >>>>>>>> - how to create the default memory tiers in kernel (via abstract > >>>>>>>> distance provided by drivers? Or use SLIT as the first step?) > >>>>>>>> > >>>>>>>> - how to override the default memory tiers from user space > >>>>>>>> > >>>>>>>> As in the following thread and email, > >>>>>>>> > >>>>>>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/ > >>>>>>>> > >>>>>>>> I think that we need to finalized on that firstly? > >>>>>>> > >>>>>>> I did list the proposal here > >>>>>>> > >>>>>>> https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com > >>>>>>> > >>>>>>> So both the kernel default and driver-specific default tiers now become kernel parameters that can be updated > >>>>>>> if the user wants a different tier topology. > >>>>>>> > >>>>>>> All memory that is not managed by a driver gets added to default_memory_tier which got a default value of 200 > >>>>>>> > >>>>>>> For now, the only driver that is updated is dax kmem, which adds the memory it manages to memory tier 100. > >>>>>>> Later as we learn more about the device attributes (HMAT or something similar) that we might want to use > >>>>>>> to control the tier assignment this can be a range of memory tiers. > >>>>>>> > >>>>>>> Based on the above, I guess we can merge what is posted in this series and later fine-tune/update > >>>>>>> the memory tier assignment based on device attributes. > >>>>>> > >>>>>> Sorry for late reply. > >>>>>> > >>>>>> As the first step, it may be better to skip the parts that we haven't > >>>>>> reached consensus yet, for example, the user space interface to override > >>>>>> the default memory tiers. And we can use 0, 1, 2 as the default memory > >>>>>> tier IDs. We can refine/revise the in-kernel implementation, but we > >>>>>> cannot change the user space ABI. > >>>>>> > >>>>> > >>>>> Can you help list the use case that will be broken by using tierID as outlined in this series? > >>>>> One of the details that were mentioned earlier was the need to track top-tier memory usage in a > >>>>> memcg and IIUC the patchset posted https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com > >>>>> can work with tier IDs too. Let me know if you think otherwise. So at this point > >>>>> I am not sure which area we are still debating w.r.t the userspace interface. > >>>> > >>>> In > >>>> > >>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/ > >>>> > >>>> per my understanding, Johannes suggested to override the kernel default > >>>> memory tiers with "abstract distance" via drivers implementing memory > >>>> devices. As you said in another email, that is related to [7/12] of the > >>>> series. And we can table it for future. > >>>> > >>>> And per my understanding, he also suggested to make memory tier IDs > >>>> dynamic. For example, after the "abstract distance" of a driver is > >>>> overridden by users, the total number of memory tiers may be changed, > >>>> and the memory tier ID of some nodes may be changed too. This will make > >>>> memory tier ID easier to be understood, but more unstable. For example, > >>>> this will make it harder to specify the per-memory-tier memory partition > >>>> for a cgroup. > >>>> > >>> > >>> With all the approaches we discussed so far, a memory tier of a numa node can be changed. > >>> ie, pgdat->memtier can change anytime. The per memcg top tier mem usage tracking patches > >>> posted here > >>> https://lore.kernel.org/linux-mm/cefeb63173fa0fac7543315a2abbd4b5a1b25af8.1655242024.git.tim.c.chen@linux.intel.com/ > >>> doesn't consider the node movement from one memory tier to another. If we need > >>> a stable pgdat->memtier we will have to prevent a node memory tier reassignment > >>> while we have pages from the memory tier charged to a cgroup. This patchset should not > >>> prevent such a restriction. > >> > >> Absolute stableness doesn't exist even in "rank" based solution. But > >> "rank" can improve the stableness at some degree. For example, if we > >> move the tier of HBM nodes (from below DRAM to above DRAM), the DRAM > >> nodes can keep its memory tier ID stable. This may be not a real issue > >> finally. But we need to discuss that. > >> > > > > I agree that using ranks gives us the flexibility to change demotion order > > without being blocked by cgroup usage. But how frequently do we expect the > > tier assignment to change? My expectation was these reassignments are going > > to be rare and won't happen frequently after a system is up and running? > > Hence using tierID for demotion order won't prevent a node reassignment > > much because we don't expect to change the node tierID during runtime. In > > the rare case we do, we will have to make sure there is no cgroup usage from > > the specific memory tier. > > > > Even if we use ranks, we will have to avoid a rank update, if such > > an update can change the meaning of top tier? ie, if a rank update > > can result in a node being moved from top tier to non top tier. > > > >> Tim has suggested to use top-tier(s) memory partition among cgroups. > >> But I don't think that has been finalized. We may use per-memory-tier > >> memory partition among cgroups. I don't know whether Wei will use that > >> (may be implemented in the user space). > >> > >> And, if we thought stableness between nodes and memory tier ID isn't > >> important. Why should we use sparse memory device IDs (that is, 100, > >> 200, 300)? Why not just 0, 1, 2, ...? That looks more natural. > >> > > > > > > The range allows us to use memtier ID for demotion order. ie, as we start initializing > > devices with different attributes via dax kmem, there will be a desire to > > assign them to different tierIDs. Having default memtier ID (DRAM) at 200 enables > > us to put these devices in the range [0 - 200) without updating the node to memtier > > mapping of existing NUMA nodes (ie, without updating default memtier). > > I believe that sparse memory tier IDs can make memory tier more stable > in some cases. But this is different from the system suggested by > Johannes. Per my understanding, with Johannes' system, we will > > - one driver may online different memory types (such as kmem_dax may > online HBM, PMEM, etc.) > > - one memory type manages several memory nodes (NUMA nodes) > > - one "abstract distance" for each memory type > > - the "abstract distance" can be offset by user space override knob > > - memory tiers generated dynamic from different memory types according > "abstract distance" and overridden "offset" > > - the granularity to group several memory types into one memory tier can > be overridden via user space knob > > In this way, the memory tiers may be changed totally after user space > overridden. It may be hard to link memory tiers before/after the > overridden. So we may need to reset all per-memory-tier configuration, > such as cgroup paritation limit or interleave weight, etc. > > Personally, I think the system above makes sense. But I think we need > to make sure whether it satisfies the requirements. > > Best Regards, > Huang, Ying > Th "memory type" and "abstract distance" concepts sound to me similar to the memory tier "rank" idea. We can have some well-defined type/distance/rank values, e.g. HBM, DRAM, CXL_DRAM, PMEM, CXL_PMEM, which a device can register with. The memory tiers will build from these values. It can be configurable to whether/how to collapse several values into a single tier. Wei
On 7/13/22 9:12 AM, Huang, Ying wrote: > Yang Shi <shy828301@gmail.com> writes: > >> On Mon, Jul 11, 2022 at 10:10 PM Aneesh Kumar K V >> <aneesh.kumar@linux.ibm.com> wrote: >>> >>> On 7/12/22 10:12 AM, Aneesh Kumar K V wrote: >>>> On 7/12/22 6:46 AM, Huang, Ying wrote: >>>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: >>>>> >>>>>> On 7/5/22 9:59 AM, Huang, Ying wrote: >>>>>>> Hi, Aneesh, >>>>>>> >>>>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes: >>>>>>> >>>>>>>> The current kernel has the basic memory tiering support: Inactive >>>>>>>> pages on a higher tier NUMA node can be migrated (demoted) to a lower >>>>>>>> tier NUMA node to make room for new allocations on the higher tier >>>>>>>> NUMA node. Frequently accessed pages on a lower tier NUMA node can be >>>>>>>> migrated (promoted) to a higher tier NUMA node to improve the >>>>>>>> performance. >>>>>>>> >>>>>>>> In the current kernel, memory tiers are defined implicitly via a >>>>>>>> demotion path relationship between NUMA nodes, which is created during >>>>>>>> the kernel initialization and updated when a NUMA node is hot-added or >>>>>>>> hot-removed. The current implementation puts all nodes with CPU into >>>>>>>> the top tier, and builds the tier hierarchy tier-by-tier by establishing >>>>>>>> the per-node demotion targets based on the distances between nodes. >>>>>>>> >>>>>>>> This current memory tier kernel interface needs to be improved for >>>>>>>> several important use cases: >>>>>>>> >>>>>>>> * The current tier initialization code always initializes >>>>>>>> each memory-only NUMA node into a lower tier. But a memory-only >>>>>>>> NUMA node may have a high performance memory device (e.g. a DRAM >>>>>>>> device attached via CXL.mem or a DRAM-backed memory-only node on >>>>>>>> a virtual machine) and should be put into a higher tier. >>>>>>>> >>>>>>>> * The current tier hierarchy always puts CPU nodes into the top >>>>>>>> tier. But on a system with HBM (e.g. GPU memory) devices, these >>>>>>>> memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes >>>>>>>> with CPUs are better to be placed into the next lower tier. >>>>>>>> >>>>>>>> * Also because the current tier hierarchy always puts CPU nodes >>>>>>>> into the top tier, when a CPU is hot-added (or hot-removed) and >>>>>>>> triggers a memory node from CPU-less into a CPU node (or vice >>>>>>>> versa), the memory tier hierarchy gets changed, even though no >>>>>>>> memory node is added or removed. This can make the tier >>>>>>>> hierarchy unstable and make it difficult to support tier-based >>>>>>>> memory accounting. >>>>>>>> >>>>>>>> * A higher tier node can only be demoted to selected nodes on the >>>>>>>> next lower tier as defined by the demotion path, not any other >>>>>>>> node from any lower tier. This strict, hard-coded demotion order >>>>>>>> does not work in all use cases (e.g. some use cases may want to >>>>>>>> allow cross-socket demotion to another node in the same demotion >>>>>>>> tier as a fallback when the preferred demotion node is out of >>>>>>>> space), and has resulted in the feature request for an interface to >>>>>>>> override the system-wide, per-node demotion order from the >>>>>>>> userspace. This demotion order is also inconsistent with the page >>>>>>>> allocation fallback order when all the nodes in a higher tier are >>>>>>>> out of space: The page allocation can fall back to any node from >>>>>>>> any lower tier, whereas the demotion order doesn't allow that. >>>>>>>> >>>>>>>> * There are no interfaces for the userspace to learn about the memory >>>>>>>> tier hierarchy in order to optimize its memory allocations. >>>>>>>> >>>>>>>> This patch series make the creation of memory tiers explicit under >>>>>>>> the control of userspace or device driver. >>>>>>>> >>>>>>>> Memory Tier Initialization >>>>>>>> ========================== >>>>>>>> >>>>>>>> By default, all memory nodes are assigned to the default tier with >>>>>>>> tier ID value 200. >>>>>>>> >>>>>>>> A device driver can move up or down its memory nodes from the default >>>>>>>> tier. For example, PMEM can move down its memory nodes below the >>>>>>>> default tier, whereas GPU can move up its memory nodes above the >>>>>>>> default tier. >>>>>>>> >>>>>>>> The kernel initialization code makes the decision on which exact tier >>>>>>>> a memory node should be assigned to based on the requests from the >>>>>>>> device drivers as well as the memory device hardware information >>>>>>>> provided by the firmware. >>>>>>>> >>>>>>>> Hot-adding/removing CPUs doesn't affect memory tier hierarchy. >>>>>>>> >>>>>>>> Memory Allocation for Demotion >>>>>>>> ============================== >>>>>>>> This patch series keep the demotion target page allocation logic same. >>>>>>>> The demotion page allocation pick the closest NUMA node in the >>>>>>>> next lower tier to the current NUMA node allocating pages from. >>>>>>>> >>>>>>>> This will be later improved to use the same page allocation strategy >>>>>>>> using fallback list. >>>>>>>> >>>>>>>> Sysfs Interface: >>>>>>>> ------------- >>>>>>>> Listing current list of memory tiers details: >>>>>>>> >>>>>>>> :/sys/devices/system/memtier$ ls >>>>>>>> default_tier max_tier memtier1 power uevent >>>>>>>> :/sys/devices/system/memtier$ cat default_tier >>>>>>>> memtier200 >>>>>>>> :/sys/devices/system/memtier$ cat max_tier >>>>>>>> 400 >>>>>>>> :/sys/devices/system/memtier$ >>>>>>>> >>>>>>>> Per node memory tier details: >>>>>>>> >>>>>>>> For a cpu only NUMA node: >>>>>>>> >>>>>>>> :/sys/devices/system/node# cat node0/memtier >>>>>>>> :/sys/devices/system/node# echo 1 > node0/memtier >>>>>>>> :/sys/devices/system/node# cat node0/memtier >>>>>>>> :/sys/devices/system/node# >>>>>>>> >>>>>>>> For a NUMA node with memory: >>>>>>>> :/sys/devices/system/node# cat node1/memtier >>>>>>>> 1 >>>>>>>> :/sys/devices/system/node# ls ../memtier/ >>>>>>>> default_tier max_tier memtier1 power uevent >>>>>>>> :/sys/devices/system/node# echo 2 > node1/memtier >>>>>>>> :/sys/devices/system/node# >>>>>>>> :/sys/devices/system/node# ls ../memtier/ >>>>>>>> default_tier max_tier memtier1 memtier2 power uevent >>>>>>>> :/sys/devices/system/node# cat node1/memtier >>>>>>>> 2 >>>>>>>> :/sys/devices/system/node# >>>>>>>> >>>>>>>> Removing a memory tier >>>>>>>> :/sys/devices/system/node# cat node1/memtier >>>>>>>> 2 >>>>>>>> :/sys/devices/system/node# echo 1 > node1/memtier >>>>>>> >>>>>>> Thanks a lot for your patchset. >>>>>>> >>>>>>> Per my understanding, we haven't reach consensus on >>>>>>> >>>>>>> - how to create the default memory tiers in kernel (via abstract >>>>>>> distance provided by drivers? Or use SLIT as the first step?) >>>>>>> >>>>>>> - how to override the default memory tiers from user space >>>>>>> >>>>>>> As in the following thread and email, >>>>>>> >>>>>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/ >>>>>>> >>>>>>> I think that we need to finalized on that firstly? >>>>>> >>>>>> I did list the proposal here >>>>>> >>>>>> https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com >>>>>> >>>>>> So both the kernel default and driver-specific default tiers now become kernel parameters that can be updated >>>>>> if the user wants a different tier topology. >>>>>> >>>>>> All memory that is not managed by a driver gets added to default_memory_tier which got a default value of 200 >>>>>> >>>>>> For now, the only driver that is updated is dax kmem, which adds the memory it manages to memory tier 100. >>>>>> Later as we learn more about the device attributes (HMAT or something similar) that we might want to use >>>>>> to control the tier assignment this can be a range of memory tiers. >>>>>> >>>>>> Based on the above, I guess we can merge what is posted in this series and later fine-tune/update >>>>>> the memory tier assignment based on device attributes. >>>>> >>>>> Sorry for late reply. >>>>> >>>>> As the first step, it may be better to skip the parts that we haven't >>>>> reached consensus yet, for example, the user space interface to override >>>>> the default memory tiers. And we can use 0, 1, 2 as the default memory >>>>> tier IDs. We can refine/revise the in-kernel implementation, but we >>>>> cannot change the user space ABI. >>>>> >>>> >>>> Can you help list the use case that will be broken by using tierID as outlined in this series? >>>> One of the details that were mentioned earlier was the need to track top-tier memory usage in a >>>> memcg and IIUC the patchset posted https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com >>>> can work with tier IDs too. Let me know if you think otherwise. So at this point >>>> I am not sure which area we are still debating w.r.t the userspace interface. >>>> >>>> I will still keep the default tier IDs with a large range between them. That will allow >>>> us to go back to tierID based demotion order if we can. That is much simpler than using tierID and rank >>>> together. If we still want to go back to rank based approach the tierID value won't have much >>>> meaning anyway. >>>> >>>> Any feedback on patches 1 - 5, so that I can request Andrew to merge them? >>>> >>> >>> Looking at this again, I guess we just need to drop patch 7 >>> mm/demotion: Add per node memory tier attribute to sysfs ? >>> >>> We do agree to use the device model to expose memory tiers to userspace so patch 6 can still be included. >>> It also exposes max_tier, default_tier, and node list of a memory tier. All these are useful >>> and agreed upon. Hence patch 6 can be merged? >>> >>> patch 8 - 10 -> are done based on the request from others and is independent of how memory tiers >>> are exposed/created from userspace. Hence that can be merged? >>> >>> If you agree I can rebase the series moving patch 7,11,12 as the last patches in the series so >>> that we can skip merging them based on what we conclude w.r.t usage of rank. >> >> I think the most controversial part is the user visible interfaces so >> far. And IIUC the series could be split roughly into two parts, patch >> 1 - 5 and others. The patch 1 -5 added the explicit memory tier >> support and fixed the issue reported by Jagdish. I think we are on the >> same page for this part. But I haven't seen any thorough review on >> those patches yet since we got distracted by spending most time >> discussing about the user visible interfaces. >> >> So would it help to move things forward to submit patch 1 - 5 as a >> standalone series to get thorough review then get merged? > > Yes. I think this is a good idea. We can discuss the in kernel > implementation (without user space interface) in details and try to make > it merged. > > And we can continue our discussion of user space interface in a separate > thread. Thanks. I will post patch 1 - 5 as a series for review. -aneesh
Wei Xu <weixugc@google.com> writes: > On Tue, Jul 12, 2022 at 8:03 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: >> >> > On 7/12/22 2:18 PM, Huang, Ying wrote: >> >> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: >> >> >> >>> On 7/12/22 12:29 PM, Huang, Ying wrote: >> >>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: >> >>>> >> >>>>> On 7/12/22 6:46 AM, Huang, Ying wrote: >> >>>>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: >> >>>>>> >> >>>>>>> On 7/5/22 9:59 AM, Huang, Ying wrote: >> >>>>>>>> Hi, Aneesh, >> >>>>>>>> >> >>>>>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes: >> >>>>>>>> >> >>>>>>>>> The current kernel has the basic memory tiering support: Inactive >> >>>>>>>>> pages on a higher tier NUMA node can be migrated (demoted) to a lower >> >>>>>>>>> tier NUMA node to make room for new allocations on the higher tier >> >>>>>>>>> NUMA node. Frequently accessed pages on a lower tier NUMA node can be >> >>>>>>>>> migrated (promoted) to a higher tier NUMA node to improve the >> >>>>>>>>> performance. >> >>>>>>>>> >> >>>>>>>>> In the current kernel, memory tiers are defined implicitly via a >> >>>>>>>>> demotion path relationship between NUMA nodes, which is created during >> >>>>>>>>> the kernel initialization and updated when a NUMA node is hot-added or >> >>>>>>>>> hot-removed. The current implementation puts all nodes with CPU into >> >>>>>>>>> the top tier, and builds the tier hierarchy tier-by-tier by establishing >> >>>>>>>>> the per-node demotion targets based on the distances between nodes. >> >>>>>>>>> >> >>>>>>>>> This current memory tier kernel interface needs to be improved for >> >>>>>>>>> several important use cases: >> >>>>>>>>> >> >>>>>>>>> * The current tier initialization code always initializes >> >>>>>>>>> each memory-only NUMA node into a lower tier. But a memory-only >> >>>>>>>>> NUMA node may have a high performance memory device (e.g. a DRAM >> >>>>>>>>> device attached via CXL.mem or a DRAM-backed memory-only node on >> >>>>>>>>> a virtual machine) and should be put into a higher tier. >> >>>>>>>>> >> >>>>>>>>> * The current tier hierarchy always puts CPU nodes into the top >> >>>>>>>>> tier. But on a system with HBM (e.g. GPU memory) devices, these >> >>>>>>>>> memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes >> >>>>>>>>> with CPUs are better to be placed into the next lower tier. >> >>>>>>>>> >> >>>>>>>>> * Also because the current tier hierarchy always puts CPU nodes >> >>>>>>>>> into the top tier, when a CPU is hot-added (or hot-removed) and >> >>>>>>>>> triggers a memory node from CPU-less into a CPU node (or vice >> >>>>>>>>> versa), the memory tier hierarchy gets changed, even though no >> >>>>>>>>> memory node is added or removed. This can make the tier >> >>>>>>>>> hierarchy unstable and make it difficult to support tier-based >> >>>>>>>>> memory accounting. >> >>>>>>>>> >> >>>>>>>>> * A higher tier node can only be demoted to selected nodes on the >> >>>>>>>>> next lower tier as defined by the demotion path, not any other >> >>>>>>>>> node from any lower tier. This strict, hard-coded demotion order >> >>>>>>>>> does not work in all use cases (e.g. some use cases may want to >> >>>>>>>>> allow cross-socket demotion to another node in the same demotion >> >>>>>>>>> tier as a fallback when the preferred demotion node is out of >> >>>>>>>>> space), and has resulted in the feature request for an interface to >> >>>>>>>>> override the system-wide, per-node demotion order from the >> >>>>>>>>> userspace. This demotion order is also inconsistent with the page >> >>>>>>>>> allocation fallback order when all the nodes in a higher tier are >> >>>>>>>>> out of space: The page allocation can fall back to any node from >> >>>>>>>>> any lower tier, whereas the demotion order doesn't allow that. >> >>>>>>>>> >> >>>>>>>>> * There are no interfaces for the userspace to learn about the memory >> >>>>>>>>> tier hierarchy in order to optimize its memory allocations. >> >>>>>>>>> >> >>>>>>>>> This patch series make the creation of memory tiers explicit under >> >>>>>>>>> the control of userspace or device driver. >> >>>>>>>>> >> >>>>>>>>> Memory Tier Initialization >> >>>>>>>>> ========================== >> >>>>>>>>> >> >>>>>>>>> By default, all memory nodes are assigned to the default tier with >> >>>>>>>>> tier ID value 200. >> >>>>>>>>> >> >>>>>>>>> A device driver can move up or down its memory nodes from the default >> >>>>>>>>> tier. For example, PMEM can move down its memory nodes below the >> >>>>>>>>> default tier, whereas GPU can move up its memory nodes above the >> >>>>>>>>> default tier. >> >>>>>>>>> >> >>>>>>>>> The kernel initialization code makes the decision on which exact tier >> >>>>>>>>> a memory node should be assigned to based on the requests from the >> >>>>>>>>> device drivers as well as the memory device hardware information >> >>>>>>>>> provided by the firmware. >> >>>>>>>>> >> >>>>>>>>> Hot-adding/removing CPUs doesn't affect memory tier hierarchy. >> >>>>>>>>> >> >>>>>>>>> Memory Allocation for Demotion >> >>>>>>>>> ============================== >> >>>>>>>>> This patch series keep the demotion target page allocation logic same. >> >>>>>>>>> The demotion page allocation pick the closest NUMA node in the >> >>>>>>>>> next lower tier to the current NUMA node allocating pages from. >> >>>>>>>>> >> >>>>>>>>> This will be later improved to use the same page allocation strategy >> >>>>>>>>> using fallback list. >> >>>>>>>>> >> >>>>>>>>> Sysfs Interface: >> >>>>>>>>> ------------- >> >>>>>>>>> Listing current list of memory tiers details: >> >>>>>>>>> >> >>>>>>>>> :/sys/devices/system/memtier$ ls >> >>>>>>>>> default_tier max_tier memtier1 power uevent >> >>>>>>>>> :/sys/devices/system/memtier$ cat default_tier >> >>>>>>>>> memtier200 >> >>>>>>>>> :/sys/devices/system/memtier$ cat max_tier >> >>>>>>>>> 400 >> >>>>>>>>> :/sys/devices/system/memtier$ >> >>>>>>>>> >> >>>>>>>>> Per node memory tier details: >> >>>>>>>>> >> >>>>>>>>> For a cpu only NUMA node: >> >>>>>>>>> >> >>>>>>>>> :/sys/devices/system/node# cat node0/memtier >> >>>>>>>>> :/sys/devices/system/node# echo 1 > node0/memtier >> >>>>>>>>> :/sys/devices/system/node# cat node0/memtier >> >>>>>>>>> :/sys/devices/system/node# >> >>>>>>>>> >> >>>>>>>>> For a NUMA node with memory: >> >>>>>>>>> :/sys/devices/system/node# cat node1/memtier >> >>>>>>>>> 1 >> >>>>>>>>> :/sys/devices/system/node# ls ../memtier/ >> >>>>>>>>> default_tier max_tier memtier1 power uevent >> >>>>>>>>> :/sys/devices/system/node# echo 2 > node1/memtier >> >>>>>>>>> :/sys/devices/system/node# >> >>>>>>>>> :/sys/devices/system/node# ls ../memtier/ >> >>>>>>>>> default_tier max_tier memtier1 memtier2 power uevent >> >>>>>>>>> :/sys/devices/system/node# cat node1/memtier >> >>>>>>>>> 2 >> >>>>>>>>> :/sys/devices/system/node# >> >>>>>>>>> >> >>>>>>>>> Removing a memory tier >> >>>>>>>>> :/sys/devices/system/node# cat node1/memtier >> >>>>>>>>> 2 >> >>>>>>>>> :/sys/devices/system/node# echo 1 > node1/memtier >> >>>>>>>> >> >>>>>>>> Thanks a lot for your patchset. >> >>>>>>>> >> >>>>>>>> Per my understanding, we haven't reach consensus on >> >>>>>>>> >> >>>>>>>> - how to create the default memory tiers in kernel (via abstract >> >>>>>>>> distance provided by drivers? Or use SLIT as the first step?) >> >>>>>>>> >> >>>>>>>> - how to override the default memory tiers from user space >> >>>>>>>> >> >>>>>>>> As in the following thread and email, >> >>>>>>>> >> >>>>>>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/ >> >>>>>>>> >> >>>>>>>> I think that we need to finalized on that firstly? >> >>>>>>> >> >>>>>>> I did list the proposal here >> >>>>>>> >> >>>>>>> https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com >> >>>>>>> >> >>>>>>> So both the kernel default and driver-specific default tiers now become kernel parameters that can be updated >> >>>>>>> if the user wants a different tier topology. >> >>>>>>> >> >>>>>>> All memory that is not managed by a driver gets added to default_memory_tier which got a default value of 200 >> >>>>>>> >> >>>>>>> For now, the only driver that is updated is dax kmem, which adds the memory it manages to memory tier 100. >> >>>>>>> Later as we learn more about the device attributes (HMAT or something similar) that we might want to use >> >>>>>>> to control the tier assignment this can be a range of memory tiers. >> >>>>>>> >> >>>>>>> Based on the above, I guess we can merge what is posted in this series and later fine-tune/update >> >>>>>>> the memory tier assignment based on device attributes. >> >>>>>> >> >>>>>> Sorry for late reply. >> >>>>>> >> >>>>>> As the first step, it may be better to skip the parts that we haven't >> >>>>>> reached consensus yet, for example, the user space interface to override >> >>>>>> the default memory tiers. And we can use 0, 1, 2 as the default memory >> >>>>>> tier IDs. We can refine/revise the in-kernel implementation, but we >> >>>>>> cannot change the user space ABI. >> >>>>>> >> >>>>> >> >>>>> Can you help list the use case that will be broken by using tierID as outlined in this series? >> >>>>> One of the details that were mentioned earlier was the need to track top-tier memory usage in a >> >>>>> memcg and IIUC the patchset posted https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com >> >>>>> can work with tier IDs too. Let me know if you think otherwise. So at this point >> >>>>> I am not sure which area we are still debating w.r.t the userspace interface. >> >>>> >> >>>> In >> >>>> >> >>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/ >> >>>> >> >>>> per my understanding, Johannes suggested to override the kernel default >> >>>> memory tiers with "abstract distance" via drivers implementing memory >> >>>> devices. As you said in another email, that is related to [7/12] of the >> >>>> series. And we can table it for future. >> >>>> >> >>>> And per my understanding, he also suggested to make memory tier IDs >> >>>> dynamic. For example, after the "abstract distance" of a driver is >> >>>> overridden by users, the total number of memory tiers may be changed, >> >>>> and the memory tier ID of some nodes may be changed too. This will make >> >>>> memory tier ID easier to be understood, but more unstable. For example, >> >>>> this will make it harder to specify the per-memory-tier memory partition >> >>>> for a cgroup. >> >>>> >> >>> >> >>> With all the approaches we discussed so far, a memory tier of a numa node can be changed. >> >>> ie, pgdat->memtier can change anytime. The per memcg top tier mem usage tracking patches >> >>> posted here >> >>> https://lore.kernel.org/linux-mm/cefeb63173fa0fac7543315a2abbd4b5a1b25af8.1655242024.git.tim.c.chen@linux.intel.com/ >> >>> doesn't consider the node movement from one memory tier to another. If we need >> >>> a stable pgdat->memtier we will have to prevent a node memory tier reassignment >> >>> while we have pages from the memory tier charged to a cgroup. This patchset should not >> >>> prevent such a restriction. >> >> >> >> Absolute stableness doesn't exist even in "rank" based solution. But >> >> "rank" can improve the stableness at some degree. For example, if we >> >> move the tier of HBM nodes (from below DRAM to above DRAM), the DRAM >> >> nodes can keep its memory tier ID stable. This may be not a real issue >> >> finally. But we need to discuss that. >> >> >> > >> > I agree that using ranks gives us the flexibility to change demotion order >> > without being blocked by cgroup usage. But how frequently do we expect the >> > tier assignment to change? My expectation was these reassignments are going >> > to be rare and won't happen frequently after a system is up and running? >> > Hence using tierID for demotion order won't prevent a node reassignment >> > much because we don't expect to change the node tierID during runtime. In >> > the rare case we do, we will have to make sure there is no cgroup usage from >> > the specific memory tier. >> > >> > Even if we use ranks, we will have to avoid a rank update, if such >> > an update can change the meaning of top tier? ie, if a rank update >> > can result in a node being moved from top tier to non top tier. >> > >> >> Tim has suggested to use top-tier(s) memory partition among cgroups. >> >> But I don't think that has been finalized. We may use per-memory-tier >> >> memory partition among cgroups. I don't know whether Wei will use that >> >> (may be implemented in the user space). >> >> >> >> And, if we thought stableness between nodes and memory tier ID isn't >> >> important. Why should we use sparse memory device IDs (that is, 100, >> >> 200, 300)? Why not just 0, 1, 2, ...? That looks more natural. >> >> >> > >> > >> > The range allows us to use memtier ID for demotion order. ie, as we start initializing >> > devices with different attributes via dax kmem, there will be a desire to >> > assign them to different tierIDs. Having default memtier ID (DRAM) at 200 enables >> > us to put these devices in the range [0 - 200) without updating the node to memtier >> > mapping of existing NUMA nodes (ie, without updating default memtier). >> >> I believe that sparse memory tier IDs can make memory tier more stable >> in some cases. But this is different from the system suggested by >> Johannes. Per my understanding, with Johannes' system, we will >> >> - one driver may online different memory types (such as kmem_dax may >> online HBM, PMEM, etc.) >> >> - one memory type manages several memory nodes (NUMA nodes) >> >> - one "abstract distance" for each memory type >> >> - the "abstract distance" can be offset by user space override knob >> >> - memory tiers generated dynamic from different memory types according >> "abstract distance" and overridden "offset" >> >> - the granularity to group several memory types into one memory tier can >> be overridden via user space knob >> >> In this way, the memory tiers may be changed totally after user space >> overridden. It may be hard to link memory tiers before/after the >> overridden. So we may need to reset all per-memory-tier configuration, >> such as cgroup paritation limit or interleave weight, etc. >> >> Personally, I think the system above makes sense. But I think we need >> to make sure whether it satisfies the requirements. >> >> Best Regards, >> Huang, Ying >> > > Th "memory type" and "abstract distance" concepts sound to me similar > to the memory tier "rank" idea. Yes. "abstract distance" is similar as "rank". > We can have some well-defined type/distance/rank values, e.g. HBM, > DRAM, CXL_DRAM, PMEM, CXL_PMEM, which a device can register with. The > memory tiers will build from these values. It can be configurable to > whether/how to collapse several values into a single tier. The memory types are registered by drivers (such as kmem_dax). And the distances can come from SLIT, HMAT, and other firmware or driver specific information sources. Per my understanding, this solution may make memory tier IDs more unstable. For example, the memory ID of a node may be changed after the user override the distance of a memory type. Although I think the overriding should be a rare operations, will it be a real issue for your use cases? Best Regards, Huang, Ying
Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: > On 7/13/22 9:12 AM, Huang, Ying wrote: >> Yang Shi <shy828301@gmail.com> writes: >> >>> On Mon, Jul 11, 2022 at 10:10 PM Aneesh Kumar K V >>> <aneesh.kumar@linux.ibm.com> wrote: >>>> >>>> On 7/12/22 10:12 AM, Aneesh Kumar K V wrote: >>>>> On 7/12/22 6:46 AM, Huang, Ying wrote: >>>>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: >>>>>> >>>>>>> On 7/5/22 9:59 AM, Huang, Ying wrote: >>>>>>>> Hi, Aneesh, >>>>>>>> >>>>>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes: >>>>>>>> >>>>>>>>> The current kernel has the basic memory tiering support: Inactive >>>>>>>>> pages on a higher tier NUMA node can be migrated (demoted) to a lower >>>>>>>>> tier NUMA node to make room for new allocations on the higher tier >>>>>>>>> NUMA node. Frequently accessed pages on a lower tier NUMA node can be >>>>>>>>> migrated (promoted) to a higher tier NUMA node to improve the >>>>>>>>> performance. >>>>>>>>> >>>>>>>>> In the current kernel, memory tiers are defined implicitly via a >>>>>>>>> demotion path relationship between NUMA nodes, which is created during >>>>>>>>> the kernel initialization and updated when a NUMA node is hot-added or >>>>>>>>> hot-removed. The current implementation puts all nodes with CPU into >>>>>>>>> the top tier, and builds the tier hierarchy tier-by-tier by establishing >>>>>>>>> the per-node demotion targets based on the distances between nodes. >>>>>>>>> >>>>>>>>> This current memory tier kernel interface needs to be improved for >>>>>>>>> several important use cases: >>>>>>>>> >>>>>>>>> * The current tier initialization code always initializes >>>>>>>>> each memory-only NUMA node into a lower tier. But a memory-only >>>>>>>>> NUMA node may have a high performance memory device (e.g. a DRAM >>>>>>>>> device attached via CXL.mem or a DRAM-backed memory-only node on >>>>>>>>> a virtual machine) and should be put into a higher tier. >>>>>>>>> >>>>>>>>> * The current tier hierarchy always puts CPU nodes into the top >>>>>>>>> tier. But on a system with HBM (e.g. GPU memory) devices, these >>>>>>>>> memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes >>>>>>>>> with CPUs are better to be placed into the next lower tier. >>>>>>>>> >>>>>>>>> * Also because the current tier hierarchy always puts CPU nodes >>>>>>>>> into the top tier, when a CPU is hot-added (or hot-removed) and >>>>>>>>> triggers a memory node from CPU-less into a CPU node (or vice >>>>>>>>> versa), the memory tier hierarchy gets changed, even though no >>>>>>>>> memory node is added or removed. This can make the tier >>>>>>>>> hierarchy unstable and make it difficult to support tier-based >>>>>>>>> memory accounting. >>>>>>>>> >>>>>>>>> * A higher tier node can only be demoted to selected nodes on the >>>>>>>>> next lower tier as defined by the demotion path, not any other >>>>>>>>> node from any lower tier. This strict, hard-coded demotion order >>>>>>>>> does not work in all use cases (e.g. some use cases may want to >>>>>>>>> allow cross-socket demotion to another node in the same demotion >>>>>>>>> tier as a fallback when the preferred demotion node is out of >>>>>>>>> space), and has resulted in the feature request for an interface to >>>>>>>>> override the system-wide, per-node demotion order from the >>>>>>>>> userspace. This demotion order is also inconsistent with the page >>>>>>>>> allocation fallback order when all the nodes in a higher tier are >>>>>>>>> out of space: The page allocation can fall back to any node from >>>>>>>>> any lower tier, whereas the demotion order doesn't allow that. >>>>>>>>> >>>>>>>>> * There are no interfaces for the userspace to learn about the memory >>>>>>>>> tier hierarchy in order to optimize its memory allocations. >>>>>>>>> >>>>>>>>> This patch series make the creation of memory tiers explicit under >>>>>>>>> the control of userspace or device driver. >>>>>>>>> >>>>>>>>> Memory Tier Initialization >>>>>>>>> ========================== >>>>>>>>> >>>>>>>>> By default, all memory nodes are assigned to the default tier with >>>>>>>>> tier ID value 200. >>>>>>>>> >>>>>>>>> A device driver can move up or down its memory nodes from the default >>>>>>>>> tier. For example, PMEM can move down its memory nodes below the >>>>>>>>> default tier, whereas GPU can move up its memory nodes above the >>>>>>>>> default tier. >>>>>>>>> >>>>>>>>> The kernel initialization code makes the decision on which exact tier >>>>>>>>> a memory node should be assigned to based on the requests from the >>>>>>>>> device drivers as well as the memory device hardware information >>>>>>>>> provided by the firmware. >>>>>>>>> >>>>>>>>> Hot-adding/removing CPUs doesn't affect memory tier hierarchy. >>>>>>>>> >>>>>>>>> Memory Allocation for Demotion >>>>>>>>> ============================== >>>>>>>>> This patch series keep the demotion target page allocation logic same. >>>>>>>>> The demotion page allocation pick the closest NUMA node in the >>>>>>>>> next lower tier to the current NUMA node allocating pages from. >>>>>>>>> >>>>>>>>> This will be later improved to use the same page allocation strategy >>>>>>>>> using fallback list. >>>>>>>>> >>>>>>>>> Sysfs Interface: >>>>>>>>> ------------- >>>>>>>>> Listing current list of memory tiers details: >>>>>>>>> >>>>>>>>> :/sys/devices/system/memtier$ ls >>>>>>>>> default_tier max_tier memtier1 power uevent >>>>>>>>> :/sys/devices/system/memtier$ cat default_tier >>>>>>>>> memtier200 >>>>>>>>> :/sys/devices/system/memtier$ cat max_tier >>>>>>>>> 400 >>>>>>>>> :/sys/devices/system/memtier$ >>>>>>>>> >>>>>>>>> Per node memory tier details: >>>>>>>>> >>>>>>>>> For a cpu only NUMA node: >>>>>>>>> >>>>>>>>> :/sys/devices/system/node# cat node0/memtier >>>>>>>>> :/sys/devices/system/node# echo 1 > node0/memtier >>>>>>>>> :/sys/devices/system/node# cat node0/memtier >>>>>>>>> :/sys/devices/system/node# >>>>>>>>> >>>>>>>>> For a NUMA node with memory: >>>>>>>>> :/sys/devices/system/node# cat node1/memtier >>>>>>>>> 1 >>>>>>>>> :/sys/devices/system/node# ls ../memtier/ >>>>>>>>> default_tier max_tier memtier1 power uevent >>>>>>>>> :/sys/devices/system/node# echo 2 > node1/memtier >>>>>>>>> :/sys/devices/system/node# >>>>>>>>> :/sys/devices/system/node# ls ../memtier/ >>>>>>>>> default_tier max_tier memtier1 memtier2 power uevent >>>>>>>>> :/sys/devices/system/node# cat node1/memtier >>>>>>>>> 2 >>>>>>>>> :/sys/devices/system/node# >>>>>>>>> >>>>>>>>> Removing a memory tier >>>>>>>>> :/sys/devices/system/node# cat node1/memtier >>>>>>>>> 2 >>>>>>>>> :/sys/devices/system/node# echo 1 > node1/memtier >>>>>>>> >>>>>>>> Thanks a lot for your patchset. >>>>>>>> >>>>>>>> Per my understanding, we haven't reach consensus on >>>>>>>> >>>>>>>> - how to create the default memory tiers in kernel (via abstract >>>>>>>> distance provided by drivers? Or use SLIT as the first step?) >>>>>>>> >>>>>>>> - how to override the default memory tiers from user space >>>>>>>> >>>>>>>> As in the following thread and email, >>>>>>>> >>>>>>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/ >>>>>>>> >>>>>>>> I think that we need to finalized on that firstly? >>>>>>> >>>>>>> I did list the proposal here >>>>>>> >>>>>>> https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com >>>>>>> >>>>>>> So both the kernel default and driver-specific default tiers now become kernel parameters that can be updated >>>>>>> if the user wants a different tier topology. >>>>>>> >>>>>>> All memory that is not managed by a driver gets added to default_memory_tier which got a default value of 200 >>>>>>> >>>>>>> For now, the only driver that is updated is dax kmem, which adds the memory it manages to memory tier 100. >>>>>>> Later as we learn more about the device attributes (HMAT or something similar) that we might want to use >>>>>>> to control the tier assignment this can be a range of memory tiers. >>>>>>> >>>>>>> Based on the above, I guess we can merge what is posted in this series and later fine-tune/update >>>>>>> the memory tier assignment based on device attributes. >>>>>> >>>>>> Sorry for late reply. >>>>>> >>>>>> As the first step, it may be better to skip the parts that we haven't >>>>>> reached consensus yet, for example, the user space interface to override >>>>>> the default memory tiers. And we can use 0, 1, 2 as the default memory >>>>>> tier IDs. We can refine/revise the in-kernel implementation, but we >>>>>> cannot change the user space ABI. >>>>>> >>>>> >>>>> Can you help list the use case that will be broken by using tierID as outlined in this series? >>>>> One of the details that were mentioned earlier was the need to track top-tier memory usage in a >>>>> memcg and IIUC the patchset posted https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com >>>>> can work with tier IDs too. Let me know if you think otherwise. So at this point >>>>> I am not sure which area we are still debating w.r.t the userspace interface. >>>>> >>>>> I will still keep the default tier IDs with a large range between them. That will allow >>>>> us to go back to tierID based demotion order if we can. That is much simpler than using tierID and rank >>>>> together. If we still want to go back to rank based approach the tierID value won't have much >>>>> meaning anyway. >>>>> >>>>> Any feedback on patches 1 - 5, so that I can request Andrew to merge them? >>>>> >>>> >>>> Looking at this again, I guess we just need to drop patch 7 >>>> mm/demotion: Add per node memory tier attribute to sysfs ? >>>> >>>> We do agree to use the device model to expose memory tiers to userspace so patch 6 can still be included. >>>> It also exposes max_tier, default_tier, and node list of a memory tier. All these are useful >>>> and agreed upon. Hence patch 6 can be merged? >>>> >>>> patch 8 - 10 -> are done based on the request from others and is independent of how memory tiers >>>> are exposed/created from userspace. Hence that can be merged? >>>> >>>> If you agree I can rebase the series moving patch 7,11,12 as the last patches in the series so >>>> that we can skip merging them based on what we conclude w.r.t usage of rank. >>> >>> I think the most controversial part is the user visible interfaces so >>> far. And IIUC the series could be split roughly into two parts, patch >>> 1 - 5 and others. The patch 1 -5 added the explicit memory tier >>> support and fixed the issue reported by Jagdish. I think we are on the >>> same page for this part. But I haven't seen any thorough review on >>> those patches yet since we got distracted by spending most time >>> discussing about the user visible interfaces. >>> >>> So would it help to move things forward to submit patch 1 - 5 as a >>> standalone series to get thorough review then get merged? >> >> Yes. I think this is a good idea. We can discuss the in kernel >> implementation (without user space interface) in details and try to make >> it merged. >> >> And we can continue our discussion of user space interface in a separate >> thread. > > Thanks. I will post patch 1 - 5 as a series for review. I think that you should add 8-10 too, that is, all in-kernel implementation except the user space interface part. Although I think we should squash 8/12 personally. We can discuss that further during review. Best Regards, Huang, Ying
"Huang, Ying" <ying.huang@intel.com> writes: > Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: > >> On 7/12/22 2:18 PM, Huang, Ying wrote: >>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: >>> >>>> On 7/12/22 12:29 PM, Huang, Ying wrote: >>>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: >>>>> >>>>>> On 7/12/22 6:46 AM, Huang, Ying wrote: >>>>>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: >>>>>>> >>>>>>>> On 7/5/22 9:59 AM, Huang, Ying wrote: >>>>>>>>> Hi, Aneesh, >>>>>>>>> >>>>>>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes: >>>>>>>>> >>>>>>>>>> The current kernel has the basic memory tiering support: Inactive >>>>>>>>>> pages on a higher tier NUMA node can be migrated (demoted) to a lower >>>>>>>>>> tier NUMA node to make room for new allocations on the higher tier >>>>>>>>>> NUMA node. Frequently accessed pages on a lower tier NUMA node can be >>>>>>>>>> migrated (promoted) to a higher tier NUMA node to improve the >>>>>>>>>> performance. >>>>>>>>>> >>>>>>>>>> In the current kernel, memory tiers are defined implicitly via a >>>>>>>>>> demotion path relationship between NUMA nodes, which is created during >>>>>>>>>> the kernel initialization and updated when a NUMA node is hot-added or >>>>>>>>>> hot-removed. The current implementation puts all nodes with CPU into >>>>>>>>>> the top tier, and builds the tier hierarchy tier-by-tier by establishing >>>>>>>>>> the per-node demotion targets based on the distances between nodes. >>>>>>>>>> >>>>>>>>>> This current memory tier kernel interface needs to be improved for >>>>>>>>>> several important use cases: >>>>>>>>>> >>>>>>>>>> * The current tier initialization code always initializes >>>>>>>>>> each memory-only NUMA node into a lower tier. But a memory-only >>>>>>>>>> NUMA node may have a high performance memory device (e.g. a DRAM >>>>>>>>>> device attached via CXL.mem or a DRAM-backed memory-only node on >>>>>>>>>> a virtual machine) and should be put into a higher tier. >>>>>>>>>> >>>>>>>>>> * The current tier hierarchy always puts CPU nodes into the top >>>>>>>>>> tier. But on a system with HBM (e.g. GPU memory) devices, these >>>>>>>>>> memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes >>>>>>>>>> with CPUs are better to be placed into the next lower tier. >>>>>>>>>> >>>>>>>>>> * Also because the current tier hierarchy always puts CPU nodes >>>>>>>>>> into the top tier, when a CPU is hot-added (or hot-removed) and >>>>>>>>>> triggers a memory node from CPU-less into a CPU node (or vice >>>>>>>>>> versa), the memory tier hierarchy gets changed, even though no >>>>>>>>>> memory node is added or removed. This can make the tier >>>>>>>>>> hierarchy unstable and make it difficult to support tier-based >>>>>>>>>> memory accounting. >>>>>>>>>> >>>>>>>>>> * A higher tier node can only be demoted to selected nodes on the >>>>>>>>>> next lower tier as defined by the demotion path, not any other >>>>>>>>>> node from any lower tier. This strict, hard-coded demotion order >>>>>>>>>> does not work in all use cases (e.g. some use cases may want to >>>>>>>>>> allow cross-socket demotion to another node in the same demotion >>>>>>>>>> tier as a fallback when the preferred demotion node is out of >>>>>>>>>> space), and has resulted in the feature request for an interface to >>>>>>>>>> override the system-wide, per-node demotion order from the >>>>>>>>>> userspace. This demotion order is also inconsistent with the page >>>>>>>>>> allocation fallback order when all the nodes in a higher tier are >>>>>>>>>> out of space: The page allocation can fall back to any node from >>>>>>>>>> any lower tier, whereas the demotion order doesn't allow that. >>>>>>>>>> >>>>>>>>>> * There are no interfaces for the userspace to learn about the memory >>>>>>>>>> tier hierarchy in order to optimize its memory allocations. >>>>>>>>>> >>>>>>>>>> This patch series make the creation of memory tiers explicit under >>>>>>>>>> the control of userspace or device driver. >>>>>>>>>> >>>>>>>>>> Memory Tier Initialization >>>>>>>>>> ========================== >>>>>>>>>> >>>>>>>>>> By default, all memory nodes are assigned to the default tier with >>>>>>>>>> tier ID value 200. >>>>>>>>>> >>>>>>>>>> A device driver can move up or down its memory nodes from the default >>>>>>>>>> tier. For example, PMEM can move down its memory nodes below the >>>>>>>>>> default tier, whereas GPU can move up its memory nodes above the >>>>>>>>>> default tier. >>>>>>>>>> >>>>>>>>>> The kernel initialization code makes the decision on which exact tier >>>>>>>>>> a memory node should be assigned to based on the requests from the >>>>>>>>>> device drivers as well as the memory device hardware information >>>>>>>>>> provided by the firmware. >>>>>>>>>> >>>>>>>>>> Hot-adding/removing CPUs doesn't affect memory tier hierarchy. >>>>>>>>>> >>>>>>>>>> Memory Allocation for Demotion >>>>>>>>>> ============================== >>>>>>>>>> This patch series keep the demotion target page allocation logic same. >>>>>>>>>> The demotion page allocation pick the closest NUMA node in the >>>>>>>>>> next lower tier to the current NUMA node allocating pages from. >>>>>>>>>> >>>>>>>>>> This will be later improved to use the same page allocation strategy >>>>>>>>>> using fallback list. >>>>>>>>>> >>>>>>>>>> Sysfs Interface: >>>>>>>>>> ------------- >>>>>>>>>> Listing current list of memory tiers details: >>>>>>>>>> >>>>>>>>>> :/sys/devices/system/memtier$ ls >>>>>>>>>> default_tier max_tier memtier1 power uevent >>>>>>>>>> :/sys/devices/system/memtier$ cat default_tier >>>>>>>>>> memtier200 >>>>>>>>>> :/sys/devices/system/memtier$ cat max_tier >>>>>>>>>> 400 >>>>>>>>>> :/sys/devices/system/memtier$ >>>>>>>>>> >>>>>>>>>> Per node memory tier details: >>>>>>>>>> >>>>>>>>>> For a cpu only NUMA node: >>>>>>>>>> >>>>>>>>>> :/sys/devices/system/node# cat node0/memtier >>>>>>>>>> :/sys/devices/system/node# echo 1 > node0/memtier >>>>>>>>>> :/sys/devices/system/node# cat node0/memtier >>>>>>>>>> :/sys/devices/system/node# >>>>>>>>>> >>>>>>>>>> For a NUMA node with memory: >>>>>>>>>> :/sys/devices/system/node# cat node1/memtier >>>>>>>>>> 1 >>>>>>>>>> :/sys/devices/system/node# ls ../memtier/ >>>>>>>>>> default_tier max_tier memtier1 power uevent >>>>>>>>>> :/sys/devices/system/node# echo 2 > node1/memtier >>>>>>>>>> :/sys/devices/system/node# >>>>>>>>>> :/sys/devices/system/node# ls ../memtier/ >>>>>>>>>> default_tier max_tier memtier1 memtier2 power uevent >>>>>>>>>> :/sys/devices/system/node# cat node1/memtier >>>>>>>>>> 2 >>>>>>>>>> :/sys/devices/system/node# >>>>>>>>>> >>>>>>>>>> Removing a memory tier >>>>>>>>>> :/sys/devices/system/node# cat node1/memtier >>>>>>>>>> 2 >>>>>>>>>> :/sys/devices/system/node# echo 1 > node1/memtier >>>>>>>>> >>>>>>>>> Thanks a lot for your patchset. >>>>>>>>> >>>>>>>>> Per my understanding, we haven't reach consensus on >>>>>>>>> >>>>>>>>> - how to create the default memory tiers in kernel (via abstract >>>>>>>>> distance provided by drivers? Or use SLIT as the first step?) >>>>>>>>> >>>>>>>>> - how to override the default memory tiers from user space >>>>>>>>> >>>>>>>>> As in the following thread and email, >>>>>>>>> >>>>>>>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/ >>>>>>>>> >>>>>>>>> I think that we need to finalized on that firstly? >>>>>>>> >>>>>>>> I did list the proposal here >>>>>>>> >>>>>>>> https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com >>>>>>>> >>>>>>>> So both the kernel default and driver-specific default tiers now become kernel parameters that can be updated >>>>>>>> if the user wants a different tier topology. >>>>>>>> >>>>>>>> All memory that is not managed by a driver gets added to default_memory_tier which got a default value of 200 >>>>>>>> >>>>>>>> For now, the only driver that is updated is dax kmem, which adds the memory it manages to memory tier 100. >>>>>>>> Later as we learn more about the device attributes (HMAT or something similar) that we might want to use >>>>>>>> to control the tier assignment this can be a range of memory tiers. >>>>>>>> >>>>>>>> Based on the above, I guess we can merge what is posted in this series and later fine-tune/update >>>>>>>> the memory tier assignment based on device attributes. >>>>>>> >>>>>>> Sorry for late reply. >>>>>>> >>>>>>> As the first step, it may be better to skip the parts that we haven't >>>>>>> reached consensus yet, for example, the user space interface to override >>>>>>> the default memory tiers. And we can use 0, 1, 2 as the default memory >>>>>>> tier IDs. We can refine/revise the in-kernel implementation, but we >>>>>>> cannot change the user space ABI. >>>>>>> >>>>>> >>>>>> Can you help list the use case that will be broken by using tierID as outlined in this series? >>>>>> One of the details that were mentioned earlier was the need to track top-tier memory usage in a >>>>>> memcg and IIUC the patchset posted https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com >>>>>> can work with tier IDs too. Let me know if you think otherwise. So at this point >>>>>> I am not sure which area we are still debating w.r.t the userspace interface. >>>>> >>>>> In >>>>> >>>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/ >>>>> >>>>> per my understanding, Johannes suggested to override the kernel default >>>>> memory tiers with "abstract distance" via drivers implementing memory >>>>> devices. As you said in another email, that is related to [7/12] of the >>>>> series. And we can table it for future. >>>>> >>>>> And per my understanding, he also suggested to make memory tier IDs >>>>> dynamic. For example, after the "abstract distance" of a driver is >>>>> overridden by users, the total number of memory tiers may be changed, >>>>> and the memory tier ID of some nodes may be changed too. This will make >>>>> memory tier ID easier to be understood, but more unstable. For example, >>>>> this will make it harder to specify the per-memory-tier memory partition >>>>> for a cgroup. >>>>> >>>> >>>> With all the approaches we discussed so far, a memory tier of a numa node can be changed. >>>> ie, pgdat->memtier can change anytime. The per memcg top tier mem usage tracking patches >>>> posted here >>>> https://lore.kernel.org/linux-mm/cefeb63173fa0fac7543315a2abbd4b5a1b25af8.1655242024.git.tim.c.chen@linux.intel.com/ >>>> doesn't consider the node movement from one memory tier to another. If we need >>>> a stable pgdat->memtier we will have to prevent a node memory tier reassignment >>>> while we have pages from the memory tier charged to a cgroup. This patchset should not >>>> prevent such a restriction. >>> >>> Absolute stableness doesn't exist even in "rank" based solution. But >>> "rank" can improve the stableness at some degree. For example, if we >>> move the tier of HBM nodes (from below DRAM to above DRAM), the DRAM >>> nodes can keep its memory tier ID stable. This may be not a real issue >>> finally. But we need to discuss that. >>> >> >> I agree that using ranks gives us the flexibility to change demotion order >> without being blocked by cgroup usage. But how frequently do we expect the >> tier assignment to change? My expectation was these reassignments are going >> to be rare and won't happen frequently after a system is up and running? >> Hence using tierID for demotion order won't prevent a node reassignment >> much because we don't expect to change the node tierID during runtime. In >> the rare case we do, we will have to make sure there is no cgroup usage from >> the specific memory tier. >> >> Even if we use ranks, we will have to avoid a rank update, if such >> an update can change the meaning of top tier? ie, if a rank update >> can result in a node being moved from top tier to non top tier. >> >>> Tim has suggested to use top-tier(s) memory partition among cgroups. >>> But I don't think that has been finalized. We may use per-memory-tier >>> memory partition among cgroups. I don't know whether Wei will use that >>> (may be implemented in the user space). >>> >>> And, if we thought stableness between nodes and memory tier ID isn't >>> important. Why should we use sparse memory device IDs (that is, 100, >>> 200, 300)? Why not just 0, 1, 2, ...? That looks more natural. >>> >> >> >> The range allows us to use memtier ID for demotion order. ie, as we start initializing >> devices with different attributes via dax kmem, there will be a desire to >> assign them to different tierIDs. Having default memtier ID (DRAM) at 200 enables >> us to put these devices in the range [0 - 200) without updating the node to memtier >> mapping of existing NUMA nodes (ie, without updating default memtier). > > I believe that sparse memory tier IDs can make memory tier more stable > in some cases. But this is different from the system suggested by > Johannes. Per my understanding, with Johannes' system, we will > > - one driver may online different memory types (such as kmem_dax may > online HBM, PMEM, etc.) > > - one memory type manages several memory nodes (NUMA nodes) > > - one "abstract distance" for each memory type > > - the "abstract distance" can be offset by user space override knob > > - memory tiers generated dynamic from different memory types according > "abstract distance" and overridden "offset" > > - the granularity to group several memory types into one memory tier can > be overridden via user space knob > > In this way, the memory tiers may be changed totally after user space > overridden. It may be hard to link memory tiers before/after the > overridden. So we may need to reset all per-memory-tier configuration, > such as cgroup paritation limit or interleave weight, etc. Making sure we all agree on the details. In the proposal https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com instead of calling it "abstract distance" I was referring it as device attributes. Johannes also suggested these device attributes/"abstract distance" to be used to derive the memory tier to which the memory type/memory device will be assigned. So dax kmem would manage different types of memory and based on the device attributes, we would assign them to different memory tiers (memory tiers in the range [0-200)). Now the additional detail here is that we might add knobs that will be used by dax kmem to fine-tune memory types to memory tiers assignment. On updating these knob values, the kernel should rebuild the entire memory tier hierarchy. (earlier I was considering only newly added memory devices will get impacted by such a change. But I agree it makes sense to rebuild the entire hierarchy again) But that rebuilding will be restricted to dax kmem driver. > > Personally, I think the system above makes sense. But I think we need > to make sure whether it satisfies the requirements. -aneesh
Wei Xu <weixugc@google.com> writes: > On Tue, Jul 12, 2022 at 8:03 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: >> >> > On 7/12/22 2:18 PM, Huang, Ying wrote: >> >> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: >> >> >> >>> On 7/12/22 12:29 PM, Huang, Ying wrote: >> >>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: >> >>>> >> >>>>> On 7/12/22 6:46 AM, Huang, Ying wrote: >> >>>>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: >> >>>>>> >> >>>>>>> On 7/5/22 9:59 AM, Huang, Ying wrote: >> >>>>>>>> Hi, Aneesh, >> >>>>>>>> >> >>>>>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes: >> >>>>>>>> >> >>>>>>>>> The current kernel has the basic memory tiering support: Inactive >> >>>>>>>>> pages on a higher tier NUMA node can be migrated (demoted) to a lower >> >>>>>>>>> tier NUMA node to make room for new allocations on the higher tier >> >>>>>>>>> NUMA node. Frequently accessed pages on a lower tier NUMA node can be >> >>>>>>>>> migrated (promoted) to a higher tier NUMA node to improve the >> >>>>>>>>> performance. >> >>>>>>>>> >> >>>>>>>>> In the current kernel, memory tiers are defined implicitly via a >> >>>>>>>>> demotion path relationship between NUMA nodes, which is created during >> >>>>>>>>> the kernel initialization and updated when a NUMA node is hot-added or >> >>>>>>>>> hot-removed. The current implementation puts all nodes with CPU into >> >>>>>>>>> the top tier, and builds the tier hierarchy tier-by-tier by establishing >> >>>>>>>>> the per-node demotion targets based on the distances between nodes. >> >>>>>>>>> >> >>>>>>>>> This current memory tier kernel interface needs to be improved for >> >>>>>>>>> several important use cases: >> >>>>>>>>> >> >>>>>>>>> * The current tier initialization code always initializes >> >>>>>>>>> each memory-only NUMA node into a lower tier. But a memory-only >> >>>>>>>>> NUMA node may have a high performance memory device (e.g. a DRAM >> >>>>>>>>> device attached via CXL.mem or a DRAM-backed memory-only node on >> >>>>>>>>> a virtual machine) and should be put into a higher tier. >> >>>>>>>>> >> >>>>>>>>> * The current tier hierarchy always puts CPU nodes into the top >> >>>>>>>>> tier. But on a system with HBM (e.g. GPU memory) devices, these >> >>>>>>>>> memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes >> >>>>>>>>> with CPUs are better to be placed into the next lower tier. >> >>>>>>>>> >> >>>>>>>>> * Also because the current tier hierarchy always puts CPU nodes >> >>>>>>>>> into the top tier, when a CPU is hot-added (or hot-removed) and >> >>>>>>>>> triggers a memory node from CPU-less into a CPU node (or vice >> >>>>>>>>> versa), the memory tier hierarchy gets changed, even though no >> >>>>>>>>> memory node is added or removed. This can make the tier >> >>>>>>>>> hierarchy unstable and make it difficult to support tier-based >> >>>>>>>>> memory accounting. >> >>>>>>>>> >> >>>>>>>>> * A higher tier node can only be demoted to selected nodes on the >> >>>>>>>>> next lower tier as defined by the demotion path, not any other >> >>>>>>>>> node from any lower tier. This strict, hard-coded demotion order >> >>>>>>>>> does not work in all use cases (e.g. some use cases may want to >> >>>>>>>>> allow cross-socket demotion to another node in the same demotion >> >>>>>>>>> tier as a fallback when the preferred demotion node is out of >> >>>>>>>>> space), and has resulted in the feature request for an interface to >> >>>>>>>>> override the system-wide, per-node demotion order from the >> >>>>>>>>> userspace. This demotion order is also inconsistent with the page >> >>>>>>>>> allocation fallback order when all the nodes in a higher tier are >> >>>>>>>>> out of space: The page allocation can fall back to any node from >> >>>>>>>>> any lower tier, whereas the demotion order doesn't allow that. >> >>>>>>>>> >> >>>>>>>>> * There are no interfaces for the userspace to learn about the memory >> >>>>>>>>> tier hierarchy in order to optimize its memory allocations. >> >>>>>>>>> >> >>>>>>>>> This patch series make the creation of memory tiers explicit under >> >>>>>>>>> the control of userspace or device driver. >> >>>>>>>>> >> >>>>>>>>> Memory Tier Initialization >> >>>>>>>>> ========================== >> >>>>>>>>> >> >>>>>>>>> By default, all memory nodes are assigned to the default tier with >> >>>>>>>>> tier ID value 200. >> >>>>>>>>> >> >>>>>>>>> A device driver can move up or down its memory nodes from the default >> >>>>>>>>> tier. For example, PMEM can move down its memory nodes below the >> >>>>>>>>> default tier, whereas GPU can move up its memory nodes above the >> >>>>>>>>> default tier. >> >>>>>>>>> >> >>>>>>>>> The kernel initialization code makes the decision on which exact tier >> >>>>>>>>> a memory node should be assigned to based on the requests from the >> >>>>>>>>> device drivers as well as the memory device hardware information >> >>>>>>>>> provided by the firmware. >> >>>>>>>>> >> >>>>>>>>> Hot-adding/removing CPUs doesn't affect memory tier hierarchy. >> >>>>>>>>> >> >>>>>>>>> Memory Allocation for Demotion >> >>>>>>>>> ============================== >> >>>>>>>>> This patch series keep the demotion target page allocation logic same. >> >>>>>>>>> The demotion page allocation pick the closest NUMA node in the >> >>>>>>>>> next lower tier to the current NUMA node allocating pages from. >> >>>>>>>>> >> >>>>>>>>> This will be later improved to use the same page allocation strategy >> >>>>>>>>> using fallback list. >> >>>>>>>>> >> >>>>>>>>> Sysfs Interface: >> >>>>>>>>> ------------- >> >>>>>>>>> Listing current list of memory tiers details: >> >>>>>>>>> >> >>>>>>>>> :/sys/devices/system/memtier$ ls >> >>>>>>>>> default_tier max_tier memtier1 power uevent >> >>>>>>>>> :/sys/devices/system/memtier$ cat default_tier >> >>>>>>>>> memtier200 >> >>>>>>>>> :/sys/devices/system/memtier$ cat max_tier >> >>>>>>>>> 400 >> >>>>>>>>> :/sys/devices/system/memtier$ >> >>>>>>>>> >> >>>>>>>>> Per node memory tier details: >> >>>>>>>>> >> >>>>>>>>> For a cpu only NUMA node: >> >>>>>>>>> >> >>>>>>>>> :/sys/devices/system/node# cat node0/memtier >> >>>>>>>>> :/sys/devices/system/node# echo 1 > node0/memtier >> >>>>>>>>> :/sys/devices/system/node# cat node0/memtier >> >>>>>>>>> :/sys/devices/system/node# >> >>>>>>>>> >> >>>>>>>>> For a NUMA node with memory: >> >>>>>>>>> :/sys/devices/system/node# cat node1/memtier >> >>>>>>>>> 1 >> >>>>>>>>> :/sys/devices/system/node# ls ../memtier/ >> >>>>>>>>> default_tier max_tier memtier1 power uevent >> >>>>>>>>> :/sys/devices/system/node# echo 2 > node1/memtier >> >>>>>>>>> :/sys/devices/system/node# >> >>>>>>>>> :/sys/devices/system/node# ls ../memtier/ >> >>>>>>>>> default_tier max_tier memtier1 memtier2 power uevent >> >>>>>>>>> :/sys/devices/system/node# cat node1/memtier >> >>>>>>>>> 2 >> >>>>>>>>> :/sys/devices/system/node# >> >>>>>>>>> >> >>>>>>>>> Removing a memory tier >> >>>>>>>>> :/sys/devices/system/node# cat node1/memtier >> >>>>>>>>> 2 >> >>>>>>>>> :/sys/devices/system/node# echo 1 > node1/memtier >> >>>>>>>> >> >>>>>>>> Thanks a lot for your patchset. >> >>>>>>>> >> >>>>>>>> Per my understanding, we haven't reach consensus on >> >>>>>>>> >> >>>>>>>> - how to create the default memory tiers in kernel (via abstract >> >>>>>>>> distance provided by drivers? Or use SLIT as the first step?) >> >>>>>>>> >> >>>>>>>> - how to override the default memory tiers from user space >> >>>>>>>> >> >>>>>>>> As in the following thread and email, >> >>>>>>>> >> >>>>>>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/ >> >>>>>>>> >> >>>>>>>> I think that we need to finalized on that firstly? >> >>>>>>> >> >>>>>>> I did list the proposal here >> >>>>>>> >> >>>>>>> https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com >> >>>>>>> >> >>>>>>> So both the kernel default and driver-specific default tiers now become kernel parameters that can be updated >> >>>>>>> if the user wants a different tier topology. >> >>>>>>> >> >>>>>>> All memory that is not managed by a driver gets added to default_memory_tier which got a default value of 200 >> >>>>>>> >> >>>>>>> For now, the only driver that is updated is dax kmem, which adds the memory it manages to memory tier 100. >> >>>>>>> Later as we learn more about the device attributes (HMAT or something similar) that we might want to use >> >>>>>>> to control the tier assignment this can be a range of memory tiers. >> >>>>>>> >> >>>>>>> Based on the above, I guess we can merge what is posted in this series and later fine-tune/update >> >>>>>>> the memory tier assignment based on device attributes. >> >>>>>> >> >>>>>> Sorry for late reply. >> >>>>>> >> >>>>>> As the first step, it may be better to skip the parts that we haven't >> >>>>>> reached consensus yet, for example, the user space interface to override >> >>>>>> the default memory tiers. And we can use 0, 1, 2 as the default memory >> >>>>>> tier IDs. We can refine/revise the in-kernel implementation, but we >> >>>>>> cannot change the user space ABI. >> >>>>>> >> >>>>> >> >>>>> Can you help list the use case that will be broken by using tierID as outlined in this series? >> >>>>> One of the details that were mentioned earlier was the need to track top-tier memory usage in a >> >>>>> memcg and IIUC the patchset posted https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com >> >>>>> can work with tier IDs too. Let me know if you think otherwise. So at this point >> >>>>> I am not sure which area we are still debating w.r.t the userspace interface. >> >>>> >> >>>> In >> >>>> >> >>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/ >> >>>> >> >>>> per my understanding, Johannes suggested to override the kernel default >> >>>> memory tiers with "abstract distance" via drivers implementing memory >> >>>> devices. As you said in another email, that is related to [7/12] of the >> >>>> series. And we can table it for future. >> >>>> >> >>>> And per my understanding, he also suggested to make memory tier IDs >> >>>> dynamic. For example, after the "abstract distance" of a driver is >> >>>> overridden by users, the total number of memory tiers may be changed, >> >>>> and the memory tier ID of some nodes may be changed too. This will make >> >>>> memory tier ID easier to be understood, but more unstable. For example, >> >>>> this will make it harder to specify the per-memory-tier memory partition >> >>>> for a cgroup. >> >>>> >> >>> >> >>> With all the approaches we discussed so far, a memory tier of a numa node can be changed. >> >>> ie, pgdat->memtier can change anytime. The per memcg top tier mem usage tracking patches >> >>> posted here >> >>> https://lore.kernel.org/linux-mm/cefeb63173fa0fac7543315a2abbd4b5a1b25af8.1655242024.git.tim.c.chen@linux.intel.com/ >> >>> doesn't consider the node movement from one memory tier to another. If we need >> >>> a stable pgdat->memtier we will have to prevent a node memory tier reassignment >> >>> while we have pages from the memory tier charged to a cgroup. This patchset should not >> >>> prevent such a restriction. >> >> >> >> Absolute stableness doesn't exist even in "rank" based solution. But >> >> "rank" can improve the stableness at some degree. For example, if we >> >> move the tier of HBM nodes (from below DRAM to above DRAM), the DRAM >> >> nodes can keep its memory tier ID stable. This may be not a real issue >> >> finally. But we need to discuss that. >> >> >> > >> > I agree that using ranks gives us the flexibility to change demotion order >> > without being blocked by cgroup usage. But how frequently do we expect the >> > tier assignment to change? My expectation was these reassignments are going >> > to be rare and won't happen frequently after a system is up and running? >> > Hence using tierID for demotion order won't prevent a node reassignment >> > much because we don't expect to change the node tierID during runtime. In >> > the rare case we do, we will have to make sure there is no cgroup usage from >> > the specific memory tier. >> > >> > Even if we use ranks, we will have to avoid a rank update, if such >> > an update can change the meaning of top tier? ie, if a rank update >> > can result in a node being moved from top tier to non top tier. >> > >> >> Tim has suggested to use top-tier(s) memory partition among cgroups. >> >> But I don't think that has been finalized. We may use per-memory-tier >> >> memory partition among cgroups. I don't know whether Wei will use that >> >> (may be implemented in the user space). >> >> >> >> And, if we thought stableness between nodes and memory tier ID isn't >> >> important. Why should we use sparse memory device IDs (that is, 100, >> >> 200, 300)? Why not just 0, 1, 2, ...? That looks more natural. >> >> >> > >> > >> > The range allows us to use memtier ID for demotion order. ie, as we start initializing >> > devices with different attributes via dax kmem, there will be a desire to >> > assign them to different tierIDs. Having default memtier ID (DRAM) at 200 enables >> > us to put these devices in the range [0 - 200) without updating the node to memtier >> > mapping of existing NUMA nodes (ie, without updating default memtier). >> >> I believe that sparse memory tier IDs can make memory tier more stable >> in some cases. But this is different from the system suggested by >> Johannes. Per my understanding, with Johannes' system, we will >> >> - one driver may online different memory types (such as kmem_dax may >> online HBM, PMEM, etc.) >> >> - one memory type manages several memory nodes (NUMA nodes) >> >> - one "abstract distance" for each memory type >> >> - the "abstract distance" can be offset by user space override knob >> >> - memory tiers generated dynamic from different memory types according >> "abstract distance" and overridden "offset" >> >> - the granularity to group several memory types into one memory tier can >> be overridden via user space knob >> >> In this way, the memory tiers may be changed totally after user space >> overridden. It may be hard to link memory tiers before/after the >> overridden. So we may need to reset all per-memory-tier configuration, >> such as cgroup paritation limit or interleave weight, etc. >> >> Personally, I think the system above makes sense. But I think we need >> to make sure whether it satisfies the requirements. >> >> Best Regards, >> Huang, Ying >> > > Th "memory type" and "abstract distance" concepts sound to me similar > to the memory tier "rank" idea. > > We can have some well-defined type/distance/rank values, e.g. HBM, > DRAM, CXL_DRAM, PMEM, CXL_PMEM, which a device can register with. The > memory tiers will build from these values. It can be configurable to > whether/how to collapse several values into a single tier. But then we also don't want to not use it directly for demotion order. Instead, we can use tierID. The memory type to memory tier assignment can be fine-tuned using device attribute/"abstract distance"/rank/userspace override etc. -aneesh
"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes: > "Huang, Ying" <ying.huang@intel.com> writes: [snip] >> >> I believe that sparse memory tier IDs can make memory tier more stable >> in some cases. But this is different from the system suggested by >> Johannes. Per my understanding, with Johannes' system, we will >> >> - one driver may online different memory types (such as kmem_dax may >> online HBM, PMEM, etc.) >> >> - one memory type manages several memory nodes (NUMA nodes) >> >> - one "abstract distance" for each memory type >> >> - the "abstract distance" can be offset by user space override knob >> >> - memory tiers generated dynamic from different memory types according >> "abstract distance" and overridden "offset" >> >> - the granularity to group several memory types into one memory tier can >> be overridden via user space knob >> >> In this way, the memory tiers may be changed totally after user space >> overridden. It may be hard to link memory tiers before/after the >> overridden. So we may need to reset all per-memory-tier configuration, >> such as cgroup paritation limit or interleave weight, etc. > > Making sure we all agree on the details. > > In the proposal https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com > instead of calling it "abstract distance" I was referring it as device > attributes. > > Johannes also suggested these device attributes/"abstract distance" > to be used to derive the memory tier to which the memory type/memory > device will be assigned. > > So dax kmem would manage different types of memory and based on the device > attributes, we would assign them to different memory tiers (memory tiers > in the range [0-200)). > > Now the additional detail here is that we might add knobs that will be > used by dax kmem to fine-tune memory types to memory tiers assignment. > On updating these knob values, the kernel should rebuild the entire > memory tier hierarchy. (earlier I was considering only newly added > memory devices will get impacted by such a change. But I agree it > makes sense to rebuild the entire hierarchy again) But that rebuilding > will be restricted to dax kmem driver. > Thanks for explanation and pointer. Per my understanding, memory types and memory devices including abstract distances are used to describe the *physical* memory devices, not *policy*. We may add more physical attributes to these memory devices, such as, latency, throughput, etc. I think we can reach consensus on this point? In contrast, memory tiers are more about policy, such as demotion/promotion, interleaving and possible partition among cgroups. How to derive memory tiers from memory types (or devices)? We have multiple choices. Per my understanding, Johannes suggested to use some policy parameters such as distance granularity (e.g., if granularity is 100, then memory devices with abstract distance 0-100, 100-200, 200-300, ... will be put to memory tier 0, 1, 2, ...) to build the memory tiers. Distance granularity may be not flexible enough, we may need something like a set of cutoffs or range, e.g., 50, 100, 200, 500, or 0-50, 50-100, 100-200, 200-500, >500. These policy parameters should be overridable from user space. And per my understanding, you suggested to place memory devices to memory tiers directly via a knob of memory types (or memory devices). e.g., memory_type/memtier can be written to place the memory devices of the memory_type to the specified memtier. Or via memorty_type/distance_offset to do that. Best Regards, Huang, Ying [snip]
On 7/14/22 10:26 AM, Huang, Ying wrote: > "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes: > >> "Huang, Ying" <ying.huang@intel.com> writes: > > [snip] > >>> >>> I believe that sparse memory tier IDs can make memory tier more stable >>> in some cases. But this is different from the system suggested by >>> Johannes. Per my understanding, with Johannes' system, we will >>> >>> - one driver may online different memory types (such as kmem_dax may >>> online HBM, PMEM, etc.) >>> >>> - one memory type manages several memory nodes (NUMA nodes) >>> >>> - one "abstract distance" for each memory type >>> >>> - the "abstract distance" can be offset by user space override knob >>> >>> - memory tiers generated dynamic from different memory types according >>> "abstract distance" and overridden "offset" >>> >>> - the granularity to group several memory types into one memory tier can >>> be overridden via user space knob >>> >>> In this way, the memory tiers may be changed totally after user space >>> overridden. It may be hard to link memory tiers before/after the >>> overridden. So we may need to reset all per-memory-tier configuration, >>> such as cgroup paritation limit or interleave weight, etc. >> >> Making sure we all agree on the details. >> >> In the proposal https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com >> instead of calling it "abstract distance" I was referring it as device >> attributes. >> >> Johannes also suggested these device attributes/"abstract distance" >> to be used to derive the memory tier to which the memory type/memory >> device will be assigned. >> >> So dax kmem would manage different types of memory and based on the device >> attributes, we would assign them to different memory tiers (memory tiers >> in the range [0-200)). >> >> Now the additional detail here is that we might add knobs that will be >> used by dax kmem to fine-tune memory types to memory tiers assignment. >> On updating these knob values, the kernel should rebuild the entire >> memory tier hierarchy. (earlier I was considering only newly added >> memory devices will get impacted by such a change. But I agree it >> makes sense to rebuild the entire hierarchy again) But that rebuilding >> will be restricted to dax kmem driver. >> > > Thanks for explanation and pointer. Per my understanding, memory > types and memory devices including abstract distances are used to > describe the *physical* memory devices, not *policy*. We may add more > physical attributes to these memory devices, such as, latency, > throughput, etc. I think we can reach consensus on this point? > > In contrast, memory tiers are more about policy, such as > demotion/promotion, interleaving and possible partition among cgroups. > How to derive memory tiers from memory types (or devices)? We have > multiple choices. > agreed to the above. > Per my understanding, Johannes suggested to use some policy parameters > such as distance granularity (e.g., if granularity is 100, then memory > devices with abstract distance 0-100, 100-200, 200-300, ... will be put > to memory tier 0, 1, 2, ...) to build the memory tiers. Distance > granularity may be not flexible enough, we may need something like a set > of cutoffs or range, e.g., 50, 100, 200, 500, or 0-50, 50-100, 100-200, > 200-500, >500. These policy parameters should be overridable from user > space. > The term distance was always confusing to me. Instead, I was generalizing it as an attribute. The challenge with the term distance for me was in clarifying the distance of this memory device from where? Instead, it is much simpler to group devices based on device attributes such as write latency. So everything you explained above is correct, except we describe it in terms of a single device attribute or a combination of multiple device attributes. We could convert a combination of multiple device attribute to an "abstract distance". Such an "abstract distance" is derived based on different device attribute values with policy parameters overridable from userspace. > And per my understanding, you suggested to place memory devices to > memory tiers directly via a knob of memory types (or memory devices). > e.g., memory_type/memtier can be written to place the memory devices of > the memory_type to the specified memtier. Or via > memorty_type/distance_offset to do that. > What I explained above is what I would expect the kernel to do by default. Before we can reach there we need to get a better understanding of which device attribute describes the grouping of memory devices to a memory tier. Do we need latency-based grouping or bandwidth-based grouping? Till then userspace can place these devices to different memory tiers. Hence the addition of /sys/devices/system/node/nodeN/memtier write feature which moves a memory node to a specific memory tier. I am not suggesting we override the memory types from userspace. -aneesh
Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: > On 7/14/22 10:26 AM, Huang, Ying wrote: >> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes: >> >>> "Huang, Ying" <ying.huang@intel.com> writes: >> >> [snip] >> >>>> >>>> I believe that sparse memory tier IDs can make memory tier more stable >>>> in some cases. But this is different from the system suggested by >>>> Johannes. Per my understanding, with Johannes' system, we will >>>> >>>> - one driver may online different memory types (such as kmem_dax may >>>> online HBM, PMEM, etc.) >>>> >>>> - one memory type manages several memory nodes (NUMA nodes) >>>> >>>> - one "abstract distance" for each memory type >>>> >>>> - the "abstract distance" can be offset by user space override knob >>>> >>>> - memory tiers generated dynamic from different memory types according >>>> "abstract distance" and overridden "offset" >>>> >>>> - the granularity to group several memory types into one memory tier can >>>> be overridden via user space knob >>>> >>>> In this way, the memory tiers may be changed totally after user space >>>> overridden. It may be hard to link memory tiers before/after the >>>> overridden. So we may need to reset all per-memory-tier configuration, >>>> such as cgroup paritation limit or interleave weight, etc. >>> >>> Making sure we all agree on the details. >>> >>> In the proposal https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com >>> instead of calling it "abstract distance" I was referring it as device >>> attributes. >>> >>> Johannes also suggested these device attributes/"abstract distance" >>> to be used to derive the memory tier to which the memory type/memory >>> device will be assigned. >>> >>> So dax kmem would manage different types of memory and based on the device >>> attributes, we would assign them to different memory tiers (memory tiers >>> in the range [0-200)). >>> >>> Now the additional detail here is that we might add knobs that will be >>> used by dax kmem to fine-tune memory types to memory tiers assignment. >>> On updating these knob values, the kernel should rebuild the entire >>> memory tier hierarchy. (earlier I was considering only newly added >>> memory devices will get impacted by such a change. But I agree it >>> makes sense to rebuild the entire hierarchy again) But that rebuilding >>> will be restricted to dax kmem driver. >>> >> >> Thanks for explanation and pointer. Per my understanding, memory >> types and memory devices including abstract distances are used to >> describe the *physical* memory devices, not *policy*. We may add more >> physical attributes to these memory devices, such as, latency, >> throughput, etc. I think we can reach consensus on this point? >> >> In contrast, memory tiers are more about policy, such as >> demotion/promotion, interleaving and possible partition among cgroups. >> How to derive memory tiers from memory types (or devices)? We have >> multiple choices. >> > > agreed to the above. > >> Per my understanding, Johannes suggested to use some policy parameters >> such as distance granularity (e.g., if granularity is 100, then memory >> devices with abstract distance 0-100, 100-200, 200-300, ... will be put >> to memory tier 0, 1, 2, ...) to build the memory tiers. Distance >> granularity may be not flexible enough, we may need something like a set >> of cutoffs or range, e.g., 50, 100, 200, 500, or 0-50, 50-100, 100-200, >> 200-500, >500. These policy parameters should be overridable from user >> space. >> > > The term distance was always confusing to me. Instead, I was > generalizing it as an attribute. Attributes sounds too general to me :-) > The challenge with the term distance for me was in clarifying the > distance of this memory device from where? Instead, it is much simpler > to group devices based on device attributes such as write latency. Per my understanding, the "distance" here is the distance from local CPUs, that is, get rid of the influence of NUMA topology as much as possible. There may be other memory accessing initiators in the system, such as GPU, etc. But we don't want to have different memory tiers for each initiators, so we mainly consider CPUs. The device drivers of other initiators may consider other type of memory tiers. The "distance" characters the latency of the memory device under typical memory throughput in the system. So it characterizes both latency and throughput, because the latency will increase with the throughput. This one of reasons we need to override the default distance, because the typical memory throughput may be different among different workloads. The "abstract distance" can come from SLIT, HMAT firstly. Then we can try to explore the other possible sources of information. > So everything you explained above is correct, except we describe it in terms of a > single device attribute or a combination of multiple device attributes. We could convert > a combination of multiple device attribute to an "abstract distance". Sounds good to me. > Such an "abstract distance" is derived based on different device > attribute values with policy parameters overridable from userspace. I think "abstract distance" is different from policy parameters. >> And per my understanding, you suggested to place memory devices to >> memory tiers directly via a knob of memory types (or memory devices). >> e.g., memory_type/memtier can be written to place the memory devices of >> the memory_type to the specified memtier. Or via >> memorty_type/distance_offset to do that. >> > > What I explained above is what I would expect the kernel to do by default. Before we can > reach there we need to get a better understanding of which device attribute describes > the grouping of memory devices to a memory tier. Do we need latency-based grouping > or bandwidth-based grouping? Till then userspace can place these devices to different > memory tiers. Hence the addition of /sys/devices/system/node/nodeN/memtier write feature > which moves a memory node to a specific memory tier. > > I am not suggesting we override the memory types from userspace. OK. I don't think we need this. We can examine the target solution above and try to find any issue with it. Best Regards, Huang, Ying
On Wed, 13 Jul 2022 16:17:21 +0800 "Huang, Ying" <ying.huang@intel.com> wrote: > Wei Xu <weixugc@google.com> writes: > > > On Tue, Jul 12, 2022 at 8:03 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: > >> > >> > On 7/12/22 2:18 PM, Huang, Ying wrote: > >> >> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: > >> >> > >> >>> On 7/12/22 12:29 PM, Huang, Ying wrote: > >> >>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: > >> >>>> > >> >>>>> On 7/12/22 6:46 AM, Huang, Ying wrote: > >> >>>>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: > >> >>>>>> > >> >>>>>>> On 7/5/22 9:59 AM, Huang, Ying wrote: > >> >>>>>>>> Hi, Aneesh, > >> >>>>>>>> > >> >>>>>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes: > >> >>>>>>>> > >> >>>>>>>>> The current kernel has the basic memory tiering support: Inactive > >> >>>>>>>>> pages on a higher tier NUMA node can be migrated (demoted) to a lower > >> >>>>>>>>> tier NUMA node to make room for new allocations on the higher tier > >> >>>>>>>>> NUMA node. Frequently accessed pages on a lower tier NUMA node can be > >> >>>>>>>>> migrated (promoted) to a higher tier NUMA node to improve the > >> >>>>>>>>> performance. > >> >>>>>>>>> > >> >>>>>>>>> In the current kernel, memory tiers are defined implicitly via a > >> >>>>>>>>> demotion path relationship between NUMA nodes, which is created during > >> >>>>>>>>> the kernel initialization and updated when a NUMA node is hot-added or > >> >>>>>>>>> hot-removed. The current implementation puts all nodes with CPU into > >> >>>>>>>>> the top tier, and builds the tier hierarchy tier-by-tier by establishing > >> >>>>>>>>> the per-node demotion targets based on the distances between nodes. > >> >>>>>>>>> > >> >>>>>>>>> This current memory tier kernel interface needs to be improved for > >> >>>>>>>>> several important use cases: > >> >>>>>>>>> > >> >>>>>>>>> * The current tier initialization code always initializes > >> >>>>>>>>> each memory-only NUMA node into a lower tier. But a memory-only > >> >>>>>>>>> NUMA node may have a high performance memory device (e.g. a DRAM > >> >>>>>>>>> device attached via CXL.mem or a DRAM-backed memory-only node on > >> >>>>>>>>> a virtual machine) and should be put into a higher tier. > >> >>>>>>>>> > >> >>>>>>>>> * The current tier hierarchy always puts CPU nodes into the top > >> >>>>>>>>> tier. But on a system with HBM (e.g. GPU memory) devices, these > >> >>>>>>>>> memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes > >> >>>>>>>>> with CPUs are better to be placed into the next lower tier. > >> >>>>>>>>> > >> >>>>>>>>> * Also because the current tier hierarchy always puts CPU nodes > >> >>>>>>>>> into the top tier, when a CPU is hot-added (or hot-removed) and > >> >>>>>>>>> triggers a memory node from CPU-less into a CPU node (or vice > >> >>>>>>>>> versa), the memory tier hierarchy gets changed, even though no > >> >>>>>>>>> memory node is added or removed. This can make the tier > >> >>>>>>>>> hierarchy unstable and make it difficult to support tier-based > >> >>>>>>>>> memory accounting. > >> >>>>>>>>> > >> >>>>>>>>> * A higher tier node can only be demoted to selected nodes on the > >> >>>>>>>>> next lower tier as defined by the demotion path, not any other > >> >>>>>>>>> node from any lower tier. This strict, hard-coded demotion order > >> >>>>>>>>> does not work in all use cases (e.g. some use cases may want to > >> >>>>>>>>> allow cross-socket demotion to another node in the same demotion > >> >>>>>>>>> tier as a fallback when the preferred demotion node is out of > >> >>>>>>>>> space), and has resulted in the feature request for an interface to > >> >>>>>>>>> override the system-wide, per-node demotion order from the > >> >>>>>>>>> userspace. This demotion order is also inconsistent with the page > >> >>>>>>>>> allocation fallback order when all the nodes in a higher tier are > >> >>>>>>>>> out of space: The page allocation can fall back to any node from > >> >>>>>>>>> any lower tier, whereas the demotion order doesn't allow that. > >> >>>>>>>>> > >> >>>>>>>>> * There are no interfaces for the userspace to learn about the memory > >> >>>>>>>>> tier hierarchy in order to optimize its memory allocations. > >> >>>>>>>>> > >> >>>>>>>>> This patch series make the creation of memory tiers explicit under > >> >>>>>>>>> the control of userspace or device driver. > >> >>>>>>>>> > >> >>>>>>>>> Memory Tier Initialization > >> >>>>>>>>> ========================== > >> >>>>>>>>> > >> >>>>>>>>> By default, all memory nodes are assigned to the default tier with > >> >>>>>>>>> tier ID value 200. > >> >>>>>>>>> > >> >>>>>>>>> A device driver can move up or down its memory nodes from the default > >> >>>>>>>>> tier. For example, PMEM can move down its memory nodes below the > >> >>>>>>>>> default tier, whereas GPU can move up its memory nodes above the > >> >>>>>>>>> default tier. > >> >>>>>>>>> > >> >>>>>>>>> The kernel initialization code makes the decision on which exact tier > >> >>>>>>>>> a memory node should be assigned to based on the requests from the > >> >>>>>>>>> device drivers as well as the memory device hardware information > >> >>>>>>>>> provided by the firmware. > >> >>>>>>>>> > >> >>>>>>>>> Hot-adding/removing CPUs doesn't affect memory tier hierarchy. > >> >>>>>>>>> > >> >>>>>>>>> Memory Allocation for Demotion > >> >>>>>>>>> ============================== > >> >>>>>>>>> This patch series keep the demotion target page allocation logic same. > >> >>>>>>>>> The demotion page allocation pick the closest NUMA node in the > >> >>>>>>>>> next lower tier to the current NUMA node allocating pages from. > >> >>>>>>>>> > >> >>>>>>>>> This will be later improved to use the same page allocation strategy > >> >>>>>>>>> using fallback list. > >> >>>>>>>>> > >> >>>>>>>>> Sysfs Interface: > >> >>>>>>>>> ------------- > >> >>>>>>>>> Listing current list of memory tiers details: > >> >>>>>>>>> > >> >>>>>>>>> :/sys/devices/system/memtier$ ls > >> >>>>>>>>> default_tier max_tier memtier1 power uevent > >> >>>>>>>>> :/sys/devices/system/memtier$ cat default_tier > >> >>>>>>>>> memtier200 > >> >>>>>>>>> :/sys/devices/system/memtier$ cat max_tier > >> >>>>>>>>> 400 > >> >>>>>>>>> :/sys/devices/system/memtier$ > >> >>>>>>>>> > >> >>>>>>>>> Per node memory tier details: > >> >>>>>>>>> > >> >>>>>>>>> For a cpu only NUMA node: > >> >>>>>>>>> > >> >>>>>>>>> :/sys/devices/system/node# cat node0/memtier > >> >>>>>>>>> :/sys/devices/system/node# echo 1 > node0/memtier > >> >>>>>>>>> :/sys/devices/system/node# cat node0/memtier > >> >>>>>>>>> :/sys/devices/system/node# > >> >>>>>>>>> > >> >>>>>>>>> For a NUMA node with memory: > >> >>>>>>>>> :/sys/devices/system/node# cat node1/memtier > >> >>>>>>>>> 1 > >> >>>>>>>>> :/sys/devices/system/node# ls ../memtier/ > >> >>>>>>>>> default_tier max_tier memtier1 power uevent > >> >>>>>>>>> :/sys/devices/system/node# echo 2 > node1/memtier > >> >>>>>>>>> :/sys/devices/system/node# > >> >>>>>>>>> :/sys/devices/system/node# ls ../memtier/ > >> >>>>>>>>> default_tier max_tier memtier1 memtier2 power uevent > >> >>>>>>>>> :/sys/devices/system/node# cat node1/memtier > >> >>>>>>>>> 2 > >> >>>>>>>>> :/sys/devices/system/node# > >> >>>>>>>>> > >> >>>>>>>>> Removing a memory tier > >> >>>>>>>>> :/sys/devices/system/node# cat node1/memtier > >> >>>>>>>>> 2 > >> >>>>>>>>> :/sys/devices/system/node# echo 1 > node1/memtier > >> >>>>>>>> > >> >>>>>>>> Thanks a lot for your patchset. > >> >>>>>>>> > >> >>>>>>>> Per my understanding, we haven't reach consensus on > >> >>>>>>>> > >> >>>>>>>> - how to create the default memory tiers in kernel (via abstract > >> >>>>>>>> distance provided by drivers? Or use SLIT as the first step?) > >> >>>>>>>> > >> >>>>>>>> - how to override the default memory tiers from user space > >> >>>>>>>> > >> >>>>>>>> As in the following thread and email, > >> >>>>>>>> > >> >>>>>>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/ > >> >>>>>>>> > >> >>>>>>>> I think that we need to finalized on that firstly? > >> >>>>>>> > >> >>>>>>> I did list the proposal here > >> >>>>>>> > >> >>>>>>> https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com > >> >>>>>>> > >> >>>>>>> So both the kernel default and driver-specific default tiers now become kernel parameters that can be updated > >> >>>>>>> if the user wants a different tier topology. > >> >>>>>>> > >> >>>>>>> All memory that is not managed by a driver gets added to default_memory_tier which got a default value of 200 > >> >>>>>>> > >> >>>>>>> For now, the only driver that is updated is dax kmem, which adds the memory it manages to memory tier 100. > >> >>>>>>> Later as we learn more about the device attributes (HMAT or something similar) that we might want to use > >> >>>>>>> to control the tier assignment this can be a range of memory tiers. > >> >>>>>>> > >> >>>>>>> Based on the above, I guess we can merge what is posted in this series and later fine-tune/update > >> >>>>>>> the memory tier assignment based on device attributes. > >> >>>>>> > >> >>>>>> Sorry for late reply. > >> >>>>>> > >> >>>>>> As the first step, it may be better to skip the parts that we haven't > >> >>>>>> reached consensus yet, for example, the user space interface to override > >> >>>>>> the default memory tiers. And we can use 0, 1, 2 as the default memory > >> >>>>>> tier IDs. We can refine/revise the in-kernel implementation, but we > >> >>>>>> cannot change the user space ABI. > >> >>>>>> > >> >>>>> > >> >>>>> Can you help list the use case that will be broken by using tierID as outlined in this series? > >> >>>>> One of the details that were mentioned earlier was the need to track top-tier memory usage in a > >> >>>>> memcg and IIUC the patchset posted https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com > >> >>>>> can work with tier IDs too. Let me know if you think otherwise. So at this point > >> >>>>> I am not sure which area we are still debating w.r.t the userspace interface. > >> >>>> > >> >>>> In > >> >>>> > >> >>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/ > >> >>>> > >> >>>> per my understanding, Johannes suggested to override the kernel default > >> >>>> memory tiers with "abstract distance" via drivers implementing memory > >> >>>> devices. As you said in another email, that is related to [7/12] of the > >> >>>> series. And we can table it for future. > >> >>>> > >> >>>> And per my understanding, he also suggested to make memory tier IDs > >> >>>> dynamic. For example, after the "abstract distance" of a driver is > >> >>>> overridden by users, the total number of memory tiers may be changed, > >> >>>> and the memory tier ID of some nodes may be changed too. This will make > >> >>>> memory tier ID easier to be understood, but more unstable. For example, > >> >>>> this will make it harder to specify the per-memory-tier memory partition > >> >>>> for a cgroup. > >> >>>> > >> >>> > >> >>> With all the approaches we discussed so far, a memory tier of a numa node can be changed. > >> >>> ie, pgdat->memtier can change anytime. The per memcg top tier mem usage tracking patches > >> >>> posted here > >> >>> https://lore.kernel.org/linux-mm/cefeb63173fa0fac7543315a2abbd4b5a1b25af8.1655242024.git.tim.c.chen@linux.intel.com/ > >> >>> doesn't consider the node movement from one memory tier to another. If we need > >> >>> a stable pgdat->memtier we will have to prevent a node memory tier reassignment > >> >>> while we have pages from the memory tier charged to a cgroup. This patchset should not > >> >>> prevent such a restriction. > >> >> > >> >> Absolute stableness doesn't exist even in "rank" based solution. But > >> >> "rank" can improve the stableness at some degree. For example, if we > >> >> move the tier of HBM nodes (from below DRAM to above DRAM), the DRAM > >> >> nodes can keep its memory tier ID stable. This may be not a real issue > >> >> finally. But we need to discuss that. > >> >> > >> > > >> > I agree that using ranks gives us the flexibility to change demotion order > >> > without being blocked by cgroup usage. But how frequently do we expect the > >> > tier assignment to change? My expectation was these reassignments are going > >> > to be rare and won't happen frequently after a system is up and running? > >> > Hence using tierID for demotion order won't prevent a node reassignment > >> > much because we don't expect to change the node tierID during runtime. In > >> > the rare case we do, we will have to make sure there is no cgroup usage from > >> > the specific memory tier. > >> > > >> > Even if we use ranks, we will have to avoid a rank update, if such > >> > an update can change the meaning of top tier? ie, if a rank update > >> > can result in a node being moved from top tier to non top tier. > >> > > >> >> Tim has suggested to use top-tier(s) memory partition among cgroups. > >> >> But I don't think that has been finalized. We may use per-memory-tier > >> >> memory partition among cgroups. I don't know whether Wei will use that > >> >> (may be implemented in the user space). > >> >> > >> >> And, if we thought stableness between nodes and memory tier ID isn't > >> >> important. Why should we use sparse memory device IDs (that is, 100, > >> >> 200, 300)? Why not just 0, 1, 2, ...? That looks more natural. > >> >> > >> > > >> > > >> > The range allows us to use memtier ID for demotion order. ie, as we start initializing > >> > devices with different attributes via dax kmem, there will be a desire to > >> > assign them to different tierIDs. Having default memtier ID (DRAM) at 200 enables > >> > us to put these devices in the range [0 - 200) without updating the node to memtier > >> > mapping of existing NUMA nodes (ie, without updating default memtier). > >> > >> I believe that sparse memory tier IDs can make memory tier more stable > >> in some cases. But this is different from the system suggested by > >> Johannes. Per my understanding, with Johannes' system, we will > >> > >> - one driver may online different memory types (such as kmem_dax may > >> online HBM, PMEM, etc.) > >> > >> - one memory type manages several memory nodes (NUMA nodes) > >> > >> - one "abstract distance" for each memory type > >> > >> - the "abstract distance" can be offset by user space override knob > >> > >> - memory tiers generated dynamic from different memory types according > >> "abstract distance" and overridden "offset" > >> > >> - the granularity to group several memory types into one memory tier can > >> be overridden via user space knob > >> > >> In this way, the memory tiers may be changed totally after user space > >> overridden. It may be hard to link memory tiers before/after the > >> overridden. So we may need to reset all per-memory-tier configuration, > >> such as cgroup paritation limit or interleave weight, etc. > >> > >> Personally, I think the system above makes sense. But I think we need > >> to make sure whether it satisfies the requirements. > >> > >> Best Regards, > >> Huang, Ying > >> > > > > Th "memory type" and "abstract distance" concepts sound to me similar > > to the memory tier "rank" idea. > > Yes. "abstract distance" is similar as "rank". > > > We can have some well-defined type/distance/rank values, e.g. HBM, > > DRAM, CXL_DRAM, PMEM, CXL_PMEM, which a device can register with. The > > memory tiers will build from these values. It can be configurable to > > whether/how to collapse several values into a single tier. > > The memory types are registered by drivers (such as kmem_dax). And the > distances can come from SLIT, HMAT, and other firmware or driver > specific information sources. > > Per my understanding, this solution may make memory tier IDs more > unstable. For example, the memory ID of a node may be changed after the > user override the distance of a memory type. Although I think the > overriding should be a rare operations, will it be a real issue for your > use cases? Not sure how common it is, but I'm aware of systems that have dynamic access characteristics. i.e. the bandwidth and latency of a access to a given memory device will change dynamically at runtime (typically due to something like hardware degradation / power saving etc). Potentially leading to memory in use needing to move in 'demotion order'. We could handle that with a per device tier and rank that changes... Just thought I'd throw that out there to add to the complexity ;) I don't consider it important to support initially but just wanted to point out this will only get more complex over time. Jonathan > > Best Regards, > Huang, Ying
Hi, Jonathan, Jonathan Cameron <Jonathan.Cameron@Huawei.com> writes: > On Wed, 13 Jul 2022 16:17:21 +0800 > "Huang, Ying" <ying.huang@intel.com> wrote: > >> Wei Xu <weixugc@google.com> writes: [snip] >> > >> > Th "memory type" and "abstract distance" concepts sound to me similar >> > to the memory tier "rank" idea. >> >> Yes. "abstract distance" is similar as "rank". >> >> > We can have some well-defined type/distance/rank values, e.g. HBM, >> > DRAM, CXL_DRAM, PMEM, CXL_PMEM, which a device can register with. The >> > memory tiers will build from these values. It can be configurable to >> > whether/how to collapse several values into a single tier. >> >> The memory types are registered by drivers (such as kmem_dax). And the >> distances can come from SLIT, HMAT, and other firmware or driver >> specific information sources. >> >> Per my understanding, this solution may make memory tier IDs more >> unstable. For example, the memory ID of a node may be changed after the >> user override the distance of a memory type. Although I think the >> overriding should be a rare operations, will it be a real issue for your >> use cases? > > Not sure how common it is, but I'm aware of systems that have dynamic > access characteristics. i.e. the bandwidth and latency of a access > to a given memory device will change dynamically at runtime (typically > due to something like hardware degradation / power saving etc). Potentially > leading to memory in use needing to move in 'demotion order'. We could > handle that with a per device tier and rank that changes... > > Just thought I'd throw that out there to add to the complexity ;) > I don't consider it important to support initially but just wanted to > point out this will only get more complex over time. > Thanks for your information! If we make the mapping from the abstract distance range to the memory tier ID stable at some degree, the memory tier ID can be stable at some degree, e.g., abstract distance range memory tier ID 1 -100 0 101-200 1 201-300 2 301-400 3 401-500 4 500- 5 Then if the abstract distance of a memory device changes at run time, its memory tier ID will change. But the memory tier ID of other memory devices can be unchanged. If so, the memory tier IDs are unstable mainly when we change the mapping from the abstract distance range to memory tier ID. Best Regards, Huang, Ying