Message ID | 20220603134237.131362-2-aneesh.kumar@linux.ibm.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/demotion: Memory tiers and demotion | expand |
On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote: > > > The nodes which are part of a specific memory tier can be listed > via > /sys/devices/system/memtier/memtierN/nodelist > > "Rank" is an opaque value. Its absolute value doesn't have any > special meaning. But the rank values of different memtiers can be > compared with each other to determine the memory tier order. > > For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and > their rank values are 300, 200, 100, then the memory tier order is: > memtier0 -> memtier2 -> memtier1, Why is memtier2 (rank 100) higher than memtier1 (rank 200)? Seems like the order should be memtier0 -> memtier1 -> memtier2? (rank 300) (rank 200) (rank 100) > where memtier0 is the highest tier > and memtier1 is the lowest tier. I think memtier2 is the lowest as it has the lowest rank value. > > The rank value of each memtier should be unique. > > > + > +static void memory_tier_device_release(struct device *dev) > +{ > + struct memory_tier *tier = to_memory_tier(dev); > + Do we need some ref counts on memory_tier? If there is another device still using the same memtier, free below could cause problem. > + kfree(tier); > +} > + > ... > +static struct memory_tier *register_memory_tier(unsigned int tier) > +{ > + int error; > + struct memory_tier *memtier; > + > + if (tier >= MAX_MEMORY_TIERS) > + return NULL; > + > + memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL); > + if (!memtier) > + return NULL; > + > + memtier->dev.id = tier; > + memtier->rank = get_rank_from_tier(tier); > + memtier->dev.bus = &memory_tier_subsys; > + memtier->dev.release = memory_tier_device_release; > + memtier->dev.groups = memory_tier_dev_groups; > + Should you take the mem_tier_lock before you insert to memtier-list? > + insert_memory_tier(memtier); > + > + error = device_register(&memtier->dev); > + if (error) { > + list_del(&memtier->list); > + put_device(&memtier->dev); > + return NULL; > + } > + return memtier; > +} > + > +__maybe_unused // temporay to prevent warnings during bisects > +static void unregister_memory_tier(struct memory_tier *memtier) > +{ I think we should take mem_tier_lock before modifying memtier->list. > + list_del(&memtier->list); > + device_unregister(&memtier->dev); > +} > + > Thanks. Tim
On Tue, Jun 7, 2022 at 11:43 AM Tim Chen <tim.c.chen@linux.intel.com> wrote: > > On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote: > > > > > > The nodes which are part of a specific memory tier can be listed > > via > > /sys/devices/system/memtier/memtierN/nodelist > > > > "Rank" is an opaque value. Its absolute value doesn't have any > > special meaning. But the rank values of different memtiers can be > > compared with each other to determine the memory tier order. > > > > For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and > > their rank values are 300, 200, 100, then the memory tier order is: > > memtier0 -> memtier2 -> memtier1, > > Why is memtier2 (rank 100) higher than memtier1 (rank 200)? Seems like > the order should be memtier0 -> memtier1 -> memtier2? > (rank 300) (rank 200) (rank 100) I think this is a copy-and-modify typo from my original memory tiering kernel interface RFC (v4, https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com/T/): where the rank values are 100, 10, 50 (i.e the rank of memtier2 is higher than memtier1). > > where memtier0 is the highest tier > > and memtier1 is the lowest tier. > > I think memtier2 is the lowest as it has the lowest rank value. > > > > The rank value of each memtier should be unique. > > > > > > + > > +static void memory_tier_device_release(struct device *dev) > > +{ > > + struct memory_tier *tier = to_memory_tier(dev); > > + > > Do we need some ref counts on memory_tier? > If there is another device still using the same memtier, > free below could cause problem. > > > + kfree(tier); > > +} > > + > > > ... > > +static struct memory_tier *register_memory_tier(unsigned int tier) > > +{ > > + int error; > > + struct memory_tier *memtier; > > + > > + if (tier >= MAX_MEMORY_TIERS) > > + return NULL; > > + > > + memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL); > > + if (!memtier) > > + return NULL; > > + > > + memtier->dev.id = tier; > > + memtier->rank = get_rank_from_tier(tier); > > + memtier->dev.bus = &memory_tier_subsys; > > + memtier->dev.release = memory_tier_device_release; > > + memtier->dev.groups = memory_tier_dev_groups; > > + > > Should you take the mem_tier_lock before you insert to > memtier-list? > > > + insert_memory_tier(memtier); > > + > > + error = device_register(&memtier->dev); > > + if (error) { > > + list_del(&memtier->list); > > + put_device(&memtier->dev); > > + return NULL; > > + } > > + return memtier; > > +} > > + > > +__maybe_unused // temporay to prevent warnings during bisects > > +static void unregister_memory_tier(struct memory_tier *memtier) > > +{ > > I think we should take mem_tier_lock before modifying memtier->list. > > > + list_del(&memtier->list); > > + device_unregister(&memtier->dev); > > +} > > + > > > > Thanks. > > Tim > >
On Fri, Jun 3, 2022 at 6:43 AM Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> wrote: > > In the current kernel, memory tiers are defined implicitly via a > demotion path relationship between NUMA nodes, which is created > during the kernel initialization and updated when a NUMA node is > hot-added or hot-removed. The current implementation puts all > nodes with CPU into the top tier, and builds the tier hierarchy > tier-by-tier by establishing the per-node demotion targets based > on the distances between nodes. > > This current memory tier kernel interface needs to be improved for > several important use cases, > > The current tier initialization code always initializes > each memory-only NUMA node into a lower tier. But a memory-only > NUMA node may have a high performance memory device (e.g. a DRAM > device attached via CXL.mem or a DRAM-backed memory-only node on > a virtual machine) and should be put into a higher tier. > > The current tier hierarchy always puts CPU nodes into the top > tier. But on a system with HBM or GPU devices, the > memory-only NUMA nodes mapping these devices should be in the > top tier, and DRAM nodes with CPUs are better to be placed into the > next lower tier. > > With current kernel higher tier node can only be demoted to selected nodes on the > next lower tier as defined by the demotion path, not any other > node from any lower tier. This strict, hard-coded demotion order > does not work in all use cases (e.g. some use cases may want to > allow cross-socket demotion to another node in the same demotion > tier as a fallback when the preferred demotion node is out of > space), This demotion order is also inconsistent with the page > allocation fallback order when all the nodes in a higher tier are > out of space: The page allocation can fall back to any node from > any lower tier, whereas the demotion order doesn't allow that. > > The current kernel also don't provide any interfaces for the > userspace to learn about the memory tier hierarchy in order to > optimize its memory allocations. > > This patch series address the above by defining memory tiers explicitly. > > This patch introduce explicity memory tiers with ranks. The rank > value of a memory tier is used to derive the demotion order between > NUMA nodes. The memory tiers present in a system can be found at > > /sys/devices/system/memtier/memtierN/ > > The nodes which are part of a specific memory tier can be listed > via > /sys/devices/system/memtier/memtierN/nodelist > > "Rank" is an opaque value. Its absolute value doesn't have any > special meaning. But the rank values of different memtiers can be > compared with each other to determine the memory tier order. > > For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and > their rank values are 300, 200, 100, then the memory tier order is: > memtier0 -> memtier2 -> memtier1, where memtier0 is the highest tier > and memtier1 is the lowest tier. > > The rank value of each memtier should be unique. > > A higher rank memory tier will appear first in the demotion order > than a lower rank memory tier. ie. while reclaim we choose a node > in higher rank memory tier to demote pages to as compared to a node > in a lower rank memory tier. > > For now we are not adding the dynamic number of memory tiers. > But a future series supporting that is possible. Currently > number of tiers supported is limitted to MAX_MEMORY_TIERS(3). > When doing memory hotplug, if not added to a memory tier, the NUMA > node gets added to DEFAULT_MEMORY_TIER(1). > > This patch is based on the proposal sent by Wei Xu <weixugc@google.com> at [1]. > > [1] https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com > > Suggested-by: Wei Xu <weixugc@google.com> > Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com> > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> > --- > include/linux/memory-tiers.h | 20 ++++ > mm/Kconfig | 11 ++ > mm/Makefile | 1 + > mm/memory-tiers.c | 188 +++++++++++++++++++++++++++++++++++ > 4 files changed, 220 insertions(+) > create mode 100644 include/linux/memory-tiers.h > create mode 100644 mm/memory-tiers.c > > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h > new file mode 100644 > index 000000000000..e17f6b4ee177 > --- /dev/null > +++ b/include/linux/memory-tiers.h > @@ -0,0 +1,20 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +#ifndef _LINUX_MEMORY_TIERS_H > +#define _LINUX_MEMORY_TIERS_H > + > +#ifdef CONFIG_TIERED_MEMORY > + > +#define MEMORY_TIER_HBM_GPU 0 > +#define MEMORY_TIER_DRAM 1 > +#define MEMORY_TIER_PMEM 2 > + > +#define MEMORY_RANK_HBM_GPU 300 > +#define MEMORY_RANK_DRAM 200 > +#define MEMORY_RANK_PMEM 100 > + > +#define DEFAULT_MEMORY_TIER MEMORY_TIER_DRAM > +#define MAX_MEMORY_TIERS 3 > + > +#endif /* CONFIG_TIERED_MEMORY */ > + > +#endif > diff --git a/mm/Kconfig b/mm/Kconfig > index 169e64192e48..08a3d330740b 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -614,6 +614,17 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION > config ARCH_ENABLE_THP_MIGRATION > bool > > +config TIERED_MEMORY > + bool "Support for explicit memory tiers" > + def_bool n > + depends on MIGRATION && NUMA > + help > + Support to split nodes into memory tiers explicitly and > + to demote pages on reclaim to lower tiers. This option > + also exposes sysfs interface to read nodes available in > + specific tier and to move specific node among different > + possible tiers. IMHO we should not need a new kernel config. If tiering is not present then there is just one tier on the system. And tiering is a kind of hardware configuration, the information could be shown regardless of whether demotion/promotion is supported/enabled or not. > + > config HUGETLB_PAGE_SIZE_VARIABLE > def_bool n > help > diff --git a/mm/Makefile b/mm/Makefile > index 6f9ffa968a1a..482557fbc9d1 100644 > --- a/mm/Makefile > +++ b/mm/Makefile > @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/ > obj-$(CONFIG_FAILSLAB) += failslab.o > obj-$(CONFIG_MEMTEST) += memtest.o > obj-$(CONFIG_MIGRATION) += migrate.o > +obj-$(CONFIG_TIERED_MEMORY) += memory-tiers.o > obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o > obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o > obj-$(CONFIG_PAGE_COUNTER) += page_counter.o > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c > new file mode 100644 > index 000000000000..7de18d94a08d > --- /dev/null > +++ b/mm/memory-tiers.c > @@ -0,0 +1,188 @@ > +// SPDX-License-Identifier: GPL-2.0 > +#include <linux/types.h> > +#include <linux/device.h> > +#include <linux/nodemask.h> > +#include <linux/slab.h> > +#include <linux/memory-tiers.h> > + > +struct memory_tier { > + struct list_head list; > + struct device dev; > + nodemask_t nodelist; > + int rank; > +}; > + > +#define to_memory_tier(device) container_of(device, struct memory_tier, dev) > + > +static struct bus_type memory_tier_subsys = { > + .name = "memtier", > + .dev_name = "memtier", > +}; > + > +static DEFINE_MUTEX(memory_tier_lock); > +static LIST_HEAD(memory_tiers); > + > + > +static ssize_t nodelist_show(struct device *dev, > + struct device_attribute *attr, char *buf) > +{ > + struct memory_tier *memtier = to_memory_tier(dev); > + > + return sysfs_emit(buf, "%*pbl\n", > + nodemask_pr_args(&memtier->nodelist)); > +} > +static DEVICE_ATTR_RO(nodelist); > + > +static ssize_t rank_show(struct device *dev, > + struct device_attribute *attr, char *buf) > +{ > + struct memory_tier *memtier = to_memory_tier(dev); > + > + return sysfs_emit(buf, "%d\n", memtier->rank); > +} > +static DEVICE_ATTR_RO(rank); > + > +static struct attribute *memory_tier_dev_attrs[] = { > + &dev_attr_nodelist.attr, > + &dev_attr_rank.attr, > + NULL > +}; > + > +static const struct attribute_group memory_tier_dev_group = { > + .attrs = memory_tier_dev_attrs, > +}; > + > +static const struct attribute_group *memory_tier_dev_groups[] = { > + &memory_tier_dev_group, > + NULL > +}; > + > +static void memory_tier_device_release(struct device *dev) > +{ > + struct memory_tier *tier = to_memory_tier(dev); > + > + kfree(tier); > +} > + > +/* > + * Keep it simple by having direct mapping between > + * tier index and rank value. > + */ > +static inline int get_rank_from_tier(unsigned int tier) > +{ > + switch (tier) { > + case MEMORY_TIER_HBM_GPU: > + return MEMORY_RANK_HBM_GPU; > + case MEMORY_TIER_DRAM: > + return MEMORY_RANK_DRAM; > + case MEMORY_TIER_PMEM: > + return MEMORY_RANK_PMEM; > + } > + > + return 0; > +} > + > +static void insert_memory_tier(struct memory_tier *memtier) > +{ > + struct list_head *ent; > + struct memory_tier *tmp_memtier; > + > + list_for_each(ent, &memory_tiers) { > + tmp_memtier = list_entry(ent, struct memory_tier, list); > + if (tmp_memtier->rank < memtier->rank) { > + list_add_tail(&memtier->list, ent); > + return; > + } > + } > + list_add_tail(&memtier->list, &memory_tiers); > +} > + > +static struct memory_tier *register_memory_tier(unsigned int tier) > +{ > + int error; > + struct memory_tier *memtier; > + > + if (tier >= MAX_MEMORY_TIERS) > + return NULL; > + > + memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL); > + if (!memtier) > + return NULL; > + > + memtier->dev.id = tier; > + memtier->rank = get_rank_from_tier(tier); > + memtier->dev.bus = &memory_tier_subsys; > + memtier->dev.release = memory_tier_device_release; > + memtier->dev.groups = memory_tier_dev_groups; > + > + insert_memory_tier(memtier); > + > + error = device_register(&memtier->dev); > + if (error) { > + list_del(&memtier->list); > + put_device(&memtier->dev); > + return NULL; > + } > + return memtier; > +} > + > +__maybe_unused // temporay to prevent warnings during bisects > +static void unregister_memory_tier(struct memory_tier *memtier) > +{ > + list_del(&memtier->list); > + device_unregister(&memtier->dev); > +} > + > +static ssize_t > +max_tier_show(struct device *dev, struct device_attribute *attr, char *buf) > +{ > + return sysfs_emit(buf, "%d\n", MAX_MEMORY_TIERS); > +} > +static DEVICE_ATTR_RO(max_tier); > + > +static ssize_t > +default_tier_show(struct device *dev, struct device_attribute *attr, char *buf) > +{ > + return sysfs_emit(buf, "memtier%d\n", DEFAULT_MEMORY_TIER); > +} > +static DEVICE_ATTR_RO(default_tier); > + > +static struct attribute *memory_tier_attrs[] = { > + &dev_attr_max_tier.attr, > + &dev_attr_default_tier.attr, > + NULL > +}; > + > +static const struct attribute_group memory_tier_attr_group = { > + .attrs = memory_tier_attrs, > +}; > + > +static const struct attribute_group *memory_tier_attr_groups[] = { > + &memory_tier_attr_group, > + NULL, > +}; > + > +static int __init memory_tier_init(void) > +{ > + int ret; > + struct memory_tier *memtier; > + > + ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups); > + if (ret) > + panic("%s() failed to register subsystem: %d\n", __func__, ret); > + > + /* > + * Register only default memory tier to hide all empty > + * memory tier from sysfs. > + */ > + memtier = register_memory_tier(DEFAULT_MEMORY_TIER); > + if (!memtier) > + panic("%s() failed to register memory tier: %d\n", __func__, ret); > + > + /* CPU only nodes are not part of memory tiers. */ > + memtier->nodelist = node_states[N_MEMORY]; > + > + return 0; > +} > +subsys_initcall(memory_tier_init); > + > -- > 2.36.1 >
On Tue, 2022-06-07 at 14:32 -0700, Yang Shi wrote: > On Fri, Jun 3, 2022 at 6:43 AM Aneesh Kumar K.V > <aneesh.kumar@linux.ibm.com> wrote: > > > > In the current kernel, memory tiers are defined implicitly via a > > demotion path relationship between NUMA nodes, which is created > > during the kernel initialization and updated when a NUMA node is > > hot-added or hot-removed. The current implementation puts all > > nodes with CPU into the top tier, and builds the tier hierarchy > > tier-by-tier by establishing the per-node demotion targets based > > on the distances between nodes. > > > > This current memory tier kernel interface needs to be improved for > > several important use cases, > > > > The current tier initialization code always initializes > > each memory-only NUMA node into a lower tier. But a memory-only > > NUMA node may have a high performance memory device (e.g. a DRAM > > device attached via CXL.mem or a DRAM-backed memory-only node on > > a virtual machine) and should be put into a higher tier. > > > > The current tier hierarchy always puts CPU nodes into the top > > tier. But on a system with HBM or GPU devices, the > > memory-only NUMA nodes mapping these devices should be in the > > top tier, and DRAM nodes with CPUs are better to be placed into the > > next lower tier. > > > > With current kernel higher tier node can only be demoted to selected nodes on the > > next lower tier as defined by the demotion path, not any other > > node from any lower tier. This strict, hard-coded demotion order > > does not work in all use cases (e.g. some use cases may want to > > allow cross-socket demotion to another node in the same demotion > > tier as a fallback when the preferred demotion node is out of > > space), This demotion order is also inconsistent with the page > > allocation fallback order when all the nodes in a higher tier are > > out of space: The page allocation can fall back to any node from > > any lower tier, whereas the demotion order doesn't allow that. > > > > The current kernel also don't provide any interfaces for the > > userspace to learn about the memory tier hierarchy in order to > > optimize its memory allocations. > > > > This patch series address the above by defining memory tiers explicitly. > > > > This patch introduce explicity memory tiers with ranks. The rank > > value of a memory tier is used to derive the demotion order between > > NUMA nodes. The memory tiers present in a system can be found at > > > > /sys/devices/system/memtier/memtierN/ > > > > The nodes which are part of a specific memory tier can be listed > > via > > /sys/devices/system/memtier/memtierN/nodelist > > > > "Rank" is an opaque value. Its absolute value doesn't have any > > special meaning. But the rank values of different memtiers can be > > compared with each other to determine the memory tier order. > > > > For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and > > their rank values are 300, 200, 100, then the memory tier order is: > > memtier0 -> memtier2 -> memtier1, where memtier0 is the highest tier > > and memtier1 is the lowest tier. > > > > The rank value of each memtier should be unique. > > > > A higher rank memory tier will appear first in the demotion order > > than a lower rank memory tier. ie. while reclaim we choose a node > > in higher rank memory tier to demote pages to as compared to a node > > in a lower rank memory tier. > > > > For now we are not adding the dynamic number of memory tiers. > > But a future series supporting that is possible. Currently > > number of tiers supported is limitted to MAX_MEMORY_TIERS(3). > > When doing memory hotplug, if not added to a memory tier, the NUMA > > node gets added to DEFAULT_MEMORY_TIER(1). > > > > This patch is based on the proposal sent by Wei Xu <weixugc@google.com> at [1]. > > > > [1] https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com > > > > Suggested-by: Wei Xu <weixugc@google.com> > > Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com> > > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> > > --- > > include/linux/memory-tiers.h | 20 ++++ > > mm/Kconfig | 11 ++ > > mm/Makefile | 1 + > > mm/memory-tiers.c | 188 +++++++++++++++++++++++++++++++++++ > > 4 files changed, 220 insertions(+) > > create mode 100644 include/linux/memory-tiers.h > > create mode 100644 mm/memory-tiers.c > > > > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h > > new file mode 100644 > > index 000000000000..e17f6b4ee177 > > --- /dev/null > > +++ b/include/linux/memory-tiers.h > > @@ -0,0 +1,20 @@ > > +/* SPDX-License-Identifier: GPL-2.0 */ > > +#ifndef _LINUX_MEMORY_TIERS_H > > +#define _LINUX_MEMORY_TIERS_H > > + > > +#ifdef CONFIG_TIERED_MEMORY > > + > > +#define MEMORY_TIER_HBM_GPU 0 > > +#define MEMORY_TIER_DRAM 1 > > +#define MEMORY_TIER_PMEM 2 > > + > > +#define MEMORY_RANK_HBM_GPU 300 > > +#define MEMORY_RANK_DRAM 200 > > +#define MEMORY_RANK_PMEM 100 > > + > > +#define DEFAULT_MEMORY_TIER MEMORY_TIER_DRAM > > +#define MAX_MEMORY_TIERS 3 > > + > > +#endif /* CONFIG_TIERED_MEMORY */ > > + > > +#endif > > diff --git a/mm/Kconfig b/mm/Kconfig > > index 169e64192e48..08a3d330740b 100644 > > --- a/mm/Kconfig > > +++ b/mm/Kconfig > > @@ -614,6 +614,17 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION > > config ARCH_ENABLE_THP_MIGRATION > > bool > > > > +config TIERED_MEMORY > > + bool "Support for explicit memory tiers" > > + def_bool n > > + depends on MIGRATION && NUMA > > + help > > + Support to split nodes into memory tiers explicitly and > > + to demote pages on reclaim to lower tiers. This option > > + also exposes sysfs interface to read nodes available in > > + specific tier and to move specific node among different > > + possible tiers. > > IMHO we should not need a new kernel config. If tiering is not present > then there is just one tier on the system. And tiering is a kind of > hardware configuration, the information could be shown regardless of > whether demotion/promotion is supported/enabled or not. I think so too. At least it appears unnecessary to let the user turn on/off it at configuration time. All the code should be enclosed by #if defined(CONFIG_NUMA) && defined(CONIFIG_MIGRATION). So we will not waste memory in small systems. Best Regards, Huang, Ying > > + > > config HUGETLB_PAGE_SIZE_VARIABLE > > def_bool n > > help > > diff --git a/mm/Makefile b/mm/Makefile > > index 6f9ffa968a1a..482557fbc9d1 100644 > > --- a/mm/Makefile > > +++ b/mm/Makefile > > @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/ > > obj-$(CONFIG_FAILSLAB) += failslab.o > > obj-$(CONFIG_MEMTEST) += memtest.o > > obj-$(CONFIG_MIGRATION) += migrate.o > > +obj-$(CONFIG_TIERED_MEMORY) += memory-tiers.o > > obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o > > obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o > > obj-$(CONFIG_PAGE_COUNTER) += page_counter.o > > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c > > new file mode 100644 > > index 000000000000..7de18d94a08d > > --- /dev/null > > +++ b/mm/memory-tiers.c > > @@ -0,0 +1,188 @@ > > +// SPDX-License-Identifier: GPL-2.0 > > +#include <linux/types.h> > > +#include <linux/device.h> > > +#include <linux/nodemask.h> > > +#include <linux/slab.h> > > +#include <linux/memory-tiers.h> > > + > > +struct memory_tier { > > + struct list_head list; > > + struct device dev; > > + nodemask_t nodelist; > > + int rank; > > +}; > > + > > +#define to_memory_tier(device) container_of(device, struct memory_tier, dev) > > + > > +static struct bus_type memory_tier_subsys = { > > + .name = "memtier", > > + .dev_name = "memtier", > > +}; > > + > > +static DEFINE_MUTEX(memory_tier_lock); > > +static LIST_HEAD(memory_tiers); > > + > > + > > +static ssize_t nodelist_show(struct device *dev, > > + struct device_attribute *attr, char *buf) > > +{ > > + struct memory_tier *memtier = to_memory_tier(dev); > > + > > + return sysfs_emit(buf, "%*pbl\n", > > + nodemask_pr_args(&memtier->nodelist)); > > +} > > +static DEVICE_ATTR_RO(nodelist); > > + > > +static ssize_t rank_show(struct device *dev, > > + struct device_attribute *attr, char *buf) > > +{ > > + struct memory_tier *memtier = to_memory_tier(dev); > > + > > + return sysfs_emit(buf, "%d\n", memtier->rank); > > +} > > +static DEVICE_ATTR_RO(rank); > > + > > +static struct attribute *memory_tier_dev_attrs[] = { > > + &dev_attr_nodelist.attr, > > + &dev_attr_rank.attr, > > + NULL > > +}; > > + > > +static const struct attribute_group memory_tier_dev_group = { > > + .attrs = memory_tier_dev_attrs, > > +}; > > + > > +static const struct attribute_group *memory_tier_dev_groups[] = { > > + &memory_tier_dev_group, > > + NULL > > +}; > > + > > +static void memory_tier_device_release(struct device *dev) > > +{ > > + struct memory_tier *tier = to_memory_tier(dev); > > + > > + kfree(tier); > > +} > > + > > +/* > > + * Keep it simple by having direct mapping between > > + * tier index and rank value. > > + */ > > +static inline int get_rank_from_tier(unsigned int tier) > > +{ > > + switch (tier) { > > + case MEMORY_TIER_HBM_GPU: > > + return MEMORY_RANK_HBM_GPU; > > + case MEMORY_TIER_DRAM: > > + return MEMORY_RANK_DRAM; > > + case MEMORY_TIER_PMEM: > > + return MEMORY_RANK_PMEM; > > + } > > + > > + return 0; > > +} > > + > > +static void insert_memory_tier(struct memory_tier *memtier) > > +{ > > + struct list_head *ent; > > + struct memory_tier *tmp_memtier; > > + > > + list_for_each(ent, &memory_tiers) { > > + tmp_memtier = list_entry(ent, struct memory_tier, list); > > + if (tmp_memtier->rank < memtier->rank) { > > + list_add_tail(&memtier->list, ent); > > + return; > > + } > > + } > > + list_add_tail(&memtier->list, &memory_tiers); > > +} > > + > > +static struct memory_tier *register_memory_tier(unsigned int tier) > > +{ > > + int error; > > + struct memory_tier *memtier; > > + > > + if (tier >= MAX_MEMORY_TIERS) > > + return NULL; > > + > > + memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL); > > + if (!memtier) > > + return NULL; > > + > > + memtier->dev.id = tier; > > + memtier->rank = get_rank_from_tier(tier); > > + memtier->dev.bus = &memory_tier_subsys; > > + memtier->dev.release = memory_tier_device_release; > > + memtier->dev.groups = memory_tier_dev_groups; > > + > > + insert_memory_tier(memtier); > > + > > + error = device_register(&memtier->dev); > > + if (error) { > > + list_del(&memtier->list); > > + put_device(&memtier->dev); > > + return NULL; > > + } > > + return memtier; > > +} > > + > > +__maybe_unused // temporay to prevent warnings during bisects > > +static void unregister_memory_tier(struct memory_tier *memtier) > > +{ > > + list_del(&memtier->list); > > + device_unregister(&memtier->dev); > > +} > > + > > +static ssize_t > > +max_tier_show(struct device *dev, struct device_attribute *attr, char *buf) > > +{ > > + return sysfs_emit(buf, "%d\n", MAX_MEMORY_TIERS); > > +} > > +static DEVICE_ATTR_RO(max_tier); > > + > > +static ssize_t > > +default_tier_show(struct device *dev, struct device_attribute *attr, char *buf) > > +{ > > + return sysfs_emit(buf, "memtier%d\n", DEFAULT_MEMORY_TIER); > > +} > > +static DEVICE_ATTR_RO(default_tier); > > + > > +static struct attribute *memory_tier_attrs[] = { > > + &dev_attr_max_tier.attr, > > + &dev_attr_default_tier.attr, > > + NULL > > +}; > > + > > +static const struct attribute_group memory_tier_attr_group = { > > + .attrs = memory_tier_attrs, > > +}; > > + > > +static const struct attribute_group *memory_tier_attr_groups[] = { > > + &memory_tier_attr_group, > > + NULL, > > +}; > > + > > +static int __init memory_tier_init(void) > > +{ > > + int ret; > > + struct memory_tier *memtier; > > + > > + ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups); > > + if (ret) > > + panic("%s() failed to register subsystem: %d\n", __func__, ret); > > + > > + /* > > + * Register only default memory tier to hide all empty > > + * memory tier from sysfs. > > + */ > > + memtier = register_memory_tier(DEFAULT_MEMORY_TIER); > > + if (!memtier) > > + panic("%s() failed to register memory tier: %d\n", __func__, ret); > > + > > + /* CPU only nodes are not part of memory tiers. */ > > + memtier->nodelist = node_states[N_MEMORY]; > > + > > + return 0; > > +} > > +subsys_initcall(memory_tier_init); > > + > > -- > > 2.36.1 > >
On 6/8/22 12:13 AM, Tim Chen wrote: > On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote: >> >> >> The nodes which are part of a specific memory tier can be listed >> via >> /sys/devices/system/memtier/memtierN/nodelist >> >> "Rank" is an opaque value. Its absolute value doesn't have any >> special meaning. But the rank values of different memtiers can be >> compared with each other to determine the memory tier order. >> >> For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and >> their rank values are 300, 200, 100, then the memory tier order is: >> memtier0 -> memtier2 -> memtier1, > > Why is memtier2 (rank 100) higher than memtier1 (rank 200)? Seems like > the order should be memtier0 -> memtier1 -> memtier2? > (rank 300) (rank 200) (rank 100) > >> where memtier0 is the highest tier >> and memtier1 is the lowest tier. > > I think memtier2 is the lowest as it has the lowest rank value. typo error. Will fix that in the next update >> >> The rank value of each memtier should be unique. >> >> >> + >> +static void memory_tier_device_release(struct device *dev) >> +{ >> + struct memory_tier *tier = to_memory_tier(dev); >> + > > Do we need some ref counts on memory_tier? > If there is another device still using the same memtier, > free below could cause problem. > >> + kfree(tier); >> +} >> + >> > ... >> +static struct memory_tier *register_memory_tier(unsigned int tier) >> +{ >> + int error; >> + struct memory_tier *memtier; >> + >> + if (tier >= MAX_MEMORY_TIERS) >> + return NULL; >> + >> + memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL); >> + if (!memtier) >> + return NULL; >> + >> + memtier->dev.id = tier; >> + memtier->rank = get_rank_from_tier(tier); >> + memtier->dev.bus = &memory_tier_subsys; >> + memtier->dev.release = memory_tier_device_release; >> + memtier->dev.groups = memory_tier_dev_groups; >> + > > Should you take the mem_tier_lock before you insert to > memtier-list? Both register_memory_tier and unregister_memory_tier get called with memory_tier_lock held. > >> + insert_memory_tier(memtier); >> + >> + error = device_register(&memtier->dev); >> + if (error) { >> + list_del(&memtier->list); >> + put_device(&memtier->dev); >> + return NULL; >> + } >> + return memtier; >> +} >> + >> +__maybe_unused // temporay to prevent warnings during bisects >> +static void unregister_memory_tier(struct memory_tier *memtier) >> +{ > > I think we should take mem_tier_lock before modifying memtier->list. > unregister_memory_tier get called with memory_tier_lock held. >> + list_del(&memtier->list); >> + device_unregister(&memtier->dev); >> +} >> + >> -aneesh
On 6/8/22 12:13 AM, Tim Chen wrote: ... >> >> + >> +static void memory_tier_device_release(struct device *dev) >> +{ >> + struct memory_tier *tier = to_memory_tier(dev); >> + > > Do we need some ref counts on memory_tier? > If there is another device still using the same memtier, > free below could cause problem. > >> + kfree(tier); >> +} >> + >> > ... The lifecycle of the memory_tier struct is tied to the sysfs device life time. ie, memory_tier_device_relese get called only after the last reference on that sysfs dev object is released. Hence we can be sure there is no userspace that is keeping one of the memtier related sysfs file open. W.r.t other memory device sharing the same memtier, we unregister the sysfs device only when the memory tier nodelist is empty. That is no memory device is present in this memory tier. -aneesh
On 6/8/22 3:02 AM, Yang Shi wrote: > On Fri, Jun 3, 2022 at 6:43 AM Aneesh Kumar K.V > <aneesh.kumar@linux.ibm.com> wrote: >> >> In the current kernel, memory tiers are defined implicitly via a >> demotion path relationship between NUMA nodes, which is created >> during the kernel initialization and updated when a NUMA node is >> hot-added or hot-removed. The current implementation puts all >> nodes with CPU into the top tier, and builds the tier hierarchy >> tier-by-tier by establishing the per-node demotion targets based >> on the distances between nodes. >> >> This current memory tier kernel interface needs to be improved for >> several important use cases, >> >> The current tier initialization code always initializes >> each memory-only NUMA node into a lower tier. But a memory-only >> NUMA node may have a high performance memory device (e.g. a DRAM >> device attached via CXL.mem or a DRAM-backed memory-only node on >> a virtual machine) and should be put into a higher tier. >> >> The current tier hierarchy always puts CPU nodes into the top >> tier. But on a system with HBM or GPU devices, the >> memory-only NUMA nodes mapping these devices should be in the >> top tier, and DRAM nodes with CPUs are better to be placed into the >> next lower tier. >> >> With current kernel higher tier node can only be demoted to selected nodes on the >> next lower tier as defined by the demotion path, not any other >> node from any lower tier. This strict, hard-coded demotion order >> does not work in all use cases (e.g. some use cases may want to >> allow cross-socket demotion to another node in the same demotion >> tier as a fallback when the preferred demotion node is out of >> space), This demotion order is also inconsistent with the page >> allocation fallback order when all the nodes in a higher tier are >> out of space: The page allocation can fall back to any node from >> any lower tier, whereas the demotion order doesn't allow that. >> >> The current kernel also don't provide any interfaces for the >> userspace to learn about the memory tier hierarchy in order to >> optimize its memory allocations. >> >> This patch series address the above by defining memory tiers explicitly. >> >> This patch introduce explicity memory tiers with ranks. The rank >> value of a memory tier is used to derive the demotion order between >> NUMA nodes. The memory tiers present in a system can be found at >> >> /sys/devices/system/memtier/memtierN/ >> >> The nodes which are part of a specific memory tier can be listed >> via >> /sys/devices/system/memtier/memtierN/nodelist >> >> "Rank" is an opaque value. Its absolute value doesn't have any >> special meaning. But the rank values of different memtiers can be >> compared with each other to determine the memory tier order. >> >> For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and >> their rank values are 300, 200, 100, then the memory tier order is: >> memtier0 -> memtier2 -> memtier1, where memtier0 is the highest tier >> and memtier1 is the lowest tier. >> >> The rank value of each memtier should be unique. >> >> A higher rank memory tier will appear first in the demotion order >> than a lower rank memory tier. ie. while reclaim we choose a node >> in higher rank memory tier to demote pages to as compared to a node >> in a lower rank memory tier. >> >> For now we are not adding the dynamic number of memory tiers. >> But a future series supporting that is possible. Currently >> number of tiers supported is limitted to MAX_MEMORY_TIERS(3). >> When doing memory hotplug, if not added to a memory tier, the NUMA >> node gets added to DEFAULT_MEMORY_TIER(1). >> >> This patch is based on the proposal sent by Wei Xu <weixugc@google.com> at [1]. >> >> [1] https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com >> >> Suggested-by: Wei Xu <weixugc@google.com> >> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com> >> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> >> --- >> include/linux/memory-tiers.h | 20 ++++ >> mm/Kconfig | 11 ++ >> mm/Makefile | 1 + >> mm/memory-tiers.c | 188 +++++++++++++++++++++++++++++++++++ >> 4 files changed, 220 insertions(+) >> create mode 100644 include/linux/memory-tiers.h >> create mode 100644 mm/memory-tiers.c >> >> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h >> new file mode 100644 >> index 000000000000..e17f6b4ee177 >> --- /dev/null >> +++ b/include/linux/memory-tiers.h >> @@ -0,0 +1,20 @@ >> +/* SPDX-License-Identifier: GPL-2.0 */ >> +#ifndef _LINUX_MEMORY_TIERS_H >> +#define _LINUX_MEMORY_TIERS_H >> + >> +#ifdef CONFIG_TIERED_MEMORY >> + >> +#define MEMORY_TIER_HBM_GPU 0 >> +#define MEMORY_TIER_DRAM 1 >> +#define MEMORY_TIER_PMEM 2 >> + >> +#define MEMORY_RANK_HBM_GPU 300 >> +#define MEMORY_RANK_DRAM 200 >> +#define MEMORY_RANK_PMEM 100 >> + >> +#define DEFAULT_MEMORY_TIER MEMORY_TIER_DRAM >> +#define MAX_MEMORY_TIERS 3 >> + >> +#endif /* CONFIG_TIERED_MEMORY */ >> + >> +#endif >> diff --git a/mm/Kconfig b/mm/Kconfig >> index 169e64192e48..08a3d330740b 100644 >> --- a/mm/Kconfig >> +++ b/mm/Kconfig >> @@ -614,6 +614,17 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION >> config ARCH_ENABLE_THP_MIGRATION >> bool >> >> +config TIERED_MEMORY >> + bool "Support for explicit memory tiers" >> + def_bool n >> + depends on MIGRATION && NUMA >> + help >> + Support to split nodes into memory tiers explicitly and >> + to demote pages on reclaim to lower tiers. This option >> + also exposes sysfs interface to read nodes available in >> + specific tier and to move specific node among different >> + possible tiers. > > IMHO we should not need a new kernel config. If tiering is not present > then there is just one tier on the system. And tiering is a kind of > hardware configuration, the information could be shown regardless of > whether demotion/promotion is supported/enabled or not. > This was added so that we could avoid doing multiple #if defined(CONFIG_MIGRATION) && defined(CONFIG_NUMA) Initially I had that as def_bool y and depends on MIGRATION && NUMA. But it was later suggested that def_bool is not recommended for newer config. How about config TIERED_MEMORY bool "Support for explicit memory tiers" - def_bool n - depends on MIGRATION && NUMA - help - Support to split nodes into memory tiers explicitly and - to demote pages on reclaim to lower tiers. This option - also exposes sysfs interface to read nodes available in - specific tier and to move specific node among different - possible tiers. + def_bool MIGRATION && NUMA config HUGETLB_PAGE_SIZE_VARIABLE def_bool n ie, we just make it a Kconfig variable without exposing it to the user? -aneesh
On Wed, 2022-06-08 at 10:00 +0530, Aneesh Kumar K V wrote: > On 6/8/22 12:13 AM, Tim Chen wrote: > > On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote: > > > > > > > > > The nodes which are part of a specific memory tier can be listed > > > via > > > /sys/devices/system/memtier/memtierN/nodelist > > > > > > "Rank" is an opaque value. Its absolute value doesn't have any > > > special meaning. But the rank values of different memtiers can be > > > compared with each other to determine the memory tier order. > > > > > > For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and > > > their rank values are 300, 200, 100, then the memory tier order is: > > > memtier0 -> memtier2 -> memtier1, > > > > Why is memtier2 (rank 100) higher than memtier1 (rank 200)? Seems like > > the order should be memtier0 -> memtier1 -> memtier2? > > (rank 300) (rank 200) (rank 100) > > > > > where memtier0 is the highest tier > > > and memtier1 is the lowest tier. > > > > I think memtier2 is the lowest as it has the lowest rank value. > > > typo error. Will fix that in the next update > > > > > > > The rank value of each memtier should be unique. > > > > > > > > > + > > > +static void memory_tier_device_release(struct device *dev) > > > +{ > > > + struct memory_tier *tier = to_memory_tier(dev); > > > + > > > > Do we need some ref counts on memory_tier? > > If there is another device still using the same memtier, > > free below could cause problem. > > > > > + kfree(tier); > > > +} > > > + > > > > > ... > > > +static struct memory_tier *register_memory_tier(unsigned int tier) > > > +{ > > > + int error; > > > + struct memory_tier *memtier; > > > + > > > + if (tier >= MAX_MEMORY_TIERS) > > > + return NULL; > > > + > > > + memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL); > > > + if (!memtier) > > > + return NULL; > > > + > > > + memtier->dev.id = tier; > > > + memtier->rank = get_rank_from_tier(tier); > > > + memtier->dev.bus = &memory_tier_subsys; > > > + memtier->dev.release = memory_tier_device_release; > > > + memtier->dev.groups = memory_tier_dev_groups; > > > + > > > > Should you take the mem_tier_lock before you insert to > > memtier-list? > > > Both register_memory_tier and unregister_memory_tier get called with > memory_tier_lock held. Then please add locking requirements to the comments above these functions. Best Regards, Huang, Ying > > > > > + insert_memory_tier(memtier); > > > + > > > + error = device_register(&memtier->dev); > > > + if (error) { > > > + list_del(&memtier->list); > > > + put_device(&memtier->dev); > > > + return NULL; > > > + } > > > + return memtier; > > > +} > > > + > > > +__maybe_unused // temporay to prevent warnings during bisects > > > +static void unregister_memory_tier(struct memory_tier *memtier) > > > +{ > > > > I think we should take mem_tier_lock before modifying memtier->list. > > > > unregister_memory_tier get called with memory_tier_lock held. > > > > + list_del(&memtier->list); > > > + device_unregister(&memtier->dev); > > > +} > > > + > > > > > -aneesh
On Wed, 2022-06-08 at 10:07 +0530, Aneesh Kumar K V wrote: > On 6/8/22 12:13 AM, Tim Chen wrote: > ... > > > > > > > + > > > +static void memory_tier_device_release(struct device *dev) > > > +{ > > > + struct memory_tier *tier = to_memory_tier(dev); > > > + > > > > Do we need some ref counts on memory_tier? > > If there is another device still using the same memtier, > > free below could cause problem. > > > > > + kfree(tier); > > > +} > > > + > > > > > ... > > The lifecycle of the memory_tier struct is tied to the sysfs device life > time. ie, memory_tier_device_relese get called only after the last > reference on that sysfs dev object is released. Hence we can be sure > there is no userspace that is keeping one of the memtier related sysfs > file open. > > W.r.t other memory device sharing the same memtier, we unregister the > sysfs device only when the memory tier nodelist is empty. That is no > memory device is present in this memory tier. memory_tier isn't only used by user space. It is used inside kernel too. If some kernel code get a pointer to struct memory_tier, we need to guarantee the pointer will not be freed under us. And as Tim pointed out, we need to use it in hot path (for statistics), so some kind of rcu lock may be good. Best Regards, Huang, Ying
On Wed, 2022-06-08 at 10:28 +0530, Aneesh Kumar K V wrote: > On 6/8/22 3:02 AM, Yang Shi wrote: > > On Fri, Jun 3, 2022 at 6:43 AM Aneesh Kumar K.V > > <aneesh.kumar@linux.ibm.com> wrote: > > > > > > In the current kernel, memory tiers are defined implicitly via a > > > demotion path relationship between NUMA nodes, which is created > > > during the kernel initialization and updated when a NUMA node is > > > hot-added or hot-removed. The current implementation puts all > > > nodes with CPU into the top tier, and builds the tier hierarchy > > > tier-by-tier by establishing the per-node demotion targets based > > > on the distances between nodes. > > > > > > This current memory tier kernel interface needs to be improved for > > > several important use cases, > > > > > > The current tier initialization code always initializes > > > each memory-only NUMA node into a lower tier. But a memory-only > > > NUMA node may have a high performance memory device (e.g. a DRAM > > > device attached via CXL.mem or a DRAM-backed memory-only node on > > > a virtual machine) and should be put into a higher tier. > > > > > > The current tier hierarchy always puts CPU nodes into the top > > > tier. But on a system with HBM or GPU devices, the > > > memory-only NUMA nodes mapping these devices should be in the > > > top tier, and DRAM nodes with CPUs are better to be placed into the > > > next lower tier. > > > > > > With current kernel higher tier node can only be demoted to selected nodes on the > > > next lower tier as defined by the demotion path, not any other > > > node from any lower tier. This strict, hard-coded demotion order > > > does not work in all use cases (e.g. some use cases may want to > > > allow cross-socket demotion to another node in the same demotion > > > tier as a fallback when the preferred demotion node is out of > > > space), This demotion order is also inconsistent with the page > > > allocation fallback order when all the nodes in a higher tier are > > > out of space: The page allocation can fall back to any node from > > > any lower tier, whereas the demotion order doesn't allow that. > > > > > > The current kernel also don't provide any interfaces for the > > > userspace to learn about the memory tier hierarchy in order to > > > optimize its memory allocations. > > > > > > This patch series address the above by defining memory tiers explicitly. > > > > > > This patch introduce explicity memory tiers with ranks. The rank > > > value of a memory tier is used to derive the demotion order between > > > NUMA nodes. The memory tiers present in a system can be found at > > > > > > /sys/devices/system/memtier/memtierN/ > > > > > > The nodes which are part of a specific memory tier can be listed > > > via > > > /sys/devices/system/memtier/memtierN/nodelist > > > > > > "Rank" is an opaque value. Its absolute value doesn't have any > > > special meaning. But the rank values of different memtiers can be > > > compared with each other to determine the memory tier order. > > > > > > For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and > > > their rank values are 300, 200, 100, then the memory tier order is: > > > memtier0 -> memtier2 -> memtier1, where memtier0 is the highest tier > > > and memtier1 is the lowest tier. > > > > > > The rank value of each memtier should be unique. > > > > > > A higher rank memory tier will appear first in the demotion order > > > than a lower rank memory tier. ie. while reclaim we choose a node > > > in higher rank memory tier to demote pages to as compared to a node > > > in a lower rank memory tier. > > > > > > For now we are not adding the dynamic number of memory tiers. > > > But a future series supporting that is possible. Currently > > > number of tiers supported is limitted to MAX_MEMORY_TIERS(3). > > > When doing memory hotplug, if not added to a memory tier, the NUMA > > > node gets added to DEFAULT_MEMORY_TIER(1). > > > > > > This patch is based on the proposal sent by Wei Xu <weixugc@google.com> at [1]. > > > > > > [1] https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com > > > > > > Suggested-by: Wei Xu <weixugc@google.com> > > > Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com> > > > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> > > > --- > > > include/linux/memory-tiers.h | 20 ++++ > > > mm/Kconfig | 11 ++ > > > mm/Makefile | 1 + > > > mm/memory-tiers.c | 188 +++++++++++++++++++++++++++++++++++ > > > 4 files changed, 220 insertions(+) > > > create mode 100644 include/linux/memory-tiers.h > > > create mode 100644 mm/memory-tiers.c > > > > > > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h > > > new file mode 100644 > > > index 000000000000..e17f6b4ee177 > > > --- /dev/null > > > +++ b/include/linux/memory-tiers.h > > > @@ -0,0 +1,20 @@ > > > +/* SPDX-License-Identifier: GPL-2.0 */ > > > +#ifndef _LINUX_MEMORY_TIERS_H > > > +#define _LINUX_MEMORY_TIERS_H > > > + > > > +#ifdef CONFIG_TIERED_MEMORY > > > + > > > +#define MEMORY_TIER_HBM_GPU 0 > > > +#define MEMORY_TIER_DRAM 1 > > > +#define MEMORY_TIER_PMEM 2 > > > + > > > +#define MEMORY_RANK_HBM_GPU 300 > > > +#define MEMORY_RANK_DRAM 200 > > > +#define MEMORY_RANK_PMEM 100 > > > + > > > +#define DEFAULT_MEMORY_TIER MEMORY_TIER_DRAM > > > +#define MAX_MEMORY_TIERS 3 > > > + > > > +#endif /* CONFIG_TIERED_MEMORY */ > > > + > > > +#endif > > > diff --git a/mm/Kconfig b/mm/Kconfig > > > index 169e64192e48..08a3d330740b 100644 > > > --- a/mm/Kconfig > > > +++ b/mm/Kconfig > > > @@ -614,6 +614,17 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION > > > config ARCH_ENABLE_THP_MIGRATION > > > bool > > > > > > +config TIERED_MEMORY > > > + bool "Support for explicit memory tiers" > > > + def_bool n > > > + depends on MIGRATION && NUMA > > > + help > > > + Support to split nodes into memory tiers explicitly and > > > + to demote pages on reclaim to lower tiers. This option > > > + also exposes sysfs interface to read nodes available in > > > + specific tier and to move specific node among different > > > + possible tiers. > > > > IMHO we should not need a new kernel config. If tiering is not present > > then there is just one tier on the system. And tiering is a kind of > > hardware configuration, the information could be shown regardless of > > whether demotion/promotion is supported/enabled or not. > > > > This was added so that we could avoid doing multiple > > #if defined(CONFIG_MIGRATION) && defined(CONFIG_NUMA) > > Initially I had that as def_bool y and depends on MIGRATION && NUMA. But > it was later suggested that def_bool is not recommended for newer config. > > How about > > config TIERED_MEMORY > bool "Support for explicit memory tiers" Need to remove this line too to make it invisible for users? Best Regards, HUang, Ying > - def_bool n > - depends on MIGRATION && NUMA > - help > - Support to split nodes into memory tiers explicitly and > - to demote pages on reclaim to lower tiers. This option > - also exposes sysfs interface to read nodes available in > - specific tier and to move specific node among different > - possible tiers. > + def_bool MIGRATION && NUMA > > config HUGETLB_PAGE_SIZE_VARIABLE > def_bool n > > ie, we just make it a Kconfig variable without exposing it to the user? > > -aneesh
On 6/8/22 11:40 AM, Ying Huang wrote: > On Wed, 2022-06-08 at 10:07 +0530, Aneesh Kumar K V wrote: >> On 6/8/22 12:13 AM, Tim Chen wrote: >> ... >> >>>> >>>> + >>>> +static void memory_tier_device_release(struct device *dev) >>>> +{ >>>> + struct memory_tier *tier = to_memory_tier(dev); >>>> + >>> >>> Do we need some ref counts on memory_tier? >>> If there is another device still using the same memtier, >>> free below could cause problem. >>> >>>> + kfree(tier); >>>> +} >>>> + >>>> >>> ... >> >> The lifecycle of the memory_tier struct is tied to the sysfs device life >> time. ie, memory_tier_device_relese get called only after the last >> reference on that sysfs dev object is released. Hence we can be sure >> there is no userspace that is keeping one of the memtier related sysfs >> file open. >> >> W.r.t other memory device sharing the same memtier, we unregister the >> sysfs device only when the memory tier nodelist is empty. That is no >> memory device is present in this memory tier. > > memory_tier isn't only used by user space. It is used inside kernel > too. If some kernel code get a pointer to struct memory_tier, we need > to guarantee the pointer will not be freed under us. As mentioned above current patchset avoid doing that. > And as Tim pointed > out, we need to use it in hot path (for statistics), so some kind of rcu > lock may be good. > Sure when those statistics code get added, we can add the relevant kref and locking details. -aneesh
Hi Aneesh, On Fri, Jun 03, 2022 at 07:12:29PM +0530, Aneesh Kumar K.V wrote: > @@ -0,0 +1,20 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +#ifndef _LINUX_MEMORY_TIERS_H > +#define _LINUX_MEMORY_TIERS_H > + > +#ifdef CONFIG_TIERED_MEMORY > + > +#define MEMORY_TIER_HBM_GPU 0 > +#define MEMORY_TIER_DRAM 1 > +#define MEMORY_TIER_PMEM 2 > + > +#define MEMORY_RANK_HBM_GPU 300 > +#define MEMORY_RANK_DRAM 200 > +#define MEMORY_RANK_PMEM 100 > + > +#define DEFAULT_MEMORY_TIER MEMORY_TIER_DRAM > +#define MAX_MEMORY_TIERS 3 I understand the names are somewhat arbitrary, and the tier ID space can be expanded down the line by bumping MAX_MEMORY_TIERS. But starting out with a packed ID space can get quite awkward for users when new tiers - especially intermediate tiers - show up in existing configurations. I mentioned in the other email that DRAM != DRAM, so new tiers seem inevitable already. It could make sense to start with a bigger address space and spread out the list of kernel default tiers a bit within it: MEMORY_TIER_GPU 0 MEMORY_TIER_DRAM 10 MEMORY_TIER_PMEM 20 etc.
On 6/8/22 7:41 PM, Johannes Weiner wrote: > Hi Aneesh, > > On Fri, Jun 03, 2022 at 07:12:29PM +0530, Aneesh Kumar K.V wrote: >> @@ -0,0 +1,20 @@ >> +/* SPDX-License-Identifier: GPL-2.0 */ >> +#ifndef _LINUX_MEMORY_TIERS_H >> +#define _LINUX_MEMORY_TIERS_H >> + >> +#ifdef CONFIG_TIERED_MEMORY >> + >> +#define MEMORY_TIER_HBM_GPU 0 >> +#define MEMORY_TIER_DRAM 1 >> +#define MEMORY_TIER_PMEM 2 >> + >> +#define MEMORY_RANK_HBM_GPU 300 >> +#define MEMORY_RANK_DRAM 200 >> +#define MEMORY_RANK_PMEM 100 >> + >> +#define DEFAULT_MEMORY_TIER MEMORY_TIER_DRAM >> +#define MAX_MEMORY_TIERS 3 > > I understand the names are somewhat arbitrary, and the tier ID space > can be expanded down the line by bumping MAX_MEMORY_TIERS. > > But starting out with a packed ID space can get quite awkward for > users when new tiers - especially intermediate tiers - show up in > existing configurations. I mentioned in the other email that DRAM != > DRAM, so new tiers seem inevitable already. > > It could make sense to start with a bigger address space and spread > out the list of kernel default tiers a bit within it: > > MEMORY_TIER_GPU 0 > MEMORY_TIER_DRAM 10 > MEMORY_TIER_PMEM 20 > the tier index or tier id or the tier dev id don't have any special meaning. What is used to find the demotion order is memory tier rank and they are really spread out, (300, 200, 100). -aneesh
Hello, On Wed, Jun 08, 2022 at 10:11:31AM -0400, Johannes Weiner wrote: > On Fri, Jun 03, 2022 at 07:12:29PM +0530, Aneesh Kumar K.V wrote: > > @@ -0,0 +1,20 @@ > > +/* SPDX-License-Identifier: GPL-2.0 */ > > +#ifndef _LINUX_MEMORY_TIERS_H > > +#define _LINUX_MEMORY_TIERS_H > > + > > +#ifdef CONFIG_TIERED_MEMORY > > + > > +#define MEMORY_TIER_HBM_GPU 0 > > +#define MEMORY_TIER_DRAM 1 > > +#define MEMORY_TIER_PMEM 2 > > + > > +#define MEMORY_RANK_HBM_GPU 300 > > +#define MEMORY_RANK_DRAM 200 > > +#define MEMORY_RANK_PMEM 100 > > + > > +#define DEFAULT_MEMORY_TIER MEMORY_TIER_DRAM > > +#define MAX_MEMORY_TIERS 3 > > I understand the names are somewhat arbitrary, and the tier ID space > can be expanded down the line by bumping MAX_MEMORY_TIERS. > > But starting out with a packed ID space can get quite awkward for > users when new tiers - especially intermediate tiers - show up in > existing configurations. I mentioned in the other email that DRAM != > DRAM, so new tiers seem inevitable already. > > It could make sense to start with a bigger address space and spread > out the list of kernel default tiers a bit within it: > > MEMORY_TIER_GPU 0 > MEMORY_TIER_DRAM 10 > MEMORY_TIER_PMEM 20 Forgive me if I'm asking a question that has been answered. I went back to earlier threads and couldn't work it out - maybe there were some off-list discussions? Anyway... Why is there a distinction between tier ID and rank? I undestand that rank was added because tier IDs were too few. But if rank determines ordering, what is the use of a separate tier ID? IOW, why not make the tier ID space wider and have the kernel pick a few spread out defaults based on known hardware, with plenty of headroom to be future proof. $ ls tiers 100 # DEFAULT_TIER $ cat tiers/100/nodelist 0-1 # conventional numa nodes <pmem is onlined> $ grep . tiers/*/nodelist tiers/100/nodelist:0-1 # conventional numa tiers/200/nodelist:2 # pmem $ grep . nodes/*/tier nodes/0/tier:100 nodes/1/tier:100 nodes/2/tier:200 <unknown device is online as node 3, defaults to 100> $ grep . tiers/*/nodelist tiers/100/nodelist:0-1,3 tiers/200/nodelist:2 $ echo 300 >nodes/3/tier $ grep . tiers/*/nodelist tiers/100/nodelist:0-1 tiers/200/nodelist:2 tiers/300/nodelist:3 $ echo 200 >nodes/3/tier $ grep . tiers/*/nodelist tiers/100/nodelist:0-1 tiers/200/nodelist:2-3 etc.
On 6/8/22 9:25 PM, Johannes Weiner wrote: > Hello, > > On Wed, Jun 08, 2022 at 10:11:31AM -0400, Johannes Weiner wrote: >> On Fri, Jun 03, 2022 at 07:12:29PM +0530, Aneesh Kumar K.V wrote: >>> @@ -0,0 +1,20 @@ >>> +/* SPDX-License-Identifier: GPL-2.0 */ >>> +#ifndef _LINUX_MEMORY_TIERS_H >>> +#define _LINUX_MEMORY_TIERS_H >>> + >>> +#ifdef CONFIG_TIERED_MEMORY >>> + >>> +#define MEMORY_TIER_HBM_GPU 0 >>> +#define MEMORY_TIER_DRAM 1 >>> +#define MEMORY_TIER_PMEM 2 >>> + >>> +#define MEMORY_RANK_HBM_GPU 300 >>> +#define MEMORY_RANK_DRAM 200 >>> +#define MEMORY_RANK_PMEM 100 >>> + >>> +#define DEFAULT_MEMORY_TIER MEMORY_TIER_DRAM >>> +#define MAX_MEMORY_TIERS 3 >> >> I understand the names are somewhat arbitrary, and the tier ID space >> can be expanded down the line by bumping MAX_MEMORY_TIERS. >> >> But starting out with a packed ID space can get quite awkward for >> users when new tiers - especially intermediate tiers - show up in >> existing configurations. I mentioned in the other email that DRAM != >> DRAM, so new tiers seem inevitable already. >> >> It could make sense to start with a bigger address space and spread >> out the list of kernel default tiers a bit within it: >> >> MEMORY_TIER_GPU 0 >> MEMORY_TIER_DRAM 10 >> MEMORY_TIER_PMEM 20 > > Forgive me if I'm asking a question that has been answered. I went > back to earlier threads and couldn't work it out - maybe there were > some off-list discussions? Anyway... > > Why is there a distinction between tier ID and rank? I undestand that > rank was added because tier IDs were too few. But if rank determines > ordering, what is the use of a separate tier ID? IOW, why not make the > tier ID space wider and have the kernel pick a few spread out defaults > based on known hardware, with plenty of headroom to be future proof. > > $ ls tiers > 100 # DEFAULT_TIER > $ cat tiers/100/nodelist > 0-1 # conventional numa nodes > > <pmem is onlined> > > $ grep . tiers/*/nodelist > tiers/100/nodelist:0-1 # conventional numa > tiers/200/nodelist:2 # pmem > > $ grep . nodes/*/tier > nodes/0/tier:100 > nodes/1/tier:100 > nodes/2/tier:200 > > <unknown device is online as node 3, defaults to 100> > > $ grep . tiers/*/nodelist > tiers/100/nodelist:0-1,3 > tiers/200/nodelist:2 > > $ echo 300 >nodes/3/tier > $ grep . tiers/*/nodelist > tiers/100/nodelist:0-1 > tiers/200/nodelist:2 > tiers/300/nodelist:3 > > $ echo 200 >nodes/3/tier > $ grep . tiers/*/nodelist > tiers/100/nodelist:0-1 > tiers/200/nodelist:2-3 > > etc. tier ID is also used as device id memtier.dev.id. It was discussed that we would need the ability to change the rank value of a memory tier. If we make rank value same as tier ID or tier device id, we will not be able to support that. -aneesh
On Tue, Jun 7, 2022 at 6:34 PM Ying Huang <ying.huang@intel.com> wrote: > > On Tue, 2022-06-07 at 14:32 -0700, Yang Shi wrote: > > On Fri, Jun 3, 2022 at 6:43 AM Aneesh Kumar K.V > > <aneesh.kumar@linux.ibm.com> wrote: > > > > > > In the current kernel, memory tiers are defined implicitly via a > > > demotion path relationship between NUMA nodes, which is created > > > during the kernel initialization and updated when a NUMA node is > > > hot-added or hot-removed. The current implementation puts all > > > nodes with CPU into the top tier, and builds the tier hierarchy > > > tier-by-tier by establishing the per-node demotion targets based > > > on the distances between nodes. > > > > > > This current memory tier kernel interface needs to be improved for > > > several important use cases, > > > > > > The current tier initialization code always initializes > > > each memory-only NUMA node into a lower tier. But a memory-only > > > NUMA node may have a high performance memory device (e.g. a DRAM > > > device attached via CXL.mem or a DRAM-backed memory-only node on > > > a virtual machine) and should be put into a higher tier. > > > > > > The current tier hierarchy always puts CPU nodes into the top > > > tier. But on a system with HBM or GPU devices, the > > > memory-only NUMA nodes mapping these devices should be in the > > > top tier, and DRAM nodes with CPUs are better to be placed into the > > > next lower tier. > > > > > > With current kernel higher tier node can only be demoted to selected nodes on the > > > next lower tier as defined by the demotion path, not any other > > > node from any lower tier. This strict, hard-coded demotion order > > > does not work in all use cases (e.g. some use cases may want to > > > allow cross-socket demotion to another node in the same demotion > > > tier as a fallback when the preferred demotion node is out of > > > space), This demotion order is also inconsistent with the page > > > allocation fallback order when all the nodes in a higher tier are > > > out of space: The page allocation can fall back to any node from > > > any lower tier, whereas the demotion order doesn't allow that. > > > > > > The current kernel also don't provide any interfaces for the > > > userspace to learn about the memory tier hierarchy in order to > > > optimize its memory allocations. > > > > > > This patch series address the above by defining memory tiers explicitly. > > > > > > This patch introduce explicity memory tiers with ranks. The rank > > > value of a memory tier is used to derive the demotion order between > > > NUMA nodes. The memory tiers present in a system can be found at > > > > > > /sys/devices/system/memtier/memtierN/ > > > > > > The nodes which are part of a specific memory tier can be listed > > > via > > > /sys/devices/system/memtier/memtierN/nodelist > > > > > > "Rank" is an opaque value. Its absolute value doesn't have any > > > special meaning. But the rank values of different memtiers can be > > > compared with each other to determine the memory tier order. > > > > > > For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and > > > their rank values are 300, 200, 100, then the memory tier order is: > > > memtier0 -> memtier2 -> memtier1, where memtier0 is the highest tier > > > and memtier1 is the lowest tier. > > > > > > The rank value of each memtier should be unique. > > > > > > A higher rank memory tier will appear first in the demotion order > > > than a lower rank memory tier. ie. while reclaim we choose a node > > > in higher rank memory tier to demote pages to as compared to a node > > > in a lower rank memory tier. > > > > > > For now we are not adding the dynamic number of memory tiers. > > > But a future series supporting that is possible. Currently > > > number of tiers supported is limitted to MAX_MEMORY_TIERS(3). > > > When doing memory hotplug, if not added to a memory tier, the NUMA > > > node gets added to DEFAULT_MEMORY_TIER(1). > > > > > > This patch is based on the proposal sent by Wei Xu <weixugc@google.com> at [1]. > > > > > > [1] https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com > > > > > > Suggested-by: Wei Xu <weixugc@google.com> > > > Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com> > > > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> > > > --- > > > include/linux/memory-tiers.h | 20 ++++ > > > mm/Kconfig | 11 ++ > > > mm/Makefile | 1 + > > > mm/memory-tiers.c | 188 +++++++++++++++++++++++++++++++++++ > > > 4 files changed, 220 insertions(+) > > > create mode 100644 include/linux/memory-tiers.h > > > create mode 100644 mm/memory-tiers.c > > > > > > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h > > > new file mode 100644 > > > index 000000000000..e17f6b4ee177 > > > --- /dev/null > > > +++ b/include/linux/memory-tiers.h > > > @@ -0,0 +1,20 @@ > > > +/* SPDX-License-Identifier: GPL-2.0 */ > > > +#ifndef _LINUX_MEMORY_TIERS_H > > > +#define _LINUX_MEMORY_TIERS_H > > > + > > > +#ifdef CONFIG_TIERED_MEMORY > > > + > > > +#define MEMORY_TIER_HBM_GPU 0 > > > +#define MEMORY_TIER_DRAM 1 > > > +#define MEMORY_TIER_PMEM 2 > > > + > > > +#define MEMORY_RANK_HBM_GPU 300 > > > +#define MEMORY_RANK_DRAM 200 > > > +#define MEMORY_RANK_PMEM 100 > > > + > > > +#define DEFAULT_MEMORY_TIER MEMORY_TIER_DRAM > > > +#define MAX_MEMORY_TIERS 3 > > > + > > > +#endif /* CONFIG_TIERED_MEMORY */ > > > + > > > +#endif > > > diff --git a/mm/Kconfig b/mm/Kconfig > > > index 169e64192e48..08a3d330740b 100644 > > > --- a/mm/Kconfig > > > +++ b/mm/Kconfig > > > @@ -614,6 +614,17 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION > > > config ARCH_ENABLE_THP_MIGRATION > > > bool > > > > > > +config TIERED_MEMORY > > > + bool "Support for explicit memory tiers" > > > + def_bool n > > > + depends on MIGRATION && NUMA > > > + help > > > + Support to split nodes into memory tiers explicitly and > > > + to demote pages on reclaim to lower tiers. This option > > > + also exposes sysfs interface to read nodes available in > > > + specific tier and to move specific node among different > > > + possible tiers. > > > > IMHO we should not need a new kernel config. If tiering is not present > > then there is just one tier on the system. And tiering is a kind of > > hardware configuration, the information could be shown regardless of > > whether demotion/promotion is supported/enabled or not. > > I think so too. At least it appears unnecessary to let the user turn > on/off it at configuration time. > > All the code should be enclosed by #if defined(CONFIG_NUMA) && > defined(CONIFIG_MIGRATION). So we will not waste memory in small > systems. CONFIG_NUMA alone should be good enough. CONFIG_MIGRATION is enabled by default if NUMA is enabled. And MIGRATION is just used to support demotion/promotion. Memory tiers exist even though demotion/promotion is not supported, right? > > Best Regards, > Huang, Ying > > > > + > > > config HUGETLB_PAGE_SIZE_VARIABLE > > > def_bool n > > > help > > > diff --git a/mm/Makefile b/mm/Makefile > > > index 6f9ffa968a1a..482557fbc9d1 100644 > > > --- a/mm/Makefile > > > +++ b/mm/Makefile > > > @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/ > > > obj-$(CONFIG_FAILSLAB) += failslab.o > > > obj-$(CONFIG_MEMTEST) += memtest.o > > > obj-$(CONFIG_MIGRATION) += migrate.o > > > +obj-$(CONFIG_TIERED_MEMORY) += memory-tiers.o > > > obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o > > > obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o > > > obj-$(CONFIG_PAGE_COUNTER) += page_counter.o > > > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c > > > new file mode 100644 > > > index 000000000000..7de18d94a08d > > > --- /dev/null > > > +++ b/mm/memory-tiers.c > > > @@ -0,0 +1,188 @@ > > > +// SPDX-License-Identifier: GPL-2.0 > > > +#include <linux/types.h> > > > +#include <linux/device.h> > > > +#include <linux/nodemask.h> > > > +#include <linux/slab.h> > > > +#include <linux/memory-tiers.h> > > > + > > > +struct memory_tier { > > > + struct list_head list; > > > + struct device dev; > > > + nodemask_t nodelist; > > > + int rank; > > > +}; > > > + > > > +#define to_memory_tier(device) container_of(device, struct memory_tier, dev) > > > + > > > +static struct bus_type memory_tier_subsys = { > > > + .name = "memtier", > > > + .dev_name = "memtier", > > > +}; > > > + > > > +static DEFINE_MUTEX(memory_tier_lock); > > > +static LIST_HEAD(memory_tiers); > > > + > > > + > > > +static ssize_t nodelist_show(struct device *dev, > > > + struct device_attribute *attr, char *buf) > > > +{ > > > + struct memory_tier *memtier = to_memory_tier(dev); > > > + > > > + return sysfs_emit(buf, "%*pbl\n", > > > + nodemask_pr_args(&memtier->nodelist)); > > > +} > > > +static DEVICE_ATTR_RO(nodelist); > > > + > > > +static ssize_t rank_show(struct device *dev, > > > + struct device_attribute *attr, char *buf) > > > +{ > > > + struct memory_tier *memtier = to_memory_tier(dev); > > > + > > > + return sysfs_emit(buf, "%d\n", memtier->rank); > > > +} > > > +static DEVICE_ATTR_RO(rank); > > > + > > > +static struct attribute *memory_tier_dev_attrs[] = { > > > + &dev_attr_nodelist.attr, > > > + &dev_attr_rank.attr, > > > + NULL > > > +}; > > > + > > > +static const struct attribute_group memory_tier_dev_group = { > > > + .attrs = memory_tier_dev_attrs, > > > +}; > > > + > > > +static const struct attribute_group *memory_tier_dev_groups[] = { > > > + &memory_tier_dev_group, > > > + NULL > > > +}; > > > + > > > +static void memory_tier_device_release(struct device *dev) > > > +{ > > > + struct memory_tier *tier = to_memory_tier(dev); > > > + > > > + kfree(tier); > > > +} > > > + > > > +/* > > > + * Keep it simple by having direct mapping between > > > + * tier index and rank value. > > > + */ > > > +static inline int get_rank_from_tier(unsigned int tier) > > > +{ > > > + switch (tier) { > > > + case MEMORY_TIER_HBM_GPU: > > > + return MEMORY_RANK_HBM_GPU; > > > + case MEMORY_TIER_DRAM: > > > + return MEMORY_RANK_DRAM; > > > + case MEMORY_TIER_PMEM: > > > + return MEMORY_RANK_PMEM; > > > + } > > > + > > > + return 0; > > > +} > > > + > > > +static void insert_memory_tier(struct memory_tier *memtier) > > > +{ > > > + struct list_head *ent; > > > + struct memory_tier *tmp_memtier; > > > + > > > + list_for_each(ent, &memory_tiers) { > > > + tmp_memtier = list_entry(ent, struct memory_tier, list); > > > + if (tmp_memtier->rank < memtier->rank) { > > > + list_add_tail(&memtier->list, ent); > > > + return; > > > + } > > > + } > > > + list_add_tail(&memtier->list, &memory_tiers); > > > +} > > > + > > > +static struct memory_tier *register_memory_tier(unsigned int tier) > > > +{ > > > + int error; > > > + struct memory_tier *memtier; > > > + > > > + if (tier >= MAX_MEMORY_TIERS) > > > + return NULL; > > > + > > > + memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL); > > > + if (!memtier) > > > + return NULL; > > > + > > > + memtier->dev.id = tier; > > > + memtier->rank = get_rank_from_tier(tier); > > > + memtier->dev.bus = &memory_tier_subsys; > > > + memtier->dev.release = memory_tier_device_release; > > > + memtier->dev.groups = memory_tier_dev_groups; > > > + > > > + insert_memory_tier(memtier); > > > + > > > + error = device_register(&memtier->dev); > > > + if (error) { > > > + list_del(&memtier->list); > > > + put_device(&memtier->dev); > > > + return NULL; > > > + } > > > + return memtier; > > > +} > > > + > > > +__maybe_unused // temporay to prevent warnings during bisects > > > +static void unregister_memory_tier(struct memory_tier *memtier) > > > +{ > > > + list_del(&memtier->list); > > > + device_unregister(&memtier->dev); > > > +} > > > + > > > +static ssize_t > > > +max_tier_show(struct device *dev, struct device_attribute *attr, char *buf) > > > +{ > > > + return sysfs_emit(buf, "%d\n", MAX_MEMORY_TIERS); > > > +} > > > +static DEVICE_ATTR_RO(max_tier); > > > + > > > +static ssize_t > > > +default_tier_show(struct device *dev, struct device_attribute *attr, char *buf) > > > +{ > > > + return sysfs_emit(buf, "memtier%d\n", DEFAULT_MEMORY_TIER); > > > +} > > > +static DEVICE_ATTR_RO(default_tier); > > > + > > > +static struct attribute *memory_tier_attrs[] = { > > > + &dev_attr_max_tier.attr, > > > + &dev_attr_default_tier.attr, > > > + NULL > > > +}; > > > + > > > +static const struct attribute_group memory_tier_attr_group = { > > > + .attrs = memory_tier_attrs, > > > +}; > > > + > > > +static const struct attribute_group *memory_tier_attr_groups[] = { > > > + &memory_tier_attr_group, > > > + NULL, > > > +}; > > > + > > > +static int __init memory_tier_init(void) > > > +{ > > > + int ret; > > > + struct memory_tier *memtier; > > > + > > > + ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups); > > > + if (ret) > > > + panic("%s() failed to register subsystem: %d\n", __func__, ret); > > > + > > > + /* > > > + * Register only default memory tier to hide all empty > > > + * memory tier from sysfs. > > > + */ > > > + memtier = register_memory_tier(DEFAULT_MEMORY_TIER); > > > + if (!memtier) > > > + panic("%s() failed to register memory tier: %d\n", __func__, ret); > > > + > > > + /* CPU only nodes are not part of memory tiers. */ > > > + memtier->nodelist = node_states[N_MEMORY]; > > > + > > > + return 0; > > > +} > > > +subsys_initcall(memory_tier_init); > > > + > > > -- > > > 2.36.1 > > > > >
On Tue, Jun 7, 2022 at 9:58 PM Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> wrote: > > On 6/8/22 3:02 AM, Yang Shi wrote: > > On Fri, Jun 3, 2022 at 6:43 AM Aneesh Kumar K.V > > <aneesh.kumar@linux.ibm.com> wrote: > >> > >> In the current kernel, memory tiers are defined implicitly via a > >> demotion path relationship between NUMA nodes, which is created > >> during the kernel initialization and updated when a NUMA node is > >> hot-added or hot-removed. The current implementation puts all > >> nodes with CPU into the top tier, and builds the tier hierarchy > >> tier-by-tier by establishing the per-node demotion targets based > >> on the distances between nodes. > >> > >> This current memory tier kernel interface needs to be improved for > >> several important use cases, > >> > >> The current tier initialization code always initializes > >> each memory-only NUMA node into a lower tier. But a memory-only > >> NUMA node may have a high performance memory device (e.g. a DRAM > >> device attached via CXL.mem or a DRAM-backed memory-only node on > >> a virtual machine) and should be put into a higher tier. > >> > >> The current tier hierarchy always puts CPU nodes into the top > >> tier. But on a system with HBM or GPU devices, the > >> memory-only NUMA nodes mapping these devices should be in the > >> top tier, and DRAM nodes with CPUs are better to be placed into the > >> next lower tier. > >> > >> With current kernel higher tier node can only be demoted to selected nodes on the > >> next lower tier as defined by the demotion path, not any other > >> node from any lower tier. This strict, hard-coded demotion order > >> does not work in all use cases (e.g. some use cases may want to > >> allow cross-socket demotion to another node in the same demotion > >> tier as a fallback when the preferred demotion node is out of > >> space), This demotion order is also inconsistent with the page > >> allocation fallback order when all the nodes in a higher tier are > >> out of space: The page allocation can fall back to any node from > >> any lower tier, whereas the demotion order doesn't allow that. > >> > >> The current kernel also don't provide any interfaces for the > >> userspace to learn about the memory tier hierarchy in order to > >> optimize its memory allocations. > >> > >> This patch series address the above by defining memory tiers explicitly. > >> > >> This patch introduce explicity memory tiers with ranks. The rank > >> value of a memory tier is used to derive the demotion order between > >> NUMA nodes. The memory tiers present in a system can be found at > >> > >> /sys/devices/system/memtier/memtierN/ > >> > >> The nodes which are part of a specific memory tier can be listed > >> via > >> /sys/devices/system/memtier/memtierN/nodelist > >> > >> "Rank" is an opaque value. Its absolute value doesn't have any > >> special meaning. But the rank values of different memtiers can be > >> compared with each other to determine the memory tier order. > >> > >> For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and > >> their rank values are 300, 200, 100, then the memory tier order is: > >> memtier0 -> memtier2 -> memtier1, where memtier0 is the highest tier > >> and memtier1 is the lowest tier. > >> > >> The rank value of each memtier should be unique. > >> > >> A higher rank memory tier will appear first in the demotion order > >> than a lower rank memory tier. ie. while reclaim we choose a node > >> in higher rank memory tier to demote pages to as compared to a node > >> in a lower rank memory tier. > >> > >> For now we are not adding the dynamic number of memory tiers. > >> But a future series supporting that is possible. Currently > >> number of tiers supported is limitted to MAX_MEMORY_TIERS(3). > >> When doing memory hotplug, if not added to a memory tier, the NUMA > >> node gets added to DEFAULT_MEMORY_TIER(1). > >> > >> This patch is based on the proposal sent by Wei Xu <weixugc@google.com> at [1]. > >> > >> [1] https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com > >> > >> Suggested-by: Wei Xu <weixugc@google.com> > >> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com> > >> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> > >> --- > >> include/linux/memory-tiers.h | 20 ++++ > >> mm/Kconfig | 11 ++ > >> mm/Makefile | 1 + > >> mm/memory-tiers.c | 188 +++++++++++++++++++++++++++++++++++ > >> 4 files changed, 220 insertions(+) > >> create mode 100644 include/linux/memory-tiers.h > >> create mode 100644 mm/memory-tiers.c > >> > >> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h > >> new file mode 100644 > >> index 000000000000..e17f6b4ee177 > >> --- /dev/null > >> +++ b/include/linux/memory-tiers.h > >> @@ -0,0 +1,20 @@ > >> +/* SPDX-License-Identifier: GPL-2.0 */ > >> +#ifndef _LINUX_MEMORY_TIERS_H > >> +#define _LINUX_MEMORY_TIERS_H > >> + > >> +#ifdef CONFIG_TIERED_MEMORY > >> + > >> +#define MEMORY_TIER_HBM_GPU 0 > >> +#define MEMORY_TIER_DRAM 1 > >> +#define MEMORY_TIER_PMEM 2 > >> + > >> +#define MEMORY_RANK_HBM_GPU 300 > >> +#define MEMORY_RANK_DRAM 200 > >> +#define MEMORY_RANK_PMEM 100 > >> + > >> +#define DEFAULT_MEMORY_TIER MEMORY_TIER_DRAM > >> +#define MAX_MEMORY_TIERS 3 > >> + > >> +#endif /* CONFIG_TIERED_MEMORY */ > >> + > >> +#endif > >> diff --git a/mm/Kconfig b/mm/Kconfig > >> index 169e64192e48..08a3d330740b 100644 > >> --- a/mm/Kconfig > >> +++ b/mm/Kconfig > >> @@ -614,6 +614,17 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION > >> config ARCH_ENABLE_THP_MIGRATION > >> bool > >> > >> +config TIERED_MEMORY > >> + bool "Support for explicit memory tiers" > >> + def_bool n > >> + depends on MIGRATION && NUMA > >> + help > >> + Support to split nodes into memory tiers explicitly and > >> + to demote pages on reclaim to lower tiers. This option > >> + also exposes sysfs interface to read nodes available in > >> + specific tier and to move specific node among different > >> + possible tiers. > > > > IMHO we should not need a new kernel config. If tiering is not present > > then there is just one tier on the system. And tiering is a kind of > > hardware configuration, the information could be shown regardless of > > whether demotion/promotion is supported/enabled or not. > > > > This was added so that we could avoid doing multiple > > #if defined(CONFIG_MIGRATION) && defined(CONFIG_NUMA) > > Initially I had that as def_bool y and depends on MIGRATION && NUMA. But > it was later suggested that def_bool is not recommended for newer config. > > How about > > config TIERED_MEMORY > bool "Support for explicit memory tiers" > - def_bool n > - depends on MIGRATION && NUMA > - help > - Support to split nodes into memory tiers explicitly and > - to demote pages on reclaim to lower tiers. This option > - also exposes sysfs interface to read nodes available in > - specific tier and to move specific node among different > - possible tiers. > + def_bool MIGRATION && NUMA CONFIG_NUMA should be good enough. Memory tiering doesn't have to mean demotion/promotion has to be supported IMHO. > > config HUGETLB_PAGE_SIZE_VARIABLE > def_bool n > > ie, we just make it a Kconfig variable without exposing it to the user? > > -aneesh
On Wed, Jun 08, 2022 at 09:43:52PM +0530, Aneesh Kumar K V wrote: > On 6/8/22 9:25 PM, Johannes Weiner wrote: > > Hello, > > > > On Wed, Jun 08, 2022 at 10:11:31AM -0400, Johannes Weiner wrote: > > > On Fri, Jun 03, 2022 at 07:12:29PM +0530, Aneesh Kumar K.V wrote: > > > > @@ -0,0 +1,20 @@ > > > > +/* SPDX-License-Identifier: GPL-2.0 */ > > > > +#ifndef _LINUX_MEMORY_TIERS_H > > > > +#define _LINUX_MEMORY_TIERS_H > > > > + > > > > +#ifdef CONFIG_TIERED_MEMORY > > > > + > > > > +#define MEMORY_TIER_HBM_GPU 0 > > > > +#define MEMORY_TIER_DRAM 1 > > > > +#define MEMORY_TIER_PMEM 2 > > > > + > > > > +#define MEMORY_RANK_HBM_GPU 300 > > > > +#define MEMORY_RANK_DRAM 200 > > > > +#define MEMORY_RANK_PMEM 100 > > > > + > > > > +#define DEFAULT_MEMORY_TIER MEMORY_TIER_DRAM > > > > +#define MAX_MEMORY_TIERS 3 > > > > > > I understand the names are somewhat arbitrary, and the tier ID space > > > can be expanded down the line by bumping MAX_MEMORY_TIERS. > > > > > > But starting out with a packed ID space can get quite awkward for > > > users when new tiers - especially intermediate tiers - show up in > > > existing configurations. I mentioned in the other email that DRAM != > > > DRAM, so new tiers seem inevitable already. > > > > > > It could make sense to start with a bigger address space and spread > > > out the list of kernel default tiers a bit within it: > > > > > > MEMORY_TIER_GPU 0 > > > MEMORY_TIER_DRAM 10 > > > MEMORY_TIER_PMEM 20 > > > > Forgive me if I'm asking a question that has been answered. I went > > back to earlier threads and couldn't work it out - maybe there were > > some off-list discussions? Anyway... > > > > Why is there a distinction between tier ID and rank? I undestand that > > rank was added because tier IDs were too few. But if rank determines > > ordering, what is the use of a separate tier ID? IOW, why not make the > > tier ID space wider and have the kernel pick a few spread out defaults > > based on known hardware, with plenty of headroom to be future proof. > > > > $ ls tiers > > 100 # DEFAULT_TIER > > $ cat tiers/100/nodelist > > 0-1 # conventional numa nodes > > > > <pmem is onlined> > > > > $ grep . tiers/*/nodelist > > tiers/100/nodelist:0-1 # conventional numa > > tiers/200/nodelist:2 # pmem > > > > $ grep . nodes/*/tier > > nodes/0/tier:100 > > nodes/1/tier:100 > > nodes/2/tier:200 > > > > <unknown device is online as node 3, defaults to 100> > > > > $ grep . tiers/*/nodelist > > tiers/100/nodelist:0-1,3 > > tiers/200/nodelist:2 > > > > $ echo 300 >nodes/3/tier > > $ grep . tiers/*/nodelist > > tiers/100/nodelist:0-1 > > tiers/200/nodelist:2 > > tiers/300/nodelist:3 > > > > $ echo 200 >nodes/3/tier > > $ grep . tiers/*/nodelist > > tiers/100/nodelist:0-1 > > tiers/200/nodelist:2-3 > > > > etc. > > tier ID is also used as device id memtier.dev.id. It was discussed that we > would need the ability to change the rank value of a memory tier. If we make > rank value same as tier ID or tier device id, we will not be able to support > that. Is the idea that you could change the rank of a collection of nodes in one go? Rather than moving the nodes one by one into a new tier? [ Sorry, I wasn't able to find this discussion. AFAICS the first patches in RFC4 already had the struct device { .id = tier } logic. Could you point me to it? In general it would be really helpful to maintain summarized rationales for such decisions in the coverletter to make sure things don't get lost over many, many threads, conferences, and video calls. ]
On 6/8/22 11:46 PM, Johannes Weiner wrote: > On Wed, Jun 08, 2022 at 09:43:52PM +0530, Aneesh Kumar K V wrote: >> On 6/8/22 9:25 PM, Johannes Weiner wrote: >>> Hello, >>> >>> On Wed, Jun 08, 2022 at 10:11:31AM -0400, Johannes Weiner wrote: >>>> On Fri, Jun 03, 2022 at 07:12:29PM +0530, Aneesh Kumar K.V wrote: >>>>> @@ -0,0 +1,20 @@ >>>>> +/* SPDX-License-Identifier: GPL-2.0 */ >>>>> +#ifndef _LINUX_MEMORY_TIERS_H >>>>> +#define _LINUX_MEMORY_TIERS_H >>>>> + >>>>> +#ifdef CONFIG_TIERED_MEMORY >>>>> + >>>>> +#define MEMORY_TIER_HBM_GPU 0 >>>>> +#define MEMORY_TIER_DRAM 1 >>>>> +#define MEMORY_TIER_PMEM 2 >>>>> + >>>>> +#define MEMORY_RANK_HBM_GPU 300 >>>>> +#define MEMORY_RANK_DRAM 200 >>>>> +#define MEMORY_RANK_PMEM 100 >>>>> + >>>>> +#define DEFAULT_MEMORY_TIER MEMORY_TIER_DRAM >>>>> +#define MAX_MEMORY_TIERS 3 >>>> >>>> I understand the names are somewhat arbitrary, and the tier ID space >>>> can be expanded down the line by bumping MAX_MEMORY_TIERS. >>>> >>>> But starting out with a packed ID space can get quite awkward for >>>> users when new tiers - especially intermediate tiers - show up in >>>> existing configurations. I mentioned in the other email that DRAM != >>>> DRAM, so new tiers seem inevitable already. >>>> >>>> It could make sense to start with a bigger address space and spread >>>> out the list of kernel default tiers a bit within it: >>>> >>>> MEMORY_TIER_GPU 0 >>>> MEMORY_TIER_DRAM 10 >>>> MEMORY_TIER_PMEM 20 >>> >>> Forgive me if I'm asking a question that has been answered. I went >>> back to earlier threads and couldn't work it out - maybe there were >>> some off-list discussions? Anyway... >>> >>> Why is there a distinction between tier ID and rank? I undestand that >>> rank was added because tier IDs were too few. But if rank determines >>> ordering, what is the use of a separate tier ID? IOW, why not make the >>> tier ID space wider and have the kernel pick a few spread out defaults >>> based on known hardware, with plenty of headroom to be future proof. >>> >>> $ ls tiers >>> 100 # DEFAULT_TIER >>> $ cat tiers/100/nodelist >>> 0-1 # conventional numa nodes >>> >>> <pmem is onlined> >>> >>> $ grep . tiers/*/nodelist >>> tiers/100/nodelist:0-1 # conventional numa >>> tiers/200/nodelist:2 # pmem >>> >>> $ grep . nodes/*/tier >>> nodes/0/tier:100 >>> nodes/1/tier:100 >>> nodes/2/tier:200 >>> >>> <unknown device is online as node 3, defaults to 100> >>> >>> $ grep . tiers/*/nodelist >>> tiers/100/nodelist:0-1,3 >>> tiers/200/nodelist:2 >>> >>> $ echo 300 >nodes/3/tier >>> $ grep . tiers/*/nodelist >>> tiers/100/nodelist:0-1 >>> tiers/200/nodelist:2 >>> tiers/300/nodelist:3 >>> >>> $ echo 200 >nodes/3/tier >>> $ grep . tiers/*/nodelist >>> tiers/100/nodelist:0-1 >>> tiers/200/nodelist:2-3 >>> >>> etc. >> >> tier ID is also used as device id memtier.dev.id. It was discussed that we >> would need the ability to change the rank value of a memory tier. If we make >> rank value same as tier ID or tier device id, we will not be able to support >> that. > > Is the idea that you could change the rank of a collection of nodes in > one go? Rather than moving the nodes one by one into a new tier? > > [ Sorry, I wasn't able to find this discussion. AFAICS the first > patches in RFC4 already had the struct device { .id = tier } > logic. Could you point me to it? In general it would be really > helpful to maintain summarized rationales for such decisions in the > coverletter to make sure things don't get lost over many, many > threads, conferences, and video calls. ] Most of the discussion happened not int he patch review email threads. RFC: Memory Tiering Kernel Interfaces (v2) https://lore.kernel.org/linux-mm/CAAPL-u_diGYEb7+WsgqNBLRix-nRCk2SsDj6p9r8j5JZwOABZQ@mail.gmail.com RFC: Memory Tiering Kernel Interfaces (v4) https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com -aneesh
On Wed, 2022-06-08 at 09:37 -0700, Yang Shi wrote: > On Tue, Jun 7, 2022 at 6:34 PM Ying Huang <ying.huang@intel.com> wrote: > > > > On Tue, 2022-06-07 at 14:32 -0700, Yang Shi wrote: > > > On Fri, Jun 3, 2022 at 6:43 AM Aneesh Kumar K.V > > > <aneesh.kumar@linux.ibm.com> wrote: > > > > > > > > In the current kernel, memory tiers are defined implicitly via a > > > > demotion path relationship between NUMA nodes, which is created > > > > during the kernel initialization and updated when a NUMA node is > > > > hot-added or hot-removed. The current implementation puts all > > > > nodes with CPU into the top tier, and builds the tier hierarchy > > > > tier-by-tier by establishing the per-node demotion targets based > > > > on the distances between nodes. > > > > > > > > This current memory tier kernel interface needs to be improved for > > > > several important use cases, > > > > > > > > The current tier initialization code always initializes > > > > each memory-only NUMA node into a lower tier. But a memory-only > > > > NUMA node may have a high performance memory device (e.g. a DRAM > > > > device attached via CXL.mem or a DRAM-backed memory-only node on > > > > a virtual machine) and should be put into a higher tier. > > > > > > > > The current tier hierarchy always puts CPU nodes into the top > > > > tier. But on a system with HBM or GPU devices, the > > > > memory-only NUMA nodes mapping these devices should be in the > > > > top tier, and DRAM nodes with CPUs are better to be placed into the > > > > next lower tier. > > > > > > > > With current kernel higher tier node can only be demoted to selected nodes on the > > > > next lower tier as defined by the demotion path, not any other > > > > node from any lower tier. This strict, hard-coded demotion order > > > > does not work in all use cases (e.g. some use cases may want to > > > > allow cross-socket demotion to another node in the same demotion > > > > tier as a fallback when the preferred demotion node is out of > > > > space), This demotion order is also inconsistent with the page > > > > allocation fallback order when all the nodes in a higher tier are > > > > out of space: The page allocation can fall back to any node from > > > > any lower tier, whereas the demotion order doesn't allow that. > > > > > > > > The current kernel also don't provide any interfaces for the > > > > userspace to learn about the memory tier hierarchy in order to > > > > optimize its memory allocations. > > > > > > > > This patch series address the above by defining memory tiers explicitly. > > > > > > > > This patch introduce explicity memory tiers with ranks. The rank > > > > value of a memory tier is used to derive the demotion order between > > > > NUMA nodes. The memory tiers present in a system can be found at > > > > > > > > /sys/devices/system/memtier/memtierN/ > > > > > > > > The nodes which are part of a specific memory tier can be listed > > > > via > > > > /sys/devices/system/memtier/memtierN/nodelist > > > > > > > > "Rank" is an opaque value. Its absolute value doesn't have any > > > > special meaning. But the rank values of different memtiers can be > > > > compared with each other to determine the memory tier order. > > > > > > > > For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and > > > > their rank values are 300, 200, 100, then the memory tier order is: > > > > memtier0 -> memtier2 -> memtier1, where memtier0 is the highest tier > > > > and memtier1 is the lowest tier. > > > > > > > > The rank value of each memtier should be unique. > > > > > > > > A higher rank memory tier will appear first in the demotion order > > > > than a lower rank memory tier. ie. while reclaim we choose a node > > > > in higher rank memory tier to demote pages to as compared to a node > > > > in a lower rank memory tier. > > > > > > > > For now we are not adding the dynamic number of memory tiers. > > > > But a future series supporting that is possible. Currently > > > > number of tiers supported is limitted to MAX_MEMORY_TIERS(3). > > > > When doing memory hotplug, if not added to a memory tier, the NUMA > > > > node gets added to DEFAULT_MEMORY_TIER(1). > > > > > > > > This patch is based on the proposal sent by Wei Xu <weixugc@google.com> at [1]. > > > > > > > > [1] https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com > > > > > > > > Suggested-by: Wei Xu <weixugc@google.com> > > > > Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com> > > > > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> > > > > --- > > > > include/linux/memory-tiers.h | 20 ++++ > > > > mm/Kconfig | 11 ++ > > > > mm/Makefile | 1 + > > > > mm/memory-tiers.c | 188 +++++++++++++++++++++++++++++++++++ > > > > 4 files changed, 220 insertions(+) > > > > create mode 100644 include/linux/memory-tiers.h > > > > create mode 100644 mm/memory-tiers.c > > > > > > > > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h > > > > new file mode 100644 > > > > index 000000000000..e17f6b4ee177 > > > > --- /dev/null > > > > +++ b/include/linux/memory-tiers.h > > > > @@ -0,0 +1,20 @@ > > > > +/* SPDX-License-Identifier: GPL-2.0 */ > > > > +#ifndef _LINUX_MEMORY_TIERS_H > > > > +#define _LINUX_MEMORY_TIERS_H > > > > + > > > > +#ifdef CONFIG_TIERED_MEMORY > > > > + > > > > +#define MEMORY_TIER_HBM_GPU 0 > > > > +#define MEMORY_TIER_DRAM 1 > > > > +#define MEMORY_TIER_PMEM 2 > > > > + > > > > +#define MEMORY_RANK_HBM_GPU 300 > > > > +#define MEMORY_RANK_DRAM 200 > > > > +#define MEMORY_RANK_PMEM 100 > > > > + > > > > +#define DEFAULT_MEMORY_TIER MEMORY_TIER_DRAM > > > > +#define MAX_MEMORY_TIERS 3 > > > > + > > > > +#endif /* CONFIG_TIERED_MEMORY */ > > > > + > > > > +#endif > > > > diff --git a/mm/Kconfig b/mm/Kconfig > > > > index 169e64192e48..08a3d330740b 100644 > > > > --- a/mm/Kconfig > > > > +++ b/mm/Kconfig > > > > @@ -614,6 +614,17 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION > > > > config ARCH_ENABLE_THP_MIGRATION > > > > bool > > > > > > > > +config TIERED_MEMORY > > > > + bool "Support for explicit memory tiers" > > > > + def_bool n > > > > + depends on MIGRATION && NUMA > > > > + help > > > > + Support to split nodes into memory tiers explicitly and > > > > + to demote pages on reclaim to lower tiers. This option > > > > + also exposes sysfs interface to read nodes available in > > > > + specific tier and to move specific node among different > > > > + possible tiers. > > > > > > IMHO we should not need a new kernel config. If tiering is not present > > > then there is just one tier on the system. And tiering is a kind of > > > hardware configuration, the information could be shown regardless of > > > whether demotion/promotion is supported/enabled or not. > > > > I think so too. At least it appears unnecessary to let the user turn > > on/off it at configuration time. > > > > All the code should be enclosed by #if defined(CONFIG_NUMA) && > > defined(CONIFIG_MIGRATION). So we will not waste memory in small > > systems. > > CONFIG_NUMA alone should be good enough. CONFIG_MIGRATION is enabled > by default if NUMA is enabled. And MIGRATION is just used to support > demotion/promotion. Memory tiers exist even though demotion/promotion > is not supported, right? Yes. You are right. For example, in the following patch, memory tiers are used for allocation interleaving. https://lore.kernel.org/lkml/20220607171949.85796-1-hannes@cmpxchg.org/ Best Regards, Huang, Ying > > > > > > + > > > > config HUGETLB_PAGE_SIZE_VARIABLE > > > > def_bool n > > > > help > > > > diff --git a/mm/Makefile b/mm/Makefile > > > > index 6f9ffa968a1a..482557fbc9d1 100644 > > > > --- a/mm/Makefile > > > > +++ b/mm/Makefile > > > > @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/ > > > > obj-$(CONFIG_FAILSLAB) += failslab.o > > > > obj-$(CONFIG_MEMTEST) += memtest.o > > > > obj-$(CONFIG_MIGRATION) += migrate.o > > > > +obj-$(CONFIG_TIERED_MEMORY) += memory-tiers.o > > > > obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o > > > > obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o > > > > obj-$(CONFIG_PAGE_COUNTER) += page_counter.o > > > > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c > > > > new file mode 100644 > > > > index 000000000000..7de18d94a08d > > > > --- /dev/null > > > > +++ b/mm/memory-tiers.c > > > > @@ -0,0 +1,188 @@ > > > > +// SPDX-License-Identifier: GPL-2.0 > > > > +#include <linux/types.h> > > > > +#include <linux/device.h> > > > > +#include <linux/nodemask.h> > > > > +#include <linux/slab.h> > > > > +#include <linux/memory-tiers.h> > > > > + > > > > +struct memory_tier { > > > > + struct list_head list; > > > > + struct device dev; > > > > + nodemask_t nodelist; > > > > + int rank; > > > > +}; > > > > + > > > > +#define to_memory_tier(device) container_of(device, struct memory_tier, dev) > > > > + > > > > +static struct bus_type memory_tier_subsys = { > > > > + .name = "memtier", > > > > + .dev_name = "memtier", > > > > +}; > > > > + > > > > +static DEFINE_MUTEX(memory_tier_lock); > > > > +static LIST_HEAD(memory_tiers); > > > > + > > > > + > > > > +static ssize_t nodelist_show(struct device *dev, > > > > + struct device_attribute *attr, char *buf) > > > > +{ > > > > + struct memory_tier *memtier = to_memory_tier(dev); > > > > + > > > > + return sysfs_emit(buf, "%*pbl\n", > > > > + nodemask_pr_args(&memtier->nodelist)); > > > > +} > > > > +static DEVICE_ATTR_RO(nodelist); > > > > + > > > > +static ssize_t rank_show(struct device *dev, > > > > + struct device_attribute *attr, char *buf) > > > > +{ > > > > + struct memory_tier *memtier = to_memory_tier(dev); > > > > + > > > > + return sysfs_emit(buf, "%d\n", memtier->rank); > > > > +} > > > > +static DEVICE_ATTR_RO(rank); > > > > + > > > > +static struct attribute *memory_tier_dev_attrs[] = { > > > > + &dev_attr_nodelist.attr, > > > > + &dev_attr_rank.attr, > > > > + NULL > > > > +}; > > > > + > > > > +static const struct attribute_group memory_tier_dev_group = { > > > > + .attrs = memory_tier_dev_attrs, > > > > +}; > > > > + > > > > +static const struct attribute_group *memory_tier_dev_groups[] = { > > > > + &memory_tier_dev_group, > > > > + NULL > > > > +}; > > > > + > > > > +static void memory_tier_device_release(struct device *dev) > > > > +{ > > > > + struct memory_tier *tier = to_memory_tier(dev); > > > > + > > > > + kfree(tier); > > > > +} > > > > + > > > > +/* > > > > + * Keep it simple by having direct mapping between > > > > + * tier index and rank value. > > > > + */ > > > > +static inline int get_rank_from_tier(unsigned int tier) > > > > +{ > > > > + switch (tier) { > > > > + case MEMORY_TIER_HBM_GPU: > > > > + return MEMORY_RANK_HBM_GPU; > > > > + case MEMORY_TIER_DRAM: > > > > + return MEMORY_RANK_DRAM; > > > > + case MEMORY_TIER_PMEM: > > > > + return MEMORY_RANK_PMEM; > > > > + } > > > > + > > > > + return 0; > > > > +} > > > > + > > > > +static void insert_memory_tier(struct memory_tier *memtier) > > > > +{ > > > > + struct list_head *ent; > > > > + struct memory_tier *tmp_memtier; > > > > + > > > > + list_for_each(ent, &memory_tiers) { > > > > + tmp_memtier = list_entry(ent, struct memory_tier, list); > > > > + if (tmp_memtier->rank < memtier->rank) { > > > > + list_add_tail(&memtier->list, ent); > > > > + return; > > > > + } > > > > + } > > > > + list_add_tail(&memtier->list, &memory_tiers); > > > > +} > > > > + > > > > +static struct memory_tier *register_memory_tier(unsigned int tier) > > > > +{ > > > > + int error; > > > > + struct memory_tier *memtier; > > > > + > > > > + if (tier >= MAX_MEMORY_TIERS) > > > > + return NULL; > > > > + > > > > + memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL); > > > > + if (!memtier) > > > > + return NULL; > > > > + > > > > + memtier->dev.id = tier; > > > > + memtier->rank = get_rank_from_tier(tier); > > > > + memtier->dev.bus = &memory_tier_subsys; > > > > + memtier->dev.release = memory_tier_device_release; > > > > + memtier->dev.groups = memory_tier_dev_groups; > > > > + > > > > + insert_memory_tier(memtier); > > > > + > > > > + error = device_register(&memtier->dev); > > > > + if (error) { > > > > + list_del(&memtier->list); > > > > + put_device(&memtier->dev); > > > > + return NULL; > > > > + } > > > > + return memtier; > > > > +} > > > > + > > > > +__maybe_unused // temporay to prevent warnings during bisects > > > > +static void unregister_memory_tier(struct memory_tier *memtier) > > > > +{ > > > > + list_del(&memtier->list); > > > > + device_unregister(&memtier->dev); > > > > +} > > > > + > > > > +static ssize_t > > > > +max_tier_show(struct device *dev, struct device_attribute *attr, char *buf) > > > > +{ > > > > + return sysfs_emit(buf, "%d\n", MAX_MEMORY_TIERS); > > > > +} > > > > +static DEVICE_ATTR_RO(max_tier); > > > > + > > > > +static ssize_t > > > > +default_tier_show(struct device *dev, struct device_attribute *attr, char *buf) > > > > +{ > > > > + return sysfs_emit(buf, "memtier%d\n", DEFAULT_MEMORY_TIER); > > > > +} > > > > +static DEVICE_ATTR_RO(default_tier); > > > > + > > > > +static struct attribute *memory_tier_attrs[] = { > > > > + &dev_attr_max_tier.attr, > > > > + &dev_attr_default_tier.attr, > > > > + NULL > > > > +}; > > > > + > > > > +static const struct attribute_group memory_tier_attr_group = { > > > > + .attrs = memory_tier_attrs, > > > > +}; > > > > + > > > > +static const struct attribute_group *memory_tier_attr_groups[] = { > > > > + &memory_tier_attr_group, > > > > + NULL, > > > > +}; > > > > + > > > > +static int __init memory_tier_init(void) > > > > +{ > > > > + int ret; > > > > + struct memory_tier *memtier; > > > > + > > > > + ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups); > > > > + if (ret) > > > > + panic("%s() failed to register subsystem: %d\n", __func__, ret); > > > > + > > > > + /* > > > > + * Register only default memory tier to hide all empty > > > > + * memory tier from sysfs. > > > > + */ > > > > + memtier = register_memory_tier(DEFAULT_MEMORY_TIER); > > > > + if (!memtier) > > > > + panic("%s() failed to register memory tier: %d\n", __func__, ret); > > > > + > > > > + /* CPU only nodes are not part of memory tiers. */ > > > > + memtier->nodelist = node_states[N_MEMORY]; > > > > + > > > > + return 0; > > > > +} > > > > +subsys_initcall(memory_tier_init); > > > > + > > > > -- > > > > 2.36.1 > > > > > > > >
On 6/8/22 10:12 PM, Yang Shi wrote: > On Tue, Jun 7, 2022 at 9:58 PM Aneesh Kumar K V > <aneesh.kumar@linux.ibm.com> wrote: .... >> config TIERED_MEMORY >> bool "Support for explicit memory tiers" >> - def_bool n >> - depends on MIGRATION && NUMA >> - help >> - Support to split nodes into memory tiers explicitly and >> - to demote pages on reclaim to lower tiers. This option >> - also exposes sysfs interface to read nodes available in >> - specific tier and to move specific node among different >> - possible tiers. >> + def_bool MIGRATION && NUMA > > CONFIG_NUMA should be good enough. Memory tiering doesn't have to mean > demotion/promotion has to be supported IMHO. > >> >> config HUGETLB_PAGE_SIZE_VARIABLE >> def_bool n >> >> ie, we just make it a Kconfig variable without exposing it to the user? >> We can do that but that would also mean in order to avoid building the demotion targets etc we will now have to have multiple #ifdef CONFIG_MIGRATION in mm/memory-tiers.c . It builds without those #ifdef So these are not really build errors, but rather we will be building all the demotion targets for no real use with them. What usecase do you have to expose memory tiers on a system with CONFIG_MIGRATION disabled? CONFIG_MIGRATION gets enabled in almost all configs these days due to its dependency against COMPACTION and TRANSPARENT_HUGEPAGE. Unless there is a real need, I am wondering if we can avoid sprinkling #ifdef CONFIG_MIGRATION in mm/memory-tiers.c -aneesh
On Thu, Jun 09, 2022 at 08:03:26AM +0530, Aneesh Kumar K V wrote: > On 6/8/22 11:46 PM, Johannes Weiner wrote: > > On Wed, Jun 08, 2022 at 09:43:52PM +0530, Aneesh Kumar K V wrote: > > > On 6/8/22 9:25 PM, Johannes Weiner wrote: > > > > Hello, > > > > > > > > On Wed, Jun 08, 2022 at 10:11:31AM -0400, Johannes Weiner wrote: > > > > > On Fri, Jun 03, 2022 at 07:12:29PM +0530, Aneesh Kumar K.V wrote: > > > > > > @@ -0,0 +1,20 @@ > > > > > > +/* SPDX-License-Identifier: GPL-2.0 */ > > > > > > +#ifndef _LINUX_MEMORY_TIERS_H > > > > > > +#define _LINUX_MEMORY_TIERS_H > > > > > > + > > > > > > +#ifdef CONFIG_TIERED_MEMORY > > > > > > + > > > > > > +#define MEMORY_TIER_HBM_GPU 0 > > > > > > +#define MEMORY_TIER_DRAM 1 > > > > > > +#define MEMORY_TIER_PMEM 2 > > > > > > + > > > > > > +#define MEMORY_RANK_HBM_GPU 300 > > > > > > +#define MEMORY_RANK_DRAM 200 > > > > > > +#define MEMORY_RANK_PMEM 100 > > > > > > + > > > > > > +#define DEFAULT_MEMORY_TIER MEMORY_TIER_DRAM > > > > > > +#define MAX_MEMORY_TIERS 3 > > > > > > > > > > I understand the names are somewhat arbitrary, and the tier ID space > > > > > can be expanded down the line by bumping MAX_MEMORY_TIERS. > > > > > > > > > > But starting out with a packed ID space can get quite awkward for > > > > > users when new tiers - especially intermediate tiers - show up in > > > > > existing configurations. I mentioned in the other email that DRAM != > > > > > DRAM, so new tiers seem inevitable already. > > > > > > > > > > It could make sense to start with a bigger address space and spread > > > > > out the list of kernel default tiers a bit within it: > > > > > > > > > > MEMORY_TIER_GPU 0 > > > > > MEMORY_TIER_DRAM 10 > > > > > MEMORY_TIER_PMEM 20 > > > > > > > > Forgive me if I'm asking a question that has been answered. I went > > > > back to earlier threads and couldn't work it out - maybe there were > > > > some off-list discussions? Anyway... > > > > > > > > Why is there a distinction between tier ID and rank? I undestand that > > > > rank was added because tier IDs were too few. But if rank determines > > > > ordering, what is the use of a separate tier ID? IOW, why not make the > > > > tier ID space wider and have the kernel pick a few spread out defaults > > > > based on known hardware, with plenty of headroom to be future proof. > > > > > > > > $ ls tiers > > > > 100 # DEFAULT_TIER > > > > $ cat tiers/100/nodelist > > > > 0-1 # conventional numa nodes > > > > > > > > <pmem is onlined> > > > > > > > > $ grep . tiers/*/nodelist > > > > tiers/100/nodelist:0-1 # conventional numa > > > > tiers/200/nodelist:2 # pmem > > > > > > > > $ grep . nodes/*/tier > > > > nodes/0/tier:100 > > > > nodes/1/tier:100 > > > > nodes/2/tier:200 > > > > > > > > <unknown device is online as node 3, defaults to 100> > > > > > > > > $ grep . tiers/*/nodelist > > > > tiers/100/nodelist:0-1,3 > > > > tiers/200/nodelist:2 > > > > > > > > $ echo 300 >nodes/3/tier > > > > $ grep . tiers/*/nodelist > > > > tiers/100/nodelist:0-1 > > > > tiers/200/nodelist:2 > > > > tiers/300/nodelist:3 > > > > > > > > $ echo 200 >nodes/3/tier > > > > $ grep . tiers/*/nodelist > > > > tiers/100/nodelist:0-1 > > > > tiers/200/nodelist:2-3 > > > > > > > > etc. > > > > > > tier ID is also used as device id memtier.dev.id. It was discussed that we > > > would need the ability to change the rank value of a memory tier. If we make > > > rank value same as tier ID or tier device id, we will not be able to support > > > that. > > > > Is the idea that you could change the rank of a collection of nodes in > > one go? Rather than moving the nodes one by one into a new tier? > > > > [ Sorry, I wasn't able to find this discussion. AFAICS the first > > patches in RFC4 already had the struct device { .id = tier } > > logic. Could you point me to it? In general it would be really > > helpful to maintain summarized rationales for such decisions in the > > coverletter to make sure things don't get lost over many, many > > threads, conferences, and video calls. ] > > Most of the discussion happened not int he patch review email threads. > > RFC: Memory Tiering Kernel Interfaces (v2) > https://lore.kernel.org/linux-mm/CAAPL-u_diGYEb7+WsgqNBLRix-nRCk2SsDj6p9r8j5JZwOABZQ@mail.gmail.com > > RFC: Memory Tiering Kernel Interfaces (v4) > https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com I read the RFCs, the discussions and your code. It's still not clear why the tier/device ID and the rank need to be two separate, user-visible things. There is only one tier of a given rank, why can't the rank be the unique device id? dev->id = 100. One number. Or use a unique device id allocator if large numbers are causing problems internally. But I don't see an explanation why they need to be two different things, let alone two different things in the user ABI.
On Thu, 9 Jun 2022 09:55:45 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote: > On Thu, Jun 09, 2022 at 08:03:26AM +0530, Aneesh Kumar K V wrote: > > On 6/8/22 11:46 PM, Johannes Weiner wrote: > > > On Wed, Jun 08, 2022 at 09:43:52PM +0530, Aneesh Kumar K V wrote: > > > > On 6/8/22 9:25 PM, Johannes Weiner wrote: > > > > > Hello, > > > > > > > > > > On Wed, Jun 08, 2022 at 10:11:31AM -0400, Johannes Weiner wrote: > > > > > > On Fri, Jun 03, 2022 at 07:12:29PM +0530, Aneesh Kumar K.V wrote: > > > > > > > @@ -0,0 +1,20 @@ > > > > > > > +/* SPDX-License-Identifier: GPL-2.0 */ > > > > > > > +#ifndef _LINUX_MEMORY_TIERS_H > > > > > > > +#define _LINUX_MEMORY_TIERS_H > > > > > > > + > > > > > > > +#ifdef CONFIG_TIERED_MEMORY > > > > > > > + > > > > > > > +#define MEMORY_TIER_HBM_GPU 0 > > > > > > > +#define MEMORY_TIER_DRAM 1 > > > > > > > +#define MEMORY_TIER_PMEM 2 > > > > > > > + > > > > > > > +#define MEMORY_RANK_HBM_GPU 300 > > > > > > > +#define MEMORY_RANK_DRAM 200 > > > > > > > +#define MEMORY_RANK_PMEM 100 > > > > > > > + > > > > > > > +#define DEFAULT_MEMORY_TIER MEMORY_TIER_DRAM > > > > > > > +#define MAX_MEMORY_TIERS 3 > > > > > > > > > > > > I understand the names are somewhat arbitrary, and the tier ID space > > > > > > can be expanded down the line by bumping MAX_MEMORY_TIERS. > > > > > > > > > > > > But starting out with a packed ID space can get quite awkward for > > > > > > users when new tiers - especially intermediate tiers - show up in > > > > > > existing configurations. I mentioned in the other email that DRAM != > > > > > > DRAM, so new tiers seem inevitable already. > > > > > > > > > > > > It could make sense to start with a bigger address space and spread > > > > > > out the list of kernel default tiers a bit within it: > > > > > > > > > > > > MEMORY_TIER_GPU 0 > > > > > > MEMORY_TIER_DRAM 10 > > > > > > MEMORY_TIER_PMEM 20 > > > > > > > > > > Forgive me if I'm asking a question that has been answered. I went > > > > > back to earlier threads and couldn't work it out - maybe there were > > > > > some off-list discussions? Anyway... > > > > > > > > > > Why is there a distinction between tier ID and rank? I undestand that > > > > > rank was added because tier IDs were too few. But if rank determines > > > > > ordering, what is the use of a separate tier ID? IOW, why not make the > > > > > tier ID space wider and have the kernel pick a few spread out defaults > > > > > based on known hardware, with plenty of headroom to be future proof. > > > > > > > > > > $ ls tiers > > > > > 100 # DEFAULT_TIER > > > > > $ cat tiers/100/nodelist > > > > > 0-1 # conventional numa nodes > > > > > > > > > > <pmem is onlined> > > > > > > > > > > $ grep . tiers/*/nodelist > > > > > tiers/100/nodelist:0-1 # conventional numa > > > > > tiers/200/nodelist:2 # pmem > > > > > > > > > > $ grep . nodes/*/tier > > > > > nodes/0/tier:100 > > > > > nodes/1/tier:100 > > > > > nodes/2/tier:200 > > > > > > > > > > <unknown device is online as node 3, defaults to 100> > > > > > > > > > > $ grep . tiers/*/nodelist > > > > > tiers/100/nodelist:0-1,3 > > > > > tiers/200/nodelist:2 > > > > > > > > > > $ echo 300 >nodes/3/tier > > > > > $ grep . tiers/*/nodelist > > > > > tiers/100/nodelist:0-1 > > > > > tiers/200/nodelist:2 > > > > > tiers/300/nodelist:3 > > > > > > > > > > $ echo 200 >nodes/3/tier > > > > > $ grep . tiers/*/nodelist > > > > > tiers/100/nodelist:0-1 > > > > > tiers/200/nodelist:2-3 > > > > > > > > > > etc. > > > > > > > > tier ID is also used as device id memtier.dev.id. It was discussed that we > > > > would need the ability to change the rank value of a memory tier. If we make > > > > rank value same as tier ID or tier device id, we will not be able to support > > > > that. > > > > > > Is the idea that you could change the rank of a collection of nodes in > > > one go? Rather than moving the nodes one by one into a new tier? > > > > > > [ Sorry, I wasn't able to find this discussion. AFAICS the first > > > patches in RFC4 already had the struct device { .id = tier } > > > logic. Could you point me to it? In general it would be really > > > helpful to maintain summarized rationales for such decisions in the > > > coverletter to make sure things don't get lost over many, many > > > threads, conferences, and video calls. ] > > > > Most of the discussion happened not int he patch review email threads. > > > > RFC: Memory Tiering Kernel Interfaces (v2) > > https://lore.kernel.org/linux-mm/CAAPL-u_diGYEb7+WsgqNBLRix-nRCk2SsDj6p9r8j5JZwOABZQ@mail.gmail.com > > > > RFC: Memory Tiering Kernel Interfaces (v4) > > https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com > > I read the RFCs, the discussions and your code. It's still not clear > why the tier/device ID and the rank need to be two separate, > user-visible things. There is only one tier of a given rank, why can't > the rank be the unique device id? dev->id = 100. One number. Or use a > unique device id allocator if large numbers are causing problems > internally. But I don't see an explanation why they need to be two > different things, let alone two different things in the user ABI. I think discussion hinged on it making sense to be able to change rank of a tier rather than create a new tier and move things one by one. Example was wanting to change the rank of a tier that was created either by core code or a subsystem. E.g. If GPU driver creates a tier, assumption is all similar GPUs will default to the same tier (if hot plugged later for example) as the driver subsystem will keep a reference to the created tier. Hence if user wants to change the order of that relative to other tiers, the option of creating a new tier and moving the devices would then require us to have infrastructure to tell the GPU driver to now use the new tier for additional devices. Or we could go with new nodes are not assigned to a tier and userspace is always responsible for that assignment. That may be a problem for anything relying on existing behavior. Means that there must always be a sensible userspace script... Jonathan
On Thu, Jun 9, 2022 at 1:18 AM Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> wrote: > > On 6/8/22 10:12 PM, Yang Shi wrote: > > On Tue, Jun 7, 2022 at 9:58 PM Aneesh Kumar K V > > <aneesh.kumar@linux.ibm.com> wrote: > > .... > > >> config TIERED_MEMORY > >> bool "Support for explicit memory tiers" > >> - def_bool n > >> - depends on MIGRATION && NUMA > >> - help > >> - Support to split nodes into memory tiers explicitly and > >> - to demote pages on reclaim to lower tiers. This option > >> - also exposes sysfs interface to read nodes available in > >> - specific tier and to move specific node among different > >> - possible tiers. > >> + def_bool MIGRATION && NUMA > > > > CONFIG_NUMA should be good enough. Memory tiering doesn't have to mean > > demotion/promotion has to be supported IMHO. > > > >> > >> config HUGETLB_PAGE_SIZE_VARIABLE > >> def_bool n > >> > >> ie, we just make it a Kconfig variable without exposing it to the user? > >> > > We can do that but that would also mean in order to avoid building the > demotion targets etc we will now have to have multiple #ifdef > CONFIG_MIGRATION in mm/memory-tiers.c . It builds without those #ifdef > So these are not really build errors, but rather we will be building all > the demotion targets for no real use with them. Can we have default demotion targets for !MIGRATION? For example, all demotion targets are -1. > > What usecase do you have to expose memory tiers on a system with > CONFIG_MIGRATION disabled? CONFIG_MIGRATION gets enabled in almost all > configs these days due to its dependency against COMPACTION and > TRANSPARENT_HUGEPAGE. Johannes's interleave series is an example, https://lore.kernel.org/lkml/20220607171949.85796-1-hannes@cmpxchg.org/ It doesn't do any demotion/promotion, just make allocations interleave on different tiers. > > Unless there is a real need, I am wondering if we can avoid sprinkling > #ifdef CONFIG_MIGRATION in mm/memory-tiers.c > > -aneesh
On Thu, Jun 09, 2022 at 03:22:43PM +0100, Jonathan Cameron wrote: > I think discussion hinged on it making sense to be able to change > rank of a tier rather than create a new tier and move things one by one. > Example was wanting to change the rank of a tier that was created > either by core code or a subsystem. > > E.g. If GPU driver creates a tier, assumption is all similar GPUs will > default to the same tier (if hot plugged later for example) as the > driver subsystem will keep a reference to the created tier. > Hence if user wants to change the order of that relative to > other tiers, the option of creating a new tier and moving the > devices would then require us to have infrastructure to tell the GPU > driver to now use the new tier for additional devices. That's an interesting point, thanks for explaining. But that could still happen when two drivers report the same tier and one of them is wrong, right? You'd still need to separate out by hand to adjust rank, as well as handle hotplug events. Driver colllisions are probable with coarse categories like gpu, dram, pmem. Would it make more sense to have the platform/devicetree/driver provide more fine-grained distance values similar to NUMA distances, and have a driver-scope tunable to override/correct? And then have the distance value function as the unique tier ID and rank in one. That would allow device class reassignments, too, and it would work with driver collisions where simple "tier stickiness" would not. (Although collisions would be less likely to begin with given a broader range of possible distance values.) Going further, it could be useful to separate the business of hardware properties (and configuring quirks) from the business of configuring MM policies that should be applied to the resulting tier hierarchy. They're somewhat orthogonal tuning tasks, and one of them might become obsolete before the other (if the quality of distance values provided by drivers improves before the quality of MM heuristics ;). Separating them might help clarify the interface for both designers and users. E.g. a memdev class scope with a driver-wide distance value, and a memdev scope for per-device values that default to "inherit driver value". The memtier subtree would then have an r/o structure, but allow tuning per-tier interleaving ratio[1], demotion rules etc. [1] https://lore.kernel.org/linux-mm/20220607171949.85796-1-hannes@cmpxchg.org/#t
On Thu, 2022-06-09 at 16:41 -0400, Johannes Weiner wrote: > On Thu, Jun 09, 2022 at 03:22:43PM +0100, Jonathan Cameron wrote: > > I think discussion hinged on it making sense to be able to change > > rank of a tier rather than create a new tier and move things one by one. > > Example was wanting to change the rank of a tier that was created > > either by core code or a subsystem. > > > > E.g. If GPU driver creates a tier, assumption is all similar GPUs will > > default to the same tier (if hot plugged later for example) as the > > driver subsystem will keep a reference to the created tier. > > Hence if user wants to change the order of that relative to > > other tiers, the option of creating a new tier and moving the > > devices would then require us to have infrastructure to tell the GPU > > driver to now use the new tier for additional devices. > > That's an interesting point, thanks for explaining. I have proposed to use sparse memory tier device ID and remove rank. The response from Wei Xu is as follows, " Using the rank value directly as the device ID has some disadvantages: - It is kind of unconventional to number devices in this way. - We cannot assign DRAM nodes with CPUs with a specific memtier device ID (even though this is not mandated by the "rank" proposal, I expect the device will likely always be memtier1 in practice). - It is possible that we may eventually allow the rank value to be modified as a way to adjust the tier ordering. We cannot do that easily for device IDs. " in https://lore.kernel.org/lkml/CAAPL-u9t=9hYfcXyCZwYFmVTUQGrWVq3cndpN+sqPSm5cwE4Yg@mail.gmail.com/ I think that your proposal below has resolved the latter "disadvantage". So if the former one isn't so important, we can go to remove "rank". That will make memory tier much easier to be understand and use. Best Regards, Huang, Ying > But that could still happen when two drivers report the same tier and > one of them is wrong, right? You'd still need to separate out by hand > to adjust rank, as well as handle hotplug events. Driver colllisions > are probable with coarse categories like gpu, dram, pmem. > > Would it make more sense to have the platform/devicetree/driver > provide more fine-grained distance values similar to NUMA distances, > and have a driver-scope tunable to override/correct? And then have the > distance value function as the unique tier ID and rank in one. > > That would allow device class reassignments, too, and it would work > with driver collisions where simple "tier stickiness" would > not. (Although collisions would be less likely to begin with given a > broader range of possible distance values.) > > Going further, it could be useful to separate the business of hardware > properties (and configuring quirks) from the business of configuring > MM policies that should be applied to the resulting tier hierarchy. > They're somewhat orthogonal tuning tasks, and one of them might become > obsolete before the other (if the quality of distance values provided > by drivers improves before the quality of MM heuristics ;). Separating > them might help clarify the interface for both designers and users. > > E.g. a memdev class scope with a driver-wide distance value, and a > memdev scope for per-device values that default to "inherit driver > value". The memtier subtree would then have an r/o structure, but > allow tuning per-tier interleaving ratio[1], demotion rules etc. > > [1] https://lore.kernel.org/linux-mm/20220607171949.85796-1-hannes@cmpxchg.org/#t
On Thu, 9 Jun 2022 16:41:04 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote: > On Thu, Jun 09, 2022 at 03:22:43PM +0100, Jonathan Cameron wrote: > > I think discussion hinged on it making sense to be able to change > > rank of a tier rather than create a new tier and move things one by one. > > Example was wanting to change the rank of a tier that was created > > either by core code or a subsystem. > > > > E.g. If GPU driver creates a tier, assumption is all similar GPUs will > > default to the same tier (if hot plugged later for example) as the > > driver subsystem will keep a reference to the created tier. > > Hence if user wants to change the order of that relative to > > other tiers, the option of creating a new tier and moving the > > devices would then require us to have infrastructure to tell the GPU > > driver to now use the new tier for additional devices. > > That's an interesting point, thanks for explaining. > > But that could still happen when two drivers report the same tier and > one of them is wrong, right? You'd still need to separate out by hand > to adjust rank, as well as handle hotplug events. Driver colllisions > are probable with coarse categories like gpu, dram, pmem. There will always be cases that need hand tweaking. Also I'd envision some driver subsystems being clever enough to manage several tiers and use the information available to them to assign appropriately. This is definitely true for CXL 2.0+ devices where we can have radically different device types under the same driver (volatile, persistent, direct connect, behind switches etc). There will be some interesting choices to make on groupings in big systems as we don't want too many tiers unless we naturally demote multiple levels in one go.. > > Would it make more sense to have the platform/devicetree/driver > provide more fine-grained distance values similar to NUMA distances, > and have a driver-scope tunable to override/correct? And then have the > distance value function as the unique tier ID and rank in one. Absolutely a good thing to provide that information, but it's black magic. There are too many contradicting metrics (latency vs bandwidth etc) even not including a more complex system model like Jerome Glisse proposed a few years back. https://lore.kernel.org/all/20190118174512.GA3060@redhat.com/ CXL 2.0 got this more right than anything else I've seen as provides discoverable topology along with details like latency to cross between particular switch ports. Actually using that data (other than by throwing it to userspace controls for HPC apps etc) is going to take some figuring out. Even the question of what + how we expose this info to userspace is non obvious. The 'right' decision is also usecase specific, so what you'd do for particular memory characteristics for a GPU are not the same as what you'd do for the same characteristics on a memory only device. > > That would allow device class reassignments, too, and it would work > with driver collisions where simple "tier stickiness" would > not. (Although collisions would be less likely to begin with given a > broader range of possible distance values.) I think we definitely need the option to move individual nodes (in this case nodes map to individual devices if characteristics vary between them) around as well, but I think that's somewhat orthogonal to a good first guess. > > Going further, it could be useful to separate the business of hardware > properties (and configuring quirks) from the business of configuring > MM policies that should be applied to the resulting tier hierarchy. > They're somewhat orthogonal tuning tasks, and one of them might become > obsolete before the other (if the quality of distance values provided > by drivers improves before the quality of MM heuristics ;). Separating > them might help clarify the interface for both designers and users. > > E.g. a memdev class scope with a driver-wide distance value, and a > memdev scope for per-device values that default to "inherit driver > value". The memtier subtree would then have an r/o structure, but > allow tuning per-tier interleaving ratio[1], demotion rules etc. Ok that makes sense. I'm not sure if that ends up as an implementation detail, or effects the userspace interface of this particular element. I'm not sure completely read only is flexible enough (though mostly RO is fine) as we keep sketching out cases where any attempt to do things automatically does the wrong thing and where we need to add an extra tier to get everything to work. Short of having a lot of tiers I'm not sure how we could have the default work well. Maybe a lot of "tiers" is fine though perhaps we need to rename them if going this way and then they don't really work as current concept of tier. Imagine a system with subtle difference between different memories such as 10% latency increase for same bandwidth. To get an advantage from demoting to such a tier will require really stable usage and long run times. Whilst you could design a demotion scheme that takes that into account, I think we are a long way from that today. Jonathan > > [1] https://lore.kernel.org/linux-mm/20220607171949.85796-1-hannes@cmpxchg.org/#t
On Fri, Jun 10, 2022 at 10:57:08AM +0100, Jonathan Cameron wrote: > On Thu, 9 Jun 2022 16:41:04 -0400 > Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Thu, Jun 09, 2022 at 03:22:43PM +0100, Jonathan Cameron wrote: > > Would it make more sense to have the platform/devicetree/driver > > provide more fine-grained distance values similar to NUMA distances, > > and have a driver-scope tunable to override/correct? And then have the > > distance value function as the unique tier ID and rank in one. > > Absolutely a good thing to provide that information, but it's black > magic. There are too many contradicting metrics (latency vs bandwidth etc) > even not including a more complex system model like Jerome Glisse proposed > a few years back. https://lore.kernel.org/all/20190118174512.GA3060@redhat.com/ > CXL 2.0 got this more right than anything else I've seen as provides > discoverable topology along with details like latency to cross between > particular switch ports. Actually using that data (other than by throwing > it to userspace controls for HPC apps etc) is going to take some figuring out. > Even the question of what + how we expose this info to userspace is non > obvious. Right, I don't think those would be scientifically accurate - but neither is a number between 1 and 3. The way I look at it is more about spreading out the address space a bit, to allow expressing nuanced differences without risking conflicts and overlaps. Hopefully this results in the shipped values stabilizing over time and thus requiring less and less intervention and overriding from userspace. > > Going further, it could be useful to separate the business of hardware > > properties (and configuring quirks) from the business of configuring > > MM policies that should be applied to the resulting tier hierarchy. > > They're somewhat orthogonal tuning tasks, and one of them might become > > obsolete before the other (if the quality of distance values provided > > by drivers improves before the quality of MM heuristics ;). Separating > > them might help clarify the interface for both designers and users. > > > > E.g. a memdev class scope with a driver-wide distance value, and a > > memdev scope for per-device values that default to "inherit driver > > value". The memtier subtree would then have an r/o structure, but > > allow tuning per-tier interleaving ratio[1], demotion rules etc. > > Ok that makes sense. I'm not sure if that ends up as an implementation > detail, or effects the userspace interface of this particular element. > > I'm not sure completely read only is flexible enough (though mostly RO is fine) > as we keep sketching out cases where any attempt to do things automatically > does the wrong thing and where we need to add an extra tier to get > everything to work. Short of having a lot of tiers I'm not sure how > we could have the default work well. Maybe a lot of "tiers" is fine > though perhaps we need to rename them if going this way and then they > don't really work as current concept of tier. > > Imagine a system with subtle difference between different memories such > as 10% latency increase for same bandwidth. To get an advantage from > demoting to such a tier will require really stable usage and long > run times. Whilst you could design a demotion scheme that takes that > into account, I think we are a long way from that today. Good point: there can be a clear hardware difference, but it's a policy choice whether the MM should treat them as one or two tiers. What do you think of a per-driver/per-device (overridable) distance number, combined with a configurable distance cutoff for what constitutes separate tiers. E.g. cutoff=20 means two devices with distances of 10 and 20 respectively would be in the same tier, devices with 10 and 100 would be in separate ones. The kernel then generates and populates the tiers based on distances and grouping cutoff, and populates the memtier directory tree and nodemasks in sysfs. It could be simple tier0, tier1, tier2 numbering again, but the numbers now would mean something to the user. A rank tunable is no longer necessary. I think even the nodemasks in the memtier tree could be read-only then, since corrections should only be necessary when either the device distance is wrong or the tier grouping cutoff. Can you think of scenarios where that scheme would fall apart?
On 6/13/22 7:35 PM, Johannes Weiner wrote: > On Fri, Jun 10, 2022 at 10:57:08AM +0100, Jonathan Cameron wrote: >> .... >> I'm not sure completely read only is flexible enough (though mostly RO is fine) >> as we keep sketching out cases where any attempt to do things automatically >> does the wrong thing and where we need to add an extra tier to get >> everything to work. Short of having a lot of tiers I'm not sure how >> we could have the default work well. Maybe a lot of "tiers" is fine >> though perhaps we need to rename them if going this way and then they >> don't really work as current concept of tier. >> >> Imagine a system with subtle difference between different memories such >> as 10% latency increase for same bandwidth. To get an advantage from >> demoting to such a tier will require really stable usage and long >> run times. Whilst you could design a demotion scheme that takes that >> into account, I think we are a long way from that today. > > Good point: there can be a clear hardware difference, but it's a > policy choice whether the MM should treat them as one or two tiers. > > What do you think of a per-driver/per-device (overridable) distance > number, combined with a configurable distance cutoff for what > constitutes separate tiers. E.g. cutoff=20 means two devices with > distances of 10 and 20 respectively would be in the same tier, devices > with 10 and 100 would be in separate ones. The kernel then generates > and populates the tiers based on distances and grouping cutoff, and > populates the memtier directory tree and nodemasks in sysfs. > Right now core/generic code doesn't get involved in building tiers. It just defines three tiers where drivers could place the respective devices they manage. The above suggestion would imply we are moving quite a lot of policy decision logic into the generic code?. At some point, we will have to depend on more attributes other than distance(may be HMAT?) and each driver should have the flexibility to place the device it is managing in a specific tier? By then we may decide to support more than 3 static tiers which the core kernel currently does. If the kernel still can't make the right decision, userspace could rearrange them in any order using rank values. Without something like rank, if userspace needs to fix things up, it gets hard with device hotplugging. ie, the userspace policy could be that any new PMEM tier device that is hotplugged, park it with a very low-rank value and hence lowest in demotion order by default. (echo 10 > /sys/devices/system/memtier/memtier2/rank) . After that userspace could selectively move the new devices to the correct memory tier? > It could be simple tier0, tier1, tier2 numbering again, but the > numbers now would mean something to the user. A rank tunable is no > longer necessary. > > I think even the nodemasks in the memtier tree could be read-only > then, since corrections should only be necessary when either the > device distance is wrong or the tier grouping cutoff. > > Can you think of scenarios where that scheme would fall apart? -aneesh
On Mon, Jun 13, 2022 at 07:53:03PM +0530, Aneesh Kumar K V wrote: > On 6/13/22 7:35 PM, Johannes Weiner wrote: > > On Fri, Jun 10, 2022 at 10:57:08AM +0100, Jonathan Cameron wrote: > > > I'm not sure completely read only is flexible enough (though mostly RO is fine) > > > as we keep sketching out cases where any attempt to do things automatically > > > does the wrong thing and where we need to add an extra tier to get > > > everything to work. Short of having a lot of tiers I'm not sure how > > > we could have the default work well. Maybe a lot of "tiers" is fine > > > though perhaps we need to rename them if going this way and then they > > > don't really work as current concept of tier. > > > > > > Imagine a system with subtle difference between different memories such > > > as 10% latency increase for same bandwidth. To get an advantage from > > > demoting to such a tier will require really stable usage and long > > > run times. Whilst you could design a demotion scheme that takes that > > > into account, I think we are a long way from that today. > > > > Good point: there can be a clear hardware difference, but it's a > > policy choice whether the MM should treat them as one or two tiers. > > > > What do you think of a per-driver/per-device (overridable) distance > > number, combined with a configurable distance cutoff for what > > constitutes separate tiers. E.g. cutoff=20 means two devices with > > distances of 10 and 20 respectively would be in the same tier, devices > > with 10 and 100 would be in separate ones. The kernel then generates > > and populates the tiers based on distances and grouping cutoff, and > > populates the memtier directory tree and nodemasks in sysfs. > > > > Right now core/generic code doesn't get involved in building tiers. It just > defines three tiers where drivers could place the respective devices they > manage. The above suggestion would imply we are moving quite a lot of policy > decision logic into the generic code?. No. The driver still chooses its own number, just from a wider range. The only policy in generic code is the distance cutoff for which devices are grouped into tiers together. > At some point, we will have to depend on more attributes other than > distance(may be HMAT?) and each driver should have the flexibility to place > the device it is managing in a specific tier? By then we may decide to > support more than 3 static tiers which the core kernel currently does. If we start with a larger possible range of "distance" values right away, we can still let the drivers ballpark into 3 tiers for now (100, 200, 300). But it will be easier to take additional metrics into account later and fine tune accordingly (120, 260, 90 etc.) without having to update all the other drivers as well. > If the kernel still can't make the right decision, userspace could rearrange > them in any order using rank values. Without something like rank, if > userspace needs to fix things up, it gets hard with device > hotplugging. ie, the userspace policy could be that any new PMEM tier device > that is hotplugged, park it with a very low-rank value and hence lowest in > demotion order by default. (echo 10 > > /sys/devices/system/memtier/memtier2/rank) . After that userspace could > selectively move the new devices to the correct memory tier? I had touched on this in the other email. This doesn't work if two drivers that should have separate policies collide into the same tier - which is very likely with just 3 tiers. So it seems to me the main usecase for having a rank tunable falls apart rather quickly until tiers are spaced out more widely. And it does so at the cost of an, IMO, tricky to understand interface. In the other email I had suggested the ability to override not just the per-device distance, but also the driver default for new devices to handle the hotplug situation. This should be less policy than before. Driver default and per-device distances (both overridable) combined with one tunable to set the range of distances that get grouped into tiers. With these parameters alone, you can generate an ordered list of tiers and their devices. The tier numbers make sense, and no rank is needed. Do you still need the ability to move nodes by writing nodemasks? I don't think so. Assuming you would never want to have an actually slower device in a higher tier than a faster device, the only time you'd want to move a device is when the device's distance value is wrong. So you override that (until you update to a fixed kernel).
On Mon, 2022-06-13 at 11:50 -0400, Johannes Weiner wrote: > On Mon, Jun 13, 2022 at 07:53:03PM +0530, Aneesh Kumar K V wrote: > > On 6/13/22 7:35 PM, Johannes Weiner wrote: > > > On Fri, Jun 10, 2022 at 10:57:08AM +0100, Jonathan Cameron wrote: > > > > I'm not sure completely read only is flexible enough (though mostly RO is fine) > > > > as we keep sketching out cases where any attempt to do things automatically > > > > does the wrong thing and where we need to add an extra tier to get > > > > everything to work. Short of having a lot of tiers I'm not sure how > > > > we could have the default work well. Maybe a lot of "tiers" is fine > > > > though perhaps we need to rename them if going this way and then they > > > > don't really work as current concept of tier. > > > > > > > > Imagine a system with subtle difference between different memories such > > > > as 10% latency increase for same bandwidth. To get an advantage from > > > > demoting to such a tier will require really stable usage and long > > > > run times. Whilst you could design a demotion scheme that takes that > > > > into account, I think we are a long way from that today. > > > > > > Good point: there can be a clear hardware difference, but it's a > > > policy choice whether the MM should treat them as one or two tiers. > > > > > > What do you think of a per-driver/per-device (overridable) distance > > > number, combined with a configurable distance cutoff for what > > > constitutes separate tiers. E.g. cutoff=20 means two devices with > > > distances of 10 and 20 respectively would be in the same tier, devices > > > with 10 and 100 would be in separate ones. The kernel then generates > > > and populates the tiers based on distances and grouping cutoff, and > > > populates the memtier directory tree and nodemasks in sysfs. > > > > > > > Right now core/generic code doesn't get involved in building tiers. It just > > defines three tiers where drivers could place the respective devices they > > manage. The above suggestion would imply we are moving quite a lot of policy > > decision logic into the generic code?. > > No. The driver still chooses its own number, just from a wider > range. The only policy in generic code is the distance cutoff for > which devices are grouped into tiers together. > > > At some point, we will have to depend on more attributes other than > > distance(may be HMAT?) and each driver should have the flexibility to place > > the device it is managing in a specific tier? By then we may decide to > > support more than 3 static tiers which the core kernel currently does. > > If we start with a larger possible range of "distance" values right > away, we can still let the drivers ballpark into 3 tiers for now (100, > 200, 300). But it will be easier to take additional metrics into > account later and fine tune accordingly (120, 260, 90 etc.) without > having to update all the other drivers as well. > > > If the kernel still can't make the right decision, userspace could rearrange > > them in any order using rank values. Without something like rank, if > > userspace needs to fix things up, it gets hard with device > > hotplugging. ie, the userspace policy could be that any new PMEM tier device > > that is hotplugged, park it with a very low-rank value and hence lowest in > > demotion order by default. (echo 10 > > > /sys/devices/system/memtier/memtier2/rank) . After that userspace could > > selectively move the new devices to the correct memory tier? > > I had touched on this in the other email. > > This doesn't work if two drivers that should have separate policies > collide into the same tier - which is very likely with just 3 tiers. > So it seems to me the main usecase for having a rank tunable falls > apart rather quickly until tiers are spaced out more widely. And it > does so at the cost of an, IMO, tricky to understand interface. > > In the other email I had suggested the ability to override not just > the per-device distance, but also the driver default for new devices > to handle the hotplug situation. > > This should be less policy than before. Driver default and per-device > distances (both overridable) combined with one tunable to set the > range of distances that get grouped into tiers. > > With these parameters alone, you can generate an ordered list of tiers > and their devices. The tier numbers make sense, and no rank is needed. > > Do you still need the ability to move nodes by writing nodemasks? I > don't think so. Assuming you would never want to have an actually > slower device in a higher tier than a faster device, the only time > you'd want to move a device is when the device's distance value is > wrong. So you override that (until you update to a fixed kernel). This sounds good to me. In this way, we override driver parameter instead of memory tiers itself. So I guess when we do that, the memory tier of the NUMA nodes controlled by the driver will be changed. Or all memory tiers will be regenerated? I have a suggestion. Instead of abstract distance number, how about using memory latency and bandwidth directly? These can be gotten from HMAT directly when necessary. Even if they are not available directly, they may be tested at runtime by the drivers. Best Regards, Huang, Ying
On 6/13/22 9:20 PM, Johannes Weiner wrote: > On Mon, Jun 13, 2022 at 07:53:03PM +0530, Aneesh Kumar K V wrote: >> On 6/13/22 7:35 PM, Johannes Weiner wrote: >>> On Fri, Jun 10, 2022 at 10:57:08AM +0100, Jonathan Cameron wrote: >>>> I'm not sure completely read only is flexible enough (though mostly RO is fine) >>>> as we keep sketching out cases where any attempt to do things automatically >>>> does the wrong thing and where we need to add an extra tier to get >>>> everything to work. Short of having a lot of tiers I'm not sure how >>>> we could have the default work well. Maybe a lot of "tiers" is fine >>>> though perhaps we need to rename them if going this way and then they >>>> don't really work as current concept of tier. >>>> >>>> Imagine a system with subtle difference between different memories such >>>> as 10% latency increase for same bandwidth. To get an advantage from >>>> demoting to such a tier will require really stable usage and long >>>> run times. Whilst you could design a demotion scheme that takes that >>>> into account, I think we are a long way from that today. >>> >>> Good point: there can be a clear hardware difference, but it's a >>> policy choice whether the MM should treat them as one or two tiers. >>> >>> What do you think of a per-driver/per-device (overridable) distance >>> number, combined with a configurable distance cutoff for what >>> constitutes separate tiers. E.g. cutoff=20 means two devices with >>> distances of 10 and 20 respectively would be in the same tier, devices >>> with 10 and 100 would be in separate ones. The kernel then generates >>> and populates the tiers based on distances and grouping cutoff, and >>> populates the memtier directory tree and nodemasks in sysfs. >>> >> >> Right now core/generic code doesn't get involved in building tiers. It just >> defines three tiers where drivers could place the respective devices they >> manage. The above suggestion would imply we are moving quite a lot of policy >> decision logic into the generic code?. > > No. The driver still chooses its own number, just from a wider > range. The only policy in generic code is the distance cutoff for > which devices are grouped into tiers together. > >> At some point, we will have to depend on more attributes other than >> distance(may be HMAT?) and each driver should have the flexibility to place >> the device it is managing in a specific tier? By then we may decide to >> support more than 3 static tiers which the core kernel currently does. > > If we start with a larger possible range of "distance" values right > away, we can still let the drivers ballpark into 3 tiers for now (100, > 200, 300). But it will be easier to take additional metrics into > account later and fine tune accordingly (120, 260, 90 etc.) without > having to update all the other drivers as well. > >> If the kernel still can't make the right decision, userspace could rearrange >> them in any order using rank values. Without something like rank, if >> userspace needs to fix things up, it gets hard with device >> hotplugging. ie, the userspace policy could be that any new PMEM tier device >> that is hotplugged, park it with a very low-rank value and hence lowest in >> demotion order by default. (echo 10 > >> /sys/devices/system/memtier/memtier2/rank) . After that userspace could >> selectively move the new devices to the correct memory tier? > > I had touched on this in the other email. > > This doesn't work if two drivers that should have separate policies > collide into the same tier - which is very likely with just 3 tiers. > So it seems to me the main usecase for having a rank tunable falls > apart rather quickly until tiers are spaced out more widely. And it > does so at the cost of an, IMO, tricky to understand interface. > Considering the kernel has a static map for these tiers, how can two drivers end up using the same tier? If a new driver is going to manage a memory device that is of different characteristics than the one managed by dax/kmem, we will end up adding #define MEMORY_TIER_NEW_DEVICE 4 The new driver will never use MEMORY_TIER_PMEM What can happen is two devices that are managed by DAX/kmem that should be in two memory tiers get assigned the same memory tier because the dax/kmem driver added both the device to the same memory tier. In the future we would avoid that by using more device properties like HMAT to create additional memory tiers with different rank values. ie, we would do in the dax/kmem create_tier_from_rank() . > In the other email I had suggested the ability to override not just > the per-device distance, but also the driver default for new devices > to handle the hotplug situation. > I understand that the driver override will be done via module parameters. How will we implement device override? For example in case of dax/kmem driver the device override will be per dax device? What interface will we use to set the override? IIUC in the above proposal the dax/kmem will do node_create_and_set_memory_tier(numa_node, get_device_tier_index(dev_dax)); get_device_tier_index(struct dev_dax *dev) { return dax_kmem_tier_index; // module parameter } Are you suggesting to add a dev_dax property to override the tier defaults? > This should be less policy than before. Driver default and per-device > distances (both overridable) combined with one tunable to set the > range of distances that get grouped into tiers. > Can you elaborate more on how distance value will be used? The device/device NUMA node can have different distance value from other NUMA nodes. How do we group them? for ex: earlier discussion did outline three different topologies. Can you ellaborate how we would end up grouping them using distance? For ex: in the topology below node 2 is at distance 30 from Node0 and 40 from Nodes so how will we classify node 2? Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes. 20 Node 0 (DRAM) ---- Node 1 (DRAM) | \ / | | 30 40 X 40 | 30 | / \ | Node 2 (PMEM) ---- Node 3 (PMEM) 40 node distances: node 0 1 2 3 0 10 20 30 40 1 20 10 40 30 2 30 40 10 40 3 40 30 40 10 Node 0 & 1 are DRAM nodes. Node 2 is a PMEM node and closer to node 0. 20 Node 0 (DRAM) ---- Node 1 (DRAM) | / | 30 / 40 | / Node 2 (PMEM) node distances: node 0 1 2 0 10 20 30 1 20 10 40 2 30 40 10 Node 0 is a DRAM node with CPU. Node 1 is a GPU node. Node 2 is a PMEM node. Node 3 is a large, slow DRAM node without CPU. 100 Node 0 (DRAM) ---- Node 1 (GPU) / | / | /40 |30 120 / | 110 | | / | | Node 2 (PMEM) ---- / | \ / \ 80 \ / ------- Node 3 (Slow DRAM) node distances: node 0 1 2 3 0 10 100 30 40 1 100 10 120 110 2 30 120 10 80 3 40 110 80 10 > With these parameters alone, you can generate an ordered list of tiers > and their devices. The tier numbers make sense, and no rank is needed. > > Do you still need the ability to move nodes by writing nodemasks? I > don't think so. Assuming you would never want to have an actually > slower device in a higher tier than a faster device, the only time > you'd want to move a device is when the device's distance value is > wrong. So you override that (until you update to a fixed kernel).
On Mon, 13 Jun 2022 10:05:06 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote: > On Fri, Jun 10, 2022 at 10:57:08AM +0100, Jonathan Cameron wrote: > > On Thu, 9 Jun 2022 16:41:04 -0400 > > Johannes Weiner <hannes@cmpxchg.org> wrote: > > > On Thu, Jun 09, 2022 at 03:22:43PM +0100, Jonathan Cameron wrote: > > > Would it make more sense to have the platform/devicetree/driver > > > provide more fine-grained distance values similar to NUMA distances, > > > and have a driver-scope tunable to override/correct? And then have the > > > distance value function as the unique tier ID and rank in one. > > > > Absolutely a good thing to provide that information, but it's black > > magic. There are too many contradicting metrics (latency vs bandwidth etc) > > even not including a more complex system model like Jerome Glisse proposed > > a few years back. https://lore.kernel.org/all/20190118174512.GA3060@redhat.com/ > > CXL 2.0 got this more right than anything else I've seen as provides > > discoverable topology along with details like latency to cross between > > particular switch ports. Actually using that data (other than by throwing > > it to userspace controls for HPC apps etc) is going to take some figuring out. > > Even the question of what + how we expose this info to userspace is non > > obvious. Was offline for a few days. At risk of splitting a complex thread even more.... > > Right, I don't think those would be scientifically accurate - but > neither is a number between 1 and 3. The 3 tiers in this proposal are just a starting point (and one I'd expect we'll move beyond very quickly) - aim is to define a userspace that is flexible enough, but then only use a tiny bit of that flexibility to get an initial version in place. Even relatively trivial CXL systems will include. 1) Direct connected volatile memory, (similar to a memory only NUMA node / socket) 2) Direct connected non volatile (similar to pmem Numa node, but maybe not similar enough to fuse with socket connected pmem) 3) Switch connected volatile memory (typically a disagregated memory device, so huge, high bandwidth, not great latency) 4) Switch connected non volatile (typically huge, high bandwidth, even wors latency). 5) Much more fun if we care about bandwidth as interleaving going on in hardware across either similar, or mixed sets of switch connected and direct connected. Sure we might fuse some of those. But just the CXL driver is likely to have groups separate enough we want to handle them as 4 tiers and migrate between those tiers... Obviously might want a clever strategy for cold / hot migration! > The way I look at it is more > about spreading out the address space a bit, to allow expressing > nuanced differences without risking conflicts and overlaps. Hopefully > this results in the shipped values stabilizing over time and thus > requiring less and less intervention and overriding from userspace. I don't think they ever will stabilize, because the right answer isn't definable in terms of just one number. We'll end up with the old mess of magic values in SLIT in which systems have been tuned against particular use cases. HMAT was meant to solve that, but not yet clear it it will. > > > > Going further, it could be useful to separate the business of hardware > > > properties (and configuring quirks) from the business of configuring > > > MM policies that should be applied to the resulting tier hierarchy. > > > They're somewhat orthogonal tuning tasks, and one of them might become > > > obsolete before the other (if the quality of distance values provided > > > by drivers improves before the quality of MM heuristics ;). Separating > > > them might help clarify the interface for both designers and users. > > > > > > E.g. a memdev class scope with a driver-wide distance value, and a > > > memdev scope for per-device values that default to "inherit driver > > > value". The memtier subtree would then have an r/o structure, but > > > allow tuning per-tier interleaving ratio[1], demotion rules etc. > > > > Ok that makes sense. I'm not sure if that ends up as an implementation > > detail, or effects the userspace interface of this particular element. > > > > I'm not sure completely read only is flexible enough (though mostly RO is fine) > > as we keep sketching out cases where any attempt to do things automatically > > does the wrong thing and where we need to add an extra tier to get > > everything to work. Short of having a lot of tiers I'm not sure how > > we could have the default work well. Maybe a lot of "tiers" is fine > > though perhaps we need to rename them if going this way and then they > > don't really work as current concept of tier. > > > > Imagine a system with subtle difference between different memories such > > as 10% latency increase for same bandwidth. To get an advantage from > > demoting to such a tier will require really stable usage and long > > run times. Whilst you could design a demotion scheme that takes that > > into account, I think we are a long way from that today. > > Good point: there can be a clear hardware difference, but it's a > policy choice whether the MM should treat them as one or two tiers. > > What do you think of a per-driver/per-device (overridable) distance > number, combined with a configurable distance cutoff for what > constitutes separate tiers. E.g. cutoff=20 means two devices with > distances of 10 and 20 respectively would be in the same tier, devices > with 10 and 100 would be in separate ones. The kernel then generates > and populates the tiers based on distances and grouping cutoff, and > populates the memtier directory tree and nodemasks in sysfs. I think we'll need something along those lines, though I was envisioning it sitting at the level of what we do with the tiers, rather than how we create them. So particularly usecases would decide to treat sets of tiers as if they were one. Have enough tiers and we'll end up with k-means or similar to figure out the groupings. Of course there is then a soft of 'tier group for use XX' concept so maybe not much difference until we have a bunch of usecases. > > It could be simple tier0, tier1, tier2 numbering again, but the > numbers now would mean something to the user. A rank tunable is no > longer necessary. This feels like it might make tier assignments a bit less stable and hence run into question of how to hook up accounting. Not my area of expertise though, but it was put forward as one of the reasons we didn't want hotplug to potentially end up shuffling other tiers around. The desire was for a 'stable' entity. Can avoid that with 'space' between them but then we sort of still have rank, just in a form that makes updating it messy (need to create a new tier to do it). > > I think even the nodemasks in the memtier tree could be read-only > then, since corrections should only be necessary when either the > device distance is wrong or the tier grouping cutoff. > > Can you think of scenarios where that scheme would fall apart? Simplest (I think) is the GPU one. Often those have very nice memory that we CPU software developers would love to use, but some pesky GPGPU folk think it is for GPU related data. Anyhow, folk who care about GPUs have requested that it be in a tier that is lower rank than main memory. If you just categorize it by performance (from CPUs) then it might well end up elsewhere. These folk do want to demote to CPU attached DRAM though. Which raises the question of 'where is your distance between?' Definitely policy decision, and one we can't get from perf characteristics. It's a blurry line. There are classes of fairly low spec memory attached accelerators on the horizon. For those preventing migration to the memory they are associated with might generally not make sense. Tweaking policy by messing with anything that claims to be a distance is a bit nasty at looks like the SLIT table tuning that's still happens. Could have a per device rank though and make it clear this isn't cleanly related to any perf characterstics. So ultimately that moves rank to devices and then we have to put them into nodes. Not sure it gained us much other than seeming more complex to me. Jonathan
On Tue, Jun 14, 2022 at 01:31:37PM +0530, Aneesh Kumar K V wrote: > On 6/13/22 9:20 PM, Johannes Weiner wrote: > > On Mon, Jun 13, 2022 at 07:53:03PM +0530, Aneesh Kumar K V wrote: > >> If the kernel still can't make the right decision, userspace could rearrange > >> them in any order using rank values. Without something like rank, if > >> userspace needs to fix things up, it gets hard with device > >> hotplugging. ie, the userspace policy could be that any new PMEM tier device > >> that is hotplugged, park it with a very low-rank value and hence lowest in > >> demotion order by default. (echo 10 > > >> /sys/devices/system/memtier/memtier2/rank) . After that userspace could > >> selectively move the new devices to the correct memory tier? > > > > I had touched on this in the other email. > > > > This doesn't work if two drivers that should have separate policies > > collide into the same tier - which is very likely with just 3 tiers. > > So it seems to me the main usecase for having a rank tunable falls > > apart rather quickly until tiers are spaced out more widely. And it > > does so at the cost of an, IMO, tricky to understand interface. > > > > Considering the kernel has a static map for these tiers, how can two drivers > end up using the same tier? If a new driver is going to manage a memory > device that is of different characteristics than the one managed by dax/kmem, > we will end up adding > > #define MEMORY_TIER_NEW_DEVICE 4 > > The new driver will never use MEMORY_TIER_PMEM > > What can happen is two devices that are managed by DAX/kmem that > should be in two memory tiers get assigned the same memory tier > because the dax/kmem driver added both the device to the same memory tier. > > In the future we would avoid that by using more device properties like HMAT > to create additional memory tiers with different rank values. ie, we would > do in the dax/kmem create_tier_from_rank() . Yes, that's the type of collision I mean. Two GPUs, two CXL-attached DRAMs of different speeds etc. I also like Huang's idea of using latency characteristics instead of abstract distances. Though I'm not quite sure how feasible this is in the short term, and share some concerns that Jonathan raised. But I think a wider possible range to begin with makes sense in any case. > > In the other email I had suggested the ability to override not just > > the per-device distance, but also the driver default for new devices > > to handle the hotplug situation. > > > > I understand that the driver override will be done via module parameters. > How will we implement device override? For example in case of dax/kmem driver > the device override will be per dax device? What interface will we use to set the override? > > IIUC in the above proposal the dax/kmem will do > > node_create_and_set_memory_tier(numa_node, get_device_tier_index(dev_dax)); > > get_device_tier_index(struct dev_dax *dev) > { > return dax_kmem_tier_index; // module parameter > } > > Are you suggesting to add a dev_dax property to override the tier defaults? I was thinking a new struct memdevice and struct memtype(?). Every driver implementing memory devices like this sets those up and registers them with generic code and preset parameters. The generic code creates sysfs directories and allows overriding the parameters. struct memdevice { struct device dev; unsigned long distance; struct list_head siblings; /* nid? ... */ }; struct memtype { struct device_type type; unsigned long default_distance; struct list_head devices; }; That forms the (tweakable) tree describing physical properties. From that, the kernel then generates the ordered list of tiers. > > This should be less policy than before. Driver default and per-device > > distances (both overridable) combined with one tunable to set the > > range of distances that get grouped into tiers. > > > > Can you elaborate more on how distance value will be used? The device/device NUMA node can have > different distance value from other NUMA nodes. How do we group them? > for ex: earlier discussion did outline three different topologies. Can you > ellaborate how we would end up grouping them using distance? > > For ex: in the topology below node 2 is at distance 30 from Node0 and 40 from Nodes > so how will we classify node 2? > > > Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes. > > 20 > Node 0 (DRAM) ---- Node 1 (DRAM) > | \ / | > | 30 40 X 40 | 30 > | / \ | > Node 2 (PMEM) ---- Node 3 (PMEM) > 40 > > node distances: > node 0 1 2 3 > 0 10 20 30 40 > 1 20 10 40 30 > 2 30 40 10 40 > 3 40 30 40 10 I'm fairly confused by this example. Do all nodes have CPUs? Isn't this just classic NUMA, where optimizing for locality makes the most sense, rather than tiering? Forget the interface for a second, I have no idea how tiering on such a system would work. One CPU's lower tier can be another CPU's toptier. There is no lowest rung from which to actually *reclaim* pages. Would the CPUs just demote in circles? And the coldest pages on one socket would get demoted into another socket and displace what that socket considers hot local memory? I feel like I missing something. When we're talking about tiered memory, I'm thinking about CPUs utilizing more than one memory node. If those other nodes have CPUs, you can't reliably establish a singular tier order anymore and it becomes classic NUMA, no?
On 6/15/22 12:26 AM, Johannes Weiner wrote: .... >> What can happen is two devices that are managed by DAX/kmem that >> should be in two memory tiers get assigned the same memory tier >> because the dax/kmem driver added both the device to the same memory tier. >> >> In the future we would avoid that by using more device properties like HMAT >> to create additional memory tiers with different rank values. ie, we would >> do in the dax/kmem create_tier_from_rank() . > > Yes, that's the type of collision I mean. Two GPUs, two CXL-attached > DRAMs of different speeds etc. > > I also like Huang's idea of using latency characteristics instead of > abstract distances. Though I'm not quite sure how feasible this is in > the short term, and share some concerns that Jonathan raised. But I > think a wider possible range to begin with makes sense in any case. > How about the below proposal? In this proposal, we use the tier ID as the value that determines the position of the memory tier in the demotion order. A higher value of tier ID indicates a higher memory tier. Memory demotion happens from a higher memory tier to a lower memory tier. By default memory get hotplugged into 'default_memory_tier' . There is a core kernel parameter "default_memory_tier" which can be updated if the user wants to modify the default tier ID. dax/kmem driver use the "dax_kmem_memtier" module parameter to determine the memory tier to which DAX/kmem memory will be added. dax_kmem_memtier and default_memtier defaults to 100 and 200 respectively. Later as we update dax/kmem to use additional device attributes, the driver will be able to place new devices in different memory tiers. As we do that, it is expected that users will have the ability to override these device attribute and control which memory tiers the devices will be placed. New memory tiers can also be created by using node/memtier attribute. Moving a NUMA node to a non-existing memory tier results in creating new memory tiers. So if the kernel default placement of memory devices in memory tiers is not preferred, userspace could choose to create a completely new memory tier hierarchy using this interface. Memory tiers get deleted when they ends up with empty nodelist. # cat /sys/module/kernel/parameters/default_memory_tier 200 # cat /sys/module/kmem/parameters/dax_kmem_memtier 100 # ls /sys/devices/system/memtier/ default_tier max_tier memtier200 power uevent # ls /sys/devices/system/memtier/memtier200/nodelist /sys/devices/system/memtier/memtier200/nodelist # cat /sys/devices/system/memtier/memtier200/nodelist 1-3 # echo 20 > /sys/devices/system/node/node1/memtier # # ls /sys/devices/system/memtier/ default_tier max_tier memtier20 memtier200 power uevent # cat /sys/devices/system/memtier/memtier20/nodelist 1 # # echo 10 > /sys/module/kmem/parameters/dax_kmem_memtier # echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind # echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id # # ls /sys/devices/system/memtier/ default_tier max_tier memtier10 memtier20 memtier200 power uevent # cat /sys/devices/system/memtier/memtier10/nodelist 4 # # grep . /sys/devices/system/memtier/memtier*/nodelist /sys/devices/system/memtier/memtier10/nodelist:4 /sys/devices/system/memtier/memtier200/nodelist:2-3 /sys/devices/system/memtier/memtier20/nodelist:1 demotion order details for the above will be lower tier mask for node 1 is 4 and preferred demotion node is 4 lower tier mask for node 2 is 1,4 and preferred demotion node is 1 lower tier mask for node 3 is 1,4 and preferred demotion node is 1 lower tier mask for node 4 None :/sys/devices/system/memtier# ls default_tier max_tier memtier10 memtier20 memtier200 power uevent :/sys/devices/system/memtier# cat memtier20/nodelist 1 :/sys/devices/system/memtier# echo 200 > ../node/node1/memtier :/sys/devices/system/memtier# ls default_tier max_tier memtier10 memtier200 power uevent :/sys/devices/system/memtier# >>> In the other email I had suggested the ability to override not just >>> the per-device distance, but also the driver default for new devices >>> to handle the hotplug situation. >>> ..... >> >> Can you elaborate more on how distance value will be used? The device/device NUMA node can have >> different distance value from other NUMA nodes. How do we group them? >> for ex: earlier discussion did outline three different topologies. Can you >> ellaborate how we would end up grouping them using distance? >> >> For ex: in the topology below node 2 is at distance 30 from Node0 and 40 from Nodes >> so how will we classify node 2? >> >> >> Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes. >> >> 20 >> Node 0 (DRAM) ---- Node 1 (DRAM) >> | \ / | >> | 30 40 X 40 | 30 >> | / \ | >> Node 2 (PMEM) ---- Node 3 (PMEM) >> 40 >> >> node distances: >> node 0 1 2 3 >> 0 10 20 30 40 >> 1 20 10 40 30 >> 2 30 40 10 40 >> 3 40 30 40 10 > > I'm fairly confused by this example. Do all nodes have CPUs? Isn't > this just classic NUMA, where optimizing for locality makes the most > sense, rather than tiering? > Node 2 and Node3 will be memory only NUMA nodes. > Forget the interface for a second, I have no idea how tiering on such > a system would work. One CPU's lower tier can be another CPU's > toptier. There is no lowest rung from which to actually *reclaim* > pages. Would the CPUs just demote in circles? > > And the coldest pages on one socket would get demoted into another > socket and displace what that socket considers hot local memory? > > I feel like I missing something. > > When we're talking about tiered memory, I'm thinking about CPUs > utilizing more than one memory node. If those other nodes have CPUs, > you can't reliably establish a singular tier order anymore and it > becomes classic NUMA, no?
On Tue, 2022-06-14 at 14:56 -0400, Johannes Weiner wrote: > On Tue, Jun 14, 2022 at 01:31:37PM +0530, Aneesh Kumar K V wrote: > > On 6/13/22 9:20 PM, Johannes Weiner wrote: > > > On Mon, Jun 13, 2022 at 07:53:03PM +0530, Aneesh Kumar K V wrote: > > > > If the kernel still can't make the right decision, userspace could rearrange > > > > them in any order using rank values. Without something like rank, if > > > > userspace needs to fix things up, it gets hard with device > > > > hotplugging. ie, the userspace policy could be that any new PMEM tier device > > > > that is hotplugged, park it with a very low-rank value and hence lowest in > > > > demotion order by default. (echo 10 > > > > > /sys/devices/system/memtier/memtier2/rank) . After that userspace could > > > > selectively move the new devices to the correct memory tier? > > > > > > I had touched on this in the other email. > > > > > > This doesn't work if two drivers that should have separate policies > > > collide into the same tier - which is very likely with just 3 tiers. > > > So it seems to me the main usecase for having a rank tunable falls > > > apart rather quickly until tiers are spaced out more widely. And it > > > does so at the cost of an, IMO, tricky to understand interface. > > > > > > > Considering the kernel has a static map for these tiers, how can two drivers > > end up using the same tier? If a new driver is going to manage a memory > > device that is of different characteristics than the one managed by dax/kmem, > > we will end up adding > > > > #define MEMORY_TIER_NEW_DEVICE 4 > > > > The new driver will never use MEMORY_TIER_PMEM > > > > What can happen is two devices that are managed by DAX/kmem that > > should be in two memory tiers get assigned the same memory tier > > because the dax/kmem driver added both the device to the same memory tier. > > > > In the future we would avoid that by using more device properties like HMAT > > to create additional memory tiers with different rank values. ie, we would > > do in the dax/kmem create_tier_from_rank() . > > Yes, that's the type of collision I mean. Two GPUs, two CXL-attached > DRAMs of different speeds etc. > > I also like Huang's idea of using latency characteristics instead of > abstract distances. Though I'm not quite sure how feasible this is in > the short term, and share some concerns that Jonathan raised. But I > think a wider possible range to begin with makes sense in any case. > > > > In the other email I had suggested the ability to override not just > > > the per-device distance, but also the driver default for new devices > > > to handle the hotplug situation. > > > > > > > I understand that the driver override will be done via module parameters. > > How will we implement device override? For example in case of dax/kmem driver > > the device override will be per dax device? What interface will we use to set the override? > > > > IIUC in the above proposal the dax/kmem will do > > > > node_create_and_set_memory_tier(numa_node, get_device_tier_index(dev_dax)); > > > > get_device_tier_index(struct dev_dax *dev) > > { > > return dax_kmem_tier_index; // module parameter > > } > > > > Are you suggesting to add a dev_dax property to override the tier defaults? > > I was thinking a new struct memdevice and struct memtype(?). Every > driver implementing memory devices like this sets those up and > registers them with generic code and preset parameters. The generic > code creates sysfs directories and allows overriding the parameters. > > struct memdevice { > struct device dev; > unsigned long distance; > struct list_head siblings; > /* nid? ... */ > }; > > struct memtype { > struct device_type type; > unsigned long default_distance; > struct list_head devices; > }; > > That forms the (tweakable) tree describing physical properties. In general, I think memtype is a good idea. I have suggested something similar before. It can describe the characters of a specific type of memory (same memory media with different interface (e.g., CXL, or DIMM) will be different memory types). And they can be used to provide overriding information. As for memdevice, I think that we already have "node" to represent them in sysfs. Do we really need another one? Is it sufficient to add some links to node in the appropriate directory? For example, make memtype class device under the physical device (e.g. CXL device), and create links to node inside the memtype class device directory? > From that, the kernel then generates the ordered list of tiers. As Jonathan Cameron pointed, we may need the memory tier ID to be stable if possible. I know this isn't a easy task. At least we can make the default memory tier (CPU local DRAM) ID stable (for example make it always 128)? That provides an anchor for users to understand. Best Regards, Huang, Ying > > > This should be less policy than before. Driver default and per-device > > > distances (both overridable) combined with one tunable to set the > > > range of distances that get grouped into tiers. > > > > > > > Can you elaborate more on how distance value will be used? The device/device NUMA node can have > > different distance value from other NUMA nodes. How do we group them? > > for ex: earlier discussion did outline three different topologies. Can you > > ellaborate how we would end up grouping them using distance? > > > > For ex: in the topology below node 2 is at distance 30 from Node0 and 40 from Nodes > > so how will we classify node 2? > > > > > > Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes. > > > > 20 > > Node 0 (DRAM) ---- Node 1 (DRAM) > > | \ / | > > | 30 40 X 40 | 30 > > | / \ | > > Node 2 (PMEM) ---- Node 3 (PMEM) > > 40 > > > > node distances: > > node 0 1 2 3 > > 0 10 20 30 40 > > 1 20 10 40 30 > > 2 30 40 10 40 > > 3 40 30 40 10 > > I'm fairly confused by this example. Do all nodes have CPUs? Isn't > this just classic NUMA, where optimizing for locality makes the most > sense, rather than tiering? > > Forget the interface for a second, I have no idea how tiering on such > a system would work. One CPU's lower tier can be another CPU's > toptier. There is no lowest rung from which to actually *reclaim* > pages. Would the CPUs just demote in circles? > > And the coldest pages on one socket would get demoted into another > socket and displace what that socket considers hot local memory? > > I feel like I missing something. > > When we're talking about tiered memory, I'm thinking about CPUs > utilizing more than one memory node. If those other nodes have CPUs, > you can't reliably establish a singular tier order anymore and it > becomes classic NUMA, no?
On Wed, Jun 15, 2022 at 6:11 PM Ying Huang <ying.huang@intel.com> wrote: > > On Tue, 2022-06-14 at 14:56 -0400, Johannes Weiner wrote: > > On Tue, Jun 14, 2022 at 01:31:37PM +0530, Aneesh Kumar K V wrote: > > > On 6/13/22 9:20 PM, Johannes Weiner wrote: > > > > On Mon, Jun 13, 2022 at 07:53:03PM +0530, Aneesh Kumar K V wrote: > > > > > If the kernel still can't make the right decision, userspace could rearrange > > > > > them in any order using rank values. Without something like rank, if > > > > > userspace needs to fix things up, it gets hard with device > > > > > hotplugging. ie, the userspace policy could be that any new PMEM tier device > > > > > that is hotplugged, park it with a very low-rank value and hence lowest in > > > > > demotion order by default. (echo 10 > > > > > > /sys/devices/system/memtier/memtier2/rank) . After that userspace could > > > > > selectively move the new devices to the correct memory tier? > > > > > > > > I had touched on this in the other email. > > > > > > > > This doesn't work if two drivers that should have separate policies > > > > collide into the same tier - which is very likely with just 3 tiers. > > > > So it seems to me the main usecase for having a rank tunable falls > > > > apart rather quickly until tiers are spaced out more widely. And it > > > > does so at the cost of an, IMO, tricky to understand interface. > > > > > > > > > > Considering the kernel has a static map for these tiers, how can two drivers > > > end up using the same tier? If a new driver is going to manage a memory > > > device that is of different characteristics than the one managed by dax/kmem, > > > we will end up adding > > > > > > #define MEMORY_TIER_NEW_DEVICE 4 > > > > > > The new driver will never use MEMORY_TIER_PMEM > > > > > > What can happen is two devices that are managed by DAX/kmem that > > > should be in two memory tiers get assigned the same memory tier > > > because the dax/kmem driver added both the device to the same memory tier. > > > > > > In the future we would avoid that by using more device properties like HMAT > > > to create additional memory tiers with different rank values. ie, we would > > > do in the dax/kmem create_tier_from_rank() . > > > > Yes, that's the type of collision I mean. Two GPUs, two CXL-attached > > DRAMs of different speeds etc. > > > > I also like Huang's idea of using latency characteristics instead of > > abstract distances. Though I'm not quite sure how feasible this is in > > the short term, and share some concerns that Jonathan raised. But I > > think a wider possible range to begin with makes sense in any case. > > > > > > In the other email I had suggested the ability to override not just > > > > the per-device distance, but also the driver default for new devices > > > > to handle the hotplug situation. > > > > > > > > > > I understand that the driver override will be done via module parameters. > > > How will we implement device override? For example in case of dax/kmem driver > > > the device override will be per dax device? What interface will we use to set the override? > > > > > > IIUC in the above proposal the dax/kmem will do > > > > > > node_create_and_set_memory_tier(numa_node, get_device_tier_index(dev_dax)); > > > > > > get_device_tier_index(struct dev_dax *dev) > > > { > > > return dax_kmem_tier_index; // module parameter > > > } > > > > > > Are you suggesting to add a dev_dax property to override the tier defaults? > > > > I was thinking a new struct memdevice and struct memtype(?). Every > > driver implementing memory devices like this sets those up and > > registers them with generic code and preset parameters. The generic > > code creates sysfs directories and allows overriding the parameters. > > > > struct memdevice { > > struct device dev; > > unsigned long distance; > > struct list_head siblings; > > /* nid? ... */ > > }; > > > > struct memtype { > > struct device_type type; > > unsigned long default_distance; > > struct list_head devices; > > }; > > > > That forms the (tweakable) tree describing physical properties. > > In general, I think memtype is a good idea. I have suggested > something similar before. It can describe the characters of a > specific type of memory (same memory media with different interface > (e.g., CXL, or DIMM) will be different memory types). And they can > be used to provide overriding information. > > As for memdevice, I think that we already have "node" to represent > them in sysfs. Do we really need another one? Is it sufficient to > add some links to node in the appropriate directory? For example, > make memtype class device under the physical device (e.g. CXL device), > and create links to node inside the memtype class device directory? > > > From that, the kernel then generates the ordered list of tiers. > > As Jonathan Cameron pointed, we may need the memory tier ID to be > stable if possible. I know this isn't a easy task. At least we can > make the default memory tier (CPU local DRAM) ID stable (for example > make it always 128)? That provides an anchor for users to understand. One of the motivations of introducing "rank" is to allow memory tier ID to be stable, at least for the well-defined tiers such as the default memory tier. The default memory tier can be moved around in the tier hierarchy by adjusting its rank position relative to other tiers, but its device ID can remain the same, e.g. always 1. > Best Regards, > Huang, Ying > > > > > This should be less policy than before. Driver default and per-device > > > > distances (both overridable) combined with one tunable to set the > > > > range of distances that get grouped into tiers. > > > > > > > > > > Can you elaborate more on how distance value will be used? The device/device NUMA node can have > > > different distance value from other NUMA nodes. How do we group them? > > > for ex: earlier discussion did outline three different topologies. Can you > > > ellaborate how we would end up grouping them using distance? > > > > > > For ex: in the topology below node 2 is at distance 30 from Node0 and 40 from Nodes > > > so how will we classify node 2? > > > > > > > > > Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes. > > > > > > 20 > > > Node 0 (DRAM) ---- Node 1 (DRAM) > > > | \ / | > > > | 30 40 X 40 | 30 > > > | / \ | > > > Node 2 (PMEM) ---- Node 3 (PMEM) > > > 40 > > > > > > node distances: > > > node 0 1 2 3 > > > 0 10 20 30 40 > > > 1 20 10 40 30 > > > 2 30 40 10 40 > > > 3 40 30 40 10 > > > > I'm fairly confused by this example. Do all nodes have CPUs? Isn't > > this just classic NUMA, where optimizing for locality makes the most > > sense, rather than tiering? > > > > Forget the interface for a second, I have no idea how tiering on such > > a system would work. One CPU's lower tier can be another CPU's > > toptier. There is no lowest rung from which to actually *reclaim* > > pages. Would the CPUs just demote in circles? > > > > And the coldest pages on one socket would get demoted into another > > socket and displace what that socket considers hot local memory? > > > > I feel like I missing something. > > > > When we're talking about tiered memory, I'm thinking about CPUs > > utilizing more than one memory node. If those other nodes have CPUs, > > you can't reliably establish a singular tier order anymore and it > > becomes classic NUMA, no? > > >
On 6/16/22 9:15 AM, Wei Xu wrote: > On Wed, Jun 15, 2022 at 6:11 PM Ying Huang <ying.huang@intel.com> wrote: >> >> On Tue, 2022-06-14 at 14:56 -0400, Johannes Weiner wrote: >>> On Tue, Jun 14, 2022 at 01:31:37PM +0530, Aneesh Kumar K V wrote: .... >> As Jonathan Cameron pointed, we may need the memory tier ID to be >> stable if possible. I know this isn't a easy task. At least we can >> make the default memory tier (CPU local DRAM) ID stable (for example >> make it always 128)? That provides an anchor for users to understand. > > One of the motivations of introducing "rank" is to allow memory tier > ID to be stable, at least for the well-defined tiers such as the > default memory tier. The default memory tier can be moved around in > the tier hierarchy by adjusting its rank position relative to other > tiers, but its device ID can remain the same, e.g. always 1. > With /sys/devices/system/memtier/default_tier userspace will be able query the default tier details. Did you get to look at https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com Any reason why that will not work with all the requirements we had? -aneesh
On Thu, 2022-06-16 at 10:17 +0530, Aneesh Kumar K V wrote: > On 6/16/22 9:15 AM, Wei Xu wrote: > > On Wed, Jun 15, 2022 at 6:11 PM Ying Huang <ying.huang@intel.com> wrote: > > > > > > On Tue, 2022-06-14 at 14:56 -0400, Johannes Weiner wrote: > > > > On Tue, Jun 14, 2022 at 01:31:37PM +0530, Aneesh Kumar K V wrote: > > .... > > > > As Jonathan Cameron pointed, we may need the memory tier ID to be > > > stable if possible. I know this isn't a easy task. At least we can > > > make the default memory tier (CPU local DRAM) ID stable (for example > > > make it always 128)? That provides an anchor for users to understand. > > > > One of the motivations of introducing "rank" is to allow memory tier > > ID to be stable, at least for the well-defined tiers such as the > > default memory tier. The default memory tier can be moved around in > > the tier hierarchy by adjusting its rank position relative to other > > tiers, but its device ID can remain the same, e.g. always 1. > > > > With /sys/devices/system/memtier/default_tier userspace will be able query > the default tier details. > Yes. This is a way to address the memory tier ID stability issue too. Anther choice is to make default_tier a symbolic link. Best Regards, Huang, Ying
On Thu, 16 Jun 2022 09:11:24 +0800 Ying Huang <ying.huang@intel.com> wrote: > On Tue, 2022-06-14 at 14:56 -0400, Johannes Weiner wrote: > > On Tue, Jun 14, 2022 at 01:31:37PM +0530, Aneesh Kumar K V wrote: > > > On 6/13/22 9:20 PM, Johannes Weiner wrote: > > > > On Mon, Jun 13, 2022 at 07:53:03PM +0530, Aneesh Kumar K V wrote: > > > > > If the kernel still can't make the right decision, userspace could rearrange > > > > > them in any order using rank values. Without something like rank, if > > > > > userspace needs to fix things up, it gets hard with device > > > > > hotplugging. ie, the userspace policy could be that any new PMEM tier device > > > > > that is hotplugged, park it with a very low-rank value and hence lowest in > > > > > demotion order by default. (echo 10 > > > > > > /sys/devices/system/memtier/memtier2/rank) . After that userspace could > > > > > selectively move the new devices to the correct memory tier? > > > > > > > > I had touched on this in the other email. > > > > > > > > This doesn't work if two drivers that should have separate policies > > > > collide into the same tier - which is very likely with just 3 tiers. > > > > So it seems to me the main usecase for having a rank tunable falls > > > > apart rather quickly until tiers are spaced out more widely. And it > > > > does so at the cost of an, IMO, tricky to understand interface. > > > > > > > > > > Considering the kernel has a static map for these tiers, how can two drivers > > > end up using the same tier? If a new driver is going to manage a memory > > > device that is of different characteristics than the one managed by dax/kmem, > > > we will end up adding > > > > > > #define MEMORY_TIER_NEW_DEVICE 4 > > > > > > The new driver will never use MEMORY_TIER_PMEM > > > > > > What can happen is two devices that are managed by DAX/kmem that > > > should be in two memory tiers get assigned the same memory tier > > > because the dax/kmem driver added both the device to the same memory tier. > > > > > > In the future we would avoid that by using more device properties like HMAT > > > to create additional memory tiers with different rank values. ie, we would > > > do in the dax/kmem create_tier_from_rank() . > > > > Yes, that's the type of collision I mean. Two GPUs, two CXL-attached > > DRAMs of different speeds etc. > > > > I also like Huang's idea of using latency characteristics instead of > > abstract distances. Though I'm not quite sure how feasible this is in > > the short term, and share some concerns that Jonathan raised. But I > > think a wider possible range to begin with makes sense in any case. > > > > > > In the other email I had suggested the ability to override not just > > > > the per-device distance, but also the driver default for new devices > > > > to handle the hotplug situation. > > > > > > > > > > I understand that the driver override will be done via module parameters. > > > How will we implement device override? For example in case of dax/kmem driver > > > the device override will be per dax device? What interface will we use to set the override? > > > > > > IIUC in the above proposal the dax/kmem will do > > > > > > node_create_and_set_memory_tier(numa_node, get_device_tier_index(dev_dax)); > > > > > > get_device_tier_index(struct dev_dax *dev) > > > { > > > return dax_kmem_tier_index; // module parameter > > > } > > > > > > Are you suggesting to add a dev_dax property to override the tier defaults? > > > > I was thinking a new struct memdevice and struct memtype(?). Every > > driver implementing memory devices like this sets those up and > > registers them with generic code and preset parameters. The generic > > code creates sysfs directories and allows overriding the parameters. > > > > struct memdevice { > > struct device dev; > > unsigned long distance; > > struct list_head siblings; > > /* nid? ... */ > > }; > > > > struct memtype { > > struct device_type type; > > unsigned long default_distance; > > struct list_head devices; > > }; > > > > That forms the (tweakable) tree describing physical properties. > > In general, I think memtype is a good idea. I have suggested > something similar before. It can describe the characters of a > specific type of memory (same memory media with different interface > (e.g., CXL, or DIMM) will be different memory types). And they can > be used to provide overriding information. I'm not sure you are suggesting interface as one element of distinguishing types, or as the element - just in case it's as 'the element'. Ignore the next bit if not ;) Memory "interface" isn't going to be enough of a distinction. If you want to have a default distance it would need to be different for cases where the same 'type' of RAM has very different characteristics. Applies everywhere but given CXL 'defines' a lot of this - if we just have DRAM attached via CXL: 1. 16-lane direct attached DRAM device. (low latency - high bw) 2. 4x 16-lane direct attached DRAM interleaved (low latency - very high bw) 3. 4-lane direct attached DRAM device (low latency - low bandwidth) 4. 16-lane to single switch, 4x 4-lane devices interleaved (mid latency - high bw) 5. 4-lane to single switch, 4x 4-lane devices interleaved (mid latency, mid bw) 6. 4x 16-lane so 4 switch, each switch to 4 DRAM devices (mid latency, very high bw) (7. 16 land directed attached nvram. (midish latency, high bw - perf wise might be similarish to 4). It could be a lot more complex, but hopefully that conveys that 'type' is next to useless to characterize things unless we have a very large number of potential subtypes. If we were on current tiering proposal we'd just have the CXL subsystem manage multiple tiers to cover what is attached. > > As for memdevice, I think that we already have "node" to represent > them in sysfs. Do we really need another one? Is it sufficient to > add some links to node in the appropriate directory? For example, > make memtype class device under the physical device (e.g. CXL device), > and create links to node inside the memtype class device directory? > > > From that, the kernel then generates the ordered list of tiers. > > As Jonathan Cameron pointed, we may need the memory tier ID to be > stable if possible. I know this isn't a easy task. At least we can > make the default memory tier (CPU local DRAM) ID stable (for example > make it always 128)? That provides an anchor for users to understand. > > Best Regards, > Huang, Ying > > > > > This should be less policy than before. Driver default and per-device > > > > distances (both overridable) combined with one tunable to set the > > > > range of distances that get grouped into tiers. > > > > > > > > > > Can you elaborate more on how distance value will be used? The device/device NUMA node can have > > > different distance value from other NUMA nodes. How do we group them? > > > for ex: earlier discussion did outline three different topologies. Can you > > > ellaborate how we would end up grouping them using distance? > > > > > > For ex: in the topology below node 2 is at distance 30 from Node0 and 40 from Nodes > > > so how will we classify node 2? > > > > > > > > > Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes. > > > > > > 20 > > > Node 0 (DRAM) ---- Node 1 (DRAM) > > > | \ / | > > > | 30 40 X 40 | 30 > > > | / \ | > > > Node 2 (PMEM) ---- Node 3 (PMEM) > > > 40 > > > > > > node distances: > > > node 0 1 2 3 > > > 0 10 20 30 40 > > > 1 20 10 40 30 > > > 2 30 40 10 40 > > > 3 40 30 40 10 > > > > I'm fairly confused by this example. Do all nodes have CPUs? Isn't > > this just classic NUMA, where optimizing for locality makes the most > > sense, rather than tiering? > > > > Forget the interface for a second, I have no idea how tiering on such > > a system would work. One CPU's lower tier can be another CPU's > > toptier. There is no lowest rung from which to actually *reclaim* > > pages. Would the CPUs just demote in circles? > > > > And the coldest pages on one socket would get demoted into another > > socket and displace what that socket considers hot local memory? > > > > I feel like I missing something. > > > > When we're talking about tiered memory, I'm thinking about CPUs > > utilizing more than one memory node. If those other nodes have CPUs, > > you can't reliably establish a singular tier order anymore and it > > becomes classic NUMA, no? > >
On 6/14/22 10:15 PM, Jonathan Cameron wrote: > ... >> >> It could be simple tier0, tier1, tier2 numbering again, but the >> numbers now would mean something to the user. A rank tunable is no >> longer necessary. > > This feels like it might make tier assignments a bit less stable > and hence run into question of how to hook up accounting. Not my > area of expertise though, but it was put forward as one of the reasons > we didn't want hotplug to potentially end up shuffling other tiers > around. The desire was for a 'stable' entity. Can avoid that with > 'space' between them but then we sort of still have rank, just in a > form that makes updating it messy (need to create a new tier to do > it). > >> How about we do what is proposed here https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com The cgroup accounting patch posted here https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com looks at top tier accounting per cgroup and I am not sure what tier ID stability is expected for top tier accounting. -aneesh
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h new file mode 100644 index 000000000000..e17f6b4ee177 --- /dev/null +++ b/include/linux/memory-tiers.h @@ -0,0 +1,20 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_MEMORY_TIERS_H +#define _LINUX_MEMORY_TIERS_H + +#ifdef CONFIG_TIERED_MEMORY + +#define MEMORY_TIER_HBM_GPU 0 +#define MEMORY_TIER_DRAM 1 +#define MEMORY_TIER_PMEM 2 + +#define MEMORY_RANK_HBM_GPU 300 +#define MEMORY_RANK_DRAM 200 +#define MEMORY_RANK_PMEM 100 + +#define DEFAULT_MEMORY_TIER MEMORY_TIER_DRAM +#define MAX_MEMORY_TIERS 3 + +#endif /* CONFIG_TIERED_MEMORY */ + +#endif diff --git a/mm/Kconfig b/mm/Kconfig index 169e64192e48..08a3d330740b 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -614,6 +614,17 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION config ARCH_ENABLE_THP_MIGRATION bool +config TIERED_MEMORY + bool "Support for explicit memory tiers" + def_bool n + depends on MIGRATION && NUMA + help + Support to split nodes into memory tiers explicitly and + to demote pages on reclaim to lower tiers. This option + also exposes sysfs interface to read nodes available in + specific tier and to move specific node among different + possible tiers. + config HUGETLB_PAGE_SIZE_VARIABLE def_bool n help diff --git a/mm/Makefile b/mm/Makefile index 6f9ffa968a1a..482557fbc9d1 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/ obj-$(CONFIG_FAILSLAB) += failslab.o obj-$(CONFIG_MEMTEST) += memtest.o obj-$(CONFIG_MIGRATION) += migrate.o +obj-$(CONFIG_TIERED_MEMORY) += memory-tiers.o obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o obj-$(CONFIG_PAGE_COUNTER) += page_counter.o diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c new file mode 100644 index 000000000000..7de18d94a08d --- /dev/null +++ b/mm/memory-tiers.c @@ -0,0 +1,188 @@ +// SPDX-License-Identifier: GPL-2.0 +#include <linux/types.h> +#include <linux/device.h> +#include <linux/nodemask.h> +#include <linux/slab.h> +#include <linux/memory-tiers.h> + +struct memory_tier { + struct list_head list; + struct device dev; + nodemask_t nodelist; + int rank; +}; + +#define to_memory_tier(device) container_of(device, struct memory_tier, dev) + +static struct bus_type memory_tier_subsys = { + .name = "memtier", + .dev_name = "memtier", +}; + +static DEFINE_MUTEX(memory_tier_lock); +static LIST_HEAD(memory_tiers); + + +static ssize_t nodelist_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct memory_tier *memtier = to_memory_tier(dev); + + return sysfs_emit(buf, "%*pbl\n", + nodemask_pr_args(&memtier->nodelist)); +} +static DEVICE_ATTR_RO(nodelist); + +static ssize_t rank_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct memory_tier *memtier = to_memory_tier(dev); + + return sysfs_emit(buf, "%d\n", memtier->rank); +} +static DEVICE_ATTR_RO(rank); + +static struct attribute *memory_tier_dev_attrs[] = { + &dev_attr_nodelist.attr, + &dev_attr_rank.attr, + NULL +}; + +static const struct attribute_group memory_tier_dev_group = { + .attrs = memory_tier_dev_attrs, +}; + +static const struct attribute_group *memory_tier_dev_groups[] = { + &memory_tier_dev_group, + NULL +}; + +static void memory_tier_device_release(struct device *dev) +{ + struct memory_tier *tier = to_memory_tier(dev); + + kfree(tier); +} + +/* + * Keep it simple by having direct mapping between + * tier index and rank value. + */ +static inline int get_rank_from_tier(unsigned int tier) +{ + switch (tier) { + case MEMORY_TIER_HBM_GPU: + return MEMORY_RANK_HBM_GPU; + case MEMORY_TIER_DRAM: + return MEMORY_RANK_DRAM; + case MEMORY_TIER_PMEM: + return MEMORY_RANK_PMEM; + } + + return 0; +} + +static void insert_memory_tier(struct memory_tier *memtier) +{ + struct list_head *ent; + struct memory_tier *tmp_memtier; + + list_for_each(ent, &memory_tiers) { + tmp_memtier = list_entry(ent, struct memory_tier, list); + if (tmp_memtier->rank < memtier->rank) { + list_add_tail(&memtier->list, ent); + return; + } + } + list_add_tail(&memtier->list, &memory_tiers); +} + +static struct memory_tier *register_memory_tier(unsigned int tier) +{ + int error; + struct memory_tier *memtier; + + if (tier >= MAX_MEMORY_TIERS) + return NULL; + + memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL); + if (!memtier) + return NULL; + + memtier->dev.id = tier; + memtier->rank = get_rank_from_tier(tier); + memtier->dev.bus = &memory_tier_subsys; + memtier->dev.release = memory_tier_device_release; + memtier->dev.groups = memory_tier_dev_groups; + + insert_memory_tier(memtier); + + error = device_register(&memtier->dev); + if (error) { + list_del(&memtier->list); + put_device(&memtier->dev); + return NULL; + } + return memtier; +} + +__maybe_unused // temporay to prevent warnings during bisects +static void unregister_memory_tier(struct memory_tier *memtier) +{ + list_del(&memtier->list); + device_unregister(&memtier->dev); +} + +static ssize_t +max_tier_show(struct device *dev, struct device_attribute *attr, char *buf) +{ + return sysfs_emit(buf, "%d\n", MAX_MEMORY_TIERS); +} +static DEVICE_ATTR_RO(max_tier); + +static ssize_t +default_tier_show(struct device *dev, struct device_attribute *attr, char *buf) +{ + return sysfs_emit(buf, "memtier%d\n", DEFAULT_MEMORY_TIER); +} +static DEVICE_ATTR_RO(default_tier); + +static struct attribute *memory_tier_attrs[] = { + &dev_attr_max_tier.attr, + &dev_attr_default_tier.attr, + NULL +}; + +static const struct attribute_group memory_tier_attr_group = { + .attrs = memory_tier_attrs, +}; + +static const struct attribute_group *memory_tier_attr_groups[] = { + &memory_tier_attr_group, + NULL, +}; + +static int __init memory_tier_init(void) +{ + int ret; + struct memory_tier *memtier; + + ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups); + if (ret) + panic("%s() failed to register subsystem: %d\n", __func__, ret); + + /* + * Register only default memory tier to hide all empty + * memory tier from sysfs. + */ + memtier = register_memory_tier(DEFAULT_MEMORY_TIER); + if (!memtier) + panic("%s() failed to register memory tier: %d\n", __func__, ret); + + /* CPU only nodes are not part of memory tiers. */ + memtier->nodelist = node_states[N_MEMORY]; + + return 0; +} +subsys_initcall(memory_tier_init); +