Message ID | 20220728190436.858458-2-aneesh.kumar@linux.ibm.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/demotion: Memory tiers and demotion | expand |
"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes: > In the current kernel, memory tiers are defined implicitly via a demotion path > relationship between NUMA nodes, which is created during the kernel > initialization and updated when a NUMA node is hot-added or hot-removed. The > current implementation puts all nodes with CPU into the highest tier, and builds > the tier hierarchy tier-by-tier by establishing the per-node demotion targets > based on the distances between nodes. > > This current memory tier kernel implementation needs to be improved for several > important use cases, > > The current tier initialization code always initializes each memory-only NUMA > node into a lower tier. But a memory-only NUMA node may have a high performance > memory device (e.g. a DRAM-backed memory-only node on a virtual machine) that > should be put into a higher tier. > > The current tier hierarchy always puts CPU nodes into the top tier. But on a > system with HBM or GPU devices, the memory-only NUMA nodes mapping these devices > should be in the top tier, and DRAM nodes with CPUs are better to be placed into > the next lower tier. > > With current kernel higher tier node can only be demoted to nodes with shortest > distance on the next lower tier as defined by the demotion path, not any other > node from any lower tier. This strict, demotion order does not work in all use > cases (e.g. some use cases may want to allow cross-socket demotion to another > node in the same demotion tier as a fallback when the preferred demotion node is > out of space), This demotion order is also inconsistent with the page allocation > fallback order when all the nodes in a higher tier are out of space: The page > allocation can fall back to any node from any lower tier, whereas the demotion > order doesn't allow that. > > This patch series address the above by defining memory tiers explicitly. > > Linux kernel presents memory devices as NUMA nodes and each memory device is of > a specific type. The memory type of a device is represented by its abstract > distance. A memory tier corresponds to a range of abstract distance. This allows > for classifying memory devices with a specific performance range into a memory > tier. > > This patch configures the range/chunk size to be 128. The default DRAM > abstract distance is 512. We can have 4 memory tiers below the default DRAM > abstract distance which cover the range 0 - 127, 127 - 255, 256- 383, 384 - 511. > Slower memory devices like persistent memory will have abstract distance below > the default DRAM level and hence will be placed in these 4 lower tiers. For abstract distance, the lower value means higher performance, higher value means lower performance. So the abstract distance of PMEM should be smaller than that of DRAM. > A kernel parameter is provided to override the default memory tier. Forget to delete? Best Regards, Huang, Ying > Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com > Link: https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com > > Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com> > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> > --- > include/linux/memory-tiers.h | 17 ++++++ > mm/Makefile | 1 + > mm/memory-tiers.c | 102 +++++++++++++++++++++++++++++++++++ > 3 files changed, 120 insertions(+) > create mode 100644 include/linux/memory-tiers.h > create mode 100644 mm/memory-tiers.c > > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h > new file mode 100644 > index 000000000000..8d7884b7a3f0 > --- /dev/null > +++ b/include/linux/memory-tiers.h > @@ -0,0 +1,17 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +#ifndef _LINUX_MEMORY_TIERS_H > +#define _LINUX_MEMORY_TIERS_H > + > +/* > + * Each tier cover a abstrace distance chunk size of 128 > + */ > +#define MEMTIER_CHUNK_BITS 7 > +#define MEMTIER_CHUNK_SIZE (1 << MEMTIER_CHUNK_BITS) > +/* > + * For now let's have 4 memory tier below default DRAM tier. > + */ > +#define MEMTIER_ADISTANCE_DRAM (1 << (MEMTIER_CHUNK_BITS + 2)) > +/* leave one tier below this slow pmem */ > +#define MEMTIER_ADISTANCE_PMEM (1 << MEMTIER_CHUNK_BITS) > + > +#endif /* _LINUX_MEMORY_TIERS_H */ > diff --git a/mm/Makefile b/mm/Makefile > index 6f9ffa968a1a..d30acebc2164 100644 > --- a/mm/Makefile > +++ b/mm/Makefile > @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/ > obj-$(CONFIG_FAILSLAB) += failslab.o > obj-$(CONFIG_MEMTEST) += memtest.o > obj-$(CONFIG_MIGRATION) += migrate.o > +obj-$(CONFIG_NUMA) += memory-tiers.o > obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o > obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o > obj-$(CONFIG_PAGE_COUNTER) += page_counter.o > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c > new file mode 100644 > index 000000000000..01cfd514c192 > --- /dev/null > +++ b/mm/memory-tiers.c > @@ -0,0 +1,102 @@ > +// SPDX-License-Identifier: GPL-2.0 > +#include <linux/types.h> > +#include <linux/nodemask.h> > +#include <linux/slab.h> > +#include <linux/lockdep.h> > +#include <linux/memory-tiers.h> > + > +struct memory_tier { > + /* hierarchy of memory tiers */ > + struct list_head list; > + /* list of all memory types part of this tier */ > + struct list_head memory_types; > + /* > + * start value of abstract distance. memory tier maps > + * an abstract distance range, > + * adistance_start .. adistance_start + MEMTIER_CHUNK_SIZE > + */ > + int adistance_start; > +}; > + > +struct memory_dev_type { > + /* list of memory types that are are part of same tier as this type */ > + struct list_head tier_sibiling; > + /* abstract distance for this specific memory type */ > + int adistance; > + /* Nodes of same abstract distance */ > + nodemask_t nodes; > + struct memory_tier *memtier; > +}; > + > +static DEFINE_MUTEX(memory_tier_lock); > +static LIST_HEAD(memory_tiers); > +struct memory_dev_type *node_memory_types[MAX_NUMNODES]; > +/* > + * For now let's have 4 memory tier below default DRAM tier. > + */ > +static struct memory_dev_type default_dram_type = { > + .adistance = MEMTIER_ADISTANCE_DRAM, > + .tier_sibiling = LIST_HEAD_INIT(default_dram_type.tier_sibiling), > +}; > + > +static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memtype) > +{ > + bool found_slot = false; > + struct memory_tier *memtier, *new_memtier; > + int adistance = memtype->adistance; > + unsigned int memtier_adistance_chunk_size = MEMTIER_CHUNK_SIZE; > + > + lockdep_assert_held_once(&memory_tier_lock); > + > + /* > + * If the memtype is already part of a memory tier, > + * just return that. > + */ > + if (memtype->memtier) > + return memtype->memtier; > + > + adistance = round_down(adistance, memtier_adistance_chunk_size); > + list_for_each_entry(memtier, &memory_tiers, list) { > + if (adistance == memtier->adistance_start) { > + memtype->memtier = memtier; > + list_add(&memtype->tier_sibiling, &memtier->memory_types); > + return memtier; > + } else if (adistance < memtier->adistance_start) { > + found_slot = true; > + break; > + } > + } > + > + new_memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL); > + if (!new_memtier) > + return ERR_PTR(-ENOMEM); > + > + new_memtier->adistance_start = adistance; > + INIT_LIST_HEAD(&new_memtier->list); > + INIT_LIST_HEAD(&new_memtier->memory_types); > + if (found_slot) > + list_add_tail(&new_memtier->list, &memtier->list); > + else > + list_add_tail(&new_memtier->list, &memory_tiers); > + memtype->memtier = new_memtier; > + list_add(&memtype->tier_sibiling, &new_memtier->memory_types); > + return new_memtier; > +} > + > +static int __init memory_tier_init(void) > +{ > + struct memory_tier *memtier; > + > + mutex_lock(&memory_tier_lock); > + /* CPU only nodes are not part of memory tiers. */ > + default_dram_type.nodes = node_states[N_MEMORY]; > + > + memtier = find_create_memory_tier(&default_dram_type); > + if (IS_ERR(memtier)) > + panic("%s() failed to register memory tier: %ld\n", > + __func__, PTR_ERR(memtier)); > + mutex_unlock(&memory_tier_lock); > + > + return 0; > +} > +subsys_initcall(memory_tier_init);
"Huang, Ying" <ying.huang@intel.com> writes: > "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes: > >> In the current kernel, memory tiers are defined implicitly via a demotion path >> relationship between NUMA nodes, which is created during the kernel >> initialization and updated when a NUMA node is hot-added or hot-removed. The >> current implementation puts all nodes with CPU into the highest tier, and builds >> the tier hierarchy tier-by-tier by establishing the per-node demotion targets >> based on the distances between nodes. >> >> This current memory tier kernel implementation needs to be improved for several >> important use cases, >> >> The current tier initialization code always initializes each memory-only NUMA >> node into a lower tier. But a memory-only NUMA node may have a high performance >> memory device (e.g. a DRAM-backed memory-only node on a virtual machine) that >> should be put into a higher tier. >> >> The current tier hierarchy always puts CPU nodes into the top tier. But on a >> system with HBM or GPU devices, the memory-only NUMA nodes mapping these devices >> should be in the top tier, and DRAM nodes with CPUs are better to be placed into >> the next lower tier. >> >> With current kernel higher tier node can only be demoted to nodes with shortest >> distance on the next lower tier as defined by the demotion path, not any other >> node from any lower tier. This strict, demotion order does not work in all use >> cases (e.g. some use cases may want to allow cross-socket demotion to another >> node in the same demotion tier as a fallback when the preferred demotion node is >> out of space), This demotion order is also inconsistent with the page allocation >> fallback order when all the nodes in a higher tier are out of space: The page >> allocation can fall back to any node from any lower tier, whereas the demotion >> order doesn't allow that. >> >> This patch series address the above by defining memory tiers explicitly. >> >> Linux kernel presents memory devices as NUMA nodes and each memory device is of >> a specific type. The memory type of a device is represented by its abstract >> distance. A memory tier corresponds to a range of abstract distance. This allows >> for classifying memory devices with a specific performance range into a memory >> tier. >> >> This patch configures the range/chunk size to be 128. The default DRAM >> abstract distance is 512. We can have 4 memory tiers below the default DRAM >> abstract distance which cover the range 0 - 127, 127 - 255, 256- 383, 384 - 511. >> Slower memory devices like persistent memory will have abstract distance below >> the default DRAM level and hence will be placed in these 4 lower tiers. > > For abstract distance, the lower value means higher performance, higher > value means lower performance. So the abstract distance of PMEM should > be smaller than that of DRAM. I noticed that after sending v11 and did send v12 fixing that already which can be found https://lore.kernel.org/linux-mm/20220729061349.968148-1-aneesh.kumar@linux.ibm.com > >> A kernel parameter is provided to override the default memory tier. > > Forget to delete? yes. Also fixed in v12. -aneesh
Aneesh Kumar K.V wrote: > In the current kernel, memory tiers are defined implicitly via a demotion path > relationship between NUMA nodes, which is created during the kernel > initialization and updated when a NUMA node is hot-added or hot-removed. The > current implementation puts all nodes with CPU into the highest tier, and builds > the tier hierarchy tier-by-tier by establishing the per-node demotion targets > based on the distances between nodes. > > This current memory tier kernel implementation needs to be improved for several > important use cases, > > The current tier initialization code always initializes each memory-only NUMA > node into a lower tier. But a memory-only NUMA node may have a high performance > memory device (e.g. a DRAM-backed memory-only node on a virtual machine) that > should be put into a higher tier. > > The current tier hierarchy always puts CPU nodes into the top tier. But on a > system with HBM or GPU devices, the memory-only NUMA nodes mapping these devices > should be in the top tier, and DRAM nodes with CPUs are better to be placed into > the next lower tier. > > With current kernel higher tier node can only be demoted to nodes with shortest > distance on the next lower tier as defined by the demotion path, not any other > node from any lower tier. This strict, demotion order does not work in all use > cases (e.g. some use cases may want to allow cross-socket demotion to another > node in the same demotion tier as a fallback when the preferred demotion node is > out of space), This demotion order is also inconsistent with the page allocation > fallback order when all the nodes in a higher tier are out of space: The page > allocation can fall back to any node from any lower tier, whereas the demotion > order doesn't allow that. > > This patch series address the above by defining memory tiers explicitly. > > Linux kernel presents memory devices as NUMA nodes and each memory device is of > a specific type. The memory type of a device is represented by its abstract > distance. A memory tier corresponds to a range of abstract distance. This allows > for classifying memory devices with a specific performance range into a memory > tier. > > This patch configures the range/chunk size to be 128. The default DRAM > abstract distance is 512. We can have 4 memory tiers below the default DRAM > abstract distance which cover the range 0 - 127, 127 - 255, 256- 383, 384 - 511. > Slower memory devices like persistent memory will have abstract distance below > the default DRAM level and hence will be placed in these 4 lower tiers. > > A kernel parameter is provided to override the default memory tier. > > Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com > Link: https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com > > Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com> > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> > --- > include/linux/memory-tiers.h | 17 ++++++ > mm/Makefile | 1 + > mm/memory-tiers.c | 102 +++++++++++++++++++++++++++++++++++ > 3 files changed, 120 insertions(+) > create mode 100644 include/linux/memory-tiers.h > create mode 100644 mm/memory-tiers.c > > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h > new file mode 100644 > index 000000000000..8d7884b7a3f0 > --- /dev/null > +++ b/include/linux/memory-tiers.h > @@ -0,0 +1,17 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +#ifndef _LINUX_MEMORY_TIERS_H > +#define _LINUX_MEMORY_TIERS_H > + > +/* > + * Each tier cover a abstrace distance chunk size of 128 > + */ > +#define MEMTIER_CHUNK_BITS 7 > +#define MEMTIER_CHUNK_SIZE (1 << MEMTIER_CHUNK_BITS) > +/* > + * For now let's have 4 memory tier below default DRAM tier. > + */ > +#define MEMTIER_ADISTANCE_DRAM (1 << (MEMTIER_CHUNK_BITS + 2)) > +/* leave one tier below this slow pmem */ > +#define MEMTIER_ADISTANCE_PMEM (1 << MEMTIER_CHUNK_BITS) Why is memory type encoded in these values? There is no reason to believe that PMEM is of a lower performance tier than DRAM. Consider high performance energy backed DRAM that makes it "PMEM", consider CXL attached DRAM over a switch topology and constrained links that makes it a lower performance tier than locally attached DRAM. The names should be associated with tiers that indicate their usage. Something like HOT, GENERAL, and COLD. Where, for example, HOT is low capacity high performance compared to the general purpose pool, and COLD is high capacity low performance intended to offload the general purpose tier. It does not need to be exactly that ontology, but please try to not encode policy meaning behind memory types. There has been explicit effort to avoid that to date because types are fraught for declaring relative performance characteristics, and the relative performance changes based on what memory types are assembled in a given system.
Dan Williams <dan.j.williams@intel.com> writes: > Aneesh Kumar K.V wrote: >> In the current kernel, memory tiers are defined implicitly via a demotion path >> relationship between NUMA nodes, which is created during the kernel >> initialization and updated when a NUMA node is hot-added or hot-removed. The >> current implementation puts all nodes with CPU into the highest tier, and builds >> the tier hierarchy tier-by-tier by establishing the per-node demotion targets >> based on the distances between nodes. >> >> This current memory tier kernel implementation needs to be improved for several >> important use cases, >> >> The current tier initialization code always initializes each memory-only NUMA >> node into a lower tier. But a memory-only NUMA node may have a high performance >> memory device (e.g. a DRAM-backed memory-only node on a virtual machine) that >> should be put into a higher tier. >> >> The current tier hierarchy always puts CPU nodes into the top tier. But on a >> system with HBM or GPU devices, the memory-only NUMA nodes mapping these devices >> should be in the top tier, and DRAM nodes with CPUs are better to be placed into >> the next lower tier. >> >> With current kernel higher tier node can only be demoted to nodes with shortest >> distance on the next lower tier as defined by the demotion path, not any other >> node from any lower tier. This strict, demotion order does not work in all use >> cases (e.g. some use cases may want to allow cross-socket demotion to another >> node in the same demotion tier as a fallback when the preferred demotion node is >> out of space), This demotion order is also inconsistent with the page allocation >> fallback order when all the nodes in a higher tier are out of space: The page >> allocation can fall back to any node from any lower tier, whereas the demotion >> order doesn't allow that. >> >> This patch series address the above by defining memory tiers explicitly. >> >> Linux kernel presents memory devices as NUMA nodes and each memory device is of >> a specific type. The memory type of a device is represented by its abstract >> distance. A memory tier corresponds to a range of abstract distance. This allows >> for classifying memory devices with a specific performance range into a memory >> tier. >> >> This patch configures the range/chunk size to be 128. The default DRAM >> abstract distance is 512. We can have 4 memory tiers below the default DRAM >> abstract distance which cover the range 0 - 127, 127 - 255, 256- 383, 384 - 511. >> Slower memory devices like persistent memory will have abstract distance below >> the default DRAM level and hence will be placed in these 4 lower tiers. >> >> A kernel parameter is provided to override the default memory tier. >> >> Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com >> Link: https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com >> >> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com> >> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> >> --- >> include/linux/memory-tiers.h | 17 ++++++ >> mm/Makefile | 1 + >> mm/memory-tiers.c | 102 +++++++++++++++++++++++++++++++++++ >> 3 files changed, 120 insertions(+) >> create mode 100644 include/linux/memory-tiers.h >> create mode 100644 mm/memory-tiers.c >> >> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h >> new file mode 100644 >> index 000000000000..8d7884b7a3f0 >> --- /dev/null >> +++ b/include/linux/memory-tiers.h >> @@ -0,0 +1,17 @@ >> +/* SPDX-License-Identifier: GPL-2.0 */ >> +#ifndef _LINUX_MEMORY_TIERS_H >> +#define _LINUX_MEMORY_TIERS_H >> + >> +/* >> + * Each tier cover a abstrace distance chunk size of 128 >> + */ >> +#define MEMTIER_CHUNK_BITS 7 >> +#define MEMTIER_CHUNK_SIZE (1 << MEMTIER_CHUNK_BITS) >> +/* >> + * For now let's have 4 memory tier below default DRAM tier. >> + */ >> +#define MEMTIER_ADISTANCE_DRAM (1 << (MEMTIER_CHUNK_BITS + 2)) >> +/* leave one tier below this slow pmem */ >> +#define MEMTIER_ADISTANCE_PMEM (1 << MEMTIER_CHUNK_BITS) > > Why is memory type encoded in these values? There is no reason to > believe that PMEM is of a lower performance tier than DRAM. Consider > high performance energy backed DRAM that makes it "PMEM", consider CXL > attached DRAM over a switch topology and constrained links that makes it > a lower performance tier than locally attached DRAM. The names should be > associated with tiers that indicate their usage. Something like HOT, > GENERAL, and COLD. Where, for example, HOT is low capacity high > performance compared to the general purpose pool, and COLD is high > capacity low performance intended to offload the general purpose tier. > > It does not need to be exactly that ontology, but please try to not > encode policy meaning behind memory types. There has been explicit > effort to avoid that to date because types are fraught for declaring > relative performance characteristics, and the relative performance > changes based on what memory types are assembled in a given system. Yes. MEMTIER_ADISTANCE_PMEM is something over simplified. That is only used in this very first version to make it as simple as possible. I think we can come up with something better in the later version. For example, identify the abstract distance of a PMEM device based on HMAT, etc. And even in this first version, we should put MEMTIER_ADISTANCE_PMEM in dax/kmem.c. Because it's just for that specific type of memory used now, not for all PMEM. In the current design, memory type is used to report the performance of the hardware, in terms of abstract distance, per Johannes' suggestion. Which is an abstraction of memory latency and bandwidth. Policy is described via memory tiers. Several memory types may be put in one memory tier. The abstract distance chunk size of the memory tier may be adjusted according to policy. Best Regards, Huang, Ying
Huang, Ying wrote: > Dan Williams <dan.j.williams@intel.com> writes: > > > Aneesh Kumar K.V wrote: > >> In the current kernel, memory tiers are defined implicitly via a demotion path > >> relationship between NUMA nodes, which is created during the kernel > >> initialization and updated when a NUMA node is hot-added or hot-removed. The > >> current implementation puts all nodes with CPU into the highest tier, and builds > >> the tier hierarchy tier-by-tier by establishing the per-node demotion targets > >> based on the distances between nodes. > >> > >> This current memory tier kernel implementation needs to be improved for several > >> important use cases, > >> > >> The current tier initialization code always initializes each memory-only NUMA > >> node into a lower tier. But a memory-only NUMA node may have a high performance > >> memory device (e.g. a DRAM-backed memory-only node on a virtual machine) that > >> should be put into a higher tier. > >> > >> The current tier hierarchy always puts CPU nodes into the top tier. But on a > >> system with HBM or GPU devices, the memory-only NUMA nodes mapping these devices > >> should be in the top tier, and DRAM nodes with CPUs are better to be placed into > >> the next lower tier. > >> > >> With current kernel higher tier node can only be demoted to nodes with shortest > >> distance on the next lower tier as defined by the demotion path, not any other > >> node from any lower tier. This strict, demotion order does not work in all use > >> cases (e.g. some use cases may want to allow cross-socket demotion to another > >> node in the same demotion tier as a fallback when the preferred demotion node is > >> out of space), This demotion order is also inconsistent with the page allocation > >> fallback order when all the nodes in a higher tier are out of space: The page > >> allocation can fall back to any node from any lower tier, whereas the demotion > >> order doesn't allow that. > >> > >> This patch series address the above by defining memory tiers explicitly. > >> > >> Linux kernel presents memory devices as NUMA nodes and each memory device is of > >> a specific type. The memory type of a device is represented by its abstract > >> distance. A memory tier corresponds to a range of abstract distance. This allows > >> for classifying memory devices with a specific performance range into a memory > >> tier. > >> > >> This patch configures the range/chunk size to be 128. The default DRAM > >> abstract distance is 512. We can have 4 memory tiers below the default DRAM > >> abstract distance which cover the range 0 - 127, 127 - 255, 256- 383, 384 - 511. > >> Slower memory devices like persistent memory will have abstract distance below > >> the default DRAM level and hence will be placed in these 4 lower tiers. > >> > >> A kernel parameter is provided to override the default memory tier. > >> > >> Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com > >> Link: https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com > >> > >> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com> > >> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> > >> --- > >> include/linux/memory-tiers.h | 17 ++++++ > >> mm/Makefile | 1 + > >> mm/memory-tiers.c | 102 +++++++++++++++++++++++++++++++++++ > >> 3 files changed, 120 insertions(+) > >> create mode 100644 include/linux/memory-tiers.h > >> create mode 100644 mm/memory-tiers.c > >> > >> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h > >> new file mode 100644 > >> index 000000000000..8d7884b7a3f0 > >> --- /dev/null > >> +++ b/include/linux/memory-tiers.h > >> @@ -0,0 +1,17 @@ > >> +/* SPDX-License-Identifier: GPL-2.0 */ > >> +#ifndef _LINUX_MEMORY_TIERS_H > >> +#define _LINUX_MEMORY_TIERS_H > >> + > >> +/* > >> + * Each tier cover a abstrace distance chunk size of 128 > >> + */ > >> +#define MEMTIER_CHUNK_BITS 7 > >> +#define MEMTIER_CHUNK_SIZE (1 << MEMTIER_CHUNK_BITS) > >> +/* > >> + * For now let's have 4 memory tier below default DRAM tier. > >> + */ > >> +#define MEMTIER_ADISTANCE_DRAM (1 << (MEMTIER_CHUNK_BITS + 2)) > >> +/* leave one tier below this slow pmem */ > >> +#define MEMTIER_ADISTANCE_PMEM (1 << MEMTIER_CHUNK_BITS) > > > > Why is memory type encoded in these values? There is no reason to > > believe that PMEM is of a lower performance tier than DRAM. Consider > > high performance energy backed DRAM that makes it "PMEM", consider CXL > > attached DRAM over a switch topology and constrained links that makes it > > a lower performance tier than locally attached DRAM. The names should be > > associated with tiers that indicate their usage. Something like HOT, > > GENERAL, and COLD. Where, for example, HOT is low capacity high > > performance compared to the general purpose pool, and COLD is high > > capacity low performance intended to offload the general purpose tier. > > > > It does not need to be exactly that ontology, but please try to not > > encode policy meaning behind memory types. There has been explicit > > effort to avoid that to date because types are fraught for declaring > > relative performance characteristics, and the relative performance > > changes based on what memory types are assembled in a given system. > > Yes. MEMTIER_ADISTANCE_PMEM is something over simplified. That is only > used in this very first version to make it as simple as possible. I am failing to see the simplicity of using names that convey a performance contract that are invalid depending on the system. > I think we can come up with something better in the later version. > For example, identify the abstract distance of a PMEM device based on > HMAT, etc. Memory tiering has nothing to do with persistence why is PMEM in the name at all? > And even in this first version, we should put MEMTIER_ADISTANCE_PMEM > in dax/kmem.c. Because it's just for that specific type of memory > used now, not for all PMEM. dax/kmem.c also handles HBM and "soft reserved" memory in general. There is also nothing PMEM specific about the device-dax subsystem. > In the current design, memory type is used to report the performance of > the hardware, in terms of abstract distance, per Johannes' suggestion. That sounds fine, just pick an abstract name, not an explicit memory type. > Which is an abstraction of memory latency and bandwidth. Policy is > described via memory tiers. Several memory types may be put in one > memory tier. The abstract distance chunk size of the memory tier may > be adjusted according to policy. That part all sounds good. That said, I do not see the benefit of waiting to run away from these inadequate names.
On 8/2/22 9:10 AM, Dan Williams wrote: > Huang, Ying wrote: >> Dan Williams <dan.j.williams@intel.com> writes: >> >>> Aneesh Kumar K.V wrote: >>>> In the current kernel, memory tiers are defined implicitly via a demotion path >>>> relationship between NUMA nodes, which is created during the kernel >>>> initialization and updated when a NUMA node is hot-added or hot-removed. The >>>> current implementation puts all nodes with CPU into the highest tier, and builds >>>> the tier hierarchy tier-by-tier by establishing the per-node demotion targets >>>> based on the distances between nodes. >>>> >>>> This current memory tier kernel implementation needs to be improved for several >>>> important use cases, >>>> >>>> The current tier initialization code always initializes each memory-only NUMA >>>> node into a lower tier. But a memory-only NUMA node may have a high performance >>>> memory device (e.g. a DRAM-backed memory-only node on a virtual machine) that >>>> should be put into a higher tier. >>>> >>>> The current tier hierarchy always puts CPU nodes into the top tier. But on a >>>> system with HBM or GPU devices, the memory-only NUMA nodes mapping these devices >>>> should be in the top tier, and DRAM nodes with CPUs are better to be placed into >>>> the next lower tier. >>>> >>>> With current kernel higher tier node can only be demoted to nodes with shortest >>>> distance on the next lower tier as defined by the demotion path, not any other >>>> node from any lower tier. This strict, demotion order does not work in all use >>>> cases (e.g. some use cases may want to allow cross-socket demotion to another >>>> node in the same demotion tier as a fallback when the preferred demotion node is >>>> out of space), This demotion order is also inconsistent with the page allocation >>>> fallback order when all the nodes in a higher tier are out of space: The page >>>> allocation can fall back to any node from any lower tier, whereas the demotion >>>> order doesn't allow that. >>>> >>>> This patch series address the above by defining memory tiers explicitly. >>>> >>>> Linux kernel presents memory devices as NUMA nodes and each memory device is of >>>> a specific type. The memory type of a device is represented by its abstract >>>> distance. A memory tier corresponds to a range of abstract distance. This allows >>>> for classifying memory devices with a specific performance range into a memory >>>> tier. >>>> >>>> This patch configures the range/chunk size to be 128. The default DRAM >>>> abstract distance is 512. We can have 4 memory tiers below the default DRAM >>>> abstract distance which cover the range 0 - 127, 127 - 255, 256- 383, 384 - 511. >>>> Slower memory devices like persistent memory will have abstract distance below >>>> the default DRAM level and hence will be placed in these 4 lower tiers. >>>> >>>> A kernel parameter is provided to override the default memory tier. >>>> >>>> Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com >>>> Link: https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com >>>> >>>> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com> >>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> >>>> --- >>>> include/linux/memory-tiers.h | 17 ++++++ >>>> mm/Makefile | 1 + >>>> mm/memory-tiers.c | 102 +++++++++++++++++++++++++++++++++++ >>>> 3 files changed, 120 insertions(+) >>>> create mode 100644 include/linux/memory-tiers.h >>>> create mode 100644 mm/memory-tiers.c >>>> >>>> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h >>>> new file mode 100644 >>>> index 000000000000..8d7884b7a3f0 >>>> --- /dev/null >>>> +++ b/include/linux/memory-tiers.h >>>> @@ -0,0 +1,17 @@ >>>> +/* SPDX-License-Identifier: GPL-2.0 */ >>>> +#ifndef _LINUX_MEMORY_TIERS_H >>>> +#define _LINUX_MEMORY_TIERS_H >>>> + >>>> +/* >>>> + * Each tier cover a abstrace distance chunk size of 128 >>>> + */ >>>> +#define MEMTIER_CHUNK_BITS 7 >>>> +#define MEMTIER_CHUNK_SIZE (1 << MEMTIER_CHUNK_BITS) >>>> +/* >>>> + * For now let's have 4 memory tier below default DRAM tier. >>>> + */ >>>> +#define MEMTIER_ADISTANCE_DRAM (1 << (MEMTIER_CHUNK_BITS + 2)) >>>> +/* leave one tier below this slow pmem */ >>>> +#define MEMTIER_ADISTANCE_PMEM (1 << MEMTIER_CHUNK_BITS) >>> >>> Why is memory type encoded in these values? There is no reason to >>> believe that PMEM is of a lower performance tier than DRAM. Consider >>> high performance energy backed DRAM that makes it "PMEM", consider CXL >>> attached DRAM over a switch topology and constrained links that makes it >>> a lower performance tier than locally attached DRAM. The names should be >>> associated with tiers that indicate their usage. Something like HOT, >>> GENERAL, and COLD. Where, for example, HOT is low capacity high >>> performance compared to the general purpose pool, and COLD is high >>> capacity low performance intended to offload the general purpose tier. >>> >>> It does not need to be exactly that ontology, but please try to not >>> encode policy meaning behind memory types. There has been explicit >>> effort to avoid that to date because types are fraught for declaring >>> relative performance characteristics, and the relative performance >>> changes based on what memory types are assembled in a given system. >> >> Yes. MEMTIER_ADISTANCE_PMEM is something over simplified. That is only >> used in this very first version to make it as simple as possible. > > I am failing to see the simplicity of using names that convey a > performance contract that are invalid depending on the system. > >> I think we can come up with something better in the later version. >> For example, identify the abstract distance of a PMEM device based on >> HMAT, etc. > > Memory tiering has nothing to do with persistence why is PMEM in the > name at all? > How about MEMTIER_DEFAULT_DAX_ADISTANCE with a comment there explaining if low level drivers don't initialize a memory_dev_type for a device/NUMA node, dax/kmem will consider the node slower than DRAM? >> And even in this first version, we should put MEMTIER_ADISTANCE_PMEM >> in dax/kmem.c. Because it's just for that specific type of memory >> used now, not for all PMEM. > > dax/kmem.c also handles HBM and "soft reserved" memory in general. There > is also nothing PMEM specific about the device-dax subsystem. > >> In the current design, memory type is used to report the performance of >> the hardware, in terms of abstract distance, per Johannes' suggestion. > > That sounds fine, just pick an abstract name, not an explicit memory > type. > >> Which is an abstraction of memory latency and bandwidth. Policy is >> described via memory tiers. Several memory types may be put in one >> memory tier. The abstract distance chunk size of the memory tier may >> be adjusted according to policy. > > That part all sounds good. That said, I do not see the benefit of > waiting to run away from these inadequate names. -aneesh
Dan Williams <dan.j.williams@intel.com> writes: > Huang, Ying wrote: >> Dan Williams <dan.j.williams@intel.com> writes: >> >> > Aneesh Kumar K.V wrote: >> >> In the current kernel, memory tiers are defined implicitly via a demotion path >> >> relationship between NUMA nodes, which is created during the kernel >> >> initialization and updated when a NUMA node is hot-added or hot-removed. The >> >> current implementation puts all nodes with CPU into the highest tier, and builds >> >> the tier hierarchy tier-by-tier by establishing the per-node demotion targets >> >> based on the distances between nodes. >> >> >> >> This current memory tier kernel implementation needs to be improved for several >> >> important use cases, >> >> >> >> The current tier initialization code always initializes each memory-only NUMA >> >> node into a lower tier. But a memory-only NUMA node may have a high performance >> >> memory device (e.g. a DRAM-backed memory-only node on a virtual machine) that >> >> should be put into a higher tier. >> >> >> >> The current tier hierarchy always puts CPU nodes into the top tier. But on a >> >> system with HBM or GPU devices, the memory-only NUMA nodes mapping these devices >> >> should be in the top tier, and DRAM nodes with CPUs are better to be placed into >> >> the next lower tier. >> >> >> >> With current kernel higher tier node can only be demoted to nodes with shortest >> >> distance on the next lower tier as defined by the demotion path, not any other >> >> node from any lower tier. This strict, demotion order does not work in all use >> >> cases (e.g. some use cases may want to allow cross-socket demotion to another >> >> node in the same demotion tier as a fallback when the preferred demotion node is >> >> out of space), This demotion order is also inconsistent with the page allocation >> >> fallback order when all the nodes in a higher tier are out of space: The page >> >> allocation can fall back to any node from any lower tier, whereas the demotion >> >> order doesn't allow that. >> >> >> >> This patch series address the above by defining memory tiers explicitly. >> >> >> >> Linux kernel presents memory devices as NUMA nodes and each memory device is of >> >> a specific type. The memory type of a device is represented by its abstract >> >> distance. A memory tier corresponds to a range of abstract distance. This allows >> >> for classifying memory devices with a specific performance range into a memory >> >> tier. >> >> >> >> This patch configures the range/chunk size to be 128. The default DRAM >> >> abstract distance is 512. We can have 4 memory tiers below the default DRAM >> >> abstract distance which cover the range 0 - 127, 127 - 255, 256- 383, 384 - 511. >> >> Slower memory devices like persistent memory will have abstract distance below >> >> the default DRAM level and hence will be placed in these 4 lower tiers. >> >> >> >> A kernel parameter is provided to override the default memory tier. >> >> >> >> Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com >> >> Link: https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com >> >> >> >> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com> >> >> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> >> >> --- >> >> include/linux/memory-tiers.h | 17 ++++++ >> >> mm/Makefile | 1 + >> >> mm/memory-tiers.c | 102 +++++++++++++++++++++++++++++++++++ >> >> 3 files changed, 120 insertions(+) >> >> create mode 100644 include/linux/memory-tiers.h >> >> create mode 100644 mm/memory-tiers.c >> >> >> >> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h >> >> new file mode 100644 >> >> index 000000000000..8d7884b7a3f0 >> >> --- /dev/null >> >> +++ b/include/linux/memory-tiers.h >> >> @@ -0,0 +1,17 @@ >> >> +/* SPDX-License-Identifier: GPL-2.0 */ >> >> +#ifndef _LINUX_MEMORY_TIERS_H >> >> +#define _LINUX_MEMORY_TIERS_H >> >> + >> >> +/* >> >> + * Each tier cover a abstrace distance chunk size of 128 >> >> + */ >> >> +#define MEMTIER_CHUNK_BITS 7 >> >> +#define MEMTIER_CHUNK_SIZE (1 << MEMTIER_CHUNK_BITS) >> >> +/* >> >> + * For now let's have 4 memory tier below default DRAM tier. >> >> + */ >> >> +#define MEMTIER_ADISTANCE_DRAM (1 << (MEMTIER_CHUNK_BITS + 2)) >> >> +/* leave one tier below this slow pmem */ >> >> +#define MEMTIER_ADISTANCE_PMEM (1 << MEMTIER_CHUNK_BITS) >> > >> > Why is memory type encoded in these values? There is no reason to >> > believe that PMEM is of a lower performance tier than DRAM. Consider >> > high performance energy backed DRAM that makes it "PMEM", consider CXL >> > attached DRAM over a switch topology and constrained links that makes it >> > a lower performance tier than locally attached DRAM. The names should be >> > associated with tiers that indicate their usage. Something like HOT, >> > GENERAL, and COLD. Where, for example, HOT is low capacity high >> > performance compared to the general purpose pool, and COLD is high >> > capacity low performance intended to offload the general purpose tier. >> > >> > It does not need to be exactly that ontology, but please try to not >> > encode policy meaning behind memory types. There has been explicit >> > effort to avoid that to date because types are fraught for declaring >> > relative performance characteristics, and the relative performance >> > changes based on what memory types are assembled in a given system. >> >> Yes. MEMTIER_ADISTANCE_PMEM is something over simplified. That is only >> used in this very first version to make it as simple as possible. > > I am failing to see the simplicity of using names that convey a > performance contract that are invalid depending on the system. > >> I think we can come up with something better in the later version. >> For example, identify the abstract distance of a PMEM device based on >> HMAT, etc. > > Memory tiering has nothing to do with persistence why is PMEM in the > name at all? > >> And even in this first version, we should put MEMTIER_ADISTANCE_PMEM >> in dax/kmem.c. Because it's just for that specific type of memory >> used now, not for all PMEM. > > dax/kmem.c also handles HBM and "soft reserved" memory in general. There > is also nothing PMEM specific about the device-dax subsystem. Ah... I see the issue here. For the systems in our hand, dax/kmem.c is used to online PMEM only. Even the "soft reserved" memory is used for PMEM or simulating PMEM too. So to make the code as simple as possible, we treat all memory devices onlined by dax/kmem as PMEM in the first version. And plan to support more memory types in the future versions. But from your above words, our assumption are wrong here. dax/kmem.c can online HBM and other memory devices already. If so, how do we distinguish between them and how to get the performance character of these devices? We can start with SLIT? >> In the current design, memory type is used to report the performance of >> the hardware, in terms of abstract distance, per Johannes' suggestion. > > That sounds fine, just pick an abstract name, not an explicit memory > type. > >> Which is an abstraction of memory latency and bandwidth. Policy is >> described via memory tiers. Several memory types may be put in one >> memory tier. The abstract distance chunk size of the memory tier may >> be adjusted according to policy. > > That part all sounds good. That said, I do not see the benefit of > waiting to run away from these inadequate names. Good! Best Regards, Huang, Ying
On 8/2/22 12:27 PM, Huang, Ying wrote: > Dan Williams <dan.j.williams@intel.com> writes: > >> Huang, Ying wrote: >>> Dan Williams <dan.j.williams@intel.com> writes: >>> >>>> Aneesh Kumar K.V wrote: >>>>> In the current kernel, memory tiers are defined implicitly via a demotion path >>>>> relationship between NUMA nodes, which is created during the kernel >>>>> initialization and updated when a NUMA node is hot-added or hot-removed. The >>>>> current implementation puts all nodes with CPU into the highest tier, and builds >>>>> the tier hierarchy tier-by-tier by establishing the per-node demotion targets >>>>> based on the distances between nodes. >>>>> >>>>> This current memory tier kernel implementation needs to be improved for several >>>>> important use cases, >>>>> >>>>> The current tier initialization code always initializes each memory-only NUMA >>>>> node into a lower tier. But a memory-only NUMA node may have a high performance >>>>> memory device (e.g. a DRAM-backed memory-only node on a virtual machine) that >>>>> should be put into a higher tier. >>>>> >>>>> The current tier hierarchy always puts CPU nodes into the top tier. But on a >>>>> system with HBM or GPU devices, the memory-only NUMA nodes mapping these devices >>>>> should be in the top tier, and DRAM nodes with CPUs are better to be placed into >>>>> the next lower tier. >>>>> >>>>> With current kernel higher tier node can only be demoted to nodes with shortest >>>>> distance on the next lower tier as defined by the demotion path, not any other >>>>> node from any lower tier. This strict, demotion order does not work in all use >>>>> cases (e.g. some use cases may want to allow cross-socket demotion to another >>>>> node in the same demotion tier as a fallback when the preferred demotion node is >>>>> out of space), This demotion order is also inconsistent with the page allocation >>>>> fallback order when all the nodes in a higher tier are out of space: The page >>>>> allocation can fall back to any node from any lower tier, whereas the demotion >>>>> order doesn't allow that. >>>>> >>>>> This patch series address the above by defining memory tiers explicitly. >>>>> >>>>> Linux kernel presents memory devices as NUMA nodes and each memory device is of >>>>> a specific type. The memory type of a device is represented by its abstract >>>>> distance. A memory tier corresponds to a range of abstract distance. This allows >>>>> for classifying memory devices with a specific performance range into a memory >>>>> tier. >>>>> >>>>> This patch configures the range/chunk size to be 128. The default DRAM >>>>> abstract distance is 512. We can have 4 memory tiers below the default DRAM >>>>> abstract distance which cover the range 0 - 127, 127 - 255, 256- 383, 384 - 511. >>>>> Slower memory devices like persistent memory will have abstract distance below >>>>> the default DRAM level and hence will be placed in these 4 lower tiers. >>>>> >>>>> A kernel parameter is provided to override the default memory tier. >>>>> >>>>> Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com >>>>> Link: https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com >>>>> >>>>> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com> >>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> >>>>> --- >>>>> include/linux/memory-tiers.h | 17 ++++++ >>>>> mm/Makefile | 1 + >>>>> mm/memory-tiers.c | 102 +++++++++++++++++++++++++++++++++++ >>>>> 3 files changed, 120 insertions(+) >>>>> create mode 100644 include/linux/memory-tiers.h >>>>> create mode 100644 mm/memory-tiers.c >>>>> >>>>> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h >>>>> new file mode 100644 >>>>> index 000000000000..8d7884b7a3f0 >>>>> --- /dev/null >>>>> +++ b/include/linux/memory-tiers.h >>>>> @@ -0,0 +1,17 @@ >>>>> +/* SPDX-License-Identifier: GPL-2.0 */ >>>>> +#ifndef _LINUX_MEMORY_TIERS_H >>>>> +#define _LINUX_MEMORY_TIERS_H >>>>> + >>>>> +/* >>>>> + * Each tier cover a abstrace distance chunk size of 128 >>>>> + */ >>>>> +#define MEMTIER_CHUNK_BITS 7 >>>>> +#define MEMTIER_CHUNK_SIZE (1 << MEMTIER_CHUNK_BITS) >>>>> +/* >>>>> + * For now let's have 4 memory tier below default DRAM tier. >>>>> + */ >>>>> +#define MEMTIER_ADISTANCE_DRAM (1 << (MEMTIER_CHUNK_BITS + 2)) >>>>> +/* leave one tier below this slow pmem */ >>>>> +#define MEMTIER_ADISTANCE_PMEM (1 << MEMTIER_CHUNK_BITS) >>>> >>>> Why is memory type encoded in these values? There is no reason to >>>> believe that PMEM is of a lower performance tier than DRAM. Consider >>>> high performance energy backed DRAM that makes it "PMEM", consider CXL >>>> attached DRAM over a switch topology and constrained links that makes it >>>> a lower performance tier than locally attached DRAM. The names should be >>>> associated with tiers that indicate their usage. Something like HOT, >>>> GENERAL, and COLD. Where, for example, HOT is low capacity high >>>> performance compared to the general purpose pool, and COLD is high >>>> capacity low performance intended to offload the general purpose tier. >>>> >>>> It does not need to be exactly that ontology, but please try to not >>>> encode policy meaning behind memory types. There has been explicit >>>> effort to avoid that to date because types are fraught for declaring >>>> relative performance characteristics, and the relative performance >>>> changes based on what memory types are assembled in a given system. >>> >>> Yes. MEMTIER_ADISTANCE_PMEM is something over simplified. That is only >>> used in this very first version to make it as simple as possible. >> >> I am failing to see the simplicity of using names that convey a >> performance contract that are invalid depending on the system. >> >>> I think we can come up with something better in the later version. >>> For example, identify the abstract distance of a PMEM device based on >>> HMAT, etc. >> >> Memory tiering has nothing to do with persistence why is PMEM in the >> name at all? >> >>> And even in this first version, we should put MEMTIER_ADISTANCE_PMEM >>> in dax/kmem.c. Because it's just for that specific type of memory >>> used now, not for all PMEM. >> >> dax/kmem.c also handles HBM and "soft reserved" memory in general. There >> is also nothing PMEM specific about the device-dax subsystem. > > Ah... I see the issue here. For the systems in our hand, dax/kmem.c is > used to online PMEM only. Even the "soft reserved" memory is used for > PMEM or simulating PMEM too. So to make the code as simple as possible, > we treat all memory devices onlined by dax/kmem as PMEM in the first > version. And plan to support more memory types in the future versions. > > But from your above words, our assumption are wrong here. dax/kmem.c > can online HBM and other memory devices already. If so, how do we > distinguish between them and how to get the performance character of > these devices? We can start with SLIT? > We would let low level driver register memory_dev_types for the NUMA nodes that will be mapped to these devices. ie, a papr_scm, ACPI NFIT or CXL can register different memory_dev_type based on device tree, HMAT or CDAT. -aneesh
Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: > On 8/2/22 12:27 PM, Huang, Ying wrote: >> Dan Williams <dan.j.williams@intel.com> writes: >> >>> Huang, Ying wrote: >>>> Dan Williams <dan.j.williams@intel.com> writes: >>>> >>>>> Aneesh Kumar K.V wrote: >>>>>> In the current kernel, memory tiers are defined implicitly via a demotion path >>>>>> relationship between NUMA nodes, which is created during the kernel >>>>>> initialization and updated when a NUMA node is hot-added or hot-removed. The >>>>>> current implementation puts all nodes with CPU into the highest tier, and builds >>>>>> the tier hierarchy tier-by-tier by establishing the per-node demotion targets >>>>>> based on the distances between nodes. >>>>>> >>>>>> This current memory tier kernel implementation needs to be improved for several >>>>>> important use cases, >>>>>> >>>>>> The current tier initialization code always initializes each memory-only NUMA >>>>>> node into a lower tier. But a memory-only NUMA node may have a high performance >>>>>> memory device (e.g. a DRAM-backed memory-only node on a virtual machine) that >>>>>> should be put into a higher tier. >>>>>> >>>>>> The current tier hierarchy always puts CPU nodes into the top tier. But on a >>>>>> system with HBM or GPU devices, the memory-only NUMA nodes mapping these devices >>>>>> should be in the top tier, and DRAM nodes with CPUs are better to be placed into >>>>>> the next lower tier. >>>>>> >>>>>> With current kernel higher tier node can only be demoted to nodes with shortest >>>>>> distance on the next lower tier as defined by the demotion path, not any other >>>>>> node from any lower tier. This strict, demotion order does not work in all use >>>>>> cases (e.g. some use cases may want to allow cross-socket demotion to another >>>>>> node in the same demotion tier as a fallback when the preferred demotion node is >>>>>> out of space), This demotion order is also inconsistent with the page allocation >>>>>> fallback order when all the nodes in a higher tier are out of space: The page >>>>>> allocation can fall back to any node from any lower tier, whereas the demotion >>>>>> order doesn't allow that. >>>>>> >>>>>> This patch series address the above by defining memory tiers explicitly. >>>>>> >>>>>> Linux kernel presents memory devices as NUMA nodes and each memory device is of >>>>>> a specific type. The memory type of a device is represented by its abstract >>>>>> distance. A memory tier corresponds to a range of abstract distance. This allows >>>>>> for classifying memory devices with a specific performance range into a memory >>>>>> tier. >>>>>> >>>>>> This patch configures the range/chunk size to be 128. The default DRAM >>>>>> abstract distance is 512. We can have 4 memory tiers below the default DRAM >>>>>> abstract distance which cover the range 0 - 127, 127 - 255, 256- 383, 384 - 511. >>>>>> Slower memory devices like persistent memory will have abstract distance below >>>>>> the default DRAM level and hence will be placed in these 4 lower tiers. >>>>>> >>>>>> A kernel parameter is provided to override the default memory tier. >>>>>> >>>>>> Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com >>>>>> Link: https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com >>>>>> >>>>>> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com> >>>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> >>>>>> --- >>>>>> include/linux/memory-tiers.h | 17 ++++++ >>>>>> mm/Makefile | 1 + >>>>>> mm/memory-tiers.c | 102 +++++++++++++++++++++++++++++++++++ >>>>>> 3 files changed, 120 insertions(+) >>>>>> create mode 100644 include/linux/memory-tiers.h >>>>>> create mode 100644 mm/memory-tiers.c >>>>>> >>>>>> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h >>>>>> new file mode 100644 >>>>>> index 000000000000..8d7884b7a3f0 >>>>>> --- /dev/null >>>>>> +++ b/include/linux/memory-tiers.h >>>>>> @@ -0,0 +1,17 @@ >>>>>> +/* SPDX-License-Identifier: GPL-2.0 */ >>>>>> +#ifndef _LINUX_MEMORY_TIERS_H >>>>>> +#define _LINUX_MEMORY_TIERS_H >>>>>> + >>>>>> +/* >>>>>> + * Each tier cover a abstrace distance chunk size of 128 >>>>>> + */ >>>>>> +#define MEMTIER_CHUNK_BITS 7 >>>>>> +#define MEMTIER_CHUNK_SIZE (1 << MEMTIER_CHUNK_BITS) >>>>>> +/* >>>>>> + * For now let's have 4 memory tier below default DRAM tier. >>>>>> + */ >>>>>> +#define MEMTIER_ADISTANCE_DRAM (1 << (MEMTIER_CHUNK_BITS + 2)) >>>>>> +/* leave one tier below this slow pmem */ >>>>>> +#define MEMTIER_ADISTANCE_PMEM (1 << MEMTIER_CHUNK_BITS) >>>>> >>>>> Why is memory type encoded in these values? There is no reason to >>>>> believe that PMEM is of a lower performance tier than DRAM. Consider >>>>> high performance energy backed DRAM that makes it "PMEM", consider CXL >>>>> attached DRAM over a switch topology and constrained links that makes it >>>>> a lower performance tier than locally attached DRAM. The names should be >>>>> associated with tiers that indicate their usage. Something like HOT, >>>>> GENERAL, and COLD. Where, for example, HOT is low capacity high >>>>> performance compared to the general purpose pool, and COLD is high >>>>> capacity low performance intended to offload the general purpose tier. >>>>> >>>>> It does not need to be exactly that ontology, but please try to not >>>>> encode policy meaning behind memory types. There has been explicit >>>>> effort to avoid that to date because types are fraught for declaring >>>>> relative performance characteristics, and the relative performance >>>>> changes based on what memory types are assembled in a given system. >>>> >>>> Yes. MEMTIER_ADISTANCE_PMEM is something over simplified. That is only >>>> used in this very first version to make it as simple as possible. >>> >>> I am failing to see the simplicity of using names that convey a >>> performance contract that are invalid depending on the system. >>> >>>> I think we can come up with something better in the later version. >>>> For example, identify the abstract distance of a PMEM device based on >>>> HMAT, etc. >>> >>> Memory tiering has nothing to do with persistence why is PMEM in the >>> name at all? >>> >>>> And even in this first version, we should put MEMTIER_ADISTANCE_PMEM >>>> in dax/kmem.c. Because it's just for that specific type of memory >>>> used now, not for all PMEM. >>> >>> dax/kmem.c also handles HBM and "soft reserved" memory in general. There >>> is also nothing PMEM specific about the device-dax subsystem. >> >> Ah... I see the issue here. For the systems in our hand, dax/kmem.c is >> used to online PMEM only. Even the "soft reserved" memory is used for >> PMEM or simulating PMEM too. So to make the code as simple as possible, >> we treat all memory devices onlined by dax/kmem as PMEM in the first >> version. And plan to support more memory types in the future versions. >> >> But from your above words, our assumption are wrong here. dax/kmem.c >> can online HBM and other memory devices already. If so, how do we >> distinguish between them and how to get the performance character of >> these devices? We can start with SLIT? >> > > We would let low level driver register memory_dev_types for the NUMA nodes > that will be mapped to these devices. ie, a papr_scm, ACPI NFIT or CXL > can register different memory_dev_type based on device tree, HMAT or CDAT. I didn't find ACPI NFIT can provide any performance information, just whether it's non-volatile. HMAT or CDAT should help here, but it's not available always. For now, what we have is just SLIT at least for quite some machines. I prefer to create memory_dev_type in high level driver like dax/kmem. And it may query low level driver like SLIT, HMAT, CDAT, etc for more information based on availability etc. Best Regards, Huang, Ying
On 8/4/22 6:26 AM, Huang, Ying wrote: > Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: > >> On 8/2/22 12:27 PM, Huang, Ying wrote: >>> Dan Williams <dan.j.williams@intel.com> writes: >>> >>>> Huang, Ying wrote: >>>>> Dan Williams <dan.j.williams@intel.com> writes: >>>>> >>>>>> Aneesh Kumar K.V wrote: >>>>>>> In the current kernel, memory tiers are defined implicitly via a demotion path >>>>>>> relationship between NUMA nodes, which is created during the kernel >>>>>>> initialization and updated when a NUMA node is hot-added or hot-removed. The >>>>>>> current implementation puts all nodes with CPU into the highest tier, and builds >>>>>>> the tier hierarchy tier-by-tier by establishing the per-node demotion targets >>>>>>> based on the distances between nodes. >>>>>>> >>>>>>> This current memory tier kernel implementation needs to be improved for several >>>>>>> important use cases, >>>>>>> >>>>>>> The current tier initialization code always initializes each memory-only NUMA >>>>>>> node into a lower tier. But a memory-only NUMA node may have a high performance >>>>>>> memory device (e.g. a DRAM-backed memory-only node on a virtual machine) that >>>>>>> should be put into a higher tier. >>>>>>> >>>>>>> The current tier hierarchy always puts CPU nodes into the top tier. But on a >>>>>>> system with HBM or GPU devices, the memory-only NUMA nodes mapping these devices >>>>>>> should be in the top tier, and DRAM nodes with CPUs are better to be placed into >>>>>>> the next lower tier. >>>>>>> >>>>>>> With current kernel higher tier node can only be demoted to nodes with shortest >>>>>>> distance on the next lower tier as defined by the demotion path, not any other >>>>>>> node from any lower tier. This strict, demotion order does not work in all use >>>>>>> cases (e.g. some use cases may want to allow cross-socket demotion to another >>>>>>> node in the same demotion tier as a fallback when the preferred demotion node is >>>>>>> out of space), This demotion order is also inconsistent with the page allocation >>>>>>> fallback order when all the nodes in a higher tier are out of space: The page >>>>>>> allocation can fall back to any node from any lower tier, whereas the demotion >>>>>>> order doesn't allow that. >>>>>>> >>>>>>> This patch series address the above by defining memory tiers explicitly. >>>>>>> >>>>>>> Linux kernel presents memory devices as NUMA nodes and each memory device is of >>>>>>> a specific type. The memory type of a device is represented by its abstract >>>>>>> distance. A memory tier corresponds to a range of abstract distance. This allows >>>>>>> for classifying memory devices with a specific performance range into a memory >>>>>>> tier. >>>>>>> >>>>>>> This patch configures the range/chunk size to be 128. The default DRAM >>>>>>> abstract distance is 512. We can have 4 memory tiers below the default DRAM >>>>>>> abstract distance which cover the range 0 - 127, 127 - 255, 256- 383, 384 - 511. >>>>>>> Slower memory devices like persistent memory will have abstract distance below >>>>>>> the default DRAM level and hence will be placed in these 4 lower tiers. >>>>>>> >>>>>>> A kernel parameter is provided to override the default memory tier. >>>>>>> >>>>>>> Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com >>>>>>> Link: https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com >>>>>>> >>>>>>> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com> >>>>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> >>>>>>> --- >>>>>>> include/linux/memory-tiers.h | 17 ++++++ >>>>>>> mm/Makefile | 1 + >>>>>>> mm/memory-tiers.c | 102 +++++++++++++++++++++++++++++++++++ >>>>>>> 3 files changed, 120 insertions(+) >>>>>>> create mode 100644 include/linux/memory-tiers.h >>>>>>> create mode 100644 mm/memory-tiers.c >>>>>>> >>>>>>> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h >>>>>>> new file mode 100644 >>>>>>> index 000000000000..8d7884b7a3f0 >>>>>>> --- /dev/null >>>>>>> +++ b/include/linux/memory-tiers.h >>>>>>> @@ -0,0 +1,17 @@ >>>>>>> +/* SPDX-License-Identifier: GPL-2.0 */ >>>>>>> +#ifndef _LINUX_MEMORY_TIERS_H >>>>>>> +#define _LINUX_MEMORY_TIERS_H >>>>>>> + >>>>>>> +/* >>>>>>> + * Each tier cover a abstrace distance chunk size of 128 >>>>>>> + */ >>>>>>> +#define MEMTIER_CHUNK_BITS 7 >>>>>>> +#define MEMTIER_CHUNK_SIZE (1 << MEMTIER_CHUNK_BITS) >>>>>>> +/* >>>>>>> + * For now let's have 4 memory tier below default DRAM tier. >>>>>>> + */ >>>>>>> +#define MEMTIER_ADISTANCE_DRAM (1 << (MEMTIER_CHUNK_BITS + 2)) >>>>>>> +/* leave one tier below this slow pmem */ >>>>>>> +#define MEMTIER_ADISTANCE_PMEM (1 << MEMTIER_CHUNK_BITS) >>>>>> >>>>>> Why is memory type encoded in these values? There is no reason to >>>>>> believe that PMEM is of a lower performance tier than DRAM. Consider >>>>>> high performance energy backed DRAM that makes it "PMEM", consider CXL >>>>>> attached DRAM over a switch topology and constrained links that makes it >>>>>> a lower performance tier than locally attached DRAM. The names should be >>>>>> associated with tiers that indicate their usage. Something like HOT, >>>>>> GENERAL, and COLD. Where, for example, HOT is low capacity high >>>>>> performance compared to the general purpose pool, and COLD is high >>>>>> capacity low performance intended to offload the general purpose tier. >>>>>> >>>>>> It does not need to be exactly that ontology, but please try to not >>>>>> encode policy meaning behind memory types. There has been explicit >>>>>> effort to avoid that to date because types are fraught for declaring >>>>>> relative performance characteristics, and the relative performance >>>>>> changes based on what memory types are assembled in a given system. >>>>> >>>>> Yes. MEMTIER_ADISTANCE_PMEM is something over simplified. That is only >>>>> used in this very first version to make it as simple as possible. >>>> >>>> I am failing to see the simplicity of using names that convey a >>>> performance contract that are invalid depending on the system. >>>> >>>>> I think we can come up with something better in the later version. >>>>> For example, identify the abstract distance of a PMEM device based on >>>>> HMAT, etc. >>>> >>>> Memory tiering has nothing to do with persistence why is PMEM in the >>>> name at all? >>>> >>>>> And even in this first version, we should put MEMTIER_ADISTANCE_PMEM >>>>> in dax/kmem.c. Because it's just for that specific type of memory >>>>> used now, not for all PMEM. >>>> >>>> dax/kmem.c also handles HBM and "soft reserved" memory in general. There >>>> is also nothing PMEM specific about the device-dax subsystem. >>> >>> Ah... I see the issue here. For the systems in our hand, dax/kmem.c is >>> used to online PMEM only. Even the "soft reserved" memory is used for >>> PMEM or simulating PMEM too. So to make the code as simple as possible, >>> we treat all memory devices onlined by dax/kmem as PMEM in the first >>> version. And plan to support more memory types in the future versions. >>> >>> But from your above words, our assumption are wrong here. dax/kmem.c >>> can online HBM and other memory devices already. If so, how do we >>> distinguish between them and how to get the performance character of >>> these devices? We can start with SLIT? >>> >> >> We would let low level driver register memory_dev_types for the NUMA nodes >> that will be mapped to these devices. ie, a papr_scm, ACPI NFIT or CXL >> can register different memory_dev_type based on device tree, HMAT or CDAT. > > I didn't find ACPI NFIT can provide any performance information, just > whether it's non-volatile. HMAT or CDAT should help here, but it's not > available always. For now, what we have is just SLIT at least for quite > some machines. > The lower level driver that is creating the nvdimm regions can assign a memory type to the numa node which it associates with the region. For now, drivers like papr_scm do that on ppc64. When it associates a numa node to nvdimm regions, it can query every detail available (device tree in case of papr_scm, can be HMAT/SLIT or CDAT) to associate the NUMA node to a memory type. > I prefer to create memory_dev_type in high level driver like dax/kmem. > And it may query low level driver like SLIT, HMAT, CDAT, etc for more > information based on availability etc. > > Best Regards, > Huang, Ying
Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: > On 8/4/22 6:26 AM, Huang, Ying wrote: >> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: >> >>> On 8/2/22 12:27 PM, Huang, Ying wrote: >>>> Dan Williams <dan.j.williams@intel.com> writes: >>>> >>>>> Huang, Ying wrote: >>>>>> Dan Williams <dan.j.williams@intel.com> writes: >>>>>> >>>>>>> Aneesh Kumar K.V wrote: >>>>>>>> In the current kernel, memory tiers are defined implicitly via a demotion path >>>>>>>> relationship between NUMA nodes, which is created during the kernel >>>>>>>> initialization and updated when a NUMA node is hot-added or hot-removed. The >>>>>>>> current implementation puts all nodes with CPU into the highest tier, and builds >>>>>>>> the tier hierarchy tier-by-tier by establishing the per-node demotion targets >>>>>>>> based on the distances between nodes. >>>>>>>> >>>>>>>> This current memory tier kernel implementation needs to be improved for several >>>>>>>> important use cases, >>>>>>>> >>>>>>>> The current tier initialization code always initializes each memory-only NUMA >>>>>>>> node into a lower tier. But a memory-only NUMA node may have a high performance >>>>>>>> memory device (e.g. a DRAM-backed memory-only node on a virtual machine) that >>>>>>>> should be put into a higher tier. >>>>>>>> >>>>>>>> The current tier hierarchy always puts CPU nodes into the top tier. But on a >>>>>>>> system with HBM or GPU devices, the memory-only NUMA nodes mapping these devices >>>>>>>> should be in the top tier, and DRAM nodes with CPUs are better to be placed into >>>>>>>> the next lower tier. >>>>>>>> >>>>>>>> With current kernel higher tier node can only be demoted to nodes with shortest >>>>>>>> distance on the next lower tier as defined by the demotion path, not any other >>>>>>>> node from any lower tier. This strict, demotion order does not work in all use >>>>>>>> cases (e.g. some use cases may want to allow cross-socket demotion to another >>>>>>>> node in the same demotion tier as a fallback when the preferred demotion node is >>>>>>>> out of space), This demotion order is also inconsistent with the page allocation >>>>>>>> fallback order when all the nodes in a higher tier are out of space: The page >>>>>>>> allocation can fall back to any node from any lower tier, whereas the demotion >>>>>>>> order doesn't allow that. >>>>>>>> >>>>>>>> This patch series address the above by defining memory tiers explicitly. >>>>>>>> >>>>>>>> Linux kernel presents memory devices as NUMA nodes and each memory device is of >>>>>>>> a specific type. The memory type of a device is represented by its abstract >>>>>>>> distance. A memory tier corresponds to a range of abstract distance. This allows >>>>>>>> for classifying memory devices with a specific performance range into a memory >>>>>>>> tier. >>>>>>>> >>>>>>>> This patch configures the range/chunk size to be 128. The default DRAM >>>>>>>> abstract distance is 512. We can have 4 memory tiers below the default DRAM >>>>>>>> abstract distance which cover the range 0 - 127, 127 - 255, 256- 383, 384 - 511. >>>>>>>> Slower memory devices like persistent memory will have abstract distance below >>>>>>>> the default DRAM level and hence will be placed in these 4 lower tiers. >>>>>>>> >>>>>>>> A kernel parameter is provided to override the default memory tier. >>>>>>>> >>>>>>>> Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com >>>>>>>> Link: https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com >>>>>>>> >>>>>>>> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com> >>>>>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> >>>>>>>> --- >>>>>>>> include/linux/memory-tiers.h | 17 ++++++ >>>>>>>> mm/Makefile | 1 + >>>>>>>> mm/memory-tiers.c | 102 +++++++++++++++++++++++++++++++++++ >>>>>>>> 3 files changed, 120 insertions(+) >>>>>>>> create mode 100644 include/linux/memory-tiers.h >>>>>>>> create mode 100644 mm/memory-tiers.c >>>>>>>> >>>>>>>> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h >>>>>>>> new file mode 100644 >>>>>>>> index 000000000000..8d7884b7a3f0 >>>>>>>> --- /dev/null >>>>>>>> +++ b/include/linux/memory-tiers.h >>>>>>>> @@ -0,0 +1,17 @@ >>>>>>>> +/* SPDX-License-Identifier: GPL-2.0 */ >>>>>>>> +#ifndef _LINUX_MEMORY_TIERS_H >>>>>>>> +#define _LINUX_MEMORY_TIERS_H >>>>>>>> + >>>>>>>> +/* >>>>>>>> + * Each tier cover a abstrace distance chunk size of 128 >>>>>>>> + */ >>>>>>>> +#define MEMTIER_CHUNK_BITS 7 >>>>>>>> +#define MEMTIER_CHUNK_SIZE (1 << MEMTIER_CHUNK_BITS) >>>>>>>> +/* >>>>>>>> + * For now let's have 4 memory tier below default DRAM tier. >>>>>>>> + */ >>>>>>>> +#define MEMTIER_ADISTANCE_DRAM (1 << (MEMTIER_CHUNK_BITS + 2)) >>>>>>>> +/* leave one tier below this slow pmem */ >>>>>>>> +#define MEMTIER_ADISTANCE_PMEM (1 << MEMTIER_CHUNK_BITS) >>>>>>> >>>>>>> Why is memory type encoded in these values? There is no reason to >>>>>>> believe that PMEM is of a lower performance tier than DRAM. Consider >>>>>>> high performance energy backed DRAM that makes it "PMEM", consider CXL >>>>>>> attached DRAM over a switch topology and constrained links that makes it >>>>>>> a lower performance tier than locally attached DRAM. The names should be >>>>>>> associated with tiers that indicate their usage. Something like HOT, >>>>>>> GENERAL, and COLD. Where, for example, HOT is low capacity high >>>>>>> performance compared to the general purpose pool, and COLD is high >>>>>>> capacity low performance intended to offload the general purpose tier. >>>>>>> >>>>>>> It does not need to be exactly that ontology, but please try to not >>>>>>> encode policy meaning behind memory types. There has been explicit >>>>>>> effort to avoid that to date because types are fraught for declaring >>>>>>> relative performance characteristics, and the relative performance >>>>>>> changes based on what memory types are assembled in a given system. >>>>>> >>>>>> Yes. MEMTIER_ADISTANCE_PMEM is something over simplified. That is only >>>>>> used in this very first version to make it as simple as possible. >>>>> >>>>> I am failing to see the simplicity of using names that convey a >>>>> performance contract that are invalid depending on the system. >>>>> >>>>>> I think we can come up with something better in the later version. >>>>>> For example, identify the abstract distance of a PMEM device based on >>>>>> HMAT, etc. >>>>> >>>>> Memory tiering has nothing to do with persistence why is PMEM in the >>>>> name at all? >>>>> >>>>>> And even in this first version, we should put MEMTIER_ADISTANCE_PMEM >>>>>> in dax/kmem.c. Because it's just for that specific type of memory >>>>>> used now, not for all PMEM. >>>>> >>>>> dax/kmem.c also handles HBM and "soft reserved" memory in general. There >>>>> is also nothing PMEM specific about the device-dax subsystem. >>>> >>>> Ah... I see the issue here. For the systems in our hand, dax/kmem.c is >>>> used to online PMEM only. Even the "soft reserved" memory is used for >>>> PMEM or simulating PMEM too. So to make the code as simple as possible, >>>> we treat all memory devices onlined by dax/kmem as PMEM in the first >>>> version. And plan to support more memory types in the future versions. >>>> >>>> But from your above words, our assumption are wrong here. dax/kmem.c >>>> can online HBM and other memory devices already. If so, how do we >>>> distinguish between them and how to get the performance character of >>>> these devices? We can start with SLIT? >>>> >>> >>> We would let low level driver register memory_dev_types for the NUMA nodes >>> that will be mapped to these devices. ie, a papr_scm, ACPI NFIT or CXL >>> can register different memory_dev_type based on device tree, HMAT or CDAT. >> >> I didn't find ACPI NFIT can provide any performance information, just >> whether it's non-volatile. HMAT or CDAT should help here, but it's not >> available always. For now, what we have is just SLIT at least for quite >> some machines. >> > > > The lower level driver that is creating the nvdimm regions can assign a > memory type to the numa node which it associates with the region. For now, > drivers like papr_scm do that on ppc64. When it associates a numa node to > nvdimm regions, it can query every detail available (device tree > in case of papr_scm, can be HMAT/SLIT or CDAT) to associate the NUMA node > to a memory type. If we have only one information source, it's OK to create all memory type with this source. But if we have multiple sources, we need a mechanism to coordinate among these sources. It gives us good flexibility to create memory types in driver. Because drivers can use any information sources. Best Regards, Huang, Ying >> I prefer to create memory_dev_type in high level driver like dax/kmem. >> And it may query low level driver like SLIT, HMAT, CDAT, etc for more >> information based on availability etc. >> >> Best Regards, >> Huang, Ying
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h new file mode 100644 index 000000000000..8d7884b7a3f0 --- /dev/null +++ b/include/linux/memory-tiers.h @@ -0,0 +1,17 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_MEMORY_TIERS_H +#define _LINUX_MEMORY_TIERS_H + +/* + * Each tier cover a abstrace distance chunk size of 128 + */ +#define MEMTIER_CHUNK_BITS 7 +#define MEMTIER_CHUNK_SIZE (1 << MEMTIER_CHUNK_BITS) +/* + * For now let's have 4 memory tier below default DRAM tier. + */ +#define MEMTIER_ADISTANCE_DRAM (1 << (MEMTIER_CHUNK_BITS + 2)) +/* leave one tier below this slow pmem */ +#define MEMTIER_ADISTANCE_PMEM (1 << MEMTIER_CHUNK_BITS) + +#endif /* _LINUX_MEMORY_TIERS_H */ diff --git a/mm/Makefile b/mm/Makefile index 6f9ffa968a1a..d30acebc2164 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/ obj-$(CONFIG_FAILSLAB) += failslab.o obj-$(CONFIG_MEMTEST) += memtest.o obj-$(CONFIG_MIGRATION) += migrate.o +obj-$(CONFIG_NUMA) += memory-tiers.o obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o obj-$(CONFIG_PAGE_COUNTER) += page_counter.o diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c new file mode 100644 index 000000000000..01cfd514c192 --- /dev/null +++ b/mm/memory-tiers.c @@ -0,0 +1,102 @@ +// SPDX-License-Identifier: GPL-2.0 +#include <linux/types.h> +#include <linux/nodemask.h> +#include <linux/slab.h> +#include <linux/lockdep.h> +#include <linux/memory-tiers.h> + +struct memory_tier { + /* hierarchy of memory tiers */ + struct list_head list; + /* list of all memory types part of this tier */ + struct list_head memory_types; + /* + * start value of abstract distance. memory tier maps + * an abstract distance range, + * adistance_start .. adistance_start + MEMTIER_CHUNK_SIZE + */ + int adistance_start; +}; + +struct memory_dev_type { + /* list of memory types that are are part of same tier as this type */ + struct list_head tier_sibiling; + /* abstract distance for this specific memory type */ + int adistance; + /* Nodes of same abstract distance */ + nodemask_t nodes; + struct memory_tier *memtier; +}; + +static DEFINE_MUTEX(memory_tier_lock); +static LIST_HEAD(memory_tiers); +struct memory_dev_type *node_memory_types[MAX_NUMNODES]; +/* + * For now let's have 4 memory tier below default DRAM tier. + */ +static struct memory_dev_type default_dram_type = { + .adistance = MEMTIER_ADISTANCE_DRAM, + .tier_sibiling = LIST_HEAD_INIT(default_dram_type.tier_sibiling), +}; + +static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memtype) +{ + bool found_slot = false; + struct memory_tier *memtier, *new_memtier; + int adistance = memtype->adistance; + unsigned int memtier_adistance_chunk_size = MEMTIER_CHUNK_SIZE; + + lockdep_assert_held_once(&memory_tier_lock); + + /* + * If the memtype is already part of a memory tier, + * just return that. + */ + if (memtype->memtier) + return memtype->memtier; + + adistance = round_down(adistance, memtier_adistance_chunk_size); + list_for_each_entry(memtier, &memory_tiers, list) { + if (adistance == memtier->adistance_start) { + memtype->memtier = memtier; + list_add(&memtype->tier_sibiling, &memtier->memory_types); + return memtier; + } else if (adistance < memtier->adistance_start) { + found_slot = true; + break; + } + } + + new_memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL); + if (!new_memtier) + return ERR_PTR(-ENOMEM); + + new_memtier->adistance_start = adistance; + INIT_LIST_HEAD(&new_memtier->list); + INIT_LIST_HEAD(&new_memtier->memory_types); + if (found_slot) + list_add_tail(&new_memtier->list, &memtier->list); + else + list_add_tail(&new_memtier->list, &memory_tiers); + memtype->memtier = new_memtier; + list_add(&memtype->tier_sibiling, &new_memtier->memory_types); + return new_memtier; +} + +static int __init memory_tier_init(void) +{ + struct memory_tier *memtier; + + mutex_lock(&memory_tier_lock); + /* CPU only nodes are not part of memory tiers. */ + default_dram_type.nodes = node_states[N_MEMORY]; + + memtier = find_create_memory_tier(&default_dram_type); + if (IS_ERR(memtier)) + panic("%s() failed to register memory tier: %ld\n", + __func__, PTR_ERR(memtier)); + mutex_unlock(&memory_tier_lock); + + return 0; +} +subsys_initcall(memory_tier_init);