diff mbox series

[v12,1/8] mm/demotion: Add support for explicit memory tiers

Message ID 20220729061349.968148-2-aneesh.kumar@linux.ibm.com (mailing list archive)
State New
Headers show
Series mm/demotion: Memory tiers and demotion | expand

Commit Message

Aneesh Kumar K.V July 29, 2022, 6:13 a.m. UTC
In the current kernel, memory tiers are defined implicitly via a demotion path
relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed. The
current implementation puts all nodes with CPU into the highest tier, and builds
the tier hierarchy tier-by-tier by establishing the per-node demotion targets
based on the distances between nodes.

This current memory tier kernel implementation needs to be improved for several
important use cases,

The current tier initialization code always initializes each memory-only NUMA
node into a lower tier. But a memory-only NUMA node may have a high performance
memory device (e.g. a DRAM-backed memory-only node on a virtual machine) that
should be put into a higher tier.

The current tier hierarchy always puts CPU nodes into the top tier. But on a
system with HBM or GPU devices, the memory-only NUMA nodes mapping these devices
should be in the top tier, and DRAM nodes with CPUs are better to be placed into
the next lower tier.

With current kernel higher tier node can only be demoted to nodes with shortest
distance on the next lower tier as defined by the demotion path, not any other
node from any lower tier. This strict, demotion order does not work in all use
cases (e.g. some use cases may want to allow cross-socket demotion to another
node in the same demotion tier as a fallback when the preferred demotion node is
out of space), This demotion order is also inconsistent with the page allocation
fallback order when all the nodes in a higher tier are out of space: The page
allocation can fall back to any node from any lower tier, whereas the demotion
order doesn't allow that.

This patch series address the above by defining memory tiers explicitly.

Linux kernel presents memory devices as NUMA nodes and each memory device is of
a specific type. The memory type of a device is represented by its abstract
distance. A memory tier corresponds to a range of abstract distance. This allows
for classifying memory devices with a specific performance range into a memory
tier.

This patch configures the range/chunk size to be 128. The default DRAM
abstract distance is 512. We can have 4 memory tiers below the default DRAM
abstract distance which cover the range 0 - 127, 127 - 255, 256- 383, 384 - 511.
Slower memory devices like persistent memory will have abstract distance higher
than the default DRAM level.

Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/memory-tiers.h |  16 ++++++
 mm/Makefile                  |   1 +
 mm/memory-tiers.c            | 107 +++++++++++++++++++++++++++++++++++
 3 files changed, 124 insertions(+)
 create mode 100644 include/linux/memory-tiers.h
 create mode 100644 mm/memory-tiers.c

Comments

kernel test robot July 31, 2022, 9:54 a.m. UTC | #1
Hi "Aneesh,

I love your patch! Perhaps something to improve:

[auto build test WARNING on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Aneesh-Kumar-K-V/mm-demotion-Memory-tiers-and-demotion/20220729-141604
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
config: x86_64-allmodconfig
compiler: gcc-11 (Debian 11.3.0-3) 11.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.4-39-gce1a6720-dirty
        # https://github.com/intel-lab-lkp/linux/commit/ffb16b158fbcab09eec35d082fd8696bb3e649da
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Aneesh-Kumar-K-V/mm-demotion-Memory-tiers-and-demotion/20220729-141604
        git checkout ffb16b158fbcab09eec35d082fd8696bb3e649da
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=x86_64 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

sparse warnings: (new ones prefixed by >>)
>> mm/memory-tiers.c:33:24: sparse: sparse: symbol 'node_memory_types' was not declared. Should it be static?
Huang, Ying Aug. 1, 2022, 2:37 a.m. UTC | #2
"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

> In the current kernel, memory tiers are defined implicitly via a demotion path
> relationship between NUMA nodes, which is created during the kernel
> initialization and updated when a NUMA node is hot-added or hot-removed. The
> current implementation puts all nodes with CPU into the highest tier, and builds
> the tier hierarchy tier-by-tier by establishing the per-node demotion targets
> based on the distances between nodes.
>
> This current memory tier kernel implementation needs to be improved for several
> important use cases,
>
> The current tier initialization code always initializes each memory-only NUMA
> node into a lower tier. But a memory-only NUMA node may have a high performance
> memory device (e.g. a DRAM-backed memory-only node on a virtual machine) that
> should be put into a higher tier.
>
> The current tier hierarchy always puts CPU nodes into the top tier. But on a
> system with HBM or GPU devices, the memory-only NUMA nodes mapping these devices
> should be in the top tier, and DRAM nodes with CPUs are better to be placed into
> the next lower tier.
>
> With current kernel higher tier node can only be demoted to nodes with shortest
> distance on the next lower tier as defined by the demotion path, not any other
> node from any lower tier. This strict, demotion order does not work in all use
> cases (e.g. some use cases may want to allow cross-socket demotion to another
> node in the same demotion tier as a fallback when the preferred demotion node is
> out of space), This demotion order is also inconsistent with the page allocation
> fallback order when all the nodes in a higher tier are out of space: The page
> allocation can fall back to any node from any lower tier, whereas the demotion
> order doesn't allow that.
>
> This patch series address the above by defining memory tiers explicitly.
>
> Linux kernel presents memory devices as NUMA nodes and each memory device is of
> a specific type. The memory type of a device is represented by its abstract
> distance. A memory tier corresponds to a range of abstract distance. This allows
> for classifying memory devices with a specific performance range into a memory
> tier.
>
> This patch configures the range/chunk size to be 128. The default DRAM
> abstract distance is 512. We can have 4 memory tiers below the default DRAM
                                                       ~~~~~

above?

> abstract distance which cover the range 0 - 127, 127 - 255, 256- 383, 384 - 511.
> Slower memory devices like persistent memory will have abstract distance higher
> than the default DRAM level.
>
> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  include/linux/memory-tiers.h |  16 ++++++
>  mm/Makefile                  |   1 +
>  mm/memory-tiers.c            | 107 +++++++++++++++++++++++++++++++++++
>  3 files changed, 124 insertions(+)
>  create mode 100644 include/linux/memory-tiers.h
>  create mode 100644 mm/memory-tiers.c
>
> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> new file mode 100644
> index 000000000000..9238c3291aaf
> --- /dev/null
> +++ b/include/linux/memory-tiers.h
> @@ -0,0 +1,16 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_MEMORY_TIERS_H
> +#define _LINUX_MEMORY_TIERS_H
> +
> +/*
> + * Each tier cover a abstrace distance chunk size of 128
> + */
> +#define MEMTIER_CHUNK_BITS	7
> +#define MEMTIER_CHUNK_SIZE	(1 << MEMTIER_CHUNK_BITS)
> +/*
> + * Smaller abstract distance value imply faster(higher) memory tiers.
> + */
> +#define MEMTIER_ADISTANCE_DRAM	(1 << (MEMTIER_CHUNK_BITS + 2))
> +#define MEMTIER_ADISTANCE_PMEM	(1 << (MEMTIER_CHUNK_BITS + 3))

Not a big issue, I am easier to understand it with the following format,

#define MEMTIER_ADISTANCE_DRAM	\
        (4 * MEMTIER_CHUNK_SIZE + #MEMTIER_CHUNK_SIZE / 2)
#define MEMTIER_ADISTANCE_PMEM	\
        (8 * MEMTIER_CHUNK_SIZE + #MEMTIER_CHUNK_SIZE / 2)

And it appears better to put the predefined abstract distance at the
middle of the range.

Best Regards,
Huang, Ying

> +
> +#endif  /* _LINUX_MEMORY_TIERS_H */
> diff --git a/mm/Makefile b/mm/Makefile
> index 6f9ffa968a1a..d30acebc2164 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/
>  obj-$(CONFIG_FAILSLAB) += failslab.o
>  obj-$(CONFIG_MEMTEST)		+= memtest.o
>  obj-$(CONFIG_MIGRATION) += migrate.o
> +obj-$(CONFIG_NUMA) += memory-tiers.o
>  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
>  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> new file mode 100644
> index 000000000000..60f82667d942
> --- /dev/null
> +++ b/mm/memory-tiers.c
> @@ -0,0 +1,107 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/types.h>
> +#include <linux/nodemask.h>
> +#include <linux/slab.h>
> +#include <linux/lockdep.h>
> +#include <linux/memory-tiers.h>
> +
> +struct memory_tier {
> +	/* hierarchy of memory tiers */
> +	struct list_head list;
> +	/* list of all memory types part of this tier */
> +	struct list_head memory_types;
> +	/*
> +	 * start value of abstract distance. memory tier maps
> +	 * an abstract distance  range,
> +	 * adistance_start .. adistance_start + MEMTIER_CHUNK_SIZE
> +	 */
> +	int adistance_start;
> +};
> +
> +struct memory_dev_type {
> +	/* list of memory types that are are part of same tier as this type */
> +	struct list_head tier_sibiling;
> +	/* abstract distance for this specific memory type */
> +	int adistance;
> +	/* Nodes of same abstract distance */
> +	nodemask_t nodes;
> +	struct memory_tier *memtier;
> +};
> +
> +static DEFINE_MUTEX(memory_tier_lock);
> +static LIST_HEAD(memory_tiers);
> +struct memory_dev_type *node_memory_types[MAX_NUMNODES];
> +/*
> + * For now let's have 4 memory tier below default DRAM tier.
> + */
> +static struct memory_dev_type default_dram_type  = {
> +	.adistance = MEMTIER_ADISTANCE_DRAM,
> +	.tier_sibiling = LIST_HEAD_INIT(default_dram_type.tier_sibiling),
> +};
> +
> +static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memtype)
> +{
> +	bool found_slot = false;
> +	struct memory_tier *memtier, *new_memtier;
> +	int adistance = memtype->adistance;
> +	unsigned int memtier_adistance_chunk_size = MEMTIER_CHUNK_SIZE;
> +
> +	lockdep_assert_held_once(&memory_tier_lock);
> +
> +	/*
> +	 * If the memtype is already part of a memory tier,
> +	 * just return that.
> +	 */
> +	if (memtype->memtier)
> +		return memtype->memtier;
> +
> +	adistance = round_down(adistance, memtier_adistance_chunk_size);
> +	list_for_each_entry(memtier, &memory_tiers, list) {
> +		if (adistance == memtier->adistance_start) {
> +			memtype->memtier = memtier;
> +			list_add(&memtype->tier_sibiling, &memtier->memory_types);
> +			return memtier;
> +		} else if (adistance < memtier->adistance_start) {
> +			found_slot = true;
> +			break;
> +		}
> +	}
> +
> +	new_memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
> +	if (!new_memtier)
> +		return ERR_PTR(-ENOMEM);
> +
> +	new_memtier->adistance_start = adistance;
> +	INIT_LIST_HEAD(&new_memtier->list);
> +	INIT_LIST_HEAD(&new_memtier->memory_types);
> +	if (found_slot)
> +		list_add_tail(&new_memtier->list, &memtier->list);
> +	else
> +		list_add_tail(&new_memtier->list, &memory_tiers);
> +	memtype->memtier = new_memtier;
> +	list_add(&memtype->tier_sibiling, &new_memtier->memory_types);
> +	return new_memtier;
> +}
> +
> +static int __init memory_tier_init(void)
> +{
> +	int node;
> +	struct memory_tier *memtier;
> +
> +	mutex_lock(&memory_tier_lock);
> +	/* CPU only nodes are not part of memory tiers. */
> +	default_dram_type.nodes = node_states[N_MEMORY];
> +
> +	memtier = find_create_memory_tier(&default_dram_type);
> +	if (IS_ERR(memtier))
> +		panic("%s() failed to register memory tier: %ld\n",
> +		      __func__, PTR_ERR(memtier));
> +
> +	for_each_node_state(node, N_MEMORY)
> +		node_memory_types[node] = &default_dram_type;
> +
> +	mutex_unlock(&memory_tier_lock);
> +
> +	return 0;
> +}
> +subsys_initcall(memory_tier_init);
Aneesh Kumar K.V Aug. 1, 2022, 4:47 a.m. UTC | #3
On 8/1/22 8:07 AM, Huang, Ying wrote:
> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> 
>> In the current kernel, memory tiers are defined implicitly via a demotion path
>> relationship between NUMA nodes, which is created during the kernel
>> initialization and updated when a NUMA node is hot-added or hot-removed. The
>> current implementation puts all nodes with CPU into the highest tier, and builds
>> the tier hierarchy tier-by-tier by establishing the per-node demotion targets
>> based on the distances between nodes.
>>
>> This current memory tier kernel implementation needs to be improved for several
>> important use cases,
>>
>> The current tier initialization code always initializes each memory-only NUMA
>> node into a lower tier. But a memory-only NUMA node may have a high performance
>> memory device (e.g. a DRAM-backed memory-only node on a virtual machine) that
>> should be put into a higher tier.
>>
>> The current tier hierarchy always puts CPU nodes into the top tier. But on a
>> system with HBM or GPU devices, the memory-only NUMA nodes mapping these devices
>> should be in the top tier, and DRAM nodes with CPUs are better to be placed into
>> the next lower tier.
>>
>> With current kernel higher tier node can only be demoted to nodes with shortest
>> distance on the next lower tier as defined by the demotion path, not any other
>> node from any lower tier. This strict, demotion order does not work in all use
>> cases (e.g. some use cases may want to allow cross-socket demotion to another
>> node in the same demotion tier as a fallback when the preferred demotion node is
>> out of space), This demotion order is also inconsistent with the page allocation
>> fallback order when all the nodes in a higher tier are out of space: The page
>> allocation can fall back to any node from any lower tier, whereas the demotion
>> order doesn't allow that.
>>
>> This patch series address the above by defining memory tiers explicitly.
>>
>> Linux kernel presents memory devices as NUMA nodes and each memory device is of
>> a specific type. The memory type of a device is represented by its abstract
>> distance. A memory tier corresponds to a range of abstract distance. This allows
>> for classifying memory devices with a specific performance range into a memory
>> tier.
>>
>> This patch configures the range/chunk size to be 128. The default DRAM
>> abstract distance is 512. We can have 4 memory tiers below the default DRAM
>                                                        ~~~~~
> 
> above?

Updated the above as below.


This patch configures the range/chunk size to be 128. The default DRAM abstract
distance is 512. We can have 4 memory tiers below the default DRAM with abstract
distance range 0 - 127, 127 - 255, 256- 383, 384 - 511. Faster memory devices
can be placed in these faster(higher) memory tiers. Slower memory devices like
persistent memory will have abstract distance higher than the default DRAM
level.




> 
>> abstract distance which cover the range 0 - 127, 127 - 255, 256- 383, 384 - 511.
>> Slower memory devices like persistent memory will have abstract distance higher
>> than the default DRAM level.
>>

-aneesh
diff mbox series

Patch

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
new file mode 100644
index 000000000000..9238c3291aaf
--- /dev/null
+++ b/include/linux/memory-tiers.h
@@ -0,0 +1,16 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_MEMORY_TIERS_H
+#define _LINUX_MEMORY_TIERS_H
+
+/*
+ * Each tier cover a abstrace distance chunk size of 128
+ */
+#define MEMTIER_CHUNK_BITS	7
+#define MEMTIER_CHUNK_SIZE	(1 << MEMTIER_CHUNK_BITS)
+/*
+ * Smaller abstract distance value imply faster(higher) memory tiers.
+ */
+#define MEMTIER_ADISTANCE_DRAM	(1 << (MEMTIER_CHUNK_BITS + 2))
+#define MEMTIER_ADISTANCE_PMEM	(1 << (MEMTIER_CHUNK_BITS + 3))
+
+#endif  /* _LINUX_MEMORY_TIERS_H */
diff --git a/mm/Makefile b/mm/Makefile
index 6f9ffa968a1a..d30acebc2164 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -92,6 +92,7 @@  obj-$(CONFIG_KFENCE) += kfence/
 obj-$(CONFIG_FAILSLAB) += failslab.o
 obj-$(CONFIG_MEMTEST)		+= memtest.o
 obj-$(CONFIG_MIGRATION) += migrate.o
+obj-$(CONFIG_NUMA) += memory-tiers.o
 obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
 obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
new file mode 100644
index 000000000000..60f82667d942
--- /dev/null
+++ b/mm/memory-tiers.c
@@ -0,0 +1,107 @@ 
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/types.h>
+#include <linux/nodemask.h>
+#include <linux/slab.h>
+#include <linux/lockdep.h>
+#include <linux/memory-tiers.h>
+
+struct memory_tier {
+	/* hierarchy of memory tiers */
+	struct list_head list;
+	/* list of all memory types part of this tier */
+	struct list_head memory_types;
+	/*
+	 * start value of abstract distance. memory tier maps
+	 * an abstract distance  range,
+	 * adistance_start .. adistance_start + MEMTIER_CHUNK_SIZE
+	 */
+	int adistance_start;
+};
+
+struct memory_dev_type {
+	/* list of memory types that are are part of same tier as this type */
+	struct list_head tier_sibiling;
+	/* abstract distance for this specific memory type */
+	int adistance;
+	/* Nodes of same abstract distance */
+	nodemask_t nodes;
+	struct memory_tier *memtier;
+};
+
+static DEFINE_MUTEX(memory_tier_lock);
+static LIST_HEAD(memory_tiers);
+struct memory_dev_type *node_memory_types[MAX_NUMNODES];
+/*
+ * For now let's have 4 memory tier below default DRAM tier.
+ */
+static struct memory_dev_type default_dram_type  = {
+	.adistance = MEMTIER_ADISTANCE_DRAM,
+	.tier_sibiling = LIST_HEAD_INIT(default_dram_type.tier_sibiling),
+};
+
+static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memtype)
+{
+	bool found_slot = false;
+	struct memory_tier *memtier, *new_memtier;
+	int adistance = memtype->adistance;
+	unsigned int memtier_adistance_chunk_size = MEMTIER_CHUNK_SIZE;
+
+	lockdep_assert_held_once(&memory_tier_lock);
+
+	/*
+	 * If the memtype is already part of a memory tier,
+	 * just return that.
+	 */
+	if (memtype->memtier)
+		return memtype->memtier;
+
+	adistance = round_down(adistance, memtier_adistance_chunk_size);
+	list_for_each_entry(memtier, &memory_tiers, list) {
+		if (adistance == memtier->adistance_start) {
+			memtype->memtier = memtier;
+			list_add(&memtype->tier_sibiling, &memtier->memory_types);
+			return memtier;
+		} else if (adistance < memtier->adistance_start) {
+			found_slot = true;
+			break;
+		}
+	}
+
+	new_memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
+	if (!new_memtier)
+		return ERR_PTR(-ENOMEM);
+
+	new_memtier->adistance_start = adistance;
+	INIT_LIST_HEAD(&new_memtier->list);
+	INIT_LIST_HEAD(&new_memtier->memory_types);
+	if (found_slot)
+		list_add_tail(&new_memtier->list, &memtier->list);
+	else
+		list_add_tail(&new_memtier->list, &memory_tiers);
+	memtype->memtier = new_memtier;
+	list_add(&memtype->tier_sibiling, &new_memtier->memory_types);
+	return new_memtier;
+}
+
+static int __init memory_tier_init(void)
+{
+	int node;
+	struct memory_tier *memtier;
+
+	mutex_lock(&memory_tier_lock);
+	/* CPU only nodes are not part of memory tiers. */
+	default_dram_type.nodes = node_states[N_MEMORY];
+
+	memtier = find_create_memory_tier(&default_dram_type);
+	if (IS_ERR(memtier))
+		panic("%s() failed to register memory tier: %ld\n",
+		      __func__, PTR_ERR(memtier));
+
+	for_each_node_state(node, N_MEMORY)
+		node_memory_types[node] = &default_dram_type;
+
+	mutex_unlock(&memory_tier_lock);
+
+	return 0;
+}
+subsys_initcall(memory_tier_init);