diff mbox series

[RFC] mm: migrate: Add new node demotion strategy

Message ID c02bcbc04faa7a2c852534e9cd58a91c44494657.1636016609.git.baolin.wang@linux.alibaba.com (mailing list archive)
State New
Headers show
Series [RFC] mm: migrate: Add new node demotion strategy | expand

Commit Message

Baolin Wang Nov. 4, 2021, 9:13 a.m. UTC
We have some machines with multiple memory types like below, which
have one fast (DRAM) memory node and two slow (persistent memory) memory
nodes. According to current node demotion, if node 0 fills up,
its memory should be migrated to node 1, when node 1 fills up, its
memory will be migrated to node 2: node 0 -> node 1 -> node 2 ->stop.

But this is not efficient and suitbale memory migration route
for our machine with multiple slow memory nodes. Since the distance
between node 0 to node 1 and node 0 to node 2 is equal, and memory
migration between slow memory nodes will increase persistent memory
bandwidth greatly, which will hurt the whole system's performance.

Thus for this case, we can treat the slow memory node 1 and node 2
as a whole slow memory region, and we should migrate memory from
node 0 to node 1 and node 2 if node 0 fills up with a simple round-robin
method to select the target demotion node.

available: 3 nodes (0-2)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 62153 MB
node 0 free: 55135 MB
node 1 cpus:
node 1 size: 127007 MB
node 1 free: 126930 MB
node 2 cpus:
node 2 size: 126968 MB
node 2 free: 126878 MB
node distances:
node   0   1   2
  0:  10  20  20
  1:  20  10  20
  2:  20  20  10

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 Documentation/ABI/testing/sysfs-kernel-mm-numa |  16 ++++
 mm/migrate.c                                   | 114 ++++++++++++++++++++++++-
 2 files changed, 126 insertions(+), 4 deletions(-)

Comments

Dave Hansen Nov. 4, 2021, 3:18 p.m. UTC | #1
On 11/4/21 2:13 AM, Baolin Wang wrote:
> +What:		/sys/kernel/mm/numa/demotion_mode
> +Date:		November 2021
> +Contact:	Linux memory management mailing list <linux-mm@kvack.org>
> +Description:	Set the demotion mode when enabling demoting pages during reclaim

I don't think we need a tunable for this.  The existing behavior is just
stupid for your hardware and can be replaced.  Let's also try to do it
with the existing node_demotion[] data structure before we go adding more.
Huang, Ying Nov. 5, 2021, 2:51 a.m. UTC | #2
Dave Hansen <dave.hansen@intel.com> writes:

> On 11/4/21 2:13 AM, Baolin Wang wrote:
>> +What:		/sys/kernel/mm/numa/demotion_mode
>> +Date:		November 2021
>> +Contact:	Linux memory management mailing list <linux-mm@kvack.org>
>> +Description:	Set the demotion mode when enabling demoting pages during reclaim
>
> I don't think we need a tunable for this.  The existing behavior is just
> stupid for your hardware and can be replaced.

Yes.  I think so too.  I don't think DIRECT_DEMOTION is reasonable for your system.

> Let's also try to do it with the existing node_demotion[] data
> structure before we go adding more.

To avoid cache ping-pong, I guess some kind of per-CPU data structure
may be more suitable for interleaving among multiple nodes.

Best Regards,
Huang, Ying
Dave Hansen Nov. 5, 2021, 3:47 p.m. UTC | #3
On 11/4/21 7:51 PM, Huang, Ying wrote:
>> Let's also try to do it with the existing node_demotion[] data
>> structure before we go adding more.
> To avoid cache ping-pong, I guess some kind of per-CPU data structure
> may be more suitable for interleaving among multiple nodes.

It would probably be better to just find something that's more
read-heavy.  Like, instead of keeping a strict round-robin, just
randomly select one of the notes to which you can round-robin.

That will scale naturally without having to worry about caching or fancy
per-cpu data structures.
Baolin Wang Nov. 7, 2021, 9:33 a.m. UTC | #4
On 2021/11/5 23:47, Dave Hansen wrote:
> On 11/4/21 7:51 PM, Huang, Ying wrote:
>>> Let's also try to do it with the existing node_demotion[] data
>>> structure before we go adding more.
>> To avoid cache ping-pong, I guess some kind of per-CPU data structure
>> may be more suitable for interleaving among multiple nodes.
> 
> It would probably be better to just find something that's more
> read-heavy.  Like, instead of keeping a strict round-robin, just
> randomly select one of the notes to which you can round-robin.
> 
> That will scale naturally without having to worry about caching or fancy
> per-cpu data structures.
> 

Thanks for your suggestion. After some thinking, can we change the 
node_demotion[] structure like below? Which means one source node can be 
demoted to mutiple target node, and we can set up the target node mask 
according to the node distance. How do you think? Thanks.

static nodemask_t node_demotion[MAX_NUMNODES] __read_mostly =
         {[0 ...  MAX_NUMNODES - 1] = NODE_MASK_NONE};
Dave Hansen Nov. 7, 2021, 3:20 p.m. UTC | #5
On 11/7/21 1:33 AM, Baolin Wang wrote:
> Thanks for your suggestion. After some thinking, can we change the
> node_demotion[] structure like below? Which means one source node can be
> demoted to mutiple target node, and we can set up the target node mask
> according to the node distance. How do you think? Thanks.
> 
> static nodemask_t node_demotion[MAX_NUMNODES] __read_mostly =
>         {[0 ...  MAX_NUMNODES - 1] = NODE_MASK_NONE};

How large is that in the worst case?
Huang, Ying Nov. 8, 2021, 2:12 a.m. UTC | #6
Dave Hansen <dave.hansen@intel.com> writes:

> On 11/4/21 7:51 PM, Huang, Ying wrote:
>>> Let's also try to do it with the existing node_demotion[] data
>>> structure before we go adding more.
>> To avoid cache ping-pong, I guess some kind of per-CPU data structure
>> may be more suitable for interleaving among multiple nodes.
>
> It would probably be better to just find something that's more
> read-heavy.  Like, instead of keeping a strict round-robin, just
> randomly select one of the notes to which you can round-robin.
>
> That will scale naturally without having to worry about caching or fancy
> per-cpu data structures.

Yes.  That sounds good.  And per-CPU data structure is used inside
random API too :-)

Best Regards,
Huang, Ying
Baolin Wang Nov. 8, 2021, 6:38 a.m. UTC | #7
On 2021/11/7 23:20, Dave Hansen wrote:
> On 11/7/21 1:33 AM, Baolin Wang wrote:
>> Thanks for your suggestion. After some thinking, can we change the
>> node_demotion[] structure like below? Which means one source node can be
>> demoted to mutiple target node, and we can set up the target node mask
>> according to the node distance. How do you think? Thanks.
>>
>> static nodemask_t node_demotion[MAX_NUMNODES] __read_mostly =
>>          {[0 ...  MAX_NUMNODES - 1] = NODE_MASK_NONE};
> 
> How large is that in the worst case?

For the worst case (MAX_NUMNODES=1024), the size of the node_demotion is 
131072 bytes, while the size of original data structure is 4096 bytes. 
Maybe we can allocate the node_demotion dynamically?
Huang, Ying Nov. 8, 2021, 6:48 a.m. UTC | #8
Baolin Wang <baolin.wang@linux.alibaba.com> writes:

> On 2021/11/7 23:20, Dave Hansen wrote:
>> On 11/7/21 1:33 AM, Baolin Wang wrote:
>>> Thanks for your suggestion. After some thinking, can we change the
>>> node_demotion[] structure like below? Which means one source node can be
>>> demoted to mutiple target node, and we can set up the target node mask
>>> according to the node distance. How do you think? Thanks.
>>>
>>> static nodemask_t node_demotion[MAX_NUMNODES] __read_mostly =
>>>          {[0 ...  MAX_NUMNODES - 1] = NODE_MASK_NONE};
>> How large is that in the worst case?
>
> For the worst case (MAX_NUMNODES=1024), the size of the node_demotion
> is 131072 bytes, while the size of original data structure is 4096
> bytes. Maybe we can allocate the node_demotion dynamically?

Per my understanding, in most cases, the number of demotion target nodes
should be quite small.  So why not restrict the number of demotion
target nodes to make it some kind of simple array?

Best Regards,
Huang, Ying
Baolin Wang Nov. 8, 2021, 7:07 a.m. UTC | #9
On 2021/11/8 14:48, Huang, Ying writes:
> Baolin Wang <baolin.wang@linux.alibaba.com> writes:
> 
>> On 2021/11/7 23:20, Dave Hansen wrote:
>>> On 11/7/21 1:33 AM, Baolin Wang wrote:
>>>> Thanks for your suggestion. After some thinking, can we change the
>>>> node_demotion[] structure like below? Which means one source node can be
>>>> demoted to mutiple target node, and we can set up the target node mask
>>>> according to the node distance. How do you think? Thanks.
>>>>
>>>> static nodemask_t node_demotion[MAX_NUMNODES] __read_mostly =
>>>>           {[0 ...  MAX_NUMNODES - 1] = NODE_MASK_NONE};
>>> How large is that in the worst case?
>>
>> For the worst case (MAX_NUMNODES=1024), the size of the node_demotion
>> is 131072 bytes, while the size of original data structure is 4096
>> bytes. Maybe we can allocate the node_demotion dynamically?
> 
> Per my understanding, in most cases, the number of demotion target nodes
> should be quite small.  So why not restrict the number of demotion
> target nodes to make it some kind of simple array?

Yes, agree. Something like below is reasonable for you?

#define DEMOTION_TARGET_NODES 16
typedef struct { DECLARE_BITMAP(bits, DEMOTION_TARGET_NODES); } 
demotemask_t;

static demotemask_t node_demotion[MAX_NUMNODES];
Baolin Wang Nov. 8, 2021, 8:43 a.m. UTC | #10
On 2021/11/8 16:12, Huang, Ying writes:
> Baolin Wang <baolin.wang@linux.alibaba.com> writes:
> 
>> On 2021/11/8 14:48, Huang, Ying writes:
>>> Baolin Wang <baolin.wang@linux.alibaba.com> writes:
>>>
>>>> On 2021/11/7 23:20, Dave Hansen wrote:
>>>>> On 11/7/21 1:33 AM, Baolin Wang wrote:
>>>>>> Thanks for your suggestion. After some thinking, can we change the
>>>>>> node_demotion[] structure like below? Which means one source node can be
>>>>>> demoted to mutiple target node, and we can set up the target node mask
>>>>>> according to the node distance. How do you think? Thanks.
>>>>>>
>>>>>> static nodemask_t node_demotion[MAX_NUMNODES] __read_mostly =
>>>>>>        {[0 ... MAX_NUMNODES - 1] = NODE_MASK_NONE};
>>>>> How large is that in the worst case?
>>>>
>>>> For the worst case (MAX_NUMNODES=1024), the size of the node_demotion
>>>> is 131072 bytes, while the size of original data structure is 4096
>>>> bytes. Maybe we can allocate the node_demotion dynamically?
>>> Per my understanding, in most cases, the number of demotion target
>>> nodes
>>> should be quite small.  So why not restrict the number of demotion
>>> target nodes to make it some kind of simple array?
>>
>> Yes, agree. Something like below is reasonable for you?
>>
>> #define DEMOTION_TARGET_NODES 16
>> typedef struct { DECLARE_BITMAP(bits, DEMOTION_TARGET_NODES); }
>> demotemask_t;
>>
>> static demotemask_t node_demotion[MAX_NUMNODES];
> 
> I don't think we need a bitmap.  May be something as following,
> 
> #define DEMOTION_TARGET_NODES 15
> struct demotion_nodes {
>    unsigned short nr;
>    unsigned short nodes[DEMOTION_TARGET_NODES];
> };

OK. Let me try it in next version. Thanks.
diff mbox series

Patch

diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-numa b/Documentation/ABI/testing/sysfs-kernel-mm-numa
index 77e559d..896e91d 100644
--- a/Documentation/ABI/testing/sysfs-kernel-mm-numa
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-numa
@@ -22,3 +22,19 @@  Description:	Enable/disable demoting pages during reclaim
 		the guarantees of cpusets.  This should not be enabled
 		on systems which need strict cpuset location
 		guarantees.
+
+What:		/sys/kernel/mm/numa/demotion_mode
+Date:		November 2021
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	Set the demotion mode when enabling demoting pages during reclaim
+
+		Specify the demotion mode. "direct" is for the DIRECT_DEMOTION
+		mode if the system's multiple memory types are one to one
+		correspondence, the node distance among the multiple memory types
+		is tiered.
+
+		"multiple" is for the MULTIPLE_DEMOTION mode if the system's
+		multiple memory types are not one to one correspondence, the node
+		distance between fast memory node and slow memory nodes is equal,
+		which means one fast memory node can be demoted to multiple slow
+		memory nodes.
diff --git a/mm/migrate.c b/mm/migrate.c
index a11e948..a43726d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1099,6 +1099,33 @@  static int __unmap_and_move(struct page *page, struct page *newpage,
 	return rc;
 }
 
+/*
+ * DIRECT_DEMOTION mode is for the system that multiple
+ * memory types are one to one correspondence, the node
+ * distance among the multiple memory types is tiered.
+ * For example, there are 3 memory types in one system:
+ * fast (node 0), medium (node 1) and slow (node 2). So
+ * the memory migration route should be:
+ * node 0 -> node 1 -> node 2 -> stop.
+ *
+ * MULTIPLE_DEMOTION mode is for the system that multiple
+ * memory types are not one to one correspondence, the node
+ * distance between fast memory node and slow memory nodes
+ * is equal. For example, there are 2 memory types in one
+ * system with 3 memory nodes: fast (node 0), slow (node 1),
+ * slow (node 2), and the distance between node 0 to node 1
+ * and node 0 to node 2 is equal. In this case, we should
+ * not migrate memory among slow memory nodes, instead we
+ * should treat node 1 and node 2 as a whole slow memory,
+ * so the memory migration route should be:
+ * node 0 -> node 1/2 -> stop.
+ */
+enum demotion_mode {
+	DIRECT_DEMOTION,
+	MULTIPLE_DEMOTION,
+};
+
+static enum demotion_mode numa_demotion_mode = DIRECT_DEMOTION;
 
 /*
  * node_demotion[] example:
@@ -1144,6 +1171,13 @@  static int __unmap_and_move(struct page *page, struct page *newpage,
 static int node_demotion[MAX_NUMNODES] __read_mostly =
 	{[0 ...  MAX_NUMNODES - 1] = NUMA_NO_NODE};
 
+/*
+ * Used for one fast memory node to multiple slow memory nodes,
+ * to select the next target demotion node.
+ */
+static atomic_t next_demote_node[MAX_NUMNODES] =
+	{[0 ...  MAX_NUMNODES - 1] = ATOMIC_INIT(NUMA_NO_NODE)};
+
 /**
  * next_demotion_node() - Get the next node in the demotion path
  * @node: The starting node to lookup the next node
@@ -1155,7 +1189,8 @@  static int __unmap_and_move(struct page *page, struct page *newpage,
  */
 int next_demotion_node(int node)
 {
-	int target;
+	nodemask_t target_nodes = NODE_MASK_NONE;
+	int target = NUMA_NO_NODE, src = node;
 
 	/*
 	 * node_demotion[] is updated without excluding this
@@ -1166,9 +1201,52 @@  int next_demotion_node(int node)
 	 * Make sure to use RCU over entire code blocks if
 	 * node_demotion[] reads need to be consistent.
 	 */
-	rcu_read_lock();
-	target = READ_ONCE(node_demotion[node]);
-	rcu_read_unlock();
+	switch (numa_demotion_mode) {
+	case DIRECT_DEMOTION:
+		rcu_read_lock();
+		target = READ_ONCE(node_demotion[node]);
+		rcu_read_unlock();
+		break;
+	case MULTIPLE_DEMOTION:
+		/*
+		 * Only fast memory can be demoted to multiple slow
+		 * memory nodes, and slow memory should not be demoted
+		 * among slow nemory nodes.
+		 */
+		if (!node_state(node, N_CPU))
+			return NUMA_NO_NODE;
+
+		rcu_read_lock();
+		do {
+			target = READ_ONCE(node_demotion[src]);
+			if (target == NUMA_NO_NODE)
+				break;
+
+			node_set(target, target_nodes);
+			src = target;
+		} while (1);
+		rcu_read_unlock();
+
+		if (!nodes_empty(target_nodes)) {
+			/*
+			 * When the fast memory can be demoted to multiple
+			 * slow memory nodes, here we choose round-robin
+			 * method to select one target demotion node.
+			 */
+			target = atomic_read(&next_demote_node[node]);
+			if (target == NUMA_NO_NODE)
+				target = first_node(target_nodes);
+			atomic_set(&next_demote_node[node],
+				   next_node_in(target, target_nodes));
+		} else {
+			target = NUMA_NO_NODE;
+		}
+
+		break;
+	default:
+		pr_warn("Invalid demotion mode\n");
+		break;
+	}
 
 	return target;
 }
@@ -3330,12 +3408,40 @@  static ssize_t numa_demotion_enabled_store(struct kobject *kobj,
 	return count;
 }
 
+
+static ssize_t numa_demotion_mode_show(struct kobject *kobj,
+				       struct kobj_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%s\n",
+			  numa_demotion_mode == DIRECT_DEMOTION ?
+			  "direct" : "multiple");
+}
+
+static ssize_t numa_demotion_mode_store(struct kobject *kobj,
+					struct kobj_attribute *attr,
+					const char *buf, size_t count)
+{
+	if (!strncmp(buf, "direct", 6))
+		numa_demotion_mode = DIRECT_DEMOTION;
+	else if (!strncmp(buf, "multiple", 8))
+		numa_demotion_mode = MULTIPLE_DEMOTION;
+	else
+		return -EINVAL;
+
+	return count;
+}
+
 static struct kobj_attribute numa_demotion_enabled_attr =
 	__ATTR(demotion_enabled, 0644, numa_demotion_enabled_show,
 	       numa_demotion_enabled_store);
 
+static struct kobj_attribute numa_demotion_mode_attr =
+	__ATTR(demotion_mode, 0644, numa_demotion_mode_show,
+	       numa_demotion_mode_store);
+
 static struct attribute *numa_attrs[] = {
 	&numa_demotion_enabled_attr.attr,
+	&numa_demotion_mode_attr.attr,
 	NULL,
 };