diff mbox series

[RFC,v2,10/21] mm: build separate zonelist for PMEM and DRAM node

Message ID 20181226133351.644607371@intel.com (mailing list archive)
State New, archived
Headers show
Series PMEM NUMA node and hotness accounting/migration | expand

Commit Message

Fengguang Wu Dec. 26, 2018, 1:14 p.m. UTC
From: Fan Du <fan.du@intel.com>

When allocate page, DRAM and PMEM node should better not fall back to
each other. This allows migration code to explicitly control which type
of node to allocate pages from.

With this patch, PMEM NUMA node can only be used in 2 ways:
- migrate in and out
- numactl

That guarantees PMEM NUMA node will only hold anon pages.
We don't detect hotness for other types of pages for now.
So need to prevent some PMEM page goes hot while not able to
detect/move it to DRAM.

Another implication is, new page allocations will by default goto
DRAM nodes. Which is normally a good choice -- since DRAM writes
are cheaper than PMEM, it's often benefitial to watch new pages in
DRAM for some time and only move the likely cold pages to PMEM.

However there can be exceptions. For example, if PMEM:DRAM ratio is
very high, some page allocations may better go to PMEM nodes directly.
In long term, we may create more kind of fallback zonelists and make
them configurable by NUMA policy.

Signed-off-by: Fan Du <fan.du@intel.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 mm/mempolicy.c  |   14 ++++++++++++++
 mm/page_alloc.c |   42 +++++++++++++++++++++++++++++-------------
 2 files changed, 43 insertions(+), 13 deletions(-)

Comments

Aneesh Kumar K.V Jan. 1, 2019, 9:14 a.m. UTC | #1
Fengguang Wu <fengguang.wu@intel.com> writes:

> From: Fan Du <fan.du@intel.com>
>
> When allocate page, DRAM and PMEM node should better not fall back to
> each other. This allows migration code to explicitly control which type
> of node to allocate pages from.
>
> With this patch, PMEM NUMA node can only be used in 2 ways:
> - migrate in and out
> - numactl

Can we achieve this using nodemask? That way we don't tag nodes with
different properties such as DRAM/PMEM. We can then give the
flexibilility to the device init code to add the new memory nodes to
the right nodemask

-aneesh
Fengguang Wu Jan. 7, 2019, 9:57 a.m. UTC | #2
On Tue, Jan 01, 2019 at 02:44:41PM +0530, Aneesh Kumar K.V wrote:
>Fengguang Wu <fengguang.wu@intel.com> writes:
>
>> From: Fan Du <fan.du@intel.com>
>>
>> When allocate page, DRAM and PMEM node should better not fall back to
>> each other. This allows migration code to explicitly control which type
>> of node to allocate pages from.
>>
>> With this patch, PMEM NUMA node can only be used in 2 ways:
>> - migrate in and out
>> - numactl
>
>Can we achieve this using nodemask? That way we don't tag nodes with
>different properties such as DRAM/PMEM. We can then give the
>flexibilility to the device init code to add the new memory nodes to
>the right nodemask

Aneesh, in patch 2 we did create nodemask numa_nodes_pmem and
numa_nodes_dram. What's your supposed way of "using nodemask"?

Thanks,
Fengguang
Aneesh Kumar K.V Jan. 7, 2019, 2:09 p.m. UTC | #3
Fengguang Wu <fengguang.wu@intel.com> writes:

> On Tue, Jan 01, 2019 at 02:44:41PM +0530, Aneesh Kumar K.V wrote:
>>Fengguang Wu <fengguang.wu@intel.com> writes:
>>
>>> From: Fan Du <fan.du@intel.com>
>>>
>>> When allocate page, DRAM and PMEM node should better not fall back to
>>> each other. This allows migration code to explicitly control which type
>>> of node to allocate pages from.
>>>
>>> With this patch, PMEM NUMA node can only be used in 2 ways:
>>> - migrate in and out
>>> - numactl
>>
>>Can we achieve this using nodemask? That way we don't tag nodes with
>>different properties such as DRAM/PMEM. We can then give the
>>flexibilility to the device init code to add the new memory nodes to
>>the right nodemask
>
> Aneesh, in patch 2 we did create nodemask numa_nodes_pmem and
> numa_nodes_dram. What's your supposed way of "using nodemask"?
>

IIUC the patch is to avoid allocation from PMEM nodes and the way you
achieve it is by checking if (is_node_pmem(n)). We already have
abstractness to avoid allocation from a node using node mask. I was
wondering whether we can do the equivalent of above using that.

ie, __next_zone_zonelist can do zref_in_nodemask(z,
default_exclude_nodemask)) and decide whether to use the specific zone
or not.

That way we don't add special code like 

+	PGDAT_DRAM,			/* Volatile DRAM memory node */
+	PGDAT_PMEM,			/* Persistent memory node */

The reason is that there could be other device memory that would want to
get excluded from that default allocation like you are doing for PMEM

-aneesh
diff mbox series

Patch

--- linux.orig/mm/mempolicy.c	2018-12-26 20:03:49.821417489 +0800
+++ linux/mm/mempolicy.c	2018-12-26 20:29:24.597884301 +0800
@@ -1745,6 +1745,20 @@  static int policy_node(gfp_t gfp, struct
 		WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
 	}
 
+	if (policy->mode == MPOL_BIND) {
+		nodemask_t nodes = policy->v.nodes;
+
+		/*
+		 * The rule is if we run on DRAM node and mbind to PMEM node,
+		 * perferred node id is the peer node, vice versa.
+		 * if we run on DRAM node and mbind to DRAM node, #PF node is
+		 * the preferred node, vice versa, so just fall back.
+		 */
+		if ((is_node_dram(nd) && nodes_subset(nodes, numa_nodes_pmem)) ||
+			(is_node_pmem(nd) && nodes_subset(nodes, numa_nodes_dram)))
+			nd = NODE_DATA(nd)->peer_node;
+	}
+
 	return nd;
 }
 
--- linux.orig/mm/page_alloc.c	2018-12-26 20:03:49.821417489 +0800
+++ linux/mm/page_alloc.c	2018-12-26 20:03:49.817417321 +0800
@@ -5153,6 +5153,10 @@  static int find_next_best_node(int node,
 		if (node_isset(n, *used_node_mask))
 			continue;
 
+		/* DRAM node doesn't fallback to pmem node */
+		if (is_node_pmem(n))
+			continue;
+
 		/* Use the distance array to find the distance */
 		val = node_distance(node, n);
 
@@ -5242,19 +5246,31 @@  static void build_zonelists(pg_data_t *p
 	nodes_clear(used_mask);
 
 	memset(node_order, 0, sizeof(node_order));
-	while ((node = find_next_best_node(local_node, &used_mask)) >= 0) {
-		/*
-		 * We don't want to pressure a particular node.
-		 * So adding penalty to the first node in same
-		 * distance group to make it round-robin.
-		 */
-		if (node_distance(local_node, node) !=
-		    node_distance(local_node, prev_node))
-			node_load[node] = load;
-
-		node_order[nr_nodes++] = node;
-		prev_node = node;
-		load--;
+	/* Pmem node doesn't fallback to DRAM node */
+	if (is_node_pmem(local_node)) {
+		int n;
+
+		/* Pmem nodes should fallback to each other */
+		node_order[nr_nodes++] = local_node;
+		for_each_node_state(n, N_MEMORY) {
+			if ((n != local_node) && is_node_pmem(n))
+				node_order[nr_nodes++] = n;
+		}
+	} else {
+		while ((node = find_next_best_node(local_node, &used_mask)) >= 0) {
+			/*
+			 * We don't want to pressure a particular node.
+			 * So adding penalty to the first node in same
+			 * distance group to make it round-robin.
+			 */
+			if (node_distance(local_node, node) !=
+			    node_distance(local_node, prev_node))
+				node_load[node] = load;
+
+			node_order[nr_nodes++] = node;
+			prev_node = node;
+			load--;
+		}
 	}
 
 	build_zonelists_in_node_order(pgdat, node_order, nr_nodes);