diff mbox series

[RFC,v2,08/21] mm: introduce and export pgdat peer_node

Message ID 20181226133351.521151384@intel.com (mailing list archive)
State New, archived
Headers show
Series PMEM NUMA node and hotness accounting/migration | expand

Commit Message

Fengguang Wu Dec. 26, 2018, 1:14 p.m. UTC
From: Fan Du <fan.du@intel.com>

Each CPU socket can have 1 DRAM and 1 PMEM node, we call them "peer nodes".
Migration between DRAM and PMEM will by default happen between peer nodes.

It's a temp solution. In multiple memory layers, a node can have both
promotion and demotion targets instead of a single peer node. User space
may also be able to infer promotion/demotion targets based on future
HMAT info.

Signed-off-by: Fan Du <fan.du@intel.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 drivers/base/node.c    |   11 +++++++++++
 include/linux/mmzone.h |   12 ++++++++++++
 mm/page_alloc.c        |   29 +++++++++++++++++++++++++++++
 3 files changed, 52 insertions(+)

Comments

Christoph Lameter (Ampere) Dec. 27, 2018, 8:07 p.m. UTC | #1
On Wed, 26 Dec 2018, Fengguang Wu wrote:

> Each CPU socket can have 1 DRAM and 1 PMEM node, we call them "peer nodes".
> Migration between DRAM and PMEM will by default happen between peer nodes.

Which one does numa_node_id() point to? I guess that is the DRAM node and
then we fall back to the PMEM node?
Fengguang Wu Dec. 28, 2018, 2:31 a.m. UTC | #2
On Thu, Dec 27, 2018 at 08:07:26PM +0000, Christopher Lameter wrote:
>On Wed, 26 Dec 2018, Fengguang Wu wrote:
>
>> Each CPU socket can have 1 DRAM and 1 PMEM node, we call them "peer nodes".
>> Migration between DRAM and PMEM will by default happen between peer nodes.
>
>Which one does numa_node_id() point to? I guess that is the DRAM node and

Yes. In our test machine, PMEM nodes show up as memory-only nodes, so
numa_node_id() points to DRAM node.

Here is numactl --hardware output on a 2S test machine.

available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77
node 0 size: 257712 MB
node 0 free: 178251 MB
node 1 cpus: 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102
103
node 1 size: 258038 MB
node 1 free: 174796 MB
node 2 cpus:
node 2 size: 503999 MB
node 2 free: 438349 MB
node 3 cpus:
node 3 size: 503999 MB
node 3 free: 438349 MB
node distances:
node   0   1   2   3
  0:  10  21  20  20
  1:  21  10  20  20
  2:  20  20  10  20
  3:  20  20  20  10

>then we fall back to the PMEM node?

Fall back is possible but not the scope of this patchset. We modified
fallback zonelists in patch 10 to simplify PMEM usage. With that
patch, page allocations on DRAM nodes won't fallback to PMEM nodes.
Instead, PMEM nodes will mainly be used by explicit numactl placement
and as migration target. When there is memory pressure in DRAM node,
LRU cold pages there will be demote migrated to its peer PMEM node on
the same socket by patch 20.

Thanks,
Fengguang
diff mbox series

Patch

--- linux.orig/drivers/base/node.c	2018-12-23 19:39:51.647261099 +0800
+++ linux/drivers/base/node.c	2018-12-23 19:39:51.643261112 +0800
@@ -242,6 +242,16 @@  static ssize_t type_show(struct device *
 }
 static DEVICE_ATTR(type, S_IRUGO, type_show, NULL);
 
+static ssize_t peer_node_show(struct device *dev,
+			struct device_attribute *attr, char *buf)
+{
+	int nid = dev->id;
+	struct pglist_data *pgdat = NODE_DATA(nid);
+
+	return sprintf(buf, "%d\n", pgdat->peer_node);
+}
+static DEVICE_ATTR(peer_node, S_IRUGO, peer_node_show, NULL);
+
 static struct attribute *node_dev_attrs[] = {
 	&dev_attr_cpumap.attr,
 	&dev_attr_cpulist.attr,
@@ -250,6 +260,7 @@  static struct attribute *node_dev_attrs[
 	&dev_attr_distance.attr,
 	&dev_attr_vmstat.attr,
 	&dev_attr_type.attr,
+	&dev_attr_peer_node.attr,
 	NULL
 };
 ATTRIBUTE_GROUPS(node_dev);
--- linux.orig/include/linux/mmzone.h	2018-12-23 19:39:51.647261099 +0800
+++ linux/include/linux/mmzone.h	2018-12-23 19:39:51.643261112 +0800
@@ -713,6 +713,18 @@  typedef struct pglist_data {
 	/* Per-node vmstats */
 	struct per_cpu_nodestat __percpu *per_cpu_nodestats;
 	atomic_long_t		vm_stat[NR_VM_NODE_STAT_ITEMS];
+
+	/*
+	 * Points to the nearest node in terms of latency
+	 * E.g. peer of node 0 is node 2 per SLIT
+	 * node distances:
+	 * node   0   1   2   3
+	 *   0:  10  21  17  28
+	 *   1:  21  10  28  17
+	 *   2:  17  28  10  28
+	 *   3:  28  17  28  10
+	 */
+	int	peer_node;
 } pg_data_t;
 
 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
--- linux.orig/mm/page_alloc.c	2018-12-23 19:39:51.647261099 +0800
+++ linux/mm/page_alloc.c	2018-12-23 19:39:51.643261112 +0800
@@ -6926,6 +6926,34 @@  static void check_for_memory(pg_data_t *
 	}
 }
 
+/*
+ * Return the nearest peer node in terms of *locality*
+ * E.g. peer of node 0 is node 2 per SLIT
+ * node distances:
+ * node   0   1   2   3
+ *   0:  10  21  17  28
+ *   1:  21  10  28  17
+ *   2:  17  28  10  28
+ *   3:  28  17  28  10
+ */
+static int find_best_peer_node(int nid)
+{
+	int n, val;
+	int min_val = INT_MAX;
+	int peer = NUMA_NO_NODE;
+
+	for_each_online_node(n) {
+		if (n == nid)
+			continue;
+		val = node_distance(nid, n);
+		if (val < min_val) {
+			min_val = val;
+			peer = n;
+		}
+	}
+	return peer;
+}
+
 /**
  * free_area_init_nodes - Initialise all pg_data_t and zone data
  * @max_zone_pfn: an array of max PFNs for each zone
@@ -7012,6 +7040,7 @@  void __init free_area_init_nodes(unsigne
 		if (pgdat->node_present_pages)
 			node_set_state(nid, N_MEMORY);
 		check_for_memory(pgdat, nid);
+		pgdat->peer_node = find_best_peer_node(nid);
 	}
 }