Message ID | 20181226133351.521151384@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | PMEM NUMA node and hotness accounting/migration | expand |
On Wed, 26 Dec 2018, Fengguang Wu wrote: > Each CPU socket can have 1 DRAM and 1 PMEM node, we call them "peer nodes". > Migration between DRAM and PMEM will by default happen between peer nodes. Which one does numa_node_id() point to? I guess that is the DRAM node and then we fall back to the PMEM node?
On Thu, Dec 27, 2018 at 08:07:26PM +0000, Christopher Lameter wrote: >On Wed, 26 Dec 2018, Fengguang Wu wrote: > >> Each CPU socket can have 1 DRAM and 1 PMEM node, we call them "peer nodes". >> Migration between DRAM and PMEM will by default happen between peer nodes. > >Which one does numa_node_id() point to? I guess that is the DRAM node and Yes. In our test machine, PMEM nodes show up as memory-only nodes, so numa_node_id() points to DRAM node. Here is numactl --hardware output on a 2S test machine. available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 node 0 size: 257712 MB node 0 free: 178251 MB node 1 cpus: 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 node 1 size: 258038 MB node 1 free: 174796 MB node 2 cpus: node 2 size: 503999 MB node 2 free: 438349 MB node 3 cpus: node 3 size: 503999 MB node 3 free: 438349 MB node distances: node 0 1 2 3 0: 10 21 20 20 1: 21 10 20 20 2: 20 20 10 20 3: 20 20 20 10 >then we fall back to the PMEM node? Fall back is possible but not the scope of this patchset. We modified fallback zonelists in patch 10 to simplify PMEM usage. With that patch, page allocations on DRAM nodes won't fallback to PMEM nodes. Instead, PMEM nodes will mainly be used by explicit numactl placement and as migration target. When there is memory pressure in DRAM node, LRU cold pages there will be demote migrated to its peer PMEM node on the same socket by patch 20. Thanks, Fengguang
--- linux.orig/drivers/base/node.c 2018-12-23 19:39:51.647261099 +0800 +++ linux/drivers/base/node.c 2018-12-23 19:39:51.643261112 +0800 @@ -242,6 +242,16 @@ static ssize_t type_show(struct device * } static DEVICE_ATTR(type, S_IRUGO, type_show, NULL); +static ssize_t peer_node_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + int nid = dev->id; + struct pglist_data *pgdat = NODE_DATA(nid); + + return sprintf(buf, "%d\n", pgdat->peer_node); +} +static DEVICE_ATTR(peer_node, S_IRUGO, peer_node_show, NULL); + static struct attribute *node_dev_attrs[] = { &dev_attr_cpumap.attr, &dev_attr_cpulist.attr, @@ -250,6 +260,7 @@ static struct attribute *node_dev_attrs[ &dev_attr_distance.attr, &dev_attr_vmstat.attr, &dev_attr_type.attr, + &dev_attr_peer_node.attr, NULL }; ATTRIBUTE_GROUPS(node_dev); --- linux.orig/include/linux/mmzone.h 2018-12-23 19:39:51.647261099 +0800 +++ linux/include/linux/mmzone.h 2018-12-23 19:39:51.643261112 +0800 @@ -713,6 +713,18 @@ typedef struct pglist_data { /* Per-node vmstats */ struct per_cpu_nodestat __percpu *per_cpu_nodestats; atomic_long_t vm_stat[NR_VM_NODE_STAT_ITEMS]; + + /* + * Points to the nearest node in terms of latency + * E.g. peer of node 0 is node 2 per SLIT + * node distances: + * node 0 1 2 3 + * 0: 10 21 17 28 + * 1: 21 10 28 17 + * 2: 17 28 10 28 + * 3: 28 17 28 10 + */ + int peer_node; } pg_data_t; #define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages) --- linux.orig/mm/page_alloc.c 2018-12-23 19:39:51.647261099 +0800 +++ linux/mm/page_alloc.c 2018-12-23 19:39:51.643261112 +0800 @@ -6926,6 +6926,34 @@ static void check_for_memory(pg_data_t * } } +/* + * Return the nearest peer node in terms of *locality* + * E.g. peer of node 0 is node 2 per SLIT + * node distances: + * node 0 1 2 3 + * 0: 10 21 17 28 + * 1: 21 10 28 17 + * 2: 17 28 10 28 + * 3: 28 17 28 10 + */ +static int find_best_peer_node(int nid) +{ + int n, val; + int min_val = INT_MAX; + int peer = NUMA_NO_NODE; + + for_each_online_node(n) { + if (n == nid) + continue; + val = node_distance(nid, n); + if (val < min_val) { + min_val = val; + peer = n; + } + } + return peer; +} + /** * free_area_init_nodes - Initialise all pg_data_t and zone data * @max_zone_pfn: an array of max PFNs for each zone @@ -7012,6 +7040,7 @@ void __init free_area_init_nodes(unsigne if (pgdat->node_present_pages) node_set_state(nid, N_MEMORY); check_for_memory(pgdat, nid); + pgdat->peer_node = find_best_peer_node(nid); } }