Message ID | 20181226133351.644607371@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | PMEM NUMA node and hotness accounting/migration | expand |
Fengguang Wu <fengguang.wu@intel.com> writes: > From: Fan Du <fan.du@intel.com> > > When allocate page, DRAM and PMEM node should better not fall back to > each other. This allows migration code to explicitly control which type > of node to allocate pages from. > > With this patch, PMEM NUMA node can only be used in 2 ways: > - migrate in and out > - numactl Can we achieve this using nodemask? That way we don't tag nodes with different properties such as DRAM/PMEM. We can then give the flexibilility to the device init code to add the new memory nodes to the right nodemask -aneesh
On Tue, Jan 01, 2019 at 02:44:41PM +0530, Aneesh Kumar K.V wrote: >Fengguang Wu <fengguang.wu@intel.com> writes: > >> From: Fan Du <fan.du@intel.com> >> >> When allocate page, DRAM and PMEM node should better not fall back to >> each other. This allows migration code to explicitly control which type >> of node to allocate pages from. >> >> With this patch, PMEM NUMA node can only be used in 2 ways: >> - migrate in and out >> - numactl > >Can we achieve this using nodemask? That way we don't tag nodes with >different properties such as DRAM/PMEM. We can then give the >flexibilility to the device init code to add the new memory nodes to >the right nodemask Aneesh, in patch 2 we did create nodemask numa_nodes_pmem and numa_nodes_dram. What's your supposed way of "using nodemask"? Thanks, Fengguang
Fengguang Wu <fengguang.wu@intel.com> writes: > On Tue, Jan 01, 2019 at 02:44:41PM +0530, Aneesh Kumar K.V wrote: >>Fengguang Wu <fengguang.wu@intel.com> writes: >> >>> From: Fan Du <fan.du@intel.com> >>> >>> When allocate page, DRAM and PMEM node should better not fall back to >>> each other. This allows migration code to explicitly control which type >>> of node to allocate pages from. >>> >>> With this patch, PMEM NUMA node can only be used in 2 ways: >>> - migrate in and out >>> - numactl >> >>Can we achieve this using nodemask? That way we don't tag nodes with >>different properties such as DRAM/PMEM. We can then give the >>flexibilility to the device init code to add the new memory nodes to >>the right nodemask > > Aneesh, in patch 2 we did create nodemask numa_nodes_pmem and > numa_nodes_dram. What's your supposed way of "using nodemask"? > IIUC the patch is to avoid allocation from PMEM nodes and the way you achieve it is by checking if (is_node_pmem(n)). We already have abstractness to avoid allocation from a node using node mask. I was wondering whether we can do the equivalent of above using that. ie, __next_zone_zonelist can do zref_in_nodemask(z, default_exclude_nodemask)) and decide whether to use the specific zone or not. That way we don't add special code like + PGDAT_DRAM, /* Volatile DRAM memory node */ + PGDAT_PMEM, /* Persistent memory node */ The reason is that there could be other device memory that would want to get excluded from that default allocation like you are doing for PMEM -aneesh
--- linux.orig/mm/mempolicy.c 2018-12-26 20:03:49.821417489 +0800 +++ linux/mm/mempolicy.c 2018-12-26 20:29:24.597884301 +0800 @@ -1745,6 +1745,20 @@ static int policy_node(gfp_t gfp, struct WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE)); } + if (policy->mode == MPOL_BIND) { + nodemask_t nodes = policy->v.nodes; + + /* + * The rule is if we run on DRAM node and mbind to PMEM node, + * perferred node id is the peer node, vice versa. + * if we run on DRAM node and mbind to DRAM node, #PF node is + * the preferred node, vice versa, so just fall back. + */ + if ((is_node_dram(nd) && nodes_subset(nodes, numa_nodes_pmem)) || + (is_node_pmem(nd) && nodes_subset(nodes, numa_nodes_dram))) + nd = NODE_DATA(nd)->peer_node; + } + return nd; } --- linux.orig/mm/page_alloc.c 2018-12-26 20:03:49.821417489 +0800 +++ linux/mm/page_alloc.c 2018-12-26 20:03:49.817417321 +0800 @@ -5153,6 +5153,10 @@ static int find_next_best_node(int node, if (node_isset(n, *used_node_mask)) continue; + /* DRAM node doesn't fallback to pmem node */ + if (is_node_pmem(n)) + continue; + /* Use the distance array to find the distance */ val = node_distance(node, n); @@ -5242,19 +5246,31 @@ static void build_zonelists(pg_data_t *p nodes_clear(used_mask); memset(node_order, 0, sizeof(node_order)); - while ((node = find_next_best_node(local_node, &used_mask)) >= 0) { - /* - * We don't want to pressure a particular node. - * So adding penalty to the first node in same - * distance group to make it round-robin. - */ - if (node_distance(local_node, node) != - node_distance(local_node, prev_node)) - node_load[node] = load; - - node_order[nr_nodes++] = node; - prev_node = node; - load--; + /* Pmem node doesn't fallback to DRAM node */ + if (is_node_pmem(local_node)) { + int n; + + /* Pmem nodes should fallback to each other */ + node_order[nr_nodes++] = local_node; + for_each_node_state(n, N_MEMORY) { + if ((n != local_node) && is_node_pmem(n)) + node_order[nr_nodes++] = n; + } + } else { + while ((node = find_next_best_node(local_node, &used_mask)) >= 0) { + /* + * We don't want to pressure a particular node. + * So adding penalty to the first node in same + * distance group to make it round-robin. + */ + if (node_distance(local_node, node) != + node_distance(local_node, prev_node)) + node_load[node] = load; + + node_order[nr_nodes++] = node; + prev_node = node; + load--; + } } build_zonelists_in_node_order(pgdat, node_order, nr_nodes);