Message ID | 20220603134237.131362-10-aneesh.kumar@linux.ibm.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/demotion: Memory tiers and demotion | expand |
On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote: > With memory tiers support we can have memory on NUMA nodes > in the top tier from which we want to avoid promotion tracking NUMA > faults. Update node_is_toptier to work with memory tiers. To > avoid taking locks, a nodemask is maintained for all demotion > targets. All NUMA nodes are by default top tier nodes and as > we add new lower memory tiers NUMA nodes get added to the > demotion targets thereby moving them out of the top tier. Check the usage of node_is_toptier(), - migrate_misplaced_page() node_is_toptier() is used to check whether migration is a promotion. We can avoid to use it. Just compare the rank of the nodes. - change_pte_range() and change_huge_pmd() node_is_toptier() is used to avoid scanning fast memory (DRAM) pages for promotion. So I think we should change the name to node_is_fast() as follows, static inline bool node_is_fast(int node) { return NODE_DATA(node)->mt_rank >= MEMORY_RANK_DRAM; } And, as above, I suggest to add memory tier ID and rank to struct pglist_data directly. Best Regards, Huang, Ying
On 6/6/22 8:41 AM, Ying Huang wrote: > On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote: >> With memory tiers support we can have memory on NUMA nodes >> in the top tier from which we want to avoid promotion tracking NUMA >> faults. Update node_is_toptier to work with memory tiers. To >> avoid taking locks, a nodemask is maintained for all demotion >> targets. All NUMA nodes are by default top tier nodes and as >> we add new lower memory tiers NUMA nodes get added to the >> demotion targets thereby moving them out of the top tier. > > Check the usage of node_is_toptier(), > > - migrate_misplaced_page() > node_is_toptier() is used to check whether migration is a promotion. > We can avoid to use it. Just compare the rank of the nodes. > > - change_pte_range() and change_huge_pmd() > node_is_toptier() is used to avoid scanning fast memory (DRAM) pages > for promotion. So I think we should change the name to node_is_fast() > as follows, > > static inline bool node_is_fast(int node) > { > return NODE_DATA(node)->mt_rank >= MEMORY_RANK_DRAM; > } > But that gives special meaning to MEMORY_RANK_DRAM. As detailed in other patches, absolute value of rank doesn't carry any meaning. It is only the relative value w.r.t other memory tiers that decide whether it is fast or not. Agreed by default memory tiers get built with MEMORY_RANK_DRAM. But userspace can change the rank value of 'memtier1' Hence to determine a node is consisting of fast memory is essentially figuring out whether node is the top most tier in memory hierarchy and not just the memory tier rank value is >= MEMORY_RANK_DRAM? -aneesh
On Mon, 2022-06-06 at 09:22 +0530, Aneesh Kumar K V wrote: > On 6/6/22 8:41 AM, Ying Huang wrote: > > On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote: > > > With memory tiers support we can have memory on NUMA nodes > > > in the top tier from which we want to avoid promotion tracking NUMA > > > faults. Update node_is_toptier to work with memory tiers. To > > > avoid taking locks, a nodemask is maintained for all demotion > > > targets. All NUMA nodes are by default top tier nodes and as > > > we add new lower memory tiers NUMA nodes get added to the > > > demotion targets thereby moving them out of the top tier. > > > > Check the usage of node_is_toptier(), > > > > - migrate_misplaced_page() > > node_is_toptier() is used to check whether migration is a promotion. > > We can avoid to use it. Just compare the rank of the nodes. > > > > - change_pte_range() and change_huge_pmd() > > node_is_toptier() is used to avoid scanning fast memory (DRAM) pages > > for promotion. So I think we should change the name to node_is_fast() > > as follows, > > > > static inline bool node_is_fast(int node) > > { > > return NODE_DATA(node)->mt_rank >= MEMORY_RANK_DRAM; > > } > > > > But that gives special meaning to MEMORY_RANK_DRAM. As detailed in other > patches, absolute value of rank doesn't carry any meaning. It is only > the relative value w.r.t other memory tiers that decide whether it is > fast or not. Agreed by default memory tiers get built with > MEMORY_RANK_DRAM. But userspace can change the rank value of 'memtier1' > Hence to determine a node is consisting of fast memory is essentially > figuring out whether node is the top most tier in memory hierarchy and > not just the memory tier rank value is >= MEMORY_RANK_DRAM? In a system with 3 tiers, HBM 0 DRAM 1 PMEM 2 In your implementation, only HBM will be considered fast. But what we need is to consider both HBM and DRAM fast. Because we use NUMA balancing to promote PMEM pages to DRAM. It's unnecessary to scan HBM and DRAM pages for that. And there're no requirements to promote DRAM pages to HBM with NUMA balancing. I can understand that the memory tiers are more dynamic now. For requirements of NUMA balancing, we need the lowest memory tier (rank) where there's at least one node with CPU. The nodes in it and the higher tiers will be considered fast. Best Regards, Huang, Ying
On 6/6/22 12:54 PM, Ying Huang wrote: > On Mon, 2022-06-06 at 09:22 +0530, Aneesh Kumar K V wrote: >> On 6/6/22 8:41 AM, Ying Huang wrote: >>> On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote: >>>> With memory tiers support we can have memory on NUMA nodes >>>> in the top tier from which we want to avoid promotion tracking NUMA >>>> faults. Update node_is_toptier to work with memory tiers. To >>>> avoid taking locks, a nodemask is maintained for all demotion >>>> targets. All NUMA nodes are by default top tier nodes and as >>>> we add new lower memory tiers NUMA nodes get added to the >>>> demotion targets thereby moving them out of the top tier. >>> >>> Check the usage of node_is_toptier(), >>> >>> - migrate_misplaced_page() >>> node_is_toptier() is used to check whether migration is a promotion. >>> We can avoid to use it. Just compare the rank of the nodes. >>> >>> - change_pte_range() and change_huge_pmd() >>> node_is_toptier() is used to avoid scanning fast memory (DRAM) pages >>> for promotion. So I think we should change the name to node_is_fast() >>> as follows, >>> >>> static inline bool node_is_fast(int node) >>> { >>> return NODE_DATA(node)->mt_rank >= MEMORY_RANK_DRAM; >>> } >>> >> >> But that gives special meaning to MEMORY_RANK_DRAM. As detailed in other >> patches, absolute value of rank doesn't carry any meaning. It is only >> the relative value w.r.t other memory tiers that decide whether it is >> fast or not. Agreed by default memory tiers get built with >> MEMORY_RANK_DRAM. But userspace can change the rank value of 'memtier1' >> Hence to determine a node is consisting of fast memory is essentially >> figuring out whether node is the top most tier in memory hierarchy and >> not just the memory tier rank value is >= MEMORY_RANK_DRAM? > > In a system with 3 tiers, > > HBM 0 > DRAM 1 > PMEM 2 > > In your implementation, only HBM will be considered fast. But what we > need is to consider both HBM and DRAM fast. Because we use NUMA > balancing to promote PMEM pages to DRAM. It's unnecessary to scan HBM > and DRAM pages for that. And there're no requirements to promote DRAM > pages to HBM with NUMA balancing. > > I can understand that the memory tiers are more dynamic now. For > requirements of NUMA balancing, we need the lowest memory tier (rank) > where there's at least one node with CPU. The nodes in it and the > higher tiers will be considered fast. > is this good (not tested)? /* * build the allowed promotion mask. Promotion is allowed * from higher memory tier to lower memory tier only if * lower memory tier doesn't include compute. We want to * skip promotion from a memory tier, if any node which is * part of that memory tier have CPUs. Once we detect such * a memory tier, we consider that tier as top tier from * which promotion is not allowed. */ list_for_each_entry_reverse(memtier, &memory_tiers, list) { nodes_and(allowed, node_state[N_CPU], memtier->nodelist); if (nodes_empty(allowed)) nodes_or(promotion_mask, promotion_mask, allowed); else break; } and then static inline bool node_is_toptier(int node) { return !node_isset(node, promotion_mask); }
On Mon, 2022-06-06 at 14:03 +0530, Aneesh Kumar K V wrote: > On 6/6/22 12:54 PM, Ying Huang wrote: > > On Mon, 2022-06-06 at 09:22 +0530, Aneesh Kumar K V wrote: > > > On 6/6/22 8:41 AM, Ying Huang wrote: > > > > On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote: > > > > > With memory tiers support we can have memory on NUMA nodes > > > > > in the top tier from which we want to avoid promotion tracking NUMA > > > > > faults. Update node_is_toptier to work with memory tiers. To > > > > > avoid taking locks, a nodemask is maintained for all demotion > > > > > targets. All NUMA nodes are by default top tier nodes and as > > > > > we add new lower memory tiers NUMA nodes get added to the > > > > > demotion targets thereby moving them out of the top tier. > > > > > > > > Check the usage of node_is_toptier(), > > > > > > > > - migrate_misplaced_page() > > > > node_is_toptier() is used to check whether migration is a promotion. > > > > We can avoid to use it. Just compare the rank of the nodes. > > > > > > > > - change_pte_range() and change_huge_pmd() > > > > node_is_toptier() is used to avoid scanning fast memory (DRAM) pages > > > > for promotion. So I think we should change the name to node_is_fast() > > > > as follows, > > > > > > > > static inline bool node_is_fast(int node) > > > > { > > > > return NODE_DATA(node)->mt_rank >= MEMORY_RANK_DRAM; > > > > } > > > > > > > > > > But that gives special meaning to MEMORY_RANK_DRAM. As detailed in other > > > patches, absolute value of rank doesn't carry any meaning. It is only > > > the relative value w.r.t other memory tiers that decide whether it is > > > fast or not. Agreed by default memory tiers get built with > > > MEMORY_RANK_DRAM. But userspace can change the rank value of 'memtier1' > > > Hence to determine a node is consisting of fast memory is essentially > > > figuring out whether node is the top most tier in memory hierarchy and > > > not just the memory tier rank value is >= MEMORY_RANK_DRAM? > > > > In a system with 3 tiers, > > > > HBM 0 > > DRAM 1 > > PMEM 2 > > > > In your implementation, only HBM will be considered fast. But what we > > need is to consider both HBM and DRAM fast. Because we use NUMA > > balancing to promote PMEM pages to DRAM. It's unnecessary to scan HBM > > and DRAM pages for that. And there're no requirements to promote DRAM > > pages to HBM with NUMA balancing. > > > > I can understand that the memory tiers are more dynamic now. For > > requirements of NUMA balancing, we need the lowest memory tier (rank) > > where there's at least one node with CPU. The nodes in it and the > > higher tiers will be considered fast. > > > > is this good (not tested)? > /* > * build the allowed promotion mask. Promotion is allowed > * from higher memory tier to lower memory tier only if > * lower memory tier doesn't include compute. We want to > * skip promotion from a memory tier, if any node which is > * part of that memory tier have CPUs. Once we detect such > * a memory tier, we consider that tier as top tier from > * which promotion is not allowed. > */ > list_for_each_entry_reverse(memtier, &memory_tiers, list) { > nodes_and(allowed, node_state[N_CPU], memtier->nodelist); > if (nodes_empty(allowed)) > nodes_or(promotion_mask, promotion_mask, allowed); > else > break; > } > > and then > > static inline bool node_is_toptier(int node) > { > > return !node_isset(node, promotion_mask); > } > This should work. But it appears unnatural. So, I don't think we should avoid to add more and more node masks to mitigate the design decision that we cannot access memory tier information directly. All these becomes simple and natural, if we can access memory tier information directly. Best Regards, Huang, Ying
On 6/8/22 12:56 PM, Ying Huang wrote: > On Mon, 2022-06-06 at 14:03 +0530, Aneesh Kumar K V wrote: >> On 6/6/22 12:54 PM, Ying Huang wrote: >>> On Mon, 2022-06-06 at 09:22 +0530, Aneesh Kumar K V wrote: >>>> On 6/6/22 8:41 AM, Ying Huang wrote: >>>>> On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote: >>>>>> With memory tiers support we can have memory on NUMA nodes >>>>>> in the top tier from which we want to avoid promotion tracking NUMA >>>>>> faults. Update node_is_toptier to work with memory tiers. To >>>>>> avoid taking locks, a nodemask is maintained for all demotion >>>>>> targets. All NUMA nodes are by default top tier nodes and as >>>>>> we add new lower memory tiers NUMA nodes get added to the >>>>>> demotion targets thereby moving them out of the top tier. >>>>> >>>>> Check the usage of node_is_toptier(), >>>>> >>>>> - migrate_misplaced_page() >>>>> node_is_toptier() is used to check whether migration is a promotion. >>>>> We can avoid to use it. Just compare the rank of the nodes. >>>>> >>>>> - change_pte_range() and change_huge_pmd() >>>>> node_is_toptier() is used to avoid scanning fast memory (DRAM) pages >>>>> for promotion. So I think we should change the name to node_is_fast() >>>>> as follows, >>>>> >>>>> static inline bool node_is_fast(int node) >>>>> { >>>>> return NODE_DATA(node)->mt_rank >= MEMORY_RANK_DRAM; >>>>> } >>>>> >>>> >>>> But that gives special meaning to MEMORY_RANK_DRAM. As detailed in other >>>> patches, absolute value of rank doesn't carry any meaning. It is only >>>> the relative value w.r.t other memory tiers that decide whether it is >>>> fast or not. Agreed by default memory tiers get built with >>>> MEMORY_RANK_DRAM. But userspace can change the rank value of 'memtier1' >>>> Hence to determine a node is consisting of fast memory is essentially >>>> figuring out whether node is the top most tier in memory hierarchy and >>>> not just the memory tier rank value is >= MEMORY_RANK_DRAM? >>> >>> In a system with 3 tiers, >>> >>> HBM 0 >>> DRAM 1 >>> PMEM 2 >>> >>> In your implementation, only HBM will be considered fast. But what we >>> need is to consider both HBM and DRAM fast. Because we use NUMA >>> balancing to promote PMEM pages to DRAM. It's unnecessary to scan HBM >>> and DRAM pages for that. And there're no requirements to promote DRAM >>> pages to HBM with NUMA balancing. >>> >>> I can understand that the memory tiers are more dynamic now. For >>> requirements of NUMA balancing, we need the lowest memory tier (rank) >>> where there's at least one node with CPU. The nodes in it and the >>> higher tiers will be considered fast. >>> >> >> is this good (not tested)? >> /* >> * build the allowed promotion mask. Promotion is allowed >> * from higher memory tier to lower memory tier only if >> * lower memory tier doesn't include compute. We want to >> * skip promotion from a memory tier, if any node which is >> * part of that memory tier have CPUs. Once we detect such >> * a memory tier, we consider that tier as top tier from >> * which promotion is not allowed. >> */ >> list_for_each_entry_reverse(memtier, &memory_tiers, list) { >> nodes_and(allowed, node_state[N_CPU], memtier->nodelist); >> if (nodes_empty(allowed)) >> nodes_or(promotion_mask, promotion_mask, allowed); >> else >> break; >> } >> >> and then >> >> static inline bool node_is_toptier(int node) >> { >> >> return !node_isset(node, promotion_mask); >> } >> > > This should work. But it appears unnatural. So, I don't think we > should avoid to add more and more node masks to mitigate the design > decision that we cannot access memory tier information directly. All > these becomes simple and natural, if we can access memory tier > information directly. > how do you derive whether node is toptier details if we have memtier details in pgdat? -aneesh
On Wed, 2022-06-08 at 13:58 +0530, Aneesh Kumar K V wrote: > On 6/8/22 12:56 PM, Ying Huang wrote: > > On Mon, 2022-06-06 at 14:03 +0530, Aneesh Kumar K V wrote: > > > On 6/6/22 12:54 PM, Ying Huang wrote: > > > > On Mon, 2022-06-06 at 09:22 +0530, Aneesh Kumar K V wrote: > > > > > On 6/6/22 8:41 AM, Ying Huang wrote: > > > > > > On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote: > > > > > > > With memory tiers support we can have memory on NUMA nodes > > > > > > > in the top tier from which we want to avoid promotion tracking NUMA > > > > > > > faults. Update node_is_toptier to work with memory tiers. To > > > > > > > avoid taking locks, a nodemask is maintained for all demotion > > > > > > > targets. All NUMA nodes are by default top tier nodes and as > > > > > > > we add new lower memory tiers NUMA nodes get added to the > > > > > > > demotion targets thereby moving them out of the top tier. > > > > > > > > > > > > Check the usage of node_is_toptier(), > > > > > > > > > > > > - migrate_misplaced_page() > > > > > > node_is_toptier() is used to check whether migration is a promotion. > > > > > > We can avoid to use it. Just compare the rank of the nodes. > > > > > > > > > > > > - change_pte_range() and change_huge_pmd() > > > > > > node_is_toptier() is used to avoid scanning fast memory (DRAM) pages > > > > > > for promotion. So I think we should change the name to node_is_fast() > > > > > > as follows, > > > > > > > > > > > > static inline bool node_is_fast(int node) > > > > > > { > > > > > > return NODE_DATA(node)->mt_rank >= MEMORY_RANK_DRAM; > > > > > > } > > > > > > > > > > > > > > > > But that gives special meaning to MEMORY_RANK_DRAM. As detailed in other > > > > > patches, absolute value of rank doesn't carry any meaning. It is only > > > > > the relative value w.r.t other memory tiers that decide whether it is > > > > > fast or not. Agreed by default memory tiers get built with > > > > > MEMORY_RANK_DRAM. But userspace can change the rank value of 'memtier1' > > > > > Hence to determine a node is consisting of fast memory is essentially > > > > > figuring out whether node is the top most tier in memory hierarchy and > > > > > not just the memory tier rank value is >= MEMORY_RANK_DRAM? > > > > > > > > In a system with 3 tiers, > > > > > > > > HBM 0 > > > > DRAM 1 > > > > PMEM 2 > > > > > > > > In your implementation, only HBM will be considered fast. But what we > > > > need is to consider both HBM and DRAM fast. Because we use NUMA > > > > balancing to promote PMEM pages to DRAM. It's unnecessary to scan HBM > > > > and DRAM pages for that. And there're no requirements to promote DRAM > > > > pages to HBM with NUMA balancing. > > > > > > > > I can understand that the memory tiers are more dynamic now. For > > > > requirements of NUMA balancing, we need the lowest memory tier (rank) > > > > where there's at least one node with CPU. The nodes in it and the > > > > higher tiers will be considered fast. > > > > > > > > > > is this good (not tested)? > > > /* > > > * build the allowed promotion mask. Promotion is allowed > > > * from higher memory tier to lower memory tier only if > > > * lower memory tier doesn't include compute. We want to > > > * skip promotion from a memory tier, if any node which is > > > * part of that memory tier have CPUs. Once we detect such > > > * a memory tier, we consider that tier as top tier from > > > * which promotion is not allowed. > > > */ > > > list_for_each_entry_reverse(memtier, &memory_tiers, list) { > > > nodes_and(allowed, node_state[N_CPU], memtier->nodelist); > > > if (nodes_empty(allowed)) > > > nodes_or(promotion_mask, promotion_mask, allowed); > > > else > > > break; > > > } > > > > > > and then > > > > > > static inline bool node_is_toptier(int node) > > > { > > > > > > return !node_isset(node, promotion_mask); > > > } > > > > > > > This should work. But it appears unnatural. So, I don't think we > > should avoid to add more and more node masks to mitigate the design > > decision that we cannot access memory tier information directly. All > > these becomes simple and natural, if we can access memory tier > > information directly. > > > > how do you derive whether node is toptier details if we have memtier > details in pgdat? pgdat -> memory tier -> rank Then we can compare this rank with the fast memory rank. The fast memory rank can be calculated dynamically at appropriate places. Best Regards, Huang, Ying
Ying Huang <ying.huang@intel.com> writes: .... > > > >> > > is this good (not tested)? >> > > /* >> > > * build the allowed promotion mask. Promotion is allowed >> > > * from higher memory tier to lower memory tier only if >> > > * lower memory tier doesn't include compute. We want to >> > > * skip promotion from a memory tier, if any node which is >> > > * part of that memory tier have CPUs. Once we detect such >> > > * a memory tier, we consider that tier as top tier from >> > > * which promotion is not allowed. >> > > */ >> > > list_for_each_entry_reverse(memtier, &memory_tiers, list) { >> > > nodes_and(allowed, node_state[N_CPU], memtier->nodelist); >> > > if (nodes_empty(allowed)) >> > > nodes_or(promotion_mask, promotion_mask, allowed); >> > > else >> > > break; >> > > } >> > > >> > > and then >> > > >> > > static inline bool node_is_toptier(int node) >> > > { >> > > >> > > return !node_isset(node, promotion_mask); >> > > } >> > > >> > >> > This should work. But it appears unnatural. So, I don't think we >> > should avoid to add more and more node masks to mitigate the design >> > decision that we cannot access memory tier information directly. All >> > these becomes simple and natural, if we can access memory tier >> > information directly. >> > >> >> how do you derive whether node is toptier details if we have memtier >> details in pgdat? > > pgdat -> memory tier -> rank > > Then we can compare this rank with the fast memory rank. The fast > memory rank can be calculated dynamically at appropriate places. This is what I am testing now. We still need to closely audit that lock free access to the NODE_DATA()->memtier. For v6 I will keep this as a separate patch and once we all agree that it is safe, I will fold it back. diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index a388a806b61a..3e733de1a8a0 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -17,7 +17,6 @@ #define MAX_MEMORY_TIERS (MAX_STATIC_MEMORY_TIERS + 2) extern bool numa_demotion_enabled; -extern nodemask_t promotion_mask; int node_create_and_set_memory_tier(int node, int tier); int next_demotion_node(int node); int node_set_memory_tier(int node, int tier); @@ -25,15 +24,7 @@ int node_get_memory_tier_id(int node); int node_reset_memory_tier(int node, int tier); void node_remove_from_memory_tier(int node); void node_get_allowed_targets(int node, nodemask_t *targets); - -/* - * By default all nodes are top tiper. As we create new memory tiers - * we below top tiers we add them to NON_TOP_TIER state. - */ -static inline bool node_is_toptier(int node) -{ - return !node_isset(node, promotion_mask); -} +bool node_is_toptier(int node); #else #define numa_demotion_enabled false diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index aab70355d64f..c4fcfd2b9980 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -928,6 +928,9 @@ typedef struct pglist_data { /* Per-node vmstats */ struct per_cpu_nodestat __percpu *per_cpu_nodestats; atomic_long_t vm_stat[NR_VM_NODE_STAT_ITEMS]; +#ifdef CONFIG_TIERED_MEMORY + struct memory_tier *memtier; +#endif } pg_data_t; #define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages) diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 29a038bb38b0..31ef0fab5f19 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -7,6 +7,7 @@ #include <linux/random.h> #include <linux/memory.h> #include <linux/idr.h> +#include <linux/rcupdate.h> #include "internal.h" @@ -26,7 +27,7 @@ struct demotion_nodes { static void establish_migration_targets(void); static DEFINE_MUTEX(memory_tier_lock); static LIST_HEAD(memory_tiers); -nodemask_t promotion_mask; +static int top_tier_rank; /* * node_demotion[] examples: * @@ -135,7 +136,7 @@ static void memory_tier_device_release(struct device *dev) if (tier->dev.id >= MAX_STATIC_MEMORY_TIERS) ida_free(&memtier_dev_id, tier->dev.id); - kfree(tier); + kfree_rcu(tier); } /* @@ -233,6 +234,70 @@ static struct memory_tier *__get_memory_tier_from_id(int id) return NULL; } +/* + * Called with memory_tier_lock. Hence the device references cannot + * be dropped during this function. + */ +static void memtier_node_clear(int node, struct memory_tier *memtier) +{ + pg_data_t *pgdat; + + pgdat = NODE_DATA(node); + if (!pgdat) + return; + + rcu_assign_pointer(pgdat->memtier, NULL); + /* + * Make sure read side see the NULL value before we clear the node + * from the nodelist. + */ + synchronize_rcu(); + node_clear(node, memtier->nodelist); +} + +static void memtier_node_set(int node, struct memory_tier *memtier) +{ + pg_data_t *pgdat; + + pgdat = NODE_DATA(node); + if (!pgdat) + return; + /* + * Make sure we mark the memtier NULL before we assign the new memory tier + * to the NUMA node. This make sure that anybody looking at NODE_DATA + * finds a NULL memtier or the one which is still valid. + */ + rcu_assign_pointer(pgdat->memtier, NULL); + synchronize_rcu(); + node_set(node, memtier->nodelist); + rcu_assign_pointer(pgdat->memtier, memtier); +} + +bool node_is_toptier(int node) +{ + bool toptier; + pg_data_t *pgdat; + struct memory_tier *memtier; + + pgdat = NODE_DATA(node); + if (!pgdat) + return false; + + rcu_read_lock(); + memtier = rcu_dereference(pgdat->memtier); + if (!memtier) { + toptier = true; + goto out; + } + if (memtier->rank >= top_tier_rank) + toptier = true; + else + toptier = false; +out: + rcu_read_unlock(); + return toptier; +} + static int __node_create_and_set_memory_tier(int node, int tier) { int ret = 0; @@ -253,7 +318,7 @@ static int __node_create_and_set_memory_tier(int node, int tier) goto out; } } - node_set(node, memtier->nodelist); + memtier_node_set(node, memtier); out: return ret; } @@ -275,12 +340,12 @@ int node_create_and_set_memory_tier(int node, int tier) if (current_tier->dev.id == tier) goto out; - node_clear(node, current_tier->nodelist); + memtier_node_clear(node, current_tier); ret = __node_create_and_set_memory_tier(node, tier); if (ret) { /* reset it back to older tier */ - node_set(node, current_tier->nodelist); + memtier_node_set(node, current_tier); goto out; } @@ -305,7 +370,7 @@ static int __node_set_memory_tier(int node, int tier) ret = -EINVAL; goto out; } - node_set(node, memtier->nodelist); + memtier_node_set(node, memtier); out: return ret; } @@ -374,12 +439,12 @@ int node_reset_memory_tier(int node, int tier) if (current_tier->dev.id == tier) goto out; - node_clear(node, current_tier->nodelist); + memtier_node_clear(node, current_tier); ret = __node_set_memory_tier(node, tier); if (ret) { /* reset it back to older tier */ - node_set(node, current_tier->nodelist); + memtier_node_set(node, current_tier); goto out; } @@ -407,7 +472,7 @@ void node_remove_from_memory_tier(int node) * empty then unregister it to make it invisible * in sysfs. */ - node_clear(node, memtier->nodelist); + memtier_node_clear(node, memtier); if (nodes_empty(memtier->nodelist)) unregister_memory_tier(memtier); @@ -570,15 +635,13 @@ static void establish_migration_targets(void) * a memory tier, we consider that tier as top tiper from * which promotion is not allowed. */ - promotion_mask = NODE_MASK_NONE; list_for_each_entry_reverse(memtier, &memory_tiers, list) { nodes_and(allowed, node_states[N_CPU], memtier->nodelist); - if (nodes_empty(allowed)) - nodes_or(promotion_mask, promotion_mask, memtier->nodelist); - else + if (!nodes_empty(allowed)) { + top_tier_rank = memtier->rank; break; + } } - pr_emerg("top tier rank is %d\n", top_tier_rank); allowed = NODE_MASK_NONE; /* @@ -748,7 +811,7 @@ static const struct attribute_group *memory_tier_attr_groups[] = { static int __init memory_tier_init(void) { - int ret; + int ret, node; struct memory_tier *memtier; ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups); @@ -766,7 +829,13 @@ static int __init memory_tier_init(void) panic("%s() failed to register memory tier: %d\n", __func__, ret); /* CPU only nodes are not part of memory tiers. */ - memtier->nodelist = node_states[N_MEMORY]; + for_each_node_state(node, N_MEMORY) { + /* + * Should be safe to do this early in the boot. + */ + NODE_DATA(node)->memtier = memtier; + node_set(node, memtier->nodelist); + } migrate_on_reclaim_init(); return 0;
On Wed, 2022-06-08 at 20:07 +0530, Aneesh Kumar K.V wrote: > > > This is what I am testing now. We still need to closely audit that lock > free access to the NODE_DATA()->memtier. You're refering to this or something else? This is a write so seems okay. > + for_each_node_state(node, N_MEMORY) { > + /* > + * Should be safe to do this early in the boot. > + */ > + NODE_DATA(node)->memtier = memtier; > + node_set(node, memtier->nodelist); > + } > migrate_on_reclaim_init(); > For v6 I will keep this as a > separate patch and once we all agree that it is safe, I will fold it > back. Please update code that uses __node_get_memory_tier(node) to use NODE_DATA(node)->memtier; Otherwise the code looks okay at a first glance. Tim > > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h > index a388a806b61a..3e733de1a8a0 100644 > --- a/include/linux/memory-tiers.h > +++ b/include/linux/memory-tiers.h > @@ -17,7 +17,6 @@ > #define MAX_MEMORY_TIERS (MAX_STATIC_MEMORY_TIERS + 2) > > extern bool numa_demotion_enabled; > -extern nodemask_t promotion_mask; > int node_create_and_set_memory_tier(int node, int tier); > int next_demotion_node(int node); > int node_set_memory_tier(int node, int tier); > @@ -25,15 +24,7 @@ int node_get_memory_tier_id(int node); > int node_reset_memory_tier(int node, int tier); > void node_remove_from_memory_tier(int node); > void node_get_allowed_targets(int node, nodemask_t *targets); > - > -/* > - * By default all nodes are top tiper. As we create new memory tiers > - * we below top tiers we add them to NON_TOP_TIER state. > - */ > -static inline bool node_is_toptier(int node) > -{ > - return !node_isset(node, promotion_mask); > -} > +bool node_is_toptier(int node); > > #else > #define numa_demotion_enabled false > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index aab70355d64f..c4fcfd2b9980 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -928,6 +928,9 @@ typedef struct pglist_data { > /* Per-node vmstats */ > struct per_cpu_nodestat __percpu *per_cpu_nodestats; > atomic_long_t vm_stat[NR_VM_NODE_STAT_ITEMS]; > +#ifdef CONFIG_TIERED_MEMORY > + struct memory_tier *memtier; > +#endif > } pg_data_t; > > #define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages) > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c > index 29a038bb38b0..31ef0fab5f19 100644 > --- a/mm/memory-tiers.c > +++ b/mm/memory-tiers.c > @@ -7,6 +7,7 @@ > #include <linux/random.h> > #include <linux/memory.h> > #include <linux/idr.h> > +#include <linux/rcupdate.h> > > #include "internal.h" > > @@ -26,7 +27,7 @@ struct demotion_nodes { > static void establish_migration_targets(void); > static DEFINE_MUTEX(memory_tier_lock); > static LIST_HEAD(memory_tiers); > -nodemask_t promotion_mask; > +static int top_tier_rank; > /* > * node_demotion[] examples: > * > @@ -135,7 +136,7 @@ static void memory_tier_device_release(struct device *dev) > if (tier->dev.id >= MAX_STATIC_MEMORY_TIERS) > ida_free(&memtier_dev_id, tier->dev.id); > > - kfree(tier); > + kfree_rcu(tier); > } > > /* > @@ -233,6 +234,70 @@ static struct memory_tier *__get_memory_tier_from_id(int id) > return NULL; > } > > +/* > + * Called with memory_tier_lock. Hence the device references cannot > + * be dropped during this function. > + */ > +static void memtier_node_clear(int node, struct memory_tier *memtier) > +{ > + pg_data_t *pgdat; > + > + pgdat = NODE_DATA(node); > + if (!pgdat) > + return; > + > + rcu_assign_pointer(pgdat->memtier, NULL); > + /* > + * Make sure read side see the NULL value before we clear the node > + * from the nodelist. > + */ > + synchronize_rcu(); > + node_clear(node, memtier->nodelist); > +} > + > +static void memtier_node_set(int node, struct memory_tier *memtier) > +{ > + pg_data_t *pgdat; > + > + pgdat = NODE_DATA(node); > + if (!pgdat) > + return; > + /* > + * Make sure we mark the memtier NULL before we assign the new memory tier > + * to the NUMA node. This make sure that anybody looking at NODE_DATA > + * finds a NULL memtier or the one which is still valid. > + */ > + rcu_assign_pointer(pgdat->memtier, NULL); > + synchronize_rcu(); > + node_set(node, memtier->nodelist); > + rcu_assign_pointer(pgdat->memtier, memtier); > +} > + > +bool node_is_toptier(int node) > +{ > + bool toptier; > + pg_data_t *pgdat; > + struct memory_tier *memtier; > + > + pgdat = NODE_DATA(node); > + if (!pgdat) > + return false; > + > + rcu_read_lock(); > + memtier = rcu_dereference(pgdat->memtier); > + if (!memtier) { > + toptier = true; > + goto out; > + } > + if (memtier->rank >= top_tier_rank) > + toptier = true; > + else > + toptier = false; > +out: > + rcu_read_unlock(); > + return toptier; > +} > + > static int __node_create_and_set_memory_tier(int node, int tier) > { > int ret = 0; > @@ -253,7 +318,7 @@ static int __node_create_and_set_memory_tier(int node, int tier) > goto out; > } > } > - node_set(node, memtier->nodelist); > + memtier_node_set(node, memtier); > out: > return ret; > } > @@ -275,12 +340,12 @@ int node_create_and_set_memory_tier(int node, int tier) > if (current_tier->dev.id == tier) > goto out; > > - node_clear(node, current_tier->nodelist); > + memtier_node_clear(node, current_tier); > > ret = __node_create_and_set_memory_tier(node, tier); > if (ret) { > /* reset it back to older tier */ > - node_set(node, current_tier->nodelist); > + memtier_node_set(node, current_tier); > goto out; > } > > @@ -305,7 +370,7 @@ static int __node_set_memory_tier(int node, int tier) > ret = -EINVAL; > goto out; > } > - node_set(node, memtier->nodelist); > + memtier_node_set(node, memtier); > out: > return ret; > } > @@ -374,12 +439,12 @@ int node_reset_memory_tier(int node, int tier) > if (current_tier->dev.id == tier) > goto out; > > - node_clear(node, current_tier->nodelist); > + memtier_node_clear(node, current_tier); > > ret = __node_set_memory_tier(node, tier); > if (ret) { > /* reset it back to older tier */ > - node_set(node, current_tier->nodelist); > + memtier_node_set(node, current_tier); > goto out; > } > > @@ -407,7 +472,7 @@ void node_remove_from_memory_tier(int node) > * empty then unregister it to make it invisible > * in sysfs. > */ > - node_clear(node, memtier->nodelist); > + memtier_node_clear(node, memtier); > if (nodes_empty(memtier->nodelist)) > unregister_memory_tier(memtier); > > @@ -570,15 +635,13 @@ static void establish_migration_targets(void) > * a memory tier, we consider that tier as top tiper from > * which promotion is not allowed. > */ > - promotion_mask = NODE_MASK_NONE; > list_for_each_entry_reverse(memtier, &memory_tiers, list) { > nodes_and(allowed, node_states[N_CPU], memtier->nodelist); > - if (nodes_empty(allowed)) > - nodes_or(promotion_mask, promotion_mask, memtier->nodelist); > - else > + if (!nodes_empty(allowed)) { > + top_tier_rank = memtier->rank; > break; > + } > } > - > pr_emerg("top tier rank is %d\n", top_tier_rank); > allowed = NODE_MASK_NONE; > /* > @@ -748,7 +811,7 @@ static const struct attribute_group *memory_tier_attr_groups[] = { > > static int __init memory_tier_init(void) > { > - int ret; > + int ret, node; > struct memory_tier *memtier; > > ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups); > @@ -766,7 +829,13 @@ static int __init memory_tier_init(void) > panic("%s() failed to register memory tier: %d\n", __func__, ret); > > /* CPU only nodes are not part of memory tiers. */ > - memtier->nodelist = node_states[N_MEMORY]; > + for_each_node_state(node, N_MEMORY) { > + /* > + * Should be safe to do this early in the boot. > + */ > + NODE_DATA(node)->memtier = memtier; > + node_set(node, memtier->nodelist); > + } > migrate_on_reclaim_init(); > > return 0;
On Wed, 2022-06-08 at 20:07 +0530, Aneesh Kumar K.V wrote: > Ying Huang <ying.huang@intel.com> writes: > > .... > > > > > > > > > > is this good (not tested)? > > > > > /* > > > > > * build the allowed promotion mask. Promotion is allowed > > > > > * from higher memory tier to lower memory tier only if > > > > > * lower memory tier doesn't include compute. We want to > > > > > * skip promotion from a memory tier, if any node which is > > > > > * part of that memory tier have CPUs. Once we detect such > > > > > * a memory tier, we consider that tier as top tier from > > > > > * which promotion is not allowed. > > > > > */ > > > > > list_for_each_entry_reverse(memtier, &memory_tiers, list) { > > > > > nodes_and(allowed, node_state[N_CPU], memtier->nodelist); > > > > > if (nodes_empty(allowed)) > > > > > nodes_or(promotion_mask, promotion_mask, allowed); > > > > > else > > > > > break; > > > > > } > > > > > > > > > > and then > > > > > > > > > > static inline bool node_is_toptier(int node) > > > > > { > > > > > > > > > > return !node_isset(node, promotion_mask); > > > > > } > > > > > > > > > > > > > This should work. But it appears unnatural. So, I don't think we > > > > should avoid to add more and more node masks to mitigate the design > > > > decision that we cannot access memory tier information directly. All > > > > these becomes simple and natural, if we can access memory tier > > > > information directly. > > > > > > > > > > how do you derive whether node is toptier details if we have memtier > > > details in pgdat? > > > > pgdat -> memory tier -> rank > > > > Then we can compare this rank with the fast memory rank. The fast > > memory rank can be calculated dynamically at appropriate places. > > This is what I am testing now. We still need to closely audit that lock > free access to the NODE_DATA()->memtier. For v6 I will keep this as a > separate patch and once we all agree that it is safe, I will fold it > back. Thanks for doing this. We finally have a way to access memory_tier in hot path. [snip] > +/* > + * Called with memory_tier_lock. Hence the device references cannot > + * be dropped during this function. > + */ > +static void memtier_node_clear(int node, struct memory_tier *memtier) > +{ > + pg_data_t *pgdat; > + > + pgdat = NODE_DATA(node); > + if (!pgdat) > + return; > + > + rcu_assign_pointer(pgdat->memtier, NULL); > + /* > + * Make sure read side see the NULL value before we clear the node > + * from the nodelist. > + */ > + synchronize_rcu(); > + node_clear(node, memtier->nodelist); > +} > + > +static void memtier_node_set(int node, struct memory_tier *memtier) > +{ > + pg_data_t *pgdat; > + > + pgdat = NODE_DATA(node); > + if (!pgdat) > + return; > + /* > + * Make sure we mark the memtier NULL before we assign the new memory tier > + * to the NUMA node. This make sure that anybody looking at NODE_DATA > + * finds a NULL memtier or the one which is still valid. > + */ > > + rcu_assign_pointer(pgdat->memtier, NULL); > + synchronize_rcu(); Per my understanding, in your code, when we change pgdat->memtier, we will call synchronize_rcu() twice. IMHO, once should be OK. That is, something like below, rcu_assign_pointer(pgdat->memtier, NULL); node_clear(node, memtier->nodelist); synchronize_rcu(); node_set(node, new_memtier->nodelist); rcu_assign_pointer(pgdat->memtier, new_memtier); In this way, there will be 3 states, 1. prev pgdat->memtier == old_memtier node_isset(node, old_memtier->node_list) !node_isset(node, new_memtier->node_list) 2. transitioning pgdat->memtier == NULL !node_isset(node, old_memtier->node_list) !node_isset(node, new_memtier->node_list) 3. after pgdat->memtier == new_memtier !node_isset(node, old_memtier->node_list) node_isset(node, new_memtier->node_list) The real state may be one of 1, 2, 3, 1+2, or 2+3. But it will not be 1+3. I think that this satisfied our requirements. [snip] > static int __node_create_and_set_memory_tier(int node, int tier) > { > int ret = 0; > @@ -253,7 +318,7 @@ static int __node_create_and_set_memory_tier(int node, int tier) > goto out; > } > } > - node_set(node, memtier->nodelist); > + memtier_node_set(node, memtier); > out: > return ret; > } > @@ -275,12 +340,12 @@ int node_create_and_set_memory_tier(int node, int tier) > if (current_tier->dev.id == tier) > goto out; > - node_clear(node, current_tier->nodelist); > + memtier_node_clear(node, current_tier);+ node_set(node, memtier->nodelist); > + rcu_assign_pointer(pgdat->memtier, memtier); > +} > + > +bool node_is_toptier(int node) > +{ > + bool toptier; > + pg_data_t *pgdat; > + struct memory_tier *memtier; > + > + pgdat = NODE_DATA(node); > + if (!pgdat) > + return false; > + > + rcu_read_lock(); > + memtier = rcu_dereference(pgdat->memtier); > + if (!memtier) { > + toptier = true; > + goto out; > + } > + if (memtier->rank >= top_tier_rank) > + toptier = true; > + else > + toptier = false; > +out: > + rcu_read_unlock(); > + return toptier; > +} > + > > ret = __node_create_and_set_memory_tier(node, tier); > > if (ret) { > /* reset it back to older tier */ > - node_set(node, current_tier->nodelist); > + memtier_node_set(node, current_tier); > goto out; > } > > [snip] > static int __init memory_tier_init(void) > { > - int ret; > + int ret, node; > struct memory_tier *memtier; > > ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups); > > @@ -766,7 +829,13 @@ static int __init memory_tier_init(void) > > panic("%s() failed to register memory tier: %d\n", __func__, ret); > > /* CPU only nodes are not part of memory tiers. */ > > - memtier->nodelist = node_states[N_MEMORY]; > + for_each_node_state(node, N_MEMORY) { > + /* > + * Should be safe to do this early in the boot. > + */ > + NODE_DATA(node)->memtier = memtier; No required absoluately. But IMHO it's more consistent to use rcu_assign_pointer() here. > + node_set(node, memtier->nodelist); > + } > migrate_on_reclaim_init(); > > > return 0; Best Regareds, Huang, Ying
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index cd6e71f702ad..32e0e6fabf02 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -16,12 +16,23 @@ #define MAX_MEMORY_TIERS 3 extern bool numa_demotion_enabled; +extern nodemask_t demotion_target_mask; int next_demotion_node(int node); void node_remove_from_memory_tier(int node); int node_get_memory_tier_id(int node); int node_set_memory_tier(int node, int tier); int node_reset_memory_tier(int node, int tier); void node_get_allowed_targets(int node, nodemask_t *targets); + +/* + * By default all nodes are top tiper. As we create new memory tiers + * we below top tiers we add them to NON_TOP_TIER state. + */ +static inline bool node_is_toptier(int node) +{ + return !node_isset(node, demotion_target_mask); +} + #else #define numa_demotion_enabled false static inline int next_demotion_node(int node) @@ -33,6 +44,11 @@ static inline void node_get_allowed_targets(int node, nodemask_t *targets) { *targets = NODE_MASK_NONE; } + +static inline bool node_is_toptier(int node) +{ + return true; +} #endif /* CONFIG_TIERED_MEMORY */ #endif diff --git a/include/linux/node.h b/include/linux/node.h index 40d641a8bfb0..9ec680dd607f 100644 --- a/include/linux/node.h +++ b/include/linux/node.h @@ -185,9 +185,4 @@ static inline void register_hugetlbfs_with_node(node_registration_func_t reg, #define to_node(device) container_of(device, struct node, dev) -static inline bool node_is_toptier(int node) -{ - return node_state(node, N_CPU); -} - #endif /* _LINUX_NODE_H_ */ diff --git a/mm/huge_memory.c b/mm/huge_memory.c index a77c78a2b6b5..294873d4be2b 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -35,6 +35,7 @@ #include <linux/numa.h> #include <linux/page_owner.h> #include <linux/sched/sysctl.h> +#include <linux/memory-tiers.h> #include <asm/tlb.h> #include <asm/pgalloc.h> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 592d939ec28d..df8e9910165a 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -31,6 +31,7 @@ static struct bus_type memory_tier_subsys = { static void establish_migration_targets(void); static DEFINE_MUTEX(memory_tier_lock); static LIST_HEAD(memory_tiers); +nodemask_t demotion_target_mask; /* * node_demotion[] examples: @@ -541,6 +542,15 @@ static void establish_migration_targets(void) */ list_for_each_entry(memtier, &memory_tiers, list) nodes_or(allowed, allowed, memtier->nodelist); + /* + * Add nodes to demotion target mask so that we can find + * top tier easily. + */ + memtier = list_first_entry(&memory_tiers, struct memory_tier, list); + if (memtier) + nodes_andnot(demotion_target_mask, allowed, memtier->nodelist); + else + demotion_target_mask = NODE_MASK_NONE; /* * Removes nodes not yet in N_MEMORY. */ diff --git a/mm/migrate.c b/mm/migrate.c index 0b554625a219..78615c48fc0f 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -50,6 +50,7 @@ #include <linux/memory.h> #include <linux/random.h> #include <linux/sched/sysctl.h> +#include <linux/memory-tiers.h> #include <asm/tlbflush.h> diff --git a/mm/mprotect.c b/mm/mprotect.c index ba5592655ee3..92a2fc0fa88b 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -31,6 +31,7 @@ #include <linux/pgtable.h> #include <linux/sched/sysctl.h> #include <linux/userfaultfd_k.h> +#include <linux/memory-tiers.h> #include <asm/cacheflush.h> #include <asm/mmu_context.h> #include <asm/tlbflush.h>
With memory tiers support we can have memory on NUMA nodes in the top tier from which we want to avoid promotion tracking NUMA faults. Update node_is_toptier to work with memory tiers. To avoid taking locks, a nodemask is maintained for all demotion targets. All NUMA nodes are by default top tier nodes and as we add new lower memory tiers NUMA nodes get added to the demotion targets thereby moving them out of the top tier. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> --- include/linux/memory-tiers.h | 16 ++++++++++++++++ include/linux/node.h | 5 ----- mm/huge_memory.c | 1 + mm/memory-tiers.c | 10 ++++++++++ mm/migrate.c | 1 + mm/mprotect.c | 1 + 6 files changed, 29 insertions(+), 5 deletions(-)