Message ID | 20220527122528.129445-8-aneesh.kumar@linux.ibm.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/demotion: Memory tiers and demotion | expand |
On Fri, 27 May 2022 17:55:28 +0530 "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> wrote: > From: Jagdish Gediya <jvgediya@linux.ibm.com> > > currently, a higher tier node can only be demoted to selected > nodes on the next lower tier as defined by the demotion path, > not any other node from any lower tier. This strict, hard-coded > demotion order does not work in all use cases (e.g. some use cases > may want to allow cross-socket demotion to another node in the same > demotion tier as a fallback when the preferred demotion node is out > of space). This demotion order is also inconsistent with the page > allocation fallback order when all the nodes in a higher tier are > out of space: The page allocation can fall back to any node from any > lower tier, whereas the demotion order doesn't allow that currently. > > This patch adds support to get all the allowed demotion targets mask > for node, also demote_page_list() function is modified to utilize this > allowed node mask by filling it in migration_target_control structure > before passing it to migrate_pages(). > > Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com> > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Ah ok. So this deals with any tier with higher rank case. Painful though it is I would suggest the series needs recreating as a single set of steps to reach the end goal rather than introducing one approach then modifying it. What you have now might work but as a series it's very hard to review. If their is a good reason to maintain this 'path to the answer' then it can be done but it's going to make this harder to get review + merge. Superficially this looks like it addresses my earlier comments. Jonathan > --- > include/linux/migrate.h | 5 ++++ > mm/migrate.c | 52 +++++++++++++++++++++++++++++++++++++---- > mm/vmscan.c | 38 ++++++++++++++---------------- > 3 files changed, 71 insertions(+), 24 deletions(-) > > diff --git a/include/linux/migrate.h b/include/linux/migrate.h > index 77c581f47953..1f3cbd5185ca 100644 > --- a/include/linux/migrate.h > +++ b/include/linux/migrate.h > @@ -182,6 +182,7 @@ void node_remove_from_memory_tier(int node); > int node_get_memory_tier_id(int node); > int node_set_memory_tier_rank(int node, int tier); > int node_reset_memory_tier(int node, int tier); > +void node_get_allowed_targets(int node, nodemask_t *targets); > #else > #define numa_demotion_enabled false > static inline int next_demotion_node(int node) > @@ -189,6 +190,10 @@ static inline int next_demotion_node(int node) > return NUMA_NO_NODE; > } > > +static inline void node_get_allowed_targets(int node, nodemask_t *targets) > +{ > + *targets = NODE_MASK_NONE; > +} > #endif /* CONFIG_TIERED_MEMORY */ > > #endif /* _LINUX_MIGRATE_H */ > diff --git a/mm/migrate.c b/mm/migrate.c > index 114c7428b9f3..84fac477538c 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -2129,6 +2129,7 @@ struct memory_tier { > > struct demotion_nodes { > nodemask_t preferred; > + nodemask_t allowed; > }; > > #define to_memory_tier(device) container_of(device, struct memory_tier, dev) > @@ -2475,6 +2476,25 @@ int node_set_memory_tier_rank(int node, int rank) > } > EXPORT_SYMBOL_GPL(node_set_memory_tier_rank); > > +void node_get_allowed_targets(int node, nodemask_t *targets) > +{ > + /* > + * node_demotion[] is updated without excluding this > + * function from running. > + * > + * If any node is moving to lower tiers then modifications > + * in node_demotion[] are still valid for this node, if any > + * node is moving to higher tier then moving node may be > + * used once for demotion which should be ok so rcu should > + * be enough here. > + */ > + rcu_read_lock(); > + > + *targets = node_demotion[node].allowed; > + > + rcu_read_unlock(); > +} > + > /** > * next_demotion_node() - Get the next node in the demotion path > * @node: The starting node to lookup the next node > @@ -2534,8 +2554,10 @@ static void __disable_all_migrate_targets(void) > { > int node; > > - for_each_node_mask(node, node_states[N_MEMORY]) > + for_each_node_mask(node, node_states[N_MEMORY]) { > node_demotion[node].preferred = NODE_MASK_NONE; > + node_demotion[node].allowed = NODE_MASK_NONE; > + } > } > > static void disable_all_migrate_targets(void) > @@ -2558,12 +2580,11 @@ static void disable_all_migrate_targets(void) > */ > static void establish_migration_targets(void) > { > - struct list_head *ent; > struct memory_tier *memtier; > struct demotion_nodes *nd; > - int tier, target = NUMA_NO_NODE, node; > + int target = NUMA_NO_NODE, node; > int distance, best_distance; > - nodemask_t used; > + nodemask_t used, allowed = NODE_MASK_NONE; > > if (!node_demotion) > return; > @@ -2603,6 +2624,29 @@ static void establish_migration_targets(void) > } > } while (1); > } > + /* > + * Now build the allowed mask for each node collecting node mask from > + * all memory tier below it. This allows us to fallback demotion page > + * allocation to a set of nodes that is closer the above selected > + * perferred node. > + */ > + list_for_each_entry(memtier, &memory_tiers, list) > + nodes_or(allowed, allowed, memtier->nodelist); > + /* > + * Removes nodes not yet in N_MEMORY. > + */ > + nodes_and(allowed, node_states[N_MEMORY], allowed); > + > + list_for_each_entry(memtier, &memory_tiers, list) { > + /* > + * Keep removing current tier from allowed nodes, > + * This will remove all nodes in current and above > + * memory tier from the allowed mask. > + */ > + nodes_andnot(allowed, allowed, memtier->nodelist); > + for_each_node_mask(node, memtier->nodelist) > + node_demotion[node].allowed = allowed; > + } > } > > /* > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 1678802e03e7..feb994589481 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1454,23 +1454,6 @@ static void folio_check_dirty_writeback(struct folio *folio, > mapping->a_ops->is_dirty_writeback(&folio->page, dirty, writeback); > } > > -static struct page *alloc_demote_page(struct page *page, unsigned long node) > -{ > - struct migration_target_control mtc = { > - /* > - * Allocate from 'node', or fail quickly and quietly. > - * When this happens, 'page' will likely just be discarded > - * instead of migrated. > - */ > - .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | > - __GFP_THISNODE | __GFP_NOWARN | > - __GFP_NOMEMALLOC | GFP_NOWAIT, > - .nid = node > - }; > - > - return alloc_migration_target(page, (unsigned long)&mtc); > -} > - > /* > * Take pages on @demote_list and attempt to demote them to > * another node. Pages which are not demoted are left on > @@ -1481,6 +1464,19 @@ static unsigned int demote_page_list(struct list_head *demote_pages, > { > int target_nid = next_demotion_node(pgdat->node_id); > unsigned int nr_succeeded; > + nodemask_t allowed_mask; > + > + struct migration_target_control mtc = { > + /* > + * Allocate from 'node', or fail quickly and quietly. > + * When this happens, 'page' will likely just be discarded > + * instead of migrated. > + */ > + .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN | > + __GFP_NOMEMALLOC | GFP_NOWAIT, > + .nid = target_nid, > + .nmask = &allowed_mask > + }; > > if (list_empty(demote_pages)) > return 0; > @@ -1488,10 +1484,12 @@ static unsigned int demote_page_list(struct list_head *demote_pages, > if (target_nid == NUMA_NO_NODE) > return 0; > > + node_get_allowed_targets(pgdat->node_id, &allowed_mask); > + > /* Demotion ignores all cpuset and mempolicy settings */ > - migrate_pages(demote_pages, alloc_demote_page, NULL, > - target_nid, MIGRATE_ASYNC, MR_DEMOTION, > - &nr_succeeded); > + migrate_pages(demote_pages, alloc_migration_target, NULL, > + (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION, > + &nr_succeeded); > > if (current_is_kswapd()) > __count_vm_events(PGDEMOTE_KSWAPD, nr_succeeded);
On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote: > From: Jagdish Gediya <jvgediya@linux.ibm.com> > > currently, a higher tier node can only be demoted to selected > nodes on the next lower tier as defined by the demotion path, > not any other node from any lower tier. This strict, hard-coded > demotion order does not work in all use cases (e.g. some use cases > may want to allow cross-socket demotion to another node in the same > demotion tier as a fallback when the preferred demotion node is out > of space). This demotion order is also inconsistent with the page > allocation fallback order when all the nodes in a higher tier are > out of space: The page allocation can fall back to any node from any > lower tier, whereas the demotion order doesn't allow that currently. > > This patch adds support to get all the allowed demotion targets mask > for node, also demote_page_list() function is modified to utilize this > allowed node mask by filling it in migration_target_control structure > before passing it to migrate_pages(). > > Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com> > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> > --- > include/linux/migrate.h | 5 ++++ > mm/migrate.c | 52 +++++++++++++++++++++++++++++++++++++---- > mm/vmscan.c | 38 ++++++++++++++---------------- > 3 files changed, 71 insertions(+), 24 deletions(-) > > diff --git a/include/linux/migrate.h b/include/linux/migrate.h > index 77c581f47953..1f3cbd5185ca 100644 > --- a/include/linux/migrate.h > +++ b/include/linux/migrate.h > @@ -182,6 +182,7 @@ void node_remove_from_memory_tier(int node); > int node_get_memory_tier_id(int node); > int node_set_memory_tier_rank(int node, int tier); > int node_reset_memory_tier(int node, int tier); > +void node_get_allowed_targets(int node, nodemask_t *targets); > #else > #define numa_demotion_enabled false > static inline int next_demotion_node(int node) > @@ -189,6 +190,10 @@ static inline int next_demotion_node(int node) > return NUMA_NO_NODE; > } > > > > > +static inline void node_get_allowed_targets(int node, nodemask_t *targets) > +{ > + *targets = NODE_MASK_NONE; > +} > #endif /* CONFIG_TIERED_MEMORY */ > > > > > #endif /* _LINUX_MIGRATE_H */ > diff --git a/mm/migrate.c b/mm/migrate.c > index 114c7428b9f3..84fac477538c 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -2129,6 +2129,7 @@ struct memory_tier { > > > > > struct demotion_nodes { > nodemask_t preferred; > + nodemask_t allowed; > }; > > > > > #define to_memory_tier(device) container_of(device, struct memory_tier, dev) > @@ -2475,6 +2476,25 @@ int node_set_memory_tier_rank(int node, int rank) > } > EXPORT_SYMBOL_GPL(node_set_memory_tier_rank); > > > > > +void node_get_allowed_targets(int node, nodemask_t *targets) > +{ > + /* > + * node_demotion[] is updated without excluding this > + * function from running. > + * > + * If any node is moving to lower tiers then modifications > + * in node_demotion[] are still valid for this node, if any > + * node is moving to higher tier then moving node may be > + * used once for demotion which should be ok so rcu should > + * be enough here. > + */ > + rcu_read_lock(); > + > + *targets = node_demotion[node].allowed; > + > + rcu_read_unlock(); > +} > + > /** > * next_demotion_node() - Get the next node in the demotion path > * @node: The starting node to lookup the next node > @@ -2534,8 +2554,10 @@ static void __disable_all_migrate_targets(void) > { > int node; > > > > > - for_each_node_mask(node, node_states[N_MEMORY]) > + for_each_node_mask(node, node_states[N_MEMORY]) { > node_demotion[node].preferred = NODE_MASK_NONE; > + node_demotion[node].allowed = NODE_MASK_NONE; > + } > } > > > > > static void disable_all_migrate_targets(void) > @@ -2558,12 +2580,11 @@ static void disable_all_migrate_targets(void) > */ > static void establish_migration_targets(void) > { > - struct list_head *ent; > struct memory_tier *memtier; > struct demotion_nodes *nd; > - int tier, target = NUMA_NO_NODE, node; > + int target = NUMA_NO_NODE, node; > int distance, best_distance; > - nodemask_t used; > + nodemask_t used, allowed = NODE_MASK_NONE; > > > > > if (!node_demotion) > return; > @@ -2603,6 +2624,29 @@ static void establish_migration_targets(void) > } > } while (1); > } > + /* > + * Now build the allowed mask for each node collecting node mask from > + * all memory tier below it. This allows us to fallback demotion page > + * allocation to a set of nodes that is closer the above selected > + * perferred node. > + */ > + list_for_each_entry(memtier, &memory_tiers, list) > + nodes_or(allowed, allowed, memtier->nodelist); > + /* > + * Removes nodes not yet in N_MEMORY. > + */ > + nodes_and(allowed, node_states[N_MEMORY], allowed); > + > + list_for_each_entry(memtier, &memory_tiers, list) { > + /* > + * Keep removing current tier from allowed nodes, > + * This will remove all nodes in current and above > + * memory tier from the allowed mask. > + */ > + nodes_andnot(allowed, allowed, memtier->nodelist); > + for_each_node_mask(node, memtier->nodelist) > + node_demotion[node].allowed = allowed; > + } > } > > > > > /* > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 1678802e03e7..feb994589481 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1454,23 +1454,6 @@ static void folio_check_dirty_writeback(struct folio *folio, > mapping->a_ops->is_dirty_writeback(&folio->page, dirty, writeback); > } > > > > > -static struct page *alloc_demote_page(struct page *page, unsigned long node) > -{ > - struct migration_target_control mtc = { > - /* > - * Allocate from 'node', or fail quickly and quietly. > - * When this happens, 'page' will likely just be discarded > - * instead of migrated. > - */ > - .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | > - __GFP_THISNODE | __GFP_NOWARN | > - __GFP_NOMEMALLOC | GFP_NOWAIT, > - .nid = node > - }; > - > - return alloc_migration_target(page, (unsigned long)&mtc); > -} > - > /* > * Take pages on @demote_list and attempt to demote them to > * another node. Pages which are not demoted are left on > @@ -1481,6 +1464,19 @@ static unsigned int demote_page_list(struct list_head *demote_pages, > { > int target_nid = next_demotion_node(pgdat->node_id); > unsigned int nr_succeeded; > + nodemask_t allowed_mask; > + > + struct migration_target_control mtc = { > + /* > + * Allocate from 'node', or fail quickly and quietly. > + * When this happens, 'page' will likely just be discarded > + * instead of migrated. > + */ > + .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN | > + __GFP_NOMEMALLOC | GFP_NOWAIT, > + .nid = target_nid, > + .nmask = &allowed_mask > + }; IMHO, we should try to allocate from preferred node firstly (which will kick kswapd of the preferred node if necessary). If failed, we will fallback to all allowed node. As we discussed as follows, https://lore.kernel.org/lkml/69f2d063a15f8c4afb4688af7b7890f32af55391.camel@intel.com/ That is, something like below, static struct page *alloc_demote_page(struct page *page, unsigned long node) { struct page *page; nodemask_t allowed_mask; struct migration_target_control mtc = { /* * Allocate from 'node', or fail quickly and quietly. * When this happens, 'page' will likely just be discarded * instead of migrated. */ .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_THISNODE | __GFP_NOWARN | __GFP_NOMEMALLOC | GFP_NOWAIT, .nid = node }; page = alloc_migration_target(page, (unsigned long)&mtc); if (page) return page; mtc.gfp_mask &= ~__GFP_THISNODE; mtc.nmask = &allowed_mask; return alloc_migration_target(page, (unsigned long)&mtc); } Best Regards, Huang, Ying > if (list_empty(demote_pages)) > return 0; > @@ -1488,10 +1484,12 @@ static unsigned int demote_page_list(struct list_head *demote_pages, > if (target_nid == NUMA_NO_NODE) > return 0; > > > > > + node_get_allowed_targets(pgdat->node_id, &allowed_mask); > + > /* Demotion ignores all cpuset and mempolicy settings */ > - migrate_pages(demote_pages, alloc_demote_page, NULL, > - target_nid, MIGRATE_ASYNC, MR_DEMOTION, > - &nr_succeeded); > + migrate_pages(demote_pages, alloc_migration_target, NULL, > + (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION, > + &nr_succeeded); > > > > > if (current_is_kswapd()) > __count_vm_events(PGDEMOTE_KSWAPD, nr_succeeded);
On 6/2/22 1:05 PM, Ying Huang wrote: > On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote: >> From: Jagdish Gediya <jvgediya@linux.ibm.com> >> >> currently, a higher tier node can only be demoted to selected >> nodes on the next lower tier as defined by the demotion path, >> not any other node from any lower tier. This strict, hard-coded >> demotion order does not work in all use cases (e.g. some use cases >> may want to allow cross-socket demotion to another node in the same >> demotion tier as a fallback when the preferred demotion node is out >> of space). This demotion order is also inconsistent with the page >> allocation fallback order when all the nodes in a higher tier are >> out of space: The page allocation can fall back to any node from any >> lower tier, whereas the demotion order doesn't allow that currently. >> >> This patch adds support to get all the allowed demotion targets mask >> for node, also demote_page_list() function is modified to utilize this >> allowed node mask by filling it in migration_target_control structure >> before passing it to migrate_pages(). > ... >> * Take pages on @demote_list and attempt to demote them to >> * another node. Pages which are not demoted are left on >> @@ -1481,6 +1464,19 @@ static unsigned int demote_page_list(struct list_head *demote_pages, >> { >> int target_nid = next_demotion_node(pgdat->node_id); >> unsigned int nr_succeeded; >> + nodemask_t allowed_mask; >> + >> + struct migration_target_control mtc = { >> + /* >> + * Allocate from 'node', or fail quickly and quietly. >> + * When this happens, 'page' will likely just be discarded >> + * instead of migrated. >> + */ >> + .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN | >> + __GFP_NOMEMALLOC | GFP_NOWAIT, >> + .nid = target_nid, >> + .nmask = &allowed_mask >> + }; > > IMHO, we should try to allocate from preferred node firstly (which will > kick kswapd of the preferred node if necessary). If failed, we will > fallback to all allowed node. > > As we discussed as follows, > > https://lore.kernel.org/lkml/69f2d063a15f8c4afb4688af7b7890f32af55391.camel@intel.com/ > > That is, something like below, > > static struct page *alloc_demote_page(struct page *page, unsigned long node) > { > struct page *page; > nodemask_t allowed_mask; > struct migration_target_control mtc = { > /* > * Allocate from 'node', or fail quickly and quietly. > * When this happens, 'page' will likely just be discarded > * instead of migrated. > */ > .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | > __GFP_THISNODE | __GFP_NOWARN | > __GFP_NOMEMALLOC | GFP_NOWAIT, > .nid = node > }; > > page = alloc_migration_target(page, (unsigned long)&mtc); > if (page) > return page; > > mtc.gfp_mask &= ~__GFP_THISNODE; > mtc.nmask = &allowed_mask; > > return alloc_migration_target(page, (unsigned long)&mtc); > } I skipped doing this in v5 because I was not sure this is really what we want. I guess we can do this as part of the change that is going to introduce the usage of memory policy for the allocation? -aneesh
On Fri, 2022-06-03 at 20:39 +0530, Aneesh Kumar K V wrote: > On 6/2/22 1:05 PM, Ying Huang wrote: > > On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote: > > > From: Jagdish Gediya <jvgediya@linux.ibm.com> > > > > > > currently, a higher tier node can only be demoted to selected > > > nodes on the next lower tier as defined by the demotion path, > > > not any other node from any lower tier. This strict, hard-coded > > > demotion order does not work in all use cases (e.g. some use cases > > > may want to allow cross-socket demotion to another node in the same > > > demotion tier as a fallback when the preferred demotion node is out > > > of space). This demotion order is also inconsistent with the page > > > allocation fallback order when all the nodes in a higher tier are > > > out of space: The page allocation can fall back to any node from any > > > lower tier, whereas the demotion order doesn't allow that currently. > > > > > > This patch adds support to get all the allowed demotion targets mask > > > for node, also demote_page_list() function is modified to utilize this > > > allowed node mask by filling it in migration_target_control structure > > > before passing it to migrate_pages(). > > > > ... > > > > * Take pages on @demote_list and attempt to demote them to > > > * another node. Pages which are not demoted are left on > > > @@ -1481,6 +1464,19 @@ static unsigned int demote_page_list(struct list_head *demote_pages, > > > { > > > int target_nid = next_demotion_node(pgdat->node_id); > > > unsigned int nr_succeeded; > > > + nodemask_t allowed_mask; > > > + > > > + struct migration_target_control mtc = { > > > + /* > > > + * Allocate from 'node', or fail quickly and quietly. > > > + * When this happens, 'page' will likely just be discarded > > > + * instead of migrated. > > > + */ > > > + .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN | > > > + __GFP_NOMEMALLOC | GFP_NOWAIT, > > > + .nid = target_nid, > > > + .nmask = &allowed_mask > > > + }; > > > > IMHO, we should try to allocate from preferred node firstly (which will > > kick kswapd of the preferred node if necessary). If failed, we will > > fallback to all allowed node. > > > > As we discussed as follows, > > > > https://lore.kernel.org/lkml/69f2d063a15f8c4afb4688af7b7890f32af55391.camel@intel.com/ > > > > That is, something like below, > > > > static struct page *alloc_demote_page(struct page *page, unsigned long node) > > { > > struct page *page; > > nodemask_t allowed_mask; > > struct migration_target_control mtc = { > > /* > > * Allocate from 'node', or fail quickly and quietly. > > * When this happens, 'page' will likely just be discarded > > * instead of migrated. > > */ > > .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | > > __GFP_THISNODE | __GFP_NOWARN | > > __GFP_NOMEMALLOC | GFP_NOWAIT, > > .nid = node > > }; > > > > page = alloc_migration_target(page, (unsigned long)&mtc); > > if (page) > > return page; > > > > mtc.gfp_mask &= ~__GFP_THISNODE; > > mtc.nmask = &allowed_mask; > > > > return alloc_migration_target(page, (unsigned long)&mtc); > > } > > I skipped doing this in v5 because I was not sure this is really what we > want. I think so. And this is the original behavior. We should keep the original behavior as much as possible, then make changes if necessary. > I guess we can do this as part of the change that is going to > introduce the usage of memory policy for the allocation? Like the memory allocation policy, the default policy should be local preferred. We shouldn't force users to use explicit memory policy for that. And the added code isn't complex. Best Regards, Huang, Ying
On 6/6/22 6:13 AM, Ying Huang wrote: > On Fri, 2022-06-03 at 20:39 +0530, Aneesh Kumar K V wrote: >> On 6/2/22 1:05 PM, Ying Huang wrote: >>> On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote: >>>> From: Jagdish Gediya <jvgediya@linux.ibm.com> >>>> >>>> currently, a higher tier node can only be demoted to selected >>>> nodes on the next lower tier as defined by the demotion path, >>>> not any other node from any lower tier. This strict, hard-coded >>>> demotion order does not work in all use cases (e.g. some use cases >>>> may want to allow cross-socket demotion to another node in the same >>>> demotion tier as a fallback when the preferred demotion node is out >>>> of space). This demotion order is also inconsistent with the page >>>> allocation fallback order when all the nodes in a higher tier are >>>> out of space: The page allocation can fall back to any node from any >>>> lower tier, whereas the demotion order doesn't allow that currently. >>>> >>>> This patch adds support to get all the allowed demotion targets mask >>>> for node, also demote_page_list() function is modified to utilize this >>>> allowed node mask by filling it in migration_target_control structure >>>> before passing it to migrate_pages(). >>> >> >> ... >> >>>> * Take pages on @demote_list and attempt to demote them to >>>> * another node. Pages which are not demoted are left on >>>> @@ -1481,6 +1464,19 @@ static unsigned int demote_page_list(struct list_head *demote_pages, >>>> { >>>> int target_nid = next_demotion_node(pgdat->node_id); >>>> unsigned int nr_succeeded; >>>> + nodemask_t allowed_mask; >>>> + >>>> + struct migration_target_control mtc = { >>>> + /* >>>> + * Allocate from 'node', or fail quickly and quietly. >>>> + * When this happens, 'page' will likely just be discarded >>>> + * instead of migrated. >>>> + */ >>>> + .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN | >>>> + __GFP_NOMEMALLOC | GFP_NOWAIT, >>>> + .nid = target_nid, >>>> + .nmask = &allowed_mask >>>> + }; >>> >>> IMHO, we should try to allocate from preferred node firstly (which will >>> kick kswapd of the preferred node if necessary). If failed, we will >>> fallback to all allowed node. >>> >>> As we discussed as follows, >>> >>> https://lore.kernel.org/lkml/69f2d063a15f8c4afb4688af7b7890f32af55391.camel@intel.com/ >>> >>> That is, something like below, >>> >>> static struct page *alloc_demote_page(struct page *page, unsigned long node) >>> { >>> struct page *page; >>> nodemask_t allowed_mask; >>> struct migration_target_control mtc = { >>> /* >>> * Allocate from 'node', or fail quickly and quietly. >>> * When this happens, 'page' will likely just be discarded >>> * instead of migrated. >>> */ >>> .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | >>> __GFP_THISNODE | __GFP_NOWARN | >>> __GFP_NOMEMALLOC | GFP_NOWAIT, >>> .nid = node >>> }; >>> >>> page = alloc_migration_target(page, (unsigned long)&mtc); >>> if (page) >>> return page; >>> >>> mtc.gfp_mask &= ~__GFP_THISNODE; >>> mtc.nmask = &allowed_mask; >>> >>> return alloc_migration_target(page, (unsigned long)&mtc); >>> } >> >> I skipped doing this in v5 because I was not sure this is really what we >> want. > > I think so. And this is the original behavior. We should keep the > original behavior as much as possible, then make changes if necessary. > That is the reason I split the new page allocation as a separate patch. Previous discussion on this topic didn't conclude on whether we really need to do the above or not https://lore.kernel.org/lkml/CAAPL-u9endrWf_aOnPENDPdvT-2-YhCAeJ7ONGckGnXErTLOfQ@mail.gmail.com/ Based on the above I looked at avoiding GFP_THISNODE allocation. If you have experiment results that suggest otherwise can you share? I could summarize that in the commit message for better description of why GFP_THISNODE enforcing is needed. >> I guess we can do this as part of the change that is going to >> introduce the usage of memory policy for the allocation? > > Like the memory allocation policy, the default policy should be local > preferred. We shouldn't force users to use explicit memory policy for > that. > > And the added code isn't complex. > -aneesh
On Mon, 2022-06-06 at 09:37 +0530, Aneesh Kumar K V wrote: > On 6/6/22 6:13 AM, Ying Huang wrote: > > On Fri, 2022-06-03 at 20:39 +0530, Aneesh Kumar K V wrote: > > > On 6/2/22 1:05 PM, Ying Huang wrote: > > > > On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote: > > > > > From: Jagdish Gediya <jvgediya@linux.ibm.com> > > > > > > > > > > currently, a higher tier node can only be demoted to selected > > > > > nodes on the next lower tier as defined by the demotion path, > > > > > not any other node from any lower tier. This strict, hard-coded > > > > > demotion order does not work in all use cases (e.g. some use cases > > > > > may want to allow cross-socket demotion to another node in the same > > > > > demotion tier as a fallback when the preferred demotion node is out > > > > > of space). This demotion order is also inconsistent with the page > > > > > allocation fallback order when all the nodes in a higher tier are > > > > > out of space: The page allocation can fall back to any node from any > > > > > lower tier, whereas the demotion order doesn't allow that currently. > > > > > > > > > > This patch adds support to get all the allowed demotion targets mask > > > > > for node, also demote_page_list() function is modified to utilize this > > > > > allowed node mask by filling it in migration_target_control structure > > > > > before passing it to migrate_pages(). > > > > > > > > > > ... > > > > > > > > * Take pages on @demote_list and attempt to demote them to > > > > > * another node. Pages which are not demoted are left on > > > > > @@ -1481,6 +1464,19 @@ static unsigned int demote_page_list(struct list_head *demote_pages, > > > > > { > > > > > int target_nid = next_demotion_node(pgdat->node_id); > > > > > unsigned int nr_succeeded; > > > > > + nodemask_t allowed_mask; > > > > > + > > > > > + struct migration_target_control mtc = { > > > > > + /* > > > > > + * Allocate from 'node', or fail quickly and quietly. > > > > > + * When this happens, 'page' will likely just be discarded > > > > > + * instead of migrated. > > > > > + */ > > > > > + .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN | > > > > > + __GFP_NOMEMALLOC | GFP_NOWAIT, > > > > > + .nid = target_nid, > > > > > + .nmask = &allowed_mask > > > > > + }; > > > > > > > > IMHO, we should try to allocate from preferred node firstly (which will > > > > kick kswapd of the preferred node if necessary). If failed, we will > > > > fallback to all allowed node. > > > > > > > > As we discussed as follows, > > > > > > > > https://lore.kernel.org/lkml/69f2d063a15f8c4afb4688af7b7890f32af55391.camel@intel.com/ > > > > > > > > That is, something like below, > > > > > > > > static struct page *alloc_demote_page(struct page *page, unsigned long node) > > > > { > > > > struct page *page; > > > > nodemask_t allowed_mask; > > > > struct migration_target_control mtc = { > > > > /* > > > > * Allocate from 'node', or fail quickly and quietly. > > > > * When this happens, 'page' will likely just be discarded > > > > * instead of migrated. > > > > */ > > > > .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | > > > > __GFP_THISNODE | __GFP_NOWARN | > > > > __GFP_NOMEMALLOC | GFP_NOWAIT, > > > > .nid = node > > > > }; > > > > > > > > page = alloc_migration_target(page, (unsigned long)&mtc); > > > > if (page) > > > > return page; > > > > > > > > mtc.gfp_mask &= ~__GFP_THISNODE; > > > > mtc.nmask = &allowed_mask; > > > > > > > > return alloc_migration_target(page, (unsigned long)&mtc); > > > > } > > > > > > I skipped doing this in v5 because I was not sure this is really what we > > > want. > > > > I think so. And this is the original behavior. We should keep the > > original behavior as much as possible, then make changes if necessary. > > > > That is the reason I split the new page allocation as a separate patch. > Previous discussion on this topic didn't conclude on whether we really > need to do the above or not > https://lore.kernel.org/lkml/CAAPL-u9endrWf_aOnPENDPdvT-2-YhCAeJ7ONGckGnXErTLOfQ@mail.gmail.com/ Please check the later email in the thread you referenced. Both Wei and me agree that the use case needs to be supported. We just didn't reach concensus about how to implement it. If you think Wei's solution is better (referenced as below), you can try to do that too. Although I think my proposed implementation is much simpler. " This is true with the current allocation code. But I think we can make some changes for demotion allocations. For example, we can add a GFP_DEMOTE flag and update the allocation function to wake up kswapd when this flag is set and we need to fall back to another node. " > Based on the above I looked at avoiding GFP_THISNODE allocation. If you > have experiment results that suggest otherwise can you share? I could > summarize that in the commit message for better description of why > GFP_THISNODE enforcing is needed. Why? GFP_THISNODE is just the first step. We will fallback to allocation without it if necessary. Best Regards, Huang, Ying > > > I guess we can do this as part of the change that is going to > > > introduce the usage of memory policy for the allocation? > > > > Like the memory allocation policy, the default policy should be local > > preferred. We shouldn't force users to use explicit memory policy for > > that. > > > > And the added code isn't complex. > > > > -aneesh
Ying Huang <ying.huang@intel.com> writes: ..... > > > > >> > > > https://lore.kernel.org/lkml/69f2d063a15f8c4afb4688af7b7890f32af55391.camel@intel.com/ >> > > > >> > > > That is, something like below, >> > > > >> > > > static struct page *alloc_demote_page(struct page *page, unsigned long node) >> > > > { >> > > > struct page *page; >> > > > nodemask_t allowed_mask; >> > > > struct migration_target_control mtc = { >> > > > /* >> > > > * Allocate from 'node', or fail quickly and quietly. >> > > > * When this happens, 'page' will likely just be discarded >> > > > * instead of migrated. >> > > > */ >> > > > .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | >> > > > __GFP_THISNODE | __GFP_NOWARN | >> > > > __GFP_NOMEMALLOC | GFP_NOWAIT, >> > > > .nid = node >> > > > }; >> > > > >> > > > page = alloc_migration_target(page, (unsigned long)&mtc); >> > > > if (page) >> > > > return page; >> > > > >> > > > mtc.gfp_mask &= ~__GFP_THISNODE; >> > > > mtc.nmask = &allowed_mask; >> > > > >> > > > return alloc_migration_target(page, (unsigned long)&mtc); >> > > > } >> > > >> > > I skipped doing this in v5 because I was not sure this is really what we >> > > want. >> > >> > I think so. And this is the original behavior. We should keep the >> > original behavior as much as possible, then make changes if necessary. >> > >> >> That is the reason I split the new page allocation as a separate patch. >> Previous discussion on this topic didn't conclude on whether we really >> need to do the above or not >> https://lore.kernel.org/lkml/CAAPL-u9endrWf_aOnPENDPdvT-2-YhCAeJ7ONGckGnXErTLOfQ@mail.gmail.com/ > > Please check the later email in the thread you referenced. Both Wei and > me agree that the use case needs to be supported. We just didn't reach > concensus about how to implement it. If you think Wei's solution is > better (referenced as below), you can try to do that too. Although I > think my proposed implementation is much simpler. How about the below details diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index 79bd8d26feb2..cd6e71f702ad 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -21,6 +21,7 @@ void node_remove_from_memory_tier(int node); int node_get_memory_tier_id(int node); int node_set_memory_tier(int node, int tier); int node_reset_memory_tier(int node, int tier); +void node_get_allowed_targets(int node, nodemask_t *targets); #else #define numa_demotion_enabled false static inline int next_demotion_node(int node) @@ -28,6 +29,10 @@ static inline int next_demotion_node(int node) return NUMA_NO_NODE; } +static inline void node_get_allowed_targets(int node, nodemask_t *targets) +{ + *targets = NODE_MASK_NONE; +} #endif /* CONFIG_TIERED_MEMORY */ #endif diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index b4e72b672d4d..592d939ec28d 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -18,6 +18,7 @@ struct memory_tier { struct demotion_nodes { nodemask_t preferred; + nodemask_t allowed; }; #define to_memory_tier(device) container_of(device, struct memory_tier, dev) @@ -378,6 +379,25 @@ int node_set_memory_tier(int node, int tier) } EXPORT_SYMBOL_GPL(node_set_memory_tier); +void node_get_allowed_targets(int node, nodemask_t *targets) +{ + /* + * node_demotion[] is updated without excluding this + * function from running. + * + * If any node is moving to lower tiers then modifications + * in node_demotion[] are still valid for this node, if any + * node is moving to higher tier then moving node may be + * used once for demotion which should be ok so rcu should + * be enough here. + */ + rcu_read_lock(); + + *targets = node_demotion[node].allowed; + + rcu_read_unlock(); +} + /** * next_demotion_node() - Get the next node in the demotion path * @node: The starting node to lookup the next node @@ -437,8 +457,10 @@ static void __disable_all_migrate_targets(void) { int node; - for_each_node_mask(node, node_states[N_MEMORY]) + for_each_node_mask(node, node_states[N_MEMORY]) { node_demotion[node].preferred = NODE_MASK_NONE; + node_demotion[node].allowed = NODE_MASK_NONE; + } } static void disable_all_migrate_targets(void) @@ -465,7 +487,7 @@ static void establish_migration_targets(void) struct demotion_nodes *nd; int target = NUMA_NO_NODE, node; int distance, best_distance; - nodemask_t used; + nodemask_t used, allowed = NODE_MASK_NONE; if (!node_demotion) return; @@ -511,6 +533,29 @@ static void establish_migration_targets(void) } } while (1); } + /* + * Now build the allowed mask for each node collecting node mask from + * all memory tier below it. This allows us to fallback demotion page + * allocation to a set of nodes that is closer the above selected + * perferred node. + */ + list_for_each_entry(memtier, &memory_tiers, list) + nodes_or(allowed, allowed, memtier->nodelist); + /* + * Removes nodes not yet in N_MEMORY. + */ + nodes_and(allowed, node_states[N_MEMORY], allowed); + + list_for_each_entry(memtier, &memory_tiers, list) { + /* + * Keep removing current tier from allowed nodes, + * This will remove all nodes in current and above + * memory tier from the allowed mask. + */ + nodes_andnot(allowed, allowed, memtier->nodelist); + for_each_node_mask(node, memtier->nodelist) + node_demotion[node].allowed = allowed; + } } /* diff --git a/mm/vmscan.c b/mm/vmscan.c index 3a8f78277f99..b0792d838efb 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1460,19 +1460,32 @@ static void folio_check_dirty_writeback(struct folio *folio, mapping->a_ops->is_dirty_writeback(folio, dirty, writeback); } -static struct page *alloc_demote_page(struct page *page, unsigned long node) +static struct page *alloc_demote_page(struct page *page, unsigned long private) { - struct migration_target_control mtc = { - /* - * Allocate from 'node', or fail quickly and quietly. - * When this happens, 'page' will likely just be discarded - * instead of migrated. - */ - .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | - __GFP_THISNODE | __GFP_NOWARN | - __GFP_NOMEMALLOC | GFP_NOWAIT, - .nid = node - }; + struct page *target_page; + nodemask_t *allowed_mask; + struct migration_target_control *mtc; + + mtc = (struct migration_target_control *)private; + + allowed_mask = mtc->nmask; + /* + * make sure we allocate from the target node first also trying to + * reclaim pages from the target node via kswapd if we are low on + * free memory on target node. If we don't do this and if we have low + * free memory on the target memtier, we would start allocating pages + * from higher memory tiers without even forcing a demotion of cold + * pages from the target memtier. This can result in the kernel placing + * hotpages in higher memory tiers. + */ + mtc->nmask = NULL; + mtc->gfp_mask |= __GFP_THISNODE; + target_page = alloc_migration_target(page, (unsigned long)&mtc); + if (target_page) + return target_page; + + mtc->gfp_mask &= ~__GFP_THISNODE; + mtc->nmask = allowed_mask; return alloc_migration_target(page, (unsigned long)&mtc); } @@ -1487,6 +1500,19 @@ static unsigned int demote_page_list(struct list_head *demote_pages, { int target_nid = next_demotion_node(pgdat->node_id); unsigned int nr_succeeded; + nodemask_t allowed_mask; + + struct migration_target_control mtc = { + /* + * Allocate from 'node', or fail quickly and quietly. + * When this happens, 'page' will likely just be discarded + * instead of migrated. + */ + .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN | + __GFP_NOMEMALLOC | GFP_NOWAIT, + .nid = target_nid, + .nmask = &allowed_mask + }; if (list_empty(demote_pages)) return 0; @@ -1494,10 +1520,12 @@ static unsigned int demote_page_list(struct list_head *demote_pages, if (target_nid == NUMA_NO_NODE) return 0; + node_get_allowed_targets(pgdat->node_id, &allowed_mask); + /* Demotion ignores all cpuset and mempolicy settings */ migrate_pages(demote_pages, alloc_demote_page, NULL, - target_nid, MIGRATE_ASYNC, MR_DEMOTION, - &nr_succeeded); + (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION, + &nr_succeeded); if (current_is_kswapd()) __count_vm_events(PGDEMOTE_KSWAPD, nr_succeeded);
On Mon, 2022-06-06 at 11:51 +0530, Aneesh Kumar K.V wrote: > Ying Huang <ying.huang@intel.com> writes: > > ..... > > > > > > > > > > > > https://lore.kernel.org/lkml/69f2d063a15f8c4afb4688af7b7890f32af55391.camel@intel.com/ > > > > > > > > > > > > That is, something like below, > > > > > > > > > > > > static struct page *alloc_demote_page(struct page *page, unsigned long node) > > > > > > { > > > > > > struct page *page; > > > > > > nodemask_t allowed_mask; > > > > > > struct migration_target_control mtc = { > > > > > > /* > > > > > > * Allocate from 'node', or fail quickly and quietly. > > > > > > * When this happens, 'page' will likely just be discarded > > > > > > * instead of migrated. > > > > > > */ > > > > > > .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | > > > > > > __GFP_THISNODE | __GFP_NOWARN | > > > > > > __GFP_NOMEMALLOC | GFP_NOWAIT, > > > > > > .nid = node > > > > > > }; > > > > > > > > > > > > page = alloc_migration_target(page, (unsigned long)&mtc); > > > > > > if (page) > > > > > > return page; > > > > > > > > > > > > mtc.gfp_mask &= ~__GFP_THISNODE; > > > > > > mtc.nmask = &allowed_mask; > > > > > > > > > > > > return alloc_migration_target(page, (unsigned long)&mtc); > > > > > > } > > > > > > > > > > I skipped doing this in v5 because I was not sure this is really what we > > > > > want. > > > > > > > > I think so. And this is the original behavior. We should keep the > > > > original behavior as much as possible, then make changes if necessary. > > > > > > > > > > That is the reason I split the new page allocation as a separate patch. > > > Previous discussion on this topic didn't conclude on whether we really > > > need to do the above or not > > > https://lore.kernel.org/lkml/CAAPL-u9endrWf_aOnPENDPdvT-2-YhCAeJ7ONGckGnXErTLOfQ@mail.gmail.com/ > > > > Please check the later email in the thread you referenced. Both Wei and > > me agree that the use case needs to be supported. We just didn't reach > > concensus about how to implement it. If you think Wei's solution is > > better (referenced as below), you can try to do that too. Although I > > think my proposed implementation is much simpler. > > How about the below details > > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h > index 79bd8d26feb2..cd6e71f702ad 100644 > --- a/include/linux/memory-tiers.h > +++ b/include/linux/memory-tiers.h > @@ -21,6 +21,7 @@ void node_remove_from_memory_tier(int node); > int node_get_memory_tier_id(int node); > int node_set_memory_tier(int node, int tier); > int node_reset_memory_tier(int node, int tier); > +void node_get_allowed_targets(int node, nodemask_t *targets); > #else > #define numa_demotion_enabled false > static inline int next_demotion_node(int node) > @@ -28,6 +29,10 @@ static inline int next_demotion_node(int node) > return NUMA_NO_NODE; > } > > > > > +static inline void node_get_allowed_targets(int node, nodemask_t *targets) > +{ > + *targets = NODE_MASK_NONE; > +} > #endif /* CONFIG_TIERED_MEMORY */ > > > > > #endif > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c > index b4e72b672d4d..592d939ec28d 100644 > --- a/mm/memory-tiers.c > +++ b/mm/memory-tiers.c > @@ -18,6 +18,7 @@ struct memory_tier { > > > > > struct demotion_nodes { > nodemask_t preferred; > + nodemask_t allowed; > }; > > > > > #define to_memory_tier(device) container_of(device, struct memory_tier, dev) > @@ -378,6 +379,25 @@ int node_set_memory_tier(int node, int tier) > } > EXPORT_SYMBOL_GPL(node_set_memory_tier); > > > > > +void node_get_allowed_targets(int node, nodemask_t *targets) > +{ > + /* > + * node_demotion[] is updated without excluding this > + * function from running. > + * > + * If any node is moving to lower tiers then modifications > + * in node_demotion[] are still valid for this node, if any > + * node is moving to higher tier then moving node may be > + * used once for demotion which should be ok so rcu should > + * be enough here. > + */ > + rcu_read_lock(); > + > + *targets = node_demotion[node].allowed; > + > + rcu_read_unlock(); > +} > + > /** > * next_demotion_node() - Get the next node in the demotion path > * @node: The starting node to lookup the next node > @@ -437,8 +457,10 @@ static void __disable_all_migrate_targets(void) > { > int node; > > > > > - for_each_node_mask(node, node_states[N_MEMORY]) > + for_each_node_mask(node, node_states[N_MEMORY]) { > node_demotion[node].preferred = NODE_MASK_NONE; > + node_demotion[node].allowed = NODE_MASK_NONE; > + } > } > > > > > static void disable_all_migrate_targets(void) > @@ -465,7 +487,7 @@ static void establish_migration_targets(void) > struct demotion_nodes *nd; > int target = NUMA_NO_NODE, node; > int distance, best_distance; > - nodemask_t used; > + nodemask_t used, allowed = NODE_MASK_NONE; > > > > > if (!node_demotion) > return; > @@ -511,6 +533,29 @@ static void establish_migration_targets(void) > } > } while (1); > } > + /* > + * Now build the allowed mask for each node collecting node mask from > + * all memory tier below it. This allows us to fallback demotion page > + * allocation to a set of nodes that is closer the above selected > + * perferred node. > + */ > + list_for_each_entry(memtier, &memory_tiers, list) > + nodes_or(allowed, allowed, memtier->nodelist); > + /* > + * Removes nodes not yet in N_MEMORY. > + */ > + nodes_and(allowed, node_states[N_MEMORY], allowed); > + > + list_for_each_entry(memtier, &memory_tiers, list) { > + /* > + * Keep removing current tier from allowed nodes, > + * This will remove all nodes in current and above > + * memory tier from the allowed mask. > + */ > + nodes_andnot(allowed, allowed, memtier->nodelist); > + for_each_node_mask(node, memtier->nodelist) > + node_demotion[node].allowed = allowed; > + } > } > > > > > /* > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 3a8f78277f99..b0792d838efb 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1460,19 +1460,32 @@ static void folio_check_dirty_writeback(struct folio *folio, > mapping->a_ops->is_dirty_writeback(folio, dirty, writeback); > } > > > > > -static struct page *alloc_demote_page(struct page *page, unsigned long node) > +static struct page *alloc_demote_page(struct page *page, unsigned long private) > { > - struct migration_target_control mtc = { > - /* > - * Allocate from 'node', or fail quickly and quietly. > - * When this happens, 'page' will likely just be discarded > - * instead of migrated. > - */ > - .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | > - __GFP_THISNODE | __GFP_NOWARN | > - __GFP_NOMEMALLOC | GFP_NOWAIT, > - .nid = node > - }; > + struct page *target_page; > + nodemask_t *allowed_mask; > + struct migration_target_control *mtc; > + > + mtc = (struct migration_target_control *)private; > + > + allowed_mask = mtc->nmask; > + /* > + * make sure we allocate from the target node first also trying to > + * reclaim pages from the target node via kswapd if we are low on > + * free memory on target node. If we don't do this and if we have low > + * free memory on the target memtier, we would start allocating pages > + * from higher memory tiers without even forcing a demotion of cold > + * pages from the target memtier. This can result in the kernel placing > + * hotpages in higher memory tiers. > + */ > + mtc->nmask = NULL; > + mtc->gfp_mask |= __GFP_THISNODE; > + target_page = alloc_migration_target(page, (unsigned long)&mtc); > + if (target_page) > + return target_page; > + > + mtc->gfp_mask &= ~__GFP_THISNODE; > + mtc->nmask = allowed_mask; > > > > > return alloc_migration_target(page, (unsigned long)&mtc); > } > @@ -1487,6 +1500,19 @@ static unsigned int demote_page_list(struct list_head *demote_pages, > { > int target_nid = next_demotion_node(pgdat->node_id); > unsigned int nr_succeeded; > + nodemask_t allowed_mask; > + > + struct migration_target_control mtc = { > + /* > + * Allocate from 'node', or fail quickly and quietly. > + * When this happens, 'page' will likely just be discarded > + * instead of migrated. > + */ > + .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN | > + __GFP_NOMEMALLOC | GFP_NOWAIT, > + .nid = target_nid, > + .nmask = &allowed_mask > + }; > > > > > if (list_empty(demote_pages)) > return 0; > @@ -1494,10 +1520,12 @@ static unsigned int demote_page_list(struct list_head *demote_pages, > if (target_nid == NUMA_NO_NODE) > return 0; > > > > > + node_get_allowed_targets(pgdat->node_id, &allowed_mask); > + > /* Demotion ignores all cpuset and mempolicy settings */ > migrate_pages(demote_pages, alloc_demote_page, NULL, > - target_nid, MIGRATE_ASYNC, MR_DEMOTION, > - &nr_succeeded); > + (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION, > + &nr_succeeded); Firstly, it addressed my requirement, Thanks! And, I'd prefer to put mtc definition in alloc_demote_page(). Because that makes all page allocation logic in one function. Thus the readability of code is better. Best Regards, Huang, Ying > if (current_is_kswapd()) > __count_vm_events(PGDEMOTE_KSWAPD, nr_succeeded);
On 6/6/22 1:12 PM, Ying Huang wrote: > On Mon, 2022-06-06 at 11:51 +0530, Aneesh Kumar K.V wrote: >> Ying Huang <ying.huang@intel.com> writes: >> >> ..... >> >>>>>> >>>>>>> https://lore.kernel.org/lkml/69f2d063a15f8c4afb4688af7b7890f32af55391.camel@intel.com/ >>>>>>> >>>>>>> That is, something like below, >>>>>>> >>>>>>> static struct page *alloc_demote_page(struct page *page, unsigned long node) >>>>>>> { >>>>>>> struct page *page; >>>>>>> nodemask_t allowed_mask; >>>>>>> struct migration_target_control mtc = { >>>>>>> /* >>>>>>> * Allocate from 'node', or fail quickly and quietly. >>>>>>> * When this happens, 'page' will likely just be discarded >>>>>>> * instead of migrated. >>>>>>> */ >>>>>>> .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | >>>>>>> __GFP_THISNODE | __GFP_NOWARN | >>>>>>> __GFP_NOMEMALLOC | GFP_NOWAIT, >>>>>>> .nid = node >>>>>>> }; >>>>>>> >>>>>>> page = alloc_migration_target(page, (unsigned long)&mtc); >>>>>>> if (page) >>>>>>> return page; >>>>>>> >>>>>>> mtc.gfp_mask &= ~__GFP_THISNODE; >>>>>>> mtc.nmask = &allowed_mask; >>>>>>> >>>>>>> return alloc_migration_target(page, (unsigned long)&mtc); >>>>>>> } >>>>>> >>>>>> I skipped doing this in v5 because I was not sure this is really what we >>>>>> want. >>>>> >>>>> I think so. And this is the original behavior. We should keep the >>>>> original behavior as much as possible, then make changes if necessary. >>>>> >>>> >>>> That is the reason I split the new page allocation as a separate patch. >>>> Previous discussion on this topic didn't conclude on whether we really >>>> need to do the above or not >>>> https://lore.kernel.org/lkml/CAAPL-u9endrWf_aOnPENDPdvT-2-YhCAeJ7ONGckGnXErTLOfQ@mail.gmail.com/ >>> >>> Please check the later email in the thread you referenced. Both Wei and >>> me agree that the use case needs to be supported. We just didn't reach >>> concensus about how to implement it. If you think Wei's solution is >>> better (referenced as below), you can try to do that too. Although I >>> think my proposed implementation is much simpler. >> >> How about the below details >> >> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h >> index 79bd8d26feb2..cd6e71f702ad 100644 >> --- a/include/linux/memory-tiers.h >> +++ b/include/linux/memory-tiers.h >> @@ -21,6 +21,7 @@ void node_remove_from_memory_tier(int node); >> int node_get_memory_tier_id(int node); >> int node_set_memory_tier(int node, int tier); >> int node_reset_memory_tier(int node, int tier); >> +void node_get_allowed_targets(int node, nodemask_t *targets); >> #else >> #define numa_demotion_enabled false >> static inline int next_demotion_node(int node) >> @@ -28,6 +29,10 @@ static inline int next_demotion_node(int node) >> return NUMA_NO_NODE; >> } >> >> >> >> >> +static inline void node_get_allowed_targets(int node, nodemask_t *targets) >> +{ >> + *targets = NODE_MASK_NONE; >> +} >> #endif /* CONFIG_TIERED_MEMORY */ >> >> >> >> >> #endif >> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c >> index b4e72b672d4d..592d939ec28d 100644 >> --- a/mm/memory-tiers.c >> +++ b/mm/memory-tiers.c >> @@ -18,6 +18,7 @@ struct memory_tier { >> >> >> >> >> struct demotion_nodes { >> nodemask_t preferred; >> + nodemask_t allowed; >> }; >> >> >> >> >> #define to_memory_tier(device) container_of(device, struct memory_tier, dev) >> @@ -378,6 +379,25 @@ int node_set_memory_tier(int node, int tier) >> } >> EXPORT_SYMBOL_GPL(node_set_memory_tier); >> >> >> >> >> +void node_get_allowed_targets(int node, nodemask_t *targets) >> +{ >> + /* >> + * node_demotion[] is updated without excluding this >> + * function from running. >> + * >> + * If any node is moving to lower tiers then modifications >> + * in node_demotion[] are still valid for this node, if any >> + * node is moving to higher tier then moving node may be >> + * used once for demotion which should be ok so rcu should >> + * be enough here. >> + */ >> + rcu_read_lock(); >> + >> + *targets = node_demotion[node].allowed; >> + >> + rcu_read_unlock(); >> +} >> + >> /** >> * next_demotion_node() - Get the next node in the demotion path >> * @node: The starting node to lookup the next node >> @@ -437,8 +457,10 @@ static void __disable_all_migrate_targets(void) >> { >> int node; >> >> >> >> >> - for_each_node_mask(node, node_states[N_MEMORY]) >> + for_each_node_mask(node, node_states[N_MEMORY]) { >> node_demotion[node].preferred = NODE_MASK_NONE; >> + node_demotion[node].allowed = NODE_MASK_NONE; >> + } >> } >> >> >> >> >> static void disable_all_migrate_targets(void) >> @@ -465,7 +487,7 @@ static void establish_migration_targets(void) >> struct demotion_nodes *nd; >> int target = NUMA_NO_NODE, node; >> int distance, best_distance; >> - nodemask_t used; >> + nodemask_t used, allowed = NODE_MASK_NONE; >> >> >> >> >> if (!node_demotion) >> return; >> @@ -511,6 +533,29 @@ static void establish_migration_targets(void) >> } >> } while (1); >> } >> + /* >> + * Now build the allowed mask for each node collecting node mask from >> + * all memory tier below it. This allows us to fallback demotion page >> + * allocation to a set of nodes that is closer the above selected >> + * perferred node. >> + */ >> + list_for_each_entry(memtier, &memory_tiers, list) >> + nodes_or(allowed, allowed, memtier->nodelist); >> + /* >> + * Removes nodes not yet in N_MEMORY. >> + */ >> + nodes_and(allowed, node_states[N_MEMORY], allowed); >> + >> + list_for_each_entry(memtier, &memory_tiers, list) { >> + /* >> + * Keep removing current tier from allowed nodes, >> + * This will remove all nodes in current and above >> + * memory tier from the allowed mask. >> + */ >> + nodes_andnot(allowed, allowed, memtier->nodelist); >> + for_each_node_mask(node, memtier->nodelist) >> + node_demotion[node].allowed = allowed; >> + } >> } >> >> >> >> >> /* >> diff --git a/mm/vmscan.c b/mm/vmscan.c >> index 3a8f78277f99..b0792d838efb 100644 >> --- a/mm/vmscan.c >> +++ b/mm/vmscan.c >> @@ -1460,19 +1460,32 @@ static void folio_check_dirty_writeback(struct folio *folio, >> mapping->a_ops->is_dirty_writeback(folio, dirty, writeback); >> } >> >> >> >> >> -static struct page *alloc_demote_page(struct page *page, unsigned long node) >> +static struct page *alloc_demote_page(struct page *page, unsigned long private) >> { >> - struct migration_target_control mtc = { >> - /* >> - * Allocate from 'node', or fail quickly and quietly. >> - * When this happens, 'page' will likely just be discarded >> - * instead of migrated. >> - */ >> - .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | >> - __GFP_THISNODE | __GFP_NOWARN | >> - __GFP_NOMEMALLOC | GFP_NOWAIT, >> - .nid = node >> - }; >> + struct page *target_page; >> + nodemask_t *allowed_mask; >> + struct migration_target_control *mtc; >> + >> + mtc = (struct migration_target_control *)private; >> + >> + allowed_mask = mtc->nmask; >> + /* >> + * make sure we allocate from the target node first also trying to >> + * reclaim pages from the target node via kswapd if we are low on >> + * free memory on target node. If we don't do this and if we have low >> + * free memory on the target memtier, we would start allocating pages >> + * from higher memory tiers without even forcing a demotion of cold >> + * pages from the target memtier. This can result in the kernel placing >> + * hotpages in higher memory tiers. >> + */ >> + mtc->nmask = NULL; >> + mtc->gfp_mask |= __GFP_THISNODE; >> + target_page = alloc_migration_target(page, (unsigned long)&mtc); >> + if (target_page) >> + return target_page; >> + >> + mtc->gfp_mask &= ~__GFP_THISNODE; >> + mtc->nmask = allowed_mask; >> >> >> >> >> return alloc_migration_target(page, (unsigned long)&mtc); >> } >> @@ -1487,6 +1500,19 @@ static unsigned int demote_page_list(struct list_head *demote_pages, >> { >> int target_nid = next_demotion_node(pgdat->node_id); >> unsigned int nr_succeeded; >> + nodemask_t allowed_mask; >> + >> + struct migration_target_control mtc = { >> + /* >> + * Allocate from 'node', or fail quickly and quietly. >> + * When this happens, 'page' will likely just be discarded >> + * instead of migrated. >> + */ >> + .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN | >> + __GFP_NOMEMALLOC | GFP_NOWAIT, >> + .nid = target_nid, >> + .nmask = &allowed_mask >> + }; >> >> >> >> >> if (list_empty(demote_pages)) >> return 0; >> @@ -1494,10 +1520,12 @@ static unsigned int demote_page_list(struct list_head *demote_pages, >> if (target_nid == NUMA_NO_NODE) >> return 0; >> >> >> >> >> + node_get_allowed_targets(pgdat->node_id, &allowed_mask); >> + >> /* Demotion ignores all cpuset and mempolicy settings */ >> migrate_pages(demote_pages, alloc_demote_page, NULL, >> - target_nid, MIGRATE_ASYNC, MR_DEMOTION, >> - &nr_succeeded); >> + (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION, >> + &nr_succeeded); > > Firstly, it addressed my requirement, Thanks! And, I'd prefer to put > mtc definition in alloc_demote_page(). Because that makes all page > allocation logic in one function. Thus the readability of code is > better. The challenge is in allowed_mask computation. That is based on the src_node and not target_node. -aneesh
On Mon, 2022-06-06 at 13:32 +0530, Aneesh Kumar K V wrote: > On 6/6/22 1:12 PM, Ying Huang wrote: > > On Mon, 2022-06-06 at 11:51 +0530, Aneesh Kumar K.V wrote: > > > Ying Huang <ying.huang@intel.com> writes: > > > > > > ..... > > > > > > > > > > > > > > > > > > https://lore.kernel.org/lkml/69f2d063a15f8c4afb4688af7b7890f32af55391.camel@intel.com/ > > > > > > > > > > > > > > > > That is, something like below, > > > > > > > > > > > > > > > > static struct page *alloc_demote_page(struct page *page, unsigned long node) > > > > > > > > { > > > > > > > > struct page *page; > > > > > > > > nodemask_t allowed_mask; > > > > > > > > struct migration_target_control mtc = { > > > > > > > > /* > > > > > > > > * Allocate from 'node', or fail quickly and quietly. > > > > > > > > * When this happens, 'page' will likely just be discarded > > > > > > > > * instead of migrated. > > > > > > > > */ > > > > > > > > .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | > > > > > > > > __GFP_THISNODE | __GFP_NOWARN | > > > > > > > > __GFP_NOMEMALLOC | GFP_NOWAIT, > > > > > > > > .nid = node > > > > > > > > }; > > > > > > > > > > > > > > > > page = alloc_migration_target(page, (unsigned long)&mtc); > > > > > > > > if (page) > > > > > > > > return page; > > > > > > > > > > > > > > > > mtc.gfp_mask &= ~__GFP_THISNODE; > > > > > > > > mtc.nmask = &allowed_mask; > > > > > > > > > > > > > > > > return alloc_migration_target(page, (unsigned long)&mtc); > > > > > > > > } > > > > > > > > > > > > > > I skipped doing this in v5 because I was not sure this is really what we > > > > > > > want. > > > > > > > > > > > > I think so. And this is the original behavior. We should keep the > > > > > > original behavior as much as possible, then make changes if necessary. > > > > > > > > > > > > > > > > That is the reason I split the new page allocation as a separate patch. > > > > > Previous discussion on this topic didn't conclude on whether we really > > > > > need to do the above or not > > > > > https://lore.kernel.org/lkml/CAAPL-u9endrWf_aOnPENDPdvT-2-YhCAeJ7ONGckGnXErTLOfQ@mail.gmail.com/ > > > > > > > > Please check the later email in the thread you referenced. Both Wei and > > > > me agree that the use case needs to be supported. We just didn't reach > > > > concensus about how to implement it. If you think Wei's solution is > > > > better (referenced as below), you can try to do that too. Although I > > > > think my proposed implementation is much simpler. > > > > > > How about the below details > > > > > > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h > > > index 79bd8d26feb2..cd6e71f702ad 100644 > > > --- a/include/linux/memory-tiers.h > > > +++ b/include/linux/memory-tiers.h > > > @@ -21,6 +21,7 @@ void node_remove_from_memory_tier(int node); > > > int node_get_memory_tier_id(int node); > > > int node_set_memory_tier(int node, int tier); > > > int node_reset_memory_tier(int node, int tier); > > > +void node_get_allowed_targets(int node, nodemask_t *targets); > > > #else > > > #define numa_demotion_enabled false > > > static inline int next_demotion_node(int node) > > > @@ -28,6 +29,10 @@ static inline int next_demotion_node(int node) > > > return NUMA_NO_NODE; > > > } > > > > > > > > > > > > > > > > > > > > > > > > +static inline void node_get_allowed_targets(int node, nodemask_t *targets) > > > +{ > > > + *targets = NODE_MASK_NONE; > > > +} > > > #endif /* CONFIG_TIERED_MEMORY */ > > > > > > > > > > > > > > > > > > > > > > > > #endif > > > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c > > > index b4e72b672d4d..592d939ec28d 100644 > > > --- a/mm/memory-tiers.c > > > +++ b/mm/memory-tiers.c > > > @@ -18,6 +18,7 @@ struct memory_tier { > > > > > > > > > > > > > > > > > > > > > > > > struct demotion_nodes { > > > nodemask_t preferred; > > > + nodemask_t allowed; > > > }; > > > > > > > > > > > > > > > > > > > > > > > > #define to_memory_tier(device) container_of(device, struct memory_tier, dev) > > > @@ -378,6 +379,25 @@ int node_set_memory_tier(int node, int tier) > > > } > > > EXPORT_SYMBOL_GPL(node_set_memory_tier); > > > > > > > > > > > > > > > > > > > > > > > > +void node_get_allowed_targets(int node, nodemask_t *targets) > > > +{ > > > + /* > > > + * node_demotion[] is updated without excluding this > > > + * function from running. > > > + * > > > + * If any node is moving to lower tiers then modifications > > > + * in node_demotion[] are still valid for this node, if any > > > + * node is moving to higher tier then moving node may be > > > + * used once for demotion which should be ok so rcu should > > > + * be enough here. > > > + */ > > > + rcu_read_lock(); > > > + > > > + *targets = node_demotion[node].allowed; > > > + > > > + rcu_read_unlock(); > > > +} > > > + > > > /** > > > * next_demotion_node() - Get the next node in the demotion path > > > * @node: The starting node to lookup the next node > > > @@ -437,8 +457,10 @@ static void __disable_all_migrate_targets(void) > > > { > > > int node; > > > > > > > > > > > > > > > > > > > > > > > > - for_each_node_mask(node, node_states[N_MEMORY]) > > > + for_each_node_mask(node, node_states[N_MEMORY]) { > > > node_demotion[node].preferred = NODE_MASK_NONE; > > > + node_demotion[node].allowed = NODE_MASK_NONE; > > > + } > > > } > > > > > > > > > > > > > > > > > > > > > > > > static void disable_all_migrate_targets(void) > > > @@ -465,7 +487,7 @@ static void establish_migration_targets(void) > > > struct demotion_nodes *nd; > > > int target = NUMA_NO_NODE, node; > > > int distance, best_distance; > > > - nodemask_t used; > > > + nodemask_t used, allowed = NODE_MASK_NONE; > > > > > > > > > > > > > > > > > > > > > > > > if (!node_demotion) > > > return; > > > @@ -511,6 +533,29 @@ static void establish_migration_targets(void) > > > } > > > } while (1); > > > } > > > + /* > > > + * Now build the allowed mask for each node collecting node mask from > > > + * all memory tier below it. This allows us to fallback demotion page > > > + * allocation to a set of nodes that is closer the above selected > > > + * perferred node. > > > + */ > > > + list_for_each_entry(memtier, &memory_tiers, list) > > > + nodes_or(allowed, allowed, memtier->nodelist); > > > + /* > > > + * Removes nodes not yet in N_MEMORY. > > > + */ > > > + nodes_and(allowed, node_states[N_MEMORY], allowed); > > > + > > > + list_for_each_entry(memtier, &memory_tiers, list) { > > > + /* > > > + * Keep removing current tier from allowed nodes, > > > + * This will remove all nodes in current and above > > > + * memory tier from the allowed mask. > > > + */ > > > + nodes_andnot(allowed, allowed, memtier->nodelist); > > > + for_each_node_mask(node, memtier->nodelist) > > > + node_demotion[node].allowed = allowed; > > > + } > > > } > > > > > > > > > > > > > > > > > > > > > > > > /* > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > index 3a8f78277f99..b0792d838efb 100644 > > > --- a/mm/vmscan.c > > > +++ b/mm/vmscan.c > > > @@ -1460,19 +1460,32 @@ static void folio_check_dirty_writeback(struct folio *folio, > > > mapping->a_ops->is_dirty_writeback(folio, dirty, writeback); > > > } > > > > > > > > > > > > > > > > > > > > > > > > -static struct page *alloc_demote_page(struct page *page, unsigned long node) > > > +static struct page *alloc_demote_page(struct page *page, unsigned long private) > > > { > > > - struct migration_target_control mtc = { > > > - /* > > > - * Allocate from 'node', or fail quickly and quietly. > > > - * When this happens, 'page' will likely just be discarded > > > - * instead of migrated. > > > - */ > > > - .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | > > > - __GFP_THISNODE | __GFP_NOWARN | > > > - __GFP_NOMEMALLOC | GFP_NOWAIT, > > > - .nid = node > > > - }; > > > + struct page *target_page; > > > + nodemask_t *allowed_mask; > > > + struct migration_target_control *mtc; > > > + > > > + mtc = (struct migration_target_control *)private; > > > + > > > + allowed_mask = mtc->nmask; > > > + /* > > > + * make sure we allocate from the target node first also trying to > > > + * reclaim pages from the target node via kswapd if we are low on > > > + * free memory on target node. If we don't do this and if we have low > > > + * free memory on the target memtier, we would start allocating pages > > > + * from higher memory tiers without even forcing a demotion of cold > > > + * pages from the target memtier. This can result in the kernel placing > > > + * hotpages in higher memory tiers. > > > + */ > > > + mtc->nmask = NULL; > > > + mtc->gfp_mask |= __GFP_THISNODE; > > > + target_page = alloc_migration_target(page, (unsigned long)&mtc); > > > + if (target_page) > > > + return target_page; > > > + > > > + mtc->gfp_mask &= ~__GFP_THISNODE; > > > + mtc->nmask = allowed_mask; > > > > > > > > > > > > > > > > > > > > > > > > return alloc_migration_target(page, (unsigned long)&mtc); > > > } > > > @@ -1487,6 +1500,19 @@ static unsigned int demote_page_list(struct list_head *demote_pages, > > > { > > > int target_nid = next_demotion_node(pgdat->node_id); > > > unsigned int nr_succeeded; > > > + nodemask_t allowed_mask; > > > + > > > + struct migration_target_control mtc = { > > > + /* > > > + * Allocate from 'node', or fail quickly and quietly. > > > + * When this happens, 'page' will likely just be discarded > > > + * instead of migrated. > > > + */ > > > + .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN | > > > + __GFP_NOMEMALLOC | GFP_NOWAIT, > > > + .nid = target_nid, > > > + .nmask = &allowed_mask > > > + }; > > > > > > > > > > > > > > > > > > > > > > > > if (list_empty(demote_pages)) > > > return 0; > > > @@ -1494,10 +1520,12 @@ static unsigned int demote_page_list(struct list_head *demote_pages, > > > if (target_nid == NUMA_NO_NODE) > > > return 0; > > > > > > > > > > > > > > > > > > > > > > > > + node_get_allowed_targets(pgdat->node_id, &allowed_mask); > > > + > > > /* Demotion ignores all cpuset and mempolicy settings */ > > > migrate_pages(demote_pages, alloc_demote_page, NULL, > > > - target_nid, MIGRATE_ASYNC, MR_DEMOTION, > > > - &nr_succeeded); > > > + (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION, > > > + &nr_succeeded); > > > > Firstly, it addressed my requirement, Thanks! And, I'd prefer to put > > mtc definition in alloc_demote_page(). Because that makes all page > > allocation logic in one function. Thus the readability of code is > > better. > > The challenge is in allowed_mask computation. That is based on the > src_node and not target_node. > How about passing the src_node to alloc_demote_page()? Best Regards, Huang, Ying
On Sun, Jun 5, 2022 at 10:27 PM Ying Huang <ying.huang@intel.com> wrote: > > On Mon, 2022-06-06 at 09:37 +0530, Aneesh Kumar K V wrote: > > On 6/6/22 6:13 AM, Ying Huang wrote: > > > On Fri, 2022-06-03 at 20:39 +0530, Aneesh Kumar K V wrote: > > > > On 6/2/22 1:05 PM, Ying Huang wrote: > > > > > On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote: > > > > > > From: Jagdish Gediya <jvgediya@linux.ibm.com> > > > > > > > > > > > > currently, a higher tier node can only be demoted to selected > > > > > > nodes on the next lower tier as defined by the demotion path, > > > > > > not any other node from any lower tier. This strict, hard-coded > > > > > > demotion order does not work in all use cases (e.g. some use cases > > > > > > may want to allow cross-socket demotion to another node in the same > > > > > > demotion tier as a fallback when the preferred demotion node is out > > > > > > of space). This demotion order is also inconsistent with the page > > > > > > allocation fallback order when all the nodes in a higher tier are > > > > > > out of space: The page allocation can fall back to any node from any > > > > > > lower tier, whereas the demotion order doesn't allow that currently. > > > > > > > > > > > > This patch adds support to get all the allowed demotion targets mask > > > > > > for node, also demote_page_list() function is modified to utilize this > > > > > > allowed node mask by filling it in migration_target_control structure > > > > > > before passing it to migrate_pages(). > > > > > > > > > > > > > ... > > > > > > > > > > * Take pages on @demote_list and attempt to demote them to > > > > > > * another node. Pages which are not demoted are left on > > > > > > @@ -1481,6 +1464,19 @@ static unsigned int demote_page_list(struct list_head *demote_pages, > > > > > > { > > > > > > int target_nid = next_demotion_node(pgdat->node_id); > > > > > > unsigned int nr_succeeded; > > > > > > + nodemask_t allowed_mask; > > > > > > + > > > > > > + struct migration_target_control mtc = { > > > > > > + /* > > > > > > + * Allocate from 'node', or fail quickly and quietly. > > > > > > + * When this happens, 'page' will likely just be discarded > > > > > > + * instead of migrated. > > > > > > + */ > > > > > > + .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN | > > > > > > + __GFP_NOMEMALLOC | GFP_NOWAIT, > > > > > > + .nid = target_nid, > > > > > > + .nmask = &allowed_mask > > > > > > + }; > > > > > > > > > > IMHO, we should try to allocate from preferred node firstly (which will > > > > > kick kswapd of the preferred node if necessary). If failed, we will > > > > > fallback to all allowed node. > > > > > > > > > > As we discussed as follows, > > > > > > > > > > https://lore.kernel.org/lkml/69f2d063a15f8c4afb4688af7b7890f32af55391.camel@intel.com/ > > > > > > > > > > That is, something like below, > > > > > > > > > > static struct page *alloc_demote_page(struct page *page, unsigned long node) > > > > > { > > > > > struct page *page; > > > > > nodemask_t allowed_mask; > > > > > struct migration_target_control mtc = { > > > > > /* > > > > > * Allocate from 'node', or fail quickly and quietly. > > > > > * When this happens, 'page' will likely just be discarded > > > > > * instead of migrated. > > > > > */ > > > > > .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | > > > > > __GFP_THISNODE | __GFP_NOWARN | > > > > > __GFP_NOMEMALLOC | GFP_NOWAIT, > > > > > .nid = node > > > > > }; > > > > > > > > > > page = alloc_migration_target(page, (unsigned long)&mtc); > > > > > if (page) > > > > > return page; > > > > > > > > > > mtc.gfp_mask &= ~__GFP_THISNODE; > > > > > mtc.nmask = &allowed_mask; > > > > > > > > > > return alloc_migration_target(page, (unsigned long)&mtc); > > > > > } > > > > > > > > I skipped doing this in v5 because I was not sure this is really what we > > > > want. > > > > > > I think so. And this is the original behavior. We should keep the > > > original behavior as much as possible, then make changes if necessary. > > > > > > > That is the reason I split the new page allocation as a separate patch. > > Previous discussion on this topic didn't conclude on whether we really > > need to do the above or not > > https://lore.kernel.org/lkml/CAAPL-u9endrWf_aOnPENDPdvT-2-YhCAeJ7ONGckGnXErTLOfQ@mail.gmail.com/ > > Please check the later email in the thread you referenced. Both Wei and > me agree that the use case needs to be supported. We just didn't reach > concensus about how to implement it. If you think Wei's solution is > better (referenced as below), you can try to do that too. Although I > think my proposed implementation is much simpler. > > " > This is true with the current allocation code. But I think we can make > some changes for demotion allocations. For example, we can add a > GFP_DEMOTE flag and update the allocation function to wake up kswapd > when this flag is set and we need to fall back to another node. > " Sorry for chiming in late. Yes, I also agree doing harder on the preferred node before fallback is a valid usecase. I think the "trying with __GFP_THISNODE then retrying w/o it" should be good enough for now since demotion is the only user and vmscan is the only callsite for now. It won't be late to modify core page allocation function to support this semantic when it gets broader use. > > > Based on the above I looked at avoiding GFP_THISNODE allocation. If you > > have experiment results that suggest otherwise can you share? I could > > summarize that in the commit message for better description of why > > GFP_THISNODE enforcing is needed. > > Why? GFP_THISNODE is just the first step. We will fallback to > allocation without it if necessary. > > Best Regards, > Huang, Ying > > > > > I guess we can do this as part of the change that is going to > > > > introduce the usage of memory policy for the allocation? > > > > > > Like the memory allocation policy, the default policy should be local > > > preferred. We shouldn't force users to use explicit memory policy for > > > that. > > > > > > And the added code isn't complex. > > > > > > > -aneesh > > >
diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 77c581f47953..1f3cbd5185ca 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -182,6 +182,7 @@ void node_remove_from_memory_tier(int node); int node_get_memory_tier_id(int node); int node_set_memory_tier_rank(int node, int tier); int node_reset_memory_tier(int node, int tier); +void node_get_allowed_targets(int node, nodemask_t *targets); #else #define numa_demotion_enabled false static inline int next_demotion_node(int node) @@ -189,6 +190,10 @@ static inline int next_demotion_node(int node) return NUMA_NO_NODE; } +static inline void node_get_allowed_targets(int node, nodemask_t *targets) +{ + *targets = NODE_MASK_NONE; +} #endif /* CONFIG_TIERED_MEMORY */ #endif /* _LINUX_MIGRATE_H */ diff --git a/mm/migrate.c b/mm/migrate.c index 114c7428b9f3..84fac477538c 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2129,6 +2129,7 @@ struct memory_tier { struct demotion_nodes { nodemask_t preferred; + nodemask_t allowed; }; #define to_memory_tier(device) container_of(device, struct memory_tier, dev) @@ -2475,6 +2476,25 @@ int node_set_memory_tier_rank(int node, int rank) } EXPORT_SYMBOL_GPL(node_set_memory_tier_rank); +void node_get_allowed_targets(int node, nodemask_t *targets) +{ + /* + * node_demotion[] is updated without excluding this + * function from running. + * + * If any node is moving to lower tiers then modifications + * in node_demotion[] are still valid for this node, if any + * node is moving to higher tier then moving node may be + * used once for demotion which should be ok so rcu should + * be enough here. + */ + rcu_read_lock(); + + *targets = node_demotion[node].allowed; + + rcu_read_unlock(); +} + /** * next_demotion_node() - Get the next node in the demotion path * @node: The starting node to lookup the next node @@ -2534,8 +2554,10 @@ static void __disable_all_migrate_targets(void) { int node; - for_each_node_mask(node, node_states[N_MEMORY]) + for_each_node_mask(node, node_states[N_MEMORY]) { node_demotion[node].preferred = NODE_MASK_NONE; + node_demotion[node].allowed = NODE_MASK_NONE; + } } static void disable_all_migrate_targets(void) @@ -2558,12 +2580,11 @@ static void disable_all_migrate_targets(void) */ static void establish_migration_targets(void) { - struct list_head *ent; struct memory_tier *memtier; struct demotion_nodes *nd; - int tier, target = NUMA_NO_NODE, node; + int target = NUMA_NO_NODE, node; int distance, best_distance; - nodemask_t used; + nodemask_t used, allowed = NODE_MASK_NONE; if (!node_demotion) return; @@ -2603,6 +2624,29 @@ static void establish_migration_targets(void) } } while (1); } + /* + * Now build the allowed mask for each node collecting node mask from + * all memory tier below it. This allows us to fallback demotion page + * allocation to a set of nodes that is closer the above selected + * perferred node. + */ + list_for_each_entry(memtier, &memory_tiers, list) + nodes_or(allowed, allowed, memtier->nodelist); + /* + * Removes nodes not yet in N_MEMORY. + */ + nodes_and(allowed, node_states[N_MEMORY], allowed); + + list_for_each_entry(memtier, &memory_tiers, list) { + /* + * Keep removing current tier from allowed nodes, + * This will remove all nodes in current and above + * memory tier from the allowed mask. + */ + nodes_andnot(allowed, allowed, memtier->nodelist); + for_each_node_mask(node, memtier->nodelist) + node_demotion[node].allowed = allowed; + } } /* diff --git a/mm/vmscan.c b/mm/vmscan.c index 1678802e03e7..feb994589481 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1454,23 +1454,6 @@ static void folio_check_dirty_writeback(struct folio *folio, mapping->a_ops->is_dirty_writeback(&folio->page, dirty, writeback); } -static struct page *alloc_demote_page(struct page *page, unsigned long node) -{ - struct migration_target_control mtc = { - /* - * Allocate from 'node', or fail quickly and quietly. - * When this happens, 'page' will likely just be discarded - * instead of migrated. - */ - .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | - __GFP_THISNODE | __GFP_NOWARN | - __GFP_NOMEMALLOC | GFP_NOWAIT, - .nid = node - }; - - return alloc_migration_target(page, (unsigned long)&mtc); -} - /* * Take pages on @demote_list and attempt to demote them to * another node. Pages which are not demoted are left on @@ -1481,6 +1464,19 @@ static unsigned int demote_page_list(struct list_head *demote_pages, { int target_nid = next_demotion_node(pgdat->node_id); unsigned int nr_succeeded; + nodemask_t allowed_mask; + + struct migration_target_control mtc = { + /* + * Allocate from 'node', or fail quickly and quietly. + * When this happens, 'page' will likely just be discarded + * instead of migrated. + */ + .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN | + __GFP_NOMEMALLOC | GFP_NOWAIT, + .nid = target_nid, + .nmask = &allowed_mask + }; if (list_empty(demote_pages)) return 0; @@ -1488,10 +1484,12 @@ static unsigned int demote_page_list(struct list_head *demote_pages, if (target_nid == NUMA_NO_NODE) return 0; + node_get_allowed_targets(pgdat->node_id, &allowed_mask); + /* Demotion ignores all cpuset and mempolicy settings */ - migrate_pages(demote_pages, alloc_demote_page, NULL, - target_nid, MIGRATE_ASYNC, MR_DEMOTION, - &nr_succeeded); + migrate_pages(demote_pages, alloc_migration_target, NULL, + (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION, + &nr_succeeded); if (current_is_kswapd()) __count_vm_events(PGDEMOTE_KSWAPD, nr_succeeded);