diff mbox series

mm/mempolicy: Fix decision-making issues for memory migration during NUMA balancing

Message ID tencent_57D6CF437AF88E48DD5C5BD872753C43280A@qq.com (mailing list archive)
State New
Headers show
Series mm/mempolicy: Fix decision-making issues for memory migration during NUMA balancing | expand

Commit Message

Junjie Fu Nov. 23, 2024, 7:09 p.m. UTC
When handling a page fault caused by NUMA balancing (do_numa_page), it is
necessary to decide whether to migrate the current page to another node or
keep it on its current node. For pages with the MPOL_PREFERRED memory
policy, it is sufficient to check whether the first node set in the
nodemask is the same as the node where the page is currently located. If
this is the case, the page should remain in its current state. Otherwise,
migration to another node should be attempted.

Because the definition of MPOL_PREFERRED is as follows: "This mode sets the
preferred node for allocation. The kernel will try to allocate pages from
this node first and fall back to nearby nodes if the preferred node is low
on free memory. If the nodemask specifies more than one node ID, the first
node in the mask will be selected as the preferred node."

Thus, if the node where the current page resides is not the first node in
the nodemask, it is not the PREFERRED node, and memory migration can be
attempted.

However, in the original code, the check only verifies whether the current
node exists in the nodemask (which may or may not be the first node in the
mask). This could lead to a scenario where, if the current node is not the
first node in the nodemask, the code incorrectly decides not to attempt
migration to other nodes.

This behavior is clearly incorrect. If the target node for migration and
the page's current NUMA node are both within the nodemask but neither is
the first node, they should be treated with the same priority, and
migration attempts should proceed.

Signed-off-by: Junjie Fu <fujunjie1@qq.com>
---
 mm/mempolicy.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Matthew Wilcox Nov. 23, 2024, 10:15 p.m. UTC | #1
On Sun, Nov 24, 2024 at 03:09:35AM +0800, Junjie Fu wrote:
> Because the definition of MPOL_PREFERRED is as follows: "This mode sets the
> preferred node for allocation. The kernel will try to allocate pages from
> this node first and fall back to nearby nodes if the preferred node is low
> on free memory. If the nodemask specifies more than one node ID, the first
> node in the mask will be selected as the preferred node."
> 
> Thus, if the node where the current page resides is not the first node in
> the nodemask, it is not the PREFERRED node, and memory migration can be
> attempted.

I think you've found poor documentation, not a kernel bug.  If multiple
nodes are set in PREFERRED, then _new_ allocations should come from the
first node, but _existing_ allocations do not need to be moved to the
new node.  At least IMO that was the original intent of allowing
multiple nodes to be set.  Otherwise, what is the point?
Michal Hocko Nov. 25, 2024, 11:33 a.m. UTC | #2
On Sun 24-11-24 03:09:35, Junjie Fu wrote:
> When handling a page fault caused by NUMA balancing (do_numa_page), it is
> necessary to decide whether to migrate the current page to another node or
> keep it on its current node. For pages with the MPOL_PREFERRED memory
> policy, it is sufficient to check whether the first node set in the
> nodemask is the same as the node where the page is currently located. If
> this is the case, the page should remain in its current state. Otherwise,
> migration to another node should be attempted.
> 
> Because the definition of MPOL_PREFERRED is as follows: "This mode sets the
> preferred node for allocation. The kernel will try to allocate pages from
> this node first and fall back to nearby nodes if the preferred node is low
> on free memory. If the nodemask specifies more than one node ID, the first
> node in the mask will be selected as the preferred node."
> 
> Thus, if the node where the current page resides is not the first node in
> the nodemask, it is not the PREFERRED node, and memory migration can be
> attempted.
> 
> However, in the original code, the check only verifies whether the current
> node exists in the nodemask (which may or may not be the first node in the
> mask). This could lead to a scenario where, if the current node is not the
> first node in the nodemask, the code incorrectly decides not to attempt
> migration to other nodes.
> 
> This behavior is clearly incorrect. If the target node for migration and
> the page's current NUMA node are both within the nodemask but neither is
> the first node, they should be treated with the same priority, and
> migration attempts should proceed.

The code is clearly confusing but is there any actual problem to be
solved? IIRC although we do keep nodemask for MPOL_PREFERRED
policy we do not allow to set more than a single node to be set there.
Have a look at mpol_new_preferred
Gregory Price Nov. 25, 2024, 4:06 p.m. UTC | #3
On Mon, Nov 25, 2024 at 12:33:52PM +0100, Michal Hocko wrote:
> On Sun 24-11-24 03:09:35, Junjie Fu wrote:
> > When handling a page fault caused by NUMA balancing (do_numa_page), it is
> > necessary to decide whether to migrate the current page to another node or
> > keep it on its current node. For pages with the MPOL_PREFERRED memory
> > policy, it is sufficient to check whether the first node set in the
> > nodemask is the same as the node where the page is currently located. If
> > this is the case, the page should remain in its current state. Otherwise,
> > migration to another node should be attempted.
> > 
> > Because the definition of MPOL_PREFERRED is as follows: "This mode sets the
> > preferred node for allocation. The kernel will try to allocate pages from
> > this node first and fall back to nearby nodes if the preferred node is low
> > on free memory. If the nodemask specifies more than one node ID, the first
> > node in the mask will be selected as the preferred node."
> > 
> > Thus, if the node where the current page resides is not the first node in
> > the nodemask, it is not the PREFERRED node, and memory migration can be
> > attempted.
> > 
> > However, in the original code, the check only verifies whether the current
> > node exists in the nodemask (which may or may not be the first node in the
> > mask). This could lead to a scenario where, if the current node is not the
> > first node in the nodemask, the code incorrectly decides not to attempt
> > migration to other nodes.
> > 
> > This behavior is clearly incorrect. If the target node for migration and
> > the page's current NUMA node are both within the nodemask but neither is
> > the first node, they should be treated with the same priority, and
> > migration attempts should proceed.
> 
> The code is clearly confusing but is there any actual problem to be
> solved? IIRC although we do keep nodemask for MPOL_PREFERRED
> policy we do not allow to set more than a single node to be set there.
> Have a look at mpol_new_preferred
>

concur here - the proposed patch doesn't actually change any behavior
(or it shouldn, at least).

Is there a migration error being observed that this patch fixes, or is
this just an `observational fix`?

~Gregory
Junjie Fu Nov. 25, 2024, 7:45 p.m. UTC | #4
On November 25, 2024 at 19:33, Michal Hocko wrote:
> On Sun 24-11-24 03:09:35, Junjie Fu wrote:
>> When handling a page fault caused by NUMA balancing (do_numa_page), it is
>> necessary to decide whether to migrate the current page to another node or
>> keep it on its current node. For pages with the MPOL_PREFERRED memory
>> policy, it is sufficient to check whether the first node set in the
>> nodemask is the same as the node where the page is currently located. If
>> this is the case, the page should remain in its current state. Otherwise,
>> migration to another node should be attempted.
>>
>> Because the definition of MPOL_PREFERRED is as follows: "This mode sets the
>> preferred node for allocation. The kernel will try to allocate pages from
>> this node first and fall back to nearby nodes if the preferred node is low
>> on free memory. If the nodemask specifies more than one node ID, the first
>> node in the mask will be selected as the preferred node."
>>
>> Thus, if the node where the current page resides is not the first node in
>> the nodemask, it is not the PREFERRED node, and memory migration can be
>> attempted.
>>
>> However, in the original code, the check only verifies whether the current
>> node exists in the nodemask (which may or may not be the first node in the
>> mask). This could lead to a scenario where, if the current node is not the
>> first node in the nodemask, the code incorrectly decides not to attempt
>> migration to other nodes.
>>
>> This behavior is clearly incorrect. If the target node for migration and
>> the page's current NUMA node are both within the nodemask but neither is
>> the first node, they should be treated with the same priority, and
>> migration attempts should proceed.
> 
> The code is clearly confusing but is there any actual problem to be
> solved? IIRC although we do keep nodemask for MPOL_PREFERRED
> policy we do not allow to set more than a single node to be set there.
> Have a look at mpol_new_preferred
> 

I apologize for the oversight when reviewing the code regarding the 
process of setting only the first node in the nodemask for the 
MPOL_PREFERRED memory policy. After reviewing the mpol_new_preferred 
function, I realized that when setting the memory policy, only the first 
node from the user's nodemask is copied into the corresponding memory 
policy instance's nodemask, as shown in the following code:

static int mpol_new_preferred(struct mempolicy *pol, const nodemask_t 
*nodes)
{
     if (nodes_empty(*nodes))
         return -EINVAL;

     nodes_clear(pol->nodes);
     node_set(first_node(*nodes), pol->nodes); //only the first node to 
be set
     return 0;
}

Due to my previous oversight, I mistakenly assumed that multiple nodes 
could be set in pol->nodes, leading to my incorrect understanding. 
Therefore, the original code is correct. Thank you all for your responses.
Michal Hocko Nov. 25, 2024, 8:18 p.m. UTC | #5
On Tue 26-11-24 03:45:01, Junjie Fu wrote:
[...]
> I apologize for the oversight when reviewing the code regarding the process
> of setting only the first node in the nodemask for the MPOL_PREFERRED memory
> policy. 

There is no need to apologize! Really, this code is far from
straightforward and it is not easy to get all the loose ends together.
What helps though, is to help reviewers with the problem statement. It
often helps to state whether the fix is based on code review or it is
fixing a real life or even artificial workload.
Junjie Fu Nov. 25, 2024, 8:41 p.m. UTC | #6
On November 26, 2024 at 4:18, Michal Hocko wrote:
> On Tue 26-11-24 03:45:01, Junjie Fu wrote:
> [...]
>> I apologize for the oversight when reviewing the code regarding the process
>> of setting only the first node in the nodemask for the MPOL_PREFERRED memory
>> policy.
> 
> There is no need to apologize! Really, this code is far from
> straightforward and it is not easy to get all the loose ends together.
> What helps though, is to help reviewers with the problem statement. It
> often helps to state whether the fix is based on code review or it is
> fixing a real life or even artificial workload.

Thank you very much for your understanding and kind words. I truly 
appreciate your helpful suggestion, and I will make sure to provide more 
context and clarity for reviewers in future submissions to avoid any 
confusion.
diff mbox series

Patch

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index bb37cd1a51d8..3454dfc7da8d 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2769,7 +2769,7 @@  int mpol_misplaced(struct folio *folio, struct vm_fault *vmf,
 		break;
 
 	case MPOL_PREFERRED:
-		if (node_isset(curnid, pol->nodes))
+		if (curnid == first_node(pol->nodes))
 			goto out;
 		polnid = first_node(pol->nodes);
 		break;