Message ID | tencent_57D6CF437AF88E48DD5C5BD872753C43280A@qq.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/mempolicy: Fix decision-making issues for memory migration during NUMA balancing | expand |
On Sun, Nov 24, 2024 at 03:09:35AM +0800, Junjie Fu wrote: > Because the definition of MPOL_PREFERRED is as follows: "This mode sets the > preferred node for allocation. The kernel will try to allocate pages from > this node first and fall back to nearby nodes if the preferred node is low > on free memory. If the nodemask specifies more than one node ID, the first > node in the mask will be selected as the preferred node." > > Thus, if the node where the current page resides is not the first node in > the nodemask, it is not the PREFERRED node, and memory migration can be > attempted. I think you've found poor documentation, not a kernel bug. If multiple nodes are set in PREFERRED, then _new_ allocations should come from the first node, but _existing_ allocations do not need to be moved to the new node. At least IMO that was the original intent of allowing multiple nodes to be set. Otherwise, what is the point?
On Sun 24-11-24 03:09:35, Junjie Fu wrote: > When handling a page fault caused by NUMA balancing (do_numa_page), it is > necessary to decide whether to migrate the current page to another node or > keep it on its current node. For pages with the MPOL_PREFERRED memory > policy, it is sufficient to check whether the first node set in the > nodemask is the same as the node where the page is currently located. If > this is the case, the page should remain in its current state. Otherwise, > migration to another node should be attempted. > > Because the definition of MPOL_PREFERRED is as follows: "This mode sets the > preferred node for allocation. The kernel will try to allocate pages from > this node first and fall back to nearby nodes if the preferred node is low > on free memory. If the nodemask specifies more than one node ID, the first > node in the mask will be selected as the preferred node." > > Thus, if the node where the current page resides is not the first node in > the nodemask, it is not the PREFERRED node, and memory migration can be > attempted. > > However, in the original code, the check only verifies whether the current > node exists in the nodemask (which may or may not be the first node in the > mask). This could lead to a scenario where, if the current node is not the > first node in the nodemask, the code incorrectly decides not to attempt > migration to other nodes. > > This behavior is clearly incorrect. If the target node for migration and > the page's current NUMA node are both within the nodemask but neither is > the first node, they should be treated with the same priority, and > migration attempts should proceed. The code is clearly confusing but is there any actual problem to be solved? IIRC although we do keep nodemask for MPOL_PREFERRED policy we do not allow to set more than a single node to be set there. Have a look at mpol_new_preferred
On Mon, Nov 25, 2024 at 12:33:52PM +0100, Michal Hocko wrote: > On Sun 24-11-24 03:09:35, Junjie Fu wrote: > > When handling a page fault caused by NUMA balancing (do_numa_page), it is > > necessary to decide whether to migrate the current page to another node or > > keep it on its current node. For pages with the MPOL_PREFERRED memory > > policy, it is sufficient to check whether the first node set in the > > nodemask is the same as the node where the page is currently located. If > > this is the case, the page should remain in its current state. Otherwise, > > migration to another node should be attempted. > > > > Because the definition of MPOL_PREFERRED is as follows: "This mode sets the > > preferred node for allocation. The kernel will try to allocate pages from > > this node first and fall back to nearby nodes if the preferred node is low > > on free memory. If the nodemask specifies more than one node ID, the first > > node in the mask will be selected as the preferred node." > > > > Thus, if the node where the current page resides is not the first node in > > the nodemask, it is not the PREFERRED node, and memory migration can be > > attempted. > > > > However, in the original code, the check only verifies whether the current > > node exists in the nodemask (which may or may not be the first node in the > > mask). This could lead to a scenario where, if the current node is not the > > first node in the nodemask, the code incorrectly decides not to attempt > > migration to other nodes. > > > > This behavior is clearly incorrect. If the target node for migration and > > the page's current NUMA node are both within the nodemask but neither is > > the first node, they should be treated with the same priority, and > > migration attempts should proceed. > > The code is clearly confusing but is there any actual problem to be > solved? IIRC although we do keep nodemask for MPOL_PREFERRED > policy we do not allow to set more than a single node to be set there. > Have a look at mpol_new_preferred > concur here - the proposed patch doesn't actually change any behavior (or it shouldn, at least). Is there a migration error being observed that this patch fixes, or is this just an `observational fix`? ~Gregory
On November 25, 2024 at 19:33, Michal Hocko wrote: > On Sun 24-11-24 03:09:35, Junjie Fu wrote: >> When handling a page fault caused by NUMA balancing (do_numa_page), it is >> necessary to decide whether to migrate the current page to another node or >> keep it on its current node. For pages with the MPOL_PREFERRED memory >> policy, it is sufficient to check whether the first node set in the >> nodemask is the same as the node where the page is currently located. If >> this is the case, the page should remain in its current state. Otherwise, >> migration to another node should be attempted. >> >> Because the definition of MPOL_PREFERRED is as follows: "This mode sets the >> preferred node for allocation. The kernel will try to allocate pages from >> this node first and fall back to nearby nodes if the preferred node is low >> on free memory. If the nodemask specifies more than one node ID, the first >> node in the mask will be selected as the preferred node." >> >> Thus, if the node where the current page resides is not the first node in >> the nodemask, it is not the PREFERRED node, and memory migration can be >> attempted. >> >> However, in the original code, the check only verifies whether the current >> node exists in the nodemask (which may or may not be the first node in the >> mask). This could lead to a scenario where, if the current node is not the >> first node in the nodemask, the code incorrectly decides not to attempt >> migration to other nodes. >> >> This behavior is clearly incorrect. If the target node for migration and >> the page's current NUMA node are both within the nodemask but neither is >> the first node, they should be treated with the same priority, and >> migration attempts should proceed. > > The code is clearly confusing but is there any actual problem to be > solved? IIRC although we do keep nodemask for MPOL_PREFERRED > policy we do not allow to set more than a single node to be set there. > Have a look at mpol_new_preferred > I apologize for the oversight when reviewing the code regarding the process of setting only the first node in the nodemask for the MPOL_PREFERRED memory policy. After reviewing the mpol_new_preferred function, I realized that when setting the memory policy, only the first node from the user's nodemask is copied into the corresponding memory policy instance's nodemask, as shown in the following code: static int mpol_new_preferred(struct mempolicy *pol, const nodemask_t *nodes) { if (nodes_empty(*nodes)) return -EINVAL; nodes_clear(pol->nodes); node_set(first_node(*nodes), pol->nodes); //only the first node to be set return 0; } Due to my previous oversight, I mistakenly assumed that multiple nodes could be set in pol->nodes, leading to my incorrect understanding. Therefore, the original code is correct. Thank you all for your responses.
On Tue 26-11-24 03:45:01, Junjie Fu wrote: [...] > I apologize for the oversight when reviewing the code regarding the process > of setting only the first node in the nodemask for the MPOL_PREFERRED memory > policy. There is no need to apologize! Really, this code is far from straightforward and it is not easy to get all the loose ends together. What helps though, is to help reviewers with the problem statement. It often helps to state whether the fix is based on code review or it is fixing a real life or even artificial workload.
On November 26, 2024 at 4:18, Michal Hocko wrote: > On Tue 26-11-24 03:45:01, Junjie Fu wrote: > [...] >> I apologize for the oversight when reviewing the code regarding the process >> of setting only the first node in the nodemask for the MPOL_PREFERRED memory >> policy. > > There is no need to apologize! Really, this code is far from > straightforward and it is not easy to get all the loose ends together. > What helps though, is to help reviewers with the problem statement. It > often helps to state whether the fix is based on code review or it is > fixing a real life or even artificial workload. Thank you very much for your understanding and kind words. I truly appreciate your helpful suggestion, and I will make sure to provide more context and clarity for reviewers in future submissions to avoid any confusion.
diff --git a/mm/mempolicy.c b/mm/mempolicy.c index bb37cd1a51d8..3454dfc7da8d 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2769,7 +2769,7 @@ int mpol_misplaced(struct folio *folio, struct vm_fault *vmf, break; case MPOL_PREFERRED: - if (node_isset(curnid, pol->nodes)) + if (curnid == first_node(pol->nodes)) goto out; polnid = first_node(pol->nodes); break;
When handling a page fault caused by NUMA balancing (do_numa_page), it is necessary to decide whether to migrate the current page to another node or keep it on its current node. For pages with the MPOL_PREFERRED memory policy, it is sufficient to check whether the first node set in the nodemask is the same as the node where the page is currently located. If this is the case, the page should remain in its current state. Otherwise, migration to another node should be attempted. Because the definition of MPOL_PREFERRED is as follows: "This mode sets the preferred node for allocation. The kernel will try to allocate pages from this node first and fall back to nearby nodes if the preferred node is low on free memory. If the nodemask specifies more than one node ID, the first node in the mask will be selected as the preferred node." Thus, if the node where the current page resides is not the first node in the nodemask, it is not the PREFERRED node, and memory migration can be attempted. However, in the original code, the check only verifies whether the current node exists in the nodemask (which may or may not be the first node in the mask). This could lead to a scenario where, if the current node is not the first node in the nodemask, the code incorrectly decides not to attempt migration to other nodes. This behavior is clearly incorrect. If the target node for migration and the page's current NUMA node are both within the nodemask but neither is the first node, they should be treated with the same priority, and migration attempts should proceed. Signed-off-by: Junjie Fu <fujunjie1@qq.com> --- mm/mempolicy.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)