diff mbox series

[v3,2/2] mm/numa_balancing:Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy

Message ID b1599085e1d2f3e48dc71c7991283b8aaa0fe00c.1711002865.git.donettom@linux.ibm.com (mailing list archive)
State New
Headers show
Series [v3,1/2] mm/mempolicy: Use numa_node_id() instead of cpu_to_node() | expand

Commit Message

Donet Tom March 21, 2024, 11:29 a.m. UTC
commit bda420b98505 ("numa balancing: migrate on fault among multiple bound
nodes") added support for migrate on protnone reference with MPOL_BIND
memory policy. This allowed numa fault migration when the executing node
is part of the policy mask for MPOL_BIND. This patch extends migration
support to MPOL_PREFERRED_MANY policy.

Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag
MPOL_F_NUMA_BALANCING. This causes issues when we want to use
NUMA_BALANCING_MEMORY_TIERING. To effectively use the slow memory tier,
the kernel should not allocate pages from the slower memory tier via
allocation control zonelist fallback. Instead, we should move cold pages
from the faster memory node via memory demotion. For a page allocation,
kswapd is only woken up after we try to allocate pages from all nodes in
the allocation zone list. This implies that, without using memory
policies, we will end up allocating hot pages in the slower memory tier.

MPOL_PREFERRED_MANY was added by commit b27abaccf8e8 ("mm/mempolicy: add
MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better
allocation control when we have memory tiers in the system. With
MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only
of faster memory nodes. When we fail to allocate pages from the faster
memory node, kswapd would be woken up, allowing demotion of cold pages
to slower memory nodes.

With the current kernel, such usage of memory policies implies we can't
do page promotion from a slower memory tier to a faster memory tier
using numa fault. This patch fixes this issue.

For MPOL_PREFERRED_MANY, if the executing node is in the policy node
mask, we allow numa migration to the executing nodes. If the executing
node is not in the policy node mask, we do not allow numa migration.

Signed-off-by: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
---
 mm/mempolicy.c | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

Comments

Huang, Ying March 22, 2024, 8:32 a.m. UTC | #1
Donet Tom <donettom@linux.ibm.com> writes:

> commit bda420b98505 ("numa balancing: migrate on fault among multiple bound
> nodes") added support for migrate on protnone reference with MPOL_BIND
> memory policy. This allowed numa fault migration when the executing node
> is part of the policy mask for MPOL_BIND. This patch extends migration
> support to MPOL_PREFERRED_MANY policy.
>
> Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag
> MPOL_F_NUMA_BALANCING. This causes issues when we want to use
> NUMA_BALANCING_MEMORY_TIERING. To effectively use the slow memory tier,
> the kernel should not allocate pages from the slower memory tier via
> allocation control zonelist fallback. Instead, we should move cold pages
> from the faster memory node via memory demotion. For a page allocation,
> kswapd is only woken up after we try to allocate pages from all nodes in
> the allocation zone list. This implies that, without using memory
> policies, we will end up allocating hot pages in the slower memory tier.
>
> MPOL_PREFERRED_MANY was added by commit b27abaccf8e8 ("mm/mempolicy: add
> MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better
> allocation control when we have memory tiers in the system. With
> MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only
> of faster memory nodes. When we fail to allocate pages from the faster
> memory node, kswapd would be woken up, allowing demotion of cold pages
> to slower memory nodes.
>
> With the current kernel, such usage of memory policies implies we can't
> do page promotion from a slower memory tier to a faster memory tier
> using numa fault. This patch fixes this issue.
>
> For MPOL_PREFERRED_MANY, if the executing node is in the policy node
> mask, we allow numa migration to the executing nodes. If the executing
> node is not in the policy node mask, we do not allow numa migration.

Can we provide more information about this?  I suggest to use an
example, for instance, pages may be distributed among multiple sockets
unexpectedly.

--
Best Regards,
Huang, Ying

> Signed-off-by: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
> ---
>  mm/mempolicy.c | 22 +++++++++++++++++-----
>  1 file changed, 17 insertions(+), 5 deletions(-)
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index aa48376e2d34..13100a290918 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1504,9 +1504,10 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags)
>  	if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES))
>  		return -EINVAL;
>  	if (*flags & MPOL_F_NUMA_BALANCING) {
> -		if (*mode != MPOL_BIND)
> +		if (*mode == MPOL_BIND || *mode == MPOL_PREFERRED_MANY)
> +			*flags |= (MPOL_F_MOF | MPOL_F_MORON);
> +		else
>  			return -EINVAL;
> -		*flags |= (MPOL_F_MOF | MPOL_F_MORON);
>  	}
>  	return 0;
>  }
> @@ -2770,15 +2771,26 @@ int mpol_misplaced(struct folio *folio, struct vm_fault *vmf,
>  		break;
>  
>  	case MPOL_BIND:
> -		/* Optimize placement among multiple nodes via NUMA balancing */
> +	case MPOL_PREFERRED_MANY:
> +		/*
> +		 * Even though MPOL_PREFERRED_MANY can allocate pages outside
> +		 * policy nodemask we don't allow numa migration to nodes
> +		 * outside policy nodemask for now. This is done so that if we
> +		 * want demotion to slow memory to happen, before allocating
> +		 * from some DRAM node say 'x', we will end up using a
> +		 * MPOL_PREFERRED_MANY mask excluding node 'x'. In such scenario
> +		 * we should not promote to node 'x' from slow memory node.
> +		 */
>  		if (pol->flags & MPOL_F_MORON) {
> +			/*
> +			 * Optimize placement among multiple nodes
> +			 * via NUMA balancing
> +			 */
>  			if (node_isset(thisnid, pol->nodes))
>  				break;
>  			goto out;
>  		}
> -		fallthrough;
>  
> -	case MPOL_PREFERRED_MANY:
>  		/*
>  		 * use current page if in policy nodemask,
>  		 * else select nearest allowed node, if any.
Donet Tom March 22, 2024, 10:05 a.m. UTC | #2
On 3/22/24 14:02, Huang, Ying wrote:
> Donet Tom <donettom@linux.ibm.com> writes:
>
>> commit bda420b98505 ("numa balancing: migrate on fault among multiple bound
>> nodes") added support for migrate on protnone reference with MPOL_BIND
>> memory policy. This allowed numa fault migration when the executing node
>> is part of the policy mask for MPOL_BIND. This patch extends migration
>> support to MPOL_PREFERRED_MANY policy.
>>
>> Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag
>> MPOL_F_NUMA_BALANCING. This causes issues when we want to use
>> NUMA_BALANCING_MEMORY_TIERING. To effectively use the slow memory tier,
>> the kernel should not allocate pages from the slower memory tier via
>> allocation control zonelist fallback. Instead, we should move cold pages
>> from the faster memory node via memory demotion. For a page allocation,
>> kswapd is only woken up after we try to allocate pages from all nodes in
>> the allocation zone list. This implies that, without using memory
>> policies, we will end up allocating hot pages in the slower memory tier.
>>
>> MPOL_PREFERRED_MANY was added by commit b27abaccf8e8 ("mm/mempolicy: add
>> MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better
>> allocation control when we have memory tiers in the system. With
>> MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only
>> of faster memory nodes. When we fail to allocate pages from the faster
>> memory node, kswapd would be woken up, allowing demotion of cold pages
>> to slower memory nodes.
>>
>> With the current kernel, such usage of memory policies implies we can't
>> do page promotion from a slower memory tier to a faster memory tier
>> using numa fault. This patch fixes this issue.
>>
>> For MPOL_PREFERRED_MANY, if the executing node is in the policy node
>> mask, we allow numa migration to the executing nodes. If the executing
>> node is not in the policy node mask, we do not allow numa migration.
> Can we provide more information about this?  I suggest to use an
> example, for instance, pages may be distributed among multiple sockets
> unexpectedly.

Thank you for your suggestion. However, this commit message explains all the scenarios.

For example, Consider a system with 3 numa nodes (N0,N1 and N6).
N0 and N1 are tier1 DRAM nodes  and N6 is tier 2 PMEM node.

Scenario 1: The process is executing on N1,
             If the executing node is in the policy node mask,
             Curr Loc Pages - The numa node where page present(folio node)
==================================================================================
Process      Policy          Curr Loc Pages                 Observations
-----------------------------------------------------------------------------------
N1           N0 N1 N6              N0                   Pages Migrated from N0 to N1
N1           N0 N1 N6              N6                   Pages Migrated from N6 to N1
N1           N0 N1                 N1                   Pages Migrated from N1 to N6
N1           N0 N1                 N6                   Pages Migrated from N6 to N1
------------------------------------------------------------------------------------
Scenario 2:  The process is executing on N1,
              If the executing node is NOT in the policy node mask,
              Curr Loc Pages - The numa node where page present(folio node)
===================================================================================
Process       Policy       Curr Loc Pages       Observations
-----------------------------------------------------------------------------------
N1            N0 N6             N0              Pages are not Migrating
N1            N0 N6             N6              Pages are not migration,
N1            N0                N0              Pages are not Migrating
------------------------------------------------------------------------------------

Scenario 3: The process is executing on N1,
             If the executing node and folio nodes are  NOT in the policy node mask,
             Curr Loc Pages - The numa node where page present (folio node)
====================================================================================
Thread    Policy       Curr Loc Pages           Observations
------------------------------------------------------------------------------------
N1          N0               N6                 Pages are not Migrating
N1          N6               N0                 Pages are not Migrating
------------------------------------------------------------------------------------

We can conclude that even if the pages are distributed among multiple sockets,
if the executing node is in the policy node mask, we allow numa migration to the
executing nodes. If the executing node is not in the policy node mask,
we do not allow numa migration.

Thanks
Donet Tom

>
> --
> Best Regards,
> Huang, Ying
>
>> Signed-off-by: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>> ---
>>   mm/mempolicy.c | 22 +++++++++++++++++-----
>>   1 file changed, 17 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>> index aa48376e2d34..13100a290918 100644
>> --- a/mm/mempolicy.c
>> +++ b/mm/mempolicy.c
>> @@ -1504,9 +1504,10 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags)
>>   	if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES))
>>   		return -EINVAL;
>>   	if (*flags & MPOL_F_NUMA_BALANCING) {
>> -		if (*mode != MPOL_BIND)
>> +		if (*mode == MPOL_BIND || *mode == MPOL_PREFERRED_MANY)
>> +			*flags |= (MPOL_F_MOF | MPOL_F_MORON);
>> +		else
>>   			return -EINVAL;
>> -		*flags |= (MPOL_F_MOF | MPOL_F_MORON);
>>   	}
>>   	return 0;
>>   }
>> @@ -2770,15 +2771,26 @@ int mpol_misplaced(struct folio *folio, struct vm_fault *vmf,
>>   		break;
>>   
>>   	case MPOL_BIND:
>> -		/* Optimize placement among multiple nodes via NUMA balancing */
>> +	case MPOL_PREFERRED_MANY:
>> +		/*
>> +		 * Even though MPOL_PREFERRED_MANY can allocate pages outside
>> +		 * policy nodemask we don't allow numa migration to nodes
>> +		 * outside policy nodemask for now. This is done so that if we
>> +		 * want demotion to slow memory to happen, before allocating
>> +		 * from some DRAM node say 'x', we will end up using a
>> +		 * MPOL_PREFERRED_MANY mask excluding node 'x'. In such scenario
>> +		 * we should not promote to node 'x' from slow memory node.
>> +		 */
>>   		if (pol->flags & MPOL_F_MORON) {
>> +			/*
>> +			 * Optimize placement among multiple nodes
>> +			 * via NUMA balancing
>> +			 */
>>   			if (node_isset(thisnid, pol->nodes))
>>   				break;
>>   			goto out;
>>   		}
>> -		fallthrough;
>>   
>> -	case MPOL_PREFERRED_MANY:
>>   		/*
>>   		 * use current page if in policy nodemask,
>>   		 * else select nearest allowed node, if any.
Huang, Ying March 25, 2024, 2:48 a.m. UTC | #3
Donet Tom <donettom@linux.ibm.com> writes:

> On 3/22/24 14:02, Huang, Ying wrote:
>> Donet Tom <donettom@linux.ibm.com> writes:
>>
>>> commit bda420b98505 ("numa balancing: migrate on fault among multiple bound
>>> nodes") added support for migrate on protnone reference with MPOL_BIND
>>> memory policy. This allowed numa fault migration when the executing node
>>> is part of the policy mask for MPOL_BIND. This patch extends migration
>>> support to MPOL_PREFERRED_MANY policy.
>>>
>>> Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag
>>> MPOL_F_NUMA_BALANCING. This causes issues when we want to use
>>> NUMA_BALANCING_MEMORY_TIERING. To effectively use the slow memory tier,
>>> the kernel should not allocate pages from the slower memory tier via
>>> allocation control zonelist fallback. Instead, we should move cold pages
>>> from the faster memory node via memory demotion. For a page allocation,
>>> kswapd is only woken up after we try to allocate pages from all nodes in
>>> the allocation zone list. This implies that, without using memory
>>> policies, we will end up allocating hot pages in the slower memory tier.
>>>
>>> MPOL_PREFERRED_MANY was added by commit b27abaccf8e8 ("mm/mempolicy: add
>>> MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better
>>> allocation control when we have memory tiers in the system. With
>>> MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only
>>> of faster memory nodes. When we fail to allocate pages from the faster
>>> memory node, kswapd would be woken up, allowing demotion of cold pages
>>> to slower memory nodes.
>>>
>>> With the current kernel, such usage of memory policies implies we can't
>>> do page promotion from a slower memory tier to a faster memory tier
>>> using numa fault. This patch fixes this issue.
>>>
>>> For MPOL_PREFERRED_MANY, if the executing node is in the policy node
>>> mask, we allow numa migration to the executing nodes. If the executing
>>> node is not in the policy node mask, we do not allow numa migration.
>> Can we provide more information about this?  I suggest to use an
>> example, for instance, pages may be distributed among multiple sockets
>> unexpectedly.
>
> Thank you for your suggestion. However, this commit message explains all the scenarios.

Yes.  The commit message is correct and covers many cases.  What I
suggested is to describe why we do that?  An examples can not covers all
possibility, but it is easy to be understood.  For example, something as
below?

For example, on a 2-sockets system, there are N0, N1, N2 in socket 0, N3
in socket 1.  N0, N1, N3 have fast memory and CPU, while N2 has slow
memory and no CPU.  For a workload, we may use MPOL_PREFERRED_MANY with
nodemask with N0 and N1 set because the workload runs on CPUs of socket
0 at most times.  Then, even if the workload runs on CPUs of N3
occasionally, we will not try to migrate the workload pages from N2 to
N3 because users may want to avoid cross-socket access as much as
possible in the long term.

> For example, Consider a system with 3 numa nodes (N0,N1 and N6).
> N0 and N1 are tier1 DRAM nodes  and N6 is tier 2 PMEM node.
>
> Scenario 1: The process is executing on N1,
>             If the executing node is in the policy node mask,
>             Curr Loc Pages - The numa node where page present(folio node)
> ==================================================================================
> Process      Policy          Curr Loc Pages                 Observations
> -----------------------------------------------------------------------------------
> N1           N0 N1 N6              N0                   Pages Migrated from N0 to N1
> N1           N0 N1 N6              N6                   Pages Migrated from N6 to N1
> N1           N0 N1                 N1                   Pages Migrated from N1 to N6

Pages are not Migrating ?

> N1           N0 N1                 N6                   Pages Migrated from N6 to N1
> ------------------------------------------------------------------------------------
> Scenario 2:  The process is executing on N1,
>              If the executing node is NOT in the policy node mask,
>              Curr Loc Pages - The numa node where page present(folio node)
> ===================================================================================
> Process       Policy       Curr Loc Pages       Observations
> -----------------------------------------------------------------------------------
> N1            N0 N6             N0              Pages are not Migrating
> N1            N0 N6             N6              Pages are not migration,
> N1            N0                N0              Pages are not Migrating
> ------------------------------------------------------------------------------------
>
> Scenario 3: The process is executing on N1,
>             If the executing node and folio nodes are  NOT in the policy node mask,
>             Curr Loc Pages - The numa node where page present (folio node)
> ====================================================================================
> Thread    Policy       Curr Loc Pages           Observations
> ------------------------------------------------------------------------------------
> N1          N0               N6                 Pages are not Migrating
> N1          N6               N0                 Pages are not Migrating
> ------------------------------------------------------------------------------------
>
> We can conclude that even if the pages are distributed among multiple sockets,
> if the executing node is in the policy node mask, we allow numa migration to the
> executing nodes. If the executing node is not in the policy node mask,
> we do not allow numa migration.
>

[snip]

--
Best Regards,
Huang, Ying
Donet Tom March 25, 2024, 5 a.m. UTC | #4
On 3/25/24 08:18, Huang, Ying wrote:
> Donet Tom <donettom@linux.ibm.com> writes:
>
>> On 3/22/24 14:02, Huang, Ying wrote:
>>> Donet Tom <donettom@linux.ibm.com> writes:
>>>
>>>> commit bda420b98505 ("numa balancing: migrate on fault among multiple bound
>>>> nodes") added support for migrate on protnone reference with MPOL_BIND
>>>> memory policy. This allowed numa fault migration when the executing node
>>>> is part of the policy mask for MPOL_BIND. This patch extends migration
>>>> support to MPOL_PREFERRED_MANY policy.
>>>>
>>>> Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag
>>>> MPOL_F_NUMA_BALANCING. This causes issues when we want to use
>>>> NUMA_BALANCING_MEMORY_TIERING. To effectively use the slow memory tier,
>>>> the kernel should not allocate pages from the slower memory tier via
>>>> allocation control zonelist fallback. Instead, we should move cold pages
>>>> from the faster memory node via memory demotion. For a page allocation,
>>>> kswapd is only woken up after we try to allocate pages from all nodes in
>>>> the allocation zone list. This implies that, without using memory
>>>> policies, we will end up allocating hot pages in the slower memory tier.
>>>>
>>>> MPOL_PREFERRED_MANY was added by commit b27abaccf8e8 ("mm/mempolicy: add
>>>> MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better
>>>> allocation control when we have memory tiers in the system. With
>>>> MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only
>>>> of faster memory nodes. When we fail to allocate pages from the faster
>>>> memory node, kswapd would be woken up, allowing demotion of cold pages
>>>> to slower memory nodes.
>>>>
>>>> With the current kernel, such usage of memory policies implies we can't
>>>> do page promotion from a slower memory tier to a faster memory tier
>>>> using numa fault. This patch fixes this issue.
>>>>
>>>> For MPOL_PREFERRED_MANY, if the executing node is in the policy node
>>>> mask, we allow numa migration to the executing nodes. If the executing
>>>> node is not in the policy node mask, we do not allow numa migration.
>>> Can we provide more information about this?  I suggest to use an
>>> example, for instance, pages may be distributed among multiple sockets
>>> unexpectedly.
>> Thank you for your suggestion. However, this commit message explains all the scenarios.
> Yes.  The commit message is correct and covers many cases.  What I
> suggested is to describe why we do that?  An examples can not covers all
> possibility, but it is easy to be understood.  For example, something as
> below?
>
> For example, on a 2-sockets system, there are N0, N1, N2 in socket 0, N3
> in socket 1.  N0, N1, N3 have fast memory and CPU, while N2 has slow
> memory and no CPU.  For a workload, we may use MPOL_PREFERRED_MANY with
> nodemask with N0 and N1 set because the workload runs on CPUs of socket
> 0 at most times.  Then, even if the workload runs on CPUs of N3
> occasionally, we will not try to migrate the workload pages from N2 to
> N3 because users may want to avoid cross-socket access as much as
> possible in the long term.

Thank you. I will change the commit message and post V4.

Thanks
Donet Tom

>
>> For example, Consider a system with 3 numa nodes (N0,N1 and N6).
>> N0 and N1 are tier1 DRAM nodes  and N6 is tier 2 PMEM node.
>>
>> Scenario 1: The process is executing on N1,
>>              If the executing node is in the policy node mask,
>>              Curr Loc Pages - The numa node where page present(folio node)
>> ==================================================================================
>> Process      Policy          Curr Loc Pages                 Observations
>> -----------------------------------------------------------------------------------
>> N1           N0 N1 N6              N0                   Pages Migrated from N0 to N1
>> N1           N0 N1 N6              N6                   Pages Migrated from N6 to N1
>> N1           N0 N1                 N1                   Pages Migrated from N1 to N6
> Pages are not Migrating ?
>
>> N1           N0 N1                 N6                   Pages Migrated from N6 to N1
>> ------------------------------------------------------------------------------------
>> Scenario 2:  The process is executing on N1,
>>               If the executing node is NOT in the policy node mask,
>>               Curr Loc Pages - The numa node where page present(folio node)
>> ===================================================================================
>> Process       Policy       Curr Loc Pages       Observations
>> -----------------------------------------------------------------------------------
>> N1            N0 N6             N0              Pages are not Migrating
>> N1            N0 N6             N6              Pages are not migration,
>> N1            N0                N0              Pages are not Migrating
>> ------------------------------------------------------------------------------------
>>
>> Scenario 3: The process is executing on N1,
>>              If the executing node and folio nodes are  NOT in the policy node mask,
>>              Curr Loc Pages - The numa node where page present (folio node)
>> ====================================================================================
>> Thread    Policy       Curr Loc Pages           Observations
>> ------------------------------------------------------------------------------------
>> N1          N0               N6                 Pages are not Migrating
>> N1          N6               N0                 Pages are not Migrating
>> ------------------------------------------------------------------------------------
>>
>> We can conclude that even if the pages are distributed among multiple sockets,
>> if the executing node is in the policy node mask, we allow numa migration to the
>> executing nodes. If the executing node is not in the policy node mask,
>> we do not allow numa migration.
>>
> [snip]
>
> --
> Best Regards,
> Huang, Ying
Donet Tom March 25, 2024, 5:02 a.m. UTC | #5
On 3/25/24 08:18, Huang, Ying wrote:
> Donet Tom <donettom@linux.ibm.com> writes:
>
>> On 3/22/24 14:02, Huang, Ying wrote:
>>> Donet Tom <donettom@linux.ibm.com> writes:
>>>
>>>> commit bda420b98505 ("numa balancing: migrate on fault among multiple bound
>>>> nodes") added support for migrate on protnone reference with MPOL_BIND
>>>> memory policy. This allowed numa fault migration when the executing node
>>>> is part of the policy mask for MPOL_BIND. This patch extends migration
>>>> support to MPOL_PREFERRED_MANY policy.
>>>>
>>>> Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag
>>>> MPOL_F_NUMA_BALANCING. This causes issues when we want to use
>>>> NUMA_BALANCING_MEMORY_TIERING. To effectively use the slow memory tier,
>>>> the kernel should not allocate pages from the slower memory tier via
>>>> allocation control zonelist fallback. Instead, we should move cold pages
>>>> from the faster memory node via memory demotion. For a page allocation,
>>>> kswapd is only woken up after we try to allocate pages from all nodes in
>>>> the allocation zone list. This implies that, without using memory
>>>> policies, we will end up allocating hot pages in the slower memory tier.
>>>>
>>>> MPOL_PREFERRED_MANY was added by commit b27abaccf8e8 ("mm/mempolicy: add
>>>> MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better
>>>> allocation control when we have memory tiers in the system. With
>>>> MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only
>>>> of faster memory nodes. When we fail to allocate pages from the faster
>>>> memory node, kswapd would be woken up, allowing demotion of cold pages
>>>> to slower memory nodes.
>>>>
>>>> With the current kernel, such usage of memory policies implies we can't
>>>> do page promotion from a slower memory tier to a faster memory tier
>>>> using numa fault. This patch fixes this issue.
>>>>
>>>> For MPOL_PREFERRED_MANY, if the executing node is in the policy node
>>>> mask, we allow numa migration to the executing nodes. If the executing
>>>> node is not in the policy node mask, we do not allow numa migration.
>>> Can we provide more information about this?  I suggest to use an
>>> example, for instance, pages may be distributed among multiple sockets
>>> unexpectedly.
>> Thank you for your suggestion. However, this commit message explains all the scenarios.
> Yes.  The commit message is correct and covers many cases.  What I
> suggested is to describe why we do that?  An examples can not covers all
> possibility, but it is easy to be understood.  For example, something as
> below?
>
> For example, on a 2-sockets system, there are N0, N1, N2 in socket 0, N3
> in socket 1.  N0, N1, N3 have fast memory and CPU, while N2 has slow
> memory and no CPU.  For a workload, we may use MPOL_PREFERRED_MANY with
> nodemask with N0 and N1 set because the workload runs on CPUs of socket
> 0 at most times.  Then, even if the workload runs on CPUs of N3
> occasionally, we will not try to migrate the workload pages from N2 to
> N3 because users may want to avoid cross-socket access as much as
> possible in the long term.
>
>> For example, Consider a system with 3 numa nodes (N0,N1 and N6).
>> N0 and N1 are tier1 DRAM nodes  and N6 is tier 2 PMEM node.
>>
>> Scenario 1: The process is executing on N1,
>>              If the executing node is in the policy node mask,
>>              Curr Loc Pages - The numa node where page present(folio node)
>> ==================================================================================
>> Process      Policy          Curr Loc Pages                 Observations
>> -----------------------------------------------------------------------------------
>> N1           N0 N1 N6              N0                   Pages Migrated from N0 to N1
>> N1           N0 N1 N6              N6                   Pages Migrated from N6 to N1
>> N1           N0 N1                 N1                   Pages Migrated from N1 to N6
> Pages are not Migrating ?

Sorry .This is a mistake. In this case Pages are not migrating.

Thanks
Donet.

>
>> N1           N0 N1                 N6                   Pages Migrated from N6 to N1
>> ------------------------------------------------------------------------------------
>> Scenario 2:  The process is executing on N1,
>>               If the executing node is NOT in the policy node mask,
>>               Curr Loc Pages - The numa node where page present(folio node)
>> ===================================================================================
>> Process       Policy       Curr Loc Pages       Observations
>> -----------------------------------------------------------------------------------
>> N1            N0 N6             N0              Pages are not Migrating
>> N1            N0 N6             N6              Pages are not migration,
>> N1            N0                N0              Pages are not Migrating
>> ------------------------------------------------------------------------------------
>>
>> Scenario 3: The process is executing on N1,
>>              If the executing node and folio nodes are  NOT in the policy node mask,
>>              Curr Loc Pages - The numa node where page present (folio node)
>> ====================================================================================
>> Thread    Policy       Curr Loc Pages           Observations
>> ------------------------------------------------------------------------------------
>> N1          N0               N6                 Pages are not Migrating
>> N1          N6               N0                 Pages are not Migrating
>> ------------------------------------------------------------------------------------
>>
>> We can conclude that even if the pages are distributed among multiple sockets,
>> if the executing node is in the policy node mask, we allow numa migration to the
>> executing nodes. If the executing node is not in the policy node mask,
>> we do not allow numa migration.
>>
> [snip]
>
> --
> Best Regards,
> Huang, Ying
diff mbox series

Patch

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index aa48376e2d34..13100a290918 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1504,9 +1504,10 @@  static inline int sanitize_mpol_flags(int *mode, unsigned short *flags)
 	if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES))
 		return -EINVAL;
 	if (*flags & MPOL_F_NUMA_BALANCING) {
-		if (*mode != MPOL_BIND)
+		if (*mode == MPOL_BIND || *mode == MPOL_PREFERRED_MANY)
+			*flags |= (MPOL_F_MOF | MPOL_F_MORON);
+		else
 			return -EINVAL;
-		*flags |= (MPOL_F_MOF | MPOL_F_MORON);
 	}
 	return 0;
 }
@@ -2770,15 +2771,26 @@  int mpol_misplaced(struct folio *folio, struct vm_fault *vmf,
 		break;
 
 	case MPOL_BIND:
-		/* Optimize placement among multiple nodes via NUMA balancing */
+	case MPOL_PREFERRED_MANY:
+		/*
+		 * Even though MPOL_PREFERRED_MANY can allocate pages outside
+		 * policy nodemask we don't allow numa migration to nodes
+		 * outside policy nodemask for now. This is done so that if we
+		 * want demotion to slow memory to happen, before allocating
+		 * from some DRAM node say 'x', we will end up using a
+		 * MPOL_PREFERRED_MANY mask excluding node 'x'. In such scenario
+		 * we should not promote to node 'x' from slow memory node.
+		 */
 		if (pol->flags & MPOL_F_MORON) {
+			/*
+			 * Optimize placement among multiple nodes
+			 * via NUMA balancing
+			 */
 			if (node_isset(thisnid, pol->nodes))
 				break;
 			goto out;
 		}
-		fallthrough;
 
-	case MPOL_PREFERRED_MANY:
 		/*
 		 * use current page if in policy nodemask,
 		 * else select nearest allowed node, if any.