diff mbox series

[v6,1/6] mm/mempolicy: Add MPOL_PREFERRED_MANY for multiple preferred nodes

Message ID 1626077374-81682-2-git-send-email-feng.tang@intel.com (mailing list archive)
State New
Headers show
Series Introduce multi-preference mempolicy | expand

Commit Message

Feng Tang July 12, 2021, 8:09 a.m. UTC
From: Dave Hansen <dave.hansen@linux.intel.com>

The NUMA APIs currently allow passing in a "preferred node" as a
single bit set in a nodemask.  If more than one bit it set, bits
after the first are ignored.

This single node is generally OK for location-based NUMA where
memory being allocated will eventually be operated on by a single
CPU.  However, in systems with multiple memory types, folks want
to target a *type* of memory instead of a location.  For instance,
someone might want some high-bandwidth memory but do not care about
the CPU next to which it is allocated.  Or, they want a cheap,
high capacity allocation and want to target all NUMA nodes which
have persistent memory in volatile mode.  In both of these cases,
the application wants to target a *set* of nodes, but does not
want strict MPOL_BIND behavior as that could lead to OOM killer or
SIGSEGV.

So add MPOL_PREFERRED_MANY policy to support the multiple preferred
nodes requirement. This is not a pie-in-the-sky dream for an API.
This was a response to a specific ask of more than one group at Intel.
Specifically:

1. There are existing libraries that target memory types such as
   https://github.com/memkind/memkind.  These are known to suffer
   from SIGSEGV's when memory is low on targeted memory "kinds" that
   span more than one node.  The MCDRAM on a Xeon Phi in "Cluster on
   Die" mode is an example of this.
2. Volatile-use persistent memory users want to have a memory policy
   which is targeted at either "cheap and slow" (PMEM) or "expensive and
   fast" (DRAM).  However, they do not want to experience allocation
   failures when the targeted type is unavailable.
3. Allocate-then-run.  Generally, we let the process scheduler decide
   on which physical CPU to run a task.  That location provides a
   default allocation policy, and memory availability is not generally
   considered when placing tasks.  For situations where memory is
   valuable and constrained, some users want to allocate memory first,
   *then* allocate close compute resources to the allocation.  This is
   the reverse of the normal (CPU) model.  Accelerators such as GPUs
   that operate on core-mm-managed memory are interested in this model.

A check is added in sanitize_mpol_flags() to not permit 'prefer_many'
policy to be used for now, and will be removed in later patch after all
implementations for 'prefer_many' are ready, as suggested by Michal Hocko.

Link: https://lore.kernel.org/r/20200630212517.308045-4-ben.widawsky@intel.com
Co-developed-by: Ben Widawsky <ben.widawsky@intel.com>
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
---
 include/uapi/linux/mempolicy.h |  1 +
 mm/mempolicy.c                 | 44 +++++++++++++++++++++++++++++++++++++-----
 2 files changed, 40 insertions(+), 5 deletions(-)

Comments

Michal Hocko July 28, 2021, 12:31 p.m. UTC | #1
[Sorry for a late review]

On Mon 12-07-21 16:09:29, Feng Tang wrote:
[...]
> @@ -1887,7 +1909,8 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
>  /* Return the node id preferred by the given mempolicy, or the given id */
>  static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd)
>  {
> -	if (policy->mode == MPOL_PREFERRED) {
> +	if (policy->mode == MPOL_PREFERRED ||
> +	    policy->mode == MPOL_PREFERRED_MANY) {
>  		nd = first_node(policy->nodes);
>  	} else {
>  		/*

Do we really want to have the preferred node to be always the first node
in the node mask? Shouldn't that strive for a locality as well? Existing
callers already prefer numa_node_id() - aka local node - and I belive we
shouldn't just throw that away here.

> @@ -1931,6 +1954,7 @@ unsigned int mempolicy_slab_node(void)
>  
>  	switch (policy->mode) {
>  	case MPOL_PREFERRED:
> +	case MPOL_PREFERRED_MANY:
>  		return first_node(policy->nodes);

Similarly here but I am not really familiar with the slab numa code
enough to have strong opinions here.

> @@ -2173,10 +2198,12 @@ struct page *alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>  		 * node and don't fall back to other nodes, as the cost of
>  		 * remote accesses would likely offset THP benefits.
>  		 *
> -		 * If the policy is interleave, or does not allow the current
> -		 * node in its nodemask, we allocate the standard way.
> +		 * If the policy is interleave or multiple preferred nodes, or
> +		 * does not allow the current node in its nodemask, we allocate
> +		 * the standard way.
>  		 */
> -		if (pol->mode == MPOL_PREFERRED)
> +		if ((pol->mode == MPOL_PREFERRED ||
> +		     pol->mode == MPOL_PREFERRED_MANY))
>  			hpage_node = first_node(pol->nodes);

Same here.

> @@ -2451,6 +2479,9 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
>  		break;
>  
>  	case MPOL_PREFERRED:
> +	case MPOL_PREFERRED_MANY:
> +		if (node_isset(curnid, pol->nodes))
> +			goto out;
>  		polnid = first_node(pol->nodes);
>  		break;

I do not follow what is the point of using first_node here. Either the
node is in the mask or it is misplaced. What are you trying to achieve
here?
Feng Tang July 28, 2021, 2:11 p.m. UTC | #2
On Wed, Jul 28, 2021 at 02:31:03PM +0200, Michal Hocko wrote:
> [Sorry for a late review]

Not at all. Thank you for all your reviews and suggestions from v1
to v6!

> On Mon 12-07-21 16:09:29, Feng Tang wrote:
> [...]
> > @@ -1887,7 +1909,8 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
> >  /* Return the node id preferred by the given mempolicy, or the given id */
> >  static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd)
> >  {
> > -	if (policy->mode == MPOL_PREFERRED) {
> > +	if (policy->mode == MPOL_PREFERRED ||
> > +	    policy->mode == MPOL_PREFERRED_MANY) {
> >  		nd = first_node(policy->nodes);
> >  	} else {
> >  		/*
> 
> Do we really want to have the preferred node to be always the first node
> in the node mask? Shouldn't that strive for a locality as well? Existing
> callers already prefer numa_node_id() - aka local node - and I belive we
> shouldn't just throw that away here.
 
I think it's about the difference of 'local' and 'prefer/perfer-many'
policy. There are different kinds of memory HW: HBM(High Bandwidth
Memory), normal DRAM, PMEM (Persistent Memory), which have different
price, bandwidth, speed etc. A platform may have two, or all three of
these types, and there are real use case which want memory comes
'preferred' node/nodes than the local node.

And good point for 'local node', if the 'prefer-many' policy's
nodemask has local node set, we should pick it han this
'first_node', and the same semantic also applies to the other
several places you pointed out. Or do I misunderstand you point?

Thanks,
Feng

> > @@ -1931,6 +1954,7 @@ unsigned int mempolicy_slab_node(void)
> >  
> >  	switch (policy->mode) {
> >  	case MPOL_PREFERRED:
> > +	case MPOL_PREFERRED_MANY:
> >  		return first_node(policy->nodes);
> 
> Similarly here but I am not really familiar with the slab numa code
> enough to have strong opinions here.
> 
> > @@ -2173,10 +2198,12 @@ struct page *alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
> >  		 * node and don't fall back to other nodes, as the cost of
> >  		 * remote accesses would likely offset THP benefits.
> >  		 *
> > -		 * If the policy is interleave, or does not allow the current
> > -		 * node in its nodemask, we allocate the standard way.
> > +		 * If the policy is interleave or multiple preferred nodes, or
> > +		 * does not allow the current node in its nodemask, we allocate
> > +		 * the standard way.
> >  		 */
> > -		if (pol->mode == MPOL_PREFERRED)
> > +		if ((pol->mode == MPOL_PREFERRED ||
> > +		     pol->mode == MPOL_PREFERRED_MANY))
> >  			hpage_node = first_node(pol->nodes);
> 
> Same here.
> 
> > @@ -2451,6 +2479,9 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
> >  		break;
> >  
> >  	case MPOL_PREFERRED:
> > +	case MPOL_PREFERRED_MANY:
> > +		if (node_isset(curnid, pol->nodes))
> > +			goto out;
> >  		polnid = first_node(pol->nodes);
> >  		break;
> 
> I do not follow what is the point of using first_node here. Either the
> node is in the mask or it is misplaced. What are you trying to achieve
> here?
> -- 
> Michal Hocko
> SUSE Labs
Michal Hocko July 28, 2021, 4:12 p.m. UTC | #3
On Wed 28-07-21 22:11:56, Feng Tang wrote:
> On Wed, Jul 28, 2021 at 02:31:03PM +0200, Michal Hocko wrote:
> > [Sorry for a late review]
> 
> Not at all. Thank you for all your reviews and suggestions from v1
> to v6!
> 
> > On Mon 12-07-21 16:09:29, Feng Tang wrote:
> > [...]
> > > @@ -1887,7 +1909,8 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
> > >  /* Return the node id preferred by the given mempolicy, or the given id */
> > >  static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd)
> > >  {
> > > -	if (policy->mode == MPOL_PREFERRED) {
> > > +	if (policy->mode == MPOL_PREFERRED ||
> > > +	    policy->mode == MPOL_PREFERRED_MANY) {
> > >  		nd = first_node(policy->nodes);
> > >  	} else {
> > >  		/*
> > 
> > Do we really want to have the preferred node to be always the first node
> > in the node mask? Shouldn't that strive for a locality as well? Existing
> > callers already prefer numa_node_id() - aka local node - and I belive we
> > shouldn't just throw that away here.
>  
> I think it's about the difference of 'local' and 'prefer/perfer-many'
> policy. There are different kinds of memory HW: HBM(High Bandwidth
> Memory), normal DRAM, PMEM (Persistent Memory), which have different
> price, bandwidth, speed etc. A platform may have two, or all three of
> these types, and there are real use case which want memory comes
> 'preferred' node/nodes than the local node.
> 
> And good point for 'local node', if the 'prefer-many' policy's
> nodemask has local node set, we should pick it han this
> 'first_node', and the same semantic also applies to the other
> several places you pointed out. Or do I misunderstand you point?

Yeah. Essentially what I am trying to tell is that for
MPOL_PREFERRED_MANY you simply want to return the given node without any
alternation. That node will be used for the fallback zonelist and the
nodemask would make sure we won't get out of the policy.
Feng Tang July 29, 2021, 7:09 a.m. UTC | #4
On Wed, Jul 28, 2021 at 06:12:21PM +0200, Michal Hocko wrote:
> On Wed 28-07-21 22:11:56, Feng Tang wrote:
> > On Wed, Jul 28, 2021 at 02:31:03PM +0200, Michal Hocko wrote:
> > > [Sorry for a late review]
> > 
> > Not at all. Thank you for all your reviews and suggestions from v1
> > to v6!
> > 
> > > On Mon 12-07-21 16:09:29, Feng Tang wrote:
> > > [...]
> > > > @@ -1887,7 +1909,8 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
> > > >  /* Return the node id preferred by the given mempolicy, or the given id */
> > > >  static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd)
> > > >  {
> > > > -	if (policy->mode == MPOL_PREFERRED) {
> > > > +	if (policy->mode == MPOL_PREFERRED ||
> > > > +	    policy->mode == MPOL_PREFERRED_MANY) {
> > > >  		nd = first_node(policy->nodes);
> > > >  	} else {
> > > >  		/*
> > > 
> > > Do we really want to have the preferred node to be always the first node
> > > in the node mask? Shouldn't that strive for a locality as well? Existing
> > > callers already prefer numa_node_id() - aka local node - and I belive we
> > > shouldn't just throw that away here.
> >  
> > I think it's about the difference of 'local' and 'prefer/perfer-many'
> > policy. There are different kinds of memory HW: HBM(High Bandwidth
> > Memory), normal DRAM, PMEM (Persistent Memory), which have different
> > price, bandwidth, speed etc. A platform may have two, or all three of
> > these types, and there are real use case which want memory comes
> > 'preferred' node/nodes than the local node.
> > 
> > And good point for 'local node', if the 'prefer-many' policy's
> > nodemask has local node set, we should pick it han this
> > 'first_node', and the same semantic also applies to the other
> > several places you pointed out. Or do I misunderstand you point?
> 
> Yeah. Essentially what I am trying to tell is that for
> MPOL_PREFERRED_MANY you simply want to return the given node without any
> alternation. That node will be used for the fallback zonelist and the
> nodemask would make sure we won't get out of the policy.

I think I got your point now :)

With current mainline code, the 'prefer' policy will return the preferred
node.

For 'prefer-many', we would like to keep the similar semantic, that the
preference of node is 'preferred' > 'local' > all other nodes. There is
some customer use case, whose platform has both DRAM and cheaper, bigger
and slower PMEM, and they anlayzed the hotness of their huge data, and
they want to put huge cold data into the PMEM, and only fallback to DRAM
as the last step. The HW topology could be simplified like this:

Socket 0:  Node 0 (CPU + 64GB DRAM), Node 2 (512GB PMEM)
Socket 1:  Node 1 (CPU + 64GB DRAM), Node 3 (512GB PMEM)

E.g they want to allocate memory for colde application data with
'prefer-many' policy + 0xC nodemask (N2+N3 PMEM nodes), so no matter the
application is running on Node 0 or Node 1, the 'local' node only has DRAM
which is not their preference, and want a preferred-->local-->others order. 

Thanks,
Feng

> -- 
> Michal Hocko
> SUSE Labs
Michal Hocko July 29, 2021, 1:38 p.m. UTC | #5
On Thu 29-07-21 15:09:18, Feng Tang wrote:
> On Wed, Jul 28, 2021 at 06:12:21PM +0200, Michal Hocko wrote:
> > On Wed 28-07-21 22:11:56, Feng Tang wrote:
> > > On Wed, Jul 28, 2021 at 02:31:03PM +0200, Michal Hocko wrote:
> > > > [Sorry for a late review]
> > > 
> > > Not at all. Thank you for all your reviews and suggestions from v1
> > > to v6!
> > > 
> > > > On Mon 12-07-21 16:09:29, Feng Tang wrote:
> > > > [...]
> > > > > @@ -1887,7 +1909,8 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
> > > > >  /* Return the node id preferred by the given mempolicy, or the given id */
> > > > >  static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd)
> > > > >  {
> > > > > -	if (policy->mode == MPOL_PREFERRED) {
> > > > > +	if (policy->mode == MPOL_PREFERRED ||
> > > > > +	    policy->mode == MPOL_PREFERRED_MANY) {
> > > > >  		nd = first_node(policy->nodes);
> > > > >  	} else {
> > > > >  		/*
> > > > 
> > > > Do we really want to have the preferred node to be always the first node
> > > > in the node mask? Shouldn't that strive for a locality as well? Existing
> > > > callers already prefer numa_node_id() - aka local node - and I belive we
> > > > shouldn't just throw that away here.
> > >  
> > > I think it's about the difference of 'local' and 'prefer/perfer-many'
> > > policy. There are different kinds of memory HW: HBM(High Bandwidth
> > > Memory), normal DRAM, PMEM (Persistent Memory), which have different
> > > price, bandwidth, speed etc. A platform may have two, or all three of
> > > these types, and there are real use case which want memory comes
> > > 'preferred' node/nodes than the local node.
> > > 
> > > And good point for 'local node', if the 'prefer-many' policy's
> > > nodemask has local node set, we should pick it han this
> > > 'first_node', and the same semantic also applies to the other
> > > several places you pointed out. Or do I misunderstand you point?
> > 
> > Yeah. Essentially what I am trying to tell is that for
> > MPOL_PREFERRED_MANY you simply want to return the given node without any
> > alternation. That node will be used for the fallback zonelist and the
> > nodemask would make sure we won't get out of the policy.
> 
> I think I got your point now :)
> 
> With current mainline code, the 'prefer' policy will return the preferred
> node.

Yes this makes sense as there is only one node.

> For 'prefer-many', we would like to keep the similar semantic, that the
> preference of node is 'preferred' > 'local' > all other nodes.

Yes but which of the preferred nodes you want to start with. Say your
nodemask preferring nodes 0 and 2 with the following topology
	0	1	2	3
0	10	30	20	30
1	30	10	20	30
2	20	30	10	30
3	30	30	30	10

And say you are running on cpu 1. I believe you want your allocation
preferably from node 2 rathern than 0, right? With your approach you
would start with node 0 which would be more distant from cpu 1. Also the
semantic to give nodes some ordering based on their numbers sounds
rather weird to me.

The semantic I am proposing is to allocate from prefered nodes in
distance order starting from the local node.
Feng Tang July 29, 2021, 3:12 p.m. UTC | #6
On Thu, Jul 29, 2021 at 03:38:44PM +0200, Michal Hocko wrote:
> On Thu 29-07-21 15:09:18, Feng Tang wrote:
> > On Wed, Jul 28, 2021 at 06:12:21PM +0200, Michal Hocko wrote:
> > > On Wed 28-07-21 22:11:56, Feng Tang wrote:
> > > > On Wed, Jul 28, 2021 at 02:31:03PM +0200, Michal Hocko wrote:
> > > > > [Sorry for a late review]
> > > > 
> > > > Not at all. Thank you for all your reviews and suggestions from v1
> > > > to v6!
> > > > 
> > > > > On Mon 12-07-21 16:09:29, Feng Tang wrote:
> > > > > [...]
> > > > > > @@ -1887,7 +1909,8 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
> > > > > >  /* Return the node id preferred by the given mempolicy, or the given id */
> > > > > >  static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd)
> > > > > >  {
> > > > > > -	if (policy->mode == MPOL_PREFERRED) {
> > > > > > +	if (policy->mode == MPOL_PREFERRED ||
> > > > > > +	    policy->mode == MPOL_PREFERRED_MANY) {
> > > > > >  		nd = first_node(policy->nodes);
> > > > > >  	} else {
> > > > > >  		/*
> > > > > 
> > > > > Do we really want to have the preferred node to be always the first node
> > > > > in the node mask? Shouldn't that strive for a locality as well? Existing
> > > > > callers already prefer numa_node_id() - aka local node - and I belive we
> > > > > shouldn't just throw that away here.
> > > >  
> > > > I think it's about the difference of 'local' and 'prefer/perfer-many'
> > > > policy. There are different kinds of memory HW: HBM(High Bandwidth
> > > > Memory), normal DRAM, PMEM (Persistent Memory), which have different
> > > > price, bandwidth, speed etc. A platform may have two, or all three of
> > > > these types, and there are real use case which want memory comes
> > > > 'preferred' node/nodes than the local node.
> > > > 
> > > > And good point for 'local node', if the 'prefer-many' policy's
> > > > nodemask has local node set, we should pick it han this
> > > > 'first_node', and the same semantic also applies to the other
> > > > several places you pointed out. Or do I misunderstand you point?
> > > 
> > > Yeah. Essentially what I am trying to tell is that for
> > > MPOL_PREFERRED_MANY you simply want to return the given node without any
> > > alternation. That node will be used for the fallback zonelist and the
> > > nodemask would make sure we won't get out of the policy.
> > 
> > I think I got your point now :)
> > 
> > With current mainline code, the 'prefer' policy will return the preferred
> > node.
> 
> Yes this makes sense as there is only one node.
> 
> > For 'prefer-many', we would like to keep the similar semantic, that the
> > preference of node is 'preferred' > 'local' > all other nodes.
> 
> Yes but which of the preferred nodes you want to start with. Say your
> nodemask preferring nodes 0 and 2 with the following topology
> 	0	1	2	3
> 0	10	30	20	30
> 1	30	10	20	30
> 2	20	30	10	30
> 3	30	30	30	10
> 
> And say you are running on cpu 1. I believe you want your allocation
> preferably from node 2 rathern than 0, right?

Yes, and in one earlier reply, I had a similar thought
https://lore.kernel.org/lkml/20210728152507.GE43486@shbuild999.sh.intel.com/

  "
  One further thought is, if local node is not in the nodemask,
  should we compare the distance of all the nodes in nodemask
  to the local node and chose the shortest?
  "
And we may add a new API if there is no existing one:
	int cloest_node(int nid, nodemask_t *nmask);
to pick the best node from 'prefer-many' nodemsk.

> With your approach you
> would start with node 0 which would be more distant from cpu 1. 

> Also the
> semantic to give nodes some ordering based on their numbers sounds
> rather weird to me.

I agree, and as I admitted in the first reply, this need to be fixed.

> The semantic I am proposing is to allocate from prefered nodes in
> distance order starting from the local node.

So the plan is:
* if the local node is set in 'prefer-many's nodemask, then chose
* otherwise chose the node with the shortest distance to local node
?

Thanks,
Feng

> -- 
> Michal Hocko
> SUSE Labs
Michal Hocko July 29, 2021, 4:21 p.m. UTC | #7
On Thu 29-07-21 23:12:42, Feng Tang wrote:
> On Thu, Jul 29, 2021 at 03:38:44PM +0200, Michal Hocko wrote:
[...]
> > Also the
> > semantic to give nodes some ordering based on their numbers sounds
> > rather weird to me.
> 
> I agree, and as I admitted in the first reply, this need to be fixed.

OK. I was not really clear that we are on the same page here.

> > The semantic I am proposing is to allocate from prefered nodes in
> > distance order starting from the local node.
> 
> So the plan is:
> * if the local node is set in 'prefer-many's nodemask, then chose
> * otherwise chose the node with the shortest distance to local node
> ?

Yes and what I am trying to say is that you will achieve that simply by
doing the following in policy_node:
	if (policy->mode == MPOL_PREFERRED_MANY)
		return nd;
Feng Tang July 30, 2021, 3:05 a.m. UTC | #8
On Thu, Jul 29, 2021 at 06:21:19PM +0200, Michal Hocko wrote:
> On Thu 29-07-21 23:12:42, Feng Tang wrote:
> > On Thu, Jul 29, 2021 at 03:38:44PM +0200, Michal Hocko wrote:
> [...]
> > > Also the
> > > semantic to give nodes some ordering based on their numbers sounds
> > > rather weird to me.
> > 
> > I agree, and as I admitted in the first reply, this need to be fixed.
> 
> OK. I was not really clear that we are on the same page here.
> 
> > > The semantic I am proposing is to allocate from prefered nodes in
> > > distance order starting from the local node.
> > 
> > So the plan is:
> > * if the local node is set in 'prefer-many's nodemask, then chose
> > * otherwise chose the node with the shortest distance to local node
> > ?
> 
> Yes and what I am trying to say is that you will achieve that simply by
> doing the following in policy_node:
> 	if (policy->mode == MPOL_PREFERRED_MANY)
> 		return nd;

One thing is, it's possible that 'nd' is not set in the preferred
nodemask. 

For policy_node(), most of its caller use the local node id as 'nd'
parameter. For HBM and PMEM memory nodes, they are cpuless nodes,
so they will not be a 'local node', but some use cases only prefer
these nodes.

Thanks,
Feng

> -- 
> Michal Hocko
> SUSE Labs
Michal Hocko July 30, 2021, 6:36 a.m. UTC | #9
On Fri 30-07-21 11:05:02, Feng Tang wrote:
> On Thu, Jul 29, 2021 at 06:21:19PM +0200, Michal Hocko wrote:
> > On Thu 29-07-21 23:12:42, Feng Tang wrote:
> > > On Thu, Jul 29, 2021 at 03:38:44PM +0200, Michal Hocko wrote:
> > [...]
> > > > Also the
> > > > semantic to give nodes some ordering based on their numbers sounds
> > > > rather weird to me.
> > > 
> > > I agree, and as I admitted in the first reply, this need to be fixed.
> > 
> > OK. I was not really clear that we are on the same page here.
> > 
> > > > The semantic I am proposing is to allocate from prefered nodes in
> > > > distance order starting from the local node.
> > > 
> > > So the plan is:
> > > * if the local node is set in 'prefer-many's nodemask, then chose
> > > * otherwise chose the node with the shortest distance to local node
> > > ?
> > 
> > Yes and what I am trying to say is that you will achieve that simply by
> > doing the following in policy_node:
> > 	if (policy->mode == MPOL_PREFERRED_MANY)
> > 		return nd;
> 
> One thing is, it's possible that 'nd' is not set in the preferred
> nodemask. 

Yes, and there shouldn't be any problem with that.  The given node is
only used to get the respective zonelist (order distance ordered list of
zones to try). get_page_from_freelist will then use the preferred node
mask to filter this zone list. Is that more clear now?
Feng Tang July 30, 2021, 7:18 a.m. UTC | #10
On Fri, Jul 30, 2021 at 08:36:50AM +0200, Michal Hocko wrote:
> On Fri 30-07-21 11:05:02, Feng Tang wrote:
> > On Thu, Jul 29, 2021 at 06:21:19PM +0200, Michal Hocko wrote:
> > > On Thu 29-07-21 23:12:42, Feng Tang wrote:
> > > > On Thu, Jul 29, 2021 at 03:38:44PM +0200, Michal Hocko wrote:
> > > [...]
> > > > > Also the
> > > > > semantic to give nodes some ordering based on their numbers sounds
> > > > > rather weird to me.
> > > > 
> > > > I agree, and as I admitted in the first reply, this need to be fixed.
> > > 
> > > OK. I was not really clear that we are on the same page here.
> > > 
> > > > > The semantic I am proposing is to allocate from prefered nodes in
> > > > > distance order starting from the local node.
> > > > 
> > > > So the plan is:
> > > > * if the local node is set in 'prefer-many's nodemask, then chose
> > > > * otherwise chose the node with the shortest distance to local node
> > > > ?
> > > 
> > > Yes and what I am trying to say is that you will achieve that simply by
> > > doing the following in policy_node:
> > > 	if (policy->mode == MPOL_PREFERRED_MANY)
> > > 		return nd;
> > 
> > One thing is, it's possible that 'nd' is not set in the preferred
> > nodemask. 
> 
> Yes, and there shouldn't be any problem with that.  The given node is
> only used to get the respective zonelist (order distance ordered list of
> zones to try). get_page_from_freelist will then use the preferred node
> mask to filter this zone list. Is that more clear now?

Yes, from the code, the policy_node() is always coupled with
policy_nodemask(), which secures the 'nodemask' limit. Thanks for
the clarification!

And for the mempolicy_slab_node(), it seems to be a little different,
and we may need to reuse its logic for 'bind' policy, which is similar
to what we've discussed, pick a nearest node to the local node. And
similar for mpol_misplaced(). Thoughts?

Thanks,
Feng

> -- 
> Michal Hocko
> SUSE Labs
Michal Hocko July 30, 2021, 7:38 a.m. UTC | #11
On Fri 30-07-21 15:18:40, Feng Tang wrote:
[...]
> And for the mempolicy_slab_node(), it seems to be a little different,
> and we may need to reuse its logic for 'bind' policy, which is similar
> to what we've discussed, pick a nearest node to the local node. And
> similar for mpol_misplaced(). Thoughts?

I have to admit, I haven't looked closer to what slab does. I am not
familiar with internals and would have to study it some more. Maybe slab
maintainers have an idea.
Feng Tang Aug. 2, 2021, 8:11 a.m. UTC | #12
On Fri, Jul 30, 2021 at 03:18:40PM +0800, Tang, Feng wrote:
[snip]
> > > One thing is, it's possible that 'nd' is not set in the preferred
> > > nodemask. 
> > 
> > Yes, and there shouldn't be any problem with that.  The given node is
> > only used to get the respective zonelist (order distance ordered list of
> > zones to try). get_page_from_freelist will then use the preferred node
> > mask to filter this zone list. Is that more clear now?
> 
> Yes, from the code, the policy_node() is always coupled with
> policy_nodemask(), which secures the 'nodemask' limit. Thanks for
> the clarification!

Hi Michal,

To ensure the nodemask limit, the policy_nodemask() also needs some
change to return the nodemask for 'prefer-many' policy, so here is a
updated 1/6 patch, which mainly changes the node/nodemask selection
for 'prefer-many' policy, could you review it? thanks!

- Feng

------8<-------------------------------------------------------
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 19a00bc7fe86..046d0ccba4cd 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -22,6 +22,7 @@ enum {
 	MPOL_BIND,
 	MPOL_INTERLEAVE,
 	MPOL_LOCAL,
+	MPOL_PREFERRED_MANY,
 	MPOL_MAX,	/* always last member of enum */
 };
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index e675bfb856da..ea97530f86db 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -31,6 +31,9 @@
  *                but useful to set in a VMA when you have a non default
  *                process policy.
  *
+ * preferred many Try a set of nodes first before normal fallback. This is
+ *                similar to preferred without the special case.
+ *
  * default        Allocate on the local node first, or when on a VMA
  *                use the process policy. This is what Linux always did
  *		  in a NUMA aware kernel and still does by, ahem, default.
@@ -207,6 +210,14 @@ static int mpol_new_preferred(struct mempolicy *pol, const nodemask_t *nodes)
 	return 0;
 }
 
+static int mpol_new_preferred_many(struct mempolicy *pol, const nodemask_t *nodes)
+{
+	if (nodes_empty(*nodes))
+		return -EINVAL;
+	pol->nodes = *nodes;
+	return 0;
+}
+
 static int mpol_new_bind(struct mempolicy *pol, const nodemask_t *nodes)
 {
 	if (nodes_empty(*nodes))
@@ -408,6 +419,10 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
 	[MPOL_LOCAL] = {
 		.rebind = mpol_rebind_default,
 	},
+	[MPOL_PREFERRED_MANY] = {
+		.create = mpol_new_preferred_many,
+		.rebind = mpol_rebind_preferred,
+	},
 };
 
 static int migrate_page_add(struct page *page, struct list_head *pagelist,
@@ -900,6 +915,7 @@ static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes)
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
 	case MPOL_PREFERRED:
+	case MPOL_PREFERRED_MANY:
 		*nodes = p->nodes;
 		break;
 	case MPOL_LOCAL:
@@ -1446,7 +1462,13 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags)
 {
 	*flags = *mode & MPOL_MODE_FLAGS;
 	*mode &= ~MPOL_MODE_FLAGS;
-	if ((unsigned int)(*mode) >= MPOL_MAX)
+
+	/*
+	 * The check should be 'mode >= MPOL_MAX', but as 'prefer_many'
+	 * is not fully implemented, don't permit it to be used for now,
+	 * and the logic will be restored in following patch
+	 */
+	if ((unsigned int)(*mode) >=  MPOL_PREFERRED_MANY)
 		return -EINVAL;
 	if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES))
 		return -EINVAL;
@@ -1875,8 +1897,13 @@ static int apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
  */
 nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
 {
-	/* Lower zones don't get a nodemask applied for MPOL_BIND */
-	if (unlikely(policy->mode == MPOL_BIND) &&
+	int mode = policy->mode;
+
+	/*
+	 * Lower zones don't get a nodemask applied for 'bind' and
+	 * 'prefer-many' policies
+	 */
+	if (unlikely(mode == MPOL_BIND || mode == MPOL_PREFERRED_MANY) &&
 			apply_policy_zone(policy, gfp_zone(gfp)) &&
 			cpuset_nodemask_valid_mems_allowed(&policy->nodes))
 		return &policy->nodes;
@@ -1884,7 +1911,13 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
 	return NULL;
 }
 
-/* Return the node id preferred by the given mempolicy, or the given id */
+/*
+ * Return the  preferred node id for 'prefer' mempolicy, and return
+ * the given id for all other policies.
+ *
+ * policy_node() is always coupbled with policy_nodemask(), which
+ * secures the the nodemask limit for 'bind' and 'prefer-many' policy.
+ */
 static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd)
 {
 	if (policy->mode == MPOL_PREFERRED) {
@@ -1936,7 +1969,9 @@ unsigned int mempolicy_slab_node(void)
 	case MPOL_INTERLEAVE:
 		return interleave_nodes(policy);
 
-	case MPOL_BIND: {
+	case MPOL_BIND:
+	case MPOL_PREFERRED_MANY:
+	{
 		struct zoneref *z;
 
 		/*
@@ -2008,7 +2043,7 @@ static inline unsigned interleave_nid(struct mempolicy *pol,
  * @addr: address in @vma for shared policy lookup and interleave policy
  * @gfp_flags: for requested zone
  * @mpol: pointer to mempolicy pointer for reference counted mempolicy
- * @nodemask: pointer to nodemask pointer for MPOL_BIND nodemask
+ * @nodemask: pointer to nodemask pointer for 'bind' and 'prefer-many' policy
  *
  * Returns a nid suitable for a huge page allocation and a pointer
  * to the struct mempolicy for conditional unref after allocation.
@@ -2021,16 +2056,18 @@ int huge_node(struct vm_area_struct *vma, unsigned long addr, gfp_t gfp_flags,
 				struct mempolicy **mpol, nodemask_t **nodemask)
 {
 	int nid;
+	int mode;
 
 	*mpol = get_vma_policy(vma, addr);
-	*nodemask = NULL;	/* assume !MPOL_BIND */
+	*nodemask = NULL;
+	mode = (*mpol)->mode;
 
-	if (unlikely((*mpol)->mode == MPOL_INTERLEAVE)) {
+	if (unlikely(mode == MPOL_INTERLEAVE)) {
 		nid = interleave_nid(*mpol, vma, addr,
 					huge_page_shift(hstate_vma(vma)));
 	} else {
 		nid = policy_node(gfp_flags, *mpol, numa_node_id());
-		if ((*mpol)->mode == MPOL_BIND)
+		if (mode == MPOL_BIND || mode == MPOL_PREFERRED_MANY)
 			*nodemask = &(*mpol)->nodes;
 	}
 	return nid;
@@ -2063,6 +2100,7 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
 	mempolicy = current->mempolicy;
 	switch (mempolicy->mode) {
 	case MPOL_PREFERRED:
+	case MPOL_PREFERRED_MANY:
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
 		*mask = mempolicy->nodes;
@@ -2173,7 +2211,7 @@ struct page *alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		 * node and don't fall back to other nodes, as the cost of
 		 * remote accesses would likely offset THP benefits.
 		 *
-		 * If the policy is interleave, or does not allow the current
+		 * If the policy is interleave or does not allow the current
 		 * node in its nodemask, we allocate the standard way.
 		 */
 		if (pol->mode == MPOL_PREFERRED)
@@ -2311,6 +2349,7 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
 	case MPOL_PREFERRED:
+	case MPOL_PREFERRED_MANY:
 		return !!nodes_equal(a->nodes, b->nodes);
 	case MPOL_LOCAL:
 		return true;
@@ -2451,6 +2490,8 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		break;
 
 	case MPOL_PREFERRED:
+		if (node_isset(curnid, pol->nodes))
+			goto out;
 		polnid = first_node(pol->nodes);
 		break;
 
@@ -2465,9 +2506,10 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 				break;
 			goto out;
 		}
+		fallthrough;
 
+	case MPOL_PREFERRED_MANY:
 		/*
-		 * allows binding to multiple nodes.
 		 * use current page if in policy nodemask,
 		 * else select nearest allowed node, if any.
 		 * If no allowed nodes, use current [!misplaced].
@@ -2829,6 +2871,7 @@ static const char * const policy_modes[] =
 	[MPOL_BIND]       = "bind",
 	[MPOL_INTERLEAVE] = "interleave",
 	[MPOL_LOCAL]      = "local",
+	[MPOL_PREFERRED_MANY]  = "prefer (many)",
 };
 
 
@@ -2907,6 +2950,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
 		if (!nodelist)
 			err = 0;
 		goto out;
+	case MPOL_PREFERRED_MANY:
 	case MPOL_BIND:
 		/*
 		 * Insist on a nodelist
@@ -2993,6 +3037,7 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
 	case MPOL_LOCAL:
 		break;
 	case MPOL_PREFERRED:
+	case MPOL_PREFERRED_MANY:
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
 		nodes = pol->nodes;
Michal Hocko Aug. 2, 2021, 11:14 a.m. UTC | #13
On Mon 02-08-21 16:11:30, Feng Tang wrote:
> On Fri, Jul 30, 2021 at 03:18:40PM +0800, Tang, Feng wrote:
> [snip]
> > > > One thing is, it's possible that 'nd' is not set in the preferred
> > > > nodemask. 
> > > 
> > > Yes, and there shouldn't be any problem with that.  The given node is
> > > only used to get the respective zonelist (order distance ordered list of
> > > zones to try). get_page_from_freelist will then use the preferred node
> > > mask to filter this zone list. Is that more clear now?
> > 
> > Yes, from the code, the policy_node() is always coupled with
> > policy_nodemask(), which secures the 'nodemask' limit. Thanks for
> > the clarification!
> 
> Hi Michal,
> 
> To ensure the nodemask limit, the policy_nodemask() also needs some
> change to return the nodemask for 'prefer-many' policy, so here is a
> updated 1/6 patch, which mainly changes the node/nodemask selection
> for 'prefer-many' policy, could you review it? thanks!

right, I have mixed it with get_policy_nodemask

> @@ -1875,8 +1897,13 @@ static int apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
>   */
>  nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
>  {
> -	/* Lower zones don't get a nodemask applied for MPOL_BIND */
> -	if (unlikely(policy->mode == MPOL_BIND) &&
> +	int mode = policy->mode;
> +
> +	/*
> +	 * Lower zones don't get a nodemask applied for 'bind' and
> +	 * 'prefer-many' policies
> +	 */
> +	if (unlikely(mode == MPOL_BIND || mode == MPOL_PREFERRED_MANY) &&
>  			apply_policy_zone(policy, gfp_zone(gfp)) &&
>  			cpuset_nodemask_valid_mems_allowed(&policy->nodes))
>  		return &policy->nodes;

Isn't this just too cryptic? Why didn't you simply
	if (mode == MPOL_PREFERRED_MANY)
		return &policy->mode;

in addition to the existing code? I mean why would you even care about
cpusets? Those are handled at the page allocator layer and will further
filter the given nodemask.
Feng Tang Aug. 2, 2021, 11:33 a.m. UTC | #14
On Mon, Aug 02, 2021 at 01:14:29PM +0200, Michal Hocko wrote:
> On Mon 02-08-21 16:11:30, Feng Tang wrote:
> > On Fri, Jul 30, 2021 at 03:18:40PM +0800, Tang, Feng wrote:
> > [snip]
> > > > > One thing is, it's possible that 'nd' is not set in the preferred
> > > > > nodemask. 
> > > > 
> > > > Yes, and there shouldn't be any problem with that.  The given node is
> > > > only used to get the respective zonelist (order distance ordered list of
> > > > zones to try). get_page_from_freelist will then use the preferred node
> > > > mask to filter this zone list. Is that more clear now?
> > > 
> > > Yes, from the code, the policy_node() is always coupled with
> > > policy_nodemask(), which secures the 'nodemask' limit. Thanks for
> > > the clarification!
> > 
> > Hi Michal,
> > 
> > To ensure the nodemask limit, the policy_nodemask() also needs some
> > change to return the nodemask for 'prefer-many' policy, so here is a
> > updated 1/6 patch, which mainly changes the node/nodemask selection
> > for 'prefer-many' policy, could you review it? thanks!
> 
> right, I have mixed it with get_policy_nodemask
> 
> > @@ -1875,8 +1897,13 @@ static int apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
> >   */
> >  nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
> >  {
> > -	/* Lower zones don't get a nodemask applied for MPOL_BIND */
> > -	if (unlikely(policy->mode == MPOL_BIND) &&
> > +	int mode = policy->mode;
> > +
> > +	/*
> > +	 * Lower zones don't get a nodemask applied for 'bind' and
> > +	 * 'prefer-many' policies
> > +	 */
> > +	if (unlikely(mode == MPOL_BIND || mode == MPOL_PREFERRED_MANY) &&
> >  			apply_policy_zone(policy, gfp_zone(gfp)) &&
> >  			cpuset_nodemask_valid_mems_allowed(&policy->nodes))
> >  		return &policy->nodes;
> 
> Isn't this just too cryptic? Why didn't you simply
> 	if (mode == MPOL_PREFERRED_MANY)
> 		return &policy->mode;
> 
> in addition to the existing code? I mean why would you even care about
> cpusets? Those are handled at the page allocator layer and will further
> filter the given nodemask. 

Ok, I will follow your suggestion and keep 'bind' handling unchanged.

And to be honest, I don't fully understand the current handling for
'bind' policy, will the returning NULL for 'bind' policy open a
sideway for the strict 'bind' limit. 

Thanks,
Feng


> -- 
> Michal Hocko
> SUSE Labs
Michal Hocko Aug. 2, 2021, 11:47 a.m. UTC | #15
On Mon 02-08-21 19:33:26, Feng Tang wrote:
[...]
> And to be honest, I don't fully understand the current handling for
> 'bind' policy, will the returning NULL for 'bind' policy open a
> sideway for the strict 'bind' limit. 

I do not remember all the details but this is an old behavior that MBIND
policy doesn't apply to kernel allocations in presnce of the movable
zone. Detailed reasoning is not clear to me at the moment, maybe Mel
remembers?
diff mbox series

Patch

diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 19a00bc7fe86..046d0ccba4cd 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -22,6 +22,7 @@  enum {
 	MPOL_BIND,
 	MPOL_INTERLEAVE,
 	MPOL_LOCAL,
+	MPOL_PREFERRED_MANY,
 	MPOL_MAX,	/* always last member of enum */
 };
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index e32360e90274..17b5800b7dcc 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -31,6 +31,9 @@ 
  *                but useful to set in a VMA when you have a non default
  *                process policy.
  *
+ * preferred many Try a set of nodes first before normal fallback. This is
+ *                similar to preferred without the special case.
+ *
  * default        Allocate on the local node first, or when on a VMA
  *                use the process policy. This is what Linux always did
  *		  in a NUMA aware kernel and still does by, ahem, default.
@@ -207,6 +210,14 @@  static int mpol_new_preferred(struct mempolicy *pol, const nodemask_t *nodes)
 	return 0;
 }
 
+static int mpol_new_preferred_many(struct mempolicy *pol, const nodemask_t *nodes)
+{
+	if (nodes_empty(*nodes))
+		return -EINVAL;
+	pol->nodes = *nodes;
+	return 0;
+}
+
 static int mpol_new_bind(struct mempolicy *pol, const nodemask_t *nodes)
 {
 	if (nodes_empty(*nodes))
@@ -408,6 +419,10 @@  static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
 	[MPOL_LOCAL] = {
 		.rebind = mpol_rebind_default,
 	},
+	[MPOL_PREFERRED_MANY] = {
+		.create = mpol_new_preferred_many,
+		.rebind = mpol_rebind_preferred,
+	},
 };
 
 static int migrate_page_add(struct page *page, struct list_head *pagelist,
@@ -900,6 +915,7 @@  static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes)
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
 	case MPOL_PREFERRED:
+	case MPOL_PREFERRED_MANY:
 		*nodes = p->nodes;
 		break;
 	case MPOL_LOCAL:
@@ -1446,7 +1462,13 @@  static inline int sanitize_mpol_flags(int *mode, unsigned short *flags)
 {
 	*flags = *mode & MPOL_MODE_FLAGS;
 	*mode &= ~MPOL_MODE_FLAGS;
-	if ((unsigned int)(*mode) >= MPOL_MAX)
+
+	/*
+	 * The check should be 'mode >= MPOL_MAX', but as 'prefer_many'
+	 * is not fully implemented, don't permit it to be used for now,
+	 * and the logic will be restored in following patch
+	 */
+	if ((unsigned int)(*mode) >=  MPOL_PREFERRED_MANY)
 		return -EINVAL;
 	if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES))
 		return -EINVAL;
@@ -1887,7 +1909,8 @@  nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
 /* Return the node id preferred by the given mempolicy, or the given id */
 static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd)
 {
-	if (policy->mode == MPOL_PREFERRED) {
+	if (policy->mode == MPOL_PREFERRED ||
+	    policy->mode == MPOL_PREFERRED_MANY) {
 		nd = first_node(policy->nodes);
 	} else {
 		/*
@@ -1931,6 +1954,7 @@  unsigned int mempolicy_slab_node(void)
 
 	switch (policy->mode) {
 	case MPOL_PREFERRED:
+	case MPOL_PREFERRED_MANY:
 		return first_node(policy->nodes);
 
 	case MPOL_INTERLEAVE:
@@ -2063,6 +2087,7 @@  bool init_nodemask_of_mempolicy(nodemask_t *mask)
 	mempolicy = current->mempolicy;
 	switch (mempolicy->mode) {
 	case MPOL_PREFERRED:
+	case MPOL_PREFERRED_MANY:
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
 		*mask = mempolicy->nodes;
@@ -2173,10 +2198,12 @@  struct page *alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		 * node and don't fall back to other nodes, as the cost of
 		 * remote accesses would likely offset THP benefits.
 		 *
-		 * If the policy is interleave, or does not allow the current
-		 * node in its nodemask, we allocate the standard way.
+		 * If the policy is interleave or multiple preferred nodes, or
+		 * does not allow the current node in its nodemask, we allocate
+		 * the standard way.
 		 */
-		if (pol->mode == MPOL_PREFERRED)
+		if ((pol->mode == MPOL_PREFERRED ||
+		     pol->mode == MPOL_PREFERRED_MANY))
 			hpage_node = first_node(pol->nodes);
 
 		nmask = policy_nodemask(gfp, pol);
@@ -2311,6 +2338,7 @@  bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
 	case MPOL_PREFERRED:
+	case MPOL_PREFERRED_MANY:
 		return !!nodes_equal(a->nodes, b->nodes);
 	case MPOL_LOCAL:
 		return true;
@@ -2451,6 +2479,9 @@  int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		break;
 
 	case MPOL_PREFERRED:
+	case MPOL_PREFERRED_MANY:
+		if (node_isset(curnid, pol->nodes))
+			goto out;
 		polnid = first_node(pol->nodes);
 		break;
 
@@ -2829,6 +2860,7 @@  static const char * const policy_modes[] =
 	[MPOL_BIND]       = "bind",
 	[MPOL_INTERLEAVE] = "interleave",
 	[MPOL_LOCAL]      = "local",
+	[MPOL_PREFERRED_MANY]  = "prefer (many)",
 };
 
 
@@ -2907,6 +2939,7 @@  int mpol_parse_str(char *str, struct mempolicy **mpol)
 		if (!nodelist)
 			err = 0;
 		goto out;
+	case MPOL_PREFERRED_MANY:
 	case MPOL_BIND:
 		/*
 		 * Insist on a nodelist
@@ -2993,6 +3026,7 @@  void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
 	case MPOL_LOCAL:
 		break;
 	case MPOL_PREFERRED:
+	case MPOL_PREFERRED_MANY:
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
 		nodes = pol->nodes;