diff mbox series

[02/18] mm/mempolicy: Use node_mem_id() instead of node_id()

Message ID 20200619162425.1052382-3-ben.widawsky@intel.com (mailing list archive)
State New, archived
Headers show
Series multiple preferred nodes | expand

Commit Message

Ben Widawsky June 19, 2020, 4:24 p.m. UTC
Calling out some distinctions first as I understand it, and the
reasoning of the patch:
numa_node_id() - The node id for the currently running CPU.
numa_mem_id() - The node id for the closest memory node.

The case where they are not the same is CONFIG_HAVE_MEMORYLESS_NODES.
Only ia64 and powerpc support this option, so it is perhaps not a very
interesting situation to most.

The question is, when you do want which? numa_node_id() is definitely
what's desired if MPOL_PREFERRED, or MPOL_LOCAL were used, since the ABI
states "This mode specifies "local allocation"; the memory is allocated
on the node of the CPU that triggered the allocation (the "local
node")." It would be weird, though not impossible to set this policy on
a CPU that has memoryless nodes. A more likely way to hit this is with
interleaving. The current interfaces will return some equally weird
thing, but at least it's symmetric. Therefore, in cases where the node
is being queried for the currently running process, it probably makes
sense to use numa_node_id(). For other cases however, when CPU is trying
to obtain the "local" memory, numa_mem_id() already contains this and
should be used instead.

This really should only effect configurations where
CONFIG_HAVE_MEMORYLESS_NODES=y, and even on those machines it's quite
possible the ultimate behavior would be identical.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 mm/mempolicy.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Michal Hocko June 24, 2020, 8:25 a.m. UTC | #1
On Fri 19-06-20 09:24:09, Ben Widawsky wrote:
> Calling out some distinctions first as I understand it, and the
> reasoning of the patch:
> numa_node_id() - The node id for the currently running CPU.
> numa_mem_id() - The node id for the closest memory node.

Correct

> The case where they are not the same is CONFIG_HAVE_MEMORYLESS_NODES.
> Only ia64 and powerpc support this option, so it is perhaps not a very
> interesting situation to most.

Other arches can have nodes without any memory as well. Just offline all
the managed memory via hotplug... (please note that such node still
might have memory present! Just not useable by the page allocator)

> The question is, when you do want which?

Short answer is that you shouldn't really care. The fact that we do and
that we have a distinct API for that is IMHO a mistake of the past IMHO.

A slightly longer answer would be that the allocator should fallback to
a proper node(s) even if you specify a memory less node as a preferred
one. That is achieved by proper zonelist for each existing NUMA node.
There are bugs here and there when some nodes do not get their zonelists
initialized but fundamentally this should be nobrainer. There are also
corner cases where an application might have been bound to a node which
went offline completely which would be "fun"

> numa_node_id() is definitely
> what's desired if MPOL_PREFERRED, or MPOL_LOCAL were used, since the ABI
> states "This mode specifies "local allocation"; the memory is allocated
> on the node of the CPU that triggered the allocation (the "local
> node")."

In fact from the allocator point of view there is no real difference
because there is nothing to allocate from the node without memory,
obviously so it would fallback to the next node/zones from the closest
node...

> It would be weird, though not impossible to set this policy on
> a CPU that has memoryless nodes.

Keep in mind that the memory might went away via hotplug.

> A more likely way to hit this is with
> interleaving. The current interfaces will return some equally weird
> thing, but at least it's symmetric. Therefore, in cases where the node
> is being queried for the currently running process, it probably makes
> sense to use numa_node_id(). For other cases however, when CPU is trying
> to obtain the "local" memory, numa_mem_id() already contains this and
> should be used instead.
> 
> This really should only effect configurations where
> CONFIG_HAVE_MEMORYLESS_NODES=y, and even on those machines it's quite
> possible the ultimate behavior would be identical.
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
> Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>

Well, mempolicy.c uses numa_node_id in most cases and I do not see these
two being special. So if we want to change that it should be done
consistently. I even suspect that these changes are mostly nops because
respective zonelists will do the right thing, But there might be land
mines here and there - e.g. if __GFP_THISNODE was used then somebody
might expect a failure rather than a misplaced allocation because there
is other fallback mechanism on a depleted numa node.

All that being said I am not sure this is an actual improvement.

> ---
>I  mm/mempolicy.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 36ee3267c25f..99e0f3f9c4a6 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1991,7 +1991,7 @@ static unsigned offset_il_node(struct mempolicy *pol, unsigned long n)
>  	int nid;
>  
>  	if (!nnodes)
> -		return numa_node_id();
> +		return numa_mem_id();
>  	target = (unsigned int)n % nnodes;
>  	nid = first_node(pol->v.nodes);
>  	for (i = 0; i < target; i++)
> @@ -2049,7 +2049,7 @@ int huge_node(struct vm_area_struct *vma, unsigned long addr, gfp_t gfp_flags,
>  		nid = interleave_nid(*mpol, vma, addr,
>  					huge_page_shift(hstate_vma(vma)));
>  	} else {
> -		nid = policy_node(gfp_flags, *mpol, numa_node_id());
> +		nid = policy_node(gfp_flags, *mpol, numa_mem_id());
>  		if ((*mpol)->mode == MPOL_BIND)
>  			*nodemask = &(*mpol)->v.nodes;
>  	}
> -- 
> 2.27.0
>
Ben Widawsky June 24, 2020, 4:48 p.m. UTC | #2
On 20-06-24 10:25:59, Michal Hocko wrote:
> On Fri 19-06-20 09:24:09, Ben Widawsky wrote:
> > Calling out some distinctions first as I understand it, and the
> > reasoning of the patch:
> > numa_node_id() - The node id for the currently running CPU.
> > numa_mem_id() - The node id for the closest memory node.
> 
> Correct
> 
> > The case where they are not the same is CONFIG_HAVE_MEMORYLESS_NODES.
> > Only ia64 and powerpc support this option, so it is perhaps not a very
> > interesting situation to most.
> 
> Other arches can have nodes without any memory as well. Just offline all
> the managed memory via hotplug... (please note that such node still
> might have memory present! Just not useable by the page allocator)

You must have CONFIG_HAVE_MEMORYLESS_NODES defined. So I believe that this
change is limited to ia64 and powerpc. I don't think there is a way to set it
outside of those arches.

> 
> > The question is, when you do want which?
> 
> Short answer is that you shouldn't really care. The fact that we do and
> that we have a distinct API for that is IMHO a mistake of the past IMHO.
> 
> A slightly longer answer would be that the allocator should fallback to
> a proper node(s) even if you specify a memory less node as a preferred
> one. That is achieved by proper zonelist for each existing NUMA node.
> There are bugs here and there when some nodes do not get their zonelists
> initialized but fundamentally this should be nobrainer. There are also
> corner cases where an application might have been bound to a node which
> went offline completely which would be "fun"

I'd be willing to try to review a patch that removes it. Generally, I think
you're correct and zonelists solve what this attempts to solve. Maybe you need
something like this if a cpu has no memory:
zone_to_nid(first_zones_zonelist(NODE_DATA(nid)->node_zonelists).

> 
> > numa_node_id() is definitely
> > what's desired if MPOL_PREFERRED, or MPOL_LOCAL were used, since the ABI
> > states "This mode specifies "local allocation"; the memory is allocated
> > on the node of the CPU that triggered the allocation (the "local
> > node")."
> 
> In fact from the allocator point of view there is no real difference
> because there is nothing to allocate from the node without memory,
> obviously so it would fallback to the next node/zones from the closest
> node...
> 
> > It would be weird, though not impossible to set this policy on
> > a CPU that has memoryless nodes.
> 
> Keep in mind that the memory might went away via hotplug.
> 
> > A more likely way to hit this is with
> > interleaving. The current interfaces will return some equally weird
> > thing, but at least it's symmetric. Therefore, in cases where the node
> > is being queried for the currently running process, it probably makes
> > sense to use numa_node_id(). For other cases however, when CPU is trying
> > to obtain the "local" memory, numa_mem_id() already contains this and
> > should be used instead.
> > 
> > This really should only effect configurations where
> > CONFIG_HAVE_MEMORYLESS_NODES=y, and even on those machines it's quite
> > possible the ultimate behavior would be identical.
> > 
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
> > Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> 
> Well, mempolicy.c uses numa_node_id in most cases and I do not see these
> two being special. So if we want to change that it should be done
> consistently. I even suspect that these changes are mostly nops because
> respective zonelists will do the right thing, But there might be land
> mines here and there - e.g. if __GFP_THISNODE was used then somebody
> might expect a failure rather than a misplaced allocation because there
> is other fallback mechanism on a depleted numa node.
> 
> All that being said I am not sure this is an actual improvement.

It wasn't entirely arbitrary. I tried to cherry pick the cases where the local
memory node was wanted as opposed to the local numa node. As I went through, it
seemed some cases this is correct, for instance, getting the node for "LOCAL"
should probably be on the closest node, not the closest memory node, but getting
the next closest interleave node should be a memory node. It was definitely not
perfect.

The only reason I added this patch is because I use numa_mem_id by the end of
the series and I wanted to make sure my usage of it was correct to add new
usages of numa_mem_id (because like you said, mempolicy mostly uses
numa_node_id). I'm happy to just keep using numa_node_id if that's the right
thing to do.

> 
> > ---
> >I  mm/mempolicy.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 36ee3267c25f..99e0f3f9c4a6 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -1991,7 +1991,7 @@ static unsigned offset_il_node(struct mempolicy *pol, unsigned long n)
> >  	int nid;
> >  
> >  	if (!nnodes)
> > -		return numa_node_id();
> > +		return numa_mem_id();
> >  	target = (unsigned int)n % nnodes;
> >  	nid = first_node(pol->v.nodes);
> >  	for (i = 0; i < target; i++)
> > @@ -2049,7 +2049,7 @@ int huge_node(struct vm_area_struct *vma, unsigned long addr, gfp_t gfp_flags,
> >  		nid = interleave_nid(*mpol, vma, addr,
> >  					huge_page_shift(hstate_vma(vma)));
> >  	} else {
> > -		nid = policy_node(gfp_flags, *mpol, numa_node_id());
> > +		nid = policy_node(gfp_flags, *mpol, numa_mem_id());
> >  		if ((*mpol)->mode == MPOL_BIND)
> >  			*nodemask = &(*mpol)->v.nodes;
> >  	}
> > -- 
> > 2.27.0
> > 
> 
> -- 
> Michal Hocko
> SUSE Labs
Michal Hocko June 26, 2020, 12:30 p.m. UTC | #3
On Wed 24-06-20 09:48:37, Ben Widawsky wrote:
> On 20-06-24 10:25:59, Michal Hocko wrote:
> > On Fri 19-06-20 09:24:09, Ben Widawsky wrote:
> > > Calling out some distinctions first as I understand it, and the
> > > reasoning of the patch:
> > > numa_node_id() - The node id for the currently running CPU.
> > > numa_mem_id() - The node id for the closest memory node.
> > 
> > Correct
> > 
> > > The case where they are not the same is CONFIG_HAVE_MEMORYLESS_NODES.
> > > Only ia64 and powerpc support this option, so it is perhaps not a very
> > > interesting situation to most.
> > 
> > Other arches can have nodes without any memory as well. Just offline all
> > the managed memory via hotplug... (please note that such node still
> > might have memory present! Just not useable by the page allocator)
> 
> You must have CONFIG_HAVE_MEMORYLESS_NODES defined. So I believe that this
> change is limited to ia64 and powerpc. I don't think there is a way to set it
> outside of those arches.

I have tried to say that while other arches (like x86) do not have
CONFIG_HAVE_MEMORYLESS_NODES defined they still can end up with a memory
node without any memory. Just use memory hotplug...
diff mbox series

Patch

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 36ee3267c25f..99e0f3f9c4a6 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1991,7 +1991,7 @@  static unsigned offset_il_node(struct mempolicy *pol, unsigned long n)
 	int nid;
 
 	if (!nnodes)
-		return numa_node_id();
+		return numa_mem_id();
 	target = (unsigned int)n % nnodes;
 	nid = first_node(pol->v.nodes);
 	for (i = 0; i < target; i++)
@@ -2049,7 +2049,7 @@  int huge_node(struct vm_area_struct *vma, unsigned long addr, gfp_t gfp_flags,
 		nid = interleave_nid(*mpol, vma, addr,
 					huge_page_shift(hstate_vma(vma)));
 	} else {
-		nid = policy_node(gfp_flags, *mpol, numa_node_id());
+		nid = policy_node(gfp_flags, *mpol, numa_mem_id());
 		if ((*mpol)->mode == MPOL_BIND)
 			*nodemask = &(*mpol)->v.nodes;
 	}