diff mbox series

[RFC,2/2] mm/slub: prefer NUMA locality over slight memory saving on NUMA machines

Message ID 20230723190906.4082646-3-42.hyeyoo@gmail.com (mailing list archive)
State New
Headers show
Series An attempt to improve SLUB on NUMA / under memory pressure | expand

Commit Message

Hyeonggon Yoo July 23, 2023, 7:09 p.m. UTC
By default, SLUB sets remote_node_defrag_ratio to 1000, which makes it
(in most cases) take slabs from remote nodes first before trying allocating
new folios on the local node from buddy.

Documentation/ABI/testing/sysfs-kernel-slab says:
> The file remote_node_defrag_ratio specifies the percentage of
> times SLUB will attempt to refill the cpu slab with a partial
> slab from a remote node as opposed to allocating a new slab on
> the local node.  This reduces the amount of wasted memory over
> the entire system but can be expensive.

Although this made sense when it was introduced, the portion of
per node partial lists in the overall SLUB memory usage has been decreased
since the introduction of per cpu partial lists. Therefore, it's worth
reevaluating its overhead on performance and memory usage.

[
	XXX: Add performance data. I tried to measure its impact on
	hackbench with a 2 socket NUMA 	machine. but it seems hackbench is
	too synthetic to benefit from this, because the	skbuff_head_cache's
	size fits into the last level cache.

	Probably more realistic workloads like netperf would benefit
	from this?
]

Set remote_node_defrag_ratio to zero by default, and the new behavior is:
	1) try refilling per CPU partial list from the local node
	2) try allocating new slabs from the local node without reclamation
	3) try refilling per CPU partial list from remote nodes
	4) try allocating new slabs from the local node or remote nodes

If user specified remote_node_defrag_ratio, it probabilistically tries
3) first and then try 2) and 4) in order, to avoid unexpected behavioral
change from user's perspective.
---
 mm/slub.c | 45 +++++++++++++++++++++++++++++++++++++--------
 1 file changed, 37 insertions(+), 8 deletions(-)

Comments

Vlastimil Babka Aug. 3, 2023, 2:54 p.m. UTC | #1
On 7/23/23 21:09, Hyeonggon Yoo wrote:
> By default, SLUB sets remote_node_defrag_ratio to 1000, which makes it
> (in most cases) take slabs from remote nodes first before trying allocating
> new folios on the local node from buddy.
> 
> Documentation/ABI/testing/sysfs-kernel-slab says:
>> The file remote_node_defrag_ratio specifies the percentage of
>> times SLUB will attempt to refill the cpu slab with a partial
>> slab from a remote node as opposed to allocating a new slab on
>> the local node.  This reduces the amount of wasted memory over
>> the entire system but can be expensive.
> 
> Although this made sense when it was introduced, the portion of
> per node partial lists in the overall SLUB memory usage has been decreased
> since the introduction of per cpu partial lists. Therefore, it's worth
> reevaluating its overhead on performance and memory usage.
> 
> [
> 	XXX: Add performance data. I tried to measure its impact on
> 	hackbench with a 2 socket NUMA 	machine. but it seems hackbench is
> 	too synthetic to benefit from this, because the	skbuff_head_cache's
> 	size fits into the last level cache.
> 
> 	Probably more realistic workloads like netperf would benefit
> 	from this?
> ]
> 
> Set remote_node_defrag_ratio to zero by default, and the new behavior is:
> 	1) try refilling per CPU partial list from the local node
> 	2) try allocating new slabs from the local node without reclamation
> 	3) try refilling per CPU partial list from remote nodes
> 	4) try allocating new slabs from the local node or remote nodes
> 
> If user specified remote_node_defrag_ratio, it probabilistically tries
> 3) first and then try 2) and 4) in order, to avoid unexpected behavioral
> change from user's perspective.

It makes sense to me, but as you note it would be great to demonstrate
benefits, because it adds complexity, especially in the already complex
___slab_alloc(). Networking has been indeed historically a workload very
sensitive to slab performance, so seems a good candidate.

We could also postpone this until we have tried the percpu arrays
improvements discussed at LSF/MM.
Hyeonggon Yoo Aug. 7, 2023, 8:39 a.m. UTC | #2
On Thu, Aug 3, 2023 at 11:54 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 7/23/23 21:09, Hyeonggon Yoo wrote:
> > By default, SLUB sets remote_node_defrag_ratio to 1000, which makes it
> > (in most cases) take slabs from remote nodes first before trying allocating
> > new folios on the local node from buddy.
> >
> > Documentation/ABI/testing/sysfs-kernel-slab says:
> >> The file remote_node_defrag_ratio specifies the percentage of
> >> times SLUB will attempt to refill the cpu slab with a partial
> >> slab from a remote node as opposed to allocating a new slab on
> >> the local node.  This reduces the amount of wasted memory over
> >> the entire system but can be expensive.
> >
> > Although this made sense when it was introduced, the portion of
> > per node partial lists in the overall SLUB memory usage has been decreased
> > since the introduction of per cpu partial lists. Therefore, it's worth
> > reevaluating its overhead on performance and memory usage.
> >
> > [
> >       XXX: Add performance data. I tried to measure its impact on
> >       hackbench with a 2 socket NUMA  machine. but it seems hackbench is
> >       too synthetic to benefit from this, because the skbuff_head_cache's
> >       size fits into the last level cache.
> >
> >       Probably more realistic workloads like netperf would benefit
> >       from this?
> > ]
> >
> > Set remote_node_defrag_ratio to zero by default, and the new behavior is:
> >       1) try refilling per CPU partial list from the local node
> >       2) try allocating new slabs from the local node without reclamation
> >       3) try refilling per CPU partial list from remote nodes
> >       4) try allocating new slabs from the local node or remote nodes
> >
> > If user specified remote_node_defrag_ratio, it probabilistically tries
> > 3) first and then try 2) and 4) in order, to avoid unexpected behavioral
> > change from user's perspective.
>
> It makes sense to me, but as you note it would be great to demonstrate
> benefits, because it adds complexity, especially in the already complex
> ___slab_alloc(). Networking has been indeed historically a workload very
> sensitive to slab performance, so seems a good candidate.

Thank you for looking at it!

Yeah, it was a PoC for what I thought "oh, it might be useful"
and definitely I will try to measure it.

> We could also postpone this until we have tried the percpu arrays
> improvements discussed at LSF/MM.

Possibly, but can you please share your plans/opinions on it?
I think one possible way is simply to allow the cpu freelist to be
mixed by objects from different slabs
if we want to minimize changes, Or introduce a per cpu array similar
to what SLAB does now.

And one thing I'm having difficulty understanding is - what is the
mind behind/or impact of managing objects
on a slab basis, other than avoiding array queues in 2007?
Vlastimil Babka Aug. 8, 2023, 9:59 a.m. UTC | #3
On 8/7/23 10:39, Hyeonggon Yoo wrote:
> On Thu, Aug 3, 2023 at 11:54 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
> 
> Thank you for looking at it!
> 
> Yeah, it was a PoC for what I thought "oh, it might be useful"
> and definitely I will try to measure it.
> 
>> We could also postpone this until we have tried the percpu arrays
>> improvements discussed at LSF/MM.
> 
> Possibly, but can you please share your plans/opinions on it?

Here's the very first attempt :)
https://lore.kernel.org/linux-mm/20230808095342.12637-7-vbabka@suse.cz/

> I think one possible way is simply to allow the cpu freelist to be
> mixed by objects from different slabs

I didn't try that way, might be much trickier than it looks.

> if we want to minimize changes, Or introduce a per cpu array similar
> to what SLAB does now.

Yes.

> And one thing I'm having difficulty understanding is - what is the
> mind behind/or impact of managing objects
> on a slab basis, other than avoiding array queues in 2007?

"The mind" is Christoph's so I'll leave that question to him :)
diff mbox series

Patch

diff --git a/mm/slub.c b/mm/slub.c
index 199d3d03d5b9..cfdea3e3e221 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2319,7 +2319,8 @@  static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
 /*
  * Get a slab from somewhere. Search in increasing NUMA distances.
  */
-static void *get_any_partial(struct kmem_cache *s, struct partial_context *pc)
+static void *get_any_partial(struct kmem_cache *s, struct partial_context *pc,
+			     bool force_defrag)
 {
 #ifdef CONFIG_NUMA
 	struct zonelist *zonelist;
@@ -2347,8 +2348,8 @@  static void *get_any_partial(struct kmem_cache *s, struct partial_context *pc)
 	 * may be expensive if we do it every time we are trying to find a slab
 	 * with available objects.
 	 */
-	if (!s->remote_node_defrag_ratio ||
-			get_cycles() % 1024 > s->remote_node_defrag_ratio)
+	if (!force_defrag && (!s->remote_node_defrag_ratio ||
+			get_cycles() % 1024 > s->remote_node_defrag_ratio))
 		return NULL;
 
 	do {
@@ -2382,7 +2383,8 @@  static void *get_any_partial(struct kmem_cache *s, struct partial_context *pc)
 /*
  * Get a partial slab, lock it and return it.
  */
-static void *get_partial(struct kmem_cache *s, int node, struct partial_context *pc)
+static void *get_partial(struct kmem_cache *s, int node, struct partial_context *pc,
+			 bool force_defrag)
 {
 	void *object;
 	int searchnode = node;
@@ -2394,7 +2396,7 @@  static void *get_partial(struct kmem_cache *s, int node, struct partial_context
 	if (object || node != NUMA_NO_NODE)
 		return object;
 
-	return get_any_partial(s, pc);
+	return get_any_partial(s, pc, force_defrag);
 }
 
 #ifndef CONFIG_SLUB_TINY
@@ -3092,6 +3094,7 @@  static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	struct slab *slab;
 	unsigned long flags;
 	struct partial_context pc;
+	gfp_t local_flags;
 
 	stat(s, ALLOC_SLOWPATH);
 
@@ -3208,10 +3211,35 @@  static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	pc.flags = gfpflags;
 	pc.slab = &slab;
 	pc.orig_size = orig_size;
-	freelist = get_partial(s, node, &pc);
+
+	freelist = get_partial(s, node, &pc, false);
 	if (freelist)
 		goto check_new_slab;
 
+	/*
+	 * try allocating slab from the local node first before taking slabs
+	 * from remote nodes. If user specified remote_node_defrag_ratio,
+	 * try taking slabs from remote nodes first.
+	 */
+	slub_put_cpu_ptr(s->cpu_slab);
+	local_flags = (gfpflags | __GFP_NOWARN | __GFP_THISNODE);
+	local_flags &= ~(__GFP_NOFAIL | __GFP_RECLAIM);
+	slab = new_slab(s, local_flags, node);
+	c = slub_get_cpu_ptr(s->cpu_slab);
+
+	if (slab)
+		goto alloc_slab;
+
+	/*
+	 * At this point no memory can be allocated lightly.
+	 * Take slabs from remote nodes.
+	 */
+	if (node == NUMA_NO_NODE) {
+		freelist = get_any_partial(s, &pc, true);
+		if (freelist)
+			goto check_new_slab;
+	}
+
 	slub_put_cpu_ptr(s->cpu_slab);
 	slab = new_slab(s, gfpflags, node);
 	c = slub_get_cpu_ptr(s->cpu_slab);
@@ -3221,6 +3249,7 @@  static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 		return NULL;
 	}
 
+alloc_slab:
 	stat(s, ALLOC_SLAB);
 
 	if (kmem_cache_debug(s)) {
@@ -3404,7 +3433,7 @@  static void *__slab_alloc_node(struct kmem_cache *s,
 	pc.flags = gfpflags;
 	pc.slab = &slab;
 	pc.orig_size = orig_size;
-	object = get_partial(s, node, &pc);
+	object = get_partial(s, node, &pc, false);
 
 	if (object)
 		return object;
@@ -4538,7 +4567,7 @@  static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags)
 	set_cpu_partial(s);
 
 #ifdef CONFIG_NUMA
-	s->remote_node_defrag_ratio = 1000;
+	s->remote_node_defrag_ratio = 0;
 #endif
 
 	/* Initialize the pre-computed randomized freelist if slab is up */