Higher slub memory consumption on 64K page-size systems?

Message ID	20201028055030.GA362097@in.ibm.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=7+Oq=ED=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C344B22282 Date: Wed, 28 Oct 2020 11:20:30 +0530 From: Bharata B Rao <bharata@linux.ibm.com> To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, cl@linux.com, rientjes@google.com, iamjoonsoo.kim@lge.com, akpm@linux-foundation.org, guro@fb.com, vbabka@suse.cz, shakeelb@google.com, hannes@cmpxchg.org, aneesh.kumar@linux.ibm.com Subject: Higher slub memory consumption on 64K page-size systems? Message-ID: <20201028055030.GA362097@in.ibm.com> Reply-To: bharata@linux.ibm.com MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Higher slub memory consumption on 64K page-size systems? \| expand Higher slub memory consumption on 64K page-size systems?

Message ID

20201028055030.GA362097@in.ibm.com (mailing list archive)

State

New, archived

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C344B22282
Date: Wed, 28 Oct 2020 11:20:30 +0530
From: Bharata B Rao <bharata@linux.ibm.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, cl@linux.com, rientjes@google.com,
        iamjoonsoo.kim@lge.com, akpm@linux-foundation.org, guro@fb.com,
        vbabka@suse.cz, shakeelb@google.com, hannes@cmpxchg.org,
        aneesh.kumar@linux.ibm.com
Subject: Higher slub memory consumption on 64K page-size systems?
Message-ID: <20201028055030.GA362097@in.ibm.com>
Reply-To: bharata@linux.ibm.com
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

Higher slub memory consumption on 64K page-size systems? | expand

Commit Message

Bharata B Rao Oct. 28, 2020, 5:50 a.m. UTC

Hi,

On POWER systems, where 64K PAGE_SIZE is default, I see that slub
consumes higher amount of memory compared to any 4K page-size system.
While slub is obviously going to consume more memory on 64K page-size
systems compared to 4K as slabs are allocated in page-size granularity,
I want to check if there are any obvious tuning (via existing tunables
or via some code change) that we can do to reduce the amount of memory
consumed by slub.

Here is a comparision of the slab memory consumption between 4K and
64K page-size pseries hash KVM guest with 16 cores and 16G memory
configuration immediately after boot:

64K	209280 kB
4K	67636 kB

64K configuration may never be able to consume as less as a 4K configuration,
but it certainly shows that the slub can be optimized for 64K page-size better.

slub_max_order
--------------
The most promising tunable that shows consistent reduction in slab memory
is slub_max_order. Here is a table that shows the number of slabs that
end up with different orders and the total slab consumption at boot
for different values of slub_max_order:
-------------------------------------------
slub_max_order	Order	NrSlabs	Slab memory
-------------------------------------------
		0	276
	3	1	16	207488 kB
    (default)	2	4
		3	11
-------------------------------------------
		0	276
	2	1	16	166656 kB
		2	4
-------------------------------------------
		0	276	144128 kB
	1	1	31
-------------------------------------------

Though only a few bigger sized caches fall into order-2 or order-3, they
seem to make a considerable difference to the overall slab consumption.
If we take task_struct cache as an example, this is how it ends up when
slub_max_order is varied:

task_struct, objsize=9856
--------------------------------------------
slub_max_order	objperslab	pagesperslab
--------------------------------------------
3		53		8
2		26		4
1		13		2
--------------------------------------------

The slab page-order and hence the number of objects in a slab has a
bearing on the performance, but I wonder if some caches like task_struct
above can be auto-tuned to fall into a conservative order and do good
both wrt both memory and performance?

mm/slub.c:calulate_order() has the logic which determines the the
page-order for the slab. It starts with min_objects and attempts
to arrive at the best configuration for the slab. The min_objects
is starts like this:

min_objects = 4 * (fls(nr_cpu_ids) + 1);

Here nr_cpu_ids depends on the maxcpus and hence this can have a
significant effect on those systems which define maxcpus. Slab numbers
post-boot for a KVM pseries guest that has 16 boottime CPUs and varying
number of maxcpus look like this:
-------------------------------
maxcpus		Slab memory(kB)
-------------------------------
64		209280
256		253824
512		293824
-------------------------------

Page-order is a one time setting and obviously can't be tweaked dynamically
on CPU hotplug, but just wanted to bring out the effect of the same.

And that constant multiplicative factor of 4 was infact added by the commit
9b2cd506e5f2 - "slub: Calculate min_objects based on number of processors."

Reducing that to say 2, does give some reduction in the slab memory
and also same hackbench performance with reduced slab memory, but I am not
sure if that could be assumed to be beneficial for all scenarios.

MIN_PARTIAL
-----------
This determines the number of slabs left on the partial list even if they
are empty. My initial thought was that the default MIN_PARTIAL value of 5
is on the higher side and we are accumulating MIN_PARTIAL number of
empty slabs in all caches without freeing them. However I hardly find
the case where an empty slab is retained during freeing on account of
partial slabs being lesser than MIN_PARTIAL.

However what I find in practice is that we are accumulating a lot of partial
slabs with just one in-use object in the whole slab. High number of such
partial slabs is indeed contributing to the increased slab memory consumption.

For example, after a hackbench run, I find the distribution of objects
like this for kmalloc-2k cache:

total_objects		3168
objects			1611
Nr partial slabs	54
Nr parital slabs with
just 1 inuse object	38

With 64K page-size, so many partial slabs with just 1 inuse object can
result in high memory usage. Is there any workaround possible prevent this
kind of situation?

cpu_partial
-----------
Here is how the slab consumption post-boot varies when all the slab
caches are forced with the fixed cpu_partial value:
---------------------------
cpu_partial	Slab Memory
---------------------------
0		175872 kB
2		187136 kB
4		191616 kB
default		204864 kB
---------------------------

It has been suggested earlier that reducing cpu_partial and/or making
cpu_partial 64K page-size aware will benefit. In set_cpu_partial(),
for bigger sized slabs (size > PAGE_SIZE), cpu_partial is already set
to 2. A bit of tweaking there to introduce cpu_partial=1 for certain
slabs does give some benefit.


With the above change, the slab consumption post-boot reduces to 186048 kB.
Also, here are the hackbench numbers with and w/o the above change:

Average of 10 runs of 'hackbench -s 1024 -l 200 -g 200 -f 25 -P'
Slab consumption captured at the end of each run
--------------------------------------------------------------
		Time		Slab memory
--------------------------------------------------------------
Default		11.124s		645580 kB
Patched		11.032s		584352 kB
--------------------------------------------------------------

I have mostly looked at reducing the slab memory consumption here.
But I do understand that default tunable values have been arrived
at based on some benchmark numbers. Are there ways or possibilities
to reduce the slub memory consumption with the existing level of
performance is what I would like to understand and explore.

Regards,
Bharata.

Comments

Roman Gushchin Oct. 29, 2020, 12:07 a.m. UTC | #1

On Wed, Oct 28, 2020 at 11:20:30AM +0530, Bharata B Rao wrote:
> Hi,
> 
> On POWER systems, where 64K PAGE_SIZE is default, I see that slub
> consumes higher amount of memory compared to any 4K page-size system.
> While slub is obviously going to consume more memory on 64K page-size
> systems compared to 4K as slabs are allocated in page-size granularity,
> I want to check if there are any obvious tuning (via existing tunables
> or via some code change) that we can do to reduce the amount of memory
> consumed by slub.
> 
> Here is a comparision of the slab memory consumption between 4K and
> 64K page-size pseries hash KVM guest with 16 cores and 16G memory
> configuration immediately after boot:
> 
> 64K	209280 kB
> 4K	67636 kB
> 
> 64K configuration may never be able to consume as less as a 4K configuration,
> but it certainly shows that the slub can be optimized for 64K page-size better.
> 
> slub_max_order
> --------------
> The most promising tunable that shows consistent reduction in slab memory
> is slub_max_order. Here is a table that shows the number of slabs that
> end up with different orders and the total slab consumption at boot
> for different values of slub_max_order:
> -------------------------------------------
> slub_max_order	Order	NrSlabs	Slab memory
> -------------------------------------------
> 		0	276
> 	3	1	16	207488 kB
>     (default)	2	4
> 		3	11
> -------------------------------------------
> 		0	276
> 	2	1	16	166656 kB
> 		2	4
> -------------------------------------------
> 		0	276	144128 kB
> 	1	1	31
> -------------------------------------------
> 
> Though only a few bigger sized caches fall into order-2 or order-3, they
> seem to make a considerable difference to the overall slab consumption.
> If we take task_struct cache as an example, this is how it ends up when
> slub_max_order is varied:
> 
> task_struct, objsize=9856
> --------------------------------------------
> slub_max_order	objperslab	pagesperslab
> --------------------------------------------
> 3		53		8
> 2		26		4
> 1		13		2
> --------------------------------------------
> 
> The slab page-order and hence the number of objects in a slab has a
> bearing on the performance, but I wonder if some caches like task_struct
> above can be auto-tuned to fall into a conservative order and do good
> both wrt both memory and performance?
> 
> mm/slub.c:calulate_order() has the logic which determines the the
> page-order for the slab. It starts with min_objects and attempts
> to arrive at the best configuration for the slab. The min_objects
> is starts like this:
> 
> min_objects = 4 * (fls(nr_cpu_ids) + 1);
> 
> Here nr_cpu_ids depends on the maxcpus and hence this can have a
> significant effect on those systems which define maxcpus. Slab numbers
> post-boot for a KVM pseries guest that has 16 boottime CPUs and varying
> number of maxcpus look like this:
> -------------------------------
> maxcpus		Slab memory(kB)
> -------------------------------
> 64		209280
> 256		253824
> 512		293824
> -------------------------------
> 
> Page-order is a one time setting and obviously can't be tweaked dynamically
> on CPU hotplug, but just wanted to bring out the effect of the same.
> 
> And that constant multiplicative factor of 4 was infact added by the commit
> 9b2cd506e5f2 - "slub: Calculate min_objects based on number of processors."
> 
> Reducing that to say 2, does give some reduction in the slab memory
> and also same hackbench performance with reduced slab memory, but I am not
> sure if that could be assumed to be beneficial for all scenarios.
> 
> MIN_PARTIAL
> -----------
> This determines the number of slabs left on the partial list even if they
> are empty. My initial thought was that the default MIN_PARTIAL value of 5
> is on the higher side and we are accumulating MIN_PARTIAL number of
> empty slabs in all caches without freeing them. However I hardly find
> the case where an empty slab is retained during freeing on account of
> partial slabs being lesser than MIN_PARTIAL.
> 
> However what I find in practice is that we are accumulating a lot of partial
> slabs with just one in-use object in the whole slab. High number of such
> partial slabs is indeed contributing to the increased slab memory consumption.
> 
> For example, after a hackbench run, I find the distribution of objects
> like this for kmalloc-2k cache:
> 
> total_objects		3168
> objects			1611
> Nr partial slabs	54
> Nr parital slabs with
> just 1 inuse object	38
> 
> With 64K page-size, so many partial slabs with just 1 inuse object can
> result in high memory usage. Is there any workaround possible prevent this
> kind of situation?
> 
> cpu_partial
> -----------
> Here is how the slab consumption post-boot varies when all the slab
> caches are forced with the fixed cpu_partial value:
> ---------------------------
> cpu_partial	Slab Memory
> ---------------------------
> 0		175872 kB
> 2		187136 kB
> 4		191616 kB
> default		204864 kB
> ---------------------------
> 
> It has been suggested earlier that reducing cpu_partial and/or making
> cpu_partial 64K page-size aware will benefit. In set_cpu_partial(),
> for bigger sized slabs (size > PAGE_SIZE), cpu_partial is already set
> to 2. A bit of tweaking there to introduce cpu_partial=1 for certain
> slabs does give some benefit.
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index a28ed9b8fc61..e09eff1199bf 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3626,7 +3626,9 @@ static void set_cpu_partial(struct kmem_cache *s)
>          */
>         if (!kmem_cache_has_cpu_partial(s))
>                 slub_set_cpu_partial(s, 0);
> -       else if (s->size >= PAGE_SIZE)
> +       else if (s->size >= 8192)
> +               slub_set_cpu_partial(s, 1);
> +       else if (s->size >= 4096)
>                 slub_set_cpu_partial(s, 2);
>         else if (s->size >= 1024)
>                 slub_set_cpu_partial(s, 6);
> 
> With the above change, the slab consumption post-boot reduces to 186048 kB.
> Also, here are the hackbench numbers with and w/o the above change:
> 
> Average of 10 runs of 'hackbench -s 1024 -l 200 -g 200 -f 25 -P'
> Slab consumption captured at the end of each run
> --------------------------------------------------------------
> 		Time		Slab memory
> --------------------------------------------------------------
> Default		11.124s		645580 kB
> Patched		11.032s		584352 kB
> --------------------------------------------------------------
> 
> I have mostly looked at reducing the slab memory consumption here.
> But I do understand that default tunable values have been arrived
> at based on some benchmark numbers. Are there ways or possibilities
> to reduce the slub memory consumption with the existing level of
> performance is what I would like to understand and explore.

Hi Bharata!

I wonder how the distribution of the consumed memory by slab_caches
differs between 4k and 64k pages. In particular, I wonder if
page-sized and larger kmallocs make the difference (or a big part of it)?
There are many places in the kernel which are doing something like
kmalloc(PAGE_SIZE).

Re slub tuning: in general we do care about the number of objects
in a partial list, less about the number of pages. If we can have the
same amount of objects but on fewer pages, it's even better.
So I don't see any reasons why we shouldn't scale down these tunables
if the PAGE_SIZE > 4K.
Idk if it makes sense to switch to byte-sized tunables or just to hardcode
custom default values for the 64k page case. The latter is probably
is easier.

Thanks!

Bharata B Rao Nov. 2, 2020, 11:33 a.m. UTC | #2

On Wed, Oct 28, 2020 at 05:07:57PM -0700, Roman Gushchin wrote:
> On Wed, Oct 28, 2020 at 11:20:30AM +0530, Bharata B Rao wrote:
> > I have mostly looked at reducing the slab memory consumption here.
> > But I do understand that default tunable values have been arrived
> > at based on some benchmark numbers. Are there ways or possibilities
> > to reduce the slub memory consumption with the existing level of
> > performance is what I would like to understand and explore.
> 
> Hi Bharata!
> 
> I wonder how the distribution of the consumed memory by slab_caches
> differs between 4k and 64k pages. In particular, I wonder if
> page-sized and larger kmallocs make the difference (or a big part of it)?
> There are many places in the kernel which are doing something like
> kmalloc(PAGE_SIZE).

Here is comparision of topmost slabs in terms of memory usage b/n
4K and 64K configurations:

Case 1: After boot
==================
4K page-size
------------
Name                   Objects Objsize           Space Slabs/Part/Cpu  O/S O %Fr %Ef Flg
inode_cache              23382     592           14.1M       400/0/33   54 3   0  97 a
dentry                   29484     192            5.7M      592/0/110   42 1   0  98 a
kmalloc-1k                5358    1024            5.6M       130/9/42   32 3   5  97
task_struct                371    9856            4.1M        88/6/40    3 3   4  87
kmalloc-512               6640     512            3.4M       159/3/49   32 2   1  99
...
kmalloc-4k                 530    4096            2.2M        42/6/27    8 3   8  96

64K page-size
-------------
pgtable-2^11               935   16384           38.7M       16/16/58   16 3  21  39
inode_cache              23980     592           14.4M       203/0/17  109 0   0  98 a
thread_stack               709   16384           12.0M         6/1/17   32 3   4  96
task_struct               1012    9856           10.4M         4/1/16   53 3   5  95
kmalloc-64k                144   65536            9.4M         2/0/16    8 3   0 100

Case 2: After hackbench run
===========================
4K page-size
------------
inode_cache              21823     592           13.3M       361/3/46   54 3   0  96 a
kmalloc-512              10309     512            9.4M    433/325/146   32 2  56  55
kmalloc-1k                6207    1024            6.5M      121/12/78   32 3   6  97
dentry                   28923     192            5.9M     468/48/261   42 1   6  92 a
task_struct                418    9856            5.1M      106/24/51    3 3  15  80
...
kmalloc-4k                 510    4096            2.1M       41/10/26    8 3  14  95

64K page-size
-------------
kmalloc-8k                3081    8192           84.9M     241/241/83   32 2  74  29
thread_stack              2919   16384           52.4M       15/10/85   32 3  10  91
pgtable-2^11              1281   16384           50.8M       20/20/77   16 3  20  41
task_struct               3771    9856           40.3M         9/6/68   53 3   7  92
vm_area_struct           92295     200           18.9M        8/8/281  327 0   2  97
...
kmalloc-64k                144   65536            9.4M         2/0/16    8 3   0 100

I can't see any specific pattern wrt to kmalloc cache usage in both the
above cases (boot vs hackbench run). In the boot case, the 64K configuration
consuming more memory can be attributed probably to the bigger page size
itself. However in case of hackbench run, any significant number of
partial slabs does contribute to significant increase of memory for
64K configuration.

> 
> Re slub tuning: in general we do care about the number of objects
> in a partial list, less about the number of pages. If we can have the
> same amount of objects but on fewer pages, it's even better.

Right, but how do we achieve that when few number of inuse objects are
spread across a number of partial slabs? This specifically is the case
we see after a workload run (hackbench in this case)

> So I don't see any reasons why we shouldn't scale down these tunables
> if the PAGE_SIZE > 4K.
> Idk if it makes sense to switch to byte-sized tunables or just to hardcode
> custom default values for the 64k page case. The latter is probably
> is easier.

Right, tuning the mininum number of objects when calculating the page order
of the slab and tuning cpu_partial value show some consistent reduction
in the slab memory consumption. (I have shown this in previous mail)

Thanks for your comments.

Regards,
Bharata.

Vlastimil Babka Nov. 5, 2020, 4:47 p.m. UTC | #3

On 10/28/20 6:50 AM, Bharata B Rao wrote:
> slub_max_order
> --------------
> The most promising tunable that shows consistent reduction in slab memory
> is slub_max_order. Here is a table that shows the number of slabs that
> end up with different orders and the total slab consumption at boot
> for different values of slub_max_order:
> -------------------------------------------
> slub_max_order	Order	NrSlabs	Slab memory
> -------------------------------------------
> 		0	276
> 	3	1	16	207488 kB
>      (default)	2	4
> 		3	11
> -------------------------------------------
> 		0	276
> 	2	1	16	166656 kB
> 		2	4
> -------------------------------------------
> 		0	276	144128 kB
> 	1	1	31
> -------------------------------------------
> 
> Though only a few bigger sized caches fall into order-2 or order-3, they
> seem to make a considerable difference to the overall slab consumption.
> If we take task_struct cache as an example, this is how it ends up when
> slub_max_order is varied:
> 
> task_struct, objsize=9856
> --------------------------------------------
> slub_max_order	objperslab	pagesperslab
> --------------------------------------------
> 3		53		8
> 2		26		4
> 1		13		2
> --------------------------------------------
> 
> The slab page-order and hence the number of objects in a slab has a
> bearing on the performance, but I wonder if some caches like task_struct
> above can be auto-tuned to fall into a conservative order and do good
> both wrt both memory and performance?

Hmm ideally this should be based on objperslab so if there's larger page sizes, 
then the calculated order becomes smaller, even 0?

> mm/slub.c:calulate_order() has the logic which determines the the
> page-order for the slab. It starts with min_objects and attempts
> to arrive at the best configuration for the slab. The min_objects
> is starts like this:
> 
> min_objects = 4 * (fls(nr_cpu_ids) + 1);
> 
> Here nr_cpu_ids depends on the maxcpus and hence this can have a
> significant effect on those systems which define maxcpus. Slab numbers
> post-boot for a KVM pseries guest that has 16 boottime CPUs and varying
> number of maxcpus look like this:
> -------------------------------
> maxcpus		Slab memory(kB)
> -------------------------------
> 64		209280
> 256		253824
> 512		293824
> -------------------------------

Yeah IIRC nr_cpu_ids is related to number of possible cpus which is rather 
excessive on some systems, so a relation to actually online cpus would make more 
sense.

> Page-order is a one time setting and obviously can't be tweaked dynamically
> on CPU hotplug, but just wanted to bring out the effect of the same.
> 
> And that constant multiplicative factor of 4 was infact added by the commit
> 9b2cd506e5f2 - "slub: Calculate min_objects based on number of processors."
> 
> Reducing that to say 2, does give some reduction in the slab memory
> and also same hackbench performance with reduced slab memory, but I am not
> sure if that could be assumed to be beneficial for all scenarios.
> 
> MIN_PARTIAL
> -----------
> This determines the number of slabs left on the partial list even if they
> are empty. My initial thought was that the default MIN_PARTIAL value of 5
> is on the higher side and we are accumulating MIN_PARTIAL number of
> empty slabs in all caches without freeing them. However I hardly find
> the case where an empty slab is retained during freeing on account of
> partial slabs being lesser than MIN_PARTIAL.
> 
> However what I find in practice is that we are accumulating a lot of partial
> slabs with just one in-use object in the whole slab. High number of such
> partial slabs is indeed contributing to the increased slab memory consumption.
> 
> For example, after a hackbench run, I find the distribution of objects
> like this for kmalloc-2k cache:
> 
> total_objects		3168
> objects			1611
> Nr partial slabs	54
> Nr parital slabs with
> just 1 inuse object	38
> 
> With 64K page-size, so many partial slabs with just 1 inuse object can
> result in high memory usage. Is there any workaround possible prevent this
> kind of situation?

Probably not, this is just fundamental internal fragmentation problem and that 
we can't predict which objects will have similar lifetime and thus put it 
together. Larger pages make just make the effect more pronounced. It would be 
wrong if we allocated new pages instead of reusing the partial ones, but that's 
not the case, IIUC?

But you are measuring "after a hackbench run", so is that an important data 
point? If the system was in some kind of steady state workload, the pages would 
be better used I'd expect.

> cpu_partial
> -----------
> Here is how the slab consumption post-boot varies when all the slab
> caches are forced with the fixed cpu_partial value:
> ---------------------------
> cpu_partial	Slab Memory
> ---------------------------
> 0		175872 kB
> 2		187136 kB
> 4		191616 kB
> default		204864 kB
> ---------------------------
> 
> It has been suggested earlier that reducing cpu_partial and/or making
> cpu_partial 64K page-size aware will benefit. In set_cpu_partial(),
> for bigger sized slabs (size > PAGE_SIZE), cpu_partial is already set
> to 2. A bit of tweaking there to introduce cpu_partial=1 for certain
> slabs does give some benefit.
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index a28ed9b8fc61..e09eff1199bf 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3626,7 +3626,9 @@ static void set_cpu_partial(struct kmem_cache *s)
>           */
>          if (!kmem_cache_has_cpu_partial(s))
>                  slub_set_cpu_partial(s, 0);
> -       else if (s->size >= PAGE_SIZE)
> +       else if (s->size >= 8192)
> +               slub_set_cpu_partial(s, 1);
> +       else if (s->size >= 4096)
>                  slub_set_cpu_partial(s, 2);
>          else if (s->size >= 1024)
>                  slub_set_cpu_partial(s, 6);
> 
> With the above change, the slab consumption post-boot reduces to 186048 kB.

Yeah, making it agnostic to PAGE_SIZE makes sense.

> Also, here are the hackbench numbers with and w/o the above change:
> 
> Average of 10 runs of 'hackbench -s 1024 -l 200 -g 200 -f 25 -P'
> Slab consumption captured at the end of each run
> --------------------------------------------------------------
> 		Time		Slab memory
> --------------------------------------------------------------
> Default		11.124s		645580 kB
> Patched		11.032s		584352 kB
> --------------------------------------------------------------
> 
> I have mostly looked at reducing the slab memory consumption here.
> But I do understand that default tunable values have been arrived
> at based on some benchmark numbers. Are there ways or possibilities
> to reduce the slub memory consumption with the existing level of
> performance is what I would like to understand and explore.
> 
> Regards,
> Bharata.
>

Bharata B Rao Nov. 11, 2020, 9:02 a.m. UTC | #4

On Thu, Nov 05, 2020 at 05:47:03PM +0100, Vlastimil Babka wrote:
> On 10/28/20 6:50 AM, Bharata B Rao wrote:
> > slub_max_order
> > --------------
> > The most promising tunable that shows consistent reduction in slab memory
> > is slub_max_order. Here is a table that shows the number of slabs that
> > end up with different orders and the total slab consumption at boot
> > for different values of slub_max_order:
> > -------------------------------------------
> > slub_max_order	Order	NrSlabs	Slab memory
> > -------------------------------------------
> > 		0	276
> > 	3	1	16	207488 kB
> >      (default)	2	4
> > 		3	11
> > -------------------------------------------
> > 		0	276
> > 	2	1	16	166656 kB
> > 		2	4
> > -------------------------------------------
> > 		0	276	144128 kB
> > 	1	1	31
> > -------------------------------------------
> > 
> > Though only a few bigger sized caches fall into order-2 or order-3, they
> > seem to make a considerable difference to the overall slab consumption.
> > If we take task_struct cache as an example, this is how it ends up when
> > slub_max_order is varied:
> > 
> > task_struct, objsize=9856
> > --------------------------------------------
> > slub_max_order	objperslab	pagesperslab
> > --------------------------------------------
> > 3		53		8
> > 2		26		4
> > 1		13		2
> > --------------------------------------------
> > 
> > The slab page-order and hence the number of objects in a slab has a
> > bearing on the performance, but I wonder if some caches like task_struct
> > above can be auto-tuned to fall into a conservative order and do good
> > both wrt both memory and performance?
> 
> Hmm ideally this should be based on objperslab so if there's larger page
> sizes, then the calculated order becomes smaller, even 0?

It is indeed based on number of objects that could be optimally
fit within a slab. As I explain below, curently we start with a
minimum objects value that ends up pushing the page order higher
for some slab sizes and page size combination. The question is can
we start with a more conservative/lower value for min_objects in
calculate_order()?

> 
> > mm/slub.c:calulate_order() has the logic which determines the the
> > page-order for the slab. It starts with min_objects and attempts
> > to arrive at the best configuration for the slab. The min_objects
> > is starts like this:
> > 
> > min_objects = 4 * (fls(nr_cpu_ids) + 1);
> > 
> > Here nr_cpu_ids depends on the maxcpus and hence this can have a
> > significant effect on those systems which define maxcpus. Slab numbers
> > post-boot for a KVM pseries guest that has 16 boottime CPUs and varying
> > number of maxcpus look like this:
> > -------------------------------
> > maxcpus		Slab memory(kB)
> > -------------------------------
> > 64		209280
> > 256		253824
> > 512		293824
> > -------------------------------
> 
> Yeah IIRC nr_cpu_ids is related to number of possible cpus which is rather
> excessive on some systems, so a relation to actually online cpus would make
> more sense.

May be I can send a patch to change the above calculation of
min_objects to be based on online cpus and see how it is received.

> 
> > Page-order is a one time setting and obviously can't be tweaked dynamically
> > on CPU hotplug, but just wanted to bring out the effect of the same.
> > 
> > And that constant multiplicative factor of 4 was infact added by the commit
> > 9b2cd506e5f2 - "slub: Calculate min_objects based on number of processors."
> > 
> > Reducing that to say 2, does give some reduction in the slab memory
> > and also same hackbench performance with reduced slab memory, but I am not
> > sure if that could be assumed to be beneficial for all scenarios.
> > 
> > MIN_PARTIAL
> > -----------
> > This determines the number of slabs left on the partial list even if they
> > are empty. My initial thought was that the default MIN_PARTIAL value of 5
> > is on the higher side and we are accumulating MIN_PARTIAL number of
> > empty slabs in all caches without freeing them. However I hardly find
> > the case where an empty slab is retained during freeing on account of
> > partial slabs being lesser than MIN_PARTIAL.
> > 
> > However what I find in practice is that we are accumulating a lot of partial
> > slabs with just one in-use object in the whole slab. High number of such
> > partial slabs is indeed contributing to the increased slab memory consumption.
> > 
> > For example, after a hackbench run, I find the distribution of objects
> > like this for kmalloc-2k cache:
> > 
> > total_objects		3168
> > objects			1611
> > Nr partial slabs	54
> > Nr parital slabs with
> > just 1 inuse object	38
> > 
> > With 64K page-size, so many partial slabs with just 1 inuse object can
> > result in high memory usage. Is there any workaround possible prevent this
> > kind of situation?
> 
> Probably not, this is just fundamental internal fragmentation problem and
> that we can't predict which objects will have similar lifetime and thus put
> it together. Larger pages make just make the effect more pronounced. It
> would be wrong if we allocated new pages instead of reusing the partial
> ones, but that's not the case, IIUC?

Correct, that shouldn't be the case, I will check by adding some
instrumentation and ascertain if it indeed the case.

> 
> But you are measuring "after a hackbench run", so is that an important data
> point? If the system was in some kind of steady state workload, the pages
> would be better used I'd expect.

May be, I am not sure, we will have to check. I measured at two points: immediately
after boot as initial state and after hackbench run as an exteme state. I chose
hackbench as I see that earlier changes to some of these slab code/tunables
have been supported by hackbench numbers.

> 
> > cpu_partial
> > -----------
> > Here is how the slab consumption post-boot varies when all the slab
> > caches are forced with the fixed cpu_partial value:
> > ---------------------------
> > cpu_partial	Slab Memory
> > ---------------------------
> > 0		175872 kB
> > 2		187136 kB
> > 4		191616 kB
> > default		204864 kB
> > ---------------------------
> > 
> > It has been suggested earlier that reducing cpu_partial and/or making
> > cpu_partial 64K page-size aware will benefit. In set_cpu_partial(),
> > for bigger sized slabs (size > PAGE_SIZE), cpu_partial is already set
> > to 2. A bit of tweaking there to introduce cpu_partial=1 for certain
> > slabs does give some benefit.
> > 
> > diff --git a/mm/slub.c b/mm/slub.c
> > index a28ed9b8fc61..e09eff1199bf 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -3626,7 +3626,9 @@ static void set_cpu_partial(struct kmem_cache *s)
> >           */
> >          if (!kmem_cache_has_cpu_partial(s))
> >                  slub_set_cpu_partial(s, 0);
> > -       else if (s->size >= PAGE_SIZE)
> > +       else if (s->size >= 8192)
> > +               slub_set_cpu_partial(s, 1);
> > +       else if (s->size >= 4096)
> >                  slub_set_cpu_partial(s, 2);
> >          else if (s->size >= 1024)
> >                  slub_set_cpu_partial(s, 6);
> > 
> > With the above change, the slab consumption post-boot reduces to 186048 kB.
> 
> Yeah, making it agnostic to PAGE_SIZE makes sense.

Ok, let me send a separate patch for this.

Thanks for your inputs.

Regards,
Bharata.

diff --git a/mm/slub.c b/mm/slub.c
index a28ed9b8fc61..e09eff1199bf 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3626,7 +3626,9 @@  static void set_cpu_partial(struct kmem_cache *s)
         */
        if (!kmem_cache_has_cpu_partial(s))
                slub_set_cpu_partial(s, 0);
-       else if (s->size >= PAGE_SIZE)
+       else if (s->size >= 8192)
+               slub_set_cpu_partial(s, 1);
+       else if (s->size >= 4096)
                slub_set_cpu_partial(s, 2);
        else if (s->size >= 1024)
                slub_set_cpu_partial(s, 6);