[RFC,0/2] An attempt to improve SLUB on NUMA / under memory pressure

Message ID	20230723190906.4082646-1-42.hyeyoo@gmail.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Hyeonggon Yoo <42.hyeyoo@gmail.com> To: Vlastimil Babka <vbabka@suse.cz>, Christoph Lameter <cl@linux.com>, Pekka Enberg <penberg@kernel.org>, Joonsoo Kim <iamjoonsoo.kim@lge.com>, David Rientjes <rientjes@google.com>, Andrew Morton <akpm@linux-foundation.org> Cc: Roman Gushchin <roman.gushchin@linux.dev>, Feng Tang <feng.tang@intel.com>, "Sang, Oliver" <oliver.sang@intel.com>, Jay Patel <jaypatel@linux.ibm.com>, Binder Makin <merimus@google.com>, aneesh.kumar@linux.ibm.com, tsahu@linux.ibm.com, piyushs@linux.ibm.com, fengwei.yin@intel.com, ying.huang@intel.com, lkp <lkp@intel.com>, "oe-lkp@lists.linux.dev" <oe-lkp@lists.linux.dev>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Hyeonggon Yoo <42.hyeyoo@gmail.com> Subject: [RFC 0/2] An attempt to improve SLUB on NUMA / under memory pressure Date: Mon, 24 Jul 2023 04:09:04 +0900 Message-ID: <20230723190906.4082646-1-42.hyeyoo@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	An attempt to improve SLUB on NUMA / under memory pressure \| expand [RFC,0/2] An attempt to improve SLUB on NUMA / under memory pressure [RFC,1/2] Revert "mm, slub: change percpu partial accounting from objects to pages" [RFC,2/2] mm/slub: prefer NUMA locality over slight memory saving on NUMA machines

Message ID

20230723190906.4082646-1-42.hyeyoo@gmail.com (mailing list archive)

Headers

From: Hyeonggon Yoo <42.hyeyoo@gmail.com>
To: Vlastimil Babka <vbabka@suse.cz>,
	Christoph Lameter <cl@linux.com>,
	Pekka Enberg <penberg@kernel.org>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	David Rientjes <rientjes@google.com>,
	Andrew Morton <akpm@linux-foundation.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>,
	Feng Tang <feng.tang@intel.com>,
	"Sang, Oliver" <oliver.sang@intel.com>,
	Jay Patel <jaypatel@linux.ibm.com>,
	Binder Makin <merimus@google.com>,
	aneesh.kumar@linux.ibm.com,
	tsahu@linux.ibm.com,
	piyushs@linux.ibm.com,
	fengwei.yin@intel.com,
	ying.huang@intel.com,
	lkp <lkp@intel.com>,
	"oe-lkp@lists.linux.dev" <oe-lkp@lists.linux.dev>,
	linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	Hyeonggon Yoo <42.hyeyoo@gmail.com>
Subject: [RFC 0/2] An attempt to improve SLUB on NUMA / under memory pressure
Date: Mon, 24 Jul 2023 04:09:04 +0900
Message-ID: <20230723190906.4082646-1-42.hyeyoo@gmail.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

An attempt to improve SLUB on NUMA / under memory pressure | expand

Message

Hyeonggon Yoo July 23, 2023, 7:09 p.m. UTC

Hello folks,

This series is motivated by kernel test bot report [1] on Jay's patch
that modifies slab order. While the patch was not merged and not in the
final form, I think it was a good lesson that changing slab order has more
impacts on performance than we expected.

While inspecting the report, I found some potential points to improve
SLUB. [2] It's _potential_ because it shows no improvements on hackbench.
but I believe more realistic workloads would benefit from this. Due to
lack of resources and lack of my understanding of *realistic* workloads,
I am asking you to help evaluating this together.

It only consists of two patches. Patch #1 addresses inaccuracy in
SLUB's heuristic, which can negatively affect workloads' performance
when large folios are not available from buddy.

Patch #2 changes SLUB's behavior when there are no slabs available on the
local node's partial slab list, increasing NUMA locality when there are
available memory (without reclamation) on the local node from buddy.

This is early state, but I think it's a good enough to start discussion.
Any feedbacks and ideas are welcome. Thank you in advance!

Hyeonggon

https://lore.kernel.org/linux-mm/202307172140.3b34825a-oliver.sang@intel.com [1]
https://lore.kernel.org/linux-mm/CAB=+i9S6Ykp90+4N1kCE=hiTJTE4wzJDi8k5pBjjO_3sf0aeqg@mail.gmail.com [2]

Hyeonggon Yoo (2):
Revert "mm, slub: change percpu partial accounting from objects to
pages"
mm/slub: prefer NUMA locality over slight memory saving on NUMA
machines

include/linux/slub_def.h | 2 --
mm/slab.h | 6 ++++
mm/slub.c | 76 ++++++++++++++++++++++++++--------------
3 files changed, 55 insertions(+), 29 deletions(-)

Comments

Jay Patel Aug. 10, 2023, 10:55 a.m. UTC | #1

On Mon, 2023-07-24 at 04:09 +0900, Hyeonggon Yoo wrote:
> Hello folks,
> 
> This series is motivated by kernel test bot report [1] on Jay's patch
> that modifies slab order. While the patch was not merged and not in
> the
> final form, I think it was a good lesson that changing slab order has
> more
> impacts on performance than we expected.
> 
> While inspecting the report, I found some potential points to improve
> SLUB. [2] It's _potential_ because it shows no improvements on
> hackbench.
> but I believe more realistic workloads would benefit from this. Due
> to
> lack of resources and lack of my understanding of *realistic*
> workloads,
> I am asking you to help evaluating this together.

Hi Hyeonggon,
I tried hackbench test on Powerpc machine with 16 cpus but
got ~32% of Regression with patch.

Results as 

+-------+----+---------+------------+------------+
|       |    | Normal  | With Patch |            |
+-------+----+---------+------------+------------+
| Amean | 1  | 1.3700  | 2.0353     | ( -32.69%) |
| Amean | 4  | 5.1663  | 7.6563     | (- 32.52%) |
| Amean | 7  | 8.9180  | 13.3353    | ( -33.13%) |
| Amean | 12 | 15.4290 | 23.0757    | ( -33.14%) |
| Amean | 21 | 27.3333 | 40.7823    | ( -32.98%) |
| Amean | 30 | 38.7677 | 58.5300    | ( -33.76%) |
| Amean | 48 | 62.2987 | 92.9850    | ( -33.00%) |
| Amean | 64 | 82.8993 | 123.4717   | ( -32.86%) |
+-------+----+---------+------------+------------+

Thanks
Jay Patel
> 
> It only consists of two patches. Patch #1 addresses inaccuracy in
> SLUB's heuristic, which can negatively affect workloads' performance
> when large folios are not available from buddy.
> 
> Patch #2 changes SLUB's behavior when there are no slabs available on
> the
> local node's partial slab list, increasing NUMA locality when there
> are
> available memory (without reclamation) on the local node from buddy.
> 
> This is early state, but I think it's a good enough to start
> discussion.
> Any feedbacks and ideas are welcome. Thank you in advance!
> 
> Hyeonggon
> 
> https://lore.kernel.org/linux-mm/202307172140.3b34825a-oliver.sang@intel.com
> [1]
> https://lore.kernel.org/linux-mm/CAB=+i9S6Ykp90+4N1kCE=hiTJTE4wzJDi8k5pBjjO_3sf0aeqg@mail.gmail.com
> [2]
> 
> Hyeonggon Yoo (2):
>   Revert "mm, slub: change percpu partial accounting from objects to
>     pages"
>   mm/slub: prefer NUMA locality over slight memory saving on NUMA
>     machines
> 
>  include/linux/slub_def.h |  2 --
>  mm/slab.h                |  6 ++++
>  mm/slub.c                | 76 ++++++++++++++++++++++++++----------
> ----
>  3 files changed, 55 insertions(+), 29 deletions(-)
>

Hyeonggon Yoo Aug. 10, 2023, 6:06 p.m. UTC | #2

On Thu, Aug 10, 2023 at 7:56 PM Jay Patel <jaypatel@linux.ibm.com> wrote:
>
> On Mon, 2023-07-24 at 04:09 +0900, Hyeonggon Yoo wrote:
> > Hello folks,
> >
> > This series is motivated by kernel test bot report [1] on Jay's patch
> > that modifies slab order. While the patch was not merged and not in
> > the
> > final form, I think it was a good lesson that changing slab order has
> > more
> > impacts on performance than we expected.
> >
> > While inspecting the report, I found some potential points to improve
> > SLUB. [2] It's _potential_ because it shows no improvements on
> > hackbench.
> > but I believe more realistic workloads would benefit from this. Due
> > to
> > lack of resources and lack of my understanding of *realistic*
> > workloads,
> > I am asking you to help evaluating this together.
>
> Hi Hyeonggon,
> I tried hackbench test on Powerpc machine with 16 cpus but
> got ~32% of Regression with patch.

Thank you so much for measuring this! That's very helpful.
It's interesting because on an AMD machine with 2 NUMA nodes there was
not much difference.

Does it have more than one socket?

Could you confirm if the offending patch is patch 1 or 2?
If the offending one is patch 2, can you please check how large is L3
cache miss rate
during hackbench?

> Results as
>
> +-------+----+---------+------------+------------+
> |       |    | Normal  | With Patch |            |
> +-------+----+---------+------------+------------+
> | Amean | 1  | 1.3700  | 2.0353     | ( -32.69%) |
> | Amean | 4  | 5.1663  | 7.6563     | (- 32.52%) |
> | Amean | 7  | 8.9180  | 13.3353    | ( -33.13%) |
> | Amean | 12 | 15.4290 | 23.0757    | ( -33.14%) |
> | Amean | 21 | 27.3333 | 40.7823    | ( -32.98%) |
> | Amean | 30 | 38.7677 | 58.5300    | ( -33.76%) |
> | Amean | 48 | 62.2987 | 92.9850    | ( -33.00%) |
> | Amean | 64 | 82.8993 | 123.4717   | ( -32.86%) |
> +-------+----+---------+------------+------------+
>
> Thanks
> Jay Patel
> >
> > It only consists of two patches. Patch #1 addresses inaccuracy in
> > SLUB's heuristic, which can negatively affect workloads' performance
> > when large folios are not available from buddy.
> >
> > Patch #2 changes SLUB's behavior when there are no slabs available on
> > the
> > local node's partial slab list, increasing NUMA locality when there
> > are
> > available memory (without reclamation) on the local node from buddy.
> >
> > This is early state, but I think it's a good enough to start
> > discussion.
> > Any feedbacks and ideas are welcome. Thank you in advance!
> >
> > Hyeonggon
> >
> > https://lore.kernel.org/linux-mm/202307172140.3b34825a-oliver.sang@intel.com
> > [1]
> > https://lore.kernel.org/linux-mm/CAB=+i9S6Ykp90+4N1kCE=hiTJTE4wzJDi8k5pBjjO_3sf0aeqg@mail.gmail.com
> > [2]
> >
> > Hyeonggon Yoo (2):
> >   Revert "mm, slub: change percpu partial accounting from objects to
> >     pages"
> >   mm/slub: prefer NUMA locality over slight memory saving on NUMA
> >     machines
> >
> >  include/linux/slub_def.h |  2 --
> >  mm/slab.h                |  6 ++++
> >  mm/slub.c                | 76 ++++++++++++++++++++++++++----------
> > ----
> >  3 files changed, 55 insertions(+), 29 deletions(-)
> >
>

Jay Patel Aug. 18, 2023, 6:45 a.m. UTC | #3

On Fri, 2023-08-11 at 03:06 +0900, Hyeonggon Yoo wrote:
> On Thu, Aug 10, 2023 at 7:56 PM Jay Patel <jaypatel@linux.ibm.com>
> wrote:
> > On Mon, 2023-07-24 at 04:09 +0900, Hyeonggon Yoo wrote:
> > > Hello folks,
> > > 
> > > This series is motivated by kernel test bot report [1] on Jay's
> > > patch
> > > that modifies slab order. While the patch was not merged and not
> > > in
> > > the
> > > final form, I think it was a good lesson that changing slab order
> > > has
> > > more
> > > impacts on performance than we expected.
> > > 
> > > While inspecting the report, I found some potential points to
> > > improve
> > > SLUB. [2] It's _potential_ because it shows no improvements on
> > > hackbench.
> > > but I believe more realistic workloads would benefit from this.
> > > Due
> > > to
> > > lack of resources and lack of my understanding of *realistic*
> > > workloads,
> > > I am asking you to help evaluating this together.
> > 
> > Hi Hyeonggon,
> > I tried hackbench test on Powerpc machine with 16 cpus but
> > got ~32% of Regression with patch.
> 
> Thank you so much for measuring this! That's very helpful.
> It's interesting because on an AMD machine with 2 NUMA nodes there
> was
> not much difference.
> 
> Does it have more than one socket?

I have tested on single socket system.
> 
> Could you confirm if the offending patch is patch 1 or 2?
> If the offending one is patch 2, can you please check how large is L3
> cache miss rate
> during hackbench?
> 
Below regression is cause by Patch 1 "Revert mm, slub: change percpu
partial accounting from objects to pages"

Thanks 
Jay Patel

> > Results as
> > 
> > +-------+----+---------+------------+------------+
> > >       |    | Normal  | With Patch |            |
> > +-------+----+---------+------------+------------+
> > > Amean | 1  | 1.3700  | 2.0353     | ( -32.69%) |
> > > Amean | 4  | 5.1663  | 7.6563     | (- 32.52%) |
> > > Amean | 7  | 8.9180  | 13.3353    | ( -33.13%) |
> > > Amean | 12 | 15.4290 | 23.0757    | ( -33.14%) |
> > > Amean | 21 | 27.3333 | 40.7823    | ( -32.98%) |
> > > Amean | 30 | 38.7677 | 58.5300    | ( -33.76%) |
> > > Amean | 48 | 62.2987 | 92.9850    | ( -33.00%) |
> > > Amean | 64 | 82.8993 | 123.4717   | ( -32.86%) |
> > +-------+----+---------+------------+------------+
> > 
> > Thanks
> > Jay Patel
> > > It only consists of two patches. Patch #1 addresses inaccuracy in
> > > SLUB's heuristic, which can negatively affect workloads'
> > > performance
> > > when large folios are not available from buddy.
> > > 
> > > Patch #2 changes SLUB's behavior when there are no slabs
> > > available on
> > > the
> > > local node's partial slab list, increasing NUMA locality when
> > > there
> > > are
> > > available memory (without reclamation) on the local node from
> > > buddy.
> > > 
> > > This is early state, but I think it's a good enough to start
> > > discussion.
> > > Any feedbacks and ideas are welcome. Thank you in advance!
> > > 
> > > Hyeonggon
> > > 
> > > https://lore.kernel.org/linux-mm/202307172140.3b34825a-oliver.sang@intel.com
> > > [1]
> > > https://lore.kernel.org/linux-mm/CAB=+i9S6Ykp90+4N1kCE=hiTJTE4wzJDi8k5pBjjO_3sf0aeqg@mail.gmail.com
> > > [2]
> > > 
> > > Hyeonggon Yoo (2):
> > >   Revert "mm, slub: change percpu partial accounting from objects
> > > to
> > >     pages"
> > >   mm/slub: prefer NUMA locality over slight memory saving on NUMA
> > >     machines
> > > 
> > >  include/linux/slub_def.h |  2 --
> > >  mm/slab.h                |  6 ++++
> > >  mm/slub.c                | 76 ++++++++++++++++++++++++++------
> > > ----
> > > ----
> > >  3 files changed, 55 insertions(+), 29 deletions(-)
> > >

Hyeonggon Yoo Aug. 18, 2023, 3:18 p.m. UTC | #4

On Fri, Aug 18, 2023 at 4:11 PM Jay Patel <jaypatel@linux.ibm.com> wrote:
>
> On Fri, 2023-08-11 at 03:06 +0900, Hyeonggon Yoo wrote:
> > On Thu, Aug 10, 2023 at 7:56 PM Jay Patel <jaypatel@linux.ibm.com>
> > wrote:
> > > On Mon, 2023-07-24 at 04:09 +0900, Hyeonggon Yoo wrote:
> > > > Hello folks,
> > > >
> > > > This series is motivated by kernel test bot report [1] on Jay's
> > > > patch
> > > > that modifies slab order. While the patch was not merged and not
> > > > in
> > > > the
> > > > final form, I think it was a good lesson that changing slab order
> > > > has
> > > > more
> > > > impacts on performance than we expected.
> > > >
> > > > While inspecting the report, I found some potential points to
> > > > improve
> > > > SLUB. [2] It's _potential_ because it shows no improvements on
> > > > hackbench.
> > > > but I believe more realistic workloads would benefit from this.
> > > > Due
> > > > to
> > > > lack of resources and lack of my understanding of *realistic*
> > > > workloads,
> > > > I am asking you to help evaluating this together.
> > >
> > > Hi Hyeonggon,
> > > I tried hackbench test on Powerpc machine with 16 cpus but
> > > got ~32% of Regression with patch.
> >
> > Thank you so much for measuring this! That's very helpful.
> > It's interesting because on an AMD machine with 2 NUMA nodes there
> > was
> > not much difference.
> >
> > Does it have more than one socket?
>
> I have tested on single socket system.
> >
> > Could you confirm if the offending patch is patch 1 or 2?
> > If the offending one is patch 2, can you please check how large is L3
> > cache miss rate
> > during hackbench?
> >
> Below regression is cause by Patch 1 "Revert mm, slub: change percpu
> partial accounting from objects to pages"

Fortunately I was able to reproduce the regression (5~10%) on my amd laptop :)
It's interesting and thank you so much for pointing it out!

It only modifies slowpath so the overhead of calculation itself should
be negligible.
And I think it's fair to assume that this is because the freelist is
shortened due to the patch,
because it rounds up the number of slabs:
> nr_slabs = DIV_ROUND_UP(nr_objects * 2, oo_objects(s->oo));

So before the patch more objects were cached than intended.
I'll try to bump up the default value to the point where it does not
use more memory than before.

By the way, what is the optimal default value is very unclear to me.
Obviously 'Good enough value for hackbench' is not a good standard,
because it's quite a synthetic workload.


> Thanks
> Jay Patel
>
> > > Results as
> > >
> > > +-------+----+---------+------------+------------+
> > > >       |    | Normal  | With Patch |            |
> > > +-------+----+---------+------------+------------+
> > > > Amean | 1  | 1.3700  | 2.0353     | ( -32.69%) |
> > > > Amean | 4  | 5.1663  | 7.6563     | (- 32.52%) |
> > > > Amean | 7  | 8.9180  | 13.3353    | ( -33.13%) |
> > > > Amean | 12 | 15.4290 | 23.0757    | ( -33.14%) |
> > > > Amean | 21 | 27.3333 | 40.7823    | ( -32.98%) |
> > > > Amean | 30 | 38.7677 | 58.5300    | ( -33.76%) |
> > > > Amean | 48 | 62.2987 | 92.9850    | ( -33.00%) |
> > > > Amean | 64 | 82.8993 | 123.4717   | ( -32.86%) |
> > > +-------+----+---------+------------+------------+
> > >
> > > Thanks
> > > Jay Patel