diff mbox series

[RFC] mm+net: allow to set kmem_cache create flag for SLAB_NEVER_MERGE

Message ID 167396280045.539803.7540459812377220500.stgit@firesoul (mailing list archive)
State New
Headers show
Series [RFC] mm+net: allow to set kmem_cache create flag for SLAB_NEVER_MERGE | expand

Commit Message

Jesper Dangaard Brouer Jan. 17, 2023, 1:40 p.m. UTC
Allow API users of kmem_cache_create to specify that they don't want
any slab merge or aliasing (with similar sized objects). Use this in
network stack and kfence_test.

The SKB (sk_buff) kmem_cache slab is critical for network performance.
Network stack uses kmem_cache_{alloc,free}_bulk APIs to gain
performance by amortising the alloc/free cost.

For the bulk API to perform efficiently the slub fragmentation need to
be low. Especially for the SLUB allocator, the efficiency of bulk free
API depend on objects belonging to the same slab (page).

When running different network performance microbenchmarks, I started
to notice that performance was reduced (slightly) when machines had
longer uptimes. I believe the cause was 'skbuff_head_cache' got
aliased/merged into the general slub for 256 bytes sized objects (with
my kernel config, without CONFIG_HARDENED_USERCOPY).

For SKB kmem_cache network stack have reasons for not merging, but it
varies depending on kernel config (e.g. CONFIG_HARDENED_USERCOPY).
We want to explicitly set SLAB_NEVER_MERGE for this kmem_cache.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/linux/slab.h    |    2 ++
 mm/kfence/kfence_test.c |    7 +++----
 mm/slab.h               |    5 +++--
 mm/slab_common.c        |    8 ++++----
 net/core/skbuff.c       |   13 ++++++++++++-
 5 files changed, 24 insertions(+), 11 deletions(-)

Comments

Christoph Lameter Jan. 17, 2023, 2:54 p.m. UTC | #1
On Tue, 17 Jan 2023, Jesper Dangaard Brouer wrote:

> When running different network performance microbenchmarks, I started
> to notice that performance was reduced (slightly) when machines had
> longer uptimes. I believe the cause was 'skbuff_head_cache' got
> aliased/merged into the general slub for 256 bytes sized objects (with
> my kernel config, without CONFIG_HARDENED_USERCOPY).

Well that is a common effect that we see in multiple subsystems. This is
due to general memory fragmentation. Depending on the prior load the
performance could actually be better after some runtime if the caches are
populated avoiding the page allocator etc.

The merging could actually be beneficial since there may be more partial
slabs to allocate from and thus avoiding expensive calls to the page
allocator.

I wish we had some effective way of memory defragmentation.
Matthew Wilcox (Oracle) Jan. 18, 2023, 5:17 a.m. UTC | #2
On Tue, Jan 17, 2023 at 03:54:34PM +0100, Christoph Lameter wrote:
> On Tue, 17 Jan 2023, Jesper Dangaard Brouer wrote:
> 
> > When running different network performance microbenchmarks, I started
> > to notice that performance was reduced (slightly) when machines had
> > longer uptimes. I believe the cause was 'skbuff_head_cache' got
> > aliased/merged into the general slub for 256 bytes sized objects (with
> > my kernel config, without CONFIG_HARDENED_USERCOPY).
> 
> Well that is a common effect that we see in multiple subsystems. This is
> due to general memory fragmentation. Depending on the prior load the
> performance could actually be better after some runtime if the caches are
> populated avoiding the page allocator etc.

The page allocator isn't _that_ expensive.  I could see updating several
slabs being more expensive than allocating a new page.

> The merging could actually be beneficial since there may be more partial
> slabs to allocate from and thus avoiding expensive calls to the page
> allocator.

What might be more effective is allocating larger order slabs.  I see
that kmalloc-256 allocates a pair of pages and manages 32 objects within
that pair.  It should perform better in Jesper's scenario if it allocated
4 pages and managed 64 objects per slab.

Simplest way to test that should be booting a kernel with
'slub_min_order=2'.  Does that help matters at all, Jesper?  You could
also try slub_min_order=3.  Going above that starts to get a bit sketchy.
Vlastimil Babka Jan. 18, 2023, 7:36 a.m. UTC | #3
On 1/17/23 14:40, Jesper Dangaard Brouer wrote:
> Allow API users of kmem_cache_create to specify that they don't want
> any slab merge or aliasing (with similar sized objects). Use this in
> network stack and kfence_test.
> 
> The SKB (sk_buff) kmem_cache slab is critical for network performance.
> Network stack uses kmem_cache_{alloc,free}_bulk APIs to gain
> performance by amortising the alloc/free cost.
> 
> For the bulk API to perform efficiently the slub fragmentation need to
> be low. Especially for the SLUB allocator, the efficiency of bulk free
> API depend on objects belonging to the same slab (page).

Incidentally, would you know if anyone still uses SLAB instead of SLUB
because it would perform better for networking? IIRC in the past discussions
networking was one of the reasons for SLAB to stay. We are looking again
into the possibility of removing it, so it would be good to know if there
are benchmarks where SLUB does worse so it can be looked into.

> When running different network performance microbenchmarks, I started
> to notice that performance was reduced (slightly) when machines had
> longer uptimes. I believe the cause was 'skbuff_head_cache' got
> aliased/merged into the general slub for 256 bytes sized objects (with
> my kernel config, without CONFIG_HARDENED_USERCOPY).

So did things improve with SLAB_NEVER_MERGE?

> For SKB kmem_cache network stack have reasons for not merging, but it
> varies depending on kernel config (e.g. CONFIG_HARDENED_USERCOPY).
> We want to explicitly set SLAB_NEVER_MERGE for this kmem_cache.
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  include/linux/slab.h    |    2 ++
>  mm/kfence/kfence_test.c |    7 +++----
>  mm/slab.h               |    5 +++--
>  mm/slab_common.c        |    8 ++++----
>  net/core/skbuff.c       |   13 ++++++++++++-
>  5 files changed, 24 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 45af70315a94..83a89ba7c4be 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -138,6 +138,8 @@
>  #define SLAB_SKIP_KFENCE	0
>  #endif
>  
> +#define SLAB_NEVER_MERGE	((slab_flags_t __force)0x40000000U)

I think there should be an explanation what this does and when to consider
it. We should discourage blind use / cargo cult / copy paste from elsewhere
resulting in excessive proliferation of the flag.

- very specialized internal things like kfence? ok
- prevent a bad user of another cache corrupt my cache due to merging? no,
use slub_debug to find and fix the root cause
- performance concerns? only after proper evaluation, not prematurely

> +
>  /* The following flags affect the page allocator grouping pages by mobility */
>  /* Objects are reclaimable */
>  #ifndef CONFIG_SLUB_TINY
> diff --git a/mm/kfence/kfence_test.c b/mm/kfence/kfence_test.c
> index b5d66a69200d..9e83e344ee3c 100644
> --- a/mm/kfence/kfence_test.c
> +++ b/mm/kfence/kfence_test.c
> @@ -191,11 +191,10 @@ static size_t setup_test_cache(struct kunit *test, size_t size, slab_flags_t fla
>  	kunit_info(test, "%s: size=%zu, ctor=%ps\n", __func__, size, ctor);
>  
>  	/*
> -	 * Use SLAB_NOLEAKTRACE to prevent merging with existing caches. Any
> -	 * other flag in SLAB_NEVER_MERGE also works. Use SLAB_ACCOUNT to
> -	 * allocate via memcg, if enabled.
> +	 * Use SLAB_NEVER_MERGE to prevent merging with existing caches.
> +	 * Use SLAB_ACCOUNT to allocate via memcg, if enabled.
>  	 */
> -	flags |= SLAB_NOLEAKTRACE | SLAB_ACCOUNT;
> +	flags |= SLAB_NEVER_MERGE | SLAB_ACCOUNT;
>  	test_cache = kmem_cache_create("test", size, 1, flags, ctor);
>  	KUNIT_ASSERT_TRUE_MSG(test, test_cache, "could not create cache");
>  
> diff --git a/mm/slab.h b/mm/slab.h
> index 7cc432969945..be1383176d3e 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -341,11 +341,11 @@ static inline slab_flags_t kmem_cache_flags(unsigned int object_size,
>  #if defined(CONFIG_SLAB)
>  #define SLAB_CACHE_FLAGS (SLAB_MEM_SPREAD | SLAB_NOLEAKTRACE | \
>  			  SLAB_RECLAIM_ACCOUNT | SLAB_TEMPORARY | \
> -			  SLAB_ACCOUNT)
> +			  SLAB_ACCOUNT | SLAB_NEVER_MERGE)
>  #elif defined(CONFIG_SLUB)
>  #define SLAB_CACHE_FLAGS (SLAB_NOLEAKTRACE | SLAB_RECLAIM_ACCOUNT | \
>  			  SLAB_TEMPORARY | SLAB_ACCOUNT | \
> -			  SLAB_NO_USER_FLAGS | SLAB_KMALLOC)
> +			  SLAB_NO_USER_FLAGS | SLAB_KMALLOC | SLAB_NEVER_MERGE)
>  #else
>  #define SLAB_CACHE_FLAGS (SLAB_NOLEAKTRACE)
>  #endif
> @@ -366,6 +366,7 @@ static inline slab_flags_t kmem_cache_flags(unsigned int object_size,
>  			      SLAB_TEMPORARY | \
>  			      SLAB_ACCOUNT | \
>  			      SLAB_KMALLOC | \
> +			      SLAB_NEVER_MERGE | \
>  			      SLAB_NO_USER_FLAGS)
>  
>  bool __kmem_cache_empty(struct kmem_cache *);
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 1cba98acc486..269f67c5fee6 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -45,9 +45,9 @@ static DECLARE_WORK(slab_caches_to_rcu_destroy_work,
>  /*
>   * Set of flags that will prevent slab merging
>   */
> -#define SLAB_NEVER_MERGE (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER | \
> +#define SLAB_NEVER_MERGE_FLAGS (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER |\
>  		SLAB_TRACE | SLAB_TYPESAFE_BY_RCU | SLAB_NOLEAKTRACE | \
> -		SLAB_FAILSLAB | kasan_never_merge())
> +		SLAB_FAILSLAB | SLAB_NEVER_MERGE | kasan_never_merge())
>  
>  #define SLAB_MERGE_SAME (SLAB_RECLAIM_ACCOUNT | SLAB_CACHE_DMA | \
>  			 SLAB_CACHE_DMA32 | SLAB_ACCOUNT)
> @@ -137,7 +137,7 @@ static unsigned int calculate_alignment(slab_flags_t flags,
>   */
>  int slab_unmergeable(struct kmem_cache *s)
>  {
> -	if (slab_nomerge || (s->flags & SLAB_NEVER_MERGE))
> +	if (slab_nomerge || (s->flags & SLAB_NEVER_MERGE_FLAGS))
>  		return 1;
>  
>  	if (s->ctor)
> @@ -173,7 +173,7 @@ struct kmem_cache *find_mergeable(unsigned int size, unsigned int align,
>  	size = ALIGN(size, align);
>  	flags = kmem_cache_flags(size, flags, name);
>  
> -	if (flags & SLAB_NEVER_MERGE)
> +	if (flags & SLAB_NEVER_MERGE_FLAGS)
>  		return NULL;
>  
>  	list_for_each_entry_reverse(s, &slab_caches, list) {
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 79c9e795a964..799b9914457b 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -4629,12 +4629,23 @@ static void skb_extensions_init(void)
>  static void skb_extensions_init(void) {}
>  #endif
>  
> +/* The SKB kmem_cache slab is critical for network performance.  Never
> + * merge/alias the slab with similar sized objects.  This avoids fragmentation
> + * that hurts performance of kmem_cache_{alloc,free}_bulk APIs.
> + */
> +#ifndef CONFIG_SLUB_TINY
> +#define FLAG_SKB_NEVER_MERGE	SLAB_NEVER_MERGE
> +#else /* CONFIG_SLUB_TINY - simple loop in kmem_cache_alloc_bulk */
> +#define FLAG_SKB_NEVER_MERGE	0
> +#endif
> +
>  void __init skb_init(void)
>  {
>  	skbuff_head_cache = kmem_cache_create_usercopy("skbuff_head_cache",
>  					      sizeof(struct sk_buff),
>  					      0,
> -					      SLAB_HWCACHE_ALIGN|SLAB_PANIC,
> +					      SLAB_HWCACHE_ALIGN|SLAB_PANIC|
> +						FLAG_SKB_NEVER_MERGE,
>  					      offsetof(struct sk_buff, cb),
>  					      sizeof_field(struct sk_buff, cb),
>  					      NULL);
> 
>
Jesper Dangaard Brouer Jan. 19, 2023, 6:08 p.m. UTC | #4
On 18/01/2023 06.17, Matthew Wilcox wrote:
> On Tue, Jan 17, 2023 at 03:54:34PM +0100, Christoph Lameter wrote:
>> On Tue, 17 Jan 2023, Jesper Dangaard Brouer wrote:
>>
>>> When running different network performance microbenchmarks, I started
>>> to notice that performance was reduced (slightly) when machines had
>>> longer uptimes. I believe the cause was 'skbuff_head_cache' got
>>> aliased/merged into the general slub for 256 bytes sized objects (with
>>> my kernel config, without CONFIG_HARDENED_USERCOPY).
>>
>> Well that is a common effect that we see in multiple subsystems. This is
>> due to general memory fragmentation. Depending on the prior load the
>> performance could actually be better after some runtime if the caches are
>> populated avoiding the page allocator etc.
> 
> The page allocator isn't _that_ expensive.  I could see updating several
> slabs being more expensive than allocating a new page.
> 

For 10Gbit/s wirespeed small frames I have 201 cycles as budget.

I prefer to measure things, so lets see what page alloc cost, but also
relate this to how much this is per 4096 bytes.

  alloc_pages order:0(4096B/x1)    246 cycles per-4096B 246 cycles
  alloc_pages order:1(8192B/x2)    300 cycles per-4096B 150 cycles
  alloc_pages order:2(16384B/x4)   328 cycles per-4096B 82 cycles
  alloc_pages order:3(32768B/x8)   357 cycles per-4096B 44 cycles
  alloc_pages order:4(65536B/x16)  516 cycles per-4096B 32 cycles
  alloc_pages order:5(131072B/x32) 801 cycles per-4096B 25 cycles

I looked back at my MM-presentation[2016][2017], and notice that in
[2017] I reported that Mel have improved order-0 page cost to 143 cycles
in kernel 4.11-rc1.  According to above measurements kernel have
regressed in performance.


[2016] 
https://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf
[2017] 
https://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit2017-JesperBrouer.pdf


>> The merging could actually be beneficial since there may be more partial
>> slabs to allocate from and thus avoiding expensive calls to the page
>> allocator.
> 
> What might be more effective is allocating larger order slabs.  I see
> that kmalloc-256 allocates a pair of pages and manages 32 objects within
> that pair.  It should perform better in Jesper's scenario if it allocated
> 4 pages and managed 64 objects per slab.
> 
> Simplest way to test that should be booting a kernel with
> 'slub_min_order=2'.  Does that help matters at all, Jesper?  You could
> also try slub_min_order=3.  Going above that starts to get a bit sketchy.
> 

I have tried this slub_min_order trick before, and it did help.  I've
not tested it is recently.

--Jesper
Jesper Dangaard Brouer Jan. 23, 2023, 4:14 p.m. UTC | #5
On 18/01/2023 08.36, Vlastimil Babka wrote:
> On 1/17/23 14:40, Jesper Dangaard Brouer wrote:
>> Allow API users of kmem_cache_create to specify that they don't want
>> any slab merge or aliasing (with similar sized objects). Use this in
>> network stack and kfence_test.
>>
>> The SKB (sk_buff) kmem_cache slab is critical for network performance.
>> Network stack uses kmem_cache_{alloc,free}_bulk APIs to gain
>> performance by amortising the alloc/free cost.
>>
>> For the bulk API to perform efficiently the slub fragmentation need to
>> be low. Especially for the SLUB allocator, the efficiency of bulk free
>> API depend on objects belonging to the same slab (page).
> 
> Incidentally, would you know if anyone still uses SLAB instead of SLUB
> because it would perform better for networking? IIRC in the past discussions
> networking was one of the reasons for SLAB to stay. We are looking again
> into the possibility of removing it, so it would be good to know if there
> are benchmarks where SLUB does worse so it can be looked into.
> 

I don't know of any users using SLAB for network performance reasons.
I've only been benchmarking with SLUB for a long time.
Anyone else on netdev?

Both SLUB and SLAB got the kmem_cache bulk API implemented.  This is
used today in network stack to squeeze extra performance for networking
for our SKB (sk_buff) metadata structure (that point to packet data).
Details: Networking cache upto 64 of these SKBs for RX-path NAPI-softirq
processing per CPU, which is repopulated with kmem_cache bulking API
(bulk alloc 16 and bulk free 32).

>> When running different network performance microbenchmarks, I started
>> to notice that performance was reduced (slightly) when machines had
>> longer uptimes. I believe the cause was 'skbuff_head_cache' got
>> aliased/merged into the general slub for 256 bytes sized objects (with
>> my kernel config, without CONFIG_HARDENED_USERCOPY).
> 
> So did things improve with SLAB_NEVER_MERGE?

Yes, but only the stability of the results.

The performance tests were microbenchmarks and as Christoph points out
there might be gains from more partial slabs when there are more
fragmentation.  The "overload" microbench will always do maximum
bulking, while more real workloads might be satisfied from the partial
slabs.  I would need to do a broader range of benchmarks before I can
conclude if this is always a win.

>> For SKB kmem_cache network stack have reasons for not merging, but it
>> varies depending on kernel config (e.g. CONFIG_HARDENED_USERCOPY).
>> We want to explicitly set SLAB_NEVER_MERGE for this kmem_cache.
>>

In most distro kernels configs SKB kmem_cache will already not get
merged / aliased.  I was just trying to make this consistent.

>> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
>> ---
>>   include/linux/slab.h    |    2 ++
>>   mm/kfence/kfence_test.c |    7 +++----
>>   mm/slab.h               |    5 +++--
>>   mm/slab_common.c        |    8 ++++----
>>   net/core/skbuff.c       |   13 ++++++++++++-
>>   5 files changed, 24 insertions(+), 11 deletions(-)
>>
>> diff --git a/include/linux/slab.h b/include/linux/slab.h
>> index 45af70315a94..83a89ba7c4be 100644
>> --- a/include/linux/slab.h
>> +++ b/include/linux/slab.h
>> @@ -138,6 +138,8 @@
>>   #define SLAB_SKIP_KFENCE	0
>>   #endif
>>   
>> +#define SLAB_NEVER_MERGE	((slab_flags_t __force)0x40000000U)
> 
> I think there should be an explanation what this does and when to consider
> it. We should discourage blind use / cargo cult / copy paste from elsewhere
> resulting in excessive proliferation of the flag.

I agree.

> - very specialized internal things like kfence? ok
> - prevent a bad user of another cache corrupt my cache due to merging? no,
> use slub_debug to find and fix the root cause

Agree, and the comment could point to the slub_debug trick.

> - performance concerns? only after proper evaluation, not prematurely
>

Yes, and I would need to do more perf eval myself ;-)
I don't have time atm, thus I'll not pursue this RFC patch anytime soon.

--Jesper
Hyeonggon Yoo Jan. 24, 2023, 4:06 p.m. UTC | #6
On Tue, Jan 17, 2023 at 03:54:34PM +0100, Christoph Lameter wrote:
> On Tue, 17 Jan 2023, Jesper Dangaard Brouer wrote:
> 
> > When running different network performance microbenchmarks, I started
> > to notice that performance was reduced (slightly) when machines had
> > longer uptimes. I believe the cause was 'skbuff_head_cache' got
> > aliased/merged into the general slub for 256 bytes sized objects (with
> > my kernel config, without CONFIG_HARDENED_USERCOPY).
> 
> Well that is a common effect that we see in multiple subsystems. This is
> due to general memory fragmentation. Depending on the prior load the
> performance could actually be better after some runtime if the caches are
> populated avoiding the page allocator etc.
> 
> The merging could actually be beneficial since there may be more partial
> slabs to allocate from and thus avoiding expensive calls to the page
> allocator.
> 
> I wish we had some effective way of memory defragmentation.

If general memory fragmentation is actual cause of this problem, 
it may be worsening by [1] due to assumption that all allocations
are done in the same order as s->oo, when accounting and limiting the number
of percpu slabs.

[1] https://lore.kernel.org/linux-mm/76c63237-c489-b942-bdd9-5720042f52a9@suse.cz
Vlastimil Babka May 31, 2023, 12:03 p.m. UTC | #7
On 1/17/23 14:40, Jesper Dangaard Brouer wrote:
> Allow API users of kmem_cache_create to specify that they don't want
> any slab merge or aliasing (with similar sized objects). Use this in
> network stack and kfence_test.
> 
> The SKB (sk_buff) kmem_cache slab is critical for network performance.
> Network stack uses kmem_cache_{alloc,free}_bulk APIs to gain
> performance by amortising the alloc/free cost.
> 
> For the bulk API to perform efficiently the slub fragmentation need to
> be low. Especially for the SLUB allocator, the efficiency of bulk free
> API depend on objects belonging to the same slab (page).
> 
> When running different network performance microbenchmarks, I started
> to notice that performance was reduced (slightly) when machines had
> longer uptimes. I believe the cause was 'skbuff_head_cache' got
> aliased/merged into the general slub for 256 bytes sized objects (with
> my kernel config, without CONFIG_HARDENED_USERCOPY).
> 
> For SKB kmem_cache network stack have reasons for not merging, but it
> varies depending on kernel config (e.g. CONFIG_HARDENED_USERCOPY).
> We want to explicitly set SLAB_NEVER_MERGE for this kmem_cache.
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

Since this idea was revived by David [1], and neither patch worked as is,
but yours was more complete and first, I have fixed it up as below. The
skbuff part itself will be best submitted separately afterwards so we don't
get conflicts between trees etc. Comments?

----8<----
From 485d3f58f3e797306b803102573e7f1367af2ad2 Mon Sep 17 00:00:00 2001
From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Tue, 17 Jan 2023 14:40:00 +0100
Subject: [PATCH] mm/slab: introduce kmem_cache flag SLAB_NO_MERGE

Allow API users of kmem_cache_create to specify that they don't want
any slab merge or aliasing (with similar sized objects). Use this in
kfence_test.

The SKB (sk_buff) kmem_cache slab is critical for network performance.
Network stack uses kmem_cache_{alloc,free}_bulk APIs to gain
performance by amortising the alloc/free cost.

For the bulk API to perform efficiently the slub fragmentation need to
be low. Especially for the SLUB allocator, the efficiency of bulk free
API depend on objects belonging to the same slab (page).

When running different network performance microbenchmarks, I started
to notice that performance was reduced (slightly) when machines had
longer uptimes. I believe the cause was 'skbuff_head_cache' got
aliased/merged into the general slub for 256 bytes sized objects (with
my kernel config, without CONFIG_HARDENED_USERCOPY).

For SKB kmem_cache network stack have reasons for not merging, but it
varies depending on kernel config (e.g. CONFIG_HARDENED_USERCOPY).
We want to explicitly set SLAB_NO_MERGE for this kmem_cache.

Another use case for the flag has been described by David Sterba [1]:

> This can be used for more fine grained control over the caches or for
> debugging builds where separate slabs can verify that no objects leak.

> The slab_nomerge boot option is too coarse and would need to be
> enabled on all testing hosts. There are some other ways how to disable
> merging, e.g. a slab constructor but this disables poisoning besides
> that it adds additional overhead. Other flags are internal and may
> have other semantics.

> A concrete example what motivates the flag. During 'btrfs balance'
> slab top reported huge increase in caches like

>  1330095 1330095 100%    0.10K  34105       39    136420K Acpi-ParseExt
>  1734684 1734684 100%    0.14K  61953       28    247812K pid_namespace
>  8244036 6873075  83%    0.11K 229001       36    916004K khugepaged_mm_slot

> which was confusing and that it's because of slab merging was not the
> first idea.  After rebooting with slab_nomerge all the caches were
> from btrfs_ namespace as expected.

[1] https://lore.kernel.org/all/20230524101748.30714-1-dsterba@suse.com/

[ vbabka@suse.cz: rename to SLAB_NO_MERGE, change the flag value to the
  one proposed by David so it does not collide with internal SLAB/SLUB
  flags, write a comment for the flag, expand changelog, drop the skbuff
  part to be handled spearately ]

Reported-by: David Sterba <dsterba@suse.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/slab.h    | 12 ++++++++++++
 mm/kfence/kfence_test.c |  7 +++----
 mm/slab.h               |  5 +++--
 mm/slab_common.c        |  2 +-
 4 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 6b3e155b70bf..72bc906d8bc7 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -106,6 +106,18 @@
 /* Avoid kmemleak tracing */
 #define SLAB_NOLEAKTRACE	((slab_flags_t __force)0x00800000U)
 
+/*
+ * Prevent merging with compatible kmem caches. This flag should be used
+ * cautiously. Valid use cases:
+ *
+ * - caches created for self-tests (e.g. kunit)
+ * - general caches created and used by a subsystem, only when a
+ *   (subsystem-specific) debug option is enabled
+ * - performance critical caches, should be very rare and consulted with slab
+ *   maintainers, and not used together with CONFIG_SLUB_TINY
+ */
+#define SLAB_NO_MERGE		((slab_flags_t __force)0x01000000U)
+
 /* Fault injection mark */
 #ifdef CONFIG_FAILSLAB
 # define SLAB_FAILSLAB		((slab_flags_t __force)0x02000000U)
diff --git a/mm/kfence/kfence_test.c b/mm/kfence/kfence_test.c
index 6aee19a79236..9e008a336d9f 100644
--- a/mm/kfence/kfence_test.c
+++ b/mm/kfence/kfence_test.c
@@ -191,11 +191,10 @@ static size_t setup_test_cache(struct kunit *test, size_t size, slab_flags_t fla
 	kunit_info(test, "%s: size=%zu, ctor=%ps\n", __func__, size, ctor);
 
 	/*
-	 * Use SLAB_NOLEAKTRACE to prevent merging with existing caches. Any
-	 * other flag in SLAB_NEVER_MERGE also works. Use SLAB_ACCOUNT to
-	 * allocate via memcg, if enabled.
+	 * Use SLAB_NO_MERGE to prevent merging with existing caches.
+	 * Use SLAB_ACCOUNT to allocate via memcg, if enabled.
 	 */
-	flags |= SLAB_NOLEAKTRACE | SLAB_ACCOUNT;
+	flags |= SLAB_NO_MERGE | SLAB_ACCOUNT;
 	test_cache = kmem_cache_create("test", size, 1, flags, ctor);
 	KUNIT_ASSERT_TRUE_MSG(test, test_cache, "could not create cache");
 
diff --git a/mm/slab.h b/mm/slab.h
index f01ac256a8f5..9005ddc51cf8 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -294,11 +294,11 @@ static inline bool is_kmalloc_cache(struct kmem_cache *s)
 #if defined(CONFIG_SLAB)
 #define SLAB_CACHE_FLAGS (SLAB_MEM_SPREAD | SLAB_NOLEAKTRACE | \
 			  SLAB_RECLAIM_ACCOUNT | SLAB_TEMPORARY | \
-			  SLAB_ACCOUNT)
+			  SLAB_ACCOUNT | SLAB_NO_MERGE)
 #elif defined(CONFIG_SLUB)
 #define SLAB_CACHE_FLAGS (SLAB_NOLEAKTRACE | SLAB_RECLAIM_ACCOUNT | \
 			  SLAB_TEMPORARY | SLAB_ACCOUNT | \
-			  SLAB_NO_USER_FLAGS | SLAB_KMALLOC)
+			  SLAB_NO_USER_FLAGS | SLAB_KMALLOC | SLAB_NO_MERGE)
 #else
 #define SLAB_CACHE_FLAGS (SLAB_NOLEAKTRACE)
 #endif
@@ -319,6 +319,7 @@ static inline bool is_kmalloc_cache(struct kmem_cache *s)
 			      SLAB_TEMPORARY | \
 			      SLAB_ACCOUNT | \
 			      SLAB_KMALLOC | \
+			      SLAB_NO_MERGE | \
 			      SLAB_NO_USER_FLAGS)
 
 bool __kmem_cache_empty(struct kmem_cache *);
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 607249785c07..0e0a617eae7d 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -47,7 +47,7 @@ static DECLARE_WORK(slab_caches_to_rcu_destroy_work,
  */
 #define SLAB_NEVER_MERGE (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER | \
 		SLAB_TRACE | SLAB_TYPESAFE_BY_RCU | SLAB_NOLEAKTRACE | \
-		SLAB_FAILSLAB | kasan_never_merge())
+		SLAB_FAILSLAB | SLAB_NO_MERGE | kasan_never_merge())
 
 #define SLAB_MERGE_SAME (SLAB_RECLAIM_ACCOUNT | SLAB_CACHE_DMA | \
 			 SLAB_CACHE_DMA32 | SLAB_ACCOUNT)
Jesper Dangaard Brouer May 31, 2023, 1:59 p.m. UTC | #8
On 31/05/2023 14.03, Vlastimil Babka wrote:
> On 1/17/23 14:40, Jesper Dangaard Brouer wrote:
>> Allow API users of kmem_cache_create to specify that they don't want
>> any slab merge or aliasing (with similar sized objects). Use this in
>> network stack and kfence_test.
>>
>> The SKB (sk_buff) kmem_cache slab is critical for network performance.
>> Network stack uses kmem_cache_{alloc,free}_bulk APIs to gain
>> performance by amortising the alloc/free cost.
>>
>> For the bulk API to perform efficiently the slub fragmentation need to
>> be low. Especially for the SLUB allocator, the efficiency of bulk free
>> API depend on objects belonging to the same slab (page).
>>
>> When running different network performance microbenchmarks, I started
>> to notice that performance was reduced (slightly) when machines had
>> longer uptimes. I believe the cause was 'skbuff_head_cache' got
>> aliased/merged into the general slub for 256 bytes sized objects (with
>> my kernel config, without CONFIG_HARDENED_USERCOPY).
>>
>> For SKB kmem_cache network stack have reasons for not merging, but it
>> varies depending on kernel config (e.g. CONFIG_HARDENED_USERCOPY).
>> We want to explicitly set SLAB_NEVER_MERGE for this kmem_cache.
>>
>> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> 
> Since this idea was revived by David [1], and neither patch worked as is,
> but yours was more complete and first, I have fixed it up as below. The
> skbuff part itself will be best submitted separately afterwards so we don't
> get conflicts between trees etc. Comments?
> 

Thanks for following up on this! :-)
I like the adjustments, ACKed below.

I'm okay with submitting changes to net/core/skbuff.c separately.

> ----8<----
>  From 485d3f58f3e797306b803102573e7f1367af2ad2 Mon Sep 17 00:00:00 2001
> From: Jesper Dangaard Brouer <brouer@redhat.com>
> Date: Tue, 17 Jan 2023 14:40:00 +0100
> Subject: [PATCH] mm/slab: introduce kmem_cache flag SLAB_NO_MERGE
> 
> Allow API users of kmem_cache_create to specify that they don't want
> any slab merge or aliasing (with similar sized objects). Use this in
> kfence_test.
> 
> The SKB (sk_buff) kmem_cache slab is critical for network performance.
> Network stack uses kmem_cache_{alloc,free}_bulk APIs to gain
> performance by amortising the alloc/free cost.
> 
> For the bulk API to perform efficiently the slub fragmentation need to
> be low. Especially for the SLUB allocator, the efficiency of bulk free
> API depend on objects belonging to the same slab (page).
> 
> When running different network performance microbenchmarks, I started
> to notice that performance was reduced (slightly) when machines had
> longer uptimes. I believe the cause was 'skbuff_head_cache' got
> aliased/merged into the general slub for 256 bytes sized objects (with
> my kernel config, without CONFIG_HARDENED_USERCOPY).
> 
> For SKB kmem_cache network stack have reasons for not merging, but it
> varies depending on kernel config (e.g. CONFIG_HARDENED_USERCOPY).
> We want to explicitly set SLAB_NO_MERGE for this kmem_cache.
> 
> Another use case for the flag has been described by David Sterba [1]:
> 
>> This can be used for more fine grained control over the caches or for
>> debugging builds where separate slabs can verify that no objects leak.
> 
>> The slab_nomerge boot option is too coarse and would need to be
>> enabled on all testing hosts. There are some other ways how to disable
>> merging, e.g. a slab constructor but this disables poisoning besides
>> that it adds additional overhead. Other flags are internal and may
>> have other semantics.
> 
>> A concrete example what motivates the flag. During 'btrfs balance'
>> slab top reported huge increase in caches like
> 
>>   1330095 1330095 100%    0.10K  34105       39    136420K Acpi-ParseExt
>>   1734684 1734684 100%    0.14K  61953       28    247812K pid_namespace
>>   8244036 6873075  83%    0.11K 229001       36    916004K khugepaged_mm_slot
> 
>> which was confusing and that it's because of slab merging was not the
>> first idea.  After rebooting with slab_nomerge all the caches were
>> from btrfs_ namespace as expected.
> 
> [1] https://lore.kernel.org/all/20230524101748.30714-1-dsterba@suse.com/
> 
> [ vbabka@suse.cz: rename to SLAB_NO_MERGE, change the flag value to the
>    one proposed by David so it does not collide with internal SLAB/SLUB
>    flags, write a comment for the flag, expand changelog, drop the skbuff
>    part to be handled spearately ]
> 
> Reported-by: David Sterba <dsterba@suse.com>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Jesper Dangaard Brouer <brouer@redhat.com

> ---
>   include/linux/slab.h    | 12 ++++++++++++
>   mm/kfence/kfence_test.c |  7 +++----
>   mm/slab.h               |  5 +++--
>   mm/slab_common.c        |  2 +-
>   4 files changed, 19 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 6b3e155b70bf..72bc906d8bc7 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -106,6 +106,18 @@
>   /* Avoid kmemleak tracing */
>   #define SLAB_NOLEAKTRACE	((slab_flags_t __force)0x00800000U)
>   
> +/*
> + * Prevent merging with compatible kmem caches. This flag should be used
> + * cautiously. Valid use cases:
> + *
> + * - caches created for self-tests (e.g. kunit)
> + * - general caches created and used by a subsystem, only when a
> + *   (subsystem-specific) debug option is enabled
> + * - performance critical caches, should be very rare and consulted with slab
> + *   maintainers, and not used together with CONFIG_SLUB_TINY
> + */
> +#define SLAB_NO_MERGE		((slab_flags_t __force)0x01000000U)
> +
>   /* Fault injection mark */
>   #ifdef CONFIG_FAILSLAB
>   # define SLAB_FAILSLAB		((slab_flags_t __force)0x02000000U)
> diff --git a/mm/kfence/kfence_test.c b/mm/kfence/kfence_test.c
> index 6aee19a79236..9e008a336d9f 100644
> --- a/mm/kfence/kfence_test.c
> +++ b/mm/kfence/kfence_test.c
> @@ -191,11 +191,10 @@ static size_t setup_test_cache(struct kunit *test, size_t size, slab_flags_t fla
>   	kunit_info(test, "%s: size=%zu, ctor=%ps\n", __func__, size, ctor);
>   
>   	/*
> -	 * Use SLAB_NOLEAKTRACE to prevent merging with existing caches. Any
> -	 * other flag in SLAB_NEVER_MERGE also works. Use SLAB_ACCOUNT to
> -	 * allocate via memcg, if enabled.
> +	 * Use SLAB_NO_MERGE to prevent merging with existing caches.
> +	 * Use SLAB_ACCOUNT to allocate via memcg, if enabled.
>   	 */
> -	flags |= SLAB_NOLEAKTRACE | SLAB_ACCOUNT;
> +	flags |= SLAB_NO_MERGE | SLAB_ACCOUNT;
>   	test_cache = kmem_cache_create("test", size, 1, flags, ctor);
>   	KUNIT_ASSERT_TRUE_MSG(test, test_cache, "could not create cache");
>   
> diff --git a/mm/slab.h b/mm/slab.h
> index f01ac256a8f5..9005ddc51cf8 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -294,11 +294,11 @@ static inline bool is_kmalloc_cache(struct kmem_cache *s)
>   #if defined(CONFIG_SLAB)
>   #define SLAB_CACHE_FLAGS (SLAB_MEM_SPREAD | SLAB_NOLEAKTRACE | \
>   			  SLAB_RECLAIM_ACCOUNT | SLAB_TEMPORARY | \
> -			  SLAB_ACCOUNT)
> +			  SLAB_ACCOUNT | SLAB_NO_MERGE)
>   #elif defined(CONFIG_SLUB)
>   #define SLAB_CACHE_FLAGS (SLAB_NOLEAKTRACE | SLAB_RECLAIM_ACCOUNT | \
>   			  SLAB_TEMPORARY | SLAB_ACCOUNT | \
> -			  SLAB_NO_USER_FLAGS | SLAB_KMALLOC)
> +			  SLAB_NO_USER_FLAGS | SLAB_KMALLOC | SLAB_NO_MERGE)
>   #else
>   #define SLAB_CACHE_FLAGS (SLAB_NOLEAKTRACE)
>   #endif
> @@ -319,6 +319,7 @@ static inline bool is_kmalloc_cache(struct kmem_cache *s)
>   			      SLAB_TEMPORARY | \
>   			      SLAB_ACCOUNT | \
>   			      SLAB_KMALLOC | \
> +			      SLAB_NO_MERGE | \
>   			      SLAB_NO_USER_FLAGS)
>   
>   bool __kmem_cache_empty(struct kmem_cache *);
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 607249785c07..0e0a617eae7d 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -47,7 +47,7 @@ static DECLARE_WORK(slab_caches_to_rcu_destroy_work,
>    */
>   #define SLAB_NEVER_MERGE (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER | \
>   		SLAB_TRACE | SLAB_TYPESAFE_BY_RCU | SLAB_NOLEAKTRACE | \
> -		SLAB_FAILSLAB | kasan_never_merge())
> +		SLAB_FAILSLAB | SLAB_NO_MERGE | kasan_never_merge())
>   
>   #define SLAB_MERGE_SAME (SLAB_RECLAIM_ACCOUNT | SLAB_CACHE_DMA | \
>   			 SLAB_CACHE_DMA32 | SLAB_ACCOUNT)
diff mbox series

Patch

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 45af70315a94..83a89ba7c4be 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -138,6 +138,8 @@ 
 #define SLAB_SKIP_KFENCE	0
 #endif
 
+#define SLAB_NEVER_MERGE	((slab_flags_t __force)0x40000000U)
+
 /* The following flags affect the page allocator grouping pages by mobility */
 /* Objects are reclaimable */
 #ifndef CONFIG_SLUB_TINY
diff --git a/mm/kfence/kfence_test.c b/mm/kfence/kfence_test.c
index b5d66a69200d..9e83e344ee3c 100644
--- a/mm/kfence/kfence_test.c
+++ b/mm/kfence/kfence_test.c
@@ -191,11 +191,10 @@  static size_t setup_test_cache(struct kunit *test, size_t size, slab_flags_t fla
 	kunit_info(test, "%s: size=%zu, ctor=%ps\n", __func__, size, ctor);
 
 	/*
-	 * Use SLAB_NOLEAKTRACE to prevent merging with existing caches. Any
-	 * other flag in SLAB_NEVER_MERGE also works. Use SLAB_ACCOUNT to
-	 * allocate via memcg, if enabled.
+	 * Use SLAB_NEVER_MERGE to prevent merging with existing caches.
+	 * Use SLAB_ACCOUNT to allocate via memcg, if enabled.
 	 */
-	flags |= SLAB_NOLEAKTRACE | SLAB_ACCOUNT;
+	flags |= SLAB_NEVER_MERGE | SLAB_ACCOUNT;
 	test_cache = kmem_cache_create("test", size, 1, flags, ctor);
 	KUNIT_ASSERT_TRUE_MSG(test, test_cache, "could not create cache");
 
diff --git a/mm/slab.h b/mm/slab.h
index 7cc432969945..be1383176d3e 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -341,11 +341,11 @@  static inline slab_flags_t kmem_cache_flags(unsigned int object_size,
 #if defined(CONFIG_SLAB)
 #define SLAB_CACHE_FLAGS (SLAB_MEM_SPREAD | SLAB_NOLEAKTRACE | \
 			  SLAB_RECLAIM_ACCOUNT | SLAB_TEMPORARY | \
-			  SLAB_ACCOUNT)
+			  SLAB_ACCOUNT | SLAB_NEVER_MERGE)
 #elif defined(CONFIG_SLUB)
 #define SLAB_CACHE_FLAGS (SLAB_NOLEAKTRACE | SLAB_RECLAIM_ACCOUNT | \
 			  SLAB_TEMPORARY | SLAB_ACCOUNT | \
-			  SLAB_NO_USER_FLAGS | SLAB_KMALLOC)
+			  SLAB_NO_USER_FLAGS | SLAB_KMALLOC | SLAB_NEVER_MERGE)
 #else
 #define SLAB_CACHE_FLAGS (SLAB_NOLEAKTRACE)
 #endif
@@ -366,6 +366,7 @@  static inline slab_flags_t kmem_cache_flags(unsigned int object_size,
 			      SLAB_TEMPORARY | \
 			      SLAB_ACCOUNT | \
 			      SLAB_KMALLOC | \
+			      SLAB_NEVER_MERGE | \
 			      SLAB_NO_USER_FLAGS)
 
 bool __kmem_cache_empty(struct kmem_cache *);
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 1cba98acc486..269f67c5fee6 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -45,9 +45,9 @@  static DECLARE_WORK(slab_caches_to_rcu_destroy_work,
 /*
  * Set of flags that will prevent slab merging
  */
-#define SLAB_NEVER_MERGE (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER | \
+#define SLAB_NEVER_MERGE_FLAGS (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER |\
 		SLAB_TRACE | SLAB_TYPESAFE_BY_RCU | SLAB_NOLEAKTRACE | \
-		SLAB_FAILSLAB | kasan_never_merge())
+		SLAB_FAILSLAB | SLAB_NEVER_MERGE | kasan_never_merge())
 
 #define SLAB_MERGE_SAME (SLAB_RECLAIM_ACCOUNT | SLAB_CACHE_DMA | \
 			 SLAB_CACHE_DMA32 | SLAB_ACCOUNT)
@@ -137,7 +137,7 @@  static unsigned int calculate_alignment(slab_flags_t flags,
  */
 int slab_unmergeable(struct kmem_cache *s)
 {
-	if (slab_nomerge || (s->flags & SLAB_NEVER_MERGE))
+	if (slab_nomerge || (s->flags & SLAB_NEVER_MERGE_FLAGS))
 		return 1;
 
 	if (s->ctor)
@@ -173,7 +173,7 @@  struct kmem_cache *find_mergeable(unsigned int size, unsigned int align,
 	size = ALIGN(size, align);
 	flags = kmem_cache_flags(size, flags, name);
 
-	if (flags & SLAB_NEVER_MERGE)
+	if (flags & SLAB_NEVER_MERGE_FLAGS)
 		return NULL;
 
 	list_for_each_entry_reverse(s, &slab_caches, list) {
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 79c9e795a964..799b9914457b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4629,12 +4629,23 @@  static void skb_extensions_init(void)
 static void skb_extensions_init(void) {}
 #endif
 
+/* The SKB kmem_cache slab is critical for network performance.  Never
+ * merge/alias the slab with similar sized objects.  This avoids fragmentation
+ * that hurts performance of kmem_cache_{alloc,free}_bulk APIs.
+ */
+#ifndef CONFIG_SLUB_TINY
+#define FLAG_SKB_NEVER_MERGE	SLAB_NEVER_MERGE
+#else /* CONFIG_SLUB_TINY - simple loop in kmem_cache_alloc_bulk */
+#define FLAG_SKB_NEVER_MERGE	0
+#endif
+
 void __init skb_init(void)
 {
 	skbuff_head_cache = kmem_cache_create_usercopy("skbuff_head_cache",
 					      sizeof(struct sk_buff),
 					      0,
-					      SLAB_HWCACHE_ALIGN|SLAB_PANIC,
+					      SLAB_HWCACHE_ALIGN|SLAB_PANIC|
+						FLAG_SKB_NEVER_MERGE,
 					      offsetof(struct sk_buff, cb),
 					      sizeof_field(struct sk_buff, cb),
 					      NULL);