[RFC] ipc/mqueue: introduce msg cache

Message ID	20221220184813.1908318-1-roman.gushchin@linux.dev (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Roman Gushchin <roman.gushchin@linux.dev> To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Shakeel Butt <shakeelb@google.com>, Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, Muchun Song <muchun.song@linux.dev>, Andrew Morton <akpm@linux-foundation.org>, Waiman Long <longman@redhat.com>, Roman Gushchin <roman.gushchin@linux.dev>, Sven Luther <Sven.Luther@windriver.com> Subject: [PATCH RFC] ipc/mqueue: introduce msg cache Date: Tue, 20 Dec 2022 10:48:13 -0800 Message-Id: <20221220184813.1908318-1-roman.gushchin@linux.dev> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[RFC] ipc/mqueue: introduce msg cache \| expand [RFC] ipc/mqueue: introduce msg cache

Roman Gushchin Dec. 20, 2022, 6:48 p.m. UTC

Sven Luther reported a regression in the posix message queues
performance caused by switching to the per-object tracking of
slab objects introduced by patch series ending with the
commit 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches for all
allocations").

To mitigate the regression cache allocated mqueue messages on a small
percpu cache instead of releasing and re-allocating them every time.

This change brings the performance measured by a benchmark kindly
provided by Sven [1] almost back to the original (or even better)
numbers. Measurements results are also provided by Sven.

+------------+---------------------------------+--------------------------------+
| kernel     | mqueue_rcv (ns)     variation   | mqueue_send (ns)   variation   |
| version    | min avg  max      min      avg  | min avg max       min     avg  |
+------------+--------------------------+---------------------------------------+
| 4.18.45    | 351 382 17533    base     base  | 383 410 13178     base    base |
| 5.8-good   | 380 392  7156   -7,63%  -2,55%  | 376 384  6225    1,86%   6,77% |
| 5.8-bad    | 524 530  5310  -33,02% -27,92%  | 512 519  8775  -25,20% -21,00% |
| 5.10       | 520 533  4078  -32,20% -28,33%  | 518 534  8108  -26,06% -23,22% |
| 5.15       | 431 444  8440  -18,56% -13,96%  | 425 437  6170   -9,88%  -6,18% |
| 6.0.3      | 474 614  3881  -25,95% -37,79%  | 482 693   931  -20,54% -40,84% |
| 6.1-rc8    | 496 509  8804  -29,23% -24,95%  | 493 512  5748  -22,31% -19,92% |
| 6.1-rc8+p  | 392 397  5479  -10,46%  -3,78%  | 364 369 10776    5,22%  11,11% |
+------------+---------------------------------+--------------------------------+

The last line reflects the result with this patch ("6.1-rc8+p").

[1]: https://lore.kernel.org/linux-mm/Y46lqCToUa%2FBgt%2Fc@P9FQF9L96D/T/

Reported-by: Sven Luther <Sven.Luther@windriver.com>
Tested-by: Sven Luther <Sven.Luther@windriver.com>
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 include/linux/memcontrol.h |  12 +++++
 ipc/mqueue.c               |  20 ++++++--
 ipc/msg.c                  |  12 ++---
 ipc/msgutil.c              | 101 +++++++++++++++++++++++++++++++++----
 ipc/util.h                 |   8 ++-
 mm/memcontrol.c            |  62 +++++++++++++++++++++++
 6 files changed, 194 insertions(+), 21 deletions(-)

Shakeel Butt Dec. 20, 2022, 7:53 p.m. UTC | #1

+Vlastimil

On Tue, Dec 20, 2022 at 10:48 AM Roman Gushchin
<roman.gushchin@linux.dev> wrote:
>
> Sven Luther reported a regression in the posix message queues
> performance caused by switching to the per-object tracking of
> slab objects introduced by patch series ending with the
> commit 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches for all
> allocations").
>
> To mitigate the regression cache allocated mqueue messages on a small
> percpu cache instead of releasing and re-allocating them every time.
>

Seems fine with me but I am wondering what is stopping us to do this
caching in the slab layer for all accounted allocations? Does this
only make sense for specific scenarios/use-cases?

Roman Gushchin Dec. 20, 2022, 8:59 p.m. UTC | #2

On Tue, Dec 20, 2022 at 11:53:25AM -0800, Shakeel Butt wrote:
> +Vlastimil
> 
> On Tue, Dec 20, 2022 at 10:48 AM Roman Gushchin
> <roman.gushchin@linux.dev> wrote:
> >
> > Sven Luther reported a regression in the posix message queues
> > performance caused by switching to the per-object tracking of
> > slab objects introduced by patch series ending with the
> > commit 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches for all
> > allocations").
> >
> > To mitigate the regression cache allocated mqueue messages on a small
> > percpu cache instead of releasing and re-allocating them every time.
> >
> 
> Seems fine with me but I am wondering what is stopping us to do this
> caching in the slab layer for all accounted allocations? Does this
> only make sense for specific scenarios/use-cases?

It's far from trivial, unfortunately. Here we have an mqueue object to associate
a percpu cache with and the hit rate is expected to be high, assuming the mqueue
will be used to pass a lot of messages.

With a generic slab cache we return to the necessity of managing
the per-cgroup x per-slab-cache x per-cpu free list (or some other data structure),
which is already far from trivial, based on the previous experience. It can
easily lead to a significant memory waste, which will fully compensate all perf
wins.

So probably we need some heuristics to allocate caches only for really hot slab
caches and use some sort of a hash map (keyed by cgroup and slab cache) to
access freelists. What I miss to commit more time to this project (aside from not
having it), is the lack of real workloads which will noticeably win from this work.

Sven provided a good example and benchmark to reproduce the regression, so it
was easy to justify the work.

Thanks!

Shakeel Butt Dec. 20, 2022, 11:28 p.m. UTC | #3

On Tue, Dec 20, 2022 at 12:59 PM Roman Gushchin
<roman.gushchin@linux.dev> wrote:
>
> On Tue, Dec 20, 2022 at 11:53:25AM -0800, Shakeel Butt wrote:
> > +Vlastimil
> >
> > On Tue, Dec 20, 2022 at 10:48 AM Roman Gushchin
> > <roman.gushchin@linux.dev> wrote:
> > >
> > > Sven Luther reported a regression in the posix message queues
> > > performance caused by switching to the per-object tracking of
> > > slab objects introduced by patch series ending with the
> > > commit 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches for all
> > > allocations").
> > >
> > > To mitigate the regression cache allocated mqueue messages on a small
> > > percpu cache instead of releasing and re-allocating them every time.
> > >
> >
> > Seems fine with me but I am wondering what is stopping us to do this
> > caching in the slab layer for all accounted allocations? Does this
> > only make sense for specific scenarios/use-cases?
>
> It's far from trivial, unfortunately. Here we have an mqueue object to associate
> a percpu cache with and the hit rate is expected to be high, assuming the mqueue
> will be used to pass a lot of messages.
>
> With a generic slab cache we return to the necessity of managing
> the per-cgroup x per-slab-cache x per-cpu free list (or some other data structure),
> which is already far from trivial, based on the previous experience. It can
> easily lead to a significant memory waste, which will fully compensate all perf
> wins.
>
> So probably we need some heuristics to allocate caches only for really hot slab
> caches and use some sort of a hash map (keyed by cgroup and slab cache) to
> access freelists. What I miss to commit more time to this project (aside from not
> having it), is the lack of real workloads which will noticeably win from this work.
>
> Sven provided a good example and benchmark to reproduce the regression, so it
> was easy to justify the work.
>

Thanks for the explanation. I think we should add this to the commit
message as well. I do think we should have a general framework for
such caching as there are other users (e.g. io_uring) doing the same
and some future users can take advantage as well e.g. I think this
type of caching will be helpful for filelock_cache as well. Anyways
that can be done in future.

Roman Gushchin Dec. 20, 2022, 11:56 p.m. UTC | #4

On Tue, Dec 20, 2022 at 03:28:11PM -0800, Shakeel Butt wrote:
> On Tue, Dec 20, 2022 at 12:59 PM Roman Gushchin
> <roman.gushchin@linux.dev> wrote:
> >
> > On Tue, Dec 20, 2022 at 11:53:25AM -0800, Shakeel Butt wrote:
> > > +Vlastimil
> > >
> > > On Tue, Dec 20, 2022 at 10:48 AM Roman Gushchin
> > > <roman.gushchin@linux.dev> wrote:
> > > >
> > > > Sven Luther reported a regression in the posix message queues
> > > > performance caused by switching to the per-object tracking of
> > > > slab objects introduced by patch series ending with the
> > > > commit 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches for all
> > > > allocations").
> > > >
> > > > To mitigate the regression cache allocated mqueue messages on a small
> > > > percpu cache instead of releasing and re-allocating them every time.
> > > >
> > >
> > > Seems fine with me but I am wondering what is stopping us to do this
> > > caching in the slab layer for all accounted allocations? Does this
> > > only make sense for specific scenarios/use-cases?
> >
> > It's far from trivial, unfortunately. Here we have an mqueue object to associate
> > a percpu cache with and the hit rate is expected to be high, assuming the mqueue
> > will be used to pass a lot of messages.
> >
> > With a generic slab cache we return to the necessity of managing
> > the per-cgroup x per-slab-cache x per-cpu free list (or some other data structure),
> > which is already far from trivial, based on the previous experience. It can
> > easily lead to a significant memory waste, which will fully compensate all perf
> > wins.
> >
> > So probably we need some heuristics to allocate caches only for really hot slab
> > caches and use some sort of a hash map (keyed by cgroup and slab cache) to
> > access freelists. What I miss to commit more time to this project (aside from not
> > having it), is the lack of real workloads which will noticeably win from this work.
> >
> > Sven provided a good example and benchmark to reproduce the regression, so it
> > was easy to justify the work.
> >
> 
> Thanks for the explanation. I think we should add this to the commit
> message as well. I do think we should have a general framework for
> such caching as there are other users (e.g. io_uring) doing the same
> and some future users can take advantage as well e.g. I think this
> type of caching will be helpful for filelock_cache as well. Anyways
> that can be done in future.

I agree.

One way I'm thinking about is to provide an API for creating objcg-specific slab
caches. All objects belonging to a such slab cache will belong to the same objcg.
In this case the accounting can be even faster than the previous per-page
accounting. And the user of such API will be responsible for the lifetime
of the cache, as well as the decision whether to use the local or global
slab cache.

Thanks!

Thorsten Leemhuis Dec. 21, 2022, 9:29 a.m. UTC | #5

On 20.12.22 19:48, Roman Gushchin wrote:
> Sven Luther reported a regression in the posix message queues
> performance caused by switching to the per-object tracking of
> slab objects introduced by patch series ending with the
> commit 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches for all
> allocations").
> [...]

Thx for looking into this

> The last line reflects the result with this patch ("6.1-rc8+p").
> 
> [1]: https://lore.kernel.org/linux-mm/Y46lqCToUa%2FBgt%2Fc@P9FQF9L96D/T/

This link...

> Reported-by: Sven Luther <Sven.Luther@windriver.com>

...should be here and look like this, as tools like regzbot otherwise
won't do the right thing:

Link: https://lore.kernel.org/linux-mm/Y46lqCToUa%2FBgt%2Fc@P9FQF9L96D/ [1]

For details, see:
https://lore.kernel.org/all/CAHk-=wjMmSZzMJ3Xnskdg4+GGz=5p5p+GSYyFBTh0f-DgvdBWg@mail.gmail.com/
https://lore.kernel.org/all/CAHk-=wgs38ZrfPvy=nOwVkVzjpM3VFU1zobP37Fwd_h9iAD5JQ@mail.gmail.com/
https://lore.kernel.org/all/CAHk-=wjxzafG-=J8oT30s7upn4RhBs6TX-uVFZ5rME+L5_DoJA@mail.gmail.com/
Documentation/process/submitting-patches.rst
(http://docs.kernel.org/process/submitting-patches.html)
Documentation/process/5.Posting.rst
(https://docs.kernel.org/process/5.Posting.html)

> Tested-by: Sven Luther <Sven.Luther@windriver.com>
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>

#regzbot ^backmonitor:
https://lore.kernel.org/linux-mm/Y46lqCToUa%2FBgt%2Fc@P9FQF9L96D/

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.

Brian Foster Dec. 22, 2022, 11:52 a.m. UTC | #6

On Tue, Dec 20, 2022 at 10:48:13AM -0800, Roman Gushchin wrote:
> Sven Luther reported a regression in the posix message queues
> performance caused by switching to the per-object tracking of
> slab objects introduced by patch series ending with the
> commit 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches for all
> allocations").
> 
> To mitigate the regression cache allocated mqueue messages on a small
> percpu cache instead of releasing and re-allocating them every time.
> 
> This change brings the performance measured by a benchmark kindly
> provided by Sven [1] almost back to the original (or even better)
> numbers. Measurements results are also provided by Sven.
> 
> +------------+---------------------------------+--------------------------------+
> | kernel     | mqueue_rcv (ns)     variation   | mqueue_send (ns)   variation   |
> | version    | min avg  max      min      avg  | min avg max       min     avg  |
> +------------+--------------------------+---------------------------------------+
> | 4.18.45    | 351 382 17533    base     base  | 383 410 13178     base    base |
> | 5.8-good   | 380 392  7156   -7,63%  -2,55%  | 376 384  6225    1,86%   6,77% |
> | 5.8-bad    | 524 530  5310  -33,02% -27,92%  | 512 519  8775  -25,20% -21,00% |
> | 5.10       | 520 533  4078  -32,20% -28,33%  | 518 534  8108  -26,06% -23,22% |
> | 5.15       | 431 444  8440  -18,56% -13,96%  | 425 437  6170   -9,88%  -6,18% |
> | 6.0.3      | 474 614  3881  -25,95% -37,79%  | 482 693   931  -20,54% -40,84% |
> | 6.1-rc8    | 496 509  8804  -29,23% -24,95%  | 493 512  5748  -22,31% -19,92% |
> | 6.1-rc8+p  | 392 397  5479  -10,46%  -3,78%  | 364 369 10776    5,22%  11,11% |
> +------------+---------------------------------+--------------------------------+
> 
> The last line reflects the result with this patch ("6.1-rc8+p").
> 
> [1]: https://lore.kernel.org/linux-mm/Y46lqCToUa%2FBgt%2Fc@P9FQF9L96D/T/
> 
> Reported-by: Sven Luther <Sven.Luther@windriver.com>
> Tested-by: Sven Luther <Sven.Luther@windriver.com>
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> ---
>  include/linux/memcontrol.h |  12 +++++
>  ipc/mqueue.c               |  20 ++++++--
>  ipc/msg.c                  |  12 ++---
>  ipc/msgutil.c              | 101 +++++++++++++++++++++++++++++++++----
>  ipc/util.h                 |   8 ++-
>  mm/memcontrol.c            |  62 +++++++++++++++++++++++
>  6 files changed, 194 insertions(+), 21 deletions(-)
> 
...
> diff --git a/ipc/msgutil.c b/ipc/msgutil.c
> index d0a0e877cadd..8667708fc00a 100644
> --- a/ipc/msgutil.c
> +++ b/ipc/msgutil.c
...
> @@ -39,16 +40,76 @@ struct msg_msgseg {
...
> +static struct msg_msg *alloc_msg(size_t len, struct msg_cache *cache)
>  {
>  	struct msg_msg *msg;
>  	struct msg_msgseg **pseg;
>  	size_t alen;
>  
> +	if (cache) {
> +		struct pcpu_msg_cache *pc;
> +
> +		msg = NULL;
> +		pc = get_cpu_ptr(cache->pcpu_cache);
> +		if (pc->msg && pc->len == len) {
> +			struct obj_cgroup *objcg;
> +
> +			rcu_read_lock();
> +			objcg = obj_cgroup_from_current();
> +			if (objcg == pc->objcg) {
> +				msg = pc->msg;
> +				pc->msg = NULL;
> +				obj_cgroup_put(pc->objcg);
> +			}
> +			rcu_read_unlock();
> +		}
> +		put_cpu_ptr(cache->pcpu_cache);
> +		if (msg)
> +			return msg;
> +	}
> +
>  	alen = min(len, DATALEN_MSG);
>  	msg = kmalloc(sizeof(*msg) + alen, GFP_KERNEL_ACCOUNT);
>  	if (msg == NULL)
> @@ -77,18 +138,19 @@ static struct msg_msg *alloc_msg(size_t len)
>  	return msg;
>  
>  out_err:
> -	free_msg(msg);
> +	free_msg(msg, NULL);
>  	return NULL;
>  }
>  
> -struct msg_msg *load_msg(const void __user *src, size_t len)
> +struct msg_msg *load_msg(const void __user *src, size_t len,
> +			 struct msg_cache *cache)
>  {
>  	struct msg_msg *msg;
>  	struct msg_msgseg *seg;
>  	int err = -EFAULT;
>  	size_t alen;
>  
> -	msg = alloc_msg(len);
> +	msg = alloc_msg(len, cache);
>  	if (msg == NULL)
>  		return ERR_PTR(-ENOMEM);
>  
> @@ -104,14 +166,16 @@ struct msg_msg *load_msg(const void __user *src, size_t len)
>  			goto out_err;
>  	}
>  
> -	err = security_msg_msg_alloc(msg);
> -	if (err)
> -		goto out_err;
> +	if (!msg->security) {
> +		err = security_msg_msg_alloc(msg);
> +		if (err)
> +			goto out_err;
> +	}
>  
>  	return msg;
>  
>  out_err:
> -	free_msg(msg);
> +	free_msg(msg, NULL);
>  	return ERR_PTR(err);
>  }
>  #ifdef CONFIG_CHECKPOINT_RESTORE
> @@ -166,10 +230,29 @@ int store_msg(void __user *dest, struct msg_msg *msg, size_t len)
>  	return 0;
>  }
>  
> -void free_msg(struct msg_msg *msg)
> +void free_msg(struct msg_msg *msg, struct msg_cache *cache)
>  {
>  	struct msg_msgseg *seg;
>  
> +	if (cache) {
> +		struct pcpu_msg_cache *pc;
> +		bool cached = false;
> +
> +		pc = get_cpu_ptr(cache->pcpu_cache);
> +		if (!pc->msg) {
> +			pc->objcg = get_obj_cgroup_from_slab_obj(msg);
> +			pc->len = msg->m_ts;
> +			pc->msg = msg;
> +
> +			if (pc->objcg)
> +				cached = true;
> +		}

Hi Roman,

It seems that this is kind of tailored to the ideal conditions
implemented by the test case: i.e., a single, fixed size message being
passed back and forth on a single cpu. Does that actually represent the
production workload?

I'm a little curious if/how this might work for workloads that might
involve more variable sized messages, deeper queue depths (i.e.  sending
more than one message before attempting a recv) and more tasks across
different cpus. For example, it looks like if an "uncommonly" sized
message ended up cached on a cpu, this would always result in subsequent
misses because the alloc side requires an exact size match and the free
side never replaces a cached msg. Hm?

Brian

> +		put_cpu_ptr(cache->pcpu_cache);
> +
> +		if (cached)
> +			return;
> +	}
> +
>  	security_msg_msg_free(msg);
>  
>  	seg = msg->next;
> diff --git a/ipc/util.h b/ipc/util.h
> index b2906e366539..a2da266386aa 100644
> --- a/ipc/util.h
> +++ b/ipc/util.h
> @@ -196,8 +196,12 @@ static inline void ipc_update_pid(struct pid **pos, struct pid *pid)
>  int ipc_parse_version(int *cmd);
>  #endif
>  
> -extern void free_msg(struct msg_msg *msg);
> -extern struct msg_msg *load_msg(const void __user *src, size_t len);
> +struct msg_cache;
> +extern int init_msg_cache(struct msg_cache *cache);
> +extern void free_msg_cache(struct msg_cache *cache);
> +extern void free_msg(struct msg_msg *msg, struct msg_cache *cache);
> +extern struct msg_msg *load_msg(const void __user *src, size_t len,
> +				struct msg_cache *cache);
>  extern struct msg_msg *copy_msg(struct msg_msg *src, struct msg_msg *dst);
>  extern int store_msg(void __user *dest, struct msg_msg *msg, size_t len);
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a1a35c12635e..28528b4da0fb 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3004,6 +3004,28 @@ static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
>  	return objcg;
>  }
>  
> +__always_inline struct obj_cgroup *obj_cgroup_from_current(void)
> +{
> +	struct obj_cgroup *objcg = NULL;
> +	struct mem_cgroup *memcg;
> +
> +	if (memcg_kmem_bypass())
> +		return NULL;
> +
> +	if (unlikely(active_memcg()))
> +		memcg = active_memcg();
> +	else
> +		memcg = mem_cgroup_from_task(current);
> +
> +	for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) {
> +		objcg = rcu_dereference(memcg->objcg);
> +		if (likely(objcg))
> +			return objcg;
> +	}
> +
> +	return NULL;
> +}
> +
>  __always_inline struct obj_cgroup *get_obj_cgroup_from_current(void)
>  {
>  	struct obj_cgroup *objcg = NULL;
> @@ -3046,6 +3068,46 @@ struct obj_cgroup *get_obj_cgroup_from_page(struct page *page)
>  	return objcg;
>  }
>  
> +/*
> + * A passed kernel object must be a slab object or a generic kernel page.
> + * Not suitable for objects, allocated using vmalloc().
> + */
> +struct obj_cgroup *get_obj_cgroup_from_slab_obj(void *p)
> +{
> +	struct obj_cgroup *objcg = NULL;
> +	struct folio *folio;
> +
> +	if (mem_cgroup_disabled())
> +		return NULL;
> +
> +	folio = virt_to_folio(p);
> +	/*
> +	 * Slab object can be either a true slab object, which are accounted
> +	 * individually with objcg pointers stored in a separate objcg array,
> +	 * or it can a generic folio with MEMCG_DATA_KMEM flag set.
> +	 */
> +	if (folio_test_slab(folio)) {
> +		struct obj_cgroup **objcgs;
> +		struct slab *slab;
> +		unsigned int off;
> +
> +		slab = folio_slab(folio);
> +		objcgs = slab_objcgs(slab);
> +		if (!objcgs)
> +			return NULL;
> +
> +		off = obj_to_index(slab->slab_cache, slab, p);
> +		objcg = objcgs[off];
> +	} else {
> +		objcg = __folio_objcg(folio);
> +	}
> +
> +	if (objcg)
> +		obj_cgroup_get(objcg);
> +
> +	return objcg;
> +}
> +
>  static void memcg_account_kmem(struct mem_cgroup *memcg, int nr_pages)
>  {
>  	mod_memcg_state(memcg, MEMCG_KMEM, nr_pages);
> -- 
> 2.39.0
> 
>

Roman Gushchin Dec. 22, 2022, 4:37 p.m. UTC | #7

On Thu, Dec 22, 2022 at 06:52:06AM -0500, Brian Foster wrote:
> On Tue, Dec 20, 2022 at 10:48:13AM -0800, Roman Gushchin wrote:
> > Sven Luther reported a regression in the posix message queues
> > performance caused by switching to the per-object tracking of
> > slab objects introduced by patch series ending with the
> > commit 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches for all
> > allocations").
> > 
> > To mitigate the regression cache allocated mqueue messages on a small
> > percpu cache instead of releasing and re-allocating them every time.
> > 
> > This change brings the performance measured by a benchmark kindly
> > provided by Sven [1] almost back to the original (or even better)
> > numbers. Measurements results are also provided by Sven.
> > 
> > +------------+---------------------------------+--------------------------------+
> > | kernel     | mqueue_rcv (ns)     variation   | mqueue_send (ns)   variation   |
> > | version    | min avg  max      min      avg  | min avg max       min     avg  |
> > +------------+--------------------------+---------------------------------------+
> > | 4.18.45    | 351 382 17533    base     base  | 383 410 13178     base    base |
> > | 5.8-good   | 380 392  7156   -7,63%  -2,55%  | 376 384  6225    1,86%   6,77% |
> > | 5.8-bad    | 524 530  5310  -33,02% -27,92%  | 512 519  8775  -25,20% -21,00% |
> > | 5.10       | 520 533  4078  -32,20% -28,33%  | 518 534  8108  -26,06% -23,22% |
> > | 5.15       | 431 444  8440  -18,56% -13,96%  | 425 437  6170   -9,88%  -6,18% |
> > | 6.0.3      | 474 614  3881  -25,95% -37,79%  | 482 693   931  -20,54% -40,84% |
> > | 6.1-rc8    | 496 509  8804  -29,23% -24,95%  | 493 512  5748  -22,31% -19,92% |
> > | 6.1-rc8+p  | 392 397  5479  -10,46%  -3,78%  | 364 369 10776    5,22%  11,11% |
> > +------------+---------------------------------+--------------------------------+
> > 
> > The last line reflects the result with this patch ("6.1-rc8+p").
> > 
> > [1]: https://lore.kernel.org/linux-mm/Y46lqCToUa%2FBgt%2Fc@P9FQF9L96D/T/
> > 
> > Reported-by: Sven Luther <Sven.Luther@windriver.com>
> > Tested-by: Sven Luther <Sven.Luther@windriver.com>
> > Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> > ---
> >  include/linux/memcontrol.h |  12 +++++
> >  ipc/mqueue.c               |  20 ++++++--
> >  ipc/msg.c                  |  12 ++---
> >  ipc/msgutil.c              | 101 +++++++++++++++++++++++++++++++++----
> >  ipc/util.h                 |   8 ++-
> >  mm/memcontrol.c            |  62 +++++++++++++++++++++++
> >  6 files changed, 194 insertions(+), 21 deletions(-)
> > 
> ...
> > diff --git a/ipc/msgutil.c b/ipc/msgutil.c
> > index d0a0e877cadd..8667708fc00a 100644
> > --- a/ipc/msgutil.c
> > +++ b/ipc/msgutil.c
> ...
> > @@ -39,16 +40,76 @@ struct msg_msgseg {
> ...
> > +static struct msg_msg *alloc_msg(size_t len, struct msg_cache *cache)
> >  {
> >  	struct msg_msg *msg;
> >  	struct msg_msgseg **pseg;
> >  	size_t alen;
> >  
> > +	if (cache) {
> > +		struct pcpu_msg_cache *pc;
> > +
> > +		msg = NULL;
> > +		pc = get_cpu_ptr(cache->pcpu_cache);
> > +		if (pc->msg && pc->len == len) {
> > +			struct obj_cgroup *objcg;
> > +
> > +			rcu_read_lock();
> > +			objcg = obj_cgroup_from_current();
> > +			if (objcg == pc->objcg) {
> > +				msg = pc->msg;
> > +				pc->msg = NULL;
> > +				obj_cgroup_put(pc->objcg);
> > +			}
> > +			rcu_read_unlock();
> > +		}
> > +		put_cpu_ptr(cache->pcpu_cache);
> > +		if (msg)
> > +			return msg;
> > +	}
> > +
> >  	alen = min(len, DATALEN_MSG);
> >  	msg = kmalloc(sizeof(*msg) + alen, GFP_KERNEL_ACCOUNT);
> >  	if (msg == NULL)
> > @@ -77,18 +138,19 @@ static struct msg_msg *alloc_msg(size_t len)
> >  	return msg;
> >  
> >  out_err:
> > -	free_msg(msg);
> > +	free_msg(msg, NULL);
> >  	return NULL;
> >  }
> >  
> > -struct msg_msg *load_msg(const void __user *src, size_t len)
> > +struct msg_msg *load_msg(const void __user *src, size_t len,
> > +			 struct msg_cache *cache)
> >  {
> >  	struct msg_msg *msg;
> >  	struct msg_msgseg *seg;
> >  	int err = -EFAULT;
> >  	size_t alen;
> >  
> > -	msg = alloc_msg(len);
> > +	msg = alloc_msg(len, cache);
> >  	if (msg == NULL)
> >  		return ERR_PTR(-ENOMEM);
> >  
> > @@ -104,14 +166,16 @@ struct msg_msg *load_msg(const void __user *src, size_t len)
> >  			goto out_err;
> >  	}
> >  
> > -	err = security_msg_msg_alloc(msg);
> > -	if (err)
> > -		goto out_err;
> > +	if (!msg->security) {
> > +		err = security_msg_msg_alloc(msg);
> > +		if (err)
> > +			goto out_err;
> > +	}
> >  
> >  	return msg;
> >  
> >  out_err:
> > -	free_msg(msg);
> > +	free_msg(msg, NULL);
> >  	return ERR_PTR(err);
> >  }
> >  #ifdef CONFIG_CHECKPOINT_RESTORE
> > @@ -166,10 +230,29 @@ int store_msg(void __user *dest, struct msg_msg *msg, size_t len)
> >  	return 0;
> >  }
> >  
> > -void free_msg(struct msg_msg *msg)
> > +void free_msg(struct msg_msg *msg, struct msg_cache *cache)
> >  {
> >  	struct msg_msgseg *seg;
> >  
> > +	if (cache) {
> > +		struct pcpu_msg_cache *pc;
> > +		bool cached = false;
> > +
> > +		pc = get_cpu_ptr(cache->pcpu_cache);
> > +		if (!pc->msg) {
> > +			pc->objcg = get_obj_cgroup_from_slab_obj(msg);
> > +			pc->len = msg->m_ts;
> > +			pc->msg = msg;
> > +
> > +			if (pc->objcg)
> > +				cached = true;
> > +		}
> 
> Hi Roman,
> 
> It seems that this is kind of tailored to the ideal conditions
> implemented by the test case: i.e., a single, fixed size message being
> passed back and forth on a single cpu. Does that actually represent the
> production workload?

Hi Brian!

Not really, it was all based on Sven's report and the benchmark he provided.
I assume that the benchmark emulates the production workload he has, but
it's up to him to confirm.

Personally I haven't seen a lot of mqueue usage in the production, especially
for anything performance-sensitive. Also, Sven reported the regression which
was introduced in 5.9, so I take it as an indicator that not so many users
depend on the mqueue performance.

> 
> I'm a little curious if/how this might work for workloads that might
> involve more variable sized messages, deeper queue depths (i.e.  sending
> more than one message before attempting a recv) and more tasks across
> different cpus. For example, it looks like if an "uncommonly" sized
> message ended up cached on a cpu, this would always result in subsequent
> misses because the alloc side requires an exact size match and the free
> side never replaces a cached msg. Hm?

Yes, of course it's very primitive. But I'm not sure we want to implement
something complicated here. If there any specific ideas, I'm totally up for
them. We can cache 2 sizes or 4 sizes or something else, but Idk how much value
it has.

Btw, in parallel I'm working on some generic improvements to the slab
allocation path [1], maybe it will be good enough for Sven.

[1]: https://lore.kernel.org/linux-mm/20221220204451.gm5d3pdbfvd5ki6b@google.com/T/

Thanks!

Brian Foster Dec. 22, 2022, 5:54 p.m. UTC | #8

On Thu, Dec 22, 2022 at 08:37:32AM -0800, Roman Gushchin wrote:
> On Thu, Dec 22, 2022 at 06:52:06AM -0500, Brian Foster wrote:
> > On Tue, Dec 20, 2022 at 10:48:13AM -0800, Roman Gushchin wrote:
> > > Sven Luther reported a regression in the posix message queues
> > > performance caused by switching to the per-object tracking of
> > > slab objects introduced by patch series ending with the
> > > commit 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches for all
> > > allocations").
> > > 
> > > To mitigate the regression cache allocated mqueue messages on a small
> > > percpu cache instead of releasing and re-allocating them every time.
> > > 
> > > This change brings the performance measured by a benchmark kindly
> > > provided by Sven [1] almost back to the original (or even better)
> > > numbers. Measurements results are also provided by Sven.
> > > 
> > > +------------+---------------------------------+--------------------------------+
> > > | kernel     | mqueue_rcv (ns)     variation   | mqueue_send (ns)   variation   |
> > > | version    | min avg  max      min      avg  | min avg max       min     avg  |
> > > +------------+--------------------------+---------------------------------------+
> > > | 4.18.45    | 351 382 17533    base     base  | 383 410 13178     base    base |
> > > | 5.8-good   | 380 392  7156   -7,63%  -2,55%  | 376 384  6225    1,86%   6,77% |
> > > | 5.8-bad    | 524 530  5310  -33,02% -27,92%  | 512 519  8775  -25,20% -21,00% |
> > > | 5.10       | 520 533  4078  -32,20% -28,33%  | 518 534  8108  -26,06% -23,22% |
> > > | 5.15       | 431 444  8440  -18,56% -13,96%  | 425 437  6170   -9,88%  -6,18% |
> > > | 6.0.3      | 474 614  3881  -25,95% -37,79%  | 482 693   931  -20,54% -40,84% |
> > > | 6.1-rc8    | 496 509  8804  -29,23% -24,95%  | 493 512  5748  -22,31% -19,92% |
> > > | 6.1-rc8+p  | 392 397  5479  -10,46%  -3,78%  | 364 369 10776    5,22%  11,11% |
> > > +------------+---------------------------------+--------------------------------+
> > > 
> > > The last line reflects the result with this patch ("6.1-rc8+p").
> > > 
> > > [1]: https://lore.kernel.org/linux-mm/Y46lqCToUa%2FBgt%2Fc@P9FQF9L96D/T/
> > > 
> > > Reported-by: Sven Luther <Sven.Luther@windriver.com>
> > > Tested-by: Sven Luther <Sven.Luther@windriver.com>
> > > Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> > > ---
> > >  include/linux/memcontrol.h |  12 +++++
> > >  ipc/mqueue.c               |  20 ++++++--
> > >  ipc/msg.c                  |  12 ++---
> > >  ipc/msgutil.c              | 101 +++++++++++++++++++++++++++++++++----
> > >  ipc/util.h                 |   8 ++-
> > >  mm/memcontrol.c            |  62 +++++++++++++++++++++++
> > >  6 files changed, 194 insertions(+), 21 deletions(-)
> > > 
> > ...
> > > diff --git a/ipc/msgutil.c b/ipc/msgutil.c
> > > index d0a0e877cadd..8667708fc00a 100644
> > > --- a/ipc/msgutil.c
> > > +++ b/ipc/msgutil.c
> > ...
> > > @@ -39,16 +40,76 @@ struct msg_msgseg {
> > ...
> > > +static struct msg_msg *alloc_msg(size_t len, struct msg_cache *cache)
> > >  {
> > >  	struct msg_msg *msg;
> > >  	struct msg_msgseg **pseg;
> > >  	size_t alen;
> > >  
> > > +	if (cache) {
> > > +		struct pcpu_msg_cache *pc;
> > > +
> > > +		msg = NULL;
> > > +		pc = get_cpu_ptr(cache->pcpu_cache);
> > > +		if (pc->msg && pc->len == len) {
> > > +			struct obj_cgroup *objcg;
> > > +
> > > +			rcu_read_lock();
> > > +			objcg = obj_cgroup_from_current();
> > > +			if (objcg == pc->objcg) {
> > > +				msg = pc->msg;
> > > +				pc->msg = NULL;
> > > +				obj_cgroup_put(pc->objcg);
> > > +			}
> > > +			rcu_read_unlock();
> > > +		}
> > > +		put_cpu_ptr(cache->pcpu_cache);
> > > +		if (msg)
> > > +			return msg;
> > > +	}
> > > +
> > >  	alen = min(len, DATALEN_MSG);
> > >  	msg = kmalloc(sizeof(*msg) + alen, GFP_KERNEL_ACCOUNT);
> > >  	if (msg == NULL)
> > > @@ -77,18 +138,19 @@ static struct msg_msg *alloc_msg(size_t len)
> > >  	return msg;
> > >  
> > >  out_err:
> > > -	free_msg(msg);
> > > +	free_msg(msg, NULL);
> > >  	return NULL;
> > >  }
> > >  
> > > -struct msg_msg *load_msg(const void __user *src, size_t len)
> > > +struct msg_msg *load_msg(const void __user *src, size_t len,
> > > +			 struct msg_cache *cache)
> > >  {
> > >  	struct msg_msg *msg;
> > >  	struct msg_msgseg *seg;
> > >  	int err = -EFAULT;
> > >  	size_t alen;
> > >  
> > > -	msg = alloc_msg(len);
> > > +	msg = alloc_msg(len, cache);
> > >  	if (msg == NULL)
> > >  		return ERR_PTR(-ENOMEM);
> > >  
> > > @@ -104,14 +166,16 @@ struct msg_msg *load_msg(const void __user *src, size_t len)
> > >  			goto out_err;
> > >  	}
> > >  
> > > -	err = security_msg_msg_alloc(msg);
> > > -	if (err)
> > > -		goto out_err;
> > > +	if (!msg->security) {
> > > +		err = security_msg_msg_alloc(msg);
> > > +		if (err)
> > > +			goto out_err;
> > > +	}
> > >  
> > >  	return msg;
> > >  
> > >  out_err:
> > > -	free_msg(msg);
> > > +	free_msg(msg, NULL);
> > >  	return ERR_PTR(err);
> > >  }
> > >  #ifdef CONFIG_CHECKPOINT_RESTORE
> > > @@ -166,10 +230,29 @@ int store_msg(void __user *dest, struct msg_msg *msg, size_t len)
> > >  	return 0;
> > >  }
> > >  
> > > -void free_msg(struct msg_msg *msg)
> > > +void free_msg(struct msg_msg *msg, struct msg_cache *cache)
> > >  {
> > >  	struct msg_msgseg *seg;
> > >  
> > > +	if (cache) {
> > > +		struct pcpu_msg_cache *pc;
> > > +		bool cached = false;
> > > +
> > > +		pc = get_cpu_ptr(cache->pcpu_cache);
> > > +		if (!pc->msg) {
> > > +			pc->objcg = get_obj_cgroup_from_slab_obj(msg);
> > > +			pc->len = msg->m_ts;
> > > +			pc->msg = msg;
> > > +
> > > +			if (pc->objcg)
> > > +				cached = true;
> > > +		}
> > 
> > Hi Roman,
> > 
> > It seems that this is kind of tailored to the ideal conditions
> > implemented by the test case: i.e., a single, fixed size message being
> > passed back and forth on a single cpu. Does that actually represent the
> > production workload?
> 
> Hi Brian!
> 
> Not really, it was all based on Sven's report and the benchmark he provided.
> I assume that the benchmark emulates the production workload he has, but
> it's up to him to confirm.
> 
> Personally I haven't seen a lot of mqueue usage in the production, especially
> for anything performance-sensitive. Also, Sven reported the regression which
> was introduced in 5.9, so I take it as an indicator that not so many users
> depend on the mqueue performance.
> 

Ok, thanks for the context. I've not either TBH. I guess even though
it's quite helpful to simplify a bug/regression into a test program, it
might be useful to confirm whatever improvement translates back to the
production workload if it happens to be more involved. If it's really as
simple as the test program, then carry on.. nothing to see here :). But
I'll defer to Sven..

> > 
> > I'm a little curious if/how this might work for workloads that might
> > involve more variable sized messages, deeper queue depths (i.e.  sending
> > more than one message before attempting a recv) and more tasks across
> > different cpus. For example, it looks like if an "uncommonly" sized
> > message ended up cached on a cpu, this would always result in subsequent
> > misses because the alloc side requires an exact size match and the free
> > side never replaces a cached msg. Hm?
> 
> Yes, of course it's very primitive. But I'm not sure we want to implement
> something complicated here. If there any specific ideas, I'm totally up for
> them. We can cache 2 sizes or 4 sizes or something else, but Idk how much value
> it has.
> 

Agreed on avoiding unnecessary complexity. FWIW, a couple thoughts that
crossed my mind when skimming this patch were to do something like
unconditionally replace a cached msg with one being freed (and hope for
some size repetition?), or perhaps always replace a cached msg with a
larger one being freed. Also if it is possible to use a larger
physically allocated msg to send a smaller actual message (?), perhaps
it might make sense to use the cached msg so long as it is large enough
for the current request (instead of exactly matching).

Any of those may or may not be more complex than just caching multiple
objects (i.e. may require to track physical msg size, or to realloc a
msg, etc.) or just may not add any measurable value over the current
approach. Just some random ideas. ;)

Brian

> Btw, in parallel I'm working on some generic improvements to the slab
> allocation path [1], maybe it will be good enough for Sven.
> 
> [1]: https://lore.kernel.org/linux-mm/20221220204451.gm5d3pdbfvd5ki6b@google.com/T/
> 
> Thanks!
>

Thorsten Leemhuis Feb. 16, 2023, 12:29 p.m. UTC | #9

Hi, this is your Linux kernel regression tracker.

On 20.12.22 19:48, Roman Gushchin wrote:
> Sven Luther reported a regression in the posix message queues
> performance caused by switching to the per-object tracking of
> slab objects introduced by patch series ending with the
> commit 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches for all
> allocations").

Quick inquiry: what happened to below patch? It was supposed to fix a
performance regression reported here:

https://lore.kernel.org/linux-mm/PH0PR11MB562641BC03630B4B7A227FD7E9189@PH0PR11MB5626.namprd11.prod.outlook.com/

But it seems this after christmas fall through the cracks (side note:
/me should have sent this mail way earlier)? Or was it dropped for good
reasons? Or was it merged and I'm just too stupid to find it?

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot ^backmonitor:
https://lore.kernel.org/linux-mm/PH0PR11MB562641BC03630B4B7A227FD7E9189@PH0PR11MB5626.namprd11.prod.outlook.com/
#regzbot poke

> To mitigate the regression cache allocated mqueue messages on a small
> percpu cache instead of releasing and re-allocating them every time.
> 
> This change brings the performance measured by a benchmark kindly
> provided by Sven [1] almost back to the original (or even better)
> numbers. Measurements results are also provided by Sven.
> 
> +------------+---------------------------------+--------------------------------+
> | kernel     | mqueue_rcv (ns)     variation   | mqueue_send (ns)   variation   |
> | version    | min avg  max      min      avg  | min avg max       min     avg  |
> +------------+--------------------------+---------------------------------------+
> | 4.18.45    | 351 382 17533    base     base  | 383 410 13178     base    base |
> | 5.8-good   | 380 392  7156   -7,63%  -2,55%  | 376 384  6225    1,86%   6,77% |
> | 5.8-bad    | 524 530  5310  -33,02% -27,92%  | 512 519  8775  -25,20% -21,00% |
> | 5.10       | 520 533  4078  -32,20% -28,33%  | 518 534  8108  -26,06% -23,22% |
> | 5.15       | 431 444  8440  -18,56% -13,96%  | 425 437  6170   -9,88%  -6,18% |
> | 6.0.3      | 474 614  3881  -25,95% -37,79%  | 482 693   931  -20,54% -40,84% |
> | 6.1-rc8    | 496 509  8804  -29,23% -24,95%  | 493 512  5748  -22,31% -19,92% |
> | 6.1-rc8+p  | 392 397  5479  -10,46%  -3,78%  | 364 369 10776    5,22%  11,11% |
> +------------+---------------------------------+--------------------------------+
> 
> The last line reflects the result with this patch ("6.1-rc8+p").
> 
> [1]: https://lore.kernel.org/linux-mm/Y46lqCToUa%2FBgt%2Fc@P9FQF9L96D/T/
> 
> Reported-by: Sven Luther <Sven.Luther@windriver.com>
> Tested-by: Sven Luther <Sven.Luther@windriver.com>
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> ---
>  include/linux/memcontrol.h |  12 +++++
>  ipc/mqueue.c               |  20 ++++++--
>  ipc/msg.c                  |  12 ++---
>  ipc/msgutil.c              | 101 +++++++++++++++++++++++++++++++++----
>  ipc/util.h                 |   8 ++-
>  mm/memcontrol.c            |  62 +++++++++++++++++++++++
>  6 files changed, 194 insertions(+), 21 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index e1644a24009c..8b7f00d562a2 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1734,8 +1734,10 @@ bool mem_cgroup_kmem_disabled(void);
>  int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order);
>  void __memcg_kmem_uncharge_page(struct page *page, int order);
>  
> +struct obj_cgroup *obj_cgroup_from_current(void);
>  struct obj_cgroup *get_obj_cgroup_from_current(void);
>  struct obj_cgroup *get_obj_cgroup_from_page(struct page *page);
> +struct obj_cgroup *get_obj_cgroup_from_slab_obj(void *p);
>  
>  int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
>  void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
> @@ -1813,11 +1815,21 @@ static inline void __memcg_kmem_uncharge_page(struct page *page, int order)
>  {
>  }
>  
> +static inline struct obj_cgroup *obj_cgroup_from_current(void)
> +{
> +	return NULL;
> +}
> +
>  static inline struct obj_cgroup *get_obj_cgroup_from_page(struct page *page)
>  {
>  	return NULL;
>  }
>  
> +static inline struct obj_cgroup *get_obj_cgroup_from_slab_obj(void *p)
> +{
> +	return NULL;
> +}
> +
>  static inline bool memcg_kmem_enabled(void)
>  {
>  	return false;
> diff --git a/ipc/mqueue.c b/ipc/mqueue.c
> index 467a194b8a2e..75bf83a8a36f 100644
> --- a/ipc/mqueue.c
> +++ b/ipc/mqueue.c
> @@ -131,6 +131,11 @@ struct ext_wait_queue {		/* queue of sleeping tasks */
>  	int state;		/* one of STATE_* values */
>  };
>  
> +struct pcpu_msg_cache;
> +struct msg_cache {
> +	struct pcpu_msg_cache __percpu *pcpu_cache;
> +};
> +
>  struct mqueue_inode_info {
>  	spinlock_t lock;
>  	struct inode vfs_inode;
> @@ -152,6 +157,8 @@ struct mqueue_inode_info {
>  	/* for tasks waiting for free space and messages, respectively */
>  	struct ext_wait_queue e_wait_q[2];
>  
> +	struct msg_cache msg_cache;
> +
>  	unsigned long qsize; /* size of queue in memory (sum of all msgs) */
>  };
>  
> @@ -368,6 +375,9 @@ static struct inode *mqueue_get_inode(struct super_block *sb,
>  		mq_bytes = info->attr.mq_maxmsg * info->attr.mq_msgsize;
>  		if (mq_bytes + mq_treesize < mq_bytes)
>  			goto out_inode;
> +		ret = init_msg_cache(&info->msg_cache);
> +		if (ret)
> +			goto out_inode;
>  		mq_bytes += mq_treesize;
>  		info->ucounts = get_ucounts(current_ucounts());
>  		if (info->ucounts) {
> @@ -531,9 +541,11 @@ static void mqueue_evict_inode(struct inode *inode)
>  
>  	list_for_each_entry_safe(msg, nmsg, &tmp_msg, m_list) {
>  		list_del(&msg->m_list);
> -		free_msg(msg);
> +		free_msg(msg, NULL);
>  	}
>  
> +	free_msg_cache(&info->msg_cache);
> +
>  	if (info->ucounts) {
>  		unsigned long mq_bytes, mq_treesize;
>  
> @@ -1108,7 +1120,7 @@ static int do_mq_timedsend(mqd_t mqdes, const char __user *u_msg_ptr,
>  
>  	/* First try to allocate memory, before doing anything with
>  	 * existing queues. */
> -	msg_ptr = load_msg(u_msg_ptr, msg_len);
> +	msg_ptr = load_msg(u_msg_ptr, msg_len, &info->msg_cache);
>  	if (IS_ERR(msg_ptr)) {
>  		ret = PTR_ERR(msg_ptr);
>  		goto out_fput;
> @@ -1170,7 +1182,7 @@ static int do_mq_timedsend(mqd_t mqdes, const char __user *u_msg_ptr,
>  	wake_up_q(&wake_q);
>  out_free:
>  	if (ret)
> -		free_msg(msg_ptr);
> +		free_msg(msg_ptr, &info->msg_cache);
>  out_fput:
>  	fdput(f);
>  out:
> @@ -1273,7 +1285,7 @@ static int do_mq_timedreceive(mqd_t mqdes, char __user *u_msg_ptr,
>  			store_msg(u_msg_ptr, msg_ptr, msg_ptr->m_ts)) {
>  			ret = -EFAULT;
>  		}
> -		free_msg(msg_ptr);
> +		free_msg(msg_ptr, &info->msg_cache);
>  	}
>  out_fput:
>  	fdput(f);
> diff --git a/ipc/msg.c b/ipc/msg.c
> index fd08b3cb36d7..fcc09f848490 100644
> --- a/ipc/msg.c
> +++ b/ipc/msg.c
> @@ -287,7 +287,7 @@ static void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
>  
>  	list_for_each_entry_safe(msg, t, &msq->q_messages, m_list) {
>  		percpu_counter_sub_local(&ns->percpu_msg_hdrs, 1);
> -		free_msg(msg);
> +		free_msg(msg, NULL);
>  	}
>  	percpu_counter_sub_local(&ns->percpu_msg_bytes, msq->q_cbytes);
>  	ipc_update_pid(&msq->q_lspid, NULL);
> @@ -861,7 +861,7 @@ static long do_msgsnd(int msqid, long mtype, void __user *mtext,
>  	if (mtype < 1)
>  		return -EINVAL;
>  
> -	msg = load_msg(mtext, msgsz);
> +	msg = load_msg(mtext, msgsz, NULL);
>  	if (IS_ERR(msg))
>  		return PTR_ERR(msg);
>  
> @@ -954,7 +954,7 @@ static long do_msgsnd(int msqid, long mtype, void __user *mtext,
>  out_unlock1:
>  	rcu_read_unlock();
>  	if (msg != NULL)
> -		free_msg(msg);
> +		free_msg(msg, NULL);
>  	return err;
>  }
>  
> @@ -1049,7 +1049,7 @@ static inline struct msg_msg *prepare_copy(void __user *buf, size_t bufsz)
>  	/*
>  	 * Create dummy message to copy real message to.
>  	 */
> -	copy = load_msg(buf, bufsz);
> +	copy = load_msg(buf, bufsz, NULL);
>  	if (!IS_ERR(copy))
>  		copy->m_ts = bufsz;
>  	return copy;
> @@ -1058,7 +1058,7 @@ static inline struct msg_msg *prepare_copy(void __user *buf, size_t bufsz)
>  static inline void free_copy(struct msg_msg *copy)
>  {
>  	if (copy)
> -		free_msg(copy);
> +		free_msg(copy, NULL);
>  }
>  #else
>  static inline struct msg_msg *prepare_copy(void __user *buf, size_t bufsz)
> @@ -1256,7 +1256,7 @@ static long do_msgrcv(int msqid, void __user *buf, size_t bufsz, long msgtyp, in
>  	}
>  
>  	bufsz = msg_handler(buf, msg, bufsz);
> -	free_msg(msg);
> +	free_msg(msg, NULL);
>  
>  	return bufsz;
>  }
> diff --git a/ipc/msgutil.c b/ipc/msgutil.c
> index d0a0e877cadd..8667708fc00a 100644
> --- a/ipc/msgutil.c
> +++ b/ipc/msgutil.c
> @@ -15,6 +15,7 @@
>  #include <linux/proc_ns.h>
>  #include <linux/uaccess.h>
>  #include <linux/sched.h>
> +#include <linux/memcontrol.h>
>  
>  #include "util.h"
>  
> @@ -39,16 +40,76 @@ struct msg_msgseg {
>  	/* the next part of the message follows immediately */
>  };
>  
> +struct pcpu_msg_cache {
> +	struct msg_msg *msg;
> +	struct obj_cgroup *objcg;
> +	size_t len;
> +};
> +
> +struct msg_cache {
> +	struct pcpu_msg_cache __percpu *pcpu_cache;
> +};
> +
>  #define DATALEN_MSG	((size_t)PAGE_SIZE-sizeof(struct msg_msg))
>  #define DATALEN_SEG	((size_t)PAGE_SIZE-sizeof(struct msg_msgseg))
>  
> +int init_msg_cache(struct msg_cache *cache)
> +{
> +	cache->pcpu_cache = alloc_percpu(struct pcpu_msg_cache);
> +	if (!cache->pcpu_cache)
> +		return -ENOMEM;
> +
> +	return 0;
> +}
>  
> -static struct msg_msg *alloc_msg(size_t len)
> +void free_msg_cache(struct msg_cache *cache)
> +{
> +	int cpu;
> +
> +	if (!cache->pcpu_cache)
> +		return;
> +
> +	for_each_possible_cpu(cpu) {
> +		struct pcpu_msg_cache *pc = per_cpu_ptr(cache->pcpu_cache, cpu);
> +
> +		if (pc->msg) {
> +			if (pc->objcg)
> +				obj_cgroup_put(pc->objcg);
> +			free_msg(pc->msg, NULL);
> +		}
> +	}
> +
> +	free_percpu(cache->pcpu_cache);
> +}
> +
> +static struct msg_msg *alloc_msg(size_t len, struct msg_cache *cache)
>  {
>  	struct msg_msg *msg;
>  	struct msg_msgseg **pseg;
>  	size_t alen;
>  
> +	if (cache) {
> +		struct pcpu_msg_cache *pc;
> +
> +		msg = NULL;
> +		pc = get_cpu_ptr(cache->pcpu_cache);
> +		if (pc->msg && pc->len == len) {
> +			struct obj_cgroup *objcg;
> +
> +			rcu_read_lock();
> +			objcg = obj_cgroup_from_current();
> +			if (objcg == pc->objcg) {
> +				msg = pc->msg;
> +				pc->msg = NULL;
> +				obj_cgroup_put(pc->objcg);
> +			}
> +			rcu_read_unlock();
> +		}
> +		put_cpu_ptr(cache->pcpu_cache);
> +		if (msg)
> +			return msg;
> +	}
> +
>  	alen = min(len, DATALEN_MSG);
>  	msg = kmalloc(sizeof(*msg) + alen, GFP_KERNEL_ACCOUNT);
>  	if (msg == NULL)
> @@ -77,18 +138,19 @@ static struct msg_msg *alloc_msg(size_t len)
>  	return msg;
>  
>  out_err:
> -	free_msg(msg);
> +	free_msg(msg, NULL);
>  	return NULL;
>  }
>  
> -struct msg_msg *load_msg(const void __user *src, size_t len)
> +struct msg_msg *load_msg(const void __user *src, size_t len,
> +			 struct msg_cache *cache)
>  {
>  	struct msg_msg *msg;
>  	struct msg_msgseg *seg;
>  	int err = -EFAULT;
>  	size_t alen;
>  
> -	msg = alloc_msg(len);
> +	msg = alloc_msg(len, cache);
>  	if (msg == NULL)
>  		return ERR_PTR(-ENOMEM);
>  
> @@ -104,14 +166,16 @@ struct msg_msg *load_msg(const void __user *src, size_t len)
>  			goto out_err;
>  	}
>  
> -	err = security_msg_msg_alloc(msg);
> -	if (err)
> -		goto out_err;
> +	if (!msg->security) {
> +		err = security_msg_msg_alloc(msg);
> +		if (err)
> +			goto out_err;
> +	}
>  
>  	return msg;
>  
>  out_err:
> -	free_msg(msg);
> +	free_msg(msg, NULL);
>  	return ERR_PTR(err);
>  }
>  #ifdef CONFIG_CHECKPOINT_RESTORE
> @@ -166,10 +230,29 @@ int store_msg(void __user *dest, struct msg_msg *msg, size_t len)
>  	return 0;
>  }
>  
> -void free_msg(struct msg_msg *msg)
> +void free_msg(struct msg_msg *msg, struct msg_cache *cache)
>  {
>  	struct msg_msgseg *seg;
>  
> +	if (cache) {
> +		struct pcpu_msg_cache *pc;
> +		bool cached = false;
> +
> +		pc = get_cpu_ptr(cache->pcpu_cache);
> +		if (!pc->msg) {
> +			pc->objcg = get_obj_cgroup_from_slab_obj(msg);
> +			pc->len = msg->m_ts;
> +			pc->msg = msg;
> +
> +			if (pc->objcg)
> +				cached = true;
> +		}
> +		put_cpu_ptr(cache->pcpu_cache);
> +
> +		if (cached)
> +			return;
> +	}
> +
>  	security_msg_msg_free(msg);
>  
>  	seg = msg->next;
> diff --git a/ipc/util.h b/ipc/util.h
> index b2906e366539..a2da266386aa 100644
> --- a/ipc/util.h
> +++ b/ipc/util.h
> @@ -196,8 +196,12 @@ static inline void ipc_update_pid(struct pid **pos, struct pid *pid)
>  int ipc_parse_version(int *cmd);
>  #endif
>  
> -extern void free_msg(struct msg_msg *msg);
> -extern struct msg_msg *load_msg(const void __user *src, size_t len);
> +struct msg_cache;
> +extern int init_msg_cache(struct msg_cache *cache);
> +extern void free_msg_cache(struct msg_cache *cache);
> +extern void free_msg(struct msg_msg *msg, struct msg_cache *cache);
> +extern struct msg_msg *load_msg(const void __user *src, size_t len,
> +				struct msg_cache *cache);
>  extern struct msg_msg *copy_msg(struct msg_msg *src, struct msg_msg *dst);
>  extern int store_msg(void __user *dest, struct msg_msg *msg, size_t len);
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a1a35c12635e..28528b4da0fb 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3004,6 +3004,28 @@ static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
>  	return objcg;
>  }
>  
> +__always_inline struct obj_cgroup *obj_cgroup_from_current(void)
> +{
> +	struct obj_cgroup *objcg = NULL;
> +	struct mem_cgroup *memcg;
> +
> +	if (memcg_kmem_bypass())
> +		return NULL;
> +
> +	if (unlikely(active_memcg()))
> +		memcg = active_memcg();
> +	else
> +		memcg = mem_cgroup_from_task(current);
> +
> +	for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) {
> +		objcg = rcu_dereference(memcg->objcg);
> +		if (likely(objcg))
> +			return objcg;
> +	}
> +
> +	return NULL;
> +}
> +
>  __always_inline struct obj_cgroup *get_obj_cgroup_from_current(void)
>  {
>  	struct obj_cgroup *objcg = NULL;
> @@ -3046,6 +3068,46 @@ struct obj_cgroup *get_obj_cgroup_from_page(struct page *page)
>  	return objcg;
>  }
>  
> +/*
> + * A passed kernel object must be a slab object or a generic kernel page.
> + * Not suitable for objects, allocated using vmalloc().
> + */
> +struct obj_cgroup *get_obj_cgroup_from_slab_obj(void *p)
> +{
> +	struct obj_cgroup *objcg = NULL;
> +	struct folio *folio;
> +
> +	if (mem_cgroup_disabled())
> +		return NULL;
> +
> +	folio = virt_to_folio(p);
> +	/*
> +	 * Slab object can be either a true slab object, which are accounted
> +	 * individually with objcg pointers stored in a separate objcg array,
> +	 * or it can a generic folio with MEMCG_DATA_KMEM flag set.
> +	 */
> +	if (folio_test_slab(folio)) {
> +		struct obj_cgroup **objcgs;
> +		struct slab *slab;
> +		unsigned int off;
> +
> +		slab = folio_slab(folio);
> +		objcgs = slab_objcgs(slab);
> +		if (!objcgs)
> +			return NULL;
> +
> +		off = obj_to_index(slab->slab_cache, slab, p);
> +		objcg = objcgs[off];
> +	} else {
> +		objcg = __folio_objcg(folio);
> +	}
> +
> +	if (objcg)
> +		obj_cgroup_get(objcg);
> +
> +	return objcg;
> +}
> +
>  static void memcg_account_kmem(struct mem_cgroup *memcg, int nr_pages)
>  {
>  	mod_memcg_state(memcg, MEMCG_KMEM, nr_pages);

Roman Gushchin Feb. 17, 2023, 6:26 p.m. UTC | #10

On Thu, Feb 16, 2023 at 01:29:59PM +0100, Linux regression tracking (Thorsten Leemhuis) wrote:
> Hi, this is your Linux kernel regression tracker.
> 
> On 20.12.22 19:48, Roman Gushchin wrote:
> > Sven Luther reported a regression in the posix message queues
> > performance caused by switching to the per-object tracking of
> > slab objects introduced by patch series ending with the
> > commit 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches for all
> > allocations").
> 
> Quick inquiry: what happened to below patch? It was supposed to fix a
> performance regression reported here:

Hi Thorsten!

I wouldn't call it simple a regression, things a bit more complicated:
it was a switch to a different approach with different trade-offs,
which IMO make more sense for the majority of real-world workloads.
In two words: individual kernel memory allocations became somewhat slower
(but still fast), but we've saved 40%+ of slab memory on typical systems
and reduced the memory fragmentation.

The regression reported by Sven and my "fix" are related to one very specific
case: posix message queues. To my knowledge they are not widely used for
anything that performance-sensitive, so it's quite a niche use case.
My "fix" was also hand-crafted for the benchmark provided by Sven, so it might
not work for a more generic case. And I don't think it can be easily generalized
without adding cpu or memory overhead.

On the other hand I'm working on improving the speed of kernel memory allocations
in general (I posted early versions some weeks ago). Hopefully it will mitigate
the problem for Sven as well, so we won't need these message queue-specific
hacks.

Thanks!

Thorsten Leemhuis Feb. 24, 2023, 10:47 a.m. UTC | #11

[TLDR: This mail in primarily relevant for Linux kernel regression
tracking. See link in footer if these mails annoy you.]

On 17.02.23 19:26, Roman Gushchin wrote:
> On Thu, Feb 16, 2023 at 01:29:59PM +0100, Linux regression tracking (Thorsten Leemhuis) wrote:
>>
>> On 20.12.22 19:48, Roman Gushchin wrote:
>>> Sven Luther reported a regression in the posix message queues
>>> performance caused by switching to the per-object tracking of
>>> slab objects introduced by patch series ending with the
>>> commit 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches for all
>>> allocations").
>>
>> Quick inquiry: what happened to below patch? It was supposed to fix a
>> performance regression reported here:
> 
> I wouldn't call it simple a regression,

Well, performance regressions are regressions, too. That being said:

> things a bit more complicated:
> it was a switch to a different approach with different trade-offs,
> which IMO make more sense for the majority of real-world workloads.
> In two words: individual kernel memory allocations became somewhat slower
> (but still fast), but we've saved 40%+ of slab memory on typical systems
> and reduced the memory fragmentation.
>
> The regression reported by Sven and my "fix" are related to one very specific
> case: posix message queues. To my knowledge they are not widely used for
> anything that performance-sensitive, so it's quite a niche use case.
> My "fix" was also hand-crafted for the benchmark provided by Sven, so it might
> not work for a more generic case. And I don't think it can be easily generalized
> without adding cpu or memory overhead.
> 
> On the other hand I'm working on improving the speed of kernel memory allocations
> in general (I posted early versions some weeks ago). Hopefully it will mitigate
> the problem for Sven as well, so we won't need these message queue-specific
> hacks.

Thx for the explanation. Sven didn't complain and it seems no one else
run into this, so I think we can live with that state of affairs.

#regzbot inconclusive: not fixed, but cause by a trade-off and a likely
corner-case anyway; more optimizations planned to improve things
#regzbot ignore-activity

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

[RFC] ipc/mqueue: introduce msg cache

Commit Message

Comments

Patch