[RFC,00/10] memcg: stock code cleanups

Message ID	20250314061511.1308152-1-shakeel.butt@linux.dev (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Shakeel Butt <shakeel.butt@linux.dev> To: Andrew Morton <akpm@linux-foundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, Roman Gushchin <roman.gushchin@linux.dev>, Muchun Song <muchun.song@linux.dev>, Vlastimil Babka <vbabka@suse.cz>, Sebastian Andrzej Siewior <bigeasy@linutronix.de>, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Meta kernel team <kernel-team@meta.com> Subject: [RFC PATCH 00/10] memcg: stock code cleanups Date: Thu, 13 Mar 2025 23:15:01 -0700 Message-ID: <20250314061511.1308152-1-shakeel.butt@linux.dev> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	memcg: stock code cleanups \| expand [RFC,00/10] memcg: stock code cleanups [RFC,01/10] memcg: remove root memcg check from refill_stock [RFC,02/10] memcg: decouple drain_obj_stock from local stock [RFC,03/10] memcg: introduce memcg_uncharge [RFC,04/10] memcg: manually inline __refill_stock [RFC,05/10] memcg: no refilling stock from obj_cgroup_release [RFC,06/10] memcg: do obj_cgroup_put inside drain_obj_stock [RFC,07/10] memcg: use __mod_memcg_state in drain_obj_stock [RFC,08/10] memcg: assert in_task for couple of local_lock holders [RFC,09/10] memcg: trylock stock for objcg [RFC,10/10] memcg: no more irq disabling for stock locks

Message ID

20250314061511.1308152-1-shakeel.butt@linux.dev (mailing list archive)

Headers

From: Shakeel Butt <shakeel.butt@linux.dev>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@kernel.org>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Muchun Song <muchun.song@linux.dev>,
	Vlastimil Babka <vbabka@suse.cz>,
	Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
	linux-mm@kvack.org,
	cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Meta kernel team <kernel-team@meta.com>
Subject: [RFC PATCH 00/10] memcg: stock code cleanups
Date: Thu, 13 Mar 2025 23:15:01 -0700
Message-ID: <20250314061511.1308152-1-shakeel.butt@linux.dev>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

memcg: stock code cleanups | expand

Message

Shakeel Butt March 14, 2025, 6:15 a.m. UTC

This is a cleanup series (first 7 patches) is trying to simplify the
memcg stock code, particularly it tries to remove unnecessary
dependencies. The last 3 patches are a prototype to make per-cpu memcg
stock work without disabling irqs.

My plan is to send out the first 7 cleanup patches separately for the
next release window and iterate more on the last 3 patches plus add
functionality for multiple memcgs.

This series is based on next-20250313 plus two following patches:

Link: https://lore.kernel.org/all/20250312222552.3284173-1-shakeel.butt@linux.dev/
Link: https://lore.kernel.org/all/20250313054812.2185900-1-shakeel.butt@linux.dev/

 to simply the memcg stock code

Shakeel Butt (10):
  memcg: remove root memcg check from refill_stock
  memcg: decouple drain_obj_stock from local stock
  memcg: introduce memcg_uncharge
  memcg: manually inline __refill_stock
  memcg: no refilling stock from obj_cgroup_release
  memcg: do obj_cgroup_put inside drain_obj_stock
  memcg: use __mod_memcg_state in drain_obj_stock
  memcg: assert in_task for couple of local_lock holders
  memcg: trylock stock for objcg
  memcg: no more irq disabling for stock locks

 mm/memcontrol.c | 201 +++++++++++++++++++++++++++---------------------
 1 file changed, 112 insertions(+), 89 deletions(-)

Comments

Vlastimil Babka March 14, 2025, 1:33 p.m. UTC | #1

On 3/14/25 07:15, Shakeel Butt wrote:
> This is a cleanup series (first 7 patches) is trying to simplify the
> memcg stock code, particularly it tries to remove unnecessary
> dependencies. The last 3 patches are a prototype to make per-cpu memcg
> stock work without disabling irqs.
> 
> My plan is to send out the first 7 cleanup patches separately for the
> next release window and iterate more on the last 3 patches plus add
> functionality for multiple memcgs.
> 
> This series is based on next-20250313 plus two following patches:
> 
> Link: https://lore.kernel.org/all/20250312222552.3284173-1-shakeel.butt@linux.dev/
> Link: https://lore.kernel.org/all/20250313054812.2185900-1-shakeel.butt@linux.dev/
> 
>  to simply the memcg stock code

Hi, 

I've been looking at this area too, and noticed a different opportunity for
cleanup+perf improvement yesterday. I rebased it on top of patch 7 as it
would make sense to do it before changing the locking - it reduces the
number of places where the local_trylock is taken. If it's ok to you, please
incorporate to your series.

Thanks,
Vlastimil

----8<----
From 8dda8e4225ee91b48a73759f727a28d448c634b5 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <vbabka@suse.cz>
Date: Thu, 13 Mar 2025 20:08:32 +0100
Subject: [PATCH] memcg: combine slab obj stock charging and accounting

When handing slab objects, we use obj_cgroup_[un]charge() for
(un)charging and mod_objcg_state() to account NR_SLAB_[UN]RECLAIMABLE_B.
All these operations use the percpu stock for performance. However with
the calls being separate, the stock_lock is taken twice in each case.

By refactoring the code, we can turn mod_objcg_state() into
__account_obj_stock() which is called on a stock that's already locked
and validated. On the charging side we can call this function from
consume_obj_stock() when it succeeds, and refill_obj_stock() in the
fallback. We just expand parameters of these functions as necessary.
The uncharge side from __memcg_slab_free_hook() is just the call to
refill_obj_stock().

Other callers of obj_cgroup_[un]charge() (i.e. not slab) simply pass the
extra parameters as NULL/zeroes to skip the __account_obj_stock()
operation.

In __memcg_slab_post_alloc_hook() we now charge each object separately,
but that's not a problem as we did call mod_objcg_state() for each
object separately, and most allocations are non-bulk anyway. This
could be improved by batching all operations until slab_pgdat(slab)
changes.

Some preliminary benchmarking with a kfree(kmalloc()) loop of 10M
iterations with/without __GFP_ACCOUNT:

Before the patch:
kmalloc/kfree !memcg:    581390144 cycles
kmalloc/kfree memcg:     783689984 cycles

After the patch:
kmalloc/kfree memcg:     658723808 cycles

More than half of the overhead of __GFP_ACCOUNT relative to
non-accounted case seems eliminated.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/memcontrol.c | 77 +++++++++++++++++++++++++++++--------------------
 1 file changed, 46 insertions(+), 31 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index dfe9c2eb7816..553eb1d7250a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2753,25 +2753,17 @@ static void replace_stock_objcg(struct memcg_stock_pcp *stock,
 	WRITE_ONCE(stock->cached_objcg, objcg);
 }
 
-static void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
-		     enum node_stat_item idx, int nr)
+static void __account_obj_stock(struct obj_cgroup *objcg,
+				struct memcg_stock_pcp *stock, int nr,
+				struct pglist_data *pgdat, enum node_stat_item idx)
 {
-	struct memcg_stock_pcp *stock;
-	unsigned long flags;
 	int *bytes;
 
-	localtry_lock_irqsave(&memcg_stock.stock_lock, flags);
-	stock = this_cpu_ptr(&memcg_stock);
-
 	/*
 	 * Save vmstat data in stock and skip vmstat array update unless
-	 * accumulating over a page of vmstat data or when pgdat or idx
-	 * changes.
+	 * accumulating over a page of vmstat data or when pgdat changes.
 	 */
-	if (READ_ONCE(stock->cached_objcg) != objcg) {
-		replace_stock_objcg(stock, objcg);
-		stock->cached_pgdat = pgdat;
-	} else if (stock->cached_pgdat != pgdat) {
+	if (stock->cached_pgdat != pgdat) {
 		/* Flush the existing cached vmstat data */
 		struct pglist_data *oldpg = stock->cached_pgdat;
 
@@ -2808,11 +2800,10 @@ static void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
 	}
 	if (nr)
 		__mod_objcg_mlstate(objcg, pgdat, idx, nr);
-
-	localtry_unlock_irqrestore(&memcg_stock.stock_lock, flags);
 }
 
-static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
+static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
+			      struct pglist_data *pgdat, enum node_stat_item idx)
 {
 	struct memcg_stock_pcp *stock;
 	unsigned long flags;
@@ -2824,6 +2815,9 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 	if (objcg == READ_ONCE(stock->cached_objcg) && stock->nr_bytes >= nr_bytes) {
 		stock->nr_bytes -= nr_bytes;
 		ret = true;
+
+		if (pgdat)
+			__account_obj_stock(objcg, stock, nr_bytes, pgdat, idx);
 	}
 
 	localtry_unlock_irqrestore(&memcg_stock.stock_lock, flags);
@@ -2908,7 +2902,8 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 }
 
 static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
-			     bool allow_uncharge)
+		bool allow_uncharge, int nr_acct, struct pglist_data *pgdat,
+		enum node_stat_item idx)
 {
 	struct memcg_stock_pcp *stock;
 	unsigned long flags;
@@ -2923,6 +2918,9 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
 	}
 	stock->nr_bytes += nr_bytes;
 
+	if (pgdat)
+		__account_obj_stock(objcg, stock, nr_acct, pgdat, idx);
+
 	if (allow_uncharge && (stock->nr_bytes > PAGE_SIZE)) {
 		nr_pages = stock->nr_bytes >> PAGE_SHIFT;
 		stock->nr_bytes &= (PAGE_SIZE - 1);
@@ -2934,12 +2932,13 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
 		obj_cgroup_uncharge_pages(objcg, nr_pages);
 }
 
-int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
+static int obj_cgroup_charge_account(struct obj_cgroup *objcg, gfp_t gfp, size_t size,
+				     struct pglist_data *pgdat, enum node_stat_item idx)
 {
 	unsigned int nr_pages, nr_bytes;
 	int ret;
 
-	if (consume_obj_stock(objcg, size))
+	if (likely(consume_obj_stock(objcg, size, pgdat, idx)))
 		return 0;
 
 	/*
@@ -2972,15 +2971,21 @@ int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
 		nr_pages += 1;
 
 	ret = obj_cgroup_charge_pages(objcg, gfp, nr_pages);
-	if (!ret && nr_bytes)
-		refill_obj_stock(objcg, PAGE_SIZE - nr_bytes, false);
+	if (!ret && (nr_bytes || pgdat))
+		refill_obj_stock(objcg, nr_bytes ? PAGE_SIZE - nr_bytes : 0,
+					 false, size, pgdat, idx);
 
 	return ret;
 }
 
+int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
+{
+	return obj_cgroup_charge_account(objcg, gfp, size, NULL, 0);
+}
+
 void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
 {
-	refill_obj_stock(objcg, size, true);
+	refill_obj_stock(objcg, size, true, 0, NULL, 0);
 }
 
 static inline size_t obj_full_size(struct kmem_cache *s)
@@ -3032,23 +3037,32 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
 			return false;
 	}
 
-	if (obj_cgroup_charge(objcg, flags, size * obj_full_size(s)))
-		return false;
-
 	for (i = 0; i < size; i++) {
 		slab = virt_to_slab(p[i]);
 
 		if (!slab_obj_exts(slab) &&
 		    alloc_slab_obj_exts(slab, s, flags, false)) {
-			obj_cgroup_uncharge(objcg, obj_full_size(s));
 			continue;
 		}
 
+		/*
+		 * if we fail and size is 1, memcg_alloc_abort_single() will
+		 * just free the object, which is ok as we have not assigned
+		 * objcg to its obj_ext yet
+		 *
+		 * for larger sizes, kmem_cache_free_bulk() will uncharge
+		 * any objects that were already charged and obj_ext assigned
+		 *
+		 * TODO: we could batch this until slab_pgdat(slab) changes
+		 * between iterations, with a more complicated undo
+		 */
+		if (obj_cgroup_charge_account(objcg, flags, obj_full_size(s),
+					slab_pgdat(slab), cache_vmstat_idx(s)))
+			return false;
+
 		off = obj_to_index(s, slab, p[i]);
 		obj_cgroup_get(objcg);
 		slab_obj_exts(slab)[off].objcg = objcg;
-		mod_objcg_state(objcg, slab_pgdat(slab),
-				cache_vmstat_idx(s), obj_full_size(s));
 	}
 
 	return true;
@@ -3057,6 +3071,8 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
 void __memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab,
 			    void **p, int objects, struct slabobj_ext *obj_exts)
 {
+	size_t obj_size = obj_full_size(s);
+
 	for (int i = 0; i < objects; i++) {
 		struct obj_cgroup *objcg;
 		unsigned int off;
@@ -3067,9 +3083,8 @@ void __memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab,
 			continue;
 
 		obj_exts[off].objcg = NULL;
-		obj_cgroup_uncharge(objcg, obj_full_size(s));
-		mod_objcg_state(objcg, slab_pgdat(slab), cache_vmstat_idx(s),
-				-obj_full_size(s));
+		refill_obj_stock(objcg, obj_size, true, -obj_size,
+				 slab_pgdat(slab), cache_vmstat_idx(s));
 		obj_cgroup_put(objcg);
 	}
 }

Shakeel Butt March 14, 2025, 4:03 p.m. UTC | #2

On Fri, Mar 14, 2025 at 02:33:28PM +0100, Vlastimil Babka wrote:
> On 3/14/25 07:15, Shakeel Butt wrote:
> > This is a cleanup series (first 7 patches) is trying to simplify the
> > memcg stock code, particularly it tries to remove unnecessary
> > dependencies. The last 3 patches are a prototype to make per-cpu memcg
> > stock work without disabling irqs.
> > 
> > My plan is to send out the first 7 cleanup patches separately for the
> > next release window and iterate more on the last 3 patches plus add
> > functionality for multiple memcgs.
> > 
> > This series is based on next-20250313 plus two following patches:
> > 
> > Link: https://lore.kernel.org/all/20250312222552.3284173-1-shakeel.butt@linux.dev/
> > Link: https://lore.kernel.org/all/20250313054812.2185900-1-shakeel.butt@linux.dev/
> > 
> >  to simply the memcg stock code
> 
> Hi, 
> 
> I've been looking at this area too, and noticed a different opportunity for
> cleanup+perf improvement yesterday. I rebased it on top of patch 7 as it
> would make sense to do it before changing the locking - it reduces the
> number of places where the local_trylock is taken. If it's ok to you, please
> incorporate to your series.

Thanks a lot Vlastimil, I will take it and add review tag after
reviewing it.

Andrew, I will post only the cleanup series soon. It will be based on
next-20250314. If you decide to take it for upcoming open window
(6.15-rc) then some coordination with bpf tree would be needed. However
for the next window (6.16-rc), I don't expect conflicts.