From patchwork Wed Apr 16 18:02:29 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Shakeel Butt X-Patchwork-Id: 14054392 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2C4ABC369BA for ; Wed, 16 Apr 2025 18:02:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B2570280139; Wed, 16 Apr 2025 14:02:52 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id AD6F2280138; Wed, 16 Apr 2025 14:02:52 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 99D0B280139; Wed, 16 Apr 2025 14:02:52 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 7C25B280138 for ; Wed, 16 Apr 2025 14:02:52 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 426DC1CC655 for ; Wed, 16 Apr 2025 18:02:53 +0000 (UTC) X-FDA: 83340677826.21.01FD61F Received: from out-176.mta1.migadu.com (out-176.mta1.migadu.com [95.215.58.176]) by imf29.hostedemail.com (Postfix) with ESMTP id 55C77120008 for ; Wed, 16 Apr 2025 18:02:51 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=bZLxmOyZ; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf29.hostedemail.com: domain of shakeel.butt@linux.dev designates 95.215.58.176 as permitted sender) smtp.mailfrom=shakeel.butt@linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1744826571; a=rsa-sha256; cv=none; b=TJAnXeRammDCtVnPxa1xMrUruo2XzGqQY737Q1bBFF/IeTSIyEa53jwMjxe+f7tvzAVnr4 X6VRUIB2DZIPewGcPOxkBklXAVvX6jNyefuR5Nz4gHTAXQu3RlBn/kPkAlNrfSqTYKX5A/ NWKAf0MWxrIOFztB/bs2vv8hqPSUJ54= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=bZLxmOyZ; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf29.hostedemail.com: domain of shakeel.butt@linux.dev designates 95.215.58.176 as permitted sender) smtp.mailfrom=shakeel.butt@linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1744826571; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=N0wICEPxoKXEmipCZYFaxtgePit0HW271+fFL4AOW2o=; b=lWWSIBFj3NSHTuTmUa5Ox6naao72iu8vAJ4zXc/6sJjEJgualX/kOn5fwh9MZ0RviANpbm /4dgWWxwkvkOYVrsWW7Mcmi0Qv+yPbSdgAl+haYBY321kV3Eb8VU/ZNoGZvbrqMtRpzFtw JOv83xSlfjWbHghjmQcaoYkLhAltY5Q= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1744826568; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=N0wICEPxoKXEmipCZYFaxtgePit0HW271+fFL4AOW2o=; b=bZLxmOyZSqJIgeH3b8E/IhKnLdC5mDVuFbevye1u2CDLkg4g8giv0pPSr3rix5NVguurWY ++44wvKJt2GnLxbqKR0RFVhr3B/F8xJoFTWYCIRdighAPedcQfvfIuy/wlBu8S2bmXe+sY NCjFH85QUZAtGGci4kR+Mi1fw+cGBng= From: Shakeel Butt To: Andrew Morton Cc: Johannes Weiner , Michal Hocko , Roman Gushchin , Muchun Song , Vlastimil Babka , Jakub Kicinski , Eric Dumazet , Soheil Hassas Yeganeh , linux-mm@kvack.org, cgroups@vger.kernel.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Meta kernel team Subject: [PATCH] memcg: multi-memcg percpu charge cache Date: Wed, 16 Apr 2025 11:02:29 -0700 Message-ID: <20250416180229.2902751-1-shakeel.butt@linux.dev> MIME-Version: 1.0 X-Migadu-Flow: FLOW_OUT X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: 55C77120008 X-Stat-Signature: ipxpw5piecfb691juapcmrz497wc5a16 X-Rspam-User: X-HE-Tag: 1744826571-781477 X-HE-Meta: U2FsdGVkX19yb3dsBXkyubypHwkCV4qQL1vFcT3uU/AKOO/Uhh99VY1uqVoHt3bsKyHmRtZiVLanp/VjJ8BqTm41b+O4jSYGmQo5ERpYBxDfy7BtvftcXLAdntkIyns1xZH2P7eG8Y+a50AqMRN8x8hYeHtHcBAr2ov02WmrVD36/C3+EPCgVpsl3dBnRNjeZ7/bADJoNMQfS1JsTir+qJitDyMOMuY+5d/nLzVGGZm9KuiQKbcKihbguyJEyg9LyQta2yuTZCZB0Qdz3+cFv3eViRMaiDZ+dtg0B70CtAt7NAIEiei9AKTTWgYP45dGEcUWffT+BHf7EGvYdm4rkF5z1BsmYahnCMW3DY9uafUBhdBWNx1h2AUBKWJFCRW/Rq7tK6GqfIHKJu61z4peF9cBTNDCBieotyEJuBgAbgaMkhoSLt9Yw8nvH7V0bh9gFYlBhjp8nu0COVlJMnJRoUevPAN75tl2pxJ4YTsRFZcBK4Rb09sKCmZUNn2tFQXi3BdsiOZsRVRtXQjOymOQ7ZddtfIt6XvjlpbatPXEAqdTNZoF7Nmo3YOYL8A6wpWvbxbod/4uIBzyDZKp8uoo05Kl3P7+VDweQC+xjCG6jifQQkkaawokHWDBgHTbu+btF7/BnsmaQOfhremrziUEe/AIBBqydJS8ijWT0AvqMPRPnNMShIXDjtqu+dyQ4YvauZWr/85l/TDRLPoIdf3JYbpZKJpWB0F7JUG2AyKzmlFQvXYiEPzFVriyTPP+JOPU5k9alONXxVJt20kmmpmFWMKfd4hb4hzq1GFGafFzI0qscKyO+fex6rrH/qy9acA2MxNnHTLevDt77o9el2L0mP6Lkk4nQgz4vR/g5vj3hHkZTjjYIlqX1lZKzyXihZpqgRY/YvRx886Rob1F+ZJOA2I3h6mXyNeiSgULPvAXVtEk06hnzy1gh+vW8ZvjXhksjQGn+B4ifCSsRCDUnh5 0G++gtt5 FdEInEvG/Nz+4OvG0Jk9j9G5lsJagAmreM+cYSGnjvVoqUTI2kaKLddaeP35hYVbNvYE610Cpo4piF9ZY0PROJrgntr5/qnGU8TQ1JY1CKM9PU4217AE9+BGGXeieskpGbhVjnAOzDvlhPrnu84eXj3kvOoTU6iN2z80UtKuXjSqOCxY= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Memory cgroup accounting is expensive and to reduce the cost, the kernel maintains per-cpu charge cache for a single memcg. So, if a charge request comes for a different memcg, the kernel will flush the old memcg's charge cache and then charge the newer memcg a fixed amount (64 pages), subtracts the charge request amount and stores the remaining in the per-cpu charge cache for the newer memcg. This mechanism is based on the assumption that the kernel, for locality, keep a process on a CPU for long period of time and most of the charge requests from that process will be served by that CPU's local charge cache. However this assumption breaks down for incoming network traffic in a multi-tenant machine. We are in the process of running multiple workloads on a single machine and if such workloads are network heavy, we are seeing very high network memory accounting cost. We have observed multiple CPUs spending almost 100% of their time in net_rx_action and almost all of that time is spent in memcg accounting of the network traffic. More precisely, net_rx_action is serving packets from multiple workloads and is observing/serving mix of packets of these workloads. The memcg switch of per-cpu cache is very expensive and we are observing a lot of memcg switches on the machine. Almost all the time is being spent on charging new memcg and flushing older memcg cache. So, definitely we need per-cpu cache that support multiple memcgs for this scenario. This patch implements a simple (and dumb) multiple memcg percpu charge cache. Actually we started with more sophisticated LRU based approach but the dumb one was always better than the sophisticated one by 1% to 3%, so going with the simple approach. Some of the design choices are: 1. Fit all caches memcgs in a single cacheline. 2. The cache array can be mix of empty slots or memcg charged slots, so the kernel has to traverse the full array. 3. The cache drain from the reclaim will drain all cached memcgs to keep things simple. To evaluate the impact of this optimization, on a 72 CPUs machine, we ran the following workload where each netperf client runs in a different cgroup. The next-20250415 kernel is used as base. $ netserver -6 $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K number of clients | Without patch | With patch 6 | 42584.1 Mbps | 48603.4 Mbps (14.13% improvement) 12 | 30617.1 Mbps | 47919.7 Mbps (56.51% improvement) 18 | 25305.2 Mbps | 45497.3 Mbps (79.79% improvement) 24 | 20104.1 Mbps | 37907.7 Mbps (88.55% improvement) 30 | 14702.4 Mbps | 30746.5 Mbps (109.12% improvement) 36 | 10801.5 Mbps | 26476.3 Mbps (145.11% improvement) The results show drastic improvement for network intensive workloads. Signed-off-by: Shakeel Butt --- mm/memcontrol.c | 128 ++++++++++++++++++++++++++++++++++-------------- 1 file changed, 91 insertions(+), 37 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 1ad326e871c1..0a02ba07561e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1769,10 +1769,11 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg) pr_cont(" are going to be killed due to memory.oom.group set\n"); } +#define NR_MEMCG_STOCK 7 struct memcg_stock_pcp { local_trylock_t stock_lock; - struct mem_cgroup *cached; /* this never be root cgroup */ - unsigned int nr_pages; + uint8_t nr_pages[NR_MEMCG_STOCK]; + struct mem_cgroup *cached[NR_MEMCG_STOCK]; struct obj_cgroup *cached_objcg; struct pglist_data *cached_pgdat; @@ -1809,9 +1810,10 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages, gfp_t gfp_mask) { struct memcg_stock_pcp *stock; - unsigned int stock_pages; + uint8_t stock_pages; unsigned long flags; bool ret = false; + int i; if (nr_pages > MEMCG_CHARGE_BATCH) return ret; @@ -1822,10 +1824,17 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages, return ret; stock = this_cpu_ptr(&memcg_stock); - stock_pages = READ_ONCE(stock->nr_pages); - if (memcg == READ_ONCE(stock->cached) && stock_pages >= nr_pages) { - WRITE_ONCE(stock->nr_pages, stock_pages - nr_pages); - ret = true; + + for (i = 0; i < NR_MEMCG_STOCK; ++i) { + if (memcg != READ_ONCE(stock->cached[i])) + continue; + + stock_pages = READ_ONCE(stock->nr_pages[i]); + if (stock_pages >= nr_pages) { + WRITE_ONCE(stock->nr_pages[i], stock_pages - nr_pages); + ret = true; + } + break; } local_unlock_irqrestore(&memcg_stock.stock_lock, flags); @@ -1843,21 +1852,30 @@ static void memcg_uncharge(struct mem_cgroup *memcg, unsigned int nr_pages) /* * Returns stocks cached in percpu and reset cached information. */ -static void drain_stock(struct memcg_stock_pcp *stock) +static void drain_stock(struct memcg_stock_pcp *stock, int i) { - unsigned int stock_pages = READ_ONCE(stock->nr_pages); - struct mem_cgroup *old = READ_ONCE(stock->cached); + struct mem_cgroup *old = READ_ONCE(stock->cached[i]); + uint8_t stock_pages; if (!old) return; + stock_pages = READ_ONCE(stock->nr_pages[i]); if (stock_pages) { memcg_uncharge(old, stock_pages); - WRITE_ONCE(stock->nr_pages, 0); + WRITE_ONCE(stock->nr_pages[i], 0); } css_put(&old->css); - WRITE_ONCE(stock->cached, NULL); + WRITE_ONCE(stock->cached[i], NULL); +} + +static void drain_stock_fully(struct memcg_stock_pcp *stock) +{ + int i; + + for (i = 0; i < NR_MEMCG_STOCK; ++i) + drain_stock(stock, i); } static void drain_local_stock(struct work_struct *dummy) @@ -1874,7 +1892,7 @@ static void drain_local_stock(struct work_struct *dummy) stock = this_cpu_ptr(&memcg_stock); drain_obj_stock(stock); - drain_stock(stock); + drain_stock_fully(stock); clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags); local_unlock_irqrestore(&memcg_stock.stock_lock, flags); @@ -1883,35 +1901,81 @@ static void drain_local_stock(struct work_struct *dummy) static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) { struct memcg_stock_pcp *stock; - unsigned int stock_pages; + struct mem_cgroup *cached; + uint8_t stock_pages; unsigned long flags; + bool evict = true; + int i; VM_WARN_ON_ONCE(mem_cgroup_is_root(memcg)); - if (!local_trylock_irqsave(&memcg_stock.stock_lock, flags)) { + if (nr_pages > MEMCG_CHARGE_BATCH || + !local_trylock_irqsave(&memcg_stock.stock_lock, flags)) { /* - * In case of unlikely failure to lock percpu stock_lock - * uncharge memcg directly. + * In case of larger than batch refill or unlikely failure to + * lock the percpu stock_lock, uncharge memcg directly. */ memcg_uncharge(memcg, nr_pages); return; } stock = this_cpu_ptr(&memcg_stock); - if (READ_ONCE(stock->cached) != memcg) { /* reset if necessary */ - drain_stock(stock); - css_get(&memcg->css); - WRITE_ONCE(stock->cached, memcg); + for (i = 0; i < NR_MEMCG_STOCK; ++i) { +again: + cached = READ_ONCE(stock->cached[i]); + if (!cached) { + css_get(&memcg->css); + WRITE_ONCE(stock->cached[i], memcg); + } + if (!cached || memcg == READ_ONCE(stock->cached[i])) { + stock_pages = READ_ONCE(stock->nr_pages[i]) + nr_pages; + WRITE_ONCE(stock->nr_pages[i], stock_pages); + if (stock_pages > MEMCG_CHARGE_BATCH) + drain_stock(stock, i); + evict = false; + break; + } } - stock_pages = READ_ONCE(stock->nr_pages) + nr_pages; - WRITE_ONCE(stock->nr_pages, stock_pages); - if (stock_pages > MEMCG_CHARGE_BATCH) - drain_stock(stock); + if (evict) { + i = get_random_u32_below(NR_MEMCG_STOCK); + drain_stock(stock, i); + goto again; + } local_unlock_irqrestore(&memcg_stock.stock_lock, flags); } +static bool is_drain_needed(struct memcg_stock_pcp *stock, + struct mem_cgroup *root_memcg) +{ + struct mem_cgroup *memcg; + bool flush = false; + int i; + + rcu_read_lock(); + + if (obj_stock_flush_required(stock, root_memcg)) { + flush = true; + goto out; + } + + for (i = 0; i < NR_MEMCG_STOCK; ++i) { + memcg = READ_ONCE(stock->cached[i]); + if (!memcg) + continue; + + if (READ_ONCE(stock->nr_pages[i]) && + mem_cgroup_is_descendant(memcg, root_memcg)) { + flush = true; + break; + } + } +out: + rcu_read_unlock(); + return flush; +} + /* * Drains all per-CPU charge caches for given root_memcg resp. subtree * of the hierarchy under it. @@ -1933,17 +1997,7 @@ void drain_all_stock(struct mem_cgroup *root_memcg) curcpu = smp_processor_id(); for_each_online_cpu(cpu) { struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu); - struct mem_cgroup *memcg; - bool flush = false; - - rcu_read_lock(); - memcg = READ_ONCE(stock->cached); - if (memcg && READ_ONCE(stock->nr_pages) && - mem_cgroup_is_descendant(memcg, root_memcg)) - flush = true; - else if (obj_stock_flush_required(stock, root_memcg)) - flush = true; - rcu_read_unlock(); + bool flush = is_drain_needed(stock, root_memcg); if (flush && !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) { @@ -1969,7 +2023,7 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu) drain_obj_stock(stock); local_unlock_irqrestore(&memcg_stock.stock_lock, flags); - drain_stock(stock); + drain_stock_fully(stock); return 0; }