From patchwork Sat Apr 19 18:35:45 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Shakeel Butt X-Patchwork-Id: 14057959 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 90AC2C369CA for ; Sat, 19 Apr 2025 18:36:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D9DA06B0096; Sat, 19 Apr 2025 14:36:25 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D4C196B0098; Sat, 19 Apr 2025 14:36:25 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C631E6B0099; Sat, 19 Apr 2025 14:36:25 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id A97A86B0096 for ; Sat, 19 Apr 2025 14:36:25 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id CAD461201A8 for ; Sat, 19 Apr 2025 18:36:25 +0000 (UTC) X-FDA: 83351648730.28.95D125D Received: from out-188.mta0.migadu.com (out-188.mta0.migadu.com [91.218.175.188]) by imf19.hostedemail.com (Postfix) with ESMTP id D2A531A000A for ; Sat, 19 Apr 2025 18:36:23 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=stT8Fsw3; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf19.hostedemail.com: domain of shakeel.butt@linux.dev designates 91.218.175.188 as permitted sender) smtp.mailfrom=shakeel.butt@linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745087784; a=rsa-sha256; cv=none; b=CdyuZcUCABiXP7+Ic+7DfnolhyUPJkXhgP0kaFHF8E7uizIeo/rJysG4Kg2AVGwHHoJ2rs LFtzBxwQIcENglqRHh1KLOk56JyOi8ZZi2Ilp1HhWblfESkHqgfFHoifOZBwPaAggv4Qap WHD2rJR3sQTfXqzOaF7QIReys1lcA74= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=stT8Fsw3; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf19.hostedemail.com: domain of shakeel.butt@linux.dev designates 91.218.175.188 as permitted sender) smtp.mailfrom=shakeel.butt@linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745087784; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=Ya7lL/93PZYOwN+LH6VIpy/eVtlPQUR8w4DByMzkNPg=; b=PM9rulWQr8/hXVBtYHu9srPiGRrcb7/jqPK2fOiMLbfR22jp2malrGJaJZMDXHshnO4e26 JLrkaWYbh8w5jxl7OxH73Xd0wgR3pTeuFV3uObe9TqbFHnp4CnqtJPfbbgEL0ERSFcUDI3 qtpquGmBG4p7s5aG/80Wr1++64WuyM8= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1745087781; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=Ya7lL/93PZYOwN+LH6VIpy/eVtlPQUR8w4DByMzkNPg=; b=stT8Fsw3OeLFsuvpIKoKrdDGNAmI/jiPzv3L2tRdrqxlCqUNJ5sh3W3ttGBF7VsIj2PAgw QKXjn/X0AZfmFpnhmnVDgihx+fI79TjPeOVmQ+RqC8nLcpvHj8KRhi5RGbqd7RvS/mf6va DnLs4mYHEy5Zkeh0cah3wGZJKQWfx0M= From: Shakeel Butt To: Andrew Morton Cc: Johannes Weiner , Michal Hocko , Roman Gushchin , Muchun Song , Yosry Ahmed , Tejun Heo , =?utf-8?q?Michal_Koutn=C3=BD?= , Greg Thelen , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Meta kernel team Subject: [PATCH v2] memcg: introduce non-blocking limit setting option Date: Sat, 19 Apr 2025 11:35:45 -0700 Message-ID: <20250419183545.1982187-1-shakeel.butt@linux.dev> MIME-Version: 1.0 X-Migadu-Flow: FLOW_OUT X-Rspam-User: X-Rspamd-Queue-Id: D2A531A000A X-Rspamd-Server: rspam04 X-Stat-Signature: r1kxnn1x5pg94wgu36oaxyo9yu97w4ot X-HE-Tag: 1745087783-228490 X-HE-Meta: U2FsdGVkX18IhykUuzk35e7ixxxHOGDIGBzG5spIPkmwrGu+vFTvAPw02NNE9tMC2MHHfkasPyYvCdlhLNLhAEtJYAs5H7/9QCAxbWh5npXJk5ZPsO8gbbwAd+jniaipZnziHOcyJjdOgQeRMa2N7ana7s4G/bZmEYJO2Ql4cmTalgpjIRkQeZKK2zfqHmshTLhnFRJpa6u4t9y+EImpStxFgOORS5WbnTHCGg/Lal/3/c/+0MdGOYdOvxg7+mcXneICsMIJOEjYdj4Il9Ewjib22cT0IhDGuRina1HU2WhHPTIKmDILTs5zZq7w8ISSGEp5i0Nip//c8aN9hlcj6x8u4P4M/djEzvEdt/mRTm84gnHiOfdlxGddSVRW3Eqt8gUuNVMgh+iYYdG3vf/UCOM7ukBnM0ulK9CuPAz2Y4zcvr+lHHNbGIL6rmCONBNQ8QVoRUCgGIadwoXVdtK12DyV0OWnwUs5jPXVLn1MvTWTcOviDLmzgl5TDfxLYutKsIfFZAQmZtguKITvljIAPiSuhMhxd0lh1bvMH90LRCj2JGYWKcOJQijl4ARNIWISy1cpSQo0KfLM0IctrYwH4hhFEBHf/4twHFLTk1wisJD3qNzwpNpU62+W1+q6r7vO2oocyD6EhQvYg7Fl/JwjSa3y/RGkybN+wunZWdnGDqphdEZo49W/FmihcYD9hXoGK1me4OgGOS0awUAXIAeiNTV27yYX5TJHe4Z4TWocButFeYFbTjy5vjtCuLGvYkMM7pjTKdscVIk7hTPX6IwNkzPMoxIvhPsI9pqSmh1fq0aBSc32quq36+pO8fNU+RSNwZueuisETymYZ/SDlIGNtwscG9LX7rrz/d/alYvSqZ91LggZ1WZ5TV8gzL6tIs21A7Zvc8kbi5BP+85BoA4k4WJxXQO2YwFVp/JugQT+Zpw1qxDW+/HIAmZSjpuOMKvQqwAIJYco2DfZWSfx1Bs Ls48rLYl XxYxy7Je+QrpD4rBOpIVSdHEBaaNejVIStir2TSeJ+fuOgglLdSVSbSqAM3jZqxCmjN7KB4yE8lzatC8dafgAZnIztq/exIvbpFSl4PM1yprjwF8sJxCTDq4sd5POl0F6v2vL3AWU9AweGMNBxiNJOfL5VmaniItUNOOA6WMGrYufPvM= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Setting the max and high limits can trigger synchronous reclaim and/or oom-kill if the usage is higher than the given limit. This behavior is fine for newly created cgroups but it can cause issues for the node controller while setting limits for existing cgroups. In our production multi-tenant and overcommitted environment, we are seeing priority inversion when the node controller dynamically adjusts the limits of running jobs of different priorities. Based on the system situation, the node controller may reduce the limits of lower priority jobs and increase the limits of higher priority jobs. However we are seeing node controller getting stuck for long period of time while reclaiming from lower priority jobs while setting their limits and also spends a lot of its own CPU. One of the workaround we are trying is to fork a new process which sets the limit of the lower priority job along with setting an alarm to get itself killed if it get stuck in the reclaim for lower priority job. However we are finding it very unreliable and costly. Either we need a good enough time buffer for the alarm to be delivered after setting limit and potentialy spend a lot of CPU in the reclaim or be unreliable in setting the limit for much shorter but cheaper (less reclaim) alarms. Let's introduce new limit setting option which does not trigger reclaim and/or oom-kill and let the processes in the target cgroup to trigger reclaim and/or throttling and/or oom-kill in their next charge request. This will make the node controller on multi-tenant overcommitted environment much more reliable. Signed-off-by: Shakeel Butt --- Changes since v1: - Instead of new interfaces use O_NONBLOCK flag (Greg, Roman & Tejun) Documentation/admin-guide/cgroup-v2.rst | 14 ++++++++++++++ mm/memcontrol.c | 10 ++++++++-- 2 files changed, 22 insertions(+), 2 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 8fb14ffab7d1..c14514da4d9a 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1299,6 +1299,13 @@ PAGE_SIZE multiple when read back. monitors the limited cgroup to alleviate heavy reclaim pressure. + If memory.high is opened with O_NONBLOCK then the synchronous + reclaim is bypassed. This is useful for admin processes that + need to dynamically adjust the job's memory limits without + expending their own CPU resources on memory reclamation. The + job will trigger the reclaim and/or get throttled on its + next charge request. + memory.max A read-write single value file which exists on non-root cgroups. The default is "max". @@ -1316,6 +1323,13 @@ PAGE_SIZE multiple when read back. Caller could retry them differently, return into userspace as -ENOMEM or silently ignore in cases like disk readahead. + If memory.max is opened with O_NONBLOCK, then the synchronous + reclaim and oom-kill are bypassed. This is useful for admin + processes that need to dynamically adjust the job's memory limits + without expending their own CPU resources on memory reclamation. + The job will trigger the reclaim and/or oom-kill on its next + charge request. + memory.reclaim A write-only nested-keyed file which exists for all cgroups. diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 5e2ea8b8a898..6f7362a7756a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4252,6 +4252,9 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, page_counter_set_high(&memcg->memory, high); + if (of->file->f_flags & O_NONBLOCK) + goto out; + for (;;) { unsigned long nr_pages = page_counter_read(&memcg->memory); unsigned long reclaimed; @@ -4274,7 +4277,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, if (!reclaimed && !nr_retries--) break; } - +out: memcg_wb_domain_size_changed(memcg); return nbytes; } @@ -4301,6 +4304,9 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, xchg(&memcg->memory.max, max); + if (of->file->f_flags & O_NONBLOCK) + goto out; + for (;;) { unsigned long nr_pages = page_counter_read(&memcg->memory); @@ -4328,7 +4334,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, break; cond_resched(); } - +out: memcg_wb_domain_size_changed(memcg); return nbytes; }