From patchwork Wed Jul 3 11:25:10 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Maarten Lankhorst X-Patchwork-Id: 13722108 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9A4DBC30653 for ; Wed, 3 Jul 2024 11:25:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D46F06B009F; Wed, 3 Jul 2024 07:25:21 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CF7316B00A0; Wed, 3 Jul 2024 07:25:21 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BBE576B00A1; Wed, 3 Jul 2024 07:25:21 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 987376B009F for ; Wed, 3 Jul 2024 07:25:21 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 1EE30160B2B for ; Wed, 3 Jul 2024 11:25:21 +0000 (UTC) X-FDA: 82298210442.26.8F936D3 Received: from mblankhorst.nl (lankhorst.se [141.105.120.124]) by imf25.hostedemail.com (Postfix) with ESMTP id 5FD1AA0018 for ; Wed, 3 Jul 2024 11:25:19 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=none; dmarc=fail reason="No valid SPF, No valid DKIM" header.from=intel.com (policy=none); spf=none (imf25.hostedemail.com: domain of mlankhorst@mblankhorst.nl has no SPF policy when checking 141.105.120.124) smtp.mailfrom=mlankhorst@mblankhorst.nl ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1720005897; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references; bh=v+UwXgSwugygpEOwM5cSRJo/zDkFdKIhJkAu+Is3FI0=; b=7anImBZT7IJOz29jAW2L66NWUmwqnFnpJXrtFKElfBmREGrO/K89U+Y1quMuOWXCOwLUSG meu5+3hxCOmAcTQOaHI5fCtdtEMQx1L0BtRQkocrOff5kq4g6V4MgEHnc++MFsAYkPRduQ 8j7y5FqaBi/HfArv8b3qTYfDvFDGgKU= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720005897; a=rsa-sha256; cv=none; b=RvrHh2GJMSaLff3UKyqLj13jh6ayvcPRnA785oBr3wcgRhWBXrFxHSaN+JDuyyoUCIV//4 WpbOUvR/w44Zg6m9lhsDM3mq8QtlSBgSrvPTOH61oAUclhLKEZsl9vY0Tkw4yUfZjEErVY EgWM5h0R+RykdyCrDJx2/FB2B1rbMYk= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=none; dmarc=fail reason="No valid SPF, No valid DKIM" header.from=intel.com (policy=none); spf=none (imf25.hostedemail.com: domain of mlankhorst@mblankhorst.nl has no SPF policy when checking 141.105.120.124) smtp.mailfrom=mlankhorst@mblankhorst.nl From: Maarten Lankhorst To: linux-mm@kvack.org, cgroups@vger.kernel.org, Andrew Morton , Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song Cc: linux-kernel@vger.kernel.org, Maarten Lankhorst Subject: [PATCH] mm/page_counter: Move calculating protection values to page_counter Date: Wed, 3 Jul 2024 13:25:10 +0200 Message-ID: <20240703112510.36424-1-maarten.lankhorst@linux.intel.com> X-Mailer: git-send-email 2.45.2 MIME-Version: 1.0 X-Rspamd-Queue-Id: 5FD1AA0018 X-Stat-Signature: z3ntn157iuesfsc143ynm98qisp6manz X-Rspamd-Server: rspam09 X-Rspam-User: X-HE-Tag: 1720005919-297733 X-HE-Meta: U2FsdGVkX1+UG1PHGPcEnMXVJHvSeHsYiJ537K0oC4TqmqTzeRdMJyFVzdGYvDgcIb3FgfI2FyqHNIL6fCGwj1kqO3Tb4DIXboITuZT6MUnQKwR7m8hwrTFc0/jKg1FStllgUWmtyWTJWuAwhsTskoyalm7NKXMRdysllXojoW8eQKsVthN2MABnLrUiRZ0PLUbE5a0hHq1748x1jFRX/y6BnvGq4yE+fxtFdRZ39k48bIixnMt6utLE20foYiNzV7sd5u2ctznnMAqtRk9lFtBlNZ0sUkz+un5wNCOjXeF/j+OHx5SqaH8vbwZYt/dYUVs7LhOjYXr3uO9xaOsrsiwWDxXrEwdBCO+qDeqaLfwQp9DDiSmlhBtHYiVtBRKQv0nLzIVk3ArYod8hjW7QEgli0yiJ03F3P4mGPCO5zUx/SAWbIMMrgohVPwEpmUFm6K2Fu7lxUNybaXcGkklOV0CcIkPAJ1kEcVFIrkmBrJ6LMa0jPmiLsRp0h0l21Hmez+YJdVUKK++YhMOY+PcgQbMLuEWCEuSivk5ZnfSiAYSBhcG3nOiSLw6jmXqMhKU7kyGr2k6PEsLlYHbT7S98dlP7J2SgixHo+UBIy65HdtZtJWypNCIQT7I5c3sM9FDsMigDv2UyqEq8j1Cxf5BkRP5PWslVmHgPifpNMAcyjl7NGgrhPHZvRjxrHh43RdP8tfq2Zaw6VyzIUkEdcKOHDoPZBdAEspqHHI7/1BuUglmyU4F+k3nDsNMazSHdP9sBlOGMBccSdfsvUFxU++Sdr5qXvoJ7p6SYbA83rq08KPYggyNSdgYTH+0t8vauuckTJ/iI/KTIyL78Fdbvsm1EZjamAbnRzn4G6Nl56P3wMYiUgIHWWJu3l5BpZzZVOkgSARoHV1waTXksvn2mPJ5iHfyCKE50Tpa+ZfhpPapm31b/vQwTNyTvH4xHVoBsTXpOtMN9xkg1mUZexJn0yT9 qyCd/sC3 vqJtTAHJKISMKF0fl37BR4Trf5qNkN7XPEvSUFcnItVoW5KqiLao4nqg+jNCH9xKGPvucrfJJ6d1fG2jhCxvhnk+ahyfr7i22urAFhfEMKy09jAviAGZLmwgoJIaRq69HHrLvHlovz5LbatFFdS5/H7GnZgwNRdqH4rPWs7m5WAaLp4m6HOPNJWx10zmhhaIeaC0k+7aFnCMECXUQJPKao8GACG+WAEPK7xZBHzgsJ3RMJmwite2IHMtoMwTYDIRrl9BImbOys5qvteg9J4xNI2Aok1mpUUwiCzVS X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: It's a lot of math, and there is nothing memcontrol specific about it. This makes it easier to use inside of the drm cgroup controller. Signed-off-by: Maarten Lankhorst Acked-by: Roman Gushchin Acked-by: Shakeel Butt --- include/linux/page_counter.h | 4 + mm/memcontrol.c | 154 +------------------------------ mm/page_counter.c | 173 +++++++++++++++++++++++++++++++++++ 3 files changed, 180 insertions(+), 151 deletions(-) diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h index 8cd858d912c4..904c52f97284 100644 --- a/include/linux/page_counter.h +++ b/include/linux/page_counter.h @@ -81,4 +81,8 @@ static inline void page_counter_reset_watermark(struct page_counter *counter) counter->watermark = page_counter_read(counter); } +void page_counter_calculate_protection(struct page_counter *root, + struct page_counter *counter, + bool recursive_protection); + #endif /* _LINUX_PAGE_COUNTER_H */ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 71fe2a95b8bd..9454e1a3120e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -7316,122 +7316,6 @@ struct cgroup_subsys memory_cgrp_subsys = { .early_init = 0, }; -/* - * This function calculates an individual cgroup's effective - * protection which is derived from its own memory.min/low, its - * parent's and siblings' settings, as well as the actual memory - * distribution in the tree. - * - * The following rules apply to the effective protection values: - * - * 1. At the first level of reclaim, effective protection is equal to - * the declared protection in memory.min and memory.low. - * - * 2. To enable safe delegation of the protection configuration, at - * subsequent levels the effective protection is capped to the - * parent's effective protection. - * - * 3. To make complex and dynamic subtrees easier to configure, the - * user is allowed to overcommit the declared protection at a given - * level. If that is the case, the parent's effective protection is - * distributed to the children in proportion to how much protection - * they have declared and how much of it they are utilizing. - * - * This makes distribution proportional, but also work-conserving: - * if one cgroup claims much more protection than it uses memory, - * the unused remainder is available to its siblings. - * - * 4. Conversely, when the declared protection is undercommitted at a - * given level, the distribution of the larger parental protection - * budget is NOT proportional. A cgroup's protection from a sibling - * is capped to its own memory.min/low setting. - * - * 5. However, to allow protecting recursive subtrees from each other - * without having to declare each individual cgroup's fixed share - * of the ancestor's claim to protection, any unutilized - - * "floating" - protection from up the tree is distributed in - * proportion to each cgroup's *usage*. This makes the protection - * neutral wrt sibling cgroups and lets them compete freely over - * the shared parental protection budget, but it protects the - * subtree as a whole from neighboring subtrees. - * - * Note that 4. and 5. are not in conflict: 4. is about protecting - * against immediate siblings whereas 5. is about protecting against - * neighboring subtrees. - */ -static unsigned long effective_protection(unsigned long usage, - unsigned long parent_usage, - unsigned long setting, - unsigned long parent_effective, - unsigned long siblings_protected) -{ - unsigned long protected; - unsigned long ep; - - protected = min(usage, setting); - /* - * If all cgroups at this level combined claim and use more - * protection than what the parent affords them, distribute - * shares in proportion to utilization. - * - * We are using actual utilization rather than the statically - * claimed protection in order to be work-conserving: claimed - * but unused protection is available to siblings that would - * otherwise get a smaller chunk than what they claimed. - */ - if (siblings_protected > parent_effective) - return protected * parent_effective / siblings_protected; - - /* - * Ok, utilized protection of all children is within what the - * parent affords them, so we know whatever this child claims - * and utilizes is effectively protected. - * - * If there is unprotected usage beyond this value, reclaim - * will apply pressure in proportion to that amount. - * - * If there is unutilized protection, the cgroup will be fully - * shielded from reclaim, but we do return a smaller value for - * protection than what the group could enjoy in theory. This - * is okay. With the overcommit distribution above, effective - * protection is always dependent on how memory is actually - * consumed among the siblings anyway. - */ - ep = protected; - - /* - * If the children aren't claiming (all of) the protection - * afforded to them by the parent, distribute the remainder in - * proportion to the (unprotected) memory of each cgroup. That - * way, cgroups that aren't explicitly prioritized wrt each - * other compete freely over the allowance, but they are - * collectively protected from neighboring trees. - * - * We're using unprotected memory for the weight so that if - * some cgroups DO claim explicit protection, we don't protect - * the same bytes twice. - * - * Check both usage and parent_usage against the respective - * protected values. One should imply the other, but they - * aren't read atomically - make sure the division is sane. - */ - if (!(cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT)) - return ep; - if (parent_effective > siblings_protected && - parent_usage > siblings_protected && - usage > protected) { - unsigned long unclaimed; - - unclaimed = parent_effective - siblings_protected; - unclaimed *= usage - protected; - unclaimed /= parent_usage - siblings_protected; - - ep += unclaimed; - } - - return ep; -} - /** * mem_cgroup_calculate_protection - check if memory consumption is in the normal range * @root: the top ancestor of the sub-tree being checked @@ -7443,8 +7327,8 @@ static unsigned long effective_protection(unsigned long usage, void mem_cgroup_calculate_protection(struct mem_cgroup *root, struct mem_cgroup *memcg) { - unsigned long usage, parent_usage; - struct mem_cgroup *parent; + bool recursive_protection = + cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT; if (mem_cgroup_disabled()) return; @@ -7452,39 +7336,7 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root, if (!root) root = root_mem_cgroup; - /* - * Effective values of the reclaim targets are ignored so they - * can be stale. Have a look at mem_cgroup_protection for more - * details. - * TODO: calculation should be more robust so that we do not need - * that special casing. - */ - if (memcg == root) - return; - - usage = page_counter_read(&memcg->memory); - if (!usage) - return; - - parent = parent_mem_cgroup(memcg); - - if (parent == root) { - memcg->memory.emin = READ_ONCE(memcg->memory.min); - memcg->memory.elow = READ_ONCE(memcg->memory.low); - return; - } - - parent_usage = page_counter_read(&parent->memory); - - WRITE_ONCE(memcg->memory.emin, effective_protection(usage, parent_usage, - READ_ONCE(memcg->memory.min), - READ_ONCE(parent->memory.emin), - atomic_long_read(&parent->memory.children_min_usage))); - - WRITE_ONCE(memcg->memory.elow, effective_protection(usage, parent_usage, - READ_ONCE(memcg->memory.low), - READ_ONCE(parent->memory.elow), - atomic_long_read(&parent->memory.children_low_usage))); + page_counter_calculate_protection(&root->memory, &memcg->memory, recursive_protection); } static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg, diff --git a/mm/page_counter.c b/mm/page_counter.c index db20d6452b71..8ee49cbf71be 100644 --- a/mm/page_counter.c +++ b/mm/page_counter.c @@ -262,3 +262,176 @@ int page_counter_memparse(const char *buf, const char *max, return 0; } + + +/* + * This function calculates an individual page counter's effective + * protection which is derived from its own memory.min/low, its + * parent's and siblings' settings, as well as the actual memory + * distribution in the tree. + * + * The following rules apply to the effective protection values: + * + * 1. At the first level of reclaim, effective protection is equal to + * the declared protection in memory.min and memory.low. + * + * 2. To enable safe delegation of the protection configuration, at + * subsequent levels the effective protection is capped to the + * parent's effective protection. + * + * 3. To make complex and dynamic subtrees easier to configure, the + * user is allowed to overcommit the declared protection at a given + * level. If that is the case, the parent's effective protection is + * distributed to the children in proportion to how much protection + * they have declared and how much of it they are utilizing. + * + * This makes distribution proportional, but also work-conserving: + * if one counter claims much more protection than it uses memory, + * the unused remainder is available to its siblings. + * + * 4. Conversely, when the declared protection is undercommitted at a + * given level, the distribution of the larger parental protection + * budget is NOT proportional. A counter's protection from a sibling + * is capped to its own memory.min/low setting. + * + * 5. However, to allow protecting recursive subtrees from each other + * without having to declare each individual counter's fixed share + * of the ancestor's claim to protection, any unutilized - + * "floating" - protection from up the tree is distributed in + * proportion to each counter's *usage*. This makes the protection + * neutral wrt sibling cgroups and lets them compete freely over + * the shared parental protection budget, but it protects the + * subtree as a whole from neighboring subtrees. + * + * Note that 4. and 5. are not in conflict: 4. is about protecting + * against immediate siblings whereas 5. is about protecting against + * neighboring subtrees. + */ +static unsigned long effective_protection(unsigned long usage, + unsigned long parent_usage, + unsigned long setting, + unsigned long parent_effective, + unsigned long siblings_protected, + bool recursive_protection) +{ + unsigned long protected; + unsigned long ep; + + protected = min(usage, setting); + /* + * If all cgroups at this level combined claim and use more + * protection than what the parent affords them, distribute + * shares in proportion to utilization. + * + * We are using actual utilization rather than the statically + * claimed protection in order to be work-conserving: claimed + * but unused protection is available to siblings that would + * otherwise get a smaller chunk than what they claimed. + */ + if (siblings_protected > parent_effective) + return protected * parent_effective / siblings_protected; + + /* + * Ok, utilized protection of all children is within what the + * parent affords them, so we know whatever this child claims + * and utilizes is effectively protected. + * + * If there is unprotected usage beyond this value, reclaim + * will apply pressure in proportion to that amount. + * + * If there is unutilized protection, the cgroup will be fully + * shielded from reclaim, but we do return a smaller value for + * protection than what the group could enjoy in theory. This + * is okay. With the overcommit distribution above, effective + * protection is always dependent on how memory is actually + * consumed among the siblings anyway. + */ + ep = protected; + + /* + * If the children aren't claiming (all of) the protection + * afforded to them by the parent, distribute the remainder in + * proportion to the (unprotected) memory of each cgroup. That + * way, cgroups that aren't explicitly prioritized wrt each + * other compete freely over the allowance, but they are + * collectively protected from neighboring trees. + * + * We're using unprotected memory for the weight so that if + * some cgroups DO claim explicit protection, we don't protect + * the same bytes twice. + * + * Check both usage and parent_usage against the respective + * protected values. One should imply the other, but they + * aren't read atomically - make sure the division is sane. + */ + if (!recursive_protection) + return ep; + + if (parent_effective > siblings_protected && + parent_usage > siblings_protected && + usage > protected) { + unsigned long unclaimed; + + unclaimed = parent_effective - siblings_protected; + unclaimed *= usage - protected; + unclaimed /= parent_usage - siblings_protected; + + ep += unclaimed; + } + + return ep; +} + + +/** + * page_counter_calculate_protection - check if memory consumption is in the normal range + * @root: the top ancestor of the sub-tree being checked + * @memcg: the memory cgroup to check + * @recursive_protection: Whether to use memory_recursiveprot behavior. + * + * Calculates elow/emin thresholds for given page_counter. + * + * WARNING: This function is not stateless! It can only be used as part + * of a top-down tree iteration, not for isolated queries. + */ +void page_counter_calculate_protection(struct page_counter *root, + struct page_counter *counter, + bool recursive_protection) +{ + unsigned long usage, parent_usage; + struct page_counter *parent = counter->parent; + + /* + * Effective values of the reclaim targets are ignored so they + * can be stale. Have a look at mem_cgroup_protection for more + * details. + * TODO: calculation should be more robust so that we do not need + * that special casing. + */ + if (root == counter) + return; + + usage = page_counter_read(counter); + if (!usage) + return; + + if (parent == root) { + counter->emin = READ_ONCE(counter->min); + counter->elow = READ_ONCE(counter->low); + return; + } + + parent_usage = page_counter_read(parent); + + WRITE_ONCE(counter->emin, effective_protection(usage, parent_usage, + READ_ONCE(counter->min), + READ_ONCE(parent->emin), + atomic_long_read(&parent->children_min_usage), + recursive_protection)); + + WRITE_ONCE(counter->elow, effective_protection(usage, parent_usage, + READ_ONCE(counter->low), + READ_ONCE(parent->elow), + atomic_long_read(&parent->children_low_usage), + recursive_protection)); +}