From patchwork Thu Feb 27 19:56:05 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Johannes Weiner X-Patchwork-Id: 11409583 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 49D5B14B4 for ; Thu, 27 Feb 2020 19:56:20 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id F098B246A2 for ; Thu, 27 Feb 2020 19:56:19 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=cmpxchg-org.20150623.gappssmtp.com header.i=@cmpxchg-org.20150623.gappssmtp.com header.b="GtZ5uMsW" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org F098B246A2 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=cmpxchg.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id D2D9E6B0007; Thu, 27 Feb 2020 14:56:16 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id C66636B0008; Thu, 27 Feb 2020 14:56:16 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B7C346B000A; Thu, 27 Feb 2020 14:56:16 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0084.hostedemail.com [216.40.44.84]) by kanga.kvack.org (Postfix) with ESMTP id 9B2F76B0007 for ; Thu, 27 Feb 2020 14:56:16 -0500 (EST) Received: from smtpin19.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 5A09C9438 for ; Thu, 27 Feb 2020 19:56:16 +0000 (UTC) X-FDA: 76536963552.19.boy47_36b55958f8d09 X-Spam-Summary: 2,0,0,0cf1ded480cbc8cc,d41d8cd98f00b204,hannes@cmpxchg.org,,RULES_HIT:1:2:41:69:152:355:379:541:800:960:973:981:982:988:989:1260:1277:1311:1313:1314:1345:1359:1437:1515:1516:1518:1593:1594:1605:1730:1747:1777:1792:1801:2194:2198:2199:2200:2393:2559:2562:2693:2731:2897:3138:3139:3140:3141:3142:3743:3865:3866:3867:3868:3870:3871:3872:3874:4052:4250:4605:5007:6261:6653:7875:7903:8603:8660:8784:9108:9592:10004:11026:11473:11658:11914:12043:12220:12291:12295:12296:12297:12438:12517:12519:12555:12683:12895:13148:13161:13229:13230:13255:13869:13894:14096:14097:14394:14659:21080:21222:21433:21444:21451:21627:21740:21795:21966:21990:30005:30012:30051:30054:30056:30064:30070,0,RBL:209.85.222.195:@cmpxchg.org:.lbl8.mailshell.net-66.100.201.201 62.2.0.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fp,MSBL:0,DNSBL:none,Custom_rules:0:0:0,LFtime:24,LUA_SUMMARY:none X-HE-Tag: boy47_36b55958f8d09 X-Filterd-Recvd-Size: 12901 Received: from mail-qk1-f195.google.com (mail-qk1-f195.google.com [209.85.222.195]) by imf01.hostedemail.com (Postfix) with ESMTP for ; Thu, 27 Feb 2020 19:56:15 +0000 (UTC) Received: by mail-qk1-f195.google.com with SMTP id u124so589989qkh.13 for ; Thu, 27 Feb 2020 11:56:15 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=lDjWpH6grOtzg9jRElfS3afP9Xx2wYz4tow/cIPwfsc=; b=GtZ5uMsWMd0OMOb6SUGkO7O7QOsQlQ2mg1L95hluFzKdMMLVIVxeBRjFb8F4ekY6L7 hVVHDzVeb+davTH5VCuhL/S5/vyR9rGlbZ2lxozVr2eibCvgyB0/8N0LBkp6p3njyCbZ TpwBdF0uJ7pwBd9bBqns/9B2UrTm3nWPDYxwX/urJ12eHqGk/TDNKwyauQpqzmZxzpkT 9/aXGKfJBvTY18ut5BORUiVgRYls+okYGQRe5Xxtt2mBmwmMoeYzeLbduzMDUN78zJb2 dxhteMd7aK8qHaMmCadWzEm8hvaDK8Z63EtOXJMI+RFbiYpU5EpN60n01BPGArtq0kll YbhA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=lDjWpH6grOtzg9jRElfS3afP9Xx2wYz4tow/cIPwfsc=; b=DbcRfZf8C390jMcztkXPfDNtFURouFb/S4UGYpaP8wtUyAydRFtEaXfgkyFlpZVJ1i bKRFm6fZH0ZRr3cWf69/wZ/HIUYPoPTWIxA5nkBAFriF/3/mYWDTwPAK2JJCiD1w8ifP p0J9O4CtBML47ZwOlqjHH3/3HpRuLpeR4k2juTAl2XO5YlgLRdZ5fjPPxLUbbbSCKmpe Euwd/B+POhy27TFVti86DsI9RScEZCN5HsBzLCaEh3TeN52BG4YCu7sppYeZjAhurxpC OJqOSmtFnvj1A9wwBBGs4JoFyOAUhDbzK24/iL7WFXqh5cRV8M8onj+CpmR3u6QbaraQ NROQ== X-Gm-Message-State: APjAAAUPWgLYmThbEV2ecdiCEqy6xM252grjjX8PdyhySKweX/YBmTUp WociiJxkldYE48i18d+p7REd+g== X-Google-Smtp-Source: APXvYqx6zjvhh+3GLc16k1NmRrYkE58F4RCrqVSvjq4hkwz43RZngjuTejLCpp/ODPjN86zwbc+GeQ== X-Received: by 2002:a05:620a:2093:: with SMTP id e19mr1054066qka.355.1582833374935; Thu, 27 Feb 2020 11:56:14 -0800 (PST) Received: from localhost ([2620:10d:c091:500::3:2450]) by smtp.gmail.com with ESMTPSA id u23sm1542073qtk.16.2020.02.27.11.56.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 27 Feb 2020 11:56:14 -0800 (PST) From: Johannes Weiner To: Andrew Morton Cc: Roman Gushchin , Michal Hocko , Tejun Heo , Chris Down , =?utf-8?q?Mic?= =?utf-8?q?hal_Koutn=C3=BD?= , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: [PATCH 2/3] mm: memcontrol: clean up and document effective low/min calculations Date: Thu, 27 Feb 2020 14:56:05 -0500 Message-Id: <20200227195606.46212-3-hannes@cmpxchg.org> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20200227195606.46212-1-hannes@cmpxchg.org> References: <20200227195606.46212-1-hannes@cmpxchg.org> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The effective protection of any given cgroup is a somewhat complicated construct that depends on the ancestor's configuration, siblings' configurations, as well as current memory utilization in all these groups. It's done this way to satisfy hierarchical delegation requirements while also making the configuration semantics flexible and expressive in complex real life scenarios. Unfortunately, all the rules and requirements are sparsely documented, and the code is a little too clever in merging different scenarios into a single min() expression. This makes it hard to reason about the implementation and avoid breaking semantics when making changes to it. This patch documents each semantic rule individually and splits out the handling of the overcommit case from the regular case. Michal Koutný also points out that the points of equilibrium as described in the existing example scenarios aren't actually accurate. Delete these examples for now to avoid confusion. Acked-by: Tejun Heo Acked-by: Roman Gushchin Acked-by: Chris Down Acked-by: Michal Hocko Signed-off-by: Johannes Weiner --- mm/memcontrol.c | 175 +++++++++++++++++++++++------------------------- 1 file changed, 83 insertions(+), 92 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 874a0b00f89b..836c521bd61f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6204,6 +6204,76 @@ struct cgroup_subsys memory_cgrp_subsys = { .early_init = 0, }; +/* + * This function calculates an individual cgroup's effective + * protection which is derived from its own memory.min/low, its + * parent's and siblings' settings, as well as the actual memory + * distribution in the tree. + * + * The following rules apply to the effective protection values: + * + * 1. At the first level of reclaim, effective protection is equal to + * the declared protection in memory.min and memory.low. + * + * 2. To enable safe delegation of the protection configuration, at + * subsequent levels the effective protection is capped to the + * parent's effective protection. + * + * 3. To make complex and dynamic subtrees easier to configure, the + * user is allowed to overcommit the declared protection at a given + * level. If that is the case, the parent's effective protection is + * distributed to the children in proportion to how much protection + * they have declared and how much of it they are utilizing. + * + * This makes distribution proportional, but also work-conserving: + * if one cgroup claims much more protection than it uses memory, + * the unused remainder is available to its siblings. + * + * 4. Conversely, when the declared protection is undercommitted at a + * given level, the distribution of the larger parental protection + * budget is NOT proportional. A cgroup's protection from a sibling + * is capped to its own memory.min/low setting. + * + */ +static unsigned long effective_protection(unsigned long usage, + unsigned long setting, + unsigned long parent_effective, + unsigned long siblings_protected) +{ + unsigned long protected; + + protected = min(usage, setting); + /* + * If all cgroups at this level combined claim and use more + * protection then what the parent affords them, distribute + * shares in proportion to utilization. + * + * We are using actual utilization rather than the statically + * claimed protection in order to be work-conserving: claimed + * but unused protection is available to siblings that would + * otherwise get a smaller chunk than what they claimed. + */ + if (siblings_protected > parent_effective) + return protected * parent_effective / siblings_protected; + + /* + * Ok, utilized protection of all children is within what the + * parent affords them, so we know whatever this child claims + * and utilizes is effectively protected. + * + * If there is unprotected usage beyond this value, reclaim + * will apply pressure in proportion to that amount. + * + * If there is unutilized protection, the cgroup will be fully + * shielded from reclaim, but we do return a smaller value for + * protection than what the group could enjoy in theory. This + * is okay. With the overcommit distribution above, effective + * protection is always dependent on how memory is actually + * consumed among the siblings anyway. + */ + return protected; +} + /** * mem_cgroup_protected - check if memory consumption is in the normal range * @root: the top ancestor of the sub-tree being checked @@ -6217,67 +6287,11 @@ struct cgroup_subsys memory_cgrp_subsys = { * MEMCG_PROT_LOW: cgroup memory is protected as long there is * an unprotected supply of reclaimable memory from other cgroups. * MEMCG_PROT_MIN: cgroup memory is protected - * - * @root is exclusive; it is never protected when looked at directly - * - * To provide a proper hierarchical behavior, effective memory.min/low values - * are used. Below is the description of how effective memory.low is calculated. - * Effective memory.min values is calculated in the same way. - * - * Effective memory.low is always equal or less than the original memory.low. - * If there is no memory.low overcommittment (which is always true for - * top-level memory cgroups), these two values are equal. - * Otherwise, it's a part of parent's effective memory.low, - * calculated as a cgroup's memory.low usage divided by sum of sibling's - * memory.low usages, where memory.low usage is the size of actually - * protected memory. - * - * low_usage - * elow = min( memory.low, parent->elow * ------------------ ), - * siblings_low_usage - * - * low_usage = min(memory.low, memory.current) - * - * - * Such definition of the effective memory.low provides the expected - * hierarchical behavior: parent's memory.low value is limiting - * children, unprotected memory is reclaimed first and cgroups, - * which are not using their guarantee do not affect actual memory - * distribution. - * - * For example, if there are memcgs A, A/B, A/C, A/D and A/E: - * - * A A/memory.low = 2G, A/memory.current = 6G - * //\\ - * BC DE B/memory.low = 3G B/memory.current = 2G - * C/memory.low = 1G C/memory.current = 2G - * D/memory.low = 0 D/memory.current = 2G - * E/memory.low = 10G E/memory.current = 0 - * - * and the memory pressure is applied, the following memory distribution - * is expected (approximately): - * - * A/memory.current = 2G - * - * B/memory.current = 1.3G - * C/memory.current = 0.6G - * D/memory.current = 0 - * E/memory.current = 0 - * - * These calculations require constant tracking of the actual low usages - * (see propagate_protected_usage()), as well as recursive calculation of - * effective memory.low values. But as we do call mem_cgroup_protected() - * path for each memory cgroup top-down from the reclaim, - * it's possible to optimize this part, and save calculated elow - * for next usage. This part is intentionally racy, but it's ok, - * as memory.low is a best-effort mechanism. */ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, struct mem_cgroup *memcg) { struct mem_cgroup *parent; - unsigned long emin, parent_emin; - unsigned long elow, parent_elow; unsigned long usage; if (mem_cgroup_disabled()) @@ -6292,52 +6306,29 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, if (!usage) return MEMCG_PROT_NONE; - emin = memcg->memory.min; - elow = memcg->memory.low; - parent = parent_mem_cgroup(memcg); /* No parent means a non-hierarchical mode on v1 memcg */ if (!parent) return MEMCG_PROT_NONE; - if (parent == root) - goto exit; - - parent_emin = READ_ONCE(parent->memory.emin); - emin = min(emin, parent_emin); - if (emin && parent_emin) { - unsigned long min_usage, siblings_min_usage; - - min_usage = min(usage, memcg->memory.min); - siblings_min_usage = atomic_long_read( - &parent->memory.children_min_usage); - - if (min_usage && siblings_min_usage) - emin = min(emin, parent_emin * min_usage / - siblings_min_usage); + if (parent == root) { + memcg->memory.emin = memcg->memory.min; + memcg->memory.elow = memcg->memory.low; + goto out; } - parent_elow = READ_ONCE(parent->memory.elow); - elow = min(elow, parent_elow); - if (elow && parent_elow) { - unsigned long low_usage, siblings_low_usage; - - low_usage = min(usage, memcg->memory.low); - siblings_low_usage = atomic_long_read( - &parent->memory.children_low_usage); + memcg->memory.emin = effective_protection(usage, + memcg->memory.min, READ_ONCE(parent->memory.emin), + atomic_long_read(&parent->memory.children_min_usage)); - if (low_usage && siblings_low_usage) - elow = min(elow, parent_elow * low_usage / - siblings_low_usage); - } + memcg->memory.elow = effective_protection(usage, + memcg->memory.low, READ_ONCE(parent->memory.elow), + atomic_long_read(&parent->memory.children_low_usage)); -exit: - memcg->memory.emin = emin; - memcg->memory.elow = elow; - - if (usage <= emin) +out: + if (usage <= memcg->memory.emin) return MEMCG_PROT_MIN; - else if (usage <= elow) + else if (usage <= memcg->memory.elow) return MEMCG_PROT_LOW; else return MEMCG_PROT_NONE;