From patchwork Thu Feb 27 19:56:04 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Johannes Weiner X-Patchwork-Id: 11409581 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id DD36714B4 for ; Thu, 27 Feb 2020 19:56:17 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 9D4CE246A5 for ; Thu, 27 Feb 2020 19:56:17 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=cmpxchg-org.20150623.gappssmtp.com header.i=@cmpxchg-org.20150623.gappssmtp.com header.b="oBEvIdqy" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 9D4CE246A5 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=cmpxchg.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 95C676B0006; Thu, 27 Feb 2020 14:56:15 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 90B3F6B0007; Thu, 27 Feb 2020 14:56:15 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7D4896B0008; Thu, 27 Feb 2020 14:56:15 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0038.hostedemail.com [216.40.44.38]) by kanga.kvack.org (Postfix) with ESMTP id 63E5C6B0006 for ; Thu, 27 Feb 2020 14:56:15 -0500 (EST) Received: from smtpin09.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 430365DEB for ; Thu, 27 Feb 2020 19:56:15 +0000 (UTC) X-FDA: 76536963510.09.car34_368b6725f305f X-Spam-Summary: 2,0,0,1e5d5544b3799b7b,d41d8cd98f00b204,hannes@cmpxchg.org,,RULES_HIT:2:41:69:355:379:541:800:960:973:988:989:1260:1311:1314:1345:1359:1437:1515:1535:1605:1730:1747:1777:1792:1801:2198:2199:2393:2559:2562:2693:2897:3138:3139:3140:3141:3142:3865:3866:3867:3868:3870:3871:3872:3874:4049:4120:4321:4605:5007:6119:6261:6653:7903:8784:9592:11026:11658:11914:12043:12220:12294:12295:12296:12297:12355:12438:12517:12519:12555:12679:12694:12737:12895:12986:13161:13229:13870:13894:14096:14394:21080:21444:21450:21451:21627:21740:21990:30054:30064,0,RBL:209.85.222.169:@cmpxchg.org:.lbl8.mailshell.net-62.2.0.100 66.100.201.201,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fp,MSBL:0,DNSBL:none,Custom_rules:0:1:0,LFtime:24,LUA_SUMMARY:none X-HE-Tag: car34_368b6725f305f X-Filterd-Recvd-Size: 9772 Received: from mail-qk1-f169.google.com (mail-qk1-f169.google.com [209.85.222.169]) by imf46.hostedemail.com (Postfix) with ESMTP for ; Thu, 27 Feb 2020 19:56:14 +0000 (UTC) Received: by mail-qk1-f169.google.com with SMTP id m2so633274qka.7 for ; Thu, 27 Feb 2020 11:56:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=fSikhrFs5BcRusqV72w1ZPleJmM+MohxSSUW83ONj5U=; b=oBEvIdqy/cYuOQutMhWUrNcPeYL2gICCQr+fu7tq7vVplWwxr0u7hW9ENL65iMdRom xFPYrs5j+CvsSDj/SOJ0fb+AQ/JGskuPHfhd/wFPRruaeOtFk7Qsg0IPt9PLNCzvD8GM MmHIl/zx6wFCAnLSLRRaK29JxbAtIbEZ+s4gqBOK5c3KKtIn4PFtJZlJHD01gCLtU7K1 EEkJsiMrCUO5M/l4QvUQDaP2zS6qxa36u/D6n3QWZmfkHyiDGzSoYqRjuxTxE5s1Ij6k dRwDWFVoDGKvv8qDZk+m2mPbpSKjWvMiuAVVYT76aAmsqxeBqxRVUU1ugyX8ZIdHdP1x pdJA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=fSikhrFs5BcRusqV72w1ZPleJmM+MohxSSUW83ONj5U=; b=niR85f6KYV1KFvRdtbjy+BAb4KEn6s5Ty+ehHg0SudVpA5mn/UHrrTtYQvHWM5WnTC oZMD4BfZu1uMYshuMSKly6mG4XWXO9nHJnTkIqvDY3LJJGXiW0TKD/Gwk1n1P5+fhhph S+CDcA+e8OL/u8AxR9PZNImFQx8uAxJXsuJSTUbhQBCb+dsODfZngriYFVAksZxDBUmJ 24A9j9UjXlasuivkHAG6q5CUqgC5MYqg5f2OKuXzhfDwL1cyKFpvOBIMM8AZZFRa2mjv UMv0R9QjRK/5EO/9OTILCxGKKYX4VVNBAJzfeJrIEi3c9ArlOaGw+qKD3okmjD8Sipc3 hDJw== X-Gm-Message-State: APjAAAVINtD3QlvbzkvC/yTIMtHuIkTnS/2xUIwM0dpvVFCru6PcImAd prLPiGikBukO4nqIj2HjtKwBVQ== X-Google-Smtp-Source: APXvYqwzGV55cVF8neavTQ0KHqXTRgctIxpNwffmP9jz1OBsL5ZdIPghXnLFLvTkBlP+8+xsm0fcGg== X-Received: by 2002:a05:620a:42:: with SMTP id t2mr1196414qkt.45.1582833373465; Thu, 27 Feb 2020 11:56:13 -0800 (PST) Received: from localhost ([2620:10d:c091:500::3:2450]) by smtp.gmail.com with ESMTPSA id a14sm3708173qkk.73.2020.02.27.11.56.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 27 Feb 2020 11:56:12 -0800 (PST) From: Johannes Weiner To: Andrew Morton Cc: Roman Gushchin , Michal Hocko , Tejun Heo , Chris Down , =?utf-8?q?Mic?= =?utf-8?q?hal_Koutn=C3=BD?= , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: [PATCH 1/3] mm: memcontrol: fix memory.low proportional distribution Date: Thu, 27 Feb 2020 14:56:04 -0500 Message-Id: <20200227195606.46212-2-hannes@cmpxchg.org> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20200227195606.46212-1-hannes@cmpxchg.org> References: <20200227195606.46212-1-hannes@cmpxchg.org> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: When memory.low is overcommitted - i.e. the children claim more protection than their shared ancestor grants them - the allowance is distributed in proportion to how much each sibling uses their own declared protection: low_usage = min(memory.low, memory.current) elow = parent_elow * (low_usage / siblings_low_usage) However, siblings_low_usage is not the sum of all low_usages. It sums up the usages of *only those cgroups that are within their memory.low* That means that low_usage can be *bigger* than siblings_low_usage, and consequently the total protection afforded to the children can be bigger than what the ancestor grants the subtree. Consider three groups where two are in excess of their protection: A/memory.low = 10G A/A1/memory.low = 10G, memory.current = 20G A/A2/memory.low = 10G, memory.current = 20G A/A3/memory.low = 10G, memory.current = 8G siblings_low_usage = 8G (only A3 contributes) A1/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(8G) = 12.5G -> 10G A2/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(8G) = 12.5G -> 10G A3/elow = parent_elow(10G) * low_usage(8G) / siblings_low_usage(8G) = 10.0G (the 12.5G are capped to the explicit memory.low setting of 10G) With that, the sum of all awarded protection below A is 30G, when A only grants 10G for the entire subtree. What does this mean in practice? A1 and A2 would still be in excess of their 10G allowance and would be reclaimed, whereas A3 would not. As they eventually drop below their protection setting, they would be counted in siblings_low_usage again and the error would right itself. When reclaim was applied in a binary fashion (cgroup is reclaimed when it's above its protection, otherwise it's skipped) this would actually work out just fine. However, since 1bc63fb1272b ("mm, memcg: make scan aggression always exclude protection"), reclaim pressure is scaled to how much a cgroup is above its protection. As a result this calculation error unduly skews pressure away from A1 and A2 toward the rest of the system. But why did we do it like this in the first place? The reasoning behind exempting groups in excess from siblings_low_usage was to go after them first during reclaim in an overcommitted subtree: A/memory.low = 2G, memory.current = 4G A/A1/memory.low = 3G, memory.current = 2G A/A2/memory.low = 1G, memory.current = 2G siblings_low_usage = 2G (only A1 contributes) A1/elow = parent_elow(2G) * low_usage(2G) / siblings_low_usage(2G) = 2G A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G While the children combined are overcomitting A and are technically both at fault, A2 is actively declaring unprotected memory and we would like to reclaim that first. However, while this sounds like a noble goal on the face of it, it doesn't make much difference in actual memory distribution: Because A is overcommitted, reclaim will not stop once A2 gets pushed back to within its allowance; we'll have to reclaim A1 either way. The end result is still that protection is distributed proportionally, with A1 getting 3/4 (1.5G) and A2 getting 1/4 (0.5G) of A's allowance. [ If A weren't overcommitted, it wouldn't make a difference since each cgroup would just get the protection it declares: A/memory.low = 2G, memory.current = 3G A/A1/memory.low = 1G, memory.current = 1G A/A2/memory.low = 1G, memory.current = 2G With the current calculation: siblings_low_usage = 1G (only A1 contributes) A1/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(1G) = 2G -> 1G A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(1G) = 2G -> 1G Including excess groups in siblings_low_usage: siblings_low_usage = 2G A1/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G -> 1G A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G -> 1G ] Simplify the calculation and fix the proportional reclaim bug by including excess cgroups in siblings_low_usage. After this patch, the effective memory.low distribution from the example above would be as follows: A/memory.low = 10G A/A1/memory.low = 10G, memory.current = 20G A/A2/memory.low = 10G, memory.current = 20G A/A3/memory.low = 10G, memory.current = 8G siblings_low_usage = 28G A1/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(28G) = 3.5G A2/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(28G) = 3.5G A3/elow = parent_elow(10G) * low_usage(8G) / siblings_low_usage(28G) = 2.8G Fixes: 1bc63fb1272b ("mm, memcg: make scan aggression always exclude protection") Fixes: 230671533d64 ("mm: memory.low hierarchical behavior") Acked-by: Tejun Heo Acked-by: Roman Gushchin Acked-by: Chris Down Acked-by: Michal Hocko Signed-off-by: Johannes Weiner --- mm/memcontrol.c | 4 +--- mm/page_counter.c | 12 ++---------- 2 files changed, 3 insertions(+), 13 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c5b5f74cfd4d..874a0b00f89b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6236,9 +6236,7 @@ struct cgroup_subsys memory_cgrp_subsys = { * elow = min( memory.low, parent->elow * ------------------ ), * siblings_low_usage * - * | memory.current, if memory.current < memory.low - * low_usage = | - * | 0, otherwise. + * low_usage = min(memory.low, memory.current) * * * Such definition of the effective memory.low provides the expected diff --git a/mm/page_counter.c b/mm/page_counter.c index de31470655f6..75d53f15f040 100644 --- a/mm/page_counter.c +++ b/mm/page_counter.c @@ -23,11 +23,7 @@ static void propagate_protected_usage(struct page_counter *c, return; if (c->min || atomic_long_read(&c->min_usage)) { - if (usage <= c->min) - protected = usage; - else - protected = 0; - + protected = min(usage, c->min); old_protected = atomic_long_xchg(&c->min_usage, protected); delta = protected - old_protected; if (delta) @@ -35,11 +31,7 @@ static void propagate_protected_usage(struct page_counter *c, } if (c->low || atomic_long_read(&c->low_usage)) { - if (usage <= c->low) - protected = usage; - else - protected = 0; - + protected = min(usage, c->low); old_protected = atomic_long_xchg(&c->low_usage, protected); delta = protected - old_protected; if (delta) From patchwork Thu Feb 27 19:56:05 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Johannes Weiner X-Patchwork-Id: 11409583 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 49D5B14B4 for ; Thu, 27 Feb 2020 19:56:20 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id F098B246A2 for ; Thu, 27 Feb 2020 19:56:19 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=cmpxchg-org.20150623.gappssmtp.com header.i=@cmpxchg-org.20150623.gappssmtp.com header.b="GtZ5uMsW" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org F098B246A2 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=cmpxchg.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id D2D9E6B0007; Thu, 27 Feb 2020 14:56:16 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id C66636B0008; Thu, 27 Feb 2020 14:56:16 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B7C346B000A; Thu, 27 Feb 2020 14:56:16 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0084.hostedemail.com [216.40.44.84]) by kanga.kvack.org (Postfix) with ESMTP id 9B2F76B0007 for ; Thu, 27 Feb 2020 14:56:16 -0500 (EST) Received: from smtpin19.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 5A09C9438 for ; Thu, 27 Feb 2020 19:56:16 +0000 (UTC) X-FDA: 76536963552.19.boy47_36b55958f8d09 X-Spam-Summary: 2,0,0,0cf1ded480cbc8cc,d41d8cd98f00b204,hannes@cmpxchg.org,,RULES_HIT:1:2:41:69:152:355:379:541:800:960:973:981:982:988:989:1260:1277:1311:1313:1314:1345:1359:1437:1515:1516:1518:1593:1594:1605:1730:1747:1777:1792:1801:2194:2198:2199:2200:2393:2559:2562:2693:2731:2897:3138:3139:3140:3141:3142:3743:3865:3866:3867:3868:3870:3871:3872:3874:4052:4250:4605:5007:6261:6653:7875:7903:8603:8660:8784:9108:9592:10004:11026:11473:11658:11914:12043:12220:12291:12295:12296:12297:12438:12517:12519:12555:12683:12895:13148:13161:13229:13230:13255:13869:13894:14096:14097:14394:14659:21080:21222:21433:21444:21451:21627:21740:21795:21966:21990:30005:30012:30051:30054:30056:30064:30070,0,RBL:209.85.222.195:@cmpxchg.org:.lbl8.mailshell.net-66.100.201.201 62.2.0.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fp,MSBL:0,DNSBL:none,Custom_rules:0:0:0,LFtime:24,LUA_SUMMARY:none X-HE-Tag: boy47_36b55958f8d09 X-Filterd-Recvd-Size: 12901 Received: from mail-qk1-f195.google.com (mail-qk1-f195.google.com [209.85.222.195]) by imf01.hostedemail.com (Postfix) with ESMTP for ; Thu, 27 Feb 2020 19:56:15 +0000 (UTC) Received: by mail-qk1-f195.google.com with SMTP id u124so589989qkh.13 for ; Thu, 27 Feb 2020 11:56:15 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=lDjWpH6grOtzg9jRElfS3afP9Xx2wYz4tow/cIPwfsc=; b=GtZ5uMsWMd0OMOb6SUGkO7O7QOsQlQ2mg1L95hluFzKdMMLVIVxeBRjFb8F4ekY6L7 hVVHDzVeb+davTH5VCuhL/S5/vyR9rGlbZ2lxozVr2eibCvgyB0/8N0LBkp6p3njyCbZ TpwBdF0uJ7pwBd9bBqns/9B2UrTm3nWPDYxwX/urJ12eHqGk/TDNKwyauQpqzmZxzpkT 9/aXGKfJBvTY18ut5BORUiVgRYls+okYGQRe5Xxtt2mBmwmMoeYzeLbduzMDUN78zJb2 dxhteMd7aK8qHaMmCadWzEm8hvaDK8Z63EtOXJMI+RFbiYpU5EpN60n01BPGArtq0kll YbhA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=lDjWpH6grOtzg9jRElfS3afP9Xx2wYz4tow/cIPwfsc=; b=DbcRfZf8C390jMcztkXPfDNtFURouFb/S4UGYpaP8wtUyAydRFtEaXfgkyFlpZVJ1i bKRFm6fZH0ZRr3cWf69/wZ/HIUYPoPTWIxA5nkBAFriF/3/mYWDTwPAK2JJCiD1w8ifP p0J9O4CtBML47ZwOlqjHH3/3HpRuLpeR4k2juTAl2XO5YlgLRdZ5fjPPxLUbbbSCKmpe Euwd/B+POhy27TFVti86DsI9RScEZCN5HsBzLCaEh3TeN52BG4YCu7sppYeZjAhurxpC OJqOSmtFnvj1A9wwBBGs4JoFyOAUhDbzK24/iL7WFXqh5cRV8M8onj+CpmR3u6QbaraQ NROQ== X-Gm-Message-State: APjAAAUPWgLYmThbEV2ecdiCEqy6xM252grjjX8PdyhySKweX/YBmTUp WociiJxkldYE48i18d+p7REd+g== X-Google-Smtp-Source: APXvYqx6zjvhh+3GLc16k1NmRrYkE58F4RCrqVSvjq4hkwz43RZngjuTejLCpp/ODPjN86zwbc+GeQ== X-Received: by 2002:a05:620a:2093:: with SMTP id e19mr1054066qka.355.1582833374935; Thu, 27 Feb 2020 11:56:14 -0800 (PST) Received: from localhost ([2620:10d:c091:500::3:2450]) by smtp.gmail.com with ESMTPSA id u23sm1542073qtk.16.2020.02.27.11.56.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 27 Feb 2020 11:56:14 -0800 (PST) From: Johannes Weiner To: Andrew Morton Cc: Roman Gushchin , Michal Hocko , Tejun Heo , Chris Down , =?utf-8?q?Mic?= =?utf-8?q?hal_Koutn=C3=BD?= , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: [PATCH 2/3] mm: memcontrol: clean up and document effective low/min calculations Date: Thu, 27 Feb 2020 14:56:05 -0500 Message-Id: <20200227195606.46212-3-hannes@cmpxchg.org> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20200227195606.46212-1-hannes@cmpxchg.org> References: <20200227195606.46212-1-hannes@cmpxchg.org> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The effective protection of any given cgroup is a somewhat complicated construct that depends on the ancestor's configuration, siblings' configurations, as well as current memory utilization in all these groups. It's done this way to satisfy hierarchical delegation requirements while also making the configuration semantics flexible and expressive in complex real life scenarios. Unfortunately, all the rules and requirements are sparsely documented, and the code is a little too clever in merging different scenarios into a single min() expression. This makes it hard to reason about the implementation and avoid breaking semantics when making changes to it. This patch documents each semantic rule individually and splits out the handling of the overcommit case from the regular case. Michal Koutný also points out that the points of equilibrium as described in the existing example scenarios aren't actually accurate. Delete these examples for now to avoid confusion. Acked-by: Tejun Heo Acked-by: Roman Gushchin Acked-by: Chris Down Acked-by: Michal Hocko Signed-off-by: Johannes Weiner --- mm/memcontrol.c | 175 +++++++++++++++++++++++------------------------- 1 file changed, 83 insertions(+), 92 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 874a0b00f89b..836c521bd61f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6204,6 +6204,76 @@ struct cgroup_subsys memory_cgrp_subsys = { .early_init = 0, }; +/* + * This function calculates an individual cgroup's effective + * protection which is derived from its own memory.min/low, its + * parent's and siblings' settings, as well as the actual memory + * distribution in the tree. + * + * The following rules apply to the effective protection values: + * + * 1. At the first level of reclaim, effective protection is equal to + * the declared protection in memory.min and memory.low. + * + * 2. To enable safe delegation of the protection configuration, at + * subsequent levels the effective protection is capped to the + * parent's effective protection. + * + * 3. To make complex and dynamic subtrees easier to configure, the + * user is allowed to overcommit the declared protection at a given + * level. If that is the case, the parent's effective protection is + * distributed to the children in proportion to how much protection + * they have declared and how much of it they are utilizing. + * + * This makes distribution proportional, but also work-conserving: + * if one cgroup claims much more protection than it uses memory, + * the unused remainder is available to its siblings. + * + * 4. Conversely, when the declared protection is undercommitted at a + * given level, the distribution of the larger parental protection + * budget is NOT proportional. A cgroup's protection from a sibling + * is capped to its own memory.min/low setting. + * + */ +static unsigned long effective_protection(unsigned long usage, + unsigned long setting, + unsigned long parent_effective, + unsigned long siblings_protected) +{ + unsigned long protected; + + protected = min(usage, setting); + /* + * If all cgroups at this level combined claim and use more + * protection then what the parent affords them, distribute + * shares in proportion to utilization. + * + * We are using actual utilization rather than the statically + * claimed protection in order to be work-conserving: claimed + * but unused protection is available to siblings that would + * otherwise get a smaller chunk than what they claimed. + */ + if (siblings_protected > parent_effective) + return protected * parent_effective / siblings_protected; + + /* + * Ok, utilized protection of all children is within what the + * parent affords them, so we know whatever this child claims + * and utilizes is effectively protected. + * + * If there is unprotected usage beyond this value, reclaim + * will apply pressure in proportion to that amount. + * + * If there is unutilized protection, the cgroup will be fully + * shielded from reclaim, but we do return a smaller value for + * protection than what the group could enjoy in theory. This + * is okay. With the overcommit distribution above, effective + * protection is always dependent on how memory is actually + * consumed among the siblings anyway. + */ + return protected; +} + /** * mem_cgroup_protected - check if memory consumption is in the normal range * @root: the top ancestor of the sub-tree being checked @@ -6217,67 +6287,11 @@ struct cgroup_subsys memory_cgrp_subsys = { * MEMCG_PROT_LOW: cgroup memory is protected as long there is * an unprotected supply of reclaimable memory from other cgroups. * MEMCG_PROT_MIN: cgroup memory is protected - * - * @root is exclusive; it is never protected when looked at directly - * - * To provide a proper hierarchical behavior, effective memory.min/low values - * are used. Below is the description of how effective memory.low is calculated. - * Effective memory.min values is calculated in the same way. - * - * Effective memory.low is always equal or less than the original memory.low. - * If there is no memory.low overcommittment (which is always true for - * top-level memory cgroups), these two values are equal. - * Otherwise, it's a part of parent's effective memory.low, - * calculated as a cgroup's memory.low usage divided by sum of sibling's - * memory.low usages, where memory.low usage is the size of actually - * protected memory. - * - * low_usage - * elow = min( memory.low, parent->elow * ------------------ ), - * siblings_low_usage - * - * low_usage = min(memory.low, memory.current) - * - * - * Such definition of the effective memory.low provides the expected - * hierarchical behavior: parent's memory.low value is limiting - * children, unprotected memory is reclaimed first and cgroups, - * which are not using their guarantee do not affect actual memory - * distribution. - * - * For example, if there are memcgs A, A/B, A/C, A/D and A/E: - * - * A A/memory.low = 2G, A/memory.current = 6G - * //\\ - * BC DE B/memory.low = 3G B/memory.current = 2G - * C/memory.low = 1G C/memory.current = 2G - * D/memory.low = 0 D/memory.current = 2G - * E/memory.low = 10G E/memory.current = 0 - * - * and the memory pressure is applied, the following memory distribution - * is expected (approximately): - * - * A/memory.current = 2G - * - * B/memory.current = 1.3G - * C/memory.current = 0.6G - * D/memory.current = 0 - * E/memory.current = 0 - * - * These calculations require constant tracking of the actual low usages - * (see propagate_protected_usage()), as well as recursive calculation of - * effective memory.low values. But as we do call mem_cgroup_protected() - * path for each memory cgroup top-down from the reclaim, - * it's possible to optimize this part, and save calculated elow - * for next usage. This part is intentionally racy, but it's ok, - * as memory.low is a best-effort mechanism. */ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, struct mem_cgroup *memcg) { struct mem_cgroup *parent; - unsigned long emin, parent_emin; - unsigned long elow, parent_elow; unsigned long usage; if (mem_cgroup_disabled()) @@ -6292,52 +6306,29 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, if (!usage) return MEMCG_PROT_NONE; - emin = memcg->memory.min; - elow = memcg->memory.low; - parent = parent_mem_cgroup(memcg); /* No parent means a non-hierarchical mode on v1 memcg */ if (!parent) return MEMCG_PROT_NONE; - if (parent == root) - goto exit; - - parent_emin = READ_ONCE(parent->memory.emin); - emin = min(emin, parent_emin); - if (emin && parent_emin) { - unsigned long min_usage, siblings_min_usage; - - min_usage = min(usage, memcg->memory.min); - siblings_min_usage = atomic_long_read( - &parent->memory.children_min_usage); - - if (min_usage && siblings_min_usage) - emin = min(emin, parent_emin * min_usage / - siblings_min_usage); + if (parent == root) { + memcg->memory.emin = memcg->memory.min; + memcg->memory.elow = memcg->memory.low; + goto out; } - parent_elow = READ_ONCE(parent->memory.elow); - elow = min(elow, parent_elow); - if (elow && parent_elow) { - unsigned long low_usage, siblings_low_usage; - - low_usage = min(usage, memcg->memory.low); - siblings_low_usage = atomic_long_read( - &parent->memory.children_low_usage); + memcg->memory.emin = effective_protection(usage, + memcg->memory.min, READ_ONCE(parent->memory.emin), + atomic_long_read(&parent->memory.children_min_usage)); - if (low_usage && siblings_low_usage) - elow = min(elow, parent_elow * low_usage / - siblings_low_usage); - } + memcg->memory.elow = effective_protection(usage, + memcg->memory.low, READ_ONCE(parent->memory.elow), + atomic_long_read(&parent->memory.children_low_usage)); -exit: - memcg->memory.emin = emin; - memcg->memory.elow = elow; - - if (usage <= emin) +out: + if (usage <= memcg->memory.emin) return MEMCG_PROT_MIN; - else if (usage <= elow) + else if (usage <= memcg->memory.elow) return MEMCG_PROT_LOW; else return MEMCG_PROT_NONE; From patchwork Thu Feb 27 19:56:06 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Johannes Weiner X-Patchwork-Id: 11409585 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id AB1B8930 for ; Thu, 27 Feb 2020 19:56:22 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 5E108246AD for ; Thu, 27 Feb 2020 19:56:22 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=cmpxchg-org.20150623.gappssmtp.com header.i=@cmpxchg-org.20150623.gappssmtp.com header.b="a0Uh+0/N" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 5E108246AD Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=cmpxchg.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id DA0676B0008; Thu, 27 Feb 2020 14:56:18 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id D523E6B000A; Thu, 27 Feb 2020 14:56:18 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BCB9E6B000C; Thu, 27 Feb 2020 14:56:18 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0080.hostedemail.com [216.40.44.80]) by kanga.kvack.org (Postfix) with ESMTP id A59BE6B0008 for ; Thu, 27 Feb 2020 14:56:18 -0500 (EST) Received: from smtpin17.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 943E098A7 for ; Thu, 27 Feb 2020 19:56:18 +0000 (UTC) X-FDA: 76536963636.17.scale45_36f6928e5f461 X-Spam-Summary: 2,0,0,ca795161fff27c7a,d41d8cd98f00b204,hannes@cmpxchg.org,,RULES_HIT:4:41:69:355:379:541:800:960:966:973:988:989:1042:1260:1311:1314:1345:1359:1437:1515:1605:1730:1747:1777:1792:1801:2194:2196:2198:2199:2200:2201:2393:2559:2562:2638:2693:2731:2897:2919:3138:3139:3140:3141:3142:3622:3743:3865:3866:3867:3868:3870:3871:3872:3874:4250:4321:4384:4385:4395:4605:5007:6119:6261:6653:7875:7903:8603:8784:9121:10004:11026:11232:11233:11473:11658:11914:12043:12198:12291:12296:12297:12438:12517:12519:12555:12660:12679:12683:12895:12986:13161:13215:13229:13869:13894:14096:14394:21080:21222:21394:21433:21444:21451:21627:21740:21772:21795:21810:21966:21990:30005:30051:30054:30062:30064:30070,0,RBL:209.85.222.196:@cmpxchg.org:.lbl8.mailshell.net-62.2.0.100 66.100.201.201,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fp,MSBL:0,DNSBL:none,Custom_rules:0:0:0,LFtime:40,LUA_SUMMARY:none X-HE-Tag: scale45_36f6928e5f461 X-Filterd-Recvd-Size: 15598 Received: from mail-qk1-f196.google.com (mail-qk1-f196.google.com [209.85.222.196]) by imf35.hostedemail.com (Postfix) with ESMTP for ; Thu, 27 Feb 2020 19:56:17 +0000 (UTC) Received: by mail-qk1-f196.google.com with SMTP id f3so663645qkh.3 for ; Thu, 27 Feb 2020 11:56:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=3S75USrJcvg7+14y2qoSr0W2vyLWm3IhoTMN6tsWJzg=; b=a0Uh+0/NzVTWut0foQw1rsKmUUAr2GBVpYMifrpAh0mLWUF4UcX39JJTxul+5VTETA 2H5mtw8r8rQOH0i9OXw3fmFtr+tO1Di/FGo1CEl4bdmnHYOifhR8MsV9t/Pp9ydvSZWZ awIFXmOHjuGDKvpCAeAmuNvRTqEx/Hk5nwAFJ0NzzURtbg7pOXfVbLi02ycND+vrsQyQ 2D+a+vrVzEjEM2Ly0Ro4WKWgGcVOeFks5Rb+BBrxz5eZR/D5HAZp09i5stguskXK9QZH tOcoJcCk+VJce1WPPpw+mIe3sCwr4OOZbN1V0apfNv71cUaa+NFONrk+BZbp6u8z9Kbj /Pfw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=3S75USrJcvg7+14y2qoSr0W2vyLWm3IhoTMN6tsWJzg=; b=qpCSH5Rx3M4aUwm1btQowhhA0WSEI0jd/gUzu4dC1bm7/meZfpAIxXToqk/h9VX+VD yciJPgn+oG65J0590YZe00dIIAEg5YIvNQN1mxyQVnD5hNhfH6s4dmeyVRTqPr2exW+Z v5PjBFUSM0yijvtJm+K2NeTLvxGC2VwU35JlMX5/jHxC/a7Uwh45258A6tMInDHnwT6t qkU3OiybRhduHmmaGC+mAaUM3XYu7Ympu+pUPx2Z84y/dWKGOoTJrvyg+zcFuu8Oi2Q+ txYJOw2zoiPhUedhpBYRVosgviUijZmaRzSdDNBI82M7RVeNImKWzjavy08tBQjC5Qyt fuAA== X-Gm-Message-State: APjAAAXngKnpLfJd4rW7sy70oFOlK9Hxmvg8DP9RrOHOnRvyjgnzVcSB w6uQsXk/Hspc6uPZyDVEIeWHVQ== X-Google-Smtp-Source: APXvYqw295YqlVSaCQGL02EkVC+azawsBSQLP/GA2Wfpln+atCptwejVsTETixabVk/2qhm1/CRvqg== X-Received: by 2002:a37:7a05:: with SMTP id v5mr1107870qkc.329.1582833376542; Thu, 27 Feb 2020 11:56:16 -0800 (PST) Received: from localhost ([2620:10d:c091:500::3:2450]) by smtp.gmail.com with ESMTPSA id o55sm3843680qtf.46.2020.02.27.11.56.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 27 Feb 2020 11:56:15 -0800 (PST) From: Johannes Weiner To: Andrew Morton Cc: Roman Gushchin , Michal Hocko , Tejun Heo , Chris Down , =?utf-8?q?Mic?= =?utf-8?q?hal_Koutn=C3=BD?= , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: [PATCH 3/3] mm: memcontrol: recursive memory.low protection Date: Thu, 27 Feb 2020 14:56:06 -0500 Message-Id: <20200227195606.46212-4-hannes@cmpxchg.org> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20200227195606.46212-1-hannes@cmpxchg.org> References: <20200227195606.46212-1-hannes@cmpxchg.org> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Right now, the effective protection of any given cgroup is capped by its own explicit memory.low setting, regardless of what the parent says. The reasons for this are mostly historical and ease of implementation: to make delegation of memory.low safe, effective protection is the min() of all memory.low up the tree. Unfortunately, this limitation makes it impossible to protect an entire subtree from another without forcing the user to make explicit protection allocations all the way to the leaf cgroups - something that is highly undesirable in real life scenarios. Consider memory in a data center host. At the cgroup top level, we have a distinction between system management software and the actual workload the system is executing. Both branches are further subdivided into individual services, job components etc. We want to protect the workload as a whole from the system management software, but that doesn't mean we want to protect and prioritize individual workload wrt each other. Their memory demand can vary over time, and we'd want the VM to simply cache the hottest data within the workload subtree. Yet, the current memory.low limitations force us to allocate a fixed amount of protection to each workload component in order to get protection from system management software in general. This results in very inefficient resource distribution. Another concern with mandating downward allocation is that, as the complexity of the cgroup tree grows, it gets harder for the lower levels to be informed about decisions made at the host-level. Consider a container inside a namespace that in turn creates its own nested tree of cgroups to run multiple workloads. It'd be extremely difficult to configure memory.low parameters in those leaf cgroups that on one hand balance pressure among siblings as the container desires, while also reflecting the host-level protection from e.g. rpm upgrades, that lie beyond one or more delegation and namespacing points in the tree. It's highly unusual from a cgroup interface POV that nested levels have to be aware of and reflect decisions made at higher levels for them to be effective. To enable such use cases and scale configurability for complex trees, this patch implements a resource inheritance model for memory that is similar to how the CPU and the IO controller implement work-conserving resource allocations: a share of a resource allocated to a subree always applies to the entire subtree recursively, while allowing, but not mandating, children to further specify distribution rules. That means that if protection is explicitly allocated among siblings, those configured shares are being followed during page reclaim just like they are now. However, if the memory.low set at a higher level is not fully claimed by the children in that subtree, the "floating" remainder is applied to each cgroup in the tree in proportion to its size. Since reclaim pressure is applied in proportion to size as well, each child in that tree gets the same boost, and the effect is neutral among siblings - with respect to each other, they behave as if no memory control was enabled at all, and the VM simply balances the memory demands optimally within the subtree. But collectively those cgroups enjoy a boost over the cgroups in neighboring trees. E.g. a leaf cgroup with a memory.low setting of 0 no longer means that it's not getting a share of the hierarchically assigned resource, just that it doesn't claim a fixed amount of it to protect from its siblings. This allows us to recursively protect one subtree (workload) from another (system management), while letting subgroups compete freely among each other - without having to assign fixed shares to each leaf, and without nested groups having to echo higher-level settings. The floating protection composes naturally with fixed protection. Consider the following example tree: A A: low = 2G / \ A1: low = 1G A1 A2 A2: low = 0G As outside pressure is applied to this tree, A1 will enjoy a fixed protection from A2 of 1G, but the remaining, unclaimed 1G from A is split evenly among A1 and A2, coming out to 1.5G and 0.5G. There is a slight risk of regressing theoretical setups where the top-level cgroups don't know about the true budgeting and set bogusly high "bypass" values that are meaningfully allocated down the tree. Such setups would rely on unclaimed protection to be discarded, and distributing it would change the intended behavior. Be safe and hide the new behavior behind a mount option, 'memory_recursiveprot'. Acked-by: Tejun Heo Acked-by: Roman Gushchin Acked-by: Chris Down Signed-off-by: Johannes Weiner --- Documentation/admin-guide/cgroup-v2.rst | 11 ++++++ include/linux/cgroup-defs.h | 5 +++ kernel/cgroup/cgroup.c | 17 ++++++++- mm/memcontrol.c | 51 +++++++++++++++++++++++-- 4 files changed, 79 insertions(+), 5 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 0636bcb60b5a..e569d83621da 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -186,6 +186,17 @@ cgroup v2 currently supports the following mount options. modified through remount from the init namespace. The mount option is ignored on non-init namespace mounts. + memory_recursiveprot + + Recursively apply memory.min and memory.low protection to + entire subtrees, without requiring explicit downward + propagation into leaf cgroups. This allows protecting entire + subtrees from one another, while retaining free competition + within those subtrees. This should have been the default + behavior but is a mount-option to avoid regressing setups + relying on the original semantics (e.g. specifying bogusly + high 'bypass' protection values at higher tree levels). + Organizing Processes and Threads -------------------------------- diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 63097cb243cb..e1fafed22db1 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -94,6 +94,11 @@ enum { * Enable legacy local memory.events. */ CGRP_ROOT_MEMORY_LOCAL_EVENTS = (1 << 5), + + /* + * Enable recursive subtree protection + */ + CGRP_ROOT_MEMORY_RECURSIVE_PROT = (1 << 6), }; /* cftype->flags */ diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 735af8f15f95..a2f8d2ab8dec 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -1813,12 +1813,14 @@ int cgroup_show_path(struct seq_file *sf, struct kernfs_node *kf_node, enum cgroup2_param { Opt_nsdelegate, Opt_memory_localevents, + Opt_memory_recursiveprot, nr__cgroup2_params }; static const struct fs_parameter_spec cgroup2_param_specs[] = { fsparam_flag("nsdelegate", Opt_nsdelegate), fsparam_flag("memory_localevents", Opt_memory_localevents), + fsparam_flag("memory_recursiveprot", Opt_memory_recursiveprot), {} }; @@ -1844,6 +1846,9 @@ static int cgroup2_parse_param(struct fs_context *fc, struct fs_parameter *param case Opt_memory_localevents: ctx->flags |= CGRP_ROOT_MEMORY_LOCAL_EVENTS; return 0; + case Opt_memory_recursiveprot: + ctx->flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT; + return 0; } return -EINVAL; } @@ -1860,6 +1865,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags) cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_LOCAL_EVENTS; else cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_LOCAL_EVENTS; + + if (root_flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT) + cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT; + else + cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_RECURSIVE_PROT; } } @@ -1869,6 +1879,8 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root seq_puts(seq, ",nsdelegate"); if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_LOCAL_EVENTS) seq_puts(seq, ",memory_localevents"); + if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT) + seq_puts(seq, ",memory_recursiveprot"); return 0; } @@ -6364,7 +6376,10 @@ static struct kobj_attribute cgroup_delegate_attr = __ATTR_RO(delegate); static ssize_t features_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { - return snprintf(buf, PAGE_SIZE, "nsdelegate\nmemory_localevents\n"); + return snprintf(buf, PAGE_SIZE, + "nsdelegate\n" + "memory_localevents\n" + "memory_recursiveprot\n"); } static struct kobj_attribute cgroup_features_attr = __ATTR_RO(features); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 836c521bd61f..0dd5d5f70593 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6234,13 +6234,27 @@ struct cgroup_subsys memory_cgrp_subsys = { * budget is NOT proportional. A cgroup's protection from a sibling * is capped to its own memory.min/low setting. * + * 5. However, to allow protecting recursive subtrees from each other + * without having to declare each individual cgroup's fixed share + * of the ancestor's claim to protection, any unutilized - + * "floating" - protection from up the tree is distributed in + * proportion to each cgroup's *usage*. This makes the protection + * neutral wrt sibling cgroups and lets them compete freely over + * the shared parental protection budget, but it protects the + * subtree as a whole from neighboring subtrees. + * + * Note that 4. and 5. are not in conflict: 4. is about protecting + * against immediate siblings whereas 5. is about protecting against + * neighboring subtrees. */ static unsigned long effective_protection(unsigned long usage, + unsigned long parent_usage, unsigned long setting, unsigned long parent_effective, unsigned long siblings_protected) { unsigned long protected; + unsigned long ep; protected = min(usage, setting); /* @@ -6271,7 +6285,34 @@ static unsigned long effective_protection(unsigned long usage, * protection is always dependent on how memory is actually * consumed among the siblings anyway. */ - return protected; + ep = protected; + + /* + * If the children aren't claiming (all of) the protection + * afforded to them by the parent, distribute the remainder in + * proportion to the (unprotected) memory of each cgroup. That + * way, cgroups that aren't explicitly prioritized wrt each + * other compete freely over the allowance, but they are + * collectively protected from neighboring trees. + * + * We're using unprotected memory for the weight so that if + * some cgroups DO claim explicit protection, we don't protect + * the same bytes twice. + */ + if (!(cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT)) + return ep; + + if (parent_effective > siblings_protected && usage > protected) { + unsigned long unclaimed; + + unclaimed = parent_effective - siblings_protected; + unclaimed *= usage - protected; + unclaimed /= parent_usage - siblings_protected; + + ep += unclaimed; + } + + return ep; } /** @@ -6291,8 +6332,8 @@ static unsigned long effective_protection(unsigned long usage, enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, struct mem_cgroup *memcg) { + unsigned long usage, parent_usage; struct mem_cgroup *parent; - unsigned long usage; if (mem_cgroup_disabled()) return MEMCG_PROT_NONE; @@ -6317,11 +6358,13 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, goto out; } - memcg->memory.emin = effective_protection(usage, + parent_usage = page_counter_read(&parent->memory); + + memcg->memory.emin = effective_protection(usage, parent_usage, memcg->memory.min, READ_ONCE(parent->memory.emin), atomic_long_read(&parent->memory.children_min_usage)); - memcg->memory.elow = effective_protection(usage, + memcg->memory.elow = effective_protection(usage, parent_usage, memcg->memory.low, READ_ONCE(parent->memory.elow), atomic_long_read(&parent->memory.children_low_usage));