From patchwork Fri Jul 13 23:07:29 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Rientjes X-Patchwork-Id: 10524229 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 76DE9602B3 for ; Fri, 13 Jul 2018 23:07:37 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 6592A29B44 for ; Fri, 13 Jul 2018 23:07:37 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 5A14129B58; Fri, 13 Jul 2018 23:07:37 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.5 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE, USER_IN_DEF_DKIM_WL autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 1396E29B44 for ; Fri, 13 Jul 2018 23:07:36 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8A5766B026B; Fri, 13 Jul 2018 19:07:32 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 855EF6B026C; Fri, 13 Jul 2018 19:07:32 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 745686B026D; Fri, 13 Jul 2018 19:07:32 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-pf0-f199.google.com (mail-pf0-f199.google.com [209.85.192.199]) by kanga.kvack.org (Postfix) with ESMTP id 31EC16B026B for ; Fri, 13 Jul 2018 19:07:32 -0400 (EDT) Received: by mail-pf0-f199.google.com with SMTP id v9-v6so11571433pfn.6 for ; Fri, 13 Jul 2018 16:07:32 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:date:from:to:cc:subject :in-reply-to:message-id:references:user-agent:mime-version; bh=WOnO5EeeqUhQ5cYMpWpsvlm4YZxVLKH4lNKgsOqxjMI=; b=OTEkcxD1i11v6AEIjpnCmJmpZyG8DIVc1Lral73GXYS6wN6SIFNbxF5iqx+sWoc7/9 DbqqsdrcWPEZ2W1yeZ+LmyjjrR16+HqI0tmLyQ9hX7GuKVRs7ivbD8SDtUDj0KQIUhGU 70u8BVqDvtHbl2hTsa8nw31FkNo0d+3vHu0UhoO+7Wvbkc94XvGeoRUmg6RJUlie3ecw bnw5aZhDV5oO7zHW2cSvKkWeytF8KkzR5HgCS9lYyB9XwrsL5bwuV5vQUUR8GtOP8Kfz maroXdECOraEogI/0g55QDALO95LJlA31gZT310jhNrvFE4dvWCMmkNMfDhkigIczTKl wxxw== X-Gm-Message-State: AOUpUlHEKxbPOyoAXyEYynwrTuuNgtGyshhZgnylDVZEwgNnJRkSHVlY feNv6r8wIczOco5w0SI+dmhg/PAgUeAjJnX/qfdOu0raqfViPN5aVXfMxd5Y2qH8XeMuFaTgrq5 1doZcQmQE8u4khzT935M7Rnbcmbzy92zi+/Rh1pzeCfk6N3pbr3GGbzKbV2cggfEq7/mDQvwHWv Uh0wP8T4BzYiUx2gj3zS/2giOnvRcmWKAgxrPbYKFaxj6bZiYX8mdcIdf6fIEDT9OlX0IOOSNo4 C+P1O4cG/VvIaOc6oxaR1fGT9tgaHLNubdBtdIurGvnzo1K2WaU0EpBMBd0loppv6rygPUq3BCH BWmtTzuRTJWs287o6D2xkwNKmv7ofMgjMEDzAQWPHnPjzTpQ7yO4ZsRL3dEM3I4i2wW5PTTeRTS F X-Received: by 2002:a65:5581:: with SMTP id j1-v6mr7980253pgs.203.1531523251865; Fri, 13 Jul 2018 16:07:31 -0700 (PDT) X-Received: by 2002:a65:5581:: with SMTP id j1-v6mr7980217pgs.203.1531523250853; Fri, 13 Jul 2018 16:07:30 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1531523250; cv=none; d=google.com; s=arc-20160816; b=RpBzQRiYIe5F+7hloAiUb7MV9PrYsNZ1qtV3CLbDKULMQAnwEMLgecGdeRmfMUC1u6 7GLkN5sMZoFz+1axseYsDYa8OLNK9p06rIcO4KE/ydQANJDe8vmCYxIeaD0GrXPyIT3t /5hcJ/CXXJgKETg9Gob/Z0eI/hZ+XXpjl9LEn3jRhIyJQnmQCxyysRVcYJJ+IwVmPaZK EFdN7gZ+XcnM+pZMg6FtKhd2Le0PveSzponqOWLEhSa6glGUbo+qJIg+L/vgf0mI9RIn XBSo8BBWPx6zvtifq0rxQRDc5MpT89PCS7OOONnP1hYd4idvn4ES2PtWL2brpEDowhhV qJ+w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=mime-version:user-agent:references:message-id:in-reply-to:subject :cc:to:from:date:dkim-signature:arc-authentication-results; bh=WOnO5EeeqUhQ5cYMpWpsvlm4YZxVLKH4lNKgsOqxjMI=; b=CigOH7sj8IrbYLPEenCfdzm9nMHnEEFwXOfYsbNWX/7gRIL0k7iqbrDo8rxQZ7qpws ZQJAqqqdt9hmtB/14Wzq2tJC0xH0Rh9SHvHqaFaG0qdRISOV8eTds/YK+j5gaHdwVkxu t2efsU55bWwui1/uUUHIaXFE9K9ooTy7INeg4WQdMN2sygLgFBeC02JCxKK7kdrIZBXI fYoHaW4/r8NfEH+SrT95Jj39WNQHJdH6VifGvLxrSt6wK6BZv779362R/EBUrPs3phKD Lb8KsjrRaBKsRx8nyBvF0rL/5xYeF68lFgAdcEmIUH/NiZRO1+8y52u01J601+6XNXt6 zuCw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=oiUvHQwB; spf=pass (google.com: domain of rientjes@google.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=rientjes@google.com; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id t2-v6sor6531731pgu.303.2018.07.13.16.07.30 for (Google Transport Security); Fri, 13 Jul 2018 16:07:30 -0700 (PDT) Received-SPF: pass (google.com: domain of rientjes@google.com designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=oiUvHQwB; spf=pass (google.com: domain of rientjes@google.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=rientjes@google.com; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:in-reply-to:message-id:references :user-agent:mime-version; bh=WOnO5EeeqUhQ5cYMpWpsvlm4YZxVLKH4lNKgsOqxjMI=; b=oiUvHQwB93DH7UDGssZox15UAv2dIHlzDGCc+yLX6JL1ZaB6OYSzG+VehngDE2j7// zU4jlNmSr1DFS8/NArPUezLM7g2ceWtsaMl4IvavUad1souwmH/wSSk+vOUPE4TmsyBw c+2IvNZP1JeTch4NUwbW/MTDpk2XgnOBZUmz62n12QyGje0y/6fL39WIzEhoZJXa9OEy TG14MWjm3R2yNdFFoA02XjweSDQMGnwg1OnNZlhVMoIrGSdv1dMfbq47e3dRCnJxIyf0 VC2nD/rUdGoCEroGnpaqScPsaxKGA/hpjQrPRCznHDuZjI+e4TISccXR8LgALfhAXHIt CewQ== X-Google-Smtp-Source: AAOMgpfIkqG9uxewubJJZphPa6Cm8hLcikZ5vGmuZcFkK2562P40XZ9jF2h10fbdmXF5P/ZESE4u2w== X-Received: by 2002:a63:5350:: with SMTP id t16-v6mr7599006pgl.196.1531523250254; Fri, 13 Jul 2018 16:07:30 -0700 (PDT) Received: from [2620:15c:17:3:3a5:23a7:5e32:4598] ([2620:15c:17:3:3a5:23a7:5e32:4598]) by smtp.gmail.com with ESMTPSA id v4-v6sm175043pgr.36.2018.07.13.16.07.29 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Fri, 13 Jul 2018 16:07:29 -0700 (PDT) Date: Fri, 13 Jul 2018 16:07:29 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Andrew Morton , Roman Gushchin cc: Michal Hocko , Vladimir Davydov , Johannes Weiner , Tejun Heo , cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [patch v3 -mm 3/6] mm, memcg: add hierarchical usage oom policy In-Reply-To: Message-ID: References: User-Agent: Alpine 2.21 (DEB 202 2017-01-01) MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP One of the three significant concerns brought up about the cgroup aware oom killer is that its decisionmaking is completely evaded by creating subcontainers and attaching processes such that the ancestor's usage does not exceed another cgroup on the system. Consider the example from the previous patch where "memory" is set in each mem cgroup's cgroup.controllers: mem cgroup cgroup.procs ========== ============ /cg1 1 process consuming 250MB /cg2 3 processes consuming 100MB each /cg3/cg31 2 processes consuming 100MB each /cg3/cg32 2 processes consuming 100MB each If memory.oom_policy is "cgroup", a process from /cg2 is chosen because it is in the single indivisible memory consumer with the greatest usage. The true usage of /cg3 is actually 400MB, but a process from /cg2 is chosen because cgroups are compared individually rather than hierarchically. If a system is divided into two users, for example: mem cgroup memory.max ========== ========== /userA 250MB /userB 250MB If /userA runs all processes attached to the local mem cgroup, whereas /userB distributes their processes over a set of subcontainers under /userB, /userA will be unfairly penalized. There is incentive with cgroup v2 to distribute processes over a set of subcontainers if those processes shall be constrained by other cgroup controllers; this is a direct result of mandating a single, unified hierarchy for cgroups. A user may also reasonably do this for mem cgroup control or statistics. And, a user may do this to evade the cgroup-aware oom killer selection logic. This patch adds an oom policy, "tree", that accounts for hierarchical usage when comparing cgroups and the cgroup aware oom killer is enabled by an ancestor. This allows administrators, for example, to require users in their own top-level mem cgroup subtree to be accounted for with hierarchical usage. In other words, they can longer evade the oom killer by using other controllers or subcontainers. If an oom policy of "tree" is in place for a subtree, such as /cg3 above, the hierarchical usage is used for comparisons with other cgroups if either "cgroup" or "tree" is the oom policy of the oom mem cgroup. Thus, if /cg3/memory.oom_policy is "tree", one of the processes from /cg3's subcontainers is chosen for oom kill. Signed-off-by: David Rientjes --- Documentation/admin-guide/cgroup-v2.rst | 17 ++++++++++++++--- include/linux/memcontrol.h | 5 +++++ mm/memcontrol.c | 18 ++++++++++++------ 3 files changed, 31 insertions(+), 9 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1113,6 +1113,10 @@ PAGE_SIZE multiple when read back. memory consumers; that is, they will compare mem cgroup usage rather than process memory footprint. See the "OOM Killer" section below. + If "tree", the OOM killer will compare mem cgroups and its subtree + as a single indivisible memory consumer. This policy cannot be set + on the root mem cgroup. See the "OOM Killer" section below. + When an OOM condition occurs, the policy is dictated by the mem cgroup that is OOM (the root mem cgroup for a system-wide OOM condition). If a descendant mem cgroup has a policy of "none", for @@ -1120,6 +1124,10 @@ PAGE_SIZE multiple when read back. the heuristic will still compare mem cgroups as indivisible memory consumers. + When an OOM condition occurs in a mem cgroup with an OOM policy of + "cgroup" or "tree", the OOM killer will compare mem cgroups with + "cgroup" policy individually with "tree" policy subtrees. + memory.events A read-only flat-keyed file which exists on non-root cgroups. The following entries are defined. Unless specified @@ -1355,7 +1363,7 @@ out of memory, its memory.oom_policy will dictate how the OOM killer will select a process, or cgroup, to kill. Likewise, when the system is OOM, the policy is dictated by the root mem cgroup. -There are currently two available oom policies: +There are currently three available oom policies: - "none": default, choose the largest single memory hogging process to oom kill, as traditionally the OOM killer has always done. @@ -1364,6 +1372,9 @@ There are currently two available oom policies: subtree as an OOM victim and kill at least one process, depending on memory.oom_group, from it. + - "tree": choose the cgroup with the largest memory footprint considering + itself and its subtree and kill at least one process. + When selecting a cgroup as a victim, the OOM killer will kill the process with the largest memory footprint. A user can control this behavior by enabling the per-cgroup memory.oom_group option. If set, it causes the @@ -1382,8 +1393,8 @@ Please, note that memory charges are not migrating if tasks are moved between different memory cgroups. Moving tasks with significant memory footprint may affect OOM victim selection logic. If it's a case, please, consider creating a common ancestor for -the source and destination memory cgroups and enabling oom_group -on ancestor layer. +the source and destination memory cgroups and setting a policy of "tree" +and enabling oom_group on an ancestor layer. IO diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -77,6 +77,11 @@ enum memcg_oom_policy { * mem cgroup as an indivisible consumer */ MEMCG_OOM_POLICY_CGROUP, + /* + * Tree cgroup usage for all descendant memcg groups, treating each mem + * cgroup and its subtree as an indivisible consumer + */ + MEMCG_OOM_POLICY_TREE, }; struct mem_cgroup_reclaim_cookie { diff --git a/mm/memcontrol.c b/mm/memcontrol.c --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2952,7 +2952,7 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc) /* * The oom_score is calculated for leaf memory cgroups (including * the root memcg). - * Non-leaf oom_group cgroups accumulating score of descendant + * Cgroups with oom policy of "tree" accumulate the score of descendant * leaf memory cgroups. */ rcu_read_lock(); @@ -2961,10 +2961,11 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc) /* * We don't consider non-leaf non-oom_group memory cgroups - * as OOM victims. + * without the oom policy of "tree" as OOM victims. */ if (memcg_has_children(iter) && iter != root_mem_cgroup && - !mem_cgroup_oom_group(iter)) + !mem_cgroup_oom_group(iter) && + iter->oom_policy != MEMCG_OOM_POLICY_TREE) continue; /* @@ -3027,7 +3028,7 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc) else root = root_mem_cgroup; - if (root->oom_policy != MEMCG_OOM_POLICY_CGROUP) + if (root->oom_policy == MEMCG_OOM_POLICY_NONE) return false; select_victim_memcg(root, oc); @@ -5812,11 +5813,14 @@ static int memory_oom_policy_show(struct seq_file *m, void *v) switch (policy) { case MEMCG_OOM_POLICY_CGROUP: - seq_puts(m, "none [cgroup]\n"); + seq_puts(m, "none [cgroup] tree\n"); + break; + case MEMCG_OOM_POLICY_TREE: + seq_puts(m, "none cgroup [tree]\n"); break; case MEMCG_OOM_POLICY_NONE: default: - seq_puts(m, "[none] cgroup\n"); + seq_puts(m, "[none] cgroup tree\n"); }; return 0; } @@ -5832,6 +5836,8 @@ static ssize_t memory_oom_policy_write(struct kernfs_open_file *of, memcg->oom_policy = MEMCG_OOM_POLICY_NONE; else if (!memcmp("cgroup", buf, min(sizeof("cgroup")-1, nbytes))) memcg->oom_policy = MEMCG_OOM_POLICY_CGROUP; + else if (!memcmp("tree", buf, min(sizeof("tree")-1, nbytes))) + memcg->oom_policy = MEMCG_OOM_POLICY_TREE; else ret = -EINVAL;