From patchwork Fri Jul 13 23:07:27 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Rientjes X-Patchwork-Id: 10524227 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id F10F6602B3 for ; Fri, 13 Jul 2018 23:07:33 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id DECFA29B44 for ; Fri, 13 Jul 2018 23:07:33 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id D2FE029B58; Fri, 13 Jul 2018 23:07:33 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.5 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE, USER_IN_DEF_DKIM_WL autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id DD6D929B44 for ; Fri, 13 Jul 2018 23:07:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DFD046B0010; Fri, 13 Jul 2018 19:07:30 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id DB1446B0269; Fri, 13 Jul 2018 19:07:30 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CA0BD6B026A; Fri, 13 Jul 2018 19:07:30 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-pl0-f69.google.com (mail-pl0-f69.google.com [209.85.160.69]) by kanga.kvack.org (Postfix) with ESMTP id 7DFE56B0010 for ; Fri, 13 Jul 2018 19:07:30 -0400 (EDT) Received: by mail-pl0-f69.google.com with SMTP id t19-v6so20593747plo.9 for ; Fri, 13 Jul 2018 16:07:30 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:date:from:to:cc:subject :in-reply-to:message-id:references:user-agent:mime-version; bh=b+s6bg+tDxXaayVyh0bFCKesXgJkwUsbkmRWfaI0TO0=; b=C+F9Ll4LbGrA04T3eM7tF4BH+Vg25suL7OqZLsXwFg1ZWwhHFtYQbutC4ozLhTl/qb rIKKMnP5R6xs3pOZPGEoE2W3xqduXPyQ9sNQMTN/6dskmciUgyWwurZ4VFHGnKx5+P13 P3Rd7KBMqAWiwtyo42x6OjBH1IThtb98jEfEcOjsK1rCZny8/ET8CYHBgJAu2Sv6RD5K OqwLdoUcyTGG4ef+6Nv7CUGbRElzHTyZQBhqVMpNXSEQ4+6ceuPhLEk/Kk1OQPRP2dHc aVEcS6q8lkwB+zT+dgcP+AyDpq8ZcTnsOIaCSh8a7BCjHu2T418KD2rdEwHpAt/cp4L/ 7NKw== X-Gm-Message-State: AOUpUlFzZJII5NbNcpcZYaDEq/iBWtU7jRkeRqaLYPNrmKKy48kj5kRC S0gu/DmD8eOURiC4ySzrmY0SJ1V0jmM50eFR0BVs5oIW8qVGnAPd+mQsqaoI6hi0Xg0mSHQvOTm F4InJBS+osy4jQ5DfOm3z/OMvWeGddFyiqFauqNHCVYQh+bfTGAoT2A3vbQG2JZgkXO8ZgOuEtJ W8z8sJNkLA6pUZf08S/G64oFYl2DHOwzK+DKLWv+rFcrkt3DK2nh+YlgH9BLt6m+MxM+IIWhvbD 3rLm7BQxehuPD5WDPIhPgapPYUBD8+1je3A8lZYb2mRUKcDFCpiaRbcmcrzA/Zt//PjLyqOnKSf Ae3IdcxzmExQaRw1hd6cje8WkU27EquieJoSl+svVEpm1mgyaYQWDBmV0uKB+3ZRVKZ6P22WDun H X-Received: by 2002:a63:da04:: with SMTP id c4-v6mr7408632pgh.398.1531523250156; Fri, 13 Jul 2018 16:07:30 -0700 (PDT) X-Received: by 2002:a63:da04:: with SMTP id c4-v6mr7408590pgh.398.1531523248972; Fri, 13 Jul 2018 16:07:28 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1531523248; cv=none; d=google.com; s=arc-20160816; b=hUNlJEiFxhPCegncP/1UC7mKfiaUoGY6797SkAJURS+M0AKV66lI59Lyo9YqbtgnAM xVPA0RFzMglqMWIQ5QlPA5ItR4UFfIlCJ+eTUFijyaI11eOOuNyL6IYfAg4R0H4xtrR1 1/3t5dF63T1q9X/I1pgoVIb8JiU20V8TduEle+2rMINCcCL99YEBbO3XAwWFm+RFlk0t O8eyc/JhOPGrrhoQMYn2cnRbZMq4px7gO5eG4tNGIZWftpXhMEWrjrrqDeiSDBNRysHy iFEgoyLVVf7ubQhZJdYFU37k+dYmbT4+5RdnYQ7QSp+ntXzQvZNZEWhX6AW8BCIfgiib /kuA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=mime-version:user-agent:references:message-id:in-reply-to:subject :cc:to:from:date:dkim-signature:arc-authentication-results; bh=b+s6bg+tDxXaayVyh0bFCKesXgJkwUsbkmRWfaI0TO0=; b=V5UqBnj7OKDfNja0oR2DNdcAGGk3LLRdYQLL9cASL/KzFZMbOBjaxI+1QkQqLdBgWA /DRrsoKihf0HtVNbkFc8CSZf4X7v1CdUWtEYYFB1I867a42X+dVlU1ae5syoKip5QLvi pGXlvBowuNi0QdZW4Z+9WcXie/zypEs/KgNzFhGsSA1YzfM1toxmvhZM1eqBIdqv3r5S M+SSY84F6Pmk3n5xJpVthTTmKac5GRc/jVZ/rqDFzCl9al3Vh3zCr1ON24NATCWLBdsr W6YE3xqGNIxRGQ3/omAsYBQhEbyWte4V8CnPNaheDf1t5cokE3nXa3ZmVfujLb/cHK8z ie9g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=nyuYmkYn; spf=pass (google.com: domain of rientjes@google.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=rientjes@google.com; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id 1-v6sor8418081plz.88.2018.07.13.16.07.28 for (Google Transport Security); Fri, 13 Jul 2018 16:07:28 -0700 (PDT) Received-SPF: pass (google.com: domain of rientjes@google.com designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=nyuYmkYn; spf=pass (google.com: domain of rientjes@google.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=rientjes@google.com; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:in-reply-to:message-id:references :user-agent:mime-version; bh=b+s6bg+tDxXaayVyh0bFCKesXgJkwUsbkmRWfaI0TO0=; b=nyuYmkYnnAYqkdF/Bkt5fvVLFQzdeli3z77ZZTsH1yG9/K0LxjYJVNsaJOBIg3M+G5 BxmPrcy60rz2p4K53P0D1WMYJI53kTHDRP7/nq22OHFALB7Oe1rJRdZ0/X2c1ICGMJro G8uICWskyV82DLNSKNJOYu5akbpxzjbAj9pRa5whwRuCoL4L4q3jmmFjILNeqdxwlHx9 R+6L44hFGFjPSK6anICFLNQvseeF8JChViRo5NYWsZSwt5pOWnq3nOtLCvrqHDCAkeMT jOvGCLOg+o/agt5FLTBy2/7goXS8SQxBqkeT0924MkEBrUpbxGBBqrxuObNNxdzq+U91 MRnQ== X-Google-Smtp-Source: AAOMgpc2WVAOtgGqcmUUhTWyPngrAVM1ipD+OmvaA8OG8OHpWeyQtoIKohesXs2G8xYS9Ixqyd2VaA== X-Received: by 2002:a17:902:bd97:: with SMTP id q23-v6mr8090600pls.238.1531523248306; Fri, 13 Jul 2018 16:07:28 -0700 (PDT) Received: from [2620:15c:17:3:3a5:23a7:5e32:4598] ([2620:15c:17:3:3a5:23a7:5e32:4598]) by smtp.gmail.com with ESMTPSA id 64-v6sm16116121pfe.151.2018.07.13.16.07.27 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Fri, 13 Jul 2018 16:07:27 -0700 (PDT) Date: Fri, 13 Jul 2018 16:07:27 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Andrew Morton , Roman Gushchin cc: Michal Hocko , Vladimir Davydov , Johannes Weiner , Tejun Heo , cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [patch v3 -mm 2/6] mm, memcg: replace cgroup aware oom killer mount option with tunable In-Reply-To: Message-ID: References: User-Agent: Alpine 2.21 (DEB 202 2017-01-01) MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP Now that each mem cgroup on the system has a memory.oom_policy tunable to specify oom kill selection behavior, remove the needless "groupoom" mount option that requires (1) the entire system to be forced, perhaps unnecessarily, perhaps unexpectedly, into a single oom policy that differs from the traditional per process selection, and (2) a remount to change. Instead of enabling the cgroup aware oom killer with the "groupoom" mount option, set the mem cgroup subtree's memory.oom_policy to "cgroup". The heuristic used to select a process or cgroup to kill from is controlled by the oom mem cgroup's memory.oom_policy. This means that if a descendant mem cgroup has an oom policy of "none", for example, and an oom condition originates in an ancestor with an oom policy of "cgroup", the selection logic will treat all descendant cgroups as indivisible memory consumers. For example, consider an example where each mem cgroup has "memory" set in cgroup.controllers: mem cgroup cgroup.procs ========== ============ /cg1 1 process consuming 250MB /cg2 3 processes consuming 100MB each /cg3/cg31 2 processes consuming 100MB each /cg3/cg32 2 processes consuming 100MB each If the root mem cgroup's memory.oom_policy is "none", the process from /cg1 is chosen as the victim. If memory.oom_policy is "cgroup", a process from /cg2 is chosen because it is in the single indivisible memory consumer with the greatest usage. This policy of "cgroup" is identical to to the current "groupoom" mount option, now removed. Note that /cg3 is not the chosen victim when the oom mem cgroup policy is "cgroup" because cgroups are treated individually without regard to hierarchical /cg3/memory.current usage. This will be addressed in a follow-up patch. This has the added benefit of allowing descendant cgroups to control their own oom policies if they have memory.oom_policy file permissions without being restricted to the system-wide policy. In the above example, /cg2 and /cg3 can be either "none" or "cgroup" with the same results: the selection heuristic depends only on the policy of the oom mem cgroup. If /cg2 or /cg3 themselves are oom, however, the policy is controlled by their own oom policies, either process aware or cgroup aware. Signed-off-by: David Rientjes --- Documentation/admin-guide/cgroup-v2.rst | 78 +++++++++++++------------ include/linux/cgroup-defs.h | 5 -- include/linux/memcontrol.h | 5 ++ kernel/cgroup/cgroup.c | 13 +---- mm/memcontrol.c | 19 +++--- 5 files changed, 56 insertions(+), 64 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1109,6 +1109,17 @@ PAGE_SIZE multiple when read back. Documentation/filesystems/proc.txt). This is the same policy as if memory cgroups were not even mounted. + If "cgroup", the OOM killer will compare mem cgroups as indivisible + memory consumers; that is, they will compare mem cgroup usage rather + than process memory footprint. See the "OOM Killer" section below. + + When an OOM condition occurs, the policy is dictated by the mem + cgroup that is OOM (the root mem cgroup for a system-wide OOM + condition). If a descendant mem cgroup has a policy of "none", for + example, for an OOM condition in a mem cgroup with policy "cgroup", + the heuristic will still compare mem cgroups as indivisible memory + consumers. + memory.events A read-only flat-keyed file which exists on non-root cgroups. The following entries are defined. Unless specified @@ -1336,43 +1347,36 @@ belonging to the affected files to ensure correct memory ownership. OOM Killer ~~~~~~~~~~ -Cgroup v2 memory controller implements a cgroup-aware OOM killer. -It means that it treats cgroups as first class OOM entities. - -Cgroup-aware OOM logic is turned off by default and requires -passing the "groupoom" option on mounting cgroupfs. It can also -by remounting cgroupfs with the following command:: - - # mount -o remount,groupoom $MOUNT_POINT - -Under OOM conditions the memory controller tries to make the best -choice of a victim, looking for a memory cgroup with the largest -memory footprint, considering leaf cgroups and cgroups with the -memory.oom_group option set, which are considered to be an indivisible -memory consumers. - -By default, OOM killer will kill the biggest task in the selected -memory cgroup. A user can change this behavior by enabling -the per-cgroup memory.oom_group option. If set, it causes -the OOM killer to kill all processes attached to the cgroup, -except processes with oom_score_adj set to -1000. - -This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM -the memory controller considers only cgroups belonging to the sub-tree -of the OOM'ing cgroup. - -Leaf cgroups and cgroups with oom_group option set are compared based -on their cumulative memory usage. The root cgroup is treated as a -leaf memory cgroup as well, so it is compared with other leaf memory -cgroups. Due to internal implementation restrictions the size of -the root cgroup is the cumulative sum of oom_badness of all its tasks -(in other words oom_score_adj of each task is obeyed). Relying on -oom_score_adj (apart from OOM_SCORE_ADJ_MIN) can lead to over- or -underestimation of the root cgroup consumption and it is therefore -discouraged. This might change in the future, however. - -If there are no cgroups with the enabled memory controller, -the OOM killer is using the "traditional" process-based approach. +Cgroup v2 memory controller implements an optional cgroup-aware out of +memory killer, which treats cgroups as indivisible OOM entities. + +This policy is controlled by memory.oom_policy. When a memory cgroup is +out of memory, its memory.oom_policy will dictate how the OOM killer will +select a process, or cgroup, to kill. Likewise, when the system is OOM, +the policy is dictated by the root mem cgroup. + +There are currently two available oom policies: + + - "none": default, choose the largest single memory hogging process to + oom kill, as traditionally the OOM killer has always done. + + - "cgroup": choose the cgroup with the largest memory footprint from the + subtree as an OOM victim and kill at least one process, depending on + memory.oom_group, from it. + +When selecting a cgroup as a victim, the OOM killer will kill the process +with the largest memory footprint. A user can control this behavior by +enabling the per-cgroup memory.oom_group option. If set, it causes the +OOM killer to kill all processes attached to the cgroup, except processes +with /proc/pid/oom_score_adj set to -1000 (oom disabled). + +The root cgroup is treated as a leaf memory cgroup as well, so it is +compared with other leaf memory cgroups. Due to internal implementation +restrictions the size of the root cgroup is the cumulative sum of +oom_badness of all its tasks (in other words oom_score_adj of each task +is obeyed). Relying on oom_score_adj (apart from OOM_SCORE_ADJ_MIN) can +lead to over- or underestimation of the root cgroup consumption and it is +therefore discouraged. This might change in the future, however. Please, note that memory charges are not migrating if tasks are moved between different memory cgroups. Moving tasks with diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -81,11 +81,6 @@ enum { * Enable cpuset controller in v1 cgroup to use v2 behavior. */ CGRP_ROOT_CPUSET_V2_MODE = (1 << 4), - - /* - * Enable cgroup-aware OOM killer. - */ - CGRP_GROUP_OOM = (1 << 5), }; /* cftype->flags */ diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -72,6 +72,11 @@ enum memcg_oom_policy { * oom_badness() */ MEMCG_OOM_POLICY_NONE, + /* + * Local cgroup usage is used to select a target cgroup, treating each + * mem cgroup as an indivisible consumer + */ + MEMCG_OOM_POLICY_CGROUP, }; struct mem_cgroup_reclaim_cookie { diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -1744,9 +1744,6 @@ static int cgroup2_parse_option(struct fs_context *fc, char *token) if (!strcmp(token, "nsdelegate")) { ctx->flags |= CGRP_ROOT_NS_DELEGATE; return 0; - } else if (!strcmp(token, "groupoom")) { - ctx->flags |= CGRP_GROUP_OOM; - return 0; } return -EINVAL; @@ -1757,8 +1754,6 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root if (current->nsproxy->cgroup_ns == &init_cgroup_ns) { if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE) seq_puts(seq, ",nsdelegate"); - if (cgrp_dfl_root.flags & CGRP_GROUP_OOM) - seq_puts(seq, ",groupoom"); } return 0; } @@ -1770,11 +1765,6 @@ static void apply_cgroup_root_flags(unsigned int root_flags) cgrp_dfl_root.flags |= CGRP_ROOT_NS_DELEGATE; else cgrp_dfl_root.flags &= ~CGRP_ROOT_NS_DELEGATE; - - if (root_flags & CGRP_GROUP_OOM) - cgrp_dfl_root.flags |= CGRP_GROUP_OOM; - else - cgrp_dfl_root.flags &= ~CGRP_GROUP_OOM; } } @@ -6012,8 +6002,7 @@ static struct kobj_attribute cgroup_delegate_attr = __ATTR_RO(delegate); static ssize_t features_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { - return snprintf(buf, PAGE_SIZE, "nsdelegate\n" - "groupoom\n"); + return snprintf(buf, PAGE_SIZE, "nsdelegate\n"); } static struct kobj_attribute cgroup_features_attr = __ATTR_RO(features); diff --git a/mm/memcontrol.c b/mm/memcontrol.c --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3022,14 +3022,14 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc) if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) return false; - if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM)) - return false; - if (oc->memcg) root = oc->memcg; else root = root_mem_cgroup; + if (root->oom_policy != MEMCG_OOM_POLICY_CGROUP) + return false; + select_victim_memcg(root, oc); return oc->chosen_memcg; @@ -5683,9 +5683,6 @@ static int memory_oom_group_show(struct seq_file *m, void *v) struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); bool oom_group = memcg->oom_group; - if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM)) - return -ENOTSUPP; - seq_printf(m, "%d\n", oom_group); return 0; @@ -5699,9 +5696,6 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, int oom_group; int err; - if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM)) - return -ENOTSUPP; - err = kstrtoint(strstrip(buf), 0, &oom_group); if (err) return err; @@ -5817,9 +5811,12 @@ static int memory_oom_policy_show(struct seq_file *m, void *v) enum memcg_oom_policy policy = READ_ONCE(memcg->oom_policy); switch (policy) { + case MEMCG_OOM_POLICY_CGROUP: + seq_puts(m, "none [cgroup]\n"); + break; case MEMCG_OOM_POLICY_NONE: default: - seq_puts(m, "[none]\n"); + seq_puts(m, "[none] cgroup\n"); }; return 0; } @@ -5833,6 +5830,8 @@ static ssize_t memory_oom_policy_write(struct kernfs_open_file *of, buf = strstrip(buf); if (!memcmp("none", buf, min(sizeof("none")-1, nbytes))) memcg->oom_policy = MEMCG_OOM_POLICY_NONE; + else if (!memcmp("cgroup", buf, min(sizeof("cgroup")-1, nbytes))) + memcg->oom_policy = MEMCG_OOM_POLICY_CGROUP; else ret = -EINVAL;