From patchwork Thu Jul 9 15:53:07 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yafang Shao X-Patchwork-Id: 11654655 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 583F2618 for ; Thu, 9 Jul 2020 15:53:32 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 178CB207DA for ; Thu, 9 Jul 2020 15:53:32 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="X7Kzcf8K" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 178CB207DA Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 1E4876B0008; Thu, 9 Jul 2020 11:53:31 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 197B56B000A; Thu, 9 Jul 2020 11:53:31 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0AC206B000C; Thu, 9 Jul 2020 11:53:31 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0112.hostedemail.com [216.40.44.112]) by kanga.kvack.org (Postfix) with ESMTP id E9C566B0008 for ; Thu, 9 Jul 2020 11:53:30 -0400 (EDT) Received: from smtpin27.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 87F9A441E for ; Thu, 9 Jul 2020 15:53:30 +0000 (UTC) X-FDA: 77018982180.27.toad38_051130a26ec7 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin27.hostedemail.com (Postfix) with ESMTP id 461583D668 for ; Thu, 9 Jul 2020 15:53:30 +0000 (UTC) X-Spam-Summary: 10,1,0,8557415ac6ce3dc7,d41d8cd98f00b204,laoar.shao@gmail.com,,RULES_HIT:1:2:41:69:355:379:541:800:960:973:982:988:989:1260:1345:1437:1605:1730:1747:1777:1792:2198:2199:2393:2559:2562:2693:2731:3138:3139:3140:3141:3142:3865:3866:3867:3868:3870:3871:3872:3874:4051:4321:4605:5007:6117:6119:6261:6653:7208:7514:7556:7903:8784:9413:10004:11026:11233:11473:11658:11914:12043:12296:12297:12438:12517:12519:12555:12895:12986:13161:13229:14096:14394:14664:14687:21064:21080:21324:21433:21444:21450:21451:21611:21627:21666:21740:21889:21990:30029:30054:30056:30075,0,RBL:209.85.160.193:@gmail.com:.lbl8.mailshell.net-62.50.0.100 66.100.201.100;04y8p8a9ckc7gt4hyc1m4gnroyny9oci4mexs71ru7p1kgwts976uji1x9wusr3.xwgeims1fup6znw7npmzjazs59dhr3fzqpxebcn17194pnnexgbmbwi4fr94jfy.6-lbl8.mailshell.net-223.238.255.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fp,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:24,LUA_SUMMARY:none X-HE-Tag: toad38_051130a26ec7 X-Filterd-Recvd-Size: 11615 Received: from mail-qt1-f193.google.com (mail-qt1-f193.google.com [209.85.160.193]) by imf28.hostedemail.com (Postfix) with ESMTP for ; Thu, 9 Jul 2020 15:53:29 +0000 (UTC) Received: by mail-qt1-f193.google.com with SMTP id w34so2049004qte.1 for ; Thu, 09 Jul 2020 08:53:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id; bh=mt8T46gvLyHYb8wT4fUcB3ibvUFULpgxxt7DH/1x03E=; b=X7Kzcf8Kl1aCI716TVyvUaDKeT3p+ZAM7aKwF4WAvMebi0aNqm1GByhr6StBm12W9h c1nbL2D1waUl+H4dAVQUG9IE32dRkukp7TPB3yCHqy8K97Hh6WLYzjanCTBQQY0eOqYK DWclTW8NPNhMWbAfSBQJvkcVdiEyA6ccKWUP4YgKXVMwwUE12SF86sMl8h7fEuHi0qsg rOIpN8+dR4W5sXKMEpSw4gznUeFij8ZipwEhnAGmjNLi0BfdjtH7dlRywQFo+kDt028E YTLP/tL7ax/om6L/BEkZl4enbXLzAO+ypRXO72jvF3Eqy6pquv7/ggc8dwa0fN8GciTG 45lg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id; bh=mt8T46gvLyHYb8wT4fUcB3ibvUFULpgxxt7DH/1x03E=; b=AkYktf3tgLgvbbaA2G46dP9dxGpaReKhqkpEzyMNBExCWkWum8CMZ8aiqdfWN2kCQL 7IyYQHbaguyGnd6S3aQtxMpc36s9s6FZIxO18NRkg6OAkTl28prkQ3NIeElDAto/U9Mv iKXPuJ5hcv6yjA5SY/DFiafKytm7O6HL5ram0A6qBbI2vR/5LrIsBnc4BcU/qO4uBFf1 5zhdntXU7m+1tGHF+7WuOAh9LRK7YNC2LGwEwXWAONhpU6IntIg9zsK/GABmEgEDe248 slQS9gTIJc9NG6XcedtNiMB3gT7Eq0ms5dfdW80u0iNJjSXBWpq74N7FOOjrNHvycBt2 pU8A== X-Gm-Message-State: AOAM533Of8GLhhSCqirCJJzZYAhMRVYf4qUw0EZHvlCdXZfml+w63TL0 e73V2TepsuFAarlMmGw5B3E= X-Google-Smtp-Source: ABdhPJzAi4JVIW9eJXsIQLHZp+NFoK+m3CMLqsGGWk+SKjD2M7TZi07z6I+N0pKQ/KnAv8i7O5bo9w== X-Received: by 2002:ac8:409d:: with SMTP id p29mr66659685qtl.369.1594310009162; Thu, 09 Jul 2020 08:53:29 -0700 (PDT) Received: from dev.localdomain ([183.134.211.52]) by smtp.gmail.com with ESMTPSA id b125sm2051034qkf.71.2020.07.09.08.53.26 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Thu, 09 Jul 2020 08:53:28 -0700 (PDT) From: Yafang Shao To: mhocko@kernel.org, rientjes@google.com, akpm@linux-foundation.org Cc: linux-mm@kvack.org, Yafang Shao Subject: [PATCH v2] mm, oom: make the calculation of oom badness more accurate Date: Thu, 9 Jul 2020 11:53:07 -0400 Message-Id: <1594309987-9919-1-git-send-email-laoar.shao@gmail.com> X-Mailer: git-send-email 1.8.3.1 X-Rspamd-Queue-Id: 461583D668 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam02 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Recently we found an issue on our production environment that when memcg oom is triggered the oom killer doesn't chose the process with largest resident memory but chose the first scanned process. Note that all processes in this memcg have the same oom_score_adj, so the oom killer should chose the process with largest resident memory. Bellow is part of the oom info, which is enough to analyze this issue. [7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037 [7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0 [7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0 [...] [7516987.983293] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [7516987.983510] [ 5740] 0 5740 257 1 32768 0 -998 pause [7516987.983574] [58804] 0 58804 4594 771 81920 0 -998 entry_point.bas [7516987.983577] [58908] 0 58908 7089 689 98304 0 -998 cron [7516987.983580] [58910] 0 58910 16235 5576 163840 0 -998 supervisord [7516987.983590] [59620] 0 59620 18074 1395 188416 0 -998 sshd [7516987.983594] [59622] 0 59622 18680 6679 188416 0 -998 python [7516987.983598] [59624] 0 59624 1859266 5161 548864 0 -998 odin-agent [7516987.983600] [59625] 0 59625 707223 9248 983040 0 -998 filebeat [7516987.983604] [59627] 0 59627 416433 64239 774144 0 -998 odin-log-agent [7516987.983607] [59631] 0 59631 180671 15012 385024 0 -998 python3 [7516987.983612] [61396] 0 61396 791287 3189 352256 0 -998 client [7516987.983615] [61641] 0 61641 1844642 29089 946176 0 -998 client [7516987.983765] [ 9236] 0 9236 2642 467 53248 0 -998 php_scanner [7516987.983911] [42898] 0 42898 15543 838 167936 0 -998 su [7516987.983915] [42900] 1000 42900 3673 867 77824 0 -998 exec_script_vr2 [7516987.983918] [42925] 1000 42925 36475 19033 335872 0 -998 python [7516987.983921] [57146] 1000 57146 3673 848 73728 0 -998 exec_script_J2p [7516987.983925] [57195] 1000 57195 186359 22958 491520 0 -998 python2 [7516987.983928] [58376] 1000 58376 275764 14402 290816 0 -998 rosmaster [7516987.983931] [58395] 1000 58395 155166 4449 245760 0 -998 rosout [7516987.983935] [58406] 1000 58406 18285584 3967322 37101568 0 -998 data_sim [7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0 [7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB [7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB We can find that the first scanned process 5740 (pause) was killed, but its rss is only one page. That is because, when we calculate the oom badness in oom_badness(), we always ignore the negtive point and convert all of these negtive points to 1. Now as oom_score_adj of all the processes in this targeted memcg have the same value -998, the points of these processes are all negtive value. As a result, the first scanned process will be killed. The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a a Guaranteed pod, which has higher priority to prevent from being killed by system oom. To fix this issue, we should make the calculation of oom point more accurate. We can achieve it by convert the chosen_point from 'unsigned long' to 'long'. Signed-off-by: Yafang Shao Acked-by: Michal Hocko --- drivers/tty/sysrq.c | 1 + fs/proc/base.c | 7 ++++++- include/linux/oom.h | 4 ++-- mm/memcontrol.c | 1 + mm/oom_kill.c | 19 ++++++++----------- mm/page_alloc.c | 1 + 6 files changed, 19 insertions(+), 14 deletions(-) diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c index 7c95afa9..e83fd46 100644 --- a/drivers/tty/sysrq.c +++ b/drivers/tty/sysrq.c @@ -382,6 +382,7 @@ static void moom_callback(struct work_struct *ignored) .memcg = NULL, .gfp_mask = gfp_mask, .order = -1, + .chosen_points = LONG_MIN, }; mutex_lock(&oom_lock); diff --git a/fs/proc/base.c b/fs/proc/base.c index d86c0af..bf16406 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -551,8 +551,13 @@ static int proc_oom_score(struct seq_file *m, struct pid_namespace *ns, { unsigned long totalpages = totalram_pages() + total_swap_pages; unsigned long points = 0; + long badness; - points = oom_badness(task, totalpages) * 1000 / totalpages; + badness = oom_badness(task, totalpages); + if (badness != LONG_MIN) { + /* Let's keep the range of points as [0, 2000]. */ + points = (1000 + badness * 1000 / (long)totalpages) * 2 / 3; + } seq_printf(m, "%lu\n", points); return 0; diff --git a/include/linux/oom.h b/include/linux/oom.h index c696c26..f022f58 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -48,7 +48,7 @@ struct oom_control { /* Used by oom implementation, do not set */ unsigned long totalpages; struct task_struct *chosen; - unsigned long chosen_points; + long chosen_points; /* Used to print the constraint info. */ enum oom_constraint constraint; @@ -107,7 +107,7 @@ static inline vm_fault_t check_stable_address_space(struct mm_struct *mm) bool __oom_reap_task_mm(struct mm_struct *mm); -extern unsigned long oom_badness(struct task_struct *p, +long oom_badness(struct task_struct *p, unsigned long totalpages); extern bool out_of_memory(struct oom_control *oc); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 1962232..df73b30 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1559,6 +1559,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, .memcg = memcg, .gfp_mask = gfp_mask, .order = order, + .chosen_points = LONG_MIN, }; bool ret; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 6e94962..2dd5a90 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -196,17 +196,17 @@ static bool is_dump_unreclaim_slabs(void) * predictable as possible. The goal is to return the highest value for the * task consuming the most memory to avoid subsequent oom failures. */ -unsigned long oom_badness(struct task_struct *p, unsigned long totalpages) +long oom_badness(struct task_struct *p, unsigned long totalpages) { long points; long adj; if (oom_unkillable_task(p)) - return 0; + return LONG_MIN; p = find_lock_task_mm(p); if (!p) - return 0; + return LONG_MIN; /* * Do not even consider tasks which are explicitly marked oom @@ -218,7 +218,7 @@ unsigned long oom_badness(struct task_struct *p, unsigned long totalpages) test_bit(MMF_OOM_SKIP, &p->mm->flags) || in_vfork(p)) { task_unlock(p); - return 0; + return LONG_MIN; } /* @@ -233,11 +233,7 @@ unsigned long oom_badness(struct task_struct *p, unsigned long totalpages) adj *= totalpages / 1000; points += adj; - /* - * Never return 0 for an eligible task regardless of the root bonus and - * oom_score_adj (oom_score_adj can't be OOM_SCORE_ADJ_MIN here). - */ - return points > 0 ? points : 1; + return points; } static const char * const oom_constraint_text[] = { @@ -336,12 +332,12 @@ static int oom_evaluate_task(struct task_struct *task, void *arg) * killed first if it triggers an oom, then select it. */ if (oom_task_origin(task)) { - points = ULONG_MAX; + points = LONG_MAX; goto select; } points = oom_badness(task, oc->totalpages); - if (!points || points < oc->chosen_points) + if (points == LONG_MIN || points < oc->chosen_points) goto next; select: @@ -1128,6 +1124,7 @@ void pagefault_out_of_memory(void) .memcg = NULL, .gfp_mask = 0, .order = 0, + .chosen_points = LONG_MIN, }; if (mem_cgroup_oom_synchronize(true)) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e028b87c..8eec9d65 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3896,6 +3896,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...) .memcg = NULL, .gfp_mask = gfp_mask, .order = order, + .chosen_points = LONG_MIN, }; struct page *page;