diff mbox series

[v3] mm, oom: make the calculation of oom badness more accurate

Message ID 1594396651-9931-1-git-send-email-laoar.shao@gmail.com (mailing list archive)
State New, archived
Headers show
Series [v3] mm, oom: make the calculation of oom badness more accurate | expand

Commit Message

Yafang Shao July 10, 2020, 3:57 p.m. UTC
Recently we found an issue on our production environment that when memcg
oom is triggered the oom killer doesn't chose the process with largest
resident memory but chose the first scanned process. Note that all
processes in this memcg have the same oom_score_adj, so the oom killer
should chose the process with largest resident memory.

Bellow is part of the oom info, which is enough to analyze this issue.
[7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037
[7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0
[7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0
[...]
[7516987.983293] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[7516987.983510] [ 5740]     0  5740      257        1    32768        0          -998 pause
[7516987.983574] [58804]     0 58804     4594      771    81920        0          -998 entry_point.bas
[7516987.983577] [58908]     0 58908     7089      689    98304        0          -998 cron
[7516987.983580] [58910]     0 58910    16235     5576   163840        0          -998 supervisord
[7516987.983590] [59620]     0 59620    18074     1395   188416        0          -998 sshd
[7516987.983594] [59622]     0 59622    18680     6679   188416        0          -998 python
[7516987.983598] [59624]     0 59624  1859266     5161   548864        0          -998 odin-agent
[7516987.983600] [59625]     0 59625   707223     9248   983040        0          -998 filebeat
[7516987.983604] [59627]     0 59627   416433    64239   774144        0          -998 odin-log-agent
[7516987.983607] [59631]     0 59631   180671    15012   385024        0          -998 python3
[7516987.983612] [61396]     0 61396   791287     3189   352256        0          -998 client
[7516987.983615] [61641]     0 61641  1844642    29089   946176        0          -998 client
[7516987.983765] [ 9236]     0  9236     2642      467    53248        0          -998 php_scanner
[7516987.983911] [42898]     0 42898    15543      838   167936        0          -998 su
[7516987.983915] [42900]  1000 42900     3673      867    77824        0          -998 exec_script_vr2
[7516987.983918] [42925]  1000 42925    36475    19033   335872        0          -998 python
[7516987.983921] [57146]  1000 57146     3673      848    73728        0          -998 exec_script_J2p
[7516987.983925] [57195]  1000 57195   186359    22958   491520        0          -998 python2
[7516987.983928] [58376]  1000 58376   275764    14402   290816        0          -998 rosmaster
[7516987.983931] [58395]  1000 58395   155166     4449   245760        0          -998 rosout
[7516987.983935] [58406]  1000 58406 18285584  3967322 37101568        0          -998 data_sim
[7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0
[7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

We can find that the first scanned process 5740 (pause) was killed, but its
rss is only one page. That is because, when we calculate the oom badness in
oom_badness(), we always ignore the negtive point and convert all of these
negtive points to 1. Now as oom_score_adj of all the processes in this
targeted memcg have the same value -998, the points of these processes are
all negtive value. As a result, the first scanned process will be killed.

The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a
a Guaranteed pod, which has higher priority to prevent from being killed by
system oom.

To fix this issue, we should make the calculation of oom point more
accurate. We can achieve it by convert the chosen_point from 'unsigned
long' to 'long'.

[cai@lca.pw: reported a issue in the previous version]
[mhocko@suse.com: fixed the issue reported by Cai]
[mhocko@suse.com: add the comment in proc_oom_score()]
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Qian Cai <cai@lca.pw>

---
v2 -> v3:
- fix the type of variable 'point' in oom_evaluate_task()
- initialize oom_control->chosen_points in select_bad_process() per Michal
- update the comment in proc_oom_score() per Michal

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 fs/proc/base.c      | 11 ++++++++++-
 include/linux/oom.h |  4 ++--
 mm/oom_kill.c       | 22 ++++++++++------------
 3 files changed, 22 insertions(+), 15 deletions(-)

Comments

Naresh Kamboju July 12, 2020, 6:34 p.m. UTC | #1
On Fri, 10 Jul 2020 at 21:28, Yafang Shao <laoar.shao@gmail.com> wrote:
>
> Recently we found an issue on our production environment that when memcg
> oom is triggered the oom killer doesn't chose the process with largest
> resident memory but chose the first scanned process. Note that all
> processes in this memcg have the same oom_score_adj, so the oom killer
> should chose the process with largest resident memory.
>
> Bellow is part of the oom info, which is enough to analyze this issue.
> [7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037
> [7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0
> [7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0
[...]
> [7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0
> [7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
> [7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
>
> We can find that the first scanned process 5740 (pause) was killed, but its
> rss is only one page. That is because, when we calculate the oom badness in
> oom_badness(), we always ignore the negtive point and convert all of these
> negtive points to 1. Now as oom_score_adj of all the processes in this
> targeted memcg have the same value -998, the points of these processes are
> all negtive value. As a result, the first scanned process will be killed.
>
> The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a
> a Guaranteed pod, which has higher priority to prevent from being killed by
> system oom.
>
> To fix this issue, we should make the calculation of oom point more
> accurate. We can achieve it by convert the chosen_point from 'unsigned
> long' to 'long'.
>
> [cai@lca.pw: reported a issue in the previous version]
> [mhocko@suse.com: fixed the issue reported by Cai]
> [mhocko@suse.com: add the comment in proc_oom_score()]
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Qian Cai <cai@lca.pw>
>
> ---
> v2 -> v3:
> - fix the type of variable 'point' in oom_evaluate_task()
> - initialize oom_control->chosen_points in select_bad_process() per Michal
> - update the comment in proc_oom_score() per Michal
>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>

Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>

I have noticed kernel panic with v2 patch while running LTP mm test suite.

[ 63.451494] Out of memory and no killable processes...
[ 63.456633] Kernel panic - not syncing: System is deadlocked on memory

Then I have removed the v2 patch and applied this below v3 patch and re-tested.
No regression noticed with v3 patch while running LTP mm on x86_64 and arm.

OTOH,
oom01 test case started with 100 iterations but runltp got killed after the
6th iteration [3]. I think this is expected.

test steps:
          - cd /opt/ltp
          - ./runltp -s oom01 -I 100 || true

[  209.052842] Out of memory: Killed process 519 (runltp)
total-vm:10244kB, anon-rss:904kB, file-rss:4kB, shmem-rss:0kB, UID:0
pgtables:60kB oom_score_adj:0
[  209.066782] oom_reaper: reaped process 519 (runltp), now
anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
/lava-1558245/0/tests/0_prep-tmp-disk/run.sh: line 21:   519 Killed
              ./runltp -s oom01 -I 100

> ---
>  fs/proc/base.c      | 11 ++++++++++-
>  include/linux/oom.h |  4 ++--
>  mm/oom_kill.c       | 22 ++++++++++------------
>  3 files changed, 22 insertions(+), 15 deletions(-)


Reference test jobs,
[1] https://lkft.validation.linaro.org/scheduler/job/1558246#L9189
[2] https://lkft.validation.linaro.org/scheduler/job/1558247#L17213
[3] https://lkft.validation.linaro.org/scheduler/job/1558245#L1407
Yafang Shao July 13, 2020, 1:49 a.m. UTC | #2
On Mon, Jul 13, 2020 at 2:34 AM Naresh Kamboju
<naresh.kamboju@linaro.org> wrote:
>
> On Fri, 10 Jul 2020 at 21:28, Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > Recently we found an issue on our production environment that when memcg
> > oom is triggered the oom killer doesn't chose the process with largest
> > resident memory but chose the first scanned process. Note that all
> > processes in this memcg have the same oom_score_adj, so the oom killer
> > should chose the process with largest resident memory.
> >
> > Bellow is part of the oom info, which is enough to analyze this issue.
> > [7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037
> > [7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0
> > [7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0
> [...]
> > [7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0
> > [7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
> > [7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> >
> > We can find that the first scanned process 5740 (pause) was killed, but its
> > rss is only one page. That is because, when we calculate the oom badness in
> > oom_badness(), we always ignore the negtive point and convert all of these
> > negtive points to 1. Now as oom_score_adj of all the processes in this
> > targeted memcg have the same value -998, the points of these processes are
> > all negtive value. As a result, the first scanned process will be killed.
> >
> > The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a
> > a Guaranteed pod, which has higher priority to prevent from being killed by
> > system oom.
> >
> > To fix this issue, we should make the calculation of oom point more
> > accurate. We can achieve it by convert the chosen_point from 'unsigned
> > long' to 'long'.
> >
> > [cai@lca.pw: reported a issue in the previous version]
> > [mhocko@suse.com: fixed the issue reported by Cai]
> > [mhocko@suse.com: add the comment in proc_oom_score()]
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > Acked-by: Michal Hocko <mhocko@suse.com>
> > Cc: David Rientjes <rientjes@google.com>
> > Cc: Qian Cai <cai@lca.pw>
> >
> > ---
> > v2 -> v3:
> > - fix the type of variable 'point' in oom_evaluate_task()
> > - initialize oom_control->chosen_points in select_bad_process() per Michal
> > - update the comment in proc_oom_score() per Michal
> >
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
>
> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
>
> I have noticed kernel panic with v2 patch while running LTP mm test suite.
>
> [ 63.451494] Out of memory and no killable processes...
> [ 63.456633] Kernel panic - not syncing: System is deadlocked on memory
>
> Then I have removed the v2 patch and applied this below v3 patch and re-tested.
> No regression noticed with v3 patch while running LTP mm on x86_64 and arm.
>
> OTOH,
> oom01 test case started with 100 iterations but runltp got killed after the
> 6th iteration [3]. I think this is expected.
>
> test steps:
>           - cd /opt/ltp
>           - ./runltp -s oom01 -I 100 || true
>
> [  209.052842] Out of memory: Killed process 519 (runltp)
> total-vm:10244kB, anon-rss:904kB, file-rss:4kB, shmem-rss:0kB, UID:0
> pgtables:60kB oom_score_adj:0
> [  209.066782] oom_reaper: reaped process 519 (runltp), now
> anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> /lava-1558245/0/tests/0_prep-tmp-disk/run.sh: line 21:   519 Killed
>               ./runltp -s oom01 -I 100
>
> > ---
> >  fs/proc/base.c      | 11 ++++++++++-
> >  include/linux/oom.h |  4 ++--
> >  mm/oom_kill.c       | 22 ++++++++++------------
> >  3 files changed, 22 insertions(+), 15 deletions(-)
>
>
> Reference test jobs,
> [1] https://lkft.validation.linaro.org/scheduler/job/1558246#L9189
> [2] https://lkft.validation.linaro.org/scheduler/job/1558247#L17213
> [3] https://lkft.validation.linaro.org/scheduler/job/1558245#L1407

Thanks for the test, Naresh.
Stephen Rothwell July 13, 2020, 2:39 a.m. UTC | #3
Hi all,

On Fri, 10 Jul 2020 11:57:31 -0400 Yafang Shao <laoar.shao@gmail.com> wrote:
>
> ---
> v2 -> v3:
> - fix the type of variable 'point' in oom_evaluate_task()
> - initialize oom_control->chosen_points in select_bad_process() per Michal
> - update the comment in proc_oom_score() per Michal

I have applied the following incremental patch to linux-next today.

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Mon, 13 Jul 2020 12:34:46 +1000
Subject: [PATCH] mm-oom-make-the-calculation-of-oom-badness-more-accurate-fix

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
---
 drivers/tty/sysrq.c |  1 -
 fs/proc/base.c      | 10 +++++++---
 mm/memcontrol.c     |  1 -
 mm/oom_kill.c       |  5 +++--
 mm/page_alloc.c     |  1 -
 5 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index e83fd46f632c..7c95afa905a0 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -382,7 +382,6 @@ static void moom_callback(struct work_struct *ignored)
 		.memcg = NULL,
 		.gfp_mask = gfp_mask,
 		.order = -1,
-		.chosen_points = LONG_MIN,
 	};
 
 	mutex_lock(&oom_lock);
diff --git a/fs/proc/base.c b/fs/proc/base.c
index bf164064b901..10c2a26c0e20 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -554,10 +554,14 @@ static int proc_oom_score(struct seq_file *m, struct pid_namespace *ns,
 	long badness;
 
 	badness = oom_badness(task, totalpages);
-	if (badness != LONG_MIN) {
-		/* Let's keep the range of points as [0, 2000]. */
+	/*
+	 * Special case OOM_SCORE_ADJ_MIN for all others scale the
+	 * badness value into [0, 2000] range which we have been
+	 * exporting for a long time so userspace might depend on it.
+	 */
+	if (badness != LONG_MIN)
 		points = (1000 + badness * 1000 / (long)totalpages) * 2 / 3;
-	}
+
 	seq_printf(m, "%lu\n", points);
 
 	return 0;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4ae64726f863..430d4c659669 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1666,7 +1666,6 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 		.memcg = memcg,
 		.gfp_mask = gfp_mask,
 		.order = order,
-		.chosen_points = LONG_MIN,
 	};
 	bool ret;
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 8fdb8b03f5cd..48e0db54d838 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -306,7 +306,7 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
 static int oom_evaluate_task(struct task_struct *task, void *arg)
 {
 	struct oom_control *oc = arg;
-	unsigned long points;
+	long points;
 
 	if (oom_unkillable_task(task))
 		goto next;
@@ -361,6 +361,8 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
  */
 static void select_bad_process(struct oom_control *oc)
 {
+	oc->chosen_points = LONG_MIN;
+
 	if (is_memcg_oom(oc))
 		mem_cgroup_scan_tasks(oc->memcg, oom_evaluate_task, oc);
 	else {
@@ -1124,7 +1126,6 @@ void pagefault_out_of_memory(void)
 		.memcg = NULL,
 		.gfp_mask = 0,
 		.order = 0,
-		.chosen_points = LONG_MIN,
 	};
 
 	if (mem_cgroup_oom_synchronize(true))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0832b21a7728..65fc0ed784d2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3915,7 +3915,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		.memcg = NULL,
 		.gfp_mask = gfp_mask,
 		.order = order,
-		.chosen_points = LONG_MIN,
 	};
 	struct page *page;
diff mbox series

Patch

diff --git a/fs/proc/base.c b/fs/proc/base.c
index d86c0af..10c2a26 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -551,8 +551,17 @@  static int proc_oom_score(struct seq_file *m, struct pid_namespace *ns,
 {
 	unsigned long totalpages = totalram_pages() + total_swap_pages;
 	unsigned long points = 0;
+	long badness;
+
+	badness = oom_badness(task, totalpages);
+	/*
+	 * Special case OOM_SCORE_ADJ_MIN for all others scale the
+	 * badness value into [0, 2000] range which we have been
+	 * exporting for a long time so userspace might depend on it.
+	 */
+	if (badness != LONG_MIN)
+		points = (1000 + badness * 1000 / (long)totalpages) * 2 / 3;
 
-	points = oom_badness(task, totalpages) * 1000 / totalpages;
 	seq_printf(m, "%lu\n", points);
 
 	return 0;
diff --git a/include/linux/oom.h b/include/linux/oom.h
index c696c26..f022f58 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -48,7 +48,7 @@  struct oom_control {
 	/* Used by oom implementation, do not set */
 	unsigned long totalpages;
 	struct task_struct *chosen;
-	unsigned long chosen_points;
+	long chosen_points;
 
 	/* Used to print the constraint info. */
 	enum oom_constraint constraint;
@@ -107,7 +107,7 @@  static inline vm_fault_t check_stable_address_space(struct mm_struct *mm)
 
 bool __oom_reap_task_mm(struct mm_struct *mm);
 
-extern unsigned long oom_badness(struct task_struct *p,
+long oom_badness(struct task_struct *p,
 		unsigned long totalpages);
 
 extern bool out_of_memory(struct oom_control *oc);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 6e94962..b0ac617 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -196,17 +196,17 @@  static bool is_dump_unreclaim_slabs(void)
  * predictable as possible.  The goal is to return the highest value for the
  * task consuming the most memory to avoid subsequent oom failures.
  */
-unsigned long oom_badness(struct task_struct *p, unsigned long totalpages)
+long oom_badness(struct task_struct *p, unsigned long totalpages)
 {
 	long points;
 	long adj;
 
 	if (oom_unkillable_task(p))
-		return 0;
+		return LONG_MIN;
 
 	p = find_lock_task_mm(p);
 	if (!p)
-		return 0;
+		return LONG_MIN;
 
 	/*
 	 * Do not even consider tasks which are explicitly marked oom
@@ -218,7 +218,7 @@  unsigned long oom_badness(struct task_struct *p, unsigned long totalpages)
 			test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
 			in_vfork(p)) {
 		task_unlock(p);
-		return 0;
+		return LONG_MIN;
 	}
 
 	/*
@@ -233,11 +233,7 @@  unsigned long oom_badness(struct task_struct *p, unsigned long totalpages)
 	adj *= totalpages / 1000;
 	points += adj;
 
-	/*
-	 * Never return 0 for an eligible task regardless of the root bonus and
-	 * oom_score_adj (oom_score_adj can't be OOM_SCORE_ADJ_MIN here).
-	 */
-	return points > 0 ? points : 1;
+	return points;
 }
 
 static const char * const oom_constraint_text[] = {
@@ -310,7 +306,7 @@  static enum oom_constraint constrained_alloc(struct oom_control *oc)
 static int oom_evaluate_task(struct task_struct *task, void *arg)
 {
 	struct oom_control *oc = arg;
-	unsigned long points;
+	long points;
 
 	if (oom_unkillable_task(task))
 		goto next;
@@ -336,12 +332,12 @@  static int oom_evaluate_task(struct task_struct *task, void *arg)
 	 * killed first if it triggers an oom, then select it.
 	 */
 	if (oom_task_origin(task)) {
-		points = ULONG_MAX;
+		points = LONG_MAX;
 		goto select;
 	}
 
 	points = oom_badness(task, oc->totalpages);
-	if (!points || points < oc->chosen_points)
+	if (points == LONG_MIN || points < oc->chosen_points)
 		goto next;
 
 select:
@@ -365,6 +361,8 @@  static int oom_evaluate_task(struct task_struct *task, void *arg)
  */
 static void select_bad_process(struct oom_control *oc)
 {
+	oc->chosen_points = LONG_MIN;
+
 	if (is_memcg_oom(oc))
 		mem_cgroup_scan_tasks(oc->memcg, oom_evaluate_task, oc);
 	else {