[RFC] mm, oom: disable dump_tasks by default

Message ID	20190903144512.9374-1-mhocko@kernel.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=NQQQ=W6=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 833F120828 From: Michal Hocko <mhocko@kernel.org> To: <linux-mm@kvack.org> Cc: Andrew Morton <akpm@linux-foundation.org>, Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>, David Rientjes <rientjes@google.com>, LKML <linux-kernel@vger.kernel.org>, Michal Hocko <mhocko@suse.com> Subject: [RFC PATCH] mm, oom: disable dump_tasks by default Date: Tue, 3 Sep 2019 16:45:12 +0200 Message-Id: <20190903144512.9374-1-mhocko@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[RFC] mm, oom: disable dump_tasks by default \| expand [RFC] mm, oom: disable dump_tasks by default

Michal Hocko Sept. 3, 2019, 2:45 p.m. UTC

From: Michal Hocko <mhocko@suse.com>

dump_tasks has been introduced by quite some time ago fef1bdd68c81
("oom: add sysctl to enable task memory dump"). It's primary purpose is
to help analyse oom victim selection decision. This has been certainly
useful at times when the heuristic to chose a victim was much more
volatile. Since a63d83f427fb ("oom: badness heuristic rewrite")
situation became much more stable (mostly because the only selection
criterion is the memory usage) and reports about a wrong process to
be shot down have become effectively non-existent.

dump_tasks can generate a lot of output to the kernel log. It is not
uncommon that even relative small system has hundreds of tasks running.
Generating a lot of output to the kernel log both makes the oom report
less convenient to process and also induces a higher load on the printk
subsystem which can lead to other problems (e.g. longer stalls to flush
all the data to consoles).

Therefore change the default of oom_dump_tasks to not print the task
list by default. The sysctl remains in place for anybody who might need
to get this additional information. The oom report still provides an
information about the allocation context and the state of the MM
subsystem which should be sufficient to analyse most of the oom
situations.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Qian Cai Sept. 3, 2019, 3:02 p.m. UTC | #1

On Tue, 2019-09-03 at 16:45 +0200, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> dump_tasks has been introduced by quite some time ago fef1bdd68c81
> ("oom: add sysctl to enable task memory dump"). It's primary purpose is
> to help analyse oom victim selection decision. This has been certainly
> useful at times when the heuristic to chose a victim was much more
> volatile. Since a63d83f427fb ("oom: badness heuristic rewrite")
> situation became much more stable (mostly because the only selection
> criterion is the memory usage) and reports about a wrong process to
> be shot down have become effectively non-existent.

Well, I still see OOM sometimes kills wrong processes like ssh, systemd
processes while LTP OOM tests with staight-forward allocation patterns. I just
have not had a chance to debug them fully. The situation could be worse with
more complex allocations like random stress or fuzzy testing.

> 
> dump_tasks can generate a lot of output to the kernel log. It is not
> uncommon that even relative small system has hundreds of tasks running.
> Generating a lot of output to the kernel log both makes the oom report
> less convenient to process and also induces a higher load on the printk
> subsystem which can lead to other problems (e.g. longer stalls to flush
> all the data to consoles).

It is only generate output for the victim process where I tested on those large
NUMA machines and the output is fairly manageable.

> 
> Therefore change the default of oom_dump_tasks to not print the task
> list by default. The sysctl remains in place for anybody who might need
> to get this additional information. The oom report still provides an
> information about the allocation context and the state of the MM
> subsystem which should be sufficient to analyse most of the oom
> situations.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/oom_kill.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index eda2e2a0bdc6..d0353705c6e6 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -52,7 +52,7 @@
>  
>  int sysctl_panic_on_oom;
>  int sysctl_oom_kill_allocating_task;
> -int sysctl_oom_dump_tasks = 1;
> +int sysctl_oom_dump_tasks;
>  
>  /*
>   * Serializes oom killer invocations (out_of_memory()) from all contexts to

Michal Hocko Sept. 3, 2019, 3:13 p.m. UTC | #2

On Tue 03-09-19 11:02:46, Qian Cai wrote:
> On Tue, 2019-09-03 at 16:45 +0200, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > dump_tasks has been introduced by quite some time ago fef1bdd68c81
> > ("oom: add sysctl to enable task memory dump"). It's primary purpose is
> > to help analyse oom victim selection decision. This has been certainly
> > useful at times when the heuristic to chose a victim was much more
> > volatile. Since a63d83f427fb ("oom: badness heuristic rewrite")
> > situation became much more stable (mostly because the only selection
> > criterion is the memory usage) and reports about a wrong process to
> > be shot down have become effectively non-existent.
> 
> Well, I still see OOM sometimes kills wrong processes like ssh, systemd
> processes while LTP OOM tests with staight-forward allocation patterns.

Please report those. Most cases I have seen so far just turned out to
work as expected and memory hogs just used oom_score_adj or similar.

> I just
> have not had a chance to debug them fully. The situation could be worse with
> more complex allocations like random stress or fuzzy testing.

Nothing really prevents enabling the sysctl when doing OOM oriented
testing.

> > dump_tasks can generate a lot of output to the kernel log. It is not
> > uncommon that even relative small system has hundreds of tasks running.
> > Generating a lot of output to the kernel log both makes the oom report
> > less convenient to process and also induces a higher load on the printk
> > subsystem which can lead to other problems (e.g. longer stalls to flush
> > all the data to consoles).
> 
> It is only generate output for the victim process where I tested on those large
> NUMA machines and the output is fairly manageable.

The main question here is whether that information is useful by
_default_ because it is certainly not free. It takes both time to crawl
all processes and cpu cycles to get that information to the console
because printk is not free either. So if it more of "nice to have" than
necessary for oom analysis then it should be disabled by default IMHO.

Qian Cai Sept. 3, 2019, 3:32 p.m. UTC | #3

On Tue, 2019-09-03 at 17:13 +0200, Michal Hocko wrote:
> On Tue 03-09-19 11:02:46, Qian Cai wrote:
> > On Tue, 2019-09-03 at 16:45 +0200, Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > dump_tasks has been introduced by quite some time ago fef1bdd68c81
> > > ("oom: add sysctl to enable task memory dump"). It's primary purpose is
> > > to help analyse oom victim selection decision. This has been certainly
> > > useful at times when the heuristic to chose a victim was much more
> > > volatile. Since a63d83f427fb ("oom: badness heuristic rewrite")
> > > situation became much more stable (mostly because the only selection
> > > criterion is the memory usage) and reports about a wrong process to
> > > be shot down have become effectively non-existent.
> > 
> > Well, I still see OOM sometimes kills wrong processes like ssh, systemd
> > processes while LTP OOM tests with staight-forward allocation patterns.
> 
> Please report those. Most cases I have seen so far just turned out to
> work as expected and memory hogs just used oom_score_adj or similar.
> 
> > I just
> > have not had a chance to debug them fully. The situation could be worse with
> > more complex allocations like random stress or fuzzy testing.
> 
> Nothing really prevents enabling the sysctl when doing OOM oriented
> testing.
> 
> > > dump_tasks can generate a lot of output to the kernel log. It is not
> > > uncommon that even relative small system has hundreds of tasks running.
> > > Generating a lot of output to the kernel log both makes the oom report
> > > less convenient to process and also induces a higher load on the printk
> > > subsystem which can lead to other problems (e.g. longer stalls to flush
> > > all the data to consoles).
> > 
> > It is only generate output for the victim process where I tested on those
> > large
> > NUMA machines and the output is fairly manageable.
> 
> The main question here is whether that information is useful by
> _default_ because it is certainly not free. It takes both time to crawl
> all processes and cpu cycles to get that information to the console
> because printk is not free either. So if it more of "nice to have" than
> necessary for oom analysis then it should be disabled by default IMHO.

It also feels like more a band-aid micro-optimization with the side-effect that
affecting debuggability, as there could be loads of console output anyway during
a kernel OOM event including failed allocation warnings. I suppose if you want
to change the default behavior, the bar is high with more data and
justification.

Michal Hocko Sept. 3, 2019, 7:12 p.m. UTC | #4

On Tue 03-09-19 11:32:58, Qian Cai wrote:
> On Tue, 2019-09-03 at 17:13 +0200, Michal Hocko wrote:
> > On Tue 03-09-19 11:02:46, Qian Cai wrote:
> > > On Tue, 2019-09-03 at 16:45 +0200, Michal Hocko wrote:
> > > > From: Michal Hocko <mhocko@suse.com>
> > > > 
> > > > dump_tasks has been introduced by quite some time ago fef1bdd68c81
> > > > ("oom: add sysctl to enable task memory dump"). It's primary purpose is
> > > > to help analyse oom victim selection decision. This has been certainly
> > > > useful at times when the heuristic to chose a victim was much more
> > > > volatile. Since a63d83f427fb ("oom: badness heuristic rewrite")
> > > > situation became much more stable (mostly because the only selection
> > > > criterion is the memory usage) and reports about a wrong process to
> > > > be shot down have become effectively non-existent.
> > > 
> > > Well, I still see OOM sometimes kills wrong processes like ssh, systemd
> > > processes while LTP OOM tests with staight-forward allocation patterns.
> > 
> > Please report those. Most cases I have seen so far just turned out to
> > work as expected and memory hogs just used oom_score_adj or similar.
> > 
> > > I just
> > > have not had a chance to debug them fully. The situation could be worse with
> > > more complex allocations like random stress or fuzzy testing.
> > 
> > Nothing really prevents enabling the sysctl when doing OOM oriented
> > testing.
> > 
> > > > dump_tasks can generate a lot of output to the kernel log. It is not
> > > > uncommon that even relative small system has hundreds of tasks running.
> > > > Generating a lot of output to the kernel log both makes the oom report
> > > > less convenient to process and also induces a higher load on the printk
> > > > subsystem which can lead to other problems (e.g. longer stalls to flush
> > > > all the data to consoles).
> > > 
> > > It is only generate output for the victim process where I tested on those
> > > large
> > > NUMA machines and the output is fairly manageable.
> > 
> > The main question here is whether that information is useful by
> > _default_ because it is certainly not free. It takes both time to crawl
> > all processes and cpu cycles to get that information to the console
> > because printk is not free either. So if it more of "nice to have" than
> > necessary for oom analysis then it should be disabled by default IMHO.
> 
> It also feels like more a band-aid micro-optimization with the side-effect that
> affecting debuggability, as there could be loads of console output anyway during
> a kernel OOM event including failed allocation warnings. I suppose if you want
> to change the default behavior, the bar is high with more data and
> justification.

Any specific idea what that justification should be?

Because this is not something that you could measure very easily. It is
very subjective and far from black and white. And I am fully aware of
that. Hence RFC. That is why we should apply some common sense
and cost/benefit evaluation.

The cost of an additional output should be quite clear. Now we can
argue about the benefit. I argue that for an absolute majority of oom
report I have seen throughout past many years the task list was the
least useful information of the report. Sure I could go there and double
check that the victim was selected as designed. In minority cases I
could use that the task list to confirm that expected-to-be-victim had
OOM_SCORE_ADJ_MIN for some reason and that was why something esle has
been selected. Those configuration issues tend to be reproducible and
are easier to debug with sysctl enabled.

All that being said, arguments tend to weigh more towards IMO.

Michal Hocko Sept. 4, 2019, 5:40 a.m. UTC | #5

On Wed 04-09-19 05:52:43, Tetsuo Handa wrote:
> On 2019/09/03 23:45, Michal Hocko wrote:
> > It's primary purpose is
> > to help analyse oom victim selection decision.
> 
> I disagree, for I use the process list for understanding what / how many
> processes are consuming what kind of memory (without crashing the system)
> for anomaly detection purpose. Although we can't dump memory consumed by
> e.g. file descriptors, disabling dump_tasks() loose that clue, and is
> problematic for me.

Does anything really prevent you from enabling this by sysctl though? Or
do you claim that this is a general usage pattern and therefore the
default change is not acceptable or do you want a changelog to be
updated?

David Rientjes Sept. 4, 2019, 8:04 p.m. UTC | #6

On Wed, 4 Sep 2019, Michal Hocko wrote:

> > > It's primary purpose is
> > > to help analyse oom victim selection decision.
> > 
> > I disagree, for I use the process list for understanding what / how many
> > processes are consuming what kind of memory (without crashing the system)
> > for anomaly detection purpose. Although we can't dump memory consumed by
> > e.g. file descriptors, disabling dump_tasks() loose that clue, and is
> > problematic for me.
> 
> Does anything really prevent you from enabling this by sysctl though? Or
> do you claim that this is a general usage pattern and therefore the
> default change is not acceptable or do you want a changelog to be
> updated?
> 

I think the motivation is that users don't want to need to reproduce an 
oom kill to figure out why: they want to be able to figure out which 
process had higher than normal memory usage.  If oom is the normal case 
then they ceratinly have the ability to disable it by disabling the 
sysctl, but that seems like something better to opt-out of rather than 
need to opt-in to and reproduce.

Michal Hocko Sept. 5, 2019, 2:08 p.m. UTC | #7

On Thu 05-09-19 22:39:47, Tetsuo Handa wrote:
[...]
> There is nothing that prevents users from enabling oom_dump_tasks by sysctl.
> But that requires a solution for OOM stalling problem.

You can hardly remove stalling if you are not reducing the amount of
output or get it into a different context. Whether the later is
reasonable is another question but you are essentially losing "at the
OOM event state".

> Since I know how
> difficult to avoid problems caused by printk() flooding, I insist that
> we need "mm,oom: Defer dump_tasks() output." patch.

insisting is not a way to cooperate.

Qian Cai Sept. 5, 2019, 4:10 p.m. UTC | #8

On Tue, 2019-09-03 at 17:13 +0200, Michal Hocko wrote:
> On Tue 03-09-19 11:02:46, Qian Cai wrote:
> > On Tue, 2019-09-03 at 16:45 +0200, Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > dump_tasks has been introduced by quite some time ago fef1bdd68c81
> > > ("oom: add sysctl to enable task memory dump"). It's primary purpose is
> > > to help analyse oom victim selection decision. This has been certainly
> > > useful at times when the heuristic to chose a victim was much more
> > > volatile. Since a63d83f427fb ("oom: badness heuristic rewrite")
> > > situation became much more stable (mostly because the only selection
> > > criterion is the memory usage) and reports about a wrong process to
> > > be shot down have become effectively non-existent.
> > 
> > Well, I still see OOM sometimes kills wrong processes like ssh, systemd
> > processes while LTP OOM tests with staight-forward allocation patterns.
> 
> Please report those. Most cases I have seen so far just turned out to
> work as expected and memory hogs just used oom_score_adj or similar.

Here is the one where oom01 should be one to be killed.

[92598.855697][ T2588] Swap cache stats: add 105240923, delete 105250445, find
42196/101577
[92598.893970][ T2588] Free swap  = 16383612kB
[92598.913482][ T2588] Total swap = 16465916kB
[92598.932938][ T2588] 7275091 pages RAM
[92598.950212][ T2588] 0 pages HighMem/MovableOnly
[92598.971539][ T2588] 1315554 pages reserved
[92598.990698][ T2588] 16384 pages cma reserved
[92599.010760][ T2588] Tasks state (memory values in pages):
[92599.036265][ T2588] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes
swapents oom_score_adj name
[92599.080129][ T2588]
[   1662]     0  1662    29511     1034   290816      244             0 systemd-
journal
[92599.126163][ T2588]
[   2586]   998  2586   508086        0   368640     1838             0 polkitd
[92599.168706][ T2588]
[   2587]     0  2587    52786        0   421888      500             0 sssd
[92599.210082][ T2588]
[   2588]     0  2588    31223        0   139264      195             0
irqbalance
[92599.255606][ T2588]
[   2589]    81  2589    18381        0   167936      217          -900 dbus-
daemon
[92599.303678][ T2588]
[   2590]     0  2590    97260      193   372736      573             0
NetworkManager
[92599.348957][ T2588]
[   2594]     0  2594    95350        1   229376      758             0 rngd
[92599.390216][ T2588]
[   2598]   995  2598     7364        0    94208      103             0 chronyd
[92599.432447][ T2588]
[   2629]     0  2629   106234      399   442368     3836             0 tuned
[92599.473950][ T2588]
[   2638]     0  2638    23604        0   212992      240         -1000 sshd
[92599.515158][ T2588]
[   2642]     0  2642    10392        0   102400      138             0
rhsmcertd
[92599.560435][ T2588]
[   2691]     0  2691    21877        0   208896      277             0 systemd-
logind
[92599.605035][ T2588]
[   2700]     0  2700     3916        0    69632       45             0 agetty
[92599.646750][ T2588]
[   2705]     0  2705    23370        0   225280      393             0 systemd
[92599.688063][ T2588]
[   2730]     0  2730    37063        0   294912      667             0 (sd-pam)
[92599.729028][ T2588]
[   2922]     0  2922     9020        0    98304      232             0 crond
[92599.769130][ T2588]
[   3036]     0  3036    37797        1   307200      305             0 sshd
[92599.813768][ T2588]
[   3057]     0  3057    37797        0   303104      335             0 sshd
[92599.853450][ T2588]
[   3065]     0  3065     6343        1    86016      163             0 bash
[92599.892899][ T2588] [  38249]     0
38249    58330      293   221184      246             0 rsyslogd
[92599.934457][ T2588] [  11329]     0
11329    55131       73   454656      396             0 sssd_nss
[92599.976240][ T2588] [  11331]     0
11331    54424        1   434176      610             0 sssd_be
[92600.017106][ T2588] [  25247]     0
25247    25746        1   212992      300         -1000 systemd-udevd
[92600.060539][ T2588] [  25391]     0
25391     2184        0    65536       32             0 oom01
[92600.100648][ T2588] [  25392]     0
25392     2184        0    65536       39             0 oom01
[92600.143516][ T2588] oom-
kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-
1,global_oom,task_memcg=/system.slice/tuned.service,task=tuned,pid=2629,uid=0
[92600.213724][ T2588] Out of memory: Killed process 2629 (tuned) total-
vm:424936kB, anon-rss:328kB, file-rss:1268kB, shmem-rss:0kB, UID:0
pgtables:442368kB oom_score_adj:0
[92600.297832][  T305] oom_reaper: reaped process 2629 (tuned), now anon-
rss:0kB, file-rss:0kB, shmem-rss:0kB


> 
> > I just
> > have not had a chance to debug them fully. The situation could be worse with
> > more complex allocations like random stress or fuzzy testing.

Qian Cai Sept. 5, 2019, 9:21 p.m. UTC | #9

On Fri, 2019-09-06 at 05:59 +0900, Tetsuo Handa wrote:
> On 2019/09/06 1:10, Qian Cai wrote:
> > On Tue, 2019-09-03 at 17:13 +0200, Michal Hocko wrote:
> > > On Tue 03-09-19 11:02:46, Qian Cai wrote:
> > > > Well, I still see OOM sometimes kills wrong processes like ssh, systemd
> > > > processes while LTP OOM tests with staight-forward allocation patterns.
> > > 
> > > Please report those. Most cases I have seen so far just turned out to
> > > work as expected and memory hogs just used oom_score_adj or similar.
> > 
> > Here is the one where oom01 should be one to be killed.
> 
> I assume that there are previous OOM killer events before
> 
> > 
> > [92598.855697][ T2588] Swap cache stats: add 105240923, delete 105250445, find
> > 42196/101577
> 
> line. Please be sure to include.

Sep  5 12:00:52 hp-xl420gen9-01 kernel: Node 1 hugepages_total=0
hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Sep  5 12:00:52 hp-xl420gen9-01 kernel: Node 1 hugepages_total=0
hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Sep  5 12:00:52 hp-xl420gen9-01 kernel: 104 total pagecache pages
Sep  5 12:00:52 hp-xl420gen9-01 kernel: 72 pages in swap cache
Sep  5 12:00:52 hp-xl420gen9-01 kernel: Swap cache stats: add 105228915, delete
105238918, find 41766/100491
Sep  5 12:00:52 hp-xl420gen9-01 kernel: Free swap  = 16382644kB
Sep  5 12:00:52 hp-xl420gen9-01 kernel: Total swap = 16465916kB
Sep  5 12:00:52 hp-xl420gen9-01 kernel: 7275091 pages RAM
Sep  5 12:00:52 hp-xl420gen9-01 kernel: 0 pages HighMem/MovableOnly
Sep  5 12:00:52 hp-xl420gen9-01 kernel: 1315554 pages reserved
Sep  5 12:00:52 hp-xl420gen9-01 kernel: 16384 pages cma reserved
Sep  5 12:00:52 hp-xl420gen9-01 kernel: Tasks state (memory values in pages):
Sep  5 12:00:52 hp-xl420gen9-01 kernel: [  pid  ]   uid  tgid total_vm      rss
pgtables_bytes swapents oom_score_adj name
Sep  5 12:00:52 hp-xl420gen9-01 kernel:
[   1662]     0  1662    29511        0   290816      296             0 systemd-
journal
Sep  5 12:00:52 hp-xl420gen9-01 kernel:
[   2586]   998  2586   508086        0   368640     1838             0 polkitd
Sep  5 12:00:52 hp-xl420gen9-01 kernel:
[   2587]     0  2587    52786        0   421888      500             0 sssd
Sep  5 12:00:52 hp-xl420gen9-01 kernel:
[   2588]     0  2588    31223        0   139264      207             0
irqbalance
Sep  5 12:00:52 hp-xl420gen9-01 kernel:
[   2589]    81  2589    18381        0   167936      217          -900 dbus-
daemon
Sep  5 12:00:52 hp-xl420gen9-01 kernel:
[   2590]     0  2590    97260        0   372736      621             0
NetworkManager
Sep  5 12:00:52 hp-xl420gen9-01 kernel:
[   2594]     0  2594    95350        1   229376      758             0 rngd
Sep  5 12:00:52 hp-xl420gen9-01 kernel:
[   2598]   995  2598     7364        0    94208      102             0 chronyd
Sep  5 12:00:52 hp-xl420gen9-01 kernel:
[   2629]     0  2629   106234        0   442368     3959             0 tuned
Sep  5 12:00:52 hp-xl420gen9-01 kernel:
[   2638]     0  2638    23604        0   212992      240         -1000 sshd
Sep  5 12:00:52 hp-xl420gen9-01 kernel:
[   2642]     0  2642    10392        0   102400      138             0
rhsmcertd
Sep  5 12:00:52 hp-xl420gen9-01 kernel:
[   2691]     0  2691    21877        0   208896      277             0 systemd-
logind
Sep  5 12:00:52 hp-xl420gen9-01 kernel:
[   2700]     0  2700     3916        0    69632       45             0 agetty
Sep  5 12:00:52 hp-xl420gen9-01 kernel:
[   2705]     0  2705    23370        0   225280      393             0 systemd
Sep  5 12:00:52 hp-xl420gen9-01 kernel:
[   2730]     0  2730    37063        0   294912      667             0 (sd-pam)
Sep  5 12:00:52 hp-xl420gen9-01 kernel:
[   2922]     0  2922     9020        0    98304      232             0 crond
Sep  5 12:00:52 hp-xl420gen9-01 kernel:
[   3036]     0  3036    37797        1   307200      305             0 sshd
Sep  5 12:00:52 hp-xl420gen9-01 kernel:
[   3057]     0  3057    37797        0   303104      335             0 sshd
Sep  5 12:00:52 hp-xl420gen9-01 kernel:
[   3065]     0  3065     6343        1    86016      163             0 bash
Sep  5 12:00:52 hp-xl420gen9-01 kernel: [  38249]     0
38249    58330        0   221184      293             0 rsyslogd
Sep  5 12:00:52 hp-xl420gen9-01 kernel: [  11329]     0
11329    55131        0   454656      427             0 sssd_nss
Sep  5 12:00:52 hp-xl420gen9-01 kernel: [  11331]     0
11331    54424        0   434176      637             0 sssd_be
Sep  5 12:00:52 hp-xl420gen9-01 kernel: [  25247]     0
25247    25746        1   212992      300         -1000 systemd-udevd
Sep  5 12:00:52 hp-xl420gen9-01 kernel: [  25391]     0
25391     2184        0    65536       32             0 oom01

25392     2184        0    65536       39             0 oom01
Sep  5 12:00:52 hp-xl420gen9-01 kernel: [  25507]     0 25507  1581195  1411594
11395072       48             0 oom01
Sep  5 12:00:52 hp-xl420gen9-01 kernel: oom-
kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-
1,global_oom,task_memcg=/user.slice,task=oom01,pid=25507,uid=0
Sep  5 12:00:52 hp-xl420gen9-01 kernel: Out of memory: Killed process 25507
(oom01) total-vm:6324780kB, anon-rss:5647168kB, file-rss:0kB, shmem-rss:0kB,
UID:0 pgtables:11395072kB oom_score_adj:0
Sep  5 12:00:52 hp-xl420gen9-01 kernel: oom_reaper: reaped process 25507
(oom01), now anon-rss:5647452kB, file-rss:0kB, shmem-rss:0kB
Sep  5 12:00:52 hp-xl420gen9-01 kernel: irqbalance invoked oom-killer:
gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
Sep  5 12:00:52 hp-xl420gen9-01 kernel: CPU: 40 PID: 2588 Comm: irqbalance Not
tainted 5.3.0-rc7-next-20190904+ #5
Sep  5 12:00:52 hp-xl420gen9-01 kernel: Hardware name: HP ProLiant XL420
Gen9/ProLiant XL420 Gen9, BIOS U19 12/27/2015
Sep  5 12:00:52 hp-xl420gen9-01 kernel: Call Trace:
Sep  5 12:00:52 hp-xl420gen9-01 kernel: dump_stack+0x62/0x9a
Sep  5 12:00:52 hp-xl420gen9-01 kernel: dump_header+0xf4/0x610
Sep  5 12:00:52 hp-xl420gen9-01 kernel: ? ___ratelimit+0x4a/0x1a0
Sep  5 12:00:52 hp-xl420gen9-01 kernel: ? find_lock_task_mm+0x110/0x110
Sep  5 12:00:52 hp-xl420gen9-01 kernel: ? ___ratelimit+0xfa/0x1a0
Sep  5 12:00:52 hp-xl420gen9-01 kernel: oom_kill_process+0x136/0x1b0
Sep  5 12:00:52 hp-xl420gen9-01 kernel: out_of_memory+0x1fb/0x960
Sep  5 12:00:52 hp-xl420gen9-01 kernel: ? oom_killer_disable+0x230/0x230
Sep  5 12:00:52 hp-xl420gen9-01 kernel: ? mutex_trylock+0x17d/0x1a0
Sep  5 12:00:52 hp-xl420gen9-01 kernel: __alloc_pages_nodemask+0x1475/0x1bb0
Sep  5 12:00:52 hp-xl420gen9-01 kernel: ? gfp_pfmemalloc_allowed+0xc0/0xc0
Sep  5 12:00:52 hp-xl420gen9-01 kernel: ? __swp_swapcount+0xbf/0x160
Sep  5 12:00:52 hp-xl420gen9-01 kernel: ? rwlock_bug.part.0+0x60/0x60
Sep  5 12:00:52 hp-xl420gen9-01 kernel: ? __swp_swapcount+0x14a/0x160
Sep  5 12:00:52 hp-xl420gen9-01 kernel: ? __swp_swapcount+0x109/0x160
Sep  5 12:00:52 hp-xl420gen9-01 kernel: alloc_pages_vma+0x18f/0x200
Sep  5 12:00:52 hp-xl420gen9-01 kernel: __read_swap_cache_async+0x3ba/0x790
Sep  5 12:00:52 hp-xl420gen9-01 kernel: ? lookup_swap_cache+0x3c0/0x3c0
Sep  5 12:00:52 hp-xl420gen9-01 kernel: read_swap_cache_async+0x69/0xd0
Sep  5 12:00:52 hp-xl420gen9-01 kernel: ? __read_swap_cache_async+0x790/0x790
Sep  5 12:00:52 hp-xl420gen9-01 kernel: ? si_swapinfo+0xc0/0x150
Sep  5 12:00:52 hp-xl420gen9-01 kernel: swap_cluster_readahead+0x2a4/0x640
Sep  5 12:00:52 hp-xl420gen9-01 kernel: ? read_swap_cache_async+0xd0/0xd0
Sep  5 12:00:52 hp-xl420gen9-01 kernel: ? __kasan_check_read+0x11/0x20
Sep  5 12:00:52 hp-xl420gen9-01 kernel: ? check_chain_key+0x1df/0x2e0
Sep  5 12:00:52 hp-xl420gen9-01 kernel: ? lookup_swap_cache+0xd3/0x3c0
Sep  5 12:00:52 hp-xl420gen9-01 kernel: swapin_readahead+0xb9/0x83a
Sep  5 12:00:52 hp-xl420gen9-01 kernel: ? exit_swap_address_space+0x160/0x160
Sep  5 12:00:52 hp-xl420gen9-01 kernel: ? lookup_swap_cache+0x124/0x3c0
Sep  5 12:00:52 hp-xl420gen9-01 kernel: ? swp_swap_info+0x8e/0xe0
Sep  5 12:00:52 hp-xl420gen9-01 kernel: ? swapcache_prepare+0x20/0x20
Sep  5 12:00:52 hp-xl420gen9-01 kernel: do_swap_page+0x64e/0x1410
Sep  5 12:00:52 hp-xl420gen9-01 kernel: ? lock_downgrade+0x390/0x390
Sep  5 12:00:52 hp-xl420gen9-01 kernel: ? unmap_mapping_range+0x30/0x30
Sep  5 12:00:52 hp-xl420gen9-01 kernel: __handle_mm_fault+0xe0c/0x1a50
Sep  5 12:00:52 hp-xl420gen9-01 kernel: ? vmf_insert_mixed_mkwrite+0x20/0x20
Sep  5 12:00:52 hp-xl420gen9-01 kernel: ? __kasan_check_read+0x11/0x20
Sep  5 12:00:52 hp-xl420gen9-01 kernel: ? __count_memcg_events+0x56/0x1d0
Sep  5 12:00:52 hp-xl420gen9-01 kernel: handle_mm_fault+0x17f/0x37e
Sep  5 12:00:52 hp-xl420gen9-01 kernel: __do_page_fault+0x369/0x630
Sep  5 12:00:52 hp-xl420gen9-01 kernel: do_page_fault+0x50/0x2d3
Sep  5 12:00:52 hp-xl420gen9-01 kernel: page_fault+0x2f/0x40
Sep  5 12:00:52 hp-xl420gen9-01 kernel: RIP: 0033:0x7f89fb1a4b14
Sep  5 12:00:52 hp-xl420gen9-01 kernel: Code: Bad RIP value.
Sep  5 12:00:52 hp-xl420gen9-01 kernel: RSP: 002b:00007ffcf7a88148 EFLAGS:
00010206
Sep  5 12:00:52 hp-xl420gen9-01 kernel: RAX: 0000000000000000 RBX:
0000556844dd5db0 RCX: 00007f89fa4bf211
Sep  5 12:00:52 hp-xl420gen9-01 kernel: RDX: 00000000000026f3 RSI:
0000000000000002 RDI: 0000000000000000
Sep  5 12:00:52 hp-xl420gen9-01 kernel: RBP: 0000000000000002 R08:
0000000000000000 R09: 0000556844dc37e0
Sep  5 12:00:52 hp-xl420gen9-01 kernel: R10: 0000556844dc4f20 R11:
0000000000000000 R12: 0000556844d9c720
Sep  5 12:00:52 hp-xl420gen9-01 kernel: R13: 0000000000000000 R14:
00000000000026f3 R15: 0000000000000002
Sep  5 12:00:52 hp-xl420gen9-01 kernel: Mem-Info:
Sep  5 12:00:52 hp-xl420gen9-01 kernel: active_anon:112189 inactive_anon:307421
isolated_anon:0#012 active_file:0 inactive_file:1797 isolated_file:0#012
unevictable:984241 dirty:0 writeback:511 unstable:0#012 slab_reclaimable:14317
slab_unreclaimable:498931#012 mapped:27 shmem:0 pagetables:4114 bounce:0#012
free:36874 free_pcp:549 free_cma:0
Sep  5 12:00:52 hp-xl420gen9-01 kernel: Node 0 active_anon:2328kB
inactive_anon:434856kB active_file:0kB inactive_file:20kB unevictable:414264kB
isolated(anon):0kB isolated(file):0kB mapped:20kB dirty:0kB writeback:0kB
shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB
unstable:0kB all_unreclaimable? yes
Sep  5 12:00:52 hp-xl420gen9-01 kernel: Node 1 active_anon:448804kB
inactive_anon:1107116kB active_file:40kB inactive_file:8440kB
unevictable:3200440kB isolated(anon):0kB isolated(file):0kB mapped:2076kB
dirty:0kB writeback:2304kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB
anon_thp: 2353152kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
Sep  5 12:00:52 hp-xl420gen9-01 kernel: Node 0 DMA free:15864kB min:12kB
low:24kB high:36kB active_anon:0kB inactive_anon:0kB active_file:0kB
inactive_file:0kB unevictable:0kB writepending:0kB present:15948kB
managed:15864kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB
free_pcp:0kB local_pcp:0kB free_cma:0kB
Sep  5 12:00:52 hp-xl420gen9-01 kernel: lowmem_reserve[]: 0 1273 7711 7711 7711
Sep  5 12:00:52 hp-xl420gen9-01 kernel: Node 0 DMA32 free:26876kB min:1064kB
low:2368kB high:3672kB active_anon:0kB inactive_anon:246988kB active_file:4kB
inactive_file:4kB unevictable:172kB writepending:0kB present:1821440kB
managed:1304560kB mlocked:184kB kernel_stack:288kB pagetables:480kB bounce:0kB
free_pcp:28kB local_pcp:0kB free_cma:0kB
Sep  5 12:00:52 hp-xl420gen9-01 kernel: lowmem_reserve[]: 0 0 6437 6437 6437
Sep  5 12:00:52 hp-xl420gen9-01 kernel: Node 0 Normal free:5792kB min:5392kB
low:11980kB high:18568kB active_anon:2572kB inactive_anon:601564kB
active_file:0kB inactive_file:4kB unevictable:244kB writepending:0kB
present:10485760kB managed:6591660kB mlocked:4kB kernel_stack:7616kB
pagetables:3304kB bounce:0kB free_pcp:292kB local_pcp:0kB free_cma:0kB
Sep  5 12:00:52 hp-xl420gen9-01 kernel: lowmem_reserve[]: 0 0 0 0 0
Sep  5 12:00:52 hp-xl420gen9-01 kernel: Node 1 Normal free:94492kB min:13024kB
low:28948kB high:44872kB active_anon:450828kB inactive_anon:4304996kB
active_file:52kB inactive_file:10544kB unevictable:268kB writepending:12kB
present:16777216kB managed:15926064kB mlocked:60kB kernel_stack:7744kB
pagetables:12672kB bounce:0kB free_pcp:6376kB local_pcp:0kB free_cma:0kB
Sep  5 12:00:52 hp-xl420gen9-01 kernel: lowmem_reserve[]: 0 0 0 0 0
Sep  5 12:00:52 hp-xl420gen9-01 kernel: Node 0 DMA: 0*4kB 1*8kB (U) 1*16kB (U)
1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M)
3*4096kB (M) = 15864kB
Sep  5 12:00:52 hp-xl420gen9-01 kernel: Node 0 DMA32: 13*4kB (U) 17*8kB (U)
584*16kB (UMEH) 202*32kB (UMEH) 96*64kB (UMEH) 21*128kB (UMEH) 4*256kB (UM)
0*512kB 1*1024kB (H) 0*2048kB 0*4096kB = 26876kB
Sep  5 12:00:52 hp-xl420gen9-01 kernel: Node 0 Normal: 0*4kB 0*8kB 18*16kB (UME)
176*32kB (UM) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 5920kB
Sep  5 12:00:52 hp-xl420gen9-01 kernel: Node 1 Normal: 0*4kB 0*8kB 10*16kB
(UMEH) 10*32kB (MH) 7*64kB (UMEH) 600*128kB (M) 2082*256kB (UM) 0*512kB 0*1024kB
0*2048kB 0*4096kB = 610720kB
Sep  5 12:00:52 hp-xl420gen9-01 kernel: Node 0 hugepages_total=0
hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Sep  5 12:00:52 hp-xl420gen9-01 kernel: Node 0 hugepages_total=0
hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Sep  5 12:00:52 hp-xl420gen9-01 kernel: Node 1 hugepages_total=0
hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Sep  5 12:00:52 hp-xl420gen9-01 kernel: Node 1 hugepages_total=0
hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Sep  5 12:00:52 hp-xl420gen9-01 kernel: 3936 total pagecache pages
Sep  5 12:00:52 hp-xl420gen9-01 kernel: 541 pages in swap cache
Sep  5 12:00:52 hp-xl420gen9-01 kernel: Swap cache stats: add 105240923, delete
105250445, find 42196/101577
Sep  5 12:00:52 hp-xl420gen9-01 kernel: Free swap  = 16383612kB
Sep  5 12:00:52 hp-xl420gen9-01 kernel: Total swap = 16465916kB
Sep  5 12:00:52 hp-xl420gen9-01 kernel: 7275091 pages RAM
Sep  5 12:00:52 hp-xl420gen9-01 kernel: 0 pages HighMem/MovableOnly
Sep  5 12:00:52 hp-xl420gen9-01 kernel: 1315554 pages reserved
Sep  5 12:00:52 hp-xl420gen9-01 kernel: 16384 pages cma reserved
Sep  5 12:00:52 hp-xl420gen9-01 kernel: Tasks state (memory values in pages):
Sep  5 12:00:52 hp-xl420gen9-01 kernel: [  pid  ]   uid  tgid total_vm      rss
pgtables_bytes swapents oom_score_adj name
Sep  5 12:00:52 hp-xl420gen9-01 kernel:
[   1662]     0  1662    29511     1034   290816      244             0 systemd-
journal
Sep  5 12:00:52 hp-xl420gen9-01 kernel:
[   2586]   998  2586   508086        0   368640     1838             0 polkitd
Sep  5 12:00:52 hp-xl420gen9-01 kernel:
[   2587]     0  2587    52786        0   421888      500             0 sssd
Sep  5 12:00:52 hp-xl420gen9-01 kernel:
[   2588]     0  2588    31223        0   139264      195             0
irqbalance
Sep  5 12:00:53 hp-xl420gen9-01 kernel:
[   2589]    81  2589    18381        0   167936      217          -900 dbus-
daemon
Sep  5 12:00:53 hp-xl420gen9-01 kernel:
[   2590]     0  2590    97260      193   372736      573             0
NetworkManager
Sep  5 12:00:53 hp-xl420gen9-01 kernel:
[   2594]     0  2594    95350        1   229376      758             0 rngd
Sep  5 12:00:53 hp-xl420gen9-01 kernel:
[   2598]   995  2598     7364        0    94208      103             0 chronyd
Sep  5 12:00:53 hp-xl420gen9-01 kernel:
[   2629]     0  2629   106234      399   442368     3836             0 tuned
Sep  5 12:00:53 hp-xl420gen9-01 kernel:
[   2638]     0  2638    23604        0   212992      240         -1000 sshd
Sep  5 12:00:53 hp-xl420gen9-01 kernel:
[   2642]     0  2642    10392        0   102400      138             0
rhsmcertd
Sep  5 12:00:53 hp-xl420gen9-01 kernel:
[   2691]     0  2691    21877        0   208896      277             0 systemd-
logind
Sep  5 12:00:53 hp-xl420gen9-01 kernel:
[   2700]     0  2700     3916        0    69632       45             0 agetty
Sep  5 12:00:53 hp-xl420gen9-01 kernel:
[   2705]     0  2705    23370        0   225280      393             0 systemd
Sep  5 12:00:53 hp-xl420gen9-01 kernel:
[   2730]     0  2730    37063        0   294912      667             0 (sd-pam)
Sep  5 12:00:53 hp-xl420gen9-01 kernel:
[   2922]     0  2922     9020        0    98304      232             0 crond
Sep  5 12:00:53 hp-xl420gen9-01 kernel:
[   3036]     0  3036    37797        1   307200      305             0 sshd
Sep  5 12:00:53 hp-xl420gen9-01 kernel:
[   3057]     0  3057    37797        0   303104      335             0 sshd
Sep  5 12:00:53 hp-xl420gen9-01 kernel:
[   3065]     0  3065     6343        1    86016      163             0 bash
Sep  5 12:00:53 hp-xl420gen9-01 kernel: [  38249]     0
38249    58330      293   221184      246             0 rsyslogd
Sep  5 12:00:53 hp-xl420gen9-01 kernel: [  11329]     0
11329    55131       73   454656      396             0 sssd_nss
Sep  5 12:00:53 hp-xl420gen9-01 kernel: [  11331]     0
11331    54424        1   434176      610             0 sssd_be
Sep  5 12:00:53 hp-xl420gen9-01 kernel: [  25247]     0
25247    25746        1   212992      300         -1000 systemd-udevd
Sep  5 12:00:53 hp-xl420gen9-01 kernel: [  25391]     0
25391     2184        0    65536       32             0 oom01
Sep  5 12:00:53 hp-xl420gen9-01 kernel: [  25392]     0
25392     2184        0    65536       39             0 oom01
Sep  5 12:00:53 hp-xl420gen9-01 kernel: oom-
kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-
1,global_oom,task_memcg=/system.slice/tuned.service,task=tuned,pid=2629,uid=0
Sep  5 12:00:54 hp-xl420gen9-01 kernel: Out of memory: Killed process 2629
(tuned) total-vm:424936kB, anon-rss:328kB, file-rss:1268kB, shmem-rss:0kB, UID:0
pgtables:442368kB oom_score_adj:0
Sep  5 12:00:54 hp-xl420gen9-01 kernel: oom_reaper: reaped process 2629 (tuned),
now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Sep  5 12:00:54 hp-xl420gen9-01 systemd[1]: tuned.service: Main process exited,
code=killed, status=9/KILL
Sep  5 12:00:54 hp-xl420gen9-01 systemd[1]: tuned.service: Failed with result
'signal'.
Sep  5 12:06:32 hp-xl420gen9-01 kernel: oom01 invoked oom-killer:
gfp_mask=0x400dc0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), order=0, oom_score_adj=0
Sep  5 12:10:06 hp-xl420gen9-01 kernel: CPU: 35 PID: 25560 Comm: oom01 Not
tainted 5.3.0-rc7-next-20190904+ #5
Sep  5 12:10:06 hp-xl420gen9-01 kernel: Hardware name: HP ProLiant XL420
Gen9/ProLiant XL420 Gen9, BIOS U19 12/27/2015

> 
> Any chance it is https://lkml.org/lkml/2017/7/28/317 ?

I am not sure.

Michal Hocko Sept. 6, 2019, 11:02 a.m. UTC | #10

On Fri 06-09-19 19:46:10, Tetsuo Handa wrote:
> On 2019/09/05 23:08, Michal Hocko wrote:
> > On Thu 05-09-19 22:39:47, Tetsuo Handa wrote:
> > [...]
> >> There is nothing that prevents users from enabling oom_dump_tasks by sysctl.
> >> But that requires a solution for OOM stalling problem.
> > 
> > You can hardly remove stalling if you are not reducing the amount of
> > output or get it into a different context. Whether the later is
> > reasonable is another question but you are essentially losing "at the
> > OOM event state".
> > 
> 
> I am not losing "at the OOM event state". Please find "struct oom_task_info"
> (for now) embedded into "struct task_struct" which holds "at the OOM event state".
> 
> And my patch moves "printk() from dump_tasks()" from OOM context to WQ context.

Workers might be blocked for unbound amount of time and so this
information might be printed late.

Qian Cai Sept. 6, 2019, 1:08 p.m. UTC | #11

On Fri, 2019-09-06 at 19:32 +0900, Tetsuo Handa wrote:
> On 2019/09/06 6:21, Qian Cai wrote:
> > On Fri, 2019-09-06 at 05:59 +0900, Tetsuo Handa wrote:
> > > On 2019/09/06 1:10, Qian Cai wrote:
> > > > On Tue, 2019-09-03 at 17:13 +0200, Michal Hocko wrote:
> > > > > On Tue 03-09-19 11:02:46, Qian Cai wrote:
> > > > > > Well, I still see OOM sometimes kills wrong processes like ssh, systemd
> > > > > > processes while LTP OOM tests with staight-forward allocation patterns.
> > > > > 
> > > > > Please report those. Most cases I have seen so far just turned out to
> > > > > work as expected and memory hogs just used oom_score_adj or similar.
> > > > 
> > > > Here is the one where oom01 should be one to be killed.
> > > 
> > > I assume that there are previous OOM killer events before
> > > 
> > > > 
> > > > [92598.855697][ T2588] Swap cache stats: add 105240923, delete 105250445, find
> > > > 42196/101577
> > > 
> > > line. Please be sure to include.
> > 
> > 12:00:52 oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/user.slice,task=oom01,pid=25507,uid=0
> > 12:00:52 Out of memory: Killed process 25507(oom01) total-vm:6324780kB, anon-rss:5647168kB, file-rss:0kB, shmem-rss:0kB,UID:0 pgtables:11395072kB oom_score_adj:0
> > 12:00:52 oom_reaper: reaped process 25507(oom01), now anon-rss:5647452kB, file-rss:0kB, shmem-rss:0kB
> > 12:00:52 irqbalance invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
> > (...snipped...)
> > 12:00:53 [  25391]     0 25391     2184        0    65536       32             0 oom01
> > 12:00:53 [  25392]     0 25392     2184        0    65536       39             0 oom01
> > 12:00:53 oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/system.slice/tuned.service,task=tuned,pid=2629,uid=0
> > 12:00:54 Out of memory: Killed process 2629(tuned) total-vm:424936kB, anon-rss:328kB, file-rss:1268kB, shmem-rss:0kB, UID:0 pgtables:442368kB oom_score_adj:0
> > 12:00:54 oom_reaper: reaped process 2629 (tuned), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> 
> OK. anon-rss did not decrease when oom_reaper gave up.
> I think this is same with https://lkml.org/lkml/2017/7/28/317 case.
> 
> The OOM killer does not wait for OOM victims until existing OOM victims release
> memory by calling exit_mmap(). The OOM killer selects next OOM victim as soon as
> the OOM reaper sets MMF_OOM_SKIP. As a result, when the OOM reaper failed to
> reclaim memory due to e.g. mlocked pages, the OOM killer immediately selects next
> OOM victim. But since 25391 and 25392 are consuming little memory (maybe these are
> already reaped OOM victims), neither 25391 nor 25392 was selected as next OOM victim.
> 

Yes, mlocked is troublesome. I have other incidents where crond and systemd-
udevd were killed by mistake, and it even tried to kill kworker/0.

https://cailca.github.io/files/dmesg.txt

Qian Cai Sept. 6, 2019, 1:48 p.m. UTC | #12

On Fri, 2019-09-06 at 22:41 +0900, Tetsuo Handa wrote:
> On 2019/09/06 22:08, Qian Cai wrote:
> > Yes, mlocked is troublesome. I have other incidents where crond and systemd-
> > udevd were killed by mistake,
> 
> Yes. How to mitigate this regression is a controversial topic.
> Michal thinks that we should make mlocked pages reclaimable, but
> we haven't reached there. I think that we can monitor whether
> counters decay over time (with timeout), but Michal refuses any
> timeout based approach. We are deadlocked there.
> 
> >                               and it even tried to kill kworker/0.
> 
> No. The OOM killer never tries to kill kernel threads like kworker/0.

I meant,

[40040.401575] kworker/0:1 invoked oom-killer:
gfp_mask=0x40cc0(GFP_KERNEL|__GFP_COMP), order=2, oom_score_adj=0

> 
> > 
> > https://cailca.github.io/files/dmesg.txt
> 
>

Michal Hocko Sept. 6, 2019, 1:49 p.m. UTC | #13

On Fri 06-09-19 22:41:39, Tetsuo Handa wrote:
> On 2019/09/06 22:08, Qian Cai wrote:
> > Yes, mlocked is troublesome. I have other incidents where crond and systemd-
> > udevd were killed by mistake,
> 
> Yes. How to mitigate this regression is a controversial topic.
> Michal thinks that we should make mlocked pages reclaimable, but
> we haven't reached there. I think that we can monitor whether
> counters decay over time (with timeout), but Michal refuses any
> timeout based approach. We are deadlocked there.

There is really nothing controversial here. It just needs somebody to do
the actual work. Besides that I haven't seen any real workload to mlock
so much memory that there won't be anything reapable. LTP oom test
simply doesn't represent any real world workload. So while not optimal
I do not really think this is something to lose sleep over.

[RFC] mm, oom: disable dump_tasks by default

Commit Message

Comments

Patch