Message ID | 20181010151135.25766-1-mhocko@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] memcg, oom: throttle dump_header for memcg ooms without eligible tasks | expand |
Michal Hocko wrote: > Once we are here, make sure that the reason to trigger the OOM is > printed without ratelimiting because this is really valuable to > debug what happened. Here is my version. From 0c9ab34fd01837d4c85794042ecb9e922c9eed5a Mon Sep 17 00:00:00 2001 From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Date: Thu, 11 Oct 2018 15:27:42 +0900 Subject: [PATCH] mm: memcontrol: Don't flood OOM messages with no eligible task. syzbot is hitting RCU stall at shmem_fault() [1]. This is because memcg-OOM events with no eligible task (current thread is marked as OOM-unkillable) continued calling dump_header() from out_of_memory() enabled by commit 3100dab2aa09dc6e ("mm: memcontrol: print proper OOM header when no eligible victim left."). Let's make sure that next dump_header() waits for at least 60 seconds from previous "Out of memory and no killable processes..." message, if current thread already reported it. This change allows current thread to complete memory allocation requests for doing recovery operation, and also allows us to know whether the current thread is doing recovery operation. [1] https://syzkaller.appspot.com/bug?id=4ae3fff7fcf4c33a47c1192d2d62d2e03efffa64 Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: syzbot <syzbot+77e6b28a7a7106ad0def@syzkaller.appspotmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> --- include/linux/sched.h | 1 + mm/oom_kill.c | 23 +++++++++++++++++++++++ 2 files changed, 24 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 977cb57..701cf3c 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -723,6 +723,7 @@ struct task_struct { #endif #ifdef CONFIG_MEMCG unsigned in_user_fault:1; + unsigned memcg_oom_no_eligible_warned:1; #ifdef CONFIG_MEMCG_KMEM unsigned memcg_kmem_skip_account:1; #endif diff --git a/mm/oom_kill.c b/mm/oom_kill.c index f10aa53..f954d99 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -1106,6 +1106,21 @@ bool out_of_memory(struct oom_control *oc) select_bad_process(oc); /* Found nothing?!?! */ if (!oc->chosen) { +#ifdef CONFIG_MEMCG + /* + * Don't flood with dump_header() if already reported, in case + * current->signal->oom_score_adj == OOM_SCORE_ADJ_MIN. Maybe + * this variable should be per "struct mem_cgroup". But since + * we can't prove that multiple concurrent memcg OOM without + * eligible task won't cause flooding, choose global variable + * for safety. + */ + static u64 last_warned; + + if (is_memcg_oom(oc) && current->memcg_oom_no_eligible_warned + && time_before64(get_jiffies_64(), last_warned + 60 * HZ)) + return false; +#endif dump_header(oc, NULL); pr_warn("Out of memory and no killable processes...\n"); /* @@ -1115,6 +1130,14 @@ bool out_of_memory(struct oom_control *oc) */ if (!is_sysrq_oom(oc) && !is_memcg_oom(oc)) panic("System is deadlocked on memory\n"); +#ifdef CONFIG_MEMCG + if (is_memcg_oom(oc)) { + last_warned = get_jiffies_64(); + current->memcg_oom_no_eligible_warned = 1; + } + } else if (is_memcg_oom(oc)) { + current->memcg_oom_no_eligible_warned = 0; +#endif } if (oc->chosen && oc->chosen != (void *)-1UL) oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
On 2018/10/11 15:37, Tetsuo Handa wrote: > Michal Hocko wrote: >> Once we are here, make sure that the reason to trigger the OOM is >> printed without ratelimiting because this is really valuable to >> debug what happened. > > Here is my version. > Hmm, per mem_cgroup flag would be better than per task_struct flag. From 5e80805e4988f23372a3b352fd4185e6bad53f58 Mon Sep 17 00:00:00 2001 From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Date: Fri, 12 Oct 2018 16:24:49 +0900 Subject: [PATCH v2] mm: memcontrol: Don't flood OOM messages with no eligible task. syzbot is hitting RCU stall at shmem_fault() [1]. This is because memcg-OOM events with no eligible task (current thread is marked as OOM-unkillable) continued calling dump_header() from out_of_memory() enabled by commit 3100dab2aa09dc6e ("mm: memcontrol: print proper OOM header when no eligible victim left."). Let's make sure that next dump_header() waits for at least 60 seconds from previous "Out of memory and no killable processes..." message, if requested memcg already reported it. This change allows threads in requested memcg to complete memory allocation requests for doing recovery operation, and also allows us to know whether threads in requested memcg is doing recovery operation. [1] https://syzkaller.appspot.com/bug?id=4ae3fff7fcf4c33a47c1192d2d62d2e03efffa64 Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: syzbot <syzbot+77e6b28a7a7106ad0def@syzkaller.appspotmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> --- include/linux/memcontrol.h | 3 +++ mm/oom_kill.c | 23 +++++++++++++++++++++++ 2 files changed, 26 insertions(+) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 652f602..8da2340 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -230,6 +230,9 @@ struct mem_cgroup { */ bool oom_group; + /* Previous OOM-kill request failed due to no eligible task? */ + bool oom_no_eligible_warned; + /* protected by memcg_oom_lock */ bool oom_lock; int under_oom; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index f10aa53..52a8319 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -1106,6 +1106,21 @@ bool out_of_memory(struct oom_control *oc) select_bad_process(oc); /* Found nothing?!?! */ if (!oc->chosen) { +#ifdef CONFIG_MEMCG + /* + * Don't flood with dump_header() if already reported, in case + * current->signal->oom_score_adj == OOM_SCORE_ADJ_MIN. Maybe + * this variable should be per "struct mem_cgroup". But since + * we can't prove that multiple concurrent memcg OOM without + * eligible task won't cause flooding, choose global variable + * for safety. + */ + static u64 last_warned; + + if (is_memcg_oom(oc) && oc->memcg->oom_no_eligible_warned && + time_before64(get_jiffies_64(), last_warned + 60 * HZ)) + return false; +#endif dump_header(oc, NULL); pr_warn("Out of memory and no killable processes...\n"); /* @@ -1115,6 +1130,14 @@ bool out_of_memory(struct oom_control *oc) */ if (!is_sysrq_oom(oc) && !is_memcg_oom(oc)) panic("System is deadlocked on memory\n"); +#ifdef CONFIG_MEMCG + if (is_memcg_oom(oc)) { + oc->memcg->oom_no_eligible_warned = true; + last_warned = get_jiffies_64(); + } + } else if (is_memcg_oom(oc)) { + oc->memcg->oom_no_eligible_warned = false; +#endif } if (oc->chosen && oc->chosen != (void *)-1UL) oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
On Wed, Oct 10, 2018 at 05:11:35PM +0200, Michal Hocko wrote: > From: Michal Hocko <mhocko@suse.com> > > syzbot has noticed that it can trigger RCU stalls from the memcg oom > path: > RIP: 0010:dump_stack+0x358/0x3ab lib/dump_stack.c:118 > Code: 74 0c 48 c7 c7 f0 f5 31 89 e8 9f 0e 0e fa 48 83 3d 07 15 7d 01 00 0f > 84 63 fe ff ff e8 1c 89 c9 f9 48 8b bd 70 ff ff ff 57 9d <0f> 1f 44 00 00 > e8 09 89 c9 f9 48 8b 8d 68 ff ff ff b8 ff ff 37 00 > RSP: 0018:ffff88017d3a5c70 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13 > RAX: 0000000000040000 RBX: 1ffffffff1263ebe RCX: ffffc90001e5a000 > RDX: 0000000000040000 RSI: ffffffff87b4e0f4 RDI: 0000000000000246 > RBP: ffff88017d3a5d18 R08: ffff8801d7e02480 R09: fffffbfff13da030 > R10: fffffbfff13da030 R11: 0000000000000003 R12: 1ffff1002fa74b96 > R13: 00000000ffffffff R14: 0000000000000200 R15: 0000000000000000 > dump_header+0x27b/0xf72 mm/oom_kill.c:441 > out_of_memory.cold.30+0xf/0x184 mm/oom_kill.c:1109 > mem_cgroup_out_of_memory+0x15e/0x210 mm/memcontrol.c:1386 > mem_cgroup_oom mm/memcontrol.c:1701 [inline] > try_charge+0xb7c/0x1710 mm/memcontrol.c:2260 > mem_cgroup_try_charge+0x627/0xe20 mm/memcontrol.c:5892 > mem_cgroup_try_charge_delay+0x1d/0xa0 mm/memcontrol.c:5907 > shmem_getpage_gfp+0x186b/0x4840 mm/shmem.c:1784 > shmem_fault+0x25f/0x960 mm/shmem.c:1982 > __do_fault+0x100/0x6b0 mm/memory.c:2996 > do_read_fault mm/memory.c:3408 [inline] > do_fault mm/memory.c:3531 [inline] > > The primary reason of the stall lies in an expensive printk handling > of oom report flood because a misconfiguration on the syzbot side > caused that there is simply no eligible task because they have > OOM_SCORE_ADJ_MIN set. This generates the oom report for each allocation > from the memcg context. > > While normal workloads should be much more careful about potential heavy > memory consumers that are OOM disabled it makes some sense to rate limit > a potentially expensive oom reports for cases when there is no eligible > victim found. Do that by moving the rate limit logic inside dump_header. > We no longer rely on the caller to do that. It was only oom_kill_process > which has been throttling. Other two call sites simply didn't have to > care because one just paniced on the OOM when configured that way and > no eligible task would panic for the global case as well. Memcg changed > the picture because we do not panic and we might have multiple sources > of the same event. > > Once we are here, make sure that the reason to trigger the OOM is > printed without ratelimiting because this is really valuable to > debug what happened. > > Reported-by: syzbot+77e6b28a7a7106ad0def@syzkaller.appspotmail.com > Cc: guro@fb.com > Cc: hannes@cmpxchg.org > Cc: kirill.shutemov@linux.intel.com > Cc: linux-kernel@vger.kernel.org > Cc: penguin-kernel@i-love.sakura.ne.jp > Cc: rientjes@google.com > Cc: yang.s@alibaba-inc.com > Signed-off-by: Michal Hocko <mhocko@suse.com> So not more than 10 dumps in each 5s interval. That looks reasonable to me. By the time it starts dropping data you have more than enough information to go on already. Acked-by: Johannes Weiner <hannes@cmpxchg.org>
On Fri 12-10-18 07:20:08, Johannes Weiner wrote: > On Wed, Oct 10, 2018 at 05:11:35PM +0200, Michal Hocko wrote: > > From: Michal Hocko <mhocko@suse.com> > > > > syzbot has noticed that it can trigger RCU stalls from the memcg oom > > path: > > RIP: 0010:dump_stack+0x358/0x3ab lib/dump_stack.c:118 > > Code: 74 0c 48 c7 c7 f0 f5 31 89 e8 9f 0e 0e fa 48 83 3d 07 15 7d 01 00 0f > > 84 63 fe ff ff e8 1c 89 c9 f9 48 8b bd 70 ff ff ff 57 9d <0f> 1f 44 00 00 > > e8 09 89 c9 f9 48 8b 8d 68 ff ff ff b8 ff ff 37 00 > > RSP: 0018:ffff88017d3a5c70 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13 > > RAX: 0000000000040000 RBX: 1ffffffff1263ebe RCX: ffffc90001e5a000 > > RDX: 0000000000040000 RSI: ffffffff87b4e0f4 RDI: 0000000000000246 > > RBP: ffff88017d3a5d18 R08: ffff8801d7e02480 R09: fffffbfff13da030 > > R10: fffffbfff13da030 R11: 0000000000000003 R12: 1ffff1002fa74b96 > > R13: 00000000ffffffff R14: 0000000000000200 R15: 0000000000000000 > > dump_header+0x27b/0xf72 mm/oom_kill.c:441 > > out_of_memory.cold.30+0xf/0x184 mm/oom_kill.c:1109 > > mem_cgroup_out_of_memory+0x15e/0x210 mm/memcontrol.c:1386 > > mem_cgroup_oom mm/memcontrol.c:1701 [inline] > > try_charge+0xb7c/0x1710 mm/memcontrol.c:2260 > > mem_cgroup_try_charge+0x627/0xe20 mm/memcontrol.c:5892 > > mem_cgroup_try_charge_delay+0x1d/0xa0 mm/memcontrol.c:5907 > > shmem_getpage_gfp+0x186b/0x4840 mm/shmem.c:1784 > > shmem_fault+0x25f/0x960 mm/shmem.c:1982 > > __do_fault+0x100/0x6b0 mm/memory.c:2996 > > do_read_fault mm/memory.c:3408 [inline] > > do_fault mm/memory.c:3531 [inline] > > > > The primary reason of the stall lies in an expensive printk handling > > of oom report flood because a misconfiguration on the syzbot side > > caused that there is simply no eligible task because they have > > OOM_SCORE_ADJ_MIN set. This generates the oom report for each allocation > > from the memcg context. > > > > While normal workloads should be much more careful about potential heavy > > memory consumers that are OOM disabled it makes some sense to rate limit > > a potentially expensive oom reports for cases when there is no eligible > > victim found. Do that by moving the rate limit logic inside dump_header. > > We no longer rely on the caller to do that. It was only oom_kill_process > > which has been throttling. Other two call sites simply didn't have to > > care because one just paniced on the OOM when configured that way and > > no eligible task would panic for the global case as well. Memcg changed > > the picture because we do not panic and we might have multiple sources > > of the same event. > > > > Once we are here, make sure that the reason to trigger the OOM is > > printed without ratelimiting because this is really valuable to > > debug what happened. > > > > Reported-by: syzbot+77e6b28a7a7106ad0def@syzkaller.appspotmail.com > > Cc: guro@fb.com > > Cc: hannes@cmpxchg.org > > Cc: kirill.shutemov@linux.intel.com > > Cc: linux-kernel@vger.kernel.org > > Cc: penguin-kernel@i-love.sakura.ne.jp > > Cc: rientjes@google.com > > Cc: yang.s@alibaba-inc.com > > Signed-off-by: Michal Hocko <mhocko@suse.com> > > So not more than 10 dumps in each 5s interval. That looks reasonable > to me. By the time it starts dropping data you have more than enough > information to go on already. Yeah. Unless we have a storm coming from many different cgroups in parallel. But even then we have the allocation context for each OOM so we are not losing everything. Should we ever tune this, it can be done later with some explicit examples. > Acked-by: Johannes Weiner <hannes@cmpxchg.org> Thanks! I will post the patch to Andrew early next week.
On 2018/10/12 21:08, Michal Hocko wrote: >> So not more than 10 dumps in each 5s interval. That looks reasonable >> to me. By the time it starts dropping data you have more than enough >> information to go on already. > > Yeah. Unless we have a storm coming from many different cgroups in > parallel. But even then we have the allocation context for each OOM so > we are not losing everything. Should we ever tune this, it can be done > later with some explicit examples. > >> Acked-by: Johannes Weiner <hannes@cmpxchg.org> > > Thanks! I will post the patch to Andrew early next week. > How do you handle environments where one dump takes e.g. 3 seconds? Counting delay between first message in previous dump and first message in next dump is not safe. Unless we count delay between last message in previous dump and first message in next dump, we cannot guarantee that the system won't lockup due to printk() flooding.
On Fri, Oct 12, 2018 at 09:10:40PM +0900, Tetsuo Handa wrote: > On 2018/10/12 21:08, Michal Hocko wrote: > >> So not more than 10 dumps in each 5s interval. That looks reasonable > >> to me. By the time it starts dropping data you have more than enough > >> information to go on already. > > > > Yeah. Unless we have a storm coming from many different cgroups in > > parallel. But even then we have the allocation context for each OOM so > > we are not losing everything. Should we ever tune this, it can be done > > later with some explicit examples. > > > >> Acked-by: Johannes Weiner <hannes@cmpxchg.org> > > > > Thanks! I will post the patch to Andrew early next week. > > > > How do you handle environments where one dump takes e.g. 3 seconds? > Counting delay between first message in previous dump and first message > in next dump is not safe. Unless we count delay between last message > in previous dump and first message in next dump, we cannot guarantee > that the system won't lockup due to printk() flooding. How is that different from any other printk ratelimiting? If a dump takes 3 seconds you need to fix your console. It doesn't make sense to design KERN_INFO messages for the slowest serial consoles out there. That's what we did, btw. We used to patch out the OOM header because our serial console was so bad, but obviously that's not a generic upstream solution. We've since changed the loglevel on the serial and use netconsole[1] for the chattier loglevels. [1] https://github.com/facebook/fbkutils/tree/master/netconsd
Calling printk() people. ;-) On 2018/10/12 21:41, Johannes Weiner wrote: > On Fri, Oct 12, 2018 at 09:10:40PM +0900, Tetsuo Handa wrote: >> On 2018/10/12 21:08, Michal Hocko wrote: >>>> So not more than 10 dumps in each 5s interval. That looks reasonable >>>> to me. By the time it starts dropping data you have more than enough >>>> information to go on already. >>> >>> Yeah. Unless we have a storm coming from many different cgroups in >>> parallel. But even then we have the allocation context for each OOM so >>> we are not losing everything. Should we ever tune this, it can be done >>> later with some explicit examples. >>> >>>> Acked-by: Johannes Weiner <hannes@cmpxchg.org> >>> >>> Thanks! I will post the patch to Andrew early next week. >>> >> >> How do you handle environments where one dump takes e.g. 3 seconds? >> Counting delay between first message in previous dump and first message >> in next dump is not safe. Unless we count delay between last message >> in previous dump and first message in next dump, we cannot guarantee >> that the system won't lockup due to printk() flooding. > > How is that different from any other printk ratelimiting? If a dump > takes 3 seconds you need to fix your console. It doesn't make sense to > design KERN_INFO messages for the slowest serial consoles out there. You can't fix the console. It is a hardware limitation. > > That's what we did, btw. We used to patch out the OOM header because > our serial console was so bad, but obviously that's not a generic > upstream solution. We've since changed the loglevel on the serial and > use netconsole[1] for the chattier loglevels. > > [1] https://github.com/facebook/fbkutils/tree/master/netconsd >
On 2018/10/12 21:58, Tetsuo Handa wrote: > On 2018/10/12 21:41, Johannes Weiner wrote: >> On Fri, Oct 12, 2018 at 09:10:40PM +0900, Tetsuo Handa wrote: >>> On 2018/10/12 21:08, Michal Hocko wrote: >>>>> So not more than 10 dumps in each 5s interval. That looks reasonable >>>>> to me. By the time it starts dropping data you have more than enough >>>>> information to go on already. Not reasonable at all. >>>> >>>> Yeah. Unless we have a storm coming from many different cgroups in >>>> parallel. But even then we have the allocation context for each OOM so >>>> we are not losing everything. Should we ever tune this, it can be done >>>> later with some explicit examples. >>>> >>>>> Acked-by: Johannes Weiner <hannes@cmpxchg.org> >>>> >>>> Thanks! I will post the patch to Andrew early next week. >>>> One thread from one cgroup is sufficient. I don't think that Michal's patch is an appropriate mitigation. It still needlessly floods kernel log buffer and significantly defers recovery operation. Nacked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> ---------- Testcase ---------- #include <stdio.h> #include <sys/stat.h> #include <sys/types.h> #include <unistd.h> #include <stdlib.h> int main(int argc, char *argv[]) { FILE *fp; const unsigned long size = 1048576 * 200; char *buf = malloc(size); mkdir("/sys/fs/cgroup/memory/test1", 0755); fp = fopen("/sys/fs/cgroup/memory/test1/memory.limit_in_bytes", "w"); fprintf(fp, "%lu\n", size / 2); fclose(fp); fp = fopen("/sys/fs/cgroup/memory/test1/tasks", "w"); fprintf(fp, "%u\n", getpid()); fclose(fp); fp = fopen("/proc/self/oom_score_adj", "w"); fprintf(fp, "-1000\n"); fclose(fp); fp = fopen("/dev/zero", "r"); fread(buf, 1, size, fp); fclose(fp); return 0; } ---------- Michal's patch ---------- 73133 lines (5.79MB) of kernel messages per one run [root@ccsecurity ~]# time ./a.out real 3m44.389s user 0m0.000s sys 3m42.334s [root@ccsecurity ~]# time ./a.out real 3m41.767s user 0m0.004s sys 3m39.779s ---------- My v2 patch ---------- 50 lines (3.40 KB) of kernel messages per one run [root@ccsecurity ~]# time ./a.out real 0m5.227s user 0m0.000s sys 0m4.950s [root@ccsecurity ~]# time ./a.out real 0m5.249s user 0m0.000s sys 0m4.956s
On Sat, Oct 13, 2018 at 08:09:30PM +0900, Tetsuo Handa wrote: > ---------- Michal's patch ---------- > > 73133 lines (5.79MB) of kernel messages per one run > > [root@ccsecurity ~]# time ./a.out > > real 3m44.389s > user 0m0.000s > sys 3m42.334s > > [root@ccsecurity ~]# time ./a.out > > real 3m41.767s > user 0m0.004s > sys 3m39.779s > > ---------- My v2 patch ---------- > > 50 lines (3.40 KB) of kernel messages per one run > > [root@ccsecurity ~]# time ./a.out > > real 0m5.227s > user 0m0.000s > sys 0m4.950s > > [root@ccsecurity ~]# time ./a.out > > real 0m5.249s > user 0m0.000s > sys 0m4.956s Your patch is suppressing information that I want to have and my console can handle, just because your console is slow, even though there is no need to use that console at that log level. NAK to your patch. I think you're looking at this from the wrong angle. A console that takes almost 4 minutes to print 70k lines shouldn't be the baseline for how verbose KERN_INFO is.
On 2018/10/13 20:22, Johannes Weiner wrote: > On Sat, Oct 13, 2018 at 08:09:30PM +0900, Tetsuo Handa wrote: >> ---------- Michal's patch ---------- >> >> 73133 lines (5.79MB) of kernel messages per one run >> >> [root@ccsecurity ~]# time ./a.out >> >> real 3m44.389s >> user 0m0.000s >> sys 3m42.334s >> >> [root@ccsecurity ~]# time ./a.out >> >> real 3m41.767s >> user 0m0.004s >> sys 3m39.779s >> >> ---------- My v2 patch ---------- >> >> 50 lines (3.40 KB) of kernel messages per one run >> >> [root@ccsecurity ~]# time ./a.out >> >> real 0m5.227s >> user 0m0.000s >> sys 0m4.950s >> >> [root@ccsecurity ~]# time ./a.out >> >> real 0m5.249s >> user 0m0.000s >> sys 0m4.956s > > Your patch is suppressing information that I want to have and my > console can handle, just because your console is slow, even though > there is no need to use that console at that log level. My patch is not suppressing information you want to have. My patch is mainly suppressing [ 52.393146] Out of memory and no killable processes... [ 52.395195] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000 [ 52.398623] Out of memory and no killable processes... [ 52.401195] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000 [ 52.404356] Out of memory and no killable processes... [ 52.406492] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000 [ 52.409595] Out of memory and no killable processes... [ 52.411745] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000 [ 52.415588] Out of memory and no killable processes... [ 52.418484] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000 [ 52.421904] Out of memory and no killable processes... [ 52.424273] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000 lines which Michal's patch cannot suppress. Also, my console is console=ttyS0,115200n8 . Not slow at all. > > NAK to your patch. I think you're looking at this from the wrong > angle. A console that takes almost 4 minutes to print 70k lines > shouldn't be the baseline for how verbose KERN_INFO is. > Run the testcase in your environment.
On Sat 13-10-18 20:28:38, Tetsuo Handa wrote: > On 2018/10/13 20:22, Johannes Weiner wrote: > > On Sat, Oct 13, 2018 at 08:09:30PM +0900, Tetsuo Handa wrote: > >> ---------- Michal's patch ---------- > >> > >> 73133 lines (5.79MB) of kernel messages per one run > >> > >> [root@ccsecurity ~]# time ./a.out > >> > >> real 3m44.389s > >> user 0m0.000s > >> sys 3m42.334s > >> > >> [root@ccsecurity ~]# time ./a.out > >> > >> real 3m41.767s > >> user 0m0.004s > >> sys 3m39.779s > >> > >> ---------- My v2 patch ---------- > >> > >> 50 lines (3.40 KB) of kernel messages per one run > >> > >> [root@ccsecurity ~]# time ./a.out > >> > >> real 0m5.227s > >> user 0m0.000s > >> sys 0m4.950s > >> > >> [root@ccsecurity ~]# time ./a.out > >> > >> real 0m5.249s > >> user 0m0.000s > >> sys 0m4.956s > > > > Your patch is suppressing information that I want to have and my > > console can handle, just because your console is slow, even though > > there is no need to use that console at that log level. > > My patch is not suppressing information you want to have. > My patch is mainly suppressing > > [ 52.393146] Out of memory and no killable processes... > [ 52.395195] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000 > [ 52.398623] Out of memory and no killable processes... > [ 52.401195] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000 > [ 52.404356] Out of memory and no killable processes... > [ 52.406492] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000 > [ 52.409595] Out of memory and no killable processes... > [ 52.411745] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000 > [ 52.415588] Out of memory and no killable processes... > [ 52.418484] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000 > [ 52.421904] Out of memory and no killable processes... > [ 52.424273] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000 > > lines which Michal's patch cannot suppress. This was a deliberate decision because the allocation failure context is usually a useful information to get. If this is killing a reasonably configured machine then we can move the ratelimit up and suppress that information. This will always be cost vs. benefit decision. And as such it should be argued in the changelog. As so many dozens of times before, I will point you to an incremental nature of changes we really prefer in the mm land. We are also after a simplicity which your proposal lacks in many aspects. You seem to ignore that general approach and I have hard time to consider your NAK as a relevant feedback. Going to an extreme and basing a complex solution on it is not going to fly. No killable process should be a rare event which requires a seriously misconfigured memcg to happen so wildly. If you can trigger it with a normal user privileges then it would be a clear bug to address rather than work around with printk throttling.
On 2018/10/15 17:19, Michal Hocko wrote: > As so many dozens of times before, I will point you to an incremental > nature of changes we really prefer in the mm land. We are also after a > simplicity which your proposal lacks in many aspects. You seem to ignore > that general approach and I have hard time to consider your NAK as a > relevant feedback. Going to an extreme and basing a complex solution on > it is not going to fly. No killable process should be a rare event which > requires a seriously misconfigured memcg to happen so wildly. If you can > trigger it with a normal user privileges then it would be a clear bug to > address rather than work around with printk throttling. > I can trigger 200+ times / 900+ lines / 69KB+ of needless OOM messages with a normal user privileges. This is a lot of needless noise/delay. No killable process is not a rare event, even without root privileges. [root@ccsecurity kumaneko]# time ./a.out Killed real 0m2.396s user 0m0.000s sys 0m2.970s [root@ccsecurity ~]# dmesg | grep 'no killable' | wc -l 202 [root@ccsecurity ~]# dmesg | wc 942 7335 70716 [root@ccsecurity ~]# ---------------------------------------- #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> #include <sched.h> #include <sys/mman.h> #define NUMTHREADS 256 #define MMAPSIZE 4 * 10485760 #define STACKSIZE 4096 static int pipe_fd[2] = { EOF, EOF }; static int memory_eater(void *unused) { int fd = open("/dev/zero", O_RDONLY); char *buf = mmap(NULL, MMAPSIZE, PROT_WRITE | PROT_READ, MAP_ANONYMOUS | MAP_SHARED, EOF, 0); read(pipe_fd[0], buf, 1); read(fd, buf, MMAPSIZE); pause(); return 0; } int main(int argc, char *argv[]) { int i; char *stack; FILE *fp; const unsigned long size = 1048576 * 200; mkdir("/sys/fs/cgroup/memory/test1", 0755); fp = fopen("/sys/fs/cgroup/memory/test1/memory.limit_in_bytes", "w"); fprintf(fp, "%lu\n", size); fclose(fp); fp = fopen("/sys/fs/cgroup/memory/test1/tasks", "w"); fprintf(fp, "%u\n", getpid()); fclose(fp); if (setgid(-2) || setuid(-2)) return 1; stack = mmap(NULL, STACKSIZE * NUMTHREADS, PROT_WRITE | PROT_READ, MAP_ANONYMOUS | MAP_SHARED, EOF, 0); for (i = 0; i < NUMTHREADS; i++) if (clone(memory_eater, stack + (i + 1) * STACKSIZE, CLONE_SIGHAND | CLONE_THREAD | CLONE_VM | CLONE_FS | CLONE_FILES, NULL) == -1) break; sleep(1); close(pipe_fd[1]); pause(); return 0; } ----------------------------------------
On Mon 15-10-18 19:57:35, Tetsuo Handa wrote: > On 2018/10/15 17:19, Michal Hocko wrote: > > As so many dozens of times before, I will point you to an incremental > > nature of changes we really prefer in the mm land. We are also after a > > simplicity which your proposal lacks in many aspects. You seem to ignore > > that general approach and I have hard time to consider your NAK as a > > relevant feedback. Going to an extreme and basing a complex solution on > > it is not going to fly. No killable process should be a rare event which > > requires a seriously misconfigured memcg to happen so wildly. If you can > > trigger it with a normal user privileges then it would be a clear bug to > > address rather than work around with printk throttling. > > > > I can trigger 200+ times / 900+ lines / 69KB+ of needless OOM messages > with a normal user privileges. This is a lot of needless noise/delay. I am pretty sure you have understood the part of my message you have chosen to not quote where I have said that the specific rate limitting decisions can be changed based on reasonable configurations. There is absolutely zero reason to NAK a natural decision to unify the throttling and cook a per-memcg way for a very specific path instead. > No killable process is not a rare event, even without root privileges. > > [root@ccsecurity kumaneko]# time ./a.out > Killed > > real 0m2.396s > user 0m0.000s > sys 0m2.970s > [root@ccsecurity ~]# dmesg | grep 'no killable' | wc -l > 202 > [root@ccsecurity ~]# dmesg | wc > 942 7335 70716 OK, so this is 70kB worth of data pushed throug the console. Is this really killing any machine?
On 2018/10/15 20:24, Michal Hocko wrote: > On Mon 15-10-18 19:57:35, Tetsuo Handa wrote: >> On 2018/10/15 17:19, Michal Hocko wrote: >>> As so many dozens of times before, I will point you to an incremental >>> nature of changes we really prefer in the mm land. We are also after a >>> simplicity which your proposal lacks in many aspects. You seem to ignore >>> that general approach and I have hard time to consider your NAK as a >>> relevant feedback. Going to an extreme and basing a complex solution on >>> it is not going to fly. No killable process should be a rare event which >>> requires a seriously misconfigured memcg to happen so wildly. If you can >>> trigger it with a normal user privileges then it would be a clear bug to >>> address rather than work around with printk throttling. >>> >> >> I can trigger 200+ times / 900+ lines / 69KB+ of needless OOM messages >> with a normal user privileges. This is a lot of needless noise/delay. > > I am pretty sure you have understood the part of my message you have > chosen to not quote where I have said that the specific rate limitting > decisions can be changed based on reasonable configurations. There is > absolutely zero reason to NAK a natural decision to unify the throttling > and cook a per-memcg way for a very specific path instead. > >> No killable process is not a rare event, even without root privileges. >> >> [root@ccsecurity kumaneko]# time ./a.out >> Killed >> >> real 0m2.396s >> user 0m0.000s >> sys 0m2.970s >> [root@ccsecurity ~]# dmesg | grep 'no killable' | wc -l >> 202 >> [root@ccsecurity ~]# dmesg | wc >> 942 7335 70716 > > OK, so this is 70kB worth of data pushed throug the console. Is this > really killing any machine? > Nobody can prove that it never kills some machine. This is just one example result of one example stress tried in my environment. Since I am secure programming man from security subsystem, I really hate your "Can you trigger it?" resistance. Since this is OOM path where nobody tests, starting from being prepared for the worst case keeps things simple.
On Mon 15-10-18 21:47:08, Tetsuo Handa wrote: > On 2018/10/15 20:24, Michal Hocko wrote: > > On Mon 15-10-18 19:57:35, Tetsuo Handa wrote: > >> On 2018/10/15 17:19, Michal Hocko wrote: > >>> As so many dozens of times before, I will point you to an incremental > >>> nature of changes we really prefer in the mm land. We are also after a > >>> simplicity which your proposal lacks in many aspects. You seem to ignore > >>> that general approach and I have hard time to consider your NAK as a > >>> relevant feedback. Going to an extreme and basing a complex solution on > >>> it is not going to fly. No killable process should be a rare event which > >>> requires a seriously misconfigured memcg to happen so wildly. If you can > >>> trigger it with a normal user privileges then it would be a clear bug to > >>> address rather than work around with printk throttling. > >>> > >> > >> I can trigger 200+ times / 900+ lines / 69KB+ of needless OOM messages > >> with a normal user privileges. This is a lot of needless noise/delay. > > > > I am pretty sure you have understood the part of my message you have > > chosen to not quote where I have said that the specific rate limitting > > decisions can be changed based on reasonable configurations. There is > > absolutely zero reason to NAK a natural decision to unify the throttling > > and cook a per-memcg way for a very specific path instead. > > > >> No killable process is not a rare event, even without root privileges. > >> > >> [root@ccsecurity kumaneko]# time ./a.out > >> Killed > >> > >> real 0m2.396s > >> user 0m0.000s > >> sys 0m2.970s > >> [root@ccsecurity ~]# dmesg | grep 'no killable' | wc -l > >> 202 > >> [root@ccsecurity ~]# dmesg | wc > >> 942 7335 70716 > > > > OK, so this is 70kB worth of data pushed throug the console. Is this > > really killing any machine? > > > > Nobody can prove that it never kills some machine. This is just one example result of > one example stress tried in my environment. Since I am secure programming man from security > subsystem, I really hate your "Can you trigger it?" resistance. Since this is OOM path > where nobody tests, starting from being prepared for the worst case keeps things simple. There is simply no way to be generally safe this kind of situation. As soon as your console is so slow that you cannot push the oom report through there is only one single option left and that is to disable the oom report altogether. And that might be a viable option. But fiddling with per memcg limit is not going to fly. Just realize what will happen if you have hundreds of different memcgs triggering this path around the same time. So can you start being reasonable and try to look at a wider picture finally please?
On 2018/10/15 22:35, Michal Hocko wrote: >> Nobody can prove that it never kills some machine. This is just one example result of >> one example stress tried in my environment. Since I am secure programming man from security >> subsystem, I really hate your "Can you trigger it?" resistance. Since this is OOM path >> where nobody tests, starting from being prepared for the worst case keeps things simple. > > There is simply no way to be generally safe this kind of situation. As > soon as your console is so slow that you cannot push the oom report > through there is only one single option left and that is to disable the > oom report altogether. And that might be a viable option. There is a way to be safe this kind of situation. The way is to make sure that printk() is called with enough interval. That is, count the interval between the end of previous printk() messages and the beginning of next printk() messages. And you are misunderstanding my patch. Although my patch does not ratelimit the first occurrence of memcg OOM in each memcg domain (because the first dump_header(oc, NULL); pr_warn("Out of memory and no killable processes...\n"); output is usually a useful information to get) which is serialized by oom_lock mutex, my patch cannot cause lockup because my patch ratelimits subsequent outputs in any memcg domain. That is, my patch might cause "** %u printk messages dropped **\n" when we have hundreds of different memcgs triggering this path around the same time, my patch refrains from "keep disturbing administrator's manual recovery operation from console by printing "%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, oom_score_adj=%hd\n" "Out of memory and no killable processes...\n" on each page fault event from hundreds of different memcgs triggering this path". There is no need to print "%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, oom_score_adj=%hd\n" "Out of memory and no killable processes...\n" lines on evey page fault event. A kernel which consumes multiple milliseconds on each page fault event (due to printk() messages from the defunctional OOM killer) is stupid. > But fiddling > with per memcg limit is not going to fly. Just realize what will happen > if you have hundreds of different memcgs triggering this path around the > same time. You have just said that "No killable process should be a rare event which requires a seriously misconfigured memcg to happen so wildly." and now you refer to a very bad case "Just realize what will happen if you have hundreds of different memcgs triggering this path around the same time." which makes your former comment suspicious. > > So can you start being reasonable and try to look at a wider picture > finally please? > Honestly, I can't look at a wider picture because I have never been shown a picture from you. What we are doing is endless loop of "let's do ... because ..." and "hmm, our assumption was wrong because ..."; that is, making changes without firstly considering the worst case. For example, OOM victims which David Rientjes is complaining is that our assumption that "__oom_reap_task_mm() can reclaim majority of memory" was wrong. (And your proposal to hand over is getting no response.) For another example, __set_oom_adj() which Yong-Taek Lee is trying to optimize is that our assumption that "we already succeeded enforcing same oom_score_adj among multiple thread groups" was wrong. For yet another example, CVE-2018-1000200 and CVE-2016-10723 are caused by ignoring my concern. And funny thing is that you negated the rationale of "mm, oom: remove sleep from under oom_lock" by "mm, oom: remove oom_lock from oom_reaper" after only 4 days... Anyway, I'm OK if we apply _BOTH_ your patch and my patch. Or I'm OK with simplified one shown below (because you don't like per memcg limit). --- mm/oom_kill.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index f10aa53..9056f9b 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -1106,6 +1106,11 @@ bool out_of_memory(struct oom_control *oc) select_bad_process(oc); /* Found nothing?!?! */ if (!oc->chosen) { + static unsigned long last_warned; + + if ((is_sysrq_oom(oc) || is_memcg_oom(oc)) && + time_in_range(jiffies, last_warned, last_warned + 60 * HZ)) + return false; dump_header(oc, NULL); pr_warn("Out of memory and no killable processes...\n"); /* @@ -1115,6 +1120,7 @@ bool out_of_memory(struct oom_control *oc) */ if (!is_sysrq_oom(oc) && !is_memcg_oom(oc)) panic("System is deadlocked on memory\n"); + last_warned = jiffies; } if (oc->chosen && oc->chosen != (void *)-1UL) oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
On Tue 16-10-18 09:55:06, Tetsuo Handa wrote: > On 2018/10/15 22:35, Michal Hocko wrote: > >> Nobody can prove that it never kills some machine. This is just one example result of > >> one example stress tried in my environment. Since I am secure programming man from security > >> subsystem, I really hate your "Can you trigger it?" resistance. Since this is OOM path > >> where nobody tests, starting from being prepared for the worst case keeps things simple. > > > > There is simply no way to be generally safe this kind of situation. As > > soon as your console is so slow that you cannot push the oom report > > through there is only one single option left and that is to disable the > > oom report altogether. And that might be a viable option. > > There is a way to be safe this kind of situation. The way is to make sure that printk() > is called with enough interval. That is, count the interval between the end of previous > printk() messages and the beginning of next printk() messages. You are simply wrong. Because any interval is meaningless without knowing the printk throughput. [...] > lines on evey page fault event. A kernel which consumes multiple milliseconds on each page > fault event (due to printk() messages from the defunctional OOM killer) is stupid. Not if it represent an unusual situation where there is no eligible task available. Because this is an exceptional case where the cost of the printk is simply not relevant. [...] I am sorry to skip large part of your message but this discussion, like many others, doesn't lead anywhere. You simply refuse to understand some of the core assumptions in this area. > Anyway, I'm OK if we apply _BOTH_ your patch and my patch. Or I'm OK with simplified > one shown below (because you don't like per memcg limit). My patch is adding a rate-limit! I really fail to see why we need yet another one on top of it. This is just ridiculous. I can see reasons to tune that rate limit but adding 2 different mechanisms is just wrong. If your NAK to unify the ratelimit for dump_header for all paths still holds then I do not care too much to push it forward. But I find thiis way of the review feedback counter productive.
On 2018/10/16 18:20, Michal Hocko wrote: >> Anyway, I'm OK if we apply _BOTH_ your patch and my patch. Or I'm OK with simplified >> one shown below (because you don't like per memcg limit). > > My patch is adding a rate-limit! I really fail to see why we need yet > another one on top of it. This is just ridiculous. I can see reasons to > tune that rate limit but adding 2 different mechanisms is just wrong. > > If your NAK to unify the ratelimit for dump_header for all paths > still holds then I do not care too much to push it forward. But I find > thiis way of the review feedback counter productive. > Your patch is _NOT_ adding a rate-limit for "%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, oom_score_adj=%hd\n" "Out of memory and no killable processes...\n" lines! [ 97.519229] Out of memory and no killable processes... [ 97.522060] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000 [ 97.525507] Out of memory and no killable processes... [ 97.528817] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000 [ 97.532345] Out of memory and no killable processes... [ 97.534813] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000 [ 97.538270] Out of memory and no killable processes... [ 97.541449] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000 [ 97.546268] Out of memory and no killable processes... [ 97.548823] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000 [ 97.552399] Out of memory and no killable processes... [ 97.554939] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000 [ 97.558616] Out of memory and no killable processes... [ 97.562257] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000 [ 97.565998] Out of memory and no killable processes... [ 97.568642] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000 [ 97.572169] Out of memory and no killable processes... [ 97.575200] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000 [ 97.579357] Out of memory and no killable processes... [ 97.581912] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000 [ 97.585414] Out of memory and no killable processes... [ 97.589191] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000 [ 97.593586] Out of memory and no killable processes... [ 97.596527] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000 [ 97.600118] Out of memory and no killable processes... [ 97.603237] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000 [ 97.606837] Out of memory and no killable processes... [ 97.611550] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000 [ 97.615244] Out of memory and no killable processes... [ 97.617859] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000 [ 97.621634] Out of memory and no killable processes... [ 97.624884] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000 [ 97.629256] Out of memory and no killable processes... [ 97.631885] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000 [ 97.635367] Out of memory and no killable processes... [ 97.638033] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000 [ 97.641827] Out of memory and no killable processes... [ 97.641993] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000 [ 97.648453] Out of memory and no killable processes... [ 97.651481] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000 [ 97.655082] Out of memory and no killable processes... [ 97.657941] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000 [ 97.663036] Out of memory and no killable processes... [ 97.665890] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000 [ 97.669473] Out of memory and no killable processes... We don't need to print these lines every few milliseconds. Even if an exceptional case, this is a DoS for console users. Printing once (or a few times) per a minute will be enough. Otherwise, users cannot see what they are typing and what are printed.
On Tue 16-10-18 20:05:47, Tetsuo Handa wrote: > On 2018/10/16 18:20, Michal Hocko wrote: > >> Anyway, I'm OK if we apply _BOTH_ your patch and my patch. Or I'm OK with simplified > >> one shown below (because you don't like per memcg limit). > > > > My patch is adding a rate-limit! I really fail to see why we need yet > > another one on top of it. This is just ridiculous. I can see reasons to > > tune that rate limit but adding 2 different mechanisms is just wrong. > > > > If your NAK to unify the ratelimit for dump_header for all paths > > still holds then I do not care too much to push it forward. But I find > > thiis way of the review feedback counter productive. > > > > Your patch is _NOT_ adding a rate-limit for > > "%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, oom_score_adj=%hd\n" > "Out of memory and no killable processes...\n" > > lines! And I've said I do not have objections to have an _incremental_ patch to move the ratelimit up with a clear cost/benefit evaluation.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c index f10aa5360616..4ee393c85e27 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -430,6 +430,9 @@ static void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask) static void dump_header(struct oom_control *oc, struct task_struct *p) { + static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, + DEFAULT_RATELIMIT_BURST); + pr_warn("%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, oom_score_adj=%hd\n", current->comm, oc->gfp_mask, &oc->gfp_mask, nodemask_pr_args(oc->nodemask), oc->order, @@ -437,6 +440,9 @@ static void dump_header(struct oom_control *oc, struct task_struct *p) if (!IS_ENABLED(CONFIG_COMPACTION) && oc->order) pr_warn("COMPACTION is disabled!!!\n"); + if (!__ratelimit(&oom_rs)) + return; + cpuset_print_current_mems_allowed(); dump_stack(); if (is_memcg_oom(oc)) @@ -931,8 +937,6 @@ static void oom_kill_process(struct oom_control *oc, const char *message) struct task_struct *t; struct mem_cgroup *oom_group; unsigned int victim_points = 0; - static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, - DEFAULT_RATELIMIT_BURST); /* * If the task is already exiting, don't alarm the sysadmin or kill @@ -949,8 +953,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message) } task_unlock(p); - if (__ratelimit(&oom_rs)) - dump_header(oc, p); + dump_header(oc, p); pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n", message, task_pid_nr(p), p->comm, points);