Message ID | 20210612204634.1102472-1-atomlin@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] mm/oom_kill: show oom eligibility when displaying the current memory state of all tasks | expand |
On Sat, 12 Jun 2021, Aaron Tomlin wrote: > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index eefd3f5fde46..094b7b61d66f 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -160,6 +160,27 @@ static inline bool is_sysrq_oom(struct oom_control *oc) > return oc->order == -1; > } > > +/** > + * is_task_eligible_oom - determine if and why a task cannot be OOM killed > + * @tsk: task to check > + * > + * Needs to be called with task_lock(). > + */ > +static const char * is_task_oom_eligible(struct task_struct *p) You should be able to just return a char. > +{ > + long adj; > + > + adj = (long)p->signal->oom_score_adj; > + if (adj == OOM_SCORE_ADJ_MIN) > + return "S"; The value is already printed in the task dump, this doesn't look to add any information. > + else if (test_bit(MMF_OOM_SKIP, &p->mm->flags) > + return "R"; We should be doing the task dump only when we're killing a victim (unless we're panicking), so something else has been chosen. Since we would have oom killed a process with MMF_OOM_SKIP already, can we simply choose to not print a line for this process? > + else if (in_vfork(p)) > + return "V"; Is this a transition state that we can simply disregard from the output as well unless/until it becomes eligible? > + else > + return ""; > +} > + > /* return true if the task is not adequate as candidate victim task. */ > static bool oom_unkillable_task(struct task_struct *p) > { > @@ -401,12 +422,13 @@ static int dump_task(struct task_struct *p, void *arg) > return 0; > } > > - pr_info("[%7d] %5d %5d %8lu %8lu %8ld %8lu %5hd %s\n", > + pr_info("[%7d] %5d %5d %8lu %8lu %8ld %8lu %5hd %13s %s\n", 13 characters for one char output? > task->pid, from_kuid(&init_user_ns, task_uid(task)), > task->tgid, task->mm->total_vm, get_mm_rss(task->mm), > mm_pgtables_bytes(task->mm), > get_mm_counter(task->mm, MM_SWAPENTS), > - task->signal->oom_score_adj, task->comm); > + task->signal->oom_score_adj, is_task_oom_eligible(task), > + task->comm); > task_unlock(task); > > return 0; > @@ -420,12 +442,13 @@ static int dump_task(struct task_struct *p, void *arg) > * memcg, not in the same cpuset, or bound to a disjoint set of mempolicy nodes > * are not shown. > * State information includes task's pid, uid, tgid, vm size, rss, > - * pgtables_bytes, swapents, oom_score_adj value, and name. > + * pgtables_bytes, swapents, oom_score_adj value, oom eligibility status > + * and name. > */ > static void dump_tasks(struct oom_control *oc) > { > pr_info("Tasks state (memory values in pages):\n"); > - pr_info("[ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name\n"); > + pr_info("[ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj oom eligible? name\n"); Field names are single words. > > if (is_memcg_oom(oc)) > mem_cgroup_scan_tasks(oc->memcg, dump_task, oc);
On Sat 12-06-21 21:46:34, Aaron Tomlin wrote: > Changes since v2: > - Use single character (e.g. 'R' for MMF_OOM_SKIP) as suggested > by Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> > - Add new header to oom_dump_tasks documentation > > > At the present time, when showing potential OOM victims, we do not > exclude tasks which already have MMF_OOM_SKIP set; it is possible that > the last OOM killable victim was already OOM killed, yet the OOM > reaper failed to reclaim memory and set MMF_OOM_SKIP. > This can be confusing/or perhaps even misleading, to the reader of the > OOM report. Now, we already unconditionally display a task's > oom_score_adj_min value that can be set to OOM_SCORE_ADJ_MIN which is > indicative of an "unkillable" task i.e. is not eligible. Well, I have to say that I have a bit hard time understand the problem statement here. First of all you are very likely basing your observation on an old kernel which is missing a fix which should make the situation impossible IIRC. You should be focusing on a justification why the new information is helpful for the current tree. Historically, all tasks eligible for the oom killing have been printed. That includes also tasks which are excluded later in the selection. E.g. OOMS_SCORE_ADJ_MIN which can be tricky indeed. IIRC the primary reason was to have a sufficient amount of information to evaluate whether the system is configured properly (e.g. OOMS_SCORE_ADJ_MIN is not used too extensively). More excluded criterion have been added due to implementation details (e.g.MMF_OOM_SKIP or mm shared with otherwise ineligible task. You are correctly pointing out that those internal states are not exposed but you should focus on explanation why that gap really stands in the way for the current upstream. Who is going to consume that information and for what purpose? > This patch provides a clear indication with regard to the OOM > eligibility of each displayed task. This should provide an example of the output with a clarification of the meaning. [...]
On Mon 2021-06-14 08:44 +0200, Michal Hocko wrote: > Well, I have to say that I have a bit hard time understand the problem > statement here. First of all you are very likely basing your observation > on an old kernel which is missing a fix which should make the situation > impossible IIRC. You should be focusing on a justification why the new > information is helpful for the current tree. Michal, Not exactly. See oom_reap_task(). Let's assume an OOM event occurred within the context of a memcg and 'memory.oom.group' was not set. If I understand correctly, once all attempts to OOM reap the specified task were "unsuccessful" then MMF_OOM_SKIP is applied; and, the assumption is it will be terminated shorty due to the pending fatal signal (see __oom_kill_process()) i.e. a SIGKILL is sent to the "victim" before the OOM reaper is notified. Now assuming the above task did not exited yet, another task, in the same memcg, could also trigger an OOM event. Therefore, when showing potential OOM victims the task above with MMF_OOM_SKIP set, will indeed be displayed. I understanding the point on OOM_SCORE_ADJ_MIN. This can be easily identified and is clear to the viewer. However, the same cannot be stated for MMF_OOM_SKIP. So, if we prefer to display rather than exclude such tasks, in my opinion having a flag/or marker of some description might prove useful to avoid any misunderstanding. > This should provide an example of the output with a clarification of the > meaning. Fair enough. Kind regards,
On Sun 2021-06-13 16:47 -0700, David Rientjes wrote: > On Sat, 12 Jun 2021, Aaron Tomlin wrote: > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > index eefd3f5fde46..094b7b61d66f 100644 > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -160,6 +160,27 @@ static inline bool is_sysrq_oom(struct oom_control *oc) > > return oc->order == -1; > > } > > > > +/** > > + * is_task_eligible_oom - determine if and why a task cannot be OOM killed > > + * @tsk: task to check > > + * > > + * Needs to be called with task_lock(). > > + */ > > +static const char * is_task_oom_eligible(struct task_struct *p) > > You should be able to just return a char. I see, sure. > > +{ > > + long adj; > > + > > + adj = (long)p->signal->oom_score_adj; > > + if (adj == OOM_SCORE_ADJ_MIN) > > + return "S"; > > The value is already printed in the task dump, this doesn't look to add > any information. I understand and perhaps it does not make sense; albeit, the reader might not understand the meaning of OOM_SCORE_ADJ_MIN. > > + else if (test_bit(MMF_OOM_SKIP, &p->mm->flags) > > + return "R"; > > We should be doing the task dump only when we're killing a victim (unless > we're panicking), so something else has been chosen. Since we would have > oom killed a process with MMF_OOM_SKIP already, can we simply choose to > not print a line for this process? I'd prefer not to show such tasks, when displaying potential OOM victims (including those in_vfork()) as in my opinion, it can be misleading to the reader. That said, a case has been made to maintain their inclusion. However, should in_vfork() even be shown in the actual report? > > > @@ -401,12 +422,13 @@ static int dump_task(struct task_struct *p, void *arg) > > return 0; > > } > > > > - pr_info("[%7d] %5d %5d %8lu %8lu %8ld %8lu %5hd %s\n", > > + pr_info("[%7d] %5d %5d %8lu %8lu %8ld %8lu %5hd %13s %s\n", > > 13 characters for one char output? This was to maintain some alignment but fair enough. > > static void dump_tasks(struct oom_control *oc) > > { > > pr_info("Tasks state (memory values in pages):\n"); > > - pr_info("[ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name\n"); > > + pr_info("[ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj oom eligible? name\n"); > > Field names are single words. Understood. Kind regards,
On Tue 15-06-21 12:51:47, Aaron Tomlin wrote: > On Mon 2021-06-14 08:44 +0200, Michal Hocko wrote: > > Well, I have to say that I have a bit hard time understand the problem > > statement here. First of all you are very likely basing your observation > > on an old kernel which is missing a fix which should make the situation > > impossible IIRC. You should be focusing on a justification why the new > > information is helpful for the current tree. > > Michal, > > Not exactly. > > See oom_reap_task(). Let's assume an OOM event occurred within the context > of a memcg and 'memory.oom.group' was not set. If I understand correctly, > once all attempts to OOM reap the specified task were "unsuccessful" then > MMF_OOM_SKIP is applied; and, the assumption is it will be terminated > shorty due to the pending fatal signal (see __oom_kill_process()) i.e. a > SIGKILL is sent to the "victim" before the OOM reaper is notified. Now > assuming the above task did not exited yet, another task, in the same > memcg, could also trigger an OOM event. Therefore, when showing potential > OOM victims the task above with MMF_OOM_SKIP set, will indeed be displayed. > > I understanding the point on OOM_SCORE_ADJ_MIN. This can be easily > identified and is clear to the viewer. However, the same cannot be stated > for MMF_OOM_SKIP. This is all true but it is not really clear why that is really a problem. Kernel log already contains information about reaped processes as they are reported to the log. I fully acknowledge that this is rather spartan but on the other hand from years of experience reading oom reports I have to say the dump_tasks is the least interesting part of the report (while being the most verbose one). All that being said, I am not really opposing extending the information although I am a bit worried about leaking too much internal state to the log. What I am asking for here is a justification why this addition is a general improvement and how it helps uderstanding oom reports further. So please focus on that part.
On Tue 2021-06-15 14:42 +0200, Michal Hocko wrote: > This is all true but it is not really clear why that is really a > problem. Kernel log already contains information about reaped processes > as they are reported to the log. I fully acknowledge that this is rather > spartan but on the other hand from years of experience reading oom > reports I have to say the dump_tasks is the least interesting part of > the report (while being the most verbose one). I understand. I suppose, in a situation whereby dump_tasks() output is only available, for whatever reason, it can provide at least some insight into what tasks were actually considered not OOM eligible and why. > All that being said, I am not really opposing extending the information > although I am a bit worried about leaking too much internal state to the > log. Fair enough. That said, I still feel highlighting such "ineligible" tasks could be useful to the viewer for troubleshooting purposes; we already display OOM_SCORE_ADJ_MIN. Consider a situation when only a few tasks in a memcg are displayed as possibly OOM eligible but one had MMF_OOM_SKIP applied. In my opinion, perhaps it is better to just exclude such details altogether. That being said, as you know, we only provide this facility when one is interested in it anyway i.e., if oom_dump_tasks is enabled. > What I am asking for here is a justification why this addition is a > general improvement and how it helps uderstanding oom reports further. > So please focus on that part. Sure; albeit, thinking about this more, it does not provide much understanding in simple isolation. Kind regards,
diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index 586cd4b86428..123be642bc7e 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -658,8 +658,9 @@ oom_dump_tasks Enables a system-wide task dump (excluding kernel threads) to be produced when the kernel performs an OOM-killing and includes such information as pid, uid, tgid, vm size, rss, pgtables_bytes, swapents, oom_score_adj -score, and name. This is helpful to determine why the OOM killer was -invoked, to identify the rogue task that caused it, and to determine why +score, oom eligibility status and name. This is helpful to determine why +the OOM killer was invoked, to identify the rogue task that caused it, and +to determine why the OOM killer chose the task it did to kill. If this is set to zero, this information is suppressed. On very diff --git a/mm/oom_kill.c b/mm/oom_kill.c index eefd3f5fde46..094b7b61d66f 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -160,6 +160,27 @@ static inline bool is_sysrq_oom(struct oom_control *oc) return oc->order == -1; } +/** + * is_task_eligible_oom - determine if and why a task cannot be OOM killed + * @tsk: task to check + * + * Needs to be called with task_lock(). + */ +static const char * is_task_oom_eligible(struct task_struct *p) +{ + long adj; + + adj = (long)p->signal->oom_score_adj; + if (adj == OOM_SCORE_ADJ_MIN) + return "S"; + else if (test_bit(MMF_OOM_SKIP, &p->mm->flags) + return "R"; + else if (in_vfork(p)) + return "V"; + else + return ""; +} + /* return true if the task is not adequate as candidate victim task. */ static bool oom_unkillable_task(struct task_struct *p) { @@ -401,12 +422,13 @@ static int dump_task(struct task_struct *p, void *arg) return 0; } - pr_info("[%7d] %5d %5d %8lu %8lu %8ld %8lu %5hd %s\n", + pr_info("[%7d] %5d %5d %8lu %8lu %8ld %8lu %5hd %13s %s\n", task->pid, from_kuid(&init_user_ns, task_uid(task)), task->tgid, task->mm->total_vm, get_mm_rss(task->mm), mm_pgtables_bytes(task->mm), get_mm_counter(task->mm, MM_SWAPENTS), - task->signal->oom_score_adj, task->comm); + task->signal->oom_score_adj, is_task_oom_eligible(task), + task->comm); task_unlock(task); return 0; @@ -420,12 +442,13 @@ static int dump_task(struct task_struct *p, void *arg) * memcg, not in the same cpuset, or bound to a disjoint set of mempolicy nodes * are not shown. * State information includes task's pid, uid, tgid, vm size, rss, - * pgtables_bytes, swapents, oom_score_adj value, and name. + * pgtables_bytes, swapents, oom_score_adj value, oom eligibility status + * and name. */ static void dump_tasks(struct oom_control *oc) { pr_info("Tasks state (memory values in pages):\n"); - pr_info("[ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name\n"); + pr_info("[ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj oom eligible? name\n"); if (is_memcg_oom(oc)) mem_cgroup_scan_tasks(oc->memcg, dump_task, oc);
Changes since v2: - Use single character (e.g. 'R' for MMF_OOM_SKIP) as suggested by Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> - Add new header to oom_dump_tasks documentation At the present time, when showing potential OOM victims, we do not exclude tasks which already have MMF_OOM_SKIP set; it is possible that the last OOM killable victim was already OOM killed, yet the OOM reaper failed to reclaim memory and set MMF_OOM_SKIP. This can be confusing/or perhaps even misleading, to the reader of the OOM report. Now, we already unconditionally display a task's oom_score_adj_min value that can be set to OOM_SCORE_ADJ_MIN which is indicative of an "unkillable" task i.e. is not eligible. This patch provides a clear indication with regard to the OOM eligibility of each displayed task. Signed-off-by: Aaron Tomlin <atomlin@redhat.com> --- Documentation/admin-guide/sysctl/vm.rst | 5 ++-- mm/oom_kill.c | 31 +++++++++++++++++++++---- 2 files changed, 30 insertions(+), 6 deletions(-)