diff mbox series

[v3] mm/oom_kill: show oom eligibility when displaying the current memory state of all tasks

Message ID 20210730162002.279678-1-atomlin@redhat.com (mailing list archive)
State New
Headers show
Series [v3] mm/oom_kill: show oom eligibility when displaying the current memory state of all tasks | expand

Commit Message

Aaron Tomlin July 30, 2021, 4:20 p.m. UTC
Changes since v2:
 - Use single character (e.g. 'R' for MMF_OOM_SKIP) as suggested
   by Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
 - Add new header to oom_dump_tasks documentation
 - Provide further justification


The output generated by dump_tasks() can be helpful to determine why
there was an OOM condition and which rogue task potentially caused it.
Please note that this is only provided when sysctl oom_dump_tasks is
enabled.

At the present time, when showing potential OOM victims, we do not
exclude any task that are not OOM eligible e.g. those that have
MMF_OOM_SKIP set; it is possible that the last OOM killable victim was
already OOM killed, yet the OOM reaper failed to reclaim memory and set
MMF_OOM_SKIP. This can be confusing (or perhaps even be misleading) to the
viewer. Now, we already unconditionally display a task's oom_score_adj_min
value that can be set to OOM_SCORE_ADJ_MIN which is indicative of an
"unkillable" task.

This patch provides a clear indication with regard to the OOM ineligibility
(and why) of each displayed task with the addition of a new column namely
"oom_skipped". An example is provided below:

    [ 5084.524970] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj oom_skipped name
    [ 5084.526397] [660417]     0 660417    35869      683   167936        0         -1000 M conmon
    [ 5084.526400] [660452]     0 660452   175834      472    86016        0          -998  pod
    [ 5084.527460] [752415]     0 752415    35869      650   172032        0         -1000 M conmon
    [ 5084.527462] [752575] 1001050000 752575   184205    11158   700416        0           999  npm
    [ 5084.527467] [753606] 1001050000 753606   183380    46843  2134016        0           999  node
    [ 5084.527581] Memory cgroup out of memory: Killed process 753606 (node) total-vm:733520kB, anon-rss:161228kB, file-rss:26144kB, shmem-rss:0kB, UID:1001050000

So, a single character 'M' is for OOM_SCORE_ADJ_MIN, 'R' MMF_OOM_SKIP and
'V' for in_vfork().


Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
---
 Documentation/admin-guide/sysctl/vm.rst |  5 ++--
 mm/oom_kill.c                           | 31 +++++++++++++++++++++----
 2 files changed, 30 insertions(+), 6 deletions(-)

Comments

Andrew Morton Aug. 1, 2021, 8:01 p.m. UTC | #1
On Fri, 30 Jul 2021 17:20:02 +0100 Aaron Tomlin <atomlin@redhat.com> wrote:

> Changes since v2:
>  - Use single character (e.g. 'R' for MMF_OOM_SKIP) as suggested
>    by Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
>  - Add new header to oom_dump_tasks documentation
>  - Provide further justification
> 
> 
> The output generated by dump_tasks() can be helpful to determine why
> there was an OOM condition and which rogue task potentially caused it.
> Please note that this is only provided when sysctl oom_dump_tasks is
> enabled.
> 
> At the present time, when showing potential OOM victims, we do not
> exclude any task that are not OOM eligible e.g. those that have
> MMF_OOM_SKIP set; it is possible that the last OOM killable victim was
> already OOM killed, yet the OOM reaper failed to reclaim memory and set
> MMF_OOM_SKIP. This can be confusing (or perhaps even be misleading) to the
> viewer. Now, we already unconditionally display a task's oom_score_adj_min
> value that can be set to OOM_SCORE_ADJ_MIN which is indicative of an
> "unkillable" task.
> 
> This patch provides a clear indication with regard to the OOM ineligibility
> (and why) of each displayed task with the addition of a new column namely
> "oom_skipped". An example is provided below:
> 
>     [ 5084.524970] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj oom_skipped name
>     [ 5084.526397] [660417]     0 660417    35869      683   167936        0         -1000 M conmon
>     [ 5084.526400] [660452]     0 660452   175834      472    86016        0          -998  pod
>     [ 5084.527460] [752415]     0 752415    35869      650   172032        0         -1000 M conmon
>     [ 5084.527462] [752575] 1001050000 752575   184205    11158   700416        0           999  npm
>     [ 5084.527467] [753606] 1001050000 753606   183380    46843  2134016        0           999  node
>     [ 5084.527581] Memory cgroup out of memory: Killed process 753606 (node) total-vm:733520kB, anon-rss:161228kB, file-rss:26144kB, shmem-rss:0kB, UID:1001050000
> 
> So, a single character 'M' is for OOM_SCORE_ADJ_MIN, 'R' MMF_OOM_SKIP and
> 'V' for in_vfork().
> 
> index 003d5cc3751b..4c79fa00ddb3 100644
> --- a/Documentation/admin-guide/sysctl/vm.rst
> +++ b/Documentation/admin-guide/sysctl/vm.rst
> @@ -650,8 +650,9 @@ oom_dump_tasks
>  Enables a system-wide task dump (excluding kernel threads) to be produced
>  when the kernel performs an OOM-killing and includes such information as
>  pid, uid, tgid, vm size, rss, pgtables_bytes, swapents, oom_score_adj
> -score, and name.  This is helpful to determine why the OOM killer was
> -invoked, to identify the rogue task that caused it, and to determine why
> +score, oom eligibility status and name.  This is helpful to determine why
> +the OOM killer was invoked, to identify the rogue task that caused it, and
> +to determine why

It would be better if the meaning of 'M', 'R' and 'V' were described here.

>  the OOM killer chose the task it did to kill.
>  
> +/**
> + * is_task_eligible_oom - determine if and why a task cannot be OOM killed
> + * @tsk: task to check
> + *
> + * Needs to be called with task_lock().
> + */
> +static const char * const is_task_oom_eligible(struct task_struct *p)

Name seems inappropriate.  task_oom_eligibility()?

> +{
> +	long adj;
> +
> +	adj = (long)p->signal->oom_score_adj;
> +	if (adj == OOM_SCORE_ADJ_MIN)
> +		return "M";
> +	else if (test_bit(MMF_OOM_SKIP, &p->mm->flags)
> +		return "R";
> +	else if (in_vfork(p))
> +		return "V";
> +	else
> +		return "";
> +}
David Rientjes Aug. 2, 2021, 3:49 a.m. UTC | #2
On Fri, 30 Jul 2021, Aaron Tomlin wrote:

>  Documentation/admin-guide/sysctl/vm.rst |  5 ++--
>  mm/oom_kill.c                           | 31 +++++++++++++++++++++----
>  2 files changed, 30 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
> index 003d5cc3751b..4c79fa00ddb3 100644
> --- a/Documentation/admin-guide/sysctl/vm.rst
> +++ b/Documentation/admin-guide/sysctl/vm.rst
> @@ -650,8 +650,9 @@ oom_dump_tasks
>  Enables a system-wide task dump (excluding kernel threads) to be produced
>  when the kernel performs an OOM-killing and includes such information as
>  pid, uid, tgid, vm size, rss, pgtables_bytes, swapents, oom_score_adj
> -score, and name.  This is helpful to determine why the OOM killer was
> -invoked, to identify the rogue task that caused it, and to determine why
> +score, oom eligibility status and name.  This is helpful to determine why
> +the OOM killer was invoked, to identify the rogue task that caused it, and
> +to determine why
>  the OOM killer chose the task it did to kill.
>  
>  If this is set to zero, this information is suppressed.  On very
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index c729a4c4a1ac..36daa6917b62 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -160,6 +160,27 @@ static inline bool is_sysrq_oom(struct oom_control *oc)
>  	return oc->order == -1;
>  }
>  
> +/**
> + * is_task_eligible_oom - determine if and why a task cannot be OOM killed
> + * @tsk: task to check
> + *
> + * Needs to be called with task_lock().
> + */
> +static const char * const is_task_oom_eligible(struct task_struct *p)
> +{
> +	long adj;
> +
> +	adj = (long)p->signal->oom_score_adj;
> +	if (adj == OOM_SCORE_ADJ_MIN)
> +		return "M";

oom_score_adj is shown already in the tasklist dump, I'm not sure what 
value this adds.

> +	else if (test_bit(MMF_OOM_SKIP, &p->mm->flags)
> +		return "R";
> +	else if (in_vfork(p))
> +		return "V";

This is going to be racy, we can't show that a task that is emitted as 
part of the tasklist dump was did not have in_vfork() == true at the time 
oom_badness() was called.

Wouldn't it be better to simply print the output of oom_badness() to the 
tasklist dump instead so we get complete information?

We could simply special case a LONG_MIN return value as -1000 or "min".

> +	else
> +		return "";
> +}
> +
>  /* return true if the task is not adequate as candidate victim task. */
>  static bool oom_unkillable_task(struct task_struct *p)
>  {
> @@ -401,12 +422,13 @@ static int dump_task(struct task_struct *p, void *arg)
>  		return 0;
>  	}
>  
> -	pr_info("[%7d] %5d %5d %8lu %8lu %8ld %8lu         %5hd %s\n",
> +	pr_info("[%7d] %5d %5d %8lu %8lu %8ld %8lu         %5hd %1s %s\n",
>  		task->pid, from_kuid(&init_user_ns, task_uid(task)),
>  		task->tgid, task->mm->total_vm, get_mm_rss(task->mm),
>  		mm_pgtables_bytes(task->mm),
>  		get_mm_counter(task->mm, MM_SWAPENTS),
> -		task->signal->oom_score_adj, task->comm);
> +		task->signal->oom_score_adj, is_task_oom_eligible(task),
> +		task->comm);
>  	task_unlock(task);
>  
>  	return 0;
> @@ -420,12 +442,13 @@ static int dump_task(struct task_struct *p, void *arg)
>   * memcg, not in the same cpuset, or bound to a disjoint set of mempolicy nodes
>   * are not shown.
>   * State information includes task's pid, uid, tgid, vm size, rss,
> - * pgtables_bytes, swapents, oom_score_adj value, and name.
> + * pgtables_bytes, swapents, oom_score_adj value, oom eligibility status
> + * and name.
>   */
>  static void dump_tasks(struct oom_control *oc)
>  {
>  	pr_info("Tasks state (memory values in pages):\n");
> -	pr_info("[  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name\n");
> +	pr_info("[  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj oom_skipped name\n");
>  
>  	if (is_memcg_oom(oc))
>  		mem_cgroup_scan_tasks(oc->memcg, dump_task, oc);
> -- 
> 2.31.1
> 
>
Michal Hocko Aug. 2, 2021, 6:34 a.m. UTC | #3
On Fri 30-07-21 17:20:02, Aaron Tomlin wrote:
> Changes since v2:
>  - Use single character (e.g. 'R' for MMF_OOM_SKIP) as suggested
>    by Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
>  - Add new header to oom_dump_tasks documentation
>  - Provide further justification
> 
> 
> The output generated by dump_tasks() can be helpful to determine why
> there was an OOM condition and which rogue task potentially caused it.
> Please note that this is only provided when sysctl oom_dump_tasks is
> enabled.
> 
> At the present time, when showing potential OOM victims, we do not
> exclude any task that are not OOM eligible e.g.

Well, this is not precise. We do exclude ineligible. Consider tasks that
are outside of the OOM domain for example. You are right that we are not
excluding all of them though.

> those that have
> MMF_OOM_SKIP set; it is possible that the last OOM killable victim was
> already OOM killed, yet the OOM reaper failed to reclaim memory and set
> MMF_OOM_SKIP. This can be confusing (or perhaps even be misleading) to the
> viewer. Now, we already unconditionally display a task's oom_score_adj_min
> value that can be set to OOM_SCORE_ADJ_MIN which is indicative of an
> "unkillable" task.
> 
> This patch provides a clear indication with regard to the OOM ineligibility
> (and why) of each displayed task with the addition of a new column namely
> "oom_skipped". An example is provided below:
> 
>     [ 5084.524970] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj oom_skipped name
>     [ 5084.526397] [660417]     0 660417    35869      683   167936        0         -1000 M conmon
>     [ 5084.526400] [660452]     0 660452   175834      472    86016        0          -998  pod
>     [ 5084.527460] [752415]     0 752415    35869      650   172032        0         -1000 M conmon
>     [ 5084.527462] [752575] 1001050000 752575   184205    11158   700416        0           999  npm
>     [ 5084.527467] [753606] 1001050000 753606   183380    46843  2134016        0           999  node
>     [ 5084.527581] Memory cgroup out of memory: Killed process 753606 (node) total-vm:733520kB, anon-rss:161228kB, file-rss:26144kB, shmem-rss:0kB, UID:1001050000
> 
> So, a single character 'M' is for OOM_SCORE_ADJ_MIN, 'R' MMF_OOM_SKIP and
> 'V' for in_vfork().

I have to say I dislike this for two reasons. First and foremost it
makes parsing unnecessarily more complex. Now you have a potentially
an empty column to special case. Secondly MMF_OOM_SKIP is an internal
state that shouldn't really leak to userspace IMO. in_vfork is a racy
check and M is already expressed via oom_score_adj so it duplicates an
existing information.

If you really want/need to make any change here then I would propose to
either add E(eligible)/I(ligible) column without any specifics or
consistently skip over all tasks which are not eligible.
> 2.31.1
Aaron Tomlin Aug. 2, 2021, 2:50 p.m. UTC | #4
On Sun 2021-08-01 20:49 -0700, David Rientjes wrote:
> oom_score_adj is shown already in the tasklist dump, I'm not sure what 
> value this adds.

Fair enough.

> 
> > +	else if (test_bit(MMF_OOM_SKIP, &p->mm->flags)
> > +		return "R";
> > +	else if (in_vfork(p))
> > +		return "V";
> 
> This is going to be racy, we can't show that a task that is emitted as 
> part of the tasklist dump was did not have in_vfork() == true at the time 
> oom_badness() was called.

Yes, this is true.

> Wouldn't it be better to simply print the output of oom_badness() to the 
> tasklist dump instead so we get complete information?

I think this would be acceptable.

> We could simply special case a LONG_MIN return value as -1000 or "min".

Agreed.





Kind regards,
Aaron Tomlin Aug. 2, 2021, 3:12 p.m. UTC | #5
On Mon 2021-08-02 08:34 +0200, Michal Hocko wrote:
> If you really want/need to make any change here then I would propose to
> either add E(eligible)/I(ligible) column without any specifics or
> consistently skip over all tasks which are not eligible.

How about the suggestion made by David i.e. exposing the value returned by
oom_badness(), as if I understand correctly, this would provide a more
complete picture with regard to the rationale used?


Kind regards,
Michal Hocko Aug. 3, 2021, 7:05 a.m. UTC | #6
On Mon 02-08-21 16:12:50, Aaron Tomlin wrote:
> On Mon 2021-08-02 08:34 +0200, Michal Hocko wrote:
> > If you really want/need to make any change here then I would propose to
> > either add E(eligible)/I(ligible) column without any specifics or
> > consistently skip over all tasks which are not eligible.
> 
> How about the suggestion made by David i.e. exposing the value returned by
> oom_badness(), as if I understand correctly, this would provide a more
> complete picture with regard to the rationale used?

There were some attempts to print oom_score during OOM. E.g.
http://lkml.kernel.org/r/20190808183247.28206-1-echron@arista.com. That
one was rejected on the grounds that the number on its own doesn't
really present any real value. It is really only valuable when comparing
to other potential oom victims. I have to say I am still worried about
printing this internal scoring as it should have really been an
implementation detail but with /proc/<pid>/oom_score this is likely a
lost battle and I am willing to give up on that front.

I am still not entirely convinced this is worth doing though.
oom_badness is not a cheap operation. task_lock has to be taken again
during dump_tasks for each task so the already quite expensive operation
will be more so. Is this really something we cannot live without?
Aaron Tomlin Aug. 3, 2021, 10:32 a.m. UTC | #7
On Tue 2021-08-03 09:05 +0200, Michal Hocko wrote:
> There were some attempts to print oom_score during OOM. E.g.
> http://lkml.kernel.org/r/20190808183247.28206-1-echron@arista.com. That
> one was rejected on the grounds that the number on its own doesn't
> really present any real value. It is really only valuable when comparing
> to other potential oom victims. I have to say I am still worried about
> printing this internal scoring as it should have really been an
> implementation detail but with /proc/<pid>/oom_score this is likely a
> lost battle and I am willing to give up on that front.

Understood.

> I am still not entirely convinced this is worth doing though.
> oom_badness is not a cheap operation. task_lock has to be taken again
> during dump_tasks for each task so the already quite expensive operation
> will be more so. Is this really something we cannot live without?

Fair enough and I now agree, it is unquestionably not worth it.



Kind regards,
diff mbox series

Patch

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 003d5cc3751b..4c79fa00ddb3 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -650,8 +650,9 @@  oom_dump_tasks
 Enables a system-wide task dump (excluding kernel threads) to be produced
 when the kernel performs an OOM-killing and includes such information as
 pid, uid, tgid, vm size, rss, pgtables_bytes, swapents, oom_score_adj
-score, and name.  This is helpful to determine why the OOM killer was
-invoked, to identify the rogue task that caused it, and to determine why
+score, oom eligibility status and name.  This is helpful to determine why
+the OOM killer was invoked, to identify the rogue task that caused it, and
+to determine why
 the OOM killer chose the task it did to kill.
 
 If this is set to zero, this information is suppressed.  On very
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index c729a4c4a1ac..36daa6917b62 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -160,6 +160,27 @@  static inline bool is_sysrq_oom(struct oom_control *oc)
 	return oc->order == -1;
 }
 
+/**
+ * is_task_eligible_oom - determine if and why a task cannot be OOM killed
+ * @tsk: task to check
+ *
+ * Needs to be called with task_lock().
+ */
+static const char * const is_task_oom_eligible(struct task_struct *p)
+{
+	long adj;
+
+	adj = (long)p->signal->oom_score_adj;
+	if (adj == OOM_SCORE_ADJ_MIN)
+		return "M";
+	else if (test_bit(MMF_OOM_SKIP, &p->mm->flags)
+		return "R";
+	else if (in_vfork(p))
+		return "V";
+	else
+		return "";
+}
+
 /* return true if the task is not adequate as candidate victim task. */
 static bool oom_unkillable_task(struct task_struct *p)
 {
@@ -401,12 +422,13 @@  static int dump_task(struct task_struct *p, void *arg)
 		return 0;
 	}
 
-	pr_info("[%7d] %5d %5d %8lu %8lu %8ld %8lu         %5hd %s\n",
+	pr_info("[%7d] %5d %5d %8lu %8lu %8ld %8lu         %5hd %1s %s\n",
 		task->pid, from_kuid(&init_user_ns, task_uid(task)),
 		task->tgid, task->mm->total_vm, get_mm_rss(task->mm),
 		mm_pgtables_bytes(task->mm),
 		get_mm_counter(task->mm, MM_SWAPENTS),
-		task->signal->oom_score_adj, task->comm);
+		task->signal->oom_score_adj, is_task_oom_eligible(task),
+		task->comm);
 	task_unlock(task);
 
 	return 0;
@@ -420,12 +442,13 @@  static int dump_task(struct task_struct *p, void *arg)
  * memcg, not in the same cpuset, or bound to a disjoint set of mempolicy nodes
  * are not shown.
  * State information includes task's pid, uid, tgid, vm size, rss,
- * pgtables_bytes, swapents, oom_score_adj value, and name.
+ * pgtables_bytes, swapents, oom_score_adj value, oom eligibility status
+ * and name.
  */
 static void dump_tasks(struct oom_control *oc)
 {
 	pr_info("Tasks state (memory values in pages):\n");
-	pr_info("[  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name\n");
+	pr_info("[  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj oom_skipped name\n");
 
 	if (is_memcg_oom(oc))
 		mem_cgroup_scan_tasks(oc->memcg, dump_task, oc);