diff mbox series

mm, memcg: introduce per memcg oom_score_adj

Message ID 1566464189-1631-1-git-send-email-laoar.shao@gmail.com (mailing list archive)
State New, archived
Headers show
Series mm, memcg: introduce per memcg oom_score_adj | expand

Commit Message

Yafang Shao Aug. 22, 2019, 8:56 a.m. UTC
- Why we need a per memcg oom_score_adj setting ?
This is easy to deploy and very convenient for container.
When we use container, we always treat memcg as a whole, if we have a per
memcg oom_score_adj setting we don't need to set it process by process.
It will make the user exhausted to set it to all processes in a memcg.

In this patch, a file named memory.oom.score_adj is introduced.
The valid value of it is from -1000 to +1000, which is same with
process-level oom_score_adj.
When OOM is invoked, the effective oom_score_adj is as bellow,
    effective oom_score_adj = original oom_score_adj + memory.oom.score_adj
The valid effective value is also from -1000 to +1000.
This is something like a hook to re-calculate the oom_score_adj.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
---
 include/linux/memcontrol.h | 24 ++++++++++++++++++++++++
 mm/memcontrol.c            | 38 ++++++++++++++++++++++++++++++++++++++
 mm/oom_kill.c              | 20 ++++++++------------
 3 files changed, 70 insertions(+), 12 deletions(-)

Comments

Michal Hocko Aug. 22, 2019, 9:19 a.m. UTC | #1
On Thu 22-08-19 04:56:29, Yafang Shao wrote:
> - Why we need a per memcg oom_score_adj setting ?
> This is easy to deploy and very convenient for container.
> When we use container, we always treat memcg as a whole, if we have a per
> memcg oom_score_adj setting we don't need to set it process by process.

Why cannot an initial process in the cgroup set the oom_score_adj and
other processes just inherit it from there? This sounds trivial to do
with a startup script.

> It will make the user exhausted to set it to all processes in a memcg.

Then let's have scripts to set it as they are less prone to exhaustion
;)
But seriously

> In this patch, a file named memory.oom.score_adj is introduced.
> The valid value of it is from -1000 to +1000, which is same with
> process-level oom_score_adj.
> When OOM is invoked, the effective oom_score_adj is as bellow,
>     effective oom_score_adj = original oom_score_adj + memory.oom.score_adj

This doesn't make any sense to me. Say that process has oom_score_adj
-1000 (never kill) then group oom_score_adj will simply break the
expectation and the task becomes killable for any value but -1000.
Why is summing up those values even sensible?

> The valid effective value is also from -1000 to +1000.
> This is something like a hook to re-calculate the oom_score_adj.

Besides that. What is the hierarchical semantic? Say you have hierarchy
	A (oom_score_adj = 1000)
	 \
	  B (oom_score_adj = 500)
	   \
	    C (oom_score_adj = -1000)

put the above summing up aside for now and just focus on the memcg
adjusting?
Yafang Shao Aug. 22, 2019, 9:34 a.m. UTC | #2
On Thu, Aug 22, 2019 at 5:19 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 22-08-19 04:56:29, Yafang Shao wrote:
> > - Why we need a per memcg oom_score_adj setting ?
> > This is easy to deploy and very convenient for container.
> > When we use container, we always treat memcg as a whole, if we have a per
> > memcg oom_score_adj setting we don't need to set it process by process.
>
> Why cannot an initial process in the cgroup set the oom_score_adj and
> other processes just inherit it from there? This sounds trivial to do
> with a startup script.
>

That is what we used to do before.
But it can't apply to the running containers.


> > It will make the user exhausted to set it to all processes in a memcg.
>
> Then let's have scripts to set it as they are less prone to exhaustion
> ;)

That is not easy to deploy it to the production environment.

> But seriously
>
> > In this patch, a file named memory.oom.score_adj is introduced.
> > The valid value of it is from -1000 to +1000, which is same with
> > process-level oom_score_adj.
> > When OOM is invoked, the effective oom_score_adj is as bellow,
> >     effective oom_score_adj = original oom_score_adj + memory.oom.score_adj
>
> This doesn't make any sense to me. Say that process has oom_score_adj
> -1000 (never kill) then group oom_score_adj will simply break the
> expectation and the task becomes killable for any value but -1000.
> Why is summing up those values even sensible?
>

Ah, good catch. This needs to be improved.

> > The valid effective value is also from -1000 to +1000.
> > This is something like a hook to re-calculate the oom_score_adj.
>
> Besides that. What is the hierarchical semantic? Say you have hierarchy
>         A (oom_score_adj = 1000)
>          \
>           B (oom_score_adj = 500)
>            \
>             C (oom_score_adj = -1000)
>
> put the above summing up aside for now and just focus on the memcg
> adjusting?

I think that there's no conflict between children's oom_score_adj,
that is different with memory.max.
So it is not neccessary to consider the parent's oom_sore_adj.

Thanks

Yafang
Michal Hocko Aug. 22, 2019, 10:59 a.m. UTC | #3
On Thu 22-08-19 17:34:54, Yafang Shao wrote:
> On Thu, Aug 22, 2019 at 5:19 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 22-08-19 04:56:29, Yafang Shao wrote:
> > > - Why we need a per memcg oom_score_adj setting ?
> > > This is easy to deploy and very convenient for container.
> > > When we use container, we always treat memcg as a whole, if we have a per
> > > memcg oom_score_adj setting we don't need to set it process by process.
> >
> > Why cannot an initial process in the cgroup set the oom_score_adj and
> > other processes just inherit it from there? This sounds trivial to do
> > with a startup script.
> >
> 
> That is what we used to do before.
> But it can't apply to the running containers.
> 
> 
> > > It will make the user exhausted to set it to all processes in a memcg.
> >
> > Then let's have scripts to set it as they are less prone to exhaustion
> > ;)
> 
> That is not easy to deploy it to the production environment.

What is hard about a simple loop over tasklist exported by cgroup and
apply a value to oom_score_adj?

[...]

> > Besides that. What is the hierarchical semantic? Say you have hierarchy
> >         A (oom_score_adj = 1000)
> >          \
> >           B (oom_score_adj = 500)
> >            \
> >             C (oom_score_adj = -1000)
> >
> > put the above summing up aside for now and just focus on the memcg
> > adjusting?
> 
> I think that there's no conflict between children's oom_score_adj,
> that is different with memory.max.
> So it is not neccessary to consider the parent's oom_sore_adj.

Each exported cgroup tuning _has_ to be hierarchical so that an admin
can override children setting in order to safely delegate the
configuration.

Last but not least, oom_score_adj has proven to be a terrible interface
that is essentially close to unusable to anything outside of extreme
values (-1000 and very arguably 1000). Making it cgroup aware without
changing oom victim selection to consider cgroup as a whole will also be
a pain so I am afraid that this is a dead end path.

We can discuss cgroup aware oom victim selection for sure and there are
certainly reasonable usecases to back that functionality. Please refer
to discussion from 2017/2018 (dubbed as "cgroup-aware OOM killer"). But
be warned this is a tricky area and there was a fundamental disagreement
on how things should be classified without a clear way to reach
consensus. What we have right now is the only agreement we could reach.
It is likely possible that the only more clever cgroup aware oom
selection has to be implemented in the userspace with an understanding
of the specific workload.
Roman Gushchin Aug. 22, 2019, 10:46 p.m. UTC | #4
On Thu, Aug 22, 2019 at 12:59:18PM +0200, Michal Hocko wrote:
> On Thu 22-08-19 17:34:54, Yafang Shao wrote:
> > On Thu, Aug 22, 2019 at 5:19 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Thu 22-08-19 04:56:29, Yafang Shao wrote:
> > > > - Why we need a per memcg oom_score_adj setting ?
> > > > This is easy to deploy and very convenient for container.
> > > > When we use container, we always treat memcg as a whole, if we have a per
> > > > memcg oom_score_adj setting we don't need to set it process by process.
> > >
> > > Why cannot an initial process in the cgroup set the oom_score_adj and
> > > other processes just inherit it from there? This sounds trivial to do
> > > with a startup script.
> > >
> > 
> > That is what we used to do before.
> > But it can't apply to the running containers.
> > 
> > 
> > > > It will make the user exhausted to set it to all processes in a memcg.
> > >
> > > Then let's have scripts to set it as they are less prone to exhaustion
> > > ;)
> > 
> > That is not easy to deploy it to the production environment.
> 
> What is hard about a simple loop over tasklist exported by cgroup and
> apply a value to oom_score_adj?
> 
> [...]
> 
> > > Besides that. What is the hierarchical semantic? Say you have hierarchy
> > >         A (oom_score_adj = 1000)
> > >          \
> > >           B (oom_score_adj = 500)
> > >            \
> > >             C (oom_score_adj = -1000)
> > >
> > > put the above summing up aside for now and just focus on the memcg
> > > adjusting?
> > 
> > I think that there's no conflict between children's oom_score_adj,
> > that is different with memory.max.
> > So it is not neccessary to consider the parent's oom_sore_adj.
> 
> Each exported cgroup tuning _has_ to be hierarchical so that an admin
> can override children setting in order to safely delegate the
> configuration.

+1

> 
> Last but not least, oom_score_adj has proven to be a terrible interface
> that is essentially close to unusable to anything outside of extreme
> values (-1000 and very arguably 1000). Making it cgroup aware without
> changing oom victim selection to consider cgroup as a whole will also be
> a pain so I am afraid that this is a dead end path.
> 
> We can discuss cgroup aware oom victim selection for sure and there are
> certainly reasonable usecases to back that functionality. Please refer
> to discussion from 2017/2018 (dubbed as "cgroup-aware OOM killer"). But
> be warned this is a tricky area and there was a fundamental disagreement
> on how things should be classified without a clear way to reach
> consensus. What we have right now is the only agreement we could reach.
> It is likely possible that the only more clever cgroup aware oom
> selection has to be implemented in the userspace with an understanding
> of the specific workload.

I think the agreement is that the main goal of the kernel OOM killer is to
prevent different memory dead- and live-lock scenarios. And everything
that involves policies which define which workloads are preferable over
others should be kept in userspace.

So the biggest issue of the kernel OOM killer right now is that it often kicks
in too late, if at all (which has been discussed recently). And it looks like
the best answer now is PSI. So I'd really look into that direction to enhance
it.

Thanks!
Yafang Shao Aug. 23, 2019, 1:26 a.m. UTC | #5
On Fri, Aug 23, 2019 at 6:47 AM Roman Gushchin <guro@fb.com> wrote:
>
> On Thu, Aug 22, 2019 at 12:59:18PM +0200, Michal Hocko wrote:
> > On Thu 22-08-19 17:34:54, Yafang Shao wrote:
> > > On Thu, Aug 22, 2019 at 5:19 PM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Thu 22-08-19 04:56:29, Yafang Shao wrote:
> > > > > - Why we need a per memcg oom_score_adj setting ?
> > > > > This is easy to deploy and very convenient for container.
> > > > > When we use container, we always treat memcg as a whole, if we have a per
> > > > > memcg oom_score_adj setting we don't need to set it process by process.
> > > >
> > > > Why cannot an initial process in the cgroup set the oom_score_adj and
> > > > other processes just inherit it from there? This sounds trivial to do
> > > > with a startup script.
> > > >
> > >
> > > That is what we used to do before.
> > > But it can't apply to the running containers.
> > >
> > >
> > > > > It will make the user exhausted to set it to all processes in a memcg.
> > > >
> > > > Then let's have scripts to set it as they are less prone to exhaustion
> > > > ;)
> > >
> > > That is not easy to deploy it to the production environment.
> >
> > What is hard about a simple loop over tasklist exported by cgroup and
> > apply a value to oom_score_adj?
> >
> > [...]
> >
> > > > Besides that. What is the hierarchical semantic? Say you have hierarchy
> > > >         A (oom_score_adj = 1000)
> > > >          \
> > > >           B (oom_score_adj = 500)
> > > >            \
> > > >             C (oom_score_adj = -1000)
> > > >
> > > > put the above summing up aside for now and just focus on the memcg
> > > > adjusting?
> > >
> > > I think that there's no conflict between children's oom_score_adj,
> > > that is different with memory.max.
> > > So it is not neccessary to consider the parent's oom_sore_adj.
> >
> > Each exported cgroup tuning _has_ to be hierarchical so that an admin
> > can override children setting in order to safely delegate the
> > configuration.
>
> +1
>
> >
> > Last but not least, oom_score_adj has proven to be a terrible interface
> > that is essentially close to unusable to anything outside of extreme
> > values (-1000 and very arguably 1000). Making it cgroup aware without
> > changing oom victim selection to consider cgroup as a whole will also be
> > a pain so I am afraid that this is a dead end path.
> >
> > We can discuss cgroup aware oom victim selection for sure and there are
> > certainly reasonable usecases to back that functionality. Please refer
> > to discussion from 2017/2018 (dubbed as "cgroup-aware OOM killer"). But
> > be warned this is a tricky area and there was a fundamental disagreement
> > on how things should be classified without a clear way to reach
> > consensus. What we have right now is the only agreement we could reach.
> > It is likely possible that the only more clever cgroup aware oom
> > selection has to be implemented in the userspace with an understanding
> > of the specific workload.
>
> I think the agreement is that the main goal of the kernel OOM killer is to
> prevent different memory dead- and live-lock scenarios.

I argree with you that this is the most improtant thing in OOM, and
then we should consider OOM QoS.

> And everything
> that involves policies which define which workloads are preferable over
> others should be kept in userspace.
>

I think it would be better if the kernel could provide some HOOKs to
adjust the OOM QoS.
Something like ebpf or some interfaces like memory.oom.score_adj or
something else.

> So the biggest issue of the kernel OOM killer right now is that it often kicks
> in too late, if at all (which has been discussed recently). And it looks like
> the best answer now is PSI. So I'd really look into that direction to enhance
> it.
>

Agreed.
The kernel OOM killer kicks in only when there's almost no available
memory, that may cause system hang.

Thanks
Yafang
diff mbox series

Patch

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 2cd4359..d2dbde5 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -21,6 +21,7 @@ 
 #include <linux/vmstat.h>
 #include <linux/writeback.h>
 #include <linux/page-flags.h>
+#include <linux/oom.h>
 
 struct mem_cgroup;
 struct page;
@@ -224,6 +225,7 @@  struct mem_cgroup {
 	 * Should the OOM killer kill all belonging tasks, had it kill one?
 	 */
 	bool oom_group;
+	short oom_score_adj;
 
 	/* protected by memcg_oom_lock */
 	bool		oom_lock;
@@ -538,6 +540,23 @@  static inline bool task_in_memcg_oom(struct task_struct *p)
 	return p->memcg_in_oom;
 }
 
+static inline int mem_cgroup_score_adj(struct task_struct *p, int task_adj)
+{
+	struct mem_cgroup *memcg;
+	int adj = task_adj;
+
+	memcg = mem_cgroup_from_task(p);
+	if (memcg != root_mem_cgroup) {
+		adj += memcg->oom_score_adj;
+		if (adj < OOM_SCORE_ADJ_MIN)
+			adj = OOM_SCORE_ADJ_MIN;
+		else if (adj > OOM_SCORE_ADJ_MAX)
+			adj = OOM_SCORE_ADJ_MAX;
+	}
+
+	return adj;
+}
+
 bool mem_cgroup_oom_synchronize(bool wait);
 struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim,
 					    struct mem_cgroup *oom_domain);
@@ -987,6 +1006,11 @@  static inline bool task_in_memcg_oom(struct task_struct *p)
 	return false;
 }
 
+static inline int mem_cgroup_score_adj(struct task_struct *p, int task_adj)
+{
+	return task_adj;
+}
+
 static inline bool mem_cgroup_oom_synchronize(bool wait)
 {
 	return false;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6f5c0c5..065285c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5856,6 +5856,38 @@  static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
 	return nbytes;
 }
 
+static int memory_oom_score_adj_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+
+	seq_printf(m, "%d\n", memcg->oom_score_adj);
+
+	return 0;
+}
+
+static ssize_t memory_oom_score_adj_write(struct kernfs_open_file *of,
+					  char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	int oom_score_adj;
+	int ret;
+
+	buf = strstrip(buf);
+	if (!buf)
+		return -EINVAL;
+
+	ret = kstrtoint(buf, 0, &oom_score_adj);
+	if (ret)
+		return ret;
+
+	if (oom_score_adj > 1000 || oom_score_adj < -1000)
+		return -EINVAL;
+
+	memcg->oom_score_adj = oom_score_adj;
+
+	return nbytes;
+}
+
 static struct cftype memory_files[] = {
 	{
 		.name = "current",
@@ -5909,6 +5941,12 @@  static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
 		.seq_show = memory_oom_group_show,
 		.write = memory_oom_group_write,
 	},
+	{
+		.name = "oom.score_adj",
+		.flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE,
+		.seq_show = memory_oom_score_adj_show,
+		.write = memory_oom_score_adj_write,
+	},
 	{ }	/* terminate */
 };
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index eda2e2a..f3b0276 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -212,13 +212,7 @@  unsigned long oom_badness(struct task_struct *p, unsigned long totalpages)
 	 * unkillable or have been already oom reaped or the are in
 	 * the middle of vfork
 	 */
-	adj = (long)p->signal->oom_score_adj;
-	if (adj == OOM_SCORE_ADJ_MIN ||
-			test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
-			in_vfork(p)) {
-		task_unlock(p);
-		return 0;
-	}
+	adj = mem_cgroup_score_adj(p, p->signal->oom_score_adj);
 
 	/*
 	 * The baseline for the badness score is the proportion of RAM that each
@@ -404,7 +398,8 @@  static int dump_task(struct task_struct *p, void *arg)
 		task->tgid, task->mm->total_vm, get_mm_rss(task->mm),
 		mm_pgtables_bytes(task->mm),
 		get_mm_counter(task->mm, MM_SWAPENTS),
-		task->signal->oom_score_adj, task->comm);
+		mem_cgroup_score_adj(task, task->signal->oom_score_adj),
+		task->comm);
 	task_unlock(task);
 
 	return 0;
@@ -453,7 +448,7 @@  static void dump_header(struct oom_control *oc, struct task_struct *p)
 {
 	pr_warn("%s invoked oom-killer: gfp_mask=%#x(%pGg), order=%d, oom_score_adj=%hd\n",
 		current->comm, oc->gfp_mask, &oc->gfp_mask, oc->order,
-			current->signal->oom_score_adj);
+		mem_cgroup_score_adj(current, current->signal->oom_score_adj));
 	if (!IS_ENABLED(CONFIG_COMPACTION) && oc->order)
 		pr_warn("COMPACTION is disabled!!!\n");
 
@@ -939,8 +934,8 @@  static void __oom_kill_process(struct task_struct *victim, const char *message)
  */
 static int oom_kill_memcg_member(struct task_struct *task, void *message)
 {
-	if (task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN &&
-	    !is_global_init(task)) {
+	if (mem_cgroup_score_adj(task, task->signal->oom_score_adj) !=
+	    OOM_SCORE_ADJ_MIN && !is_global_init(task)) {
 		get_task_struct(task);
 		__oom_kill_process(task, message);
 	}
@@ -1085,7 +1080,8 @@  bool out_of_memory(struct oom_control *oc)
 	if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
 	    current->mm && !oom_unkillable_task(current) &&
 	    oom_cpuset_eligible(current, oc) &&
-	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
+	    mem_cgroup_score_adj(current, current->signal->oom_score_adj) !=
+	    OOM_SCORE_ADJ_MIN) {
 		get_task_struct(current);
 		oc->chosen = current;
 		oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)");