Message ID | 1566464189-1631-1-git-send-email-laoar.shao@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm, memcg: introduce per memcg oom_score_adj | expand |
On Thu 22-08-19 04:56:29, Yafang Shao wrote: > - Why we need a per memcg oom_score_adj setting ? > This is easy to deploy and very convenient for container. > When we use container, we always treat memcg as a whole, if we have a per > memcg oom_score_adj setting we don't need to set it process by process. Why cannot an initial process in the cgroup set the oom_score_adj and other processes just inherit it from there? This sounds trivial to do with a startup script. > It will make the user exhausted to set it to all processes in a memcg. Then let's have scripts to set it as they are less prone to exhaustion ;) But seriously > In this patch, a file named memory.oom.score_adj is introduced. > The valid value of it is from -1000 to +1000, which is same with > process-level oom_score_adj. > When OOM is invoked, the effective oom_score_adj is as bellow, > effective oom_score_adj = original oom_score_adj + memory.oom.score_adj This doesn't make any sense to me. Say that process has oom_score_adj -1000 (never kill) then group oom_score_adj will simply break the expectation and the task becomes killable for any value but -1000. Why is summing up those values even sensible? > The valid effective value is also from -1000 to +1000. > This is something like a hook to re-calculate the oom_score_adj. Besides that. What is the hierarchical semantic? Say you have hierarchy A (oom_score_adj = 1000) \ B (oom_score_adj = 500) \ C (oom_score_adj = -1000) put the above summing up aside for now and just focus on the memcg adjusting?
On Thu, Aug 22, 2019 at 5:19 PM Michal Hocko <mhocko@suse.com> wrote: > > On Thu 22-08-19 04:56:29, Yafang Shao wrote: > > - Why we need a per memcg oom_score_adj setting ? > > This is easy to deploy and very convenient for container. > > When we use container, we always treat memcg as a whole, if we have a per > > memcg oom_score_adj setting we don't need to set it process by process. > > Why cannot an initial process in the cgroup set the oom_score_adj and > other processes just inherit it from there? This sounds trivial to do > with a startup script. > That is what we used to do before. But it can't apply to the running containers. > > It will make the user exhausted to set it to all processes in a memcg. > > Then let's have scripts to set it as they are less prone to exhaustion > ;) That is not easy to deploy it to the production environment. > But seriously > > > In this patch, a file named memory.oom.score_adj is introduced. > > The valid value of it is from -1000 to +1000, which is same with > > process-level oom_score_adj. > > When OOM is invoked, the effective oom_score_adj is as bellow, > > effective oom_score_adj = original oom_score_adj + memory.oom.score_adj > > This doesn't make any sense to me. Say that process has oom_score_adj > -1000 (never kill) then group oom_score_adj will simply break the > expectation and the task becomes killable for any value but -1000. > Why is summing up those values even sensible? > Ah, good catch. This needs to be improved. > > The valid effective value is also from -1000 to +1000. > > This is something like a hook to re-calculate the oom_score_adj. > > Besides that. What is the hierarchical semantic? Say you have hierarchy > A (oom_score_adj = 1000) > \ > B (oom_score_adj = 500) > \ > C (oom_score_adj = -1000) > > put the above summing up aside for now and just focus on the memcg > adjusting? I think that there's no conflict between children's oom_score_adj, that is different with memory.max. So it is not neccessary to consider the parent's oom_sore_adj. Thanks Yafang
On Thu 22-08-19 17:34:54, Yafang Shao wrote: > On Thu, Aug 22, 2019 at 5:19 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Thu 22-08-19 04:56:29, Yafang Shao wrote: > > > - Why we need a per memcg oom_score_adj setting ? > > > This is easy to deploy and very convenient for container. > > > When we use container, we always treat memcg as a whole, if we have a per > > > memcg oom_score_adj setting we don't need to set it process by process. > > > > Why cannot an initial process in the cgroup set the oom_score_adj and > > other processes just inherit it from there? This sounds trivial to do > > with a startup script. > > > > That is what we used to do before. > But it can't apply to the running containers. > > > > > It will make the user exhausted to set it to all processes in a memcg. > > > > Then let's have scripts to set it as they are less prone to exhaustion > > ;) > > That is not easy to deploy it to the production environment. What is hard about a simple loop over tasklist exported by cgroup and apply a value to oom_score_adj? [...] > > Besides that. What is the hierarchical semantic? Say you have hierarchy > > A (oom_score_adj = 1000) > > \ > > B (oom_score_adj = 500) > > \ > > C (oom_score_adj = -1000) > > > > put the above summing up aside for now and just focus on the memcg > > adjusting? > > I think that there's no conflict between children's oom_score_adj, > that is different with memory.max. > So it is not neccessary to consider the parent's oom_sore_adj. Each exported cgroup tuning _has_ to be hierarchical so that an admin can override children setting in order to safely delegate the configuration. Last but not least, oom_score_adj has proven to be a terrible interface that is essentially close to unusable to anything outside of extreme values (-1000 and very arguably 1000). Making it cgroup aware without changing oom victim selection to consider cgroup as a whole will also be a pain so I am afraid that this is a dead end path. We can discuss cgroup aware oom victim selection for sure and there are certainly reasonable usecases to back that functionality. Please refer to discussion from 2017/2018 (dubbed as "cgroup-aware OOM killer"). But be warned this is a tricky area and there was a fundamental disagreement on how things should be classified without a clear way to reach consensus. What we have right now is the only agreement we could reach. It is likely possible that the only more clever cgroup aware oom selection has to be implemented in the userspace with an understanding of the specific workload.
On Thu, Aug 22, 2019 at 12:59:18PM +0200, Michal Hocko wrote: > On Thu 22-08-19 17:34:54, Yafang Shao wrote: > > On Thu, Aug 22, 2019 at 5:19 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Thu 22-08-19 04:56:29, Yafang Shao wrote: > > > > - Why we need a per memcg oom_score_adj setting ? > > > > This is easy to deploy and very convenient for container. > > > > When we use container, we always treat memcg as a whole, if we have a per > > > > memcg oom_score_adj setting we don't need to set it process by process. > > > > > > Why cannot an initial process in the cgroup set the oom_score_adj and > > > other processes just inherit it from there? This sounds trivial to do > > > with a startup script. > > > > > > > That is what we used to do before. > > But it can't apply to the running containers. > > > > > > > > It will make the user exhausted to set it to all processes in a memcg. > > > > > > Then let's have scripts to set it as they are less prone to exhaustion > > > ;) > > > > That is not easy to deploy it to the production environment. > > What is hard about a simple loop over tasklist exported by cgroup and > apply a value to oom_score_adj? > > [...] > > > > Besides that. What is the hierarchical semantic? Say you have hierarchy > > > A (oom_score_adj = 1000) > > > \ > > > B (oom_score_adj = 500) > > > \ > > > C (oom_score_adj = -1000) > > > > > > put the above summing up aside for now and just focus on the memcg > > > adjusting? > > > > I think that there's no conflict between children's oom_score_adj, > > that is different with memory.max. > > So it is not neccessary to consider the parent's oom_sore_adj. > > Each exported cgroup tuning _has_ to be hierarchical so that an admin > can override children setting in order to safely delegate the > configuration. +1 > > Last but not least, oom_score_adj has proven to be a terrible interface > that is essentially close to unusable to anything outside of extreme > values (-1000 and very arguably 1000). Making it cgroup aware without > changing oom victim selection to consider cgroup as a whole will also be > a pain so I am afraid that this is a dead end path. > > We can discuss cgroup aware oom victim selection for sure and there are > certainly reasonable usecases to back that functionality. Please refer > to discussion from 2017/2018 (dubbed as "cgroup-aware OOM killer"). But > be warned this is a tricky area and there was a fundamental disagreement > on how things should be classified without a clear way to reach > consensus. What we have right now is the only agreement we could reach. > It is likely possible that the only more clever cgroup aware oom > selection has to be implemented in the userspace with an understanding > of the specific workload. I think the agreement is that the main goal of the kernel OOM killer is to prevent different memory dead- and live-lock scenarios. And everything that involves policies which define which workloads are preferable over others should be kept in userspace. So the biggest issue of the kernel OOM killer right now is that it often kicks in too late, if at all (which has been discussed recently). And it looks like the best answer now is PSI. So I'd really look into that direction to enhance it. Thanks!
On Fri, Aug 23, 2019 at 6:47 AM Roman Gushchin <guro@fb.com> wrote: > > On Thu, Aug 22, 2019 at 12:59:18PM +0200, Michal Hocko wrote: > > On Thu 22-08-19 17:34:54, Yafang Shao wrote: > > > On Thu, Aug 22, 2019 at 5:19 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Thu 22-08-19 04:56:29, Yafang Shao wrote: > > > > > - Why we need a per memcg oom_score_adj setting ? > > > > > This is easy to deploy and very convenient for container. > > > > > When we use container, we always treat memcg as a whole, if we have a per > > > > > memcg oom_score_adj setting we don't need to set it process by process. > > > > > > > > Why cannot an initial process in the cgroup set the oom_score_adj and > > > > other processes just inherit it from there? This sounds trivial to do > > > > with a startup script. > > > > > > > > > > That is what we used to do before. > > > But it can't apply to the running containers. > > > > > > > > > > > It will make the user exhausted to set it to all processes in a memcg. > > > > > > > > Then let's have scripts to set it as they are less prone to exhaustion > > > > ;) > > > > > > That is not easy to deploy it to the production environment. > > > > What is hard about a simple loop over tasklist exported by cgroup and > > apply a value to oom_score_adj? > > > > [...] > > > > > > Besides that. What is the hierarchical semantic? Say you have hierarchy > > > > A (oom_score_adj = 1000) > > > > \ > > > > B (oom_score_adj = 500) > > > > \ > > > > C (oom_score_adj = -1000) > > > > > > > > put the above summing up aside for now and just focus on the memcg > > > > adjusting? > > > > > > I think that there's no conflict between children's oom_score_adj, > > > that is different with memory.max. > > > So it is not neccessary to consider the parent's oom_sore_adj. > > > > Each exported cgroup tuning _has_ to be hierarchical so that an admin > > can override children setting in order to safely delegate the > > configuration. > > +1 > > > > > Last but not least, oom_score_adj has proven to be a terrible interface > > that is essentially close to unusable to anything outside of extreme > > values (-1000 and very arguably 1000). Making it cgroup aware without > > changing oom victim selection to consider cgroup as a whole will also be > > a pain so I am afraid that this is a dead end path. > > > > We can discuss cgroup aware oom victim selection for sure and there are > > certainly reasonable usecases to back that functionality. Please refer > > to discussion from 2017/2018 (dubbed as "cgroup-aware OOM killer"). But > > be warned this is a tricky area and there was a fundamental disagreement > > on how things should be classified without a clear way to reach > > consensus. What we have right now is the only agreement we could reach. > > It is likely possible that the only more clever cgroup aware oom > > selection has to be implemented in the userspace with an understanding > > of the specific workload. > > I think the agreement is that the main goal of the kernel OOM killer is to > prevent different memory dead- and live-lock scenarios. I argree with you that this is the most improtant thing in OOM, and then we should consider OOM QoS. > And everything > that involves policies which define which workloads are preferable over > others should be kept in userspace. > I think it would be better if the kernel could provide some HOOKs to adjust the OOM QoS. Something like ebpf or some interfaces like memory.oom.score_adj or something else. > So the biggest issue of the kernel OOM killer right now is that it often kicks > in too late, if at all (which has been discussed recently). And it looks like > the best answer now is PSI. So I'd really look into that direction to enhance > it. > Agreed. The kernel OOM killer kicks in only when there's almost no available memory, that may cause system hang. Thanks Yafang
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 2cd4359..d2dbde5 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -21,6 +21,7 @@ #include <linux/vmstat.h> #include <linux/writeback.h> #include <linux/page-flags.h> +#include <linux/oom.h> struct mem_cgroup; struct page; @@ -224,6 +225,7 @@ struct mem_cgroup { * Should the OOM killer kill all belonging tasks, had it kill one? */ bool oom_group; + short oom_score_adj; /* protected by memcg_oom_lock */ bool oom_lock; @@ -538,6 +540,23 @@ static inline bool task_in_memcg_oom(struct task_struct *p) return p->memcg_in_oom; } +static inline int mem_cgroup_score_adj(struct task_struct *p, int task_adj) +{ + struct mem_cgroup *memcg; + int adj = task_adj; + + memcg = mem_cgroup_from_task(p); + if (memcg != root_mem_cgroup) { + adj += memcg->oom_score_adj; + if (adj < OOM_SCORE_ADJ_MIN) + adj = OOM_SCORE_ADJ_MIN; + else if (adj > OOM_SCORE_ADJ_MAX) + adj = OOM_SCORE_ADJ_MAX; + } + + return adj; +} + bool mem_cgroup_oom_synchronize(bool wait); struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim, struct mem_cgroup *oom_domain); @@ -987,6 +1006,11 @@ static inline bool task_in_memcg_oom(struct task_struct *p) return false; } +static inline int mem_cgroup_score_adj(struct task_struct *p, int task_adj) +{ + return task_adj; +} + static inline bool mem_cgroup_oom_synchronize(bool wait) { return false; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6f5c0c5..065285c 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5856,6 +5856,38 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, return nbytes; } +static int memory_oom_score_adj_show(struct seq_file *m, void *v) +{ + struct mem_cgroup *memcg = mem_cgroup_from_seq(m); + + seq_printf(m, "%d\n", memcg->oom_score_adj); + + return 0; +} + +static ssize_t memory_oom_score_adj_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + int oom_score_adj; + int ret; + + buf = strstrip(buf); + if (!buf) + return -EINVAL; + + ret = kstrtoint(buf, 0, &oom_score_adj); + if (ret) + return ret; + + if (oom_score_adj > 1000 || oom_score_adj < -1000) + return -EINVAL; + + memcg->oom_score_adj = oom_score_adj; + + return nbytes; +} + static struct cftype memory_files[] = { { .name = "current", @@ -5909,6 +5941,12 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, .seq_show = memory_oom_group_show, .write = memory_oom_group_write, }, + { + .name = "oom.score_adj", + .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE, + .seq_show = memory_oom_score_adj_show, + .write = memory_oom_score_adj_write, + }, { } /* terminate */ }; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index eda2e2a..f3b0276 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -212,13 +212,7 @@ unsigned long oom_badness(struct task_struct *p, unsigned long totalpages) * unkillable or have been already oom reaped or the are in * the middle of vfork */ - adj = (long)p->signal->oom_score_adj; - if (adj == OOM_SCORE_ADJ_MIN || - test_bit(MMF_OOM_SKIP, &p->mm->flags) || - in_vfork(p)) { - task_unlock(p); - return 0; - } + adj = mem_cgroup_score_adj(p, p->signal->oom_score_adj); /* * The baseline for the badness score is the proportion of RAM that each @@ -404,7 +398,8 @@ static int dump_task(struct task_struct *p, void *arg) task->tgid, task->mm->total_vm, get_mm_rss(task->mm), mm_pgtables_bytes(task->mm), get_mm_counter(task->mm, MM_SWAPENTS), - task->signal->oom_score_adj, task->comm); + mem_cgroup_score_adj(task, task->signal->oom_score_adj), + task->comm); task_unlock(task); return 0; @@ -453,7 +448,7 @@ static void dump_header(struct oom_control *oc, struct task_struct *p) { pr_warn("%s invoked oom-killer: gfp_mask=%#x(%pGg), order=%d, oom_score_adj=%hd\n", current->comm, oc->gfp_mask, &oc->gfp_mask, oc->order, - current->signal->oom_score_adj); + mem_cgroup_score_adj(current, current->signal->oom_score_adj)); if (!IS_ENABLED(CONFIG_COMPACTION) && oc->order) pr_warn("COMPACTION is disabled!!!\n"); @@ -939,8 +934,8 @@ static void __oom_kill_process(struct task_struct *victim, const char *message) */ static int oom_kill_memcg_member(struct task_struct *task, void *message) { - if (task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN && - !is_global_init(task)) { + if (mem_cgroup_score_adj(task, task->signal->oom_score_adj) != + OOM_SCORE_ADJ_MIN && !is_global_init(task)) { get_task_struct(task); __oom_kill_process(task, message); } @@ -1085,7 +1080,8 @@ bool out_of_memory(struct oom_control *oc) if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task && current->mm && !oom_unkillable_task(current) && oom_cpuset_eligible(current, oc) && - current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) { + mem_cgroup_score_adj(current, current->signal->oom_score_adj) != + OOM_SCORE_ADJ_MIN) { get_task_struct(current); oc->chosen = current; oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)");
- Why we need a per memcg oom_score_adj setting ? This is easy to deploy and very convenient for container. When we use container, we always treat memcg as a whole, if we have a per memcg oom_score_adj setting we don't need to set it process by process. It will make the user exhausted to set it to all processes in a memcg. In this patch, a file named memory.oom.score_adj is introduced. The valid value of it is from -1000 to +1000, which is same with process-level oom_score_adj. When OOM is invoked, the effective oom_score_adj is as bellow, effective oom_score_adj = original oom_score_adj + memory.oom.score_adj The valid effective value is also from -1000 to +1000. This is something like a hook to re-calculate the oom_score_adj. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Roman Gushchin <guro@fb.com> --- include/linux/memcontrol.h | 24 ++++++++++++++++++++++++ mm/memcontrol.c | 38 ++++++++++++++++++++++++++++++++++++++ mm/oom_kill.c | 20 ++++++++------------ 3 files changed, 70 insertions(+), 12 deletions(-)