Message ID | 20180802003201.817-4-guro@fb.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | introduce memory.oom.group | expand |
On 2018/08/02 9:32, Roman Gushchin wrote: > For some workloads an intervention from the OOM killer > can be painful. Killing a random task can bring > the workload into an inconsistent state. > > Historically, there are two common solutions for this > problem: > 1) enabling panic_on_oom, > 2) using a userspace daemon to monitor OOMs and kill > all outstanding processes. > > Both approaches have their downsides: > rebooting on each OOM is an obvious waste of capacity, > and handling all in userspace is tricky and requires > a userspace agent, which will monitor all cgroups > for OOMs. We could start a one-time userspace agent which handles an cgroup OOM event and then terminates... > +/** > + * mem_cgroup_get_oom_group - get a memory cgroup to clean up after OOM > + * @victim: task to be killed by the OOM killer > + * @oom_domain: memcg in case of memcg OOM, NULL in case of system-wide OOM > + * > + * Returns a pointer to a memory cgroup, which has to be cleaned up > + * by killing all belonging OOM-killable tasks. > + * > + * Caller has to call mem_cgroup_put() on the returned non-NULL memcg. > + */ > +struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim, > + struct mem_cgroup *oom_domain) > +{ > + struct mem_cgroup *oom_group = NULL; > + struct mem_cgroup *memcg; > + > + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) > + return NULL; > + > + if (!oom_domain) > + oom_domain = root_mem_cgroup; > + > + rcu_read_lock(); > + > + memcg = mem_cgroup_from_task(victim); Isn't this racy? I guess that memcg of this "victim" can change to somewhere else from the one as of determining the final candidate. This "victim" might have already passed exit_mm()/cgroup_exit() from do_exit(). This "victim" might be moving to a memcg which is different from the one determining the final candidate. > + if (memcg == root_mem_cgroup) > + goto out; > + > + /* > + * Traverse the memory cgroup hierarchy from the victim task's > + * cgroup up to the OOMing cgroup (or root) to find the > + * highest-level memory cgroup with oom.group set. > + */ > + for (; memcg; memcg = parent_mem_cgroup(memcg)) { > + if (memcg->oom_group) > + oom_group = memcg; > + > + if (memcg == oom_domain) > + break; > + } > + > + if (oom_group) > + css_get(&oom_group->css); > +out: > + rcu_read_unlock(); > + > + return oom_group; > +} > @@ -974,7 +988,23 @@ static void oom_kill_process(struct oom_control *oc, const char *message) > } > read_unlock(&tasklist_lock); > > + /* > + * Do we need to kill the entire memory cgroup? > + * Or even one of the ancestor memory cgroups? > + * Check this out before killing the victim task. > + */ > + oom_group = mem_cgroup_get_oom_group(victim, oc->memcg); > + > __oom_kill_process(victim); > + > + /* > + * If necessary, kill all tasks in the selected memory cgroup. > + */ > + if (oom_group) { Isn't "killing a child process of the biggest memory hog" and "killing all processes which belongs to a memcg which the child process of the biggest memory hog belongs to" strange? The intent of selecting a child is to try to minimize lost work while the intent of oom_cgroup is to try to discard all work. If oom_cgroup is enabled, I feel that we should pr_err("%s: Kill all processes in ", message); pr_cont_cgroup_path(memcg->css.cgroup); pr_cont(" due to memory.oom.group set\n"); without pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n", message, task_pid_nr(p), p->comm, points); (I mean, don't try to select a child). > + mem_cgroup_print_oom_group(oom_group); > + mem_cgroup_scan_tasks(oom_group, oom_kill_memcg_member, NULL); > + mem_cgroup_put(oom_group); > + } > } > > /*
On Thu 02-08-18 19:53:13, Tetsuo Handa wrote: > On 2018/08/02 9:32, Roman Gushchin wrote: [...] > > +struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim, > > + struct mem_cgroup *oom_domain) > > +{ > > + struct mem_cgroup *oom_group = NULL; > > + struct mem_cgroup *memcg; > > + > > + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) > > + return NULL; > > + > > + if (!oom_domain) > > + oom_domain = root_mem_cgroup; > > + > > + rcu_read_lock(); > > + > > + memcg = mem_cgroup_from_task(victim); > > Isn't this racy? I guess that memcg of this "victim" can change to > somewhere else from the one as of determining the final candidate. How is this any different from the existing code? We select a victim and then kill it. The victim might move away and won't be part of the oom memcg anymore but we will still kill it. I do not remember this ever being a problem. Migration is a privileged operation. If you loose this restriction you shouldn't allow to move outside of the oom domain. > This "victim" might have already passed exit_mm()/cgroup_exit() from do_exit(). Why does this matter? The victim hasn't been killed yet so if it exists by its own I do not think we really have to tear the whole cgroup down. > This "victim" might be moving to a memcg which is different from the one > determining the final candidate. > > > + if (memcg == root_mem_cgroup) > > + goto out; > > + > > + /* > > + * Traverse the memory cgroup hierarchy from the victim task's > > + * cgroup up to the OOMing cgroup (or root) to find the > > + * highest-level memory cgroup with oom.group set. > > + */ > > + for (; memcg; memcg = parent_mem_cgroup(memcg)) { > > + if (memcg->oom_group) > > + oom_group = memcg; > > + > > + if (memcg == oom_domain) > > + break; > > + } > > + > > + if (oom_group) > > + css_get(&oom_group->css); > > +out: > > + rcu_read_unlock(); > > + > > + return oom_group; > > +} > > > > > @@ -974,7 +988,23 @@ static void oom_kill_process(struct oom_control *oc, const char *message) > > } > > read_unlock(&tasklist_lock); > > > > + /* > > + * Do we need to kill the entire memory cgroup? > > + * Or even one of the ancestor memory cgroups? > > + * Check this out before killing the victim task. > > + */ > > + oom_group = mem_cgroup_get_oom_group(victim, oc->memcg); > > + > > __oom_kill_process(victim); > > + > > + /* > > + * If necessary, kill all tasks in the selected memory cgroup. > > + */ > > + if (oom_group) { > > Isn't "killing a child process of the biggest memory hog" and "killing all > processes which belongs to a memcg which the child process of the biggest > memory hog belongs to" strange? The intent of selecting a child is to try > to minimize lost work while the intent of oom_cgroup is to try to discard > all work. If oom_cgroup is enabled, I feel that we should > > pr_err("%s: Kill all processes in ", message); > pr_cont_cgroup_path(memcg->css.cgroup); > pr_cont(" due to memory.oom.group set\n"); > > without > > pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n", message, task_pid_nr(p), p->comm, points); > > (I mean, don't try to select a child). Well, the child can belong into a different memcg. Whether the heuristic to pick up the child is sensible is another question and I do not think it is related to this patchset. The code works as intended, albeit being questionable.
On 2018/08/02 20:21, Michal Hocko wrote: > On Thu 02-08-18 19:53:13, Tetsuo Handa wrote: >> On 2018/08/02 9:32, Roman Gushchin wrote: > [...] >>> +struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim, >>> + struct mem_cgroup *oom_domain) >>> +{ >>> + struct mem_cgroup *oom_group = NULL; >>> + struct mem_cgroup *memcg; >>> + >>> + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) >>> + return NULL; >>> + >>> + if (!oom_domain) >>> + oom_domain = root_mem_cgroup; >>> + >>> + rcu_read_lock(); >>> + >>> + memcg = mem_cgroup_from_task(victim); >> >> Isn't this racy? I guess that memcg of this "victim" can change to >> somewhere else from the one as of determining the final candidate. > > How is this any different from the existing code? We select a victim and > then kill it. The victim might move away and won't be part of the oom > memcg anymore but we will still kill it. I do not remember this ever > being a problem. Migration is a privileged operation. If you loose this > restriction you shouldn't allow to move outside of the oom domain. The existing code kills one process (plus other processes sharing mm if any). But oom_cgroup kills multiple processes. Thus, whether we made decision based on correct memcg becomes important. > >> This "victim" might have already passed exit_mm()/cgroup_exit() from do_exit(). > > Why does this matter? The victim hasn't been killed yet so if it exists > by its own I do not think we really have to tear the whole cgroup down. The existing code does not send SIGKILL if find_lock_task_mm() failed. Who can guarantee that the victim is not inside do_exit() yet when this code is executed?
On Thu 02-08-18 20:53:14, Tetsuo Handa wrote: > On 2018/08/02 20:21, Michal Hocko wrote: > > On Thu 02-08-18 19:53:13, Tetsuo Handa wrote: > >> On 2018/08/02 9:32, Roman Gushchin wrote: > > [...] > >>> +struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim, > >>> + struct mem_cgroup *oom_domain) > >>> +{ > >>> + struct mem_cgroup *oom_group = NULL; > >>> + struct mem_cgroup *memcg; > >>> + > >>> + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) > >>> + return NULL; > >>> + > >>> + if (!oom_domain) > >>> + oom_domain = root_mem_cgroup; > >>> + > >>> + rcu_read_lock(); > >>> + > >>> + memcg = mem_cgroup_from_task(victim); > >> > >> Isn't this racy? I guess that memcg of this "victim" can change to > >> somewhere else from the one as of determining the final candidate. > > > > How is this any different from the existing code? We select a victim and > > then kill it. The victim might move away and won't be part of the oom > > memcg anymore but we will still kill it. I do not remember this ever > > being a problem. Migration is a privileged operation. If you loose this > > restriction you shouldn't allow to move outside of the oom domain. > > The existing code kills one process (plus other processes sharing mm if any). > But oom_cgroup kills multiple processes. Thus, whether we made decision based > on correct memcg becomes important. Yes but a proper configuration should already mitigate the harm because you shouldn't be able to migrate the task outside of the oom domain. A (oom.group = 1) / \ B C moving task between B and C should be harmless while moving it out of A subtree completely is a dubious configuration. > >> This "victim" might have already passed exit_mm()/cgroup_exit() from do_exit(). > > > > Why does this matter? The victim hasn't been killed yet so if it exists > > by its own I do not think we really have to tear the whole cgroup down. > > The existing code does not send SIGKILL if find_lock_task_mm() failed. Who can > guarantee that the victim is not inside do_exit() yet when this code is executed? I do not follow. Why does this matter at all?
On Thu, Aug 02, 2018 at 07:53:13PM +0900, Tetsuo Handa wrote: > On 2018/08/02 9:32, Roman Gushchin wrote: > > For some workloads an intervention from the OOM killer > > can be painful. Killing a random task can bring > > the workload into an inconsistent state. > > > > Historically, there are two common solutions for this > > problem: > > 1) enabling panic_on_oom, > > 2) using a userspace daemon to monitor OOMs and kill > > all outstanding processes. > > > > Both approaches have their downsides: > > rebooting on each OOM is an obvious waste of capacity, > > and handling all in userspace is tricky and requires > > a userspace agent, which will monitor all cgroups > > for OOMs. > > We could start a one-time userspace agent which handles > an cgroup OOM event and then terminates... That might be not so trivial if there is a shortage of memory. > > > > > +/** > > + * mem_cgroup_get_oom_group - get a memory cgroup to clean up after OOM > > + * @victim: task to be killed by the OOM killer > > + * @oom_domain: memcg in case of memcg OOM, NULL in case of system-wide OOM > > + * > > + * Returns a pointer to a memory cgroup, which has to be cleaned up > > + * by killing all belonging OOM-killable tasks. > > + * > > + * Caller has to call mem_cgroup_put() on the returned non-NULL memcg. > > + */ > > +struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim, > > + struct mem_cgroup *oom_domain) > > +{ > > + struct mem_cgroup *oom_group = NULL; > > + struct mem_cgroup *memcg; > > + > > + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) > > + return NULL; > > + > > + if (!oom_domain) > > + oom_domain = root_mem_cgroup; > > + > > + rcu_read_lock(); > > + > > + memcg = mem_cgroup_from_task(victim); > > Isn't this racy? I guess that memcg of this "victim" can change to > somewhere else from the one as of determining the final candidate. > This "victim" might have already passed exit_mm()/cgroup_exit() from do_exit(). > This "victim" might be moving to a memcg which is different from the one > determining the final candidate. It is, as well as _all_ OOM handling code. E.g. what if a user will set oom_score_adj to -1000 in the last moment? It really doesn't matter, OOM killer should guarantee forward progress without making too stupid decisions. It doesn't provide any strict guarantees and really shouldn't. > > > + if (memcg == root_mem_cgroup) > > + goto out; > > + > > + /* > > + * Traverse the memory cgroup hierarchy from the victim task's > > + * cgroup up to the OOMing cgroup (or root) to find the > > + * highest-level memory cgroup with oom.group set. > > + */ > > + for (; memcg; memcg = parent_mem_cgroup(memcg)) { > > + if (memcg->oom_group) > > + oom_group = memcg; > > + > > + if (memcg == oom_domain) > > + break; > > + } > > + > > + if (oom_group) > > + css_get(&oom_group->css); > > +out: > > + rcu_read_unlock(); > > + > > + return oom_group; > > +} > > > > > @@ -974,7 +988,23 @@ static void oom_kill_process(struct oom_control *oc, const char *message) > > } > > read_unlock(&tasklist_lock); > > > > + /* > > + * Do we need to kill the entire memory cgroup? > > + * Or even one of the ancestor memory cgroups? > > + * Check this out before killing the victim task. > > + */ > > + oom_group = mem_cgroup_get_oom_group(victim, oc->memcg); > > + > > __oom_kill_process(victim); > > + > > + /* > > + * If necessary, kill all tasks in the selected memory cgroup. > > + */ > > + if (oom_group) { > > Isn't "killing a child process of the biggest memory hog" and "killing all > processes which belongs to a memcg which the child process of the biggest > memory hog belongs to" strange? The intent of selecting a child is to try > to minimize lost work while the intent of oom_cgroup is to try to discard > all work. If oom_cgroup is enabled, I feel that we should > > pr_err("%s: Kill all processes in ", message); > pr_cont_cgroup_path(memcg->css.cgroup); > pr_cont(" due to memory.oom.group set\n"); > > without > > pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n", message, task_pid_nr(p), p->comm, points); > > (I mean, don't try to select a child). We can do this optimization, but I would be accurate with changing dmesg output format. Although it never was a part of ABI, I wonder, how many users are using something like "kill process [0-9]+ or sacrifice child" regexps? Thanks!
On Wed, Aug 01, 2018 at 05:32:01PM -0700, Roman Gushchin wrote: > For some workloads an intervention from the OOM killer > can be painful. Killing a random task can bring > the workload into an inconsistent state. For patches 1-3, Acked-by: Tejun Heo <tj@kernel.org> Thanks.
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 8a2c52d5c53b..7b4364962fbb 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1069,6 +1069,24 @@ PAGE_SIZE multiple when read back. high limit is used and monitored properly, this limit's utility is limited to providing the final safety net. + memory.oom.group + A read-write single value file which exists on non-root + cgroups. The default value is "0". + + Determines whether the cgroup should be treated as + an indivisible workload by the OOM killer. If set, + all tasks belonging to the cgroup or to its descendants + (if the memory cgroup is not a leaf cgroup) are killed + together or not at all. This can be used to avoid + partial kills to guarantee workload integrity. + + Tasks with the OOM protection (oom_score_adj set to -1000) + are treated as an exception and are never killed. + + If the OOM killer is invoked in a cgroup, it's not going + to kill any tasks outside of this cgroup, regardless + memory.oom.group values of ancestor cgroups. + memory.events A read-only flat-keyed file which exists on non-root cgroups. The following entries are defined. Unless specified diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index e53e00cdbe3f..5b26ab460565 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -213,6 +213,11 @@ struct mem_cgroup { */ bool use_hierarchy; + /* + * Should the OOM killer kill all belonging tasks, had it kill one? + */ + bool oom_group; + /* protected by memcg_oom_lock */ bool oom_lock; int under_oom; @@ -517,6 +522,9 @@ static inline bool task_in_memcg_oom(struct task_struct *p) } bool mem_cgroup_oom_synchronize(bool wait); +struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim, + struct mem_cgroup *oom_domain); +void mem_cgroup_print_oom_group(struct mem_cgroup *memcg); #ifdef CONFIG_MEMCG_SWAP extern int do_swap_account; @@ -951,6 +959,16 @@ static inline bool mem_cgroup_oom_synchronize(bool wait) return false; } +static inline struct mem_cgroup *mem_cgroup_get_oom_group( + struct task_struct *victim, struct mem_cgroup *oom_domain) +{ + return NULL; +} + +static inline void mem_cgroup_print_oom_group(struct mem_cgroup *memcg) +{ +} + static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 8c0280b3143e..23045398ad21 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1577,6 +1577,62 @@ bool mem_cgroup_oom_synchronize(bool handle) return true; } +/** + * mem_cgroup_get_oom_group - get a memory cgroup to clean up after OOM + * @victim: task to be killed by the OOM killer + * @oom_domain: memcg in case of memcg OOM, NULL in case of system-wide OOM + * + * Returns a pointer to a memory cgroup, which has to be cleaned up + * by killing all belonging OOM-killable tasks. + * + * Caller has to call mem_cgroup_put() on the returned non-NULL memcg. + */ +struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim, + struct mem_cgroup *oom_domain) +{ + struct mem_cgroup *oom_group = NULL; + struct mem_cgroup *memcg; + + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) + return NULL; + + if (!oom_domain) + oom_domain = root_mem_cgroup; + + rcu_read_lock(); + + memcg = mem_cgroup_from_task(victim); + if (memcg == root_mem_cgroup) + goto out; + + /* + * Traverse the memory cgroup hierarchy from the victim task's + * cgroup up to the OOMing cgroup (or root) to find the + * highest-level memory cgroup with oom.group set. + */ + for (; memcg; memcg = parent_mem_cgroup(memcg)) { + if (memcg->oom_group) + oom_group = memcg; + + if (memcg == oom_domain) + break; + } + + if (oom_group) + css_get(&oom_group->css); +out: + rcu_read_unlock(); + + return oom_group; +} + +void mem_cgroup_print_oom_group(struct mem_cgroup *memcg) +{ + pr_info("Tasks in "); + pr_cont_cgroup_path(memcg->css.cgroup); + pr_cont(" are going to be killed due to memory.oom.group set\n"); +} + /** * lock_page_memcg - lock a page->mem_cgroup binding * @page: the page @@ -5328,6 +5384,37 @@ static int memory_stat_show(struct seq_file *m, void *v) return 0; } +static int memory_oom_group_show(struct seq_file *m, void *v) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); + + seq_printf(m, "%d\n", memcg->oom_group); + + return 0; +} + +static ssize_t memory_oom_group_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + int ret, oom_group; + + buf = strstrip(buf); + if (!buf) + return -EINVAL; + + ret = kstrtoint(buf, 0, &oom_group); + if (ret) + return ret; + + if (oom_group != 0 && oom_group != 1) + return -EINVAL; + + memcg->oom_group = oom_group; + + return nbytes; +} + static struct cftype memory_files[] = { { .name = "current", @@ -5369,6 +5456,12 @@ static struct cftype memory_files[] = { .flags = CFTYPE_NOT_ON_ROOT, .seq_show = memory_stat_show, }, + { + .name = "oom.group", + .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE, + .seq_show = memory_oom_group_show, + .write = memory_oom_group_write, + }, { } /* terminate */ }; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 8bded6b3205b..f10eb301f6bf 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -914,6 +914,19 @@ static void __oom_kill_process(struct task_struct *victim) } #undef K +/* + * Kill provided task unless it's secured by setting + * oom_score_adj to OOM_SCORE_ADJ_MIN. + */ +static int oom_kill_memcg_member(struct task_struct *task, void *unused) +{ + if (task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) { + get_task_struct(task); + __oom_kill_process(task); + } + return 0; +} + static void oom_kill_process(struct oom_control *oc, const char *message) { struct task_struct *p = oc->chosen; @@ -921,6 +934,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message) struct task_struct *victim = p; struct task_struct *child; struct task_struct *t; + struct mem_cgroup *oom_group; unsigned int victim_points = 0; static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST); @@ -974,7 +988,23 @@ static void oom_kill_process(struct oom_control *oc, const char *message) } read_unlock(&tasklist_lock); + /* + * Do we need to kill the entire memory cgroup? + * Or even one of the ancestor memory cgroups? + * Check this out before killing the victim task. + */ + oom_group = mem_cgroup_get_oom_group(victim, oc->memcg); + __oom_kill_process(victim); + + /* + * If necessary, kill all tasks in the selected memory cgroup. + */ + if (oom_group) { + mem_cgroup_print_oom_group(oom_group); + mem_cgroup_scan_tasks(oom_group, oom_kill_memcg_member, NULL); + mem_cgroup_put(oom_group); + } } /*