Message ID | 20220921064710.89663-1-ligang.bdlg@bytedance.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [RFC,v1] mm: oom: introduce cpuset oom | expand |
On Wed, 21 Sep 2022, Gang Li wrote: > cpuset confine processes to processor and memory node subsets. > When a process in cpuset triggers oom, it may kill a completely > irrelevant process on another numa node, which will not release any > memory for this cpuset. > > It seems that `CONSTRAINT_CPUSET` is not really doing much these > days. Using CONSTRAINT_CPUSET, we can easily achieve node aware oom > killing by selecting victim from the cpuset which triggers oom. > > Suggested-by: Michal Hocko <mhocko@suse.com> > Signed-off-by: Gang Li <ligang.bdlg@bytedance.com> Hmm, is this the right approach? If a cpuset results in a oom condition, is there a reason why we'd need to find a process from within that cpuset to kill? I think the idea is to free memory on the oom set of nodes (cpuset.mems) and that can happen by killing a process that is not a member of this cpuset. I understand the challenges of creating a NUMA aware oom killer to target memory that is actually resident on an oom node, but this approach doesn't seem right and could actually lead to pathological cases where a small process trying to fork in an otherwise empty cpuset is repeatedly oom killing when we'd actually prefer to kill a single large process. > --- > This idea comes from a previous patch: > mm, oom: Introduce per numa node oom for CONSTRAINT_MEMORY_POLICY > https://lore.kernel.org/all/YoJ%2FioXwGTdCywUE@dhcp22.suse.cz/ > > Any comments are welcome. > --- > include/linux/cpuset.h | 6 ++++++ > kernel/cgroup/cpuset.c | 17 +++++++++++++++++ > mm/oom_kill.c | 4 ++++ > 3 files changed, 27 insertions(+) > > diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h > index d58e0476ee8e..7475f613ab90 100644 > --- a/include/linux/cpuset.h > +++ b/include/linux/cpuset.h > @@ -178,6 +178,8 @@ static inline void set_mems_allowed(nodemask_t nodemask) > task_unlock(current); > } > > +int cpuset_cgroup_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg); > + > #else /* !CONFIG_CPUSETS */ > > static inline bool cpusets_enabled(void) { return false; } > @@ -299,6 +301,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq) > return false; > } > > +static inline int cpuset_cgroup_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg) > +{ > + return 0; > +} > #endif /* !CONFIG_CPUSETS */ > > #endif /* _LINUX_CPUSET_H */ > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c > index b474289c15b8..1f1238b4276d 100644 > --- a/kernel/cgroup/cpuset.c > +++ b/kernel/cgroup/cpuset.c > @@ -3943,6 +3943,23 @@ void cpuset_print_current_mems_allowed(void) > rcu_read_unlock(); > } > > +int cpuset_cgroup_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg) > +{ > + int ret = 0; > + struct cgroup *cgrp; > + struct css_task_iter it; > + struct task_struct *task; > + > + rcu_read_lock(); > + css_task_iter_start(&(task_cs(current)->css), CSS_TASK_ITER_PROCS, &it); > + while (!ret && (task = css_task_iter_next(&it))) > + ret = fn(task, arg); > + css_task_iter_end(&it); > + rcu_read_unlock(); > + > + return ret; > +} > + > /* > * Collection of memory_pressure is suppressed unless > * this flag is enabled by writing "1" to the special > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 46e7e073f137..8cea787b359c 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -367,6 +367,8 @@ static void select_bad_process(struct oom_control *oc) > > if (is_memcg_oom(oc)) > mem_cgroup_scan_tasks(oc->memcg, oom_evaluate_task, oc); > + else if (oc->constraint == CONSTRAINT_CPUSET) > + cpuset_cgroup_scan_tasks(oom_evaluate_task, oc); > else { > struct task_struct *p; > > @@ -427,6 +429,8 @@ static void dump_tasks(struct oom_control *oc) > > if (is_memcg_oom(oc)) > mem_cgroup_scan_tasks(oc->memcg, dump_task, oc); > + else if (oc->constraint == CONSTRAINT_CPUSET) > + cpuset_cgroup_scan_tasks(dump_task, oc); > else { > struct task_struct *p; > > -- > 2.20.1 > > >
On Thu 22-09-22 12:18:04, David Rientjes wrote: > On Wed, 21 Sep 2022, Gang Li wrote: > > > cpuset confine processes to processor and memory node subsets. > > When a process in cpuset triggers oom, it may kill a completely > > irrelevant process on another numa node, which will not release any > > memory for this cpuset. > > > > It seems that `CONSTRAINT_CPUSET` is not really doing much these > > days. Using CONSTRAINT_CPUSET, we can easily achieve node aware oom > > killing by selecting victim from the cpuset which triggers oom. > > > > Suggested-by: Michal Hocko <mhocko@suse.com> > > Signed-off-by: Gang Li <ligang.bdlg@bytedance.com> > > Hmm, is this the right approach? > > If a cpuset results in a oom condition, is there a reason why we'd need to > find a process from within that cpuset to kill? I think the idea is to > free memory on the oom set of nodes (cpuset.mems) and that can happen by > killing a process that is not a member of this cpuset. I would argue that the current cpuset should be considered first because chances are that it will already have the biggest memory consumption from the constrained NUMA nodes. At least that would be the case when cpusets are used to partition the system into exclusive NUMA domains. Situation gets more complex with overlapping nodemasks in different cpusets but I believe our existing semantic sucks already for those usecases already because we just shoot a random process with an unknown amount of memory allocated from the constrained nodemask. This new semantic is not much worse. We could find a real oom victim under a different cpuset but the current semantic could as well kill a large memory consumer with a tiny footprint on the target node. With the cpuset view the potential damage is more targeted in many cases. > I understand the challenges of creating a NUMA aware oom killer to target > memory that is actually resident on an oom node, but this approach doesn't > seem right and could actually lead to pathological cases where a small > process trying to fork in an otherwise empty cpuset is repeatedly oom > killing when we'd actually prefer to kill a single large process. Yeah, that is possible and something to consider. One way to go about that is to make the selection from all cpusets with an overlap with the requested nodemask (probably with a preference to more constrained ones). In any case let's keep in mind that this is a mere heuristic. We just need to kill some process, it is not really feasible to aim for the best selection. We should just try to reduce the harm. Our exisiting cpuset based OOM is effectivelly random without any clear relation to cpusets so I would be open to experimenting in this area.
On 2022/9/23 03:18, David Rientjes wrote: > On Wed, 21 Sep 2022, Gang Li wrote: > >> cpuset confine processes to processor and memory node subsets. >> When a process in cpuset triggers oom, it may kill a completely >> irrelevant process on another numa node, which will not release any >> memory for this cpuset. >> >> It seems that `CONSTRAINT_CPUSET` is not really doing much these >> days. Using CONSTRAINT_CPUSET, we can easily achieve node aware oom >> killing by selecting victim from the cpuset which triggers oom. >> >> Suggested-by: Michal Hocko <mhocko@suse.com> >> Signed-off-by: Gang Li <ligang.bdlg@bytedance.com> > > Hmm, is this the right approach? > > If a cpuset results in a oom condition, is there a reason why we'd need to > find a process from within that cpuset to kill? I think the idea is to > free memory on the oom set of nodes (cpuset.mems) and that can happen by > killing a process that is not a member of this cpuset. > Hi, My last patch implemented this idea[1][2]. But it needs to inc/dec a per mm_struct counter on every page allocation/release/migration. As the Unixbench show, this takes 0%-3% performance loss on different workloads[2]. So Michal Hocko inspired me to use cpuset[3]. [1]. https://lore.kernel.org/all/20220512044634.63586-1-ligang.bdlg@bytedance.com/ [2]. https://lore.kernel.org/all/20220708082129.80115-1-ligang.bdlg@bytedance.com/ [3]. https://lore.kernel.org/all/YoJ%2FioXwGTdCywUE@dhcp22.suse.cz/ > I understand the challenges of creating a NUMA aware oom killer to target > memory that is actually resident on an oom node, but this approach doesn't > seem right and could actually lead to pathological cases where a small > process trying to fork in an otherwise empty cpuset is repeatedly oom > killing when we'd actually prefer to kill a single large process. > I think there are three ways to achieve NUMA aware oom killer: 1. Count every page operations, which cause performance loss[2]. 2. Iterate over pages(like show_numa_map) for all processes, which may stuck oom. 3. Select victim in a cpuset, which may leads to pathological kill.(this patch) None of them are perfect and I'm getting stuck, do you have any ideas?
On 2022/9/23 15:45, Michal Hocko wrote: > Yeah, that is possible and something to consider. One way to go about > that is to make the selection from all cpusets with an overlap with the > requested nodemask (probably with a preference to more constrained > ones). In any case let's keep in mind that this is a mere heuristic. We > just need to kill some process, it is not really feasible to aim for the > best selection. We should just try to reduce the harm. Our exisiting > cpuset based OOM is effectivelly random without any clear relation to > cpusets so I would be open to experimenting in this area. In addition to cpuset, users can also bind numa through mbind(). So I want to manage numa binding applications that are not managed by cpuset. Do you have any ideas?
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index d58e0476ee8e..7475f613ab90 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -178,6 +178,8 @@ static inline void set_mems_allowed(nodemask_t nodemask) task_unlock(current); } +int cpuset_cgroup_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg); + #else /* !CONFIG_CPUSETS */ static inline bool cpusets_enabled(void) { return false; } @@ -299,6 +301,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq) return false; } +static inline int cpuset_cgroup_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg) +{ + return 0; +} #endif /* !CONFIG_CPUSETS */ #endif /* _LINUX_CPUSET_H */ diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index b474289c15b8..1f1238b4276d 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -3943,6 +3943,23 @@ void cpuset_print_current_mems_allowed(void) rcu_read_unlock(); } +int cpuset_cgroup_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg) +{ + int ret = 0; + struct cgroup *cgrp; + struct css_task_iter it; + struct task_struct *task; + + rcu_read_lock(); + css_task_iter_start(&(task_cs(current)->css), CSS_TASK_ITER_PROCS, &it); + while (!ret && (task = css_task_iter_next(&it))) + ret = fn(task, arg); + css_task_iter_end(&it); + rcu_read_unlock(); + + return ret; +} + /* * Collection of memory_pressure is suppressed unless * this flag is enabled by writing "1" to the special diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 46e7e073f137..8cea787b359c 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -367,6 +367,8 @@ static void select_bad_process(struct oom_control *oc) if (is_memcg_oom(oc)) mem_cgroup_scan_tasks(oc->memcg, oom_evaluate_task, oc); + else if (oc->constraint == CONSTRAINT_CPUSET) + cpuset_cgroup_scan_tasks(oom_evaluate_task, oc); else { struct task_struct *p; @@ -427,6 +429,8 @@ static void dump_tasks(struct oom_control *oc) if (is_memcg_oom(oc)) mem_cgroup_scan_tasks(oc->memcg, dump_task, oc); + else if (oc->constraint == CONSTRAINT_CPUSET) + cpuset_cgroup_scan_tasks(dump_task, oc); else { struct task_struct *p;
cpuset confine processes to processor and memory node subsets. When a process in cpuset triggers oom, it may kill a completely irrelevant process on another numa node, which will not release any memory for this cpuset. It seems that `CONSTRAINT_CPUSET` is not really doing much these days. Using CONSTRAINT_CPUSET, we can easily achieve node aware oom killing by selecting victim from the cpuset which triggers oom. Suggested-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Gang Li <ligang.bdlg@bytedance.com> --- This idea comes from a previous patch: mm, oom: Introduce per numa node oom for CONSTRAINT_MEMORY_POLICY https://lore.kernel.org/all/YoJ%2FioXwGTdCywUE@dhcp22.suse.cz/ Any comments are welcome. --- include/linux/cpuset.h | 6 ++++++ kernel/cgroup/cpuset.c | 17 +++++++++++++++++ mm/oom_kill.c | 4 ++++ 3 files changed, 27 insertions(+)