Message ID | 20230410025056.22103-1-ligang.bdlg@bytedance.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v3] mm: oom: introduce cpuset oom | expand |
On 4/9/23 22:50, Gang Li wrote: > Cpusets constrain the CPU and Memory placement of tasks. > `CONSTRAINT_CPUSET` type in oom has existed for a long time, but > has never been utilized. > > When a process in cpuset which constrain memory placement triggers > oom, it may kill a completely irrelevant process on other numa nodes, > which will not release any memory for this cpuset. > > We can easily achieve node aware oom by using `CONSTRAINT_CPUSET` and > selecting victim from all cpusets with the same mems_allowed as the > current cpuset. > > Example: > > Create two processes named mem_on_node0 and mem_on_node1 constrained > by cpusets respectively. These two processes alloc memory on their > own node. Now node0 has run out of memory, OOM will be invokled by > mem_on_node0. > > Before this patch: > > Since `CONSTRAINT_CPUSET` do nothing, the victim will be selected from > the entire system. Therefore, the OOM is highly likely to kill > mem_on_node1, which will not free any memory for mem_on_node0. This > is a useless kill. > > ``` > [ 2786.519080] mem_on_node0 invoked oom-killer > [ 2786.885738] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name > [ 2787.181724] [ 13432] 0 13432 787016 786745 6344704 0 0 mem_on_node1 > [ 2787.189115] [ 13457] 0 13457 787002 785504 6340608 0 0 mem_on_node0 > [ 2787.216534] oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=test,mems_allowed=0 > [ 2787.229991] Out of memory: Killed process 13432 (mem_on_node1) > ``` > > After this patch: > > The victim will be selected only in all cpusets that have the same > mems_allowed as the cpuset that invoked oom. This will prevent > useless kill and protect innocent victims. > > ``` > [ 395.922444] mem_on_node0 invoked oom-killer > [ 396.239777] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name > [ 396.246128] [ 2614] 0 2614 1311294 1144192 9224192 0 0 mem_on_node0 > [ 396.252655] oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=test,mems_allowed=0 > [ 396.264068] Out of memory: Killed process 2614 (mem_on_node0) > ``` > > Suggested-by: Michal Hocko <mhocko@suse.com> > Cc: <cgroups@vger.kernel.org> > Cc: <linux-mm@kvack.org> > Cc: <rientjes@google.com> > Cc: Waiman Long <longman@redhat.com> > Cc: Zefan Li <lizefan.x@bytedance.com> > Signed-off-by: Gang Li <ligang.bdlg@bytedance.com> Thanks for the update. > --- > Changes in v3: > - Provide more details about the use case, testing, implementation. > - Document the userspace visible change in Documentation. > - Rename cpuset_cgroup_scan_tasks() to cpuset_scan_tasks() and add > a doctext comment about its purpose and how it should be used. > - Take cpuset_rwsem to ensure that cpusets are stable. > > Changes in v2: > - https://lore.kernel.org/all/20230404115509.14299-1-ligang.bdlg@bytedance.com/ > - Select victim from all cpusets with the same mems_allowed as the current cpuset. > (David Rientjes <rientjes@google.com>) > > v1: > - https://lore.kernel.org/all/20220921064710.89663-1-ligang.bdlg@bytedance.com/ > - Introduce cpuset oom. > --- > .../admin-guide/cgroup-v1/cpusets.rst | 14 +++++- > include/linux/cpuset.h | 6 +++ > kernel/cgroup/cpuset.c | 44 +++++++++++++++++++ > mm/oom_kill.c | 4 ++ > 4 files changed, 66 insertions(+), 2 deletions(-) > > diff --git a/Documentation/admin-guide/cgroup-v1/cpusets.rst b/Documentation/admin-guide/cgroup-v1/cpusets.rst > index 5d844ed4df69..d686cd47e91d 100644 > --- a/Documentation/admin-guide/cgroup-v1/cpusets.rst > +++ b/Documentation/admin-guide/cgroup-v1/cpusets.rst > @@ -25,7 +25,8 @@ Written by Simon.Derr@bull.net > 1.6 What is memory spread ? > 1.7 What is sched_load_balance ? > 1.8 What is sched_relax_domain_level ? > - 1.9 How do I use cpusets ? > + 1.9 What is cpuset oom ? > + 1.10 How do I use cpusets ? > 2. Usage Examples and Syntax > 2.1 Basic Usage > 2.2 Adding/removing cpus > @@ -607,8 +608,17 @@ If your situation is: > - The latency is required even it sacrifices cache hit rate etc. > then increasing 'sched_relax_domain_level' would benefit you. > > +1.9 What is cpuset oom ? > +-------------------------- > +If there is no available memory to allocate on the nodes specified by > +cpuset.mems, then an OOM (Out-Of-Memory) will be invoked. > + > +Since the victim selection is a heuristic algorithm, we cannot select > +the "perfect" victim. Therefore, currently, the victim will be selected > +from all the cpusets that have the same mems_allowed as the cpuset > +which invoked OOM. Nit: That feature is not specific to cgroup v1, as it applies to v2 as well. Maybe you can be more specific about that. > > -1.9 How do I use cpusets ? > +1.10 How do I use cpusets ? > -------------------------- > > In order to minimize the impact of cpusets on critical kernel > diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h > index 980b76a1237e..75465bf58f74 100644 > --- a/include/linux/cpuset.h > +++ b/include/linux/cpuset.h > @@ -171,6 +171,8 @@ static inline void set_mems_allowed(nodemask_t nodemask) > task_unlock(current); > } > > +int cpuset_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg); > + > #else /* !CONFIG_CPUSETS */ > > static inline bool cpusets_enabled(void) { return false; } > @@ -287,6 +289,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq) > return false; > } > > +static inline int cpuset_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg) > +{ > + return 0; > +} > #endif /* !CONFIG_CPUSETS */ > > #endif /* _LINUX_CPUSET_H */ > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c > index bc4dcfd7bee5..4c51225568aa 100644 > --- a/kernel/cgroup/cpuset.c > +++ b/kernel/cgroup/cpuset.c > @@ -4013,6 +4013,50 @@ void cpuset_print_current_mems_allowed(void) > rcu_read_unlock(); > } > > +/** > + * cpuset_scan_tasks - specify the oom scan range > + * @fn: callback function to select oom victim > + * @arg: argument for callback function, usually a pointer to struct oom_control > + * > + * Description: This function is used to specify the oom scan range. Return 0 if > + * no task is selected, otherwise return 1. The selected task will be stored in > + * arg->chosen. Thins function can only be called in select_bad_process() > + * while oc->onstraint == CONSTRAINT_CPUSET. Nit: That is not strictly correct as dump_tasks() will call this as well. Cheers, Longman
diff --git a/Documentation/admin-guide/cgroup-v1/cpusets.rst b/Documentation/admin-guide/cgroup-v1/cpusets.rst index 5d844ed4df69..d686cd47e91d 100644 --- a/Documentation/admin-guide/cgroup-v1/cpusets.rst +++ b/Documentation/admin-guide/cgroup-v1/cpusets.rst @@ -25,7 +25,8 @@ Written by Simon.Derr@bull.net 1.6 What is memory spread ? 1.7 What is sched_load_balance ? 1.8 What is sched_relax_domain_level ? - 1.9 How do I use cpusets ? + 1.9 What is cpuset oom ? + 1.10 How do I use cpusets ? 2. Usage Examples and Syntax 2.1 Basic Usage 2.2 Adding/removing cpus @@ -607,8 +608,17 @@ If your situation is: - The latency is required even it sacrifices cache hit rate etc. then increasing 'sched_relax_domain_level' would benefit you. +1.9 What is cpuset oom ? +-------------------------- +If there is no available memory to allocate on the nodes specified by +cpuset.mems, then an OOM (Out-Of-Memory) will be invoked. + +Since the victim selection is a heuristic algorithm, we cannot select +the "perfect" victim. Therefore, currently, the victim will be selected +from all the cpusets that have the same mems_allowed as the cpuset +which invoked OOM. -1.9 How do I use cpusets ? +1.10 How do I use cpusets ? -------------------------- In order to minimize the impact of cpusets on critical kernel diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index 980b76a1237e..75465bf58f74 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -171,6 +171,8 @@ static inline void set_mems_allowed(nodemask_t nodemask) task_unlock(current); } +int cpuset_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg); + #else /* !CONFIG_CPUSETS */ static inline bool cpusets_enabled(void) { return false; } @@ -287,6 +289,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq) return false; } +static inline int cpuset_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg) +{ + return 0; +} #endif /* !CONFIG_CPUSETS */ #endif /* _LINUX_CPUSET_H */ diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index bc4dcfd7bee5..4c51225568aa 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -4013,6 +4013,50 @@ void cpuset_print_current_mems_allowed(void) rcu_read_unlock(); } +/** + * cpuset_scan_tasks - specify the oom scan range + * @fn: callback function to select oom victim + * @arg: argument for callback function, usually a pointer to struct oom_control + * + * Description: This function is used to specify the oom scan range. Return 0 if + * no task is selected, otherwise return 1. The selected task will be stored in + * arg->chosen. Thins function can only be called in select_bad_process() + * while oc->onstraint == CONSTRAINT_CPUSET. + * + * The selection algorithm is heuristic, therefore requires constant iteration + * based on user feedback. Currently, we just iterate through all cpusets with + * the same mems_allowed as the current cpuset. + */ +int cpuset_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg) +{ + int ret = 0; + struct css_task_iter it; + struct task_struct *task; + struct cpuset *cs; + struct cgroup_subsys_state *pos_css; + + /* + * Situation gets complex with overlapping nodemasks in different cpusets. + * TODO: Maybe we should calculate the "distance" between different mems_allowed. + * + * But for now, let's make it simple. Just iterate through all cpusets + * with the same mems_allowed as the current cpuset. + */ + cpuset_read_lock(); + rcu_read_lock(); + cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) { + if (nodes_equal(cs->mems_allowed, task_cs(current)->mems_allowed)) { + css_task_iter_start(&(cs->css), CSS_TASK_ITER_PROCS, &it); + while (!ret && (task = css_task_iter_next(&it))) + ret = fn(task, arg); + css_task_iter_end(&it); + } + } + rcu_read_unlock(); + cpuset_read_unlock(); + return ret; +} + /* * Collection of memory_pressure is suppressed unless * this flag is enabled by writing "1" to the special diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 044e1eed720e..228257788d9e 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -367,6 +367,8 @@ static void select_bad_process(struct oom_control *oc) if (is_memcg_oom(oc)) mem_cgroup_scan_tasks(oc->memcg, oom_evaluate_task, oc); + else if (oc->constraint == CONSTRAINT_CPUSET) + cpuset_scan_tasks(oom_evaluate_task, oc); else { struct task_struct *p; @@ -427,6 +429,8 @@ static void dump_tasks(struct oom_control *oc) if (is_memcg_oom(oc)) mem_cgroup_scan_tasks(oc->memcg, dump_task, oc); + else if (oc->constraint == CONSTRAINT_CPUSET) + cpuset_scan_tasks(dump_task, oc); else { struct task_struct *p;
Cpusets constrain the CPU and Memory placement of tasks. `CONSTRAINT_CPUSET` type in oom has existed for a long time, but has never been utilized. When a process in cpuset which constrain memory placement triggers oom, it may kill a completely irrelevant process on other numa nodes, which will not release any memory for this cpuset. We can easily achieve node aware oom by using `CONSTRAINT_CPUSET` and selecting victim from all cpusets with the same mems_allowed as the current cpuset. Example: Create two processes named mem_on_node0 and mem_on_node1 constrained by cpusets respectively. These two processes alloc memory on their own node. Now node0 has run out of memory, OOM will be invokled by mem_on_node0. Before this patch: Since `CONSTRAINT_CPUSET` do nothing, the victim will be selected from the entire system. Therefore, the OOM is highly likely to kill mem_on_node1, which will not free any memory for mem_on_node0. This is a useless kill. ``` [ 2786.519080] mem_on_node0 invoked oom-killer [ 2786.885738] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [ 2787.181724] [ 13432] 0 13432 787016 786745 6344704 0 0 mem_on_node1 [ 2787.189115] [ 13457] 0 13457 787002 785504 6340608 0 0 mem_on_node0 [ 2787.216534] oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=test,mems_allowed=0 [ 2787.229991] Out of memory: Killed process 13432 (mem_on_node1) ``` After this patch: The victim will be selected only in all cpusets that have the same mems_allowed as the cpuset that invoked oom. This will prevent useless kill and protect innocent victims. ``` [ 395.922444] mem_on_node0 invoked oom-killer [ 396.239777] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [ 396.246128] [ 2614] 0 2614 1311294 1144192 9224192 0 0 mem_on_node0 [ 396.252655] oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=test,mems_allowed=0 [ 396.264068] Out of memory: Killed process 2614 (mem_on_node0) ``` Suggested-by: Michal Hocko <mhocko@suse.com> Cc: <cgroups@vger.kernel.org> Cc: <linux-mm@kvack.org> Cc: <rientjes@google.com> Cc: Waiman Long <longman@redhat.com> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Gang Li <ligang.bdlg@bytedance.com> --- Changes in v3: - Provide more details about the use case, testing, implementation. - Document the userspace visible change in Documentation. - Rename cpuset_cgroup_scan_tasks() to cpuset_scan_tasks() and add a doctext comment about its purpose and how it should be used. - Take cpuset_rwsem to ensure that cpusets are stable. Changes in v2: - https://lore.kernel.org/all/20230404115509.14299-1-ligang.bdlg@bytedance.com/ - Select victim from all cpusets with the same mems_allowed as the current cpuset. (David Rientjes <rientjes@google.com>) v1: - https://lore.kernel.org/all/20220921064710.89663-1-ligang.bdlg@bytedance.com/ - Introduce cpuset oom. --- .../admin-guide/cgroup-v1/cpusets.rst | 14 +++++- include/linux/cpuset.h | 6 +++ kernel/cgroup/cpuset.c | 44 +++++++++++++++++++ mm/oom_kill.c | 4 ++ 4 files changed, 66 insertions(+), 2 deletions(-)