Message ID | 20230411065816.9798-1-ligang.bdlg@bytedance.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v4] mm: oom: introduce cpuset oom | expand |
Hello. On Tue, Apr 11, 2023 at 02:58:15PM +0800, Gang Li <ligang.bdlg@bytedance.com> wrote: > +int cpuset_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg) > +{ > + int ret = 0; > + struct css_task_iter it; > + struct task_struct *task; > + struct cpuset *cs; > + struct cgroup_subsys_state *pos_css; > + > + /* > + * Situation gets complex with overlapping nodemasks in different cpusets. > + * TODO: Maybe we should calculate the "distance" between different mems_allowed. > + * > + * But for now, let's make it simple. Just iterate through all cpusets > + * with the same mems_allowed as the current cpuset. > + */ > + cpuset_read_lock(); > + rcu_read_lock(); > + cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) { > + if (nodes_equal(cs->mems_allowed, task_cs(current)->mems_allowed)) { > + css_task_iter_start(&(cs->css), CSS_TASK_ITER_PROCS, &it); > + while (!ret && (task = css_task_iter_next(&it))) > + ret = fn(task, arg); > + css_task_iter_end(&it); > + } > + } > + rcu_read_unlock(); > + cpuset_read_unlock(); > + return ret; > +} I see this traverses all cpusets without the hierarchy actually mattering that much. Wouldn't the CONSTRAINT_CPUSET better achieved by globally (or per-memcg) scanning all processes and filtering with: nodes_intersect(current->mems_allowed, p->mems_allowed) (`current` triggers the OOM, `p` is the iterated task) ? Thanks, Michal
On 2023/4/11 20:23, Michal Koutný wrote: > Hello. > > On Tue, Apr 11, 2023 at 02:58:15PM +0800, Gang Li <ligang.bdlg@bytedance.com> wrote: >> + cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) { >> + if (nodes_equal(cs->mems_allowed, task_cs(current)->mems_allowed)) { >> + css_task_iter_start(&(cs->css), CSS_TASK_ITER_PROCS, &it); >> + while (!ret && (task = css_task_iter_next(&it))) >> + ret = fn(task, arg); >> + css_task_iter_end(&it); >> + } >> + } >> + rcu_read_unlock(); >> + cpuset_read_unlock(); >> + return ret; >> +} > > I see this traverses all cpusets without the hierarchy actually > mattering that much. Wouldn't the CONSTRAINT_CPUSET better achieved by > globally (or per-memcg) scanning all processes and filtering with: Oh I see, you mean scanning all processes in all cpusets and scanning all processes globally are equivalent. > nodes_intersect(current->mems_allowed, p->mems_allowed Perhaps it would be better to use nodes_equal first, and if no suitable victim is found, then downgrade to nodes_intersect? NUMA balancing mechanism tends to keep memory on the same NUMA node, and if the selected victim's memory happens to be on a node that does not intersect with the current process's node, we still won't be able to free up any memory. In this example: A->mems_allowed: 0,1 B->mems_allowed: 1,2 nodes_intersect(A->mems_allowed, B->mems_allowed) == true Memory Distribution: +=======+=======+=======+ | Node0 | Node1 | Node2 | +=======+=======+=======+ | A | | | +-------+-------+-------+ | | |B | +-------+-------+-------+ Process A invoke oom, then kill B. But A still can't get any free mem on Node0 and 1. > (`current` triggers the OOM, `p` is the iterated task) > ? > > Thanks, > Michal
On Tue 11-04-23 21:04:18, Gang Li wrote: > > > On 2023/4/11 20:23, Michal Koutný wrote: > > Hello. > > > > On Tue, Apr 11, 2023 at 02:58:15PM +0800, Gang Li <ligang.bdlg@bytedance.com> wrote: > > > + cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) { > > > + if (nodes_equal(cs->mems_allowed, task_cs(current)->mems_allowed)) { > > > + css_task_iter_start(&(cs->css), CSS_TASK_ITER_PROCS, &it); > > > + while (!ret && (task = css_task_iter_next(&it))) > > > + ret = fn(task, arg); > > > + css_task_iter_end(&it); > > > + } > > > + } > > > + rcu_read_unlock(); > > > + cpuset_read_unlock(); > > > + return ret; > > > +} > > > > I see this traverses all cpusets without the hierarchy actually > > mattering that much. Wouldn't the CONSTRAINT_CPUSET better achieved by > > globally (or per-memcg) scanning all processes and filtering with: > > Oh I see, you mean scanning all processes in all cpusets and scanning > all processes globally are equivalent. Why cannot you simple select a process from the cpuset the allocating process belongs to? I thought the whole idea was to handle well partitioned workloads. > > nodes_intersect(current->mems_allowed, p->mems_allowed > > Perhaps it would be better to use nodes_equal first, and if no suitable > victim is found, then downgrade to nodes_intersect? How can this happen? > NUMA balancing mechanism tends to keep memory on the same NUMA node, and > if the selected victim's memory happens to be on a node that does not > intersect with the current process's node, we still won't be able to > free up any memory. AFAIR NUMA balancing doesn't touch processes with memory policies.
On 2023/4/11 21:12, Michal Hocko wrote: > On Tue 11-04-23 21:04:18, Gang Li wrote: >> >> >> On 2023/4/11 20:23, Michal Koutný wrote: >>> Hello. >>> >>> On Tue, Apr 11, 2023 at 02:58:15PM +0800, Gang Li <ligang.bdlg@bytedance.com> wrote: >>>> + cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) { >>>> + if (nodes_equal(cs->mems_allowed, task_cs(current)->mems_allowed)) { >>>> + css_task_iter_start(&(cs->css), CSS_TASK_ITER_PROCS, &it); >>>> + while (!ret && (task = css_task_iter_next(&it))) >>>> + ret = fn(task, arg); >>>> + css_task_iter_end(&it); >>>> + } >>>> + } >>>> + rcu_read_unlock(); >>>> + cpuset_read_unlock(); >>>> + return ret; >>>> +} >>> >>> I see this traverses all cpusets without the hierarchy actually >>> mattering that much. Wouldn't the CONSTRAINT_CPUSET better achieved by >>> globally (or per-memcg) scanning all processes and filtering with: >> >> Oh I see, you mean scanning all processes in all cpusets and scanning >> all processes globally are equivalent. > > Why cannot you simple select a process from the cpuset the allocating > process belongs to? I thought the whole idea was to handle well > partitioned workloads. > Yes I can :) It's much easier. >>> nodes_intersect(current->mems_allowed, p->mems_allowed >> >> Perhaps it would be better to use nodes_equal first, and if no suitable >> victim is found, then downgrade to nodes_intersect? > > How can this happen? > >> NUMA balancing mechanism tends to keep memory on the same NUMA node, and >> if the selected victim's memory happens to be on a node that does not >> intersect with the current process's node, we still won't be able to >> free up any memory. > > AFAIR NUMA balancing doesn't touch processes with memory policies.
On Tue 11-04-23 14:58:15, Gang Li wrote: > Cpusets constrain the CPU and Memory placement of tasks. > `CONSTRAINT_CPUSET` type in oom has existed for a long time, but > has never been utilized. > > When a process in cpuset which constrain memory placement triggers > oom, it may kill a completely irrelevant process on other numa nodes, > which will not release any memory for this cpuset. > > We can easily achieve node aware oom by using `CONSTRAINT_CPUSET` and > selecting victim from cpusets with the same mems_allowed as the > current one. I believe it still wouldn't hurt to be more specific here. CONSTRAINT_CPUSET is rather obscure. Looking at this just makes my head spin. /* Check this allocation failure is caused by cpuset's wall function */ for_each_zone_zonelist_nodemask(zone, z, oc->zonelist, highest_zoneidx, oc->nodemask) if (!cpuset_zone_allowed(zone, oc->gfp_mask)) cpuset_limited = true; Does this even work properly and why? prepare_alloc_pages sets oc->nodemask to current->mems_allowed but the above gives us cpuset_limited only if there is at least one zone/node that is not oc->nodemask compatible. So it seems like this wouldn't ever get set unless oc->nodemask got reset somewhere. This is a maze indeed. Is there any reason why we cannot rely on __GFP_HARWALL here? Or should we instead rely on the fact the nodemask should be same as current->mems_allowed? I do realize that this is not directly related to your patch but considering this has been mostly doing nothing maybe we want to document it better or even rework it at this occasion. > Example: > > Create two processes named mem_on_node0 and mem_on_node1 constrained > by cpusets respectively. These two processes alloc memory on their > own node. Now node0 has run out of memory, OOM will be invokled by > mem_on_node0. Don't you have an actual real life example with a properly partitioned system which clearly misbehaves and this patch addresses that?
On Tue, Apr 11, 2023 at 03:12:34PM +0200, Michal Hocko <mhocko@suse.com> wrote: > > Oh I see, you mean scanning all processes in all cpusets and scanning > > all processes globally are equivalent. > > Why cannot you simple select a process from the cpuset the allocating > process belongs to? I thought the whole idea was to handle well > partitioned workloads. Ah, I was confused by the top_cpuset implementation. The iteration should then start in nearest_hardwall_ancestor(task_cs(current)). (in the 1st approximation). The nodes_intersect/nodes_subset/nodes_equal/whatnot heuristics is secondary. HTH, Michal
Apologize for the extremely delayed response. I was previously occupied with work unrelated to the Linux kernel. On 2023/4/11 22:36, Michal Hocko wrote: > I believe it still wouldn't hurt to be more specific here. > CONSTRAINT_CPUSET is rather obscure. Looking at this just makes my head > spin. > /* Check this allocation failure is caused by cpuset's wall function */ > for_each_zone_zonelist_nodemask(zone, z, oc->zonelist, > highest_zoneidx, oc->nodemask) > if (!cpuset_zone_allowed(zone, oc->gfp_mask)) > cpuset_limited = true; > > Does this even work properly and why? prepare_alloc_pages sets > oc->nodemask to current->mems_allowed but the above gives us > cpuset_limited only if there is at least one zone/node that is not > oc->nodemask compatible. So it seems like this wouldn't ever get set > unless oc->nodemask got reset somewhere. This is a maze indeed.Is there In __alloc_pages: ``` /* * Restore the original nodemask if it was potentially replaced with * &cpuset_current_mems_allowed to optimize the fast-path attempt. */ ac.nodemask = nodemask; page = __alloc_pages_slowpath(alloc_gfp, order, &ac); ``` __alloc_pages set ac.nodemask back to mempolicy before call __alloc_pages_slowpath. > any reason why we cannot rely on __GFP_HARWALL here? Or should we In prepare_alloc_pages: ``` if (cpusets_enabled()) { *alloc_gfp |= __GFP_HARDWALL; ... } ``` Since __GFP_HARDWALL is set as long as cpuset is enabled, I think we can use it to determine if we are under the constraint of CPUSET. But I have a question: Why we always set __GFP_HARDWALL when cpuset is enabled, regardless of the value of cpuset.mem_hardwall? Thanks, Gang Li
On 8/17/23 04:40, Gang Li wrote: > > Since __GFP_HARDWALL is set as long as cpuset is enabled, I think we can > use it to determine if we are under the constraint of CPUSET. > > But I have a question: Why we always set __GFP_HARDWALL when cpuset is > enabled, regardless of the value of cpuset.mem_hardwall? There is no direct dependency between cpuset.mem_hardwall and the __GFP_HARDWALL flag. When cpuset is enabled, all user memory allocation should be subjected to the cpuset memory constraint. In the case of non-user memory allocation, it can fall back to to the node mask of an ancestor up to the root cgroup, i.e. all memory nodes. cpuset.mem_hardwall enables a barrier to this upward search. Note that cpuset.mem_hardwall is a v1 feature that is not available in cgroup v2. Cheers, Longman
Hi, On 2023/8/17 16:40, Gang Li wrote: > On 2023/4/11 22:36, Michal Hocko wrote: >> I believe it still wouldn't hurt to be more specific here. >> CONSTRAINT_CPUSET is rather obscure. Looking at this just makes my head >> spin. >> /* Check this allocation failure is caused by cpuset's wall >> function */ >> for_each_zone_zonelist_nodemask(zone, z, oc->zonelist, >> highest_zoneidx, oc->nodemask) >> if (!cpuset_zone_allowed(zone, oc->gfp_mask)) >> cpuset_limited = true; >> > Does this even work properly and why? prepare_alloc_pages sets >> oc->nodemask to current->mems_allowed but the above gives us >> cpuset_limited only if there is at least one zone/node that is not >> oc->nodemask compatible. So it seems like this wouldn't ever get set >> unless oc->nodemask got reset somewhere. This is a maze indeed.Is there > > In __alloc_pages: > ``` > /* > * Restore the original nodemask if it was potentially replaced with > * &cpuset_current_mems_allowed to optimize the fast-path attempt. > */ > ac.nodemask = nodemask; > page = __alloc_pages_slowpath(alloc_gfp, order, &ac); > > ``` > > __alloc_pages set ac.nodemask back to mempolicy before call > __alloc_pages_slowpath. > > >> any reason why we cannot rely on __GFP_HARWALL here? Or should we > > In prepare_alloc_pages: > ``` > if (cpusets_enabled()) { > *alloc_gfp |= __GFP_HARDWALL; > ... > } > ``` > > Since __GFP_HARDWALL is set as long as cpuset is enabled, I think we can > use it to determine if we are under the constraint of CPUSET. > We have two nodemasks: one from the parameters of __alloc_pages and another from cpuset. If the node allowed by the parameters of __alloc_pages is not allowed by cpuset, it means that this page allocation is constrained by cpuset, and thus CONSTRAINT_CPUSET can be returned. I guess this piece of code is reasonable and we can keep the code as it is. Thanks, Gang Li.
diff --git a/Documentation/admin-guide/cgroup-v1/cpusets.rst b/Documentation/admin-guide/cgroup-v1/cpusets.rst index 5d844ed4df69..51ffdc0eb167 100644 --- a/Documentation/admin-guide/cgroup-v1/cpusets.rst +++ b/Documentation/admin-guide/cgroup-v1/cpusets.rst @@ -25,7 +25,8 @@ Written by Simon.Derr@bull.net 1.6 What is memory spread ? 1.7 What is sched_load_balance ? 1.8 What is sched_relax_domain_level ? - 1.9 How do I use cpusets ? + 1.9 What is cpuset oom ? + 1.10 How do I use cpusets ? 2. Usage Examples and Syntax 2.1 Basic Usage 2.2 Adding/removing cpus @@ -607,8 +608,19 @@ If your situation is: - The latency is required even it sacrifices cache hit rate etc. then increasing 'sched_relax_domain_level' would benefit you. +1.9 What is cpuset oom ? +-------------------------- +If there is no available memory to allocate on the nodes specified by +cpuset.mems, then an OOM (Out-Of-Memory) will be invoked. + +Since the victim selection is a heuristic algorithm, we cannot select +the "perfect" victim. Therefore, currently, the victim will be selected +from all the cpusets that have the same mems_allowed as the cpuset +which invoked OOM. + +Cpuset oom works in both cgroup v1 and v2. -1.9 How do I use cpusets ? +1.10 How do I use cpusets ? -------------------------- In order to minimize the impact of cpusets on critical kernel diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index f67c0829350b..594aa71cf441 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -2199,6 +2199,10 @@ Cpuset Interface Files a need to change "cpuset.mems" with active tasks, it shouldn't be done frequently. + When a process invokes oom due to the constraint of cpuset.mems, + the victim will be selected from cpusets with the same + mems_allowed as the current one. + cpuset.mems.effective A read-only multiple values file which exists on all cpuset-enabled cgroups. diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index 980b76a1237e..75465bf58f74 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -171,6 +171,8 @@ static inline void set_mems_allowed(nodemask_t nodemask) task_unlock(current); } +int cpuset_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg); + #else /* !CONFIG_CPUSETS */ static inline bool cpusets_enabled(void) { return false; } @@ -287,6 +289,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq) return false; } +static inline int cpuset_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg) +{ + return 0; +} #endif /* !CONFIG_CPUSETS */ #endif /* _LINUX_CPUSET_H */ diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index bc4dcfd7bee5..cb6b49245e18 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -4013,6 +4013,49 @@ void cpuset_print_current_mems_allowed(void) rcu_read_unlock(); } +/** + * cpuset_scan_tasks - specify the oom scan range + * @fn: callback function to select oom victim + * @arg: argument for callback function, usually a pointer to struct oom_control + * + * Description: This function is used to specify the oom scan range. Return 0 if + * no task is selected, otherwise return 1. The selected task will be stored in + * arg->chosen. This function can only be called in cpuset oom context. + * + * The selection algorithm is heuristic, therefore requires constant iteration + * based on user feedback. Currently, we just iterate through all cpusets with + * the same mems_allowed as the current cpuset. + */ +int cpuset_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg) +{ + int ret = 0; + struct css_task_iter it; + struct task_struct *task; + struct cpuset *cs; + struct cgroup_subsys_state *pos_css; + + /* + * Situation gets complex with overlapping nodemasks in different cpusets. + * TODO: Maybe we should calculate the "distance" between different mems_allowed. + * + * But for now, let's make it simple. Just iterate through all cpusets + * with the same mems_allowed as the current cpuset. + */ + cpuset_read_lock(); + rcu_read_lock(); + cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) { + if (nodes_equal(cs->mems_allowed, task_cs(current)->mems_allowed)) { + css_task_iter_start(&(cs->css), CSS_TASK_ITER_PROCS, &it); + while (!ret && (task = css_task_iter_next(&it))) + ret = fn(task, arg); + css_task_iter_end(&it); + } + } + rcu_read_unlock(); + cpuset_read_unlock(); + return ret; +} + /* * Collection of memory_pressure is suppressed unless * this flag is enabled by writing "1" to the special diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 044e1eed720e..228257788d9e 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -367,6 +367,8 @@ static void select_bad_process(struct oom_control *oc) if (is_memcg_oom(oc)) mem_cgroup_scan_tasks(oc->memcg, oom_evaluate_task, oc); + else if (oc->constraint == CONSTRAINT_CPUSET) + cpuset_scan_tasks(oom_evaluate_task, oc); else { struct task_struct *p; @@ -427,6 +429,8 @@ static void dump_tasks(struct oom_control *oc) if (is_memcg_oom(oc)) mem_cgroup_scan_tasks(oc->memcg, dump_task, oc); + else if (oc->constraint == CONSTRAINT_CPUSET) + cpuset_scan_tasks(dump_task, oc); else { struct task_struct *p;
Cpusets constrain the CPU and Memory placement of tasks. `CONSTRAINT_CPUSET` type in oom has existed for a long time, but has never been utilized. When a process in cpuset which constrain memory placement triggers oom, it may kill a completely irrelevant process on other numa nodes, which will not release any memory for this cpuset. We can easily achieve node aware oom by using `CONSTRAINT_CPUSET` and selecting victim from cpusets with the same mems_allowed as the current one. Example: Create two processes named mem_on_node0 and mem_on_node1 constrained by cpusets respectively. These two processes alloc memory on their own node. Now node0 has run out of memory, OOM will be invokled by mem_on_node0. Before this patch: Since `CONSTRAINT_CPUSET` do nothing, the victim will be selected from the entire system. Therefore, the OOM is highly likely to kill mem_on_node1, which will not free any memory for mem_on_node0. This is a useless kill. ``` [ 2786.519080] mem_on_node0 invoked oom-killer [ 2786.885738] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [ 2787.181724] [ 13432] 0 13432 787016 786745 6344704 0 0 mem_on_node1 [ 2787.189115] [ 13457] 0 13457 787002 785504 6340608 0 0 mem_on_node0 [ 2787.216534] oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=test,mems_allowed=0 [ 2787.229991] Out of memory: Killed process 13432 (mem_on_node1) ``` After this patch: The victim will be selected only in all cpusets that have the same mems_allowed as the cpuset that invoked oom. This will prevent useless kill and protect innocent victims. ``` [ 395.922444] mem_on_node0 invoked oom-killer [ 396.239777] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [ 396.246128] [ 2614] 0 2614 1311294 1144192 9224192 0 0 mem_on_node0 [ 396.252655] oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=test,mems_allowed=0 [ 396.264068] Out of memory: Killed process 2614 (mem_on_node0) ``` Suggested-by: Michal Hocko <mhocko@suse.com> Cc: <cgroups@vger.kernel.org> Cc: <linux-mm@kvack.org> Cc: <rientjes@google.com> Cc: Waiman Long <longman@redhat.com> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Gang Li <ligang.bdlg@bytedance.com> --- Changes in v4: - Modify comments and documentation. Changes in v3: - https://lore.kernel.org/all/20230410025056.22103-1-ligang.bdlg@bytedance.com/ - Provide more details about the use case, testing, implementation. - Document the userspace visible change in Documentation. - Rename cpuset_cgroup_scan_tasks() to cpuset_scan_tasks() and add a doctext comment about its purpose and how it should be used. - Take cpuset_rwsem to ensure that cpusets are stable. Changes in v2: - https://lore.kernel.org/all/20230404115509.14299-1-ligang.bdlg@bytedance.com/ - Select victim from all cpusets with the same mems_allowed as the current cpuset. v1: - https://lore.kernel.org/all/20220921064710.89663-1-ligang.bdlg@bytedance.com/ - Introduce cpuset oom. --- .../admin-guide/cgroup-v1/cpusets.rst | 16 ++++++- Documentation/admin-guide/cgroup-v2.rst | 4 ++ include/linux/cpuset.h | 6 +++ kernel/cgroup/cpuset.c | 43 +++++++++++++++++++ mm/oom_kill.c | 4 ++ 5 files changed, 71 insertions(+), 2 deletions(-)