[v4] mm: oom: introduce cpuset oom

Message ID	20230411065816.9798-1-ligang.bdlg@bytedance.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Gang Li <ligang.bdlg@bytedance.com> To: Waiman Long <longman@redhat.com>, Michal Hocko <mhocko@suse.com> Cc: Gang Li <ligang.bdlg@bytedance.com>, cgroups@vger.kernel.org, linux-mm@kvack.org, rientjes@google.com, Zefan Li <lizefan.x@bytedance.com>, linux-kernel@vger.kernel.org Subject: [PATCH v4] mm: oom: introduce cpuset oom Date: Tue, 11 Apr 2023 14:58:15 +0800 Message-Id: <20230411065816.9798-1-ligang.bdlg@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[v4] mm: oom: introduce cpuset oom \| expand [v4] mm: oom: introduce cpuset oom

diff --git a/Documentation/admin-guide/cgroup-v1/cpusets.rst b/Documentation/admin-guide/cgroup-v1/cpusets.rst index 5d844ed4df69..51ffdc0eb167 100644 --- a/Documentation/admin-guide/cgroup-v1/cpusets.rst +++ b/Documentation/admin-guide/cgroup-v1/cpusets.rst @@ -25,7 +25,8 @@ Written by Simon.Derr@bull.net 1.6 What is memory spread ? 1.7 What is sched_load_balance ? 1.8 What is sched_relax_domain_level ? - 1.9 How do I use cpusets ? + 1.9 What is cpuset oom ? + 1.10 How do I use cpusets ? 2. Usage Examples and Syntax 2.1 Basic Usage 2.2 Adding/removing cpus @@ -607,8 +608,19 @@ If your situation is: - The latency is required even it sacrifices cache hit rate etc. then increasing 'sched_relax_domain_level' would benefit you. +1.9 What is cpuset oom ? +-------------------------- +If there is no available memory to allocate on the nodes specified by +cpuset.mems, then an OOM (Out-Of-Memory) will be invoked. + +Since the victim selection is a heuristic algorithm, we cannot select +the "perfect" victim. Therefore, currently, the victim will be selected +from all the cpusets that have the same mems_allowed as the cpuset +which invoked OOM. + +Cpuset oom works in both cgroup v1 and v2. -1.9 How do I use cpusets ? +1.10 How do I use cpusets ? -------------------------- In order to minimize the impact of cpusets on critical kernel diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index f67c0829350b..594aa71cf441 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -2199,6 +2199,10 @@ Cpuset Interface Files a need to change "cpuset.mems" with active tasks, it shouldn't be done frequently. + When a process invokes oom due to the constraint of cpuset.mems, + the victim will be selected from cpusets with the same + mems_allowed as the current one. + cpuset.mems.effective A read-only multiple values file which exists on all cpuset-enabled cgroups. diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index 980b76a1237e..75465bf58f74 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -171,6 +171,8 @@ static inline void set_mems_allowed(nodemask_t nodemask) task_unlock(current); } +int cpuset_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg); + #else /* !CONFIG_CPUSETS */ static inline bool cpusets_enabled(void) { return false; } @@ -287,6 +289,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq) return false; } +static inline int cpuset_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg) +{ + return 0; +} #endif /* !CONFIG_CPUSETS */ #endif /* _LINUX_CPUSET_H */ diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index bc4dcfd7bee5..cb6b49245e18 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -4013,6 +4013,49 @@ void cpuset_print_current_mems_allowed(void) rcu_read_unlock(); } +/** + * cpuset_scan_tasks - specify the oom scan range + * @fn: callback function to select oom victim + * @arg: argument for callback function, usually a pointer to struct oom_control + * + * Description: This function is used to specify the oom scan range. Return 0 if + * no task is selected, otherwise return 1. The selected task will be stored in + * arg->chosen. This function can only be called in cpuset oom context. + * + * The selection algorithm is heuristic, therefore requires constant iteration + * based on user feedback. Currently, we just iterate through all cpusets with + * the same mems_allowed as the current cpuset. + */ +int cpuset_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg) +{ + int ret = 0; + struct css_task_iter it; + struct task_struct *task; + struct cpuset *cs; + struct cgroup_subsys_state *pos_css; + + /* + * Situation gets complex with overlapping nodemasks in different cpusets. + * TODO: Maybe we should calculate the "distance" between different mems_allowed. + * + * But for now, let's make it simple. Just iterate through all cpusets + * with the same mems_allowed as the current cpuset. + */ + cpuset_read_lock(); + rcu_read_lock(); + cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) { + if (nodes_equal(cs->mems_allowed, task_cs(current)->mems_allowed)) { + css_task_iter_start(&(cs->css), CSS_TASK_ITER_PROCS, &it); + while (!ret && (task = css_task_iter_next(&it))) + ret = fn(task, arg); + css_task_iter_end(&it); + } + } + rcu_read_unlock(); + cpuset_read_unlock(); + return ret; +} + /* * Collection of memory_pressure is suppressed unless * this flag is enabled by writing "1" to the special diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 044e1eed720e..228257788d9e 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -367,6 +367,8 @@ static void select_bad_process(struct oom_control *oc) if (is_memcg_oom(oc)) mem_cgroup_scan_tasks(oc->memcg, oom_evaluate_task, oc); + else if (oc->constraint == CONSTRAINT_CPUSET) + cpuset_scan_tasks(oom_evaluate_task, oc); else { struct task_struct *p; @@ -427,6 +429,8 @@ static void dump_tasks(struct oom_control *oc) if (is_memcg_oom(oc)) mem_cgroup_scan_tasks(oc->memcg, dump_task, oc); + else if (oc->constraint == CONSTRAINT_CPUSET) + cpuset_scan_tasks(dump_task, oc); else { struct task_struct *p;

[v4] mm: oom: introduce cpuset oom

Commit Message

Comments

Patch