diff mbox series

[v5] mm: oom: introduce cpuset oom

Message ID 20230411134539.45046-1-ligang.bdlg@bytedance.com (mailing list archive)
State New
Headers show
Series [v5] mm: oom: introduce cpuset oom | expand

Commit Message

Gang Li April 11, 2023, 1:45 p.m. UTC
Cpusets constrain the CPU and Memory placement of tasks.
`CONSTRAINT_CPUSET` type in oom  has existed for a long time, but
has never been utilized.

When a process in cpuset which constrain memory placement triggers
oom, it may kill a completely irrelevant process on other numa nodes,
which will not release any memory for this cpuset.

We can easily achieve node aware oom by using `CONSTRAINT_CPUSET` and
selecting victim from cpuset the allocating process belongs to.

Example:

Create two processes named mem_on_node0 and mem_on_node1 constrained
by cpusets respectively. These two processes alloc memory on their
own node. Now node0 has run out of memory, OOM will be invokled by
mem_on_node0.

Before this patch:

Since `CONSTRAINT_CPUSET` do nothing, the victim will be selected from
the entire system. Therefore, the OOM is highly likely to kill
mem_on_node1, which will not free any memory for mem_on_node0. This
is a useless kill.

```
[ 2786.519080] mem_on_node0 invoked oom-killer
[ 2786.885738] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[ 2787.181724] [  13432]     0 13432   787016   786745  6344704        0             0 mem_on_node1
[ 2787.189115] [  13457]     0 13457   787002   785504  6340608        0             0 mem_on_node0
[ 2787.216534] oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=test,mems_allowed=0
[ 2787.229991] Out of memory: Killed process 13432 (mem_on_node1)
```

After this patch:

The victim will be selected only in mem_on_node0's own cpuset. This will
prevent useless kill and protect innocent victims.

```
[  395.922444] mem_on_node0 invoked oom-killer
[  396.239777] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[  396.246128] [   2614]     0  2614  1311294  1144192  9224192        0             0 mem_on_node0
[  396.252655] oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=test,mems_allowed=0
[  396.264068] Out of memory: Killed process 2614 (mem_on_node0)
```

Suggested-by: Michal Hocko <mhocko@suse.com>
Cc: <cgroups@vger.kernel.org>
Cc: <linux-mm@kvack.org>
Cc: <rientjes@google.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
---
Changes in v5:
- Select victim in the cpuset the allocating process belongs to.

Changes in v4:
- https://lore.kernel.org/all/20230411065816.9798-1-ligang.bdlg@bytedance.com/
- Modify comments and documentation.

Changes in v3:
- https://lore.kernel.org/all/20230410025056.22103-1-ligang.bdlg@bytedance.com/
- Provide more details about the use case, testing, implementation.
- Document the userspace visible change in Documentation.
- Rename cpuset_cgroup_scan_tasks() to cpuset_scan_tasks() and add
  a doctext comment about its purpose and how it should be used.
- Take cpuset_rwsem to ensure that cpusets are stable.

Changes in v2:
- https://lore.kernel.org/all/20230404115509.14299-1-ligang.bdlg@bytedance.com/
- Select victim from all cpusets with the same mems_allowed as the current cpuset.

v1:
- https://lore.kernel.org/all/20220921064710.89663-1-ligang.bdlg@bytedance.com/
- Introduce cpuset oom.
---
 .../admin-guide/cgroup-v1/cpusets.rst         | 15 ++++++--
 Documentation/admin-guide/cgroup-v2.rst       |  4 +++
 include/linux/cpuset.h                        |  6 ++++
 kernel/cgroup/cpuset.c                        | 34 +++++++++++++++++++
 mm/oom_kill.c                                 |  4 +++
 5 files changed, 61 insertions(+), 2 deletions(-)
diff mbox series

Patch

diff --git a/Documentation/admin-guide/cgroup-v1/cpusets.rst b/Documentation/admin-guide/cgroup-v1/cpusets.rst
index 5d844ed4df69..57bc15782d56 100644
--- a/Documentation/admin-guide/cgroup-v1/cpusets.rst
+++ b/Documentation/admin-guide/cgroup-v1/cpusets.rst
@@ -25,7 +25,8 @@  Written by Simon.Derr@bull.net
      1.6 What is memory spread ?
      1.7 What is sched_load_balance ?
      1.8 What is sched_relax_domain_level ?
-     1.9 How do I use cpusets ?
+     1.9 What is cpuset oom ?
+     1.10 How do I use cpusets ?
    2. Usage Examples and Syntax
      2.1 Basic Usage
      2.2 Adding/removing cpus
@@ -607,8 +608,18 @@  If your situation is:
  - The latency is required even it sacrifices cache hit rate etc.
    then increasing 'sched_relax_domain_level' would benefit you.
 
+1.9 What is cpuset oom ?
+--------------------------
+If there is no available memory to allocate on the nodes specified by
+cpuset.mems, then an OOM (Out-Of-Memory) will be invoked.
+
+Since the victim selection is a heuristic algorithm, we cannot select
+the "perfect" victim. So just select a process from the cpuset the
+allocating process belongs to.
+
+Cpuset oom works in both cgroup v1 and v2.
 
-1.9 How do I use cpusets ?
+1.10 How do I use cpusets ?
 --------------------------
 
 In order to minimize the impact of cpusets on critical kernel
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index f67c0829350b..5db84fb4f1cc 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2199,6 +2199,10 @@  Cpuset Interface Files
 	a need to change "cpuset.mems" with active tasks, it shouldn't
 	be done frequently.
 
+	When a process invokes oom due to the constraint of cpuset.mems,
+	the victim will be selected from cpuset the allocating process
+	belongs to.
+
   cpuset.mems.effective
 	A read-only multiple values file which exists on all
 	cpuset-enabled cgroups.
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 980b76a1237e..75465bf58f74 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -171,6 +171,8 @@  static inline void set_mems_allowed(nodemask_t nodemask)
 	task_unlock(current);
 }
 
+int cpuset_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg);
+
 #else /* !CONFIG_CPUSETS */
 
 static inline bool cpusets_enabled(void) { return false; }
@@ -287,6 +289,10 @@  static inline bool read_mems_allowed_retry(unsigned int seq)
 	return false;
 }
 
+static inline int cpuset_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg)
+{
+	return 0;
+}
 #endif /* !CONFIG_CPUSETS */
 
 #endif /* _LINUX_CPUSET_H */
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index bc4dcfd7bee5..624454368605 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -4013,6 +4013,40 @@  void cpuset_print_current_mems_allowed(void)
 	rcu_read_unlock();
 }
 
+/**
+ * cpuset_scan_tasks - specify the oom scan range
+ * @fn: callback function to select oom victim
+ * @arg: argument for callback function, usually a pointer to struct oom_control
+ *
+ * Description: This function is used to specify the oom scan range. Return 0 if
+ * no task is selected, otherwise return 1. The selected task will be stored in
+ * arg->chosen. This function can only be called in cpuset oom context.
+ *
+ * The selection algorithm is heuristic, therefore requires constant iteration
+ * based on user feedback. Currently, we just scan the current cpuset.
+ */
+int cpuset_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg)
+{
+	int ret = 0;
+	struct css_task_iter it;
+	struct task_struct *task;
+
+	/*
+	 * Situation gets complex with overlapping nodemasks in different cpusets.
+	 * TODO: Maybe we should calculate the "distance" between different mems_allowed.
+	 *
+	 * But for now, let's make it simple. Just scan current cpuset.
+	 */
+	rcu_read_lock();
+	css_task_iter_start(&(task_cs(current)->css), CSS_TASK_ITER_PROCS, &it);
+	while (!ret && (task = css_task_iter_next(&it)))
+		ret = fn(task, arg);
+	css_task_iter_end(&it);
+	rcu_read_unlock();
+
+	return ret;
+}
+
 /*
  * Collection of memory_pressure is suppressed unless
  * this flag is enabled by writing "1" to the special
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 044e1eed720e..228257788d9e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -367,6 +367,8 @@  static void select_bad_process(struct oom_control *oc)
 
 	if (is_memcg_oom(oc))
 		mem_cgroup_scan_tasks(oc->memcg, oom_evaluate_task, oc);
+	else if (oc->constraint == CONSTRAINT_CPUSET)
+		cpuset_scan_tasks(oom_evaluate_task, oc);
 	else {
 		struct task_struct *p;
 
@@ -427,6 +429,8 @@  static void dump_tasks(struct oom_control *oc)
 
 	if (is_memcg_oom(oc))
 		mem_cgroup_scan_tasks(oc->memcg, dump_task, oc);
+	else if (oc->constraint == CONSTRAINT_CPUSET)
+		cpuset_scan_tasks(dump_task, oc);
 	else {
 		struct task_struct *p;