[RFC,v1] mm: oom: introduce cpuset oom

Message ID	20220921064710.89663-1-ligang.bdlg@bytedance.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Gang Li <ligang.bdlg@bytedance.com> To: Zefan Li <lizefan.x@bytedance.com>, Tejun Heo <tj@kernel.org>, Johannes Weiner <hannes@cmpxchg.org>, Andrew Morton <akpm@linux-foundation.org> Cc: Gang Li <ligang.bdlg@bytedance.com>, Michal Hocko <mhocko@suse.com>, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [RFC PATCH v1] mm: oom: introduce cpuset oom Date: Wed, 21 Sep 2022 14:47:09 +0800 Message-Id: <20220921064710.89663-1-ligang.bdlg@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[RFC,v1] mm: oom: introduce cpuset oom \| expand [RFC,v1] mm: oom: introduce cpuset oom

Message ID

20220921064710.89663-1-ligang.bdlg@bytedance.com (mailing list archive)

State

New

Headers

From: Gang Li <ligang.bdlg@bytedance.com>
To: Zefan Li <lizefan.x@bytedance.com>,
	Tejun Heo <tj@kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Andrew Morton <akpm@linux-foundation.org>
Cc: Gang Li <ligang.bdlg@bytedance.com>,
	Michal Hocko <mhocko@suse.com>,
	cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-mm@kvack.org
Subject: [RFC PATCH v1] mm: oom: introduce cpuset oom
Date: Wed, 21 Sep 2022 14:47:09 +0800
Message-Id: <20220921064710.89663-1-ligang.bdlg@bytedance.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

[RFC,v1] mm: oom: introduce cpuset oom | expand

Commit Message

Gang Li Sept. 21, 2022, 6:47 a.m. UTC

cpuset confine processes to processor and memory node subsets.
When a process in cpuset triggers oom, it may kill a completely
irrelevant process on another numa node, which will not release any
memory for this cpuset.

It seems that `CONSTRAINT_CPUSET` is not really doing much these
days. Using CONSTRAINT_CPUSET, we can easily achieve node aware oom
killing by selecting victim from the cpuset which triggers oom.

Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
---
This idea comes from a previous patch:
mm, oom: Introduce per numa node oom for CONSTRAINT_MEMORY_POLICY
https://lore.kernel.org/all/YoJ%2FioXwGTdCywUE@dhcp22.suse.cz/

Any comments are welcome.
---
 include/linux/cpuset.h |  6 ++++++
 kernel/cgroup/cpuset.c | 17 +++++++++++++++++
 mm/oom_kill.c          |  4 ++++
 3 files changed, 27 insertions(+)

Comments

David Rientjes Sept. 22, 2022, 7:18 p.m. UTC | #1

On Wed, 21 Sep 2022, Gang Li wrote:

> cpuset confine processes to processor and memory node subsets.
> When a process in cpuset triggers oom, it may kill a completely
> irrelevant process on another numa node, which will not release any
> memory for this cpuset.
> 
> It seems that `CONSTRAINT_CPUSET` is not really doing much these
> days. Using CONSTRAINT_CPUSET, we can easily achieve node aware oom
> killing by selecting victim from the cpuset which triggers oom.
> 
> Suggested-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>

Hmm, is this the right approach?

If a cpuset results in a oom condition, is there a reason why we'd need to 
find a process from within that cpuset to kill?  I think the idea is to 
free memory on the oom set of nodes (cpuset.mems) and that can happen by 
killing a process that is not a member of this cpuset.

I understand the challenges of creating a NUMA aware oom killer to target 
memory that is actually resident on an oom node, but this approach doesn't 
seem right and could actually lead to pathological cases where a small 
process trying to fork in an otherwise empty cpuset is repeatedly oom 
killing when we'd actually prefer to kill a single large process.

> ---
> This idea comes from a previous patch:
> mm, oom: Introduce per numa node oom for CONSTRAINT_MEMORY_POLICY
> https://lore.kernel.org/all/YoJ%2FioXwGTdCywUE@dhcp22.suse.cz/
> 
> Any comments are welcome.
> ---
>  include/linux/cpuset.h |  6 ++++++
>  kernel/cgroup/cpuset.c | 17 +++++++++++++++++
>  mm/oom_kill.c          |  4 ++++
>  3 files changed, 27 insertions(+)
> 
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index d58e0476ee8e..7475f613ab90 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -178,6 +178,8 @@ static inline void set_mems_allowed(nodemask_t nodemask)
>  	task_unlock(current);
>  }
>  
> +int cpuset_cgroup_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg);
> +
>  #else /* !CONFIG_CPUSETS */
>  
>  static inline bool cpusets_enabled(void) { return false; }
> @@ -299,6 +301,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq)
>  	return false;
>  }
>  
> +static inline int cpuset_cgroup_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg)
> +{
> +	return 0;
> +}
>  #endif /* !CONFIG_CPUSETS */
>  
>  #endif /* _LINUX_CPUSET_H */
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index b474289c15b8..1f1238b4276d 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -3943,6 +3943,23 @@ void cpuset_print_current_mems_allowed(void)
>  	rcu_read_unlock();
>  }
>  
> +int cpuset_cgroup_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg)
> +{
> +	int ret = 0;
> +	struct cgroup *cgrp;
> +	struct css_task_iter it;
> +	struct task_struct *task;
> +
> +	rcu_read_lock();
> +	css_task_iter_start(&(task_cs(current)->css), CSS_TASK_ITER_PROCS, &it);
> +	while (!ret && (task = css_task_iter_next(&it)))
> +		ret = fn(task, arg);
> +	css_task_iter_end(&it);
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +
>  /*
>   * Collection of memory_pressure is suppressed unless
>   * this flag is enabled by writing "1" to the special
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 46e7e073f137..8cea787b359c 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -367,6 +367,8 @@ static void select_bad_process(struct oom_control *oc)
>  
>  	if (is_memcg_oom(oc))
>  		mem_cgroup_scan_tasks(oc->memcg, oom_evaluate_task, oc);
> +	else if (oc->constraint == CONSTRAINT_CPUSET)
> +		cpuset_cgroup_scan_tasks(oom_evaluate_task, oc);
>  	else {
>  		struct task_struct *p;
>  
> @@ -427,6 +429,8 @@ static void dump_tasks(struct oom_control *oc)
>  
>  	if (is_memcg_oom(oc))
>  		mem_cgroup_scan_tasks(oc->memcg, dump_task, oc);
> +	else if (oc->constraint == CONSTRAINT_CPUSET)
> +		cpuset_cgroup_scan_tasks(dump_task, oc);
>  	else {
>  		struct task_struct *p;
>  
> -- 
> 2.20.1
> 
> 
>

Michal Hocko Sept. 23, 2022, 7:45 a.m. UTC | #2

On Thu 22-09-22 12:18:04, David Rientjes wrote:
> On Wed, 21 Sep 2022, Gang Li wrote:
> 
> > cpuset confine processes to processor and memory node subsets.
> > When a process in cpuset triggers oom, it may kill a completely
> > irrelevant process on another numa node, which will not release any
> > memory for this cpuset.
> > 
> > It seems that `CONSTRAINT_CPUSET` is not really doing much these
> > days. Using CONSTRAINT_CPUSET, we can easily achieve node aware oom
> > killing by selecting victim from the cpuset which triggers oom.
> > 
> > Suggested-by: Michal Hocko <mhocko@suse.com>
> > Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
> 
> Hmm, is this the right approach?
> 
> If a cpuset results in a oom condition, is there a reason why we'd need to 
> find a process from within that cpuset to kill?  I think the idea is to 
> free memory on the oom set of nodes (cpuset.mems) and that can happen by 
> killing a process that is not a member of this cpuset.

I would argue that the current cpuset should be considered first because
chances are that it will already have the biggest memory consumption
from the constrained NUMA nodes. At least that would be the case when
cpusets are used to partition the system into exclusive NUMA domains.

Situation gets more complex with overlapping nodemasks in different
cpusets but I believe our existing semantic sucks already for those
usecases already because we just shoot a random process with an unknown
amount of memory allocated from the constrained nodemask.

This new semantic is not much worse. We could find a real oom victim
under a different cpuset but the current semantic could as well kill a
large memory consumer with a tiny footprint on the target node. With the
cpuset view the potential damage is more targeted in many cases.

> I understand the challenges of creating a NUMA aware oom killer to target 
> memory that is actually resident on an oom node, but this approach doesn't 
> seem right and could actually lead to pathological cases where a small 
> process trying to fork in an otherwise empty cpuset is repeatedly oom 
> killing when we'd actually prefer to kill a single large process.

Yeah, that is possible and something to consider. One way to go about
that is to make the selection from all cpusets with an overlap with the
requested nodemask (probably with a preference to more constrained
ones). In any case let's keep in mind that this is a mere heuristic. We
just need to kill some process, it is not really feasible to aim for the
best selection. We should just try to reduce the harm. Our exisiting
cpuset based OOM is effectivelly random without any clear relation to
cpusets so I would be open to experimenting in this area.

Gang Li Sept. 26, 2022, 3:38 a.m. UTC | #3

On 2022/9/23 03:18, David Rientjes wrote:
> On Wed, 21 Sep 2022, Gang Li wrote:
> 
>> cpuset confine processes to processor and memory node subsets.
>> When a process in cpuset triggers oom, it may kill a completely
>> irrelevant process on another numa node, which will not release any
>> memory for this cpuset.
>>
>> It seems that `CONSTRAINT_CPUSET` is not really doing much these
>> days. Using CONSTRAINT_CPUSET, we can easily achieve node aware oom
>> killing by selecting victim from the cpuset which triggers oom.
>>
>> Suggested-by: Michal Hocko <mhocko@suse.com>
>> Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
> 
> Hmm, is this the right approach?
> 
> If a cpuset results in a oom condition, is there a reason why we'd need to
> find a process from within that cpuset to kill?  I think the idea is to
> free memory on the oom set of nodes (cpuset.mems) and that can happen by
> killing a process that is not a member of this cpuset.
> 
Hi,

My last patch implemented this idea[1][2]. But it needs to inc/dec a per 
mm_struct counter on every page allocation/release/migration.

As the Unixbench show, this takes 0%-3% performance loss on different 
workloads[2]. So Michal Hocko inspired me to use cpuset[3].

[1]. 
https://lore.kernel.org/all/20220512044634.63586-1-ligang.bdlg@bytedance.com/
[2]. 
https://lore.kernel.org/all/20220708082129.80115-1-ligang.bdlg@bytedance.com/
[3]. https://lore.kernel.org/all/YoJ%2FioXwGTdCywUE@dhcp22.suse.cz/

> I understand the challenges of creating a NUMA aware oom killer to target
> memory that is actually resident on an oom node, but this approach doesn't
> seem right and could actually lead to pathological cases where a small
> process trying to fork in an otherwise empty cpuset is repeatedly oom
> killing when we'd actually prefer to kill a single large process.
> 

I think there are three ways to achieve NUMA aware oom killer:

1. Count every page operations, which cause performance loss[2].
2. Iterate over pages(like show_numa_map) for all processes, which may 
stuck oom.
3. Select victim in a cpuset, which may leads to pathological kill.(this 
patch)

None of them are perfect and I'm getting stuck, do you have any ideas?

Gang Li Sept. 29, 2022, 3:15 a.m. UTC | #4

On 2022/9/23 15:45, Michal Hocko wrote:
> Yeah, that is possible and something to consider. One way to go about
> that is to make the selection from all cpusets with an overlap with the
> requested nodemask (probably with a preference to more constrained
> ones). In any case let's keep in mind that this is a mere heuristic. We
> just need to kill some process, it is not really feasible to aim for the
> best selection. We should just try to reduce the harm. Our exisiting
> cpuset based OOM is effectivelly random without any clear relation to
> cpusets so I would be open to experimenting in this area.

In addition to cpuset, users can also bind numa through mbind(). So I
want to manage numa binding applications that are not managed by cpuset.
Do you have any ideas?

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index d58e0476ee8e..7475f613ab90 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -178,6 +178,8 @@  static inline void set_mems_allowed(nodemask_t nodemask)
 	task_unlock(current);
 }
 
+int cpuset_cgroup_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg);
+
 #else /* !CONFIG_CPUSETS */
 
 static inline bool cpusets_enabled(void) { return false; }
@@ -299,6 +301,10 @@  static inline bool read_mems_allowed_retry(unsigned int seq)
 	return false;
 }
 
+static inline int cpuset_cgroup_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg)
+{
+	return 0;
+}
 #endif /* !CONFIG_CPUSETS */
 
 #endif /* _LINUX_CPUSET_H */
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index b474289c15b8..1f1238b4276d 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -3943,6 +3943,23 @@  void cpuset_print_current_mems_allowed(void)
 	rcu_read_unlock();
 }
 
+int cpuset_cgroup_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg)
+{
+	int ret = 0;
+	struct cgroup *cgrp;
+	struct css_task_iter it;
+	struct task_struct *task;
+
+	rcu_read_lock();
+	css_task_iter_start(&(task_cs(current)->css), CSS_TASK_ITER_PROCS, &it);
+	while (!ret && (task = css_task_iter_next(&it)))
+		ret = fn(task, arg);
+	css_task_iter_end(&it);
+	rcu_read_unlock();
+
+	return ret;
+}
+
 /*
  * Collection of memory_pressure is suppressed unless
  * this flag is enabled by writing "1" to the special
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 46e7e073f137..8cea787b359c 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -367,6 +367,8 @@  static void select_bad_process(struct oom_control *oc)
 
 	if (is_memcg_oom(oc))
 		mem_cgroup_scan_tasks(oc->memcg, oom_evaluate_task, oc);
+	else if (oc->constraint == CONSTRAINT_CPUSET)
+		cpuset_cgroup_scan_tasks(oom_evaluate_task, oc);
 	else {
 		struct task_struct *p;
 
@@ -427,6 +429,8 @@  static void dump_tasks(struct oom_control *oc)
 
 	if (is_memcg_oom(oc))
 		mem_cgroup_scan_tasks(oc->memcg, dump_task, oc);
+	else if (oc->constraint == CONSTRAINT_CPUSET)
+		cpuset_cgroup_scan_tasks(dump_task, oc);
 	else {
 		struct task_struct *p;

[RFC,v1] mm: oom: introduce cpuset oom

Commit Message

Comments

Patch