diff mbox series

[v4,4/6] cgroup/cpuset: Allow non-top parent partition root to distribute out all CPUs

Message ID 20210811030607.13824-5-longman@redhat.com (mailing list archive)
State New
Headers show
Series cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus | expand

Commit Message

Waiman Long Aug. 11, 2021, 3:06 a.m. UTC
Currently, a parent partition root cannot distribute all its CPUs to
child partition roots with no CPUs left. However in some use cases,
a management application may want to create a parent partition root as
a management unit with no task associated with it and has all its CPUs
distributed to various child partition roots dynamically according to
their needs. Leaving a cpu in the parent partition root in such a case is
now a waste.

To accommodate such use cases, a parent partition root can now have
all its CPUs distributed to its child partition roots as long as:
 1) it is not the top cpuset; and
 2) there is no task directly associated with the parent.

Once an empty parent partition root is formed, no new task can be moved
into it.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 91 ++++++++++++++++++++++++++++--------------
 1 file changed, 61 insertions(+), 30 deletions(-)

Comments

Tejun Heo Aug. 11, 2021, 6:13 p.m. UTC | #1
On Tue, Aug 10, 2021 at 11:06:05PM -0400, Waiman Long wrote:
> Currently, a parent partition root cannot distribute all its CPUs to
> child partition roots with no CPUs left. However in some use cases,
> a management application may want to create a parent partition root as
> a management unit with no task associated with it and has all its CPUs
> distributed to various child partition roots dynamically according to
> their needs. Leaving a cpu in the parent partition root in such a case is
> now a waste.
> 
> To accommodate such use cases, a parent partition root can now have
> all its CPUs distributed to its child partition roots as long as:
>  1) it is not the top cpuset; and

>  2) there is no task directly associated with the parent.
> 
> Once an empty parent partition root is formed, no new task can be moved
> into it.

The above are already enforced by cgroup2 core, right? No intermediate
cgroup with controllers enabled can have processes. From controllers' POV,
only leaves can have processes.

Thanks.
Waiman Long Aug. 11, 2021, 6:18 p.m. UTC | #2
On 8/11/21 2:13 PM, Tejun Heo wrote:
> On Tue, Aug 10, 2021 at 11:06:05PM -0400, Waiman Long wrote:
>> Currently, a parent partition root cannot distribute all its CPUs to
>> child partition roots with no CPUs left. However in some use cases,
>> a management application may want to create a parent partition root as
>> a management unit with no task associated with it and has all its CPUs
>> distributed to various child partition roots dynamically according to
>> their needs. Leaving a cpu in the parent partition root in such a case is
>> now a waste.
>>
>> To accommodate such use cases, a parent partition root can now have
>> all its CPUs distributed to its child partition roots as long as:
>>   1) it is not the top cpuset; and
>>   2) there is no task directly associated with the parent.
>>
>> Once an empty parent partition root is formed, no new task can be moved
>> into it.
> The above are already enforced by cgroup2 core, right? No intermediate
> cgroup with controllers enabled can have processes. From controllers' POV,
> only leaves can have processes.
>
I don't think that is true. A task can reside anywhere in the cgroup 
hierarchy. I have encountered no problem moving tasks around.

Cheers,
Longman
Tejun Heo Aug. 11, 2021, 6:21 p.m. UTC | #3
On Wed, Aug 11, 2021 at 02:18:17PM -0400, Waiman Long wrote:
> I don't think that is true. A task can reside anywhere in the cgroup
> hierarchy. I have encountered no problem moving tasks around.

Oh, that shouldn't be happening with controllers enabled. Can you please
share a repro?

Thanks.
Waiman Long Aug. 11, 2021, 6:46 p.m. UTC | #4
On 8/11/21 2:21 PM, Tejun Heo wrote:
> On Wed, Aug 11, 2021 at 02:18:17PM -0400, Waiman Long wrote:
>> I don't think that is true. A task can reside anywhere in the cgroup
>> hierarchy. I have encountered no problem moving tasks around.
> Oh, that shouldn't be happening with controllers enabled. Can you please
> share a repro?

I have done further testing. Enabling controllers won't prohibit moving 
a task into a parent cgroup as long as the child cgroups have no tasks. 
Once the child cgroup has task, moving another task to the parent is not 
allowed (-EBUSY). Similarly if a parent cgroup has tasks, you can't put 
new tasks into the child cgroup. I don't realize that we have such 
constraints as I usually do my testing with a cgroup hierarchy with no 
tasks initially. Anyway, a new lesson learned.

I will try to see how to address that in the patch, but the additional 
check added is still valid in some special case.

Cheers,
Longman
Tejun Heo Aug. 12, 2021, 9:51 p.m. UTC | #5
On Wed, Aug 11, 2021 at 02:46:24PM -0400, Waiman Long wrote:
> On 8/11/21 2:21 PM, Tejun Heo wrote:
> > On Wed, Aug 11, 2021 at 02:18:17PM -0400, Waiman Long wrote:
> > > I don't think that is true. A task can reside anywhere in the cgroup
> > > hierarchy. I have encountered no problem moving tasks around.
> > Oh, that shouldn't be happening with controllers enabled. Can you please
> > share a repro?
> 
> I have done further testing. Enabling controllers won't prohibit moving a
> task into a parent cgroup as long as the child cgroups have no tasks. Once

Should be "as long as there's no child cgroups".

  root@test /s/f/cgroup# mkdir test
  root@test /s/f/cgroup# mkdir -p test/A
  root@test /s/f/cgroup# echo +io > test/cgroup.subtree_control 
  root@test /s/f/cgroup# echo $fish_pid > test/cgroup.procs
  write: Device or resource busy

> the child cgroup has task, moving another task to the parent is not allowed
> (-EBUSY). Similarly if a parent cgroup has tasks, you can't put new tasks
> into the child cgroup. I don't realize that we have such constraints as I

You can't enable controller from a populated cgroup:

  root@test /s/f/cgroup# mkdir test
  root@test /s/f/cgroup# echo +io > test/cgroup.subtree_control 
  root@test /s/f/cgroup# echo $fish_pid > test/cgroup.procs

> usually do my testing with a cgroup hierarchy with no tasks initially.
> Anyway, a new lesson learned.

The invariant is that from each controller's POV, all cgroups with processes
in them are leaves. This is all pretty well documented in cgroup-v2.rst.

Thanks.
diff mbox series

Patch

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 498fd418fd30..d8c1d01bdd81 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -279,6 +279,11 @@  static inline void notify_partition_change(struct cpuset *cs,
 		cgroup_file_notify(&cs->partition_file);
 }
 
+static inline int cpuset_has_tasks(const struct cpuset *cs)
+{
+	return cs->css.cgroup->nr_populated_csets;
+}
+
 static struct cpuset top_cpuset = {
 	.flags = ((1 << CS_ONLINE) | (1 << CS_CPU_EXCLUSIVE) |
 		  (1 << CS_MEM_EXCLUSIVE)),
@@ -1186,22 +1191,32 @@  static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 	if ((cmd != partcmd_update) && css_has_online_children(&cpuset->css))
 		return -EBUSY;
 
-	/*
-	 * Enabling partition root is not allowed if not all the CPUs
-	 * can be granted from parent's effective_cpus or at least one
-	 * CPU will be left after that.
-	 */
-	if ((cmd == partcmd_enable) &&
-	   (!cpumask_subset(cpuset->cpus_allowed, parent->effective_cpus) ||
-	     cpumask_equal(cpuset->cpus_allowed, parent->effective_cpus)))
-		return -EINVAL;
-
 	/*
 	 * A cpumask update cannot make parent's effective_cpus become empty.
 	 */
 	adding = deleting = false;
 	old_prs = new_prs = cpuset->partition_root_state;
 	if (cmd == partcmd_enable) {
+		bool parent_is_top_cpuset = !parent_cs(parent);
+		bool no_cpu_in_parent = cpumask_equal(cpuset->cpus_allowed,
+						      parent->effective_cpus);
+		/*
+		 * Enabling partition root is not allowed if not all the CPUs
+		 * can be granted from parent's effective_cpus. If the parent
+		 * is the top cpuset, at least one CPU must be left after that.
+		 */
+		if (!cpumask_subset(cpuset->cpus_allowed, parent->effective_cpus) ||
+		    (parent_is_top_cpuset && no_cpu_in_parent))
+			return -EINVAL;
+
+		/*
+		 * A non-top parent can be left with no CPU as long as there
+		 * is no task directly associated with the parent. For such
+		 * a parent, no new task can be moved into it.
+		 */
+		if (no_cpu_in_parent && cpuset_has_tasks(parent))
+			return -EINVAL;
+
 		cpumask_copy(tmp->addmask, cpuset->cpus_allowed);
 		adding = true;
 	} else if (cmd == partcmd_disable) {
@@ -1231,20 +1246,21 @@  static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 		adding = cpumask_andnot(tmp->addmask, tmp->addmask,
 					parent->subparts_cpus);
 		/*
-		 * Return error if parent's effective_cpus could become empty.
+		 * Make partition invalid if parent's effective_cpus could
+		 * become empty and there are tasks in the parent.
 		 */
-		if (adding &&
+		if (adding && cpuset_has_tasks(parent) &&
 		    cpumask_equal(parent->effective_cpus, tmp->addmask)) {
 			if (!deleting)
-				return -EINVAL;
+				part_error = true;
 			/*
 			 * As some of the CPUs in subparts_cpus might have
 			 * been offlined, we need to compute the real delmask
 			 * to confirm that.
 			 */
-			if (!cpumask_and(tmp->addmask, tmp->delmask,
-					 cpu_active_mask))
-				return -EINVAL;
+			else if (!cpumask_and(tmp->addmask, tmp->delmask,
+					      cpu_active_mask))
+				part_error = true;
 			cpumask_copy(tmp->addmask, parent->effective_cpus);
 		}
 	} else {
@@ -1266,7 +1282,8 @@  static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 				     parent->effective_cpus);
 		part_error = (is_partition_root(cpuset) &&
 			      !parent->nr_subparts_cpus) ||
-			     cpumask_equal(tmp->addmask, parent->effective_cpus);
+			     (cpumask_equal(tmp->addmask, parent->effective_cpus) &&
+			      cpuset_has_tasks(parent));
 	}
 
 	if (cmd == partcmd_update) {
@@ -1367,9 +1384,15 @@  static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 
 		/*
 		 * If it becomes empty, inherit the effective mask of the
-		 * parent, which is guaranteed to have some CPUs.
+		 * parent, which is guaranteed to have some CPUs unless
+		 * it is a partition root that has explicitly distributed
+		 * out all its CPUs.
 		 */
 		if (is_in_v2_mode() && cpumask_empty(tmp->new_cpus)) {
+			if (is_partition_root(cp) &&
+			    cpumask_equal(cp->cpus_allowed, cp->subparts_cpus))
+				goto update_parent_subparts;
+
 			cpumask_copy(tmp->new_cpus, parent->effective_cpus);
 			if (!cp->use_parent_ecpus) {
 				cp->use_parent_ecpus = true;
@@ -1391,6 +1414,7 @@  static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 			continue;
 		}
 
+update_parent_subparts:
 		/*
 		 * update_parent_subparts_cpumask() should have been called
 		 * for cs already in update_cpumask(). We should also call
@@ -1466,12 +1490,9 @@  static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 			 */
 			cpumask_andnot(cp->effective_cpus, cp->effective_cpus,
 				       cp->subparts_cpus);
-			WARN_ON_ONCE(cpumask_empty(cp->effective_cpus));
 		}
 
-		if (new_prs != old_prs)
-			cp->partition_root_state = new_prs;
-
+		cp->partition_root_state = new_prs;
 		spin_unlock_irq(&callback_lock);
 		notify_partition_change(cp, old_prs, new_prs);
 
@@ -2215,6 +2236,13 @@  static int cpuset_can_attach(struct cgroup_taskset *tset)
 	    (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed)))
 		goto out_unlock;
 
+	/*
+	 * On default hierarchy, task cannot be moved to a cpuset with empty
+	 * effective cpus.
+	 */
+	if (is_in_v2_mode() && cpumask_empty(cs->effective_cpus))
+		goto out_unlock;
+
 	cgroup_taskset_for_each(task, css, tset) {
 		ret = task_can_attach(task, cs->cpus_allowed);
 		if (ret)
@@ -3082,7 +3110,8 @@  hotplug_update_tasks(struct cpuset *cs,
 		     struct cpumask *new_cpus, nodemask_t *new_mems,
 		     bool cpus_updated, bool mems_updated)
 {
-	if (cpumask_empty(new_cpus))
+	/* A partition root is allowed to have empty effective cpus */
+	if (cpumask_empty(new_cpus) && !is_partition_root(cs))
 		cpumask_copy(new_cpus, parent_cs(cs)->effective_cpus);
 	if (nodes_empty(*new_mems))
 		*new_mems = parent_cs(cs)->effective_mems;
@@ -3151,22 +3180,24 @@  static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
 
 	/*
 	 * In the unlikely event that a partition root has empty
-	 * effective_cpus, we will have to force any child partitions,
-	 * if present, to become invalid by setting nr_subparts_cpus to 0
-	 * without causing itself to become invalid.
+	 * effective_cpus with tasks, we will have to force any child
+	 * partitions, if present, to become invalid by setting
+	 * nr_subparts_cpus to 0 without causing itself to become invalid.
 	 */
 	if (is_partition_root(cs) && cs->nr_subparts_cpus &&
-	    cpumask_empty(&new_cpus)) {
+	    cpumask_empty(&new_cpus) && cpuset_has_tasks(cs)) {
 		cs->nr_subparts_cpus = 0;
 		cpumask_clear(cs->subparts_cpus);
 		compute_effective_cpumask(&new_cpus, cs, parent);
 	}
 
 	/*
-	 * If empty effective_cpus or zero nr_subparts_cpus or its parent
-	 * becomes erroneous, we have to transition it to the erroneous state.
+	 * If empty effective_cpus with tasks or zero nr_subparts_cpus or
+	 * its parent becomes erroneous, we have to transition it to the
+	 * erroneous state.
 	 */
-	if (is_partition_root(cs) && (cpumask_empty(&new_cpus) ||
+	if (is_partition_root(cs) &&
+	   ((cpumask_empty(&new_cpus) && cpuset_has_tasks(cs)) ||
 	    (parent->partition_root_state == PRS_ERROR) ||
 	    !parent->nr_subparts_cpus)) {
 		int old_prs;