diff mbox series

[v11,2/5] sched/core: uclamp: Propagate parent clamps

Message ID 20190708084357.12944-3-patrick.bellasi@arm.com (mailing list archive)
State Not Applicable, archived
Headers show
Series Add utilization clamping support (CGroups API) | expand

Commit Message

Patrick Bellasi July 8, 2019, 8:43 a.m. UTC
In order to properly support hierarchical resources control, the cgroup
delegation model requires that attribute writes from a child group never
fail but still are locally consistent and constrained based on parent's
assigned resources. This requires to properly propagate and aggregate
parent attributes down to its descendants.

Implement this mechanism by adding a new "effective" clamp value for each
task group. The effective clamp value is defined as the smaller value
between the clamp value of a group and the effective clamp value of its
parent. This is the actual clamp value enforced on tasks in a task group.

Since it's possible for a cpu.uclamp.min value to be bigger than the
cpu.uclamp.max value, ensure local consistency by restricting each
"protection"
(i.e. min utilization) with the corresponding "limit" (i.e. max
utilization). Do that at effective clamps propagation to ensure all
user-space write never fails while still always tracking the most
restrictive values.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>

---
Changes in v11:
 Message-ID: <20190624174607.GQ657710@devbig004.ftw2.facebook.com>
 - Removed user-space uclamp.{min.max}.effective API
 - Ensure group limits always clamps group protections
---
 kernel/sched/core.c  | 65 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |  2 ++
 2 files changed, 67 insertions(+)

Comments

Michal Koutný July 15, 2019, 4:42 p.m. UTC | #1
On Mon, Jul 08, 2019 at 09:43:54AM +0100, Patrick Bellasi <patrick.bellasi@arm.com> wrote:
> Since it's possible for a cpu.uclamp.min value to be bigger than the
> cpu.uclamp.max value, ensure local consistency by restricting each
> "protection"
> (i.e. min utilization) with the corresponding "limit" (i.e. max
> utilization).
I think this constraint should be mentioned in the Documentation/....

> +static void cpu_util_update_eff(struct cgroup_subsys_state *css)
> +{
> +	struct cgroup_subsys_state *top_css = css;
> +	struct uclamp_se *uc_se = NULL;
> +	unsigned int eff[UCLAMP_CNT];
> +	unsigned int clamp_id;
> +	unsigned int clamps;
> +
> +	css_for_each_descendant_pre(css, top_css) {
> +		uc_se = css_tg(css)->parent
> +			? css_tg(css)->parent->uclamp : NULL;
> +
> +		for_each_clamp_id(clamp_id) {
> +			/* Assume effective clamps matches requested clamps */
> +			eff[clamp_id] = css_tg(css)->uclamp_req[clamp_id].value;
> +			/* Cap effective clamps with parent's effective clamps */
> +			if (uc_se &&
> +			    eff[clamp_id] > uc_se[clamp_id].value) {
> +				eff[clamp_id] = uc_se[clamp_id].value;
> +			}
> +		}
> +		/* Ensure protection is always capped by limit */
> +		eff[UCLAMP_MIN] = min(eff[UCLAMP_MIN], eff[UCLAMP_MAX]);
> +
> +		/* Propagate most restrictive effective clamps */
> +		clamps = 0x0;
> +		uc_se = css_tg(css)->uclamp;
(Nitpick only, reassigning child where was parent before decreases
readibility. IMO)

> +		for_each_clamp_id(clamp_id) {
> +			if (eff[clamp_id] == uc_se[clamp_id].value)
> +				continue;
> +			uc_se[clamp_id].value = eff[clamp_id];
> +			uc_se[clamp_id].bucket_id = uclamp_bucket_id(eff[clamp_id]);
Shouldn't these writes be synchronized with writes from
__setscheduler_uclamp()?

>
Patrick Bellasi July 16, 2019, 2:07 p.m. UTC | #2
Hi Michal,

On 15-Jul 18:42, Michal Koutný wrote:
> On Mon, Jul 08, 2019 at 09:43:54AM +0100, Patrick Bellasi <patrick.bellasi@arm.com> wrote:
> > Since it's possible for a cpu.uclamp.min value to be bigger than the
> > cpu.uclamp.max value, ensure local consistency by restricting each
> > "protection"
> > (i.e. min utilization) with the corresponding "limit" (i.e. max
> > utilization).
> I think this constraint should be mentioned in the Documentation/....

That note comes from the previous review cycle and it's based on a
request from Tejun to align uclamp behaviors with the way the
delegation model is supposed to work.

I guess this part of the documentation:
   https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html?highlight=protections#resource-distribution-models
should already cover the expected uclamp min/max behaviors.

However, I guess "repetita iuvant" in this case. I'll call this out
explicitly in the description of cpu.uclamp.min.

> > +static void cpu_util_update_eff(struct cgroup_subsys_state *css)
> > +{
> > +	struct cgroup_subsys_state *top_css = css;
> > +	struct uclamp_se *uc_se = NULL;
> > +	unsigned int eff[UCLAMP_CNT];
> > +	unsigned int clamp_id;
> > +	unsigned int clamps;
> > +
> > +	css_for_each_descendant_pre(css, top_css) {
> > +		uc_se = css_tg(css)->parent
> > +			? css_tg(css)->parent->uclamp : NULL;
> > +
> > +		for_each_clamp_id(clamp_id) {
> > +			/* Assume effective clamps matches requested clamps */
> > +			eff[clamp_id] = css_tg(css)->uclamp_req[clamp_id].value;
> > +			/* Cap effective clamps with parent's effective clamps */
> > +			if (uc_se &&
> > +			    eff[clamp_id] > uc_se[clamp_id].value) {
> > +				eff[clamp_id] = uc_se[clamp_id].value;
> > +			}
> > +		}
> > +		/* Ensure protection is always capped by limit */
> > +		eff[UCLAMP_MIN] = min(eff[UCLAMP_MIN], eff[UCLAMP_MAX]);
> > +
> > +		/* Propagate most restrictive effective clamps */
> > +		clamps = 0x0;
> > +		uc_se = css_tg(css)->uclamp;
> (Nitpick only, reassigning child where was parent before decreases
> readibility. IMO)

Did not checked but I think the compiler will figure out it can still
use a single pointer for both assignments.
I'll let's the compiler to its job and add in a dedicated stack var
for the parent pointer.


> > +		for_each_clamp_id(clamp_id) {
> > +			if (eff[clamp_id] == uc_se[clamp_id].value)
> > +				continue;
> > +			uc_se[clamp_id].value = eff[clamp_id];
> > +			uc_se[clamp_id].bucket_id = uclamp_bucket_id(eff[clamp_id]);
> Shouldn't these writes be synchronized with writes from
> __setscheduler_uclamp()?

You right, the synchronization is introduced by a later patch:

   sched/core: uclamp: Update CPU's refcount on TG's clamp changes

Cheers,
Patrick
Michal Koutný July 16, 2019, 3:29 p.m. UTC | #3
On Tue, Jul 16, 2019 at 03:07:06PM +0100, Patrick Bellasi <patrick.bellasi@arm.com> wrote:
> That note comes from the previous review cycle and it's based on a
> request from Tejun to align uclamp behaviors with the way the
> delegation model is supposed to work.
I saw and hopefully understood that reasoning -- uclamp.min has the
protection semantics and uclamp.max the limit semantics.

However, what took me some time to comprehend when the effected
uclamp.min and uclamp.max cross over, i.e. that uclamp.min is then bound
by uclamp.max (besides parent's uclamp.min). Your commit message
explains that and I think it's relevant for the kernel docs file
itself.

> You right, the synchronization is introduced by a later patch:
> 
>    sched/core: uclamp: Update CPU's refcount on TG's clamp changes
I saw that lock but didn't realize __setscheduler_uclamp() touches only
task's struct uclamp_se, none of task_group's/css's (which is under
uclamp_mutex). That seems correct.
Patrick Bellasi July 16, 2019, 5:55 p.m. UTC | #4
On 16-Jul 17:29, Michal Koutný wrote:
> On Tue, Jul 16, 2019 at 03:07:06PM +0100, Patrick Bellasi <patrick.bellasi@arm.com> wrote:
> > That note comes from the previous review cycle and it's based on a
> > request from Tejun to align uclamp behaviors with the way the
> > delegation model is supposed to work.
> I saw and hopefully understood that reasoning -- uclamp.min has the
> protection semantics and uclamp.max the limit semantics.
> 
> However, what took me some time to comprehend when the effected
> uclamp.min and uclamp.max cross over, i.e. that uclamp.min is then bound
> by uclamp.max (besides parent's uclamp.min). Your commit message
> explains that and I think it's relevant for the kernel docs file
> itself.

Right, I've just added a paragraph to the cpu.uclamp.min documentation.

> > You right, the synchronization is introduced by a later patch:
> > 
> >    sched/core: uclamp: Update CPU's refcount on TG's clamp changes
> I saw that lock but didn't realize __setscheduler_uclamp() touches only
> task's struct uclamp_se, none of task_group's/css's (which is under
> uclamp_mutex). That seems correct.

Right, the mutex is used only on the cgroup side. That's because the
CGroup API can affect multiple tasks running on different CPUs, thus
we wanna make sure we don't come up with race conditions. In that
path we can also afford to go a bit slower.

In the fast path instead we rely on the rq-locks to ensure
serialization on RUNNABLE tasks clamp updates.

Coming from the __setscheduler_uclamp() side however we don't sync
RUNNABLE tasks immediately. We delay the update to next enqueue
opportunity.

Cheers,
Patrick
diff mbox series

Patch

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 17ebdaaf7cd9..ec91f4518752 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -773,6 +773,18 @@  static void set_load_weight(struct task_struct *p, bool update_load)
 }
 
 #ifdef CONFIG_UCLAMP_TASK
+/*
+ * Serializes updates of utilization clamp values
+ *
+ * The (slow-path) user-space triggers utilization clamp value updates which
+ * can require updates on (fast-path) scheduler's data structures used to
+ * support enqueue/dequeue operations.
+ * While the per-CPU rq lock protects fast-path update operations, user-space
+ * requests are serialized using a mutex to reduce the risk of conflicting
+ * updates or API abuses.
+ */
+static DEFINE_MUTEX(uclamp_mutex);
+
 /* Max allowed minimum utilization */
 unsigned int sysctl_sched_uclamp_util_min = SCHED_CAPACITY_SCALE;
 
@@ -1137,6 +1149,8 @@  static void __init init_uclamp(void)
 	unsigned int clamp_id;
 	int cpu;
 
+	mutex_init(&uclamp_mutex);
+
 	for_each_possible_cpu(cpu) {
 		memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq));
 		cpu_rq(cpu)->uclamp_flags = 0;
@@ -1153,6 +1167,7 @@  static void __init init_uclamp(void)
 		uclamp_default[clamp_id] = uc_max;
 #ifdef CONFIG_UCLAMP_TASK_GROUP
 		root_task_group.uclamp_req[clamp_id] = uc_max;
+		root_task_group.uclamp[clamp_id] = uc_max;
 #endif
 	}
 }
@@ -6738,6 +6753,7 @@  static inline void alloc_uclamp_sched_group(struct task_group *tg,
 	for_each_clamp_id(clamp_id) {
 		uclamp_se_set(&tg->uclamp_req[clamp_id],
 			      uclamp_none(clamp_id), false);
+		tg->uclamp[clamp_id] = parent->uclamp[clamp_id];
 	}
 #endif
 }
@@ -6988,6 +7004,45 @@  static void cpu_cgroup_attach(struct cgroup_taskset *tset)
 }
 
 #ifdef CONFIG_UCLAMP_TASK_GROUP
+static void cpu_util_update_eff(struct cgroup_subsys_state *css)
+{
+	struct cgroup_subsys_state *top_css = css;
+	struct uclamp_se *uc_se = NULL;
+	unsigned int eff[UCLAMP_CNT];
+	unsigned int clamp_id;
+	unsigned int clamps;
+
+	css_for_each_descendant_pre(css, top_css) {
+		uc_se = css_tg(css)->parent
+			? css_tg(css)->parent->uclamp : NULL;
+
+		for_each_clamp_id(clamp_id) {
+			/* Assume effective clamps matches requested clamps */
+			eff[clamp_id] = css_tg(css)->uclamp_req[clamp_id].value;
+			/* Cap effective clamps with parent's effective clamps */
+			if (uc_se &&
+			    eff[clamp_id] > uc_se[clamp_id].value) {
+				eff[clamp_id] = uc_se[clamp_id].value;
+			}
+		}
+		/* Ensure protection is always capped by limit */
+		eff[UCLAMP_MIN] = min(eff[UCLAMP_MIN], eff[UCLAMP_MAX]);
+
+		/* Propagate most restrictive effective clamps */
+		clamps = 0x0;
+		uc_se = css_tg(css)->uclamp;
+		for_each_clamp_id(clamp_id) {
+			if (eff[clamp_id] == uc_se[clamp_id].value)
+				continue;
+			uc_se[clamp_id].value = eff[clamp_id];
+			uc_se[clamp_id].bucket_id = uclamp_bucket_id(eff[clamp_id]);
+			clamps |= (0x1 << clamp_id);
+		}
+		if (!clamps)
+			css = css_rightmost_descendant(css);
+	}
+}
+
 static inline int uclamp_scale_from_percent(char *buf, u64 *value)
 {
 	*value = SCHED_CAPACITY_SCALE;
@@ -7027,13 +7082,18 @@  static ssize_t cpu_uclamp_min_write(struct kernfs_open_file *of,
 	if (min_value > SCHED_CAPACITY_SCALE)
 		return -ERANGE;
 
+	mutex_lock(&uclamp_mutex);
 	rcu_read_lock();
 
 	tg = css_tg(of_css(of));
 	if (tg->uclamp_req[UCLAMP_MIN].value != min_value)
 		uclamp_se_set(&tg->uclamp_req[UCLAMP_MIN], min_value, false);
 
+	/* Update effective clamps to track the most restrictive value */
+	cpu_util_update_eff(of_css(of));
+
 	rcu_read_unlock();
+	mutex_unlock(&uclamp_mutex);
 
 	return nbytes;
 }
@@ -7052,13 +7112,18 @@  static ssize_t cpu_uclamp_max_write(struct kernfs_open_file *of,
 	if (max_value > SCHED_CAPACITY_SCALE)
 		return -ERANGE;
 
+	mutex_lock(&uclamp_mutex);
 	rcu_read_lock();
 
 	tg = css_tg(of_css(of));
 	if (tg->uclamp_req[UCLAMP_MAX].value != max_value)
 		uclamp_se_set(&tg->uclamp_req[UCLAMP_MAX], max_value, false);
 
+	/* Update effective clamps to track the most restrictive value */
+	cpu_util_update_eff(of_css(of));
+
 	rcu_read_unlock();
+	mutex_unlock(&uclamp_mutex);
 
 	return nbytes;
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3723037ea80d..8c3aefdaf0ef 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -397,6 +397,8 @@  struct task_group {
 #ifdef CONFIG_UCLAMP_TASK_GROUP
 	/* Clamp values requested for a task group */
 	struct uclamp_se	uclamp_req[UCLAMP_CNT];
+	/* Effective clamp values used for a task group */
+	struct uclamp_se	uclamp[UCLAMP_CNT];
 #endif
 
 };