Message ID | 20190708084357.12944-2-patrick.bellasi@arm.com (mailing list archive) |
---|---|
State | Not Applicable, archived |
Headers | show |
Series | Add utilization clamping support (CGroups API) | expand |
Hi Patrick, On Monday 08 Jul 2019 at 09:43:53 (+0100), Patrick Bellasi wrote: > +static inline int uclamp_scale_from_percent(char *buf, u64 *value) > +{ > + *value = SCHED_CAPACITY_SCALE; > + > + buf = strim(buf); > + if (strncmp("max", buf, 4)) { > + s64 percent; > + int ret; > + > + ret = cgroup_parse_float(buf, 2, &percent); > + if (ret) > + return ret; > + > + percent <<= SCHED_CAPACITY_SHIFT; > + *value = DIV_ROUND_CLOSEST_ULL(percent, 10000); > + } > + > + return 0; > +} > + > +static inline u64 uclamp_percent_from_scale(u64 value) > +{ > + return DIV_ROUND_CLOSEST_ULL(value * 10000, SCHED_CAPACITY_SCALE); > +} FWIW, I tried the patches and realized these conversions result in a 'funny' behaviour from a user's perspective. Things like this happen: $ echo 20 > cpu.uclamp.min $ cat cpu.uclamp.min 20.2 $ echo 20.2 > cpu.uclamp.min $ cat cpu.uclamp.min 20.21 Having looked at the code, I get why this is happening, but I'm not sure if a random user will. It's not an issue per se, but it's just a bit weird. I guess one way to fix this would be to revert back to having a 1024-scale for the cgroup interface too ... Though I understand Tejun wanted % for consistency with other things. So, I'm not sure if this is still up for discussion, but in any case I wanted to say I support your original idea of using a 1024-scale for the cgroups interface, since that would solve the 'issue' above and keeps things consistent with the per-task API too. Thanks, Quentin
On 08-Jul 12:08, Quentin Perret wrote: > Hi Patrick, Hi Quentin! > On Monday 08 Jul 2019 at 09:43:53 (+0100), Patrick Bellasi wrote: > > +static inline int uclamp_scale_from_percent(char *buf, u64 *value) > > +{ > > + *value = SCHED_CAPACITY_SCALE; > > + > > + buf = strim(buf); > > + if (strncmp("max", buf, 4)) { > > + s64 percent; > > + int ret; > > + > > + ret = cgroup_parse_float(buf, 2, &percent); > > + if (ret) > > + return ret; > > + > > + percent <<= SCHED_CAPACITY_SHIFT; > > + *value = DIV_ROUND_CLOSEST_ULL(percent, 10000); > > + } > > + > > + return 0; > > +} > > + > > +static inline u64 uclamp_percent_from_scale(u64 value) > > +{ > > + return DIV_ROUND_CLOSEST_ULL(value * 10000, SCHED_CAPACITY_SCALE); > > +} > > FWIW, I tried the patches and realized these conversions result in a > 'funny' behaviour from a user's perspective. Things like this happen: > > $ echo 20 > cpu.uclamp.min > $ cat cpu.uclamp.min > 20.2 > $ echo 20.2 > cpu.uclamp.min > $ cat cpu.uclamp.min > 20.21 > > Having looked at the code, I get why this is happening, but I'm not sure > if a random user will. It's not an issue per se, but it's just a bit > weird. Yes, that's what we get if we need to use a "two decimal digit precision percentage" to represent a 1024 range in kernel space. I don't think the "percent <=> utilization" conversion code can be made more robust. The only possible alternative I see to get back exactly what we write in, is to store the actual request in kernel space, alongside its conversion to the SCHED_CAPACITY_SCALE required by the actual scheduler code. Something along these lines (on top of what we have in this series): ---8<--- diff --git a/kernel/sched/core.c b/kernel/sched/core.c index ddc5fcd4b9cf..82b28cfa5c3f 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7148,40 +7148,35 @@ static void cpu_util_update_eff(struct cgroup_subsys_state *css) } } -static inline int uclamp_scale_from_percent(char *buf, u64 *value) +static inline int uclamp_scale_from_percent(char *buf, s64 *percent, u64 *scale) { - *value = SCHED_CAPACITY_SCALE; + *scale = SCHED_CAPACITY_SCALE; buf = strim(buf); if (strncmp("max", buf, 4)) { - s64 percent; int ret; - ret = cgroup_parse_float(buf, 2, &percent); + ret = cgroup_parse_float(buf, 2, percent); if (ret) return ret; - percent <<= SCHED_CAPACITY_SHIFT; - *value = DIV_ROUND_CLOSEST_ULL(percent, 10000); + *scale = *percent << SCHED_CAPACITY_SHIFT; + *scale = DIV_ROUND_CLOSEST_ULL(*scale, 10000); } return 0; } -static inline u64 uclamp_percent_from_scale(u64 value) -{ - return DIV_ROUND_CLOSEST_ULL(value * 10000, SCHED_CAPACITY_SCALE); -} - static ssize_t cpu_uclamp_min_write(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { struct task_group *tg; u64 min_value; + s64 percent; int ret; - ret = uclamp_scale_from_percent(buf, &min_value); + ret = uclamp_scale_from_percent(buf, &percent, &min_value); if (ret) return ret; if (min_value > SCHED_CAPACITY_SCALE) @@ -7197,6 +7192,9 @@ static ssize_t cpu_uclamp_min_write(struct kernfs_open_file *of, /* Update effective clamps to track the most restrictive value */ cpu_util_update_eff(of_css(of)); + /* Keep track of the actual requested value */ + tg->uclamp_pct[UCLAMP_MIN] = percent; + rcu_read_unlock(); mutex_unlock(&uclamp_mutex); @@ -7209,9 +7207,10 @@ static ssize_t cpu_uclamp_max_write(struct kernfs_open_file *of, { struct task_group *tg; u64 max_value; + s64 percent; int ret; - ret = uclamp_scale_from_percent(buf, &max_value); + ret = uclamp_scale_from_percent(buf, &percent, &max_value); if (ret) return ret; if (max_value > SCHED_CAPACITY_SCALE) @@ -7227,6 +7226,9 @@ static ssize_t cpu_uclamp_max_write(struct kernfs_open_file *of, /* Update effective clamps to track the most restrictive value */ cpu_util_update_eff(of_css(of)); + /* Keep track of the actual requested value */ + tg->uclamp_pct[UCLAMP_MAX] = percent; + rcu_read_unlock(); mutex_unlock(&uclamp_mutex); @@ -7251,7 +7253,7 @@ static inline void cpu_uclamp_print(struct seq_file *sf, return; } - percent = uclamp_percent_from_scale(util_clamp); + percent = tg->uclamp_pct[clamp_id]; percent = div_u64_rem(percent, 100, &rem); seq_printf(sf, "%llu.%u\n", percent, rem); } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 0e37f4a4e536..4f9b0c660310 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -395,6 +395,8 @@ struct task_group { struct cfs_bandwidth cfs_bandwidth; #ifdef CONFIG_UCLAMP_TASK_GROUP + /* The two decimal precision [%] value requested from user-space */ + unsigned int uclamp_pct[UCLAMP_CNT]; /* Clamp values requested for a task group */ struct uclamp_se uclamp_req[UCLAMP_CNT]; /* Effective clamp values used for a task group */ ---8<--- > I guess one way to fix this would be to revert back to having a > 1024-scale for the cgroup interface too ... Though I understand Tejun > wanted % for consistency with other things. Yes that would be another option, which will also keep aligned the per-task and system-wide APIs with the CGroups one. Although, AFAIU, having two different APIs is not considered a major issue. > So, I'm not sure if this is still up for discussion, but in any case I > wanted to say I support your original idea of using a 1024-scale for the > cgroups interface, since that would solve the 'issue' above and keeps > things consistent with the per-task API too. Right, I'm personally more leaning toward either going back to use SCHED_CAPACITY_SCALE or the add the small change I suggested above. Tejun, Peter: any preference? Alternative suggestions? > Thanks, > Quentin Cheers, Patrick
Hello, Patrick. On Mon, Jul 08, 2019 at 09:43:53AM +0100, Patrick Bellasi wrote: > +static inline void cpu_uclamp_print(struct seq_file *sf, > + enum uclamp_id clamp_id) > +{ > + struct task_group *tg; > + u64 util_clamp; > + u64 percent; > + u32 rem; > + > + rcu_read_lock(); > + tg = css_tg(seq_css(sf)); > + util_clamp = tg->uclamp_req[clamp_id].value; > + rcu_read_unlock(); > + > + if (util_clamp == SCHED_CAPACITY_SCALE) { > + seq_puts(sf, "max\n"); > + return; > + } > + > + percent = uclamp_percent_from_scale(util_clamp); > + percent = div_u64_rem(percent, 100, &rem); > + seq_printf(sf, "%llu.%u\n", percent, rem); "%llu.%02u" otherwise 20.01 gets printed as 20.1 Thanks.
On 18-Jul 07:52, Tejun Heo wrote: > Hello, Patrick. > > On Mon, Jul 08, 2019 at 09:43:53AM +0100, Patrick Bellasi wrote: > > +static inline void cpu_uclamp_print(struct seq_file *sf, > > + enum uclamp_id clamp_id) > > +{ > > + struct task_group *tg; > > + u64 util_clamp; > > + u64 percent; > > + u32 rem; > > + > > + rcu_read_lock(); > > + tg = css_tg(seq_css(sf)); > > + util_clamp = tg->uclamp_req[clamp_id].value; > > + rcu_read_unlock(); > > + > > + if (util_clamp == SCHED_CAPACITY_SCALE) { > > + seq_puts(sf, "max\n"); > > + return; > > + } > > + > > + percent = uclamp_percent_from_scale(util_clamp); > > + percent = div_u64_rem(percent, 100, &rem); > > + seq_printf(sf, "%llu.%u\n", percent, rem); > > "%llu.%02u" otherwise 20.01 gets printed as 20.1 Yup!... good point! :) Since we already collected many feedbacks, I've got a v12 ready for posting. Maybe you better wait for that before going on with the review. Thanks, Patrick
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index a5c845338d6d..1d49426b4c1e 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -951,6 +951,13 @@ controller implements weight and absolute bandwidth limit models for normal scheduling policy and absolute bandwidth allocation model for realtime scheduling policy. +In all the above models, cycles distribution is defined only on a temporal +base and it does not account for the frequency at which tasks are executed. +The (optional) utilization clamping support allows to hint the schedutil +cpufreq governor about the minimum desired frequency which should always be +provided by a CPU, as well as the maximum desired frequency, which should not +be exceeded by a CPU. + WARNING: cgroup2 doesn't yet support control of realtime processes and the cpu controller can only be enabled when all RT processes are in the root cgroup. Be aware that system management software may already @@ -1016,6 +1023,29 @@ All time durations are in microseconds. Shows pressure stall information for CPU. See Documentation/accounting/psi.txt for details. + cpu.uclamp.min + A read-write single value file which exists on non-root cgroups. + The default is "0", i.e. no utilization boosting. + + The requested minimum utilization as a percentage rational number, + e.g. 12.34 for 12.34%. + + This interface allows reading and setting minimum utilization clamp + values similar to the sched_setattr(2). This minimum utilization + value is used to clamp the task specific minimum utilization clamp. + + cpu.uclamp.max + A read-write single value file which exists on non-root cgroups. + The default is "max". i.e. no utilization capping + + The requested maximum utilization as a percentage rational number, + e.g. 98.76 for 98.76%. + + This interface allows reading and setting maximum utilization clamp + values similar to the sched_setattr(2). This maximum utilization + value is used to clamp the task specific maximum utilization clamp. + + Memory ------ diff --git a/init/Kconfig b/init/Kconfig index bf96faf3fe43..68a21188786c 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -903,6 +903,28 @@ config RT_GROUP_SCHED endif #CGROUP_SCHED +config UCLAMP_TASK_GROUP + bool "Utilization clamping per group of tasks" + depends on CGROUP_SCHED + depends on UCLAMP_TASK + default n + help + This feature enables the scheduler to track the clamped utilization + of each CPU based on RUNNABLE tasks currently scheduled on that CPU. + + When this option is enabled, the user can specify a min and max + CPU bandwidth which is allowed for each single task in a group. + The max bandwidth allows to clamp the maximum frequency a task + can use, while the min bandwidth allows to define a minimum + frequency a task will always use. + + When task group based utilization clamping is enabled, an eventually + specified task-specific clamp value is constrained by the cgroup + specified clamp value. Both minimum and maximum task clamping cannot + be bigger than the corresponding clamping defined at task group level. + + If in doubt, say N. + config CGROUP_PIDS bool "PIDs controller" help diff --git a/kernel/sched/core.c b/kernel/sched/core.c index fa43ce3962e7..17ebdaaf7cd9 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1149,8 +1149,12 @@ static void __init init_uclamp(void) /* System defaults allow max clamp values for both indexes */ uclamp_se_set(&uc_max, uclamp_none(UCLAMP_MAX), false); - for_each_clamp_id(clamp_id) + for_each_clamp_id(clamp_id) { uclamp_default[clamp_id] = uc_max; +#ifdef CONFIG_UCLAMP_TASK_GROUP + root_task_group.uclamp_req[clamp_id] = uc_max; +#endif + } } #else /* CONFIG_UCLAMP_TASK */ @@ -6725,6 +6729,19 @@ void ia64_set_curr_task(int cpu, struct task_struct *p) /* task_group_lock serializes the addition/removal of task groups */ static DEFINE_SPINLOCK(task_group_lock); +static inline void alloc_uclamp_sched_group(struct task_group *tg, + struct task_group *parent) +{ +#ifdef CONFIG_UCLAMP_TASK_GROUP + int clamp_id; + + for_each_clamp_id(clamp_id) { + uclamp_se_set(&tg->uclamp_req[clamp_id], + uclamp_none(clamp_id), false); + } +#endif +} + static void sched_free_group(struct task_group *tg) { free_fair_sched_group(tg); @@ -6748,6 +6765,8 @@ struct task_group *sched_create_group(struct task_group *parent) if (!alloc_rt_sched_group(tg, parent)) goto err; + alloc_uclamp_sched_group(tg, parent); + return tg; err: @@ -6968,6 +6987,118 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset) sched_move_task(task); } +#ifdef CONFIG_UCLAMP_TASK_GROUP +static inline int uclamp_scale_from_percent(char *buf, u64 *value) +{ + *value = SCHED_CAPACITY_SCALE; + + buf = strim(buf); + if (strncmp("max", buf, 4)) { + s64 percent; + int ret; + + ret = cgroup_parse_float(buf, 2, &percent); + if (ret) + return ret; + + percent <<= SCHED_CAPACITY_SHIFT; + *value = DIV_ROUND_CLOSEST_ULL(percent, 10000); + } + + return 0; +} + +static inline u64 uclamp_percent_from_scale(u64 value) +{ + return DIV_ROUND_CLOSEST_ULL(value * 10000, SCHED_CAPACITY_SCALE); +} + +static ssize_t cpu_uclamp_min_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, + loff_t off) +{ + struct task_group *tg; + u64 min_value; + int ret; + + ret = uclamp_scale_from_percent(buf, &min_value); + if (ret) + return ret; + if (min_value > SCHED_CAPACITY_SCALE) + return -ERANGE; + + rcu_read_lock(); + + tg = css_tg(of_css(of)); + if (tg->uclamp_req[UCLAMP_MIN].value != min_value) + uclamp_se_set(&tg->uclamp_req[UCLAMP_MIN], min_value, false); + + rcu_read_unlock(); + + return nbytes; +} + +static ssize_t cpu_uclamp_max_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, + loff_t off) +{ + struct task_group *tg; + u64 max_value; + int ret; + + ret = uclamp_scale_from_percent(buf, &max_value); + if (ret) + return ret; + if (max_value > SCHED_CAPACITY_SCALE) + return -ERANGE; + + rcu_read_lock(); + + tg = css_tg(of_css(of)); + if (tg->uclamp_req[UCLAMP_MAX].value != max_value) + uclamp_se_set(&tg->uclamp_req[UCLAMP_MAX], max_value, false); + + rcu_read_unlock(); + + return nbytes; +} + +static inline void cpu_uclamp_print(struct seq_file *sf, + enum uclamp_id clamp_id) +{ + struct task_group *tg; + u64 util_clamp; + u64 percent; + u32 rem; + + rcu_read_lock(); + tg = css_tg(seq_css(sf)); + util_clamp = tg->uclamp_req[clamp_id].value; + rcu_read_unlock(); + + if (util_clamp == SCHED_CAPACITY_SCALE) { + seq_puts(sf, "max\n"); + return; + } + + percent = uclamp_percent_from_scale(util_clamp); + percent = div_u64_rem(percent, 100, &rem); + seq_printf(sf, "%llu.%u\n", percent, rem); +} + +static int cpu_uclamp_min_show(struct seq_file *sf, void *v) +{ + cpu_uclamp_print(sf, UCLAMP_MIN); + return 0; +} + +static int cpu_uclamp_max_show(struct seq_file *sf, void *v) +{ + cpu_uclamp_print(sf, UCLAMP_MAX); + return 0; +} +#endif /* CONFIG_UCLAMP_TASK_GROUP */ + #ifdef CONFIG_FAIR_GROUP_SCHED static int cpu_shares_write_u64(struct cgroup_subsys_state *css, struct cftype *cftype, u64 shareval) @@ -7312,6 +7443,20 @@ static struct cftype cpu_legacy_files[] = { .read_u64 = cpu_rt_period_read_uint, .write_u64 = cpu_rt_period_write_uint, }, +#endif +#ifdef CONFIG_UCLAMP_TASK_GROUP + { + .name = "uclamp.min", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = cpu_uclamp_min_show, + .write = cpu_uclamp_min_write, + }, + { + .name = "uclamp.max", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = cpu_uclamp_max_show, + .write = cpu_uclamp_max_write, + }, #endif { } /* Terminate */ }; @@ -7479,6 +7624,20 @@ static struct cftype cpu_files[] = { .seq_show = cpu_max_show, .write = cpu_max_write, }, +#endif +#ifdef CONFIG_UCLAMP_TASK_GROUP + { + .name = "uclamp.min", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = cpu_uclamp_min_show, + .write = cpu_uclamp_min_write, + }, + { + .name = "uclamp.max", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = cpu_uclamp_max_show, + .write = cpu_uclamp_max_write, + }, #endif { } /* terminate */ }; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 802b1f3405f2..3723037ea80d 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -393,6 +393,12 @@ struct task_group { #endif struct cfs_bandwidth cfs_bandwidth; + +#ifdef CONFIG_UCLAMP_TASK_GROUP + /* Clamp values requested for a task group */ + struct uclamp_se uclamp_req[UCLAMP_CNT]; +#endif + }; #ifdef CONFIG_FAIR_GROUP_SCHED
The cgroup CPU bandwidth controller allows to assign a specified (maximum) bandwidth to the tasks of a group. However this bandwidth is defined and enforced only on a temporal base, without considering the actual frequency a CPU is running on. Thus, the amount of computation completed by a task within an allocated bandwidth can be very different depending on the actual frequency the CPU is running that task. The amount of computation can be affected also by the specific CPU a task is running on, especially when running on asymmetric capacity systems like Arm's big.LITTLE. With the availability of schedutil, the scheduler is now able to drive frequency selections based on actual task utilization. Moreover, the utilization clamping support provides a mechanism to bias the frequency selection operated by schedutil depending on constraints assigned to the tasks currently RUNNABLE on a CPU. Giving the mechanisms described above, it is now possible to extend the cpu controller to specify the minimum (or maximum) utilization which should be considered for tasks RUNNABLE on a cpu. This makes it possible to better defined the actual computational power assigned to task groups, thus improving the cgroup CPU bandwidth controller which is currently based just on time constraints. Extend the CPU controller with a couple of new attributes uclamp.{min,max} which allow to enforce utilization boosting and capping for all the tasks in a group. Specifically: - uclamp.min: defines the minimum utilization which should be considered i.e. the RUNNABLE tasks of this group will run at least at a minimum frequency which corresponds to the uclamp.min utilization - uclamp.max: defines the maximum utilization which should be considered i.e. the RUNNABLE tasks of this group will run up to a maximum frequency which corresponds to the uclamp.max utilization These attributes: a) are available only for non-root nodes, both on default and legacy hierarchies, while system wide clamps are defined by a generic interface which does not depends on cgroups. This system wide interface enforces constraints on tasks in the root node. b) enforce effective constraints at each level of the hierarchy which are a restriction of the group requests considering its parent's effective constraints. Root group effective constraints are defined by the system wide interface. This mechanism allows each (non-root) level of the hierarchy to: - request whatever clamp values it would like to get - effectively get only up to the maximum amount allowed by its parent c) have higher priority than task-specific clamps, defined via sched_setattr(), thus allowing to control and restrict task requests. Add two new attributes to the cpu controller to collect "requested" clamp values. Allow that at each non-root level of the hierarchy. Validate local consistency by enforcing uclamp.min < uclamp.max. Keep it simple by not caring now about "effective" values computation and propagation along the hierarchy. Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> --- Changes in v11: Message-ID: <20190624175215.GR657710@devbig004.ftw2.facebook.com> - remove checks for cpu_uclamp_{min,max}_write() from root group - remove enforcement for "protection" being smaller than "limits" - rephrase uclamp extension description to avoid explicit mentioning of the bandwidth concept --- Documentation/admin-guide/cgroup-v2.rst | 30 +++++ init/Kconfig | 22 ++++ kernel/sched/core.c | 161 +++++++++++++++++++++++- kernel/sched/sched.h | 6 + 4 files changed, 218 insertions(+), 1 deletion(-)