Message ID | 20190515094459.10317-13-patrick.bellasi@arm.com (mailing list archive) |
---|---|
State | Not Applicable, archived |
Headers | show |
Series | Add utilization clamping support | expand |
Hello, Patrick. On Wed, May 15, 2019 at 10:44:55AM +0100, Patrick Bellasi wrote: > Extend the CPU controller with a couple of new attributes util.{min,max} > which allows to enforce utilization boosting and capping for all the > tasks in a group. Specifically: > > - util.min: defines the minimum utilization which should be considered > i.e. the RUNNABLE tasks of this group will run at least at a > minimum frequency which corresponds to the util.min > utilization > > - util.max: defines the maximum utilization which should be considered > i.e. the RUNNABLE tasks of this group will run up to a > maximum frequency which corresponds to the util.max > utilization Let's please use a prefix which is more specific. It's clamping the utilization estimates of the member tasks which in turn affect scheduling / frequency decisions but cpu.util.max reads like it's gonna limit the cpu utilization directly. Maybe just use uclamp? > These attributes: > > a) are available only for non-root nodes, both on default and legacy > hierarchies, while system wide clamps are defined by a generic > interface which does not depends on cgroups. This system wide > interface enforces constraints on tasks in the root node. I'd much prefer if they weren't entangled this way. The system wide limits should work the same regardless of cgroup's existence. cgroup can put further restriction on top but mere creation of cgroups with cpu controller enabled shouldn't take them out of the system-wide limits. > b) enforce effective constraints at each level of the hierarchy which > are a restriction of the group requests considering its parent's > effective constraints. Root group effective constraints are defined > by the system wide interface. > This mechanism allows each (non-root) level of the hierarchy to: > - request whatever clamp values it would like to get > - effectively get only up to the maximum amount allowed by its parent I'll come back to this later. > c) have higher priority than task-specific clamps, defined via > sched_setattr(), thus allowing to control and restrict task requests This sounds good. > Add two new attributes to the cpu controller to collect "requested" > clamp values. Allow that at each non-root level of the hierarchy. > Validate local consistency by enforcing util.min < util.max. > Keep it simple by do not caring now about "effective" values computation > and propagation along the hierarchy. So, the followings are what we're doing for hierarchical protection and limit propgations. * Limits (high / max) default to max. Protections (low / min) 0. A new cgroup by default doesn't constrain itself further and doesn't have any protection. * A limit defines the upper ceiling for the subtree. If an ancestor has a limit of X, none of its descendants can have more than X. * A protection defines the upper ceiling of protections for the subtree. If an andester has a protection of X, none of its descendants can have more protection than X. Note that there's no way for an ancestor to enforce protection its descendants. It can only allow them to claim some. This is intentional as the other end of the spectrum is either descendants losing the ability to further distribute protections as they see fit. For proportions (as opposed to weights), we use percentage rational numbers - e.g. 38.44 for 38.44%. I have parser and doc update commits pending. I'll put them on cgroup/for-5.3. Thanks.
On 31-May 08:35, Tejun Heo wrote: > Hello, Patrick. Hi Tejun! > On Wed, May 15, 2019 at 10:44:55AM +0100, Patrick Bellasi wrote: > > Extend the CPU controller with a couple of new attributes util.{min,max} > > which allows to enforce utilization boosting and capping for all the > > tasks in a group. Specifically: > > > > - util.min: defines the minimum utilization which should be considered > > i.e. the RUNNABLE tasks of this group will run at least at a > > minimum frequency which corresponds to the util.min > > utilization > > > > - util.max: defines the maximum utilization which should be considered > > i.e. the RUNNABLE tasks of this group will run up to a > > maximum frequency which corresponds to the util.max > > utilization > > Let's please use a prefix which is more specific. It's clamping the > utilization estimates of the member tasks which in turn affect > scheduling / frequency decisions but cpu.util.max reads like it's > gonna limit the cpu utilization directly. Maybe just use uclamp? Being too specific does not risk to expose implementation details? If that's not a problem and Peter likes: cpu.uclamp.{min,max} that's ok with me.
On 31-May 08:35, Tejun Heo wrote: [...] > > These attributes: > > > > a) are available only for non-root nodes, both on default and legacy > > hierarchies, while system wide clamps are defined by a generic > > interface which does not depends on cgroups. This system wide > > interface enforces constraints on tasks in the root node. > > I'd much prefer if they weren't entangled this way. The system wide > limits should work the same regardless of cgroup's existence. cgroup > can put further restriction on top but mere creation of cgroups with > cpu controller enabled shouldn't take them out of the system-wide > limits. That's correct and what you describe matches, at least on its intents, the current implementation provided in: [PATCH v9 14/16] sched/core: uclamp: Propagate system defaults to root group https://lore.kernel.org/lkml/20190515094459.10317-15-patrick.bellasi@arm.com/ System clamps always work the same way, independently from cgroups: they define the upper bound for both min and max clamps. When cgroups are not available, tasks specific clamps are always capped by system clamps. When cgroups are available, the root task group clamps are capped by the system clamps, which affects its "effective" clamps and propagate them down the hierarchy to child's "effective" clamps. That's done in: [PATCH v9 13/16] sched/core: uclamp: Propagate parent clamps https://lore.kernel.org/lkml/20190515094459.10317-14-patrick.bellasi@arm.com/ Example 1 --------- Here is an example of system and groups clamps aggregation: min max system defaults 400 600 cg_name min min.effective max max.effective /uclamp 1024 400 500 500 /uclamp/app 512 400 512 500 /uclamp/app/pid_smalls 100 100 200 200 /uclamp/app/pid_bigs 500 400 700 500 The ".effective" clamps are used to define the actual clamp value to apply to tasks, according to the aggregation rules defined in: [PATCH v9 15/16] sched/core: uclamp: Use TG's clamps to restrict TASK's clamps https://lore.kernel.org/lkml/20190515094459.10317-16-patrick.bellasi@arm.com/ All the above, to me it means that: - cgroups are always capped by system clamps - cgroups can further restrict system clamps Does that match with your view? > > b) enforce effective constraints at each level of the hierarchy which > > are a restriction of the group requests considering its parent's > > effective constraints. Root group effective constraints are defined > > by the system wide interface. > > This mechanism allows each (non-root) level of the hierarchy to: > > - request whatever clamp values it would like to get > > - effectively get only up to the maximum amount allowed by its parent > > I'll come back to this later. > > > c) have higher priority than task-specific clamps, defined via > > sched_setattr(), thus allowing to control and restrict task requests > > This sounds good. > > > Add two new attributes to the cpu controller to collect "requested" > > clamp values. Allow that at each non-root level of the hierarchy. > > Validate local consistency by enforcing util.min < util.max. > > Keep it simple by do not caring now about "effective" values computation > > and propagation along the hierarchy. > > So, the followings are what we're doing for hierarchical protection > and limit propgations. > > * Limits (high / max) default to max. Protections (low / min) 0. A > new cgroup by default doesn't constrain itself further and doesn't > have any protection. Example 2 --------- Let say we have: /tg1: util_min=200 (as a protection) util_max=800 (as a limit) the moment we create a subgroup /tg1/tg11, in v9 it is initialized with the same limits _and protections_ of its father: /tg1/tg11: util_min=200 (protection inherited from /tg1) util_max=800 (limit inherited from /tg1) Do you mean that we should have instead: /tg1/tg11: util_min=0 (no protection by default at creation time) util_max=800 (limit inherited from /tg1) i.e. we need to reset the protection of a newly created subgroup? > * A limit defines the upper ceiling for the subtree. If an ancestor > has a limit of X, none of its descendants can have more than X. That's correct, however we distinguish between "requested" and "effective" values. Example 3 --------- We can have: cg_name max max.effective /uclamp/app 400 400 /uclamp/app/pid_bigs 500 400 Which means that a subgroup can "request" a limit (max=500) higher then its father (max=400), while still getting only up to what its father allows (max.effective = 400). Example 4 --------- Tracking the actual requested limit (max=500) it's useful to enforce it once the father limit should be relaxed, for example we will have: cg_name max max.effective /uclamp/app 600 600 /uclamp/app/pid_bigs 500 500 where a subgroup gets not more than what it has been configured for. This is the logic implemented by cpu_util_update_eff() in: [PATCH v9 13/16] sched/core: uclamp: Propagate parent clamps https://lore.kernel.org/lkml/20190515094459.10317-14-patrick.bellasi@arm.com/ > * A protection defines the upper ceiling of protections for the > subtree. If an andester has a protection of X, none of its > descendants can have more protection than X. Right, that's the current behavior in v9. > Note that there's no way for an ancestor to enforce protection its > descendants. It can only allow them to claim some. This is > intentional as the other end of the spectrum is either descendants > losing the ability to further distribute protections as they see fit. Ok, that means I need to update in v10 the initialization of subgroups min clamps to be none by default as discussed in the above Example 2, right? [...] Cheers, Patrick
On 31-May 08:35, Tejun Heo wrote: > Hello, Patrick. > > On Wed, May 15, 2019 at 10:44:55AM +0100, Patrick Bellasi wrote: [...] > For proportions (as opposed to weights), we use percentage rational > numbers - e.g. 38.44 for 38.44%. I have parser and doc update commits > pending. I'll put them on cgroup/for-5.3. That's a point worth discussing with Peter, we already changed one time from percentages to 1024 scale. Utilization clamps are expressed as percentages by definition, they are just expressed in a convenient 1024 scale which should not be alien to people using those knobs. If we wanna use a "more specific" name like uclamp.{min,max} then we should probably also accept to use a "more specific" metric, don't we? I personally like the [0..1024] range, but I guess that's really up to you and Peter to agree upon. > Thanks. > > -- > tejun Cheers, Patrick
Hello, On Mon, Jun 03, 2019 at 01:27:25PM +0100, Patrick Bellasi wrote: > All the above, to me it means that: > - cgroups are always capped by system clamps > - cgroups can further restrict system clamps > > Does that match with your view? Yeah, as long as what's defined at system level clamps everything in the system whether they're in cgroups or not, it's all good. > > * Limits (high / max) default to max. Protections (low / min) 0. A > > new cgroup by default doesn't constrain itself further and doesn't > > have any protection. > > Example 2 > --------- > > Let say we have: > > /tg1: > util_min=200 (as a protection) > util_max=800 (as a limit) > > the moment we create a subgroup /tg1/tg11, in v9 it is initialized > with the same limits _and protections_ of its father: > > /tg1/tg11: > util_min=200 (protection inherited from /tg1) > util_max=800 (limit inherited from /tg1) > > Do you mean that we should have instead: > > /tg1/tg11: > util_min=0 (no protection by default at creation time) > util_max=800 (limit inherited from /tg1) > > > i.e. we need to reset the protection of a newly created subgroup? The default value for limits should be max, protections 0. Don't inherit config values from the parent. That gets confusing super fast because when the parent config is set and each child is created plays into the overall configuration. Hierarchical confinements should always be enforced and a new cgroup should always start afresh in terms of its own configuration. > > * A limit defines the upper ceiling for the subtree. If an ancestor > > has a limit of X, none of its descendants can have more than X. > > That's correct, however we distinguish between "requested" and > "effective" values. Sure, all property propagating controllers should. > > Note that there's no way for an ancestor to enforce protection its > > descendants. It can only allow them to claim some. This is > > intentional as the other end of the spectrum is either descendants > > losing the ability to further distribute protections as they see fit. > > Ok, that means I need to update in v10 the initialization of subgroups > min clamps to be none by default as discussed in the above Example 2, > right? Yeah and max to max. Thanks.
Hello, On Mon, Jun 03, 2019 at 01:29:29PM +0100, Patrick Bellasi wrote: > On 31-May 08:35, Tejun Heo wrote: > > Hello, Patrick. > > > > On Wed, May 15, 2019 at 10:44:55AM +0100, Patrick Bellasi wrote: > > [...] > > > For proportions (as opposed to weights), we use percentage rational > > numbers - e.g. 38.44 for 38.44%. I have parser and doc update commits > > pending. I'll put them on cgroup/for-5.3. > > That's a point worth discussing with Peter, we already changed one > time from percentages to 1024 scale. cgroup tries to uss uniform units for its interface files as much as possible even when that deviates from non-cgroup interface. We can bikeshed the pros and cons for that design choice for sure but I don't think it makes sense to deviate from that at this point unless there are really strong reasons to do so. > Utilization clamps are expressed as percentages by definition, > they are just expressed in a convenient 1024 scale which should not be > alien to people using those knobs. > > If we wanna use a "more specific" name like uclamp.{min,max} then we > should probably also accept to use a "more specific" metric, don't we? Heh, this actually made me chuckle. It's an interesting bargaining take but I don't think that same word being in two different places makes them tradable entities. We can go into the weeds with the semantics but how about us using an alternative adjective "misleading" for the cpu.util.min/max names to short-circuit that? Thanks.
On 05-Jun 07:03, Tejun Heo wrote: > Hello, Hi! > On Mon, Jun 03, 2019 at 01:27:25PM +0100, Patrick Bellasi wrote: > > All the above, to me it means that: > > - cgroups are always capped by system clamps > > - cgroups can further restrict system clamps > > > > Does that match with your view? > > Yeah, as long as what's defined at system level clamps everything in > the system whether they're in cgroups or not, it's all good. Right, then we are good with v9 on this point. > > > * Limits (high / max) default to max. Protections (low / min) 0. A > > > new cgroup by default doesn't constrain itself further and doesn't > > > have any protection. > > > > Example 2 > > --------- > > > > Let say we have: > > > > /tg1: > > util_min=200 (as a protection) > > util_max=800 (as a limit) > > > > the moment we create a subgroup /tg1/tg11, in v9 it is initialized > > with the same limits _and protections_ of its father: > > > > /tg1/tg11: > > util_min=200 (protection inherited from /tg1) > > util_max=800 (limit inherited from /tg1) > > > > Do you mean that we should have instead: > > > > /tg1/tg11: > > util_min=0 (no protection by default at creation time) > > util_max=800 (limit inherited from /tg1) > > > > > > i.e. we need to reset the protection of a newly created subgroup? > > The default value for limits should be max, protections 0. Don't > inherit config values from the parent. That gets confusing super fast > because when the parent config is set and each child is created plays > into the overall configuration. Hierarchical confinements should > always be enforced and a new cgroup should always start afresh in > terms of its own configuration. Got it, so in the example above we will create: /tg1/tg11: util_min=0 (no requested protection by default at creation time) util_max=1024 (no requests limit by default at creation time) That's it for the "requested" values side, while the "effective" values are enforced by the hierarchical confinement rules since creation time. Which means we will enforce the effective values as: /tg1/tg11: util_min.effective=0 i.e. keep the child protection since smaller than parent util_max.effective=800 i.e. keep parent limit since stricter than child Please shout if I got it wrong, otherwise I'll update v10 to implement the above logic. > > > * A limit defines the upper ceiling for the subtree. If an ancestor > > > has a limit of X, none of its descendants can have more than X. > > > > That's correct, however we distinguish between "requested" and > > "effective" values. > > Sure, all property propagating controllers should. Right. > > > Note that there's no way for an ancestor to enforce protection its > > > descendants. It can only allow them to claim some. This is > > > intentional as the other end of the spectrum is either descendants > > > losing the ability to further distribute protections as they see fit. > > > > Ok, that means I need to update in v10 the initialization of subgroups > > min clamps to be none by default as discussed in the above Example 2, > > right? > > Yeah and max to max. Right, I've got it now. > Thanks. Cheers, Patrick
Hello, On Wed, Jun 05, 2019 at 03:39:50PM +0100, Patrick Bellasi wrote: > Which means we will enforce the effective values as: > > /tg1/tg11: > > util_min.effective=0 > i.e. keep the child protection since smaller than parent > > util_max.effective=800 > i.e. keep parent limit since stricter than child > > Please shout if I got it wrong, otherwise I'll update v10 to > implement the above logic. Everything sounds good to me. Please note that cgroup interface files actually use literal "max" for limit/protection max settings so that 0 and "max" mean the same things for all limit/protection knobs. Thanks.
On 05-Jun 07:09, Tejun Heo wrote: > Hello, Hi, > On Mon, Jun 03, 2019 at 01:29:29PM +0100, Patrick Bellasi wrote: > > On 31-May 08:35, Tejun Heo wrote: > > > Hello, Patrick. > > > > > > On Wed, May 15, 2019 at 10:44:55AM +0100, Patrick Bellasi wrote: > > > > [...] > > > > > For proportions (as opposed to weights), we use percentage rational > > > numbers - e.g. 38.44 for 38.44%. I have parser and doc update commits > > > pending. I'll put them on cgroup/for-5.3. > > > > That's a point worth discussing with Peter, we already changed one > > time from percentages to 1024 scale. > > cgroup tries to uss uniform units for its interface files as much as > possible even when that deviates from non-cgroup interface. We can > bikeshed the pros and cons for that design choice for sure but I don't > think it makes sense to deviate from that at this point unless there > are really strong reasons to do so. that makes sense to me, having a uniform interface has certainly a value. The only additional point I can think about as a (slightly) stronger reason is that I guess we would like to have the same API for cgroups as well as for the task specific and the system wide settings. The task specific values comes in via the sched_setattr() syscall: [PATCH v9 06/16] sched/core: uclamp: Extend sched_setattr() to support utilization clamping https://lore.kernel.org/lkml/20190515094459.10317-7-patrick.bellasi@arm.com/ where we need to encode each clamp into a __u32 value. System wide settings are expose similarly to these: grep '' /proc/sys/kernel/sched_* where we have always integer numbers. AFAIU your proposal will require to use a "scaled percentage" - e.g. 3844 for 38.44% which however it's still not quite the same as writing the string "38.44". Not sure that's a strong enough argument, is it? > > Utilization clamps are expressed as percentages by definition, > > they are just expressed in a convenient 1024 scale which should not be > > alien to people using those knobs. > > > > If we wanna use a "more specific" name like uclamp.{min,max} then we > > should probably also accept to use a "more specific" metric, don't we? > > Heh, this actually made me chuckle. :) > It's an interesting bargaining take but I don't think that same word > being in two different places makes them tradable entities. Sure, that was not my intention. I was just trying to see if the need to be more specific could be an argument for having also a more specific value. > We can go into the weeds with the semantics but how about us using > an alternative adjective "misleading" for the cpu.util.min/max names > to short-circuit that? Not quite sure to get what you mean here. Are you pointing out that with clamps we don't strictly enforce a bandwidth but we just set a bias? Cheers, Patrick
Hello, Patrick. On Wed, Jun 05, 2019 at 04:06:30PM +0100, Patrick Bellasi wrote: > The only additional point I can think about as a (slightly) stronger > reason is that I guess we would like to have the same API for cgroups > as well as for the task specific and the system wide settings. > > The task specific values comes in via the sched_setattr() syscall: > > [PATCH v9 06/16] sched/core: uclamp: Extend sched_setattr() to support utilization clamping > https://lore.kernel.org/lkml/20190515094459.10317-7-patrick.bellasi@arm.com/ > > where we need to encode each clamp into a __u32 value. > > System wide settings are expose similarly to these: > > grep '' /proc/sys/kernel/sched_* > > where we have always integer numbers. > > AFAIU your proposal will require to use a "scaled percentage" - e.g. > 3844 for 38.44% which however it's still not quite the same as writing > the string "38.44". > > Not sure that's a strong enough argument, is it? It definitely is an argument but the thing is that the units we use in kernel API are all over the place anyway. Even for something as simple as sizes, we use bytes, 512 byte sectors, kilobytes and pages all over the place. Some for good reasons (as you mentioned above) and others for historical / random ones. So, I'm generally not too concerned about units differing between cgroup interface and, say, syscall interface. That ship has sailed a long while ago and we have to deal with it everywhere anyway (in many cases there isn't even a good unit to pick for compatibility because the existing interfaces are already mixing units heavily). As long as the translation is trivial, it isn't a big issue. Note that some translations are not trivial. For example, the sched nice value mapping to weight has a separate unit matching knob for that reason. > > We can go into the weeds with the semantics but how about us using > > an alternative adjective "misleading" for the cpu.util.min/max names > > to short-circuit that? > > Not quite sure to get what you mean here. Are you pointing out that > with clamps we don't strictly enforce a bandwidth but we just set a > bias? It's just that "util" is already used a lot and cpu.util.max reads like it should cap cpu utilization (wallclock based) to 80% and it's likely that it'd read seem way to many other folks too. A more distinctive name signals that it isn't something that obvious. Thanks.
On 05-Jun 07:44, Tejun Heo wrote: > Hello, Hi, > On Wed, Jun 05, 2019 at 03:39:50PM +0100, Patrick Bellasi wrote: > > Which means we will enforce the effective values as: > > > > /tg1/tg11: > > > > util_min.effective=0 > > i.e. keep the child protection since smaller than parent > > > > util_max.effective=800 > > i.e. keep parent limit since stricter than child > > > > Please shout if I got it wrong, otherwise I'll update v10 to > > implement the above logic. > > Everything sounds good to me. Please note that cgroup interface files > actually use literal "max" for limit/protection max settings so that 0 > and "max" mean the same things for all limit/protection knobs. Lemme see if I've got it right, do you mean that we can: 1) write the _string_ "max" into a cgroup attribute to: - set 0 for util_max, since it's a protection - set 1024 for util_min, since it's a limit 2) write the _string_ "0" into a cgroup attribute to: - set 1024 for util_max, since it's a protection - set 0 for util_min, since it's a limit Is that correct or it's just me totally confused? > Thanks. > > -- > tejun Cheers, Patrick
Hello, Patrick. On Wed, Jun 05, 2019 at 04:37:43PM +0100, Patrick Bellasi wrote: > > Everything sounds good to me. Please note that cgroup interface files > > actually use literal "max" for limit/protection max settings so that 0 > > and "max" mean the same things for all limit/protection knobs. > > Lemme see if I've got it right, do you mean that we can: > > 1) write the _string_ "max" into a cgroup attribute to: > > - set 0 for util_max, since it's a protection > - set 1024 for util_min, since it's a limit > > 2) write the _string_ "0" into a cgroup attribute to: > > - set 1024 for util_max, since it's a protection > - set 0 for util_min, since it's a limit > > Is that correct or it's just me totally confused? Heh, sorry about not being clearer. "max" just means numerically highest possible config for the config knob, so in your case, "max" would always map to 1024. Thanks.
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 20f92c16ffbf..3a940bfe4e8c 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -909,6 +909,12 @@ controller implements weight and absolute bandwidth limit models for normal scheduling policy and absolute bandwidth allocation model for realtime scheduling policy. +Cycles distribution is based, by default, on a temporal base and it +does not account for the frequency at which tasks are executed. +The (optional) utilization clamping support allows to enforce a minimum +bandwidth, which should always be provided by a CPU, and a maximum bandwidth, +which should never be exceeded by a CPU. + WARNING: cgroup2 doesn't yet support control of realtime processes and the cpu controller can only be enabled when all RT processes are in the root cgroup. Be aware that system management software may already @@ -974,6 +980,27 @@ All time durations are in microseconds. Shows pressure stall information for CPU. See Documentation/accounting/psi.txt for details. + cpu.util.min + A read-write single value file which exists on non-root cgroups. + The default is "0", i.e. no utilization boosting. + + The requested minimum utilization in the range [0, 1024]. + + This interface allows reading and setting minimum utilization clamp + values similar to the sched_setattr(2). This minimum utilization + value is used to clamp the task specific minimum utilization clamp. + + cpu.util.max + A read-write single value file which exists on non-root cgroups. + The default is "1024". i.e. no utilization capping + + The requested maximum utilization in the range [0, 1024]. + + This interface allows reading and setting maximum utilization clamp + values similar to the sched_setattr(2). This maximum utilization + value is used to clamp the task specific maximum utilization clamp. + + Memory ------ diff --git a/init/Kconfig b/init/Kconfig index 8e103505456a..5617742b97e5 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -894,6 +894,28 @@ config RT_GROUP_SCHED endif #CGROUP_SCHED +config UCLAMP_TASK_GROUP + bool "Utilization clamping per group of tasks" + depends on CGROUP_SCHED + depends on UCLAMP_TASK + default n + help + This feature enables the scheduler to track the clamped utilization + of each CPU based on RUNNABLE tasks currently scheduled on that CPU. + + When this option is enabled, the user can specify a min and max + CPU bandwidth which is allowed for each single task in a group. + The max bandwidth allows to clamp the maximum frequency a task + can use, while the min bandwidth allows to define a minimum + frequency a task will always use. + + When task group based utilization clamping is enabled, an eventually + specified task-specific clamp value is constrained by the cgroup + specified clamp value. Both minimum and maximum task clamping cannot + be bigger than the corresponding clamping defined at task group level. + + If in doubt, say N. + config CGROUP_PIDS bool "PIDs controller" help diff --git a/kernel/sched/core.c b/kernel/sched/core.c index eed7664437ac..19437257a08d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1137,8 +1137,12 @@ static void __init init_uclamp(void) /* System defaults allow max clamp values for both indexes */ uclamp_se_set(&uc_max, uclamp_none(UCLAMP_MAX), false); - for_each_clamp_id(clamp_id) + for_each_clamp_id(clamp_id) { uclamp_default[clamp_id] = uc_max; +#ifdef CONFIG_UCLAMP_TASK_GROUP + root_task_group.uclamp_req[clamp_id] = uc_max; +#endif + } } #else /* CONFIG_UCLAMP_TASK */ @@ -6695,6 +6699,17 @@ void ia64_set_curr_task(int cpu, struct task_struct *p) /* task_group_lock serializes the addition/removal of task groups */ static DEFINE_SPINLOCK(task_group_lock); +static inline void alloc_uclamp_sched_group(struct task_group *tg, + struct task_group *parent) +{ +#ifdef CONFIG_UCLAMP_TASK_GROUP + int clamp_id; + + for_each_clamp_id(clamp_id) + tg->uclamp_req[clamp_id] = parent->uclamp_req[clamp_id]; +#endif +} + static void sched_free_group(struct task_group *tg) { free_fair_sched_group(tg); @@ -6718,6 +6733,8 @@ struct task_group *sched_create_group(struct task_group *parent) if (!alloc_rt_sched_group(tg, parent)) goto err; + alloc_uclamp_sched_group(tg, parent); + return tg; err: @@ -6938,6 +6955,96 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset) sched_move_task(task); } +#ifdef CONFIG_UCLAMP_TASK_GROUP +static int cpu_util_min_write_u64(struct cgroup_subsys_state *css, + struct cftype *cftype, u64 min_value) +{ + struct task_group *tg; + int ret = 0; + + if (min_value > SCHED_CAPACITY_SCALE) + return -ERANGE; + + rcu_read_lock(); + + tg = css_tg(css); + if (tg == &root_task_group) { + ret = -EINVAL; + goto out; + } + if (tg->uclamp_req[UCLAMP_MIN].value == min_value) + goto out; + if (tg->uclamp_req[UCLAMP_MAX].value < min_value) { + ret = -EINVAL; + goto out; + } + + uclamp_se_set(&tg->uclamp_req[UCLAMP_MIN], min_value, false); + +out: + rcu_read_unlock(); + + return ret; +} + +static int cpu_util_max_write_u64(struct cgroup_subsys_state *css, + struct cftype *cftype, u64 max_value) +{ + struct task_group *tg; + int ret = 0; + + if (max_value > SCHED_CAPACITY_SCALE) + return -ERANGE; + + rcu_read_lock(); + + tg = css_tg(css); + if (tg == &root_task_group) { + ret = -EINVAL; + goto out; + } + if (tg->uclamp_req[UCLAMP_MAX].value == max_value) + goto out; + if (tg->uclamp_req[UCLAMP_MIN].value > max_value) { + ret = -EINVAL; + goto out; + } + + uclamp_se_set(&tg->uclamp_req[UCLAMP_MAX], max_value, false); + +out: + rcu_read_unlock(); + + return ret; +} + +static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css, + enum uclamp_id clamp_id) +{ + struct task_group *tg; + u64 util_clamp; + + rcu_read_lock(); + tg = css_tg(css); + util_clamp = tg->uclamp_req[clamp_id].value; + rcu_read_unlock(); + + return util_clamp; +} + +static u64 cpu_util_min_read_u64(struct cgroup_subsys_state *css, + struct cftype *cft) +{ + return cpu_uclamp_read(css, UCLAMP_MIN); +} + +static u64 cpu_util_max_read_u64(struct cgroup_subsys_state *css, + struct cftype *cft) +{ + return cpu_uclamp_read(css, UCLAMP_MAX); +} +#endif /* CONFIG_UCLAMP_TASK_GROUP */ + #ifdef CONFIG_FAIR_GROUP_SCHED static int cpu_shares_write_u64(struct cgroup_subsys_state *css, struct cftype *cftype, u64 shareval) @@ -7282,6 +7389,18 @@ static struct cftype cpu_legacy_files[] = { .read_u64 = cpu_rt_period_read_uint, .write_u64 = cpu_rt_period_write_uint, }, +#endif +#ifdef CONFIG_UCLAMP_TASK_GROUP + { + .name = "util.min", + .read_u64 = cpu_util_min_read_u64, + .write_u64 = cpu_util_min_write_u64, + }, + { + .name = "util.max", + .read_u64 = cpu_util_max_read_u64, + .write_u64 = cpu_util_max_write_u64, + }, #endif { } /* Terminate */ }; @@ -7449,6 +7568,20 @@ static struct cftype cpu_files[] = { .seq_show = cpu_max_show, .write = cpu_max_write, }, +#endif +#ifdef CONFIG_UCLAMP_TASK_GROUP + { + .name = "util.min", + .flags = CFTYPE_NOT_ON_ROOT, + .read_u64 = cpu_util_min_read_u64, + .write_u64 = cpu_util_min_write_u64, + }, + { + .name = "util.max", + .flags = CFTYPE_NOT_ON_ROOT, + .read_u64 = cpu_util_max_read_u64, + .write_u64 = cpu_util_max_write_u64, + }, #endif { } /* terminate */ }; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 6335cfcc81ba..fd31527fdcc8 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -399,6 +399,12 @@ struct task_group { #endif struct cfs_bandwidth cfs_bandwidth; + +#ifdef CONFIG_UCLAMP_TASK_GROUP + /* Clamp values requested for a task group */ + struct uclamp_se uclamp_req[UCLAMP_CNT]; +#endif + }; #ifdef CONFIG_FAIR_GROUP_SCHED
The cgroup CPU bandwidth controller allows to assign a specified (maximum) bandwidth to the tasks of a group. However this bandwidth is defined and enforced only on a temporal base, without considering the actual frequency a CPU is running on. Thus, the amount of computation completed by a task within an allocated bandwidth can be very different depending on the actual frequency the CPU is running that task. The amount of computation can be affected also by the specific CPU a task is running on, especially when running on asymmetric capacity systems like Arm's big.LITTLE. With the availability of schedutil, the scheduler is now able to drive frequency selections based on actual task utilization. Moreover, the utilization clamping support provides a mechanism to bias the frequency selection operated by schedutil depending on constraints assigned to the tasks currently RUNNABLE on a CPU. Giving the mechanisms described above, it is now possible to extend the cpu controller to specify the minimum (or maximum) utilization which should be considered for tasks RUNNABLE on a cpu. This makes it possible to better defined the actual computational power assigned to task groups, thus improving the cgroup CPU bandwidth controller which is currently based just on time constraints. Extend the CPU controller with a couple of new attributes util.{min,max} which allows to enforce utilization boosting and capping for all the tasks in a group. Specifically: - util.min: defines the minimum utilization which should be considered i.e. the RUNNABLE tasks of this group will run at least at a minimum frequency which corresponds to the util.min utilization - util.max: defines the maximum utilization which should be considered i.e. the RUNNABLE tasks of this group will run up to a maximum frequency which corresponds to the util.max utilization These attributes: a) are available only for non-root nodes, both on default and legacy hierarchies, while system wide clamps are defined by a generic interface which does not depends on cgroups. This system wide interface enforces constraints on tasks in the root node. b) enforce effective constraints at each level of the hierarchy which are a restriction of the group requests considering its parent's effective constraints. Root group effective constraints are defined by the system wide interface. This mechanism allows each (non-root) level of the hierarchy to: - request whatever clamp values it would like to get - effectively get only up to the maximum amount allowed by its parent c) have higher priority than task-specific clamps, defined via sched_setattr(), thus allowing to control and restrict task requests Add two new attributes to the cpu controller to collect "requested" clamp values. Allow that at each non-root level of the hierarchy. Validate local consistency by enforcing util.min < util.max. Keep it simple by do not caring now about "effective" values computation and propagation along the hierarchy. Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> --- Changes in v9 Message-ID: <20190507114232.npsvba4itex5qpvl@e110439-lin> - make alloc_uclamp_sched_group() a void function Message-ID: <20190508190011.GB32547@worktop.programming.kicks-ass.net>: - use for_each_clamp_id() and uclamp_se_set() to make code less fragile --- Documentation/admin-guide/cgroup-v2.rst | 27 +++++ init/Kconfig | 22 ++++ kernel/sched/core.c | 135 +++++++++++++++++++++++- kernel/sched/sched.h | 6 ++ 4 files changed, 189 insertions(+), 1 deletion(-)