diff mbox series

[v7,01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

Message ID 20190208100554.32196-2-patrick.bellasi@arm.com (mailing list archive)
State Not Applicable, archived
Headers show
Series Add utilization clamping support | expand

Commit Message

Patrick Bellasi Feb. 8, 2019, 10:05 a.m. UTC
Utilization clamping allows to clamp the CPU's utilization within a
[util_min, util_max] range, depending on the set of RUNNABLE tasks on
that CPU. Each task references two "clamp buckets" defining its minimum
and maximum (util_{min,max}) utilization "clamp values". A CPU's clamp
bucket is active if there is at least one RUNNABLE tasks enqueued on
that CPU and refcounting that bucket.

When a task is {en,de}queued {on,from} a rq, the set of active clamp
buckets on that CPU can change. Since each clamp bucket enforces a
different utilization clamp value, when the set of active clamp buckets
changes, a new "aggregated" clamp value is computed for that CPU.

Clamp values are always MAX aggregated for both util_min and util_max.
This ensures that no tasks can affect the performance of other
co-scheduled tasks which are more boosted (i.e. with higher util_min
clamp) or less capped (i.e. with higher util_max clamp).

Each task has a:
   task_struct::uclamp[clamp_id]::bucket_id
to track the "bucket index" of the CPU's clamp bucket it refcounts while
enqueued, for each clamp index (clamp_id).

Each CPU's rq has a:
   rq::uclamp[clamp_id]::bucket[bucket_id].tasks
to track how many tasks, currently RUNNABLE on that CPU, refcount each
clamp bucket (bucket_id) of a clamp index (clamp_id).

Each CPU's rq has also a:
   rq::uclamp[clamp_id]::bucket[bucket_id].value
to track the clamp value of each clamp bucket (bucket_id) of a clamp
index (clamp_id).

The rq::uclamp::bucket[clamp_id][] array is scanned every time we need
to find a new MAX aggregated clamp value for a clamp_id. This operation
is required only when we dequeue the last task of a clamp bucket
tracking the current MAX aggregated clamp value. In these cases, the CPU
is either entering IDLE or going to schedule a less boosted or more
clamped task.
The expected number of different clamp values, configured at build time,
is small enough to fit the full unordered array into a single cache
line.

Add the basic data structures required to refcount, in each CPU's rq,
the number of RUNNABLE tasks for each clamp bucket. Add also the max
aggregation required to update the rq's clamp value at each
enqueue/dequeue event.

Use a simple linear mapping of clamp values into clamp buckets.
Pre-compute and cache bucket_id to avoid integer divisions at
enqueue/dequeue time.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>

---
Changes in v7:
 Message-ID: <20190123191007.GG17749@hirez.programming.kicks-ass.net>
 - removed buckets mapping code
 - use a simpler linear mapping of clamp values into buckets
 Message-ID: <20190124161443.lv2pw5fsspyelckq@e110439-lin>
 - move this patch at the beginning of the series,
   in the attempt to make the overall series easier to digest by moving
   at the very beginning the core bits and main data structures
 Others:
 - update the mapping logic to use exactly and only
   UCLAMP_BUCKETS_COUNT buckets, i.e. no more "special" bucket
 - update uclamp_rq_update() to do top-bottom max search
---
 include/linux/log2.h           |  37 ++++++++
 include/linux/sched.h          |  39 ++++++++
 include/linux/sched/topology.h |   6 --
 init/Kconfig                   |  53 +++++++++++
 kernel/sched/core.c            | 165 +++++++++++++++++++++++++++++++++
 kernel/sched/sched.h           |  59 +++++++++++-
 6 files changed, 350 insertions(+), 9 deletions(-)

Comments

Dietmar Eggemann March 12, 2019, 12:52 p.m. UTC | #1
On 2/8/19 11:05 AM, Patrick Bellasi wrote:

[...]

> +config UCLAMP_BUCKETS_COUNT
> +	int "Number of supported utilization clamp buckets"
> +	range 5 20
> +	default 5
> +	depends on UCLAMP_TASK
> +	help
> +	  Defines the number of clamp buckets to use. The range of each bucket
> +	  will be SCHED_CAPACITY_SCALE/UCLAMP_BUCKETS_COUNT. The higher the
> +	  number of clamp buckets the finer their granularity and the higher
> +	  the precision of clamping aggregation and tracking at run-time.
> +
> +	  For example, with the default configuration we will have 5 clamp
> +	  buckets tracking 20% utilization each. A 25% boosted tasks will be
> +	  refcounted in the [20..39]% bucket and will set the bucket clamp
> +	  effective value to 25%.
> +	  If a second 30% boosted task should be co-scheduled on the same CPU,
> +	  that task will be refcounted in the same bucket of the first task and
> +	  it will boost the bucket clamp effective value to 30%.
> +	  The clamp effective value of a bucket is reset to its nominal value
> +	  (20% in the example above) when there are anymore tasks refcounted in

this sounds weird.

[...]

> +static inline unsigned int uclamp_bucket_value(unsigned int clamp_value)
> +{
> +	return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value);
> +}

Soemthing like uclamp_bucket_nominal_value() should be clearer.

> +static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id)
> +{
> +	struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
> +	unsigned int max_value = uclamp_none(clamp_id);
> +	unsigned int bucket_id;

unsigned int bucket_id = UCLAMP_BUCKETS;

> +
> +	/*
> +	 * Both min and max clamps are MAX aggregated, thus the topmost
> +	 * bucket with some tasks defines the rq's clamp value.
> +	 */
> +	bucket_id = UCLAMP_BUCKETS;

to get rid of this line?

> +	do {
> +		--bucket_id;
> +		if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks)

if (!bucket[bucket_id].tasks)

[...]

> +/*
> + * When a task is enqueued on a rq, the clamp bucket currently defined by the
> + * task's uclamp::bucket_id is reference counted on that rq. This also
> + * immediately updates the rq's clamp value if required.
> + *
> + * Since tasks know their specific value requested from user-space, we track
> + * within each bucket the maximum value for tasks refcounted in that bucket.
> + * This provide a further aggregation (local clamping) which allows to track

s/This provide/This provides

> + * within each bucket the exact "requested" clamp value whenever all tasks
> + * RUNNABLE in that bucket require the same clamp.
> + */
> +static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq,
> +				    unsigned int clamp_id)
> +{
> +	unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
> +	unsigned int rq_clamp, bkt_clamp, tsk_clamp;

Wouldn't it be easier to have a pointer to the task's and rq's uclamp 
structure as well to the bucket?

-       unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
+       struct uclamp_se *uc_se = &p->uclamp[clamp_id];
+       struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
+       struct uclamp_bucket *bucket = &uc_rq->bucket[uc_se->bucket_id];

The code in uclamp_rq_inc_id() and uclamp_rq_dec_id() for example 
becomes much more readable.

[...]

>   struct sched_class {
>   	const struct sched_class *next;
>   
> +#ifdef CONFIG_UCLAMP_TASK
> +	int uclamp_enabled;
> +#endif
> +
>   	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
>   	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
> -	void (*yield_task)   (struct rq *rq);
> -	bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt);
>   
>   	void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags);
>   
> @@ -1685,7 +1734,6 @@ struct sched_class {
>   	void (*set_curr_task)(struct rq *rq);
>   	void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
>   	void (*task_fork)(struct task_struct *p);
> -	void (*task_dead)(struct task_struct *p);
>   
>   	/*
>   	 * The switched_from() call is allowed to drop rq->lock, therefore we
> @@ -1702,12 +1750,17 @@ struct sched_class {
>   
>   	void (*update_curr)(struct rq *rq);
>   
> +	void (*yield_task)   (struct rq *rq);
> +	bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt);
> +
>   #define TASK_SET_GROUP		0
>   #define TASK_MOVE_GROUP		1
>   
>   #ifdef CONFIG_FAIR_GROUP_SCHED
>   	void (*task_change_group)(struct task_struct *p, int type);
>   #endif
> +
> +	void (*task_dead)(struct task_struct *p);

Why do you move yield_task, yield_to_task and task_dead here?

[...]
Peter Zijlstra March 12, 2019, 3:20 p.m. UTC | #2
On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote:
> +/* Integer ceil-rounded range for each bucket */
> +#define UCLAMP_BUCKET_DELTA ((SCHED_CAPACITY_SCALE / UCLAMP_BUCKETS) + 1)

Uhm, should that not me ((x+y-1)/y), aka. DIV_ROUND_UP(x,y) ?

The above would give 4 for 9/3, which is clearly buggered.
Patrick Bellasi March 12, 2019, 3:50 p.m. UTC | #3
On 12-Mar 16:20, Peter Zijlstra wrote:
> On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote:
> > +/* Integer ceil-rounded range for each bucket */
> > +#define UCLAMP_BUCKET_DELTA ((SCHED_CAPACITY_SCALE / UCLAMP_BUCKETS) + 1)
> 
> Uhm, should that not me ((x+y-1)/y), aka. DIV_ROUND_UP(x,y) ?

Well, there is certainly some rounding to be done...

> The above would give 4 for 9/3, which is clearly buggered.

.. still the math above should work fine within the boundaries we
define for UCLAMP_BUCKET_DELTA (5..20 groups) and considering that
SCHED_CAPACITY_SCALE will never be smaller then 1024.

The above is designed to shrink the topmost bucket wrt all the others
but it will never be smaller than ~30%.

Here are the start values computed for each bucket using the math
above and the computed shrinking percentage for the topmost bucket:

bukets size: 205, top bucket start@820 (err:  0.49%), buckets: {0: 0, 1: 205, 2: 410, 3: 615, 4: 820}
bukets size: 171, top bucket start@855 (err:  1.17%), buckets: {0: 0, 1: 171, 2: 342, 3: 513, 4: 684, 5: 855}
bukets size: 147, top bucket start@882 (err:  3.40%), buckets: {0: 0, 1: 147, 2: 294, 3: 441, 4: 588, 5: 735, 6: 882}
bukets size: 129, top bucket start@903 (err:  6.20%), buckets: {0: 0, 1: 129, 2: 258, 3: 387, 4: 516, 5: 645, 6: 774, 7: 903}
bukets size: 114, top bucket start@912 (err:  1.75%), buckets: {0: 0, 1: 114, 2: 228, 3: 342, 4: 456, 5: 570, 6: 684, 7: 798, 8: 912}
bukets size: 103, top bucket start@927 (err:  5.83%), buckets: {0: 0, 1: 103, 2: 206, 3: 309, 4: 412, 5: 515, 6: 618, 7: 721, 8: 824, 9: 927}
bukets size:  94, top bucket start@940 (err: 10.64%), buckets: {0: 0, 1:  94, 2: 188, 3: 282, 4: 376, 5: 470, 6: 564, 7: 658, 8: 752, 9: 846, 10: 940}
bukets size:  86, top bucket start@946 (err:  9.30%), buckets: {0: 0, 1:  86, 2: 172, 3: 258, 4: 344, 5: 430, 6: 516, 7: 602, 8: 688, 9: 774, 10: 860, 11: 946}
bukets size:  79, top bucket start@948 (err:  3.80%), buckets: {0: 0, 1:  79, 2: 158, 3: 237, 4: 316, 5: 395, 6: 474, 7: 553, 8: 632, 9: 711, 10: 790, 11: 869, 12: 948}
bukets size:  74, top bucket start@962 (err: 16.22%), buckets: {0: 0, 1:  74, 2: 148, 3: 222, 4: 296, 5: 370, 6: 444, 7: 518, 8: 592, 9: 666, 10: 740, 11: 814, 12: 888, 13: 962}
bukets size:  69, top bucket start@966 (err: 15.94%), buckets: {0: 0, 1:  69, 2: 138, 3: 207, 4: 276, 5: 345, 6: 414, 7: 483, 8: 552, 9: 621, 10: 690, 11: 759, 12: 828, 13: 897, 14: 966}
bukets size:  65, top bucket start@975 (err: 24.62%), buckets: {0: 0, 1:  65, 2: 130, 3: 195, 4: 260, 5: 325, 6: 390, 7: 455, 8: 520, 9: 585, 10: 650, 11: 715, 12: 780, 13: 845, 14: 910, 15: 975}
bukets size:  61, top bucket start@976 (err: 21.31%), buckets: {0: 0, 1:  61, 2: 122, 3: 183, 4: 244, 5: 305, 6: 366, 7: 427, 8: 488, 9: 549, 10: 610, 11: 671, 12: 732, 13: 793, 14: 854, 15: 915, 16: 976}
bukets size:  57, top bucket start@969 (err:  3.51%), buckets: {0: 0, 1:  57, 2: 114, 3: 171, 4: 228, 5: 285, 6: 342, 7: 399, 8: 456, 9: 513, 10: 570, 11: 627, 12: 684, 13: 741, 14: 798, 15: 855, 16: 912, 17: 969}
bukets size:  54, top bucket start@972 (err:  3.70%), buckets: {0: 0, 1:  54, 2: 108, 3: 162, 4: 216, 5: 270, 6: 324, 7: 378, 8: 432, 9: 486, 10: 540, 11: 594, 12: 648, 13: 702, 14: 756, 15: 810, 16: 864, 17: 918, 18: 972}
bukets size:  52, top bucket start@988 (err: 30.77%), buckets: {0: 0, 1:  52, 2: 104, 3: 156, 4: 208, 5: 260, 6: 312, 7: 364, 8: 416, 9: 468, 10: 520, 11: 572, 12: 624, 13: 676, 14: 728, 15: 780, 16: 832, 17: 884, 18: 936, 19: 988}

Does that makes sense?
Peter Zijlstra March 13, 2019, 8:19 a.m. UTC | #4
On Tue, Mar 12, 2019 at 03:50:43PM +0000, Patrick Bellasi wrote:
> On 12-Mar 16:20, Peter Zijlstra wrote:
> > On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote:
> > > +/* Integer ceil-rounded range for each bucket */

          ^^^^^^^^^^^^^^^^^^^^^^^^^^^

> > > +#define UCLAMP_BUCKET_DELTA ((SCHED_CAPACITY_SCALE / UCLAMP_BUCKETS) + 1)

                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

simply do not match.

> > Uhm, should that not me ((x+y-1)/y), aka. DIV_ROUND_UP(x,y) ?
> 
> Well, there is certainly some rounding to be done...
> 
> > The above would give 4 for 9/3, which is clearly buggered.
> 
> .. still the math above should work fine within the boundaries we
> define for UCLAMP_BUCKET_DELTA (5..20 groups) and considering that
> SCHED_CAPACITY_SCALE will never be smaller then 1024.

That's a very poor reason to write utter nonsense :-)

> The above is designed to shrink the topmost bucket wrt all the others
> but it will never be smaller than ~30%.

30% sounds like a lot, esp. for this range.

> Here are the start values computed for each bucket using the math
> above and the computed shrinking percentage for the topmost bucket:

If you use a regular rounding, the error is _much_ smaller:

$ for ((x=5;x<21;x++)) ; do let d=(1024+x/2)/x; let s=(x-1)*d; let e=1024-s; let p=100*(d-e)/d; echo $x $d $s $e $p%; done
5 205 820 204 0%
6 171 855 169 1%
7 146 876 148 -1%
8 128 896 128 0%
9 114 912 112 1%
10 102 918 106 -3%
11 93 930 94 -1%
12 85 935 89 -4%
13 79 948 76 3%
14 73 949 75 -2%
15 68 952 72 -5%
16 64 960 64 0%
17 60 960 64 -6%
18 57 969 55 3%
19 54 972 52 3%
20 51 969 55 -7%

Funnily enough, we have a helper for that too: DIV_ROUND_CLOSEST().

Now, if we go further, the error will obviously increase because we run
out of precision, but even there, regular rounding will be better than
either floor or ceil.
Patrick Bellasi March 13, 2019, 11:37 a.m. UTC | #5
On 13-Mar 09:19, Peter Zijlstra wrote:
> On Tue, Mar 12, 2019 at 03:50:43PM +0000, Patrick Bellasi wrote:
> > On 12-Mar 16:20, Peter Zijlstra wrote:
> > > On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote:
> > > > +/* Integer ceil-rounded range for each bucket */
> 
>           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> > > > +#define UCLAMP_BUCKET_DELTA ((SCHED_CAPACITY_SCALE / UCLAMP_BUCKETS) + 1)
> 
>                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> simply do not match.

Right, that don't match when UCLAMP_BUCKETS is a divider of
SCHED_CAPACITY_SCALE, i.e. when we use 8 or 16 buckets.

> > > Uhm, should that not me ((x+y-1)/y), aka. DIV_ROUND_UP(x,y) ?
> > 
> > Well, there is certainly some rounding to be done...
> > 
> > > The above would give 4 for 9/3, which is clearly buggered.
> > 
> > .. still the math above should work fine within the boundaries we
> > define for UCLAMP_BUCKET_DELTA (5..20 groups) and considering that
> > SCHED_CAPACITY_SCALE will never be smaller then 1024.
> 
> That's a very poor reason to write utter nonsense :-)
> 
> > The above is designed to shrink the topmost bucket wrt all the others
> > but it will never be smaller than ~30%.
> 
> 30% sounds like a lot, esp. for this range.

Well, that 30% is really just ~16 utiliation units on a scale of 1024
when buckets have a size of 52.

Still, yes, we can argue that's big but that's also the same error
generated by DIV_ROUND_UP() when UCLAMP_BUCKETS is not 8 or 16.

> > Here are the start values computed for each bucket using the math
> > above and the computed shrinking percentage for the topmost bucket:
> 
> If you use a regular rounding, the error is _much_ smaller:
> 
> $ for ((x=5;x<21;x++)) ; do let d=(1024+x/2)/x; let s=(x-1)*d; let e=1024-s; let p=100*(d-e)/d; echo $x $d $s $e $p%; done
                                    ^^^^^^^^^^^^^
> 5 205 820 204 0%
> 6 171 855 169 1%
> 7 146 876 148 -1%
> 8 128 896 128 0%
> 9 114 912 112 1%
> 10 102 918 106 -3%
> 11 93 930 94 -1%
> 12 85 935 89 -4%
> 13 79 948 76 3%
> 14 73 949 75 -2%
> 15 68 952 72 -5%
> 16 64 960 64 0%
> 17 60 960 64 -6%
> 18 57 969 55 3%
> 19 54 972 52 3%
> 20 51 969 55 -7%
> 
> Funnily enough, we have a helper for that too: DIV_ROUND_CLOSEST().
                                                 ^^^^^^^^^^^^^^^^^^^^
This is different than DIV_ROUND_UP() and actually better across the
full range.

> Now, if we go further, the error will obviously increase because we run
> out of precision, but even there, regular rounding will be better than
> either floor or ceil.

I don't think we will have to cover other values in the further but I
agree that this "closest rounding" is definitively better.

Thanks for spotting it, will update in v8.
Peter Zijlstra March 13, 2019, 1:40 p.m. UTC | #6
On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote:
> +static inline unsigned int uclamp_bucket_id(unsigned int clamp_value)
> +{
> +	return clamp_value / UCLAMP_BUCKET_DELTA;
> +}
> +
> +static inline unsigned int uclamp_bucket_value(unsigned int clamp_value)
> +{
> +	return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value);

	return clamp_value - (clamp_value % UCLAMP_BUCKET_DELTA);

might generate better code; just a single division, instead of a div and
mult.

> +}
> +
> +static inline unsigned int uclamp_none(int clamp_id)
> +{
> +	if (clamp_id == UCLAMP_MIN)
> +		return 0;
> +	return SCHED_CAPACITY_SCALE;
> +}
> +
> +static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id)
> +{
> +	struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
> +	unsigned int max_value = uclamp_none(clamp_id);
> +	unsigned int bucket_id;
> +
> +	/*
> +	 * Both min and max clamps are MAX aggregated, thus the topmost
> +	 * bucket with some tasks defines the rq's clamp value.
> +	 */
> +	bucket_id = UCLAMP_BUCKETS;
> +	do {
> +		--bucket_id;
> +		if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks)
> +			continue;
> +		max_value = bucket[bucket_id].value;
> +		break;

If you flip the if condition the code will be nicer.

> +	} while (bucket_id);

But you can also use a for loop:

	for (i = UCLAMP_BUCKETS-1; i>=0; i--) {
		if (rq->uclamp[clamp_id].bucket[i].tasks) {
			max_value = bucket[i].value;
			break;
		}
	}

> +	WRITE_ONCE(rq->uclamp[clamp_id].value, max_value);
> +}
Peter Zijlstra March 13, 2019, 1:52 p.m. UTC | #7
On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote:
> +/*
> + * When a task is enqueued on a rq, the clamp bucket currently defined by the
> + * task's uclamp::bucket_id is reference counted on that rq. This also
> + * immediately updates the rq's clamp value if required.
> + *
> + * Since tasks know their specific value requested from user-space, we track
> + * within each bucket the maximum value for tasks refcounted in that bucket.
> + * This provide a further aggregation (local clamping) which allows to track
> + * within each bucket the exact "requested" clamp value whenever all tasks
> + * RUNNABLE in that bucket require the same clamp.
> + */
> +static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq,
> +				    unsigned int clamp_id)
> +{
> +	unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
> +	unsigned int rq_clamp, bkt_clamp, tsk_clamp;
> +
> +	rq->uclamp[clamp_id].bucket[bucket_id].tasks++;
> +
> +	/*
> +	 * Local clamping: rq's buckets always track the max "requested"
> +	 * clamp value from all RUNNABLE tasks in that bucket.
> +	 */
> +	tsk_clamp = p->uclamp[clamp_id].value;
> +	bkt_clamp = rq->uclamp[clamp_id].bucket[bucket_id].value;
> +	rq->uclamp[clamp_id].bucket[bucket_id].value = max(bkt_clamp, tsk_clamp);

So, if I read this correct:

 - here we track a max value in a bucket,

> +	rq_clamp = READ_ONCE(rq->uclamp[clamp_id].value);
> +	WRITE_ONCE(rq->uclamp[clamp_id].value, max(rq_clamp, tsk_clamp));
> +}
> +
> +/*
> + * When a task is dequeued from a rq, the clamp bucket reference counted by
> + * the task is released. If this is the last task reference counting the rq's
> + * max active clamp value, then the rq's clamp value is updated.
> + * Both the tasks reference counter and the rq's cached clamp values are
> + * expected to be always valid, if we detect they are not we skip the updates,
> + * enforce a consistent state and warn.
> + */
> +static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq,
> +				    unsigned int clamp_id)
> +{
> +	unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
> +	unsigned int rq_clamp, bkt_clamp;
> +
> +	SCHED_WARN_ON(!rq->uclamp[clamp_id].bucket[bucket_id].tasks);
> +	if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks))
> +		rq->uclamp[clamp_id].bucket[bucket_id].tasks--;
> +
> +	/*
> +	 * Keep "local clamping" simple and accept to (possibly) overboost
> +	 * still RUNNABLE tasks in the same bucket.
> +	 */
> +	if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks))
> +		return;

(Oh man, I hope that generates semi sane code; long live CSE passes I
suppose)

But we never decrement that bkt_clamp value on dequeue.

> +	bkt_clamp = rq->uclamp[clamp_id].bucket[bucket_id].value;
> +
> +	/* The rq's clamp value is expected to always track the max */
> +	rq_clamp = READ_ONCE(rq->uclamp[clamp_id].value);
> +	SCHED_WARN_ON(bkt_clamp > rq_clamp);
> +	if (bkt_clamp >= rq_clamp) {

head hurts, this reads ==, how can this ever not be so?

> +		/*
> +		 * Reset rq's clamp bucket value to its nominal value whenever
> +		 * there are anymore RUNNABLE tasks refcounting it.

-ENOPARSE

> +		 */
> +		rq->uclamp[clamp_id].bucket[bucket_id].value =
> +			uclamp_bucket_value(rq_clamp);

But basically you decrement the bucket value to the nominal value.

> +		uclamp_rq_update(rq, clamp_id);
> +	}
> +}

Given all that, what is to stop the bucket value to climbing to
uclamp_bucket_value(+1)-1 and staying there (provided there's someone
runnable)?

Why are we doing this... ?
Peter Zijlstra March 13, 2019, 2:06 p.m. UTC | #8
On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote:
> +static void __init init_uclamp(void)
> +{
> +	unsigned int clamp_id;
> +	int cpu;
> +
> +	for_each_possible_cpu(cpu)
> +		memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq));
> +

Is that really needed? Doesn't rq come from .bss ?
Peter Zijlstra March 13, 2019, 2:09 p.m. UTC | #9
On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote:
> +static inline unsigned int uclamp_none(int clamp_id)
> +{
> +	if (clamp_id == UCLAMP_MIN)
> +		return 0;
> +	return SCHED_CAPACITY_SCALE;
> +}
> +
> +static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id)
> +{
> +	struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
> +	unsigned int max_value = uclamp_none(clamp_id);

That's 1024 for uclamp_max

> +	unsigned int bucket_id;
> +
> +	/*
> +	 * Both min and max clamps are MAX aggregated, thus the topmost
> +	 * bucket with some tasks defines the rq's clamp value.
> +	 */
> +	bucket_id = UCLAMP_BUCKETS;
> +	do {
> +		--bucket_id;
> +		if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks)
> +			continue;
> +		max_value = bucket[bucket_id].value;

but this will then _lower_ it. That's not a MAX aggregate.

> +		break;
> +	} while (bucket_id);
> +
> +	WRITE_ONCE(rq->uclamp[clamp_id].value, max_value);
> +}
Patrick Bellasi March 13, 2019, 3:15 p.m. UTC | #10
On 12-Mar 13:52, Dietmar Eggemann wrote:
> On 2/8/19 11:05 AM, Patrick Bellasi wrote:
> 
> [...]
> 
> > +config UCLAMP_BUCKETS_COUNT
> > +	int "Number of supported utilization clamp buckets"
> > +	range 5 20
> > +	default 5
> > +	depends on UCLAMP_TASK
> > +	help
> > +	  Defines the number of clamp buckets to use. The range of each bucket
> > +	  will be SCHED_CAPACITY_SCALE/UCLAMP_BUCKETS_COUNT. The higher the
> > +	  number of clamp buckets the finer their granularity and the higher
> > +	  the precision of clamping aggregation and tracking at run-time.
> > +
> > +	  For example, with the default configuration we will have 5 clamp
> > +	  buckets tracking 20% utilization each. A 25% boosted tasks will be
> > +	  refcounted in the [20..39]% bucket and will set the bucket clamp
> > +	  effective value to 25%.
> > +	  If a second 30% boosted task should be co-scheduled on the same CPU,
> > +	  that task will be refcounted in the same bucket of the first task and
> > +	  it will boost the bucket clamp effective value to 30%.
> > +	  The clamp effective value of a bucket is reset to its nominal value
> > +	  (20% in the example above) when there are anymore tasks refcounted in
> 
> this sounds weird.

Why ?

> 
> [...]
> 
> > +static inline unsigned int uclamp_bucket_value(unsigned int clamp_value)
> > +{
> > +	return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value);
> > +}
> 
> Soemthing like uclamp_bucket_nominal_value() should be clearer.

Maybe... can update it in v8

> > +static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id)
> > +{
> > +	struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
> > +	unsigned int max_value = uclamp_none(clamp_id);
> > +	unsigned int bucket_id;
> 
> unsigned int bucket_id = UCLAMP_BUCKETS;
> 
> > +
> > +	/*
> > +	 * Both min and max clamps are MAX aggregated, thus the topmost
> > +	 * bucket with some tasks defines the rq's clamp value.
> > +	 */
> > +	bucket_id = UCLAMP_BUCKETS;
> 
> to get rid of this line?

I put it on a different line as a justfication for the loop variable
initialization described in the comment above.

> 
> > +	do {
> > +		--bucket_id;
> > +		if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks)
> 
> if (!bucket[bucket_id].tasks)

Right... that's some leftover from the last refactoring!

[...]

> > + * within each bucket the exact "requested" clamp value whenever all tasks
> > + * RUNNABLE in that bucket require the same clamp.
> > + */
> > +static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq,
> > +				    unsigned int clamp_id)
> > +{
> > +	unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
> > +	unsigned int rq_clamp, bkt_clamp, tsk_clamp;
> 
> Wouldn't it be easier to have a pointer to the task's and rq's uclamp
> structure as well to the bucket?
> 
> -       unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
> +       struct uclamp_se *uc_se = &p->uclamp[clamp_id];
> +       struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
> +       struct uclamp_bucket *bucket = &uc_rq->bucket[uc_se->bucket_id];

I think I went back/forth a couple of times in using pointer or the
extended version, which both have pros and cons.

I personally prefer the pointers as you suggest but I've got the
impression in the past that since everybody cleared "basic C trainings"
it's not so difficult to read the code above too.

> The code in uclamp_rq_inc_id() and uclamp_rq_dec_id() for example becomes
> much more readable.

Agree... let's try to switch once again in v8 and see ;)

> [...]
> 
> >   struct sched_class {
> >   	const struct sched_class *next;
> > +#ifdef CONFIG_UCLAMP_TASK
> > +	int uclamp_enabled;
> > +#endif
> > +
> >   	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
> >   	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
> > -	void (*yield_task)   (struct rq *rq);
> > -	bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt);
> >   	void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags);
> > @@ -1685,7 +1734,6 @@ struct sched_class {
> >   	void (*set_curr_task)(struct rq *rq);
> >   	void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
> >   	void (*task_fork)(struct task_struct *p);
> > -	void (*task_dead)(struct task_struct *p);
> >   	/*
> >   	 * The switched_from() call is allowed to drop rq->lock, therefore we
> > @@ -1702,12 +1750,17 @@ struct sched_class {
> >   	void (*update_curr)(struct rq *rq);
> > +	void (*yield_task)   (struct rq *rq);
> > +	bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt);
> > +
> >   #define TASK_SET_GROUP		0
> >   #define TASK_MOVE_GROUP		1
> >   #ifdef CONFIG_FAIR_GROUP_SCHED
> >   	void (*task_change_group)(struct task_struct *p, int type);
> >   #endif
> > +
> > +	void (*task_dead)(struct task_struct *p);
> 
> Why do you move yield_task, yield_to_task and task_dead here?

Since I'm adding a new field at the beginning of the struct, which is
used at enqueue/dequeue time, this is to ensure that all the
callbacks used in these paths are grouped together and don't fall
across a cache line... but yes, that's supposed to be a
micro-optimization which I can skip in this patch.
Patrick Bellasi March 13, 2019, 3:23 p.m. UTC | #11
On 13-Mar 15:09, Peter Zijlstra wrote:
> On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote:
> > +static inline unsigned int uclamp_none(int clamp_id)
> > +{
> > +	if (clamp_id == UCLAMP_MIN)
> > +		return 0;
> > +	return SCHED_CAPACITY_SCALE;
> > +}
> > +
> > +static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id)
> > +{
> > +	struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
> > +	unsigned int max_value = uclamp_none(clamp_id);
> 
> That's 1024 for uclamp_max
> 
> > +	unsigned int bucket_id;
> > +
> > +	/*
> > +	 * Both min and max clamps are MAX aggregated, thus the topmost
> > +	 * bucket with some tasks defines the rq's clamp value.
> > +	 */
> > +	bucket_id = UCLAMP_BUCKETS;
> > +	do {
> > +		--bucket_id;
> > +		if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks)
> > +			continue;
> > +		max_value = bucket[bucket_id].value;
> 
> but this will then _lower_ it. That's not a MAX aggregate.

For uclamp_max we want max_value=1024 when there are no active tasks,
which means: no max clamp enforced on CFS/RT "idle" cpus.

If instead there are active RT/CFS tasks then we want the clamp value
of the max group, which means: MAX aggregate active clamps.

That's what the code above does and the comment says.

> > +		break;
> > +	} while (bucket_id);
> > +
> > +	WRITE_ONCE(rq->uclamp[clamp_id].value, max_value);
> > +}
Patrick Bellasi March 13, 2019, 3:28 p.m. UTC | #12
On 13-Mar 15:06, Peter Zijlstra wrote:
> On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote:
> > +static void __init init_uclamp(void)
> > +{
> > +	unsigned int clamp_id;
> > +	int cpu;
> > +
> > +	for_each_possible_cpu(cpu)
> > +		memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq));
> > +
> 
> Is that really needed? Doesn't rq come from .bss ?

Will check better... I've just assumed not since in sched_init() we
already have tons of rq's related zero initializations.
Patrick Bellasi March 13, 2019, 3:59 p.m. UTC | #13
On 13-Mar 14:52, Peter Zijlstra wrote:
> On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote:
> > +/*
> > + * When a task is enqueued on a rq, the clamp bucket currently defined by the
> > + * task's uclamp::bucket_id is reference counted on that rq. This also
> > + * immediately updates the rq's clamp value if required.
> > + *
> > + * Since tasks know their specific value requested from user-space, we track
> > + * within each bucket the maximum value for tasks refcounted in that bucket.
> > + * This provide a further aggregation (local clamping) which allows to track
> > + * within each bucket the exact "requested" clamp value whenever all tasks
> > + * RUNNABLE in that bucket require the same clamp.
> > + */
> > +static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq,
> > +				    unsigned int clamp_id)
> > +{
> > +	unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
> > +	unsigned int rq_clamp, bkt_clamp, tsk_clamp;
> > +
> > +	rq->uclamp[clamp_id].bucket[bucket_id].tasks++;
> > +
> > +	/*
> > +	 * Local clamping: rq's buckets always track the max "requested"
> > +	 * clamp value from all RUNNABLE tasks in that bucket.
> > +	 */
> > +	tsk_clamp = p->uclamp[clamp_id].value;
> > +	bkt_clamp = rq->uclamp[clamp_id].bucket[bucket_id].value;
> > +	rq->uclamp[clamp_id].bucket[bucket_id].value = max(bkt_clamp, tsk_clamp);
> 
> So, if I read this correct:
> 
>  - here we track a max value in a bucket,
> 
> > +	rq_clamp = READ_ONCE(rq->uclamp[clamp_id].value);
> > +	WRITE_ONCE(rq->uclamp[clamp_id].value, max(rq_clamp, tsk_clamp));
> > +}
> > +
> > +/*
> > + * When a task is dequeued from a rq, the clamp bucket reference counted by
> > + * the task is released. If this is the last task reference counting the rq's
> > + * max active clamp value, then the rq's clamp value is updated.
> > + * Both the tasks reference counter and the rq's cached clamp values are
> > + * expected to be always valid, if we detect they are not we skip the updates,
> > + * enforce a consistent state and warn.
> > + */
> > +static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq,
> > +				    unsigned int clamp_id)
> > +{
> > +	unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
> > +	unsigned int rq_clamp, bkt_clamp;
> > +
> > +	SCHED_WARN_ON(!rq->uclamp[clamp_id].bucket[bucket_id].tasks);
> > +	if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks))
> > +		rq->uclamp[clamp_id].bucket[bucket_id].tasks--;
> > +
> > +	/*
> > +	 * Keep "local clamping" simple and accept to (possibly) overboost
> > +	 * still RUNNABLE tasks in the same bucket.
> > +	 */
> > +	if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks))
> > +		return;
> 
> (Oh man, I hope that generates semi sane code; long live CSE passes I
> suppose)

What do you mean ?

> But we never decrement that bkt_clamp value on dequeue.

We decrement the bkt_clamp value only when the bucket becomes empty
and thus we pass the condition above. That's what the comment above is
there to call out.


> > +	bkt_clamp = rq->uclamp[clamp_id].bucket[bucket_id].value;
> > +
> > +	/* The rq's clamp value is expected to always track the max */
> > +	rq_clamp = READ_ONCE(rq->uclamp[clamp_id].value);
> > +	SCHED_WARN_ON(bkt_clamp > rq_clamp);
> > +	if (bkt_clamp >= rq_clamp) {
> 
> head hurts, this reads ==, how can this ever not be so?

Never, given the current code, that's just defensive programming.

If in the future the accounting should be accidentally broken by some
refactoring we warn and fix the corrupted data structures at first
chance.

Is that so bad?

> > +		/*
> > +		 * Reset rq's clamp bucket value to its nominal value whenever
> > +		 * there are anymore RUNNABLE tasks refcounting it.
> 
> -ENOPARSE

That's related to the comment above, when you say we don't decrement
the bkt_clamp.

Because of backetization, we potentially end up tracking tasks with
different requested clamp values in the same bucket.

For example, with 20% bucket size, we can have:
  Task1: util_min=25%
  Task2: util_min=35%
accounted in the same bucket.

This ensure that while they are both running we boost 35%. If Task1
should run longer than Task2, Task1 will be "overboosted" until its
end. The bucket value will be reset to 20% (nominal value) when both
tasks are idle.


> > +		 */
> > +		rq->uclamp[clamp_id].bucket[bucket_id].value =
> > +			uclamp_bucket_value(rq_clamp);
> 
> But basically you decrement the bucket value to the nominal value.

Yes, at this point we know there are no more tasks in this bucket and
we reset its value.

> 
> > +		uclamp_rq_update(rq, clamp_id);
> > +	}
> > +}
> 
> Given all that, what is to stop the bucket value to climbing to
> uclamp_bucket_value(+1)-1 and staying there (provided there's someone
> runnable)?

Nothing... but that's an expected consequence of bucketization.

> Why are we doing this... ?

You can either decide to:

 a) always boost tasks to just the bucket nominal value
    thus always penalizing both Task1 and Task2 of the example above

 b) always boost tasks to the bucket "max" value
    thus always overboosting both Task1 and Task2 of the example above

The solution above instead has a very good property: in systems
where you have only few and well defined clamp values we always
provide the exact boost.

For example, if your system requires only 23% and 47% boost values
(totally random numbers), then you can always get the exact boost
required using just 3 bucksts or ~33% size each.

In systems where you don't know which boost values you will have, you
can still defined the maximum overboost granularity you accept for
each task by just tuning the number of clamp groups. For example, with
20 groups you can have a 5% max overboost.
Patrick Bellasi March 13, 2019, 4:12 p.m. UTC | #14
On 13-Mar 14:40, Peter Zijlstra wrote:
> On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote:
> > +static inline unsigned int uclamp_bucket_id(unsigned int clamp_value)
> > +{
> > +	return clamp_value / UCLAMP_BUCKET_DELTA;
> > +}
> > +
> > +static inline unsigned int uclamp_bucket_value(unsigned int clamp_value)
> > +{
> > +	return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value);
> 
> 	return clamp_value - (clamp_value % UCLAMP_BUCKET_DELTA);
> 
> might generate better code; just a single division, instead of a div and
> mult.

Wondering if compilers cannot do these optimizations... but yes, looks
cool and will do it in v8, thanks.

> > +}
> > +
> > +static inline unsigned int uclamp_none(int clamp_id)
> > +{
> > +	if (clamp_id == UCLAMP_MIN)
> > +		return 0;
> > +	return SCHED_CAPACITY_SCALE;
> > +}
> > +
> > +static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id)
> > +{
> > +	struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
> > +	unsigned int max_value = uclamp_none(clamp_id);
> > +	unsigned int bucket_id;
> > +
> > +	/*
> > +	 * Both min and max clamps are MAX aggregated, thus the topmost
> > +	 * bucket with some tasks defines the rq's clamp value.
> > +	 */
> > +	bucket_id = UCLAMP_BUCKETS;
> > +	do {
> > +		--bucket_id;
> > +		if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks)
> > +			continue;
> > +		max_value = bucket[bucket_id].value;
> > +		break;
> 
> If you flip the if condition the code will be nicer.
> 
> > +	} while (bucket_id);
> 
> But you can also use a for loop:
> 
> 	for (i = UCLAMP_BUCKETS-1; i>=0; i--) {
> 		if (rq->uclamp[clamp_id].bucket[i].tasks) {
> 			max_value = bucket[i].value;
> 			break;
> 		}
> 	}

Yes, the for looks better, but perhaps like that:

	unsigned int bucket_id = UCLAMP_BUCKETS;

 	/*
 	 * Both min and max clamps are MAX aggregated, thus the topmost
 	 * bucket with some tasks defines the rq's clamp value.
 	 */
	for (; bucket_id >= 0; --bucket_id) {
		if (!bucket[bucket_id].tasks)
			continue;
		max_value = bucket[bucket_id].value;
		break;
	}

... just to save a {} block.


> > +	WRITE_ONCE(rq->uclamp[clamp_id].value, max_value);
> > +}
Peter Zijlstra March 13, 2019, 5:22 p.m. UTC | #15
On Wed, Mar 13, 2019 at 04:12:29PM +0000, Patrick Bellasi wrote:
> Yes, the for looks better, but perhaps like that:
> 
> 	unsigned int bucket_id = UCLAMP_BUCKETS;
> 
>  	/*
>  	 * Both min and max clamps are MAX aggregated, thus the topmost
>  	 * bucket with some tasks defines the rq's clamp value.
>  	 */
> 	for (; bucket_id >= 0; --bucket_id) {

GCC will be clever and figure that unsigned will never be smaller than 0
and turn the above into an infinite loop or something daft.

That is; use 'int', not 'unsigned int' for bucket_id.
Patrick Bellasi March 13, 2019, 6:22 p.m. UTC | #16
On 13-Mar 18:22, Peter Zijlstra wrote:
> On Wed, Mar 13, 2019 at 04:12:29PM +0000, Patrick Bellasi wrote:
> > Yes, the for looks better, but perhaps like that:
> > 
> > 	unsigned int bucket_id = UCLAMP_BUCKETS;
> > 
> >  	/*
> >  	 * Both min and max clamps are MAX aggregated, thus the topmost
> >  	 * bucket with some tasks defines the rq's clamp value.
> >  	 */
> > 	for (; bucket_id >= 0; --bucket_id) {
> 
> GCC will be clever and figure that unsigned will never be smaller than 0
> and turn the above into an infinite loop or something daft.
> 
> That is; use 'int', not 'unsigned int' for bucket_id.

Right, which remembers me now why I originally went for a do { } while();
Peter Zijlstra March 13, 2019, 7:30 p.m. UTC | #17
On Wed, Mar 13, 2019 at 03:59:54PM +0000, Patrick Bellasi wrote:
> On 13-Mar 14:52, Peter Zijlstra wrote:
> > > +static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq,
> > > +				    unsigned int clamp_id)
> > > +{
> > > +	unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
> > > +	unsigned int rq_clamp, bkt_clamp;
> > > +
> > > +	SCHED_WARN_ON(!rq->uclamp[clamp_id].bucket[bucket_id].tasks);
> > > +	if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks))
> > > +		rq->uclamp[clamp_id].bucket[bucket_id].tasks--;
> > > +
> > > +	/*
> > > +	 * Keep "local clamping" simple and accept to (possibly) overboost
> > > +	 * still RUNNABLE tasks in the same bucket.
> > > +	 */
> > > +	if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks))
> > > +		return;
> > 
> > (Oh man, I hope that generates semi sane code; long live CSE passes I
> > suppose)
> 
> What do you mean ?

that does: 'rq->uclamp[clamp_id].bucket[bucket_id].tasks' three times in
a row. And yes the compiler _should_ dtrt, but....
Peter Zijlstra March 13, 2019, 7:39 p.m. UTC | #18
On Wed, Mar 13, 2019 at 03:59:54PM +0000, Patrick Bellasi wrote:
> On 13-Mar 14:52, Peter Zijlstra wrote:

> Because of backetization, we potentially end up tracking tasks with
> different requested clamp values in the same bucket.
> 
> For example, with 20% bucket size, we can have:
>   Task1: util_min=25%
>   Task2: util_min=35%
> accounted in the same bucket.

> > Given all that, what is to stop the bucket value to climbing to
> > uclamp_bucket_value(+1)-1 and staying there (provided there's someone
> > runnable)?
> 
> Nothing... but that's an expected consequence of bucketization.

No, it is not.

> > Why are we doing this... ?
> 
> You can either decide to:
> 
>  a) always boost tasks to just the bucket nominal value
>     thus always penalizing both Task1 and Task2 of the example above

This is the expected behaviour. When was the last time your histogram
did something like b?

>  b) always boost tasks to the bucket "max" value
>     thus always overboosting both Task1 and Task2 of the example above
> 
> The solution above instead has a very good property: in systems
> where you have only few and well defined clamp values we always
> provide the exact boost.
> 
> For example, if your system requires only 23% and 47% boost values
> (totally random numbers), then you can always get the exact boost
> required using just 3 bucksts or ~33% size each.
> 
> In systems where you don't know which boost values you will have, you
> can still defined the maximum overboost granularity you accept for
> each task by just tuning the number of clamp groups. For example, with
> 20 groups you can have a 5% max overboost.

Maybe, but this is not a direct concequence of buckets, but an
additional heuristic that might work well in this case.

Maybe split this out in a separate patch? So start with the trivial
bucket, and then do this change on top with the above few paragraphs as
changelog?
Peter Zijlstra March 13, 2019, 7:46 p.m. UTC | #19
On Wed, Mar 13, 2019 at 03:23:59PM +0000, Patrick Bellasi wrote:
> On 13-Mar 15:09, Peter Zijlstra wrote:
> > On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote:

> > > +static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id)
> > > +{
> > > +	struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
> > > +	unsigned int max_value = uclamp_none(clamp_id);
> > 
> > That's 1024 for uclamp_max
> > 
> > > +	unsigned int bucket_id;
> > > +
> > > +	/*
> > > +	 * Both min and max clamps are MAX aggregated, thus the topmost
> > > +	 * bucket with some tasks defines the rq's clamp value.
> > > +	 */
> > > +	bucket_id = UCLAMP_BUCKETS;
> > > +	do {
> > > +		--bucket_id;
> > > +		if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks)
> > > +			continue;
> > > +		max_value = bucket[bucket_id].value;
> > 
> > but this will then _lower_ it. That's not a MAX aggregate.
> 
> For uclamp_max we want max_value=1024 when there are no active tasks,
> which means: no max clamp enforced on CFS/RT "idle" cpus.
> 
> If instead there are active RT/CFS tasks then we want the clamp value
> of the max group, which means: MAX aggregate active clamps.
> 
> That's what the code above does and the comment says.

That's (obviously) not how I read it... maybe something like:

static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id)
{
	struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
	int i;

	/*
	 * Since both min and max clamps are max aggregated, find the
	 * top most bucket with tasks in.
	 */
	for (i = UCLMAP_BUCKETS-1; i>=0; i--) {
		if (!bucket[i].tasks)
			continue;
		return bucket[i].value;
	}

	/* No tasks -- default clamp values */
	return uclamp_none(clamp_id);
}

would make it clearer?
Peter Zijlstra March 13, 2019, 7:48 p.m. UTC | #20
On Wed, Mar 13, 2019 at 04:12:29PM +0000, Patrick Bellasi wrote:
> On 13-Mar 14:40, Peter Zijlstra wrote:
> > On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote:
> > > +static inline unsigned int uclamp_bucket_id(unsigned int clamp_value)
> > > +{
> > > +	return clamp_value / UCLAMP_BUCKET_DELTA;
> > > +}
> > > +
> > > +static inline unsigned int uclamp_bucket_value(unsigned int clamp_value)
> > > +{
> > > +	return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value);
> > 
> > 	return clamp_value - (clamp_value % UCLAMP_BUCKET_DELTA);
> > 
> > might generate better code; just a single division, instead of a div and
> > mult.
> 
> Wondering if compilers cannot do these optimizations... but yes, looks
> cool and will do it in v8, thanks.

I'd be most impressed if they pull this off. Check the generated code
and see I suppose :-)
Suren Baghdasaryan March 13, 2019, 9:01 p.m. UTC | #21
On Wed, Mar 13, 2019 at 8:15 AM Patrick Bellasi <patrick.bellasi@arm.com> wrote:
>
> On 12-Mar 13:52, Dietmar Eggemann wrote:
> > On 2/8/19 11:05 AM, Patrick Bellasi wrote:
> >
> > [...]
> >
> > > +config UCLAMP_BUCKETS_COUNT
> > > +   int "Number of supported utilization clamp buckets"
> > > +   range 5 20
> > > +   default 5
> > > +   depends on UCLAMP_TASK
> > > +   help
> > > +     Defines the number of clamp buckets to use. The range of each bucket
> > > +     will be SCHED_CAPACITY_SCALE/UCLAMP_BUCKETS_COUNT. The higher the
> > > +     number of clamp buckets the finer their granularity and the higher
> > > +     the precision of clamping aggregation and tracking at run-time.
> > > +
> > > +     For example, with the default configuration we will have 5 clamp
> > > +     buckets tracking 20% utilization each. A 25% boosted tasks will be
> > > +     refcounted in the [20..39]% bucket and will set the bucket clamp
> > > +     effective value to 25%.
> > > +     If a second 30% boosted task should be co-scheduled on the same CPU,
> > > +     that task will be refcounted in the same bucket of the first task and
> > > +     it will boost the bucket clamp effective value to 30%.
> > > +     The clamp effective value of a bucket is reset to its nominal value
> > > +     (20% in the example above) when there are anymore tasks refcounted in
> >
> > this sounds weird.
>
> Why ?

Should probably be "when there are no more tasks refcounted"

> >
> > [...]
> >
> > > +static inline unsigned int uclamp_bucket_value(unsigned int clamp_value)
> > > +{
> > > +   return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value);
> > > +}
> >
> > Soemthing like uclamp_bucket_nominal_value() should be clearer.
>
> Maybe... can update it in v8
>

uclamp_bucket_base_value is a little shorter, just to consider :)

> > > +static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id)
> > > +{
> > > +   struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
> > > +   unsigned int max_value = uclamp_none(clamp_id);
> > > +   unsigned int bucket_id;
> >
> > unsigned int bucket_id = UCLAMP_BUCKETS;
> >
> > > +
> > > +   /*
> > > +    * Both min and max clamps are MAX aggregated, thus the topmost
> > > +    * bucket with some tasks defines the rq's clamp value.
> > > +    */
> > > +   bucket_id = UCLAMP_BUCKETS;
> >
> > to get rid of this line?
>
> I put it on a different line as a justfication for the loop variable
> initialization described in the comment above.
>
> >
> > > +   do {
> > > +           --bucket_id;
> > > +           if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks)
> >
> > if (!bucket[bucket_id].tasks)
>
> Right... that's some leftover from the last refactoring!
>
> [...]
>
> > > + * within each bucket the exact "requested" clamp value whenever all tasks
> > > + * RUNNABLE in that bucket require the same clamp.
> > > + */
> > > +static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq,
> > > +                               unsigned int clamp_id)
> > > +{
> > > +   unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
> > > +   unsigned int rq_clamp, bkt_clamp, tsk_clamp;
> >
> > Wouldn't it be easier to have a pointer to the task's and rq's uclamp
> > structure as well to the bucket?
> >
> > -       unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
> > +       struct uclamp_se *uc_se = &p->uclamp[clamp_id];
> > +       struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
> > +       struct uclamp_bucket *bucket = &uc_rq->bucket[uc_se->bucket_id];
>
> I think I went back/forth a couple of times in using pointer or the
> extended version, which both have pros and cons.
>
> I personally prefer the pointers as you suggest but I've got the
> impression in the past that since everybody cleared "basic C trainings"
> it's not so difficult to read the code above too.
>
> > The code in uclamp_rq_inc_id() and uclamp_rq_dec_id() for example becomes
> > much more readable.
>
> Agree... let's try to switch once again in v8 and see ;)
>
> > [...]
> >
> > >   struct sched_class {
> > >     const struct sched_class *next;
> > > +#ifdef CONFIG_UCLAMP_TASK
> > > +   int uclamp_enabled;
> > > +#endif
> > > +
> > >     void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
> > >     void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
> > > -   void (*yield_task)   (struct rq *rq);
> > > -   bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt);
> > >     void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags);
> > > @@ -1685,7 +1734,6 @@ struct sched_class {
> > >     void (*set_curr_task)(struct rq *rq);
> > >     void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
> > >     void (*task_fork)(struct task_struct *p);
> > > -   void (*task_dead)(struct task_struct *p);
> > >     /*
> > >      * The switched_from() call is allowed to drop rq->lock, therefore we
> > > @@ -1702,12 +1750,17 @@ struct sched_class {
> > >     void (*update_curr)(struct rq *rq);
> > > +   void (*yield_task)   (struct rq *rq);
> > > +   bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt);
> > > +
> > >   #define TASK_SET_GROUP            0
> > >   #define TASK_MOVE_GROUP           1
> > >   #ifdef CONFIG_FAIR_GROUP_SCHED
> > >     void (*task_change_group)(struct task_struct *p, int type);
> > >   #endif
> > > +
> > > +   void (*task_dead)(struct task_struct *p);
> >
> > Why do you move yield_task, yield_to_task and task_dead here?
>
> Since I'm adding a new field at the beginning of the struct, which is
> used at enqueue/dequeue time, this is to ensure that all the
> callbacks used in these paths are grouped together and don't fall
> across a cache line... but yes, that's supposed to be a
> micro-optimization which I can skip in this patch.
>
> --
> #include <best/regards.h>
>
> Patrick Bellasi
Suren Baghdasaryan March 13, 2019, 9:08 p.m. UTC | #22
On Wed, Mar 13, 2019 at 12:46 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Mar 13, 2019 at 03:23:59PM +0000, Patrick Bellasi wrote:
> > On 13-Mar 15:09, Peter Zijlstra wrote:
> > > On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote:
>
> > > > +static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id)
> > > > +{
> > > > + struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
> > > > + unsigned int max_value = uclamp_none(clamp_id);
> > >
> > > That's 1024 for uclamp_max
> > >
> > > > + unsigned int bucket_id;
> > > > +
> > > > + /*
> > > > +  * Both min and max clamps are MAX aggregated, thus the topmost
> > > > +  * bucket with some tasks defines the rq's clamp value.
> > > > +  */
> > > > + bucket_id = UCLAMP_BUCKETS;
> > > > + do {
> > > > +         --bucket_id;
> > > > +         if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks)
> > > > +                 continue;
> > > > +         max_value = bucket[bucket_id].value;
> > >
> > > but this will then _lower_ it. That's not a MAX aggregate.
> >
> > For uclamp_max we want max_value=1024 when there are no active tasks,
> > which means: no max clamp enforced on CFS/RT "idle" cpus.
> >
> > If instead there are active RT/CFS tasks then we want the clamp value
> > of the max group, which means: MAX aggregate active clamps.
> >
> > That's what the code above does and the comment says.
>
> That's (obviously) not how I read it... maybe something like:
>
> static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id)
> {
>         struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
>         int i;
>
>         /*
>          * Since both min and max clamps are max aggregated, find the
>          * top most bucket with tasks in.
>          */
>         for (i = UCLMAP_BUCKETS-1; i>=0; i--) {
>                 if (!bucket[i].tasks)
>                         continue;
>                 return bucket[i].value;
>         }
>
>         /* No tasks -- default clamp values */
>         return uclamp_none(clamp_id);
> }
>
> would make it clearer?

This way it's also more readable/obvious when it's used inside
uclamp_rq_dec_id, assuming uclamp_rq_update is renamed into smth like
get_max_rq_uclamp.
Suren Baghdasaryan March 13, 2019, 9:23 p.m. UTC | #23
On Wed, Mar 13, 2019 at 6:52 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote:
> > +/*
> > + * When a task is enqueued on a rq, the clamp bucket currently defined by the
> > + * task's uclamp::bucket_id is reference counted on that rq. This also
> > + * immediately updates the rq's clamp value if required.
> > + *
> > + * Since tasks know their specific value requested from user-space, we track
> > + * within each bucket the maximum value for tasks refcounted in that bucket.
> > + * This provide a further aggregation (local clamping) which allows to track
> > + * within each bucket the exact "requested" clamp value whenever all tasks
> > + * RUNNABLE in that bucket require the same clamp.
> > + */
> > +static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq,
> > +                                 unsigned int clamp_id)
> > +{
> > +     unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
> > +     unsigned int rq_clamp, bkt_clamp, tsk_clamp;
> > +
> > +     rq->uclamp[clamp_id].bucket[bucket_id].tasks++;
> > +
> > +     /*
> > +      * Local clamping: rq's buckets always track the max "requested"
> > +      * clamp value from all RUNNABLE tasks in that bucket.
> > +      */
> > +     tsk_clamp = p->uclamp[clamp_id].value;
> > +     bkt_clamp = rq->uclamp[clamp_id].bucket[bucket_id].value;
> > +     rq->uclamp[clamp_id].bucket[bucket_id].value = max(bkt_clamp, tsk_clamp);
>
> So, if I read this correct:
>
>  - here we track a max value in a bucket,
>
> > +     rq_clamp = READ_ONCE(rq->uclamp[clamp_id].value);
> > +     WRITE_ONCE(rq->uclamp[clamp_id].value, max(rq_clamp, tsk_clamp));
> > +}
> > +
> > +/*
> > + * When a task is dequeued from a rq, the clamp bucket reference counted by
> > + * the task is released. If this is the last task reference counting the rq's
> > + * max active clamp value, then the rq's clamp value is updated.
> > + * Both the tasks reference counter and the rq's cached clamp values are
> > + * expected to be always valid, if we detect they are not we skip the updates,
> > + * enforce a consistent state and warn.
> > + */
> > +static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq,
> > +                                 unsigned int clamp_id)
> > +{
> > +     unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
> > +     unsigned int rq_clamp, bkt_clamp;
> > +
> > +     SCHED_WARN_ON(!rq->uclamp[clamp_id].bucket[bucket_id].tasks);
> > +     if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks))
> > +             rq->uclamp[clamp_id].bucket[bucket_id].tasks--;
> > +
> > +     /*
> > +      * Keep "local clamping" simple and accept to (possibly) overboost
> > +      * still RUNNABLE tasks in the same bucket.
> > +      */
> > +     if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks))
> > +             return;
>
> (Oh man, I hope that generates semi sane code; long live CSE passes I
> suppose)
>
> But we never decrement that bkt_clamp value on dequeue.
>
> > +     bkt_clamp = rq->uclamp[clamp_id].bucket[bucket_id].value;
> > +
> > +     /* The rq's clamp value is expected to always track the max */
> > +     rq_clamp = READ_ONCE(rq->uclamp[clamp_id].value);
> > +     SCHED_WARN_ON(bkt_clamp > rq_clamp);
> > +     if (bkt_clamp >= rq_clamp) {
>
> head hurts, this reads ==, how can this ever not be so?
>
> > +             /*
> > +              * Reset rq's clamp bucket value to its nominal value whenever
> > +              * there are anymore RUNNABLE tasks refcounting it.
>
> -ENOPARSE
>
> > +              */
> > +             rq->uclamp[clamp_id].bucket[bucket_id].value =
> > +                     uclamp_bucket_value(rq_clamp);
>
> But basically you decrement the bucket value to the nominal value.
>
> > +             uclamp_rq_update(rq, clamp_id);
> > +     }
> > +}
>
> Given all that, what is to stop the bucket value to climbing to
> uclamp_bucket_value(+1)-1 and staying there (provided there's someone
> runnable)?
>
> Why are we doing this... ?

I agree with Peter, this part of the patch was the hardest to read.
SCHED_WARN_ON line makes sense to me. The condition that follows and
the following comment are a little baffling. Condition seems to
indicate that the code that follows should be executed only if we are
in the top-most occupied bucket (the bucket which has tasks and has
the highest uclamp value). So this bucket just lost its last task and
we should update rq->uclamp[clamp_id].value. However that's not
exactly what the code does... It also resets
rq->uclamp[clamp_id].bucket[bucket_id].value. So if I understand
correctly, unless the bucket that just lost its last task is the
top-most one its value will not be reset to nominal value. That looks
like a bug to me. Am I missing something?

Side note: some more explanation would be very helpful.
Suren Baghdasaryan March 13, 2019, 9:32 p.m. UTC | #24
On Fri, Feb 8, 2019 at 2:06 AM Patrick Bellasi <patrick.bellasi@arm.com> wrote:
>
> Utilization clamping allows to clamp the CPU's utilization within a
> [util_min, util_max] range, depending on the set of RUNNABLE tasks on
> that CPU. Each task references two "clamp buckets" defining its minimum
> and maximum (util_{min,max}) utilization "clamp values". A CPU's clamp
> bucket is active if there is at least one RUNNABLE tasks enqueued on
> that CPU and refcounting that bucket.
>
> When a task is {en,de}queued {on,from} a rq, the set of active clamp
> buckets on that CPU can change. Since each clamp bucket enforces a
> different utilization clamp value, when the set of active clamp buckets
> changes, a new "aggregated" clamp value is computed for that CPU.
>
> Clamp values are always MAX aggregated for both util_min and util_max.
> This ensures that no tasks can affect the performance of other
> co-scheduled tasks which are more boosted (i.e. with higher util_min
> clamp) or less capped (i.e. with higher util_max clamp).
>
> Each task has a:
>    task_struct::uclamp[clamp_id]::bucket_id
> to track the "bucket index" of the CPU's clamp bucket it refcounts while
> enqueued, for each clamp index (clamp_id).
>
> Each CPU's rq has a:
>    rq::uclamp[clamp_id]::bucket[bucket_id].tasks
> to track how many tasks, currently RUNNABLE on that CPU, refcount each
> clamp bucket (bucket_id) of a clamp index (clamp_id).
>
> Each CPU's rq has also a:
>    rq::uclamp[clamp_id]::bucket[bucket_id].value
> to track the clamp value of each clamp bucket (bucket_id) of a clamp
> index (clamp_id).
>
> The rq::uclamp::bucket[clamp_id][] array is scanned every time we need
> to find a new MAX aggregated clamp value for a clamp_id. This operation
> is required only when we dequeue the last task of a clamp bucket
> tracking the current MAX aggregated clamp value. In these cases, the CPU
> is either entering IDLE or going to schedule a less boosted or more
> clamped task.
> The expected number of different clamp values, configured at build time,
> is small enough to fit the full unordered array into a single cache
> line.

I assume you are talking about "struct uclamp_rq uclamp[UCLAMP_CNT]"
here. uclamp_rq size depends on UCLAMP_BUCKETS configurable to be up
to 20. sizeof(long)*20 is already more than 64 bytes. What am I
missing?

> Add the basic data structures required to refcount, in each CPU's rq,
> the number of RUNNABLE tasks for each clamp bucket. Add also the max
> aggregation required to update the rq's clamp value at each
> enqueue/dequeue event.
>
> Use a simple linear mapping of clamp values into clamp buckets.
> Pre-compute and cache bucket_id to avoid integer divisions at
> enqueue/dequeue time.
>
> Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
>
> ---
> Changes in v7:
>  Message-ID: <20190123191007.GG17749@hirez.programming.kicks-ass.net>
>  - removed buckets mapping code
>  - use a simpler linear mapping of clamp values into buckets
>  Message-ID: <20190124161443.lv2pw5fsspyelckq@e110439-lin>
>  - move this patch at the beginning of the series,
>    in the attempt to make the overall series easier to digest by moving
>    at the very beginning the core bits and main data structures
>  Others:
>  - update the mapping logic to use exactly and only
>    UCLAMP_BUCKETS_COUNT buckets, i.e. no more "special" bucket
>  - update uclamp_rq_update() to do top-bottom max search
> ---
>  include/linux/log2.h           |  37 ++++++++
>  include/linux/sched.h          |  39 ++++++++
>  include/linux/sched/topology.h |   6 --
>  init/Kconfig                   |  53 +++++++++++
>  kernel/sched/core.c            | 165 +++++++++++++++++++++++++++++++++
>  kernel/sched/sched.h           |  59 +++++++++++-
>  6 files changed, 350 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/log2.h b/include/linux/log2.h
> index 2af7f77866d0..e2db25734532 100644
> --- a/include/linux/log2.h
> +++ b/include/linux/log2.h
> @@ -224,4 +224,41 @@ int __order_base_2(unsigned long n)
>                 ilog2((n) - 1) + 1) :           \
>         __order_base_2(n)                       \
>  )
> +
> +static inline __attribute__((const))
> +int __bits_per(unsigned long n)
> +{
> +       if (n < 2)
> +               return 1;
> +       if (is_power_of_2(n))
> +               return order_base_2(n) + 1;
> +       return order_base_2(n);
> +}
> +
> +/**
> + * bits_per - calculate the number of bits required for the argument
> + * @n: parameter
> + *
> + * This is constant-capable and can be used for compile time
> + * initiaizations, e.g bitfields.
> + *
> + * The first few values calculated by this routine:
> + * bf(0) = 1
> + * bf(1) = 1
> + * bf(2) = 2
> + * bf(3) = 2
> + * bf(4) = 3
> + * ... and so on.
> + */
> +#define bits_per(n)                            \
> +(                                              \
> +       __builtin_constant_p(n) ? (             \
> +               ((n) == 0 || (n) == 1) ? 1 : (  \
> +               ((n) & (n - 1)) == 0 ?          \
> +                       ilog2((n) - 1) + 2 :    \
> +                       ilog2((n) - 1) + 1      \
> +               )                               \
> +       ) :                                     \
> +       __bits_per(n)                           \
> +)
>  #endif /* _LINUX_LOG2_H */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 4112639c2a85..45460e7a3eee 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -281,6 +281,18 @@ struct vtime {
>         u64                     gtime;
>  };
>
> +/*
> + * Utilization clamp constraints.
> + * @UCLAMP_MIN:        Minimum utilization
> + * @UCLAMP_MAX:        Maximum utilization
> + * @UCLAMP_CNT:        Utilization clamp constraints count
> + */
> +enum uclamp_id {
> +       UCLAMP_MIN = 0,
> +       UCLAMP_MAX,
> +       UCLAMP_CNT
> +};
> +
>  struct sched_info {
>  #ifdef CONFIG_SCHED_INFO
>         /* Cumulative counters: */
> @@ -312,6 +324,10 @@ struct sched_info {
>  # define SCHED_FIXEDPOINT_SHIFT                10
>  # define SCHED_FIXEDPOINT_SCALE                (1L << SCHED_FIXEDPOINT_SHIFT)
>
> +/* Increase resolution of cpu_capacity calculations */
> +# define SCHED_CAPACITY_SHIFT          SCHED_FIXEDPOINT_SHIFT
> +# define SCHED_CAPACITY_SCALE          (1L << SCHED_CAPACITY_SHIFT)
> +
>  struct load_weight {
>         unsigned long                   weight;
>         u32                             inv_weight;
> @@ -560,6 +576,25 @@ struct sched_dl_entity {
>         struct hrtimer inactive_timer;
>  };
>
> +#ifdef CONFIG_UCLAMP_TASK
> +/* Number of utilization clamp buckets (shorter alias) */
> +#define UCLAMP_BUCKETS CONFIG_UCLAMP_BUCKETS_COUNT
> +
> +/*
> + * Utilization clamp for a scheduling entity
> + * @value:             clamp value "requested" by a se
> + * @bucket_id:         clamp bucket corresponding to the "requested" value
> + *
> + * The bucket_id is the index of the clamp bucket matching the clamp value
> + * which is pre-computed and stored to avoid expensive integer divisions from
> + * the fast path.
> + */
> +struct uclamp_se {
> +       unsigned int value              : bits_per(SCHED_CAPACITY_SCALE);
> +       unsigned int bucket_id          : bits_per(UCLAMP_BUCKETS);
> +};
> +#endif /* CONFIG_UCLAMP_TASK */
> +
>  union rcu_special {
>         struct {
>                 u8                      blocked;
> @@ -640,6 +675,10 @@ struct task_struct {
>  #endif
>         struct sched_dl_entity          dl;
>
> +#ifdef CONFIG_UCLAMP_TASK
> +       struct uclamp_se                uclamp[UCLAMP_CNT];
> +#endif
> +
>  #ifdef CONFIG_PREEMPT_NOTIFIERS
>         /* List of struct preempt_notifier: */
>         struct hlist_head               preempt_notifiers;
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index c31d3a47a47c..04beadac6985 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -6,12 +6,6 @@
>
>  #include <linux/sched/idle.h>
>
> -/*
> - * Increase resolution of cpu_capacity calculations
> - */
> -#define SCHED_CAPACITY_SHIFT   SCHED_FIXEDPOINT_SHIFT
> -#define SCHED_CAPACITY_SCALE   (1L << SCHED_CAPACITY_SHIFT)
> -
>  /*
>   * sched-domains (multiprocessor balancing) declarations:
>   */
> diff --git a/init/Kconfig b/init/Kconfig
> index 513fa544a134..34e23d5d95d1 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -640,6 +640,59 @@ config HAVE_UNSTABLE_SCHED_CLOCK
>  config GENERIC_SCHED_CLOCK
>         bool
>
> +menu "Scheduler features"
> +
> +config UCLAMP_TASK
> +       bool "Enable utilization clamping for RT/FAIR tasks"
> +       depends on CPU_FREQ_GOV_SCHEDUTIL
> +       help
> +         This feature enables the scheduler to track the clamped utilization
> +         of each CPU based on RUNNABLE tasks scheduled on that CPU.
> +
> +         With this option, the user can specify the min and max CPU
> +         utilization allowed for RUNNABLE tasks. The max utilization defines
> +         the maximum frequency a task should use while the min utilization
> +         defines the minimum frequency it should use.
> +
> +         Both min and max utilization clamp values are hints to the scheduler,
> +         aiming at improving its frequency selection policy, but they do not
> +         enforce or grant any specific bandwidth for tasks.
> +
> +         If in doubt, say N.
> +
> +config UCLAMP_BUCKETS_COUNT
> +       int "Number of supported utilization clamp buckets"
> +       range 5 20
> +       default 5
> +       depends on UCLAMP_TASK
> +       help
> +         Defines the number of clamp buckets to use. The range of each bucket
> +         will be SCHED_CAPACITY_SCALE/UCLAMP_BUCKETS_COUNT. The higher the
> +         number of clamp buckets the finer their granularity and the higher
> +         the precision of clamping aggregation and tracking at run-time.
> +
> +         For example, with the default configuration we will have 5 clamp
> +         buckets tracking 20% utilization each. A 25% boosted tasks will be
> +         refcounted in the [20..39]% bucket and will set the bucket clamp
> +         effective value to 25%.
> +         If a second 30% boosted task should be co-scheduled on the same CPU,
> +         that task will be refcounted in the same bucket of the first task and
> +         it will boost the bucket clamp effective value to 30%.
> +         The clamp effective value of a bucket is reset to its nominal value
> +         (20% in the example above) when there are anymore tasks refcounted in
> +         that bucket.
> +
> +         An additional boost/capping margin can be added to some tasks. In the
> +         example above the 25% task will be boosted to 30% until it exits the
> +         CPU. If that should be considered not acceptable on certain systems,
> +         it's always possible to reduce the margin by increasing the number of
> +         clamp buckets to trade off used memory for run-time tracking
> +         precision.
> +
> +         If in doubt, use the default value.
> +
> +endmenu
> +
>  #
>  # For architectures that want to enable the support for NUMA-affine scheduler
>  # balancing logic:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index ec1b67a195cc..8ecf5470058c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -719,6 +719,167 @@ static void set_load_weight(struct task_struct *p, bool update_load)
>         }
>  }
>
> +#ifdef CONFIG_UCLAMP_TASK
> +
> +/* Integer ceil-rounded range for each bucket */
> +#define UCLAMP_BUCKET_DELTA ((SCHED_CAPACITY_SCALE / UCLAMP_BUCKETS) + 1)
> +
> +static inline unsigned int uclamp_bucket_id(unsigned int clamp_value)
> +{
> +       return clamp_value / UCLAMP_BUCKET_DELTA;
> +}
> +
> +static inline unsigned int uclamp_bucket_value(unsigned int clamp_value)
> +{
> +       return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value);
> +}
> +
> +static inline unsigned int uclamp_none(int clamp_id)
> +{
> +       if (clamp_id == UCLAMP_MIN)
> +               return 0;
> +       return SCHED_CAPACITY_SCALE;
> +}
> +
> +static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id)
> +{
> +       struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
> +       unsigned int max_value = uclamp_none(clamp_id);
> +       unsigned int bucket_id;
> +
> +       /*
> +        * Both min and max clamps are MAX aggregated, thus the topmost
> +        * bucket with some tasks defines the rq's clamp value.
> +        */
> +       bucket_id = UCLAMP_BUCKETS;
> +       do {
> +               --bucket_id;
> +               if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks)
> +                       continue;
> +               max_value = bucket[bucket_id].value;
> +               break;
> +       } while (bucket_id);
> +
> +       WRITE_ONCE(rq->uclamp[clamp_id].value, max_value);
> +}
> +
> +/*
> + * When a task is enqueued on a rq, the clamp bucket currently defined by the
> + * task's uclamp::bucket_id is reference counted on that rq. This also
> + * immediately updates the rq's clamp value if required.
> + *
> + * Since tasks know their specific value requested from user-space, we track
> + * within each bucket the maximum value for tasks refcounted in that bucket.
> + * This provide a further aggregation (local clamping) which allows to track
> + * within each bucket the exact "requested" clamp value whenever all tasks
> + * RUNNABLE in that bucket require the same clamp.
> + */
> +static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq,
> +                                   unsigned int clamp_id)
> +{
> +       unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
> +       unsigned int rq_clamp, bkt_clamp, tsk_clamp;
> +
> +       rq->uclamp[clamp_id].bucket[bucket_id].tasks++;
> +
> +       /*
> +        * Local clamping: rq's buckets always track the max "requested"
> +        * clamp value from all RUNNABLE tasks in that bucket.
> +        */
> +       tsk_clamp = p->uclamp[clamp_id].value;
> +       bkt_clamp = rq->uclamp[clamp_id].bucket[bucket_id].value;
> +       rq->uclamp[clamp_id].bucket[bucket_id].value = max(bkt_clamp, tsk_clamp);
> +
> +       rq_clamp = READ_ONCE(rq->uclamp[clamp_id].value);
> +       WRITE_ONCE(rq->uclamp[clamp_id].value, max(rq_clamp, tsk_clamp));
> +}
> +
> +/*
> + * When a task is dequeued from a rq, the clamp bucket reference counted by
> + * the task is released. If this is the last task reference counting the rq's
> + * max active clamp value, then the rq's clamp value is updated.
> + * Both the tasks reference counter and the rq's cached clamp values are
> + * expected to be always valid, if we detect they are not we skip the updates,
> + * enforce a consistent state and warn.
> + */
> +static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq,
> +                                   unsigned int clamp_id)
> +{
> +       unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
> +       unsigned int rq_clamp, bkt_clamp;
> +
> +       SCHED_WARN_ON(!rq->uclamp[clamp_id].bucket[bucket_id].tasks);
> +       if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks))
> +               rq->uclamp[clamp_id].bucket[bucket_id].tasks--;
> +
> +       /*
> +        * Keep "local clamping" simple and accept to (possibly) overboost
> +        * still RUNNABLE tasks in the same bucket.
> +        */
> +       if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks))
> +               return;
> +       bkt_clamp = rq->uclamp[clamp_id].bucket[bucket_id].value;
> +
> +       /* The rq's clamp value is expected to always track the max */
> +       rq_clamp = READ_ONCE(rq->uclamp[clamp_id].value);
> +       SCHED_WARN_ON(bkt_clamp > rq_clamp);
> +       if (bkt_clamp >= rq_clamp) {
> +               /*
> +                * Reset rq's clamp bucket value to its nominal value whenever
> +                * there are anymore RUNNABLE tasks refcounting it.
> +                */
> +               rq->uclamp[clamp_id].bucket[bucket_id].value =
> +                       uclamp_bucket_value(rq_clamp);
> +               uclamp_rq_update(rq, clamp_id);
> +       }
> +}
> +
> +static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
> +{
> +       unsigned int clamp_id;
> +
> +       if (unlikely(!p->sched_class->uclamp_enabled))
> +               return;
> +
> +       for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
> +               uclamp_rq_inc_id(p, rq, clamp_id);
> +}
> +
> +static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
> +{
> +       unsigned int clamp_id;
> +
> +       if (unlikely(!p->sched_class->uclamp_enabled))
> +               return;
> +
> +       for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
> +               uclamp_rq_dec_id(p, rq, clamp_id);
> +}
> +
> +static void __init init_uclamp(void)
> +{
> +       unsigned int clamp_id;
> +       int cpu;
> +
> +       for_each_possible_cpu(cpu)
> +               memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq));
> +
> +       for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> +               unsigned int clamp_value = uclamp_none(clamp_id);
> +               unsigned int bucket_id = uclamp_bucket_id(clamp_value);
> +               struct uclamp_se *uc_se = &init_task.uclamp[clamp_id];
> +
> +               uc_se->bucket_id = bucket_id;
> +               uc_se->value = clamp_value;
> +       }
> +}
> +
> +#else /* CONFIG_UCLAMP_TASK */
> +static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { }
> +static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) { }
> +static inline void init_uclamp(void) { }
> +#endif /* CONFIG_UCLAMP_TASK */
> +
>  static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
>  {
>         if (!(flags & ENQUEUE_NOCLOCK))
> @@ -729,6 +890,7 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
>                 psi_enqueue(p, flags & ENQUEUE_WAKEUP);
>         }
>
> +       uclamp_rq_inc(rq, p);
>         p->sched_class->enqueue_task(rq, p, flags);
>  }
>
> @@ -742,6 +904,7 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
>                 psi_dequeue(p, flags & DEQUEUE_SLEEP);
>         }
>
> +       uclamp_rq_dec(rq, p);
>         p->sched_class->dequeue_task(rq, p, flags);
>  }
>
> @@ -6075,6 +6238,8 @@ void __init sched_init(void)
>
>         psi_init();
>
> +       init_uclamp();
> +
>         scheduler_running = 1;
>  }
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index c688ef5012e5..ea9e28723946 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -797,6 +797,48 @@ extern void rto_push_irq_work_func(struct irq_work *work);
>  #endif
>  #endif /* CONFIG_SMP */
>
> +#ifdef CONFIG_UCLAMP_TASK
> +/*
> + * struct uclamp_bucket - Utilization clamp bucket
> + * @value: utilization clamp value for tasks on this clamp bucket
> + * @tasks: number of RUNNABLE tasks on this clamp bucket
> + *
> + * Keep track of how many tasks are RUNNABLE for a given utilization
> + * clamp value.
> + */
> +struct uclamp_bucket {
> +       unsigned long value : bits_per(SCHED_CAPACITY_SCALE);
> +       unsigned long tasks : BITS_PER_LONG - bits_per(SCHED_CAPACITY_SCALE);
> +};
> +
> +/*
> + * struct uclamp_rq - rq's utilization clamp
> + * @value: currently active clamp values for a rq
> + * @bucket: utilization clamp buckets affecting a rq
> + *
> + * Keep track of RUNNABLE tasks on a rq to aggregate their clamp values.
> + * A clamp value is affecting a rq when there is at least one task RUNNABLE
> + * (or actually running) with that value.
> + *
> + * We have up to UCLAMP_CNT possible different clamp values, which are
> + * currently only two: minmum utilization and maximum utilization.
> + *
> + * All utilization clamping values are MAX aggregated, since:
> + * - for util_min: we want to run the CPU at least at the max of the minimum
> + *   utilization required by its currently RUNNABLE tasks.
> + * - for util_max: we want to allow the CPU to run up to the max of the
> + *   maximum utilization allowed by its currently RUNNABLE tasks.
> + *
> + * Since on each system we expect only a limited number of different
> + * utilization clamp values (UCLAMP_BUCKETS), we use a simple array to track
> + * the metrics required to compute all the per-rq utilization clamp values.
> + */
> +struct uclamp_rq {
> +       unsigned int value;
> +       struct uclamp_bucket bucket[UCLAMP_BUCKETS];
> +};
> +#endif /* CONFIG_UCLAMP_TASK */
> +
>  /*
>   * This is the main, per-CPU runqueue data structure.
>   *
> @@ -835,6 +877,11 @@ struct rq {
>         unsigned long           nr_load_updates;
>         u64                     nr_switches;
>
> +#ifdef CONFIG_UCLAMP_TASK
> +       /* Utilization clamp values based on CPU's RUNNABLE tasks */
> +       struct uclamp_rq        uclamp[UCLAMP_CNT] ____cacheline_aligned;
> +#endif
> +
>         struct cfs_rq           cfs;
>         struct rt_rq            rt;
>         struct dl_rq            dl;
> @@ -1649,10 +1696,12 @@ extern const u32                sched_prio_to_wmult[40];
>  struct sched_class {
>         const struct sched_class *next;
>
> +#ifdef CONFIG_UCLAMP_TASK
> +       int uclamp_enabled;
> +#endif
> +
>         void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
>         void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
> -       void (*yield_task)   (struct rq *rq);
> -       bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt);
>
>         void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags);
>
> @@ -1685,7 +1734,6 @@ struct sched_class {
>         void (*set_curr_task)(struct rq *rq);
>         void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
>         void (*task_fork)(struct task_struct *p);
> -       void (*task_dead)(struct task_struct *p);
>
>         /*
>          * The switched_from() call is allowed to drop rq->lock, therefore we
> @@ -1702,12 +1750,17 @@ struct sched_class {
>
>         void (*update_curr)(struct rq *rq);
>
> +       void (*yield_task)   (struct rq *rq);
> +       bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt);
> +
>  #define TASK_SET_GROUP         0
>  #define TASK_MOVE_GROUP                1
>
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>         void (*task_change_group)(struct task_struct *p, int type);
>  #endif
> +
> +       void (*task_dead)(struct task_struct *p);
>  };
>
>  static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
> --
> 2.20.1
>
Patrick Bellasi March 14, 2019, 11:03 a.m. UTC | #25
On 13-Mar 20:30, Peter Zijlstra wrote:
> On Wed, Mar 13, 2019 at 03:59:54PM +0000, Patrick Bellasi wrote:
> > On 13-Mar 14:52, Peter Zijlstra wrote:
> > > > +static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq,
> > > > +				    unsigned int clamp_id)
> > > > +{
> > > > +	unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
> > > > +	unsigned int rq_clamp, bkt_clamp;
> > > > +
> > > > +	SCHED_WARN_ON(!rq->uclamp[clamp_id].bucket[bucket_id].tasks);
> > > > +	if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks))
> > > > +		rq->uclamp[clamp_id].bucket[bucket_id].tasks--;
> > > > +
> > > > +	/*
> > > > +	 * Keep "local clamping" simple and accept to (possibly) overboost
> > > > +	 * still RUNNABLE tasks in the same bucket.
> > > > +	 */
> > > > +	if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks))
> > > > +		return;
> > > 
> > > (Oh man, I hope that generates semi sane code; long live CSE passes I
> > > suppose)
> > 
> > What do you mean ?
> 
> that does: 'rq->uclamp[clamp_id].bucket[bucket_id].tasks' three times in
> a row. And yes the compiler _should_ dtrt, but....

Sorry, don't follow you here... but it's an interesting point. :)

The code above becomes:

   if (__builtin_expect(!!(rq->uclamp[clamp_id].bucket[bucket_id].tasks), 1))
           return;

Are you referring to the resolution of the memory references, i.e
   1) rq->uclamp
   2) rq->uclamp[clamp_id]
   3) rq->uclamp[clamp_id].bucket[bucket_id]
?

By playing with:

   https://godbolt.org/z/OPLpyR

I can see that this simplified version:

---8<---
#define BUCKETS 5
#define CLAMPS 2

struct uclamp {
    unsigned int value;
    struct bucket {
        unsigned int value;
        unsigned int tasks;
    } bucket[BUCKETS];
};

struct rq {
    struct uclamp uclamp[CLAMPS];
};

void uclamp_rq_dec_id(struct rq *rq, int clamp_id, int bucket_id) {
    if (__builtin_expect(!!(rq->uclamp[clamp_id].bucket[bucket_id].tasks), 1))
        return;
    rq->uclamp[clamp_id].bucket[bucket_id].tasks--;
}
---8<---

generates something like:

---8<---
uclamp_rq_dec_id:
        sxtw    x1, w1
        add     x3, x1, x1, lsl 1
        lsl     x3, x3, 2
        sub     x3, x3, x1
        lsl     x3, x3, 2
        add     x2, x3, x2, sxtw 3
        add     x0, x0, x2
        ldr     w1, [x0, 8]
        cbz     w1, .L4
        ret
.L4:
        mov     w1, -1
        str     w1, [x0, 8]
        ret
---8<---


which looks "sane" and quite expected, isn't it?
Patrick Bellasi March 14, 2019, 11:18 a.m. UTC | #26
On 13-Mar 20:39, Peter Zijlstra wrote:
> On Wed, Mar 13, 2019 at 03:59:54PM +0000, Patrick Bellasi wrote:
> > On 13-Mar 14:52, Peter Zijlstra wrote:
> 
> > Because of backetization, we potentially end up tracking tasks with
> > different requested clamp values in the same bucket.
> > 
> > For example, with 20% bucket size, we can have:
> >   Task1: util_min=25%
> >   Task2: util_min=35%
> > accounted in the same bucket.
> 
> > > Given all that, what is to stop the bucket value to climbing to
> > > uclamp_bucket_value(+1)-1 and staying there (provided there's someone
> > > runnable)?
> > 
> > Nothing... but that's an expected consequence of bucketization.
> 
> No, it is not.
> 
> > > Why are we doing this... ?
> > 
> > You can either decide to:
> > 
> >  a) always boost tasks to just the bucket nominal value
> >     thus always penalizing both Task1 and Task2 of the example above
> 
> This is the expected behaviour. When was the last time your histogram
> did something like b?

Right, I see what you mean... strictly speaking histograms always do a
floor aggregation.

> >  b) always boost tasks to the bucket "max" value
> >     thus always overboosting both Task1 and Task2 of the example above
> > 
> > The solution above instead has a very good property: in systems
> > where you have only few and well defined clamp values we always
> > provide the exact boost.
> > 
> > For example, if your system requires only 23% and 47% boost values
> > (totally random numbers), then you can always get the exact boost
> > required using just 3 bucksts or ~33% size each.
> > 
> > In systems where you don't know which boost values you will have, you
> > can still defined the maximum overboost granularity you accept for
> > each task by just tuning the number of clamp groups. For example, with
> > 20 groups you can have a 5% max overboost.
> 
> Maybe, but this is not a direct concequence of buckets, but an
> additional heuristic that might work well in this case.

Right... that's the point.

We started with mapping to be able to track exact clamp values.
Then we switched to linear mapping to remove the complexity of
mapping, but we would like to still have the possibility to track
exact numbers whenever possible.

> Maybe split this out in a separate patch? So start with the trivial
> bucket, and then do this change on top with the above few paragraphs as
> changelog?

That's doable, otherwise maybe we can just add the above paragraphs to
the changelog of this patch. But give your comment above I assume you
prefer to split it out... just let me know otherwise.
Patrick Bellasi March 14, 2019, 11:45 a.m. UTC | #27
On 13-Mar 20:46, Peter Zijlstra wrote:
> On Wed, Mar 13, 2019 at 03:23:59PM +0000, Patrick Bellasi wrote:
> > On 13-Mar 15:09, Peter Zijlstra wrote:
> > > On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote:
> 
> > > > +static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id)
> > > > +{
> > > > +	struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
> > > > +	unsigned int max_value = uclamp_none(clamp_id);
> > > 
> > > That's 1024 for uclamp_max
> > > 
> > > > +	unsigned int bucket_id;
> > > > +
> > > > +	/*
> > > > +	 * Both min and max clamps are MAX aggregated, thus the topmost
> > > > +	 * bucket with some tasks defines the rq's clamp value.
> > > > +	 */
> > > > +	bucket_id = UCLAMP_BUCKETS;
> > > > +	do {
> > > > +		--bucket_id;
> > > > +		if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks)
> > > > +			continue;
> > > > +		max_value = bucket[bucket_id].value;
> > > 
> > > but this will then _lower_ it. That's not a MAX aggregate.
> > 
> > For uclamp_max we want max_value=1024 when there are no active tasks,
> > which means: no max clamp enforced on CFS/RT "idle" cpus.
> > 
> > If instead there are active RT/CFS tasks then we want the clamp value
> > of the max group, which means: MAX aggregate active clamps.
> > 
> > That's what the code above does and the comment says.
> 
> That's (obviously) not how I read it.... maybe something like:
> 
> static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id)
> {
> 	struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
> 	int i;
> 
> 	/*
> 	 * Since both min and max clamps are max aggregated, find the
> 	 * top most bucket with tasks in.
> 	 */
> 	for (i = UCLMAP_BUCKETS-1; i>=0; i--) {
> 		if (!bucket[i].tasks)
> 			continue;
> 		return bucket[i].value;
> 	}
> 
> 	/* No tasks -- default clamp value */
> 	return uclamp_none(clamp_id);
> }
> 
> would make it clearer?

Fine for me, I'll then change the name in something else since that's
not more an "_update" by moving the WRITE_ONCE into the caller.
Patrick Bellasi March 14, 2019, 12:13 p.m. UTC | #28
On 13-Mar 20:48, Peter Zijlstra wrote:
> On Wed, Mar 13, 2019 at 04:12:29PM +0000, Patrick Bellasi wrote:
> > On 13-Mar 14:40, Peter Zijlstra wrote:
> > > On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote:
> > > > +static inline unsigned int uclamp_bucket_id(unsigned int clamp_value)
> > > > +{
> > > > +	return clamp_value / UCLAMP_BUCKET_DELTA;
> > > > +}
> > > > +
> > > > +static inline unsigned int uclamp_bucket_value(unsigned int clamp_value)
> > > > +{
> > > > +	return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value);
> > > 
> > > 	return clamp_value - (clamp_value % UCLAMP_BUCKET_DELTA);
> > > 
> > > might generate better code; just a single division, instead of a div and
> > > mult.
> > 
> > Wondering if compilers cannot do these optimizations... but yes, looks
> > cool and will do it in v8, thanks.
> 
> I'd be most impressed if they pull this off. Check the generated code
> and see I suppose :-)

On x86 the code generated looks exactly the same:

   https://godbolt.org/z/PjmA7k

While on on arm64 it seems the difference boils down to:
 - one single "mul" instruction
vs
 - two instructions: "sub" _plus_ one "multiply subtract"

  https://godbolt.org/z/0shU0S

So, if I din't get something wrong... perhaps the original version is
even better, isn't it?


Test code:

---8<---
#define UCLAMP_BUCKET_DELTA 52

static inline unsigned int uclamp_bucket_id(unsigned int clamp_value)
{
    return clamp_value / UCLAMP_BUCKET_DELTA;
}

static inline unsigned int uclamp_bucket_value1(unsigned int clamp_value)
{
    return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value);
}

static inline unsigned int uclamp_bucket_value2(unsigned int clamp_value)
{
    return clamp_value - (clamp_value % UCLAMP_BUCKET_DELTA);
}

int test1(int argc, char *argv[]) {
    return uclamp_bucket_value1(argc);
}

int test2(int argc, char *argv[]) {
    return uclamp_bucket_value2(argc);
}

int test3(int argc, char *argv[]) {
    return uclamp_bucket_value1(argc) - uclamp_bucket_value2(argc);
}
---8<---

which gives on arm64:

---8<---
test1:
        mov     w1, 60495
        movk    w1, 0x4ec4, lsl 16
        umull   x0, w0, w1
        lsr     x0, x0, 36
        mov     w1, 52
        mul     w0, w0, w1
        ret
test2:
        mov     w1, 60495
        movk    w1, 0x4ec4, lsl 16
        umull   x1, w0, w1
        lsr     x1, x1, 36
        mov     w2, 52
        msub    w1, w1, w2, w0
        sub     w0, w0, w1
        ret
test3:
        mov     w0, 0
        ret
---8<---
Patrick Bellasi March 14, 2019, 12:22 p.m. UTC | #29
On 13-Mar 14:08, Suren Baghdasaryan wrote:
> On Wed, Mar 13, 2019 at 12:46 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Wed, Mar 13, 2019 at 03:23:59PM +0000, Patrick Bellasi wrote:
> > > On 13-Mar 15:09, Peter Zijlstra wrote:
> > > > On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote:
> >
> > > > > +static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id)
> > > > > +{
> > > > > + struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
> > > > > + unsigned int max_value = uclamp_none(clamp_id);
> > > >
> > > > That's 1024 for uclamp_max
> > > >
> > > > > + unsigned int bucket_id;
> > > > > +
> > > > > + /*
> > > > > +  * Both min and max clamps are MAX aggregated, thus the topmost
> > > > > +  * bucket with some tasks defines the rq's clamp value.
> > > > > +  */
> > > > > + bucket_id = UCLAMP_BUCKETS;
> > > > > + do {
> > > > > +         --bucket_id;
> > > > > +         if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks)
> > > > > +                 continue;
> > > > > +         max_value = bucket[bucket_id].value;
> > > >
> > > > but this will then _lower_ it. That's not a MAX aggregate.
> > >
> > > For uclamp_max we want max_value=1024 when there are no active tasks,
> > > which means: no max clamp enforced on CFS/RT "idle" cpus.
> > >
> > > If instead there are active RT/CFS tasks then we want the clamp value
> > > of the max group, which means: MAX aggregate active clamps.
> > >
> > > That's what the code above does and the comment says.
> >
> > That's (obviously) not how I read it... maybe something like:
> >
> > static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id)
> > {
> >         struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
> >         int i;
> >
> >         /*
> >          * Since both min and max clamps are max aggregated, find the
> >          * top most bucket with tasks in.
> >          */
> >         for (i = UCLMAP_BUCKETS-1; i>=0; i--) {
> >                 if (!bucket[i].tasks)
> >                         continue;
> >                 return bucket[i].value;
> >         }
> >
> >         /* No tasks -- default clamp values */
> >         return uclamp_none(clamp_id);
> > }
> >
> > would make it clearer?
> 
> This way it's also more readable/obvious when it's used inside
> uclamp_rq_dec_id, assuming uclamp_rq_update is renamed into smth like
> get_max_rq_uclamp.

Rightm, I have now something like that:

---8<---
static inline unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id)
{
	struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
	int bucket_id;

	/*
	 * Since both min and max clamps are max aggregated, find the
	 * top most bucket with tasks in.
	 */
	for (bucket_id = UCLMAP_BUCKETS-1; bucket_id >= 0; bucket_id--) {
		if (!bucket[bucket_id].tasks)
			continue;
		return bucket[bucket_id].value;
	}

	/* No tasks -- default clamp value */
	return uclamp_none(clamp_id);
}

static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq,
				    unsigned int clamp_id)
{
        //...
	if (bucket->value >= rq_clamp) {
		/*
		 * Reset rq's clamp bucket value to its nominal value whenever
		 * there are anymore RUNNABLE tasks refcounting it.
		 */
		bucket->value = uclamp_bucket_nominal_value(rq_clamp);
		WRITE_ONCE(uc_rq->value, uclamp_rq_max_value(rq, clamp_id));
	}
}
---8<---
Patrick Bellasi March 14, 2019, 12:43 p.m. UTC | #30
On 13-Mar 14:23, Suren Baghdasaryan wrote:
> On Wed, Mar 13, 2019 at 6:52 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote:
> > > +/*
> > > + * When a task is enqueued on a rq, the clamp bucket currently defined by the
> > > + * task's uclamp::bucket_id is reference counted on that rq. This also
> > > + * immediately updates the rq's clamp value if required.
> > > + *
> > > + * Since tasks know their specific value requested from user-space, we track
> > > + * within each bucket the maximum value for tasks refcounted in that bucket.
> > > + * This provide a further aggregation (local clamping) which allows to track
> > > + * within each bucket the exact "requested" clamp value whenever all tasks
> > > + * RUNNABLE in that bucket require the same clamp.
> > > + */
> > > +static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq,
> > > +                                 unsigned int clamp_id)
> > > +{
> > > +     unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
> > > +     unsigned int rq_clamp, bkt_clamp, tsk_clamp;
> > > +
> > > +     rq->uclamp[clamp_id].bucket[bucket_id].tasks++;
> > > +
> > > +     /*
> > > +      * Local clamping: rq's buckets always track the max "requested"
> > > +      * clamp value from all RUNNABLE tasks in that bucket.
> > > +      */
> > > +     tsk_clamp = p->uclamp[clamp_id].value;
> > > +     bkt_clamp = rq->uclamp[clamp_id].bucket[bucket_id].value;
> > > +     rq->uclamp[clamp_id].bucket[bucket_id].value = max(bkt_clamp, tsk_clamp);
> >
> > So, if I read this correct:
> >
> >  - here we track a max value in a bucket,
> >
> > > +     rq_clamp = READ_ONCE(rq->uclamp[clamp_id].value);
> > > +     WRITE_ONCE(rq->uclamp[clamp_id].value, max(rq_clamp, tsk_clamp));
> > > +}
> > > +
> > > +/*
> > > + * When a task is dequeued from a rq, the clamp bucket reference counted by
> > > + * the task is released. If this is the last task reference counting the rq's
> > > + * max active clamp value, then the rq's clamp value is updated.
> > > + * Both the tasks reference counter and the rq's cached clamp values are
> > > + * expected to be always valid, if we detect they are not we skip the updates,
> > > + * enforce a consistent state and warn.
> > > + */
> > > +static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq,
> > > +                                 unsigned int clamp_id)
> > > +{
> > > +     unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
> > > +     unsigned int rq_clamp, bkt_clamp;
> > > +
> > > +     SCHED_WARN_ON(!rq->uclamp[clamp_id].bucket[bucket_id].tasks);
> > > +     if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks))
> > > +             rq->uclamp[clamp_id].bucket[bucket_id].tasks--;
> > > +
> > > +     /*
> > > +      * Keep "local clamping" simple and accept to (possibly) overboost
> > > +      * still RUNNABLE tasks in the same bucket.
> > > +      */
> > > +     if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks))
> > > +             return;
> >
> > (Oh man, I hope that generates semi sane code; long live CSE passes I
> > suppose)
> >
> > But we never decrement that bkt_clamp value on dequeue.
> >
> > > +     bkt_clamp = rq->uclamp[clamp_id].bucket[bucket_id].value;
> > > +
> > > +     /* The rq's clamp value is expected to always track the max */
> > > +     rq_clamp = READ_ONCE(rq->uclamp[clamp_id].value);
> > > +     SCHED_WARN_ON(bkt_clamp > rq_clamp);
> > > +     if (bkt_clamp >= rq_clamp) {
> >
> > head hurts, this reads ==, how can this ever not be so?
> >
> > > +             /*
> > > +              * Reset rq's clamp bucket value to its nominal value whenever
> > > +              * there are anymore RUNNABLE tasks refcounting it.
> >
> > -ENOPARSE
> >
> > > +              */
> > > +             rq->uclamp[clamp_id].bucket[bucket_id].value =
> > > +                     uclamp_bucket_value(rq_clamp);
> >
> > But basically you decrement the bucket value to the nominal value.
> >
> > > +             uclamp_rq_update(rq, clamp_id);
> > > +     }
> > > +}
> >
> > Given all that, what is to stop the bucket value to climbing to
> > uclamp_bucket_value(+1)-1 and staying there (provided there's someone
> > runnable)?
> >
> > Why are we doing this... ?
> 
> I agree with Peter, this part of the patch was the hardest to read.
> SCHED_WARN_ON line makes sense to me. The condition that follows and
> the following comment are a little baffling. Condition seems to
> indicate that the code that follows should be executed only if we are
> in the top-most occupied bucket (the bucket which has tasks and has
> the highest uclamp value).
> So this bucket just lost its last task and we should update
> rq->uclamp[clamp_id].value.

Right.

> However that's not exactly what the code does... It also resets
> rq->uclamp[clamp_id].bucket[bucket_id].value.

Right...

> So if I understand correctly, unless the bucket that just lost its
> last task is the top-most one its value will not be reset to nominal
> value. That looks like a bug to me. Am I missing something?

... and I think you've got a point here!

The reset to nominal value line should be done unconditionally.
I'll move it outside its current block. Thanks for spotting it.

> Side note: some more explanation would be very helpful.

Will move that "bucket local max" management code into a separate
patch as suggested by Peter. Hopefully that should make the logic more
clear and allows me to add some notes in the changelog.
Peter Zijlstra March 14, 2019, 1:27 p.m. UTC | #31
On Thu, Mar 14, 2019 at 11:03:30AM +0000, Patrick Bellasi wrote:

> void uclamp_rq_dec_id(struct rq *rq, int clamp_id, int bucket_id) {
>     if (__builtin_expect(!!(rq->uclamp[clamp_id].bucket[bucket_id].tasks), 1))
>         return;
>     rq->uclamp[clamp_id].bucket[bucket_id].tasks--;
> }
> ---8<---
> 
> generates something like:
> 
> ---8<---
> uclamp_rq_dec_id:
>         sxtw    x1, w1
>         add     x3, x1, x1, lsl 1
>         lsl     x3, x3, 2
>         sub     x3, x3, x1
>         lsl     x3, x3, 2
>         add     x2, x3, x2, sxtw 3
>         add     x0, x0, x2
>         ldr     w1, [x0, 8]
>         cbz     w1, .L4
>         ret
> .L4:
>         mov     w1, -1
>         str     w1, [x0, 8]
>         ret
> ---8<---
> 
> 
> which looks "sane" and quite expected, isn't it?

Yep, thanks! Sometimes I worry about silly things.
Peter Zijlstra March 14, 2019, 1:32 p.m. UTC | #32
On Thu, Mar 14, 2019 at 12:13:15PM +0000, Patrick Bellasi wrote:
> > I'd be most impressed if they pull this off. Check the generated code
> > and see I suppose :-)
> 
> On x86 the code generated looks exactly the same:
> 
>    https://godbolt.org/z/PjmA7k

Argh, they do mult by inverse to avoid the division, and can do this
because its a constant.

And then yes, your arm version looks worse.

It does what I expected with -Os, but as Rutland said the other day,
that stands for Optimize for Sadness.
Patrick Bellasi March 14, 2019, 2:46 p.m. UTC | #33
On 13-Mar 14:32, Suren Baghdasaryan wrote:
> On Fri, Feb 8, 2019 at 2:06 AM Patrick Bellasi <patrick.bellasi@arm.com> wrote:
> >
> > Utilization clamping allows to clamp the CPU's utilization within a
> > [util_min, util_max] range, depending on the set of RUNNABLE tasks on
> > that CPU. Each task references two "clamp buckets" defining its minimum
> > and maximum (util_{min,max}) utilization "clamp values". A CPU's clamp
> > bucket is active if there is at least one RUNNABLE tasks enqueued on
> > that CPU and refcounting that bucket.
> >
> > When a task is {en,de}queued {on,from} a rq, the set of active clamp
> > buckets on that CPU can change. Since each clamp bucket enforces a
> > different utilization clamp value, when the set of active clamp buckets
> > changes, a new "aggregated" clamp value is computed for that CPU.
> >
> > Clamp values are always MAX aggregated for both util_min and util_max.
> > This ensures that no tasks can affect the performance of other
> > co-scheduled tasks which are more boosted (i.e. with higher util_min
> > clamp) or less capped (i.e. with higher util_max clamp).
> >
> > Each task has a:
> >    task_struct::uclamp[clamp_id]::bucket_id
> > to track the "bucket index" of the CPU's clamp bucket it refcounts while
> > enqueued, for each clamp index (clamp_id).
> >
> > Each CPU's rq has a:
> >    rq::uclamp[clamp_id]::bucket[bucket_id].tasks
> > to track how many tasks, currently RUNNABLE on that CPU, refcount each
> > clamp bucket (bucket_id) of a clamp index (clamp_id).
> >
> > Each CPU's rq has also a:
> >    rq::uclamp[clamp_id]::bucket[bucket_id].value
> > to track the clamp value of each clamp bucket (bucket_id) of a clamp
> > index (clamp_id).
> >
> > The rq::uclamp::bucket[clamp_id][] array is scanned every time we need
> > to find a new MAX aggregated clamp value for a clamp_id. This operation
> > is required only when we dequeue the last task of a clamp bucket
> > tracking the current MAX aggregated clamp value. In these cases, the CPU
> > is either entering IDLE or going to schedule a less boosted or more
> > clamped task.
> > The expected number of different clamp values, configured at build time,
> > is small enough to fit the full unordered array into a single cache
> > line.
> 
> I assume you are talking about "struct uclamp_rq uclamp[UCLAMP_CNT]"
> here.

No, I'm talking about the rq::uclamp::bucket[clamp_id][], which is an
array of:

   struct uclamp_bucket {
	unsigned long value : bits_per(SCHED_CAPACITY_SCALE);
	unsigned long tasks : BITS_PER_LONG - bits_per(SCHED_CAPACITY_SCALE);
   };

defined as part of:

   struct uclamp_rq {
	unsigned int value;
	struct uclamp_bucket bucket[UCLAMP_BUCKETS];
   };


So, it's an array of UCLAMP_BUCKETS (value, tasks) pairs.

> uclamp_rq size depends on UCLAMP_BUCKETS configurable to be up
> to 20. sizeof(long)*20 is already more than 64 bytes. What am I
> missing?

Right, the comment above refers to the default configuration, which is
5 buckets. With that configuration we have:


$> pahole kernel/sched/core.o

---8<---
   struct uclamp_bucket {
           long unsigned int          value:11;             /*     0:53  8 */
           long unsigned int          tasks:53;             /*     0: 0  8 */

           /* size: 8, cachelines: 1, members: 2 */
           /* last cacheline: 8 bytes */
   };

   struct uclamp_rq {
           unsigned int               value;                /*     0     4 */

           /* XXX 4 bytes hole, try to pack */

           struct uclamp_bucket       bucket[5];            /*     8    40 */

           /* size: 48, cachelines: 1, members: 2 */
           /* sum members: 44, holes: 1, sum holes: 4 */
           /* last cacheline: 48 bytes */
   };

   struct rq {
           // ...
           /* --- cacheline 2 boundary (128 bytes) --- */
           struct uclamp_rq           uclamp[2];            /*   128    96 */
           /* --- cacheline 3 boundary (192 bytes) was 32 bytes ago --- */
           // ...
   };
---8<---

Where you see the array fits into a single cache line.

Actually I notice now that, since when we removed the bucket dedicated
to the default values, we now have some spare space and we can
probably increase the default (and minimum) value of UCLAMP_BUCKETS to
be 7.

This will uses two full cache lines in struct rq, one for each clamp
index...  Although 7 it's a bit of a odd number and gives by default
buckets of ~14% size instead of the ~20%.

Thoughts ?

[...]
Patrick Bellasi March 14, 2019, 2:54 p.m. UTC | #34
On 13-Mar 14:01, Suren Baghdasaryan wrote:
> On Wed, Mar 13, 2019 at 8:15 AM Patrick Bellasi <patrick.bellasi@arm.com> wrote:
> >
> > On 12-Mar 13:52, Dietmar Eggemann wrote:
> > > On 2/8/19 11:05 AM, Patrick Bellasi wrote:

[...]

> > > > +static inline unsigned int uclamp_bucket_value(unsigned int clamp_value)
> > > > +{
> > > > +   return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value);
> > > > +}
> > >
> > > Soemthing like uclamp_bucket_nominal_value() should be clearer.
> >
> > Maybe... can update it in v8
> >
> 
> uclamp_bucket_base_value is a little shorter, just to consider :)

Right, I also like it better ;)
Patrick Bellasi March 14, 2019, 3 p.m. UTC | #35
On 13-Mar 15:15, Patrick Bellasi wrote:
> On 12-Mar 13:52, Dietmar Eggemann wrote:
> > On 2/8/19 11:05 AM, Patrick Bellasi wrote:

[...]

> > > + * within each bucket the exact "requested" clamp value whenever all tasks
> > > + * RUNNABLE in that bucket require the same clamp.
> > > + */
> > > +static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq,
> > > +				    unsigned int clamp_id)
> > > +{
> > > +	unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
> > > +	unsigned int rq_clamp, bkt_clamp, tsk_clamp;
> > 
> > Wouldn't it be easier to have a pointer to the task's and rq's uclamp
> > structure as well to the bucket?
> > 
> > -       unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
> > +       struct uclamp_se *uc_se = &p->uclamp[clamp_id];
> > +       struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
> > +       struct uclamp_bucket *bucket = &uc_rq->bucket[uc_se->bucket_id];
> 
> I think I went back/forth a couple of times in using pointer or the
> extended version, which both have pros and cons.
> 
> I personally prefer the pointers as you suggest but I've got the
> impression in the past that since everybody cleared "basic C trainings"
> it's not so difficult to read the code above too.
> 
> > The code in uclamp_rq_inc_id() and uclamp_rq_dec_id() for example becomes
> > much more readable.
> 
> Agree... let's try to switch once again in v8 and see ;)

This is not as straightforward as I thought.

We either still need local variables to use with max(), which does not
play well with bitfields values, or we have to avoid using it and go
back to conditional updates:

---8<---
static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq,
				    unsigned int clamp_id)
{
	struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
	struct uclamp_req *uc_se = &p->uclamp_se[clamp_id];
	struct uclamp_bucket *bucket = &uc_rq->bucket[uc_se->bucket_id];

	bucket->tasks++;

	/*
	 * Local clamping: rq's buckets always track the max "requested"
	 * clamp value from all RUNNABLE tasks in that bucket.
	 */
	if (uc_se->value > bucket->value)
		bucket->value = uc_se->value;

	if (uc_se->value > READ_ONCE(uc_rq->value))
		WRITE_ONCE(uc_rq->value, uc_se->value);
}
---8<---

I remember Peter asking for max() in one of the past reviews.. but the
code above looks simpler to me too... let see if this time it can be
accepted. :)
Patrick Bellasi March 14, 2019, 3:07 p.m. UTC | #36
On 14-Mar 14:32, Peter Zijlstra wrote:
> On Thu, Mar 14, 2019 at 12:13:15PM +0000, Patrick Bellasi wrote:
> > > I'd be most impressed if they pull this off. Check the generated code
> > > and see I suppose :-)
> > 
> > On x86 the code generated looks exactly the same:
> > 
> >    https://godbolt.org/z/PjmA7k
> 
> Argh, they do mult by inverse to avoid the division, and can do this
> because its a constant.
> 
> And then yes, your arm version looks worse.

your "arm version" is worst then x86, or "your version" is worse?

IOW, should I keep the code as the original? Do you prefer your
version? Or... we don't really care...

> It does what I expected with -Os, but as Rutland said the other day,
> that stands for Optimize for Sadness.

Yes, I guess we cannot optimize for all flags... however, just let me
know what you prefer and I'll put that version in ;)
Suren Baghdasaryan March 14, 2019, 3:29 p.m. UTC | #37
On Thu, Mar 14, 2019 at 7:46 AM Patrick Bellasi <patrick.bellasi@arm.com> wrote:
>
> On 13-Mar 14:32, Suren Baghdasaryan wrote:
> > On Fri, Feb 8, 2019 at 2:06 AM Patrick Bellasi <patrick.bellasi@arm.com> wrote:
> > >
> > > Utilization clamping allows to clamp the CPU's utilization within a
> > > [util_min, util_max] range, depending on the set of RUNNABLE tasks on
> > > that CPU. Each task references two "clamp buckets" defining its minimum
> > > and maximum (util_{min,max}) utilization "clamp values". A CPU's clamp
> > > bucket is active if there is at least one RUNNABLE tasks enqueued on
> > > that CPU and refcounting that bucket.
> > >
> > > When a task is {en,de}queued {on,from} a rq, the set of active clamp
> > > buckets on that CPU can change. Since each clamp bucket enforces a
> > > different utilization clamp value, when the set of active clamp buckets
> > > changes, a new "aggregated" clamp value is computed for that CPU.
> > >
> > > Clamp values are always MAX aggregated for both util_min and util_max.
> > > This ensures that no tasks can affect the performance of other
> > > co-scheduled tasks which are more boosted (i.e. with higher util_min
> > > clamp) or less capped (i.e. with higher util_max clamp).
> > >
> > > Each task has a:
> > >    task_struct::uclamp[clamp_id]::bucket_id
> > > to track the "bucket index" of the CPU's clamp bucket it refcounts while
> > > enqueued, for each clamp index (clamp_id).
> > >
> > > Each CPU's rq has a:
> > >    rq::uclamp[clamp_id]::bucket[bucket_id].tasks
> > > to track how many tasks, currently RUNNABLE on that CPU, refcount each
> > > clamp bucket (bucket_id) of a clamp index (clamp_id).
> > >
> > > Each CPU's rq has also a:
> > >    rq::uclamp[clamp_id]::bucket[bucket_id].value
> > > to track the clamp value of each clamp bucket (bucket_id) of a clamp
> > > index (clamp_id).
> > >
> > > The rq::uclamp::bucket[clamp_id][] array is scanned every time we need
> > > to find a new MAX aggregated clamp value for a clamp_id. This operation
> > > is required only when we dequeue the last task of a clamp bucket
> > > tracking the current MAX aggregated clamp value. In these cases, the CPU
> > > is either entering IDLE or going to schedule a less boosted or more
> > > clamped task.
> > > The expected number of different clamp values, configured at build time,
> > > is small enough to fit the full unordered array into a single cache
> > > line.
> >
> > I assume you are talking about "struct uclamp_rq uclamp[UCLAMP_CNT]"
> > here.
>
> No, I'm talking about the rq::uclamp::bucket[clamp_id][], which is an
> array of:
>
>    struct uclamp_bucket {
>         unsigned long value : bits_per(SCHED_CAPACITY_SCALE);
>         unsigned long tasks : BITS_PER_LONG - bits_per(SCHED_CAPACITY_SCALE);
>    };
>
> defined as part of:
>
>    struct uclamp_rq {
>         unsigned int value;
>         struct uclamp_bucket bucket[UCLAMP_BUCKETS];
>    };
>
>
> So, it's an array of UCLAMP_BUCKETS (value, tasks) pairs.
>
> > uclamp_rq size depends on UCLAMP_BUCKETS configurable to be up
> > to 20. sizeof(long)*20 is already more than 64 bytes. What am I
> > missing?
>
> Right, the comment above refers to the default configuration, which is
> 5 buckets. With that configuration we have:
>
>
> $> pahole kernel/sched/core.o
>
> ---8<---
>    struct uclamp_bucket {
>            long unsigned int          value:11;             /*     0:53  8 */
>            long unsigned int          tasks:53;             /*     0: 0  8 */
>
>            /* size: 8, cachelines: 1, members: 2 */
>            /* last cacheline: 8 bytes */
>    };
>
>    struct uclamp_rq {
>            unsigned int               value;                /*     0     4 */
>
>            /* XXX 4 bytes hole, try to pack */
>
>            struct uclamp_bucket       bucket[5];            /*     8    40 */
>
>            /* size: 48, cachelines: 1, members: 2 */
>            /* sum members: 44, holes: 1, sum holes: 4 */
>            /* last cacheline: 48 bytes */
>    };
>
>    struct rq {
>            // ...
>            /* --- cacheline 2 boundary (128 bytes) --- */
>            struct uclamp_rq           uclamp[2];            /*   128    96 */
>            /* --- cacheline 3 boundary (192 bytes) was 32 bytes ago --- */
>            // ...
>    };
> ---8<---
>
> Where you see the array fits into a single cache line.
>
> Actually I notice now that, since when we removed the bucket dedicated
> to the default values, we now have some spare space and we can
> probably increase the default (and minimum) value of UCLAMP_BUCKETS to
> be 7.
>
> This will uses two full cache lines in struct rq, one for each clamp
> index...  Although 7 it's a bit of a odd number and gives by default
> buckets of ~14% size instead of the ~20%.
>
> Thoughts ?

Got it. From reading the documentation at the beginning my impression
was that whatever value I choose within allowed 5-20 range it would
still fit in a cache line. To disambiguate it might be worse
mentioning that this is true for the default value or for values up to
7. Thanks!

> [...]
>
> --
> #include <best/regards.h>
>
> Patrick Bellasi
Patrick Bellasi March 14, 2019, 3:40 p.m. UTC | #38
On 14-Mar 08:29, Suren Baghdasaryan wrote:
> On Thu, Mar 14, 2019 at 7:46 AM Patrick Bellasi <patrick.bellasi@arm.com> wrote:
> > On 13-Mar 14:32, Suren Baghdasaryan wrote:
> > > On Fri, Feb 8, 2019 at 2:06 AM Patrick Bellasi <patrick.bellasi@arm.com> wrote:

[...]

> > > > The rq::uclamp::bucket[clamp_id][] array is scanned every time we need
> > > > to find a new MAX aggregated clamp value for a clamp_id. This operation
> > > > is required only when we dequeue the last task of a clamp bucket
> > > > tracking the current MAX aggregated clamp value. In these cases, the CPU
> > > > is either entering IDLE or going to schedule a less boosted or more
> > > > clamped task.

The following:

> > > > The expected number of different clamp values, configured at build time,
> > > > is small enough to fit the full unordered array into a single cache
> > > > line.

will read:

The expected number of different clamp values, configured at build time,
is small enough to fit the full unordered array into a single cache
line for the default UCLAMP_BUCKETS configuration of 7 buckets.

[...]

> Got it. From reading the documentation at the beginning my impression
> was that whatever value I choose within allowed 5-20 range it would
> still fit in a cache line. To disambiguate it might be worse
> mentioning that this is true for the default value or for values up to
> 7. Thanks!

Right, I hope the above proposed change helps to clarify that.
Suren Baghdasaryan March 14, 2019, 4:39 p.m. UTC | #39
On Thu, Mar 14, 2019 at 8:41 AM Patrick Bellasi <patrick.bellasi@arm.com> wrote:
>
> On 14-Mar 08:29, Suren Baghdasaryan wrote:
> > On Thu, Mar 14, 2019 at 7:46 AM Patrick Bellasi <patrick.bellasi@arm.com> wrote:
> > > On 13-Mar 14:32, Suren Baghdasaryan wrote:
> > > > On Fri, Feb 8, 2019 at 2:06 AM Patrick Bellasi <patrick.bellasi@arm.com> wrote:
>
> [...]
>
> > > > > The rq::uclamp::bucket[clamp_id][] array is scanned every time we need
> > > > > to find a new MAX aggregated clamp value for a clamp_id. This operation
> > > > > is required only when we dequeue the last task of a clamp bucket
> > > > > tracking the current MAX aggregated clamp value. In these cases, the CPU
> > > > > is either entering IDLE or going to schedule a less boosted or more
> > > > > clamped task.
>
> The following:
>
> > > > > The expected number of different clamp values, configured at build time,
> > > > > is small enough to fit the full unordered array into a single cache
> > > > > line.
>
> will read:
>
> The expected number of different clamp values, configured at build time,
> is small enough to fit the full unordered array into a single cache
> line for the default UCLAMP_BUCKETS configuration of 7 buckets.

I think keeping default to be 5 is good. As you mentioned it's a nice
round number and keeping it at the minimum also hints that this is not
a free resource and the more buckets you use the more you pay.
Documentation might say "to fit the full unordered array into a single
cache line for configurations of up to 7 buckets".

> [...]
>
> > Got it. From reading the documentation at the beginning my impression
> > was that whatever value I choose within allowed 5-20 range it would
> > still fit in a cache line. To disambiguate it might be worse
> > mentioning that this is true for the default value or for values up to
> > 7. Thanks!
>
> Right, I hope the above proposed change helps to clarify that.
>
> --
> #include <best/regards.h>
>
> Patrick Bellasi
Peter Zijlstra March 14, 2019, 7:18 p.m. UTC | #40
On Thu, Mar 14, 2019 at 03:07:53PM +0000, Patrick Bellasi wrote:
> On 14-Mar 14:32, Peter Zijlstra wrote:
> > On Thu, Mar 14, 2019 at 12:13:15PM +0000, Patrick Bellasi wrote:
> > > > I'd be most impressed if they pull this off. Check the generated code
> > > > and see I suppose :-)
> > > 
> > > On x86 the code generated looks exactly the same:
> > > 
> > >    https://godbolt.org/z/PjmA7k
> > 
> > Argh, they do mult by inverse to avoid the division, and can do this
> > because its a constant.
> > 
> > And then yes, your arm version looks worse.
> 
> your "arm version" is worst then x86, or "your version" is worse?
> 
> IOW, should I keep the code as the original? Do you prefer your
> version? Or... we don't really care...

Yeah, keep the original, it didn't matter on x86 and arm regressed with
my version.

> > It does what I expected with -Os, but as Rutland said the other day,
> > that stands for Optimize for Sadness.
> 
> Yes, I guess we cannot optimize for all flags... however, just let me
> know what you prefer and I'll put that version in ;)

Yeah, don't bother optimizing for -Os, it generally creates atrocious
crap, hence Rutland calling it 'Optimize for Sadness'.
diff mbox series

Patch

diff --git a/include/linux/log2.h b/include/linux/log2.h
index 2af7f77866d0..e2db25734532 100644
--- a/include/linux/log2.h
+++ b/include/linux/log2.h
@@ -224,4 +224,41 @@  int __order_base_2(unsigned long n)
 		ilog2((n) - 1) + 1) :		\
 	__order_base_2(n)			\
 )
+
+static inline __attribute__((const))
+int __bits_per(unsigned long n)
+{
+	if (n < 2)
+		return 1;
+	if (is_power_of_2(n))
+		return order_base_2(n) + 1;
+	return order_base_2(n);
+}
+
+/**
+ * bits_per - calculate the number of bits required for the argument
+ * @n: parameter
+ *
+ * This is constant-capable and can be used for compile time
+ * initiaizations, e.g bitfields.
+ *
+ * The first few values calculated by this routine:
+ * bf(0) = 1
+ * bf(1) = 1
+ * bf(2) = 2
+ * bf(3) = 2
+ * bf(4) = 3
+ * ... and so on.
+ */
+#define bits_per(n)				\
+(						\
+	__builtin_constant_p(n) ? (		\
+		((n) == 0 || (n) == 1) ? 1 : (	\
+		((n) & (n - 1)) == 0 ?		\
+			ilog2((n) - 1) + 2 :	\
+			ilog2((n) - 1) + 1	\
+		)				\
+	) :					\
+	__bits_per(n)				\
+)
 #endif /* _LINUX_LOG2_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4112639c2a85..45460e7a3eee 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -281,6 +281,18 @@  struct vtime {
 	u64			gtime;
 };
 
+/*
+ * Utilization clamp constraints.
+ * @UCLAMP_MIN:	Minimum utilization
+ * @UCLAMP_MAX:	Maximum utilization
+ * @UCLAMP_CNT:	Utilization clamp constraints count
+ */
+enum uclamp_id {
+	UCLAMP_MIN = 0,
+	UCLAMP_MAX,
+	UCLAMP_CNT
+};
+
 struct sched_info {
 #ifdef CONFIG_SCHED_INFO
 	/* Cumulative counters: */
@@ -312,6 +324,10 @@  struct sched_info {
 # define SCHED_FIXEDPOINT_SHIFT		10
 # define SCHED_FIXEDPOINT_SCALE		(1L << SCHED_FIXEDPOINT_SHIFT)
 
+/* Increase resolution of cpu_capacity calculations */
+# define SCHED_CAPACITY_SHIFT		SCHED_FIXEDPOINT_SHIFT
+# define SCHED_CAPACITY_SCALE		(1L << SCHED_CAPACITY_SHIFT)
+
 struct load_weight {
 	unsigned long			weight;
 	u32				inv_weight;
@@ -560,6 +576,25 @@  struct sched_dl_entity {
 	struct hrtimer inactive_timer;
 };
 
+#ifdef CONFIG_UCLAMP_TASK
+/* Number of utilization clamp buckets (shorter alias) */
+#define UCLAMP_BUCKETS CONFIG_UCLAMP_BUCKETS_COUNT
+
+/*
+ * Utilization clamp for a scheduling entity
+ * @value:		clamp value "requested" by a se
+ * @bucket_id:		clamp bucket corresponding to the "requested" value
+ *
+ * The bucket_id is the index of the clamp bucket matching the clamp value
+ * which is pre-computed and stored to avoid expensive integer divisions from
+ * the fast path.
+ */
+struct uclamp_se {
+	unsigned int value		: bits_per(SCHED_CAPACITY_SCALE);
+	unsigned int bucket_id		: bits_per(UCLAMP_BUCKETS);
+};
+#endif /* CONFIG_UCLAMP_TASK */
+
 union rcu_special {
 	struct {
 		u8			blocked;
@@ -640,6 +675,10 @@  struct task_struct {
 #endif
 	struct sched_dl_entity		dl;
 
+#ifdef CONFIG_UCLAMP_TASK
+	struct uclamp_se		uclamp[UCLAMP_CNT];
+#endif
+
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 	/* List of struct preempt_notifier: */
 	struct hlist_head		preempt_notifiers;
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index c31d3a47a47c..04beadac6985 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -6,12 +6,6 @@ 
 
 #include <linux/sched/idle.h>
 
-/*
- * Increase resolution of cpu_capacity calculations
- */
-#define SCHED_CAPACITY_SHIFT	SCHED_FIXEDPOINT_SHIFT
-#define SCHED_CAPACITY_SCALE	(1L << SCHED_CAPACITY_SHIFT)
-
 /*
  * sched-domains (multiprocessor balancing) declarations:
  */
diff --git a/init/Kconfig b/init/Kconfig
index 513fa544a134..34e23d5d95d1 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -640,6 +640,59 @@  config HAVE_UNSTABLE_SCHED_CLOCK
 config GENERIC_SCHED_CLOCK
 	bool
 
+menu "Scheduler features"
+
+config UCLAMP_TASK
+	bool "Enable utilization clamping for RT/FAIR tasks"
+	depends on CPU_FREQ_GOV_SCHEDUTIL
+	help
+	  This feature enables the scheduler to track the clamped utilization
+	  of each CPU based on RUNNABLE tasks scheduled on that CPU.
+
+	  With this option, the user can specify the min and max CPU
+	  utilization allowed for RUNNABLE tasks. The max utilization defines
+	  the maximum frequency a task should use while the min utilization
+	  defines the minimum frequency it should use.
+
+	  Both min and max utilization clamp values are hints to the scheduler,
+	  aiming at improving its frequency selection policy, but they do not
+	  enforce or grant any specific bandwidth for tasks.
+
+	  If in doubt, say N.
+
+config UCLAMP_BUCKETS_COUNT
+	int "Number of supported utilization clamp buckets"
+	range 5 20
+	default 5
+	depends on UCLAMP_TASK
+	help
+	  Defines the number of clamp buckets to use. The range of each bucket
+	  will be SCHED_CAPACITY_SCALE/UCLAMP_BUCKETS_COUNT. The higher the
+	  number of clamp buckets the finer their granularity and the higher
+	  the precision of clamping aggregation and tracking at run-time.
+
+	  For example, with the default configuration we will have 5 clamp
+	  buckets tracking 20% utilization each. A 25% boosted tasks will be
+	  refcounted in the [20..39]% bucket and will set the bucket clamp
+	  effective value to 25%.
+	  If a second 30% boosted task should be co-scheduled on the same CPU,
+	  that task will be refcounted in the same bucket of the first task and
+	  it will boost the bucket clamp effective value to 30%.
+	  The clamp effective value of a bucket is reset to its nominal value
+	  (20% in the example above) when there are anymore tasks refcounted in
+	  that bucket.
+
+	  An additional boost/capping margin can be added to some tasks. In the
+	  example above the 25% task will be boosted to 30% until it exits the
+	  CPU. If that should be considered not acceptable on certain systems,
+	  it's always possible to reduce the margin by increasing the number of
+	  clamp buckets to trade off used memory for run-time tracking
+	  precision.
+
+	  If in doubt, use the default value.
+
+endmenu
+
 #
 # For architectures that want to enable the support for NUMA-affine scheduler
 # balancing logic:
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ec1b67a195cc..8ecf5470058c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -719,6 +719,167 @@  static void set_load_weight(struct task_struct *p, bool update_load)
 	}
 }
 
+#ifdef CONFIG_UCLAMP_TASK
+
+/* Integer ceil-rounded range for each bucket */
+#define UCLAMP_BUCKET_DELTA ((SCHED_CAPACITY_SCALE / UCLAMP_BUCKETS) + 1)
+
+static inline unsigned int uclamp_bucket_id(unsigned int clamp_value)
+{
+	return clamp_value / UCLAMP_BUCKET_DELTA;
+}
+
+static inline unsigned int uclamp_bucket_value(unsigned int clamp_value)
+{
+	return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value);
+}
+
+static inline unsigned int uclamp_none(int clamp_id)
+{
+	if (clamp_id == UCLAMP_MIN)
+		return 0;
+	return SCHED_CAPACITY_SCALE;
+}
+
+static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id)
+{
+	struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
+	unsigned int max_value = uclamp_none(clamp_id);
+	unsigned int bucket_id;
+
+	/*
+	 * Both min and max clamps are MAX aggregated, thus the topmost
+	 * bucket with some tasks defines the rq's clamp value.
+	 */
+	bucket_id = UCLAMP_BUCKETS;
+	do {
+		--bucket_id;
+		if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks)
+			continue;
+		max_value = bucket[bucket_id].value;
+		break;
+	} while (bucket_id);
+
+	WRITE_ONCE(rq->uclamp[clamp_id].value, max_value);
+}
+
+/*
+ * When a task is enqueued on a rq, the clamp bucket currently defined by the
+ * task's uclamp::bucket_id is reference counted on that rq. This also
+ * immediately updates the rq's clamp value if required.
+ *
+ * Since tasks know their specific value requested from user-space, we track
+ * within each bucket the maximum value for tasks refcounted in that bucket.
+ * This provide a further aggregation (local clamping) which allows to track
+ * within each bucket the exact "requested" clamp value whenever all tasks
+ * RUNNABLE in that bucket require the same clamp.
+ */
+static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq,
+				    unsigned int clamp_id)
+{
+	unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
+	unsigned int rq_clamp, bkt_clamp, tsk_clamp;
+
+	rq->uclamp[clamp_id].bucket[bucket_id].tasks++;
+
+	/*
+	 * Local clamping: rq's buckets always track the max "requested"
+	 * clamp value from all RUNNABLE tasks in that bucket.
+	 */
+	tsk_clamp = p->uclamp[clamp_id].value;
+	bkt_clamp = rq->uclamp[clamp_id].bucket[bucket_id].value;
+	rq->uclamp[clamp_id].bucket[bucket_id].value = max(bkt_clamp, tsk_clamp);
+
+	rq_clamp = READ_ONCE(rq->uclamp[clamp_id].value);
+	WRITE_ONCE(rq->uclamp[clamp_id].value, max(rq_clamp, tsk_clamp));
+}
+
+/*
+ * When a task is dequeued from a rq, the clamp bucket reference counted by
+ * the task is released. If this is the last task reference counting the rq's
+ * max active clamp value, then the rq's clamp value is updated.
+ * Both the tasks reference counter and the rq's cached clamp values are
+ * expected to be always valid, if we detect they are not we skip the updates,
+ * enforce a consistent state and warn.
+ */
+static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq,
+				    unsigned int clamp_id)
+{
+	unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
+	unsigned int rq_clamp, bkt_clamp;
+
+	SCHED_WARN_ON(!rq->uclamp[clamp_id].bucket[bucket_id].tasks);
+	if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks))
+		rq->uclamp[clamp_id].bucket[bucket_id].tasks--;
+
+	/*
+	 * Keep "local clamping" simple and accept to (possibly) overboost
+	 * still RUNNABLE tasks in the same bucket.
+	 */
+	if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks))
+		return;
+	bkt_clamp = rq->uclamp[clamp_id].bucket[bucket_id].value;
+
+	/* The rq's clamp value is expected to always track the max */
+	rq_clamp = READ_ONCE(rq->uclamp[clamp_id].value);
+	SCHED_WARN_ON(bkt_clamp > rq_clamp);
+	if (bkt_clamp >= rq_clamp) {
+		/*
+		 * Reset rq's clamp bucket value to its nominal value whenever
+		 * there are anymore RUNNABLE tasks refcounting it.
+		 */
+		rq->uclamp[clamp_id].bucket[bucket_id].value =
+			uclamp_bucket_value(rq_clamp);
+		uclamp_rq_update(rq, clamp_id);
+	}
+}
+
+static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
+{
+	unsigned int clamp_id;
+
+	if (unlikely(!p->sched_class->uclamp_enabled))
+		return;
+
+	for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
+		uclamp_rq_inc_id(p, rq, clamp_id);
+}
+
+static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
+{
+	unsigned int clamp_id;
+
+	if (unlikely(!p->sched_class->uclamp_enabled))
+		return;
+
+	for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
+		uclamp_rq_dec_id(p, rq, clamp_id);
+}
+
+static void __init init_uclamp(void)
+{
+	unsigned int clamp_id;
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq));
+
+	for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+		unsigned int clamp_value = uclamp_none(clamp_id);
+		unsigned int bucket_id = uclamp_bucket_id(clamp_value);
+		struct uclamp_se *uc_se = &init_task.uclamp[clamp_id];
+
+		uc_se->bucket_id = bucket_id;
+		uc_se->value = clamp_value;
+	}
+}
+
+#else /* CONFIG_UCLAMP_TASK */
+static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { }
+static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) { }
+static inline void init_uclamp(void) { }
+#endif /* CONFIG_UCLAMP_TASK */
+
 static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 {
 	if (!(flags & ENQUEUE_NOCLOCK))
@@ -729,6 +890,7 @@  static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 		psi_enqueue(p, flags & ENQUEUE_WAKEUP);
 	}
 
+	uclamp_rq_inc(rq, p);
 	p->sched_class->enqueue_task(rq, p, flags);
 }
 
@@ -742,6 +904,7 @@  static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
 		psi_dequeue(p, flags & DEQUEUE_SLEEP);
 	}
 
+	uclamp_rq_dec(rq, p);
 	p->sched_class->dequeue_task(rq, p, flags);
 }
 
@@ -6075,6 +6238,8 @@  void __init sched_init(void)
 
 	psi_init();
 
+	init_uclamp();
+
 	scheduler_running = 1;
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c688ef5012e5..ea9e28723946 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -797,6 +797,48 @@  extern void rto_push_irq_work_func(struct irq_work *work);
 #endif
 #endif /* CONFIG_SMP */
 
+#ifdef CONFIG_UCLAMP_TASK
+/*
+ * struct uclamp_bucket - Utilization clamp bucket
+ * @value: utilization clamp value for tasks on this clamp bucket
+ * @tasks: number of RUNNABLE tasks on this clamp bucket
+ *
+ * Keep track of how many tasks are RUNNABLE for a given utilization
+ * clamp value.
+ */
+struct uclamp_bucket {
+	unsigned long value : bits_per(SCHED_CAPACITY_SCALE);
+	unsigned long tasks : BITS_PER_LONG - bits_per(SCHED_CAPACITY_SCALE);
+};
+
+/*
+ * struct uclamp_rq - rq's utilization clamp
+ * @value: currently active clamp values for a rq
+ * @bucket: utilization clamp buckets affecting a rq
+ *
+ * Keep track of RUNNABLE tasks on a rq to aggregate their clamp values.
+ * A clamp value is affecting a rq when there is at least one task RUNNABLE
+ * (or actually running) with that value.
+ *
+ * We have up to UCLAMP_CNT possible different clamp values, which are
+ * currently only two: minmum utilization and maximum utilization.
+ *
+ * All utilization clamping values are MAX aggregated, since:
+ * - for util_min: we want to run the CPU at least at the max of the minimum
+ *   utilization required by its currently RUNNABLE tasks.
+ * - for util_max: we want to allow the CPU to run up to the max of the
+ *   maximum utilization allowed by its currently RUNNABLE tasks.
+ *
+ * Since on each system we expect only a limited number of different
+ * utilization clamp values (UCLAMP_BUCKETS), we use a simple array to track
+ * the metrics required to compute all the per-rq utilization clamp values.
+ */
+struct uclamp_rq {
+	unsigned int value;
+	struct uclamp_bucket bucket[UCLAMP_BUCKETS];
+};
+#endif /* CONFIG_UCLAMP_TASK */
+
 /*
  * This is the main, per-CPU runqueue data structure.
  *
@@ -835,6 +877,11 @@  struct rq {
 	unsigned long		nr_load_updates;
 	u64			nr_switches;
 
+#ifdef CONFIG_UCLAMP_TASK
+	/* Utilization clamp values based on CPU's RUNNABLE tasks */
+	struct uclamp_rq	uclamp[UCLAMP_CNT] ____cacheline_aligned;
+#endif
+
 	struct cfs_rq		cfs;
 	struct rt_rq		rt;
 	struct dl_rq		dl;
@@ -1649,10 +1696,12 @@  extern const u32		sched_prio_to_wmult[40];
 struct sched_class {
 	const struct sched_class *next;
 
+#ifdef CONFIG_UCLAMP_TASK
+	int uclamp_enabled;
+#endif
+
 	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
 	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
-	void (*yield_task)   (struct rq *rq);
-	bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt);
 
 	void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags);
 
@@ -1685,7 +1734,6 @@  struct sched_class {
 	void (*set_curr_task)(struct rq *rq);
 	void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
 	void (*task_fork)(struct task_struct *p);
-	void (*task_dead)(struct task_struct *p);
 
 	/*
 	 * The switched_from() call is allowed to drop rq->lock, therefore we
@@ -1702,12 +1750,17 @@  struct sched_class {
 
 	void (*update_curr)(struct rq *rq);
 
+	void (*yield_task)   (struct rq *rq);
+	bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt);
+
 #define TASK_SET_GROUP		0
 #define TASK_MOVE_GROUP		1
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	void (*task_change_group)(struct task_struct *p, int type);
 #endif
+
+	void (*task_dead)(struct task_struct *p);
 };
 
 static inline void put_prev_task(struct rq *rq, struct task_struct *prev)