Message ID | 20190208100554.32196-2-patrick.bellasi@arm.com (mailing list archive) |
---|---|
State | Not Applicable, archived |
Headers | show |
Series | Add utilization clamping support | expand |
On 2/8/19 11:05 AM, Patrick Bellasi wrote: [...] > +config UCLAMP_BUCKETS_COUNT > + int "Number of supported utilization clamp buckets" > + range 5 20 > + default 5 > + depends on UCLAMP_TASK > + help > + Defines the number of clamp buckets to use. The range of each bucket > + will be SCHED_CAPACITY_SCALE/UCLAMP_BUCKETS_COUNT. The higher the > + number of clamp buckets the finer their granularity and the higher > + the precision of clamping aggregation and tracking at run-time. > + > + For example, with the default configuration we will have 5 clamp > + buckets tracking 20% utilization each. A 25% boosted tasks will be > + refcounted in the [20..39]% bucket and will set the bucket clamp > + effective value to 25%. > + If a second 30% boosted task should be co-scheduled on the same CPU, > + that task will be refcounted in the same bucket of the first task and > + it will boost the bucket clamp effective value to 30%. > + The clamp effective value of a bucket is reset to its nominal value > + (20% in the example above) when there are anymore tasks refcounted in this sounds weird. [...] > +static inline unsigned int uclamp_bucket_value(unsigned int clamp_value) > +{ > + return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value); > +} Soemthing like uclamp_bucket_nominal_value() should be clearer. > +static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id) > +{ > + struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket; > + unsigned int max_value = uclamp_none(clamp_id); > + unsigned int bucket_id; unsigned int bucket_id = UCLAMP_BUCKETS; > + > + /* > + * Both min and max clamps are MAX aggregated, thus the topmost > + * bucket with some tasks defines the rq's clamp value. > + */ > + bucket_id = UCLAMP_BUCKETS; to get rid of this line? > + do { > + --bucket_id; > + if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks) if (!bucket[bucket_id].tasks) [...] > +/* > + * When a task is enqueued on a rq, the clamp bucket currently defined by the > + * task's uclamp::bucket_id is reference counted on that rq. This also > + * immediately updates the rq's clamp value if required. > + * > + * Since tasks know their specific value requested from user-space, we track > + * within each bucket the maximum value for tasks refcounted in that bucket. > + * This provide a further aggregation (local clamping) which allows to track s/This provide/This provides > + * within each bucket the exact "requested" clamp value whenever all tasks > + * RUNNABLE in that bucket require the same clamp. > + */ > +static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq, > + unsigned int clamp_id) > +{ > + unsigned int bucket_id = p->uclamp[clamp_id].bucket_id; > + unsigned int rq_clamp, bkt_clamp, tsk_clamp; Wouldn't it be easier to have a pointer to the task's and rq's uclamp structure as well to the bucket? - unsigned int bucket_id = p->uclamp[clamp_id].bucket_id; + struct uclamp_se *uc_se = &p->uclamp[clamp_id]; + struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id]; + struct uclamp_bucket *bucket = &uc_rq->bucket[uc_se->bucket_id]; The code in uclamp_rq_inc_id() and uclamp_rq_dec_id() for example becomes much more readable. [...] > struct sched_class { > const struct sched_class *next; > > +#ifdef CONFIG_UCLAMP_TASK > + int uclamp_enabled; > +#endif > + > void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags); > void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags); > - void (*yield_task) (struct rq *rq); > - bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt); > > void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags); > > @@ -1685,7 +1734,6 @@ struct sched_class { > void (*set_curr_task)(struct rq *rq); > void (*task_tick)(struct rq *rq, struct task_struct *p, int queued); > void (*task_fork)(struct task_struct *p); > - void (*task_dead)(struct task_struct *p); > > /* > * The switched_from() call is allowed to drop rq->lock, therefore we > @@ -1702,12 +1750,17 @@ struct sched_class { > > void (*update_curr)(struct rq *rq); > > + void (*yield_task) (struct rq *rq); > + bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt); > + > #define TASK_SET_GROUP 0 > #define TASK_MOVE_GROUP 1 > > #ifdef CONFIG_FAIR_GROUP_SCHED > void (*task_change_group)(struct task_struct *p, int type); > #endif > + > + void (*task_dead)(struct task_struct *p); Why do you move yield_task, yield_to_task and task_dead here? [...]
On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote: > +/* Integer ceil-rounded range for each bucket */ > +#define UCLAMP_BUCKET_DELTA ((SCHED_CAPACITY_SCALE / UCLAMP_BUCKETS) + 1) Uhm, should that not me ((x+y-1)/y), aka. DIV_ROUND_UP(x,y) ? The above would give 4 for 9/3, which is clearly buggered.
On 12-Mar 16:20, Peter Zijlstra wrote: > On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote: > > +/* Integer ceil-rounded range for each bucket */ > > +#define UCLAMP_BUCKET_DELTA ((SCHED_CAPACITY_SCALE / UCLAMP_BUCKETS) + 1) > > Uhm, should that not me ((x+y-1)/y), aka. DIV_ROUND_UP(x,y) ? Well, there is certainly some rounding to be done... > The above would give 4 for 9/3, which is clearly buggered. .. still the math above should work fine within the boundaries we define for UCLAMP_BUCKET_DELTA (5..20 groups) and considering that SCHED_CAPACITY_SCALE will never be smaller then 1024. The above is designed to shrink the topmost bucket wrt all the others but it will never be smaller than ~30%. Here are the start values computed for each bucket using the math above and the computed shrinking percentage for the topmost bucket: bukets size: 205, top bucket start@820 (err: 0.49%), buckets: {0: 0, 1: 205, 2: 410, 3: 615, 4: 820} bukets size: 171, top bucket start@855 (err: 1.17%), buckets: {0: 0, 1: 171, 2: 342, 3: 513, 4: 684, 5: 855} bukets size: 147, top bucket start@882 (err: 3.40%), buckets: {0: 0, 1: 147, 2: 294, 3: 441, 4: 588, 5: 735, 6: 882} bukets size: 129, top bucket start@903 (err: 6.20%), buckets: {0: 0, 1: 129, 2: 258, 3: 387, 4: 516, 5: 645, 6: 774, 7: 903} bukets size: 114, top bucket start@912 (err: 1.75%), buckets: {0: 0, 1: 114, 2: 228, 3: 342, 4: 456, 5: 570, 6: 684, 7: 798, 8: 912} bukets size: 103, top bucket start@927 (err: 5.83%), buckets: {0: 0, 1: 103, 2: 206, 3: 309, 4: 412, 5: 515, 6: 618, 7: 721, 8: 824, 9: 927} bukets size: 94, top bucket start@940 (err: 10.64%), buckets: {0: 0, 1: 94, 2: 188, 3: 282, 4: 376, 5: 470, 6: 564, 7: 658, 8: 752, 9: 846, 10: 940} bukets size: 86, top bucket start@946 (err: 9.30%), buckets: {0: 0, 1: 86, 2: 172, 3: 258, 4: 344, 5: 430, 6: 516, 7: 602, 8: 688, 9: 774, 10: 860, 11: 946} bukets size: 79, top bucket start@948 (err: 3.80%), buckets: {0: 0, 1: 79, 2: 158, 3: 237, 4: 316, 5: 395, 6: 474, 7: 553, 8: 632, 9: 711, 10: 790, 11: 869, 12: 948} bukets size: 74, top bucket start@962 (err: 16.22%), buckets: {0: 0, 1: 74, 2: 148, 3: 222, 4: 296, 5: 370, 6: 444, 7: 518, 8: 592, 9: 666, 10: 740, 11: 814, 12: 888, 13: 962} bukets size: 69, top bucket start@966 (err: 15.94%), buckets: {0: 0, 1: 69, 2: 138, 3: 207, 4: 276, 5: 345, 6: 414, 7: 483, 8: 552, 9: 621, 10: 690, 11: 759, 12: 828, 13: 897, 14: 966} bukets size: 65, top bucket start@975 (err: 24.62%), buckets: {0: 0, 1: 65, 2: 130, 3: 195, 4: 260, 5: 325, 6: 390, 7: 455, 8: 520, 9: 585, 10: 650, 11: 715, 12: 780, 13: 845, 14: 910, 15: 975} bukets size: 61, top bucket start@976 (err: 21.31%), buckets: {0: 0, 1: 61, 2: 122, 3: 183, 4: 244, 5: 305, 6: 366, 7: 427, 8: 488, 9: 549, 10: 610, 11: 671, 12: 732, 13: 793, 14: 854, 15: 915, 16: 976} bukets size: 57, top bucket start@969 (err: 3.51%), buckets: {0: 0, 1: 57, 2: 114, 3: 171, 4: 228, 5: 285, 6: 342, 7: 399, 8: 456, 9: 513, 10: 570, 11: 627, 12: 684, 13: 741, 14: 798, 15: 855, 16: 912, 17: 969} bukets size: 54, top bucket start@972 (err: 3.70%), buckets: {0: 0, 1: 54, 2: 108, 3: 162, 4: 216, 5: 270, 6: 324, 7: 378, 8: 432, 9: 486, 10: 540, 11: 594, 12: 648, 13: 702, 14: 756, 15: 810, 16: 864, 17: 918, 18: 972} bukets size: 52, top bucket start@988 (err: 30.77%), buckets: {0: 0, 1: 52, 2: 104, 3: 156, 4: 208, 5: 260, 6: 312, 7: 364, 8: 416, 9: 468, 10: 520, 11: 572, 12: 624, 13: 676, 14: 728, 15: 780, 16: 832, 17: 884, 18: 936, 19: 988} Does that makes sense?
On Tue, Mar 12, 2019 at 03:50:43PM +0000, Patrick Bellasi wrote: > On 12-Mar 16:20, Peter Zijlstra wrote: > > On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote: > > > +/* Integer ceil-rounded range for each bucket */ ^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > +#define UCLAMP_BUCKET_DELTA ((SCHED_CAPACITY_SCALE / UCLAMP_BUCKETS) + 1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ simply do not match. > > Uhm, should that not me ((x+y-1)/y), aka. DIV_ROUND_UP(x,y) ? > > Well, there is certainly some rounding to be done... > > > The above would give 4 for 9/3, which is clearly buggered. > > .. still the math above should work fine within the boundaries we > define for UCLAMP_BUCKET_DELTA (5..20 groups) and considering that > SCHED_CAPACITY_SCALE will never be smaller then 1024. That's a very poor reason to write utter nonsense :-) > The above is designed to shrink the topmost bucket wrt all the others > but it will never be smaller than ~30%. 30% sounds like a lot, esp. for this range. > Here are the start values computed for each bucket using the math > above and the computed shrinking percentage for the topmost bucket: If you use a regular rounding, the error is _much_ smaller: $ for ((x=5;x<21;x++)) ; do let d=(1024+x/2)/x; let s=(x-1)*d; let e=1024-s; let p=100*(d-e)/d; echo $x $d $s $e $p%; done 5 205 820 204 0% 6 171 855 169 1% 7 146 876 148 -1% 8 128 896 128 0% 9 114 912 112 1% 10 102 918 106 -3% 11 93 930 94 -1% 12 85 935 89 -4% 13 79 948 76 3% 14 73 949 75 -2% 15 68 952 72 -5% 16 64 960 64 0% 17 60 960 64 -6% 18 57 969 55 3% 19 54 972 52 3% 20 51 969 55 -7% Funnily enough, we have a helper for that too: DIV_ROUND_CLOSEST(). Now, if we go further, the error will obviously increase because we run out of precision, but even there, regular rounding will be better than either floor or ceil.
On 13-Mar 09:19, Peter Zijlstra wrote: > On Tue, Mar 12, 2019 at 03:50:43PM +0000, Patrick Bellasi wrote: > > On 12-Mar 16:20, Peter Zijlstra wrote: > > > On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote: > > > > +/* Integer ceil-rounded range for each bucket */ > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > > +#define UCLAMP_BUCKET_DELTA ((SCHED_CAPACITY_SCALE / UCLAMP_BUCKETS) + 1) > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > simply do not match. Right, that don't match when UCLAMP_BUCKETS is a divider of SCHED_CAPACITY_SCALE, i.e. when we use 8 or 16 buckets. > > > Uhm, should that not me ((x+y-1)/y), aka. DIV_ROUND_UP(x,y) ? > > > > Well, there is certainly some rounding to be done... > > > > > The above would give 4 for 9/3, which is clearly buggered. > > > > .. still the math above should work fine within the boundaries we > > define for UCLAMP_BUCKET_DELTA (5..20 groups) and considering that > > SCHED_CAPACITY_SCALE will never be smaller then 1024. > > That's a very poor reason to write utter nonsense :-) > > > The above is designed to shrink the topmost bucket wrt all the others > > but it will never be smaller than ~30%. > > 30% sounds like a lot, esp. for this range. Well, that 30% is really just ~16 utiliation units on a scale of 1024 when buckets have a size of 52. Still, yes, we can argue that's big but that's also the same error generated by DIV_ROUND_UP() when UCLAMP_BUCKETS is not 8 or 16. > > Here are the start values computed for each bucket using the math > > above and the computed shrinking percentage for the topmost bucket: > > If you use a regular rounding, the error is _much_ smaller: > > $ for ((x=5;x<21;x++)) ; do let d=(1024+x/2)/x; let s=(x-1)*d; let e=1024-s; let p=100*(d-e)/d; echo $x $d $s $e $p%; done ^^^^^^^^^^^^^ > 5 205 820 204 0% > 6 171 855 169 1% > 7 146 876 148 -1% > 8 128 896 128 0% > 9 114 912 112 1% > 10 102 918 106 -3% > 11 93 930 94 -1% > 12 85 935 89 -4% > 13 79 948 76 3% > 14 73 949 75 -2% > 15 68 952 72 -5% > 16 64 960 64 0% > 17 60 960 64 -6% > 18 57 969 55 3% > 19 54 972 52 3% > 20 51 969 55 -7% > > Funnily enough, we have a helper for that too: DIV_ROUND_CLOSEST(). ^^^^^^^^^^^^^^^^^^^^ This is different than DIV_ROUND_UP() and actually better across the full range. > Now, if we go further, the error will obviously increase because we run > out of precision, but even there, regular rounding will be better than > either floor or ceil. I don't think we will have to cover other values in the further but I agree that this "closest rounding" is definitively better. Thanks for spotting it, will update in v8.
On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote: > +static inline unsigned int uclamp_bucket_id(unsigned int clamp_value) > +{ > + return clamp_value / UCLAMP_BUCKET_DELTA; > +} > + > +static inline unsigned int uclamp_bucket_value(unsigned int clamp_value) > +{ > + return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value); return clamp_value - (clamp_value % UCLAMP_BUCKET_DELTA); might generate better code; just a single division, instead of a div and mult. > +} > + > +static inline unsigned int uclamp_none(int clamp_id) > +{ > + if (clamp_id == UCLAMP_MIN) > + return 0; > + return SCHED_CAPACITY_SCALE; > +} > + > +static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id) > +{ > + struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket; > + unsigned int max_value = uclamp_none(clamp_id); > + unsigned int bucket_id; > + > + /* > + * Both min and max clamps are MAX aggregated, thus the topmost > + * bucket with some tasks defines the rq's clamp value. > + */ > + bucket_id = UCLAMP_BUCKETS; > + do { > + --bucket_id; > + if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks) > + continue; > + max_value = bucket[bucket_id].value; > + break; If you flip the if condition the code will be nicer. > + } while (bucket_id); But you can also use a for loop: for (i = UCLAMP_BUCKETS-1; i>=0; i--) { if (rq->uclamp[clamp_id].bucket[i].tasks) { max_value = bucket[i].value; break; } } > + WRITE_ONCE(rq->uclamp[clamp_id].value, max_value); > +}
On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote: > +/* > + * When a task is enqueued on a rq, the clamp bucket currently defined by the > + * task's uclamp::bucket_id is reference counted on that rq. This also > + * immediately updates the rq's clamp value if required. > + * > + * Since tasks know their specific value requested from user-space, we track > + * within each bucket the maximum value for tasks refcounted in that bucket. > + * This provide a further aggregation (local clamping) which allows to track > + * within each bucket the exact "requested" clamp value whenever all tasks > + * RUNNABLE in that bucket require the same clamp. > + */ > +static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq, > + unsigned int clamp_id) > +{ > + unsigned int bucket_id = p->uclamp[clamp_id].bucket_id; > + unsigned int rq_clamp, bkt_clamp, tsk_clamp; > + > + rq->uclamp[clamp_id].bucket[bucket_id].tasks++; > + > + /* > + * Local clamping: rq's buckets always track the max "requested" > + * clamp value from all RUNNABLE tasks in that bucket. > + */ > + tsk_clamp = p->uclamp[clamp_id].value; > + bkt_clamp = rq->uclamp[clamp_id].bucket[bucket_id].value; > + rq->uclamp[clamp_id].bucket[bucket_id].value = max(bkt_clamp, tsk_clamp); So, if I read this correct: - here we track a max value in a bucket, > + rq_clamp = READ_ONCE(rq->uclamp[clamp_id].value); > + WRITE_ONCE(rq->uclamp[clamp_id].value, max(rq_clamp, tsk_clamp)); > +} > + > +/* > + * When a task is dequeued from a rq, the clamp bucket reference counted by > + * the task is released. If this is the last task reference counting the rq's > + * max active clamp value, then the rq's clamp value is updated. > + * Both the tasks reference counter and the rq's cached clamp values are > + * expected to be always valid, if we detect they are not we skip the updates, > + * enforce a consistent state and warn. > + */ > +static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq, > + unsigned int clamp_id) > +{ > + unsigned int bucket_id = p->uclamp[clamp_id].bucket_id; > + unsigned int rq_clamp, bkt_clamp; > + > + SCHED_WARN_ON(!rq->uclamp[clamp_id].bucket[bucket_id].tasks); > + if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks)) > + rq->uclamp[clamp_id].bucket[bucket_id].tasks--; > + > + /* > + * Keep "local clamping" simple and accept to (possibly) overboost > + * still RUNNABLE tasks in the same bucket. > + */ > + if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks)) > + return; (Oh man, I hope that generates semi sane code; long live CSE passes I suppose) But we never decrement that bkt_clamp value on dequeue. > + bkt_clamp = rq->uclamp[clamp_id].bucket[bucket_id].value; > + > + /* The rq's clamp value is expected to always track the max */ > + rq_clamp = READ_ONCE(rq->uclamp[clamp_id].value); > + SCHED_WARN_ON(bkt_clamp > rq_clamp); > + if (bkt_clamp >= rq_clamp) { head hurts, this reads ==, how can this ever not be so? > + /* > + * Reset rq's clamp bucket value to its nominal value whenever > + * there are anymore RUNNABLE tasks refcounting it. -ENOPARSE > + */ > + rq->uclamp[clamp_id].bucket[bucket_id].value = > + uclamp_bucket_value(rq_clamp); But basically you decrement the bucket value to the nominal value. > + uclamp_rq_update(rq, clamp_id); > + } > +} Given all that, what is to stop the bucket value to climbing to uclamp_bucket_value(+1)-1 and staying there (provided there's someone runnable)? Why are we doing this... ?
On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote: > +static void __init init_uclamp(void) > +{ > + unsigned int clamp_id; > + int cpu; > + > + for_each_possible_cpu(cpu) > + memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq)); > + Is that really needed? Doesn't rq come from .bss ?
On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote: > +static inline unsigned int uclamp_none(int clamp_id) > +{ > + if (clamp_id == UCLAMP_MIN) > + return 0; > + return SCHED_CAPACITY_SCALE; > +} > + > +static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id) > +{ > + struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket; > + unsigned int max_value = uclamp_none(clamp_id); That's 1024 for uclamp_max > + unsigned int bucket_id; > + > + /* > + * Both min and max clamps are MAX aggregated, thus the topmost > + * bucket with some tasks defines the rq's clamp value. > + */ > + bucket_id = UCLAMP_BUCKETS; > + do { > + --bucket_id; > + if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks) > + continue; > + max_value = bucket[bucket_id].value; but this will then _lower_ it. That's not a MAX aggregate. > + break; > + } while (bucket_id); > + > + WRITE_ONCE(rq->uclamp[clamp_id].value, max_value); > +}
On 12-Mar 13:52, Dietmar Eggemann wrote: > On 2/8/19 11:05 AM, Patrick Bellasi wrote: > > [...] > > > +config UCLAMP_BUCKETS_COUNT > > + int "Number of supported utilization clamp buckets" > > + range 5 20 > > + default 5 > > + depends on UCLAMP_TASK > > + help > > + Defines the number of clamp buckets to use. The range of each bucket > > + will be SCHED_CAPACITY_SCALE/UCLAMP_BUCKETS_COUNT. The higher the > > + number of clamp buckets the finer their granularity and the higher > > + the precision of clamping aggregation and tracking at run-time. > > + > > + For example, with the default configuration we will have 5 clamp > > + buckets tracking 20% utilization each. A 25% boosted tasks will be > > + refcounted in the [20..39]% bucket and will set the bucket clamp > > + effective value to 25%. > > + If a second 30% boosted task should be co-scheduled on the same CPU, > > + that task will be refcounted in the same bucket of the first task and > > + it will boost the bucket clamp effective value to 30%. > > + The clamp effective value of a bucket is reset to its nominal value > > + (20% in the example above) when there are anymore tasks refcounted in > > this sounds weird. Why ? > > [...] > > > +static inline unsigned int uclamp_bucket_value(unsigned int clamp_value) > > +{ > > + return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value); > > +} > > Soemthing like uclamp_bucket_nominal_value() should be clearer. Maybe... can update it in v8 > > +static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id) > > +{ > > + struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket; > > + unsigned int max_value = uclamp_none(clamp_id); > > + unsigned int bucket_id; > > unsigned int bucket_id = UCLAMP_BUCKETS; > > > + > > + /* > > + * Both min and max clamps are MAX aggregated, thus the topmost > > + * bucket with some tasks defines the rq's clamp value. > > + */ > > + bucket_id = UCLAMP_BUCKETS; > > to get rid of this line? I put it on a different line as a justfication for the loop variable initialization described in the comment above. > > > + do { > > + --bucket_id; > > + if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks) > > if (!bucket[bucket_id].tasks) Right... that's some leftover from the last refactoring! [...] > > + * within each bucket the exact "requested" clamp value whenever all tasks > > + * RUNNABLE in that bucket require the same clamp. > > + */ > > +static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq, > > + unsigned int clamp_id) > > +{ > > + unsigned int bucket_id = p->uclamp[clamp_id].bucket_id; > > + unsigned int rq_clamp, bkt_clamp, tsk_clamp; > > Wouldn't it be easier to have a pointer to the task's and rq's uclamp > structure as well to the bucket? > > - unsigned int bucket_id = p->uclamp[clamp_id].bucket_id; > + struct uclamp_se *uc_se = &p->uclamp[clamp_id]; > + struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id]; > + struct uclamp_bucket *bucket = &uc_rq->bucket[uc_se->bucket_id]; I think I went back/forth a couple of times in using pointer or the extended version, which both have pros and cons. I personally prefer the pointers as you suggest but I've got the impression in the past that since everybody cleared "basic C trainings" it's not so difficult to read the code above too. > The code in uclamp_rq_inc_id() and uclamp_rq_dec_id() for example becomes > much more readable. Agree... let's try to switch once again in v8 and see ;) > [...] > > > struct sched_class { > > const struct sched_class *next; > > +#ifdef CONFIG_UCLAMP_TASK > > + int uclamp_enabled; > > +#endif > > + > > void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags); > > void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags); > > - void (*yield_task) (struct rq *rq); > > - bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt); > > void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags); > > @@ -1685,7 +1734,6 @@ struct sched_class { > > void (*set_curr_task)(struct rq *rq); > > void (*task_tick)(struct rq *rq, struct task_struct *p, int queued); > > void (*task_fork)(struct task_struct *p); > > - void (*task_dead)(struct task_struct *p); > > /* > > * The switched_from() call is allowed to drop rq->lock, therefore we > > @@ -1702,12 +1750,17 @@ struct sched_class { > > void (*update_curr)(struct rq *rq); > > + void (*yield_task) (struct rq *rq); > > + bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt); > > + > > #define TASK_SET_GROUP 0 > > #define TASK_MOVE_GROUP 1 > > #ifdef CONFIG_FAIR_GROUP_SCHED > > void (*task_change_group)(struct task_struct *p, int type); > > #endif > > + > > + void (*task_dead)(struct task_struct *p); > > Why do you move yield_task, yield_to_task and task_dead here? Since I'm adding a new field at the beginning of the struct, which is used at enqueue/dequeue time, this is to ensure that all the callbacks used in these paths are grouped together and don't fall across a cache line... but yes, that's supposed to be a micro-optimization which I can skip in this patch.
On 13-Mar 15:09, Peter Zijlstra wrote: > On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote: > > +static inline unsigned int uclamp_none(int clamp_id) > > +{ > > + if (clamp_id == UCLAMP_MIN) > > + return 0; > > + return SCHED_CAPACITY_SCALE; > > +} > > + > > +static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id) > > +{ > > + struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket; > > + unsigned int max_value = uclamp_none(clamp_id); > > That's 1024 for uclamp_max > > > + unsigned int bucket_id; > > + > > + /* > > + * Both min and max clamps are MAX aggregated, thus the topmost > > + * bucket with some tasks defines the rq's clamp value. > > + */ > > + bucket_id = UCLAMP_BUCKETS; > > + do { > > + --bucket_id; > > + if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks) > > + continue; > > + max_value = bucket[bucket_id].value; > > but this will then _lower_ it. That's not a MAX aggregate. For uclamp_max we want max_value=1024 when there are no active tasks, which means: no max clamp enforced on CFS/RT "idle" cpus. If instead there are active RT/CFS tasks then we want the clamp value of the max group, which means: MAX aggregate active clamps. That's what the code above does and the comment says. > > + break; > > + } while (bucket_id); > > + > > + WRITE_ONCE(rq->uclamp[clamp_id].value, max_value); > > +}
On 13-Mar 15:06, Peter Zijlstra wrote: > On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote: > > +static void __init init_uclamp(void) > > +{ > > + unsigned int clamp_id; > > + int cpu; > > + > > + for_each_possible_cpu(cpu) > > + memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq)); > > + > > Is that really needed? Doesn't rq come from .bss ? Will check better... I've just assumed not since in sched_init() we already have tons of rq's related zero initializations.
On 13-Mar 14:52, Peter Zijlstra wrote: > On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote: > > +/* > > + * When a task is enqueued on a rq, the clamp bucket currently defined by the > > + * task's uclamp::bucket_id is reference counted on that rq. This also > > + * immediately updates the rq's clamp value if required. > > + * > > + * Since tasks know their specific value requested from user-space, we track > > + * within each bucket the maximum value for tasks refcounted in that bucket. > > + * This provide a further aggregation (local clamping) which allows to track > > + * within each bucket the exact "requested" clamp value whenever all tasks > > + * RUNNABLE in that bucket require the same clamp. > > + */ > > +static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq, > > + unsigned int clamp_id) > > +{ > > + unsigned int bucket_id = p->uclamp[clamp_id].bucket_id; > > + unsigned int rq_clamp, bkt_clamp, tsk_clamp; > > + > > + rq->uclamp[clamp_id].bucket[bucket_id].tasks++; > > + > > + /* > > + * Local clamping: rq's buckets always track the max "requested" > > + * clamp value from all RUNNABLE tasks in that bucket. > > + */ > > + tsk_clamp = p->uclamp[clamp_id].value; > > + bkt_clamp = rq->uclamp[clamp_id].bucket[bucket_id].value; > > + rq->uclamp[clamp_id].bucket[bucket_id].value = max(bkt_clamp, tsk_clamp); > > So, if I read this correct: > > - here we track a max value in a bucket, > > > + rq_clamp = READ_ONCE(rq->uclamp[clamp_id].value); > > + WRITE_ONCE(rq->uclamp[clamp_id].value, max(rq_clamp, tsk_clamp)); > > +} > > + > > +/* > > + * When a task is dequeued from a rq, the clamp bucket reference counted by > > + * the task is released. If this is the last task reference counting the rq's > > + * max active clamp value, then the rq's clamp value is updated. > > + * Both the tasks reference counter and the rq's cached clamp values are > > + * expected to be always valid, if we detect they are not we skip the updates, > > + * enforce a consistent state and warn. > > + */ > > +static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq, > > + unsigned int clamp_id) > > +{ > > + unsigned int bucket_id = p->uclamp[clamp_id].bucket_id; > > + unsigned int rq_clamp, bkt_clamp; > > + > > + SCHED_WARN_ON(!rq->uclamp[clamp_id].bucket[bucket_id].tasks); > > + if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks)) > > + rq->uclamp[clamp_id].bucket[bucket_id].tasks--; > > + > > + /* > > + * Keep "local clamping" simple and accept to (possibly) overboost > > + * still RUNNABLE tasks in the same bucket. > > + */ > > + if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks)) > > + return; > > (Oh man, I hope that generates semi sane code; long live CSE passes I > suppose) What do you mean ? > But we never decrement that bkt_clamp value on dequeue. We decrement the bkt_clamp value only when the bucket becomes empty and thus we pass the condition above. That's what the comment above is there to call out. > > + bkt_clamp = rq->uclamp[clamp_id].bucket[bucket_id].value; > > + > > + /* The rq's clamp value is expected to always track the max */ > > + rq_clamp = READ_ONCE(rq->uclamp[clamp_id].value); > > + SCHED_WARN_ON(bkt_clamp > rq_clamp); > > + if (bkt_clamp >= rq_clamp) { > > head hurts, this reads ==, how can this ever not be so? Never, given the current code, that's just defensive programming. If in the future the accounting should be accidentally broken by some refactoring we warn and fix the corrupted data structures at first chance. Is that so bad? > > + /* > > + * Reset rq's clamp bucket value to its nominal value whenever > > + * there are anymore RUNNABLE tasks refcounting it. > > -ENOPARSE That's related to the comment above, when you say we don't decrement the bkt_clamp. Because of backetization, we potentially end up tracking tasks with different requested clamp values in the same bucket. For example, with 20% bucket size, we can have: Task1: util_min=25% Task2: util_min=35% accounted in the same bucket. This ensure that while they are both running we boost 35%. If Task1 should run longer than Task2, Task1 will be "overboosted" until its end. The bucket value will be reset to 20% (nominal value) when both tasks are idle. > > + */ > > + rq->uclamp[clamp_id].bucket[bucket_id].value = > > + uclamp_bucket_value(rq_clamp); > > But basically you decrement the bucket value to the nominal value. Yes, at this point we know there are no more tasks in this bucket and we reset its value. > > > + uclamp_rq_update(rq, clamp_id); > > + } > > +} > > Given all that, what is to stop the bucket value to climbing to > uclamp_bucket_value(+1)-1 and staying there (provided there's someone > runnable)? Nothing... but that's an expected consequence of bucketization. > Why are we doing this... ? You can either decide to: a) always boost tasks to just the bucket nominal value thus always penalizing both Task1 and Task2 of the example above b) always boost tasks to the bucket "max" value thus always overboosting both Task1 and Task2 of the example above The solution above instead has a very good property: in systems where you have only few and well defined clamp values we always provide the exact boost. For example, if your system requires only 23% and 47% boost values (totally random numbers), then you can always get the exact boost required using just 3 bucksts or ~33% size each. In systems where you don't know which boost values you will have, you can still defined the maximum overboost granularity you accept for each task by just tuning the number of clamp groups. For example, with 20 groups you can have a 5% max overboost.
On 13-Mar 14:40, Peter Zijlstra wrote: > On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote: > > +static inline unsigned int uclamp_bucket_id(unsigned int clamp_value) > > +{ > > + return clamp_value / UCLAMP_BUCKET_DELTA; > > +} > > + > > +static inline unsigned int uclamp_bucket_value(unsigned int clamp_value) > > +{ > > + return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value); > > return clamp_value - (clamp_value % UCLAMP_BUCKET_DELTA); > > might generate better code; just a single division, instead of a div and > mult. Wondering if compilers cannot do these optimizations... but yes, looks cool and will do it in v8, thanks. > > +} > > + > > +static inline unsigned int uclamp_none(int clamp_id) > > +{ > > + if (clamp_id == UCLAMP_MIN) > > + return 0; > > + return SCHED_CAPACITY_SCALE; > > +} > > + > > +static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id) > > +{ > > + struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket; > > + unsigned int max_value = uclamp_none(clamp_id); > > + unsigned int bucket_id; > > + > > + /* > > + * Both min and max clamps are MAX aggregated, thus the topmost > > + * bucket with some tasks defines the rq's clamp value. > > + */ > > + bucket_id = UCLAMP_BUCKETS; > > + do { > > + --bucket_id; > > + if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks) > > + continue; > > + max_value = bucket[bucket_id].value; > > + break; > > If you flip the if condition the code will be nicer. > > > + } while (bucket_id); > > But you can also use a for loop: > > for (i = UCLAMP_BUCKETS-1; i>=0; i--) { > if (rq->uclamp[clamp_id].bucket[i].tasks) { > max_value = bucket[i].value; > break; > } > } Yes, the for looks better, but perhaps like that: unsigned int bucket_id = UCLAMP_BUCKETS; /* * Both min and max clamps are MAX aggregated, thus the topmost * bucket with some tasks defines the rq's clamp value. */ for (; bucket_id >= 0; --bucket_id) { if (!bucket[bucket_id].tasks) continue; max_value = bucket[bucket_id].value; break; } ... just to save a {} block. > > + WRITE_ONCE(rq->uclamp[clamp_id].value, max_value); > > +}
On Wed, Mar 13, 2019 at 04:12:29PM +0000, Patrick Bellasi wrote: > Yes, the for looks better, but perhaps like that: > > unsigned int bucket_id = UCLAMP_BUCKETS; > > /* > * Both min and max clamps are MAX aggregated, thus the topmost > * bucket with some tasks defines the rq's clamp value. > */ > for (; bucket_id >= 0; --bucket_id) { GCC will be clever and figure that unsigned will never be smaller than 0 and turn the above into an infinite loop or something daft. That is; use 'int', not 'unsigned int' for bucket_id.
On 13-Mar 18:22, Peter Zijlstra wrote: > On Wed, Mar 13, 2019 at 04:12:29PM +0000, Patrick Bellasi wrote: > > Yes, the for looks better, but perhaps like that: > > > > unsigned int bucket_id = UCLAMP_BUCKETS; > > > > /* > > * Both min and max clamps are MAX aggregated, thus the topmost > > * bucket with some tasks defines the rq's clamp value. > > */ > > for (; bucket_id >= 0; --bucket_id) { > > GCC will be clever and figure that unsigned will never be smaller than 0 > and turn the above into an infinite loop or something daft. > > That is; use 'int', not 'unsigned int' for bucket_id. Right, which remembers me now why I originally went for a do { } while();
On Wed, Mar 13, 2019 at 03:59:54PM +0000, Patrick Bellasi wrote: > On 13-Mar 14:52, Peter Zijlstra wrote: > > > +static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq, > > > + unsigned int clamp_id) > > > +{ > > > + unsigned int bucket_id = p->uclamp[clamp_id].bucket_id; > > > + unsigned int rq_clamp, bkt_clamp; > > > + > > > + SCHED_WARN_ON(!rq->uclamp[clamp_id].bucket[bucket_id].tasks); > > > + if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks)) > > > + rq->uclamp[clamp_id].bucket[bucket_id].tasks--; > > > + > > > + /* > > > + * Keep "local clamping" simple and accept to (possibly) overboost > > > + * still RUNNABLE tasks in the same bucket. > > > + */ > > > + if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks)) > > > + return; > > > > (Oh man, I hope that generates semi sane code; long live CSE passes I > > suppose) > > What do you mean ? that does: 'rq->uclamp[clamp_id].bucket[bucket_id].tasks' three times in a row. And yes the compiler _should_ dtrt, but....
On Wed, Mar 13, 2019 at 03:59:54PM +0000, Patrick Bellasi wrote: > On 13-Mar 14:52, Peter Zijlstra wrote: > Because of backetization, we potentially end up tracking tasks with > different requested clamp values in the same bucket. > > For example, with 20% bucket size, we can have: > Task1: util_min=25% > Task2: util_min=35% > accounted in the same bucket. > > Given all that, what is to stop the bucket value to climbing to > > uclamp_bucket_value(+1)-1 and staying there (provided there's someone > > runnable)? > > Nothing... but that's an expected consequence of bucketization. No, it is not. > > Why are we doing this... ? > > You can either decide to: > > a) always boost tasks to just the bucket nominal value > thus always penalizing both Task1 and Task2 of the example above This is the expected behaviour. When was the last time your histogram did something like b? > b) always boost tasks to the bucket "max" value > thus always overboosting both Task1 and Task2 of the example above > > The solution above instead has a very good property: in systems > where you have only few and well defined clamp values we always > provide the exact boost. > > For example, if your system requires only 23% and 47% boost values > (totally random numbers), then you can always get the exact boost > required using just 3 bucksts or ~33% size each. > > In systems where you don't know which boost values you will have, you > can still defined the maximum overboost granularity you accept for > each task by just tuning the number of clamp groups. For example, with > 20 groups you can have a 5% max overboost. Maybe, but this is not a direct concequence of buckets, but an additional heuristic that might work well in this case. Maybe split this out in a separate patch? So start with the trivial bucket, and then do this change on top with the above few paragraphs as changelog?
On Wed, Mar 13, 2019 at 03:23:59PM +0000, Patrick Bellasi wrote: > On 13-Mar 15:09, Peter Zijlstra wrote: > > On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote: > > > +static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id) > > > +{ > > > + struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket; > > > + unsigned int max_value = uclamp_none(clamp_id); > > > > That's 1024 for uclamp_max > > > > > + unsigned int bucket_id; > > > + > > > + /* > > > + * Both min and max clamps are MAX aggregated, thus the topmost > > > + * bucket with some tasks defines the rq's clamp value. > > > + */ > > > + bucket_id = UCLAMP_BUCKETS; > > > + do { > > > + --bucket_id; > > > + if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks) > > > + continue; > > > + max_value = bucket[bucket_id].value; > > > > but this will then _lower_ it. That's not a MAX aggregate. > > For uclamp_max we want max_value=1024 when there are no active tasks, > which means: no max clamp enforced on CFS/RT "idle" cpus. > > If instead there are active RT/CFS tasks then we want the clamp value > of the max group, which means: MAX aggregate active clamps. > > That's what the code above does and the comment says. That's (obviously) not how I read it... maybe something like: static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id) { struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket; int i; /* * Since both min and max clamps are max aggregated, find the * top most bucket with tasks in. */ for (i = UCLMAP_BUCKETS-1; i>=0; i--) { if (!bucket[i].tasks) continue; return bucket[i].value; } /* No tasks -- default clamp values */ return uclamp_none(clamp_id); } would make it clearer?
On Wed, Mar 13, 2019 at 04:12:29PM +0000, Patrick Bellasi wrote: > On 13-Mar 14:40, Peter Zijlstra wrote: > > On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote: > > > +static inline unsigned int uclamp_bucket_id(unsigned int clamp_value) > > > +{ > > > + return clamp_value / UCLAMP_BUCKET_DELTA; > > > +} > > > + > > > +static inline unsigned int uclamp_bucket_value(unsigned int clamp_value) > > > +{ > > > + return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value); > > > > return clamp_value - (clamp_value % UCLAMP_BUCKET_DELTA); > > > > might generate better code; just a single division, instead of a div and > > mult. > > Wondering if compilers cannot do these optimizations... but yes, looks > cool and will do it in v8, thanks. I'd be most impressed if they pull this off. Check the generated code and see I suppose :-)
On Wed, Mar 13, 2019 at 8:15 AM Patrick Bellasi <patrick.bellasi@arm.com> wrote: > > On 12-Mar 13:52, Dietmar Eggemann wrote: > > On 2/8/19 11:05 AM, Patrick Bellasi wrote: > > > > [...] > > > > > +config UCLAMP_BUCKETS_COUNT > > > + int "Number of supported utilization clamp buckets" > > > + range 5 20 > > > + default 5 > > > + depends on UCLAMP_TASK > > > + help > > > + Defines the number of clamp buckets to use. The range of each bucket > > > + will be SCHED_CAPACITY_SCALE/UCLAMP_BUCKETS_COUNT. The higher the > > > + number of clamp buckets the finer their granularity and the higher > > > + the precision of clamping aggregation and tracking at run-time. > > > + > > > + For example, with the default configuration we will have 5 clamp > > > + buckets tracking 20% utilization each. A 25% boosted tasks will be > > > + refcounted in the [20..39]% bucket and will set the bucket clamp > > > + effective value to 25%. > > > + If a second 30% boosted task should be co-scheduled on the same CPU, > > > + that task will be refcounted in the same bucket of the first task and > > > + it will boost the bucket clamp effective value to 30%. > > > + The clamp effective value of a bucket is reset to its nominal value > > > + (20% in the example above) when there are anymore tasks refcounted in > > > > this sounds weird. > > Why ? Should probably be "when there are no more tasks refcounted" > > > > [...] > > > > > +static inline unsigned int uclamp_bucket_value(unsigned int clamp_value) > > > +{ > > > + return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value); > > > +} > > > > Soemthing like uclamp_bucket_nominal_value() should be clearer. > > Maybe... can update it in v8 > uclamp_bucket_base_value is a little shorter, just to consider :) > > > +static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id) > > > +{ > > > + struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket; > > > + unsigned int max_value = uclamp_none(clamp_id); > > > + unsigned int bucket_id; > > > > unsigned int bucket_id = UCLAMP_BUCKETS; > > > > > + > > > + /* > > > + * Both min and max clamps are MAX aggregated, thus the topmost > > > + * bucket with some tasks defines the rq's clamp value. > > > + */ > > > + bucket_id = UCLAMP_BUCKETS; > > > > to get rid of this line? > > I put it on a different line as a justfication for the loop variable > initialization described in the comment above. > > > > > > + do { > > > + --bucket_id; > > > + if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks) > > > > if (!bucket[bucket_id].tasks) > > Right... that's some leftover from the last refactoring! > > [...] > > > > + * within each bucket the exact "requested" clamp value whenever all tasks > > > + * RUNNABLE in that bucket require the same clamp. > > > + */ > > > +static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq, > > > + unsigned int clamp_id) > > > +{ > > > + unsigned int bucket_id = p->uclamp[clamp_id].bucket_id; > > > + unsigned int rq_clamp, bkt_clamp, tsk_clamp; > > > > Wouldn't it be easier to have a pointer to the task's and rq's uclamp > > structure as well to the bucket? > > > > - unsigned int bucket_id = p->uclamp[clamp_id].bucket_id; > > + struct uclamp_se *uc_se = &p->uclamp[clamp_id]; > > + struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id]; > > + struct uclamp_bucket *bucket = &uc_rq->bucket[uc_se->bucket_id]; > > I think I went back/forth a couple of times in using pointer or the > extended version, which both have pros and cons. > > I personally prefer the pointers as you suggest but I've got the > impression in the past that since everybody cleared "basic C trainings" > it's not so difficult to read the code above too. > > > The code in uclamp_rq_inc_id() and uclamp_rq_dec_id() for example becomes > > much more readable. > > Agree... let's try to switch once again in v8 and see ;) > > > [...] > > > > > struct sched_class { > > > const struct sched_class *next; > > > +#ifdef CONFIG_UCLAMP_TASK > > > + int uclamp_enabled; > > > +#endif > > > + > > > void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags); > > > void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags); > > > - void (*yield_task) (struct rq *rq); > > > - bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt); > > > void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags); > > > @@ -1685,7 +1734,6 @@ struct sched_class { > > > void (*set_curr_task)(struct rq *rq); > > > void (*task_tick)(struct rq *rq, struct task_struct *p, int queued); > > > void (*task_fork)(struct task_struct *p); > > > - void (*task_dead)(struct task_struct *p); > > > /* > > > * The switched_from() call is allowed to drop rq->lock, therefore we > > > @@ -1702,12 +1750,17 @@ struct sched_class { > > > void (*update_curr)(struct rq *rq); > > > + void (*yield_task) (struct rq *rq); > > > + bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt); > > > + > > > #define TASK_SET_GROUP 0 > > > #define TASK_MOVE_GROUP 1 > > > #ifdef CONFIG_FAIR_GROUP_SCHED > > > void (*task_change_group)(struct task_struct *p, int type); > > > #endif > > > + > > > + void (*task_dead)(struct task_struct *p); > > > > Why do you move yield_task, yield_to_task and task_dead here? > > Since I'm adding a new field at the beginning of the struct, which is > used at enqueue/dequeue time, this is to ensure that all the > callbacks used in these paths are grouped together and don't fall > across a cache line... but yes, that's supposed to be a > micro-optimization which I can skip in this patch. > > -- > #include <best/regards.h> > > Patrick Bellasi
On Wed, Mar 13, 2019 at 12:46 PM Peter Zijlstra <peterz@infradead.org> wrote: > > On Wed, Mar 13, 2019 at 03:23:59PM +0000, Patrick Bellasi wrote: > > On 13-Mar 15:09, Peter Zijlstra wrote: > > > On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote: > > > > > +static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id) > > > > +{ > > > > + struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket; > > > > + unsigned int max_value = uclamp_none(clamp_id); > > > > > > That's 1024 for uclamp_max > > > > > > > + unsigned int bucket_id; > > > > + > > > > + /* > > > > + * Both min and max clamps are MAX aggregated, thus the topmost > > > > + * bucket with some tasks defines the rq's clamp value. > > > > + */ > > > > + bucket_id = UCLAMP_BUCKETS; > > > > + do { > > > > + --bucket_id; > > > > + if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks) > > > > + continue; > > > > + max_value = bucket[bucket_id].value; > > > > > > but this will then _lower_ it. That's not a MAX aggregate. > > > > For uclamp_max we want max_value=1024 when there are no active tasks, > > which means: no max clamp enforced on CFS/RT "idle" cpus. > > > > If instead there are active RT/CFS tasks then we want the clamp value > > of the max group, which means: MAX aggregate active clamps. > > > > That's what the code above does and the comment says. > > That's (obviously) not how I read it... maybe something like: > > static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id) > { > struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket; > int i; > > /* > * Since both min and max clamps are max aggregated, find the > * top most bucket with tasks in. > */ > for (i = UCLMAP_BUCKETS-1; i>=0; i--) { > if (!bucket[i].tasks) > continue; > return bucket[i].value; > } > > /* No tasks -- default clamp values */ > return uclamp_none(clamp_id); > } > > would make it clearer? This way it's also more readable/obvious when it's used inside uclamp_rq_dec_id, assuming uclamp_rq_update is renamed into smth like get_max_rq_uclamp.
On Wed, Mar 13, 2019 at 6:52 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote: > > +/* > > + * When a task is enqueued on a rq, the clamp bucket currently defined by the > > + * task's uclamp::bucket_id is reference counted on that rq. This also > > + * immediately updates the rq's clamp value if required. > > + * > > + * Since tasks know their specific value requested from user-space, we track > > + * within each bucket the maximum value for tasks refcounted in that bucket. > > + * This provide a further aggregation (local clamping) which allows to track > > + * within each bucket the exact "requested" clamp value whenever all tasks > > + * RUNNABLE in that bucket require the same clamp. > > + */ > > +static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq, > > + unsigned int clamp_id) > > +{ > > + unsigned int bucket_id = p->uclamp[clamp_id].bucket_id; > > + unsigned int rq_clamp, bkt_clamp, tsk_clamp; > > + > > + rq->uclamp[clamp_id].bucket[bucket_id].tasks++; > > + > > + /* > > + * Local clamping: rq's buckets always track the max "requested" > > + * clamp value from all RUNNABLE tasks in that bucket. > > + */ > > + tsk_clamp = p->uclamp[clamp_id].value; > > + bkt_clamp = rq->uclamp[clamp_id].bucket[bucket_id].value; > > + rq->uclamp[clamp_id].bucket[bucket_id].value = max(bkt_clamp, tsk_clamp); > > So, if I read this correct: > > - here we track a max value in a bucket, > > > + rq_clamp = READ_ONCE(rq->uclamp[clamp_id].value); > > + WRITE_ONCE(rq->uclamp[clamp_id].value, max(rq_clamp, tsk_clamp)); > > +} > > + > > +/* > > + * When a task is dequeued from a rq, the clamp bucket reference counted by > > + * the task is released. If this is the last task reference counting the rq's > > + * max active clamp value, then the rq's clamp value is updated. > > + * Both the tasks reference counter and the rq's cached clamp values are > > + * expected to be always valid, if we detect they are not we skip the updates, > > + * enforce a consistent state and warn. > > + */ > > +static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq, > > + unsigned int clamp_id) > > +{ > > + unsigned int bucket_id = p->uclamp[clamp_id].bucket_id; > > + unsigned int rq_clamp, bkt_clamp; > > + > > + SCHED_WARN_ON(!rq->uclamp[clamp_id].bucket[bucket_id].tasks); > > + if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks)) > > + rq->uclamp[clamp_id].bucket[bucket_id].tasks--; > > + > > + /* > > + * Keep "local clamping" simple and accept to (possibly) overboost > > + * still RUNNABLE tasks in the same bucket. > > + */ > > + if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks)) > > + return; > > (Oh man, I hope that generates semi sane code; long live CSE passes I > suppose) > > But we never decrement that bkt_clamp value on dequeue. > > > + bkt_clamp = rq->uclamp[clamp_id].bucket[bucket_id].value; > > + > > + /* The rq's clamp value is expected to always track the max */ > > + rq_clamp = READ_ONCE(rq->uclamp[clamp_id].value); > > + SCHED_WARN_ON(bkt_clamp > rq_clamp); > > + if (bkt_clamp >= rq_clamp) { > > head hurts, this reads ==, how can this ever not be so? > > > + /* > > + * Reset rq's clamp bucket value to its nominal value whenever > > + * there are anymore RUNNABLE tasks refcounting it. > > -ENOPARSE > > > + */ > > + rq->uclamp[clamp_id].bucket[bucket_id].value = > > + uclamp_bucket_value(rq_clamp); > > But basically you decrement the bucket value to the nominal value. > > > + uclamp_rq_update(rq, clamp_id); > > + } > > +} > > Given all that, what is to stop the bucket value to climbing to > uclamp_bucket_value(+1)-1 and staying there (provided there's someone > runnable)? > > Why are we doing this... ? I agree with Peter, this part of the patch was the hardest to read. SCHED_WARN_ON line makes sense to me. The condition that follows and the following comment are a little baffling. Condition seems to indicate that the code that follows should be executed only if we are in the top-most occupied bucket (the bucket which has tasks and has the highest uclamp value). So this bucket just lost its last task and we should update rq->uclamp[clamp_id].value. However that's not exactly what the code does... It also resets rq->uclamp[clamp_id].bucket[bucket_id].value. So if I understand correctly, unless the bucket that just lost its last task is the top-most one its value will not be reset to nominal value. That looks like a bug to me. Am I missing something? Side note: some more explanation would be very helpful.
On Fri, Feb 8, 2019 at 2:06 AM Patrick Bellasi <patrick.bellasi@arm.com> wrote: > > Utilization clamping allows to clamp the CPU's utilization within a > [util_min, util_max] range, depending on the set of RUNNABLE tasks on > that CPU. Each task references two "clamp buckets" defining its minimum > and maximum (util_{min,max}) utilization "clamp values". A CPU's clamp > bucket is active if there is at least one RUNNABLE tasks enqueued on > that CPU and refcounting that bucket. > > When a task is {en,de}queued {on,from} a rq, the set of active clamp > buckets on that CPU can change. Since each clamp bucket enforces a > different utilization clamp value, when the set of active clamp buckets > changes, a new "aggregated" clamp value is computed for that CPU. > > Clamp values are always MAX aggregated for both util_min and util_max. > This ensures that no tasks can affect the performance of other > co-scheduled tasks which are more boosted (i.e. with higher util_min > clamp) or less capped (i.e. with higher util_max clamp). > > Each task has a: > task_struct::uclamp[clamp_id]::bucket_id > to track the "bucket index" of the CPU's clamp bucket it refcounts while > enqueued, for each clamp index (clamp_id). > > Each CPU's rq has a: > rq::uclamp[clamp_id]::bucket[bucket_id].tasks > to track how many tasks, currently RUNNABLE on that CPU, refcount each > clamp bucket (bucket_id) of a clamp index (clamp_id). > > Each CPU's rq has also a: > rq::uclamp[clamp_id]::bucket[bucket_id].value > to track the clamp value of each clamp bucket (bucket_id) of a clamp > index (clamp_id). > > The rq::uclamp::bucket[clamp_id][] array is scanned every time we need > to find a new MAX aggregated clamp value for a clamp_id. This operation > is required only when we dequeue the last task of a clamp bucket > tracking the current MAX aggregated clamp value. In these cases, the CPU > is either entering IDLE or going to schedule a less boosted or more > clamped task. > The expected number of different clamp values, configured at build time, > is small enough to fit the full unordered array into a single cache > line. I assume you are talking about "struct uclamp_rq uclamp[UCLAMP_CNT]" here. uclamp_rq size depends on UCLAMP_BUCKETS configurable to be up to 20. sizeof(long)*20 is already more than 64 bytes. What am I missing? > Add the basic data structures required to refcount, in each CPU's rq, > the number of RUNNABLE tasks for each clamp bucket. Add also the max > aggregation required to update the rq's clamp value at each > enqueue/dequeue event. > > Use a simple linear mapping of clamp values into clamp buckets. > Pre-compute and cache bucket_id to avoid integer divisions at > enqueue/dequeue time. > > Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Peter Zijlstra <peterz@infradead.org> > > --- > Changes in v7: > Message-ID: <20190123191007.GG17749@hirez.programming.kicks-ass.net> > - removed buckets mapping code > - use a simpler linear mapping of clamp values into buckets > Message-ID: <20190124161443.lv2pw5fsspyelckq@e110439-lin> > - move this patch at the beginning of the series, > in the attempt to make the overall series easier to digest by moving > at the very beginning the core bits and main data structures > Others: > - update the mapping logic to use exactly and only > UCLAMP_BUCKETS_COUNT buckets, i.e. no more "special" bucket > - update uclamp_rq_update() to do top-bottom max search > --- > include/linux/log2.h | 37 ++++++++ > include/linux/sched.h | 39 ++++++++ > include/linux/sched/topology.h | 6 -- > init/Kconfig | 53 +++++++++++ > kernel/sched/core.c | 165 +++++++++++++++++++++++++++++++++ > kernel/sched/sched.h | 59 +++++++++++- > 6 files changed, 350 insertions(+), 9 deletions(-) > > diff --git a/include/linux/log2.h b/include/linux/log2.h > index 2af7f77866d0..e2db25734532 100644 > --- a/include/linux/log2.h > +++ b/include/linux/log2.h > @@ -224,4 +224,41 @@ int __order_base_2(unsigned long n) > ilog2((n) - 1) + 1) : \ > __order_base_2(n) \ > ) > + > +static inline __attribute__((const)) > +int __bits_per(unsigned long n) > +{ > + if (n < 2) > + return 1; > + if (is_power_of_2(n)) > + return order_base_2(n) + 1; > + return order_base_2(n); > +} > + > +/** > + * bits_per - calculate the number of bits required for the argument > + * @n: parameter > + * > + * This is constant-capable and can be used for compile time > + * initiaizations, e.g bitfields. > + * > + * The first few values calculated by this routine: > + * bf(0) = 1 > + * bf(1) = 1 > + * bf(2) = 2 > + * bf(3) = 2 > + * bf(4) = 3 > + * ... and so on. > + */ > +#define bits_per(n) \ > +( \ > + __builtin_constant_p(n) ? ( \ > + ((n) == 0 || (n) == 1) ? 1 : ( \ > + ((n) & (n - 1)) == 0 ? \ > + ilog2((n) - 1) + 2 : \ > + ilog2((n) - 1) + 1 \ > + ) \ > + ) : \ > + __bits_per(n) \ > +) > #endif /* _LINUX_LOG2_H */ > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 4112639c2a85..45460e7a3eee 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -281,6 +281,18 @@ struct vtime { > u64 gtime; > }; > > +/* > + * Utilization clamp constraints. > + * @UCLAMP_MIN: Minimum utilization > + * @UCLAMP_MAX: Maximum utilization > + * @UCLAMP_CNT: Utilization clamp constraints count > + */ > +enum uclamp_id { > + UCLAMP_MIN = 0, > + UCLAMP_MAX, > + UCLAMP_CNT > +}; > + > struct sched_info { > #ifdef CONFIG_SCHED_INFO > /* Cumulative counters: */ > @@ -312,6 +324,10 @@ struct sched_info { > # define SCHED_FIXEDPOINT_SHIFT 10 > # define SCHED_FIXEDPOINT_SCALE (1L << SCHED_FIXEDPOINT_SHIFT) > > +/* Increase resolution of cpu_capacity calculations */ > +# define SCHED_CAPACITY_SHIFT SCHED_FIXEDPOINT_SHIFT > +# define SCHED_CAPACITY_SCALE (1L << SCHED_CAPACITY_SHIFT) > + > struct load_weight { > unsigned long weight; > u32 inv_weight; > @@ -560,6 +576,25 @@ struct sched_dl_entity { > struct hrtimer inactive_timer; > }; > > +#ifdef CONFIG_UCLAMP_TASK > +/* Number of utilization clamp buckets (shorter alias) */ > +#define UCLAMP_BUCKETS CONFIG_UCLAMP_BUCKETS_COUNT > + > +/* > + * Utilization clamp for a scheduling entity > + * @value: clamp value "requested" by a se > + * @bucket_id: clamp bucket corresponding to the "requested" value > + * > + * The bucket_id is the index of the clamp bucket matching the clamp value > + * which is pre-computed and stored to avoid expensive integer divisions from > + * the fast path. > + */ > +struct uclamp_se { > + unsigned int value : bits_per(SCHED_CAPACITY_SCALE); > + unsigned int bucket_id : bits_per(UCLAMP_BUCKETS); > +}; > +#endif /* CONFIG_UCLAMP_TASK */ > + > union rcu_special { > struct { > u8 blocked; > @@ -640,6 +675,10 @@ struct task_struct { > #endif > struct sched_dl_entity dl; > > +#ifdef CONFIG_UCLAMP_TASK > + struct uclamp_se uclamp[UCLAMP_CNT]; > +#endif > + > #ifdef CONFIG_PREEMPT_NOTIFIERS > /* List of struct preempt_notifier: */ > struct hlist_head preempt_notifiers; > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h > index c31d3a47a47c..04beadac6985 100644 > --- a/include/linux/sched/topology.h > +++ b/include/linux/sched/topology.h > @@ -6,12 +6,6 @@ > > #include <linux/sched/idle.h> > > -/* > - * Increase resolution of cpu_capacity calculations > - */ > -#define SCHED_CAPACITY_SHIFT SCHED_FIXEDPOINT_SHIFT > -#define SCHED_CAPACITY_SCALE (1L << SCHED_CAPACITY_SHIFT) > - > /* > * sched-domains (multiprocessor balancing) declarations: > */ > diff --git a/init/Kconfig b/init/Kconfig > index 513fa544a134..34e23d5d95d1 100644 > --- a/init/Kconfig > +++ b/init/Kconfig > @@ -640,6 +640,59 @@ config HAVE_UNSTABLE_SCHED_CLOCK > config GENERIC_SCHED_CLOCK > bool > > +menu "Scheduler features" > + > +config UCLAMP_TASK > + bool "Enable utilization clamping for RT/FAIR tasks" > + depends on CPU_FREQ_GOV_SCHEDUTIL > + help > + This feature enables the scheduler to track the clamped utilization > + of each CPU based on RUNNABLE tasks scheduled on that CPU. > + > + With this option, the user can specify the min and max CPU > + utilization allowed for RUNNABLE tasks. The max utilization defines > + the maximum frequency a task should use while the min utilization > + defines the minimum frequency it should use. > + > + Both min and max utilization clamp values are hints to the scheduler, > + aiming at improving its frequency selection policy, but they do not > + enforce or grant any specific bandwidth for tasks. > + > + If in doubt, say N. > + > +config UCLAMP_BUCKETS_COUNT > + int "Number of supported utilization clamp buckets" > + range 5 20 > + default 5 > + depends on UCLAMP_TASK > + help > + Defines the number of clamp buckets to use. The range of each bucket > + will be SCHED_CAPACITY_SCALE/UCLAMP_BUCKETS_COUNT. The higher the > + number of clamp buckets the finer their granularity and the higher > + the precision of clamping aggregation and tracking at run-time. > + > + For example, with the default configuration we will have 5 clamp > + buckets tracking 20% utilization each. A 25% boosted tasks will be > + refcounted in the [20..39]% bucket and will set the bucket clamp > + effective value to 25%. > + If a second 30% boosted task should be co-scheduled on the same CPU, > + that task will be refcounted in the same bucket of the first task and > + it will boost the bucket clamp effective value to 30%. > + The clamp effective value of a bucket is reset to its nominal value > + (20% in the example above) when there are anymore tasks refcounted in > + that bucket. > + > + An additional boost/capping margin can be added to some tasks. In the > + example above the 25% task will be boosted to 30% until it exits the > + CPU. If that should be considered not acceptable on certain systems, > + it's always possible to reduce the margin by increasing the number of > + clamp buckets to trade off used memory for run-time tracking > + precision. > + > + If in doubt, use the default value. > + > +endmenu > + > # > # For architectures that want to enable the support for NUMA-affine scheduler > # balancing logic: > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index ec1b67a195cc..8ecf5470058c 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -719,6 +719,167 @@ static void set_load_weight(struct task_struct *p, bool update_load) > } > } > > +#ifdef CONFIG_UCLAMP_TASK > + > +/* Integer ceil-rounded range for each bucket */ > +#define UCLAMP_BUCKET_DELTA ((SCHED_CAPACITY_SCALE / UCLAMP_BUCKETS) + 1) > + > +static inline unsigned int uclamp_bucket_id(unsigned int clamp_value) > +{ > + return clamp_value / UCLAMP_BUCKET_DELTA; > +} > + > +static inline unsigned int uclamp_bucket_value(unsigned int clamp_value) > +{ > + return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value); > +} > + > +static inline unsigned int uclamp_none(int clamp_id) > +{ > + if (clamp_id == UCLAMP_MIN) > + return 0; > + return SCHED_CAPACITY_SCALE; > +} > + > +static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id) > +{ > + struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket; > + unsigned int max_value = uclamp_none(clamp_id); > + unsigned int bucket_id; > + > + /* > + * Both min and max clamps are MAX aggregated, thus the topmost > + * bucket with some tasks defines the rq's clamp value. > + */ > + bucket_id = UCLAMP_BUCKETS; > + do { > + --bucket_id; > + if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks) > + continue; > + max_value = bucket[bucket_id].value; > + break; > + } while (bucket_id); > + > + WRITE_ONCE(rq->uclamp[clamp_id].value, max_value); > +} > + > +/* > + * When a task is enqueued on a rq, the clamp bucket currently defined by the > + * task's uclamp::bucket_id is reference counted on that rq. This also > + * immediately updates the rq's clamp value if required. > + * > + * Since tasks know their specific value requested from user-space, we track > + * within each bucket the maximum value for tasks refcounted in that bucket. > + * This provide a further aggregation (local clamping) which allows to track > + * within each bucket the exact "requested" clamp value whenever all tasks > + * RUNNABLE in that bucket require the same clamp. > + */ > +static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq, > + unsigned int clamp_id) > +{ > + unsigned int bucket_id = p->uclamp[clamp_id].bucket_id; > + unsigned int rq_clamp, bkt_clamp, tsk_clamp; > + > + rq->uclamp[clamp_id].bucket[bucket_id].tasks++; > + > + /* > + * Local clamping: rq's buckets always track the max "requested" > + * clamp value from all RUNNABLE tasks in that bucket. > + */ > + tsk_clamp = p->uclamp[clamp_id].value; > + bkt_clamp = rq->uclamp[clamp_id].bucket[bucket_id].value; > + rq->uclamp[clamp_id].bucket[bucket_id].value = max(bkt_clamp, tsk_clamp); > + > + rq_clamp = READ_ONCE(rq->uclamp[clamp_id].value); > + WRITE_ONCE(rq->uclamp[clamp_id].value, max(rq_clamp, tsk_clamp)); > +} > + > +/* > + * When a task is dequeued from a rq, the clamp bucket reference counted by > + * the task is released. If this is the last task reference counting the rq's > + * max active clamp value, then the rq's clamp value is updated. > + * Both the tasks reference counter and the rq's cached clamp values are > + * expected to be always valid, if we detect they are not we skip the updates, > + * enforce a consistent state and warn. > + */ > +static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq, > + unsigned int clamp_id) > +{ > + unsigned int bucket_id = p->uclamp[clamp_id].bucket_id; > + unsigned int rq_clamp, bkt_clamp; > + > + SCHED_WARN_ON(!rq->uclamp[clamp_id].bucket[bucket_id].tasks); > + if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks)) > + rq->uclamp[clamp_id].bucket[bucket_id].tasks--; > + > + /* > + * Keep "local clamping" simple and accept to (possibly) overboost > + * still RUNNABLE tasks in the same bucket. > + */ > + if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks)) > + return; > + bkt_clamp = rq->uclamp[clamp_id].bucket[bucket_id].value; > + > + /* The rq's clamp value is expected to always track the max */ > + rq_clamp = READ_ONCE(rq->uclamp[clamp_id].value); > + SCHED_WARN_ON(bkt_clamp > rq_clamp); > + if (bkt_clamp >= rq_clamp) { > + /* > + * Reset rq's clamp bucket value to its nominal value whenever > + * there are anymore RUNNABLE tasks refcounting it. > + */ > + rq->uclamp[clamp_id].bucket[bucket_id].value = > + uclamp_bucket_value(rq_clamp); > + uclamp_rq_update(rq, clamp_id); > + } > +} > + > +static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) > +{ > + unsigned int clamp_id; > + > + if (unlikely(!p->sched_class->uclamp_enabled)) > + return; > + > + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) > + uclamp_rq_inc_id(p, rq, clamp_id); > +} > + > +static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) > +{ > + unsigned int clamp_id; > + > + if (unlikely(!p->sched_class->uclamp_enabled)) > + return; > + > + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) > + uclamp_rq_dec_id(p, rq, clamp_id); > +} > + > +static void __init init_uclamp(void) > +{ > + unsigned int clamp_id; > + int cpu; > + > + for_each_possible_cpu(cpu) > + memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq)); > + > + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) { > + unsigned int clamp_value = uclamp_none(clamp_id); > + unsigned int bucket_id = uclamp_bucket_id(clamp_value); > + struct uclamp_se *uc_se = &init_task.uclamp[clamp_id]; > + > + uc_se->bucket_id = bucket_id; > + uc_se->value = clamp_value; > + } > +} > + > +#else /* CONFIG_UCLAMP_TASK */ > +static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { } > +static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) { } > +static inline void init_uclamp(void) { } > +#endif /* CONFIG_UCLAMP_TASK */ > + > static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags) > { > if (!(flags & ENQUEUE_NOCLOCK)) > @@ -729,6 +890,7 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags) > psi_enqueue(p, flags & ENQUEUE_WAKEUP); > } > > + uclamp_rq_inc(rq, p); > p->sched_class->enqueue_task(rq, p, flags); > } > > @@ -742,6 +904,7 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags) > psi_dequeue(p, flags & DEQUEUE_SLEEP); > } > > + uclamp_rq_dec(rq, p); > p->sched_class->dequeue_task(rq, p, flags); > } > > @@ -6075,6 +6238,8 @@ void __init sched_init(void) > > psi_init(); > > + init_uclamp(); > + > scheduler_running = 1; > } > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index c688ef5012e5..ea9e28723946 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -797,6 +797,48 @@ extern void rto_push_irq_work_func(struct irq_work *work); > #endif > #endif /* CONFIG_SMP */ > > +#ifdef CONFIG_UCLAMP_TASK > +/* > + * struct uclamp_bucket - Utilization clamp bucket > + * @value: utilization clamp value for tasks on this clamp bucket > + * @tasks: number of RUNNABLE tasks on this clamp bucket > + * > + * Keep track of how many tasks are RUNNABLE for a given utilization > + * clamp value. > + */ > +struct uclamp_bucket { > + unsigned long value : bits_per(SCHED_CAPACITY_SCALE); > + unsigned long tasks : BITS_PER_LONG - bits_per(SCHED_CAPACITY_SCALE); > +}; > + > +/* > + * struct uclamp_rq - rq's utilization clamp > + * @value: currently active clamp values for a rq > + * @bucket: utilization clamp buckets affecting a rq > + * > + * Keep track of RUNNABLE tasks on a rq to aggregate their clamp values. > + * A clamp value is affecting a rq when there is at least one task RUNNABLE > + * (or actually running) with that value. > + * > + * We have up to UCLAMP_CNT possible different clamp values, which are > + * currently only two: minmum utilization and maximum utilization. > + * > + * All utilization clamping values are MAX aggregated, since: > + * - for util_min: we want to run the CPU at least at the max of the minimum > + * utilization required by its currently RUNNABLE tasks. > + * - for util_max: we want to allow the CPU to run up to the max of the > + * maximum utilization allowed by its currently RUNNABLE tasks. > + * > + * Since on each system we expect only a limited number of different > + * utilization clamp values (UCLAMP_BUCKETS), we use a simple array to track > + * the metrics required to compute all the per-rq utilization clamp values. > + */ > +struct uclamp_rq { > + unsigned int value; > + struct uclamp_bucket bucket[UCLAMP_BUCKETS]; > +}; > +#endif /* CONFIG_UCLAMP_TASK */ > + > /* > * This is the main, per-CPU runqueue data structure. > * > @@ -835,6 +877,11 @@ struct rq { > unsigned long nr_load_updates; > u64 nr_switches; > > +#ifdef CONFIG_UCLAMP_TASK > + /* Utilization clamp values based on CPU's RUNNABLE tasks */ > + struct uclamp_rq uclamp[UCLAMP_CNT] ____cacheline_aligned; > +#endif > + > struct cfs_rq cfs; > struct rt_rq rt; > struct dl_rq dl; > @@ -1649,10 +1696,12 @@ extern const u32 sched_prio_to_wmult[40]; > struct sched_class { > const struct sched_class *next; > > +#ifdef CONFIG_UCLAMP_TASK > + int uclamp_enabled; > +#endif > + > void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags); > void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags); > - void (*yield_task) (struct rq *rq); > - bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt); > > void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags); > > @@ -1685,7 +1734,6 @@ struct sched_class { > void (*set_curr_task)(struct rq *rq); > void (*task_tick)(struct rq *rq, struct task_struct *p, int queued); > void (*task_fork)(struct task_struct *p); > - void (*task_dead)(struct task_struct *p); > > /* > * The switched_from() call is allowed to drop rq->lock, therefore we > @@ -1702,12 +1750,17 @@ struct sched_class { > > void (*update_curr)(struct rq *rq); > > + void (*yield_task) (struct rq *rq); > + bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt); > + > #define TASK_SET_GROUP 0 > #define TASK_MOVE_GROUP 1 > > #ifdef CONFIG_FAIR_GROUP_SCHED > void (*task_change_group)(struct task_struct *p, int type); > #endif > + > + void (*task_dead)(struct task_struct *p); > }; > > static inline void put_prev_task(struct rq *rq, struct task_struct *prev) > -- > 2.20.1 >
On 13-Mar 20:30, Peter Zijlstra wrote: > On Wed, Mar 13, 2019 at 03:59:54PM +0000, Patrick Bellasi wrote: > > On 13-Mar 14:52, Peter Zijlstra wrote: > > > > +static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq, > > > > + unsigned int clamp_id) > > > > +{ > > > > + unsigned int bucket_id = p->uclamp[clamp_id].bucket_id; > > > > + unsigned int rq_clamp, bkt_clamp; > > > > + > > > > + SCHED_WARN_ON(!rq->uclamp[clamp_id].bucket[bucket_id].tasks); > > > > + if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks)) > > > > + rq->uclamp[clamp_id].bucket[bucket_id].tasks--; > > > > + > > > > + /* > > > > + * Keep "local clamping" simple and accept to (possibly) overboost > > > > + * still RUNNABLE tasks in the same bucket. > > > > + */ > > > > + if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks)) > > > > + return; > > > > > > (Oh man, I hope that generates semi sane code; long live CSE passes I > > > suppose) > > > > What do you mean ? > > that does: 'rq->uclamp[clamp_id].bucket[bucket_id].tasks' three times in > a row. And yes the compiler _should_ dtrt, but.... Sorry, don't follow you here... but it's an interesting point. :) The code above becomes: if (__builtin_expect(!!(rq->uclamp[clamp_id].bucket[bucket_id].tasks), 1)) return; Are you referring to the resolution of the memory references, i.e 1) rq->uclamp 2) rq->uclamp[clamp_id] 3) rq->uclamp[clamp_id].bucket[bucket_id] ? By playing with: https://godbolt.org/z/OPLpyR I can see that this simplified version: ---8<--- #define BUCKETS 5 #define CLAMPS 2 struct uclamp { unsigned int value; struct bucket { unsigned int value; unsigned int tasks; } bucket[BUCKETS]; }; struct rq { struct uclamp uclamp[CLAMPS]; }; void uclamp_rq_dec_id(struct rq *rq, int clamp_id, int bucket_id) { if (__builtin_expect(!!(rq->uclamp[clamp_id].bucket[bucket_id].tasks), 1)) return; rq->uclamp[clamp_id].bucket[bucket_id].tasks--; } ---8<--- generates something like: ---8<--- uclamp_rq_dec_id: sxtw x1, w1 add x3, x1, x1, lsl 1 lsl x3, x3, 2 sub x3, x3, x1 lsl x3, x3, 2 add x2, x3, x2, sxtw 3 add x0, x0, x2 ldr w1, [x0, 8] cbz w1, .L4 ret .L4: mov w1, -1 str w1, [x0, 8] ret ---8<--- which looks "sane" and quite expected, isn't it?
On 13-Mar 20:39, Peter Zijlstra wrote: > On Wed, Mar 13, 2019 at 03:59:54PM +0000, Patrick Bellasi wrote: > > On 13-Mar 14:52, Peter Zijlstra wrote: > > > Because of backetization, we potentially end up tracking tasks with > > different requested clamp values in the same bucket. > > > > For example, with 20% bucket size, we can have: > > Task1: util_min=25% > > Task2: util_min=35% > > accounted in the same bucket. > > > > Given all that, what is to stop the bucket value to climbing to > > > uclamp_bucket_value(+1)-1 and staying there (provided there's someone > > > runnable)? > > > > Nothing... but that's an expected consequence of bucketization. > > No, it is not. > > > > Why are we doing this... ? > > > > You can either decide to: > > > > a) always boost tasks to just the bucket nominal value > > thus always penalizing both Task1 and Task2 of the example above > > This is the expected behaviour. When was the last time your histogram > did something like b? Right, I see what you mean... strictly speaking histograms always do a floor aggregation. > > b) always boost tasks to the bucket "max" value > > thus always overboosting both Task1 and Task2 of the example above > > > > The solution above instead has a very good property: in systems > > where you have only few and well defined clamp values we always > > provide the exact boost. > > > > For example, if your system requires only 23% and 47% boost values > > (totally random numbers), then you can always get the exact boost > > required using just 3 bucksts or ~33% size each. > > > > In systems where you don't know which boost values you will have, you > > can still defined the maximum overboost granularity you accept for > > each task by just tuning the number of clamp groups. For example, with > > 20 groups you can have a 5% max overboost. > > Maybe, but this is not a direct concequence of buckets, but an > additional heuristic that might work well in this case. Right... that's the point. We started with mapping to be able to track exact clamp values. Then we switched to linear mapping to remove the complexity of mapping, but we would like to still have the possibility to track exact numbers whenever possible. > Maybe split this out in a separate patch? So start with the trivial > bucket, and then do this change on top with the above few paragraphs as > changelog? That's doable, otherwise maybe we can just add the above paragraphs to the changelog of this patch. But give your comment above I assume you prefer to split it out... just let me know otherwise.
On 13-Mar 20:46, Peter Zijlstra wrote: > On Wed, Mar 13, 2019 at 03:23:59PM +0000, Patrick Bellasi wrote: > > On 13-Mar 15:09, Peter Zijlstra wrote: > > > On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote: > > > > > +static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id) > > > > +{ > > > > + struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket; > > > > + unsigned int max_value = uclamp_none(clamp_id); > > > > > > That's 1024 for uclamp_max > > > > > > > + unsigned int bucket_id; > > > > + > > > > + /* > > > > + * Both min and max clamps are MAX aggregated, thus the topmost > > > > + * bucket with some tasks defines the rq's clamp value. > > > > + */ > > > > + bucket_id = UCLAMP_BUCKETS; > > > > + do { > > > > + --bucket_id; > > > > + if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks) > > > > + continue; > > > > + max_value = bucket[bucket_id].value; > > > > > > but this will then _lower_ it. That's not a MAX aggregate. > > > > For uclamp_max we want max_value=1024 when there are no active tasks, > > which means: no max clamp enforced on CFS/RT "idle" cpus. > > > > If instead there are active RT/CFS tasks then we want the clamp value > > of the max group, which means: MAX aggregate active clamps. > > > > That's what the code above does and the comment says. > > That's (obviously) not how I read it.... maybe something like: > > static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id) > { > struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket; > int i; > > /* > * Since both min and max clamps are max aggregated, find the > * top most bucket with tasks in. > */ > for (i = UCLMAP_BUCKETS-1; i>=0; i--) { > if (!bucket[i].tasks) > continue; > return bucket[i].value; > } > > /* No tasks -- default clamp value */ > return uclamp_none(clamp_id); > } > > would make it clearer? Fine for me, I'll then change the name in something else since that's not more an "_update" by moving the WRITE_ONCE into the caller.
On 13-Mar 20:48, Peter Zijlstra wrote: > On Wed, Mar 13, 2019 at 04:12:29PM +0000, Patrick Bellasi wrote: > > On 13-Mar 14:40, Peter Zijlstra wrote: > > > On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote: > > > > +static inline unsigned int uclamp_bucket_id(unsigned int clamp_value) > > > > +{ > > > > + return clamp_value / UCLAMP_BUCKET_DELTA; > > > > +} > > > > + > > > > +static inline unsigned int uclamp_bucket_value(unsigned int clamp_value) > > > > +{ > > > > + return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value); > > > > > > return clamp_value - (clamp_value % UCLAMP_BUCKET_DELTA); > > > > > > might generate better code; just a single division, instead of a div and > > > mult. > > > > Wondering if compilers cannot do these optimizations... but yes, looks > > cool and will do it in v8, thanks. > > I'd be most impressed if they pull this off. Check the generated code > and see I suppose :-) On x86 the code generated looks exactly the same: https://godbolt.org/z/PjmA7k While on on arm64 it seems the difference boils down to: - one single "mul" instruction vs - two instructions: "sub" _plus_ one "multiply subtract" https://godbolt.org/z/0shU0S So, if I din't get something wrong... perhaps the original version is even better, isn't it? Test code: ---8<--- #define UCLAMP_BUCKET_DELTA 52 static inline unsigned int uclamp_bucket_id(unsigned int clamp_value) { return clamp_value / UCLAMP_BUCKET_DELTA; } static inline unsigned int uclamp_bucket_value1(unsigned int clamp_value) { return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value); } static inline unsigned int uclamp_bucket_value2(unsigned int clamp_value) { return clamp_value - (clamp_value % UCLAMP_BUCKET_DELTA); } int test1(int argc, char *argv[]) { return uclamp_bucket_value1(argc); } int test2(int argc, char *argv[]) { return uclamp_bucket_value2(argc); } int test3(int argc, char *argv[]) { return uclamp_bucket_value1(argc) - uclamp_bucket_value2(argc); } ---8<--- which gives on arm64: ---8<--- test1: mov w1, 60495 movk w1, 0x4ec4, lsl 16 umull x0, w0, w1 lsr x0, x0, 36 mov w1, 52 mul w0, w0, w1 ret test2: mov w1, 60495 movk w1, 0x4ec4, lsl 16 umull x1, w0, w1 lsr x1, x1, 36 mov w2, 52 msub w1, w1, w2, w0 sub w0, w0, w1 ret test3: mov w0, 0 ret ---8<---
On 13-Mar 14:08, Suren Baghdasaryan wrote: > On Wed, Mar 13, 2019 at 12:46 PM Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Wed, Mar 13, 2019 at 03:23:59PM +0000, Patrick Bellasi wrote: > > > On 13-Mar 15:09, Peter Zijlstra wrote: > > > > On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote: > > > > > > > +static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id) > > > > > +{ > > > > > + struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket; > > > > > + unsigned int max_value = uclamp_none(clamp_id); > > > > > > > > That's 1024 for uclamp_max > > > > > > > > > + unsigned int bucket_id; > > > > > + > > > > > + /* > > > > > + * Both min and max clamps are MAX aggregated, thus the topmost > > > > > + * bucket with some tasks defines the rq's clamp value. > > > > > + */ > > > > > + bucket_id = UCLAMP_BUCKETS; > > > > > + do { > > > > > + --bucket_id; > > > > > + if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks) > > > > > + continue; > > > > > + max_value = bucket[bucket_id].value; > > > > > > > > but this will then _lower_ it. That's not a MAX aggregate. > > > > > > For uclamp_max we want max_value=1024 when there are no active tasks, > > > which means: no max clamp enforced on CFS/RT "idle" cpus. > > > > > > If instead there are active RT/CFS tasks then we want the clamp value > > > of the max group, which means: MAX aggregate active clamps. > > > > > > That's what the code above does and the comment says. > > > > That's (obviously) not how I read it... maybe something like: > > > > static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id) > > { > > struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket; > > int i; > > > > /* > > * Since both min and max clamps are max aggregated, find the > > * top most bucket with tasks in. > > */ > > for (i = UCLMAP_BUCKETS-1; i>=0; i--) { > > if (!bucket[i].tasks) > > continue; > > return bucket[i].value; > > } > > > > /* No tasks -- default clamp values */ > > return uclamp_none(clamp_id); > > } > > > > would make it clearer? > > This way it's also more readable/obvious when it's used inside > uclamp_rq_dec_id, assuming uclamp_rq_update is renamed into smth like > get_max_rq_uclamp. Rightm, I have now something like that: ---8<--- static inline unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id) { struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket; int bucket_id; /* * Since both min and max clamps are max aggregated, find the * top most bucket with tasks in. */ for (bucket_id = UCLMAP_BUCKETS-1; bucket_id >= 0; bucket_id--) { if (!bucket[bucket_id].tasks) continue; return bucket[bucket_id].value; } /* No tasks -- default clamp value */ return uclamp_none(clamp_id); } static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq, unsigned int clamp_id) { //... if (bucket->value >= rq_clamp) { /* * Reset rq's clamp bucket value to its nominal value whenever * there are anymore RUNNABLE tasks refcounting it. */ bucket->value = uclamp_bucket_nominal_value(rq_clamp); WRITE_ONCE(uc_rq->value, uclamp_rq_max_value(rq, clamp_id)); } } ---8<---
On 13-Mar 14:23, Suren Baghdasaryan wrote: > On Wed, Mar 13, 2019 at 6:52 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote: > > > +/* > > > + * When a task is enqueued on a rq, the clamp bucket currently defined by the > > > + * task's uclamp::bucket_id is reference counted on that rq. This also > > > + * immediately updates the rq's clamp value if required. > > > + * > > > + * Since tasks know their specific value requested from user-space, we track > > > + * within each bucket the maximum value for tasks refcounted in that bucket. > > > + * This provide a further aggregation (local clamping) which allows to track > > > + * within each bucket the exact "requested" clamp value whenever all tasks > > > + * RUNNABLE in that bucket require the same clamp. > > > + */ > > > +static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq, > > > + unsigned int clamp_id) > > > +{ > > > + unsigned int bucket_id = p->uclamp[clamp_id].bucket_id; > > > + unsigned int rq_clamp, bkt_clamp, tsk_clamp; > > > + > > > + rq->uclamp[clamp_id].bucket[bucket_id].tasks++; > > > + > > > + /* > > > + * Local clamping: rq's buckets always track the max "requested" > > > + * clamp value from all RUNNABLE tasks in that bucket. > > > + */ > > > + tsk_clamp = p->uclamp[clamp_id].value; > > > + bkt_clamp = rq->uclamp[clamp_id].bucket[bucket_id].value; > > > + rq->uclamp[clamp_id].bucket[bucket_id].value = max(bkt_clamp, tsk_clamp); > > > > So, if I read this correct: > > > > - here we track a max value in a bucket, > > > > > + rq_clamp = READ_ONCE(rq->uclamp[clamp_id].value); > > > + WRITE_ONCE(rq->uclamp[clamp_id].value, max(rq_clamp, tsk_clamp)); > > > +} > > > + > > > +/* > > > + * When a task is dequeued from a rq, the clamp bucket reference counted by > > > + * the task is released. If this is the last task reference counting the rq's > > > + * max active clamp value, then the rq's clamp value is updated. > > > + * Both the tasks reference counter and the rq's cached clamp values are > > > + * expected to be always valid, if we detect they are not we skip the updates, > > > + * enforce a consistent state and warn. > > > + */ > > > +static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq, > > > + unsigned int clamp_id) > > > +{ > > > + unsigned int bucket_id = p->uclamp[clamp_id].bucket_id; > > > + unsigned int rq_clamp, bkt_clamp; > > > + > > > + SCHED_WARN_ON(!rq->uclamp[clamp_id].bucket[bucket_id].tasks); > > > + if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks)) > > > + rq->uclamp[clamp_id].bucket[bucket_id].tasks--; > > > + > > > + /* > > > + * Keep "local clamping" simple and accept to (possibly) overboost > > > + * still RUNNABLE tasks in the same bucket. > > > + */ > > > + if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks)) > > > + return; > > > > (Oh man, I hope that generates semi sane code; long live CSE passes I > > suppose) > > > > But we never decrement that bkt_clamp value on dequeue. > > > > > + bkt_clamp = rq->uclamp[clamp_id].bucket[bucket_id].value; > > > + > > > + /* The rq's clamp value is expected to always track the max */ > > > + rq_clamp = READ_ONCE(rq->uclamp[clamp_id].value); > > > + SCHED_WARN_ON(bkt_clamp > rq_clamp); > > > + if (bkt_clamp >= rq_clamp) { > > > > head hurts, this reads ==, how can this ever not be so? > > > > > + /* > > > + * Reset rq's clamp bucket value to its nominal value whenever > > > + * there are anymore RUNNABLE tasks refcounting it. > > > > -ENOPARSE > > > > > + */ > > > + rq->uclamp[clamp_id].bucket[bucket_id].value = > > > + uclamp_bucket_value(rq_clamp); > > > > But basically you decrement the bucket value to the nominal value. > > > > > + uclamp_rq_update(rq, clamp_id); > > > + } > > > +} > > > > Given all that, what is to stop the bucket value to climbing to > > uclamp_bucket_value(+1)-1 and staying there (provided there's someone > > runnable)? > > > > Why are we doing this... ? > > I agree with Peter, this part of the patch was the hardest to read. > SCHED_WARN_ON line makes sense to me. The condition that follows and > the following comment are a little baffling. Condition seems to > indicate that the code that follows should be executed only if we are > in the top-most occupied bucket (the bucket which has tasks and has > the highest uclamp value). > So this bucket just lost its last task and we should update > rq->uclamp[clamp_id].value. Right. > However that's not exactly what the code does... It also resets > rq->uclamp[clamp_id].bucket[bucket_id].value. Right... > So if I understand correctly, unless the bucket that just lost its > last task is the top-most one its value will not be reset to nominal > value. That looks like a bug to me. Am I missing something? ... and I think you've got a point here! The reset to nominal value line should be done unconditionally. I'll move it outside its current block. Thanks for spotting it. > Side note: some more explanation would be very helpful. Will move that "bucket local max" management code into a separate patch as suggested by Peter. Hopefully that should make the logic more clear and allows me to add some notes in the changelog.
On Thu, Mar 14, 2019 at 11:03:30AM +0000, Patrick Bellasi wrote: > void uclamp_rq_dec_id(struct rq *rq, int clamp_id, int bucket_id) { > if (__builtin_expect(!!(rq->uclamp[clamp_id].bucket[bucket_id].tasks), 1)) > return; > rq->uclamp[clamp_id].bucket[bucket_id].tasks--; > } > ---8<--- > > generates something like: > > ---8<--- > uclamp_rq_dec_id: > sxtw x1, w1 > add x3, x1, x1, lsl 1 > lsl x3, x3, 2 > sub x3, x3, x1 > lsl x3, x3, 2 > add x2, x3, x2, sxtw 3 > add x0, x0, x2 > ldr w1, [x0, 8] > cbz w1, .L4 > ret > .L4: > mov w1, -1 > str w1, [x0, 8] > ret > ---8<--- > > > which looks "sane" and quite expected, isn't it? Yep, thanks! Sometimes I worry about silly things.
On Thu, Mar 14, 2019 at 12:13:15PM +0000, Patrick Bellasi wrote: > > I'd be most impressed if they pull this off. Check the generated code > > and see I suppose :-) > > On x86 the code generated looks exactly the same: > > https://godbolt.org/z/PjmA7k Argh, they do mult by inverse to avoid the division, and can do this because its a constant. And then yes, your arm version looks worse. It does what I expected with -Os, but as Rutland said the other day, that stands for Optimize for Sadness.
On 13-Mar 14:32, Suren Baghdasaryan wrote: > On Fri, Feb 8, 2019 at 2:06 AM Patrick Bellasi <patrick.bellasi@arm.com> wrote: > > > > Utilization clamping allows to clamp the CPU's utilization within a > > [util_min, util_max] range, depending on the set of RUNNABLE tasks on > > that CPU. Each task references two "clamp buckets" defining its minimum > > and maximum (util_{min,max}) utilization "clamp values". A CPU's clamp > > bucket is active if there is at least one RUNNABLE tasks enqueued on > > that CPU and refcounting that bucket. > > > > When a task is {en,de}queued {on,from} a rq, the set of active clamp > > buckets on that CPU can change. Since each clamp bucket enforces a > > different utilization clamp value, when the set of active clamp buckets > > changes, a new "aggregated" clamp value is computed for that CPU. > > > > Clamp values are always MAX aggregated for both util_min and util_max. > > This ensures that no tasks can affect the performance of other > > co-scheduled tasks which are more boosted (i.e. with higher util_min > > clamp) or less capped (i.e. with higher util_max clamp). > > > > Each task has a: > > task_struct::uclamp[clamp_id]::bucket_id > > to track the "bucket index" of the CPU's clamp bucket it refcounts while > > enqueued, for each clamp index (clamp_id). > > > > Each CPU's rq has a: > > rq::uclamp[clamp_id]::bucket[bucket_id].tasks > > to track how many tasks, currently RUNNABLE on that CPU, refcount each > > clamp bucket (bucket_id) of a clamp index (clamp_id). > > > > Each CPU's rq has also a: > > rq::uclamp[clamp_id]::bucket[bucket_id].value > > to track the clamp value of each clamp bucket (bucket_id) of a clamp > > index (clamp_id). > > > > The rq::uclamp::bucket[clamp_id][] array is scanned every time we need > > to find a new MAX aggregated clamp value for a clamp_id. This operation > > is required only when we dequeue the last task of a clamp bucket > > tracking the current MAX aggregated clamp value. In these cases, the CPU > > is either entering IDLE or going to schedule a less boosted or more > > clamped task. > > The expected number of different clamp values, configured at build time, > > is small enough to fit the full unordered array into a single cache > > line. > > I assume you are talking about "struct uclamp_rq uclamp[UCLAMP_CNT]" > here. No, I'm talking about the rq::uclamp::bucket[clamp_id][], which is an array of: struct uclamp_bucket { unsigned long value : bits_per(SCHED_CAPACITY_SCALE); unsigned long tasks : BITS_PER_LONG - bits_per(SCHED_CAPACITY_SCALE); }; defined as part of: struct uclamp_rq { unsigned int value; struct uclamp_bucket bucket[UCLAMP_BUCKETS]; }; So, it's an array of UCLAMP_BUCKETS (value, tasks) pairs. > uclamp_rq size depends on UCLAMP_BUCKETS configurable to be up > to 20. sizeof(long)*20 is already more than 64 bytes. What am I > missing? Right, the comment above refers to the default configuration, which is 5 buckets. With that configuration we have: $> pahole kernel/sched/core.o ---8<--- struct uclamp_bucket { long unsigned int value:11; /* 0:53 8 */ long unsigned int tasks:53; /* 0: 0 8 */ /* size: 8, cachelines: 1, members: 2 */ /* last cacheline: 8 bytes */ }; struct uclamp_rq { unsigned int value; /* 0 4 */ /* XXX 4 bytes hole, try to pack */ struct uclamp_bucket bucket[5]; /* 8 40 */ /* size: 48, cachelines: 1, members: 2 */ /* sum members: 44, holes: 1, sum holes: 4 */ /* last cacheline: 48 bytes */ }; struct rq { // ... /* --- cacheline 2 boundary (128 bytes) --- */ struct uclamp_rq uclamp[2]; /* 128 96 */ /* --- cacheline 3 boundary (192 bytes) was 32 bytes ago --- */ // ... }; ---8<--- Where you see the array fits into a single cache line. Actually I notice now that, since when we removed the bucket dedicated to the default values, we now have some spare space and we can probably increase the default (and minimum) value of UCLAMP_BUCKETS to be 7. This will uses two full cache lines in struct rq, one for each clamp index... Although 7 it's a bit of a odd number and gives by default buckets of ~14% size instead of the ~20%. Thoughts ? [...]
On 13-Mar 14:01, Suren Baghdasaryan wrote: > On Wed, Mar 13, 2019 at 8:15 AM Patrick Bellasi <patrick.bellasi@arm.com> wrote: > > > > On 12-Mar 13:52, Dietmar Eggemann wrote: > > > On 2/8/19 11:05 AM, Patrick Bellasi wrote: [...] > > > > +static inline unsigned int uclamp_bucket_value(unsigned int clamp_value) > > > > +{ > > > > + return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value); > > > > +} > > > > > > Soemthing like uclamp_bucket_nominal_value() should be clearer. > > > > Maybe... can update it in v8 > > > > uclamp_bucket_base_value is a little shorter, just to consider :) Right, I also like it better ;)
On 13-Mar 15:15, Patrick Bellasi wrote: > On 12-Mar 13:52, Dietmar Eggemann wrote: > > On 2/8/19 11:05 AM, Patrick Bellasi wrote: [...] > > > + * within each bucket the exact "requested" clamp value whenever all tasks > > > + * RUNNABLE in that bucket require the same clamp. > > > + */ > > > +static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq, > > > + unsigned int clamp_id) > > > +{ > > > + unsigned int bucket_id = p->uclamp[clamp_id].bucket_id; > > > + unsigned int rq_clamp, bkt_clamp, tsk_clamp; > > > > Wouldn't it be easier to have a pointer to the task's and rq's uclamp > > structure as well to the bucket? > > > > - unsigned int bucket_id = p->uclamp[clamp_id].bucket_id; > > + struct uclamp_se *uc_se = &p->uclamp[clamp_id]; > > + struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id]; > > + struct uclamp_bucket *bucket = &uc_rq->bucket[uc_se->bucket_id]; > > I think I went back/forth a couple of times in using pointer or the > extended version, which both have pros and cons. > > I personally prefer the pointers as you suggest but I've got the > impression in the past that since everybody cleared "basic C trainings" > it's not so difficult to read the code above too. > > > The code in uclamp_rq_inc_id() and uclamp_rq_dec_id() for example becomes > > much more readable. > > Agree... let's try to switch once again in v8 and see ;) This is not as straightforward as I thought. We either still need local variables to use with max(), which does not play well with bitfields values, or we have to avoid using it and go back to conditional updates: ---8<--- static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq, unsigned int clamp_id) { struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id]; struct uclamp_req *uc_se = &p->uclamp_se[clamp_id]; struct uclamp_bucket *bucket = &uc_rq->bucket[uc_se->bucket_id]; bucket->tasks++; /* * Local clamping: rq's buckets always track the max "requested" * clamp value from all RUNNABLE tasks in that bucket. */ if (uc_se->value > bucket->value) bucket->value = uc_se->value; if (uc_se->value > READ_ONCE(uc_rq->value)) WRITE_ONCE(uc_rq->value, uc_se->value); } ---8<--- I remember Peter asking for max() in one of the past reviews.. but the code above looks simpler to me too... let see if this time it can be accepted. :)
On 14-Mar 14:32, Peter Zijlstra wrote: > On Thu, Mar 14, 2019 at 12:13:15PM +0000, Patrick Bellasi wrote: > > > I'd be most impressed if they pull this off. Check the generated code > > > and see I suppose :-) > > > > On x86 the code generated looks exactly the same: > > > > https://godbolt.org/z/PjmA7k > > Argh, they do mult by inverse to avoid the division, and can do this > because its a constant. > > And then yes, your arm version looks worse. your "arm version" is worst then x86, or "your version" is worse? IOW, should I keep the code as the original? Do you prefer your version? Or... we don't really care... > It does what I expected with -Os, but as Rutland said the other day, > that stands for Optimize for Sadness. Yes, I guess we cannot optimize for all flags... however, just let me know what you prefer and I'll put that version in ;)
On Thu, Mar 14, 2019 at 7:46 AM Patrick Bellasi <patrick.bellasi@arm.com> wrote: > > On 13-Mar 14:32, Suren Baghdasaryan wrote: > > On Fri, Feb 8, 2019 at 2:06 AM Patrick Bellasi <patrick.bellasi@arm.com> wrote: > > > > > > Utilization clamping allows to clamp the CPU's utilization within a > > > [util_min, util_max] range, depending on the set of RUNNABLE tasks on > > > that CPU. Each task references two "clamp buckets" defining its minimum > > > and maximum (util_{min,max}) utilization "clamp values". A CPU's clamp > > > bucket is active if there is at least one RUNNABLE tasks enqueued on > > > that CPU and refcounting that bucket. > > > > > > When a task is {en,de}queued {on,from} a rq, the set of active clamp > > > buckets on that CPU can change. Since each clamp bucket enforces a > > > different utilization clamp value, when the set of active clamp buckets > > > changes, a new "aggregated" clamp value is computed for that CPU. > > > > > > Clamp values are always MAX aggregated for both util_min and util_max. > > > This ensures that no tasks can affect the performance of other > > > co-scheduled tasks which are more boosted (i.e. with higher util_min > > > clamp) or less capped (i.e. with higher util_max clamp). > > > > > > Each task has a: > > > task_struct::uclamp[clamp_id]::bucket_id > > > to track the "bucket index" of the CPU's clamp bucket it refcounts while > > > enqueued, for each clamp index (clamp_id). > > > > > > Each CPU's rq has a: > > > rq::uclamp[clamp_id]::bucket[bucket_id].tasks > > > to track how many tasks, currently RUNNABLE on that CPU, refcount each > > > clamp bucket (bucket_id) of a clamp index (clamp_id). > > > > > > Each CPU's rq has also a: > > > rq::uclamp[clamp_id]::bucket[bucket_id].value > > > to track the clamp value of each clamp bucket (bucket_id) of a clamp > > > index (clamp_id). > > > > > > The rq::uclamp::bucket[clamp_id][] array is scanned every time we need > > > to find a new MAX aggregated clamp value for a clamp_id. This operation > > > is required only when we dequeue the last task of a clamp bucket > > > tracking the current MAX aggregated clamp value. In these cases, the CPU > > > is either entering IDLE or going to schedule a less boosted or more > > > clamped task. > > > The expected number of different clamp values, configured at build time, > > > is small enough to fit the full unordered array into a single cache > > > line. > > > > I assume you are talking about "struct uclamp_rq uclamp[UCLAMP_CNT]" > > here. > > No, I'm talking about the rq::uclamp::bucket[clamp_id][], which is an > array of: > > struct uclamp_bucket { > unsigned long value : bits_per(SCHED_CAPACITY_SCALE); > unsigned long tasks : BITS_PER_LONG - bits_per(SCHED_CAPACITY_SCALE); > }; > > defined as part of: > > struct uclamp_rq { > unsigned int value; > struct uclamp_bucket bucket[UCLAMP_BUCKETS]; > }; > > > So, it's an array of UCLAMP_BUCKETS (value, tasks) pairs. > > > uclamp_rq size depends on UCLAMP_BUCKETS configurable to be up > > to 20. sizeof(long)*20 is already more than 64 bytes. What am I > > missing? > > Right, the comment above refers to the default configuration, which is > 5 buckets. With that configuration we have: > > > $> pahole kernel/sched/core.o > > ---8<--- > struct uclamp_bucket { > long unsigned int value:11; /* 0:53 8 */ > long unsigned int tasks:53; /* 0: 0 8 */ > > /* size: 8, cachelines: 1, members: 2 */ > /* last cacheline: 8 bytes */ > }; > > struct uclamp_rq { > unsigned int value; /* 0 4 */ > > /* XXX 4 bytes hole, try to pack */ > > struct uclamp_bucket bucket[5]; /* 8 40 */ > > /* size: 48, cachelines: 1, members: 2 */ > /* sum members: 44, holes: 1, sum holes: 4 */ > /* last cacheline: 48 bytes */ > }; > > struct rq { > // ... > /* --- cacheline 2 boundary (128 bytes) --- */ > struct uclamp_rq uclamp[2]; /* 128 96 */ > /* --- cacheline 3 boundary (192 bytes) was 32 bytes ago --- */ > // ... > }; > ---8<--- > > Where you see the array fits into a single cache line. > > Actually I notice now that, since when we removed the bucket dedicated > to the default values, we now have some spare space and we can > probably increase the default (and minimum) value of UCLAMP_BUCKETS to > be 7. > > This will uses two full cache lines in struct rq, one for each clamp > index... Although 7 it's a bit of a odd number and gives by default > buckets of ~14% size instead of the ~20%. > > Thoughts ? Got it. From reading the documentation at the beginning my impression was that whatever value I choose within allowed 5-20 range it would still fit in a cache line. To disambiguate it might be worse mentioning that this is true for the default value or for values up to 7. Thanks! > [...] > > -- > #include <best/regards.h> > > Patrick Bellasi
On 14-Mar 08:29, Suren Baghdasaryan wrote: > On Thu, Mar 14, 2019 at 7:46 AM Patrick Bellasi <patrick.bellasi@arm.com> wrote: > > On 13-Mar 14:32, Suren Baghdasaryan wrote: > > > On Fri, Feb 8, 2019 at 2:06 AM Patrick Bellasi <patrick.bellasi@arm.com> wrote: [...] > > > > The rq::uclamp::bucket[clamp_id][] array is scanned every time we need > > > > to find a new MAX aggregated clamp value for a clamp_id. This operation > > > > is required only when we dequeue the last task of a clamp bucket > > > > tracking the current MAX aggregated clamp value. In these cases, the CPU > > > > is either entering IDLE or going to schedule a less boosted or more > > > > clamped task. The following: > > > > The expected number of different clamp values, configured at build time, > > > > is small enough to fit the full unordered array into a single cache > > > > line. will read: The expected number of different clamp values, configured at build time, is small enough to fit the full unordered array into a single cache line for the default UCLAMP_BUCKETS configuration of 7 buckets. [...] > Got it. From reading the documentation at the beginning my impression > was that whatever value I choose within allowed 5-20 range it would > still fit in a cache line. To disambiguate it might be worse > mentioning that this is true for the default value or for values up to > 7. Thanks! Right, I hope the above proposed change helps to clarify that.
On Thu, Mar 14, 2019 at 8:41 AM Patrick Bellasi <patrick.bellasi@arm.com> wrote: > > On 14-Mar 08:29, Suren Baghdasaryan wrote: > > On Thu, Mar 14, 2019 at 7:46 AM Patrick Bellasi <patrick.bellasi@arm.com> wrote: > > > On 13-Mar 14:32, Suren Baghdasaryan wrote: > > > > On Fri, Feb 8, 2019 at 2:06 AM Patrick Bellasi <patrick.bellasi@arm.com> wrote: > > [...] > > > > > > The rq::uclamp::bucket[clamp_id][] array is scanned every time we need > > > > > to find a new MAX aggregated clamp value for a clamp_id. This operation > > > > > is required only when we dequeue the last task of a clamp bucket > > > > > tracking the current MAX aggregated clamp value. In these cases, the CPU > > > > > is either entering IDLE or going to schedule a less boosted or more > > > > > clamped task. > > The following: > > > > > > The expected number of different clamp values, configured at build time, > > > > > is small enough to fit the full unordered array into a single cache > > > > > line. > > will read: > > The expected number of different clamp values, configured at build time, > is small enough to fit the full unordered array into a single cache > line for the default UCLAMP_BUCKETS configuration of 7 buckets. I think keeping default to be 5 is good. As you mentioned it's a nice round number and keeping it at the minimum also hints that this is not a free resource and the more buckets you use the more you pay. Documentation might say "to fit the full unordered array into a single cache line for configurations of up to 7 buckets". > [...] > > > Got it. From reading the documentation at the beginning my impression > > was that whatever value I choose within allowed 5-20 range it would > > still fit in a cache line. To disambiguate it might be worse > > mentioning that this is true for the default value or for values up to > > 7. Thanks! > > Right, I hope the above proposed change helps to clarify that. > > -- > #include <best/regards.h> > > Patrick Bellasi
On Thu, Mar 14, 2019 at 03:07:53PM +0000, Patrick Bellasi wrote: > On 14-Mar 14:32, Peter Zijlstra wrote: > > On Thu, Mar 14, 2019 at 12:13:15PM +0000, Patrick Bellasi wrote: > > > > I'd be most impressed if they pull this off. Check the generated code > > > > and see I suppose :-) > > > > > > On x86 the code generated looks exactly the same: > > > > > > https://godbolt.org/z/PjmA7k > > > > Argh, they do mult by inverse to avoid the division, and can do this > > because its a constant. > > > > And then yes, your arm version looks worse. > > your "arm version" is worst then x86, or "your version" is worse? > > IOW, should I keep the code as the original? Do you prefer your > version? Or... we don't really care... Yeah, keep the original, it didn't matter on x86 and arm regressed with my version. > > It does what I expected with -Os, but as Rutland said the other day, > > that stands for Optimize for Sadness. > > Yes, I guess we cannot optimize for all flags... however, just let me > know what you prefer and I'll put that version in ;) Yeah, don't bother optimizing for -Os, it generally creates atrocious crap, hence Rutland calling it 'Optimize for Sadness'.
diff --git a/include/linux/log2.h b/include/linux/log2.h index 2af7f77866d0..e2db25734532 100644 --- a/include/linux/log2.h +++ b/include/linux/log2.h @@ -224,4 +224,41 @@ int __order_base_2(unsigned long n) ilog2((n) - 1) + 1) : \ __order_base_2(n) \ ) + +static inline __attribute__((const)) +int __bits_per(unsigned long n) +{ + if (n < 2) + return 1; + if (is_power_of_2(n)) + return order_base_2(n) + 1; + return order_base_2(n); +} + +/** + * bits_per - calculate the number of bits required for the argument + * @n: parameter + * + * This is constant-capable and can be used for compile time + * initiaizations, e.g bitfields. + * + * The first few values calculated by this routine: + * bf(0) = 1 + * bf(1) = 1 + * bf(2) = 2 + * bf(3) = 2 + * bf(4) = 3 + * ... and so on. + */ +#define bits_per(n) \ +( \ + __builtin_constant_p(n) ? ( \ + ((n) == 0 || (n) == 1) ? 1 : ( \ + ((n) & (n - 1)) == 0 ? \ + ilog2((n) - 1) + 2 : \ + ilog2((n) - 1) + 1 \ + ) \ + ) : \ + __bits_per(n) \ +) #endif /* _LINUX_LOG2_H */ diff --git a/include/linux/sched.h b/include/linux/sched.h index 4112639c2a85..45460e7a3eee 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -281,6 +281,18 @@ struct vtime { u64 gtime; }; +/* + * Utilization clamp constraints. + * @UCLAMP_MIN: Minimum utilization + * @UCLAMP_MAX: Maximum utilization + * @UCLAMP_CNT: Utilization clamp constraints count + */ +enum uclamp_id { + UCLAMP_MIN = 0, + UCLAMP_MAX, + UCLAMP_CNT +}; + struct sched_info { #ifdef CONFIG_SCHED_INFO /* Cumulative counters: */ @@ -312,6 +324,10 @@ struct sched_info { # define SCHED_FIXEDPOINT_SHIFT 10 # define SCHED_FIXEDPOINT_SCALE (1L << SCHED_FIXEDPOINT_SHIFT) +/* Increase resolution of cpu_capacity calculations */ +# define SCHED_CAPACITY_SHIFT SCHED_FIXEDPOINT_SHIFT +# define SCHED_CAPACITY_SCALE (1L << SCHED_CAPACITY_SHIFT) + struct load_weight { unsigned long weight; u32 inv_weight; @@ -560,6 +576,25 @@ struct sched_dl_entity { struct hrtimer inactive_timer; }; +#ifdef CONFIG_UCLAMP_TASK +/* Number of utilization clamp buckets (shorter alias) */ +#define UCLAMP_BUCKETS CONFIG_UCLAMP_BUCKETS_COUNT + +/* + * Utilization clamp for a scheduling entity + * @value: clamp value "requested" by a se + * @bucket_id: clamp bucket corresponding to the "requested" value + * + * The bucket_id is the index of the clamp bucket matching the clamp value + * which is pre-computed and stored to avoid expensive integer divisions from + * the fast path. + */ +struct uclamp_se { + unsigned int value : bits_per(SCHED_CAPACITY_SCALE); + unsigned int bucket_id : bits_per(UCLAMP_BUCKETS); +}; +#endif /* CONFIG_UCLAMP_TASK */ + union rcu_special { struct { u8 blocked; @@ -640,6 +675,10 @@ struct task_struct { #endif struct sched_dl_entity dl; +#ifdef CONFIG_UCLAMP_TASK + struct uclamp_se uclamp[UCLAMP_CNT]; +#endif + #ifdef CONFIG_PREEMPT_NOTIFIERS /* List of struct preempt_notifier: */ struct hlist_head preempt_notifiers; diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index c31d3a47a47c..04beadac6985 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -6,12 +6,6 @@ #include <linux/sched/idle.h> -/* - * Increase resolution of cpu_capacity calculations - */ -#define SCHED_CAPACITY_SHIFT SCHED_FIXEDPOINT_SHIFT -#define SCHED_CAPACITY_SCALE (1L << SCHED_CAPACITY_SHIFT) - /* * sched-domains (multiprocessor balancing) declarations: */ diff --git a/init/Kconfig b/init/Kconfig index 513fa544a134..34e23d5d95d1 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -640,6 +640,59 @@ config HAVE_UNSTABLE_SCHED_CLOCK config GENERIC_SCHED_CLOCK bool +menu "Scheduler features" + +config UCLAMP_TASK + bool "Enable utilization clamping for RT/FAIR tasks" + depends on CPU_FREQ_GOV_SCHEDUTIL + help + This feature enables the scheduler to track the clamped utilization + of each CPU based on RUNNABLE tasks scheduled on that CPU. + + With this option, the user can specify the min and max CPU + utilization allowed for RUNNABLE tasks. The max utilization defines + the maximum frequency a task should use while the min utilization + defines the minimum frequency it should use. + + Both min and max utilization clamp values are hints to the scheduler, + aiming at improving its frequency selection policy, but they do not + enforce or grant any specific bandwidth for tasks. + + If in doubt, say N. + +config UCLAMP_BUCKETS_COUNT + int "Number of supported utilization clamp buckets" + range 5 20 + default 5 + depends on UCLAMP_TASK + help + Defines the number of clamp buckets to use. The range of each bucket + will be SCHED_CAPACITY_SCALE/UCLAMP_BUCKETS_COUNT. The higher the + number of clamp buckets the finer their granularity and the higher + the precision of clamping aggregation and tracking at run-time. + + For example, with the default configuration we will have 5 clamp + buckets tracking 20% utilization each. A 25% boosted tasks will be + refcounted in the [20..39]% bucket and will set the bucket clamp + effective value to 25%. + If a second 30% boosted task should be co-scheduled on the same CPU, + that task will be refcounted in the same bucket of the first task and + it will boost the bucket clamp effective value to 30%. + The clamp effective value of a bucket is reset to its nominal value + (20% in the example above) when there are anymore tasks refcounted in + that bucket. + + An additional boost/capping margin can be added to some tasks. In the + example above the 25% task will be boosted to 30% until it exits the + CPU. If that should be considered not acceptable on certain systems, + it's always possible to reduce the margin by increasing the number of + clamp buckets to trade off used memory for run-time tracking + precision. + + If in doubt, use the default value. + +endmenu + # # For architectures that want to enable the support for NUMA-affine scheduler # balancing logic: diff --git a/kernel/sched/core.c b/kernel/sched/core.c index ec1b67a195cc..8ecf5470058c 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -719,6 +719,167 @@ static void set_load_weight(struct task_struct *p, bool update_load) } } +#ifdef CONFIG_UCLAMP_TASK + +/* Integer ceil-rounded range for each bucket */ +#define UCLAMP_BUCKET_DELTA ((SCHED_CAPACITY_SCALE / UCLAMP_BUCKETS) + 1) + +static inline unsigned int uclamp_bucket_id(unsigned int clamp_value) +{ + return clamp_value / UCLAMP_BUCKET_DELTA; +} + +static inline unsigned int uclamp_bucket_value(unsigned int clamp_value) +{ + return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value); +} + +static inline unsigned int uclamp_none(int clamp_id) +{ + if (clamp_id == UCLAMP_MIN) + return 0; + return SCHED_CAPACITY_SCALE; +} + +static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id) +{ + struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket; + unsigned int max_value = uclamp_none(clamp_id); + unsigned int bucket_id; + + /* + * Both min and max clamps are MAX aggregated, thus the topmost + * bucket with some tasks defines the rq's clamp value. + */ + bucket_id = UCLAMP_BUCKETS; + do { + --bucket_id; + if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks) + continue; + max_value = bucket[bucket_id].value; + break; + } while (bucket_id); + + WRITE_ONCE(rq->uclamp[clamp_id].value, max_value); +} + +/* + * When a task is enqueued on a rq, the clamp bucket currently defined by the + * task's uclamp::bucket_id is reference counted on that rq. This also + * immediately updates the rq's clamp value if required. + * + * Since tasks know their specific value requested from user-space, we track + * within each bucket the maximum value for tasks refcounted in that bucket. + * This provide a further aggregation (local clamping) which allows to track + * within each bucket the exact "requested" clamp value whenever all tasks + * RUNNABLE in that bucket require the same clamp. + */ +static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq, + unsigned int clamp_id) +{ + unsigned int bucket_id = p->uclamp[clamp_id].bucket_id; + unsigned int rq_clamp, bkt_clamp, tsk_clamp; + + rq->uclamp[clamp_id].bucket[bucket_id].tasks++; + + /* + * Local clamping: rq's buckets always track the max "requested" + * clamp value from all RUNNABLE tasks in that bucket. + */ + tsk_clamp = p->uclamp[clamp_id].value; + bkt_clamp = rq->uclamp[clamp_id].bucket[bucket_id].value; + rq->uclamp[clamp_id].bucket[bucket_id].value = max(bkt_clamp, tsk_clamp); + + rq_clamp = READ_ONCE(rq->uclamp[clamp_id].value); + WRITE_ONCE(rq->uclamp[clamp_id].value, max(rq_clamp, tsk_clamp)); +} + +/* + * When a task is dequeued from a rq, the clamp bucket reference counted by + * the task is released. If this is the last task reference counting the rq's + * max active clamp value, then the rq's clamp value is updated. + * Both the tasks reference counter and the rq's cached clamp values are + * expected to be always valid, if we detect they are not we skip the updates, + * enforce a consistent state and warn. + */ +static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq, + unsigned int clamp_id) +{ + unsigned int bucket_id = p->uclamp[clamp_id].bucket_id; + unsigned int rq_clamp, bkt_clamp; + + SCHED_WARN_ON(!rq->uclamp[clamp_id].bucket[bucket_id].tasks); + if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks)) + rq->uclamp[clamp_id].bucket[bucket_id].tasks--; + + /* + * Keep "local clamping" simple and accept to (possibly) overboost + * still RUNNABLE tasks in the same bucket. + */ + if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks)) + return; + bkt_clamp = rq->uclamp[clamp_id].bucket[bucket_id].value; + + /* The rq's clamp value is expected to always track the max */ + rq_clamp = READ_ONCE(rq->uclamp[clamp_id].value); + SCHED_WARN_ON(bkt_clamp > rq_clamp); + if (bkt_clamp >= rq_clamp) { + /* + * Reset rq's clamp bucket value to its nominal value whenever + * there are anymore RUNNABLE tasks refcounting it. + */ + rq->uclamp[clamp_id].bucket[bucket_id].value = + uclamp_bucket_value(rq_clamp); + uclamp_rq_update(rq, clamp_id); + } +} + +static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) +{ + unsigned int clamp_id; + + if (unlikely(!p->sched_class->uclamp_enabled)) + return; + + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) + uclamp_rq_inc_id(p, rq, clamp_id); +} + +static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) +{ + unsigned int clamp_id; + + if (unlikely(!p->sched_class->uclamp_enabled)) + return; + + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) + uclamp_rq_dec_id(p, rq, clamp_id); +} + +static void __init init_uclamp(void) +{ + unsigned int clamp_id; + int cpu; + + for_each_possible_cpu(cpu) + memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq)); + + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) { + unsigned int clamp_value = uclamp_none(clamp_id); + unsigned int bucket_id = uclamp_bucket_id(clamp_value); + struct uclamp_se *uc_se = &init_task.uclamp[clamp_id]; + + uc_se->bucket_id = bucket_id; + uc_se->value = clamp_value; + } +} + +#else /* CONFIG_UCLAMP_TASK */ +static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { } +static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) { } +static inline void init_uclamp(void) { } +#endif /* CONFIG_UCLAMP_TASK */ + static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags) { if (!(flags & ENQUEUE_NOCLOCK)) @@ -729,6 +890,7 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags) psi_enqueue(p, flags & ENQUEUE_WAKEUP); } + uclamp_rq_inc(rq, p); p->sched_class->enqueue_task(rq, p, flags); } @@ -742,6 +904,7 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags) psi_dequeue(p, flags & DEQUEUE_SLEEP); } + uclamp_rq_dec(rq, p); p->sched_class->dequeue_task(rq, p, flags); } @@ -6075,6 +6238,8 @@ void __init sched_init(void) psi_init(); + init_uclamp(); + scheduler_running = 1; } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index c688ef5012e5..ea9e28723946 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -797,6 +797,48 @@ extern void rto_push_irq_work_func(struct irq_work *work); #endif #endif /* CONFIG_SMP */ +#ifdef CONFIG_UCLAMP_TASK +/* + * struct uclamp_bucket - Utilization clamp bucket + * @value: utilization clamp value for tasks on this clamp bucket + * @tasks: number of RUNNABLE tasks on this clamp bucket + * + * Keep track of how many tasks are RUNNABLE for a given utilization + * clamp value. + */ +struct uclamp_bucket { + unsigned long value : bits_per(SCHED_CAPACITY_SCALE); + unsigned long tasks : BITS_PER_LONG - bits_per(SCHED_CAPACITY_SCALE); +}; + +/* + * struct uclamp_rq - rq's utilization clamp + * @value: currently active clamp values for a rq + * @bucket: utilization clamp buckets affecting a rq + * + * Keep track of RUNNABLE tasks on a rq to aggregate their clamp values. + * A clamp value is affecting a rq when there is at least one task RUNNABLE + * (or actually running) with that value. + * + * We have up to UCLAMP_CNT possible different clamp values, which are + * currently only two: minmum utilization and maximum utilization. + * + * All utilization clamping values are MAX aggregated, since: + * - for util_min: we want to run the CPU at least at the max of the minimum + * utilization required by its currently RUNNABLE tasks. + * - for util_max: we want to allow the CPU to run up to the max of the + * maximum utilization allowed by its currently RUNNABLE tasks. + * + * Since on each system we expect only a limited number of different + * utilization clamp values (UCLAMP_BUCKETS), we use a simple array to track + * the metrics required to compute all the per-rq utilization clamp values. + */ +struct uclamp_rq { + unsigned int value; + struct uclamp_bucket bucket[UCLAMP_BUCKETS]; +}; +#endif /* CONFIG_UCLAMP_TASK */ + /* * This is the main, per-CPU runqueue data structure. * @@ -835,6 +877,11 @@ struct rq { unsigned long nr_load_updates; u64 nr_switches; +#ifdef CONFIG_UCLAMP_TASK + /* Utilization clamp values based on CPU's RUNNABLE tasks */ + struct uclamp_rq uclamp[UCLAMP_CNT] ____cacheline_aligned; +#endif + struct cfs_rq cfs; struct rt_rq rt; struct dl_rq dl; @@ -1649,10 +1696,12 @@ extern const u32 sched_prio_to_wmult[40]; struct sched_class { const struct sched_class *next; +#ifdef CONFIG_UCLAMP_TASK + int uclamp_enabled; +#endif + void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags); void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags); - void (*yield_task) (struct rq *rq); - bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt); void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags); @@ -1685,7 +1734,6 @@ struct sched_class { void (*set_curr_task)(struct rq *rq); void (*task_tick)(struct rq *rq, struct task_struct *p, int queued); void (*task_fork)(struct task_struct *p); - void (*task_dead)(struct task_struct *p); /* * The switched_from() call is allowed to drop rq->lock, therefore we @@ -1702,12 +1750,17 @@ struct sched_class { void (*update_curr)(struct rq *rq); + void (*yield_task) (struct rq *rq); + bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt); + #define TASK_SET_GROUP 0 #define TASK_MOVE_GROUP 1 #ifdef CONFIG_FAIR_GROUP_SCHED void (*task_change_group)(struct task_struct *p, int type); #endif + + void (*task_dead)(struct task_struct *p); }; static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
Utilization clamping allows to clamp the CPU's utilization within a [util_min, util_max] range, depending on the set of RUNNABLE tasks on that CPU. Each task references two "clamp buckets" defining its minimum and maximum (util_{min,max}) utilization "clamp values". A CPU's clamp bucket is active if there is at least one RUNNABLE tasks enqueued on that CPU and refcounting that bucket. When a task is {en,de}queued {on,from} a rq, the set of active clamp buckets on that CPU can change. Since each clamp bucket enforces a different utilization clamp value, when the set of active clamp buckets changes, a new "aggregated" clamp value is computed for that CPU. Clamp values are always MAX aggregated for both util_min and util_max. This ensures that no tasks can affect the performance of other co-scheduled tasks which are more boosted (i.e. with higher util_min clamp) or less capped (i.e. with higher util_max clamp). Each task has a: task_struct::uclamp[clamp_id]::bucket_id to track the "bucket index" of the CPU's clamp bucket it refcounts while enqueued, for each clamp index (clamp_id). Each CPU's rq has a: rq::uclamp[clamp_id]::bucket[bucket_id].tasks to track how many tasks, currently RUNNABLE on that CPU, refcount each clamp bucket (bucket_id) of a clamp index (clamp_id). Each CPU's rq has also a: rq::uclamp[clamp_id]::bucket[bucket_id].value to track the clamp value of each clamp bucket (bucket_id) of a clamp index (clamp_id). The rq::uclamp::bucket[clamp_id][] array is scanned every time we need to find a new MAX aggregated clamp value for a clamp_id. This operation is required only when we dequeue the last task of a clamp bucket tracking the current MAX aggregated clamp value. In these cases, the CPU is either entering IDLE or going to schedule a less boosted or more clamped task. The expected number of different clamp values, configured at build time, is small enough to fit the full unordered array into a single cache line. Add the basic data structures required to refcount, in each CPU's rq, the number of RUNNABLE tasks for each clamp bucket. Add also the max aggregation required to update the rq's clamp value at each enqueue/dequeue event. Use a simple linear mapping of clamp values into clamp buckets. Pre-compute and cache bucket_id to avoid integer divisions at enqueue/dequeue time. Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> --- Changes in v7: Message-ID: <20190123191007.GG17749@hirez.programming.kicks-ass.net> - removed buckets mapping code - use a simpler linear mapping of clamp values into buckets Message-ID: <20190124161443.lv2pw5fsspyelckq@e110439-lin> - move this patch at the beginning of the series, in the attempt to make the overall series easier to digest by moving at the very beginning the core bits and main data structures Others: - update the mapping logic to use exactly and only UCLAMP_BUCKETS_COUNT buckets, i.e. no more "special" bucket - update uclamp_rq_update() to do top-bottom max search --- include/linux/log2.h | 37 ++++++++ include/linux/sched.h | 39 ++++++++ include/linux/sched/topology.h | 6 -- init/Kconfig | 53 +++++++++++ kernel/sched/core.c | 165 +++++++++++++++++++++++++++++++++ kernel/sched/sched.h | 59 +++++++++++- 6 files changed, 350 insertions(+), 9 deletions(-)