Message ID | 1415598358-26505-2-git-send-email-shilpa.bhat@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | RFC, archived |
Headers | show |
On Mon, Nov 10, 2014 at 11:15:57AM +0530, Shilpasri G Bhat wrote: > +/** > + * task_cumulative_load - return the cumulative load of > + * the previous task if cpu is the current cpu OR the > + * cumulative load of current task on the cpu. If cpu > + * is idle then return 0. > + * > + * Invoked by the cpufreq governor to calculate the > + * load when the CPU is woken from an idle state. > + * > + */ > +unsigned int task_cumulative_load(int cpu) > +{ > + struct rq *rq = cpu_rq(cpu); > + struct task_struct *p; > + > + if (cpu == smp_processor_id()) { > + if (rq->prev == rq->idle) > + goto idle; > + p = rq->prev; > + } else { > + if (rq->curr == rq->idle) > + goto idle; > + p = rq->curr; > + } > + /* > + * Removing the priority as we are interested in CPU > + * utilization of the task > + */ > + return (100 * p->se.avg.cumulative_avg / p->se.load.weight); > +idle: > + return 0; > +} > +EXPORT_SYMBOL(task_cumulative_load); This doesn't make any sense, its wrong and also its broken. This doesn't make any sense, because the load of a cpu is unrelated to whatever task ran there last. You want to look at the task that is waking now. It is wrong because dividing out the load.weight doesn't quite work right. Also there's patches that introduce unweighted stats like you want, you could have used those. It it broken because who says rq->prev still exists? > @@ -2476,11 +2478,13 @@ static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {} > static inline void __update_task_entity_contrib(struct sched_entity *se) > { > u32 contrib; > - > /* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */ > contrib = se->avg.runnable_avg_sum * scale_load_down(se->load.weight); > contrib /= (se->avg.runnable_avg_period + 1); > se->avg.load_avg_contrib = scale_load(contrib); > + se->avg.cumulative_avg *= se->avg.cumulative_avg_count; > + se->avg.cumulative_avg += se->avg.load_avg_contrib; > + se->avg.cumulative_avg /= ++se->avg.cumulative_avg_count; > } This is not a numerically stable algorithm. Look what happens when cumulative_avg_count gets large. Also, whatever made you choose an absolute decay filter like that? -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 11/10/2014 07:19 PM, Peter Zijlstra wrote: > On Mon, Nov 10, 2014 at 11:15:57AM +0530, Shilpasri G Bhat wrote: >> +/** >> + * task_cumulative_load - return the cumulative load of >> + * the previous task if cpu is the current cpu OR the >> + * cumulative load of current task on the cpu. If cpu >> + * is idle then return 0. >> + * >> + * Invoked by the cpufreq governor to calculate the >> + * load when the CPU is woken from an idle state. >> + * >> + */ >> +unsigned int task_cumulative_load(int cpu) >> +{ >> + struct rq *rq = cpu_rq(cpu); >> + struct task_struct *p; >> + >> + if (cpu == smp_processor_id()) { >> + if (rq->prev == rq->idle) >> + goto idle; >> + p = rq->prev; >> + } else { >> + if (rq->curr == rq->idle) >> + goto idle; >> + p = rq->curr; >> + } >> + /* >> + * Removing the priority as we are interested in CPU >> + * utilization of the task >> + */ >> + return (100 * p->se.avg.cumulative_avg / p->se.load.weight); >> +idle: >> + return 0; >> +} >> +EXPORT_SYMBOL(task_cumulative_load); > > This doesn't make any sense, its wrong and also its broken. > > This doesn't make any sense, because the load of a cpu is unrelated to > whatever task ran there last. You want to look at the task that is > waking now. Yes I agree that the task in consideration should be current task on cpu and not the last task. But the above code is handled by the cpufreq governor's kworker. So the task in context while executing 'task_cumulative_load()' is kworker and we are not interested in kworker's load. The task's load which we want to account to is predecessor of kworker i.e, the task that woke up on the idle cpu. > > It is wrong because dividing out the load.weight doesn't quite work > right. Also there's patches that introduce unweighted stats like you > want, you could have used those. Okay. > > It it broken because who says rq->prev still exists? 'task_cumulative_load()' is handled by the cpufreq governor's kworker. And the kworker is queued only if there is task running on cpu which guarantees the existence of rq->prev in a running state. > >> @@ -2476,11 +2478,13 @@ static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {} >> static inline void __update_task_entity_contrib(struct sched_entity *se) >> { >> u32 contrib; >> - >> /* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */ >> contrib = se->avg.runnable_avg_sum * scale_load_down(se->load.weight); >> contrib /= (se->avg.runnable_avg_period + 1); >> se->avg.load_avg_contrib = scale_load(contrib); >> + se->avg.cumulative_avg *= se->avg.cumulative_avg_count; >> + se->avg.cumulative_avg += se->avg.load_avg_contrib; >> + se->avg.cumulative_avg /= ++se->avg.cumulative_avg_count; >> } > > This is not a numerically stable algorithm. Look what happens when > cumulative_avg_count gets large. Also, whatever made you choose an > absolute decay filter like that? > Yes I agree. The problem I am trying to solve using this metric is: if cpufreq governor's timer is handled very early when a task wakes up to run on an idle cpu then the value of 'load_avg_contrib' of that task is very less, when the governor's kworker is switched in. This happens because the task did not run for enough time such that it can update its 'load_avg_contrib' to show considerable amount of load. Cumulative load was our initial attempt at overcoming this. It is true that this is not the best solution. We wanted the community's suggestion on how to get around this issue. What do you propose? Thanks and Regards, Shilpa -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Nov 11, 2014 at 08:22:55PM +0530, Shilpasri G Bhat wrote: > Cumulative load was our initial attempt at overcoming this. It is true > that this is not the best solution. We wanted the community's suggestion > on how to get around this issue. What do you propose? Shoot cpufreq in the head basically :-) We're looking at completely reworking the way we do things. We basically want to integrate cpufreq into the scheduler and effect frequency changes from the scheduler context where possible, only reverting to schedulable context (kworker) where there is no other possibility. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Nov 11, 2014 at 08:22:55PM +0530, Shilpasri G Bhat wrote: > On 11/10/2014 07:19 PM, Peter Zijlstra wrote: > > It it broken because who says rq->prev still exists? > > 'task_cumulative_load()' is handled by the cpufreq governor's kworker. > And the kworker is queued only if there is task running on cpu which > guarantees the existence of rq->prev in a running state. Not so, there is no guarantee it will still be running by the time the kworker actually comes around to running. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/linux/sched.h b/include/linux/sched.h index 5e344bb..212a0a7 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1081,6 +1081,8 @@ struct sched_avg { u64 last_runnable_update; s64 decay_count; unsigned long load_avg_contrib; + unsigned long cumulative_avg; + unsigned long cumulative_avg_count; }; #ifdef CONFIG_SCHEDSTATS @@ -3032,3 +3034,5 @@ static inline unsigned long rlimit_max(unsigned int limit) } #endif + +extern unsigned int task_cumulative_load(int cpu); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 4499950..b3d0d5a 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2853,6 +2853,7 @@ need_resched: if (likely(prev != next)) { rq->nr_switches++; rq->curr = next; + rq->prev = prev; ++*switch_count; context_switch(rq, prev, next); /* unlocks the rq */ @@ -8219,3 +8220,37 @@ void dump_cpu_task(int cpu) pr_info("Task dump for CPU %d:\n", cpu); sched_show_task(cpu_curr(cpu)); } + +/** + * task_cumulative_load - return the cumulative load of + * the previous task if cpu is the current cpu OR the + * cumulative load of current task on the cpu. If cpu + * is idle then return 0. + * + * Invoked by the cpufreq governor to calculate the + * load when the CPU is woken from an idle state. + * + */ +unsigned int task_cumulative_load(int cpu) +{ + struct rq *rq = cpu_rq(cpu); + struct task_struct *p; + + if (cpu == smp_processor_id()) { + if (rq->prev == rq->idle) + goto idle; + p = rq->prev; + } else { + if (rq->curr == rq->idle) + goto idle; + p = rq->curr; + } + /* + * Removing the priority as we are interested in CPU + * utilization of the task + */ + return (100 * p->se.avg.cumulative_avg / p->se.load.weight); +idle: + return 0; +} +EXPORT_SYMBOL(task_cumulative_load); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0b069bf..58c27e3 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -680,6 +680,8 @@ void init_task_runnable_average(struct task_struct *p) slice = sched_slice(task_cfs_rq(p), &p->se) >> 10; p->se.avg.runnable_avg_sum = slice; p->se.avg.runnable_avg_period = slice; + p->se.avg.cumulative_avg_count = 1; + p->se.avg.cumulative_avg = p->se.load.weight; __update_task_entity_contrib(&p->se); } #else @@ -2476,11 +2478,13 @@ static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {} static inline void __update_task_entity_contrib(struct sched_entity *se) { u32 contrib; - /* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */ contrib = se->avg.runnable_avg_sum * scale_load_down(se->load.weight); contrib /= (se->avg.runnable_avg_period + 1); se->avg.load_avg_contrib = scale_load(contrib); + se->avg.cumulative_avg *= se->avg.cumulative_avg_count; + se->avg.cumulative_avg += se->avg.load_avg_contrib; + se->avg.cumulative_avg /= ++se->avg.cumulative_avg_count; } /* Compute the current contribution to load_avg by se, return any delta */ diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 24156c84..064d6b1 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -565,7 +565,7 @@ struct rq { */ unsigned long nr_uninterruptible; - struct task_struct *curr, *idle, *stop; + struct task_struct *curr, *idle, *stop, *prev; unsigned long next_balance; struct mm_struct *prev_mm;
This patch aims to identify a suitable metric which can define the nature of the task as CPU intensive or non-intensive. This metric's value will be used to optimize the logic to scale CPU frequency during an idle wakeup scenario i.e., to make the cpufreq governor to scale the frequency optimized for workloads. Given the properties of 'load_avg_contrib' of a task's sched entity, it forms a potential candidate for frequency scaling. By the virtue of its design this metric picks up the load slowly during an idle wakeup. So at that instant of wakeup we cannot account on the latest value of this metric as it is unreliable. This can be seen in the test results below. I ran a modified version of ebizzy which sleeps in between its execution such that its utilization can be user defined value. The below trace was observed for a single thread ebizzy run for a 40% utilization. T1 ebizzy 309.613862: .__update_entity_load_avg_contrib: ebizzy load_avg_contrib=1022 T2 ebizzy 309.613864: sched_switch: prev_comm=ebizzy ==> next_comm=swapper/8 T3 <idle> 310.062932: .__update_entity_load_avg_contrib: ebizzy load_avg_contrib=0 T4 <idle> 310.062936: sched_switch: prev_comm=swapper/8 ==> next_comm=ebizzy T5 ebizzy 310.063104: .__update_entity_load_avg_contrib: ebizzy load_avg_contrib=67 T6 ebizzy 310.063106: .__update_entity_load_avg_contrib: kworker/8:1 load_avg_contrib=0 T7 ebizzy 310.063108: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:1 At 309.613862 the 'load_avg_contrib' value of ebizzy is equal to 1022 which indicates its high utilization. It goes to sleep at 309.613864. After a long idle peroid ebizzy wakes up at 310.062932. Once a cpu wakes up from an idle state all the timers that were deferred on that cpu will be fired. The cpufreq governor's timer is one such deferred timer which gets fired consequently after ebizzy wakes up. On next context_switch at 310.063104 we can see that load_avg_contrib's value gets updated to 67. Further at 310.063108 the cpufreq governor gets switched in to calculate the load, now if it were to consider ebizzy's 'load_avg_contrib' value then we will not benefit as the recent value is far less compared to the value ebizzy had before going to sleep. We can hide these fluctuations of 'load_avg_contrib' by calculating the average of all the values of 'load_avg_contrib'. The cumulative average of 'load_avg_contrib' will always preserve the long-term behavior of the task. Thus using the cumulative average of 'load_avg_contrib' we can scale the cpu frequency as best suited for the task. 'load_avg_contrib' of ebizzy is updated at T1, T3 and T5. So in a period from T1-T7 ebizzy has the following values : load_avg_contrib cumulative_average T1 1022 1022/1 = 1022 T3 0 (1022+0)/2 = 511 T5 67 (1022+0+67)/3 = 363 At T5 the cumulative_average is 363 which is better than the load_avg_contrib value 67 when used to decide the nature of the task. Thus we can use cumulative_average to scale the cpu frequency during an idle wakeup. Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com> Suggested-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> --- include/linux/sched.h | 4 ++++ kernel/sched/core.c | 35 +++++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 6 +++++- kernel/sched/sched.h | 2 +- 4 files changed, 45 insertions(+), 2 deletions(-)