diff mbox

[RFC,1/2] sched/fair: Add cumulative average of load_avg_contrib to a task

Message ID 1415598358-26505-2-git-send-email-shilpa.bhat@linux.vnet.ibm.com (mailing list archive)
State RFC, archived
Headers show

Commit Message

Shilpasri G Bhat Nov. 10, 2014, 5:45 a.m. UTC
This patch aims to identify a suitable metric which can define
the nature of the task as CPU intensive or non-intensive. This
metric's value will be used to optimize the logic to scale CPU
frequency during an idle wakeup scenario i.e., to make the cpufreq
governor to scale the frequency optimized for workloads.

Given the properties of 'load_avg_contrib' of a task's sched entity,
it forms a potential candidate for frequency scaling. By the virtue of
its design this metric picks up the load slowly during an idle wakeup.
So at that instant of wakeup we cannot account on the latest value of
this metric as it is unreliable. This can be seen in the test results
below.

I ran a modified version of ebizzy which sleeps in between its
execution such that its utilization can be user defined value. The
below trace was observed for a single thread ebizzy run for a 40%
utilization.

T1 ebizzy   309.613862: .__update_entity_load_avg_contrib: ebizzy load_avg_contrib=1022
T2 ebizzy   309.613864: sched_switch: prev_comm=ebizzy ==> next_comm=swapper/8
T3 <idle>   310.062932: .__update_entity_load_avg_contrib: ebizzy load_avg_contrib=0
T4 <idle>   310.062936: sched_switch: prev_comm=swapper/8 ==> next_comm=ebizzy
T5 ebizzy   310.063104: .__update_entity_load_avg_contrib: ebizzy load_avg_contrib=67
T6 ebizzy   310.063106: .__update_entity_load_avg_contrib: kworker/8:1 load_avg_contrib=0
T7 ebizzy   310.063108: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:1

At 309.613862 the 'load_avg_contrib' value of ebizzy is equal to 1022
which indicates its high utilization. It goes to sleep at 309.613864.
After a long idle peroid ebizzy wakes up at 310.062932. Once a cpu
wakes up from an idle state all the timers that were deferred on that
cpu will be fired. The cpufreq governor's timer is one such deferred
timer which gets fired consequently after ebizzy wakes up. On next
context_switch at 310.063104 we can see that load_avg_contrib's value
gets updated to 67. Further at 310.063108 the cpufreq governor gets
switched in to calculate the load, now if it were to consider ebizzy's
'load_avg_contrib' value then we will not benefit as the recent value
is far less compared to the value ebizzy had before going to sleep.

We can hide these fluctuations of 'load_avg_contrib' by calculating
the average of all the values of 'load_avg_contrib'. The cumulative
average of 'load_avg_contrib' will always preserve the long-term
behavior of the task. Thus using the cumulative average of
'load_avg_contrib' we can scale the cpu frequency as best suited for
the task.

'load_avg_contrib' of ebizzy is updated at T1, T3 and T5. So in
a period from T1-T7 ebizzy has the following values :
	load_avg_contrib 	cumulative_average
T1		1022		1022/1        = 1022
T3		0		(1022+0)/2    = 511
T5		67		(1022+0+67)/3 = 363

At T5 the cumulative_average is 363 which is better than the
load_avg_contrib value 67 when used to decide the nature of the task.
Thus we can use cumulative_average to scale the cpu frequency during
an idle wakeup.

Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
Suggested-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---
 include/linux/sched.h |  4 ++++
 kernel/sched/core.c   | 35 +++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c   |  6 +++++-
 kernel/sched/sched.h  |  2 +-
 4 files changed, 45 insertions(+), 2 deletions(-)

Comments

Peter Zijlstra Nov. 10, 2014, 1:49 p.m. UTC | #1
On Mon, Nov 10, 2014 at 11:15:57AM +0530, Shilpasri G Bhat wrote:
> +/**
> + * task_cumulative_load - return the cumulative load of
> + * the previous task if cpu is the current cpu OR the
> + * cumulative load of current task on the cpu. If cpu
> + * is idle then return 0.
> + *
> + * Invoked by the cpufreq governor to calculate the
> + * load when the CPU is woken from an idle state.
> + *
> + */
> +unsigned int task_cumulative_load(int cpu)
> +{
> +	struct rq *rq = cpu_rq(cpu);
> +	struct task_struct *p;
> +
> +	if (cpu == smp_processor_id()) {
> +		if (rq->prev == rq->idle)
> +			goto idle;
> +		p = rq->prev;
> +	} else {
> +		if (rq->curr == rq->idle)
> +			goto idle;
> +		p = rq->curr;
> +	}
> +	/*
> +	 * Removing the priority as we are interested in CPU
> +	 * utilization of the task
> +	 */
> +	return (100 * p->se.avg.cumulative_avg / p->se.load.weight);
> +idle:
> +	return 0;
> +}
> +EXPORT_SYMBOL(task_cumulative_load);

This doesn't make any sense, its wrong and also its broken.

This doesn't make any sense, because the load of a cpu is unrelated to
whatever task ran there last. You want to look at the task that is
waking now.

It is wrong because dividing out the load.weight doesn't quite work
right. Also there's patches that introduce unweighted stats like you
want, you could have used those.

It it broken because who says rq->prev still exists?

> @@ -2476,11 +2478,13 @@ static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {}
>  static inline void __update_task_entity_contrib(struct sched_entity *se)
>  {
>  	u32 contrib;
> -
>  	/* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
>  	contrib = se->avg.runnable_avg_sum * scale_load_down(se->load.weight);
>  	contrib /= (se->avg.runnable_avg_period + 1);
>  	se->avg.load_avg_contrib = scale_load(contrib);
> +	se->avg.cumulative_avg *= se->avg.cumulative_avg_count;
> +	se->avg.cumulative_avg += se->avg.load_avg_contrib;
> +	se->avg.cumulative_avg /= ++se->avg.cumulative_avg_count;
>  }

This is not a numerically stable algorithm. Look what happens when
cumulative_avg_count gets large. Also, whatever made you choose an
absolute decay filter like that?
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shilpasri G Bhat Nov. 11, 2014, 2:52 p.m. UTC | #2
On 11/10/2014 07:19 PM, Peter Zijlstra wrote:
> On Mon, Nov 10, 2014 at 11:15:57AM +0530, Shilpasri G Bhat wrote:
>> +/**
>> + * task_cumulative_load - return the cumulative load of
>> + * the previous task if cpu is the current cpu OR the
>> + * cumulative load of current task on the cpu. If cpu
>> + * is idle then return 0.
>> + *
>> + * Invoked by the cpufreq governor to calculate the
>> + * load when the CPU is woken from an idle state.
>> + *
>> + */
>> +unsigned int task_cumulative_load(int cpu)
>> +{
>> +	struct rq *rq = cpu_rq(cpu);
>> +	struct task_struct *p;
>> +
>> +	if (cpu == smp_processor_id()) {
>> +		if (rq->prev == rq->idle)
>> +			goto idle;
>> +		p = rq->prev;
>> +	} else {
>> +		if (rq->curr == rq->idle)
>> +			goto idle;
>> +		p = rq->curr;
>> +	}
>> +	/*
>> +	 * Removing the priority as we are interested in CPU
>> +	 * utilization of the task
>> +	 */
>> +	return (100 * p->se.avg.cumulative_avg / p->se.load.weight);
>> +idle:
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL(task_cumulative_load);
> 
> This doesn't make any sense, its wrong and also its broken.
> 
> This doesn't make any sense, because the load of a cpu is unrelated to
> whatever task ran there last. You want to look at the task that is
> waking now.

Yes I agree that the task in consideration should be current task on
cpu and not the last task. But the above code is handled by the
cpufreq governor's kworker. So the task in context while executing
'task_cumulative_load()' is kworker and we are not interested in
kworker's load. The task's load which we want to account to is
predecessor of kworker i.e, the task that woke up on the idle cpu.

> 
> It is wrong because dividing out the load.weight doesn't quite work
> right. Also there's patches that introduce unweighted stats like you
> want, you could have used those.

Okay.
> 
> It it broken because who says rq->prev still exists?

'task_cumulative_load()' is handled by the cpufreq governor's kworker.
And the kworker is queued only if there is task running on cpu which
guarantees the existence of rq->prev in a running state.

> 
>> @@ -2476,11 +2478,13 @@ static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {}
>>  static inline void __update_task_entity_contrib(struct sched_entity *se)
>>  {
>>  	u32 contrib;
>> -
>>  	/* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
>>  	contrib = se->avg.runnable_avg_sum * scale_load_down(se->load.weight);
>>  	contrib /= (se->avg.runnable_avg_period + 1);
>>  	se->avg.load_avg_contrib = scale_load(contrib);
>> +	se->avg.cumulative_avg *= se->avg.cumulative_avg_count;
>> +	se->avg.cumulative_avg += se->avg.load_avg_contrib;
>> +	se->avg.cumulative_avg /= ++se->avg.cumulative_avg_count;
>>  }
> 
> This is not a numerically stable algorithm. Look what happens when
> cumulative_avg_count gets large. Also, whatever made you choose an
> absolute decay filter like that?
> 

Yes I agree.

The problem I am trying to solve using this metric is: if cpufreq
governor's timer is handled very early when a task wakes up to run
on an idle cpu then the value of 'load_avg_contrib' of that task is
very less, when the governor's kworker is switched in. This happens
because the task did not run for enough time such that it can update
its 'load_avg_contrib' to show considerable amount of load.

Cumulative load was our initial attempt at overcoming this. It is true
that this is not the best solution. We wanted the community's suggestion
on how to get around this issue. What do you propose?

Thanks and Regards,
Shilpa

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Nov. 11, 2014, 5:29 p.m. UTC | #3
On Tue, Nov 11, 2014 at 08:22:55PM +0530, Shilpasri G Bhat wrote:
> Cumulative load was our initial attempt at overcoming this. It is true
> that this is not the best solution. We wanted the community's suggestion
> on how to get around this issue. What do you propose?

Shoot cpufreq in the head basically :-) We're looking at completely
reworking the way we do things.

We basically want to integrate cpufreq into the scheduler and effect
frequency changes from the scheduler context where possible, only
reverting to schedulable context (kworker) where there is no other
possibility.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Nov. 11, 2014, 5:30 p.m. UTC | #4
On Tue, Nov 11, 2014 at 08:22:55PM +0530, Shilpasri G Bhat wrote:
> On 11/10/2014 07:19 PM, Peter Zijlstra wrote:
> > It it broken because who says rq->prev still exists?
> 
> 'task_cumulative_load()' is handled by the cpufreq governor's kworker.
> And the kworker is queued only if there is task running on cpu which
> guarantees the existence of rq->prev in a running state.

Not so, there is no guarantee it will still be running by the time the
kworker actually comes around to running.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5e344bb..212a0a7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1081,6 +1081,8 @@  struct sched_avg {
 	u64 last_runnable_update;
 	s64 decay_count;
 	unsigned long load_avg_contrib;
+	unsigned long cumulative_avg;
+	unsigned long cumulative_avg_count;
 };
 
 #ifdef CONFIG_SCHEDSTATS
@@ -3032,3 +3034,5 @@  static inline unsigned long rlimit_max(unsigned int limit)
 }
 
 #endif
+
+extern unsigned int task_cumulative_load(int cpu);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4499950..b3d0d5a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2853,6 +2853,7 @@  need_resched:
 	if (likely(prev != next)) {
 		rq->nr_switches++;
 		rq->curr = next;
+		rq->prev = prev;
 		++*switch_count;
 
 		context_switch(rq, prev, next); /* unlocks the rq */
@@ -8219,3 +8220,37 @@  void dump_cpu_task(int cpu)
 	pr_info("Task dump for CPU %d:\n", cpu);
 	sched_show_task(cpu_curr(cpu));
 }
+
+/**
+ * task_cumulative_load - return the cumulative load of
+ * the previous task if cpu is the current cpu OR the
+ * cumulative load of current task on the cpu. If cpu
+ * is idle then return 0.
+ *
+ * Invoked by the cpufreq governor to calculate the
+ * load when the CPU is woken from an idle state.
+ *
+ */
+unsigned int task_cumulative_load(int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+	struct task_struct *p;
+
+	if (cpu == smp_processor_id()) {
+		if (rq->prev == rq->idle)
+			goto idle;
+		p = rq->prev;
+	} else {
+		if (rq->curr == rq->idle)
+			goto idle;
+		p = rq->curr;
+	}
+	/*
+	 * Removing the priority as we are interested in CPU
+	 * utilization of the task
+	 */
+	return (100 * p->se.avg.cumulative_avg / p->se.load.weight);
+idle:
+	return 0;
+}
+EXPORT_SYMBOL(task_cumulative_load);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0b069bf..58c27e3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -680,6 +680,8 @@  void init_task_runnable_average(struct task_struct *p)
 	slice = sched_slice(task_cfs_rq(p), &p->se) >> 10;
 	p->se.avg.runnable_avg_sum = slice;
 	p->se.avg.runnable_avg_period = slice;
+	p->se.avg.cumulative_avg_count = 1;
+	p->se.avg.cumulative_avg = p->se.load.weight;
 	__update_task_entity_contrib(&p->se);
 }
 #else
@@ -2476,11 +2478,13 @@  static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {}
 static inline void __update_task_entity_contrib(struct sched_entity *se)
 {
 	u32 contrib;
-
 	/* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
 	contrib = se->avg.runnable_avg_sum * scale_load_down(se->load.weight);
 	contrib /= (se->avg.runnable_avg_period + 1);
 	se->avg.load_avg_contrib = scale_load(contrib);
+	se->avg.cumulative_avg *= se->avg.cumulative_avg_count;
+	se->avg.cumulative_avg += se->avg.load_avg_contrib;
+	se->avg.cumulative_avg /= ++se->avg.cumulative_avg_count;
 }
 
 /* Compute the current contribution to load_avg by se, return any delta */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 24156c84..064d6b1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -565,7 +565,7 @@  struct rq {
 	 */
 	unsigned long nr_uninterruptible;
 
-	struct task_struct *curr, *idle, *stop;
+	struct task_struct *curr, *idle, *stop, *prev;
 	unsigned long next_balance;
 	struct mm_struct *prev_mm;