Message ID | 1436293469-25707-44-git-send-email-morten.rasmussen@arm.com (mailing list archive) |
---|---|
State | RFC |
Headers | show |
Hi Juri, On 07/07/2015 11:24 AM, Morten Rasmussen wrote: > From: Juri Lelli <juri.lelli@arm.com> > > When a CPU is going idle it is pointless to ask for an OPP update as we > would wake up another task only to ask for the same capacity we are already > running at (utilization gets moved to blocked_utilization). We thus add > cpufreq_sched_reset_capacity() interface to just reset our current capacity > request without triggering any real update. At wakeup we will use the > decayed utilization to select an appropriate OPP. ... > +void cpufreq_sched_reset_cap(int cpu) > +{ > + per_cpu(pcpu_capacity, cpu) = 0; > +} > + ... > @@ -4427,9 +4427,13 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) > if (sched_energy_freq() && task_sleep) { > unsigned long req_cap = get_cpu_usage(cpu_of(rq)); > > - req_cap = req_cap * capacity_margin > - >> SCHED_CAPACITY_SHIFT; > - cpufreq_sched_set_cap(cpu_of(rq), req_cap); > + if (rq->cfs.nr_running) { > + req_cap = req_cap * capacity_margin > + >> SCHED_CAPACITY_SHIFT; > + cpufreq_sched_set_cap(cpu_of(rq), req_cap); > + } else { > + cpufreq_sched_reset_cap(cpu_of(rq)); > + } > } Though I understand the initial stated motivation here (avoiding a redundant capacity request upon idle entry), releasing the CPU's capacity request altogether on idle seems like it could be a contentious policy decision. An example to illustrate my concern: - 2 CPU single frequency domain topology - task A is a small frequently-running task on CPU0 - task B is a heavier intermittent task running on CPU1 Task B is driving the frequency of the cluster high, but whenever it sleeps CPU1 becomes idle and the capacity request is dropped. If there's any activity on CPU0 that causes cpufreq_sched_set_cap() to be called (which is likely, given task A runs often) the cluster frequency will be lowered. Task B's performance will be impacted when it wakes up because initially the OPP will be insufficient. Power may or may not be impacted, depending on the characteristics of the workload and system, and whether energy is saved with the additional frequency scaling. The decision of when a CPU's vote should be decayed or removed is more policy where I believe there's no single right answer and in the past, has been solved with tunables. The interactive governor's slack timer controls how long it will allow an idle CPU to request a frequency > fmin. If the utilization of a long-idle CPU could be decayed by a different CPU in the system, perhaps it could take care of updating that CPU's vote? -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Steve, On 08/10/15 21:40, Steve Muckle wrote: > Hi Juri, > > On 07/07/2015 11:24 AM, Morten Rasmussen wrote: >> From: Juri Lelli <juri.lelli@arm.com> >> >> When a CPU is going idle it is pointless to ask for an OPP update as we >> would wake up another task only to ask for the same capacity we are already >> running at (utilization gets moved to blocked_utilization). We thus add >> cpufreq_sched_reset_capacity() interface to just reset our current capacity >> request without triggering any real update. At wakeup we will use the >> decayed utilization to select an appropriate OPP. > ... >> +void cpufreq_sched_reset_cap(int cpu) >> +{ >> + per_cpu(pcpu_capacity, cpu) = 0; >> +} >> + > ... >> @@ -4427,9 +4427,13 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) >> if (sched_energy_freq() && task_sleep) { >> unsigned long req_cap = get_cpu_usage(cpu_of(rq)); >> >> - req_cap = req_cap * capacity_margin >> - >> SCHED_CAPACITY_SHIFT; >> - cpufreq_sched_set_cap(cpu_of(rq), req_cap); >> + if (rq->cfs.nr_running) { >> + req_cap = req_cap * capacity_margin >> + >> SCHED_CAPACITY_SHIFT; >> + cpufreq_sched_set_cap(cpu_of(rq), req_cap); >> + } else { >> + cpufreq_sched_reset_cap(cpu_of(rq)); >> + } >> } > > Though I understand the initial stated motivation here (avoiding a > redundant capacity request upon idle entry), releasing the CPU's > capacity request altogether on idle seems like it could be a contentious > policy decision. > > An example to illustrate my concern: > - 2 CPU single frequency domain topology > - task A is a small frequently-running task on CPU0 > - task B is a heavier intermittent task running on CPU1 > > Task B is driving the frequency of the cluster high, but whenever it > sleeps CPU1 becomes idle and the capacity request is dropped. If there's > any activity on CPU0 that causes cpufreq_sched_set_cap() to be called > (which is likely, given task A runs often) the cluster frequency will be > lowered. Task B's performance will be impacted when it wakes up because > initially the OPP will be insufficient. Power may or may not be With the current implementation you are right: B's util will be decayed and it will have to build it up again, loosing in performance. What about we try to change this as discussed at Connect? At enqueue time we use pre-decayed B's util, so that it will generate an OPP transition at the required capacity on wakeup. > impacted, depending on the characteristics of the workload and system, > and whether energy is saved with the additional frequency scaling. > > The decision of when a CPU's vote should be decayed or removed is more > policy where I believe there's no single right answer and in the past, > has been solved with tunables. The interactive governor's slack timer > controls how long it will allow an idle CPU to request a frequency > fmin. > Mmm, IMHO there is still a bit of space for trying to make the current implementation better, before we give up and go to add a tunable :-). > If the utilization of a long-idle CPU could be decayed by a different > CPU in the system, perhaps it could take care of updating that CPU's vote? > Yeah, that's exactly what we thought of as well. Actually we also tried to do some sort of "fake" pre-decaying based on information coming from set timers, but it didn't quite fix the issue. Decaying someone else's stuff implies taking locks, I fear, and that is not good. However, I agree that finding a clean way of properly update util of idle cpus would be a better solution. Thanks, - Juri -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/09/2015 02:14 AM, Juri Lelli wrote: >> Though I understand the initial stated motivation here (avoiding a >> > redundant capacity request upon idle entry), releasing the CPU's >> > capacity request altogether on idle seems like it could be a contentious >> > policy decision. >> > >> > An example to illustrate my concern: >> > - 2 CPU single frequency domain topology >> > - task A is a small frequently-running task on CPU0 >> > - task B is a heavier intermittent task running on CPU1 >> > >> > Task B is driving the frequency of the cluster high, but whenever it >> > sleeps CPU1 becomes idle and the capacity request is dropped. If there's >> > any activity on CPU0 that causes cpufreq_sched_set_cap() to be called >> > (which is likely, given task A runs often) the cluster frequency will be >> > lowered. Task B's performance will be impacted when it wakes up because >> > initially the OPP will be insufficient. Power may or may not be > > With the current implementation you are right: B's util will be decayed > and it will have to build it up again, loosing in performance. What > about we try to change this as discussed at Connect? At enqueue time we > use pre-decayed B's util, so that it will generate an OPP transition > at the required capacity on wakeup. Actually I wasn't even really considering the decay of B's utilization - just that the CPU OPP will have been lowered due to the reset of CPU1's reservation when B slept and subsequent task activity on CPU0, and then will have to be raised (to something, depending on whether pre or post decayed utilization is used) when B wakes. The latency of OPP transitions may be considerable, or at least nontrivial, compared to a task's wake/sleep pattern, meaning that a good portion of the task activity may occur while the OPP is suboptimal for that task. Frequent OPP transitions may also have a nontrivial overhead in terms of CPU usage and energy. I don't have an opinion to offer at the moment on using the pre or post decayed utilization in enqueue. That seems like a tough policy choice which may require a lot of power/perf data to clearly justify either way. My concern here is limited to whether a CPU's dvfs contribution/vote should be entirely removed when the last task on it is dequeued, or removed gradually (decayed) over time, or removed entirely after some timeout etc. >> > The decision of when a CPU's vote should be decayed or removed is more >> > policy where I believe there's no single right answer and in the past, >> > has been solved with tunables. The interactive governor's slack timer >> > controls how long it will allow an idle CPU to request a frequency > fmin. >> > > > Mmm, IMHO there is still a bit of space for trying to make the current > implementation better, before we give up and go to add a tunable :-). Agreed. As a tunable apologist my attempt to offer background on one way this is solved today ended up looking more like a request :) . -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/kernel/sched/cpufreq_sched.c b/kernel/sched/cpufreq_sched.c index 7071528..06ff183 100644 --- a/kernel/sched/cpufreq_sched.c +++ b/kernel/sched/cpufreq_sched.c @@ -203,6 +203,18 @@ void cpufreq_sched_set_cap(int cpu, unsigned long capacity) return; } +/** + * cpufreq_sched_reset_capacity - interface to scheduler for resetting capacity + * requests + * @cpu: cpu whose capacity request has to be reset + * + * This _wont trigger_ any capacity update. + */ +void cpufreq_sched_reset_cap(int cpu) +{ + per_cpu(pcpu_capacity, cpu) = 0; +} + static inline void set_sched_energy_freq(void) { if (!sched_energy_freq()) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index bb49499..323331f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4427,9 +4427,13 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) if (sched_energy_freq() && task_sleep) { unsigned long req_cap = get_cpu_usage(cpu_of(rq)); - req_cap = req_cap * capacity_margin - >> SCHED_CAPACITY_SHIFT; - cpufreq_sched_set_cap(cpu_of(rq), req_cap); + if (rq->cfs.nr_running) { + req_cap = req_cap * capacity_margin + >> SCHED_CAPACITY_SHIFT; + cpufreq_sched_set_cap(cpu_of(rq), req_cap); + } else { + cpufreq_sched_reset_cap(cpu_of(rq)); + } } } hrtick_update(rq); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 85be5d8..f1ff5bb 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1487,9 +1487,12 @@ static inline bool sched_energy_freq(void) #ifdef CONFIG_CPU_FREQ_GOV_SCHED void cpufreq_sched_set_cap(int cpu, unsigned long util); +void cpufreq_sched_reset_cap(int cpu); #else static inline void cpufreq_sched_set_cap(int cpu, unsigned long util) { } +static inline void cpufreq_sched_reset_cap(int cpu) +{ } #endif static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)