Message ID | 20180320094312.24081-6-dietmar.eggemann@arm.com (mailing list archive) |
---|---|
State | RFC, archived |
Headers | show |
On 20-Mar 09:43, Dietmar Eggemann wrote: > From: Quentin Perret <quentin.perret@arm.com> [...] > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 76bd46502486..65a1bead0773 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -6513,6 +6513,60 @@ static unsigned long compute_energy(struct task_struct *p, int dst_cpu) > return energy; > } > > +static bool task_fits(struct task_struct *p, int cpu) > +{ > + unsigned long next_util = cpu_util_next(cpu, p, cpu); > + > + return util_fits_capacity(next_util, capacity_orig_of(cpu)); ^^^^^^^^^^^^^^^^^^^^^ Since here we are at scheduling CFS tasks, should we not better use capacity_of() to account for RT/IRQ pressure ? > +} > + > +static int find_energy_efficient_cpu(struct sched_domain *sd, > + struct task_struct *p, int prev_cpu) > +{ > + unsigned long cur_energy, prev_energy, best_energy; > + int cpu, best_cpu = prev_cpu; > + > + if (!task_util(p)) We are still waking up a task... what if the task was previously running on a big CPU which is now idle? I understand that from a _relative_ energy_diff standpoint there is not much to do for a 0 utilization task. However, for those tasks we can still try to return the most energy efficient CPU among the ones in their cpus_allowed mask. It should be a relatively low overhead (maybe contained in a fallback most_energy_efficient_cpu() kind of function) which allows, for example on ARM big.LITTLE systems, to consolidate those tasks on LITTLE CPUs instead for example keep running them on a big CPU. > + return prev_cpu; > + > + /* Compute the energy impact of leaving the task on prev_cpu. */ > + prev_energy = best_energy = compute_energy(p, prev_cpu); > + > + /* Look for the CPU that minimizes the energy. */ ^^^^^^^^^^ nit-pick: would say explicitly "best_energy" here, just to avoid confusion about (non) possible overflows in the following if check ;) > + for_each_cpu_and(cpu, &p->cpus_allowed, sched_domain_span(sd)) { > + if (!task_fits(p, cpu) || cpu == prev_cpu) nit-pick: to me it would read better as: if (cpu == prev_cpu) continue; if (!task_fits(p, cpu)) continue; but it's more matter of (personal) taste then efficiency. > + continue; > + cur_energy = compute_energy(p, cpu); > + if (cur_energy < best_energy) { > + best_energy = cur_energy; > + best_cpu = cpu; > + } > + } > + > + /* > + * We pick the best CPU only if it saves at least 1.5% of the > + * energy used by prev_cpu. > + */ > + if ((prev_energy - best_energy) > (prev_energy >> 6)) > + return best_cpu; > + > + return prev_cpu; > +} [...] > @@ -6555,6 +6613,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f > break; > } > > + /* > + * Energy-aware task placement is performed on the highest > + * non-overutilized domain spanning over cpu and prev_cpu. > + */ > + if (want_energy && !sd_overutilized(tmp) && > + cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) > + energy_sd = tmp; > + Not entirely sure, but I was trying to understand if we can avoid to modify the definition of want_affine (in the previous chunk) and move this block before the previous "if (want_affine..." (in mainline but not in this chunk), which will became an else, e.g. if (want_energy && !sd_overutilized(tmp) && // ... else if (want_energy && !sd_overutilized(tmp) && // ... Isn't that the same? Maybe there is a code path I'm missing... but otherwise it seems a more self contained modification of select_task_rq_fair... > if (tmp->flags & sd_flag) > sd = tmp; > else if (!want_affine) > @@ -6586,6 +6652,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f > if (want_affine) > current->recent_used_cpu = cpu; > } > + } else if (energy_sd) { > + new_cpu = find_energy_efficient_cpu(energy_sd, p, prev_cpu); > } else { > new_cpu = find_idlest_cpu(sd, p, cpu, prev_cpu, sd_flag); > }
Hi, On Tue, Mar 20, 2018 at 2:43 AM, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > > From: Quentin Perret <quentin.perret@arm.com> > > In case an energy model is available, waking tasks are re-routed into a > new energy-aware placement algorithm. The eligible CPUs to be used in the > energy-aware wakeup path are restricted to the highest non-overutilized > sched_domain containing prev_cpu and this_cpu. If no such domain is found, > the tasks go through the usual wake-up path, hence energy-aware placement > happens only in lightly utilized scenarios. > > The selection of the most energy-efficient CPU for a task is achieved by > estimating the impact on system-level active energy resulting from the > placement of the task on each candidate CPU. The best CPU energy-wise is > then selected if it saves a large enough amount of energy with respect to > prev_cpu. > > Although it has already shown significant benefits on some existing > targets, this brute force approach clearly cannot scale to platforms with > numerous CPUs. This patch is an attempt to do something useful as writing > a fast heuristic that performs reasonably well on a broad spectrum of > architectures isn't an easy task. As a consequence, the scope of usability > of the energy-aware wake-up path is restricted to systems with the > SD_ASYM_CPUCAPACITY flag set. These systems not only show the most > promising opportunities for saving energy but also typically feature a > limited number of logical CPUs. > > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Peter Zijlstra <peterz@infradead.org> > Signed-off-by: Quentin Perret <quentin.perret@arm.com> > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> > --- > kernel/sched/fair.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++++++--- > 1 file changed, 71 insertions(+), 3 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 76bd46502486..65a1bead0773 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -6513,6 +6513,60 @@ static unsigned long compute_energy(struct task_struct *p, int dst_cpu) > return energy; > } > > +static bool task_fits(struct task_struct *p, int cpu) > +{ > + unsigned long next_util = cpu_util_next(cpu, p, cpu); > + > + return util_fits_capacity(next_util, capacity_orig_of(cpu)); > +} > + > +static int find_energy_efficient_cpu(struct sched_domain *sd, > + struct task_struct *p, int prev_cpu) > +{ > + unsigned long cur_energy, prev_energy, best_energy; > + int cpu, best_cpu = prev_cpu; > + > + if (!task_util(p)) > + return prev_cpu; > + > + /* Compute the energy impact of leaving the task on prev_cpu. */ > + prev_energy = best_energy = compute_energy(p, prev_cpu); Is it possible that before the wakeup, the task's affinity is changed so that p->cpus_allowed no longer contains prev_cpu ? In that case prev_energy wouldn't matter since previous CPU is no longer an option? > + > + /* Look for the CPU that minimizes the energy. */ > + for_each_cpu_and(cpu, &p->cpus_allowed, sched_domain_span(sd)) { > + if (!task_fits(p, cpu) || cpu == prev_cpu) > + continue; > + cur_energy = compute_energy(p, cpu); > + if (cur_energy < best_energy) { > + best_energy = cur_energy; > + best_cpu = cpu; > + } > + } > + > + /* > + * We pick the best CPU only if it saves at least 1.5% of the > + * energy used by prev_cpu. > + */ > + if ((prev_energy - best_energy) > (prev_energy >> 6)) > + return best_cpu; > + > + return prev_cpu; > +} > + > +static inline bool wake_energy(struct task_struct *p, int prev_cpu) > +{ > + struct sched_domain *sd; > + > + if (!static_branch_unlikely(&sched_energy_present)) > + return false; > + > + sd = rcu_dereference_sched(cpu_rq(prev_cpu)->sd); > + if (!sd || sd_overutilized(sd)) Shouldn't you check for the SD_ASYM_CPUCAPACITY flag here? > + return false; > + > + return true; > +} > + > /* > * select_task_rq_fair: Select target runqueue for the waking task in domains > * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE, > @@ -6529,18 +6583,22 @@ static int > select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags) > { > struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL; > + struct sched_domain *energy_sd = NULL; > int cpu = smp_processor_id(); > int new_cpu = prev_cpu; > - int want_affine = 0; > + int want_affine = 0, want_energy = 0; > int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING); > > + rcu_read_lock(); > + > if (sd_flag & SD_BALANCE_WAKE) { > record_wakee(p); > + want_energy = wake_energy(p, prev_cpu); > want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu) > - && cpumask_test_cpu(cpu, &p->cpus_allowed); > + && cpumask_test_cpu(cpu, &p->cpus_allowed) > + && !want_energy; > } > > - rcu_read_lock(); > for_each_domain(cpu, tmp) { > if (!(tmp->flags & SD_LOAD_BALANCE)) > break; > @@ -6555,6 +6613,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f > break; > } > > + /* > + * Energy-aware task placement is performed on the highest > + * non-overutilized domain spanning over cpu and prev_cpu. > + */ > + if (want_energy && !sd_overutilized(tmp) && > + cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) Shouldn't you check for the SD_ASYM_CPUCAPACITY flag here for tmp level? > + energy_sd = tmp; > + > if (tmp->flags & sd_flag) > sd = tmp; > else if (!want_affine) > @@ -6586,6 +6652,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f > if (want_affine) > current->recent_used_cpu = cpu; > } > + } else if (energy_sd) { > + new_cpu = find_energy_efficient_cpu(energy_sd, p, prev_cpu); Even if want_affine = 0 (want_energy = 1), we can have sd = NULL if sd_flag and tmp->flags don't match. In this case we wont enter the EAS selection path because sd will be = NULL. So should you be setting sd = NULL explicitly if energy_sd != NULL , or rather do the if (energy_sd) before doing the if (!sd) ? If you still want to keep the logic this way, then probably you should also check if (tmp->flags & sd_flag) == true in the loop? That way energy_sd wont be set at all (Since we're basically saying we dont want to do wake up across this sd (in energy aware fashion in this case) if the domain flags don't watch the wake up sd_flag. thanks, - Joel
On 22-Mar 09:27, Joel Fernandes wrote: > Hi, > > On Tue, Mar 20, 2018 at 2:43 AM, Dietmar Eggemann > <dietmar.eggemann@arm.com> wrote: > > > > From: Quentin Perret <quentin.perret@arm.com> [...] > > +static inline bool wake_energy(struct task_struct *p, int prev_cpu) > > +{ > > + struct sched_domain *sd; > > + > > + if (!static_branch_unlikely(&sched_energy_present)) > > + return false; > > + > > + sd = rcu_dereference_sched(cpu_rq(prev_cpu)->sd); > > + if (!sd || sd_overutilized(sd)) > > Shouldn't you check for the SD_ASYM_CPUCAPACITY flag here? I think that should be already covered by the static key check above... > > > + return false; > > + > > + return true; > > +} > > + > > /* > > * select_task_rq_fair: Select target runqueue for the waking task in domains > > * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE, > > @@ -6529,18 +6583,22 @@ static int > > select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags) > > { > > struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL; > > + struct sched_domain *energy_sd = NULL; > > int cpu = smp_processor_id(); > > int new_cpu = prev_cpu; > > - int want_affine = 0; > > + int want_affine = 0, want_energy = 0; > > int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING); > > > > + rcu_read_lock(); > > + > > if (sd_flag & SD_BALANCE_WAKE) { > > record_wakee(p); > > + want_energy = wake_energy(p, prev_cpu); > > want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu) > > - && cpumask_test_cpu(cpu, &p->cpus_allowed); > > + && cpumask_test_cpu(cpu, &p->cpus_allowed) > > + && !want_energy; > > } > > > > - rcu_read_lock(); > > for_each_domain(cpu, tmp) { > > if (!(tmp->flags & SD_LOAD_BALANCE)) > > break; > > @@ -6555,6 +6613,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f > > break; > > } > > > > + /* > > + * Energy-aware task placement is performed on the highest > > + * non-overutilized domain spanning over cpu and prev_cpu. > > + */ > > + if (want_energy && !sd_overutilized(tmp) && > > + cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) > > Shouldn't you check for the SD_ASYM_CPUCAPACITY flag here for tmp level? ... and this then should be covered by the previous check in wake_energy(), which sets want_energy. > > > + energy_sd = tmp; > > + > > if (tmp->flags & sd_flag) > > sd = tmp; > > else if (!want_affine) > > @@ -6586,6 +6652,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f > > if (want_affine) > > current->recent_used_cpu = cpu; > > } > > + } else if (energy_sd) { > > + new_cpu = find_energy_efficient_cpu(energy_sd, p, prev_cpu); > > Even if want_affine = 0 (want_energy = 1), we can have sd = NULL if > sd_flag and tmp->flags don't match. In this case we wont enter the EAS > selection path because sd will be = NULL. So should you be setting sd > = NULL explicitly if energy_sd != NULL , or rather do the if > (energy_sd) before doing the if (!sd) ? That's the same think I was also proposing in my reply to this patch. But in my case the point was mainly to make the code easier to follow... which at the end it's also to void all the consideration on dependencies you describe above. Joel, can you have a look at what I proposed... I was not entirely sure that we miss some code paths doing it that way. > If you still want to keep the logic this way, then probably you should > also check if (tmp->flags & sd_flag) == true in the loop? That way > energy_sd wont be set at all (Since we're basically saying we dont > want to do wake up across this sd (in energy aware fashion in this > case) if the domain flags don't watch the wake up sd_flag. > > thanks, > > - Joel
On Wed, Mar 21, 2018 at 8:35 AM, Patrick Bellasi <patrick.bellasi@arm.com> wrote: > [...] > >> @@ -6555,6 +6613,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f >> break; >> } >> >> + /* >> + * Energy-aware task placement is performed on the highest >> + * non-overutilized domain spanning over cpu and prev_cpu. >> + */ >> + if (want_energy && !sd_overutilized(tmp) && >> + cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) >> + energy_sd = tmp; >> + > > Not entirely sure, but I was trying to understand if we can avoid to > modify the definition of want_affine (in the previous chunk) and move > this block before the previous "if (want_affine..." (in mainline but > not in this chunk), which will became an else, e.g. > > if (want_energy && !sd_overutilized(tmp) && > // ... > else if (want_energy && !sd_overutilized(tmp) && > // ... > > Isn't that the same? > > Maybe there is a code path I'm missing... but otherwise it seems a > more self contained modification of select_task_rq_fair... Just replying to this here Patrick instead of the other thread. I think this is the right place for the block from Quentin quoted above because we want to search for the highest domain that is !overutilized and look among those for the candidates. So from that perspective, we can't move the block to the beginning and it seems to be in the right place. My main concern on the other thread was different, I was talking about the cases where sd_flag & tmp->flags don't match. In that case, sd = NULL would trump EAS and I was wondering if that's the right thing to do... thanks, - Joel [...]
On Thu, Mar 22, 2018 at 11:06 AM, Patrick Bellasi <patrick.bellasi@arm.com> wrote: [..] >> > +static inline bool wake_energy(struct task_struct *p, int prev_cpu) >> > +{ >> > + struct sched_domain *sd; >> > + >> > + if (!static_branch_unlikely(&sched_energy_present)) >> > + return false; >> > + >> > + sd = rcu_dereference_sched(cpu_rq(prev_cpu)->sd); >> > + if (!sd || sd_overutilized(sd)) >> >> Shouldn't you check for the SD_ASYM_CPUCAPACITY flag here? > > I think that should be already covered by the static key check > above... > I understand there is an assumption like that but I was wondering if its more future proof for checking it explicity. I am Ok if everyone thinks its a valid assumption... >> >> > + return false; >> > + >> > + return true; >> > +} >> > + >> > /* >> > * select_task_rq_fair: Select target runqueue for the waking task in domains >> > * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE, >> > @@ -6529,18 +6583,22 @@ static int >> > select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags) >> > { >> > struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL; >> > + struct sched_domain *energy_sd = NULL; >> > int cpu = smp_processor_id(); >> > int new_cpu = prev_cpu; >> > - int want_affine = 0; >> > + int want_affine = 0, want_energy = 0; >> > int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING); >> > >> > + rcu_read_lock(); >> > + >> > if (sd_flag & SD_BALANCE_WAKE) { >> > record_wakee(p); >> > + want_energy = wake_energy(p, prev_cpu); >> > want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu) >> > - && cpumask_test_cpu(cpu, &p->cpus_allowed); >> > + && cpumask_test_cpu(cpu, &p->cpus_allowed) >> > + && !want_energy; >> > } >> > >> > - rcu_read_lock(); >> > for_each_domain(cpu, tmp) { >> > if (!(tmp->flags & SD_LOAD_BALANCE)) >> > break; >> > @@ -6555,6 +6613,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f >> > break; >> > } >> > >> > + /* >> > + * Energy-aware task placement is performed on the highest >> > + * non-overutilized domain spanning over cpu and prev_cpu. >> > + */ >> > + if (want_energy && !sd_overutilized(tmp) && >> > + cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) >> >> Shouldn't you check for the SD_ASYM_CPUCAPACITY flag here for tmp level? > > ... and this then should be covered by the previous check in > wake_energy(), which sets want_energy. Right, but in a scenario which probably doesn't exist today where we have both SD_ASYM_CPUCAPACITY and !SD_ASYM_CPUCAPACITY domains in the hierarchy for which want_energy = 1, I was thinking if its more future proof to check it and not make assumptions... >> >> > + energy_sd = tmp; >> > + >> > if (tmp->flags & sd_flag) >> > sd = tmp; >> > else if (!want_affine) >> > @@ -6586,6 +6652,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f >> > if (want_affine) >> > current->recent_used_cpu = cpu; >> > } >> > + } else if (energy_sd) { >> > + new_cpu = find_energy_efficient_cpu(energy_sd, p, prev_cpu); >> >> Even if want_affine = 0 (want_energy = 1), we can have sd = NULL if >> sd_flag and tmp->flags don't match. In this case we wont enter the EAS >> selection path because sd will be = NULL. So should you be setting sd >> = NULL explicitly if energy_sd != NULL , or rather do the if >> (energy_sd) before doing the if (!sd) ? > > That's the same think I was also proposing in my reply to this patch. > But in my case the point was mainly to make the code easier to > follow... which at the end it's also to void all the consideration on > dependencies you describe above. > > Joel, can you have a look at what I proposed... I was not entirely > sure that we miss some code paths doing it that way. Replied to this in the other thread. thanks, - Joel
On Thu, Mar 22, 2018 at 01:10:22PM -0700, Joel Fernandes wrote: > On Wed, Mar 21, 2018 at 8:35 AM, Patrick Bellasi > <patrick.bellasi@arm.com> wrote: > > [...] > > > >> @@ -6555,6 +6613,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f > >> break; > >> } > >> > >> + /* > >> + * Energy-aware task placement is performed on the highest > >> + * non-overutilized domain spanning over cpu and prev_cpu. > >> + */ > >> + if (want_energy && !sd_overutilized(tmp) && > >> + cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) > >> + energy_sd = tmp; > >> + > > > > Not entirely sure, but I was trying to understand if we can avoid to > > modify the definition of want_affine (in the previous chunk) and move > > this block before the previous "if (want_affine..." (in mainline but > > not in this chunk), which will became an else, e.g. > > > > if (want_energy && !sd_overutilized(tmp) && > > // ... > > else if (want_energy && !sd_overutilized(tmp) && > > // ... > > > > Isn't that the same? > > > > Maybe there is a code path I'm missing... but otherwise it seems a > > more self contained modification of select_task_rq_fair... > > Just replying to this here Patrick instead of the other thread. > > I think this is the right place for the block from Quentin quoted > above because we want to search for the highest domain that is > !overutilized and look among those for the candidates. So from that > perspective, we can't move the block to the beginning and it seems to > be in the right place. My main concern on the other thread was > different, I was talking about the cases where sd_flag & tmp->flags > don't match. In that case, sd = NULL would trump EAS and I was > wondering if that's the right thing to do... You mean if SD_BALANCE_WAKE isn't set on sched_domains? The current code seems to rely on that flag to be set to work correctly. Otherwise, the loop might bail out on !want_affine and we end up doing the find_energy_efficient_cpu() on the lowest level sched_domain even if there is higher level one which isn't over-utilized. However, SD_BALANCE_WAKE should be set if SD_ASYM_CPUCAPACITY is set so sd == NULL shouldn't be possible? This only holds as long as we only want EAS for asymmetric systems.
On Thu, Mar 22, 2018 at 09:27:43AM -0700, Joel Fernandes wrote: > Hi, > > On Tue, Mar 20, 2018 at 2:43 AM, Dietmar Eggemann > <dietmar.eggemann@arm.com> wrote: > > > > From: Quentin Perret <quentin.perret@arm.com> > > > > In case an energy model is available, waking tasks are re-routed into a > > new energy-aware placement algorithm. The eligible CPUs to be used in the > > energy-aware wakeup path are restricted to the highest non-overutilized > > sched_domain containing prev_cpu and this_cpu. If no such domain is found, > > the tasks go through the usual wake-up path, hence energy-aware placement > > happens only in lightly utilized scenarios. > > > > The selection of the most energy-efficient CPU for a task is achieved by > > estimating the impact on system-level active energy resulting from the > > placement of the task on each candidate CPU. The best CPU energy-wise is > > then selected if it saves a large enough amount of energy with respect to > > prev_cpu. > > > > Although it has already shown significant benefits on some existing > > targets, this brute force approach clearly cannot scale to platforms with > > numerous CPUs. This patch is an attempt to do something useful as writing > > a fast heuristic that performs reasonably well on a broad spectrum of > > architectures isn't an easy task. As a consequence, the scope of usability > > of the energy-aware wake-up path is restricted to systems with the > > SD_ASYM_CPUCAPACITY flag set. These systems not only show the most > > promising opportunities for saving energy but also typically feature a > > limited number of logical CPUs. > > > > Cc: Ingo Molnar <mingo@redhat.com> > > Cc: Peter Zijlstra <peterz@infradead.org> > > Signed-off-by: Quentin Perret <quentin.perret@arm.com> > > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> > > --- > > kernel/sched/fair.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++++++--- > > 1 file changed, 71 insertions(+), 3 deletions(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 76bd46502486..65a1bead0773 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -6513,6 +6513,60 @@ static unsigned long compute_energy(struct task_struct *p, int dst_cpu) > > return energy; > > } > > > > +static bool task_fits(struct task_struct *p, int cpu) > > +{ > > + unsigned long next_util = cpu_util_next(cpu, p, cpu); > > + > > + return util_fits_capacity(next_util, capacity_orig_of(cpu)); > > +} > > + > > +static int find_energy_efficient_cpu(struct sched_domain *sd, > > + struct task_struct *p, int prev_cpu) > > +{ > > + unsigned long cur_energy, prev_energy, best_energy; > > + int cpu, best_cpu = prev_cpu; > > + > > + if (!task_util(p)) > > + return prev_cpu; > > + > > + /* Compute the energy impact of leaving the task on prev_cpu. */ > > + prev_energy = best_energy = compute_energy(p, prev_cpu); > > Is it possible that before the wakeup, the task's affinity is changed > so that p->cpus_allowed no longer contains prev_cpu ? In that case > prev_energy wouldn't matter since previous CPU is no longer an option? It is possible to wake-up with a disallowed prev_cpu. In fact select_idle_sibling() may happily return a disallowed cpu in that case. The mistake gets fixed in select_task_rq() which uses select_fallback_rq() to find an allowed cpu instead. Could we fix the issue in find_energy_efficient_cpu() by a simple test like below if (cpumask_test_cpu(prev_cpu, &p->cpus_allowed)) prev_energy = best_energy = compute_energy(p, prev_cpu); else prev_energy = best_energy = ULONG_MAX;
On Fri, Mar 23, 2018 at 9:00 AM, Morten Rasmussen <morten.rasmussen@arm.com> wrote: > On Thu, Mar 22, 2018 at 09:27:43AM -0700, Joel Fernandes wrote: >> > >> > In case an energy model is available, waking tasks are re-routed into a >> > new energy-aware placement algorithm. The eligible CPUs to be used in the >> > energy-aware wakeup path are restricted to the highest non-overutilized >> > sched_domain containing prev_cpu and this_cpu. If no such domain is found, >> > the tasks go through the usual wake-up path, hence energy-aware placement >> > happens only in lightly utilized scenarios. >> > >> > The selection of the most energy-efficient CPU for a task is achieved by >> > estimating the impact on system-level active energy resulting from the >> > placement of the task on each candidate CPU. The best CPU energy-wise is >> > then selected if it saves a large enough amount of energy with respect to >> > prev_cpu. >> > >> > Although it has already shown significant benefits on some existing >> > targets, this brute force approach clearly cannot scale to platforms with >> > numerous CPUs. This patch is an attempt to do something useful as writing >> > a fast heuristic that performs reasonably well on a broad spectrum of >> > architectures isn't an easy task. As a consequence, the scope of usability >> > of the energy-aware wake-up path is restricted to systems with the >> > SD_ASYM_CPUCAPACITY flag set. These systems not only show the most >> > promising opportunities for saving energy but also typically feature a >> > limited number of logical CPUs. >> > >> > Cc: Ingo Molnar <mingo@redhat.com> >> > Cc: Peter Zijlstra <peterz@infradead.org> >> > Signed-off-by: Quentin Perret <quentin.perret@arm.com> >> > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> >> > --- >> > kernel/sched/fair.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++++++--- >> > 1 file changed, 71 insertions(+), 3 deletions(-) >> > >> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> > index 76bd46502486..65a1bead0773 100644 >> > --- a/kernel/sched/fair.c >> > +++ b/kernel/sched/fair.c >> > @@ -6513,6 +6513,60 @@ static unsigned long compute_energy(struct task_struct *p, int dst_cpu) >> > return energy; >> > } >> > >> > +static bool task_fits(struct task_struct *p, int cpu) >> > +{ >> > + unsigned long next_util = cpu_util_next(cpu, p, cpu); >> > + >> > + return util_fits_capacity(next_util, capacity_orig_of(cpu)); >> > +} >> > + >> > +static int find_energy_efficient_cpu(struct sched_domain *sd, >> > + struct task_struct *p, int prev_cpu) >> > +{ >> > + unsigned long cur_energy, prev_energy, best_energy; >> > + int cpu, best_cpu = prev_cpu; >> > + >> > + if (!task_util(p)) >> > + return prev_cpu; >> > + >> > + /* Compute the energy impact of leaving the task on prev_cpu. */ >> > + prev_energy = best_energy = compute_energy(p, prev_cpu); >> >> Is it possible that before the wakeup, the task's affinity is changed >> so that p->cpus_allowed no longer contains prev_cpu ? In that case >> prev_energy wouldn't matter since previous CPU is no longer an option? > > It is possible to wake-up with a disallowed prev_cpu. In fact > select_idle_sibling() may happily return a disallowed cpu in that case. > The mistake gets fixed in select_task_rq() which uses > select_fallback_rq() to find an allowed cpu instead. > > Could we fix the issue in find_energy_efficient_cpu() by a simple test > like below > > if (cpumask_test_cpu(prev_cpu, &p->cpus_allowed)) > prev_energy = best_energy = compute_energy(p, prev_cpu); > else > prev_energy = best_energy = ULONG_MAX; Yes, I think setting to ULONG_MAX in this case is Ok with me. thanks, - Joel
Hi Morten, On Fri, Mar 23, 2018 at 8:47 AM, Morten Rasmussen <morten.rasmussen@arm.com> wrote: > On Thu, Mar 22, 2018 at 01:10:22PM -0700, Joel Fernandes wrote: >> On Wed, Mar 21, 2018 at 8:35 AM, Patrick Bellasi >> <patrick.bellasi@arm.com> wrote: >> > [...] >> > >> >> @@ -6555,6 +6613,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f >> >> break; >> >> } >> >> >> >> + /* >> >> + * Energy-aware task placement is performed on the highest >> >> + * non-overutilized domain spanning over cpu and prev_cpu. >> >> + */ >> >> + if (want_energy && !sd_overutilized(tmp) && >> >> + cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) >> >> + energy_sd = tmp; >> >> + >> > >> > Not entirely sure, but I was trying to understand if we can avoid to >> > modify the definition of want_affine (in the previous chunk) and move >> > this block before the previous "if (want_affine..." (in mainline but >> > not in this chunk), which will became an else, e.g. >> > >> > if (want_energy && !sd_overutilized(tmp) && >> > // ... >> > else if (want_energy && !sd_overutilized(tmp) && >> > // ... >> > >> > Isn't that the same? >> > >> > Maybe there is a code path I'm missing... but otherwise it seems a >> > more self contained modification of select_task_rq_fair... >> >> Just replying to this here Patrick instead of the other thread. >> >> I think this is the right place for the block from Quentin quoted >> above because we want to search for the highest domain that is >> !overutilized and look among those for the candidates. So from that >> perspective, we can't move the block to the beginning and it seems to >> be in the right place. My main concern on the other thread was >> different, I was talking about the cases where sd_flag & tmp->flags >> don't match. In that case, sd = NULL would trump EAS and I was >> wondering if that's the right thing to do... > > You mean if SD_BALANCE_WAKE isn't set on sched_domains? Yes. > The current code seems to rely on that flag to be set to work correctly. > Otherwise, the loop might bail out on !want_affine and we end up doing > the find_energy_efficient_cpu() on the lowest level sched_domain even if > there is higher level one which isn't over-utilized. > > However, SD_BALANCE_WAKE should be set if SD_ASYM_CPUCAPACITY is set so > sd == NULL shouldn't be possible? This only holds as long as we only > want EAS for asymmetric systems. Yes, I see you had topology code that set SD_BALANCE_WAKE for ASYM. It makes sense to me then, thanks for the clarification. Still I feel it is a bit tedious/confusing when reading code to draw the conclusion about why sd is checked first before doing find_energy_efficient_cpu (and that sd will != NULL for ASYM systems). If energy_sd is set, then we can just proceed with EAS without checking that sd != NULL. This function in mainline is already pretty confusing as it is :-( Regards, - Joel
On Friday 23 Mar 2018 at 15:47:45 (+0000), Morten Rasmussen wrote: > On Thu, Mar 22, 2018 at 01:10:22PM -0700, Joel Fernandes wrote: > > On Wed, Mar 21, 2018 at 8:35 AM, Patrick Bellasi > > <patrick.bellasi@arm.com> wrote: > > > [...] > > > > > >> @@ -6555,6 +6613,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f > > >> break; > > >> } > > >> > > >> + /* > > >> + * Energy-aware task placement is performed on the highest > > >> + * non-overutilized domain spanning over cpu and prev_cpu. > > >> + */ > > >> + if (want_energy && !sd_overutilized(tmp) && > > >> + cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) > > >> + energy_sd = tmp; > > >> + > > > > > > Not entirely sure, but I was trying to understand if we can avoid to > > > modify the definition of want_affine (in the previous chunk) and move > > > this block before the previous "if (want_affine..." (in mainline but > > > not in this chunk), which will became an else, e.g. > > > > > > if (want_energy && !sd_overutilized(tmp) && > > > // ... > > > else if (want_energy && !sd_overutilized(tmp) && > > > // ... > > > > > > Isn't that the same? > > > > > > Maybe there is a code path I'm missing... but otherwise it seems a > > > more self contained modification of select_task_rq_fair... > > > > Just replying to this here Patrick instead of the other thread. > > > > I think this is the right place for the block from Quentin quoted > > above because we want to search for the highest domain that is > > !overutilized and look among those for the candidates. So from that > > perspective, we can't move the block to the beginning and it seems to > > be in the right place. My main concern on the other thread was > > different, I was talking about the cases where sd_flag & tmp->flags > > don't match. In that case, sd = NULL would trump EAS and I was > > wondering if that's the right thing to do... > > You mean if SD_BALANCE_WAKE isn't set on sched_domains? > > The current code seems to rely on that flag to be set to work correctly. > Otherwise, the loop might bail out on !want_affine and we end up doing > the find_energy_efficient_cpu() on the lowest level sched_domain even if > there is higher level one which isn't over-utilized. > > However, SD_BALANCE_WAKE should be set if SD_ASYM_CPUCAPACITY is set so > sd == NULL shouldn't be possible? This only holds as long as we only > want EAS for asymmetric systems. That's correct, we are under the assumption that the SD_ASYM_CPUCAPACITY flag is set somewhere in the hierarchy here. If a sched domain has this flag set, SD_BALANCE_WAKE is propagated to all lower sched domains (see sd_init() in kernel/sched/topology.c) so we should be fine.
On Friday 23 Mar 2018 at 18:13:56 (-0700), Joel Fernandes wrote: > Hi Morten, > > On Fri, Mar 23, 2018 at 8:47 AM, Morten Rasmussen > <morten.rasmussen@arm.com> wrote: > > On Thu, Mar 22, 2018 at 01:10:22PM -0700, Joel Fernandes wrote: [...] > > You mean if SD_BALANCE_WAKE isn't set on sched_domains? > > Yes. > > > The current code seems to rely on that flag to be set to work correctly. > > Otherwise, the loop might bail out on !want_affine and we end up doing > > the find_energy_efficient_cpu() on the lowest level sched_domain even if > > there is higher level one which isn't over-utilized. > > > > However, SD_BALANCE_WAKE should be set if SD_ASYM_CPUCAPACITY is set so > > sd == NULL shouldn't be possible? This only holds as long as we only > > want EAS for asymmetric systems. > > Yes, I see you had topology code that set SD_BALANCE_WAKE for ASYM. It > makes sense to me then, thanks for the clarification. > > Still I feel it is a bit tedious/confusing when reading code to draw > the conclusion about why sd is checked first before doing > find_energy_efficient_cpu (and that sd will != NULL for ASYM systems). > If energy_sd is set, then we can just proceed with EAS without > checking that sd != NULL. This function in mainline is already pretty > confusing as it is :-( Right I see your point. The code is correct as is, but I agree that having a code structured as if (energy_sd) { new_cpu = find_energy_efficient_cpu(energy_sd, p, prev_cpu); } else if (!sd) { ... might be easier to understand and functionally equivalent. What do you think ? Thanks, Quentin
On Thursday 22 Mar 2018 at 13:19:03 (-0700), Joel Fernandes wrote: > On Thu, Mar 22, 2018 at 11:06 AM, Patrick Bellasi > <patrick.bellasi@arm.com> wrote: [...] > >> > @@ -6555,6 +6613,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f > >> > break; > >> > } > >> > > >> > + /* > >> > + * Energy-aware task placement is performed on the highest > >> > + * non-overutilized domain spanning over cpu and prev_cpu. > >> > + */ > >> > + if (want_energy && !sd_overutilized(tmp) && > >> > + cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) > >> > >> Shouldn't you check for the SD_ASYM_CPUCAPACITY flag here for tmp level? > > > > ... and this then should be covered by the previous check in > > wake_energy(), which sets want_energy. > > Right, but in a scenario which probably doesn't exist today where we > have both SD_ASYM_CPUCAPACITY and !SD_ASYM_CPUCAPACITY domains in the > hierarchy for which want_energy = 1, I was thinking if its more future > proof to check it and not make assumptions... So we can definitely have cases where SD_ASYM_CPUCAPACITY is not set at all sd levels. Today, on mobile systems, this flag is typically set only at DIE level for big.LITTLE platforms, and not at MC level. We enable EAS if we find _at least_ one domain that has this flag in the hierarchy, just to make sure we don't enable EAS for symmetric platform. It's just a way to check a property about the topology when EAS starts, not really a way to actually select the sd at which we do scheduling at runtime. I hope that helps ! Thanks, Quentin
On March 23, 2018 6:34:22 PM PDT, Quentin Perret <quentin.perret@arm.com> wrote: >On Friday 23 Mar 2018 at 18:13:56 (-0700), Joel Fernandes wrote: >> Hi Morten, >> >> On Fri, Mar 23, 2018 at 8:47 AM, Morten Rasmussen >> <morten.rasmussen@arm.com> wrote: >> > On Thu, Mar 22, 2018 at 01:10:22PM -0700, Joel Fernandes wrote: > >[...] > >> > You mean if SD_BALANCE_WAKE isn't set on sched_domains? >> >> Yes. >> >> > The current code seems to rely on that flag to be set to work >correctly. >> > Otherwise, the loop might bail out on !want_affine and we end up >doing >> > the find_energy_efficient_cpu() on the lowest level sched_domain >even if >> > there is higher level one which isn't over-utilized. >> > >> > However, SD_BALANCE_WAKE should be set if SD_ASYM_CPUCAPACITY is >set so >> > sd == NULL shouldn't be possible? This only holds as long as we >only >> > want EAS for asymmetric systems. >> >> Yes, I see you had topology code that set SD_BALANCE_WAKE for ASYM. >It >> makes sense to me then, thanks for the clarification. >> >> Still I feel it is a bit tedious/confusing when reading code to draw >> the conclusion about why sd is checked first before doing >> find_energy_efficient_cpu (and that sd will != NULL for ASYM >systems). >> If energy_sd is set, then we can just proceed with EAS without >> checking that sd != NULL. This function in mainline is already pretty >> confusing as it is :-( > >Right I see your point. The code is correct as is, but I agree that >having >a code structured as > > if (energy_sd) { > new_cpu = find_energy_efficient_cpu(energy_sd, p, prev_cpu); > } else if (!sd) { > ... > >might be easier to understand and functionally equivalent. What do you >think ? Yeah definitely. Go for it. - Joel
On Fri, Mar 23, 2018 at 6:47 PM, Quentin Perret <quentin.perret@arm.com> wrote: > On Thursday 22 Mar 2018 at 13:19:03 (-0700), Joel Fernandes wrote: >> On Thu, Mar 22, 2018 at 11:06 AM, Patrick Bellasi >> <patrick.bellasi@arm.com> wrote: > > [...] > >> >> > @@ -6555,6 +6613,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f >> >> > break; >> >> > } >> >> > >> >> > + /* >> >> > + * Energy-aware task placement is performed on the highest >> >> > + * non-overutilized domain spanning over cpu and prev_cpu. >> >> > + */ >> >> > + if (want_energy && !sd_overutilized(tmp) && >> >> > + cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) >> >> >> >> Shouldn't you check for the SD_ASYM_CPUCAPACITY flag here for tmp level? >> > >> > ... and this then should be covered by the previous check in >> > wake_energy(), which sets want_energy. >> >> Right, but in a scenario which probably doesn't exist today where we >> have both SD_ASYM_CPUCAPACITY and !SD_ASYM_CPUCAPACITY domains in the >> hierarchy for which want_energy = 1, I was thinking if its more future >> proof to check it and not make assumptions... > > So we can definitely have cases where SD_ASYM_CPUCAPACITY is not set at all > sd levels. Today, on mobile systems, this flag is typically set only at DIE > level for big.LITTLE platforms, and not at MC level. > We enable EAS if we find _at least_ one domain that has this flag in the > hierarchy, just to make sure we don't enable EAS for symmetric platform. > It's just a way to check a property about the topology when EAS starts, not > really a way to actually select the sd at which we do scheduling at > runtime. Yes Ok you're right we do have the ASYM flag set at some sd levels but not others at the moment. Sorry about the hasty comment. I understand what you're doing now, I am Ok with that. thanks, - Joel
On Friday 23 Mar 2018 at 16:00:59 (+0000), Morten Rasmussen wrote: > On Thu, Mar 22, 2018 at 09:27:43AM -0700, Joel Fernandes wrote: > > Hi, > > > > On Tue, Mar 20, 2018 at 2:43 AM, Dietmar Eggemann > > <dietmar.eggemann@arm.com> wrote: [...] > > Is it possible that before the wakeup, the task's affinity is changed > > so that p->cpus_allowed no longer contains prev_cpu ? In that case > > prev_energy wouldn't matter since previous CPU is no longer an option? > > It is possible to wake-up with a disallowed prev_cpu. In fact > select_idle_sibling() may happily return a disallowed cpu in that case. > The mistake gets fixed in select_task_rq() which uses > select_fallback_rq() to find an allowed cpu instead. > > Could we fix the issue in find_energy_efficient_cpu() by a simple test > like below > > if (cpumask_test_cpu(prev_cpu, &p->cpus_allowed)) > prev_energy = best_energy = compute_energy(p, prev_cpu); > else > prev_energy = best_energy = ULONG_MAX; Right, that should work. I'll change this in v2. Thanks, Quentin
On Wednesday 21 Mar 2018 at 15:35:18 (+0000), Patrick Bellasi wrote: > On 20-Mar 09:43, Dietmar Eggemann wrote: > > From: Quentin Perret <quentin.perret@arm.com> > > [...] > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 76bd46502486..65a1bead0773 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -6513,6 +6513,60 @@ static unsigned long compute_energy(struct task_struct *p, int dst_cpu) > > return energy; > > } > > > > +static bool task_fits(struct task_struct *p, int cpu) > > +{ > > + unsigned long next_util = cpu_util_next(cpu, p, cpu); > > + > > + return util_fits_capacity(next_util, capacity_orig_of(cpu)); > ^^^^^^^^^^^^^^^^^^^^^ > > Since here we are at scheduling CFS tasks, should we not better use > capacity_of() to account for RT/IRQ pressure ? Yes, definitely. I change this in v2. > > > +} > > + > > +static int find_energy_efficient_cpu(struct sched_domain *sd, > > + struct task_struct *p, int prev_cpu) > > +{ > > + unsigned long cur_energy, prev_energy, best_energy; > > + int cpu, best_cpu = prev_cpu; > > + > > + if (!task_util(p)) > > We are still waking up a task... what if the task was previously > running on a big CPU which is now idle? > > I understand that from a _relative_ energy_diff standpoint there is > not much to do for a 0 utilization task. However, for those tasks we > can still try to return the most energy efficient CPU among the ones > in their cpus_allowed mask. > > It should be a relatively low overhead (maybe contained in a fallback > most_energy_efficient_cpu() kind of function) which allows, for > example on ARM big.LITTLE systems, to consolidate those tasks on > LITTLE CPUs instead for example keep running them on a big CPU. Hmmmm so the difficult thing about a task with 0 util is that you don't know if this is really a small task, or a big task with a very long period. The only useful thing you know for sure about the task is where it ran last time, so I guess that makes sense to use that information rather than make assumptions. There is no perfect solution using the util_avg of the task. Now, UTIL_EST is changing the game here. If we use it for task placement (which I think is the right thing to do), this issue should be a lot easier to solve. What do you think ? Thanks, Quentin
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 76bd46502486..65a1bead0773 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6513,6 +6513,60 @@ static unsigned long compute_energy(struct task_struct *p, int dst_cpu) return energy; } +static bool task_fits(struct task_struct *p, int cpu) +{ + unsigned long next_util = cpu_util_next(cpu, p, cpu); + + return util_fits_capacity(next_util, capacity_orig_of(cpu)); +} + +static int find_energy_efficient_cpu(struct sched_domain *sd, + struct task_struct *p, int prev_cpu) +{ + unsigned long cur_energy, prev_energy, best_energy; + int cpu, best_cpu = prev_cpu; + + if (!task_util(p)) + return prev_cpu; + + /* Compute the energy impact of leaving the task on prev_cpu. */ + prev_energy = best_energy = compute_energy(p, prev_cpu); + + /* Look for the CPU that minimizes the energy. */ + for_each_cpu_and(cpu, &p->cpus_allowed, sched_domain_span(sd)) { + if (!task_fits(p, cpu) || cpu == prev_cpu) + continue; + cur_energy = compute_energy(p, cpu); + if (cur_energy < best_energy) { + best_energy = cur_energy; + best_cpu = cpu; + } + } + + /* + * We pick the best CPU only if it saves at least 1.5% of the + * energy used by prev_cpu. + */ + if ((prev_energy - best_energy) > (prev_energy >> 6)) + return best_cpu; + + return prev_cpu; +} + +static inline bool wake_energy(struct task_struct *p, int prev_cpu) +{ + struct sched_domain *sd; + + if (!static_branch_unlikely(&sched_energy_present)) + return false; + + sd = rcu_dereference_sched(cpu_rq(prev_cpu)->sd); + if (!sd || sd_overutilized(sd)) + return false; + + return true; +} + /* * select_task_rq_fair: Select target runqueue for the waking task in domains * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE, @@ -6529,18 +6583,22 @@ static int select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags) { struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL; + struct sched_domain *energy_sd = NULL; int cpu = smp_processor_id(); int new_cpu = prev_cpu; - int want_affine = 0; + int want_affine = 0, want_energy = 0; int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING); + rcu_read_lock(); + if (sd_flag & SD_BALANCE_WAKE) { record_wakee(p); + want_energy = wake_energy(p, prev_cpu); want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu) - && cpumask_test_cpu(cpu, &p->cpus_allowed); + && cpumask_test_cpu(cpu, &p->cpus_allowed) + && !want_energy; } - rcu_read_lock(); for_each_domain(cpu, tmp) { if (!(tmp->flags & SD_LOAD_BALANCE)) break; @@ -6555,6 +6613,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f break; } + /* + * Energy-aware task placement is performed on the highest + * non-overutilized domain spanning over cpu and prev_cpu. + */ + if (want_energy && !sd_overutilized(tmp) && + cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) + energy_sd = tmp; + if (tmp->flags & sd_flag) sd = tmp; else if (!want_affine) @@ -6586,6 +6652,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f if (want_affine) current->recent_used_cpu = cpu; } + } else if (energy_sd) { + new_cpu = find_energy_efficient_cpu(energy_sd, p, prev_cpu); } else { new_cpu = find_idlest_cpu(sd, p, cpu, prev_cpu, sd_flag); }