Message ID | 20180406153607.17815-5-dietmar.eggemann@arm.com (mailing list archive) |
---|---|
State | RFC, archived |
Headers | show |
On Fri, Apr 06, 2018 at 04:36:05PM +0100, Dietmar Eggemann wrote: > +static inline > +struct capacity_state *find_cap_state(int cpu, unsigned long util) > +{ > + struct sched_energy_model *em = *per_cpu_ptr(energy_model, cpu); > + struct capacity_state *cs = NULL; > + int i; > + > + util += util >> 2; > + > + for (i = 0; i < em->nr_cap_states; i++) { > + cs = &em->cap_states[i]; > + if (cs->cap >= util) > + break; > + } > + > + return cs; > +} So in the last thread there was some discussion about this; in particular on how this related to schedutil and if we should tie it into that. I think for starters tying it to schedutil is not a bad idea; ideally people _should_ migrate towards using that. Also; I think it makes sense to better integrate cpufreq and the energy-model values like what rjw already suggested, such that maybe we can have cpufreq_driver_resolve_freq() return a structure containing the relevant information for the selected frequency. But implementing the frequency selection thing in multiple places like now sounds like a very bad idea to me.
On Tuesday 10 Apr 2018 at 14:51:05 (+0200), Peter Zijlstra wrote: > On Fri, Apr 06, 2018 at 04:36:05PM +0100, Dietmar Eggemann wrote: > > +static inline > > +struct capacity_state *find_cap_state(int cpu, unsigned long util) > > +{ > > + struct sched_energy_model *em = *per_cpu_ptr(energy_model, cpu); > > + struct capacity_state *cs = NULL; > > + int i; > > + > > + util += util >> 2; > > + > > + for (i = 0; i < em->nr_cap_states; i++) { > > + cs = &em->cap_states[i]; > > + if (cs->cap >= util) > > + break; > > + } > > + > > + return cs; > > +} > > So in the last thread there was some discussion about this; in > particular on how this related to schedutil and if we should tie it into > that. > > I think for starters tying it to schedutil is not a bad idea; ideally > people _should_ migrate towards using that. > > Also; I think it makes sense to better integrate cpufreq and the > energy-model values like what rjw already suggested, such that maybe we > can have cpufreq_driver_resolve_freq() return a structure containing the > relevant information for the selected frequency. I guess if we want to do that in the wake-up path, we would also need to add a new parameter to it to make sure we don't actually call into cpufreq_driver->resolve_freq() ... But then, we could sort of rely on cpufreq_schedutil.c::get_next_freq() to replace find_cap_state() ... Is this what you had in mind ? > > But implementing the frequency selection thing in multiple places like > now sounds like a very bad idea to me. Understood. Making sure we share the same code everywhere might have consequences though. I guess we'll have to either accept the cost of function calls in the wake-up path, or to accept to inline those functions for ex. Or maybe you had something else in mind ? Anyways, that's probably another good discussion topic for OSPM next week :-) Thanks, Quentin
On Tue, Apr 10, 2018 at 02:56:41PM +0100, Quentin Perret wrote: > > So in the last thread there was some discussion about this; in > > particular on how this related to schedutil and if we should tie it into > > that. > > > > I think for starters tying it to schedutil is not a bad idea; ideally > > people _should_ migrate towards using that. > > > > Also; I think it makes sense to better integrate cpufreq and the > > energy-model values like what rjw already suggested, such that maybe we > > can have cpufreq_driver_resolve_freq() return a structure containing the > > relevant information for the selected frequency. > > I guess if we want to do that in the wake-up path, we would also need to > add a new parameter to it to make sure we don't actually call into > cpufreq_driver->resolve_freq() ... > > But then, we could sort of rely on cpufreq_schedutil.c::get_next_freq() > to replace find_cap_state() ... Is this what you had in mind ? Yes, something along those lines; we could also of course factor get_next_freq() into two parts. > > But implementing the frequency selection thing in multiple places like > > now sounds like a very bad idea to me. > > Understood. Making sure we share the same code everywhere might have > consequences though. I guess we'll have to either accept the cost of > function calls in the wake-up path, or to accept to inline those > functions for ex. Or maybe you had something else in mind ? > > Anyways, that's probably another good discussion topic for OSPM > next week :-) Yes, that wants a wee bit of discussion. Ideally we'd have a shared data structure we can iterate instead of a chain of indirect calls.
On 06-04-18, 16:36, Dietmar Eggemann wrote: > +static inline struct capacity_state > +*find_cap_state(int cpu, unsigned long util) { return NULL; } I saw this somewhere else as well in this series. I believe the line break should happen after "*" as "struct capacity_state *" should be read together as one entity.
On Fri, Apr 06, 2018 at 04:36:05PM +0100, Dietmar Eggemann wrote: > From: Quentin Perret <quentin.perret@arm.com> > > In preparation for the definition of an energy-aware wakeup path, a > helper function is provided to estimate the consequence on system energy > when a specific task wakes-up on a specific CPU. compute_energy() > estimates the OPPs to be reached by all frequency domains and estimates > the consumption of each online CPU according to its energy model and its > percentage of busy time. > > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Peter Zijlstra <peterz@infradead.org> > Signed-off-by: Quentin Perret <quentin.perret@arm.com> > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> > --- > include/linux/sched/energy.h | 20 +++++++++++++ > kernel/sched/fair.c | 68 ++++++++++++++++++++++++++++++++++++++++++++ > kernel/sched/sched.h | 2 +- > 3 files changed, 89 insertions(+), 1 deletion(-) > > diff --git a/include/linux/sched/energy.h b/include/linux/sched/energy.h > index 941071eec013..b4110b145228 100644 > --- a/include/linux/sched/energy.h > +++ b/include/linux/sched/energy.h > @@ -27,6 +27,24 @@ static inline bool sched_energy_enabled(void) > return static_branch_unlikely(&sched_energy_present); > } > > +static inline > +struct capacity_state *find_cap_state(int cpu, unsigned long util) > +{ > + struct sched_energy_model *em = *per_cpu_ptr(energy_model, cpu); > + struct capacity_state *cs = NULL; > + int i; > + > + util += util >> 2; > + > + for (i = 0; i < em->nr_cap_states; i++) { > + cs = &em->cap_states[i]; > + if (cs->cap >= util) > + break; > + } > + > + return cs; 'cs' is possible to return NULL. > +} > + > static inline struct cpumask *freq_domain_span(struct freq_domain *fd) > { > return &fd->span; > @@ -42,6 +60,8 @@ struct freq_domain; > static inline bool sched_energy_enabled(void) { return false; } > static inline struct cpumask > *freq_domain_span(struct freq_domain *fd) { return NULL; } > +static inline struct capacity_state > +*find_cap_state(int cpu, unsigned long util) { return NULL; } > static inline void init_sched_energy(void) { } > #define for_each_freq_domain(fdom) for (; fdom; fdom = NULL) > #endif > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 6960e5ef3c14..8cb9fb04fff2 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -6633,6 +6633,74 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu) > } > > /* > + * Returns the util of "cpu" if "p" wakes up on "dst_cpu". > + */ > +static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu) > +{ > + unsigned long util, util_est; > + struct cfs_rq *cfs_rq; > + > + /* Task is where it should be, or has no impact on cpu */ > + if ((task_cpu(p) == dst_cpu) || (cpu != task_cpu(p) && cpu != dst_cpu)) > + return cpu_util(cpu); > + > + cfs_rq = &cpu_rq(cpu)->cfs; > + util = READ_ONCE(cfs_rq->avg.util_avg); > + > + if (dst_cpu == cpu) > + util += task_util(p); > + else > + util = max_t(long, util - task_util(p), 0); I tried to understand the logic at here, below code is more clear for myself: int prev_cpu = task_cpu(p); cfs_rq = &cpu_rq(cpu)->cfs; util = READ_ONCE(cfs_rq->avg.util_avg); /* Bail out if src and dst CPUs are the same one */ if (prev_cpu == cpu && dst_cpu == cpu) return util; /* Remove task utilization for src CPU */ if (cpu == prev_cpu) util = max_t(long, util - task_util(p), 0); /* Add task utilization for dst CPU */ if (dst_cpu == cpu) util += task_util(p); BTW, CPU utilization is decayed value and task_util() is not decayed value, so 'util - task_util(p)' calculates a smaller value than the prev CPU pure utilization, right? Another question is can we reuse the function cpu_util_wake() and just compenstate task util for dst cpu? > + if (sched_feat(UTIL_EST)) { > + util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued); > + if (dst_cpu == cpu) > + util_est += _task_util_est(p); > + else > + util_est = max_t(long, util_est - _task_util_est(p), 0); > + util = max(util, util_est); > + } > + > + return min_t(unsigned long, util, capacity_orig_of(cpu)); > +} > + > +/* > + * Estimates the system level energy assuming that p wakes-up on dst_cpu. > + * > + * compute_energy() is safe to call only if an energy model is available for > + * the platform, which is when sched_energy_enabled() is true. > + */ > +static unsigned long compute_energy(struct task_struct *p, int dst_cpu) > +{ > + unsigned long util, max_util, sum_util; > + struct capacity_state *cs; > + unsigned long energy = 0; > + struct freq_domain *fd; > + int cpu; > + > + for_each_freq_domain(fd) { > + max_util = sum_util = 0; > + for_each_cpu_and(cpu, freq_domain_span(fd), cpu_online_mask) { > + util = cpu_util_next(cpu, p, dst_cpu); > + util += cpu_util_dl(cpu_rq(cpu)); > + max_util = max(util, max_util); > + sum_util += util; > + } > + > + /* > + * Here we assume that the capacity states of CPUs belonging to > + * the same frequency domains are shared. Hence, we look at the > + * capacity state of the first CPU and re-use it for all. > + */ > + cpu = cpumask_first(freq_domain_span(fd)); > + cs = find_cap_state(cpu, max_util); > + energy += cs->power * sum_util / cs->cap; > + } This means all CPUs will be iterated for calculation, the complexity is O(n)... Thanks, Leo Yan > + return energy; > +} > + > +/* > * select_task_rq_fair: Select target runqueue for the waking task in domains > * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE, > * SD_BALANCE_FORK, or SD_BALANCE_EXEC. > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 5d552c0d7109..6eb38f41d5d9 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -2156,7 +2156,7 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {} > # define arch_scale_freq_invariant() false > #endif > > -#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL > +#ifdef CONFIG_SMP > static inline unsigned long cpu_util_dl(struct rq *rq) > { > return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; > -- > 2.11.0 >
On Tuesday 17 Apr 2018 at 23:22:13 (+0800), Leo Yan wrote: > On Fri, Apr 06, 2018 at 04:36:05PM +0100, Dietmar Eggemann wrote: > > From: Quentin Perret <quentin.perret@arm.com> > > > > In preparation for the definition of an energy-aware wakeup path, a > > helper function is provided to estimate the consequence on system energy > > when a specific task wakes-up on a specific CPU. compute_energy() > > estimates the OPPs to be reached by all frequency domains and estimates > > the consumption of each online CPU according to its energy model and its > > percentage of busy time. > > > > Cc: Ingo Molnar <mingo@redhat.com> > > Cc: Peter Zijlstra <peterz@infradead.org> > > Signed-off-by: Quentin Perret <quentin.perret@arm.com> > > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> > > --- > > include/linux/sched/energy.h | 20 +++++++++++++ > > kernel/sched/fair.c | 68 ++++++++++++++++++++++++++++++++++++++++++++ > > kernel/sched/sched.h | 2 +- > > 3 files changed, 89 insertions(+), 1 deletion(-) > > > > diff --git a/include/linux/sched/energy.h b/include/linux/sched/energy.h > > index 941071eec013..b4110b145228 100644 > > --- a/include/linux/sched/energy.h > > +++ b/include/linux/sched/energy.h > > @@ -27,6 +27,24 @@ static inline bool sched_energy_enabled(void) > > return static_branch_unlikely(&sched_energy_present); > > } > > > > +static inline > > +struct capacity_state *find_cap_state(int cpu, unsigned long util) > > +{ > > + struct sched_energy_model *em = *per_cpu_ptr(energy_model, cpu); > > + struct capacity_state *cs = NULL; > > + int i; > > + > > + util += util >> 2; > > + > > + for (i = 0; i < em->nr_cap_states; i++) { > > + cs = &em->cap_states[i]; > > + if (cs->cap >= util) > > + break; > > + } > > + > > + return cs; > > 'cs' is possible to return NULL. Only if em-nr_cap_states==0, and that shouldn't be possible if sched_energy_present==True, so this code should be safe :-) > > > +} > > + > > static inline struct cpumask *freq_domain_span(struct freq_domain *fd) > > { > > return &fd->span; > > @@ -42,6 +60,8 @@ struct freq_domain; > > static inline bool sched_energy_enabled(void) { return false; } > > static inline struct cpumask > > *freq_domain_span(struct freq_domain *fd) { return NULL; } > > +static inline struct capacity_state > > +*find_cap_state(int cpu, unsigned long util) { return NULL; } > > static inline void init_sched_energy(void) { } > > #define for_each_freq_domain(fdom) for (; fdom; fdom = NULL) > > #endif > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 6960e5ef3c14..8cb9fb04fff2 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -6633,6 +6633,74 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu) > > } > > > > /* > > + * Returns the util of "cpu" if "p" wakes up on "dst_cpu". > > + */ > > +static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu) > > +{ > > + unsigned long util, util_est; > > + struct cfs_rq *cfs_rq; > > + > > + /* Task is where it should be, or has no impact on cpu */ > > + if ((task_cpu(p) == dst_cpu) || (cpu != task_cpu(p) && cpu != dst_cpu)) > > + return cpu_util(cpu); > > + > > + cfs_rq = &cpu_rq(cpu)->cfs; > > + util = READ_ONCE(cfs_rq->avg.util_avg); > > + > > + if (dst_cpu == cpu) > > + util += task_util(p); > > + else > > + util = max_t(long, util - task_util(p), 0); > > I tried to understand the logic at here, below code is more clear for > myself: > > int prev_cpu = task_cpu(p); > > cfs_rq = &cpu_rq(cpu)->cfs; > util = READ_ONCE(cfs_rq->avg.util_avg); > > /* Bail out if src and dst CPUs are the same one */ > if (prev_cpu == cpu && dst_cpu == cpu) > return util; > > /* Remove task utilization for src CPU */ > if (cpu == prev_cpu) > util = max_t(long, util - task_util(p), 0); > > /* Add task utilization for dst CPU */ > if (dst_cpu == cpu) > util += task_util(p); > > BTW, CPU utilization is decayed value and task_util() is not decayed > value, so 'util - task_util(p)' calculates a smaller value than the > prev CPU pure utilization, right? task_util() is the raw PELT signal, without UTIL_EST, so I think it's fine to do `util - task_util()`. > > Another question is can we reuse the function cpu_util_wake() and > just compenstate task util for dst cpu? Well it's not that simple. cpu_util_wake() will give you the max between the util_avg and the util_est value, so which task_util() should you add to it ? The util_avg or the uti_est value ? Here we are trying to predict what will be the cpu_util signal in the future, so the only always-correct implementation of this function has to predict what will be the CPU util_avg and util_est signals in parallel and take the max of the two. > > > + if (sched_feat(UTIL_EST)) { > > + util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued); > > + if (dst_cpu == cpu) > > + util_est += _task_util_est(p); > > + else > > + util_est = max_t(long, util_est - _task_util_est(p), 0); > > + util = max(util, util_est); > > + } > > + > > + return min_t(unsigned long, util, capacity_orig_of(cpu)); > > +} > > + > > +/* > > + * Estimates the system level energy assuming that p wakes-up on dst_cpu. > > + * > > + * compute_energy() is safe to call only if an energy model is available for > > + * the platform, which is when sched_energy_enabled() is true. > > + */ > > +static unsigned long compute_energy(struct task_struct *p, int dst_cpu) > > +{ > > + unsigned long util, max_util, sum_util; > > + struct capacity_state *cs; > > + unsigned long energy = 0; > > + struct freq_domain *fd; > > + int cpu; > > + > > + for_each_freq_domain(fd) { > > + max_util = sum_util = 0; > > + for_each_cpu_and(cpu, freq_domain_span(fd), cpu_online_mask) { > > + util = cpu_util_next(cpu, p, dst_cpu); > > + util += cpu_util_dl(cpu_rq(cpu)); > > + max_util = max(util, max_util); > > + sum_util += util; > > + } > > + > > + /* > > + * Here we assume that the capacity states of CPUs belonging to > > + * the same frequency domains are shared. Hence, we look at the > > + * capacity state of the first CPU and re-use it for all. > > + */ > > + cpu = cpumask_first(freq_domain_span(fd)); > > + cs = find_cap_state(cpu, max_util); > > + energy += cs->power * sum_util / cs->cap; > > + } > > This means all CPUs will be iterated for calculation, the complexity is > O(n)... > > > + return energy; > > +} > > + > > +/* > > * select_task_rq_fair: Select target runqueue for the waking task in domains > > * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE, > > * SD_BALANCE_FORK, or SD_BALANCE_EXEC. > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > > index 5d552c0d7109..6eb38f41d5d9 100644 > > --- a/kernel/sched/sched.h > > +++ b/kernel/sched/sched.h > > @@ -2156,7 +2156,7 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {} > > # define arch_scale_freq_invariant() false > > #endif > > > > -#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL > > +#ifdef CONFIG_SMP > > static inline unsigned long cpu_util_dl(struct rq *rq) > > { > > return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; > > -- > > 2.11.0 > >
On Wed, Apr 18, 2018 at 09:13:39AM +0100, Quentin Perret wrote: > On Tuesday 17 Apr 2018 at 23:22:13 (+0800), Leo Yan wrote: > > On Fri, Apr 06, 2018 at 04:36:05PM +0100, Dietmar Eggemann wrote: > > > From: Quentin Perret <quentin.perret@arm.com> > > > > > > In preparation for the definition of an energy-aware wakeup path, a > > > helper function is provided to estimate the consequence on system energy > > > when a specific task wakes-up on a specific CPU. compute_energy() > > > estimates the OPPs to be reached by all frequency domains and estimates > > > the consumption of each online CPU according to its energy model and its > > > percentage of busy time. > > > > > > Cc: Ingo Molnar <mingo@redhat.com> > > > Cc: Peter Zijlstra <peterz@infradead.org> > > > Signed-off-by: Quentin Perret <quentin.perret@arm.com> > > > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> > > > --- > > > include/linux/sched/energy.h | 20 +++++++++++++ > > > kernel/sched/fair.c | 68 ++++++++++++++++++++++++++++++++++++++++++++ > > > kernel/sched/sched.h | 2 +- > > > 3 files changed, 89 insertions(+), 1 deletion(-) > > > > > > diff --git a/include/linux/sched/energy.h b/include/linux/sched/energy.h > > > index 941071eec013..b4110b145228 100644 > > > --- a/include/linux/sched/energy.h > > > +++ b/include/linux/sched/energy.h > > > @@ -27,6 +27,24 @@ static inline bool sched_energy_enabled(void) > > > return static_branch_unlikely(&sched_energy_present); > > > } > > > > > > +static inline > > > +struct capacity_state *find_cap_state(int cpu, unsigned long util) > > > +{ > > > + struct sched_energy_model *em = *per_cpu_ptr(energy_model, cpu); > > > + struct capacity_state *cs = NULL; > > > + int i; > > > + > > > + util += util >> 2; > > > + > > > + for (i = 0; i < em->nr_cap_states; i++) { > > > + cs = &em->cap_states[i]; > > > + if (cs->cap >= util) > > > + break; > > > + } > > > + > > > + return cs; > > > > 'cs' is possible to return NULL. > > Only if em-nr_cap_states==0, and that shouldn't be possible if > sched_energy_present==True, so this code should be safe :-) You are right. Thanks for explanation. > > > +} > > > + > > > static inline struct cpumask *freq_domain_span(struct freq_domain *fd) > > > { > > > return &fd->span; > > > @@ -42,6 +60,8 @@ struct freq_domain; > > > static inline bool sched_energy_enabled(void) { return false; } > > > static inline struct cpumask > > > *freq_domain_span(struct freq_domain *fd) { return NULL; } > > > +static inline struct capacity_state > > > +*find_cap_state(int cpu, unsigned long util) { return NULL; } > > > static inline void init_sched_energy(void) { } > > > #define for_each_freq_domain(fdom) for (; fdom; fdom = NULL) > > > #endif > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > index 6960e5ef3c14..8cb9fb04fff2 100644 > > > --- a/kernel/sched/fair.c > > > +++ b/kernel/sched/fair.c > > > @@ -6633,6 +6633,74 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu) > > > } > > > > > > /* > > > + * Returns the util of "cpu" if "p" wakes up on "dst_cpu". > > > + */ > > > +static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu) > > > +{ > > > + unsigned long util, util_est; > > > + struct cfs_rq *cfs_rq; > > > + > > > + /* Task is where it should be, or has no impact on cpu */ > > > + if ((task_cpu(p) == dst_cpu) || (cpu != task_cpu(p) && cpu != dst_cpu)) > > > + return cpu_util(cpu); > > > + > > > + cfs_rq = &cpu_rq(cpu)->cfs; > > > + util = READ_ONCE(cfs_rq->avg.util_avg); > > > + > > > + if (dst_cpu == cpu) > > > + util += task_util(p); > > > + else > > > + util = max_t(long, util - task_util(p), 0); > > > > I tried to understand the logic at here, below code is more clear for > > myself: > > > > int prev_cpu = task_cpu(p); > > > > cfs_rq = &cpu_rq(cpu)->cfs; > > util = READ_ONCE(cfs_rq->avg.util_avg); > > > > /* Bail out if src and dst CPUs are the same one */ > > if (prev_cpu == cpu && dst_cpu == cpu) > > return util; > > > > /* Remove task utilization for src CPU */ > > if (cpu == prev_cpu) > > util = max_t(long, util - task_util(p), 0); > > > > /* Add task utilization for dst CPU */ > > if (dst_cpu == cpu) > > util += task_util(p); > > > > BTW, CPU utilization is decayed value and task_util() is not decayed > > value, so 'util - task_util(p)' calculates a smaller value than the > > prev CPU pure utilization, right? > > task_util() is the raw PELT signal, without UTIL_EST, so I think it's > fine to do `util - task_util()`. > > > > > Another question is can we reuse the function cpu_util_wake() and > > just compenstate task util for dst cpu? > > Well it's not that simple. cpu_util_wake() will give you the max between > the util_avg and the util_est value, so which task_util() should you add > to it ? The util_avg or the uti_est value ? If feature 'UTIL_EST' is enabled, then add task's util_est; otherwise add task util_avg value. I think cpu_util_wake() has similiar logic with here, it merely returns CPU level util; but here needs to accumulate CPU level util + task level util. So seems to me, the logic is: cpu_util_wake() + task_util_wake() > Here we are trying to predict what will be the cpu_util signal in the > future, so the only always-correct implementation of this function has > to predict what will be the CPU util_avg and util_est signals in > parallel and take the max of the two. I totally agree with this, just want to check if can reuse existed code, so we can have more consistent logic accrossing scheduler. > > > + if (sched_feat(UTIL_EST)) { > > > + util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued); > > > + if (dst_cpu == cpu) > > > + util_est += _task_util_est(p); > > > + else > > > + util_est = max_t(long, util_est - _task_util_est(p), 0); > > > + util = max(util, util_est); > > > + } > > > + > > > + return min_t(unsigned long, util, capacity_orig_of(cpu)); > > > +} > > > + > > > +/* > > > + * Estimates the system level energy assuming that p wakes-up on dst_cpu. > > > + * > > > + * compute_energy() is safe to call only if an energy model is available for > > > + * the platform, which is when sched_energy_enabled() is true. > > > + */ > > > +static unsigned long compute_energy(struct task_struct *p, int dst_cpu) > > > +{ > > > + unsigned long util, max_util, sum_util; > > > + struct capacity_state *cs; > > > + unsigned long energy = 0; > > > + struct freq_domain *fd; > > > + int cpu; > > > + > > > + for_each_freq_domain(fd) { > > > + max_util = sum_util = 0; > > > + for_each_cpu_and(cpu, freq_domain_span(fd), cpu_online_mask) { > > > + util = cpu_util_next(cpu, p, dst_cpu); > > > + util += cpu_util_dl(cpu_rq(cpu)); > > > + max_util = max(util, max_util); > > > + sum_util += util; > > > + } > > > + > > > + /* > > > + * Here we assume that the capacity states of CPUs belonging to > > > + * the same frequency domains are shared. Hence, we look at the > > > + * capacity state of the first CPU and re-use it for all. > > > + */ > > > + cpu = cpumask_first(freq_domain_span(fd)); > > > + cs = find_cap_state(cpu, max_util); > > > + energy += cs->power * sum_util / cs->cap; > > > + } > > > > This means all CPUs will be iterated for calculation, the complexity is > > O(n)... > > > > > + return energy; > > > +} > > > + > > > +/* > > > * select_task_rq_fair: Select target runqueue for the waking task in domains > > > * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE, > > > * SD_BALANCE_FORK, or SD_BALANCE_EXEC. > > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > > > index 5d552c0d7109..6eb38f41d5d9 100644 > > > --- a/kernel/sched/sched.h > > > +++ b/kernel/sched/sched.h > > > @@ -2156,7 +2156,7 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {} > > > # define arch_scale_freq_invariant() false > > > #endif > > > > > > -#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL > > > +#ifdef CONFIG_SMP > > > static inline unsigned long cpu_util_dl(struct rq *rq) > > > { > > > return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; > > > -- > > > 2.11.0 > > >
On Fri, Apr 06, 2018 at 04:36:05PM +0100, Dietmar Eggemann wrote: [...] > +/* > + * Estimates the system level energy assuming that p wakes-up on dst_cpu. > + * > + * compute_energy() is safe to call only if an energy model is available for > + * the platform, which is when sched_energy_enabled() is true. > + */ > +static unsigned long compute_energy(struct task_struct *p, int dst_cpu) > +{ > + unsigned long util, max_util, sum_util; > + struct capacity_state *cs; > + unsigned long energy = 0; > + struct freq_domain *fd; > + int cpu; > + > + for_each_freq_domain(fd) { > + max_util = sum_util = 0; > + for_each_cpu_and(cpu, freq_domain_span(fd), cpu_online_mask) { > + util = cpu_util_next(cpu, p, dst_cpu); > + util += cpu_util_dl(cpu_rq(cpu)); > + max_util = max(util, max_util); > + sum_util += util; > + } > + > + /* > + * Here we assume that the capacity states of CPUs belonging to > + * the same frequency domains are shared. Hence, we look at the > + * capacity state of the first CPU and re-use it for all. > + */ > + cpu = cpumask_first(freq_domain_span(fd)); > + cs = find_cap_state(cpu, max_util); > + energy += cs->power * sum_util / cs->cap; I am a bit worry about the resolution issue, especially when the capacity value is a quite high value and sum_util is a minor value. > + } > + > + return energy; > +}
On Wednesday 18 Apr 2018 at 17:19:28 (+0800), Leo Yan wrote: > > > BTW, CPU utilization is decayed value and task_util() is not decayed > > > value, so 'util - task_util(p)' calculates a smaller value than the > > > prev CPU pure utilization, right? > > > > task_util() is the raw PELT signal, without UTIL_EST, so I think it's > > fine to do `util - task_util()`. > > > > > > > > Another question is can we reuse the function cpu_util_wake() and > > > just compenstate task util for dst cpu? > > > > Well it's not that simple. cpu_util_wake() will give you the max between > > the util_avg and the util_est value, so which task_util() should you add > > to it ? The util_avg or the uti_est value ? > > If feature 'UTIL_EST' is enabled, then add task's util_est; otherwise > add task util_avg value. I don't think this is correct. If UTIL_EST is enabled, cpu_util_wake() will return the max between the util_avg and the util_est. When you call it, you don't know what you get. Adding the _task_util_est() to cpu_util_wake() is wrong if cpu_util_avg > cpu_util_est for example. > > I think cpu_util_wake() has similiar logic with here, it merely returns > CPU level util; but here needs to accumulate CPU level util + task level > util. So seems to me, the logic is: > > cpu_util_wake() + task_util_wake() I'm not sure to get that one ... What is task_util_wake() ? Thanks, Quentin
Hi Quentin, On Fri, Apr 06, 2018 at 04:36:05PM +0100, Dietmar Eggemann wrote: > From: Quentin Perret <quentin.perret@arm.com> > > In preparation for the definition of an energy-aware wakeup path, a > helper function is provided to estimate the consequence on system energy > when a specific task wakes-up on a specific CPU. compute_energy() > estimates the OPPs to be reached by all frequency domains and estimates > the consumption of each online CPU according to its energy model and its > percentage of busy time. > > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Peter Zijlstra <peterz@infradead.org> > Signed-off-by: Quentin Perret <quentin.perret@arm.com> > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> > --- > include/linux/sched/energy.h | 20 +++++++++++++ > kernel/sched/fair.c | 68 ++++++++++++++++++++++++++++++++++++++++++++ > kernel/sched/sched.h | 2 +- > 3 files changed, 89 insertions(+), 1 deletion(-) > > diff --git a/include/linux/sched/energy.h b/include/linux/sched/energy.h > index 941071eec013..b4110b145228 100644 > --- a/include/linux/sched/energy.h > +++ b/include/linux/sched/energy.h > @@ -27,6 +27,24 @@ static inline bool sched_energy_enabled(void) > return static_branch_unlikely(&sched_energy_present); > } > > +static inline > +struct capacity_state *find_cap_state(int cpu, unsigned long util) > +{ > + struct sched_energy_model *em = *per_cpu_ptr(energy_model, cpu); > + struct capacity_state *cs = NULL; > + int i; > + > + util += util >> 2; > + > + for (i = 0; i < em->nr_cap_states; i++) { > + cs = &em->cap_states[i]; > + if (cs->cap >= util) > + break; > + } > + > + return cs; > +} > + > static inline struct cpumask *freq_domain_span(struct freq_domain *fd) > { > return &fd->span; > @@ -42,6 +60,8 @@ struct freq_domain; > static inline bool sched_energy_enabled(void) { return false; } > static inline struct cpumask > *freq_domain_span(struct freq_domain *fd) { return NULL; } > +static inline struct capacity_state > +*find_cap_state(int cpu, unsigned long util) { return NULL; } > static inline void init_sched_energy(void) { } > #define for_each_freq_domain(fdom) for (; fdom; fdom = NULL) > #endif > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 6960e5ef3c14..8cb9fb04fff2 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -6633,6 +6633,74 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu) > } > > /* > + * Returns the util of "cpu" if "p" wakes up on "dst_cpu". > + */ > +static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu) > +{ > + unsigned long util, util_est; > + struct cfs_rq *cfs_rq; > + > + /* Task is where it should be, or has no impact on cpu */ > + if ((task_cpu(p) == dst_cpu) || (cpu != task_cpu(p) && cpu != dst_cpu)) > + return cpu_util(cpu); > + > + cfs_rq = &cpu_rq(cpu)->cfs; > + util = READ_ONCE(cfs_rq->avg.util_avg); > + > + if (dst_cpu == cpu) > + util += task_util(p); > + else > + util = max_t(long, util - task_util(p), 0); > + > + if (sched_feat(UTIL_EST)) { > + util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued); > + if (dst_cpu == cpu) > + util_est += _task_util_est(p); > + else > + util_est = max_t(long, util_est - _task_util_est(p), 0); > + util = max(util, util_est); > + } > + > + return min_t(unsigned long, util, capacity_orig_of(cpu)); > +} > + > +/* > + * Estimates the system level energy assuming that p wakes-up on dst_cpu. > + * > + * compute_energy() is safe to call only if an energy model is available for > + * the platform, which is when sched_energy_enabled() is true. > + */ > +static unsigned long compute_energy(struct task_struct *p, int dst_cpu) > +{ > + unsigned long util, max_util, sum_util; > + struct capacity_state *cs; > + unsigned long energy = 0; > + struct freq_domain *fd; > + int cpu; > + > + for_each_freq_domain(fd) { > + max_util = sum_util = 0; > + for_each_cpu_and(cpu, freq_domain_span(fd), cpu_online_mask) { > + util = cpu_util_next(cpu, p, dst_cpu); > + util += cpu_util_dl(cpu_rq(cpu)); > + max_util = max(util, max_util); > + sum_util += util; > + } > + > + /* > + * Here we assume that the capacity states of CPUs belonging to > + * the same frequency domains are shared. Hence, we look at the > + * capacity state of the first CPU and re-use it for all. > + */ > + cpu = cpumask_first(freq_domain_span(fd)); > + cs = find_cap_state(cpu, max_util); > + energy += cs->power * sum_util / cs->cap; > + } Sorry I introduce mess at here to spread my questions in several replying, later will try to ask questions in one replying. Below are more questions which it's good to bring up: The code for energy computation is quite neat and simple, but I think the energy computation mixes two concepts for CPU util: one concept is the estimated CPU util which is used to select CPU OPP in schedutil, another concept is the raw CPU util according to CPU real running time; for example, cpu_util_next() predicts CPU util but this value might be much higher than cpu_util(), especially after enabled UTIL_EST feature (I have shallow understanding for UTIL_EST so correct me as needed); but this patch simply computes CPU capacity and energy with the single one CPU utilization value (and it will be an inflated value afte enable UTIL_EST). Is this purposed for simple implementation? IMHO, cpu_util_next() can be used to predict CPU capacity, on the other hand, should we use the CPU util without UTIL_EST capping for 'sum_util', this can be more reasonable to reflect the CPU utilization? Furthermore, if we consider RT thread is running on CPU and connect with 'schedutil' governor, the CPU will run at maximum frequency, but we cannot say the CPU has 100% utilization. The RT thread case is not handled in this patch. > + > + return energy; > +} > + > +/* > * select_task_rq_fair: Select target runqueue for the waking task in domains > * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE, > * SD_BALANCE_FORK, or SD_BALANCE_EXEC. > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 5d552c0d7109..6eb38f41d5d9 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -2156,7 +2156,7 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {} > # define arch_scale_freq_invariant() false > #endif > > -#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL > +#ifdef CONFIG_SMP > static inline unsigned long cpu_util_dl(struct rq *rq) > { > return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; > -- > 2.11.0 >
Hi Leo, On Wednesday 18 Apr 2018 at 20:15:47 (+0800), Leo Yan wrote: > Sorry I introduce mess at here to spread my questions in several > replying, later will try to ask questions in one replying. Below are > more questions which it's good to bring up: > > The code for energy computation is quite neat and simple, but I think > the energy computation mixes two concepts for CPU util: one concept is > the estimated CPU util which is used to select CPU OPP in schedutil, > another concept is the raw CPU util according to CPU real running time; > for example, cpu_util_next() predicts CPU util but this value might be > much higher than cpu_util(), especially after enabled UTIL_EST feature > (I have shallow understanding for UTIL_EST so correct me as needed); I'm not not sure to understand what you mean by higher than cpu_util() here ... In which case would that happen ? cpu_util_next() is basically used to figure out what will be the cpu_util() of CPU A after task p has been enqueued on CPU B (no matter what A and B are). > but this patch simply computes CPU capacity and energy with the single > one CPU utilization value (and it will be an inflated value afte enable > UTIL_EST). Is this purposed for simple implementation? > > IMHO, cpu_util_next() can be used to predict CPU capacity, on the other > hand, should we use the CPU util without UTIL_EST capping for 'sum_util', > this can be more reasonable to reflect the CPU utilization? Why would a decayed utilisation be a better estimate of the time that a task is going to spend on a CPU ? > > Furthermore, if we consider RT thread is running on CPU and connect with > 'schedutil' governor, the CPU will run at maximum frequency, but we > cannot say the CPU has 100% utilization. The RT thread case is not > handled in this patch. Right, we don't account for RT tasks in the OPP prediction for now. Vincent's patches to have a util_avg for RT runqueues could help us do that I suppose ... Thanks ! Quentin
On Wednesday 18 Apr 2018 at 17:23:16 (+0800), Leo Yan wrote: > On Fri, Apr 06, 2018 at 04:36:05PM +0100, Dietmar Eggemann wrote: > > [...] > > > +/* > > + * Estimates the system level energy assuming that p wakes-up on dst_cpu. > > + * > > + * compute_energy() is safe to call only if an energy model is available for > > + * the platform, which is when sched_energy_enabled() is true. > > + */ > > +static unsigned long compute_energy(struct task_struct *p, int dst_cpu) > > +{ > > + unsigned long util, max_util, sum_util; > > + struct capacity_state *cs; > > + unsigned long energy = 0; > > + struct freq_domain *fd; > > + int cpu; > > + > > + for_each_freq_domain(fd) { > > + max_util = sum_util = 0; > > + for_each_cpu_and(cpu, freq_domain_span(fd), cpu_online_mask) { > > + util = cpu_util_next(cpu, p, dst_cpu); > > + util += cpu_util_dl(cpu_rq(cpu)); > > + max_util = max(util, max_util); > > + sum_util += util; > > + } > > + > > + /* > > + * Here we assume that the capacity states of CPUs belonging to > > + * the same frequency domains are shared. Hence, we look at the > > + * capacity state of the first CPU and re-use it for all. > > + */ > > + cpu = cpumask_first(freq_domain_span(fd)); > > + cs = find_cap_state(cpu, max_util); > > + energy += cs->power * sum_util / cs->cap; > > I am a bit worry about the resolution issue, especially when the > capacity value is a quite high value and sum_util is a minor value. Good point. As of now, the cs->power values happen to be in micro-watts for the platforms we've been testing with, so they're typically high enough to avoid significant resolution problem I guess. Now, the energy model code has to be reworked significantly as we have to remove the dependency on PM_OPP, so I'll try to make sure to keep this issue in mind for the next version. Thanks, Quentin
On Fri, Apr 20, 2018 at 03:42:45PM +0100, Quentin Perret wrote: > Hi Leo, > > On Wednesday 18 Apr 2018 at 20:15:47 (+0800), Leo Yan wrote: > > Sorry I introduce mess at here to spread my questions in several > > replying, later will try to ask questions in one replying. Below are > > more questions which it's good to bring up: > > > > The code for energy computation is quite neat and simple, but I think > > the energy computation mixes two concepts for CPU util: one concept is > > the estimated CPU util which is used to select CPU OPP in schedutil, > > another concept is the raw CPU util according to CPU real running time; > > for example, cpu_util_next() predicts CPU util but this value might be > > much higher than cpu_util(), especially after enabled UTIL_EST feature > > (I have shallow understanding for UTIL_EST so correct me as needed); > > I'm not not sure to understand what you mean by higher than cpu_util() > here ... In which case would that happen ? After UTIL_EST feature is enabled, cpu_util_next() returns higher value than cpu_util(), see below code 'util = max(util, util_est);'; as result cpu_util_next() takes consideration for extra compensention introduced by UTIL_EST. if (sched_feat(UTIL_EST)) { util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued); if (dst_cpu == cpu) util_est += _task_util_est(p); else util_est = max_t(long, util_est - _task_util_est(p), 0); util = max(util, util_est); } > cpu_util_next() is basically used to figure out what will be the > cpu_util() of CPU A after task p has been enqueued on CPU B (no matter > what A and B are). Same with upper description, cpu_util_next() is not the same thing with cpu_util(), cpu_util_next() takes consideration for extra compensention introduced by UTIL_EST. > > but this patch simply computes CPU capacity and energy with the single > > one CPU utilization value (and it will be an inflated value afte enable > > UTIL_EST). Is this purposed for simple implementation? > > > > IMHO, cpu_util_next() can be used to predict CPU capacity, on the other > > hand, should we use the CPU util without UTIL_EST capping for 'sum_util', > > this can be more reasonable to reflect the CPU utilization? > > Why would a decayed utilisation be a better estimate of the time that > a task is going to spend on a CPU ? IIUC, in the scheduler waken up path task_util() is the task utilisation before task sleeping, so it's not a decayed value. cpu_util() is decayed value, but is this just we want to reflect cpu historic utilisation at the recent past time? This is the reason I bring up to use 'cpu_util() + task_util()' as estimation. I understand this patch tries to use pre-decayed value, please review below example has issue or not: if one CPU's cfs_rq->avg.util_est.enqueued is quite high value, then this CPU enter idle state and sleep for long while, if we use cfs_rq->avg.util_est.enqueued to estimate CPU utilisation, this might have big deviation than the CPU run time if place wake task on it? On the other hand, cpu_util() can decay for CPU idle time... > > Furthermore, if we consider RT thread is running on CPU and connect with > > 'schedutil' governor, the CPU will run at maximum frequency, but we > > cannot say the CPU has 100% utilization. The RT thread case is not > > handled in this patch. > > Right, we don't account for RT tasks in the OPP prediction for now. > Vincent's patches to have a util_avg for RT runqueues could help us > do that I suppose ... Good to know this. > Thanks ! > Quentin
Hi Leo, Sorry for the delay in responding... On Saturday 21 Apr 2018 at 00:27:53 (+0800), Leo Yan wrote: > On Fri, Apr 20, 2018 at 03:42:45PM +0100, Quentin Perret wrote: > > Hi Leo, > > > > On Wednesday 18 Apr 2018 at 20:15:47 (+0800), Leo Yan wrote: > > > Sorry I introduce mess at here to spread my questions in several > > > replying, later will try to ask questions in one replying. Below are > > > more questions which it's good to bring up: > > > > > > The code for energy computation is quite neat and simple, but I think > > > the energy computation mixes two concepts for CPU util: one concept is > > > the estimated CPU util which is used to select CPU OPP in schedutil, > > > another concept is the raw CPU util according to CPU real running time; > > > for example, cpu_util_next() predicts CPU util but this value might be > > > much higher than cpu_util(), especially after enabled UTIL_EST feature > > > (I have shallow understanding for UTIL_EST so correct me as needed); > > > > I'm not not sure to understand what you mean by higher than cpu_util() > > here ... In which case would that happen ? > > After UTIL_EST feature is enabled, cpu_util_next() returns higher value > than cpu_util(), see below code 'util = max(util, util_est);'; as > result cpu_util_next() takes consideration for extra compensention > introduced by UTIL_EST. > > if (sched_feat(UTIL_EST)) { > util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued); > if (dst_cpu == cpu) > util_est += _task_util_est(p); > else > util_est = max_t(long, util_est - _task_util_est(p), 0); > util = max(util, util_est); > } So, cpu_util() accounts for the UTIL_EST compensation: static inline unsigned long cpu_util(int cpu) { struct cfs_rq *cfs_rq; unsigned int util; cfs_rq = &cpu_rq(cpu)->cfs; util = READ_ONCE(cfs_rq->avg.util_avg); if (sched_feat(UTIL_EST)) util = max(util, READ_ONCE(cfs_rq->avg.util_est.enqueued)); return min_t(unsigned long, util, capacity_orig_of(cpu)); } So cpu_util_next() just mimics that. > > > cpu_util_next() is basically used to figure out what will be the > > cpu_util() of CPU A after task p has been enqueued on CPU B (no matter > > what A and B are). > > Same with upper description, cpu_util_next() is not the same thing > with cpu_util(), cpu_util_next() takes consideration for extra > compensention introduced by UTIL_EST. > > > > but this patch simply computes CPU capacity and energy with the single > > > one CPU utilization value (and it will be an inflated value afte enable > > > UTIL_EST). Is this purposed for simple implementation? > > > > > > IMHO, cpu_util_next() can be used to predict CPU capacity, on the other > > > hand, should we use the CPU util without UTIL_EST capping for 'sum_util', > > > this can be more reasonable to reflect the CPU utilization? > > > > Why would a decayed utilisation be a better estimate of the time that > > a task is going to spend on a CPU ? > > IIUC, in the scheduler waken up path task_util() is the task utilisation > before task sleeping, so it's not a decayed value. I don't think this is correct. sync_entity_load_avg() is called in select_task_rq_fair() so task_util() *is* decayed upon wakeup. > cpu_util() is > decayed value, This is not necessarily correct either. As mentioned above, cpu_util() includes the UTIL_EST compensation, so the value isn't necessarily decayed. > but is this just we want to reflect cpu historic > utilisation at the recent past time? This is the reason I bring up to > use 'cpu_util() + task_util()' as estimation. > > I understand this patch tries to use pre-decayed value, No, this patch tries to estimate what will be the return value of cpu_util() if the task is enqueued on a specific CPU. This value can be the util_avg (decayed) or the util_est (non-decayed) depending on the conditions. > please review > below example has issue or not: > if one CPU's cfs_rq->avg.util_est.enqueued is quite high value, then this > CPU enter idle state and sleep for long while, if we use > cfs_rq->avg.util_est.enqueued to estimate CPU utilisation, this might > have big deviation than the CPU run time if place wake task on it? On > the other hand, cpu_util() can decay for CPU idle time... > > > > Furthermore, if we consider RT thread is running on CPU and connect with > > > 'schedutil' governor, the CPU will run at maximum frequency, but we > > > cannot say the CPU has 100% utilization. The RT thread case is not > > > handled in this patch. > > > > Right, we don't account for RT tasks in the OPP prediction for now. > > Vincent's patches to have a util_avg for RT runqueues could help us > > do that I suppose ... > > Good to know this. > > > Thanks ! > > Quentin
diff --git a/include/linux/sched/energy.h b/include/linux/sched/energy.h index 941071eec013..b4110b145228 100644 --- a/include/linux/sched/energy.h +++ b/include/linux/sched/energy.h @@ -27,6 +27,24 @@ static inline bool sched_energy_enabled(void) return static_branch_unlikely(&sched_energy_present); } +static inline +struct capacity_state *find_cap_state(int cpu, unsigned long util) +{ + struct sched_energy_model *em = *per_cpu_ptr(energy_model, cpu); + struct capacity_state *cs = NULL; + int i; + + util += util >> 2; + + for (i = 0; i < em->nr_cap_states; i++) { + cs = &em->cap_states[i]; + if (cs->cap >= util) + break; + } + + return cs; +} + static inline struct cpumask *freq_domain_span(struct freq_domain *fd) { return &fd->span; @@ -42,6 +60,8 @@ struct freq_domain; static inline bool sched_energy_enabled(void) { return false; } static inline struct cpumask *freq_domain_span(struct freq_domain *fd) { return NULL; } +static inline struct capacity_state +*find_cap_state(int cpu, unsigned long util) { return NULL; } static inline void init_sched_energy(void) { } #define for_each_freq_domain(fdom) for (; fdom; fdom = NULL) #endif diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6960e5ef3c14..8cb9fb04fff2 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6633,6 +6633,74 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu) } /* + * Returns the util of "cpu" if "p" wakes up on "dst_cpu". + */ +static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu) +{ + unsigned long util, util_est; + struct cfs_rq *cfs_rq; + + /* Task is where it should be, or has no impact on cpu */ + if ((task_cpu(p) == dst_cpu) || (cpu != task_cpu(p) && cpu != dst_cpu)) + return cpu_util(cpu); + + cfs_rq = &cpu_rq(cpu)->cfs; + util = READ_ONCE(cfs_rq->avg.util_avg); + + if (dst_cpu == cpu) + util += task_util(p); + else + util = max_t(long, util - task_util(p), 0); + + if (sched_feat(UTIL_EST)) { + util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued); + if (dst_cpu == cpu) + util_est += _task_util_est(p); + else + util_est = max_t(long, util_est - _task_util_est(p), 0); + util = max(util, util_est); + } + + return min_t(unsigned long, util, capacity_orig_of(cpu)); +} + +/* + * Estimates the system level energy assuming that p wakes-up on dst_cpu. + * + * compute_energy() is safe to call only if an energy model is available for + * the platform, which is when sched_energy_enabled() is true. + */ +static unsigned long compute_energy(struct task_struct *p, int dst_cpu) +{ + unsigned long util, max_util, sum_util; + struct capacity_state *cs; + unsigned long energy = 0; + struct freq_domain *fd; + int cpu; + + for_each_freq_domain(fd) { + max_util = sum_util = 0; + for_each_cpu_and(cpu, freq_domain_span(fd), cpu_online_mask) { + util = cpu_util_next(cpu, p, dst_cpu); + util += cpu_util_dl(cpu_rq(cpu)); + max_util = max(util, max_util); + sum_util += util; + } + + /* + * Here we assume that the capacity states of CPUs belonging to + * the same frequency domains are shared. Hence, we look at the + * capacity state of the first CPU and re-use it for all. + */ + cpu = cpumask_first(freq_domain_span(fd)); + cs = find_cap_state(cpu, max_util); + energy += cs->power * sum_util / cs->cap; + } + + return energy; +} + +/* * select_task_rq_fair: Select target runqueue for the waking task in domains * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE, * SD_BALANCE_FORK, or SD_BALANCE_EXEC. diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 5d552c0d7109..6eb38f41d5d9 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2156,7 +2156,7 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {} # define arch_scale_freq_invariant() false #endif -#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL +#ifdef CONFIG_SMP static inline unsigned long cpu_util_dl(struct rq *rq) { return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;