diff mbox

[RFC,v2,4/6] sched/fair: Introduce an energy estimation helper function

Message ID 20180406153607.17815-5-dietmar.eggemann@arm.com (mailing list archive)
State RFC, archived
Headers show

Commit Message

Dietmar Eggemann April 6, 2018, 3:36 p.m. UTC
From: Quentin Perret <quentin.perret@arm.com>

In preparation for the definition of an energy-aware wakeup path, a
helper function is provided to estimate the consequence on system energy
when a specific task wakes-up on a specific CPU. compute_energy()
estimates the OPPs to be reached by all frequency domains and estimates
the consumption of each online CPU according to its energy model and its
percentage of busy time.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 include/linux/sched/energy.h | 20 +++++++++++++
 kernel/sched/fair.c          | 68 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h         |  2 +-
 3 files changed, 89 insertions(+), 1 deletion(-)

Comments

Peter Zijlstra April 10, 2018, 12:51 p.m. UTC | #1
On Fri, Apr 06, 2018 at 04:36:05PM +0100, Dietmar Eggemann wrote:
> +static inline
> +struct capacity_state *find_cap_state(int cpu, unsigned long util)
> +{
> +	struct sched_energy_model *em = *per_cpu_ptr(energy_model, cpu);
> +	struct capacity_state *cs = NULL;
> +	int i;
> +
> +	util += util >> 2;
> +
> +	for (i = 0; i < em->nr_cap_states; i++) {
> +		cs = &em->cap_states[i];
> +		if (cs->cap >= util)
> +			break;
> +	}
> +
> +	return cs;
> +}

So in the last thread there was some discussion about this; in
particular on how this related to schedutil and if we should tie it into
that.

I think for starters tying it to schedutil is not a bad idea; ideally
people _should_ migrate towards using that.

Also; I think it makes sense to better integrate cpufreq and the
energy-model values like what rjw already suggested, such that maybe we
can have cpufreq_driver_resolve_freq() return a structure containing the
relevant information for the selected frequency.

But implementing the frequency selection thing in multiple places like
now sounds like a very bad idea to me.
Quentin Perret April 10, 2018, 1:56 p.m. UTC | #2
On Tuesday 10 Apr 2018 at 14:51:05 (+0200), Peter Zijlstra wrote:
> On Fri, Apr 06, 2018 at 04:36:05PM +0100, Dietmar Eggemann wrote:
> > +static inline
> > +struct capacity_state *find_cap_state(int cpu, unsigned long util)
> > +{
> > +	struct sched_energy_model *em = *per_cpu_ptr(energy_model, cpu);
> > +	struct capacity_state *cs = NULL;
> > +	int i;
> > +
> > +	util += util >> 2;
> > +
> > +	for (i = 0; i < em->nr_cap_states; i++) {
> > +		cs = &em->cap_states[i];
> > +		if (cs->cap >= util)
> > +			break;
> > +	}
> > +
> > +	return cs;
> > +}
> 
> So in the last thread there was some discussion about this; in
> particular on how this related to schedutil and if we should tie it into
> that.
> 
> I think for starters tying it to schedutil is not a bad idea; ideally
> people _should_ migrate towards using that.
> 
> Also; I think it makes sense to better integrate cpufreq and the
> energy-model values like what rjw already suggested, such that maybe we
> can have cpufreq_driver_resolve_freq() return a structure containing the
> relevant information for the selected frequency.

I guess if we want to do that in the wake-up path, we would also need to
add a new parameter to it to make sure we don't actually call into
cpufreq_driver->resolve_freq() ...

But then, we could sort of rely on cpufreq_schedutil.c::get_next_freq()
to replace find_cap_state() ... Is this what you had in mind ?

> 
> But implementing the frequency selection thing in multiple places like
> now sounds like a very bad idea to me.

Understood. Making sure we share the same code everywhere might have
consequences though. I guess we'll have to either accept the cost of
function calls in the wake-up path, or to accept to inline those
functions for ex. Or maybe you had something else in mind ?

Anyways, that's probably another good discussion topic for OSPM
next week :-)

Thanks,
Quentin
Peter Zijlstra April 10, 2018, 2:08 p.m. UTC | #3
On Tue, Apr 10, 2018 at 02:56:41PM +0100, Quentin Perret wrote:
> > So in the last thread there was some discussion about this; in
> > particular on how this related to schedutil and if we should tie it into
> > that.
> > 
> > I think for starters tying it to schedutil is not a bad idea; ideally
> > people _should_ migrate towards using that.
> > 
> > Also; I think it makes sense to better integrate cpufreq and the
> > energy-model values like what rjw already suggested, such that maybe we
> > can have cpufreq_driver_resolve_freq() return a structure containing the
> > relevant information for the selected frequency.
> 
> I guess if we want to do that in the wake-up path, we would also need to
> add a new parameter to it to make sure we don't actually call into
> cpufreq_driver->resolve_freq() ...
> 
> But then, we could sort of rely on cpufreq_schedutil.c::get_next_freq()
> to replace find_cap_state() ... Is this what you had in mind ?

Yes, something along those lines; we could also of course factor
get_next_freq() into two parts.

> > But implementing the frequency selection thing in multiple places like
> > now sounds like a very bad idea to me.
> 
> Understood. Making sure we share the same code everywhere might have
> consequences though. I guess we'll have to either accept the cost of
> function calls in the wake-up path, or to accept to inline those
> functions for ex. Or maybe you had something else in mind ?
> 
> Anyways, that's probably another good discussion topic for OSPM
> next week :-)

Yes, that wants a wee bit of discussion. Ideally we'd have a shared data
structure we can iterate instead of a chain of indirect calls.
Viresh Kumar April 13, 2018, 6:27 a.m. UTC | #4
On 06-04-18, 16:36, Dietmar Eggemann wrote:
> +static inline struct capacity_state
> +*find_cap_state(int cpu, unsigned long util) { return NULL; }

I saw this somewhere else as well in this series. I believe the line break
should happen after "*" as "struct capacity_state *" should be read together as
one entity.
Leo Yan April 17, 2018, 3:22 p.m. UTC | #5
On Fri, Apr 06, 2018 at 04:36:05PM +0100, Dietmar Eggemann wrote:
> From: Quentin Perret <quentin.perret@arm.com>
> 
> In preparation for the definition of an energy-aware wakeup path, a
> helper function is provided to estimate the consequence on system energy
> when a specific task wakes-up on a specific CPU. compute_energy()
> estimates the OPPs to be reached by all frequency domains and estimates
> the consumption of each online CPU according to its energy model and its
> percentage of busy time.
> 
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Quentin Perret <quentin.perret@arm.com>
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> ---
>  include/linux/sched/energy.h | 20 +++++++++++++
>  kernel/sched/fair.c          | 68 ++++++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/sched.h         |  2 +-
>  3 files changed, 89 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/sched/energy.h b/include/linux/sched/energy.h
> index 941071eec013..b4110b145228 100644
> --- a/include/linux/sched/energy.h
> +++ b/include/linux/sched/energy.h
> @@ -27,6 +27,24 @@ static inline bool sched_energy_enabled(void)
>  	return static_branch_unlikely(&sched_energy_present);
>  }
>  
> +static inline
> +struct capacity_state *find_cap_state(int cpu, unsigned long util)
> +{
> +	struct sched_energy_model *em = *per_cpu_ptr(energy_model, cpu);
> +	struct capacity_state *cs = NULL;
> +	int i;
> +
> +	util += util >> 2;
> +
> +	for (i = 0; i < em->nr_cap_states; i++) {
> +		cs = &em->cap_states[i];
> +		if (cs->cap >= util)
> +			break;
> +	}
> +
> +	return cs;

'cs' is possible to return NULL.

> +}
> +
>  static inline struct cpumask *freq_domain_span(struct freq_domain *fd)
>  {
>  	return &fd->span;
> @@ -42,6 +60,8 @@ struct freq_domain;
>  static inline bool sched_energy_enabled(void) { return false; }
>  static inline struct cpumask
>  *freq_domain_span(struct freq_domain *fd) { return NULL; }
> +static inline struct capacity_state
> +*find_cap_state(int cpu, unsigned long util) { return NULL; }
>  static inline void init_sched_energy(void) { }
>  #define for_each_freq_domain(fdom) for (; fdom; fdom = NULL)
>  #endif
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6960e5ef3c14..8cb9fb04fff2 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6633,6 +6633,74 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
>  }
>  
>  /*
> + * Returns the util of "cpu" if "p" wakes up on "dst_cpu".
> + */
> +static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
> +{
> +	unsigned long util, util_est;
> +	struct cfs_rq *cfs_rq;
> +
> +	/* Task is where it should be, or has no impact on cpu */
> +	if ((task_cpu(p) == dst_cpu) || (cpu != task_cpu(p) && cpu != dst_cpu))
> +		return cpu_util(cpu);
> +
> +	cfs_rq = &cpu_rq(cpu)->cfs;
> +	util = READ_ONCE(cfs_rq->avg.util_avg);
> +
> +	if (dst_cpu == cpu)
> +		util += task_util(p);
> +	else
> +		util = max_t(long, util - task_util(p), 0);

I tried to understand the logic at here, below code is more clear for
myself:

        int prev_cpu = task_cpu(p);

        cfs_rq = &cpu_rq(cpu)->cfs;
        util = READ_ONCE(cfs_rq->avg.util_avg);

        /* Bail out if src and dst CPUs are the same one */
        if (prev_cpu == cpu && dst_cpu == cpu)
                return util;

        /* Remove task utilization for src CPU */
        if (cpu == prev_cpu)
                util = max_t(long, util - task_util(p), 0);

        /* Add task utilization for dst CPU */
        if (dst_cpu == cpu)
                util += task_util(p);

BTW, CPU utilization is decayed value and task_util() is not decayed
value, so 'util - task_util(p)' calculates a smaller value than the
prev CPU pure utilization, right?

Another question is can we reuse the function cpu_util_wake() and
just compenstate task util for dst cpu?

> +	if (sched_feat(UTIL_EST)) {
> +		util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
> +		if (dst_cpu == cpu)
> +			util_est += _task_util_est(p);
> +		else
> +			util_est = max_t(long, util_est - _task_util_est(p), 0);
> +		util = max(util, util_est);
> +	}
> +
> +	return min_t(unsigned long, util, capacity_orig_of(cpu));
> +}
> +
> +/*
> + * Estimates the system level energy assuming that p wakes-up on dst_cpu.
> + *
> + * compute_energy() is safe to call only if an energy model is available for
> + * the platform, which is when sched_energy_enabled() is true.
> + */
> +static unsigned long compute_energy(struct task_struct *p, int dst_cpu)
> +{
> +	unsigned long util, max_util, sum_util;
> +	struct capacity_state *cs;
> +	unsigned long energy = 0;
> +	struct freq_domain *fd;
> +	int cpu;
> +
> +	for_each_freq_domain(fd) {
> +		max_util = sum_util = 0;
> +		for_each_cpu_and(cpu, freq_domain_span(fd), cpu_online_mask) {
> +			util = cpu_util_next(cpu, p, dst_cpu);
> +			util += cpu_util_dl(cpu_rq(cpu));
> +			max_util = max(util, max_util);
> +			sum_util += util;
> +		}
> +
> +		/*
> +		 * Here we assume that the capacity states of CPUs belonging to
> +		 * the same frequency domains are shared. Hence, we look at the
> +		 * capacity state of the first CPU and re-use it for all.
> +		 */
> +		cpu = cpumask_first(freq_domain_span(fd));
> +		cs = find_cap_state(cpu, max_util);
> +		energy += cs->power * sum_util / cs->cap;
> +	}

This means all CPUs will be iterated for calculation, the complexity is
O(n)...

Thanks,
Leo Yan

> +	return energy;
> +}
> +
> +/*
>   * select_task_rq_fair: Select target runqueue for the waking task in domains
>   * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
>   * SD_BALANCE_FORK, or SD_BALANCE_EXEC.
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 5d552c0d7109..6eb38f41d5d9 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2156,7 +2156,7 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
>  # define arch_scale_freq_invariant()	false
>  #endif
>  
> -#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> +#ifdef CONFIG_SMP
>  static inline unsigned long cpu_util_dl(struct rq *rq)
>  {
>  	return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> -- 
> 2.11.0
>
Quentin Perret April 18, 2018, 8:13 a.m. UTC | #6
On Tuesday 17 Apr 2018 at 23:22:13 (+0800), Leo Yan wrote:
> On Fri, Apr 06, 2018 at 04:36:05PM +0100, Dietmar Eggemann wrote:
> > From: Quentin Perret <quentin.perret@arm.com>
> > 
> > In preparation for the definition of an energy-aware wakeup path, a
> > helper function is provided to estimate the consequence on system energy
> > when a specific task wakes-up on a specific CPU. compute_energy()
> > estimates the OPPs to be reached by all frequency domains and estimates
> > the consumption of each online CPU according to its energy model and its
> > percentage of busy time.
> > 
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Signed-off-by: Quentin Perret <quentin.perret@arm.com>
> > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > ---
> >  include/linux/sched/energy.h | 20 +++++++++++++
> >  kernel/sched/fair.c          | 68 ++++++++++++++++++++++++++++++++++++++++++++
> >  kernel/sched/sched.h         |  2 +-
> >  3 files changed, 89 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/sched/energy.h b/include/linux/sched/energy.h
> > index 941071eec013..b4110b145228 100644
> > --- a/include/linux/sched/energy.h
> > +++ b/include/linux/sched/energy.h
> > @@ -27,6 +27,24 @@ static inline bool sched_energy_enabled(void)
> >  	return static_branch_unlikely(&sched_energy_present);
> >  }
> >  
> > +static inline
> > +struct capacity_state *find_cap_state(int cpu, unsigned long util)
> > +{
> > +	struct sched_energy_model *em = *per_cpu_ptr(energy_model, cpu);
> > +	struct capacity_state *cs = NULL;
> > +	int i;
> > +
> > +	util += util >> 2;
> > +
> > +	for (i = 0; i < em->nr_cap_states; i++) {
> > +		cs = &em->cap_states[i];
> > +		if (cs->cap >= util)
> > +			break;
> > +	}
> > +
> > +	return cs;
> 
> 'cs' is possible to return NULL.

Only if em-nr_cap_states==0, and that shouldn't be possible if
sched_energy_present==True, so this code should be safe :-)

> 
> > +}
> > +
> >  static inline struct cpumask *freq_domain_span(struct freq_domain *fd)
> >  {
> >  	return &fd->span;
> > @@ -42,6 +60,8 @@ struct freq_domain;
> >  static inline bool sched_energy_enabled(void) { return false; }
> >  static inline struct cpumask
> >  *freq_domain_span(struct freq_domain *fd) { return NULL; }
> > +static inline struct capacity_state
> > +*find_cap_state(int cpu, unsigned long util) { return NULL; }
> >  static inline void init_sched_energy(void) { }
> >  #define for_each_freq_domain(fdom) for (; fdom; fdom = NULL)
> >  #endif
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 6960e5ef3c14..8cb9fb04fff2 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6633,6 +6633,74 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
> >  }
> >  
> >  /*
> > + * Returns the util of "cpu" if "p" wakes up on "dst_cpu".
> > + */
> > +static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
> > +{
> > +	unsigned long util, util_est;
> > +	struct cfs_rq *cfs_rq;
> > +
> > +	/* Task is where it should be, or has no impact on cpu */
> > +	if ((task_cpu(p) == dst_cpu) || (cpu != task_cpu(p) && cpu != dst_cpu))
> > +		return cpu_util(cpu);
> > +
> > +	cfs_rq = &cpu_rq(cpu)->cfs;
> > +	util = READ_ONCE(cfs_rq->avg.util_avg);
> > +
> > +	if (dst_cpu == cpu)
> > +		util += task_util(p);
> > +	else
> > +		util = max_t(long, util - task_util(p), 0);
> 
> I tried to understand the logic at here, below code is more clear for
> myself:
> 
>         int prev_cpu = task_cpu(p);
> 
>         cfs_rq = &cpu_rq(cpu)->cfs;
>         util = READ_ONCE(cfs_rq->avg.util_avg);
> 
>         /* Bail out if src and dst CPUs are the same one */
>         if (prev_cpu == cpu && dst_cpu == cpu)
>                 return util;
> 
>         /* Remove task utilization for src CPU */
>         if (cpu == prev_cpu)
>                 util = max_t(long, util - task_util(p), 0);
> 
>         /* Add task utilization for dst CPU */
>         if (dst_cpu == cpu)
>                 util += task_util(p);
> 
> BTW, CPU utilization is decayed value and task_util() is not decayed
> value, so 'util - task_util(p)' calculates a smaller value than the
> prev CPU pure utilization, right?

task_util() is the raw PELT signal, without UTIL_EST, so I think it's
fine to do `util - task_util()`.

> 
> Another question is can we reuse the function cpu_util_wake() and
> just compenstate task util for dst cpu?

Well it's not that simple. cpu_util_wake() will give you the max between
the util_avg and the util_est value, so which task_util() should you add
to it ? The util_avg or the uti_est value ?

Here we are trying to predict what will be the cpu_util signal in the
future, so the only always-correct implementation of this function has
to predict what will be the CPU util_avg and util_est signals in
parallel and take the max of the two.

> 
> > +	if (sched_feat(UTIL_EST)) {
> > +		util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
> > +		if (dst_cpu == cpu)
> > +			util_est += _task_util_est(p);
> > +		else
> > +			util_est = max_t(long, util_est - _task_util_est(p), 0);
> > +		util = max(util, util_est);
> > +	}
> > +
> > +	return min_t(unsigned long, util, capacity_orig_of(cpu));
> > +}
> > +
> > +/*
> > + * Estimates the system level energy assuming that p wakes-up on dst_cpu.
> > + *
> > + * compute_energy() is safe to call only if an energy model is available for
> > + * the platform, which is when sched_energy_enabled() is true.
> > + */
> > +static unsigned long compute_energy(struct task_struct *p, int dst_cpu)
> > +{
> > +	unsigned long util, max_util, sum_util;
> > +	struct capacity_state *cs;
> > +	unsigned long energy = 0;
> > +	struct freq_domain *fd;
> > +	int cpu;
> > +
> > +	for_each_freq_domain(fd) {
> > +		max_util = sum_util = 0;
> > +		for_each_cpu_and(cpu, freq_domain_span(fd), cpu_online_mask) {
> > +			util = cpu_util_next(cpu, p, dst_cpu);
> > +			util += cpu_util_dl(cpu_rq(cpu));
> > +			max_util = max(util, max_util);
> > +			sum_util += util;
> > +		}
> > +
> > +		/*
> > +		 * Here we assume that the capacity states of CPUs belonging to
> > +		 * the same frequency domains are shared. Hence, we look at the
> > +		 * capacity state of the first CPU and re-use it for all.
> > +		 */
> > +		cpu = cpumask_first(freq_domain_span(fd));
> > +		cs = find_cap_state(cpu, max_util);
> > +		energy += cs->power * sum_util / cs->cap;
> > +	}
> 
> This means all CPUs will be iterated for calculation, the complexity is
> O(n)...
> 
> > +	return energy;
> > +}
> > +
> > +/*
> >   * select_task_rq_fair: Select target runqueue for the waking task in domains
> >   * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
> >   * SD_BALANCE_FORK, or SD_BALANCE_EXEC.
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 5d552c0d7109..6eb38f41d5d9 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -2156,7 +2156,7 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
> >  # define arch_scale_freq_invariant()	false
> >  #endif
> >  
> > -#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> > +#ifdef CONFIG_SMP
> >  static inline unsigned long cpu_util_dl(struct rq *rq)
> >  {
> >  	return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> > -- 
> > 2.11.0
> >
Leo Yan April 18, 2018, 9:19 a.m. UTC | #7
On Wed, Apr 18, 2018 at 09:13:39AM +0100, Quentin Perret wrote:
> On Tuesday 17 Apr 2018 at 23:22:13 (+0800), Leo Yan wrote:
> > On Fri, Apr 06, 2018 at 04:36:05PM +0100, Dietmar Eggemann wrote:
> > > From: Quentin Perret <quentin.perret@arm.com>
> > > 
> > > In preparation for the definition of an energy-aware wakeup path, a
> > > helper function is provided to estimate the consequence on system energy
> > > when a specific task wakes-up on a specific CPU. compute_energy()
> > > estimates the OPPs to be reached by all frequency domains and estimates
> > > the consumption of each online CPU according to its energy model and its
> > > percentage of busy time.
> > > 
> > > Cc: Ingo Molnar <mingo@redhat.com>
> > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > Signed-off-by: Quentin Perret <quentin.perret@arm.com>
> > > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > > ---
> > >  include/linux/sched/energy.h | 20 +++++++++++++
> > >  kernel/sched/fair.c          | 68 ++++++++++++++++++++++++++++++++++++++++++++
> > >  kernel/sched/sched.h         |  2 +-
> > >  3 files changed, 89 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/include/linux/sched/energy.h b/include/linux/sched/energy.h
> > > index 941071eec013..b4110b145228 100644
> > > --- a/include/linux/sched/energy.h
> > > +++ b/include/linux/sched/energy.h
> > > @@ -27,6 +27,24 @@ static inline bool sched_energy_enabled(void)
> > >  	return static_branch_unlikely(&sched_energy_present);
> > >  }
> > >  
> > > +static inline
> > > +struct capacity_state *find_cap_state(int cpu, unsigned long util)
> > > +{
> > > +	struct sched_energy_model *em = *per_cpu_ptr(energy_model, cpu);
> > > +	struct capacity_state *cs = NULL;
> > > +	int i;
> > > +
> > > +	util += util >> 2;
> > > +
> > > +	for (i = 0; i < em->nr_cap_states; i++) {
> > > +		cs = &em->cap_states[i];
> > > +		if (cs->cap >= util)
> > > +			break;
> > > +	}
> > > +
> > > +	return cs;
> > 
> > 'cs' is possible to return NULL.
> 
> Only if em-nr_cap_states==0, and that shouldn't be possible if
> sched_energy_present==True, so this code should be safe :-)

You are right. Thanks for explanation.

> > > +}
> > > +
> > >  static inline struct cpumask *freq_domain_span(struct freq_domain *fd)
> > >  {
> > >  	return &fd->span;
> > > @@ -42,6 +60,8 @@ struct freq_domain;
> > >  static inline bool sched_energy_enabled(void) { return false; }
> > >  static inline struct cpumask
> > >  *freq_domain_span(struct freq_domain *fd) { return NULL; }
> > > +static inline struct capacity_state
> > > +*find_cap_state(int cpu, unsigned long util) { return NULL; }
> > >  static inline void init_sched_energy(void) { }
> > >  #define for_each_freq_domain(fdom) for (; fdom; fdom = NULL)
> > >  #endif
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 6960e5ef3c14..8cb9fb04fff2 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -6633,6 +6633,74 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
> > >  }
> > >  
> > >  /*
> > > + * Returns the util of "cpu" if "p" wakes up on "dst_cpu".
> > > + */
> > > +static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
> > > +{
> > > +	unsigned long util, util_est;
> > > +	struct cfs_rq *cfs_rq;
> > > +
> > > +	/* Task is where it should be, or has no impact on cpu */
> > > +	if ((task_cpu(p) == dst_cpu) || (cpu != task_cpu(p) && cpu != dst_cpu))
> > > +		return cpu_util(cpu);
> > > +
> > > +	cfs_rq = &cpu_rq(cpu)->cfs;
> > > +	util = READ_ONCE(cfs_rq->avg.util_avg);
> > > +
> > > +	if (dst_cpu == cpu)
> > > +		util += task_util(p);
> > > +	else
> > > +		util = max_t(long, util - task_util(p), 0);
> > 
> > I tried to understand the logic at here, below code is more clear for
> > myself:
> > 
> >         int prev_cpu = task_cpu(p);
> > 
> >         cfs_rq = &cpu_rq(cpu)->cfs;
> >         util = READ_ONCE(cfs_rq->avg.util_avg);
> > 
> >         /* Bail out if src and dst CPUs are the same one */
> >         if (prev_cpu == cpu && dst_cpu == cpu)
> >                 return util;
> > 
> >         /* Remove task utilization for src CPU */
> >         if (cpu == prev_cpu)
> >                 util = max_t(long, util - task_util(p), 0);
> > 
> >         /* Add task utilization for dst CPU */
> >         if (dst_cpu == cpu)
> >                 util += task_util(p);
> > 
> > BTW, CPU utilization is decayed value and task_util() is not decayed
> > value, so 'util - task_util(p)' calculates a smaller value than the
> > prev CPU pure utilization, right?
> 
> task_util() is the raw PELT signal, without UTIL_EST, so I think it's
> fine to do `util - task_util()`.
> 
> > 
> > Another question is can we reuse the function cpu_util_wake() and
> > just compenstate task util for dst cpu?
> 
> Well it's not that simple. cpu_util_wake() will give you the max between
> the util_avg and the util_est value, so which task_util() should you add
> to it ? The util_avg or the uti_est value ?

If feature 'UTIL_EST' is enabled, then add task's util_est; otherwise
add task util_avg value.

I think cpu_util_wake() has similiar logic with here, it merely returns
CPU level util; but here needs to accumulate CPU level util + task level
util.  So seems to me, the logic is:

  cpu_util_wake() + task_util_wake()

> Here we are trying to predict what will be the cpu_util signal in the
> future, so the only always-correct implementation of this function has
> to predict what will be the CPU util_avg and util_est signals in
> parallel and take the max of the two.

I totally agree with this, just want to check if can reuse existed code,
so we can have more consistent logic accrossing scheduler.

> > > +	if (sched_feat(UTIL_EST)) {
> > > +		util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
> > > +		if (dst_cpu == cpu)
> > > +			util_est += _task_util_est(p);
> > > +		else
> > > +			util_est = max_t(long, util_est - _task_util_est(p), 0);
> > > +		util = max(util, util_est);
> > > +	}
> > > +
> > > +	return min_t(unsigned long, util, capacity_orig_of(cpu));
> > > +}
> > > +
> > > +/*
> > > + * Estimates the system level energy assuming that p wakes-up on dst_cpu.
> > > + *
> > > + * compute_energy() is safe to call only if an energy model is available for
> > > + * the platform, which is when sched_energy_enabled() is true.
> > > + */
> > > +static unsigned long compute_energy(struct task_struct *p, int dst_cpu)
> > > +{
> > > +	unsigned long util, max_util, sum_util;
> > > +	struct capacity_state *cs;
> > > +	unsigned long energy = 0;
> > > +	struct freq_domain *fd;
> > > +	int cpu;
> > > +
> > > +	for_each_freq_domain(fd) {
> > > +		max_util = sum_util = 0;
> > > +		for_each_cpu_and(cpu, freq_domain_span(fd), cpu_online_mask) {
> > > +			util = cpu_util_next(cpu, p, dst_cpu);
> > > +			util += cpu_util_dl(cpu_rq(cpu));
> > > +			max_util = max(util, max_util);
> > > +			sum_util += util;
> > > +		}
> > > +
> > > +		/*
> > > +		 * Here we assume that the capacity states of CPUs belonging to
> > > +		 * the same frequency domains are shared. Hence, we look at the
> > > +		 * capacity state of the first CPU and re-use it for all.
> > > +		 */
> > > +		cpu = cpumask_first(freq_domain_span(fd));
> > > +		cs = find_cap_state(cpu, max_util);
> > > +		energy += cs->power * sum_util / cs->cap;
> > > +	}
> > 
> > This means all CPUs will be iterated for calculation, the complexity is
> > O(n)...
> > 
> > > +	return energy;
> > > +}
> > > +
> > > +/*
> > >   * select_task_rq_fair: Select target runqueue for the waking task in domains
> > >   * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
> > >   * SD_BALANCE_FORK, or SD_BALANCE_EXEC.
> > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > > index 5d552c0d7109..6eb38f41d5d9 100644
> > > --- a/kernel/sched/sched.h
> > > +++ b/kernel/sched/sched.h
> > > @@ -2156,7 +2156,7 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
> > >  # define arch_scale_freq_invariant()	false
> > >  #endif
> > >  
> > > -#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> > > +#ifdef CONFIG_SMP
> > >  static inline unsigned long cpu_util_dl(struct rq *rq)
> > >  {
> > >  	return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> > > -- 
> > > 2.11.0
> > >
Leo Yan April 18, 2018, 9:23 a.m. UTC | #8
On Fri, Apr 06, 2018 at 04:36:05PM +0100, Dietmar Eggemann wrote:

[...]

> +/*
> + * Estimates the system level energy assuming that p wakes-up on dst_cpu.
> + *
> + * compute_energy() is safe to call only if an energy model is available for
> + * the platform, which is when sched_energy_enabled() is true.
> + */
> +static unsigned long compute_energy(struct task_struct *p, int dst_cpu)
> +{
> +	unsigned long util, max_util, sum_util;
> +	struct capacity_state *cs;
> +	unsigned long energy = 0;
> +	struct freq_domain *fd;
> +	int cpu;
> +
> +	for_each_freq_domain(fd) {
> +		max_util = sum_util = 0;
> +		for_each_cpu_and(cpu, freq_domain_span(fd), cpu_online_mask) {
> +			util = cpu_util_next(cpu, p, dst_cpu);
> +			util += cpu_util_dl(cpu_rq(cpu));
> +			max_util = max(util, max_util);
> +			sum_util += util;
> +		}
> +
> +		/*
> +		 * Here we assume that the capacity states of CPUs belonging to
> +		 * the same frequency domains are shared. Hence, we look at the
> +		 * capacity state of the first CPU and re-use it for all.
> +		 */
> +		cpu = cpumask_first(freq_domain_span(fd));
> +		cs = find_cap_state(cpu, max_util);
> +		energy += cs->power * sum_util / cs->cap;

I am a bit worry about the resolution issue, especially when the
capacity value is a quite high value and sum_util is a minor value.

> +	}
> +
> +	return energy;
> +}
Quentin Perret April 18, 2018, 11:06 a.m. UTC | #9
On Wednesday 18 Apr 2018 at 17:19:28 (+0800), Leo Yan wrote:
> > > BTW, CPU utilization is decayed value and task_util() is not decayed
> > > value, so 'util - task_util(p)' calculates a smaller value than the
> > > prev CPU pure utilization, right?
> > 
> > task_util() is the raw PELT signal, without UTIL_EST, so I think it's
> > fine to do `util - task_util()`.
> > 
> > > 
> > > Another question is can we reuse the function cpu_util_wake() and
> > > just compenstate task util for dst cpu?
> > 
> > Well it's not that simple. cpu_util_wake() will give you the max between
> > the util_avg and the util_est value, so which task_util() should you add
> > to it ? The util_avg or the uti_est value ?
> 
> If feature 'UTIL_EST' is enabled, then add task's util_est; otherwise
> add task util_avg value.

I don't think this is correct. If UTIL_EST is enabled, cpu_util_wake()
will return the max between the util_avg and the util_est. When you call
it, you don't know what you get. Adding the _task_util_est() to
cpu_util_wake() is wrong if cpu_util_avg > cpu_util_est for example.

> 
> I think cpu_util_wake() has similiar logic with here, it merely returns
> CPU level util; but here needs to accumulate CPU level util + task level
> util.  So seems to me, the logic is:
> 
>   cpu_util_wake() + task_util_wake()

I'm not sure to get that one ... What is task_util_wake() ?

Thanks,
Quentin
Leo Yan April 18, 2018, 12:15 p.m. UTC | #10
Hi Quentin,

On Fri, Apr 06, 2018 at 04:36:05PM +0100, Dietmar Eggemann wrote:
> From: Quentin Perret <quentin.perret@arm.com>
> 
> In preparation for the definition of an energy-aware wakeup path, a
> helper function is provided to estimate the consequence on system energy
> when a specific task wakes-up on a specific CPU. compute_energy()
> estimates the OPPs to be reached by all frequency domains and estimates
> the consumption of each online CPU according to its energy model and its
> percentage of busy time.
> 
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Quentin Perret <quentin.perret@arm.com>
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> ---
>  include/linux/sched/energy.h | 20 +++++++++++++
>  kernel/sched/fair.c          | 68 ++++++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/sched.h         |  2 +-
>  3 files changed, 89 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/sched/energy.h b/include/linux/sched/energy.h
> index 941071eec013..b4110b145228 100644
> --- a/include/linux/sched/energy.h
> +++ b/include/linux/sched/energy.h
> @@ -27,6 +27,24 @@ static inline bool sched_energy_enabled(void)
>  	return static_branch_unlikely(&sched_energy_present);
>  }
>  
> +static inline
> +struct capacity_state *find_cap_state(int cpu, unsigned long util)
> +{
> +	struct sched_energy_model *em = *per_cpu_ptr(energy_model, cpu);
> +	struct capacity_state *cs = NULL;
> +	int i;
> +
> +	util += util >> 2;
> +
> +	for (i = 0; i < em->nr_cap_states; i++) {
> +		cs = &em->cap_states[i];
> +		if (cs->cap >= util)
> +			break;
> +	}
> +
> +	return cs;
> +}
> +
>  static inline struct cpumask *freq_domain_span(struct freq_domain *fd)
>  {
>  	return &fd->span;
> @@ -42,6 +60,8 @@ struct freq_domain;
>  static inline bool sched_energy_enabled(void) { return false; }
>  static inline struct cpumask
>  *freq_domain_span(struct freq_domain *fd) { return NULL; }
> +static inline struct capacity_state
> +*find_cap_state(int cpu, unsigned long util) { return NULL; }
>  static inline void init_sched_energy(void) { }
>  #define for_each_freq_domain(fdom) for (; fdom; fdom = NULL)
>  #endif
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6960e5ef3c14..8cb9fb04fff2 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6633,6 +6633,74 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
>  }
>  
>  /*
> + * Returns the util of "cpu" if "p" wakes up on "dst_cpu".
> + */
> +static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
> +{
> +	unsigned long util, util_est;
> +	struct cfs_rq *cfs_rq;
> +
> +	/* Task is where it should be, or has no impact on cpu */
> +	if ((task_cpu(p) == dst_cpu) || (cpu != task_cpu(p) && cpu != dst_cpu))
> +		return cpu_util(cpu);
> +
> +	cfs_rq = &cpu_rq(cpu)->cfs;
> +	util = READ_ONCE(cfs_rq->avg.util_avg);
> +
> +	if (dst_cpu == cpu)
> +		util += task_util(p);
> +	else
> +		util = max_t(long, util - task_util(p), 0);
> +
> +	if (sched_feat(UTIL_EST)) {
> +		util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
> +		if (dst_cpu == cpu)
> +			util_est += _task_util_est(p);
> +		else
> +			util_est = max_t(long, util_est - _task_util_est(p), 0);
> +		util = max(util, util_est);
> +	}
> +
> +	return min_t(unsigned long, util, capacity_orig_of(cpu));
> +}
> +
> +/*
> + * Estimates the system level energy assuming that p wakes-up on dst_cpu.
> + *
> + * compute_energy() is safe to call only if an energy model is available for
> + * the platform, which is when sched_energy_enabled() is true.
> + */
> +static unsigned long compute_energy(struct task_struct *p, int dst_cpu)
> +{
> +	unsigned long util, max_util, sum_util;
> +	struct capacity_state *cs;
> +	unsigned long energy = 0;
> +	struct freq_domain *fd;
> +	int cpu;
> +
> +	for_each_freq_domain(fd) {
> +		max_util = sum_util = 0;
> +		for_each_cpu_and(cpu, freq_domain_span(fd), cpu_online_mask) {
> +			util = cpu_util_next(cpu, p, dst_cpu);
> +			util += cpu_util_dl(cpu_rq(cpu));
> +			max_util = max(util, max_util);
> +			sum_util += util;
> +		}
> +
> +		/*
> +		 * Here we assume that the capacity states of CPUs belonging to
> +		 * the same frequency domains are shared. Hence, we look at the
> +		 * capacity state of the first CPU and re-use it for all.
> +		 */
> +		cpu = cpumask_first(freq_domain_span(fd));
> +		cs = find_cap_state(cpu, max_util);
> +		energy += cs->power * sum_util / cs->cap;
> +	}

Sorry I introduce mess at here to spread my questions in several
replying, later will try to ask questions in one replying.  Below are
more questions which it's good to bring up:

The code for energy computation is quite neat and simple, but I think
the energy computation mixes two concepts for CPU util: one concept is
the estimated CPU util which is used to select CPU OPP in schedutil,
another concept is the raw CPU util according to CPU real running time;
for example, cpu_util_next() predicts CPU util but this value might be
much higher than cpu_util(), especially after enabled UTIL_EST feature
(I have shallow understanding for UTIL_EST so correct me as needed);
but this patch simply computes CPU capacity and energy with the single
one CPU utilization value (and it will be an inflated value afte enable
UTIL_EST).  Is this purposed for simple implementation?

IMHO, cpu_util_next() can be used to predict CPU capacity, on the other
hand, should we use the CPU util without UTIL_EST capping for 'sum_util',
this can be more reasonable to reflect the CPU utilization?

Furthermore, if we consider RT thread is running on CPU and connect with
'schedutil' governor, the CPU will run at maximum frequency, but we
cannot say the CPU has 100% utilization.  The RT thread case is not
handled in this patch.

> +
> +	return energy;
> +}
> +
> +/*
>   * select_task_rq_fair: Select target runqueue for the waking task in domains
>   * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
>   * SD_BALANCE_FORK, or SD_BALANCE_EXEC.
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 5d552c0d7109..6eb38f41d5d9 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2156,7 +2156,7 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
>  # define arch_scale_freq_invariant()	false
>  #endif
>  
> -#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> +#ifdef CONFIG_SMP
>  static inline unsigned long cpu_util_dl(struct rq *rq)
>  {
>  	return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> -- 
> 2.11.0
>
Quentin Perret April 20, 2018, 2:42 p.m. UTC | #11
Hi Leo,

On Wednesday 18 Apr 2018 at 20:15:47 (+0800), Leo Yan wrote:
> Sorry I introduce mess at here to spread my questions in several
> replying, later will try to ask questions in one replying.  Below are
> more questions which it's good to bring up:
> 
> The code for energy computation is quite neat and simple, but I think
> the energy computation mixes two concepts for CPU util: one concept is
> the estimated CPU util which is used to select CPU OPP in schedutil,
> another concept is the raw CPU util according to CPU real running time;
> for example, cpu_util_next() predicts CPU util but this value might be
> much higher than cpu_util(), especially after enabled UTIL_EST feature
> (I have shallow understanding for UTIL_EST so correct me as needed);

I'm not not sure to understand what you mean by higher than cpu_util()
here ... In which case would that happen ?

cpu_util_next() is basically used to figure out what will be the
cpu_util() of CPU A after task p has been enqueued on CPU B (no matter
what A and B are).

> but this patch simply computes CPU capacity and energy with the single
> one CPU utilization value (and it will be an inflated value afte enable
> UTIL_EST).  Is this purposed for simple implementation?
> 
> IMHO, cpu_util_next() can be used to predict CPU capacity, on the other
> hand, should we use the CPU util without UTIL_EST capping for 'sum_util',
> this can be more reasonable to reflect the CPU utilization?

Why would a decayed utilisation be a better estimate of the time that
a task is going to spend on a CPU ?

> 
> Furthermore, if we consider RT thread is running on CPU and connect with
> 'schedutil' governor, the CPU will run at maximum frequency, but we
> cannot say the CPU has 100% utilization.  The RT thread case is not
> handled in this patch.

Right, we don't account for RT tasks in the OPP prediction for now.
Vincent's patches to have a util_avg for RT runqueues could help us
do that I suppose ...

Thanks !
Quentin
Quentin Perret April 20, 2018, 2:51 p.m. UTC | #12
On Wednesday 18 Apr 2018 at 17:23:16 (+0800), Leo Yan wrote:
> On Fri, Apr 06, 2018 at 04:36:05PM +0100, Dietmar Eggemann wrote:
> 
> [...]
> 
> > +/*
> > + * Estimates the system level energy assuming that p wakes-up on dst_cpu.
> > + *
> > + * compute_energy() is safe to call only if an energy model is available for
> > + * the platform, which is when sched_energy_enabled() is true.
> > + */
> > +static unsigned long compute_energy(struct task_struct *p, int dst_cpu)
> > +{
> > +	unsigned long util, max_util, sum_util;
> > +	struct capacity_state *cs;
> > +	unsigned long energy = 0;
> > +	struct freq_domain *fd;
> > +	int cpu;
> > +
> > +	for_each_freq_domain(fd) {
> > +		max_util = sum_util = 0;
> > +		for_each_cpu_and(cpu, freq_domain_span(fd), cpu_online_mask) {
> > +			util = cpu_util_next(cpu, p, dst_cpu);
> > +			util += cpu_util_dl(cpu_rq(cpu));
> > +			max_util = max(util, max_util);
> > +			sum_util += util;
> > +		}
> > +
> > +		/*
> > +		 * Here we assume that the capacity states of CPUs belonging to
> > +		 * the same frequency domains are shared. Hence, we look at the
> > +		 * capacity state of the first CPU and re-use it for all.
> > +		 */
> > +		cpu = cpumask_first(freq_domain_span(fd));
> > +		cs = find_cap_state(cpu, max_util);
> > +		energy += cs->power * sum_util / cs->cap;
> 
> I am a bit worry about the resolution issue, especially when the
> capacity value is a quite high value and sum_util is a minor value.

Good point. As of now, the cs->power values happen to be in micro-watts
for the platforms we've been testing with, so they're typically high
enough to avoid significant resolution problem I guess.

Now, the energy model code has to be reworked significantly as we have
to remove the dependency on PM_OPP, so I'll try to make sure to keep
this issue in mind for the next version.

Thanks,
Quentin
Leo Yan April 20, 2018, 4:27 p.m. UTC | #13
On Fri, Apr 20, 2018 at 03:42:45PM +0100, Quentin Perret wrote:
> Hi Leo,
> 
> On Wednesday 18 Apr 2018 at 20:15:47 (+0800), Leo Yan wrote:
> > Sorry I introduce mess at here to spread my questions in several
> > replying, later will try to ask questions in one replying.  Below are
> > more questions which it's good to bring up:
> > 
> > The code for energy computation is quite neat and simple, but I think
> > the energy computation mixes two concepts for CPU util: one concept is
> > the estimated CPU util which is used to select CPU OPP in schedutil,
> > another concept is the raw CPU util according to CPU real running time;
> > for example, cpu_util_next() predicts CPU util but this value might be
> > much higher than cpu_util(), especially after enabled UTIL_EST feature
> > (I have shallow understanding for UTIL_EST so correct me as needed);
> 
> I'm not not sure to understand what you mean by higher than cpu_util()
> here ... In which case would that happen ?

After UTIL_EST feature is enabled, cpu_util_next() returns higher value
than cpu_util(), see below code 'util = max(util, util_est);';  as
result cpu_util_next() takes consideration for extra compensention
introduced by UTIL_EST.

	if (sched_feat(UTIL_EST)) {
	        util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
	        if (dst_cpu == cpu)
	                util_est += _task_util_est(p);
	        else
	                util_est = max_t(long, util_est - _task_util_est(p), 0);
	        util = max(util, util_est);
	}

> cpu_util_next() is basically used to figure out what will be the
> cpu_util() of CPU A after task p has been enqueued on CPU B (no matter
> what A and B are).

Same with upper description, cpu_util_next() is not the same thing
with cpu_util(), cpu_util_next() takes consideration for extra
compensention introduced by UTIL_EST.

> > but this patch simply computes CPU capacity and energy with the single
> > one CPU utilization value (and it will be an inflated value afte enable
> > UTIL_EST).  Is this purposed for simple implementation?
> > 
> > IMHO, cpu_util_next() can be used to predict CPU capacity, on the other
> > hand, should we use the CPU util without UTIL_EST capping for 'sum_util',
> > this can be more reasonable to reflect the CPU utilization?
> 
> Why would a decayed utilisation be a better estimate of the time that
> a task is going to spend on a CPU ?

IIUC, in the scheduler waken up path task_util() is the task utilisation
before task sleeping, so it's not a decayed value.  cpu_util() is
decayed value, but is this just we want to reflect cpu historic
utilisation at the recent past time?  This is the reason I bring up to
use 'cpu_util() + task_util()' as estimation.

I understand this patch tries to use pre-decayed value, please review
below example has issue or not:
if one CPU's cfs_rq->avg.util_est.enqueued is quite high value, then this
CPU enter idle state and sleep for long while, if we use
cfs_rq->avg.util_est.enqueued to estimate CPU utilisation, this might
have big deviation than the CPU run time if place wake task on it?  On
the other hand, cpu_util() can decay for CPU idle time...

> > Furthermore, if we consider RT thread is running on CPU and connect with
> > 'schedutil' governor, the CPU will run at maximum frequency, but we
> > cannot say the CPU has 100% utilization.  The RT thread case is not
> > handled in this patch.
> 
> Right, we don't account for RT tasks in the OPP prediction for now.
> Vincent's patches to have a util_avg for RT runqueues could help us
> do that I suppose ...

Good to know this.

> Thanks !
> Quentin
Quentin Perret April 25, 2018, 8:23 a.m. UTC | #14
Hi Leo,

Sorry for the delay in responding...

On Saturday 21 Apr 2018 at 00:27:53 (+0800), Leo Yan wrote:
> On Fri, Apr 20, 2018 at 03:42:45PM +0100, Quentin Perret wrote:
> > Hi Leo,
> > 
> > On Wednesday 18 Apr 2018 at 20:15:47 (+0800), Leo Yan wrote:
> > > Sorry I introduce mess at here to spread my questions in several
> > > replying, later will try to ask questions in one replying.  Below are
> > > more questions which it's good to bring up:
> > > 
> > > The code for energy computation is quite neat and simple, but I think
> > > the energy computation mixes two concepts for CPU util: one concept is
> > > the estimated CPU util which is used to select CPU OPP in schedutil,
> > > another concept is the raw CPU util according to CPU real running time;
> > > for example, cpu_util_next() predicts CPU util but this value might be
> > > much higher than cpu_util(), especially after enabled UTIL_EST feature
> > > (I have shallow understanding for UTIL_EST so correct me as needed);
> > 
> > I'm not not sure to understand what you mean by higher than cpu_util()
> > here ... In which case would that happen ?
> 
> After UTIL_EST feature is enabled, cpu_util_next() returns higher value
> than cpu_util(), see below code 'util = max(util, util_est);';  as
> result cpu_util_next() takes consideration for extra compensention
> introduced by UTIL_EST.
> 
> 	if (sched_feat(UTIL_EST)) {
> 	        util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
> 	        if (dst_cpu == cpu)
> 	                util_est += _task_util_est(p);
> 	        else
> 	                util_est = max_t(long, util_est - _task_util_est(p), 0);
> 	        util = max(util, util_est);
> 	}

So, cpu_util() accounts for the UTIL_EST compensation:

	static inline unsigned long cpu_util(int cpu)
	{
		struct cfs_rq *cfs_rq;
		unsigned int util;

		cfs_rq = &cpu_rq(cpu)->cfs;
		util = READ_ONCE(cfs_rq->avg.util_avg);

		if (sched_feat(UTIL_EST))
			util = max(util, READ_ONCE(cfs_rq->avg.util_est.enqueued));

		return min_t(unsigned long, util, capacity_orig_of(cpu));
	}

So cpu_util_next() just mimics that.

> 
> > cpu_util_next() is basically used to figure out what will be the
> > cpu_util() of CPU A after task p has been enqueued on CPU B (no matter
> > what A and B are).
> 
> Same with upper description, cpu_util_next() is not the same thing
> with cpu_util(), cpu_util_next() takes consideration for extra
> compensention introduced by UTIL_EST.
> 
> > > but this patch simply computes CPU capacity and energy with the single
> > > one CPU utilization value (and it will be an inflated value afte enable
> > > UTIL_EST).  Is this purposed for simple implementation?
> > > 
> > > IMHO, cpu_util_next() can be used to predict CPU capacity, on the other
> > > hand, should we use the CPU util without UTIL_EST capping for 'sum_util',
> > > this can be more reasonable to reflect the CPU utilization?
> > 
> > Why would a decayed utilisation be a better estimate of the time that
> > a task is going to spend on a CPU ?
> 
> IIUC, in the scheduler waken up path task_util() is the task utilisation
> before task sleeping, so it's not a decayed value.

I don't think this is correct. sync_entity_load_avg() is called in
select_task_rq_fair() so task_util() *is* decayed upon wakeup.

> cpu_util() is
> decayed value,

This is not necessarily correct either. As mentioned above, cpu_util()
includes the UTIL_EST compensation, so the value isn't necessarily
decayed.

> but is this just we want to reflect cpu historic
> utilisation at the recent past time?  This is the reason I bring up to
> use 'cpu_util() + task_util()' as estimation.
> 
> I understand this patch tries to use pre-decayed value,

No, this patch tries to estimate what will be the return value of
cpu_util() if the task is enqueued on a specific CPU. This value can be
the util_avg (decayed) or the util_est (non-decayed) depending on the
conditions.

> please review
> below example has issue or not:
> if one CPU's cfs_rq->avg.util_est.enqueued is quite high value, then this
> CPU enter idle state and sleep for long while, if we use
> cfs_rq->avg.util_est.enqueued to estimate CPU utilisation, this might
> have big deviation than the CPU run time if place wake task on it?  On
> the other hand, cpu_util() can decay for CPU idle time...
> 
> > > Furthermore, if we consider RT thread is running on CPU and connect with
> > > 'schedutil' governor, the CPU will run at maximum frequency, but we
> > > cannot say the CPU has 100% utilization.  The RT thread case is not
> > > handled in this patch.
> > 
> > Right, we don't account for RT tasks in the OPP prediction for now.
> > Vincent's patches to have a util_avg for RT runqueues could help us
> > do that I suppose ...
> 
> Good to know this.
> 
> > Thanks !
> > Quentin
diff mbox

Patch

diff --git a/include/linux/sched/energy.h b/include/linux/sched/energy.h
index 941071eec013..b4110b145228 100644
--- a/include/linux/sched/energy.h
+++ b/include/linux/sched/energy.h
@@ -27,6 +27,24 @@  static inline bool sched_energy_enabled(void)
 	return static_branch_unlikely(&sched_energy_present);
 }
 
+static inline
+struct capacity_state *find_cap_state(int cpu, unsigned long util)
+{
+	struct sched_energy_model *em = *per_cpu_ptr(energy_model, cpu);
+	struct capacity_state *cs = NULL;
+	int i;
+
+	util += util >> 2;
+
+	for (i = 0; i < em->nr_cap_states; i++) {
+		cs = &em->cap_states[i];
+		if (cs->cap >= util)
+			break;
+	}
+
+	return cs;
+}
+
 static inline struct cpumask *freq_domain_span(struct freq_domain *fd)
 {
 	return &fd->span;
@@ -42,6 +60,8 @@  struct freq_domain;
 static inline bool sched_energy_enabled(void) { return false; }
 static inline struct cpumask
 *freq_domain_span(struct freq_domain *fd) { return NULL; }
+static inline struct capacity_state
+*find_cap_state(int cpu, unsigned long util) { return NULL; }
 static inline void init_sched_energy(void) { }
 #define for_each_freq_domain(fdom) for (; fdom; fdom = NULL)
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6960e5ef3c14..8cb9fb04fff2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6633,6 +6633,74 @@  static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
 }
 
 /*
+ * Returns the util of "cpu" if "p" wakes up on "dst_cpu".
+ */
+static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
+{
+	unsigned long util, util_est;
+	struct cfs_rq *cfs_rq;
+
+	/* Task is where it should be, or has no impact on cpu */
+	if ((task_cpu(p) == dst_cpu) || (cpu != task_cpu(p) && cpu != dst_cpu))
+		return cpu_util(cpu);
+
+	cfs_rq = &cpu_rq(cpu)->cfs;
+	util = READ_ONCE(cfs_rq->avg.util_avg);
+
+	if (dst_cpu == cpu)
+		util += task_util(p);
+	else
+		util = max_t(long, util - task_util(p), 0);
+
+	if (sched_feat(UTIL_EST)) {
+		util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
+		if (dst_cpu == cpu)
+			util_est += _task_util_est(p);
+		else
+			util_est = max_t(long, util_est - _task_util_est(p), 0);
+		util = max(util, util_est);
+	}
+
+	return min_t(unsigned long, util, capacity_orig_of(cpu));
+}
+
+/*
+ * Estimates the system level energy assuming that p wakes-up on dst_cpu.
+ *
+ * compute_energy() is safe to call only if an energy model is available for
+ * the platform, which is when sched_energy_enabled() is true.
+ */
+static unsigned long compute_energy(struct task_struct *p, int dst_cpu)
+{
+	unsigned long util, max_util, sum_util;
+	struct capacity_state *cs;
+	unsigned long energy = 0;
+	struct freq_domain *fd;
+	int cpu;
+
+	for_each_freq_domain(fd) {
+		max_util = sum_util = 0;
+		for_each_cpu_and(cpu, freq_domain_span(fd), cpu_online_mask) {
+			util = cpu_util_next(cpu, p, dst_cpu);
+			util += cpu_util_dl(cpu_rq(cpu));
+			max_util = max(util, max_util);
+			sum_util += util;
+		}
+
+		/*
+		 * Here we assume that the capacity states of CPUs belonging to
+		 * the same frequency domains are shared. Hence, we look at the
+		 * capacity state of the first CPU and re-use it for all.
+		 */
+		cpu = cpumask_first(freq_domain_span(fd));
+		cs = find_cap_state(cpu, max_util);
+		energy += cs->power * sum_util / cs->cap;
+	}
+
+	return energy;
+}
+
+/*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
  * SD_BALANCE_FORK, or SD_BALANCE_EXEC.
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5d552c0d7109..6eb38f41d5d9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2156,7 +2156,7 @@  static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
 # define arch_scale_freq_invariant()	false
 #endif
 
-#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
+#ifdef CONFIG_SMP
 static inline unsigned long cpu_util_dl(struct rq *rq)
 {
 	return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;