diff mbox

[RFC,4/6] sched/fair: Introduce an energy estimation helper function

Message ID 20180320094312.24081-5-dietmar.eggemann@arm.com (mailing list archive)
State RFC, archived
Headers show

Commit Message

Dietmar Eggemann March 20, 2018, 9:43 a.m. UTC
From: Quentin Perret <quentin.perret@arm.com>

In preparation for the definition of an energy-aware wakeup path, a
helper function is provided to estimate the consequence on system energy
when a specific task wakes-up on a specific CPU. compute_energy()
estimates the OPPs to be reached by all frequency domains and estimates
the consumption of each online CPU according to its energy model and its
percentage of busy time.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 kernel/sched/fair.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 81 insertions(+)

Comments

Juri Lelli March 21, 2018, 9:04 a.m. UTC | #1
Hi,

On 20/03/18 09:43, Dietmar Eggemann wrote:
> From: Quentin Perret <quentin.perret@arm.com>
> 
> In preparation for the definition of an energy-aware wakeup path, a
> helper function is provided to estimate the consequence on system energy
> when a specific task wakes-up on a specific CPU. compute_energy()
> estimates the OPPs to be reached by all frequency domains and estimates
> the consumption of each online CPU according to its energy model and its
> percentage of busy time.
> 
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Quentin Perret <quentin.perret@arm.com>
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> ---
>  kernel/sched/fair.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 81 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6c72a5e7b1b0..76bd46502486 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6409,6 +6409,30 @@ static inline int cpu_overutilized(int cpu)
>  }
>  
>  /*
> + * Returns the util of "cpu" if "p" wakes up on "dst_cpu".
> + */
> +static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
> +{
> +	unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;

What about other classes? Shouldn't we now also take into account
DEADLINE (as schedutil does)?

BTW, we now also have a getter method in sched/sched.h; it takes
UTIL_EST into account, though. Do we need to take that into account when
estimating energy consumption?

> +	unsigned long capacity = capacity_orig_of(cpu);
> +
> +	/*
> +	 * If p is where it should be, or if it has no impact on cpu, there is
> +	 * not much to do.
> +	 */
> +	if ((task_cpu(p) == dst_cpu) || (cpu != task_cpu(p) && cpu != dst_cpu))
> +		goto clamp_util;
> +
> +	if (dst_cpu == cpu)
> +		util += task_util(p);
> +	else
> +		util = max_t(long, util - task_util(p), 0);
> +
> +clamp_util:
> +	return (util >= capacity) ? capacity : util;
> +}
> +
> +/*
>   * Disable WAKE_AFFINE in the case where task @p doesn't fit in the
>   * capacity of either the waking CPU @cpu or the previous CPU @prev_cpu.
>   *
> @@ -6432,6 +6456,63 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
>  	return !util_fits_capacity(task_util(p), min_cap);
>  }
>  
> +static struct capacity_state *find_cap_state(int cpu, unsigned long util)
> +{
> +	struct sched_energy_model *em = *per_cpu_ptr(energy_model, cpu);
> +	struct capacity_state *cs = NULL;
> +	int i;
> +
> +	/*
> +	 * As the goal is to estimate the OPP reached for a specific util
> +	 * value, mimic the behaviour of schedutil with a 1.25 coefficient
> +	 */
> +	util += util >> 2;

What about other governors (ondemand for example). Is this supposed to
work only when schedutil is in use (if so we should probably make it
conditional on that)?

Also, even when schedutil is in use, shouldn't we ask it for a util
"computation" instead of replicating its _current_ heuristic? I fear
the two might diverge in the future.

Best,

- Juri
Patrick Bellasi March 21, 2018, 12:26 p.m. UTC | #2
On 21-Mar 10:04, Juri Lelli wrote:
> Hi,
> 
> On 20/03/18 09:43, Dietmar Eggemann wrote:
> > From: Quentin Perret <quentin.perret@arm.com>
> > 
> > In preparation for the definition of an energy-aware wakeup path, a
> > helper function is provided to estimate the consequence on system energy
> > when a specific task wakes-up on a specific CPU. compute_energy()
> > estimates the OPPs to be reached by all frequency domains and estimates
> > the consumption of each online CPU according to its energy model and its
> > percentage of busy time.
> > 
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Signed-off-by: Quentin Perret <quentin.perret@arm.com>
> > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > ---
> >  kernel/sched/fair.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 81 insertions(+)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 6c72a5e7b1b0..76bd46502486 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6409,6 +6409,30 @@ static inline int cpu_overutilized(int cpu)
> >  }
> >  
> >  /*
> > + * Returns the util of "cpu" if "p" wakes up on "dst_cpu".
> > + */
> > +static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
> > +{
> > +	unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;
> 
> What about other classes? Shouldn't we now also take into account
> DEADLINE (as schedutil does)?

Good point, although that would likely require to factor out from
schedutil the utilization aggregation function, isn't it?

> BTW, we now also have a getter method in sched/sched.h; it takes
> UTIL_EST into account, though. Do we need to take that into account when
> estimating energy consumption?

Actually I think that this whole function can be written "just" as:

---8<---
   unsigned long util = cpu_util_wake(cpu);

   if (cpu != dst_cpu)
        return util;

   return min(util + task_util(p), capacity_orig_of(cpu));
---8<---

which will reuse existing functions as well as getting for free other
stuff (like the CPU util_est).

Considering your observation above, it makes also easy to add into
util the DEADLINE and RT utilizations, just before returning the
value.

> > +	unsigned long capacity = capacity_orig_of(cpu);
> > +
> > +	/*
> > +	 * If p is where it should be, or if it has no impact on cpu, there is
> > +	 * not much to do.
> > +	 */
> > +	if ((task_cpu(p) == dst_cpu) || (cpu != task_cpu(p) && cpu != dst_cpu))
> > +		goto clamp_util;
> > +
> > +	if (dst_cpu == cpu)
> > +		util += task_util(p);
> > +	else
> > +		util = max_t(long, util - task_util(p), 0);
> > +
> > +clamp_util:
> > +	return (util >= capacity) ? capacity : util;
> > +}
> > +
> > +/*
> >   * Disable WAKE_AFFINE in the case where task @p doesn't fit in the
> >   * capacity of either the waking CPU @cpu or the previous CPU @prev_cpu.
> >   *
> > @@ -6432,6 +6456,63 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
> >  	return !util_fits_capacity(task_util(p), min_cap);
> >  }
> >  
> > +static struct capacity_state *find_cap_state(int cpu, unsigned long util)
> > +{
> > +	struct sched_energy_model *em = *per_cpu_ptr(energy_model, cpu);
> > +	struct capacity_state *cs = NULL;
> > +	int i;
> > +
> > +	/*
> > +	 * As the goal is to estimate the OPP reached for a specific util
> > +	 * value, mimic the behaviour of schedutil with a 1.25 coefficient
> > +	 */
> > +	util += util >> 2;
> 
> What about other governors (ondemand for example). Is this supposed to
> work only when schedutil is in use (if so we should probably make it
> conditional on that)?

Yes, I would say that EAS mostly makes sense when you have a "minimum"
control on OPPs... otherwise all the energy estimations are really
fuzzy.

> Also, even when schedutil is in use, shouldn't we ask it for a util
> "computation" instead of replicating its _current_ heuristic?

Are you proposing to have the 1.25 factor only here and remove it from
schedutil?

> I fear  the two might diverge in the future.

That could be avoided by factoring out from schedutil the
"compensation" factor into a proper function to be used by all the
interested playes, isn't it?
Patrick Bellasi March 21, 2018, 12:39 p.m. UTC | #3
On 20-Mar 09:43, Dietmar Eggemann wrote:
> From: Quentin Perret <quentin.perret@arm.com>

[...]

> +static unsigned long compute_energy(struct task_struct *p, int dst_cpu)
> +{
> +	unsigned long util, fdom_max_util;
> +	struct capacity_state *cs;
> +	unsigned long energy = 0;
> +	struct freq_domain *fdom;
> +	int cpu;
> +
> +	for_each_freq_domain(fdom) {
> +		fdom_max_util = 0;
> +		for_each_cpu_and(cpu, &(fdom->span), cpu_online_mask) {
> +			util = cpu_util_next(cpu, p, dst_cpu);

Would be nice to find a way to cache all these util and reuse them
below... even just to ensure data consistency between the "cs"
computation and its usage...

> +			fdom_max_util = max(util, fdom_max_util);
> +		}
> +
> +		/*
> +		 * Here we assume that the capacity states of CPUs belonging to
> +		 * the same frequency domains are shared. Hence, we look at the
> +		 * capacity state of the first CPU and re-use it for all.
> +		 */
> +		cpu = cpumask_first(&(fdom->span));
> +		cs = find_cap_state(cpu, fdom_max_util);
                ^^^^

The above code could theoretically return NULL, although likely EAS is
completely disabled if em->nb_cap_states == 0, right?

If that's the case then, in the previous function, you can certainly
avoid the initialization of *cs and maybe also add an explicit:

    BUG_ON(em->nb_cap_states == 0);

which helps even just as "in code documentation".

But, I'm not sure if maintainers like BUG_ON in scheduler code :)


> +
> +		/*
> +		 * The energy consumed by each CPU is derived from the power
                      ^

Should we make more explicit that this is just the "active" energy
consumed?

> +		 * it dissipates at the expected OPP and its percentage of
> +		 * busy time.
> +		 */
> +		for_each_cpu_and(cpu, &(fdom->span), cpu_online_mask) {
> +			util = cpu_util_next(cpu, p, dst_cpu);
> +			energy += cs->power * util / cs->cap;
> +		}
> +	}

nit-pick: empty line before return?

> +	return energy;
> +}
> +
>  /*
>   * select_task_rq_fair: Select target runqueue for the waking task in domains
>   * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
> -- 
> 2.11.0
>
Juri Lelli March 21, 2018, 12:59 p.m. UTC | #4
On 21/03/18 12:26, Patrick Bellasi wrote:
> On 21-Mar 10:04, Juri Lelli wrote:
> > Hi,
> > 
> > On 20/03/18 09:43, Dietmar Eggemann wrote:
> > > From: Quentin Perret <quentin.perret@arm.com>
> > > 
> > > In preparation for the definition of an energy-aware wakeup path, a
> > > helper function is provided to estimate the consequence on system energy
> > > when a specific task wakes-up on a specific CPU. compute_energy()
> > > estimates the OPPs to be reached by all frequency domains and estimates
> > > the consumption of each online CPU according to its energy model and its
> > > percentage of busy time.
> > > 
> > > Cc: Ingo Molnar <mingo@redhat.com>
> > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > Signed-off-by: Quentin Perret <quentin.perret@arm.com>
> > > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > > ---
> > >  kernel/sched/fair.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 81 insertions(+)
> > > 
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 6c72a5e7b1b0..76bd46502486 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -6409,6 +6409,30 @@ static inline int cpu_overutilized(int cpu)
> > >  }
> > >  
> > >  /*
> > > + * Returns the util of "cpu" if "p" wakes up on "dst_cpu".
> > > + */
> > > +static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
> > > +{
> > > +	unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;
> > 
> > What about other classes? Shouldn't we now also take into account
> > DEADLINE (as schedutil does)?
> 
> Good point, although that would likely require to factor out from
> schedutil the utilization aggregation function, isn't it?

Maybe, or simply use getter methods and aggregate again here.

> 
> > BTW, we now also have a getter method in sched/sched.h; it takes
> > UTIL_EST into account, though. Do we need to take that into account when
> > estimating energy consumption?
> 
> Actually I think that this whole function can be written "just" as:
> 
> ---8<---
>    unsigned long util = cpu_util_wake(cpu);
> 
>    if (cpu != dst_cpu)
>         return util;
> 
>    return min(util + task_util(p), capacity_orig_of(cpu));
> ---8<---
> 
> which will reuse existing functions as well as getting for free other
> stuff (like the CPU util_est).
> 
> Considering your observation above, it makes also easy to add into
> util the DEADLINE and RT utilizations, just before returning the
> value.

Well, for RT we should problably consider the fact that schedutil is
going to select max OPP...

Apart from that I guess it could work like you said.

> 
> > > +	unsigned long capacity = capacity_orig_of(cpu);
> > > +
> > > +	/*
> > > +	 * If p is where it should be, or if it has no impact on cpu, there is
> > > +	 * not much to do.
> > > +	 */
> > > +	if ((task_cpu(p) == dst_cpu) || (cpu != task_cpu(p) && cpu != dst_cpu))
> > > +		goto clamp_util;
> > > +
> > > +	if (dst_cpu == cpu)
> > > +		util += task_util(p);
> > > +	else
> > > +		util = max_t(long, util - task_util(p), 0);
> > > +
> > > +clamp_util:
> > > +	return (util >= capacity) ? capacity : util;
> > > +}
> > > +
> > > +/*
> > >   * Disable WAKE_AFFINE in the case where task @p doesn't fit in the
> > >   * capacity of either the waking CPU @cpu or the previous CPU @prev_cpu.
> > >   *
> > > @@ -6432,6 +6456,63 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
> > >  	return !util_fits_capacity(task_util(p), min_cap);
> > >  }
> > >  
> > > +static struct capacity_state *find_cap_state(int cpu, unsigned long util)
> > > +{
> > > +	struct sched_energy_model *em = *per_cpu_ptr(energy_model, cpu);
> > > +	struct capacity_state *cs = NULL;
> > > +	int i;
> > > +
> > > +	/*
> > > +	 * As the goal is to estimate the OPP reached for a specific util
> > > +	 * value, mimic the behaviour of schedutil with a 1.25 coefficient
> > > +	 */
> > > +	util += util >> 2;
> > 
> > What about other governors (ondemand for example). Is this supposed to
> > work only when schedutil is in use (if so we should probably make it
> > conditional on that)?
> 
> Yes, I would say that EAS mostly makes sense when you have a "minimum"
> control on OPPs... otherwise all the energy estimations are really
> fuzzy.

Make sense to me. Shouldn't we then make all this conditional on using
schedutil?

> 
> > Also, even when schedutil is in use, shouldn't we ask it for a util
> > "computation" instead of replicating its _current_ heuristic?
> 
> Are you proposing to have the 1.25 factor only here and remove it from
> schedutil?

I'm only saying that we shouldn't probably have two places where we add
this 1.25 factor to utilization before using it, as in the future one of
the two might modify that 1.25 to something else and then we'll have
problems. So, maybe a common wrapper that adds such factor?

> 
> > I fear  the two might diverge in the future.
> 
> That could be avoided by factoring out from schedutil the
> "compensation" factor into a proper function to be used by all the
> interested playes, isn't it?

And I should have read till the end before writing the above paragraph
it seems. :)

Thanks,

- Juri
Quentin Perret March 21, 2018, 1:55 p.m. UTC | #5
On Wednesday 21 Mar 2018 at 13:59:25 (+0100), Juri Lelli wrote:
> On 21/03/18 12:26, Patrick Bellasi wrote:
> > On 21-Mar 10:04, Juri Lelli wrote:
> > > Hi,
> > > 
> > > On 20/03/18 09:43, Dietmar Eggemann wrote:
> > > > From: Quentin Perret <quentin.perret@arm.com>
> > > > 
> > > > In preparation for the definition of an energy-aware wakeup path, a
> > > > helper function is provided to estimate the consequence on system energy
> > > > when a specific task wakes-up on a specific CPU. compute_energy()
> > > > estimates the OPPs to be reached by all frequency domains and estimates
> > > > the consumption of each online CPU according to its energy model and its
> > > > percentage of busy time.
> > > > 
> > > > Cc: Ingo Molnar <mingo@redhat.com>
> > > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > > Signed-off-by: Quentin Perret <quentin.perret@arm.com>
> > > > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > > > ---
> > > >  kernel/sched/fair.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > >  1 file changed, 81 insertions(+)
> > > > 
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index 6c72a5e7b1b0..76bd46502486 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -6409,6 +6409,30 @@ static inline int cpu_overutilized(int cpu)
> > > >  }
> > > >  
> > > >  /*
> > > > + * Returns the util of "cpu" if "p" wakes up on "dst_cpu".
> > > > + */
> > > > +static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
> > > > +{
> > > > +	unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;
> > > 
> > > What about other classes? Shouldn't we now also take into account
> > > DEADLINE (as schedutil does)?
> > 
> > Good point, although that would likely require to factor out from
> > schedutil the utilization aggregation function, isn't it?
> 
> Maybe, or simply use getter methods and aggregate again here.

I agree with you both, taking DL into account here is most likely the
right thing to do. Thinking about this, there are other places in those
patches where we should really use capacity_of() instead of
capacity_orig_of() (in task_fits() in patch 5/6 for ex) in order to
avoid CPUs under heavy RT pressure. I'll try to improve the integration
with other scheduling classes in v2.

> 
> > 
> > > BTW, we now also have a getter method in sched/sched.h; it takes
> > > UTIL_EST into account, though. Do we need to take that into account when
> > > estimating energy consumption?

Yes, I think that using UTIL_EST makes a lot of sense for energy
calculation. This is what is used for frequency selection (with
schedutil obviously) and this is also our best guess on how much time
a task will spend on a CPU. V2 will be rebased on the latest
tip/sched/core and I'll make sure to integrate things better with
util_est.

> > 
> > Actually I think that this whole function can be written "just" as:
> > 
> > ---8<---
> >    unsigned long util = cpu_util_wake(cpu);
> > 
> >    if (cpu != dst_cpu)
> >         return util;
> > 
> >    return min(util + task_util(p), capacity_orig_of(cpu));
> > ---8<---
> > 
> > which will reuse existing functions as well as getting for free other
> > stuff (like the CPU util_est).
> > 
> > Considering your observation above, it makes also easy to add into
> > util the DEADLINE and RT utilizations, just before returning the
> > value.
> 
> Well, for RT we should problably consider the fact that schedutil is
> going to select max OPP...

Right, but I need to think about the right place to put that, and how to
compute the energy accurately in this case. Some modification might also
be required in find_cap_state() (patch 5/6).

> 
> Apart from that I guess it could work like you said.
> 
> > 
> > > > +	unsigned long capacity = capacity_orig_of(cpu);
> > > > +
> > > > +	/*
> > > > +	 * If p is where it should be, or if it has no impact on cpu, there is
> > > > +	 * not much to do.
> > > > +	 */
> > > > +	if ((task_cpu(p) == dst_cpu) || (cpu != task_cpu(p) && cpu != dst_cpu))
> > > > +		goto clamp_util;
> > > > +
> > > > +	if (dst_cpu == cpu)
> > > > +		util += task_util(p);
> > > > +	else
> > > > +		util = max_t(long, util - task_util(p), 0);
> > > > +
> > > > +clamp_util:
> > > > +	return (util >= capacity) ? capacity : util;
> > > > +}
> > > > +
> > > > +/*
> > > >   * Disable WAKE_AFFINE in the case where task @p doesn't fit in the
> > > >   * capacity of either the waking CPU @cpu or the previous CPU @prev_cpu.
> > > >   *
> > > > @@ -6432,6 +6456,63 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
> > > >  	return !util_fits_capacity(task_util(p), min_cap);
> > > >  }
> > > >  
> > > > +static struct capacity_state *find_cap_state(int cpu, unsigned long util)
> > > > +{
> > > > +	struct sched_energy_model *em = *per_cpu_ptr(energy_model, cpu);
> > > > +	struct capacity_state *cs = NULL;
> > > > +	int i;
> > > > +
> > > > +	/*
> > > > +	 * As the goal is to estimate the OPP reached for a specific util
> > > > +	 * value, mimic the behaviour of schedutil with a 1.25 coefficient
> > > > +	 */
> > > > +	util += util >> 2;
> > > 
> > > What about other governors (ondemand for example). Is this supposed to
> > > work only when schedutil is in use (if so we should probably make it
> > > conditional on that)?
> > 
> > Yes, I would say that EAS mostly makes sense when you have a "minimum"
> > control on OPPs... otherwise all the energy estimations are really
> > fuzzy.
> 
> Make sense to me. Shouldn't we then make all this conditional on using
> schedutil?

So, in theory, EAS could make sense even for other governors than
schedutil. Even with the performance governor it is probably more
energy efficient (although users using "performance" probably don't care
about energy, but that's just an example) to place small tasks onto little
CPUs up to a certain point given by the energy model. The ideal solution
would be to change the behaviour of find_cap_state() depending on the
governor being used, but I don't know if this extra complexity is worth
it really.
I'm happy to make all this conditional on schedutil as a first step and
we can see later if that makes sense to extend EAS to other use-cases.

> 
> > 
> > > Also, even when schedutil is in use, shouldn't we ask it for a util
> > > "computation" instead of replicating its _current_ heuristic?
> > 
> > Are you proposing to have the 1.25 factor only here and remove it from
> > schedutil?
> 
> I'm only saying that we shouldn't probably have two places where we add
> this 1.25 factor to utilization before using it, as in the future one of
> the two might modify that 1.25 to something else and then we'll have
> problems. So, maybe a common wrapper that adds such factor?

Ok, I can definitely factorize this code between schedutil and EAS. And
BTW, would it make sense to make schedutil use "capacity_margin" instead
of an arbitrary value ? The semantics feels pretty close. Out of curiosity,
what was the reason to use C=1.25 in the first place ?

> 
> > 
> > > I fear  the two might diverge in the future.
> > 
> > That could be avoided by factoring out from schedutil the
> > "compensation" factor into a proper function to be used by all the
> > interested playes, isn't it?
> 
> And I should have read till the end before writing the above paragraph
> it seems. :)
> 
> Thanks,
> 
> - Juri

Thank you very much for the feedback !
Quentin Perret March 21, 2018, 2:02 p.m. UTC | #6
On Wednesday 21 Mar 2018 at 12:26:21 (+0000), Patrick Bellasi wrote:
> On 21-Mar 10:04, Juri Lelli wrote:
> > Hi,
> > 
> > On 20/03/18 09:43, Dietmar Eggemann wrote:
> > > From: Quentin Perret <quentin.perret@arm.com>
> > > 
> > > In preparation for the definition of an energy-aware wakeup path, a
> > > helper function is provided to estimate the consequence on system energy
> > > when a specific task wakes-up on a specific CPU. compute_energy()
> > > estimates the OPPs to be reached by all frequency domains and estimates
> > > the consumption of each online CPU according to its energy model and its
> > > percentage of busy time.
> > > 
> > > Cc: Ingo Molnar <mingo@redhat.com>
> > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > Signed-off-by: Quentin Perret <quentin.perret@arm.com>
> > > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > > ---
> > >  kernel/sched/fair.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 81 insertions(+)
> > > 
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 6c72a5e7b1b0..76bd46502486 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -6409,6 +6409,30 @@ static inline int cpu_overutilized(int cpu)
> > >  }
> > >  
> > >  /*
> > > + * Returns the util of "cpu" if "p" wakes up on "dst_cpu".
> > > + */
> > > +static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
> > > +{
> > > +	unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;
> > 
> > What about other classes? Shouldn't we now also take into account
> > DEADLINE (as schedutil does)?
> 
> Good point, although that would likely require to factor out from
> schedutil the utilization aggregation function, isn't it?
> 
> > BTW, we now also have a getter method in sched/sched.h; it takes
> > UTIL_EST into account, though. Do we need to take that into account when
> > estimating energy consumption?
> 
> Actually I think that this whole function can be written "just" as:
> 
> ---8<---
>    unsigned long util = cpu_util_wake(cpu);
> 
>    if (cpu != dst_cpu)
>         return util;
> 
>    return min(util + task_util(p), capacity_orig_of(cpu));
> ---8<---
> 

Yes this should be functionally equivalent. However, with your
suggestion you can potentially remove the task contribution from the
cpu_util in cpu_util_wake() and then add it back right after if
cpu==dst_cpu. This is sub-optimal and that's why I implemented things
slightly differently. But maybe this optimization really is too small to
justify the extra complexity involved ...

> which will reuse existing functions as well as getting for free other
> stuff (like the CPU util_est).
> 
> Considering your observation above, it makes also easy to add into
> util the DEADLINE and RT utilizations, just before returning the
> value.
> 
> > > +	unsigned long capacity = capacity_orig_of(cpu);
> > > +
> > > +	/*
> > > +	 * If p is where it should be, or if it has no impact on cpu, there is
> > > +	 * not much to do.
> > > +	 */
> > > +	if ((task_cpu(p) == dst_cpu) || (cpu != task_cpu(p) && cpu != dst_cpu))
> > > +		goto clamp_util;
> > > +
> > > +	if (dst_cpu == cpu)
> > > +		util += task_util(p);
> > > +	else
> > > +		util = max_t(long, util - task_util(p), 0);
> > > +
> > > +clamp_util:
> > > +	return (util >= capacity) ? capacity : util;
> > > +}
> > > +
> > > +/*
> > >   * Disable WAKE_AFFINE in the case where task @p doesn't fit in the
> > >   * capacity of either the waking CPU @cpu or the previous CPU @prev_cpu.
> > >   *
> > > @@ -6432,6 +6456,63 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
> > >  	return !util_fits_capacity(task_util(p), min_cap);
> > >  }
> > >  
> > > +static struct capacity_state *find_cap_state(int cpu, unsigned long util)
> > > +{
> > > +	struct sched_energy_model *em = *per_cpu_ptr(energy_model, cpu);
> > > +	struct capacity_state *cs = NULL;
> > > +	int i;
> > > +
> > > +	/*
> > > +	 * As the goal is to estimate the OPP reached for a specific util
> > > +	 * value, mimic the behaviour of schedutil with a 1.25 coefficient
> > > +	 */
> > > +	util += util >> 2;
> > 
> > What about other governors (ondemand for example). Is this supposed to
> > work only when schedutil is in use (if so we should probably make it
> > conditional on that)?
> 
> Yes, I would say that EAS mostly makes sense when you have a "minimum"
> control on OPPs... otherwise all the energy estimations are really
> fuzzy.
> 
> > Also, even when schedutil is in use, shouldn't we ask it for a util
> > "computation" instead of replicating its _current_ heuristic?
> 
> Are you proposing to have the 1.25 factor only here and remove it from
> schedutil?
> 
> > I fear  the two might diverge in the future.
> 
> That could be avoided by factoring out from schedutil the
> "compensation" factor into a proper function to be used by all the
> interested playes, isn't it?
> 
> -- 
> #include <best/regards.h>
> 
> Patrick Bellasi
Quentin Perret March 21, 2018, 2:26 p.m. UTC | #7
On Wednesday 21 Mar 2018 at 12:39:21 (+0000), Patrick Bellasi wrote:
> On 20-Mar 09:43, Dietmar Eggemann wrote:
> > From: Quentin Perret <quentin.perret@arm.com>
> 
> [...]
> 
> > +static unsigned long compute_energy(struct task_struct *p, int dst_cpu)
> > +{
> > +	unsigned long util, fdom_max_util;
> > +	struct capacity_state *cs;
> > +	unsigned long energy = 0;
> > +	struct freq_domain *fdom;
> > +	int cpu;
> > +
> > +	for_each_freq_domain(fdom) {
> > +		fdom_max_util = 0;
> > +		for_each_cpu_and(cpu, &(fdom->span), cpu_online_mask) {
> > +			util = cpu_util_next(cpu, p, dst_cpu);
> 
> Would be nice to find a way to cache all these util and reuse them
> below... even just to ensure data consistency between the "cs"
> computation and its usage...

So actually, what I can do is add something like

    fdom_tot_util += util;

to this loop and compute

    energy = cs->power * fdom_tot_util / cs->cap;

only once, instead of having the second loop to compute the energy. We don't
have to scale the util for each and every CPU since they share the same
cap state. That would save some divisions and ensure the consistency
between the selection of the cap state and the associated energy
computation. What do you think ?

Or maybe you were talking about consistency between several consecutive
calls to compute_energy() ?

> 
> > +			fdom_max_util = max(util, fdom_max_util);
> > +		}
> > +
> > +		/*
> > +		 * Here we assume that the capacity states of CPUs belonging to
> > +		 * the same frequency domains are shared. Hence, we look at the
> > +		 * capacity state of the first CPU and re-use it for all.
> > +		 */
> > +		cpu = cpumask_first(&(fdom->span));
> > +		cs = find_cap_state(cpu, fdom_max_util);
>                 ^^^^
> 
> The above code could theoretically return NULL, although likely EAS is
> completely disabled if em->nb_cap_states == 0, right?

That's right. sched_energy_present cannot be enabled with
em->nb_cap_states == 0, and compute_energy() is never called without
sched_energy_present in the proposed implementation.

> 
> If that's the case then, in the previous function, you can certainly
> avoid the initialization of *cs and maybe also add an explicit:
> 
>     BUG_ON(em->nb_cap_states == 0);
> 
> which helps even just as "in code documentation".
> 
> But, I'm not sure if maintainers like BUG_ON in scheduler code :)

Yes, I'm not sure about the BUG_ON either :). I agree that it would be
nice to document somewhere that compute_energy() is unsafe to call
without sched_energy_present. I can simply add a proper doc comment to
this function actually. Would that work ?

> 
> 
> > +
> > +		/*
> > +		 * The energy consumed by each CPU is derived from the power
>                       ^
> 
> Should we make more explicit that this is just the "active" energy
> consumed?

Ok, np.

> 
> > +		 * it dissipates at the expected OPP and its percentage of
> > +		 * busy time.
> > +		 */
> > +		for_each_cpu_and(cpu, &(fdom->span), cpu_online_mask) {
> > +			util = cpu_util_next(cpu, p, dst_cpu);
> > +			energy += cs->power * util / cs->cap;
> > +		}
> > +	}
> 
> nit-pick: empty line before return?

Will do.

> 
> > +	return energy;
> > +}
> > +
> >  /*
> >   * select_task_rq_fair: Select target runqueue for the waking task in domains
> >   * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
> > -- 
> > 2.11.0
> > 
> 
> -- 
> #include <best/regards.h>
> 
> Patrick Bellasi

Thanks,
Quentin
Juri Lelli March 21, 2018, 2:50 p.m. UTC | #8
On 21/03/18 14:26, Quentin Perret wrote:
> On Wednesday 21 Mar 2018 at 12:39:21 (+0000), Patrick Bellasi wrote:
> > On 20-Mar 09:43, Dietmar Eggemann wrote:

[...]

> > 
> > If that's the case then, in the previous function, you can certainly
> > avoid the initialization of *cs and maybe also add an explicit:
> > 
> >     BUG_ON(em->nb_cap_states == 0);
> > 
> > which helps even just as "in code documentation".
> > 
> > But, I'm not sure if maintainers like BUG_ON in scheduler code :)
> 
> Yes, I'm not sure about the BUG_ON either :). I agree that it would be
> nice to document somewhere that compute_energy() is unsafe to call
> without sched_energy_present. I can simply add a proper doc comment to
> this function actually. Would that work ?

If it is something that must not happen and it is also non recoverable
at runtime, then...

$ git grep BUG_ON -- kernel/sched/ | wc -l
50

:)
Juri Lelli March 21, 2018, 3:15 p.m. UTC | #9
On 21/03/18 13:55, Quentin Perret wrote:
> On Wednesday 21 Mar 2018 at 13:59:25 (+0100), Juri Lelli wrote:
> > On 21/03/18 12:26, Patrick Bellasi wrote:
> > > On 21-Mar 10:04, Juri Lelli wrote:

[...]

> > > > > +	/*
> > > > > +	 * As the goal is to estimate the OPP reached for a specific util
> > > > > +	 * value, mimic the behaviour of schedutil with a 1.25 coefficient
> > > > > +	 */
> > > > > +	util += util >> 2;
> > > > 
> > > > What about other governors (ondemand for example). Is this supposed to
> > > > work only when schedutil is in use (if so we should probably make it
> > > > conditional on that)?
> > > 
> > > Yes, I would say that EAS mostly makes sense when you have a "minimum"
> > > control on OPPs... otherwise all the energy estimations are really
> > > fuzzy.
> > 
> > Make sense to me. Shouldn't we then make all this conditional on using
> > schedutil?
> 
> So, in theory, EAS could make sense even for other governors than
> schedutil. Even with the performance governor it is probably more
> energy efficient (although users using "performance" probably don't care
> about energy, but that's just an example) to place small tasks onto little
> CPUs up to a certain point given by the energy model. The ideal solution
> would be to change the behaviour of find_cap_state() depending on the
> governor being used, but I don't know if this extra complexity is worth
> it really.
> I'm happy to make all this conditional on schedutil as a first step and
> we can see later if that makes sense to extend EAS to other use-cases.

I agree that EAS makes still sense even for !schedutil cases (your
performance example being one of them, powersave maybe another one?).
Making it work with ondemand is tricky, though.

So, not sure what's the best thing to do, but we should be at least aware
of limitations.

> 
> > 
> > > 
> > > > Also, even when schedutil is in use, shouldn't we ask it for a util
> > > > "computation" instead of replicating its _current_ heuristic?
> > > 
> > > Are you proposing to have the 1.25 factor only here and remove it from
> > > schedutil?
> > 
> > I'm only saying that we shouldn't probably have two places where we add
> > this 1.25 factor to utilization before using it, as in the future one of
> > the two might modify that 1.25 to something else and then we'll have
> > problems. So, maybe a common wrapper that adds such factor?
> 
> Ok, I can definitely factorize this code between schedutil and EAS. And
> BTW, would it make sense to make schedutil use "capacity_margin" instead
> of an arbitrary value ? The semantics feels pretty close. Out of curiosity,
> what was the reason to use C=1.25 in the first place ?

I seem to remember it was choosen out of experiments, but I might surely
be wrong and Rafael, Viresh, others will correct me. :)

> 
> > 
> > > 
> > > > I fear  the two might diverge in the future.
> > > 
> > > That could be avoided by factoring out from schedutil the
> > > "compensation" factor into a proper function to be used by all the
> > > interested playes, isn't it?
> > 
> > And I should have read till the end before writing the above paragraph
> > it seems. :)
> > 
> > Thanks,
> > 
> > - Juri
> 
> Thank you very much for the feedback !

No problem. I'm of course very interested in this. Could you please
directly Cc me (juri.lelli@redhat.com) in next versions?

Thanks,

- Juri
Patrick Bellasi March 21, 2018, 3:54 p.m. UTC | #10
On 21-Mar 14:26, Quentin Perret wrote:
> On Wednesday 21 Mar 2018 at 12:39:21 (+0000), Patrick Bellasi wrote:
> > On 20-Mar 09:43, Dietmar Eggemann wrote:
> > > From: Quentin Perret <quentin.perret@arm.com>
> > 
> > [...]
> > 
> > > +static unsigned long compute_energy(struct task_struct *p, int dst_cpu)
> > > +{
> > > +	unsigned long util, fdom_max_util;
> > > +	struct capacity_state *cs;
> > > +	unsigned long energy = 0;
> > > +	struct freq_domain *fdom;
> > > +	int cpu;
> > > +
> > > +	for_each_freq_domain(fdom) {
> > > +		fdom_max_util = 0;
> > > +		for_each_cpu_and(cpu, &(fdom->span), cpu_online_mask) {
> > > +			util = cpu_util_next(cpu, p, dst_cpu);
> > 
> > Would be nice to find a way to cache all these util and reuse them
> > below... even just to ensure data consistency between the "cs"
> > computation and its usage...
> 
> So actually, what I can do is add something like
> 
>     fdom_tot_util += util;
> 
> to this loop and compute
> 
>     energy = cs->power * fdom_tot_util / cs->cap;
> 
> only once, instead of having the second loop to compute the energy. We don't
> have to scale the util for each and every CPU since they share the same
> cap state. That would save some divisions and ensure the consistency
> between the selection of the cap state and the associated energy
> computation. What do you think ?

Right, would say that under the hypothesis the we are in the same
frequency domain (and we are because of fdom->span), that's basically
doing:

   sum_i(P_x * U_i / C_x) => P_x / C_x * sum_i(U_i)

Where (C_x, P_x) are the EM reported capacity and power for the
expected frequency domain OPP.

> Or maybe you were talking about consistency between several consecutive
> calls to compute_energy() ?

Nope, the above +1

> > > +			fdom_max_util = max(util, fdom_max_util);
> > > +		}
> > > +
> > > +		/*
> > > +		 * Here we assume that the capacity states of CPUs belonging to
> > > +		 * the same frequency domains are shared. Hence, we look at the
> > > +		 * capacity state of the first CPU and re-use it for all.
> > > +		 */
> > > +		cpu = cpumask_first(&(fdom->span));
> > > +		cs = find_cap_state(cpu, fdom_max_util);
> >                 ^^^^
> > 
> > The above code could theoretically return NULL, although likely EAS is
> > completely disabled if em->nb_cap_states == 0, right?
> 
> That's right. sched_energy_present cannot be enabled with
> em->nb_cap_states == 0, and compute_energy() is never called without
> sched_energy_present in the proposed implementation.
> 
> > 
> > If that's the case then, in the previous function, you can certainly
> > avoid the initialization of *cs and maybe also add an explicit:
> > 
> >     BUG_ON(em->nb_cap_states == 0);
> > 
> > which helps even just as "in code documentation".
> > 
> > But, I'm not sure if maintainers like BUG_ON in scheduler code :)
> 
> Yes, I'm not sure about the BUG_ON either :).

FWIW, there are already some BUG_ON in fair.c... thus, if they can
pinpoint a specific bug in case of errors, they should be acceptable ?

> I agree that it would be nice to document somewhere that
> compute_energy() is unsafe to call without sched_energy_present.
> I can simply add a proper doc comment to this function actually.
> Would that work ?

Right, it's just that _maybe_ an explicit BUG_ON is improving the
documentation by making more explicit the error on testing ?

Thus, I would probably add both... but Peter will tell you for sure ;)
Morten Rasmussen March 21, 2018, 4:26 p.m. UTC | #11
On Wed, Mar 21, 2018 at 04:15:13PM +0100, Juri Lelli wrote:
> On 21/03/18 13:55, Quentin Perret wrote:
> > On Wednesday 21 Mar 2018 at 13:59:25 (+0100), Juri Lelli wrote:
> > > On 21/03/18 12:26, Patrick Bellasi wrote:
> > > > On 21-Mar 10:04, Juri Lelli wrote:
> 
> [...]
> 
> > > > > > +	/*
> > > > > > +	 * As the goal is to estimate the OPP reached for a specific util
> > > > > > +	 * value, mimic the behaviour of schedutil with a 1.25 coefficient
> > > > > > +	 */
> > > > > > +	util += util >> 2;
> > > > > 
> > > > > What about other governors (ondemand for example). Is this supposed to
> > > > > work only when schedutil is in use (if so we should probably make it
> > > > > conditional on that)?
> > > > 
> > > > Yes, I would say that EAS mostly makes sense when you have a "minimum"
> > > > control on OPPs... otherwise all the energy estimations are really
> > > > fuzzy.
> > > 
> > > Make sense to me. Shouldn't we then make all this conditional on using
> > > schedutil?
> > 
> > So, in theory, EAS could make sense even for other governors than
> > schedutil. Even with the performance governor it is probably more
> > energy efficient (although users using "performance" probably don't care
> > about energy, but that's just an example) to place small tasks onto little
> > CPUs up to a certain point given by the energy model. The ideal solution
> > would be to change the behaviour of find_cap_state() depending on the
> > governor being used, but I don't know if this extra complexity is worth
> > it really.
> > I'm happy to make all this conditional on schedutil as a first step and
> > we can see later if that makes sense to extend EAS to other use-cases.
> 
> I agree that EAS makes still sense even for !schedutil cases (your
> performance example being one of them, powersave maybe another one?).
> Making it work with ondemand is tricky, though.
> 
> So, not sure what's the best thing to do, but we should be at least aware
> of limitations.

I would suggest making as few assumptions about the OPP selection as
possible. Even when we do use schedutil, there could be number of
reasons why we don't actually get the OPP schedutil requests (thermal,
hardware-says-no,...).

In the previous energy model-driven scheduling postings, years back, I
went with the assumption that OPP would follow the utilization. So if we
put more tasks on a cpu, the OPP would increase to match. If cpufreq or
hardware decided to go faster, that is fine but it could lead to
suboptimal decisions.

We could call into schedutil somehow to make sure that we at least
request the same OPP as the energy model assumes if the overhead is
small and we can present schedutil with all the information it needs to
choose the OPP for the proposed task placement. I wonder if it is worth
it, or if we should just decide on a simple assumption on OPP selection
for energy estimation and stick it in a comment.

Morten
Juri Lelli March 21, 2018, 5:02 p.m. UTC | #12
On 21/03/18 16:26, Morten Rasmussen wrote:
> On Wed, Mar 21, 2018 at 04:15:13PM +0100, Juri Lelli wrote:
> > On 21/03/18 13:55, Quentin Perret wrote:
> > > On Wednesday 21 Mar 2018 at 13:59:25 (+0100), Juri Lelli wrote:
> > > > On 21/03/18 12:26, Patrick Bellasi wrote:
> > > > > On 21-Mar 10:04, Juri Lelli wrote:
> > 
> > [...]
> > 
> > > > > > > +	/*
> > > > > > > +	 * As the goal is to estimate the OPP reached for a specific util
> > > > > > > +	 * value, mimic the behaviour of schedutil with a 1.25 coefficient
> > > > > > > +	 */
> > > > > > > +	util += util >> 2;
> > > > > > 
> > > > > > What about other governors (ondemand for example). Is this supposed to
> > > > > > work only when schedutil is in use (if so we should probably make it
> > > > > > conditional on that)?
> > > > > 
> > > > > Yes, I would say that EAS mostly makes sense when you have a "minimum"
> > > > > control on OPPs... otherwise all the energy estimations are really
> > > > > fuzzy.
> > > > 
> > > > Make sense to me. Shouldn't we then make all this conditional on using
> > > > schedutil?
> > > 
> > > So, in theory, EAS could make sense even for other governors than
> > > schedutil. Even with the performance governor it is probably more
> > > energy efficient (although users using "performance" probably don't care
> > > about energy, but that's just an example) to place small tasks onto little
> > > CPUs up to a certain point given by the energy model. The ideal solution
> > > would be to change the behaviour of find_cap_state() depending on the
> > > governor being used, but I don't know if this extra complexity is worth
> > > it really.
> > > I'm happy to make all this conditional on schedutil as a first step and
> > > we can see later if that makes sense to extend EAS to other use-cases.
> > 
> > I agree that EAS makes still sense even for !schedutil cases (your
> > performance example being one of them, powersave maybe another one?).
> > Making it work with ondemand is tricky, though.
> > 
> > So, not sure what's the best thing to do, but we should be at least aware
> > of limitations.
> 
> I would suggest making as few assumptions about the OPP selection as
> possible. Even when we do use schedutil, there could be number of
> reasons why we don't actually get the OPP schedutil requests (thermal,
> hardware-says-no,...).
> 
> In the previous energy model-driven scheduling postings, years back, I
> went with the assumption that OPP would follow the utilization. So if we
> put more tasks on a cpu, the OPP would increase to match. If cpufreq or
> hardware decided to go faster, that is fine but it could lead to
> suboptimal decisions.
> 
> We could call into schedutil somehow to make sure that we at least
> request the same OPP as the energy model assumes if the overhead is
> small and we can present schedutil with all the information it needs to
> choose the OPP for the proposed task placement. I wonder if it is worth
> it, or if we should just decide on a simple assumption on OPP selection
> for energy estimation and stick it in a comment.

Right, I see your point. Refactoring the 1.25 coefficient calculation in
some getter method shouldn't hopefully add much overhead, but yes, it
might not give us much in terms of correctness for certain situations.

Best,

- Juri
Quentin Perret March 22, 2018, 5:05 a.m. UTC | #13
On Wednesday 21 Mar 2018 at 15:54:58 (+0000), Patrick Bellasi wrote:
> On 21-Mar 14:26, Quentin Perret wrote:
> > On Wednesday 21 Mar 2018 at 12:39:21 (+0000), Patrick Bellasi wrote:
> > > On 20-Mar 09:43, Dietmar Eggemann wrote:
> > > > From: Quentin Perret <quentin.perret@arm.com>

[...]

> > So actually, what I can do is add something like
> > 
> >     fdom_tot_util += util;
> > 
> > to this loop and compute
> > 
> >     energy = cs->power * fdom_tot_util / cs->cap;
> > 
> > only once, instead of having the second loop to compute the energy. We don't
> > have to scale the util for each and every CPU since they share the same
> > cap state. That would save some divisions and ensure the consistency
> > between the selection of the cap state and the associated energy
> > computation. What do you think ?
> 
> Right, would say that under the hypothesis the we are in the same
> frequency domain (and we are because of fdom->span), that's basically
> doing:
> 
>    sum_i(P_x * U_i / C_x) => P_x / C_x * sum_i(U_i)
> 
> Where (C_x, P_x) are the EM reported capacity and power for the
> expected frequency domain OPP.
> 

Yes that's exactly that. I'll do the change in v2.

> > Or maybe you were talking about consistency between several consecutive
> > calls to compute_energy() ?
> 
> Nope, the above +1
> 

[...]

> > I agree that it would be nice to document somewhere that
> > compute_energy() is unsafe to call without sched_energy_present.
> > I can simply add a proper doc comment to this function actually.
> > Would that work ?
> 
> Right, it's just that _maybe_ an explicit BUG_ON is improving the
> documentation by making more explicit the error on testing ?
> 
> Thus, I would probably add both... but Peter will tell you for sure ;)
> 

Right, but I'm still not sure if the BUG_ON is the right thing to do. I
mean, if we really want to make this check, then we could also try
to recover into a working state ... If we enter compute_energy() without
having an energy model, and if we detect it on time, we could bail out
and disable sched_energy_present ASAP with an error message for example.
That would let us know if EAS is broken without making the system
unusable.

Anyways, if there is a general agreement, or if the maintainers think
that the BUG_ON is the right thing to do here, I'm happy to change that
in future versions :)
diff mbox

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6c72a5e7b1b0..76bd46502486 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6409,6 +6409,30 @@  static inline int cpu_overutilized(int cpu)
 }
 
 /*
+ * Returns the util of "cpu" if "p" wakes up on "dst_cpu".
+ */
+static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
+{
+	unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;
+	unsigned long capacity = capacity_orig_of(cpu);
+
+	/*
+	 * If p is where it should be, or if it has no impact on cpu, there is
+	 * not much to do.
+	 */
+	if ((task_cpu(p) == dst_cpu) || (cpu != task_cpu(p) && cpu != dst_cpu))
+		goto clamp_util;
+
+	if (dst_cpu == cpu)
+		util += task_util(p);
+	else
+		util = max_t(long, util - task_util(p), 0);
+
+clamp_util:
+	return (util >= capacity) ? capacity : util;
+}
+
+/*
  * Disable WAKE_AFFINE in the case where task @p doesn't fit in the
  * capacity of either the waking CPU @cpu or the previous CPU @prev_cpu.
  *
@@ -6432,6 +6456,63 @@  static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
 	return !util_fits_capacity(task_util(p), min_cap);
 }
 
+static struct capacity_state *find_cap_state(int cpu, unsigned long util)
+{
+	struct sched_energy_model *em = *per_cpu_ptr(energy_model, cpu);
+	struct capacity_state *cs = NULL;
+	int i;
+
+	/*
+	 * As the goal is to estimate the OPP reached for a specific util
+	 * value, mimic the behaviour of schedutil with a 1.25 coefficient
+	 */
+	util += util >> 2;
+
+	for (i = 0; i < em->nr_cap_states; i++) {
+		cs = &em->cap_states[i];
+		if (cs->cap >= util)
+			break;
+	}
+
+	return cs;
+}
+
+static unsigned long compute_energy(struct task_struct *p, int dst_cpu)
+{
+	unsigned long util, fdom_max_util;
+	struct capacity_state *cs;
+	unsigned long energy = 0;
+	struct freq_domain *fdom;
+	int cpu;
+
+	for_each_freq_domain(fdom) {
+		fdom_max_util = 0;
+		for_each_cpu_and(cpu, &(fdom->span), cpu_online_mask) {
+			util = cpu_util_next(cpu, p, dst_cpu);
+			fdom_max_util = max(util, fdom_max_util);
+		}
+
+		/*
+		 * Here we assume that the capacity states of CPUs belonging to
+		 * the same frequency domains are shared. Hence, we look at the
+		 * capacity state of the first CPU and re-use it for all.
+		 */
+		cpu = cpumask_first(&(fdom->span));
+		cs = find_cap_state(cpu, fdom_max_util);
+
+		/*
+		 * The energy consumed by each CPU is derived from the power
+		 * it dissipates at the expected OPP and its percentage of
+		 * busy time.
+		 */
+		for_each_cpu_and(cpu, &(fdom->span), cpu_online_mask) {
+			util = cpu_util_next(cpu, p, dst_cpu);
+			energy += cs->power * util / cs->cap;
+		}
+	}
+	return energy;
+}
+
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,