Message ID | 20180320094312.24081-5-dietmar.eggemann@arm.com (mailing list archive) |
---|---|
State | RFC, archived |
Headers | show |
Hi, On 20/03/18 09:43, Dietmar Eggemann wrote: > From: Quentin Perret <quentin.perret@arm.com> > > In preparation for the definition of an energy-aware wakeup path, a > helper function is provided to estimate the consequence on system energy > when a specific task wakes-up on a specific CPU. compute_energy() > estimates the OPPs to be reached by all frequency domains and estimates > the consumption of each online CPU according to its energy model and its > percentage of busy time. > > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Peter Zijlstra <peterz@infradead.org> > Signed-off-by: Quentin Perret <quentin.perret@arm.com> > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> > --- > kernel/sched/fair.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 81 insertions(+) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 6c72a5e7b1b0..76bd46502486 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -6409,6 +6409,30 @@ static inline int cpu_overutilized(int cpu) > } > > /* > + * Returns the util of "cpu" if "p" wakes up on "dst_cpu". > + */ > +static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu) > +{ > + unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg; What about other classes? Shouldn't we now also take into account DEADLINE (as schedutil does)? BTW, we now also have a getter method in sched/sched.h; it takes UTIL_EST into account, though. Do we need to take that into account when estimating energy consumption? > + unsigned long capacity = capacity_orig_of(cpu); > + > + /* > + * If p is where it should be, or if it has no impact on cpu, there is > + * not much to do. > + */ > + if ((task_cpu(p) == dst_cpu) || (cpu != task_cpu(p) && cpu != dst_cpu)) > + goto clamp_util; > + > + if (dst_cpu == cpu) > + util += task_util(p); > + else > + util = max_t(long, util - task_util(p), 0); > + > +clamp_util: > + return (util >= capacity) ? capacity : util; > +} > + > +/* > * Disable WAKE_AFFINE in the case where task @p doesn't fit in the > * capacity of either the waking CPU @cpu or the previous CPU @prev_cpu. > * > @@ -6432,6 +6456,63 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu) > return !util_fits_capacity(task_util(p), min_cap); > } > > +static struct capacity_state *find_cap_state(int cpu, unsigned long util) > +{ > + struct sched_energy_model *em = *per_cpu_ptr(energy_model, cpu); > + struct capacity_state *cs = NULL; > + int i; > + > + /* > + * As the goal is to estimate the OPP reached for a specific util > + * value, mimic the behaviour of schedutil with a 1.25 coefficient > + */ > + util += util >> 2; What about other governors (ondemand for example). Is this supposed to work only when schedutil is in use (if so we should probably make it conditional on that)? Also, even when schedutil is in use, shouldn't we ask it for a util "computation" instead of replicating its _current_ heuristic? I fear the two might diverge in the future. Best, - Juri
On 21-Mar 10:04, Juri Lelli wrote: > Hi, > > On 20/03/18 09:43, Dietmar Eggemann wrote: > > From: Quentin Perret <quentin.perret@arm.com> > > > > In preparation for the definition of an energy-aware wakeup path, a > > helper function is provided to estimate the consequence on system energy > > when a specific task wakes-up on a specific CPU. compute_energy() > > estimates the OPPs to be reached by all frequency domains and estimates > > the consumption of each online CPU according to its energy model and its > > percentage of busy time. > > > > Cc: Ingo Molnar <mingo@redhat.com> > > Cc: Peter Zijlstra <peterz@infradead.org> > > Signed-off-by: Quentin Perret <quentin.perret@arm.com> > > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> > > --- > > kernel/sched/fair.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++++ > > 1 file changed, 81 insertions(+) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 6c72a5e7b1b0..76bd46502486 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -6409,6 +6409,30 @@ static inline int cpu_overutilized(int cpu) > > } > > > > /* > > + * Returns the util of "cpu" if "p" wakes up on "dst_cpu". > > + */ > > +static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu) > > +{ > > + unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg; > > What about other classes? Shouldn't we now also take into account > DEADLINE (as schedutil does)? Good point, although that would likely require to factor out from schedutil the utilization aggregation function, isn't it? > BTW, we now also have a getter method in sched/sched.h; it takes > UTIL_EST into account, though. Do we need to take that into account when > estimating energy consumption? Actually I think that this whole function can be written "just" as: ---8<--- unsigned long util = cpu_util_wake(cpu); if (cpu != dst_cpu) return util; return min(util + task_util(p), capacity_orig_of(cpu)); ---8<--- which will reuse existing functions as well as getting for free other stuff (like the CPU util_est). Considering your observation above, it makes also easy to add into util the DEADLINE and RT utilizations, just before returning the value. > > + unsigned long capacity = capacity_orig_of(cpu); > > + > > + /* > > + * If p is where it should be, or if it has no impact on cpu, there is > > + * not much to do. > > + */ > > + if ((task_cpu(p) == dst_cpu) || (cpu != task_cpu(p) && cpu != dst_cpu)) > > + goto clamp_util; > > + > > + if (dst_cpu == cpu) > > + util += task_util(p); > > + else > > + util = max_t(long, util - task_util(p), 0); > > + > > +clamp_util: > > + return (util >= capacity) ? capacity : util; > > +} > > + > > +/* > > * Disable WAKE_AFFINE in the case where task @p doesn't fit in the > > * capacity of either the waking CPU @cpu or the previous CPU @prev_cpu. > > * > > @@ -6432,6 +6456,63 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu) > > return !util_fits_capacity(task_util(p), min_cap); > > } > > > > +static struct capacity_state *find_cap_state(int cpu, unsigned long util) > > +{ > > + struct sched_energy_model *em = *per_cpu_ptr(energy_model, cpu); > > + struct capacity_state *cs = NULL; > > + int i; > > + > > + /* > > + * As the goal is to estimate the OPP reached for a specific util > > + * value, mimic the behaviour of schedutil with a 1.25 coefficient > > + */ > > + util += util >> 2; > > What about other governors (ondemand for example). Is this supposed to > work only when schedutil is in use (if so we should probably make it > conditional on that)? Yes, I would say that EAS mostly makes sense when you have a "minimum" control on OPPs... otherwise all the energy estimations are really fuzzy. > Also, even when schedutil is in use, shouldn't we ask it for a util > "computation" instead of replicating its _current_ heuristic? Are you proposing to have the 1.25 factor only here and remove it from schedutil? > I fear the two might diverge in the future. That could be avoided by factoring out from schedutil the "compensation" factor into a proper function to be used by all the interested playes, isn't it?
On 20-Mar 09:43, Dietmar Eggemann wrote: > From: Quentin Perret <quentin.perret@arm.com> [...] > +static unsigned long compute_energy(struct task_struct *p, int dst_cpu) > +{ > + unsigned long util, fdom_max_util; > + struct capacity_state *cs; > + unsigned long energy = 0; > + struct freq_domain *fdom; > + int cpu; > + > + for_each_freq_domain(fdom) { > + fdom_max_util = 0; > + for_each_cpu_and(cpu, &(fdom->span), cpu_online_mask) { > + util = cpu_util_next(cpu, p, dst_cpu); Would be nice to find a way to cache all these util and reuse them below... even just to ensure data consistency between the "cs" computation and its usage... > + fdom_max_util = max(util, fdom_max_util); > + } > + > + /* > + * Here we assume that the capacity states of CPUs belonging to > + * the same frequency domains are shared. Hence, we look at the > + * capacity state of the first CPU and re-use it for all. > + */ > + cpu = cpumask_first(&(fdom->span)); > + cs = find_cap_state(cpu, fdom_max_util); ^^^^ The above code could theoretically return NULL, although likely EAS is completely disabled if em->nb_cap_states == 0, right? If that's the case then, in the previous function, you can certainly avoid the initialization of *cs and maybe also add an explicit: BUG_ON(em->nb_cap_states == 0); which helps even just as "in code documentation". But, I'm not sure if maintainers like BUG_ON in scheduler code :) > + > + /* > + * The energy consumed by each CPU is derived from the power ^ Should we make more explicit that this is just the "active" energy consumed? > + * it dissipates at the expected OPP and its percentage of > + * busy time. > + */ > + for_each_cpu_and(cpu, &(fdom->span), cpu_online_mask) { > + util = cpu_util_next(cpu, p, dst_cpu); > + energy += cs->power * util / cs->cap; > + } > + } nit-pick: empty line before return? > + return energy; > +} > + > /* > * select_task_rq_fair: Select target runqueue for the waking task in domains > * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE, > -- > 2.11.0 >
On 21/03/18 12:26, Patrick Bellasi wrote: > On 21-Mar 10:04, Juri Lelli wrote: > > Hi, > > > > On 20/03/18 09:43, Dietmar Eggemann wrote: > > > From: Quentin Perret <quentin.perret@arm.com> > > > > > > In preparation for the definition of an energy-aware wakeup path, a > > > helper function is provided to estimate the consequence on system energy > > > when a specific task wakes-up on a specific CPU. compute_energy() > > > estimates the OPPs to be reached by all frequency domains and estimates > > > the consumption of each online CPU according to its energy model and its > > > percentage of busy time. > > > > > > Cc: Ingo Molnar <mingo@redhat.com> > > > Cc: Peter Zijlstra <peterz@infradead.org> > > > Signed-off-by: Quentin Perret <quentin.perret@arm.com> > > > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> > > > --- > > > kernel/sched/fair.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > 1 file changed, 81 insertions(+) > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > index 6c72a5e7b1b0..76bd46502486 100644 > > > --- a/kernel/sched/fair.c > > > +++ b/kernel/sched/fair.c > > > @@ -6409,6 +6409,30 @@ static inline int cpu_overutilized(int cpu) > > > } > > > > > > /* > > > + * Returns the util of "cpu" if "p" wakes up on "dst_cpu". > > > + */ > > > +static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu) > > > +{ > > > + unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg; > > > > What about other classes? Shouldn't we now also take into account > > DEADLINE (as schedutil does)? > > Good point, although that would likely require to factor out from > schedutil the utilization aggregation function, isn't it? Maybe, or simply use getter methods and aggregate again here. > > > BTW, we now also have a getter method in sched/sched.h; it takes > > UTIL_EST into account, though. Do we need to take that into account when > > estimating energy consumption? > > Actually I think that this whole function can be written "just" as: > > ---8<--- > unsigned long util = cpu_util_wake(cpu); > > if (cpu != dst_cpu) > return util; > > return min(util + task_util(p), capacity_orig_of(cpu)); > ---8<--- > > which will reuse existing functions as well as getting for free other > stuff (like the CPU util_est). > > Considering your observation above, it makes also easy to add into > util the DEADLINE and RT utilizations, just before returning the > value. Well, for RT we should problably consider the fact that schedutil is going to select max OPP... Apart from that I guess it could work like you said. > > > > + unsigned long capacity = capacity_orig_of(cpu); > > > + > > > + /* > > > + * If p is where it should be, or if it has no impact on cpu, there is > > > + * not much to do. > > > + */ > > > + if ((task_cpu(p) == dst_cpu) || (cpu != task_cpu(p) && cpu != dst_cpu)) > > > + goto clamp_util; > > > + > > > + if (dst_cpu == cpu) > > > + util += task_util(p); > > > + else > > > + util = max_t(long, util - task_util(p), 0); > > > + > > > +clamp_util: > > > + return (util >= capacity) ? capacity : util; > > > +} > > > + > > > +/* > > > * Disable WAKE_AFFINE in the case where task @p doesn't fit in the > > > * capacity of either the waking CPU @cpu or the previous CPU @prev_cpu. > > > * > > > @@ -6432,6 +6456,63 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu) > > > return !util_fits_capacity(task_util(p), min_cap); > > > } > > > > > > +static struct capacity_state *find_cap_state(int cpu, unsigned long util) > > > +{ > > > + struct sched_energy_model *em = *per_cpu_ptr(energy_model, cpu); > > > + struct capacity_state *cs = NULL; > > > + int i; > > > + > > > + /* > > > + * As the goal is to estimate the OPP reached for a specific util > > > + * value, mimic the behaviour of schedutil with a 1.25 coefficient > > > + */ > > > + util += util >> 2; > > > > What about other governors (ondemand for example). Is this supposed to > > work only when schedutil is in use (if so we should probably make it > > conditional on that)? > > Yes, I would say that EAS mostly makes sense when you have a "minimum" > control on OPPs... otherwise all the energy estimations are really > fuzzy. Make sense to me. Shouldn't we then make all this conditional on using schedutil? > > > Also, even when schedutil is in use, shouldn't we ask it for a util > > "computation" instead of replicating its _current_ heuristic? > > Are you proposing to have the 1.25 factor only here and remove it from > schedutil? I'm only saying that we shouldn't probably have two places where we add this 1.25 factor to utilization before using it, as in the future one of the two might modify that 1.25 to something else and then we'll have problems. So, maybe a common wrapper that adds such factor? > > > I fear the two might diverge in the future. > > That could be avoided by factoring out from schedutil the > "compensation" factor into a proper function to be used by all the > interested playes, isn't it? And I should have read till the end before writing the above paragraph it seems. :) Thanks, - Juri
On Wednesday 21 Mar 2018 at 13:59:25 (+0100), Juri Lelli wrote: > On 21/03/18 12:26, Patrick Bellasi wrote: > > On 21-Mar 10:04, Juri Lelli wrote: > > > Hi, > > > > > > On 20/03/18 09:43, Dietmar Eggemann wrote: > > > > From: Quentin Perret <quentin.perret@arm.com> > > > > > > > > In preparation for the definition of an energy-aware wakeup path, a > > > > helper function is provided to estimate the consequence on system energy > > > > when a specific task wakes-up on a specific CPU. compute_energy() > > > > estimates the OPPs to be reached by all frequency domains and estimates > > > > the consumption of each online CPU according to its energy model and its > > > > percentage of busy time. > > > > > > > > Cc: Ingo Molnar <mingo@redhat.com> > > > > Cc: Peter Zijlstra <peterz@infradead.org> > > > > Signed-off-by: Quentin Perret <quentin.perret@arm.com> > > > > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> > > > > --- > > > > kernel/sched/fair.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > 1 file changed, 81 insertions(+) > > > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > > index 6c72a5e7b1b0..76bd46502486 100644 > > > > --- a/kernel/sched/fair.c > > > > +++ b/kernel/sched/fair.c > > > > @@ -6409,6 +6409,30 @@ static inline int cpu_overutilized(int cpu) > > > > } > > > > > > > > /* > > > > + * Returns the util of "cpu" if "p" wakes up on "dst_cpu". > > > > + */ > > > > +static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu) > > > > +{ > > > > + unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg; > > > > > > What about other classes? Shouldn't we now also take into account > > > DEADLINE (as schedutil does)? > > > > Good point, although that would likely require to factor out from > > schedutil the utilization aggregation function, isn't it? > > Maybe, or simply use getter methods and aggregate again here. I agree with you both, taking DL into account here is most likely the right thing to do. Thinking about this, there are other places in those patches where we should really use capacity_of() instead of capacity_orig_of() (in task_fits() in patch 5/6 for ex) in order to avoid CPUs under heavy RT pressure. I'll try to improve the integration with other scheduling classes in v2. > > > > > > BTW, we now also have a getter method in sched/sched.h; it takes > > > UTIL_EST into account, though. Do we need to take that into account when > > > estimating energy consumption? Yes, I think that using UTIL_EST makes a lot of sense for energy calculation. This is what is used for frequency selection (with schedutil obviously) and this is also our best guess on how much time a task will spend on a CPU. V2 will be rebased on the latest tip/sched/core and I'll make sure to integrate things better with util_est. > > > > Actually I think that this whole function can be written "just" as: > > > > ---8<--- > > unsigned long util = cpu_util_wake(cpu); > > > > if (cpu != dst_cpu) > > return util; > > > > return min(util + task_util(p), capacity_orig_of(cpu)); > > ---8<--- > > > > which will reuse existing functions as well as getting for free other > > stuff (like the CPU util_est). > > > > Considering your observation above, it makes also easy to add into > > util the DEADLINE and RT utilizations, just before returning the > > value. > > Well, for RT we should problably consider the fact that schedutil is > going to select max OPP... Right, but I need to think about the right place to put that, and how to compute the energy accurately in this case. Some modification might also be required in find_cap_state() (patch 5/6). > > Apart from that I guess it could work like you said. > > > > > > > + unsigned long capacity = capacity_orig_of(cpu); > > > > + > > > > + /* > > > > + * If p is where it should be, or if it has no impact on cpu, there is > > > > + * not much to do. > > > > + */ > > > > + if ((task_cpu(p) == dst_cpu) || (cpu != task_cpu(p) && cpu != dst_cpu)) > > > > + goto clamp_util; > > > > + > > > > + if (dst_cpu == cpu) > > > > + util += task_util(p); > > > > + else > > > > + util = max_t(long, util - task_util(p), 0); > > > > + > > > > +clamp_util: > > > > + return (util >= capacity) ? capacity : util; > > > > +} > > > > + > > > > +/* > > > > * Disable WAKE_AFFINE in the case where task @p doesn't fit in the > > > > * capacity of either the waking CPU @cpu or the previous CPU @prev_cpu. > > > > * > > > > @@ -6432,6 +6456,63 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu) > > > > return !util_fits_capacity(task_util(p), min_cap); > > > > } > > > > > > > > +static struct capacity_state *find_cap_state(int cpu, unsigned long util) > > > > +{ > > > > + struct sched_energy_model *em = *per_cpu_ptr(energy_model, cpu); > > > > + struct capacity_state *cs = NULL; > > > > + int i; > > > > + > > > > + /* > > > > + * As the goal is to estimate the OPP reached for a specific util > > > > + * value, mimic the behaviour of schedutil with a 1.25 coefficient > > > > + */ > > > > + util += util >> 2; > > > > > > What about other governors (ondemand for example). Is this supposed to > > > work only when schedutil is in use (if so we should probably make it > > > conditional on that)? > > > > Yes, I would say that EAS mostly makes sense when you have a "minimum" > > control on OPPs... otherwise all the energy estimations are really > > fuzzy. > > Make sense to me. Shouldn't we then make all this conditional on using > schedutil? So, in theory, EAS could make sense even for other governors than schedutil. Even with the performance governor it is probably more energy efficient (although users using "performance" probably don't care about energy, but that's just an example) to place small tasks onto little CPUs up to a certain point given by the energy model. The ideal solution would be to change the behaviour of find_cap_state() depending on the governor being used, but I don't know if this extra complexity is worth it really. I'm happy to make all this conditional on schedutil as a first step and we can see later if that makes sense to extend EAS to other use-cases. > > > > > > Also, even when schedutil is in use, shouldn't we ask it for a util > > > "computation" instead of replicating its _current_ heuristic? > > > > Are you proposing to have the 1.25 factor only here and remove it from > > schedutil? > > I'm only saying that we shouldn't probably have two places where we add > this 1.25 factor to utilization before using it, as in the future one of > the two might modify that 1.25 to something else and then we'll have > problems. So, maybe a common wrapper that adds such factor? Ok, I can definitely factorize this code between schedutil and EAS. And BTW, would it make sense to make schedutil use "capacity_margin" instead of an arbitrary value ? The semantics feels pretty close. Out of curiosity, what was the reason to use C=1.25 in the first place ? > > > > > > I fear the two might diverge in the future. > > > > That could be avoided by factoring out from schedutil the > > "compensation" factor into a proper function to be used by all the > > interested playes, isn't it? > > And I should have read till the end before writing the above paragraph > it seems. :) > > Thanks, > > - Juri Thank you very much for the feedback !
On Wednesday 21 Mar 2018 at 12:26:21 (+0000), Patrick Bellasi wrote: > On 21-Mar 10:04, Juri Lelli wrote: > > Hi, > > > > On 20/03/18 09:43, Dietmar Eggemann wrote: > > > From: Quentin Perret <quentin.perret@arm.com> > > > > > > In preparation for the definition of an energy-aware wakeup path, a > > > helper function is provided to estimate the consequence on system energy > > > when a specific task wakes-up on a specific CPU. compute_energy() > > > estimates the OPPs to be reached by all frequency domains and estimates > > > the consumption of each online CPU according to its energy model and its > > > percentage of busy time. > > > > > > Cc: Ingo Molnar <mingo@redhat.com> > > > Cc: Peter Zijlstra <peterz@infradead.org> > > > Signed-off-by: Quentin Perret <quentin.perret@arm.com> > > > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> > > > --- > > > kernel/sched/fair.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > 1 file changed, 81 insertions(+) > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > index 6c72a5e7b1b0..76bd46502486 100644 > > > --- a/kernel/sched/fair.c > > > +++ b/kernel/sched/fair.c > > > @@ -6409,6 +6409,30 @@ static inline int cpu_overutilized(int cpu) > > > } > > > > > > /* > > > + * Returns the util of "cpu" if "p" wakes up on "dst_cpu". > > > + */ > > > +static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu) > > > +{ > > > + unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg; > > > > What about other classes? Shouldn't we now also take into account > > DEADLINE (as schedutil does)? > > Good point, although that would likely require to factor out from > schedutil the utilization aggregation function, isn't it? > > > BTW, we now also have a getter method in sched/sched.h; it takes > > UTIL_EST into account, though. Do we need to take that into account when > > estimating energy consumption? > > Actually I think that this whole function can be written "just" as: > > ---8<--- > unsigned long util = cpu_util_wake(cpu); > > if (cpu != dst_cpu) > return util; > > return min(util + task_util(p), capacity_orig_of(cpu)); > ---8<--- > Yes this should be functionally equivalent. However, with your suggestion you can potentially remove the task contribution from the cpu_util in cpu_util_wake() and then add it back right after if cpu==dst_cpu. This is sub-optimal and that's why I implemented things slightly differently. But maybe this optimization really is too small to justify the extra complexity involved ... > which will reuse existing functions as well as getting for free other > stuff (like the CPU util_est). > > Considering your observation above, it makes also easy to add into > util the DEADLINE and RT utilizations, just before returning the > value. > > > > + unsigned long capacity = capacity_orig_of(cpu); > > > + > > > + /* > > > + * If p is where it should be, or if it has no impact on cpu, there is > > > + * not much to do. > > > + */ > > > + if ((task_cpu(p) == dst_cpu) || (cpu != task_cpu(p) && cpu != dst_cpu)) > > > + goto clamp_util; > > > + > > > + if (dst_cpu == cpu) > > > + util += task_util(p); > > > + else > > > + util = max_t(long, util - task_util(p), 0); > > > + > > > +clamp_util: > > > + return (util >= capacity) ? capacity : util; > > > +} > > > + > > > +/* > > > * Disable WAKE_AFFINE in the case where task @p doesn't fit in the > > > * capacity of either the waking CPU @cpu or the previous CPU @prev_cpu. > > > * > > > @@ -6432,6 +6456,63 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu) > > > return !util_fits_capacity(task_util(p), min_cap); > > > } > > > > > > +static struct capacity_state *find_cap_state(int cpu, unsigned long util) > > > +{ > > > + struct sched_energy_model *em = *per_cpu_ptr(energy_model, cpu); > > > + struct capacity_state *cs = NULL; > > > + int i; > > > + > > > + /* > > > + * As the goal is to estimate the OPP reached for a specific util > > > + * value, mimic the behaviour of schedutil with a 1.25 coefficient > > > + */ > > > + util += util >> 2; > > > > What about other governors (ondemand for example). Is this supposed to > > work only when schedutil is in use (if so we should probably make it > > conditional on that)? > > Yes, I would say that EAS mostly makes sense when you have a "minimum" > control on OPPs... otherwise all the energy estimations are really > fuzzy. > > > Also, even when schedutil is in use, shouldn't we ask it for a util > > "computation" instead of replicating its _current_ heuristic? > > Are you proposing to have the 1.25 factor only here and remove it from > schedutil? > > > I fear the two might diverge in the future. > > That could be avoided by factoring out from schedutil the > "compensation" factor into a proper function to be used by all the > interested playes, isn't it? > > -- > #include <best/regards.h> > > Patrick Bellasi
On Wednesday 21 Mar 2018 at 12:39:21 (+0000), Patrick Bellasi wrote: > On 20-Mar 09:43, Dietmar Eggemann wrote: > > From: Quentin Perret <quentin.perret@arm.com> > > [...] > > > +static unsigned long compute_energy(struct task_struct *p, int dst_cpu) > > +{ > > + unsigned long util, fdom_max_util; > > + struct capacity_state *cs; > > + unsigned long energy = 0; > > + struct freq_domain *fdom; > > + int cpu; > > + > > + for_each_freq_domain(fdom) { > > + fdom_max_util = 0; > > + for_each_cpu_and(cpu, &(fdom->span), cpu_online_mask) { > > + util = cpu_util_next(cpu, p, dst_cpu); > > Would be nice to find a way to cache all these util and reuse them > below... even just to ensure data consistency between the "cs" > computation and its usage... So actually, what I can do is add something like fdom_tot_util += util; to this loop and compute energy = cs->power * fdom_tot_util / cs->cap; only once, instead of having the second loop to compute the energy. We don't have to scale the util for each and every CPU since they share the same cap state. That would save some divisions and ensure the consistency between the selection of the cap state and the associated energy computation. What do you think ? Or maybe you were talking about consistency between several consecutive calls to compute_energy() ? > > > + fdom_max_util = max(util, fdom_max_util); > > + } > > + > > + /* > > + * Here we assume that the capacity states of CPUs belonging to > > + * the same frequency domains are shared. Hence, we look at the > > + * capacity state of the first CPU and re-use it for all. > > + */ > > + cpu = cpumask_first(&(fdom->span)); > > + cs = find_cap_state(cpu, fdom_max_util); > ^^^^ > > The above code could theoretically return NULL, although likely EAS is > completely disabled if em->nb_cap_states == 0, right? That's right. sched_energy_present cannot be enabled with em->nb_cap_states == 0, and compute_energy() is never called without sched_energy_present in the proposed implementation. > > If that's the case then, in the previous function, you can certainly > avoid the initialization of *cs and maybe also add an explicit: > > BUG_ON(em->nb_cap_states == 0); > > which helps even just as "in code documentation". > > But, I'm not sure if maintainers like BUG_ON in scheduler code :) Yes, I'm not sure about the BUG_ON either :). I agree that it would be nice to document somewhere that compute_energy() is unsafe to call without sched_energy_present. I can simply add a proper doc comment to this function actually. Would that work ? > > > > + > > + /* > > + * The energy consumed by each CPU is derived from the power > ^ > > Should we make more explicit that this is just the "active" energy > consumed? Ok, np. > > > + * it dissipates at the expected OPP and its percentage of > > + * busy time. > > + */ > > + for_each_cpu_and(cpu, &(fdom->span), cpu_online_mask) { > > + util = cpu_util_next(cpu, p, dst_cpu); > > + energy += cs->power * util / cs->cap; > > + } > > + } > > nit-pick: empty line before return? Will do. > > > + return energy; > > +} > > + > > /* > > * select_task_rq_fair: Select target runqueue for the waking task in domains > > * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE, > > -- > > 2.11.0 > > > > -- > #include <best/regards.h> > > Patrick Bellasi Thanks, Quentin
On 21/03/18 14:26, Quentin Perret wrote: > On Wednesday 21 Mar 2018 at 12:39:21 (+0000), Patrick Bellasi wrote: > > On 20-Mar 09:43, Dietmar Eggemann wrote: [...] > > > > If that's the case then, in the previous function, you can certainly > > avoid the initialization of *cs and maybe also add an explicit: > > > > BUG_ON(em->nb_cap_states == 0); > > > > which helps even just as "in code documentation". > > > > But, I'm not sure if maintainers like BUG_ON in scheduler code :) > > Yes, I'm not sure about the BUG_ON either :). I agree that it would be > nice to document somewhere that compute_energy() is unsafe to call > without sched_energy_present. I can simply add a proper doc comment to > this function actually. Would that work ? If it is something that must not happen and it is also non recoverable at runtime, then... $ git grep BUG_ON -- kernel/sched/ | wc -l 50 :)
On 21/03/18 13:55, Quentin Perret wrote: > On Wednesday 21 Mar 2018 at 13:59:25 (+0100), Juri Lelli wrote: > > On 21/03/18 12:26, Patrick Bellasi wrote: > > > On 21-Mar 10:04, Juri Lelli wrote: [...] > > > > > + /* > > > > > + * As the goal is to estimate the OPP reached for a specific util > > > > > + * value, mimic the behaviour of schedutil with a 1.25 coefficient > > > > > + */ > > > > > + util += util >> 2; > > > > > > > > What about other governors (ondemand for example). Is this supposed to > > > > work only when schedutil is in use (if so we should probably make it > > > > conditional on that)? > > > > > > Yes, I would say that EAS mostly makes sense when you have a "minimum" > > > control on OPPs... otherwise all the energy estimations are really > > > fuzzy. > > > > Make sense to me. Shouldn't we then make all this conditional on using > > schedutil? > > So, in theory, EAS could make sense even for other governors than > schedutil. Even with the performance governor it is probably more > energy efficient (although users using "performance" probably don't care > about energy, but that's just an example) to place small tasks onto little > CPUs up to a certain point given by the energy model. The ideal solution > would be to change the behaviour of find_cap_state() depending on the > governor being used, but I don't know if this extra complexity is worth > it really. > I'm happy to make all this conditional on schedutil as a first step and > we can see later if that makes sense to extend EAS to other use-cases. I agree that EAS makes still sense even for !schedutil cases (your performance example being one of them, powersave maybe another one?). Making it work with ondemand is tricky, though. So, not sure what's the best thing to do, but we should be at least aware of limitations. > > > > > > > > > > Also, even when schedutil is in use, shouldn't we ask it for a util > > > > "computation" instead of replicating its _current_ heuristic? > > > > > > Are you proposing to have the 1.25 factor only here and remove it from > > > schedutil? > > > > I'm only saying that we shouldn't probably have two places where we add > > this 1.25 factor to utilization before using it, as in the future one of > > the two might modify that 1.25 to something else and then we'll have > > problems. So, maybe a common wrapper that adds such factor? > > Ok, I can definitely factorize this code between schedutil and EAS. And > BTW, would it make sense to make schedutil use "capacity_margin" instead > of an arbitrary value ? The semantics feels pretty close. Out of curiosity, > what was the reason to use C=1.25 in the first place ? I seem to remember it was choosen out of experiments, but I might surely be wrong and Rafael, Viresh, others will correct me. :) > > > > > > > > > > I fear the two might diverge in the future. > > > > > > That could be avoided by factoring out from schedutil the > > > "compensation" factor into a proper function to be used by all the > > > interested playes, isn't it? > > > > And I should have read till the end before writing the above paragraph > > it seems. :) > > > > Thanks, > > > > - Juri > > Thank you very much for the feedback ! No problem. I'm of course very interested in this. Could you please directly Cc me (juri.lelli@redhat.com) in next versions? Thanks, - Juri
On 21-Mar 14:26, Quentin Perret wrote: > On Wednesday 21 Mar 2018 at 12:39:21 (+0000), Patrick Bellasi wrote: > > On 20-Mar 09:43, Dietmar Eggemann wrote: > > > From: Quentin Perret <quentin.perret@arm.com> > > > > [...] > > > > > +static unsigned long compute_energy(struct task_struct *p, int dst_cpu) > > > +{ > > > + unsigned long util, fdom_max_util; > > > + struct capacity_state *cs; > > > + unsigned long energy = 0; > > > + struct freq_domain *fdom; > > > + int cpu; > > > + > > > + for_each_freq_domain(fdom) { > > > + fdom_max_util = 0; > > > + for_each_cpu_and(cpu, &(fdom->span), cpu_online_mask) { > > > + util = cpu_util_next(cpu, p, dst_cpu); > > > > Would be nice to find a way to cache all these util and reuse them > > below... even just to ensure data consistency between the "cs" > > computation and its usage... > > So actually, what I can do is add something like > > fdom_tot_util += util; > > to this loop and compute > > energy = cs->power * fdom_tot_util / cs->cap; > > only once, instead of having the second loop to compute the energy. We don't > have to scale the util for each and every CPU since they share the same > cap state. That would save some divisions and ensure the consistency > between the selection of the cap state and the associated energy > computation. What do you think ? Right, would say that under the hypothesis the we are in the same frequency domain (and we are because of fdom->span), that's basically doing: sum_i(P_x * U_i / C_x) => P_x / C_x * sum_i(U_i) Where (C_x, P_x) are the EM reported capacity and power for the expected frequency domain OPP. > Or maybe you were talking about consistency between several consecutive > calls to compute_energy() ? Nope, the above +1 > > > + fdom_max_util = max(util, fdom_max_util); > > > + } > > > + > > > + /* > > > + * Here we assume that the capacity states of CPUs belonging to > > > + * the same frequency domains are shared. Hence, we look at the > > > + * capacity state of the first CPU and re-use it for all. > > > + */ > > > + cpu = cpumask_first(&(fdom->span)); > > > + cs = find_cap_state(cpu, fdom_max_util); > > ^^^^ > > > > The above code could theoretically return NULL, although likely EAS is > > completely disabled if em->nb_cap_states == 0, right? > > That's right. sched_energy_present cannot be enabled with > em->nb_cap_states == 0, and compute_energy() is never called without > sched_energy_present in the proposed implementation. > > > > > If that's the case then, in the previous function, you can certainly > > avoid the initialization of *cs and maybe also add an explicit: > > > > BUG_ON(em->nb_cap_states == 0); > > > > which helps even just as "in code documentation". > > > > But, I'm not sure if maintainers like BUG_ON in scheduler code :) > > Yes, I'm not sure about the BUG_ON either :). FWIW, there are already some BUG_ON in fair.c... thus, if they can pinpoint a specific bug in case of errors, they should be acceptable ? > I agree that it would be nice to document somewhere that > compute_energy() is unsafe to call without sched_energy_present. > I can simply add a proper doc comment to this function actually. > Would that work ? Right, it's just that _maybe_ an explicit BUG_ON is improving the documentation by making more explicit the error on testing ? Thus, I would probably add both... but Peter will tell you for sure ;)
On Wed, Mar 21, 2018 at 04:15:13PM +0100, Juri Lelli wrote: > On 21/03/18 13:55, Quentin Perret wrote: > > On Wednesday 21 Mar 2018 at 13:59:25 (+0100), Juri Lelli wrote: > > > On 21/03/18 12:26, Patrick Bellasi wrote: > > > > On 21-Mar 10:04, Juri Lelli wrote: > > [...] > > > > > > > + /* > > > > > > + * As the goal is to estimate the OPP reached for a specific util > > > > > > + * value, mimic the behaviour of schedutil with a 1.25 coefficient > > > > > > + */ > > > > > > + util += util >> 2; > > > > > > > > > > What about other governors (ondemand for example). Is this supposed to > > > > > work only when schedutil is in use (if so we should probably make it > > > > > conditional on that)? > > > > > > > > Yes, I would say that EAS mostly makes sense when you have a "minimum" > > > > control on OPPs... otherwise all the energy estimations are really > > > > fuzzy. > > > > > > Make sense to me. Shouldn't we then make all this conditional on using > > > schedutil? > > > > So, in theory, EAS could make sense even for other governors than > > schedutil. Even with the performance governor it is probably more > > energy efficient (although users using "performance" probably don't care > > about energy, but that's just an example) to place small tasks onto little > > CPUs up to a certain point given by the energy model. The ideal solution > > would be to change the behaviour of find_cap_state() depending on the > > governor being used, but I don't know if this extra complexity is worth > > it really. > > I'm happy to make all this conditional on schedutil as a first step and > > we can see later if that makes sense to extend EAS to other use-cases. > > I agree that EAS makes still sense even for !schedutil cases (your > performance example being one of them, powersave maybe another one?). > Making it work with ondemand is tricky, though. > > So, not sure what's the best thing to do, but we should be at least aware > of limitations. I would suggest making as few assumptions about the OPP selection as possible. Even when we do use schedutil, there could be number of reasons why we don't actually get the OPP schedutil requests (thermal, hardware-says-no,...). In the previous energy model-driven scheduling postings, years back, I went with the assumption that OPP would follow the utilization. So if we put more tasks on a cpu, the OPP would increase to match. If cpufreq or hardware decided to go faster, that is fine but it could lead to suboptimal decisions. We could call into schedutil somehow to make sure that we at least request the same OPP as the energy model assumes if the overhead is small and we can present schedutil with all the information it needs to choose the OPP for the proposed task placement. I wonder if it is worth it, or if we should just decide on a simple assumption on OPP selection for energy estimation and stick it in a comment. Morten
On 21/03/18 16:26, Morten Rasmussen wrote: > On Wed, Mar 21, 2018 at 04:15:13PM +0100, Juri Lelli wrote: > > On 21/03/18 13:55, Quentin Perret wrote: > > > On Wednesday 21 Mar 2018 at 13:59:25 (+0100), Juri Lelli wrote: > > > > On 21/03/18 12:26, Patrick Bellasi wrote: > > > > > On 21-Mar 10:04, Juri Lelli wrote: > > > > [...] > > > > > > > > > + /* > > > > > > > + * As the goal is to estimate the OPP reached for a specific util > > > > > > > + * value, mimic the behaviour of schedutil with a 1.25 coefficient > > > > > > > + */ > > > > > > > + util += util >> 2; > > > > > > > > > > > > What about other governors (ondemand for example). Is this supposed to > > > > > > work only when schedutil is in use (if so we should probably make it > > > > > > conditional on that)? > > > > > > > > > > Yes, I would say that EAS mostly makes sense when you have a "minimum" > > > > > control on OPPs... otherwise all the energy estimations are really > > > > > fuzzy. > > > > > > > > Make sense to me. Shouldn't we then make all this conditional on using > > > > schedutil? > > > > > > So, in theory, EAS could make sense even for other governors than > > > schedutil. Even with the performance governor it is probably more > > > energy efficient (although users using "performance" probably don't care > > > about energy, but that's just an example) to place small tasks onto little > > > CPUs up to a certain point given by the energy model. The ideal solution > > > would be to change the behaviour of find_cap_state() depending on the > > > governor being used, but I don't know if this extra complexity is worth > > > it really. > > > I'm happy to make all this conditional on schedutil as a first step and > > > we can see later if that makes sense to extend EAS to other use-cases. > > > > I agree that EAS makes still sense even for !schedutil cases (your > > performance example being one of them, powersave maybe another one?). > > Making it work with ondemand is tricky, though. > > > > So, not sure what's the best thing to do, but we should be at least aware > > of limitations. > > I would suggest making as few assumptions about the OPP selection as > possible. Even when we do use schedutil, there could be number of > reasons why we don't actually get the OPP schedutil requests (thermal, > hardware-says-no,...). > > In the previous energy model-driven scheduling postings, years back, I > went with the assumption that OPP would follow the utilization. So if we > put more tasks on a cpu, the OPP would increase to match. If cpufreq or > hardware decided to go faster, that is fine but it could lead to > suboptimal decisions. > > We could call into schedutil somehow to make sure that we at least > request the same OPP as the energy model assumes if the overhead is > small and we can present schedutil with all the information it needs to > choose the OPP for the proposed task placement. I wonder if it is worth > it, or if we should just decide on a simple assumption on OPP selection > for energy estimation and stick it in a comment. Right, I see your point. Refactoring the 1.25 coefficient calculation in some getter method shouldn't hopefully add much overhead, but yes, it might not give us much in terms of correctness for certain situations. Best, - Juri
On Wednesday 21 Mar 2018 at 15:54:58 (+0000), Patrick Bellasi wrote: > On 21-Mar 14:26, Quentin Perret wrote: > > On Wednesday 21 Mar 2018 at 12:39:21 (+0000), Patrick Bellasi wrote: > > > On 20-Mar 09:43, Dietmar Eggemann wrote: > > > > From: Quentin Perret <quentin.perret@arm.com> [...] > > So actually, what I can do is add something like > > > > fdom_tot_util += util; > > > > to this loop and compute > > > > energy = cs->power * fdom_tot_util / cs->cap; > > > > only once, instead of having the second loop to compute the energy. We don't > > have to scale the util for each and every CPU since they share the same > > cap state. That would save some divisions and ensure the consistency > > between the selection of the cap state and the associated energy > > computation. What do you think ? > > Right, would say that under the hypothesis the we are in the same > frequency domain (and we are because of fdom->span), that's basically > doing: > > sum_i(P_x * U_i / C_x) => P_x / C_x * sum_i(U_i) > > Where (C_x, P_x) are the EM reported capacity and power for the > expected frequency domain OPP. > Yes that's exactly that. I'll do the change in v2. > > Or maybe you were talking about consistency between several consecutive > > calls to compute_energy() ? > > Nope, the above +1 > [...] > > I agree that it would be nice to document somewhere that > > compute_energy() is unsafe to call without sched_energy_present. > > I can simply add a proper doc comment to this function actually. > > Would that work ? > > Right, it's just that _maybe_ an explicit BUG_ON is improving the > documentation by making more explicit the error on testing ? > > Thus, I would probably add both... but Peter will tell you for sure ;) > Right, but I'm still not sure if the BUG_ON is the right thing to do. I mean, if we really want to make this check, then we could also try to recover into a working state ... If we enter compute_energy() without having an energy model, and if we detect it on time, we could bail out and disable sched_energy_present ASAP with an error message for example. That would let us know if EAS is broken without making the system unusable. Anyways, if there is a general agreement, or if the maintainers think that the BUG_ON is the right thing to do here, I'm happy to change that in future versions :)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6c72a5e7b1b0..76bd46502486 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6409,6 +6409,30 @@ static inline int cpu_overutilized(int cpu) } /* + * Returns the util of "cpu" if "p" wakes up on "dst_cpu". + */ +static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu) +{ + unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg; + unsigned long capacity = capacity_orig_of(cpu); + + /* + * If p is where it should be, or if it has no impact on cpu, there is + * not much to do. + */ + if ((task_cpu(p) == dst_cpu) || (cpu != task_cpu(p) && cpu != dst_cpu)) + goto clamp_util; + + if (dst_cpu == cpu) + util += task_util(p); + else + util = max_t(long, util - task_util(p), 0); + +clamp_util: + return (util >= capacity) ? capacity : util; +} + +/* * Disable WAKE_AFFINE in the case where task @p doesn't fit in the * capacity of either the waking CPU @cpu or the previous CPU @prev_cpu. * @@ -6432,6 +6456,63 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu) return !util_fits_capacity(task_util(p), min_cap); } +static struct capacity_state *find_cap_state(int cpu, unsigned long util) +{ + struct sched_energy_model *em = *per_cpu_ptr(energy_model, cpu); + struct capacity_state *cs = NULL; + int i; + + /* + * As the goal is to estimate the OPP reached for a specific util + * value, mimic the behaviour of schedutil with a 1.25 coefficient + */ + util += util >> 2; + + for (i = 0; i < em->nr_cap_states; i++) { + cs = &em->cap_states[i]; + if (cs->cap >= util) + break; + } + + return cs; +} + +static unsigned long compute_energy(struct task_struct *p, int dst_cpu) +{ + unsigned long util, fdom_max_util; + struct capacity_state *cs; + unsigned long energy = 0; + struct freq_domain *fdom; + int cpu; + + for_each_freq_domain(fdom) { + fdom_max_util = 0; + for_each_cpu_and(cpu, &(fdom->span), cpu_online_mask) { + util = cpu_util_next(cpu, p, dst_cpu); + fdom_max_util = max(util, fdom_max_util); + } + + /* + * Here we assume that the capacity states of CPUs belonging to + * the same frequency domains are shared. Hence, we look at the + * capacity state of the first CPU and re-use it for all. + */ + cpu = cpumask_first(&(fdom->span)); + cs = find_cap_state(cpu, fdom_max_util); + + /* + * The energy consumed by each CPU is derived from the power + * it dissipates at the expected OPP and its percentage of + * busy time. + */ + for_each_cpu_and(cpu, &(fdom->span), cpu_online_mask) { + util = cpu_util_next(cpu, p, dst_cpu); + energy += cs->power * util / cs->cap; + } + } + return energy; +} + /* * select_task_rq_fair: Select target runqueue for the waking task in domains * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,