[RFC,v3,0/6] sched/cpufreq: Make schedutil energy aware

Message ID	20191011134500.235736-1-douglas.raillard@arm.com (mailing list archive)
Headers	show Return-Path: <SRS0=Jp3p=YE=vger.kernel.org=linux-pm-owner@kernel.org> From: Douglas RAILLARD <douglas.raillard@arm.com> To: linux-kernel@vger.kernel.org Cc: linux-pm@vger.kernel.org, mingo@redhat.com, peterz@infradead.org, rjw@rjwysocki.net, viresh.kumar@linaro.org, juri.lelli@redhat.com, vincent.guittot@linaro.org, douglas.raillard@arm.com, dietmar.eggemann@arm.com, qperret@qperret.net, patrick.bellasi@matbug.net, dh.han@samsung.com Subject: [RFC PATCH v3 0/6] sched/cpufreq: Make schedutil energy aware Date: Fri, 11 Oct 2019 14:44:54 +0100 Message-Id: <20191011134500.235736-1-douglas.raillard@arm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-pm-owner@vger.kernel.org Precedence: bulk
Series	sched/cpufreq: Make schedutil energy aware \| expand [RFC,v3,0/6] sched/cpufreq: Make schedutil energy aware [RFC,v3,1/6] PM: Introduce em_pd_get_higher_freq() [RFC,v3,2/6] sched/cpufreq: Attach perf domain to sugov policy [RFC,v3,3/6] sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq() [RFC,v3,4/6] sched/cpufreq: Introduce sugov_cpu_ramp_boost [RFC,v3,5/6] sched/cpufreq: Boost schedutil frequency ramp up [RFC,v3,6/6] sched/cpufreq: Add schedutil_em_tp tracepoint

Douglas RAILLARD Oct. 11, 2019, 1:44 p.m. UTC

Make schedutil cpufreq governor energy-aware.

- patch 1 introduces a function to retrieve a frequency given a base
  frequency and an energy cost margin.
- patch 2 links Energy Model perf_domain to sugov_policy.
- patch 3 updates get_next_freq() to make use of the Energy Model.
- patch 4 adds sugov_cpu_ramp_boost() function.
- patch 5 updates sugov_update_(single|shared)() to make use of
  sugov_cpu_ramp_boost().
- patch 6 introduces a tracepoint in get_next_freq() for
  testing/debugging. Since it's not a trace event, it's not exposed to
  userspace in a directly usable way, allowing for painless future
  updates/removal.

The benefits of using the EM in schedutil are twofold:

1) Selecting the highest possible frequency for a given cost. Some
   platforms can have lower frequencies that are less efficient than
   higher ones, in which case they should be skipped for most purposes.
   They can still be useful to give more freedom to thermal throttling
   mechanisms, but not under normal circumstances.
   note: the EM framework will warn about such OPPs "hertz/watts ratio
   non-monotonically decreasing"

2) Driving the frequency selection with power in mind, in addition to
   maximizing the utilization of the non-idle CPUs in the system.

Point 1) is implemented in "PM: Introduce em_pd_get_higher_freq()" and
enabled in schedutil by
"sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq()".

Point 2) is enabled in
"sched/cpufreq: Boost schedutil frequency ramp up". It allows using
higher frequencies when it is known that the true utilization of
currently running tasks is exceeding their previous stable point.
The benefits are:

* Boosting the frequency when the behavior of a runnable task changes,
  leading to an increase in utilization. That shortens the frequency
  ramp up duration, which in turns allows the utilization signal to
  reach stable values quicker.  Since the allowed frequency boost is
  bounded in energy, it will behave consistently across platforms,
  regardless of the OPP cost range.

* The boost is only transient, and should not impact a lot the energy
  consumed of workloads with very stable utilization signals.

This has been ligthly tested with a rtapp task ramping from 10% to 75%
utilisation on a big core. Results are improved by fast ramp-up
EWMA [1], since it greatly reduces the oscillation in frequency at first
idle when ramping up.

[1] [PATCH] sched/fair: util_est: fast ramp-up EWMA on utilization increases
    Message-ID: <20190620150555.15717-1-patrick.bellasi@arm.com>
    https://lore.kernel.org/lkml/20190620150555.15717-1-patrick.bellasi@arm.com/


v1 -> v2:

  * Split the new sugov_cpu_ramp_boost() from the existing
    sugov_cpu_is_busy() as they seem to seek a different goal.

  * Implement sugov_cpu_ramp_boost() based on CFS util_avg and
    util_est_enqueued signals, rather than using idle calls count.
    This makes the ramp boost much more accurate in finding boost
    opportunities, and give a "continuous" output rather than a boolean.

  * Add EM_COST_MARGIN_SCALE=1024 to represent the
    margin values of em_pd_get_higher_freq().

v2 -> v3:

  * Check util_avg >= sg_cpu->util_avg in sugov_cpu_ramp_boost_update()
    to avoid boosting when the utilization is decreasing.

  * Add a tracepoint for testing. 

Douglas RAILLARD (6):
  PM: Introduce em_pd_get_higher_freq()
  sched/cpufreq: Attach perf domain to sugov policy
  sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq()
  sched/cpufreq: Introduce sugov_cpu_ramp_boost
  sched/cpufreq: Boost schedutil frequency ramp up
  sched/cpufreq: Add schedutil_em_tp tracepoint

 include/linux/energy_model.h     |  53 ++++++++++++++
 include/trace/events/power.h     |   9 +++
 kernel/sched/cpufreq_schedutil.c | 122 +++++++++++++++++++++++++++++--
 3 files changed, 178 insertions(+), 6 deletions(-)

Peter Zijlstra Oct. 14, 2019, 2:53 p.m. UTC | #1

On Fri, Oct 11, 2019 at 02:44:54PM +0100, Douglas RAILLARD wrote:
> This has been ligthly tested with a rtapp task ramping from 10% to 75%
> utilisation on a big core. Results are improved by fast ramp-up
> EWMA [1], since it greatly reduces the oscillation in frequency at first
> idle when ramping up.
> 
> [1] [PATCH] sched/fair: util_est: fast ramp-up EWMA on utilization increases
>     Message-ID: <20190620150555.15717-1-patrick.bellasi@arm.com>
>     https://lore.kernel.org/lkml/20190620150555.15717-1-patrick.bellasi@arm.com/


I don't really see anything fundamentally weird here. Any actual numbers
or other means of quantifying the improvement these patches bring?

Douglas RAILLARD Oct. 14, 2019, 3:50 p.m. UTC | #2

Hi Peter,

On 10/14/19 3:53 PM, Peter Zijlstra wrote:
> On Fri, Oct 11, 2019 at 02:44:54PM +0100, Douglas RAILLARD wrote:
>> This has been ligthly tested with a rtapp task ramping from 10% to 75%
>> utilisation on a big core. Results are improved by fast ramp-up
>> EWMA [1], since it greatly reduces the oscillation in frequency at first
>> idle when ramping up.
>>
>> [1] [PATCH] sched/fair: util_est: fast ramp-up EWMA on utilization increases
>>      Message-ID: <20190620150555.15717-1-patrick.bellasi@arm.com>
>>      https://lore.kernel.org/lkml/20190620150555.15717-1-patrick.bellasi@arm.com/
> 
> 
> I don't really see anything fundamentally weird here. Any actual numbers
> or other means of quantifying the improvement these patches bring?
> 

I posted some numbers based on a similar experiment on the v2 of that series that
are still applicable:

TL;DR the rt-app negative slack is divided by 1.75 by this series, with an
increase of 3% in total energy consumption. There is a burst every 0.6s, and
the average power consumption increase is proportional to the average number
of bursts.


The workload is an rt-app task ramping up from 5% to 75% util in one big step,
pinned on a big core. The whole cycle is 0.6s long (0.3s at 5% followed by 0.3s at 75%).
This cycle is repeated 20 times and the average of boosting is taken.

The test device is a Google Pixel 3 (Qcom Snapdragon 845) phone.
It has a lot more OPPs than a hikey 960, so gradations in boosting
are better reflected on frequency selection.

avg slack (higher=better):
     Average time between task sleep and its next periodic activation.
     See rt-app doc: https://github.com/scheduler-tools/rt-app/blob/9a50d76f726d7c325c82ac8c7ed9ed70e1c97937/doc/tutorial.txt#L631

avg negative slack (lower in absolute value=better):
     Same as avg slack, but only taking into account negative values.
     Negative slack means a task activation did not have enough time to complete before the next
     periodic activation fired, which is what we want to avoid.

boost energy overhead (lower=better):
     Extra power consumption induced by ramp boost, assuming continuous OPP space (infinite number of OPP)
     and single-CPU policies. In practice, fixed number of OPP decrease this value, and more CPU per policy increases it,
     since boost(policy) = max(boost(cpu) foreach cpu of policy)).

Without ramp boost:
+--------------------+--------------------+
|avg slack (us)      |avg negative slack  |
|                    |(us)                |
+--------------------+--------------------+
|6598.72             |-10217.13           |
|6595.49             |-10200.13           |
|6613.72             |-10401.06           |
|6600.29             |-9860.872           |
|6605.53             |-10057.64           |
|6612.05             |-10267.50           |
|6599.01             |-9939.60            |
|6593.79             |-9445.633           |
|6613.56             |-10276.75           |
|6595.44             |-9751.770           |
+--------------------+--------------------+
|average                                  |
+--------------------+--------------------+
|6602.76             |-10041.81           |
+--------------------+--------------------+


With ramp boost enabled:
+--------------------+--------------------+--------------------+
|boost energy        |avg slack (us)      |avg negative slack  |
|overhead (%)        |                    |(us)                |
+--------------------+--------------------+--------------------+
|3.05                |7148.93             |-5664.26            |
|3.04                |7144.69             |-5667.77            |
|3.05                |7149.05             |-5698.31            |
|2.97                |7126.71             |-6040.23            |
|3.02                |7140.28             |-5826.78            |
|3.03                |7135.11             |-5749.62            |
|3.05                |7140.24             |-5750.0             |
|3.05                |7144.84             |-5667.04            |
|3.07                |7157.30             |-5656.65            |
|3.06                |7154.65             |-5653.76            |
+--------------------+--------------------+--------------------+
|average                                                       |
+--------------------+--------------------+--------------------+
|3.039000            |7144.18             |-5737.44            |
+--------------------+--------------------+--------------------+


The negative slack is due to missed activations while the utilization signals
increase during the big utilization step. Ramp boost is designed to boost frequency during
that phase, which materializes in 1.75 less negative slack, for an extra power
consumption under 3%.


Regards,
Douglas

Peter Zijlstra Oct. 17, 2019, 9:50 a.m. UTC | #3

On Mon, Oct 14, 2019 at 04:50:24PM +0100, Douglas Raillard wrote:

> I posted some numbers based on a similar experiment on the v2 of that series that
> are still applicable:
> 
> TL;DR the rt-app negative slack is divided by 1.75 by this series, with an
> increase of 3% in total energy consumption. There is a burst every 0.6s, and
> the average power consumption increase is proportional to the average number
> of bursts.
> 
> 
> The workload is an rt-app task ramping up from 5% to 75% util in one big step,
> pinned on a big core. The whole cycle is 0.6s long (0.3s at 5% followed by 0.3s at 75%).
> This cycle is repeated 20 times and the average of boosting is taken.
> 
> The test device is a Google Pixel 3 (Qcom Snapdragon 845) phone.
> It has a lot more OPPs than a hikey 960, so gradations in boosting
> are better reflected on frequency selection.
> 
> avg slack (higher=better):
>     Average time between task sleep and its next periodic activation.
>     See rt-app doc: https://github.com/scheduler-tools/rt-app/blob/9a50d76f726d7c325c82ac8c7ed9ed70e1c97937/doc/tutorial.txt#L631
> 
> avg negative slack (lower in absolute value=better):
>     Same as avg slack, but only taking into account negative values.
>     Negative slack means a task activation did not have enough time to complete before the next
>     periodic activation fired, which is what we want to avoid.
> 
> boost energy overhead (lower=better):
>     Extra power consumption induced by ramp boost, assuming continuous OPP space (infinite number of OPP)
>     and single-CPU policies. In practice, fixed number of OPP decrease this value, and more CPU per policy increases it,
>     since boost(policy) = max(boost(cpu) foreach cpu of policy)).
> 
> Without ramp boost:
> +--------------------+--------------------+
> |avg slack (us)      |avg negative slack  |
> |                    |(us)                |
> +--------------------+--------------------+
> |6598.72             |-10217.13           |
> |6595.49             |-10200.13           |
> |6613.72             |-10401.06           |
> |6600.29             |-9860.872           |
> |6605.53             |-10057.64           |
> |6612.05             |-10267.50           |
> |6599.01             |-9939.60            |
> |6593.79             |-9445.633           |
> |6613.56             |-10276.75           |
> |6595.44             |-9751.770           |
> +--------------------+--------------------+
> |average                                  |
> +--------------------+--------------------+
> |6602.76             |-10041.81           |
> +--------------------+--------------------+
> 
> 
> With ramp boost enabled:
> +--------------------+--------------------+--------------------+
> |boost energy        |avg slack (us)      |avg negative slack  |
> |overhead (%)        |                    |(us)                |
> +--------------------+--------------------+--------------------+
> |3.05                |7148.93             |-5664.26            |
> |3.04                |7144.69             |-5667.77            |
> |3.05                |7149.05             |-5698.31            |
> |2.97                |7126.71             |-6040.23            |
> |3.02                |7140.28             |-5826.78            |
> |3.03                |7135.11             |-5749.62            |
> |3.05                |7140.24             |-5750.0             |
> |3.05                |7144.84             |-5667.04            |
> |3.07                |7157.30             |-5656.65            |
> |3.06                |7154.65             |-5653.76            |
> +--------------------+--------------------+--------------------+
> |average                                                       |
> +--------------------+--------------------+--------------------+
> |3.039000            |7144.18             |-5737.44            |
> +--------------------+--------------------+--------------------+
> 
> 
> The negative slack is due to missed activations while the utilization signals
> increase during the big utilization step. Ramp boost is designed to boost frequency during
> that phase, which materializes in 1.75 less negative slack, for an extra power
> consumption under 3%.

OK, so I think I see what it is doing, and why.

Normally we use (map_util_freq):

	freq = C * max_freq * util / max ; C=1.25

But here, when util is increasing, we effectively increase our C to
allow picking a higher OPP. Because of that higher OPP we finish our
work sooner (avg slack increases) and miss our activation less often
(avg neg slack decreases).

Now, the thing is, we use map_util_freq() in more places, and should we
not reflect this increase in C for all of them? That is, why is this
patch changing get_next_freq() and not map_util_freq().

I don't think that question is answered in the Changelogs.

Exactly because it does change the energy consumption (it must) should
that not also be reflected in the EAS logic?

I'm still thinking about the exact means you're using to raise C; that
is, the 'util - util_est' as cost_margin. It hurts my brain still.

Quentin Perret Oct. 17, 2019, 11:11 a.m. UTC | #4

On Thursday 17 Oct 2019 at 11:50:15 (+0200), Peter Zijlstra wrote:
> Now, the thing is, we use map_util_freq() in more places, and should we
> not reflect this increase in C for all of them? That is, why is this
> patch changing get_next_freq() and not map_util_freq().
> 
> I don't think that question is answered in the Changelogs.
> 
> Exactly because it does change the energy consumption (it must) should
> that not also be reflected in the EAS logic?

Right that shouldn't hurt and keep things consistent. That probably
won't have a huge impact in practice (the boost should be != 0 only when
the util signals haven't converged IIUC, which is a case where the EAS
calculation is already 'wrong' anyway), but that still feels like the
right thing to do.

> I'm still thinking about the exact means you're using to raise C; that
> is, the 'util - util_est' as cost_margin. It hurts my brain still.

+1 ...

Thanks,
Quentin

Peter Zijlstra Oct. 17, 2019, 2:11 p.m. UTC | #5

On Thu, Oct 17, 2019 at 12:11:16PM +0100, Quentin Perret wrote:
> On Thursday 17 Oct 2019 at 11:50:15 (+0200), Peter Zijlstra wrote:
> > Now, the thing is, we use map_util_freq() in more places, and should we
> > not reflect this increase in C for all of them? That is, why is this
> > patch changing get_next_freq() and not map_util_freq().
> > 
> > I don't think that question is answered in the Changelogs.
> > 
> > Exactly because it does change the energy consumption (it must) should
> > that not also be reflected in the EAS logic?
> 
> Right that shouldn't hurt and keep things consistent. That probably
> won't have a huge impact in practice (the boost should be != 0 only when
> the util signals haven't converged IIUC, which is a case where the EAS
> calculation is already 'wrong' anyway), but that still feels like the
> right thing to do.

It only boosts when 'rq->cfs.avg.util' increases while
'rq->cfs.avg.util_est.enqueued' remains unchanged (and util > util_est
obv).

This condition can be true for select_task_rq_fair(), because that is
ran before we do enqueue_task_fair() (for obvious raisins).

> > I'm still thinking about the exact means you're using to raise C; that
> > is, the 'util - util_est' as cost_margin. It hurts my brain still.
> 
> +1 ...

cost_i = capacity_i / power_i ; for the i-th OPP

We then do: 'x = util - util_avg' and use that on the first
OPP >= min_freq:

	cost(x) = cost_j + cost_j * x ; freq_j >= min_freq
	        = cost_j * (1 + x)

And then we find the highest OPP that has:

	cost_k <= cost(x)

Now, 'P = C*V^2*f', which under 'V ~ f' turns into 'P ~ f^3'.

(this assumption is important because we don't have V_i, but know that
when f increases, V also increases and the linear relation is the
simplest)

This then gives us:

	capacity_i = f_i / f_max
	power_i ~ f_i ^ 3
	cost_i = capacity_i / power_i
	       ~ (f_i / f_max) / f_i^3
	       ~ 1 / (f_max * f_i^2)

(your changelog already called if efficiency, but then went on calling
it cost; as per the above, you see that higher frequencies have lower
efficiency, as expected)

cost(x) then turns into something like:

	cost(x) = cost_j * (1 + x)
	        ~ (1 + x) / (f_max * f_j^2)

We then get the following equation (assuming inf. OPPs):

	cost_k = cost(x)

	1 / (f_max * f_k^2) = (1 + x) / (f_max * f_j^2)

From which we can solve f_k:

	f_k = f_j / sqrt(1 + x) ; x = util - util_est

Which, given positive 'x' results in f_k < f_j, IOW. we're selecting a
lower frequency.

Given that 'cost' really is efficiency, and we've made the equations
about selecting a higher efficiency, that makes sense in so far that it
would always end up at the knee in the graph (our most efficient OPP).

Is this really what we're wanting to do?

Douglas RAILLARD Oct. 17, 2019, 2:23 p.m. UTC | #6

Hi Peter,

On 10/17/19 10:50 AM, Peter Zijlstra wrote:
> On Mon, Oct 14, 2019 at 04:50:24PM +0100, Douglas Raillard wrote:
> 
>> I posted some numbers based on a similar experiment on the v2 of that series that
>> are still applicable:
>>
>> TL;DR the rt-app negative slack is divided by 1.75 by this series, with an
>> increase of 3% in total energy consumption. There is a burst every 0.6s, and
>> the average power consumption increase is proportional to the average number
>> of bursts.
>>
>>
>> The workload is an rt-app task ramping up from 5% to 75% util in one big step,
>> pinned on a big core. The whole cycle is 0.6s long (0.3s at 5% followed by 0.3s at 75%).
>> This cycle is repeated 20 times and the average of boosting is taken.
>>
>> The test device is a Google Pixel 3 (Qcom Snapdragon 845) phone.
>> It has a lot more OPPs than a hikey 960, so gradations in boosting
>> are better reflected on frequency selection.
>>
>> avg slack (higher=better):
>>      Average time between task sleep and its next periodic activation.
>>      See rt-app doc: https://github.com/scheduler-tools/rt-app/blob/9a50d76f726d7c325c82ac8c7ed9ed70e1c97937/doc/tutorial.txt#L631
>>
>> avg negative slack (lower in absolute value=better):
>>      Same as avg slack, but only taking into account negative values.
>>      Negative slack means a task activation did not have enough time to complete before the next
>>      periodic activation fired, which is what we want to avoid.
>>
>> boost energy overhead (lower=better):
>>      Extra power consumption induced by ramp boost, assuming continuous OPP space (infinite number of OPP)
>>      and single-CPU policies. In practice, fixed number of OPP decrease this value, and more CPU per policy increases it,
>>      since boost(policy) = max(boost(cpu) foreach cpu of policy)).
>>
>> Without ramp boost:
>> +--------------------+--------------------+
>> |avg slack (us)      |avg negative slack  |
>> |                    |(us)                |
>> +--------------------+--------------------+
>> |6598.72             |-10217.13           |
>> |6595.49             |-10200.13           |
>> |6613.72             |-10401.06           |
>> |6600.29             |-9860.872           |
>> |6605.53             |-10057.64           |
>> |6612.05             |-10267.50           |
>> |6599.01             |-9939.60            |
>> |6593.79             |-9445.633           |
>> |6613.56             |-10276.75           |
>> |6595.44             |-9751.770           |
>> +--------------------+--------------------+
>> |average                                  |
>> +--------------------+--------------------+
>> |6602.76             |-10041.81           |
>> +--------------------+--------------------+
>>
>>
>> With ramp boost enabled:
>> +--------------------+--------------------+--------------------+
>> |boost energy        |avg slack (us)      |avg negative slack  |
>> |overhead (%)        |                    |(us)                |
>> +--------------------+--------------------+--------------------+
>> |3.05                |7148.93             |-5664.26            |
>> |3.04                |7144.69             |-5667.77            |
>> |3.05                |7149.05             |-5698.31            |
>> |2.97                |7126.71             |-6040.23            |
>> |3.02                |7140.28             |-5826.78            |
>> |3.03                |7135.11             |-5749.62            |
>> |3.05                |7140.24             |-5750.0             |
>> |3.05                |7144.84             |-5667.04            |
>> |3.07                |7157.30             |-5656.65            |
>> |3.06                |7154.65             |-5653.76            |
>> +--------------------+--------------------+--------------------+
>> |average                                                       |
>> +--------------------+--------------------+--------------------+
>> |3.039000            |7144.18             |-5737.44            |
>> +--------------------+--------------------+--------------------+
>>
>>
>> The negative slack is due to missed activations while the utilization signals
>> increase during the big utilization step. Ramp boost is designed to boost frequency during
>> that phase, which materializes in 1.75 less negative slack, for an extra power
>> consumption under 3%.
> 
> OK, so I think I see what it is doing, and why.
> 
> Normally we use (map_util_freq):
> 
> 	freq = C * max_freq * util / max ; C=1.25
> 
> But here, when util is increasing, we effectively increase our C to
> allow picking a higher OPP. Because of that higher OPP we finish our
> work sooner (avg slack increases)

Indeed. The increase in avg slack is much smaller than the absolute decrease in negative slack,
which means we don't boost when that's not necessary
(i.e. on activations where the slack is already positive).

> and miss our activation less often
> (avg neg slack decreases).

We actually miss the activation as often, but by a lesser amount of time,
which is desirable. Preventing from missing activations altogether could
probably be achieved by very aggressive boosting, at which point it might
be better/easier to just assume util=1024 and let it decrease to its
correct value. None of that sounds very appealing in the general case though.

> Now, the thing is, we use map_util_freq() in more places, and should we
> not reflect this increase in C for all of them? That is, why is this
> patch changing get_next_freq() and not map_util_freq().
> 
> I don't think that question is answered in the Changelogs.
> 
> Exactly because it does change the energy consumption (it must) should
> that not also be reflected in the EAS logic?

map_util_freq() is only used in schedutil and in EAS by compute_energy(), so
I retarget the question at compute_energy(). It is supposed to compute
the energy consumed by a given CPU if a given task was migrated on it.
This implies some assumptions on util signals:
1) util(_est.enqueued) of the task is representative of the task's duty cycle
   (the semantic of util is somehow blurry for aperiodic tasks AFAIK).
2) util of the task is CPU-invariant
3) task util + target CPU util = new target CPU util for the foreseeable future,
    i.e. the amount of future we can predict with reasonable accuracy. Otherwise
    we would end up moving things around all the time.

Since ramp boost is designed to be transient, integrating it (indirectly) in "target CPU util" will
add some noise to the placement decision, potentially rendering it obsolete as soon
as the boosting stops. Basing a costly decision on that does not sound like a good idea IMHO.

> I'm still thinking about the exact means you're using to raise C; that
> is, the 'util - util_est' as cost_margin. It hurts my brain still.

util_est is currently the best approximation of the actual portion of the CPU the task needs:
1) for periodic tasks, it's not too far from the duty cycle, and is always higher

2) for aperiodic tasks, it (indirectly) takes into account the total time it took
   to complete the previous activation, so the signal is not 100% composed of logical signals
   only relevant for periodic tasks (although it's a big part of it).

3) Point 1) and 2) together allows util_est to adapt to periodic tasks that changes
their duty cycle over time, without needing a very long history (the last task period
is sufficient).

For periodic tasks, the distance between instantaneous util_avg and the actual task
duty cycle indicates somehow what is our best guess of the (potential) change in the task
duty cycle.

util_est is the threshold (assuming util_avg increasing) for util_avg after which we know
for sure that even if the task stopped right now, its duty cycle would be higher than
during the previous period.
This means for a given task and with (util >= util_est):

1) util - util_est == 0 means the task duty cycle will be equal to the one during
   during the previous activation, if the tasks stopped executing right now.

2) util - util_est > 0 means the task duty cycle will be higher to the one during
   during the previous activation, if the tasks stopped executing right now.

Using the difference (util - util_est) will therefore give these properties to the boost signal:
* no boost will be applied as long as the task has a constant or decreasing duty cycle.

* when we can detect that the duty cycle increases, we temporarily increase the frequency.
   We start with a slight increase, and the longer we wait for the current period to finish,
   the more we boost, since the more likely it is that the task has a much larger duty cycle
   than anticipated. More specifically, the evaluation of "how much more" is done the exact
   same way as it is done for PELT, since the dynamic of the boost is "inherited" from PELT.

Now if the task is aperiodic, the boost will allow reaching the highest frequency faster,
which may or may not be desired. Ultimately, it's not more or less wrong than just picking
the freq based on util_est alone, since util_est is already somewhat meaningless for aperiodic
tasks. It just allows reaching the max freq at some point without waiting for too long, which is
all what we can do without more info on the task.

When applying these boosting rules on the runqueue util signals, we are able to detect if at least one
task needs boosting according to these rules. That only holds as long as the history we look at is
the result of a stable set of tasks, i.e. no tasks added or removed from the rq.

Regards,
Douglas

Peter Zijlstra Oct. 17, 2019, 2:53 p.m. UTC | #7

On Thu, Oct 17, 2019 at 03:23:04PM +0100, Douglas Raillard wrote:
> On 10/17/19 10:50 AM, Peter Zijlstra wrote:
> > Now, the thing is, we use map_util_freq() in more places, and should we
> > not reflect this increase in C for all of them? That is, why is this
> > patch changing get_next_freq() and not map_util_freq().
> > 
> > I don't think that question is answered in the Changelogs.
> > 
> > Exactly because it does change the energy consumption (it must) should
> > that not also be reflected in the EAS logic?
> 
> map_util_freq() is only used in schedutil and in EAS by compute_energy(), so
> I retarget the question at compute_energy(). It is supposed to compute
> the energy consumed by a given CPU if a given task was migrated on it.
> This implies some assumptions on util signals:
> 1) util(_est.enqueued) of the task is representative of the task's
>    duty cycle (the semantic of util is somehow blurry for aperiodic tasks
>    AFAIK).
> 2) util of the task is CPU-invariant

( we know this to be false, but do indeed make this assumption because
simplicity, taking IPC differences into account would just complicate
things more )

> 3) task util + target CPU util = new target CPU util for the
>    foreseeable future, i.e. the amount of future we can predict with
>    reasonable accuracy. Otherwise we would end up moving things around
>    all the time.
> 
> Since ramp boost is designed to be transient, integrating it
> (indirectly) in "target CPU util" will add some noise to the placement
> decision, potentially rendering it obsolete as soon as the boosting
> stops. Basing a costly decision on that does not sound like a good
> idea IMHO.

Well, we _hope_ recent past is a reasonable predictor for the near
future. We of course know it'll be complete crap every so often, but
hope that on average it is true more than false :-)

Anyway, the above seems like a sensible enough argument, and seems
worthy of being part of the Changelog. Also maybe a comment in schedutil
as to why there is a map_util_freq() modifier there.

Peter Zijlstra Oct. 17, 2019, 7:07 p.m. UTC | #8

On Thu, Oct 17, 2019 at 03:23:04PM +0100, Douglas Raillard wrote:
> On 10/17/19 10:50 AM, Peter Zijlstra wrote:

> > I'm still thinking about the exact means you're using to raise C; that
> > is, the 'util - util_est' as cost_margin. It hurts my brain still.
> 
> util_est is currently the best approximation of the actual portion of the CPU the task needs:
> 1) for periodic tasks, it's not too far from the duty cycle, and is always higher
> 
> 2) for aperiodic tasks, it (indirectly) takes into account the total time it took
>   to complete the previous activation, so the signal is not 100% composed of logical signals
>   only relevant for periodic tasks (although it's a big part of it).
> 
> 3) Point 1) and 2) together allows util_est to adapt to periodic tasks that changes
> their duty cycle over time, without needing a very long history (the last task period
> is sufficient).
> 
> For periodic tasks, the distance between instantaneous util_avg and the actual task
> duty cycle indicates somehow what is our best guess of the (potential) change in the task
> duty cycle.
> 
> util_est is the threshold (assuming util_avg increasing) for util_avg after which we know
> for sure that even if the task stopped right now, its duty cycle would be higher than
> during the previous period.
> This means for a given task and with (util >= util_est):
> 
> 1) util - util_est == 0 means the task duty cycle will be equal to the one during
>   during the previous activation, if the tasks stopped executing right now.
> 
> 2) util - util_est > 0 means the task duty cycle will be higher to the one during
>   during the previous activation, if the tasks stopped executing right now.

So far I can follow, 2) is indeed a fairly sane indication that
utilization is growing.

> Using the difference (util - util_est) will therefore give these properties to the boost signal:
> * no boost will be applied as long as the task has a constant or decreasing duty cycle.
> 
> * when we can detect that the duty cycle increases, we temporarily increase the frequency.
>   We start with a slight increase, and the longer we wait for the current period to finish,
>   the more we boost, since the more likely it is that the task has a much larger duty cycle
>   than anticipated. More specifically, the evaluation of "how much more" is done the exact
>   same way as it is done for PELT, since the dynamic of the boost is "inherited" from PELT.

Right, because as long it keeps running, util_est will not be changed,
so the difference will continue to increase.

What I don't see is how that that difference makes sense as input to:

  cost(x) : (1 + x) * cost_j

I suppose that limits the additional OPP to twice the previously
selected cost / efficiency (see the confusion from that other email).
But given that efficency drops (or costs rise) for higher OPPs that
still doesn't really make sense..

> Now if the task is aperiodic, the boost will allow reaching the highest frequency faster,
> which may or may not be desired. Ultimately, it's not more or less wrong than just picking
> the freq based on util_est alone, since util_est is already somewhat meaningless for aperiodic
> tasks. It just allows reaching the max freq at some point without waiting for too long, which is
> all what we can do without more info on the task.
> 
> When applying these boosting rules on the runqueue util signals, we are able to detect if at least one
> task needs boosting according to these rules. That only holds as long as the history we look at is
> the result of a stable set of tasks, i.e. no tasks added or removed from the rq.

So while I agree that 2) is a reasonable signal to work from, everything
that comes after is still much confusing me.

Dietmar Eggemann Oct. 18, 2019, 7:44 a.m. UTC | #9

On 17/10/2019 16:11, Peter Zijlstra wrote:
> On Thu, Oct 17, 2019 at 12:11:16PM +0100, Quentin Perret wrote:

[...]

> It only boosts when 'rq->cfs.avg.util' increases while
> 'rq->cfs.avg.util_est.enqueued' remains unchanged (and util > util_est
> obv).
> 
> This condition can be true for select_task_rq_fair(), because that is
> ran before we do enqueue_task_fair() (for obvious raisins).
> 
>>> I'm still thinking about the exact means you're using to raise C; that
>>> is, the 'util - util_est' as cost_margin. It hurts my brain still.
>>
>> +1 ...
> 
> cost_i = capacity_i / power_i ; for the i-th OPP

I get confused by this definition. efficiency=capacity/power but the
cs->cost value used in em_pd_get_higher_freq() is defined as

cs_cost = cs->power * cpu_max_freq / cs->freq [energy_model.h]

> We then do: 'x = util - util_avg' and use that on the first
> OPP >= min_freq:
> 
> 	cost(x) = cost_j + cost_j * x ; freq_j >= min_freq
> 	        = cost_j * (1 + x)
> 
> And then we find the highest OPP that has:
> 
> 	cost_k <= cost(x)

[...]

Peter Zijlstra Oct. 18, 2019, 7:59 a.m. UTC | #10

On Fri, Oct 18, 2019 at 09:44:44AM +0200, Dietmar Eggemann wrote:
> On 17/10/2019 16:11, Peter Zijlstra wrote:
> > On Thu, Oct 17, 2019 at 12:11:16PM +0100, Quentin Perret wrote:
> 
> [...]
> 
> > It only boosts when 'rq->cfs.avg.util' increases while
> > 'rq->cfs.avg.util_est.enqueued' remains unchanged (and util > util_est
> > obv).
> > 
> > This condition can be true for select_task_rq_fair(), because that is
> > ran before we do enqueue_task_fair() (for obvious raisins).
> > 
> >>> I'm still thinking about the exact means you're using to raise C; that
> >>> is, the 'util - util_est' as cost_margin. It hurts my brain still.
> >>
> >> +1 ...
> > 
> > cost_i = capacity_i / power_i ; for the i-th OPP
> 
> I get confused by this definition. efficiency=capacity/power but the
> cs->cost value used in em_pd_get_higher_freq() is defined as
> 
> cs_cost = cs->power * cpu_max_freq / cs->freq [energy_model.h]

Well, chalk that up to confusion inspired by the Changelog of patch 1.

Let me redo it with that formula then.

Peter Zijlstra Oct. 18, 2019, 8:11 a.m. UTC | #11

On Fri, Oct 18, 2019 at 09:44:44AM +0200, Dietmar Eggemann wrote:
> On 17/10/2019 16:11, Peter Zijlstra wrote:
> > On Thu, Oct 17, 2019 at 12:11:16PM +0100, Quentin Perret wrote:
> 
> [...]
> 
> > It only boosts when 'rq->cfs.avg.util' increases while
> > 'rq->cfs.avg.util_est.enqueued' remains unchanged (and util > util_est
> > obv).
> > 
> > This condition can be true for select_task_rq_fair(), because that is
> > ran before we do enqueue_task_fair() (for obvious raisins).
> > 
> >>> I'm still thinking about the exact means you're using to raise C; that
> >>> is, the 'util - util_est' as cost_margin. It hurts my brain still.
> >>
> >> +1 ...
> > 
> > cost_i = capacity_i / power_i ; for the i-th OPP
> 
> I get confused by this definition. efficiency=capacity/power but the
> cs->cost value used in em_pd_get_higher_freq() is defined as
> 
> cs_cost = cs->power * cpu_max_freq / cs->freq [energy_model.h]

	cost_i = power_i * f_max / f_i

	cost(x) = cost_j * (1 + x) ; f_j >= min_freq

	cost_k <= cost(x)

	P = C*V^2*f, V ~ f -> P ~ f^3

	cost_i ~ f_i^3 * f_max / f_i
	       = f_i^2 * f_max

	cost(x) = (1 + x) * f_j^2 * f_max

	cost_k = cost(x)

	f_k^2 * f_max = (1 + x) * f_j^2 * f_max

	f_k = sqrt(1 + x) * f_j

Which does indeed make more sense... However, I still struggle with
using our 'x = util - util_est' as input for an OPP specific increase.

Douglas RAILLARD Oct. 18, 2019, 11:46 a.m. UTC | #12

On 10/17/19 8:07 PM, Peter Zijlstra wrote:
> On Thu, Oct 17, 2019 at 03:23:04PM +0100, Douglas Raillard wrote:
>> On 10/17/19 10:50 AM, Peter Zijlstra wrote:
> 
>>> I'm still thinking about the exact means you're using to raise C; that
>>> is, the 'util - util_est' as cost_margin. It hurts my brain still.
>>
>> util_est is currently the best approximation of the actual portion of the CPU the task needs:
>> 1) for periodic tasks, it's not too far from the duty cycle, and is always higher
>>
>> 2) for aperiodic tasks, it (indirectly) takes into account the total time it took
>>    to complete the previous activation, so the signal is not 100% composed of logical signals
>>    only relevant for periodic tasks (although it's a big part of it).
>>
>> 3) Point 1) and 2) together allows util_est to adapt to periodic tasks that changes
>> their duty cycle over time, without needing a very long history (the last task period
>> is sufficient).
>>
>> For periodic tasks, the distance between instantaneous util_avg and the actual task
>> duty cycle indicates somehow what is our best guess of the (potential) change in the task
>> duty cycle.
>>
>> util_est is the threshold (assuming util_avg increasing) for util_avg after which we know
>> for sure that even if the task stopped right now, its duty cycle would be higher than
>> during the previous period.
>> This means for a given task and with (util >= util_est):
>>
>> 1) util - util_est == 0 means the task duty cycle will be equal to the one during
>>    during the previous activation, if the tasks stopped executing right now.
>>
>> 2) util - util_est > 0 means the task duty cycle will be higher to the one during
>>    during the previous activation, if the tasks stopped executing right now.
> 
> So far I can follow, 2) is indeed a fairly sane indication that
> utilization is growing.
> 
>> Using the difference (util - util_est) will therefore give these properties to the boost signal:
>> * no boost will be applied as long as the task has a constant or decreasing duty cycle.
>>
>> * when we can detect that the duty cycle increases, we temporarily increase the frequency.
>>    We start with a slight increase, and the longer we wait for the current period to finish,
>>    the more we boost, since the more likely it is that the task has a much larger duty cycle
>>    than anticipated. More specifically, the evaluation of "how much more" is done the exact
>>    same way as it is done for PELT, since the dynamic of the boost is "inherited" from PELT.
> 
> Right, because as long it keeps running, util_est will not be changed,
> so the difference will continue to increase.
> 
> What I don't see is how that that difference makes sense as input to:
> 
>    cost(x) : (1 + x) * cost_j

The actual input is:
x = (EM_COST_MARGIN_SCALE/SCHED_CAPACITY_SCALE) * (util - util_est)

Since EM_COST_MARGIN_SCALE == SCHED_CAPACITY_SCALE == 1024, this factor 
of 1 is not directly reflected in the code but is important for units 
consistency.

Non-zero x means that we are going to allow spending more energy than 
what schedutil currently thinks of being the minimal energy.
(it's actually not that minimal, since we have this 1.25 factor, plus 
the fact that we use util_est.enqueued, which will always over-estimate 
the actual util of the task).

> 
> I suppose that limits the additional OPP to twice the previously
> selected cost / efficiency (see the confusion from that other email).
> But given that efficency drops (or costs rise) for higher OPPs that
> still doesn't really make sense..
Yes, this current limit to +100% freq boosting is somehow arbitrary and 
could probably benefit from being tunable in some way (Kconfig option 
maybe). When (margin > 0), we end up selecting an OPP that has a higher 
cost than the one strictly required, which is expected. The goal is to 
speed things up at the expense of more power consumed to achieve the 
same work, hence at a lower efficiency (== higher cost).

That's the main reason why this boosting apply a margin on the cost of 
the selected OPP rather than just inflating the util. This allows 
controlling directly how much more power (battery life) we are going to 
spend to achieve some work that we know could be achieved with less 
power. This "how much more" is the margin. A policy for such boosting 
must obviously be quite picky in when it decides to boost (not boosting 
all the time). Decreasing the amount of negative slack is one situation 
where spending a bit more power to ensure the work is done in time can 
be more important than just efficiency, within reasonable limits. (the 
eternal efficiency vs throughput vs latency trade-off).

> 
>> Now if the task is aperiodic, the boost will allow reaching the highest frequency faster,
>> which may or may not be desired. Ultimately, it's not more or less wrong than just picking
>> the freq based on util_est alone, since util_est is already somewhat meaningless for aperiodic
>> tasks. It just allows reaching the max freq at some point without waiting for too long, which is
>> all what we can do without more info on the task.
>>
>> When applying these boosting rules on the runqueue util signals, we are able to detect if at least one
>> task needs boosting according to these rules. That only holds as long as the history we look at is
>> the result of a stable set of tasks, i.e. no tasks added or removed from the rq.
> 
> So while I agree that 2) is a reasonable signal to work from, everything
> that comes after is still much confusing me.

"Now if the task is aperiodic ...": What I mean by that is that AFAIK, 
there isn't really any fundamentally right or wrong way of choosing 
frequencies if the tasks we are dealing with are not periodic, unless 
you add more constraints to the problem. We currently base the decision 
on an overestimation of some kind of average running time per second of 
the task. The average being the EWMA implemented by PELT, not to be 
confused with util_est.ewma that adds an extra EWMA on top.

What "window" of time used for that average (or EWMA half life in our 
case) will change the value of the average for aperiodic tasks. The 
choice of that half life is driven by how much of the task history we 
want to take into account. That is driven by how often we expect tasks 
to change their "execution profile" on average so to speak (a thread 
pool picking disparate work items at random would change its profile 
very often for example).

Once this window or half life is chosen, we can ensure that the CPU will 
use a frequency high enough to avoid work piling up more than what can 
be computed using a simple proportional controller. The goal of the 
schedutil controller is to make sure that:
current CPU capa == current util * 1.25

All that to say that in the aperiodic case, some pieces of the setup are 
directly provided by the policy (PELT half life), which are empirically 
determined to perform well, without any way of computing an provably 
optimal value (unless we know for sure exactly when a task is going to 
change its workload and CPU share it will require). For periodic tasks, 
we can compute the exact frequency that will lead to using 100% of the 
CPU just by looking at the duty cycle of the tasks, and that's more or 
less what schedutil does.

"When applying these boosting rules on the runqueue util signals ...":
Assuming the set of enqueued tasks stays the same between 2 observations 
from schedutil, if we see the rq util_avg increase above its 
util_est.enqueued, that means that at least one task had its util_avg go 
above util_est.enqueued. We might miss some boosting opportunities if 
some (util - util_est) compensates:
TASK_1(util - util_est) = - TASK_2(util - util_est)
but working on the aggregated value is much easier in schedutil, to 
avoid crawling the list of entities.

Peter Zijlstra Oct. 18, 2019, 12:07 p.m. UTC | #13

On Fri, Oct 18, 2019 at 12:46:25PM +0100, Douglas Raillard wrote:

> > What I don't see is how that that difference makes sense as input to:
> > 
> >    cost(x) : (1 + x) * cost_j
> 
> The actual input is:
> x = (EM_COST_MARGIN_SCALE/SCHED_CAPACITY_SCALE) * (util - util_est)
> 
> Since EM_COST_MARGIN_SCALE == SCHED_CAPACITY_SCALE == 1024, this factor of 1
> is not directly reflected in the code but is important for units
> consistency.

But completely irrelevant for the actual math and conceptual
understanding. Just because computers suck at real numbers, and floats
are expensive, doesn't mean we have to burden ourselves with fixed point
when writing equations.

Also, as a physicist I'm prone to normalizing everything to 1, because
that's lazy.

> > I suppose that limits the additional OPP to twice the previously
> > selected cost / efficiency (see the confusion from that other email).
> > But given that efficency drops (or costs rise) for higher OPPs that
> > still doesn't really make sense..

> Yes, this current limit to +100% freq boosting is somehow arbitrary and
> could probably benefit from being tunable in some way (Kconfig option
> maybe). When (margin > 0), we end up selecting an OPP that has a higher cost
> than the one strictly required, which is expected. The goal is to speed
> things up at the expense of more power consumed to achieve the same work,
> hence at a lower efficiency (== higher cost).

No, no Kconfig knobs.

> That's the main reason why this boosting apply a margin on the cost of the
> selected OPP rather than just inflating the util. This allows controlling
> directly how much more power (battery life) we are going to spend to achieve
> some work that we know could be achieved with less power.

But you're not; the margin is relative to the OPP, it is not absolute.

Or rather, the only actual limit is in relation to the max OPP. So you
have very little actual control over how much more energy you're
spending.

> > So while I agree that 2) is a reasonable signal to work from, everything
> > that comes after is still much confusing me.

> "When applying these boosting rules on the runqueue util signals ...":
> Assuming the set of enqueued tasks stays the same between 2 observations
> from schedutil, if we see the rq util_avg increase above its
> util_est.enqueued, that means that at least one task had its util_avg go
> above util_est.enqueued. We might miss some boosting opportunities if some
> (util - util_est) compensates:
> TASK_1(util - util_est) = - TASK_2(util - util_est)
> but working on the aggregated value is much easier in schedutil, to avoid
> crawling the list of entities.

That still does not explain why 'util - util_est', when >0, makes for a
sensible input into an OPP relative function.

I agree that 'util - util_est', when >0, indicates utilization is
increasing (for the aperiodic blah blah blah). But after that I'm still
confused.

Douglas RAILLARD Oct. 18, 2019, 2:44 p.m. UTC | #14

On 10/18/19 1:07 PM, Peter Zijlstra wrote:
> On Fri, Oct 18, 2019 at 12:46:25PM +0100, Douglas Raillard wrote:
> 
>>> What I don't see is how that that difference makes sense as input to:
>>>
>>>     cost(x) : (1 + x) * cost_j
>>
>> The actual input is:
>> x = (EM_COST_MARGIN_SCALE/SCHED_CAPACITY_SCALE) * (util - util_est)
>>
>> Since EM_COST_MARGIN_SCALE == SCHED_CAPACITY_SCALE == 1024, this factor of 1
>> is not directly reflected in the code but is important for units
>> consistency.
> 
> But completely irrelevant for the actual math and conceptual
> understanding.

 > how that that difference makes sense as input to
I was unsure if you referred to the units being inconsistent or the 
actual way of computing values being strange, so I provided some 
justification for both.

> Just because computers suck at real numbers, and floats
> are expensive, doesn't mean we have to burden ourselves with fixed point
> when writing equations.
> 
> Also, as a physicist I'm prone to normalizing everything to 1, because
> that's lazy.
> 
>>> I suppose that limits the additional OPP to twice the previously
>>> selected cost / efficiency (see the confusion from that other email).
>>> But given that efficency drops (or costs rise) for higher OPPs that
>>> still doesn't really make sense..
> 
>> Yes, this current limit to +100% freq boosting is somehow arbitrary and
>> could probably benefit from being tunable in some way (Kconfig option
>> maybe). When (margin > 0), we end up selecting an OPP that has a higher cost
>> than the one strictly required, which is expected. The goal is to speed
>> things up at the expense of more power consumed to achieve the same work,
>> hence at a lower efficiency (== higher cost).
> 
> No, no Kconfig knobs.
> 
>> That's the main reason why this boosting apply a margin on the cost of the
>> selected OPP rather than just inflating the util. This allows controlling
>> directly how much more power (battery life) we are going to spend to achieve
>> some work that we know could be achieved with less power.
> 
> But you're not; the margin is relative to the OPP, it is not absolute.

Considering a CPU with 1024 max capacity (since we are not talking about 
migrations here, we can ignore CPU invariance):

work = normalized number of iterations of a given busy loop
# Thanks to freq invariance
work = util (between 0 and 1)
util = f/f_max

# f(work) is the min freq that is admissible for "work", which we will
# abbreviate as "f"
f(work) = work * f_max

# from struct em_cap_state doc in energy_model.h
cost(f) = power(f) * f_max / f
cost(f) = power(f) / util
cost(f) = power(f) / work
power(f) = cost(f) * work

boosted_cost(f) = cost(f) + x
boosted_power(f) = boosted_cost(f) * work
boosted_power(f) = (cost(f) + x) * work

# Let's normalize cost() so we can forget about f and deal only with work.
cost'(work) = cost(f)/cost(f_max)
x' = x/cost(f_max)
boosted_power'(work) = (cost'(work) + x') * work
boosted_power'(work) = cost'(work) * work + x' * work
boosted_power'(work) = power'(work) + x' * work
boosted_power'(work) = power'(work) + A(work)

# Over a duration T, spend an extra B unit of energy
B(work) = A(work) * T
lost_battery_percent(work) = 100 * B(work)/total_battery_energy
lost_battery_percent(work) = 100 * T * x' * work /total_battery_energy
lost_battery_percent(work) =
  (100 * T / cost(f_max) / total_battery_energy) * x * work

This means that the effect of boosting on battery life is proportional 
to "x" unless I made a mistake somewhere.

> 
> Or rather, the only actual limit is in relation to the max OPP. So you
> have very little actual control over how much more energy you're
> spending.
> 
>>> So while I agree that 2) is a reasonable signal to work from, everything
>>> that comes after is still much confusing me.
> 
>> "When applying these boosting rules on the runqueue util signals ...":
>> Assuming the set of enqueued tasks stays the same between 2 observations
>> from schedutil, if we see the rq util_avg increase above its
>> util_est.enqueued, that means that at least one task had its util_avg go
>> above util_est.enqueued. We might miss some boosting opportunities if some
>> (util - util_est) compensates:
>> TASK_1(util - util_est) = - TASK_2(util - util_est)
>> but working on the aggregated value is much easier in schedutil, to avoid
>> crawling the list of entities.
> 
> That still does not explain why 'util - util_est', when >0, makes for a
> sensible input into an OPP relative function > I agree that 'util - util_est', when >0, indicates utilization is
> increasing (for the aperiodic blah blah blah). But after that I'm still
> confused.

For the same reason PELT makes a sensible input for OPP selection.
Currently, OPP selection is based on max(util_avg, util_est.enqueued) 
(from cpu_util_cfs in sched.h), so as soon as we have
(util - util_est > 0), the OPP will be selected according to util_avg. 
In a way, using util_avg there is already some kind of boosting.

Since the boosting is essentially (util - constant), it grows the same 
way as util. If we think of (util - util_est) as being some estimation 
of how wrong we were in the estimation of the task "true" utilization of 
the CPU, then it makes sense to feed that to the boost. The wronger we 
were, the more we want to boost, because the more time passes, the more 
the scheduler realizes it actually does not know what the task needs. In 
doubt, provide a higher freq than usual until we get to know this task 
better. When that happens (at the next period), boosting is disabled and 
we revert to the usual behavior (aka margin=0).

Hope we are converging to some wording that makes sense.

Vincent Guittot Oct. 18, 2019, 3:15 p.m. UTC | #15

On Fri, 18 Oct 2019 at 16:44, Douglas Raillard <douglas.raillard@arm.com> wrote:
>
>
>
> On 10/18/19 1:07 PM, Peter Zijlstra wrote:
> > On Fri, Oct 18, 2019 at 12:46:25PM +0100, Douglas Raillard wrote:
> >
> >>> What I don't see is how that that difference makes sense as input to:
> >>>
> >>>     cost(x) : (1 + x) * cost_j
> >>
> >> The actual input is:
> >> x = (EM_COST_MARGIN_SCALE/SCHED_CAPACITY_SCALE) * (util - util_est)
> >>
> >> Since EM_COST_MARGIN_SCALE == SCHED_CAPACITY_SCALE == 1024, this factor of 1
> >> is not directly reflected in the code but is important for units
> >> consistency.
> >
> > But completely irrelevant for the actual math and conceptual
> > understanding.
>
>  > how that that difference makes sense as input to
> I was unsure if you referred to the units being inconsistent or the
> actual way of computing values being strange, so I provided some
> justification for both.
>
> > Just because computers suck at real numbers, and floats
> > are expensive, doesn't mean we have to burden ourselves with fixed point
> > when writing equations.
> >
> > Also, as a physicist I'm prone to normalizing everything to 1, because
> > that's lazy.
> >
> >>> I suppose that limits the additional OPP to twice the previously
> >>> selected cost / efficiency (see the confusion from that other email).
> >>> But given that efficency drops (or costs rise) for higher OPPs that
> >>> still doesn't really make sense..
> >
> >> Yes, this current limit to +100% freq boosting is somehow arbitrary and
> >> could probably benefit from being tunable in some way (Kconfig option
> >> maybe). When (margin > 0), we end up selecting an OPP that has a higher cost
> >> than the one strictly required, which is expected. The goal is to speed
> >> things up at the expense of more power consumed to achieve the same work,
> >> hence at a lower efficiency (== higher cost).
> >
> > No, no Kconfig knobs.
> >
> >> That's the main reason why this boosting apply a margin on the cost of the
> >> selected OPP rather than just inflating the util. This allows controlling
> >> directly how much more power (battery life) we are going to spend to achieve
> >> some work that we know could be achieved with less power.
> >
> > But you're not; the margin is relative to the OPP, it is not absolute.
>
> Considering a CPU with 1024 max capacity (since we are not talking about
> migrations here, we can ignore CPU invariance):
>
> work = normalized number of iterations of a given busy loop
> # Thanks to freq invariance
> work = util (between 0 and 1)
> util = f/f_max
>
> # f(work) is the min freq that is admissible for "work", which we will
> # abbreviate as "f"
> f(work) = work * f_max
>
> # from struct em_cap_state doc in energy_model.h
> cost(f) = power(f) * f_max / f
> cost(f) = power(f) / util
> cost(f) = power(f) / work
> power(f) = cost(f) * work
>
> boosted_cost(f) = cost(f) + x

In em_pd_get_higher_freq, the boost is a % of cost(f)  so it should be
boosted_cost(f)=cost(f)1+ cost(f)*x

> boosted_power(f) = boosted_cost(f) * work
> boosted_power(f) = (cost(f) + x) * work
>
> # Let's normalize cost() so we can forget about f and deal only with work.
> cost'(work) = cost(f)/cost(f_max)
> x' = x/cost(f_max)
> boosted_power'(work) = (cost'(work) + x') * work
> boosted_power'(work) = cost'(work) * work + x' * work
> boosted_power'(work) = power'(work) + x' * work
> boosted_power'(work) = power'(work) + A(work)
>
> # Over a duration T, spend an extra B unit of energy
> B(work) = A(work) * T
> lost_battery_percent(work) = 100 * B(work)/total_battery_energy
> lost_battery_percent(work) = 100 * T * x' * work /total_battery_energy
> lost_battery_percent(work) =
>   (100 * T / cost(f_max) / total_battery_energy) * x * work
>
> This means that the effect of boosting on battery life is proportional
> to "x" unless I made a mistake somewhere.
>
> >
> > Or rather, the only actual limit is in relation to the max OPP. So you
> > have very little actual control over how much more energy you're
> > spending.
> >
> >>> So while I agree that 2) is a reasonable signal to work from, everything
> >>> that comes after is still much confusing me.
> >
> >> "When applying these boosting rules on the runqueue util signals ...":
> >> Assuming the set of enqueued tasks stays the same between 2 observations
> >> from schedutil, if we see the rq util_avg increase above its
> >> util_est.enqueued, that means that at least one task had its util_avg go
> >> above util_est.enqueued. We might miss some boosting opportunities if some
> >> (util - util_est) compensates:
> >> TASK_1(util - util_est) = - TASK_2(util - util_est)
> >> but working on the aggregated value is much easier in schedutil, to avoid
> >> crawling the list of entities.
> >
> > That still does not explain why 'util - util_est', when >0, makes for a
> > sensible input into an OPP relative function > I agree that 'util - util_est', when >0, indicates utilization is
> > increasing (for the aperiodic blah blah blah). But after that I'm still
> > confused.
>
> For the same reason PELT makes a sensible input for OPP selection.
> Currently, OPP selection is based on max(util_avg, util_est.enqueued)
> (from cpu_util_cfs in sched.h), so as soon as we have
> (util - util_est > 0), the OPP will be selected according to util_avg.
> In a way, using util_avg there is already some kind of boosting.
>
> Since the boosting is essentially (util - constant), it grows the same
> way as util. If we think of (util - util_est) as being some estimation
> of how wrong we were in the estimation of the task "true" utilization of
> the CPU, then it makes sense to feed that to the boost. The wronger we
> were, the more we want to boost, because the more time passes, the more
> the scheduler realizes it actually does not know what the task needs. In
> doubt, provide a higher freq than usual until we get to know this task
> better. When that happens (at the next period), boosting is disabled and
> we revert to the usual behavior (aka margin=0).
>
> Hope we are converging to some wording that makes sense.

Vincent Guittot Oct. 18, 2019, 3:20 p.m. UTC | #16

On Fri, 18 Oct 2019 at 16:44, Douglas Raillard <douglas.raillard@arm.com> wrote:
>
>
>
> On 10/18/19 1:07 PM, Peter Zijlstra wrote:
> > On Fri, Oct 18, 2019 at 12:46:25PM +0100, Douglas Raillard wrote:
> >
> >>> What I don't see is how that that difference makes sense as input to:
> >>>
> >>>     cost(x) : (1 + x) * cost_j
> >>
> >> The actual input is:
> >> x = (EM_COST_MARGIN_SCALE/SCHED_CAPACITY_SCALE) * (util - util_est)
> >>
> >> Since EM_COST_MARGIN_SCALE == SCHED_CAPACITY_SCALE == 1024, this factor of 1
> >> is not directly reflected in the code but is important for units
> >> consistency.
> >
> > But completely irrelevant for the actual math and conceptual
> > understanding.
>
>  > how that that difference makes sense as input to
> I was unsure if you referred to the units being inconsistent or the
> actual way of computing values being strange, so I provided some
> justification for both.
>
> > Just because computers suck at real numbers, and floats
> > are expensive, doesn't mean we have to burden ourselves with fixed point
> > when writing equations.
> >
> > Also, as a physicist I'm prone to normalizing everything to 1, because
> > that's lazy.
> >
> >>> I suppose that limits the additional OPP to twice the previously
> >>> selected cost / efficiency (see the confusion from that other email).
> >>> But given that efficency drops (or costs rise) for higher OPPs that
> >>> still doesn't really make sense..
> >
> >> Yes, this current limit to +100% freq boosting is somehow arbitrary and
> >> could probably benefit from being tunable in some way (Kconfig option
> >> maybe). When (margin > 0), we end up selecting an OPP that has a higher cost
> >> than the one strictly required, which is expected. The goal is to speed
> >> things up at the expense of more power consumed to achieve the same work,
> >> hence at a lower efficiency (== higher cost).
> >
> > No, no Kconfig knobs.
> >
> >> That's the main reason why this boosting apply a margin on the cost of the
> >> selected OPP rather than just inflating the util. This allows controlling
> >> directly how much more power (battery life) we are going to spend to achieve
> >> some work that we know could be achieved with less power.
> >
> > But you're not; the margin is relative to the OPP, it is not absolute.
>
> Considering a CPU with 1024 max capacity (since we are not talking about
> migrations here, we can ignore CPU invariance):
>
> work = normalized number of iterations of a given busy loop
> # Thanks to freq invariance
> work = util (between 0 and 1)
> util = f/f_max
>
> # f(work) is the min freq that is admissible for "work", which we will
> # abbreviate as "f"
> f(work) = work * f_max
>
> # from struct em_cap_state doc in energy_model.h
> cost(f) = power(f) * f_max / f
> cost(f) = power(f) / util
> cost(f) = power(f) / work
> power(f) = cost(f) * work
>
> boosted_cost(f) = cost(f) + x
> boosted_power(f) = boosted_cost(f) * work
> boosted_power(f) = (cost(f) + x) * work
>
> # Let's normalize cost() so we can forget about f and deal only with work.
> cost'(work) = cost(f)/cost(f_max)
> x' = x/cost(f_max)
> boosted_power'(work) = (cost'(work) + x') * work
> boosted_power'(work) = cost'(work) * work + x' * work
> boosted_power'(work) = power'(work) + x' * work
> boosted_power'(work) = power'(work) + A(work)
>
> # Over a duration T, spend an extra B unit of energy
> B(work) = A(work) * T
> lost_battery_percent(work) = 100 * B(work)/total_battery_energy
> lost_battery_percent(work) = 100 * T * x' * work /total_battery_energy
> lost_battery_percent(work) =
>   (100 * T / cost(f_max) / total_battery_energy) * x * work
>
> This means that the effect of boosting on battery life is proportional
> to "x" unless I made a mistake somewhere.

Because the boost is relative to cost(f) and cost is not linear to the
frequency, I don't think that it's is a linear relation.

>
> >
> > Or rather, the only actual limit is in relation to the max OPP. So you
> > have very little actual control over how much more energy you're
> > spending.
> >
> >>> So while I agree that 2) is a reasonable signal to work from, everything
> >>> that comes after is still much confusing me.
> >
> >> "When applying these boosting rules on the runqueue util signals ...":
> >> Assuming the set of enqueued tasks stays the same between 2 observations
> >> from schedutil, if we see the rq util_avg increase above its
> >> util_est.enqueued, that means that at least one task had its util_avg go
> >> above util_est.enqueued. We might miss some boosting opportunities if some
> >> (util - util_est) compensates:
> >> TASK_1(util - util_est) = - TASK_2(util - util_est)
> >> but working on the aggregated value is much easier in schedutil, to avoid
> >> crawling the list of entities.
> >
> > That still does not explain why 'util - util_est', when >0, makes for a
> > sensible input into an OPP relative function > I agree that 'util - util_est', when >0, indicates utilization is
> > increasing (for the aperiodic blah blah blah). But after that I'm still
> > confused.
>
> For the same reason PELT makes a sensible input for OPP selection.
> Currently, OPP selection is based on max(util_avg, util_est.enqueued)
> (from cpu_util_cfs in sched.h), so as soon as we have
> (util - util_est > 0), the OPP will be selected according to util_avg.
> In a way, using util_avg there is already some kind of boosting.
>
> Since the boosting is essentially (util - constant), it grows the same
> way as util. If we think of (util - util_est) as being some estimation
> of how wrong we were in the estimation of the task "true" utilization of
> the CPU, then it makes sense to feed that to the boost. The wronger we
> were, the more we want to boost, because the more time passes, the more
> the scheduler realizes it actually does not know what the task needs. In
> doubt, provide a higher freq than usual until we get to know this task
> better. When that happens (at the next period), boosting is disabled and
> we revert to the usual behavior (aka margin=0).
>
> Hope we are converging to some wording that makes sense.

Douglas RAILLARD Oct. 18, 2019, 4:03 p.m. UTC | #17

On 10/18/19 4:15 PM, Vincent Guittot wrote:
> On Fri, 18 Oct 2019 at 16:44, Douglas Raillard <douglas.raillard@arm.com> wrote:
>>
>>
>>
>> On 10/18/19 1:07 PM, Peter Zijlstra wrote:
>>> On Fri, Oct 18, 2019 at 12:46:25PM +0100, Douglas Raillard wrote:
>>>
>>>>> What I don't see is how that that difference makes sense as input to:
>>>>>
>>>>>      cost(x) : (1 + x) * cost_j
>>>>
>>>> The actual input is:
>>>> x = (EM_COST_MARGIN_SCALE/SCHED_CAPACITY_SCALE) * (util - util_est)
>>>>
>>>> Since EM_COST_MARGIN_SCALE == SCHED_CAPACITY_SCALE == 1024, this factor of 1
>>>> is not directly reflected in the code but is important for units
>>>> consistency.
>>>
>>> But completely irrelevant for the actual math and conceptual
>>> understanding.
>>
>>   > how that that difference makes sense as input to
>> I was unsure if you referred to the units being inconsistent or the
>> actual way of computing values being strange, so I provided some
>> justification for both.
>>
>>> Just because computers suck at real numbers, and floats
>>> are expensive, doesn't mean we have to burden ourselves with fixed point
>>> when writing equations.
>>>
>>> Also, as a physicist I'm prone to normalizing everything to 1, because
>>> that's lazy.
>>>
>>>>> I suppose that limits the additional OPP to twice the previously
>>>>> selected cost / efficiency (see the confusion from that other email).
>>>>> But given that efficency drops (or costs rise) for higher OPPs that
>>>>> still doesn't really make sense..
>>>
>>>> Yes, this current limit to +100% freq boosting is somehow arbitrary and
>>>> could probably benefit from being tunable in some way (Kconfig option
>>>> maybe). When (margin > 0), we end up selecting an OPP that has a higher cost
>>>> than the one strictly required, which is expected. The goal is to speed
>>>> things up at the expense of more power consumed to achieve the same work,
>>>> hence at a lower efficiency (== higher cost).
>>>
>>> No, no Kconfig knobs.
>>>
>>>> That's the main reason why this boosting apply a margin on the cost of the
>>>> selected OPP rather than just inflating the util. This allows controlling
>>>> directly how much more power (battery life) we are going to spend to achieve
>>>> some work that we know could be achieved with less power.
>>>
>>> But you're not; the margin is relative to the OPP, it is not absolute.
>>
>> Considering a CPU with 1024 max capacity (since we are not talking about
>> migrations here, we can ignore CPU invariance):
>>
>> work = normalized number of iterations of a given busy loop
>> # Thanks to freq invariance
>> work = util (between 0 and 1)
>> util = f/f_max
>>
>> # f(work) is the min freq that is admissible for "work", which we will
>> # abbreviate as "f"
>> f(work) = work * f_max
>>
>> # from struct em_cap_state doc in energy_model.h
>> cost(f) = power(f) * f_max / f
>> cost(f) = power(f) / util
>> cost(f) = power(f) / work
>> power(f) = cost(f) * work
>>
>> boosted_cost(f) = cost(f) + x
> 
> In em_pd_get_higher_freq, the boost is a % of cost(f)  so it should be
> boosted_cost(f)=cost(f)1+ cost(f)*x

Good point, this means that we need to change "x" in these equations:
x = cost(f) * margin

Which leads to:
lost_battery_percent(work) =
(100 * T / cost(f_max) / total_battery_energy) * cost'(work) * margin * work

lost_battery_percent(work) is still proportional to something that can
easily be traced and averaged (cost'(work,t) * margin(work,t)). At the
end of the day, since the impact depends on whether the workload will
make the condition to trigger, tracing is necessary to see how it
performs.

Other than that, I agree that the thing becomes simpler if
em_pd_get_higher_freq() takes an absolute margin (as a per-1024 of max
cost) rather than something proportional to cost(f). I'll make the
change for v4.

>> boosted_power(f) = boosted_cost(f) * work
>> boosted_power(f) = (cost(f) + x) * work
>>
>> # Let's normalize cost() so we can forget about f and deal only with work.
>> cost'(work) = cost(f)/cost(f_max)
>> x' = x/cost(f_max)
>> boosted_power'(work) = (cost'(work) + x') * work
>> boosted_power'(work) = cost'(work) * work + x' * work
>> boosted_power'(work) = power'(work) + x' * work
>> boosted_power'(work) = power'(work) + A(work)
>>
>> # Over a duration T, spend an extra B unit of energy
>> B(work) = A(work) * T
>> lost_battery_percent(work) = 100 * B(work)/total_battery_energy
>> lost_battery_percent(work) = 100 * T * x' * work /total_battery_energy
>> lost_battery_percent(work) =
>>    (100 * T / cost(f_max) / total_battery_energy) * x * work
>>
>> This means that the effect of boosting on battery life is proportional
>> to "x" unless I made a mistake somewhere.
>>
>>>
>>> Or rather, the only actual limit is in relation to the max OPP. So you
>>> have very little actual control over how much more energy you're
>>> spending.
>>>
>>>>> So while I agree that 2) is a reasonable signal to work from, everything
>>>>> that comes after is still much confusing me.
>>>
>>>> "When applying these boosting rules on the runqueue util signals ...":
>>>> Assuming the set of enqueued tasks stays the same between 2 observations
>>>> from schedutil, if we see the rq util_avg increase above its
>>>> util_est.enqueued, that means that at least one task had its util_avg go
>>>> above util_est.enqueued. We might miss some boosting opportunities if some
>>>> (util - util_est) compensates:
>>>> TASK_1(util - util_est) = - TASK_2(util - util_est)
>>>> but working on the aggregated value is much easier in schedutil, to avoid
>>>> crawling the list of entities.
>>>
>>> That still does not explain why 'util - util_est', when >0, makes for a
>>> sensible input into an OPP relative function > I agree that 'util - util_est', when >0, indicates utilization is
>>> increasing (for the aperiodic blah blah blah). But after that I'm still
>>> confused.
>>
>> For the same reason PELT makes a sensible input for OPP selection.
>> Currently, OPP selection is based on max(util_avg, util_est.enqueued)
>> (from cpu_util_cfs in sched.h), so as soon as we have
>> (util - util_est > 0), the OPP will be selected according to util_avg.
>> In a way, using util_avg there is already some kind of boosting.
>>
>> Since the boosting is essentially (util - constant), it grows the same
>> way as util. If we think of (util - util_est) as being some estimation
>> of how wrong we were in the estimation of the task "true" utilization of
>> the CPU, then it makes sense to feed that to the boost. The wronger we
>> were, the more we want to boost, because the more time passes, the more
>> the scheduler realizes it actually does not know what the task needs. In
>> doubt, provide a higher freq than usual until we get to know this task
>> better. When that happens (at the next period), boosting is disabled and
>> we revert to the usual behavior (aka margin=0).
>>
>> Hope we are converging to some wording that makes sense.

Douglas RAILLARD Oct. 18, 2019, 5:24 p.m. UTC | #18

On 10/18/19 8:59 AM, Peter Zijlstra wrote:
> On Fri, Oct 18, 2019 at 09:44:44AM +0200, Dietmar Eggemann wrote:
>> On 17/10/2019 16:11, Peter Zijlstra wrote:
>>> On Thu, Oct 17, 2019 at 12:11:16PM +0100, Quentin Perret wrote:
>>
>> [...]
>>
>>> It only boosts when 'rq->cfs.avg.util' increases while
>>> 'rq->cfs.avg.util_est.enqueued' remains unchanged (and util > util_est
>>> obv).
>>>
>>> This condition can be true for select_task_rq_fair(), because that is
>>> ran before we do enqueue_task_fair() (for obvious raisins).
>>>
>>>>> I'm still thinking about the exact means you're using to raise C; that
>>>>> is, the 'util - util_est' as cost_margin. It hurts my brain still.
>>>>
>>>> +1 ...
>>>
>>> cost_i = capacity_i / power_i ; for the i-th OPP
>>
>> I get confused by this definition. efficiency=capacity/power but the
>> cs->cost value used in em_pd_get_higher_freq() is defined as
>>
>> cs_cost = cs->power * cpu_max_freq / cs->freq [energy_model.h]
> 
> Well, chalk that up to confusion inspired by the Changelog of patch 1.

I've updated the commit message to include that ordering OPPs by
increasing efficiency=capa/power on one CPU leads to the same ordering
as ordering by decreasing cost=power*f_max/f.

efficiency=(cpu_max_capa/1024) * (f/f_max) / power
efficiency=(cpu_max_capa/1024) / cost


> Let me redo it with that formula then.
>

[RFC,v3,0/6] sched/cpufreq: Make schedutil energy aware

Message

Comments