diff mbox

[RFC,v3,03/10] PM: Introduce an Energy Model management framework

Message ID b47cb1f1-6ca1-7674-903f-724be43a57a0@arm.com (mailing list archive)
State Not Applicable, archived
Headers show

Commit Message

Dietmar Eggemann June 6, 2018, 1:12 p.m. UTC
On 05/21/2018 04:24 PM, Quentin Perret wrote:
> Several subsystems in the kernel (scheduler and/or thermal at the time
> of writing) can benefit from knowing about the energy consumed by CPUs.
> Yet, this information can come from different sources (DT or firmware for
> example), in different formats, hence making it hard to exploit without
> a standard API.
> 
> This patch attempts to solve this issue by introducing a centralized
> Energy Model (EM) framework which can be used to interface the data
> providers with the client subsystems. This framework standardizes the
> API to expose power costs, and to access them from multiple locations.
> 
> The current design assumes that all CPUs in a frequency domain share the
> same micro-architecture. As such, the EM data is structured in a
> per-frequency-domain fashion. Drivers aware of frequency domains
> (typically, but not limited to, CPUFreq drivers) are expected to register
> data in the EM framework using the em_register_freq_domain() API. To do
> so, the drivers must provide a callback function that will be called by
> the EM framework to populate the tables. As of today, only the active
> power of the CPUs is considered. For each frequency domain, the EM
> includes a list of <frequency, power, capacity> tuples for the capacity
> states of the domain alongside a cpumask covering the involved CPUs.
> 
> The EM framework also provides an API to re-scale the capacity values
> of the model asynchronously, after it has been created. This is required
> for architectures where the capacity scale factor of CPUs can change at
> run-time. This is the case for Arm/Arm64 for example where the
> arch_topology driver recomputes the capacity scale factors of the CPUs
> after the maximum frequency of all CPUs has been discovered. Although
> complex, the process of creating and re-scaling the EM has to be kept in
> two separate steps to fulfill the needs of the different users. The thermal
> subsystem doesn't use the capacity values and shouldn't have dependencies
> on subsystems providing them. On the other hand, the task scheduler needs
> the capacity values, and it will benefit from seeing them up-to-date when
> applicable.
> 
> Because of this need for asynchronous update, the capacity state table
> of each frequency domain is protected by RCU, hence guaranteeing a safe
> modification of the table and a fast access to readers in latency-sensitive
> code paths.
> 
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> Signed-off-by: Quentin Perret <quentin.perret@arm.com>

[...]

> +static void fd_update_cs_table(struct em_cs_table *cs_table, int cpu)
> +{
> +	unsigned long cmax = arch_scale_cpu_capacity(NULL, cpu);
> +	int max_cap_state = cs_table->nr_cap_states - 1;
> +	unsigned long fmax = cs_table->state[max_cap_state].frequency;
> +	int i;
> +
> +	for (i = 0; i < cs_table->nr_cap_states; i++)
> +		cs_table->state[i].capacity = cmax *
> +					cs_table->state[i].frequency / fmax;
> +}

This has issues on a 32bit system. cs_table->state[i].capacity (unsigned
long) overflows with the frequency values stored in Hz.

Maybe something like this to cure it:

the Energy Model in the system. Shouldn't the units of frequency and power
not standardized, maybe Mhz and mW?
The task scheduler doesn't care since it is only interested in power diffs
but other user might do.

[...]

Comments

Quentin Perret June 6, 2018, 2:37 p.m. UTC | #1
Hi Dietmar,

On Wednesday 06 Jun 2018 at 15:12:15 (+0200), Dietmar Eggemann wrote:
> > +static void fd_update_cs_table(struct em_cs_table *cs_table, int cpu)
> > +{
> > +	unsigned long cmax = arch_scale_cpu_capacity(NULL, cpu);
> > +	int max_cap_state = cs_table->nr_cap_states - 1;
> > +	unsigned long fmax = cs_table->state[max_cap_state].frequency;
> > +	int i;
> > +
> > +	for (i = 0; i < cs_table->nr_cap_states; i++)
> > +		cs_table->state[i].capacity = cmax *
> > +					cs_table->state[i].frequency / fmax;
> > +}
> 
> This has issues on a 32bit system. cs_table->state[i].capacity (unsigned
> long) overflows with the frequency values stored in Hz.

Ah, thank you very much for pointing this out ! I haven't tried on a 32bit
machine yet, my bad. I'll fix that for v4.

> 
> Maybe something like this to cure it:
> 
> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
> index 6ad53f1cf7e6..c13b3eb8bf35 100644
> --- a/kernel/power/energy_model.c
> +++ b/kernel/power/energy_model.c
> @@ -144,9 +144,11 @@ static void fd_update_cs_table(struct em_cs_table *cs_table, int cpu)
>  	unsigned long fmax = cs_table->state[max_cap_state].frequency;
>  	int i;
>  
> -	for (i = 0; i < cs_table->nr_cap_states; i++)
> -		cs_table->state[i].capacity = cmax *
> -					cs_table->state[i].frequency / fmax;
> +	for (i = 0; i < cs_table->nr_cap_states; i++) {
> +		u64 val = (u64)cmax * cs_table->state[i].frequency;
> +		do_div(val, fmax);
> +		cs_table->state[i].capacity = (unsigned long)val;
> +	}
>  }

Hmmm yes, that should work.

> 
> This brings me to another question. Let's say there are multiple users of
> the Energy Model in the system. Shouldn't the units of frequency and power
> not standardized, maybe Mhz and mW?
> The task scheduler doesn't care since it is only interested in power diffs
> but other user might do.

So the good thing about specifying units is that we can probably assume
ranges on the values. If the power is in mW, assuming that we're talking
about a single CPU, it'll probably fit in 16 bits. 65W/core should be
a reasonable upper-bound ?
But there are also vendors who might not be happy with disclosing absolute
values ... These are sometimes considered sensitive and only relative
numbers are discussed publicly. Now, you can also argue that we already
have units specified in IPA for ex, and that it doesn't really matter if
a driver "lies" about the real value, as long as the ratios are correct.
And I guess that anyone can do measurement on the hardware and get those
values anyway. So specifying a unit (mW) for the power is probably a
good idea.

For the frequency, I would rather keep it consistent with what CPUFreq
manages. But that should be specified somewhere, I agree.

What do you think ?

Thanks !
Quentin
Juri Lelli June 6, 2018, 3:20 p.m. UTC | #2
On 06/06/18 15:37, Quentin Perret wrote:
> Hi Dietmar,
> 
> On Wednesday 06 Jun 2018 at 15:12:15 (+0200), Dietmar Eggemann wrote:
> > > +static void fd_update_cs_table(struct em_cs_table *cs_table, int cpu)
> > > +{
> > > +	unsigned long cmax = arch_scale_cpu_capacity(NULL, cpu);
> > > +	int max_cap_state = cs_table->nr_cap_states - 1;
> > > +	unsigned long fmax = cs_table->state[max_cap_state].frequency;
> > > +	int i;
> > > +
> > > +	for (i = 0; i < cs_table->nr_cap_states; i++)
> > > +		cs_table->state[i].capacity = cmax *
> > > +					cs_table->state[i].frequency / fmax;
> > > +}
> > 
> > This has issues on a 32bit system. cs_table->state[i].capacity (unsigned
> > long) overflows with the frequency values stored in Hz.
> 
> Ah, thank you very much for pointing this out ! I haven't tried on a 32bit
> machine yet, my bad. I'll fix that for v4.
> 
> > 
> > Maybe something like this to cure it:
> > 
> > diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
> > index 6ad53f1cf7e6..c13b3eb8bf35 100644
> > --- a/kernel/power/energy_model.c
> > +++ b/kernel/power/energy_model.c
> > @@ -144,9 +144,11 @@ static void fd_update_cs_table(struct em_cs_table *cs_table, int cpu)
> >  	unsigned long fmax = cs_table->state[max_cap_state].frequency;
> >  	int i;
> >  
> > -	for (i = 0; i < cs_table->nr_cap_states; i++)
> > -		cs_table->state[i].capacity = cmax *
> > -					cs_table->state[i].frequency / fmax;
> > +	for (i = 0; i < cs_table->nr_cap_states; i++) {
> > +		u64 val = (u64)cmax * cs_table->state[i].frequency;
> > +		do_div(val, fmax);
> > +		cs_table->state[i].capacity = (unsigned long)val;
> > +	}
> >  }
> 
> Hmmm yes, that should work.
> 
> > 
> > This brings me to another question. Let's say there are multiple users of
> > the Energy Model in the system. Shouldn't the units of frequency and power
> > not standardized, maybe Mhz and mW?
> > The task scheduler doesn't care since it is only interested in power diffs
> > but other user might do.
> 
> So the good thing about specifying units is that we can probably assume
> ranges on the values. If the power is in mW, assuming that we're talking
> about a single CPU, it'll probably fit in 16 bits. 65W/core should be
> a reasonable upper-bound ?
> But there are also vendors who might not be happy with disclosing absolute
> values ... These are sometimes considered sensitive and only relative
> numbers are discussed publicly. Now, you can also argue that we already
> have units specified in IPA for ex, and that it doesn't really matter if
> a driver "lies" about the real value, as long as the ratios are correct.
> And I guess that anyone can do measurement on the hardware and get those
> values anyway. So specifying a unit (mW) for the power is probably a
> good idea.

Mmm, I remember we fought quite a bit while getting capacity-dmpis-mhz
binding accepted, and one of the musts was that the values were going to
be normalized. So, normalized power values again maybe?

Best,

- Juri
Quentin Perret June 6, 2018, 3:29 p.m. UTC | #3
On Wednesday 06 Jun 2018 at 17:20:00 (+0200), Juri Lelli wrote:
> > > This brings me to another question. Let's say there are multiple users of
> > > the Energy Model in the system. Shouldn't the units of frequency and power
> > > not standardized, maybe Mhz and mW?
> > > The task scheduler doesn't care since it is only interested in power diffs
> > > but other user might do.
> > 
> > So the good thing about specifying units is that we can probably assume
> > ranges on the values. If the power is in mW, assuming that we're talking
> > about a single CPU, it'll probably fit in 16 bits. 65W/core should be
> > a reasonable upper-bound ?
> > But there are also vendors who might not be happy with disclosing absolute
> > values ... These are sometimes considered sensitive and only relative
> > numbers are discussed publicly. Now, you can also argue that we already
> > have units specified in IPA for ex, and that it doesn't really matter if
> > a driver "lies" about the real value, as long as the ratios are correct.
> > And I guess that anyone can do measurement on the hardware and get those
> > values anyway. So specifying a unit (mW) for the power is probably a
> > good idea.
> 
> Mmm, I remember we fought quite a bit while getting capacity-dmpis-mhz
> binding accepted, and one of the musts was that the values were going to
> be normalized. So, normalized power values again maybe?

Hmmm, that's a very good point ... There should be no problems on the
scheduler side -- we're only interested in correct ratios. But I'm not
sure on the thermal side ... I will double check that.

Javi, Viresh, Eduardo: any thoughts about this ?

Thanks !
Quentin
Quentin Perret June 6, 2018, 4:26 p.m. UTC | #4
On Wednesday 06 Jun 2018 at 16:29:50 (+0100), Quentin Perret wrote:
> On Wednesday 06 Jun 2018 at 17:20:00 (+0200), Juri Lelli wrote:
> > > > This brings me to another question. Let's say there are multiple users of
> > > > the Energy Model in the system. Shouldn't the units of frequency and power
> > > > not standardized, maybe Mhz and mW?
> > > > The task scheduler doesn't care since it is only interested in power diffs
> > > > but other user might do.
> > > 
> > > So the good thing about specifying units is that we can probably assume
> > > ranges on the values. If the power is in mW, assuming that we're talking
> > > about a single CPU, it'll probably fit in 16 bits. 65W/core should be
> > > a reasonable upper-bound ?
> > > But there are also vendors who might not be happy with disclosing absolute
> > > values ... These are sometimes considered sensitive and only relative
> > > numbers are discussed publicly. Now, you can also argue that we already
> > > have units specified in IPA for ex, and that it doesn't really matter if
> > > a driver "lies" about the real value, as long as the ratios are correct.
> > > And I guess that anyone can do measurement on the hardware and get those
> > > values anyway. So specifying a unit (mW) for the power is probably a
> > > good idea.
> > 
> > Mmm, I remember we fought quite a bit while getting capacity-dmpis-mhz
> > binding accepted, and one of the musts was that the values were going to
> > be normalized. So, normalized power values again maybe?
> 
> Hmmm, that's a very good point ... There should be no problems on the
> scheduler side -- we're only interested in correct ratios. But I'm not
> sure on the thermal side ... I will double check that.

So, IPA needs to compare the power of the CPUs with the power of other
things (e.g. GPUs). So we can't normalize the power of the CPUs without
normalizing in the same scale the power of the other devices. I see two
possibilities:

1) we don't normalize the CPU power values, we specify them in mW, and
   we document (and maybe throw a warning if we see an issue at runtime)
   the max range of values. The max expected power for a single core
   could be 65K for ex (16bits). And based on that we can verify
   overflow and precision issues in the algorithms, and we keep it easy
   to compare the CPU power numbers with other devices.

2) we normalize the power values, but that means that the EM framework
   has to manage not only CPUs, but also other types of devices, and
   normalized their power values as well. That's required to keep the
   scale consistent across all of them, and keep comparisons doable.
   But if we do this, we still have to keep a normalized and a "raw"
   version of the power for all devices. And the "raw" power must still
   be in the same unit across all devices, otherwise the re-scaling is
   broken. The main benefit of doing this is that the range of
   acceptable "raw" power values can be larger, probably 32bits, and
   that the precision of the normalized range is arbitrary.

I feel like 2) involves a lot of complexity, and not so many benefits,
so I'd be happy to go with 1). Unless I forgot something ?

Thanks,
Quentin
Dietmar Eggemann June 7, 2018, 3:58 p.m. UTC | #5
On 06/06/2018 06:26 PM, Quentin Perret wrote:
> On Wednesday 06 Jun 2018 at 16:29:50 (+0100), Quentin Perret wrote:
>> On Wednesday 06 Jun 2018 at 17:20:00 (+0200), Juri Lelli wrote:
>>>>> This brings me to another question. Let's say there are multiple users of
>>>>> the Energy Model in the system. Shouldn't the units of frequency and power
>>>>> not standardized, maybe Mhz and mW?
>>>>> The task scheduler doesn't care since it is only interested in power diffs
>>>>> but other user might do.
>>>>
>>>> So the good thing about specifying units is that we can probably assume
>>>> ranges on the values. If the power is in mW, assuming that we're talking
>>>> about a single CPU, it'll probably fit in 16 bits. 65W/core should be
>>>> a reasonable upper-bound ?
>>>> But there are also vendors who might not be happy with disclosing absolute
>>>> values ... These are sometimes considered sensitive and only relative
>>>> numbers are discussed publicly. Now, you can also argue that we already
>>>> have units specified in IPA for ex, and that it doesn't really matter if
>>>> a driver "lies" about the real value, as long as the ratios are correct.
>>>> And I guess that anyone can do measurement on the hardware and get those
>>>> values anyway. So specifying a unit (mW) for the power is probably a
>>>> good idea.
>>>
>>> Mmm, I remember we fought quite a bit while getting capacity-dmpis-mhz
>>> binding accepted, and one of the musts was that the values were going to
>>> be normalized. So, normalized power values again maybe?
>>
>> Hmmm, that's a very good point ... There should be no problems on the
>> scheduler side -- we're only interested in correct ratios. But I'm not
>> sure on the thermal side ... I will double check that.
> 
> So, IPA needs to compare the power of the CPUs with the power of other
> things (e.g. GPUs). So we can't normalize the power of the CPUs without
> normalizing in the same scale the power of the other devices. I see two
> possibilities:
> 
> 1) we don't normalize the CPU power values, we specify them in mW, and
>     we document (and maybe throw a warning if we see an issue at runtime)
>     the max range of values. The max expected power for a single core
>     could be 65K for ex (16bits). And based on that we can verify
>     overflow and precision issues in the algorithms, and we keep it easy
>     to compare the CPU power numbers with other devices.

I would say we need 1). 32bit values with units and proper documentation 
of the possible ranges.

[...]
Javi Merino June 8, 2018, 1:39 p.m. UTC | #6
On Wed, Jun 06, 2018 at 05:26:47PM +0100, Quentin Perret wrote:
> On Wednesday 06 Jun 2018 at 16:29:50 (+0100), Quentin Perret wrote:
> > On Wednesday 06 Jun 2018 at 17:20:00 (+0200), Juri Lelli wrote:
> > > > > This brings me to another question. Let's say there are multiple users of
> > > > > the Energy Model in the system. Shouldn't the units of frequency and power
> > > > > not standardized, maybe Mhz and mW?
> > > > > The task scheduler doesn't care since it is only interested in power diffs
> > > > > but other user might do.
> > > > 
> > > > So the good thing about specifying units is that we can probably assume
> > > > ranges on the values. If the power is in mW, assuming that we're talking
> > > > about a single CPU, it'll probably fit in 16 bits. 65W/core should be
> > > > a reasonable upper-bound ?
> > > > But there are also vendors who might not be happy with disclosing absolute
> > > > values ... These are sometimes considered sensitive and only relative
> > > > numbers are discussed publicly. Now, you can also argue that we already
> > > > have units specified in IPA for ex, and that it doesn't really matter if
> > > > a driver "lies" about the real value, as long as the ratios are correct.
> > > > And I guess that anyone can do measurement on the hardware and get those
> > > > values anyway. So specifying a unit (mW) for the power is probably a
> > > > good idea.
> > > 
> > > Mmm, I remember we fought quite a bit while getting capacity-dmpis-mhz
> > > binding accepted, and one of the musts was that the values were going to
> > > be normalized. So, normalized power values again maybe?
> > 
> > Hmmm, that's a very good point ... There should be no problems on the
> > scheduler side -- we're only interested in correct ratios. But I'm not
> > sure on the thermal side ... I will double check that.
> 
> So, IPA needs to compare the power of the CPUs with the power of other
> things (e.g. GPUs). So we can't normalize the power of the CPUs without
> normalizing in the same scale the power of the other devices. I see two
> possibilities:
> 
> 1) we don't normalize the CPU power values, we specify them in mW, and
>    we document (and maybe throw a warning if we see an issue at runtime)
>    the max range of values. The max expected power for a single core
>    could be 65K for ex (16bits). And based on that we can verify
>    overflow and precision issues in the algorithms, and we keep it easy
>    to compare the CPU power numbers with other devices.
> 
> 2) we normalize the power values, but that means that the EM framework
>    has to manage not only CPUs, but also other types of devices, and
>    normalized their power values as well. That's required to keep the
>    scale consistent across all of them, and keep comparisons doable.
>    But if we do this, we still have to keep a normalized and a "raw"
>    version of the power for all devices. And the "raw" power must still
>    be in the same unit across all devices, otherwise the re-scaling is
>    broken. The main benefit of doing this is that the range of
>    acceptable "raw" power values can be larger, probably 32bits, and
>    that the precision of the normalized range is arbitrary.
> 
> I feel like 2) involves a lot of complexity, and not so many benefits,
> so I'd be happy to go with 1). Unless I forgot something ?

From the thermal point of view, the power values don't need to have
any given unit, as long as the values are comparable to each other.
Do we need to normalize anything in the kernel though?  Can't we just
assume that whatever the platform is telling us is correct?  Quentin
mentioned it earlier: sometimes absolute values are considered
sensitive and we only get ones that are correct relative to the rest
of the system.

(This reminds me that the units in
Documentation/thermal/cpu-cooling-api.txt are wrong and I need to fix
it :-X )

Cheers,
Javi
Quentin Perret June 8, 2018, 3:47 p.m. UTC | #7
Hi Javi,

On Friday 08 Jun 2018 at 14:39:42 (+0100), Javi Merino wrote:
> On Wed, Jun 06, 2018 at 05:26:47PM +0100, Quentin Perret wrote:
> > On Wednesday 06 Jun 2018 at 16:29:50 (+0100), Quentin Perret wrote:
> > > On Wednesday 06 Jun 2018 at 17:20:00 (+0200), Juri Lelli wrote:
> > > > > > This brings me to another question. Let's say there are multiple users of
> > > > > > the Energy Model in the system. Shouldn't the units of frequency and power
> > > > > > not standardized, maybe Mhz and mW?
> > > > > > The task scheduler doesn't care since it is only interested in power diffs
> > > > > > but other user might do.
> > > > > 
> > > > > So the good thing about specifying units is that we can probably assume
> > > > > ranges on the values. If the power is in mW, assuming that we're talking
> > > > > about a single CPU, it'll probably fit in 16 bits. 65W/core should be
> > > > > a reasonable upper-bound ?
> > > > > But there are also vendors who might not be happy with disclosing absolute
> > > > > values ... These are sometimes considered sensitive and only relative
> > > > > numbers are discussed publicly. Now, you can also argue that we already
> > > > > have units specified in IPA for ex, and that it doesn't really matter if
> > > > > a driver "lies" about the real value, as long as the ratios are correct.
> > > > > And I guess that anyone can do measurement on the hardware and get those
> > > > > values anyway. So specifying a unit (mW) for the power is probably a
> > > > > good idea.
> > > > 
> > > > Mmm, I remember we fought quite a bit while getting capacity-dmpis-mhz
> > > > binding accepted, and one of the musts was that the values were going to
> > > > be normalized. So, normalized power values again maybe?
> > > 
> > > Hmmm, that's a very good point ... There should be no problems on the
> > > scheduler side -- we're only interested in correct ratios. But I'm not
> > > sure on the thermal side ... I will double check that.
> > 
> > So, IPA needs to compare the power of the CPUs with the power of other
> > things (e.g. GPUs). So we can't normalize the power of the CPUs without
> > normalizing in the same scale the power of the other devices. I see two
> > possibilities:
> > 
> > 1) we don't normalize the CPU power values, we specify them in mW, and
> >    we document (and maybe throw a warning if we see an issue at runtime)
> >    the max range of values. The max expected power for a single core
> >    could be 65K for ex (16bits). And based on that we can verify
> >    overflow and precision issues in the algorithms, and we keep it easy
> >    to compare the CPU power numbers with other devices.
> > 
> > 2) we normalize the power values, but that means that the EM framework
> >    has to manage not only CPUs, but also other types of devices, and
> >    normalized their power values as well. That's required to keep the
> >    scale consistent across all of them, and keep comparisons doable.
> >    But if we do this, we still have to keep a normalized and a "raw"
> >    version of the power for all devices. And the "raw" power must still
> >    be in the same unit across all devices, otherwise the re-scaling is
> >    broken. The main benefit of doing this is that the range of
> >    acceptable "raw" power values can be larger, probably 32bits, and
> >    that the precision of the normalized range is arbitrary.
> > 
> > I feel like 2) involves a lot of complexity, and not so many benefits,
> > so I'd be happy to go with 1). Unless I forgot something ?
> 
> From the thermal point of view, the power values don't need to have
> any given unit, as long as the values are comparable to each other.

OK, thanks for confirming that :-)

> Do we need to normalize anything in the kernel though?  Can't we just
> assume that whatever the platform is telling us is correct?  Quentin
> mentioned it earlier: sometimes absolute values are considered
> sensitive and we only get ones that are correct relative to the rest
> of the system.

I'm happy to specify the units as mW and let the drivers lie about the
true values. At least that helps them lie coherently if another
subsystem requires power in uW for example.

> 
> (This reminds me that the units in
> Documentation/thermal/cpu-cooling-api.txt are wrong and I need to fix
> it :-X )

;-)

Thanks,
Quentin
Javi Merino June 9, 2018, 8:24 a.m. UTC | #8
On Fri, Jun 08, 2018 at 04:47:39PM +0100, Quentin Perret wrote:
> Hi Javi,
> 
> On Friday 08 Jun 2018 at 14:39:42 (+0100), Javi Merino wrote:
> > On Wed, Jun 06, 2018 at 05:26:47PM +0100, Quentin Perret wrote:
> > > On Wednesday 06 Jun 2018 at 16:29:50 (+0100), Quentin Perret wrote:
> > > > On Wednesday 06 Jun 2018 at 17:20:00 (+0200), Juri Lelli wrote:
> > > > > > > This brings me to another question. Let's say there are multiple users of
> > > > > > > the Energy Model in the system. Shouldn't the units of frequency and power
> > > > > > > not standardized, maybe Mhz and mW?
> > > > > > > The task scheduler doesn't care since it is only interested in power diffs
> > > > > > > but other user might do.
> > > > > > 
> > > > > > So the good thing about specifying units is that we can probably assume
> > > > > > ranges on the values. If the power is in mW, assuming that we're talking
> > > > > > about a single CPU, it'll probably fit in 16 bits. 65W/core should be
> > > > > > a reasonable upper-bound ?
> > > > > > But there are also vendors who might not be happy with disclosing absolute
> > > > > > values ... These are sometimes considered sensitive and only relative
> > > > > > numbers are discussed publicly. Now, you can also argue that we already
> > > > > > have units specified in IPA for ex, and that it doesn't really matter if
> > > > > > a driver "lies" about the real value, as long as the ratios are correct.
> > > > > > And I guess that anyone can do measurement on the hardware and get those
> > > > > > values anyway. So specifying a unit (mW) for the power is probably a
> > > > > > good idea.
> > > > > 
> > > > > Mmm, I remember we fought quite a bit while getting capacity-dmpis-mhz
> > > > > binding accepted, and one of the musts was that the values were going to
> > > > > be normalized. So, normalized power values again maybe?
> > > > 
> > > > Hmmm, that's a very good point ... There should be no problems on the
> > > > scheduler side -- we're only interested in correct ratios. But I'm not
> > > > sure on the thermal side ... I will double check that.
> > > 
> > > So, IPA needs to compare the power of the CPUs with the power of other
> > > things (e.g. GPUs). So we can't normalize the power of the CPUs without
> > > normalizing in the same scale the power of the other devices. I see two
> > > possibilities:
> > > 
> > > 1) we don't normalize the CPU power values, we specify them in mW, and
> > >    we document (and maybe throw a warning if we see an issue at runtime)
> > >    the max range of values. The max expected power for a single core
> > >    could be 65K for ex (16bits). And based on that we can verify
> > >    overflow and precision issues in the algorithms, and we keep it easy
> > >    to compare the CPU power numbers with other devices.
> > > 
> > > 2) we normalize the power values, but that means that the EM framework
> > >    has to manage not only CPUs, but also other types of devices, and
> > >    normalized their power values as well. That's required to keep the
> > >    scale consistent across all of them, and keep comparisons doable.
> > >    But if we do this, we still have to keep a normalized and a "raw"
> > >    version of the power for all devices. And the "raw" power must still
> > >    be in the same unit across all devices, otherwise the re-scaling is
> > >    broken. The main benefit of doing this is that the range of
> > >    acceptable "raw" power values can be larger, probably 32bits, and
> > >    that the precision of the normalized range is arbitrary.
> > > 
> > > I feel like 2) involves a lot of complexity, and not so many benefits,
> > > so I'd be happy to go with 1). Unless I forgot something ?
> > 
> > From the thermal point of view, the power values don't need to have
> > any given unit, as long as the values are comparable to each other.
> 
> OK, thanks for confirming that :-)
> 
> > Do we need to normalize anything in the kernel though?  Can't we just
> > assume that whatever the platform is telling us is correct?  Quentin
> > mentioned it earlier: sometimes absolute values are considered
> > sensitive and we only get ones that are correct relative to the rest
> > of the system.
> 
> I'm happy to specify the units as mW and let the drivers lie about the
> true values. At least that helps them lie coherently if another
> subsystem requires power in uW for example.

I think this is a good option.

Cheers,
Javi
diff mbox

Patch

diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 6ad53f1cf7e6..c13b3eb8bf35 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -144,9 +144,11 @@  static void fd_update_cs_table(struct em_cs_table *cs_table, int cpu)
 	unsigned long fmax = cs_table->state[max_cap_state].frequency;
 	int i;
 
-	for (i = 0; i < cs_table->nr_cap_states; i++)
-		cs_table->state[i].capacity = cmax *
-					cs_table->state[i].frequency / fmax;
+	for (i = 0; i < cs_table->nr_cap_states; i++) {
+		u64 val = (u64)cmax * cs_table->state[i].frequency;
+		do_div(val, fmax);
+		cs_table->state[i].capacity = (unsigned long)val;
+	}
 }

This brings me to another question. Let's say there are multiple users of