diff mbox series

base: arch_topology: Use policy->max to calculate freq_factor

Message ID 20211115201010.68567-1-thara.gopinath@linaro.org (mailing list archive)
State Not Applicable
Headers show
Series base: arch_topology: Use policy->max to calculate freq_factor | expand

Commit Message

Thara Gopinath Nov. 15, 2021, 8:10 p.m. UTC
cpuinfo.max_freq can reflect boost frequency if enabled during boot.  Since
we don't consider boost frequencies while calculating cpu capacities, use
policy->max to populate the freq_factor during boot up.

Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
---
 drivers/base/arch_topology.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Rafael J. Wysocki Nov. 16, 2021, 7:05 p.m. UTC | #1
On Mon, Nov 15, 2021 at 9:10 PM Thara Gopinath
<thara.gopinath@linaro.org> wrote:
>
> cpuinfo.max_freq can reflect boost frequency if enabled during boot.  Since
> we don't consider boost frequencies while calculating cpu capacities, use
> policy->max to populate the freq_factor during boot up.

I'm not sure about this.  schedutil uses cpuinfo.max_freq as the max frequency.

> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> ---
>  drivers/base/arch_topology.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> index 43407665918f..df818b439bc3 100644
> --- a/drivers/base/arch_topology.c
> +++ b/drivers/base/arch_topology.c
> @@ -334,7 +334,7 @@ init_cpu_capacity_callback(struct notifier_block *nb,
>         cpumask_andnot(cpus_to_visit, cpus_to_visit, policy->related_cpus);
>
>         for_each_cpu(cpu, policy->related_cpus)
> -               per_cpu(freq_factor, cpu) = policy->cpuinfo.max_freq / 1000;
> +               per_cpu(freq_factor, cpu) = policy->max / 1000;
>
>         if (cpumask_empty(cpus_to_visit)) {
>                 topology_normalize_cpu_scale();
> --
> 2.25.1
>
Lukasz Luba Nov. 17, 2021, 10:12 a.m. UTC | #2
On 11/15/21 8:10 PM, Thara Gopinath wrote:
> cpuinfo.max_freq can reflect boost frequency if enabled during boot.  Since
> we don't consider boost frequencies while calculating cpu capacities, use
> policy->max to populate the freq_factor during boot up.
> 
> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> ---
>   drivers/base/arch_topology.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> index 43407665918f..df818b439bc3 100644
> --- a/drivers/base/arch_topology.c
> +++ b/drivers/base/arch_topology.c
> @@ -334,7 +334,7 @@ init_cpu_capacity_callback(struct notifier_block *nb,
>   	cpumask_andnot(cpus_to_visit, cpus_to_visit, policy->related_cpus);
>   
>   	for_each_cpu(cpu, policy->related_cpus)
> -		per_cpu(freq_factor, cpu) = policy->cpuinfo.max_freq / 1000;
> +		per_cpu(freq_factor, cpu) = policy->max / 1000;
>   
>   	if (cpumask_empty(cpus_to_visit)) {
>   		topology_normalize_cpu_scale();
> 

LGTM

Reviewed-by: Lukasz Luba <lukasz.luba@armc.com>
Lukasz Luba Nov. 17, 2021, 10:46 a.m. UTC | #3
Hi Rafael,

On 11/16/21 7:05 PM, Rafael J. Wysocki wrote:
> On Mon, Nov 15, 2021 at 9:10 PM Thara Gopinath
> <thara.gopinath@linaro.org> wrote:
>>
>> cpuinfo.max_freq can reflect boost frequency if enabled during boot.  Since
>> we don't consider boost frequencies while calculating cpu capacities, use
>> policy->max to populate the freq_factor during boot up.
> 
> I'm not sure about this.  schedutil uses cpuinfo.max_freq as the max frequency.

Agree it's tricky how we treat the boost frequencies and also combine
them with thermal pressure.
We probably would have consider these design bits:
1. Should thermal pressure include boost frequency?
2. Should max capacity 1024 be a boost frequency so scheduler
    would see it explicitly?
- if no, then schedutil could still request boost freq thanks to
   map_util_perf() where we add 25% to the util and then
   map_util_freq() would return a boost freq when util was > 1024


I can see in schedutil only one place when cpuinfo.max_freq is used:
get_next_freq(). If the value stored in there is a boost,
then don't we get a higher freq value for the same util?
Dietmar Eggemann Nov. 17, 2021, 11:21 a.m. UTC | #4
On 16.11.21 20:05, Rafael J. Wysocki wrote:
> On Mon, Nov 15, 2021 at 9:10 PM Thara Gopinath
> <thara.gopinath@linaro.org> wrote:
>>
>> cpuinfo.max_freq can reflect boost frequency if enabled during boot.  Since
>> we don't consider boost frequencies while calculating cpu capacities, use
>> policy->max to populate the freq_factor during boot up.
> 
> I'm not sure about this.  schedutil uses cpuinfo.max_freq as the max frequency.

Same question here. There is this:

  capacity = (freq / freq_max) * arch_scale_cpu_capacity()

in cpufreq_set_cur_state() [drivers/thermal/cpufreq_cooling.c]

where freq_max is `cpufreq_cdev->policy->cpuinfo.max_freq`

And this is then used to calc the PELT thermal pressure.

> 
>> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
>> ---
>>  drivers/base/arch_topology.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
>> index 43407665918f..df818b439bc3 100644
>> --- a/drivers/base/arch_topology.c
>> +++ b/drivers/base/arch_topology.c
>> @@ -334,7 +334,7 @@ init_cpu_capacity_callback(struct notifier_block *nb,
>>         cpumask_andnot(cpus_to_visit, cpus_to_visit, policy->related_cpus);
>>
>>         for_each_cpu(cpu, policy->related_cpus)
>> -               per_cpu(freq_factor, cpu) = policy->cpuinfo.max_freq / 1000;
>> +               per_cpu(freq_factor, cpu) = policy->max / 1000;
>>
>>         if (cpumask_empty(cpus_to_visit)) {
>>                 topology_normalize_cpu_scale();
>> --
>> 2.25.1
>>
Rafael J. Wysocki Nov. 17, 2021, 12:49 p.m. UTC | #5
On Wed, Nov 17, 2021 at 11:46 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>
> Hi Rafael,
>
> On 11/16/21 7:05 PM, Rafael J. Wysocki wrote:
> > On Mon, Nov 15, 2021 at 9:10 PM Thara Gopinath
> > <thara.gopinath@linaro.org> wrote:
> >>
> >> cpuinfo.max_freq can reflect boost frequency if enabled during boot.  Since
> >> we don't consider boost frequencies while calculating cpu capacities, use
> >> policy->max to populate the freq_factor during boot up.
> >
> > I'm not sure about this.  schedutil uses cpuinfo.max_freq as the max frequency.
>
> Agree it's tricky how we treat the boost frequencies and also combine
> them with thermal pressure.
> We probably would have consider these design bits:
> 1. Should thermal pressure include boost frequency?

Well, I guess so.

Running at a boost frequency certainly increases thermal pressure.

> 2. Should max capacity 1024 be a boost frequency so scheduler
>     would see it explicitly?

That's what it is now if cpuinfo.max_freq is a boost frequency.

> - if no, then schedutil could still request boost freq thanks to
>    map_util_perf() where we add 25% to the util and then
>    map_util_freq() would return a boost freq when util was > 1024
>
>
> I can see in schedutil only one place when cpuinfo.max_freq is used:
> get_next_freq(). If the value stored in there is a boost,
> then don't we get a higher freq value for the same util?

Yes. we do, which basically is my point.

The schedutil's response is proportional to cpuinfo.max_freq and that
needs to be taken into account for the results to be consistent.
Lukasz Luba Nov. 17, 2021, 3:08 p.m. UTC | #6
On 11/17/21 12:49 PM, Rafael J. Wysocki wrote:
> On Wed, Nov 17, 2021 at 11:46 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>
>> Hi Rafael,
>>
>> On 11/16/21 7:05 PM, Rafael J. Wysocki wrote:
>>> On Mon, Nov 15, 2021 at 9:10 PM Thara Gopinath
>>> <thara.gopinath@linaro.org> wrote:
>>>>
>>>> cpuinfo.max_freq can reflect boost frequency if enabled during boot.  Since
>>>> we don't consider boost frequencies while calculating cpu capacities, use
>>>> policy->max to populate the freq_factor during boot up.
>>>
>>> I'm not sure about this.  schedutil uses cpuinfo.max_freq as the max frequency.
>>
>> Agree it's tricky how we treat the boost frequencies and also combine
>> them with thermal pressure.
>> We probably would have consider these design bits:
>> 1. Should thermal pressure include boost frequency?
> 
> Well, I guess so.
> 
> Running at a boost frequency certainly increases thermal pressure.
> 
>> 2. Should max capacity 1024 be a boost frequency so scheduler
>>      would see it explicitly?
> 
> That's what it is now if cpuinfo.max_freq is a boost frequency.
> 
>> - if no, then schedutil could still request boost freq thanks to
>>     map_util_perf() where we add 25% to the util and then
>>     map_util_freq() would return a boost freq when util was > 1024
>>
>>
>> I can see in schedutil only one place when cpuinfo.max_freq is used:
>> get_next_freq(). If the value stored in there is a boost,
>> then don't we get a higher freq value for the same util?
> 
> Yes. we do, which basically is my point.
> 
> The schedutil's response is proportional to cpuinfo.max_freq and that
> needs to be taken into account for the results to be consistent.
> 

This boost thing wasn't an issue for us, because we didn't have
platforms which come with it (till recently). I've checked that you have
quite a few CPUs which support huge boost freq, e.g. 5GHz vs. 3.6GHz
nominal max freq [1]. Am I reading this correctly as kernel boost freq?
Do you represent this 5GHz as 1024 capacity?
 From this schedutil get_next_freq() I would guess yes.

I cannot find if you use thermal pressure, could you help me with this,
please?


[1] 
https://ark.intel.com/content/www/us/en/ark/products/186605/intel-core-i99900k-processor-16m-cache-up-to-5-00-ghz.html
Rafael J. Wysocki Nov. 17, 2021, 3:17 p.m. UTC | #7
On Wed, Nov 17, 2021 at 4:08 PM Lukasz Luba <lukasz.luba@arm.com> wrote:
>
>
>
> On 11/17/21 12:49 PM, Rafael J. Wysocki wrote:
> > On Wed, Nov 17, 2021 at 11:46 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
> >>
> >> Hi Rafael,
> >>
> >> On 11/16/21 7:05 PM, Rafael J. Wysocki wrote:
> >>> On Mon, Nov 15, 2021 at 9:10 PM Thara Gopinath
> >>> <thara.gopinath@linaro.org> wrote:
> >>>>
> >>>> cpuinfo.max_freq can reflect boost frequency if enabled during boot.  Since
> >>>> we don't consider boost frequencies while calculating cpu capacities, use
> >>>> policy->max to populate the freq_factor during boot up.
> >>>
> >>> I'm not sure about this.  schedutil uses cpuinfo.max_freq as the max frequency.
> >>
> >> Agree it's tricky how we treat the boost frequencies and also combine
> >> them with thermal pressure.
> >> We probably would have consider these design bits:
> >> 1. Should thermal pressure include boost frequency?
> >
> > Well, I guess so.
> >
> > Running at a boost frequency certainly increases thermal pressure.
> >
> >> 2. Should max capacity 1024 be a boost frequency so scheduler
> >>      would see it explicitly?
> >
> > That's what it is now if cpuinfo.max_freq is a boost frequency.
> >
> >> - if no, then schedutil could still request boost freq thanks to
> >>     map_util_perf() where we add 25% to the util and then
> >>     map_util_freq() would return a boost freq when util was > 1024
> >>
> >>
> >> I can see in schedutil only one place when cpuinfo.max_freq is used:
> >> get_next_freq(). If the value stored in there is a boost,
> >> then don't we get a higher freq value for the same util?
> >
> > Yes. we do, which basically is my point.
> >
> > The schedutil's response is proportional to cpuinfo.max_freq and that
> > needs to be taken into account for the results to be consistent.
> >
>
> This boost thing wasn't an issue for us, because we didn't have
> platforms which come with it (till recently). I've checked that you have
> quite a few CPUs which support huge boost freq, e.g. 5GHz vs. 3.6GHz
> nominal max freq [1]. Am I reading this correctly as kernel boost freq?

That actually depends on the driver.

For instance, intel_pstate can be run with turbo (== boost) enabled or
disabled.  If turbo is enabled, cpuinfo.max_freq is the max turbo
frequency.

In acpi_cpufreq things are sort of weird, because the highest bin in
there is a turbo frequency, but not the max one and it is used to
enable the entire turbo range.  The driver sets cpuinfo.max_freq to
this one if boost is enabled IIRC.

> Do you represent this 5GHz as 1024 capacity?

Yes (but see above).

>  From this schedutil get_next_freq() I would guess yes.
>
> I cannot find if you use thermal pressure, could you help me with this,
> please?

It is not used on x86 AFAICS.
Lukasz Luba Nov. 17, 2021, 3:47 p.m. UTC | #8
On 11/17/21 3:17 PM, Rafael J. Wysocki wrote:
> On Wed, Nov 17, 2021 at 4:08 PM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>
>>
>>
>> On 11/17/21 12:49 PM, Rafael J. Wysocki wrote:
>>> On Wed, Nov 17, 2021 at 11:46 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>>>
>>>> Hi Rafael,
>>>>
>>>> On 11/16/21 7:05 PM, Rafael J. Wysocki wrote:
>>>>> On Mon, Nov 15, 2021 at 9:10 PM Thara Gopinath
>>>>> <thara.gopinath@linaro.org> wrote:
>>>>>>
>>>>>> cpuinfo.max_freq can reflect boost frequency if enabled during boot.  Since
>>>>>> we don't consider boost frequencies while calculating cpu capacities, use
>>>>>> policy->max to populate the freq_factor during boot up.
>>>>>
>>>>> I'm not sure about this.  schedutil uses cpuinfo.max_freq as the max frequency.
>>>>
>>>> Agree it's tricky how we treat the boost frequencies and also combine
>>>> them with thermal pressure.
>>>> We probably would have consider these design bits:
>>>> 1. Should thermal pressure include boost frequency?
>>>
>>> Well, I guess so.
>>>
>>> Running at a boost frequency certainly increases thermal pressure.
>>>
>>>> 2. Should max capacity 1024 be a boost frequency so scheduler
>>>>       would see it explicitly?
>>>
>>> That's what it is now if cpuinfo.max_freq is a boost frequency.
>>>
>>>> - if no, then schedutil could still request boost freq thanks to
>>>>      map_util_perf() where we add 25% to the util and then
>>>>      map_util_freq() would return a boost freq when util was > 1024
>>>>
>>>>
>>>> I can see in schedutil only one place when cpuinfo.max_freq is used:
>>>> get_next_freq(). If the value stored in there is a boost,
>>>> then don't we get a higher freq value for the same util?
>>>
>>> Yes. we do, which basically is my point.
>>>
>>> The schedutil's response is proportional to cpuinfo.max_freq and that
>>> needs to be taken into account for the results to be consistent.
>>>
>>
>> This boost thing wasn't an issue for us, because we didn't have
>> platforms which come with it (till recently). I've checked that you have
>> quite a few CPUs which support huge boost freq, e.g. 5GHz vs. 3.6GHz
>> nominal max freq [1]. Am I reading this correctly as kernel boost freq?
> 
> That actually depends on the driver.
> 
> For instance, intel_pstate can be run with turbo (== boost) enabled or
> disabled.  If turbo is enabled, cpuinfo.max_freq is the max turbo
> frequency.
> 
> In acpi_cpufreq things are sort of weird, because the highest bin in
> there is a turbo frequency, but not the max one and it is used to
> enable the entire turbo range.  The driver sets cpuinfo.max_freq to
> this one if boost is enabled IIRC.
> 
>> Do you represent this 5GHz as 1024 capacity?
> 
> Yes (but see above).
> 
>>   From this schedutil get_next_freq() I would guess yes.
>>
>> I cannot find if you use thermal pressure, could you help me with this,
>> please?
> 
> It is not used on x86 AFAICS.
> 

Thank you Rafael for all these information. We will have to re-visit
many places on our platform and how this boost should work. It looks
for the first glance that its a full-time task for one of our
team members. We would have to organize this investigation
internally to get better understanding of all affected places.

While this patch change is easy, since the policy->max should
contain the nominal max freq at this setup time (which is what
we want for calculating capacity), the schedutil usage of
cpuinfo.max_freq is not easy to judge and solve. Currently,
on our platforms we stick to the design where nominal max freq
is 1024 capacity, but I don't know if that would hold for long...
Thara Gopinath Nov. 17, 2021, 5:01 p.m. UTC | #9
Hi,

On 11/17/21 7:49 AM, Rafael J. Wysocki wrote:
> On Wed, Nov 17, 2021 at 11:46 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>
>> Hi Rafael,
>>
>> On 11/16/21 7:05 PM, Rafael J. Wysocki wrote:
>>> On Mon, Nov 15, 2021 at 9:10 PM Thara Gopinath
>>> <thara.gopinath@linaro.org> wrote:
>>>>
>>>> cpuinfo.max_freq can reflect boost frequency if enabled during boot.  Since
>>>> we don't consider boost frequencies while calculating cpu capacities, use
>>>> policy->max to populate the freq_factor during boot up.
>>>
>>> I'm not sure about this.  schedutil uses cpuinfo.max_freq as the max frequency.
>>
>> Agree it's tricky how we treat the boost frequencies and also combine
>> them with thermal pressure.
>> We probably would have consider these design bits:
>> 1. Should thermal pressure include boost frequency?
> 
> Well, I guess so.
> 
> Running at a boost frequency certainly increases thermal pressure.
> 
>> 2. Should max capacity 1024 be a boost frequency so scheduler
>>      would see it explicitly?
> 
> That's what it is now if cpuinfo.max_freq is a boost frequency.
> 
>> - if no, then schedutil could still request boost freq thanks to
>>     map_util_perf() where we add 25% to the util and then
>>     map_util_freq() would return a boost freq when util was > 1024
>>
>>
>> I can see in schedutil only one place when cpuinfo.max_freq is used:
>> get_next_freq(). If the value stored in there is a boost,
>> then don't we get a higher freq value for the same util?
> 
> Yes. we do, which basically is my point.
> 
> The schedutil's response is proportional to cpuinfo.max_freq and that
> needs to be taken into account for the results to be consistent.

So IIUC, cpuinfo.max_freq is always supposed to be the highest supported 
frequency of a cpu, irrespective of whether boost is enabled or not. 
Where as policy->max is the currently available maximum cpu frequency 
which can be equal to cpuinfo.max_freq or lower (depending on whether 
boost is enabled, whether there is a constraint on policy->max placed by 
thermal etc). So in this case isn't it better for schedutil to consider 
policy->max instead of cpuinfo.max ? Like you mentioned above same 
utilization will relate to different frequencies depending on the 
maximum frequency.



>
Rafael J. Wysocki Nov. 17, 2021, 5:59 p.m. UTC | #10
On Wed, Nov 17, 2021 at 6:01 PM Thara Gopinath
<thara.gopinath@linaro.org> wrote:
>
> Hi,
>
> On 11/17/21 7:49 AM, Rafael J. Wysocki wrote:
> > On Wed, Nov 17, 2021 at 11:46 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
> >>
> >> Hi Rafael,
> >>
> >> On 11/16/21 7:05 PM, Rafael J. Wysocki wrote:
> >>> On Mon, Nov 15, 2021 at 9:10 PM Thara Gopinath
> >>> <thara.gopinath@linaro.org> wrote:
> >>>>
> >>>> cpuinfo.max_freq can reflect boost frequency if enabled during boot.  Since
> >>>> we don't consider boost frequencies while calculating cpu capacities, use
> >>>> policy->max to populate the freq_factor during boot up.
> >>>
> >>> I'm not sure about this.  schedutil uses cpuinfo.max_freq as the max frequency.
> >>
> >> Agree it's tricky how we treat the boost frequencies and also combine
> >> them with thermal pressure.
> >> We probably would have consider these design bits:
> >> 1. Should thermal pressure include boost frequency?
> >
> > Well, I guess so.
> >
> > Running at a boost frequency certainly increases thermal pressure.
> >
> >> 2. Should max capacity 1024 be a boost frequency so scheduler
> >>      would see it explicitly?
> >
> > That's what it is now if cpuinfo.max_freq is a boost frequency.
> >
> >> - if no, then schedutil could still request boost freq thanks to
> >>     map_util_perf() where we add 25% to the util and then
> >>     map_util_freq() would return a boost freq when util was > 1024
> >>
> >>
> >> I can see in schedutil only one place when cpuinfo.max_freq is used:
> >> get_next_freq(). If the value stored in there is a boost,
> >> then don't we get a higher freq value for the same util?
> >
> > Yes. we do, which basically is my point.
> >
> > The schedutil's response is proportional to cpuinfo.max_freq and that
> > needs to be taken into account for the results to be consistent.
>
> So IIUC, cpuinfo.max_freq is always supposed to be the highest supported
> frequency of a cpu, irrespective of whether boost is enabled or not.
> Where as policy->max is the currently available maximum cpu frequency
> which can be equal to cpuinfo.max_freq or lower (depending on whether
> boost is enabled, whether there is a constraint on policy->max placed by
> thermal etc).

It may also depend on the limit set by user space.

> So in this case isn't it better for schedutil to consider
> policy->max instead of cpuinfo.max ?

Not really.

In that case setting policy->max to 1/2 of cpuinfo.max_freq would
cause schedutil to choose 1/4 of cpuinfo.max_freq for 50% utilization
which would be rather unexpected.

policy->max is a cap, not the current maximum capacity.

> Like you mentioned above same
> utilization will relate to different frequencies depending on the
> maximum frequency.

Which is not how it is expected (and defined) to work, though.

If you really want to play with the current maximum capacity, you need
to change it whenever boost is disabled or enabled - and there is a
mechanism for updating cpufinfo.max_freq in such cases.
Lukasz Luba Nov. 24, 2021, 4:22 p.m. UTC | #11
On 11/17/21 10:12 AM, Lukasz Luba wrote:
> 
> 
> On 11/15/21 8:10 PM, Thara Gopinath wrote:
>> cpuinfo.max_freq can reflect boost frequency if enabled during boot.  
>> Since
>> we don't consider boost frequencies while calculating cpu capacities, use
>> policy->max to populate the freq_factor during boot up.
>>
>> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
>> ---
>>   drivers/base/arch_topology.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
>> index 43407665918f..df818b439bc3 100644
>> --- a/drivers/base/arch_topology.c
>> +++ b/drivers/base/arch_topology.c
>> @@ -334,7 +334,7 @@ init_cpu_capacity_callback(struct notifier_block *nb,
>>       cpumask_andnot(cpus_to_visit, cpus_to_visit, policy->related_cpus);
>>       for_each_cpu(cpu, policy->related_cpus)
>> -        per_cpu(freq_factor, cpu) = policy->cpuinfo.max_freq / 1000;
>> +        per_cpu(freq_factor, cpu) = policy->max / 1000;
>>       if (cpumask_empty(cpus_to_visit)) {
>>           topology_normalize_cpu_scale();
>>
> 
> LGTM
> 
> Reviewed-by: Lukasz Luba <lukasz.luba@armc.com>

Rafael, Thara, please ignore for a while this review.
We are going to do full investigation of this boost frequency,
capacity, schedutil util-to-freq mapping with cpuinfo.max_freq.
The code pointed by Rafael in that sched_util function
already has issue [1]. We have to figure out the consistent
solution for all platforms.

Regards,
Lukasz

[1] 
https://elixir.bootlin.com/linux/v5.16-rc2/source/kernel/sched/cpufreq_schedutil.c#L152
Morten Rasmussen Dec. 2, 2021, 10:50 a.m. UTC | #12
On Wed, Nov 17, 2021 at 06:59:05PM +0100, Rafael J. Wysocki wrote:
> On Wed, Nov 17, 2021 at 6:01 PM Thara Gopinath
> <thara.gopinath@linaro.org> wrote:
> >
> > Hi,
> >
> > On 11/17/21 7:49 AM, Rafael J. Wysocki wrote:
> > > On Wed, Nov 17, 2021 at 11:46 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
> > >>
> > >> Hi Rafael,
> > >>
> > >> On 11/16/21 7:05 PM, Rafael J. Wysocki wrote:
> > >>> On Mon, Nov 15, 2021 at 9:10 PM Thara Gopinath
> > >>> <thara.gopinath@linaro.org> wrote:
> > >>>>
> > >>>> cpuinfo.max_freq can reflect boost frequency if enabled during boot.  Since
> > >>>> we don't consider boost frequencies while calculating cpu capacities, use
> > >>>> policy->max to populate the freq_factor during boot up.
> > >>>
> > >>> I'm not sure about this.  schedutil uses cpuinfo.max_freq as the max frequency.
> > >>
> > >> Agree it's tricky how we treat the boost frequencies and also combine
> > >> them with thermal pressure.
> > >> We probably would have consider these design bits:
> > >> 1. Should thermal pressure include boost frequency?
> > >
> > > Well, I guess so.
> > >
> > > Running at a boost frequency certainly increases thermal pressure.
> > >
> > >> 2. Should max capacity 1024 be a boost frequency so scheduler
> > >>      would see it explicitly?
> > >
> > > That's what it is now if cpuinfo.max_freq is a boost frequency.
> > >
> > >> - if no, then schedutil could still request boost freq thanks to
> > >>     map_util_perf() where we add 25% to the util and then
> > >>     map_util_freq() would return a boost freq when util was > 1024
> > >>
> > >>
> > >> I can see in schedutil only one place when cpuinfo.max_freq is used:
> > >> get_next_freq(). If the value stored in there is a boost,
> > >> then don't we get a higher freq value for the same util?
> > >
> > > Yes. we do, which basically is my point.
> > >
> > > The schedutil's response is proportional to cpuinfo.max_freq and that
> > > needs to be taken into account for the results to be consistent.
> >
> > So IIUC, cpuinfo.max_freq is always supposed to be the highest supported
> > frequency of a cpu, irrespective of whether boost is enabled or not.
> > Where as policy->max is the currently available maximum cpu frequency
> > which can be equal to cpuinfo.max_freq or lower (depending on whether
> > boost is enabled, whether there is a constraint on policy->max placed by
> > thermal etc).
> 
> It may also depend on the limit set by user space.
> 
> > So in this case isn't it better for schedutil to consider
> > policy->max instead of cpuinfo.max ?
> 
> Not really.
> 
> In that case setting policy->max to 1/2 of cpuinfo.max_freq would
> cause schedutil to choose 1/4 of cpuinfo.max_freq for 50% utilization
> which would be rather unexpected.
> 
> policy->max is a cap, not the current maximum capacity.
> 
> > Like you mentioned above same
> > utilization will relate to different frequencies depending on the
> > maximum frequency.
> 
> Which is not how it is expected (and defined) to work, though.
> 
> If you really want to play with the current maximum capacity, you need
> to change it whenever boost is disabled or enabled - and there is a
> mechanism for updating cpufinfo.max_freq in such cases.

I don't see why we would want to change max capacity on the fly. It is
not a cheap operation as we would need to normalize the capacity for all
CPUs if the CPU(s) with capacity = 1024 changes its capacity. Worst case
we even have to rebuild the sched_domain hierarchy to update flags. The
update would also temporarily mess with load and utilization signals, so
not a cheap operation.

Morten
Rafael J. Wysocki Dec. 2, 2021, 4:31 p.m. UTC | #13
On Thu, Dec 2, 2021 at 11:50 AM Morten Rasmussen
<morten.rasmussen@arm.com> wrote:
>
> On Wed, Nov 17, 2021 at 06:59:05PM +0100, Rafael J. Wysocki wrote:
> > On Wed, Nov 17, 2021 at 6:01 PM Thara Gopinath
> > <thara.gopinath@linaro.org> wrote:
> > >
> > > Hi,
> > >
> > > On 11/17/21 7:49 AM, Rafael J. Wysocki wrote:
> > > > On Wed, Nov 17, 2021 at 11:46 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
> > > >>
> > > >> Hi Rafael,
> > > >>
> > > >> On 11/16/21 7:05 PM, Rafael J. Wysocki wrote:
> > > >>> On Mon, Nov 15, 2021 at 9:10 PM Thara Gopinath
> > > >>> <thara.gopinath@linaro.org> wrote:
> > > >>>>
> > > >>>> cpuinfo.max_freq can reflect boost frequency if enabled during boot.  Since
> > > >>>> we don't consider boost frequencies while calculating cpu capacities, use
> > > >>>> policy->max to populate the freq_factor during boot up.
> > > >>>
> > > >>> I'm not sure about this.  schedutil uses cpuinfo.max_freq as the max frequency.
> > > >>
> > > >> Agree it's tricky how we treat the boost frequencies and also combine
> > > >> them with thermal pressure.
> > > >> We probably would have consider these design bits:
> > > >> 1. Should thermal pressure include boost frequency?
> > > >
> > > > Well, I guess so.
> > > >
> > > > Running at a boost frequency certainly increases thermal pressure.
> > > >
> > > >> 2. Should max capacity 1024 be a boost frequency so scheduler
> > > >>      would see it explicitly?
> > > >
> > > > That's what it is now if cpuinfo.max_freq is a boost frequency.
> > > >
> > > >> - if no, then schedutil could still request boost freq thanks to
> > > >>     map_util_perf() where we add 25% to the util and then
> > > >>     map_util_freq() would return a boost freq when util was > 1024
> > > >>
> > > >>
> > > >> I can see in schedutil only one place when cpuinfo.max_freq is used:
> > > >> get_next_freq(). If the value stored in there is a boost,
> > > >> then don't we get a higher freq value for the same util?
> > > >
> > > > Yes. we do, which basically is my point.
> > > >
> > > > The schedutil's response is proportional to cpuinfo.max_freq and that
> > > > needs to be taken into account for the results to be consistent.
> > >
> > > So IIUC, cpuinfo.max_freq is always supposed to be the highest supported
> > > frequency of a cpu, irrespective of whether boost is enabled or not.
> > > Where as policy->max is the currently available maximum cpu frequency
> > > which can be equal to cpuinfo.max_freq or lower (depending on whether
> > > boost is enabled, whether there is a constraint on policy->max placed by
> > > thermal etc).
> >
> > It may also depend on the limit set by user space.
> >
> > > So in this case isn't it better for schedutil to consider
> > > policy->max instead of cpuinfo.max ?
> >
> > Not really.
> >
> > In that case setting policy->max to 1/2 of cpuinfo.max_freq would
> > cause schedutil to choose 1/4 of cpuinfo.max_freq for 50% utilization
> > which would be rather unexpected.
> >
> > policy->max is a cap, not the current maximum capacity.
> >
> > > Like you mentioned above same
> > > utilization will relate to different frequencies depending on the
> > > maximum frequency.
> >
> > Which is not how it is expected (and defined) to work, though.
> >
> > If you really want to play with the current maximum capacity, you need
> > to change it whenever boost is disabled or enabled - and there is a
> > mechanism for updating cpufinfo.max_freq in such cases.
>
> I don't see why we would want to change max capacity on the fly. It is
> not a cheap operation as we would need to normalize the capacity for all
> CPUs if the CPU(s) with capacity = 1024 changes its capacity. Worst case
> we even have to rebuild the sched_domain hierarchy to update flags. The
> update would also temporarily mess with load and utilization signals, so
> not a cheap operation.

I didn't say it was cheap. :-)

However, boost frequencies are not disabled and enabled very often, so
it may be acceptable to do it then.  I actually don't know.

The point is that if you set the max capacity to correspond to the max
boosted perf and it is never reached (because boost is disabled), the
scaling will cause CPUs to appear as underutilized, but in fact there
is no spare capacity in the system.

Conversely, if the max capacity corresponds to the max non-boost perf
and boost is used very often, the scaling will cause the CPUs to
appear to be 100% loaded, but there may be still spare capacity in the
system.
Morten Rasmussen Dec. 3, 2021, 9:48 a.m. UTC | #14
On Thu, Dec 02, 2021 at 05:31:53PM +0100, Rafael J. Wysocki wrote:
> On Thu, Dec 2, 2021 at 11:50 AM Morten Rasmussen
> <morten.rasmussen@arm.com> wrote:
> >
> > On Wed, Nov 17, 2021 at 06:59:05PM +0100, Rafael J. Wysocki wrote:
> > > On Wed, Nov 17, 2021 at 6:01 PM Thara Gopinath
> > > <thara.gopinath@linaro.org> wrote:
> > > >
> > > > Hi,
> > > >
> > > > On 11/17/21 7:49 AM, Rafael J. Wysocki wrote:
> > > > > On Wed, Nov 17, 2021 at 11:46 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
> > > > >>
> > > > >> Hi Rafael,
> > > > >>
> > > > >> On 11/16/21 7:05 PM, Rafael J. Wysocki wrote:
> > > > >>> On Mon, Nov 15, 2021 at 9:10 PM Thara Gopinath
> > > > >>> <thara.gopinath@linaro.org> wrote:
> > > > >>>>
> > > > >>>> cpuinfo.max_freq can reflect boost frequency if enabled during boot.  Since
> > > > >>>> we don't consider boost frequencies while calculating cpu capacities, use
> > > > >>>> policy->max to populate the freq_factor during boot up.
> > > > >>>
> > > > >>> I'm not sure about this.  schedutil uses cpuinfo.max_freq as the max frequency.
> > > > >>
> > > > >> Agree it's tricky how we treat the boost frequencies and also combine
> > > > >> them with thermal pressure.
> > > > >> We probably would have consider these design bits:
> > > > >> 1. Should thermal pressure include boost frequency?
> > > > >
> > > > > Well, I guess so.
> > > > >
> > > > > Running at a boost frequency certainly increases thermal pressure.
> > > > >
> > > > >> 2. Should max capacity 1024 be a boost frequency so scheduler
> > > > >>      would see it explicitly?
> > > > >
> > > > > That's what it is now if cpuinfo.max_freq is a boost frequency.
> > > > >
> > > > >> - if no, then schedutil could still request boost freq thanks to
> > > > >>     map_util_perf() where we add 25% to the util and then
> > > > >>     map_util_freq() would return a boost freq when util was > 1024
> > > > >>
> > > > >>
> > > > >> I can see in schedutil only one place when cpuinfo.max_freq is used:
> > > > >> get_next_freq(). If the value stored in there is a boost,
> > > > >> then don't we get a higher freq value for the same util?
> > > > >
> > > > > Yes. we do, which basically is my point.
> > > > >
> > > > > The schedutil's response is proportional to cpuinfo.max_freq and that
> > > > > needs to be taken into account for the results to be consistent.
> > > >
> > > > So IIUC, cpuinfo.max_freq is always supposed to be the highest supported
> > > > frequency of a cpu, irrespective of whether boost is enabled or not.
> > > > Where as policy->max is the currently available maximum cpu frequency
> > > > which can be equal to cpuinfo.max_freq or lower (depending on whether
> > > > boost is enabled, whether there is a constraint on policy->max placed by
> > > > thermal etc).
> > >
> > > It may also depend on the limit set by user space.
> > >
> > > > So in this case isn't it better for schedutil to consider
> > > > policy->max instead of cpuinfo.max ?
> > >
> > > Not really.
> > >
> > > In that case setting policy->max to 1/2 of cpuinfo.max_freq would
> > > cause schedutil to choose 1/4 of cpuinfo.max_freq for 50% utilization
> > > which would be rather unexpected.
> > >
> > > policy->max is a cap, not the current maximum capacity.
> > >
> > > > Like you mentioned above same
> > > > utilization will relate to different frequencies depending on the
> > > > maximum frequency.
> > >
> > > Which is not how it is expected (and defined) to work, though.
> > >
> > > If you really want to play with the current maximum capacity, you need
> > > to change it whenever boost is disabled or enabled - and there is a
> > > mechanism for updating cpufinfo.max_freq in such cases.
> >
> > I don't see why we would want to change max capacity on the fly. It is
> > not a cheap operation as we would need to normalize the capacity for all
> > CPUs if the CPU(s) with capacity = 1024 changes its capacity. Worst case
> > we even have to rebuild the sched_domain hierarchy to update flags. The
> > update would also temporarily mess with load and utilization signals, so
> > not a cheap operation.
> 
> I didn't say it was cheap. :-)

You didn't :-) But I thought it was worth pointing out in case someone
would think we need to constantly renormalize to the highest achievable
performance level taking all factors into account, including thermal
capping.

> However, boost frequencies are not disabled and enabled very often, so
> it may be acceptable to do it then.  I actually don't know.

Agree.

> 
> The point is that if you set the max capacity to correspond to the max
> boosted perf and it is never reached (because boost is disabled), the
> scaling will cause CPUs to appear as underutilized, but in fact there
> is no spare capacity in the system.

We kind of have the problem already with thermal capping but addressed
it by having the thermal pressure signal to indicate the some of the
capacity is unavailable. Perhaps the thermal pressure signal should be extended
to cover all reasons for capacity being unavailable, or we should have
another signal to track boost frequencies not being delivered, manually
disabled or not possible due to system circumstances?

> Conversely, if the max capacity corresponds to the max non-boost perf
> and boost is used very often, the scaling will cause the CPUs to
> appear to be 100% loaded, but there may be still spare capacity in the
> system.

It is even worse than that. Allowing delivered performance to exceed the
CPU capacity will break utilization scale invariance at it will make
per-task utilization appear smaller than it really is potentially
leading to wrong task placement.

I think we have to ensure that the full performance range is visible to
the OS. If part of it is often unachievable we need to track the gap
between requested and delivered performance and somehow take that into
account when making task placement decisions.

Morten
Rafael J. Wysocki Dec. 3, 2021, 3:07 p.m. UTC | #15
On Fri, Dec 3, 2021 at 10:48 AM Morten Rasmussen
<morten.rasmussen@arm.com> wrote:
>
> On Thu, Dec 02, 2021 at 05:31:53PM +0100, Rafael J. Wysocki wrote:
> > On Thu, Dec 2, 2021 at 11:50 AM Morten Rasmussen
> > <morten.rasmussen@arm.com> wrote:
> > >
> > > On Wed, Nov 17, 2021 at 06:59:05PM +0100, Rafael J. Wysocki wrote:
> > > > On Wed, Nov 17, 2021 at 6:01 PM Thara Gopinath
> > > > <thara.gopinath@linaro.org> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > On 11/17/21 7:49 AM, Rafael J. Wysocki wrote:
> > > > > > On Wed, Nov 17, 2021 at 11:46 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
> > > > > >>
> > > > > >> Hi Rafael,
> > > > > >>
> > > > > >> On 11/16/21 7:05 PM, Rafael J. Wysocki wrote:
> > > > > >>> On Mon, Nov 15, 2021 at 9:10 PM Thara Gopinath
> > > > > >>> <thara.gopinath@linaro.org> wrote:
> > > > > >>>>
> > > > > >>>> cpuinfo.max_freq can reflect boost frequency if enabled during boot.  Since
> > > > > >>>> we don't consider boost frequencies while calculating cpu capacities, use
> > > > > >>>> policy->max to populate the freq_factor during boot up.
> > > > > >>>
> > > > > >>> I'm not sure about this.  schedutil uses cpuinfo.max_freq as the max frequency.
> > > > > >>
> > > > > >> Agree it's tricky how we treat the boost frequencies and also combine
> > > > > >> them with thermal pressure.
> > > > > >> We probably would have consider these design bits:
> > > > > >> 1. Should thermal pressure include boost frequency?
> > > > > >
> > > > > > Well, I guess so.
> > > > > >
> > > > > > Running at a boost frequency certainly increases thermal pressure.
> > > > > >
> > > > > >> 2. Should max capacity 1024 be a boost frequency so scheduler
> > > > > >>      would see it explicitly?
> > > > > >
> > > > > > That's what it is now if cpuinfo.max_freq is a boost frequency.
> > > > > >
> > > > > >> - if no, then schedutil could still request boost freq thanks to
> > > > > >>     map_util_perf() where we add 25% to the util and then
> > > > > >>     map_util_freq() would return a boost freq when util was > 1024
> > > > > >>
> > > > > >>
> > > > > >> I can see in schedutil only one place when cpuinfo.max_freq is used:
> > > > > >> get_next_freq(). If the value stored in there is a boost,
> > > > > >> then don't we get a higher freq value for the same util?
> > > > > >
> > > > > > Yes. we do, which basically is my point.
> > > > > >
> > > > > > The schedutil's response is proportional to cpuinfo.max_freq and that
> > > > > > needs to be taken into account for the results to be consistent.
> > > > >
> > > > > So IIUC, cpuinfo.max_freq is always supposed to be the highest supported
> > > > > frequency of a cpu, irrespective of whether boost is enabled or not.
> > > > > Where as policy->max is the currently available maximum cpu frequency
> > > > > which can be equal to cpuinfo.max_freq or lower (depending on whether
> > > > > boost is enabled, whether there is a constraint on policy->max placed by
> > > > > thermal etc).
> > > >
> > > > It may also depend on the limit set by user space.
> > > >
> > > > > So in this case isn't it better for schedutil to consider
> > > > > policy->max instead of cpuinfo.max ?
> > > >
> > > > Not really.
> > > >
> > > > In that case setting policy->max to 1/2 of cpuinfo.max_freq would
> > > > cause schedutil to choose 1/4 of cpuinfo.max_freq for 50% utilization
> > > > which would be rather unexpected.
> > > >
> > > > policy->max is a cap, not the current maximum capacity.
> > > >
> > > > > Like you mentioned above same
> > > > > utilization will relate to different frequencies depending on the
> > > > > maximum frequency.
> > > >
> > > > Which is not how it is expected (and defined) to work, though.
> > > >
> > > > If you really want to play with the current maximum capacity, you need
> > > > to change it whenever boost is disabled or enabled - and there is a
> > > > mechanism for updating cpufinfo.max_freq in such cases.
> > >
> > > I don't see why we would want to change max capacity on the fly. It is
> > > not a cheap operation as we would need to normalize the capacity for all
> > > CPUs if the CPU(s) with capacity = 1024 changes its capacity. Worst case
> > > we even have to rebuild the sched_domain hierarchy to update flags. The
> > > update would also temporarily mess with load and utilization signals, so
> > > not a cheap operation.
> >
> > I didn't say it was cheap. :-)
>
> You didn't :-) But I thought it was worth pointing out in case someone
> would think we need to constantly renormalize to the highest achievable
> performance level taking all factors into account, including thermal
> capping.
>
> > However, boost frequencies are not disabled and enabled very often, so
> > it may be acceptable to do it then.  I actually don't know.
>
> Agree.
>
> >
> > The point is that if you set the max capacity to correspond to the max
> > boosted perf and it is never reached (because boost is disabled), the
> > scaling will cause CPUs to appear as underutilized, but in fact there
> > is no spare capacity in the system.
>
> We kind of have the problem already with thermal capping but addressed
> it by having the thermal pressure signal to indicate the some of the
> capacity is unavailable. Perhaps the thermal pressure signal should be extended
> to cover all reasons for capacity being unavailable, or we should have
> another signal to track boost frequencies not being delivered, manually
> disabled or not possible due to system circumstances?

Well, even without boost frequencies, the capacity that's effectively
available may not be the advertised max.  For example,
scaling_max_freq may be set below the advertised max value (and that's
applied after the governor has produced its output), there may be
power capping in place etc.

Taking the thermal pressure in particular into account helps to reduce
it, but that may just be part of the difference between the advertised
max and the effectively available perf, and not even the dominating
one for that matter.

And boost frequencies complicate the picture even further, because
they are more-or-less unsustainable and as a rule there's no
information on how sustainable they are or how much time it takes to
get to the max boost perf (and that may be configurable even).

So IMO the advertised max ought to be treated as the upper bound in
general, but it makes sense to adjust it when it is known to be too
large and it may stay so forever (which is the case when boost
frequencies are disabled).

> > Conversely, if the max capacity corresponds to the max non-boost perf
> > and boost is used very often, the scaling will cause the CPUs to
> > appear to be 100% loaded, but there may be still spare capacity in the
> > system.
>
> It is even worse than that. Allowing delivered performance to exceed the
> CPU capacity will break utilization scale invariance at it will make
> per-task utilization appear smaller than it really is potentially
> leading to wrong task placement.
>
> I think we have to ensure that the full performance range is visible to
> the OS. If part of it is often unachievable we need to track the gap
> between requested and delivered performance and somehow take that into
> account when making task placement decisions.

I generally agree, but let me say that correlating what was asked for
with the delivered perf need not be straightforward.
Morten Rasmussen Dec. 8, 2021, 9:20 a.m. UTC | #16
On Fri, Dec 03, 2021 at 04:07:30PM +0100, Rafael J. Wysocki wrote:
> On Fri, Dec 3, 2021 at 10:48 AM Morten Rasmussen
> <morten.rasmussen@arm.com> wrote:
> >
> > On Thu, Dec 02, 2021 at 05:31:53PM +0100, Rafael J. Wysocki wrote:
> > > On Thu, Dec 2, 2021 at 11:50 AM Morten Rasmussen
> > > <morten.rasmussen@arm.com> wrote:
> > > >
> > > > On Wed, Nov 17, 2021 at 06:59:05PM +0100, Rafael J. Wysocki wrote:
> > > > > On Wed, Nov 17, 2021 at 6:01 PM Thara Gopinath
> > > > > <thara.gopinath@linaro.org> wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > On 11/17/21 7:49 AM, Rafael J. Wysocki wrote:
> > > > > > > On Wed, Nov 17, 2021 at 11:46 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
> > > > > > >>
> > > > > > >> Hi Rafael,
> > > > > > >>
> > > > > > >> On 11/16/21 7:05 PM, Rafael J. Wysocki wrote:
> > > > > > >>> On Mon, Nov 15, 2021 at 9:10 PM Thara Gopinath
> > > > > > >>> <thara.gopinath@linaro.org> wrote:
> > > > > > >>>>
> > > > > > >>>> cpuinfo.max_freq can reflect boost frequency if enabled during boot.  Since
> > > > > > >>>> we don't consider boost frequencies while calculating cpu capacities, use
> > > > > > >>>> policy->max to populate the freq_factor during boot up.
> > > > > > >>>
> > > > > > >>> I'm not sure about this.  schedutil uses cpuinfo.max_freq as the max frequency.
> > > > > > >>
> > > > > > >> Agree it's tricky how we treat the boost frequencies and also combine
> > > > > > >> them with thermal pressure.
> > > > > > >> We probably would have consider these design bits:
> > > > > > >> 1. Should thermal pressure include boost frequency?
> > > > > > >
> > > > > > > Well, I guess so.
> > > > > > >
> > > > > > > Running at a boost frequency certainly increases thermal pressure.
> > > > > > >
> > > > > > >> 2. Should max capacity 1024 be a boost frequency so scheduler
> > > > > > >>      would see it explicitly?
> > > > > > >
> > > > > > > That's what it is now if cpuinfo.max_freq is a boost frequency.
> > > > > > >
> > > > > > >> - if no, then schedutil could still request boost freq thanks to
> > > > > > >>     map_util_perf() where we add 25% to the util and then
> > > > > > >>     map_util_freq() would return a boost freq when util was > 1024
> > > > > > >>
> > > > > > >>
> > > > > > >> I can see in schedutil only one place when cpuinfo.max_freq is used:
> > > > > > >> get_next_freq(). If the value stored in there is a boost,
> > > > > > >> then don't we get a higher freq value for the same util?
> > > > > > >
> > > > > > > Yes. we do, which basically is my point.
> > > > > > >
> > > > > > > The schedutil's response is proportional to cpuinfo.max_freq and that
> > > > > > > needs to be taken into account for the results to be consistent.
> > > > > >
> > > > > > So IIUC, cpuinfo.max_freq is always supposed to be the highest supported
> > > > > > frequency of a cpu, irrespective of whether boost is enabled or not.
> > > > > > Where as policy->max is the currently available maximum cpu frequency
> > > > > > which can be equal to cpuinfo.max_freq or lower (depending on whether
> > > > > > boost is enabled, whether there is a constraint on policy->max placed by
> > > > > > thermal etc).
> > > > >
> > > > > It may also depend on the limit set by user space.
> > > > >
> > > > > > So in this case isn't it better for schedutil to consider
> > > > > > policy->max instead of cpuinfo.max ?
> > > > >
> > > > > Not really.
> > > > >
> > > > > In that case setting policy->max to 1/2 of cpuinfo.max_freq would
> > > > > cause schedutil to choose 1/4 of cpuinfo.max_freq for 50% utilization
> > > > > which would be rather unexpected.
> > > > >
> > > > > policy->max is a cap, not the current maximum capacity.
> > > > >
> > > > > > Like you mentioned above same
> > > > > > utilization will relate to different frequencies depending on the
> > > > > > maximum frequency.
> > > > >
> > > > > Which is not how it is expected (and defined) to work, though.
> > > > >
> > > > > If you really want to play with the current maximum capacity, you need
> > > > > to change it whenever boost is disabled or enabled - and there is a
> > > > > mechanism for updating cpufinfo.max_freq in such cases.
> > > >
> > > > I don't see why we would want to change max capacity on the fly. It is
> > > > not a cheap operation as we would need to normalize the capacity for all
> > > > CPUs if the CPU(s) with capacity = 1024 changes its capacity. Worst case
> > > > we even have to rebuild the sched_domain hierarchy to update flags. The
> > > > update would also temporarily mess with load and utilization signals, so
> > > > not a cheap operation.
> > >
> > > I didn't say it was cheap. :-)
> >
> > You didn't :-) But I thought it was worth pointing out in case someone
> > would think we need to constantly renormalize to the highest achievable
> > performance level taking all factors into account, including thermal
> > capping.
> >
> > > However, boost frequencies are not disabled and enabled very often, so
> > > it may be acceptable to do it then.  I actually don't know.
> >
> > Agree.
> >
> > >
> > > The point is that if you set the max capacity to correspond to the max
> > > boosted perf and it is never reached (because boost is disabled), the
> > > scaling will cause CPUs to appear as underutilized, but in fact there
> > > is no spare capacity in the system.
> >
> > We kind of have the problem already with thermal capping but addressed
> > it by having the thermal pressure signal to indicate the some of the
> > capacity is unavailable. Perhaps the thermal pressure signal should be extended
> > to cover all reasons for capacity being unavailable, or we should have
> > another signal to track boost frequencies not being delivered, manually
> > disabled or not possible due to system circumstances?
> 
> Well, even without boost frequencies, the capacity that's effectively
> available may not be the advertised max.  For example,
> scaling_max_freq may be set below the advertised max value (and that's
> applied after the governor has produced its output), there may be
> power capping in place etc.
> 
> Taking the thermal pressure in particular into account helps to reduce
> it, but that may just be part of the difference between the advertised
> max and the effectively available perf, and not even the dominating
> one for that matter.
> 
> And boost frequencies complicate the picture even further, because
> they are more-or-less unsustainable and as a rule there's no
> information on how sustainable they are or how much time it takes to
> get to the max boost perf (and that may be configurable even).
> 
> So IMO the advertised max ought to be treated as the upper bound in
> general, but it makes sense to adjust it when it is known to be too
> large and it may stay so forever (which is the case when boost
> frequencies are disabled).

I agree that max performance level should be treated as upper bound.
Thermal capping help us somewhat to figure out the currently achievable
performance level. Removing disabled boost levels would help too. There
might still be quite a gap between requested and delivered performance
though.

> 
> > > Conversely, if the max capacity corresponds to the max non-boost perf
> > > and boost is used very often, the scaling will cause the CPUs to
> > > appear to be 100% loaded, but there may be still spare capacity in the
> > > system.
> >
> > It is even worse than that. Allowing delivered performance to exceed the
> > CPU capacity will break utilization scale invariance at it will make
> > per-task utilization appear smaller than it really is potentially
> > leading to wrong task placement.
> >
> > I think we have to ensure that the full performance range is visible to
> > the OS. If part of it is often unachievable we need to track the gap
> > between requested and delivered performance and somehow take that into
> > account when making task placement decisions.
> 
> I generally agree, but let me say that correlating what was asked for
> with the delivered perf need not be straightforward.

Yes, it won't necessarily be very accurate. I'm just wondering if we can
do better than only taking thermal capping and maybe disabled boost
levels into account? Tracking delivered vs requested performance avoids
the problem of having to deal explicitly with every kind of mechanism
that can reduce delivered performance. Delivered performance can't be
taken as a true upper limit for performance as it might change very
quickly.

Morten
diff mbox series

Patch

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 43407665918f..df818b439bc3 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -334,7 +334,7 @@  init_cpu_capacity_callback(struct notifier_block *nb,
 	cpumask_andnot(cpus_to_visit, cpus_to_visit, policy->related_cpus);
 
 	for_each_cpu(cpu, policy->related_cpus)
-		per_cpu(freq_factor, cpu) = policy->cpuinfo.max_freq / 1000;
+		per_cpu(freq_factor, cpu) = policy->max / 1000;
 
 	if (cpumask_empty(cpus_to_visit)) {
 		topology_normalize_cpu_scale();