diff mbox

[1/1] cpufreq: pcc-cpufreq: Re-introduce deadband effect to reduce number of frequency changes

Message ID 20160819122101.GB17296@suselix.suse.de (mailing list archive)
State Not Applicable, archived
Headers show

Commit Message

Andreas Herrmann Aug. 19, 2016, 12:21 p.m. UTC
Commit 6393d6a102 (cpufreq: ondemand: Eliminate the deadband effect)
introduced a performance regression for systems using pcc-cpufreq and
ondemand governor. This is measurable with different workloads. E.g.
wall-clock time for kernel compilation significantly increased.

The elimination of the deadband effect significantly increased the
number of frequency changes with pcc-cpufreq.

Instead of reverting commit 6393d6a102 I suggest to add a workaround
in pcc-cpufreq to re-introduce the deadband effect for this driver
only - to restore the old performance behaviour with pcc-cpufreq with
ondemand governor.

Following some performance numbers for similar kernel compilations to
illustrate the effect of commit 6393d6a102 and the proposed fix.

Following typical numbers of kernel compilation tests with varying number of
compile jobs:

                     v4.8.0-rc2               4.8.0-rc2-pcc-cpufreq-deadband
 # of jobst   user     sys   elapsed   CPU     user     sys   elapsed   CPU
       2     440.39  116.49  4:33.35   203%   404.85  109.10  4:10.35   205%
       4     436.87  133.39  2:22.88   399%   381.83  128.00  2:06.84   401%
       8     475.49  157.68  1:22.24   769%   344.36  149.08  1:04.29   767%
      16     620.69  188.33  0:54.74  1477%   374.60  157.40  0:36.76  1447%
      32     815.79  209.58  0:37.22  2754%   490.46  160.22  0:24.87  2616%
      64     394.13   60.55  0:13.54  3355%   386.54   60.33  0:12.79  3493%
     120     398.24   61.55  0:14.60  3148%   390.44   61.19  0:13.07  3453%

(HP ProLiant DL580 Gen8 system, 60 CPUs @ 2.80GHz)

Link: http://marc.info/?l=linux-pm&m=147160912625600
Signed-off-by: Andreas Herrmann <aherrmann@suse.com>
---
 drivers/cpufreq/pcc-cpufreq.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

If this change is accepted maybe it's a good idea to tag it also for
stable kernels, e.g. starting with v4.4.


Thanks,

Andreas

Comments

Viresh Kumar Aug. 29, 2016, 6:01 a.m. UTC | #1
On 19-08-16, 14:21, Andreas Herrmann wrote:
> 
> Commit 6393d6a102 (cpufreq: ondemand: Eliminate the deadband effect)
> introduced a performance regression for systems using pcc-cpufreq and
> ondemand governor. This is measurable with different workloads. E.g.
> wall-clock time for kernel compilation significantly increased.
> 
> The elimination of the deadband effect significantly increased the
> number of frequency changes with pcc-cpufreq.
> 
> Instead of reverting commit 6393d6a102 I suggest to add a workaround
> in pcc-cpufreq to re-introduce the deadband effect for this driver
> only - to restore the old performance behaviour with pcc-cpufreq with
> ondemand governor.
> 
> Following some performance numbers for similar kernel compilations to
> illustrate the effect of commit 6393d6a102 and the proposed fix.
> 
> Following typical numbers of kernel compilation tests with varying number of
> compile jobs:
> 
>                      v4.8.0-rc2               4.8.0-rc2-pcc-cpufreq-deadband
>  # of jobst   user     sys   elapsed   CPU     user     sys   elapsed   CPU
>        2     440.39  116.49  4:33.35   203%   404.85  109.10  4:10.35   205%
>        4     436.87  133.39  2:22.88   399%   381.83  128.00  2:06.84   401%
>        8     475.49  157.68  1:22.24   769%   344.36  149.08  1:04.29   767%
>       16     620.69  188.33  0:54.74  1477%   374.60  157.40  0:36.76  1447%
>       32     815.79  209.58  0:37.22  2754%   490.46  160.22  0:24.87  2616%
>       64     394.13   60.55  0:13.54  3355%   386.54   60.33  0:12.79  3493%
>      120     398.24   61.55  0:14.60  3148%   390.44   61.19  0:13.07  3453%
> 
> (HP ProLiant DL580 Gen8 system, 60 CPUs @ 2.80GHz)
> 
> Link: http://marc.info/?l=linux-pm&m=147160912625600
> Signed-off-by: Andreas Herrmann <aherrmann@suse.com>
> ---
>  drivers/cpufreq/pcc-cpufreq.c | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
> 
> If this change is accepted maybe it's a good idea to tag it also for
> stable kernels, e.g. starting with v4.4.

I am _really_ worried about such hacks in drivers to negate the effect of a
patch, that was actually good.

Did you try to increase the sampling period of ondemand governor to see if that
helps without this patch.

Also, it is important to understand why is the performance going down, while the
original commit should have made it better. Is it only about more transitions ?
Andreas Herrmann Sept. 1, 2016, 1:21 p.m. UTC | #2
On Mon, Aug 29, 2016 at 11:31:53AM +0530, Viresh Kumar wrote:
> On 19-08-16, 14:21, Andreas Herrmann wrote:
> > 
> > Commit 6393d6a102 (cpufreq: ondemand: Eliminate the deadband effect)
> > introduced a performance regression for systems using pcc-cpufreq and
> > ondemand governor. This is measurable with different workloads. E.g.
> > wall-clock time for kernel compilation significantly increased.
> > 
> > The elimination of the deadband effect significantly increased the
> > number of frequency changes with pcc-cpufreq.
> > 
> > Instead of reverting commit 6393d6a102 I suggest to add a workaround
> > in pcc-cpufreq to re-introduce the deadband effect for this driver
> > only - to restore the old performance behaviour with pcc-cpufreq with
> > ondemand governor.
> > 
> > Following some performance numbers for similar kernel compilations to
> > illustrate the effect of commit 6393d6a102 and the proposed fix.
> > 
> > Following typical numbers of kernel compilation tests with varying number of
> > compile jobs:
> > 
> >                      v4.8.0-rc2               4.8.0-rc2-pcc-cpufreq-deadband
> >  # of jobst   user     sys   elapsed   CPU     user     sys   elapsed   CPU
> >        2     440.39  116.49  4:33.35   203%   404.85  109.10  4:10.35   205%
> >        4     436.87  133.39  2:22.88   399%   381.83  128.00  2:06.84   401%
> >        8     475.49  157.68  1:22.24   769%   344.36  149.08  1:04.29   767%
> >       16     620.69  188.33  0:54.74  1477%   374.60  157.40  0:36.76  1447%
> >       32     815.79  209.58  0:37.22  2754%   490.46  160.22  0:24.87  2616%
> >       64     394.13   60.55  0:13.54  3355%   386.54   60.33  0:12.79  3493%
> >      120     398.24   61.55  0:14.60  3148%   390.44   61.19  0:13.07  3453%
> > 
> > (HP ProLiant DL580 Gen8 system, 60 CPUs @ 2.80GHz)
> > 
> > Link: http://marc.info/?l=linux-pm&m=147160912625600
> > Signed-off-by: Andreas Herrmann <aherrmann@suse.com>
> > ---
> >  drivers/cpufreq/pcc-cpufreq.c | 20 ++++++++++++++++++++
> >  1 file changed, 20 insertions(+)
> > 
> > If this change is accepted maybe it's a good idea to tag it also for
> > stable kernels, e.g. starting with v4.4.

> I am _really_ worried about such hacks in drivers to negate the effect of a
> patch, that was actually good.

> Did you try to increase the sampling period of ondemand governor to see if that
> helps without this patch.

With an older kernel I've modified transition_latency of the driver
which in turn is used to calculate the sampling rate.

I started with the value return as "nominal latency" for PCC.  This
was 300000 ns on the test system and made things worse.  I've tested
other values as well unitl I've found a local optimium at 45000ns but
performance was lower in comparison to when I've applied my hack.

> Also, it is important to understand why is the performance going
> down, while the original commit should have made it better.

My understanding is that the original commit was tested with certain
combinations of hardware and cpufreq-drivers and the claim was that
for those (two?) tested combinations performance increased and power
consumption was lower. So I am not so sure what to expect from all
other cpufreq-driver/hardware combinations.

> Is it only about more transitions ?

I think this is the main issue.

In an older kernel version I activated/added debug output in
__cpufreq_driver_target(). Of course this creates a huge amount of
messages. But with original patch reverted it was like:

[   29.489677] cpufreq: target for CPU 0: 1760000 kHz (1200000 kHz), relation 2, requested 1760000 kHz
[   29.570364] cpufreq: target for CPU 0: 1216000 kHz (1760000 kHz), relation 2, requested 1216000 kHz
[   29.571055] cpufreq: target for CPU 1: 1200000 kHz (1148000 kHz), relation 0, requested 1200000 kHz
[   29.571483] cpufreq: target for CPU 1: 1200000 kHz (1200000 kHz), relation 2, requested 1200000 kHz
[   29.572042] cpufreq: target for CPU 2: 1200000 kHz (1064000 kHz), relation 0, requested 1200000 kHz
[   29.572503] cpufreq: target for CPU 2: 1200000 kHz (1200000 kHz), relation 2, requested 1200000 kHz

a lot of stuff, but system could still handle it and booted to the
prompt.

With the original patch applied the system was really flooded and
eventually became unresponsive:

** 459 printk messages dropped ** [   29.838689] cpufreq: target for CPU 43: 1408000 kHz (2384000 kHz), relation 2, requested 1408000 kHz
** 480 printk messages dropped ** [   29.993849] cpufreq: target for CPU 54: 1200000 kHz (1248000 kHz), relation 2, requested 1200000 kHz
** 413 printk messages dropped ** [   30.113921] cpufreq: target for CPU 59: 2064000 kHz (1248000 kHz), relation 2, requested 2064000 kHz
** 437 printk messages dropped ** [   30.245846] cpufreq: target for CPU 21: 1296000 kHz (1296000 kHz), relation 2, requested 1296000 kHz
** 435 printk messages dropped ** [   30.397748] cpufreq: target for CPU 13: 1280000 kHz (2640000 kHz), relation 2, requested 1280000 kHz
** 480 printk messages dropped ** [   30.541846] cpufreq: target for CPU 58: 2112000 kHz (1632000 kHz), relation 2, requested 2112000 kHz


Of course, ideas to further debug this are welcome, or suggestions
to fix the issue in another way.


Thanks,

Andreas
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Viresh Kumar Sept. 7, 2016, 5:02 a.m. UTC | #3
On 01-09-16, 15:21, Andreas Herrmann wrote:
> On Mon, Aug 29, 2016 at 11:31:53AM +0530, Viresh Kumar wrote:

> > I am _really_ worried about such hacks in drivers to negate the effect of a
> > patch, that was actually good.
> 
> > Did you try to increase the sampling period of ondemand governor to see if that
> > helps without this patch.
> 
> With an older kernel I've modified transition_latency of the driver
> which in turn is used to calculate the sampling rate.

Naah, that isn't what I was looking for, sorry :(

To explain it a bit more, this is what the patch did.

Suppose, your platform supports frequencies: F1 (lowest), F2, F3, F4,
F5, F6 and F7 (highest). The cpufreq governor (ondemand) based on a
sampling rate and system load tries to change the frequency of the
underlying hardware and select one of those.

Before the original patch came in, F2 and F3 were never getting
selected and the system was stuck in F1 for a long time. Which will
decrease the performance for that period of time as we should have
switched to a higher frequency really.

With the new patch we switch to the frequency proportional to current
load.

The requests made from cpufreq-governor wouldn't have changed at all
with that, but the number of freq changes at hardware level may
actually change as we might be returning very quickly if the target
freq evaluated to F1 in the earlier case.

That may lead to switching to frequencies F2 and F3, which wasn't
happening earlier.

I don't think the result of that should be too bad.. We should have
performed better almost everywhere.

transition_latency is used only to initialize sampling rate, but that
may get modified from userspace later on. Please tell us the value
read from sysfs file sampling_rate present in:

/sys/devices/system/cpu/cpufreq/policy0/ondemand/

And try playing with that variable a bit to see if you can make things
better or not.

> I started with the value return as "nominal latency" for PCC.  This
> was 300000 ns on the test system and made things worse.  I've tested
> other values as well unitl I've found a local optimium at 45000ns but
> performance was lower in comparison to when I've applied my hack.

Can you try to use kernel tracer (ftrace) and see how the frequencies
are getting changed and at what frequency.

We need to understand the problem better, as I am not 100% sure what's
going on right now.

> > Also, it is important to understand why is the performance going
> > down, while the original commit should have made it better.
> 
> My understanding is that the original commit was tested with certain
> combinations of hardware and cpufreq-drivers and the claim was that
> for those (two?) tested combinations performance increased and power
> consumption was lower. So I am not so sure what to expect from all
> other cpufreq-driver/hardware combinations.

It was principally the right thing to do IMO. And I don't think any
other hardware should get affected badly. At the max, the tuning needs
to be made a bit better.

> > Is it only about more transitions ?
> 
> I think this is the main issue.

Then it can be controlled with sampling rate from userspace.

> In an older kernel version I activated/added debug output in
> __cpufreq_driver_target(). Of course this creates a huge amount of
> messages. But with original patch reverted it was like:
> 
> [   29.489677] cpufreq: target for CPU 0: 1760000 kHz (1200000 kHz), relation 2, requested 1760000 kHz
> [   29.570364] cpufreq: target for CPU 0: 1216000 kHz (1760000 kHz), relation 2, requested 1216000 kHz
> [   29.571055] cpufreq: target for CPU 1: 1200000 kHz (1148000 kHz), relation 0, requested 1200000 kHz
> [   29.571483] cpufreq: target for CPU 1: 1200000 kHz (1200000 kHz), relation 2, requested 1200000 kHz
> [   29.572042] cpufreq: target for CPU 2: 1200000 kHz (1064000 kHz), relation 0, requested 1200000 kHz
> [   29.572503] cpufreq: target for CPU 2: 1200000 kHz (1200000 kHz), relation 2, requested 1200000 kHz

Your platform is a bit special as it uses ->target() callback and not
->target_index(). And so you can pretty much switch to any frequency.

Can you please read value of all the sysfs files present in the
governor directory? That would be helpful. Maybe we can play with some
more files like: up_threshold to see what the results are.

> a lot of stuff, but system could still handle it and booted to the
> prompt.
> 
> With the original patch applied the system was really flooded and
> eventually became unresponsive:
> 
> ** 459 printk messages dropped ** [   29.838689] cpufreq: target for CPU 43: 1408000 kHz (2384000 kHz), relation 2, requested 1408000 kHz
> ** 480 printk messages dropped ** [   29.993849] cpufreq: target for CPU 54: 1200000 kHz (1248000 kHz), relation 2, requested 1200000 kHz
> ** 413 printk messages dropped ** [   30.113921] cpufreq: target for CPU 59: 2064000 kHz (1248000 kHz), relation 2, requested 2064000 kHz
> ** 437 printk messages dropped ** [   30.245846] cpufreq: target for CPU 21: 1296000 kHz (1296000 kHz), relation 2, requested 1296000 kHz
> ** 435 printk messages dropped ** [   30.397748] cpufreq: target for CPU 13: 1280000 kHz (2640000 kHz), relation 2, requested 1280000 kHz
> ** 480 printk messages dropped ** [   30.541846] cpufreq: target for CPU 58: 2112000 kHz (1632000 kHz), relation 2, requested 2112000 kHz

This looks even more dangerous :)
Andreas Herrmann Sept. 13, 2016, 10:53 a.m. UTC | #4
On Wed, Sep 07, 2016 at 10:32:01AM +0530, Viresh Kumar wrote:
> On 01-09-16, 15:21, Andreas Herrmann wrote:
> > On Mon, Aug 29, 2016 at 11:31:53AM +0530, Viresh Kumar wrote:
> 
> > > I am _really_ worried about such hacks in drivers to negate the effect of a
> > > patch, that was actually good.
> > 
> > > Did you try to increase the sampling period of ondemand governor to see if that
> > > helps without this patch.
> > 
> > With an older kernel I've modified transition_latency of the driver
> > which in turn is used to calculate the sampling rate.
> 
> Naah, that isn't what I was looking for, sorry :(
> 
> To explain it a bit more, this is what the patch did.
> 
> Suppose, your platform supports frequencies: F1 (lowest), F2, F3, F4,
> F5, F6 and F7 (highest). The cpufreq governor (ondemand) based on a
> sampling rate and system load tries to change the frequency of the
> underlying hardware and select one of those.
> 
> Before the original patch came in, F2 and F3 were never getting
> selected and the system was stuck in F1 for a long time. Which will
> decrease the performance for that period of time as we should have
> switched to a higher frequency really.
> 
> With the new patch we switch to the frequency proportional to current
> load.
> 
> The requests made from cpufreq-governor wouldn't have changed at all
> with that, but the number of freq changes at hardware level may
> actually change as we might be returning very quickly if the target
> freq evaluated to F1 in the earlier case.
> 
> That may lead to switching to frequencies F2 and F3, which wasn't
> happening earlier.
> 
> I don't think the result of that should be too bad.. We should have
> performed better almost everywhere.
> 
> transition_latency is used only to initialize sampling rate, but that
> may get modified from userspace later on. Please tell us the value
> read from sysfs file sampling_rate present in:
> 
> /sys/devices/system/cpu/cpufreq/policy0/ondemand/

/sys/devices/system/cpu/cpufreq/ondemand contains following values:

ignore_nice_load           : 0
io_is_busy                 : 1
min_sampling_rate          : 10000
powersave_bias             : 0
up_threshold               : 95
sampling_rate              : 10000
sampling_down_factor       : 1

FYI, for each CPU direcories /sys/devices/system/cpu/cpufreq/policy* look like:

scaling_cur_freq           : 1200000
cpuinfo_max_freq           : 2800000
cpuinfo_transition_latency : 0
cpuinfo_min_freq           : 1200000
cpuinfo_cur_freq           : 1204000
affected_cpus              : 0
scaling_governor           : ondemand
scaling_driver             : pcc-cpufreq
scaling_max_freq           : 2800000
related_cpus               : 0
scaling_setspeed           : <unsupported>
scaling_available_governors: ondemand performance 
scaling_min_freq           : 1200000

(which is for policy0 for CPU0)

> And try playing with that variable a bit to see if you can make things
> better or not.

Done that. Following the elapsed time with ondemand governor and
different values for sampling rate (for each of 4.8-rc5
vanilla/modified with my patch/commit 6393d6a reverted).

4.8-rc5 (vanilla)
                                                 sample_rate
           10000            20000            30000            45000            60000            90000
jobs  seconds  % CPU   seconds  % CPU   seconds  % CPU   seconds  % CPU   seconds  % CPU   seconds  % CPU
2     283.47   203.00  273.79   204.00  276.08   204.00  276.99   204.00  281.80   203.60  291.00   203.00 
4     146.92   399.20  140.98   399.80  141.55   399.80  140.91   400.00  142.91   400.00  146.49   399.80 
8      81.66   770.00   77.44   769.20   76.81   769.20   73.74   769.20   73.17   767.20   72.75   767.00 
16     54.31  1466.20   49.58  1458.40   48.37  1456.60   44.29  1447.80   42.89  1436.40   41.12  1416.80
32     36.71  2720.60   30.70  2672.80   29.22  2578.80   26.34  2563.20   25.57  2506.80   24.58  2457.80
64     13.99  3291.20   13.63  3295.40   13.55  3309.20   13.30  3354.60   13.66  3264.40   13.55  3288.60
120    14.84  3114.20   14.10  3215.00   14.57  3115.20   14.11  3209.80   14.17  3198.60   14.09  3220.80
------------------------------
4.8-rc5 re-intro deadband for pcc-cpufreq only (my patch)
                                                 sample_rate
           10000            20000            30000            45000            60000            90000
jobs  seconds  % CPU   seconds  % CPU   seconds  % CPU   seconds  % CPU   seconds  % CPU   seconds  % CPU
2     270.88   192.00  270.87   192.80  270.65   192.00  274.08   192.00  277.97   192.80  289.85   193.00 
4     140.23   379.20  138.94   379.80  138.20   378.80  138.99   379.20  140.81   379.60  145.83   380.40 
8      76.15   738.20   73.86   733.60   71.58   734.20   70.43   730.40   70.32   730.60   71.14   728.20 
16     47.39  1414.80   44.20  1395.60   43.02  1384.60   41.09  1368.00   40.37  1362.80   39.56  1351.20
32     32.78  2659.80   29.05  2531.80   27.18  2482.40   25.36  2471.60   24.55  2441.00   24.18  2365.80
64     13.34  3213.20   13.29  3204.60   13.38  3158.60   13.30  3162.40   13.57  3089.60   13.48  3125.20
120    13.89  3103.00   14.00  3047.40   13.87  3068.40   13.83  3077.80   13.93  3054.80   14.10  3033.60
------------------------------
4.8-rc5 reverted commit 6393d6a
                                                 sample_rate
           10000            20000            30000            45000            60000            90000
jobs  seconds  % CPU   seconds  % CPU   seconds  % CPU   seconds  % CPU   seconds  % CPU   seconds  % CPU
2     250.94   205.00  256.26   205.00  261.29   205.00  268.92   205.00  275.60   204.80  288.64   204.00 
4     127.20   401.20  129.63   401.80  131.80   402.00  135.41   401.20  138.88   401.20  144.96   401.00 
8      64.24   767.60   64.18   769.60   64.49   770.20   65.69   766.00   66.38   768.60   68.64   767.40 
16     37.02  1438.00   36.09  1432.20   36.10  1429.40   36.10  1424.60   36.61  1408.80   36.87  1405.60
32     24.73  2618.80   23.28  2533.80   23.00  2513.40   22.45  2526.00   22.92  2463.00   23.12  2422.20
64     12.62  3550.60   12.94  3439.60   12.95  3425.60   12.94  3421.80   13.29  3336.80   13.65  3257.80
120    13.30  3408.80   13.43  3346.40   13.44  3347.80   13.55  3322.60   13.76  3277.20   14.19  3192.40

> > I started with the value return as "nominal latency" for PCC.  This
> > was 300000 ns on the test system and made things worse.  I've tested
> > other values as well unitl I've found a local optimium at 45000ns but
> > performance was lower in comparison to when I've applied my hack.
> 
> Can you try to use kernel tracer (ftrace) and see how the frequencies
> are getting changed and at what frequency.

I am currently looking into this ...

> We need to understand the problem better, as I am not 100% sure what's
> going on right now.

... to gather related data.

> > > Also, it is important to understand why is the performance going
> > > down, while the original commit should have made it better.
> > 
> > My understanding is that the original commit was tested with certain
> > combinations of hardware and cpufreq-drivers and the claim was that
> > for those (two?) tested combinations performance increased and power
> > consumption was lower. So I am not so sure what to expect from all
> > other cpufreq-driver/hardware combinations.
> 
> It was principally the right thing to do IMO. And I don't think any
> other hardware should get affected badly. At the max, the tuning needs
> to be made a bit better.
> 
> > > Is it only about more transitions ?
> > 
> > I think this is the main issue.
> 
> Then it can be controlled with sampling rate from userspace.
> 
> > In an older kernel version I activated/added debug output in
> > __cpufreq_driver_target(). Of course this creates a huge amount of
> > messages. But with original patch reverted it was like:
> > 
> > [   29.489677] cpufreq: target for CPU 0: 1760000 kHz (1200000 kHz), relation 2, requested 1760000 kHz
> > [   29.570364] cpufreq: target for CPU 0: 1216000 kHz (1760000 kHz), relation 2, requested 1216000 kHz
> > [   29.571055] cpufreq: target for CPU 1: 1200000 kHz (1148000 kHz), relation 0, requested 1200000 kHz
> > [   29.571483] cpufreq: target for CPU 1: 1200000 kHz (1200000 kHz), relation 2, requested 1200000 kHz
> > [   29.572042] cpufreq: target for CPU 2: 1200000 kHz (1064000 kHz), relation 0, requested 1200000 kHz
> > [   29.572503] cpufreq: target for CPU 2: 1200000 kHz (1200000 kHz), relation 2, requested 1200000 kHz
> 
> Your platform is a bit special as it uses ->target() callback and not
> ->target_index(). And so you can pretty much switch to any frequency.
> 
> Can you please read value of all the sysfs files present in the
> governor directory? That would be helpful. Maybe we can play with some
> more files like: up_threshold to see what the results are.

For sysfs values see above. I've not yet played with up_threshold but
plan to do this as well.

> > a lot of stuff, but system could still handle it and booted to the
> > prompt.
> > 
> > With the original patch applied the system was really flooded and
> > eventually became unresponsive:
> > 
> > ** 459 printk messages dropped ** [   29.838689] cpufreq: target for CPU 43: 1408000 kHz (2384000 kHz), relation 2, requested 1408000 kHz
> > ** 480 printk messages dropped ** [   29.993849] cpufreq: target for CPU 54: 1200000 kHz (1248000 kHz), relation 2, requested 1200000 kHz
> > ** 413 printk messages dropped ** [   30.113921] cpufreq: target for CPU 59: 2064000 kHz (1248000 kHz), relation 2, requested 2064000 kHz
> > ** 437 printk messages dropped ** [   30.245846] cpufreq: target for CPU 21: 1296000 kHz (1296000 kHz), relation 2, requested 1296000 kHz
> > ** 435 printk messages dropped ** [   30.397748] cpufreq: target for CPU 13: 1280000 kHz (2640000 kHz), relation 2, requested 1280000 kHz
> > ** 480 printk messages dropped ** [   30.541846] cpufreq: target for CPU 58: 2112000 kHz (1632000 kHz), relation 2, requested 2112000 kHz
> 
> This looks even more dangerous :)
> 
> -- 
> viresh


Thanks,

Andreas
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Herrmann Sept. 14, 2016, 2:56 p.m. UTC | #5
On Wed, Sep 07, 2016 at 10:32:01AM +0530, Viresh Kumar wrote:
> On 01-09-16, 15:21, Andreas Herrmann wrote:

  ---8<---

> > I started with the value return as "nominal latency" for PCC.  This
> > was 300000 ns on the test system and made things worse.  I've tested
> > other values as well unitl I've found a local optimium at 45000ns but
> > performance was lower in comparison to when I've applied my hack.
> 
> Can you try to use kernel tracer (ftrace) and see how the frequencies
> are getting changed and at what frequency.

Below is some trace data. I hope it is of some help.

(A) - sampling 10s period when system is idle
(B) - sampling 10s period when system partially loaded (kernel
      compilation using 2 jobs)

(1) 4.8-rc5
(2) 4.8-rc5 with my patch (reintro of deadband effect within
    pcc-cpufreq)
(3) 4.8-rc5 with reversal of 6393d6a102 (cpufreq: ondemand: Eliminate
    the deadband effect)

Let me know whether you are looking for other trace data wrt this
issue.


Thanks,

Andreas

---

(A)-(1)

 # Total Lost Samples: 0
 # Samples: 41  of event 'power:cpu_frequency'
 # Event count (approx.): 41
 # Overhead  Command       Shared Object     Symbol                       
 # ........  ............  ................  .............................
     39.02%  kworker/14:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
     29.27%  kworker/0:0   [kernel.vmlinux]  [k] cpufreq_notify_transition
     19.51%  kworker/10:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
      7.32%  kworker/5:2   [kernel.vmlinux]  [k] cpufreq_notify_transition
      2.44%  kworker/23:2  [kernel.vmlinux]  [k] cpufreq_notify_transition
      2.44%  kworker/40:1  [kernel.vmlinux]  [k] cpufreq_notify_transition

(A)-(2)

 # Total Lost Samples: 0
 # Samples: 6  of event 'power:cpu_frequency'
 # Event count (approx.): 6
 # Overhead  Command       Shared Object     Symbol                       
 # ........  ............  ................  .............................
     33.33%  kworker/1:2   [kernel.vmlinux]  [k] cpufreq_notify_transition
     16.67%  kworker/16:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
     16.67%  kworker/22:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
     16.67%  kworker/26:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
     16.67%  kworker/33:1  [kernel.vmlinux]  [k] cpufreq_notify_transition

(A)-(3)

 # Total Lost Samples: 0
 # Samples: 7  of event 'power:cpu_frequency'
 # Event count (approx.): 7
 # Overhead  Command       Shared Object     Symbol                       
 # ........  ............  ................  .............................
     28.57%  kworker/58:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
     14.29%  kworker/19:2  [kernel.vmlinux]  [k] cpufreq_notify_transition
     14.29%  kworker/20:2  [kernel.vmlinux]  [k] cpufreq_notify_transition
     14.29%  kworker/22:2  [kernel.vmlinux]  [k] cpufreq_notify_transition
     14.29%  kworker/23:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
     14.29%  kworker/35:1  [kernel.vmlinux]  [k] cpufreq_notify_transition

---

(B)-(1)

 # Total Lost Samples: 0
 # Samples: 2K of event 'power:cpu_frequency'
 # Event count (approx.): 2382
 # Overhead  Command       Shared Object     Symbol                       
 # ........  ............  ................  .............................
      5.75%  kworker/0:0   [kernel.vmlinux]  [k] cpufreq_notify_transition
      4.16%  kworker/12:2  [kernel.vmlinux]  [k] cpufreq_notify_transition
      3.11%  kworker/17:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
      2.94%  kworker/2:1   [kernel.vmlinux]  [k] cpufreq_notify_transition
      2.73%  kworker/19:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
      ...

(B)-(2)

 # Total Lost Samples: 0
 # Samples: 320  of event 'power:cpu_frequency'
 # Event count (approx.): 320
 # Overhead  Command       Shared Object     Symbol                       
 # ........  ............  ................  .............................
      4.69%  kworker/56:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
      4.06%  kworker/12:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
      4.06%  kworker/28:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
      4.06%  kworker/6:2   [kernel.vmlinux]  [k] cpufreq_notify_transition
      3.75%  kworker/32:2  [kernel.vmlinux]  [k] cpufreq_notify_transition
      ...

(B)-(3)

 # Total Lost Samples: 0
 # Samples: 333  of event 'power:cpu_frequency'
 # Event count (approx.): 333
 # Overhead  Command       Shared Object     Symbol                       
 # ........  ............  ................  .............................
      4.80%  kworker/51:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
      4.50%  kworker/39:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
      4.20%  kworker/47:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
      3.90%  kworker/59:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
      3.90%  kworker/7:2   [kernel.vmlinux]  [k] cpufreq_notify_transition
      ...

---

With (1) pcc-cpufreq tries to set pretty much every frequency even if
the system is idle, e.g. here is the start of (perf script output)
from (A)-(1):

     kworker/5:2   904 [005]   255.778343: power:cpu_frequency: state=1216000 cpu_id=5
     kworker/5:2   904 [005]   255.794382: power:cpu_frequency: state=1264000 cpu_id=5
     kworker/5:2   904 [005]   256.102400: power:cpu_frequency: state=1200000 cpu_id=5
    kworker/10:1   171 [010]   258.010362: power:cpu_frequency: state=2224000 cpu_id=10
    kworker/10:1   171 [010]   258.026366: power:cpu_frequency: state=1264000 cpu_id=10
    kworker/10:1   171 [010]   258.594514: power:cpu_frequency: state=1200000 cpu_id=10
    kworker/10:1   171 [010]   258.618417: power:cpu_frequency: state=1232000 cpu_id=10
    kworker/10:1   171 [010]   258.634409: power:cpu_frequency: state=1264000 cpu_id=10
    kworker/10:1   171 [010]   258.674467: power:cpu_frequency: state=1200000 cpu_id=10
    kworker/10:1   171 [010]   258.730486: power:cpu_frequency: state=1216000 cpu_id=10
    kworker/40:1   388 [040]   258.730999: power:cpu_frequency: state=1200000 cpu_id=40
    kworker/23:2   775 [023]   258.731504: power:cpu_frequency: state=1200000 cpu_id=23
    kworker/14:1   178 [014]   258.732013: power:cpu_frequency: state=1216000 cpu_id=14
    kworker/10:1   171 [010]   258.906434: power:cpu_frequency: state=1200000 cpu_id=10
    kworker/14:1   178 [014]   258.970500: power:cpu_frequency: state=1200000 cpu_id=14
    kworker/14:1   178 [014]   258.998440: power:cpu_frequency: state=1232000 cpu_id=14
    kworker/14:1   178 [014]   259.034490: power:cpu_frequency: state=1200000 cpu_id=14
    kworker/14:1   178 [014]   259.095089: power:cpu_frequency: state=1216000 cpu_id=14
    kworker/14:1   178 [014]   259.270470: power:cpu_frequency: state=1200000 cpu_id=14
    ...
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Herrmann Sept. 16, 2016, 9:47 a.m. UTC | #6
On Wed, Sep 07, 2016 at 10:32:01AM +0530, Viresh Kumar wrote:
> On 01-09-16, 15:21, Andreas Herrmann wrote:
> > On Mon, Aug 29, 2016 at 11:31:53AM +0530, Viresh Kumar wrote:

> > > I am _really_ worried about such hacks in drivers to negate the effect of a
> > > patch, that was actually good.
> > 
> > > Did you try to increase the sampling period of ondemand governor to see if that
> > > helps without this patch.
> > 
> > With an older kernel I've modified transition_latency of the driver
> > which in turn is used to calculate the sampling rate.

> Naah, that isn't what I was looking for, sorry :(

> To explain it a bit more, this is what the patch did.

> Suppose, your platform supports frequencies: F1 (lowest), F2, F3, F4,
> F5, F6 and F7 (highest). The cpufreq governor (ondemand) based on a
> sampling rate and system load tries to change the frequency of the
> underlying hardware and select one of those.

> Before the original patch came in, F2 and F3 were never getting
> selected and the system was stuck in F1 for a long time.

I think this is not a general statement. Such a behaviour is not
common to all systems. Before commit 6393d6a target frequency was
based on

   freq_next = load * policy->cpuinfo.max_freq / 100;

F2 would have been selected if

  load = F2 * 100 / F7

If F2 was not seen it can mean

(1) either the load value was not hit in practice during monitoring of
    a certain workload

(2) or the calculated load value (in integer representation) would
    select F1 or F3 (there is no corresponding integer value that
    would select F2)

E.g. for the Intel i7-3770 system mentioned in commit message for
6393d6a I think a load value of 49 should have selected 1700000 which
is not shown in the provided frequency table.

What essentially changed was how load values are mapped to target
frequencies. For the HP system (min_freq=1200000, max_freq=2800000)
that I used in my tests, the old code would create following mapping:

load | freq_next | used target frequency
Stratos Karafotis Sept. 16, 2016, 6:48 p.m. UTC | #7
Hi,

On 16/09/2016 12:47 μμ, Andreas Herrmann wrote:
> On Wed, Sep 07, 2016 at 10:32:01AM +0530, Viresh Kumar wrote:
>> On 01-09-16, 15:21, Andreas Herrmann wrote:
>>> On Mon, Aug 29, 2016 at 11:31:53AM +0530, Viresh Kumar wrote:
> 
>>>> I am _really_ worried about such hacks in drivers to negate the effect of a
>>>> patch, that was actually good.
>>>
>>>> Did you try to increase the sampling period of ondemand governor to see if that
>>>> helps without this patch.
>>>
>>> With an older kernel I've modified transition_latency of the driver
>>> which in turn is used to calculate the sampling rate.
> 
>> Naah, that isn't what I was looking for, sorry :(
> 
>> To explain it a bit more, this is what the patch did.
> 
>> Suppose, your platform supports frequencies: F1 (lowest), F2, F3, F4,
>> F5, F6 and F7 (highest). The cpufreq governor (ondemand) based on a
>> sampling rate and system load tries to change the frequency of the
>> underlying hardware and select one of those.
> 
>> Before the original patch came in, F2 and F3 were never getting
>> selected and the system was stuck in F1 for a long time.
> 
> I think this is not a general statement. Such a behaviour is not
> common to all systems. Before commit 6393d6a target frequency was
> based on
> 
>    freq_next = load * policy->cpuinfo.max_freq / 100;
> 
> F2 would have been selected if
> 
>   load = F2 * 100 / F7
> 
> If F2 was not seen it can mean
> 
> (1) either the load value was not hit in practice during monitoring of
>     a certain workload
> 
> (2) or the calculated load value (in integer representation) would
>     select F1 or F3 (there is no corresponding integer value that
>     would select F2)
> 
> E.g. for the Intel i7-3770 system mentioned in commit message for
> 6393d6a I think a load value of 49 should have selected 1700000 which
> is not shown in the provided frequency table.

I think this is not true, because before the specific patch the relation
of frequency selection was CPUFREQ_RELATION_L. This is the reason
that a load value of 49 with a freq_next 1666490 would have a
target frequency 1600000.

> 
> What essentially changed was how load values are mapped to target
> frequencies. For the HP system (min_freq=1200000, max_freq=2800000)
> that I used in my tests, the old code would create following mapping:
> 
> load | freq_next | used target frequency
> ________________________________________
> 0      0            1200000
> 10     280000       1200000
> 20     560000       1200000
> 30     840000       1200000
> 40     1120000      1200000
> 42     1176000      1200000
> 43     1204000      1204000
> 50     1400000      1400000
> 60     1680000      1680000
> 70     1960000      1960000
> 80     2240000      2240000
> 90     2520000      2520000
> 100    2800000      2800000
> 
> The new code (introduced with commit 6393d6a) changed the mapping as
> follows:
> 
> load | freq_next | used target frequency
> ________________________________________
> 0      1200000      1200000
> 10     1360000	    1360000
> 20     1520000	    1520000
> 30     1680000	    1680000
> 40     1840000	    1840000
> 42     1872000	    1872000
> 43     1888000	    1888000
> 50     2000000	    2000000
> 60     2160000	    2160000
> 70     2320000	    2320000
> 80     2480000	    2480000
> 90     2640000	    2640000
> 100    2800000	    2800000
> 
> My patch creates a third mapping. It basically ensures that up to a
> load value of 42 the minimum frequency is used.
> 
>> Which will decrease the performance for that period of time as we
>> should have switched to a higher frequency really.
> 
> I am not sure whether it's really useful for all systems using
> ondemand governor to increase frequency above min_freq even if load is
> just above 0. Of course expectation is that performance will be equal
> or better than before. But how overall power consumption changes
> depends on the hardware and its power saving capabilites.
> 
>   ---8<---
> 
>>> My understanding is that the original commit was tested with certain
>>> combinations of hardware and cpufreq-drivers and the claim was that
>>> for those (two?) tested combinations performance increased and power
>>> consumption was lower. So I am not so sure what to expect from all
>>> other cpufreq-driver/hardware combinations.
> 
>> It was principally the right thing to do IMO. And I don't think any
>> other hardware should get affected badly. At the max, the tuning needs
>> to be made a bit better.
> 
>   ---8<---
> 
> It seems that the decision how to best map load values to target
> frequencies is kind of hardware specific.
> 
> Maybe a solution to this is that the cpufreq driver should be able to
> provide a mapping function to overwrite the current default
> calculation.
> 

I'm not familiar with ppc-cpufreq drive but maybe patch 6393d6 just
uncovered an "issue" that was already existed but only on higher loads.

Because, with or without patch 6393d6, if the specific CPU doesn't
use a frequency table, there will many frequency transitions in
higher loads too. I believe, though, that the side effect it's smaller
in higher frequencies because CPUs tend to work on lowest and highest
frequencies.

What about a patch in ppc-cpufreq driver that permits frequency
changes only in specific steps and not in arbitrary values?


Regards,
Stratos
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Herrmann Sept. 19, 2016, 4:16 p.m. UTC | #8
On Fri, Sep 16, 2016 at 09:58:42PM +0300, Stratos Karafotis wrote:
> Hi,
> 
> [ I 'm resending this message, because I think some recipients didn't receive
> it. ]
> 
> On 16/09/2016 12:47 μμ, Andreas Herrmann wrote:
> > On Wed, Sep 07, 2016 at 10:32:01AM +0530, Viresh Kumar wrote:
> >> On 01-09-16, 15:21, Andreas Herrmann wrote:
> >>> On Mon, Aug 29, 2016 at 11:31:53AM +0530, Viresh Kumar wrote:
> >
> >>>> I am _really_ worried about such hacks in drivers to negate the effect of
> a
> >>>> patch, that was actually good.
> >>>
> >>>> Did you try to increase the sampling period of ondemand governor to see if
> that
> >>>> helps without this patch.
> >>>
> >>> With an older kernel I've modified transition_latency of the driver
> >>> which in turn is used to calculate the sampling rate.
> >
> >> Naah, that isn't what I was looking for, sorry 
> >
> >> To explain it a bit more, this is what the patch did.
> >
> >> Suppose, your platform supports frequencies: F1 (lowest), F2, F3, F4,
> >> F5, F6 and F7 (highest). The cpufreq governor (ondemand) based on a
> >> sampling rate and system load tries to change the frequency of the
> >> underlying hardware and select one of those.
> >
> >> Before the original patch came in, F2 and F3 were never getting
> >> selected and the system was stuck in F1 for a long time.
> >
> > I think this is not a general statement. Such a behaviour is not
> > common to all systems. Before commit 6393d6a target frequency was
> > based on
> >
> >    freq_next = load * policy->cpuinfo.max_freq / 100;
> >
> > F2 would have been selected if
> >
> >   load = F2 * 100 / F7
> >
> > If F2 was not seen it can mean
> >
> > (1) either the load value was not hit in practice during monitoring of
> >     a certain workload
> >
> > (2) or the calculated load value (in integer representation) would
> >     select F1 or F3 (there is no corresponding integer value that
> >     would select F2)
> >
> > E.g. for the Intel i7-3770 system mentioned in commit message for
> > 6393d6a I think a load value of 49 should have selected 1700000 which
> > is not shown in the provided frequency table.
> 
> I think this is not true, because before the specific patch the relation
> of frequency selection was CPUFREQ_RELATION_L. This is the reason
> that a load value of 49 with a freq_next 1666490 would have a
> target frequency 1600000.

Hmm...
CPUFREQ_RELATION_L should select "lowest frequency at or above target"
being 1700000 in this case. Otherwise (if it would select "highest
frequency at or below target") this would imply that load values of
50, 51, 52 should select 1700000 which would contradict what was
written in commit message of 6393d6a1.

In any case probability of seeing such load values and thus selecting
a frequency of 1700000 is quite low. So I fully understand why the
patch was introduced.

> > What essentially changed was how load values are mapped to target
> > frequencies. For the HP system (min_freq=1200000, max_freq=2800000)
> > that I used in my tests, the old code would create following mapping:
> >
> > load | freq_next | used target frequency
> > ________________________________________
> > 0      0            1200000
> > 10     280000       1200000
> > 20     560000       1200000
> > 30     840000       1200000
> > 40     1120000      1200000
> > 42     1176000      1200000
> > 43     1204000      1204000
> > 50     1400000      1400000
> > 60     1680000      1680000
> > 70     1960000      1960000
> > 80     2240000      2240000
> > 90     2520000      2520000
> > 100    2800000      2800000
> >
> > The new code (introduced with commit 6393d6a) changed the mapping as
> > follows:
> >
> > load | freq_next | used target frequency
> > ________________________________________
> > 0      1200000      1200000
> > 10     1360000    1360000
> > 20     1520000    1520000
> > 30     1680000    1680000
> > 40     1840000    1840000
> > 42     1872000    1872000
> > 43     1888000    1888000
> > 50     2000000    2000000
> > 60     2160000    2160000
> > 70     2320000    2320000
> > 80     2480000    2480000
> > 90     2640000    2640000
> > 100    2800000    2800000
> >
> > My patch creates a third mapping. It basically ensures that up to a
> > load value of 42 the minimum frequency is used.
> >
> >> Which will decrease the performance for that period of time as we
> >> should have switched to a higher frequency really.
> >
> > I am not sure whether it's really useful for all systems using
> > ondemand governor to increase frequency above min_freq even if load is
> > just above 0. Of course expectation is that performance will be equal
> > or better than before. But how overall power consumption changes
> > depends on the hardware and its power saving capabilites.
> >
> >   ---8<---
> >
> >>> My understanding is that the original commit was tested with certain
> >>> combinations of hardware and cpufreq-drivers and the claim was that
> >>> for those (two?) tested combinations performance increased and power
> >>> consumption was lower. So I am not so sure what to expect from all
> >>> other cpufreq-driver/hardware combinations.
> >
> >> It was principally the right thing to do IMO. And I don't think any
> >> other hardware should get affected badly. At the max, the tuning needs
> >> to be made a bit better.
> >
> >   ---8<---
> >
> > It seems that the decision how to best map load values to target
> > frequencies is kind of hardware specific.
> >
> > Maybe a solution to this is that the cpufreq driver should be able to
> > provide a mapping function to overwrite the current default
> > calculation.
> >
> 
> I'm not familiar with ppc-cpufreq drive but maybe patch 6393d6 just
> uncovered an "issue" that was already existed but only on higher loads.
>
> Because, with or without patch 6393d6, if the specific CPU doesn't
> use a frequency table, there will many frequency transitions in
> higher loads too. I believe, though, that the side effect it's smaller
> in higher frequencies because CPUs tend to work on lowest and highest
> frequencies.

Might be. I didn't test this specifically.

> What about a patch in ppc-cpufreq driver that permits frequency
> changes only in specific steps and not in arbitrary values?

Which steps would you use? What scheme would be universal usable for
all affected system using this driver?

I had played with an approach to only make use of min_freq and
max_freq which eventually didn't result in better performance
in comparison to code before commit 6393d6.


Andreas

> Regards,
> Stratos

  ---8<---
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Stratos Karafotis Sept. 19, 2016, 7:39 p.m. UTC | #9
On Mon, Sep 19, 2016 at 7:16 PM, Andreas Herrmann <aherrmann@suse.com> wrote:
> On Fri, Sep 16, 2016 at 09:58:42PM +0300, Stratos Karafotis wrote:
>> Hi,
>>
>> [ I 'm resending this message, because I think some recipients didn't receive
>> it. ]
>>
>> On 16/09/2016 12:47 μμ, Andreas Herrmann wrote:
>> > On Wed, Sep 07, 2016 at 10:32:01AM +0530, Viresh Kumar wrote:
>> >> On 01-09-16, 15:21, Andreas Herrmann wrote:
>> >>> On Mon, Aug 29, 2016 at 11:31:53AM +0530, Viresh Kumar wrote:
>> >
>> >>>> I am _really_ worried about such hacks in drivers to negate the effect of
>> a
>> >>>> patch, that was actually good.
>> >>>
>> >>>> Did you try to increase the sampling period of ondemand governor to see if
>> that
>> >>>> helps without this patch.
>> >>>
>> >>> With an older kernel I've modified transition_latency of the driver
>> >>> which in turn is used to calculate the sampling rate.
>> >
>> >> Naah, that isn't what I was looking for, sorry
>> >
>> >> To explain it a bit more, this is what the patch did.
>> >
>> >> Suppose, your platform supports frequencies: F1 (lowest), F2, F3, F4,
>> >> F5, F6 and F7 (highest). The cpufreq governor (ondemand) based on a
>> >> sampling rate and system load tries to change the frequency of the
>> >> underlying hardware and select one of those.
>> >
>> >> Before the original patch came in, F2 and F3 were never getting
>> >> selected and the system was stuck in F1 for a long time.
>> >
>> > I think this is not a general statement. Such a behaviour is not
>> > common to all systems. Before commit 6393d6a target frequency was
>> > based on
>> >
>> >    freq_next = load * policy->cpuinfo.max_freq / 100;
>> >
>> > F2 would have been selected if
>> >
>> >   load = F2 * 100 / F7
>> >
>> > If F2 was not seen it can mean
>> >
>> > (1) either the load value was not hit in practice during monitoring of
>> >     a certain workload
>> >
>> > (2) or the calculated load value (in integer representation) would
>> >     select F1 or F3 (there is no corresponding integer value that
>> >     would select F2)
>> >
>> > E.g. for the Intel i7-3770 system mentioned in commit message for
>> > 6393d6a I think a load value of 49 should have selected 1700000 which
>> > is not shown in the provided frequency table.
>>
>> I think this is not true, because before the specific patch the relation
>> of frequency selection was CPUFREQ_RELATION_L. This is the reason
>> that a load value of 49 with a freq_next 1666490 would have a
>> target frequency 1600000.
>
> Hmm...
> CPUFREQ_RELATION_L should select "lowest frequency at or above target"
> being 1700000 in this case. Otherwise (if it would select "highest
> frequency at or below target") this would imply that load values of
> 50, 51, 52 should select 1700000 which would contradict what was
> written in commit message of 6393d6a1.

Yes, you are right. I'm sorry for the noise.

> In any case probability of seeing such load values and thus selecting
> a frequency of 1700000 is quite low. So I fully understand why the
> patch was introduced.

>> > What essentially changed was how load values are mapped to target
>> > frequencies. For the HP system (min_freq=1200000, max_freq=2800000)
>> > that I used in my tests, the old code would create following mapping:
>> >
>> > load | freq_next | used target frequency
>> > ________________________________________
>> > 0      0            1200000
>> > 10     280000       1200000
>> > 20     560000       1200000
>> > 30     840000       1200000
>> > 40     1120000      1200000
>> > 42     1176000      1200000
>> > 43     1204000      1204000
>> > 50     1400000      1400000
>> > 60     1680000      1680000
>> > 70     1960000      1960000
>> > 80     2240000      2240000
>> > 90     2520000      2520000
>> > 100    2800000      2800000
>> >
>> > The new code (introduced with commit 6393d6a) changed the mapping as
>> > follows:
>> >
>> > load | freq_next | used target frequency
>> > ________________________________________
>> > 0      1200000      1200000
>> > 10     1360000    1360000
>> > 20     1520000    1520000
>> > 30     1680000    1680000
>> > 40     1840000    1840000
>> > 42     1872000    1872000
>> > 43     1888000    1888000
>> > 50     2000000    2000000
>> > 60     2160000    2160000
>> > 70     2320000    2320000
>> > 80     2480000    2480000
>> > 90     2640000    2640000
>> > 100    2800000    2800000
>> >
>> > My patch creates a third mapping. It basically ensures that up to a
>> > load value of 42 the minimum frequency is used.
>> >
>> >> Which will decrease the performance for that period of time as we
>> >> should have switched to a higher frequency really.
>> >
>> > I am not sure whether it's really useful for all systems using
>> > ondemand governor to increase frequency above min_freq even if load is
>> > just above 0. Of course expectation is that performance will be equal
>> > or better than before. But how overall power consumption changes
>> > depends on the hardware and its power saving capabilites.
>> >
>> >   ---8<---
>> >
>> >>> My understanding is that the original commit was tested with certain
>> >>> combinations of hardware and cpufreq-drivers and the claim was that
>> >>> for those (two?) tested combinations performance increased and power
>> >>> consumption was lower. So I am not so sure what to expect from all
>> >>> other cpufreq-driver/hardware combinations.
>> >
>> >> It was principally the right thing to do IMO. And I don't think any
>> >> other hardware should get affected badly. At the max, the tuning needs
>> >> to be made a bit better.
>> >
>> >   ---8<---
>> >
>> > It seems that the decision how to best map load values to target
>> > frequencies is kind of hardware specific.
>> >
>> > Maybe a solution to this is that the cpufreq driver should be able to
>> > provide a mapping function to overwrite the current default
>> > calculation.
>> >
>>
>> I'm not familiar with ppc-cpufreq drive but maybe patch 6393d6 just
>> uncovered an "issue" that was already existed but only on higher loads.
>>
>> Because, with or without patch 6393d6, if the specific CPU doesn't
>> use a frequency table, there will many frequency transitions in
>> higher loads too. I believe, though, that the side effect it's smaller
>> in higher frequencies because CPUs tend to work on lowest and highest
>> frequencies.
>
> Might be. I didn't test this specifically.
>
>> What about a patch in ppc-cpufreq driver that permits frequency
>> changes only in specific steps and not in arbitrary values?
>
> Which steps would you use? What scheme would be universal usable for
> all affected system using this driver?

Just an idea. I would split the frequency range (max_freq - min_freq)
into ~10 steps. But I'm not familiar with the affected systems and
of course I can't prove this is an ideal approach.

> I had played with an approach to only make use of min_freq and
> max_freq which eventually didn't result in better performance
> in comparison to code before commit 6393d6.
>


Regards,
Stratos
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Herrmann Sept. 22, 2016, 5:54 p.m. UTC | #10
On Mon, Sep 19, 2016 at 10:39:18PM +0300, Stratos Karafotis wrote:
> On Mon, Sep 19, 2016 at 7:16 PM, Andreas Herrmann <aherrmann@suse.com> wrote:
> > On Fri, Sep 16, 2016 at 09:58:42PM +0300, Stratos Karafotis wrote:
> >> Hi,
> >>
> >> [ I 'm resending this message, because I think some recipients didn't receive
> >> it. ]
> >>
> >> On 16/09/2016 12:47 μμ, Andreas Herrmann wrote:
> >> > On Wed, Sep 07, 2016 at 10:32:01AM +0530, Viresh Kumar wrote:
> >> >> On 01-09-16, 15:21, Andreas Herrmann wrote:
> >> >>> On Mon, Aug 29, 2016 at 11:31:53AM +0530, Viresh Kumar wrote:

  ---8<---

> >> > It seems that the decision how to best map load values to target
> >> > frequencies is kind of hardware specific.
> >> >
> >> > Maybe a solution to this is that the cpufreq driver should be able to
> >> > provide a mapping function to overwrite the current default
> >> > calculation.

FYI, I've created new patches to address the issue.

First one will be to introduce a map_load_to_freq function. The
default being what commit 6393d6 introduced (no deadband).  Second
patch will than introduce a specific function for pcc-cpufreq to fall
back to what was used before commit 6393d6.

I just want to assemble gathered performance data and I am planning to
send those patches tomorrow.

> >> I'm not familiar with ppc-cpufreq drive but maybe patch 6393d6 just
> >> uncovered an "issue" that was already existed but only on higher loads.
> >>
> >> Because, with or without patch 6393d6, if the specific CPU doesn't
> >> use a frequency table, there will many frequency transitions in
> >> higher loads too. I believe, though, that the side effect it's smaller
> >> in higher frequencies because CPUs tend to work on lowest and highest
> >> frequencies.
> >
> > Might be. I didn't test this specifically.

Hopefully I'll also find time to gather some ftrace data wrt this.

> >> What about a patch in ppc-cpufreq driver that permits frequency
> >> changes only in specific steps and not in arbitrary values?
> >
> > Which steps would you use? What scheme would be universal usable for
> > all affected system using this driver?
> 
> Just an idea. I would split the frequency range (max_freq - min_freq)
> into ~10 steps. But I'm not familiar with the affected systems and
> of course I can't prove this is an ideal approach.

I've modified the pcc-cpufreq specific map_load_to_freq function to do
just that (map load values to 10 discrete frequency values) instead of
falling back to the deadband (pre-commit-6393d6-version).

Unfortunately this resulted in lower performance compared to
pre-commit-6393d6-version.

> > I had played with an approach to only make use of min_freq and
> > max_freq which eventually didn't result in better performance
> > in comparison to code before commit 6393d6.
> 
> Regards,
> Stratos


Regards,

Andreas
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Viresh Kumar Oct. 5, 2016, 5:17 a.m. UTC | #11
Sorry for being late Andreas :(

On 14-09-16, 16:56, Andreas Herrmann wrote:

First of all, thanks for your hardwork in getting these numbers out.
Really appreciate it.

> Below is some trace data. I hope it is of some help.
> 
> (A) - sampling 10s period when system is idle
> (B) - sampling 10s period when system partially loaded (kernel
>       compilation using 2 jobs)
> 
> (1) 4.8-rc5
> (2) 4.8-rc5 with my patch (reintro of deadband effect within
>     pcc-cpufreq)
> (3) 4.8-rc5 with reversal of 6393d6a102 (cpufreq: ondemand: Eliminate
>     the deadband effect)
> 
> Let me know whether you are looking for other trace data wrt this
> issue.
> 
> 
> Thanks,
> 
> Andreas
> 
> ---
> 
> (A)-(1)
> 
>  # Total Lost Samples: 0
>  # Samples: 41  of event 'power:cpu_frequency'
>  # Event count (approx.): 41
>  # Overhead  Command       Shared Object     Symbol                       
>  # ........  ............  ................  .............................
>      39.02%  kworker/14:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
>      29.27%  kworker/0:0   [kernel.vmlinux]  [k] cpufreq_notify_transition
>      19.51%  kworker/10:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
>       7.32%  kworker/5:2   [kernel.vmlinux]  [k] cpufreq_notify_transition
>       2.44%  kworker/23:2  [kernel.vmlinux]  [k] cpufreq_notify_transition
>       2.44%  kworker/40:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
> 
> (A)-(2)
> 
>  # Total Lost Samples: 0
>  # Samples: 6  of event 'power:cpu_frequency'
>  # Event count (approx.): 6
>  # Overhead  Command       Shared Object     Symbol                       
>  # ........  ............  ................  .............................
>      33.33%  kworker/1:2   [kernel.vmlinux]  [k] cpufreq_notify_transition
>      16.67%  kworker/16:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
>      16.67%  kworker/22:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
>      16.67%  kworker/26:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
>      16.67%  kworker/33:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
> 
> (A)-(3)
> 
>  # Total Lost Samples: 0
>  # Samples: 7  of event 'power:cpu_frequency'
>  # Event count (approx.): 7
>  # Overhead  Command       Shared Object     Symbol                       
>  # ........  ............  ................  .............................
>      28.57%  kworker/58:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
>      14.29%  kworker/19:2  [kernel.vmlinux]  [k] cpufreq_notify_transition
>      14.29%  kworker/20:2  [kernel.vmlinux]  [k] cpufreq_notify_transition
>      14.29%  kworker/22:2  [kernel.vmlinux]  [k] cpufreq_notify_transition
>      14.29%  kworker/23:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
>      14.29%  kworker/35:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
> 
> ---
> 
> (B)-(1)
> 
>  # Total Lost Samples: 0
>  # Samples: 2K of event 'power:cpu_frequency'
>  # Event count (approx.): 2382
>  # Overhead  Command       Shared Object     Symbol                       
>  # ........  ............  ................  .............................
>       5.75%  kworker/0:0   [kernel.vmlinux]  [k] cpufreq_notify_transition
>       4.16%  kworker/12:2  [kernel.vmlinux]  [k] cpufreq_notify_transition
>       3.11%  kworker/17:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
>       2.94%  kworker/2:1   [kernel.vmlinux]  [k] cpufreq_notify_transition
>       2.73%  kworker/19:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
>       ...
> 
> (B)-(2)
> 
>  # Total Lost Samples: 0
>  # Samples: 320  of event 'power:cpu_frequency'
>  # Event count (approx.): 320
>  # Overhead  Command       Shared Object     Symbol                       
>  # ........  ............  ................  .............................
>       4.69%  kworker/56:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
>       4.06%  kworker/12:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
>       4.06%  kworker/28:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
>       4.06%  kworker/6:2   [kernel.vmlinux]  [k] cpufreq_notify_transition
>       3.75%  kworker/32:2  [kernel.vmlinux]  [k] cpufreq_notify_transition
>       ...
> 
> (B)-(3)
> 
>  # Total Lost Samples: 0
>  # Samples: 333  of event 'power:cpu_frequency'
>  # Event count (approx.): 333
>  # Overhead  Command       Shared Object     Symbol                       
>  # ........  ............  ................  .............................
>       4.80%  kworker/51:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
>       4.50%  kworker/39:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
>       4.20%  kworker/47:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
>       3.90%  kworker/59:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
>       3.90%  kworker/7:2   [kernel.vmlinux]  [k] cpufreq_notify_transition
>       ...

I am worried by all of these.. So, its not the DVFS transition which
took time, but cpufreq_notify_transition(). And probably one of the
drivers in your setup is screwing up here, which has registered with
cpufreq_register_notifier().

Can you please take a look at that ? Just check which all routines are
getting called as part of srcu_notifier_call_chain().

Also a simple (hacky) solution to fix the problem you have got is to
divide the range of frequency in steps (which you have already done
AFAICT in one of your patches) and not mention the frequencies that
were part of the deadband earlier. That will keep the CPU at either
low freq or some higher freqs.

But I am quite sure that we have an abuser here and we better find it
now.
Viresh Kumar Oct. 5, 2016, 5:21 a.m. UTC | #12
On 22-09-16, 19:54, Andreas Herrmann wrote:
> On Mon, Sep 19, 2016 at 10:39:18PM +0300, Stratos Karafotis wrote:
> > Just an idea. I would split the frequency range (max_freq - min_freq)
> > into ~10 steps. But I'm not familiar with the affected systems and
> > of course I can't prove this is an ideal approach.
> 
> I've modified the pcc-cpufreq specific map_load_to_freq function to do
> just that (map load values to 10 discrete frequency values) instead of
> falling back to the deadband (pre-commit-6393d6-version).

And in that you can remove the deadband frequencies from the table to
get almost the same results, as I already stated in an earlier email.
Andreas Herrmann Oct. 11, 2016, 6:28 a.m. UTC | #13
On Wed, Oct 05, 2016 at 10:47:53AM +0530, Viresh Kumar wrote:
> Sorry for being late Andreas :(
> 
> On 14-09-16, 16:56, Andreas Herrmann wrote:
> 
> First of all, thanks for your hardwork in getting these numbers out.
> Really appreciate it.
> 
> > Below is some trace data. I hope it is of some help.
> > 
> > (A) - sampling 10s period when system is idle
> > (B) - sampling 10s period when system partially loaded (kernel
> >       compilation using 2 jobs)
> > 
> > (1) 4.8-rc5
> > (2) 4.8-rc5 with my patch (reintro of deadband effect within
> >     pcc-cpufreq)
> > (3) 4.8-rc5 with reversal of 6393d6a102 (cpufreq: ondemand: Eliminate
> >     the deadband effect)
> > 
> > Let me know whether you are looking for other trace data wrt this
> > issue.
> > 
> > 
> > Thanks,
> > 
> > Andreas
> > 
> > ---
> > 
> > (A)-(1)
> > 
> >  # Total Lost Samples: 0
> >  # Samples: 41  of event 'power:cpu_frequency'
> >  # Event count (approx.): 41
> >  # Overhead  Command       Shared Object     Symbol                       
> >  # ........  ............  ................  .............................
> >      39.02%  kworker/14:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
> >      29.27%  kworker/0:0   [kernel.vmlinux]  [k] cpufreq_notify_transition
> >      19.51%  kworker/10:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
> >       7.32%  kworker/5:2   [kernel.vmlinux]  [k] cpufreq_notify_transition
> >       2.44%  kworker/23:2  [kernel.vmlinux]  [k] cpufreq_notify_transition
> >       2.44%  kworker/40:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
> > 
> > (A)-(2)
> > 
> >  # Total Lost Samples: 0
> >  # Samples: 6  of event 'power:cpu_frequency'
> >  # Event count (approx.): 6
> >  # Overhead  Command       Shared Object     Symbol                       
> >  # ........  ............  ................  .............................
> >      33.33%  kworker/1:2   [kernel.vmlinux]  [k] cpufreq_notify_transition
> >      16.67%  kworker/16:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
> >      16.67%  kworker/22:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
> >      16.67%  kworker/26:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
> >      16.67%  kworker/33:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
> > 
> > (A)-(3)
> > 
> >  # Total Lost Samples: 0
> >  # Samples: 7  of event 'power:cpu_frequency'
> >  # Event count (approx.): 7
> >  # Overhead  Command       Shared Object     Symbol                       
> >  # ........  ............  ................  .............................
> >      28.57%  kworker/58:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
> >      14.29%  kworker/19:2  [kernel.vmlinux]  [k] cpufreq_notify_transition
> >      14.29%  kworker/20:2  [kernel.vmlinux]  [k] cpufreq_notify_transition
> >      14.29%  kworker/22:2  [kernel.vmlinux]  [k] cpufreq_notify_transition
> >      14.29%  kworker/23:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
> >      14.29%  kworker/35:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
> > 
> > ---
> > 
> > (B)-(1)
> > 
> >  # Total Lost Samples: 0
> >  # Samples: 2K of event 'power:cpu_frequency'
> >  # Event count (approx.): 2382
> >  # Overhead  Command       Shared Object     Symbol                       
> >  # ........  ............  ................  .............................
> >       5.75%  kworker/0:0   [kernel.vmlinux]  [k] cpufreq_notify_transition
> >       4.16%  kworker/12:2  [kernel.vmlinux]  [k] cpufreq_notify_transition
> >       3.11%  kworker/17:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
> >       2.94%  kworker/2:1   [kernel.vmlinux]  [k] cpufreq_notify_transition
> >       2.73%  kworker/19:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
> >       ...
> > 
> > (B)-(2)
> > 
> >  # Total Lost Samples: 0
> >  # Samples: 320  of event 'power:cpu_frequency'
> >  # Event count (approx.): 320
> >  # Overhead  Command       Shared Object     Symbol                       
> >  # ........  ............  ................  .............................
> >       4.69%  kworker/56:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
> >       4.06%  kworker/12:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
> >       4.06%  kworker/28:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
> >       4.06%  kworker/6:2   [kernel.vmlinux]  [k] cpufreq_notify_transition
> >       3.75%  kworker/32:2  [kernel.vmlinux]  [k] cpufreq_notify_transition
> >       ...
> > 
> > (B)-(3)
> > 
> >  # Total Lost Samples: 0
> >  # Samples: 333  of event 'power:cpu_frequency'
> >  # Event count (approx.): 333
> >  # Overhead  Command       Shared Object     Symbol                       
> >  # ........  ............  ................  .............................
> >       4.80%  kworker/51:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
> >       4.50%  kworker/39:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
> >       4.20%  kworker/47:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
> >       3.90%  kworker/59:1  [kernel.vmlinux]  [k] cpufreq_notify_transition
> >       3.90%  kworker/7:2   [kernel.vmlinux]  [k] cpufreq_notify_transition
> >       ...
> 
> I am worried by all of these.. So, its not the DVFS transition which
> took time, but cpufreq_notify_transition(). And probably one of the
> drivers in your setup is screwing up here, which has registered with
> cpufreq_register_notifier().
> 
> Can you please take a look at that ? Just check which all routines are
> getting called as part of srcu_notifier_call_chain().

I'll take a look into this.

> Also a simple (hacky) solution to fix the problem you have got is to
> divide the range of frequency in steps (which you have already done
> AFAICT in one of your patches) and not mention the frequencies that
> were part of the deadband earlier. That will keep the CPU at either
> low freq or some higher freqs.

Yes, seems worth to try this.

> But I am quite sure that we have an abuser here and we better find it
> now.


Thanks,

Andreas

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/cpufreq/pcc-cpufreq.c b/drivers/cpufreq/pcc-cpufreq.c
index 3f0ce2a..d2e2963 100644
--- a/drivers/cpufreq/pcc-cpufreq.c
+++ b/drivers/cpufreq/pcc-cpufreq.c
@@ -200,10 +200,26 @@  static int pcc_cpufreq_target(struct cpufreq_policy *policy,
 {
 	struct pcc_cpu *pcc_cpu_data;
 	struct cpufreq_freqs freqs;
+	static u32 limit = 0;
+
 	u16 status;
 	u32 input_buffer;
 	int cpu;
 
+	if (!limit) {
+		u32 f_min = policy->cpuinfo.min_freq / 1000;
+		u32 f_max = policy->cpuinfo.max_freq / 1000;
+		limit = (f_max - f_min) * f_min;
+		limit /= f_max;
+		limit *= 1000;
+		limit += f_min * 1000;
+		pr_debug("pcc-cpufreq: setting deadband limit to %u kHz\n",
+			limit);
+	}
+
+	if (target_freq < limit)
+		target_freq = policy->min;
+
 	cpu = policy->cpu;
 	pcc_cpu_data = per_cpu_ptr(pcc_cpu_info, cpu);
 
@@ -214,6 +230,10 @@  static int pcc_cpufreq_target(struct cpufreq_policy *policy,
 
 	freqs.old = policy->cur;
 	freqs.new = target_freq;
+
+	if (freqs.new == freqs.old)
+		return 0;
+
 	cpufreq_freq_transition_begin(policy, &freqs);
 	spin_lock(&pcc_lock);