Message ID | 20160819122101.GB17296@suselix.suse.de (mailing list archive) |
---|---|
State | Not Applicable, archived |
Headers | show |
On 19-08-16, 14:21, Andreas Herrmann wrote: > > Commit 6393d6a102 (cpufreq: ondemand: Eliminate the deadband effect) > introduced a performance regression for systems using pcc-cpufreq and > ondemand governor. This is measurable with different workloads. E.g. > wall-clock time for kernel compilation significantly increased. > > The elimination of the deadband effect significantly increased the > number of frequency changes with pcc-cpufreq. > > Instead of reverting commit 6393d6a102 I suggest to add a workaround > in pcc-cpufreq to re-introduce the deadband effect for this driver > only - to restore the old performance behaviour with pcc-cpufreq with > ondemand governor. > > Following some performance numbers for similar kernel compilations to > illustrate the effect of commit 6393d6a102 and the proposed fix. > > Following typical numbers of kernel compilation tests with varying number of > compile jobs: > > v4.8.0-rc2 4.8.0-rc2-pcc-cpufreq-deadband > # of jobst user sys elapsed CPU user sys elapsed CPU > 2 440.39 116.49 4:33.35 203% 404.85 109.10 4:10.35 205% > 4 436.87 133.39 2:22.88 399% 381.83 128.00 2:06.84 401% > 8 475.49 157.68 1:22.24 769% 344.36 149.08 1:04.29 767% > 16 620.69 188.33 0:54.74 1477% 374.60 157.40 0:36.76 1447% > 32 815.79 209.58 0:37.22 2754% 490.46 160.22 0:24.87 2616% > 64 394.13 60.55 0:13.54 3355% 386.54 60.33 0:12.79 3493% > 120 398.24 61.55 0:14.60 3148% 390.44 61.19 0:13.07 3453% > > (HP ProLiant DL580 Gen8 system, 60 CPUs @ 2.80GHz) > > Link: http://marc.info/?l=linux-pm&m=147160912625600 > Signed-off-by: Andreas Herrmann <aherrmann@suse.com> > --- > drivers/cpufreq/pcc-cpufreq.c | 20 ++++++++++++++++++++ > 1 file changed, 20 insertions(+) > > If this change is accepted maybe it's a good idea to tag it also for > stable kernels, e.g. starting with v4.4. I am _really_ worried about such hacks in drivers to negate the effect of a patch, that was actually good. Did you try to increase the sampling period of ondemand governor to see if that helps without this patch. Also, it is important to understand why is the performance going down, while the original commit should have made it better. Is it only about more transitions ?
On Mon, Aug 29, 2016 at 11:31:53AM +0530, Viresh Kumar wrote: > On 19-08-16, 14:21, Andreas Herrmann wrote: > > > > Commit 6393d6a102 (cpufreq: ondemand: Eliminate the deadband effect) > > introduced a performance regression for systems using pcc-cpufreq and > > ondemand governor. This is measurable with different workloads. E.g. > > wall-clock time for kernel compilation significantly increased. > > > > The elimination of the deadband effect significantly increased the > > number of frequency changes with pcc-cpufreq. > > > > Instead of reverting commit 6393d6a102 I suggest to add a workaround > > in pcc-cpufreq to re-introduce the deadband effect for this driver > > only - to restore the old performance behaviour with pcc-cpufreq with > > ondemand governor. > > > > Following some performance numbers for similar kernel compilations to > > illustrate the effect of commit 6393d6a102 and the proposed fix. > > > > Following typical numbers of kernel compilation tests with varying number of > > compile jobs: > > > > v4.8.0-rc2 4.8.0-rc2-pcc-cpufreq-deadband > > # of jobst user sys elapsed CPU user sys elapsed CPU > > 2 440.39 116.49 4:33.35 203% 404.85 109.10 4:10.35 205% > > 4 436.87 133.39 2:22.88 399% 381.83 128.00 2:06.84 401% > > 8 475.49 157.68 1:22.24 769% 344.36 149.08 1:04.29 767% > > 16 620.69 188.33 0:54.74 1477% 374.60 157.40 0:36.76 1447% > > 32 815.79 209.58 0:37.22 2754% 490.46 160.22 0:24.87 2616% > > 64 394.13 60.55 0:13.54 3355% 386.54 60.33 0:12.79 3493% > > 120 398.24 61.55 0:14.60 3148% 390.44 61.19 0:13.07 3453% > > > > (HP ProLiant DL580 Gen8 system, 60 CPUs @ 2.80GHz) > > > > Link: http://marc.info/?l=linux-pm&m=147160912625600 > > Signed-off-by: Andreas Herrmann <aherrmann@suse.com> > > --- > > drivers/cpufreq/pcc-cpufreq.c | 20 ++++++++++++++++++++ > > 1 file changed, 20 insertions(+) > > > > If this change is accepted maybe it's a good idea to tag it also for > > stable kernels, e.g. starting with v4.4. > I am _really_ worried about such hacks in drivers to negate the effect of a > patch, that was actually good. > Did you try to increase the sampling period of ondemand governor to see if that > helps without this patch. With an older kernel I've modified transition_latency of the driver which in turn is used to calculate the sampling rate. I started with the value return as "nominal latency" for PCC. This was 300000 ns on the test system and made things worse. I've tested other values as well unitl I've found a local optimium at 45000ns but performance was lower in comparison to when I've applied my hack. > Also, it is important to understand why is the performance going > down, while the original commit should have made it better. My understanding is that the original commit was tested with certain combinations of hardware and cpufreq-drivers and the claim was that for those (two?) tested combinations performance increased and power consumption was lower. So I am not so sure what to expect from all other cpufreq-driver/hardware combinations. > Is it only about more transitions ? I think this is the main issue. In an older kernel version I activated/added debug output in __cpufreq_driver_target(). Of course this creates a huge amount of messages. But with original patch reverted it was like: [ 29.489677] cpufreq: target for CPU 0: 1760000 kHz (1200000 kHz), relation 2, requested 1760000 kHz [ 29.570364] cpufreq: target for CPU 0: 1216000 kHz (1760000 kHz), relation 2, requested 1216000 kHz [ 29.571055] cpufreq: target for CPU 1: 1200000 kHz (1148000 kHz), relation 0, requested 1200000 kHz [ 29.571483] cpufreq: target for CPU 1: 1200000 kHz (1200000 kHz), relation 2, requested 1200000 kHz [ 29.572042] cpufreq: target for CPU 2: 1200000 kHz (1064000 kHz), relation 0, requested 1200000 kHz [ 29.572503] cpufreq: target for CPU 2: 1200000 kHz (1200000 kHz), relation 2, requested 1200000 kHz a lot of stuff, but system could still handle it and booted to the prompt. With the original patch applied the system was really flooded and eventually became unresponsive: ** 459 printk messages dropped ** [ 29.838689] cpufreq: target for CPU 43: 1408000 kHz (2384000 kHz), relation 2, requested 1408000 kHz ** 480 printk messages dropped ** [ 29.993849] cpufreq: target for CPU 54: 1200000 kHz (1248000 kHz), relation 2, requested 1200000 kHz ** 413 printk messages dropped ** [ 30.113921] cpufreq: target for CPU 59: 2064000 kHz (1248000 kHz), relation 2, requested 2064000 kHz ** 437 printk messages dropped ** [ 30.245846] cpufreq: target for CPU 21: 1296000 kHz (1296000 kHz), relation 2, requested 1296000 kHz ** 435 printk messages dropped ** [ 30.397748] cpufreq: target for CPU 13: 1280000 kHz (2640000 kHz), relation 2, requested 1280000 kHz ** 480 printk messages dropped ** [ 30.541846] cpufreq: target for CPU 58: 2112000 kHz (1632000 kHz), relation 2, requested 2112000 kHz Of course, ideas to further debug this are welcome, or suggestions to fix the issue in another way. Thanks, Andreas -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 01-09-16, 15:21, Andreas Herrmann wrote: > On Mon, Aug 29, 2016 at 11:31:53AM +0530, Viresh Kumar wrote: > > I am _really_ worried about such hacks in drivers to negate the effect of a > > patch, that was actually good. > > > Did you try to increase the sampling period of ondemand governor to see if that > > helps without this patch. > > With an older kernel I've modified transition_latency of the driver > which in turn is used to calculate the sampling rate. Naah, that isn't what I was looking for, sorry :( To explain it a bit more, this is what the patch did. Suppose, your platform supports frequencies: F1 (lowest), F2, F3, F4, F5, F6 and F7 (highest). The cpufreq governor (ondemand) based on a sampling rate and system load tries to change the frequency of the underlying hardware and select one of those. Before the original patch came in, F2 and F3 were never getting selected and the system was stuck in F1 for a long time. Which will decrease the performance for that period of time as we should have switched to a higher frequency really. With the new patch we switch to the frequency proportional to current load. The requests made from cpufreq-governor wouldn't have changed at all with that, but the number of freq changes at hardware level may actually change as we might be returning very quickly if the target freq evaluated to F1 in the earlier case. That may lead to switching to frequencies F2 and F3, which wasn't happening earlier. I don't think the result of that should be too bad.. We should have performed better almost everywhere. transition_latency is used only to initialize sampling rate, but that may get modified from userspace later on. Please tell us the value read from sysfs file sampling_rate present in: /sys/devices/system/cpu/cpufreq/policy0/ondemand/ And try playing with that variable a bit to see if you can make things better or not. > I started with the value return as "nominal latency" for PCC. This > was 300000 ns on the test system and made things worse. I've tested > other values as well unitl I've found a local optimium at 45000ns but > performance was lower in comparison to when I've applied my hack. Can you try to use kernel tracer (ftrace) and see how the frequencies are getting changed and at what frequency. We need to understand the problem better, as I am not 100% sure what's going on right now. > > Also, it is important to understand why is the performance going > > down, while the original commit should have made it better. > > My understanding is that the original commit was tested with certain > combinations of hardware and cpufreq-drivers and the claim was that > for those (two?) tested combinations performance increased and power > consumption was lower. So I am not so sure what to expect from all > other cpufreq-driver/hardware combinations. It was principally the right thing to do IMO. And I don't think any other hardware should get affected badly. At the max, the tuning needs to be made a bit better. > > Is it only about more transitions ? > > I think this is the main issue. Then it can be controlled with sampling rate from userspace. > In an older kernel version I activated/added debug output in > __cpufreq_driver_target(). Of course this creates a huge amount of > messages. But with original patch reverted it was like: > > [ 29.489677] cpufreq: target for CPU 0: 1760000 kHz (1200000 kHz), relation 2, requested 1760000 kHz > [ 29.570364] cpufreq: target for CPU 0: 1216000 kHz (1760000 kHz), relation 2, requested 1216000 kHz > [ 29.571055] cpufreq: target for CPU 1: 1200000 kHz (1148000 kHz), relation 0, requested 1200000 kHz > [ 29.571483] cpufreq: target for CPU 1: 1200000 kHz (1200000 kHz), relation 2, requested 1200000 kHz > [ 29.572042] cpufreq: target for CPU 2: 1200000 kHz (1064000 kHz), relation 0, requested 1200000 kHz > [ 29.572503] cpufreq: target for CPU 2: 1200000 kHz (1200000 kHz), relation 2, requested 1200000 kHz Your platform is a bit special as it uses ->target() callback and not ->target_index(). And so you can pretty much switch to any frequency. Can you please read value of all the sysfs files present in the governor directory? That would be helpful. Maybe we can play with some more files like: up_threshold to see what the results are. > a lot of stuff, but system could still handle it and booted to the > prompt. > > With the original patch applied the system was really flooded and > eventually became unresponsive: > > ** 459 printk messages dropped ** [ 29.838689] cpufreq: target for CPU 43: 1408000 kHz (2384000 kHz), relation 2, requested 1408000 kHz > ** 480 printk messages dropped ** [ 29.993849] cpufreq: target for CPU 54: 1200000 kHz (1248000 kHz), relation 2, requested 1200000 kHz > ** 413 printk messages dropped ** [ 30.113921] cpufreq: target for CPU 59: 2064000 kHz (1248000 kHz), relation 2, requested 2064000 kHz > ** 437 printk messages dropped ** [ 30.245846] cpufreq: target for CPU 21: 1296000 kHz (1296000 kHz), relation 2, requested 1296000 kHz > ** 435 printk messages dropped ** [ 30.397748] cpufreq: target for CPU 13: 1280000 kHz (2640000 kHz), relation 2, requested 1280000 kHz > ** 480 printk messages dropped ** [ 30.541846] cpufreq: target for CPU 58: 2112000 kHz (1632000 kHz), relation 2, requested 2112000 kHz This looks even more dangerous :)
On Wed, Sep 07, 2016 at 10:32:01AM +0530, Viresh Kumar wrote: > On 01-09-16, 15:21, Andreas Herrmann wrote: > > On Mon, Aug 29, 2016 at 11:31:53AM +0530, Viresh Kumar wrote: > > > > I am _really_ worried about such hacks in drivers to negate the effect of a > > > patch, that was actually good. > > > > > Did you try to increase the sampling period of ondemand governor to see if that > > > helps without this patch. > > > > With an older kernel I've modified transition_latency of the driver > > which in turn is used to calculate the sampling rate. > > Naah, that isn't what I was looking for, sorry :( > > To explain it a bit more, this is what the patch did. > > Suppose, your platform supports frequencies: F1 (lowest), F2, F3, F4, > F5, F6 and F7 (highest). The cpufreq governor (ondemand) based on a > sampling rate and system load tries to change the frequency of the > underlying hardware and select one of those. > > Before the original patch came in, F2 and F3 were never getting > selected and the system was stuck in F1 for a long time. Which will > decrease the performance for that period of time as we should have > switched to a higher frequency really. > > With the new patch we switch to the frequency proportional to current > load. > > The requests made from cpufreq-governor wouldn't have changed at all > with that, but the number of freq changes at hardware level may > actually change as we might be returning very quickly if the target > freq evaluated to F1 in the earlier case. > > That may lead to switching to frequencies F2 and F3, which wasn't > happening earlier. > > I don't think the result of that should be too bad.. We should have > performed better almost everywhere. > > transition_latency is used only to initialize sampling rate, but that > may get modified from userspace later on. Please tell us the value > read from sysfs file sampling_rate present in: > > /sys/devices/system/cpu/cpufreq/policy0/ondemand/ /sys/devices/system/cpu/cpufreq/ondemand contains following values: ignore_nice_load : 0 io_is_busy : 1 min_sampling_rate : 10000 powersave_bias : 0 up_threshold : 95 sampling_rate : 10000 sampling_down_factor : 1 FYI, for each CPU direcories /sys/devices/system/cpu/cpufreq/policy* look like: scaling_cur_freq : 1200000 cpuinfo_max_freq : 2800000 cpuinfo_transition_latency : 0 cpuinfo_min_freq : 1200000 cpuinfo_cur_freq : 1204000 affected_cpus : 0 scaling_governor : ondemand scaling_driver : pcc-cpufreq scaling_max_freq : 2800000 related_cpus : 0 scaling_setspeed : <unsupported> scaling_available_governors: ondemand performance scaling_min_freq : 1200000 (which is for policy0 for CPU0) > And try playing with that variable a bit to see if you can make things > better or not. Done that. Following the elapsed time with ondemand governor and different values for sampling rate (for each of 4.8-rc5 vanilla/modified with my patch/commit 6393d6a reverted). 4.8-rc5 (vanilla) sample_rate 10000 20000 30000 45000 60000 90000 jobs seconds % CPU seconds % CPU seconds % CPU seconds % CPU seconds % CPU seconds % CPU 2 283.47 203.00 273.79 204.00 276.08 204.00 276.99 204.00 281.80 203.60 291.00 203.00 4 146.92 399.20 140.98 399.80 141.55 399.80 140.91 400.00 142.91 400.00 146.49 399.80 8 81.66 770.00 77.44 769.20 76.81 769.20 73.74 769.20 73.17 767.20 72.75 767.00 16 54.31 1466.20 49.58 1458.40 48.37 1456.60 44.29 1447.80 42.89 1436.40 41.12 1416.80 32 36.71 2720.60 30.70 2672.80 29.22 2578.80 26.34 2563.20 25.57 2506.80 24.58 2457.80 64 13.99 3291.20 13.63 3295.40 13.55 3309.20 13.30 3354.60 13.66 3264.40 13.55 3288.60 120 14.84 3114.20 14.10 3215.00 14.57 3115.20 14.11 3209.80 14.17 3198.60 14.09 3220.80 ------------------------------ 4.8-rc5 re-intro deadband for pcc-cpufreq only (my patch) sample_rate 10000 20000 30000 45000 60000 90000 jobs seconds % CPU seconds % CPU seconds % CPU seconds % CPU seconds % CPU seconds % CPU 2 270.88 192.00 270.87 192.80 270.65 192.00 274.08 192.00 277.97 192.80 289.85 193.00 4 140.23 379.20 138.94 379.80 138.20 378.80 138.99 379.20 140.81 379.60 145.83 380.40 8 76.15 738.20 73.86 733.60 71.58 734.20 70.43 730.40 70.32 730.60 71.14 728.20 16 47.39 1414.80 44.20 1395.60 43.02 1384.60 41.09 1368.00 40.37 1362.80 39.56 1351.20 32 32.78 2659.80 29.05 2531.80 27.18 2482.40 25.36 2471.60 24.55 2441.00 24.18 2365.80 64 13.34 3213.20 13.29 3204.60 13.38 3158.60 13.30 3162.40 13.57 3089.60 13.48 3125.20 120 13.89 3103.00 14.00 3047.40 13.87 3068.40 13.83 3077.80 13.93 3054.80 14.10 3033.60 ------------------------------ 4.8-rc5 reverted commit 6393d6a sample_rate 10000 20000 30000 45000 60000 90000 jobs seconds % CPU seconds % CPU seconds % CPU seconds % CPU seconds % CPU seconds % CPU 2 250.94 205.00 256.26 205.00 261.29 205.00 268.92 205.00 275.60 204.80 288.64 204.00 4 127.20 401.20 129.63 401.80 131.80 402.00 135.41 401.20 138.88 401.20 144.96 401.00 8 64.24 767.60 64.18 769.60 64.49 770.20 65.69 766.00 66.38 768.60 68.64 767.40 16 37.02 1438.00 36.09 1432.20 36.10 1429.40 36.10 1424.60 36.61 1408.80 36.87 1405.60 32 24.73 2618.80 23.28 2533.80 23.00 2513.40 22.45 2526.00 22.92 2463.00 23.12 2422.20 64 12.62 3550.60 12.94 3439.60 12.95 3425.60 12.94 3421.80 13.29 3336.80 13.65 3257.80 120 13.30 3408.80 13.43 3346.40 13.44 3347.80 13.55 3322.60 13.76 3277.20 14.19 3192.40 > > I started with the value return as "nominal latency" for PCC. This > > was 300000 ns on the test system and made things worse. I've tested > > other values as well unitl I've found a local optimium at 45000ns but > > performance was lower in comparison to when I've applied my hack. > > Can you try to use kernel tracer (ftrace) and see how the frequencies > are getting changed and at what frequency. I am currently looking into this ... > We need to understand the problem better, as I am not 100% sure what's > going on right now. ... to gather related data. > > > Also, it is important to understand why is the performance going > > > down, while the original commit should have made it better. > > > > My understanding is that the original commit was tested with certain > > combinations of hardware and cpufreq-drivers and the claim was that > > for those (two?) tested combinations performance increased and power > > consumption was lower. So I am not so sure what to expect from all > > other cpufreq-driver/hardware combinations. > > It was principally the right thing to do IMO. And I don't think any > other hardware should get affected badly. At the max, the tuning needs > to be made a bit better. > > > > Is it only about more transitions ? > > > > I think this is the main issue. > > Then it can be controlled with sampling rate from userspace. > > > In an older kernel version I activated/added debug output in > > __cpufreq_driver_target(). Of course this creates a huge amount of > > messages. But with original patch reverted it was like: > > > > [ 29.489677] cpufreq: target for CPU 0: 1760000 kHz (1200000 kHz), relation 2, requested 1760000 kHz > > [ 29.570364] cpufreq: target for CPU 0: 1216000 kHz (1760000 kHz), relation 2, requested 1216000 kHz > > [ 29.571055] cpufreq: target for CPU 1: 1200000 kHz (1148000 kHz), relation 0, requested 1200000 kHz > > [ 29.571483] cpufreq: target for CPU 1: 1200000 kHz (1200000 kHz), relation 2, requested 1200000 kHz > > [ 29.572042] cpufreq: target for CPU 2: 1200000 kHz (1064000 kHz), relation 0, requested 1200000 kHz > > [ 29.572503] cpufreq: target for CPU 2: 1200000 kHz (1200000 kHz), relation 2, requested 1200000 kHz > > Your platform is a bit special as it uses ->target() callback and not > ->target_index(). And so you can pretty much switch to any frequency. > > Can you please read value of all the sysfs files present in the > governor directory? That would be helpful. Maybe we can play with some > more files like: up_threshold to see what the results are. For sysfs values see above. I've not yet played with up_threshold but plan to do this as well. > > a lot of stuff, but system could still handle it and booted to the > > prompt. > > > > With the original patch applied the system was really flooded and > > eventually became unresponsive: > > > > ** 459 printk messages dropped ** [ 29.838689] cpufreq: target for CPU 43: 1408000 kHz (2384000 kHz), relation 2, requested 1408000 kHz > > ** 480 printk messages dropped ** [ 29.993849] cpufreq: target for CPU 54: 1200000 kHz (1248000 kHz), relation 2, requested 1200000 kHz > > ** 413 printk messages dropped ** [ 30.113921] cpufreq: target for CPU 59: 2064000 kHz (1248000 kHz), relation 2, requested 2064000 kHz > > ** 437 printk messages dropped ** [ 30.245846] cpufreq: target for CPU 21: 1296000 kHz (1296000 kHz), relation 2, requested 1296000 kHz > > ** 435 printk messages dropped ** [ 30.397748] cpufreq: target for CPU 13: 1280000 kHz (2640000 kHz), relation 2, requested 1280000 kHz > > ** 480 printk messages dropped ** [ 30.541846] cpufreq: target for CPU 58: 2112000 kHz (1632000 kHz), relation 2, requested 2112000 kHz > > This looks even more dangerous :) > > -- > viresh Thanks, Andreas -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Sep 07, 2016 at 10:32:01AM +0530, Viresh Kumar wrote: > On 01-09-16, 15:21, Andreas Herrmann wrote: ---8<--- > > I started with the value return as "nominal latency" for PCC. This > > was 300000 ns on the test system and made things worse. I've tested > > other values as well unitl I've found a local optimium at 45000ns but > > performance was lower in comparison to when I've applied my hack. > > Can you try to use kernel tracer (ftrace) and see how the frequencies > are getting changed and at what frequency. Below is some trace data. I hope it is of some help. (A) - sampling 10s period when system is idle (B) - sampling 10s period when system partially loaded (kernel compilation using 2 jobs) (1) 4.8-rc5 (2) 4.8-rc5 with my patch (reintro of deadband effect within pcc-cpufreq) (3) 4.8-rc5 with reversal of 6393d6a102 (cpufreq: ondemand: Eliminate the deadband effect) Let me know whether you are looking for other trace data wrt this issue. Thanks, Andreas --- (A)-(1) # Total Lost Samples: 0 # Samples: 41 of event 'power:cpu_frequency' # Event count (approx.): 41 # Overhead Command Shared Object Symbol # ........ ............ ................ ............................. 39.02% kworker/14:1 [kernel.vmlinux] [k] cpufreq_notify_transition 29.27% kworker/0:0 [kernel.vmlinux] [k] cpufreq_notify_transition 19.51% kworker/10:1 [kernel.vmlinux] [k] cpufreq_notify_transition 7.32% kworker/5:2 [kernel.vmlinux] [k] cpufreq_notify_transition 2.44% kworker/23:2 [kernel.vmlinux] [k] cpufreq_notify_transition 2.44% kworker/40:1 [kernel.vmlinux] [k] cpufreq_notify_transition (A)-(2) # Total Lost Samples: 0 # Samples: 6 of event 'power:cpu_frequency' # Event count (approx.): 6 # Overhead Command Shared Object Symbol # ........ ............ ................ ............................. 33.33% kworker/1:2 [kernel.vmlinux] [k] cpufreq_notify_transition 16.67% kworker/16:1 [kernel.vmlinux] [k] cpufreq_notify_transition 16.67% kworker/22:1 [kernel.vmlinux] [k] cpufreq_notify_transition 16.67% kworker/26:1 [kernel.vmlinux] [k] cpufreq_notify_transition 16.67% kworker/33:1 [kernel.vmlinux] [k] cpufreq_notify_transition (A)-(3) # Total Lost Samples: 0 # Samples: 7 of event 'power:cpu_frequency' # Event count (approx.): 7 # Overhead Command Shared Object Symbol # ........ ............ ................ ............................. 28.57% kworker/58:1 [kernel.vmlinux] [k] cpufreq_notify_transition 14.29% kworker/19:2 [kernel.vmlinux] [k] cpufreq_notify_transition 14.29% kworker/20:2 [kernel.vmlinux] [k] cpufreq_notify_transition 14.29% kworker/22:2 [kernel.vmlinux] [k] cpufreq_notify_transition 14.29% kworker/23:1 [kernel.vmlinux] [k] cpufreq_notify_transition 14.29% kworker/35:1 [kernel.vmlinux] [k] cpufreq_notify_transition --- (B)-(1) # Total Lost Samples: 0 # Samples: 2K of event 'power:cpu_frequency' # Event count (approx.): 2382 # Overhead Command Shared Object Symbol # ........ ............ ................ ............................. 5.75% kworker/0:0 [kernel.vmlinux] [k] cpufreq_notify_transition 4.16% kworker/12:2 [kernel.vmlinux] [k] cpufreq_notify_transition 3.11% kworker/17:1 [kernel.vmlinux] [k] cpufreq_notify_transition 2.94% kworker/2:1 [kernel.vmlinux] [k] cpufreq_notify_transition 2.73% kworker/19:1 [kernel.vmlinux] [k] cpufreq_notify_transition ... (B)-(2) # Total Lost Samples: 0 # Samples: 320 of event 'power:cpu_frequency' # Event count (approx.): 320 # Overhead Command Shared Object Symbol # ........ ............ ................ ............................. 4.69% kworker/56:1 [kernel.vmlinux] [k] cpufreq_notify_transition 4.06% kworker/12:1 [kernel.vmlinux] [k] cpufreq_notify_transition 4.06% kworker/28:1 [kernel.vmlinux] [k] cpufreq_notify_transition 4.06% kworker/6:2 [kernel.vmlinux] [k] cpufreq_notify_transition 3.75% kworker/32:2 [kernel.vmlinux] [k] cpufreq_notify_transition ... (B)-(3) # Total Lost Samples: 0 # Samples: 333 of event 'power:cpu_frequency' # Event count (approx.): 333 # Overhead Command Shared Object Symbol # ........ ............ ................ ............................. 4.80% kworker/51:1 [kernel.vmlinux] [k] cpufreq_notify_transition 4.50% kworker/39:1 [kernel.vmlinux] [k] cpufreq_notify_transition 4.20% kworker/47:1 [kernel.vmlinux] [k] cpufreq_notify_transition 3.90% kworker/59:1 [kernel.vmlinux] [k] cpufreq_notify_transition 3.90% kworker/7:2 [kernel.vmlinux] [k] cpufreq_notify_transition ... --- With (1) pcc-cpufreq tries to set pretty much every frequency even if the system is idle, e.g. here is the start of (perf script output) from (A)-(1): kworker/5:2 904 [005] 255.778343: power:cpu_frequency: state=1216000 cpu_id=5 kworker/5:2 904 [005] 255.794382: power:cpu_frequency: state=1264000 cpu_id=5 kworker/5:2 904 [005] 256.102400: power:cpu_frequency: state=1200000 cpu_id=5 kworker/10:1 171 [010] 258.010362: power:cpu_frequency: state=2224000 cpu_id=10 kworker/10:1 171 [010] 258.026366: power:cpu_frequency: state=1264000 cpu_id=10 kworker/10:1 171 [010] 258.594514: power:cpu_frequency: state=1200000 cpu_id=10 kworker/10:1 171 [010] 258.618417: power:cpu_frequency: state=1232000 cpu_id=10 kworker/10:1 171 [010] 258.634409: power:cpu_frequency: state=1264000 cpu_id=10 kworker/10:1 171 [010] 258.674467: power:cpu_frequency: state=1200000 cpu_id=10 kworker/10:1 171 [010] 258.730486: power:cpu_frequency: state=1216000 cpu_id=10 kworker/40:1 388 [040] 258.730999: power:cpu_frequency: state=1200000 cpu_id=40 kworker/23:2 775 [023] 258.731504: power:cpu_frequency: state=1200000 cpu_id=23 kworker/14:1 178 [014] 258.732013: power:cpu_frequency: state=1216000 cpu_id=14 kworker/10:1 171 [010] 258.906434: power:cpu_frequency: state=1200000 cpu_id=10 kworker/14:1 178 [014] 258.970500: power:cpu_frequency: state=1200000 cpu_id=14 kworker/14:1 178 [014] 258.998440: power:cpu_frequency: state=1232000 cpu_id=14 kworker/14:1 178 [014] 259.034490: power:cpu_frequency: state=1200000 cpu_id=14 kworker/14:1 178 [014] 259.095089: power:cpu_frequency: state=1216000 cpu_id=14 kworker/14:1 178 [014] 259.270470: power:cpu_frequency: state=1200000 cpu_id=14 ... -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Sep 07, 2016 at 10:32:01AM +0530, Viresh Kumar wrote: > On 01-09-16, 15:21, Andreas Herrmann wrote: > > On Mon, Aug 29, 2016 at 11:31:53AM +0530, Viresh Kumar wrote: > > > I am _really_ worried about such hacks in drivers to negate the effect of a > > > patch, that was actually good. > > > > > Did you try to increase the sampling period of ondemand governor to see if that > > > helps without this patch. > > > > With an older kernel I've modified transition_latency of the driver > > which in turn is used to calculate the sampling rate. > Naah, that isn't what I was looking for, sorry :( > To explain it a bit more, this is what the patch did. > Suppose, your platform supports frequencies: F1 (lowest), F2, F3, F4, > F5, F6 and F7 (highest). The cpufreq governor (ondemand) based on a > sampling rate and system load tries to change the frequency of the > underlying hardware and select one of those. > Before the original patch came in, F2 and F3 were never getting > selected and the system was stuck in F1 for a long time. I think this is not a general statement. Such a behaviour is not common to all systems. Before commit 6393d6a target frequency was based on freq_next = load * policy->cpuinfo.max_freq / 100; F2 would have been selected if load = F2 * 100 / F7 If F2 was not seen it can mean (1) either the load value was not hit in practice during monitoring of a certain workload (2) or the calculated load value (in integer representation) would select F1 or F3 (there is no corresponding integer value that would select F2) E.g. for the Intel i7-3770 system mentioned in commit message for 6393d6a I think a load value of 49 should have selected 1700000 which is not shown in the provided frequency table. What essentially changed was how load values are mapped to target frequencies. For the HP system (min_freq=1200000, max_freq=2800000) that I used in my tests, the old code would create following mapping: load | freq_next | used target frequency
Hi, On 16/09/2016 12:47 μμ, Andreas Herrmann wrote: > On Wed, Sep 07, 2016 at 10:32:01AM +0530, Viresh Kumar wrote: >> On 01-09-16, 15:21, Andreas Herrmann wrote: >>> On Mon, Aug 29, 2016 at 11:31:53AM +0530, Viresh Kumar wrote: > >>>> I am _really_ worried about such hacks in drivers to negate the effect of a >>>> patch, that was actually good. >>> >>>> Did you try to increase the sampling period of ondemand governor to see if that >>>> helps without this patch. >>> >>> With an older kernel I've modified transition_latency of the driver >>> which in turn is used to calculate the sampling rate. > >> Naah, that isn't what I was looking for, sorry :( > >> To explain it a bit more, this is what the patch did. > >> Suppose, your platform supports frequencies: F1 (lowest), F2, F3, F4, >> F5, F6 and F7 (highest). The cpufreq governor (ondemand) based on a >> sampling rate and system load tries to change the frequency of the >> underlying hardware and select one of those. > >> Before the original patch came in, F2 and F3 were never getting >> selected and the system was stuck in F1 for a long time. > > I think this is not a general statement. Such a behaviour is not > common to all systems. Before commit 6393d6a target frequency was > based on > > freq_next = load * policy->cpuinfo.max_freq / 100; > > F2 would have been selected if > > load = F2 * 100 / F7 > > If F2 was not seen it can mean > > (1) either the load value was not hit in practice during monitoring of > a certain workload > > (2) or the calculated load value (in integer representation) would > select F1 or F3 (there is no corresponding integer value that > would select F2) > > E.g. for the Intel i7-3770 system mentioned in commit message for > 6393d6a I think a load value of 49 should have selected 1700000 which > is not shown in the provided frequency table. I think this is not true, because before the specific patch the relation of frequency selection was CPUFREQ_RELATION_L. This is the reason that a load value of 49 with a freq_next 1666490 would have a target frequency 1600000. > > What essentially changed was how load values are mapped to target > frequencies. For the HP system (min_freq=1200000, max_freq=2800000) > that I used in my tests, the old code would create following mapping: > > load | freq_next | used target frequency > ________________________________________ > 0 0 1200000 > 10 280000 1200000 > 20 560000 1200000 > 30 840000 1200000 > 40 1120000 1200000 > 42 1176000 1200000 > 43 1204000 1204000 > 50 1400000 1400000 > 60 1680000 1680000 > 70 1960000 1960000 > 80 2240000 2240000 > 90 2520000 2520000 > 100 2800000 2800000 > > The new code (introduced with commit 6393d6a) changed the mapping as > follows: > > load | freq_next | used target frequency > ________________________________________ > 0 1200000 1200000 > 10 1360000 1360000 > 20 1520000 1520000 > 30 1680000 1680000 > 40 1840000 1840000 > 42 1872000 1872000 > 43 1888000 1888000 > 50 2000000 2000000 > 60 2160000 2160000 > 70 2320000 2320000 > 80 2480000 2480000 > 90 2640000 2640000 > 100 2800000 2800000 > > My patch creates a third mapping. It basically ensures that up to a > load value of 42 the minimum frequency is used. > >> Which will decrease the performance for that period of time as we >> should have switched to a higher frequency really. > > I am not sure whether it's really useful for all systems using > ondemand governor to increase frequency above min_freq even if load is > just above 0. Of course expectation is that performance will be equal > or better than before. But how overall power consumption changes > depends on the hardware and its power saving capabilites. > > ---8<--- > >>> My understanding is that the original commit was tested with certain >>> combinations of hardware and cpufreq-drivers and the claim was that >>> for those (two?) tested combinations performance increased and power >>> consumption was lower. So I am not so sure what to expect from all >>> other cpufreq-driver/hardware combinations. > >> It was principally the right thing to do IMO. And I don't think any >> other hardware should get affected badly. At the max, the tuning needs >> to be made a bit better. > > ---8<--- > > It seems that the decision how to best map load values to target > frequencies is kind of hardware specific. > > Maybe a solution to this is that the cpufreq driver should be able to > provide a mapping function to overwrite the current default > calculation. > I'm not familiar with ppc-cpufreq drive but maybe patch 6393d6 just uncovered an "issue" that was already existed but only on higher loads. Because, with or without patch 6393d6, if the specific CPU doesn't use a frequency table, there will many frequency transitions in higher loads too. I believe, though, that the side effect it's smaller in higher frequencies because CPUs tend to work on lowest and highest frequencies. What about a patch in ppc-cpufreq driver that permits frequency changes only in specific steps and not in arbitrary values? Regards, Stratos -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Sep 16, 2016 at 09:58:42PM +0300, Stratos Karafotis wrote: > Hi, > > [ I 'm resending this message, because I think some recipients didn't receive > it. ] > > On 16/09/2016 12:47 μμ, Andreas Herrmann wrote: > > On Wed, Sep 07, 2016 at 10:32:01AM +0530, Viresh Kumar wrote: > >> On 01-09-16, 15:21, Andreas Herrmann wrote: > >>> On Mon, Aug 29, 2016 at 11:31:53AM +0530, Viresh Kumar wrote: > > > >>>> I am _really_ worried about such hacks in drivers to negate the effect of > a > >>>> patch, that was actually good. > >>> > >>>> Did you try to increase the sampling period of ondemand governor to see if > that > >>>> helps without this patch. > >>> > >>> With an older kernel I've modified transition_latency of the driver > >>> which in turn is used to calculate the sampling rate. > > > >> Naah, that isn't what I was looking for, sorry > > > >> To explain it a bit more, this is what the patch did. > > > >> Suppose, your platform supports frequencies: F1 (lowest), F2, F3, F4, > >> F5, F6 and F7 (highest). The cpufreq governor (ondemand) based on a > >> sampling rate and system load tries to change the frequency of the > >> underlying hardware and select one of those. > > > >> Before the original patch came in, F2 and F3 were never getting > >> selected and the system was stuck in F1 for a long time. > > > > I think this is not a general statement. Such a behaviour is not > > common to all systems. Before commit 6393d6a target frequency was > > based on > > > > freq_next = load * policy->cpuinfo.max_freq / 100; > > > > F2 would have been selected if > > > > load = F2 * 100 / F7 > > > > If F2 was not seen it can mean > > > > (1) either the load value was not hit in practice during monitoring of > > a certain workload > > > > (2) or the calculated load value (in integer representation) would > > select F1 or F3 (there is no corresponding integer value that > > would select F2) > > > > E.g. for the Intel i7-3770 system mentioned in commit message for > > 6393d6a I think a load value of 49 should have selected 1700000 which > > is not shown in the provided frequency table. > > I think this is not true, because before the specific patch the relation > of frequency selection was CPUFREQ_RELATION_L. This is the reason > that a load value of 49 with a freq_next 1666490 would have a > target frequency 1600000. Hmm... CPUFREQ_RELATION_L should select "lowest frequency at or above target" being 1700000 in this case. Otherwise (if it would select "highest frequency at or below target") this would imply that load values of 50, 51, 52 should select 1700000 which would contradict what was written in commit message of 6393d6a1. In any case probability of seeing such load values and thus selecting a frequency of 1700000 is quite low. So I fully understand why the patch was introduced. > > What essentially changed was how load values are mapped to target > > frequencies. For the HP system (min_freq=1200000, max_freq=2800000) > > that I used in my tests, the old code would create following mapping: > > > > load | freq_next | used target frequency > > ________________________________________ > > 0 0 1200000 > > 10 280000 1200000 > > 20 560000 1200000 > > 30 840000 1200000 > > 40 1120000 1200000 > > 42 1176000 1200000 > > 43 1204000 1204000 > > 50 1400000 1400000 > > 60 1680000 1680000 > > 70 1960000 1960000 > > 80 2240000 2240000 > > 90 2520000 2520000 > > 100 2800000 2800000 > > > > The new code (introduced with commit 6393d6a) changed the mapping as > > follows: > > > > load | freq_next | used target frequency > > ________________________________________ > > 0 1200000 1200000 > > 10 1360000 1360000 > > 20 1520000 1520000 > > 30 1680000 1680000 > > 40 1840000 1840000 > > 42 1872000 1872000 > > 43 1888000 1888000 > > 50 2000000 2000000 > > 60 2160000 2160000 > > 70 2320000 2320000 > > 80 2480000 2480000 > > 90 2640000 2640000 > > 100 2800000 2800000 > > > > My patch creates a third mapping. It basically ensures that up to a > > load value of 42 the minimum frequency is used. > > > >> Which will decrease the performance for that period of time as we > >> should have switched to a higher frequency really. > > > > I am not sure whether it's really useful for all systems using > > ondemand governor to increase frequency above min_freq even if load is > > just above 0. Of course expectation is that performance will be equal > > or better than before. But how overall power consumption changes > > depends on the hardware and its power saving capabilites. > > > > ---8<--- > > > >>> My understanding is that the original commit was tested with certain > >>> combinations of hardware and cpufreq-drivers and the claim was that > >>> for those (two?) tested combinations performance increased and power > >>> consumption was lower. So I am not so sure what to expect from all > >>> other cpufreq-driver/hardware combinations. > > > >> It was principally the right thing to do IMO. And I don't think any > >> other hardware should get affected badly. At the max, the tuning needs > >> to be made a bit better. > > > > ---8<--- > > > > It seems that the decision how to best map load values to target > > frequencies is kind of hardware specific. > > > > Maybe a solution to this is that the cpufreq driver should be able to > > provide a mapping function to overwrite the current default > > calculation. > > > > I'm not familiar with ppc-cpufreq drive but maybe patch 6393d6 just > uncovered an "issue" that was already existed but only on higher loads. > > Because, with or without patch 6393d6, if the specific CPU doesn't > use a frequency table, there will many frequency transitions in > higher loads too. I believe, though, that the side effect it's smaller > in higher frequencies because CPUs tend to work on lowest and highest > frequencies. Might be. I didn't test this specifically. > What about a patch in ppc-cpufreq driver that permits frequency > changes only in specific steps and not in arbitrary values? Which steps would you use? What scheme would be universal usable for all affected system using this driver? I had played with an approach to only make use of min_freq and max_freq which eventually didn't result in better performance in comparison to code before commit 6393d6. Andreas > Regards, > Stratos ---8<--- -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Sep 19, 2016 at 7:16 PM, Andreas Herrmann <aherrmann@suse.com> wrote: > On Fri, Sep 16, 2016 at 09:58:42PM +0300, Stratos Karafotis wrote: >> Hi, >> >> [ I 'm resending this message, because I think some recipients didn't receive >> it. ] >> >> On 16/09/2016 12:47 μμ, Andreas Herrmann wrote: >> > On Wed, Sep 07, 2016 at 10:32:01AM +0530, Viresh Kumar wrote: >> >> On 01-09-16, 15:21, Andreas Herrmann wrote: >> >>> On Mon, Aug 29, 2016 at 11:31:53AM +0530, Viresh Kumar wrote: >> > >> >>>> I am _really_ worried about such hacks in drivers to negate the effect of >> a >> >>>> patch, that was actually good. >> >>> >> >>>> Did you try to increase the sampling period of ondemand governor to see if >> that >> >>>> helps without this patch. >> >>> >> >>> With an older kernel I've modified transition_latency of the driver >> >>> which in turn is used to calculate the sampling rate. >> > >> >> Naah, that isn't what I was looking for, sorry >> > >> >> To explain it a bit more, this is what the patch did. >> > >> >> Suppose, your platform supports frequencies: F1 (lowest), F2, F3, F4, >> >> F5, F6 and F7 (highest). The cpufreq governor (ondemand) based on a >> >> sampling rate and system load tries to change the frequency of the >> >> underlying hardware and select one of those. >> > >> >> Before the original patch came in, F2 and F3 were never getting >> >> selected and the system was stuck in F1 for a long time. >> > >> > I think this is not a general statement. Such a behaviour is not >> > common to all systems. Before commit 6393d6a target frequency was >> > based on >> > >> > freq_next = load * policy->cpuinfo.max_freq / 100; >> > >> > F2 would have been selected if >> > >> > load = F2 * 100 / F7 >> > >> > If F2 was not seen it can mean >> > >> > (1) either the load value was not hit in practice during monitoring of >> > a certain workload >> > >> > (2) or the calculated load value (in integer representation) would >> > select F1 or F3 (there is no corresponding integer value that >> > would select F2) >> > >> > E.g. for the Intel i7-3770 system mentioned in commit message for >> > 6393d6a I think a load value of 49 should have selected 1700000 which >> > is not shown in the provided frequency table. >> >> I think this is not true, because before the specific patch the relation >> of frequency selection was CPUFREQ_RELATION_L. This is the reason >> that a load value of 49 with a freq_next 1666490 would have a >> target frequency 1600000. > > Hmm... > CPUFREQ_RELATION_L should select "lowest frequency at or above target" > being 1700000 in this case. Otherwise (if it would select "highest > frequency at or below target") this would imply that load values of > 50, 51, 52 should select 1700000 which would contradict what was > written in commit message of 6393d6a1. Yes, you are right. I'm sorry for the noise. > In any case probability of seeing such load values and thus selecting > a frequency of 1700000 is quite low. So I fully understand why the > patch was introduced. >> > What essentially changed was how load values are mapped to target >> > frequencies. For the HP system (min_freq=1200000, max_freq=2800000) >> > that I used in my tests, the old code would create following mapping: >> > >> > load | freq_next | used target frequency >> > ________________________________________ >> > 0 0 1200000 >> > 10 280000 1200000 >> > 20 560000 1200000 >> > 30 840000 1200000 >> > 40 1120000 1200000 >> > 42 1176000 1200000 >> > 43 1204000 1204000 >> > 50 1400000 1400000 >> > 60 1680000 1680000 >> > 70 1960000 1960000 >> > 80 2240000 2240000 >> > 90 2520000 2520000 >> > 100 2800000 2800000 >> > >> > The new code (introduced with commit 6393d6a) changed the mapping as >> > follows: >> > >> > load | freq_next | used target frequency >> > ________________________________________ >> > 0 1200000 1200000 >> > 10 1360000 1360000 >> > 20 1520000 1520000 >> > 30 1680000 1680000 >> > 40 1840000 1840000 >> > 42 1872000 1872000 >> > 43 1888000 1888000 >> > 50 2000000 2000000 >> > 60 2160000 2160000 >> > 70 2320000 2320000 >> > 80 2480000 2480000 >> > 90 2640000 2640000 >> > 100 2800000 2800000 >> > >> > My patch creates a third mapping. It basically ensures that up to a >> > load value of 42 the minimum frequency is used. >> > >> >> Which will decrease the performance for that period of time as we >> >> should have switched to a higher frequency really. >> > >> > I am not sure whether it's really useful for all systems using >> > ondemand governor to increase frequency above min_freq even if load is >> > just above 0. Of course expectation is that performance will be equal >> > or better than before. But how overall power consumption changes >> > depends on the hardware and its power saving capabilites. >> > >> > ---8<--- >> > >> >>> My understanding is that the original commit was tested with certain >> >>> combinations of hardware and cpufreq-drivers and the claim was that >> >>> for those (two?) tested combinations performance increased and power >> >>> consumption was lower. So I am not so sure what to expect from all >> >>> other cpufreq-driver/hardware combinations. >> > >> >> It was principally the right thing to do IMO. And I don't think any >> >> other hardware should get affected badly. At the max, the tuning needs >> >> to be made a bit better. >> > >> > ---8<--- >> > >> > It seems that the decision how to best map load values to target >> > frequencies is kind of hardware specific. >> > >> > Maybe a solution to this is that the cpufreq driver should be able to >> > provide a mapping function to overwrite the current default >> > calculation. >> > >> >> I'm not familiar with ppc-cpufreq drive but maybe patch 6393d6 just >> uncovered an "issue" that was already existed but only on higher loads. >> >> Because, with or without patch 6393d6, if the specific CPU doesn't >> use a frequency table, there will many frequency transitions in >> higher loads too. I believe, though, that the side effect it's smaller >> in higher frequencies because CPUs tend to work on lowest and highest >> frequencies. > > Might be. I didn't test this specifically. > >> What about a patch in ppc-cpufreq driver that permits frequency >> changes only in specific steps and not in arbitrary values? > > Which steps would you use? What scheme would be universal usable for > all affected system using this driver? Just an idea. I would split the frequency range (max_freq - min_freq) into ~10 steps. But I'm not familiar with the affected systems and of course I can't prove this is an ideal approach. > I had played with an approach to only make use of min_freq and > max_freq which eventually didn't result in better performance > in comparison to code before commit 6393d6. > Regards, Stratos -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Sep 19, 2016 at 10:39:18PM +0300, Stratos Karafotis wrote: > On Mon, Sep 19, 2016 at 7:16 PM, Andreas Herrmann <aherrmann@suse.com> wrote: > > On Fri, Sep 16, 2016 at 09:58:42PM +0300, Stratos Karafotis wrote: > >> Hi, > >> > >> [ I 'm resending this message, because I think some recipients didn't receive > >> it. ] > >> > >> On 16/09/2016 12:47 μμ, Andreas Herrmann wrote: > >> > On Wed, Sep 07, 2016 at 10:32:01AM +0530, Viresh Kumar wrote: > >> >> On 01-09-16, 15:21, Andreas Herrmann wrote: > >> >>> On Mon, Aug 29, 2016 at 11:31:53AM +0530, Viresh Kumar wrote: ---8<--- > >> > It seems that the decision how to best map load values to target > >> > frequencies is kind of hardware specific. > >> > > >> > Maybe a solution to this is that the cpufreq driver should be able to > >> > provide a mapping function to overwrite the current default > >> > calculation. FYI, I've created new patches to address the issue. First one will be to introduce a map_load_to_freq function. The default being what commit 6393d6 introduced (no deadband). Second patch will than introduce a specific function for pcc-cpufreq to fall back to what was used before commit 6393d6. I just want to assemble gathered performance data and I am planning to send those patches tomorrow. > >> I'm not familiar with ppc-cpufreq drive but maybe patch 6393d6 just > >> uncovered an "issue" that was already existed but only on higher loads. > >> > >> Because, with or without patch 6393d6, if the specific CPU doesn't > >> use a frequency table, there will many frequency transitions in > >> higher loads too. I believe, though, that the side effect it's smaller > >> in higher frequencies because CPUs tend to work on lowest and highest > >> frequencies. > > > > Might be. I didn't test this specifically. Hopefully I'll also find time to gather some ftrace data wrt this. > >> What about a patch in ppc-cpufreq driver that permits frequency > >> changes only in specific steps and not in arbitrary values? > > > > Which steps would you use? What scheme would be universal usable for > > all affected system using this driver? > > Just an idea. I would split the frequency range (max_freq - min_freq) > into ~10 steps. But I'm not familiar with the affected systems and > of course I can't prove this is an ideal approach. I've modified the pcc-cpufreq specific map_load_to_freq function to do just that (map load values to 10 discrete frequency values) instead of falling back to the deadband (pre-commit-6393d6-version). Unfortunately this resulted in lower performance compared to pre-commit-6393d6-version. > > I had played with an approach to only make use of min_freq and > > max_freq which eventually didn't result in better performance > > in comparison to code before commit 6393d6. > > Regards, > Stratos Regards, Andreas -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Sorry for being late Andreas :( On 14-09-16, 16:56, Andreas Herrmann wrote: First of all, thanks for your hardwork in getting these numbers out. Really appreciate it. > Below is some trace data. I hope it is of some help. > > (A) - sampling 10s period when system is idle > (B) - sampling 10s period when system partially loaded (kernel > compilation using 2 jobs) > > (1) 4.8-rc5 > (2) 4.8-rc5 with my patch (reintro of deadband effect within > pcc-cpufreq) > (3) 4.8-rc5 with reversal of 6393d6a102 (cpufreq: ondemand: Eliminate > the deadband effect) > > Let me know whether you are looking for other trace data wrt this > issue. > > > Thanks, > > Andreas > > --- > > (A)-(1) > > # Total Lost Samples: 0 > # Samples: 41 of event 'power:cpu_frequency' > # Event count (approx.): 41 > # Overhead Command Shared Object Symbol > # ........ ............ ................ ............................. > 39.02% kworker/14:1 [kernel.vmlinux] [k] cpufreq_notify_transition > 29.27% kworker/0:0 [kernel.vmlinux] [k] cpufreq_notify_transition > 19.51% kworker/10:1 [kernel.vmlinux] [k] cpufreq_notify_transition > 7.32% kworker/5:2 [kernel.vmlinux] [k] cpufreq_notify_transition > 2.44% kworker/23:2 [kernel.vmlinux] [k] cpufreq_notify_transition > 2.44% kworker/40:1 [kernel.vmlinux] [k] cpufreq_notify_transition > > (A)-(2) > > # Total Lost Samples: 0 > # Samples: 6 of event 'power:cpu_frequency' > # Event count (approx.): 6 > # Overhead Command Shared Object Symbol > # ........ ............ ................ ............................. > 33.33% kworker/1:2 [kernel.vmlinux] [k] cpufreq_notify_transition > 16.67% kworker/16:1 [kernel.vmlinux] [k] cpufreq_notify_transition > 16.67% kworker/22:1 [kernel.vmlinux] [k] cpufreq_notify_transition > 16.67% kworker/26:1 [kernel.vmlinux] [k] cpufreq_notify_transition > 16.67% kworker/33:1 [kernel.vmlinux] [k] cpufreq_notify_transition > > (A)-(3) > > # Total Lost Samples: 0 > # Samples: 7 of event 'power:cpu_frequency' > # Event count (approx.): 7 > # Overhead Command Shared Object Symbol > # ........ ............ ................ ............................. > 28.57% kworker/58:1 [kernel.vmlinux] [k] cpufreq_notify_transition > 14.29% kworker/19:2 [kernel.vmlinux] [k] cpufreq_notify_transition > 14.29% kworker/20:2 [kernel.vmlinux] [k] cpufreq_notify_transition > 14.29% kworker/22:2 [kernel.vmlinux] [k] cpufreq_notify_transition > 14.29% kworker/23:1 [kernel.vmlinux] [k] cpufreq_notify_transition > 14.29% kworker/35:1 [kernel.vmlinux] [k] cpufreq_notify_transition > > --- > > (B)-(1) > > # Total Lost Samples: 0 > # Samples: 2K of event 'power:cpu_frequency' > # Event count (approx.): 2382 > # Overhead Command Shared Object Symbol > # ........ ............ ................ ............................. > 5.75% kworker/0:0 [kernel.vmlinux] [k] cpufreq_notify_transition > 4.16% kworker/12:2 [kernel.vmlinux] [k] cpufreq_notify_transition > 3.11% kworker/17:1 [kernel.vmlinux] [k] cpufreq_notify_transition > 2.94% kworker/2:1 [kernel.vmlinux] [k] cpufreq_notify_transition > 2.73% kworker/19:1 [kernel.vmlinux] [k] cpufreq_notify_transition > ... > > (B)-(2) > > # Total Lost Samples: 0 > # Samples: 320 of event 'power:cpu_frequency' > # Event count (approx.): 320 > # Overhead Command Shared Object Symbol > # ........ ............ ................ ............................. > 4.69% kworker/56:1 [kernel.vmlinux] [k] cpufreq_notify_transition > 4.06% kworker/12:1 [kernel.vmlinux] [k] cpufreq_notify_transition > 4.06% kworker/28:1 [kernel.vmlinux] [k] cpufreq_notify_transition > 4.06% kworker/6:2 [kernel.vmlinux] [k] cpufreq_notify_transition > 3.75% kworker/32:2 [kernel.vmlinux] [k] cpufreq_notify_transition > ... > > (B)-(3) > > # Total Lost Samples: 0 > # Samples: 333 of event 'power:cpu_frequency' > # Event count (approx.): 333 > # Overhead Command Shared Object Symbol > # ........ ............ ................ ............................. > 4.80% kworker/51:1 [kernel.vmlinux] [k] cpufreq_notify_transition > 4.50% kworker/39:1 [kernel.vmlinux] [k] cpufreq_notify_transition > 4.20% kworker/47:1 [kernel.vmlinux] [k] cpufreq_notify_transition > 3.90% kworker/59:1 [kernel.vmlinux] [k] cpufreq_notify_transition > 3.90% kworker/7:2 [kernel.vmlinux] [k] cpufreq_notify_transition > ... I am worried by all of these.. So, its not the DVFS transition which took time, but cpufreq_notify_transition(). And probably one of the drivers in your setup is screwing up here, which has registered with cpufreq_register_notifier(). Can you please take a look at that ? Just check which all routines are getting called as part of srcu_notifier_call_chain(). Also a simple (hacky) solution to fix the problem you have got is to divide the range of frequency in steps (which you have already done AFAICT in one of your patches) and not mention the frequencies that were part of the deadband earlier. That will keep the CPU at either low freq or some higher freqs. But I am quite sure that we have an abuser here and we better find it now.
On 22-09-16, 19:54, Andreas Herrmann wrote: > On Mon, Sep 19, 2016 at 10:39:18PM +0300, Stratos Karafotis wrote: > > Just an idea. I would split the frequency range (max_freq - min_freq) > > into ~10 steps. But I'm not familiar with the affected systems and > > of course I can't prove this is an ideal approach. > > I've modified the pcc-cpufreq specific map_load_to_freq function to do > just that (map load values to 10 discrete frequency values) instead of > falling back to the deadband (pre-commit-6393d6-version). And in that you can remove the deadband frequencies from the table to get almost the same results, as I already stated in an earlier email.
On Wed, Oct 05, 2016 at 10:47:53AM +0530, Viresh Kumar wrote: > Sorry for being late Andreas :( > > On 14-09-16, 16:56, Andreas Herrmann wrote: > > First of all, thanks for your hardwork in getting these numbers out. > Really appreciate it. > > > Below is some trace data. I hope it is of some help. > > > > (A) - sampling 10s period when system is idle > > (B) - sampling 10s period when system partially loaded (kernel > > compilation using 2 jobs) > > > > (1) 4.8-rc5 > > (2) 4.8-rc5 with my patch (reintro of deadband effect within > > pcc-cpufreq) > > (3) 4.8-rc5 with reversal of 6393d6a102 (cpufreq: ondemand: Eliminate > > the deadband effect) > > > > Let me know whether you are looking for other trace data wrt this > > issue. > > > > > > Thanks, > > > > Andreas > > > > --- > > > > (A)-(1) > > > > # Total Lost Samples: 0 > > # Samples: 41 of event 'power:cpu_frequency' > > # Event count (approx.): 41 > > # Overhead Command Shared Object Symbol > > # ........ ............ ................ ............................. > > 39.02% kworker/14:1 [kernel.vmlinux] [k] cpufreq_notify_transition > > 29.27% kworker/0:0 [kernel.vmlinux] [k] cpufreq_notify_transition > > 19.51% kworker/10:1 [kernel.vmlinux] [k] cpufreq_notify_transition > > 7.32% kworker/5:2 [kernel.vmlinux] [k] cpufreq_notify_transition > > 2.44% kworker/23:2 [kernel.vmlinux] [k] cpufreq_notify_transition > > 2.44% kworker/40:1 [kernel.vmlinux] [k] cpufreq_notify_transition > > > > (A)-(2) > > > > # Total Lost Samples: 0 > > # Samples: 6 of event 'power:cpu_frequency' > > # Event count (approx.): 6 > > # Overhead Command Shared Object Symbol > > # ........ ............ ................ ............................. > > 33.33% kworker/1:2 [kernel.vmlinux] [k] cpufreq_notify_transition > > 16.67% kworker/16:1 [kernel.vmlinux] [k] cpufreq_notify_transition > > 16.67% kworker/22:1 [kernel.vmlinux] [k] cpufreq_notify_transition > > 16.67% kworker/26:1 [kernel.vmlinux] [k] cpufreq_notify_transition > > 16.67% kworker/33:1 [kernel.vmlinux] [k] cpufreq_notify_transition > > > > (A)-(3) > > > > # Total Lost Samples: 0 > > # Samples: 7 of event 'power:cpu_frequency' > > # Event count (approx.): 7 > > # Overhead Command Shared Object Symbol > > # ........ ............ ................ ............................. > > 28.57% kworker/58:1 [kernel.vmlinux] [k] cpufreq_notify_transition > > 14.29% kworker/19:2 [kernel.vmlinux] [k] cpufreq_notify_transition > > 14.29% kworker/20:2 [kernel.vmlinux] [k] cpufreq_notify_transition > > 14.29% kworker/22:2 [kernel.vmlinux] [k] cpufreq_notify_transition > > 14.29% kworker/23:1 [kernel.vmlinux] [k] cpufreq_notify_transition > > 14.29% kworker/35:1 [kernel.vmlinux] [k] cpufreq_notify_transition > > > > --- > > > > (B)-(1) > > > > # Total Lost Samples: 0 > > # Samples: 2K of event 'power:cpu_frequency' > > # Event count (approx.): 2382 > > # Overhead Command Shared Object Symbol > > # ........ ............ ................ ............................. > > 5.75% kworker/0:0 [kernel.vmlinux] [k] cpufreq_notify_transition > > 4.16% kworker/12:2 [kernel.vmlinux] [k] cpufreq_notify_transition > > 3.11% kworker/17:1 [kernel.vmlinux] [k] cpufreq_notify_transition > > 2.94% kworker/2:1 [kernel.vmlinux] [k] cpufreq_notify_transition > > 2.73% kworker/19:1 [kernel.vmlinux] [k] cpufreq_notify_transition > > ... > > > > (B)-(2) > > > > # Total Lost Samples: 0 > > # Samples: 320 of event 'power:cpu_frequency' > > # Event count (approx.): 320 > > # Overhead Command Shared Object Symbol > > # ........ ............ ................ ............................. > > 4.69% kworker/56:1 [kernel.vmlinux] [k] cpufreq_notify_transition > > 4.06% kworker/12:1 [kernel.vmlinux] [k] cpufreq_notify_transition > > 4.06% kworker/28:1 [kernel.vmlinux] [k] cpufreq_notify_transition > > 4.06% kworker/6:2 [kernel.vmlinux] [k] cpufreq_notify_transition > > 3.75% kworker/32:2 [kernel.vmlinux] [k] cpufreq_notify_transition > > ... > > > > (B)-(3) > > > > # Total Lost Samples: 0 > > # Samples: 333 of event 'power:cpu_frequency' > > # Event count (approx.): 333 > > # Overhead Command Shared Object Symbol > > # ........ ............ ................ ............................. > > 4.80% kworker/51:1 [kernel.vmlinux] [k] cpufreq_notify_transition > > 4.50% kworker/39:1 [kernel.vmlinux] [k] cpufreq_notify_transition > > 4.20% kworker/47:1 [kernel.vmlinux] [k] cpufreq_notify_transition > > 3.90% kworker/59:1 [kernel.vmlinux] [k] cpufreq_notify_transition > > 3.90% kworker/7:2 [kernel.vmlinux] [k] cpufreq_notify_transition > > ... > > I am worried by all of these.. So, its not the DVFS transition which > took time, but cpufreq_notify_transition(). And probably one of the > drivers in your setup is screwing up here, which has registered with > cpufreq_register_notifier(). > > Can you please take a look at that ? Just check which all routines are > getting called as part of srcu_notifier_call_chain(). I'll take a look into this. > Also a simple (hacky) solution to fix the problem you have got is to > divide the range of frequency in steps (which you have already done > AFAICT in one of your patches) and not mention the frequencies that > were part of the deadband earlier. That will keep the CPU at either > low freq or some higher freqs. Yes, seems worth to try this. > But I am quite sure that we have an abuser here and we better find it > now. Thanks, Andreas -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/cpufreq/pcc-cpufreq.c b/drivers/cpufreq/pcc-cpufreq.c index 3f0ce2a..d2e2963 100644 --- a/drivers/cpufreq/pcc-cpufreq.c +++ b/drivers/cpufreq/pcc-cpufreq.c @@ -200,10 +200,26 @@ static int pcc_cpufreq_target(struct cpufreq_policy *policy, { struct pcc_cpu *pcc_cpu_data; struct cpufreq_freqs freqs; + static u32 limit = 0; + u16 status; u32 input_buffer; int cpu; + if (!limit) { + u32 f_min = policy->cpuinfo.min_freq / 1000; + u32 f_max = policy->cpuinfo.max_freq / 1000; + limit = (f_max - f_min) * f_min; + limit /= f_max; + limit *= 1000; + limit += f_min * 1000; + pr_debug("pcc-cpufreq: setting deadband limit to %u kHz\n", + limit); + } + + if (target_freq < limit) + target_freq = policy->min; + cpu = policy->cpu; pcc_cpu_data = per_cpu_ptr(pcc_cpu_info, cpu); @@ -214,6 +230,10 @@ static int pcc_cpufreq_target(struct cpufreq_policy *policy, freqs.old = policy->cur; freqs.new = target_freq; + + if (freqs.new == freqs.old) + return 0; + cpufreq_freq_transition_begin(policy, &freqs); spin_lock(&pcc_lock);
Commit 6393d6a102 (cpufreq: ondemand: Eliminate the deadband effect) introduced a performance regression for systems using pcc-cpufreq and ondemand governor. This is measurable with different workloads. E.g. wall-clock time for kernel compilation significantly increased. The elimination of the deadband effect significantly increased the number of frequency changes with pcc-cpufreq. Instead of reverting commit 6393d6a102 I suggest to add a workaround in pcc-cpufreq to re-introduce the deadband effect for this driver only - to restore the old performance behaviour with pcc-cpufreq with ondemand governor. Following some performance numbers for similar kernel compilations to illustrate the effect of commit 6393d6a102 and the proposed fix. Following typical numbers of kernel compilation tests with varying number of compile jobs: v4.8.0-rc2 4.8.0-rc2-pcc-cpufreq-deadband # of jobst user sys elapsed CPU user sys elapsed CPU 2 440.39 116.49 4:33.35 203% 404.85 109.10 4:10.35 205% 4 436.87 133.39 2:22.88 399% 381.83 128.00 2:06.84 401% 8 475.49 157.68 1:22.24 769% 344.36 149.08 1:04.29 767% 16 620.69 188.33 0:54.74 1477% 374.60 157.40 0:36.76 1447% 32 815.79 209.58 0:37.22 2754% 490.46 160.22 0:24.87 2616% 64 394.13 60.55 0:13.54 3355% 386.54 60.33 0:12.79 3493% 120 398.24 61.55 0:14.60 3148% 390.44 61.19 0:13.07 3453% (HP ProLiant DL580 Gen8 system, 60 CPUs @ 2.80GHz) Link: http://marc.info/?l=linux-pm&m=147160912625600 Signed-off-by: Andreas Herrmann <aherrmann@suse.com> --- drivers/cpufreq/pcc-cpufreq.c | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) If this change is accepted maybe it's a good idea to tag it also for stable kernels, e.g. starting with v4.4. Thanks, Andreas