Message ID | 20140607204114.25673.87194.stgit@srivatsabhat.in.ibm.com (mailing list archive) |
---|---|
State | Accepted, archived |
Headers | show |
On Sunday, June 08, 2014 02:11:43 AM Srivatsa S. Bhat wrote: > Cpufreq governors like the ondemand governor calculate the load on the CPU > periodically by employing deferrable timers. A deferrable timer won't fire > if the CPU is completely idle (and there are no other timers to be run), in > order to avoid unnecessary wakeups and thus save CPU power. > > However, the load calculation logic is agnostic to all this, and this can > lead to the problem described below. > > > Time (ms) CPU 1 > > 100 Task-A running > > 110 Governor's timer fires, finds load as 100% in the last > 10ms interval and increases the CPU frequency. > > 110.5 Task-A running > > 120 Governor's timer fires, finds load as 100% in the last > 10ms interval and increases the CPU frequency. > > 125 Task-A went to sleep. With nothing else to do, CPU 1 > went completely idle. > > 200 Task-A woke up and started running again. > > 200.5 Governor's deferred timer (which was originally programmed > to fire at time 130) fires now. It calculates load for the > time period 120 to 200.5, and finds the load is almost zero. > Hence it decreases the CPU frequency to the minimum. > > 210 Governor's timer fires, finds load as 100% in the last > 10ms interval and increases the CPU frequency. > > > So, after the workload woke up and started running, the frequency was suddenly > dropped to absolute minimum, and after that, there was an unnecessary delay of > 10ms (sampling period) to increase the CPU frequency back to a reasonable value. > And this pattern repeats for every wake-up-from-cpu-idle for that workload. > This can be quite undesirable for latency- or response-time sensitive bursty > workloads. So we need to fix the governor's logic to detect such wake-up-from- > cpu-idle scenarios and start the workload at a reasonably high CPU frequency. > > One extreme solution would be to fake a load of 100% in such scenarios. But > that might lead to undesirable side-effects such as frequency spikes (which > might also need voltage changes) especially if the previous frequency happened > to be very low. > > We just want to avoid the stupidity of dropping down the frequency to a minimum > and then enduring a needless (and long) delay before ramping it up back again. > So, let us simply carry forward the previous load - that is, let us just pretend > that the 'load' for the current time-window is the same as the load for the > previous window. That way, the frequency and voltage will continue to be set > to whatever values they were set at previously. This means that bursty workloads > will get a chance to influence the CPU frequency at which they wake up from > cpu-idle, based on their past execution history. Thus, they might be able to > avoid suffering from slow wakeups and long response-times. > > However, we should take care not to over-do this. For example, such a "copy > previous load" logic will benefit cases like this: (where # represents busy > and . represents idle) > > ##########.........#########.........###########...........##########........ > > but it will be detrimental in cases like the one shown below, because it will > retain the high frequency (copied from the previous interval) even in a mostly > idle system: > > ##########.........#.................#.....................#............... > > (i.e., the workload finished and the remaining tasks are such that their busy > periods are smaller than the sampling interval, which causes the timer to > always get deferred. So, this will make the copy-previous-load logic copy > the initial high load to subsequent idle periods over and over again, thus > keeping the frequency high unnecessarily). > > So, we modify this copy-previous-load logic such that it is used only once > upon every wakeup-from-idle. Thus if we have 2 consecutive idle periods, the > previous load won't get blindly copied over; cpufreq will freshly evaluate the > load in the second idle interval, thus ensuring that the system comes back to > its normal state. > > [ The right way to solve this whole problem is to teach the CPU frequency > governors to also track load on a per-task basis, not just a per-CPU basis, > and then use both the data sources intelligently to set the appropriate > frequency on the CPUs. But that involves redesigning the cpufreq subsystem, > so this patch should make the situation bearable until then. ] > > Experimental results: > +-------------------+ > > I ran a modified version of ebizzy (called 'sleeping-ebizzy') that sleeps in > between its execution such that its total utilization can be a user-defined > value, say 10% or 20% (higher the utilization specified, lesser the amount of > sleeps injected). This ebizzy was run with a single-thread, tied to CPU 8. > > Behavior observed with tracing (sample taken from 40% utilization runs): > ------------------------------------------------------------------------ > > Without patch: > ~~~~~~~~~~~~~~ > kworker/8:2-12137 416.335742: cpu_frequency: state=2061000 cpu_id=8 > kworker/8:2-12137 416.335744: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy > <...>-40753 416.345741: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2 > kworker/8:2-12137 416.345744: cpu_frequency: state=4123000 cpu_id=8 > kworker/8:2-12137 416.345746: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy > <...>-40753 416.355738: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2 > <snip> --------------------------------------------------------------------- <snip> > <...>-40753 416.402202: sched_switch: prev_comm=ebizzy ==> next_comm=swapper/8 > <idle>-0 416.502130: sched_switch: prev_comm=swapper/8 ==> next_comm=ebizzy > <...>-40753 416.505738: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2 > kworker/8:2-12137 416.505739: cpu_frequency: state=2061000 cpu_id=8 > kworker/8:2-12137 416.505741: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy > <...>-40753 416.515739: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2 > kworker/8:2-12137 416.515742: cpu_frequency: state=4123000 cpu_id=8 > kworker/8:2-12137 416.515744: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy > > Observation: Ebizzy went idle at 416.402202, and started running again at > 416.502130. But cpufreq noticed the long idle period, and dropped the frequency > at 416.505739, only to increase it back again at 416.515742, realizing that the > workload is in-fact CPU bound. Thus ebizzy needlessly ran at the lowest frequency > for almost 13 milliseconds (almost 1 full sample period), and this pattern > repeats on every sleep-wakeup. This could hurt latency-sensitive workloads quite > a lot. > > With patch: > ~~~~~~~~~~~ > > kworker/8:2-29802 464.832535: cpu_frequency: state=2061000 cpu_id=8 > <snip> --------------------------------------------------------------------- <snip> > kworker/8:2-29802 464.962538: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy > <...>-40738 464.972533: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2 > kworker/8:2-29802 464.972536: cpu_frequency: state=4123000 cpu_id=8 > kworker/8:2-29802 464.972538: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy > <...>-40738 464.982531: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2 > <snip> --------------------------------------------------------------------- <snip> > kworker/8:2-29802 465.022533: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy > <...>-40738 465.032531: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2 > kworker/8:2-29802 465.032532: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy > <...>-40738 465.035797: sched_switch: prev_comm=ebizzy ==> next_comm=swapper/8 > <idle>-0 465.240178: sched_switch: prev_comm=swapper/8 ==> next_comm=ebizzy > <...>-40738 465.242533: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2 > kworker/8:2-29802 465.242535: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy > <...>-40738 465.252531: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2 > > Observation: Ebizzy went idle at 465.035797, and started running again at > 465.240178. Since ebizzy was the only real workload running on this CPU, > cpufreq retained the frequency at 4.1Ghz throughout the run of ebizzy, no > matter how many times ebizzy slept and woke-up in-between. Thus, ebizzy > got the 10ms worth of 4.1 Ghz benefit during every sleep-wakeup (as compared > to the run without the patch) and this boost gave a modest improvement in total > throughput, as shown below. > > Sleeping-ebizzy records-per-second: > ----------------------------------- > > Utilization Without patch With patch Difference (Absolute and % values) > 10% 274767 277046 + 2279 (+0.829%) > 20% 543429 553484 + 10055 (+1.850%) > 40% 1090744 1107959 + 17215 (+1.578%) > 60% 1634908 1662018 + 27110 (+1.658%) > > A rudimentary and somewhat approximately latency-sensitive workload such as > sleeping-ebizzy itself showed a consistent, noticeable performance improvement > with this patch. Hence, workloads that are truly latency-sensitive will benefit > quite a bit from this change. Moreover, this is an overall win-win since this > patch does not hurt power-savings at all (because, this patch does not reduce > the idle time or idle residency; and the high frequency of the CPU when it goes > to cpu-idle does not affect/hurt the power-savings of deep idle states). > > Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> > Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> > Acked-by: Viresh Kumar <viresh.kumar@linaro.org> > --- > > Changes in v3: > * Modified the "copy-previous-load" logic to copy only once, upon the first > wakeup from idle, to fix the flaw pointed out by Pavel Machek. > > * Fixed the 64 bit division issue reported by Fengguang Wu's build robot. Applied to bleeding-edge, thanks!
> > the idle time or idle residency; and the high frequency of the CPU when it goes > > to cpu-idle does not affect/hurt the power-savings of deep idle states). > > > > Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> > > Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> > > Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Pavel Machek <pavel@ucw.cz> Pavel
On 8 June 2014 02:11, Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> wrote: > diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c > + if (unlikely(wall_time > (2 * sampling_rate)) && > + j_cdbs->copy_prev_load) { > + load = j_cdbs->prev_load; > + j_cdbs->copy_prev_load = false; > + } else { > + load = 100 * (wall_time - idle_time) / wall_time; > + j_cdbs->prev_load = load; > + j_cdbs->copy_prev_load = true; > + } Hmm, slight modifications over the weekend :) .. What do you think about removing this extra variable and using prev_load only, i.e. make it zero in the else part? Also adding a comment for this would be helpful ? I will try a patch before you come to office :) -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c index e1c6433..9004450 100644 --- a/drivers/cpufreq/cpufreq_governor.c +++ b/drivers/cpufreq/cpufreq_governor.c @@ -36,14 +36,29 @@ void dbs_check_cpu(struct dbs_data *dbs_data, int cpu) struct od_dbs_tuners *od_tuners = dbs_data->tuners; struct cs_dbs_tuners *cs_tuners = dbs_data->tuners; struct cpufreq_policy *policy; + unsigned int sampling_rate; unsigned int max_load = 0; unsigned int ignore_nice; unsigned int j; - if (dbs_data->cdata->governor == GOV_ONDEMAND) + if (dbs_data->cdata->governor == GOV_ONDEMAND) { + struct od_cpu_dbs_info_s *od_dbs_info = + dbs_data->cdata->get_cpu_dbs_info_s(cpu); + + /* + * Sometimes, the ondemand governor uses an additional + * multiplier to give long delays. So apply this multiplier to + * the 'sampling_rate', so as to keep the wake-up-from-idle + * detection logic a bit conservative. + */ + sampling_rate = od_tuners->sampling_rate; + sampling_rate *= od_dbs_info->rate_mult; + ignore_nice = od_tuners->ignore_nice_load; - else + } else { + sampling_rate = cs_tuners->sampling_rate; ignore_nice = cs_tuners->ignore_nice_load; + } policy = cdbs->cur_policy; @@ -96,7 +111,36 @@ void dbs_check_cpu(struct dbs_data *dbs_data, int cpu) if (unlikely(!wall_time || wall_time < idle_time)) continue; - load = 100 * (wall_time - idle_time) / wall_time; + /* + * If the CPU had gone completely idle, and a task just woke up + * on this CPU now, it would be unfair to calculate 'load' the + * usual way for this elapsed time-window, because it will show + * near-zero load, irrespective of how CPU intensive that task + * actually is. This is undesirable for latency-sensitive bursty + * workloads. + * + * To avoid this, we reuse the 'load' from the previous + * time-window and give this task a chance to start with a + * reasonably high CPU frequency. (However, we shouldn't over-do + * this copy, lest we get stuck at a high load (high frequency) + * for too long, even when the current system load has actually + * dropped down. So we perform the copy only once, upon the + * first wake-up from idle.) + * + * Detecting this situation is easy: the governor's deferrable + * timer would not have fired during CPU-idle periods. Hence + * an unusually large 'wall_time' (as compared to the sampling + * rate) indicates this scenario. + */ + if (unlikely(wall_time > (2 * sampling_rate)) && + j_cdbs->copy_prev_load) { + load = j_cdbs->prev_load; + j_cdbs->copy_prev_load = false; + } else { + load = 100 * (wall_time - idle_time) / wall_time; + j_cdbs->prev_load = load; + j_cdbs->copy_prev_load = true; + } if (load > max_load) max_load = load; @@ -318,11 +362,19 @@ int cpufreq_governor_dbs(struct cpufreq_policy *policy, for_each_cpu(j, policy->cpus) { struct cpu_dbs_common_info *j_cdbs = dbs_data->cdata->get_cpu_cdbs(j); + unsigned int prev_load; j_cdbs->cpu = j; j_cdbs->cur_policy = policy; j_cdbs->prev_cpu_idle = get_cpu_idle_time(j, &j_cdbs->prev_cpu_wall, io_busy); + + prev_load = (unsigned int) + (j_cdbs->prev_cpu_wall - j_cdbs->prev_cpu_idle); + j_cdbs->prev_load = 100 * prev_load / + (unsigned int) j_cdbs->prev_cpu_wall; + j_cdbs->copy_prev_load = true; + if (ignore_nice) j_cdbs->prev_cpu_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE]; diff --git a/drivers/cpufreq/cpufreq_governor.h b/drivers/cpufreq/cpufreq_governor.h index bfb9ae1..c2a5b7e 100644 --- a/drivers/cpufreq/cpufreq_governor.h +++ b/drivers/cpufreq/cpufreq_governor.h @@ -134,6 +134,12 @@ struct cpu_dbs_common_info { u64 prev_cpu_idle; u64 prev_cpu_wall; u64 prev_cpu_nice; + unsigned int prev_load; + /* + * Flag to ensure that we copy the previous load only once, upon the + * first wake-up from idle. + */ + bool copy_prev_load; struct cpufreq_policy *cur_policy; struct delayed_work work; /*