diff mbox

cpufreq: governor: Be friendly towards latency-sensitive bursty workloads

Message ID 538D9631.9090500@linux.vnet.ibm.com (mailing list archive)
State Superseded, archived
Headers show

Commit Message

Srivatsa S. Bhat June 3, 2014, 9:32 a.m. UTC
On 06/03/2014 01:48 PM, Viresh Kumar wrote:
> On 27 May 2014 02:23, Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> wrote:
> 
> Looks fine, some nits..
> 
>> diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c
>> -void dbs_check_cpu(struct dbs_data *dbs_data, int cpu)
>> +void dbs_check_cpu(struct dbs_data *dbs_data, int cpu,
>> +                  unsigned int sampling_rate)
> 
> We don't need to pass a new argument, we can get all the information from
> dbs_data alone. Its already done for multiple routines. Let me know if you
> find it difficult to figure out..
> 

Sure, that would be a good improvement. Does something like the patch below
look good? I have only compile-tested it. I'll send out the patch with changelog
once I finish testing it.

Thank you!

Regards,
Srivatsa S. Bhat

----------------------------------------------------------------------------


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Viresh Kumar June 3, 2014, 9:39 a.m. UTC | #1
On 3 June 2014 15:02, Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> wrote:
> diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c
> index e1c6433..3e8588f 100644
> --- a/drivers/cpufreq/cpufreq_governor.c
> +++ b/drivers/cpufreq/cpufreq_governor.c
> @@ -36,14 +36,29 @@ void dbs_check_cpu(struct dbs_data *dbs_data, int cpu)
>         struct od_dbs_tuners *od_tuners = dbs_data->tuners;
>         struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
>         struct cpufreq_policy *policy;
> +       unsigned int sampling_rate;
>         unsigned int max_load = 0;
>         unsigned int ignore_nice;
>         unsigned int j;
>
> -       if (dbs_data->cdata->governor == GOV_ONDEMAND)
> +       if (dbs_data->cdata->governor == GOV_ONDEMAND) {
> +               struct od_cpu_dbs_info_s *od_dbs_info;
> +
> +               /*
> +                * Sometimes, the ondemand governor uses an additional
> +                * multiplier to give long delays. So apply this multiplier to
> +                * the 'sampling_rate', so as to keep the wake-up-from-idle
> +                * detection logic a bit conservative.
> +                */
> +               sampling_rate = od_tuners->sampling_rate;
> +               od_dbs_info = dbs_data->cdata->get_cpu_dbs_info_s(cpu);

Probably do both above right after definition of od_dbs_info or merge
above with it. and just keep below after the comment ?

> +               sampling_rate *= od_dbs_info->rate_mult;
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat June 3, 2014, 10:04 a.m. UTC | #2
On 06/03/2014 03:09 PM, Viresh Kumar wrote:
> On 3 June 2014 15:02, Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>> diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c
>> index e1c6433..3e8588f 100644
>> --- a/drivers/cpufreq/cpufreq_governor.c
>> +++ b/drivers/cpufreq/cpufreq_governor.c
>> @@ -36,14 +36,29 @@ void dbs_check_cpu(struct dbs_data *dbs_data, int cpu)
>>         struct od_dbs_tuners *od_tuners = dbs_data->tuners;
>>         struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
>>         struct cpufreq_policy *policy;
>> +       unsigned int sampling_rate;
>>         unsigned int max_load = 0;
>>         unsigned int ignore_nice;
>>         unsigned int j;
>>
>> -       if (dbs_data->cdata->governor == GOV_ONDEMAND)
>> +       if (dbs_data->cdata->governor == GOV_ONDEMAND) {
>> +               struct od_cpu_dbs_info_s *od_dbs_info;
>> +
>> +               /*
>> +                * Sometimes, the ondemand governor uses an additional
>> +                * multiplier to give long delays. So apply this multiplier to
>> +                * the 'sampling_rate', so as to keep the wake-up-from-idle
>> +                * detection logic a bit conservative.
>> +                */
>> +               sampling_rate = od_tuners->sampling_rate;
>> +               od_dbs_info = dbs_data->cdata->get_cpu_dbs_info_s(cpu);
> 
> Probably do both above right after definition of od_dbs_info or merge
> above with it. and just keep below after the comment ?
> 
>> +               sampling_rate *= od_dbs_info->rate_mult;
> 

Well, the method I used keeps the organization such that the code following
the comment does precisely what the comment says (i.e, get the sampling_rate,
fetch the multiplier, and then multiply). So I feel it makes it easier to
understand.

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Viresh Kumar June 3, 2014, 10:08 a.m. UTC | #3
On 3 June 2014 15:34, Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> wrote:
> Well, the method I used keeps the organization such that the code following
> the comment does precisely what the comment says (i.e, get the sampling_rate,
> fetch the multiplier, and then multiply). So I feel it makes it easier to
> understand.

It looked like the comment is there only for this special statement:

>>> +               sampling_rate *= od_dbs_info->rate_mult;

And so suggested that :)

Anyway move this up as it doesn't belong to comment for sure.
>> +               od_dbs_info = dbs_data->cdata->get_cpu_dbs_info_s(cpu);
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c
index e1c6433..3e8588f 100644
--- a/drivers/cpufreq/cpufreq_governor.c
+++ b/drivers/cpufreq/cpufreq_governor.c
@@ -36,14 +36,29 @@  void dbs_check_cpu(struct dbs_data *dbs_data, int cpu)
 	struct od_dbs_tuners *od_tuners = dbs_data->tuners;
 	struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
 	struct cpufreq_policy *policy;
+	unsigned int sampling_rate;
 	unsigned int max_load = 0;
 	unsigned int ignore_nice;
 	unsigned int j;
 
-	if (dbs_data->cdata->governor == GOV_ONDEMAND)
+	if (dbs_data->cdata->governor == GOV_ONDEMAND) {
+		struct od_cpu_dbs_info_s *od_dbs_info;
+
+		/*
+		 * Sometimes, the ondemand governor uses an additional
+		 * multiplier to give long delays. So apply this multiplier to
+		 * the 'sampling_rate', so as to keep the wake-up-from-idle
+		 * detection logic a bit conservative.
+		 */
+		sampling_rate = od_tuners->sampling_rate;
+		od_dbs_info = dbs_data->cdata->get_cpu_dbs_info_s(cpu);
+		sampling_rate *= od_dbs_info->rate_mult;
+
 		ignore_nice = od_tuners->ignore_nice_load;
-	else
+	} else {
+		sampling_rate = cs_tuners->sampling_rate;
 		ignore_nice = cs_tuners->ignore_nice_load;
+	}
 
 	policy = cdbs->cur_policy;
 
@@ -96,7 +111,29 @@  void dbs_check_cpu(struct dbs_data *dbs_data, int cpu)
 		if (unlikely(!wall_time || wall_time < idle_time))
 			continue;
 
-		load = 100 * (wall_time - idle_time) / wall_time;
+		/*
+		 * If the CPU had gone completely idle, and a task just woke up
+		 * on this CPU now, it would be unfair to calculate 'load' the
+		 * usual way for this elapsed time-window, because it will show
+		 * near-zero load, irrespective of how CPU intensive the new
+		 * task is. This is undesirable for latency-sensitive bursty
+		 * workloads.
+		 *
+		 * To avoid this, we reuse the 'load' from the previous
+		 * time-window and give this task a chance to start with a
+		 * reasonably high CPU frequency.
+		 *
+		 * Detecting this situation is easy: the governor's deferrable
+		 * timer would not have fired during CPU-idle periods. Hence
+		 * an unusually large 'wall_time' (as compared to the sampling
+		 * rate) indicates this scenario.
+		 */
+		if (unlikely(wall_time > (2 * sampling_rate))) {
+			load = j_cdbs->prev_load;
+		} else {
+			load = 100 * (wall_time - idle_time) / wall_time;
+			j_cdbs->prev_load = load;
+		}
 
 		if (load > max_load)
 			max_load = load;
@@ -323,6 +360,10 @@  int cpufreq_governor_dbs(struct cpufreq_policy *policy,
 			j_cdbs->cur_policy = policy;
 			j_cdbs->prev_cpu_idle = get_cpu_idle_time(j,
 					       &j_cdbs->prev_cpu_wall, io_busy);
+			j_cdbs->prev_load = 100 * (j_cdbs->prev_cpu_wall -
+						   j_cdbs->prev_cpu_idle) /
+						   j_cdbs->prev_cpu_wall;
+
 			if (ignore_nice)
 				j_cdbs->prev_cpu_nice =
 					kcpustat_cpu(j).cpustat[CPUTIME_NICE];
diff --git a/drivers/cpufreq/cpufreq_governor.h b/drivers/cpufreq/cpufreq_governor.h
index bfb9ae1..b56552b 100644
--- a/drivers/cpufreq/cpufreq_governor.h
+++ b/drivers/cpufreq/cpufreq_governor.h
@@ -134,6 +134,7 @@  struct cpu_dbs_common_info {
 	u64 prev_cpu_idle;
 	u64 prev_cpu_wall;
 	u64 prev_cpu_nice;
+	unsigned int prev_load;
 	struct cpufreq_policy *cur_policy;
 	struct delayed_work work;
 	/*