diff mbox

cpuidle: Allow menu governor to enter deeper sleep states after some time

Message ID 4491b9f9-bc6f-f173-5816-c46a2f41de8e@tu-dresden.de (mailing list archive)
State RFC
Headers show

Commit Message

Thomas Ilsche July 27, 2017, 12:50 p.m. UTC
On several contemporary Intel systems, we have observed that the idle
power consumption is sometimes significantly too high, e.g. 105 W instead
of 74 W for several seconds.

We tracked this issue down to patterns that confuse the heuristic of the
menu idle governor. If the governor observes several consecutive short
sleeps, it does expect another short sleep duration despite no immediate
timers, sending the CPU into a shallow sleep state. On a dyntick-idle
kernel, there are no means for the core to enter a more efficient sleep
state if the prediction was wrong. Ironically this is particularly
problematic if some cores of the system have very infrequent activity, i.e.
many cores and optimized configuration for low idle power.

The effect, cause, triggers, mitigation technique and verification thereof
are described in detail in the a paper that is pending publication [1].

Fixing the heuristic to make a better prediction does not seem to be
generally feasible. The following patch addresses this issue by setting a
timer such that when the an expected immediate wake-up fails to appear, the
CPU is woken up to go into a deeper sleep state. If the heuristic was
right, the timer is simply canceled.

Please consider the patch a draft for discussion. We need to address:

  * Avoid programming the fallback timer when the deepest sleep state is
    already chosen.
  * Determine good default values for the introduced configurables. This
    is difficult due to the large variety of system configurations affected
    by the menu governor. Nevertheless we believe this can be done such
     that many systems benefit without significant degradation in any case.
  * Document (or remove) the sysfs entries.

But first, I would like to invite comments if this is going in the right
direction, or if this should be addressed in a different way.

[1] https://tu-dresden.de/zih/forschung/ressourcen/dateien/projekte/haec/powernightmares.pdf

       
Signed-off-by: Thomas Ilsche <thomas.ilsche@tu-dresden.de>
Signed-off-by: Marcus Hähnel <mhaehnel@os.inf.tu-dresden.de>
---

Comments

Len Brown Oct. 19, 2017, 7:46 a.m. UTC | #1
Thomas,

Thanks for doing this investigation and sending this note.   It was
very helpful that you boiled down the stimulus to a simple example
sequence.  I created an "idle-virus" using that sequence, and I can
reproduce your results on my hardware.

Indeed, it is actually possible to observe this issue by comparing a
big idle system in single user mode vs multi-user mode.  Simply run
turbostat in both scenarios and watch the C-states requested and
Pkg%pc6 achieved, and the RAPL DC power consumed.  The "background OS
noise" in multi-user mode is enough to trigger the issue you describe.
C1E is requested and Pkg%pc6 sometimes go to zero, even when Busy% is
0.

To prove that this is the result of OS C-state demotion, I have
implemented option "d)" in your paper.  I added a module param,
currently called "cpuidle.powersave", that causes cpuidle to bypass
the governor altogether and select the deepest state that is allowed
by PM_QOS.  (maybe I should call it cpuidle.bypass, or
cpuidle.deepest...)  When i set this parameter,  multi-user C-state
residency and power equal that seen in single-user mode, with no
spikes.

Note that the paper is not correct when it states that the OS can not
enforce latency constraints when relying on HW C-state selection.  The
OS PM_QOS constraint is still effective in this path because it can
still limit the OS request, and the HW will never "promote" to a
deeper state than that requested by the OS -- the hardware can only
demote and then un-demote -- but never deeper than the OS request.

Indeed, on systems where PM_QOS is used properly (and it is working
correctly, a different story), and where the hardware is smart about
demoting C-states, one can make a case that the menu governor is not
needed at all.

Regarding your proposed patch.  I agree with the concept, but not the
implementation.  Rather than create a new timer, we should simply
decline to disable the OS tick when we predict short idle on the
tickless-idle-kernel.  That existing timer is a sufficient "safety
net" to assure that the shallow C-state does not run too long.
Indeed, Aubrey Li has discovered that it is also a latency issue for
the OS to be disabling and then re-enabling the timer tick for short
idles, so this is a situation where if we do less work, we can help
both performance and power!:-)

Also, I think we should re-examine menu governor's heuristic that you
pointed out where it throws out some results -- maybe that isn't such
a great heuristic...

thanks,
-Len

ps. I strongly encourage you to universally use the words "shallow"
and "deep" to describe C-states, and never use the words of "high" and
"low".  The former are never ambiguous.  The later can get very
confusing because a higher latency C-state have lower power, and the
numbering of C-states has lower power C-states with higher numbers...
Thomas Ilsche Oct. 20, 2017, 4:31 p.m. UTC | #2
Len,

thank you for the detailed response. Answers below.

> To prove that this is the result of OS C-state demotion, I have
> implemented option "d)" in your paper.  I added a module param,
> currently called "cpuidle.powersave", that causes cpuidle to bypass
> the governor altogether and select the deepest state that is allowed
> by PM_QOS.  (maybe I should call it cpuidle.bypass, or
> cpuidle.deepest...)  When i set this parameter,  multi-user C-state
> residency and power equal that seen in single-user mode, with no
> spikes.

Can you please share the patch, I'd like to use it for some experiments.

> Note that the paper is not correct when it states that the OS can not
> enforce latency constraints when relying on HW C-state selection.  The
> OS PM_QOS constraint is still effective in this path because it can
> still limit the OS request, and the HW will never "promote" to a
> deeper state than that requested by the OS -- the hardware can only
> demote and then un-demote -- but never deeper than the OS request.

Agreed. Since there is only C1E HW promotion, which is AFAIK disabled
on Linux, practically one would only use HW (un)demotion with which the
os can still enforce constraints.

> Indeed, on systems where PM_QOS is used properly (and it is working
> correctly, a different story), and where the hardware is smart about
> demoting C-states, one can make a case that the menu governor is not
> needed at all.

We haven't looked at auto-demotion in practice yet, but could do that.
Would you think this is a serious option for Linux, given that it means
handing quite some control to in-transparent hardware?

There also might be some concern about different architectures.

> Regarding your proposed patch.  I agree with the concept, but not the
> implementation.  Rather than create a new timer, we should simply
> decline to disable the OS tick when we predict short idle on the
> tickless-idle-kernel.  That existing timer is a sufficient "safety
> net" to assure that the shallow C-state does not run too long.
> Indeed, Aubrey Li has discovered that it is also a latency issue for
> the OS to be disabling and then re-enabling the timer tick for short
> idles, so this is a situation where if we do less work, we can help
> both performance and power!:-)

That sounds like a viable alternative. I would like to try that, but I
am unsure how to implement. If I read the code correctly, the tick-timer
is disabled on the outer idle-loop in do_idle, whereas the decision is
made later in the code.

Deferring the tick_nohz_idle_enter would create some redundancy and I'm
not sure it is free of side effects. Also it might complicate the calls
from idle to governor.

> Also, I think we should re-examine menu governor's heuristic that you
> pointed out where it throws out some results -- maybe that isn't such
> a great heuristic...

I agree, but
1) we need the mitigation anyway because the heuristic will never be
     perfect.
2) i find it hard to optimize the heuristic with an incomplete
    understanding of workload patterns and the latency optimization
    target. I'm assuming a reason for some of the odd aspects of the
    heuristic, but I don't know it.

So as a secondary quest, for instance the outlier detection could be
limited for below-median cases.

Best,
Thomas
Doug Smythies Oct. 21, 2017, 2:27 p.m. UTC | #3
On 2017.10.20 09:32 Thomas Ilsche wrote:

>> To prove that this is the result of OS C-state demotion, I have
>> implemented option "d)" in your paper.  I added a module param,
>> currently called "cpuidle.powersave", that causes cpuidle to bypass
>> the governor altogether and select the deepest state that is allowed
>> by PM_QOS.  (maybe I should call it cpuidle.bypass, or
>> cpuidle.deepest...)  When i set this parameter,  multi-user C-state
>> residency and power equal that seen in single-user mode, with no
>> spikes.
>
> Can you please share the patch, I'd like to use it for some experiments.

Can you not achieve the same, or at least similar and good enough for
testing, conditions by disabling all the other idle states?

On my system, the max idle state is 4 (C6), and I just disable states
0-3 to force always state 4.

... Doug
diff mbox

Patch

diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index 61b64c2b2cb8..45bbeb362809 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -22,6 +22,7 @@ 
  #include <linux/sched/stat.h>
  #include <linux/math64.h>
  #include <linux/cpu.h>
+#include <linux/smp.h>
  
  /*
   * Please note when changing the tuning values:
@@ -130,6 +131,10 @@  struct menu_device {
  	unsigned int	correction_factor[BUCKETS];
  	unsigned int	intervals[INTERVALS];
  	int		interval_ptr;
+
+	struct hrtimer	fallback_timer;
+	int             have_timer;
+	unsigned int	disregard_past;
  };
  
  
@@ -190,6 +195,85 @@  static inline int performance_multiplier(unsigned long nr_iowaiters, unsigned lo
  
  static DEFINE_PER_CPU(struct menu_device, menu_devices);
  
+static unsigned int fallback_timer_disregard_past = 1;
+static unsigned int diff_threshold_bits = 8;
+static unsigned int fallback_timer_enabled;
+static unsigned int fallback_timer_interval_us = 10000;
+
+#define MENU_ATTR_RW(name, var, range_min, range_max, wfun) \
+	static ssize_t show_##name(struct device *dev, \
+				   struct device_attribute *attr, char *buf) \
+	{ \
+		return snprintf(buf, 12, "%i\n", var); \
+	} \
+	static ssize_t store_##name(struct device *dev, \
+				    struct device_attribute *attr, \
+				    const char *buf, size_t count) \
+	{ \
+		unsigned int tmp; \
+		int ret = kstrtouint(buf, 10, &tmp); \
+		if (ret != 1) \
+			return -EINVAL; \
+		if (tmp > range_max || tmp < range_min) { \
+			pr_warn("value outside of valid range [%u, %u]\n", \
+				range_min, range_max); \
+			return -EINVAL; \
+		} \
+		var = tmp; \
+		wfun \
+		return count; \
+	} \
+	static DEVICE_ATTR(fallback_timer_##name, 0644, \
+			   show_##name, store_##name)
+
+MENU_ATTR_RW(threshold_bits, diff_threshold_bits, 0, 31, {});
+
+MENU_ATTR_RW(enable, fallback_timer_enabled, 0, 1, {
+	int i;
+
+	for_each_possible_cpu(i) {
+		struct menu_device *data = per_cpu_ptr(&menu_devices, i);
+
+		if (!fallback_timer_enabled) {
+			data->have_timer = 0;
+			hrtimer_cancel(&(data->fallback_timer));
+		}
+	} });
+
+MENU_ATTR_RW(interval_us, fallback_timer_interval_us, 1, 1000000, {});
+
+MENU_ATTR_RW(disregard_past, fallback_timer_disregard_past, 0, 1, {
+	int i;
+
+	for_each_possible_cpu(i) {
+		struct menu_device *data = per_cpu_ptr(&menu_devices, i);
+
+		data->disregard_past = 0;
+	} });
+
+static struct attribute *menu_attrs[] = {
+	&dev_attr_fallback_timer_threshold_bits.attr,
+	&dev_attr_fallback_timer_enable.attr,
+	&dev_attr_fallback_timer_interval_us.attr,
+	&dev_attr_fallback_timer_disregard_past.attr,
+	NULL
+};
+
+static struct attribute_group menu_attr_group = {
+	.attrs = menu_attrs,
+	.name = "cpuidle_menu",
+};
+
+int menu_add_interface(struct device *dev)
+{
+	return sysfs_create_group(&dev->kobj, &menu_attr_group);
+}
+
+void menu_remove_interface(struct device *dev)
+{
+	sysfs_remove_group(&dev->kobj, &menu_attr_group);
+}
+
  static void menu_update(struct cpuidle_driver *drv, struct cpuidle_device *dev);
  
  /*
@@ -275,6 +359,14 @@  static unsigned int get_typical_interval(struct menu_device *data)
  	goto again;
  }
  
+static enum hrtimer_restart fallback_timer_fun(struct hrtimer *tmr)
+{
+	struct menu_device *mdata = this_cpu_ptr(&menu_devices);
+
+	mdata->disregard_past = 1;
+	return HRTIMER_NORESTART;
+}
+
  /**
   * menu_select - selects the next idle state to enter
   * @drv: cpuidle driver containing state data
@@ -293,6 +385,11 @@  static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
  	unsigned long nr_iowaiters, cpu_load;
  	int resume_latency = dev_pm_qos_raw_read_value(device);
  
+	if (fallback_timer_enabled && data->have_timer) {
+		data->have_timer = 0;
+		hrtimer_cancel(&(data->fallback_timer));
+	}
+
  	if (data->needs_update) {
  		menu_update(drv, dev);
  		data->needs_update = 0;
@@ -322,7 +419,32 @@  static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
  					 RESOLUTION * DECAY);
  
  	expected_interval = get_typical_interval(data);
-	expected_interval = min(expected_interval, data->next_timer_us);
+
+	if (fallback_timer_enabled && fallback_timer_disregard_past
+	    && data->disregard_past)  {
+		expected_interval = data->next_timer_us;
+		// Only disregard the past once! Then try again
+		data->disregard_past = 0;
+	} else {
+		if (fallback_timer_enabled
+		    && expected_interval < (data->next_timer_us >> diff_threshold_bits)
+		    && data->next_timer_us > fallback_timer_interval_us * 2) {
+			/*
+			 * Program the fallback timer if the gap between the
+			 * expected interval by heuristic and the next regular
+			 * timer are too far apart.
+			 * However, only do this when we didn't just wakup from
+			 * a timer and are told to disregard the heuristic
+			 */
+			ktime_t interval =
+				ktime_set(0, fallback_timer_interval_us * 1000);
+
+			hrtimer_start(&(data->fallback_timer), interval,
+				      HRTIMER_MODE_REL_PINNED);
+			data->have_timer = 1;
+		}
+		expected_interval = min(expected_interval, data->next_timer_us);
+	}
  
  	if (CPUIDLE_DRIVER_STATE_START > 0) {
  		struct cpuidle_state *s = &drv->states[CPUIDLE_DRIVER_STATE_START];
@@ -398,6 +520,11 @@  static void menu_reflect(struct cpuidle_device *dev, int index)
  {
  	struct menu_device *data = this_cpu_ptr(&menu_devices);
  
+	if (fallback_timer_enabled && data->have_timer) {
+		data->have_timer = 0;
+		hrtimer_cancel(&data->fallback_timer);
+	}
+
  	data->last_state_idx = index;
  	data->needs_update = 1;
  }
@@ -486,6 +613,10 @@  static int menu_enable_device(struct cpuidle_driver *drv,
  
  	memset(data, 0, sizeof(struct menu_device));
  
+	hrtimer_init(&(data->fallback_timer),
+		     CLOCK_REALTIME, HRTIMER_MODE_REL_PINNED);
+	data->fallback_timer.function = fallback_timer_fun;
+
  	/*
  	 * if the correction factor is 0 (eg first time init or cpu hotplug
  	 * etc), we actually want to start out with a unity factor.
@@ -509,6 +640,10 @@  static struct cpuidle_governor menu_governor = {
   */
  static int __init init_menu(void)
  {
+	int ret = menu_add_interface(cpu_subsys.dev_root);
+
+	if (ret)
+		return ret;
  	return cpuidle_register_governor(&menu_governor);
  }