Message ID | 4491b9f9-bc6f-f173-5816-c46a2f41de8e@tu-dresden.de (mailing list archive) |
---|---|
State | RFC |
Headers | show |
Thomas, Thanks for doing this investigation and sending this note. It was very helpful that you boiled down the stimulus to a simple example sequence. I created an "idle-virus" using that sequence, and I can reproduce your results on my hardware. Indeed, it is actually possible to observe this issue by comparing a big idle system in single user mode vs multi-user mode. Simply run turbostat in both scenarios and watch the C-states requested and Pkg%pc6 achieved, and the RAPL DC power consumed. The "background OS noise" in multi-user mode is enough to trigger the issue you describe. C1E is requested and Pkg%pc6 sometimes go to zero, even when Busy% is 0. To prove that this is the result of OS C-state demotion, I have implemented option "d)" in your paper. I added a module param, currently called "cpuidle.powersave", that causes cpuidle to bypass the governor altogether and select the deepest state that is allowed by PM_QOS. (maybe I should call it cpuidle.bypass, or cpuidle.deepest...) When i set this parameter, multi-user C-state residency and power equal that seen in single-user mode, with no spikes. Note that the paper is not correct when it states that the OS can not enforce latency constraints when relying on HW C-state selection. The OS PM_QOS constraint is still effective in this path because it can still limit the OS request, and the HW will never "promote" to a deeper state than that requested by the OS -- the hardware can only demote and then un-demote -- but never deeper than the OS request. Indeed, on systems where PM_QOS is used properly (and it is working correctly, a different story), and where the hardware is smart about demoting C-states, one can make a case that the menu governor is not needed at all. Regarding your proposed patch. I agree with the concept, but not the implementation. Rather than create a new timer, we should simply decline to disable the OS tick when we predict short idle on the tickless-idle-kernel. That existing timer is a sufficient "safety net" to assure that the shallow C-state does not run too long. Indeed, Aubrey Li has discovered that it is also a latency issue for the OS to be disabling and then re-enabling the timer tick for short idles, so this is a situation where if we do less work, we can help both performance and power!:-) Also, I think we should re-examine menu governor's heuristic that you pointed out where it throws out some results -- maybe that isn't such a great heuristic... thanks, -Len ps. I strongly encourage you to universally use the words "shallow" and "deep" to describe C-states, and never use the words of "high" and "low". The former are never ambiguous. The later can get very confusing because a higher latency C-state have lower power, and the numbering of C-states has lower power C-states with higher numbers...
Len, thank you for the detailed response. Answers below. > To prove that this is the result of OS C-state demotion, I have > implemented option "d)" in your paper. I added a module param, > currently called "cpuidle.powersave", that causes cpuidle to bypass > the governor altogether and select the deepest state that is allowed > by PM_QOS. (maybe I should call it cpuidle.bypass, or > cpuidle.deepest...) When i set this parameter, multi-user C-state > residency and power equal that seen in single-user mode, with no > spikes. Can you please share the patch, I'd like to use it for some experiments. > Note that the paper is not correct when it states that the OS can not > enforce latency constraints when relying on HW C-state selection. The > OS PM_QOS constraint is still effective in this path because it can > still limit the OS request, and the HW will never "promote" to a > deeper state than that requested by the OS -- the hardware can only > demote and then un-demote -- but never deeper than the OS request. Agreed. Since there is only C1E HW promotion, which is AFAIK disabled on Linux, practically one would only use HW (un)demotion with which the os can still enforce constraints. > Indeed, on systems where PM_QOS is used properly (and it is working > correctly, a different story), and where the hardware is smart about > demoting C-states, one can make a case that the menu governor is not > needed at all. We haven't looked at auto-demotion in practice yet, but could do that. Would you think this is a serious option for Linux, given that it means handing quite some control to in-transparent hardware? There also might be some concern about different architectures. > Regarding your proposed patch. I agree with the concept, but not the > implementation. Rather than create a new timer, we should simply > decline to disable the OS tick when we predict short idle on the > tickless-idle-kernel. That existing timer is a sufficient "safety > net" to assure that the shallow C-state does not run too long. > Indeed, Aubrey Li has discovered that it is also a latency issue for > the OS to be disabling and then re-enabling the timer tick for short > idles, so this is a situation where if we do less work, we can help > both performance and power!:-) That sounds like a viable alternative. I would like to try that, but I am unsure how to implement. If I read the code correctly, the tick-timer is disabled on the outer idle-loop in do_idle, whereas the decision is made later in the code. Deferring the tick_nohz_idle_enter would create some redundancy and I'm not sure it is free of side effects. Also it might complicate the calls from idle to governor. > Also, I think we should re-examine menu governor's heuristic that you > pointed out where it throws out some results -- maybe that isn't such > a great heuristic... I agree, but 1) we need the mitigation anyway because the heuristic will never be perfect. 2) i find it hard to optimize the heuristic with an incomplete understanding of workload patterns and the latency optimization target. I'm assuming a reason for some of the odd aspects of the heuristic, but I don't know it. So as a secondary quest, for instance the outlier detection could be limited for below-median cases. Best, Thomas
On 2017.10.20 09:32 Thomas Ilsche wrote: >> To prove that this is the result of OS C-state demotion, I have >> implemented option "d)" in your paper. I added a module param, >> currently called "cpuidle.powersave", that causes cpuidle to bypass >> the governor altogether and select the deepest state that is allowed >> by PM_QOS. (maybe I should call it cpuidle.bypass, or >> cpuidle.deepest...) When i set this parameter, multi-user C-state >> residency and power equal that seen in single-user mode, with no >> spikes. > > Can you please share the patch, I'd like to use it for some experiments. Can you not achieve the same, or at least similar and good enough for testing, conditions by disabling all the other idle states? On my system, the max idle state is 4 (C6), and I just disable states 0-3 to force always state 4. ... Doug
diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c index 61b64c2b2cb8..45bbeb362809 100644 --- a/drivers/cpuidle/governors/menu.c +++ b/drivers/cpuidle/governors/menu.c @@ -22,6 +22,7 @@ #include <linux/sched/stat.h> #include <linux/math64.h> #include <linux/cpu.h> +#include <linux/smp.h> /* * Please note when changing the tuning values: @@ -130,6 +131,10 @@ struct menu_device { unsigned int correction_factor[BUCKETS]; unsigned int intervals[INTERVALS]; int interval_ptr; + + struct hrtimer fallback_timer; + int have_timer; + unsigned int disregard_past; }; @@ -190,6 +195,85 @@ static inline int performance_multiplier(unsigned long nr_iowaiters, unsigned lo static DEFINE_PER_CPU(struct menu_device, menu_devices); +static unsigned int fallback_timer_disregard_past = 1; +static unsigned int diff_threshold_bits = 8; +static unsigned int fallback_timer_enabled; +static unsigned int fallback_timer_interval_us = 10000; + +#define MENU_ATTR_RW(name, var, range_min, range_max, wfun) \ + static ssize_t show_##name(struct device *dev, \ + struct device_attribute *attr, char *buf) \ + { \ + return snprintf(buf, 12, "%i\n", var); \ + } \ + static ssize_t store_##name(struct device *dev, \ + struct device_attribute *attr, \ + const char *buf, size_t count) \ + { \ + unsigned int tmp; \ + int ret = kstrtouint(buf, 10, &tmp); \ + if (ret != 1) \ + return -EINVAL; \ + if (tmp > range_max || tmp < range_min) { \ + pr_warn("value outside of valid range [%u, %u]\n", \ + range_min, range_max); \ + return -EINVAL; \ + } \ + var = tmp; \ + wfun \ + return count; \ + } \ + static DEVICE_ATTR(fallback_timer_##name, 0644, \ + show_##name, store_##name) + +MENU_ATTR_RW(threshold_bits, diff_threshold_bits, 0, 31, {}); + +MENU_ATTR_RW(enable, fallback_timer_enabled, 0, 1, { + int i; + + for_each_possible_cpu(i) { + struct menu_device *data = per_cpu_ptr(&menu_devices, i); + + if (!fallback_timer_enabled) { + data->have_timer = 0; + hrtimer_cancel(&(data->fallback_timer)); + } + } }); + +MENU_ATTR_RW(interval_us, fallback_timer_interval_us, 1, 1000000, {}); + +MENU_ATTR_RW(disregard_past, fallback_timer_disregard_past, 0, 1, { + int i; + + for_each_possible_cpu(i) { + struct menu_device *data = per_cpu_ptr(&menu_devices, i); + + data->disregard_past = 0; + } }); + +static struct attribute *menu_attrs[] = { + &dev_attr_fallback_timer_threshold_bits.attr, + &dev_attr_fallback_timer_enable.attr, + &dev_attr_fallback_timer_interval_us.attr, + &dev_attr_fallback_timer_disregard_past.attr, + NULL +}; + +static struct attribute_group menu_attr_group = { + .attrs = menu_attrs, + .name = "cpuidle_menu", +}; + +int menu_add_interface(struct device *dev) +{ + return sysfs_create_group(&dev->kobj, &menu_attr_group); +} + +void menu_remove_interface(struct device *dev) +{ + sysfs_remove_group(&dev->kobj, &menu_attr_group); +} + static void menu_update(struct cpuidle_driver *drv, struct cpuidle_device *dev); /* @@ -275,6 +359,14 @@ static unsigned int get_typical_interval(struct menu_device *data) goto again; } +static enum hrtimer_restart fallback_timer_fun(struct hrtimer *tmr) +{ + struct menu_device *mdata = this_cpu_ptr(&menu_devices); + + mdata->disregard_past = 1; + return HRTIMER_NORESTART; +} + /** * menu_select - selects the next idle state to enter * @drv: cpuidle driver containing state data @@ -293,6 +385,11 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev) unsigned long nr_iowaiters, cpu_load; int resume_latency = dev_pm_qos_raw_read_value(device); + if (fallback_timer_enabled && data->have_timer) { + data->have_timer = 0; + hrtimer_cancel(&(data->fallback_timer)); + } + if (data->needs_update) { menu_update(drv, dev); data->needs_update = 0; @@ -322,7 +419,32 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev) RESOLUTION * DECAY); expected_interval = get_typical_interval(data); - expected_interval = min(expected_interval, data->next_timer_us); + + if (fallback_timer_enabled && fallback_timer_disregard_past + && data->disregard_past) { + expected_interval = data->next_timer_us; + // Only disregard the past once! Then try again + data->disregard_past = 0; + } else { + if (fallback_timer_enabled + && expected_interval < (data->next_timer_us >> diff_threshold_bits) + && data->next_timer_us > fallback_timer_interval_us * 2) { + /* + * Program the fallback timer if the gap between the + * expected interval by heuristic and the next regular + * timer are too far apart. + * However, only do this when we didn't just wakup from + * a timer and are told to disregard the heuristic + */ + ktime_t interval = + ktime_set(0, fallback_timer_interval_us * 1000); + + hrtimer_start(&(data->fallback_timer), interval, + HRTIMER_MODE_REL_PINNED); + data->have_timer = 1; + } + expected_interval = min(expected_interval, data->next_timer_us); + } if (CPUIDLE_DRIVER_STATE_START > 0) { struct cpuidle_state *s = &drv->states[CPUIDLE_DRIVER_STATE_START]; @@ -398,6 +520,11 @@ static void menu_reflect(struct cpuidle_device *dev, int index) { struct menu_device *data = this_cpu_ptr(&menu_devices); + if (fallback_timer_enabled && data->have_timer) { + data->have_timer = 0; + hrtimer_cancel(&data->fallback_timer); + } + data->last_state_idx = index; data->needs_update = 1; } @@ -486,6 +613,10 @@ static int menu_enable_device(struct cpuidle_driver *drv, memset(data, 0, sizeof(struct menu_device)); + hrtimer_init(&(data->fallback_timer), + CLOCK_REALTIME, HRTIMER_MODE_REL_PINNED); + data->fallback_timer.function = fallback_timer_fun; + /* * if the correction factor is 0 (eg first time init or cpu hotplug * etc), we actually want to start out with a unity factor. @@ -509,6 +640,10 @@ static struct cpuidle_governor menu_governor = { */ static int __init init_menu(void) { + int ret = menu_add_interface(cpu_subsys.dev_root); + + if (ret) + return ret; return cpuidle_register_governor(&menu_governor); }