Message ID | 20230728145808.902892871@infradead.org (mailing list archive) |
---|---|
State | RFC, archived |
Headers | show |
Series | cpuidle,teo: Improve TEO tick decisions | expand |
On Fri, Jul 28, 2023 at 5:01 PM Peter Zijlstra <peterz@infradead.org> wrote: > > With cpuidle having added a TICK bucket, TEO will account all TICK and > longer idles there. This means we can now make an informed decision > about stopping the tick. If the sum of 'hit+intercepts' of all states > below the TICK bucket is more than 50%, it is most likely we'll not > reach the tick this time around either, so stopping the tick doesn't > make sense. > > If we don't stop the tick, don't bother calling > tick_nohz_get_sleep_length() and assume duration is no longer than a > tick (could be improved to still look at the current pending time and > timers). > > Since we have this extra state, remove the state_count based early > decisions. > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > --- > drivers/cpuidle/governors/teo.c | 97 ++++++++++++++-------------------------- > 1 file changed, 34 insertions(+), 63 deletions(-) > > --- a/drivers/cpuidle/governors/teo.c > +++ b/drivers/cpuidle/governors/teo.c > @@ -139,6 +139,7 @@ > #include <linux/sched/clock.h> > #include <linux/sched/topology.h> > #include <linux/tick.h> > +#include "../cpuidle.h" > > /* > * The number of bits to shift the CPU's capacity by in order to determine > @@ -197,7 +198,6 @@ struct teo_cpu { > int next_recent_idx; > int recent_idx[NR_RECENT]; > unsigned long util_threshold; > - bool utilized; > }; > > static DEFINE_PER_CPU(struct teo_cpu, teo_cpus); > @@ -276,11 +276,11 @@ static void teo_update(struct cpuidle_dr > > cpu_data->total += bin->hits + bin->intercepts; > > - if (target_residency_ns <= cpu_data->sleep_length_ns) { > + if (target_residency_ns <= cpu_data->sleep_length_ns) > idx_timer = i; > - if (target_residency_ns <= measured_ns) > - idx_duration = i; > - } > + > + if (target_residency_ns <= measured_ns) > + idx_duration = i; I'm not quite sure what happens here. > } > > i = cpu_data->next_recent_idx++; > @@ -362,11 +362,12 @@ static int teo_select(struct cpuidle_dri > unsigned int recent_sum = 0; > unsigned int idx_hit_sum = 0; > unsigned int hit_sum = 0; > + unsigned int tick_sum = 0; > int constraint_idx = 0; > int idx0 = 0, idx = -1; > bool alt_intercepts, alt_recent; > ktime_t delta_tick; > - s64 duration_ns; > + s64 duration_ns = TICK_NSEC; > int i; > > if (dev->last_state_idx >= 0) { > @@ -376,36 +377,26 @@ static int teo_select(struct cpuidle_dri > > cpu_data->time_span_ns = local_clock(); > > - duration_ns = tick_nohz_get_sleep_length(&delta_tick); > - cpu_data->sleep_length_ns = duration_ns; > + /* Should we stop the tick? */ Who's we? I'd prefer something like "Should the tick be stopped?" here (analogously below). > + for (i = 1; i < drv->state_count; i++) { > + struct teo_bin *prev_bin = &cpu_data->state_bins[i-1]; > + struct cpuidle_state *s = &drv->states[i]; > > - /* Check if there is any choice in the first place. */ > - if (drv->state_count < 2) { > - idx = 0; > - goto end; > - } > - if (!dev->states_usage[0].disable) { > - idx = 0; > - if (drv->states[1].target_residency_ns > duration_ns) > - goto end; > - } > + tick_sum += prev_bin->intercepts; > + tick_sum += prev_bin->hits; > > - cpu_data->utilized = teo_cpu_is_utilized(dev->cpu, cpu_data); > - /* > - * If the CPU is being utilized over the threshold and there are only 2 > - * states to choose from, the metrics need not be considered, so choose > - * the shallowest non-polling state and exit. > - */ > - if (drv->state_count < 3 && cpu_data->utilized) { > - for (i = 0; i < drv->state_count; ++i) { > - if (!dev->states_usage[i].disable && > - !(drv->states[i].flags & CPUIDLE_FLAG_POLLING)) { > - idx = i; > - goto end; > - } > - } > + if (s->target_residency_ns >= SHORT_TICK_NSEC) > + break; > } > > + if (2*tick_sum > cpu_data->total) > + *stop_tick = false; This means "if over 50% of all the events fall into the buckets below the tick period length, don't stop the tick". Fair enough, but this covers long-term only and what about the most recent events? I think that they need to be taken into account here too. > + > + /* If we do stop the tick, ask for the next timer. */ > + if (*stop_tick) > + duration_ns = tick_nohz_get_sleep_length(&delta_tick); > + cpu_data->sleep_length_ns = duration_ns; If the decision is made to retain the tick and the time to the closest tick event is very small, it would be better to refine the state selection so as to avoid returning a state with the target residency above that time (which essentially is wasting energy). That's what delta_tick above is for, but now tick_nohz_get_sleep_length() is never called when the tick is not going to be stopped. Besides, if I'm not mistaken, setting sleep_length_ns to TICK_NSEC every time the tick is not stopped will not really work on systems where there are real idle states with target residencies beyond TICK_NSEC. > + > /* > * Find the deepest idle state whose target residency does not exceed > * the current sleep length and the deepest idle state not deeper than > @@ -446,13 +437,13 @@ static int teo_select(struct cpuidle_dri > idx_recent_sum = recent_sum; > } > > - /* Avoid unnecessary overhead. */ > - if (idx < 0) { > - idx = 0; /* No states enabled, must use 0. */ > - goto end; > - } else if (idx == idx0) { > - goto end; > - } > + /* No states enabled, must use 0 */ > + if (idx < 0) > + return 0; > + > + /* No point looking for something shallower than the first enabled state */ > + if (idx == idx0) > + return idx; > > /* > * If the sum of the intercepts metric for all of the idle states > @@ -541,29 +532,9 @@ static int teo_select(struct cpuidle_dri > * If the CPU is being utilized over the threshold, choose a shallower > * non-polling state to improve latency > */ > - if (cpu_data->utilized) > + if (teo_cpu_is_utilized(dev->cpu, cpu_data)) > idx = teo_find_shallower_state(drv, dev, idx, duration_ns, true); > > -end: > - /* > - * Don't stop the tick if the selected state is a polling one or if the > - * expected idle duration is shorter than the tick period length. > - */ > - if (((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) || > - duration_ns < TICK_NSEC) && !tick_nohz_tick_stopped()) { > - *stop_tick = false; > - > - /* > - * The tick is not going to be stopped, so if the target > - * residency of the state to be returned is not within the time > - * till the closest timer including the tick, try to correct > - * that. > - */ > - if (idx > idx0 && > - drv->states[idx].target_residency_ns > delta_tick) > - idx = teo_find_shallower_state(drv, dev, idx, delta_tick, false); > - } > - > return idx; > } Overall, I think that the problem with calling tick_nohz_get_sleep_length() is limited to the cases when the CPU is almost fully loaded, so the overall amount of idle time on it is tiny. I would rather use a special pah for those cases and I would register all of the wakeups as "intercepts" in those cases.
On Fri, Jul 28, 2023 at 06:56:24PM +0200, Rafael J. Wysocki wrote: > On Fri, Jul 28, 2023 at 5:01 PM Peter Zijlstra <peterz@infradead.org> wrote: > > @@ -276,11 +276,11 @@ static void teo_update(struct cpuidle_dr > > > > cpu_data->total += bin->hits + bin->intercepts; > > > > - if (target_residency_ns <= cpu_data->sleep_length_ns) { > > + if (target_residency_ns <= cpu_data->sleep_length_ns) > > idx_timer = i; > > - if (target_residency_ns <= measured_ns) > > - idx_duration = i; > > - } > > + > > + if (target_residency_ns <= measured_ns) > > + idx_duration = i; > > I'm not quite sure what happens here. Oh, I couldn't convince myself that measured_ns <= sleep_length_ns. If measured was longer we still want the higher index. But yeah, I forgots I had that hunk in. > > } > > > > i = cpu_data->next_recent_idx++; > > @@ -362,11 +362,12 @@ static int teo_select(struct cpuidle_dri > > unsigned int recent_sum = 0; > > unsigned int idx_hit_sum = 0; > > unsigned int hit_sum = 0; > > + unsigned int tick_sum = 0; > > int constraint_idx = 0; > > int idx0 = 0, idx = -1; > > bool alt_intercepts, alt_recent; > > ktime_t delta_tick; > > - s64 duration_ns; > > + s64 duration_ns = TICK_NSEC; > > int i; > > > > if (dev->last_state_idx >= 0) { > > @@ -376,36 +377,26 @@ static int teo_select(struct cpuidle_dri > > > > cpu_data->time_span_ns = local_clock(); > > > > - duration_ns = tick_nohz_get_sleep_length(&delta_tick); > > - cpu_data->sleep_length_ns = duration_ns; > > + /* Should we stop the tick? */ > > Who's we? I'd prefer something like "Should the tick be stopped?" > here (analogously below). Sure. > > + for (i = 1; i < drv->state_count; i++) { > > + struct teo_bin *prev_bin = &cpu_data->state_bins[i-1]; > > + struct cpuidle_state *s = &drv->states[i]; > > > > - /* Check if there is any choice in the first place. */ > > - if (drv->state_count < 2) { > > - idx = 0; > > - goto end; > > - } > > - if (!dev->states_usage[0].disable) { > > - idx = 0; > > - if (drv->states[1].target_residency_ns > duration_ns) > > - goto end; > > - } > > + tick_sum += prev_bin->intercepts; > > + tick_sum += prev_bin->hits; > > > > - cpu_data->utilized = teo_cpu_is_utilized(dev->cpu, cpu_data); > > - /* > > - * If the CPU is being utilized over the threshold and there are only 2 > > - * states to choose from, the metrics need not be considered, so choose > > - * the shallowest non-polling state and exit. > > - */ > > - if (drv->state_count < 3 && cpu_data->utilized) { > > - for (i = 0; i < drv->state_count; ++i) { > > - if (!dev->states_usage[i].disable && > > - !(drv->states[i].flags & CPUIDLE_FLAG_POLLING)) { > > - idx = i; > > - goto end; > > - } > > - } > > + if (s->target_residency_ns >= SHORT_TICK_NSEC) > > + break; > > } > > > > + if (2*tick_sum > cpu_data->total) > > + *stop_tick = false; > > This means "if over 50% of all the events fall into the buckets below > the tick period length, don't stop the tick". Fair enough, but this > covers long-term only and what about the most recent events? I think > that they need to be taken into account here too. From looking at a few traces this 'long' term is around 8-10 samples. Which I figured was quick enough. Note that DECAY_SHIFT is 3, while the pulse is 10 bits, so 3-4 cycles will drain most of the history when there's a distinct phase shift. That said; I did look at the recent thing and those seem geared towards the intercepts, while I think hits+intercepts makes more sense here. Given it adjusted quickly enough I didn't put more time in it. > > + > > + /* If we do stop the tick, ask for the next timer. */ > > + if (*stop_tick) > > + duration_ns = tick_nohz_get_sleep_length(&delta_tick); > > + cpu_data->sleep_length_ns = duration_ns; > > If the decision is made to retain the tick and the time to the closest > tick event is very small, it would be better to refine the state > selection so as to avoid returning a state with the target residency > above that time (which essentially is wasting energy). That's what > delta_tick above is for, but now tick_nohz_get_sleep_length() is never > called when the tick is not going to be stopped. Right, so I did ponder using something like ktime_sub(tick_nohz_get_next_hrtimer(), now) instead of TICK_NSEC to get a more accurate measure, but I didn't do so yet. > Besides, if I'm not mistaken, setting sleep_length_ns to TICK_NSEC > every time the tick is not stopped will not really work on systems > where there are real idle states with target residencies beyond > TICK_NSEC. It does work; you really don't want to select such a state if the tick is still active -- you'll never get your residency. Such a state should really only be used when the tick is off. > > + > > /* > > * Find the deepest idle state whose target residency does not exceed > > * the current sleep length and the deepest idle state not deeper than > > @@ -446,13 +437,13 @@ static int teo_select(struct cpuidle_dri > > idx_recent_sum = recent_sum; > > } > > > > - /* Avoid unnecessary overhead. */ > > - if (idx < 0) { > > - idx = 0; /* No states enabled, must use 0. */ > > - goto end; > > - } else if (idx == idx0) { > > - goto end; > > - } > > + /* No states enabled, must use 0 */ > > + if (idx < 0) > > + return 0; > > + > > + /* No point looking for something shallower than the first enabled state */ > > + if (idx == idx0) > > + return idx; > > > > /* > > * If the sum of the intercepts metric for all of the idle states > > @@ -541,29 +532,9 @@ static int teo_select(struct cpuidle_dri > > * If the CPU is being utilized over the threshold, choose a shallower > > * non-polling state to improve latency > > */ > > - if (cpu_data->utilized) > > + if (teo_cpu_is_utilized(dev->cpu, cpu_data)) > > idx = teo_find_shallower_state(drv, dev, idx, duration_ns, true); > > > > -end: > > - /* > > - * Don't stop the tick if the selected state is a polling one or if the > > - * expected idle duration is shorter than the tick period length. > > - */ > > - if (((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) || > > - duration_ns < TICK_NSEC) && !tick_nohz_tick_stopped()) { > > - *stop_tick = false; > > - > > - /* > > - * The tick is not going to be stopped, so if the target > > - * residency of the state to be returned is not within the time > > - * till the closest timer including the tick, try to correct > > - * that. > > - */ > > - if (idx > idx0 && > > - drv->states[idx].target_residency_ns > delta_tick) > > - idx = teo_find_shallower_state(drv, dev, idx, delta_tick, false); > > - } > > - > > return idx; > > } > > Overall, I think that the problem with calling > tick_nohz_get_sleep_length() is limited to the cases when the CPU is > almost fully loaded, so the overall amount of idle time on it is tiny. > I would rather use a special pah for those cases and I would register > all of the wakeups as "intercepts" in those cases. I'm not sure what you're proposing. If we track the tick+ bucket -- as we must in order to say anything useful about it, then we can decide the tick state before (as I do here) calling sleep_length(). The timer-pull rework from Anna-Maria unfortunately makes the tick_nohz_get_sleep_length() thing excessively expensive and it really doesn't make sense to call it when we retain the tick. It's all a bit of a chicken-egg situation, cpuidle wants to know when the next timer is, but telling when that is, wants to know if the tick stays. We need to break that somehow -- I propose by not calling it when we know we'll keep the tick.
On Sat, Jul 29, 2023 at 12:01 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Fri, Jul 28, 2023 at 06:56:24PM +0200, Rafael J. Wysocki wrote: > > On Fri, Jul 28, 2023 at 5:01 PM Peter Zijlstra <peterz@infradead.org> wrote: > > > > @@ -276,11 +276,11 @@ static void teo_update(struct cpuidle_dr > > > > > > cpu_data->total += bin->hits + bin->intercepts; > > > > > > - if (target_residency_ns <= cpu_data->sleep_length_ns) { > > > + if (target_residency_ns <= cpu_data->sleep_length_ns) > > > idx_timer = i; > > > - if (target_residency_ns <= measured_ns) > > > - idx_duration = i; > > > - } > > > + > > > + if (target_residency_ns <= measured_ns) > > > + idx_duration = i; > > > > I'm not quite sure what happens here. > > Oh, I couldn't convince myself that measured_ns <= sleep_length_ns. If > measured was longer we still want the higher index. > > But yeah, I forgots I had that hunk in. > > > > } > > > > > > i = cpu_data->next_recent_idx++; > > > @@ -362,11 +362,12 @@ static int teo_select(struct cpuidle_dri > > > unsigned int recent_sum = 0; > > > unsigned int idx_hit_sum = 0; > > > unsigned int hit_sum = 0; > > > + unsigned int tick_sum = 0; > > > int constraint_idx = 0; > > > int idx0 = 0, idx = -1; > > > bool alt_intercepts, alt_recent; > > > ktime_t delta_tick; > > > - s64 duration_ns; > > > + s64 duration_ns = TICK_NSEC; > > > int i; > > > > > > if (dev->last_state_idx >= 0) { > > > @@ -376,36 +377,26 @@ static int teo_select(struct cpuidle_dri > > > > > > cpu_data->time_span_ns = local_clock(); > > > > > > - duration_ns = tick_nohz_get_sleep_length(&delta_tick); > > > - cpu_data->sleep_length_ns = duration_ns; > > > + /* Should we stop the tick? */ > > > > Who's we? I'd prefer something like "Should the tick be stopped?" > > here (analogously below). > > Sure. > > > > + for (i = 1; i < drv->state_count; i++) { > > > + struct teo_bin *prev_bin = &cpu_data->state_bins[i-1]; > > > + struct cpuidle_state *s = &drv->states[i]; > > > > > > - /* Check if there is any choice in the first place. */ > > > - if (drv->state_count < 2) { > > > - idx = 0; > > > - goto end; > > > - } > > > - if (!dev->states_usage[0].disable) { > > > - idx = 0; > > > - if (drv->states[1].target_residency_ns > duration_ns) > > > - goto end; > > > - } > > > + tick_sum += prev_bin->intercepts; > > > + tick_sum += prev_bin->hits; > > > > > > - cpu_data->utilized = teo_cpu_is_utilized(dev->cpu, cpu_data); > > > - /* > > > - * If the CPU is being utilized over the threshold and there are only 2 > > > - * states to choose from, the metrics need not be considered, so choose > > > - * the shallowest non-polling state and exit. > > > - */ > > > - if (drv->state_count < 3 && cpu_data->utilized) { > > > - for (i = 0; i < drv->state_count; ++i) { > > > - if (!dev->states_usage[i].disable && > > > - !(drv->states[i].flags & CPUIDLE_FLAG_POLLING)) { > > > - idx = i; > > > - goto end; > > > - } > > > - } > > > + if (s->target_residency_ns >= SHORT_TICK_NSEC) > > > + break; > > > } > > > > > > + if (2*tick_sum > cpu_data->total) > > > + *stop_tick = false; > > > > This means "if over 50% of all the events fall into the buckets below > > the tick period length, don't stop the tick". Fair enough, but this > > covers long-term only and what about the most recent events? I think > > that they need to be taken into account here too. > > From looking at a few traces this 'long' term is around 8-10 samples. > Which I figured was quick enough. > > Note that DECAY_SHIFT is 3, while the pulse is 10 bits, so 3-4 cycles > will drain most of the history when there's a distinct phase shift. > > That said; I did look at the recent thing and those seem geared towards > the intercepts, while I think hits+intercepts makes more sense here. > Given it adjusted quickly enough I didn't put more time in it. > > > > + > > > + /* If we do stop the tick, ask for the next timer. */ > > > + if (*stop_tick) > > > + duration_ns = tick_nohz_get_sleep_length(&delta_tick); > > > + cpu_data->sleep_length_ns = duration_ns; > > > > If the decision is made to retain the tick and the time to the closest > > tick event is very small, it would be better to refine the state > > selection so as to avoid returning a state with the target residency > > above that time (which essentially is wasting energy). That's what > > delta_tick above is for, but now tick_nohz_get_sleep_length() is never > > called when the tick is not going to be stopped. > > Right, so I did ponder using something like > ktime_sub(tick_nohz_get_next_hrtimer(), now) instead of TICK_NSEC to get > a more accurate measure, but I didn't do so yet. > > > Besides, if I'm not mistaken, setting sleep_length_ns to TICK_NSEC > > every time the tick is not stopped will not really work on systems > > where there are real idle states with target residencies beyond > > TICK_NSEC. > > It does work; you really don't want to select such a state if the tick > is still active -- you'll never get your residency. Such a state should > really only be used when the tick is off. > > > > + > > > /* > > > * Find the deepest idle state whose target residency does not exceed > > > * the current sleep length and the deepest idle state not deeper than > > > @@ -446,13 +437,13 @@ static int teo_select(struct cpuidle_dri > > > idx_recent_sum = recent_sum; > > > } > > > > > > - /* Avoid unnecessary overhead. */ > > > - if (idx < 0) { > > > - idx = 0; /* No states enabled, must use 0. */ > > > - goto end; > > > - } else if (idx == idx0) { > > > - goto end; > > > - } > > > + /* No states enabled, must use 0 */ > > > + if (idx < 0) > > > + return 0; > > > + > > > + /* No point looking for something shallower than the first enabled state */ > > > + if (idx == idx0) > > > + return idx; > > > > > > /* > > > * If the sum of the intercepts metric for all of the idle states > > > @@ -541,29 +532,9 @@ static int teo_select(struct cpuidle_dri > > > * If the CPU is being utilized over the threshold, choose a shallower > > > * non-polling state to improve latency > > > */ > > > - if (cpu_data->utilized) > > > + if (teo_cpu_is_utilized(dev->cpu, cpu_data)) > > > idx = teo_find_shallower_state(drv, dev, idx, duration_ns, true); > > > > > > -end: > > > - /* > > > - * Don't stop the tick if the selected state is a polling one or if the > > > - * expected idle duration is shorter than the tick period length. > > > - */ > > > - if (((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) || > > > - duration_ns < TICK_NSEC) && !tick_nohz_tick_stopped()) { > > > - *stop_tick = false; > > > - > > > - /* > > > - * The tick is not going to be stopped, so if the target > > > - * residency of the state to be returned is not within the time > > > - * till the closest timer including the tick, try to correct > > > - * that. > > > - */ > > > - if (idx > idx0 && > > > - drv->states[idx].target_residency_ns > delta_tick) > > > - idx = teo_find_shallower_state(drv, dev, idx, delta_tick, false); > > > - } > > > - > > > return idx; > > > } > > > > Overall, I think that the problem with calling > > tick_nohz_get_sleep_length() is limited to the cases when the CPU is > > almost fully loaded, so the overall amount of idle time on it is tiny. > > I would rather use a special pah for those cases and I would register > > all of the wakeups as "intercepts" in those cases. > > I'm not sure what you're proposing. Something really simple like: 1. Check sched_cpu_util() (which is done by teo anyway). 2. If that is around 90% of the maximum CPU capacity, select the first non-polling idle state and be done (don't stop the tick as my other replay earlier today). In 2, tick_nohz_get_sleep_length() need not be checked, because the idle state selection doesn't depend on it, and sleep_length_ns can be set to KTIME_MAX (or indeed anything greater than TICK_NSEC), because if the CPU is woken up by the tick, teo_reflect() will take care of this and otherwise the wakeup should be recorded as an "intercept". > If we track the tick+ bucket -- as > we must in order to say anything useful about it, then we can decide the > tick state before (as I do here) calling sleep_length(). > > The timer-pull rework from Anna-Maria unfortunately makes the > tick_nohz_get_sleep_length() thing excessively expensive and it really > doesn't make sense to call it when we retain the tick. > > It's all a bit of a chicken-egg situation, cpuidle wants to know when > the next timer is, but telling when that is, wants to know if the tick > stays. We need to break that somehow -- I propose by not calling it when > we know we'll keep the tick. By selecting a state whose target residency will not be met, we lose on both energy and performance, so doing this really should be avoided, unless the state is really shallow in which case there may be no time for making this consideration.
On Mon, Jul 31, 2023 at 12:17:27PM +0200, Rafael J. Wysocki wrote: > Something really simple like: > > 1. Check sched_cpu_util() (which is done by teo anyway). > 2. If that is around 90% of the maximum CPU capacity, select the first > non-polling idle state and be done (don't stop the tick as my other > replay earlier today). So I really don't like using cpu_util() here, yes, 90% is a high number, but it doesn't say *anything* about the idle duration. Remember, this is a 32ms window, so 90% of that is 28.8ms. (not entirely accurate, since it's an exponential average, but that doesn't change the overal argument, only some of the particulars) That is, 90% util, at best, says there is no idle longer than 3.2 ms. But that is still vastly longer than pretty much all residencies. Heck, that is still 3 ticks worth of HZ=1000 ticks. So 90% util should not preclude disabling the tick (at HZ=1000). Now, typically this won't be the case, and at 90% you'll have lots of small idles adding up to 3.2ms total idle. But the point is, you can't tell the difference. And as such util is a horrible measure to use for cpuidle. > > If we track the tick+ bucket -- as > > we must in order to say anything useful about it, then we can decide the > > tick state before (as I do here) calling sleep_length(). > > > > The timer-pull rework from Anna-Maria unfortunately makes the > > tick_nohz_get_sleep_length() thing excessively expensive and it really > > doesn't make sense to call it when we retain the tick. > > > > It's all a bit of a chicken-egg situation, cpuidle wants to know when > > the next timer is, but telling when that is, wants to know if the tick > > stays. We need to break that somehow -- I propose by not calling it when > > we know we'll keep the tick. > > By selecting a state whose target residency will not be met, we lose > on both energy and performance, so doing this really should be > avoided, unless the state is really shallow in which case there may be > no time for making this consideration. I'm not sure how that relates to what I propose above. By adding the tick+ bucket we have more historical information as related to the tick boundary, how does that make us select states we won't match residency for?
On Mon, Jul 31, 2023 at 2:02 PM Peter Zijlstra <peterz@infradead.org> wrote: > > On Mon, Jul 31, 2023 at 12:17:27PM +0200, Rafael J. Wysocki wrote: > > > Something really simple like: > > > > 1. Check sched_cpu_util() (which is done by teo anyway). > > 2. If that is around 90% of the maximum CPU capacity, select the first > > non-polling idle state and be done (don't stop the tick as my other > > replay earlier today). > > So I really don't like using cpu_util() here, yes, 90% is a high number, > but it doesn't say *anything* about the idle duration. Remember, this is > a 32ms window, so 90% of that is 28.8ms. It doesn't have to say anything about the idle duration as long as it says something about the governor's "accuracy". If it is observed experimentally that the governor is generally likely to mispredict a deeper state if CPu utilization is about a certain threshold, then it makes sense to use this information to counter that. That's how it is used today. And yes, you are right about the most immediate idle duration, but overall the rule that if the CPU utilization is high, then selecting deep idle states is not a good idea in general does seem to hold. > (not entirely accurate, since it's an exponential average, but that > doesn't change the overal argument, only some of the particulars) > > That is, 90% util, at best, says there is no idle longer than 3.2 ms. > But that is still vastly longer than pretty much all residencies. Heck, > that is still 3 ticks worth of HZ=1000 ticks. So 90% util should not > preclude disabling the tick (at HZ=1000). > > Now, typically this won't be the case, and at 90% you'll have lots of > small idles adding up to 3.2ms total idle. But the point is, you can't > tell the difference. And as such util is a horrible measure to use for > cpuidle. No it is not IMO, because idle is all about the combined outcome of multiple cycles. > > > If we track the tick+ bucket -- as > > > we must in order to say anything useful about it, then we can decide the > > > tick state before (as I do here) calling sleep_length(). > > > > > > The timer-pull rework from Anna-Maria unfortunately makes the > > > tick_nohz_get_sleep_length() thing excessively expensive and it really > > > doesn't make sense to call it when we retain the tick. > > > > > > It's all a bit of a chicken-egg situation, cpuidle wants to know when > > > the next timer is, but telling when that is, wants to know if the tick > > > stays. We need to break that somehow -- I propose by not calling it when > > > we know we'll keep the tick. > > > > By selecting a state whose target residency will not be met, we lose > > on both energy and performance, so doing this really should be > > avoided, unless the state is really shallow in which case there may be > > no time for making this consideration. > > I'm not sure how that relates to what I propose above. By adding the > tick+ bucket we have more historical information as related to the tick > boundary, how does that make us select states we won't match residency > for? As stated in my last reply, the only case in which it makes a difference is when the deepest idle state's target residency is below the tick and I'm not really sure if that difference is demonstrably significant. So I would do the following to start with: 1. Rearrange the teo code so that it considers all of the bins every time without calling tick_nohz_get_sleep_length(). 2. The sched_cpu_util() check will still be applied to the resulting candidate state as it is now. 3. If it finds that the candidate state is shallow enough (for instance, it is a polling state or the first non-polling one), it will return this state without calling tick_nohz_get_sleep_length() and stopping the tick. 4. Otherwise it will call tick_nohz_get_sleep_length() to see what about timers and refine the selection (towards the shallower states) if need be. 5. If the candidate state is not the deepest one and its target residency is below the tick, it will be returned and the tick will not be stopped. 6. Otherwise, the candidate state will be returned and the tick will be stopped. If this still doesn't get us where we want to be, the extra bin can be added (as long as it makes a measurable difference).
--- a/drivers/cpuidle/governors/teo.c +++ b/drivers/cpuidle/governors/teo.c @@ -139,6 +139,7 @@ #include <linux/sched/clock.h> #include <linux/sched/topology.h> #include <linux/tick.h> +#include "../cpuidle.h" /* * The number of bits to shift the CPU's capacity by in order to determine @@ -197,7 +198,6 @@ struct teo_cpu { int next_recent_idx; int recent_idx[NR_RECENT]; unsigned long util_threshold; - bool utilized; }; static DEFINE_PER_CPU(struct teo_cpu, teo_cpus); @@ -276,11 +276,11 @@ static void teo_update(struct cpuidle_dr cpu_data->total += bin->hits + bin->intercepts; - if (target_residency_ns <= cpu_data->sleep_length_ns) { + if (target_residency_ns <= cpu_data->sleep_length_ns) idx_timer = i; - if (target_residency_ns <= measured_ns) - idx_duration = i; - } + + if (target_residency_ns <= measured_ns) + idx_duration = i; } i = cpu_data->next_recent_idx++; @@ -362,11 +362,12 @@ static int teo_select(struct cpuidle_dri unsigned int recent_sum = 0; unsigned int idx_hit_sum = 0; unsigned int hit_sum = 0; + unsigned int tick_sum = 0; int constraint_idx = 0; int idx0 = 0, idx = -1; bool alt_intercepts, alt_recent; ktime_t delta_tick; - s64 duration_ns; + s64 duration_ns = TICK_NSEC; int i; if (dev->last_state_idx >= 0) { @@ -376,36 +377,26 @@ static int teo_select(struct cpuidle_dri cpu_data->time_span_ns = local_clock(); - duration_ns = tick_nohz_get_sleep_length(&delta_tick); - cpu_data->sleep_length_ns = duration_ns; + /* Should we stop the tick? */ + for (i = 1; i < drv->state_count; i++) { + struct teo_bin *prev_bin = &cpu_data->state_bins[i-1]; + struct cpuidle_state *s = &drv->states[i]; - /* Check if there is any choice in the first place. */ - if (drv->state_count < 2) { - idx = 0; - goto end; - } - if (!dev->states_usage[0].disable) { - idx = 0; - if (drv->states[1].target_residency_ns > duration_ns) - goto end; - } + tick_sum += prev_bin->intercepts; + tick_sum += prev_bin->hits; - cpu_data->utilized = teo_cpu_is_utilized(dev->cpu, cpu_data); - /* - * If the CPU is being utilized over the threshold and there are only 2 - * states to choose from, the metrics need not be considered, so choose - * the shallowest non-polling state and exit. - */ - if (drv->state_count < 3 && cpu_data->utilized) { - for (i = 0; i < drv->state_count; ++i) { - if (!dev->states_usage[i].disable && - !(drv->states[i].flags & CPUIDLE_FLAG_POLLING)) { - idx = i; - goto end; - } - } + if (s->target_residency_ns >= SHORT_TICK_NSEC) + break; } + if (2*tick_sum > cpu_data->total) + *stop_tick = false; + + /* If we do stop the tick, ask for the next timer. */ + if (*stop_tick) + duration_ns = tick_nohz_get_sleep_length(&delta_tick); + cpu_data->sleep_length_ns = duration_ns; + /* * Find the deepest idle state whose target residency does not exceed * the current sleep length and the deepest idle state not deeper than @@ -446,13 +437,13 @@ static int teo_select(struct cpuidle_dri idx_recent_sum = recent_sum; } - /* Avoid unnecessary overhead. */ - if (idx < 0) { - idx = 0; /* No states enabled, must use 0. */ - goto end; - } else if (idx == idx0) { - goto end; - } + /* No states enabled, must use 0 */ + if (idx < 0) + return 0; + + /* No point looking for something shallower than the first enabled state */ + if (idx == idx0) + return idx; /* * If the sum of the intercepts metric for all of the idle states @@ -541,29 +532,9 @@ static int teo_select(struct cpuidle_dri * If the CPU is being utilized over the threshold, choose a shallower * non-polling state to improve latency */ - if (cpu_data->utilized) + if (teo_cpu_is_utilized(dev->cpu, cpu_data)) idx = teo_find_shallower_state(drv, dev, idx, duration_ns, true); -end: - /* - * Don't stop the tick if the selected state is a polling one or if the - * expected idle duration is shorter than the tick period length. - */ - if (((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) || - duration_ns < TICK_NSEC) && !tick_nohz_tick_stopped()) { - *stop_tick = false; - - /* - * The tick is not going to be stopped, so if the target - * residency of the state to be returned is not within the time - * till the closest timer including the tick, try to correct - * that. - */ - if (idx > idx0 && - drv->states[idx].target_residency_ns > delta_tick) - idx = teo_find_shallower_state(drv, dev, idx, delta_tick, false); - } - return idx; }
With cpuidle having added a TICK bucket, TEO will account all TICK and longer idles there. This means we can now make an informed decision about stopping the tick. If the sum of 'hit+intercepts' of all states below the TICK bucket is more than 50%, it is most likely we'll not reach the tick this time around either, so stopping the tick doesn't make sense. If we don't stop the tick, don't bother calling tick_nohz_get_sleep_length() and assume duration is no longer than a tick (could be improved to still look at the current pending time and timers). Since we have this extra state, remove the state_count based early decisions. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- drivers/cpuidle/governors/teo.c | 97 ++++++++++++++-------------------------- 1 file changed, 34 insertions(+), 63 deletions(-)