Message ID | 3514439.dzOWKx1Cjx@aspire.rjw.lan (mailing list archive) |
---|---|
State | Mainlined |
Delegated to: | Rafael Wysocki |
Headers | show |
Series | [v2] cpuidle: Add 'above' and 'below' idle state metrics | expand |
On Mon, Dec 10, 2018 at 12:30:23PM +0100, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > Add two new metrics for CPU idle states, "above" and "below", to count > the number of times the given state had been asked for (or entered > from the kernel's perspective), but the observed idle duration turned > out to be too short or too long for it (respectively). > > These metrics help to estimate the quality of the CPU idle governor > in use. > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > @@ -260,6 +262,33 @@ int cpuidle_enter_state(struct cpuidle_d > dev->last_residency = (int)diff; > dev->states_usage[entered_state].time += dev->last_residency; > dev->states_usage[entered_state].usage++; > + > + if (diff < drv->states[entered_state].target_residency) { > + for (i = entered_state - 1; i >= 0; i--) { > + if (drv->states[i].disabled || > + dev->states_usage[i].disable) > + continue; > + > + /* Shallower states are enabled, so update. */ > + dev->states_usage[entered_state].above++; > + break; > + } > + } else if (diff > delay) { > + for (i = entered_state + 1; i < drv->state_count; i++) { > + if (drv->states[i].disabled || > + dev->states_usage[i].disable) > + continue; > + > + /* > + * Update if a deeper state would have been a > + * better match for the observed idle duration. > + */ > + if (diff - delay >= drv->states[i].target_residency) > + dev->states_usage[entered_state].below++; > + > + break; > + } > + } One question on this; why is this tracked unconditionally? Would not a tracepoint be better?; then there is no overhead in the normal case where nobody gives a crap about these here numbers.
On Mon, Dec 10, 2018 at 1:21 PM Peter Zijlstra <peterz@infradead.org> wrote: > > On Mon, Dec 10, 2018 at 12:30:23PM +0100, Rafael J. Wysocki wrote: > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > > > Add two new metrics for CPU idle states, "above" and "below", to count > > the number of times the given state had been asked for (or entered > > from the kernel's perspective), but the observed idle duration turned > > out to be too short or too long for it (respectively). > > > > These metrics help to estimate the quality of the CPU idle governor > > in use. > > > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > > @@ -260,6 +262,33 @@ int cpuidle_enter_state(struct cpuidle_d > > dev->last_residency = (int)diff; > > dev->states_usage[entered_state].time += dev->last_residency; > > dev->states_usage[entered_state].usage++; > > + > > + if (diff < drv->states[entered_state].target_residency) { > > + for (i = entered_state - 1; i >= 0; i--) { > > + if (drv->states[i].disabled || > > + dev->states_usage[i].disable) > > + continue; > > + > > + /* Shallower states are enabled, so update. */ > > + dev->states_usage[entered_state].above++; > > + break; > > + } > > + } else if (diff > delay) { > > + for (i = entered_state + 1; i < drv->state_count; i++) { > > + if (drv->states[i].disabled || > > + dev->states_usage[i].disable) > > + continue; > > + > > + /* > > + * Update if a deeper state would have been a > > + * better match for the observed idle duration. > > + */ > > + if (diff - delay >= drv->states[i].target_residency) > > + dev->states_usage[entered_state].below++; > > + > > + break; > > + } > > + } > > One question on this; why is this tracked unconditionally? Because I didn't quite see how to make that conditional in a sensible way. These things are counters and counting with the help of tracepoints isn't particularly convenient (and one needs debugfs to be there to use tracepoints and they require root access etc). > Would not a tracepoint be better?; then there is no overhead in the > normal case where nobody gives a crap about these here numbers. There is an existing tracepoint that in principle could be used to produce this information, but it is such a major PITA in practice that nobody does that. Guess why. :-) Also, the "usage" and "time" counters are there in sysfs, so why not these two? And is the overhead really that horrible?
On Mon, Dec 10, 2018 at 10:36:40PM +0100, Rafael J. Wysocki wrote: > On Mon, Dec 10, 2018 at 1:21 PM Peter Zijlstra <peterz@infradead.org> wrote: > > One question on this; why is this tracked unconditionally? > > Because I didn't quite see how to make that conditional in a sensible way. Something like: if (static_branch_unlikely(__tracepoint_idle_above) || static_branch_unlikely(__tracepoint_idle_below)) { // do stuff that calls trace_idle_above() / // trace_idle_below(). } > These things are counters and counting with the help of tracepoints > isn't particularly convenient (and one needs debugfs to be there to > use tracepoints and they require root access etc). Root only should not be a problem for a developer; and aren't these numbers only really interesting if you're prodding at the idle governor? > > Would not a tracepoint be better?; then there is no overhead in the > > normal case where nobody gives a crap about these here numbers. > > There is an existing tracepoint that in principle could be used to > produce this information, but it is such a major PITA in practice that > nobody does that. Guess why. :-) Sounds like you need to ship a convenient script or something :-) > Also, the "usage" and "time" counters are there in sysfs, so why not these two? > > And is the overhead really that horrible? Dunno; it could be cold cachelines, at which point it can be fairly expensive. Also, being stuck with API is fairly horrible if you want to 'fix' it.
On 2018.12.10 02:52 Peter Zijlstra wrote: >On Mon, Dec 10, 2018 at 10:36:40PM +0100, Rafael J. Wysocki wrote: >> On Mon, Dec 10, 2018 at 1:21 PM Peter Zijlstra <peterz@infradead.org> wrote: >>> Would not a tracepoint be better?; then there is no overhead in the >>> normal case where nobody gives a crap about these here numbers. >> >> There is an existing tracepoint that in principle could be used to >> produce this information, but it is such a major PITA in practice that >> nobody does that. Guess why. :-) > > Sounds like you need to ship a convenient script or something :-) For the histogram plots of idle durations that I sometimes provide, trace is used. It is more work to do it the trace way. Very often, when the rate of idle state entries/ exits is high, turning on trace influences the system under test significantly. Also, even if I allocate the majority of my memory to the trace buffer (i.e. 15 of my 16 gigabytes), I can only acquire a few minutes of trace data before filling the buffer. Some of my tests run for hours, and these new counters provide a way to acquire potentially useful (I don't have enough experience with them yet to know how useful) information, while having no influence on the system under test because I only take samples once per minute, or sometimes 4 times per minute. >> Also, the "usage" and "time" counters are there in sysfs, so why not these two? I agree, how are these two counters any different? In about a week or so, I'll have some test data comparing 4.20-rc5 with teov6 teov7 along with the idle data (graphs) that I usually provide and also these new counters. ... Doug
On Mon, Dec 10, 2018 at 11:51 PM Peter Zijlstra <peterz@infradead.org> wrote: > > On Mon, Dec 10, 2018 at 10:36:40PM +0100, Rafael J. Wysocki wrote: > > On Mon, Dec 10, 2018 at 1:21 PM Peter Zijlstra <peterz@infradead.org> wrote: > > > > One question on this; why is this tracked unconditionally? > > > > Because I didn't quite see how to make that conditional in a sensible way. > > Something like: > > if (static_branch_unlikely(__tracepoint_idle_above) || > static_branch_unlikely(__tracepoint_idle_below)) { > > // do stuff that calls trace_idle_above() / > // trace_idle_below(). > > } > > > These things are counters and counting with the help of tracepoints > > isn't particularly convenient (and one needs debugfs to be there to > > use tracepoints and they require root access etc). > > Root only should not be a problem for a developer; and aren't these > numbers only really interesting if you're prodding at the idle governor? What about regression testing? > > > Would not a tracepoint be better?; then there is no overhead in the > > > normal case where nobody gives a crap about these here numbers. > > > > There is an existing tracepoint that in principle could be used to > > produce this information, but it is such a major PITA in practice that > > nobody does that. Guess why. :-) > > Sounds like you need to ship a convenient script or something :-) That would be a very ugly script. Also I'd need to expose drv->states[i].disabled somehow (that's not accessible from user space now). > > Also, the "usage" and "time" counters are there in sysfs, so why not these two? > > > > And is the overhead really that horrible? > > Dunno; it could be cold cachelines, at which point it can be fairly > expensive. Also, being stuck with API is fairly horrible if you want to > 'fix' it. All of the cache lines involved should've been touched earlier in this code path by the governor. At least menu and the new one both touch them. The API part I'm not too worried about. I know it is useful and two other people have told that to me already. :-)
On Tue, Dec 11, 2018 at 10:51:48AM +0100, Rafael J. Wysocki wrote: > On Mon, Dec 10, 2018 at 11:51 PM Peter Zijlstra <peterz@infradead.org> wrote: > > Dunno; it could be cold cachelines, at which point it can be fairly > > expensive. Also, being stuck with API is fairly horrible if you want to > > 'fix' it. > > All of the cache lines involved should've been touched earlier in this > code path by the governor. At least menu and the new one both touch > them. > > The API part I'm not too worried about. I know it is useful and two > other people have told that to me already. :-) Like said on IRC; I mostly wanted to raise the issue of overhead due to stats and ABI -- it's something I've been bitten by too many times :/ If you're confident you're hitting the same lines with the already extant accouning (time and usage) and you cannot make the whole lot conditional because of ABI (bah) then I'll not stand in the way here. I agree the numbers are useful, I'm just weary of overhead.
On Wed, 12 Dec 2018 at 10:46, Peter Zijlstra <peterz@infradead.org> wrote: > > On Tue, Dec 11, 2018 at 10:51:48AM +0100, Rafael J. Wysocki wrote: > > On Mon, Dec 10, 2018 at 11:51 PM Peter Zijlstra <peterz@infradead.org> wrote: > > > > Dunno; it could be cold cachelines, at which point it can be fairly > > > expensive. Also, being stuck with API is fairly horrible if you want to > > > 'fix' it. > > > > All of the cache lines involved should've been touched earlier in this > > code path by the governor. At least menu and the new one both touch > > them. > > > > The API part I'm not too worried about. I know it is useful and two > > other people have told that to me already. :-) > > Like said on IRC; I mostly wanted to raise the issue of overhead due to > stats and ABI -- it's something I've been bitten by too many times :/ > > If you're confident you're hitting the same lines with the already > extant accouning (time and usage) and you cannot make the whole lot > conditional because of ABI (bah) then I'll not stand in the way here. > > I agree the numbers are useful, I'm just weary of overhead. I tend to agree. So then why not move this to debugfs, it seems like it's really there it belongs. No? Kind regards Uffe
On Wed, Dec 12, 2018 at 10:57 AM Ulf Hansson <ulf.hansson@linaro.org> wrote: > > On Wed, 12 Dec 2018 at 10:46, Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Tue, Dec 11, 2018 at 10:51:48AM +0100, Rafael J. Wysocki wrote: > > > On Mon, Dec 10, 2018 at 11:51 PM Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > Dunno; it could be cold cachelines, at which point it can be fairly > > > > expensive. Also, being stuck with API is fairly horrible if you want to > > > > 'fix' it. > > > > > > All of the cache lines involved should've been touched earlier in this > > > code path by the governor. At least menu and the new one both touch > > > them. > > > > > > The API part I'm not too worried about. I know it is useful and two > > > other people have told that to me already. :-) > > > > Like said on IRC; I mostly wanted to raise the issue of overhead due to > > stats and ABI -- it's something I've been bitten by too many times :/ > > > > If you're confident you're hitting the same lines with the already > > extant accouning (time and usage) and you cannot make the whole lot > > conditional because of ABI (bah) then I'll not stand in the way here. > > > > I agree the numbers are useful, I'm just weary of overhead. > > I tend to agree. So then why not move this to debugfs, it seems like > it's really there it belongs. No? The "usage" and "time" counters are there in sysfs already and they are ABI, so putting the new ones into debugfs only makes them harder to use for no real gain.
On 10/12/2018 12:30, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > Add two new metrics for CPU idle states, "above" and "below", to count > the number of times the given state had been asked for (or entered > from the kernel's perspective), but the observed idle duration turned > out to be too short or too long for it (respectively). > > These metrics help to estimate the quality of the CPU idle governor > in use. > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > --- > > -> v2: Fix a leftover in the documentation from the previous versions > of the patch and a typo in the changelog. > > --- > Documentation/ABI/testing/sysfs-devices-system-cpu | 7 ++++ > Documentation/admin-guide/pm/cpuidle.rst | 10 ++++++ > drivers/cpuidle/cpuidle.c | 31 ++++++++++++++++++++- > drivers/cpuidle/sysfs.c | 6 ++++ > include/linux/cpuidle.h | 2 + > 5 files changed, 55 insertions(+), 1 deletion(-) > > Index: linux-pm/drivers/cpuidle/cpuidle.c > =================================================================== > --- linux-pm.orig/drivers/cpuidle/cpuidle.c > +++ linux-pm/drivers/cpuidle/cpuidle.c > @@ -202,7 +202,6 @@ int cpuidle_enter_state(struct cpuidle_d > struct cpuidle_state *target_state = &drv->states[index]; > bool broadcast = !!(target_state->flags & CPUIDLE_FLAG_TIMER_STOP); > ktime_t time_start, time_end; > - s64 diff; > > /* > * Tell the time framework to switch to a broadcast timer because our > @@ -248,6 +247,9 @@ int cpuidle_enter_state(struct cpuidle_d > local_irq_enable(); > > if (entered_state >= 0) { > + s64 diff, delay = drv->states[entered_state].exit_latency; > + int i; > + > /* > * Update cpuidle counters > * This can be moved to within driver enter routine, > @@ -260,6 +262,33 @@ int cpuidle_enter_state(struct cpuidle_d > dev->last_residency = (int)diff; Shouldn't we subtract the 'delay' from the computed 'diff' in any case ? Otherwise the 'last_residency' accumulates the effective sleep time and the time to wakeup. We are interested in the sleep time only for prediction and metrics no ? > dev->states_usage[entered_state].time += dev->last_residency; > dev->states_usage[entered_state].usage++; > + > + if (diff < drv->states[entered_state].target_residency) { > + for (i = entered_state - 1; i >= 0; i--) { > + if (drv->states[i].disabled || > + dev->states_usage[i].disable) > + continue; > + > + /* Shallower states are enabled, so update. */ > + dev->states_usage[entered_state].above++; > + break; > + } > + } else if (diff > delay) { > + for (i = entered_state + 1; i < drv->state_count; i++) { > + if (drv->states[i].disabled || > + dev->states_usage[i].disable) > + continue; > + > + /* > + * Update if a deeper state would have been a > + * better match for the observed idle duration. > + */ > + if (diff - delay >= drv->states[i].target_residency) > + dev->states_usage[entered_state].below++; > + > + break; > + } > + } > } else {
On Thu, Jan 10, 2019 at 10:53 AM Daniel Lezcano <daniel.lezcano@linaro.org> wrote: > > On 10/12/2018 12:30, Rafael J. Wysocki wrote: > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > > > Add two new metrics for CPU idle states, "above" and "below", to count > > the number of times the given state had been asked for (or entered > > from the kernel's perspective), but the observed idle duration turned > > out to be too short or too long for it (respectively). > > > > These metrics help to estimate the quality of the CPU idle governor > > in use. > > > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > --- > > > > -> v2: Fix a leftover in the documentation from the previous versions > > of the patch and a typo in the changelog. > > > > --- > > Documentation/ABI/testing/sysfs-devices-system-cpu | 7 ++++ > > Documentation/admin-guide/pm/cpuidle.rst | 10 ++++++ > > drivers/cpuidle/cpuidle.c | 31 ++++++++++++++++++++- > > drivers/cpuidle/sysfs.c | 6 ++++ > > include/linux/cpuidle.h | 2 + > > 5 files changed, 55 insertions(+), 1 deletion(-) > > > > Index: linux-pm/drivers/cpuidle/cpuidle.c > > =================================================================== > > --- linux-pm.orig/drivers/cpuidle/cpuidle.c > > +++ linux-pm/drivers/cpuidle/cpuidle.c > > @@ -202,7 +202,6 @@ int cpuidle_enter_state(struct cpuidle_d > > struct cpuidle_state *target_state = &drv->states[index]; > > bool broadcast = !!(target_state->flags & CPUIDLE_FLAG_TIMER_STOP); > > ktime_t time_start, time_end; > > - s64 diff; > > > > /* > > * Tell the time framework to switch to a broadcast timer because our > > @@ -248,6 +247,9 @@ int cpuidle_enter_state(struct cpuidle_d > > local_irq_enable(); > > > > if (entered_state >= 0) { > > + s64 diff, delay = drv->states[entered_state].exit_latency; > > + int i; > > + > > /* > > * Update cpuidle counters > > * This can be moved to within driver enter routine, > > @@ -260,6 +262,33 @@ int cpuidle_enter_state(struct cpuidle_d > > dev->last_residency = (int)diff; > > Shouldn't we subtract the 'delay' from the computed 'diff' in any case ? No. > Otherwise the 'last_residency' accumulates the effective sleep time and > the time to wakeup. We are interested in the sleep time only for > prediction and metrics no ? Yes, but 'delay' is the worst-case latency and not the actual one experienced, most of the time, and (on average) we would underestimate the sleep time if it was always subtracted. The idea here is to only count the wakeup as 'above' if the total 'last_residency' is below the target residency of the idle state that was asked for (as in that case we know for certain that the CPU has been woken up too early) and to only count it as 'below' if the difference between 'last_residency' and 'delay' is greater than or equal to the target residency of a deeper idle state (as in that case we know for certain that the CPU has been woken up too late). Of course, this means that there is a "gray area" in which we are not really sure if the sleep time has matched the idle state that was asked for, but there's not much we can do about that IMO.
Hi Rafael, sorry for the delay. On 10/01/2019 11:20, Rafael J. Wysocki wrote: [ ... ] >>> if (entered_state >= 0) { >>> + s64 diff, delay = drv->states[entered_state].exit_latency; >>> + int i; >>> + >>> /* >>> * Update cpuidle counters >>> * This can be moved to within driver enter routine, >>> @@ -260,6 +262,33 @@ int cpuidle_enter_state(struct cpuidle_d >>> dev->last_residency = (int)diff; >> >> Shouldn't we subtract the 'delay' from the computed 'diff' in any case ? > > No. > >> Otherwise the 'last_residency' accumulates the effective sleep time and >> the time to wakeup. We are interested in the sleep time only for >> prediction and metrics no ? > > Yes, but 'delay' is the worst-case latency and not the actual one > experienced, most of the time, and (on average) we would underestimate > the sleep time if it was always subtracted. IMO, the exit latency is more or less constant for the cpu power down state. When it is the cluster power down state, the first cpu waking up has the worst latency, then the others have the same has the cpu power down state. If we can model that, the gray area you mention below can be reduced. There are platform where the exit latency is very high [1] and not taking it into account will give very wrong metrics. > The idea here is to only count the wakeup as 'above' if the total > 'last_residency' is below the target residency of the idle state that > was asked for (as in that case we know for certain that the CPU has > been woken up too early) and to only count it as 'below' if the > difference between 'last_residency' and 'delay' is greater than or > equal to the target residency of a deeper idle state (as in that case > we know for certain that the CPU has been woken up too late). > > Of course, this means that there is a "gray area" in which we are not > really sure if the sleep time has matched the idle state that was > asked for, but there's not much we can do about that IMO. There is another aspect of the metric which can be improved, the 'above' and the 'below' give an rough indication about the correctness of the prediction but they don't tell us which idle state we should have selected (we can be constantly choosing state3 instead of state1 for example). It would be nice to add a 'missed' field for each idle states, so when we check if there is a 'above' or a 'below' condition, we increment the idle state 'missed' field for the idle state we should have selected. -- Daniel [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/boot/dts/hisilicon/hi3660.dtsi#n199
On Mon, Jan 14, 2019 at 11:39 AM Daniel Lezcano <daniel.lezcano@linaro.org> wrote: > > > Hi Rafael, > > sorry for the delay. > > On 10/01/2019 11:20, Rafael J. Wysocki wrote: > > [ ... ] > > >>> if (entered_state >= 0) { > >>> + s64 diff, delay = drv->states[entered_state].exit_latency; > >>> + int i; > >>> + > >>> /* > >>> * Update cpuidle counters > >>> * This can be moved to within driver enter routine, > >>> @@ -260,6 +262,33 @@ int cpuidle_enter_state(struct cpuidle_d > >>> dev->last_residency = (int)diff; > >> > >> Shouldn't we subtract the 'delay' from the computed 'diff' in any case ? > > > > No. > > > >> Otherwise the 'last_residency' accumulates the effective sleep time and > >> the time to wakeup. We are interested in the sleep time only for > >> prediction and metrics no ? > > > > Yes, but 'delay' is the worst-case latency and not the actual one > > experienced, most of the time, and (on average) we would underestimate > > the sleep time if it was always subtracted. > > IMO, the exit latency is more or less constant for the cpu power down > state. When it is the cluster power down state, the first cpu waking up > has the worst latency, then the others have the same has the cpu power > down state. > > If we can model that, the gray area you mention below can be reduced. > There are platform where the exit latency is very high [1] and not > taking it into account will give very wrong metrics. That is kind of a special case, though, and there is no way for the cpuidle core do distinguish it from all of the other cases. > > The idea here is to only count the wakeup as 'above' if the total > > 'last_residency' is below the target residency of the idle state that > > was asked for (as in that case we know for certain that the CPU has > > been woken up too early) and to only count it as 'below' if the > > difference between 'last_residency' and 'delay' is greater than or > > equal to the target residency of a deeper idle state (as in that case > > we know for certain that the CPU has been woken up too late). > > > > Of course, this means that there is a "gray area" in which we are not > > really sure if the sleep time has matched the idle state that was > > asked for, but there's not much we can do about that IMO. > > There is another aspect of the metric which can be improved, the 'above' > and the 'below' give an rough indication about the correctness of the > prediction but they don't tell us which idle state we should have > selected (we can be constantly choosing state3 instead of state1 for > example). > > It would be nice to add a 'missed' field for each idle states, so when > we check if there is a 'above' or a 'below' condition, we increment the > idle state 'missed' field for the idle state we should have selected. That's a governor's job however. That's why there is the ->reflect governor callback after all, among other things.
Index: linux-pm/drivers/cpuidle/cpuidle.c =================================================================== --- linux-pm.orig/drivers/cpuidle/cpuidle.c +++ linux-pm/drivers/cpuidle/cpuidle.c @@ -202,7 +202,6 @@ int cpuidle_enter_state(struct cpuidle_d struct cpuidle_state *target_state = &drv->states[index]; bool broadcast = !!(target_state->flags & CPUIDLE_FLAG_TIMER_STOP); ktime_t time_start, time_end; - s64 diff; /* * Tell the time framework to switch to a broadcast timer because our @@ -248,6 +247,9 @@ int cpuidle_enter_state(struct cpuidle_d local_irq_enable(); if (entered_state >= 0) { + s64 diff, delay = drv->states[entered_state].exit_latency; + int i; + /* * Update cpuidle counters * This can be moved to within driver enter routine, @@ -260,6 +262,33 @@ int cpuidle_enter_state(struct cpuidle_d dev->last_residency = (int)diff; dev->states_usage[entered_state].time += dev->last_residency; dev->states_usage[entered_state].usage++; + + if (diff < drv->states[entered_state].target_residency) { + for (i = entered_state - 1; i >= 0; i--) { + if (drv->states[i].disabled || + dev->states_usage[i].disable) + continue; + + /* Shallower states are enabled, so update. */ + dev->states_usage[entered_state].above++; + break; + } + } else if (diff > delay) { + for (i = entered_state + 1; i < drv->state_count; i++) { + if (drv->states[i].disabled || + dev->states_usage[i].disable) + continue; + + /* + * Update if a deeper state would have been a + * better match for the observed idle duration. + */ + if (diff - delay >= drv->states[i].target_residency) + dev->states_usage[entered_state].below++; + + break; + } + } } else { dev->last_residency = 0; } Index: linux-pm/include/linux/cpuidle.h =================================================================== --- linux-pm.orig/include/linux/cpuidle.h +++ linux-pm/include/linux/cpuidle.h @@ -33,6 +33,8 @@ struct cpuidle_state_usage { unsigned long long disable; unsigned long long usage; unsigned long long time; /* in US */ + unsigned long long above; /* Number of times it's been too deep */ + unsigned long long below; /* Number of times it's been too shallow */ #ifdef CONFIG_SUSPEND unsigned long long s2idle_usage; unsigned long long s2idle_time; /* in US */ Index: linux-pm/drivers/cpuidle/sysfs.c =================================================================== --- linux-pm.orig/drivers/cpuidle/sysfs.c +++ linux-pm/drivers/cpuidle/sysfs.c @@ -301,6 +301,8 @@ define_show_state_str_function(name) define_show_state_str_function(desc) define_show_state_ull_function(disable) define_store_state_ull_function(disable) +define_show_state_ull_function(above) +define_show_state_ull_function(below) define_one_state_ro(name, show_state_name); define_one_state_ro(desc, show_state_desc); @@ -310,6 +312,8 @@ define_one_state_ro(power, show_state_po define_one_state_ro(usage, show_state_usage); define_one_state_ro(time, show_state_time); define_one_state_rw(disable, show_state_disable, store_state_disable); +define_one_state_ro(above, show_state_above); +define_one_state_ro(below, show_state_below); static struct attribute *cpuidle_state_default_attrs[] = { &attr_name.attr, @@ -320,6 +324,8 @@ static struct attribute *cpuidle_state_d &attr_usage.attr, &attr_time.attr, &attr_disable.attr, + &attr_above.attr, + &attr_below.attr, NULL }; Index: linux-pm/Documentation/admin-guide/pm/cpuidle.rst =================================================================== --- linux-pm.orig/Documentation/admin-guide/pm/cpuidle.rst +++ linux-pm/Documentation/admin-guide/pm/cpuidle.rst @@ -398,6 +398,16 @@ deeper the (effective) idle state repres a number of files (attributes) representing the properties of the idle state object corresponding to it, as follows: +``above`` + Total number of times this idle state had been asked for, but the + observed idle duration was certainly too short to match its target + residency. + +``below`` + Total number of times this idle state had been asked for, but cerainly + a deeper idle state would have been a better match for the observed idle + duration. + ``desc`` Description of the idle state. Index: linux-pm/Documentation/ABI/testing/sysfs-devices-system-cpu =================================================================== --- linux-pm.orig/Documentation/ABI/testing/sysfs-devices-system-cpu +++ linux-pm/Documentation/ABI/testing/sysfs-devices-system-cpu @@ -145,6 +145,8 @@ What: /sys/devices/system/cpu/cpuX/cpui /sys/devices/system/cpu/cpuX/cpuidle/stateN/power /sys/devices/system/cpu/cpuX/cpuidle/stateN/time /sys/devices/system/cpu/cpuX/cpuidle/stateN/usage + /sys/devices/system/cpu/cpuX/cpuidle/stateN/above + /sys/devices/system/cpu/cpuX/cpuidle/stateN/below Date: September 2007 KernelVersion: v2.6.24 Contact: Linux power management list <linux-pm@vger.kernel.org> @@ -166,6 +168,11 @@ Description: usage: (RO) Number of times this state was entered (a count). + above: (RO) Number of times this state was entered, but the + observed CPU idle duration was too short for it (a count). + + below: (RO) Number of times this state was entered, but the + observed CPU idle duration was too long for it (a count). What: /sys/devices/system/cpu/cpuX/cpuidle/stateN/desc Date: February 2008