diff mbox series

[v2] cpuidle: Add 'above' and 'below' idle state metrics

Message ID 3514439.dzOWKx1Cjx@aspire.rjw.lan (mailing list archive)
State Mainlined
Delegated to: Rafael Wysocki
Headers show
Series [v2] cpuidle: Add 'above' and 'below' idle state metrics | expand

Commit Message

Rafael J. Wysocki Dec. 10, 2018, 11:30 a.m. UTC
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Add two new metrics for CPU idle states, "above" and "below", to count
the number of times the given state had been asked for (or entered
from the kernel's perspective), but the observed idle duration turned
out to be too short or too long for it (respectively).

These metrics help to estimate the quality of the CPU idle governor
in use.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

-> v2: Fix a leftover in the documentation from the previous versions
       of the patch and a typo in the changelog.

---
 Documentation/ABI/testing/sysfs-devices-system-cpu |    7 ++++
 Documentation/admin-guide/pm/cpuidle.rst           |   10 ++++++
 drivers/cpuidle/cpuidle.c                          |   31 ++++++++++++++++++++-
 drivers/cpuidle/sysfs.c                            |    6 ++++
 include/linux/cpuidle.h                            |    2 +
 5 files changed, 55 insertions(+), 1 deletion(-)

Comments

Peter Zijlstra Dec. 10, 2018, 12:21 p.m. UTC | #1
On Mon, Dec 10, 2018 at 12:30:23PM +0100, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Add two new metrics for CPU idle states, "above" and "below", to count
> the number of times the given state had been asked for (or entered
> from the kernel's perspective), but the observed idle duration turned
> out to be too short or too long for it (respectively).
> 
> These metrics help to estimate the quality of the CPU idle governor
> in use.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

> @@ -260,6 +262,33 @@ int cpuidle_enter_state(struct cpuidle_d
>  		dev->last_residency = (int)diff;
>  		dev->states_usage[entered_state].time += dev->last_residency;
>  		dev->states_usage[entered_state].usage++;
> +
> +		if (diff < drv->states[entered_state].target_residency) {
> +			for (i = entered_state - 1; i >= 0; i--) {
> +				if (drv->states[i].disabled ||
> +				    dev->states_usage[i].disable)
> +					continue;
> +
> +				/* Shallower states are enabled, so update. */
> +				dev->states_usage[entered_state].above++;
> +				break;
> +			}
> +		} else if (diff > delay) {
> +			for (i = entered_state + 1; i < drv->state_count; i++) {
> +				if (drv->states[i].disabled ||
> +				    dev->states_usage[i].disable)
> +					continue;
> +
> +				/*
> +				 * Update if a deeper state would have been a
> +				 * better match for the observed idle duration.
> +				 */
> +				if (diff - delay >= drv->states[i].target_residency)
> +					dev->states_usage[entered_state].below++;
> +
> +				break;
> +			}
> +		}

One question on this; why is this tracked unconditionally?

Would not a tracepoint be better?; then there is no overhead in the
normal case where nobody gives a crap about these here numbers.
Rafael J. Wysocki Dec. 10, 2018, 9:36 p.m. UTC | #2
On Mon, Dec 10, 2018 at 1:21 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Dec 10, 2018 at 12:30:23PM +0100, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > Add two new metrics for CPU idle states, "above" and "below", to count
> > the number of times the given state had been asked for (or entered
> > from the kernel's perspective), but the observed idle duration turned
> > out to be too short or too long for it (respectively).
> >
> > These metrics help to estimate the quality of the CPU idle governor
> > in use.
> >
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> > @@ -260,6 +262,33 @@ int cpuidle_enter_state(struct cpuidle_d
> >               dev->last_residency = (int)diff;
> >               dev->states_usage[entered_state].time += dev->last_residency;
> >               dev->states_usage[entered_state].usage++;
> > +
> > +             if (diff < drv->states[entered_state].target_residency) {
> > +                     for (i = entered_state - 1; i >= 0; i--) {
> > +                             if (drv->states[i].disabled ||
> > +                                 dev->states_usage[i].disable)
> > +                                     continue;
> > +
> > +                             /* Shallower states are enabled, so update. */
> > +                             dev->states_usage[entered_state].above++;
> > +                             break;
> > +                     }
> > +             } else if (diff > delay) {
> > +                     for (i = entered_state + 1; i < drv->state_count; i++) {
> > +                             if (drv->states[i].disabled ||
> > +                                 dev->states_usage[i].disable)
> > +                                     continue;
> > +
> > +                             /*
> > +                              * Update if a deeper state would have been a
> > +                              * better match for the observed idle duration.
> > +                              */
> > +                             if (diff - delay >= drv->states[i].target_residency)
> > +                                     dev->states_usage[entered_state].below++;
> > +
> > +                             break;
> > +                     }
> > +             }
>
> One question on this; why is this tracked unconditionally?

Because I didn't quite see how to make that conditional in a sensible way.

These things are counters and counting with the help of tracepoints
isn't particularly convenient (and one needs debugfs to be there to
use tracepoints and they require root access etc).

> Would not a tracepoint be better?; then there is no overhead in the
> normal case where nobody gives a crap about these here numbers.

There is an existing tracepoint that in principle could be used to
produce this information, but it is such a major PITA in practice that
nobody does that.  Guess why. :-)

Also, the "usage" and "time" counters are there in sysfs, so why not these two?

And is the overhead really that horrible?
Peter Zijlstra Dec. 10, 2018, 10:51 p.m. UTC | #3
On Mon, Dec 10, 2018 at 10:36:40PM +0100, Rafael J. Wysocki wrote:
> On Mon, Dec 10, 2018 at 1:21 PM Peter Zijlstra <peterz@infradead.org> wrote:

> > One question on this; why is this tracked unconditionally?
> 
> Because I didn't quite see how to make that conditional in a sensible way.

Something like:

	if (static_branch_unlikely(__tracepoint_idle_above) ||
	    static_branch_unlikely(__tracepoint_idle_below)) {

		// do stuff that calls trace_idle_above() /
		// trace_idle_below().

	}

> These things are counters and counting with the help of tracepoints
> isn't particularly convenient (and one needs debugfs to be there to
> use tracepoints and they require root access etc).

Root only should not be a problem for a developer; and aren't these
numbers only really interesting if you're prodding at the idle governor?

> > Would not a tracepoint be better?; then there is no overhead in the
> > normal case where nobody gives a crap about these here numbers.
> 
> There is an existing tracepoint that in principle could be used to
> produce this information, but it is such a major PITA in practice that
> nobody does that.  Guess why. :-)

Sounds like you need to ship a convenient script or something :-)

> Also, the "usage" and "time" counters are there in sysfs, so why not these two?
> 
> And is the overhead really that horrible?

Dunno; it could be cold cachelines, at which point it can be fairly
expensive. Also, being stuck with API is fairly horrible if you want to
'fix' it.
Doug Smythies Dec. 11, 2018, 7:28 a.m. UTC | #4
On 2018.12.10 02:52 Peter Zijlstra wrote:
>On Mon, Dec 10, 2018 at 10:36:40PM +0100, Rafael J. Wysocki wrote:
>> On Mon, Dec 10, 2018 at 1:21 PM Peter Zijlstra <peterz@infradead.org> wrote:

>>> Would not a tracepoint be better?; then there is no overhead in the
>>> normal case where nobody gives a crap about these here numbers.
>> 
>> There is an existing tracepoint that in principle could be used to
>> produce this information, but it is such a major PITA in practice that
>> nobody does that.  Guess why. :-)
>
> Sounds like you need to ship a convenient script or something :-)

For the histogram plots of idle durations that I sometimes provide, trace
is used. It is more work to do it the trace way. Very often, when the rate
of idle state entries/ exits is high, turning on trace influences the system
under test significantly. Also, even if I allocate the majority of my memory
to the trace buffer (i.e. 15 of my 16 gigabytes), I can only acquire a few 
minutes of trace data before filling the buffer.

Some of my tests run for hours, and these new counters provide a way to acquire
potentially useful (I don't have enough experience with them yet to know how useful)
information, while having no influence on the system under test because
I only take samples once per minute, or sometimes 4 times per minute.

>> Also, the "usage" and "time" counters are there in sysfs, so why not these two?

I agree, how are these two counters any different?

In about a week or so, I'll have some test data comparing 4.20-rc5 with teov6
teov7 along with the idle data (graphs) that I usually provide and also these
new counters.

... Doug
Rafael J. Wysocki Dec. 11, 2018, 9:51 a.m. UTC | #5
On Mon, Dec 10, 2018 at 11:51 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Dec 10, 2018 at 10:36:40PM +0100, Rafael J. Wysocki wrote:
> > On Mon, Dec 10, 2018 at 1:21 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> > > One question on this; why is this tracked unconditionally?
> >
> > Because I didn't quite see how to make that conditional in a sensible way.
>
> Something like:
>
>         if (static_branch_unlikely(__tracepoint_idle_above) ||
>             static_branch_unlikely(__tracepoint_idle_below)) {
>
>                 // do stuff that calls trace_idle_above() /
>                 // trace_idle_below().
>
>         }
>
> > These things are counters and counting with the help of tracepoints
> > isn't particularly convenient (and one needs debugfs to be there to
> > use tracepoints and they require root access etc).
>
> Root only should not be a problem for a developer; and aren't these
> numbers only really interesting if you're prodding at the idle governor?

What about regression testing?

> > > Would not a tracepoint be better?; then there is no overhead in the
> > > normal case where nobody gives a crap about these here numbers.
> >
> > There is an existing tracepoint that in principle could be used to
> > produce this information, but it is such a major PITA in practice that
> > nobody does that.  Guess why. :-)
>
> Sounds like you need to ship a convenient script or something :-)

That would be a very ugly script.  Also I'd need to expose
drv->states[i].disabled somehow (that's not accessible from user space
now).

> > Also, the "usage" and "time" counters are there in sysfs, so why not these two?
> >
> > And is the overhead really that horrible?
>
> Dunno; it could be cold cachelines, at which point it can be fairly
> expensive. Also, being stuck with API is fairly horrible if you want to
> 'fix' it.

All of the cache lines involved should've been touched earlier in this
code path by the governor.  At least menu and the new one both touch
them.

The API part I'm not too worried about.  I know it is useful and two
other people have told that to me already. :-)
Peter Zijlstra Dec. 12, 2018, 9:46 a.m. UTC | #6
On Tue, Dec 11, 2018 at 10:51:48AM +0100, Rafael J. Wysocki wrote:
> On Mon, Dec 10, 2018 at 11:51 PM Peter Zijlstra <peterz@infradead.org> wrote:

> > Dunno; it could be cold cachelines, at which point it can be fairly
> > expensive. Also, being stuck with API is fairly horrible if you want to
> > 'fix' it.
> 
> All of the cache lines involved should've been touched earlier in this
> code path by the governor.  At least menu and the new one both touch
> them.
> 
> The API part I'm not too worried about.  I know it is useful and two
> other people have told that to me already. :-)

Like said on IRC; I mostly wanted to raise the issue of overhead due to
stats and ABI -- it's something I've been bitten by too many times :/

If you're confident you're hitting the same lines with the already
extant accouning (time and usage) and you cannot make the whole lot
conditional because of ABI (bah) then I'll not stand in the way here.

I agree the numbers are useful, I'm just weary of overhead.
Ulf Hansson Dec. 12, 2018, 9:57 a.m. UTC | #7
On Wed, 12 Dec 2018 at 10:46, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Dec 11, 2018 at 10:51:48AM +0100, Rafael J. Wysocki wrote:
> > On Mon, Dec 10, 2018 at 11:51 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> > > Dunno; it could be cold cachelines, at which point it can be fairly
> > > expensive. Also, being stuck with API is fairly horrible if you want to
> > > 'fix' it.
> >
> > All of the cache lines involved should've been touched earlier in this
> > code path by the governor.  At least menu and the new one both touch
> > them.
> >
> > The API part I'm not too worried about.  I know it is useful and two
> > other people have told that to me already. :-)
>
> Like said on IRC; I mostly wanted to raise the issue of overhead due to
> stats and ABI -- it's something I've been bitten by too many times :/
>
> If you're confident you're hitting the same lines with the already
> extant accouning (time and usage) and you cannot make the whole lot
> conditional because of ABI (bah) then I'll not stand in the way here.
>
> I agree the numbers are useful, I'm just weary of overhead.

I tend to agree. So then why not move this to debugfs, it seems like
it's really there it belongs. No?

Kind regards
Uffe
Rafael J. Wysocki Dec. 12, 2018, 10:17 a.m. UTC | #8
On Wed, Dec 12, 2018 at 10:57 AM Ulf Hansson <ulf.hansson@linaro.org> wrote:
>
> On Wed, 12 Dec 2018 at 10:46, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Tue, Dec 11, 2018 at 10:51:48AM +0100, Rafael J. Wysocki wrote:
> > > On Mon, Dec 10, 2018 at 11:51 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > > > Dunno; it could be cold cachelines, at which point it can be fairly
> > > > expensive. Also, being stuck with API is fairly horrible if you want to
> > > > 'fix' it.
> > >
> > > All of the cache lines involved should've been touched earlier in this
> > > code path by the governor.  At least menu and the new one both touch
> > > them.
> > >
> > > The API part I'm not too worried about.  I know it is useful and two
> > > other people have told that to me already. :-)
> >
> > Like said on IRC; I mostly wanted to raise the issue of overhead due to
> > stats and ABI -- it's something I've been bitten by too many times :/
> >
> > If you're confident you're hitting the same lines with the already
> > extant accouning (time and usage) and you cannot make the whole lot
> > conditional because of ABI (bah) then I'll not stand in the way here.
> >
> > I agree the numbers are useful, I'm just weary of overhead.
>
> I tend to agree. So then why not move this to debugfs, it seems like
> it's really there it belongs. No?

The "usage" and "time" counters are there in sysfs already and they
are ABI, so putting the new ones into debugfs only makes them harder
to use for no real gain.
Daniel Lezcano Jan. 10, 2019, 9:52 a.m. UTC | #9
On 10/12/2018 12:30, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Add two new metrics for CPU idle states, "above" and "below", to count
> the number of times the given state had been asked for (or entered
> from the kernel's perspective), but the observed idle duration turned
> out to be too short or too long for it (respectively).
> 
> These metrics help to estimate the quality of the CPU idle governor
> in use.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
> 
> -> v2: Fix a leftover in the documentation from the previous versions
>        of the patch and a typo in the changelog.
> 
> ---
>  Documentation/ABI/testing/sysfs-devices-system-cpu |    7 ++++
>  Documentation/admin-guide/pm/cpuidle.rst           |   10 ++++++
>  drivers/cpuidle/cpuidle.c                          |   31 ++++++++++++++++++++-
>  drivers/cpuidle/sysfs.c                            |    6 ++++
>  include/linux/cpuidle.h                            |    2 +
>  5 files changed, 55 insertions(+), 1 deletion(-)
> 
> Index: linux-pm/drivers/cpuidle/cpuidle.c
> ===================================================================
> --- linux-pm.orig/drivers/cpuidle/cpuidle.c
> +++ linux-pm/drivers/cpuidle/cpuidle.c
> @@ -202,7 +202,6 @@ int cpuidle_enter_state(struct cpuidle_d
>  	struct cpuidle_state *target_state = &drv->states[index];
>  	bool broadcast = !!(target_state->flags & CPUIDLE_FLAG_TIMER_STOP);
>  	ktime_t time_start, time_end;
> -	s64 diff;
>  
>  	/*
>  	 * Tell the time framework to switch to a broadcast timer because our
> @@ -248,6 +247,9 @@ int cpuidle_enter_state(struct cpuidle_d
>  		local_irq_enable();
>  
>  	if (entered_state >= 0) {
> +		s64 diff, delay = drv->states[entered_state].exit_latency;
> +		int i;
> +
>  		/*
>  		 * Update cpuidle counters
>  		 * This can be moved to within driver enter routine,
> @@ -260,6 +262,33 @@ int cpuidle_enter_state(struct cpuidle_d
>  		dev->last_residency = (int)diff;

Shouldn't we subtract the 'delay' from the computed 'diff' in any case ?

Otherwise the 'last_residency' accumulates the effective sleep time and
the time to wakeup. We are interested in the sleep time only for
prediction and metrics no ?

>  		dev->states_usage[entered_state].time += dev->last_residency;
>  		dev->states_usage[entered_state].usage++;
> +
> +		if (diff < drv->states[entered_state].target_residency) {
> +			for (i = entered_state - 1; i >= 0; i--) {
> +				if (drv->states[i].disabled ||
> +				    dev->states_usage[i].disable)
> +					continue;
> +
> +				/* Shallower states are enabled, so update. */
> +				dev->states_usage[entered_state].above++;
> +				break;
> +			}
> +		} else if (diff > delay) {
> +			for (i = entered_state + 1; i < drv->state_count; i++) {
> +				if (drv->states[i].disabled ||
> +				    dev->states_usage[i].disable)
> +					continue;
> +
> +				/*
> +				 * Update if a deeper state would have been a
> +				 * better match for the observed idle duration.
> +				 */
> +				if (diff - delay >= drv->states[i].target_residency)
> +					dev->states_usage[entered_state].below++;
> +
> +				break;
> +			}
> +		}
>  	} else {
Rafael J. Wysocki Jan. 10, 2019, 10:20 a.m. UTC | #10
On Thu, Jan 10, 2019 at 10:53 AM Daniel Lezcano
<daniel.lezcano@linaro.org> wrote:
>
> On 10/12/2018 12:30, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > Add two new metrics for CPU idle states, "above" and "below", to count
> > the number of times the given state had been asked for (or entered
> > from the kernel's perspective), but the observed idle duration turned
> > out to be too short or too long for it (respectively).
> >
> > These metrics help to estimate the quality of the CPU idle governor
> > in use.
> >
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > ---
> >
> > -> v2: Fix a leftover in the documentation from the previous versions
> >        of the patch and a typo in the changelog.
> >
> > ---
> >  Documentation/ABI/testing/sysfs-devices-system-cpu |    7 ++++
> >  Documentation/admin-guide/pm/cpuidle.rst           |   10 ++++++
> >  drivers/cpuidle/cpuidle.c                          |   31 ++++++++++++++++++++-
> >  drivers/cpuidle/sysfs.c                            |    6 ++++
> >  include/linux/cpuidle.h                            |    2 +
> >  5 files changed, 55 insertions(+), 1 deletion(-)
> >
> > Index: linux-pm/drivers/cpuidle/cpuidle.c
> > ===================================================================
> > --- linux-pm.orig/drivers/cpuidle/cpuidle.c
> > +++ linux-pm/drivers/cpuidle/cpuidle.c
> > @@ -202,7 +202,6 @@ int cpuidle_enter_state(struct cpuidle_d
> >       struct cpuidle_state *target_state = &drv->states[index];
> >       bool broadcast = !!(target_state->flags & CPUIDLE_FLAG_TIMER_STOP);
> >       ktime_t time_start, time_end;
> > -     s64 diff;
> >
> >       /*
> >        * Tell the time framework to switch to a broadcast timer because our
> > @@ -248,6 +247,9 @@ int cpuidle_enter_state(struct cpuidle_d
> >               local_irq_enable();
> >
> >       if (entered_state >= 0) {
> > +             s64 diff, delay = drv->states[entered_state].exit_latency;
> > +             int i;
> > +
> >               /*
> >                * Update cpuidle counters
> >                * This can be moved to within driver enter routine,
> > @@ -260,6 +262,33 @@ int cpuidle_enter_state(struct cpuidle_d
> >               dev->last_residency = (int)diff;
>
> Shouldn't we subtract the 'delay' from the computed 'diff' in any case ?

No.

> Otherwise the 'last_residency' accumulates the effective sleep time and
> the time to wakeup. We are interested in the sleep time only for
> prediction and metrics no ?

Yes, but 'delay' is the worst-case latency and not the actual one
experienced, most of the time, and (on average) we would underestimate
the sleep time if it was always subtracted.

The idea here is to only count the wakeup as 'above' if the total
'last_residency' is below the target residency of the idle state that
was asked for (as in that case we know for certain that the CPU has
been woken up too early) and to only count it as 'below' if the
difference between 'last_residency' and 'delay' is greater than or
equal to the target residency of a deeper idle state (as in that case
we know for certain that the CPU has been woken up too late).

Of course, this means that there is a "gray area" in which we are not
really sure if the sleep time has matched the idle state that was
asked for, but there's not much we can do about that IMO.
Daniel Lezcano Jan. 14, 2019, 10:39 a.m. UTC | #11
Hi Rafael,

sorry for the delay.

On 10/01/2019 11:20, Rafael J. Wysocki wrote:

[ ... ]

>>>       if (entered_state >= 0) {
>>> +             s64 diff, delay = drv->states[entered_state].exit_latency;
>>> +             int i;
>>> +
>>>               /*
>>>                * Update cpuidle counters
>>>                * This can be moved to within driver enter routine,
>>> @@ -260,6 +262,33 @@ int cpuidle_enter_state(struct cpuidle_d
>>>               dev->last_residency = (int)diff;
>>
>> Shouldn't we subtract the 'delay' from the computed 'diff' in any case ?
> 
> No.
> 
>> Otherwise the 'last_residency' accumulates the effective sleep time and
>> the time to wakeup. We are interested in the sleep time only for
>> prediction and metrics no ?
> 
> Yes, but 'delay' is the worst-case latency and not the actual one
> experienced, most of the time, and (on average) we would underestimate
> the sleep time if it was always subtracted.

IMO, the exit latency is more or less constant for the cpu power down
state. When it is the cluster power down state, the first cpu waking up
has the worst latency, then the others have the same has the cpu power
down state.

If we can model that, the gray area you mention below can be reduced.
There are platform where the exit latency is very high [1] and not
taking it into account will give very wrong metrics.

> The idea here is to only count the wakeup as 'above' if the total
> 'last_residency' is below the target residency of the idle state that
> was asked for (as in that case we know for certain that the CPU has
> been woken up too early) and to only count it as 'below' if the
> difference between 'last_residency' and 'delay' is greater than or
> equal to the target residency of a deeper idle state (as in that case
> we know for certain that the CPU has been woken up too late).
> 
> Of course, this means that there is a "gray area" in which we are not
> really sure if the sleep time has matched the idle state that was
> asked for, but there's not much we can do about that IMO.

There is another aspect of the metric which can be improved, the 'above'
and the 'below' give an rough indication about the correctness of the
prediction but they don't tell us which idle state we should have
selected (we can be constantly choosing state3 instead of state1 for
example).

It would be nice to add a 'missed' field for each idle states, so when
we check if there is a 'above' or a 'below' condition, we increment the
idle state 'missed' field for the idle state we should have selected.

  -- Daniel

[1]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/boot/dts/hisilicon/hi3660.dtsi#n199
Rafael J. Wysocki Jan. 14, 2019, 11:33 p.m. UTC | #12
On Mon, Jan 14, 2019 at 11:39 AM Daniel Lezcano
<daniel.lezcano@linaro.org> wrote:
>
>
> Hi Rafael,
>
> sorry for the delay.
>
> On 10/01/2019 11:20, Rafael J. Wysocki wrote:
>
> [ ... ]
>
> >>>       if (entered_state >= 0) {
> >>> +             s64 diff, delay = drv->states[entered_state].exit_latency;
> >>> +             int i;
> >>> +
> >>>               /*
> >>>                * Update cpuidle counters
> >>>                * This can be moved to within driver enter routine,
> >>> @@ -260,6 +262,33 @@ int cpuidle_enter_state(struct cpuidle_d
> >>>               dev->last_residency = (int)diff;
> >>
> >> Shouldn't we subtract the 'delay' from the computed 'diff' in any case ?
> >
> > No.
> >
> >> Otherwise the 'last_residency' accumulates the effective sleep time and
> >> the time to wakeup. We are interested in the sleep time only for
> >> prediction and metrics no ?
> >
> > Yes, but 'delay' is the worst-case latency and not the actual one
> > experienced, most of the time, and (on average) we would underestimate
> > the sleep time if it was always subtracted.
>
> IMO, the exit latency is more or less constant for the cpu power down
> state. When it is the cluster power down state, the first cpu waking up
> has the worst latency, then the others have the same has the cpu power
> down state.
>
> If we can model that, the gray area you mention below can be reduced.
> There are platform where the exit latency is very high [1] and not
> taking it into account will give very wrong metrics.

That is kind of a special case, though, and there is no way for the
cpuidle core do distinguish it from all of the other cases.

> > The idea here is to only count the wakeup as 'above' if the total
> > 'last_residency' is below the target residency of the idle state that
> > was asked for (as in that case we know for certain that the CPU has
> > been woken up too early) and to only count it as 'below' if the
> > difference between 'last_residency' and 'delay' is greater than or
> > equal to the target residency of a deeper idle state (as in that case
> > we know for certain that the CPU has been woken up too late).
> >
> > Of course, this means that there is a "gray area" in which we are not
> > really sure if the sleep time has matched the idle state that was
> > asked for, but there's not much we can do about that IMO.
>
> There is another aspect of the metric which can be improved, the 'above'
> and the 'below' give an rough indication about the correctness of the
> prediction but they don't tell us which idle state we should have
> selected (we can be constantly choosing state3 instead of state1 for
> example).
>
> It would be nice to add a 'missed' field for each idle states, so when
> we check if there is a 'above' or a 'below' condition, we increment the
> idle state 'missed' field for the idle state we should have selected.

That's a governor's job however.  That's why there is the ->reflect
governor callback after all, among other things.
diff mbox series

Patch

Index: linux-pm/drivers/cpuidle/cpuidle.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/cpuidle.c
+++ linux-pm/drivers/cpuidle/cpuidle.c
@@ -202,7 +202,6 @@  int cpuidle_enter_state(struct cpuidle_d
 	struct cpuidle_state *target_state = &drv->states[index];
 	bool broadcast = !!(target_state->flags & CPUIDLE_FLAG_TIMER_STOP);
 	ktime_t time_start, time_end;
-	s64 diff;
 
 	/*
 	 * Tell the time framework to switch to a broadcast timer because our
@@ -248,6 +247,9 @@  int cpuidle_enter_state(struct cpuidle_d
 		local_irq_enable();
 
 	if (entered_state >= 0) {
+		s64 diff, delay = drv->states[entered_state].exit_latency;
+		int i;
+
 		/*
 		 * Update cpuidle counters
 		 * This can be moved to within driver enter routine,
@@ -260,6 +262,33 @@  int cpuidle_enter_state(struct cpuidle_d
 		dev->last_residency = (int)diff;
 		dev->states_usage[entered_state].time += dev->last_residency;
 		dev->states_usage[entered_state].usage++;
+
+		if (diff < drv->states[entered_state].target_residency) {
+			for (i = entered_state - 1; i >= 0; i--) {
+				if (drv->states[i].disabled ||
+				    dev->states_usage[i].disable)
+					continue;
+
+				/* Shallower states are enabled, so update. */
+				dev->states_usage[entered_state].above++;
+				break;
+			}
+		} else if (diff > delay) {
+			for (i = entered_state + 1; i < drv->state_count; i++) {
+				if (drv->states[i].disabled ||
+				    dev->states_usage[i].disable)
+					continue;
+
+				/*
+				 * Update if a deeper state would have been a
+				 * better match for the observed idle duration.
+				 */
+				if (diff - delay >= drv->states[i].target_residency)
+					dev->states_usage[entered_state].below++;
+
+				break;
+			}
+		}
 	} else {
 		dev->last_residency = 0;
 	}
Index: linux-pm/include/linux/cpuidle.h
===================================================================
--- linux-pm.orig/include/linux/cpuidle.h
+++ linux-pm/include/linux/cpuidle.h
@@ -33,6 +33,8 @@  struct cpuidle_state_usage {
 	unsigned long long	disable;
 	unsigned long long	usage;
 	unsigned long long	time; /* in US */
+	unsigned long long	above; /* Number of times it's been too deep */
+	unsigned long long	below; /* Number of times it's been too shallow */
 #ifdef CONFIG_SUSPEND
 	unsigned long long	s2idle_usage;
 	unsigned long long	s2idle_time; /* in US */
Index: linux-pm/drivers/cpuidle/sysfs.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/sysfs.c
+++ linux-pm/drivers/cpuidle/sysfs.c
@@ -301,6 +301,8 @@  define_show_state_str_function(name)
 define_show_state_str_function(desc)
 define_show_state_ull_function(disable)
 define_store_state_ull_function(disable)
+define_show_state_ull_function(above)
+define_show_state_ull_function(below)
 
 define_one_state_ro(name, show_state_name);
 define_one_state_ro(desc, show_state_desc);
@@ -310,6 +312,8 @@  define_one_state_ro(power, show_state_po
 define_one_state_ro(usage, show_state_usage);
 define_one_state_ro(time, show_state_time);
 define_one_state_rw(disable, show_state_disable, store_state_disable);
+define_one_state_ro(above, show_state_above);
+define_one_state_ro(below, show_state_below);
 
 static struct attribute *cpuidle_state_default_attrs[] = {
 	&attr_name.attr,
@@ -320,6 +324,8 @@  static struct attribute *cpuidle_state_d
 	&attr_usage.attr,
 	&attr_time.attr,
 	&attr_disable.attr,
+	&attr_above.attr,
+	&attr_below.attr,
 	NULL
 };
 
Index: linux-pm/Documentation/admin-guide/pm/cpuidle.rst
===================================================================
--- linux-pm.orig/Documentation/admin-guide/pm/cpuidle.rst
+++ linux-pm/Documentation/admin-guide/pm/cpuidle.rst
@@ -398,6 +398,16 @@  deeper the (effective) idle state repres
 a number of files (attributes) representing the properties of the idle state
 object corresponding to it, as follows:
 
+``above``
+	Total number of times this idle state had been asked for, but the
+	observed idle duration was certainly too short to match its target
+	residency.
+
+``below``
+	Total number of times this idle state had been asked for, but cerainly
+	a deeper idle state would have been a better match for the observed idle
+	duration.
+
 ``desc``
 	Description of the idle state.
 
Index: linux-pm/Documentation/ABI/testing/sysfs-devices-system-cpu
===================================================================
--- linux-pm.orig/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ linux-pm/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -145,6 +145,8 @@  What:		/sys/devices/system/cpu/cpuX/cpui
 		/sys/devices/system/cpu/cpuX/cpuidle/stateN/power
 		/sys/devices/system/cpu/cpuX/cpuidle/stateN/time
 		/sys/devices/system/cpu/cpuX/cpuidle/stateN/usage
+		/sys/devices/system/cpu/cpuX/cpuidle/stateN/above
+		/sys/devices/system/cpu/cpuX/cpuidle/stateN/below
 Date:		September 2007
 KernelVersion:	v2.6.24
 Contact:	Linux power management list <linux-pm@vger.kernel.org>
@@ -166,6 +168,11 @@  Description:
 
 		usage: (RO) Number of times this state was entered (a count).
 
+		above: (RO) Number of times this state was entered, but the
+		       observed CPU idle duration was too short for it (a count).
+
+		below: (RO) Number of times this state was entered, but the
+		       observed CPU idle duration was too long for it (a count).
 
 What:		/sys/devices/system/cpu/cpuX/cpuidle/stateN/desc
 Date:		February 2008