mbox series

[RFT,v1,0/5] cpuidle: menu: Avoid discarding useful information when processing recent idle intervals

Message ID 1916668.tdWV9SEqCh@rjwysocki.net (mailing list archive)
Headers show
Series cpuidle: menu: Avoid discarding useful information when processing recent idle intervals | expand

Message

Rafael J. Wysocki Feb. 6, 2025, 2:21 p.m. UTC
Hi Everyone,

This work had been triggered by a report that commit 0611a640e60a ("eventpoll:
prefer kfree_rcu() in __ep_remove()") had caused the critical-jOPS metric of
the SPECjbb 2015 benchmark [1] to drop by around 50% even though it generally
reduced kernel overhead.  Indeed, it was found during further investigation
that the total interrupt rate while running the SPECjbb workload had fallen as
a result of that commit by 55% and the local timer interrupt rate had fallen by
almost 80%.

That turned out to cause the menu cpuidle governor to select the deepest idle
state supplied by the cpuidle driver (intel_idle) much more often which added
significantly more idle state latency to the workload and that led to the
decrease of the critical-jOPS score.

Interestingly enough, this problem was not visible when the teo cpuidle
governor was used instead of menu, so it appeared to be specific to the
latter.  CPU wakeup event statistics collected while running the workload
indicated that the menu governor was effectively ignoring non-timer wakeup
information and all of its idle state selection decisions appeared to be
based on timer wakeups only.  Thus, it appeared that the reduction of the
local timer interrupt rate caused the governor to predict a idle duration
much more often while running the workload and the deepest idle state was
selected significantly more often as a result of that.

A subsequent inspection of the get_typical_interval() function in the menu
governor indicated that it might return UINT_MAX too often which then caused
the governor's decisions to be based entirely on information related to timers.

Generally speaking, UINT_MAX is returned by get_typical_interval() if it
cannot make a prediction based on the most recent idle intervals data with
sufficiently high confidence, but at least in some cases this means that
useful information is not taken into account at all which may lead to
significant idle state selection mistakes.  Moreover, this is not really
unlikely to happen.

One issue with get_typical_interval() is that, when it eliminates outliers from
the sample set in an attempt to reduce the standard deviation (and so improve
the prediction confidence), it does that by dropping high-end samples only,
while samples at the low end of the set are retained.  However, the samples
at the low end very well may be the outliers and they should be eliminated
from the sample set instead of the high-end samples.  Accordingly, the
likelihood of making a meaningful idle duration prediction can be improved
by making it also eliminate low-end samples if they are farther from the
average than high-end samples.  This is done in patch [4/5].

Another issue is that get_typical_interval() gives up after eliminating 1/4
of the samples if the standard deviation is still not as low as desired (within
1/6 of the average or within 20 us if the average is close to 0), but the
remaining samples in the set still represent useful information at that point
and discarding them altogether may lead to suboptimal idle state selection.

For instance, the largest idle duration value in the get_typical_interval()
data set is the maximum idle duration observed recently and it is likely that
the upcoming idle duration will not exceed it.  Therefore, in the absence of
a better choice, this value can be used as an upper bound on the target
residency of the idle state to select.  Patch [5/5] works along these lines,
but it takes the maximum data point remaining after the elimination of
outliers.

The first two patches in the series are straightforward cleanups (in fact,
the first patch is kind of reversed by patch [4/5], but it is there because
it can be applied without the latter) and patch [3/5] is a cosmetic change
made in preparation for the subsequent ones.

This series turns out to restore the SPECjbb critical-jOPS metric on affected
systems to the level from before commit 0611a640e60a and it also happens to
increase its max-jOPS metric by around 3%.

For easier reference/testing it is present in the git branch at

https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/log/?h=experimental/menu

based on the cpuidle material that went into 6.14-rc1.

If possible, please let me know if it works for you.

Thanks!


[1] Link: https://www.spec.org/jbb2015/

Comments

Artem Bityutskiy Feb. 7, 2025, 2:48 p.m. UTC | #1
Hi,

thanks for the patches!

On Thu, 2025-02-06 at 15:21 +0100, Rafael J. Wysocki wrote:
> Hi Everyone,
> 
> This work had been triggered by a report that commit 0611a640e60a ("eventpoll:
> prefer kfree_rcu() in __ep_remove()") had caused the critical-jOPS metric of
> the SPECjbb 2015 benchmark [1] to drop by around 50% even though it generally
> reduced kernel overhead.  Indeed, it was found during further investigation
> that the total interrupt rate while running the SPECjbb workload had fallen as
> a result of that commit by 55% and the local timer interrupt rate had fallen
> by
> almost 80%.

I ran SPECjbb2015 with and it doubles critical-jOPS and basically makes it
"normal" again. Thanks!

Reported-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Christian Loehle Feb. 7, 2025, 3:24 p.m. UTC | #2
On 2/7/25 14:48, Artem Bityutskiy wrote:
> Hi,
> 
> thanks for the patches!
> 
> On Thu, 2025-02-06 at 15:21 +0100, Rafael J. Wysocki wrote:
>> Hi Everyone,
>>
>> This work had been triggered by a report that commit 0611a640e60a ("eventpoll:
>> prefer kfree_rcu() in __ep_remove()") had caused the critical-jOPS metric of
>> the SPECjbb 2015 benchmark [1] to drop by around 50% even though it generally
>> reduced kernel overhead.  Indeed, it was found during further investigation
>> that the total interrupt rate while running the SPECjbb workload had fallen as
>> a result of that commit by 55% and the local timer interrupt rate had fallen
>> by
>> almost 80%.
> 
> I ran SPECjbb2015 with and it doubles critical-jOPS and basically makes it
> "normal" again. Thanks!
> 
> Reported-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
> Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
> 

I'll go take a look in-depth, honestly the statistical test of
get_typical_interval() is somewhat black magic to me before and after
4/5, so if that actually works better fine with me.
I'll run some tests, too.
Rafael J. Wysocki Feb. 7, 2025, 3:35 p.m. UTC | #3
On Fri, Feb 7, 2025 at 4:24 PM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> On 2/7/25 14:48, Artem Bityutskiy wrote:
> > Hi,
> >
> > thanks for the patches!
> >
> > On Thu, 2025-02-06 at 15:21 +0100, Rafael J. Wysocki wrote:
> >> Hi Everyone,
> >>
> >> This work had been triggered by a report that commit 0611a640e60a ("eventpoll:
> >> prefer kfree_rcu() in __ep_remove()") had caused the critical-jOPS metric of
> >> the SPECjbb 2015 benchmark [1] to drop by around 50% even though it generally
> >> reduced kernel overhead.  Indeed, it was found during further investigation
> >> that the total interrupt rate while running the SPECjbb workload had fallen as
> >> a result of that commit by 55% and the local timer interrupt rate had fallen
> >> by
> >> almost 80%.
> >
> > I ran SPECjbb2015 with and it doubles critical-jOPS and basically makes it
> > "normal" again. Thanks!
> >
> > Reported-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
> > Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
> >
>
> I'll go take a look in-depth, honestly the statistical test of
> get_typical_interval() is somewhat black magic to me before and after
> 4/5, so if that actually works better fine with me.
> I'll run some tests, too.

Thank you!
Rafael J. Wysocki Feb. 7, 2025, 3:45 p.m. UTC | #4
On Fri, Feb 7, 2025 at 3:48 PM Artem Bityutskiy
<artem.bityutskiy@linux.intel.com> wrote:
>
> Hi,
>
> thanks for the patches!
>
> On Thu, 2025-02-06 at 15:21 +0100, Rafael J. Wysocki wrote:
> > Hi Everyone,
> >
> > This work had been triggered by a report that commit 0611a640e60a ("eventpoll:
> > prefer kfree_rcu() in __ep_remove()") had caused the critical-jOPS metric of
> > the SPECjbb 2015 benchmark [1] to drop by around 50% even though it generally
> > reduced kernel overhead.  Indeed, it was found during further investigation
> > that the total interrupt rate while running the SPECjbb workload had fallen as
> > a result of that commit by 55% and the local timer interrupt rate had fallen
> > by
> > almost 80%.
>
> I ran SPECjbb2015 with and it doubles critical-jOPS and basically makes it
> "normal" again. Thanks!
>
> Reported-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
> Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>

Thank you!
Christian Loehle Feb. 10, 2025, 2:15 p.m. UTC | #5
On 2/6/25 14:21, Rafael J. Wysocki wrote:
> Hi Everyone,
> 
> This work had been triggered by a report that commit 0611a640e60a ("eventpoll:
> prefer kfree_rcu() in __ep_remove()") had caused the critical-jOPS metric of
> the SPECjbb 2015 benchmark [1] to drop by around 50% even though it generally
> reduced kernel overhead.  Indeed, it was found during further investigation
> that the total interrupt rate while running the SPECjbb workload had fallen as
> a result of that commit by 55% and the local timer interrupt rate had fallen by
> almost 80%.
> 
> That turned out to cause the menu cpuidle governor to select the deepest idle
> state supplied by the cpuidle driver (intel_idle) much more often which added
> significantly more idle state latency to the workload and that led to the
> decrease of the critical-jOPS score.
> 
> Interestingly enough, this problem was not visible when the teo cpuidle
> governor was used instead of menu, so it appeared to be specific to the
> latter.  CPU wakeup event statistics collected while running the workload
> indicated that the menu governor was effectively ignoring non-timer wakeup
> information and all of its idle state selection decisions appeared to be
> based on timer wakeups only.  Thus, it appeared that the reduction of the
> local timer interrupt rate caused the governor to predict a idle duration
> much more often while running the workload and the deepest idle state was
> selected significantly more often as a result of that.
> 
> A subsequent inspection of the get_typical_interval() function in the menu
> governor indicated that it might return UINT_MAX too often which then caused
> the governor's decisions to be based entirely on information related to timers.
> 
> Generally speaking, UINT_MAX is returned by get_typical_interval() if it
> cannot make a prediction based on the most recent idle intervals data with
> sufficiently high confidence, but at least in some cases this means that
> useful information is not taken into account at all which may lead to
> significant idle state selection mistakes.  Moreover, this is not really
> unlikely to happen.
> 
> One issue with get_typical_interval() is that, when it eliminates outliers from
> the sample set in an attempt to reduce the standard deviation (and so improve
> the prediction confidence), it does that by dropping high-end samples only,
> while samples at the low end of the set are retained.  However, the samples
> at the low end very well may be the outliers and they should be eliminated
> from the sample set instead of the high-end samples.  Accordingly, the
> likelihood of making a meaningful idle duration prediction can be improved
> by making it also eliminate low-end samples if they are farther from the
> average than high-end samples.  This is done in patch [4/5].
> 
> Another issue is that get_typical_interval() gives up after eliminating 1/4
> of the samples if the standard deviation is still not as low as desired (within
> 1/6 of the average or within 20 us if the average is close to 0), but the
> remaining samples in the set still represent useful information at that point
> and discarding them altogether may lead to suboptimal idle state selection.
> 
> For instance, the largest idle duration value in the get_typical_interval()
> data set is the maximum idle duration observed recently and it is likely that
> the upcoming idle duration will not exceed it.  Therefore, in the absence of
> a better choice, this value can be used as an upper bound on the target
> residency of the idle state to select.  Patch [5/5] works along these lines,
> but it takes the maximum data point remaining after the elimination of
> outliers.
> 
> The first two patches in the series are straightforward cleanups (in fact,
> the first patch is kind of reversed by patch [4/5], but it is there because
> it can be applied without the latter) and patch [3/5] is a cosmetic change
> made in preparation for the subsequent ones.
> 
> This series turns out to restore the SPECjbb critical-jOPS metric on affected
> systems to the level from before commit 0611a640e60a and it also happens to
> increase its max-jOPS metric by around 3%.
> 
> For easier reference/testing it is present in the git branch at
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/log/?h=experimental/menu
> 
> based on the cpuidle material that went into 6.14-rc1.
> 
> If possible, please let me know if it works for you.
> 
> Thanks!
> 
> 
> [1] Link: https://www.spec.org/jbb2015/

5/5 shows significant IO workload improvements (the shorter wakeup scenario is
much more likely to be picked up now).
I don't see a significant regression in idle misses so far, I'll try Android
backports soon and some other system.

Here's a full dump, sorry it's from a different system (rk3588, only two idle
states), apparently eth networking is broken on 6.14-rc1 now on rk3399 :(

For dm-delay 51ms (dm-slow) the command is (8 CPUs)
fio --minimal --time_based --group_reporting --name=fiotest --filename=/dev/mapper/dm-slow --runtime=30s --numjobs=16 --rw=randread --bs=4k --ioengine=psync --iodepth=1 --direct=1
For the rest:
fio --minimal --time_based --name=fiotest --filename=/dev/mmcblk1 --runtime=30s --rw=randread --bs=4k --ioengine=psync --iodepth=1 --direct=1

device	 gov	 iter	 iops	 idles	 idle_misses	 idle_miss_ratio	 belows	 aboves	
mapper/dm-slow 	menu 	0 	307 	99648 	318 	0.003 	318 	0
mapper/dm-slow 	menu 	1 	307 	100948 	389 	0.004 	389 	0
mapper/dm-slow 	menu 	2 	307 	99512 	380 	0.004 	380 	0
mapper/dm-slow 	menu 	3 	307 	99212 	307 	0.003 	307 	0
mapper/dm-slow 	menu 	4 	307 	100156 	343 	0.003 	343 	0
mapper/dm-slow 	menu-1 	0 	307 	97434 	260 	0.003 	260 	0
mapper/dm-slow 	menu-1 	1 	307 	94628 	324 	0.003 	324 	0
mapper/dm-slow 	menu-1 	2 	307 	98004 	248 	0.003 	248 	0
mapper/dm-slow 	menu-1 	3 	307 	97524 	263 	0.003 	263 	0
mapper/dm-slow 	menu-1 	4 	307 	97048 	304 	0.003 	304 	0
mapper/dm-slow 	menu-2 	0 	307 	98340 	376 	0.004 	376 	0
mapper/dm-slow 	menu-2 	1 	307 	96246 	275 	0.003 	275 	0
mapper/dm-slow 	menu-2 	2 	307 	96456 	317 	0.003 	317 	0
mapper/dm-slow 	menu-2 	3 	307 	100054 	268 	0.003 	268 	0
mapper/dm-slow 	menu-2 	4 	307 	93378 	288 	0.003 	288 	0
mapper/dm-slow 	menu-3 	0 	307 	95140 	303 	0.003 	303 	0
mapper/dm-slow 	menu-3 	1 	307 	95858 	318 	0.003 	318 	0
mapper/dm-slow 	menu-3 	2 	307 	100528 	302 	0.003 	302 	0
mapper/dm-slow 	menu-3 	3 	307 	98274 	311 	0.003 	311 	0
mapper/dm-slow 	menu-3 	4 	307 	98428 	327 	0.003 	327 	0
mapper/dm-slow 	menu-4 	0 	307 	100340 	304 	0.003 	304 	0
mapper/dm-slow 	menu-4 	1 	307 	101628 	359 	0.004 	359 	0
mapper/dm-slow 	menu-4 	2 	307 	100624 	281 	0.003 	281 	0
mapper/dm-slow 	menu-4 	3 	307 	99824 	340 	0.003 	340 	0
mapper/dm-slow 	menu-4 	4 	307 	98318 	290 	0.003 	290 	0
mapper/dm-slow 	menu-5 	0 	307 	96842 	310 	0.003 	310 	0
mapper/dm-slow 	menu-5 	1 	307 	98884 	271 	0.003 	271 	0
mapper/dm-slow 	menu-5 	2 	307 	99706 	259 	0.003 	259 	0
mapper/dm-slow 	menu-5 	3 	307 	93096 	270 	0.003 	270 	0
mapper/dm-slow 	menu-5 	4 	307 	101590 	333 	0.003 	333 	0
mapper/dm-slow 	menu-m 	0 	307 	94270 	297 	0.003 	297 	0
mapper/dm-slow 	menu-m 	1 	307 	99820 	355 	0.004 	355 	0
mapper/dm-slow 	menu-m 	2 	307 	99284 	313 	0.003 	313 	0
mapper/dm-slow 	menu-m 	3 	307 	99320 	288 	0.003 	288 	0
mapper/dm-slow 	menu-m 	4 	307 	99666 	269 	0.003 	269 	0
mmcblk1 	menu 	0 	818 	227246 	32716 	0.144 	32716 	0
mmcblk1 	menu 	1 	818 	252552 	33582 	0.133 	33582 	0
mmcblk1 	menu 	2 	825 	255822 	31958 	0.125 	31958 	0
mmcblk1 	menu 	3 	822 	255814 	33374 	0.130 	33374 	0
mmcblk1 	menu 	4 	822 	253200 	33310 	0.132 	33310 	0
mmcblk1 	menu-1 	0 	822 	254768 	33545 	0.132 	33545 	0
mmcblk1 	menu-1 	1 	819 	249476 	33289 	0.133 	33289 	0
mmcblk1 	menu-1 	2 	823 	256152 	32838 	0.128 	32838 	0
mmcblk1 	menu-1 	3 	824 	231098 	31120 	0.135 	31120 	0
mmcblk1 	menu-1 	4 	820 	254590 	33189 	0.130 	33189 	0
mmcblk1 	menu-2 	0 	824 	256084 	32927 	0.129 	32927 	0
mmcblk1 	menu-2 	1 	806 	240166 	33672 	0.140 	33672 	0
mmcblk1 	menu-2 	2 	808 	253178 	33963 	0.134 	33963 	0
mmcblk1 	menu-2 	3 	822 	240628 	32860 	0.137 	32860 	0
mmcblk1 	menu-2 	4 	811 	251522 	33478 	0.133 	33478 	0
mmcblk1 	menu-3 	0 	810 	251914 	32477 	0.129 	32477 	0
mmcblk1 	menu-3 	1 	811 	253324 	32344 	0.128 	32344 	0
mmcblk1 	menu-3 	2 	826 	239634 	31478 	0.131 	31478 	0
mmcblk1 	menu-3 	3 	811 	252462 	33810 	0.134 	33810 	0
mmcblk1 	menu-3 	4 	806 	231730 	33646 	0.145 	33646 	0
mmcblk1 	menu-4 	0 	826 	231986 	32301 	0.139 	32301 	0
mmcblk1 	menu-4 	1 	821 	256988 	34290 	0.133 	34290 	0
mmcblk1 	menu-4 	2 	805 	247456 	35092 	0.142 	35092 	0
mmcblk1 	menu-4 	3 	807 	255072 	35291 	0.138 	35291 	0
mmcblk1 	menu-4 	4 	808 	255076 	35222 	0.138 	35222 	0
mmcblk1 	menu-5 	0 	861 	308822 	26267 	0.085 	26267 	0
mmcblk1 	menu-5 	1 	835 	288153 	26496 	0.092 	26496 	0
mmcblk1 	menu-5 	2 	841 	304148 	26916 	0.088 	26916 	0
mmcblk1 	menu-5 	3 	858 	304838 	26347 	0.086 	26347 	0
mmcblk1 	menu-5 	4 	859 	303370 	26090 	0.086 	26090 	0
mmcblk1 	menu-m 	0 	811 	243486 	33215 	0.136 	33215 	0
mmcblk1 	menu-m 	1 	827 	256902 	32863 	0.128 	32863 	0
mmcblk1 	menu-m 	2 	807 	249032 	34080 	0.137 	34080 	0
mmcblk1 	menu-m 	3 	809 	253537 	33718 	0.133 	33718 	0
mmcblk1 	menu-m 	4 	824 	241996 	32842 	0.136 	32842 	0
mmcblk1 	teo 	0 	874 	346720 	18326 	0.053 	18326 	0
mmcblk1 	teo 	1 	889 	350712 	19364 	0.055 	19364 	0
mmcblk1 	teo 	2 	874 	341195 	19004 	0.056 	19004 	0
mmcblk1 	teo 	3 	870 	343718 	18770 	0.055 	18770 	0
mmcblk1 	teo 	4 	871 	321152 	18415 	0.057 	18415 	0
nvme0n1 	menu 	0 	11546 	819014 	110717 	0.135 	110717 	0
nvme0n1 	menu 	1 	10507 	745534 	86297 	0.116 	86297 	0
nvme0n1 	menu 	2 	11758 	829030 	110667 	0.133 	110667 	0
nvme0n1 	menu 	3 	10762 	768898 	93655 	0.122 	93655 	0
nvme0n1 	menu 	4 	11719 	820536 	110456 	0.135 	110456 	0
nvme0n1 	menu-1 	0 	11409 	811374 	111285 	0.137 	111285 	0
nvme0n1 	menu-1 	1 	11432 	805208 	108621 	0.135 	108621 	0
nvme0n1 	menu-1 	2 	11154 	781534 	100566 	0.129 	100566 	0
nvme0n1 	menu-1 	3 	10180 	724944 	73523 	0.101 	73523 	0
nvme0n1 	menu-1 	4 	11667 	827804 	110505 	0.133 	110505 	0
nvme0n1 	menu-2 	0 	11091 	791998 	105824 	0.134 	105824 	0
nvme0n1 	menu-2 	1 	10664 	748122 	90282 	0.121 	90282 	0
nvme0n1 	menu-2 	2 	10921 	773806 	95668 	0.124 	95668 	0
nvme0n1 	menu-2 	3 	11445 	807918 	112475 	0.139 	112475 	0
nvme0n1 	menu-2 	4 	10629 	761546 	90181 	0.118 	90181 	0
nvme0n1 	menu-3 	0 	10330 	723824 	74813 	0.103 	74813 	0
nvme0n1 	menu-3 	1 	10242 	717762 	74187 	0.103 	74187 	0
nvme0n1 	menu-3 	2 	10579 	754108 	86841 	0.115 	86841 	0
nvme0n1 	menu-3 	3 	10161 	730416 	76722 	0.105 	76722 	0
nvme0n1 	menu-3 	4 	11665 	820052 	112621 	0.137 	112621 	0
nvme0n1 	menu-4 	0 	11279 	789456 	106411 	0.135 	106411 	0
nvme0n1 	menu-4 	1 	11095 	766714 	98036 	0.128 	98036 	0
nvme0n1 	menu-4 	2 	11003 	786088 	98979 	0.126 	98979 	0
nvme0n1 	menu-4 	3 	10371 	746978 	77039 	0.103 	77039 	0
nvme0n1 	menu-4 	4 	10761 	770218 	89958 	0.117 	89958 	0
nvme0n1 	menu-5 	0 	13243 	926672 	514 	0.001 	514 	0
nvme0n1 	menu-5 	1 	14235 	985852 	1054 	0.001 	1054 	0
nvme0n1 	menu-5 	2 	13032 	911560 	506 	0.001 	506 	0
nvme0n1 	menu-5 	3 	13074 	917252 	691 	0.001 	691 	0
nvme0n1 	menu-5 	4 	13361 	933126 	466 	0.000 	466 	0
nvme0n1 	menu-m 	0 	10290 	739468 	73692 	0.100 	73692 	0
nvme0n1 	menu-m 	1 	10647 	763144 	80430 	0.105 	80430 	0
nvme0n1 	menu-m 	2 	11067 	790362 	98525 	0.125 	98525 	0
nvme0n1 	menu-m 	3 	11337 	806888 	102446 	0.127 	102446 	0
nvme0n1 	menu-m 	4 	11519 	818128 	110233 	0.135 	110233 	0
nvme0n1 	teo 	0 	14267 	994532 	273 	0.000 	273 	0
nvme0n1 	teo 	1 	13857 	965726 	395 	0.000 	395 	0
nvme0n1 	teo 	2 	12762 	892900 	311 	0.000 	311 	0
nvme0n1 	teo 	3 	13056 	900172 	269 	0.000 	269 	0
nvme0n1 	teo 	4 	13687 	956048 	240 	0.000 	240 	0
sda 	menu 	0 	1943 	1044428 	162298 	0.155 	162298 	0
sda 	menu 	1 	1601 	860152 	232733 	0.271 	232733 	0
sda 	menu 	2 	1947 	1089550 	154879 	0.142 	154879 	0
sda 	menu 	3 	1917 	992278 	146316 	0.147 	146316 	0
sda 	menu 	4 	1706 	947224 	257686 	0.272 	257686 	0
sda 	menu-1 	0 	1981 	1109204 	174590 	0.157 	174590 	0
sda 	menu-1 	1 	1778 	989142 	271685 	0.275 	271685 	0
sda 	menu-1 	2 	1759 	955310 	252735 	0.265 	252735 	0
sda 	menu-1 	3 	1818 	985389 	180365 	0.183 	180365 	0
sda 	menu-1 	4 	1782 	915060 	247016 	0.270 	247016 	0
sda 	menu-2 	0 	1877 	959734 	181691 	0.189 	181691 	0
sda 	menu-2 	1 	1718 	961724 	262950 	0.273 	262950 	0
sda 	menu-2 	2 	1751 	949092 	259223 	0.273 	259223 	0
sda 	menu-2 	3 	1808 	1011822 	211016 	0.209 	211016 	0
sda 	menu-2 	4 	1734 	959348 	261769 	0.273 	261769 	0
sda 	menu-3 	0 	1723 	952826 	260493 	0.273 	260493 	0
sda 	menu-3 	1 	1718 	931974 	254462 	0.273 	254462 	0
sda 	menu-3 	2 	1773 	984232 	239335 	0.243 	239335 	0
sda 	menu-3 	3 	1741 	969477 	265131 	0.273 	265131 	0
sda 	menu-3 	4 	1735 	970372 	263907 	0.272 	263907 	0
sda 	menu-4 	0 	1911 	1030290 	170538 	0.166 	170538 	0
sda 	menu-4 	1 	1769 	972168 	233029 	0.240 	233029 	0
sda 	menu-4 	2 	1737 	969896 	260880 	0.269 	260880 	0
sda 	menu-4 	3 	1738 	941298 	253874 	0.270 	253874 	0
sda 	menu-4 	4 	1701 	953710 	258250 	0.271 	258250 	0
sda 	menu-5 	0 	2463 	1349556 	26158 	0.019 	26158 	0
sda 	menu-5 	1 	2359 	1344306 	80343 	0.060 	80343 	0
sda 	menu-5 	2 	2280 	1306554 	115670 	0.089 	115670 	0
sda 	menu-5 	3 	2573 	1420702 	4765 	0.003 	4765 	0
sda 	menu-5 	4 	2348 	1355996 	70428 	0.052 	70428 	0
sda 	menu-m 	0 	1738 	962150 	261205 	0.271 	261205 	0
sda 	menu-m 	1 	1667 	922214 	238208 	0.258 	238208 	0
sda 	menu-m 	2 	1696 	911352 	255364 	0.280 	255364 	0
sda 	menu-m 	3 	1840 	1006556 	193333 	0.192 	193333 	0
sda 	menu-m 	4 	1681 	919693 	251029 	0.273 	251029 	0
sda 	teo 	0 	2503 	1427634 	25997 	0.018 	25997 	0
sda 	teo 	1 	2424 	1401434 	35228 	0.025 	35228 	0
sda 	teo 	2 	2527 	1454382 	27546 	0.019 	27546 	0
sda 	teo 	3 	2481 	1430128 	16678 	0.012 	16678 	0
sda 	teo 	4 	2589 	1481254 	13389 	0.009 	13389 	0
nullb0 	menu 	0 	337827 	88502 	200 	0.002 	200 	0
nullb0 	menu 	1 	337833 	87476 	188 	0.002 	188 	0
nullb0 	menu 	2 	336378 	88862 	92 	0.001 	92 	0
nullb0 	menu 	3 	336022 	86174 	188 	0.002 	188 	0
nullb0 	menu 	4 	335158 	87880 	156 	0.002 	156 	0
nullb0 	menu-1 	0 	334663 	89150 	199 	0.002 	199 	0
nullb0 	menu-1 	1 	338526 	88184 	111 	0.001 	111 	0
nullb0 	menu-1 	2 	336671 	89210 	211 	0.002 	211 	0
nullb0 	menu-1 	3 	337454 	82408 	198 	0.002 	198 	0
nullb0 	menu-1 	4 	338256 	86994 	118 	0.001 	118 	0
nullb0 	menu-2 	0 	336636 	82202 	165 	0.002 	165 	0
nullb0 	menu-2 	1 	337580 	77918 	171 	0.002 	171 	0
nullb0 	menu-2 	2 	336260 	89198 	226 	0.003 	226 	0
nullb0 	menu-2 	3 	338440 	85444 	215 	0.003 	215 	0
nullb0 	menu-2 	4 	333633 	87244 	119 	0.001 	119 	0
nullb0 	menu-3 	0 	336890 	88096 	122 	0.001 	122 	0
nullb0 	menu-3 	1 	335804 	68502 	79 	0.001 	79 	0
nullb0 	menu-3 	2 	336863 	87258 	195 	0.002 	195 	0
nullb0 	menu-3 	3 	337091 	76452 	127 	0.002 	127 	0
nullb0 	menu-3 	4 	336142 	80664 	83 	0.001 	83 	0
nullb0 	menu-4 	0 	336840 	86936 	128 	0.001 	128 	0
nullb0 	menu-4 	1 	334498 	88792 	113 	0.001 	113 	0
nullb0 	menu-4 	2 	336736 	88542 	104 	0.001 	104 	0
nullb0 	menu-4 	3 	336476 	64548 	70 	0.001 	70 	0
nullb0 	menu-4 	4 	337513 	84776 	107 	0.001 	107 	0
nullb0 	menu-5 	0 	338498 	89216 	135 	0.002 	135 	0
nullb0 	menu-5 	1 	335087 	87424 	85 	0.001 	85 	0
nullb0 	menu-5 	2 	336965 	75456 	179 	0.002 	179 	0
nullb0 	menu-5 	3 	337415 	88112 	114 	0.001 	114 	0
nullb0 	menu-5 	4 	332365 	76456 	82 	0.001 	82 	0
nullb0 	menu-m 	0 	337718 	88018 	125 	0.001 	125 	0
nullb0 	menu-m 	1 	337801 	86584 	164 	0.002 	164 	0
nullb0 	menu-m 	2 	336760 	84262 	102 	0.001 	102 	0
nullb0 	menu-m 	3 	337524 	87902 	147 	0.002 	147 	0
nullb0 	menu-m 	4 	333724 	87916 	117 	0.001 	117 	0
nullb0 	teo 	0 	336215 	88312 	231 	0.003 	231 	0
nullb0 	teo 	1 	337653 	88802 	266 	0.003 	266 	0
nullb0 	teo 	2 	337198 	87960 	234 	0.003 	234 	0
nullb0 	teo 	3 	338716 	88516 	227 	0.003 	227 	0
nullb0 	teo 	4 	336334 	88978 	261 	0.003 	261 	0
Rafael J. Wysocki Feb. 10, 2025, 2:43 p.m. UTC | #6
On Mon, Feb 10, 2025 at 3:15 PM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> On 2/6/25 14:21, Rafael J. Wysocki wrote:
> > Hi Everyone,
> >
> > This work had been triggered by a report that commit 0611a640e60a ("eventpoll:
> > prefer kfree_rcu() in __ep_remove()") had caused the critical-jOPS metric of
> > the SPECjbb 2015 benchmark [1] to drop by around 50% even though it generally
> > reduced kernel overhead.  Indeed, it was found during further investigation
> > that the total interrupt rate while running the SPECjbb workload had fallen as
> > a result of that commit by 55% and the local timer interrupt rate had fallen by
> > almost 80%.
> >
> > That turned out to cause the menu cpuidle governor to select the deepest idle
> > state supplied by the cpuidle driver (intel_idle) much more often which added
> > significantly more idle state latency to the workload and that led to the
> > decrease of the critical-jOPS score.
> >
> > Interestingly enough, this problem was not visible when the teo cpuidle
> > governor was used instead of menu, so it appeared to be specific to the
> > latter.  CPU wakeup event statistics collected while running the workload
> > indicated that the menu governor was effectively ignoring non-timer wakeup
> > information and all of its idle state selection decisions appeared to be
> > based on timer wakeups only.  Thus, it appeared that the reduction of the
> > local timer interrupt rate caused the governor to predict a idle duration
> > much more often while running the workload and the deepest idle state was
> > selected significantly more often as a result of that.
> >
> > A subsequent inspection of the get_typical_interval() function in the menu
> > governor indicated that it might return UINT_MAX too often which then caused
> > the governor's decisions to be based entirely on information related to timers.
> >
> > Generally speaking, UINT_MAX is returned by get_typical_interval() if it
> > cannot make a prediction based on the most recent idle intervals data with
> > sufficiently high confidence, but at least in some cases this means that
> > useful information is not taken into account at all which may lead to
> > significant idle state selection mistakes.  Moreover, this is not really
> > unlikely to happen.
> >
> > One issue with get_typical_interval() is that, when it eliminates outliers from
> > the sample set in an attempt to reduce the standard deviation (and so improve
> > the prediction confidence), it does that by dropping high-end samples only,
> > while samples at the low end of the set are retained.  However, the samples
> > at the low end very well may be the outliers and they should be eliminated
> > from the sample set instead of the high-end samples.  Accordingly, the
> > likelihood of making a meaningful idle duration prediction can be improved
> > by making it also eliminate low-end samples if they are farther from the
> > average than high-end samples.  This is done in patch [4/5].
> >
> > Another issue is that get_typical_interval() gives up after eliminating 1/4
> > of the samples if the standard deviation is still not as low as desired (within
> > 1/6 of the average or within 20 us if the average is close to 0), but the
> > remaining samples in the set still represent useful information at that point
> > and discarding them altogether may lead to suboptimal idle state selection.
> >
> > For instance, the largest idle duration value in the get_typical_interval()
> > data set is the maximum idle duration observed recently and it is likely that
> > the upcoming idle duration will not exceed it.  Therefore, in the absence of
> > a better choice, this value can be used as an upper bound on the target
> > residency of the idle state to select.  Patch [5/5] works along these lines,
> > but it takes the maximum data point remaining after the elimination of
> > outliers.
> >
> > The first two patches in the series are straightforward cleanups (in fact,
> > the first patch is kind of reversed by patch [4/5], but it is there because
> > it can be applied without the latter) and patch [3/5] is a cosmetic change
> > made in preparation for the subsequent ones.
> >
> > This series turns out to restore the SPECjbb critical-jOPS metric on affected
> > systems to the level from before commit 0611a640e60a and it also happens to
> > increase its max-jOPS metric by around 3%.
> >
> > For easier reference/testing it is present in the git branch at
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/log/?h=experimental/menu
> >
> > based on the cpuidle material that went into 6.14-rc1.
> >
> > If possible, please let me know if it works for you.
> >
> > Thanks!
> >
> >
> > [1] Link: https://www.spec.org/jbb2015/
>
> 5/5 shows significant IO workload improvements (the shorter wakeup scenario is
> much more likely to be picked up now).
> I don't see a significant regression in idle misses so far, I'll try Android
> backports soon and some other system.

Sounds good, thanks!

> Here's a full dump, sorry it's from a different system (rk3588, only two idle
> states), apparently eth networking is broken on 6.14-rc1 now on rk3399 :(
>
> For dm-delay 51ms (dm-slow) the command is (8 CPUs)
> fio --minimal --time_based --group_reporting --name=fiotest --filename=/dev/mapper/dm-slow --runtime=30s --numjobs=16 --rw=randread --bs=4k --ioengine=psync --iodepth=1 --direct=1
> For the rest:
> fio --minimal --time_based --name=fiotest --filename=/dev/mmcblk1 --runtime=30s --rw=randread --bs=4k --ioengine=psync --iodepth=1 --direct=1

Thanks for the data!

Do I understand correctly that menu-X is menu with patches [1-X/5]
applied?  And what's menu-m?

> device   gov     iter    iops    idles   idle_misses     idle_miss_ratio         belows  aboves
> mapper/dm-slow  menu    0       307     99648   318     0.003   318     0
> mapper/dm-slow  menu    1       307     100948  389     0.004   389     0
> mapper/dm-slow  menu    2       307     99512   380     0.004   380     0
> mapper/dm-slow  menu    3       307     99212   307     0.003   307     0
> mapper/dm-slow  menu    4       307     100156  343     0.003   343     0
> mapper/dm-slow  menu-1  0       307     97434   260     0.003   260     0
> mapper/dm-slow  menu-1  1       307     94628   324     0.003   324     0
> mapper/dm-slow  menu-1  2       307     98004   248     0.003   248     0
> mapper/dm-slow  menu-1  3       307     97524   263     0.003   263     0
> mapper/dm-slow  menu-1  4       307     97048   304     0.003   304     0
> mapper/dm-slow  menu-2  0       307     98340   376     0.004   376     0
> mapper/dm-slow  menu-2  1       307     96246   275     0.003   275     0
> mapper/dm-slow  menu-2  2       307     96456   317     0.003   317     0
> mapper/dm-slow  menu-2  3       307     100054  268     0.003   268     0
> mapper/dm-slow  menu-2  4       307     93378   288     0.003   288     0
> mapper/dm-slow  menu-3  0       307     95140   303     0.003   303     0
> mapper/dm-slow  menu-3  1       307     95858   318     0.003   318     0
> mapper/dm-slow  menu-3  2       307     100528  302     0.003   302     0
> mapper/dm-slow  menu-3  3       307     98274   311     0.003   311     0
> mapper/dm-slow  menu-3  4       307     98428   327     0.003   327     0
> mapper/dm-slow  menu-4  0       307     100340  304     0.003   304     0
> mapper/dm-slow  menu-4  1       307     101628  359     0.004   359     0
> mapper/dm-slow  menu-4  2       307     100624  281     0.003   281     0
> mapper/dm-slow  menu-4  3       307     99824   340     0.003   340     0
> mapper/dm-slow  menu-4  4       307     98318   290     0.003   290     0
> mapper/dm-slow  menu-5  0       307     96842   310     0.003   310     0
> mapper/dm-slow  menu-5  1       307     98884   271     0.003   271     0
> mapper/dm-slow  menu-5  2       307     99706   259     0.003   259     0
> mapper/dm-slow  menu-5  3       307     93096   270     0.003   270     0
> mapper/dm-slow  menu-5  4       307     101590  333     0.003   333     0
> mapper/dm-slow  menu-m  0       307     94270   297     0.003   297     0
> mapper/dm-slow  menu-m  1       307     99820   355     0.004   355     0
> mapper/dm-slow  menu-m  2       307     99284   313     0.003   313     0
> mapper/dm-slow  menu-m  3       307     99320   288     0.003   288     0
> mapper/dm-slow  menu-m  4       307     99666   269     0.003   269     0
> mmcblk1         menu    0       818     227246  32716   0.144   32716   0
> mmcblk1         menu    1       818     252552  33582   0.133   33582   0
> mmcblk1         menu    2       825     255822  31958   0.125   31958   0
> mmcblk1         menu    3       822     255814  33374   0.130   33374   0
> mmcblk1         menu    4       822     253200  33310   0.132   33310   0
> mmcblk1         menu-1  0       822     254768  33545   0.132   33545   0
> mmcblk1         menu-1  1       819     249476  33289   0.133   33289   0
> mmcblk1         menu-1  2       823     256152  32838   0.128   32838   0
> mmcblk1         menu-1  3       824     231098  31120   0.135   31120   0
> mmcblk1         menu-1  4       820     254590  33189   0.130   33189   0
> mmcblk1         menu-2  0       824     256084  32927   0.129   32927   0
> mmcblk1         menu-2  1       806     240166  33672   0.140   33672   0
> mmcblk1         menu-2  2       808     253178  33963   0.134   33963   0
> mmcblk1         menu-2  3       822     240628  32860   0.137   32860   0
> mmcblk1         menu-2  4       811     251522  33478   0.133   33478   0
> mmcblk1         menu-3  0       810     251914  32477   0.129   32477   0
> mmcblk1         menu-3  1       811     253324  32344   0.128   32344   0
> mmcblk1         menu-3  2       826     239634  31478   0.131   31478   0
> mmcblk1         menu-3  3       811     252462  33810   0.134   33810   0
> mmcblk1         menu-3  4       806     231730  33646   0.145   33646   0
> mmcblk1         menu-4  0       826     231986  32301   0.139   32301   0
> mmcblk1         menu-4  1       821     256988  34290   0.133   34290   0
> mmcblk1         menu-4  2       805     247456  35092   0.142   35092   0
> mmcblk1         menu-4  3       807     255072  35291   0.138   35291   0
> mmcblk1         menu-4  4       808     255076  35222   0.138   35222   0
> mmcblk1         menu-5  0       861     308822  26267   0.085   26267   0
> mmcblk1         menu-5  1       835     288153  26496   0.092   26496   0
> mmcblk1         menu-5  2       841     304148  26916   0.088   26916   0
> mmcblk1         menu-5  3       858     304838  26347   0.086   26347   0
> mmcblk1         menu-5  4       859     303370  26090   0.086   26090   0
> mmcblk1         menu-m  0       811     243486  33215   0.136   33215   0
> mmcblk1         menu-m  1       827     256902  32863   0.128   32863   0
> mmcblk1         menu-m  2       807     249032  34080   0.137   34080   0
> mmcblk1         menu-m  3       809     253537  33718   0.133   33718   0
> mmcblk1         menu-m  4       824     241996  32842   0.136   32842   0
> mmcblk1         teo     0       874     346720  18326   0.053   18326   0
> mmcblk1         teo     1       889     350712  19364   0.055   19364   0
> mmcblk1         teo     2       874     341195  19004   0.056   19004   0
> mmcblk1         teo     3       870     343718  18770   0.055   18770   0
> mmcblk1         teo     4       871     321152  18415   0.057   18415   0
> nvme0n1         menu    0       11546   819014  110717  0.135   110717  0
> nvme0n1         menu    1       10507   745534  86297   0.116   86297   0
> nvme0n1         menu    2       11758   829030  110667  0.133   110667  0
> nvme0n1         menu    3       10762   768898  93655   0.122   93655   0
> nvme0n1         menu    4       11719   820536  110456  0.135   110456  0
> nvme0n1         menu-1  0       11409   811374  111285  0.137   111285  0
> nvme0n1         menu-1  1       11432   805208  108621  0.135   108621  0
> nvme0n1         menu-1  2       11154   781534  100566  0.129   100566  0
> nvme0n1         menu-1  3       10180   724944  73523   0.101   73523   0
> nvme0n1         menu-1  4       11667   827804  110505  0.133   110505  0
> nvme0n1         menu-2  0       11091   791998  105824  0.134   105824  0
> nvme0n1         menu-2  1       10664   748122  90282   0.121   90282   0
> nvme0n1         menu-2  2       10921   773806  95668   0.124   95668   0
> nvme0n1         menu-2  3       11445   807918  112475  0.139   112475  0
> nvme0n1         menu-2  4       10629   761546  90181   0.118   90181   0
> nvme0n1         menu-3  0       10330   723824  74813   0.103   74813   0
> nvme0n1         menu-3  1       10242   717762  74187   0.103   74187   0
> nvme0n1         menu-3  2       10579   754108  86841   0.115   86841   0
> nvme0n1         menu-3  3       10161   730416  76722   0.105   76722   0
> nvme0n1         menu-3  4       11665   820052  112621  0.137   112621  0
> nvme0n1         menu-4  0       11279   789456  106411  0.135   106411  0
> nvme0n1         menu-4  1       11095   766714  98036   0.128   98036   0
> nvme0n1         menu-4  2       11003   786088  98979   0.126   98979   0
> nvme0n1         menu-4  3       10371   746978  77039   0.103   77039   0
> nvme0n1         menu-4  4       10761   770218  89958   0.117   89958   0
> nvme0n1         menu-5  0       13243   926672  514     0.001   514     0
> nvme0n1         menu-5  1       14235   985852  1054    0.001   1054    0
> nvme0n1         menu-5  2       13032   911560  506     0.001   506     0
> nvme0n1         menu-5  3       13074   917252  691     0.001   691     0
> nvme0n1         menu-5  4       13361   933126  466     0.000   466     0
> nvme0n1         menu-m  0       10290   739468  73692   0.100   73692   0
> nvme0n1         menu-m  1       10647   763144  80430   0.105   80430   0
> nvme0n1         menu-m  2       11067   790362  98525   0.125   98525   0
> nvme0n1         menu-m  3       11337   806888  102446  0.127   102446  0
> nvme0n1         menu-m  4       11519   818128  110233  0.135   110233  0
> nvme0n1         teo     0       14267   994532  273     0.000   273     0
> nvme0n1         teo     1       13857   965726  395     0.000   395     0
> nvme0n1         teo     2       12762   892900  311     0.000   311     0
> nvme0n1         teo     3       13056   900172  269     0.000   269     0
> nvme0n1         teo     4       13687   956048  240     0.000   240     0
> sda     menu    0       1943    1044428         162298  0.155   162298  0
> sda     menu    1       1601    860152  232733  0.271   232733  0
> sda     menu    2       1947    1089550         154879  0.142   154879  0
> sda     menu    3       1917    992278  146316  0.147   146316  0
> sda     menu    4       1706    947224  257686  0.272   257686  0
> sda     menu-1  0       1981    1109204         174590  0.157   174590  0
> sda     menu-1  1       1778    989142  271685  0.275   271685  0
> sda     menu-1  2       1759    955310  252735  0.265   252735  0
> sda     menu-1  3       1818    985389  180365  0.183   180365  0
> sda     menu-1  4       1782    915060  247016  0.270   247016  0
> sda     menu-2  0       1877    959734  181691  0.189   181691  0
> sda     menu-2  1       1718    961724  262950  0.273   262950  0
> sda     menu-2  2       1751    949092  259223  0.273   259223  0
> sda     menu-2  3       1808    1011822         211016  0.209   211016  0
> sda     menu-2  4       1734    959348  261769  0.273   261769  0
> sda     menu-3  0       1723    952826  260493  0.273   260493  0
> sda     menu-3  1       1718    931974  254462  0.273   254462  0
> sda     menu-3  2       1773    984232  239335  0.243   239335  0
> sda     menu-3  3       1741    969477  265131  0.273   265131  0
> sda     menu-3  4       1735    970372  263907  0.272   263907  0
> sda     menu-4  0       1911    1030290         170538  0.166   170538  0
> sda     menu-4  1       1769    972168  233029  0.240   233029  0
> sda     menu-4  2       1737    969896  260880  0.269   260880  0
> sda     menu-4  3       1738    941298  253874  0.270   253874  0
> sda     menu-4  4       1701    953710  258250  0.271   258250  0
> sda     menu-5  0       2463    1349556         26158   0.019   26158   0
> sda     menu-5  1       2359    1344306         80343   0.060   80343   0
> sda     menu-5  2       2280    1306554         115670  0.089   115670  0
> sda     menu-5  3       2573    1420702         4765    0.003   4765    0
> sda     menu-5  4       2348    1355996         70428   0.052   70428   0
> sda     menu-m  0       1738    962150  261205  0.271   261205  0
> sda     menu-m  1       1667    922214  238208  0.258   238208  0
> sda     menu-m  2       1696    911352  255364  0.280   255364  0
> sda     menu-m  3       1840    1006556         193333  0.192   193333  0
> sda     menu-m  4       1681    919693  251029  0.273   251029  0
> sda     teo     0       2503    1427634         25997   0.018   25997   0
> sda     teo     1       2424    1401434         35228   0.025   35228   0
> sda     teo     2       2527    1454382         27546   0.019   27546   0
> sda     teo     3       2481    1430128         16678   0.012   16678   0
> sda     teo     4       2589    1481254         13389   0.009   13389   0
> nullb0  menu    0       337827  88502   200     0.002   200     0
> nullb0  menu    1       337833  87476   188     0.002   188     0
> nullb0  menu    2       336378  88862   92      0.001   92      0
> nullb0  menu    3       336022  86174   188     0.002   188     0
> nullb0  menu    4       335158  87880   156     0.002   156     0
> nullb0  menu-1  0       334663  89150   199     0.002   199     0
> nullb0  menu-1  1       338526  88184   111     0.001   111     0
> nullb0  menu-1  2       336671  89210   211     0.002   211     0
> nullb0  menu-1  3       337454  82408   198     0.002   198     0
> nullb0  menu-1  4       338256  86994   118     0.001   118     0
> nullb0  menu-2  0       336636  82202   165     0.002   165     0
> nullb0  menu-2  1       337580  77918   171     0.002   171     0
> nullb0  menu-2  2       336260  89198   226     0.003   226     0
> nullb0  menu-2  3       338440  85444   215     0.003   215     0
> nullb0  menu-2  4       333633  87244   119     0.001   119     0
> nullb0  menu-3  0       336890  88096   122     0.001   122     0
> nullb0  menu-3  1       335804  68502   79      0.001   79      0
> nullb0  menu-3  2       336863  87258   195     0.002   195     0
> nullb0  menu-3  3       337091  76452   127     0.002   127     0
> nullb0  menu-3  4       336142  80664   83      0.001   83      0
> nullb0  menu-4  0       336840  86936   128     0.001   128     0
> nullb0  menu-4  1       334498  88792   113     0.001   113     0
> nullb0  menu-4  2       336736  88542   104     0.001   104     0
> nullb0  menu-4  3       336476  64548   70      0.001   70      0
> nullb0  menu-4  4       337513  84776   107     0.001   107     0
> nullb0  menu-5  0       338498  89216   135     0.002   135     0
> nullb0  menu-5  1       335087  87424   85      0.001   85      0
> nullb0  menu-5  2       336965  75456   179     0.002   179     0
> nullb0  menu-5  3       337415  88112   114     0.001   114     0
> nullb0  menu-5  4       332365  76456   82      0.001   82      0
> nullb0  menu-m  0       337718  88018   125     0.001   125     0
> nullb0  menu-m  1       337801  86584   164     0.002   164     0
> nullb0  menu-m  2       336760  84262   102     0.001   102     0
> nullb0  menu-m  3       337524  87902   147     0.002   147     0
> nullb0  menu-m  4       333724  87916   117     0.001   117     0
> nullb0  teo     0       336215  88312   231     0.003   231     0
> nullb0  teo     1       337653  88802   266     0.003   266     0
> nullb0  teo     2       337198  87960   234     0.003   234     0
> nullb0  teo     3       338716  88516   227     0.003   227     0
> nullb0  teo     4       336334  88978   261     0.003   261     0
>
Christian Loehle Feb. 10, 2025, 2:47 p.m. UTC | #7
On 2/10/25 14:43, Rafael J. Wysocki wrote:
> On Mon, Feb 10, 2025 at 3:15 PM Christian Loehle
> <christian.loehle@arm.com> wrote:
>>
>> On 2/6/25 14:21, Rafael J. Wysocki wrote:
>>> Hi Everyone,
>>>
>>> This work had been triggered by a report that commit 0611a640e60a ("eventpoll:
>>> prefer kfree_rcu() in __ep_remove()") had caused the critical-jOPS metric of
>>> the SPECjbb 2015 benchmark [1] to drop by around 50% even though it generally
>>> reduced kernel overhead.  Indeed, it was found during further investigation
>>> that the total interrupt rate while running the SPECjbb workload had fallen as
>>> a result of that commit by 55% and the local timer interrupt rate had fallen by
>>> almost 80%.
>>>
>>> That turned out to cause the menu cpuidle governor to select the deepest idle
>>> state supplied by the cpuidle driver (intel_idle) much more often which added
>>> significantly more idle state latency to the workload and that led to the
>>> decrease of the critical-jOPS score.
>>>
>>> Interestingly enough, this problem was not visible when the teo cpuidle
>>> governor was used instead of menu, so it appeared to be specific to the
>>> latter.  CPU wakeup event statistics collected while running the workload
>>> indicated that the menu governor was effectively ignoring non-timer wakeup
>>> information and all of its idle state selection decisions appeared to be
>>> based on timer wakeups only.  Thus, it appeared that the reduction of the
>>> local timer interrupt rate caused the governor to predict a idle duration
>>> much more often while running the workload and the deepest idle state was
>>> selected significantly more often as a result of that.
>>>
>>> A subsequent inspection of the get_typical_interval() function in the menu
>>> governor indicated that it might return UINT_MAX too often which then caused
>>> the governor's decisions to be based entirely on information related to timers.
>>>
>>> Generally speaking, UINT_MAX is returned by get_typical_interval() if it
>>> cannot make a prediction based on the most recent idle intervals data with
>>> sufficiently high confidence, but at least in some cases this means that
>>> useful information is not taken into account at all which may lead to
>>> significant idle state selection mistakes.  Moreover, this is not really
>>> unlikely to happen.
>>>
>>> One issue with get_typical_interval() is that, when it eliminates outliers from
>>> the sample set in an attempt to reduce the standard deviation (and so improve
>>> the prediction confidence), it does that by dropping high-end samples only,
>>> while samples at the low end of the set are retained.  However, the samples
>>> at the low end very well may be the outliers and they should be eliminated
>>> from the sample set instead of the high-end samples.  Accordingly, the
>>> likelihood of making a meaningful idle duration prediction can be improved
>>> by making it also eliminate low-end samples if they are farther from the
>>> average than high-end samples.  This is done in patch [4/5].
>>>
>>> Another issue is that get_typical_interval() gives up after eliminating 1/4
>>> of the samples if the standard deviation is still not as low as desired (within
>>> 1/6 of the average or within 20 us if the average is close to 0), but the
>>> remaining samples in the set still represent useful information at that point
>>> and discarding them altogether may lead to suboptimal idle state selection.
>>>
>>> For instance, the largest idle duration value in the get_typical_interval()
>>> data set is the maximum idle duration observed recently and it is likely that
>>> the upcoming idle duration will not exceed it.  Therefore, in the absence of
>>> a better choice, this value can be used as an upper bound on the target
>>> residency of the idle state to select.  Patch [5/5] works along these lines,
>>> but it takes the maximum data point remaining after the elimination of
>>> outliers.
>>>
>>> The first two patches in the series are straightforward cleanups (in fact,
>>> the first patch is kind of reversed by patch [4/5], but it is there because
>>> it can be applied without the latter) and patch [3/5] is a cosmetic change
>>> made in preparation for the subsequent ones.
>>>
>>> This series turns out to restore the SPECjbb critical-jOPS metric on affected
>>> systems to the level from before commit 0611a640e60a and it also happens to
>>> increase its max-jOPS metric by around 3%.
>>>
>>> For easier reference/testing it is present in the git branch at
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/log/?h=experimental/menu
>>>
>>> based on the cpuidle material that went into 6.14-rc1.
>>>
>>> If possible, please let me know if it works for you.
>>>
>>> Thanks!
>>>
>>>
>>> [1] Link: https://www.spec.org/jbb2015/
>>
>> 5/5 shows significant IO workload improvements (the shorter wakeup scenario is
>> much more likely to be picked up now).
>> I don't see a significant regression in idle misses so far, I'll try Android
>> backports soon and some other system.
> 
> Sounds good, thanks!
> 
>> Here's a full dump, sorry it's from a different system (rk3588, only two idle
>> states), apparently eth networking is broken on 6.14-rc1 now on rk3399 :(
>>
>> For dm-delay 51ms (dm-slow) the command is (8 CPUs)
>> fio --minimal --time_based --group_reporting --name=fiotest --filename=/dev/mapper/dm-slow --runtime=30s --numjobs=16 --rw=randread --bs=4k --ioengine=psync --iodepth=1 --direct=1
>> For the rest:
>> fio --minimal --time_based --name=fiotest --filename=/dev/mmcblk1 --runtime=30s --rw=randread --bs=4k --ioengine=psync --iodepth=1 --direct=1
> 
> Thanks for the data!
> 
> Do I understand correctly that menu-X is menu with patches [1-X/5]
> applied?  And what's menu-m?

Correct about -1 to -5.
-m is just mainline. I changed that over time, it's just equivalent to menu now.
I'll just run that twice in the future (to be able to check for side-effects
like thermals because they all run one after the other).
Doug Smythies Feb. 14, 2025, 4:30 a.m. UTC | #8
On 2025.02.06 06:22 Rafael J. Wysocki wrote:

> Hi Everyone,

Hi Rafael,

>
> This work had been triggered by a report that commit 0611a640e60a ("eventpoll:
> prefer kfree_rcu() in __ep_remove()") had caused the critical-jOPS metric of
... deleted ...

This is a long email. It contains test results for several recent idle governor patches:

cpuidle: teo: Cleanups and very frequent wakeups handling update
cpuidle: teo: Avoid selecting deepest idle state over-eagerly (Testing aborted, after the patch was dropped.)
cpuidle: menu: Avoid discarding useful information when processing recent idle intervals

Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
Distro: Ubuntu 24.04.1, server, no desktop GUI.

CPU frequency scaling driver: intel_pstate
HWP: disabled.
CPU frequency scaling governor: performance

Ilde driver: intel_idle
Idle governor: as per individual test
Idle states: 4: name : description:
   state0/name:POLL             desc:CPUIDLE CORE POLL IDLE
   state1/name:C1_ACPI          desc:ACPI FFH MWAIT 0x0
   state2/name:C2_ACPI          desc:ACPI FFH MWAIT 0x30
   state3/name:C3_ACPI          desc:ACPI FFH MWAIT 0x60

Legend:
teo-613: teo governor - Kernel 6.13: before "cpuidle: teo: Cleanups and very frequent wakeups handling update"
menu-613: menu governor - Kernel 6.13: before "cpuidle: teo: Cleanups and very frequent wakeups handling update"
teo-614: teo governor - Kernel 6.14-rc1: Includes cpuidle: teo: Cleanups and very frequent wakeups handling update
menu-614: menu governor - Kernel 6.14-rc1: Includes cpuidle: teo: Cleanups and very frequent wakeups handling update
teo-614-p: teo governor - Kernel 6.14-rc1-p: Includes "cpuidle: teo: Avoid selecting deepest idle state over-eagerly"
menu-614-p: menu governor - Kernel 6.14-rc1-p: Includes "cpuidle: menu: Avoid discarding useful information when processing recent idle intervals"

I do a set of tests adopted over some years now.
Readers may recall that some of the tests search over a wide range of operating conditions looking for areas to focus on in more detail.
One interesting observation is that everything seems to run slower than the last time I did this, June 2024, Kernel 6.10-rc2,
which was also slower than the time before that, August 2023, Kernel 6.5-rc4.
There are some repatabilty issues with the tests.

I was unable to get the "cpuidle: teo: Cleanups and very frequent wakeups handling update" patch set to apply to kernel 6.13, and so just used kernel 6.14-rc1, but that means that all the other commits
between the kernel versions are included. This could cast doubt on the test results, and indeed some differences in test results are observed with the menu idle governor, which did not change.

Test 1: System Idle

Purpose: Basic starting point test. To observee and check an idle system for excessive power consumption.

teo-613: 1.752 watts (reference: 0.0%)
menu-613: 1.909 watts (+9.0%)
teo-614: 2.199 watts (+25.51%)   <<< Test flawed. Needs to be redone. Will be less.
teo-614-2: 2.112 watts (+17.05%) <<< Re-test of teo-614. (don't care about 0.4 watts)
menu-614: 1.873 watts (+6.91%)
teo-614-p: 9.401 watts (+436.6%)  <<< Very bad regression.
menu-614-p: 1.820 watts (+3.9%)

Further details:
http://smythies.com/~doug/linux/idle/teo-6.14/idle/perf/

Test 2: 2 core ping pong sweep:

Pass a token between 2 CPUs on 2 different cores.
Do a variable amount of work at each stop.
NOT a timer based test.

Purpose: To utilize the shallowest idle states
and observe the transition from using more of 1
idle state to another.

Results relative to teo-613 (negative is better):
        menu-613        teo-614         menu-614        menu-614-p
average -2.06%          -0.32%          -2.33%          -2.52%
max     9.42%           12.72%          8.29%           8.55%
min     -10.36%         -3.82%          -11.89%         -12.13%

No significant issues here. There are differences on idle state preferences.

Standard "fast" dwell test:

teo-613: average 3.826 uSec/loop reference
menu-613: average 4.159 +8.70%
teo-614: average 3.751 -1.94%
menu-614: average 4.076 +6.54%
menu-614-p: average 4.178 +9.21%

Intrestingly, teo-614 also uses a little less power.
Note that there is an offsetting region for the menu governor where it performs better
than teo, but it was not extracted and done as a dwell test.

Standard "medium dwell test:

teo-613: 12.241 average uSec/loop reference
menu-613: 12.251 average +0.08%
teo-614: 12.121 average -0.98%
menu-614: 12.123 average -0.96%
menu-614-p: 12.236 average -0.04%

Standard "slow" dwell test: Not done.

Further details:
http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-2/loop-times.png
http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-2/loop-times-relative.png
http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-2/perf/
http://smythies.com/~doug/linux/idle/teo-6.14/many-0-400000000-2/perf/
http://smythies.com/~doug/linux/idle/teo-6.14/many-3000-100000000-2/

Test 3: 6 core ping pong sweep:

Pass a token between 6 CPUs on 6 different cores.
Do a variable amount of work at each stop.
NOT a timer based test.

Purpose: To utilize the midrange idle states
and observe the transitions between use of
idle states.

Note: This test has uncertainty in an area where the performance is bi-stable for all idle governors,
transitioning between much less power and slower performance and much more power and higher performance.
On either side of this area, the differences between all idle governors are small.
Only data from before this area (from results 1 to 95) was included in the below results.

Results relative to teo-613 (negative is better):
        teo-614 menu-613        menu-614        menu-614-p
average 1.60%   0.18%           0.02%           0.02%
max     5.91%   0.97%           1.12%           0.85%
min     -1.79%  -1.11%          -1.88%          -1.52%

A further dwell test was done in the area where teo-614 performed worse.
There was a slight regression in both performance and power:

teo-613: average 21.34068 uSec per loop
teo-614: average 20.55809 usec per loop 3.67% regression

teo-613: average 37.17577 watts.
teo-614: average 38.06375 watts. 2.3% regression.

Further details:
http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/loop-times.png
http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/loop-times-detail-a.png
http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/loop-times-detail-b.png
http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/perf/
http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/dwell/perf/

Test 4: 12 CPU ping pong sweep:

Pass a token between all 12 CPUs.
Do a variable amount of work at each stop.
NOT a timer based test.

Purpose: To utilize the deeper idle states
and observe the transitions between use of
idle states.

This test was added last time at the request of Christian Loehle.

Note: This test has uncertainty in an area where the performance is bi-stable for all idle governors,
transitioning between much less power and slower performance and much more power and higher performance.
On either side of this area, the differences between all idle governors are small.

Only data from before this area (from results 1 to 60) was included in the below results:

Results relative to teo-613 (negative is better):
        teo-614 menu-613        menu-614        teo-614-p       menu-614-p
ave     1.73%   0.97%           1.29%           1.70%           0.43%
max     16.79%  3.52%           3.95%           17.48%          4.98%
min     -0.35%  -0.35%          -0.18%          -0.40%          -0.54%

Only data from after the uncertainty area (from results 170-300) was included in the below results:

        teo-614 menu-613        menu-614        teo-614-p       menu-614-p
ave     1.65%   0.04%           0.98%           -0.56%          0.73%
max     5.04%   2.10%           4.58%           2.44%           3.82%
min     0.00%   -1.89%          -1.17%          -1.95%          -1.38%

A further dwell test was done in the area where teo-614 performed worse and there is a 15.74%
throughput regression for teo-614 and a 5.4% regression in power.

Further details:
http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/sweep-loop-times.png
http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/sweep-loop-times-detail-a.png
http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/sweep-relative-times.png
http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/perf/
http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/dwell/times.txt
http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/dwell/perf/

Test 5: sleeping ebizzy - 128 threads.

Purpose: This test has given interesting results in the past.
The test varies the sleep interval between record lookups.
The result is varying usage of idle states.

Results: Nothing significant to report just from the performance data.
However, there does seem to be power differences worth considering.

A futher dwell test was done in a cherry picked spot.
It it is important to note that teo-614 removed a sawtooth performance
pattern that was present with teo-613. I.E. it is more stable. See:
http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/dwell/rps-only-teo.png

Further details:
http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/interval-sweep.png
http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/relative-performance.png
http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/perf/
http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/dwell/rps.png
http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/dwell/rps-relative.png
http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/dwell/perf/

Test 6: adrestia wakeup latency tests. 500 threads.

Purpose: The test was reported in 2023.09 by the kernel test robot and looked
both interesting and gave interesting results, so I added it to the tests I run.

Results:
teo-613.txt:wakeup cost (periodic, 20us): 3331nSec reference
teo-614.txt:wakeup cost (periodic, 20us): 3375nSec +1.32%
menu-613.txt:wakeup cost (periodic, 20us): 3207nSec -3.72%
menu-614.txt:wakeup cost (periodic, 20us): 3315nSec -0.48%
menu-614-p.txt:wakeup cost (periodic, 20us): 3353nSec +0.66%

Further details:
http://smythies.com/~doug/linux/idle/teo-6.14/adrestia/periodic/histogram.png
http://smythies.com/~doug/linux/idle/teo-6.14/adrestia/periodic/histogram-detail-a.png
http://smythies.com/~doug/linux/idle/teo-6.14/adrestia/periodic/perf/


Test 7: consume: periodic workflow. Various work/sleep frequencies and loads.

Purpose: To search for anomalies and hysteresis over all possible workloads at various work/sleep frequencies.
work/sleep frequencies tested: 73, 113, 211, 347, and 401 Hertz.
IS a timer based test.

NOTE: Repeatability issues. More work needed.

Tests show instability with teo-614, but a re-test was much less unstable and better power.
Idle statistics were collected for the re-test and does show teo-614 overly favoring idle state 1, with
"Idle state 1 was too shallow" of 70% verses 15% for teo-613.

Further details:
http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf73/
http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf113/
http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf211/
http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf347/
http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf401/
http://smythies.com/~doug/linux/idle/teo-6.14/consume/test/
http://smythies.com/~doug/linux/idle/teo-6.14/consume/test-idle/

Test 8: shell-intensive serialized workloads.

Variable: PIDs per second, amount of work each task does.
Note: Single threaded.

Dountil the list of tasks is finished:
    Start the next task in the list of stuff to do (with a new PID).
    Wait for it to finish
Enduntil

This workflow represents a challenge for CPU frequency scaling drivers,
schedulers, and therefore idle drivers.

Also, the best performance is achieved by overriding
the scheduler and forcing CPU affinity. This "best" case is the
master reference, requiring additional legend definitions:
1cpu-613: Kernel 6.13, execution forced onto CPU 3.
1cpu-614: Kernel 6.14-rc1, execution forced onto CPU 3.

Ideally the two 1cpu graphs would be identical, but they are not,
likely due to other changes betwwen the two kernels.

Results:
teo-614 is abaolutely outstanding in this test.
Considerably better than any previous result over many years.

Further details:
http://smythies.com/~doug/linux/idle/teo-6.14/pid-per-sec/perf-3/times.png
http://smythies.com/~doug/linux/idle/teo-6.14/pid-per-sec/perf-3/times-log.png
http://smythies.com/~doug/linux/idle/teo-6.14/pid-per-sec/perf-3/relative.png
http://smythies.com/~doug/linux/idle/teo-6.14/pid-per-sec/perf-3/relative-log.png

Test 9: Many threads, periodic workflow

500 threads of do a little work and then sleep a little.
IS a timer based test.

Results:
Kernel 6.13 teo:    reference
Kernel 6.13 menu:   -0.06%
Kernel 6.14 teo:    -0.09%
Kernel 6.14 menu:   +0.49%
Kernel 6.14+p menu: +0.33%

What is interesting is the significant differences in idle state selection.
Powers might be interesting, but much longer tests would be needed to acheive thermal equalibrium.

doug@s19:~/idle/teo/6.14$ nano README.txt
doug@s19:~/idle/teo/6.14$ rsync --archive --delete --verbose ./ doug@s15.smythies.com:/home/doug/public_html/linux/idle/teo-6.14
doug@s15.smythies.com's password:
sending incremental file list
./
README.txt
idle/
idle/teo-614-2.xlsx

sent 61,869 bytes  received 214 bytes  13,796.22 bytes/sec
total size is 20,642,833  speedup is 332.50
doug@s19:~/idle/teo/6.14$ uname -a
Linux s19 6.14.0-rc1-stock #1339 SMP PREEMPT_DYNAMIC Sun Feb  2 16:45:39 PST 2025 x86_64 x86_64 x86_64 GNU/Linux
doug@s19:~/idle/teo/6.14$ uname -a
Linux s19 6.14.0-rc1-stock #1339 SMP PREEMPT_DYNAMIC Sun Feb  2 16:45:39 PST 2025 x86_64 x86_64 x86_64 GNU/Linux
doug@s19:~/idle/teo/6.14$
doug@s19:~/idle/teo/6.14$
doug@s19:~/idle/teo/6.14$
doug@s19:~/idle/teo/6.14$ cat READEME.txt
cat: READEME.txt: No such file or directory
doug@s19:~/idle/teo/6.14$ cat README.txt
2025.02.13 Notes on this round of idle governors testing:

Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
Distro: Ubuntu 24.04.1, server, no desktop GUI.

CPU frequency scaling driver: intel_pstate
HWP: disabled.
CPU frequency scaling governor: performance

Ilde driver: intel_idle
Idle governor: as per individual test
Idle states: 4: name : description:
   state0/name:POLL             desc:CPUIDLE CORE POLL IDLE
   state1/name:C1_ACPI          desc:ACPI FFH MWAIT 0x0
   state2/name:C2_ACPI          desc:ACPI FFH MWAIT 0x30
   state3/name:C3_ACPI          desc:ACPI FFH MWAIT 0x60

Legend:
teo-613: teo governor - Kernel 6.13: before "cpuidle: teo: Cleanups and very frequent wakeups handling update"
menu-613: menu governor - Kernel 6.13: before "cpuidle: teo: Cleanups and very frequent wakeups handling update"
teo-614: teo governor - Kernel 6.14-rc1: Includes cpuidle: teo: Cleanups and very frequent wakeups handling update
menu-614: menu governor - Kernel 6.14-rc1: Includes cpuidle: teo: Cleanups and very frequent wakeups handling update
teo-614-p: teo governor - Kernel 6.14-rc1-p: Includes "cpuidle: teo: Avoid selecting deepest idle state over-eagerly"
menu-614-p: menu governor - Kernel 6.14-rc1-p: Includes "cpuidle: menu: Avoid discarding useful information when processing recent idle intervals"

I do a set of tests adopted over some years now.
Readers may recall that some of the tests search over a wide range of operating conditions looking for areas to focus on in more detail.
One interesting observation is that everything seems to run slower than the last time I did this, June 2024, Kernel 6.10-rc2,
which was also slower than the time before that, August 2023, Kernel 6.5-rc4.
There are some repeatability issues with the tests.

I was unable to get the "cpuidle: teo: Cleanups and very frequent wakeups handling update"
patch set to apply to kernel 6.13, and so just used kernel 6.14-rc1, but that means that
all the other commits between the kernel versions are included. This could cast doubt on
the test results, and indeed some differences in test results are observed with the menu
idle governor, which did not change.

Test 1: System Idle

Purpose: Basic starting point test. To observe and check an idle system for excessive power consumption.

teo-613: 1.752 watts (reference: 0.0%)
menu-613: 1.909 watts (+9.0%)
teo-614: 2.199 watts (+25.51%)   <<< Test flawed. Needs to be redone. Will be less.
teo-614-2: 2.112 watts (+17.05%) <<< Re-test of teo-614. (don't care about 0.4 watts)
menu-614: 1.873 watts (+6.91%)
teo-614-p: 9.401 watts (+436.6%)  <<< Very bad regression.
menu-614-p: 1.820 watts (+3.9%)

Further details:
http://smythies.com/~doug/linux/idle/teo-6.14/idle/perf/

Test 2: 2 core ping pong sweep:

Pass a token between 2 CPUs on 2 different cores.
Do a variable amount of work at each stop.
NOT a timer based test.

Purpose: To utilize the shallowest idle states
and observe the transition from using more of 1
idle state to another.

Results relative to teo-613 (negative is better):
        menu-613        teo-614         menu-614        menu-614-p
average -2.06%          -0.32%          -2.33%          -2.52%
max     9.42%           12.72%          8.29%           8.55%
min     -10.36%         -3.82%          -11.89%         -12.13%

No significant issues here. There are differences on idle state preferences.

Standard "fast" dwell test:

teo-613: average 3.826 uSec/loop reference
menu-613: average 4.159 +8.70%
teo-614: average 3.751 -1.94%
menu-614: average 4.076 +6.54%
menu-614-p: average 4.178 +9.21%

Interestingly, teo-614 also uses a little less power.
Note that there is an offsetting region for the menu governor where it performs better
than teo, but it was not extracted and done as a dwell test.

Standard "medium dwell test:

teo-613: 12.241 average uSec/loop reference
menu-613: 12.251 average +0.08%
teo-614: 12.121 average -0.98%
menu-614: 12.123 average -0.96%
menu-614-p: 12.236 average -0.04%

Standard "slow" dwell test: Not done.

Further details:
http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-2/loop-times.png
http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-2/loop-times-relative.png
http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-2/perf/
http://smythies.com/~doug/linux/idle/teo-6.14/many-0-400000000-2/perf/
http://smythies.com/~doug/linux/idle/teo-6.14/many-3000-100000000-2/

Test 3: 6 core ping pong sweep:

Pass a token between 6 CPUs on 6 different cores.
Do a variable amount of work at each stop.
NOT a timer based test.

Purpose: To utilize the midrange idle states
and observe the transitions between use of
idle states.

Note: This test has uncertainty in an area where the performance is bi-stable for all idle governors,
transitioning between much less power and slower performance and much more power and higher performance.
On either side of this area, the differences between all idle governors are small.
Only data from before this area (from results 1 to 95) was included in the below results.

Results relative to teo-613 (negative is better):
        teo-614 menu-613        menu-614        menu-614-p
average 1.60%   0.18%           0.02%           0.02%
max     5.91%   0.97%           1.12%           0.85%
min     -1.79%  -1.11%          -1.88%          -1.52%

A further dwell test was done in the area where teo-614 performed worse.
There was a slight regression in both performance and power:

teo-613: average 21.34068 uSec per loop
teo-614: average 20.55809 usec per loop 3.67% regression

teo-613: average 37.17577 watts.
teo-614: average 38.06375 watts. 2.3% regression.

Further details:
http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/loop-times.png
http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/loop-times-detail-a.png
http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/loop-times-detail-b.png
http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/perf/
http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/dwell/perf/

Test 4: 12 CPU ping pong sweep:

Pass a token between all 12 CPUs.
Do a variable amount of work at each stop.
NOT a timer based test.

Purpose: To utilize the deeper idle states
and observe the transitions between use of
idle states.

This test was added last time at the request of Christian Loehle.

Note: This test has uncertainty in an area where the performance is bi-stable for all idle governors,
transitioning between much less power and slower performance and much more power and higher performance.
On either side of this area, the differences between all idle governors are small.

Only data from before this area (from results 1 to 60) was included in the below results:

Results relative to teo-613 (negative is better):
        teo-614 menu-613        menu-614        teo-614-p       menu-614-p
ave     1.73%   0.97%           1.29%           1.70%           0.43%
max     16.79%  3.52%           3.95%           17.48%          4.98%
min     -0.35%  -0.35%          -0.18%          -0.40%          -0.54%

Only data from after the uncertainty area (from results 170-300) was included in the below results:

        teo-614 menu-613        menu-614        teo-614-p       menu-614-p
ave     1.65%   0.04%           0.98%           -0.56%          0.73%
max     5.04%   2.10%           4.58%           2.44%           3.82%
min     0.00%   -1.89%          -1.17%          -1.95%          -1.38%

A further dwell test was done in the area where teo-614 performed worse and there is a 15.74%
throughput regression for teo-614 and a 5.4% regression in power.

Further details:
http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/sweep-loop-times.png
http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/sweep-loop-times-detail-a.png
http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/sweep-relative-times.png
http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/perf/
http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/dwell/times.txt
http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/dwell/perf/

Test 5: sleeping ebizzy - 128 threads.

Purpose: This test has given interesting results in the past.
The test varies the sleep interval between record lookups.
The result is varying usage of idle states.

Results: Nothing significant to report just from the performance data.
However, there does seem to be power differences worth considering.

A further dwell test was done on a cherry-picked spot.
It it is important to note that teo-614 removed a sawtooth performance
pattern that was present with teo-613. I.E. it is more stable. See:
http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/dwell/rps-only-teo.png

Further details:
http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/interval-sweep.png
http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/relative-performance.png
http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/perf/
http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/dwell/rps.png
http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/dwell/rps-relative.png
http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/dwell/perf/

Test 6: adrestia wakeup latency tests. 500 threads.

Purpose: The test was reported in 2023.09 by the kernel test robot and looked
both interesting and gave interesting results, so I added it to the tests I run.

Results:
teo-613.txt:wakeup cost (periodic, 20us): 3331nSec reference
teo-614.txt:wakeup cost (periodic, 20us): 3375nSec +1.32%
menu-613.txt:wakeup cost (periodic, 20us): 3207nSec -3.72%
menu-614.txt:wakeup cost (periodic, 20us): 3315nSec -0.48%
menu-614-p.txt:wakeup cost (periodic, 20us): 3353nSec +0.66%

Further details:
http://smythies.com/~doug/linux/idle/teo-6.14/adrestia/periodic/histogram.png
http://smythies.com/~doug/linux/idle/teo-6.14/adrestia/periodic/histogram-detail-a.png
http://smythies.com/~doug/linux/idle/teo-6.14/adrestia/periodic/perf/


Test 7: consume: periodic workflow. Various work/sleep frequencies and loads.

Purpose: To search for anomalies and hysteresis over all possible workloads at various work/sleep frequencies.
work/sleep frequencies tested: 73, 113, 211, 347, and 401 Hertz.
IS a timer based test.

NOTE: Repeatability issues. More work needed.

Tests show instability with teo-614, but a re-test was much less unstable and better power.
Idle statistics were collected for the re-test and does show teo-614 overly favoring idle state 1, with
"Idle state 1 was too shallow" of 70% verses 15% for teo-613.

Further details:
http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf73/
http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf113/
http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf211/
http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf347/
http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf401/
http://smythies.com/~doug/linux/idle/teo-6.14/consume/test/
http://smythies.com/~doug/linux/idle/teo-6.14/consume/test-idle/

Test 8: shell-intensive serialized workloads.

Variable: PIDs per second, amount of work each task does.
Note: Single threaded.

Dountil the list of tasks is finished:
    Start the next task in the list of stuff to do (with a new PID).
    Wait for it to finish
Enduntil

This workflow represents a challenge for CPU frequency scaling drivers,
schedulers, and therefore idle drivers.

Also, the best performance is achieved by overriding
the scheduler and forcing CPU affinity. This "best" case is the
master reference, requiring additional legend definitions:
1cpu-613: Kernel 6.13, execution forced onto CPU 3.
1cpu-614: Kernel 6.14-rc1, execution forced onto CPU 3.

Ideally the two 1cpu graphs would be identical, but they are not,
likely due to other changes between the two kernels.

Results:
teo-614 is absolutely outstanding in this test.
Considerably better than any previous result over many years.

Further details:
http://smythies.com/~doug/linux/idle/teo-6.14/pid-per-sec/perf-3/times.png
http://smythies.com/~doug/linux/idle/teo-6.14/pid-per-sec/perf-3/times-log.png
http://smythies.com/~doug/linux/idle/teo-6.14/pid-per-sec/perf-3/relative.png
http://smythies.com/~doug/linux/idle/teo-6.14/pid-per-sec/perf-3/relative-log.png

Test 9: Many threads, periodic workflow

500 threads of do a little work and then sleep a little.
IS a timer based test.

Results:
Kernel 6.13 teo:    reference
Kernel 6.13 menu:   -0.06%
Kernel 6.14 teo:    -0.09%
Kernel 6.14 menu:   +0.49%
Kernel 6.14+p menu: +0.33%

What is interesting is the significant differences in idle state selection.
Powers might be interesting, but much longer tests would be needed to achieve thermal equilibrium.
Rafael J. Wysocki Feb. 14, 2025, 10:10 p.m. UTC | #9
Hi Doug,

On Fri, Feb 14, 2025 at 5:30 AM Doug Smythies <dsmythies@telus.net> wrote:
>
> On 2025.02.06 06:22 Rafael J. Wysocki wrote:
>
> > Hi Everyone,
>
> Hi Rafael,
>
> >
> > This work had been triggered by a report that commit 0611a640e60a ("eventpoll:
> > prefer kfree_rcu() in __ep_remove()") had caused the critical-jOPS metric of
> ... deleted ...
>
> This is a long email. It contains test results for several recent idle governor patches:

Thanks a lot for this data, it's really helpful!

> cpuidle: teo: Cleanups and very frequent wakeups handling update
> cpuidle: teo: Avoid selecting deepest idle state over-eagerly (Testing aborted, after the patch was dropped.)
> cpuidle: menu: Avoid discarding useful information when processing recent idle intervals
>
> Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
> Distro: Ubuntu 24.04.1, server, no desktop GUI.
>
> CPU frequency scaling driver: intel_pstate
> HWP: disabled.
> CPU frequency scaling governor: performance
>
> Ilde driver: intel_idle
> Idle governor: as per individual test
> Idle states: 4: name : description:
>    state0/name:POLL             desc:CPUIDLE CORE POLL IDLE
>    state1/name:C1_ACPI          desc:ACPI FFH MWAIT 0x0
>    state2/name:C2_ACPI          desc:ACPI FFH MWAIT 0x30
>    state3/name:C3_ACPI          desc:ACPI FFH MWAIT 0x60
>
> Legend:
> teo-613: teo governor - Kernel 6.13: before "cpuidle: teo: Cleanups and very frequent wakeups handling update"
> menu-613: menu governor - Kernel 6.13: before "cpuidle: teo: Cleanups and very frequent wakeups handling update"
> teo-614: teo governor - Kernel 6.14-rc1: Includes cpuidle: teo: Cleanups and very frequent wakeups handling update
> menu-614: menu governor - Kernel 6.14-rc1: Includes cpuidle: teo: Cleanups and very frequent wakeups handling update
> teo-614-p: teo governor - Kernel 6.14-rc1-p: Includes "cpuidle: teo: Avoid selecting deepest idle state over-eagerly"
> menu-614-p: menu governor - Kernel 6.14-rc1-p: Includes "cpuidle: menu: Avoid discarding useful information when processing recent idle intervals"
>
> I do a set of tests adopted over some years now.
> Readers may recall that some of the tests search over a wide range of operating conditions looking for areas to focus on in more detail.
> One interesting observation is that everything seems to run slower than the last time I did this, June 2024, Kernel 6.10-rc2,
> which was also slower than the time before that, August 2023, Kernel 6.5-rc4.
> There are some repatabilty issues with the tests.
>
> I was unable to get the "cpuidle: teo: Cleanups and very frequent wakeups handling update" patch set to apply to kernel 6.13, and so just used kernel 6.14-rc1, but that means that all the other commits
> between the kernel versions are included. This could cast doubt on the test results, and indeed some differences in test results are observed with the menu idle governor, which did not change.
>
> Test 1: System Idle
>
> Purpose: Basic starting point test. To observee and check an idle system for excessive power consumption.
>
> teo-613: 1.752 watts (reference: 0.0%)
> menu-613: 1.909 watts (+9.0%)
> teo-614: 2.199 watts (+25.51%)   <<< Test flawed. Needs to be redone. Will be less.
> teo-614-2: 2.112 watts (+17.05%) <<< Re-test of teo-614. (don't care about 0.4 watts)
> menu-614: 1.873 watts (+6.91%)
> teo-614-p: 9.401 watts (+436.6%)  <<< Very bad regression.

Already noted.

Since I've decided to withdraw this patch, I will not talk about it below.

> menu-614-p: 1.820 watts (+3.9%)

And this is an improvement worth noting.

Generally speaking, I'm mostly interested in the differences between
teo-613 and teo-614 and between menu-6.14 and menu-614-p.

> Further details:
> http://smythies.com/~doug/linux/idle/teo-6.14/idle/perf/
>
> Test 2: 2 core ping pong sweep:
>
> Pass a token between 2 CPUs on 2 different cores.
> Do a variable amount of work at each stop.
> NOT a timer based test.
>
> Purpose: To utilize the shallowest idle states
> and observe the transition from using more of 1
> idle state to another.
>
> Results relative to teo-613 (negative is better):
>         menu-613        teo-614         menu-614        menu-614-p
> average -2.06%          -0.32%          -2.33%          -2.52%
> max     9.42%           12.72%          8.29%           8.55%
> min     -10.36%         -3.82%          -11.89%         -12.13%
>
> No significant issues here. There are differences on idle state preferences.
>
> Standard "fast" dwell test:
>
> teo-613: average 3.826 uSec/loop reference
> menu-613: average 4.159 +8.70%
> teo-614: average 3.751 -1.94%

A small improvement.

> menu-614: average 4.076 +6.54%
> menu-614-p: average 4.178 +9.21%
>
> Intrestingly, teo-614 also uses a little less power.
> Note that there is an offsetting region for the menu governor where it performs better
> than teo, but it was not extracted and done as a dwell test.
>
> Standard "medium dwell test:
>
> teo-613: 12.241 average uSec/loop reference
> menu-613: 12.251 average +0.08%
> teo-614: 12.121 average -0.98%

Similarly here, but smaller.

> menu-614: 12.123 average -0.96%
> menu-614-p: 12.236 average -0.04%
>
> Standard "slow" dwell test: Not done.
>
> Further details:
> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-2/loop-times.png
> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-2/loop-times-relative.png
> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-2/perf/
> http://smythies.com/~doug/linux/idle/teo-6.14/many-0-400000000-2/perf/
> http://smythies.com/~doug/linux/idle/teo-6.14/many-3000-100000000-2/
>
> Test 3: 6 core ping pong sweep:
>
> Pass a token between 6 CPUs on 6 different cores.
> Do a variable amount of work at each stop.
> NOT a timer based test.
>
> Purpose: To utilize the midrange idle states
> and observe the transitions between use of
> idle states.
>
> Note: This test has uncertainty in an area where the performance is bi-stable for all idle governors,
> transitioning between much less power and slower performance and much more power and higher performance.
> On either side of this area, the differences between all idle governors are small.
> Only data from before this area (from results 1 to 95) was included in the below results.
>
> Results relative to teo-613 (negative is better):
>         teo-614 menu-613        menu-614        menu-614-p
> average 1.60%   0.18%           0.02%           0.02%
> max     5.91%   0.97%           1.12%           0.85%
> min     -1.79%  -1.11%          -1.88%          -1.52%
>
> A further dwell test was done in the area where teo-614 performed worse.
> There was a slight regression in both performance and power:
>
> teo-613: average 21.34068 uSec per loop
> teo-614: average 20.55809 usec per loop 3.67% regression

As this is usec per loop, I'd think that smaller would be better?

> teo-613: average 37.17577 watts.
> teo-614: average 38.06375 watts. 2.3% regression.

Which would be consistent with this.

> Further details:
> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/loop-times.png
> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/loop-times-detail-a.png
> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/loop-times-detail-b.png
> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/perf/
> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/dwell/perf/
>
> Test 4: 12 CPU ping pong sweep:
>
> Pass a token between all 12 CPUs.
> Do a variable amount of work at each stop.
> NOT a timer based test.
>
> Purpose: To utilize the deeper idle states
> and observe the transitions between use of
> idle states.
>
> This test was added last time at the request of Christian Loehle.
>
> Note: This test has uncertainty in an area where the performance is bi-stable for all idle governors,
> transitioning between much less power and slower performance and much more power and higher performance.
> On either side of this area, the differences between all idle governors are small.
>
> Only data from before this area (from results 1 to 60) was included in the below results:
>
> Results relative to teo-613 (negative is better):
>         teo-614 menu-613        menu-614        teo-614-p       menu-614-p
> ave     1.73%   0.97%           1.29%           1.70%           0.43%
> max     16.79%  3.52%           3.95%           17.48%          4.98%
> min     -0.35%  -0.35%          -0.18%          -0.40%          -0.54%
>
> Only data from after the uncertainty area (from results 170-300) was included in the below results:
>
>         teo-614 menu-613        menu-614        teo-614-p       menu-614-p
> ave     1.65%   0.04%           0.98%           -0.56%          0.73%
> max     5.04%   2.10%           4.58%           2.44%           3.82%
> min     0.00%   -1.89%          -1.17%          -1.95%          -1.38%
>
> A further dwell test was done in the area where teo-614 performed worse and there is a 15.74%
> throughput regression for teo-614 and a 5.4% regression in power.
>
> Further details:
> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/sweep-loop-times.png
> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/sweep-loop-times-detail-a.png
> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/sweep-relative-times.png
> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/perf/
> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/dwell/times.txt
> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/dwell/perf/
>
> Test 5: sleeping ebizzy - 128 threads.
>
> Purpose: This test has given interesting results in the past.
> The test varies the sleep interval between record lookups.
> The result is varying usage of idle states.
>
> Results: Nothing significant to report just from the performance data.
> However, there does seem to be power differences worth considering.
>
> A futher dwell test was done in a cherry picked spot.
> It it is important to note that teo-614 removed a sawtooth performance
> pattern that was present with teo-613. I.E. it is more stable. See:
> http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/dwell/rps-only-teo.png
>
> Further details:
> http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/interval-sweep.png
> http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/relative-performance.png
> http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/perf/
> http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/dwell/rps.png
> http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/dwell/rps-relative.png
> http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/dwell/perf/
>
> Test 6: adrestia wakeup latency tests. 500 threads.
>
> Purpose: The test was reported in 2023.09 by the kernel test robot and looked
> both interesting and gave interesting results, so I added it to the tests I run.
>
> Results:
> teo-613.txt:wakeup cost (periodic, 20us): 3331nSec reference
> teo-614.txt:wakeup cost (periodic, 20us): 3375nSec +1.32%
> menu-613.txt:wakeup cost (periodic, 20us): 3207nSec -3.72%
> menu-614.txt:wakeup cost (periodic, 20us): 3315nSec -0.48%
> menu-614-p.txt:wakeup cost (periodic, 20us): 3353nSec +0.66%
>
> Further details:
> http://smythies.com/~doug/linux/idle/teo-6.14/adrestia/periodic/histogram.png
> http://smythies.com/~doug/linux/idle/teo-6.14/adrestia/periodic/histogram-detail-a.png
> http://smythies.com/~doug/linux/idle/teo-6.14/adrestia/periodic/perf/
>
>
> Test 7: consume: periodic workflow. Various work/sleep frequencies and loads.
>
> Purpose: To search for anomalies and hysteresis over all possible workloads at various work/sleep frequencies.
> work/sleep frequencies tested: 73, 113, 211, 347, and 401 Hertz.
> IS a timer based test.
>
> NOTE: Repeatability issues. More work needed.
>
> Tests show instability with teo-614, but a re-test was much less unstable and better power.
> Idle statistics were collected for the re-test and does show teo-614 overly favoring idle state 1, with
> "Idle state 1 was too shallow" of 70% verses 15% for teo-613.
>
> Further details:
> http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf73/
> http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf113/
> http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf211/
> http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf347/
> http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf401/
> http://smythies.com/~doug/linux/idle/teo-6.14/consume/test/
> http://smythies.com/~doug/linux/idle/teo-6.14/consume/test-idle/
>
> Test 8: shell-intensive serialized workloads.
>
> Variable: PIDs per second, amount of work each task does.
> Note: Single threaded.
>
> Dountil the list of tasks is finished:
>     Start the next task in the list of stuff to do (with a new PID).
>     Wait for it to finish
> Enduntil
>
> This workflow represents a challenge for CPU frequency scaling drivers,
> schedulers, and therefore idle drivers.
>
> Also, the best performance is achieved by overriding
> the scheduler and forcing CPU affinity. This "best" case is the
> master reference, requiring additional legend definitions:
> 1cpu-613: Kernel 6.13, execution forced onto CPU 3.
> 1cpu-614: Kernel 6.14-rc1, execution forced onto CPU 3.
>
> Ideally the two 1cpu graphs would be identical, but they are not,
> likely due to other changes betwwen the two kernels.
>
> Results:
> teo-614 is abaolutely outstanding in this test.
> Considerably better than any previous result over many years.

Sounds good!

> Further details:
> http://smythies.com/~doug/linux/idle/teo-6.14/pid-per-sec/perf-3/times.png
> http://smythies.com/~doug/linux/idle/teo-6.14/pid-per-sec/perf-3/times-log.png
> http://smythies.com/~doug/linux/idle/teo-6.14/pid-per-sec/perf-3/relative.png
> http://smythies.com/~doug/linux/idle/teo-6.14/pid-per-sec/perf-3/relative-log.png
>
> Test 9: Many threads, periodic workflow
>
> 500 threads of do a little work and then sleep a little.
> IS a timer based test.
>
> Results:
> Kernel 6.13 teo:    reference
> Kernel 6.13 menu:   -0.06%
> Kernel 6.14 teo:    -0.09%
> Kernel 6.14 menu:   +0.49%
> Kernel 6.14+p menu: +0.33%
>
> What is interesting is the significant differences in idle state selection.
> Powers might be interesting, but much longer tests would be needed to acheive thermal equalibrium.
>
> doug@s19:~/idle/teo/6.14$ nano README.txt
> doug@s19:~/idle/teo/6.14$ rsync --archive --delete --verbose ./ doug@s15.smythies.com:/home/doug/public_html/linux/idle/teo-6.14
> doug@s15.smythies.com's password:
> sending incremental file list
> ./
> README.txt
> idle/
> idle/teo-614-2.xlsx
>
> sent 61,869 bytes  received 214 bytes  13,796.22 bytes/sec
> total size is 20,642,833  speedup is 332.50
> doug@s19:~/idle/teo/6.14$ uname -a
> Linux s19 6.14.0-rc1-stock #1339 SMP PREEMPT_DYNAMIC Sun Feb  2 16:45:39 PST 2025 x86_64 x86_64 x86_64 GNU/Linux
> doug@s19:~/idle/teo/6.14$ uname -a
> Linux s19 6.14.0-rc1-stock #1339 SMP PREEMPT_DYNAMIC Sun Feb  2 16:45:39 PST 2025 x86_64 x86_64 x86_64 GNU/Linux
> doug@s19:~/idle/teo/6.14$
> doug@s19:~/idle/teo/6.14$
> doug@s19:~/idle/teo/6.14$
> doug@s19:~/idle/teo/6.14$ cat READEME.txt
> cat: READEME.txt: No such file or directory
> doug@s19:~/idle/teo/6.14$ cat README.txt
> 2025.02.13 Notes on this round of idle governors testing:
>
> Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
> Distro: Ubuntu 24.04.1, server, no desktop GUI.
>
> CPU frequency scaling driver: intel_pstate
> HWP: disabled.
> CPU frequency scaling governor: performance

What's the difference between this configuration and the one above?

> Ilde driver: intel_idle
> Idle governor: as per individual test
> Idle states: 4: name : description:
>    state0/name:POLL             desc:CPUIDLE CORE POLL IDLE
>    state1/name:C1_ACPI          desc:ACPI FFH MWAIT 0x0
>    state2/name:C2_ACPI          desc:ACPI FFH MWAIT 0x30
>    state3/name:C3_ACPI          desc:ACPI FFH MWAIT 0x60
>
> Legend:
> teo-613: teo governor - Kernel 6.13: before "cpuidle: teo: Cleanups and very frequent wakeups handling update"
> menu-613: menu governor - Kernel 6.13: before "cpuidle: teo: Cleanups and very frequent wakeups handling update"
> teo-614: teo governor - Kernel 6.14-rc1: Includes cpuidle: teo: Cleanups and very frequent wakeups handling update
> menu-614: menu governor - Kernel 6.14-rc1: Includes cpuidle: teo: Cleanups and very frequent wakeups handling update
> teo-614-p: teo governor - Kernel 6.14-rc1-p: Includes "cpuidle: teo: Avoid selecting deepest idle state over-eagerly"
> menu-614-p: menu governor - Kernel 6.14-rc1-p: Includes "cpuidle: menu: Avoid discarding useful information when processing recent idle intervals"
>
> I do a set of tests adopted over some years now.
> Readers may recall that some of the tests search over a wide range of operating conditions looking for areas to focus on in more detail.
> One interesting observation is that everything seems to run slower than the last time I did this, June 2024, Kernel 6.10-rc2,
> which was also slower than the time before that, August 2023, Kernel 6.5-rc4.
> There are some repeatability issues with the tests.
>
> I was unable to get the "cpuidle: teo: Cleanups and very frequent wakeups handling update"
> patch set to apply to kernel 6.13, and so just used kernel 6.14-rc1, but that means that
> all the other commits between the kernel versions are included. This could cast doubt on
> the test results, and indeed some differences in test results are observed with the menu
> idle governor, which did not change.
>
> Test 1: System Idle
>
> Purpose: Basic starting point test. To observe and check an idle system for excessive power consumption.
>
> teo-613: 1.752 watts (reference: 0.0%)
> menu-613: 1.909 watts (+9.0%)
> teo-614: 2.199 watts (+25.51%)   <<< Test flawed. Needs to be redone. Will be less.
> teo-614-2: 2.112 watts (+17.05%) <<< Re-test of teo-614. (don't care about 0.4 watts)
> menu-614: 1.873 watts (+6.91%)
> teo-614-p: 9.401 watts (+436.6%)  <<< Very bad regression.
> menu-614-p: 1.820 watts (+3.9%)
>
> Further details:
> http://smythies.com/~doug/linux/idle/teo-6.14/idle/perf/
>
> Test 2: 2 core ping pong sweep:
>
> Pass a token between 2 CPUs on 2 different cores.
> Do a variable amount of work at each stop.
> NOT a timer based test.
>
> Purpose: To utilize the shallowest idle states
> and observe the transition from using more of 1
> idle state to another.
>
> Results relative to teo-613 (negative is better):
>         menu-613        teo-614         menu-614        menu-614-p
> average -2.06%          -0.32%          -2.33%          -2.52%
> max     9.42%           12.72%          8.29%           8.55%
> min     -10.36%         -3.82%          -11.89%         -12.13%
>
> No significant issues here. There are differences on idle state preferences.
>
> Standard "fast" dwell test:
>
> teo-613: average 3.826 uSec/loop reference
> menu-613: average 4.159 +8.70%
> teo-614: average 3.751 -1.94%
> menu-614: average 4.076 +6.54%
> menu-614-p: average 4.178 +9.21%
>
> Interestingly, teo-614 also uses a little less power.
> Note that there is an offsetting region for the menu governor where it performs better
> than teo, but it was not extracted and done as a dwell test.
>
> Standard "medium dwell test:
>
> teo-613: 12.241 average uSec/loop reference
> menu-613: 12.251 average +0.08%
> teo-614: 12.121 average -0.98%
> menu-614: 12.123 average -0.96%
> menu-614-p: 12.236 average -0.04%
>
> Standard "slow" dwell test: Not done.
>
> Further details:
> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-2/loop-times.png
> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-2/loop-times-relative.png
> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-2/perf/
> http://smythies.com/~doug/linux/idle/teo-6.14/many-0-400000000-2/perf/
> http://smythies.com/~doug/linux/idle/teo-6.14/many-3000-100000000-2/
>
> Test 3: 6 core ping pong sweep:
>
> Pass a token between 6 CPUs on 6 different cores.
> Do a variable amount of work at each stop.
> NOT a timer based test.
>
> Purpose: To utilize the midrange idle states
> and observe the transitions between use of
> idle states.
>
> Note: This test has uncertainty in an area where the performance is bi-stable for all idle governors,
> transitioning between much less power and slower performance and much more power and higher performance.
> On either side of this area, the differences between all idle governors are small.
> Only data from before this area (from results 1 to 95) was included in the below results.
>
> Results relative to teo-613 (negative is better):
>         teo-614 menu-613        menu-614        menu-614-p
> average 1.60%   0.18%           0.02%           0.02%
> max     5.91%   0.97%           1.12%           0.85%
> min     -1.79%  -1.11%          -1.88%          -1.52%
>
> A further dwell test was done in the area where teo-614 performed worse.
> There was a slight regression in both performance and power:
>
> teo-613: average 21.34068 uSec per loop
> teo-614: average 20.55809 usec per loop 3.67% regression
>
> teo-613: average 37.17577 watts.
> teo-614: average 38.06375 watts. 2.3% regression.
>
> Further details:
> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/loop-times.png
> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/loop-times-detail-a.png
> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/loop-times-detail-b.png
> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/perf/
> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/dwell/perf/
>
> Test 4: 12 CPU ping pong sweep:
>
> Pass a token between all 12 CPUs.
> Do a variable amount of work at each stop.
> NOT a timer based test.
>
> Purpose: To utilize the deeper idle states
> and observe the transitions between use of
> idle states.
>
> This test was added last time at the request of Christian Loehle.
>
> Note: This test has uncertainty in an area where the performance is bi-stable for all idle governors,
> transitioning between much less power and slower performance and much more power and higher performance.
> On either side of this area, the differences between all idle governors are small.
>
> Only data from before this area (from results 1 to 60) was included in the below results:
>
> Results relative to teo-613 (negative is better):
>         teo-614 menu-613        menu-614        teo-614-p       menu-614-p
> ave     1.73%   0.97%           1.29%           1.70%           0.43%
> max     16.79%  3.52%           3.95%           17.48%          4.98%
> min     -0.35%  -0.35%          -0.18%          -0.40%          -0.54%
>
> Only data from after the uncertainty area (from results 170-300) was included in the below results:
>
>         teo-614 menu-613        menu-614        teo-614-p       menu-614-p
> ave     1.65%   0.04%           0.98%           -0.56%          0.73%
> max     5.04%   2.10%           4.58%           2.44%           3.82%
> min     0.00%   -1.89%          -1.17%          -1.95%          -1.38%
>
> A further dwell test was done in the area where teo-614 performed worse and there is a 15.74%
> throughput regression for teo-614 and a 5.4% regression in power.
>
> Further details:
> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/sweep-loop-times.png
> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/sweep-loop-times-detail-a.png
> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/sweep-relative-times.png
> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/perf/
> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/dwell/times.txt
> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/dwell/perf/
>
> Test 5: sleeping ebizzy - 128 threads.
>
> Purpose: This test has given interesting results in the past.
> The test varies the sleep interval between record lookups.
> The result is varying usage of idle states.
>
> Results: Nothing significant to report just from the performance data.
> However, there does seem to be power differences worth considering.
>
> A further dwell test was done on a cherry-picked spot.
> It it is important to note that teo-614 removed a sawtooth performance
> pattern that was present with teo-613. I.E. it is more stable. See:
> http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/dwell/rps-only-teo.png
>
> Further details:
> http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/interval-sweep.png
> http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/relative-performance.png
> http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/perf/
> http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/dwell/rps.png
> http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/dwell/rps-relative.png
> http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/dwell/perf/
>
> Test 6: adrestia wakeup latency tests. 500 threads.
>
> Purpose: The test was reported in 2023.09 by the kernel test robot and looked
> both interesting and gave interesting results, so I added it to the tests I run.
>
> Results:
> teo-613.txt:wakeup cost (periodic, 20us): 3331nSec reference
> teo-614.txt:wakeup cost (periodic, 20us): 3375nSec +1.32%
> menu-613.txt:wakeup cost (periodic, 20us): 3207nSec -3.72%
> menu-614.txt:wakeup cost (periodic, 20us): 3315nSec -0.48%
> menu-614-p.txt:wakeup cost (periodic, 20us): 3353nSec +0.66%
>
> Further details:
> http://smythies.com/~doug/linux/idle/teo-6.14/adrestia/periodic/histogram.png
> http://smythies.com/~doug/linux/idle/teo-6.14/adrestia/periodic/histogram-detail-a.png
> http://smythies.com/~doug/linux/idle/teo-6.14/adrestia/periodic/perf/
>
>
> Test 7: consume: periodic workflow. Various work/sleep frequencies and loads.
>
> Purpose: To search for anomalies and hysteresis over all possible workloads at various work/sleep frequencies.
> work/sleep frequencies tested: 73, 113, 211, 347, and 401 Hertz.
> IS a timer based test.
>
> NOTE: Repeatability issues. More work needed.
>
> Tests show instability with teo-614, but a re-test was much less unstable and better power.
> Idle statistics were collected for the re-test and does show teo-614 overly favoring idle state 1, with
> "Idle state 1 was too shallow" of 70% verses 15% for teo-613.
>
> Further details:
> http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf73/
> http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf113/
> http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf211/
> http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf347/
> http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf401/
> http://smythies.com/~doug/linux/idle/teo-6.14/consume/test/
> http://smythies.com/~doug/linux/idle/teo-6.14/consume/test-idle/
>
> Test 8: shell-intensive serialized workloads.
>
> Variable: PIDs per second, amount of work each task does.
> Note: Single threaded.
>
> Dountil the list of tasks is finished:
>     Start the next task in the list of stuff to do (with a new PID).
>     Wait for it to finish
> Enduntil
>
> This workflow represents a challenge for CPU frequency scaling drivers,
> schedulers, and therefore idle drivers.
>
> Also, the best performance is achieved by overriding
> the scheduler and forcing CPU affinity. This "best" case is the
> master reference, requiring additional legend definitions:
> 1cpu-613: Kernel 6.13, execution forced onto CPU 3.
> 1cpu-614: Kernel 6.14-rc1, execution forced onto CPU 3.
>
> Ideally the two 1cpu graphs would be identical, but they are not,
> likely due to other changes between the two kernels.
>
> Results:
> teo-614 is absolutely outstanding in this test.
> Considerably better than any previous result over many years.
>
> Further details:
> http://smythies.com/~doug/linux/idle/teo-6.14/pid-per-sec/perf-3/times.png
> http://smythies.com/~doug/linux/idle/teo-6.14/pid-per-sec/perf-3/times-log.png
> http://smythies.com/~doug/linux/idle/teo-6.14/pid-per-sec/perf-3/relative.png
> http://smythies.com/~doug/linux/idle/teo-6.14/pid-per-sec/perf-3/relative-log.png
>
> Test 9: Many threads, periodic workflow
>
> 500 threads of do a little work and then sleep a little.
> IS a timer based test.
>
> Results:
> Kernel 6.13 teo:    reference
> Kernel 6.13 menu:   -0.06%
> Kernel 6.14 teo:    -0.09%
> Kernel 6.14 menu:   +0.49%
> Kernel 6.14+p menu: +0.33%
>
> What is interesting is the significant differences in idle state selection.
> Powers might be interesting, but much longer tests would be needed to achieve thermal equilibrium.

Overall, having seen these results, I'm not worried about the change
from teo-613 to teo-614.  The motivation for it was mostly code
consistency and IMV the results indicate that it was worth doing.

Also, if I'm not mistaken, the differences between menu-6.14 and
menu-6.14-p in the majority of your tests are relatively small (if not
in the noise) which, given that the latter is a major improvement for
the SPECjbb workload as reported by Artem, makes me think that I
should queue up menu-614-p for 6.15.

Thanks!
Doug Smythies Feb. 16, 2025, 4:16 p.m. UTC | #10
On 2025.02.14 14:10 Rafael J. Wysocki wrote:
> On Fri, Feb 14, 2025 at 5:30 AM Doug Smythies <dsmythies@telus.net> wrote:
>> On 2025.02.06 06:22 Rafael J. Wysocki wrote:
>>>
>>> This work had been triggered by a report that commit 0611a640e60a ("eventpoll:
>>> prefer kfree_rcu() in __ep_remove()") had caused the critical-jOPS metric of
>> ... deleted ...
>>
>> This is a long email. It contains test results for several recent idle governor patches:
>
> Thanks a lot for this data, it's really helpful!
>
>> cpuidle: teo: Cleanups and very frequent wakeups handling update
>> cpuidle: teo: Avoid selecting deepest idle state over-eagerly (Testing aborted, after the patch was dropped.)
>> cpuidle: menu: Avoid discarding useful information when processing recent idle intervals
>>
>> Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
>> Distro: Ubuntu 24.04.1, server, no desktop GUI.
>>
>> CPU frequency scaling driver: intel_pstate
>> HWP: disabled.
>> CPU frequency scaling governor: performance
>>
>> Ilde driver: intel_idle
>> Idle governor: as per individual test
>> Idle states: 4: name : description:
>>    state0/name:POLL             desc:CPUIDLE CORE POLL IDLE
>>    state1/name:C1_ACPI          desc:ACPI FFH MWAIT 0x0
>>    state2/name:C2_ACPI          desc:ACPI FFH MWAIT 0x30
>>    state3/name:C3_ACPI          desc:ACPI FFH MWAIT 0x60
>>
>> Legend:
>> teo-613: teo governor - Kernel 6.13: before "cpuidle: teo: Cleanups and very frequent wakeups handling update"
>> menu-613: menu governor - Kernel 6.13: before "cpuidle: teo: Cleanups and very frequent wakeups handling update"
>> teo-614: teo governor - Kernel 6.14-rc1: Includes cpuidle: teo: Cleanups and very frequent wakeups handling update
>> menu-614: menu governor - Kernel 6.14-rc1: Includes cpuidle: teo: Cleanups and very frequent wakeups handling update
>> teo-614-p: teo governor - Kernel 6.14-rc1-p: Includes "cpuidle: teo: Avoid selecting deepest idle state over-eagerly"
>> menu-614-p: menu governor - Kernel 6.14-rc1-p: Includes "cpuidle: menu: Avoid discarding useful information when processing recent idle intervals"
>>
>> I do a set of tests adopted over some years now.
>> Readers may recall that some of the tests search over a wide range of operating conditions looking for areas to focus on in more detail.
>> One interesting observation is that everything seems to run slower than the last time I did this, June 2024, Kernel 6.10-rc2,
>> which was also slower than the time before that, August 2023, Kernel 6.5-rc4.
>> There are some repeatability issues with the tests.
>>
>> I was unable to get the "cpuidle: teo: Cleanups and very frequent wakeups handling update" patch set to apply to kernel 6.13, and so just used kernel 6.14-rc1, but that means that all the other commits
>> between the kernel versions are included. This could cast doubt on the test results, and indeed some differences in test results are observed with the menu idle governor, which did not change.
>>
>> Test 1: System Idle
>>
>> Purpose: Basic starting point test. To observe and check an idle system for excessive power consumption.
>>
>> teo-613: 1.752 watts (reference: 0.0%)
>> menu-613: 1.909 watts (+9.0%)
>> teo-614: 2.199 watts (+25.51%)   <<< Test flawed. Needs to be redone. Will be less.
>> teo-614-2: 2.112 watts (+17.05%) <<< Re-test of teo-614. (don't care about 0.4 watts)
>> menu-614: 1.873 watts (+6.91%)
>> teo-614-p: 9.401 watts (+436.6%)  <<< Very bad regression.
>
> Already noted.
>
> Since I've decided to withdraw this patch, I will not talk about it below.

Yes, just repeated here for completeness. And, I didn't reprocess work already done to delete teo-614-p results.
>
>> menu-614-p: 1.820 watts (+3.9%)
>
> And this is an improvement worth noting.
>
> Generally speaking, I'm mostly interested in the differences between
> teo-613 and teo-614 and between menu-6.14 and menu-614-p.
>
>> Further details:
>> http://smythies.com/~doug/linux/idle/teo-6.14/idle/perf/
>>
>> Test 2: 2 core ping pong sweep:
>>
>> Pass a token between 2 CPUs on 2 different cores.
>> Do a variable amount of work at each stop.
>> NOT a timer based test.
>>
>> Purpose: To utilize the shallowest idle states
>> and observe the transition from using more of 1
>> idle state to another.
>>
>> Results relative to teo-613 (negative is better):
>>         menu-613        teo-614         menu-614        menu-614-p
>> average -2.06%          -0.32%          -2.33%          -2.52%
>> max     9.42%           12.72%          8.29%           8.55%
>> min     -10.36%         -3.82%          -11.89%         -12.13%
>>
>> No significant issues here. There are differences on idle state preferences.
>>
>> Standard "fast" dwell test:
>>
>> teo-613: average 3.826 uSec/loop reference
>> menu-613: average 4.159 +8.70%
>> teo-614: average 3.751 -1.94%
>
> A small improvement.
>
>> menu-614: average 4.076 +6.54%
>> menu-614-p: average 4.178 +9.21%
>>
>> Interestingly, teo-614 also uses a little less power.
>> Note that there is an offsetting region for the menu governor where it performs better
>> than teo, but it was not extracted and done as a dwell test.
>>
>> Standard "medium dwell test:
>>
>> teo-613: 12.241 average uSec/loop reference
>> menu-613: 12.251 average +0.08%
>> teo-614: 12.121 average -0.98%
>
> Similarly here, but smaller.
>
>> menu-614: 12.123 average -0.96%
>> menu-614-p: 12.236 average -0.04%
>>
>> Standard "slow" dwell test: Not done.
>>
>> Further details:
>> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-2/loop-times.png
>> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-2/loop-times-relative.png
>> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-2/perf/
>> http://smythies.com/~doug/linux/idle/teo-6.14/many-0-400000000-2/perf/
>> http://smythies.com/~doug/linux/idle/teo-6.14/many-3000-100000000-2/
>>
>> Test 3: 6 core ping pong sweep:
>>
>> Pass a token between 6 CPUs on 6 different cores.
>> Do a variable amount of work at each stop.
>> NOT a timer based test.
>>
>> Purpose: To utilize the midrange idle states
>> and observe the transitions between use of
>> idle states.
>>
>> Note: This test has uncertainty in an area where the performance is bi-stable for all idle governors,
>> transitioning between much less power and slower performance and much more power and higher performance.
>> On either side of this area, the differences between all idle governors are small.
>> Only data from before this area (from results 1 to 95) was included in the below results.
>>
>> Results relative to teo-613 (negative is better):
>>         teo-614 menu-613        menu-614        menu-614-p
>> average 1.60%   0.18%           0.02%           0.02%
>> max     5.91%   0.97%           1.12%           0.85%
>> min     -1.79%  -1.11%          -1.88%          -1.52%
>>
>> A further dwell test was done in the area where teo-614 performed worse.
>> There was a slight regression in both performance and power:
>>
>> teo-613: average 21.34068 uSec per loop
>> teo-614: average 20.55809 usec per loop 3.67% regression
>
> As this is usec per loop, I'd think that smaller would be better?

Sorry, my mistake. That was written backwards, corrected below:

teo-613: average 20.55809 uSec per loop
teo-614: average 21.34068 usec per loop 3.67% regression
>
>> teo-613: average 37.17577 watts.
>> teo-614: average 38.06375 watts. 2.3% regression.
>
> Which would be consistent with this.

There was both a regression in performance and power at this operating point.

Another dwell test was done where menu-614-p did better than menu-614:
uSec per loop:

        menu-614        menu-614-p
average 807.896         772.376
max     962.265         946.880
min     798.375         755.430

        menu-614        menu-614-p
average 0.00%           -4.40%
max     19.11%          17.20%
min     -1.18%          -6.49%

menu-614: average 28.056 watts.
menu-614-p: average 28.863 watts. 2.88% more.
Note: to avoid inclusion of thermal stabilization times, only data from 30 to 45 minutes into the test were included in the average power calculation.

>> Further details:
>> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/loop-times.png
>> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/loop-times-detail-a.png
>> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/loop-times-detail-b.png
>> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/perf/
>> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/dwell/perf/
http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/dwell-2/loop-times.png
http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/dwell-2/perf/
>>
>> Test 4: 12 CPU ping pong sweep:
>>
>> Pass a token between all 12 CPUs.
>> Do a variable amount of work at each stop.
>> NOT a timer based test.
>>
>> Purpose: To utilize the deeper idle states
>> and observe the transitions between use of
>> idle states.
>>
>> This test was added last time at the request of Christian Loehle.
>>
>> Note: This test has uncertainty in an area where the performance is bi-stable for all idle governors,
>> transitioning between much less power and slower performance and much more power and higher performance.
>> On either side of this area, the differences between all idle governors are small.
>>
>> Only data from before this area (from results 1 to 60) was included in the below results:
>>
>> Results relative to teo-613 (negative is better):
>>         teo-614 menu-613        menu-614        teo-614-p       menu-614-p
>> ave     1.73%   0.97%           1.29%           1.70%           0.43%
>> max     16.79%  3.52%           3.95%           17.48%          4.98%
>> min     -0.35%  -0.35%          -0.18%          -0.40%          -0.54%
>>
>> Only data from after the uncertainty area (from results 170-300) was included in the below results:
>>
>>         teo-614 menu-613        menu-614        teo-614-p       menu-614-p
>> ave     1.65%   0.04%           0.98%           -0.56%          0.73%
>> max     5.04%   2.10%           4.58%           2.44%           3.82%
>> min     0.00%   -1.89%          -1.17%          -1.95%          -1.38%
>>
>> A further dwell test was done in the area where teo-614 performed worse and there is a 15.74%
>> throughput regression for teo-614 and a 5.4% regression in power.

My input is to consider this test further in the decision making.

>> Further details:
>> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/sweep-loop-times.png
>> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/sweep-loop-times-detail-a.png
>> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/sweep-relative-times.png
>> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/perf/
>> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/dwell/times.txt
>> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/dwell/perf/
>>
>> Test 5: sleeping ebizzy - 128 threads.
>>
>> Purpose: This test has given interesting results in the past.
>> The test varies the sleep interval between record lookups.
>> The result is varying usage of idle states.
>>
>> Results: Nothing significant to report just from the performance data.
>> However, there does seem to be power differences worth considering.
>>
>> A futher dwell test was done in a cherry picked spot.
>> It it is important to note that teo-614 removed a sawtooth performance
>> pattern that was present with teo-613. I.E. it is more stable. See:
>> http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/dwell/rps-only-teo.png

While re-examining menu-614 and menu-614-p, and to reduce clutter, this graph was made:
http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/dwell/rps-only-menu.png
menu-614: average 8722.787 records per second
menu-614-p: average 8683.387 records per second 0.45% regression (i.e. negligible)

>> Further details:
>> http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/interval-sweep.png
>> http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/relative-performance.png
>> http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/perf/
>> http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/dwell/rps.png
>> http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/dwell/rps-relative.png
>> http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/dwell/perf/
>>
>> Test 6: adrestia wakeup latency tests. 500 threads.
>>
>> Purpose: The test was reported in 2023.09 by the kernel test robot and looked
>> both interesting and gave interesting results, so I added it to the tests I run.
>>
>> Results:
>> teo-613.txt:wakeup cost (periodic, 20us): 3331nSec reference
>> teo-614.txt:wakeup cost (periodic, 20us): 3375nSec +1.32%
>> menu-613.txt:wakeup cost (periodic, 20us): 3207nSec -3.72%
>> menu-614.txt:wakeup cost (periodic, 20us): 3315nSec -0.48%
>> menu-614-p.txt:wakeup cost (periodic, 20us): 3353nSec +0.66%
>>
>> Further details:
>> http://smythies.com/~doug/linux/idle/teo-6.14/adrestia/periodic/histogram.png
>> http://smythies.com/~doug/linux/idle/teo-6.14/adrestia/periodic/histogram-detail-a.png
>> http://smythies.com/~doug/linux/idle/teo-6.14/adrestia/periodic/perf/
>>
>> Test 7: consume: periodic workflow. Various work/sleep frequencies and loads.
>>
>> Purpose: To search for anomalies and hysteresis over all possible workloads at various work/sleep frequencies.
>> work/sleep frequencies tested: 73, 113, 211, 347, and 401 Hertz.
>> IS a timer based test.
>>
>> NOTE: Repeatability issues. More work needed.
>>
>> Tests show instability with teo-614, but a re-test was much less unstable and better power.
>> Idle statistics were collected for the re-test and does show teo-614 overly favoring idle state 1, with
>> "Idle state 1 was too shallow" of 70% verses 15% for teo-613.

I'll try to do some more experiments with these timer based periodic type workflows.

>> Further details:
>> http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf73/
>> http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf113/
>> http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf211/
>> http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf347/
>> http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf401/
>> http://smythies.com/~doug/linux/idle/teo-6.14/consume/test/
>> http://smythies.com/~doug/linux/idle/teo-6.14/consume/test-idle/
>>
>> Test 8: shell-intensive serialized workloads.
>>
>> Variable: PIDs per second, amount of work each task does.
>> Note: Single threaded.
>>
>> Dountil the list of tasks is finished:
>>     Start the next task in the list of stuff to do (with a new PID).
>>     Wait for it to finish
>> Enduntil
>>
>> This workflow represents a challenge for CPU frequency scaling drivers,
>> schedulers, and therefore idle drivers.
>>
>> Also, the best performance is achieved by overriding
>> the scheduler and forcing CPU affinity. This "best" case is the
>> master reference, requiring additional legend definitions:
>> 1cpu-613: Kernel 6.13, execution forced onto CPU 3.
>> 1cpu-614: Kernel 6.14-rc1, execution forced onto CPU 3.
>>
>> Ideally the two 1cpu graphs would be identical, but they are not,
>> likely due to other changes between the two kernels.
>>
>> Results:
>> teo-614 is absolutely outstanding in this test.
>> Considerably better than any previous result over many years.
>
> Sounds good!

This improvement is significant.
I redid the teo-614 test to prove repeatability and dismiss operator error.
It was also good. I did not re-do the published graphs.

>> Further details:
>> http://smythies.com/~doug/linux/idle/teo-6.14/pid-per-sec/perf-3/times.png
>> http://smythies.com/~doug/linux/idle/teo-6.14/pid-per-sec/perf-3/times-log.png
>> http://smythies.com/~doug/linux/idle/teo-6.14/pid-per-sec/perf-3/relative.png
>> http://smythies.com/~doug/linux/idle/teo-6.14/pid-per-sec/perf-3/relative-log.png
>>
>> Test 9: Many threads, periodic workflow
>>
>> 500 threads of do a little work and then sleep a little.
>> IS a timer based test.
>>
>> Results:
>> Kernel 6.13 teo:    reference
>> Kernel 6.13 menu:   -0.06%
>> Kernel 6.14 teo:    -0.09%
>> Kernel 6.14 menu:   +0.49%
>> Kernel 6.14+p menu: +0.33%
>>
>> What is interesting is the significant differences in idle state selection.
>> Powers might be interesting, but much longer tests would be needed to achieve thermal equilibrium.
>>
... mess deleted ...
>
> What's the difference between this configuration and the one above?

So sorry, big big screwup in my composition of the original email.
The rest was copy of the above with the typos fixed.
Apologies.

... redundant stuff deleted ...

> Overall, having seen these results, I'm not worried about the change
> from teo-613 to teo-614.  The motivation for it was mostly code
> consistency and IMV the results indicate that it was worth doing.

Agree, with hesitation. There are both negative and positive results,
but overall okay.

> Also, if I'm not mistaken, the differences between menu-6.14 and
> menu-6.14-p in the majority of your tests are relatively small (if not
> in the noise) which, given that the latter is a major improvement for
> the SPECjbb workload as reported by Artem, makes me think that I
> should queue up menu-614-p for 6.15.

Agreed.

> Thanks!
Christian Loehle Feb. 18, 2025, 9:17 p.m. UTC | #11
On 2/10/25 14:15, Christian Loehle wrote:
> On 2/6/25 14:21, Rafael J. Wysocki wrote:
>> Hi Everyone,
>>
>> This work had been triggered by a report that commit 0611a640e60a ("eventpoll:
>> prefer kfree_rcu() in __ep_remove()") had caused the critical-jOPS metric of
>> the SPECjbb 2015 benchmark [1] to drop by around 50% even though it generally
>> reduced kernel overhead.  Indeed, it was found during further investigation
>> that the total interrupt rate while running the SPECjbb workload had fallen as
>> a result of that commit by 55% and the local timer interrupt rate had fallen by
>> almost 80%.
>>
>> That turned out to cause the menu cpuidle governor to select the deepest idle
>> state supplied by the cpuidle driver (intel_idle) much more often which added
>> significantly more idle state latency to the workload and that led to the
>> decrease of the critical-jOPS score.
>>
>> Interestingly enough, this problem was not visible when the teo cpuidle
>> governor was used instead of menu, so it appeared to be specific to the
>> latter.  CPU wakeup event statistics collected while running the workload
>> indicated that the menu governor was effectively ignoring non-timer wakeup
>> information and all of its idle state selection decisions appeared to be
>> based on timer wakeups only.  Thus, it appeared that the reduction of the
>> local timer interrupt rate caused the governor to predict a idle duration
>> much more often while running the workload and the deepest idle state was
>> selected significantly more often as a result of that.
>>
>> A subsequent inspection of the get_typical_interval() function in the menu
>> governor indicated that it might return UINT_MAX too often which then caused
>> the governor's decisions to be based entirely on information related to timers.
>>
>> Generally speaking, UINT_MAX is returned by get_typical_interval() if it
>> cannot make a prediction based on the most recent idle intervals data with
>> sufficiently high confidence, but at least in some cases this means that
>> useful information is not taken into account at all which may lead to
>> significant idle state selection mistakes.  Moreover, this is not really
>> unlikely to happen.
>>
>> One issue with get_typical_interval() is that, when it eliminates outliers from
>> the sample set in an attempt to reduce the standard deviation (and so improve
>> the prediction confidence), it does that by dropping high-end samples only,
>> while samples at the low end of the set are retained.  However, the samples
>> at the low end very well may be the outliers and they should be eliminated
>> from the sample set instead of the high-end samples.  Accordingly, the
>> likelihood of making a meaningful idle duration prediction can be improved
>> by making it also eliminate low-end samples if they are farther from the
>> average than high-end samples.  This is done in patch [4/5].
>>
>> Another issue is that get_typical_interval() gives up after eliminating 1/4
>> of the samples if the standard deviation is still not as low as desired (within
>> 1/6 of the average or within 20 us if the average is close to 0), but the
>> remaining samples in the set still represent useful information at that point
>> and discarding them altogether may lead to suboptimal idle state selection.
>>
>> For instance, the largest idle duration value in the get_typical_interval()
>> data set is the maximum idle duration observed recently and it is likely that
>> the upcoming idle duration will not exceed it.  Therefore, in the absence of
>> a better choice, this value can be used as an upper bound on the target
>> residency of the idle state to select.  Patch [5/5] works along these lines,
>> but it takes the maximum data point remaining after the elimination of
>> outliers.
>>
>> The first two patches in the series are straightforward cleanups (in fact,
>> the first patch is kind of reversed by patch [4/5], but it is there because
>> it can be applied without the latter) and patch [3/5] is a cosmetic change
>> made in preparation for the subsequent ones.
>>
>> This series turns out to restore the SPECjbb critical-jOPS metric on affected
>> systems to the level from before commit 0611a640e60a and it also happens to
>> increase its max-jOPS metric by around 3%.
>>
>> For easier reference/testing it is present in the git branch at
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/log/?h=experimental/menu
>>
>> based on the cpuidle material that went into 6.14-rc1.
>>
>> If possible, please let me know if it works for you.
>>
>> Thanks!
>>
>>
>> [1] Link: https://www.spec.org/jbb2015/

Another dump for x86 idle this time, tldr: no worrying idle/power numbers

teo 	0 	121.76 	12202 	813 	0.067 	390 	423
teo 	1 	158.53 	8223 	536 	0.065 	328 	208
teo 	2 	219.37 	8373 	527 	0.063 	294 	233
teo 	3 	220.7 	8241 	538 	0.065 	340 	198
teo 	4 	211.8 	7923 	442 	0.056 	268 	174
menu 	0 	151.63 	8185 	326 	0.040 	308 	18
menu 	1 	183.45 	8873 	364 	0.041 	334 	30
menu 	2 	171.96 	8633 	380 	0.044 	369 	11
menu 	3 	164.95 	8451 	358 	0.042 	330 	28
menu 	4 	175.87 	8273 	340 	0.041 	317 	23
menu-1 	0 	119.77 	9041 	394 	0.044 	356 	38
menu-1 	1 	145.73 	8603 	335 	0.039 	293 	42
menu-1 	2 	157.89 	8345 	321 	0.038 	276 	45
menu-1 	3 	119.13 	8447 	346 	0.041 	290 	56
menu-1 	4 	142.77 	8331 	331 	0.040 	312 	19
menu-2 	0 	159.81 	8653 	342 	0.040 	296 	46
menu-2 	1 	165.01 	8421 	307 	0.036 	282 	25
menu-2 	2 	225.06 	8647 	376 	0.043 	317 	59
menu-2 	3 	232.13 	8095 	358 	0.044 	325 	33
menu-2 	4 	150.79 	8231 	323 	0.039 	299 	24
menu-3 	0 	168.87 	8153 	355 	0.044 	330 	25
menu-3 	1 	187.68 	9143 	405 	0.044 	338 	67
menu-3 	2 	129.77 	9705 	384 	0.040 	301 	83
menu-3 	3 	152.49 	9679 	469 	0.048 	374 	95
menu-3 	4 	131.0 	9077 	321 	0.035 	283 	38
menu-4 	0 	116.68 	9107 	373 	0.041 	333 	40
menu-4 	1 	164.1 	8655 	297 	0.034 	287 	10
menu-4 	2 	157.52 	8009 	300 	0.037 	297 	3
menu-4 	3 	138.47 	8567 	345 	0.040 	341 	4
menu-4 	4 	130.84 	8027 	324 	0.040 	316 	8
menu-5 	0 	139.77 	8533 	327 	0.038 	317 	10
menu-5 	1 	157.22 	9127 	433 	0.047 	373 	60
menu-5 	2 	144.54 	8313 	329 	0.040 	311 	18
menu-5 	3 	151.55 	8675 	316 	0.036 	301 	15
menu-5 	4 	137.49 	8823 	354 	0.040 	336 	18
menu 	0 	128.97 	8383 	329 	0.039 	284 	45
menu 	1 	141.97 	8945 	402 	0.045 	344 	58
menu 	2 	88.16 	8829 	368 	0.042 	307 	61
menu 	3 	81.49 	9165 	430 	0.047 	371 	59
menu 	4 	107.58 	9193 	401 	0.044 	335 	66
teo 	0 	149.28 	8399 	521 	0.062 	287 	234
teo 	1 	105.61 	8717 	563 	0.065 	306 	257
teo 	2 	116.65 	7893 	550 	0.070 	284 	266
teo 	3 	119.57 	8259 	489 	0.059 	282 	207
teo 	4 	187.64 	7897 	471 	0.060 	303 	168

And the rk3399 numbers as promised, just like rk3588 we see much better
IO performance (menu-5) without significant worse idle decisions:

device	 gov	 iter	 iops	 idles	 idle_misses	 idle_miss_ratio	 belows	 aboves	
mapper/dm-slow 	teo 	0 	461 	102980 	32260 	0.313 	6036 	26224
mapper/dm-slow 	teo 	1 	461 	100698 	31149 	0.309 	5796 	25353
mapper/dm-slow 	teo 	2 	461 	100346 	32840 	0.327 	6022 	26818
mapper/dm-slow 	teo 	3 	461 	98450 	31513 	0.320 	5129 	26384
mapper/dm-slow 	teo 	4 	461 	98689 	29982 	0.304 	3937 	26045
mapper/dm-slow 	menu 	0 	461 	97860 	22520 	0.230 	5505 	17015
mapper/dm-slow 	menu 	1 	461 	98002 	20858 	0.213 	3596 	17262
mapper/dm-slow 	menu 	2 	461 	100046 	22523 	0.225 	5333 	17190
mapper/dm-slow 	menu 	3 	461 	95020 	20827 	0.219 	4069 	16758
mapper/dm-slow 	menu 	4 	461 	98040 	22302 	0.227 	5498 	16804
mapper/dm-slow 	menu-1 	0 	461 	98186 	20648 	0.210 	3210 	17438
mapper/dm-slow 	menu-1 	1 	461 	94360 	20297 	0.215 	4184 	16113
mapper/dm-slow 	menu-1 	2 	461 	98818 	21680 	0.219 	4750 	16930
mapper/dm-slow 	menu-1 	3 	461 	97822 	20605 	0.211 	3469 	17136
mapper/dm-slow 	menu-1 	4 	461 	100748 	21740 	0.216 	4403 	17337
mapper/dm-slow 	menu-2 	0 	461 	94388 	20289 	0.215 	3449 	16840
mapper/dm-slow 	menu-2 	1 	460 	89124 	18897 	0.212 	2401 	16496
mapper/dm-slow 	menu-2 	2 	461 	94932 	20692 	0.218 	2949 	17743
mapper/dm-slow 	menu-2 	3 	461 	95270 	20612 	0.216 	3048 	17564
mapper/dm-slow 	menu-2 	4 	461 	101954 	23493 	0.230 	5978 	17515
mapper/dm-slow 	menu-3 	0 	461 	98452 	21161 	0.215 	4247 	16914
mapper/dm-slow 	menu-3 	1 	461 	100342 	21035 	0.210 	3891 	17144
mapper/dm-slow 	menu-3 	2 	461 	101156 	23322 	0.231 	5924 	17398
mapper/dm-slow 	menu-3 	3 	461 	98052 	20862 	0.213 	3927 	16935
mapper/dm-slow 	menu-3 	4 	461 	97746 	20977 	0.215 	3706 	17271
mapper/dm-slow 	menu-4 	0 	461 	99826 	23727 	0.238 	5055 	18672
mapper/dm-slow 	menu-4 	1 	461 	101686 	24859 	0.244 	5175 	19684
mapper/dm-slow 	menu-4 	2 	461 	99934 	23568 	0.236 	4477 	19091
mapper/dm-slow 	menu-4 	3 	461 	97298 	22142 	0.228 	3644 	18498
mapper/dm-slow 	menu-4 	4 	461 	98546 	24023 	0.244 	5086 	18937
mapper/dm-slow 	menu-5 	0 	461 	100545 	22833 	0.227 	3830 	19003
mapper/dm-slow 	menu-5 	1 	461 	100827 	23999 	0.238 	5217 	18782
mapper/dm-slow 	menu-5 	2 	461 	97044 	22628 	0.233 	2910 	19718
mapper/dm-slow 	menu-5 	3 	461 	100234 	23303 	0.232 	4819 	18484
mapper/dm-slow 	menu-5 	4 	461 	102358 	24488 	0.239 	4770 	19718
mapper/dm-slow 	menu 	0 	461 	97008 	21114 	0.218 	4540 	16574
mapper/dm-slow 	menu 	1 	461 	96088 	21470 	0.223 	3650 	17820
mapper/dm-slow 	menu 	2 	461 	99008 	21019 	0.212 	3405 	17614
mapper/dm-slow 	menu 	3 	461 	96608 	20145 	0.209 	3729 	16416
mapper/dm-slow 	menu 	4 	461 	83152 	17469 	0.210 	2426 	15043
mapper/dm-slow 	teo 	0 	461 	99340 	32077 	0.323 	5772 	26305
mapper/dm-slow 	teo 	1 	461 	98694 	29426 	0.298 	3585 	25841
mapper/dm-slow 	teo 	2 	461 	100294 	29810 	0.297 	3561 	26249
mapper/dm-slow 	teo 	3 	461 	98726 	29496 	0.299 	3644 	25852
mapper/dm-slow 	teo 	4 	461 	101424 	32654 	0.322 	6029 	26625
mmcblk1 	teo 	0 	2016 	559362 	29994 	0.054 	2896 	27098
mmcblk1 	teo 	1 	2037 	562153 	30001 	0.053 	3171 	26830
mmcblk1 	teo 	2 	2016 	557360 	30185 	0.054 	2986 	27199
mmcblk1 	menu 	0 	1279 	335364 	103600 	0.309 	87662 	15938
mmcblk1 	menu 	1 	1292 	342036 	105446 	0.308 	89031 	16415
mmcblk1 	menu 	2 	1294 	352954 	108588 	0.308 	90420 	18168
mmcblk1 	menu-1 	0 	1271 	331220 	103163 	0.311 	87602 	15561
mmcblk1 	menu-1 	1 	1291 	350084 	108982 	0.311 	90670 	18312
mmcblk1 	menu-1 	2 	1284 	346412 	107494 	0.310 	89899 	17595
mmcblk1 	menu-2 	0 	1306 	344316 	106253 	0.309 	89650 	16603
mmcblk1 	menu-2 	1 	1278 	345684 	107893 	0.312 	90292 	17601
mmcblk1 	menu-2 	2 	1268 	334528 	104494 	0.312 	88457 	16037
mmcblk1 	menu-3 	0 	1270 	333456 	104160 	0.312 	88392 	15768
mmcblk1 	menu-3 	1 	1273 	338328 	105477 	0.312 	88798 	16679
mmcblk1 	menu-3 	2 	1280 	337002 	104623 	0.310 	88516 	16107
mmcblk1 	menu-4 	0 	1311 	344896 	104192 	0.302 	87051 	17141
mmcblk1 	menu-4 	1 	1292 	343878 	106459 	0.310 	88297 	18162
mmcblk1 	menu-4 	2 	1286 	340172 	105502 	0.310 	87753 	17749
mmcblk1 	menu-5 	0 	2006 	550266 	24981 	0.045 	6762 	18219
mmcblk1 	menu-5 	1 	1997 	553590 	26974 	0.049 	6955 	20019
mmcblk1 	menu-5 	2 	1994 	539494 	17652 	0.033 	3903 	13749
mmcblk2 	teo 	0 	5691 	820134 	29346 	0.036 	3078 	26268
mmcblk2 	teo 	1 	5684 	856976 	23202 	0.027 	1908 	21294
mmcblk2 	teo 	2 	5783 	824666 	13984 	0.017 	3938 	10046
mmcblk2 	menu 	0 	2770 	433474 	144860 	0.334 	127466 	17394
mmcblk2 	menu 	1 	3308 	367848 	89597 	0.244 	72668 	16929
mmcblk2 	menu 	2 	2882 	422844 	133523 	0.316 	117170 	16353
mmcblk2 	menu-1 	0 	3323 	394674 	115764 	0.293 	99328 	16436
mmcblk2 	menu-1 	1 	2778 	420262 	139538 	0.332 	122356 	17182
mmcblk2 	menu-1 	2 	2895 	400774 	124841 	0.311 	109572 	15269
mmcblk2 	menu-2 	0 	2679 	429818 	148494 	0.345 	131513 	16981
mmcblk2 	menu-2 	1 	3162 	363888 	96102 	0.264 	79200 	16902
mmcblk2 	menu-2 	2 	2684 	422324 	144606 	0.342 	128528 	16078
mmcblk2 	menu-3 	0 	2953 	392124 	118629 	0.303 	101068 	17561
mmcblk2 	menu-3 	1 	3003 	402614 	120567 	0.299 	103321 	17246
mmcblk2 	menu-3 	2 	2858 	422576 	136118 	0.322 	119485 	16633
mmcblk2 	menu-4 	0 	3288 	436860 	118566 	0.271 	100329 	18237
mmcblk2 	menu-4 	1 	3062 	462484 	139897 	0.302 	121424 	18473
mmcblk2 	menu-4 	2 	3257 	424458 	115493 	0.272 	97739 	17754
mmcblk2 	menu-5 	0 	5316 	573050 	52502 	0.092 	33285 	19217
mmcblk2 	menu-5 	1 	5446 	825538 	44073 	0.053 	24355 	19718
mmcblk2 	menu-5 	2 	5292 	796000 	52828 	0.066 	38640 	14188
nvme0n1 	teo 	0 	11371 	807338 	29879 	0.037 	2961 	26918
nvme0n1 	teo 	1 	11557 	815682 	29116 	0.036 	2947 	26169
nvme0n1 	teo 	2 	11424 	810108 	29800 	0.037 	2953 	26847
nvme0n1 	menu 	0 	7754 	574116 	93148 	0.162 	76482 	16666
nvme0n1 	menu 	1 	8371 	618954 	95502 	0.154 	77657 	17845
nvme0n1 	menu 	2 	5111 	412030 	73440 	0.178 	55997 	17443
nvme0n1 	menu-1 	0 	6628 	506618 	91832 	0.181 	71427 	20405
nvme0n1 	menu-1 	1 	4923 	390294 	68772 	0.176 	52880 	15892
nvme0n1 	menu-1 	2 	5015 	396160 	68840 	0.174 	52867 	15973
nvme0n1 	menu-2 	0 	7883 	589296 	97497 	0.165 	79635 	17862
nvme0n1 	menu-2 	1 	6465 	493796 	81561 	0.165 	64629 	16932
nvme0n1 	menu-2 	2 	5363 	430614 	75499 	0.175 	57528 	17971
nvme0n1 	menu-3 	0 	5090 	415018 	74191 	0.179 	56015 	18176
nvme0n1 	menu-3 	1 	4919 	401452 	71994 	0.179 	54457 	17537
nvme0n1 	menu-3 	2 	5183 	413186 	74199 	0.180 	57542 	16657
nvme0n1 	menu-4 	0 	5402 	424860 	67413 	0.159 	49399 	18014
nvme0n1 	menu-4 	1 	5343 	420538 	67713 	0.161 	49826 	17887
nvme0n1 	menu-4 	2 	9151 	669840 	107892 	0.161 	87774 	20118
nvme0n1 	menu-5 	0 	10475 	741376 	20204 	0.027 	1827 	18377
nvme0n1 	menu-5 	1 	10603 	747262 	19228 	0.026 	1489 	17739
nvme0n1 	menu-5 	2 	11658 	824996 	22954 	0.028 	2631 	20323
sda 	teo 	0 	2328 	1334552 	48960 	0.037 	20922 	28038
sda 	teo 	1 	2328 	1267840 	37740 	0.030 	11934 	25806
sda 	teo 	2 	2394 	1360679 	21853 	0.016 	3394 	18459
sda 	menu 	0 	1004 	587054 	 198775 	0.339 	184002 	14773
sda 	menu 	1 	1205 	663838 	 209623 	0.316 	193325 	16298
sda 	menu 	2 	1117 	615382 	 208813 	0.339 	191893 	16920
sda 	menu-1 	0 	1103 	627838 	 212955 	0.339 	195703 	17252
sda 	menu-1 	1 	1024 	611658 	 221754 	0.363 	203710 	18044
sda 	menu-1 	2 	1209 	639008 	 180597 	0.283 	163837 	16760
sda 	menu-2 	0 	1200 	655398 	 205664 	0.314 	190750 	14914
sda 	menu-2 	1 	1100 	582222 	 201983 	0.347 	185874 	16109
sda 	menu-2 	2 	1124 	602988 	 199623 	0.331 	183798 	15825
sda 	menu-3 	0 	1089 	612112 	 211470 	0.345 	195156 	16314
sda 	menu-3 	1 	1077 	613556 	 213484 	0.348 	196839 	16645
sda 	menu-3 	2 	1157 	636904 	 195439 	0.307 	179535 	15904
sda 	menu-4 	0 	1126 	643468 	 208132 	0.323 	189334 	18798
sda 	menu-4 	1 	1112 	634480 	 216012 	0.340 	196841 	19171
sda 	menu-4 	2 	1190 	594398 	 196059 	0.330 	176190 	19869
sda 	menu-5 	0 	2074 	1134718 	81294 	0.072 	61820 	19474
sda 	menu-5 	1 	2179 	1249056 	76679 	0.061 	55461 	21218
sda 	menu-5 	2 	2075 	1183214 	124075 	0.105 	101650 	22425
nullb0 	teo 	0 	104833 	85906 	29085 	0.339 	3409 	25676
nullb0 	teo 	1 	103787 	88419 	29833 	0.337 	2980 	26853
nullb0 	teo 	2 	104611 	86284 	29390 	0.341 	3315 	26075
nullb0 	menu 	0 	103671 	87146 	20514 	0.235 	2643 	17871
nullb0 	menu 	1 	104380 	70086 	16855 	0.240 	1642 	15213
nullb0 	menu 	2 	103249 	81414 	19403 	0.238 	2274 	17129
nullb0 	menu-1 	0 	103424 	86984 	20448 	0.235 	2617 	17831
nullb0 	menu-1 	1 	103857 	85658 	20840 	0.243 	2544 	18296
nullb0 	menu-1 	2 	103907 	86644 	20639 	0.238 	2586 	18053
nullb0 	menu-2 	0 	103668 	82558 	20053 	0.243 	2655 	17398
nullb0 	menu-2 	1 	104277 	86914 	20472 	0.236 	2593 	17879
nullb0 	menu-2 	2 	103697 	82952 	20221 	0.244 	2410 	17811
nullb0 	menu-3 	0 	103696 	86534 	20782 	0.240 	2968 	17814
nullb0 	menu-3 	1 	103996 	81902 	19795 	0.242 	2773 	17022
nullb0 	menu-3 	2 	103790 	82474 	20058 	0.243 	2344 	17714
nullb0 	menu-4 	0 	103288 	87475 	22688 	0.259 	2596 	20092
nullb0 	menu-4 	1 	103848 	70906 	18106 	0.255 	1557 	16549
nullb0 	menu-4 	2 	104141 	84528 	22147 	0.262 	2969 	19178
nullb0 	menu-5 	0 	103812 	79234 	17989 	0.227 	1302 	16687
nullb0 	menu-5 	1 	104334 	87752 	22878 	0.261 	2511 	20367
nullb0 	menu-5 	2 	104059 	88681 	22765 	0.257 	3030 	19735
mtdblock3 	teo 	0 	257 	604294 	17359 	0.029 	3241 	14118
mtdblock3 	teo 	1 	256 	332344 	31631 	0.095 	4644 	26987
mtdblock3 	teo 	2 	257 	549736 	29559 	0.054 	2841 	26718
mtdblock3 	menu 	0 	148 	417388 	134505 	0.322 	118487 	16018
mtdblock3 	menu 	1 	137 	422132 	149336 	0.354 	132655 	16681
mtdblock3 	menu 	2 	194 	223808 	59631 	0.266 	43076 	16555
mtdblock3 	menu-1 	0 	145 	529186 	143789 	0.272 	129433 	14356
mtdblock3 	menu-1 	1 	147 	452302 	140418 	0.310 	125009 	15409
mtdblock3 	menu-1 	2 	138 	415152 	146470 	0.353 	130607 	15863
mtdblock3 	menu-2 	0 	155 	365750 	118483 	0.324 	102676 	15807
mtdblock3 	menu-2 	1 	165 	316818 	101968 	0.322 	85597 	16371
mtdblock3 	menu-2 	2 	143 	515664 	126014 	0.244 	115854 	10160
mtdblock3 	menu-3 	0 	135 	488188 	150917 	0.309 	136442 	14475
mtdblock3 	menu-3 	1 	125 	437774 	158893 	0.363 	143319 	15574
mtdblock3 	menu-3 	2 	138 	433332 	152457 	0.352 	135017 	17440
mtdblock3 	menu-4 	0 	173 	314250 	101648 	0.323 	81511 	20137
mtdblock3 	menu-4 	1 	149 	489030 	139551 	0.285 	124126 	15425
mtdblock3 	menu-4 	2 	148 	381488 	133543 	0.350 	115885 	17658
mtdblock3 	menu-5 	0 	222 	430158 	63885 	0.149 	45240 	18645
mtdblock3 	menu-5 	1 	218 	752248 	80500 	0.107 	66453 	14047
mtdblock3 	menu-5 	2 	203 	528828 	105885 	0.200 	89573 	1631

And finally a longer firefox youtube 4k playback (10mins) on x86:
device	 gov	 iter	 Joules	 idles	 idle_misses	 idle_miss_ratio	 belows	 aboves	
menu 	0 	1064.48 	357559 	106048 	0.297 	105920 	128
menu 	1 	1029.85 	345569 	104050 	0.301 	103938 	112
menu 	2 	1105.93 	347105 	104958 	0.302 	104885 	73
menu 	3 	1085.86 	347365 	106061 	0.305 	106001 	60
menu 	4 	1115.24 	352609 	107913 	0.306 	107812 	101
menu-5 	0 	1139.09 	345827 	90172 	0.261 	89609 	563
menu-5 	1 	1111.58 	335521 	88953 	0.265 	88904 	49
menu-5 	2 	1093.73 	328645 	85949 	0.262 	85839 	110
menu-5 	3 	1036.69 	330547 	86163 	0.261 	86077 	86
menu-5 	4 	1117.31 	316143 	81707 	0.258 	81580 	127
menu 	0 	1099.72 	353895 	106574 	0.301 	106523 	51
menu 	1 	1148.44 	357867 	107578 	0.301 	107369 	209
menu 	2 	1098.15 	341957 	101995 	0.298 	101870 	125
menu 	3 	1124.41 	350423 	105592 	0.301 	105481 	111
menu 	4 	1185.94 	366799 	111132 	0.303 	111029 	103
menu-5 	0 	1129.85 	332413 	86991 	0.262 	86885 	106
menu-5 	1 	1086.59 	318221 	82020 	0.258 	81924 	96
menu-5 	2 	1063.5  	320273 	83099 	0.259 	83048 	51
menu-5 	3 	1070.7  	331179 	85998 	0.260 	85839 	159
menu-5 	4 	1067.82 	322689 	83634 	0.259 	83548 	86

Significantly improving idle miss belows.

I'll do the Android tests, but that is very unlikely to show something this
doesn't (there's only one non-WFI idle state and most workloads are intercept
heavy, so if anything menu-5 should improve the overall situation.)
Feel free to already add:
Tested-by: Christian Loehle <christian.loehle@arm.com>
Rafael J. Wysocki Feb. 19, 2025, 12:06 p.m. UTC | #12
On Tue, Feb 18, 2025 at 10:17 PM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> On 2/10/25 14:15, Christian Loehle wrote:
> > On 2/6/25 14:21, Rafael J. Wysocki wrote:
> >> Hi Everyone,
> >>
> >> This work had been triggered by a report that commit 0611a640e60a ("eventpoll:
> >> prefer kfree_rcu() in __ep_remove()") had caused the critical-jOPS metric of
> >> the SPECjbb 2015 benchmark [1] to drop by around 50% even though it generally
> >> reduced kernel overhead.  Indeed, it was found during further investigation
> >> that the total interrupt rate while running the SPECjbb workload had fallen as
> >> a result of that commit by 55% and the local timer interrupt rate had fallen by
> >> almost 80%.
> >>
> >> That turned out to cause the menu cpuidle governor to select the deepest idle
> >> state supplied by the cpuidle driver (intel_idle) much more often which added
> >> significantly more idle state latency to the workload and that led to the
> >> decrease of the critical-jOPS score.
> >>
> >> Interestingly enough, this problem was not visible when the teo cpuidle
> >> governor was used instead of menu, so it appeared to be specific to the
> >> latter.  CPU wakeup event statistics collected while running the workload
> >> indicated that the menu governor was effectively ignoring non-timer wakeup
> >> information and all of its idle state selection decisions appeared to be
> >> based on timer wakeups only.  Thus, it appeared that the reduction of the
> >> local timer interrupt rate caused the governor to predict a idle duration
> >> much more often while running the workload and the deepest idle state was
> >> selected significantly more often as a result of that.
> >>
> >> A subsequent inspection of the get_typical_interval() function in the menu
> >> governor indicated that it might return UINT_MAX too often which then caused
> >> the governor's decisions to be based entirely on information related to timers.
> >>
> >> Generally speaking, UINT_MAX is returned by get_typical_interval() if it
> >> cannot make a prediction based on the most recent idle intervals data with
> >> sufficiently high confidence, but at least in some cases this means that
> >> useful information is not taken into account at all which may lead to
> >> significant idle state selection mistakes.  Moreover, this is not really
> >> unlikely to happen.
> >>
> >> One issue with get_typical_interval() is that, when it eliminates outliers from
> >> the sample set in an attempt to reduce the standard deviation (and so improve
> >> the prediction confidence), it does that by dropping high-end samples only,
> >> while samples at the low end of the set are retained.  However, the samples
> >> at the low end very well may be the outliers and they should be eliminated
> >> from the sample set instead of the high-end samples.  Accordingly, the
> >> likelihood of making a meaningful idle duration prediction can be improved
> >> by making it also eliminate low-end samples if they are farther from the
> >> average than high-end samples.  This is done in patch [4/5].
> >>
> >> Another issue is that get_typical_interval() gives up after eliminating 1/4
> >> of the samples if the standard deviation is still not as low as desired (within
> >> 1/6 of the average or within 20 us if the average is close to 0), but the
> >> remaining samples in the set still represent useful information at that point
> >> and discarding them altogether may lead to suboptimal idle state selection.
> >>
> >> For instance, the largest idle duration value in the get_typical_interval()
> >> data set is the maximum idle duration observed recently and it is likely that
> >> the upcoming idle duration will not exceed it.  Therefore, in the absence of
> >> a better choice, this value can be used as an upper bound on the target
> >> residency of the idle state to select.  Patch [5/5] works along these lines,
> >> but it takes the maximum data point remaining after the elimination of
> >> outliers.
> >>
> >> The first two patches in the series are straightforward cleanups (in fact,
> >> the first patch is kind of reversed by patch [4/5], but it is there because
> >> it can be applied without the latter) and patch [3/5] is a cosmetic change
> >> made in preparation for the subsequent ones.
> >>
> >> This series turns out to restore the SPECjbb critical-jOPS metric on affected
> >> systems to the level from before commit 0611a640e60a and it also happens to
> >> increase its max-jOPS metric by around 3%.
> >>
> >> For easier reference/testing it is present in the git branch at
> >>
> >> https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/log/?h=experimental/menu
> >>
> >> based on the cpuidle material that went into 6.14-rc1.
> >>
> >> If possible, please let me know if it works for you.
> >>
> >> Thanks!
> >>
> >>
> >> [1] Link: https://www.spec.org/jbb2015/
>
> Another dump for x86 idle this time, tldr: no worrying idle/power numbers

[cut]

> Significantly improving idle miss belows.
>
> I'll do the Android tests, but that is very unlikely to show something this
> doesn't (there's only one non-WFI idle state and most workloads are intercept
> heavy, so if anything menu-5 should improve the overall situation.)
> Feel free to already add:
> Tested-by: Christian Loehle <christian.loehle@arm.com>

Thank you!