Message ID | 1916668.tdWV9SEqCh@rjwysocki.net (mailing list archive) |
---|---|
Headers | show |
Series | cpuidle: menu: Avoid discarding useful information when processing recent idle intervals | expand |
Hi, thanks for the patches! On Thu, 2025-02-06 at 15:21 +0100, Rafael J. Wysocki wrote: > Hi Everyone, > > This work had been triggered by a report that commit 0611a640e60a ("eventpoll: > prefer kfree_rcu() in __ep_remove()") had caused the critical-jOPS metric of > the SPECjbb 2015 benchmark [1] to drop by around 50% even though it generally > reduced kernel overhead. Indeed, it was found during further investigation > that the total interrupt rate while running the SPECjbb workload had fallen as > a result of that commit by 55% and the local timer interrupt rate had fallen > by > almost 80%. I ran SPECjbb2015 with and it doubles critical-jOPS and basically makes it "normal" again. Thanks! Reported-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
On 2/7/25 14:48, Artem Bityutskiy wrote: > Hi, > > thanks for the patches! > > On Thu, 2025-02-06 at 15:21 +0100, Rafael J. Wysocki wrote: >> Hi Everyone, >> >> This work had been triggered by a report that commit 0611a640e60a ("eventpoll: >> prefer kfree_rcu() in __ep_remove()") had caused the critical-jOPS metric of >> the SPECjbb 2015 benchmark [1] to drop by around 50% even though it generally >> reduced kernel overhead. Indeed, it was found during further investigation >> that the total interrupt rate while running the SPECjbb workload had fallen as >> a result of that commit by 55% and the local timer interrupt rate had fallen >> by >> almost 80%. > > I ran SPECjbb2015 with and it doubles critical-jOPS and basically makes it > "normal" again. Thanks! > > Reported-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> > Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> > I'll go take a look in-depth, honestly the statistical test of get_typical_interval() is somewhat black magic to me before and after 4/5, so if that actually works better fine with me. I'll run some tests, too.
On Fri, Feb 7, 2025 at 4:24 PM Christian Loehle <christian.loehle@arm.com> wrote: > > On 2/7/25 14:48, Artem Bityutskiy wrote: > > Hi, > > > > thanks for the patches! > > > > On Thu, 2025-02-06 at 15:21 +0100, Rafael J. Wysocki wrote: > >> Hi Everyone, > >> > >> This work had been triggered by a report that commit 0611a640e60a ("eventpoll: > >> prefer kfree_rcu() in __ep_remove()") had caused the critical-jOPS metric of > >> the SPECjbb 2015 benchmark [1] to drop by around 50% even though it generally > >> reduced kernel overhead. Indeed, it was found during further investigation > >> that the total interrupt rate while running the SPECjbb workload had fallen as > >> a result of that commit by 55% and the local timer interrupt rate had fallen > >> by > >> almost 80%. > > > > I ran SPECjbb2015 with and it doubles critical-jOPS and basically makes it > > "normal" again. Thanks! > > > > Reported-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> > > Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> > > > > I'll go take a look in-depth, honestly the statistical test of > get_typical_interval() is somewhat black magic to me before and after > 4/5, so if that actually works better fine with me. > I'll run some tests, too. Thank you!
On Fri, Feb 7, 2025 at 3:48 PM Artem Bityutskiy <artem.bityutskiy@linux.intel.com> wrote: > > Hi, > > thanks for the patches! > > On Thu, 2025-02-06 at 15:21 +0100, Rafael J. Wysocki wrote: > > Hi Everyone, > > > > This work had been triggered by a report that commit 0611a640e60a ("eventpoll: > > prefer kfree_rcu() in __ep_remove()") had caused the critical-jOPS metric of > > the SPECjbb 2015 benchmark [1] to drop by around 50% even though it generally > > reduced kernel overhead. Indeed, it was found during further investigation > > that the total interrupt rate while running the SPECjbb workload had fallen as > > a result of that commit by 55% and the local timer interrupt rate had fallen > > by > > almost 80%. > > I ran SPECjbb2015 with and it doubles critical-jOPS and basically makes it > "normal" again. Thanks! > > Reported-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> > Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Thank you!
On 2/6/25 14:21, Rafael J. Wysocki wrote: > Hi Everyone, > > This work had been triggered by a report that commit 0611a640e60a ("eventpoll: > prefer kfree_rcu() in __ep_remove()") had caused the critical-jOPS metric of > the SPECjbb 2015 benchmark [1] to drop by around 50% even though it generally > reduced kernel overhead. Indeed, it was found during further investigation > that the total interrupt rate while running the SPECjbb workload had fallen as > a result of that commit by 55% and the local timer interrupt rate had fallen by > almost 80%. > > That turned out to cause the menu cpuidle governor to select the deepest idle > state supplied by the cpuidle driver (intel_idle) much more often which added > significantly more idle state latency to the workload and that led to the > decrease of the critical-jOPS score. > > Interestingly enough, this problem was not visible when the teo cpuidle > governor was used instead of menu, so it appeared to be specific to the > latter. CPU wakeup event statistics collected while running the workload > indicated that the menu governor was effectively ignoring non-timer wakeup > information and all of its idle state selection decisions appeared to be > based on timer wakeups only. Thus, it appeared that the reduction of the > local timer interrupt rate caused the governor to predict a idle duration > much more often while running the workload and the deepest idle state was > selected significantly more often as a result of that. > > A subsequent inspection of the get_typical_interval() function in the menu > governor indicated that it might return UINT_MAX too often which then caused > the governor's decisions to be based entirely on information related to timers. > > Generally speaking, UINT_MAX is returned by get_typical_interval() if it > cannot make a prediction based on the most recent idle intervals data with > sufficiently high confidence, but at least in some cases this means that > useful information is not taken into account at all which may lead to > significant idle state selection mistakes. Moreover, this is not really > unlikely to happen. > > One issue with get_typical_interval() is that, when it eliminates outliers from > the sample set in an attempt to reduce the standard deviation (and so improve > the prediction confidence), it does that by dropping high-end samples only, > while samples at the low end of the set are retained. However, the samples > at the low end very well may be the outliers and they should be eliminated > from the sample set instead of the high-end samples. Accordingly, the > likelihood of making a meaningful idle duration prediction can be improved > by making it also eliminate low-end samples if they are farther from the > average than high-end samples. This is done in patch [4/5]. > > Another issue is that get_typical_interval() gives up after eliminating 1/4 > of the samples if the standard deviation is still not as low as desired (within > 1/6 of the average or within 20 us if the average is close to 0), but the > remaining samples in the set still represent useful information at that point > and discarding them altogether may lead to suboptimal idle state selection. > > For instance, the largest idle duration value in the get_typical_interval() > data set is the maximum idle duration observed recently and it is likely that > the upcoming idle duration will not exceed it. Therefore, in the absence of > a better choice, this value can be used as an upper bound on the target > residency of the idle state to select. Patch [5/5] works along these lines, > but it takes the maximum data point remaining after the elimination of > outliers. > > The first two patches in the series are straightforward cleanups (in fact, > the first patch is kind of reversed by patch [4/5], but it is there because > it can be applied without the latter) and patch [3/5] is a cosmetic change > made in preparation for the subsequent ones. > > This series turns out to restore the SPECjbb critical-jOPS metric on affected > systems to the level from before commit 0611a640e60a and it also happens to > increase its max-jOPS metric by around 3%. > > For easier reference/testing it is present in the git branch at > > https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/log/?h=experimental/menu > > based on the cpuidle material that went into 6.14-rc1. > > If possible, please let me know if it works for you. > > Thanks! > > > [1] Link: https://www.spec.org/jbb2015/ 5/5 shows significant IO workload improvements (the shorter wakeup scenario is much more likely to be picked up now). I don't see a significant regression in idle misses so far, I'll try Android backports soon and some other system. Here's a full dump, sorry it's from a different system (rk3588, only two idle states), apparently eth networking is broken on 6.14-rc1 now on rk3399 :( For dm-delay 51ms (dm-slow) the command is (8 CPUs) fio --minimal --time_based --group_reporting --name=fiotest --filename=/dev/mapper/dm-slow --runtime=30s --numjobs=16 --rw=randread --bs=4k --ioengine=psync --iodepth=1 --direct=1 For the rest: fio --minimal --time_based --name=fiotest --filename=/dev/mmcblk1 --runtime=30s --rw=randread --bs=4k --ioengine=psync --iodepth=1 --direct=1 device gov iter iops idles idle_misses idle_miss_ratio belows aboves mapper/dm-slow menu 0 307 99648 318 0.003 318 0 mapper/dm-slow menu 1 307 100948 389 0.004 389 0 mapper/dm-slow menu 2 307 99512 380 0.004 380 0 mapper/dm-slow menu 3 307 99212 307 0.003 307 0 mapper/dm-slow menu 4 307 100156 343 0.003 343 0 mapper/dm-slow menu-1 0 307 97434 260 0.003 260 0 mapper/dm-slow menu-1 1 307 94628 324 0.003 324 0 mapper/dm-slow menu-1 2 307 98004 248 0.003 248 0 mapper/dm-slow menu-1 3 307 97524 263 0.003 263 0 mapper/dm-slow menu-1 4 307 97048 304 0.003 304 0 mapper/dm-slow menu-2 0 307 98340 376 0.004 376 0 mapper/dm-slow menu-2 1 307 96246 275 0.003 275 0 mapper/dm-slow menu-2 2 307 96456 317 0.003 317 0 mapper/dm-slow menu-2 3 307 100054 268 0.003 268 0 mapper/dm-slow menu-2 4 307 93378 288 0.003 288 0 mapper/dm-slow menu-3 0 307 95140 303 0.003 303 0 mapper/dm-slow menu-3 1 307 95858 318 0.003 318 0 mapper/dm-slow menu-3 2 307 100528 302 0.003 302 0 mapper/dm-slow menu-3 3 307 98274 311 0.003 311 0 mapper/dm-slow menu-3 4 307 98428 327 0.003 327 0 mapper/dm-slow menu-4 0 307 100340 304 0.003 304 0 mapper/dm-slow menu-4 1 307 101628 359 0.004 359 0 mapper/dm-slow menu-4 2 307 100624 281 0.003 281 0 mapper/dm-slow menu-4 3 307 99824 340 0.003 340 0 mapper/dm-slow menu-4 4 307 98318 290 0.003 290 0 mapper/dm-slow menu-5 0 307 96842 310 0.003 310 0 mapper/dm-slow menu-5 1 307 98884 271 0.003 271 0 mapper/dm-slow menu-5 2 307 99706 259 0.003 259 0 mapper/dm-slow menu-5 3 307 93096 270 0.003 270 0 mapper/dm-slow menu-5 4 307 101590 333 0.003 333 0 mapper/dm-slow menu-m 0 307 94270 297 0.003 297 0 mapper/dm-slow menu-m 1 307 99820 355 0.004 355 0 mapper/dm-slow menu-m 2 307 99284 313 0.003 313 0 mapper/dm-slow menu-m 3 307 99320 288 0.003 288 0 mapper/dm-slow menu-m 4 307 99666 269 0.003 269 0 mmcblk1 menu 0 818 227246 32716 0.144 32716 0 mmcblk1 menu 1 818 252552 33582 0.133 33582 0 mmcblk1 menu 2 825 255822 31958 0.125 31958 0 mmcblk1 menu 3 822 255814 33374 0.130 33374 0 mmcblk1 menu 4 822 253200 33310 0.132 33310 0 mmcblk1 menu-1 0 822 254768 33545 0.132 33545 0 mmcblk1 menu-1 1 819 249476 33289 0.133 33289 0 mmcblk1 menu-1 2 823 256152 32838 0.128 32838 0 mmcblk1 menu-1 3 824 231098 31120 0.135 31120 0 mmcblk1 menu-1 4 820 254590 33189 0.130 33189 0 mmcblk1 menu-2 0 824 256084 32927 0.129 32927 0 mmcblk1 menu-2 1 806 240166 33672 0.140 33672 0 mmcblk1 menu-2 2 808 253178 33963 0.134 33963 0 mmcblk1 menu-2 3 822 240628 32860 0.137 32860 0 mmcblk1 menu-2 4 811 251522 33478 0.133 33478 0 mmcblk1 menu-3 0 810 251914 32477 0.129 32477 0 mmcblk1 menu-3 1 811 253324 32344 0.128 32344 0 mmcblk1 menu-3 2 826 239634 31478 0.131 31478 0 mmcblk1 menu-3 3 811 252462 33810 0.134 33810 0 mmcblk1 menu-3 4 806 231730 33646 0.145 33646 0 mmcblk1 menu-4 0 826 231986 32301 0.139 32301 0 mmcblk1 menu-4 1 821 256988 34290 0.133 34290 0 mmcblk1 menu-4 2 805 247456 35092 0.142 35092 0 mmcblk1 menu-4 3 807 255072 35291 0.138 35291 0 mmcblk1 menu-4 4 808 255076 35222 0.138 35222 0 mmcblk1 menu-5 0 861 308822 26267 0.085 26267 0 mmcblk1 menu-5 1 835 288153 26496 0.092 26496 0 mmcblk1 menu-5 2 841 304148 26916 0.088 26916 0 mmcblk1 menu-5 3 858 304838 26347 0.086 26347 0 mmcblk1 menu-5 4 859 303370 26090 0.086 26090 0 mmcblk1 menu-m 0 811 243486 33215 0.136 33215 0 mmcblk1 menu-m 1 827 256902 32863 0.128 32863 0 mmcblk1 menu-m 2 807 249032 34080 0.137 34080 0 mmcblk1 menu-m 3 809 253537 33718 0.133 33718 0 mmcblk1 menu-m 4 824 241996 32842 0.136 32842 0 mmcblk1 teo 0 874 346720 18326 0.053 18326 0 mmcblk1 teo 1 889 350712 19364 0.055 19364 0 mmcblk1 teo 2 874 341195 19004 0.056 19004 0 mmcblk1 teo 3 870 343718 18770 0.055 18770 0 mmcblk1 teo 4 871 321152 18415 0.057 18415 0 nvme0n1 menu 0 11546 819014 110717 0.135 110717 0 nvme0n1 menu 1 10507 745534 86297 0.116 86297 0 nvme0n1 menu 2 11758 829030 110667 0.133 110667 0 nvme0n1 menu 3 10762 768898 93655 0.122 93655 0 nvme0n1 menu 4 11719 820536 110456 0.135 110456 0 nvme0n1 menu-1 0 11409 811374 111285 0.137 111285 0 nvme0n1 menu-1 1 11432 805208 108621 0.135 108621 0 nvme0n1 menu-1 2 11154 781534 100566 0.129 100566 0 nvme0n1 menu-1 3 10180 724944 73523 0.101 73523 0 nvme0n1 menu-1 4 11667 827804 110505 0.133 110505 0 nvme0n1 menu-2 0 11091 791998 105824 0.134 105824 0 nvme0n1 menu-2 1 10664 748122 90282 0.121 90282 0 nvme0n1 menu-2 2 10921 773806 95668 0.124 95668 0 nvme0n1 menu-2 3 11445 807918 112475 0.139 112475 0 nvme0n1 menu-2 4 10629 761546 90181 0.118 90181 0 nvme0n1 menu-3 0 10330 723824 74813 0.103 74813 0 nvme0n1 menu-3 1 10242 717762 74187 0.103 74187 0 nvme0n1 menu-3 2 10579 754108 86841 0.115 86841 0 nvme0n1 menu-3 3 10161 730416 76722 0.105 76722 0 nvme0n1 menu-3 4 11665 820052 112621 0.137 112621 0 nvme0n1 menu-4 0 11279 789456 106411 0.135 106411 0 nvme0n1 menu-4 1 11095 766714 98036 0.128 98036 0 nvme0n1 menu-4 2 11003 786088 98979 0.126 98979 0 nvme0n1 menu-4 3 10371 746978 77039 0.103 77039 0 nvme0n1 menu-4 4 10761 770218 89958 0.117 89958 0 nvme0n1 menu-5 0 13243 926672 514 0.001 514 0 nvme0n1 menu-5 1 14235 985852 1054 0.001 1054 0 nvme0n1 menu-5 2 13032 911560 506 0.001 506 0 nvme0n1 menu-5 3 13074 917252 691 0.001 691 0 nvme0n1 menu-5 4 13361 933126 466 0.000 466 0 nvme0n1 menu-m 0 10290 739468 73692 0.100 73692 0 nvme0n1 menu-m 1 10647 763144 80430 0.105 80430 0 nvme0n1 menu-m 2 11067 790362 98525 0.125 98525 0 nvme0n1 menu-m 3 11337 806888 102446 0.127 102446 0 nvme0n1 menu-m 4 11519 818128 110233 0.135 110233 0 nvme0n1 teo 0 14267 994532 273 0.000 273 0 nvme0n1 teo 1 13857 965726 395 0.000 395 0 nvme0n1 teo 2 12762 892900 311 0.000 311 0 nvme0n1 teo 3 13056 900172 269 0.000 269 0 nvme0n1 teo 4 13687 956048 240 0.000 240 0 sda menu 0 1943 1044428 162298 0.155 162298 0 sda menu 1 1601 860152 232733 0.271 232733 0 sda menu 2 1947 1089550 154879 0.142 154879 0 sda menu 3 1917 992278 146316 0.147 146316 0 sda menu 4 1706 947224 257686 0.272 257686 0 sda menu-1 0 1981 1109204 174590 0.157 174590 0 sda menu-1 1 1778 989142 271685 0.275 271685 0 sda menu-1 2 1759 955310 252735 0.265 252735 0 sda menu-1 3 1818 985389 180365 0.183 180365 0 sda menu-1 4 1782 915060 247016 0.270 247016 0 sda menu-2 0 1877 959734 181691 0.189 181691 0 sda menu-2 1 1718 961724 262950 0.273 262950 0 sda menu-2 2 1751 949092 259223 0.273 259223 0 sda menu-2 3 1808 1011822 211016 0.209 211016 0 sda menu-2 4 1734 959348 261769 0.273 261769 0 sda menu-3 0 1723 952826 260493 0.273 260493 0 sda menu-3 1 1718 931974 254462 0.273 254462 0 sda menu-3 2 1773 984232 239335 0.243 239335 0 sda menu-3 3 1741 969477 265131 0.273 265131 0 sda menu-3 4 1735 970372 263907 0.272 263907 0 sda menu-4 0 1911 1030290 170538 0.166 170538 0 sda menu-4 1 1769 972168 233029 0.240 233029 0 sda menu-4 2 1737 969896 260880 0.269 260880 0 sda menu-4 3 1738 941298 253874 0.270 253874 0 sda menu-4 4 1701 953710 258250 0.271 258250 0 sda menu-5 0 2463 1349556 26158 0.019 26158 0 sda menu-5 1 2359 1344306 80343 0.060 80343 0 sda menu-5 2 2280 1306554 115670 0.089 115670 0 sda menu-5 3 2573 1420702 4765 0.003 4765 0 sda menu-5 4 2348 1355996 70428 0.052 70428 0 sda menu-m 0 1738 962150 261205 0.271 261205 0 sda menu-m 1 1667 922214 238208 0.258 238208 0 sda menu-m 2 1696 911352 255364 0.280 255364 0 sda menu-m 3 1840 1006556 193333 0.192 193333 0 sda menu-m 4 1681 919693 251029 0.273 251029 0 sda teo 0 2503 1427634 25997 0.018 25997 0 sda teo 1 2424 1401434 35228 0.025 35228 0 sda teo 2 2527 1454382 27546 0.019 27546 0 sda teo 3 2481 1430128 16678 0.012 16678 0 sda teo 4 2589 1481254 13389 0.009 13389 0 nullb0 menu 0 337827 88502 200 0.002 200 0 nullb0 menu 1 337833 87476 188 0.002 188 0 nullb0 menu 2 336378 88862 92 0.001 92 0 nullb0 menu 3 336022 86174 188 0.002 188 0 nullb0 menu 4 335158 87880 156 0.002 156 0 nullb0 menu-1 0 334663 89150 199 0.002 199 0 nullb0 menu-1 1 338526 88184 111 0.001 111 0 nullb0 menu-1 2 336671 89210 211 0.002 211 0 nullb0 menu-1 3 337454 82408 198 0.002 198 0 nullb0 menu-1 4 338256 86994 118 0.001 118 0 nullb0 menu-2 0 336636 82202 165 0.002 165 0 nullb0 menu-2 1 337580 77918 171 0.002 171 0 nullb0 menu-2 2 336260 89198 226 0.003 226 0 nullb0 menu-2 3 338440 85444 215 0.003 215 0 nullb0 menu-2 4 333633 87244 119 0.001 119 0 nullb0 menu-3 0 336890 88096 122 0.001 122 0 nullb0 menu-3 1 335804 68502 79 0.001 79 0 nullb0 menu-3 2 336863 87258 195 0.002 195 0 nullb0 menu-3 3 337091 76452 127 0.002 127 0 nullb0 menu-3 4 336142 80664 83 0.001 83 0 nullb0 menu-4 0 336840 86936 128 0.001 128 0 nullb0 menu-4 1 334498 88792 113 0.001 113 0 nullb0 menu-4 2 336736 88542 104 0.001 104 0 nullb0 menu-4 3 336476 64548 70 0.001 70 0 nullb0 menu-4 4 337513 84776 107 0.001 107 0 nullb0 menu-5 0 338498 89216 135 0.002 135 0 nullb0 menu-5 1 335087 87424 85 0.001 85 0 nullb0 menu-5 2 336965 75456 179 0.002 179 0 nullb0 menu-5 3 337415 88112 114 0.001 114 0 nullb0 menu-5 4 332365 76456 82 0.001 82 0 nullb0 menu-m 0 337718 88018 125 0.001 125 0 nullb0 menu-m 1 337801 86584 164 0.002 164 0 nullb0 menu-m 2 336760 84262 102 0.001 102 0 nullb0 menu-m 3 337524 87902 147 0.002 147 0 nullb0 menu-m 4 333724 87916 117 0.001 117 0 nullb0 teo 0 336215 88312 231 0.003 231 0 nullb0 teo 1 337653 88802 266 0.003 266 0 nullb0 teo 2 337198 87960 234 0.003 234 0 nullb0 teo 3 338716 88516 227 0.003 227 0 nullb0 teo 4 336334 88978 261 0.003 261 0
On Mon, Feb 10, 2025 at 3:15 PM Christian Loehle <christian.loehle@arm.com> wrote: > > On 2/6/25 14:21, Rafael J. Wysocki wrote: > > Hi Everyone, > > > > This work had been triggered by a report that commit 0611a640e60a ("eventpoll: > > prefer kfree_rcu() in __ep_remove()") had caused the critical-jOPS metric of > > the SPECjbb 2015 benchmark [1] to drop by around 50% even though it generally > > reduced kernel overhead. Indeed, it was found during further investigation > > that the total interrupt rate while running the SPECjbb workload had fallen as > > a result of that commit by 55% and the local timer interrupt rate had fallen by > > almost 80%. > > > > That turned out to cause the menu cpuidle governor to select the deepest idle > > state supplied by the cpuidle driver (intel_idle) much more often which added > > significantly more idle state latency to the workload and that led to the > > decrease of the critical-jOPS score. > > > > Interestingly enough, this problem was not visible when the teo cpuidle > > governor was used instead of menu, so it appeared to be specific to the > > latter. CPU wakeup event statistics collected while running the workload > > indicated that the menu governor was effectively ignoring non-timer wakeup > > information and all of its idle state selection decisions appeared to be > > based on timer wakeups only. Thus, it appeared that the reduction of the > > local timer interrupt rate caused the governor to predict a idle duration > > much more often while running the workload and the deepest idle state was > > selected significantly more often as a result of that. > > > > A subsequent inspection of the get_typical_interval() function in the menu > > governor indicated that it might return UINT_MAX too often which then caused > > the governor's decisions to be based entirely on information related to timers. > > > > Generally speaking, UINT_MAX is returned by get_typical_interval() if it > > cannot make a prediction based on the most recent idle intervals data with > > sufficiently high confidence, but at least in some cases this means that > > useful information is not taken into account at all which may lead to > > significant idle state selection mistakes. Moreover, this is not really > > unlikely to happen. > > > > One issue with get_typical_interval() is that, when it eliminates outliers from > > the sample set in an attempt to reduce the standard deviation (and so improve > > the prediction confidence), it does that by dropping high-end samples only, > > while samples at the low end of the set are retained. However, the samples > > at the low end very well may be the outliers and they should be eliminated > > from the sample set instead of the high-end samples. Accordingly, the > > likelihood of making a meaningful idle duration prediction can be improved > > by making it also eliminate low-end samples if they are farther from the > > average than high-end samples. This is done in patch [4/5]. > > > > Another issue is that get_typical_interval() gives up after eliminating 1/4 > > of the samples if the standard deviation is still not as low as desired (within > > 1/6 of the average or within 20 us if the average is close to 0), but the > > remaining samples in the set still represent useful information at that point > > and discarding them altogether may lead to suboptimal idle state selection. > > > > For instance, the largest idle duration value in the get_typical_interval() > > data set is the maximum idle duration observed recently and it is likely that > > the upcoming idle duration will not exceed it. Therefore, in the absence of > > a better choice, this value can be used as an upper bound on the target > > residency of the idle state to select. Patch [5/5] works along these lines, > > but it takes the maximum data point remaining after the elimination of > > outliers. > > > > The first two patches in the series are straightforward cleanups (in fact, > > the first patch is kind of reversed by patch [4/5], but it is there because > > it can be applied without the latter) and patch [3/5] is a cosmetic change > > made in preparation for the subsequent ones. > > > > This series turns out to restore the SPECjbb critical-jOPS metric on affected > > systems to the level from before commit 0611a640e60a and it also happens to > > increase its max-jOPS metric by around 3%. > > > > For easier reference/testing it is present in the git branch at > > > > https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/log/?h=experimental/menu > > > > based on the cpuidle material that went into 6.14-rc1. > > > > If possible, please let me know if it works for you. > > > > Thanks! > > > > > > [1] Link: https://www.spec.org/jbb2015/ > > 5/5 shows significant IO workload improvements (the shorter wakeup scenario is > much more likely to be picked up now). > I don't see a significant regression in idle misses so far, I'll try Android > backports soon and some other system. Sounds good, thanks! > Here's a full dump, sorry it's from a different system (rk3588, only two idle > states), apparently eth networking is broken on 6.14-rc1 now on rk3399 :( > > For dm-delay 51ms (dm-slow) the command is (8 CPUs) > fio --minimal --time_based --group_reporting --name=fiotest --filename=/dev/mapper/dm-slow --runtime=30s --numjobs=16 --rw=randread --bs=4k --ioengine=psync --iodepth=1 --direct=1 > For the rest: > fio --minimal --time_based --name=fiotest --filename=/dev/mmcblk1 --runtime=30s --rw=randread --bs=4k --ioengine=psync --iodepth=1 --direct=1 Thanks for the data! Do I understand correctly that menu-X is menu with patches [1-X/5] applied? And what's menu-m? > device gov iter iops idles idle_misses idle_miss_ratio belows aboves > mapper/dm-slow menu 0 307 99648 318 0.003 318 0 > mapper/dm-slow menu 1 307 100948 389 0.004 389 0 > mapper/dm-slow menu 2 307 99512 380 0.004 380 0 > mapper/dm-slow menu 3 307 99212 307 0.003 307 0 > mapper/dm-slow menu 4 307 100156 343 0.003 343 0 > mapper/dm-slow menu-1 0 307 97434 260 0.003 260 0 > mapper/dm-slow menu-1 1 307 94628 324 0.003 324 0 > mapper/dm-slow menu-1 2 307 98004 248 0.003 248 0 > mapper/dm-slow menu-1 3 307 97524 263 0.003 263 0 > mapper/dm-slow menu-1 4 307 97048 304 0.003 304 0 > mapper/dm-slow menu-2 0 307 98340 376 0.004 376 0 > mapper/dm-slow menu-2 1 307 96246 275 0.003 275 0 > mapper/dm-slow menu-2 2 307 96456 317 0.003 317 0 > mapper/dm-slow menu-2 3 307 100054 268 0.003 268 0 > mapper/dm-slow menu-2 4 307 93378 288 0.003 288 0 > mapper/dm-slow menu-3 0 307 95140 303 0.003 303 0 > mapper/dm-slow menu-3 1 307 95858 318 0.003 318 0 > mapper/dm-slow menu-3 2 307 100528 302 0.003 302 0 > mapper/dm-slow menu-3 3 307 98274 311 0.003 311 0 > mapper/dm-slow menu-3 4 307 98428 327 0.003 327 0 > mapper/dm-slow menu-4 0 307 100340 304 0.003 304 0 > mapper/dm-slow menu-4 1 307 101628 359 0.004 359 0 > mapper/dm-slow menu-4 2 307 100624 281 0.003 281 0 > mapper/dm-slow menu-4 3 307 99824 340 0.003 340 0 > mapper/dm-slow menu-4 4 307 98318 290 0.003 290 0 > mapper/dm-slow menu-5 0 307 96842 310 0.003 310 0 > mapper/dm-slow menu-5 1 307 98884 271 0.003 271 0 > mapper/dm-slow menu-5 2 307 99706 259 0.003 259 0 > mapper/dm-slow menu-5 3 307 93096 270 0.003 270 0 > mapper/dm-slow menu-5 4 307 101590 333 0.003 333 0 > mapper/dm-slow menu-m 0 307 94270 297 0.003 297 0 > mapper/dm-slow menu-m 1 307 99820 355 0.004 355 0 > mapper/dm-slow menu-m 2 307 99284 313 0.003 313 0 > mapper/dm-slow menu-m 3 307 99320 288 0.003 288 0 > mapper/dm-slow menu-m 4 307 99666 269 0.003 269 0 > mmcblk1 menu 0 818 227246 32716 0.144 32716 0 > mmcblk1 menu 1 818 252552 33582 0.133 33582 0 > mmcblk1 menu 2 825 255822 31958 0.125 31958 0 > mmcblk1 menu 3 822 255814 33374 0.130 33374 0 > mmcblk1 menu 4 822 253200 33310 0.132 33310 0 > mmcblk1 menu-1 0 822 254768 33545 0.132 33545 0 > mmcblk1 menu-1 1 819 249476 33289 0.133 33289 0 > mmcblk1 menu-1 2 823 256152 32838 0.128 32838 0 > mmcblk1 menu-1 3 824 231098 31120 0.135 31120 0 > mmcblk1 menu-1 4 820 254590 33189 0.130 33189 0 > mmcblk1 menu-2 0 824 256084 32927 0.129 32927 0 > mmcblk1 menu-2 1 806 240166 33672 0.140 33672 0 > mmcblk1 menu-2 2 808 253178 33963 0.134 33963 0 > mmcblk1 menu-2 3 822 240628 32860 0.137 32860 0 > mmcblk1 menu-2 4 811 251522 33478 0.133 33478 0 > mmcblk1 menu-3 0 810 251914 32477 0.129 32477 0 > mmcblk1 menu-3 1 811 253324 32344 0.128 32344 0 > mmcblk1 menu-3 2 826 239634 31478 0.131 31478 0 > mmcblk1 menu-3 3 811 252462 33810 0.134 33810 0 > mmcblk1 menu-3 4 806 231730 33646 0.145 33646 0 > mmcblk1 menu-4 0 826 231986 32301 0.139 32301 0 > mmcblk1 menu-4 1 821 256988 34290 0.133 34290 0 > mmcblk1 menu-4 2 805 247456 35092 0.142 35092 0 > mmcblk1 menu-4 3 807 255072 35291 0.138 35291 0 > mmcblk1 menu-4 4 808 255076 35222 0.138 35222 0 > mmcblk1 menu-5 0 861 308822 26267 0.085 26267 0 > mmcblk1 menu-5 1 835 288153 26496 0.092 26496 0 > mmcblk1 menu-5 2 841 304148 26916 0.088 26916 0 > mmcblk1 menu-5 3 858 304838 26347 0.086 26347 0 > mmcblk1 menu-5 4 859 303370 26090 0.086 26090 0 > mmcblk1 menu-m 0 811 243486 33215 0.136 33215 0 > mmcblk1 menu-m 1 827 256902 32863 0.128 32863 0 > mmcblk1 menu-m 2 807 249032 34080 0.137 34080 0 > mmcblk1 menu-m 3 809 253537 33718 0.133 33718 0 > mmcblk1 menu-m 4 824 241996 32842 0.136 32842 0 > mmcblk1 teo 0 874 346720 18326 0.053 18326 0 > mmcblk1 teo 1 889 350712 19364 0.055 19364 0 > mmcblk1 teo 2 874 341195 19004 0.056 19004 0 > mmcblk1 teo 3 870 343718 18770 0.055 18770 0 > mmcblk1 teo 4 871 321152 18415 0.057 18415 0 > nvme0n1 menu 0 11546 819014 110717 0.135 110717 0 > nvme0n1 menu 1 10507 745534 86297 0.116 86297 0 > nvme0n1 menu 2 11758 829030 110667 0.133 110667 0 > nvme0n1 menu 3 10762 768898 93655 0.122 93655 0 > nvme0n1 menu 4 11719 820536 110456 0.135 110456 0 > nvme0n1 menu-1 0 11409 811374 111285 0.137 111285 0 > nvme0n1 menu-1 1 11432 805208 108621 0.135 108621 0 > nvme0n1 menu-1 2 11154 781534 100566 0.129 100566 0 > nvme0n1 menu-1 3 10180 724944 73523 0.101 73523 0 > nvme0n1 menu-1 4 11667 827804 110505 0.133 110505 0 > nvme0n1 menu-2 0 11091 791998 105824 0.134 105824 0 > nvme0n1 menu-2 1 10664 748122 90282 0.121 90282 0 > nvme0n1 menu-2 2 10921 773806 95668 0.124 95668 0 > nvme0n1 menu-2 3 11445 807918 112475 0.139 112475 0 > nvme0n1 menu-2 4 10629 761546 90181 0.118 90181 0 > nvme0n1 menu-3 0 10330 723824 74813 0.103 74813 0 > nvme0n1 menu-3 1 10242 717762 74187 0.103 74187 0 > nvme0n1 menu-3 2 10579 754108 86841 0.115 86841 0 > nvme0n1 menu-3 3 10161 730416 76722 0.105 76722 0 > nvme0n1 menu-3 4 11665 820052 112621 0.137 112621 0 > nvme0n1 menu-4 0 11279 789456 106411 0.135 106411 0 > nvme0n1 menu-4 1 11095 766714 98036 0.128 98036 0 > nvme0n1 menu-4 2 11003 786088 98979 0.126 98979 0 > nvme0n1 menu-4 3 10371 746978 77039 0.103 77039 0 > nvme0n1 menu-4 4 10761 770218 89958 0.117 89958 0 > nvme0n1 menu-5 0 13243 926672 514 0.001 514 0 > nvme0n1 menu-5 1 14235 985852 1054 0.001 1054 0 > nvme0n1 menu-5 2 13032 911560 506 0.001 506 0 > nvme0n1 menu-5 3 13074 917252 691 0.001 691 0 > nvme0n1 menu-5 4 13361 933126 466 0.000 466 0 > nvme0n1 menu-m 0 10290 739468 73692 0.100 73692 0 > nvme0n1 menu-m 1 10647 763144 80430 0.105 80430 0 > nvme0n1 menu-m 2 11067 790362 98525 0.125 98525 0 > nvme0n1 menu-m 3 11337 806888 102446 0.127 102446 0 > nvme0n1 menu-m 4 11519 818128 110233 0.135 110233 0 > nvme0n1 teo 0 14267 994532 273 0.000 273 0 > nvme0n1 teo 1 13857 965726 395 0.000 395 0 > nvme0n1 teo 2 12762 892900 311 0.000 311 0 > nvme0n1 teo 3 13056 900172 269 0.000 269 0 > nvme0n1 teo 4 13687 956048 240 0.000 240 0 > sda menu 0 1943 1044428 162298 0.155 162298 0 > sda menu 1 1601 860152 232733 0.271 232733 0 > sda menu 2 1947 1089550 154879 0.142 154879 0 > sda menu 3 1917 992278 146316 0.147 146316 0 > sda menu 4 1706 947224 257686 0.272 257686 0 > sda menu-1 0 1981 1109204 174590 0.157 174590 0 > sda menu-1 1 1778 989142 271685 0.275 271685 0 > sda menu-1 2 1759 955310 252735 0.265 252735 0 > sda menu-1 3 1818 985389 180365 0.183 180365 0 > sda menu-1 4 1782 915060 247016 0.270 247016 0 > sda menu-2 0 1877 959734 181691 0.189 181691 0 > sda menu-2 1 1718 961724 262950 0.273 262950 0 > sda menu-2 2 1751 949092 259223 0.273 259223 0 > sda menu-2 3 1808 1011822 211016 0.209 211016 0 > sda menu-2 4 1734 959348 261769 0.273 261769 0 > sda menu-3 0 1723 952826 260493 0.273 260493 0 > sda menu-3 1 1718 931974 254462 0.273 254462 0 > sda menu-3 2 1773 984232 239335 0.243 239335 0 > sda menu-3 3 1741 969477 265131 0.273 265131 0 > sda menu-3 4 1735 970372 263907 0.272 263907 0 > sda menu-4 0 1911 1030290 170538 0.166 170538 0 > sda menu-4 1 1769 972168 233029 0.240 233029 0 > sda menu-4 2 1737 969896 260880 0.269 260880 0 > sda menu-4 3 1738 941298 253874 0.270 253874 0 > sda menu-4 4 1701 953710 258250 0.271 258250 0 > sda menu-5 0 2463 1349556 26158 0.019 26158 0 > sda menu-5 1 2359 1344306 80343 0.060 80343 0 > sda menu-5 2 2280 1306554 115670 0.089 115670 0 > sda menu-5 3 2573 1420702 4765 0.003 4765 0 > sda menu-5 4 2348 1355996 70428 0.052 70428 0 > sda menu-m 0 1738 962150 261205 0.271 261205 0 > sda menu-m 1 1667 922214 238208 0.258 238208 0 > sda menu-m 2 1696 911352 255364 0.280 255364 0 > sda menu-m 3 1840 1006556 193333 0.192 193333 0 > sda menu-m 4 1681 919693 251029 0.273 251029 0 > sda teo 0 2503 1427634 25997 0.018 25997 0 > sda teo 1 2424 1401434 35228 0.025 35228 0 > sda teo 2 2527 1454382 27546 0.019 27546 0 > sda teo 3 2481 1430128 16678 0.012 16678 0 > sda teo 4 2589 1481254 13389 0.009 13389 0 > nullb0 menu 0 337827 88502 200 0.002 200 0 > nullb0 menu 1 337833 87476 188 0.002 188 0 > nullb0 menu 2 336378 88862 92 0.001 92 0 > nullb0 menu 3 336022 86174 188 0.002 188 0 > nullb0 menu 4 335158 87880 156 0.002 156 0 > nullb0 menu-1 0 334663 89150 199 0.002 199 0 > nullb0 menu-1 1 338526 88184 111 0.001 111 0 > nullb0 menu-1 2 336671 89210 211 0.002 211 0 > nullb0 menu-1 3 337454 82408 198 0.002 198 0 > nullb0 menu-1 4 338256 86994 118 0.001 118 0 > nullb0 menu-2 0 336636 82202 165 0.002 165 0 > nullb0 menu-2 1 337580 77918 171 0.002 171 0 > nullb0 menu-2 2 336260 89198 226 0.003 226 0 > nullb0 menu-2 3 338440 85444 215 0.003 215 0 > nullb0 menu-2 4 333633 87244 119 0.001 119 0 > nullb0 menu-3 0 336890 88096 122 0.001 122 0 > nullb0 menu-3 1 335804 68502 79 0.001 79 0 > nullb0 menu-3 2 336863 87258 195 0.002 195 0 > nullb0 menu-3 3 337091 76452 127 0.002 127 0 > nullb0 menu-3 4 336142 80664 83 0.001 83 0 > nullb0 menu-4 0 336840 86936 128 0.001 128 0 > nullb0 menu-4 1 334498 88792 113 0.001 113 0 > nullb0 menu-4 2 336736 88542 104 0.001 104 0 > nullb0 menu-4 3 336476 64548 70 0.001 70 0 > nullb0 menu-4 4 337513 84776 107 0.001 107 0 > nullb0 menu-5 0 338498 89216 135 0.002 135 0 > nullb0 menu-5 1 335087 87424 85 0.001 85 0 > nullb0 menu-5 2 336965 75456 179 0.002 179 0 > nullb0 menu-5 3 337415 88112 114 0.001 114 0 > nullb0 menu-5 4 332365 76456 82 0.001 82 0 > nullb0 menu-m 0 337718 88018 125 0.001 125 0 > nullb0 menu-m 1 337801 86584 164 0.002 164 0 > nullb0 menu-m 2 336760 84262 102 0.001 102 0 > nullb0 menu-m 3 337524 87902 147 0.002 147 0 > nullb0 menu-m 4 333724 87916 117 0.001 117 0 > nullb0 teo 0 336215 88312 231 0.003 231 0 > nullb0 teo 1 337653 88802 266 0.003 266 0 > nullb0 teo 2 337198 87960 234 0.003 234 0 > nullb0 teo 3 338716 88516 227 0.003 227 0 > nullb0 teo 4 336334 88978 261 0.003 261 0 >
On 2/10/25 14:43, Rafael J. Wysocki wrote: > On Mon, Feb 10, 2025 at 3:15 PM Christian Loehle > <christian.loehle@arm.com> wrote: >> >> On 2/6/25 14:21, Rafael J. Wysocki wrote: >>> Hi Everyone, >>> >>> This work had been triggered by a report that commit 0611a640e60a ("eventpoll: >>> prefer kfree_rcu() in __ep_remove()") had caused the critical-jOPS metric of >>> the SPECjbb 2015 benchmark [1] to drop by around 50% even though it generally >>> reduced kernel overhead. Indeed, it was found during further investigation >>> that the total interrupt rate while running the SPECjbb workload had fallen as >>> a result of that commit by 55% and the local timer interrupt rate had fallen by >>> almost 80%. >>> >>> That turned out to cause the menu cpuidle governor to select the deepest idle >>> state supplied by the cpuidle driver (intel_idle) much more often which added >>> significantly more idle state latency to the workload and that led to the >>> decrease of the critical-jOPS score. >>> >>> Interestingly enough, this problem was not visible when the teo cpuidle >>> governor was used instead of menu, so it appeared to be specific to the >>> latter. CPU wakeup event statistics collected while running the workload >>> indicated that the menu governor was effectively ignoring non-timer wakeup >>> information and all of its idle state selection decisions appeared to be >>> based on timer wakeups only. Thus, it appeared that the reduction of the >>> local timer interrupt rate caused the governor to predict a idle duration >>> much more often while running the workload and the deepest idle state was >>> selected significantly more often as a result of that. >>> >>> A subsequent inspection of the get_typical_interval() function in the menu >>> governor indicated that it might return UINT_MAX too often which then caused >>> the governor's decisions to be based entirely on information related to timers. >>> >>> Generally speaking, UINT_MAX is returned by get_typical_interval() if it >>> cannot make a prediction based on the most recent idle intervals data with >>> sufficiently high confidence, but at least in some cases this means that >>> useful information is not taken into account at all which may lead to >>> significant idle state selection mistakes. Moreover, this is not really >>> unlikely to happen. >>> >>> One issue with get_typical_interval() is that, when it eliminates outliers from >>> the sample set in an attempt to reduce the standard deviation (and so improve >>> the prediction confidence), it does that by dropping high-end samples only, >>> while samples at the low end of the set are retained. However, the samples >>> at the low end very well may be the outliers and they should be eliminated >>> from the sample set instead of the high-end samples. Accordingly, the >>> likelihood of making a meaningful idle duration prediction can be improved >>> by making it also eliminate low-end samples if they are farther from the >>> average than high-end samples. This is done in patch [4/5]. >>> >>> Another issue is that get_typical_interval() gives up after eliminating 1/4 >>> of the samples if the standard deviation is still not as low as desired (within >>> 1/6 of the average or within 20 us if the average is close to 0), but the >>> remaining samples in the set still represent useful information at that point >>> and discarding them altogether may lead to suboptimal idle state selection. >>> >>> For instance, the largest idle duration value in the get_typical_interval() >>> data set is the maximum idle duration observed recently and it is likely that >>> the upcoming idle duration will not exceed it. Therefore, in the absence of >>> a better choice, this value can be used as an upper bound on the target >>> residency of the idle state to select. Patch [5/5] works along these lines, >>> but it takes the maximum data point remaining after the elimination of >>> outliers. >>> >>> The first two patches in the series are straightforward cleanups (in fact, >>> the first patch is kind of reversed by patch [4/5], but it is there because >>> it can be applied without the latter) and patch [3/5] is a cosmetic change >>> made in preparation for the subsequent ones. >>> >>> This series turns out to restore the SPECjbb critical-jOPS metric on affected >>> systems to the level from before commit 0611a640e60a and it also happens to >>> increase its max-jOPS metric by around 3%. >>> >>> For easier reference/testing it is present in the git branch at >>> >>> https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/log/?h=experimental/menu >>> >>> based on the cpuidle material that went into 6.14-rc1. >>> >>> If possible, please let me know if it works for you. >>> >>> Thanks! >>> >>> >>> [1] Link: https://www.spec.org/jbb2015/ >> >> 5/5 shows significant IO workload improvements (the shorter wakeup scenario is >> much more likely to be picked up now). >> I don't see a significant regression in idle misses so far, I'll try Android >> backports soon and some other system. > > Sounds good, thanks! > >> Here's a full dump, sorry it's from a different system (rk3588, only two idle >> states), apparently eth networking is broken on 6.14-rc1 now on rk3399 :( >> >> For dm-delay 51ms (dm-slow) the command is (8 CPUs) >> fio --minimal --time_based --group_reporting --name=fiotest --filename=/dev/mapper/dm-slow --runtime=30s --numjobs=16 --rw=randread --bs=4k --ioengine=psync --iodepth=1 --direct=1 >> For the rest: >> fio --minimal --time_based --name=fiotest --filename=/dev/mmcblk1 --runtime=30s --rw=randread --bs=4k --ioengine=psync --iodepth=1 --direct=1 > > Thanks for the data! > > Do I understand correctly that menu-X is menu with patches [1-X/5] > applied? And what's menu-m? Correct about -1 to -5. -m is just mainline. I changed that over time, it's just equivalent to menu now. I'll just run that twice in the future (to be able to check for side-effects like thermals because they all run one after the other).
On 2025.02.06 06:22 Rafael J. Wysocki wrote: > Hi Everyone, Hi Rafael, > > This work had been triggered by a report that commit 0611a640e60a ("eventpoll: > prefer kfree_rcu() in __ep_remove()") had caused the critical-jOPS metric of ... deleted ... This is a long email. It contains test results for several recent idle governor patches: cpuidle: teo: Cleanups and very frequent wakeups handling update cpuidle: teo: Avoid selecting deepest idle state over-eagerly (Testing aborted, after the patch was dropped.) cpuidle: menu: Avoid discarding useful information when processing recent idle intervals Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz Distro: Ubuntu 24.04.1, server, no desktop GUI. CPU frequency scaling driver: intel_pstate HWP: disabled. CPU frequency scaling governor: performance Ilde driver: intel_idle Idle governor: as per individual test Idle states: 4: name : description: state0/name:POLL desc:CPUIDLE CORE POLL IDLE state1/name:C1_ACPI desc:ACPI FFH MWAIT 0x0 state2/name:C2_ACPI desc:ACPI FFH MWAIT 0x30 state3/name:C3_ACPI desc:ACPI FFH MWAIT 0x60 Legend: teo-613: teo governor - Kernel 6.13: before "cpuidle: teo: Cleanups and very frequent wakeups handling update" menu-613: menu governor - Kernel 6.13: before "cpuidle: teo: Cleanups and very frequent wakeups handling update" teo-614: teo governor - Kernel 6.14-rc1: Includes cpuidle: teo: Cleanups and very frequent wakeups handling update menu-614: menu governor - Kernel 6.14-rc1: Includes cpuidle: teo: Cleanups and very frequent wakeups handling update teo-614-p: teo governor - Kernel 6.14-rc1-p: Includes "cpuidle: teo: Avoid selecting deepest idle state over-eagerly" menu-614-p: menu governor - Kernel 6.14-rc1-p: Includes "cpuidle: menu: Avoid discarding useful information when processing recent idle intervals" I do a set of tests adopted over some years now. Readers may recall that some of the tests search over a wide range of operating conditions looking for areas to focus on in more detail. One interesting observation is that everything seems to run slower than the last time I did this, June 2024, Kernel 6.10-rc2, which was also slower than the time before that, August 2023, Kernel 6.5-rc4. There are some repatabilty issues with the tests. I was unable to get the "cpuidle: teo: Cleanups and very frequent wakeups handling update" patch set to apply to kernel 6.13, and so just used kernel 6.14-rc1, but that means that all the other commits between the kernel versions are included. This could cast doubt on the test results, and indeed some differences in test results are observed with the menu idle governor, which did not change. Test 1: System Idle Purpose: Basic starting point test. To observee and check an idle system for excessive power consumption. teo-613: 1.752 watts (reference: 0.0%) menu-613: 1.909 watts (+9.0%) teo-614: 2.199 watts (+25.51%) <<< Test flawed. Needs to be redone. Will be less. teo-614-2: 2.112 watts (+17.05%) <<< Re-test of teo-614. (don't care about 0.4 watts) menu-614: 1.873 watts (+6.91%) teo-614-p: 9.401 watts (+436.6%) <<< Very bad regression. menu-614-p: 1.820 watts (+3.9%) Further details: http://smythies.com/~doug/linux/idle/teo-6.14/idle/perf/ Test 2: 2 core ping pong sweep: Pass a token between 2 CPUs on 2 different cores. Do a variable amount of work at each stop. NOT a timer based test. Purpose: To utilize the shallowest idle states and observe the transition from using more of 1 idle state to another. Results relative to teo-613 (negative is better): menu-613 teo-614 menu-614 menu-614-p average -2.06% -0.32% -2.33% -2.52% max 9.42% 12.72% 8.29% 8.55% min -10.36% -3.82% -11.89% -12.13% No significant issues here. There are differences on idle state preferences. Standard "fast" dwell test: teo-613: average 3.826 uSec/loop reference menu-613: average 4.159 +8.70% teo-614: average 3.751 -1.94% menu-614: average 4.076 +6.54% menu-614-p: average 4.178 +9.21% Intrestingly, teo-614 also uses a little less power. Note that there is an offsetting region for the menu governor where it performs better than teo, but it was not extracted and done as a dwell test. Standard "medium dwell test: teo-613: 12.241 average uSec/loop reference menu-613: 12.251 average +0.08% teo-614: 12.121 average -0.98% menu-614: 12.123 average -0.96% menu-614-p: 12.236 average -0.04% Standard "slow" dwell test: Not done. Further details: http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-2/loop-times.png http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-2/loop-times-relative.png http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-2/perf/ http://smythies.com/~doug/linux/idle/teo-6.14/many-0-400000000-2/perf/ http://smythies.com/~doug/linux/idle/teo-6.14/many-3000-100000000-2/ Test 3: 6 core ping pong sweep: Pass a token between 6 CPUs on 6 different cores. Do a variable amount of work at each stop. NOT a timer based test. Purpose: To utilize the midrange idle states and observe the transitions between use of idle states. Note: This test has uncertainty in an area where the performance is bi-stable for all idle governors, transitioning between much less power and slower performance and much more power and higher performance. On either side of this area, the differences between all idle governors are small. Only data from before this area (from results 1 to 95) was included in the below results. Results relative to teo-613 (negative is better): teo-614 menu-613 menu-614 menu-614-p average 1.60% 0.18% 0.02% 0.02% max 5.91% 0.97% 1.12% 0.85% min -1.79% -1.11% -1.88% -1.52% A further dwell test was done in the area where teo-614 performed worse. There was a slight regression in both performance and power: teo-613: average 21.34068 uSec per loop teo-614: average 20.55809 usec per loop 3.67% regression teo-613: average 37.17577 watts. teo-614: average 38.06375 watts. 2.3% regression. Further details: http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/loop-times.png http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/loop-times-detail-a.png http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/loop-times-detail-b.png http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/perf/ http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/dwell/perf/ Test 4: 12 CPU ping pong sweep: Pass a token between all 12 CPUs. Do a variable amount of work at each stop. NOT a timer based test. Purpose: To utilize the deeper idle states and observe the transitions between use of idle states. This test was added last time at the request of Christian Loehle. Note: This test has uncertainty in an area where the performance is bi-stable for all idle governors, transitioning between much less power and slower performance and much more power and higher performance. On either side of this area, the differences between all idle governors are small. Only data from before this area (from results 1 to 60) was included in the below results: Results relative to teo-613 (negative is better): teo-614 menu-613 menu-614 teo-614-p menu-614-p ave 1.73% 0.97% 1.29% 1.70% 0.43% max 16.79% 3.52% 3.95% 17.48% 4.98% min -0.35% -0.35% -0.18% -0.40% -0.54% Only data from after the uncertainty area (from results 170-300) was included in the below results: teo-614 menu-613 menu-614 teo-614-p menu-614-p ave 1.65% 0.04% 0.98% -0.56% 0.73% max 5.04% 2.10% 4.58% 2.44% 3.82% min 0.00% -1.89% -1.17% -1.95% -1.38% A further dwell test was done in the area where teo-614 performed worse and there is a 15.74% throughput regression for teo-614 and a 5.4% regression in power. Further details: http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/sweep-loop-times.png http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/sweep-loop-times-detail-a.png http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/sweep-relative-times.png http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/perf/ http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/dwell/times.txt http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/dwell/perf/ Test 5: sleeping ebizzy - 128 threads. Purpose: This test has given interesting results in the past. The test varies the sleep interval between record lookups. The result is varying usage of idle states. Results: Nothing significant to report just from the performance data. However, there does seem to be power differences worth considering. A futher dwell test was done in a cherry picked spot. It it is important to note that teo-614 removed a sawtooth performance pattern that was present with teo-613. I.E. it is more stable. See: http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/dwell/rps-only-teo.png Further details: http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/interval-sweep.png http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/relative-performance.png http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/perf/ http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/dwell/rps.png http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/dwell/rps-relative.png http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/dwell/perf/ Test 6: adrestia wakeup latency tests. 500 threads. Purpose: The test was reported in 2023.09 by the kernel test robot and looked both interesting and gave interesting results, so I added it to the tests I run. Results: teo-613.txt:wakeup cost (periodic, 20us): 3331nSec reference teo-614.txt:wakeup cost (periodic, 20us): 3375nSec +1.32% menu-613.txt:wakeup cost (periodic, 20us): 3207nSec -3.72% menu-614.txt:wakeup cost (periodic, 20us): 3315nSec -0.48% menu-614-p.txt:wakeup cost (periodic, 20us): 3353nSec +0.66% Further details: http://smythies.com/~doug/linux/idle/teo-6.14/adrestia/periodic/histogram.png http://smythies.com/~doug/linux/idle/teo-6.14/adrestia/periodic/histogram-detail-a.png http://smythies.com/~doug/linux/idle/teo-6.14/adrestia/periodic/perf/ Test 7: consume: periodic workflow. Various work/sleep frequencies and loads. Purpose: To search for anomalies and hysteresis over all possible workloads at various work/sleep frequencies. work/sleep frequencies tested: 73, 113, 211, 347, and 401 Hertz. IS a timer based test. NOTE: Repeatability issues. More work needed. Tests show instability with teo-614, but a re-test was much less unstable and better power. Idle statistics were collected for the re-test and does show teo-614 overly favoring idle state 1, with "Idle state 1 was too shallow" of 70% verses 15% for teo-613. Further details: http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf73/ http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf113/ http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf211/ http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf347/ http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf401/ http://smythies.com/~doug/linux/idle/teo-6.14/consume/test/ http://smythies.com/~doug/linux/idle/teo-6.14/consume/test-idle/ Test 8: shell-intensive serialized workloads. Variable: PIDs per second, amount of work each task does. Note: Single threaded. Dountil the list of tasks is finished: Start the next task in the list of stuff to do (with a new PID). Wait for it to finish Enduntil This workflow represents a challenge for CPU frequency scaling drivers, schedulers, and therefore idle drivers. Also, the best performance is achieved by overriding the scheduler and forcing CPU affinity. This "best" case is the master reference, requiring additional legend definitions: 1cpu-613: Kernel 6.13, execution forced onto CPU 3. 1cpu-614: Kernel 6.14-rc1, execution forced onto CPU 3. Ideally the two 1cpu graphs would be identical, but they are not, likely due to other changes betwwen the two kernels. Results: teo-614 is abaolutely outstanding in this test. Considerably better than any previous result over many years. Further details: http://smythies.com/~doug/linux/idle/teo-6.14/pid-per-sec/perf-3/times.png http://smythies.com/~doug/linux/idle/teo-6.14/pid-per-sec/perf-3/times-log.png http://smythies.com/~doug/linux/idle/teo-6.14/pid-per-sec/perf-3/relative.png http://smythies.com/~doug/linux/idle/teo-6.14/pid-per-sec/perf-3/relative-log.png Test 9: Many threads, periodic workflow 500 threads of do a little work and then sleep a little. IS a timer based test. Results: Kernel 6.13 teo: reference Kernel 6.13 menu: -0.06% Kernel 6.14 teo: -0.09% Kernel 6.14 menu: +0.49% Kernel 6.14+p menu: +0.33% What is interesting is the significant differences in idle state selection. Powers might be interesting, but much longer tests would be needed to acheive thermal equalibrium. doug@s19:~/idle/teo/6.14$ nano README.txt doug@s19:~/idle/teo/6.14$ rsync --archive --delete --verbose ./ doug@s15.smythies.com:/home/doug/public_html/linux/idle/teo-6.14 doug@s15.smythies.com's password: sending incremental file list ./ README.txt idle/ idle/teo-614-2.xlsx sent 61,869 bytes received 214 bytes 13,796.22 bytes/sec total size is 20,642,833 speedup is 332.50 doug@s19:~/idle/teo/6.14$ uname -a Linux s19 6.14.0-rc1-stock #1339 SMP PREEMPT_DYNAMIC Sun Feb 2 16:45:39 PST 2025 x86_64 x86_64 x86_64 GNU/Linux doug@s19:~/idle/teo/6.14$ uname -a Linux s19 6.14.0-rc1-stock #1339 SMP PREEMPT_DYNAMIC Sun Feb 2 16:45:39 PST 2025 x86_64 x86_64 x86_64 GNU/Linux doug@s19:~/idle/teo/6.14$ doug@s19:~/idle/teo/6.14$ doug@s19:~/idle/teo/6.14$ doug@s19:~/idle/teo/6.14$ cat READEME.txt cat: READEME.txt: No such file or directory doug@s19:~/idle/teo/6.14$ cat README.txt 2025.02.13 Notes on this round of idle governors testing: Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz Distro: Ubuntu 24.04.1, server, no desktop GUI. CPU frequency scaling driver: intel_pstate HWP: disabled. CPU frequency scaling governor: performance Ilde driver: intel_idle Idle governor: as per individual test Idle states: 4: name : description: state0/name:POLL desc:CPUIDLE CORE POLL IDLE state1/name:C1_ACPI desc:ACPI FFH MWAIT 0x0 state2/name:C2_ACPI desc:ACPI FFH MWAIT 0x30 state3/name:C3_ACPI desc:ACPI FFH MWAIT 0x60 Legend: teo-613: teo governor - Kernel 6.13: before "cpuidle: teo: Cleanups and very frequent wakeups handling update" menu-613: menu governor - Kernel 6.13: before "cpuidle: teo: Cleanups and very frequent wakeups handling update" teo-614: teo governor - Kernel 6.14-rc1: Includes cpuidle: teo: Cleanups and very frequent wakeups handling update menu-614: menu governor - Kernel 6.14-rc1: Includes cpuidle: teo: Cleanups and very frequent wakeups handling update teo-614-p: teo governor - Kernel 6.14-rc1-p: Includes "cpuidle: teo: Avoid selecting deepest idle state over-eagerly" menu-614-p: menu governor - Kernel 6.14-rc1-p: Includes "cpuidle: menu: Avoid discarding useful information when processing recent idle intervals" I do a set of tests adopted over some years now. Readers may recall that some of the tests search over a wide range of operating conditions looking for areas to focus on in more detail. One interesting observation is that everything seems to run slower than the last time I did this, June 2024, Kernel 6.10-rc2, which was also slower than the time before that, August 2023, Kernel 6.5-rc4. There are some repeatability issues with the tests. I was unable to get the "cpuidle: teo: Cleanups and very frequent wakeups handling update" patch set to apply to kernel 6.13, and so just used kernel 6.14-rc1, but that means that all the other commits between the kernel versions are included. This could cast doubt on the test results, and indeed some differences in test results are observed with the menu idle governor, which did not change. Test 1: System Idle Purpose: Basic starting point test. To observe and check an idle system for excessive power consumption. teo-613: 1.752 watts (reference: 0.0%) menu-613: 1.909 watts (+9.0%) teo-614: 2.199 watts (+25.51%) <<< Test flawed. Needs to be redone. Will be less. teo-614-2: 2.112 watts (+17.05%) <<< Re-test of teo-614. (don't care about 0.4 watts) menu-614: 1.873 watts (+6.91%) teo-614-p: 9.401 watts (+436.6%) <<< Very bad regression. menu-614-p: 1.820 watts (+3.9%) Further details: http://smythies.com/~doug/linux/idle/teo-6.14/idle/perf/ Test 2: 2 core ping pong sweep: Pass a token between 2 CPUs on 2 different cores. Do a variable amount of work at each stop. NOT a timer based test. Purpose: To utilize the shallowest idle states and observe the transition from using more of 1 idle state to another. Results relative to teo-613 (negative is better): menu-613 teo-614 menu-614 menu-614-p average -2.06% -0.32% -2.33% -2.52% max 9.42% 12.72% 8.29% 8.55% min -10.36% -3.82% -11.89% -12.13% No significant issues here. There are differences on idle state preferences. Standard "fast" dwell test: teo-613: average 3.826 uSec/loop reference menu-613: average 4.159 +8.70% teo-614: average 3.751 -1.94% menu-614: average 4.076 +6.54% menu-614-p: average 4.178 +9.21% Interestingly, teo-614 also uses a little less power. Note that there is an offsetting region for the menu governor where it performs better than teo, but it was not extracted and done as a dwell test. Standard "medium dwell test: teo-613: 12.241 average uSec/loop reference menu-613: 12.251 average +0.08% teo-614: 12.121 average -0.98% menu-614: 12.123 average -0.96% menu-614-p: 12.236 average -0.04% Standard "slow" dwell test: Not done. Further details: http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-2/loop-times.png http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-2/loop-times-relative.png http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-2/perf/ http://smythies.com/~doug/linux/idle/teo-6.14/many-0-400000000-2/perf/ http://smythies.com/~doug/linux/idle/teo-6.14/many-3000-100000000-2/ Test 3: 6 core ping pong sweep: Pass a token between 6 CPUs on 6 different cores. Do a variable amount of work at each stop. NOT a timer based test. Purpose: To utilize the midrange idle states and observe the transitions between use of idle states. Note: This test has uncertainty in an area where the performance is bi-stable for all idle governors, transitioning between much less power and slower performance and much more power and higher performance. On either side of this area, the differences between all idle governors are small. Only data from before this area (from results 1 to 95) was included in the below results. Results relative to teo-613 (negative is better): teo-614 menu-613 menu-614 menu-614-p average 1.60% 0.18% 0.02% 0.02% max 5.91% 0.97% 1.12% 0.85% min -1.79% -1.11% -1.88% -1.52% A further dwell test was done in the area where teo-614 performed worse. There was a slight regression in both performance and power: teo-613: average 21.34068 uSec per loop teo-614: average 20.55809 usec per loop 3.67% regression teo-613: average 37.17577 watts. teo-614: average 38.06375 watts. 2.3% regression. Further details: http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/loop-times.png http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/loop-times-detail-a.png http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/loop-times-detail-b.png http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/perf/ http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/dwell/perf/ Test 4: 12 CPU ping pong sweep: Pass a token between all 12 CPUs. Do a variable amount of work at each stop. NOT a timer based test. Purpose: To utilize the deeper idle states and observe the transitions between use of idle states. This test was added last time at the request of Christian Loehle. Note: This test has uncertainty in an area where the performance is bi-stable for all idle governors, transitioning between much less power and slower performance and much more power and higher performance. On either side of this area, the differences between all idle governors are small. Only data from before this area (from results 1 to 60) was included in the below results: Results relative to teo-613 (negative is better): teo-614 menu-613 menu-614 teo-614-p menu-614-p ave 1.73% 0.97% 1.29% 1.70% 0.43% max 16.79% 3.52% 3.95% 17.48% 4.98% min -0.35% -0.35% -0.18% -0.40% -0.54% Only data from after the uncertainty area (from results 170-300) was included in the below results: teo-614 menu-613 menu-614 teo-614-p menu-614-p ave 1.65% 0.04% 0.98% -0.56% 0.73% max 5.04% 2.10% 4.58% 2.44% 3.82% min 0.00% -1.89% -1.17% -1.95% -1.38% A further dwell test was done in the area where teo-614 performed worse and there is a 15.74% throughput regression for teo-614 and a 5.4% regression in power. Further details: http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/sweep-loop-times.png http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/sweep-loop-times-detail-a.png http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/sweep-relative-times.png http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/perf/ http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/dwell/times.txt http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/dwell/perf/ Test 5: sleeping ebizzy - 128 threads. Purpose: This test has given interesting results in the past. The test varies the sleep interval between record lookups. The result is varying usage of idle states. Results: Nothing significant to report just from the performance data. However, there does seem to be power differences worth considering. A further dwell test was done on a cherry-picked spot. It it is important to note that teo-614 removed a sawtooth performance pattern that was present with teo-613. I.E. it is more stable. See: http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/dwell/rps-only-teo.png Further details: http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/interval-sweep.png http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/relative-performance.png http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/perf/ http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/dwell/rps.png http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/dwell/rps-relative.png http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/dwell/perf/ Test 6: adrestia wakeup latency tests. 500 threads. Purpose: The test was reported in 2023.09 by the kernel test robot and looked both interesting and gave interesting results, so I added it to the tests I run. Results: teo-613.txt:wakeup cost (periodic, 20us): 3331nSec reference teo-614.txt:wakeup cost (periodic, 20us): 3375nSec +1.32% menu-613.txt:wakeup cost (periodic, 20us): 3207nSec -3.72% menu-614.txt:wakeup cost (periodic, 20us): 3315nSec -0.48% menu-614-p.txt:wakeup cost (periodic, 20us): 3353nSec +0.66% Further details: http://smythies.com/~doug/linux/idle/teo-6.14/adrestia/periodic/histogram.png http://smythies.com/~doug/linux/idle/teo-6.14/adrestia/periodic/histogram-detail-a.png http://smythies.com/~doug/linux/idle/teo-6.14/adrestia/periodic/perf/ Test 7: consume: periodic workflow. Various work/sleep frequencies and loads. Purpose: To search for anomalies and hysteresis over all possible workloads at various work/sleep frequencies. work/sleep frequencies tested: 73, 113, 211, 347, and 401 Hertz. IS a timer based test. NOTE: Repeatability issues. More work needed. Tests show instability with teo-614, but a re-test was much less unstable and better power. Idle statistics were collected for the re-test and does show teo-614 overly favoring idle state 1, with "Idle state 1 was too shallow" of 70% verses 15% for teo-613. Further details: http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf73/ http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf113/ http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf211/ http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf347/ http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf401/ http://smythies.com/~doug/linux/idle/teo-6.14/consume/test/ http://smythies.com/~doug/linux/idle/teo-6.14/consume/test-idle/ Test 8: shell-intensive serialized workloads. Variable: PIDs per second, amount of work each task does. Note: Single threaded. Dountil the list of tasks is finished: Start the next task in the list of stuff to do (with a new PID). Wait for it to finish Enduntil This workflow represents a challenge for CPU frequency scaling drivers, schedulers, and therefore idle drivers. Also, the best performance is achieved by overriding the scheduler and forcing CPU affinity. This "best" case is the master reference, requiring additional legend definitions: 1cpu-613: Kernel 6.13, execution forced onto CPU 3. 1cpu-614: Kernel 6.14-rc1, execution forced onto CPU 3. Ideally the two 1cpu graphs would be identical, but they are not, likely due to other changes between the two kernels. Results: teo-614 is absolutely outstanding in this test. Considerably better than any previous result over many years. Further details: http://smythies.com/~doug/linux/idle/teo-6.14/pid-per-sec/perf-3/times.png http://smythies.com/~doug/linux/idle/teo-6.14/pid-per-sec/perf-3/times-log.png http://smythies.com/~doug/linux/idle/teo-6.14/pid-per-sec/perf-3/relative.png http://smythies.com/~doug/linux/idle/teo-6.14/pid-per-sec/perf-3/relative-log.png Test 9: Many threads, periodic workflow 500 threads of do a little work and then sleep a little. IS a timer based test. Results: Kernel 6.13 teo: reference Kernel 6.13 menu: -0.06% Kernel 6.14 teo: -0.09% Kernel 6.14 menu: +0.49% Kernel 6.14+p menu: +0.33% What is interesting is the significant differences in idle state selection. Powers might be interesting, but much longer tests would be needed to achieve thermal equilibrium.
Hi Doug, On Fri, Feb 14, 2025 at 5:30 AM Doug Smythies <dsmythies@telus.net> wrote: > > On 2025.02.06 06:22 Rafael J. Wysocki wrote: > > > Hi Everyone, > > Hi Rafael, > > > > > This work had been triggered by a report that commit 0611a640e60a ("eventpoll: > > prefer kfree_rcu() in __ep_remove()") had caused the critical-jOPS metric of > ... deleted ... > > This is a long email. It contains test results for several recent idle governor patches: Thanks a lot for this data, it's really helpful! > cpuidle: teo: Cleanups and very frequent wakeups handling update > cpuidle: teo: Avoid selecting deepest idle state over-eagerly (Testing aborted, after the patch was dropped.) > cpuidle: menu: Avoid discarding useful information when processing recent idle intervals > > Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz > Distro: Ubuntu 24.04.1, server, no desktop GUI. > > CPU frequency scaling driver: intel_pstate > HWP: disabled. > CPU frequency scaling governor: performance > > Ilde driver: intel_idle > Idle governor: as per individual test > Idle states: 4: name : description: > state0/name:POLL desc:CPUIDLE CORE POLL IDLE > state1/name:C1_ACPI desc:ACPI FFH MWAIT 0x0 > state2/name:C2_ACPI desc:ACPI FFH MWAIT 0x30 > state3/name:C3_ACPI desc:ACPI FFH MWAIT 0x60 > > Legend: > teo-613: teo governor - Kernel 6.13: before "cpuidle: teo: Cleanups and very frequent wakeups handling update" > menu-613: menu governor - Kernel 6.13: before "cpuidle: teo: Cleanups and very frequent wakeups handling update" > teo-614: teo governor - Kernel 6.14-rc1: Includes cpuidle: teo: Cleanups and very frequent wakeups handling update > menu-614: menu governor - Kernel 6.14-rc1: Includes cpuidle: teo: Cleanups and very frequent wakeups handling update > teo-614-p: teo governor - Kernel 6.14-rc1-p: Includes "cpuidle: teo: Avoid selecting deepest idle state over-eagerly" > menu-614-p: menu governor - Kernel 6.14-rc1-p: Includes "cpuidle: menu: Avoid discarding useful information when processing recent idle intervals" > > I do a set of tests adopted over some years now. > Readers may recall that some of the tests search over a wide range of operating conditions looking for areas to focus on in more detail. > One interesting observation is that everything seems to run slower than the last time I did this, June 2024, Kernel 6.10-rc2, > which was also slower than the time before that, August 2023, Kernel 6.5-rc4. > There are some repatabilty issues with the tests. > > I was unable to get the "cpuidle: teo: Cleanups and very frequent wakeups handling update" patch set to apply to kernel 6.13, and so just used kernel 6.14-rc1, but that means that all the other commits > between the kernel versions are included. This could cast doubt on the test results, and indeed some differences in test results are observed with the menu idle governor, which did not change. > > Test 1: System Idle > > Purpose: Basic starting point test. To observee and check an idle system for excessive power consumption. > > teo-613: 1.752 watts (reference: 0.0%) > menu-613: 1.909 watts (+9.0%) > teo-614: 2.199 watts (+25.51%) <<< Test flawed. Needs to be redone. Will be less. > teo-614-2: 2.112 watts (+17.05%) <<< Re-test of teo-614. (don't care about 0.4 watts) > menu-614: 1.873 watts (+6.91%) > teo-614-p: 9.401 watts (+436.6%) <<< Very bad regression. Already noted. Since I've decided to withdraw this patch, I will not talk about it below. > menu-614-p: 1.820 watts (+3.9%) And this is an improvement worth noting. Generally speaking, I'm mostly interested in the differences between teo-613 and teo-614 and between menu-6.14 and menu-614-p. > Further details: > http://smythies.com/~doug/linux/idle/teo-6.14/idle/perf/ > > Test 2: 2 core ping pong sweep: > > Pass a token between 2 CPUs on 2 different cores. > Do a variable amount of work at each stop. > NOT a timer based test. > > Purpose: To utilize the shallowest idle states > and observe the transition from using more of 1 > idle state to another. > > Results relative to teo-613 (negative is better): > menu-613 teo-614 menu-614 menu-614-p > average -2.06% -0.32% -2.33% -2.52% > max 9.42% 12.72% 8.29% 8.55% > min -10.36% -3.82% -11.89% -12.13% > > No significant issues here. There are differences on idle state preferences. > > Standard "fast" dwell test: > > teo-613: average 3.826 uSec/loop reference > menu-613: average 4.159 +8.70% > teo-614: average 3.751 -1.94% A small improvement. > menu-614: average 4.076 +6.54% > menu-614-p: average 4.178 +9.21% > > Intrestingly, teo-614 also uses a little less power. > Note that there is an offsetting region for the menu governor where it performs better > than teo, but it was not extracted and done as a dwell test. > > Standard "medium dwell test: > > teo-613: 12.241 average uSec/loop reference > menu-613: 12.251 average +0.08% > teo-614: 12.121 average -0.98% Similarly here, but smaller. > menu-614: 12.123 average -0.96% > menu-614-p: 12.236 average -0.04% > > Standard "slow" dwell test: Not done. > > Further details: > http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-2/loop-times.png > http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-2/loop-times-relative.png > http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-2/perf/ > http://smythies.com/~doug/linux/idle/teo-6.14/many-0-400000000-2/perf/ > http://smythies.com/~doug/linux/idle/teo-6.14/many-3000-100000000-2/ > > Test 3: 6 core ping pong sweep: > > Pass a token between 6 CPUs on 6 different cores. > Do a variable amount of work at each stop. > NOT a timer based test. > > Purpose: To utilize the midrange idle states > and observe the transitions between use of > idle states. > > Note: This test has uncertainty in an area where the performance is bi-stable for all idle governors, > transitioning between much less power and slower performance and much more power and higher performance. > On either side of this area, the differences between all idle governors are small. > Only data from before this area (from results 1 to 95) was included in the below results. > > Results relative to teo-613 (negative is better): > teo-614 menu-613 menu-614 menu-614-p > average 1.60% 0.18% 0.02% 0.02% > max 5.91% 0.97% 1.12% 0.85% > min -1.79% -1.11% -1.88% -1.52% > > A further dwell test was done in the area where teo-614 performed worse. > There was a slight regression in both performance and power: > > teo-613: average 21.34068 uSec per loop > teo-614: average 20.55809 usec per loop 3.67% regression As this is usec per loop, I'd think that smaller would be better? > teo-613: average 37.17577 watts. > teo-614: average 38.06375 watts. 2.3% regression. Which would be consistent with this. > Further details: > http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/loop-times.png > http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/loop-times-detail-a.png > http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/loop-times-detail-b.png > http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/perf/ > http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/dwell/perf/ > > Test 4: 12 CPU ping pong sweep: > > Pass a token between all 12 CPUs. > Do a variable amount of work at each stop. > NOT a timer based test. > > Purpose: To utilize the deeper idle states > and observe the transitions between use of > idle states. > > This test was added last time at the request of Christian Loehle. > > Note: This test has uncertainty in an area where the performance is bi-stable for all idle governors, > transitioning between much less power and slower performance and much more power and higher performance. > On either side of this area, the differences between all idle governors are small. > > Only data from before this area (from results 1 to 60) was included in the below results: > > Results relative to teo-613 (negative is better): > teo-614 menu-613 menu-614 teo-614-p menu-614-p > ave 1.73% 0.97% 1.29% 1.70% 0.43% > max 16.79% 3.52% 3.95% 17.48% 4.98% > min -0.35% -0.35% -0.18% -0.40% -0.54% > > Only data from after the uncertainty area (from results 170-300) was included in the below results: > > teo-614 menu-613 menu-614 teo-614-p menu-614-p > ave 1.65% 0.04% 0.98% -0.56% 0.73% > max 5.04% 2.10% 4.58% 2.44% 3.82% > min 0.00% -1.89% -1.17% -1.95% -1.38% > > A further dwell test was done in the area where teo-614 performed worse and there is a 15.74% > throughput regression for teo-614 and a 5.4% regression in power. > > Further details: > http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/sweep-loop-times.png > http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/sweep-loop-times-detail-a.png > http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/sweep-relative-times.png > http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/perf/ > http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/dwell/times.txt > http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/dwell/perf/ > > Test 5: sleeping ebizzy - 128 threads. > > Purpose: This test has given interesting results in the past. > The test varies the sleep interval between record lookups. > The result is varying usage of idle states. > > Results: Nothing significant to report just from the performance data. > However, there does seem to be power differences worth considering. > > A futher dwell test was done in a cherry picked spot. > It it is important to note that teo-614 removed a sawtooth performance > pattern that was present with teo-613. I.E. it is more stable. See: > http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/dwell/rps-only-teo.png > > Further details: > http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/interval-sweep.png > http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/relative-performance.png > http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/perf/ > http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/dwell/rps.png > http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/dwell/rps-relative.png > http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/dwell/perf/ > > Test 6: adrestia wakeup latency tests. 500 threads. > > Purpose: The test was reported in 2023.09 by the kernel test robot and looked > both interesting and gave interesting results, so I added it to the tests I run. > > Results: > teo-613.txt:wakeup cost (periodic, 20us): 3331nSec reference > teo-614.txt:wakeup cost (periodic, 20us): 3375nSec +1.32% > menu-613.txt:wakeup cost (periodic, 20us): 3207nSec -3.72% > menu-614.txt:wakeup cost (periodic, 20us): 3315nSec -0.48% > menu-614-p.txt:wakeup cost (periodic, 20us): 3353nSec +0.66% > > Further details: > http://smythies.com/~doug/linux/idle/teo-6.14/adrestia/periodic/histogram.png > http://smythies.com/~doug/linux/idle/teo-6.14/adrestia/periodic/histogram-detail-a.png > http://smythies.com/~doug/linux/idle/teo-6.14/adrestia/periodic/perf/ > > > Test 7: consume: periodic workflow. Various work/sleep frequencies and loads. > > Purpose: To search for anomalies and hysteresis over all possible workloads at various work/sleep frequencies. > work/sleep frequencies tested: 73, 113, 211, 347, and 401 Hertz. > IS a timer based test. > > NOTE: Repeatability issues. More work needed. > > Tests show instability with teo-614, but a re-test was much less unstable and better power. > Idle statistics were collected for the re-test and does show teo-614 overly favoring idle state 1, with > "Idle state 1 was too shallow" of 70% verses 15% for teo-613. > > Further details: > http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf73/ > http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf113/ > http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf211/ > http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf347/ > http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf401/ > http://smythies.com/~doug/linux/idle/teo-6.14/consume/test/ > http://smythies.com/~doug/linux/idle/teo-6.14/consume/test-idle/ > > Test 8: shell-intensive serialized workloads. > > Variable: PIDs per second, amount of work each task does. > Note: Single threaded. > > Dountil the list of tasks is finished: > Start the next task in the list of stuff to do (with a new PID). > Wait for it to finish > Enduntil > > This workflow represents a challenge for CPU frequency scaling drivers, > schedulers, and therefore idle drivers. > > Also, the best performance is achieved by overriding > the scheduler and forcing CPU affinity. This "best" case is the > master reference, requiring additional legend definitions: > 1cpu-613: Kernel 6.13, execution forced onto CPU 3. > 1cpu-614: Kernel 6.14-rc1, execution forced onto CPU 3. > > Ideally the two 1cpu graphs would be identical, but they are not, > likely due to other changes betwwen the two kernels. > > Results: > teo-614 is abaolutely outstanding in this test. > Considerably better than any previous result over many years. Sounds good! > Further details: > http://smythies.com/~doug/linux/idle/teo-6.14/pid-per-sec/perf-3/times.png > http://smythies.com/~doug/linux/idle/teo-6.14/pid-per-sec/perf-3/times-log.png > http://smythies.com/~doug/linux/idle/teo-6.14/pid-per-sec/perf-3/relative.png > http://smythies.com/~doug/linux/idle/teo-6.14/pid-per-sec/perf-3/relative-log.png > > Test 9: Many threads, periodic workflow > > 500 threads of do a little work and then sleep a little. > IS a timer based test. > > Results: > Kernel 6.13 teo: reference > Kernel 6.13 menu: -0.06% > Kernel 6.14 teo: -0.09% > Kernel 6.14 menu: +0.49% > Kernel 6.14+p menu: +0.33% > > What is interesting is the significant differences in idle state selection. > Powers might be interesting, but much longer tests would be needed to acheive thermal equalibrium. > > doug@s19:~/idle/teo/6.14$ nano README.txt > doug@s19:~/idle/teo/6.14$ rsync --archive --delete --verbose ./ doug@s15.smythies.com:/home/doug/public_html/linux/idle/teo-6.14 > doug@s15.smythies.com's password: > sending incremental file list > ./ > README.txt > idle/ > idle/teo-614-2.xlsx > > sent 61,869 bytes received 214 bytes 13,796.22 bytes/sec > total size is 20,642,833 speedup is 332.50 > doug@s19:~/idle/teo/6.14$ uname -a > Linux s19 6.14.0-rc1-stock #1339 SMP PREEMPT_DYNAMIC Sun Feb 2 16:45:39 PST 2025 x86_64 x86_64 x86_64 GNU/Linux > doug@s19:~/idle/teo/6.14$ uname -a > Linux s19 6.14.0-rc1-stock #1339 SMP PREEMPT_DYNAMIC Sun Feb 2 16:45:39 PST 2025 x86_64 x86_64 x86_64 GNU/Linux > doug@s19:~/idle/teo/6.14$ > doug@s19:~/idle/teo/6.14$ > doug@s19:~/idle/teo/6.14$ > doug@s19:~/idle/teo/6.14$ cat READEME.txt > cat: READEME.txt: No such file or directory > doug@s19:~/idle/teo/6.14$ cat README.txt > 2025.02.13 Notes on this round of idle governors testing: > > Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz > Distro: Ubuntu 24.04.1, server, no desktop GUI. > > CPU frequency scaling driver: intel_pstate > HWP: disabled. > CPU frequency scaling governor: performance What's the difference between this configuration and the one above? > Ilde driver: intel_idle > Idle governor: as per individual test > Idle states: 4: name : description: > state0/name:POLL desc:CPUIDLE CORE POLL IDLE > state1/name:C1_ACPI desc:ACPI FFH MWAIT 0x0 > state2/name:C2_ACPI desc:ACPI FFH MWAIT 0x30 > state3/name:C3_ACPI desc:ACPI FFH MWAIT 0x60 > > Legend: > teo-613: teo governor - Kernel 6.13: before "cpuidle: teo: Cleanups and very frequent wakeups handling update" > menu-613: menu governor - Kernel 6.13: before "cpuidle: teo: Cleanups and very frequent wakeups handling update" > teo-614: teo governor - Kernel 6.14-rc1: Includes cpuidle: teo: Cleanups and very frequent wakeups handling update > menu-614: menu governor - Kernel 6.14-rc1: Includes cpuidle: teo: Cleanups and very frequent wakeups handling update > teo-614-p: teo governor - Kernel 6.14-rc1-p: Includes "cpuidle: teo: Avoid selecting deepest idle state over-eagerly" > menu-614-p: menu governor - Kernel 6.14-rc1-p: Includes "cpuidle: menu: Avoid discarding useful information when processing recent idle intervals" > > I do a set of tests adopted over some years now. > Readers may recall that some of the tests search over a wide range of operating conditions looking for areas to focus on in more detail. > One interesting observation is that everything seems to run slower than the last time I did this, June 2024, Kernel 6.10-rc2, > which was also slower than the time before that, August 2023, Kernel 6.5-rc4. > There are some repeatability issues with the tests. > > I was unable to get the "cpuidle: teo: Cleanups and very frequent wakeups handling update" > patch set to apply to kernel 6.13, and so just used kernel 6.14-rc1, but that means that > all the other commits between the kernel versions are included. This could cast doubt on > the test results, and indeed some differences in test results are observed with the menu > idle governor, which did not change. > > Test 1: System Idle > > Purpose: Basic starting point test. To observe and check an idle system for excessive power consumption. > > teo-613: 1.752 watts (reference: 0.0%) > menu-613: 1.909 watts (+9.0%) > teo-614: 2.199 watts (+25.51%) <<< Test flawed. Needs to be redone. Will be less. > teo-614-2: 2.112 watts (+17.05%) <<< Re-test of teo-614. (don't care about 0.4 watts) > menu-614: 1.873 watts (+6.91%) > teo-614-p: 9.401 watts (+436.6%) <<< Very bad regression. > menu-614-p: 1.820 watts (+3.9%) > > Further details: > http://smythies.com/~doug/linux/idle/teo-6.14/idle/perf/ > > Test 2: 2 core ping pong sweep: > > Pass a token between 2 CPUs on 2 different cores. > Do a variable amount of work at each stop. > NOT a timer based test. > > Purpose: To utilize the shallowest idle states > and observe the transition from using more of 1 > idle state to another. > > Results relative to teo-613 (negative is better): > menu-613 teo-614 menu-614 menu-614-p > average -2.06% -0.32% -2.33% -2.52% > max 9.42% 12.72% 8.29% 8.55% > min -10.36% -3.82% -11.89% -12.13% > > No significant issues here. There are differences on idle state preferences. > > Standard "fast" dwell test: > > teo-613: average 3.826 uSec/loop reference > menu-613: average 4.159 +8.70% > teo-614: average 3.751 -1.94% > menu-614: average 4.076 +6.54% > menu-614-p: average 4.178 +9.21% > > Interestingly, teo-614 also uses a little less power. > Note that there is an offsetting region for the menu governor where it performs better > than teo, but it was not extracted and done as a dwell test. > > Standard "medium dwell test: > > teo-613: 12.241 average uSec/loop reference > menu-613: 12.251 average +0.08% > teo-614: 12.121 average -0.98% > menu-614: 12.123 average -0.96% > menu-614-p: 12.236 average -0.04% > > Standard "slow" dwell test: Not done. > > Further details: > http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-2/loop-times.png > http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-2/loop-times-relative.png > http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-2/perf/ > http://smythies.com/~doug/linux/idle/teo-6.14/many-0-400000000-2/perf/ > http://smythies.com/~doug/linux/idle/teo-6.14/many-3000-100000000-2/ > > Test 3: 6 core ping pong sweep: > > Pass a token between 6 CPUs on 6 different cores. > Do a variable amount of work at each stop. > NOT a timer based test. > > Purpose: To utilize the midrange idle states > and observe the transitions between use of > idle states. > > Note: This test has uncertainty in an area where the performance is bi-stable for all idle governors, > transitioning between much less power and slower performance and much more power and higher performance. > On either side of this area, the differences between all idle governors are small. > Only data from before this area (from results 1 to 95) was included in the below results. > > Results relative to teo-613 (negative is better): > teo-614 menu-613 menu-614 menu-614-p > average 1.60% 0.18% 0.02% 0.02% > max 5.91% 0.97% 1.12% 0.85% > min -1.79% -1.11% -1.88% -1.52% > > A further dwell test was done in the area where teo-614 performed worse. > There was a slight regression in both performance and power: > > teo-613: average 21.34068 uSec per loop > teo-614: average 20.55809 usec per loop 3.67% regression > > teo-613: average 37.17577 watts. > teo-614: average 38.06375 watts. 2.3% regression. > > Further details: > http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/loop-times.png > http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/loop-times-detail-a.png > http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/loop-times-detail-b.png > http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/perf/ > http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/dwell/perf/ > > Test 4: 12 CPU ping pong sweep: > > Pass a token between all 12 CPUs. > Do a variable amount of work at each stop. > NOT a timer based test. > > Purpose: To utilize the deeper idle states > and observe the transitions between use of > idle states. > > This test was added last time at the request of Christian Loehle. > > Note: This test has uncertainty in an area where the performance is bi-stable for all idle governors, > transitioning between much less power and slower performance and much more power and higher performance. > On either side of this area, the differences between all idle governors are small. > > Only data from before this area (from results 1 to 60) was included in the below results: > > Results relative to teo-613 (negative is better): > teo-614 menu-613 menu-614 teo-614-p menu-614-p > ave 1.73% 0.97% 1.29% 1.70% 0.43% > max 16.79% 3.52% 3.95% 17.48% 4.98% > min -0.35% -0.35% -0.18% -0.40% -0.54% > > Only data from after the uncertainty area (from results 170-300) was included in the below results: > > teo-614 menu-613 menu-614 teo-614-p menu-614-p > ave 1.65% 0.04% 0.98% -0.56% 0.73% > max 5.04% 2.10% 4.58% 2.44% 3.82% > min 0.00% -1.89% -1.17% -1.95% -1.38% > > A further dwell test was done in the area where teo-614 performed worse and there is a 15.74% > throughput regression for teo-614 and a 5.4% regression in power. > > Further details: > http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/sweep-loop-times.png > http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/sweep-loop-times-detail-a.png > http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/sweep-relative-times.png > http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/perf/ > http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/dwell/times.txt > http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/dwell/perf/ > > Test 5: sleeping ebizzy - 128 threads. > > Purpose: This test has given interesting results in the past. > The test varies the sleep interval between record lookups. > The result is varying usage of idle states. > > Results: Nothing significant to report just from the performance data. > However, there does seem to be power differences worth considering. > > A further dwell test was done on a cherry-picked spot. > It it is important to note that teo-614 removed a sawtooth performance > pattern that was present with teo-613. I.E. it is more stable. See: > http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/dwell/rps-only-teo.png > > Further details: > http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/interval-sweep.png > http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/relative-performance.png > http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/perf/ > http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/dwell/rps.png > http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/dwell/rps-relative.png > http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/dwell/perf/ > > Test 6: adrestia wakeup latency tests. 500 threads. > > Purpose: The test was reported in 2023.09 by the kernel test robot and looked > both interesting and gave interesting results, so I added it to the tests I run. > > Results: > teo-613.txt:wakeup cost (periodic, 20us): 3331nSec reference > teo-614.txt:wakeup cost (periodic, 20us): 3375nSec +1.32% > menu-613.txt:wakeup cost (periodic, 20us): 3207nSec -3.72% > menu-614.txt:wakeup cost (periodic, 20us): 3315nSec -0.48% > menu-614-p.txt:wakeup cost (periodic, 20us): 3353nSec +0.66% > > Further details: > http://smythies.com/~doug/linux/idle/teo-6.14/adrestia/periodic/histogram.png > http://smythies.com/~doug/linux/idle/teo-6.14/adrestia/periodic/histogram-detail-a.png > http://smythies.com/~doug/linux/idle/teo-6.14/adrestia/periodic/perf/ > > > Test 7: consume: periodic workflow. Various work/sleep frequencies and loads. > > Purpose: To search for anomalies and hysteresis over all possible workloads at various work/sleep frequencies. > work/sleep frequencies tested: 73, 113, 211, 347, and 401 Hertz. > IS a timer based test. > > NOTE: Repeatability issues. More work needed. > > Tests show instability with teo-614, but a re-test was much less unstable and better power. > Idle statistics were collected for the re-test and does show teo-614 overly favoring idle state 1, with > "Idle state 1 was too shallow" of 70% verses 15% for teo-613. > > Further details: > http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf73/ > http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf113/ > http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf211/ > http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf347/ > http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf401/ > http://smythies.com/~doug/linux/idle/teo-6.14/consume/test/ > http://smythies.com/~doug/linux/idle/teo-6.14/consume/test-idle/ > > Test 8: shell-intensive serialized workloads. > > Variable: PIDs per second, amount of work each task does. > Note: Single threaded. > > Dountil the list of tasks is finished: > Start the next task in the list of stuff to do (with a new PID). > Wait for it to finish > Enduntil > > This workflow represents a challenge for CPU frequency scaling drivers, > schedulers, and therefore idle drivers. > > Also, the best performance is achieved by overriding > the scheduler and forcing CPU affinity. This "best" case is the > master reference, requiring additional legend definitions: > 1cpu-613: Kernel 6.13, execution forced onto CPU 3. > 1cpu-614: Kernel 6.14-rc1, execution forced onto CPU 3. > > Ideally the two 1cpu graphs would be identical, but they are not, > likely due to other changes between the two kernels. > > Results: > teo-614 is absolutely outstanding in this test. > Considerably better than any previous result over many years. > > Further details: > http://smythies.com/~doug/linux/idle/teo-6.14/pid-per-sec/perf-3/times.png > http://smythies.com/~doug/linux/idle/teo-6.14/pid-per-sec/perf-3/times-log.png > http://smythies.com/~doug/linux/idle/teo-6.14/pid-per-sec/perf-3/relative.png > http://smythies.com/~doug/linux/idle/teo-6.14/pid-per-sec/perf-3/relative-log.png > > Test 9: Many threads, periodic workflow > > 500 threads of do a little work and then sleep a little. > IS a timer based test. > > Results: > Kernel 6.13 teo: reference > Kernel 6.13 menu: -0.06% > Kernel 6.14 teo: -0.09% > Kernel 6.14 menu: +0.49% > Kernel 6.14+p menu: +0.33% > > What is interesting is the significant differences in idle state selection. > Powers might be interesting, but much longer tests would be needed to achieve thermal equilibrium. Overall, having seen these results, I'm not worried about the change from teo-613 to teo-614. The motivation for it was mostly code consistency and IMV the results indicate that it was worth doing. Also, if I'm not mistaken, the differences between menu-6.14 and menu-6.14-p in the majority of your tests are relatively small (if not in the noise) which, given that the latter is a major improvement for the SPECjbb workload as reported by Artem, makes me think that I should queue up menu-614-p for 6.15. Thanks!
On 2025.02.14 14:10 Rafael J. Wysocki wrote: > On Fri, Feb 14, 2025 at 5:30 AM Doug Smythies <dsmythies@telus.net> wrote: >> On 2025.02.06 06:22 Rafael J. Wysocki wrote: >>> >>> This work had been triggered by a report that commit 0611a640e60a ("eventpoll: >>> prefer kfree_rcu() in __ep_remove()") had caused the critical-jOPS metric of >> ... deleted ... >> >> This is a long email. It contains test results for several recent idle governor patches: > > Thanks a lot for this data, it's really helpful! > >> cpuidle: teo: Cleanups and very frequent wakeups handling update >> cpuidle: teo: Avoid selecting deepest idle state over-eagerly (Testing aborted, after the patch was dropped.) >> cpuidle: menu: Avoid discarding useful information when processing recent idle intervals >> >> Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz >> Distro: Ubuntu 24.04.1, server, no desktop GUI. >> >> CPU frequency scaling driver: intel_pstate >> HWP: disabled. >> CPU frequency scaling governor: performance >> >> Ilde driver: intel_idle >> Idle governor: as per individual test >> Idle states: 4: name : description: >> state0/name:POLL desc:CPUIDLE CORE POLL IDLE >> state1/name:C1_ACPI desc:ACPI FFH MWAIT 0x0 >> state2/name:C2_ACPI desc:ACPI FFH MWAIT 0x30 >> state3/name:C3_ACPI desc:ACPI FFH MWAIT 0x60 >> >> Legend: >> teo-613: teo governor - Kernel 6.13: before "cpuidle: teo: Cleanups and very frequent wakeups handling update" >> menu-613: menu governor - Kernel 6.13: before "cpuidle: teo: Cleanups and very frequent wakeups handling update" >> teo-614: teo governor - Kernel 6.14-rc1: Includes cpuidle: teo: Cleanups and very frequent wakeups handling update >> menu-614: menu governor - Kernel 6.14-rc1: Includes cpuidle: teo: Cleanups and very frequent wakeups handling update >> teo-614-p: teo governor - Kernel 6.14-rc1-p: Includes "cpuidle: teo: Avoid selecting deepest idle state over-eagerly" >> menu-614-p: menu governor - Kernel 6.14-rc1-p: Includes "cpuidle: menu: Avoid discarding useful information when processing recent idle intervals" >> >> I do a set of tests adopted over some years now. >> Readers may recall that some of the tests search over a wide range of operating conditions looking for areas to focus on in more detail. >> One interesting observation is that everything seems to run slower than the last time I did this, June 2024, Kernel 6.10-rc2, >> which was also slower than the time before that, August 2023, Kernel 6.5-rc4. >> There are some repeatability issues with the tests. >> >> I was unable to get the "cpuidle: teo: Cleanups and very frequent wakeups handling update" patch set to apply to kernel 6.13, and so just used kernel 6.14-rc1, but that means that all the other commits >> between the kernel versions are included. This could cast doubt on the test results, and indeed some differences in test results are observed with the menu idle governor, which did not change. >> >> Test 1: System Idle >> >> Purpose: Basic starting point test. To observe and check an idle system for excessive power consumption. >> >> teo-613: 1.752 watts (reference: 0.0%) >> menu-613: 1.909 watts (+9.0%) >> teo-614: 2.199 watts (+25.51%) <<< Test flawed. Needs to be redone. Will be less. >> teo-614-2: 2.112 watts (+17.05%) <<< Re-test of teo-614. (don't care about 0.4 watts) >> menu-614: 1.873 watts (+6.91%) >> teo-614-p: 9.401 watts (+436.6%) <<< Very bad regression. > > Already noted. > > Since I've decided to withdraw this patch, I will not talk about it below. Yes, just repeated here for completeness. And, I didn't reprocess work already done to delete teo-614-p results. > >> menu-614-p: 1.820 watts (+3.9%) > > And this is an improvement worth noting. > > Generally speaking, I'm mostly interested in the differences between > teo-613 and teo-614 and between menu-6.14 and menu-614-p. > >> Further details: >> http://smythies.com/~doug/linux/idle/teo-6.14/idle/perf/ >> >> Test 2: 2 core ping pong sweep: >> >> Pass a token between 2 CPUs on 2 different cores. >> Do a variable amount of work at each stop. >> NOT a timer based test. >> >> Purpose: To utilize the shallowest idle states >> and observe the transition from using more of 1 >> idle state to another. >> >> Results relative to teo-613 (negative is better): >> menu-613 teo-614 menu-614 menu-614-p >> average -2.06% -0.32% -2.33% -2.52% >> max 9.42% 12.72% 8.29% 8.55% >> min -10.36% -3.82% -11.89% -12.13% >> >> No significant issues here. There are differences on idle state preferences. >> >> Standard "fast" dwell test: >> >> teo-613: average 3.826 uSec/loop reference >> menu-613: average 4.159 +8.70% >> teo-614: average 3.751 -1.94% > > A small improvement. > >> menu-614: average 4.076 +6.54% >> menu-614-p: average 4.178 +9.21% >> >> Interestingly, teo-614 also uses a little less power. >> Note that there is an offsetting region for the menu governor where it performs better >> than teo, but it was not extracted and done as a dwell test. >> >> Standard "medium dwell test: >> >> teo-613: 12.241 average uSec/loop reference >> menu-613: 12.251 average +0.08% >> teo-614: 12.121 average -0.98% > > Similarly here, but smaller. > >> menu-614: 12.123 average -0.96% >> menu-614-p: 12.236 average -0.04% >> >> Standard "slow" dwell test: Not done. >> >> Further details: >> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-2/loop-times.png >> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-2/loop-times-relative.png >> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-2/perf/ >> http://smythies.com/~doug/linux/idle/teo-6.14/many-0-400000000-2/perf/ >> http://smythies.com/~doug/linux/idle/teo-6.14/many-3000-100000000-2/ >> >> Test 3: 6 core ping pong sweep: >> >> Pass a token between 6 CPUs on 6 different cores. >> Do a variable amount of work at each stop. >> NOT a timer based test. >> >> Purpose: To utilize the midrange idle states >> and observe the transitions between use of >> idle states. >> >> Note: This test has uncertainty in an area where the performance is bi-stable for all idle governors, >> transitioning between much less power and slower performance and much more power and higher performance. >> On either side of this area, the differences between all idle governors are small. >> Only data from before this area (from results 1 to 95) was included in the below results. >> >> Results relative to teo-613 (negative is better): >> teo-614 menu-613 menu-614 menu-614-p >> average 1.60% 0.18% 0.02% 0.02% >> max 5.91% 0.97% 1.12% 0.85% >> min -1.79% -1.11% -1.88% -1.52% >> >> A further dwell test was done in the area where teo-614 performed worse. >> There was a slight regression in both performance and power: >> >> teo-613: average 21.34068 uSec per loop >> teo-614: average 20.55809 usec per loop 3.67% regression > > As this is usec per loop, I'd think that smaller would be better? Sorry, my mistake. That was written backwards, corrected below: teo-613: average 20.55809 uSec per loop teo-614: average 21.34068 usec per loop 3.67% regression > >> teo-613: average 37.17577 watts. >> teo-614: average 38.06375 watts. 2.3% regression. > > Which would be consistent with this. There was both a regression in performance and power at this operating point. Another dwell test was done where menu-614-p did better than menu-614: uSec per loop: menu-614 menu-614-p average 807.896 772.376 max 962.265 946.880 min 798.375 755.430 menu-614 menu-614-p average 0.00% -4.40% max 19.11% 17.20% min -1.18% -6.49% menu-614: average 28.056 watts. menu-614-p: average 28.863 watts. 2.88% more. Note: to avoid inclusion of thermal stabilization times, only data from 30 to 45 minutes into the test were included in the average power calculation. >> Further details: >> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/loop-times.png >> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/loop-times-detail-a.png >> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/loop-times-detail-b.png >> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/perf/ >> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/dwell/perf/ http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/dwell-2/loop-times.png http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-6/dwell-2/perf/ >> >> Test 4: 12 CPU ping pong sweep: >> >> Pass a token between all 12 CPUs. >> Do a variable amount of work at each stop. >> NOT a timer based test. >> >> Purpose: To utilize the deeper idle states >> and observe the transitions between use of >> idle states. >> >> This test was added last time at the request of Christian Loehle. >> >> Note: This test has uncertainty in an area where the performance is bi-stable for all idle governors, >> transitioning between much less power and slower performance and much more power and higher performance. >> On either side of this area, the differences between all idle governors are small. >> >> Only data from before this area (from results 1 to 60) was included in the below results: >> >> Results relative to teo-613 (negative is better): >> teo-614 menu-613 menu-614 teo-614-p menu-614-p >> ave 1.73% 0.97% 1.29% 1.70% 0.43% >> max 16.79% 3.52% 3.95% 17.48% 4.98% >> min -0.35% -0.35% -0.18% -0.40% -0.54% >> >> Only data from after the uncertainty area (from results 170-300) was included in the below results: >> >> teo-614 menu-613 menu-614 teo-614-p menu-614-p >> ave 1.65% 0.04% 0.98% -0.56% 0.73% >> max 5.04% 2.10% 4.58% 2.44% 3.82% >> min 0.00% -1.89% -1.17% -1.95% -1.38% >> >> A further dwell test was done in the area where teo-614 performed worse and there is a 15.74% >> throughput regression for teo-614 and a 5.4% regression in power. My input is to consider this test further in the decision making. >> Further details: >> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/sweep-loop-times.png >> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/sweep-loop-times-detail-a.png >> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/sweep-relative-times.png >> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/perf/ >> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/dwell/times.txt >> http://smythies.com/~doug/linux/idle/teo-6.14/ping-sweep-12/dwell/perf/ >> >> Test 5: sleeping ebizzy - 128 threads. >> >> Purpose: This test has given interesting results in the past. >> The test varies the sleep interval between record lookups. >> The result is varying usage of idle states. >> >> Results: Nothing significant to report just from the performance data. >> However, there does seem to be power differences worth considering. >> >> A futher dwell test was done in a cherry picked spot. >> It it is important to note that teo-614 removed a sawtooth performance >> pattern that was present with teo-613. I.E. it is more stable. See: >> http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/dwell/rps-only-teo.png While re-examining menu-614 and menu-614-p, and to reduce clutter, this graph was made: http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/dwell/rps-only-menu.png menu-614: average 8722.787 records per second menu-614-p: average 8683.387 records per second 0.45% regression (i.e. negligible) >> Further details: >> http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/interval-sweep.png >> http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/relative-performance.png >> http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/perf/ >> http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/dwell/rps.png >> http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/dwell/rps-relative.png >> http://smythies.com/~doug/linux/idle/teo-6.14/ebizzy/dwell/perf/ >> >> Test 6: adrestia wakeup latency tests. 500 threads. >> >> Purpose: The test was reported in 2023.09 by the kernel test robot and looked >> both interesting and gave interesting results, so I added it to the tests I run. >> >> Results: >> teo-613.txt:wakeup cost (periodic, 20us): 3331nSec reference >> teo-614.txt:wakeup cost (periodic, 20us): 3375nSec +1.32% >> menu-613.txt:wakeup cost (periodic, 20us): 3207nSec -3.72% >> menu-614.txt:wakeup cost (periodic, 20us): 3315nSec -0.48% >> menu-614-p.txt:wakeup cost (periodic, 20us): 3353nSec +0.66% >> >> Further details: >> http://smythies.com/~doug/linux/idle/teo-6.14/adrestia/periodic/histogram.png >> http://smythies.com/~doug/linux/idle/teo-6.14/adrestia/periodic/histogram-detail-a.png >> http://smythies.com/~doug/linux/idle/teo-6.14/adrestia/periodic/perf/ >> >> Test 7: consume: periodic workflow. Various work/sleep frequencies and loads. >> >> Purpose: To search for anomalies and hysteresis over all possible workloads at various work/sleep frequencies. >> work/sleep frequencies tested: 73, 113, 211, 347, and 401 Hertz. >> IS a timer based test. >> >> NOTE: Repeatability issues. More work needed. >> >> Tests show instability with teo-614, but a re-test was much less unstable and better power. >> Idle statistics were collected for the re-test and does show teo-614 overly favoring idle state 1, with >> "Idle state 1 was too shallow" of 70% verses 15% for teo-613. I'll try to do some more experiments with these timer based periodic type workflows. >> Further details: >> http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf73/ >> http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf113/ >> http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf211/ >> http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf347/ >> http://smythies.com/~doug/linux/idle/teo-6.14/consume/pf401/ >> http://smythies.com/~doug/linux/idle/teo-6.14/consume/test/ >> http://smythies.com/~doug/linux/idle/teo-6.14/consume/test-idle/ >> >> Test 8: shell-intensive serialized workloads. >> >> Variable: PIDs per second, amount of work each task does. >> Note: Single threaded. >> >> Dountil the list of tasks is finished: >> Start the next task in the list of stuff to do (with a new PID). >> Wait for it to finish >> Enduntil >> >> This workflow represents a challenge for CPU frequency scaling drivers, >> schedulers, and therefore idle drivers. >> >> Also, the best performance is achieved by overriding >> the scheduler and forcing CPU affinity. This "best" case is the >> master reference, requiring additional legend definitions: >> 1cpu-613: Kernel 6.13, execution forced onto CPU 3. >> 1cpu-614: Kernel 6.14-rc1, execution forced onto CPU 3. >> >> Ideally the two 1cpu graphs would be identical, but they are not, >> likely due to other changes between the two kernels. >> >> Results: >> teo-614 is absolutely outstanding in this test. >> Considerably better than any previous result over many years. > > Sounds good! This improvement is significant. I redid the teo-614 test to prove repeatability and dismiss operator error. It was also good. I did not re-do the published graphs. >> Further details: >> http://smythies.com/~doug/linux/idle/teo-6.14/pid-per-sec/perf-3/times.png >> http://smythies.com/~doug/linux/idle/teo-6.14/pid-per-sec/perf-3/times-log.png >> http://smythies.com/~doug/linux/idle/teo-6.14/pid-per-sec/perf-3/relative.png >> http://smythies.com/~doug/linux/idle/teo-6.14/pid-per-sec/perf-3/relative-log.png >> >> Test 9: Many threads, periodic workflow >> >> 500 threads of do a little work and then sleep a little. >> IS a timer based test. >> >> Results: >> Kernel 6.13 teo: reference >> Kernel 6.13 menu: -0.06% >> Kernel 6.14 teo: -0.09% >> Kernel 6.14 menu: +0.49% >> Kernel 6.14+p menu: +0.33% >> >> What is interesting is the significant differences in idle state selection. >> Powers might be interesting, but much longer tests would be needed to achieve thermal equilibrium. >> ... mess deleted ... > > What's the difference between this configuration and the one above? So sorry, big big screwup in my composition of the original email. The rest was copy of the above with the typos fixed. Apologies. ... redundant stuff deleted ... > Overall, having seen these results, I'm not worried about the change > from teo-613 to teo-614. The motivation for it was mostly code > consistency and IMV the results indicate that it was worth doing. Agree, with hesitation. There are both negative and positive results, but overall okay. > Also, if I'm not mistaken, the differences between menu-6.14 and > menu-6.14-p in the majority of your tests are relatively small (if not > in the noise) which, given that the latter is a major improvement for > the SPECjbb workload as reported by Artem, makes me think that I > should queue up menu-614-p for 6.15. Agreed. > Thanks!
On 2/10/25 14:15, Christian Loehle wrote: > On 2/6/25 14:21, Rafael J. Wysocki wrote: >> Hi Everyone, >> >> This work had been triggered by a report that commit 0611a640e60a ("eventpoll: >> prefer kfree_rcu() in __ep_remove()") had caused the critical-jOPS metric of >> the SPECjbb 2015 benchmark [1] to drop by around 50% even though it generally >> reduced kernel overhead. Indeed, it was found during further investigation >> that the total interrupt rate while running the SPECjbb workload had fallen as >> a result of that commit by 55% and the local timer interrupt rate had fallen by >> almost 80%. >> >> That turned out to cause the menu cpuidle governor to select the deepest idle >> state supplied by the cpuidle driver (intel_idle) much more often which added >> significantly more idle state latency to the workload and that led to the >> decrease of the critical-jOPS score. >> >> Interestingly enough, this problem was not visible when the teo cpuidle >> governor was used instead of menu, so it appeared to be specific to the >> latter. CPU wakeup event statistics collected while running the workload >> indicated that the menu governor was effectively ignoring non-timer wakeup >> information and all of its idle state selection decisions appeared to be >> based on timer wakeups only. Thus, it appeared that the reduction of the >> local timer interrupt rate caused the governor to predict a idle duration >> much more often while running the workload and the deepest idle state was >> selected significantly more often as a result of that. >> >> A subsequent inspection of the get_typical_interval() function in the menu >> governor indicated that it might return UINT_MAX too often which then caused >> the governor's decisions to be based entirely on information related to timers. >> >> Generally speaking, UINT_MAX is returned by get_typical_interval() if it >> cannot make a prediction based on the most recent idle intervals data with >> sufficiently high confidence, but at least in some cases this means that >> useful information is not taken into account at all which may lead to >> significant idle state selection mistakes. Moreover, this is not really >> unlikely to happen. >> >> One issue with get_typical_interval() is that, when it eliminates outliers from >> the sample set in an attempt to reduce the standard deviation (and so improve >> the prediction confidence), it does that by dropping high-end samples only, >> while samples at the low end of the set are retained. However, the samples >> at the low end very well may be the outliers and they should be eliminated >> from the sample set instead of the high-end samples. Accordingly, the >> likelihood of making a meaningful idle duration prediction can be improved >> by making it also eliminate low-end samples if they are farther from the >> average than high-end samples. This is done in patch [4/5]. >> >> Another issue is that get_typical_interval() gives up after eliminating 1/4 >> of the samples if the standard deviation is still not as low as desired (within >> 1/6 of the average or within 20 us if the average is close to 0), but the >> remaining samples in the set still represent useful information at that point >> and discarding them altogether may lead to suboptimal idle state selection. >> >> For instance, the largest idle duration value in the get_typical_interval() >> data set is the maximum idle duration observed recently and it is likely that >> the upcoming idle duration will not exceed it. Therefore, in the absence of >> a better choice, this value can be used as an upper bound on the target >> residency of the idle state to select. Patch [5/5] works along these lines, >> but it takes the maximum data point remaining after the elimination of >> outliers. >> >> The first two patches in the series are straightforward cleanups (in fact, >> the first patch is kind of reversed by patch [4/5], but it is there because >> it can be applied without the latter) and patch [3/5] is a cosmetic change >> made in preparation for the subsequent ones. >> >> This series turns out to restore the SPECjbb critical-jOPS metric on affected >> systems to the level from before commit 0611a640e60a and it also happens to >> increase its max-jOPS metric by around 3%. >> >> For easier reference/testing it is present in the git branch at >> >> https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/log/?h=experimental/menu >> >> based on the cpuidle material that went into 6.14-rc1. >> >> If possible, please let me know if it works for you. >> >> Thanks! >> >> >> [1] Link: https://www.spec.org/jbb2015/ Another dump for x86 idle this time, tldr: no worrying idle/power numbers teo 0 121.76 12202 813 0.067 390 423 teo 1 158.53 8223 536 0.065 328 208 teo 2 219.37 8373 527 0.063 294 233 teo 3 220.7 8241 538 0.065 340 198 teo 4 211.8 7923 442 0.056 268 174 menu 0 151.63 8185 326 0.040 308 18 menu 1 183.45 8873 364 0.041 334 30 menu 2 171.96 8633 380 0.044 369 11 menu 3 164.95 8451 358 0.042 330 28 menu 4 175.87 8273 340 0.041 317 23 menu-1 0 119.77 9041 394 0.044 356 38 menu-1 1 145.73 8603 335 0.039 293 42 menu-1 2 157.89 8345 321 0.038 276 45 menu-1 3 119.13 8447 346 0.041 290 56 menu-1 4 142.77 8331 331 0.040 312 19 menu-2 0 159.81 8653 342 0.040 296 46 menu-2 1 165.01 8421 307 0.036 282 25 menu-2 2 225.06 8647 376 0.043 317 59 menu-2 3 232.13 8095 358 0.044 325 33 menu-2 4 150.79 8231 323 0.039 299 24 menu-3 0 168.87 8153 355 0.044 330 25 menu-3 1 187.68 9143 405 0.044 338 67 menu-3 2 129.77 9705 384 0.040 301 83 menu-3 3 152.49 9679 469 0.048 374 95 menu-3 4 131.0 9077 321 0.035 283 38 menu-4 0 116.68 9107 373 0.041 333 40 menu-4 1 164.1 8655 297 0.034 287 10 menu-4 2 157.52 8009 300 0.037 297 3 menu-4 3 138.47 8567 345 0.040 341 4 menu-4 4 130.84 8027 324 0.040 316 8 menu-5 0 139.77 8533 327 0.038 317 10 menu-5 1 157.22 9127 433 0.047 373 60 menu-5 2 144.54 8313 329 0.040 311 18 menu-5 3 151.55 8675 316 0.036 301 15 menu-5 4 137.49 8823 354 0.040 336 18 menu 0 128.97 8383 329 0.039 284 45 menu 1 141.97 8945 402 0.045 344 58 menu 2 88.16 8829 368 0.042 307 61 menu 3 81.49 9165 430 0.047 371 59 menu 4 107.58 9193 401 0.044 335 66 teo 0 149.28 8399 521 0.062 287 234 teo 1 105.61 8717 563 0.065 306 257 teo 2 116.65 7893 550 0.070 284 266 teo 3 119.57 8259 489 0.059 282 207 teo 4 187.64 7897 471 0.060 303 168 And the rk3399 numbers as promised, just like rk3588 we see much better IO performance (menu-5) without significant worse idle decisions: device gov iter iops idles idle_misses idle_miss_ratio belows aboves mapper/dm-slow teo 0 461 102980 32260 0.313 6036 26224 mapper/dm-slow teo 1 461 100698 31149 0.309 5796 25353 mapper/dm-slow teo 2 461 100346 32840 0.327 6022 26818 mapper/dm-slow teo 3 461 98450 31513 0.320 5129 26384 mapper/dm-slow teo 4 461 98689 29982 0.304 3937 26045 mapper/dm-slow menu 0 461 97860 22520 0.230 5505 17015 mapper/dm-slow menu 1 461 98002 20858 0.213 3596 17262 mapper/dm-slow menu 2 461 100046 22523 0.225 5333 17190 mapper/dm-slow menu 3 461 95020 20827 0.219 4069 16758 mapper/dm-slow menu 4 461 98040 22302 0.227 5498 16804 mapper/dm-slow menu-1 0 461 98186 20648 0.210 3210 17438 mapper/dm-slow menu-1 1 461 94360 20297 0.215 4184 16113 mapper/dm-slow menu-1 2 461 98818 21680 0.219 4750 16930 mapper/dm-slow menu-1 3 461 97822 20605 0.211 3469 17136 mapper/dm-slow menu-1 4 461 100748 21740 0.216 4403 17337 mapper/dm-slow menu-2 0 461 94388 20289 0.215 3449 16840 mapper/dm-slow menu-2 1 460 89124 18897 0.212 2401 16496 mapper/dm-slow menu-2 2 461 94932 20692 0.218 2949 17743 mapper/dm-slow menu-2 3 461 95270 20612 0.216 3048 17564 mapper/dm-slow menu-2 4 461 101954 23493 0.230 5978 17515 mapper/dm-slow menu-3 0 461 98452 21161 0.215 4247 16914 mapper/dm-slow menu-3 1 461 100342 21035 0.210 3891 17144 mapper/dm-slow menu-3 2 461 101156 23322 0.231 5924 17398 mapper/dm-slow menu-3 3 461 98052 20862 0.213 3927 16935 mapper/dm-slow menu-3 4 461 97746 20977 0.215 3706 17271 mapper/dm-slow menu-4 0 461 99826 23727 0.238 5055 18672 mapper/dm-slow menu-4 1 461 101686 24859 0.244 5175 19684 mapper/dm-slow menu-4 2 461 99934 23568 0.236 4477 19091 mapper/dm-slow menu-4 3 461 97298 22142 0.228 3644 18498 mapper/dm-slow menu-4 4 461 98546 24023 0.244 5086 18937 mapper/dm-slow menu-5 0 461 100545 22833 0.227 3830 19003 mapper/dm-slow menu-5 1 461 100827 23999 0.238 5217 18782 mapper/dm-slow menu-5 2 461 97044 22628 0.233 2910 19718 mapper/dm-slow menu-5 3 461 100234 23303 0.232 4819 18484 mapper/dm-slow menu-5 4 461 102358 24488 0.239 4770 19718 mapper/dm-slow menu 0 461 97008 21114 0.218 4540 16574 mapper/dm-slow menu 1 461 96088 21470 0.223 3650 17820 mapper/dm-slow menu 2 461 99008 21019 0.212 3405 17614 mapper/dm-slow menu 3 461 96608 20145 0.209 3729 16416 mapper/dm-slow menu 4 461 83152 17469 0.210 2426 15043 mapper/dm-slow teo 0 461 99340 32077 0.323 5772 26305 mapper/dm-slow teo 1 461 98694 29426 0.298 3585 25841 mapper/dm-slow teo 2 461 100294 29810 0.297 3561 26249 mapper/dm-slow teo 3 461 98726 29496 0.299 3644 25852 mapper/dm-slow teo 4 461 101424 32654 0.322 6029 26625 mmcblk1 teo 0 2016 559362 29994 0.054 2896 27098 mmcblk1 teo 1 2037 562153 30001 0.053 3171 26830 mmcblk1 teo 2 2016 557360 30185 0.054 2986 27199 mmcblk1 menu 0 1279 335364 103600 0.309 87662 15938 mmcblk1 menu 1 1292 342036 105446 0.308 89031 16415 mmcblk1 menu 2 1294 352954 108588 0.308 90420 18168 mmcblk1 menu-1 0 1271 331220 103163 0.311 87602 15561 mmcblk1 menu-1 1 1291 350084 108982 0.311 90670 18312 mmcblk1 menu-1 2 1284 346412 107494 0.310 89899 17595 mmcblk1 menu-2 0 1306 344316 106253 0.309 89650 16603 mmcblk1 menu-2 1 1278 345684 107893 0.312 90292 17601 mmcblk1 menu-2 2 1268 334528 104494 0.312 88457 16037 mmcblk1 menu-3 0 1270 333456 104160 0.312 88392 15768 mmcblk1 menu-3 1 1273 338328 105477 0.312 88798 16679 mmcblk1 menu-3 2 1280 337002 104623 0.310 88516 16107 mmcblk1 menu-4 0 1311 344896 104192 0.302 87051 17141 mmcblk1 menu-4 1 1292 343878 106459 0.310 88297 18162 mmcblk1 menu-4 2 1286 340172 105502 0.310 87753 17749 mmcblk1 menu-5 0 2006 550266 24981 0.045 6762 18219 mmcblk1 menu-5 1 1997 553590 26974 0.049 6955 20019 mmcblk1 menu-5 2 1994 539494 17652 0.033 3903 13749 mmcblk2 teo 0 5691 820134 29346 0.036 3078 26268 mmcblk2 teo 1 5684 856976 23202 0.027 1908 21294 mmcblk2 teo 2 5783 824666 13984 0.017 3938 10046 mmcblk2 menu 0 2770 433474 144860 0.334 127466 17394 mmcblk2 menu 1 3308 367848 89597 0.244 72668 16929 mmcblk2 menu 2 2882 422844 133523 0.316 117170 16353 mmcblk2 menu-1 0 3323 394674 115764 0.293 99328 16436 mmcblk2 menu-1 1 2778 420262 139538 0.332 122356 17182 mmcblk2 menu-1 2 2895 400774 124841 0.311 109572 15269 mmcblk2 menu-2 0 2679 429818 148494 0.345 131513 16981 mmcblk2 menu-2 1 3162 363888 96102 0.264 79200 16902 mmcblk2 menu-2 2 2684 422324 144606 0.342 128528 16078 mmcblk2 menu-3 0 2953 392124 118629 0.303 101068 17561 mmcblk2 menu-3 1 3003 402614 120567 0.299 103321 17246 mmcblk2 menu-3 2 2858 422576 136118 0.322 119485 16633 mmcblk2 menu-4 0 3288 436860 118566 0.271 100329 18237 mmcblk2 menu-4 1 3062 462484 139897 0.302 121424 18473 mmcblk2 menu-4 2 3257 424458 115493 0.272 97739 17754 mmcblk2 menu-5 0 5316 573050 52502 0.092 33285 19217 mmcblk2 menu-5 1 5446 825538 44073 0.053 24355 19718 mmcblk2 menu-5 2 5292 796000 52828 0.066 38640 14188 nvme0n1 teo 0 11371 807338 29879 0.037 2961 26918 nvme0n1 teo 1 11557 815682 29116 0.036 2947 26169 nvme0n1 teo 2 11424 810108 29800 0.037 2953 26847 nvme0n1 menu 0 7754 574116 93148 0.162 76482 16666 nvme0n1 menu 1 8371 618954 95502 0.154 77657 17845 nvme0n1 menu 2 5111 412030 73440 0.178 55997 17443 nvme0n1 menu-1 0 6628 506618 91832 0.181 71427 20405 nvme0n1 menu-1 1 4923 390294 68772 0.176 52880 15892 nvme0n1 menu-1 2 5015 396160 68840 0.174 52867 15973 nvme0n1 menu-2 0 7883 589296 97497 0.165 79635 17862 nvme0n1 menu-2 1 6465 493796 81561 0.165 64629 16932 nvme0n1 menu-2 2 5363 430614 75499 0.175 57528 17971 nvme0n1 menu-3 0 5090 415018 74191 0.179 56015 18176 nvme0n1 menu-3 1 4919 401452 71994 0.179 54457 17537 nvme0n1 menu-3 2 5183 413186 74199 0.180 57542 16657 nvme0n1 menu-4 0 5402 424860 67413 0.159 49399 18014 nvme0n1 menu-4 1 5343 420538 67713 0.161 49826 17887 nvme0n1 menu-4 2 9151 669840 107892 0.161 87774 20118 nvme0n1 menu-5 0 10475 741376 20204 0.027 1827 18377 nvme0n1 menu-5 1 10603 747262 19228 0.026 1489 17739 nvme0n1 menu-5 2 11658 824996 22954 0.028 2631 20323 sda teo 0 2328 1334552 48960 0.037 20922 28038 sda teo 1 2328 1267840 37740 0.030 11934 25806 sda teo 2 2394 1360679 21853 0.016 3394 18459 sda menu 0 1004 587054 198775 0.339 184002 14773 sda menu 1 1205 663838 209623 0.316 193325 16298 sda menu 2 1117 615382 208813 0.339 191893 16920 sda menu-1 0 1103 627838 212955 0.339 195703 17252 sda menu-1 1 1024 611658 221754 0.363 203710 18044 sda menu-1 2 1209 639008 180597 0.283 163837 16760 sda menu-2 0 1200 655398 205664 0.314 190750 14914 sda menu-2 1 1100 582222 201983 0.347 185874 16109 sda menu-2 2 1124 602988 199623 0.331 183798 15825 sda menu-3 0 1089 612112 211470 0.345 195156 16314 sda menu-3 1 1077 613556 213484 0.348 196839 16645 sda menu-3 2 1157 636904 195439 0.307 179535 15904 sda menu-4 0 1126 643468 208132 0.323 189334 18798 sda menu-4 1 1112 634480 216012 0.340 196841 19171 sda menu-4 2 1190 594398 196059 0.330 176190 19869 sda menu-5 0 2074 1134718 81294 0.072 61820 19474 sda menu-5 1 2179 1249056 76679 0.061 55461 21218 sda menu-5 2 2075 1183214 124075 0.105 101650 22425 nullb0 teo 0 104833 85906 29085 0.339 3409 25676 nullb0 teo 1 103787 88419 29833 0.337 2980 26853 nullb0 teo 2 104611 86284 29390 0.341 3315 26075 nullb0 menu 0 103671 87146 20514 0.235 2643 17871 nullb0 menu 1 104380 70086 16855 0.240 1642 15213 nullb0 menu 2 103249 81414 19403 0.238 2274 17129 nullb0 menu-1 0 103424 86984 20448 0.235 2617 17831 nullb0 menu-1 1 103857 85658 20840 0.243 2544 18296 nullb0 menu-1 2 103907 86644 20639 0.238 2586 18053 nullb0 menu-2 0 103668 82558 20053 0.243 2655 17398 nullb0 menu-2 1 104277 86914 20472 0.236 2593 17879 nullb0 menu-2 2 103697 82952 20221 0.244 2410 17811 nullb0 menu-3 0 103696 86534 20782 0.240 2968 17814 nullb0 menu-3 1 103996 81902 19795 0.242 2773 17022 nullb0 menu-3 2 103790 82474 20058 0.243 2344 17714 nullb0 menu-4 0 103288 87475 22688 0.259 2596 20092 nullb0 menu-4 1 103848 70906 18106 0.255 1557 16549 nullb0 menu-4 2 104141 84528 22147 0.262 2969 19178 nullb0 menu-5 0 103812 79234 17989 0.227 1302 16687 nullb0 menu-5 1 104334 87752 22878 0.261 2511 20367 nullb0 menu-5 2 104059 88681 22765 0.257 3030 19735 mtdblock3 teo 0 257 604294 17359 0.029 3241 14118 mtdblock3 teo 1 256 332344 31631 0.095 4644 26987 mtdblock3 teo 2 257 549736 29559 0.054 2841 26718 mtdblock3 menu 0 148 417388 134505 0.322 118487 16018 mtdblock3 menu 1 137 422132 149336 0.354 132655 16681 mtdblock3 menu 2 194 223808 59631 0.266 43076 16555 mtdblock3 menu-1 0 145 529186 143789 0.272 129433 14356 mtdblock3 menu-1 1 147 452302 140418 0.310 125009 15409 mtdblock3 menu-1 2 138 415152 146470 0.353 130607 15863 mtdblock3 menu-2 0 155 365750 118483 0.324 102676 15807 mtdblock3 menu-2 1 165 316818 101968 0.322 85597 16371 mtdblock3 menu-2 2 143 515664 126014 0.244 115854 10160 mtdblock3 menu-3 0 135 488188 150917 0.309 136442 14475 mtdblock3 menu-3 1 125 437774 158893 0.363 143319 15574 mtdblock3 menu-3 2 138 433332 152457 0.352 135017 17440 mtdblock3 menu-4 0 173 314250 101648 0.323 81511 20137 mtdblock3 menu-4 1 149 489030 139551 0.285 124126 15425 mtdblock3 menu-4 2 148 381488 133543 0.350 115885 17658 mtdblock3 menu-5 0 222 430158 63885 0.149 45240 18645 mtdblock3 menu-5 1 218 752248 80500 0.107 66453 14047 mtdblock3 menu-5 2 203 528828 105885 0.200 89573 1631 And finally a longer firefox youtube 4k playback (10mins) on x86: device gov iter Joules idles idle_misses idle_miss_ratio belows aboves menu 0 1064.48 357559 106048 0.297 105920 128 menu 1 1029.85 345569 104050 0.301 103938 112 menu 2 1105.93 347105 104958 0.302 104885 73 menu 3 1085.86 347365 106061 0.305 106001 60 menu 4 1115.24 352609 107913 0.306 107812 101 menu-5 0 1139.09 345827 90172 0.261 89609 563 menu-5 1 1111.58 335521 88953 0.265 88904 49 menu-5 2 1093.73 328645 85949 0.262 85839 110 menu-5 3 1036.69 330547 86163 0.261 86077 86 menu-5 4 1117.31 316143 81707 0.258 81580 127 menu 0 1099.72 353895 106574 0.301 106523 51 menu 1 1148.44 357867 107578 0.301 107369 209 menu 2 1098.15 341957 101995 0.298 101870 125 menu 3 1124.41 350423 105592 0.301 105481 111 menu 4 1185.94 366799 111132 0.303 111029 103 menu-5 0 1129.85 332413 86991 0.262 86885 106 menu-5 1 1086.59 318221 82020 0.258 81924 96 menu-5 2 1063.5 320273 83099 0.259 83048 51 menu-5 3 1070.7 331179 85998 0.260 85839 159 menu-5 4 1067.82 322689 83634 0.259 83548 86 Significantly improving idle miss belows. I'll do the Android tests, but that is very unlikely to show something this doesn't (there's only one non-WFI idle state and most workloads are intercept heavy, so if anything menu-5 should improve the overall situation.) Feel free to already add: Tested-by: Christian Loehle <christian.loehle@arm.com>
On Tue, Feb 18, 2025 at 10:17 PM Christian Loehle <christian.loehle@arm.com> wrote: > > On 2/10/25 14:15, Christian Loehle wrote: > > On 2/6/25 14:21, Rafael J. Wysocki wrote: > >> Hi Everyone, > >> > >> This work had been triggered by a report that commit 0611a640e60a ("eventpoll: > >> prefer kfree_rcu() in __ep_remove()") had caused the critical-jOPS metric of > >> the SPECjbb 2015 benchmark [1] to drop by around 50% even though it generally > >> reduced kernel overhead. Indeed, it was found during further investigation > >> that the total interrupt rate while running the SPECjbb workload had fallen as > >> a result of that commit by 55% and the local timer interrupt rate had fallen by > >> almost 80%. > >> > >> That turned out to cause the menu cpuidle governor to select the deepest idle > >> state supplied by the cpuidle driver (intel_idle) much more often which added > >> significantly more idle state latency to the workload and that led to the > >> decrease of the critical-jOPS score. > >> > >> Interestingly enough, this problem was not visible when the teo cpuidle > >> governor was used instead of menu, so it appeared to be specific to the > >> latter. CPU wakeup event statistics collected while running the workload > >> indicated that the menu governor was effectively ignoring non-timer wakeup > >> information and all of its idle state selection decisions appeared to be > >> based on timer wakeups only. Thus, it appeared that the reduction of the > >> local timer interrupt rate caused the governor to predict a idle duration > >> much more often while running the workload and the deepest idle state was > >> selected significantly more often as a result of that. > >> > >> A subsequent inspection of the get_typical_interval() function in the menu > >> governor indicated that it might return UINT_MAX too often which then caused > >> the governor's decisions to be based entirely on information related to timers. > >> > >> Generally speaking, UINT_MAX is returned by get_typical_interval() if it > >> cannot make a prediction based on the most recent idle intervals data with > >> sufficiently high confidence, but at least in some cases this means that > >> useful information is not taken into account at all which may lead to > >> significant idle state selection mistakes. Moreover, this is not really > >> unlikely to happen. > >> > >> One issue with get_typical_interval() is that, when it eliminates outliers from > >> the sample set in an attempt to reduce the standard deviation (and so improve > >> the prediction confidence), it does that by dropping high-end samples only, > >> while samples at the low end of the set are retained. However, the samples > >> at the low end very well may be the outliers and they should be eliminated > >> from the sample set instead of the high-end samples. Accordingly, the > >> likelihood of making a meaningful idle duration prediction can be improved > >> by making it also eliminate low-end samples if they are farther from the > >> average than high-end samples. This is done in patch [4/5]. > >> > >> Another issue is that get_typical_interval() gives up after eliminating 1/4 > >> of the samples if the standard deviation is still not as low as desired (within > >> 1/6 of the average or within 20 us if the average is close to 0), but the > >> remaining samples in the set still represent useful information at that point > >> and discarding them altogether may lead to suboptimal idle state selection. > >> > >> For instance, the largest idle duration value in the get_typical_interval() > >> data set is the maximum idle duration observed recently and it is likely that > >> the upcoming idle duration will not exceed it. Therefore, in the absence of > >> a better choice, this value can be used as an upper bound on the target > >> residency of the idle state to select. Patch [5/5] works along these lines, > >> but it takes the maximum data point remaining after the elimination of > >> outliers. > >> > >> The first two patches in the series are straightforward cleanups (in fact, > >> the first patch is kind of reversed by patch [4/5], but it is there because > >> it can be applied without the latter) and patch [3/5] is a cosmetic change > >> made in preparation for the subsequent ones. > >> > >> This series turns out to restore the SPECjbb critical-jOPS metric on affected > >> systems to the level from before commit 0611a640e60a and it also happens to > >> increase its max-jOPS metric by around 3%. > >> > >> For easier reference/testing it is present in the git branch at > >> > >> https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/log/?h=experimental/menu > >> > >> based on the cpuidle material that went into 6.14-rc1. > >> > >> If possible, please let me know if it works for you. > >> > >> Thanks! > >> > >> > >> [1] Link: https://www.spec.org/jbb2015/ > > Another dump for x86 idle this time, tldr: no worrying idle/power numbers [cut] > Significantly improving idle miss belows. > > I'll do the Android tests, but that is very unlikely to show something this > doesn't (there's only one non-WFI idle state and most workloads are intercept > heavy, so if anything menu-5 should improve the overall situation.) > Feel free to already add: > Tested-by: Christian Loehle <christian.loehle@arm.com> Thank you!