Message ID | 20230710093100.918337-1-dedekind1@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | Sapphire Rapids C0.x idle states support | expand |
On Mon, Jul 10, 2023 at 11:31 AM Artem Bityutskiy <dedekind1@gmail.com> wrote: > > From: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> > > Background > ---------- > > Idle states reduce power consumption when a CPU has no work to do. The most > shallow CPU idle state is "POLL". It has lowest wake up latency, but saves > little power. The next idle state on Intel platforms is "C1". It has has higher > latency, but saves more power than "POLL". > > Sapphire Rapids Xeons add new C0.1 and C0.2 (later C0.x) idle states which > conceptually sit between "POLL" and "C1". These provide a very attractive > midpoint: near-POLL wake-up latency and power consumption halfway between > "POLL" and "C1". > > In other words, the expectation is that most latency-sensitive users will > prefer C0.x over POLL. > > Enable C0.2 idle state support on Sapphire Rapids Xeon (later - SPR) by adding > it between POLL and C1. > > Base commit > ----------- > > Based on the "linux-next" branch of "linux-pm" git tree. > > base-commit: bd9bb08847da3b1eba2ea8cebf514d9287e7f4fb > > Changelog > --------- > > * v4: > - Address issues pointed out by Thomas Gleixner. > . mwait.h: use 'IS_ENABLED()' instead of '#ifdef'. > . mwait.h: use '__always_inline'. > . mwait.h: use inline stub instead for macro for "!CONFIG_X86_64" case. > . mwait.h: use proper commentaries for '#endif' and '#else'. > . mwait.h: tested with llvm/clang. > . Use imperative form (removed "this patch"). > - intel_idle: rename 'intel_idle_hlt_irq_on()' for consistency. I'm wondering if the v4 looks better than the previous versions to the x86 folks (and Thomas in particular). Note that patch [3/4] will not be needed any more as it refines a commit that is going to be reverted. > * v3 > - Dropped patch 'x86/umwait: Increase tpause and umwait quanta' after, as > suggested by Andy Lutomirski. > - Followed Peter Zijlstra's suggestion and removed explicit 'umwait' > deadline. Rely on the global implicit deadline instead. > - Rebased on top of Arjan's patches. > - C0.2 was tested in a VM by Arjan van de Ven. > - Re-measured on 2S and 4S Sapphire Rapids Xeon. > * v2 > - Do not mix 'raw_local_irq_enable()' and 'local_irq_disable()'. I failed to > directly verify it, but I believe it'll address the '.noinstr.text' warning. > - Minor kerneldoc commentary fix. > > C0.2 vs POLL latency and power > ------------------------------ > > I compared POLL to C0.2 using 'wult' tool (https://github.com/intel/wult), > which measures idle states latency. > > * In "POLL" experiments, all C-states except for POLL were disabled. > * In "C0.2" experiments, all C-states except for POLL and C0.2 were disabled. > > Here are the measurement results. The numbers are percent change from POLL to > C0.2. > > -----------|-----------|----------|----------- > Median IR | 99th % IR | AC Power | RAPL power > -----------|-----------|----------|----------- > 24% | 12% | -13% | -18% > -----------------------|----------|----------- > > * IR stands for interrupt latency. The table provides the median and 99th > percentile. Wult measures it as the delay between the moment a timer > interrupt fires to the moment the CPU reaches the interrupt handler. > * AC Power is the wall socket AC power. > * RAPL power is the CPU package power, measured using the 'turbostat' tool. > > Hackbench measurements > ---------------------- > > I ran the 'hackbench' benchmark using the following commands: > > # 4 groups, 200 threads > hackbench -s 128 -l 100000000 -g4 -f 25 -P > # 8 groups, 400 threads. > hackbench -s 128 -l 100000000 -g8 -f 25 -P > > My SPR system has 224 CPUs, so the first command did not use all CPUs, the > second command used all of them. However, in both cases CPU power reached TDP. > > I ran hackbench 5 times for every configuration and compared hackbench "score" > averages. > > In case of 4 groups, C0.2 improved the score by about 4%, and in case of 8 > groups by about 0.6%. > > Q&A > --- > > 1. Can C0.2 be disabled? > > C0.2 can be disabled via sysfs and with the following kernel boot option: > > intel_idle.states_off=2 > > 2. Why C0.2, not C0.1? > > I measured both C0.1 and C0.2, but did not notice a clear C0.1 advantage in > terms of latency, but did notice that C0.2 saves more power. > > But if users want to try using C0.1 instead of C0.2, they can do this: > > echo 0 > /sys/devices/system/cpu/umwait_control/enable_c02 > > This will make sure that C0.2 requests from 'intel_idle' are automatically > converted to C0.1 requests. > > 3. How did you verify that system enters C0.2? > > I used 'perf' to read the corresponding PMU counters: > > perf stat -e CPU_CLK_UNHALTED.C01,CPU_CLK_UNHALTED.C02,cycles -a sleep 1 > > 4. Ho to change the global explicit 'umwait' deadline? > > Via '/sys/devices/system/cpu/umwait_control/max_time' > > Artem Bityutskiy (4): > x86/umwait: use 'IS_ENABLED()' > x86/mwait: Add support for idle via umwait > intel_idle: rename 'intel_idle_hlt_irq_on()' > intel_idle: add C0.2 state for Sapphire Rapids Xeon > > arch/x86/include/asm/mwait.h | 85 ++++++++++++++++++++++++++++++++---- > drivers/idle/intel_idle.c | 52 +++++++++++++++++++--- > 2 files changed, 123 insertions(+), 14 deletions(-) > > -- > 2.40.1 >
Hi, On Mon, 2023-07-10 at 12:30 +0300, Artem Bityutskiy wrote: > Artem Bityutskiy (4): > x86/umwait: use 'IS_ENABLED()' > x86/mwait: Add support for idle via umwait If these 2 patches are OK now, is there a chance to get them merged? > intel_idle: rename 'intel_idle_hlt_irq_on()' > intel_idle: add C0.2 state for Sapphire Rapids Xeon The intel_idle part requires a refresh, but I thought I'd do that if/when the x86 bits are merged. Thanks!
On Mon, 2023-08-28 at 19:43 +0300, Artem Bityutskiy wrote: > Hi, > > On Mon, 2023-07-10 at 12:30 +0300, Artem Bityutskiy wrote: > > > Artem Bityutskiy (4): > > x86/umwait: use 'IS_ENABLED()' > > x86/mwait: Add support for idle via umwait > > If these 2 patches are OK now, is there a chance to get them merged? Hi, I wonder if these 2 patches have any chances to get merged? May be there is something I am expected to do have not done? If so, please, accept my apologies and give me a direction. Thanks!
On Wed, Sep 13, 2023 at 1:37 PM Artem Bityutskiy <dedekind1@gmail.com> wrote: > > On Mon, 2023-08-28 at 19:43 +0300, Artem Bityutskiy wrote: > > Hi, > > > > On Mon, 2023-07-10 at 12:30 +0300, Artem Bityutskiy wrote: > > > > > Artem Bityutskiy (4): > > > x86/umwait: use 'IS_ENABLED()' > > > x86/mwait: Add support for idle via umwait > > > > If these 2 patches are OK now, is there a chance to get them merged? > > Hi, I wonder if these 2 patches have any chances to get merged? > > May be there is something I am expected to do have not done? If so, please, > accept my apologies and give me a direction. I think that they were based on the Arjan\s changes that got dropped. If so, they need to be rebased and resent.
On Wed, 2023-09-13 at 14:34 +0200, Rafael J. Wysocki wrote: > On Wed, Sep 13, 2023 at 1:37 PM Artem Bityutskiy <dedekind1@gmail.com> wrote: > > > > On Mon, 2023-08-28 at 19:43 +0300, Artem Bityutskiy wrote: > > > Hi, > > > > > > On Mon, 2023-07-10 at 12:30 +0300, Artem Bityutskiy wrote: > > > > > > > Artem Bityutskiy (4): > > > > x86/umwait: use 'IS_ENABLED()' > > > > x86/mwait: Add support for idle via umwait > > > > > > If these 2 patches are OK now, is there a chance to get them merged? > > > > Hi, I wonder if these 2 patches have any chances to get merged? > > > > May be there is something I am expected to do have not done? If so, please, > > accept my apologies and give me a direction. > > I think that they were based on the Arjan\s changes that got dropped. > If so, they need to be rebased and resent. The above quoted 2 patches are x86 bits, they introduce the basic umwait primitives. They are not based on Arjan's patches. They still apply cleanly, and do not need a refresh, as far as I can see. I was hoping to get arch/x86 bits merged. Then I would send the refreshed version of the intel_idle patches. Thanks, Artem.
On Wed, Sep 13, 2023 at 2:49 PM Artem Bityutskiy <dedekind1@gmail.com> wrote: > > On Wed, 2023-09-13 at 14:34 +0200, Rafael J. Wysocki wrote: > > On Wed, Sep 13, 2023 at 1:37 PM Artem Bityutskiy <dedekind1@gmail.com> wrote: > > > > > > On Mon, 2023-08-28 at 19:43 +0300, Artem Bityutskiy wrote: > > > > Hi, > > > > > > > > On Mon, 2023-07-10 at 12:30 +0300, Artem Bityutskiy wrote: > > > > > > > > > Artem Bityutskiy (4): > > > > > x86/umwait: use 'IS_ENABLED()' > > > > > x86/mwait: Add support for idle via umwait > > > > > > > > If these 2 patches are OK now, is there a chance to get them merged? > > > > > > Hi, I wonder if these 2 patches have any chances to get merged? > > > > > > May be there is something I am expected to do have not done? If so, please, > > > accept my apologies and give me a direction. > > > > I think that they were based on the Arjan\s changes that got dropped. > > If so, they need to be rebased and resent. > > The above quoted 2 patches are x86 bits, they introduce the basic umwait > primitives. They are not based on Arjan's patches. They still apply cleanly, and > do not need a refresh, as far as I can see. > > I was hoping to get arch/x86 bits merged. This is up to the x86 maintainers, then. > Then I would send the refreshed version of the intel_idle patches. Well, if the x86 changes are only needed because of the subsequent intel_idle changes, it may be still better to send a new version of the whole patchset.
From: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Background ---------- Idle states reduce power consumption when a CPU has no work to do. The most shallow CPU idle state is "POLL". It has lowest wake up latency, but saves little power. The next idle state on Intel platforms is "C1". It has has higher latency, but saves more power than "POLL". Sapphire Rapids Xeons add new C0.1 and C0.2 (later C0.x) idle states which conceptually sit between "POLL" and "C1". These provide a very attractive midpoint: near-POLL wake-up latency and power consumption halfway between "POLL" and "C1". In other words, the expectation is that most latency-sensitive users will prefer C0.x over POLL. Enable C0.2 idle state support on Sapphire Rapids Xeon (later - SPR) by adding it between POLL and C1. Base commit ----------- Based on the "linux-next" branch of "linux-pm" git tree. base-commit: bd9bb08847da3b1eba2ea8cebf514d9287e7f4fb Changelog --------- * v4: - Address issues pointed out by Thomas Gleixner. . mwait.h: use 'IS_ENABLED()' instead of '#ifdef'. . mwait.h: use '__always_inline'. . mwait.h: use inline stub instead for macro for "!CONFIG_X86_64" case. . mwait.h: use proper commentaries for '#endif' and '#else'. . mwait.h: tested with llvm/clang. . Use imperative form (removed "this patch"). - intel_idle: rename 'intel_idle_hlt_irq_on()' for consistency. * v3 - Dropped patch 'x86/umwait: Increase tpause and umwait quanta' after, as suggested by Andy Lutomirski. - Followed Peter Zijlstra's suggestion and removed explicit 'umwait' deadline. Rely on the global implicit deadline instead. - Rebased on top of Arjan's patches. - C0.2 was tested in a VM by Arjan van de Ven. - Re-measured on 2S and 4S Sapphire Rapids Xeon. * v2 - Do not mix 'raw_local_irq_enable()' and 'local_irq_disable()'. I failed to directly verify it, but I believe it'll address the '.noinstr.text' warning. - Minor kerneldoc commentary fix. C0.2 vs POLL latency and power ------------------------------ I compared POLL to C0.2 using 'wult' tool (https://github.com/intel/wult), which measures idle states latency. * In "POLL" experiments, all C-states except for POLL were disabled. * In "C0.2" experiments, all C-states except for POLL and C0.2 were disabled. Here are the measurement results. The numbers are percent change from POLL to C0.2. -----------|-----------|----------|----------- Median IR | 99th % IR | AC Power | RAPL power -----------|-----------|----------|----------- 24% | 12% | -13% | -18% -----------------------|----------|----------- * IR stands for interrupt latency. The table provides the median and 99th percentile. Wult measures it as the delay between the moment a timer interrupt fires to the moment the CPU reaches the interrupt handler. * AC Power is the wall socket AC power. * RAPL power is the CPU package power, measured using the 'turbostat' tool. Hackbench measurements ---------------------- I ran the 'hackbench' benchmark using the following commands: # 4 groups, 200 threads hackbench -s 128 -l 100000000 -g4 -f 25 -P # 8 groups, 400 threads. hackbench -s 128 -l 100000000 -g8 -f 25 -P My SPR system has 224 CPUs, so the first command did not use all CPUs, the second command used all of them. However, in both cases CPU power reached TDP. I ran hackbench 5 times for every configuration and compared hackbench "score" averages. In case of 4 groups, C0.2 improved the score by about 4%, and in case of 8 groups by about 0.6%. Q&A --- 1. Can C0.2 be disabled? C0.2 can be disabled via sysfs and with the following kernel boot option: intel_idle.states_off=2 2. Why C0.2, not C0.1? I measured both C0.1 and C0.2, but did not notice a clear C0.1 advantage in terms of latency, but did notice that C0.2 saves more power. But if users want to try using C0.1 instead of C0.2, they can do this: echo 0 > /sys/devices/system/cpu/umwait_control/enable_c02 This will make sure that C0.2 requests from 'intel_idle' are automatically converted to C0.1 requests. 3. How did you verify that system enters C0.2? I used 'perf' to read the corresponding PMU counters: perf stat -e CPU_CLK_UNHALTED.C01,CPU_CLK_UNHALTED.C02,cycles -a sleep 1 4. Ho to change the global explicit 'umwait' deadline? Via '/sys/devices/system/cpu/umwait_control/max_time' Artem Bityutskiy (4): x86/umwait: use 'IS_ENABLED()' x86/mwait: Add support for idle via umwait intel_idle: rename 'intel_idle_hlt_irq_on()' intel_idle: add C0.2 state for Sapphire Rapids Xeon arch/x86/include/asm/mwait.h | 85 ++++++++++++++++++++++++++++++++---- drivers/idle/intel_idle.c | 52 +++++++++++++++++++--- 2 files changed, 123 insertions(+), 14 deletions(-)