Message ID | a0b3c3f6-2abc-353b-92f9-367fa57af8ef@suse.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | x86/mwait-idle: updates from Linux (and more) | expand |
On Thu, Jan 20, 2022 at 03:04:39PM +0100, Jan Beulich wrote: > From: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> > > Enable local interrupts before requesting C1 on the last two generations > of Intel Xeon platforms: Sky Lake, Cascade Lake, Cooper Lake, Ice Lake. > This decreases average C1 interrupt latency by about 5-10%, as measured > with the 'wult' tool. > > The '->enter()' function of the driver enters C-states with local > interrupts disabled by executing the 'monitor' and 'mwait' pair of > instructions. If an interrupt happens, the CPU exits the C-state and > continues executing instructions after 'mwait'. It does not jump to > the interrupt handler, because local interrupts are disabled. The > cpuidle subsystem enables interrupts a bit later, after doing some > housekeeping. > > With this patch, we enable local interrupts before requesting C1. In > this case, if the CPU wakes up because of an interrupt, it will jump > to the interrupt handler right away. The cpuidle housekeeping will be > done after the pending interrupt(s) are handled. > > Enabling interrupts before entering a C-state has measurable impact > for faster C-states, like C1. Deeper, but slower C-states like C6 do > not really benefit from this sort of change, because their latency is > a lot higher comparing to the delay added by cpuidle housekeeping. > > This change was also tested with cyclictest and dbench. In case of Ice > Lake, the average cyclictest latency decreased by 5.1%, and the average > 'dbench' throughput increased by about 0.8%. Both tests were run for 4 > hours with only C1 enabled (all other idle states, including 'POLL', > were disabled). CPU frequency was pinned to HFM, and uncore frequency > was pinned to the maximum value. The other platforms had similar > single-digit percentage improvements. > > It is worth noting that this patch affects 'cpuidle' statistics a tiny > bit. Before this patch, C1 residency did not include the interrupt > handling time, but with this patch, it will include it. This is similar > to what happens in case of the 'POLL' state, which also runs with > interrupts enabled. > > Suggested-by: Len Brown <len.brown@intel.com> > Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> > [Linux commit: c227233ad64c77e57db738ab0e46439db71822a3] > > We don't have a pointer into cpuidle_state_table[] readily available. > To compensate, make use of the new flag only appearing for C1 and hence > only in the first table entry. We could likely use the 8 padding bits in acpi_processor_cx between entry_method and method to store a flags field? It would seems more resilient, as I guess at some point we could enable interrupts for further states? > > Unlike Linux we want to disable IRQs again after MWAITing, as > subsequently invoked functions assume so. I'm also wondering whether there could be interrupts that rely on some of the housekeeping that's done when returning from mwait. I guess it's unlikely for an interrupt handler to have anything to do with the TSC not having been restored. > > Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> I think it's important to keep in sync with our upstream that's Linux there. Albeit I would prefer if we could carry the flags into acpi_processor_cx. Thanks, Roger.
On 20.01.2022 16:52, Roger Pau Monné wrote: > On Thu, Jan 20, 2022 at 03:04:39PM +0100, Jan Beulich wrote: >> From: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> >> >> Enable local interrupts before requesting C1 on the last two generations >> of Intel Xeon platforms: Sky Lake, Cascade Lake, Cooper Lake, Ice Lake. >> This decreases average C1 interrupt latency by about 5-10%, as measured >> with the 'wult' tool. >> >> The '->enter()' function of the driver enters C-states with local >> interrupts disabled by executing the 'monitor' and 'mwait' pair of >> instructions. If an interrupt happens, the CPU exits the C-state and >> continues executing instructions after 'mwait'. It does not jump to >> the interrupt handler, because local interrupts are disabled. The >> cpuidle subsystem enables interrupts a bit later, after doing some >> housekeeping. >> >> With this patch, we enable local interrupts before requesting C1. In >> this case, if the CPU wakes up because of an interrupt, it will jump >> to the interrupt handler right away. The cpuidle housekeeping will be >> done after the pending interrupt(s) are handled. >> >> Enabling interrupts before entering a C-state has measurable impact >> for faster C-states, like C1. Deeper, but slower C-states like C6 do >> not really benefit from this sort of change, because their latency is >> a lot higher comparing to the delay added by cpuidle housekeeping. >> >> This change was also tested with cyclictest and dbench. In case of Ice >> Lake, the average cyclictest latency decreased by 5.1%, and the average >> 'dbench' throughput increased by about 0.8%. Both tests were run for 4 >> hours with only C1 enabled (all other idle states, including 'POLL', >> were disabled). CPU frequency was pinned to HFM, and uncore frequency >> was pinned to the maximum value. The other platforms had similar >> single-digit percentage improvements. >> >> It is worth noting that this patch affects 'cpuidle' statistics a tiny >> bit. Before this patch, C1 residency did not include the interrupt >> handling time, but with this patch, it will include it. This is similar >> to what happens in case of the 'POLL' state, which also runs with >> interrupts enabled. >> >> Suggested-by: Len Brown <len.brown@intel.com> >> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> >> [Linux commit: c227233ad64c77e57db738ab0e46439db71822a3] >> >> We don't have a pointer into cpuidle_state_table[] readily available. >> To compensate, make use of the new flag only appearing for C1 and hence >> only in the first table entry. > > We could likely use the 8 padding bits in acpi_processor_cx between > entry_method and method to store a flags field? When looking there initially, I thought it wouldn't be good to add field there. But looking again, I now don't see why I thought what I thought. (We actually have an 8-bit padding slot there and a 32-bit one.) > It would seems more resilient, as I guess at some point we could > enable interrupts for further states? C1E maybe; I don't think anything beyond, for having higher wakeup latency anyway. >> Unlike Linux we want to disable IRQs again after MWAITing, as >> subsequently invoked functions assume so. > > I'm also wondering whether there could be interrupts that rely on some > of the housekeeping that's done when returning from mwait. I guess > it's unlikely for an interrupt handler to have anything to do with the > TSC not having been restored. Actually this is a good point you make: We don't want to enable IRQs when cstate_restore_tsc() is not a no-op, or else we might confuse the time rendezvous. (I thought that I would remember TSC to stop only in deeper C-states, but maybe I'm mixing this up with the APIC timer.) >> Signed-off-by: Jan Beulich <jbeulich@suse.com> > > Acked-by: Roger Pau Monné <roger.pau@citrix.com> Thanks, but as per above - not just yet. Jan
On Mon, Jan 24, 2022 at 03:44:53PM +0100, Jan Beulich wrote: > On 20.01.2022 16:52, Roger Pau Monné wrote: > > On Thu, Jan 20, 2022 at 03:04:39PM +0100, Jan Beulich wrote: > >> From: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> > >> Unlike Linux we want to disable IRQs again after MWAITing, as > >> subsequently invoked functions assume so. > > > > I'm also wondering whether there could be interrupts that rely on some > > of the housekeeping that's done when returning from mwait. I guess > > it's unlikely for an interrupt handler to have anything to do with the > > TSC not having been restored. > > Actually this is a good point you make: We don't want to enable > IRQs when cstate_restore_tsc() is not a no-op, or else we might > confuse the time rendezvous. (I thought that I would remember > TSC to stop only in deeper C-states, but maybe I'm mixing this up > with the APIC timer.) There's a comment in time.c that mentions the TSC only stopping in 'deep C states'. Also note that in that case the rendezvous function already updates the TSC, so I'm not sure whether calling it with an out of date TSC would be harmful - it will be updated anyway to match the master TSC. Might be safer to disable interrupts unconditionally on CPUs that don't have a non-stop TSC just to be on the safe side. Thanks, Roger.
--- a/xen/arch/x86/cpu/mwait-idle.c +++ b/xen/arch/x86/cpu/mwait-idle.c @@ -108,6 +108,11 @@ static const struct cpuidle_state { #define CPUIDLE_FLAG_DISABLED 0x1 /* + * Enable interrupts before entering the C-state. On some platforms and for + * some C-states, this may measurably decrease interrupt latency. + */ +#define CPUIDLE_FLAG_IRQ_ENABLE 0x8000 +/* * Set this flag for states where the HW flushes the TLB for us * and so we don't need cross-calls to keep it consistent. * If this flag is set, SW flushes the TLB, so even if the @@ -539,7 +544,7 @@ static struct cpuidle_state __read_mostl static struct cpuidle_state __read_mostly skx_cstates[] = { { .name = "C1", - .flags = MWAIT2flg(0x00), + .flags = MWAIT2flg(0x00) | CPUIDLE_FLAG_IRQ_ENABLE, .exit_latency = 2, .target_residency = 2, }, @@ -561,7 +566,7 @@ static struct cpuidle_state __read_mostl static const struct cpuidle_state icx_cstates[] = { { .name = "C1", - .flags = MWAIT2flg(0x00), + .flags = MWAIT2flg(0x00) | CPUIDLE_FLAG_IRQ_ENABLE, .exit_latency = 1, .target_residency = 1, }, @@ -776,6 +781,7 @@ static void mwait_idle(void) unsigned int next_state; u64 before, after; u32 exp = 0, pred = 0, irq_traced[4] = { 0 }; + bool irq_enable_early = false; if (max_cstate > 0 && power && (next_state = cpuidle_current_governor->select(power)) > 0) { @@ -806,6 +812,12 @@ static void mwait_idle(void) return; } + if (cx->idx == 1 && cx->type == ACPI_STATE_C1 && + (cpuidle_state_table[0].flags & CPUIDLE_FLAG_IRQ_ENABLE)) { + ASSERT(cx->address == flg2MWAIT(cpuidle_state_table[0].flags)); + irq_enable_early = true; + } + cpufreq_dbs_timer_suspend(); rcu_idle_enter(cpu); @@ -842,9 +854,15 @@ static void mwait_idle(void) update_last_cx_stat(power, cx, before); - if (cpu_is_haltable(cpu)) + if (cpu_is_haltable(cpu)) { + if (irq_enable_early) + local_irq_enable(); + mwait_idle_with_hints(cx->address, MWAIT_ECX_INTERRUPT_BREAK); + local_irq_disable(); + } + after = alternative_call(cpuidle_get_tick); cstate_restore_tsc();