Message ID | 20230620180354.1020392-1-arjan@linux.intel.com (mailing list archive) |
---|---|
State | Mainlined, archived |
Headers | show |
Series | intel_idle: Add a "Long HLT" C1 state for the VM guest mode | expand |
On Tue, Jun 20, 2023 at 8:04 PM <arjan@linux.intel.com> wrote: > > From: Arjan van de Ven <arjan@linux.intel.com> > > intel_idle will, for the bare metal case, usually have one or more deep > power states that have the CPUIDLE_FLAG_TLB_FLUSHED flag set. When > a state with this flag is selected by the cpuidle framework, it will also > flush the TLBs as part of entering this state. The benefit of doing this is > that the kernel does not need to wake the cpu out of this deep power state > just to flush the TLBs... for which the latency can be very high due to > the exit latency of deep power states. > > In a VM guest currently, this benefit of avoiding the wakeup does not exist, > while the problem (long exit latency) is even more severe. Linux will need > to wake up a vCPU (causing the host to either come out of a deep C state, > or the VMM to have to deschedule something else to schedule the vCPU) which > can take a very long time.. adding a lot of latency to tlb flush operations > (including munmap and others). > > To solve this, add a "Long HLT" C state to the state table for the VM guest > case that has the CPUIDLE_FLAG_TLB_FLUSHED flag set. The result of that is > that for long idle periods (where the VMM is likely to do things that cause > large latency) the cpuidle framework will flush the TLBs (and avoid the > wakeups), while for short/quick idle durations, the existing behavior is > retained. > > Now, there is still only "hlt" available in the guest, but for long idle, > the host can go to a deeper state (say C6). There is a reasonable debate > one can have to what to set for the exit_latency and break even point for > this "Long HLT" state. The good news is that intel_idle has these values > available for the underlying CPU (even when mwait is not exposed). The > solution thus is to just use the latency and break even of the deepest state > from the bare metal CPU. This is under the assumption that this is a pretty > reasonable estimate of what the VMM would do to cause latency. > > Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> > --- > drivers/idle/intel_idle.c | 55 +++++++++++++++++++++++++++++++++++++++ > 1 file changed, 55 insertions(+) > > diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c > index a80e1f520293..34201d7ef33e 100644 > --- a/drivers/idle/intel_idle.c > +++ b/drivers/idle/intel_idle.c > @@ -1287,6 +1287,13 @@ static struct cpuidle_state vmguest_cstates[] __initdata = { > .exit_latency = 5, > .target_residency = 10, > .enter = &intel_idle_hlt, }, > + { > + .name = "C1L", > + .desc = "Long HLT", > + .flags = MWAIT2flg(0x00) | CPUIDLE_FLAG_TLB_FLUSHED, > + .exit_latency = 5, > + .target_residency = 200, > + .enter = &intel_idle_hlt, }, > { > .enter = NULL } > }; > @@ -2130,6 +2137,45 @@ static void __init intel_idle_cpuidle_devices_uninit(void) > cpuidle_unregister_device(per_cpu_ptr(intel_idle_cpuidle_devices, i)); > } > > +/* > + * Match up the latency and break even point of the bare metal (cpu based) > + * states with the deepest VM available state. > + * > + * We only want to do this for the deepest state, the ones that has > + * the TLB_FLUSHED flag set on the . > + * > + * All our short idle states are dominated by vmexit/vmenter latencies, > + * not the underlying hardware latencies so we keep our values for these. > + */ > +static void matchup_vm_state_with_baremetal(void) > +{ > + int cstate; > + > + for (cstate = 0; cstate < CPUIDLE_STATE_MAX; ++cstate) { > + int matching_cstate; > + > + if (intel_idle_max_cstate_reached(cstate)) > + break; > + > + if (!cpuidle_state_table[cstate].enter) > + break; > + > + if (!(cpuidle_state_table[cstate].flags & CPUIDLE_FLAG_TLB_FLUSHED)) > + continue; > + > + for (matching_cstate = 0; matching_cstate < CPUIDLE_STATE_MAX; ++matching_cstate) { > + if (!icpu->state_table[matching_cstate].enter) > + break; > + if (icpu->state_table[matching_cstate].exit_latency > cpuidle_state_table[cstate].exit_latency) { > + cpuidle_state_table[cstate].exit_latency = icpu->state_table[matching_cstate].exit_latency; > + cpuidle_state_table[cstate].target_residency = icpu->state_table[matching_cstate].target_residency; > + } > + } > + > + } > +} > + > + > static int __init intel_idle_vminit(const struct x86_cpu_id *id) > { > int retval; > @@ -2145,6 +2191,15 @@ static int __init intel_idle_vminit(const struct x86_cpu_id *id) > if (!intel_idle_cpuidle_devices) > return -ENOMEM; > > + /* > + * We don't know exactly what the host will do when we go idle, but as a worst estimate > + * we can assume that the exit latency of the deepest host state will be hit for our > + * deep (long duration) guest idle state. > + * The same logic applies to the break even point for the long duration guest idle state. > + * So lets copy these two properties from the table we found for the host CPU type. > + */ > + matchup_vm_state_with_baremetal(); > + > intel_idle_cpuidle_driver_init(&intel_idle_driver); > > retval = cpuidle_register_driver(&intel_idle_driver); > -- Applied as 6.5 material, thanks!
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c index a80e1f520293..34201d7ef33e 100644 --- a/drivers/idle/intel_idle.c +++ b/drivers/idle/intel_idle.c @@ -1287,6 +1287,13 @@ static struct cpuidle_state vmguest_cstates[] __initdata = { .exit_latency = 5, .target_residency = 10, .enter = &intel_idle_hlt, }, + { + .name = "C1L", + .desc = "Long HLT", + .flags = MWAIT2flg(0x00) | CPUIDLE_FLAG_TLB_FLUSHED, + .exit_latency = 5, + .target_residency = 200, + .enter = &intel_idle_hlt, }, { .enter = NULL } }; @@ -2130,6 +2137,45 @@ static void __init intel_idle_cpuidle_devices_uninit(void) cpuidle_unregister_device(per_cpu_ptr(intel_idle_cpuidle_devices, i)); } +/* + * Match up the latency and break even point of the bare metal (cpu based) + * states with the deepest VM available state. + * + * We only want to do this for the deepest state, the ones that has + * the TLB_FLUSHED flag set on the . + * + * All our short idle states are dominated by vmexit/vmenter latencies, + * not the underlying hardware latencies so we keep our values for these. + */ +static void matchup_vm_state_with_baremetal(void) +{ + int cstate; + + for (cstate = 0; cstate < CPUIDLE_STATE_MAX; ++cstate) { + int matching_cstate; + + if (intel_idle_max_cstate_reached(cstate)) + break; + + if (!cpuidle_state_table[cstate].enter) + break; + + if (!(cpuidle_state_table[cstate].flags & CPUIDLE_FLAG_TLB_FLUSHED)) + continue; + + for (matching_cstate = 0; matching_cstate < CPUIDLE_STATE_MAX; ++matching_cstate) { + if (!icpu->state_table[matching_cstate].enter) + break; + if (icpu->state_table[matching_cstate].exit_latency > cpuidle_state_table[cstate].exit_latency) { + cpuidle_state_table[cstate].exit_latency = icpu->state_table[matching_cstate].exit_latency; + cpuidle_state_table[cstate].target_residency = icpu->state_table[matching_cstate].target_residency; + } + } + + } +} + + static int __init intel_idle_vminit(const struct x86_cpu_id *id) { int retval; @@ -2145,6 +2191,15 @@ static int __init intel_idle_vminit(const struct x86_cpu_id *id) if (!intel_idle_cpuidle_devices) return -ENOMEM; + /* + * We don't know exactly what the host will do when we go idle, but as a worst estimate + * we can assume that the exit latency of the deepest host state will be hit for our + * deep (long duration) guest idle state. + * The same logic applies to the break even point for the long duration guest idle state. + * So lets copy these two properties from the table we found for the host CPU type. + */ + matchup_vm_state_with_baremetal(); + intel_idle_cpuidle_driver_init(&intel_idle_driver); retval = cpuidle_register_driver(&intel_idle_driver);