Message ID | 20170930005046-mutt-send-email-mst@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Sat, Sep 30, 2017 at 12:01 AM, Michael S. Tsirkin <mst@redhat.com> wrote: > intel idle driver does not DTRT when running within a VM: > when going into a deep power state, the right thing to > do is to exit to hypervisor rather than to keep polling > within guest using mwait. > > Currently the solution is just to exit to hypervisor each time we go > idle - this is why kvm does not expose the mwait leaf to guests even > when it allows guests to do mwait. > > But that's not ideal - it seems better to use the idle driver to guess > when will the next interrupt arrive. The idle driver alone is not sufficient for that, though. Thanks, Rafael
On Sat, 30 Sep 2017 01:21:43 +0200 "Rafael J. Wysocki" <rafael@kernel.org> wrote: > On Sat, Sep 30, 2017 at 12:01 AM, Michael S. Tsirkin <mst@redhat.com> > wrote: > > intel idle driver does not DTRT when running within a VM: > > when going into a deep power state, the right thing to > > do is to exit to hypervisor rather than to keep polling > > within guest using mwait. > > > > Currently the solution is just to exit to hypervisor each time we go > > idle - this is why kvm does not expose the mwait leaf to guests even > > when it allows guests to do mwait. > > > > But that's not ideal - it seems better to use the idle driver to > > guess when will the next interrupt arrive. > > The idle driver alone is not sufficient for that, though. > I second that. Why try to solve this problem at vendor specific driver level? perhaps just a pv idle driver that decide whether to vmexit based on something like local per vCPU timer expiration? I guess we can't predict other wake events such as interrupts. e.g. if (get_next_timer_interrupt() > kvm_halt_target_residency) vmexit else poll Jacob
On Mon, 2 Oct 2017, Jacob Pan wrote: > On Sat, 30 Sep 2017 01:21:43 +0200 > "Rafael J. Wysocki" <rafael@kernel.org> wrote: > > > On Sat, Sep 30, 2017 at 12:01 AM, Michael S. Tsirkin <mst@redhat.com> > > wrote: > > > intel idle driver does not DTRT when running within a VM: > > > when going into a deep power state, the right thing to > > > do is to exit to hypervisor rather than to keep polling > > > within guest using mwait. > > > > > > Currently the solution is just to exit to hypervisor each time we go > > > idle - this is why kvm does not expose the mwait leaf to guests even > > > when it allows guests to do mwait. > > > > > > But that's not ideal - it seems better to use the idle driver to > > > guess when will the next interrupt arrive. > > > > The idle driver alone is not sufficient for that, though. > > > I second that. Why try to solve this problem at vendor specific driver > level? perhaps just a pv idle driver that decide whether to vmexit > based on something like local per vCPU timer expiration? I guess we > can't predict other wake events such as interrupts. > e.g. > if (get_next_timer_interrupt() > kvm_halt_target_residency) Bah. no. get_next_timer_interrupt() is not available for abuse in random cpuidle driver code. It has state and its tied to the nohz code. There is the series from Audrey which makes use of the various idle prediction mechanisms, scheduler, irq timings, idle governor to get an idea about the estimated idle time. Exactly this information can be fed to the kvmidle driver which can act accordingly. Hacking a random hardware specific idle driver is definitely the wrong approach. It might be useful to chain the kvmidle driver and hardware specific drivers at some point, i.e. if the kvmdriver decides not to exit it delegates the mwait decision to the proper hardware driver in order not to reimplement all the required logic again. But that's a different story. See http://lkml.kernel.org/r/1506756034-6340-1-git-send-email-aubrey.li@intel.com Thanks, tglx
On Mon, Oct 02, 2017 at 10:12:49AM -0700, Jacob Pan wrote: > On Sat, 30 Sep 2017 01:21:43 +0200 > "Rafael J. Wysocki" <rafael@kernel.org> wrote: > > > On Sat, Sep 30, 2017 at 12:01 AM, Michael S. Tsirkin <mst@redhat.com> > > wrote: > > > intel idle driver does not DTRT when running within a VM: > > > when going into a deep power state, the right thing to > > > do is to exit to hypervisor rather than to keep polling > > > within guest using mwait. > > > > > > Currently the solution is just to exit to hypervisor each time we go > > > idle - this is why kvm does not expose the mwait leaf to guests even > > > when it allows guests to do mwait. > > > > > > But that's not ideal - it seems better to use the idle driver to > > > guess when will the next interrupt arrive. > > > > The idle driver alone is not sufficient for that, though. > > > I second that. Why try to solve this problem at vendor specific driver > level? Well we still want to e.g. mwait if possible - saves power. > perhaps just a pv idle driver that decide whether to vmexit > based on something like local per vCPU timer expiration? I guess we > can't predict other wake events such as interrupts. > e.g. > if (get_next_timer_interrupt() > kvm_halt_target_residency) > vmexit > else > poll > > Jacob It's not always a poll, on x86 putting the CPU in a low power state is possible within a VM. Does not seem possible on other CPUs that's why it's vendor specific.
On Tue, Oct 03, 2017 at 11:02:55PM +0200, Thomas Gleixner wrote: > On Mon, 2 Oct 2017, Jacob Pan wrote: > > On Sat, 30 Sep 2017 01:21:43 +0200 > > "Rafael J. Wysocki" <rafael@kernel.org> wrote: > > > > > On Sat, Sep 30, 2017 at 12:01 AM, Michael S. Tsirkin <mst@redhat.com> > > > wrote: > > > > intel idle driver does not DTRT when running within a VM: > > > > when going into a deep power state, the right thing to > > > > do is to exit to hypervisor rather than to keep polling > > > > within guest using mwait. > > > > > > > > Currently the solution is just to exit to hypervisor each time we go > > > > idle - this is why kvm does not expose the mwait leaf to guests even > > > > when it allows guests to do mwait. > > > > > > > > But that's not ideal - it seems better to use the idle driver to > > > > guess when will the next interrupt arrive. > > > > > > The idle driver alone is not sufficient for that, though. > > > > > I second that. Why try to solve this problem at vendor specific driver > > level? perhaps just a pv idle driver that decide whether to vmexit > > based on something like local per vCPU timer expiration? I guess we > > can't predict other wake events such as interrupts. > > e.g. > > if (get_next_timer_interrupt() > kvm_halt_target_residency) > > Bah. no. get_next_timer_interrupt() is not available for abuse in random > cpuidle driver code. It has state and its tied to the nohz code. > > There is the series from Audrey which makes use of the various idle > prediction mechanisms, scheduler, irq timings, idle governor to get an idea > about the estimated idle time. Exactly this information can be fed to the > kvmidle driver which can act accordingly. > > Hacking a random hardware specific idle driver is definitely the wrong > approach. It might be useful to chain the kvmidle driver and hardware > specific drivers at some point, i.e. if the kvmdriver decides not to exit > it delegates the mwait decision to the proper hardware driver in order not > to reimplement all the required logic again. By making changes to idle core to allow that chaining? Does this sound like something reasonable? > But that's a different story. > > See http://lkml.kernel.org/r/1506756034-6340-1-git-send-email-aubrey.li@intel.com Will read that, thanks a lot. > Thanks, > > tglx > > >
On Wed, 4 Oct 2017, Michael S. Tsirkin wrote: > On Tue, Oct 03, 2017 at 11:02:55PM +0200, Thomas Gleixner wrote: > > There is the series from Audrey which makes use of the various idle > > prediction mechanisms, scheduler, irq timings, idle governor to get an idea > > about the estimated idle time. Exactly this information can be fed to the > > kvmidle driver which can act accordingly. > > > > Hacking a random hardware specific idle driver is definitely the wrong > > approach. It might be useful to chain the kvmidle driver and hardware > > specific drivers at some point, i.e. if the kvmdriver decides not to exit > > it delegates the mwait decision to the proper hardware driver in order not > > to reimplement all the required logic again. > > By making changes to idle core to allow that chaining? > Does this sound like something reasonable? At least for me it makes sense to avoid code duplication. But thats up to the cpuidle maintainers to decide at the end. Thanks, tglx
On Wed, 4 Oct 2017 05:09:09 +0300 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Mon, Oct 02, 2017 at 10:12:49AM -0700, Jacob Pan wrote: > > On Sat, 30 Sep 2017 01:21:43 +0200 > > "Rafael J. Wysocki" <rafael@kernel.org> wrote: > > > > > On Sat, Sep 30, 2017 at 12:01 AM, Michael S. Tsirkin > > > <mst@redhat.com> wrote: > > > > intel idle driver does not DTRT when running within a VM: > > > > when going into a deep power state, the right thing to > > > > do is to exit to hypervisor rather than to keep polling > > > > within guest using mwait. > > > > > > > > Currently the solution is just to exit to hypervisor each time > > > > we go idle - this is why kvm does not expose the mwait leaf to > > > > guests even when it allows guests to do mwait. > > > > > > > > But that's not ideal - it seems better to use the idle driver to > > > > guess when will the next interrupt arrive. > > > > > > The idle driver alone is not sufficient for that, though. > > > > > I second that. Why try to solve this problem at vendor specific > > driver level? > > Well we still want to e.g. mwait if possible - saves power. > > > perhaps just a pv idle driver that decide whether to vmexit > > based on something like local per vCPU timer expiration? I guess we > > can't predict other wake events such as interrupts. > > e.g. > > if (get_next_timer_interrupt() > kvm_halt_target_residency) > > vmexit > > else > > poll > > > > Jacob > > It's not always a poll, on x86 putting the CPU in a low power state > is possible within a VM. > Are you talking about using mwait/monitor in the user space which are available on some Intel CPUs, such as Xeon Phi? I guess if the guest can identify host CPU id, it is doable. > Does not seem possible on other CPUs that's why it's vendor specific. > [Jacob Pan]
On Wed, Oct 04, 2017 at 10:09:39AM -0700, Jacob Pan wrote: > On Wed, 4 Oct 2017 05:09:09 +0300 > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > On Mon, Oct 02, 2017 at 10:12:49AM -0700, Jacob Pan wrote: > > > On Sat, 30 Sep 2017 01:21:43 +0200 > > > "Rafael J. Wysocki" <rafael@kernel.org> wrote: > > > > > > > On Sat, Sep 30, 2017 at 12:01 AM, Michael S. Tsirkin > > > > <mst@redhat.com> wrote: > > > > > intel idle driver does not DTRT when running within a VM: > > > > > when going into a deep power state, the right thing to > > > > > do is to exit to hypervisor rather than to keep polling > > > > > within guest using mwait. > > > > > > > > > > Currently the solution is just to exit to hypervisor each time > > > > > we go idle - this is why kvm does not expose the mwait leaf to > > > > > guests even when it allows guests to do mwait. > > > > > > > > > > But that's not ideal - it seems better to use the idle driver to > > > > > guess when will the next interrupt arrive. > > > > > > > > The idle driver alone is not sufficient for that, though. > > > > > > > I second that. Why try to solve this problem at vendor specific > > > driver level? > > > > Well we still want to e.g. mwait if possible - saves power. > > > > > perhaps just a pv idle driver that decide whether to vmexit > > > based on something like local per vCPU timer expiration? I guess we > > > can't predict other wake events such as interrupts. > > > e.g. > > > if (get_next_timer_interrupt() > kvm_halt_target_residency) > > > vmexit > > > else > > > poll > > > > > > Jacob > > > > It's not always a poll, on x86 putting the CPU in a low power state > > is possible within a VM. > > > Are you talking about using mwait/monitor in the user space which are > available on some Intel CPUs, such as Xeon Phi? I guess if the guest > can identify host CPU id, it is doable. Not really. Please take a look at the patch in question - it does mwait in guest kernel and no need to identify host CPU id. > > Does not seem possible on other CPUs that's why it's vendor specific. > > > > [Jacob Pan]
On Wed, 4 Oct 2017 20:12:28 +0300 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Wed, Oct 04, 2017 at 10:09:39AM -0700, Jacob Pan wrote: > > On Wed, 4 Oct 2017 05:09:09 +0300 > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > On Mon, Oct 02, 2017 at 10:12:49AM -0700, Jacob Pan wrote: > > > > On Sat, 30 Sep 2017 01:21:43 +0200 > > > > "Rafael J. Wysocki" <rafael@kernel.org> wrote: > > > > > > > > > On Sat, Sep 30, 2017 at 12:01 AM, Michael S. Tsirkin > > > > > <mst@redhat.com> wrote: > > > > > > intel idle driver does not DTRT when running within a VM: > > > > > > when going into a deep power state, the right thing to > > > > > > do is to exit to hypervisor rather than to keep polling > > > > > > within guest using mwait. > > > > > > > > > > > > Currently the solution is just to exit to hypervisor each > > > > > > time we go idle - this is why kvm does not expose the mwait > > > > > > leaf to guests even when it allows guests to do mwait. > > > > > > > > > > > > But that's not ideal - it seems better to use the idle > > > > > > driver to guess when will the next interrupt arrive. > > > > > > > > > > The idle driver alone is not sufficient for that, though. > > > > > > > > > I second that. Why try to solve this problem at vendor specific > > > > driver level? > > > > > > Well we still want to e.g. mwait if possible - saves power. > > > > > > > perhaps just a pv idle driver that decide whether to vmexit > > > > based on something like local per vCPU timer expiration? I > > > > guess we can't predict other wake events such as interrupts. > > > > e.g. > > > > if (get_next_timer_interrupt() > kvm_halt_target_residency) > > > > vmexit > > > > else > > > > poll > > > > > > > > Jacob > > > > > > It's not always a poll, on x86 putting the CPU in a low power > > > state is possible within a VM. > > > > > Are you talking about using mwait/monitor in the user space which > > are available on some Intel CPUs, such as Xeon Phi? I guess if the > > guest can identify host CPU id, it is doable. > > Not really. > > Please take a look at the patch in question - it does mwait in guest > kernel and no need to identify host CPU id. > I may be missing something, in your patch I only see HLT being used in the guest OS, that would cause VM exit right? If you do mwait in the guest kernel, it will also exit. So I don't see how you can enter low power state within VM guest. +static int intel_halt(struct cpuidle_device *dev, + struct cpuidle_driver *drv, int index) +{ + printk_once(KERN_ERR "safe_halt started\n"); + safe_halt(); + printk_once(KERN_ERR "safe_halt done\n"); + return index; +} > > > > Does not seem possible on other CPUs that's why it's vendor > > > specific. > > > > [Jacob Pan] [Jacob Pan]
On Wed, Oct 4, 2017 at 9:56 AM, Thomas Gleixner <tglx@linutronix.de> wrote: > On Wed, 4 Oct 2017, Michael S. Tsirkin wrote: >> On Tue, Oct 03, 2017 at 11:02:55PM +0200, Thomas Gleixner wrote: >> > There is the series from Audrey which makes use of the various idle >> > prediction mechanisms, scheduler, irq timings, idle governor to get an idea >> > about the estimated idle time. Exactly this information can be fed to the >> > kvmidle driver which can act accordingly. >> > >> > Hacking a random hardware specific idle driver is definitely the wrong >> > approach. It might be useful to chain the kvmidle driver and hardware >> > specific drivers at some point, i.e. if the kvmdriver decides not to exit >> > it delegates the mwait decision to the proper hardware driver in order not >> > to reimplement all the required logic again. >> >> By making changes to idle core to allow that chaining? >> Does this sound like something reasonable? > > At least for me it makes sense to avoid code duplication. Well, I agree. Thanks, Rafael
On 04/10/2017 20:31, Jacob Pan wrote: > On Wed, 4 Oct 2017 20:12:28 +0300 > "Michael S. Tsirkin" <mst@redhat.com> wrote: > >> On Wed, Oct 04, 2017 at 10:09:39AM -0700, Jacob Pan wrote: >>> On Wed, 4 Oct 2017 05:09:09 +0300 >>> "Michael S. Tsirkin" <mst@redhat.com> wrote: >>> >>>> On Mon, Oct 02, 2017 at 10:12:49AM -0700, Jacob Pan wrote: >>>>> On Sat, 30 Sep 2017 01:21:43 +0200 >>>>> "Rafael J. Wysocki" <rafael@kernel.org> wrote: >>>>> >>>>>> On Sat, Sep 30, 2017 at 12:01 AM, Michael S. Tsirkin >>>>>> <mst@redhat.com> wrote: >>>>>>> intel idle driver does not DTRT when running within a VM: >>>>>>> when going into a deep power state, the right thing to >>>>>>> do is to exit to hypervisor rather than to keep polling >>>>>>> within guest using mwait. >>>>>>> >>>>>>> Currently the solution is just to exit to hypervisor each >>>>>>> time we go idle - this is why kvm does not expose the mwait >>>>>>> leaf to guests even when it allows guests to do mwait. >>>>>>> >>>>>>> But that's not ideal - it seems better to use the idle >>>>>>> driver to guess when will the next interrupt arrive. >>>>>> >>>>>> The idle driver alone is not sufficient for that, though. >>>>>> >>>>> I second that. Why try to solve this problem at vendor specific >>>>> driver level? >>>> >>>> Well we still want to e.g. mwait if possible - saves power. >>>> >>>>> perhaps just a pv idle driver that decide whether to vmexit >>>>> based on something like local per vCPU timer expiration? I >>>>> guess we can't predict other wake events such as interrupts. >>>>> e.g. >>>>> if (get_next_timer_interrupt() > kvm_halt_target_residency) >>>>> vmexit >>>>> else >>>>> poll >>>>> >>>>> Jacob >>>> >>>> It's not always a poll, on x86 putting the CPU in a low power >>>> state is possible within a VM. >>>> >>> Are you talking about using mwait/monitor in the user space which >>> are available on some Intel CPUs, such as Xeon Phi? I guess if the >>> guest can identify host CPU id, it is doable. >> >> Not really. >> >> Please take a look at the patch in question - it does mwait in guest >> kernel and no need to identify host CPU id. >> > I may be missing something, in your patch I only see HLT being used in > the guest OS, that would cause VM exit right? If you do mwait in the > guest kernel, it will also exit. So I don't see how you can enter low > power state within VM guest. KVM does not exit on MWAIT (though it doesn't show it in CPUID by default), see commit 668fffa3f838edfcb1679f842f7ef1afa61c3e9a. Paolo > > +static int intel_halt(struct cpuidle_device *dev, > + struct cpuidle_driver *drv, int index) > +{ > + printk_once(KERN_ERR "safe_halt started\n"); > + safe_halt(); > + printk_once(KERN_ERR "safe_halt done\n"); > + return index; > +} >> >>>> Does not seem possible on other CPUs that's why it's vendor >>>> specific. >>> >>> [Jacob Pan] > > [Jacob Pan] >
On Wed, Oct 04, 2017 at 11:31:43AM -0700, Jacob Pan wrote: > On Wed, 4 Oct 2017 20:12:28 +0300 > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > On Wed, Oct 04, 2017 at 10:09:39AM -0700, Jacob Pan wrote: > > > On Wed, 4 Oct 2017 05:09:09 +0300 > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > On Mon, Oct 02, 2017 at 10:12:49AM -0700, Jacob Pan wrote: > > > > > On Sat, 30 Sep 2017 01:21:43 +0200 > > > > > "Rafael J. Wysocki" <rafael@kernel.org> wrote: > > > > > > > > > > > On Sat, Sep 30, 2017 at 12:01 AM, Michael S. Tsirkin > > > > > > <mst@redhat.com> wrote: > > > > > > > intel idle driver does not DTRT when running within a VM: > > > > > > > when going into a deep power state, the right thing to > > > > > > > do is to exit to hypervisor rather than to keep polling > > > > > > > within guest using mwait. > > > > > > > > > > > > > > Currently the solution is just to exit to hypervisor each > > > > > > > time we go idle - this is why kvm does not expose the mwait > > > > > > > leaf to guests even when it allows guests to do mwait. > > > > > > > > > > > > > > But that's not ideal - it seems better to use the idle > > > > > > > driver to guess when will the next interrupt arrive. > > > > > > > > > > > > The idle driver alone is not sufficient for that, though. > > > > > > > > > > > I second that. Why try to solve this problem at vendor specific > > > > > driver level? > > > > > > > > Well we still want to e.g. mwait if possible - saves power. > > > > > > > > > perhaps just a pv idle driver that decide whether to vmexit > > > > > based on something like local per vCPU timer expiration? I > > > > > guess we can't predict other wake events such as interrupts. > > > > > e.g. > > > > > if (get_next_timer_interrupt() > kvm_halt_target_residency) > > > > > vmexit > > > > > else > > > > > poll > > > > > > > > > > Jacob > > > > > > > > It's not always a poll, on x86 putting the CPU in a low power > > > > state is possible within a VM. > > > > > > > Are you talking about using mwait/monitor in the user space which > > > are available on some Intel CPUs, such as Xeon Phi? I guess if the > > > guest can identify host CPU id, it is doable. > > > > Not really. > > > > Please take a look at the patch in question - it does mwait in guest > > kernel and no need to identify host CPU id. > > > I may be missing something, in your patch I only see HLT being used in > the guest OS, that would cause VM exit right? If you do mwait in the > guest kernel, it will also exit. No mwait won't exit if running on kvm. See 668fffa3f838edfcb1679f842f7ef1afa61c3e9a > So I don't see how you can enter low > power state within VM guest. > > +static int intel_halt(struct cpuidle_device *dev, > + struct cpuidle_driver *drv, int index) > +{ > + printk_once(KERN_ERR "safe_halt started\n"); > + safe_halt(); > + printk_once(KERN_ERR "safe_halt done\n"); > + return index; > +} > > > > > > Does not seem possible on other CPUs that's why it's vendor > > > > specific. > > > > > > [Jacob Pan] > > [Jacob Pan]
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c index c2ae819..6fa58ad 100644 --- a/drivers/idle/intel_idle.c +++ b/drivers/idle/intel_idle.c @@ -65,8 +65,10 @@ #include <asm/intel-family.h> #include <asm/mwait.h> #include <asm/msr.h> +#include <linux/kvm_para.h> #define INTEL_IDLE_VERSION "0.4.1" +#define PREFIX "intel_idle: " static struct cpuidle_driver intel_idle_driver = { .name = "intel_idle", @@ -94,6 +96,7 @@ struct idle_cpu { }; static const struct idle_cpu *icpu; +static struct idle_cpu icpus; static struct cpuidle_device __percpu *intel_idle_cpuidle_devices; static int intel_idle(struct cpuidle_device *dev, struct cpuidle_driver *drv, int index); @@ -119,6 +122,49 @@ static struct cpuidle_state *cpuidle_state_table; #define flg2MWAIT(flags) (((flags) >> 24) & 0xFF) #define MWAIT2flg(eax) ((eax & 0xFF) << 24) +static int intel_halt(struct cpuidle_device *dev, + struct cpuidle_driver *drv, int index) +{ + printk_once(KERN_ERR "safe_halt started\n"); + safe_halt(); + printk_once(KERN_ERR "safe_halt done\n"); + return index; +} + +static int kvm_halt_target_residency = 400; /* Halt above this target residency */ +module_param(kvm_halt_target_residency, int, 0444); +static int kvm_halt_native = 0; /* Use native mwait substates */ +module_param(kvm_halt_native, int, 0444); +static int kvm_pv_mwait = 0; /* Whether to do mwait within KVM */ +module_param(kvm_pv_mwait, int, 0444); + +static struct cpuidle_state kvm_halt_cstate = { + .name = "HALT-KVM", + .desc = "HALT", + .flags = MWAIT2flg(0x10), + .exit_latency = 0, + .target_residency = 0, + .enter = &intel_halt, +}; + +static struct cpuidle_state kvm_cstates[] = { + { + .name = "C1-NHM", + .desc = "MWAIT 0x00", + .flags = MWAIT2flg(0x00), + .exit_latency = 3, + .target_residency = 6, + .enter = &intel_idle, + .enter_freeze = intel_idle_freeze, }, + { + .name = "HALT-KVM", + .desc = "HALT", + .flags = MWAIT2flg(0x10), + .exit_latency = 30, + .target_residency = 399, + .enter = &intel_halt, } +}; + /* * States are indexed by the cstate number, * which is also the index into the MWAIT hint array. @@ -927,8 +973,11 @@ static __cpuidle int intel_idle(struct cpuidle_device *dev, if (!(lapic_timer_reliable_states & (1 << (cstate)))) tick_broadcast_enter(); + printk_once(KERN_ERR "mwait_idle_with_hints started\n"); mwait_idle_with_hints(eax, ecx); + printk_once(KERN_ERR "mwait_idle_with_hints done\n"); + if (!(lapic_timer_reliable_states & (1 << (cstate)))) tick_broadcast_exit(); @@ -989,6 +1038,11 @@ static const struct idle_cpu idle_cpu_tangier = { .state_table = tangier_cstates, }; +static const struct idle_cpu idle_cpu_kvm = { + .state_table = kvm_cstates, +}; + + static const struct idle_cpu idle_cpu_lincroft = { .state_table = atom_cstates, .auto_demotion_disable_flags = ATM_LNC_C6_AUTO_DEMOTE, @@ -1061,7 +1115,7 @@ static const struct idle_cpu idle_cpu_dnv = { }; #define ICPU(model, cpu) \ - { X86_VENDOR_INTEL, 6, model, X86_FEATURE_MWAIT, (unsigned long)&cpu } + { X86_VENDOR_INTEL, 6, model, X86_FEATURE_ANY, (unsigned long)&cpu } static const struct x86_cpu_id intel_idle_ids[] __initconst = { ICPU(INTEL_FAM6_NEHALEM_EP, idle_cpu_nehalem), @@ -1115,6 +1169,7 @@ static int __init intel_idle_probe(void) pr_debug("disabled\n"); return -EPERM; } + pr_err(PREFIX "enabled\n"); id = x86_match_cpu(intel_idle_ids); if (!id) { @@ -1125,19 +1180,39 @@ static int __init intel_idle_probe(void) return -ENODEV; } - if (boot_cpu_data.cpuid_level < CPUID_MWAIT_LEAF) - return -ENODEV; + icpus = *(struct idle_cpu *)id->driver_data; + + if (kvm_pv_mwait) { + + if (!kvm_halt_native) + icpus = idle_cpu_kvm; + + pr_debug(PREFIX "MWAIT enabled by KVM\n"); + mwait_substates = 0x1; + /* + * these MSRs do not work on kvm maybe they should? + * more likely we need to poke at CPUID before using MSRs + */ + icpus.auto_demotion_disable_flags = 0; + icpus.disable_promotion_to_c1e = 0; + } else { + if (!cpu_has(&boot_cpu_data, X86_FEATURE_MWAIT)) + return -ENODEV; + + if (boot_cpu_data.cpuid_level < CPUID_MWAIT_LEAF) + return -ENODEV; - cpuid(CPUID_MWAIT_LEAF, &eax, &ebx, &ecx, &mwait_substates); + cpuid(CPUID_MWAIT_LEAF, &eax, &ebx, &ecx, &mwait_substates); - if (!(ecx & CPUID5_ECX_EXTENSIONS_SUPPORTED) || - !(ecx & CPUID5_ECX_INTERRUPT_BREAK) || - !mwait_substates) + if (!(ecx & CPUID5_ECX_EXTENSIONS_SUPPORTED) || + !(ecx & CPUID5_ECX_INTERRUPT_BREAK) || + !mwait_substates) return -ENODEV; - pr_debug("MWAIT substates: 0x%x\n", mwait_substates); + pr_debug(PREFIX "MWAIT substates: 0x%x\n", mwait_substates); + } - icpu = (const struct idle_cpu *)id->driver_data; + icpu = &icpus; cpuidle_state_table = icpu->state_table; pr_debug("v" INTEL_IDLE_VERSION " model 0x%X\n", @@ -1340,6 +1415,11 @@ static void __init intel_idle_cpuidle_driver_init(void) (cpuidle_state_table[cstate].enter_freeze == NULL)) break; + if (kvm_pv_mwait && + cpuidle_state_table[cstate].target_residency >= + kvm_halt_target_residency) + break; + if (cstate + 1 > max_cstate) { pr_info("max_cstate %d reached\n", max_cstate); break; @@ -1353,7 +1433,7 @@ static void __init intel_idle_cpuidle_driver_init(void) & MWAIT_SUBSTATE_MASK; /* if NO sub-states for this state in CPUID, skip it */ - if (num_substates == 0) + if (num_substates == 0 && !kvm_pv_mwait) continue; /* if state marked as disabled, skip it */ @@ -1375,6 +1455,20 @@ static void __init intel_idle_cpuidle_driver_init(void) drv->state_count += 1; } + if (kvm_halt_native && kvm_pv_mwait) { + drv->states[drv->state_count] = /* structure copy */ + kvm_halt_cstate; + drv->states[drv->state_count].exit_latency = + drv->state_count > 1 ? + drv->states[drv->state_count - 1].exit_latency + 1 : 1; + drv->states[drv->state_count].target_residency = + kvm_halt_target_residency; + + drv->state_count += 1; + } + + printk(KERN_ERR "detected states: %d\n\n", drv->state_count); + if (icpu->byt_auto_demotion_disable_flag) { wrmsrl(MSR_CC6_DEMOTION_POLICY_CONFIG, 0); wrmsrl(MSR_MC6_DEMOTION_POLICY_CONFIG, 0); @@ -1452,7 +1546,8 @@ static int __init intel_idle_init(void) goto init_driver_fail; } - if (boot_cpu_has(X86_FEATURE_ARAT)) /* Always Reliable APIC Timer */ + if (boot_cpu_has(X86_FEATURE_ARAT) || /* Always Reliable APIC Timer */ + kvm_para_available()) lapic_timer_reliable_states = LAPIC_TIMER_ALWAYS_RELIABLE; retval = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "idle/intel:online",