diff mbox

[v5,untested] kvm: better MWAIT emulation for guests

Message ID 20170316132426.GB4085@HEDWIG.INI.CMU.EDU (mailing list archive)
State New, archived
Headers show

Commit Message

Gabriel L. Somlo March 16, 2017, 1:24 p.m. UTC
On Thu, Mar 16, 2017 at 01:41:28AM +0200, Michael S. Tsirkin wrote:
> On Wed, Mar 15, 2017 at 07:35:34PM -0400, Gabriel L. Somlo wrote:
> > On Wed, Mar 15, 2017 at 11:22:18PM +0200, Michael S. Tsirkin wrote:
> > > Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem:
> > > unless explicitly provided with kernel command line argument
> > > "idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability,
> > > without checking CPUID.
> > > 
> > > We currently emulate that as a NOP but on VMX we can do better: let
> > > guest stop the CPU until timer, IPI or memory change.  CPU will be busy
> > > but that isn't any worse than a NOP emulation.
> > > 
> > > Note that mwait within guests is not the same as on real hardware
> > > because halt causes an exit while mwait doesn't.  For this reason it
> > > might not be a good idea to use the regular MWAIT flag in CPUID to
> > > signal this capability.  Add a flag in the hypervisor leaf instead.
> > > 
> > > Additionally, we add a capability for QEMU - e.g. if it knows there's an
> > > isolated CPU dedicated for the VCPU it can set the standard MWAIT flag
> > > to improve guest behaviour.
> > 
> > Same behavior (on the mac pro 1,1 running F22 with custom-compiled
> > kernel from kvm git master, plus this patch on top).
> > 
> > The OS X 10.7 kernel hangs (or at least progresses extremely slowly)
> > on boot, does not bring up guest graphical interface within the first
> > 10 minutes that I waited for it. That, in contrast with the default
> > nop-based emulation where the guest comes up within 30 seconds.
> 
> 
> Thanks a lot, meanwhile I'll try to write a unit-test and experiment
> with various behaviours.
> 
> > I will run another round of tests on a newer Mac (4-year-old macbook
> > air) and report back tomorrow.
> > 
> > Going off on a tangent, why would encouraging otherwise well-behaved
> > guests (like linux ones, for example) to use MWAIT be desirable to
> > begin with ? Is it a matter of minimizing the overhead associated with
> > exiting and re-entering L1 ? Because if so, AFAIR staying inside L1 and
> > running guest-mode MWAIT in a tight loop will actually waste the host
> > CPU without the opportunity to yield to some other L0 thread. Sorry if
> > I fell into the middle of an ongoing conversation on this and missed
> > most of the relevant context, in which case please feel free to ignore
> > me... :)
> > 
> > Thanks,
> > --G
> 
> It's just some experiments I'm running, I'm not ready to describe it
> yet. I thought this part might be useful to at least some guests, so
> trying to upstream it right now.

OK, so on a macbook air running F25 and the latest kvm git master plus
your v5 patch (4.11.0-rc2+), things appear to work.

host-side cpuid output:
eax=0x000040 ebx=0x000040 ecx=0x000003 edx=0x021120

guest-side cpuid output:
eax=00000000 ebx=00000000 ecx=0x000003 edx=00000000

processor	: 3
vendor_id	: GenuineIntel
cpu family	: 6
model		: 42
model name	: Intel(R) Core(TM) i7-2677M CPU @ 1.80GHz
stepping	: 7
microcode	: 0x29
cpu MHz		: 1157.849
cache size	: 4096 KB
physical id	: 0
siblings	: 4
core id		: 1
cpu cores	: 2
apicid		: 3
initial apicid	: 3
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts
bugs		:
bogomips	: 3604.68
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:

After studying your patch a bit more carefully (sorry, it's crazy
around here right now :) ) I realized you're simply trying to
(selectively) decide when to exit L1 and emulate as NOP vs. when to
just allow L1 to execute MONITOR & MWAIT natively.

Is that right ? Because if so, the issues I saw on my MacPro1,1 are
weird and inexplicable, given that allowing L>=1 to run MONITOR/MWAIT
natively was one of the options Alex Graf and Rene Rebe used back in
the very early days of OS X on QEMU, at the time I got involved with
that project. Here's part of an out of tree patch against 3.4 which did
just that, and worked as far as I remember on *any* MWAIT capable
intel chip I had access to back in 2010:

##############################################################################
# 99-mwait.patch.kvm-kmod (Rene Rebe <rene@exactcode.de>) 2010-04-27
##############################################################################

If all you're trying to do is (selectively) revert to this behavior,
that "shouldn't" mess it up for the MacPro either, so I'm thoroughly
confused at this point :)

Back in 2010, running MWAIT in L>=1  behaved 100% exactly like a NOP,
didn't power down the physical CPU, just immediately moved on to the
next instruction. As such, there was no power saving and no
opportunity to yield to another L0 thread either, unlike with NOP
emulation at L0.

Did that change on newer Intel chips (i.e., is guest-mode MWAIT now
doing something smarter than just acting as a guest-mode NOP) ?

Thanks,
--Gabriel

Comments

Michael S. Tsirkin March 16, 2017, 2:04 p.m. UTC | #1
On Thu, Mar 16, 2017 at 09:24:27AM -0400, Gabriel L. Somlo wrote:
> After studying your patch a bit more carefully (sorry, it's crazy
> around here right now :) ) I realized you're simply trying to
> (selectively) decide when to exit L1 and emulate as NOP vs. when to
> just allow L1 to execute MONITOR & MWAIT natively.
> 
> Is that right ? Because if so, the issues I saw on my MacPro1,1 are
> weird and inexplicable, given that allowing L>=1 to run MONITOR/MWAIT
> natively was one of the options Alex Graf and Rene Rebe used back in
> the very early days of OS X on QEMU, at the time I got involved with
> that project. Here's part of an out of tree patch against 3.4 which did
> just that, and worked as far as I remember on *any* MWAIT capable
> intel chip I had access to back in 2010:
> 
> ##############################################################################
> # 99-mwait.patch.kvm-kmod (Rene Rebe <rene@exactcode.de>) 2010-04-27
> ##############################################################################
> diff -pNarU5 linux-3.4/arch/x86/kvm/cpuid.c linux-3.4-mac/arch/x86/kvm/cpuid.c
> --- linux-3.4/arch/x86/kvm/cpuid.c	2012-05-20 18:29:13.000000000 -0400
> +++ linux-3.4-mac/arch/x86/kvm/cpuid.c	2012-10-09 11:42:59.921215750 -0400
> @@ -222,11 +222,11 @@ static int do_cpuid_ent(struct kvm_cpuid
>  		f_nx | 0 /* Reserved */ | F(MMXEXT) | F(MMX) |
>  		F(FXSR) | F(FXSR_OPT) | f_gbpages | f_rdtscp |
>  		0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW);
>  	/* cpuid 1.ecx */
>  	const u32 kvm_supported_word4_x86_features =
> -		F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
> +		F(XMM3) | F(PCLMULQDQ) | F(MWAIT) /* DTES64, MONITOR */ |
>  		0 /* DS-CPL, VMX, SMX, EST */ |
>  		0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
>  		F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
>  		0 /* Reserved, DCA */ | F(XMM4_1) |
>  		F(XMM4_2) | F(X2APIC) | F(MOVBE) | F(POPCNT) |
> diff -pNarU5 linux-3.4/arch/x86/kvm/svm.c linux-3.4-mac/arch/x86/kvm/svm.c
> --- linux-3.4/arch/x86/kvm/svm.c	2012-05-20 18:29:13.000000000 -0400
> +++ linux-3.4-mac/arch/x86/kvm/svm.c	2012-10-09 11:44:41.598997481 -0400
> @@ -1102,12 +1102,10 @@ static void init_vmcb(struct vcpu_svm *s
>  	set_intercept(svm, INTERCEPT_VMSAVE);
>  	set_intercept(svm, INTERCEPT_STGI);
>  	set_intercept(svm, INTERCEPT_CLGI);
>  	set_intercept(svm, INTERCEPT_SKINIT);
>  	set_intercept(svm, INTERCEPT_WBINVD);
> -	set_intercept(svm, INTERCEPT_MONITOR);
> -	set_intercept(svm, INTERCEPT_MWAIT);
>  	set_intercept(svm, INTERCEPT_XSETBV);
>  
>  	control->iopm_base_pa = iopm_base;
>  	control->msrpm_base_pa = __pa(svm->msrpm);
>  	control->int_ctl = V_INTR_MASKING_MASK;
> diff -pNarU5 linux-3.4/arch/x86/kvm/vmx.c linux-3.4-mac/arch/x86/kvm/vmx.c
> --- linux-3.4/arch/x86/kvm/vmx.c	2012-05-20 18:29:13.000000000 -0400
> +++ linux-3.4-mac/arch/x86/kvm/vmx.c	2012-10-09 11:42:59.925215977 -0400
> @@ -1938,11 +1938,11 @@ static __init void nested_vmx_setup_ctls
>  		nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
>  	nested_vmx_procbased_ctls_low = 0;
>  	nested_vmx_procbased_ctls_high &=
>  		CPU_BASED_VIRTUAL_INTR_PENDING | CPU_BASED_USE_TSC_OFFSETING |
>  		CPU_BASED_HLT_EXITING | CPU_BASED_INVLPG_EXITING |
> -		CPU_BASED_MWAIT_EXITING | CPU_BASED_CR3_LOAD_EXITING |
> +		CPU_BASED_CR3_LOAD_EXITING |
>  		CPU_BASED_CR3_STORE_EXITING |
>  #ifdef CONFIG_X86_64
>  		CPU_BASED_CR8_LOAD_EXITING | CPU_BASED_CR8_STORE_EXITING |
>  #endif
>  		CPU_BASED_MOV_DR_EXITING | CPU_BASED_UNCOND_IO_EXITING |
> @@ -2404,12 +2404,10 @@ static __init int setup_vmcs_config(stru
>  	      CPU_BASED_CR3_LOAD_EXITING |
>  	      CPU_BASED_CR3_STORE_EXITING |
>  	      CPU_BASED_USE_IO_BITMAPS |
>  	      CPU_BASED_MOV_DR_EXITING |
>  	      CPU_BASED_USE_TSC_OFFSETING |
> -	      CPU_BASED_MWAIT_EXITING |
> -	      CPU_BASED_MONITOR_EXITING |
>  	      CPU_BASED_INVLPG_EXITING |
>  	      CPU_BASED_RDPMC_EXITING;
>  
>  	opt = CPU_BASED_TPR_SHADOW |
>  	      CPU_BASED_USE_MSR_BITMAPS |
> 
> If all you're trying to do is (selectively) revert to this behavior,
> that "shouldn't" mess it up for the MacPro either, so I'm thoroughly
> confused at this point :)

Yes.  Me too. Want to try that other patch and see what happens?

> Back in 2010, running MWAIT in L>=1  behaved 100% exactly like a NOP,
> didn't power down the physical CPU, just immediately moved on to the
> next instruction. As such, there was no power saving and no
> opportunity to yield to another L0 thread either, unlike with NOP
> emulation at L0.
> 
> Did that change on newer Intel chips (i.e., is guest-mode MWAIT now
> doing something smarter than just acting as a guest-mode NOP) ?
> 
> Thanks,
> --Gabriel

Interesting.  What it seems to say is this:

MWAIT. Behavior of the MWAIT instruction (which always causes an invalid-
opcode exception—#UD—if CPL > 0) is determined by the setting of the “MWAIT
exiting” VM-execution control:
— If the “MWAIT exiting” VM-execution control is 1, MWAIT causes a VM exit
(see Section 22.1.3).
— If the “MWAIT exiting” VM-execution control is 0, MWAIT operates normally if
any of the following is true: (1) the “interrupt-window exiting” VM-execution
control is 0; (2) ECX[0] is 0; or (3) RFLAGS.IF = 1.
— If the “MWAIT exiting” VM-execution control is 0, the “interrupt-window
exiting” VM-execution control is 1, ECX[0] = 1, and RFLAGS.IF = 0, MWAIT
does not cause the processor to enter an implementation-dependent
optimized state; instead, control passes to the instruction following the
MWAIT instruction.


And since interrupt-window exiting is 0 most of the time for KVM,
I would expect MWAIT to behave normally.
Gabriel L. Somlo March 16, 2017, 2:58 p.m. UTC | #2
On Thu, Mar 16, 2017 at 04:04:12PM +0200, Michael S. Tsirkin wrote:
> On Thu, Mar 16, 2017 at 09:24:27AM -0400, Gabriel L. Somlo wrote:
> > After studying your patch a bit more carefully (sorry, it's crazy
> > around here right now :) ) I realized you're simply trying to
> > (selectively) decide when to exit L1 and emulate as NOP vs. when to
> > just allow L1 to execute MONITOR & MWAIT natively.
> > 
> > Is that right ? Because if so, the issues I saw on my MacPro1,1 are
> > weird and inexplicable, given that allowing L>=1 to run MONITOR/MWAIT
> > natively was one of the options Alex Graf and Rene Rebe used back in
> > the very early days of OS X on QEMU, at the time I got involved with
> > that project. Here's part of an out of tree patch against 3.4 which did
> > just that, and worked as far as I remember on *any* MWAIT capable
> > intel chip I had access to back in 2010:
> > 
> > ##############################################################################
> > # 99-mwait.patch.kvm-kmod (Rene Rebe <rene@exactcode.de>) 2010-04-27
> > ##############################################################################
> > diff -pNarU5 linux-3.4/arch/x86/kvm/cpuid.c linux-3.4-mac/arch/x86/kvm/cpuid.c
> > --- linux-3.4/arch/x86/kvm/cpuid.c	2012-05-20 18:29:13.000000000 -0400
> > +++ linux-3.4-mac/arch/x86/kvm/cpuid.c	2012-10-09 11:42:59.921215750 -0400
> > @@ -222,11 +222,11 @@ static int do_cpuid_ent(struct kvm_cpuid
> >  		f_nx | 0 /* Reserved */ | F(MMXEXT) | F(MMX) |
> >  		F(FXSR) | F(FXSR_OPT) | f_gbpages | f_rdtscp |
> >  		0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW);
> >  	/* cpuid 1.ecx */
> >  	const u32 kvm_supported_word4_x86_features =
> > -		F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
> > +		F(XMM3) | F(PCLMULQDQ) | F(MWAIT) /* DTES64, MONITOR */ |
> >  		0 /* DS-CPL, VMX, SMX, EST */ |
> >  		0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
> >  		F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
> >  		0 /* Reserved, DCA */ | F(XMM4_1) |
> >  		F(XMM4_2) | F(X2APIC) | F(MOVBE) | F(POPCNT) |
> > diff -pNarU5 linux-3.4/arch/x86/kvm/svm.c linux-3.4-mac/arch/x86/kvm/svm.c
> > --- linux-3.4/arch/x86/kvm/svm.c	2012-05-20 18:29:13.000000000 -0400
> > +++ linux-3.4-mac/arch/x86/kvm/svm.c	2012-10-09 11:44:41.598997481 -0400
> > @@ -1102,12 +1102,10 @@ static void init_vmcb(struct vcpu_svm *s
> >  	set_intercept(svm, INTERCEPT_VMSAVE);
> >  	set_intercept(svm, INTERCEPT_STGI);
> >  	set_intercept(svm, INTERCEPT_CLGI);
> >  	set_intercept(svm, INTERCEPT_SKINIT);
> >  	set_intercept(svm, INTERCEPT_WBINVD);
> > -	set_intercept(svm, INTERCEPT_MONITOR);
> > -	set_intercept(svm, INTERCEPT_MWAIT);
> >  	set_intercept(svm, INTERCEPT_XSETBV);
> >  
> >  	control->iopm_base_pa = iopm_base;
> >  	control->msrpm_base_pa = __pa(svm->msrpm);
> >  	control->int_ctl = V_INTR_MASKING_MASK;
> > diff -pNarU5 linux-3.4/arch/x86/kvm/vmx.c linux-3.4-mac/arch/x86/kvm/vmx.c
> > --- linux-3.4/arch/x86/kvm/vmx.c	2012-05-20 18:29:13.000000000 -0400
> > +++ linux-3.4-mac/arch/x86/kvm/vmx.c	2012-10-09 11:42:59.925215977 -0400
> > @@ -1938,11 +1938,11 @@ static __init void nested_vmx_setup_ctls
> >  		nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
> >  	nested_vmx_procbased_ctls_low = 0;
> >  	nested_vmx_procbased_ctls_high &=
> >  		CPU_BASED_VIRTUAL_INTR_PENDING | CPU_BASED_USE_TSC_OFFSETING |
> >  		CPU_BASED_HLT_EXITING | CPU_BASED_INVLPG_EXITING |
> > -		CPU_BASED_MWAIT_EXITING | CPU_BASED_CR3_LOAD_EXITING |
> > +		CPU_BASED_CR3_LOAD_EXITING |
> >  		CPU_BASED_CR3_STORE_EXITING |
> >  #ifdef CONFIG_X86_64
> >  		CPU_BASED_CR8_LOAD_EXITING | CPU_BASED_CR8_STORE_EXITING |
> >  #endif
> >  		CPU_BASED_MOV_DR_EXITING | CPU_BASED_UNCOND_IO_EXITING |
> > @@ -2404,12 +2404,10 @@ static __init int setup_vmcs_config(stru
> >  	      CPU_BASED_CR3_LOAD_EXITING |
> >  	      CPU_BASED_CR3_STORE_EXITING |
> >  	      CPU_BASED_USE_IO_BITMAPS |
> >  	      CPU_BASED_MOV_DR_EXITING |
> >  	      CPU_BASED_USE_TSC_OFFSETING |
> > -	      CPU_BASED_MWAIT_EXITING |
> > -	      CPU_BASED_MONITOR_EXITING |
> >  	      CPU_BASED_INVLPG_EXITING |
> >  	      CPU_BASED_RDPMC_EXITING;
> >  
> >  	opt = CPU_BASED_TPR_SHADOW |
> >  	      CPU_BASED_USE_MSR_BITMAPS |
> > 
> > If all you're trying to do is (selectively) revert to this behavior,
> > that "shouldn't" mess it up for the MacPro either, so I'm thoroughly
> > confused at this point :)
> 
> Yes.  Me too. Want to try that other patch and see what happens?

You mean the old 3.4 patch against current KVM ? I'll try to do that,
might take me a while :)

> > Back in 2010, running MWAIT in L>=1  behaved 100% exactly like a NOP,
> > didn't power down the physical CPU, just immediately moved on to the
> > next instruction. As such, there was no power saving and no
> > opportunity to yield to another L0 thread either, unlike with NOP
> > emulation at L0.
> > 
> > Did that change on newer Intel chips (i.e., is guest-mode MWAIT now
> > doing something smarter than just acting as a guest-mode NOP) ?
> > 
> > Thanks,
> > --Gabriel
> 
> Interesting.  What it seems to say is this:
> 
> MWAIT. Behavior of the MWAIT instruction (which always causes an invalid-
> opcode exception—#UD—if CPL > 0) is determined by the setting of the “MWAIT
> exiting” VM-execution control:
> — If the “MWAIT exiting” VM-execution control is 1, MWAIT causes a VM exit
> (see Section 22.1.3).
> — If the “MWAIT exiting” VM-execution control is 0, MWAIT operates normally if
> any of the following is true: (1) the “interrupt-window exiting” VM-execution
> control is 0; (2) ECX[0] is 0; or (3) RFLAGS.IF = 1.
> — If the “MWAIT exiting” VM-execution control is 0, the “interrupt-window
> exiting” VM-execution control is 1, ECX[0] = 1, and RFLAGS.IF = 0, MWAIT
> does not cause the processor to enter an implementation-dependent
> optimized state; instead, control passes to the instruction following the
> MWAIT instruction.
> 
> 
> And since interrupt-window exiting is 0 most of the time for KVM,
> I would expect MWAIT to behave normally.

The intel manual said the same thing back in 2010 as well. However,
regardless of how any flags were set, interrupt-window exiting or not,
"normal" L1 MWAIT behavior was that it woke up immediately regardless.
Remember, never going to sleep is still correct ("normal" ?) behavior
per the ISA definition of MWAIT :)

Also, when I tested your patch on the macbook air (where it worked),
not only was the host reporting 400% CPU for qemu (which is to be
expected), but the thermal fan/cooling thing also shifted up into high
gear, which means the physical CPU got hot, which it shouldn't have if
the guest-mode MWAIT actually did put the host CPU into low power.

So at least on this 4-year-old core-I7 chip, the story Intel tells in
its manual still doesn't check out. I could never get any
clarification on what they mean by "operates normally" :)
Michael S. Tsirkin March 16, 2017, 3:23 p.m. UTC | #3
On Thu, Mar 16, 2017 at 10:58:20AM -0400, Gabriel L. Somlo wrote:
> On Thu, Mar 16, 2017 at 04:04:12PM +0200, Michael S. Tsirkin wrote:
> > On Thu, Mar 16, 2017 at 09:24:27AM -0400, Gabriel L. Somlo wrote:
> > > After studying your patch a bit more carefully (sorry, it's crazy
> > > around here right now :) ) I realized you're simply trying to
> > > (selectively) decide when to exit L1 and emulate as NOP vs. when to
> > > just allow L1 to execute MONITOR & MWAIT natively.
> > > 
> > > Is that right ? Because if so, the issues I saw on my MacPro1,1 are
> > > weird and inexplicable, given that allowing L>=1 to run MONITOR/MWAIT
> > > natively was one of the options Alex Graf and Rene Rebe used back in
> > > the very early days of OS X on QEMU, at the time I got involved with
> > > that project. Here's part of an out of tree patch against 3.4 which did
> > > just that, and worked as far as I remember on *any* MWAIT capable
> > > intel chip I had access to back in 2010:
> > > 
> > > ##############################################################################
> > > # 99-mwait.patch.kvm-kmod (Rene Rebe <rene@exactcode.de>) 2010-04-27
> > > ##############################################################################
> > > diff -pNarU5 linux-3.4/arch/x86/kvm/cpuid.c linux-3.4-mac/arch/x86/kvm/cpuid.c
> > > --- linux-3.4/arch/x86/kvm/cpuid.c	2012-05-20 18:29:13.000000000 -0400
> > > +++ linux-3.4-mac/arch/x86/kvm/cpuid.c	2012-10-09 11:42:59.921215750 -0400
> > > @@ -222,11 +222,11 @@ static int do_cpuid_ent(struct kvm_cpuid
> > >  		f_nx | 0 /* Reserved */ | F(MMXEXT) | F(MMX) |
> > >  		F(FXSR) | F(FXSR_OPT) | f_gbpages | f_rdtscp |
> > >  		0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW);
> > >  	/* cpuid 1.ecx */
> > >  	const u32 kvm_supported_word4_x86_features =
> > > -		F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
> > > +		F(XMM3) | F(PCLMULQDQ) | F(MWAIT) /* DTES64, MONITOR */ |
> > >  		0 /* DS-CPL, VMX, SMX, EST */ |
> > >  		0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
> > >  		F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
> > >  		0 /* Reserved, DCA */ | F(XMM4_1) |
> > >  		F(XMM4_2) | F(X2APIC) | F(MOVBE) | F(POPCNT) |
> > > diff -pNarU5 linux-3.4/arch/x86/kvm/svm.c linux-3.4-mac/arch/x86/kvm/svm.c
> > > --- linux-3.4/arch/x86/kvm/svm.c	2012-05-20 18:29:13.000000000 -0400
> > > +++ linux-3.4-mac/arch/x86/kvm/svm.c	2012-10-09 11:44:41.598997481 -0400
> > > @@ -1102,12 +1102,10 @@ static void init_vmcb(struct vcpu_svm *s
> > >  	set_intercept(svm, INTERCEPT_VMSAVE);
> > >  	set_intercept(svm, INTERCEPT_STGI);
> > >  	set_intercept(svm, INTERCEPT_CLGI);
> > >  	set_intercept(svm, INTERCEPT_SKINIT);
> > >  	set_intercept(svm, INTERCEPT_WBINVD);
> > > -	set_intercept(svm, INTERCEPT_MONITOR);
> > > -	set_intercept(svm, INTERCEPT_MWAIT);
> > >  	set_intercept(svm, INTERCEPT_XSETBV);
> > >  
> > >  	control->iopm_base_pa = iopm_base;
> > >  	control->msrpm_base_pa = __pa(svm->msrpm);
> > >  	control->int_ctl = V_INTR_MASKING_MASK;
> > > diff -pNarU5 linux-3.4/arch/x86/kvm/vmx.c linux-3.4-mac/arch/x86/kvm/vmx.c
> > > --- linux-3.4/arch/x86/kvm/vmx.c	2012-05-20 18:29:13.000000000 -0400
> > > +++ linux-3.4-mac/arch/x86/kvm/vmx.c	2012-10-09 11:42:59.925215977 -0400
> > > @@ -1938,11 +1938,11 @@ static __init void nested_vmx_setup_ctls
> > >  		nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
> > >  	nested_vmx_procbased_ctls_low = 0;
> > >  	nested_vmx_procbased_ctls_high &=
> > >  		CPU_BASED_VIRTUAL_INTR_PENDING | CPU_BASED_USE_TSC_OFFSETING |
> > >  		CPU_BASED_HLT_EXITING | CPU_BASED_INVLPG_EXITING |
> > > -		CPU_BASED_MWAIT_EXITING | CPU_BASED_CR3_LOAD_EXITING |
> > > +		CPU_BASED_CR3_LOAD_EXITING |
> > >  		CPU_BASED_CR3_STORE_EXITING |
> > >  #ifdef CONFIG_X86_64
> > >  		CPU_BASED_CR8_LOAD_EXITING | CPU_BASED_CR8_STORE_EXITING |
> > >  #endif
> > >  		CPU_BASED_MOV_DR_EXITING | CPU_BASED_UNCOND_IO_EXITING |
> > > @@ -2404,12 +2404,10 @@ static __init int setup_vmcs_config(stru
> > >  	      CPU_BASED_CR3_LOAD_EXITING |
> > >  	      CPU_BASED_CR3_STORE_EXITING |
> > >  	      CPU_BASED_USE_IO_BITMAPS |
> > >  	      CPU_BASED_MOV_DR_EXITING |
> > >  	      CPU_BASED_USE_TSC_OFFSETING |
> > > -	      CPU_BASED_MWAIT_EXITING |
> > > -	      CPU_BASED_MONITOR_EXITING |
> > >  	      CPU_BASED_INVLPG_EXITING |
> > >  	      CPU_BASED_RDPMC_EXITING;
> > >  
> > >  	opt = CPU_BASED_TPR_SHADOW |
> > >  	      CPU_BASED_USE_MSR_BITMAPS |
> > > 
> > > If all you're trying to do is (selectively) revert to this behavior,
> > > that "shouldn't" mess it up for the MacPro either, so I'm thoroughly
> > > confused at this point :)
> > 
> > Yes.  Me too. Want to try that other patch and see what happens?
> 
> You mean the old 3.4 patch against current KVM ? I'll try to do that,
> might take me a while :)

I can rebase them for you if you send me a link.

> > > Back in 2010, running MWAIT in L>=1  behaved 100% exactly like a NOP,
> > > didn't power down the physical CPU, just immediately moved on to the
> > > next instruction. As such, there was no power saving and no
> > > opportunity to yield to another L0 thread either, unlike with NOP
> > > emulation at L0.
> > > 
> > > Did that change on newer Intel chips (i.e., is guest-mode MWAIT now
> > > doing something smarter than just acting as a guest-mode NOP) ?
> > > 
> > > Thanks,
> > > --Gabriel
> > 
> > Interesting.  What it seems to say is this:
> > 
> > MWAIT. Behavior of the MWAIT instruction (which always causes an invalid-
> > opcode exception—#UD—if CPL > 0) is determined by the setting of the “MWAIT
> > exiting” VM-execution control:
> > — If the “MWAIT exiting” VM-execution control is 1, MWAIT causes a VM exit
> > (see Section 22.1.3).
> > — If the “MWAIT exiting” VM-execution control is 0, MWAIT operates normally if
> > any of the following is true: (1) the “interrupt-window exiting” VM-execution
> > control is 0; (2) ECX[0] is 0; or (3) RFLAGS.IF = 1.
> > — If the “MWAIT exiting” VM-execution control is 0, the “interrupt-window
> > exiting” VM-execution control is 1, ECX[0] = 1, and RFLAGS.IF = 0, MWAIT
> > does not cause the processor to enter an implementation-dependent
> > optimized state; instead, control passes to the instruction following the
> > MWAIT instruction.
> > 
> > 
> > And since interrupt-window exiting is 0 most of the time for KVM,
> > I would expect MWAIT to behave normally.
> 
> The intel manual said the same thing back in 2010 as well. However,
> regardless of how any flags were set, interrupt-window exiting or not,
> "normal" L1 MWAIT behavior was that it woke up immediately regardless.
> Remember, never going to sleep is still correct ("normal" ?) behavior
> per the ISA definition of MWAIT :)
> 
> Also, when I tested your patch on the macbook air (where it worked),
> not only was the host reporting 400% CPU for qemu (which is to be
> expected), but the thermal fan/cooling thing also shifted up into high
> gear, which means the physical CPU got hot, which it shouldn't have if
> the guest-mode MWAIT actually did put the host CPU into low power.

Does same happen with NOP btw?

> So at least on this 4-year-old core-I7 chip, the story Intel tells in
> its manual still doesn't check out. I could never get any
> clarification on what they mean by "operates normally" :)

It could be Mac OS sets ECX[0] = 1 and RFLAGS.IF = 0.
diff mbox

Patch

diff -pNarU5 linux-3.4/arch/x86/kvm/cpuid.c linux-3.4-mac/arch/x86/kvm/cpuid.c
--- linux-3.4/arch/x86/kvm/cpuid.c	2012-05-20 18:29:13.000000000 -0400
+++ linux-3.4-mac/arch/x86/kvm/cpuid.c	2012-10-09 11:42:59.921215750 -0400
@@ -222,11 +222,11 @@  static int do_cpuid_ent(struct kvm_cpuid
 		f_nx | 0 /* Reserved */ | F(MMXEXT) | F(MMX) |
 		F(FXSR) | F(FXSR_OPT) | f_gbpages | f_rdtscp |
 		0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW);
 	/* cpuid 1.ecx */
 	const u32 kvm_supported_word4_x86_features =
-		F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
+		F(XMM3) | F(PCLMULQDQ) | F(MWAIT) /* DTES64, MONITOR */ |
 		0 /* DS-CPL, VMX, SMX, EST */ |
 		0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
 		F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
 		0 /* Reserved, DCA */ | F(XMM4_1) |
 		F(XMM4_2) | F(X2APIC) | F(MOVBE) | F(POPCNT) |
diff -pNarU5 linux-3.4/arch/x86/kvm/svm.c linux-3.4-mac/arch/x86/kvm/svm.c
--- linux-3.4/arch/x86/kvm/svm.c	2012-05-20 18:29:13.000000000 -0400
+++ linux-3.4-mac/arch/x86/kvm/svm.c	2012-10-09 11:44:41.598997481 -0400
@@ -1102,12 +1102,10 @@  static void init_vmcb(struct vcpu_svm *s
 	set_intercept(svm, INTERCEPT_VMSAVE);
 	set_intercept(svm, INTERCEPT_STGI);
 	set_intercept(svm, INTERCEPT_CLGI);
 	set_intercept(svm, INTERCEPT_SKINIT);
 	set_intercept(svm, INTERCEPT_WBINVD);
-	set_intercept(svm, INTERCEPT_MONITOR);
-	set_intercept(svm, INTERCEPT_MWAIT);
 	set_intercept(svm, INTERCEPT_XSETBV);
 
 	control->iopm_base_pa = iopm_base;
 	control->msrpm_base_pa = __pa(svm->msrpm);
 	control->int_ctl = V_INTR_MASKING_MASK;
diff -pNarU5 linux-3.4/arch/x86/kvm/vmx.c linux-3.4-mac/arch/x86/kvm/vmx.c
--- linux-3.4/arch/x86/kvm/vmx.c	2012-05-20 18:29:13.000000000 -0400
+++ linux-3.4-mac/arch/x86/kvm/vmx.c	2012-10-09 11:42:59.925215977 -0400
@@ -1938,11 +1938,11 @@  static __init void nested_vmx_setup_ctls
 		nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
 	nested_vmx_procbased_ctls_low = 0;
 	nested_vmx_procbased_ctls_high &=
 		CPU_BASED_VIRTUAL_INTR_PENDING | CPU_BASED_USE_TSC_OFFSETING |
 		CPU_BASED_HLT_EXITING | CPU_BASED_INVLPG_EXITING |
-		CPU_BASED_MWAIT_EXITING | CPU_BASED_CR3_LOAD_EXITING |
+		CPU_BASED_CR3_LOAD_EXITING |
 		CPU_BASED_CR3_STORE_EXITING |
 #ifdef CONFIG_X86_64
 		CPU_BASED_CR8_LOAD_EXITING | CPU_BASED_CR8_STORE_EXITING |
 #endif
 		CPU_BASED_MOV_DR_EXITING | CPU_BASED_UNCOND_IO_EXITING |
@@ -2404,12 +2404,10 @@  static __init int setup_vmcs_config(stru
 	      CPU_BASED_CR3_LOAD_EXITING |
 	      CPU_BASED_CR3_STORE_EXITING |
 	      CPU_BASED_USE_IO_BITMAPS |
 	      CPU_BASED_MOV_DR_EXITING |
 	      CPU_BASED_USE_TSC_OFFSETING |
-	      CPU_BASED_MWAIT_EXITING |
-	      CPU_BASED_MONITOR_EXITING |
 	      CPU_BASED_INVLPG_EXITING |
 	      CPU_BASED_RDPMC_EXITING;
 
 	opt = CPU_BASED_TPR_SHADOW |
 	      CPU_BASED_USE_MSR_BITMAPS |