diff mbox

[v2] kvm: x86: emulate monitor and mwait instructions as nop

Message ID 20140507205210.GA30030@ERROL.INI.CMU.EDU (mailing list archive)
State New, archived
Headers show

Commit Message

Gabriel L. Somlo May 7, 2014, 8:52 p.m. UTC
Treat monitor and mwait instructions as nop, which is architecturally
correct (but inefficient) behavior. We do this to prevent misbehaving
guests (e.g. OS X <= 10.7) from crashing after they fail to check for
monitor/mwait availability via cpuid.

Since mwait-based idle loops relying on these nop-emulated instructions
would keep the host CPU pegged at 100%, do NOT advertise their presence
via cpuid, to prevent compliant guests from using them inadvertently.

Signed-off-by: Gabriel L. Somlo <somlo@cmu.edu>
---

New in v2: remove invalid_op handler functions which were only used to
           handle exits caused by monitor and mwait

On Wed, May 07, 2014 at 08:31:27PM +0200, Alexander Graf wrote:
> On 05/07/2014 08:15 PM, Michael S. Tsirkin wrote:
> >If we really want to be paranoid and worry about guests
> >that use this strange way to trigger invalid opcode,
> >we can make it possible for userspace to enable/disable
> >this hack, and teach qemu to set it.
> >
> >That would make it even safer than it was.
> >
> >Not sure it's worth it, just a thought.
> 
> Since we don't trap on non-exposed other instructions (new SSE and
> whatdoiknow) I don't think it's really bad to just expose
> MONITOR/MWAIT as nops.

So AFAICT, linux prefers to use mwait for idling if cpuid tells it that
it's available. If we keep telling everyone that we do NOT have monitor
and mwait available, compliant guests will never end up using them, and
this hack would remain completely invisible to them, which is good
(better to use hlt-based idle loops when you're a vm guest, that would
actually allow the host to relax while you're halted :)

So the only time anyone would be able to tell we have this hack would be
when they're about to receive an invalid opcode for using monitor/mwait
in violation of what CPUID (would have) told them. That's what happens
to OS X prior to 10.8, which is when I'm hypothesizing the Apple devs
begain to seriously think about their OS running as a vm guest (on fusion
and parallels)...

Instead of killing the misbehaving guest with an invalid opcode, we'd
allow them to peg the host CPU with their monitor == mwait == nop idle
loop instead, which, at least on OS X, should be tolerable long enough
to run 'rm -rf System/Library/Extensions/AppleIntelCPUPowerManagement.kext'
and reboot the guest, after which things would settle down by reverting
the guest to a hlt-based idle loop.

The only reason I can think of to add functionality for enabling/disabling
this hack would be to protect against a malicious guest which would use
mwait *on purpose* to peg the host CPU. But a malicious guest could just
run "for(;;);" in ring 0 and accomplish the same goal, so we wouldn't
really gain anything in exchange for the added complexity...

Thanks,
  Gabriel

 arch/x86/kvm/cpuid.c |  2 ++
 arch/x86/kvm/svm.c   | 28 ++++++++++++++++++++--------
 arch/x86/kvm/vmx.c   | 20 ++++++++++++++++----
 3 files changed, 38 insertions(+), 12 deletions(-)

Comments

Gabriel L. Somlo June 2, 2014, 7:25 p.m. UTC | #1
On Wed, May 07, 2014 at 04:52:13PM -0400, Gabriel L. Somlo wrote:
> Treat monitor and mwait instructions as nop, which is architecturally
> correct (but inefficient) behavior. We do this to prevent misbehaving
> guests (e.g. OS X <= 10.7) from crashing after they fail to check for
> monitor/mwait availability via cpuid.
> 
> Since mwait-based idle loops relying on these nop-emulated instructions
> would keep the host CPU pegged at 100%, do NOT advertise their presence
> via cpuid, to prevent compliant guests from using them inadvertently.
> 
> Signed-off-by: Gabriel L. Somlo <somlo@cmu.edu>
> ---
> 
> New in v2: remove invalid_op handler functions which were only used to
>            handle exits caused by monitor and mwait
> 
> On Wed, May 07, 2014 at 08:31:27PM +0200, Alexander Graf wrote:
> > On 05/07/2014 08:15 PM, Michael S. Tsirkin wrote:
> > >If we really want to be paranoid and worry about guests
> > >that use this strange way to trigger invalid opcode,
> > >we can make it possible for userspace to enable/disable
> > >this hack, and teach qemu to set it.
> > >
> > >That would make it even safer than it was.
> > >
> > >Not sure it's worth it, just a thought.
> > 
> > Since we don't trap on non-exposed other instructions (new SSE and
> > whatdoiknow) I don't think it's really bad to just expose
> > MONITOR/MWAIT as nops.

Would it make sense to make this a module parameter,
(e.g., "int emulate_mwait") ?

Default would be 0 (no emulation). 1 would mean "emulate as nop", and
if anyone ever figures out how to do proper page-locking based
emulation we could use 2 to enable that, etc. ?

Not sure we'd want qemu to enable/disable it automatically, though...

What do you all think ?

Thanks,
--Gabriel

> 
> So AFAICT, linux prefers to use mwait for idling if cpuid tells it that
> it's available. If we keep telling everyone that we do NOT have monitor
> and mwait available, compliant guests will never end up using them, and
> this hack would remain completely invisible to them, which is good
> (better to use hlt-based idle loops when you're a vm guest, that would
> actually allow the host to relax while you're halted :)
> 
> So the only time anyone would be able to tell we have this hack would be
> when they're about to receive an invalid opcode for using monitor/mwait
> in violation of what CPUID (would have) told them. That's what happens
> to OS X prior to 10.8, which is when I'm hypothesizing the Apple devs
> begain to seriously think about their OS running as a vm guest (on fusion
> and parallels)...
> 
> Instead of killing the misbehaving guest with an invalid opcode, we'd
> allow them to peg the host CPU with their monitor == mwait == nop idle
> loop instead, which, at least on OS X, should be tolerable long enough
> to run 'rm -rf System/Library/Extensions/AppleIntelCPUPowerManagement.kext'
> and reboot the guest, after which things would settle down by reverting
> the guest to a hlt-based idle loop.
> 
> The only reason I can think of to add functionality for enabling/disabling
> this hack would be to protect against a malicious guest which would use
> mwait *on purpose* to peg the host CPU. But a malicious guest could just
> run "for(;;);" in ring 0 and accomplish the same goal, so we wouldn't
> really gain anything in exchange for the added complexity...
> 
> Thanks,
>   Gabriel
> 
>  arch/x86/kvm/cpuid.c |  2 ++
>  arch/x86/kvm/svm.c   | 28 ++++++++++++++++++++--------
>  arch/x86/kvm/vmx.c   | 20 ++++++++++++++++----
>  3 files changed, 38 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index f47a104..d094fc6 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -283,6 +283,8 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
>  		0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW);
>  	/* cpuid 1.ecx */
>  	const u32 kvm_supported_word4_x86_features =
> +		/* NOTE: MONITOR (and MWAIT) are emulated as NOP,
> +		 * but *not* advertised to guests via CPUID ! */
>  		F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
>  		0 /* DS-CPL, VMX, SMX, EST */ |
>  		0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 7f4f9c2..0b7d58d 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -2770,12 +2770,6 @@ static int xsetbv_interception(struct vcpu_svm *svm)
>  	return 1;
>  }
>  
> -static int invalid_op_interception(struct vcpu_svm *svm)
> -{
> -	kvm_queue_exception(&svm->vcpu, UD_VECTOR);
> -	return 1;
> -}
> -
>  static int task_switch_interception(struct vcpu_svm *svm)
>  {
>  	u16 tss_selector;
> @@ -3287,6 +3281,24 @@ static int pause_interception(struct vcpu_svm *svm)
>  	return 1;
>  }
>  
> +static int nop_interception(struct vcpu_svm *svm)
> +{
> +	skip_emulated_instruction(&(svm->vcpu));
> +	return 1;
> +}
> +
> +static int monitor_interception(struct vcpu_svm *svm)
> +{
> +	printk_once(KERN_WARNING "kvm: MONITOR instruction emulated as NOP!\n");
> +	return nop_interception(svm);
> +}
> +
> +static int mwait_interception(struct vcpu_svm *svm)
> +{
> +	printk_once(KERN_WARNING "kvm: MWAIT instruction emulated as NOP!\n");
> +	return nop_interception(svm);
> +}
> +
>  static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = {
>  	[SVM_EXIT_READ_CR0]			= cr_interception,
>  	[SVM_EXIT_READ_CR3]			= cr_interception,
> @@ -3344,8 +3356,8 @@ static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = {
>  	[SVM_EXIT_CLGI]				= clgi_interception,
>  	[SVM_EXIT_SKINIT]			= skinit_interception,
>  	[SVM_EXIT_WBINVD]                       = emulate_on_interception,
> -	[SVM_EXIT_MONITOR]			= invalid_op_interception,
> -	[SVM_EXIT_MWAIT]			= invalid_op_interception,
> +	[SVM_EXIT_MONITOR]			= monitor_interception,
> +	[SVM_EXIT_MWAIT]			= mwait_interception,
>  	[SVM_EXIT_XSETBV]			= xsetbv_interception,
>  	[SVM_EXIT_NPF]				= pf_interception,
>  };
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 33e8c02..3ccbcb1 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -5669,12 +5669,24 @@ static int handle_pause(struct kvm_vcpu *vcpu)
>  	return 1;
>  }
>  
> -static int handle_invalid_op(struct kvm_vcpu *vcpu)
> +static int handle_nop(struct kvm_vcpu *vcpu)
>  {
> -	kvm_queue_exception(vcpu, UD_VECTOR);
> +	skip_emulated_instruction(vcpu);
>  	return 1;
>  }
>  
> +static int handle_mwait(struct kvm_vcpu *vcpu)
> +{
> +	printk_once(KERN_WARNING "kvm: MWAIT instruction emulated as NOP!\n");
> +	return handle_nop(vcpu);
> +}
> +
> +static int handle_monitor(struct kvm_vcpu *vcpu)
> +{
> +	printk_once(KERN_WARNING "kvm: MONITOR instruction emulated as NOP!\n");
> +	return handle_nop(vcpu);
> +}
> +
>  /*
>   * To run an L2 guest, we need a vmcs02 based on the L1-specified vmcs12.
>   * We could reuse a single VMCS for all the L2 guests, but we also want the
> @@ -6571,8 +6583,8 @@ static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
>  	[EXIT_REASON_EPT_VIOLATION]	      = handle_ept_violation,
>  	[EXIT_REASON_EPT_MISCONFIG]           = handle_ept_misconfig,
>  	[EXIT_REASON_PAUSE_INSTRUCTION]       = handle_pause,
> -	[EXIT_REASON_MWAIT_INSTRUCTION]	      = handle_invalid_op,
> -	[EXIT_REASON_MONITOR_INSTRUCTION]     = handle_invalid_op,
> +	[EXIT_REASON_MWAIT_INSTRUCTION]	      = handle_mwait,
> +	[EXIT_REASON_MONITOR_INSTRUCTION]     = handle_monitor,
>  	[EXIT_REASON_INVEPT]                  = handle_invept,
>  };
>  
> -- 
> 1.9.0
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Graf June 2, 2014, 7:48 p.m. UTC | #2
> Am 02.06.2014 um 21:25 schrieb "Gabriel L. Somlo" <gsomlo@gmail.com>:
> 
>> On Wed, May 07, 2014 at 04:52:13PM -0400, Gabriel L. Somlo wrote:
>> Treat monitor and mwait instructions as nop, which is architecturally
>> correct (but inefficient) behavior. We do this to prevent misbehaving
>> guests (e.g. OS X <= 10.7) from crashing after they fail to check for
>> monitor/mwait availability via cpuid.
>> 
>> Since mwait-based idle loops relying on these nop-emulated instructions
>> would keep the host CPU pegged at 100%, do NOT advertise their presence
>> via cpuid, to prevent compliant guests from using them inadvertently.
>> 
>> Signed-off-by: Gabriel L. Somlo <somlo@cmu.edu>
>> ---
>> 
>> New in v2: remove invalid_op handler functions which were only used to
>>           handle exits caused by monitor and mwait
>> 
>>> On Wed, May 07, 2014 at 08:31:27PM +0200, Alexander Graf wrote:
>>>> On 05/07/2014 08:15 PM, Michael S. Tsirkin wrote:
>>>> If we really want to be paranoid and worry about guests
>>>> that use this strange way to trigger invalid opcode,
>>>> we can make it possible for userspace to enable/disable
>>>> this hack, and teach qemu to set it.
>>>> 
>>>> That would make it even safer than it was.
>>>> 
>>>> Not sure it's worth it, just a thought.
>>> 
>>> Since we don't trap on non-exposed other instructions (new SSE and
>>> whatdoiknow) I don't think it's really bad to just expose
>>> MONITOR/MWAIT as nops.
> 
> Would it make sense to make this a module parameter,
> (e.g., "int emulate_mwait") ?
> 
> Default would be 0 (no emulation). 1 would mean "emulate as nop", and
> if anyone ever figures out how to do proper page-locking based
> emulation we could use 2 to enable that, etc. ?
> 
> Not sure we'd want qemu to enable/disable it automatically, though...
> 
> What do you all think ?

I don't like module parameters - they're system global and there's a good chance you want to run non-osx in parallel ;).

I'd either link this to the cpuid bits or enable it forcefully through ENABLE_CAP per vcpu.

Alex

> 
> Thanks,
> --Gabriel
> 
>> 
>> So AFAICT, linux prefers to use mwait for idling if cpuid tells it that
>> it's available. If we keep telling everyone that we do NOT have monitor
>> and mwait available, compliant guests will never end up using them, and
>> this hack would remain completely invisible to them, which is good
>> (better to use hlt-based idle loops when you're a vm guest, that would
>> actually allow the host to relax while you're halted :)
>> 
>> So the only time anyone would be able to tell we have this hack would be
>> when they're about to receive an invalid opcode for using monitor/mwait
>> in violation of what CPUID (would have) told them. That's what happens
>> to OS X prior to 10.8, which is when I'm hypothesizing the Apple devs
>> begain to seriously think about their OS running as a vm guest (on fusion
>> and parallels)...
>> 
>> Instead of killing the misbehaving guest with an invalid opcode, we'd
>> allow them to peg the host CPU with their monitor == mwait == nop idle
>> loop instead, which, at least on OS X, should be tolerable long enough
>> to run 'rm -rf System/Library/Extensions/AppleIntelCPUPowerManagement.kext'
>> and reboot the guest, after which things would settle down by reverting
>> the guest to a hlt-based idle loop.
>> 
>> The only reason I can think of to add functionality for enabling/disabling
>> this hack would be to protect against a malicious guest which would use
>> mwait *on purpose* to peg the host CPU. But a malicious guest could just
>> run "for(;;);" in ring 0 and accomplish the same goal, so we wouldn't
>> really gain anything in exchange for the added complexity...
>> 
>> Thanks,
>>  Gabriel
>> 
>> arch/x86/kvm/cpuid.c |  2 ++
>> arch/x86/kvm/svm.c   | 28 ++++++++++++++++++++--------
>> arch/x86/kvm/vmx.c   | 20 ++++++++++++++++----
>> 3 files changed, 38 insertions(+), 12 deletions(-)
>> 
>> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
>> index f47a104..d094fc6 100644
>> --- a/arch/x86/kvm/cpuid.c
>> +++ b/arch/x86/kvm/cpuid.c
>> @@ -283,6 +283,8 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
>>        0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW);
>>    /* cpuid 1.ecx */
>>    const u32 kvm_supported_word4_x86_features =
>> +        /* NOTE: MONITOR (and MWAIT) are emulated as NOP,
>> +         * but *not* advertised to guests via CPUID ! */
>>        F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
>>        0 /* DS-CPL, VMX, SMX, EST */ |
>>        0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>> index 7f4f9c2..0b7d58d 100644
>> --- a/arch/x86/kvm/svm.c
>> +++ b/arch/x86/kvm/svm.c
>> @@ -2770,12 +2770,6 @@ static int xsetbv_interception(struct vcpu_svm *svm)
>>    return 1;
>> }
>> 
>> -static int invalid_op_interception(struct vcpu_svm *svm)
>> -{
>> -    kvm_queue_exception(&svm->vcpu, UD_VECTOR);
>> -    return 1;
>> -}
>> -
>> static int task_switch_interception(struct vcpu_svm *svm)
>> {
>>    u16 tss_selector;
>> @@ -3287,6 +3281,24 @@ static int pause_interception(struct vcpu_svm *svm)
>>    return 1;
>> }
>> 
>> +static int nop_interception(struct vcpu_svm *svm)
>> +{
>> +    skip_emulated_instruction(&(svm->vcpu));
>> +    return 1;
>> +}
>> +
>> +static int monitor_interception(struct vcpu_svm *svm)
>> +{
>> +    printk_once(KERN_WARNING "kvm: MONITOR instruction emulated as NOP!\n");
>> +    return nop_interception(svm);
>> +}
>> +
>> +static int mwait_interception(struct vcpu_svm *svm)
>> +{
>> +    printk_once(KERN_WARNING "kvm: MWAIT instruction emulated as NOP!\n");
>> +    return nop_interception(svm);
>> +}
>> +
>> static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = {
>>    [SVM_EXIT_READ_CR0]            = cr_interception,
>>    [SVM_EXIT_READ_CR3]            = cr_interception,
>> @@ -3344,8 +3356,8 @@ static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = {
>>    [SVM_EXIT_CLGI]                = clgi_interception,
>>    [SVM_EXIT_SKINIT]            = skinit_interception,
>>    [SVM_EXIT_WBINVD]                       = emulate_on_interception,
>> -    [SVM_EXIT_MONITOR]            = invalid_op_interception,
>> -    [SVM_EXIT_MWAIT]            = invalid_op_interception,
>> +    [SVM_EXIT_MONITOR]            = monitor_interception,
>> +    [SVM_EXIT_MWAIT]            = mwait_interception,
>>    [SVM_EXIT_XSETBV]            = xsetbv_interception,
>>    [SVM_EXIT_NPF]                = pf_interception,
>> };
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index 33e8c02..3ccbcb1 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -5669,12 +5669,24 @@ static int handle_pause(struct kvm_vcpu *vcpu)
>>    return 1;
>> }
>> 
>> -static int handle_invalid_op(struct kvm_vcpu *vcpu)
>> +static int handle_nop(struct kvm_vcpu *vcpu)
>> {
>> -    kvm_queue_exception(vcpu, UD_VECTOR);
>> +    skip_emulated_instruction(vcpu);
>>    return 1;
>> }
>> 
>> +static int handle_mwait(struct kvm_vcpu *vcpu)
>> +{
>> +    printk_once(KERN_WARNING "kvm: MWAIT instruction emulated as NOP!\n");
>> +    return handle_nop(vcpu);
>> +}
>> +
>> +static int handle_monitor(struct kvm_vcpu *vcpu)
>> +{
>> +    printk_once(KERN_WARNING "kvm: MONITOR instruction emulated as NOP!\n");
>> +    return handle_nop(vcpu);
>> +}
>> +
>> /*
>>  * To run an L2 guest, we need a vmcs02 based on the L1-specified vmcs12.
>>  * We could reuse a single VMCS for all the L2 guests, but we also want the
>> @@ -6571,8 +6583,8 @@ static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
>>    [EXIT_REASON_EPT_VIOLATION]          = handle_ept_violation,
>>    [EXIT_REASON_EPT_MISCONFIG]           = handle_ept_misconfig,
>>    [EXIT_REASON_PAUSE_INSTRUCTION]       = handle_pause,
>> -    [EXIT_REASON_MWAIT_INSTRUCTION]          = handle_invalid_op,
>> -    [EXIT_REASON_MONITOR_INSTRUCTION]     = handle_invalid_op,
>> +    [EXIT_REASON_MWAIT_INSTRUCTION]          = handle_mwait,
>> +    [EXIT_REASON_MONITOR_INSTRUCTION]     = handle_monitor,
>>    [EXIT_REASON_INVEPT]                  = handle_invept,
>> };
>> 
>> -- 
>> 1.9.0
>> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin June 2, 2014, 8:20 p.m. UTC | #3
On Mon, Jun 02, 2014 at 09:48:19PM +0200, Alexander Graf wrote:
> 
> 
> > Am 02.06.2014 um 21:25 schrieb "Gabriel L. Somlo" <gsomlo@gmail.com>:
> > 
> >> On Wed, May 07, 2014 at 04:52:13PM -0400, Gabriel L. Somlo wrote:
> >> Treat monitor and mwait instructions as nop, which is architecturally
> >> correct (but inefficient) behavior. We do this to prevent misbehaving
> >> guests (e.g. OS X <= 10.7) from crashing after they fail to check for
> >> monitor/mwait availability via cpuid.
> >> 
> >> Since mwait-based idle loops relying on these nop-emulated instructions
> >> would keep the host CPU pegged at 100%, do NOT advertise their presence
> >> via cpuid, to prevent compliant guests from using them inadvertently.
> >> 
> >> Signed-off-by: Gabriel L. Somlo <somlo@cmu.edu>
> >> ---
> >> 
> >> New in v2: remove invalid_op handler functions which were only used to
> >>           handle exits caused by monitor and mwait
> >> 
> >>> On Wed, May 07, 2014 at 08:31:27PM +0200, Alexander Graf wrote:
> >>>> On 05/07/2014 08:15 PM, Michael S. Tsirkin wrote:
> >>>> If we really want to be paranoid and worry about guests
> >>>> that use this strange way to trigger invalid opcode,
> >>>> we can make it possible for userspace to enable/disable
> >>>> this hack, and teach qemu to set it.
> >>>> 
> >>>> That would make it even safer than it was.
> >>>> 
> >>>> Not sure it's worth it, just a thought.
> >>> 
> >>> Since we don't trap on non-exposed other instructions (new SSE and
> >>> whatdoiknow) I don't think it's really bad to just expose
> >>> MONITOR/MWAIT as nops.
> > 
> > Would it make sense to make this a module parameter,
> > (e.g., "int emulate_mwait") ?
> > 
> > Default would be 0 (no emulation). 1 would mean "emulate as nop", and
> > if anyone ever figures out how to do proper page-locking based
> > emulation we could use 2 to enable that, etc. ?
> > 
> > Not sure we'd want qemu to enable/disable it automatically, though...
> > 
> > What do you all think ?
> 
> I don't like module parameters - they're system global and there's a good chance you want to run non-osx in parallel ;).
> 
> I'd either link this to the cpuid bits or enable it forcefully through ENABLE_CAP per vcpu.
> 
> Alex

Point is that.
Paolo here thinks it's safe to just make it a NOP unconditionally.
so module parameter would be there as a debugging tool:
as a means for users to test with old kvm behaviour if they see breakage.
Which we don't expect, so no need to waste cycles creating a pretty
interface for it.

> > 
> > Thanks,
> > --Gabriel
> > 
> >> 
> >> So AFAICT, linux prefers to use mwait for idling if cpuid tells it that
> >> it's available. If we keep telling everyone that we do NOT have monitor
> >> and mwait available, compliant guests will never end up using them, and
> >> this hack would remain completely invisible to them, which is good
> >> (better to use hlt-based idle loops when you're a vm guest, that would
> >> actually allow the host to relax while you're halted :)
> >> 
> >> So the only time anyone would be able to tell we have this hack would be
> >> when they're about to receive an invalid opcode for using monitor/mwait
> >> in violation of what CPUID (would have) told them. That's what happens
> >> to OS X prior to 10.8, which is when I'm hypothesizing the Apple devs
> >> begain to seriously think about their OS running as a vm guest (on fusion
> >> and parallels)...
> >> 
> >> Instead of killing the misbehaving guest with an invalid opcode, we'd
> >> allow them to peg the host CPU with their monitor == mwait == nop idle
> >> loop instead, which, at least on OS X, should be tolerable long enough
> >> to run 'rm -rf System/Library/Extensions/AppleIntelCPUPowerManagement.kext'
> >> and reboot the guest, after which things would settle down by reverting
> >> the guest to a hlt-based idle loop.
> >> 
> >> The only reason I can think of to add functionality for enabling/disabling
> >> this hack would be to protect against a malicious guest which would use
> >> mwait *on purpose* to peg the host CPU. But a malicious guest could just
> >> run "for(;;);" in ring 0 and accomplish the same goal, so we wouldn't
> >> really gain anything in exchange for the added complexity...
> >> 
> >> Thanks,
> >>  Gabriel
> >> 
> >> arch/x86/kvm/cpuid.c |  2 ++
> >> arch/x86/kvm/svm.c   | 28 ++++++++++++++++++++--------
> >> arch/x86/kvm/vmx.c   | 20 ++++++++++++++++----
> >> 3 files changed, 38 insertions(+), 12 deletions(-)
> >> 
> >> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> >> index f47a104..d094fc6 100644
> >> --- a/arch/x86/kvm/cpuid.c
> >> +++ b/arch/x86/kvm/cpuid.c
> >> @@ -283,6 +283,8 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
> >>        0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW);
> >>    /* cpuid 1.ecx */
> >>    const u32 kvm_supported_word4_x86_features =
> >> +        /* NOTE: MONITOR (and MWAIT) are emulated as NOP,
> >> +         * but *not* advertised to guests via CPUID ! */
> >>        F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
> >>        0 /* DS-CPL, VMX, SMX, EST */ |
> >>        0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
> >> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> >> index 7f4f9c2..0b7d58d 100644
> >> --- a/arch/x86/kvm/svm.c
> >> +++ b/arch/x86/kvm/svm.c
> >> @@ -2770,12 +2770,6 @@ static int xsetbv_interception(struct vcpu_svm *svm)
> >>    return 1;
> >> }
> >> 
> >> -static int invalid_op_interception(struct vcpu_svm *svm)
> >> -{
> >> -    kvm_queue_exception(&svm->vcpu, UD_VECTOR);
> >> -    return 1;
> >> -}
> >> -
> >> static int task_switch_interception(struct vcpu_svm *svm)
> >> {
> >>    u16 tss_selector;
> >> @@ -3287,6 +3281,24 @@ static int pause_interception(struct vcpu_svm *svm)
> >>    return 1;
> >> }
> >> 
> >> +static int nop_interception(struct vcpu_svm *svm)
> >> +{
> >> +    skip_emulated_instruction(&(svm->vcpu));
> >> +    return 1;
> >> +}
> >> +
> >> +static int monitor_interception(struct vcpu_svm *svm)
> >> +{
> >> +    printk_once(KERN_WARNING "kvm: MONITOR instruction emulated as NOP!\n");
> >> +    return nop_interception(svm);
> >> +}
> >> +
> >> +static int mwait_interception(struct vcpu_svm *svm)
> >> +{
> >> +    printk_once(KERN_WARNING "kvm: MWAIT instruction emulated as NOP!\n");
> >> +    return nop_interception(svm);
> >> +}
> >> +
> >> static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = {
> >>    [SVM_EXIT_READ_CR0]            = cr_interception,
> >>    [SVM_EXIT_READ_CR3]            = cr_interception,
> >> @@ -3344,8 +3356,8 @@ static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = {
> >>    [SVM_EXIT_CLGI]                = clgi_interception,
> >>    [SVM_EXIT_SKINIT]            = skinit_interception,
> >>    [SVM_EXIT_WBINVD]                       = emulate_on_interception,
> >> -    [SVM_EXIT_MONITOR]            = invalid_op_interception,
> >> -    [SVM_EXIT_MWAIT]            = invalid_op_interception,
> >> +    [SVM_EXIT_MONITOR]            = monitor_interception,
> >> +    [SVM_EXIT_MWAIT]            = mwait_interception,
> >>    [SVM_EXIT_XSETBV]            = xsetbv_interception,
> >>    [SVM_EXIT_NPF]                = pf_interception,
> >> };
> >> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >> index 33e8c02..3ccbcb1 100644
> >> --- a/arch/x86/kvm/vmx.c
> >> +++ b/arch/x86/kvm/vmx.c
> >> @@ -5669,12 +5669,24 @@ static int handle_pause(struct kvm_vcpu *vcpu)
> >>    return 1;
> >> }
> >> 
> >> -static int handle_invalid_op(struct kvm_vcpu *vcpu)
> >> +static int handle_nop(struct kvm_vcpu *vcpu)
> >> {
> >> -    kvm_queue_exception(vcpu, UD_VECTOR);
> >> +    skip_emulated_instruction(vcpu);
> >>    return 1;
> >> }
> >> 
> >> +static int handle_mwait(struct kvm_vcpu *vcpu)
> >> +{
> >> +    printk_once(KERN_WARNING "kvm: MWAIT instruction emulated as NOP!\n");
> >> +    return handle_nop(vcpu);
> >> +}
> >> +
> >> +static int handle_monitor(struct kvm_vcpu *vcpu)
> >> +{
> >> +    printk_once(KERN_WARNING "kvm: MONITOR instruction emulated as NOP!\n");
> >> +    return handle_nop(vcpu);
> >> +}
> >> +
> >> /*
> >>  * To run an L2 guest, we need a vmcs02 based on the L1-specified vmcs12.
> >>  * We could reuse a single VMCS for all the L2 guests, but we also want the
> >> @@ -6571,8 +6583,8 @@ static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
> >>    [EXIT_REASON_EPT_VIOLATION]          = handle_ept_violation,
> >>    [EXIT_REASON_EPT_MISCONFIG]           = handle_ept_misconfig,
> >>    [EXIT_REASON_PAUSE_INSTRUCTION]       = handle_pause,
> >> -    [EXIT_REASON_MWAIT_INSTRUCTION]          = handle_invalid_op,
> >> -    [EXIT_REASON_MONITOR_INSTRUCTION]     = handle_invalid_op,
> >> +    [EXIT_REASON_MWAIT_INSTRUCTION]          = handle_mwait,
> >> +    [EXIT_REASON_MONITOR_INSTRUCTION]     = handle_monitor,
> >>    [EXIT_REASON_INVEPT]                  = handle_invept,
> >> };
> >> 
> >> -- 
> >> 1.9.0
> >> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin June 2, 2014, 8:24 p.m. UTC | #4
On Mon, Jun 02, 2014 at 03:25:30PM -0400, Gabriel L. Somlo wrote:
> On Wed, May 07, 2014 at 04:52:13PM -0400, Gabriel L. Somlo wrote:
> > Treat monitor and mwait instructions as nop, which is architecturally
> > correct (but inefficient) behavior. We do this to prevent misbehaving
> > guests (e.g. OS X <= 10.7) from crashing after they fail to check for
> > monitor/mwait availability via cpuid.
> > 
> > Since mwait-based idle loops relying on these nop-emulated instructions
> > would keep the host CPU pegged at 100%, do NOT advertise their presence
> > via cpuid, to prevent compliant guests from using them inadvertently.
> > 
> > Signed-off-by: Gabriel L. Somlo <somlo@cmu.edu>
> > ---
> > 
> > New in v2: remove invalid_op handler functions which were only used to
> >            handle exits caused by monitor and mwait
> > 
> > On Wed, May 07, 2014 at 08:31:27PM +0200, Alexander Graf wrote:
> > > On 05/07/2014 08:15 PM, Michael S. Tsirkin wrote:
> > > >If we really want to be paranoid and worry about guests
> > > >that use this strange way to trigger invalid opcode,
> > > >we can make it possible for userspace to enable/disable
> > > >this hack, and teach qemu to set it.
> > > >
> > > >That would make it even safer than it was.
> > > >
> > > >Not sure it's worth it, just a thought.
> > > 
> > > Since we don't trap on non-exposed other instructions (new SSE and
> > > whatdoiknow) I don't think it's really bad to just expose
> > > MONITOR/MWAIT as nops.
> 
> Would it make sense to make this a module parameter,
> (e.g., "int emulate_mwait") ?
> 
> Default would be 0 (no emulation). 1 would mean "emulate as nop", and
> if anyone ever figures out how to do proper page-locking based
> emulation we could use 2 to enable that, etc. ?

If it's a module parameter, default should be one that's
safe for all guests, that would be 1 in the list above.

> Not sure we'd want qemu to enable/disable it automatically, though...
> 
> What do you all think ?
> 
> Thanks,
> --Gabriel
> 
> > 
> > So AFAICT, linux prefers to use mwait for idling if cpuid tells it that
> > it's available. If we keep telling everyone that we do NOT have monitor
> > and mwait available, compliant guests will never end up using them, and
> > this hack would remain completely invisible to them, which is good
> > (better to use hlt-based idle loops when you're a vm guest, that would
> > actually allow the host to relax while you're halted :)
> > 
> > So the only time anyone would be able to tell we have this hack would be
> > when they're about to receive an invalid opcode for using monitor/mwait
> > in violation of what CPUID (would have) told them. That's what happens
> > to OS X prior to 10.8, which is when I'm hypothesizing the Apple devs
> > begain to seriously think about their OS running as a vm guest (on fusion
> > and parallels)...
> > 
> > Instead of killing the misbehaving guest with an invalid opcode, we'd
> > allow them to peg the host CPU with their monitor == mwait == nop idle
> > loop instead, which, at least on OS X, should be tolerable long enough
> > to run 'rm -rf System/Library/Extensions/AppleIntelCPUPowerManagement.kext'
> > and reboot the guest, after which things would settle down by reverting
> > the guest to a hlt-based idle loop.
> > 
> > The only reason I can think of to add functionality for enabling/disabling
> > this hack would be to protect against a malicious guest which would use
> > mwait *on purpose* to peg the host CPU. But a malicious guest could just
> > run "for(;;);" in ring 0 and accomplish the same goal, so we wouldn't
> > really gain anything in exchange for the added complexity...
> > 
> > Thanks,
> >   Gabriel
> > 
> >  arch/x86/kvm/cpuid.c |  2 ++
> >  arch/x86/kvm/svm.c   | 28 ++++++++++++++++++++--------
> >  arch/x86/kvm/vmx.c   | 20 ++++++++++++++++----
> >  3 files changed, 38 insertions(+), 12 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> > index f47a104..d094fc6 100644
> > --- a/arch/x86/kvm/cpuid.c
> > +++ b/arch/x86/kvm/cpuid.c
> > @@ -283,6 +283,8 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
> >  		0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW);
> >  	/* cpuid 1.ecx */
> >  	const u32 kvm_supported_word4_x86_features =
> > +		/* NOTE: MONITOR (and MWAIT) are emulated as NOP,
> > +		 * but *not* advertised to guests via CPUID ! */
> >  		F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
> >  		0 /* DS-CPL, VMX, SMX, EST */ |
> >  		0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
> > diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> > index 7f4f9c2..0b7d58d 100644
> > --- a/arch/x86/kvm/svm.c
> > +++ b/arch/x86/kvm/svm.c
> > @@ -2770,12 +2770,6 @@ static int xsetbv_interception(struct vcpu_svm *svm)
> >  	return 1;
> >  }
> >  
> > -static int invalid_op_interception(struct vcpu_svm *svm)
> > -{
> > -	kvm_queue_exception(&svm->vcpu, UD_VECTOR);
> > -	return 1;
> > -}
> > -
> >  static int task_switch_interception(struct vcpu_svm *svm)
> >  {
> >  	u16 tss_selector;
> > @@ -3287,6 +3281,24 @@ static int pause_interception(struct vcpu_svm *svm)
> >  	return 1;
> >  }
> >  
> > +static int nop_interception(struct vcpu_svm *svm)
> > +{
> > +	skip_emulated_instruction(&(svm->vcpu));
> > +	return 1;
> > +}
> > +
> > +static int monitor_interception(struct vcpu_svm *svm)
> > +{
> > +	printk_once(KERN_WARNING "kvm: MONITOR instruction emulated as NOP!\n");
> > +	return nop_interception(svm);
> > +}
> > +
> > +static int mwait_interception(struct vcpu_svm *svm)
> > +{
> > +	printk_once(KERN_WARNING "kvm: MWAIT instruction emulated as NOP!\n");
> > +	return nop_interception(svm);
> > +}
> > +
> >  static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = {
> >  	[SVM_EXIT_READ_CR0]			= cr_interception,
> >  	[SVM_EXIT_READ_CR3]			= cr_interception,
> > @@ -3344,8 +3356,8 @@ static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = {
> >  	[SVM_EXIT_CLGI]				= clgi_interception,
> >  	[SVM_EXIT_SKINIT]			= skinit_interception,
> >  	[SVM_EXIT_WBINVD]                       = emulate_on_interception,
> > -	[SVM_EXIT_MONITOR]			= invalid_op_interception,
> > -	[SVM_EXIT_MWAIT]			= invalid_op_interception,
> > +	[SVM_EXIT_MONITOR]			= monitor_interception,
> > +	[SVM_EXIT_MWAIT]			= mwait_interception,
> >  	[SVM_EXIT_XSETBV]			= xsetbv_interception,
> >  	[SVM_EXIT_NPF]				= pf_interception,
> >  };
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index 33e8c02..3ccbcb1 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -5669,12 +5669,24 @@ static int handle_pause(struct kvm_vcpu *vcpu)
> >  	return 1;
> >  }
> >  
> > -static int handle_invalid_op(struct kvm_vcpu *vcpu)
> > +static int handle_nop(struct kvm_vcpu *vcpu)
> >  {
> > -	kvm_queue_exception(vcpu, UD_VECTOR);
> > +	skip_emulated_instruction(vcpu);
> >  	return 1;
> >  }
> >  
> > +static int handle_mwait(struct kvm_vcpu *vcpu)
> > +{
> > +	printk_once(KERN_WARNING "kvm: MWAIT instruction emulated as NOP!\n");
> > +	return handle_nop(vcpu);
> > +}
> > +
> > +static int handle_monitor(struct kvm_vcpu *vcpu)
> > +{
> > +	printk_once(KERN_WARNING "kvm: MONITOR instruction emulated as NOP!\n");
> > +	return handle_nop(vcpu);
> > +}
> > +
> >  /*
> >   * To run an L2 guest, we need a vmcs02 based on the L1-specified vmcs12.
> >   * We could reuse a single VMCS for all the L2 guests, but we also want the
> > @@ -6571,8 +6583,8 @@ static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
> >  	[EXIT_REASON_EPT_VIOLATION]	      = handle_ept_violation,
> >  	[EXIT_REASON_EPT_MISCONFIG]           = handle_ept_misconfig,
> >  	[EXIT_REASON_PAUSE_INSTRUCTION]       = handle_pause,
> > -	[EXIT_REASON_MWAIT_INSTRUCTION]	      = handle_invalid_op,
> > -	[EXIT_REASON_MONITOR_INSTRUCTION]     = handle_invalid_op,
> > +	[EXIT_REASON_MWAIT_INSTRUCTION]	      = handle_mwait,
> > +	[EXIT_REASON_MONITOR_INSTRUCTION]     = handle_monitor,
> >  	[EXIT_REASON_INVEPT]                  = handle_invept,
> >  };
> >  
> > -- 
> > 1.9.0
> > 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Graf June 2, 2014, 8:35 p.m. UTC | #5
> Am 02.06.2014 um 22:20 schrieb "Michael S. Tsirkin" <mst@redhat.com>:
> 
>> On Mon, Jun 02, 2014 at 09:48:19PM +0200, Alexander Graf wrote:
>> 
>> 
>>>> Am 02.06.2014 um 21:25 schrieb "Gabriel L. Somlo" <gsomlo@gmail.com>:
>>>> 
>>>> On Wed, May 07, 2014 at 04:52:13PM -0400, Gabriel L. Somlo wrote:
>>>> Treat monitor and mwait instructions as nop, which is architecturally
>>>> correct (but inefficient) behavior. We do this to prevent misbehaving
>>>> guests (e.g. OS X <= 10.7) from crashing after they fail to check for
>>>> monitor/mwait availability via cpuid.
>>>> 
>>>> Since mwait-based idle loops relying on these nop-emulated instructions
>>>> would keep the host CPU pegged at 100%, do NOT advertise their presence
>>>> via cpuid, to prevent compliant guests from using them inadvertently.
>>>> 
>>>> Signed-off-by: Gabriel L. Somlo <somlo@cmu.edu>
>>>> ---
>>>> 
>>>> New in v2: remove invalid_op handler functions which were only used to
>>>>          handle exits caused by monitor and mwait
>>>> 
>>>>>> On Wed, May 07, 2014 at 08:31:27PM +0200, Alexander Graf wrote:
>>>>>> On 05/07/2014 08:15 PM, Michael S. Tsirkin wrote:
>>>>>> If we really want to be paranoid and worry about guests
>>>>>> that use this strange way to trigger invalid opcode,
>>>>>> we can make it possible for userspace to enable/disable
>>>>>> this hack, and teach qemu to set it.
>>>>>> 
>>>>>> That would make it even safer than it was.
>>>>>> 
>>>>>> Not sure it's worth it, just a thought.
>>>>> 
>>>>> Since we don't trap on non-exposed other instructions (new SSE and
>>>>> whatdoiknow) I don't think it's really bad to just expose
>>>>> MONITOR/MWAIT as nops.
>>> 
>>> Would it make sense to make this a module parameter,
>>> (e.g., "int emulate_mwait") ?
>>> 
>>> Default would be 0 (no emulation). 1 would mean "emulate as nop", and
>>> if anyone ever figures out how to do proper page-locking based
>>> emulation we could use 2 to enable that, etc. ?
>>> 
>>> Not sure we'd want qemu to enable/disable it automatically, though...
>>> 
>>> What do you all think ?
>> 
>> I don't like module parameters - they're system global and there's a good chance you want to run non-osx in parallel ;).
>> 
>> I'd either link this to the cpuid bits or enable it forcefully through ENABLE_CAP per vcpu.
>> 
>> Alex
> 
> Point is that.
> Paolo here thinks it's safe to just make it a NOP unconditionally.
> so module parameter would be there as a debugging tool:
> as a means for users to test with old kvm behaviour if they see breakage.
> Which we don't expect, so no need to waste cycles creating a pretty
> interface for it.

Both interfaces already exist, so where's the problem? I'm fine with making it always nop too though.

Gabriel was asking how to make it switchable - and the only thing I'd nak is a module parameter because it's not useful.


Alex


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin June 2, 2014, 8:41 p.m. UTC | #6
On Mon, Jun 02, 2014 at 10:35:56PM +0200, Alexander Graf wrote:
> 
> 
> > Am 02.06.2014 um 22:20 schrieb "Michael S. Tsirkin" <mst@redhat.com>:
> > 
> >> On Mon, Jun 02, 2014 at 09:48:19PM +0200, Alexander Graf wrote:
> >> 
> >> 
> >>>> Am 02.06.2014 um 21:25 schrieb "Gabriel L. Somlo" <gsomlo@gmail.com>:
> >>>> 
> >>>> On Wed, May 07, 2014 at 04:52:13PM -0400, Gabriel L. Somlo wrote:
> >>>> Treat monitor and mwait instructions as nop, which is architecturally
> >>>> correct (but inefficient) behavior. We do this to prevent misbehaving
> >>>> guests (e.g. OS X <= 10.7) from crashing after they fail to check for
> >>>> monitor/mwait availability via cpuid.
> >>>> 
> >>>> Since mwait-based idle loops relying on these nop-emulated instructions
> >>>> would keep the host CPU pegged at 100%, do NOT advertise their presence
> >>>> via cpuid, to prevent compliant guests from using them inadvertently.
> >>>> 
> >>>> Signed-off-by: Gabriel L. Somlo <somlo@cmu.edu>
> >>>> ---
> >>>> 
> >>>> New in v2: remove invalid_op handler functions which were only used to
> >>>>          handle exits caused by monitor and mwait
> >>>> 
> >>>>>> On Wed, May 07, 2014 at 08:31:27PM +0200, Alexander Graf wrote:
> >>>>>> On 05/07/2014 08:15 PM, Michael S. Tsirkin wrote:
> >>>>>> If we really want to be paranoid and worry about guests
> >>>>>> that use this strange way to trigger invalid opcode,
> >>>>>> we can make it possible for userspace to enable/disable
> >>>>>> this hack, and teach qemu to set it.
> >>>>>> 
> >>>>>> That would make it even safer than it was.
> >>>>>> 
> >>>>>> Not sure it's worth it, just a thought.
> >>>>> 
> >>>>> Since we don't trap on non-exposed other instructions (new SSE and
> >>>>> whatdoiknow) I don't think it's really bad to just expose
> >>>>> MONITOR/MWAIT as nops.
> >>> 
> >>> Would it make sense to make this a module parameter,
> >>> (e.g., "int emulate_mwait") ?
> >>> 
> >>> Default would be 0 (no emulation). 1 would mean "emulate as nop", and
> >>> if anyone ever figures out how to do proper page-locking based
> >>> emulation we could use 2 to enable that, etc. ?
> >>> 
> >>> Not sure we'd want qemu to enable/disable it automatically, though...
> >>> 
> >>> What do you all think ?
> >> 
> >> I don't like module parameters - they're system global and there's a good chance you want to run non-osx in parallel ;).
> >> 
> >> I'd either link this to the cpuid bits or enable it forcefully through ENABLE_CAP per vcpu.
> >> 
> >> Alex
> > 
> > Point is that.
> > Paolo here thinks it's safe to just make it a NOP unconditionally.
> > so module parameter would be there as a debugging tool:
> > as a means for users to test with old kvm behaviour if they see breakage.
> > Which we don't expect, so no need to waste cycles creating a pretty
> > interface for it.
> 
> Both interfaces already exist, so where's the problem?

Hmm sorry which interfaces for enabling mwait nop emulation exist?

> I'm fine with making it always nop too though.
> 
> Gabriel was asking how to make it switchable - and the only thing I'd nak is a module parameter because it's not useful.
> 
> 
> Alex
>
Alexander Graf June 2, 2014, 9:01 p.m. UTC | #7
> Am 02.06.2014 um 22:41 schrieb "Michael S. Tsirkin" <mst@redhat.com>:
> 
>> On Mon, Jun 02, 2014 at 10:35:56PM +0200, Alexander Graf wrote:
>> 
>> 
>>>> Am 02.06.2014 um 22:20 schrieb "Michael S. Tsirkin" <mst@redhat.com>:
>>>> 
>>>> On Mon, Jun 02, 2014 at 09:48:19PM +0200, Alexander Graf wrote:
>>>> 
>>>> 
>>>>>> Am 02.06.2014 um 21:25 schrieb "Gabriel L. Somlo" <gsomlo@gmail.com>:
>>>>>> 
>>>>>> On Wed, May 07, 2014 at 04:52:13PM -0400, Gabriel L. Somlo wrote:
>>>>>> Treat monitor and mwait instructions as nop, which is architecturally
>>>>>> correct (but inefficient) behavior. We do this to prevent misbehaving
>>>>>> guests (e.g. OS X <= 10.7) from crashing after they fail to check for
>>>>>> monitor/mwait availability via cpuid.
>>>>>> 
>>>>>> Since mwait-based idle loops relying on these nop-emulated instructions
>>>>>> would keep the host CPU pegged at 100%, do NOT advertise their presence
>>>>>> via cpuid, to prevent compliant guests from using them inadvertently.
>>>>>> 
>>>>>> Signed-off-by: Gabriel L. Somlo <somlo@cmu.edu>
>>>>>> ---
>>>>>> 
>>>>>> New in v2: remove invalid_op handler functions which were only used to
>>>>>>         handle exits caused by monitor and mwait
>>>>>> 
>>>>>>>> On Wed, May 07, 2014 at 08:31:27PM +0200, Alexander Graf wrote:
>>>>>>>> On 05/07/2014 08:15 PM, Michael S. Tsirkin wrote:
>>>>>>>> If we really want to be paranoid and worry about guests
>>>>>>>> that use this strange way to trigger invalid opcode,
>>>>>>>> we can make it possible for userspace to enable/disable
>>>>>>>> this hack, and teach qemu to set it.
>>>>>>>> 
>>>>>>>> That would make it even safer than it was.
>>>>>>>> 
>>>>>>>> Not sure it's worth it, just a thought.
>>>>>>> 
>>>>>>> Since we don't trap on non-exposed other instructions (new SSE and
>>>>>>> whatdoiknow) I don't think it's really bad to just expose
>>>>>>> MONITOR/MWAIT as nops.
>>>>> 
>>>>> Would it make sense to make this a module parameter,
>>>>> (e.g., "int emulate_mwait") ?
>>>>> 
>>>>> Default would be 0 (no emulation). 1 would mean "emulate as nop", and
>>>>> if anyone ever figures out how to do proper page-locking based
>>>>> emulation we could use 2 to enable that, etc. ?
>>>>> 
>>>>> Not sure we'd want qemu to enable/disable it automatically, though...
>>>>> 
>>>>> What do you all think ?
>>>> 
>>>> I don't like module parameters - they're system global and there's a good chance you want to run non-osx in parallel ;).
>>>> 
>>>> I'd either link this to the cpuid bits or enable it forcefully through ENABLE_CAP per vcpu.
>>>> 
>>>> Alex
>>> 
>>> Point is that.
>>> Paolo here thinks it's safe to just make it a NOP unconditionally.
>>> so module parameter would be there as a debugging tool:
>>> as a means for users to test with old kvm behaviour if they see breakage.
>>> Which we don't expect, so no need to waste cycles creating a pretty
>>> interface for it.
>> 
>> Both interfaces already exist, so where's the problem?
> 
> Hmm sorry which interfaces for enabling mwait nop emulation exist?

User space can force cpuid bits that kvm doesn't return as supported, so we do have a negative-by-default switch.

We also have an ENABLE_CAP ioctl. Enabling the monitor/mwait nop ability explicitly by that is a 5 line patch.

Either way is very flexible and not system wide.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gabriel L. Somlo June 3, 2014, 1:55 a.m. UTC | #8
On Mon, Jun 02, 2014 at 11:01:07PM +0200, Alexander Graf wrote:
> 
> 
> > Am 02.06.2014 um 22:41 schrieb "Michael S. Tsirkin" <mst@redhat.com>:
> > 
> >> On Mon, Jun 02, 2014 at 10:35:56PM +0200, Alexander Graf wrote:
> >> 
> >> 
> >>>> Am 02.06.2014 um 22:20 schrieb "Michael S. Tsirkin" <mst@redhat.com>:
> >>>> 
> >>>> On Mon, Jun 02, 2014 at 09:48:19PM +0200, Alexander Graf wrote:
> >>>> 
> >>>> 
> >>>>>> Am 02.06.2014 um 21:25 schrieb "Gabriel L. Somlo" <gsomlo@gmail.com>:
> >>>>>> 
> >>>>>> On Wed, May 07, 2014 at 04:52:13PM -0400, Gabriel L. Somlo wrote:
> >>>>>> Treat monitor and mwait instructions as nop, which is architecturally
> >>>>>> correct (but inefficient) behavior. We do this to prevent misbehaving
> >>>>>> guests (e.g. OS X <= 10.7) from crashing after they fail to check for
> >>>>>> monitor/mwait availability via cpuid.
> >>>>>> 
> >>>>>> Since mwait-based idle loops relying on these nop-emulated instructions
> >>>>>> would keep the host CPU pegged at 100%, do NOT advertise their presence
> >>>>>> via cpuid, to prevent compliant guests from using them inadvertently.
> >>>>>> 
> >>>>>> Signed-off-by: Gabriel L. Somlo <somlo@cmu.edu>
> >>>>>> ---
> >>>>>> 
> >>>>>> New in v2: remove invalid_op handler functions which were only used to
> >>>>>>         handle exits caused by monitor and mwait
> >>>>>> 
> >>>>>>>> On Wed, May 07, 2014 at 08:31:27PM +0200, Alexander Graf wrote:
> >>>>>>>> On 05/07/2014 08:15 PM, Michael S. Tsirkin wrote:
> >>>>>>>> If we really want to be paranoid and worry about guests
> >>>>>>>> that use this strange way to trigger invalid opcode,
> >>>>>>>> we can make it possible for userspace to enable/disable
> >>>>>>>> this hack, and teach qemu to set it.
> >>>>>>>> 
> >>>>>>>> That would make it even safer than it was.
> >>>>>>>> 
> >>>>>>>> Not sure it's worth it, just a thought.
> >>>>>>> 
> >>>>>>> Since we don't trap on non-exposed other instructions (new SSE and
> >>>>>>> whatdoiknow) I don't think it's really bad to just expose
> >>>>>>> MONITOR/MWAIT as nops.
> >>>>> 
> >>>>> Would it make sense to make this a module parameter,
> >>>>> (e.g., "int emulate_mwait") ?
> >>>>> 
> >>>>> Default would be 0 (no emulation). 1 would mean "emulate as nop", and
> >>>>> if anyone ever figures out how to do proper page-locking based
> >>>>> emulation we could use 2 to enable that, etc. ?
> >>>>> 
> >>>>> Not sure we'd want qemu to enable/disable it automatically, though...
> >>>>> 
> >>>>> What do you all think ?
> >>>> 
> >>>> I don't like module parameters - they're system global and there's a good chance you want to run non-osx in parallel ;).
> >>>> 
> >>>> I'd either link this to the cpuid bits or enable it forcefully through ENABLE_CAP per vcpu.
> >>>> 
> >>>> Alex
> >>> 
> >>> Point is that.
> >>> Paolo here thinks it's safe to just make it a NOP unconditionally.
> >>> so module parameter would be there as a debugging tool:
> >>> as a means for users to test with old kvm behaviour if they see breakage.
> >>> Which we don't expect, so no need to waste cycles creating a pretty
> >>> interface for it.
> >> 
> >> Both interfaces already exist, so where's the problem?
> > 
> > Hmm sorry which interfaces for enabling mwait nop emulation exist?
> 
> User space can force cpuid bits that kvm doesn't return as supported, so we do have a negative-by-default switch.
> 
> We also have an ENABLE_CAP ioctl. Enabling the monitor/mwait nop ability explicitly by that is a 5 line patch.
> 
> Either way is very flexible and not system wide.

W.r.t. monitor/mwait, a guest can do one of the following:

1. Never check CPUID, and never use monitor/mwait
	- This is great, we don't have to do anything about these

2. Check CPUID for mwait, use it to idle in preference over hlt
	- Linux, Windows, and Mavericks (10.9) do this
	- we never want to have CPUID say "yes" to these, since
	  monitor/mwait support will be clunky in the best case,
	  and hlt is overwhelmingly preferable! [*]

3. Never check CPUID, use monitor/mwait with abandon
	- OS X 10.6 .. 10.8 does this
	- emulating monitor/mwait here allows us to boot the guest
	  and use it, and perform sysadmin surgery to force a hlt
	  based idle

4. Check CPUID, panic if unavailable
	- OS X 10.5 did this, IIRC.
	- whether I can do kext surgery and get it to stop checking
	  CPUID *in addition to* falling back to hlt-based idle is
	  TBD.
	- emulating monitor/mwait allows us to boot this type of
	  guest, BUT WE ALSO HAVE TO ADVERTISE IT VIA CPUID !!!

I like telling qemu on the command line "do monitor = mwait = nop;
for this guest only", and having qemu pass that on to KVM for only the
VCPUs associated with this guest, optionally, for cases 3 and 4 only
(everyone else gets the invalid opcode fault behavior as before).


[*] I think we've been over this a few times already, but here's a
    quick recap:

	- monitor == mwait == NOP is correct (albeitwasteful) behavior
		- mwait MUST expect and deal with spurious wakeups
		  (per the Intel manual)
		- mwait == nop is an INSTANT spurious wakeup (hence
		  works OK with any correctly written program) !
		- monitor == nop won't "arm" anything, but that
		  doesn't matter if mwait always immediately wakes up !

		- this pegs the host CPU to 100%, so MUCH worse than
		  hlt, shouldn't do it unless we ABSOLUTELY HAVE TO !!!

	- guest-mode mwait should NEVER be allowed to stop the host CPU
	  (and, according to the Intel manual, it's HARD to try and
	  make it do so, which I think is on purpose !)

	- instead, guest-mode mwait should map to a host-side
	  condition-wait (where a write to a monitor-ed area
	  acts as condition-signal).

		- the most likely way to implement something like that
		  would be to write-protect pages and handle write faults
		- and I never got it working *properly* (but I'm a n00b,
		  so that ain't saying much :)
		- but the granularity would be all wrong compared to any
		  real CPU (1 page >> typical monitored area size)
		- but I still don't see it being any better than
		  hlt-based idle, even if we *did* get it to work correctly !!!



I'll look into ENABLE_CAP, and how to expose that on the qemu command
line (I think I might need both methods mentioned by Alex in tandem,
but I'll have to study existing examples before I can say anything
useful here). Any extra words of wisdom on how to do that, what
examples might be best to study for inspiration, etc, much appreciated !!!

Thanks,
--Gabriel
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini June 3, 2014, 9:17 a.m. UTC | #9
Il 02/06/2014 21:25, Gabriel L. Somlo ha scritto:
> Would it make sense to make this a module parameter,
> (e.g., "int emulate_mwait") ?
>
> Default would be 0 (no emulation). 1 would mean "emulate as nop", and
> if anyone ever figures out how to do proper page-locking based
> emulation we could use 2 to enable that, etc. ?
>
> Not sure we'd want qemu to enable/disable it automatically, though...
>
> What do you all think ?

I think it's fine as it is now. :)

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gabriel L. Somlo June 3, 2014, 2:21 p.m. UTC | #10
On Tue, Jun 03, 2014 at 11:17:48AM +0200, Paolo Bonzini wrote:
> 
> I think it's fine as it is now. :)

On Mon, Jun 02, 2014 at 09:55:18PM -0400, Gabriel L. Somlo wrote:
> 
> W.r.t. monitor/mwait, a guest can do one of the following:
> 
> 1. Never check CPUID, and never use monitor/mwait
> 	- This is great, we don't have to do anything about these
> 
> 2. Check CPUID for mwait, use it to idle in preference over hlt
> 	- Linux, Windows, and Mavericks (10.9) do this
> 	- we never want to have CPUID say "yes" to these, since
> 	  monitor/mwait support will be clunky in the best case,
> 	  and hlt is overwhelmingly preferable! [*]
> 
> 3. Never check CPUID, use monitor/mwait with abandon
> 	- OS X 10.6 .. 10.8 does this
> 	- emulating monitor/mwait here allows us to boot the guest
> 	  and use it, and perform sysadmin surgery to force a hlt
> 	  based idle
> 
> 4. Check CPUID, panic if unavailable
> 	- OS X 10.5 did this, IIRC.
> 	- whether I can do kext surgery and get it to stop checking
> 	  CPUID *in addition to* falling back to hlt-based idle is
> 	  TBD.
> 	- emulating monitor/mwait allows us to boot this type of
> 	  guest, BUT WE ALSO HAVE TO ADVERTISE IT VIA CPUID !!!

As it is right now, #4 is not being addressed (and we can't just
advertise mwait via cpuid, or we'd be screwing up #2).

I also feel a bit weird about the "undocumented feature" aspect
of NOT generating an invalid opcode for something that *should*
be an invalid opcode according to the feature set advertised via
cpuid...

So if there's a way to make it so we can tell QEMU/KVM to
"--enable-mwait" on a per-guest basis, I think that'd be better
than an always-on "undocumented" behavior...

But then again, I'm most likely missing something about the big
picture... :)

Thanks much,
--Gabriel
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Graf June 3, 2014, 3:37 p.m. UTC | #11
On 06/03/2014 04:21 PM, Gabriel L. Somlo wrote:
> On Tue, Jun 03, 2014 at 11:17:48AM +0200, Paolo Bonzini wrote:
>> I think it's fine as it is now. :)
> On Mon, Jun 02, 2014 at 09:55:18PM -0400, Gabriel L. Somlo wrote:
>> W.r.t. monitor/mwait, a guest can do one of the following:
>>
>> 1. Never check CPUID, and never use monitor/mwait
>> 	- This is great, we don't have to do anything about these
>>
>> 2. Check CPUID for mwait, use it to idle in preference over hlt
>> 	- Linux, Windows, and Mavericks (10.9) do this
>> 	- we never want to have CPUID say "yes" to these, since
>> 	  monitor/mwait support will be clunky in the best case,
>> 	  and hlt is overwhelmingly preferable! [*]
>>
>> 3. Never check CPUID, use monitor/mwait with abandon
>> 	- OS X 10.6 .. 10.8 does this
>> 	- emulating monitor/mwait here allows us to boot the guest
>> 	  and use it, and perform sysadmin surgery to force a hlt
>> 	  based idle
>>
>> 4. Check CPUID, panic if unavailable
>> 	- OS X 10.5 did this, IIRC.
>> 	- whether I can do kext surgery and get it to stop checking
>> 	  CPUID *in addition to* falling back to hlt-based idle is
>> 	  TBD.
>> 	- emulating monitor/mwait allows us to boot this type of
>> 	  guest, BUT WE ALSO HAVE TO ADVERTISE IT VIA CPUID !!!
> As it is right now, #4 is not being addressed (and we can't just
> advertise mwait via cpuid, or we'd be screwing up #2).

I think we should be able to handle #4 by doing -cpu core2duo,+monitor 
on the QEMU command line which should override the cpuid bits that the 
kernel tells us.


Alex

>
> I also feel a bit weird about the "undocumented feature" aspect
> of NOT generating an invalid opcode for something that *should*
> be an invalid opcode according to the feature set advertised via
> cpuid...
>
> So if there's a way to make it so we can tell QEMU/KVM to
> "--enable-mwait" on a per-guest basis, I think that'd be better
> than an always-on "undocumented" behavior...
>
> But then again, I'm most likely missing something about the big
> picture... :)
>
> Thanks much,
> --Gabriel

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gabriel L. Somlo June 3, 2014, 7:07 p.m. UTC | #12
On Tue, Jun 03, 2014 at 05:37:08PM +0200, Alexander Graf wrote:
> On 06/03/2014 04:21 PM, Gabriel L. Somlo wrote:
> >On Tue, Jun 03, 2014 at 11:17:48AM +0200, Paolo Bonzini wrote:
> >>I think it's fine as it is now. :)
> >On Mon, Jun 02, 2014 at 09:55:18PM -0400, Gabriel L. Somlo wrote:
> >>W.r.t. monitor/mwait, a guest can do one of the following:
> >>
> >>1. Never check CPUID, and never use monitor/mwait
> >>	- This is great, we don't have to do anything about these
> >>
> >>2. Check CPUID for mwait, use it to idle in preference over hlt
> >>	- Linux, Windows, and Mavericks (10.9) do this
> >>	- we never want to have CPUID say "yes" to these, since
> >>	  monitor/mwait support will be clunky in the best case,
> >>	  and hlt is overwhelmingly preferable! [*]
> >>
> >>3. Never check CPUID, use monitor/mwait with abandon
> >>	- OS X 10.6 .. 10.8 does this
> >>	- emulating monitor/mwait here allows us to boot the guest
> >>	  and use it, and perform sysadmin surgery to force a hlt
> >>	  based idle
> >>
> >>4. Check CPUID, panic if unavailable
> >>	- OS X 10.5 did this, IIRC.
> >>	- whether I can do kext surgery and get it to stop checking
> >>	  CPUID *in addition to* falling back to hlt-based idle is
> >>	  TBD.
> >>	- emulating monitor/mwait allows us to boot this type of
> >>	  guest, BUT WE ALSO HAVE TO ADVERTISE IT VIA CPUID !!!
> >As it is right now, #4 is not being addressed (and we can't just
> >advertise mwait via cpuid, or we'd be screwing up #2).
> 
> I think we should be able to handle #4 by doing -cpu core2duo,+monitor on
> the QEMU command line which should override the cpuid bits that the kernel
> tells us.

Using that (with kvm patched to do monitor = mwait = nop, but without
advertising it in cpuid), I still get a "Monitor feature not present"
panic with the 10.5.6 Leopard install DVD. Patching kvm to add F(MWAIT)
to kvm_supported_word4_x86_features gets me past that panic. Did you
mean "-cpu core2duo,+monitor" should work right now, or that we'd have
to add a quick patch to QEMU to make it happen ? (it doesn't complain
when I do that now, just doesn't work :)

Thx,
--Gabriel

> 
> 
> Alex
> 
> >
> >I also feel a bit weird about the "undocumented feature" aspect
> >of NOT generating an invalid opcode for something that *should*
> >be an invalid opcode according to the feature set advertised via
> >cpuid...
> >
> >So if there's a way to make it so we can tell QEMU/KVM to
> >"--enable-mwait" on a per-guest basis, I think that'd be better
> >than an always-on "undocumented" behavior...
> >
> >But then again, I'm most likely missing something about the big
> >picture... :)
> >
> >Thanks much,
> >--Gabriel
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gabriel L. Somlo June 4, 2014, 2:39 p.m. UTC | #13
Paolo,

I noticed the monitor=mwait=nop patch is making its way upstream, so
thanks !

I'm still interested in following up with something that would enable
this behavior only conditionally (e.g. following an ioctl call from
userspace to enable it only for the (set of) vcpu(s) belonging to one
guest VM at a time), which should then also include advertising the
feature in CPUID.

I grep-ed through the kvm sources for KVM_CAP for some inspiration,
and it looks more like KVM_CAP_* is a way to tell userspace what the
kernel supports, but nothing I saw showed me an example of a "tunable"
feature that userspace may ask to be turned on or off (e.g per-vcpu).

Is there something like that I could use as an example ?

Obviously, if you really like the current behavior better you can
always reject whatever patch I'll come up with, but I'd like to at
least try and see what it would look like :)

Thanks,
--Gabriel

On Tue, Jun 03, 2014 at 11:17:48AM +0200, Paolo Bonzini wrote:
> Il 02/06/2014 21:25, Gabriel L. Somlo ha scritto:
> >Would it make sense to make this a module parameter,
> >(e.g., "int emulate_mwait") ?
> >
> >Default would be 0 (no emulation). 1 would mean "emulate as nop", and
> >if anyone ever figures out how to do proper page-locking based
> >emulation we could use 2 to enable that, etc. ?
> >
> >Not sure we'd want qemu to enable/disable it automatically, though...
> >
> >What do you all think ?
> 
> I think it's fine as it is now. :)
> 
> Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Graf June 4, 2014, 2:44 p.m. UTC | #14
On 04.06.14 16:39, Gabriel L. Somlo wrote:
> Paolo,
>
> I noticed the monitor=mwait=nop patch is making its way upstream, so
> thanks !
>
> I'm still interested in following up with something that would enable
> this behavior only conditionally (e.g. following an ioctl call from
> userspace to enable it only for the (set of) vcpu(s) belonging to one
> guest VM at a time), which should then also include advertising the
> feature in CPUID.
>
> I grep-ed through the kvm sources for KVM_CAP for some inspiration,
> and it looks more like KVM_CAP_* is a way to tell userspace what the
> kernel supports, but nothing I saw showed me an example of a "tunable"
> feature that userspace may ask to be turned on or off (e.g per-vcpu).
>
> Is there something like that I could use as an example ?

Sure, we use it all over the place on PPC :).

> Obviously, if you really like the current behavior better you can
> always reject whatever patch I'll come up with, but I'd like to at
> least try and see what it would look like :)

I think it's perfectly fine to leave mwait always implemented as NOP - 
it's valid behavior.

As for the CPUID exposure, that should be a pure QEMU thing. If 
overriding CPUID bits the kernel mask tells us doesn't work today, we 
should just make it possible :).

Eventually I really think that -cpu foo,+mwait,+monitor or whatever the 
bits are should override any safety net that KVM gives us on features it 
thinks are safe to use.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gabriel L. Somlo June 4, 2014, 3:05 p.m. UTC | #15
On Wed, Jun 04, 2014 at 04:44:13PM +0200, Alexander Graf wrote:
> 
> On 04.06.14 16:39, Gabriel L. Somlo wrote:
> >Paolo,
> >
> >I noticed the monitor=mwait=nop patch is making its way upstream, so
> >thanks !
> >
> >I'm still interested in following up with something that would enable
> >this behavior only conditionally (e.g. following an ioctl call from
> >userspace to enable it only for the (set of) vcpu(s) belonging to one
> >guest VM at a time), which should then also include advertising the
> >feature in CPUID.
> >
> >I grep-ed through the kvm sources for KVM_CAP for some inspiration,
> >and it looks more like KVM_CAP_* is a way to tell userspace what the
> >kernel supports, but nothing I saw showed me an example of a "tunable"
> >feature that userspace may ask to be turned on or off (e.g per-vcpu).
> >
> >Is there something like that I could use as an example ?
> 
> Sure, we use it all over the place on PPC :).

Allright, I'll grep harder, then :)

> >Obviously, if you really like the current behavior better you can
> >always reject whatever patch I'll come up with, but I'd like to at
> >least try and see what it would look like :)
> 
> I think it's perfectly fine to leave mwait always implemented as NOP - it's
> valid behavior.

NOP is valid MWAIT behavior, *unless* MWAIT should generate an invalid
opcode (i.e., if CPUID says mwait not supported). In that respect,
we're cheating only to hook up guests which misbehave. I'd feel less
"dirty" if I could explicitly tell KVM "ok, just this once is OK, but
don't make a habit of it" :)

> As for the CPUID exposure, that should be a pure QEMU thing. If overriding
> CPUID bits the kernel mask tells us doesn't work today, we should just make
> it possible :).
> 
> Eventually I really think that -cpu foo,+mwait,+monitor or whatever the bits
> are should override any safety net that KVM gives us on features it thinks
> are safe to use.

I need to look at the qemu source, doing what you said
(+monitor,+mwait,+whatever) right now "works", doesn't generate an error,
but silently ignores you if it's not implemented. So I'd actually have to
generate a patch to make something happen when they're present on the
command line.

The part I'm unsure about is "how bad is it to cheat the way we do right
now", vs. "how much is it worth to be pedantic and require explicitly
enabling things, in both qemu and kvm"... I feel like I don't know
enough to 1. have a strong opinion either way, and 2. have my opinion
be *right* :) Which is why I won't let it go already (and thanks for
all your patience, BTW) :)

Thanks,
--Gabriel
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Graf June 4, 2014, 3:09 p.m. UTC | #16
On 04.06.14 17:05, Gabriel L. Somlo wrote:
> On Wed, Jun 04, 2014 at 04:44:13PM +0200, Alexander Graf wrote:
>> On 04.06.14 16:39, Gabriel L. Somlo wrote:
>>> Paolo,
>>>
>>> I noticed the monitor=mwait=nop patch is making its way upstream, so
>>> thanks !
>>>
>>> I'm still interested in following up with something that would enable
>>> this behavior only conditionally (e.g. following an ioctl call from
>>> userspace to enable it only for the (set of) vcpu(s) belonging to one
>>> guest VM at a time), which should then also include advertising the
>>> feature in CPUID.
>>>
>>> I grep-ed through the kvm sources for KVM_CAP for some inspiration,
>>> and it looks more like KVM_CAP_* is a way to tell userspace what the
>>> kernel supports, but nothing I saw showed me an example of a "tunable"
>>> feature that userspace may ask to be turned on or off (e.g per-vcpu).
>>>
>>> Is there something like that I could use as an example ?
>> Sure, we use it all over the place on PPC :).
> Allright, I'll grep harder, then :)
>
>>> Obviously, if you really like the current behavior better you can
>>> always reject whatever patch I'll come up with, but I'd like to at
>>> least try and see what it would look like :)
>> I think it's perfectly fine to leave mwait always implemented as NOP - it's
>> valid behavior.
> NOP is valid MWAIT behavior, *unless* MWAIT should generate an invalid
> opcode (i.e., if CPUID says mwait not supported). In that respect,
> we're cheating only to hook up guests which misbehave. I'd feel less
> "dirty" if I could explicitly tell KVM "ok, just this once is OK, but
> don't make a habit of it" :)

We don't limit instructions the guest can execute properly anyway. If 
CPUID doesn't expose AVX, but the host CPU supports AVX, the guest can 
still call AVX instructions.

So I think we're safe to always handle MWAIT :).

>
>> As for the CPUID exposure, that should be a pure QEMU thing. If overriding
>> CPUID bits the kernel mask tells us doesn't work today, we should just make
>> it possible :).
>>
>> Eventually I really think that -cpu foo,+mwait,+monitor or whatever the bits
>> are should override any safety net that KVM gives us on features it thinks
>> are safe to use.
> I need to look at the qemu source, doing what you said
> (+monitor,+mwait,+whatever) right now "works", doesn't generate an error,
> but silently ignores you if it's not implemented. So I'd actually have to
> generate a patch to make something happen when they're present on the
> command line.
>
> The part I'm unsure about is "how bad is it to cheat the way we do right
> now", vs. "how much is it worth to be pedantic and require explicitly
> enabling things, in both qemu and kvm"... I feel like I don't know
> enough to 1. have a strong opinion either way, and 2. have my opinion
> be *right* :) Which is why I won't let it go already (and thanks for
> all your patience, BTW) :)

I think it's sane behavior to not expose the MWAIT capability in the 
default CPUID mask (which comes from KVM) unless we can actually emulate 
it properly ;).

However, I think it's very important to be able to force CPUID bits to 
on from QEMU even when KVM says it doesn't support them. I actually 
thought we could do that already, but that code got refactored a number 
of times over the years, so maybe that ability got lost.

Basically KVM gives QEMU 2 ioctls:

   * get list of KVM supported CPUIDs
   * set guest exposed CPUIDs

Whether QEMU wants to only set CPUID bits that the kernel actually 
supports is up to its own implementation. Usually the "enforce" option 
is there to guarantee that all CPUID bits are actually supported. 
Apparently all unsupported bits just get dropped silently today. IMHO 
they shouldn't if they were specified through -cpu ...,+feature.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini June 4, 2014, 4:34 p.m. UTC | #17
Il 04/06/2014 16:44, Alexander Graf ha scritto:
>
>
>> Obviously, if you really like the current behavior better you can
>> always reject whatever patch I'll come up with, but I'd like to at
>> least try and see what it would look like :)
>
> I think it's perfectly fine to leave mwait always implemented as NOP -
> it's valid behavior.
>
> As for the CPUID exposure, that should be a pure QEMU thing. If
> overriding CPUID bits the kernel mask tells us doesn't work today, we
> should just make it possible :).

That should be the purpose of KVM_GET_EMULATED_CPUID, so MWAIT could be 
added in __do_cpuid_ent_emulated.  However, the corresponding QEMU 
patches were never included.  Borislav, can you refresh them?

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gabriel L. Somlo June 4, 2014, 5:07 p.m. UTC | #18
On Wed, Jun 04, 2014 at 05:09:49PM +0200, Alexander Graf wrote:
> >>>
> >>>I grep-ed through the kvm sources for KVM_CAP for some inspiration,
> >>>and it looks more like KVM_CAP_* is a way to tell userspace what the
> >>>kernel supports, but nothing I saw showed me an example of a "tunable"
> >>>feature that userspace may ask to be turned on or off (e.g per-vcpu).
> >>>
> >>>Is there something like that I could use as an example ?
> >>Sure, we use it all over the place on PPC :).
> >Allright, I'll grep harder, then :)

Aah, I think I found it: KVM_ENABLE_CAP, currently only on ppc and s390.
I'd have to port this over to x86 before I could use it to enable mwait
on demand.

Would that be useful/desirable for any other use cases ?

> >>I think it's perfectly fine to leave mwait always implemented as NOP - it's
> >>valid behavior.
> >NOP is valid MWAIT behavior, *unless* MWAIT should generate an invalid
> >opcode (i.e., if CPUID says mwait not supported). In that respect,
> >we're cheating only to hook up guests which misbehave. I'd feel less
> >"dirty" if I could explicitly tell KVM "ok, just this once is OK, but
> >don't make a habit of it" :)
> 
> We don't limit instructions the guest can execute properly anyway. If CPUID
> doesn't expose AVX, but the host CPU supports AVX, the guest can still call
> AVX instructions.
> 
> So I think we're safe to always handle MWAIT :).
> 
> >
> >>As for the CPUID exposure, that should be a pure QEMU thing. If overriding
> >>CPUID bits the kernel mask tells us doesn't work today, we should just make
> >>it possible :).
> >>
> >>Eventually I really think that -cpu foo,+mwait,+monitor or whatever the bits
> >>are should override any safety net that KVM gives us on features it thinks
> >>are safe to use.
> >I need to look at the qemu source, doing what you said
> >(+monitor,+mwait,+whatever) right now "works", doesn't generate an error,
> >but silently ignores you if it's not implemented. So I'd actually have to
> >generate a patch to make something happen when they're present on the
> >command line.
> >
> >The part I'm unsure about is "how bad is it to cheat the way we do right
> >now", vs. "how much is it worth to be pedantic and require explicitly
> >enabling things, in both qemu and kvm"... I feel like I don't know
> >enough to 1. have a strong opinion either way, and 2. have my opinion
> >be *right* :) Which is why I won't let it go already (and thanks for
> >all your patience, BTW) :)
> 
> I think it's sane behavior to not expose the MWAIT capability in the default
> CPUID mask (which comes from KVM) unless we can actually emulate it properly
> ;).
> 
> However, I think it's very important to be able to force CPUID bits to on
> from QEMU even when KVM says it doesn't support them. I actually thought we
> could do that already, but that code got refactored a number of times over
> the years, so maybe that ability got lost.
> 
> Basically KVM gives QEMU 2 ioctls:
> 
>   * get list of KVM supported CPUIDs
>   * set guest exposed CPUIDs

Ah, so kvm_vcpu_ioctl_set_cpuid() and friends, morally similar to
kvm_vcpu_ioctl_enable_cap() on ppc, except it turns on cpuid flags
instead of entire kvm capabilities.

So we either have

	1 always-on but masked-by-default monitor/mwait as
	  nop, and enable just the cpuid flag on demand via the
	  existing ioctl_enable_cap() mechanism (and I have to
	  check out the qemu parser for cpuid command-line flags),

or

	2 off-by-default monitor/mwait/cpuid-flag, enabled via
	  ioctl_enable_cap(), which would have to first be ported
	  to x86, and would require somewhat more extensive qemu
	  hackery to take advantage of.

I think I sense a "path of least resistance" here, even though IMHO
#2 is still "The Right Thing To Do (TM)"  :) :)

Thanks,
--Gabriel
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin June 4, 2014, 7:06 p.m. UTC | #19
On Wed, Jun 04, 2014 at 01:07:21PM -0400, Gabriel L. Somlo wrote:
> On Wed, Jun 04, 2014 at 05:09:49PM +0200, Alexander Graf wrote:
> > >>>
> > >>>I grep-ed through the kvm sources for KVM_CAP for some inspiration,
> > >>>and it looks more like KVM_CAP_* is a way to tell userspace what the
> > >>>kernel supports, but nothing I saw showed me an example of a "tunable"
> > >>>feature that userspace may ask to be turned on or off (e.g per-vcpu).
> > >>>
> > >>>Is there something like that I could use as an example ?
> > >>Sure, we use it all over the place on PPC :).
> > >Allright, I'll grep harder, then :)
> 
> Aah, I think I found it: KVM_ENABLE_CAP, currently only on ppc and s390.
> I'd have to port this over to x86 before I could use it to enable mwait
> on demand.
> 
> Would that be useful/desirable for any other use cases ?
> 
> > >>I think it's perfectly fine to leave mwait always implemented as NOP - it's
> > >>valid behavior.
> > >NOP is valid MWAIT behavior, *unless* MWAIT should generate an invalid
> > >opcode (i.e., if CPUID says mwait not supported). In that respect,
> > >we're cheating only to hook up guests which misbehave. I'd feel less
> > >"dirty" if I could explicitly tell KVM "ok, just this once is OK, but
> > >don't make a habit of it" :)
> > 
> > We don't limit instructions the guest can execute properly anyway. If CPUID
> > doesn't expose AVX, but the host CPU supports AVX, the guest can still call
> > AVX instructions.
> > 
> > So I think we're safe to always handle MWAIT :).
> > 
> > >
> > >>As for the CPUID exposure, that should be a pure QEMU thing. If overriding
> > >>CPUID bits the kernel mask tells us doesn't work today, we should just make
> > >>it possible :).
> > >>
> > >>Eventually I really think that -cpu foo,+mwait,+monitor or whatever the bits
> > >>are should override any safety net that KVM gives us on features it thinks
> > >>are safe to use.
> > >I need to look at the qemu source, doing what you said
> > >(+monitor,+mwait,+whatever) right now "works", doesn't generate an error,
> > >but silently ignores you if it's not implemented. So I'd actually have to
> > >generate a patch to make something happen when they're present on the
> > >command line.
> > >
> > >The part I'm unsure about is "how bad is it to cheat the way we do right
> > >now", vs. "how much is it worth to be pedantic and require explicitly
> > >enabling things, in both qemu and kvm"... I feel like I don't know
> > >enough to 1. have a strong opinion either way, and 2. have my opinion
> > >be *right* :) Which is why I won't let it go already (and thanks for
> > >all your patience, BTW) :)
> > 
> > I think it's sane behavior to not expose the MWAIT capability in the default
> > CPUID mask (which comes from KVM) unless we can actually emulate it properly
> > ;).
> > 
> > However, I think it's very important to be able to force CPUID bits to on
> > from QEMU even when KVM says it doesn't support them. I actually thought we
> > could do that already, but that code got refactored a number of times over
> > the years, so maybe that ability got lost.
> > 
> > Basically KVM gives QEMU 2 ioctls:
> > 
> >   * get list of KVM supported CPUIDs
> >   * set guest exposed CPUIDs
> 
> Ah, so kvm_vcpu_ioctl_set_cpuid() and friends, morally similar to
> kvm_vcpu_ioctl_enable_cap() on ppc, except it turns on cpuid flags
> instead of entire kvm capabilities.
> 
> So we either have
> 
> 	1 always-on but masked-by-default monitor/mwait as
> 	  nop, and enable just the cpuid flag on demand via the
> 	  existing ioctl_enable_cap() mechanism (and I have to
> 	  check out the qemu parser for cpuid command-line flags),
> 
> or
> 
> 	2 off-by-default monitor/mwait/cpuid-flag, enabled via
> 	  ioctl_enable_cap(), which would have to first be ported
> 	  to x86, and would require somewhat more extensive qemu
> 	  hackery to take advantage of.
> 
> I think I sense a "path of least resistance" here, even though IMHO
> #2 is still "The Right Thing To Do (TM)"  :) :)
> 
> Thanks,
> --Gabriel

I think it's worng.
We really can't emulate mwait at the moment.
All we manage to do is a work-around for broken guests.

So let's not pretend that we can, just enable nop
unconditionally and be done with it.
Paolo already said it's OK with him, and I'll ack too.

Otherwise you are giving bad information to well-behaved guests,
so e.g. linux will try to use mwait. You don't want this.

The advantage is that if at some point CPUs can
actually support mwait in VMs, at that point
we will enable the CPUID bit, and userspace and guests
will be able to detect that and rely on that bit
to mean "mwait works and is efficient".
Michael S. Tsirkin June 4, 2014, 7:08 p.m. UTC | #20
On Wed, Jun 04, 2014 at 06:34:04PM +0200, Paolo Bonzini wrote:
> Il 04/06/2014 16:44, Alexander Graf ha scritto:
> >
> >
> >>Obviously, if you really like the current behavior better you can
> >>always reject whatever patch I'll come up with, but I'd like to at
> >>least try and see what it would look like :)
> >
> >I think it's perfectly fine to leave mwait always implemented as NOP -
> >it's valid behavior.
> >
> >As for the CPUID exposure, that should be a pure QEMU thing. If
> >overriding CPUID bits the kernel mask tells us doesn't work today, we
> >should just make it possible :).
> 
> That should be the purpose of KVM_GET_EMULATED_CPUID, so MWAIT could be
> added in __do_cpuid_ent_emulated.  However, the corresponding QEMU patches
> were never included.  Borislav, can you refresh them?
> 
> Paolo

I don't understand why would we want mwait bit set in CPUID.
The only reason we want the nop is because of broken guests which
don't check CPUID.
Nadav Amit June 4, 2014, 7:12 p.m. UTC | #21
On Jun 4, 2014, at 7:34 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:

> Il 04/06/2014 16:44, Alexander Graf ha scritto:
>> 
>> 
>>> Obviously, if you really like the current behavior better you can
>>> always reject whatever patch I'll come up with, but I'd like to at
>>> least try and see what it would look like :)
>> 
>> I think it's perfectly fine to leave mwait always implemented as NOP -
>> it's valid behavior.
>> 
>> As for the CPUID exposure, that should be a pure QEMU thing. If
>> overriding CPUID bits the kernel mask tells us doesn't work today, we
>> should just make it possible :).
> 
> That should be the purpose of KVM_GET_EMULATED_CPUID, so MWAIT could be added in __do_cpuid_ent_emulated.  However, the corresponding QEMU patches were never included.  Borislav, can you refresh them?


Regardless to the whole discussion of what the guest is informed about, I think it might be better to implement mwait and monitor correctly according to the spec and let the instructions to be fully emulated.
Both mwait and monitor may encounter exceptions (#GP, #PF, regardless of #UD), so this behaviour should be correct.
If you want me, I’ll send my version which looks something like:

static int em_monitor(struct x86_emulate_ctxt *ctxt)
{
	int rc;
	struct segmented_address addr;
	u64 rcx = reg_read(ctxt, VCPU_REGS_RCX);
	u64 rax = reg_read(ctxt, VCPU_REGS_RAX);
	u8 byte;

	rc = check_mwait_supported(ctxt);
	if (rc != X86EMUL_CONTINUE)
		return rc;

	if (ctxt->mode != X86EMUL_MODE_PROT64)
		rcx = (u32)rcx;

	if (rcx != 0)
		return emulate_gp(ctxt, 0);

	addr.seg = seg_override(ctxt);
	addr.ea = ctxt->ad_bytes == 8 ? rax : (u32)rax;

	rc = segmented_read(ctxt, addr, &byte, 1);
	if (rc != X86EMUL_CONTINUE)
		return rc;

	return X86EMUL_CONTINUE;
}

static int em_mwait(struct x86_emulate_ctxt *ctxt)
{
	u64 rcx = reg_read(ctxt, VCPU_REGS_RCX);
	int rc = check_mwait_supported(ctxt);
	if (rc != X86EMUL_CONTINUE)
		return rc;
	if (ctxt->mode != X86EMUL_MODE_PROT64)
		rcx = (u32)rcx;

	if ((rcx & ~(u64)1) != 0)
		return emulate_gp(ctxt, 0);

	if (rcx & 1) {
		/* Interrupt as break event */
		u32 ebx, ecx, edx, eax;
		eax = 5;
		ecx = 0;
		ctxt->ops->get_cpuid(ctxt, &eax, &ebx, &ecx, &edx);
		if (!(ecx & 1))
			return emulate_gp(ctxt, 0);
	}
	return X86EMUL_CONTINUE;
}
Gabriel L. Somlo June 4, 2014, 7:24 p.m. UTC | #22
On Wed, Jun 04, 2014 at 10:06:18PM +0300, Michael S. Tsirkin wrote:
> On Wed, Jun 04, 2014 at 01:07:21PM -0400, Gabriel L. Somlo wrote:
> > Ah, so kvm_vcpu_ioctl_set_cpuid() and friends, morally similar to
> > kvm_vcpu_ioctl_enable_cap() on ppc, except it turns on cpuid flags
> > instead of entire kvm capabilities.
> > 
> > So we either have
> > 
> > 	1 always-on but masked-by-default monitor/mwait as
> > 	  nop, and enable just the cpuid flag on demand via the
> > 	  existing ioctl_enable_cap() mechanism (and I have to
> > 	  check out the qemu parser for cpuid command-line flags),
> > 
> > or
> > 
> > 	2 off-by-default monitor/mwait/cpuid-flag, enabled via
> > 	  ioctl_enable_cap(), which would have to first be ported
> > 	  to x86, and would require somewhat more extensive qemu
> > 	  hackery to take advantage of.
> > 
> > I think I sense a "path of least resistance" here, even though IMHO
> > #2 is still "The Right Thing To Do (TM)"  :) :)
> > 
> > Thanks,
> > --Gabriel
> 
> I think it's worng.
> We really can't emulate mwait at the moment.
> All we manage to do is a work-around for broken guests.
> 
> So let's not pretend that we can, just enable nop
> unconditionally and be done with it.
> Paolo already said it's OK with him, and I'll ack too.
> 
> Otherwise you are giving bad information to well-behaved guests,
> so e.g. linux will try to use mwait. You don't want this.

That's why I suggested having it default to disabled, and only
allowing it to be turned on per VM, explicitly.

> The advantage is that if at some point CPUs can
> actually support mwait in VMs, at that point
> we will enable the CPUID bit, and userspace and guests
> will be able to detect that and rely on that bit
> to mean "mwait works and is efficient".

OK, that makes sense. Thanks for helping me finally get over it ;)

--Gabriel

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gabriel L. Somlo June 4, 2014, 7:33 p.m. UTC | #23
On Wed, Jun 04, 2014 at 10:08:12PM +0300, Michael S. Tsirkin wrote:
> On Wed, Jun 04, 2014 at 06:34:04PM +0200, Paolo Bonzini wrote:
> > Il 04/06/2014 16:44, Alexander Graf ha scritto:
> > >
> > >
> > >>Obviously, if you really like the current behavior better you can
> > >>always reject whatever patch I'll come up with, but I'd like to at
> > >>least try and see what it would look like :)
> > >
> > >I think it's perfectly fine to leave mwait always implemented as NOP -
> > >it's valid behavior.
> > >
> > >As for the CPUID exposure, that should be a pure QEMU thing. If
> > >overriding CPUID bits the kernel mask tells us doesn't work today, we
> > >should just make it possible :).
> > 
> > That should be the purpose of KVM_GET_EMULATED_CPUID, so MWAIT could be
> > added in __do_cpuid_ent_emulated.  However, the corresponding QEMU patches
> > were never included.  Borislav, can you refresh them?
> > 
> > Paolo
> 
> I don't understand why would we want mwait bit set in CPUID.
> The only reason we want the nop is because of broken guests which
> don't check CPUID.

E.g., OS X 10.5 *does* check CPUID, and panics if it doesn't find it.
It needs the MONITOR cpuid flag to be on, *and* the actual
instructions to work.

HOWEVER: I really do NOT want us to bend over backwards to support it,
I think having 10.6 and up working is good enough. Besides, there are
other problems we'd run into with 10.5 if we got the mwait situation
taken care of, so even more reason to leave well enough alone.

Thanks,
--Gabriel

PS. Thanks again for everyone's patience while I wrapped my head
around the fine details :)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin June 4, 2014, 7:37 p.m. UTC | #24
On Wed, Jun 04, 2014 at 03:24:06PM -0400, Gabriel L. Somlo wrote:
> On Wed, Jun 04, 2014 at 10:06:18PM +0300, Michael S. Tsirkin wrote:
> > On Wed, Jun 04, 2014 at 01:07:21PM -0400, Gabriel L. Somlo wrote:
> > > Ah, so kvm_vcpu_ioctl_set_cpuid() and friends, morally similar to
> > > kvm_vcpu_ioctl_enable_cap() on ppc, except it turns on cpuid flags
> > > instead of entire kvm capabilities.
> > > 
> > > So we either have
> > > 
> > > 	1 always-on but masked-by-default monitor/mwait as
> > > 	  nop, and enable just the cpuid flag on demand via the
> > > 	  existing ioctl_enable_cap() mechanism (and I have to
> > > 	  check out the qemu parser for cpuid command-line flags),
> > > 
> > > or
> > > 
> > > 	2 off-by-default monitor/mwait/cpuid-flag, enabled via
> > > 	  ioctl_enable_cap(), which would have to first be ported
> > > 	  to x86, and would require somewhat more extensive qemu
> > > 	  hackery to take advantage of.
> > > 
> > > I think I sense a "path of least resistance" here, even though IMHO
> > > #2 is still "The Right Thing To Do (TM)"  :) :)
> > > 
> > > Thanks,
> > > --Gabriel
> > 
> > I think it's worng.
> > We really can't emulate mwait at the moment.
> > All we manage to do is a work-around for broken guests.
> > 
> > So let's not pretend that we can, just enable nop
> > unconditionally and be done with it.
> > Paolo already said it's OK with him, and I'll ack too.
> > 
> > Otherwise you are giving bad information to well-behaved guests,
> > so e.g. linux will try to use mwait. You don't want this.
> 
> That's why I suggested having it default to disabled, and only
> allowing it to be turned on per VM, explicitly.

Turn on the bit in CPUID? Why is it useful?
I thought the point of the nop hack is work around
for broken guests.

> > The advantage is that if at some point CPUs can
> > actually support mwait in VMs, at that point
> > we will enable the CPUID bit, and userspace and guests
> > will be able to detect that and rely on that bit
> > to mean "mwait works and is efficient".
> 
> OK, that makes sense. Thanks for helping me finally get over it ;)
> 
> --Gabriel
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin June 4, 2014, 7:40 p.m. UTC | #25
On Wed, Jun 04, 2014 at 03:33:38PM -0400, Gabriel L. Somlo wrote:
> On Wed, Jun 04, 2014 at 10:08:12PM +0300, Michael S. Tsirkin wrote:
> > On Wed, Jun 04, 2014 at 06:34:04PM +0200, Paolo Bonzini wrote:
> > > Il 04/06/2014 16:44, Alexander Graf ha scritto:
> > > >
> > > >
> > > >>Obviously, if you really like the current behavior better you can
> > > >>always reject whatever patch I'll come up with, but I'd like to at
> > > >>least try and see what it would look like :)
> > > >
> > > >I think it's perfectly fine to leave mwait always implemented as NOP -
> > > >it's valid behavior.
> > > >
> > > >As for the CPUID exposure, that should be a pure QEMU thing. If
> > > >overriding CPUID bits the kernel mask tells us doesn't work today, we
> > > >should just make it possible :).
> > > 
> > > That should be the purpose of KVM_GET_EMULATED_CPUID, so MWAIT could be
> > > added in __do_cpuid_ent_emulated.  However, the corresponding QEMU patches
> > > were never included.  Borislav, can you refresh them?
> > > 
> > > Paolo
> > 
> > I don't understand why would we want mwait bit set in CPUID.
> > The only reason we want the nop is because of broken guests which
> > don't check CPUID.
> 
> E.g., OS X 10.5 *does* check CPUID, and panics if it doesn't find it.
> It needs the MONITOR cpuid flag to be on, *and* the actual
> instructions to work.

Aha. I didn't realize this. I thought it used mwait without
checking CPUID.

> HOWEVER: I really do NOT want us to bend over backwards to support it,
> I think having 10.6 and up working is good enough. Besides, there are
> other problems we'd run into with 10.5 if we got the mwait situation
> taken care of, so even more reason to leave well enough alone.
> 
> Thanks,
> --Gabriel
> 
> PS. Thanks again for everyone's patience while I wrapped my head
> around the fine details :)

It's up to you of course. If some guests refuse to run without
the CPUID bit, then of course an option to set it might be useful.
Sorry about misunderstanding and making noise.
Gabriel L. Somlo June 4, 2014, 7:43 p.m. UTC | #26
On Wed, Jun 04, 2014 at 10:12:39PM +0300, Nadav Amit wrote:
> Regardless to the whole discussion of what the guest is informed about, I think it might be better to implement mwait and monitor correctly according to the spec and let the instructions to be fully emulated.
> Both mwait and monitor may encounter exceptions (#GP, #PF, regardless of #UD), so this behaviour should be correct.
> If you want me, I?ll send my version which looks something like:
> 
> static int em_monitor(struct x86_emulate_ctxt *ctxt)
> {
> 	int rc;
> 	struct segmented_address addr;
> 	u64 rcx = reg_read(ctxt, VCPU_REGS_RCX);
> 	u64 rax = reg_read(ctxt, VCPU_REGS_RAX);
> 	u8 byte;
> 
> 	rc = check_mwait_supported(ctxt);
> 	if (rc != X86EMUL_CONTINUE)
> 		return rc;
> 
> 	if (ctxt->mode != X86EMUL_MODE_PROT64)
> 		rcx = (u32)rcx;
> 
> 	if (rcx != 0)
> 		return emulate_gp(ctxt, 0);
> 
> 	addr.seg = seg_override(ctxt);
> 	addr.ea = ctxt->ad_bytes == 8 ? rax : (u32)rax;
> 
> 	rc = segmented_read(ctxt, addr, &byte, 1);
> 	if (rc != X86EMUL_CONTINUE)
> 		return rc;
> 
> 	return X86EMUL_CONTINUE;
> }
> 
> static int em_mwait(struct x86_emulate_ctxt *ctxt)
> {
> 	u64 rcx = reg_read(ctxt, VCPU_REGS_RCX);
> 	int rc = check_mwait_supported(ctxt);
> 	if (rc != X86EMUL_CONTINUE)
> 		return rc;
> 	if (ctxt->mode != X86EMUL_MODE_PROT64)
> 		rcx = (u32)rcx;
> 
> 	if ((rcx & ~(u64)1) != 0)
> 		return emulate_gp(ctxt, 0);
> 
> 	if (rcx & 1) {
> 		/* Interrupt as break event */
> 		u32 ebx, ecx, edx, eax;
> 		eax = 5;
> 		ecx = 0;
> 		ctxt->ops->get_cpuid(ctxt, &eax, &ebx, &ecx, &edx);
> 		if (!(ecx & 1))
> 			return emulate_gp(ctxt, 0);
> 	}
> 	return X86EMUL_CONTINUE;
> }

I'd be curious how you're dealing with the "hidden" CPU state which
tells MWAIT to sleep until someone or something writes to the
monitored memory area set up by a corresponding MONITOR instruction.

Thanks,
--Gabriel
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Borislav Petkov June 4, 2014, 8:44 p.m. UTC | #27
On Wed, Jun 04, 2014 at 06:34:04PM +0200, Paolo Bonzini wrote:
> That should be the purpose of KVM_GET_EMULATED_CPUID, so MWAIT could be
> added in __do_cpuid_ent_emulated.  However, the corresponding QEMU patches
> were never included.  Borislav, can you refresh them?

/me goes and swaps in all the details from the discussion at that
time.... tries to decipher what the whole story was about...

Well, AFAIR - the operative word being "remember" - Eduardo had a
problem with diffentiating between emulated features and real hw
features and what was being filtered out... here's where the thread
starts:

http://lkml.kernel.org/r/20130923162856.GC7264@otherpad.lan.raisama.net

I think we remained at that someone, maybe he, would have to clean all
that discrepancy before we do the emulated gunk. Let me CC him.
Eduardo Habkost June 5, 2014, 2:40 p.m. UTC | #28
On Wed, Jun 04, 2014 at 10:44:48PM +0200, Borislav Petkov wrote:
> On Wed, Jun 04, 2014 at 06:34:04PM +0200, Paolo Bonzini wrote:
> > That should be the purpose of KVM_GET_EMULATED_CPUID, so MWAIT could be
> > added in __do_cpuid_ent_emulated.  However, the corresponding QEMU patches
> > were never included.  Borislav, can you refresh them?
> 
> /me goes and swaps in all the details from the discussion at that
> time.... tries to decipher what the whole story was about...
> 
> Well, AFAIR - the operative word being "remember" - Eduardo had a
> problem with diffentiating between emulated features and real hw
> features and what was being filtered out... here's where the thread
> starts:
> 
> http://lkml.kernel.org/r/20130923162856.GC7264@otherpad.lan.raisama.net
> 
> I think we remained at that someone, maybe he, would have to clean all
> that discrepancy before we do the emulated gunk. Let me CC him.

I didn't work on it because of other tasks, and because the
GET_SUPPORTED_CPUID code on QEMU needed some cleanup first. Now it
should be easy to implement GET_EMULATED_CPUID support in the same way
the "migratable" property was implemented[1]. I will send a patch today.

[1] See tree at
    https://github.com/ehabkost/qemu-hacks/tree/work/qom-cpu-fixes
Eric Northup June 5, 2014, 8:59 p.m. UTC | #29
On Wed, May 7, 2014 at 1:52 PM, Gabriel L. Somlo <gsomlo@gmail.com> wrote:
> Treat monitor and mwait instructions as nop, which is architecturally
> correct (but inefficient) behavior. We do this to prevent misbehaving
> guests (e.g. OS X <= 10.7) from crashing after they fail to check for
> monitor/mwait availability via cpuid.
>
> Since mwait-based idle loops relying on these nop-emulated instructions
> would keep the host CPU pegged at 100%, do NOT advertise their presence
> via cpuid, to prevent compliant guests from using them inadvertently.

If it's going to peg the host CPU at 100% anyway, why bother emulating
it?  Just let the guest run the mwait instruction!  Have a condition
that controls whether CPU_BASED_MWAIT_EXITING gets set in the VMCS
processor execution controls.  Go ahead and put it in CPUID, since
it's actually allowed.

>
> Signed-off-by: Gabriel L. Somlo <somlo@cmu.edu>
> ---
>
> New in v2: remove invalid_op handler functions which were only used to
>            handle exits caused by monitor and mwait
>
> On Wed, May 07, 2014 at 08:31:27PM +0200, Alexander Graf wrote:
>> On 05/07/2014 08:15 PM, Michael S. Tsirkin wrote:
>> >If we really want to be paranoid and worry about guests
>> >that use this strange way to trigger invalid opcode,
>> >we can make it possible for userspace to enable/disable
>> >this hack, and teach qemu to set it.
>> >
>> >That would make it even safer than it was.
>> >
>> >Not sure it's worth it, just a thought.
>>
>> Since we don't trap on non-exposed other instructions (new SSE and
>> whatdoiknow) I don't think it's really bad to just expose
>> MONITOR/MWAIT as nops.
>
> So AFAICT, linux prefers to use mwait for idling if cpuid tells it that
> it's available. If we keep telling everyone that we do NOT have monitor
> and mwait available, compliant guests will never end up using them, and
> this hack would remain completely invisible to them, which is good
> (better to use hlt-based idle loops when you're a vm guest, that would
> actually allow the host to relax while you're halted :)
>
> So the only time anyone would be able to tell we have this hack would be
> when they're about to receive an invalid opcode for using monitor/mwait
> in violation of what CPUID (would have) told them. That's what happens
> to OS X prior to 10.8, which is when I'm hypothesizing the Apple devs
> begain to seriously think about their OS running as a vm guest (on fusion
> and parallels)...
>
> Instead of killing the misbehaving guest with an invalid opcode, we'd
> allow them to peg the host CPU with their monitor == mwait == nop idle
> loop instead, which, at least on OS X, should be tolerable long enough
> to run 'rm -rf System/Library/Extensions/AppleIntelCPUPowerManagement.kext'
> and reboot the guest, after which things would settle down by reverting
> the guest to a hlt-based idle loop.
>
> The only reason I can think of to add functionality for enabling/disabling
> this hack would be to protect against a malicious guest which would use
> mwait *on purpose* to peg the host CPU. But a malicious guest could just
> run "for(;;);" in ring 0 and accomplish the same goal, so we wouldn't
> really gain anything in exchange for the added complexity...
>
> Thanks,
>   Gabriel
>
>  arch/x86/kvm/cpuid.c |  2 ++
>  arch/x86/kvm/svm.c   | 28 ++++++++++++++++++++--------
>  arch/x86/kvm/vmx.c   | 20 ++++++++++++++++----
>  3 files changed, 38 insertions(+), 12 deletions(-)
>
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index f47a104..d094fc6 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -283,6 +283,8 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
>                 0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW);
>         /* cpuid 1.ecx */
>         const u32 kvm_supported_word4_x86_features =
> +               /* NOTE: MONITOR (and MWAIT) are emulated as NOP,
> +                * but *not* advertised to guests via CPUID ! */
>                 F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
>                 0 /* DS-CPL, VMX, SMX, EST */ |
>                 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 7f4f9c2..0b7d58d 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -2770,12 +2770,6 @@ static int xsetbv_interception(struct vcpu_svm *svm)
>         return 1;
>  }
>
> -static int invalid_op_interception(struct vcpu_svm *svm)
> -{
> -       kvm_queue_exception(&svm->vcpu, UD_VECTOR);
> -       return 1;
> -}
> -
>  static int task_switch_interception(struct vcpu_svm *svm)
>  {
>         u16 tss_selector;
> @@ -3287,6 +3281,24 @@ static int pause_interception(struct vcpu_svm *svm)
>         return 1;
>  }
>
> +static int nop_interception(struct vcpu_svm *svm)
> +{
> +       skip_emulated_instruction(&(svm->vcpu));
> +       return 1;
> +}
> +
> +static int monitor_interception(struct vcpu_svm *svm)
> +{
> +       printk_once(KERN_WARNING "kvm: MONITOR instruction emulated as NOP!\n");
> +       return nop_interception(svm);
> +}
> +
> +static int mwait_interception(struct vcpu_svm *svm)
> +{
> +       printk_once(KERN_WARNING "kvm: MWAIT instruction emulated as NOP!\n");
> +       return nop_interception(svm);
> +}
> +
>  static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = {
>         [SVM_EXIT_READ_CR0]                     = cr_interception,
>         [SVM_EXIT_READ_CR3]                     = cr_interception,
> @@ -3344,8 +3356,8 @@ static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = {
>         [SVM_EXIT_CLGI]                         = clgi_interception,
>         [SVM_EXIT_SKINIT]                       = skinit_interception,
>         [SVM_EXIT_WBINVD]                       = emulate_on_interception,
> -       [SVM_EXIT_MONITOR]                      = invalid_op_interception,
> -       [SVM_EXIT_MWAIT]                        = invalid_op_interception,
> +       [SVM_EXIT_MONITOR]                      = monitor_interception,
> +       [SVM_EXIT_MWAIT]                        = mwait_interception,
>         [SVM_EXIT_XSETBV]                       = xsetbv_interception,
>         [SVM_EXIT_NPF]                          = pf_interception,
>  };
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 33e8c02..3ccbcb1 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -5669,12 +5669,24 @@ static int handle_pause(struct kvm_vcpu *vcpu)
>         return 1;
>  }
>
> -static int handle_invalid_op(struct kvm_vcpu *vcpu)
> +static int handle_nop(struct kvm_vcpu *vcpu)
>  {
> -       kvm_queue_exception(vcpu, UD_VECTOR);
> +       skip_emulated_instruction(vcpu);
>         return 1;
>  }
>
> +static int handle_mwait(struct kvm_vcpu *vcpu)
> +{
> +       printk_once(KERN_WARNING "kvm: MWAIT instruction emulated as NOP!\n");
> +       return handle_nop(vcpu);
> +}
> +
> +static int handle_monitor(struct kvm_vcpu *vcpu)
> +{
> +       printk_once(KERN_WARNING "kvm: MONITOR instruction emulated as NOP!\n");
> +       return handle_nop(vcpu);
> +}
> +
>  /*
>   * To run an L2 guest, we need a vmcs02 based on the L1-specified vmcs12.
>   * We could reuse a single VMCS for all the L2 guests, but we also want the
> @@ -6571,8 +6583,8 @@ static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
>         [EXIT_REASON_EPT_VIOLATION]           = handle_ept_violation,
>         [EXIT_REASON_EPT_MISCONFIG]           = handle_ept_misconfig,
>         [EXIT_REASON_PAUSE_INSTRUCTION]       = handle_pause,
> -       [EXIT_REASON_MWAIT_INSTRUCTION]       = handle_invalid_op,
> -       [EXIT_REASON_MONITOR_INSTRUCTION]     = handle_invalid_op,
> +       [EXIT_REASON_MWAIT_INSTRUCTION]       = handle_mwait,
> +       [EXIT_REASON_MONITOR_INSTRUCTION]     = handle_monitor,
>         [EXIT_REASON_INVEPT]                  = handle_invept,
>  };
>
> --
> 1.9.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gabriel L. Somlo June 5, 2014, 9:19 p.m. UTC | #30
On Thu, Jun 05, 2014 at 01:59:17PM -0700, Eric Northup wrote:
> On Wed, May 7, 2014 at 1:52 PM, Gabriel L. Somlo <gsomlo@gmail.com> wrote:
> > Treat monitor and mwait instructions as nop, which is architecturally
> > correct (but inefficient) behavior. We do this to prevent misbehaving
> > guests (e.g. OS X <= 10.7) from crashing after they fail to check for
> > monitor/mwait availability via cpuid.
> >
> > Since mwait-based idle loops relying on these nop-emulated instructions
> > would keep the host CPU pegged at 100%, do NOT advertise their presence
> > via cpuid, to prevent compliant guests from using them inadvertently.
> 
> If it's going to peg the host CPU at 100% anyway, why bother emulating
> it?  Just let the guest run the mwait instruction!  Have a condition
> that controls whether CPU_BASED_MWAIT_EXITING gets set in the VMCS
> processor execution controls.  Go ahead and put it in CPUID, since
> it's actually allowed.

If a well-behaved guest *does* check CPUID, we want it to get a
negative answer and fall back to a HLT-based idle loop (rather than
use mwait and peg the host cpu to 100%).

We only emulate it so that guests which use it blindly, regardless of
what CPUID says (e.g., OS X 10.6 .. 10.8) would still run instead of
crashing with an invalid opcode fault. That is, work long enough for
someone to manually force a hlt-idle fallback on the guest-side by
removing System/Library/Extensions/AppleIntelCPUPowerManagement.kext :)

Causing a VM exit (as opposed to letting the MWAIT just be a NOP in VM
guest mode) is nice, since it allows us to try and do something more
fancy in the future (i.e. yield the host cpu instead of re-entering
immediately, or attempting to otherwise emulate the instruction in a
way that's "better").

Anyhow, my $0.02 :)

Cheers,
--Gabriel
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin June 10, 2014, 10:16 a.m. UTC | #31
On Tue, Jun 03, 2014 at 10:21:58AM -0400, Gabriel L. Somlo wrote:
> On Tue, Jun 03, 2014 at 11:17:48AM +0200, Paolo Bonzini wrote:
> > 
> > I think it's fine as it is now. :)
> 
> On Mon, Jun 02, 2014 at 09:55:18PM -0400, Gabriel L. Somlo wrote:
> > 
> > W.r.t. monitor/mwait, a guest can do one of the following:
> > 
> > 1. Never check CPUID, and never use monitor/mwait
> > 	- This is great, we don't have to do anything about these
> > 
> > 2. Check CPUID for mwait, use it to idle in preference over hlt
> > 	- Linux, Windows, and Mavericks (10.9) do this
> > 	- we never want to have CPUID say "yes" to these, since
> > 	  monitor/mwait support will be clunky in the best case,
> > 	  and hlt is overwhelmingly preferable! [*]
> > 
> > 3. Never check CPUID, use monitor/mwait with abandon
> > 	- OS X 10.6 .. 10.8 does this
> > 	- emulating monitor/mwait here allows us to boot the guest
> > 	  and use it, and perform sysadmin surgery to force a hlt
> > 	  based idle
> > 
> > 4. Check CPUID, panic if unavailable
> > 	- OS X 10.5 did this, IIRC.
> > 	- whether I can do kext surgery and get it to stop checking
> > 	  CPUID *in addition to* falling back to hlt-based idle is
> > 	  TBD.
> > 	- emulating monitor/mwait allows us to boot this type of
> > 	  guest, BUT WE ALSO HAVE TO ADVERTISE IT VIA CPUID !!!
> 
> As it is right now, #4 is not being addressed (and we can't just
> advertise mwait via cpuid, or we'd be screwing up #2).

Yes, I didn't understand 10.5 did #4.

> I also feel a bit weird about the "undocumented feature" aspect
> of NOT generating an invalid opcode for something that *should*
> be an invalid opcode according to the feature set advertised via
> cpuid...
> 
> So if there's a way to make it so we can tell QEMU/KVM to
> "--enable-mwait" on a per-guest basis, I think that'd be better
> than an always-on "undocumented" behavior...
> 
> But then again, I'm most likely missing something about the big
> picture... :)
> 
> Thanks much,
> --Gabriel
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index f47a104..d094fc6 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -283,6 +283,8 @@  static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
 		0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW);
 	/* cpuid 1.ecx */
 	const u32 kvm_supported_word4_x86_features =
+		/* NOTE: MONITOR (and MWAIT) are emulated as NOP,
+		 * but *not* advertised to guests via CPUID ! */
 		F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
 		0 /* DS-CPL, VMX, SMX, EST */ |
 		0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 7f4f9c2..0b7d58d 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -2770,12 +2770,6 @@  static int xsetbv_interception(struct vcpu_svm *svm)
 	return 1;
 }
 
-static int invalid_op_interception(struct vcpu_svm *svm)
-{
-	kvm_queue_exception(&svm->vcpu, UD_VECTOR);
-	return 1;
-}
-
 static int task_switch_interception(struct vcpu_svm *svm)
 {
 	u16 tss_selector;
@@ -3287,6 +3281,24 @@  static int pause_interception(struct vcpu_svm *svm)
 	return 1;
 }
 
+static int nop_interception(struct vcpu_svm *svm)
+{
+	skip_emulated_instruction(&(svm->vcpu));
+	return 1;
+}
+
+static int monitor_interception(struct vcpu_svm *svm)
+{
+	printk_once(KERN_WARNING "kvm: MONITOR instruction emulated as NOP!\n");
+	return nop_interception(svm);
+}
+
+static int mwait_interception(struct vcpu_svm *svm)
+{
+	printk_once(KERN_WARNING "kvm: MWAIT instruction emulated as NOP!\n");
+	return nop_interception(svm);
+}
+
 static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = {
 	[SVM_EXIT_READ_CR0]			= cr_interception,
 	[SVM_EXIT_READ_CR3]			= cr_interception,
@@ -3344,8 +3356,8 @@  static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = {
 	[SVM_EXIT_CLGI]				= clgi_interception,
 	[SVM_EXIT_SKINIT]			= skinit_interception,
 	[SVM_EXIT_WBINVD]                       = emulate_on_interception,
-	[SVM_EXIT_MONITOR]			= invalid_op_interception,
-	[SVM_EXIT_MWAIT]			= invalid_op_interception,
+	[SVM_EXIT_MONITOR]			= monitor_interception,
+	[SVM_EXIT_MWAIT]			= mwait_interception,
 	[SVM_EXIT_XSETBV]			= xsetbv_interception,
 	[SVM_EXIT_NPF]				= pf_interception,
 };
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 33e8c02..3ccbcb1 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5669,12 +5669,24 @@  static int handle_pause(struct kvm_vcpu *vcpu)
 	return 1;
 }
 
-static int handle_invalid_op(struct kvm_vcpu *vcpu)
+static int handle_nop(struct kvm_vcpu *vcpu)
 {
-	kvm_queue_exception(vcpu, UD_VECTOR);
+	skip_emulated_instruction(vcpu);
 	return 1;
 }
 
+static int handle_mwait(struct kvm_vcpu *vcpu)
+{
+	printk_once(KERN_WARNING "kvm: MWAIT instruction emulated as NOP!\n");
+	return handle_nop(vcpu);
+}
+
+static int handle_monitor(struct kvm_vcpu *vcpu)
+{
+	printk_once(KERN_WARNING "kvm: MONITOR instruction emulated as NOP!\n");
+	return handle_nop(vcpu);
+}
+
 /*
  * To run an L2 guest, we need a vmcs02 based on the L1-specified vmcs12.
  * We could reuse a single VMCS for all the L2 guests, but we also want the
@@ -6571,8 +6583,8 @@  static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
 	[EXIT_REASON_EPT_VIOLATION]	      = handle_ept_violation,
 	[EXIT_REASON_EPT_MISCONFIG]           = handle_ept_misconfig,
 	[EXIT_REASON_PAUSE_INSTRUCTION]       = handle_pause,
-	[EXIT_REASON_MWAIT_INSTRUCTION]	      = handle_invalid_op,
-	[EXIT_REASON_MONITOR_INSTRUCTION]     = handle_invalid_op,
+	[EXIT_REASON_MWAIT_INSTRUCTION]	      = handle_mwait,
+	[EXIT_REASON_MONITOR_INSTRUCTION]     = handle_monitor,
 	[EXIT_REASON_INVEPT]                  = handle_invept,
 };