[v2] KVM: nVMX: Fix setting of CR0 and CR4 in guest mode

Message ID	5134F8AD.8030607@web.de (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@vger.kernel.org> Message-ID: <5134F8AD.8030607@web.de> Date: Mon, 04 Mar 2013 20:40:29 +0100 From: Jan Kiszka <jan.kiszka@web.de> User-Agent: Mozilla/5.0 (X11; U; Linux i686 (x86_64); de; rv:1.8.1.12) Gecko/20080226 SUSE/2.0.0.12-1.1 Thunderbird/2.0.0.12 Mnenhy/0.7.5.666 MIME-Version: 1.0 To: Gleb Natapov <gleb@redhat.com>, Marcelo Tosatti <mtosatti@redhat.com> CC: kvm <kvm@vger.kernel.org>, Nadav Har'El <nyh@math.technion.ac.il> Subject: [PATCH v2] KVM: nVMX: Fix setting of CR0 and CR4 in guest mode Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="----enig2MNWURWWLDDGJCXVQOREO" Sender: kvm-owner@vger.kernel.org Precedence: bulk

Jan Kiszka March 4, 2013, 7:40 p.m. UTC

The logic for calculating the value with which we call kvm_set_cr0/4 was
broken (will definitely be visible with nested unrestricted guest mode
support). Also, we performed the check regarding CR0_ALWAYSON too early
when in guest mode.

What really needs to be done on both CR0 and CR4 is to mask out L1-owned
bits and merge them in from GUEST_CR0/4. In contrast, arch.cr0/4 and
arch.cr0/4_guest_owned_bits contain the mangled L0+L1 state and, thus,
are not suited as input.

For both CRs, we can then apply the check against VMXON_CRx_ALWAYSON and
refuse the update if it fails. To be fully consistent, we implement this
check now also for CR4.

Finally, we have to set the shadow to the value L2 wanted to write
originally.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---

Changes in v2:
 - keep the non-misleading part of the comment in handle_set_cr0

 arch/x86/kvm/vmx.c |   46 +++++++++++++++++++++++++++++++---------------
 1 files changed, 31 insertions(+), 15 deletions(-)

Gleb Natapov March 5, 2013, 12:57 p.m. UTC | #1

On Mon, Mar 04, 2013 at 08:40:29PM +0100, Jan Kiszka wrote:
> The logic for calculating the value with which we call kvm_set_cr0/4 was
> broken (will definitely be visible with nested unrestricted guest mode
> support). Also, we performed the check regarding CR0_ALWAYSON too early
> when in guest mode.
> 
> What really needs to be done on both CR0 and CR4 is to mask out L1-owned
> bits and merge them in from GUEST_CR0/4. In contrast, arch.cr0/4 and
> arch.cr0/4_guest_owned_bits contain the mangled L0+L1 state and, thus,
> are not suited as input.
> 
> For both CRs, we can then apply the check against VMXON_CRx_ALWAYSON and
> refuse the update if it fails. To be fully consistent, we implement this
> check now also for CR4.
> 
> Finally, we have to set the shadow to the value L2 wanted to write
> originally.
> 
> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
> ---
> 
> Changes in v2:
>  - keep the non-misleading part of the comment in handle_set_cr0
> 
>  arch/x86/kvm/vmx.c |   46 +++++++++++++++++++++++++++++++---------------
>  1 files changed, 31 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 7cc566b..832b7b4 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -4605,37 +4605,53 @@ vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall)
>  /* called to set cr0 as appropriate for a mov-to-cr0 exit. */
>  static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val)
>  {
> -	if (to_vmx(vcpu)->nested.vmxon &&
> -	    ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON))
> -		return 1;
> -
>  	if (is_guest_mode(vcpu)) {
> +		struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> +		unsigned long orig_val = val;
> +
>  		/*
>  		 * We get here when L2 changed cr0 in a way that did not change
>  		 * any of L1's shadowed bits (see nested_vmx_exit_handled_cr),
> -		 * but did change L0 shadowed bits. This can currently happen
> -		 * with the TS bit: L0 may want to leave TS on (for lazy fpu
> -		 * loading) while pretending to allow the guest to change it.
> +		 * but did change L0 shadowed bits.
>  		 */
> -		if (kvm_set_cr0(vcpu, (val & vcpu->arch.cr0_guest_owned_bits) |
> -			 (vcpu->arch.cr0 & ~vcpu->arch.cr0_guest_owned_bits)))
> +		val = (val & ~vmcs12->cr0_guest_host_mask) |
> +			(vmcs_read64(GUEST_CR0) & vmcs12->cr0_guest_host_mask);
> +		if ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON)
>  			return 1;
The more I look at it the more it looks correct to me. I will continue
looking, but I think we can move VMXON_CR0_ALWAYSON check to
vmx_set_cr0(). Same for cr4 case.

> -		vmcs_writel(CR0_READ_SHADOW, val);
> +
> +		if (kvm_set_cr0(vcpu, val))
> +			return 1;
> +		vmcs_writel(CR0_READ_SHADOW, orig_val);
>  		return 0;
> -	} else
> +	} else {
> +		if (to_vmx(vcpu)->nested.vmxon &&
> +		    ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON))
> +			return 1;
>  		return kvm_set_cr0(vcpu, val);
> +	}
>  }
>  
>  static int handle_set_cr4(struct kvm_vcpu *vcpu, unsigned long val)
>  {
>  	if (is_guest_mode(vcpu)) {
> -		if (kvm_set_cr4(vcpu, (val & vcpu->arch.cr4_guest_owned_bits) |
> -			 (vcpu->arch.cr4 & ~vcpu->arch.cr4_guest_owned_bits)))
> +		struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> +		unsigned long orig_val = val;
> +
> +		val = (val & ~vmcs12->cr4_guest_host_mask) |
> +			(vmcs_readl(GUEST_CR4) & vmcs12->cr4_guest_host_mask);
> +		if ((val & VMXON_CR4_ALWAYSON) != VMXON_CR4_ALWAYSON)
> +			return 1;
> +
> +		if (kvm_set_cr4(vcpu, val))
>  			return 1;
> -		vmcs_writel(CR4_READ_SHADOW, val);
> +		vmcs_writel(CR4_READ_SHADOW, orig_val);
>  		return 0;
> -	} else
> +	} else {
> +		if (to_vmx(vcpu)->nested.vmxon &&
> +		    ((val & VMXON_CR4_ALWAYSON) != VMXON_CR4_ALWAYSON))
> +			return 1;
>  		return kvm_set_cr4(vcpu, val);
> +	}
>  }
>  
>  /* called to set cr0 as approriate for clts instruction exit. */
> -- 
> 1.7.3.4
> 



--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gleb Natapov March 5, 2013, 1:02 p.m. UTC | #2

On Tue, Mar 05, 2013 at 02:57:03PM +0200, Gleb Natapov wrote:
> On Mon, Mar 04, 2013 at 08:40:29PM +0100, Jan Kiszka wrote:
> > The logic for calculating the value with which we call kvm_set_cr0/4 was
> > broken (will definitely be visible with nested unrestricted guest mode
> > support). Also, we performed the check regarding CR0_ALWAYSON too early
> > when in guest mode.
> > 
> > What really needs to be done on both CR0 and CR4 is to mask out L1-owned
> > bits and merge them in from GUEST_CR0/4. In contrast, arch.cr0/4 and
> > arch.cr0/4_guest_owned_bits contain the mangled L0+L1 state and, thus,
> > are not suited as input.
> > 
> > For both CRs, we can then apply the check against VMXON_CRx_ALWAYSON and
> > refuse the update if it fails. To be fully consistent, we implement this
> > check now also for CR4.
> > 
> > Finally, we have to set the shadow to the value L2 wanted to write
> > originally.
> > 
> > Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
> > ---
> > 
> > Changes in v2:
> >  - keep the non-misleading part of the comment in handle_set_cr0
> > 
> >  arch/x86/kvm/vmx.c |   46 +++++++++++++++++++++++++++++++---------------
> >  1 files changed, 31 insertions(+), 15 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index 7cc566b..832b7b4 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -4605,37 +4605,53 @@ vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall)
> >  /* called to set cr0 as appropriate for a mov-to-cr0 exit. */
> >  static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val)
> >  {
> > -	if (to_vmx(vcpu)->nested.vmxon &&
> > -	    ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON))
> > -		return 1;
> > -
> >  	if (is_guest_mode(vcpu)) {
> > +		struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> > +		unsigned long orig_val = val;
> > +
> >  		/*
> >  		 * We get here when L2 changed cr0 in a way that did not change
> >  		 * any of L1's shadowed bits (see nested_vmx_exit_handled_cr),
> > -		 * but did change L0 shadowed bits. This can currently happen
> > -		 * with the TS bit: L0 may want to leave TS on (for lazy fpu
> > -		 * loading) while pretending to allow the guest to change it.
> > +		 * but did change L0 shadowed bits.
> >  		 */
> > -		if (kvm_set_cr0(vcpu, (val & vcpu->arch.cr0_guest_owned_bits) |
> > -			 (vcpu->arch.cr0 & ~vcpu->arch.cr0_guest_owned_bits)))
> > +		val = (val & ~vmcs12->cr0_guest_host_mask) |
> > +			(vmcs_read64(GUEST_CR0) & vmcs12->cr0_guest_host_mask);
> > +		if ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON)
> >  			return 1;
> The more I look at it the more it looks correct to me. I will continue
> looking, but I think we can move VMXON_CR0_ALWAYSON check to
> vmx_set_cr0(). Same for cr4 case.
> 
BTW would be nice to have unit test for nested :)

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jan Kiszka March 5, 2013, 1:18 p.m. UTC | #3

On 2013-03-05 14:02, Gleb Natapov wrote:
> On Tue, Mar 05, 2013 at 02:57:03PM +0200, Gleb Natapov wrote:
>> On Mon, Mar 04, 2013 at 08:40:29PM +0100, Jan Kiszka wrote:
>>> The logic for calculating the value with which we call kvm_set_cr0/4 was
>>> broken (will definitely be visible with nested unrestricted guest mode
>>> support). Also, we performed the check regarding CR0_ALWAYSON too early
>>> when in guest mode.
>>>
>>> What really needs to be done on both CR0 and CR4 is to mask out L1-owned
>>> bits and merge them in from GUEST_CR0/4. In contrast, arch.cr0/4 and
>>> arch.cr0/4_guest_owned_bits contain the mangled L0+L1 state and, thus,
>>> are not suited as input.
>>>
>>> For both CRs, we can then apply the check against VMXON_CRx_ALWAYSON and
>>> refuse the update if it fails. To be fully consistent, we implement this
>>> check now also for CR4.
>>>
>>> Finally, we have to set the shadow to the value L2 wanted to write
>>> originally.
>>>
>>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
>>> ---
>>>
>>> Changes in v2:
>>>  - keep the non-misleading part of the comment in handle_set_cr0
>>>
>>>  arch/x86/kvm/vmx.c |   46 +++++++++++++++++++++++++++++++---------------
>>>  1 files changed, 31 insertions(+), 15 deletions(-)
>>>
>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>> index 7cc566b..832b7b4 100644
>>> --- a/arch/x86/kvm/vmx.c
>>> +++ b/arch/x86/kvm/vmx.c
>>> @@ -4605,37 +4605,53 @@ vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall)
>>>  /* called to set cr0 as appropriate for a mov-to-cr0 exit. */
>>>  static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val)
>>>  {
>>> -	if (to_vmx(vcpu)->nested.vmxon &&
>>> -	    ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON))
>>> -		return 1;
>>> -
>>>  	if (is_guest_mode(vcpu)) {
>>> +		struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
>>> +		unsigned long orig_val = val;
>>> +
>>>  		/*
>>>  		 * We get here when L2 changed cr0 in a way that did not change
>>>  		 * any of L1's shadowed bits (see nested_vmx_exit_handled_cr),
>>> -		 * but did change L0 shadowed bits. This can currently happen
>>> -		 * with the TS bit: L0 may want to leave TS on (for lazy fpu
>>> -		 * loading) while pretending to allow the guest to change it.
>>> +		 * but did change L0 shadowed bits.
>>>  		 */
>>> -		if (kvm_set_cr0(vcpu, (val & vcpu->arch.cr0_guest_owned_bits) |
>>> -			 (vcpu->arch.cr0 & ~vcpu->arch.cr0_guest_owned_bits)))
>>> +		val = (val & ~vmcs12->cr0_guest_host_mask) |
>>> +			(vmcs_read64(GUEST_CR0) & vmcs12->cr0_guest_host_mask);
>>> +		if ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON)
>>>  			return 1;
>> The more I look at it the more it looks correct to me. I will continue
>> looking, but I think we can move VMXON_CR0_ALWAYSON check to
>> vmx_set_cr0(). Same for cr4 case.
>>
> BTW would be nice to have unit test for nested :)

Yeah, and I want a pony. :)

Possible, but "a bit" more complex than what it takes to write native tests.

Jan

Marcelo Tosatti March 7, 2013, 12:33 a.m. UTC | #4

On Mon, Mar 04, 2013 at 08:40:29PM +0100, Jan Kiszka wrote:
> The logic for calculating the value with which we call kvm_set_cr0/4 was
> broken (will definitely be visible with nested unrestricted guest mode
> support). Also, we performed the check regarding CR0_ALWAYSON too early
> when in guest mode.
> 
> What really needs to be done on both CR0 and CR4 is to mask out L1-owned
> bits and merge them in from GUEST_CR0/4. In contrast, arch.cr0/4 and
> arch.cr0/4_guest_owned_bits contain the mangled L0+L1 state and, thus,
> are not suited as input.
> 
> For both CRs, we can then apply the check against VMXON_CRx_ALWAYSON and
> refuse the update if it fails. To be fully consistent, we implement this
> check now also for CR4.
> 
> Finally, we have to set the shadow to the value L2 wanted to write
> originally.
> 
> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>

Applied, thanks.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gleb Natapov March 7, 2013, 7:51 a.m. UTC | #5

On Mon, Mar 04, 2013 at 08:40:29PM +0100, Jan Kiszka wrote:
> The logic for calculating the value with which we call kvm_set_cr0/4 was
> broken (will definitely be visible with nested unrestricted guest mode
> support). Also, we performed the check regarding CR0_ALWAYSON too early
> when in guest mode.
> 
> What really needs to be done on both CR0 and CR4 is to mask out L1-owned
> bits and merge them in from GUEST_CR0/4. In contrast, arch.cr0/4 and
> arch.cr0/4_guest_owned_bits contain the mangled L0+L1 state and, thus,
> are not suited as input.
> 
> For both CRs, we can then apply the check against VMXON_CRx_ALWAYSON and
> refuse the update if it fails. To be fully consistent, we implement this
> check now also for CR4.
> 
> Finally, we have to set the shadow to the value L2 wanted to write
> originally.
> 
> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
> ---
> 
> Changes in v2:
>  - keep the non-misleading part of the comment in handle_set_cr0
> 
>  arch/x86/kvm/vmx.c |   46 +++++++++++++++++++++++++++++++---------------
>  1 files changed, 31 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 7cc566b..832b7b4 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -4605,37 +4605,53 @@ vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall)
>  /* called to set cr0 as appropriate for a mov-to-cr0 exit. */
>  static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val)
>  {
> -	if (to_vmx(vcpu)->nested.vmxon &&
> -	    ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON))
> -		return 1;
> -
>  	if (is_guest_mode(vcpu)) {
> +		struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> +		unsigned long orig_val = val;
> +
>  		/*
>  		 * We get here when L2 changed cr0 in a way that did not change
>  		 * any of L1's shadowed bits (see nested_vmx_exit_handled_cr),
> -		 * but did change L0 shadowed bits. This can currently happen
> -		 * with the TS bit: L0 may want to leave TS on (for lazy fpu
> -		 * loading) while pretending to allow the guest to change it.
> +		 * but did change L0 shadowed bits.
>  		 */
> -		if (kvm_set_cr0(vcpu, (val & vcpu->arch.cr0_guest_owned_bits) |
> -			 (vcpu->arch.cr0 & ~vcpu->arch.cr0_guest_owned_bits)))
> +		val = (val & ~vmcs12->cr0_guest_host_mask) |
> +			(vmcs_read64(GUEST_CR0) & vmcs12->cr0_guest_host_mask);
I think using GUEST_CR0 here is incorrect. It contains combination of bits
set by L2, L1 and L0 and here we need to get only L2/L1 mix which is in
vcpu->arch.cr0 (almost, but good enough for this case). Why vcpu->arch.cr0
contains right L2/L1 mix? Because it was set to vmcs12->guest_cr0 during
L2 #vmentry. While L2 is running three things may happen to CR0:

 1. L2 writes to a bit that is not shadowed neither by L1 nor by L0. It
    will go strait to GUEST_CR0.
 2. L2 writes to a bit shadowed by L1. L1 #vmexit will be emulated. On the
    next #vmetry vcpu->arch.cr0 will be set to whatever value L1 calculated.
 3. L2 writes to a bit shadowed by L0, but not L1. This is the case we
    are handling here. And if we will do it right vcpu->arch.cr0 will be
    up-to-date at the end.

The only case when, while this code running, vcpu->arch.cr0 has not
up-to-date value is if 1 happened, but since L2 guest overwriting cr0
here anyway it does not matter what it previously set in GUEST_CR0. The
correct bits are in a new cr0 value provided by val and accessible by
(val & ~vmcs12->cr0_guest_host_mask).



--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jan Kiszka March 7, 2013, 8:12 a.m. UTC | #6

On 2013-03-07 08:51, Gleb Natapov wrote:
> On Mon, Mar 04, 2013 at 08:40:29PM +0100, Jan Kiszka wrote:
>> The logic for calculating the value with which we call kvm_set_cr0/4 was
>> broken (will definitely be visible with nested unrestricted guest mode
>> support). Also, we performed the check regarding CR0_ALWAYSON too early
>> when in guest mode.
>>
>> What really needs to be done on both CR0 and CR4 is to mask out L1-owned
>> bits and merge them in from GUEST_CR0/4. In contrast, arch.cr0/4 and
>> arch.cr0/4_guest_owned_bits contain the mangled L0+L1 state and, thus,
>> are not suited as input.
>>
>> For both CRs, we can then apply the check against VMXON_CRx_ALWAYSON and
>> refuse the update if it fails. To be fully consistent, we implement this
>> check now also for CR4.
>>
>> Finally, we have to set the shadow to the value L2 wanted to write
>> originally.
>>
>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
>> ---
>>
>> Changes in v2:
>>  - keep the non-misleading part of the comment in handle_set_cr0
>>
>>  arch/x86/kvm/vmx.c |   46 +++++++++++++++++++++++++++++++---------------
>>  1 files changed, 31 insertions(+), 15 deletions(-)
>>
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index 7cc566b..832b7b4 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -4605,37 +4605,53 @@ vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall)
>>  /* called to set cr0 as appropriate for a mov-to-cr0 exit. */
>>  static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val)
>>  {
>> -	if (to_vmx(vcpu)->nested.vmxon &&
>> -	    ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON))
>> -		return 1;
>> -
>>  	if (is_guest_mode(vcpu)) {
>> +		struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
>> +		unsigned long orig_val = val;
>> +
>>  		/*
>>  		 * We get here when L2 changed cr0 in a way that did not change
>>  		 * any of L1's shadowed bits (see nested_vmx_exit_handled_cr),
>> -		 * but did change L0 shadowed bits. This can currently happen
>> -		 * with the TS bit: L0 may want to leave TS on (for lazy fpu
>> -		 * loading) while pretending to allow the guest to change it.
>> +		 * but did change L0 shadowed bits.
>>  		 */
>> -		if (kvm_set_cr0(vcpu, (val & vcpu->arch.cr0_guest_owned_bits) |
>> -			 (vcpu->arch.cr0 & ~vcpu->arch.cr0_guest_owned_bits)))
>> +		val = (val & ~vmcs12->cr0_guest_host_mask) |
>> +			(vmcs_read64(GUEST_CR0) & vmcs12->cr0_guest_host_mask);
> I think using GUEST_CR0 here is incorrect. It contains combination of bits
> set by L2, L1 and L0 and here we need to get only L2/L1 mix which is in
> vcpu->arch.cr0 (almost, but good enough for this case). Why vcpu->arch.cr0
> contains right L2/L1 mix?

L0/L1. E.g., kvm_set_cr0 unconditionally injects X86_CR0_ET and masks
out reserved bits. But you are right, GUEST_CR0 isn't much better. And
maybe that mangling in kvm_set_cr0 is a corner case we can ignore.

> Because it was set to vmcs12->guest_cr0 during
> L2 #vmentry. While L2 is running three things may happen to CR0:
> 
>  1. L2 writes to a bit that is not shadowed neither by L1 nor by L0. It
>     will go strait to GUEST_CR0.
>  2. L2 writes to a bit shadowed by L1. L1 #vmexit will be emulated. On the
>     next #vmetry vcpu->arch.cr0 will be set to whatever value L1 calculated.
>  3. L2 writes to a bit shadowed by L0, but not L1. This is the case we
>     are handling here. And if we will do it right vcpu->arch.cr0 will be
>     up-to-date at the end.
> 
> The only case when, while this code running, vcpu->arch.cr0 has not
> up-to-date value is if 1 happened, but since L2 guest overwriting cr0
> here anyway it does not matter what it previously set in GUEST_CR0. The
> correct bits are in a new cr0 value provided by val and accessible by
> (val & ~vmcs12->cr0_guest_host_mask).

I need to think about it again. Maybe vmcs12->guest_cr0 is best, but
that's a shot from the hips now.

Jan

Jan Kiszka March 7, 2013, 8:27 a.m. UTC | #7

On 2013-03-07 09:12, Jan Kiszka wrote:
> On 2013-03-07 08:51, Gleb Natapov wrote:
>> On Mon, Mar 04, 2013 at 08:40:29PM +0100, Jan Kiszka wrote:
>>> The logic for calculating the value with which we call kvm_set_cr0/4 was
>>> broken (will definitely be visible with nested unrestricted guest mode
>>> support). Also, we performed the check regarding CR0_ALWAYSON too early
>>> when in guest mode.
>>>
>>> What really needs to be done on both CR0 and CR4 is to mask out L1-owned
>>> bits and merge them in from GUEST_CR0/4. In contrast, arch.cr0/4 and
>>> arch.cr0/4_guest_owned_bits contain the mangled L0+L1 state and, thus,
>>> are not suited as input.
>>>
>>> For both CRs, we can then apply the check against VMXON_CRx_ALWAYSON and
>>> refuse the update if it fails. To be fully consistent, we implement this
>>> check now also for CR4.
>>>
>>> Finally, we have to set the shadow to the value L2 wanted to write
>>> originally.
>>>
>>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
>>> ---
>>>
>>> Changes in v2:
>>>  - keep the non-misleading part of the comment in handle_set_cr0
>>>
>>>  arch/x86/kvm/vmx.c |   46 +++++++++++++++++++++++++++++++---------------
>>>  1 files changed, 31 insertions(+), 15 deletions(-)
>>>
>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>> index 7cc566b..832b7b4 100644
>>> --- a/arch/x86/kvm/vmx.c
>>> +++ b/arch/x86/kvm/vmx.c
>>> @@ -4605,37 +4605,53 @@ vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall)
>>>  /* called to set cr0 as appropriate for a mov-to-cr0 exit. */
>>>  static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val)
>>>  {
>>> -	if (to_vmx(vcpu)->nested.vmxon &&
>>> -	    ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON))
>>> -		return 1;
>>> -
>>>  	if (is_guest_mode(vcpu)) {
>>> +		struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
>>> +		unsigned long orig_val = val;
>>> +
>>>  		/*
>>>  		 * We get here when L2 changed cr0 in a way that did not change
>>>  		 * any of L1's shadowed bits (see nested_vmx_exit_handled_cr),
>>> -		 * but did change L0 shadowed bits. This can currently happen
>>> -		 * with the TS bit: L0 may want to leave TS on (for lazy fpu
>>> -		 * loading) while pretending to allow the guest to change it.
>>> +		 * but did change L0 shadowed bits.
>>>  		 */
>>> -		if (kvm_set_cr0(vcpu, (val & vcpu->arch.cr0_guest_owned_bits) |
>>> -			 (vcpu->arch.cr0 & ~vcpu->arch.cr0_guest_owned_bits)))
>>> +		val = (val & ~vmcs12->cr0_guest_host_mask) |
>>> +			(vmcs_read64(GUEST_CR0) & vmcs12->cr0_guest_host_mask);
>> I think using GUEST_CR0 here is incorrect. It contains combination of bits
>> set by L2, L1 and L0 and here we need to get only L2/L1 mix which is in
>> vcpu->arch.cr0 (almost, but good enough for this case). Why vcpu->arch.cr0
>> contains right L2/L1 mix?
> 
> L0/L1. E.g., kvm_set_cr0 unconditionally injects X86_CR0_ET and masks
> out reserved bits. But you are right, GUEST_CR0 isn't much better. And
> maybe that mangling in kvm_set_cr0 is a corner case we can ignore.
> 
>> Because it was set to vmcs12->guest_cr0 during
>> L2 #vmentry. While L2 is running three things may happen to CR0:
>>
>>  1. L2 writes to a bit that is not shadowed neither by L1 nor by L0. It
>>     will go strait to GUEST_CR0.
>>  2. L2 writes to a bit shadowed by L1. L1 #vmexit will be emulated. On the
>>     next #vmetry vcpu->arch.cr0 will be set to whatever value L1 calculated.
>>  3. L2 writes to a bit shadowed by L0, but not L1. This is the case we
>>     are handling here. And if we will do it right vcpu->arch.cr0 will be
>>     up-to-date at the end.
>>
>> The only case when, while this code running, vcpu->arch.cr0 has not
>> up-to-date value is if 1 happened, but since L2 guest overwriting cr0
>> here anyway it does not matter what it previously set in GUEST_CR0. The
>> correct bits are in a new cr0 value provided by val and accessible by
>> (val & ~vmcs12->cr0_guest_host_mask).
> 
> I need to think about it again. Maybe vmcs12->guest_cr0 is best, but
> that's a shot from the hips now.

Yes, vmcs12->guest_cr0/4 is the correct input. We cannot tolerate to
pick up L0-mangled bits from arch.cr0/4 as we will perform checks on the
result and report errors back to the guest. Do you agree?

Marcelo, can you drop this version of the patch (I do not see it in
public git yet) and wait for some v3 I will send later? Otherwise I will
fix on top.

Thanks,
Jan

Gleb Natapov March 7, 2013, 8:43 a.m. UTC | #8

On Thu, Mar 07, 2013 at 09:12:19AM +0100, Jan Kiszka wrote:
> On 2013-03-07 08:51, Gleb Natapov wrote:
> > On Mon, Mar 04, 2013 at 08:40:29PM +0100, Jan Kiszka wrote:
> >> The logic for calculating the value with which we call kvm_set_cr0/4 was
> >> broken (will definitely be visible with nested unrestricted guest mode
> >> support). Also, we performed the check regarding CR0_ALWAYSON too early
> >> when in guest mode.
> >>
> >> What really needs to be done on both CR0 and CR4 is to mask out L1-owned
> >> bits and merge them in from GUEST_CR0/4. In contrast, arch.cr0/4 and
> >> arch.cr0/4_guest_owned_bits contain the mangled L0+L1 state and, thus,
> >> are not suited as input.
> >>
> >> For both CRs, we can then apply the check against VMXON_CRx_ALWAYSON and
> >> refuse the update if it fails. To be fully consistent, we implement this
> >> check now also for CR4.
> >>
> >> Finally, we have to set the shadow to the value L2 wanted to write
> >> originally.
> >>
> >> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
> >> ---
> >>
> >> Changes in v2:
> >>  - keep the non-misleading part of the comment in handle_set_cr0
> >>
> >>  arch/x86/kvm/vmx.c |   46 +++++++++++++++++++++++++++++++---------------
> >>  1 files changed, 31 insertions(+), 15 deletions(-)
> >>
> >> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >> index 7cc566b..832b7b4 100644
> >> --- a/arch/x86/kvm/vmx.c
> >> +++ b/arch/x86/kvm/vmx.c
> >> @@ -4605,37 +4605,53 @@ vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall)
> >>  /* called to set cr0 as appropriate for a mov-to-cr0 exit. */
> >>  static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val)
> >>  {
> >> -	if (to_vmx(vcpu)->nested.vmxon &&
> >> -	    ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON))
> >> -		return 1;
> >> -
> >>  	if (is_guest_mode(vcpu)) {
> >> +		struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> >> +		unsigned long orig_val = val;
> >> +
> >>  		/*
> >>  		 * We get here when L2 changed cr0 in a way that did not change
> >>  		 * any of L1's shadowed bits (see nested_vmx_exit_handled_cr),
> >> -		 * but did change L0 shadowed bits. This can currently happen
> >> -		 * with the TS bit: L0 may want to leave TS on (for lazy fpu
> >> -		 * loading) while pretending to allow the guest to change it.
> >> +		 * but did change L0 shadowed bits.
> >>  		 */
> >> -		if (kvm_set_cr0(vcpu, (val & vcpu->arch.cr0_guest_owned_bits) |
> >> -			 (vcpu->arch.cr0 & ~vcpu->arch.cr0_guest_owned_bits)))
> >> +		val = (val & ~vmcs12->cr0_guest_host_mask) |
> >> +			(vmcs_read64(GUEST_CR0) & vmcs12->cr0_guest_host_mask);
> > I think using GUEST_CR0 here is incorrect. It contains combination of bits
> > set by L2, L1 and L0 and here we need to get only L2/L1 mix which is in
> > vcpu->arch.cr0 (almost, but good enough for this case). Why vcpu->arch.cr0
> > contains right L2/L1 mix?
> 
> L0/L1. E.g., kvm_set_cr0 unconditionally injects X86_CR0_ET and masks
> out reserved bits. But you are right, GUEST_CR0 isn't much better. And
> maybe that mangling in kvm_set_cr0 is a corner case we can ignore.
> 
I think we can. ET is R/O and wired to 1, so it does not matter what
guest writes there it should be treated as 1. About reserved bits spec
says that software should write what it reads there and does not specify
what happens if software does not follow this.

> > Because it was set to vmcs12->guest_cr0 during
> > L2 #vmentry. While L2 is running three things may happen to CR0:
> > 
> >  1. L2 writes to a bit that is not shadowed neither by L1 nor by L0. It
> >     will go strait to GUEST_CR0.
> >  2. L2 writes to a bit shadowed by L1. L1 #vmexit will be emulated. On the
> >     next #vmetry vcpu->arch.cr0 will be set to whatever value L1 calculated.
> >  3. L2 writes to a bit shadowed by L0, but not L1. This is the case we
> >     are handling here. And if we will do it right vcpu->arch.cr0 will be
> >     up-to-date at the end.
> > 
> > The only case when, while this code running, vcpu->arch.cr0 has not
> > up-to-date value is if 1 happened, but since L2 guest overwriting cr0
> > here anyway it does not matter what it previously set in GUEST_CR0. The
> > correct bits are in a new cr0 value provided by val and accessible by
> > (val & ~vmcs12->cr0_guest_host_mask).
> 
> I need to think about it again. Maybe vmcs12->guest_cr0 is best, but
> that's a shot from the hips now.
> 
I do not think it is correct because case 3 does not update it. So if 3
happens twice without L1 #vmexit between then vmcs12->guest_cr0 will be
outdated.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gleb Natapov March 7, 2013, 8:48 a.m. UTC | #9

On Thu, Mar 07, 2013 at 09:27:10AM +0100, Jan Kiszka wrote:
> On 2013-03-07 09:12, Jan Kiszka wrote:
> > On 2013-03-07 08:51, Gleb Natapov wrote:
> >> On Mon, Mar 04, 2013 at 08:40:29PM +0100, Jan Kiszka wrote:
> >>> The logic for calculating the value with which we call kvm_set_cr0/4 was
> >>> broken (will definitely be visible with nested unrestricted guest mode
> >>> support). Also, we performed the check regarding CR0_ALWAYSON too early
> >>> when in guest mode.
> >>>
> >>> What really needs to be done on both CR0 and CR4 is to mask out L1-owned
> >>> bits and merge them in from GUEST_CR0/4. In contrast, arch.cr0/4 and
> >>> arch.cr0/4_guest_owned_bits contain the mangled L0+L1 state and, thus,
> >>> are not suited as input.
> >>>
> >>> For both CRs, we can then apply the check against VMXON_CRx_ALWAYSON and
> >>> refuse the update if it fails. To be fully consistent, we implement this
> >>> check now also for CR4.
> >>>
> >>> Finally, we have to set the shadow to the value L2 wanted to write
> >>> originally.
> >>>
> >>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
> >>> ---
> >>>
> >>> Changes in v2:
> >>>  - keep the non-misleading part of the comment in handle_set_cr0
> >>>
> >>>  arch/x86/kvm/vmx.c |   46 +++++++++++++++++++++++++++++++---------------
> >>>  1 files changed, 31 insertions(+), 15 deletions(-)
> >>>
> >>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >>> index 7cc566b..832b7b4 100644
> >>> --- a/arch/x86/kvm/vmx.c
> >>> +++ b/arch/x86/kvm/vmx.c
> >>> @@ -4605,37 +4605,53 @@ vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall)
> >>>  /* called to set cr0 as appropriate for a mov-to-cr0 exit. */
> >>>  static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val)
> >>>  {
> >>> -	if (to_vmx(vcpu)->nested.vmxon &&
> >>> -	    ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON))
> >>> -		return 1;
> >>> -
> >>>  	if (is_guest_mode(vcpu)) {
> >>> +		struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> >>> +		unsigned long orig_val = val;
> >>> +
> >>>  		/*
> >>>  		 * We get here when L2 changed cr0 in a way that did not change
> >>>  		 * any of L1's shadowed bits (see nested_vmx_exit_handled_cr),
> >>> -		 * but did change L0 shadowed bits. This can currently happen
> >>> -		 * with the TS bit: L0 may want to leave TS on (for lazy fpu
> >>> -		 * loading) while pretending to allow the guest to change it.
> >>> +		 * but did change L0 shadowed bits.
> >>>  		 */
> >>> -		if (kvm_set_cr0(vcpu, (val & vcpu->arch.cr0_guest_owned_bits) |
> >>> -			 (vcpu->arch.cr0 & ~vcpu->arch.cr0_guest_owned_bits)))
> >>> +		val = (val & ~vmcs12->cr0_guest_host_mask) |
> >>> +			(vmcs_read64(GUEST_CR0) & vmcs12->cr0_guest_host_mask);
> >> I think using GUEST_CR0 here is incorrect. It contains combination of bits
> >> set by L2, L1 and L0 and here we need to get only L2/L1 mix which is in
> >> vcpu->arch.cr0 (almost, but good enough for this case). Why vcpu->arch.cr0
> >> contains right L2/L1 mix?
> > 
> > L0/L1. E.g., kvm_set_cr0 unconditionally injects X86_CR0_ET and masks
> > out reserved bits. But you are right, GUEST_CR0 isn't much better. And
> > maybe that mangling in kvm_set_cr0 is a corner case we can ignore.
> > 
> >> Because it was set to vmcs12->guest_cr0 during
> >> L2 #vmentry. While L2 is running three things may happen to CR0:
> >>
> >>  1. L2 writes to a bit that is not shadowed neither by L1 nor by L0. It
> >>     will go strait to GUEST_CR0.
> >>  2. L2 writes to a bit shadowed by L1. L1 #vmexit will be emulated. On the
> >>     next #vmetry vcpu->arch.cr0 will be set to whatever value L1 calculated.
> >>  3. L2 writes to a bit shadowed by L0, but not L1. This is the case we
> >>     are handling here. And if we will do it right vcpu->arch.cr0 will be
> >>     up-to-date at the end.
> >>
> >> The only case when, while this code running, vcpu->arch.cr0 has not
> >> up-to-date value is if 1 happened, but since L2 guest overwriting cr0
> >> here anyway it does not matter what it previously set in GUEST_CR0. The
> >> correct bits are in a new cr0 value provided by val and accessible by
> >> (val & ~vmcs12->cr0_guest_host_mask).
> > 
> > I need to think about it again. Maybe vmcs12->guest_cr0 is best, but
> > that's a shot from the hips now.
> 
> Yes, vmcs12->guest_cr0/4 is the correct input. We cannot tolerate to
> pick up L0-mangled bits from arch.cr0/4 as we will perform checks on the
> result and report errors back to the guest. Do you agree?
> 
See my answer to your previous email about when vmcs12->guest_cr0/4 may
contain incorrect value. About mangling in arch.cr0/4 I think it is
benign, but we can do L1 #vmexit if reserved bit is not zero or ET bit
is not set in L2 provided cr0 and let L1 handle it.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jan Kiszka March 7, 2013, 8:53 a.m. UTC | #10

On 2013-03-07 09:43, Gleb Natapov wrote:
> On Thu, Mar 07, 2013 at 09:12:19AM +0100, Jan Kiszka wrote:
>> On 2013-03-07 08:51, Gleb Natapov wrote:
>>> On Mon, Mar 04, 2013 at 08:40:29PM +0100, Jan Kiszka wrote:
>>>> The logic for calculating the value with which we call kvm_set_cr0/4 was
>>>> broken (will definitely be visible with nested unrestricted guest mode
>>>> support). Also, we performed the check regarding CR0_ALWAYSON too early
>>>> when in guest mode.
>>>>
>>>> What really needs to be done on both CR0 and CR4 is to mask out L1-owned
>>>> bits and merge them in from GUEST_CR0/4. In contrast, arch.cr0/4 and
>>>> arch.cr0/4_guest_owned_bits contain the mangled L0+L1 state and, thus,
>>>> are not suited as input.
>>>>
>>>> For both CRs, we can then apply the check against VMXON_CRx_ALWAYSON and
>>>> refuse the update if it fails. To be fully consistent, we implement this
>>>> check now also for CR4.
>>>>
>>>> Finally, we have to set the shadow to the value L2 wanted to write
>>>> originally.
>>>>
>>>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
>>>> ---
>>>>
>>>> Changes in v2:
>>>>  - keep the non-misleading part of the comment in handle_set_cr0
>>>>
>>>>  arch/x86/kvm/vmx.c |   46 +++++++++++++++++++++++++++++++---------------
>>>>  1 files changed, 31 insertions(+), 15 deletions(-)
>>>>
>>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>>> index 7cc566b..832b7b4 100644
>>>> --- a/arch/x86/kvm/vmx.c
>>>> +++ b/arch/x86/kvm/vmx.c
>>>> @@ -4605,37 +4605,53 @@ vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall)
>>>>  /* called to set cr0 as appropriate for a mov-to-cr0 exit. */
>>>>  static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val)
>>>>  {
>>>> -	if (to_vmx(vcpu)->nested.vmxon &&
>>>> -	    ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON))
>>>> -		return 1;
>>>> -
>>>>  	if (is_guest_mode(vcpu)) {
>>>> +		struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
>>>> +		unsigned long orig_val = val;
>>>> +
>>>>  		/*
>>>>  		 * We get here when L2 changed cr0 in a way that did not change
>>>>  		 * any of L1's shadowed bits (see nested_vmx_exit_handled_cr),
>>>> -		 * but did change L0 shadowed bits. This can currently happen
>>>> -		 * with the TS bit: L0 may want to leave TS on (for lazy fpu
>>>> -		 * loading) while pretending to allow the guest to change it.
>>>> +		 * but did change L0 shadowed bits.
>>>>  		 */
>>>> -		if (kvm_set_cr0(vcpu, (val & vcpu->arch.cr0_guest_owned_bits) |
>>>> -			 (vcpu->arch.cr0 & ~vcpu->arch.cr0_guest_owned_bits)))
>>>> +		val = (val & ~vmcs12->cr0_guest_host_mask) |
>>>> +			(vmcs_read64(GUEST_CR0) & vmcs12->cr0_guest_host_mask);
>>> I think using GUEST_CR0 here is incorrect. It contains combination of bits
>>> set by L2, L1 and L0 and here we need to get only L2/L1 mix which is in
>>> vcpu->arch.cr0 (almost, but good enough for this case). Why vcpu->arch.cr0
>>> contains right L2/L1 mix?
>>
>> L0/L1. E.g., kvm_set_cr0 unconditionally injects X86_CR0_ET and masks
>> out reserved bits. But you are right, GUEST_CR0 isn't much better. And
>> maybe that mangling in kvm_set_cr0 is a corner case we can ignore.
>>
> I think we can. ET is R/O and wired to 1, so it does not matter what
> guest writes there it should be treated as 1. About reserved bits spec
> says that software should write what it reads there and does not specify
> what happens if software does not follow this.
> 
>>> Because it was set to vmcs12->guest_cr0 during
>>> L2 #vmentry. While L2 is running three things may happen to CR0:
>>>
>>>  1. L2 writes to a bit that is not shadowed neither by L1 nor by L0. It
>>>     will go strait to GUEST_CR0.
>>>  2. L2 writes to a bit shadowed by L1. L1 #vmexit will be emulated. On the
>>>     next #vmetry vcpu->arch.cr0 will be set to whatever value L1 calculated.
>>>  3. L2 writes to a bit shadowed by L0, but not L1. This is the case we
>>>     are handling here. And if we will do it right vcpu->arch.cr0 will be
>>>     up-to-date at the end.
>>>
>>> The only case when, while this code running, vcpu->arch.cr0 has not
>>> up-to-date value is if 1 happened, but since L2 guest overwriting cr0
>>> here anyway it does not matter what it previously set in GUEST_CR0. The
>>> correct bits are in a new cr0 value provided by val and accessible by
>>> (val & ~vmcs12->cr0_guest_host_mask).
>>
>> I need to think about it again. Maybe vmcs12->guest_cr0 is best, but
>> that's a shot from the hips now.
>>
> I do not think it is correct because case 3 does not update it. So if 3
> happens twice without L1 #vmexit between then vmcs12->guest_cr0 will be
> outdated.

Again, the only thing that matters here is L1's, not L0's view on the
"real" CR0 value. So guest_cr0 is never outdated (/wrt
cr0_guest_host_mask) as it will be updated by L1 in step 2. Even if
arch.cr0 vs. guest_cr0 makes no difference in practice, the latter is
more consistent, so I will go for it unless you can convince me it is wrong.

Jan

Gleb Natapov March 7, 2013, 8:57 a.m. UTC | #11

On Thu, Mar 07, 2013 at 09:53:49AM +0100, Jan Kiszka wrote:
> On 2013-03-07 09:43, Gleb Natapov wrote:
> > On Thu, Mar 07, 2013 at 09:12:19AM +0100, Jan Kiszka wrote:
> >> On 2013-03-07 08:51, Gleb Natapov wrote:
> >>> On Mon, Mar 04, 2013 at 08:40:29PM +0100, Jan Kiszka wrote:
> >>>> The logic for calculating the value with which we call kvm_set_cr0/4 was
> >>>> broken (will definitely be visible with nested unrestricted guest mode
> >>>> support). Also, we performed the check regarding CR0_ALWAYSON too early
> >>>> when in guest mode.
> >>>>
> >>>> What really needs to be done on both CR0 and CR4 is to mask out L1-owned
> >>>> bits and merge them in from GUEST_CR0/4. In contrast, arch.cr0/4 and
> >>>> arch.cr0/4_guest_owned_bits contain the mangled L0+L1 state and, thus,
> >>>> are not suited as input.
> >>>>
> >>>> For both CRs, we can then apply the check against VMXON_CRx_ALWAYSON and
> >>>> refuse the update if it fails. To be fully consistent, we implement this
> >>>> check now also for CR4.
> >>>>
> >>>> Finally, we have to set the shadow to the value L2 wanted to write
> >>>> originally.
> >>>>
> >>>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
> >>>> ---
> >>>>
> >>>> Changes in v2:
> >>>>  - keep the non-misleading part of the comment in handle_set_cr0
> >>>>
> >>>>  arch/x86/kvm/vmx.c |   46 +++++++++++++++++++++++++++++++---------------
> >>>>  1 files changed, 31 insertions(+), 15 deletions(-)
> >>>>
> >>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >>>> index 7cc566b..832b7b4 100644
> >>>> --- a/arch/x86/kvm/vmx.c
> >>>> +++ b/arch/x86/kvm/vmx.c
> >>>> @@ -4605,37 +4605,53 @@ vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall)
> >>>>  /* called to set cr0 as appropriate for a mov-to-cr0 exit. */
> >>>>  static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val)
> >>>>  {
> >>>> -	if (to_vmx(vcpu)->nested.vmxon &&
> >>>> -	    ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON))
> >>>> -		return 1;
> >>>> -
> >>>>  	if (is_guest_mode(vcpu)) {
> >>>> +		struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> >>>> +		unsigned long orig_val = val;
> >>>> +
> >>>>  		/*
> >>>>  		 * We get here when L2 changed cr0 in a way that did not change
> >>>>  		 * any of L1's shadowed bits (see nested_vmx_exit_handled_cr),
> >>>> -		 * but did change L0 shadowed bits. This can currently happen
> >>>> -		 * with the TS bit: L0 may want to leave TS on (for lazy fpu
> >>>> -		 * loading) while pretending to allow the guest to change it.
> >>>> +		 * but did change L0 shadowed bits.
> >>>>  		 */
> >>>> -		if (kvm_set_cr0(vcpu, (val & vcpu->arch.cr0_guest_owned_bits) |
> >>>> -			 (vcpu->arch.cr0 & ~vcpu->arch.cr0_guest_owned_bits)))
> >>>> +		val = (val & ~vmcs12->cr0_guest_host_mask) |
> >>>> +			(vmcs_read64(GUEST_CR0) & vmcs12->cr0_guest_host_mask);
> >>> I think using GUEST_CR0 here is incorrect. It contains combination of bits
> >>> set by L2, L1 and L0 and here we need to get only L2/L1 mix which is in
> >>> vcpu->arch.cr0 (almost, but good enough for this case). Why vcpu->arch.cr0
> >>> contains right L2/L1 mix?
> >>
> >> L0/L1. E.g., kvm_set_cr0 unconditionally injects X86_CR0_ET and masks
> >> out reserved bits. But you are right, GUEST_CR0 isn't much better. And
> >> maybe that mangling in kvm_set_cr0 is a corner case we can ignore.
> >>
> > I think we can. ET is R/O and wired to 1, so it does not matter what
> > guest writes there it should be treated as 1. About reserved bits spec
> > says that software should write what it reads there and does not specify
> > what happens if software does not follow this.
> > 
> >>> Because it was set to vmcs12->guest_cr0 during
> >>> L2 #vmentry. While L2 is running three things may happen to CR0:
> >>>
> >>>  1. L2 writes to a bit that is not shadowed neither by L1 nor by L0. It
> >>>     will go strait to GUEST_CR0.
> >>>  2. L2 writes to a bit shadowed by L1. L1 #vmexit will be emulated. On the
> >>>     next #vmetry vcpu->arch.cr0 will be set to whatever value L1 calculated.
> >>>  3. L2 writes to a bit shadowed by L0, but not L1. This is the case we
> >>>     are handling here. And if we will do it right vcpu->arch.cr0 will be
> >>>     up-to-date at the end.
> >>>
> >>> The only case when, while this code running, vcpu->arch.cr0 has not
> >>> up-to-date value is if 1 happened, but since L2 guest overwriting cr0
> >>> here anyway it does not matter what it previously set in GUEST_CR0. The
> >>> correct bits are in a new cr0 value provided by val and accessible by
> >>> (val & ~vmcs12->cr0_guest_host_mask).
> >>
> >> I need to think about it again. Maybe vmcs12->guest_cr0 is best, but
> >> that's a shot from the hips now.
> >>
> > I do not think it is correct because case 3 does not update it. So if 3
> > happens twice without L1 #vmexit between then vmcs12->guest_cr0 will be
> > outdated.
> 
> Again, the only thing that matters here is L1's, not L0's view on the
> "real" CR0 value. So guest_cr0 is never outdated (/wrt
> cr0_guest_host_mask) as it will be updated by L1 in step 2. Even if
> arch.cr0 vs. guest_cr0 makes no difference in practice, the latter is
> more consistent, so I will go for it unless you can convince me it is wrong.
> 
Hmm, yes you are right that wrt cr0_guest_host_mask guest_cr0 should be
up-to-date. Please write a big comment about it. And what about moving
VMXON_CR0_ALWAYSON check into vmx_set_cr0()?

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jan Kiszka March 7, 2013, 10:37 a.m. UTC | #12

On 2013-03-07 09:57, Gleb Natapov wrote:
> On Thu, Mar 07, 2013 at 09:53:49AM +0100, Jan Kiszka wrote:
>> On 2013-03-07 09:43, Gleb Natapov wrote:
>>> On Thu, Mar 07, 2013 at 09:12:19AM +0100, Jan Kiszka wrote:
>>>> On 2013-03-07 08:51, Gleb Natapov wrote:
>>>>> On Mon, Mar 04, 2013 at 08:40:29PM +0100, Jan Kiszka wrote:
>>>>>> The logic for calculating the value with which we call kvm_set_cr0/4 was
>>>>>> broken (will definitely be visible with nested unrestricted guest mode
>>>>>> support). Also, we performed the check regarding CR0_ALWAYSON too early
>>>>>> when in guest mode.
>>>>>>
>>>>>> What really needs to be done on both CR0 and CR4 is to mask out L1-owned
>>>>>> bits and merge them in from GUEST_CR0/4. In contrast, arch.cr0/4 and
>>>>>> arch.cr0/4_guest_owned_bits contain the mangled L0+L1 state and, thus,
>>>>>> are not suited as input.
>>>>>>
>>>>>> For both CRs, we can then apply the check against VMXON_CRx_ALWAYSON and
>>>>>> refuse the update if it fails. To be fully consistent, we implement this
>>>>>> check now also for CR4.
>>>>>>
>>>>>> Finally, we have to set the shadow to the value L2 wanted to write
>>>>>> originally.
>>>>>>
>>>>>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
>>>>>> ---
>>>>>>
>>>>>> Changes in v2:
>>>>>>  - keep the non-misleading part of the comment in handle_set_cr0
>>>>>>
>>>>>>  arch/x86/kvm/vmx.c |   46 +++++++++++++++++++++++++++++++---------------
>>>>>>  1 files changed, 31 insertions(+), 15 deletions(-)
>>>>>>
>>>>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>>>>> index 7cc566b..832b7b4 100644
>>>>>> --- a/arch/x86/kvm/vmx.c
>>>>>> +++ b/arch/x86/kvm/vmx.c
>>>>>> @@ -4605,37 +4605,53 @@ vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall)
>>>>>>  /* called to set cr0 as appropriate for a mov-to-cr0 exit. */
>>>>>>  static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val)
>>>>>>  {
>>>>>> -	if (to_vmx(vcpu)->nested.vmxon &&
>>>>>> -	    ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON))
>>>>>> -		return 1;
>>>>>> -
>>>>>>  	if (is_guest_mode(vcpu)) {
>>>>>> +		struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
>>>>>> +		unsigned long orig_val = val;
>>>>>> +
>>>>>>  		/*
>>>>>>  		 * We get here when L2 changed cr0 in a way that did not change
>>>>>>  		 * any of L1's shadowed bits (see nested_vmx_exit_handled_cr),
>>>>>> -		 * but did change L0 shadowed bits. This can currently happen
>>>>>> -		 * with the TS bit: L0 may want to leave TS on (for lazy fpu
>>>>>> -		 * loading) while pretending to allow the guest to change it.
>>>>>> +		 * but did change L0 shadowed bits.
>>>>>>  		 */
>>>>>> -		if (kvm_set_cr0(vcpu, (val & vcpu->arch.cr0_guest_owned_bits) |
>>>>>> -			 (vcpu->arch.cr0 & ~vcpu->arch.cr0_guest_owned_bits)))
>>>>>> +		val = (val & ~vmcs12->cr0_guest_host_mask) |
>>>>>> +			(vmcs_read64(GUEST_CR0) & vmcs12->cr0_guest_host_mask);
>>>>> I think using GUEST_CR0 here is incorrect. It contains combination of bits
>>>>> set by L2, L1 and L0 and here we need to get only L2/L1 mix which is in
>>>>> vcpu->arch.cr0 (almost, but good enough for this case). Why vcpu->arch.cr0
>>>>> contains right L2/L1 mix?
>>>>
>>>> L0/L1. E.g., kvm_set_cr0 unconditionally injects X86_CR0_ET and masks
>>>> out reserved bits. But you are right, GUEST_CR0 isn't much better. And
>>>> maybe that mangling in kvm_set_cr0 is a corner case we can ignore.
>>>>
>>> I think we can. ET is R/O and wired to 1, so it does not matter what
>>> guest writes there it should be treated as 1. About reserved bits spec
>>> says that software should write what it reads there and does not specify
>>> what happens if software does not follow this.
>>>
>>>>> Because it was set to vmcs12->guest_cr0 during
>>>>> L2 #vmentry. While L2 is running three things may happen to CR0:
>>>>>
>>>>>  1. L2 writes to a bit that is not shadowed neither by L1 nor by L0. It
>>>>>     will go strait to GUEST_CR0.
>>>>>  2. L2 writes to a bit shadowed by L1. L1 #vmexit will be emulated. On the
>>>>>     next #vmetry vcpu->arch.cr0 will be set to whatever value L1 calculated.
>>>>>  3. L2 writes to a bit shadowed by L0, but not L1. This is the case we
>>>>>     are handling here. And if we will do it right vcpu->arch.cr0 will be
>>>>>     up-to-date at the end.
>>>>>
>>>>> The only case when, while this code running, vcpu->arch.cr0 has not
>>>>> up-to-date value is if 1 happened, but since L2 guest overwriting cr0
>>>>> here anyway it does not matter what it previously set in GUEST_CR0. The
>>>>> correct bits are in a new cr0 value provided by val and accessible by
>>>>> (val & ~vmcs12->cr0_guest_host_mask).
>>>>
>>>> I need to think about it again. Maybe vmcs12->guest_cr0 is best, but
>>>> that's a shot from the hips now.
>>>>
>>> I do not think it is correct because case 3 does not update it. So if 3
>>> happens twice without L1 #vmexit between then vmcs12->guest_cr0 will be
>>> outdated.
>>
>> Again, the only thing that matters here is L1's, not L0's view on the
>> "real" CR0 value. So guest_cr0 is never outdated (/wrt
>> cr0_guest_host_mask) as it will be updated by L1 in step 2. Even if
>> arch.cr0 vs. guest_cr0 makes no difference in practice, the latter is
>> more consistent, so I will go for it unless you can convince me it is wrong.
>>
> Hmm, yes you are right that wrt cr0_guest_host_mask guest_cr0 should be
> up-to-date. Please write a big comment about it.

Will do.

> And what about moving VMXON_CR0_ALWAYSON check into vmx_set_cr0()?

That doesn't make much sense for CR0 (due to the differences between
vmxon and guest mode - and lacking return code of set_cr4). But I can
consolidate the CR4 checks.

Jan

Gleb Natapov March 7, 2013, 11:06 a.m. UTC | #13

On Thu, Mar 07, 2013 at 11:37:43AM +0100, Jan Kiszka wrote:
> On 2013-03-07 09:57, Gleb Natapov wrote:
> > On Thu, Mar 07, 2013 at 09:53:49AM +0100, Jan Kiszka wrote:
> >> On 2013-03-07 09:43, Gleb Natapov wrote:
> >>> On Thu, Mar 07, 2013 at 09:12:19AM +0100, Jan Kiszka wrote:
> >>>> On 2013-03-07 08:51, Gleb Natapov wrote:
> >>>>> On Mon, Mar 04, 2013 at 08:40:29PM +0100, Jan Kiszka wrote:
> >>>>>> The logic for calculating the value with which we call kvm_set_cr0/4 was
> >>>>>> broken (will definitely be visible with nested unrestricted guest mode
> >>>>>> support). Also, we performed the check regarding CR0_ALWAYSON too early
> >>>>>> when in guest mode.
> >>>>>>
> >>>>>> What really needs to be done on both CR0 and CR4 is to mask out L1-owned
> >>>>>> bits and merge them in from GUEST_CR0/4. In contrast, arch.cr0/4 and
> >>>>>> arch.cr0/4_guest_owned_bits contain the mangled L0+L1 state and, thus,
> >>>>>> are not suited as input.
> >>>>>>
> >>>>>> For both CRs, we can then apply the check against VMXON_CRx_ALWAYSON and
> >>>>>> refuse the update if it fails. To be fully consistent, we implement this
> >>>>>> check now also for CR4.
> >>>>>>
> >>>>>> Finally, we have to set the shadow to the value L2 wanted to write
> >>>>>> originally.
> >>>>>>
> >>>>>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
> >>>>>> ---
> >>>>>>
> >>>>>> Changes in v2:
> >>>>>>  - keep the non-misleading part of the comment in handle_set_cr0
> >>>>>>
> >>>>>>  arch/x86/kvm/vmx.c |   46 +++++++++++++++++++++++++++++++---------------
> >>>>>>  1 files changed, 31 insertions(+), 15 deletions(-)
> >>>>>>
> >>>>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >>>>>> index 7cc566b..832b7b4 100644
> >>>>>> --- a/arch/x86/kvm/vmx.c
> >>>>>> +++ b/arch/x86/kvm/vmx.c
> >>>>>> @@ -4605,37 +4605,53 @@ vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall)
> >>>>>>  /* called to set cr0 as appropriate for a mov-to-cr0 exit. */
> >>>>>>  static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val)
> >>>>>>  {
> >>>>>> -	if (to_vmx(vcpu)->nested.vmxon &&
> >>>>>> -	    ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON))
> >>>>>> -		return 1;
> >>>>>> -
> >>>>>>  	if (is_guest_mode(vcpu)) {
> >>>>>> +		struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> >>>>>> +		unsigned long orig_val = val;
> >>>>>> +
> >>>>>>  		/*
> >>>>>>  		 * We get here when L2 changed cr0 in a way that did not change
> >>>>>>  		 * any of L1's shadowed bits (see nested_vmx_exit_handled_cr),
> >>>>>> -		 * but did change L0 shadowed bits. This can currently happen
> >>>>>> -		 * with the TS bit: L0 may want to leave TS on (for lazy fpu
> >>>>>> -		 * loading) while pretending to allow the guest to change it.
> >>>>>> +		 * but did change L0 shadowed bits.
> >>>>>>  		 */
> >>>>>> -		if (kvm_set_cr0(vcpu, (val & vcpu->arch.cr0_guest_owned_bits) |
> >>>>>> -			 (vcpu->arch.cr0 & ~vcpu->arch.cr0_guest_owned_bits)))
> >>>>>> +		val = (val & ~vmcs12->cr0_guest_host_mask) |
> >>>>>> +			(vmcs_read64(GUEST_CR0) & vmcs12->cr0_guest_host_mask);
> >>>>> I think using GUEST_CR0 here is incorrect. It contains combination of bits
> >>>>> set by L2, L1 and L0 and here we need to get only L2/L1 mix which is in
> >>>>> vcpu->arch.cr0 (almost, but good enough for this case). Why vcpu->arch.cr0
> >>>>> contains right L2/L1 mix?
> >>>>
> >>>> L0/L1. E.g., kvm_set_cr0 unconditionally injects X86_CR0_ET and masks
> >>>> out reserved bits. But you are right, GUEST_CR0 isn't much better. And
> >>>> maybe that mangling in kvm_set_cr0 is a corner case we can ignore.
> >>>>
> >>> I think we can. ET is R/O and wired to 1, so it does not matter what
> >>> guest writes there it should be treated as 1. About reserved bits spec
> >>> says that software should write what it reads there and does not specify
> >>> what happens if software does not follow this.
> >>>
> >>>>> Because it was set to vmcs12->guest_cr0 during
> >>>>> L2 #vmentry. While L2 is running three things may happen to CR0:
> >>>>>
> >>>>>  1. L2 writes to a bit that is not shadowed neither by L1 nor by L0. It
> >>>>>     will go strait to GUEST_CR0.
> >>>>>  2. L2 writes to a bit shadowed by L1. L1 #vmexit will be emulated. On the
> >>>>>     next #vmetry vcpu->arch.cr0 will be set to whatever value L1 calculated.
> >>>>>  3. L2 writes to a bit shadowed by L0, but not L1. This is the case we
> >>>>>     are handling here. And if we will do it right vcpu->arch.cr0 will be
> >>>>>     up-to-date at the end.
> >>>>>
> >>>>> The only case when, while this code running, vcpu->arch.cr0 has not
> >>>>> up-to-date value is if 1 happened, but since L2 guest overwriting cr0
> >>>>> here anyway it does not matter what it previously set in GUEST_CR0. The
> >>>>> correct bits are in a new cr0 value provided by val and accessible by
> >>>>> (val & ~vmcs12->cr0_guest_host_mask).
> >>>>
> >>>> I need to think about it again. Maybe vmcs12->guest_cr0 is best, but
> >>>> that's a shot from the hips now.
> >>>>
> >>> I do not think it is correct because case 3 does not update it. So if 3
> >>> happens twice without L1 #vmexit between then vmcs12->guest_cr0 will be
> >>> outdated.
> >>
> >> Again, the only thing that matters here is L1's, not L0's view on the
> >> "real" CR0 value. So guest_cr0 is never outdated (/wrt
> >> cr0_guest_host_mask) as it will be updated by L1 in step 2. Even if
> >> arch.cr0 vs. guest_cr0 makes no difference in practice, the latter is
> >> more consistent, so I will go for it unless you can convince me it is wrong.
> >>
> > Hmm, yes you are right that wrt cr0_guest_host_mask guest_cr0 should be
> > up-to-date. Please write a big comment about it.
> 
> Will do.
> 
> > And what about moving VMXON_CR0_ALWAYSON check into vmx_set_cr0()?
> 
> That doesn't make much sense for CR0 (due to the differences between
> vmxon and guest mode - and lacking return code of set_cr4). But I can
> consolidate the CR4 checks.
> 
Isn't vmxon check is implicit in a guest mode. i.e if is_guest_mode() is
trues then vmxon is on? Return code can be added.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jan Kiszka March 7, 2013, 11:25 a.m. UTC | #14

On 2013-03-07 12:06, Gleb Natapov wrote:
> On Thu, Mar 07, 2013 at 11:37:43AM +0100, Jan Kiszka wrote:
>> On 2013-03-07 09:57, Gleb Natapov wrote:
>>> On Thu, Mar 07, 2013 at 09:53:49AM +0100, Jan Kiszka wrote:
>>>> On 2013-03-07 09:43, Gleb Natapov wrote:
>>>>> On Thu, Mar 07, 2013 at 09:12:19AM +0100, Jan Kiszka wrote:
>>>>>> On 2013-03-07 08:51, Gleb Natapov wrote:
>>>>>>> On Mon, Mar 04, 2013 at 08:40:29PM +0100, Jan Kiszka wrote:
>>>>>>>> The logic for calculating the value with which we call kvm_set_cr0/4 was
>>>>>>>> broken (will definitely be visible with nested unrestricted guest mode
>>>>>>>> support). Also, we performed the check regarding CR0_ALWAYSON too early
>>>>>>>> when in guest mode.
>>>>>>>>
>>>>>>>> What really needs to be done on both CR0 and CR4 is to mask out L1-owned
>>>>>>>> bits and merge them in from GUEST_CR0/4. In contrast, arch.cr0/4 and
>>>>>>>> arch.cr0/4_guest_owned_bits contain the mangled L0+L1 state and, thus,
>>>>>>>> are not suited as input.
>>>>>>>>
>>>>>>>> For both CRs, we can then apply the check against VMXON_CRx_ALWAYSON and
>>>>>>>> refuse the update if it fails. To be fully consistent, we implement this
>>>>>>>> check now also for CR4.
>>>>>>>>
>>>>>>>> Finally, we have to set the shadow to the value L2 wanted to write
>>>>>>>> originally.
>>>>>>>>
>>>>>>>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
>>>>>>>> ---
>>>>>>>>
>>>>>>>> Changes in v2:
>>>>>>>>  - keep the non-misleading part of the comment in handle_set_cr0
>>>>>>>>
>>>>>>>>  arch/x86/kvm/vmx.c |   46 +++++++++++++++++++++++++++++++---------------
>>>>>>>>  1 files changed, 31 insertions(+), 15 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>>>>>>> index 7cc566b..832b7b4 100644
>>>>>>>> --- a/arch/x86/kvm/vmx.c
>>>>>>>> +++ b/arch/x86/kvm/vmx.c
>>>>>>>> @@ -4605,37 +4605,53 @@ vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall)
>>>>>>>>  /* called to set cr0 as appropriate for a mov-to-cr0 exit. */
>>>>>>>>  static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val)
>>>>>>>>  {
>>>>>>>> -	if (to_vmx(vcpu)->nested.vmxon &&
>>>>>>>> -	    ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON))
>>>>>>>> -		return 1;
>>>>>>>> -
>>>>>>>>  	if (is_guest_mode(vcpu)) {
>>>>>>>> +		struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
>>>>>>>> +		unsigned long orig_val = val;
>>>>>>>> +
>>>>>>>>  		/*
>>>>>>>>  		 * We get here when L2 changed cr0 in a way that did not change
>>>>>>>>  		 * any of L1's shadowed bits (see nested_vmx_exit_handled_cr),
>>>>>>>> -		 * but did change L0 shadowed bits. This can currently happen
>>>>>>>> -		 * with the TS bit: L0 may want to leave TS on (for lazy fpu
>>>>>>>> -		 * loading) while pretending to allow the guest to change it.
>>>>>>>> +		 * but did change L0 shadowed bits.
>>>>>>>>  		 */
>>>>>>>> -		if (kvm_set_cr0(vcpu, (val & vcpu->arch.cr0_guest_owned_bits) |
>>>>>>>> -			 (vcpu->arch.cr0 & ~vcpu->arch.cr0_guest_owned_bits)))
>>>>>>>> +		val = (val & ~vmcs12->cr0_guest_host_mask) |
>>>>>>>> +			(vmcs_read64(GUEST_CR0) & vmcs12->cr0_guest_host_mask);
>>>>>>> I think using GUEST_CR0 here is incorrect. It contains combination of bits
>>>>>>> set by L2, L1 and L0 and here we need to get only L2/L1 mix which is in
>>>>>>> vcpu->arch.cr0 (almost, but good enough for this case). Why vcpu->arch.cr0
>>>>>>> contains right L2/L1 mix?
>>>>>>
>>>>>> L0/L1. E.g., kvm_set_cr0 unconditionally injects X86_CR0_ET and masks
>>>>>> out reserved bits. But you are right, GUEST_CR0 isn't much better. And
>>>>>> maybe that mangling in kvm_set_cr0 is a corner case we can ignore.
>>>>>>
>>>>> I think we can. ET is R/O and wired to 1, so it does not matter what
>>>>> guest writes there it should be treated as 1. About reserved bits spec
>>>>> says that software should write what it reads there and does not specify
>>>>> what happens if software does not follow this.
>>>>>
>>>>>>> Because it was set to vmcs12->guest_cr0 during
>>>>>>> L2 #vmentry. While L2 is running three things may happen to CR0:
>>>>>>>
>>>>>>>  1. L2 writes to a bit that is not shadowed neither by L1 nor by L0. It
>>>>>>>     will go strait to GUEST_CR0.
>>>>>>>  2. L2 writes to a bit shadowed by L1. L1 #vmexit will be emulated. On the
>>>>>>>     next #vmetry vcpu->arch.cr0 will be set to whatever value L1 calculated.
>>>>>>>  3. L2 writes to a bit shadowed by L0, but not L1. This is the case we
>>>>>>>     are handling here. And if we will do it right vcpu->arch.cr0 will be
>>>>>>>     up-to-date at the end.
>>>>>>>
>>>>>>> The only case when, while this code running, vcpu->arch.cr0 has not
>>>>>>> up-to-date value is if 1 happened, but since L2 guest overwriting cr0
>>>>>>> here anyway it does not matter what it previously set in GUEST_CR0. The
>>>>>>> correct bits are in a new cr0 value provided by val and accessible by
>>>>>>> (val & ~vmcs12->cr0_guest_host_mask).
>>>>>>
>>>>>> I need to think about it again. Maybe vmcs12->guest_cr0 is best, but
>>>>>> that's a shot from the hips now.
>>>>>>
>>>>> I do not think it is correct because case 3 does not update it. So if 3
>>>>> happens twice without L1 #vmexit between then vmcs12->guest_cr0 will be
>>>>> outdated.
>>>>
>>>> Again, the only thing that matters here is L1's, not L0's view on the
>>>> "real" CR0 value. So guest_cr0 is never outdated (/wrt
>>>> cr0_guest_host_mask) as it will be updated by L1 in step 2. Even if
>>>> arch.cr0 vs. guest_cr0 makes no difference in practice, the latter is
>>>> more consistent, so I will go for it unless you can convince me it is wrong.
>>>>
>>> Hmm, yes you are right that wrt cr0_guest_host_mask guest_cr0 should be
>>> up-to-date. Please write a big comment about it.
>>
>> Will do.
>>
>>> And what about moving VMXON_CR0_ALWAYSON check into vmx_set_cr0()?
>>
>> That doesn't make much sense for CR0 (due to the differences between
>> vmxon and guest mode - and lacking return code of set_cr4). But I can
>> consolidate the CR4 checks.
>>
> Isn't vmxon check is implicit in a guest mode. i.e if is_guest_mode() is
> trues then vmxon is on? Return code can be added.

Ah, sorry, you are not seeing what I'm looking at: The test will change
for L2 context once unrestricted guest mode is added. At that point, it
makes more sense to split it into one version that checks against
VMXON_CR0_ALWAYSON while in vmxon, targeting L1, and another that does
more complex evaluation for L2, depending on nested_cpu_has2(vmcs12,
SECONDARY_EXEC_UNRESTRICTED_GUEST).

Jan

Gleb Natapov March 7, 2013, 11:50 a.m. UTC | #15

On Thu, Mar 07, 2013 at 12:25:26PM +0100, Jan Kiszka wrote:
> On 2013-03-07 12:06, Gleb Natapov wrote:
> > On Thu, Mar 07, 2013 at 11:37:43AM +0100, Jan Kiszka wrote:
> >> On 2013-03-07 09:57, Gleb Natapov wrote:
> >>> On Thu, Mar 07, 2013 at 09:53:49AM +0100, Jan Kiszka wrote:
> >>>> On 2013-03-07 09:43, Gleb Natapov wrote:
> >>>>> On Thu, Mar 07, 2013 at 09:12:19AM +0100, Jan Kiszka wrote:
> >>>>>> On 2013-03-07 08:51, Gleb Natapov wrote:
> >>>>>>> On Mon, Mar 04, 2013 at 08:40:29PM +0100, Jan Kiszka wrote:
> >>>>>>>> The logic for calculating the value with which we call kvm_set_cr0/4 was
> >>>>>>>> broken (will definitely be visible with nested unrestricted guest mode
> >>>>>>>> support). Also, we performed the check regarding CR0_ALWAYSON too early
> >>>>>>>> when in guest mode.
> >>>>>>>>
> >>>>>>>> What really needs to be done on both CR0 and CR4 is to mask out L1-owned
> >>>>>>>> bits and merge them in from GUEST_CR0/4. In contrast, arch.cr0/4 and
> >>>>>>>> arch.cr0/4_guest_owned_bits contain the mangled L0+L1 state and, thus,
> >>>>>>>> are not suited as input.
> >>>>>>>>
> >>>>>>>> For both CRs, we can then apply the check against VMXON_CRx_ALWAYSON and
> >>>>>>>> refuse the update if it fails. To be fully consistent, we implement this
> >>>>>>>> check now also for CR4.
> >>>>>>>>
> >>>>>>>> Finally, we have to set the shadow to the value L2 wanted to write
> >>>>>>>> originally.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
> >>>>>>>> ---
> >>>>>>>>
> >>>>>>>> Changes in v2:
> >>>>>>>>  - keep the non-misleading part of the comment in handle_set_cr0
> >>>>>>>>
> >>>>>>>>  arch/x86/kvm/vmx.c |   46 +++++++++++++++++++++++++++++++---------------
> >>>>>>>>  1 files changed, 31 insertions(+), 15 deletions(-)
> >>>>>>>>
> >>>>>>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >>>>>>>> index 7cc566b..832b7b4 100644
> >>>>>>>> --- a/arch/x86/kvm/vmx.c
> >>>>>>>> +++ b/arch/x86/kvm/vmx.c
> >>>>>>>> @@ -4605,37 +4605,53 @@ vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall)
> >>>>>>>>  /* called to set cr0 as appropriate for a mov-to-cr0 exit. */
> >>>>>>>>  static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val)
> >>>>>>>>  {
> >>>>>>>> -	if (to_vmx(vcpu)->nested.vmxon &&
> >>>>>>>> -	    ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON))
> >>>>>>>> -		return 1;
> >>>>>>>> -
> >>>>>>>>  	if (is_guest_mode(vcpu)) {
> >>>>>>>> +		struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> >>>>>>>> +		unsigned long orig_val = val;
> >>>>>>>> +
> >>>>>>>>  		/*
> >>>>>>>>  		 * We get here when L2 changed cr0 in a way that did not change
> >>>>>>>>  		 * any of L1's shadowed bits (see nested_vmx_exit_handled_cr),
> >>>>>>>> -		 * but did change L0 shadowed bits. This can currently happen
> >>>>>>>> -		 * with the TS bit: L0 may want to leave TS on (for lazy fpu
> >>>>>>>> -		 * loading) while pretending to allow the guest to change it.
> >>>>>>>> +		 * but did change L0 shadowed bits.
> >>>>>>>>  		 */
> >>>>>>>> -		if (kvm_set_cr0(vcpu, (val & vcpu->arch.cr0_guest_owned_bits) |
> >>>>>>>> -			 (vcpu->arch.cr0 & ~vcpu->arch.cr0_guest_owned_bits)))
> >>>>>>>> +		val = (val & ~vmcs12->cr0_guest_host_mask) |
> >>>>>>>> +			(vmcs_read64(GUEST_CR0) & vmcs12->cr0_guest_host_mask);
> >>>>>>> I think using GUEST_CR0 here is incorrect. It contains combination of bits
> >>>>>>> set by L2, L1 and L0 and here we need to get only L2/L1 mix which is in
> >>>>>>> vcpu->arch.cr0 (almost, but good enough for this case). Why vcpu->arch.cr0
> >>>>>>> contains right L2/L1 mix?
> >>>>>>
> >>>>>> L0/L1. E.g., kvm_set_cr0 unconditionally injects X86_CR0_ET and masks
> >>>>>> out reserved bits. But you are right, GUEST_CR0 isn't much better. And
> >>>>>> maybe that mangling in kvm_set_cr0 is a corner case we can ignore.
> >>>>>>
> >>>>> I think we can. ET is R/O and wired to 1, so it does not matter what
> >>>>> guest writes there it should be treated as 1. About reserved bits spec
> >>>>> says that software should write what it reads there and does not specify
> >>>>> what happens if software does not follow this.
> >>>>>
> >>>>>>> Because it was set to vmcs12->guest_cr0 during
> >>>>>>> L2 #vmentry. While L2 is running three things may happen to CR0:
> >>>>>>>
> >>>>>>>  1. L2 writes to a bit that is not shadowed neither by L1 nor by L0. It
> >>>>>>>     will go strait to GUEST_CR0.
> >>>>>>>  2. L2 writes to a bit shadowed by L1. L1 #vmexit will be emulated. On the
> >>>>>>>     next #vmetry vcpu->arch.cr0 will be set to whatever value L1 calculated.
> >>>>>>>  3. L2 writes to a bit shadowed by L0, but not L1. This is the case we
> >>>>>>>     are handling here. And if we will do it right vcpu->arch.cr0 will be
> >>>>>>>     up-to-date at the end.
> >>>>>>>
> >>>>>>> The only case when, while this code running, vcpu->arch.cr0 has not
> >>>>>>> up-to-date value is if 1 happened, but since L2 guest overwriting cr0
> >>>>>>> here anyway it does not matter what it previously set in GUEST_CR0. The
> >>>>>>> correct bits are in a new cr0 value provided by val and accessible by
> >>>>>>> (val & ~vmcs12->cr0_guest_host_mask).
> >>>>>>
> >>>>>> I need to think about it again. Maybe vmcs12->guest_cr0 is best, but
> >>>>>> that's a shot from the hips now.
> >>>>>>
> >>>>> I do not think it is correct because case 3 does not update it. So if 3
> >>>>> happens twice without L1 #vmexit between then vmcs12->guest_cr0 will be
> >>>>> outdated.
> >>>>
> >>>> Again, the only thing that matters here is L1's, not L0's view on the
> >>>> "real" CR0 value. So guest_cr0 is never outdated (/wrt
> >>>> cr0_guest_host_mask) as it will be updated by L1 in step 2. Even if
> >>>> arch.cr0 vs. guest_cr0 makes no difference in practice, the latter is
> >>>> more consistent, so I will go for it unless you can convince me it is wrong.
> >>>>
> >>> Hmm, yes you are right that wrt cr0_guest_host_mask guest_cr0 should be
> >>> up-to-date. Please write a big comment about it.
> >>
> >> Will do.
> >>
> >>> And what about moving VMXON_CR0_ALWAYSON check into vmx_set_cr0()?
> >>
> >> That doesn't make much sense for CR0 (due to the differences between
> >> vmxon and guest mode - and lacking return code of set_cr4). But I can
> >> consolidate the CR4 checks.
> >>
> > Isn't vmxon check is implicit in a guest mode. i.e if is_guest_mode() is
> > trues then vmxon is on? Return code can be added.
> 
> Ah, sorry, you are not seeing what I'm looking at: The test will change
> for L2 context once unrestricted guest mode is added. At that point, it
> makes more sense to split it into one version that checks against
> VMXON_CR0_ALWAYSON while in vmxon, targeting L1, and another that does
> more complex evaluation for L2, depending on nested_cpu_has2(vmcs12,
> SECONDARY_EXEC_UNRESTRICTED_GUEST).
> 
Ah, OK. Hard to argue that those checks can be consolidated without
seeing them :) So you want to implement unrestricted L1 on restricted L0 and
let L0 emulate real mode of L2 directly?

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jan Kiszka March 7, 2013, 11:57 a.m. UTC | #16

On 2013-03-07 12:50, Gleb Natapov wrote:
> On Thu, Mar 07, 2013 at 12:25:26PM +0100, Jan Kiszka wrote:
>> On 2013-03-07 12:06, Gleb Natapov wrote:
>>> On Thu, Mar 07, 2013 at 11:37:43AM +0100, Jan Kiszka wrote:
>>>> On 2013-03-07 09:57, Gleb Natapov wrote:
>>>>> On Thu, Mar 07, 2013 at 09:53:49AM +0100, Jan Kiszka wrote:
>>>>>> On 2013-03-07 09:43, Gleb Natapov wrote:
>>>>>>> On Thu, Mar 07, 2013 at 09:12:19AM +0100, Jan Kiszka wrote:
>>>>>>>> On 2013-03-07 08:51, Gleb Natapov wrote:
>>>>>>>>> On Mon, Mar 04, 2013 at 08:40:29PM +0100, Jan Kiszka wrote:
>>>>>>>>>> The logic for calculating the value with which we call kvm_set_cr0/4 was
>>>>>>>>>> broken (will definitely be visible with nested unrestricted guest mode
>>>>>>>>>> support). Also, we performed the check regarding CR0_ALWAYSON too early
>>>>>>>>>> when in guest mode.
>>>>>>>>>>
>>>>>>>>>> What really needs to be done on both CR0 and CR4 is to mask out L1-owned
>>>>>>>>>> bits and merge them in from GUEST_CR0/4. In contrast, arch.cr0/4 and
>>>>>>>>>> arch.cr0/4_guest_owned_bits contain the mangled L0+L1 state and, thus,
>>>>>>>>>> are not suited as input.
>>>>>>>>>>
>>>>>>>>>> For both CRs, we can then apply the check against VMXON_CRx_ALWAYSON and
>>>>>>>>>> refuse the update if it fails. To be fully consistent, we implement this
>>>>>>>>>> check now also for CR4.
>>>>>>>>>>
>>>>>>>>>> Finally, we have to set the shadow to the value L2 wanted to write
>>>>>>>>>> originally.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
>>>>>>>>>> ---
>>>>>>>>>>
>>>>>>>>>> Changes in v2:
>>>>>>>>>>  - keep the non-misleading part of the comment in handle_set_cr0
>>>>>>>>>>
>>>>>>>>>>  arch/x86/kvm/vmx.c |   46 +++++++++++++++++++++++++++++++---------------
>>>>>>>>>>  1 files changed, 31 insertions(+), 15 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>>>>>>>>> index 7cc566b..832b7b4 100644
>>>>>>>>>> --- a/arch/x86/kvm/vmx.c
>>>>>>>>>> +++ b/arch/x86/kvm/vmx.c
>>>>>>>>>> @@ -4605,37 +4605,53 @@ vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall)
>>>>>>>>>>  /* called to set cr0 as appropriate for a mov-to-cr0 exit. */
>>>>>>>>>>  static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val)
>>>>>>>>>>  {
>>>>>>>>>> -	if (to_vmx(vcpu)->nested.vmxon &&
>>>>>>>>>> -	    ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON))
>>>>>>>>>> -		return 1;
>>>>>>>>>> -
>>>>>>>>>>  	if (is_guest_mode(vcpu)) {
>>>>>>>>>> +		struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
>>>>>>>>>> +		unsigned long orig_val = val;
>>>>>>>>>> +
>>>>>>>>>>  		/*
>>>>>>>>>>  		 * We get here when L2 changed cr0 in a way that did not change
>>>>>>>>>>  		 * any of L1's shadowed bits (see nested_vmx_exit_handled_cr),
>>>>>>>>>> -		 * but did change L0 shadowed bits. This can currently happen
>>>>>>>>>> -		 * with the TS bit: L0 may want to leave TS on (for lazy fpu
>>>>>>>>>> -		 * loading) while pretending to allow the guest to change it.
>>>>>>>>>> +		 * but did change L0 shadowed bits.
>>>>>>>>>>  		 */
>>>>>>>>>> -		if (kvm_set_cr0(vcpu, (val & vcpu->arch.cr0_guest_owned_bits) |
>>>>>>>>>> -			 (vcpu->arch.cr0 & ~vcpu->arch.cr0_guest_owned_bits)))
>>>>>>>>>> +		val = (val & ~vmcs12->cr0_guest_host_mask) |
>>>>>>>>>> +			(vmcs_read64(GUEST_CR0) & vmcs12->cr0_guest_host_mask);
>>>>>>>>> I think using GUEST_CR0 here is incorrect. It contains combination of bits
>>>>>>>>> set by L2, L1 and L0 and here we need to get only L2/L1 mix which is in
>>>>>>>>> vcpu->arch.cr0 (almost, but good enough for this case). Why vcpu->arch.cr0
>>>>>>>>> contains right L2/L1 mix?
>>>>>>>>
>>>>>>>> L0/L1. E.g., kvm_set_cr0 unconditionally injects X86_CR0_ET and masks
>>>>>>>> out reserved bits. But you are right, GUEST_CR0 isn't much better. And
>>>>>>>> maybe that mangling in kvm_set_cr0 is a corner case we can ignore.
>>>>>>>>
>>>>>>> I think we can. ET is R/O and wired to 1, so it does not matter what
>>>>>>> guest writes there it should be treated as 1. About reserved bits spec
>>>>>>> says that software should write what it reads there and does not specify
>>>>>>> what happens if software does not follow this.
>>>>>>>
>>>>>>>>> Because it was set to vmcs12->guest_cr0 during
>>>>>>>>> L2 #vmentry. While L2 is running three things may happen to CR0:
>>>>>>>>>
>>>>>>>>>  1. L2 writes to a bit that is not shadowed neither by L1 nor by L0. It
>>>>>>>>>     will go strait to GUEST_CR0.
>>>>>>>>>  2. L2 writes to a bit shadowed by L1. L1 #vmexit will be emulated. On the
>>>>>>>>>     next #vmetry vcpu->arch.cr0 will be set to whatever value L1 calculated.
>>>>>>>>>  3. L2 writes to a bit shadowed by L0, but not L1. This is the case we
>>>>>>>>>     are handling here. And if we will do it right vcpu->arch.cr0 will be
>>>>>>>>>     up-to-date at the end.
>>>>>>>>>
>>>>>>>>> The only case when, while this code running, vcpu->arch.cr0 has not
>>>>>>>>> up-to-date value is if 1 happened, but since L2 guest overwriting cr0
>>>>>>>>> here anyway it does not matter what it previously set in GUEST_CR0. The
>>>>>>>>> correct bits are in a new cr0 value provided by val and accessible by
>>>>>>>>> (val & ~vmcs12->cr0_guest_host_mask).
>>>>>>>>
>>>>>>>> I need to think about it again. Maybe vmcs12->guest_cr0 is best, but
>>>>>>>> that's a shot from the hips now.
>>>>>>>>
>>>>>>> I do not think it is correct because case 3 does not update it. So if 3
>>>>>>> happens twice without L1 #vmexit between then vmcs12->guest_cr0 will be
>>>>>>> outdated.
>>>>>>
>>>>>> Again, the only thing that matters here is L1's, not L0's view on the
>>>>>> "real" CR0 value. So guest_cr0 is never outdated (/wrt
>>>>>> cr0_guest_host_mask) as it will be updated by L1 in step 2. Even if
>>>>>> arch.cr0 vs. guest_cr0 makes no difference in practice, the latter is
>>>>>> more consistent, so I will go for it unless you can convince me it is wrong.
>>>>>>
>>>>> Hmm, yes you are right that wrt cr0_guest_host_mask guest_cr0 should be
>>>>> up-to-date. Please write a big comment about it.
>>>>
>>>> Will do.
>>>>
>>>>> And what about moving VMXON_CR0_ALWAYSON check into vmx_set_cr0()?
>>>>
>>>> That doesn't make much sense for CR0 (due to the differences between
>>>> vmxon and guest mode - and lacking return code of set_cr4). But I can
>>>> consolidate the CR4 checks.
>>>>
>>> Isn't vmxon check is implicit in a guest mode. i.e if is_guest_mode() is
>>> trues then vmxon is on? Return code can be added.
>>
>> Ah, sorry, you are not seeing what I'm looking at: The test will change
>> for L2 context once unrestricted guest mode is added. At that point, it
>> makes more sense to split it into one version that checks against
>> VMXON_CR0_ALWAYSON while in vmxon, targeting L1, and another that does
>> more complex evaluation for L2, depending on nested_cpu_has2(vmcs12,
>> SECONDARY_EXEC_UNRESTRICTED_GUEST).
>>
> Ah, OK. Hard to argue that those checks can be consolidated without
> seeing them :) So you want to implement unrestricted L1 on restricted L0 and
> let L0 emulate real mode of L2 directly?

Err, no. :) Well, that emulation might even work but doesn't help unless
you also emulate EPT (not unrestricted guest mode without EPT support -
according to the spec).

I just want make L0's unrestricted guest mode support available for L1
(in fact, I already did this in my hacking branch).

Jan

Gleb Natapov March 7, 2013, 12:05 p.m. UTC | #17

On Thu, Mar 07, 2013 at 12:57:27PM +0100, Jan Kiszka wrote:
> >> Ah, sorry, you are not seeing what I'm looking at: The test will change
> >> for L2 context once unrestricted guest mode is added. At that point, it
> >> makes more sense to split it into one version that checks against
> >> VMXON_CR0_ALWAYSON while in vmxon, targeting L1, and another that does
> >> more complex evaluation for L2, depending on nested_cpu_has2(vmcs12,
> >> SECONDARY_EXEC_UNRESTRICTED_GUEST).
> >>
> > Ah, OK. Hard to argue that those checks can be consolidated without
> > seeing them :) So you want to implement unrestricted L1 on restricted L0 and
> > let L0 emulate real mode of L2 directly?
> 
> Err, no. :) Well, that emulation might even work but doesn't help unless
> you also emulate EPT (not unrestricted guest mode without EPT support -
> according to the spec).
Yes, of course EPT is needed, but patches are available :) I think it
should speedup L2 real mode substantially. No need to go to L1 for each
instruction emulation and L1 will have to exit to L0 many times during
emulation of some instructions.
> 
> I just want make L0's unrestricted guest mode support available for L1
> (in fact, I already did this in my hacking branch).
> 
Don't you need EPT for that too?

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jan Kiszka March 7, 2013, 12:18 p.m. UTC | #18

On 2013-03-07 13:05, Gleb Natapov wrote:
> On Thu, Mar 07, 2013 at 12:57:27PM +0100, Jan Kiszka wrote:
>>>> Ah, sorry, you are not seeing what I'm looking at: The test will change
>>>> for L2 context once unrestricted guest mode is added. At that point, it
>>>> makes more sense to split it into one version that checks against
>>>> VMXON_CR0_ALWAYSON while in vmxon, targeting L1, and another that does
>>>> more complex evaluation for L2, depending on nested_cpu_has2(vmcs12,
>>>> SECONDARY_EXEC_UNRESTRICTED_GUEST).
>>>>
>>> Ah, OK. Hard to argue that those checks can be consolidated without
>>> seeing them :) So you want to implement unrestricted L1 on restricted L0 and
>>> let L0 emulate real mode of L2 directly?
>>
>> Err, no. :) Well, that emulation might even work but doesn't help unless
>> you also emulate EPT (not unrestricted guest mode without EPT support -
>> according to the spec).
> Yes, of course EPT is needed, but patches are available :) I think it
> should speedup L2 real mode substantially. No need to go to L1 for each
> instruction emulation and L1 will have to exit to L0 many times during
> emulation of some instructions.

The point is: If you already have EPT on the host, you likely also have
native unrestricted guest mode. You just need to expose it and adjust
some minor things (like this bug here) along the way. Not sure how many
CPUs had EPT but no unrestricted guest mode. Do you have numbers?

>>
>> I just want make L0's unrestricted guest mode support available for L1
>> (in fact, I already did this in my hacking branch).
>>
> Don't you need EPT for that too?

Sure. As I wrote, I'm using Nadav's patches of last August against some
3.6 baseline.

Jan

Gleb Natapov March 7, 2013, 12:21 p.m. UTC | #19

On Thu, Mar 07, 2013 at 01:18:24PM +0100, Jan Kiszka wrote:
> On 2013-03-07 13:05, Gleb Natapov wrote:
> > On Thu, Mar 07, 2013 at 12:57:27PM +0100, Jan Kiszka wrote:
> >>>> Ah, sorry, you are not seeing what I'm looking at: The test will change
> >>>> for L2 context once unrestricted guest mode is added. At that point, it
> >>>> makes more sense to split it into one version that checks against
> >>>> VMXON_CR0_ALWAYSON while in vmxon, targeting L1, and another that does
> >>>> more complex evaluation for L2, depending on nested_cpu_has2(vmcs12,
> >>>> SECONDARY_EXEC_UNRESTRICTED_GUEST).
> >>>>
> >>> Ah, OK. Hard to argue that those checks can be consolidated without
> >>> seeing them :) So you want to implement unrestricted L1 on restricted L0 and
> >>> let L0 emulate real mode of L2 directly?
> >>
> >> Err, no. :) Well, that emulation might even work but doesn't help unless
> >> you also emulate EPT (not unrestricted guest mode without EPT support -
> >> according to the spec).
> > Yes, of course EPT is needed, but patches are available :) I think it
> > should speedup L2 real mode substantially. No need to go to L1 for each
> > instruction emulation and L1 will have to exit to L0 many times during
> > emulation of some instructions.
> 
> The point is: If you already have EPT on the host, you likely also have
> native unrestricted guest mode. You just need to expose it and adjust
> some minor things (like this bug here) along the way. Not sure how many
> CPUs had EPT but no unrestricted guest mode. Do you have numbers?
> 
AFAIK every single one before Westmere. Nehalem does no have it for
sure.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jan Kiszka March 7, 2013, 12:48 p.m. UTC | #20

On 2013-03-07 13:21, Gleb Natapov wrote:
> On Thu, Mar 07, 2013 at 01:18:24PM +0100, Jan Kiszka wrote:
>> On 2013-03-07 13:05, Gleb Natapov wrote:
>>> On Thu, Mar 07, 2013 at 12:57:27PM +0100, Jan Kiszka wrote:
>>>>>> Ah, sorry, you are not seeing what I'm looking at: The test will change
>>>>>> for L2 context once unrestricted guest mode is added. At that point, it
>>>>>> makes more sense to split it into one version that checks against
>>>>>> VMXON_CR0_ALWAYSON while in vmxon, targeting L1, and another that does
>>>>>> more complex evaluation for L2, depending on nested_cpu_has2(vmcs12,
>>>>>> SECONDARY_EXEC_UNRESTRICTED_GUEST).
>>>>>>
>>>>> Ah, OK. Hard to argue that those checks can be consolidated without
>>>>> seeing them :) So you want to implement unrestricted L1 on restricted L0 and
>>>>> let L0 emulate real mode of L2 directly?
>>>>
>>>> Err, no. :) Well, that emulation might even work but doesn't help unless
>>>> you also emulate EPT (not unrestricted guest mode without EPT support -
>>>> according to the spec).
>>> Yes, of course EPT is needed, but patches are available :) I think it
>>> should speedup L2 real mode substantially. No need to go to L1 for each
>>> instruction emulation and L1 will have to exit to L0 many times during
>>> emulation of some instructions.
>>
>> The point is: If you already have EPT on the host, you likely also have
>> native unrestricted guest mode. You just need to expose it and adjust
>> some minor things (like this bug here) along the way. Not sure how many
>> CPUs had EPT but no unrestricted guest mode. Do you have numbers?
>>
> AFAIK every single one before Westmere. Nehalem does no have it for
> sure.

OK. Hmm, will it be more than just faking unrestricted mode toward L1
and emulating in L0 then (which should happen automagically)? Maybe I
will play with this under unrestricted_guest=0 when I have some time.

Jan

Gleb Natapov March 7, 2013, 1:04 p.m. UTC | #21

On Thu, Mar 07, 2013 at 01:48:55PM +0100, Jan Kiszka wrote:
> On 2013-03-07 13:21, Gleb Natapov wrote:
> > On Thu, Mar 07, 2013 at 01:18:24PM +0100, Jan Kiszka wrote:
> >> On 2013-03-07 13:05, Gleb Natapov wrote:
> >>> On Thu, Mar 07, 2013 at 12:57:27PM +0100, Jan Kiszka wrote:
> >>>>>> Ah, sorry, you are not seeing what I'm looking at: The test will change
> >>>>>> for L2 context once unrestricted guest mode is added. At that point, it
> >>>>>> makes more sense to split it into one version that checks against
> >>>>>> VMXON_CR0_ALWAYSON while in vmxon, targeting L1, and another that does
> >>>>>> more complex evaluation for L2, depending on nested_cpu_has2(vmcs12,
> >>>>>> SECONDARY_EXEC_UNRESTRICTED_GUEST).
> >>>>>>
> >>>>> Ah, OK. Hard to argue that those checks can be consolidated without
> >>>>> seeing them :) So you want to implement unrestricted L1 on restricted L0 and
> >>>>> let L0 emulate real mode of L2 directly?
> >>>>
> >>>> Err, no. :) Well, that emulation might even work but doesn't help unless
> >>>> you also emulate EPT (not unrestricted guest mode without EPT support -
> >>>> according to the spec).
> >>> Yes, of course EPT is needed, but patches are available :) I think it
> >>> should speedup L2 real mode substantially. No need to go to L1 for each
> >>> instruction emulation and L1 will have to exit to L0 many times during
> >>> emulation of some instructions.
> >>
> >> The point is: If you already have EPT on the host, you likely also have
> >> native unrestricted guest mode. You just need to expose it and adjust
> >> some minor things (like this bug here) along the way. Not sure how many
> >> CPUs had EPT but no unrestricted guest mode. Do you have numbers?
> >>
> > AFAIK every single one before Westmere. Nehalem does no have it for
> > sure.
> 
> OK. Hmm, will it be more than just faking unrestricted mode toward L1
> and emulating in L0 then (which should happen automagically)? Maybe I
> will play with this under unrestricted_guest=0 when I have some time.
> 
We will probably need to teach emulator about more instruction that may
cause vmexit.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[v2] KVM: nVMX: Fix setting of CR0 and CR4 in guest mode

Commit Message

Comments

Patch