Message ID | 5134F8AD.8030607@web.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Mar 04, 2013 at 08:40:29PM +0100, Jan Kiszka wrote: > The logic for calculating the value with which we call kvm_set_cr0/4 was > broken (will definitely be visible with nested unrestricted guest mode > support). Also, we performed the check regarding CR0_ALWAYSON too early > when in guest mode. > > What really needs to be done on both CR0 and CR4 is to mask out L1-owned > bits and merge them in from GUEST_CR0/4. In contrast, arch.cr0/4 and > arch.cr0/4_guest_owned_bits contain the mangled L0+L1 state and, thus, > are not suited as input. > > For both CRs, we can then apply the check against VMXON_CRx_ALWAYSON and > refuse the update if it fails. To be fully consistent, we implement this > check now also for CR4. > > Finally, we have to set the shadow to the value L2 wanted to write > originally. > > Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> > --- > > Changes in v2: > - keep the non-misleading part of the comment in handle_set_cr0 > > arch/x86/kvm/vmx.c | 46 +++++++++++++++++++++++++++++++--------------- > 1 files changed, 31 insertions(+), 15 deletions(-) > > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > index 7cc566b..832b7b4 100644 > --- a/arch/x86/kvm/vmx.c > +++ b/arch/x86/kvm/vmx.c > @@ -4605,37 +4605,53 @@ vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall) > /* called to set cr0 as appropriate for a mov-to-cr0 exit. */ > static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val) > { > - if (to_vmx(vcpu)->nested.vmxon && > - ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON)) > - return 1; > - > if (is_guest_mode(vcpu)) { > + struct vmcs12 *vmcs12 = get_vmcs12(vcpu); > + unsigned long orig_val = val; > + > /* > * We get here when L2 changed cr0 in a way that did not change > * any of L1's shadowed bits (see nested_vmx_exit_handled_cr), > - * but did change L0 shadowed bits. This can currently happen > - * with the TS bit: L0 may want to leave TS on (for lazy fpu > - * loading) while pretending to allow the guest to change it. > + * but did change L0 shadowed bits. > */ > - if (kvm_set_cr0(vcpu, (val & vcpu->arch.cr0_guest_owned_bits) | > - (vcpu->arch.cr0 & ~vcpu->arch.cr0_guest_owned_bits))) > + val = (val & ~vmcs12->cr0_guest_host_mask) | > + (vmcs_read64(GUEST_CR0) & vmcs12->cr0_guest_host_mask); > + if ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON) > return 1; The more I look at it the more it looks correct to me. I will continue looking, but I think we can move VMXON_CR0_ALWAYSON check to vmx_set_cr0(). Same for cr4 case. > - vmcs_writel(CR0_READ_SHADOW, val); > + > + if (kvm_set_cr0(vcpu, val)) > + return 1; > + vmcs_writel(CR0_READ_SHADOW, orig_val); > return 0; > - } else > + } else { > + if (to_vmx(vcpu)->nested.vmxon && > + ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON)) > + return 1; > return kvm_set_cr0(vcpu, val); > + } > } > > static int handle_set_cr4(struct kvm_vcpu *vcpu, unsigned long val) > { > if (is_guest_mode(vcpu)) { > - if (kvm_set_cr4(vcpu, (val & vcpu->arch.cr4_guest_owned_bits) | > - (vcpu->arch.cr4 & ~vcpu->arch.cr4_guest_owned_bits))) > + struct vmcs12 *vmcs12 = get_vmcs12(vcpu); > + unsigned long orig_val = val; > + > + val = (val & ~vmcs12->cr4_guest_host_mask) | > + (vmcs_readl(GUEST_CR4) & vmcs12->cr4_guest_host_mask); > + if ((val & VMXON_CR4_ALWAYSON) != VMXON_CR4_ALWAYSON) > + return 1; > + > + if (kvm_set_cr4(vcpu, val)) > return 1; > - vmcs_writel(CR4_READ_SHADOW, val); > + vmcs_writel(CR4_READ_SHADOW, orig_val); > return 0; > - } else > + } else { > + if (to_vmx(vcpu)->nested.vmxon && > + ((val & VMXON_CR4_ALWAYSON) != VMXON_CR4_ALWAYSON)) > + return 1; > return kvm_set_cr4(vcpu, val); > + } > } > > /* called to set cr0 as approriate for clts instruction exit. */ > -- > 1.7.3.4 > -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Mar 05, 2013 at 02:57:03PM +0200, Gleb Natapov wrote: > On Mon, Mar 04, 2013 at 08:40:29PM +0100, Jan Kiszka wrote: > > The logic for calculating the value with which we call kvm_set_cr0/4 was > > broken (will definitely be visible with nested unrestricted guest mode > > support). Also, we performed the check regarding CR0_ALWAYSON too early > > when in guest mode. > > > > What really needs to be done on both CR0 and CR4 is to mask out L1-owned > > bits and merge them in from GUEST_CR0/4. In contrast, arch.cr0/4 and > > arch.cr0/4_guest_owned_bits contain the mangled L0+L1 state and, thus, > > are not suited as input. > > > > For both CRs, we can then apply the check against VMXON_CRx_ALWAYSON and > > refuse the update if it fails. To be fully consistent, we implement this > > check now also for CR4. > > > > Finally, we have to set the shadow to the value L2 wanted to write > > originally. > > > > Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> > > --- > > > > Changes in v2: > > - keep the non-misleading part of the comment in handle_set_cr0 > > > > arch/x86/kvm/vmx.c | 46 +++++++++++++++++++++++++++++++--------------- > > 1 files changed, 31 insertions(+), 15 deletions(-) > > > > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > > index 7cc566b..832b7b4 100644 > > --- a/arch/x86/kvm/vmx.c > > +++ b/arch/x86/kvm/vmx.c > > @@ -4605,37 +4605,53 @@ vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall) > > /* called to set cr0 as appropriate for a mov-to-cr0 exit. */ > > static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val) > > { > > - if (to_vmx(vcpu)->nested.vmxon && > > - ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON)) > > - return 1; > > - > > if (is_guest_mode(vcpu)) { > > + struct vmcs12 *vmcs12 = get_vmcs12(vcpu); > > + unsigned long orig_val = val; > > + > > /* > > * We get here when L2 changed cr0 in a way that did not change > > * any of L1's shadowed bits (see nested_vmx_exit_handled_cr), > > - * but did change L0 shadowed bits. This can currently happen > > - * with the TS bit: L0 may want to leave TS on (for lazy fpu > > - * loading) while pretending to allow the guest to change it. > > + * but did change L0 shadowed bits. > > */ > > - if (kvm_set_cr0(vcpu, (val & vcpu->arch.cr0_guest_owned_bits) | > > - (vcpu->arch.cr0 & ~vcpu->arch.cr0_guest_owned_bits))) > > + val = (val & ~vmcs12->cr0_guest_host_mask) | > > + (vmcs_read64(GUEST_CR0) & vmcs12->cr0_guest_host_mask); > > + if ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON) > > return 1; > The more I look at it the more it looks correct to me. I will continue > looking, but I think we can move VMXON_CR0_ALWAYSON check to > vmx_set_cr0(). Same for cr4 case. > BTW would be nice to have unit test for nested :) -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2013-03-05 14:02, Gleb Natapov wrote: > On Tue, Mar 05, 2013 at 02:57:03PM +0200, Gleb Natapov wrote: >> On Mon, Mar 04, 2013 at 08:40:29PM +0100, Jan Kiszka wrote: >>> The logic for calculating the value with which we call kvm_set_cr0/4 was >>> broken (will definitely be visible with nested unrestricted guest mode >>> support). Also, we performed the check regarding CR0_ALWAYSON too early >>> when in guest mode. >>> >>> What really needs to be done on both CR0 and CR4 is to mask out L1-owned >>> bits and merge them in from GUEST_CR0/4. In contrast, arch.cr0/4 and >>> arch.cr0/4_guest_owned_bits contain the mangled L0+L1 state and, thus, >>> are not suited as input. >>> >>> For both CRs, we can then apply the check against VMXON_CRx_ALWAYSON and >>> refuse the update if it fails. To be fully consistent, we implement this >>> check now also for CR4. >>> >>> Finally, we have to set the shadow to the value L2 wanted to write >>> originally. >>> >>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> >>> --- >>> >>> Changes in v2: >>> - keep the non-misleading part of the comment in handle_set_cr0 >>> >>> arch/x86/kvm/vmx.c | 46 +++++++++++++++++++++++++++++++--------------- >>> 1 files changed, 31 insertions(+), 15 deletions(-) >>> >>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c >>> index 7cc566b..832b7b4 100644 >>> --- a/arch/x86/kvm/vmx.c >>> +++ b/arch/x86/kvm/vmx.c >>> @@ -4605,37 +4605,53 @@ vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall) >>> /* called to set cr0 as appropriate for a mov-to-cr0 exit. */ >>> static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val) >>> { >>> - if (to_vmx(vcpu)->nested.vmxon && >>> - ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON)) >>> - return 1; >>> - >>> if (is_guest_mode(vcpu)) { >>> + struct vmcs12 *vmcs12 = get_vmcs12(vcpu); >>> + unsigned long orig_val = val; >>> + >>> /* >>> * We get here when L2 changed cr0 in a way that did not change >>> * any of L1's shadowed bits (see nested_vmx_exit_handled_cr), >>> - * but did change L0 shadowed bits. This can currently happen >>> - * with the TS bit: L0 may want to leave TS on (for lazy fpu >>> - * loading) while pretending to allow the guest to change it. >>> + * but did change L0 shadowed bits. >>> */ >>> - if (kvm_set_cr0(vcpu, (val & vcpu->arch.cr0_guest_owned_bits) | >>> - (vcpu->arch.cr0 & ~vcpu->arch.cr0_guest_owned_bits))) >>> + val = (val & ~vmcs12->cr0_guest_host_mask) | >>> + (vmcs_read64(GUEST_CR0) & vmcs12->cr0_guest_host_mask); >>> + if ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON) >>> return 1; >> The more I look at it the more it looks correct to me. I will continue >> looking, but I think we can move VMXON_CR0_ALWAYSON check to >> vmx_set_cr0(). Same for cr4 case. >> > BTW would be nice to have unit test for nested :) Yeah, and I want a pony. :) Possible, but "a bit" more complex than what it takes to write native tests. Jan
On Mon, Mar 04, 2013 at 08:40:29PM +0100, Jan Kiszka wrote: > The logic for calculating the value with which we call kvm_set_cr0/4 was > broken (will definitely be visible with nested unrestricted guest mode > support). Also, we performed the check regarding CR0_ALWAYSON too early > when in guest mode. > > What really needs to be done on both CR0 and CR4 is to mask out L1-owned > bits and merge them in from GUEST_CR0/4. In contrast, arch.cr0/4 and > arch.cr0/4_guest_owned_bits contain the mangled L0+L1 state and, thus, > are not suited as input. > > For both CRs, we can then apply the check against VMXON_CRx_ALWAYSON and > refuse the update if it fails. To be fully consistent, we implement this > check now also for CR4. > > Finally, we have to set the shadow to the value L2 wanted to write > originally. > > Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Mar 04, 2013 at 08:40:29PM +0100, Jan Kiszka wrote: > The logic for calculating the value with which we call kvm_set_cr0/4 was > broken (will definitely be visible with nested unrestricted guest mode > support). Also, we performed the check regarding CR0_ALWAYSON too early > when in guest mode. > > What really needs to be done on both CR0 and CR4 is to mask out L1-owned > bits and merge them in from GUEST_CR0/4. In contrast, arch.cr0/4 and > arch.cr0/4_guest_owned_bits contain the mangled L0+L1 state and, thus, > are not suited as input. > > For both CRs, we can then apply the check against VMXON_CRx_ALWAYSON and > refuse the update if it fails. To be fully consistent, we implement this > check now also for CR4. > > Finally, we have to set the shadow to the value L2 wanted to write > originally. > > Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> > --- > > Changes in v2: > - keep the non-misleading part of the comment in handle_set_cr0 > > arch/x86/kvm/vmx.c | 46 +++++++++++++++++++++++++++++++--------------- > 1 files changed, 31 insertions(+), 15 deletions(-) > > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > index 7cc566b..832b7b4 100644 > --- a/arch/x86/kvm/vmx.c > +++ b/arch/x86/kvm/vmx.c > @@ -4605,37 +4605,53 @@ vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall) > /* called to set cr0 as appropriate for a mov-to-cr0 exit. */ > static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val) > { > - if (to_vmx(vcpu)->nested.vmxon && > - ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON)) > - return 1; > - > if (is_guest_mode(vcpu)) { > + struct vmcs12 *vmcs12 = get_vmcs12(vcpu); > + unsigned long orig_val = val; > + > /* > * We get here when L2 changed cr0 in a way that did not change > * any of L1's shadowed bits (see nested_vmx_exit_handled_cr), > - * but did change L0 shadowed bits. This can currently happen > - * with the TS bit: L0 may want to leave TS on (for lazy fpu > - * loading) while pretending to allow the guest to change it. > + * but did change L0 shadowed bits. > */ > - if (kvm_set_cr0(vcpu, (val & vcpu->arch.cr0_guest_owned_bits) | > - (vcpu->arch.cr0 & ~vcpu->arch.cr0_guest_owned_bits))) > + val = (val & ~vmcs12->cr0_guest_host_mask) | > + (vmcs_read64(GUEST_CR0) & vmcs12->cr0_guest_host_mask); I think using GUEST_CR0 here is incorrect. It contains combination of bits set by L2, L1 and L0 and here we need to get only L2/L1 mix which is in vcpu->arch.cr0 (almost, but good enough for this case). Why vcpu->arch.cr0 contains right L2/L1 mix? Because it was set to vmcs12->guest_cr0 during L2 #vmentry. While L2 is running three things may happen to CR0: 1. L2 writes to a bit that is not shadowed neither by L1 nor by L0. It will go strait to GUEST_CR0. 2. L2 writes to a bit shadowed by L1. L1 #vmexit will be emulated. On the next #vmetry vcpu->arch.cr0 will be set to whatever value L1 calculated. 3. L2 writes to a bit shadowed by L0, but not L1. This is the case we are handling here. And if we will do it right vcpu->arch.cr0 will be up-to-date at the end. The only case when, while this code running, vcpu->arch.cr0 has not up-to-date value is if 1 happened, but since L2 guest overwriting cr0 here anyway it does not matter what it previously set in GUEST_CR0. The correct bits are in a new cr0 value provided by val and accessible by (val & ~vmcs12->cr0_guest_host_mask). -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2013-03-07 08:51, Gleb Natapov wrote: > On Mon, Mar 04, 2013 at 08:40:29PM +0100, Jan Kiszka wrote: >> The logic for calculating the value with which we call kvm_set_cr0/4 was >> broken (will definitely be visible with nested unrestricted guest mode >> support). Also, we performed the check regarding CR0_ALWAYSON too early >> when in guest mode. >> >> What really needs to be done on both CR0 and CR4 is to mask out L1-owned >> bits and merge them in from GUEST_CR0/4. In contrast, arch.cr0/4 and >> arch.cr0/4_guest_owned_bits contain the mangled L0+L1 state and, thus, >> are not suited as input. >> >> For both CRs, we can then apply the check against VMXON_CRx_ALWAYSON and >> refuse the update if it fails. To be fully consistent, we implement this >> check now also for CR4. >> >> Finally, we have to set the shadow to the value L2 wanted to write >> originally. >> >> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> >> --- >> >> Changes in v2: >> - keep the non-misleading part of the comment in handle_set_cr0 >> >> arch/x86/kvm/vmx.c | 46 +++++++++++++++++++++++++++++++--------------- >> 1 files changed, 31 insertions(+), 15 deletions(-) >> >> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c >> index 7cc566b..832b7b4 100644 >> --- a/arch/x86/kvm/vmx.c >> +++ b/arch/x86/kvm/vmx.c >> @@ -4605,37 +4605,53 @@ vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall) >> /* called to set cr0 as appropriate for a mov-to-cr0 exit. */ >> static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val) >> { >> - if (to_vmx(vcpu)->nested.vmxon && >> - ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON)) >> - return 1; >> - >> if (is_guest_mode(vcpu)) { >> + struct vmcs12 *vmcs12 = get_vmcs12(vcpu); >> + unsigned long orig_val = val; >> + >> /* >> * We get here when L2 changed cr0 in a way that did not change >> * any of L1's shadowed bits (see nested_vmx_exit_handled_cr), >> - * but did change L0 shadowed bits. This can currently happen >> - * with the TS bit: L0 may want to leave TS on (for lazy fpu >> - * loading) while pretending to allow the guest to change it. >> + * but did change L0 shadowed bits. >> */ >> - if (kvm_set_cr0(vcpu, (val & vcpu->arch.cr0_guest_owned_bits) | >> - (vcpu->arch.cr0 & ~vcpu->arch.cr0_guest_owned_bits))) >> + val = (val & ~vmcs12->cr0_guest_host_mask) | >> + (vmcs_read64(GUEST_CR0) & vmcs12->cr0_guest_host_mask); > I think using GUEST_CR0 here is incorrect. It contains combination of bits > set by L2, L1 and L0 and here we need to get only L2/L1 mix which is in > vcpu->arch.cr0 (almost, but good enough for this case). Why vcpu->arch.cr0 > contains right L2/L1 mix? L0/L1. E.g., kvm_set_cr0 unconditionally injects X86_CR0_ET and masks out reserved bits. But you are right, GUEST_CR0 isn't much better. And maybe that mangling in kvm_set_cr0 is a corner case we can ignore. > Because it was set to vmcs12->guest_cr0 during > L2 #vmentry. While L2 is running three things may happen to CR0: > > 1. L2 writes to a bit that is not shadowed neither by L1 nor by L0. It > will go strait to GUEST_CR0. > 2. L2 writes to a bit shadowed by L1. L1 #vmexit will be emulated. On the > next #vmetry vcpu->arch.cr0 will be set to whatever value L1 calculated. > 3. L2 writes to a bit shadowed by L0, but not L1. This is the case we > are handling here. And if we will do it right vcpu->arch.cr0 will be > up-to-date at the end. > > The only case when, while this code running, vcpu->arch.cr0 has not > up-to-date value is if 1 happened, but since L2 guest overwriting cr0 > here anyway it does not matter what it previously set in GUEST_CR0. The > correct bits are in a new cr0 value provided by val and accessible by > (val & ~vmcs12->cr0_guest_host_mask). I need to think about it again. Maybe vmcs12->guest_cr0 is best, but that's a shot from the hips now. Jan
On 2013-03-07 09:12, Jan Kiszka wrote: > On 2013-03-07 08:51, Gleb Natapov wrote: >> On Mon, Mar 04, 2013 at 08:40:29PM +0100, Jan Kiszka wrote: >>> The logic for calculating the value with which we call kvm_set_cr0/4 was >>> broken (will definitely be visible with nested unrestricted guest mode >>> support). Also, we performed the check regarding CR0_ALWAYSON too early >>> when in guest mode. >>> >>> What really needs to be done on both CR0 and CR4 is to mask out L1-owned >>> bits and merge them in from GUEST_CR0/4. In contrast, arch.cr0/4 and >>> arch.cr0/4_guest_owned_bits contain the mangled L0+L1 state and, thus, >>> are not suited as input. >>> >>> For both CRs, we can then apply the check against VMXON_CRx_ALWAYSON and >>> refuse the update if it fails. To be fully consistent, we implement this >>> check now also for CR4. >>> >>> Finally, we have to set the shadow to the value L2 wanted to write >>> originally. >>> >>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> >>> --- >>> >>> Changes in v2: >>> - keep the non-misleading part of the comment in handle_set_cr0 >>> >>> arch/x86/kvm/vmx.c | 46 +++++++++++++++++++++++++++++++--------------- >>> 1 files changed, 31 insertions(+), 15 deletions(-) >>> >>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c >>> index 7cc566b..832b7b4 100644 >>> --- a/arch/x86/kvm/vmx.c >>> +++ b/arch/x86/kvm/vmx.c >>> @@ -4605,37 +4605,53 @@ vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall) >>> /* called to set cr0 as appropriate for a mov-to-cr0 exit. */ >>> static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val) >>> { >>> - if (to_vmx(vcpu)->nested.vmxon && >>> - ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON)) >>> - return 1; >>> - >>> if (is_guest_mode(vcpu)) { >>> + struct vmcs12 *vmcs12 = get_vmcs12(vcpu); >>> + unsigned long orig_val = val; >>> + >>> /* >>> * We get here when L2 changed cr0 in a way that did not change >>> * any of L1's shadowed bits (see nested_vmx_exit_handled_cr), >>> - * but did change L0 shadowed bits. This can currently happen >>> - * with the TS bit: L0 may want to leave TS on (for lazy fpu >>> - * loading) while pretending to allow the guest to change it. >>> + * but did change L0 shadowed bits. >>> */ >>> - if (kvm_set_cr0(vcpu, (val & vcpu->arch.cr0_guest_owned_bits) | >>> - (vcpu->arch.cr0 & ~vcpu->arch.cr0_guest_owned_bits))) >>> + val = (val & ~vmcs12->cr0_guest_host_mask) | >>> + (vmcs_read64(GUEST_CR0) & vmcs12->cr0_guest_host_mask); >> I think using GUEST_CR0 here is incorrect. It contains combination of bits >> set by L2, L1 and L0 and here we need to get only L2/L1 mix which is in >> vcpu->arch.cr0 (almost, but good enough for this case). Why vcpu->arch.cr0 >> contains right L2/L1 mix? > > L0/L1. E.g., kvm_set_cr0 unconditionally injects X86_CR0_ET and masks > out reserved bits. But you are right, GUEST_CR0 isn't much better. And > maybe that mangling in kvm_set_cr0 is a corner case we can ignore. > >> Because it was set to vmcs12->guest_cr0 during >> L2 #vmentry. While L2 is running three things may happen to CR0: >> >> 1. L2 writes to a bit that is not shadowed neither by L1 nor by L0. It >> will go strait to GUEST_CR0. >> 2. L2 writes to a bit shadowed by L1. L1 #vmexit will be emulated. On the >> next #vmetry vcpu->arch.cr0 will be set to whatever value L1 calculated. >> 3. L2 writes to a bit shadowed by L0, but not L1. This is the case we >> are handling here. And if we will do it right vcpu->arch.cr0 will be >> up-to-date at the end. >> >> The only case when, while this code running, vcpu->arch.cr0 has not >> up-to-date value is if 1 happened, but since L2 guest overwriting cr0 >> here anyway it does not matter what it previously set in GUEST_CR0. The >> correct bits are in a new cr0 value provided by val and accessible by >> (val & ~vmcs12->cr0_guest_host_mask). > > I need to think about it again. Maybe vmcs12->guest_cr0 is best, but > that's a shot from the hips now. Yes, vmcs12->guest_cr0/4 is the correct input. We cannot tolerate to pick up L0-mangled bits from arch.cr0/4 as we will perform checks on the result and report errors back to the guest. Do you agree? Marcelo, can you drop this version of the patch (I do not see it in public git yet) and wait for some v3 I will send later? Otherwise I will fix on top. Thanks, Jan
On Thu, Mar 07, 2013 at 09:12:19AM +0100, Jan Kiszka wrote: > On 2013-03-07 08:51, Gleb Natapov wrote: > > On Mon, Mar 04, 2013 at 08:40:29PM +0100, Jan Kiszka wrote: > >> The logic for calculating the value with which we call kvm_set_cr0/4 was > >> broken (will definitely be visible with nested unrestricted guest mode > >> support). Also, we performed the check regarding CR0_ALWAYSON too early > >> when in guest mode. > >> > >> What really needs to be done on both CR0 and CR4 is to mask out L1-owned > >> bits and merge them in from GUEST_CR0/4. In contrast, arch.cr0/4 and > >> arch.cr0/4_guest_owned_bits contain the mangled L0+L1 state and, thus, > >> are not suited as input. > >> > >> For both CRs, we can then apply the check against VMXON_CRx_ALWAYSON and > >> refuse the update if it fails. To be fully consistent, we implement this > >> check now also for CR4. > >> > >> Finally, we have to set the shadow to the value L2 wanted to write > >> originally. > >> > >> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> > >> --- > >> > >> Changes in v2: > >> - keep the non-misleading part of the comment in handle_set_cr0 > >> > >> arch/x86/kvm/vmx.c | 46 +++++++++++++++++++++++++++++++--------------- > >> 1 files changed, 31 insertions(+), 15 deletions(-) > >> > >> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > >> index 7cc566b..832b7b4 100644 > >> --- a/arch/x86/kvm/vmx.c > >> +++ b/arch/x86/kvm/vmx.c > >> @@ -4605,37 +4605,53 @@ vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall) > >> /* called to set cr0 as appropriate for a mov-to-cr0 exit. */ > >> static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val) > >> { > >> - if (to_vmx(vcpu)->nested.vmxon && > >> - ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON)) > >> - return 1; > >> - > >> if (is_guest_mode(vcpu)) { > >> + struct vmcs12 *vmcs12 = get_vmcs12(vcpu); > >> + unsigned long orig_val = val; > >> + > >> /* > >> * We get here when L2 changed cr0 in a way that did not change > >> * any of L1's shadowed bits (see nested_vmx_exit_handled_cr), > >> - * but did change L0 shadowed bits. This can currently happen > >> - * with the TS bit: L0 may want to leave TS on (for lazy fpu > >> - * loading) while pretending to allow the guest to change it. > >> + * but did change L0 shadowed bits. > >> */ > >> - if (kvm_set_cr0(vcpu, (val & vcpu->arch.cr0_guest_owned_bits) | > >> - (vcpu->arch.cr0 & ~vcpu->arch.cr0_guest_owned_bits))) > >> + val = (val & ~vmcs12->cr0_guest_host_mask) | > >> + (vmcs_read64(GUEST_CR0) & vmcs12->cr0_guest_host_mask); > > I think using GUEST_CR0 here is incorrect. It contains combination of bits > > set by L2, L1 and L0 and here we need to get only L2/L1 mix which is in > > vcpu->arch.cr0 (almost, but good enough for this case). Why vcpu->arch.cr0 > > contains right L2/L1 mix? > > L0/L1. E.g., kvm_set_cr0 unconditionally injects X86_CR0_ET and masks > out reserved bits. But you are right, GUEST_CR0 isn't much better. And > maybe that mangling in kvm_set_cr0 is a corner case we can ignore. > I think we can. ET is R/O and wired to 1, so it does not matter what guest writes there it should be treated as 1. About reserved bits spec says that software should write what it reads there and does not specify what happens if software does not follow this. > > Because it was set to vmcs12->guest_cr0 during > > L2 #vmentry. While L2 is running three things may happen to CR0: > > > > 1. L2 writes to a bit that is not shadowed neither by L1 nor by L0. It > > will go strait to GUEST_CR0. > > 2. L2 writes to a bit shadowed by L1. L1 #vmexit will be emulated. On the > > next #vmetry vcpu->arch.cr0 will be set to whatever value L1 calculated. > > 3. L2 writes to a bit shadowed by L0, but not L1. This is the case we > > are handling here. And if we will do it right vcpu->arch.cr0 will be > > up-to-date at the end. > > > > The only case when, while this code running, vcpu->arch.cr0 has not > > up-to-date value is if 1 happened, but since L2 guest overwriting cr0 > > here anyway it does not matter what it previously set in GUEST_CR0. The > > correct bits are in a new cr0 value provided by val and accessible by > > (val & ~vmcs12->cr0_guest_host_mask). > > I need to think about it again. Maybe vmcs12->guest_cr0 is best, but > that's a shot from the hips now. > I do not think it is correct because case 3 does not update it. So if 3 happens twice without L1 #vmexit between then vmcs12->guest_cr0 will be outdated. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 07, 2013 at 09:27:10AM +0100, Jan Kiszka wrote: > On 2013-03-07 09:12, Jan Kiszka wrote: > > On 2013-03-07 08:51, Gleb Natapov wrote: > >> On Mon, Mar 04, 2013 at 08:40:29PM +0100, Jan Kiszka wrote: > >>> The logic for calculating the value with which we call kvm_set_cr0/4 was > >>> broken (will definitely be visible with nested unrestricted guest mode > >>> support). Also, we performed the check regarding CR0_ALWAYSON too early > >>> when in guest mode. > >>> > >>> What really needs to be done on both CR0 and CR4 is to mask out L1-owned > >>> bits and merge them in from GUEST_CR0/4. In contrast, arch.cr0/4 and > >>> arch.cr0/4_guest_owned_bits contain the mangled L0+L1 state and, thus, > >>> are not suited as input. > >>> > >>> For both CRs, we can then apply the check against VMXON_CRx_ALWAYSON and > >>> refuse the update if it fails. To be fully consistent, we implement this > >>> check now also for CR4. > >>> > >>> Finally, we have to set the shadow to the value L2 wanted to write > >>> originally. > >>> > >>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> > >>> --- > >>> > >>> Changes in v2: > >>> - keep the non-misleading part of the comment in handle_set_cr0 > >>> > >>> arch/x86/kvm/vmx.c | 46 +++++++++++++++++++++++++++++++--------------- > >>> 1 files changed, 31 insertions(+), 15 deletions(-) > >>> > >>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > >>> index 7cc566b..832b7b4 100644 > >>> --- a/arch/x86/kvm/vmx.c > >>> +++ b/arch/x86/kvm/vmx.c > >>> @@ -4605,37 +4605,53 @@ vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall) > >>> /* called to set cr0 as appropriate for a mov-to-cr0 exit. */ > >>> static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val) > >>> { > >>> - if (to_vmx(vcpu)->nested.vmxon && > >>> - ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON)) > >>> - return 1; > >>> - > >>> if (is_guest_mode(vcpu)) { > >>> + struct vmcs12 *vmcs12 = get_vmcs12(vcpu); > >>> + unsigned long orig_val = val; > >>> + > >>> /* > >>> * We get here when L2 changed cr0 in a way that did not change > >>> * any of L1's shadowed bits (see nested_vmx_exit_handled_cr), > >>> - * but did change L0 shadowed bits. This can currently happen > >>> - * with the TS bit: L0 may want to leave TS on (for lazy fpu > >>> - * loading) while pretending to allow the guest to change it. > >>> + * but did change L0 shadowed bits. > >>> */ > >>> - if (kvm_set_cr0(vcpu, (val & vcpu->arch.cr0_guest_owned_bits) | > >>> - (vcpu->arch.cr0 & ~vcpu->arch.cr0_guest_owned_bits))) > >>> + val = (val & ~vmcs12->cr0_guest_host_mask) | > >>> + (vmcs_read64(GUEST_CR0) & vmcs12->cr0_guest_host_mask); > >> I think using GUEST_CR0 here is incorrect. It contains combination of bits > >> set by L2, L1 and L0 and here we need to get only L2/L1 mix which is in > >> vcpu->arch.cr0 (almost, but good enough for this case). Why vcpu->arch.cr0 > >> contains right L2/L1 mix? > > > > L0/L1. E.g., kvm_set_cr0 unconditionally injects X86_CR0_ET and masks > > out reserved bits. But you are right, GUEST_CR0 isn't much better. And > > maybe that mangling in kvm_set_cr0 is a corner case we can ignore. > > > >> Because it was set to vmcs12->guest_cr0 during > >> L2 #vmentry. While L2 is running three things may happen to CR0: > >> > >> 1. L2 writes to a bit that is not shadowed neither by L1 nor by L0. It > >> will go strait to GUEST_CR0. > >> 2. L2 writes to a bit shadowed by L1. L1 #vmexit will be emulated. On the > >> next #vmetry vcpu->arch.cr0 will be set to whatever value L1 calculated. > >> 3. L2 writes to a bit shadowed by L0, but not L1. This is the case we > >> are handling here. And if we will do it right vcpu->arch.cr0 will be > >> up-to-date at the end. > >> > >> The only case when, while this code running, vcpu->arch.cr0 has not > >> up-to-date value is if 1 happened, but since L2 guest overwriting cr0 > >> here anyway it does not matter what it previously set in GUEST_CR0. The > >> correct bits are in a new cr0 value provided by val and accessible by > >> (val & ~vmcs12->cr0_guest_host_mask). > > > > I need to think about it again. Maybe vmcs12->guest_cr0 is best, but > > that's a shot from the hips now. > > Yes, vmcs12->guest_cr0/4 is the correct input. We cannot tolerate to > pick up L0-mangled bits from arch.cr0/4 as we will perform checks on the > result and report errors back to the guest. Do you agree? > See my answer to your previous email about when vmcs12->guest_cr0/4 may contain incorrect value. About mangling in arch.cr0/4 I think it is benign, but we can do L1 #vmexit if reserved bit is not zero or ET bit is not set in L2 provided cr0 and let L1 handle it. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2013-03-07 09:43, Gleb Natapov wrote: > On Thu, Mar 07, 2013 at 09:12:19AM +0100, Jan Kiszka wrote: >> On 2013-03-07 08:51, Gleb Natapov wrote: >>> On Mon, Mar 04, 2013 at 08:40:29PM +0100, Jan Kiszka wrote: >>>> The logic for calculating the value with which we call kvm_set_cr0/4 was >>>> broken (will definitely be visible with nested unrestricted guest mode >>>> support). Also, we performed the check regarding CR0_ALWAYSON too early >>>> when in guest mode. >>>> >>>> What really needs to be done on both CR0 and CR4 is to mask out L1-owned >>>> bits and merge them in from GUEST_CR0/4. In contrast, arch.cr0/4 and >>>> arch.cr0/4_guest_owned_bits contain the mangled L0+L1 state and, thus, >>>> are not suited as input. >>>> >>>> For both CRs, we can then apply the check against VMXON_CRx_ALWAYSON and >>>> refuse the update if it fails. To be fully consistent, we implement this >>>> check now also for CR4. >>>> >>>> Finally, we have to set the shadow to the value L2 wanted to write >>>> originally. >>>> >>>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> >>>> --- >>>> >>>> Changes in v2: >>>> - keep the non-misleading part of the comment in handle_set_cr0 >>>> >>>> arch/x86/kvm/vmx.c | 46 +++++++++++++++++++++++++++++++--------------- >>>> 1 files changed, 31 insertions(+), 15 deletions(-) >>>> >>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c >>>> index 7cc566b..832b7b4 100644 >>>> --- a/arch/x86/kvm/vmx.c >>>> +++ b/arch/x86/kvm/vmx.c >>>> @@ -4605,37 +4605,53 @@ vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall) >>>> /* called to set cr0 as appropriate for a mov-to-cr0 exit. */ >>>> static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val) >>>> { >>>> - if (to_vmx(vcpu)->nested.vmxon && >>>> - ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON)) >>>> - return 1; >>>> - >>>> if (is_guest_mode(vcpu)) { >>>> + struct vmcs12 *vmcs12 = get_vmcs12(vcpu); >>>> + unsigned long orig_val = val; >>>> + >>>> /* >>>> * We get here when L2 changed cr0 in a way that did not change >>>> * any of L1's shadowed bits (see nested_vmx_exit_handled_cr), >>>> - * but did change L0 shadowed bits. This can currently happen >>>> - * with the TS bit: L0 may want to leave TS on (for lazy fpu >>>> - * loading) while pretending to allow the guest to change it. >>>> + * but did change L0 shadowed bits. >>>> */ >>>> - if (kvm_set_cr0(vcpu, (val & vcpu->arch.cr0_guest_owned_bits) | >>>> - (vcpu->arch.cr0 & ~vcpu->arch.cr0_guest_owned_bits))) >>>> + val = (val & ~vmcs12->cr0_guest_host_mask) | >>>> + (vmcs_read64(GUEST_CR0) & vmcs12->cr0_guest_host_mask); >>> I think using GUEST_CR0 here is incorrect. It contains combination of bits >>> set by L2, L1 and L0 and here we need to get only L2/L1 mix which is in >>> vcpu->arch.cr0 (almost, but good enough for this case). Why vcpu->arch.cr0 >>> contains right L2/L1 mix? >> >> L0/L1. E.g., kvm_set_cr0 unconditionally injects X86_CR0_ET and masks >> out reserved bits. But you are right, GUEST_CR0 isn't much better. And >> maybe that mangling in kvm_set_cr0 is a corner case we can ignore. >> > I think we can. ET is R/O and wired to 1, so it does not matter what > guest writes there it should be treated as 1. About reserved bits spec > says that software should write what it reads there and does not specify > what happens if software does not follow this. > >>> Because it was set to vmcs12->guest_cr0 during >>> L2 #vmentry. While L2 is running three things may happen to CR0: >>> >>> 1. L2 writes to a bit that is not shadowed neither by L1 nor by L0. It >>> will go strait to GUEST_CR0. >>> 2. L2 writes to a bit shadowed by L1. L1 #vmexit will be emulated. On the >>> next #vmetry vcpu->arch.cr0 will be set to whatever value L1 calculated. >>> 3. L2 writes to a bit shadowed by L0, but not L1. This is the case we >>> are handling here. And if we will do it right vcpu->arch.cr0 will be >>> up-to-date at the end. >>> >>> The only case when, while this code running, vcpu->arch.cr0 has not >>> up-to-date value is if 1 happened, but since L2 guest overwriting cr0 >>> here anyway it does not matter what it previously set in GUEST_CR0. The >>> correct bits are in a new cr0 value provided by val and accessible by >>> (val & ~vmcs12->cr0_guest_host_mask). >> >> I need to think about it again. Maybe vmcs12->guest_cr0 is best, but >> that's a shot from the hips now. >> > I do not think it is correct because case 3 does not update it. So if 3 > happens twice without L1 #vmexit between then vmcs12->guest_cr0 will be > outdated. Again, the only thing that matters here is L1's, not L0's view on the "real" CR0 value. So guest_cr0 is never outdated (/wrt cr0_guest_host_mask) as it will be updated by L1 in step 2. Even if arch.cr0 vs. guest_cr0 makes no difference in practice, the latter is more consistent, so I will go for it unless you can convince me it is wrong. Jan
On Thu, Mar 07, 2013 at 09:53:49AM +0100, Jan Kiszka wrote: > On 2013-03-07 09:43, Gleb Natapov wrote: > > On Thu, Mar 07, 2013 at 09:12:19AM +0100, Jan Kiszka wrote: > >> On 2013-03-07 08:51, Gleb Natapov wrote: > >>> On Mon, Mar 04, 2013 at 08:40:29PM +0100, Jan Kiszka wrote: > >>>> The logic for calculating the value with which we call kvm_set_cr0/4 was > >>>> broken (will definitely be visible with nested unrestricted guest mode > >>>> support). Also, we performed the check regarding CR0_ALWAYSON too early > >>>> when in guest mode. > >>>> > >>>> What really needs to be done on both CR0 and CR4 is to mask out L1-owned > >>>> bits and merge them in from GUEST_CR0/4. In contrast, arch.cr0/4 and > >>>> arch.cr0/4_guest_owned_bits contain the mangled L0+L1 state and, thus, > >>>> are not suited as input. > >>>> > >>>> For both CRs, we can then apply the check against VMXON_CRx_ALWAYSON and > >>>> refuse the update if it fails. To be fully consistent, we implement this > >>>> check now also for CR4. > >>>> > >>>> Finally, we have to set the shadow to the value L2 wanted to write > >>>> originally. > >>>> > >>>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> > >>>> --- > >>>> > >>>> Changes in v2: > >>>> - keep the non-misleading part of the comment in handle_set_cr0 > >>>> > >>>> arch/x86/kvm/vmx.c | 46 +++++++++++++++++++++++++++++++--------------- > >>>> 1 files changed, 31 insertions(+), 15 deletions(-) > >>>> > >>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > >>>> index 7cc566b..832b7b4 100644 > >>>> --- a/arch/x86/kvm/vmx.c > >>>> +++ b/arch/x86/kvm/vmx.c > >>>> @@ -4605,37 +4605,53 @@ vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall) > >>>> /* called to set cr0 as appropriate for a mov-to-cr0 exit. */ > >>>> static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val) > >>>> { > >>>> - if (to_vmx(vcpu)->nested.vmxon && > >>>> - ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON)) > >>>> - return 1; > >>>> - > >>>> if (is_guest_mode(vcpu)) { > >>>> + struct vmcs12 *vmcs12 = get_vmcs12(vcpu); > >>>> + unsigned long orig_val = val; > >>>> + > >>>> /* > >>>> * We get here when L2 changed cr0 in a way that did not change > >>>> * any of L1's shadowed bits (see nested_vmx_exit_handled_cr), > >>>> - * but did change L0 shadowed bits. This can currently happen > >>>> - * with the TS bit: L0 may want to leave TS on (for lazy fpu > >>>> - * loading) while pretending to allow the guest to change it. > >>>> + * but did change L0 shadowed bits. > >>>> */ > >>>> - if (kvm_set_cr0(vcpu, (val & vcpu->arch.cr0_guest_owned_bits) | > >>>> - (vcpu->arch.cr0 & ~vcpu->arch.cr0_guest_owned_bits))) > >>>> + val = (val & ~vmcs12->cr0_guest_host_mask) | > >>>> + (vmcs_read64(GUEST_CR0) & vmcs12->cr0_guest_host_mask); > >>> I think using GUEST_CR0 here is incorrect. It contains combination of bits > >>> set by L2, L1 and L0 and here we need to get only L2/L1 mix which is in > >>> vcpu->arch.cr0 (almost, but good enough for this case). Why vcpu->arch.cr0 > >>> contains right L2/L1 mix? > >> > >> L0/L1. E.g., kvm_set_cr0 unconditionally injects X86_CR0_ET and masks > >> out reserved bits. But you are right, GUEST_CR0 isn't much better. And > >> maybe that mangling in kvm_set_cr0 is a corner case we can ignore. > >> > > I think we can. ET is R/O and wired to 1, so it does not matter what > > guest writes there it should be treated as 1. About reserved bits spec > > says that software should write what it reads there and does not specify > > what happens if software does not follow this. > > > >>> Because it was set to vmcs12->guest_cr0 during > >>> L2 #vmentry. While L2 is running three things may happen to CR0: > >>> > >>> 1. L2 writes to a bit that is not shadowed neither by L1 nor by L0. It > >>> will go strait to GUEST_CR0. > >>> 2. L2 writes to a bit shadowed by L1. L1 #vmexit will be emulated. On the > >>> next #vmetry vcpu->arch.cr0 will be set to whatever value L1 calculated. > >>> 3. L2 writes to a bit shadowed by L0, but not L1. This is the case we > >>> are handling here. And if we will do it right vcpu->arch.cr0 will be > >>> up-to-date at the end. > >>> > >>> The only case when, while this code running, vcpu->arch.cr0 has not > >>> up-to-date value is if 1 happened, but since L2 guest overwriting cr0 > >>> here anyway it does not matter what it previously set in GUEST_CR0. The > >>> correct bits are in a new cr0 value provided by val and accessible by > >>> (val & ~vmcs12->cr0_guest_host_mask). > >> > >> I need to think about it again. Maybe vmcs12->guest_cr0 is best, but > >> that's a shot from the hips now. > >> > > I do not think it is correct because case 3 does not update it. So if 3 > > happens twice without L1 #vmexit between then vmcs12->guest_cr0 will be > > outdated. > > Again, the only thing that matters here is L1's, not L0's view on the > "real" CR0 value. So guest_cr0 is never outdated (/wrt > cr0_guest_host_mask) as it will be updated by L1 in step 2. Even if > arch.cr0 vs. guest_cr0 makes no difference in practice, the latter is > more consistent, so I will go for it unless you can convince me it is wrong. > Hmm, yes you are right that wrt cr0_guest_host_mask guest_cr0 should be up-to-date. Please write a big comment about it. And what about moving VMXON_CR0_ALWAYSON check into vmx_set_cr0()? -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2013-03-07 09:57, Gleb Natapov wrote: > On Thu, Mar 07, 2013 at 09:53:49AM +0100, Jan Kiszka wrote: >> On 2013-03-07 09:43, Gleb Natapov wrote: >>> On Thu, Mar 07, 2013 at 09:12:19AM +0100, Jan Kiszka wrote: >>>> On 2013-03-07 08:51, Gleb Natapov wrote: >>>>> On Mon, Mar 04, 2013 at 08:40:29PM +0100, Jan Kiszka wrote: >>>>>> The logic for calculating the value with which we call kvm_set_cr0/4 was >>>>>> broken (will definitely be visible with nested unrestricted guest mode >>>>>> support). Also, we performed the check regarding CR0_ALWAYSON too early >>>>>> when in guest mode. >>>>>> >>>>>> What really needs to be done on both CR0 and CR4 is to mask out L1-owned >>>>>> bits and merge them in from GUEST_CR0/4. In contrast, arch.cr0/4 and >>>>>> arch.cr0/4_guest_owned_bits contain the mangled L0+L1 state and, thus, >>>>>> are not suited as input. >>>>>> >>>>>> For both CRs, we can then apply the check against VMXON_CRx_ALWAYSON and >>>>>> refuse the update if it fails. To be fully consistent, we implement this >>>>>> check now also for CR4. >>>>>> >>>>>> Finally, we have to set the shadow to the value L2 wanted to write >>>>>> originally. >>>>>> >>>>>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> >>>>>> --- >>>>>> >>>>>> Changes in v2: >>>>>> - keep the non-misleading part of the comment in handle_set_cr0 >>>>>> >>>>>> arch/x86/kvm/vmx.c | 46 +++++++++++++++++++++++++++++++--------------- >>>>>> 1 files changed, 31 insertions(+), 15 deletions(-) >>>>>> >>>>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c >>>>>> index 7cc566b..832b7b4 100644 >>>>>> --- a/arch/x86/kvm/vmx.c >>>>>> +++ b/arch/x86/kvm/vmx.c >>>>>> @@ -4605,37 +4605,53 @@ vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall) >>>>>> /* called to set cr0 as appropriate for a mov-to-cr0 exit. */ >>>>>> static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val) >>>>>> { >>>>>> - if (to_vmx(vcpu)->nested.vmxon && >>>>>> - ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON)) >>>>>> - return 1; >>>>>> - >>>>>> if (is_guest_mode(vcpu)) { >>>>>> + struct vmcs12 *vmcs12 = get_vmcs12(vcpu); >>>>>> + unsigned long orig_val = val; >>>>>> + >>>>>> /* >>>>>> * We get here when L2 changed cr0 in a way that did not change >>>>>> * any of L1's shadowed bits (see nested_vmx_exit_handled_cr), >>>>>> - * but did change L0 shadowed bits. This can currently happen >>>>>> - * with the TS bit: L0 may want to leave TS on (for lazy fpu >>>>>> - * loading) while pretending to allow the guest to change it. >>>>>> + * but did change L0 shadowed bits. >>>>>> */ >>>>>> - if (kvm_set_cr0(vcpu, (val & vcpu->arch.cr0_guest_owned_bits) | >>>>>> - (vcpu->arch.cr0 & ~vcpu->arch.cr0_guest_owned_bits))) >>>>>> + val = (val & ~vmcs12->cr0_guest_host_mask) | >>>>>> + (vmcs_read64(GUEST_CR0) & vmcs12->cr0_guest_host_mask); >>>>> I think using GUEST_CR0 here is incorrect. It contains combination of bits >>>>> set by L2, L1 and L0 and here we need to get only L2/L1 mix which is in >>>>> vcpu->arch.cr0 (almost, but good enough for this case). Why vcpu->arch.cr0 >>>>> contains right L2/L1 mix? >>>> >>>> L0/L1. E.g., kvm_set_cr0 unconditionally injects X86_CR0_ET and masks >>>> out reserved bits. But you are right, GUEST_CR0 isn't much better. And >>>> maybe that mangling in kvm_set_cr0 is a corner case we can ignore. >>>> >>> I think we can. ET is R/O and wired to 1, so it does not matter what >>> guest writes there it should be treated as 1. About reserved bits spec >>> says that software should write what it reads there and does not specify >>> what happens if software does not follow this. >>> >>>>> Because it was set to vmcs12->guest_cr0 during >>>>> L2 #vmentry. While L2 is running three things may happen to CR0: >>>>> >>>>> 1. L2 writes to a bit that is not shadowed neither by L1 nor by L0. It >>>>> will go strait to GUEST_CR0. >>>>> 2. L2 writes to a bit shadowed by L1. L1 #vmexit will be emulated. On the >>>>> next #vmetry vcpu->arch.cr0 will be set to whatever value L1 calculated. >>>>> 3. L2 writes to a bit shadowed by L0, but not L1. This is the case we >>>>> are handling here. And if we will do it right vcpu->arch.cr0 will be >>>>> up-to-date at the end. >>>>> >>>>> The only case when, while this code running, vcpu->arch.cr0 has not >>>>> up-to-date value is if 1 happened, but since L2 guest overwriting cr0 >>>>> here anyway it does not matter what it previously set in GUEST_CR0. The >>>>> correct bits are in a new cr0 value provided by val and accessible by >>>>> (val & ~vmcs12->cr0_guest_host_mask). >>>> >>>> I need to think about it again. Maybe vmcs12->guest_cr0 is best, but >>>> that's a shot from the hips now. >>>> >>> I do not think it is correct because case 3 does not update it. So if 3 >>> happens twice without L1 #vmexit between then vmcs12->guest_cr0 will be >>> outdated. >> >> Again, the only thing that matters here is L1's, not L0's view on the >> "real" CR0 value. So guest_cr0 is never outdated (/wrt >> cr0_guest_host_mask) as it will be updated by L1 in step 2. Even if >> arch.cr0 vs. guest_cr0 makes no difference in practice, the latter is >> more consistent, so I will go for it unless you can convince me it is wrong. >> > Hmm, yes you are right that wrt cr0_guest_host_mask guest_cr0 should be > up-to-date. Please write a big comment about it. Will do. > And what about moving VMXON_CR0_ALWAYSON check into vmx_set_cr0()? That doesn't make much sense for CR0 (due to the differences between vmxon and guest mode - and lacking return code of set_cr4). But I can consolidate the CR4 checks. Jan
On Thu, Mar 07, 2013 at 11:37:43AM +0100, Jan Kiszka wrote: > On 2013-03-07 09:57, Gleb Natapov wrote: > > On Thu, Mar 07, 2013 at 09:53:49AM +0100, Jan Kiszka wrote: > >> On 2013-03-07 09:43, Gleb Natapov wrote: > >>> On Thu, Mar 07, 2013 at 09:12:19AM +0100, Jan Kiszka wrote: > >>>> On 2013-03-07 08:51, Gleb Natapov wrote: > >>>>> On Mon, Mar 04, 2013 at 08:40:29PM +0100, Jan Kiszka wrote: > >>>>>> The logic for calculating the value with which we call kvm_set_cr0/4 was > >>>>>> broken (will definitely be visible with nested unrestricted guest mode > >>>>>> support). Also, we performed the check regarding CR0_ALWAYSON too early > >>>>>> when in guest mode. > >>>>>> > >>>>>> What really needs to be done on both CR0 and CR4 is to mask out L1-owned > >>>>>> bits and merge them in from GUEST_CR0/4. In contrast, arch.cr0/4 and > >>>>>> arch.cr0/4_guest_owned_bits contain the mangled L0+L1 state and, thus, > >>>>>> are not suited as input. > >>>>>> > >>>>>> For both CRs, we can then apply the check against VMXON_CRx_ALWAYSON and > >>>>>> refuse the update if it fails. To be fully consistent, we implement this > >>>>>> check now also for CR4. > >>>>>> > >>>>>> Finally, we have to set the shadow to the value L2 wanted to write > >>>>>> originally. > >>>>>> > >>>>>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> > >>>>>> --- > >>>>>> > >>>>>> Changes in v2: > >>>>>> - keep the non-misleading part of the comment in handle_set_cr0 > >>>>>> > >>>>>> arch/x86/kvm/vmx.c | 46 +++++++++++++++++++++++++++++++--------------- > >>>>>> 1 files changed, 31 insertions(+), 15 deletions(-) > >>>>>> > >>>>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > >>>>>> index 7cc566b..832b7b4 100644 > >>>>>> --- a/arch/x86/kvm/vmx.c > >>>>>> +++ b/arch/x86/kvm/vmx.c > >>>>>> @@ -4605,37 +4605,53 @@ vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall) > >>>>>> /* called to set cr0 as appropriate for a mov-to-cr0 exit. */ > >>>>>> static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val) > >>>>>> { > >>>>>> - if (to_vmx(vcpu)->nested.vmxon && > >>>>>> - ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON)) > >>>>>> - return 1; > >>>>>> - > >>>>>> if (is_guest_mode(vcpu)) { > >>>>>> + struct vmcs12 *vmcs12 = get_vmcs12(vcpu); > >>>>>> + unsigned long orig_val = val; > >>>>>> + > >>>>>> /* > >>>>>> * We get here when L2 changed cr0 in a way that did not change > >>>>>> * any of L1's shadowed bits (see nested_vmx_exit_handled_cr), > >>>>>> - * but did change L0 shadowed bits. This can currently happen > >>>>>> - * with the TS bit: L0 may want to leave TS on (for lazy fpu > >>>>>> - * loading) while pretending to allow the guest to change it. > >>>>>> + * but did change L0 shadowed bits. > >>>>>> */ > >>>>>> - if (kvm_set_cr0(vcpu, (val & vcpu->arch.cr0_guest_owned_bits) | > >>>>>> - (vcpu->arch.cr0 & ~vcpu->arch.cr0_guest_owned_bits))) > >>>>>> + val = (val & ~vmcs12->cr0_guest_host_mask) | > >>>>>> + (vmcs_read64(GUEST_CR0) & vmcs12->cr0_guest_host_mask); > >>>>> I think using GUEST_CR0 here is incorrect. It contains combination of bits > >>>>> set by L2, L1 and L0 and here we need to get only L2/L1 mix which is in > >>>>> vcpu->arch.cr0 (almost, but good enough for this case). Why vcpu->arch.cr0 > >>>>> contains right L2/L1 mix? > >>>> > >>>> L0/L1. E.g., kvm_set_cr0 unconditionally injects X86_CR0_ET and masks > >>>> out reserved bits. But you are right, GUEST_CR0 isn't much better. And > >>>> maybe that mangling in kvm_set_cr0 is a corner case we can ignore. > >>>> > >>> I think we can. ET is R/O and wired to 1, so it does not matter what > >>> guest writes there it should be treated as 1. About reserved bits spec > >>> says that software should write what it reads there and does not specify > >>> what happens if software does not follow this. > >>> > >>>>> Because it was set to vmcs12->guest_cr0 during > >>>>> L2 #vmentry. While L2 is running three things may happen to CR0: > >>>>> > >>>>> 1. L2 writes to a bit that is not shadowed neither by L1 nor by L0. It > >>>>> will go strait to GUEST_CR0. > >>>>> 2. L2 writes to a bit shadowed by L1. L1 #vmexit will be emulated. On the > >>>>> next #vmetry vcpu->arch.cr0 will be set to whatever value L1 calculated. > >>>>> 3. L2 writes to a bit shadowed by L0, but not L1. This is the case we > >>>>> are handling here. And if we will do it right vcpu->arch.cr0 will be > >>>>> up-to-date at the end. > >>>>> > >>>>> The only case when, while this code running, vcpu->arch.cr0 has not > >>>>> up-to-date value is if 1 happened, but since L2 guest overwriting cr0 > >>>>> here anyway it does not matter what it previously set in GUEST_CR0. The > >>>>> correct bits are in a new cr0 value provided by val and accessible by > >>>>> (val & ~vmcs12->cr0_guest_host_mask). > >>>> > >>>> I need to think about it again. Maybe vmcs12->guest_cr0 is best, but > >>>> that's a shot from the hips now. > >>>> > >>> I do not think it is correct because case 3 does not update it. So if 3 > >>> happens twice without L1 #vmexit between then vmcs12->guest_cr0 will be > >>> outdated. > >> > >> Again, the only thing that matters here is L1's, not L0's view on the > >> "real" CR0 value. So guest_cr0 is never outdated (/wrt > >> cr0_guest_host_mask) as it will be updated by L1 in step 2. Even if > >> arch.cr0 vs. guest_cr0 makes no difference in practice, the latter is > >> more consistent, so I will go for it unless you can convince me it is wrong. > >> > > Hmm, yes you are right that wrt cr0_guest_host_mask guest_cr0 should be > > up-to-date. Please write a big comment about it. > > Will do. > > > And what about moving VMXON_CR0_ALWAYSON check into vmx_set_cr0()? > > That doesn't make much sense for CR0 (due to the differences between > vmxon and guest mode - and lacking return code of set_cr4). But I can > consolidate the CR4 checks. > Isn't vmxon check is implicit in a guest mode. i.e if is_guest_mode() is trues then vmxon is on? Return code can be added. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2013-03-07 12:06, Gleb Natapov wrote: > On Thu, Mar 07, 2013 at 11:37:43AM +0100, Jan Kiszka wrote: >> On 2013-03-07 09:57, Gleb Natapov wrote: >>> On Thu, Mar 07, 2013 at 09:53:49AM +0100, Jan Kiszka wrote: >>>> On 2013-03-07 09:43, Gleb Natapov wrote: >>>>> On Thu, Mar 07, 2013 at 09:12:19AM +0100, Jan Kiszka wrote: >>>>>> On 2013-03-07 08:51, Gleb Natapov wrote: >>>>>>> On Mon, Mar 04, 2013 at 08:40:29PM +0100, Jan Kiszka wrote: >>>>>>>> The logic for calculating the value with which we call kvm_set_cr0/4 was >>>>>>>> broken (will definitely be visible with nested unrestricted guest mode >>>>>>>> support). Also, we performed the check regarding CR0_ALWAYSON too early >>>>>>>> when in guest mode. >>>>>>>> >>>>>>>> What really needs to be done on both CR0 and CR4 is to mask out L1-owned >>>>>>>> bits and merge them in from GUEST_CR0/4. In contrast, arch.cr0/4 and >>>>>>>> arch.cr0/4_guest_owned_bits contain the mangled L0+L1 state and, thus, >>>>>>>> are not suited as input. >>>>>>>> >>>>>>>> For both CRs, we can then apply the check against VMXON_CRx_ALWAYSON and >>>>>>>> refuse the update if it fails. To be fully consistent, we implement this >>>>>>>> check now also for CR4. >>>>>>>> >>>>>>>> Finally, we have to set the shadow to the value L2 wanted to write >>>>>>>> originally. >>>>>>>> >>>>>>>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> >>>>>>>> --- >>>>>>>> >>>>>>>> Changes in v2: >>>>>>>> - keep the non-misleading part of the comment in handle_set_cr0 >>>>>>>> >>>>>>>> arch/x86/kvm/vmx.c | 46 +++++++++++++++++++++++++++++++--------------- >>>>>>>> 1 files changed, 31 insertions(+), 15 deletions(-) >>>>>>>> >>>>>>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c >>>>>>>> index 7cc566b..832b7b4 100644 >>>>>>>> --- a/arch/x86/kvm/vmx.c >>>>>>>> +++ b/arch/x86/kvm/vmx.c >>>>>>>> @@ -4605,37 +4605,53 @@ vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall) >>>>>>>> /* called to set cr0 as appropriate for a mov-to-cr0 exit. */ >>>>>>>> static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val) >>>>>>>> { >>>>>>>> - if (to_vmx(vcpu)->nested.vmxon && >>>>>>>> - ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON)) >>>>>>>> - return 1; >>>>>>>> - >>>>>>>> if (is_guest_mode(vcpu)) { >>>>>>>> + struct vmcs12 *vmcs12 = get_vmcs12(vcpu); >>>>>>>> + unsigned long orig_val = val; >>>>>>>> + >>>>>>>> /* >>>>>>>> * We get here when L2 changed cr0 in a way that did not change >>>>>>>> * any of L1's shadowed bits (see nested_vmx_exit_handled_cr), >>>>>>>> - * but did change L0 shadowed bits. This can currently happen >>>>>>>> - * with the TS bit: L0 may want to leave TS on (for lazy fpu >>>>>>>> - * loading) while pretending to allow the guest to change it. >>>>>>>> + * but did change L0 shadowed bits. >>>>>>>> */ >>>>>>>> - if (kvm_set_cr0(vcpu, (val & vcpu->arch.cr0_guest_owned_bits) | >>>>>>>> - (vcpu->arch.cr0 & ~vcpu->arch.cr0_guest_owned_bits))) >>>>>>>> + val = (val & ~vmcs12->cr0_guest_host_mask) | >>>>>>>> + (vmcs_read64(GUEST_CR0) & vmcs12->cr0_guest_host_mask); >>>>>>> I think using GUEST_CR0 here is incorrect. It contains combination of bits >>>>>>> set by L2, L1 and L0 and here we need to get only L2/L1 mix which is in >>>>>>> vcpu->arch.cr0 (almost, but good enough for this case). Why vcpu->arch.cr0 >>>>>>> contains right L2/L1 mix? >>>>>> >>>>>> L0/L1. E.g., kvm_set_cr0 unconditionally injects X86_CR0_ET and masks >>>>>> out reserved bits. But you are right, GUEST_CR0 isn't much better. And >>>>>> maybe that mangling in kvm_set_cr0 is a corner case we can ignore. >>>>>> >>>>> I think we can. ET is R/O and wired to 1, so it does not matter what >>>>> guest writes there it should be treated as 1. About reserved bits spec >>>>> says that software should write what it reads there and does not specify >>>>> what happens if software does not follow this. >>>>> >>>>>>> Because it was set to vmcs12->guest_cr0 during >>>>>>> L2 #vmentry. While L2 is running three things may happen to CR0: >>>>>>> >>>>>>> 1. L2 writes to a bit that is not shadowed neither by L1 nor by L0. It >>>>>>> will go strait to GUEST_CR0. >>>>>>> 2. L2 writes to a bit shadowed by L1. L1 #vmexit will be emulated. On the >>>>>>> next #vmetry vcpu->arch.cr0 will be set to whatever value L1 calculated. >>>>>>> 3. L2 writes to a bit shadowed by L0, but not L1. This is the case we >>>>>>> are handling here. And if we will do it right vcpu->arch.cr0 will be >>>>>>> up-to-date at the end. >>>>>>> >>>>>>> The only case when, while this code running, vcpu->arch.cr0 has not >>>>>>> up-to-date value is if 1 happened, but since L2 guest overwriting cr0 >>>>>>> here anyway it does not matter what it previously set in GUEST_CR0. The >>>>>>> correct bits are in a new cr0 value provided by val and accessible by >>>>>>> (val & ~vmcs12->cr0_guest_host_mask). >>>>>> >>>>>> I need to think about it again. Maybe vmcs12->guest_cr0 is best, but >>>>>> that's a shot from the hips now. >>>>>> >>>>> I do not think it is correct because case 3 does not update it. So if 3 >>>>> happens twice without L1 #vmexit between then vmcs12->guest_cr0 will be >>>>> outdated. >>>> >>>> Again, the only thing that matters here is L1's, not L0's view on the >>>> "real" CR0 value. So guest_cr0 is never outdated (/wrt >>>> cr0_guest_host_mask) as it will be updated by L1 in step 2. Even if >>>> arch.cr0 vs. guest_cr0 makes no difference in practice, the latter is >>>> more consistent, so I will go for it unless you can convince me it is wrong. >>>> >>> Hmm, yes you are right that wrt cr0_guest_host_mask guest_cr0 should be >>> up-to-date. Please write a big comment about it. >> >> Will do. >> >>> And what about moving VMXON_CR0_ALWAYSON check into vmx_set_cr0()? >> >> That doesn't make much sense for CR0 (due to the differences between >> vmxon and guest mode - and lacking return code of set_cr4). But I can >> consolidate the CR4 checks. >> > Isn't vmxon check is implicit in a guest mode. i.e if is_guest_mode() is > trues then vmxon is on? Return code can be added. Ah, sorry, you are not seeing what I'm looking at: The test will change for L2 context once unrestricted guest mode is added. At that point, it makes more sense to split it into one version that checks against VMXON_CR0_ALWAYSON while in vmxon, targeting L1, and another that does more complex evaluation for L2, depending on nested_cpu_has2(vmcs12, SECONDARY_EXEC_UNRESTRICTED_GUEST). Jan
On Thu, Mar 07, 2013 at 12:25:26PM +0100, Jan Kiszka wrote: > On 2013-03-07 12:06, Gleb Natapov wrote: > > On Thu, Mar 07, 2013 at 11:37:43AM +0100, Jan Kiszka wrote: > >> On 2013-03-07 09:57, Gleb Natapov wrote: > >>> On Thu, Mar 07, 2013 at 09:53:49AM +0100, Jan Kiszka wrote: > >>>> On 2013-03-07 09:43, Gleb Natapov wrote: > >>>>> On Thu, Mar 07, 2013 at 09:12:19AM +0100, Jan Kiszka wrote: > >>>>>> On 2013-03-07 08:51, Gleb Natapov wrote: > >>>>>>> On Mon, Mar 04, 2013 at 08:40:29PM +0100, Jan Kiszka wrote: > >>>>>>>> The logic for calculating the value with which we call kvm_set_cr0/4 was > >>>>>>>> broken (will definitely be visible with nested unrestricted guest mode > >>>>>>>> support). Also, we performed the check regarding CR0_ALWAYSON too early > >>>>>>>> when in guest mode. > >>>>>>>> > >>>>>>>> What really needs to be done on both CR0 and CR4 is to mask out L1-owned > >>>>>>>> bits and merge them in from GUEST_CR0/4. In contrast, arch.cr0/4 and > >>>>>>>> arch.cr0/4_guest_owned_bits contain the mangled L0+L1 state and, thus, > >>>>>>>> are not suited as input. > >>>>>>>> > >>>>>>>> For both CRs, we can then apply the check against VMXON_CRx_ALWAYSON and > >>>>>>>> refuse the update if it fails. To be fully consistent, we implement this > >>>>>>>> check now also for CR4. > >>>>>>>> > >>>>>>>> Finally, we have to set the shadow to the value L2 wanted to write > >>>>>>>> originally. > >>>>>>>> > >>>>>>>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> > >>>>>>>> --- > >>>>>>>> > >>>>>>>> Changes in v2: > >>>>>>>> - keep the non-misleading part of the comment in handle_set_cr0 > >>>>>>>> > >>>>>>>> arch/x86/kvm/vmx.c | 46 +++++++++++++++++++++++++++++++--------------- > >>>>>>>> 1 files changed, 31 insertions(+), 15 deletions(-) > >>>>>>>> > >>>>>>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > >>>>>>>> index 7cc566b..832b7b4 100644 > >>>>>>>> --- a/arch/x86/kvm/vmx.c > >>>>>>>> +++ b/arch/x86/kvm/vmx.c > >>>>>>>> @@ -4605,37 +4605,53 @@ vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall) > >>>>>>>> /* called to set cr0 as appropriate for a mov-to-cr0 exit. */ > >>>>>>>> static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val) > >>>>>>>> { > >>>>>>>> - if (to_vmx(vcpu)->nested.vmxon && > >>>>>>>> - ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON)) > >>>>>>>> - return 1; > >>>>>>>> - > >>>>>>>> if (is_guest_mode(vcpu)) { > >>>>>>>> + struct vmcs12 *vmcs12 = get_vmcs12(vcpu); > >>>>>>>> + unsigned long orig_val = val; > >>>>>>>> + > >>>>>>>> /* > >>>>>>>> * We get here when L2 changed cr0 in a way that did not change > >>>>>>>> * any of L1's shadowed bits (see nested_vmx_exit_handled_cr), > >>>>>>>> - * but did change L0 shadowed bits. This can currently happen > >>>>>>>> - * with the TS bit: L0 may want to leave TS on (for lazy fpu > >>>>>>>> - * loading) while pretending to allow the guest to change it. > >>>>>>>> + * but did change L0 shadowed bits. > >>>>>>>> */ > >>>>>>>> - if (kvm_set_cr0(vcpu, (val & vcpu->arch.cr0_guest_owned_bits) | > >>>>>>>> - (vcpu->arch.cr0 & ~vcpu->arch.cr0_guest_owned_bits))) > >>>>>>>> + val = (val & ~vmcs12->cr0_guest_host_mask) | > >>>>>>>> + (vmcs_read64(GUEST_CR0) & vmcs12->cr0_guest_host_mask); > >>>>>>> I think using GUEST_CR0 here is incorrect. It contains combination of bits > >>>>>>> set by L2, L1 and L0 and here we need to get only L2/L1 mix which is in > >>>>>>> vcpu->arch.cr0 (almost, but good enough for this case). Why vcpu->arch.cr0 > >>>>>>> contains right L2/L1 mix? > >>>>>> > >>>>>> L0/L1. E.g., kvm_set_cr0 unconditionally injects X86_CR0_ET and masks > >>>>>> out reserved bits. But you are right, GUEST_CR0 isn't much better. And > >>>>>> maybe that mangling in kvm_set_cr0 is a corner case we can ignore. > >>>>>> > >>>>> I think we can. ET is R/O and wired to 1, so it does not matter what > >>>>> guest writes there it should be treated as 1. About reserved bits spec > >>>>> says that software should write what it reads there and does not specify > >>>>> what happens if software does not follow this. > >>>>> > >>>>>>> Because it was set to vmcs12->guest_cr0 during > >>>>>>> L2 #vmentry. While L2 is running three things may happen to CR0: > >>>>>>> > >>>>>>> 1. L2 writes to a bit that is not shadowed neither by L1 nor by L0. It > >>>>>>> will go strait to GUEST_CR0. > >>>>>>> 2. L2 writes to a bit shadowed by L1. L1 #vmexit will be emulated. On the > >>>>>>> next #vmetry vcpu->arch.cr0 will be set to whatever value L1 calculated. > >>>>>>> 3. L2 writes to a bit shadowed by L0, but not L1. This is the case we > >>>>>>> are handling here. And if we will do it right vcpu->arch.cr0 will be > >>>>>>> up-to-date at the end. > >>>>>>> > >>>>>>> The only case when, while this code running, vcpu->arch.cr0 has not > >>>>>>> up-to-date value is if 1 happened, but since L2 guest overwriting cr0 > >>>>>>> here anyway it does not matter what it previously set in GUEST_CR0. The > >>>>>>> correct bits are in a new cr0 value provided by val and accessible by > >>>>>>> (val & ~vmcs12->cr0_guest_host_mask). > >>>>>> > >>>>>> I need to think about it again. Maybe vmcs12->guest_cr0 is best, but > >>>>>> that's a shot from the hips now. > >>>>>> > >>>>> I do not think it is correct because case 3 does not update it. So if 3 > >>>>> happens twice without L1 #vmexit between then vmcs12->guest_cr0 will be > >>>>> outdated. > >>>> > >>>> Again, the only thing that matters here is L1's, not L0's view on the > >>>> "real" CR0 value. So guest_cr0 is never outdated (/wrt > >>>> cr0_guest_host_mask) as it will be updated by L1 in step 2. Even if > >>>> arch.cr0 vs. guest_cr0 makes no difference in practice, the latter is > >>>> more consistent, so I will go for it unless you can convince me it is wrong. > >>>> > >>> Hmm, yes you are right that wrt cr0_guest_host_mask guest_cr0 should be > >>> up-to-date. Please write a big comment about it. > >> > >> Will do. > >> > >>> And what about moving VMXON_CR0_ALWAYSON check into vmx_set_cr0()? > >> > >> That doesn't make much sense for CR0 (due to the differences between > >> vmxon and guest mode - and lacking return code of set_cr4). But I can > >> consolidate the CR4 checks. > >> > > Isn't vmxon check is implicit in a guest mode. i.e if is_guest_mode() is > > trues then vmxon is on? Return code can be added. > > Ah, sorry, you are not seeing what I'm looking at: The test will change > for L2 context once unrestricted guest mode is added. At that point, it > makes more sense to split it into one version that checks against > VMXON_CR0_ALWAYSON while in vmxon, targeting L1, and another that does > more complex evaluation for L2, depending on nested_cpu_has2(vmcs12, > SECONDARY_EXEC_UNRESTRICTED_GUEST). > Ah, OK. Hard to argue that those checks can be consolidated without seeing them :) So you want to implement unrestricted L1 on restricted L0 and let L0 emulate real mode of L2 directly? -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2013-03-07 12:50, Gleb Natapov wrote: > On Thu, Mar 07, 2013 at 12:25:26PM +0100, Jan Kiszka wrote: >> On 2013-03-07 12:06, Gleb Natapov wrote: >>> On Thu, Mar 07, 2013 at 11:37:43AM +0100, Jan Kiszka wrote: >>>> On 2013-03-07 09:57, Gleb Natapov wrote: >>>>> On Thu, Mar 07, 2013 at 09:53:49AM +0100, Jan Kiszka wrote: >>>>>> On 2013-03-07 09:43, Gleb Natapov wrote: >>>>>>> On Thu, Mar 07, 2013 at 09:12:19AM +0100, Jan Kiszka wrote: >>>>>>>> On 2013-03-07 08:51, Gleb Natapov wrote: >>>>>>>>> On Mon, Mar 04, 2013 at 08:40:29PM +0100, Jan Kiszka wrote: >>>>>>>>>> The logic for calculating the value with which we call kvm_set_cr0/4 was >>>>>>>>>> broken (will definitely be visible with nested unrestricted guest mode >>>>>>>>>> support). Also, we performed the check regarding CR0_ALWAYSON too early >>>>>>>>>> when in guest mode. >>>>>>>>>> >>>>>>>>>> What really needs to be done on both CR0 and CR4 is to mask out L1-owned >>>>>>>>>> bits and merge them in from GUEST_CR0/4. In contrast, arch.cr0/4 and >>>>>>>>>> arch.cr0/4_guest_owned_bits contain the mangled L0+L1 state and, thus, >>>>>>>>>> are not suited as input. >>>>>>>>>> >>>>>>>>>> For both CRs, we can then apply the check against VMXON_CRx_ALWAYSON and >>>>>>>>>> refuse the update if it fails. To be fully consistent, we implement this >>>>>>>>>> check now also for CR4. >>>>>>>>>> >>>>>>>>>> Finally, we have to set the shadow to the value L2 wanted to write >>>>>>>>>> originally. >>>>>>>>>> >>>>>>>>>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> >>>>>>>>>> --- >>>>>>>>>> >>>>>>>>>> Changes in v2: >>>>>>>>>> - keep the non-misleading part of the comment in handle_set_cr0 >>>>>>>>>> >>>>>>>>>> arch/x86/kvm/vmx.c | 46 +++++++++++++++++++++++++++++++--------------- >>>>>>>>>> 1 files changed, 31 insertions(+), 15 deletions(-) >>>>>>>>>> >>>>>>>>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c >>>>>>>>>> index 7cc566b..832b7b4 100644 >>>>>>>>>> --- a/arch/x86/kvm/vmx.c >>>>>>>>>> +++ b/arch/x86/kvm/vmx.c >>>>>>>>>> @@ -4605,37 +4605,53 @@ vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall) >>>>>>>>>> /* called to set cr0 as appropriate for a mov-to-cr0 exit. */ >>>>>>>>>> static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val) >>>>>>>>>> { >>>>>>>>>> - if (to_vmx(vcpu)->nested.vmxon && >>>>>>>>>> - ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON)) >>>>>>>>>> - return 1; >>>>>>>>>> - >>>>>>>>>> if (is_guest_mode(vcpu)) { >>>>>>>>>> + struct vmcs12 *vmcs12 = get_vmcs12(vcpu); >>>>>>>>>> + unsigned long orig_val = val; >>>>>>>>>> + >>>>>>>>>> /* >>>>>>>>>> * We get here when L2 changed cr0 in a way that did not change >>>>>>>>>> * any of L1's shadowed bits (see nested_vmx_exit_handled_cr), >>>>>>>>>> - * but did change L0 shadowed bits. This can currently happen >>>>>>>>>> - * with the TS bit: L0 may want to leave TS on (for lazy fpu >>>>>>>>>> - * loading) while pretending to allow the guest to change it. >>>>>>>>>> + * but did change L0 shadowed bits. >>>>>>>>>> */ >>>>>>>>>> - if (kvm_set_cr0(vcpu, (val & vcpu->arch.cr0_guest_owned_bits) | >>>>>>>>>> - (vcpu->arch.cr0 & ~vcpu->arch.cr0_guest_owned_bits))) >>>>>>>>>> + val = (val & ~vmcs12->cr0_guest_host_mask) | >>>>>>>>>> + (vmcs_read64(GUEST_CR0) & vmcs12->cr0_guest_host_mask); >>>>>>>>> I think using GUEST_CR0 here is incorrect. It contains combination of bits >>>>>>>>> set by L2, L1 and L0 and here we need to get only L2/L1 mix which is in >>>>>>>>> vcpu->arch.cr0 (almost, but good enough for this case). Why vcpu->arch.cr0 >>>>>>>>> contains right L2/L1 mix? >>>>>>>> >>>>>>>> L0/L1. E.g., kvm_set_cr0 unconditionally injects X86_CR0_ET and masks >>>>>>>> out reserved bits. But you are right, GUEST_CR0 isn't much better. And >>>>>>>> maybe that mangling in kvm_set_cr0 is a corner case we can ignore. >>>>>>>> >>>>>>> I think we can. ET is R/O and wired to 1, so it does not matter what >>>>>>> guest writes there it should be treated as 1. About reserved bits spec >>>>>>> says that software should write what it reads there and does not specify >>>>>>> what happens if software does not follow this. >>>>>>> >>>>>>>>> Because it was set to vmcs12->guest_cr0 during >>>>>>>>> L2 #vmentry. While L2 is running three things may happen to CR0: >>>>>>>>> >>>>>>>>> 1. L2 writes to a bit that is not shadowed neither by L1 nor by L0. It >>>>>>>>> will go strait to GUEST_CR0. >>>>>>>>> 2. L2 writes to a bit shadowed by L1. L1 #vmexit will be emulated. On the >>>>>>>>> next #vmetry vcpu->arch.cr0 will be set to whatever value L1 calculated. >>>>>>>>> 3. L2 writes to a bit shadowed by L0, but not L1. This is the case we >>>>>>>>> are handling here. And if we will do it right vcpu->arch.cr0 will be >>>>>>>>> up-to-date at the end. >>>>>>>>> >>>>>>>>> The only case when, while this code running, vcpu->arch.cr0 has not >>>>>>>>> up-to-date value is if 1 happened, but since L2 guest overwriting cr0 >>>>>>>>> here anyway it does not matter what it previously set in GUEST_CR0. The >>>>>>>>> correct bits are in a new cr0 value provided by val and accessible by >>>>>>>>> (val & ~vmcs12->cr0_guest_host_mask). >>>>>>>> >>>>>>>> I need to think about it again. Maybe vmcs12->guest_cr0 is best, but >>>>>>>> that's a shot from the hips now. >>>>>>>> >>>>>>> I do not think it is correct because case 3 does not update it. So if 3 >>>>>>> happens twice without L1 #vmexit between then vmcs12->guest_cr0 will be >>>>>>> outdated. >>>>>> >>>>>> Again, the only thing that matters here is L1's, not L0's view on the >>>>>> "real" CR0 value. So guest_cr0 is never outdated (/wrt >>>>>> cr0_guest_host_mask) as it will be updated by L1 in step 2. Even if >>>>>> arch.cr0 vs. guest_cr0 makes no difference in practice, the latter is >>>>>> more consistent, so I will go for it unless you can convince me it is wrong. >>>>>> >>>>> Hmm, yes you are right that wrt cr0_guest_host_mask guest_cr0 should be >>>>> up-to-date. Please write a big comment about it. >>>> >>>> Will do. >>>> >>>>> And what about moving VMXON_CR0_ALWAYSON check into vmx_set_cr0()? >>>> >>>> That doesn't make much sense for CR0 (due to the differences between >>>> vmxon and guest mode - and lacking return code of set_cr4). But I can >>>> consolidate the CR4 checks. >>>> >>> Isn't vmxon check is implicit in a guest mode. i.e if is_guest_mode() is >>> trues then vmxon is on? Return code can be added. >> >> Ah, sorry, you are not seeing what I'm looking at: The test will change >> for L2 context once unrestricted guest mode is added. At that point, it >> makes more sense to split it into one version that checks against >> VMXON_CR0_ALWAYSON while in vmxon, targeting L1, and another that does >> more complex evaluation for L2, depending on nested_cpu_has2(vmcs12, >> SECONDARY_EXEC_UNRESTRICTED_GUEST). >> > Ah, OK. Hard to argue that those checks can be consolidated without > seeing them :) So you want to implement unrestricted L1 on restricted L0 and > let L0 emulate real mode of L2 directly? Err, no. :) Well, that emulation might even work but doesn't help unless you also emulate EPT (not unrestricted guest mode without EPT support - according to the spec). I just want make L0's unrestricted guest mode support available for L1 (in fact, I already did this in my hacking branch). Jan
On Thu, Mar 07, 2013 at 12:57:27PM +0100, Jan Kiszka wrote: > >> Ah, sorry, you are not seeing what I'm looking at: The test will change > >> for L2 context once unrestricted guest mode is added. At that point, it > >> makes more sense to split it into one version that checks against > >> VMXON_CR0_ALWAYSON while in vmxon, targeting L1, and another that does > >> more complex evaluation for L2, depending on nested_cpu_has2(vmcs12, > >> SECONDARY_EXEC_UNRESTRICTED_GUEST). > >> > > Ah, OK. Hard to argue that those checks can be consolidated without > > seeing them :) So you want to implement unrestricted L1 on restricted L0 and > > let L0 emulate real mode of L2 directly? > > Err, no. :) Well, that emulation might even work but doesn't help unless > you also emulate EPT (not unrestricted guest mode without EPT support - > according to the spec). Yes, of course EPT is needed, but patches are available :) I think it should speedup L2 real mode substantially. No need to go to L1 for each instruction emulation and L1 will have to exit to L0 many times during emulation of some instructions. > > I just want make L0's unrestricted guest mode support available for L1 > (in fact, I already did this in my hacking branch). > Don't you need EPT for that too? -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2013-03-07 13:05, Gleb Natapov wrote: > On Thu, Mar 07, 2013 at 12:57:27PM +0100, Jan Kiszka wrote: >>>> Ah, sorry, you are not seeing what I'm looking at: The test will change >>>> for L2 context once unrestricted guest mode is added. At that point, it >>>> makes more sense to split it into one version that checks against >>>> VMXON_CR0_ALWAYSON while in vmxon, targeting L1, and another that does >>>> more complex evaluation for L2, depending on nested_cpu_has2(vmcs12, >>>> SECONDARY_EXEC_UNRESTRICTED_GUEST). >>>> >>> Ah, OK. Hard to argue that those checks can be consolidated without >>> seeing them :) So you want to implement unrestricted L1 on restricted L0 and >>> let L0 emulate real mode of L2 directly? >> >> Err, no. :) Well, that emulation might even work but doesn't help unless >> you also emulate EPT (not unrestricted guest mode without EPT support - >> according to the spec). > Yes, of course EPT is needed, but patches are available :) I think it > should speedup L2 real mode substantially. No need to go to L1 for each > instruction emulation and L1 will have to exit to L0 many times during > emulation of some instructions. The point is: If you already have EPT on the host, you likely also have native unrestricted guest mode. You just need to expose it and adjust some minor things (like this bug here) along the way. Not sure how many CPUs had EPT but no unrestricted guest mode. Do you have numbers? >> >> I just want make L0's unrestricted guest mode support available for L1 >> (in fact, I already did this in my hacking branch). >> > Don't you need EPT for that too? Sure. As I wrote, I'm using Nadav's patches of last August against some 3.6 baseline. Jan
On Thu, Mar 07, 2013 at 01:18:24PM +0100, Jan Kiszka wrote: > On 2013-03-07 13:05, Gleb Natapov wrote: > > On Thu, Mar 07, 2013 at 12:57:27PM +0100, Jan Kiszka wrote: > >>>> Ah, sorry, you are not seeing what I'm looking at: The test will change > >>>> for L2 context once unrestricted guest mode is added. At that point, it > >>>> makes more sense to split it into one version that checks against > >>>> VMXON_CR0_ALWAYSON while in vmxon, targeting L1, and another that does > >>>> more complex evaluation for L2, depending on nested_cpu_has2(vmcs12, > >>>> SECONDARY_EXEC_UNRESTRICTED_GUEST). > >>>> > >>> Ah, OK. Hard to argue that those checks can be consolidated without > >>> seeing them :) So you want to implement unrestricted L1 on restricted L0 and > >>> let L0 emulate real mode of L2 directly? > >> > >> Err, no. :) Well, that emulation might even work but doesn't help unless > >> you also emulate EPT (not unrestricted guest mode without EPT support - > >> according to the spec). > > Yes, of course EPT is needed, but patches are available :) I think it > > should speedup L2 real mode substantially. No need to go to L1 for each > > instruction emulation and L1 will have to exit to L0 many times during > > emulation of some instructions. > > The point is: If you already have EPT on the host, you likely also have > native unrestricted guest mode. You just need to expose it and adjust > some minor things (like this bug here) along the way. Not sure how many > CPUs had EPT but no unrestricted guest mode. Do you have numbers? > AFAIK every single one before Westmere. Nehalem does no have it for sure. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2013-03-07 13:21, Gleb Natapov wrote: > On Thu, Mar 07, 2013 at 01:18:24PM +0100, Jan Kiszka wrote: >> On 2013-03-07 13:05, Gleb Natapov wrote: >>> On Thu, Mar 07, 2013 at 12:57:27PM +0100, Jan Kiszka wrote: >>>>>> Ah, sorry, you are not seeing what I'm looking at: The test will change >>>>>> for L2 context once unrestricted guest mode is added. At that point, it >>>>>> makes more sense to split it into one version that checks against >>>>>> VMXON_CR0_ALWAYSON while in vmxon, targeting L1, and another that does >>>>>> more complex evaluation for L2, depending on nested_cpu_has2(vmcs12, >>>>>> SECONDARY_EXEC_UNRESTRICTED_GUEST). >>>>>> >>>>> Ah, OK. Hard to argue that those checks can be consolidated without >>>>> seeing them :) So you want to implement unrestricted L1 on restricted L0 and >>>>> let L0 emulate real mode of L2 directly? >>>> >>>> Err, no. :) Well, that emulation might even work but doesn't help unless >>>> you also emulate EPT (not unrestricted guest mode without EPT support - >>>> according to the spec). >>> Yes, of course EPT is needed, but patches are available :) I think it >>> should speedup L2 real mode substantially. No need to go to L1 for each >>> instruction emulation and L1 will have to exit to L0 many times during >>> emulation of some instructions. >> >> The point is: If you already have EPT on the host, you likely also have >> native unrestricted guest mode. You just need to expose it and adjust >> some minor things (like this bug here) along the way. Not sure how many >> CPUs had EPT but no unrestricted guest mode. Do you have numbers? >> > AFAIK every single one before Westmere. Nehalem does no have it for > sure. OK. Hmm, will it be more than just faking unrestricted mode toward L1 and emulating in L0 then (which should happen automagically)? Maybe I will play with this under unrestricted_guest=0 when I have some time. Jan
On Thu, Mar 07, 2013 at 01:48:55PM +0100, Jan Kiszka wrote: > On 2013-03-07 13:21, Gleb Natapov wrote: > > On Thu, Mar 07, 2013 at 01:18:24PM +0100, Jan Kiszka wrote: > >> On 2013-03-07 13:05, Gleb Natapov wrote: > >>> On Thu, Mar 07, 2013 at 12:57:27PM +0100, Jan Kiszka wrote: > >>>>>> Ah, sorry, you are not seeing what I'm looking at: The test will change > >>>>>> for L2 context once unrestricted guest mode is added. At that point, it > >>>>>> makes more sense to split it into one version that checks against > >>>>>> VMXON_CR0_ALWAYSON while in vmxon, targeting L1, and another that does > >>>>>> more complex evaluation for L2, depending on nested_cpu_has2(vmcs12, > >>>>>> SECONDARY_EXEC_UNRESTRICTED_GUEST). > >>>>>> > >>>>> Ah, OK. Hard to argue that those checks can be consolidated without > >>>>> seeing them :) So you want to implement unrestricted L1 on restricted L0 and > >>>>> let L0 emulate real mode of L2 directly? > >>>> > >>>> Err, no. :) Well, that emulation might even work but doesn't help unless > >>>> you also emulate EPT (not unrestricted guest mode without EPT support - > >>>> according to the spec). > >>> Yes, of course EPT is needed, but patches are available :) I think it > >>> should speedup L2 real mode substantially. No need to go to L1 for each > >>> instruction emulation and L1 will have to exit to L0 many times during > >>> emulation of some instructions. > >> > >> The point is: If you already have EPT on the host, you likely also have > >> native unrestricted guest mode. You just need to expose it and adjust > >> some minor things (like this bug here) along the way. Not sure how many > >> CPUs had EPT but no unrestricted guest mode. Do you have numbers? > >> > > AFAIK every single one before Westmere. Nehalem does no have it for > > sure. > > OK. Hmm, will it be more than just faking unrestricted mode toward L1 > and emulating in L0 then (which should happen automagically)? Maybe I > will play with this under unrestricted_guest=0 when I have some time. > We will probably need to teach emulator about more instruction that may cause vmexit. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 7cc566b..832b7b4 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -4605,37 +4605,53 @@ vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall) /* called to set cr0 as appropriate for a mov-to-cr0 exit. */ static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val) { - if (to_vmx(vcpu)->nested.vmxon && - ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON)) - return 1; - if (is_guest_mode(vcpu)) { + struct vmcs12 *vmcs12 = get_vmcs12(vcpu); + unsigned long orig_val = val; + /* * We get here when L2 changed cr0 in a way that did not change * any of L1's shadowed bits (see nested_vmx_exit_handled_cr), - * but did change L0 shadowed bits. This can currently happen - * with the TS bit: L0 may want to leave TS on (for lazy fpu - * loading) while pretending to allow the guest to change it. + * but did change L0 shadowed bits. */ - if (kvm_set_cr0(vcpu, (val & vcpu->arch.cr0_guest_owned_bits) | - (vcpu->arch.cr0 & ~vcpu->arch.cr0_guest_owned_bits))) + val = (val & ~vmcs12->cr0_guest_host_mask) | + (vmcs_read64(GUEST_CR0) & vmcs12->cr0_guest_host_mask); + if ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON) return 1; - vmcs_writel(CR0_READ_SHADOW, val); + + if (kvm_set_cr0(vcpu, val)) + return 1; + vmcs_writel(CR0_READ_SHADOW, orig_val); return 0; - } else + } else { + if (to_vmx(vcpu)->nested.vmxon && + ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON)) + return 1; return kvm_set_cr0(vcpu, val); + } } static int handle_set_cr4(struct kvm_vcpu *vcpu, unsigned long val) { if (is_guest_mode(vcpu)) { - if (kvm_set_cr4(vcpu, (val & vcpu->arch.cr4_guest_owned_bits) | - (vcpu->arch.cr4 & ~vcpu->arch.cr4_guest_owned_bits))) + struct vmcs12 *vmcs12 = get_vmcs12(vcpu); + unsigned long orig_val = val; + + val = (val & ~vmcs12->cr4_guest_host_mask) | + (vmcs_readl(GUEST_CR4) & vmcs12->cr4_guest_host_mask); + if ((val & VMXON_CR4_ALWAYSON) != VMXON_CR4_ALWAYSON) + return 1; + + if (kvm_set_cr4(vcpu, val)) return 1; - vmcs_writel(CR4_READ_SHADOW, val); + vmcs_writel(CR4_READ_SHADOW, orig_val); return 0; - } else + } else { + if (to_vmx(vcpu)->nested.vmxon && + ((val & VMXON_CR4_ALWAYSON) != VMXON_CR4_ALWAYSON)) + return 1; return kvm_set_cr4(vcpu, val); + } } /* called to set cr0 as approriate for clts instruction exit. */
The logic for calculating the value with which we call kvm_set_cr0/4 was broken (will definitely be visible with nested unrestricted guest mode support). Also, we performed the check regarding CR0_ALWAYSON too early when in guest mode. What really needs to be done on both CR0 and CR4 is to mask out L1-owned bits and merge them in from GUEST_CR0/4. In contrast, arch.cr0/4 and arch.cr0/4_guest_owned_bits contain the mangled L0+L1 state and, thus, are not suited as input. For both CRs, we can then apply the check against VMXON_CRx_ALWAYSON and refuse the update if it fails. To be fully consistent, we implement this check now also for CR4. Finally, we have to set the shadow to the value L2 wanted to write originally. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> --- Changes in v2: - keep the non-misleading part of the comment in handle_set_cr0 arch/x86/kvm/vmx.c | 46 +++++++++++++++++++++++++++++++--------------- 1 files changed, 31 insertions(+), 15 deletions(-)