diff mbox

[20/31] nVMX: Exiting from L2 to L1

Message ID 201105161954.p4GJs9XB001955@rice.haifa.ibm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Nadav Har'El May 16, 2011, 7:54 p.m. UTC
This patch implements nested_vmx_vmexit(), called when the nested L2 guest
exits and we want to run its L1 parent and let it handle this exit.

Note that this will not necessarily be called on every L2 exit. L0 may decide
to handle a particular exit on its own, without L1's involvement; In that
case, L0 will handle the exit, and resume running L2, without running L1 and
without calling nested_vmx_vmexit(). The logic for deciding whether to handle
a particular exit in L1 or in L0, i.e., whether to call nested_vmx_vmexit(),
will appear in a separate patch below.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  257 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 257 insertions(+)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Tian, Kevin May 24, 2011, 12:58 p.m. UTC | #1
> From: Nadav Har'El
> Sent: Tuesday, May 17, 2011 3:54 AM
> 
> This patch implements nested_vmx_vmexit(), called when the nested L2 guest
> exits and we want to run its L1 parent and let it handle this exit.
> 
> Note that this will not necessarily be called on every L2 exit. L0 may decide
> to handle a particular exit on its own, without L1's involvement; In that
> case, L0 will handle the exit, and resume running L2, without running L1 and
> without calling nested_vmx_vmexit(). The logic for deciding whether to handle
> a particular exit in L1 or in L0, i.e., whether to call nested_vmx_vmexit(),
> will appear in a separate patch below.
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
>  arch/x86/kvm/vmx.c |  257
> +++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 257 insertions(+)
> 
> --- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:49.000000000 +0300
> +++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:49.000000000 +0300
> @@ -6203,6 +6203,263 @@ static int nested_vmx_run(struct kvm_vcp
>  	return 1;
>  }
> 
> +/*
> + * On a nested exit from L2 to L1, vmcs12.guest_cr0 might not be up-to-date
> + * because L2 may have changed some cr0 bits directly (see
> CRO_GUEST_HOST_MASK)
> + * without L0 trapping the change and updating vmcs12.
> + * This function returns the value we should put in vmcs12.guest_cr0. It's not
> + * enough to just return the current (vmcs02) GUEST_CR0 - that may not be
> the
> + * guest cr0 that L1 thought it was giving its L2 guest; It is possible that
> + * L1 wished to allow its guest to set some cr0 bit directly, but we (L0) asked
> + * to trap this change and instead set just the read shadow bit. If this is the
> + * case, we need to copy these read-shadow bits back to vmcs12.guest_cr0,
> where
> + * L1 believes they already are.
> + */
> +static inline unsigned long
> +vmcs12_guest_cr0(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
> +{
> +	/*
> +	 * As explained above, we take a bit from GUEST_CR0 if we allowed the
> +	 * guest to modify it untrapped (vcpu->arch.cr0_guest_owned_bits), or
> +	 * if we did trap it - if we did so because L1 asked to trap this bit
> +	 * (vmcs12->cr0_guest_host_mask). Otherwise (bits we trapped but L1
> +	 * didn't expect us to trap) we read from CR0_READ_SHADOW.
> +	 */
> +	unsigned long guest_cr0_bits =
> +		vcpu->arch.cr0_guest_owned_bits | vmcs12->cr0_guest_host_mask;
> +	return (vmcs_readl(GUEST_CR0) & guest_cr0_bits) |
> +	       (vmcs_readl(CR0_READ_SHADOW) & ~guest_cr0_bits);
> +}

Hi, Nadav,

Not sure whether I get above operation wrong. But it looks not exactly correct to me
in a glimpse. Say a bit set both in L0/L1's cr0_guest_host_mask. In such case that
bit from vmcs12_GUEST_CR0 resides in vmcs02_CR0_READ_SHADOW, however above
operation will make vmcs02_GUEST_CR0 bit returned instead.

Instead of constructing vmcs12_GUEST_CR0 completely from vmcs02_GUEST_CR0,
why not just updating bits which can be altered while keeping the rest bits from
vmcs12_GUEST_CR0? Say something like:

vmcs12->guest_cr0 &= vmcs12->cr0_guest_host_mask; /* keep unchanged bits */
vmcs12->guest_cr0 |= (vmcs_readl(GUEST_CR0) & vcpu->arch.cr0_guest_owned_bits) |
	(vmcs_readl(CR0_READ_SHADOW) & ~( vcpu->arch.cr0_guest_owned_bits | vmcs12->cr0_guest_host_mask))

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nadav Har'El May 24, 2011, 1:43 p.m. UTC | #2
On Tue, May 24, 2011, Tian, Kevin wrote about "RE: [PATCH 20/31] nVMX: Exiting from L2 to L1":
> > +vmcs12_guest_cr0(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
> > +{
> > +	/*
> > +	 * As explained above, we take a bit from GUEST_CR0 if we allowed the
> > +	 * guest to modify it untrapped (vcpu->arch.cr0_guest_owned_bits), or
> > +	 * if we did trap it - if we did so because L1 asked to trap this bit
> > +	 * (vmcs12->cr0_guest_host_mask). Otherwise (bits we trapped but L1
> > +	 * didn't expect us to trap) we read from CR0_READ_SHADOW.
> > +	 */
> > +	unsigned long guest_cr0_bits =
> > +		vcpu->arch.cr0_guest_owned_bits | vmcs12->cr0_guest_host_mask;
> > +	return (vmcs_readl(GUEST_CR0) & guest_cr0_bits) |
> > +	       (vmcs_readl(CR0_READ_SHADOW) & ~guest_cr0_bits);
> > +}
> 
> Hi, Nadav,
> 
> Not sure whether I get above operation wrong.

This is one of the trickiest functions in nested VMX, which is why I added
15 lines of comments (!) on just two statements of code.

> But it looks not exactly correct to me
> in a glimpse. Say a bit set both in L0/L1's cr0_guest_host_mask. In such case that
> bit from vmcs12_GUEST_CR0 resides in vmcs02_CR0_READ_SHADOW, however above
> operation will make vmcs02_GUEST_CR0 bit returned instead.

This behavior is correct: If a bit is set in L1's cr0_guest_host_mask (and
in particular, if it is set in both L0's and L1's), we always exit to L1 when
L2 changes this bit, and this bit cannot change while L2 is running, so
naturally after the run vmcs02.guest_cr0 and vmcs12.guest_cr0 are still
identical in that be.
Copying that bit from vmcs02_CR0_READ_SHADOW, like you suggested, would be
completely wrong in this case: When L1 set a bit in cr0_guest_host_mask,
the vmcs02->cr0_read_shadow is vmcs12->cr0_read_shadow (see nested_read_cr0),
and is just a pretense that L1 set up for L2 - it is NOT the real bit of
guest_cr0, so copying it into guest_cr0 would be wrong.

Note that this function is completely different from nested_read_cr0 (the
new name), which behaves similar to what you suggested but serves a completely
different (and in some respect, opposite) function.

I think my comments in the code are clearer than what I just wrote here, so
please take a look at them again, and let me know if you find any errors.

> Instead of constructing vmcs12_GUEST_CR0 completely from vmcs02_GUEST_CR0,
> why not just updating bits which can be altered while keeping the rest bits from
> vmcs12_GUEST_CR0? Say something like:
> 
> vmcs12->guest_cr0 &= vmcs12->cr0_guest_host_mask; /* keep unchanged bits */
> vmcs12->guest_cr0 |= (vmcs_readl(GUEST_CR0) & vcpu->arch.cr0_guest_owned_bits) |
> 	(vmcs_readl(CR0_READ_SHADOW) & ~( vcpu->arch.cr0_guest_owned_bits | vmcs12->cr0_guest_host_mask))

I guess I could do something like this, but do you think it's clearer?
I don't. Behind all the details, my formula emphasises that MOST cr0 bits
can be just copied from vmcs02 to vmcs12 as is - and we only have to do
something strange for special bits - where L0 wanted to trap but L1 didn't.
In your formula, it looks like there are 3 different cases instead of 2.

In any case, your formula is definitely not more correct, because the formulas
are in fact equivalent - let me prove:

If, instead of taking the "unchanged bits" (as you call them) from
vmcs12->guest_cr0, you take them from vmcs02->guest_cr0 (you can,
because they couldn't have changed), you end up with *exactly* the same
formula I used. Here is the proof:

 yourformula =
	(vmcs12->guest_cr0 & vmcs12->cr0_guest_host_mask) |
	(vmcs_readl(GUEST_CR0) & vcpu->arch.cr0_guest_owned_bits) |
 	(vmcs_readl(CR0_READ_SHADOW) & ~( vcpu->arch.cr0_guest_owned_bits | vmcs12->cr0_guest_host_mask))

Now because of the "unchanged bits",
	(vmcs12->guest_cr0 & vmcs12->cr0_guest_host_mask) ==
	(vmcs02->guest_cr0 & vmcs12->cr0_guest_host_mask) ==

          (and note that vmcs02->guest_cr0 is vmcs_readl(GUEST_CR0))

so this in yourformula, it becomes

 yourformula =
	(vmcs_readl(GUEST_CR0) & vmcs12->cr0_guest_host_mask) |
	(vmcs_readl(GUEST_CR0) & vcpu->arch.cr0_guest_owned_bits) |
 	(vmcs_readl(CR0_READ_SHADOW) & ~( vcpu->arch.cr0_guest_owned_bits | vmcs12->cr0_guest_host_mask))

or, simplifying

 yourformula =
	(vmcs_readl(GUEST_CR0) & (vmcs12->cr0_guest_host_mask | vcpu->arch.cr0_guest_owned_bits) |
 	(vmcs_readl(CR0_READ_SHADOW) & ~( vcpu->arch.cr0_guest_owned_bits | vmcs12->cr0_guest_host_mask))

now, using the name I used:
	unsigned long guest_cr0_bits =
		vcpu->arch.cr0_guest_owned_bits | vmcs12->cr0_guest_host_mask;

you end up with

 yourforumula =
	(vmcs_readl(GUEST_CR0) & guest_cr0_bits) |
 	(vmcs_readl(CR0_READ_SHADOW) & ~guest_cr0_bits )

Which is, believe it or not, exactly my formula :-)
Tian, Kevin May 25, 2011, 12:55 a.m. UTC | #3
> From: Nadav Har'El [mailto:nyh@math.technion.ac.il]
> Sent: Tuesday, May 24, 2011 9:43 PM
> 
> On Tue, May 24, 2011, Tian, Kevin wrote about "RE: [PATCH 20/31] nVMX:
> Exiting from L2 to L1":
> > > +vmcs12_guest_cr0(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
> > > +{
> > > +	/*
> > > +	 * As explained above, we take a bit from GUEST_CR0 if we allowed
> the
> > > +	 * guest to modify it untrapped (vcpu->arch.cr0_guest_owned_bits),
> or
> > > +	 * if we did trap it - if we did so because L1 asked to trap this bit
> > > +	 * (vmcs12->cr0_guest_host_mask). Otherwise (bits we trapped but
> L1
> > > +	 * didn't expect us to trap) we read from CR0_READ_SHADOW.
> > > +	 */
> > > +	unsigned long guest_cr0_bits =
> > > +		vcpu->arch.cr0_guest_owned_bits |
> vmcs12->cr0_guest_host_mask;
> > > +	return (vmcs_readl(GUEST_CR0) & guest_cr0_bits) |
> > > +	       (vmcs_readl(CR0_READ_SHADOW) & ~guest_cr0_bits);
> > > +}
> >
> > Hi, Nadav,
> >
> > Not sure whether I get above operation wrong.
> 
> This is one of the trickiest functions in nested VMX, which is why I added
> 15 lines of comments (!) on just two statements of code.

I read the comment carefully, and the scenario I described is not covered there.

> 
> > But it looks not exactly correct to me
> > in a glimpse. Say a bit set both in L0/L1's cr0_guest_host_mask. In such case
> that
> > bit from vmcs12_GUEST_CR0 resides in vmcs02_CR0_READ_SHADOW,
> however above
> > operation will make vmcs02_GUEST_CR0 bit returned instead.
> 
> This behavior is correct: If a bit is set in L1's cr0_guest_host_mask (and
> in particular, if it is set in both L0's and L1's), we always exit to L1 when
> L2 changes this bit, and this bit cannot change while L2 is running, so
> naturally after the run vmcs02.guest_cr0 and vmcs12.guest_cr0 are still
> identical in that be.

Are you sure this is the case? vmcs12.guest_cr0 is identical to an operation
that L1 tries to update GUEST_CR0 when you prepare vmcs02 which is why
you use vmx_set_cr0(vcpu, vmcs12->guest_cr0) in prepare_vmcs02. If L0 
has one bit set in L0's cr0_guest_host_mask, the corresponding bit in 
vmcs12.guest_cr0 will be cached in vmcs02.cr0_read_shadow anyway. This
is not related to whether L2 changes that bit.

IOW, I disagree that if L0/L1 set same bit in cr0_guest_host_mask, then
the bit is identical in vmcs02.guest_cr0 and vmcs12.guest_cr0 because L1
has no permission to set its bit effectively in this case.

> Copying that bit from vmcs02_CR0_READ_SHADOW, like you suggested, would
> be
> completely wrong in this case: When L1 set a bit in cr0_guest_host_mask,
> the vmcs02->cr0_read_shadow is vmcs12->cr0_read_shadow (see
> nested_read_cr0),
> and is just a pretense that L1 set up for L2 - it is NOT the real bit of
> guest_cr0, so copying it into guest_cr0 would be wrong.

So I'm talking about reserving that bit from vmcs12.guest_cr0 when it's set
in vmcs12.cr0_guest_host_mask which is a natural output.

> 
> Note that this function is completely different from nested_read_cr0 (the
> new name), which behaves similar to what you suggested but serves a
> completely
> different (and in some respect, opposite) function.
> 
> I think my comments in the code are clearer than what I just wrote here, so
> please take a look at them again, and let me know if you find any errors.
> 
> > Instead of constructing vmcs12_GUEST_CR0 completely from
> vmcs02_GUEST_CR0,
> > why not just updating bits which can be altered while keeping the rest bits
> from
> > vmcs12_GUEST_CR0? Say something like:
> >
> > vmcs12->guest_cr0 &= vmcs12->cr0_guest_host_mask; /* keep unchanged
> bits */
> > vmcs12->guest_cr0 |= (vmcs_readl(GUEST_CR0) &
> vcpu->arch.cr0_guest_owned_bits) |
> > 	(vmcs_readl(CR0_READ_SHADOW) &
> ~( vcpu->arch.cr0_guest_owned_bits | vmcs12->cr0_guest_host_mask))
> 
> I guess I could do something like this, but do you think it's clearer?
> I don't. Behind all the details, my formula emphasises that MOST cr0 bits
> can be just copied from vmcs02 to vmcs12 as is - and we only have to do
> something strange for special bits - where L0 wanted to trap but L1 didn't.
> In your formula, it looks like there are 3 different cases instead of 2.

But my formula is more clear given that it sticks to the implication of the
cr0_guest_host_mask. You only need to update cr0 bits which can be modified
by the L2 w/o trap while just keeping the rest.

> 
> In any case, your formula is definitely not more correct, because the formulas
> are in fact equivalent - let me prove:
> 
> If, instead of taking the "unchanged bits" (as you call them) from
> vmcs12->guest_cr0, you take them from vmcs02->guest_cr0 (you can,
> because they couldn't have changed), you end up with *exactly* the same
> formula I used. Here is the proof:
> 
>  yourformula =
> 	(vmcs12->guest_cr0 & vmcs12->cr0_guest_host_mask) |
> 	(vmcs_readl(GUEST_CR0) & vcpu->arch.cr0_guest_owned_bits) |
>  	(vmcs_readl(CR0_READ_SHADOW) &
> ~( vcpu->arch.cr0_guest_owned_bits | vmcs12->cr0_guest_host_mask))
> 
> Now because of the "unchanged bits",
> 	(vmcs12->guest_cr0 & vmcs12->cr0_guest_host_mask) ==
> 	(vmcs02->guest_cr0 & vmcs12->cr0_guest_host_mask) ==
> 
>           (and note that vmcs02->guest_cr0 is vmcs_readl(GUEST_CR0))

this is the problem:

	(vmcs12->guest_cr0 & vmcs12->cr0_guest_host_mask) !=
	(vmcs02->guest_cr0 & vmcs12->cr0_guest_host_mask)

only below equation holds true:

	(vmcs12->guest_cr0 & vmcs12->cr0_guest_host_mask & !L0->cr0_guest_host_mask) ==
	(vmcs02->guest_cr0 & vmcs12->cr0_guest_host_mask & !L0->cr0_guest_host_mask)

When one bit of vmcs12->cr0_guest_host_mask is set, it simply implicates that L1
wants to control the bit instead of L2. However whether L1 can really control that
bit still depends on whether L0 allows it to be!

> 
> so this in yourformula, it becomes
> 
>  yourformula =
> 	(vmcs_readl(GUEST_CR0) & vmcs12->cr0_guest_host_mask) |
> 	(vmcs_readl(GUEST_CR0) & vcpu->arch.cr0_guest_owned_bits) |
>  	(vmcs_readl(CR0_READ_SHADOW) &
> ~( vcpu->arch.cr0_guest_owned_bits | vmcs12->cr0_guest_host_mask))
> 
> or, simplifying
> 
>  yourformula =
> 	(vmcs_readl(GUEST_CR0) & (vmcs12->cr0_guest_host_mask |
> vcpu->arch.cr0_guest_owned_bits) |
>  	(vmcs_readl(CR0_READ_SHADOW) &
> ~( vcpu->arch.cr0_guest_owned_bits | vmcs12->cr0_guest_host_mask))
> 
> now, using the name I used:
> 	unsigned long guest_cr0_bits =
> 		vcpu->arch.cr0_guest_owned_bits | vmcs12->cr0_guest_host_mask;
> 
> you end up with
> 
>  yourforumula =
> 	(vmcs_readl(GUEST_CR0) & guest_cr0_bits) |
>  	(vmcs_readl(CR0_READ_SHADOW) & ~guest_cr0_bits )
> 
> Which is, believe it or not, exactly my formula :-)
> 

So with my interpretation, two formulas are different because you
misuse vmcs12.cr0_guest_host_mask. :-)

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tian, Kevin May 25, 2011, 2:43 a.m. UTC | #4
> From: Nadav Har'El
> Sent: Tuesday, May 17, 2011 3:54 AM
> 
> This patch implements nested_vmx_vmexit(), called when the nested L2 guest
> exits and we want to run its L1 parent and let it handle this exit.
> 
> Note that this will not necessarily be called on every L2 exit. L0 may decide
> to handle a particular exit on its own, without L1's involvement; In that
> case, L0 will handle the exit, and resume running L2, without running L1 and
> without calling nested_vmx_vmexit(). The logic for deciding whether to handle
> a particular exit in L1 or in L0, i.e., whether to call nested_vmx_vmexit(),
> will appear in a separate patch below.
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>

> +/*
> + * A part of what we need to when the nested L2 guest exits and we want to
> + * run its L1 parent, is to reset L1's guest state to the host state specified
> + * in vmcs12.
> + * This function is to be called not only on normal nested exit, but also on
> + * a nested entry failure, as explained in Intel's spec, 3B.23.7 ("VM-Entry
> + * Failures During or After Loading Guest State").
> + * This function should be called when the active VMCS is L1's (vmcs01).
> + */
> +void load_vmcs12_host_state(struct kvm_vcpu *vcpu, struct vmcs12
> *vmcs12)
> +{
> +	if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_EFER)
> +		vcpu->arch.efer = vmcs12->host_ia32_efer;
> +	if (vmcs12->vm_exit_controls & VM_EXIT_HOST_ADDR_SPACE_SIZE)
> +		vcpu->arch.efer |= (EFER_LMA | EFER_LME);
> +	else
> +		vcpu->arch.efer &= ~(EFER_LMA | EFER_LME);
> +	vmx_set_efer(vcpu, vcpu->arch.efer);
> +
> +	if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_PAT)
> +		vmcs_write64(GUEST_IA32_PAT, vmcs12->host_ia32_pat);
> +
> +	kvm_register_write(vcpu, VCPU_REGS_RSP, vmcs12->host_rsp);
> +	kvm_register_write(vcpu, VCPU_REGS_RIP, vmcs12->host_rip);
> +	/*
> +	 * Note that calling vmx_set_cr0 is important, even if cr0 hasn't
> +	 * actually changed, because it depends on the current state of
> +	 * fpu_active (which may have changed).
> +	 * Note that vmx_set_cr0 refers to efer set above.
> +	 */
> +	kvm_set_cr0(vcpu, vmcs12->host_cr0);
> +	/*
> +	 * If we did fpu_activate()/fpu_deactivate() during L2's run, we need
> +	 * to apply the same changes to L1's vmcs. We just set cr0 correctly,
> +	 * but we also need to update cr0_guest_host_mask and
> exception_bitmap.
> +	 */
> +	update_exception_bitmap(vcpu);
> +	vcpu->arch.cr0_guest_owned_bits = (vcpu->fpu_active ? X86_CR0_TS : 0);
> +	vmcs_writel(CR0_GUEST_HOST_MASK,
> ~vcpu->arch.cr0_guest_owned_bits);
> +
> +	/*
> +	 * Note that CR4_GUEST_HOST_MASK is already set in the original
> vmcs01
> +	 * (KVM doesn't change it)- no reason to call set_cr4_guest_host_mask();
> +	 */
> +	vcpu->arch.cr4_guest_owned_bits =
> ~vmcs_readl(CR4_GUEST_HOST_MASK);
> +	kvm_set_cr4(vcpu, vmcs12->host_cr4);
> +
> +	/* shadow page tables on either EPT or shadow page tables */
> +	kvm_set_cr3(vcpu, vmcs12->host_cr3);
> +	kvm_mmu_reset_context(vcpu);
> +
> +	if (enable_vpid) {
> +		/*
> +		 * Trivially support vpid by letting L2s share their parent
> +		 * L1's vpid. TODO: move to a more elaborate solution, giving
> +		 * each L2 its own vpid and exposing the vpid feature to L1.
> +		 */
> +		vmx_flush_tlb(vcpu);
> +	}

How about SYSENTER and PERF_GLOBAL_CTRL MSRs? At least a TODO comment
here make the whole load process complete. :-)

Also isn't it more sane to update vmcs01's guest segment info based on vmcs12's
host segment info? Though you can assume the environment in L1 doesn't change
from VMLAUNCH/VMRESUME to VMEXIT handler, it's more architectural clear
to load those segments fields according to L1's desire.

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nadav Har'El May 25, 2011, 8:06 a.m. UTC | #5
On Wed, May 25, 2011, Tian, Kevin wrote about "RE: [PATCH 20/31] nVMX: Exiting from L2 to L1":
> IOW, I disagree that if L0/L1 set same bit in cr0_guest_host_mask, then
> the bit is identical in vmcs02.guest_cr0 and vmcs12.guest_cr0 because L1
> has no permission to set its bit effectively in this case.
>...
> this is the problem:
> 
> 	(vmcs12->guest_cr0 & vmcs12->cr0_guest_host_mask) !=
> 	(vmcs02->guest_cr0 & vmcs12->cr0_guest_host_mask)

Sorry for arguing previously, this is a very good, and correct, point, which
I missed. When both L0 and L1 are KVM, this didn't cause problems because the
only problematic bit has been the TS bit, and when KVM wants to override this
bit it always does it to 1.

So I've rewritten this function, based on my new understanding following your
insights. I believe it now implements your formula *exactly*. Please take a
look at the comments and the code, and see if you now agree with them:

/*
 * On a nested exit from L2 to L1, vmcs12.guest_cr0 might not be up-to-date
 * because L2 may have changed some cr0 bits directly (CRO_GUEST_HOST_MASK).
 * This function returns the new value we should put in vmcs12.guest_cr0.
 * It's not enough to just return the vmcs02 GUEST_CR0. Rather,
 *  1. Bits that neither L0 nor L1 trapped, were set directly by L2 and are now
 *     available in vmcs02 GUEST_CR0. (Note: It's enough to check that L0
 *     didn't trap the bit, because if L1 did, so would L0).
 *  2. Bits that L1 asked to trap (and therefore L0 also did) could not have
 *     been modified by L2, and L1 knows it. So just leave the old value of
 *     the bit from vmcs12.guest_cr0. Note that the bit from vmcs02 GUEST_CR0
 *     isn't relevant, because if L0 traps this bit it can set it to anything.
 *  3. Bits that L1 didn't trap, but L0 did. L1 believes the guest could have
 *     changed these bits, and therefore they need to be updated, but L0
 *     didn't necessarily allow them to be changed in GUEST_CR0 - and rather
 *     put them in vmcs02 CR0_READ_SHADOW. So take these bits from there.
 */
static inline unsigned long
vmcs12_guest_cr0(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
{
	return
	/*1*/	(vmcs_readl(GUEST_CR0) & vcpu->arch.cr0_guest_owned_bits) |
	/*2*/	(vmcs12->guest_cr0 & vmcs12->cr0_guest_host_mask) |
	/*3*/	(vmcs_readl(CR0_READ_SHADOW) & ~(vmcs12->cr0_guest_host_mask |
			vcpu->arch.cr0_guest_owned_bits));
}
Tian, Kevin May 25, 2011, 8:23 a.m. UTC | #6
> From: Nadav Har'El [mailto:nyh@math.technion.ac.il]
> Sent: Wednesday, May 25, 2011 4:06 PM
> 
> On Wed, May 25, 2011, Tian, Kevin wrote about "RE: [PATCH 20/31] nVMX:
> Exiting from L2 to L1":
> > IOW, I disagree that if L0/L1 set same bit in cr0_guest_host_mask, then
> > the bit is identical in vmcs02.guest_cr0 and vmcs12.guest_cr0 because L1
> > has no permission to set its bit effectively in this case.
> >...
> > this is the problem:
> >
> > 	(vmcs12->guest_cr0 & vmcs12->cr0_guest_host_mask) !=
> > 	(vmcs02->guest_cr0 & vmcs12->cr0_guest_host_mask)
> 
> Sorry for arguing previously, this is a very good, and correct, point, which
> I missed. When both L0 and L1 are KVM, this didn't cause problems because
> the
> only problematic bit has been the TS bit, and when KVM wants to override this
> bit it always does it to 1.
> 
> So I've rewritten this function, based on my new understanding following your
> insights. I believe it now implements your formula *exactly*. Please take a
> look at the comments and the code, and see if you now agree with them:
> 
> /*
>  * On a nested exit from L2 to L1, vmcs12.guest_cr0 might not be up-to-date
>  * because L2 may have changed some cr0 bits directly
> (CRO_GUEST_HOST_MASK).
>  * This function returns the new value we should put in vmcs12.guest_cr0.
>  * It's not enough to just return the vmcs02 GUEST_CR0. Rather,
>  *  1. Bits that neither L0 nor L1 trapped, were set directly by L2 and are now
>  *     available in vmcs02 GUEST_CR0. (Note: It's enough to check that L0
>  *     didn't trap the bit, because if L1 did, so would L0).
>  *  2. Bits that L1 asked to trap (and therefore L0 also did) could not have
>  *     been modified by L2, and L1 knows it. So just leave the old value of
>  *     the bit from vmcs12.guest_cr0. Note that the bit from vmcs02
> GUEST_CR0
>  *     isn't relevant, because if L0 traps this bit it can set it to anything.
>  *  3. Bits that L1 didn't trap, but L0 did. L1 believes the guest could have
>  *     changed these bits, and therefore they need to be updated, but L0
>  *     didn't necessarily allow them to be changed in GUEST_CR0 - and
> rather
>  *     put them in vmcs02 CR0_READ_SHADOW. So take these bits from
> there.
>  */
> static inline unsigned long
> vmcs12_guest_cr0(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
> {
> 	return
> 	/*1*/	(vmcs_readl(GUEST_CR0) & vcpu->arch.cr0_guest_owned_bits)
> |
> 	/*2*/	(vmcs12->guest_cr0 & vmcs12->cr0_guest_host_mask) |
> 	/*3*/	(vmcs_readl(CR0_READ_SHADOW) &
> ~(vmcs12->cr0_guest_host_mask |
> 			vcpu->arch.cr0_guest_owned_bits));
> }
> 
> 

This looks good to me. :-)

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nadav Har'El May 25, 2011, 1:21 p.m. UTC | #7
On Wed, May 25, 2011, Tian, Kevin wrote about "RE: [PATCH 20/31] nVMX: Exiting from L2 to L1":
> How about SYSENTER and PERF_GLOBAL_CTRL MSRs? At least a TODO comment
> here make the whole load process complete. :-)
> 
> Also isn't it more sane to update vmcs01's guest segment info based on vmcs12's
> host segment info? Though you can assume the environment in L1 doesn't change
> from VMLAUNCH/VMRESUME to VMEXIT handler, it's more architectural clear
> to load those segments fields according to L1's desire.

Right... One of these days, I (or some other volunteer ;-)) would need to
print out the relevant sections of the SDM, sit down with a marker, a read
it line by line marking lines, fields, capabilities, and so on, which we
forgot to implement... 

How about these additions:

	vmcs_write32(GUEST_SYSENTER_CS, vmcs12->host_ia32_sysenter_cs);
	vmcs_writel(GUEST_SYSENTER_ESP, vmcs12->host_ia32_sysenter_esp);
	vmcs_writel(GUEST_SYSENTER_EIP, vmcs12->host_ia32_sysenter_eip);
	vmcs_writel(GUEST_IDTR_BASE, vmcs12->host_idtr_base);
	vmcs_writel(GUEST_GDTR_BASE, vmcs12->host_gdtr_base);
	vmcs_writel(GUEST_TR_BASE, vmcs12->host_tr_base);
	vmcs_writel(GUEST_GS_BASE, vmcs12->host_gs_base);
	vmcs_writel(GUEST_FS_BASE, vmcs12->host_fs_base);
	vmcs_write16(GUEST_ES_SELECTOR, vmcs12->host_es_selector);
	vmcs_write16(GUEST_CS_SELECTOR, vmcs12->host_cs_selector);
	vmcs_write16(GUEST_SS_SELECTOR, vmcs12->host_ss_selector);
	vmcs_write16(GUEST_DS_SELECTOR, vmcs12->host_ds_selector);
	vmcs_write16(GUEST_FS_SELECTOR, vmcs12->host_fs_selector);
	vmcs_write16(GUEST_GS_SELECTOR, vmcs12->host_gs_selector);
	vmcs_write16(GUEST_TR_SELECTOR, vmcs12->host_tr_selector);

	if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_PAT)
		vmcs_write64(GUEST_IA32_PAT, vmcs12->host_ia32_pat);
	if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL)
		vmcs_write64(GUEST_IA32_PERF_GLOBAL_CTRL,
			vmcs12->host_ia32_perf_global_ctrl);
Tian, Kevin May 26, 2011, 12:41 a.m. UTC | #8
> From: Nadav Har'El
> Sent: Wednesday, May 25, 2011 9:21 PM
> 
> On Wed, May 25, 2011, Tian, Kevin wrote about "RE: [PATCH 20/31] nVMX:
> Exiting from L2 to L1":
> > How about SYSENTER and PERF_GLOBAL_CTRL MSRs? At least a TODO
> comment
> > here make the whole load process complete. :-)
> >
> > Also isn't it more sane to update vmcs01's guest segment info based on
> vmcs12's
> > host segment info? Though you can assume the environment in L1 doesn't
> change
> > from VMLAUNCH/VMRESUME to VMEXIT handler, it's more architectural
> clear
> > to load those segments fields according to L1's desire.
> 
> Right... One of these days, I (or some other volunteer ;-)) would need to
> print out the relevant sections of the SDM, sit down with a marker, a read
> it line by line marking lines, fields, capabilities, and so on, which we
> forgot to implement...

You've done a great job.

> 
> How about these additions:
> 
> 	vmcs_write32(GUEST_SYSENTER_CS, vmcs12->host_ia32_sysenter_cs);
> 	vmcs_writel(GUEST_SYSENTER_ESP, vmcs12->host_ia32_sysenter_esp);
> 	vmcs_writel(GUEST_SYSENTER_EIP, vmcs12->host_ia32_sysenter_eip);
> 	vmcs_writel(GUEST_IDTR_BASE, vmcs12->host_idtr_base);
> 	vmcs_writel(GUEST_GDTR_BASE, vmcs12->host_gdtr_base);
> 	vmcs_writel(GUEST_TR_BASE, vmcs12->host_tr_base);
> 	vmcs_writel(GUEST_GS_BASE, vmcs12->host_gs_base);
> 	vmcs_writel(GUEST_FS_BASE, vmcs12->host_fs_base);
> 	vmcs_write16(GUEST_ES_SELECTOR, vmcs12->host_es_selector);
> 	vmcs_write16(GUEST_CS_SELECTOR, vmcs12->host_cs_selector);
> 	vmcs_write16(GUEST_SS_SELECTOR, vmcs12->host_ss_selector);
> 	vmcs_write16(GUEST_DS_SELECTOR, vmcs12->host_ds_selector);
> 	vmcs_write16(GUEST_FS_SELECTOR, vmcs12->host_fs_selector);
> 	vmcs_write16(GUEST_GS_SELECTOR, vmcs12->host_gs_selector);
> 	vmcs_write16(GUEST_TR_SELECTOR, vmcs12->host_tr_selector);
> 
> 	if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_PAT)
> 		vmcs_write64(GUEST_IA32_PAT, vmcs12->host_ia32_pat);
> 	if (vmcs12->vm_exit_controls &
> VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL)
> 		vmcs_write64(GUEST_IA32_PERF_GLOBAL_CTRL,
> 			vmcs12->host_ia32_perf_global_ctrl);
> 

looks good.

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

--- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:49.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:49.000000000 +0300
@@ -6203,6 +6203,263 @@  static int nested_vmx_run(struct kvm_vcp
 	return 1;
 }
 
+/*
+ * On a nested exit from L2 to L1, vmcs12.guest_cr0 might not be up-to-date
+ * because L2 may have changed some cr0 bits directly (see CRO_GUEST_HOST_MASK)
+ * without L0 trapping the change and updating vmcs12.
+ * This function returns the value we should put in vmcs12.guest_cr0. It's not
+ * enough to just return the current (vmcs02) GUEST_CR0 - that may not be the
+ * guest cr0 that L1 thought it was giving its L2 guest; It is possible that
+ * L1 wished to allow its guest to set some cr0 bit directly, but we (L0) asked
+ * to trap this change and instead set just the read shadow bit. If this is the
+ * case, we need to copy these read-shadow bits back to vmcs12.guest_cr0, where
+ * L1 believes they already are.
+ */
+static inline unsigned long
+vmcs12_guest_cr0(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
+{
+	/*
+	 * As explained above, we take a bit from GUEST_CR0 if we allowed the
+	 * guest to modify it untrapped (vcpu->arch.cr0_guest_owned_bits), or
+	 * if we did trap it - if we did so because L1 asked to trap this bit
+	 * (vmcs12->cr0_guest_host_mask). Otherwise (bits we trapped but L1
+	 * didn't expect us to trap) we read from CR0_READ_SHADOW.
+	 */
+	unsigned long guest_cr0_bits =
+		vcpu->arch.cr0_guest_owned_bits | vmcs12->cr0_guest_host_mask;
+	return (vmcs_readl(GUEST_CR0) & guest_cr0_bits) |
+	       (vmcs_readl(CR0_READ_SHADOW) & ~guest_cr0_bits);
+}
+
+static inline unsigned long
+vmcs12_guest_cr4(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
+{
+	unsigned long guest_cr4_bits =
+		vcpu->arch.cr4_guest_owned_bits | vmcs12->cr4_guest_host_mask;
+	return (vmcs_readl(GUEST_CR4) & guest_cr4_bits) |
+	       (vmcs_readl(CR4_READ_SHADOW) & ~guest_cr4_bits);
+}
+
+/*
+ * prepare_vmcs12 is part of what we need to do when the nested L2 guest exits
+ * and we want to prepare to run its L1 parent. L1 keeps a vmcs for L2 (vmcs12),
+ * and this function updates it to reflect the changes to the guest state while
+ * L2 was running (and perhaps made some exits which were handled directly by L0
+ * without going back to L1), and to reflect the exit reason.
+ * Note that we do not have to copy here all VMCS fields, just those that
+ * could have changed by the L2 guest or the exit - i.e., the guest-state and
+ * exit-information fields only. Other fields are modified by L1 with VMWRITE,
+ * which already writes to vmcs12 directly.
+ */
+void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
+{
+	/* update guest state fields: */
+	vmcs12->guest_cr0 = vmcs12_guest_cr0(vcpu, vmcs12);
+	vmcs12->guest_cr4 = vmcs12_guest_cr4(vcpu, vmcs12);
+
+	kvm_get_dr(vcpu, 7, (unsigned long *)&vmcs12->guest_dr7);
+	vmcs12->guest_rsp = kvm_register_read(vcpu, VCPU_REGS_RSP);
+	vmcs12->guest_rip = kvm_register_read(vcpu, VCPU_REGS_RIP);
+	vmcs12->guest_rflags = vmcs_readl(GUEST_RFLAGS);
+
+	vmcs12->guest_es_selector = vmcs_read16(GUEST_ES_SELECTOR);
+	vmcs12->guest_cs_selector = vmcs_read16(GUEST_CS_SELECTOR);
+	vmcs12->guest_ss_selector = vmcs_read16(GUEST_SS_SELECTOR);
+	vmcs12->guest_ds_selector = vmcs_read16(GUEST_DS_SELECTOR);
+	vmcs12->guest_fs_selector = vmcs_read16(GUEST_FS_SELECTOR);
+	vmcs12->guest_gs_selector = vmcs_read16(GUEST_GS_SELECTOR);
+	vmcs12->guest_ldtr_selector = vmcs_read16(GUEST_LDTR_SELECTOR);
+	vmcs12->guest_tr_selector = vmcs_read16(GUEST_TR_SELECTOR);
+	vmcs12->guest_es_limit = vmcs_read32(GUEST_ES_LIMIT);
+	vmcs12->guest_cs_limit = vmcs_read32(GUEST_CS_LIMIT);
+	vmcs12->guest_ss_limit = vmcs_read32(GUEST_SS_LIMIT);
+	vmcs12->guest_ds_limit = vmcs_read32(GUEST_DS_LIMIT);
+	vmcs12->guest_fs_limit = vmcs_read32(GUEST_FS_LIMIT);
+	vmcs12->guest_gs_limit = vmcs_read32(GUEST_GS_LIMIT);
+	vmcs12->guest_ldtr_limit = vmcs_read32(GUEST_LDTR_LIMIT);
+	vmcs12->guest_tr_limit = vmcs_read32(GUEST_TR_LIMIT);
+	vmcs12->guest_gdtr_limit = vmcs_read32(GUEST_GDTR_LIMIT);
+	vmcs12->guest_idtr_limit = vmcs_read32(GUEST_IDTR_LIMIT);
+	vmcs12->guest_es_ar_bytes = vmcs_read32(GUEST_ES_AR_BYTES);
+	vmcs12->guest_cs_ar_bytes = vmcs_read32(GUEST_CS_AR_BYTES);
+	vmcs12->guest_ss_ar_bytes = vmcs_read32(GUEST_SS_AR_BYTES);
+	vmcs12->guest_ds_ar_bytes = vmcs_read32(GUEST_DS_AR_BYTES);
+	vmcs12->guest_fs_ar_bytes = vmcs_read32(GUEST_FS_AR_BYTES);
+	vmcs12->guest_gs_ar_bytes = vmcs_read32(GUEST_GS_AR_BYTES);
+	vmcs12->guest_ldtr_ar_bytes = vmcs_read32(GUEST_LDTR_AR_BYTES);
+	vmcs12->guest_tr_ar_bytes = vmcs_read32(GUEST_TR_AR_BYTES);
+	vmcs12->guest_es_base = vmcs_readl(GUEST_ES_BASE);
+	vmcs12->guest_cs_base = vmcs_readl(GUEST_CS_BASE);
+	vmcs12->guest_ss_base = vmcs_readl(GUEST_SS_BASE);
+	vmcs12->guest_ds_base = vmcs_readl(GUEST_DS_BASE);
+	vmcs12->guest_fs_base = vmcs_readl(GUEST_FS_BASE);
+	vmcs12->guest_gs_base = vmcs_readl(GUEST_GS_BASE);
+	vmcs12->guest_ldtr_base = vmcs_readl(GUEST_LDTR_BASE);
+	vmcs12->guest_tr_base = vmcs_readl(GUEST_TR_BASE);
+	vmcs12->guest_gdtr_base = vmcs_readl(GUEST_GDTR_BASE);
+	vmcs12->guest_idtr_base = vmcs_readl(GUEST_IDTR_BASE);
+
+	vmcs12->guest_activity_state = vmcs_read32(GUEST_ACTIVITY_STATE);
+	vmcs12->guest_interruptibility_info =
+		vmcs_read32(GUEST_INTERRUPTIBILITY_INFO);
+	vmcs12->guest_pending_dbg_exceptions =
+		vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
+
+	/* TODO: These cannot have changed unless we have MSR bitmaps and
+	 * the relevant bit asks not to trap the change */
+	vmcs12->guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);
+	if (vmcs12->vm_entry_controls & VM_EXIT_SAVE_IA32_PAT)
+		vmcs12->guest_ia32_pat = vmcs_read64(GUEST_IA32_PAT);
+	vmcs12->guest_sysenter_cs = vmcs_read32(GUEST_SYSENTER_CS);
+	vmcs12->guest_sysenter_esp = vmcs_readl(GUEST_SYSENTER_ESP);
+	vmcs12->guest_sysenter_eip = vmcs_readl(GUEST_SYSENTER_EIP);
+
+	/* update exit information fields: */
+
+	vmcs12->vm_exit_reason  = vmcs_read32(VM_EXIT_REASON);
+	vmcs12->exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+
+	vmcs12->vm_exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
+	vmcs12->vm_exit_intr_error_code = vmcs_read32(VM_EXIT_INTR_ERROR_CODE);
+	vmcs12->idt_vectoring_info_field =
+		vmcs_read32(IDT_VECTORING_INFO_FIELD);
+	vmcs12->idt_vectoring_error_code =
+		vmcs_read32(IDT_VECTORING_ERROR_CODE);
+	vmcs12->vm_exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
+	vmcs12->vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+
+	/* clear vm-entry fields which are to be cleared on exit */
+	if (!(vmcs12->vm_exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY))
+		vmcs12->vm_entry_intr_info_field &= ~INTR_INFO_VALID_MASK;
+}
+
+/*
+ * A part of what we need to when the nested L2 guest exits and we want to
+ * run its L1 parent, is to reset L1's guest state to the host state specified
+ * in vmcs12.
+ * This function is to be called not only on normal nested exit, but also on
+ * a nested entry failure, as explained in Intel's spec, 3B.23.7 ("VM-Entry
+ * Failures During or After Loading Guest State").
+ * This function should be called when the active VMCS is L1's (vmcs01).
+ */
+void load_vmcs12_host_state(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
+{
+	if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_EFER)
+		vcpu->arch.efer = vmcs12->host_ia32_efer;
+	if (vmcs12->vm_exit_controls & VM_EXIT_HOST_ADDR_SPACE_SIZE)
+		vcpu->arch.efer |= (EFER_LMA | EFER_LME);
+	else
+		vcpu->arch.efer &= ~(EFER_LMA | EFER_LME);
+	vmx_set_efer(vcpu, vcpu->arch.efer);
+
+	if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_PAT)
+		vmcs_write64(GUEST_IA32_PAT, vmcs12->host_ia32_pat);
+
+	kvm_register_write(vcpu, VCPU_REGS_RSP, vmcs12->host_rsp);
+	kvm_register_write(vcpu, VCPU_REGS_RIP, vmcs12->host_rip);
+	/*
+	 * Note that calling vmx_set_cr0 is important, even if cr0 hasn't
+	 * actually changed, because it depends on the current state of
+	 * fpu_active (which may have changed).
+	 * Note that vmx_set_cr0 refers to efer set above.
+	 */
+	kvm_set_cr0(vcpu, vmcs12->host_cr0);
+	/*
+	 * If we did fpu_activate()/fpu_deactivate() during L2's run, we need
+	 * to apply the same changes to L1's vmcs. We just set cr0 correctly,
+	 * but we also need to update cr0_guest_host_mask and exception_bitmap.
+	 */
+	update_exception_bitmap(vcpu);
+	vcpu->arch.cr0_guest_owned_bits = (vcpu->fpu_active ? X86_CR0_TS : 0);
+	vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits);
+
+	/*
+	 * Note that CR4_GUEST_HOST_MASK is already set in the original vmcs01
+	 * (KVM doesn't change it)- no reason to call set_cr4_guest_host_mask();
+	 */
+	vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
+	kvm_set_cr4(vcpu, vmcs12->host_cr4);
+
+	/* shadow page tables on either EPT or shadow page tables */
+	kvm_set_cr3(vcpu, vmcs12->host_cr3);
+	kvm_mmu_reset_context(vcpu);
+
+	if (enable_vpid) {
+		/*
+		 * Trivially support vpid by letting L2s share their parent
+		 * L1's vpid. TODO: move to a more elaborate solution, giving
+		 * each L2 its own vpid and exposing the vpid feature to L1.
+		 */
+		vmx_flush_tlb(vcpu);
+	}
+}
+
+/*
+ * Emulate an exit from nested guest (L2) to L1, i.e., prepare to run L1
+ * and modify vmcs12 to make it see what it would expect to see there if
+ * L2 was its real guest. Must only be called when in L2 (is_guest_mode())
+ */
+static void nested_vmx_vmexit(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	int cpu;
+	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
+	struct saved_vmcs *saved_vmcs02;
+
+	leave_guest_mode(vcpu);
+	prepare_vmcs12(vcpu, vmcs12);
+
+	/*
+	 * Switch from L2's VMCS, to L1's VMCS. Remember on which CPU the L2
+	 * VMCS was last loaded, and whether it was launched (we need to know
+	 * this next time we use L2), and recall these values as they were for
+	 * L1's VMCS.
+	 */
+	cpu = get_cpu();
+	saved_vmcs02 = nested_get_current_vmcs02(vmx);
+	saved_vmcs02->cpu = vcpu->cpu;
+	saved_vmcs02->launched = vmx->launched;
+	vmx->vmcs = vmx->nested.saved_vmcs01.vmcs;
+	vcpu->cpu = vmx->nested.saved_vmcs01.cpu;
+	vmx->launched = vmx->nested.saved_vmcs01.launched;
+
+	nested_maintain_per_cpu_lists(vmx,
+		&vmx->nested.saved_vmcs01, saved_vmcs02);
+	/* if no vmcs02 cache requested, remove the one we used */
+	if (VMCS02_POOL_SIZE == 0)
+		nested_free_vmcs02(vmx, vmx->nested.current_vmptr);
+
+	vmx_vcpu_put(vcpu);
+	vmx_vcpu_load(vcpu, cpu);
+	vcpu->cpu = cpu;
+	put_cpu();
+
+	load_vmcs12_host_state(vcpu, vmcs12);
+
+	/* Update TSC_OFFSET if vmx_adjust_tsc_offset() was used while L2 ran */
+	vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset);
+
+	/* This is needed for same reason as it was needed in prepare_vmcs02 */
+	vmx->host_rsp = 0;
+
+	/* Unpin physical memory we referred to in vmcs02 */
+	if (vmx->nested.apic_access_page) {
+		nested_release_page(vmx->nested.apic_access_page);
+		vmx->nested.apic_access_page = 0;
+	}
+
+	/*
+	 * Exiting from L2 to L1, we're now back to L1 which thinks it just
+	 * finished a VMLAUNCH or VMRESUME instruction, so we need to set the
+	 * success or failure flag accordingly.
+	 */
+	if (unlikely(vmx->fail)) {
+		vmx->fail = 0;
+		nested_vmx_failValid(vcpu, vmcs_read32(VM_INSTRUCTION_ERROR));
+	} else
+		nested_vmx_succeed(vcpu);
+}
+
 static int vmx_check_intercept(struct kvm_vcpu *vcpu,
 			       struct x86_instruction_info *info,
 			       enum x86_intercept_stage stage)