From patchwork Thu Jan 27 08:39:35 2011
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Nadav Har'El <nyh@il.ibm.com>
X-Patchwork-Id: 510641
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by demeter1.kernel.org (8.14.4/8.14.3) with ESMTP id p0R8dggn015679
	for <patchwork-kvm@patchwork.kernel.org>;
	Thu, 27 Jan 2011 08:39:42 GMT
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1752864Ab1A0Ijj (ORCPT
	<rfc822;patchwork-kvm@patchwork.kernel.org>);
	Thu, 27 Jan 2011 03:39:39 -0500
Received: from mtagate1.uk.ibm.com ([194.196.100.161]:44811 "EHLO
	mtagate1.uk.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751752Ab1A0Ijj (ORCPT <rfc822; kvm@vger.kernel.org>);
	Thu, 27 Jan 2011 03:39:39 -0500
Received: from d06nrmr1707.portsmouth.uk.ibm.com
	(d06nrmr1707.portsmouth.uk.ibm.com [9.149.39.225])
	by mtagate1.uk.ibm.com (8.13.1/8.13.1) with ESMTP id p0R8dbaX007979
	for <kvm@vger.kernel.org>; Thu, 27 Jan 2011 08:39:37 GMT
Received: from d06av03.portsmouth.uk.ibm.com (d06av03.portsmouth.uk.ibm.com
	[9.149.37.213])
	by d06nrmr1707.portsmouth.uk.ibm.com (8.13.8/8.13.8/NCO v10.0) with
	ESMTP id p0R8de971659054
	for <kvm@vger.kernel.org>; Thu, 27 Jan 2011 08:39:40 GMT
Received: from d06av03.portsmouth.uk.ibm.com (localhost.localdomain
	[127.0.0.1])
	by d06av03.portsmouth.uk.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with
	ESMTP id p0R8daCd018315
	for <kvm@vger.kernel.org>; Thu, 27 Jan 2011 01:39:37 -0700
Received: from rice.haifa.ibm.com (rice.haifa.ibm.com [9.148.8.217])
	by d06av03.portsmouth.uk.ibm.com (8.14.4/8.13.1/NCO v10.0 AVin) with
	ESMTP id p0R8daww018290
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Thu, 27 Jan 2011 01:39:36 -0700
Received: from rice.haifa.ibm.com (lnx-nyh.haifa.ibm.com [127.0.0.1])
	by rice.haifa.ibm.com (8.14.4/8.14.4) with ESMTP id p0R8dZgf002661;
	Thu, 27 Jan 2011 10:39:35 +0200
Received: (from nyh@localhost)
	by rice.haifa.ibm.com (8.14.4/8.14.4/Submit) id p0R8dZv7002659;
	Thu, 27 Jan 2011 10:39:35 +0200
Date: Thu, 27 Jan 2011 10:39:35 +0200
Message-Id: <201101270839.p0R8dZv7002659@rice.haifa.ibm.com>
X-Authentication-Warning: rice.haifa.ibm.com: nyh set sender to "Nadav
	Har'El" <nyh@il.ibm.com> using -f
Cc: gleb@redhat.com, avi@redhat.com
To: kvm@vger.kernel.org
From: "Nadav Har'El" <nyh@il.ibm.com>
References: <1296116987-nyh@il.ibm.com>
Subject: [PATCH 19/29] nVMX: Exiting from L2 to L1
Sender: kvm-owner@vger.kernel.org
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org
X-Greylist: IP, sender and recipient auto-whitelisted, not delayed by
	milter-greylist-4.2.6 (demeter1.kernel.org [140.211.167.41]);
	Thu, 27 Jan 2011 08:39:42 +0000 (UTC)


--- .before/arch/x86/kvm/vmx.c	2011-01-26 18:06:05.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2011-01-26 18:06:05.000000000 +0200
@@ -5217,6 +5217,8 @@ static void __vmx_complete_interrupts(st
 
 static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
 {
+	if (is_guest_mode(&vmx->vcpu))
+		return;
 	__vmx_complete_interrupts(vmx, vmx->idt_vectoring_info,
 				  VM_EXIT_INSTRUCTION_LEN,
 				  IDT_VECTORING_ERROR_CODE);
@@ -6137,6 +6139,252 @@ static int nested_vmx_run(struct kvm_vcp
 	return 1;
 }
 
+/*
+ * On a nested exit from L2 to L1, vmcs12.guest_cr0 might not be up-to-date
+ * because L2 may have changed some cr0 bits directly (see CRO_GUEST_HOST_MASK)
+ * without L0 trapping the change and updating vmcs12.
+ * This function returns the value we should put in vmcs12.guest_cr0. It's not
+ * enough to just return the current (vmcs02) GUEST_CR0. This may not be the
+ * guest cr0 that L1 thought it was giving its L2 guest - it is possible that
+ * L1 wished to allow its guest to set a cr0 bit directly, but we (L0) asked
+ * to trap this change and instead set just the read shadow. If this is the
+ * case, we need to copy these read-shadow bits back to vmcs12.guest_cr0, where
+ * L1 believes they already are.
+ */
+static inline unsigned long
+vmcs12_guest_cr0(struct kvm_vcpu *vcpu, struct vmcs_fields *vmcs12)
+{
+	unsigned long guest_cr0_bits =
+		vcpu->arch.cr0_guest_owned_bits | vmcs12->cr0_guest_host_mask;
+	return (vmcs_readl(GUEST_CR0) & guest_cr0_bits) |
+		(vmcs_readl(CR0_READ_SHADOW) & ~guest_cr0_bits);
+}
+
+static inline unsigned long
+vmcs12_guest_cr4(struct kvm_vcpu *vcpu, struct vmcs_fields *vmcs12)
+{
+	unsigned long guest_cr4_bits =
+		vcpu->arch.cr4_guest_owned_bits | vmcs12->cr4_guest_host_mask;
+	return (vmcs_readl(GUEST_CR4) & guest_cr4_bits) |
+		(vmcs_readl(CR4_READ_SHADOW) & ~guest_cr4_bits);
+}
+
+/*
+ * prepare_vmcs12 is called when the nested L2 guest exits and we want to
+ * prepare to run its L1 parent. L1 keeps a vmcs for L2 (vmcs12), and this
+ * function updates it to reflect the changes to the guest state while L2 was
+ * running (and perhaps made some exits which were handled directly by L0
+ * without going back to L1), and to reflect the exit reason.
+ * Note that we do not have to copy here all VMCS fields, just those that
+ * could have changed by the L2 guest or the exit - i.e., the guest-state and
+ * exit-information fields only. Other fields are modified by L1 with VMWRITE,
+ * which already writes to vmcs12 directly.
+ */
+void prepare_vmcs12(struct kvm_vcpu *vcpu)
+{
+	struct vmcs_fields *vmcs12 = get_vmcs12_fields(vcpu);
+
+	/* update guest state fields: */
+	vmcs12->guest_cr0 = vmcs12_guest_cr0(vcpu, vmcs12);
+	vmcs12->guest_cr4 = vmcs12_guest_cr4(vcpu, vmcs12);
+
+	vmcs12->guest_dr7 = vmcs_readl(GUEST_DR7);
+	vmcs12->guest_rsp = vmcs_readl(GUEST_RSP);
+	vmcs12->guest_rip = vmcs_readl(GUEST_RIP);
+	vmcs12->guest_rflags = vmcs_readl(GUEST_RFLAGS);
+
+	vmcs12->guest_es_selector = vmcs_read16(GUEST_ES_SELECTOR);
+	vmcs12->guest_cs_selector = vmcs_read16(GUEST_CS_SELECTOR);
+	vmcs12->guest_ss_selector = vmcs_read16(GUEST_SS_SELECTOR);
+	vmcs12->guest_ds_selector = vmcs_read16(GUEST_DS_SELECTOR);
+	vmcs12->guest_fs_selector = vmcs_read16(GUEST_FS_SELECTOR);
+	vmcs12->guest_gs_selector = vmcs_read16(GUEST_GS_SELECTOR);
+	vmcs12->guest_ldtr_selector = vmcs_read16(GUEST_LDTR_SELECTOR);
+	vmcs12->guest_tr_selector = vmcs_read16(GUEST_TR_SELECTOR);
+	vmcs12->guest_es_limit = vmcs_read32(GUEST_ES_LIMIT);
+	vmcs12->guest_cs_limit = vmcs_read32(GUEST_CS_LIMIT);
+	vmcs12->guest_ss_limit = vmcs_read32(GUEST_SS_LIMIT);
+	vmcs12->guest_ds_limit = vmcs_read32(GUEST_DS_LIMIT);
+	vmcs12->guest_fs_limit = vmcs_read32(GUEST_FS_LIMIT);
+	vmcs12->guest_gs_limit = vmcs_read32(GUEST_GS_LIMIT);
+	vmcs12->guest_ldtr_limit = vmcs_read32(GUEST_LDTR_LIMIT);
+	vmcs12->guest_tr_limit = vmcs_read32(GUEST_TR_LIMIT);
+	vmcs12->guest_gdtr_limit = vmcs_read32(GUEST_GDTR_LIMIT);
+	vmcs12->guest_idtr_limit = vmcs_read32(GUEST_IDTR_LIMIT);
+	vmcs12->guest_es_ar_bytes = vmcs_read32(GUEST_ES_AR_BYTES);
+	vmcs12->guest_cs_ar_bytes = vmcs_read32(GUEST_CS_AR_BYTES);
+	vmcs12->guest_ss_ar_bytes = vmcs_read32(GUEST_SS_AR_BYTES);
+	vmcs12->guest_ds_ar_bytes = vmcs_read32(GUEST_DS_AR_BYTES);
+	vmcs12->guest_fs_ar_bytes = vmcs_read32(GUEST_FS_AR_BYTES);
+	vmcs12->guest_gs_ar_bytes = vmcs_read32(GUEST_GS_AR_BYTES);
+	vmcs12->guest_ldtr_ar_bytes = vmcs_read32(GUEST_LDTR_AR_BYTES);
+	vmcs12->guest_tr_ar_bytes = vmcs_read32(GUEST_TR_AR_BYTES);
+	vmcs12->guest_es_base = vmcs_readl(GUEST_ES_BASE);
+	vmcs12->guest_cs_base = vmcs_readl(GUEST_CS_BASE);
+	vmcs12->guest_ss_base = vmcs_readl(GUEST_SS_BASE);
+	vmcs12->guest_ds_base = vmcs_readl(GUEST_DS_BASE);
+	vmcs12->guest_fs_base = vmcs_readl(GUEST_FS_BASE);
+	vmcs12->guest_gs_base = vmcs_readl(GUEST_GS_BASE);
+	vmcs12->guest_ldtr_base = vmcs_readl(GUEST_LDTR_BASE);
+	vmcs12->guest_tr_base = vmcs_readl(GUEST_TR_BASE);
+	vmcs12->guest_gdtr_base = vmcs_readl(GUEST_GDTR_BASE);
+	vmcs12->guest_idtr_base = vmcs_readl(GUEST_IDTR_BASE);
+
+	vmcs12->guest_activity_state = vmcs_read32(GUEST_ACTIVITY_STATE);
+	vmcs12->guest_interruptibility_info =
+		vmcs_read32(GUEST_INTERRUPTIBILITY_INFO);
+	vmcs12->guest_pending_dbg_exceptions =
+		vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
+	vmcs12->vmcs_link_pointer = vmcs_read64(VMCS_LINK_POINTER);
+
+	/* TODO: These cannot have changed unless we have MSR bitmaps and
+	 * the relevant bit asks not to trap the change */
+	vmcs12->guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);
+	if (vmcs_config.vmentry_ctrl & VM_EXIT_SAVE_IA32_PAT)
+		vmcs12->guest_ia32_pat = vmcs_read64(GUEST_IA32_PAT);
+	vmcs12->guest_sysenter_cs = vmcs_read32(GUEST_SYSENTER_CS);
+	vmcs12->guest_sysenter_esp = vmcs_readl(GUEST_SYSENTER_ESP);
+	vmcs12->guest_sysenter_eip = vmcs_readl(GUEST_SYSENTER_EIP);
+
+	/* update exit information fields: */
+
+	vmcs12->vm_exit_reason  = vmcs_read32(VM_EXIT_REASON);
+	vmcs12->exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+
+	vmcs12->vm_exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
+	vmcs12->vm_exit_intr_error_code = vmcs_read32(VM_EXIT_INTR_ERROR_CODE);
+	vmcs12->idt_vectoring_info_field =
+		vmcs_read32(IDT_VECTORING_INFO_FIELD);
+	vmcs12->idt_vectoring_error_code =
+		vmcs_read32(IDT_VECTORING_ERROR_CODE);
+	vmcs12->vm_exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
+	vmcs12->vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+
+	/* clear vm-entry fields which are to be cleared on exit */
+	if (!(vmcs12->vm_exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY))
+		vmcs12->vm_entry_intr_info_field &= ~INTR_INFO_VALID_MASK;
+}
+
+/*
+ * Emulate an exit from nested guest (L2) to L1, i.e., prepare to run L1
+ * and modify vmcs12 to make it see what it would expect to see there if
+ * L2 was its real guest. Must only be called when in L2 (is_guest_mode())
+ */
+static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, bool is_interrupt)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	struct vmcs_fields *vmcs01 = vmx->nested.vmcs01_fields;
+	int cpu;
+	struct saved_vmcs *saved_vmcs02;
+
+	leave_guest_mode(vcpu);
+
+	sync_cached_regs_to_vmcs(vcpu);
+
+	prepare_vmcs12(vcpu);
+
+	if (is_interrupt)
+		get_vmcs12_fields(vcpu)->vm_exit_reason =
+			EXIT_REASON_EXTERNAL_INTERRUPT;
+
+	/*
+	 * Switch from L2's VMCS, to L1's VMCS. Remember on which CPU the L2
+	 * VMCS was last loaded, and whether it was launched (we need to know
+	 * this next time we use L2), and recall these values as they were for
+	 * L1's VMCS.
+	 */
+	saved_vmcs02 = nested_get_current_vmcs(vmx);
+	if (saved_vmcs02) {
+		saved_vmcs02->cpu = vcpu->cpu;
+		saved_vmcs02->launched = vmx->launched;
+	}
+	vmx->vmcs = vmx->nested.saved_vmcs01.vmcs;
+	vcpu->cpu = vmx->nested.saved_vmcs01.cpu;
+	vmx->launched = vmx->nested.saved_vmcs01.launched;
+
+	vmx_vcpu_put(vcpu);
+	cpu = get_cpu();
+	vmx_vcpu_load(vcpu, cpu);
+	vcpu->cpu = cpu;
+	put_cpu();
+
+	if (get_vmcs12_fields(vcpu)->vm_exit_controls &
+	    VM_EXIT_HOST_ADDR_SPACE_SIZE)
+		vcpu->arch.efer |= (EFER_LMA | EFER_LME);
+	else
+		vcpu->arch.efer &= ~(EFER_LMA | EFER_LME);
+	vmx_set_efer(vcpu, vcpu->arch.efer);
+
+	/*
+	 * L2 perhaps switched to real mode and set vmx->rmode, but we're back
+	 * in L1 and as it is running VMX, it can't be in real mode.
+	 */
+	vmx->rmode.vm86_active = 0;
+
+	/*
+	 * If L1 set the HOST_* fields in the VMCS, when exiting from L2 to L1
+	 * we need to return those, not L1's old values.
+	 */
+	vmcs_writel(GUEST_RIP, get_vmcs12_fields(vcpu)->host_rip);
+	vmcs_writel(GUEST_RSP, get_vmcs12_fields(vcpu)->host_rsp);
+	vmcs01->cr0_read_shadow = get_vmcs12_fields(vcpu)->host_cr0;
+
+	/*
+	 * We're running a regular L1 guest again, so we do the regular KVM
+	 * thing: run vmx_set_cr0 with the cr0 bits the guest thinks it has.
+	 * vmx_set_cr0 might use slightly different bits on the new guest_cr0
+	 * it sets, e.g., add TS when !fpu_active.
+	 * Note that vmx_set_cr0 refers to rmode and efer set above.
+	 */
+	vmx_set_cr0(vcpu, guest_readable_cr0(vmcs01));
+	/*
+	 * If we did fpu_activate()/fpu_deactive() during l2's run, we need to
+	 * apply the same changes to l1's vmcs. We just set cr0 correctly, but
+	 * now we need to also update cr0_guest_host_mask and exception_bitmap.
+	 */
+	vmcs_write32(EXCEPTION_BITMAP,
+		(vmcs01->exception_bitmap & ~(1u<<NM_VECTOR)) |
+			(vcpu->fpu_active ? 0 : (1u<<NM_VECTOR)));
+	vcpu->arch.cr0_guest_owned_bits = (vcpu->fpu_active ? X86_CR0_TS : 0);
+	vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits);
+
+	vmx_set_cr4(vcpu, guest_readable_cr4(vmcs01));
+	vcpu->arch.cr4_guest_owned_bits = ~vmcs01->cr4_guest_host_mask;
+
+	if (enable_ept) {
+		/* shadow page tables on EPT: */
+		set_cr3_and_pdptrs(vcpu, get_vmcs12_fields(vcpu)->host_cr3);
+	} else {
+		/* shadow page tables on shadow page tables: */
+		kvm_set_cr3(vcpu, get_vmcs12_fields(vcpu)->host_cr3);
+		kvm_mmu_reset_context(vcpu);
+		kvm_mmu_load(vcpu);
+	}
+	if (enable_vpid) {
+		/*
+		 * Trivially support vpid by letting L2s share their parent
+		 * L1's vpid. TODO: move to a more elaborate solution, giving
+		 * each L2 its own vpid and exposing the vpid feature to L1.
+		 */
+		vmx_flush_tlb(vcpu);
+	}
+
+	kvm_register_write(vcpu, VCPU_REGS_RSP, vmcs01->guest_rsp);
+	kvm_register_write(vcpu, VCPU_REGS_RIP, vmcs01->guest_rip);
+
+	if (unlikely(vmx->fail)) {
+		/*
+		 * When L1 launches L2 and then we (L0) fail to launch L2,
+		 * we nested_vmx_vmexit back to L1, but now should let it know
+		 * that the VMLAUNCH failed - with the same error that we
+		 * got when launching L2.
+		 */
+		vmx->fail = 0;
+		nested_vmx_failValid(vcpu, vmcs_read32(VM_INSTRUCTION_ERROR));
+	} else
+		nested_vmx_succeed(vcpu);
+}
+
 static struct kvm_x86_ops vmx_x86_ops = {
 	.cpu_has_kvm_support = cpu_has_kvm_support,
 	.disabled_by_bios = vmx_disabled_by_bios,