From patchwork Tue Jun 22 16:53:22 2010 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Nadav Har'El X-Patchwork-Id: 107442 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by demeter.kernel.org (8.14.4/8.14.3) with ESMTP id o5MGrk5e017563 for ; Tue, 22 Jun 2010 16:53:46 GMT Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756480Ab0FVQxc (ORCPT ); Tue, 22 Jun 2010 12:53:32 -0400 Received: from mailgw11.technion.ac.il ([132.68.225.11]:25136 "EHLO mailgw11.technion.ac.il" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756487Ab0FVQx2 (ORCPT ); Tue, 22 Jun 2010 12:53:28 -0400 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AkIOAMyHIEyERHMG/2dsb2JhbACaLgIDhGJxsSCQVwKFGQSDUiaHegE X-IronPort-AV: E=Sophos;i="4.53,461,1272834000"; d="scan'208";a="8487395" Received: from fermat.math.technion.ac.il ([132.68.115.6]) by mailgw11.technion.ac.il with ESMTP; 22 Jun 2010 19:53:25 +0300 Received: from fermat.math.technion.ac.il (localhost [127.0.0.1]) by fermat.math.technion.ac.il (8.12.10/8.12.10) with ESMTP id o5MGrNcQ029738; Tue, 22 Jun 2010 19:53:23 +0300 (IDT) Received: (from nyh@localhost) by fermat.math.technion.ac.il (8.12.10/8.12.10/Submit) id o5MGrMBv029736; Tue, 22 Jun 2010 19:53:22 +0300 (IDT) X-Authentication-Warning: fermat.math.technion.ac.il: nyh set sender to nyh@math.technion.ac.il using -f Date: Tue, 22 Jun 2010 19:53:22 +0300 From: "Nadav Har'El" To: Avi Kivity Cc: kvm@vger.kernel.org Subject: Re: [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1 Message-ID: <20100622165322.GA29629@fermat.math.technion.ac.il> References: <1276431753-nyh@il.ibm.com> <201006131225.o5DCP79H012922@rice.haifa.ibm.com> <4C15E95D.9000300@redhat.com> <20100622145441.GA23496@fermat.math.technion.ac.il> Mime-Version: 1.0 Content-Disposition: inline In-Reply-To: <20100622145441.GA23496@fermat.math.technion.ac.il> User-Agent: Mutt/1.4.2.2i Hebrew-Date: 11 Tammuz 5770 Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Greylist: IP, sender and recipient auto-whitelisted, not delayed by milter-greylist-4.2.3 (demeter.kernel.org [140.211.167.41]); Tue, 22 Jun 2010 16:53:47 +0000 (UTC) --- .before/Documentation/kvm/nested-vmx.txt 2010-06-22 19:50:32.000000000 +0300 +++ .after/Documentation/kvm/nested-vmx.txt 2010-06-22 19:50:32.000000000 +0300 @@ -0,0 +1,233 @@ +Nested VMX +========== + +Overview +--------- + +On Intel processors, KVM uses Intel's VMX (Virtual-Machine eXtensions) +to easily and efficiently run guests operating systems. Normally, these guests +*cannot* themselves be hypervisors running their own guests, because in VMX, +guests cannot use VMX instructions. + +The "Nested VMX" feature adds this missing capability - of running guest +hypervisors (which use VMX) with their own nested guests. It does so by +allowing a guest to use VMX instructions, and correctly and efficiently +emulating them using the single level of VMX available in the hardware. + +We describe in much greater detail the theory behind the nested VMX feature, +its implementation and its performance characteristics, in IBM Research report +H-0282, "The Turtles Project: Design and Implementation of Nested +Virtualization", available at: + + http://bit.ly/a0o9te + + +Terminology +----------- + +Single-level virtualization has two levels - the host (KVM) and the guests. +In nested virtualization, we have three levels: The host (KVM), which we call +L0, the guest hypervisor, which we call L1, and the nested guest, which we +call L2. + + +Known limitations +----------------- + +The current code support running Linux under a nested KVM using shadow +page table (with bypass_guest_pf disabled). They support multiple nested +hypervisors, which can run multiple guests. Only 64-bit nested hypervisors +are supported. SMP is supported. Additional patches for running Windows under +nested KVM, and Linux under nested VMware server, and support for nested EPT, +are currently running in the lab, and will be sent as follow-on patchsets. + + +Running nested VMX +------------------ + +The nested VMX feature is disabled by default. It can be enabled by giving +the "nested=1" option to the kvm-intel module. + + +ABIs +---- + +Nested VMX aims to present a standard and (eventually) fully-functional VMX +implementation for the a guest hypervisor to use. As such, the official +specification of the ABI that it provides is Intel's VMX specification, +namely volume 3B of their "Intel 64 and IA-32 Architectures Software +Developer's Manual". Not all of VMX's features are currently fully supported, +but the goal is to eventually support them all, starting with the VMX features +which are used in practice by popular hypervisors (KVM and others). + +As a VMX implementation, nested VMX presents a VMCS structure to L1. +As mandated by the spec, other than the two fields revision_id and abort, +this structure is *opaque* to its user, who is not supposed to know or care +about its internal structure. Rather, the structure is accessed through the +VMREAD and VMWRITE instructions. +Still, for debugging purposes, KVM developers might be interested to know the +internals of this structure; This is struct vmcs12 from arch/x86/kvm/vmx.c. +For convenience, we repeat its content here. If the internals of this structure +changes, this can break live migration across KVM versions. VMCS12_REVISION +(from vmx.c) should be changed if struct vmcs12 or its inner struct shadow_vmcs +is ever changed. + +struct __packed vmcs12 { + /* According to the Intel spec, a VMCS region must start with the + * following two fields. Then follow implementation-specific data. + */ + u32 revision_id; + u32 abort; + + struct shadow_vmcs shadow_vmcs; + + bool launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */ + + int cpu; + int launched; +} + +struct __packed shadow_vmcs { + u16 virtual_processor_id; + u16 guest_es_selector; + u16 guest_cs_selector; + u16 guest_ss_selector; + u16 guest_ds_selector; + u16 guest_fs_selector; + u16 guest_gs_selector; + u16 guest_ldtr_selector; + u16 guest_tr_selector; + u16 host_es_selector; + u16 host_cs_selector; + u16 host_ss_selector; + u16 host_ds_selector; + u16 host_fs_selector; + u16 host_gs_selector; + u16 host_tr_selector; + u64 io_bitmap_a; + u64 io_bitmap_b; + u64 msr_bitmap; + u64 vm_exit_msr_store_addr; + u64 vm_exit_msr_load_addr; + u64 vm_entry_msr_load_addr; + u64 tsc_offset; + u64 virtual_apic_page_addr; + u64 apic_access_addr; + u64 ept_pointer; + u64 guest_physical_address; + u64 vmcs_link_pointer; + u64 guest_ia32_debugctl; + u64 guest_ia32_pat; + u64 guest_pdptr0; + u64 guest_pdptr1; + u64 guest_pdptr2; + u64 guest_pdptr3; + u64 host_ia32_pat; + u32 pin_based_vm_exec_control; + u32 cpu_based_vm_exec_control; + u32 exception_bitmap; + u32 page_fault_error_code_mask; + u32 page_fault_error_code_match; + u32 cr3_target_count; + u32 vm_exit_controls; + u32 vm_exit_msr_store_count; + u32 vm_exit_msr_load_count; + u32 vm_entry_controls; + u32 vm_entry_msr_load_count; + u32 vm_entry_intr_info_field; + u32 vm_entry_exception_error_code; + u32 vm_entry_instruction_len; + u32 tpr_threshold; + u32 secondary_vm_exec_control; + u32 vm_instruction_error; + u32 vm_exit_reason; + u32 vm_exit_intr_info; + u32 vm_exit_intr_error_code; + u32 idt_vectoring_info_field; + u32 idt_vectoring_error_code; + u32 vm_exit_instruction_len; + u32 vmx_instruction_info; + u32 guest_es_limit; + u32 guest_cs_limit; + u32 guest_ss_limit; + u32 guest_ds_limit; + u32 guest_fs_limit; + u32 guest_gs_limit; + u32 guest_ldtr_limit; + u32 guest_tr_limit; + u32 guest_gdtr_limit; + u32 guest_idtr_limit; + u32 guest_es_ar_bytes; + u32 guest_cs_ar_bytes; + u32 guest_ss_ar_bytes; + u32 guest_ds_ar_bytes; + u32 guest_fs_ar_bytes; + u32 guest_gs_ar_bytes; + u32 guest_ldtr_ar_bytes; + u32 guest_tr_ar_bytes; + u32 guest_interruptibility_info; + u32 guest_activity_state; + u32 guest_sysenter_cs; + u32 host_ia32_sysenter_cs; + unsigned long cr0_guest_host_mask; + unsigned long cr4_guest_host_mask; + unsigned long cr0_read_shadow; + unsigned long cr4_read_shadow; + unsigned long cr3_target_value0; + unsigned long cr3_target_value1; + unsigned long cr3_target_value2; + unsigned long cr3_target_value3; + unsigned long exit_qualification; + unsigned long guest_linear_address; + unsigned long guest_cr0; + unsigned long guest_cr3; + unsigned long guest_cr4; + unsigned long guest_es_base; + unsigned long guest_cs_base; + unsigned long guest_ss_base; + unsigned long guest_ds_base; + unsigned long guest_fs_base; + unsigned long guest_gs_base; + unsigned long guest_ldtr_base; + unsigned long guest_tr_base; + unsigned long guest_gdtr_base; + unsigned long guest_idtr_base; + unsigned long guest_dr7; + unsigned long guest_rsp; + unsigned long guest_rip; + unsigned long guest_rflags; + unsigned long guest_pending_dbg_exceptions; + unsigned long guest_sysenter_esp; + unsigned long guest_sysenter_eip; + unsigned long host_cr0; + unsigned long host_cr3; + unsigned long host_cr4; + unsigned long host_fs_base; + unsigned long host_gs_base; + unsigned long host_tr_base; + unsigned long host_gdtr_base; + unsigned long host_idtr_base; + unsigned long host_ia32_sysenter_esp; + unsigned long host_ia32_sysenter_eip; + unsigned long host_rsp; + unsigned long host_rip; +}; + + +Authors +------- + +These patches were written by: + Abel Gordon, abelg il.ibm.com + Nadav Har'El, nyh il.ibm.com + Orit Wasserman, oritw il.ibm.com + Ben-Ami Yassor, benami il.ibm.com + Muli Ben-Yehuda, muli il.ibm.com + +With contributions by: + Anthony Liguori, aliguori us.ibm.com + Mike Day, mdday us.ibm.com + +And valuable reviews by: + Avi Kivity, avi redhat.com + Gleb Natapov, gleb redhat.com