diff mbox

[31/31] nVMX: Documentation

Message ID 201105161959.p4GJxmUQ002089@rice.haifa.ibm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Nadav Har'El May 16, 2011, 7:59 p.m. UTC
This patch includes a brief introduction to the nested vmx feature in the
Documentation/kvm directory. The document also includes a copy of the
vmcs12 structure, as requested by Avi Kivity.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 Documentation/kvm/nested-vmx.txt |  243 +++++++++++++++++++++++++++++
 1 file changed, 243 insertions(+)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Tian, Kevin May 25, 2011, 10:33 a.m. UTC | #1
> From: Nadav Har'El
> Sent: Tuesday, May 17, 2011 4:00 AM
> 
> This patch includes a brief introduction to the nested vmx feature in the
> Documentation/kvm directory. The document also includes a copy of the
> vmcs12 structure, as requested by Avi Kivity.
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
>  Documentation/kvm/nested-vmx.txt |  243
> +++++++++++++++++++++++++++++
>  1 file changed, 243 insertions(+)
> 
> --- .before/Documentation/kvm/nested-vmx.txt	2011-05-16
> 22:36:51.000000000 +0300
> +++ .after/Documentation/kvm/nested-vmx.txt	2011-05-16
> 22:36:51.000000000 +0300
> @@ -0,0 +1,243 @@
> +Nested VMX
> +==========
> +
> +Overview
> +---------
> +
> +On Intel processors, KVM uses Intel's VMX (Virtual-Machine eXtensions)
> +to easily and efficiently run guest operating systems. Normally, these guests
> +*cannot* themselves be hypervisors running their own guests, because in
> VMX,
> +guests cannot use VMX instructions.

"because in VMX, guests cannot use VMX instructions" looks not correct or else
you can't add nVMX support. :-) It's just because currently KVM doesn't emulate
those VMX instructions.

> +
> +The "Nested VMX" feature adds this missing capability - of running guest
> +hypervisors (which use VMX) with their own nested guests. It does so by
> +allowing a guest to use VMX instructions, and correctly and efficiently
> +emulating them using the single level of VMX available in the hardware.
> +
> +We describe in much greater detail the theory behind the nested VMX
> feature,
> +its implementation and its performance characteristics, in the OSDI 2010
> paper
> +"The Turtles Project: Design and Implementation of Nested Virtualization",
> +available at:
> +
> +	http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf
> +
> +
> +Terminology
> +-----------
> +
> +Single-level virtualization has two levels - the host (KVM) and the guests.
> +In nested virtualization, we have three levels: The host (KVM), which we call
> +L0, the guest hypervisor, which we call L1, and its nested guest, which we
> +call L2.

Add a brief introduction about vmcs01/vmcs02/vmcs12 is also helpful here, given
that this doc is a centralized place to gain quick picture of the nested VMX.

> +
> +
> +Known limitations
> +-----------------
> +
> +The current code supports running Linux guests under KVM guests.
> +Only 64-bit guest hypervisors are supported.
> +
> +Additional patches for running Windows under guest KVM, and Linux under
> +guest VMware server, and support for nested EPT, are currently running in
> +the lab, and will be sent as follow-on patchsets.

any plan on nested VTD?

> +
> +
> +Running nested VMX
> +------------------
> +
> +The nested VMX feature is disabled by default. It can be enabled by giving
> +the "nested=1" option to the kvm-intel module.
> +
> +No modifications are required to user space (qemu). However, qemu's default
> +emulated CPU type (qemu64) does not list the "VMX" CPU feature, so it must
> be
> +explicitly enabled, by giving qemu one of the following options:
> +
> +     -cpu host              (emulated CPU has all features of the real
> CPU)
> +
> +     -cpu qemu64,+vmx       (add just the vmx feature to a named CPU
> type)
> +
> +
> +ABIs
> +----
> +
> +Nested VMX aims to present a standard and (eventually) fully-functional VMX
> +implementation for the a guest hypervisor to use. As such, the official
> +specification of the ABI that it provides is Intel's VMX specification,
> +namely volume 3B of their "Intel 64 and IA-32 Architectures Software
> +Developer's Manual". Not all of VMX's features are currently fully supported,
> +but the goal is to eventually support them all, starting with the VMX features
> +which are used in practice by popular hypervisors (KVM and others).

It'd be good to provide a list of known supported features. In your current code,
people have to look at code to understand current status. If you can keep a
supported and verified feature list here, it'd be great.

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nadav Har'El May 25, 2011, 11:54 a.m. UTC | #2
On Wed, May 25, 2011, Tian, Kevin wrote about "RE: [PATCH 31/31] nVMX: Documentation":
> > +On Intel processors, KVM uses Intel's VMX (Virtual-Machine eXtensions)
> > +to easily and efficiently run guest operating systems. Normally, these guests
> > +*cannot* themselves be hypervisors running their own guests, because in
> > VMX,
> > +guests cannot use VMX instructions.
> 
> "because in VMX, guests cannot use VMX instructions" looks not correct or else
> you can't add nVMX support. :-) It's just because currently KVM doesn't emulate
> those VMX instructions.

It depends on whether you look on the half-empty or half-full part of the
glass ;-)

The VMX instructions, when used in L1, do trap - as mandated by Popek and
Goldberg's theorem (that sensitive instructions must trap) - but they
don't "just work" like, for example, arithmetic instructions just work -
they need to be emulated by the VMM.

> > +Terminology
> > +-----------
> > +
> > +Single-level virtualization has two levels - the host (KVM) and the guests.
> > +In nested virtualization, we have three levels: The host (KVM), which we call
> > +L0, the guest hypervisor, which we call L1, and its nested guest, which we
> > +call L2.
> 
> Add a brief introduction about vmcs01/vmcs02/vmcs12 is also helpful here, given
> that this doc is a centralized place to gain quick picture of the nested VMX.

I'm adding now a short mention. However, I think this file should be viewed
as a user's guide, not a developer's guide. Developers should probably read
our full paper, where this terminology is explained, as well as how vmcs02
is related to the two others.

> > +Additional patches for running Windows under guest KVM, and Linux under
> > +guest VMware server, and support for nested EPT, are currently running in
> > +the lab, and will be sent as follow-on patchsets.
> 
> any plan on nested VTD?

Yes, for some definition of Yes ;-)

We do have an experimental nested IOMMU implementation: In our nested VMX
paper we showed how giving L1 an IOMMU allows for efficient nested device
assignment (L0 assigns a PCI device to L1, and L1 does the same to L2).
In that work we used a very simplistic "paravirtual" IOMMU instead of fully
emulating an IOMMU for L1.
Later, we did develop a full emulation of an IOMMU for L1, although we didn't
test it in the context of nested VMX (we used it to allow L1 to use an IOMMU
for better DMA protection inside the guest).

The IOMMU emulation work was done by Nadav Amit, Muli Ben-Yehuda, et al.,
and will be described in the upcoming Usenix ATC conference
(http://www.usenix.org/event/atc11/tech/techAbstracts.html#Amit).
After the conference in June, the paper will be available at this URL:
http://www.usenix.org/event/atc11/tech/final_files/Amit.pdf

If there is interest, they can perhaps contribute their work to
KVM (and QEMU) - if you're interested, please get in touch with them directly.

> It'd be good to provide a list of known supported features. In your current code,
> people have to look at code to understand current status. If you can keep a
> supported and verified feature list here, it'd be great.

It will be even better to support all features ;-)

But seriously, the VMX spec is hundreds of pages long, with hundreds of
features, sub-features, and sub-sub-features and myriads of subcase-of-
subfeature and combinations thereof, so I don't think such a list would be
practical - or ever be accurate.

In the "Known Limitations" section of this document, I'd like to list major
features which are missing, and perhaps more importantly - L1 and L2
guests which are known NOT to work.

By the way, it appears that you've been going over the patches in increasing
numerical order, and this is the last patch ;-) Have you finished your
review iteration?

Thanks for the reviews!
Nadav.
Tian, Kevin May 25, 2011, 12:11 p.m. UTC | #3
> From: Nadav Har'El
> Sent: Wednesday, May 25, 2011 7:55 PM
> 
> On Wed, May 25, 2011, Tian, Kevin wrote about "RE: [PATCH 31/31] nVMX:
> Documentation":
> > > +On Intel processors, KVM uses Intel's VMX (Virtual-Machine eXtensions)
> > > +to easily and efficiently run guest operating systems. Normally, these
> guests
> > > +*cannot* themselves be hypervisors running their own guests, because in
> > > VMX,
> > > +guests cannot use VMX instructions.
> >
> > "because in VMX, guests cannot use VMX instructions" looks not correct or
> else
> > you can't add nVMX support. :-) It's just because currently KVM doesn't
> emulate
> > those VMX instructions.
> 
> It depends on whether you look on the half-empty or half-full part of the
> glass ;-)
> 
> The VMX instructions, when used in L1, do trap - as mandated by Popek and
> Goldberg's theorem (that sensitive instructions must trap) - but they
> don't "just work" like, for example, arithmetic instructions just work -
> they need to be emulated by the VMM.
> 
> > > +Terminology
> > > +-----------
> > > +
> > > +Single-level virtualization has two levels - the host (KVM) and the guests.
> > > +In nested virtualization, we have three levels: The host (KVM), which we
> call
> > > +L0, the guest hypervisor, which we call L1, and its nested guest, which we
> > > +call L2.
> >
> > Add a brief introduction about vmcs01/vmcs02/vmcs12 is also helpful here,
> given
> > that this doc is a centralized place to gain quick picture of the nested VMX.
> 
> I'm adding now a short mention. However, I think this file should be viewed
> as a user's guide, not a developer's guide. Developers should probably read
> our full paper, where this terminology is explained, as well as how vmcs02
> is related to the two others.

I agree with the purpose of this doc. 

> 
> > > +Additional patches for running Windows under guest KVM, and Linux under
> > > +guest VMware server, and support for nested EPT, are currently running in
> > > +the lab, and will be sent as follow-on patchsets.
> >
> > any plan on nested VTD?
> 
> Yes, for some definition of Yes ;-)
> 
> We do have an experimental nested IOMMU implementation: In our nested
> VMX
> paper we showed how giving L1 an IOMMU allows for efficient nested device
> assignment (L0 assigns a PCI device to L1, and L1 does the same to L2).
> In that work we used a very simplistic "paravirtual" IOMMU instead of fully
> emulating an IOMMU for L1.
> Later, we did develop a full emulation of an IOMMU for L1, although we didn't
> test it in the context of nested VMX (we used it to allow L1 to use an IOMMU
> for better DMA protection inside the guest).
> 
> The IOMMU emulation work was done by Nadav Amit, Muli Ben-Yehuda, et al.,
> and will be described in the upcoming Usenix ATC conference
> (http://www.usenix.org/event/atc11/tech/techAbstracts.html#Amit).
> After the conference in June, the paper will be available at this URL:
> http://www.usenix.org/event/atc11/tech/final_files/Amit.pdf
> 
> If there is interest, they can perhaps contribute their work to
> KVM (and QEMU) - if you're interested, please get in touch with them directly.

Thanks and good to know those information

> 
> > It'd be good to provide a list of known supported features. In your current
> code,
> > people have to look at code to understand current status. If you can keep a
> > supported and verified feature list here, it'd be great.
> 
> It will be even better to support all features ;-)
> 
> But seriously, the VMX spec is hundreds of pages long, with hundreds of
> features, sub-features, and sub-sub-features and myriads of subcase-of-
> subfeature and combinations thereof, so I don't think such a list would be
> practical - or ever be accurate.

no need for all subfeatures, a list of possibly a dozen features which people
once enabled them one-by-one is applausive, especially for things which
may accelerate L2 perf, such as virtual NMI, tpr shadow, virtual x2APIC, ... 

> 
> In the "Known Limitations" section of this document, I'd like to list major
> features which are missing, and perhaps more importantly - L1 and L2
> guests which are known NOT to work.

yes, that info is also important and thus people can easily reproduce your
success.

> 
> By the way, it appears that you've been going over the patches in increasing
> numerical order, and this is the last patch ;-) Have you finished your
> review iteration?
> 

yes, I've finished my review on all of your v10 patches. :-)

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Muli Ben-Yehuda May 25, 2011, 12:13 p.m. UTC | #4
On Wed, May 25, 2011 at 06:33:30PM +0800, Tian, Kevin wrote:

> > +Known limitations
> > +-----------------
> > +
> > +The current code supports running Linux guests under KVM guests.
> > +Only 64-bit guest hypervisors are supported.
> > +
> > +Additional patches for running Windows under guest KVM, and Linux under
> > +guest VMware server, and support for nested EPT, are currently running in
> > +the lab, and will be sent as follow-on patchsets.
> 
> any plan on nested VTD?

Nadav Amit sent patches for VT-d emulation about a year ago
(http://marc.info/?l=qemu-devel&m=127124206827481&w=2). They don't
apply to the current tree, but rebasing them probably doesn't make
sense until some version of the QEMU IOMMU/DMA API that has been
discussed makes it in.

Cheers,
Muli
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

--- .before/Documentation/kvm/nested-vmx.txt	2011-05-16 22:36:51.000000000 +0300
+++ .after/Documentation/kvm/nested-vmx.txt	2011-05-16 22:36:51.000000000 +0300
@@ -0,0 +1,243 @@ 
+Nested VMX
+==========
+
+Overview
+---------
+
+On Intel processors, KVM uses Intel's VMX (Virtual-Machine eXtensions)
+to easily and efficiently run guest operating systems. Normally, these guests
+*cannot* themselves be hypervisors running their own guests, because in VMX,
+guests cannot use VMX instructions.
+
+The "Nested VMX" feature adds this missing capability - of running guest
+hypervisors (which use VMX) with their own nested guests. It does so by
+allowing a guest to use VMX instructions, and correctly and efficiently
+emulating them using the single level of VMX available in the hardware.
+
+We describe in much greater detail the theory behind the nested VMX feature,
+its implementation and its performance characteristics, in the OSDI 2010 paper
+"The Turtles Project: Design and Implementation of Nested Virtualization",
+available at:
+
+	http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf
+
+
+Terminology
+-----------
+
+Single-level virtualization has two levels - the host (KVM) and the guests.
+In nested virtualization, we have three levels: The host (KVM), which we call
+L0, the guest hypervisor, which we call L1, and its nested guest, which we
+call L2.
+
+
+Known limitations
+-----------------
+
+The current code supports running Linux guests under KVM guests.
+Only 64-bit guest hypervisors are supported.
+
+Additional patches for running Windows under guest KVM, and Linux under
+guest VMware server, and support for nested EPT, are currently running in
+the lab, and will be sent as follow-on patchsets.
+
+
+Running nested VMX
+------------------
+
+The nested VMX feature is disabled by default. It can be enabled by giving
+the "nested=1" option to the kvm-intel module.
+
+No modifications are required to user space (qemu). However, qemu's default
+emulated CPU type (qemu64) does not list the "VMX" CPU feature, so it must be
+explicitly enabled, by giving qemu one of the following options:
+
+     -cpu host              (emulated CPU has all features of the real CPU)
+
+     -cpu qemu64,+vmx       (add just the vmx feature to a named CPU type)
+
+
+ABIs
+----
+
+Nested VMX aims to present a standard and (eventually) fully-functional VMX
+implementation for the a guest hypervisor to use. As such, the official
+specification of the ABI that it provides is Intel's VMX specification,
+namely volume 3B of their "Intel 64 and IA-32 Architectures Software
+Developer's Manual". Not all of VMX's features are currently fully supported,
+but the goal is to eventually support them all, starting with the VMX features
+which are used in practice by popular hypervisors (KVM and others).
+
+As a VMX implementation, nested VMX presents a VMCS structure to L1.
+As mandated by the spec, other than the two fields revision_id and abort,
+this structure is *opaque* to its user, who is not supposed to know or care
+about its internal structure. Rather, the structure is accessed through the
+VMREAD and VMWRITE instructions.
+Still, for debugging purposes, KVM developers might be interested to know the
+internals of this structure; This is struct vmcs12 from arch/x86/kvm/vmx.c.
+For convenience, we repeat its content here. If the internals of this structure
+changes, this can break live migration across KVM versions. VMCS12_REVISION
+(from vmx.c) should be changed if struct vmcs12 or its inner struct shadow_vmcs
+is ever changed.
+
+	typedef u64 natural_width;
+	struct __packed vmcs12 {
+		/* According to the Intel spec, a VMCS region must start with
+		 * these two user-visible fields */
+		u32 revision_id;
+		u32 abort;
+
+		u32 launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
+		u32 padding[7]; /* room for future expansion */
+
+		u64 io_bitmap_a;
+		u64 io_bitmap_b;
+		u64 msr_bitmap;
+		u64 vm_exit_msr_store_addr;
+		u64 vm_exit_msr_load_addr;
+		u64 vm_entry_msr_load_addr;
+		u64 tsc_offset;
+		u64 virtual_apic_page_addr;
+		u64 apic_access_addr;
+		u64 ept_pointer;
+		u64 guest_physical_address;
+		u64 vmcs_link_pointer;
+		u64 guest_ia32_debugctl;
+		u64 guest_ia32_pat;
+		u64 guest_ia32_efer;
+		u64 guest_pdptr0;
+		u64 guest_pdptr1;
+		u64 guest_pdptr2;
+		u64 guest_pdptr3;
+		u64 host_ia32_pat;
+		u64 host_ia32_efer;
+		u64 padding64[8]; /* room for future expansion */
+		natural_width cr0_guest_host_mask;
+		natural_width cr4_guest_host_mask;
+		natural_width cr0_read_shadow;
+		natural_width cr4_read_shadow;
+		natural_width cr3_target_value0;
+		natural_width cr3_target_value1;
+		natural_width cr3_target_value2;
+		natural_width cr3_target_value3;
+		natural_width exit_qualification;
+		natural_width guest_linear_address;
+		natural_width guest_cr0;
+		natural_width guest_cr3;
+		natural_width guest_cr4;
+		natural_width guest_es_base;
+		natural_width guest_cs_base;
+		natural_width guest_ss_base;
+		natural_width guest_ds_base;
+		natural_width guest_fs_base;
+		natural_width guest_gs_base;
+		natural_width guest_ldtr_base;
+		natural_width guest_tr_base;
+		natural_width guest_gdtr_base;
+		natural_width guest_idtr_base;
+		natural_width guest_dr7;
+		natural_width guest_rsp;
+		natural_width guest_rip;
+		natural_width guest_rflags;
+		natural_width guest_pending_dbg_exceptions;
+		natural_width guest_sysenter_esp;
+		natural_width guest_sysenter_eip;
+		natural_width host_cr0;
+		natural_width host_cr3;
+		natural_width host_cr4;
+		natural_width host_fs_base;
+		natural_width host_gs_base;
+		natural_width host_tr_base;
+		natural_width host_gdtr_base;
+		natural_width host_idtr_base;
+		natural_width host_ia32_sysenter_esp;
+		natural_width host_ia32_sysenter_eip;
+		natural_width host_rsp;
+		natural_width host_rip;
+		natural_width paddingl[8]; /* room for future expansion */
+		u32 pin_based_vm_exec_control;
+		u32 cpu_based_vm_exec_control;
+		u32 exception_bitmap;
+		u32 page_fault_error_code_mask;
+		u32 page_fault_error_code_match;
+		u32 cr3_target_count;
+		u32 vm_exit_controls;
+		u32 vm_exit_msr_store_count;
+		u32 vm_exit_msr_load_count;
+		u32 vm_entry_controls;
+		u32 vm_entry_msr_load_count;
+		u32 vm_entry_intr_info_field;
+		u32 vm_entry_exception_error_code;
+		u32 vm_entry_instruction_len;
+		u32 tpr_threshold;
+		u32 secondary_vm_exec_control;
+		u32 vm_instruction_error;
+		u32 vm_exit_reason;
+		u32 vm_exit_intr_info;
+		u32 vm_exit_intr_error_code;
+		u32 idt_vectoring_info_field;
+		u32 idt_vectoring_error_code;
+		u32 vm_exit_instruction_len;
+		u32 vmx_instruction_info;
+		u32 guest_es_limit;
+		u32 guest_cs_limit;
+		u32 guest_ss_limit;
+		u32 guest_ds_limit;
+		u32 guest_fs_limit;
+		u32 guest_gs_limit;
+		u32 guest_ldtr_limit;
+		u32 guest_tr_limit;
+		u32 guest_gdtr_limit;
+		u32 guest_idtr_limit;
+		u32 guest_es_ar_bytes;
+		u32 guest_cs_ar_bytes;
+		u32 guest_ss_ar_bytes;
+		u32 guest_ds_ar_bytes;
+		u32 guest_fs_ar_bytes;
+		u32 guest_gs_ar_bytes;
+		u32 guest_ldtr_ar_bytes;
+		u32 guest_tr_ar_bytes;
+		u32 guest_interruptibility_info;
+		u32 guest_activity_state;
+		u32 guest_sysenter_cs;
+		u32 host_ia32_sysenter_cs;
+		u32 padding32[8]; /* room for future expansion */
+		u16 virtual_processor_id;
+		u16 guest_es_selector;
+		u16 guest_cs_selector;
+		u16 guest_ss_selector;
+		u16 guest_ds_selector;
+		u16 guest_fs_selector;
+		u16 guest_gs_selector;
+		u16 guest_ldtr_selector;
+		u16 guest_tr_selector;
+		u16 host_es_selector;
+		u16 host_cs_selector;
+		u16 host_ss_selector;
+		u16 host_ds_selector;
+		u16 host_fs_selector;
+		u16 host_gs_selector;
+		u16 host_tr_selector;
+	};
+
+
+Authors
+-------
+
+These patches were written by:
+     Abel Gordon, abelg <at> il.ibm.com
+     Nadav Har'El, nyh <at> il.ibm.com
+     Orit Wasserman, oritw <at> il.ibm.com
+     Ben-Ami Yassor, benami <at> il.ibm.com
+     Muli Ben-Yehuda, muli <at> il.ibm.com
+
+With contributions by:
+     Anthony Liguori, aliguori <at> us.ibm.com
+     Mike Day, mdday <at> us.ibm.com
+     Michael Factor, factor <at> il.ibm.com
+     Zvi Dubitzky, dubi <at> il.ibm.com
+
+And valuable reviews by:
+     Avi Kivity, avi <at> redhat.com
+     Gleb Natapov, gleb <at> redhat.com
+     and others.