From patchwork Sun Jul 31 13:48:04 2011 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Nadav Har'El X-Patchwork-Id: 1023832 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by demeter2.kernel.org (8.14.4/8.14.4) with ESMTP id p6VDmGwT021057 for ; Sun, 31 Jul 2011 13:48:16 GMT Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751565Ab1GaNsN (ORCPT ); Sun, 31 Jul 2011 09:48:13 -0400 Received: from mailgw12.technion.ac.il ([132.68.225.12]:16687 "EHLO mailgw12.technion.ac.il" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751430Ab1GaNsL (ORCPT ); Sun, 31 Jul 2011 09:48:11 -0400 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Av4EAJBcNU6ERHMG/2dsb2JhbABAp2R3gUABAQQBJwUOOgUFCQILDgkvFAc0DhMVh1e+KAKFYV8Eh1SLJ5Bn X-IronPort-AV: E=Sophos;i="4.67,295,1309726800"; d="scan'208";a="53067265" Received: from fermat.math.technion.ac.il ([132.68.115.6]) by mailgw12.technion.ac.il with ESMTP; 31 Jul 2011 16:48:07 +0300 Received: from fermat.math.technion.ac.il (localhost [127.0.0.1]) by fermat.math.technion.ac.il (8.12.10/8.12.10) with ESMTP id p6VDm6pu008100; Sun, 31 Jul 2011 16:48:06 +0300 (IDT) Received: (from nyh@localhost) by fermat.math.technion.ac.il (8.12.10/8.12.10/Submit) id p6VDm4o0008099; Sun, 31 Jul 2011 16:48:04 +0300 (IDT) X-Authentication-Warning: fermat.math.technion.ac.il: nyh set sender to nyh@math.technion.ac.il using -f Date: Sun, 31 Jul 2011 16:48:04 +0300 From: "Nadav Har'El" To: Zachary Amsden Cc: Bandan Das , KVM Mailing List , Marcelo Tosatti , joerg.roedel@amd.com Subject: Re: Nested VMX - L1 hangs on running L2 Message-ID: <20110731134804.GA7762@fermat.math.technion.ac.il> References: <20110708184053.GA23958@stratus.com> <20110718182628.GB5324@amt.cnet> <20110727115100.GA23118@fermat.math.technion.ac.il> Mime-Version: 1.0 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.2i Hebrew-Date: 29 Tammuz 5771 Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Greylist: IP, sender and recipient auto-whitelisted, not delayed by milter-greylist-4.2.6 (demeter2.kernel.org [140.211.167.43]); Sun, 31 Jul 2011 13:48:16 +0000 (UTC) On Fri, Jul 29, 2011, Zachary Amsden wrote about "Re: Nested VMX - L1 hangs on running L2": >... > In that case, you need to distinguish between reads of the TSC MSR by > the guest and reads by the host (as done internally to track drift and >... > Unfortunately, the layering currently doesn't seem to allow for this, > and it looks like both vendor specific variants currently get this > wrong. >... > 1) Add a new vendor-specific API call, vmx_x86_ops->get_L1_TSC(), and > transform current uses of the code which does TSC compensation to use > this new API. *Bonus* - no need to do double indirection through the > generic MSR code. Thank you so much for this analysis! I propose the following two patches. The first one just fixes two small bugs in the correct emulation in the VMX case, and the second patch solves the originally-reported bug as you outlined above: The code in x86.c which used to call *_get_msr() to get the TSC no longer does it - because *_get_msr() should only be called with the intention of reading the msr of the current guest (L1 or L2) and returning it to that guest. Instead, I added a new x86_op read_l1_tsc(vcpu), whose intention is to always return L1's notion of the current TSC - even if the current guest is L2. All calls in x86.c now use this new read_l1_tsc() function instead of reading the MSR - does this look correct to you? > read via RDMSR which is mis-virtualized (this bug exists today in the > SVM implementation if I am reading it correctly - cc'd Joerg to notify I believe that you're right. I created (in the patch below) svm.c's read_l1_tsc() with the same code you had in svm_get_msr(), but I think that in svm_get_msr() the code should be different in the nested case - I think you should use svm->vmcb, not get_host_vmcb()? I didn't add this svm.c fix to the patch, because I'm not sure at all that my understanding here is correct. Zachary, Marcelo, do these patches look right to you? Bandan, and others who case reproduce this bug - do these patches solve the bug? > So you are right, this is still wrong for the case in which L1 does > not trap TSC MSR reads. Note however, the RDTSC instruction is still > virtualized properly, it is only the relatively rare actual TSC MSR > read via RDMSR which is mis-virtualized (this bug exists today in the > SVM implementation if I am reading it correctly - cc'd Joerg to notify > him of that). That, combined with the relative importance of > supporting a guest which does not trap on these MSR reads suggest this > is a low priority design issue however (RDTSC still works even if the > MSR is trapped, correct?) Yes, RDTSC_EXITING is separate from the MSR bitmap. I agree that these are indeed obscure corner cases, but I think that it's still really confusing that a function called kvm_get_msr deliberately works incorrectly (returning always the L1 TSC) just so that it can fit some other uses. The SVM code worked in the common case, but was still confusing why svm_get_msr() deliberately goes to great lengths to use L1's TSC_OFFSET even when L2 is running - when this is not how this MSR is really supposed to behave. Not to mention that one day, somebody *will* want to use these obscure settings and be surprise that they don't work ;-) Subject: [PATCH 1/2] nVMX: Fix nested TSC emulation This patch fixes two corner cases in nested (L2) handling of TSC-related exits: 1. Somewhat suprisingly, according to the Intel spec, if L1 allows WRMSR to the TSC MSR without an exit, then this should set L1's TSC value itself - not offset by vmcs12.TSC_OFFSET (like was wrongly done in the previous code). 2. Allow L1 to disable the TSC_OFFSETING control, and then correctly ignore the vmcs12.TSC_OFFSET. Signed-off-by: Nadav Har'El --- arch/x86/kvm/vmx.c | 31 +++++++++++++++++++++---------- 1 file changed, 21 insertions(+), 10 deletions(-) --- .before/arch/x86/kvm/vmx.c 2011-07-31 16:17:30.000000000 +0300 +++ .after/arch/x86/kvm/vmx.c 2011-07-31 16:17:30.000000000 +0300 @@ -1762,15 +1762,23 @@ static void vmx_set_tsc_khz(struct kvm_v */ static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset) { - vmcs_write64(TSC_OFFSET, offset); - if (is_guest_mode(vcpu)) + if (is_guest_mode(vcpu)) { /* - * We're here if L1 chose not to trap the TSC MSR. Since - * prepare_vmcs12() does not copy tsc_offset, we need to also - * set the vmcs12 field here. + * We're here if L1 chose not to trap WRMSR to TSC. According + * to the spec, this should set L1's TSC; The offset that L1 + * set for L2 remains unchanged, and still needs to be added + * to the newly set TSC to get L2's TSC. */ - get_vmcs12(vcpu)->tsc_offset = offset - - to_vmx(vcpu)->nested.vmcs01_tsc_offset; + struct vmcs12 *vmcs12; + to_vmx(vcpu)->nested.vmcs01_tsc_offset = offset; + /* recalculate vmcs02.TSC_OFFSET: */ + vmcs12 = get_vmcs12(vcpu); + vmcs_write64(TSC_OFFSET, offset + + (nested_cpu_has(vmcs12, CPU_BASED_USE_TSC_OFFSETING) ? + vmcs12->tsc_offset : 0)); + } else { + vmcs_write64(TSC_OFFSET, offset); + } } static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment) @@ -6514,8 +6522,11 @@ static void prepare_vmcs02(struct kvm_vc set_cr4_guest_host_mask(vmx); - vmcs_write64(TSC_OFFSET, - vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset); + if (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETING) + vmcs_write64(TSC_OFFSET, + vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset); + else + vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset); if (enable_vpid) { /* @@ -6922,7 +6933,7 @@ static void nested_vmx_vmexit(struct kvm load_vmcs12_host_state(vcpu, vmcs12); - /* Update TSC_OFFSET if vmx_adjust_tsc_offset() was used while L2 ran */ + /* Update TSC_OFFSET if TSC was changed while L2 ran */ vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset); /* This is needed for same reason as it was needed in prepare_vmcs02 */ Subject: [PATCH 2/2] nVMX: L1 TSC handling KVM assumed in several places that reading the TSC MSR returns the value for L1. This is incorrect, because when L2 is running, the correct TSC read exit emulation is to return L2's value. We therefore add a new x86_ops function, read_l1_tsc, to use in places that specifically need to read the L1 TSC, NOT the TSC of the current level of guest. Signed-off-by: Nadav Har'El --- arch/x86/include/asm/kvm_host.h | 2 ++ arch/x86/kvm/svm.c | 9 +++++++++ arch/x86/kvm/vmx.c | 17 +++++++++++++++++ arch/x86/kvm/x86.c | 8 ++++---- 4 files changed, 32 insertions(+), 4 deletions(-) --- .before/arch/x86/include/asm/kvm_host.h 2011-07-31 16:17:30.000000000 +0300 +++ .after/arch/x86/include/asm/kvm_host.h 2011-07-31 16:17:30.000000000 +0300 @@ -636,6 +636,8 @@ struct kvm_x86_ops { struct x86_instruction_info *info, enum x86_intercept_stage stage); + u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu); + const struct trace_print_flags *exit_reasons_str; }; --- .before/arch/x86/kvm/vmx.c 2011-07-31 16:17:30.000000000 +0300 +++ .after/arch/x86/kvm/vmx.c 2011-07-31 16:17:30.000000000 +0300 @@ -1748,6 +1748,21 @@ static u64 guest_read_tsc(void) } /* + * Like guest_read_tsc, but always returns L1's notion of the timestamp + * counter, even if a nested guest (L2) is currently running. + */ +u64 vmx_read_l1_tsc(kvm_vcpu *vcpu) +{ + u64 host_tsc, tsc_offset; + + rdtscll(host_tsc); + tsc_offset = is_guest_mode(vcpu) ? + to_vmx(vcpu)->nested.vmcs01_tsc_offset : + vmcs_read64(TSC_OFFSET); + return host_tsc + tsc_offset; +} + +/* * Empty call-back. Needs to be implemented when VMX enables the SET_TSC_KHZ * ioctl. In this case the call-back should update internal vmx state to make * the changes effective. @@ -7070,6 +7085,8 @@ static struct kvm_x86_ops vmx_x86_ops = .set_tdp_cr3 = vmx_set_cr3, .check_intercept = vmx_check_intercept, + + .read_l1_tsc = vmx_read_l1_tsc(kvm_vcpu *vcpu), }; static int __init vmx_init(void) --- .before/arch/x86/kvm/svm.c 2011-07-31 16:17:30.000000000 +0300 +++ .after/arch/x86/kvm/svm.c 2011-07-31 16:17:30.000000000 +0300 @@ -2894,6 +2894,13 @@ static int cr8_write_interception(struct return 0; } +u64 svm_read_l1_tsc(kvm_vcpu *vcpu) +{ + return vmcb->control.tsc_offset + + svm_scale_tsc(vcpu, native_read_tsc()); +} + + static int svm_get_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 *data) { struct vcpu_svm *svm = to_svm(vcpu); @@ -4243,6 +4250,8 @@ static struct kvm_x86_ops svm_x86_ops = .set_tdp_cr3 = set_tdp_cr3, .check_intercept = svm_check_intercept, + + .read_l1_tsc = vmx_read_l1_tsc(kvm_vcpu *vcpu), }; static int __init svm_init(void) --- .before/arch/x86/kvm/x86.c 2011-07-31 16:17:30.000000000 +0300 +++ .after/arch/x86/kvm/x86.c 2011-07-31 16:17:30.000000000 +0300 @@ -1098,7 +1098,7 @@ static int kvm_guest_time_update(struct /* Keep irq disabled to prevent changes to the clock */ local_irq_save(flags); - kvm_get_msr(v, MSR_IA32_TSC, &tsc_timestamp); + tsc_timestamp = kvm_x86_ops->read_l1_tsc(vcpu); kernel_ns = get_kernel_ns(); this_tsc_khz = vcpu_tsc_khz(v); if (unlikely(this_tsc_khz == 0)) { @@ -2215,7 +2215,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu s64 tsc_delta; u64 tsc; - kvm_get_msr(vcpu, MSR_IA32_TSC, &tsc); + tsc = kvm_x86_ops->read_l1_tsc(vcpu); tsc_delta = !vcpu->arch.last_guest_tsc ? 0 : tsc - vcpu->arch.last_guest_tsc; @@ -2239,7 +2239,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu * { kvm_x86_ops->vcpu_put(vcpu); kvm_put_guest_fpu(vcpu); - kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc); + vcpu->arch.last_guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu); } static int is_efer_nx(void) @@ -5722,7 +5722,7 @@ static int vcpu_enter_guest(struct kvm_v if (hw_breakpoint_active()) hw_breakpoint_restore(); - kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc); + vcpu->arch.last_guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu); vcpu->mode = OUTSIDE_GUEST_MODE; smp_wmb();