Message ID | 20110731134804.GA7762@fermat.math.technion.ac.il (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
This patch looks good, with one comment noted inline below. Are there no other call sites for kvm_get_msr() or which alias some other function to kvm_get_msr(MSR_IA32_TSC) ? Did I miss the first patch? Zach On Sun, Jul 31, 2011 at 6:48 AM, Nadav Har'El <nyh@math.technion.ac.il> wrote: > On Fri, Jul 29, 2011, Zachary Amsden wrote about "Re: Nested VMX - L1 hangs on running L2": >>... >> In that case, you need to distinguish between reads of the TSC MSR by >> the guest and reads by the host (as done internally to track drift and >>... >> Unfortunately, the layering currently doesn't seem to allow for this, >> and it looks like both vendor specific variants currently get this >> wrong. >>... >> 1) Add a new vendor-specific API call, vmx_x86_ops->get_L1_TSC(), and >> transform current uses of the code which does TSC compensation to use >> this new API. *Bonus* - no need to do double indirection through the >> generic MSR code. > > Thank you so much for this analysis! > > I propose the following two patches. The first one just fixes two small > bugs in the correct emulation in the VMX case, and the second patch solves > the originally-reported bug as you outlined above: > > The code in x86.c which used to call *_get_msr() to get the TSC no longer > does it - because *_get_msr() should only be called with the intention of > reading the msr of the current guest (L1 or L2) and returning it to that > guest. > > Instead, I added a new x86_op read_l1_tsc(vcpu), whose intention is to > always return L1's notion of the current TSC - even if the current guest > is L2. All calls in x86.c now use this new read_l1_tsc() function instead > of reading the MSR - does this look correct to you? > >> read via RDMSR which is mis-virtualized (this bug exists today in the >> SVM implementation if I am reading it correctly - cc'd Joerg to notify > > I believe that you're right. I created (in the patch below) svm.c's > read_l1_tsc() with the same code you had in svm_get_msr(), but I think > that in svm_get_msr() the code should be different in the nested case - > I think you should use svm->vmcb, not get_host_vmcb()? I didn't add this svm.c > fix to the patch, because I'm not sure at all that my understanding here is > correct. > > Zachary, Marcelo, do these patches look right to you? > Bandan, and others who case reproduce this bug - do these patches solve the > bug? > >> So you are right, this is still wrong for the case in which L1 does >> not trap TSC MSR reads. Note however, the RDTSC instruction is still >> virtualized properly, it is only the relatively rare actual TSC MSR >> read via RDMSR which is mis-virtualized (this bug exists today in the >> SVM implementation if I am reading it correctly - cc'd Joerg to notify >> him of that). That, combined with the relative importance of >> supporting a guest which does not trap on these MSR reads suggest this >> is a low priority design issue however (RDTSC still works even if the >> MSR is trapped, correct?) > > Yes, RDTSC_EXITING is separate from the MSR bitmap. I agree that these > are indeed obscure corner cases, but I think that it's still really confusing > that a function called kvm_get_msr deliberately works incorrectly (returning > always the L1 TSC) just so that it can fit some other uses. The SVM code worked > in the common case, but was still confusing why svm_get_msr() deliberately > goes to great lengths to use L1's TSC_OFFSET even when L2 is running - when > this is not how this MSR is really supposed to behave. Not to mention that one > day, somebody *will* want to use these obscure settings and be surprise that > they don't work ;-) > > Subject: [PATCH 1/2] nVMX: Fix nested TSC emulation > > This patch fixes two corner cases in nested (L2) handling of TSC-related > exits: > > 1. Somewhat suprisingly, according to the Intel spec, if L1 allows WRMSR to > the TSC MSR without an exit, then this should set L1's TSC value itself - not > offset by vmcs12.TSC_OFFSET (like was wrongly done in the previous code). > > 2. Allow L1 to disable the TSC_OFFSETING control, and then correctly ignore > the vmcs12.TSC_OFFSET. > > Signed-off-by: Nadav Har'El <nyh@il.ibm.com> > --- > arch/x86/kvm/vmx.c | 31 +++++++++++++++++++++---------- > 1 file changed, 21 insertions(+), 10 deletions(-) > > --- .before/arch/x86/kvm/vmx.c 2011-07-31 16:17:30.000000000 +0300 > +++ .after/arch/x86/kvm/vmx.c 2011-07-31 16:17:30.000000000 +0300 > @@ -1762,15 +1762,23 @@ static void vmx_set_tsc_khz(struct kvm_v > */ > static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset) > { > - vmcs_write64(TSC_OFFSET, offset); > - if (is_guest_mode(vcpu)) > + if (is_guest_mode(vcpu)) { > /* > - * We're here if L1 chose not to trap the TSC MSR. Since > - * prepare_vmcs12() does not copy tsc_offset, we need to also > - * set the vmcs12 field here. > + * We're here if L1 chose not to trap WRMSR to TSC. According > + * to the spec, this should set L1's TSC; The offset that L1 > + * set for L2 remains unchanged, and still needs to be added > + * to the newly set TSC to get L2's TSC. > */ > - get_vmcs12(vcpu)->tsc_offset = offset - > - to_vmx(vcpu)->nested.vmcs01_tsc_offset; > + struct vmcs12 *vmcs12; > + to_vmx(vcpu)->nested.vmcs01_tsc_offset = offset; > + /* recalculate vmcs02.TSC_OFFSET: */ > + vmcs12 = get_vmcs12(vcpu); > + vmcs_write64(TSC_OFFSET, offset + > + (nested_cpu_has(vmcs12, CPU_BASED_USE_TSC_OFFSETING) ? > + vmcs12->tsc_offset : 0)); > + } else { > + vmcs_write64(TSC_OFFSET, offset); > + } > } > > static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment) > @@ -6514,8 +6522,11 @@ static void prepare_vmcs02(struct kvm_vc > > set_cr4_guest_host_mask(vmx); > > - vmcs_write64(TSC_OFFSET, > - vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset); > + if (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETING) > + vmcs_write64(TSC_OFFSET, > + vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset); > + else > + vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset); > > if (enable_vpid) { > /* > @@ -6922,7 +6933,7 @@ static void nested_vmx_vmexit(struct kvm > > load_vmcs12_host_state(vcpu, vmcs12); > > - /* Update TSC_OFFSET if vmx_adjust_tsc_offset() was used while L2 ran */ > + /* Update TSC_OFFSET if TSC was changed while L2 ran */ > vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset); > > /* This is needed for same reason as it was needed in prepare_vmcs02 */ > > Subject: [PATCH 2/2] nVMX: L1 TSC handling > > KVM assumed in several places that reading the TSC MSR returns the value for > L1. This is incorrect, because when L2 is running, the correct TSC read exit > emulation is to return L2's value. > > We therefore add a new x86_ops function, read_l1_tsc, to use in places that > specifically need to read the L1 TSC, NOT the TSC of the current level of > guest. > > Signed-off-by: Nadav Har'El <nyh@il.ibm.com> > --- > arch/x86/include/asm/kvm_host.h | 2 ++ > arch/x86/kvm/svm.c | 9 +++++++++ > arch/x86/kvm/vmx.c | 17 +++++++++++++++++ > arch/x86/kvm/x86.c | 8 ++++---- > 4 files changed, 32 insertions(+), 4 deletions(-) > > --- .before/arch/x86/include/asm/kvm_host.h 2011-07-31 16:17:30.000000000 +0300 > +++ .after/arch/x86/include/asm/kvm_host.h 2011-07-31 16:17:30.000000000 +0300 > @@ -636,6 +636,8 @@ struct kvm_x86_ops { > struct x86_instruction_info *info, > enum x86_intercept_stage stage); > > + u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu); > + > const struct trace_print_flags *exit_reasons_str; > }; > > --- .before/arch/x86/kvm/vmx.c 2011-07-31 16:17:30.000000000 +0300 > +++ .after/arch/x86/kvm/vmx.c 2011-07-31 16:17:30.000000000 +0300 > @@ -1748,6 +1748,21 @@ static u64 guest_read_tsc(void) > } > > /* > + * Like guest_read_tsc, but always returns L1's notion of the timestamp > + * counter, even if a nested guest (L2) is currently running. > + */ > +u64 vmx_read_l1_tsc(kvm_vcpu *vcpu) > +{ > + u64 host_tsc, tsc_offset; > + > + rdtscll(host_tsc); > + tsc_offset = is_guest_mode(vcpu) ? > + to_vmx(vcpu)->nested.vmcs01_tsc_offset : > + vmcs_read64(TSC_OFFSET); > + return host_tsc + tsc_offset; > +} > + > +/* > * Empty call-back. Needs to be implemented when VMX enables the SET_TSC_KHZ > * ioctl. In this case the call-back should update internal vmx state to make > * the changes effective. > @@ -7070,6 +7085,8 @@ static struct kvm_x86_ops vmx_x86_ops = > .set_tdp_cr3 = vmx_set_cr3, > > .check_intercept = vmx_check_intercept, > + > + .read_l1_tsc = vmx_read_l1_tsc(kvm_vcpu *vcpu), > }; > > static int __init vmx_init(void) > --- .before/arch/x86/kvm/svm.c 2011-07-31 16:17:30.000000000 +0300 > +++ .after/arch/x86/kvm/svm.c 2011-07-31 16:17:30.000000000 +0300 > @@ -2894,6 +2894,13 @@ static int cr8_write_interception(struct > return 0; > } > > +u64 svm_read_l1_tsc(kvm_vcpu *vcpu) > +{ e> + return vmcb->control.tsc_offset + > + svm_scale_tsc(vcpu, native_read_tsc()); > +} > + > + > static int svm_get_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 *data) > { > struct vcpu_svm *svm = to_svm(vcpu); > @@ -4243,6 +4250,8 @@ static struct kvm_x86_ops svm_x86_ops = > .set_tdp_cr3 = set_tdp_cr3, > > .check_intercept = svm_check_intercept, > + > + .read_l1_tsc = vmx_read_l1_tsc(kvm_vcpu *vcpu), > }; > > static int __init svm_init(void) > --- .before/arch/x86/kvm/x86.c 2011-07-31 16:17:30.000000000 +0300 > +++ .after/arch/x86/kvm/x86.c 2011-07-31 16:17:30.000000000 +0300 > @@ -1098,7 +1098,7 @@ static int kvm_guest_time_update(struct > > /* Keep irq disabled to prevent changes to the clock */ > local_irq_save(flags); > - kvm_get_msr(v, MSR_IA32_TSC, &tsc_timestamp); > + tsc_timestamp = kvm_x86_ops->read_l1_tsc(vcpu); > kernel_ns = get_kernel_ns(); > this_tsc_khz = vcpu_tsc_khz(v); > if (unlikely(this_tsc_khz == 0)) { > @@ -2215,7 +2215,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu > s64 tsc_delta; > u64 tsc; > > - kvm_get_msr(vcpu, MSR_IA32_TSC, &tsc); > + tsc = kvm_x86_ops->read_l1_tsc(vcpu); > tsc_delta = !vcpu->arch.last_guest_tsc ? 0 : > tsc - vcpu->arch.last_guest_tsc; Technically, this call site should no longer exist; this should be re-factored in terms of hardware TSC, not guest TSC. I sent a patch series which did that, but it has not yet been applied. > @@ -2239,7 +2239,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu * > { > kvm_x86_ops->vcpu_put(vcpu); > kvm_put_guest_fpu(vcpu); > - kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc); > + vcpu->arch.last_guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu); > } > > static int is_efer_nx(void) > @@ -5722,7 +5722,7 @@ static int vcpu_enter_guest(struct kvm_v > if (hw_breakpoint_active()) > hw_breakpoint_restore(); > > - kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc); > + vcpu->arch.last_guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu); > > vcpu->mode = OUTSIDE_GUEST_MODE; > smp_wmb(); > > -- > Nadav Har'El | Sunday, Jul 31 2011, 29 Tammuz 5771 > nyh@math.technion.ac.il |----------------------------------------- > Phone +972-523-790466, ICQ 13349191 |Support bacteria - they're the only > http://nadav.harel.org.il |culture some people have! > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Jul 31, 2011, Zachary Amsden wrote about "Re: Nested VMX - L1 hangs on running L2": > This patch looks good, with one comment noted inline below. Are there > no other call sites for kvm_get_msr() or which alias some other > function to kvm_get_msr(MSR_IA32_TSC) ? This is all I found, but I'll try to look again. > Did I miss the first patch? Sorry, I sent both patches in one mail, one after another. I think you saw them both, because your comment was on the second. Next time I'll send them in separate mails. Thanks, Nadav.
--- .before/arch/x86/kvm/vmx.c 2011-07-31 16:17:30.000000000 +0300 +++ .after/arch/x86/kvm/vmx.c 2011-07-31 16:17:30.000000000 +0300 @@ -1762,15 +1762,23 @@ static void vmx_set_tsc_khz(struct kvm_v */ static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset) { - vmcs_write64(TSC_OFFSET, offset); - if (is_guest_mode(vcpu)) + if (is_guest_mode(vcpu)) { /* - * We're here if L1 chose not to trap the TSC MSR. Since - * prepare_vmcs12() does not copy tsc_offset, we need to also - * set the vmcs12 field here. + * We're here if L1 chose not to trap WRMSR to TSC. According + * to the spec, this should set L1's TSC; The offset that L1 + * set for L2 remains unchanged, and still needs to be added + * to the newly set TSC to get L2's TSC. */ - get_vmcs12(vcpu)->tsc_offset = offset - - to_vmx(vcpu)->nested.vmcs01_tsc_offset; + struct vmcs12 *vmcs12; + to_vmx(vcpu)->nested.vmcs01_tsc_offset = offset; + /* recalculate vmcs02.TSC_OFFSET: */ + vmcs12 = get_vmcs12(vcpu); + vmcs_write64(TSC_OFFSET, offset + + (nested_cpu_has(vmcs12, CPU_BASED_USE_TSC_OFFSETING) ? + vmcs12->tsc_offset : 0)); + } else { + vmcs_write64(TSC_OFFSET, offset); + } } static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment) @@ -6514,8 +6522,11 @@ static void prepare_vmcs02(struct kvm_vc set_cr4_guest_host_mask(vmx); - vmcs_write64(TSC_OFFSET, - vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset); + if (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETING) + vmcs_write64(TSC_OFFSET, + vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset); + else + vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset); if (enable_vpid) { /* @@ -6922,7 +6933,7 @@ static void nested_vmx_vmexit(struct kvm load_vmcs12_host_state(vcpu, vmcs12); - /* Update TSC_OFFSET if vmx_adjust_tsc_offset() was used while L2 ran */ + /* Update TSC_OFFSET if TSC was changed while L2 ran */ vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset); /* This is needed for same reason as it was needed in prepare_vmcs02 */ Subject: [PATCH 2/2] nVMX: L1 TSC handling KVM assumed in several places that reading the TSC MSR returns the value for L1. This is incorrect, because when L2 is running, the correct TSC read exit emulation is to return L2's value. We therefore add a new x86_ops function, read_l1_tsc, to use in places that specifically need to read the L1 TSC, NOT the TSC of the current level of guest. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> --- arch/x86/include/asm/kvm_host.h | 2 ++ arch/x86/kvm/svm.c | 9 +++++++++ arch/x86/kvm/vmx.c | 17 +++++++++++++++++ arch/x86/kvm/x86.c | 8 ++++---- 4 files changed, 32 insertions(+), 4 deletions(-) --- .before/arch/x86/include/asm/kvm_host.h 2011-07-31 16:17:30.000000000 +0300 +++ .after/arch/x86/include/asm/kvm_host.h 2011-07-31 16:17:30.000000000 +0300 @@ -636,6 +636,8 @@ struct kvm_x86_ops { struct x86_instruction_info *info, enum x86_intercept_stage stage); + u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu); + const struct trace_print_flags *exit_reasons_str; }; --- .before/arch/x86/kvm/vmx.c 2011-07-31 16:17:30.000000000 +0300 +++ .after/arch/x86/kvm/vmx.c 2011-07-31 16:17:30.000000000 +0300 @@ -1748,6 +1748,21 @@ static u64 guest_read_tsc(void) } /* + * Like guest_read_tsc, but always returns L1's notion of the timestamp + * counter, even if a nested guest (L2) is currently running. + */ +u64 vmx_read_l1_tsc(kvm_vcpu *vcpu) +{ + u64 host_tsc, tsc_offset; + + rdtscll(host_tsc); + tsc_offset = is_guest_mode(vcpu) ? + to_vmx(vcpu)->nested.vmcs01_tsc_offset : + vmcs_read64(TSC_OFFSET); + return host_tsc + tsc_offset; +} + +/* * Empty call-back. Needs to be implemented when VMX enables the SET_TSC_KHZ * ioctl. In this case the call-back should update internal vmx state to make * the changes effective. @@ -7070,6 +7085,8 @@ static struct kvm_x86_ops vmx_x86_ops = .set_tdp_cr3 = vmx_set_cr3, .check_intercept = vmx_check_intercept, + + .read_l1_tsc = vmx_read_l1_tsc(kvm_vcpu *vcpu), }; static int __init vmx_init(void) --- .before/arch/x86/kvm/svm.c 2011-07-31 16:17:30.000000000 +0300 +++ .after/arch/x86/kvm/svm.c 2011-07-31 16:17:30.000000000 +0300 @@ -2894,6 +2894,13 @@ static int cr8_write_interception(struct return 0; } +u64 svm_read_l1_tsc(kvm_vcpu *vcpu) +{ + return vmcb->control.tsc_offset + + svm_scale_tsc(vcpu, native_read_tsc()); +} + + static int svm_get_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 *data) { struct vcpu_svm *svm = to_svm(vcpu); @@ -4243,6 +4250,8 @@ static struct kvm_x86_ops svm_x86_ops = .set_tdp_cr3 = set_tdp_cr3, .check_intercept = svm_check_intercept, + + .read_l1_tsc = vmx_read_l1_tsc(kvm_vcpu *vcpu), }; static int __init svm_init(void) --- .before/arch/x86/kvm/x86.c 2011-07-31 16:17:30.000000000 +0300 +++ .after/arch/x86/kvm/x86.c 2011-07-31 16:17:30.000000000 +0300 @@ -1098,7 +1098,7 @@ static int kvm_guest_time_update(struct /* Keep irq disabled to prevent changes to the clock */ local_irq_save(flags); - kvm_get_msr(v, MSR_IA32_TSC, &tsc_timestamp); + tsc_timestamp = kvm_x86_ops->read_l1_tsc(vcpu); kernel_ns = get_kernel_ns(); this_tsc_khz = vcpu_tsc_khz(v); if (unlikely(this_tsc_khz == 0)) { @@ -2215,7 +2215,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu s64 tsc_delta; u64 tsc; - kvm_get_msr(vcpu, MSR_IA32_TSC, &tsc); + tsc = kvm_x86_ops->read_l1_tsc(vcpu); tsc_delta = !vcpu->arch.last_guest_tsc ? 0 : tsc - vcpu->arch.last_guest_tsc; @@ -2239,7 +2239,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu * { kvm_x86_ops->vcpu_put(vcpu); kvm_put_guest_fpu(vcpu); - kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc); + vcpu->arch.last_guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu); } static int is_efer_nx(void) @@ -5722,7 +5722,7 @@ static int vcpu_enter_guest(struct kvm_v if (hw_breakpoint_active()) hw_breakpoint_restore(); - kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc); + vcpu->arch.last_guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu); vcpu->mode = OUTSIDE_GUEST_MODE; smp_wmb();