Message ID | 20240418193528.41780-3-dwmw2@infradead.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Cleaning up the KVM clock mess | expand |
On 18/04/2024 20:34, David Woodhouse wrote: > From: David Woodhouse <dwmw@amazon.co.uk> > > The kvm_guest_time_update() function scales the host TSC frequency to > the guest's using kvm_scale_tsc() and the v->arch.l1_tsc_scaling_ratio > scaling ratio previously calculated for that vCPU. Then calcuates the > scaling factors for the KVM clock itself based on that guest TSC > frequency. > > However, it uses kHz as the unit when scaling, and then multiplies by > 1000 only at the end. > > With a host TSC frequency of 3000MHz and a guest set to 2500MHz, the > result of kvm_scale_tsc() will actually come out at 2,499,999kHz. So > the KVM clock advertised to the guest is based on a frequency of > 2,499,999,000 Hz. > > By using Hz as the unit from the beginning, the KVM clock would be based > on a more accurate frequency of 2,499,999,999 Hz in this example. > > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> > Fixes: 78db6a503796 ("KVM: x86: rewrite handling of scaled TSC for kvmclock") > --- > arch/x86/include/asm/kvm_host.h | 2 +- > arch/x86/kvm/x86.c | 17 +++++++++-------- > arch/x86/kvm/xen.c | 2 +- > 3 files changed, 11 insertions(+), 10 deletions(-) > Reviewed-by: Paul Durrant <paul@xen.org>
On Thu, Apr 18, 2024 at 9:51 PM David Woodhouse <dwmw2@infradead.org> wrote: > gpa_t time; > struct pvclock_vcpu_time_info hv_clock; > - unsigned int hw_tsc_khz; > + unsigned int hw_tsc_hz; Why not change this to u64? 4.3 GHz is scarily close to current processors, though I expect that it will break a lot more software than just KVM. > static int kvm_guest_time_update(struct kvm_vcpu *v) > { > - unsigned long flags, tgt_tsc_khz; > + unsigned long flags; > + uint64_t tgt_tsc_hz; ... especially considering that you did use a 64-bit integer here (though---please use u64 not uint64_t; and BTW if you want to add a patch to change kvm_get_time_scale() to u64, please do. Thanks, Paolo
On Mon, 2024-04-22 at 14:22 +0200, Paolo Bonzini wrote: > On Thu, Apr 18, 2024 at 9:51 PM David Woodhouse <dwmw2@infradead.org> wrote: > > gpa_t time; > > struct pvclock_vcpu_time_info hv_clock; > > - unsigned int hw_tsc_khz; > > + unsigned int hw_tsc_hz; > > Why not change this to u64? 4.3 GHz is scarily close to current > processors, though I expect that it will break a lot more software > than just KVM. I'm not sure that just changing the variable is sufficient. I'm concerned that we may still have places (like kvm_get_time_scale(), although it's hard to tell because it's entirely uncommented!) which assume that they can multiply the value by *any* 32-bit number and it'll still fit in 64 bits. > > static int kvm_guest_time_update(struct kvm_vcpu *v) > > { > > - unsigned long flags, tgt_tsc_khz; > > + unsigned long flags; > > + uint64_t tgt_tsc_hz; > > ... especially considering that you did use a 64-bit integer here > (though---please use u64 not uint64_t; and BTW if you want to add a > patch to change kvm_get_time_scale() to u64, please do. Meh, I'm used to programming in C. Yes, I *am* old enough to have been doing this since the last decade of the 1900s, but it *has* been a long time since 1999, and my fingers have learned :) That said, although I *do* write in C by default, where I'm editing functions which already have that old anachronistic crap, I generally manage to go back and change my additions from proper C to match their surroundings. And I suppose there *are* some of those awful u64/s64 types already in kvm_guest_time_update(), so I would normally have done so. I'll change that in my tree. No way I'm regressing kvm_get_time_scale() myself though :) Heh, looks like it was you who made it uint64_t, in 2016. In a commit (3ae13faac) which said "Prepare for improving the precision in the next patch"... which never came, AFAICT?
On Mon, Apr 22, 2024 at 5:39 PM David Woodhouse <dwmw2@infradead.org> wrote: > > ... especially considering that you did use a 64-bit integer here > > (though---please use u64 not uint64_t; and BTW if you want to add a > > patch to change kvm_get_time_scale() to u64, please do. > > Meh, I'm used to programming in C. Yes, I *am* old enough to have been > doing this since the last decade of the 1900s, but it *has* been a long > time since 1999, and my fingers have learned :) Oh, I am on the same page (working on both QEMU and Linux, adapting my muscle memory to the context sucks) but u64/s64 is the preferred spelling and I have been asked to use them before. > Heh, looks like it was you who made it uint64_t, in 2016. In a commit > (3ae13faac) which said "Prepare for improving the precision in the next > patch"... which never came, AFAICT? Yes, it was posted as https://lore.kernel.org/lkml/1454944711-33022-5-git-send-email-pbonzini@redhat.com/ but not committed. As an aside, we discovered later that the patch you list as "Fixes" fixed another tricky bug: before, kvmclock could jump if the TSC is set within the 250 ppm tolerance that does not activate TSC scaling. This is possible after a first live migration, and then the second live migration used the guest TSC frequency *that userspace desired* instead of the *actual* TSC frequency. Before: this_tsc_khz = __this_cpu_read(cpu_tsc_khz); if (unlikely(vcpu->hw_tsc_khz != this_tsc_khz)) { tgt_tsc_khz = vcpu->virtual_tsc_khz; kvm_get_time_scale(NSEC_PER_SEC / 1000, tgt_tsc_khz, &vcpu->hv_clock.tsc_shift, &vcpu->hv_clock.tsc_to_system_mul); vcpu->hw_tsc_khz = this_tsc_khz; After: tgt_tsc_khz = __this_cpu_read(cpu_tsc_khz); // tgt_tsc_khz unchanged because TSC scaling was not enabled tgt_tsc_khz = kvm_scale_tsc(v, tgt_tsc_khz); if (unlikely(vcpu->hw_tsc_khz != tgt_tsc_khz)) { kvm_get_time_scale(NSEC_PER_SEC / 1000, tgt_tsc_khz, &vcpu->hv_clock.tsc_shift, &vcpu->hv_clock.tsc_to_system_mul); vcpu->hw_tsc_khz = tgt_tsc_khz; So in the first case kvm_get_time_scale uses vcpu->virtual_tsc_khz, in the second case it uses __this_cpu_read(cpu_tsc_khz). This then caused a mismatch between the actual guest frequency and what is used by kvm_guest_time_update, which only becomes visible when migration resets the clock with KVM_GET/SET_CLOCK. KVM_GET_CLOCK returns what _should have been_ the same value read by the guest, but it's wrong. Paolo
On Mon, 2024-04-22 at 17:54 +0200, Paolo Bonzini wrote: > On Mon, Apr 22, 2024 at 5:39 PM David Woodhouse <dwmw2@infradead.org> wrote: > > > > ... especially considering that you did use a 64-bit integer here > > > (though---please use u64 not uint64_t; and BTW if you want to add a > > > patch to change kvm_get_time_scale() to u64, please do. > > > > Meh, I'm used to programming in C. Yes, I *am* old enough to have been > > doing this since the last decade of the 1900s, but it *has* been a long > > time since 1999, and my fingers have learned :) > > Oh, I am on the same page (working on both QEMU and Linux, adapting my > muscle memory to the context sucks) but u64/s64 is the preferred > spelling and I have been asked to use them before. Ever heard of Jury Nullification... ? :) > > Heh, looks like it was you who made it uint64_t, in 2016. In a commit > > (3ae13faac) which said "Prepare for improving the precision in the next > > patch"... which never came, AFAICT? > > Yes, it was posted as > https://lore.kernel.org/lkml/1454944711-33022-5-git-send-email-pbonzini@redhat.com/ > but not committed. Ah, thanks. So that isn't about arithmetic precision, but about dealing with the mess we had where the KVM clock was afflicted by NTP skew. We live in a much saner world now where it's simply based on the guest TSC (or, in pathological cases, the host's CLOCK_MONOTONIC_RAW. > As an aside, we discovered later that the patch you list as "Fixes" > fixed another tricky bug: before, kvmclock could jump if the TSC is > set within the 250 ppm tolerance that does not activate TSC scaling. > This is possible after a first live migration, and then the second > live migration used the guest TSC frequency *that userspace desired* > instead of the *actual* TSC frequency. Given our saner world where the KVM clock now *isn't* adjusted for NTP skew, that "bug" was probably better left unfixed. In fact, I may give some thought to reverting commit 78db6a503796 completely. Perhaps we should call that "definition E". I think we're up to five now? But no, let's not add historical ones to the taxonomy :) I believe Jack's KVM_SET_CLOCK_GUEST fixes the fundamental issue there of the clock jumping on migration. It's just a special case of the breakage in KVM_REQ_MASTERCLOCK_UPDATE, where the KVM clock has happily been running as a direct function of the guest TSC, and when we yank it back to some other definition when we create a new masterclock reference point. With KVM_SET_CLOCK_GUEST, the KVM clock is restored as a function of the guest TSC, rather than based on realtime/CLOCK_MONOTONIC_RAW/etc. So even though a new masterclock reference is taken in the new kernel (and the scaling factors in the pvclock may be slightly different, as we discussed in the comment you responded to), the ka->kvmclock_offset is adjusted so that the value of the KVM clock at *that* moment when the new reference is taken, is precisely what it would have been under the old kvmclock regime for the contemporary guest TSC value. > Before: > > this_tsc_khz = __this_cpu_read(cpu_tsc_khz); > if (unlikely(vcpu->hw_tsc_khz != this_tsc_khz)) { > tgt_tsc_khz = vcpu->virtual_tsc_khz; > kvm_get_time_scale(NSEC_PER_SEC / 1000, tgt_tsc_khz, > &vcpu->hv_clock.tsc_shift, > &vcpu->hv_clock.tsc_to_system_mul); > vcpu->hw_tsc_khz = this_tsc_khz; > > After: > > tgt_tsc_khz = __this_cpu_read(cpu_tsc_khz); > > // tgt_tsc_khz unchanged because TSC scaling was not enabled > tgt_tsc_khz = kvm_scale_tsc(v, tgt_tsc_khz); > > if (unlikely(vcpu->hw_tsc_khz != tgt_tsc_khz)) { > kvm_get_time_scale(NSEC_PER_SEC / 1000, tgt_tsc_khz, > &vcpu->hv_clock.tsc_shift, > &vcpu->hv_clock.tsc_to_system_mul); > vcpu->hw_tsc_khz = tgt_tsc_khz; > > So in the first case kvm_get_time_scale uses vcpu->virtual_tsc_khz, in > the second case it uses __this_cpu_read(cpu_tsc_khz). > > This then caused a mismatch between the actual guest frequency and > what is used by kvm_guest_time_update, which only becomes visible when > migration resets the clock with KVM_GET/SET_CLOCK. KVM_GET_CLOCK > returns what _should have been_ the same value read by the guest, but > it's wrong. Hm? Until I fixed it in this series, KVM_GET_CLOCK didn't even *try* scaling via the guest's TSC frequency, did it? It just converted from the *host* TSC to nanoseconds (since master_kernel_now) directly. Which was even *more* broken :)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 01c69840647e..8440c4081727 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -887,7 +887,7 @@ struct kvm_vcpu_arch { gpa_t time; struct pvclock_vcpu_time_info hv_clock; - unsigned int hw_tsc_khz; + unsigned int hw_tsc_hz; struct gfn_to_pfn_cache pv_time; /* set guest stopped flag in pvclock flags field */ bool pvclock_set_guest_stopped_request; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 2d2619d3eee4..23281c508c27 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3215,7 +3215,8 @@ static void kvm_setup_guest_pvclock(struct kvm_vcpu *v, static int kvm_guest_time_update(struct kvm_vcpu *v) { - unsigned long flags, tgt_tsc_khz; + unsigned long flags; + uint64_t tgt_tsc_hz; unsigned seq; struct kvm_vcpu_arch *vcpu = &v->arch; struct kvm_arch *ka = &v->kvm->arch; @@ -3252,8 +3253,8 @@ static int kvm_guest_time_update(struct kvm_vcpu *v) /* Keep irq disabled to prevent changes to the clock */ local_irq_save(flags); - tgt_tsc_khz = get_cpu_tsc_khz(); - if (unlikely(tgt_tsc_khz == 0)) { + tgt_tsc_hz = get_cpu_tsc_khz() * 1000LL; + if (unlikely(tgt_tsc_hz == 0)) { local_irq_restore(flags); kvm_make_request(KVM_REQ_CLOCK_UPDATE, v); return 1; @@ -3288,14 +3289,14 @@ static int kvm_guest_time_update(struct kvm_vcpu *v) /* With all the info we got, fill in the values */ if (kvm_caps.has_tsc_control) - tgt_tsc_khz = kvm_scale_tsc(tgt_tsc_khz, - v->arch.l1_tsc_scaling_ratio); + tgt_tsc_hz = kvm_scale_tsc(tgt_tsc_hz, + v->arch.l1_tsc_scaling_ratio); - if (unlikely(vcpu->hw_tsc_khz != tgt_tsc_khz)) { - kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL, + if (unlikely(vcpu->hw_tsc_hz != tgt_tsc_hz)) { + kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_hz, &vcpu->hv_clock.tsc_shift, &vcpu->hv_clock.tsc_to_system_mul); - vcpu->hw_tsc_khz = tgt_tsc_khz; + vcpu->hw_tsc_hz = tgt_tsc_hz; kvm_xen_update_tsc_info(v); } diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c index 5a83a8154b79..014048c22652 100644 --- a/arch/x86/kvm/xen.c +++ b/arch/x86/kvm/xen.c @@ -2273,7 +2273,7 @@ void kvm_xen_update_tsc_info(struct kvm_vcpu *vcpu) entry = kvm_find_cpuid_entry_index(vcpu, function, 2); if (entry) - entry->eax = vcpu->arch.hw_tsc_khz; + entry->eax = vcpu->arch.hw_tsc_hz / 1000; } void kvm_xen_init_vm(struct kvm *kvm)