diff mbox series

[02/10] KVM: x86: Improve accuracy of KVM clock when TSC scaling is in force

Message ID 20240418193528.41780-3-dwmw2@infradead.org (mailing list archive)
State New
Headers show
Series Cleaning up the KVM clock mess | expand

Commit Message

David Woodhouse April 18, 2024, 7:34 p.m. UTC
From: David Woodhouse <dwmw@amazon.co.uk>

The kvm_guest_time_update() function scales the host TSC frequency to
the guest's using kvm_scale_tsc() and the v->arch.l1_tsc_scaling_ratio
scaling ratio previously calculated for that vCPU. Then calcuates the
scaling factors for the KVM clock itself based on that guest TSC
frequency.

However, it uses kHz as the unit when scaling, and then multiplies by
1000 only at the end.

With a host TSC frequency of 3000MHz and a guest set to 2500MHz, the
result of kvm_scale_tsc() will actually come out at 2,499,999kHz. So
the KVM clock advertised to the guest is based on a frequency of
2,499,999,000 Hz.

By using Hz as the unit from the beginning, the KVM clock would be based
on a more accurate frequency of 2,499,999,999 Hz in this example.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Fixes: 78db6a503796 ("KVM: x86: rewrite handling of scaled TSC for kvmclock")
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/x86.c              | 17 +++++++++--------
 arch/x86/kvm/xen.c              |  2 +-
 3 files changed, 11 insertions(+), 10 deletions(-)

Comments

Paul Durrant April 19, 2024, 3:29 p.m. UTC | #1
On 18/04/2024 20:34, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> The kvm_guest_time_update() function scales the host TSC frequency to
> the guest's using kvm_scale_tsc() and the v->arch.l1_tsc_scaling_ratio
> scaling ratio previously calculated for that vCPU. Then calcuates the
> scaling factors for the KVM clock itself based on that guest TSC
> frequency.
> 
> However, it uses kHz as the unit when scaling, and then multiplies by
> 1000 only at the end.
> 
> With a host TSC frequency of 3000MHz and a guest set to 2500MHz, the
> result of kvm_scale_tsc() will actually come out at 2,499,999kHz. So
> the KVM clock advertised to the guest is based on a frequency of
> 2,499,999,000 Hz.
> 
> By using Hz as the unit from the beginning, the KVM clock would be based
> on a more accurate frequency of 2,499,999,999 Hz in this example.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> Fixes: 78db6a503796 ("KVM: x86: rewrite handling of scaled TSC for kvmclock")
> ---
>   arch/x86/include/asm/kvm_host.h |  2 +-
>   arch/x86/kvm/x86.c              | 17 +++++++++--------
>   arch/x86/kvm/xen.c              |  2 +-
>   3 files changed, 11 insertions(+), 10 deletions(-)
> 

Reviewed-by: Paul Durrant <paul@xen.org>
Paolo Bonzini April 22, 2024, 12:22 p.m. UTC | #2
On Thu, Apr 18, 2024 at 9:51 PM David Woodhouse <dwmw2@infradead.org> wrote:
>         gpa_t time;
>         struct pvclock_vcpu_time_info hv_clock;
> -       unsigned int hw_tsc_khz;
> +       unsigned int hw_tsc_hz;

Why not change this to u64? 4.3 GHz is scarily close to current
processors, though I expect that it will break a lot more software
than just KVM.

>  static int kvm_guest_time_update(struct kvm_vcpu *v)
>  {
> -       unsigned long flags, tgt_tsc_khz;
> +       unsigned long flags;
> +       uint64_t tgt_tsc_hz;

... especially considering that you did use a 64-bit integer here
(though---please use u64 not uint64_t; and BTW if you want to add a
patch to change kvm_get_time_scale() to u64, please do.

Thanks,

Paolo
David Woodhouse April 22, 2024, 3:39 p.m. UTC | #3
On Mon, 2024-04-22 at 14:22 +0200, Paolo Bonzini wrote:
> On Thu, Apr 18, 2024 at 9:51 PM David Woodhouse <dwmw2@infradead.org> wrote:
> >          gpa_t time;
> >          struct pvclock_vcpu_time_info hv_clock;
> > -       unsigned int hw_tsc_khz;
> > +       unsigned int hw_tsc_hz;
> 
> Why not change this to u64? 4.3 GHz is scarily close to current
> processors, though I expect that it will break a lot more software
> than just KVM.

I'm not sure that just changing the variable is sufficient. I'm
concerned that we may still have places (like kvm_get_time_scale(),
although it's hard to tell because it's entirely uncommented!) which
assume that they can multiply the value by *any* 32-bit number and
it'll still fit in 64 bits.

> >   static int kvm_guest_time_update(struct kvm_vcpu *v)
> >   {
> > -       unsigned long flags, tgt_tsc_khz;
> > +       unsigned long flags;
> > +       uint64_t tgt_tsc_hz;
> 
> ... especially considering that you did use a 64-bit integer here
> (though---please use u64 not uint64_t; and BTW if you want to add a
> patch to change kvm_get_time_scale() to u64, please do.

Meh, I'm used to programming in C. Yes, I *am* old enough to have been
doing this since the last decade of the 1900s, but it *has* been a long
time since 1999, and my fingers have learned :)

That said, although I *do* write in C by default, where I'm editing
functions which already have that old anachronistic crap, I generally
manage to go back and change my additions from proper C to match their
surroundings. And I suppose there *are* some of those awful u64/s64
types already in kvm_guest_time_update(), so I would normally have done
so. I'll change that in my tree.

No way I'm regressing kvm_get_time_scale() myself though :)

Heh, looks like it was you who made it uint64_t, in 2016. In a commit
(3ae13faac) which said "Prepare for improving the precision in the next
patch"... which never came, AFAICT?
Paolo Bonzini April 22, 2024, 3:54 p.m. UTC | #4
On Mon, Apr 22, 2024 at 5:39 PM David Woodhouse <dwmw2@infradead.org> wrote:

> > ... especially considering that you did use a 64-bit integer here
> > (though---please use u64 not uint64_t; and BTW if you want to add a
> > patch to change kvm_get_time_scale() to u64, please do.
>
> Meh, I'm used to programming in C. Yes, I *am* old enough to have been
> doing this since the last decade of the 1900s, but it *has* been a long
> time since 1999, and my fingers have learned :)

Oh, I am on the same page (working on both QEMU and Linux, adapting my
muscle memory to the context sucks) but u64/s64 is the preferred
spelling and I have been asked to use them before.

> Heh, looks like it was you who made it uint64_t, in 2016. In a commit
> (3ae13faac) which said "Prepare for improving the precision in the next
> patch"... which never came, AFAICT?

Yes, it was posted as
https://lore.kernel.org/lkml/1454944711-33022-5-git-send-email-pbonzini@redhat.com/
but not committed.

As an aside, we discovered later that the patch you list as "Fixes"
fixed another tricky bug: before, kvmclock could jump if the TSC is
set within the 250 ppm tolerance that does not activate TSC scaling.
This is possible after a first live migration, and then the second
live migration used the guest TSC frequency *that userspace desired*
instead of the *actual* TSC frequency.

Before:

        this_tsc_khz = __this_cpu_read(cpu_tsc_khz);
        if (unlikely(vcpu->hw_tsc_khz != this_tsc_khz)) {
                tgt_tsc_khz = vcpu->virtual_tsc_khz;
                kvm_get_time_scale(NSEC_PER_SEC / 1000, tgt_tsc_khz,
                        &vcpu->hv_clock.tsc_shift,
                        &vcpu->hv_clock.tsc_to_system_mul);
                vcpu->hw_tsc_khz = this_tsc_khz;

After:

        tgt_tsc_khz = __this_cpu_read(cpu_tsc_khz);

        // tgt_tsc_khz unchanged because TSC scaling was not enabled
        tgt_tsc_khz = kvm_scale_tsc(v, tgt_tsc_khz);

        if (unlikely(vcpu->hw_tsc_khz != tgt_tsc_khz)) {
                kvm_get_time_scale(NSEC_PER_SEC / 1000, tgt_tsc_khz,
                        &vcpu->hv_clock.tsc_shift,
                        &vcpu->hv_clock.tsc_to_system_mul);
                vcpu->hw_tsc_khz = tgt_tsc_khz;

So in the first case kvm_get_time_scale uses vcpu->virtual_tsc_khz, in
the second case it uses __this_cpu_read(cpu_tsc_khz).

This then caused a mismatch between the actual guest frequency and
what is used by kvm_guest_time_update, which only becomes visible when
migration resets the clock with KVM_GET/SET_CLOCK. KVM_GET_CLOCK
returns what _should have been_ the same value read by the guest, but
it's wrong.

Paolo
David Woodhouse April 22, 2024, 4:44 p.m. UTC | #5
On Mon, 2024-04-22 at 17:54 +0200, Paolo Bonzini wrote:
> On Mon, Apr 22, 2024 at 5:39 PM David Woodhouse <dwmw2@infradead.org> wrote:
> 
> > > ... especially considering that you did use a 64-bit integer here
> > > (though---please use u64 not uint64_t; and BTW if you want to add a
> > > patch to change kvm_get_time_scale() to u64, please do.
> > 
> > Meh, I'm used to programming in C. Yes, I *am* old enough to have been
> > doing this since the last decade of the 1900s, but it *has* been a long
> > time since 1999, and my fingers have learned :)
> 
> Oh, I am on the same page (working on both QEMU and Linux, adapting my
> muscle memory to the context sucks) but u64/s64 is the preferred
> spelling and I have been asked to use them before.

Ever heard of Jury Nullification... ? :)

> > Heh, looks like it was you who made it uint64_t, in 2016. In a commit
> > (3ae13faac) which said "Prepare for improving the precision in the next
> > patch"... which never came, AFAICT?
> 
> Yes, it was posted as
> https://lore.kernel.org/lkml/1454944711-33022-5-git-send-email-pbonzini@redhat.com/
> but not committed.

Ah, thanks. So that isn't about arithmetic precision, but about dealing
with the mess we had where the KVM clock was afflicted by NTP skew.

We live in a much saner world now where it's simply based on the guest
TSC (or, in pathological cases, the host's CLOCK_MONOTONIC_RAW. 

> As an aside, we discovered later that the patch you list as "Fixes"
> fixed another tricky bug: before, kvmclock could jump if the TSC is
> set within the 250 ppm tolerance that does not activate TSC scaling.
> This is possible after a first live migration, and then the second
> live migration used the guest TSC frequency *that userspace desired*
> instead of the *actual* TSC frequency.

Given our saner world where the KVM clock now *isn't* adjusted for NTP
skew, that "bug" was probably better left unfixed. In fact, I may give
some thought to reverting commit 78db6a503796 completely.

Perhaps we should call that "definition E". I think we're up to five
now? But no, let's not add historical ones to the taxonomy :)

I believe Jack's KVM_SET_CLOCK_GUEST fixes the fundamental issue there
of the clock jumping on migration. It's just a special case of the
breakage in KVM_REQ_MASTERCLOCK_UPDATE, where the KVM clock has happily
been running as a direct function of the guest TSC, and when we yank it
back to some other definition when we create a new masterclock
reference point.

With KVM_SET_CLOCK_GUEST, the KVM clock is restored as a function of
the guest TSC, rather than based on realtime/CLOCK_MONOTONIC_RAW/etc.

So even though a new masterclock reference is taken in the new kernel
(and the scaling factors in the pvclock may be slightly different, as
we discussed in the comment you responded to), the ka->kvmclock_offset
is adjusted so that the value of the KVM clock at *that* moment when
the new reference is taken, is precisely what it would have been under
the old kvmclock regime for the contemporary guest TSC value.




> Before:
> 
>         this_tsc_khz = __this_cpu_read(cpu_tsc_khz);
>         if (unlikely(vcpu->hw_tsc_khz != this_tsc_khz)) {
>                 tgt_tsc_khz = vcpu->virtual_tsc_khz;
>                 kvm_get_time_scale(NSEC_PER_SEC / 1000, tgt_tsc_khz,
>                         &vcpu->hv_clock.tsc_shift,
>                         &vcpu->hv_clock.tsc_to_system_mul);
>                 vcpu->hw_tsc_khz = this_tsc_khz;
> 
> After:
> 
>         tgt_tsc_khz = __this_cpu_read(cpu_tsc_khz);
> 
>         // tgt_tsc_khz unchanged because TSC scaling was not enabled
>         tgt_tsc_khz = kvm_scale_tsc(v, tgt_tsc_khz);
> 
>         if (unlikely(vcpu->hw_tsc_khz != tgt_tsc_khz)) {
>                 kvm_get_time_scale(NSEC_PER_SEC / 1000, tgt_tsc_khz,
>                         &vcpu->hv_clock.tsc_shift,
>                         &vcpu->hv_clock.tsc_to_system_mul);
>                 vcpu->hw_tsc_khz = tgt_tsc_khz;
> 
> So in the first case kvm_get_time_scale uses vcpu->virtual_tsc_khz, in
> the second case it uses __this_cpu_read(cpu_tsc_khz).
> 
> This then caused a mismatch between the actual guest frequency and
> what is used by kvm_guest_time_update, which only becomes visible when
> migration resets the clock with KVM_GET/SET_CLOCK. KVM_GET_CLOCK
> returns what _should have been_ the same value read by the guest, but
> it's wrong.

Hm? Until I fixed it in this series, KVM_GET_CLOCK didn't even *try*
scaling via the guest's TSC frequency, did it? It just converted from
the *host* TSC to nanoseconds (since master_kernel_now) directly. Which
was even *more* broken :)
diff mbox series

Patch

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 01c69840647e..8440c4081727 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -887,7 +887,7 @@  struct kvm_vcpu_arch {
 
 	gpa_t time;
 	struct pvclock_vcpu_time_info hv_clock;
-	unsigned int hw_tsc_khz;
+	unsigned int hw_tsc_hz;
 	struct gfn_to_pfn_cache pv_time;
 	/* set guest stopped flag in pvclock flags field */
 	bool pvclock_set_guest_stopped_request;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2d2619d3eee4..23281c508c27 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3215,7 +3215,8 @@  static void kvm_setup_guest_pvclock(struct kvm_vcpu *v,
 
 static int kvm_guest_time_update(struct kvm_vcpu *v)
 {
-	unsigned long flags, tgt_tsc_khz;
+	unsigned long flags;
+	uint64_t tgt_tsc_hz;
 	unsigned seq;
 	struct kvm_vcpu_arch *vcpu = &v->arch;
 	struct kvm_arch *ka = &v->kvm->arch;
@@ -3252,8 +3253,8 @@  static int kvm_guest_time_update(struct kvm_vcpu *v)
 
 	/* Keep irq disabled to prevent changes to the clock */
 	local_irq_save(flags);
-	tgt_tsc_khz = get_cpu_tsc_khz();
-	if (unlikely(tgt_tsc_khz == 0)) {
+	tgt_tsc_hz = get_cpu_tsc_khz() * 1000LL;
+	if (unlikely(tgt_tsc_hz == 0)) {
 		local_irq_restore(flags);
 		kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);
 		return 1;
@@ -3288,14 +3289,14 @@  static int kvm_guest_time_update(struct kvm_vcpu *v)
 	/* With all the info we got, fill in the values */
 
 	if (kvm_caps.has_tsc_control)
-		tgt_tsc_khz = kvm_scale_tsc(tgt_tsc_khz,
-					    v->arch.l1_tsc_scaling_ratio);
+		tgt_tsc_hz = kvm_scale_tsc(tgt_tsc_hz,
+					   v->arch.l1_tsc_scaling_ratio);
 
-	if (unlikely(vcpu->hw_tsc_khz != tgt_tsc_khz)) {
-		kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL,
+	if (unlikely(vcpu->hw_tsc_hz != tgt_tsc_hz)) {
+		kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_hz,
 				   &vcpu->hv_clock.tsc_shift,
 				   &vcpu->hv_clock.tsc_to_system_mul);
-		vcpu->hw_tsc_khz = tgt_tsc_khz;
+		vcpu->hw_tsc_hz = tgt_tsc_hz;
 		kvm_xen_update_tsc_info(v);
 	}
 
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 5a83a8154b79..014048c22652 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -2273,7 +2273,7 @@  void kvm_xen_update_tsc_info(struct kvm_vcpu *vcpu)
 
 	entry = kvm_find_cpuid_entry_index(vcpu, function, 2);
 	if (entry)
-		entry->eax = vcpu->arch.hw_tsc_khz;
+		entry->eax = vcpu->arch.hw_tsc_hz / 1000;
 }
 
 void kvm_xen_init_vm(struct kvm *kvm)