diff mbox series

pvclock time drifting backward

Message ID facda6e2-3655-4f2c-9013-ebb18d0e6972@gmail.com (mailing list archive)
State New, archived
Headers show
Series pvclock time drifting backward | expand

Commit Message

Ming Lin March 20, 2025, 5:38 a.m. UTC
Hi,

After performing a live migration on a QEMU guest OS that had been running for over 30 days,
we noticed that the guest OS time was more than 2 seconds behind the actual time.

After extensive debugging, we found that this issue is related to master_kernel_ns and master_cycle_now.

When the guest OS starts, the host initializes a pair of master_kernel_ns and master_cycle_now values.
After live migration, the host updates these values.

Our debugging showed that if the host does not update master_kernel_ns/master_cycle_now,
the guest OS time remains correct.

To illustrate how updating master_kernel_ns/master_cycle_now leads to the guest OS time drifting backward,
we applied the following debug patch:

The patch adds a KVM debugfs entry to trigger time calculations and print the results.
The patch runs on the host side, but we use __pvclock_read_cycles() to simulate the guest OS updating its time.

Example Output:

# cat /sys/kernel/debug/kvm/946-13/pvclock
old: master_kernel_ns: 15119778316
old: master_cycle_now: 37225912658
old: ns: 1893199569691
new: master_kernel_ns: 1908210098649
new: master_cycle_now: 4391329912268
new: ns: 1893199548401

tsc 4391329912368
kvmclock_offset -15010550291
diff: ns: 21290

Explanation of Parameters:

Input:
"old: master_kernel_ns:" The master_kernel_ns value recorded when the guest OS started (remains unchanged during testing).
"old: master_cycle_now:" The master_cycle_now value recorded when the guest OS started (remains unchanged during testing).
"new: master_kernel_ns:" The latest master_kernel_ns value at the time of reading.
"new: master_cycle_now:" The latest master_cycle_now value at the time of reading.
tsc: The rdtsc() value at the time of reading.
kvmclock_offset: The offset recorded by KVM_SET_CLOCK when the guest OS started (remains unchanged during testing).

Output:
"old: ns:" Time in nanoseconds calculated using the old master_kernel_ns/master_cycle_now.
"new: ns:" Time in nanoseconds calculated using the new master_kernel_ns/master_cycle_now.
"diff: ns:" (old ns - new ns), representing the time drift relative to the guest OS start time.

Test Script:
#!/bin/bash

qemu_pid=$(pidof qemu-system-x86_64)

while [ 1 ] ; do
     echo "====================================="
     echo "Guest OS running time: $(ps -p $qemu_pid -o etime= | awk '{print $1}')"
     cat /sys/kernel/debug/kvm/*/pvclock
     echo
     sleep 10
done

Test Results:
Below are the first and last parts of a >2-hour test run.
As time progresses, the time drift calculated using the latest master_kernel_ns/master_cycle_now increases monotonically.

After 2 hours and 18 minutes, the guest OS time drifted by approximately 93 milliseconds.

I have uploaded an image for a more intuitive visualization of the time drift:
https://postimg.cc/crCDWtD7

Is this a real problem?

If there is any fix patch, I’d be happy to test it. Thanks!


     1 =====================================
     2 guest os running time: 00:50
     3 old: master_kernel_ns: 15119778316
     4 old: master_cycle_now: 37225912658
     5 old: ns: 48092694964
     6 new: master_kernel_ns: 63103244699
     7 new: master_cycle_now: 147587790614
     8 new: ns: 48092694425
     9
    10 tsc 147587790654
    11 kvmclock_offset -15010550291
    12 diff: ns: 539
    13
    14 =====================================
    15 guest os running time: 01:00
    16 old: master_kernel_ns: 15119778316
    17 old: master_cycle_now: 37225912658
    18 old: ns: 58139026532
    19 new: master_kernel_ns: 73149576143
    20 new: master_cycle_now: 170694333104
    21 new: ns: 58139025879
    22
    23 tsc 170694333168
    24 kvmclock_offset -15010550291
    25 diff: ns: 653
    26
    27 =====================================
    28 guest os running time: 01:10
    29 old: master_kernel_ns: 15119778316
    30 old: master_cycle_now: 37225912658
    31 old: ns: 68183772122
    32 new: master_kernel_ns: 83194321616
    33 new: master_cycle_now: 193797227862
    34 new: ns: 68183771357
    35
    36 tsc 193797227936
    37 kvmclock_offset -15010550291
    38 diff: ns: 765
    39
    40 =====================================
    41 guest os running time: 01:20
    42 old: master_kernel_ns: 15119778316
    43 old: master_cycle_now: 37225912658
    44 old: ns: 78225289157
    45 new: master_kernel_ns: 93235838545
    46 new: master_cycle_now: 216892696976
    47 new: ns: 78225288279
    48
    49 tsc 216892697034
    50 kvmclock_offset -15010550291
    51 diff: ns: 878
    52
    53 =====================================
    54 guest os running time: 01:30
    55 old: master_kernel_ns: 15119778316
    56 old: master_cycle_now: 37225912658
    57 old: ns: 88268955340
    58 new: master_kernel_ns: 103279504612
    59 new: master_cycle_now: 239993109102
    60 new: ns: 88268954349
    61
    62 tsc 239993109168
    63 kvmclock_offset -15010550291
    64 diff: ns: 991
    65
    66 =====================================
    67 guest os running time: 01:40
    68 old: master_kernel_ns: 15119778316
    69 old: master_cycle_now: 37225912658
    70 old: ns: 98313212581
    71 new: master_kernel_ns: 113323761740
    72 new: master_cycle_now: 263094880668
    73 new: ns: 98313211476
    74
    75 tsc 263094880732
    76 kvmclock_offset -15010550291
    77 diff: ns: 1105
.....
.....
10160 =====================================
10161 guest os running time: 02:17:11
10162 old: master_kernel_ns: 15119778316
10163 old: master_cycle_now: 37225912658
10164 old: ns: 8229817213297
10165 new: master_kernel_ns: 8244827670997
10166 new: master_cycle_now: 18965537819524
10167 new: ns: 8229817120748
10168
10169 tsc 18965537819622
10170 kvmclock_offset -15010550291
10171 diff: ns: 92549
10172 =====================================
10173 guest os running time: 02:17:21
10174 old: master_kernel_ns: 15119778316
10175 old: master_cycle_now: 37225912658
10176 old: ns: 8239861074959
10177 new: master_kernel_ns: 8254871532564
10178 new: master_cycle_now: 18988638681302
10179 new: ns: 8239860982297
10180
10181 tsc 18988638681358
10182 kvmclock_offset -15010550291
10183 diff: ns: 92662
10184 =====================================
10185 guest os running time: 02:17:31
10186 old: master_kernel_ns: 15119778316
10187 old: master_cycle_now: 37225912658
10188 old: ns: 8249904622988
10189 new: master_kernel_ns: 8264915080459
10190 new: master_cycle_now: 19011738821632
10191 new: ns: 8249904530213
10192
10193 tsc 19011738821736
10194 kvmclock_offset -15010550291
10195 diff: ns: 92775^@
10196 =====================================
10197 guest os running time: 02:17:41
10198 old: master_kernel_ns: 15119778316
10199 old: master_cycle_now: 37225912658
10200 old: ns: 8259949369203
10201 new: master_kernel_ns: 8274959826576
10202 new: master_cycle_now: 19034841717872
10203 new: ns: 8259949276315
10204
10205 tsc 19034841717942
10206 kvmclock_offset -15010550291
10207 diff: ns: 92888
10208 =====================================
10209 guest os running time: 02:17:51
10210 old: master_kernel_ns: 15119778316
10211 old: master_cycle_now: 37225912658
10212 old: ns: 8269996849598
10213 new: master_kernel_ns: 8285007306846
10214 new: master_cycle_now: 19057950902658
10215 new: ns: 8269996756597
10216
10217 tsc 19057950902756
10218 kvmclock_offset -15010550291
10219 diff: ns: 93001^@
10220 =====================================
10221 guest os running time: 02:18:02
10222 old: master_kernel_ns: 15119778316
10223 old: master_cycle_now: 37225912658
10224 old: ns: 8280039094317
10225 new: master_kernel_ns: 8295049551453
10226 new: master_cycle_now: 19081048045430
10227 new: ns: 8280039001203
10228
10229 tsc 19081048045526
10230 kvmclock_offset -15010550291
10231 diff: ns: 93114^@



     pvclock debugfs patch
---
  arch/x86/include/asm/kvm_host.h |  4 +++
  arch/x86/kvm/x86.c              | 29 +++++++++++++++-
  b.sh                            |  1 +
  virt/kvm/kvm_main.c             | 75 +++++++++++++++++++++++++++++++++++++++++
  4 files changed, 108 insertions(+), 1 deletion(-)

Comments

Sean Christopherson March 24, 2025, 10:54 p.m. UTC | #1
+David

On Wed, Mar 19, 2025, Ming Lin wrote:
> Hi,
> 
> After performing a live migration on a QEMU guest OS that had been running for over 30 days,
> we noticed that the guest OS time was more than 2 seconds behind the actual time.
> 
> After extensive debugging, we found that this issue is related to master_kernel_ns and master_cycle_now.
> 
> When the guest OS starts, the host initializes a pair of master_kernel_ns and master_cycle_now values.
> After live migration, the host updates these values.
> 
> Our debugging showed that if the host does not update master_kernel_ns/master_cycle_now,
> the guest OS time remains correct.
> 
> To illustrate how updating master_kernel_ns/master_cycle_now leads to the guest OS time drifting backward,
> we applied the following debug patch:
> 
> The patch adds a KVM debugfs entry to trigger time calculations and print the results.
> The patch runs on the host side, but we use __pvclock_read_cycles() to simulate the guest OS updating its time.
> 
> Example Output:
> 
> # cat /sys/kernel/debug/kvm/946-13/pvclock
> old: master_kernel_ns: 15119778316
> old: master_cycle_now: 37225912658
> old: ns: 1893199569691
> new: master_kernel_ns: 1908210098649
> new: master_cycle_now: 4391329912268
> new: ns: 1893199548401
> 
> tsc 4391329912368
> kvmclock_offset -15010550291
> diff: ns: 21290
> 
> Explanation of Parameters:
> 
> Input:
> "old: master_kernel_ns:" The master_kernel_ns value recorded when the guest OS started (remains unchanged during testing).
> "old: master_cycle_now:" The master_cycle_now value recorded when the guest OS started (remains unchanged during testing).
> "new: master_kernel_ns:" The latest master_kernel_ns value at the time of reading.
> "new: master_cycle_now:" The latest master_cycle_now value at the time of reading.
> tsc: The rdtsc() value at the time of reading.
> kvmclock_offset: The offset recorded by KVM_SET_CLOCK when the guest OS started (remains unchanged during testing).
> 
> Output:
> "old: ns:" Time in nanoseconds calculated using the old master_kernel_ns/master_cycle_now.
> "new: ns:" Time in nanoseconds calculated using the new master_kernel_ns/master_cycle_now.
> "diff: ns:" (old ns - new ns), representing the time drift relative to the guest OS start time.
> 
> Test Script:
> #!/bin/bash
> 
> qemu_pid=$(pidof qemu-system-x86_64)
> 
> while [ 1 ] ; do
>     echo "====================================="
>     echo "Guest OS running time: $(ps -p $qemu_pid -o etime= | awk '{print $1}')"
>     cat /sys/kernel/debug/kvm/*/pvclock
>     echo
>     sleep 10
> done
> 
> Test Results:
> Below are the first and last parts of a >2-hour test run.
> As time progresses, the time drift calculated using the latest master_kernel_ns/master_cycle_now increases monotonically.
> 
> After 2 hours and 18 minutes, the guest OS time drifted by approximately 93 milliseconds.
> 
> I have uploaded an image for a more intuitive visualization of the time drift:
> https://postimg.cc/crCDWtD7
> 
> Is this a real problem?

David can confirm, but I'm pretty sure the drift you are observing is addressed
by David's series to fix a plethora of kvmclock warts.

https://lore.kernel.org/all/20240522001817.619072-1-dwmw2@infradead.org

> If there is any fix patch, I’d be happy to test it. Thanks!
> 
> 
>     1 =====================================
>     2 guest os running time: 00:50
>     3 old: master_kernel_ns: 15119778316
>     4 old: master_cycle_now: 37225912658
>     5 old: ns: 48092694964
>     6 new: master_kernel_ns: 63103244699
>     7 new: master_cycle_now: 147587790614
>     8 new: ns: 48092694425
>     9
>    10 tsc 147587790654
>    11 kvmclock_offset -15010550291
>    12 diff: ns: 539
>    13
>    14 =====================================
>    15 guest os running time: 01:00
>    16 old: master_kernel_ns: 15119778316
>    17 old: master_cycle_now: 37225912658
>    18 old: ns: 58139026532
>    19 new: master_kernel_ns: 73149576143
>    20 new: master_cycle_now: 170694333104
>    21 new: ns: 58139025879
>    22
>    23 tsc 170694333168
>    24 kvmclock_offset -15010550291
>    25 diff: ns: 653
>    26
>    27 =====================================
>    28 guest os running time: 01:10
>    29 old: master_kernel_ns: 15119778316
>    30 old: master_cycle_now: 37225912658
>    31 old: ns: 68183772122
>    32 new: master_kernel_ns: 83194321616
>    33 new: master_cycle_now: 193797227862
>    34 new: ns: 68183771357
>    35
>    36 tsc 193797227936
>    37 kvmclock_offset -15010550291
>    38 diff: ns: 765
>    39
>    40 =====================================
>    41 guest os running time: 01:20
>    42 old: master_kernel_ns: 15119778316
>    43 old: master_cycle_now: 37225912658
>    44 old: ns: 78225289157
>    45 new: master_kernel_ns: 93235838545
>    46 new: master_cycle_now: 216892696976
>    47 new: ns: 78225288279
>    48
>    49 tsc 216892697034
>    50 kvmclock_offset -15010550291
>    51 diff: ns: 878
>    52
>    53 =====================================
>    54 guest os running time: 01:30
>    55 old: master_kernel_ns: 15119778316
>    56 old: master_cycle_now: 37225912658
>    57 old: ns: 88268955340
>    58 new: master_kernel_ns: 103279504612
>    59 new: master_cycle_now: 239993109102
>    60 new: ns: 88268954349
>    61
>    62 tsc 239993109168
>    63 kvmclock_offset -15010550291
>    64 diff: ns: 991
>    65
>    66 =====================================
>    67 guest os running time: 01:40
>    68 old: master_kernel_ns: 15119778316
>    69 old: master_cycle_now: 37225912658
>    70 old: ns: 98313212581
>    71 new: master_kernel_ns: 113323761740
>    72 new: master_cycle_now: 263094880668
>    73 new: ns: 98313211476
>    74
>    75 tsc 263094880732
>    76 kvmclock_offset -15010550291
>    77 diff: ns: 1105
> .....
> .....
> 10160 =====================================
> 10161 guest os running time: 02:17:11
> 10162 old: master_kernel_ns: 15119778316
> 10163 old: master_cycle_now: 37225912658
> 10164 old: ns: 8229817213297
> 10165 new: master_kernel_ns: 8244827670997
> 10166 new: master_cycle_now: 18965537819524
> 10167 new: ns: 8229817120748
> 10168
> 10169 tsc 18965537819622
> 10170 kvmclock_offset -15010550291
> 10171 diff: ns: 92549
> 10172 =====================================
> 10173 guest os running time: 02:17:21
> 10174 old: master_kernel_ns: 15119778316
> 10175 old: master_cycle_now: 37225912658
> 10176 old: ns: 8239861074959
> 10177 new: master_kernel_ns: 8254871532564
> 10178 new: master_cycle_now: 18988638681302
> 10179 new: ns: 8239860982297
> 10180
> 10181 tsc 18988638681358
> 10182 kvmclock_offset -15010550291
> 10183 diff: ns: 92662
> 10184 =====================================
> 10185 guest os running time: 02:17:31
> 10186 old: master_kernel_ns: 15119778316
> 10187 old: master_cycle_now: 37225912658
> 10188 old: ns: 8249904622988
> 10189 new: master_kernel_ns: 8264915080459
> 10190 new: master_cycle_now: 19011738821632
> 10191 new: ns: 8249904530213
> 10192
> 10193 tsc 19011738821736
> 10194 kvmclock_offset -15010550291
> 10195 diff: ns: 92775^@
> 10196 =====================================
> 10197 guest os running time: 02:17:41
> 10198 old: master_kernel_ns: 15119778316
> 10199 old: master_cycle_now: 37225912658
> 10200 old: ns: 8259949369203
> 10201 new: master_kernel_ns: 8274959826576
> 10202 new: master_cycle_now: 19034841717872
> 10203 new: ns: 8259949276315
> 10204
> 10205 tsc 19034841717942
> 10206 kvmclock_offset -15010550291
> 10207 diff: ns: 92888
> 10208 =====================================
> 10209 guest os running time: 02:17:51
> 10210 old: master_kernel_ns: 15119778316
> 10211 old: master_cycle_now: 37225912658
> 10212 old: ns: 8269996849598
> 10213 new: master_kernel_ns: 8285007306846
> 10214 new: master_cycle_now: 19057950902658
> 10215 new: ns: 8269996756597
> 10216
> 10217 tsc 19057950902756
> 10218 kvmclock_offset -15010550291
> 10219 diff: ns: 93001^@
> 10220 =====================================
> 10221 guest os running time: 02:18:02
> 10222 old: master_kernel_ns: 15119778316
> 10223 old: master_cycle_now: 37225912658
> 10224 old: ns: 8280039094317
> 10225 new: master_kernel_ns: 8295049551453
> 10226 new: master_cycle_now: 19081048045430
> 10227 new: ns: 8280039001203
> 10228
> 10229 tsc 19081048045526
> 10230 kvmclock_offset -15010550291
> 10231 diff: ns: 93114^@
> 
> 
> 
>     pvclock debugfs patch
> ---
>  arch/x86/include/asm/kvm_host.h |  4 +++
>  arch/x86/kvm/x86.c              | 29 +++++++++++++++-
>  b.sh                            |  1 +
>  virt/kvm/kvm_main.c             | 75 +++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 108 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 32ae3aa50c7e..5a82a69bfe7a 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1411,6 +1411,10 @@ struct kvm_arch {
>      struct delayed_work kvmclock_update_work;
>      struct delayed_work kvmclock_sync_work;
> +    u64 old_master_kernel_ns;
> +    u64 old_master_cycle_now;
> +    s64 old_kvmclock_offset;
> +
>      struct kvm_xen_hvm_config xen_hvm_config;
>      /* reads protected by irq_srcu, writes by irq_lock */
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 4b64ab350bcd..a56511ed8c5b 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2819,7 +2819,7 @@ static inline u64 vgettsc(struct pvclock_clock *clock, u64 *tsc_timestamp,
>   * As with get_kvmclock_base_ns(), this counts from boot time, at the
>   * frequency of CLOCK_MONOTONIC_RAW (hence adding gtos->offs_boot).
>   */
> -static int do_kvmclock_base(s64 *t, u64 *tsc_timestamp)
> +int do_kvmclock_base(s64 *t, u64 *tsc_timestamp)
>  {
>      struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
>      unsigned long seq;
> @@ -2861,6 +2861,27 @@ static int do_monotonic(s64 *t, u64 *tsc_timestamp)
>      return mode;
>  }
> +u64 mydebug_get_kvmclock_ns(u64 master_kernel_ns, u64 master_cycle_now, s64 kvmclock_offset, u64 tsc)
> +{
> +        struct pvclock_vcpu_time_info hv_clock;
> +        u64 ret;
> +
> +        hv_clock.tsc_timestamp = master_cycle_now;
> +        hv_clock.system_time = master_kernel_ns + kvmclock_offset;
> +
> +        /* both __this_cpu_read() and rdtsc() should be on the same cpu */
> +        get_cpu();
> +
> +        kvm_get_time_scale(NSEC_PER_SEC, __this_cpu_read(cpu_tsc_khz) * 1000LL,
> +                                   &hv_clock.tsc_shift,
> +                                   &hv_clock.tsc_to_system_mul);
> +        ret = __pvclock_read_cycles(&hv_clock, tsc);
> +
> +        put_cpu();
> +
> +        return ret;
> +}
> +
>  static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp)
>  {
>      struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
> @@ -2988,6 +3009,10 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
>      host_tsc_clocksource = kvm_get_time_and_clockread(
>                      &ka->master_kernel_ns,
>                      &ka->master_cycle_now);
> +    ka->old_master_kernel_ns = ka->master_kernel_ns;
> +    ka->old_master_cycle_now = ka->master_cycle_now;
> +    printk("MYDEBUG: old_master_kernel_ns = %llu, old_master_cycle_now = %llu\n",
> +            ka->old_master_kernel_ns, ka->old_master_cycle_now);
>      ka->use_master_clock = host_tsc_clocksource && vcpus_matched
>                  && !ka->backwards_tsc_observed
> @@ -6989,6 +7014,8 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp)
>      else
>          now_raw_ns = get_kvmclock_base_ns();
>      ka->kvmclock_offset = data.clock - now_raw_ns;
> +    ka->old_kvmclock_offset = ka->kvmclock_offset;
> +    printk("MYDEBUG: old_kvmclock_offset = %lld\n", ka->old_kvmclock_offset);
>      kvm_end_pvclock_update(kvm);
>      return 0;
>  }
> diff --git a/b.sh b/b.sh
> new file mode 120000
> index 000000000000..0ff9a93fd53f
> --- /dev/null
> +++ b/b.sh
> @@ -0,0 +1 @@
> +/home/mlin/build.upstream/b.sh
> \ No newline at end of file
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ba0327e2d0d3..d6b9a6e7275e 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -399,6 +399,7 @@ int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity,
>              return mc->nobjs >= min ? 0 : -ENOMEM;
>          mc->objects[mc->nobjs++] = obj;
>      }
> +
>      return 0;
>  }
> @@ -998,6 +999,78 @@ static void kvm_destroy_vm_debugfs(struct kvm *kvm)
>      }
>  }
> +extern int do_kvmclock_base(s64 *t, u64 *tsc_timestamp);
> +extern u64 mydebug_get_kvmclock_ns(u64 master_kernel_ns, u64 master_cycle_now, s64 kvmclock_offset, u64 tsc);
> +
> +static ssize_t kvm_mydebug_pvclock_read(struct file *file, char __user *buf,
> +                                size_t len, loff_t *ppos)
> +{
> +    struct kvm *kvm = file->private_data;
> +    struct kvm_arch *ka;
> +    char buffer[256];
> +    ssize_t ret, copied;
> +    u64 new_master_kernel_ns;
> +    u64 new_master_cycle_now;
> +    u64 old_ns, new_ns;
> +    u64 tsc;
> +
> +    if (!kvm) {
> +        pr_err("file->private_data is NULL\n");
> +        return -EINVAL;
> +    }
> +
> +    ka = &kvm->arch;
> +
> +    do_kvmclock_base(&new_master_kernel_ns, &new_master_cycle_now);
> +
> +    tsc = rdtsc();
> +
> +    old_ns = mydebug_get_kvmclock_ns(ka->old_master_kernel_ns, ka->old_master_cycle_now, ka->old_kvmclock_offset, tsc);
> +    new_ns = mydebug_get_kvmclock_ns(new_master_kernel_ns, new_master_cycle_now, ka->old_kvmclock_offset, tsc);
> +
> +    ret = snprintf(buffer, sizeof(buffer),
> +                   "old: master_kernel_ns: %llu\n"
> +                   "old: master_cycle_now: %llu\n"
> +                   "old: ns: %llu\n"
> +                   "new: master_kernel_ns: %llu\n"
> +                   "new: master_cycle_now: %llu\n"
> +                   "new: ns: %llu\n\n"
> +                   "tsc %llu\n"
> +                   "kvmclock_offset %lld\n"
> +                   "diff: ns: %lld\n",
> +                   ka->old_master_kernel_ns, ka->old_master_cycle_now, old_ns,
> +                   new_master_kernel_ns, new_master_cycle_now, new_ns,
> +                   tsc, ka->old_kvmclock_offset,
> +                  old_ns - new_ns
> +                  );
> +
> +    if (ret < 0)
> +        return ret;
> +
> +    if ((size_t)ret > sizeof(buffer))
> +        ret = sizeof(buffer);
> +
> +    if (*ppos >= ret)
> +        return 0; /* EOF */
> +
> +    copied = min(len, (size_t)(ret - *ppos));
> +
> +    if (copy_to_user(buf, buffer + *ppos, copied)) {
> +        pr_err("copy_to_user failed\n");
> +        return -EFAULT;
> +    }
> +
> +    *ppos += copied;
> +
> +    return copied;
> +}
> +
> +static const struct file_operations kvm_pvclock_fops = {
> +    .owner = THIS_MODULE,
> +    .read = kvm_mydebug_pvclock_read,
> +    .open = simple_open,
> +};
> +
>  static int kvm_create_vm_debugfs(struct kvm *kvm, const char *fdname)
>  {
>      static DEFINE_MUTEX(kvm_debugfs_lock);
> @@ -1063,6 +1136,8 @@ static int kvm_create_vm_debugfs(struct kvm *kvm, const char *fdname)
>                      &stat_fops_per_vm);
>      }
> +    debugfs_create_file("pvclock", 0444, kvm->debugfs_dentry, kvm, &kvm_pvclock_fops);
> +
>      kvm_arch_create_vm_debugfs(kvm);
>      return 0;
>  out_err:
>
David Woodhouse March 25, 2025, 11:07 a.m. UTC | #2
On Mon, 2025-03-24 at 15:54 -0700, Sean Christopherson wrote:
> 
> David can confirm, but I'm pretty sure the drift you are observing is addressed
> by David's series to fix a plethora of kvmclock warts.
> 
> https://lore.kernel.org/all/20240522001817.619072-1-dwmw2@infradead.org

Yes, that looks like exactly the problem my series is addressing. We
shouldn't update the pvclock so often. And if/when we do, we shouldn't
clamp it to some *other* clock which progresses at a different rate to
the pvclock, because that causes steps in the pvclock.
Ming Lin March 26, 2025, 3:54 p.m. UTC | #3
On Tue, Mar 25, 2025 at 4:07 AM David Woodhouse <dwmw2@infradead.org> wrote:
>
> On Mon, 2025-03-24 at 15:54 -0700, Sean Christopherson wrote:
> >
> > David can confirm, but I'm pretty sure the drift you are observing is addressed
> > by David's series to fix a plethora of kvmclock warts.
> >
> > https://lore.kernel.org/all/20240522001817.619072-1-dwmw2@infradead.org
>
> Yes, that looks like exactly the problem my series is addressing. We
> shouldn't update the pvclock so often. And if/when we do, we shouldn't
> clamp it to some *other* clock which progresses at a different rate to
> the pvclock, because that causes steps in the pvclock.

I applied the patch series on top of 6.9 cleanly and tested it with my
debug tool patch.
But it seems the time drift still increased monotonically.

Would you help take a look if the tool patch makes sense?
https://github.com/minggr/linux/commit/5284a211b6bdc9f9041b669539558a6a858e88d0

The tool patch adds a KVM debugfs entry to trigger time calculations
and print the results.
See my first email for more detail.

Test Script:
#!/bin/bash

qemu_pid=$(pidof qemu-system-x86_64)

while [ 1 ] ; do
     echo "====================================="
     echo "Guest OS running time: $(ps -p $qemu_pid -o etime= | awk
'{print $1}')"
     cat /sys/kernel/debug/kvm/*/pvclock
     echo
     sleep 10
done

Thanks,
Ming
David Woodhouse March 27, 2025, 8:10 a.m. UTC | #4
On Wed, 2025-03-26 at 08:54 -0700, Ming Lin wrote:
> I applied the patch series on top of 6.9 cleanly and tested it with my
> debug tool patch.
> But it seems the time drift still increased monotonically.
> 
> Would you help take a look if the tool patch makes sense?
> https://github.com/minggr/linux/commit/5284a211b6bdc9f9041b669539558a6a858e88d0
> 
> The tool patch adds a KVM debugfs entry to trigger time calculations
> and print the results.
> See my first email for more detail.

Your first message seemed to say that the problem occurred with live
migration. This message says "the time drift still increased
monotonically". 

Trying to make sure I fully understand... the time drift between the
host's CLOCK_MONOTONIC_RAW and the guest's kvmclock increases
monotonically *but* the guest only observes the change when its
master_kernel_ns/master_cycle_now are updated (e.g. on live migration)
and its kvmclock is reset back to the host's CLOCK_MONOTONIC_RAW?

Is this live migration from one VMM to another on the same host, so we
don't have to worry about the accuracy of the TSC itself? The guest TSC
remains consistent? And presumably your host does *have* a stable TSC,
and the guest's test case really ought to be checking the
PVCLOCK_TSC_STABLE_BIT to make sure of that?

If all the above assumptions/interpretations of mine are true, I still
think it's expected that your clock will jump on live migration
*unless* you also taught your VMM to use the new KVM_[GS]ET_CLOCK_GUEST
ioctls which were added in my patch series, specifically to preserve
the mathematical relationship between guest TSC and kvmclock across a
migration.
Ming Lin March 28, 2025, 6:30 p.m. UTC | #5
On Thu, Mar 27, 2025 at 1:10 AM David Woodhouse <dwmw2@infradead.org> wrote:
>
> On Wed, 2025-03-26 at 08:54 -0700, Ming Lin wrote:
> > I applied the patch series on top of 6.9 cleanly and tested it with my
> > debug tool patch.
> > But it seems the time drift still increased monotonically.
> >
> > Would you help take a look if the tool patch makes sense?
> > https://github.com/minggr/linux/commit/5284a211b6bdc9f9041b669539558a6a858e88d0
> >
> > The tool patch adds a KVM debugfs entry to trigger time calculations
> > and print the results.
> > See my first email for more detail.
>
> Your first message seemed to say that the problem occurred with live
> migration. This message says "the time drift still increased
> monotonically".

Yes, we discovered this issue in our production environment, where the time
inside the guest OS slowed down by more than 2 seconds. This problem
occurred both during live upgrades locally and live migrations remotely.
However, the issue is only noticeable after the guest OS has been
running for a long time, typically over 30 days.
Since 30 days is too long to wait, I wrote a debugfs tool to quickly reproduce
the original issue, but now I'm not sure if the tool is working correctly.

>
> Trying to make sure I fully understand... the time drift between the
> host's CLOCK_MONOTONIC_RAW and the guest's kvmclock increases
> monotonically *but* the guest only observes the change when its
> master_kernel_ns/master_cycle_now are updated (e.g. on live migration)
> and its kvmclock is reset back to the host's CLOCK_MONOTONIC_RAW?

Yes, we are using the 5.4 kernel and have verified that the guest OS time
remains correct after live upgrades/migrations, as long as
master_kernel_ns / master_cycle_now are not updated
(i.e., if the old master_kernel_ns / master_cycle_now values are retained).

>
> Is this live migration from one VMM to another on the same host, so we
> don't have to worry about the accuracy of the TSC itself? The guest TSC
> remains consistent? And presumably your host does *have* a stable TSC,
> and the guest's test case really ought to be checking the
> PVCLOCK_TSC_STABLE_BIT to make sure of that?

The live migration is from one VMM to another on a remote host, and we
have also observed the same issue during live upgrades on the same host.

>
> If all the above assumptions/interpretations of mine are true, I still
> think it's expected that your clock will jump on live migration
> *unless* you also taught your VMM to use the new KVM_[GS]ET_CLOCK_GUEST
> ioctls which were added in my patch series, specifically to preserve
> the mathematical relationship between guest TSC and kvmclock across a
> migration.
>

We are planning to test the patches on a 6.9 kernel (where they can be
applied cleanly) and modify the live upgrade/migration code to use the new
KVM_[GS]ET_CLOCK_GUEST ioctls.

BTW, what is the plan for upstreaming these patches?

Thanks,
Ming
David Woodhouse March 29, 2025, 8:20 a.m. UTC | #6
On Fri, 2025-03-28 at 11:30 -0700, Ming Lin wrote:
> 
> > 
> > Is this live migration from one VMM to another on the same host, so we
> > don't have to worry about the accuracy of the TSC itself? The guest TSC
> > remains consistent? And presumably your host does *have* a stable TSC,
> > and the guest's test case really ought to be checking the
> > PVCLOCK_TSC_STABLE_BIT to make sure of that?
> 
> The live migration is from one VMM to another on a remote host, and we
> have also observed the same issue during live upgrades on the same host.

Moving to a remote host also requires that you get the guest TSC to be
reasonably synchronised on the destination. Which is another litany of
sadness, especially if you have TSC scaling in the mix. Or even if your
two "identical" hosts calculated a slightly different TSC frequency
when they measured it at first boot.

In that latter case, the mul/scale factors advertised to the guest in
the pvclock will be *slightly* different on the new host. Your test in
the guest would then fail if it requires that the pvclock be
*identical*. The actual criterion is that the result should be
identical at the time of the live migration (when the TSC frequency
effectively changes).

The code currently requires that the old and new TSC frequencies are
within ±1kHz of each other.

> > 
> > If all the above assumptions/interpretations of mine are true, I still
> > think it's expected that your clock will jump on live migration
> > *unless* you also taught your VMM to use the new KVM_[GS]ET_CLOCK_GUEST
> > ioctls which were added in my patch series, specifically to preserve
> > the mathematical relationship between guest TSC and kvmclock across a
> > migration.
> > 
> 
> We are planning to test the patches on a 6.9 kernel (where they can be
> applied cleanly) and modify the live upgrade/migration code to use the new
> KVM_[GS]ET_CLOCK_GUEST ioctls.
> 
> BTW, what is the plan for upstreaming these patches?

I need to find the time to rebase, rework and retest them. You're doing
some of the testing and increasing the motivation... I'll see if I can
get it done in the next week or three.
diff mbox series

Patch

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 32ae3aa50c7e..5a82a69bfe7a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1411,6 +1411,10 @@  struct kvm_arch {
      struct delayed_work kvmclock_update_work;
      struct delayed_work kvmclock_sync_work;
  
+    u64 old_master_kernel_ns;
+    u64 old_master_cycle_now;
+    s64 old_kvmclock_offset;
+
      struct kvm_xen_hvm_config xen_hvm_config;
  
      /* reads protected by irq_srcu, writes by irq_lock */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4b64ab350bcd..a56511ed8c5b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2819,7 +2819,7 @@  static inline u64 vgettsc(struct pvclock_clock *clock, u64 *tsc_timestamp,
   * As with get_kvmclock_base_ns(), this counts from boot time, at the
   * frequency of CLOCK_MONOTONIC_RAW (hence adding gtos->offs_boot).
   */
-static int do_kvmclock_base(s64 *t, u64 *tsc_timestamp)
+int do_kvmclock_base(s64 *t, u64 *tsc_timestamp)
  {
      struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
      unsigned long seq;
@@ -2861,6 +2861,27 @@  static int do_monotonic(s64 *t, u64 *tsc_timestamp)
      return mode;
  }
  
+u64 mydebug_get_kvmclock_ns(u64 master_kernel_ns, u64 master_cycle_now, s64 kvmclock_offset, u64 tsc)
+{
+        struct pvclock_vcpu_time_info hv_clock;
+        u64 ret;
+
+        hv_clock.tsc_timestamp = master_cycle_now;
+        hv_clock.system_time = master_kernel_ns + kvmclock_offset;
+
+        /* both __this_cpu_read() and rdtsc() should be on the same cpu */
+        get_cpu();
+
+        kvm_get_time_scale(NSEC_PER_SEC, __this_cpu_read(cpu_tsc_khz) * 1000LL,
+                                   &hv_clock.tsc_shift,
+                                   &hv_clock.tsc_to_system_mul);
+        ret = __pvclock_read_cycles(&hv_clock, tsc);
+
+        put_cpu();
+
+        return ret;
+}
+
  static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp)
  {
      struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
@@ -2988,6 +3009,10 @@  static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
      host_tsc_clocksource = kvm_get_time_and_clockread(
                      &ka->master_kernel_ns,
                      &ka->master_cycle_now);
+    ka->old_master_kernel_ns = ka->master_kernel_ns;
+    ka->old_master_cycle_now = ka->master_cycle_now;
+    printk("MYDEBUG: old_master_kernel_ns = %llu, old_master_cycle_now = %llu\n",
+            ka->old_master_kernel_ns, ka->old_master_cycle_now);
  
      ka->use_master_clock = host_tsc_clocksource && vcpus_matched
                  && !ka->backwards_tsc_observed
@@ -6989,6 +7014,8 @@  static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp)
      else
          now_raw_ns = get_kvmclock_base_ns();
      ka->kvmclock_offset = data.clock - now_raw_ns;
+    ka->old_kvmclock_offset = ka->kvmclock_offset;
+    printk("MYDEBUG: old_kvmclock_offset = %lld\n", ka->old_kvmclock_offset);
      kvm_end_pvclock_update(kvm);
      return 0;
  }
diff --git a/b.sh b/b.sh
new file mode 120000
index 000000000000..0ff9a93fd53f
--- /dev/null
+++ b/b.sh
@@ -0,0 +1 @@ 
+/home/mlin/build.upstream/b.sh
\ No newline at end of file
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ba0327e2d0d3..d6b9a6e7275e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -399,6 +399,7 @@  int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity,
              return mc->nobjs >= min ? 0 : -ENOMEM;
          mc->objects[mc->nobjs++] = obj;
      }
+
      return 0;
  }
  
@@ -998,6 +999,78 @@  static void kvm_destroy_vm_debugfs(struct kvm *kvm)
      }
  }
  
+extern int do_kvmclock_base(s64 *t, u64 *tsc_timestamp);
+extern u64 mydebug_get_kvmclock_ns(u64 master_kernel_ns, u64 master_cycle_now, s64 kvmclock_offset, u64 tsc);
+
+static ssize_t kvm_mydebug_pvclock_read(struct file *file, char __user *buf,
+                                size_t len, loff_t *ppos)
+{
+    struct kvm *kvm = file->private_data;
+    struct kvm_arch *ka;
+    char buffer[256];
+    ssize_t ret, copied;
+    u64 new_master_kernel_ns;
+    u64 new_master_cycle_now;
+    u64 old_ns, new_ns;
+    u64 tsc;
+
+    if (!kvm) {
+        pr_err("file->private_data is NULL\n");
+        return -EINVAL;
+    }
+
+    ka = &kvm->arch;
+
+    do_kvmclock_base(&new_master_kernel_ns, &new_master_cycle_now);
+
+    tsc = rdtsc();
+
+    old_ns = mydebug_get_kvmclock_ns(ka->old_master_kernel_ns, ka->old_master_cycle_now, ka->old_kvmclock_offset, tsc);
+    new_ns = mydebug_get_kvmclock_ns(new_master_kernel_ns, new_master_cycle_now, ka->old_kvmclock_offset, tsc);
+
+    ret = snprintf(buffer, sizeof(buffer),
+                   "old: master_kernel_ns: %llu\n"
+                   "old: master_cycle_now: %llu\n"
+                   "old: ns: %llu\n"
+                   "new: master_kernel_ns: %llu\n"
+                   "new: master_cycle_now: %llu\n"
+                   "new: ns: %llu\n\n"
+                   "tsc %llu\n"
+                   "kvmclock_offset %lld\n"
+                   "diff: ns: %lld\n",
+                   ka->old_master_kernel_ns, ka->old_master_cycle_now, old_ns,
+                   new_master_kernel_ns, new_master_cycle_now, new_ns,
+                   tsc, ka->old_kvmclock_offset,
+                  old_ns - new_ns
+                  );
+
+    if (ret < 0)
+        return ret;
+
+    if ((size_t)ret > sizeof(buffer))
+        ret = sizeof(buffer);
+
+    if (*ppos >= ret)
+        return 0; /* EOF */
+
+    copied = min(len, (size_t)(ret - *ppos));
+
+    if (copy_to_user(buf, buffer + *ppos, copied)) {
+        pr_err("copy_to_user failed\n");
+        return -EFAULT;
+    }
+
+    *ppos += copied;
+
+    return copied;
+}
+
+static const struct file_operations kvm_pvclock_fops = {
+    .owner = THIS_MODULE,
+    .read = kvm_mydebug_pvclock_read,
+    .open = simple_open,
+};
+
  static int kvm_create_vm_debugfs(struct kvm *kvm, const char *fdname)
  {
      static DEFINE_MUTEX(kvm_debugfs_lock);
@@ -1063,6 +1136,8 @@  static int kvm_create_vm_debugfs(struct kvm *kvm, const char *fdname)
                      &stat_fops_per_vm);
      }
  
+    debugfs_create_file("pvclock", 0444, kvm->debugfs_dentry, kvm, &kvm_pvclock_fops);
+
      kvm_arch_create_vm_debugfs(kvm);
      return 0;
  out_err: