diff mbox series

KVM: x86: Refine calculation of guest wall clock to use a single TSC read

Message ID ee446c823002dc92c8ea525f21d00a9f5d27de59.camel@infradead.org (mailing list archive)
State New, archived
Headers show
Series KVM: x86: Refine calculation of guest wall clock to use a single TSC read | expand

Commit Message

David Woodhouse Oct. 1, 2023, 5:54 p.m. UTC
From: David Woodhouse <dwmw@amazon.co.uk>

When populating the guest's PV wall clock information, KVM currently does
a simple 'kvm_get_real_ns() - get_kvmclock_ns(kvm)'. This is an antipattern
which should be avoided; when working with the relationship between two
clocks, it's never correct to obtain one of them "now" and then the other
at a slightly different "now" after an unspecified period of preemption
(which might not even be under the control of the kernel, if this is an
L1 hosting an L2 guest under nested virtualization).

Add a kvm_get_wall_clock_epoch() function to return the guest wall clock
epoch in nanoseconds using the same method as __get_kvmclock() — by using
kvm_get_walltime_and_clockread() to calculate both the wall clock and KVM
clock time from a *single* TSC reading.

The condition using get_cpu_tsc_khz() is equivalent to the version in
__get_kvmclock() which separately checks for the CONSTANT_TSC feature or
the per-CPU cpu_tsc_khz. Which is what get_cpu_tsc_khz() does anyway.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
Tested by sticking a printk into it and comparing the values calculated
by the old and new methods, while running the xen_shinfo_test which
keeps relocating the shared info and thus rewriting the PV wall clock
information to it.

They look sane enough but there's still skew over time (in both of
them) as the KVM values get adjusted in presumably similarly sloppy
ways. But we'll get to this. This patch is just the first low hanging
fruit in a quest to eliminate such sloppiness and get to the point
where we can do live update (pause guests, kexec and resume them again)
with a *single* cycle of skew. After all, the host TSC is still just as
trustworthy so there's no excuse for *anything* changing over the
kexec. Every other clock in the guest should have precisely the *same*
relationship to the host TSC as it did before the kexec.

 arch/x86/kvm/x86.c | 60 ++++++++++++++++++++++++++++++++++++++++------
 arch/x86/kvm/x86.h |  2 ++
 arch/x86/kvm/xen.c |  4 ++--
 3 files changed, 57 insertions(+), 9 deletions(-)

Comments

Dongli Zhang Oct. 2, 2023, 5:14 a.m. UTC | #1
Hi David,

On 10/1/23 10:54, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> When populating the guest's PV wall clock information, KVM currently does
> a simple 'kvm_get_real_ns() - get_kvmclock_ns(kvm)'. This is an antipattern
> which should be avoided; when working with the relationship between two
> clocks, it's never correct to obtain one of them "now" and then the other
> at a slightly different "now" after an unspecified period of preemption
> (which might not even be under the control of the kernel, if this is an
> L1 hosting an L2 guest under nested virtualization).
> 
> Add a kvm_get_wall_clock_epoch() function to return the guest wall clock
> epoch in nanoseconds using the same method as __get_kvmclock() — by using
> kvm_get_walltime_and_clockread() to calculate both the wall clock and KVM
> clock time from a *single* TSC reading.
> 
> The condition using get_cpu_tsc_khz() is equivalent to the version in
> __get_kvmclock() which separately checks for the CONSTANT_TSC feature or
> the per-CPU cpu_tsc_khz. Which is what get_cpu_tsc_khz() does anyway.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
> Tested by sticking a printk into it and comparing the values calculated
> by the old and new methods, while running the xen_shinfo_test which
> keeps relocating the shared info and thus rewriting the PV wall clock
> information to it.
> 
> They look sane enough but there's still skew over time (in both of
> them) as the KVM values get adjusted in presumably similarly sloppy
> ways. But we'll get to this. This patch is just the first low hanging

About the "skew over time", would you mean the results of
kvm_get_wall_clock_epoch() keeps changing over time?

Although without testing, I suspect it is because of two reasons:

1. Would you mind explaining what does "as the KVM values get adjusted" mean?

The kvm_get_walltime_and_clockread() call is based host monotonic clock, which
may be adjusted (unlike raw monotonic).


2. The host monotonic clock and kvmclock may use different mult/shift.

The equation is A-B.

A is the current host wall clock, while B is for how long the VM has boot.

A-B will be the wallclock when VM is boot.


A: ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec       --> monotonic clock
B: __pvclock_read_cycles(&hv_clock, host_tsc); --> raw monotonic and kvmclock


The A is from kvm_get_walltime_and_clockread() to get a pair of ns and tsc. It
is based on monotonic clock, e.g., gtod->clock.shift and gtod->clock.mult.

BTW, the master clock is derived from raw monotonic, which uses
gtod->raw_clock.shift and gtod->raw_clock.mult.

However, the incremental between host_tsc and master clock will be based on the
mult/shift from kvmclock (indeed kvm_get_time_scale()).

Ideally, we may expect A and B increase in the same speed. Due to that they may
use different mult/shift/equation, A and B may increase in the different speed.

About the 2nd reason, I have a patch in progress to refresh the master clock
periodically, for the clock drift during CPU hotplug.

https://lore.kernel.org/all/20230926230649.67852-1-dongli.zhang@oracle.com/


Please correct me if the above understanding is wrong.

Thank you very much!

Dongli Zhang

> fruit in a quest to eliminate such sloppiness and get to the point
> where we can do live update (pause guests, kexec and resume them again)
> with a *single* cycle of skew. After all, the host TSC is still just as
> trustworthy so there's no excuse for *anything* changing over the
> kexec. Every other clock in the guest should have precisely the *same*
> relationship to the host TSC as it did before the kexec.
> 
>  arch/x86/kvm/x86.c | 60 ++++++++++++++++++++++++++++++++++++++++------
>  arch/x86/kvm/x86.h |  2 ++
>  arch/x86/kvm/xen.c |  4 ++--
>  3 files changed, 57 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 04b57a336b34..625ec4d9281b 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2317,14 +2317,9 @@ static void kvm_write_wall_clock(struct kvm *kvm, gpa_t wall_clock, int sec_hi_o
>  	if (kvm_write_guest(kvm, wall_clock, &version, sizeof(version)))
>  		return;
>  
> -	/*
> -	 * The guest calculates current wall clock time by adding
> -	 * system time (updated by kvm_guest_time_update below) to the
> -	 * wall clock specified here.  We do the reverse here.
> -	 */
> -	wall_nsec = ktime_get_real_ns() - get_kvmclock_ns(kvm);
> +	wall_nsec = kvm_get_wall_clock_epoch(kvm);
>  
> -	wc.nsec = do_div(wall_nsec, 1000000000);
> +	wc.nsec = do_div(wall_nsec, NSEC_PER_SEC);
>  	wc.sec = (u32)wall_nsec; /* overflow in 2106 guest time */
>  	wc.version = version;
>  
> @@ -3229,6 +3224,57 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
>  	return 0;
>  }
>  
> +uint64_t kvm_get_wall_clock_epoch(struct kvm *kvm)
> +{
> +	/*
> +	 * The guest calculates current wall clock time by adding
> +	 * system time (updated by kvm_guest_time_update below) to the
> +	 * wall clock specified here.  We do the reverse here.
> +	 */
> +#ifdef CONFIG_X86_64
> +	struct pvclock_vcpu_time_info hv_clock;
> +	struct kvm_arch *ka = &kvm->arch;
> +	unsigned long seq, local_tsc_khz = 0;
> +	struct timespec64 ts;
> +	uint64_t host_tsc;
> +
> +	do {
> +		seq = read_seqcount_begin(&ka->pvclock_sc);
> +
> +		if (!ka->use_master_clock)
> +			break;
> +
> +		/* It all has to happen on the same CPU */
> +		get_cpu();
> +
> +		local_tsc_khz = get_cpu_tsc_khz();
> +
> +		if (local_tsc_khz &&
> +		    !kvm_get_walltime_and_clockread(&ts, &host_tsc))
> +			local_tsc_khz = 0; /* Fall back to old method */
> +
> +		hv_clock.tsc_timestamp = ka->master_cycle_now;
> +		hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
> +
> +		put_cpu();
> +	} while (read_seqcount_retry(&ka->pvclock_sc, seq));
> +
> +	/*
> +	 * If the conditions were right, and obtaining the wallclock+TSC was
> +	 * successful, calculate the KVM clock at the corresponding time and
> +	 * subtract one from the other to get the epoch in nanoseconds.
> +	 */
> +	if (local_tsc_khz) {
> +		kvm_get_time_scale(NSEC_PER_SEC, local_tsc_khz * 1000LL,
> +				   &hv_clock.tsc_shift,
> +				   &hv_clock.tsc_to_system_mul);
> +		return ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec -
> +			__pvclock_read_cycles(&hv_clock, host_tsc);
> +	}
> +#endif
> +	return ktime_get_real_ns() - get_kvmclock_ns(kvm);
> +}
> +
>  /*
>   * kvmclock updates which are isolated to a given vcpu, such as
>   * vcpu->cpu migration, should not allow system_timestamp from
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index c544602d07a3..b21743526011 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -290,6 +290,8 @@ static inline bool kvm_check_has_quirk(struct kvm *kvm, u64 quirk)
>  	return !(kvm->arch.disabled_quirks & quirk);
>  }
>  
> +uint64_t kvm_get_wall_clock_epoch(struct kvm *kvm);
> +
>  void kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip);
>  
>  u64 get_kvmclock_ns(struct kvm *kvm);
> diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
> index 75586da134b3..6bab715be428 100644
> --- a/arch/x86/kvm/xen.c
> +++ b/arch/x86/kvm/xen.c
> @@ -59,7 +59,7 @@ static int kvm_xen_shared_info_init(struct kvm *kvm, gfn_t gfn)
>  		 * This code mirrors kvm_write_wall_clock() except that it writes
>  		 * directly through the pfn cache and doesn't mark the page dirty.
>  		 */
> -		wall_nsec = ktime_get_real_ns() - get_kvmclock_ns(kvm);
> +		wall_nsec = kvm_get_wall_clock_epoch(kvm);
>  
>  		/* It could be invalid again already, so we need to check */
>  		read_lock_irq(&gpc->lock);
> @@ -98,7 +98,7 @@ static int kvm_xen_shared_info_init(struct kvm *kvm, gfn_t gfn)
>  	wc_version = wc->version = (wc->version + 1) | 1;
>  	smp_wmb();
>  
> -	wc->nsec = do_div(wall_nsec,  1000000000);
> +	wc->nsec = do_div(wall_nsec, NSEC_PER_SEC);
>  	wc->sec = (u32)wall_nsec;
>  	*wc_sec_hi = wall_nsec >> 32;
>  	smp_wmb();
David Woodhouse Oct. 2, 2023, 7:24 a.m. UTC | #2
On Sun, 2023-10-01 at 22:14 -0700, Dongli Zhang wrote:
> > They look sane enough but there's still skew over time (in both of
> > them) as the KVM values get adjusted in presumably similarly sloppy
> > ways. But we'll get to this. This patch is just the first low hanging
> 
> About the "skew over time", would you mean the results of
> kvm_get_wall_clock_epoch() keeps changing over time?
> 
> Although without testing, I suspect it is because of two reasons:
> 
> 1. Would you mind explaining what does "as the KVM values get adjusted" mean?

I hadn't quite worked it out yet. You'll note that part was below the
'---' of the patch and wasn't quite as coherent. I suspect you already
know, though.

> The kvm_get_walltime_and_clockread() call is based host monotonic clock, which
> may be adjusted (unlike raw monotonic).

Well, walltime has other problems too, which is why it's actually the
wrong thing to use for the KVM_CLOCK_REALTIME feature of
get_kvmclock(). When a leap second occurs, certain values of
CLOCK_REALTIME are *ambiguous* — they occur twice, just like certain
times between 1am and 2am on the morning when the clocks go backwards
in the winter. We should have been using CLOCK_TAI for that one. But
that's a different issue...

> 2. The host monotonic clock and kvmclock may use different mult/shift.
> 
> The equation is A-B.
> 
> A is the current host wall clock, while B is for how long the VM has boot.
> 
> A-B will be the wallclock when VM is boot.
> 
> 
> A: ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec       --> monotonic clock
> B: __pvclock_read_cycles(&hv_clock, host_tsc); --> raw monotonic and kvmclock
> 
> 
> The A is from kvm_get_walltime_and_clockread() to get a pair of ns and tsc. It
> is based on monotonic clock, e.g., gtod->clock.shift and gtod->clock.mult.
> 
> BTW, the master clock is derived from raw monotonic, which uses
> gtod->raw_clock.shift and gtod->raw_clock.mult.
> 
> However, the incremental between host_tsc and master clock will be based on the
> mult/shift from kvmclock (indeed kvm_get_time_scale()).
> 
> Ideally, we may expect A and B increase in the same speed. Due to that they may
> use different mult/shift/equation, A and B may increase in the different speed.

Agreed. There'll be a small drift but it seemed larger than I expected.
I attempted to post an RFC a few days ago but it didn't seem to make it
to the list. Quoting from it...

But... it's *drifting*. If I run the xen_shinfo_selftest, which keeps
moving the shared_info around and causing the wallclock information to
be rewritten, I see it start with...

 $ sudo dmesg -w | head -10
[  611.980957] KVM 000000004ab4eef1: Calculated wall epoch 1695916533777068875 (1695916533777068623)
[  611.980995] KVM 000000004ab4eef1: Calculated wall epoch 1695916533777068874 (1695916533777068743)
[  611.981007] KVM 000000004ab4eef1: Calculated wall epoch 1695916533777068873 (1695916533777068787)
[  611.981017] KVM 000000004ab4eef1: Calculated wall epoch 1695916533777068873 (1695916533777068784)
[  611.981027] KVM 000000004ab4eef1: Calculated wall epoch 1695916533777068874 (1695916533777068786)
[  611.981036] KVM 000000004ab4eef1: Calculated wall epoch 1695916533777068873 (1695916533777068786)
[  611.981046] KVM 000000004ab4eef1: Calculated wall epoch 1695916533777068873 (1695916533777068786)
[  611.981055] KVM 000000004ab4eef1: Calculated wall epoch 1695916533777068873 (1695916533777068786)
[  611.981065] KVM 000000004ab4eef1: Calculated wall epoch 1695916533777068873 (1695916533777068786)
[  611.981075] KVM 000000004ab4eef1: Calculated wall epoch 1695916533777068873 (1695916533777068782)
 $ dmesg | tail
[  615.679572] KVM 000000004ab4eef1: Calculated wall epoch 1695916533777037455 (1695916533777037423)
[  615.679575] KVM 000000004ab4eef1: Calculated wall epoch 1695916533777037455 (1695916533777037419)
[  615.679580] KVM 000000004ab4eef1: Calculated wall epoch 1695916533777037454 (1695916533777037418)
[  615.679583] KVM 000000004ab4eef1: Calculated wall epoch 1695916533777037454 (1695916533777037418)
[  615.679587] KVM 000000004ab4eef1: Calculated wall epoch 1695916533777037454 (1695916533777037418)
[  615.679590] KVM 000000004ab4eef1: Calculated wall epoch 1695916533777037454 (1695916533777037419)
[  615.679594] KVM 000000004ab4eef1: Calculated wall epoch 1695916533777037454 (1695916533777037417)
[  615.679598] KVM 000000004ab4eef1: Calculated wall epoch 1695916533777037454 (1695916533777037418)
[  615.679605] KVM 000000004ab4eef1: Calculated wall epoch 1695916533777037454 (1695916533777037423)
[  615.679611] KVM 000000004ab4eef1: Calculated wall epoch 1695916533777037454 (1695916533777037420)

Now, I suppose it should drift a little bit, because it includes the
cumulative delta between CLOCK_REALTIME which is adjusted by NTP (and
even enjoys leap seconds), and CLOCK_MONOTONIC_RAW which has neither of
those things. But should it move by about 31µs in the space of 4
seconds?

 $ ntpfrob  -d
time = 1695917599.000628357
verbose = "Clock synchronized, no leap second adjustment pending."
time offset (ns) = 0
TAI offset = 37
frequency offset = -515992
error max (us) = 20056
error est (us) = 1083
clock cmd status = 0
pll constant = 2
clock precision (us) = 1
clock tolerance = 32768000
tick (us) = 10000

+		printk("KVM %p: Calculated wall epoch %lld (%lld)\n", kvm,
+		       ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec - __pvclock_read_cycles(&hv_clock, host_tsc),
+		       ktime_get_real_ns() - get_kvmclock_ns(kvm));



> About the 2nd reason, I have a patch in progress to refresh the master clock
> periodically, for the clock drift during CPU hotplug.
> 
> https://lore.kernel.org/all/20230926230649.67852-1-dongli.zhang@oracle.com/

Yeah, that's probably related to what I'm seeing. I'm testing by using
the xen_shinfo_test which keeps relocating the Xen shared_info page,
and *does* keep triggering a master clock update. Although now I look
harder at it, I'm not sure *why* it does so.

But *why* does the kvmclock use a different mult/shift to the
monotonic_raw clock? Surely it should be the same? I thought the
*point* in using CLOCK_MONOTONIC_RAW was that it was basically just a
direct usage of the host TSC without NTP adjustments...?

We're also seeing the clocksource watchdog trigger in Xen guests,
complaining that the Xen vs. tsc clocksources are drifting w.r.t. one
another — when again they're both supposed to be derived from the
*same* guest TSC without interference. I assume that's the same
problem? And your trick of adjusting what we advertise to the guest
over time won't help there, because that won't stop it actually
drifting w.r.t. the raw TSC that the guest is comparing against.

[ 7200.708099] clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc' as unstable because the skew is too large:
[ 7200.708134] clocksource:                       'xen' wd_nsec: 508905888 wd_now: 68dcc1c556c wd_last: 68dadc70bcc mask: ffffffffffffffff
[ 7200.708139] clocksource:                       'tsc' cs_nsec: 511584233 cs_now: f12d5ec8ad3 cs_last: f128fca662b mask: ffffffffffffffff
[ 7200.708142] clocksource:                       'xen' (not 'tsc') is current clocksource.
[ 7200.708145] tsc: Marking TSC unstable due to clocksource watchdog
Sean Christopherson Oct. 5, 2023, midnight UTC | #3
On Sun, Oct 01, 2023, David Woodhouse wrote:
> +uint64_t kvm_get_wall_clock_epoch(struct kvm *kvm)
> +{
> +	/*
> +	 * The guest calculates current wall clock time by adding
> +	 * system time (updated by kvm_guest_time_update below) to the
> +	 * wall clock specified here.  We do the reverse here.

I would much rather this be a function comment that first explains what "epoch"
means in this context.  "epoch" is a perfect fit, but I suspect it won't be all
that intuitive for many readers (definitely wasn't for me).

> +	 */
> +#ifdef CONFIG_X86_64
> +	struct pvclock_vcpu_time_info hv_clock;
> +	struct kvm_arch *ka = &kvm->arch;
> +	unsigned long seq, local_tsc_khz = 0;
> +	struct timespec64 ts;
> +	uint64_t host_tsc;
> +
> +	do {
> +		seq = read_seqcount_begin(&ka->pvclock_sc);
> +
> +		if (!ka->use_master_clock)
> +			break;

This needs to zero local_tsc_khz, no?  E.g. read everything on loop 1, but the
pvclock_sc changes because use_master_clock is disabled, and so loop 2 will bail
and the code below will consume garbage TSC/masterclock information from loop 1.

> +		/* It all has to happen on the same CPU */

Define "it all", e.g. explain exactly why the cutoff for reenabling preemption is
after reading master_kernel_ns and not before, or not after kvm_get_time_scale().

> +		get_cpu();
> +
> +		local_tsc_khz = get_cpu_tsc_khz();
> +
> +		if (local_tsc_khz &&
> +		    !kvm_get_walltime_and_clockread(&ts, &host_tsc))
> +			local_tsc_khz = 0; /* Fall back to old method */
> +
> +		hv_clock.tsc_timestamp = ka->master_cycle_now;
> +		hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
> +
> +		put_cpu();
> +	} while (read_seqcount_retry(&ka->pvclock_sc, seq));
> +
> +	/*
> +	 * If the conditions were right, and obtaining the wallclock+TSC was
> +	 * successful, calculate the KVM clock at the corresponding time and
> +	 * subtract one from the other to get the epoch in nanoseconds.
> +	 */
> +	if (local_tsc_khz) {
> +		kvm_get_time_scale(NSEC_PER_SEC, local_tsc_khz * 1000LL,

s/1000LL/NSEC_PER_USEC?

> +				   &hv_clock.tsc_shift,
> +				   &hv_clock.tsc_to_system_mul);
> +		return ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec -
> +			__pvclock_read_cycles(&hv_clock, host_tsc);
> +	}
> +#endif
diff mbox series

Patch

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 04b57a336b34..625ec4d9281b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2317,14 +2317,9 @@  static void kvm_write_wall_clock(struct kvm *kvm, gpa_t wall_clock, int sec_hi_o
 	if (kvm_write_guest(kvm, wall_clock, &version, sizeof(version)))
 		return;
 
-	/*
-	 * The guest calculates current wall clock time by adding
-	 * system time (updated by kvm_guest_time_update below) to the
-	 * wall clock specified here.  We do the reverse here.
-	 */
-	wall_nsec = ktime_get_real_ns() - get_kvmclock_ns(kvm);
+	wall_nsec = kvm_get_wall_clock_epoch(kvm);
 
-	wc.nsec = do_div(wall_nsec, 1000000000);
+	wc.nsec = do_div(wall_nsec, NSEC_PER_SEC);
 	wc.sec = (u32)wall_nsec; /* overflow in 2106 guest time */
 	wc.version = version;
 
@@ -3229,6 +3224,57 @@  static int kvm_guest_time_update(struct kvm_vcpu *v)
 	return 0;
 }
 
+uint64_t kvm_get_wall_clock_epoch(struct kvm *kvm)
+{
+	/*
+	 * The guest calculates current wall clock time by adding
+	 * system time (updated by kvm_guest_time_update below) to the
+	 * wall clock specified here.  We do the reverse here.
+	 */
+#ifdef CONFIG_X86_64
+	struct pvclock_vcpu_time_info hv_clock;
+	struct kvm_arch *ka = &kvm->arch;
+	unsigned long seq, local_tsc_khz = 0;
+	struct timespec64 ts;
+	uint64_t host_tsc;
+
+	do {
+		seq = read_seqcount_begin(&ka->pvclock_sc);
+
+		if (!ka->use_master_clock)
+			break;
+
+		/* It all has to happen on the same CPU */
+		get_cpu();
+
+		local_tsc_khz = get_cpu_tsc_khz();
+
+		if (local_tsc_khz &&
+		    !kvm_get_walltime_and_clockread(&ts, &host_tsc))
+			local_tsc_khz = 0; /* Fall back to old method */
+
+		hv_clock.tsc_timestamp = ka->master_cycle_now;
+		hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
+
+		put_cpu();
+	} while (read_seqcount_retry(&ka->pvclock_sc, seq));
+
+	/*
+	 * If the conditions were right, and obtaining the wallclock+TSC was
+	 * successful, calculate the KVM clock at the corresponding time and
+	 * subtract one from the other to get the epoch in nanoseconds.
+	 */
+	if (local_tsc_khz) {
+		kvm_get_time_scale(NSEC_PER_SEC, local_tsc_khz * 1000LL,
+				   &hv_clock.tsc_shift,
+				   &hv_clock.tsc_to_system_mul);
+		return ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec -
+			__pvclock_read_cycles(&hv_clock, host_tsc);
+	}
+#endif
+	return ktime_get_real_ns() - get_kvmclock_ns(kvm);
+}
+
 /*
  * kvmclock updates which are isolated to a given vcpu, such as
  * vcpu->cpu migration, should not allow system_timestamp from
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index c544602d07a3..b21743526011 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -290,6 +290,8 @@  static inline bool kvm_check_has_quirk(struct kvm *kvm, u64 quirk)
 	return !(kvm->arch.disabled_quirks & quirk);
 }
 
+uint64_t kvm_get_wall_clock_epoch(struct kvm *kvm);
+
 void kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip);
 
 u64 get_kvmclock_ns(struct kvm *kvm);
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 75586da134b3..6bab715be428 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -59,7 +59,7 @@  static int kvm_xen_shared_info_init(struct kvm *kvm, gfn_t gfn)
 		 * This code mirrors kvm_write_wall_clock() except that it writes
 		 * directly through the pfn cache and doesn't mark the page dirty.
 		 */
-		wall_nsec = ktime_get_real_ns() - get_kvmclock_ns(kvm);
+		wall_nsec = kvm_get_wall_clock_epoch(kvm);
 
 		/* It could be invalid again already, so we need to check */
 		read_lock_irq(&gpc->lock);
@@ -98,7 +98,7 @@  static int kvm_xen_shared_info_init(struct kvm *kvm, gfn_t gfn)
 	wc_version = wc->version = (wc->version + 1) | 1;
 	smp_wmb();
 
-	wc->nsec = do_div(wall_nsec,  1000000000);
+	wc->nsec = do_div(wall_nsec, NSEC_PER_SEC);
 	wc->sec = (u32)wall_nsec;
 	*wc_sec_hi = wall_nsec >> 32;
 	smp_wmb();