Message ID | afc496b886bc46b956ede716d8db6f208e7bab0a.camel@infradead.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v3] KVM: x86/xen: improve accuracy of Xen timers | expand |
On 14/12/2023 16:54, David Woodhouse wrote: > From: David Woodhouse <dwmw@amazon.co.uk> > > A test program such as http://david.woodhou.se/timerlat.c confirms user > reports that timers are increasingly inaccurate as the lifetime of a > guest increases. Reporting the actual delay observed when asking for > 100µs of sleep, it starts off OK on a newly-launched guest but gets > worse over time, giving incorrect sleep times: > > root@ip-10-0-193-21:~# ./timerlat -c -n 5 > 00000000 latency 103243/100000 (3.2430%) > 00000001 latency 103243/100000 (3.2430%) > 00000002 latency 103242/100000 (3.2420%) > 00000003 latency 103245/100000 (3.2450%) > 00000004 latency 103245/100000 (3.2450%) > > The biggest problem is that get_kvmclock_ns() returns inaccurate values > when the guest TSC is scaled. The guest sees a TSC value scaled from the > host TSC by a mul/shift conversion (hopefully done in hardware). The > guest then converts that guest TSC value into nanoseconds using the > mul/shift conversion given to it by the KVM pvclock information. > > But get_kvmclock_ns() performs only a single conversion directly from > host TSC to nanoseconds, giving a different result. A test program at > http://david.woodhou.se/tsdrift.c demonstrates the cumulative error > over a day. > > It's non-trivial to fix get_kvmclock_ns(), although I'll come back to > that. The actual guest hv_clock is per-CPU, and *theoretically* each > vCPU could be running at a *different* frequency. But this patch is > needed anyway because... > > The other issue with Xen timers was that the code would snapshot the > host CLOCK_MONOTONIC at some point in time, and then... after a few > interrupts may have occurred, some preemption perhaps... would also read > the guest's kvmclock. Then it would proceed under the false assumption > that those two happened at the *same* time. Any time which *actually* > elapsed between reading the two clocks was introduced as inaccuracies > in the time at which the timer fired. > > Fix it to use a variant of kvm_get_time_and_clockread(), which reads the > host TSC just *once*, then use the returned TSC value to calculate the > kvmclock (making sure to do that the way the guest would instead of > making the same mistake get_kvmclock_ns() does). > > Sadly, hrtimers based on CLOCK_MONOTONIC_RAW are not supported, so Xen > timers still have to use CLOCK_MONOTONIC. In practice the difference > between the two won't matter over the timescales involved, as the > *absolute* values don't matter; just the delta. > > This does mean a new variant of kvm_get_time_and_clockread() is needed; > called kvm_get_monotonic_and_clockread() because that's what it does. > > Fixes: 536395260582 ("KVM: x86/xen: handle PV timers oneshot mode") > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> > --- > v3: > • Rebase and repost. > > v2: > • Fall back to get_kvmclock_ns() if vcpu-arch.hv_clock isn't set up > yet, with a big comment explaining why that's actually OK. > • Fix do_monotonic() *not* to add the boot time offset. > • Rename do_monotonic_raw() → do_kvmclock_base() and add a comment > to make it clear that it *does* add the boot time offset. That > was just left as a bear trap for the unwary developer, wasn't it? > > arch/x86/kvm/x86.c | 61 +++++++++++++++++++++-- > arch/x86/kvm/x86.h | 1 + > arch/x86/kvm/xen.c | 121 ++++++++++++++++++++++++++++++++++----------- > 3 files changed, 149 insertions(+), 34 deletions(-) > Reviewed-by: Paul Durrant <paul@xen.org>
On Fri, 2023-12-15 at 09:07 +0000, Durrant, Paul wrote: > On 14/12/2023 16:54, David Woodhouse wrote: > > From: David Woodhouse <dwmw@amazon.co.uk> > > > > A test program such as http://david.woodhou.se/timerlat.c confirms user > > reports that timers are increasingly inaccurate as the lifetime of a > > guest increases. Reporting the actual delay observed when asking for > > 100µs of sleep, it starts off OK on a newly-launched guest but gets > > worse over time, giving incorrect sleep times: > > > > root@ip-10-0-193-21:~# ./timerlat -c -n 5 > > 00000000 latency 103243/100000 (3.2430%) > > 00000001 latency 103243/100000 (3.2430%) > > 00000002 latency 103242/100000 (3.2420%) > > 00000003 latency 103245/100000 (3.2450%) > > 00000004 latency 103245/100000 (3.2450%) > > > > The biggest problem is that get_kvmclock_ns() returns inaccurate values > > when the guest TSC is scaled. The guest sees a TSC value scaled from the > > host TSC by a mul/shift conversion (hopefully done in hardware). The > > guest then converts that guest TSC value into nanoseconds using the > > mul/shift conversion given to it by the KVM pvclock information. > > > > But get_kvmclock_ns() performs only a single conversion directly from > > host TSC to nanoseconds, giving a different result. A test program at > > http://david.woodhou.se/tsdrift.c demonstrates the cumulative error > > over a day. > > > > It's non-trivial to fix get_kvmclock_ns(), although I'll come back to > > that. The actual guest hv_clock is per-CPU, and *theoretically* each > > vCPU could be running at a *different* frequency. But this patch is > > needed anyway because... > > > > The other issue with Xen timers was that the code would snapshot the > > host CLOCK_MONOTONIC at some point in time, and then... after a few > > interrupts may have occurred, some preemption perhaps... would also read > > the guest's kvmclock. Then it would proceed under the false assumption > > that those two happened at the *same* time. Any time which *actually* > > elapsed between reading the two clocks was introduced as inaccuracies > > in the time at which the timer fired. > > > > Fix it to use a variant of kvm_get_time_and_clockread(), which reads the > > host TSC just *once*, then use the returned TSC value to calculate the > > kvmclock (making sure to do that the way the guest would instead of > > making the same mistake get_kvmclock_ns() does). > > > > Sadly, hrtimers based on CLOCK_MONOTONIC_RAW are not supported, so Xen > > timers still have to use CLOCK_MONOTONIC. In practice the difference > > between the two won't matter over the timescales involved, as the > > *absolute* values don't matter; just the delta. > > > > This does mean a new variant of kvm_get_time_and_clockread() is needed; > > called kvm_get_monotonic_and_clockread() because that's what it does. > > > > Fixes: 536395260582 ("KVM: x86/xen: handle PV timers oneshot mode") > > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> > > --- > > v3: > > • Rebase and repost. > > > > v2: > > • Fall back to get_kvmclock_ns() if vcpu-arch.hv_clock isn't set up > > yet, with a big comment explaining why that's actually OK. > > • Fix do_monotonic() *not* to add the boot time offset. > > • Rename do_monotonic_raw() → do_kvmclock_base() and add a comment > > to make it clear that it *does* add the boot time offset. That > > was just left as a bear trap for the unwary developer, wasn't it? > > > > arch/x86/kvm/x86.c | 61 +++++++++++++++++++++-- > > arch/x86/kvm/x86.h | 1 + > > arch/x86/kvm/xen.c | 121 ++++++++++++++++++++++++++++++++++----------- > > 3 files changed, 149 insertions(+), 34 deletions(-) > > > > Reviewed-by: Paul Durrant <paul@xen.org> Ping?
On Tue, Jan 16, 2024, David Woodhouse wrote:
> Ping?
It's on my list of things to grab for the next cycle, I'm just waiting for -rc1
to start officially applying anything.
On Thu, Dec 14, 2023, David Woodhouse wrote: > From: David Woodhouse <dwmw@amazon.co.uk> > > A test program such as http://david.woodhou.se/timerlat.c confirms user > reports that timers are increasingly inaccurate as the lifetime of a > guest increases. Reporting the actual delay observed when asking for > 100µs of sleep, it starts off OK on a newly-launched guest but gets > worse over time, giving incorrect sleep times: > > root@ip-10-0-193-21:~# ./timerlat -c -n 5 > 00000000 latency 103243/100000 (3.2430%) > 00000001 latency 103243/100000 (3.2430%) > 00000002 latency 103242/100000 (3.2420%) > 00000003 latency 103245/100000 (3.2450%) > 00000004 latency 103245/100000 (3.2450%) > > The biggest problem is that get_kvmclock_ns() returns inaccurate values > when the guest TSC is scaled. The guest sees a TSC value scaled from the > host TSC by a mul/shift conversion (hopefully done in hardware). The > guest then converts that guest TSC value into nanoseconds using the > mul/shift conversion given to it by the KVM pvclock information. > > But get_kvmclock_ns() performs only a single conversion directly from > host TSC to nanoseconds, giving a different result. A test program at > http://david.woodhou.se/tsdrift.c demonstrates the cumulative error > over a day. > > It's non-trivial to fix get_kvmclock_ns(), although I'll come back to > that. The actual guest hv_clock is per-CPU, and *theoretically* each > vCPU could be running at a *different* frequency. But this patch is > needed anyway because... > > The other issue with Xen timers was that the code would snapshot the > host CLOCK_MONOTONIC at some point in time, and then... after a few > interrupts may have occurred, some preemption perhaps... would also read > the guest's kvmclock. Then it would proceed under the false assumption > that those two happened at the *same* time. Any time which *actually* > elapsed between reading the two clocks was introduced as inaccuracies > in the time at which the timer fired. > > Fix it to use a variant of kvm_get_time_and_clockread(), which reads the > host TSC just *once*, then use the returned TSC value to calculate the > kvmclock (making sure to do that the way the guest would instead of > making the same mistake get_kvmclock_ns() does). > > Sadly, hrtimers based on CLOCK_MONOTONIC_RAW are not supported, so Xen > timers still have to use CLOCK_MONOTONIC. In practice the difference > between the two won't matter over the timescales involved, as the > *absolute* values don't matter; just the delta. > > This does mean a new variant of kvm_get_time_and_clockread() is needed; > called kvm_get_monotonic_and_clockread() because that's what it does. > > Fixes: 536395260582 ("KVM: x86/xen: handle PV timers oneshot mode") > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> > --- Dagnabbit, this one is corrupt too :-/
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 6beb6ceb28c1..8faf1bf9e8e3 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2852,7 +2852,11 @@ static inline u64 vgettsc(struct pvclock_clock *clock, u64 *tsc_timestamp, return v * clock->mult; } -static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp) +/* + * As with get_kvmclock_base_ns(), this counts from boot time, at the + * frequency of CLOCK_MONOTONIC_RAW (hence adding gtos->offs_boot). + */ +static int do_kvmclock_base(s64 *t, u64 *tsc_timestamp) { struct pvclock_gtod_data *gtod = &pvclock_gtod_data; unsigned long seq; @@ -2871,6 +2875,29 @@ static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp) return mode; } +/* + * This calculates CLOCK_MONOTONIC at the time of the TSC snapshot, with + * no boot time offset. + */ +static int do_monotonic(s64 *t, u64 *tsc_timestamp) +{ + struct pvclock_gtod_data *gtod = &pvclock_gtod_data; + unsigned long seq; + int mode; + u64 ns; + + do { + seq = read_seqcount_begin(>od->seq); + ns = gtod->clock.base_cycles; + ns += vgettsc(>od->clock, tsc_timestamp, &mode); + ns >>= gtod->clock.shift; + ns += ktime_to_ns(gtod->clock.offset); + } while (unlikely(read_seqcount_retry(>od->seq, seq))); + *t = ns; + + return mode; +} + static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp) { struct pvclock_gtod_data *gtod = &pvclock_gtod_data; @@ -2892,18 +2919,42 @@ static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp) return mode; } -/* returns true if host is using TSC based clocksource */ +/* + * Calculates the kvmclock_base_ns (CLOCK_MONOTONIC_RAW + boot time) and + * reports the TSC value from which it do so. Returns true if host is + * using TSC based clocksource. + */ static bool kvm_get_time_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp) { /* checked again under seqlock below */ if (!gtod_is_based_on_tsc(pvclock_gtod_data.clock.vclock_mode)) return false; - return gtod_is_based_on_tsc(do_monotonic_raw(kernel_ns, - tsc_timestamp)); + return gtod_is_based_on_tsc(do_kvmclock_base(kernel_ns, + tsc_timestamp)); } -/* returns true if host is using TSC based clocksource */ +/* + * Calculates CLOCK_MONOTONIC and reports the TSC value from which it did + * so. Returns true if host is using TSC based clocksource. + */ +bool kvm_get_monotonic_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp) +{ + /* checked again under seqlock below */ + if (!gtod_is_based_on_tsc(pvclock_gtod_data.clock.vclock_mode)) + return false; + + return gtod_is_based_on_tsc(do_monotonic(kernel_ns, + tsc_timestamp)); +} + +/* + * Calculates CLOCK_REALTIME and reports the TSC value from which it did + * so. Returns true if host is using TSC based clocksource. + * + * DO NOT USE this for anything related to migration. You want CLOCK_TAI + * for that. + */ static bool kvm_get_walltime_and_clockread(struct timespec64 *ts, u64 *tsc_timestamp) { diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index 5184fde1dc54..0d6af2a57af7 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -294,6 +294,7 @@ void kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip); u64 get_kvmclock_ns(struct kvm *kvm); uint64_t kvm_get_wall_clock_epoch(struct kvm *kvm); +bool kvm_get_monotonic_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp); int kvm_read_guest_virt(struct kvm_vcpu *vcpu, gva_t addr, void *val, unsigned int bytes, diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c index 6667f01170f9..a28f60aa91fb 100644 --- a/arch/x86/kvm/xen.c +++ b/arch/x86/kvm/xen.c @@ -24,6 +24,7 @@ #include <xen/interface/sched.h> #include <asm/xen/cpuid.h> +#include <asm/pvclock.h> #include "cpuid.h" #include "trace.h" @@ -149,8 +150,93 @@ static enum hrtimer_restart xen_timer_callback(struct hrtimer *timer) return HRTIMER_NORESTART; } -static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_ns) +static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, + bool linux_wa) { + uint64_t guest_now; + int64_t kernel_now, delta; + + /* + * The guest provides the requested timeout in absolute nanoseconds + * of the KVM clock — as *it* sees it, based on the scaled TSC and + * the pvclock information provided by KVM. + * + * The kernel doesn't support hrtimers based on CLOCK_MONOTONIC_RAW + * so use CLOCK_MONOTONIC. In the timescales covered by timers, the + * difference won't matter much as there is no cumulative effect. + * + * Calculate the time for some arbitrary point in time around "now" + * in terms of both kvmclock and CLOCK_MONOTONIC. Calculate the + * delta between the kvmclock "now" value and the guest's requested + * timeout, apply the "Linux workaround" described below, and add + * the resulting delta to the CLOCK_MONOTONIC "now" value, to get + * the absolute CLOCK_MONOTONIC time at which the timer should + * fire. + */ + if (vcpu->arch.hv_clock.version && vcpu->kvm->arch.use_master_clock && + static_cpu_has(X86_FEATURE_CONSTANT_TSC)) { + uint64_t host_tsc, guest_tsc; + + if (!IS_ENABLED(CONFIG_64BIT) || + !kvm_get_monotonic_and_clockread(&kernel_now, &host_tsc)) { + /* + * Don't fall back to get_kvmclock_ns() because it's + * broken; it has a systemic error in its results + * because it scales directly from host TSC to + * nanoseconds, and doesn't scale first to guest TSC + * and then* to nanoseconds as the guest does. + * + * There is a small error introduced here because time + * continues to elapse between the ktime_get() and the + * subsequent rdtsc(). But not the systemic drift due + * to get_kvmclock_ns(). + */ + kernel_now = ktime_get(); /* This is CLOCK_MONOTONIC */ + host_tsc = rdtsc(); + } + + /* Calculate the guest kvmclock as the guest would do it. */ + guest_tsc = kvm_read_l1_tsc(vcpu, host_tsc); + guest_now = __pvclock_read_cycles(&vcpu->arch.hv_clock, + guest_tsc); + } else { + /* + * Without CONSTANT_TSC, get_kvmclock_ns() is the only option. + * + * Also if the guest PV clock hasn't been set up yet, as is + * likely to be the case during migration when the vCPU has + * not been run yet. It would be possible to calculate the + * scaling factors properly in that case but there's not much + * point in doing so. The get_kvmclock_ns() drift accumulates + * over time, so it's OK to use it at startup. Besides, on + * migration there's going to be a little bit of skew in the + * precise moment at which timers fire anyway. Often they'll + * be in the "past" by the time the VM is running again after + * migration. + */ + guest_now = get_kvmclock_ns(vcpu->kvm); + kernel_now = ktime_get(); + } + + delta = guest_abs - guest_now; + + /* Xen has a 'Linux workaround' in do_set_timer_op() which + * checks for negative absolute timeout values (caused by + * integer overflow), and for values about 13 days in the + * future (2^50ns) which would be caused by jiffies + * overflow. For those cases, it sets the timeout 100ms in + * the future (not *too* soon, since if a guest really did + * set a long timeout on purpose we don't want to keep + * churning CPU time by waking it up). + */ + if (linux_wa) { + if ((unlikely((int64_t)guest_abs < 0 || + (delta > 0 && (uint32_t) (delta >> 50) != 0)))) { + delta = 100 * NSEC_PER_MSEC; + guest_abs = guest_now + delta; + } + } + /* * Avoid races with the old timer firing. Checking timer_expires * to avoid calling hrtimer_cancel() will only have false positives @@ -162,12 +248,11 @@ static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_ atomic_set(&vcpu->arch.xen.timer_pending, 0); vcpu->arch.xen.timer_expires = guest_abs; - if (delta_ns <= 0) { + if (delta <= 0) { xen_timer_callback(&vcpu->arch.xen.timer); } else { - ktime_t ktime_now = ktime_get(); hrtimer_start(&vcpu->arch.xen.timer, - ktime_add_ns(ktime_now, delta_ns), + ktime_add_ns(kernel_now, delta), HRTIMER_MODE_ABS_HARD); } } @@ -991,8 +1076,7 @@ int kvm_xen_vcpu_set_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data) /* Start the timer if the new value has a valid vector+expiry. */ if (data->u.timer.port && data->u.timer.expires_ns) kvm_xen_start_timer(vcpu, data->u.timer.expires_ns, - data->u.timer.expires_ns - - get_kvmclock_ns(vcpu->kvm)); + false); r = 0; break; @@ -1448,7 +1532,6 @@ static bool kvm_xen_hcall_vcpu_op(struct kvm_vcpu *vcpu, bool longmode, int cmd, { struct vcpu_set_singleshot_timer oneshot; struct x86_exception e; - s64 delta; if (!kvm_xen_timer_enabled(vcpu)) return false; @@ -1482,9 +1565,7 @@ static bool kvm_xen_hcall_vcpu_op(struct kvm_vcpu *vcpu, bool longmode, int cmd, return true; } - /* A delta <= 0 results in an immediate callback, which is what we want */ - delta = oneshot.timeout_abs_ns - get_kvmclock_ns(vcpu->kvm); - kvm_xen_start_timer(vcpu, oneshot.timeout_abs_ns, delta); + kvm_xen_start_timer(vcpu, oneshot.timeout_abs_ns, false); *r = 0; return true; @@ -1508,25 +1589,7 @@ static bool kvm_xen_hcall_set_timer_op(struct kvm_vcpu *vcpu, uint64_t timeout, return false; if (timeout) { - uint64_t guest_now = get_kvmclock_ns(vcpu->kvm); - int64_t delta = timeout - guest_now; - - /* Xen has a 'Linux workaround' in do_set_timer_op() which - * checks for negative absolute timeout values (caused by - * integer overflow), and for values about 13 days in the - * future (2^50ns) which would be caused by jiffies - * overflow. For those cases, it sets the timeout 100ms in - * the future (not *too* soon, since if a guest really did - * set a long timeout on purpose we don't want to keep - * churning CPU time by waking it up). - */ - if (unlikely((int64_t)timeout < 0 || - (delta > 0 && (uint32_t) (delta >> 50) != 0))) { - delta = 100 * NSEC_PER_MSEC; - timeout = guest_now + delta; - } - - kvm_xen_start_timer(vcpu, timeout, delta); + kvm_xen_start_timer(vcpu, timeout, true); } else { kvm_xen_stop_timer(vcpu); }