From patchwork Sun Oct 1 17:54:16 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: David Woodhouse X-Patchwork-Id: 13405451 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B5A84E748E9 for ; Sun, 1 Oct 2023 17:54:25 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235267AbjJARy0 (ORCPT ); Sun, 1 Oct 2023 13:54:26 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34240 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233932AbjJARyZ (ORCPT ); Sun, 1 Oct 2023 13:54:25 -0400 Received: from casper.infradead.org (casper.infradead.org [IPv6:2001:8b0:10b:1236::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 63D01A7; Sun, 1 Oct 2023 10:54:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=MIME-Version:Content-Type:Date:Cc:To: From:Subject:Message-ID:Sender:Reply-To:Content-Transfer-Encoding:Content-ID: Content-Description:In-Reply-To:References; bh=fgC78h+iS4mzDakS0xyY7EGKjinZxb9su59hNHGZ2Xg=; b=GTewsJDE85KagLi/skFFKy4D3y 1jEEdrfVeYZdijKYF8IuRivODwgdUUluA7pwRnUPU2l4ZGI7tnm5HjAq6M06ShM2EGK0H5o3nYJxo +oAwWZn223hvI68Ffx9CD+S0arywX1PThzKMm6rR2jMxio4R5BWfwDaLcyRw0hlreOxPz5XPa3x54 1pQ57XASDDFto0fMWDvWf6gYZZLaipa7ssSC685fydhjPqIF6iLyzoasQRrnnYwxOf7m3MaCeHtxj avhjX6KmkAuPTHAo1IJu2/ejFCZM7HtLRo1klhYwQdsd1rye4WaG5rKR8uAivPnhNpKiTutsgOsZS WQa3o+DQ==; Received: from [2001:8b0:10b:5::bb3] (helo=u3832b3a9db3152.infradead.org) by casper.infradead.org with esmtpsa (Exim 4.94.2 #2 (Red Hat Linux)) id 1qn0e1-004YLN-3r; Sun, 01 Oct 2023 17:54:17 +0000 Message-ID: Subject: [PATCH] KVM: x86: Refine calculation of guest wall clock to use a single TSC read From: David Woodhouse To: kvm@vger.kernel.org, sveith@amazon.de Cc: Sean Christopherson , Paolo Bonzini , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, "H. Peter Anvin" , Paul Durrant , inux-kernel@vger.kernel.org Date: Sun, 01 Oct 2023 18:54:16 +0100 User-Agent: Evolution 3.44.4-0ubuntu2 MIME-Version: 1.0 X-SRS-Rewrite: SMTP reverse-path rewritten from by casper.infradead.org. See http://www.infradead.org/rpr.html Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: David Woodhouse When populating the guest's PV wall clock information, KVM currently does a simple 'kvm_get_real_ns() - get_kvmclock_ns(kvm)'. This is an antipattern which should be avoided; when working with the relationship between two clocks, it's never correct to obtain one of them "now" and then the other at a slightly different "now" after an unspecified period of preemption (which might not even be under the control of the kernel, if this is an L1 hosting an L2 guest under nested virtualization). Add a kvm_get_wall_clock_epoch() function to return the guest wall clock epoch in nanoseconds using the same method as __get_kvmclock() — by using kvm_get_walltime_and_clockread() to calculate both the wall clock and KVM clock time from a *single* TSC reading. The condition using get_cpu_tsc_khz() is equivalent to the version in __get_kvmclock() which separately checks for the CONSTANT_TSC feature or the per-CPU cpu_tsc_khz. Which is what get_cpu_tsc_khz() does anyway. Signed-off-by: David Woodhouse --- Tested by sticking a printk into it and comparing the values calculated by the old and new methods, while running the xen_shinfo_test which keeps relocating the shared info and thus rewriting the PV wall clock information to it. They look sane enough but there's still skew over time (in both of them) as the KVM values get adjusted in presumably similarly sloppy ways. But we'll get to this. This patch is just the first low hanging fruit in a quest to eliminate such sloppiness and get to the point where we can do live update (pause guests, kexec and resume them again) with a *single* cycle of skew. After all, the host TSC is still just as trustworthy so there's no excuse for *anything* changing over the kexec. Every other clock in the guest should have precisely the *same* relationship to the host TSC as it did before the kexec. arch/x86/kvm/x86.c | 60 ++++++++++++++++++++++++++++++++++++++++------ arch/x86/kvm/x86.h | 2 ++ arch/x86/kvm/xen.c | 4 ++-- 3 files changed, 57 insertions(+), 9 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 04b57a336b34..625ec4d9281b 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2317,14 +2317,9 @@ static void kvm_write_wall_clock(struct kvm *kvm, gpa_t wall_clock, int sec_hi_o if (kvm_write_guest(kvm, wall_clock, &version, sizeof(version))) return; - /* - * The guest calculates current wall clock time by adding - * system time (updated by kvm_guest_time_update below) to the - * wall clock specified here. We do the reverse here. - */ - wall_nsec = ktime_get_real_ns() - get_kvmclock_ns(kvm); + wall_nsec = kvm_get_wall_clock_epoch(kvm); - wc.nsec = do_div(wall_nsec, 1000000000); + wc.nsec = do_div(wall_nsec, NSEC_PER_SEC); wc.sec = (u32)wall_nsec; /* overflow in 2106 guest time */ wc.version = version; @@ -3229,6 +3224,57 @@ static int kvm_guest_time_update(struct kvm_vcpu *v) return 0; } +uint64_t kvm_get_wall_clock_epoch(struct kvm *kvm) +{ + /* + * The guest calculates current wall clock time by adding + * system time (updated by kvm_guest_time_update below) to the + * wall clock specified here. We do the reverse here. + */ +#ifdef CONFIG_X86_64 + struct pvclock_vcpu_time_info hv_clock; + struct kvm_arch *ka = &kvm->arch; + unsigned long seq, local_tsc_khz = 0; + struct timespec64 ts; + uint64_t host_tsc; + + do { + seq = read_seqcount_begin(&ka->pvclock_sc); + + if (!ka->use_master_clock) + break; + + /* It all has to happen on the same CPU */ + get_cpu(); + + local_tsc_khz = get_cpu_tsc_khz(); + + if (local_tsc_khz && + !kvm_get_walltime_and_clockread(&ts, &host_tsc)) + local_tsc_khz = 0; /* Fall back to old method */ + + hv_clock.tsc_timestamp = ka->master_cycle_now; + hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset; + + put_cpu(); + } while (read_seqcount_retry(&ka->pvclock_sc, seq)); + + /* + * If the conditions were right, and obtaining the wallclock+TSC was + * successful, calculate the KVM clock at the corresponding time and + * subtract one from the other to get the epoch in nanoseconds. + */ + if (local_tsc_khz) { + kvm_get_time_scale(NSEC_PER_SEC, local_tsc_khz * 1000LL, + &hv_clock.tsc_shift, + &hv_clock.tsc_to_system_mul); + return ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec - + __pvclock_read_cycles(&hv_clock, host_tsc); + } +#endif + return ktime_get_real_ns() - get_kvmclock_ns(kvm); +} + /* * kvmclock updates which are isolated to a given vcpu, such as * vcpu->cpu migration, should not allow system_timestamp from diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index c544602d07a3..b21743526011 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -290,6 +290,8 @@ static inline bool kvm_check_has_quirk(struct kvm *kvm, u64 quirk) return !(kvm->arch.disabled_quirks & quirk); } +uint64_t kvm_get_wall_clock_epoch(struct kvm *kvm); + void kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip); u64 get_kvmclock_ns(struct kvm *kvm); diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c index 75586da134b3..6bab715be428 100644 --- a/arch/x86/kvm/xen.c +++ b/arch/x86/kvm/xen.c @@ -59,7 +59,7 @@ static int kvm_xen_shared_info_init(struct kvm *kvm, gfn_t gfn) * This code mirrors kvm_write_wall_clock() except that it writes * directly through the pfn cache and doesn't mark the page dirty. */ - wall_nsec = ktime_get_real_ns() - get_kvmclock_ns(kvm); + wall_nsec = kvm_get_wall_clock_epoch(kvm); /* It could be invalid again already, so we need to check */ read_lock_irq(&gpc->lock); @@ -98,7 +98,7 @@ static int kvm_xen_shared_info_init(struct kvm *kvm, gfn_t gfn) wc_version = wc->version = (wc->version + 1) | 1; smp_wmb(); - wc->nsec = do_div(wall_nsec, 1000000000); + wc->nsec = do_div(wall_nsec, NSEC_PER_SEC); wc->sec = (u32)wall_nsec; *wc_sec_hi = wall_nsec >> 32; smp_wmb();