From patchwork Wed Sep 13 14:08:22 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Woodhouse X-Patchwork-Id: 13383200 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1057EEDEC58 for ; Wed, 13 Sep 2023 14:08:43 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S241222AbjIMOIq (ORCPT ); Wed, 13 Sep 2023 10:08:46 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47412 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231767AbjIMOIl (ORCPT ); Wed, 13 Sep 2023 10:08:41 -0400 Received: from casper.infradead.org (casper.infradead.org [IPv6:2001:8b0:10b:1236::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 407B919B1 for ; Wed, 13 Sep 2023 07:08:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=MIME-Version:Content-Type:Date:Cc:To: From:Subject:Message-ID:Sender:Reply-To:Content-Transfer-Encoding:Content-ID: Content-Description:In-Reply-To:References; bh=jxt9bxGK7CKorrjungcwO4FK3/65JKMsEjvBJMfHGfE=; b=CJR5y/quBvUtHuNGwcdE58puqO Ica5/nFJpJ7uxsvSP3pYDgjX3w2XJOa5IfRdtET1jAdXsJ8elllUZvVg1xwKOnYR5aBRpm54rO1cQ rCa2iqXV9F+hQc36ik1UtrYs0AInPDX5ig43Zu2EuTZs8AUVcLGIDBiB/NgHyDI0k/s6XN2DhYiEh mVnMH9TnmmGFElJm+mdP7SYvkGdDw0Qg4mpvrwE/OjmRLGn8YZu7UL3WiBpQ11isIvC51MrcNh648 NLBCvK+IGR4ugpx78QxdB3vvIJclwrAELHCC5Lij2yZi5pFkI9RE3B0L/FN/LjUelfgA1BRLLJdvH vUwlgHZA==; Received: from [54.239.6.187] (helo=u3832b3a9db3152.ant.amazon.com) by casper.infradead.org with esmtpsa (Exim 4.94.2 #2 (Red Hat Linux)) id 1qgQXX-00EGyd-R0; Wed, 13 Sep 2023 14:08:24 +0000 Message-ID: <13f256ad95de186e3b6bcfcc1f88da5d0ad0cb71.camel@infradead.org> Subject: [RFC] KVM: x86: Add KVM_VCPU_TSC_SCALE and fix the documentation on TSC migration From: David Woodhouse To: kvm@vger.kernel.org Cc: dff@amazon.com, jmattson@google.com, joro@8bytes.org, oupton@google.com, pbonzini@redhat.com, seanjc@google.com, tglx@linutronix.de, vkuznets@redhat.com, wanpengli@tencent.com, Simon Veith Date: Wed, 13 Sep 2023 16:08:22 +0200 User-Agent: Evolution 3.44.4-0ubuntu2 MIME-Version: 1.0 X-SRS-Rewrite: SMTP reverse-path rewritten from by casper.infradead.org. See http://www.infradead.org/rpr.html Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: David Woodhouse The documentation on TSC migration using KVM_VCPU_TSC_OFFSET is woefully inadequate. It ignores TSC scaling, and ignores the fact that the host TSC may differ from one host to the next (and in fact because of the way the kernel calibrates it, it generally differs from one boot to the next even on the same hardware). Add KVM_VCPU_TSC_SCALE to extract the actual scale ratio and frac_bits, and attempt to document the *awful* process that we're requiring userspace to follow to merely preserve the TSC across migration. I may have thrown up in my mouth a little when writing that documentation. It's an awful API. If we do this, we should be ashamed of ourselves. (I also haven't tested the documented process yet). Let's use Simon's KVM_VCPU_TSC_VALUE instead. https://lore.kernel.org/all/20230202165950.483430-1-sveith@amazon.de/ Signed-off-by: David Woodhouse --- Documentation/virt/kvm/devices/vcpu.rst | 80 ++++++++++++++++++++----- arch/x86/include/uapi/asm/kvm.h | 6 ++ arch/x86/kvm/x86.c | 15 +++++ 3 files changed, 86 insertions(+), 15 deletions(-) diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst index 31f14ec4a65b..b6b6e4b98744 100644 --- a/Documentation/virt/kvm/devices/vcpu.rst +++ b/Documentation/virt/kvm/devices/vcpu.rst @@ -216,9 +216,11 @@ Returns: Specifies the guest's TSC offset relative to the host's TSC. The guest's TSC is then derived by the following equation: - guest_tsc = host_tsc + KVM_VCPU_TSC_OFFSET + guest_tsc = (( host_tsc * tsc_scale_ratio ) >> tsc_scale_bits ) + KVM_VCPU_TSC_OFFSET -This attribute is useful to adjust the guest's TSC on live migration, +The value of tsc_scale_bits is 48 on Intel and 32 on AMD. You can calculate +tsc_scale_ratio as (... where you might be able to botain tsc_scale_bits from debugfs + if you're luckyThis attribute is useful to adjust the guest's TSC on live migration, so that the TSC counts the time during which the VM was paused. The following describes a possible algorithm to use for this purpose. @@ -234,9 +236,19 @@ From the source VMM process: 3. Invoke the KVM_GET_TSC_KHZ ioctl to record the frequency of the guest's TSC (freq). +4. Read the KVM_VCPU_TSC_SCALE attribute for each vCPU to obtain the + src_tsc_ratio[i] and src_tsc_frac_bits[i] values. + +5. For each vCPU[i], calculate the guest TSC value (guest_tsc_src) at time + [guest_src] in guest KVM time. This is calculated by the formula: + guest_tsc_src[i] = ((tsc_src * src_tsc_ratio[i]) >> src_tsc_frac_bits[i]) + ofs_src[i] + From the destination VMM process: -4. Invoke the KVM_SET_CLOCK ioctl, providing the source nanoseconds from +6. Invoke the KVM_SET_TSC_KHZ ioctl to set the scaled frequency of the + guest's TSC (freq). + +7. Invoke the KVM_SET_CLOCK ioctl, providing the source nanoseconds from kvmclock (guest_src) and CLOCK_REALTIME (host_src) in their respective fields. Ensure that the KVM_CLOCK_REALTIME flag is set in the provided structure. @@ -248,20 +260,58 @@ From the destination VMM process: between the source pausing the VMs and the destination executing steps 4-7. -5. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (tsc_dest) and +8. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (tsc_dest) and kvmclock nanoseconds (guest_dest). -6. Adjust the guest TSC offsets for every vCPU to account for (1) time - elapsed since recording state and (2) difference in TSCs between the - source and destination machine: +9. Read the KVM_VCPU_TSC_SCALE attribute for each vCPU to obtain the + dest_tsc_ratio[i] and dest_tsc_frac_bits[i] values. + +10. For each vCPU[i], calculate the guest TSC value (guest_src_dest) at time + [guest_dest] in guest KVM time, as follows: + guest_tsc_dest[i] = guest_tsc_src[i] + (guest_dest - guest_src) / (1000000 * freq) + +11. For each vcpu[i], calculate what KVM will use internally as the scaled + guest time _before_ offsetting at time [guest_dest]: + raw_guest_tsc_dest[i] = (tsc_dest * dest_tsc_ratio[i]) >> dest_tsc_frac_bits[i] + +12. Calculate the post-scaling guest TSC offsets for every vCPU to account + for the difference between the raw scaled value and the intended value: + + ofs_dst[i] = guest_tsc_dest[i] - raw_guest_tsc_dest[i] + +13. Write the KVM_VCPU_TSC_OFFSET attribute for every vCPU with the + respective value derived in the previous step. + +4.2 ATTRIBUTE: KVM_VCPU_TSC_SCALE + +:Parameters: 64-bit fixed point TSC scale factor + +Returns: + + ======= ====================================== + -EFAULT Error reading the provided parameter + address. + -ENXIO Attribute not supported + -EINVAL Invalid request to write the attribute + ======= ====================================== + +This read-only attribute reports the guest's TSC scaling factor, in the form +of a fixed-point number represented by the following structure: + + struct kvm_vcpu_tsc_scale { + __u64 tsc_ratio; + __u64 tsc_frac_bits; + }; + - ofs_dst[i] = ofs_src[i] - - (guest_src - guest_dest) * freq + - (tsc_src - tsc_dest) +The tsc_frac_bits field indicate the location of the fixed point, such that +host TSC values are converted to guest TSC using the formula: - ("ofs[i] + tsc - guest * freq" is the guest TSC value corresponding to - a time of 0 in kvmclock. The above formula ensures that it is the - same on the destination as it was on the source). + guest_tsc = ( ( host_tsc * tsc_ratio ) >> tsc_frac_bits) + offset -7. Write the KVM_VCPU_TSC_OFFSET attribute for every vCPU with the - respective value derived in the previous step. +Userspace generally has no need to know this, as it has set the desired +guest TSC frequency. But since KVM only offsets the KVM_VCPU_TSC_OFFSET +attribute as documented above, and not a KVM_VCPU_TSC_VALUE attribute +which would have made life much easier, userspace needs to extract these +values so that it can do for itself all the calculations that the kernel +could have done more easily. diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h index 1a6a1f987949..a7b1406e7e62 100644 --- a/arch/x86/include/uapi/asm/kvm.h +++ b/arch/x86/include/uapi/asm/kvm.h @@ -558,6 +558,12 @@ struct kvm_pmu_event_filter { /* for KVM_{GET,SET,HAS}_DEVICE_ATTR */ #define KVM_VCPU_TSC_CTRL 0 /* control group for the timestamp counter (TSC) */ #define KVM_VCPU_TSC_OFFSET 0 /* attribute for the TSC offset */ +#define KVM_VCPU_TSC_SCALE 1 /* attribute for TSC scaling factor */ + +struct kvm_vcpu_tsc_scale { + __u64 tsc_ratio; + __u64 tsc_frac_bits; +}; /* x86-specific KVM_EXIT_HYPERCALL flags. */ #define KVM_EXIT_HYPERCALL_LONG_MODE BIT(0) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index a6b9bea62fb8..abc951f7bb95 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5462,6 +5462,7 @@ static int kvm_arch_tsc_has_attr(struct kvm_vcpu *vcpu, switch (attr->attr) { case KVM_VCPU_TSC_OFFSET: + case KVM_VCPU_TSC_SCALE: r = 0; break; default: @@ -5487,6 +5488,17 @@ static int kvm_arch_tsc_get_attr(struct kvm_vcpu *vcpu, break; r = 0; break; + case KVM_VCPU_TSC_SCALE: { + struct kvm_vcpu_tsc_scale scale; + + scale.tsc_ratio = vcpu->arch.l1_tsc_scaling_ratio; + scale.tsc_frac_bits = kvm_caps.tsc_scaling_ratio_frac_bits; + r = -EFAULT; + if (copy_to_user(uaddr, &scale, sizeof(scale))) + break; + r = 0; + break; + } default: r = -ENXIO; } @@ -5529,6 +5541,9 @@ static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu, r = 0; break; } + case KVM_VCPU_TSC_SCALE: + r = -EINVAL; /* Read only */ + break; default: r = -ENXIO; }