Message ID | 20240522001817.619072-6-dwmw2@infradead.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Cleaning up the KVM clock mess | expand |
On Wed, May 22, 2024, David Woodhouse wrote: > The guest the records a singular TSC reference point in time and uses it to ^^^^ then > calculate 3 KVM clock values utilizing the 3 recorded PVTI prior. Let's > call each clock value CLK[0-2]. > > In a perfect world CLK[0-2] should all be the same value if the KVM clock > & TSC relationship is preserved across the LU/LM (or faked in this test), > however it is not. > > A delta can be observed between CLK0-CLK1 due to KVM recalculating the PVTI > (and the inaccuracies associated with that). A delta of ~3500ns can be > observed if guest TSC scaling to half host TSC frequency is also enabled, > where as without scaling this is observed at ~180ns. It'd be helpful to explain why TSC scaling results in a larger drift. I'm by no means a clock expert, but I've likely stared at this code more than most and it's not obvious to me why scaling is problematic. If I thought hard maybe I could figure it out, but it's been an -ENOCOFFE sort of week so far, so I wouldn't bet on it :-) > +static void trigger_pvti_update(vm_paddr_t pvti_pa) > +{ > + /* > + * We need a way to trigger KVM to update the fields Please avoid "we", it's unnecessarily confusing as there are too many possible subjects that "we" could apply to. And please use the "full" 80 characters for comments, there's no reason to wrap more aggressively. E.g. /* * Toggle between KVM's old and new system time methods to coerce KVM * into updating the fields in the PV time info struct. */ > + * in the PV time info. The easiest way to do this is > + * to temporarily switch to the old KVM system time > + * method and then switch back to the new one. > + */ > + wrmsr(MSR_KVM_SYSTEM_TIME, pvti_pa | KVM_MSR_ENABLED); > + wrmsr(MSR_KVM_SYSTEM_TIME_NEW, pvti_pa | KVM_MSR_ENABLED); > +} > + > +static void guest_code(vm_paddr_t pvti_pa) > +{ > + struct pvclock_vcpu_time_info *pvti_va = > + (struct pvclock_vcpu_time_info *)pvti_pa; Casting to "void *" will let this vit on a single line. Though I don't see any reason to take a vm_paddr_t, the infrastructure doesn't validate the arg types. > + struct pvclock_vcpu_time_info pvti_boot; > + struct pvclock_vcpu_time_info pvti_uncorrected; > + struct pvclock_vcpu_time_info pvti_corrected; > + uint64_t cycles_boot; > + uint64_t cycles_uncorrected; > + uint64_t cycles_corrected; > + uint64_t tsc_guest; > + > + /* > + * Setup the KVMCLOCK in the guest & store the original s/&/and And wrap less aggressively here too. > + * PV time structure that is used. > + */ > + wrmsr(MSR_KVM_SYSTEM_TIME_NEW, pvti_pa | KVM_MSR_ENABLED); > + pvti_boot = *pvti_va; > + GUEST_SYNC(STAGE_FIRST_BOOT); > + > + /* > + * Trigger an update of the PVTI, if we calculate > + * the KVM clock using this structure we'll see > + * a delta from the TSC. Too many pronouns. Maybe this? /* * Trigger an update of the PVTI and snapshot the time, which at this * point is uncorrected, i.e. have a > + */ > + trigger_pvti_update(pvti_pa); > + pvti_uncorrected = *pvti_va; > + GUEST_SYNC(STAGE_UNCORRECTED); > + > + /* > + * The test should have triggered the correction by this > + * point in time. We have a copy of each of the PVTI structs > + * at each stage now. > + * > + * Let's sample the timestamp at a SINGLE point in time and > + * then calculate what the KVM clock would be using the PVTI > + * from each stage. > + * > + * Then return each of these values to the tester. > + */ /* * Snapshot the corrected time (the host does KVM_SET_CLOCK_GUEST when * handling STAGE_UNCORRECTED). */ > + pvti_corrected = *pvti_va; /* * Sample the timestamp at a SINGLE point in time, and then calculate * the effective KVM clock using the PVTI from each stage, and sync all * values back to the host for verification. */ On that last point though, why sync things back to the host? The verification can be done in the guest via __GUEST_ASSERT(), that way there are few magic fields being passed around, e.g. no need for uc.args[2..4]. > + tsc_guest = rdtsc(); > + > + cycles_boot = __pvclock_read_cycles(&pvti_boot, tsc_guest); > + cycles_uncorrected = __pvclock_read_cycles(&pvti_uncorrected, tsc_guest); > + cycles_corrected = __pvclock_read_cycles(&pvti_corrected, tsc_guest); > + > + GUEST_SYNC_ARGS(STAGE_CORRECTED, cycles_boot, cycles_uncorrected, > + cycles_corrected, 0); > +} > + > +static void run_test(struct kvm_vm *vm, struct kvm_vcpu *vcpu) > +{ > + struct pvclock_vcpu_time_info pvti_before; > + uint64_t before, uncorrected, corrected; > + int64_t delta_uncorrected, delta_corrected; > + struct ucall uc; > + uint64_t ucall_reason; > + > + /* Loop through each stage of the test. */ > + while (true) { > + > + /* Start/restart the running vCPU code. */ > + vcpu_run(vcpu); > + TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO); > + > + /* Retrieve and verify our stage. */ > + ucall_reason = get_ucall(vcpu, &uc); > + TEST_ASSERT(ucall_reason == UCALL_SYNC, > + "Unhandled ucall reason=%lu", > + ucall_reason); Or just TEST_ASSERT_EQ(). > + /* Run host specific code relating to stage. */ > + switch (uc.args[1]) { > + case STAGE_FIRST_BOOT: > + /* Store the KVM clock values before an update. */ > + vcpu_ioctl(vcpu, KVM_GET_CLOCK_GUEST, &pvti_before); > + > + /* Sleep for a set amount of time to increase delta. */ > + sleep(5); This is probably worth plumbing in via command line, e.g. so that the test can run with a shorter sleep() by default while also allowing users to stress things by running with longer delays. Ideally, the default sleep() would be as short as possible while still detecting ~100% of bugs. > + break; > + > + case STAGE_UNCORRECTED: > + /* Restore the KVM clock values. */ > + vcpu_ioctl(vcpu, KVM_SET_CLOCK_GUEST, &pvti_before); > + break; > + > + case STAGE_CORRECTED: > + /* Query the clock information and verify delta. */ > + before = uc.args[2]; > + uncorrected = uc.args[3]; > + corrected = uc.args[4]; > + > + delta_uncorrected = before - uncorrected; > + delta_corrected = before - corrected; > + > + pr_info("before=%lu uncorrected=%lu corrected=%lu\n", > + before, uncorrected, corrected); > + > + pr_info("delta_uncorrected=%ld delta_corrected=%ld\n", > + delta_uncorrected, delta_corrected); > + > + TEST_ASSERT((delta_corrected <= 1) && (delta_corrected >= -1), > + "larger than expected delta detected = %ld", delta_corrected); > + return; > + } > + } > +} > + > +static void configure_pvclock(struct kvm_vm *vm, struct kvm_vcpu *vcpu) > +{ > + unsigned int gpages; I'd prefer something like nr_pages > + > + gpages = vm_calc_num_guest_pages(VM_MODE_DEFAULT, KVMCLOCK_SIZE); > + vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, > + KVMCLOCK_GPA, 1, gpages, 0); > + virt_map(vm, KVMCLOCK_GPA, KVMCLOCK_GPA, gpages); > + > + vcpu_args_set(vcpu, 1, KVMCLOCK_GPA); This is somewhat silly. If you're going to hardcode the address, just use the #define in both the host and the guest. Then this helper doesn't need to take a vCPU and could be more easily expanded to multiple vCPUs (if there's a good reason to do so). > +} > + > +static void configure_scaled_tsc(struct kvm_vcpu *vcpu) > +{ > + uint64_t tsc_khz; > + > + tsc_khz = __vcpu_ioctl(vcpu, KVM_GET_TSC_KHZ, NULL); > + pr_info("scaling tsc from %ldKHz to %ldKHz\n", tsc_khz, tsc_khz / 2); > + tsc_khz /= 2; There's nothing special about scaling to 50%, correct? So rather than hardcode a single testcase, enumerate over a variety of frequencies, and specifically cross the 32-bit boundary, e.g. 1Ghz - 5Ghz at 500Mhz jumps or something, plus the host's native unscaled value. > + vcpu_ioctl(vcpu, KVM_SET_TSC_KHZ, (void *)tsc_khz); > +} > + > +int main(int argc, char *argv[]) > +{ > + struct kvm_vcpu *vcpu; > + struct kvm_vm *vm; > + bool scale_tsc; > + > + scale_tsc = argc > 1 && (!strncmp(argv[1], "-s", 3) || > + !strncmp(argv[1], "--scale-tsc", 10)); I think it's worth adding proper argument parsing, e.g. to print a help. The boilerplate is annoying, but it'll payoff in the long run as I suspect we'll end up with more params, e.g. to configure the sleep/delay, the min/max frequency, the intervals between frequencies, etc. > + > + TEST_REQUIRE(sys_clocksource_is_based_on_tsc()); > + > + vm = vm_create_with_one_vcpu(&vcpu, guest_code); > + > + configure_pvclock(vm, vcpu); > + > + if (scale_tsc) > + configure_scaled_tsc(vcpu); > + > + run_test(vm, vcpu); > + > + return 0; > +} > -- > 2.44.0 >
diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile index 871e2de3eb05..e33f56fedb0c 100644 --- a/tools/testing/selftests/kvm/Makefile +++ b/tools/testing/selftests/kvm/Makefile @@ -87,6 +87,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/pmu_counters_test TEST_GEN_PROGS_x86_64 += x86_64/pmu_event_filter_test TEST_GEN_PROGS_x86_64 += x86_64/private_mem_conversions_test TEST_GEN_PROGS_x86_64 += x86_64/private_mem_kvm_exits_test +TEST_GEN_PROGS_x86_64 += x86_64/pvclock_test TEST_GEN_PROGS_x86_64 += x86_64/set_boot_cpu_id TEST_GEN_PROGS_x86_64 += x86_64/set_sregs_test TEST_GEN_PROGS_x86_64 += x86_64/smaller_maxphyaddr_emulation_test diff --git a/tools/testing/selftests/kvm/x86_64/pvclock_test.c b/tools/testing/selftests/kvm/x86_64/pvclock_test.c new file mode 100644 index 000000000000..376ffb730a53 --- /dev/null +++ b/tools/testing/selftests/kvm/x86_64/pvclock_test.c @@ -0,0 +1,192 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright © 2024, Amazon.com, Inc. or its affiliates. + * + * Tests for pvclock API + * KVM_SET_CLOCK_GUEST/KVM_GET_CLOCK_GUEST + */ +#include <asm/pvclock.h> +#include <asm/pvclock-abi.h> +#include <sys/stat.h> +#include <stdint.h> +#include <stdio.h> + +#include "test_util.h" +#include "kvm_util.h" +#include "processor.h" + +enum { + STAGE_FIRST_BOOT, + STAGE_UNCORRECTED, + STAGE_CORRECTED +}; + +#define KVMCLOCK_GPA 0xc0000000ull +#define KVMCLOCK_SIZE sizeof(struct pvclock_vcpu_time_info) + +static void trigger_pvti_update(vm_paddr_t pvti_pa) +{ + /* + * We need a way to trigger KVM to update the fields + * in the PV time info. The easiest way to do this is + * to temporarily switch to the old KVM system time + * method and then switch back to the new one. + */ + wrmsr(MSR_KVM_SYSTEM_TIME, pvti_pa | KVM_MSR_ENABLED); + wrmsr(MSR_KVM_SYSTEM_TIME_NEW, pvti_pa | KVM_MSR_ENABLED); +} + +static void guest_code(vm_paddr_t pvti_pa) +{ + struct pvclock_vcpu_time_info *pvti_va = + (struct pvclock_vcpu_time_info *)pvti_pa; + + struct pvclock_vcpu_time_info pvti_boot; + struct pvclock_vcpu_time_info pvti_uncorrected; + struct pvclock_vcpu_time_info pvti_corrected; + uint64_t cycles_boot; + uint64_t cycles_uncorrected; + uint64_t cycles_corrected; + uint64_t tsc_guest; + + /* + * Setup the KVMCLOCK in the guest & store the original + * PV time structure that is used. + */ + wrmsr(MSR_KVM_SYSTEM_TIME_NEW, pvti_pa | KVM_MSR_ENABLED); + pvti_boot = *pvti_va; + GUEST_SYNC(STAGE_FIRST_BOOT); + + /* + * Trigger an update of the PVTI, if we calculate + * the KVM clock using this structure we'll see + * a delta from the TSC. + */ + trigger_pvti_update(pvti_pa); + pvti_uncorrected = *pvti_va; + GUEST_SYNC(STAGE_UNCORRECTED); + + /* + * The test should have triggered the correction by this + * point in time. We have a copy of each of the PVTI structs + * at each stage now. + * + * Let's sample the timestamp at a SINGLE point in time and + * then calculate what the KVM clock would be using the PVTI + * from each stage. + * + * Then return each of these values to the tester. + */ + pvti_corrected = *pvti_va; + tsc_guest = rdtsc(); + + cycles_boot = __pvclock_read_cycles(&pvti_boot, tsc_guest); + cycles_uncorrected = __pvclock_read_cycles(&pvti_uncorrected, tsc_guest); + cycles_corrected = __pvclock_read_cycles(&pvti_corrected, tsc_guest); + + GUEST_SYNC_ARGS(STAGE_CORRECTED, cycles_boot, cycles_uncorrected, + cycles_corrected, 0); +} + +static void run_test(struct kvm_vm *vm, struct kvm_vcpu *vcpu) +{ + struct pvclock_vcpu_time_info pvti_before; + uint64_t before, uncorrected, corrected; + int64_t delta_uncorrected, delta_corrected; + struct ucall uc; + uint64_t ucall_reason; + + /* Loop through each stage of the test. */ + while (true) { + + /* Start/restart the running vCPU code. */ + vcpu_run(vcpu); + TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO); + + /* Retrieve and verify our stage. */ + ucall_reason = get_ucall(vcpu, &uc); + TEST_ASSERT(ucall_reason == UCALL_SYNC, + "Unhandled ucall reason=%lu", + ucall_reason); + + /* Run host specific code relating to stage. */ + switch (uc.args[1]) { + case STAGE_FIRST_BOOT: + /* Store the KVM clock values before an update. */ + vcpu_ioctl(vcpu, KVM_GET_CLOCK_GUEST, &pvti_before); + + /* Sleep for a set amount of time to increase delta. */ + sleep(5); + break; + + case STAGE_UNCORRECTED: + /* Restore the KVM clock values. */ + vcpu_ioctl(vcpu, KVM_SET_CLOCK_GUEST, &pvti_before); + break; + + case STAGE_CORRECTED: + /* Query the clock information and verify delta. */ + before = uc.args[2]; + uncorrected = uc.args[3]; + corrected = uc.args[4]; + + delta_uncorrected = before - uncorrected; + delta_corrected = before - corrected; + + pr_info("before=%lu uncorrected=%lu corrected=%lu\n", + before, uncorrected, corrected); + + pr_info("delta_uncorrected=%ld delta_corrected=%ld\n", + delta_uncorrected, delta_corrected); + + TEST_ASSERT((delta_corrected <= 1) && (delta_corrected >= -1), + "larger than expected delta detected = %ld", delta_corrected); + return; + } + } +} + +static void configure_pvclock(struct kvm_vm *vm, struct kvm_vcpu *vcpu) +{ + unsigned int gpages; + + gpages = vm_calc_num_guest_pages(VM_MODE_DEFAULT, KVMCLOCK_SIZE); + vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, + KVMCLOCK_GPA, 1, gpages, 0); + virt_map(vm, KVMCLOCK_GPA, KVMCLOCK_GPA, gpages); + + vcpu_args_set(vcpu, 1, KVMCLOCK_GPA); +} + +static void configure_scaled_tsc(struct kvm_vcpu *vcpu) +{ + uint64_t tsc_khz; + + tsc_khz = __vcpu_ioctl(vcpu, KVM_GET_TSC_KHZ, NULL); + pr_info("scaling tsc from %ldKHz to %ldKHz\n", tsc_khz, tsc_khz / 2); + tsc_khz /= 2; + vcpu_ioctl(vcpu, KVM_SET_TSC_KHZ, (void *)tsc_khz); +} + +int main(int argc, char *argv[]) +{ + struct kvm_vcpu *vcpu; + struct kvm_vm *vm; + bool scale_tsc; + + scale_tsc = argc > 1 && (!strncmp(argv[1], "-s", 3) || + !strncmp(argv[1], "--scale-tsc", 10)); + + TEST_REQUIRE(sys_clocksource_is_based_on_tsc()); + + vm = vm_create_with_one_vcpu(&vcpu, guest_code); + + configure_pvclock(vm, vcpu); + + if (scale_tsc) + configure_scaled_tsc(vcpu); + + run_test(vm, vcpu); + + return 0; +}