Message ID | 20210916181538.968978-5-oupton@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: x86: Add idempotent controls for migrating system counter state | expand |
On Thu, Sep 16, 2021 at 06:15:35PM +0000, Oliver Upton wrote: > Handling the migration of TSCs correctly is difficult, in part because > Linux does not provide userspace with the ability to retrieve a (TSC, > realtime) clock pair for a single instant in time. In lieu of a more > convenient facility, KVM can report similar information in the kvm_clock > structure. > > Provide userspace with a host TSC & realtime pair iff the realtime clock > is based on the TSC. If userspace provides KVM_SET_CLOCK with a valid > realtime value, advance the KVM clock by the amount of elapsed time. Do > not step the KVM clock backwards, though, as it is a monotonic > oscillator. > > Suggested-by: Paolo Bonzini <pbonzini@redhat.com> > Signed-off-by: Oliver Upton <oupton@google.com> > --- > Documentation/virt/kvm/api.rst | 42 ++++++++++++++++++++++++++------- > arch/x86/include/asm/kvm_host.h | 3 +++ > arch/x86/kvm/x86.c | 36 +++++++++++++++++++++------- > include/uapi/linux/kvm.h | 7 +++++- > 4 files changed, 70 insertions(+), 18 deletions(-) > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst > index a6729c8cf063..d0b9c986cf6c 100644 > --- a/Documentation/virt/kvm/api.rst > +++ b/Documentation/virt/kvm/api.rst > @@ -993,20 +993,34 @@ such as migration. > When KVM_CAP_ADJUST_CLOCK is passed to KVM_CHECK_EXTENSION, it returns the > set of bits that KVM can return in struct kvm_clock_data's flag member. > > -The only flag defined now is KVM_CLOCK_TSC_STABLE. If set, the returned > -value is the exact kvmclock value seen by all VCPUs at the instant > -when KVM_GET_CLOCK was called. If clear, the returned value is simply > -CLOCK_MONOTONIC plus a constant offset; the offset can be modified > -with KVM_SET_CLOCK. KVM will try to make all VCPUs follow this clock, > -but the exact value read by each VCPU could differ, because the host > -TSC is not stable. > +FLAGS: > + > +KVM_CLOCK_TSC_STABLE. If set, the returned value is the exact kvmclock > +value seen by all VCPUs at the instant when KVM_GET_CLOCK was called. > +If clear, the returned value is simply CLOCK_MONOTONIC plus a constant > +offset; the offset can be modified with KVM_SET_CLOCK. KVM will try > +to make all VCPUs follow this clock, but the exact value read by each > +VCPU could differ, because the host TSC is not stable. > + > +KVM_CLOCK_REALTIME. If set, the `realtime` field in the kvm_clock_data > +structure is populated with the value of the host's real time > +clocksource at the instant when KVM_GET_CLOCK was called. If clear, > +the `realtime` field does not contain a value. > + > +KVM_CLOCK_HOST_TSC. If set, the `host_tsc` field in the kvm_clock_data > +structure is populated with the value of the host's timestamp counter (TSC) > +at the instant when KVM_GET_CLOCK was called. If clear, the `host_tsc` field > +does not contain a value. If the host TSCs are not stable, then KVM_CLOCK_HOST_TSC bit (and host_tsc field) are ambiguous. Shouldnt exposing them be conditional on stable TSC for the host ?
On 28/09/21 20:53, Marcelo Tosatti wrote: >> +KVM_CLOCK_HOST_TSC. If set, the `host_tsc` field in the kvm_clock_data >> +structure is populated with the value of the host's timestamp counter (TSC) >> +at the instant when KVM_GET_CLOCK was called. If clear, the `host_tsc` field >> +does not contain a value. > If the host TSCs are not stable, then KVM_CLOCK_HOST_TSC bit (and > host_tsc field) are ambiguous. Shouldnt exposing them be conditional on > stable TSC for the host ? > Yes, good point. Paolo
Oliver, Do you have any numbers for the improvement in guests CLOCK_REALTIME accuracy across migration, when this is in place? On Thu, Sep 16, 2021 at 06:15:35PM +0000, Oliver Upton wrote: > Handling the migration of TSCs correctly is difficult, in part because > Linux does not provide userspace with the ability to retrieve a (TSC, > realtime) clock pair for a single instant in time. In lieu of a more > convenient facility, KVM can report similar information in the kvm_clock > structure. > > Provide userspace with a host TSC & realtime pair iff the realtime clock > is based on the TSC. If userspace provides KVM_SET_CLOCK with a valid > realtime value, advance the KVM clock by the amount of elapsed time. Do > not step the KVM clock backwards, though, as it is a monotonic > oscillator. > > Suggested-by: Paolo Bonzini <pbonzini@redhat.com> > Signed-off-by: Oliver Upton <oupton@google.com> > --- > Documentation/virt/kvm/api.rst | 42 ++++++++++++++++++++++++++------- > arch/x86/include/asm/kvm_host.h | 3 +++ > arch/x86/kvm/x86.c | 36 +++++++++++++++++++++------- > include/uapi/linux/kvm.h | 7 +++++- > 4 files changed, 70 insertions(+), 18 deletions(-) > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst > index a6729c8cf063..d0b9c986cf6c 100644 > --- a/Documentation/virt/kvm/api.rst > +++ b/Documentation/virt/kvm/api.rst > @@ -993,20 +993,34 @@ such as migration. > When KVM_CAP_ADJUST_CLOCK is passed to KVM_CHECK_EXTENSION, it returns the > set of bits that KVM can return in struct kvm_clock_data's flag member. > > -The only flag defined now is KVM_CLOCK_TSC_STABLE. If set, the returned > -value is the exact kvmclock value seen by all VCPUs at the instant > -when KVM_GET_CLOCK was called. If clear, the returned value is simply > -CLOCK_MONOTONIC plus a constant offset; the offset can be modified > -with KVM_SET_CLOCK. KVM will try to make all VCPUs follow this clock, > -but the exact value read by each VCPU could differ, because the host > -TSC is not stable. > +FLAGS: > + > +KVM_CLOCK_TSC_STABLE. If set, the returned value is the exact kvmclock > +value seen by all VCPUs at the instant when KVM_GET_CLOCK was called. > +If clear, the returned value is simply CLOCK_MONOTONIC plus a constant > +offset; the offset can be modified with KVM_SET_CLOCK. KVM will try > +to make all VCPUs follow this clock, but the exact value read by each > +VCPU could differ, because the host TSC is not stable. > + > +KVM_CLOCK_REALTIME. If set, the `realtime` field in the kvm_clock_data > +structure is populated with the value of the host's real time > +clocksource at the instant when KVM_GET_CLOCK was called. If clear, > +the `realtime` field does not contain a value. > + > +KVM_CLOCK_HOST_TSC. If set, the `host_tsc` field in the kvm_clock_data > +structure is populated with the value of the host's timestamp counter (TSC) > +at the instant when KVM_GET_CLOCK was called. If clear, the `host_tsc` field > +does not contain a value. > > :: > > struct kvm_clock_data { > __u64 clock; /* kvmclock current value */ > __u32 flags; > - __u32 pad[9]; > + __u32 pad0; > + __u64 realtime; > + __u64 host_tsc; > + __u32 pad[4]; > }; > > > @@ -1023,12 +1037,22 @@ Sets the current timestamp of kvmclock to the value specified in its parameter. > In conjunction with KVM_GET_CLOCK, it is used to ensure monotonicity on scenarios > such as migration. > > +FLAGS: > + > +KVM_CLOCK_REALTIME. If set, KVM will compare the value of the `realtime` field > +with the value of the host's real time clocksource at the instant when > +KVM_SET_CLOCK was called. The difference in elapsed time is added to the final > +kvmclock value that will be provided to guests. > + > :: > > struct kvm_clock_data { > __u64 clock; /* kvmclock current value */ > __u32 flags; > - __u32 pad[9]; > + __u32 pad0; > + __u64 realtime; > + __u64 host_tsc; > + __u32 pad[4]; > }; > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index be6805fc0260..9c34b5b63e39 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -1936,4 +1936,7 @@ int kvm_cpu_dirty_log_size(void); > > int alloc_all_memslots_rmaps(struct kvm *kvm); > > +#define KVM_CLOCK_VALID_FLAGS \ > + (KVM_CLOCK_TSC_STABLE | KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC) > + > #endif /* _ASM_X86_KVM_HOST_H */ > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 523c4e5c109f..cb5d5cad5124 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -2815,10 +2815,20 @@ static void get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data) > get_cpu(); > > if (__this_cpu_read(cpu_tsc_khz)) { > +#ifdef CONFIG_X86_64 > + struct timespec64 ts; > + > + if (kvm_get_walltime_and_clockread(&ts, &data->host_tsc)) { > + data->realtime = ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec; > + data->flags |= KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC; > + } else > +#endif > + data->host_tsc = rdtsc(); > + > kvm_get_time_scale(NSEC_PER_SEC, __this_cpu_read(cpu_tsc_khz) * 1000LL, > &hv_clock.tsc_shift, > &hv_clock.tsc_to_system_mul); > - data->clock = __pvclock_read_cycles(&hv_clock, rdtsc()); > + data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc); > } else { > data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset; > } > @@ -4062,7 +4072,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) > r = KVM_SYNC_X86_VALID_FIELDS; > break; > case KVM_CAP_ADJUST_CLOCK: > - r = KVM_CLOCK_TSC_STABLE; > + r = KVM_CLOCK_VALID_FLAGS; > break; > case KVM_CAP_X86_DISABLE_EXITS: > r |= KVM_X86_DISABLE_EXITS_HLT | KVM_X86_DISABLE_EXITS_PAUSE | > @@ -5859,12 +5869,12 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp) > { > struct kvm_arch *ka = &kvm->arch; > struct kvm_clock_data data; > - u64 now_ns; > + u64 now_raw_ns; > > if (copy_from_user(&data, argp, sizeof(data))) > return -EFAULT; > > - if (data.flags) > + if (data.flags & ~KVM_CLOCK_REALTIME) > return -EINVAL; > > kvm_hv_invalidate_tsc_page(kvm); > @@ -5878,11 +5888,21 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp) > * is slightly ahead) here we risk going negative on unsigned > * 'system_time' when 'data.clock' is very small. > */ > - if (kvm->arch.use_master_clock) > - now_ns = ka->master_kernel_ns; > + if (data.flags & KVM_CLOCK_REALTIME) { > + u64 now_real_ns = ktime_get_real_ns(); > + > + /* > + * Avoid stepping the kvmclock backwards. > + */ > + if (now_real_ns > data.realtime) > + data.clock += now_real_ns - data.realtime; > + } Forward jumps can also cause problems, for example: * Kernel watchdogs * https://patchwork.ozlabs.org/project/qemu-devel/patch/20130618233825.GA19042@amt.cnet/ So perhaps limiting the amount of forward jump that is allowed would be a good thing? (which can happen if the two hosts realtime clocks are off). Now by how much, i am not sure. Or, as mentioned earlier, only enable KVM_CLOCK_REALTIME if userspace KVM code checks clock synchronization. Thomas, CC'ed, has deeper understanding of problems with forward time jumps than I do. Thomas, any comments? As a note: this makes it not OK to use KVM_CLOCK_REALTIME flag for either vm pause / vm resume (well, if paused for long periods of time) or savevm / restorevm. > + if (ka->use_master_clock) > + now_raw_ns = ka->master_kernel_ns; > else > - now_ns = get_kvmclock_base_ns(); > - ka->kvmclock_offset = data.clock - now_ns; > + now_raw_ns = get_kvmclock_base_ns(); > + ka->kvmclock_offset = data.clock - now_raw_ns; > kvm_end_pvclock_update(kvm); > return 0; > } > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h > index a067410ebea5..d228bf394465 100644 > --- a/include/uapi/linux/kvm.h > +++ b/include/uapi/linux/kvm.h > @@ -1223,11 +1223,16 @@ struct kvm_irqfd { > > /* Do not use 1, KVM_CHECK_EXTENSION returned it before we had flags. */ > #define KVM_CLOCK_TSC_STABLE 2 > +#define KVM_CLOCK_REALTIME (1 << 2) > +#define KVM_CLOCK_HOST_TSC (1 << 3) > > struct kvm_clock_data { > __u64 clock; > __u32 flags; > - __u32 pad[9]; > + __u32 pad0; > + __u64 realtime; > + __u64 host_tsc; > + __u32 pad[4]; > }; > > /* For KVM_CAP_SW_TLB */ > -- > 2.33.0.309.g3052b89438-goog > >
On Wed, Sep 29, 2021 at 03:56:29PM -0300, Marcelo Tosatti wrote: > Oliver, > > Do you have any numbers for the improvement in guests CLOCK_REALTIME > accuracy across migration, when this is in place? > > On Thu, Sep 16, 2021 at 06:15:35PM +0000, Oliver Upton wrote: > > Handling the migration of TSCs correctly is difficult, in part because > > Linux does not provide userspace with the ability to retrieve a (TSC, > > realtime) clock pair for a single instant in time. In lieu of a more > > convenient facility, KVM can report similar information in the kvm_clock > > structure. > > > > Provide userspace with a host TSC & realtime pair iff the realtime clock > > is based on the TSC. If userspace provides KVM_SET_CLOCK with a valid > > realtime value, advance the KVM clock by the amount of elapsed time. Do > > not step the KVM clock backwards, though, as it is a monotonic > > oscillator. > > > > Suggested-by: Paolo Bonzini <pbonzini@redhat.com> > > Signed-off-by: Oliver Upton <oupton@google.com> > > --- > > Documentation/virt/kvm/api.rst | 42 ++++++++++++++++++++++++++------- > > arch/x86/include/asm/kvm_host.h | 3 +++ > > arch/x86/kvm/x86.c | 36 +++++++++++++++++++++------- > > include/uapi/linux/kvm.h | 7 +++++- > > 4 files changed, 70 insertions(+), 18 deletions(-) > > > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst > > index a6729c8cf063..d0b9c986cf6c 100644 > > --- a/Documentation/virt/kvm/api.rst > > +++ b/Documentation/virt/kvm/api.rst > > @@ -993,20 +993,34 @@ such as migration. > > When KVM_CAP_ADJUST_CLOCK is passed to KVM_CHECK_EXTENSION, it returns the > > set of bits that KVM can return in struct kvm_clock_data's flag member. > > > > -The only flag defined now is KVM_CLOCK_TSC_STABLE. If set, the returned > > -value is the exact kvmclock value seen by all VCPUs at the instant > > -when KVM_GET_CLOCK was called. If clear, the returned value is simply > > -CLOCK_MONOTONIC plus a constant offset; the offset can be modified > > -with KVM_SET_CLOCK. KVM will try to make all VCPUs follow this clock, > > -but the exact value read by each VCPU could differ, because the host > > -TSC is not stable. > > +FLAGS: > > + > > +KVM_CLOCK_TSC_STABLE. If set, the returned value is the exact kvmclock > > +value seen by all VCPUs at the instant when KVM_GET_CLOCK was called. > > +If clear, the returned value is simply CLOCK_MONOTONIC plus a constant > > +offset; the offset can be modified with KVM_SET_CLOCK. KVM will try > > +to make all VCPUs follow this clock, but the exact value read by each > > +VCPU could differ, because the host TSC is not stable. > > + > > +KVM_CLOCK_REALTIME. If set, the `realtime` field in the kvm_clock_data > > +structure is populated with the value of the host's real time > > +clocksource at the instant when KVM_GET_CLOCK was called. If clear, > > +the `realtime` field does not contain a value. > > + > > +KVM_CLOCK_HOST_TSC. If set, the `host_tsc` field in the kvm_clock_data > > +structure is populated with the value of the host's timestamp counter (TSC) > > +at the instant when KVM_GET_CLOCK was called. If clear, the `host_tsc` field > > +does not contain a value. > > > > :: > > > > struct kvm_clock_data { > > __u64 clock; /* kvmclock current value */ > > __u32 flags; > > - __u32 pad[9]; > > + __u32 pad0; > > + __u64 realtime; > > + __u64 host_tsc; > > + __u32 pad[4]; > > }; > > > > > > @@ -1023,12 +1037,22 @@ Sets the current timestamp of kvmclock to the value specified in its parameter. > > In conjunction with KVM_GET_CLOCK, it is used to ensure monotonicity on scenarios > > such as migration. > > > > +FLAGS: > > + > > +KVM_CLOCK_REALTIME. If set, KVM will compare the value of the `realtime` field > > +with the value of the host's real time clocksource at the instant when > > +KVM_SET_CLOCK was called. The difference in elapsed time is added to the final > > +kvmclock value that will be provided to guests. > > + > > :: > > > > struct kvm_clock_data { > > __u64 clock; /* kvmclock current value */ > > __u32 flags; > > - __u32 pad[9]; > > + __u32 pad0; > > + __u64 realtime; > > + __u64 host_tsc; > > + __u32 pad[4]; > > }; > > > > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > > index be6805fc0260..9c34b5b63e39 100644 > > --- a/arch/x86/include/asm/kvm_host.h > > +++ b/arch/x86/include/asm/kvm_host.h > > @@ -1936,4 +1936,7 @@ int kvm_cpu_dirty_log_size(void); > > > > int alloc_all_memslots_rmaps(struct kvm *kvm); > > > > +#define KVM_CLOCK_VALID_FLAGS \ > > + (KVM_CLOCK_TSC_STABLE | KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC) > > + > > #endif /* _ASM_X86_KVM_HOST_H */ > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > index 523c4e5c109f..cb5d5cad5124 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -2815,10 +2815,20 @@ static void get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data) > > get_cpu(); > > > > if (__this_cpu_read(cpu_tsc_khz)) { > > +#ifdef CONFIG_X86_64 > > + struct timespec64 ts; > > + > > + if (kvm_get_walltime_and_clockread(&ts, &data->host_tsc)) { > > + data->realtime = ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec; > > + data->flags |= KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC; > > + } else > > +#endif > > + data->host_tsc = rdtsc(); > > + > > kvm_get_time_scale(NSEC_PER_SEC, __this_cpu_read(cpu_tsc_khz) * 1000LL, > > &hv_clock.tsc_shift, > > &hv_clock.tsc_to_system_mul); > > - data->clock = __pvclock_read_cycles(&hv_clock, rdtsc()); > > + data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc); > > } else { > > data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset; > > } > > @@ -4062,7 +4072,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) > > r = KVM_SYNC_X86_VALID_FIELDS; > > break; > > case KVM_CAP_ADJUST_CLOCK: > > - r = KVM_CLOCK_TSC_STABLE; > > + r = KVM_CLOCK_VALID_FLAGS; > > break; > > case KVM_CAP_X86_DISABLE_EXITS: > > r |= KVM_X86_DISABLE_EXITS_HLT | KVM_X86_DISABLE_EXITS_PAUSE | > > @@ -5859,12 +5869,12 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp) > > { > > struct kvm_arch *ka = &kvm->arch; > > struct kvm_clock_data data; > > - u64 now_ns; > > + u64 now_raw_ns; > > > > if (copy_from_user(&data, argp, sizeof(data))) > > return -EFAULT; > > > > - if (data.flags) > > + if (data.flags & ~KVM_CLOCK_REALTIME) > > return -EINVAL; > > > > kvm_hv_invalidate_tsc_page(kvm); > > @@ -5878,11 +5888,21 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp) > > * is slightly ahead) here we risk going negative on unsigned > > * 'system_time' when 'data.clock' is very small. > > */ > > - if (kvm->arch.use_master_clock) > > - now_ns = ka->master_kernel_ns; > > + if (data.flags & KVM_CLOCK_REALTIME) { > > + u64 now_real_ns = ktime_get_real_ns(); > > + > > + /* > > + * Avoid stepping the kvmclock backwards. > > + */ > > + if (now_real_ns > data.realtime) > > + data.clock += now_real_ns - data.realtime; > > + } > > Forward jumps can also cause problems, for example: > > * Kernel watchdogs > > * https://patchwork.ozlabs.org/project/qemu-devel/patch/20130618233825.GA19042@amt.cnet/ > > So perhaps limiting the amount of forward jump that is allowed > would be a good thing? (which can happen if the two hosts realtime > clocks are off). > > Now by how much, i am not sure. > > Or, as mentioned earlier, only enable KVM_CLOCK_REALTIME if userspace > KVM code checks clock synchronization. > > Thomas, CC'ed, has deeper understanding of problems with > forward time jumps than I do. Thomas, any comments? Thomas, Based on the earlier discussion about the problems of synchronizing the guests clock via a notification to the NTP/Chrony daemon (where there is a window where applications can read the stale value of the clock), a possible solution would be triggering an NMI on the destination (so that it runs ASAP, with higher priority than application/kernel). What would this NMI do, exactly? > As a note: this makes it not OK to use KVM_CLOCK_REALTIME flag > for either vm pause / vm resume (well, if paused for long periods of time) > or savevm / restorevm. Maybe with the NMI above, it would be possible to use the realtime clock as a way to know time elapsed between events and advance guest clock without the current problematic window.
Marcelo, On Thu, Sep 30 2021 at 16:21, Marcelo Tosatti wrote: > On Wed, Sep 29, 2021 at 03:56:29PM -0300, Marcelo Tosatti wrote: >> On Thu, Sep 16, 2021 at 06:15:35PM +0000, Oliver Upton wrote: >> >> Thomas, CC'ed, has deeper understanding of problems with >> forward time jumps than I do. Thomas, any comments? > > Based on the earlier discussion about the problems of synchronizing > the guests clock via a notification to the NTP/Chrony daemon > (where there is a window where applications can read the stale > value of the clock), a possible solution would be triggering > an NMI on the destination (so that it runs ASAP, with higher > priority than application/kernel). > > What would this NMI do, exactly? Nothing. You cannot do anything time related in an NMI. You might queue irq work which handles that, but that would still not prevent user space or kernel space from observing the stale time stamp depending on the execution state from where it resumes. >> As a note: this makes it not OK to use KVM_CLOCK_REALTIME flag >> for either vm pause / vm resume (well, if paused for long periods of time) >> or savevm / restorevm. > > Maybe with the NMI above, it would be possible to use > the realtime clock as a way to know time elapsed between > events and advance guest clock without the current > problematic window. As much duct tape you throw at the problem, it cannot be solved ever because it's fundamentally broken. All you can do is to make the observation windows smaller, but that's just curing the symptom. The problem is that the guest is paused/resumed without getting any information about that and the execution of the guest is stopped at an arbitrary instruction boundary from which it resumes after migration or restore. So there is no way to guarantee that after resume all vCPUs are executing in a state which can handle that. But even if that would be the case, then what prevents the stale time stamps to be visible? Nothing: T0: t = now(); -> pause -> resume -> magic "fixup" T1: dostuff(t); But that's not a fundamental problem because every preemptible or interruptible code has the same issue: T0: t = now(); -> preemption or interrupt T1: dostuff(t); Which is usually not a problem, but It becomes a problem when T1 - T0 is greater than the usual expectations which can obviously be trivially achieved by guest migration or a savevm, restorevm cycle. But let's go a step back and look at the clocks and their expectations: CLOCK_MONOTONIC: Monotonically increasing clock which counts unless the system is in suspend. On resume it continues counting without jumping forward. That's the reference clock for everything else and therefore it is important that it does _not_ jump around. The reasons why CLOCK_MONOTONIC stops during suspend is historical and any attempt to change that breaks the world and some more because making it jump forward will trigger all sorts of timeouts, watchdogs and whatever. The last attempt to make CLOCK_MONOTONIC behave like CLOCK_BOOTTIME was reverted within 3 weeks. It's not going to be attempted again. See a3ed0e4393d6 ("Revert: Unify CLOCK_MONOTONIC and CLOCK_BOOTTIME") for details. Now the proposed change is creating exactly the same problem: >> > + if (data.flags & KVM_CLOCK_REALTIME) { >> > + u64 now_real_ns = ktime_get_real_ns(); >> > + >> > + /* >> > + * Avoid stepping the kvmclock backwards. >> > + */ >> > + if (now_real_ns > data.realtime) >> > + data.clock += now_real_ns - data.realtime; >> > + } IOW, it takes the time between pause and resume into account and forwards the underlying base clock which makes CLOCK_MONOTONIC jump forward by exactly that amount of time. So depending on the size of the delta you are running into exactly the same problem as the final failed attempt to unify CLOCK_MONOTONIC and CLOCK_BOOTTIME which btw. would have been a magic cure for virt. Too bad, not going to happen ever unless you fix all affected user space and kernel side issues. CLOCK_BOOTTIME: CLOCK_MONOTONIC + time spent in suspend CLOCK_REALTIME/TAI: CLOCK_MONOTONIC + offset The offset is established by reading RTC at boot time and can be changed by clock_settime(2) and adjtimex(2). The latter is used by NTP/PTP. Any user of CLOCK_REALTIME has to be prepared for it to jump in both directions, but of course NTP/PTP daemons have expectations vs. such time jumps. They rightfully assume on a properly configured and administrated system that there are only two things which can make CLOCK_REALTIME jump: 1) NTP/PTP daemon controlled 2) Suspend/resume related updates by the kernel Just for the record, these assumptions predate virtualization. So now virt came along and created a hard to solve circular dependency problem: - If CLOCK_MONOTONIC stops for too long then NTP/PTP gets out of sync, but everything else is happy. - If CLOCK_MONOTONIC jumps too far forward, then all hell breaks lose, but NTP/PTP is happy. IOW, you are up a creek without a paddle and you have to chose one evil. That's simply a design fail because there has been no design for this from day one. But I'm not surprised at all by that simply because virtualization followed the hardware design fails vs. time and timekeeping which keep us entertained for the past 20 years on various architectures but most prominently on X86 which is the uncontended master of disaster in that regard. Of course virt follows the same approach of hardware by ignoring the problem and coming up with more duct tape and the assumption that lack of design can be "fixed in software". Just the timeframe is slightly different: We're discussing this only for about 10 years now. Seriously? All you folks can come up with in 10 years is piling duct tape on duct tape instead of sitting down and fixing the underlying root cause once and forever? I'm aware that especially chrony has tried to deal with this nonsense more gracefully, but that still does not make it great and it still gets upset. The reason why suspend/resume works perfectly fine is that it's fully coordinated and NTP state is cleared on resume which makes it easy for the deamon to accomodate. So again and I'm telling this for a decade now: 1) Stop pretending that you can fix the lack of design with duct tape engineering 2) Accept the fundamental properties of Linux time keeping as they are not going to change as explained above 3) Either accept that CLOCK_REALTIME is off and jumping around which confuses NTP/PTP or get your act together and design and implement a proper synchronization mechanism for this: - Notify the guest about the intended "pause" or "savevm" event - Let the guest go into a lightweight freeze similar to S2IDLE - Save the VM for later resume or migrate the saved state - Watch everything working as expected on resume - Have the benefit that pause/resume and savevm/restorevm have exactly the same behaviour That won't solve the problem for frankenkernels and !paravirt setups, but that's unsolvable and you can keep the pieces by chosing one of two evils. While I do not care at all, I still recommend to chose CLOCK_MONOTONIC correctness for obvious reasons. The frankenkernel/legacy problem aside, I know you are going to tell me that this is too much overhead and has VM customer visible impact. It's your choice, really: Either you chose correctness or you decide to ignore correctness for whatever reason. There is no middle ground simply because you _cannot_ guarantee that your migration time is within the acceptable limits of the CLOCK_MONOTONIC or the CLOCK_REALTIME expectations. You can limit the damage somehow by making some arbitrary cutoff of how much you forward CLOCK_MONOTONIC, but don't ask me about the right value. If you decide that correctness is overrated, then please document it clearly instead of trying to pretend being correct. I'm curious whether the hardware people or the virt folks come to senses first, but honestly I'm not expecting this to happen before I retire. Thanks, tglx
On Fri, Oct 01, 2021 at 01:02:23AM +0200, Thomas Gleixner wrote: > Marcelo, > > On Thu, Sep 30 2021 at 16:21, Marcelo Tosatti wrote: > > On Wed, Sep 29, 2021 at 03:56:29PM -0300, Marcelo Tosatti wrote: > >> On Thu, Sep 16, 2021 at 06:15:35PM +0000, Oliver Upton wrote: > >> > >> Thomas, CC'ed, has deeper understanding of problems with > >> forward time jumps than I do. Thomas, any comments? > > > > Based on the earlier discussion about the problems of synchronizing > > the guests clock via a notification to the NTP/Chrony daemon > > (where there is a window where applications can read the stale > > value of the clock), a possible solution would be triggering > > an NMI on the destination (so that it runs ASAP, with higher > > priority than application/kernel). > > > > What would this NMI do, exactly? > > Nothing. You cannot do anything time related in an NMI. > > You might queue irq work which handles that, but that would still not > prevent user space or kernel space from observing the stale time stamp > depending on the execution state from where it resumes. Yes. > >> As a note: this makes it not OK to use KVM_CLOCK_REALTIME flag > >> for either vm pause / vm resume (well, if paused for long periods of time) > >> or savevm / restorevm. > > > > Maybe with the NMI above, it would be possible to use > > the realtime clock as a way to know time elapsed between > > events and advance guest clock without the current > > problematic window. > > As much duct tape you throw at the problem, it cannot be solved ever > because it's fundamentally broken. All you can do is to make the > observation windows smaller, but that's just curing the symptom. Yes. > The problem is that the guest is paused/resumed without getting any > information about that and the execution of the guest is stopped at an > arbitrary instruction boundary from which it resumes after migration or > restore. So there is no way to guarantee that after resume all vCPUs are > executing in a state which can handle that. > > But even if that would be the case, then what prevents the stale time > stamps to be visible? Nothing: > > T0: t = now(); > -> pause > -> resume > -> magic "fixup" > T1: dostuff(t); Yes. BTW, you could have a userspace notification (then applications could handle this if desired). > But that's not a fundamental problem because every preemptible or > interruptible code has the same issue: > > T0: t = now(); > -> preemption or interrupt > T1: dostuff(t); > > Which is usually not a problem, but It becomes a problem when T1 - T0 is > greater than the usual expectations which can obviously be trivially > achieved by guest migration or a savevm, restorevm cycle. > > But let's go a step back and look at the clocks and their expectations: > > CLOCK_MONOTONIC: > > Monotonically increasing clock which counts unless the system > is in suspend. On resume it continues counting without jumping > forward. > > That's the reference clock for everything else and therefore it > is important that it does _not_ jump around. > > The reasons why CLOCK_MONOTONIC stops during suspend is > historical and any attempt to change that breaks the world and > some more because making it jump forward will trigger all sorts > of timeouts, watchdogs and whatever. The last attempt to make > CLOCK_MONOTONIC behave like CLOCK_BOOTTIME was reverted within 3 > weeks. It's not going to be attempted again. See a3ed0e4393d6 > ("Revert: Unify CLOCK_MONOTONIC and CLOCK_BOOTTIME") for > details. > > Now the proposed change is creating exactly the same problem: > > >> > + if (data.flags & KVM_CLOCK_REALTIME) { > >> > + u64 now_real_ns = ktime_get_real_ns(); > >> > + > >> > + /* > >> > + * Avoid stepping the kvmclock backwards. > >> > + */ > >> > + if (now_real_ns > data.realtime) > >> > + data.clock += now_real_ns - data.realtime; > >> > + } > > IOW, it takes the time between pause and resume into account and > forwards the underlying base clock which makes CLOCK_MONOTONIC > jump forward by exactly that amount of time. Well, it is assuming that the T0: t = now(); T1: pause vm() T2: finish vm migration() T3: dostuff(t); Interval between T1 and T2 is small (and that the guest clocks are synchronized up to a given boundary). But i suppose adding a limit to the forward clock advance (in the migration case) is useful: 1) If migration (well actually, only the final steps to finish migration, the time between when guest is paused on source and is resumed on destination) takes too long, then too bad: fix it to be shorter if you want the clocks to have close to zero change to realtime on migration. 2) Avoid the other bugs in case of large forward advance. Maybe having it configurable, with a say, 1 minute maximum by default is a good choice? An alternative would be to advance only the guests REALTIME clock, from data about how long steps T1-T2 took. > So depending on the size of the delta you are running into exactly the > same problem as the final failed attempt to unify CLOCK_MONOTONIC and > CLOCK_BOOTTIME which btw. would have been a magic cure for virt. > > Too bad, not going to happen ever unless you fix all affected user > space and kernel side issues. > > > CLOCK_BOOTTIME: > > CLOCK_MONOTONIC + time spent in suspend > > > CLOCK_REALTIME/TAI: > > CLOCK_MONOTONIC + offset > > The offset is established by reading RTC at boot time and can be > changed by clock_settime(2) and adjtimex(2). The latter is used > by NTP/PTP. > > Any user of CLOCK_REALTIME has to be prepared for it to jump in > both directions, but of course NTP/PTP daemons have expectations > vs. such time jumps. > > They rightfully assume on a properly configured and administrated > system that there are only two things which can make CLOCK_REALTIME > jump: > > 1) NTP/PTP daemon controlled > 2) Suspend/resume related updates by the kernel > > > Just for the record, these assumptions predate virtualization. > > So now virt came along and created a hard to solve circular dependency > problem: > > - If CLOCK_MONOTONIC stops for too long then NTP/PTP gets out of > sync, but everything else is happy. > > - If CLOCK_MONOTONIC jumps too far forward, then all hell breaks > lose, but NTP/PTP is happy. One must handle the T0: t = now(); -> pause -> resume -> magic "fixup" T1: dostuff(t); fact if one is going to use savevm/restorevm anyway, so... (it is kind of unfixable, unless you modify your application to accept notifications to redo any computation based on t, isnt it?). > IOW, you are up a creek without a paddle and you have to chose one evil. > > That's simply a design fail because there has been no design for this > from day one. But I'm not surprised at all by that simply because > virtualization followed the hardware design fails vs. time and > timekeeping which keep us entertained for the past 20 years on various > architectures but most prominently on X86 which is the uncontended > master of disaster in that regard. > > Of course virt follows the same approach of hardware by ignoring the > problem and coming up with more duct tape and the assumption that lack > of design can be "fixed in software". Just the timeframe is slightly > different: We're discussing this only for about 10 years now. > > Seriously? All you folks can come up with in 10 years is piling duct > tape on duct tape instead of sitting down and fixing the underlying root > cause once and forever? Been fixing bugs that are reported over 10+ years, yeah. Hopefully this thread is the "sitting down and fixing the underyling root cause" :-) > I'm aware that especially chrony has tried to deal with this nonsense > more gracefully, but that still does not make it great and it still gets > upset. > > The reason why suspend/resume works perfectly fine is that it's fully > coordinated and NTP state is cleared on resume which makes it easy for > the deamon to accomodate. This is what is in place now (which is executed on the destination): /* Now, if user has passed a time to set and the system time is set, we * just need to synchronize the hardware clock. However, if no time was * passed, user is requesting the opposite: set the system time from the * hardware clock (RTC). */ pid = fork(); if (pid == 0) { setsid(); reopen_fd_to_null(0); reopen_fd_to_null(1); reopen_fd_to_null(2); /* Use '/sbin/hwclock -w' to set RTC from the system time, * or '/sbin/hwclock -s' to set the system time from RTC. */ execle(hwclock_path, "hwclock", has_time ? "-w" : "-s", NULL, environ); _exit(EXIT_FAILURE); } else if (pid < 0) { error_setg_errno(errp, errno, "failed to create child process"); return; } > > So again and I'm telling this for a decade now: > > 1) Stop pretending that you can fix the lack of design with duct tape > engineering > > 2) Accept the fundamental properties of Linux time keeping as they are > not going to change as explained above > > 3) Either accept that CLOCK_REALTIME is off and jumping around which > confuses NTP/PTP or get your act together and design and implement a > proper synchronization mechanism for this: > > - Notify the guest about the intended "pause" or "savevm" event > > - Let the guest go into a lightweight freeze similar to S2IDLE > > - Save the VM for later resume or migrate the saved state > > - Watch everything working as expected on resume > > - Have the benefit that pause/resume and savevm/restorevm have > exactly the same behaviour OK! > That won't solve the problem for frankenkernels and !paravirt setups, > but that's unsolvable and you can keep the pieces by chosing one of two > evils. While I do not care at all, I still recommend to chose > CLOCK_MONOTONIC correctness for obvious reasons. > > The frankenkernel/legacy problem aside, I know you are going to tell me > that this is too much overhead and has VM customer visible impact. No, i think it boils down to someone implementing it. > It's > your choice, really: > > Either you chose correctness or you decide to ignore correctness for > whatever reason. > > There is no middle ground simply because you _cannot_ guarantee that > your migration time is within the acceptable limits of the > CLOCK_MONOTONIC or the CLOCK_REALTIME expectations. > > You can limit the damage somehow by making some arbitrary cutoff of > how much you forward CLOCK_MONOTONIC, but don't ask me about the right > value. > If you decide that correctness is overrated, then please document it > clearly instead of trying to pretend being correct. Based on the above, advancing only CLOCK_REALTIME (and not CLOCK_MONOTONIC) would be correct, right? And its probably not very hard to do. > I'm curious whether the hardware people or the virt folks come to senses > first, but honestly I'm not expecting this to happen before I retire. > > Thanks, > > tglx Thanks very much for the detailed information! Its a good basis for the document you ask.
On Fri, Oct 01, 2021 at 09:05:27AM -0300, Marcelo Tosatti wrote: > On Fri, Oct 01, 2021 at 01:02:23AM +0200, Thomas Gleixner wrote: > > Marcelo, > > > > On Thu, Sep 30 2021 at 16:21, Marcelo Tosatti wrote: > > > On Wed, Sep 29, 2021 at 03:56:29PM -0300, Marcelo Tosatti wrote: > > >> On Thu, Sep 16, 2021 at 06:15:35PM +0000, Oliver Upton wrote: > > >> > > >> Thomas, CC'ed, has deeper understanding of problems with > > >> forward time jumps than I do. Thomas, any comments? > > > > > > Based on the earlier discussion about the problems of synchronizing > > > the guests clock via a notification to the NTP/Chrony daemon > > > (where there is a window where applications can read the stale > > > value of the clock), a possible solution would be triggering > > > an NMI on the destination (so that it runs ASAP, with higher > > > priority than application/kernel). > > > > > > What would this NMI do, exactly? > > > > Nothing. You cannot do anything time related in an NMI. > > > > You might queue irq work which handles that, but that would still not > > prevent user space or kernel space from observing the stale time stamp > > depending on the execution state from where it resumes. > > Yes. > > > >> As a note: this makes it not OK to use KVM_CLOCK_REALTIME flag > > >> for either vm pause / vm resume (well, if paused for long periods of time) > > >> or savevm / restorevm. > > > > > > Maybe with the NMI above, it would be possible to use > > > the realtime clock as a way to know time elapsed between > > > events and advance guest clock without the current > > > problematic window. > > > > As much duct tape you throw at the problem, it cannot be solved ever > > because it's fundamentally broken. All you can do is to make the > > observation windows smaller, but that's just curing the symptom. > > Yes. > > > The problem is that the guest is paused/resumed without getting any > > information about that and the execution of the guest is stopped at an > > arbitrary instruction boundary from which it resumes after migration or > > restore. So there is no way to guarantee that after resume all vCPUs are > > executing in a state which can handle that. > > > > But even if that would be the case, then what prevents the stale time > > stamps to be visible? Nothing: > > > > T0: t = now(); > > -> pause > > -> resume > > -> magic "fixup" > > T1: dostuff(t); > > Yes. > > BTW, you could have a userspace notification (then applications > could handle this if desired). > > > But that's not a fundamental problem because every preemptible or > > interruptible code has the same issue: > > > > T0: t = now(); > > -> preemption or interrupt > > T1: dostuff(t); > > > > Which is usually not a problem, but It becomes a problem when T1 - T0 is > > greater than the usual expectations which can obviously be trivially > > achieved by guest migration or a savevm, restorevm cycle. > > > > But let's go a step back and look at the clocks and their expectations: > > > > CLOCK_MONOTONIC: > > > > Monotonically increasing clock which counts unless the system > > is in suspend. On resume it continues counting without jumping > > forward. > > > > That's the reference clock for everything else and therefore it > > is important that it does _not_ jump around. > > > > The reasons why CLOCK_MONOTONIC stops during suspend is > > historical and any attempt to change that breaks the world and > > some more because making it jump forward will trigger all sorts > > of timeouts, watchdogs and whatever. The last attempt to make > > CLOCK_MONOTONIC behave like CLOCK_BOOTTIME was reverted within 3 > > weeks. It's not going to be attempted again. See a3ed0e4393d6 > > ("Revert: Unify CLOCK_MONOTONIC and CLOCK_BOOTTIME") for > > details. > > > > Now the proposed change is creating exactly the same problem: > > > > >> > + if (data.flags & KVM_CLOCK_REALTIME) { > > >> > + u64 now_real_ns = ktime_get_real_ns(); > > >> > + > > >> > + /* > > >> > + * Avoid stepping the kvmclock backwards. > > >> > + */ > > >> > + if (now_real_ns > data.realtime) > > >> > + data.clock += now_real_ns - data.realtime; > > >> > + } > > > > IOW, it takes the time between pause and resume into account and > > forwards the underlying base clock which makes CLOCK_MONOTONIC > > jump forward by exactly that amount of time. > > Well, it is assuming that the > > T0: t = now(); > T1: pause vm() > T2: finish vm migration() > T3: dostuff(t); > > Interval between T1 and T2 is small (and that the guest > clocks are synchronized up to a given boundary). > > But i suppose adding a limit to the forward clock advance > (in the migration case) is useful: > > 1) If migration (well actually, only the final steps > to finish migration, the time between when guest is paused > on source and is resumed on destination) takes too long, > then too bad: fix it to be shorter if you want the clocks > to have close to zero change to realtime on migration. > > 2) Avoid the other bugs in case of large forward advance. > > Maybe having it configurable, with a say, 1 minute maximum by default > is a good choice? > > An alternative would be to advance only the guests REALTIME clock, from > data about how long steps T1-T2 took. > > > So depending on the size of the delta you are running into exactly the > > same problem as the final failed attempt to unify CLOCK_MONOTONIC and > > CLOCK_BOOTTIME which btw. would have been a magic cure for virt. > > > > Too bad, not going to happen ever unless you fix all affected user > > space and kernel side issues. > > > > > > CLOCK_BOOTTIME: > > > > CLOCK_MONOTONIC + time spent in suspend > > > > > > CLOCK_REALTIME/TAI: > > > > CLOCK_MONOTONIC + offset > > > > The offset is established by reading RTC at boot time and can be > > changed by clock_settime(2) and adjtimex(2). The latter is used > > by NTP/PTP. > > > > Any user of CLOCK_REALTIME has to be prepared for it to jump in > > both directions, but of course NTP/PTP daemons have expectations > > vs. such time jumps. > > > > They rightfully assume on a properly configured and administrated > > system that there are only two things which can make CLOCK_REALTIME > > jump: > > > > 1) NTP/PTP daemon controlled > > 2) Suspend/resume related updates by the kernel > > > > > > Just for the record, these assumptions predate virtualization. > > > > So now virt came along and created a hard to solve circular dependency > > problem: > > > > - If CLOCK_MONOTONIC stops for too long then NTP/PTP gets out of > > sync, but everything else is happy. > > > > - If CLOCK_MONOTONIC jumps too far forward, then all hell breaks > > lose, but NTP/PTP is happy. > > One must handle the > > T0: t = now(); > -> pause > -> resume > -> magic "fixup" > T1: dostuff(t); > > fact if one is going to use savevm/restorevm anyway, so... > (it is kind of unfixable, unless you modify your application > to accept notifications to redo any computation based on t, isnt it?). https://lore.kernel.org/lkml/1289503802-22444-2-git-send-email-virtuoso@slind.org/
On 01/10/21 01:02, Thomas Gleixner wrote: > > Now the proposed change is creating exactly the same problem: > > + if (data.flags & KVM_CLOCK_REALTIME) { > + u64 now_real_ns = ktime_get_real_ns(); > + > + /* > + * Avoid stepping the kvmclock backwards. > + */ > + if (now_real_ns > data.realtime) > + data.clock += now_real_ns - data.realtime; > + } Indeed, though it's opt-in (you can always not pass KVM_CLOCK_REALTIME and then the kernel will not muck with the value you gave it). > virt came along and created a hard to solve circular dependency > problem: > > - If CLOCK_MONOTONIC stops for too long then NTP/PTP gets out of > sync, but everything else is happy. > > - If CLOCK_MONOTONIC jumps too far forward, then all hell breaks > lose, but NTP/PTP is happy. Yes, I agree that this sums it up. For example QEMU (meaning: Marcelo :)) has gone for the former and "hoping" that NTP/PTP sorts it out sooner or later. The clock in nanoseconds is sent out to the destination and restored. Google's userspace instead went for the latter. The reason is that they've always started running on the destination before finishing the memory copy[1], therefore it's much easier to bound the CLOCK_MONOTONIC jump. I do like very much the cooperative S2IDLE or even S3 way to handle the brownout during live migration. However if your stopping time is bounded, these patches are nice because, on current processors that have TSC scaling, they make it possible to keep the illusion of the TSC running. Of course that's a big "if"; however, you can always bound the stopping time by aborting the restart on the destination machine once you get close enough to the limit. Paolo [1] see https://dl.acm.org/doi/pdf/10.1145/3296975.3186415, figure 3
On 16/09/21 20:15, Oliver Upton wrote: > + if (data.flags & ~KVM_CLOCK_REALTIME) > return -EINVAL; Let's accept KVM_CLOCK_HOST_TSC here even though it's not used; there may be programs that expect to send back to KVM_SET_CLOCK whatever they got from KVM_GET_CLOCK.
On 01/10/21 16:39, Paolo Bonzini wrote: > On 16/09/21 20:15, Oliver Upton wrote: >> + if (data.flags & ~KVM_CLOCK_REALTIME) >> return -EINVAL; > > Let's accept KVM_CLOCK_HOST_TSC here even though it's not used; there > may be programs that expect to send back to KVM_SET_CLOCK whatever they > got from KVM_GET_CLOCK. Nevermind, KVM_SET_CLOCK is already rejecting KVM_CLOCK_TSC_STABLE so no need to do that! Paolo
On Fri, Oct 1, 2021 at 7:41 AM Paolo Bonzini <pbonzini@redhat.com> wrote: > > On 01/10/21 16:39, Paolo Bonzini wrote: > > On 16/09/21 20:15, Oliver Upton wrote: > >> + if (data.flags & ~KVM_CLOCK_REALTIME) > >> return -EINVAL; > > > > Let's accept KVM_CLOCK_HOST_TSC here even though it's not used; there > > may be programs that expect to send back to KVM_SET_CLOCK whatever they > > got from KVM_GET_CLOCK. > > Nevermind, KVM_SET_CLOCK is already rejecting KVM_CLOCK_TSC_STABLE so no > need to do that! Yeah, I don't know the story on the interface but it is really odd that userspace needs to blow away flags to successfully write the clock structure. -- Thanks, Oliver
On 01/10/21 17:39, Oliver Upton wrote: >> Nevermind, KVM_SET_CLOCK is already rejecting KVM_CLOCK_TSC_STABLE so no >> need to do that! > > Yeah, I don't know the story on the interface but it is really odd > that userspace needs to blow away flags to successfully write the > clock structure. Yeah, let's fix it now and accept all three flags I would like that, even though it cannot be fixed in existing kernels. Paolo
Marcelo, On Fri, Oct 01 2021 at 09:05, Marcelo Tosatti wrote: > On Fri, Oct 01, 2021 at 01:02:23AM +0200, Thomas Gleixner wrote: >> But even if that would be the case, then what prevents the stale time >> stamps to be visible? Nothing: >> >> T0: t = now(); >> -> pause >> -> resume >> -> magic "fixup" >> T1: dostuff(t); > > Yes. > > BTW, you could have a userspace notification (then applications > could handle this if desired). Well, we have that via timerfd with TFD_TIMER_CANCEL_ON_SET for CLOCK_REALTIME. That's what applications which are sensitive to clock REALTIME jumps use today. >> Now the proposed change is creating exactly the same problem: >> >> >> > + if (data.flags & KVM_CLOCK_REALTIME) { >> >> > + u64 now_real_ns = ktime_get_real_ns(); >> >> > + >> >> > + /* >> >> > + * Avoid stepping the kvmclock backwards. >> >> > + */ >> >> > + if (now_real_ns > data.realtime) >> >> > + data.clock += now_real_ns - data.realtime; >> >> > + } >> >> IOW, it takes the time between pause and resume into account and >> forwards the underlying base clock which makes CLOCK_MONOTONIC >> jump forward by exactly that amount of time. > > Well, it is assuming that the > > T0: t = now(); > T1: pause vm() > T2: finish vm migration() > T3: dostuff(t); > > Interval between T1 and T2 is small (and that the guest > clocks are synchronized up to a given boundary). Yes, I understand that, but it's an assumption and there is no boundary for the time jump in the proposed patches, which rings my alarm bells :) > But i suppose adding a limit to the forward clock advance > (in the migration case) is useful: > > 1) If migration (well actually, only the final steps > to finish migration, the time between when guest is paused > on source and is resumed on destination) takes too long, > then too bad: fix it to be shorter if you want the clocks > to have close to zero change to realtime on migration. > > 2) Avoid the other bugs in case of large forward advance. > > Maybe having it configurable, with a say, 1 minute maximum by default > is a good choice? Don't know what 1 minute does in terms of applications etc. You have to do some experiments on that. > An alternative would be to advance only the guests REALTIME clock, from > data about how long steps T1-T2 took. Yes, that's what would happen in the cooperative S2IDLE or S3 case when the guest resumes. >> So now virt came along and created a hard to solve circular dependency >> problem: >> >> - If CLOCK_MONOTONIC stops for too long then NTP/PTP gets out of >> sync, but everything else is happy. >> >> - If CLOCK_MONOTONIC jumps too far forward, then all hell breaks >> lose, but NTP/PTP is happy. > > One must handle the > > T0: t = now(); > -> pause > -> resume > -> magic "fixup" > T1: dostuff(t); > > fact if one is going to use savevm/restorevm anyway, so... > (it is kind of unfixable, unless you modify your application > to accept notifications to redo any computation based on t, isnt it?). Well yes, but what applications can deal with is CLOCK_REALTIME jumping because that's a property of it. Not so much the CLOCK_MONOTONIC part. >> If you decide that correctness is overrated, then please document it >> clearly instead of trying to pretend being correct. > > Based on the above, advancing only CLOCK_REALTIME (and not CLOCK_MONOTONIC) > would be correct, right? And its probably not very hard to do. Time _is_ hard to get right. So you might experiment with something like this as a stop gap: Provide the guest something like this: u64 migration_seq; u64 realtime_delta_ns; in the shared clock page Do not forward jump clock MONOTONIC. On resume kick an IPI where the guest handler does: if (clock_data->migration_seq == migration_seq) return; migration_seq = clock_data->migration_seq; ts64 = { 0, 0 }; timespec64_add_ns(&ts64, clock_data->realtime_delta_ns); timekeeping_inject_sleeptime64(&ts64); Make sure that the IPI completes before you migrate the guest another time or implement it slightly smarter, but you get the idea :) That's what we use for suspend time injection, but it should just work without frozen tasks as well. It will forward clock REALTIME by the amount of time spent during migration. It'll also modify the BOOTTIME offset by the same amount, but that's not a tragedy. The function will also reset NTP state so the NTP/PTP daemon knows that there was a kernel initiated time jump and it can work out easily what to do like it does on resume from an actual suspend. It will also invoke clock_was_set() which makes all the other time related updates trigger and wakeup tasks which have a timerfd with TFD_TIMER_CANCEL_ON_SET armed. This will obviously not work when the guest is in S2IDLE or S3, but for initial experimentation you can ignore that and just avoid to do that in the guest. :) That still is worse than a cooperative S2IDLE/S3, but it's way more sensible than the other two evils you have today. > Thanks very much for the detailed information! Its a good basis > for the document you ask. I volunteer to review that documentation once it materializes :) Thanks, tglx
Hi Thomas, On Fri, Oct 1, 2021 at 12:59 PM Thomas Gleixner <tglx@linutronix.de> wrote: > > Marcelo, > > On Fri, Oct 01 2021 at 09:05, Marcelo Tosatti wrote: > > On Fri, Oct 01, 2021 at 01:02:23AM +0200, Thomas Gleixner wrote: > >> But even if that would be the case, then what prevents the stale time > >> stamps to be visible? Nothing: > >> > >> T0: t = now(); > >> -> pause > >> -> resume > >> -> magic "fixup" > >> T1: dostuff(t); > > > > Yes. > > > > BTW, you could have a userspace notification (then applications > > could handle this if desired). > > Well, we have that via timerfd with TFD_TIMER_CANCEL_ON_SET for > CLOCK_REALTIME. That's what applications which are sensitive to clock > REALTIME jumps use today. > > >> Now the proposed change is creating exactly the same problem: > >> > >> >> > + if (data.flags & KVM_CLOCK_REALTIME) { > >> >> > + u64 now_real_ns = ktime_get_real_ns(); > >> >> > + > >> >> > + /* > >> >> > + * Avoid stepping the kvmclock backwards. > >> >> > + */ > >> >> > + if (now_real_ns > data.realtime) > >> >> > + data.clock += now_real_ns - data.realtime; > >> >> > + } > >> > >> IOW, it takes the time between pause and resume into account and > >> forwards the underlying base clock which makes CLOCK_MONOTONIC > >> jump forward by exactly that amount of time. > > > > Well, it is assuming that the > > > > T0: t = now(); > > T1: pause vm() > > T2: finish vm migration() > > T3: dostuff(t); > > > > Interval between T1 and T2 is small (and that the guest > > clocks are synchronized up to a given boundary). > > Yes, I understand that, but it's an assumption and there is no boundary > for the time jump in the proposed patches, which rings my alarm bells :) > > > But i suppose adding a limit to the forward clock advance > > (in the migration case) is useful: > > > > 1) If migration (well actually, only the final steps > > to finish migration, the time between when guest is paused > > on source and is resumed on destination) takes too long, > > then too bad: fix it to be shorter if you want the clocks > > to have close to zero change to realtime on migration. > > > > 2) Avoid the other bugs in case of large forward advance. > > > > Maybe having it configurable, with a say, 1 minute maximum by default > > is a good choice? > > Don't know what 1 minute does in terms of applications etc. You have to > do some experiments on that. I debated quite a bit on what the absolute limit should be for advancing the KVM clock, and settled on doing no checks in the kernel besides the monotonicity invariant. End of the day, userspace can ignore all of the rules that KVM will try to enforce on the kvm clock/TSC and jump it as it sees fit (both are already directly writable). But I agree that there has to be some reason around what is acceptable. We have an absolute limit on how far forward we will yank the KVM clock and TSC in our userspace, but of course it has a TOCTOU problem for whatever madness can come in between userspace and the time the kernel actually services the ioctl. -- Thanks, Oliver > > An alternative would be to advance only the guests REALTIME clock, from > > data about how long steps T1-T2 took. > > Yes, that's what would happen in the cooperative S2IDLE or S3 case when > the guest resumes. > > >> So now virt came along and created a hard to solve circular dependency > >> problem: > >> > >> - If CLOCK_MONOTONIC stops for too long then NTP/PTP gets out of > >> sync, but everything else is happy. > >> > >> - If CLOCK_MONOTONIC jumps too far forward, then all hell breaks > >> lose, but NTP/PTP is happy. > > > > One must handle the > > > > T0: t = now(); > > -> pause > > -> resume > > -> magic "fixup" > > T1: dostuff(t); > > > > fact if one is going to use savevm/restorevm anyway, so... > > (it is kind of unfixable, unless you modify your application > > to accept notifications to redo any computation based on t, isnt it?). > > Well yes, but what applications can deal with is CLOCK_REALTIME jumping > because that's a property of it. Not so much the CLOCK_MONOTONIC part. > > >> If you decide that correctness is overrated, then please document it > >> clearly instead of trying to pretend being correct. > > > > Based on the above, advancing only CLOCK_REALTIME (and not CLOCK_MONOTONIC) > > would be correct, right? And its probably not very hard to do. > > Time _is_ hard to get right. > > So you might experiment with something like this as a stop gap: > > Provide the guest something like this: > > u64 migration_seq; > u64 realtime_delta_ns; > > in the shared clock page > > Do not forward jump clock MONOTONIC. > > On resume kick an IPI where the guest handler does: > > if (clock_data->migration_seq == migration_seq) > return; > > migration_seq = clock_data->migration_seq; > > ts64 = { 0, 0 }; > timespec64_add_ns(&ts64, clock_data->realtime_delta_ns); > timekeeping_inject_sleeptime64(&ts64); > > Make sure that the IPI completes before you migrate the guest another > time or implement it slightly smarter, but you get the idea :) > > That's what we use for suspend time injection, but it should just work > without frozen tasks as well. It will forward clock REALTIME by the > amount of time spent during migration. It'll also modify the BOOTTIME > offset by the same amount, but that's not a tragedy. > > The function will also reset NTP state so the NTP/PTP daemon knows that > there was a kernel initiated time jump and it can work out easily what > to do like it does on resume from an actual suspend. It will also > invoke clock_was_set() which makes all the other time related updates > trigger and wakeup tasks which have a timerfd with > TFD_TIMER_CANCEL_ON_SET armed. > > This will obviously not work when the guest is in S2IDLE or S3, but for > initial experimentation you can ignore that and just avoid to do that in > the guest. :) > > That still is worse than a cooperative S2IDLE/S3, but it's way more > sensible than the other two evils you have today. > > > Thanks very much for the detailed information! Its a good basis > > for the document you ask. > > I volunteer to review that documentation once it materializes :) > > Thanks, > > tglx
On Thu, 2021-09-16 at 18:15 +0000, Oliver Upton wrote: > > @@ -5878,11 +5888,21 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp) > * is slightly ahead) here we risk going negative on unsigned > * 'system_time' when 'data.clock' is very small. > */ > - if (kvm->arch.use_master_clock) > - now_ns = ka->master_kernel_ns; > + if (data.flags & KVM_CLOCK_REALTIME) { > + u64 now_real_ns = ktime_get_real_ns(); > + > + /* > + * Avoid stepping the kvmclock backwards. > + */ > + if (now_real_ns > data.realtime) > + data.clock += now_real_ns - data.realtime; > + } > + > + if (ka->use_master_clock) > + now_raw_ns = ka->master_kernel_ns; This looks wrong to me. > else > - now_ns = get_kvmclock_base_ns(); > - ka->kvmclock_offset = data.clock - now_ns; > + now_raw_ns = get_kvmclock_base_ns(); > + ka->kvmclock_offset = data.clock - now_raw_ns; > kvm_end_pvclock_update(kvm); > return 0; > } We use the host CLOCK_MONOTONIC_RAW plus the boot offset, as a 'kvmclock base clock', and get_kvmclock_base_ns() returns that. The KVM clocks for each VMs are based on this 'kvmclock base clock', each offset by a ka->kvmclock_offset which represents the time at which that VM was started — so each VM's clock starts from zero. The values of ka->master_kernel_ns and ka->master_cycle_now represent a single point in time, the former being the value of get_kvmclock_base_ns() at that moment and the latter being the host TSC value. In pvclock_update_vm_gtod_copy(), kvm_get_time_and_clockread() is used to return both values at precisely the same moment, from the *same* rdtsc(). This allows the current 'kvmclock base clock' to be calculated at any moment by reading the TSC, calculating a delta to that reading from ka->master_cycle_now to determine how much time has elapsed since ka->master_kernel_ns. We can then add ka->kvmclock_offset to get the kvmclock for this particular VM. Now, looking at the code quoted above. It's given a kvm_clock_data struct which contains a value of the KVM clock which is to be set as the time "now", and all it does is adjust ka->kvmclock_offset accordingly. Which is really simple: now_raw_ns = get_kvmclock_base_ns(); ka->kvmclock_offset = data.clock - now_raw_ns; Et voilà, now get_kvmclock_base_ns() + ka->kvmclock_offset at any given moment in time will result in a kvmclock value according to what was just set. Yay! Except... in the case where the TSC is constant, we actually set 'now_raw_ns' to a value that doesn't represent *now*. Instead, we set it to ka->master_kernel_ns which represents some point in the *past*. We should add the number of TSC ticks since ka->master_cycle_now if we're going to use that, surely?
/cast <Raise Skeleton> On Wed, Jan 17, 2024, David Woodhouse wrote: > On Thu, 2021-09-16 at 18:15 +0000, Oliver Upton wrote: > > > > @@ -5878,11 +5888,21 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp) > > * is slightly ahead) here we risk going negative on unsigned > > * 'system_time' when 'data.clock' is very small. > > */ > > - if (kvm->arch.use_master_clock) > > - now_ns = ka->master_kernel_ns; > > + if (data.flags & KVM_CLOCK_REALTIME) { > > + u64 now_real_ns = ktime_get_real_ns(); > > + > > + /* > > + * Avoid stepping the kvmclock backwards. > > + */ > > + if (now_real_ns > data.realtime) > > + data.clock += now_real_ns - data.realtime; > > + } > > + > > + if (ka->use_master_clock) > > + now_raw_ns = ka->master_kernel_ns; > > This looks wrong to me. > > > else > > - now_ns = get_kvmclock_base_ns(); > > - ka->kvmclock_offset = data.clock - now_ns; > > + now_raw_ns = get_kvmclock_base_ns(); > > + ka->kvmclock_offset = data.clock - now_raw_ns; > > kvm_end_pvclock_update(kvm); > > return 0; > > } > > We use the host CLOCK_MONOTONIC_RAW plus the boot offset, as a > 'kvmclock base clock', and get_kvmclock_base_ns() returns that. The KVM > clocks for each VMs are based on this 'kvmclock base clock', each > offset by a ka->kvmclock_offset which represents the time at which that > VM was started — so each VM's clock starts from zero. > > The values of ka->master_kernel_ns and ka->master_cycle_now represent a > single point in time, the former being the value of > get_kvmclock_base_ns() at that moment and the latter being the host TSC > value. In pvclock_update_vm_gtod_copy(), kvm_get_time_and_clockread() > is used to return both values at precisely the same moment, from the > *same* rdtsc(). > > This allows the current 'kvmclock base clock' to be calculated at any > moment by reading the TSC, calculating a delta to that reading from > ka->master_cycle_now to determine how much time has elapsed since > ka->master_kernel_ns. We can then add ka->kvmclock_offset to get the > kvmclock for this particular VM. > > Now, looking at the code quoted above. It's given a kvm_clock_data > struct which contains a value of the KVM clock which is to be set as > the time "now", and all it does is adjust ka->kvmclock_offset > accordingly. Which is really simple: > > now_raw_ns = get_kvmclock_base_ns(); > ka->kvmclock_offset = data.clock - now_raw_ns; > > Et voilà, now get_kvmclock_base_ns() + ka->kvmclock_offset at any given > moment in time will result in a kvmclock value according to what was > just set. Yay! > > Except... in the case where the TSC is constant, we actually set > 'now_raw_ns' to a value that doesn't represent *now*. Instead, we set > it to ka->master_kernel_ns which represents some point in the *past*. > We should add the number of TSC ticks since ka->master_cycle_now if > we're going to use that, surely? Somewhat ironically, without the KVM_CLOCK_REALTIME goo, there's no need to re-read TSC, because the rdtsc() in pvclock_update_vm_gtod_copy() *just* happened. But the call to ktime_get_real_ns() could theoretically spin for a non-trivial amount of time if the clock is being refreshed.
On Wed, 2024-07-24 at 15:24 -0700, Sean Christopherson wrote: > /cast <Raise Skeleton> > > On Wed, Jan 17, 2024, David Woodhouse wrote: > > On Thu, 2021-09-16 at 18:15 +0000, Oliver Upton wrote: > > > > > > @@ -5878,11 +5888,21 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp) > > > * is slightly ahead) here we risk going negative on unsigned > > > * 'system_time' when 'data.clock' is very small. > > > */ > > > - if (kvm->arch.use_master_clock) > > > - now_ns = ka->master_kernel_ns; > > > + if (data.flags & KVM_CLOCK_REALTIME) { > > > + u64 now_real_ns = ktime_get_real_ns(); > > > + > > > + /* > > > + * Avoid stepping the kvmclock backwards. > > > + */ > > > + if (now_real_ns > data.realtime) > > > + data.clock += now_real_ns - data.realtime; > > > + } > > > + > > > + if (ka->use_master_clock) > > > + now_raw_ns = ka->master_kernel_ns; > > > > This looks wrong to me. > > > > > else > > > - now_ns = get_kvmclock_base_ns(); > > > - ka->kvmclock_offset = data.clock - now_ns; > > > + now_raw_ns = get_kvmclock_base_ns(); > > > + ka->kvmclock_offset = data.clock - now_raw_ns; > > > kvm_end_pvclock_update(kvm); > > > return 0; > > > } > > > > We use the host CLOCK_MONOTONIC_RAW plus the boot offset, as a > > 'kvmclock base clock', and get_kvmclock_base_ns() returns that. The KVM > > clocks for each VMs are based on this 'kvmclock base clock', each > > offset by a ka->kvmclock_offset which represents the time at which that > > VM was started — so each VM's clock starts from zero. > > > > The values of ka->master_kernel_ns and ka->master_cycle_now represent a > > single point in time, the former being the value of > > get_kvmclock_base_ns() at that moment and the latter being the host TSC > > value. In pvclock_update_vm_gtod_copy(), kvm_get_time_and_clockread() > > is used to return both values at precisely the same moment, from the > > *same* rdtsc(). > > > > This allows the current 'kvmclock base clock' to be calculated at any > > moment by reading the TSC, calculating a delta to that reading from > > ka->master_cycle_now to determine how much time has elapsed since > > ka->master_kernel_ns. We can then add ka->kvmclock_offset to get the > > kvmclock for this particular VM. > > > > Now, looking at the code quoted above. It's given a kvm_clock_data > > struct which contains a value of the KVM clock which is to be set as > > the time "now", and all it does is adjust ka->kvmclock_offset > > accordingly. Which is really simple: > > > > now_raw_ns = get_kvmclock_base_ns(); > > ka->kvmclock_offset = data.clock - now_raw_ns; > > > > Et voilà, now get_kvmclock_base_ns() + ka->kvmclock_offset at any given > > moment in time will result in a kvmclock value according to what was > > just set. Yay! > > > > Except... in the case where the TSC is constant, we actually set > > 'now_raw_ns' to a value that doesn't represent *now*. Instead, we set > > it to ka->master_kernel_ns which represents some point in the *past*. > > We should add the number of TSC ticks since ka->master_cycle_now if > > we're going to use that, surely? > > Somewhat ironically, without the KVM_CLOCK_REALTIME goo, there's no need to > re-read TSC, because the rdtsc() in pvclock_update_vm_gtod_copy() *just* happened. > But the call to ktime_get_real_ns() could theoretically spin for a non-trivial > amount of time if the clock is being refreshed. Aha, thank you. That makes sense. I note that in similar code in https://lore.kernel.org/lkml/20240522001817.619072-4-dwmw2@infradead.org/ Jack explicitly added a comment to precisely that effect: + * The call to pvclock_update_vm_gtod_copy() has created a new time + * reference point in ka->master_cycle_now and ka->master_kernel_ns. Code comments are wonderful things. Someone must have trained him well :) (I'll be casting Raise Skeleton on that series too as soon as I have dispatched the existing horde, FWIW.) So the remaining problem is that we do re-read the TSC for the KVM_CLOCK_REALTIME case, as you pointed out. When we should be calculating the relationship between real time and the KVM clock at precisely the *same* moment. But I don't care about that, because this whole API is suboptimal anyway; we should just set the KVM clock in terms of the guest TSC. Which is what Jack's patch allows.
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index a6729c8cf063..d0b9c986cf6c 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -993,20 +993,34 @@ such as migration. When KVM_CAP_ADJUST_CLOCK is passed to KVM_CHECK_EXTENSION, it returns the set of bits that KVM can return in struct kvm_clock_data's flag member. -The only flag defined now is KVM_CLOCK_TSC_STABLE. If set, the returned -value is the exact kvmclock value seen by all VCPUs at the instant -when KVM_GET_CLOCK was called. If clear, the returned value is simply -CLOCK_MONOTONIC plus a constant offset; the offset can be modified -with KVM_SET_CLOCK. KVM will try to make all VCPUs follow this clock, -but the exact value read by each VCPU could differ, because the host -TSC is not stable. +FLAGS: + +KVM_CLOCK_TSC_STABLE. If set, the returned value is the exact kvmclock +value seen by all VCPUs at the instant when KVM_GET_CLOCK was called. +If clear, the returned value is simply CLOCK_MONOTONIC plus a constant +offset; the offset can be modified with KVM_SET_CLOCK. KVM will try +to make all VCPUs follow this clock, but the exact value read by each +VCPU could differ, because the host TSC is not stable. + +KVM_CLOCK_REALTIME. If set, the `realtime` field in the kvm_clock_data +structure is populated with the value of the host's real time +clocksource at the instant when KVM_GET_CLOCK was called. If clear, +the `realtime` field does not contain a value. + +KVM_CLOCK_HOST_TSC. If set, the `host_tsc` field in the kvm_clock_data +structure is populated with the value of the host's timestamp counter (TSC) +at the instant when KVM_GET_CLOCK was called. If clear, the `host_tsc` field +does not contain a value. :: struct kvm_clock_data { __u64 clock; /* kvmclock current value */ __u32 flags; - __u32 pad[9]; + __u32 pad0; + __u64 realtime; + __u64 host_tsc; + __u32 pad[4]; }; @@ -1023,12 +1037,22 @@ Sets the current timestamp of kvmclock to the value specified in its parameter. In conjunction with KVM_GET_CLOCK, it is used to ensure monotonicity on scenarios such as migration. +FLAGS: + +KVM_CLOCK_REALTIME. If set, KVM will compare the value of the `realtime` field +with the value of the host's real time clocksource at the instant when +KVM_SET_CLOCK was called. The difference in elapsed time is added to the final +kvmclock value that will be provided to guests. + :: struct kvm_clock_data { __u64 clock; /* kvmclock current value */ __u32 flags; - __u32 pad[9]; + __u32 pad0; + __u64 realtime; + __u64 host_tsc; + __u32 pad[4]; }; diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index be6805fc0260..9c34b5b63e39 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1936,4 +1936,7 @@ int kvm_cpu_dirty_log_size(void); int alloc_all_memslots_rmaps(struct kvm *kvm); +#define KVM_CLOCK_VALID_FLAGS \ + (KVM_CLOCK_TSC_STABLE | KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC) + #endif /* _ASM_X86_KVM_HOST_H */ diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 523c4e5c109f..cb5d5cad5124 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2815,10 +2815,20 @@ static void get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data) get_cpu(); if (__this_cpu_read(cpu_tsc_khz)) { +#ifdef CONFIG_X86_64 + struct timespec64 ts; + + if (kvm_get_walltime_and_clockread(&ts, &data->host_tsc)) { + data->realtime = ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec; + data->flags |= KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC; + } else +#endif + data->host_tsc = rdtsc(); + kvm_get_time_scale(NSEC_PER_SEC, __this_cpu_read(cpu_tsc_khz) * 1000LL, &hv_clock.tsc_shift, &hv_clock.tsc_to_system_mul); - data->clock = __pvclock_read_cycles(&hv_clock, rdtsc()); + data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc); } else { data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset; } @@ -4062,7 +4072,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) r = KVM_SYNC_X86_VALID_FIELDS; break; case KVM_CAP_ADJUST_CLOCK: - r = KVM_CLOCK_TSC_STABLE; + r = KVM_CLOCK_VALID_FLAGS; break; case KVM_CAP_X86_DISABLE_EXITS: r |= KVM_X86_DISABLE_EXITS_HLT | KVM_X86_DISABLE_EXITS_PAUSE | @@ -5859,12 +5869,12 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp) { struct kvm_arch *ka = &kvm->arch; struct kvm_clock_data data; - u64 now_ns; + u64 now_raw_ns; if (copy_from_user(&data, argp, sizeof(data))) return -EFAULT; - if (data.flags) + if (data.flags & ~KVM_CLOCK_REALTIME) return -EINVAL; kvm_hv_invalidate_tsc_page(kvm); @@ -5878,11 +5888,21 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp) * is slightly ahead) here we risk going negative on unsigned * 'system_time' when 'data.clock' is very small. */ - if (kvm->arch.use_master_clock) - now_ns = ka->master_kernel_ns; + if (data.flags & KVM_CLOCK_REALTIME) { + u64 now_real_ns = ktime_get_real_ns(); + + /* + * Avoid stepping the kvmclock backwards. + */ + if (now_real_ns > data.realtime) + data.clock += now_real_ns - data.realtime; + } + + if (ka->use_master_clock) + now_raw_ns = ka->master_kernel_ns; else - now_ns = get_kvmclock_base_ns(); - ka->kvmclock_offset = data.clock - now_ns; + now_raw_ns = get_kvmclock_base_ns(); + ka->kvmclock_offset = data.clock - now_raw_ns; kvm_end_pvclock_update(kvm); return 0; } diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index a067410ebea5..d228bf394465 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1223,11 +1223,16 @@ struct kvm_irqfd { /* Do not use 1, KVM_CHECK_EXTENSION returned it before we had flags. */ #define KVM_CLOCK_TSC_STABLE 2 +#define KVM_CLOCK_REALTIME (1 << 2) +#define KVM_CLOCK_HOST_TSC (1 << 3) struct kvm_clock_data { __u64 clock; __u32 flags; - __u32 pad[9]; + __u32 pad0; + __u64 realtime; + __u64 host_tsc; + __u32 pad[4]; }; /* For KVM_CAP_SW_TLB */
Handling the migration of TSCs correctly is difficult, in part because Linux does not provide userspace with the ability to retrieve a (TSC, realtime) clock pair for a single instant in time. In lieu of a more convenient facility, KVM can report similar information in the kvm_clock structure. Provide userspace with a host TSC & realtime pair iff the realtime clock is based on the TSC. If userspace provides KVM_SET_CLOCK with a valid realtime value, advance the KVM clock by the amount of elapsed time. Do not step the KVM clock backwards, though, as it is a monotonic oscillator. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Oliver Upton <oupton@google.com> --- Documentation/virt/kvm/api.rst | 42 ++++++++++++++++++++++++++------- arch/x86/include/asm/kvm_host.h | 3 +++ arch/x86/kvm/x86.c | 36 +++++++++++++++++++++------- include/uapi/linux/kvm.h | 7 +++++- 4 files changed, 70 insertions(+), 18 deletions(-)