Message ID | 20240427111929.9600-2-dwmw2@infradead.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2,01/15] KVM: x86/xen: Do not corrupt KVM clock in kvm_xen_shared_info_init() | expand |
On Sat, Apr 27, 2024 at 12:04:58PM +0100, David Woodhouse wrote: > From: David Woodhouse <dwmw@amazon.co.uk> > > The KVM clock is an interesting thing. It is defined as "nanoseconds > since the guest was created", but in practice it runs at two *different* > rates — or three different rates, if you count implementation bugs. > > Definition A is that it runs synchronously with the CLOCK_MONOTONIC_RAW > of the host, with a delta of kvm->arch.kvmclock_offset. > > But that version doesn't actually get used in the common case, where the > host has a reliable TSC and the guest TSCs are all running at the same > rate and in sync with each other, and kvm->arch.use_master_clock is set. > > In that common case, definition B is used: There is a reference point in > time at kvm->arch.master_kernel_ns (again a CLOCK_MONOTONIC_RAW time), > and a corresponding host TSC value kvm->arch.master_cycle_now. This > fixed point in time is converted to guest units (the time offset by > kvmclock_offset and the TSC Value scaled and offset to be a guest TSC > value) and advertised to the guest in the pvclock structure. While in > this 'use_master_clock' mode, the fixed point in time never needs to be > changed, and the clock runs precisely in time with the guest TSC, at the > rate advertised in the pvclock structure. > > The third definition C is implemented in kvm_get_wall_clock_epoch() and > __get_kvmclock(), using the master_cycle_now and master_kernel_ns fields > but converting the *host* TSC cycles directly to a value in nanoseconds > instead of scaling via the guest TSC. > > One might naïvely think that all three definitions are identical, since > CLOCK_MONOTONIC_RAW is not skewed by NTP frequency corrections; all > three are just the result of counting the host TSC at a known frequency, > or the scaled guest TSC at a known precise fraction of the host's > frequency. The problem is with arithmetic precision, and the way that > frequency scaling is done in a division-free way by multiplying by a > scale factor, then shifting right. In practice, all three ways of > calculating the KVM clock will suffer a systemic drift from each other. > > Eventually, definition C should just be eliminated. Commit 451a707813ae > ("KVM: x86/xen: improve accuracy of Xen timers") worked around it for > the specific case of Xen timers, which are defined in terms of the KVM > clock and suffered from a continually increasing error in timer expiry > times. That commit notes that get_kvmclock_ns() is non-trivial to fix > and says "I'll come back to that", which remains true. > > Definitions A and B do need to coexist, the former to handle the case > where the host or guest TSC is suboptimally configured. But KVM should > be more careful about switching between them, and the discontinuity in > guest time which could result. > > In particular, KVM_REQ_MASTERCLOCK_UPDATE will take a new snapshot of > time as the reference in master_kernel_ns and master_cycle_now, yanking > the guest's clock back to match definition A at that moment. KVM_REQ_MASTERCLOCK_UPDATE stops the vcpus because: * To avoid that problem, do not allow visibility of distinct * system_timestamp/tsc_timestamp values simultaneously: use a master * copy of host monotonic time values. Update that master copy * in lockstep. > When invoked from in 'use_master_clock' mode, kvm_update_masterclock() > should probably *adjust* kvm->arch.kvmclock_offset to account for the > drift, instead of yanking the clock back to defintion A. You are likely correct... > But in the meantime there are a bunch of places where it just doesn't need to be > invoked at all. > > To start with: there is no need to do such an update when a Xen guest > populates the shared_info page. This seems to have been a hangover from > the very first implementation of shared_info which automatically > populated the vcpu_info structures at their default locations, but even > then it should just have raised KVM_REQ_CLOCK_UPDATE on each vCPU > instead of using KVM_REQ_MASTERCLOCK_UPDATE. And now that userspace is > expected to explicitly set the vcpu_info even in its default locations, > there's not even any need for that either. > > Fixes: 629b5348841a1 ("KVM: x86/xen: update wallclock region") > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> > Reviewed-by: Paul Durrant <paul@xen.org> > --- > arch/x86/kvm/xen.c | 2 -- > 1 file changed, 2 deletions(-) > > diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c > index f65b35a05d91..5a83a8154b79 100644 > --- a/arch/x86/kvm/xen.c > +++ b/arch/x86/kvm/xen.c > @@ -98,8 +98,6 @@ static int kvm_xen_shared_info_init(struct kvm *kvm) > wc->version = wc_version + 1; > read_unlock_irq(&gpc->lock); > > - kvm_make_all_cpus_request(kvm, KVM_REQ_MASTERCLOCK_UPDATE); > - > out: > srcu_read_unlock(&kvm->srcu, idx); > return ret; > -- > 2.44.0 So KVM_REQ_MASTERCLOCK_UPDATE is to avoid the race above. In what contexes is kvm_xen_shared_info_init called from again? Not clear to me KVM_REQ_MASTERCLOCK_UPDATE is not needed (or that is needed, for that matter...).
On Sat, 2024-05-04 at 04:42 -0300, Marcelo Tosatti wrote: > On Sat, Apr 27, 2024 at 12:04:58PM +0100, David Woodhouse wrote: > > > > In particular, KVM_REQ_MASTERCLOCK_UPDATE will take a new snapshot of > > time as the reference in master_kernel_ns and master_cycle_now, yanking > > the guest's clock back to match definition A at that moment. > > KVM_REQ_MASTERCLOCK_UPDATE stops the vcpus because: Took me a while to work that one out, btw. The fact that the KVM_REQ_MCLOCK_INPROGRESS request is asserted but never actually *handled*, so all it does is repeatedly kick the vCPU out and make it spin until the request is cleared is... interesting. Likewise the way that we set KVM_REQ_MASTERCLOCK_UPDATE on *all* vCPUs, so they *all* call kvm_update_masterclock(), when only one of them needed to. I may clean that up a little. > * To avoid that problem, do not allow visibility of distinct > * system_timestamp/tsc_timestamp values simultaneously: use a master > * copy of host monotonic time values. Update that master copy > * in lockstep. Right. That comment is a lot longer than the part you cited here, and starts with 'assuming a stable TSC across pCPUS, and a stable TSC across vCPUs'. It's the "if (ka->use_master_clock)" case. And yes, what it's basically saying is a special case of the fact that if you let the KVM clock run at its "natural" rate based on the guest TSC (definition B), but each vCPU runs at that rate from a *different* point on the line that is definition A (the host CLOCK_MONOTONIC_RAW), bad things will happen. I'm OK with it stopping the vCPUs (although I'd like it to do so in a less implicitly awful way). But when we don't need to update the reference time at all, let's not do so. > > When invoked from in 'use_master_clock' mode, kvm_update_masterclock() > > should probably *adjust* kvm->arch.kvmclock_offset to account for the > > drift, instead of yanking the clock back to defintion A. > > You are likely correct... > > > But in the meantime there are a bunch of places where it just doesn't need to be > > invoked at all. > > > > To start with: there is no need to do such an update when a Xen guest > > populates the shared_info page. This seems to have been a hangover from > > the very first implementation of shared_info which automatically > > populated the vcpu_info structures at their default locations, but even > > then it should just have raised KVM_REQ_CLOCK_UPDATE on each vCPU > > instead of using KVM_REQ_MASTERCLOCK_UPDATE. And now that userspace is > > expected to explicitly set the vcpu_info even in its default locations, > > there's not even any need for that either. > > > > Fixes: 629b5348841a1 ("KVM: x86/xen: update wallclock region") > > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> > > Reviewed-by: Paul Durrant <paul@xen.org> > > --- > > arch/x86/kvm/xen.c | 2 -- > > 1 file changed, 2 deletions(-) > > > > diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c > > index f65b35a05d91..5a83a8154b79 100644 > > --- a/arch/x86/kvm/xen.c > > +++ b/arch/x86/kvm/xen.c > > @@ -98,8 +98,6 @@ static int kvm_xen_shared_info_init(struct kvm *kvm) > > wc->version = wc_version + 1; > > read_unlock_irq(&gpc->lock); > > > > - kvm_make_all_cpus_request(kvm, KVM_REQ_MASTERCLOCK_UPDATE); > > - > > out: > > srcu_read_unlock(&kvm->srcu, idx); > > return ret; > > -- > > 2.44.0 > > So KVM_REQ_MASTERCLOCK_UPDATE is to avoid the race above. > > In what contexes is kvm_xen_shared_info_init called from again? > > Not clear to me KVM_REQ_MASTERCLOCK_UPDATE is not needed (or that is > needed, for that matter...). We cal kvm_xen_shared_info_init() when the Xen "shared info" page is set up. The only interesting part of that is the *wallclock* epoch. The wallclock (just like KSR_KVM_WALL_CLOCK{,_NEW}) is *entirely* hosed ever since the KVM clock stopped being based on CLOCK_MONOTONIC, since that means that the value of "wallclock time minus KVM clock time" actually *changes* as the KVM clock runs at a different rate to wallclock time. I'm looking at a replacement for that which uses the gtod information to give the guest a direct mapping of guest TSC to host CLOCK_TAI. And in doing so we can *also* indicate when live migration has potentially disrupted the guest TSC, so any further NTP/PTP refinement that the guest may have done for itself needs to be thrown away.
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c index f65b35a05d91..5a83a8154b79 100644 --- a/arch/x86/kvm/xen.c +++ b/arch/x86/kvm/xen.c @@ -98,8 +98,6 @@ static int kvm_xen_shared_info_init(struct kvm *kvm) wc->version = wc_version + 1; read_unlock_irq(&gpc->lock); - kvm_make_all_cpus_request(kvm, KVM_REQ_MASTERCLOCK_UPDATE); - out: srcu_read_unlock(&kvm->srcu, idx); return ret;