Message ID | 20220309143835.253911-2-dwmw2@infradead.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: Xen PV timer fixups | expand |
On 3/9/22 15:38, David Woodhouse wrote: > + if (hrtimer_active(&vcpu->arch.xen.timer)) > + data->u.timer.expires_ns = vcpu->arch.xen.timer_expires; > + else > + data->u.timer.expires_ns = 0; > r = 0; This is still racy. You should instead clear timer_expires in xen_timer_callback, and only do so after a *successful* write to the event channel (setting timer_pending is not enough). Still, this only works if userspace reads the pending events *after* expires_ns. It's the usual pattern: xen_timer_callback() userspace ------------------------ ------------------------ kvm_xen_set_evtchn_fast() read timer_expires smp_wmb() smp_rmb() expires_ns = 0; read event channel Now, if expires is read as 0, userspace surely will read the event properly. Is this doable with the KVM ioctls that you have, considering that the pending events are in memory? If not, you should add pause timer and resume timer ioctls. These will change the hrtimer state without touching timer_expires and timer_pending. But even better, and even simpler: just set timer_pending in xen_timer_callback, always go through the delayed injection path, and remove this "if" from KVM_XEN_VCPU_GET_ATTR. Then there are no races, because KVM_RUN and KVM_XEN_VCPU_GET_ATTR are protected with vcpu->mutex: xen_timer_callback() migration source ------------------------ ------------------------ read event channel inc timer_pending ioctl(KVM_XEN_VCPU_GET_ATTR) read timer_expires timer_expires is *not* read as zero, because the clearing happens only in kvm_xen_inject_timer_irqs(), which cannot be concurrent with KVM_XEN_VCPU_GET_ATTR. And the destination does: migration destination xen_timer_callback() ------------------------ ------------------------ ioctl(KVM_XEN_VCPU_SET_ATTR) set timer_expires hrtimer_start() inc timer_pending ioctl(KVM_RUN) kvm_xen_inject_timer_irqs() kvm_xen_set_evtchn() kvm_xen_set_evtchn_fast() expires_ns = 0; Paolo
On Wed, 2022-03-09 at 16:39 +0100, Paolo Bonzini wrote: > On 3/9/22 15:38, David Woodhouse wrote: > > + if (hrtimer_active(&vcpu->arch.xen.timer)) > > + data->u.timer.expires_ns = vcpu->arch.xen.timer_expires; > > + else > > + data->u.timer.expires_ns = 0; > > r = 0; > > This is still racy. You should instead clear timer_expires in > xen_timer_callback, and only do so after a *successful* write to the > event channel (setting timer_pending is not enough). Hm, I thought about setting expires_ns = 0 in xen_timer_callback() but that can race with the vCPU making a hypercall to reconfigure the timer. > Still, this only works if userspace reads the pending events *after* > expires_ns. It's the usual pattern: > > xen_timer_callback() userspace > ------------------------ ------------------------ > kvm_xen_set_evtchn_fast() read timer_expires > smp_wmb() smp_rmb() > expires_ns = 0; read event channel > > Now, if expires is read as 0, userspace surely will read the event properly. > > Is this doable with the KVM ioctls that you have, considering that the > pending events are in memory? If not, you should add pause timer and > resume timer ioctls. These will change the hrtimer state without > touching timer_expires and timer_pending. That's OK, as the pending events are in the shared_info and vcpu_info regions which we have explicitly declared exempt from dirty tracking. Userspace must always consider them dirty any time an interrupt (which includes timers) might be delivered, so it has to migrate that memory in the final sync, after serializing the vCPU state. In the local APIC delivery mode, the actual interrupt is injected as an MSI, so userspace needs to read the timer state before the local APIC state. > But even better, and even simpler: just set timer_pending in > xen_timer_callback, always go through the delayed injection path, and > remove this "if" from KVM_XEN_VCPU_GET_ATTR. Then there are no races, > because KVM_RUN and KVM_XEN_VCPU_GET_ATTR are protected with vcpu->mutex: Yeah, I don't think it even increases the latency as the vCPU had already exited guest mode anyway. I'll do that, retest it and repost. Thanks.
On 3/9/22 17:11, David Woodhouse wrote: > That's OK, as the pending events are in the shared_info and vcpu_info > regions which we have explicitly declared exempt from dirty tracking. > Userspace must always consider them dirty any time an interrupt (which > includes timers) might be delivered, so it has to migrate that memory > in the final sync, after serializing the vCPU state. > > In the local APIC delivery mode, the actual interrupt is injected as an > MSI, so userspace needs to read the timer state before the local APIC > state. Please document this. It's not obvious from the fact that they're not dirty-tracked. Paolo
On Wed, 2022-03-09 at 17:28 +0100, Paolo Bonzini wrote: > On 3/9/22 17:11, David Woodhouse wrote: > > That's OK, as the pending events are in the shared_info and vcpu_info > > regions which we have explicitly declared exempt from dirty tracking. > > Userspace must always consider them dirty any time an interrupt (which > > includes timers) might be delivered, so it has to migrate that memory > > in the final sync, after serializing the vCPU state. > > > > In the local APIC delivery mode, the actual interrupt is injected as an > > MSI, so userspace needs to read the timer state before the local APIC > > state. > > Please document this. It's not obvious from the fact that they're not > dirty-tracked. The dirty tracking part is already there: Note that the shared info page may be constantly written to by KVM; it contains the event channel bitmap used to deliver interrupts to a Xen guest, amongst other things. It is exempt from dirty tracking mechanisms — KVM will not explicitly mark the page as dirty each time an event channel interrupt is delivered to the guest! Thus, userspace should always assume that the designated GFN is dirty if any vCPU has been running or any event channel interrupts can be routed to the guest. And for vcpu_info: KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO Sets the guest physical address of the vcpu_info for a given vCPU. As with the shared_info page for the VM, the corresponding page may be dirtied at any time if event channel interrupt delivery is enabled, so userspace should always assume that the page is dirty without relying on dirty logging. For the ordering of the various vCPU state fetches... is there any existing documentation on that? An emacs buffer just to the right of this compose window contains the only notes on that topic that I've ever seen (much of which may well be out of date by now)... /* * Notes on ordering requirements (and other ramblings). * * KVM_GET_MP_STATE calls kvm_apic_accept_events(), which might modify * vCPU/LAPIC state. As such, it must be done before most everything * else, otherwise we cannot restore everything and expect it to work. * * KVM_GET_VCPU_EVENTS/KVM_SET_VCPU_EVENTS is unsafe if other vCPUs are * still running. * * Some SET ioctls depend on kvm_vcpu_is_bsp(), so if we ever change * the BSP, we have to do that before restoring anything. The same * seems to be true for CPUID stuff. * * KVM_GET_LAPIC may change state of LAPIC before returning it. * * GET_VCPU_EVENTS should probably be last to save. The code looks as * it might as well be affected by internal state modifications of the * GET ioctls. * * SREGS saves/restores a pending interrupt, similar to what * VCPU_EVENTS also does. * * SET_REGS clears pending exceptions unconditionally, thus, it must be * done before SET_VCPU_EVENTS, which restores it. * * SET_LAPIC must come after SET_SREGS, because the latter restores * the apic base msr. * * SET_LAPIC must come before SET_MSRS, because the TSC deadline MSR * only restores successfully, when the LAPIC is correctly configured. * * GET_MSRS requires a pre-populated data structure to do something * meaningful. For SET_MSRS it will then contain good data. * * The KVM API documentation is wrong for GET_MSRS/SET_MSRS. They * return the number of successfully read/written values, which is * actually helpful to determine where things went wrong. */ If there's existing documentation on the ordering, I'd be happy to add to it. If not, then my comment earlier about timers causing an interrupt to be injected to the local APIC is far *more* obvious than most of the rest... :)
On 3/9/22 17:37, David Woodhouse wrote: > The dirty tracking part is already there: > > Note that the shared info page may be constantly written to by KVM; > it contains the event channel bitmap used to deliver interrupts to > a Xen guest, amongst other things. It is exempt from dirty tracking > mechanisms — KVM will not explicitly mark the page as dirty each > time an event channel interrupt is delivered to the guest! Thus, > userspace should always assume that the designated GFN is dirty if > any vCPU has been running or any event channel interrupts can be > routed to the guest. > > And for vcpu_info: > > KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO > Sets the guest physical address of the vcpu_info for a given vCPU. > As with the shared_info page for the VM, the corresponding page may be > dirtied at any time if event channel interrupt delivery is enabled, so > userspace should always assume that the page is dirty without relying > on dirty logging. > > > For the ordering of the various vCPU state fetches... is there any > existing documentation on that? An emacs buffer just to the right of > this compose window contains the only notes on that topic that I've > ever seen (much of which may well be out of date by now)... Ok, guilty as charged. Paolo > /* > * Notes on ordering requirements (and other ramblings). > * > * KVM_GET_MP_STATE calls kvm_apic_accept_events(), which might modify > * vCPU/LAPIC state. As such, it must be done before most everything > * else, otherwise we cannot restore everything and expect it to work. > * > * KVM_GET_VCPU_EVENTS/KVM_SET_VCPU_EVENTS is unsafe if other vCPUs are > * still running. > * > * Some SET ioctls depend on kvm_vcpu_is_bsp(), so if we ever change > * the BSP, we have to do that before restoring anything. The same > * seems to be true for CPUID stuff. > * > * KVM_GET_LAPIC may change state of LAPIC before returning it. > * > * GET_VCPU_EVENTS should probably be last to save. The code looks as > * it might as well be affected by internal state modifications of the > * GET ioctls. > * > * SREGS saves/restores a pending interrupt, similar to what > * VCPU_EVENTS also does. > * > * SET_REGS clears pending exceptions unconditionally, thus, it must be > * done before SET_VCPU_EVENTS, which restores it. > * > * SET_LAPIC must come after SET_SREGS, because the latter restores > * the apic base msr. > * > * SET_LAPIC must come before SET_MSRS, because the TSC deadline MSR > * only restores successfully, when the LAPIC is correctly configured. > * > * GET_MSRS requires a pre-populated data structure to do something > * meaningful. For SET_MSRS it will then contain good data. > * > * The KVM API documentation is wrong for GET_MSRS/SET_MSRS. They > * return the number of successfully read/written values, which is > * actually helpful to determine where things went wrong. > */ > > > If there's existing documentation on the ordering, I'd be happy to add > to it. If not, then my comment earlier about timers causing an > interrupt to be injected to the local APIC is far*more* obvious than > most of the rest...:)
diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c index af2d26fc5458..f371f1292ca3 100644 --- a/arch/x86/kvm/irq.c +++ b/arch/x86/kvm/irq.c @@ -156,7 +156,6 @@ void __kvm_migrate_timers(struct kvm_vcpu *vcpu) { __kvm_migrate_apic_timer(vcpu); __kvm_migrate_pit_timer(vcpu); - __kvm_migrate_xen_timer(vcpu); static_call_cond(kvm_x86_migrate_timers)(vcpu); } diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c index 8c85a71aa8ca..2131e21ce6c7 100644 --- a/arch/x86/kvm/xen.c +++ b/arch/x86/kvm/xen.c @@ -150,29 +150,19 @@ static enum hrtimer_restart xen_timer_callback(struct hrtimer *timer) return HRTIMER_NORESTART; } -void __kvm_migrate_xen_timer(struct kvm_vcpu *vcpu) +static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_ns) { - struct hrtimer *timer; - - if (!kvm_xen_timer_enabled(vcpu)) - return; - - timer = &vcpu->arch.xen.timer; - if (hrtimer_cancel(timer)) - hrtimer_start_expires(timer, HRTIMER_MODE_ABS_PINNED); -} - -static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, u64 delta_ns) -{ - ktime_t ktime_now; - atomic_set(&vcpu->arch.xen.timer_pending, 0); vcpu->arch.xen.timer_expires = guest_abs; - ktime_now = ktime_get(); - hrtimer_start(&vcpu->arch.xen.timer, - ktime_add_ns(ktime_now, delta_ns), - HRTIMER_MODE_ABS_PINNED); + if (delta_ns <= 0) { + xen_timer_callback(&vcpu->arch.xen.timer); + } else { + ktime_t ktime_now = ktime_get(); + hrtimer_start(&vcpu->arch.xen.timer, + ktime_add_ns(ktime_now, delta_ns), + HRTIMER_MODE_ABS_HARD); + } } static void kvm_xen_stop_timer(struct kvm_vcpu *vcpu) @@ -185,7 +175,7 @@ static void kvm_xen_stop_timer(struct kvm_vcpu *vcpu) static void kvm_xen_init_timer(struct kvm_vcpu *vcpu) { hrtimer_init(&vcpu->arch.xen.timer, CLOCK_MONOTONIC, - HRTIMER_MODE_ABS_PINNED); + HRTIMER_MODE_ABS_HARD); vcpu->arch.xen.timer.function = xen_timer_callback; } @@ -839,7 +829,10 @@ int kvm_xen_vcpu_get_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data) case KVM_XEN_VCPU_ATTR_TYPE_TIMER: data->u.timer.port = vcpu->arch.xen.timer_virq; data->u.timer.priority = KVM_IRQ_ROUTING_XEN_EVTCHN_PRIO_2LEVEL; - data->u.timer.expires_ns = vcpu->arch.xen.timer_expires; + if (hrtimer_active(&vcpu->arch.xen.timer)) + data->u.timer.expires_ns = vcpu->arch.xen.timer_expires; + else + data->u.timer.expires_ns = 0; r = 0; break; @@ -1204,7 +1197,7 @@ static bool kvm_xen_hcall_set_timer_op(struct kvm_vcpu *vcpu, uint64_t timeout, if (timeout) { uint64_t guest_now = get_kvmclock_ns(vcpu->kvm); - long delta = timeout - guest_now; + int64_t delta = timeout - guest_now; /* Xen has a 'Linux workaround' in do_set_timer_op() which * checks for negative absolute timeout values (caused by diff --git a/arch/x86/kvm/xen.h b/arch/x86/kvm/xen.h index ad0876a7c301..2bbbc1a3953e 100644 --- a/arch/x86/kvm/xen.h +++ b/arch/x86/kvm/xen.h @@ -75,7 +75,6 @@ static inline int kvm_xen_has_pending_timer(struct kvm_vcpu *vcpu) return 0; } -void __kvm_migrate_xen_timer(struct kvm_vcpu *vcpu); void kvm_xen_inject_timer_irqs(struct kvm_vcpu *vcpu); #else static inline int kvm_xen_write_hypercall_page(struct kvm_vcpu *vcpu, u64 data)