[1/2] KVM: x86/xen: PV oneshot timer fixes

Message ID	20220309143835.253911-2-dwmw2@infradead.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> From: David Woodhouse <dwmw2@infradead.org> To: kvm@vger.kernel.org, Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com>, Vitaly Kuznetsov <vkuznets@redhat.com>, Wanpeng Li <wanpengli@tencent.com>, Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org>, Joao Martins <joao.m.martins@oracle.com>, Boris Ostrovsky <boris.ostrovsky@oracle.com>, Metin Kaya <metikaya@amazon.co.uk>, Paul Durrant <pdurrant@amazon.co.uk> Subject: [PATCH 1/2] KVM: x86/xen: PV oneshot timer fixes Date: Wed, 9 Mar 2022 14:38:34 +0000 Message-Id: <20220309143835.253911-2-dwmw2@infradead.org> In-Reply-To: <20220309143835.253911-1-dwmw2@infradead.org> References: <20220309143835.253911-1-dwmw2@infradead.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: David Woodhouse <dwmw2@infradead.org> Precedence: bulk
Series	KVM: Xen PV timer fixups \| expand [0/2] KVM: Xen PV timer fixups [1/2] KVM: x86/xen: PV oneshot timer fixes [2/2] KVM: x86/xen: Update self test for Xen PV timers

David Woodhouse March 9, 2022, 2:38 p.m. UTC

From: David Woodhouse <dwmw@amazon.co.uk>

Fix the case where a restored timer is supposed to have triggered not
just in the past, but before this kernel even booted. The resulting
integer wrap caused the timer to be set a very long time into the future,
and thus effectively never trigger. Trigger timers immediately when
delta_ns <= 0 to avoid that situation.

Also switch to using HRTIMER_MODE_ABS_HARD, following the changes in the
local APIC timer in commits 2c0d278f3293f ("KVM: LAPIC: Mark hrtimer to
expire in hard interrupt context") and 4d151bf3b89e7 ("KVM: LAPIC: Make
lapic timer unpinned")... and the change I'm about to submit to fix the
local APIC timer not to gratuitously cancel and restart the hrtimer on
migration, since it isn't pinned any more anyway.

On serializing the timer state for migration, only set the expires_ns
if the timer is still active, otherwise return zero in that field.

Finally, fix the 'delta' in kvm_xen_hcall_set_timer_op() to explicitly
use 'int64_t' instead of 'long' to make the sanity check shift by 50
bits work correctly in the 32-bit build. That last one was

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kvm/irq.c |  1 -
 arch/x86/kvm/xen.c | 37 +++++++++++++++----------------------
 arch/x86/kvm/xen.h |  1 -
 3 files changed, 15 insertions(+), 24 deletions(-)

Paolo Bonzini March 9, 2022, 3:39 p.m. UTC | #1

On 3/9/22 15:38, David Woodhouse wrote:
> +		if (hrtimer_active(&vcpu->arch.xen.timer))
> +			data->u.timer.expires_ns = vcpu->arch.xen.timer_expires;
> +		else
> +			data->u.timer.expires_ns = 0;
>   		r = 0;

This is still racy.  You should instead clear timer_expires in 
xen_timer_callback, and only do so after a *successful* write to the 
event channel (setting timer_pending is not enough).

Still, this only works if userspace reads the pending events *after* 
expires_ns.  It's the usual pattern:

	xen_timer_callback()		userspace
	------------------------	------------------------
	kvm_xen_set_evtchn_fast()	read timer_expires
	smp_wmb()			smp_rmb()
	expires_ns = 0;			read event channel

Now, if expires is read as 0, userspace surely will read the event properly.

Is this doable with the KVM ioctls that you have, considering that the 
pending events are in memory?  If not, you should add pause timer and 
resume timer ioctls.  These will change the hrtimer state without 
touching timer_expires and timer_pending.

But even better, and even simpler: just set timer_pending in 
xen_timer_callback, always go through the delayed injection path, and 
remove this "if" from KVM_XEN_VCPU_GET_ATTR.  Then there are no races, 
because KVM_RUN and KVM_XEN_VCPU_GET_ATTR are protected with vcpu->mutex:

	xen_timer_callback()		migration source
	------------------------	------------------------
					read event channel
	inc timer_pending
					ioctl(KVM_XEN_VCPU_GET_ATTR)
					  read timer_expires

timer_expires is *not* read as zero, because the clearing happens only 
in kvm_xen_inject_timer_irqs(), which cannot be concurrent with 
KVM_XEN_VCPU_GET_ATTR.  And the destination does:

	migration destination			xen_timer_callback()
	------------------------		------------------------
	ioctl(KVM_XEN_VCPU_SET_ATTR)
	  set timer_expires
	    hrtimer_start()
					        inc timer_pending
	ioctl(KVM_RUN)
	  kvm_xen_inject_timer_irqs()
	    kvm_xen_set_evtchn()
	      kvm_xen_set_evtchn_fast()
	    expires_ns = 0;

Paolo

David Woodhouse March 9, 2022, 4:11 p.m. UTC | #2

On Wed, 2022-03-09 at 16:39 +0100, Paolo Bonzini wrote:
> On 3/9/22 15:38, David Woodhouse wrote:
> > +		if (hrtimer_active(&vcpu->arch.xen.timer))
> > +			data->u.timer.expires_ns = vcpu->arch.xen.timer_expires;
> > +		else
> > +			data->u.timer.expires_ns = 0;
> >   		r = 0;
> 
> This is still racy.  You should instead clear timer_expires in 
> xen_timer_callback, and only do so after a *successful* write to the 
> event channel (setting timer_pending is not enough).

Hm, I thought about setting expires_ns = 0 in xen_timer_callback() but
that can race with the vCPU making a hypercall to reconfigure the
timer.

> Still, this only works if userspace reads the pending events *after* 
> expires_ns.  It's the usual pattern:
> 
> 	xen_timer_callback()		userspace
> 	------------------------	------------------------
> 	kvm_xen_set_evtchn_fast()	read timer_expires
> 	smp_wmb()			smp_rmb()
> 	expires_ns = 0;			read event channel
> 
> Now, if expires is read as 0, userspace surely will read the event properly.
> 
> Is this doable with the KVM ioctls that you have, considering that the 
> pending events are in memory?  If not, you should add pause timer and 
> resume timer ioctls.  These will change the hrtimer state without 
> touching timer_expires and timer_pending.

That's OK, as the pending events are in the shared_info and vcpu_info
regions which we have explicitly declared exempt from dirty tracking.
Userspace must always consider them dirty any time an interrupt (which
includes timers) might be delivered, so it has to migrate that memory
in the final sync, after serializing the vCPU state.

In the local APIC delivery mode, the actual interrupt is injected as an
MSI, so userspace needs to read the timer state before the local APIC
state. 

> But even better, and even simpler: just set timer_pending in 
> xen_timer_callback, always go through the delayed injection path, and 
> remove this "if" from KVM_XEN_VCPU_GET_ATTR.  Then there are no races, 
> because KVM_RUN and KVM_XEN_VCPU_GET_ATTR are protected with vcpu->mutex:

Yeah, I don't think it even increases the latency as the vCPU had
already exited guest mode anyway. I'll do that, retest it and repost.

Thanks.

Paolo Bonzini March 9, 2022, 4:28 p.m. UTC | #3

On 3/9/22 17:11, David Woodhouse wrote:
> That's OK, as the pending events are in the shared_info and vcpu_info
> regions which we have explicitly declared exempt from dirty tracking.
> Userspace must always consider them dirty any time an interrupt (which
> includes timers) might be delivered, so it has to migrate that memory
> in the final sync, after serializing the vCPU state.
>
> In the local APIC delivery mode, the actual interrupt is injected as an
> MSI, so userspace needs to read the timer state before the local APIC
> state.

Please document this.  It's not obvious from the fact that they're not 
dirty-tracked.

Paolo

David Woodhouse March 9, 2022, 4:37 p.m. UTC | #4

On Wed, 2022-03-09 at 17:28 +0100, Paolo Bonzini wrote:
> On 3/9/22 17:11, David Woodhouse wrote:
> > That's OK, as the pending events are in the shared_info and vcpu_info
> > regions which we have explicitly declared exempt from dirty tracking.
> > Userspace must always consider them dirty any time an interrupt (which
> > includes timers) might be delivered, so it has to migrate that memory
> > in the final sync, after serializing the vCPU state.
> > 
> > In the local APIC delivery mode, the actual interrupt is injected as an
> > MSI, so userspace needs to read the timer state before the local APIC
> > state.
> 
> Please document this.  It's not obvious from the fact that they're not 
> dirty-tracked.

The dirty tracking part is already there:

  Note that the shared info page may be constantly written to by KVM;
  it contains the event channel bitmap used to deliver interrupts to
  a Xen guest, amongst other things. It is exempt from dirty tracking
  mechanisms — KVM will not explicitly mark the page as dirty each
  time an event channel interrupt is delivered to the guest! Thus,
  userspace should always assume that the designated GFN is dirty if
  any vCPU has been running or any event channel interrupts can be
  routed to the guest.

And for vcpu_info:

KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO
  Sets the guest physical address of the vcpu_info for a given vCPU.
  As with the shared_info page for the VM, the corresponding page may be
  dirtied at any time if event channel interrupt delivery is enabled, so
  userspace should always assume that the page is dirty without relying
  on dirty logging.

For the ordering of the various vCPU state fetches... is there any
existing documentation on that? An emacs buffer just to the right of
this compose window contains the only notes on that topic that I've
ever seen (much of which may well be out of date by now)...

	/*
	 * Notes on ordering requirements (and other ramblings).
	 *
	 * KVM_GET_MP_STATE calls kvm_apic_accept_events(), which might modify
	 * vCPU/LAPIC state. As such, it must be done before most everything
	 * else, otherwise we cannot restore everything and expect it to work.
	 *
	 * KVM_GET_VCPU_EVENTS/KVM_SET_VCPU_EVENTS is unsafe if other vCPUs are
	 * still running.
	 *
	 * Some SET ioctls depend on kvm_vcpu_is_bsp(), so if we ever change
	 * the BSP, we have to do that before restoring anything. The same
	 * seems to be true for CPUID stuff.
	 *
	 * KVM_GET_LAPIC may change state of LAPIC before returning it.
	 *
	 * GET_VCPU_EVENTS should probably be last to save. The code looks as
	 * it might as well be affected by internal state modifications of the
	 * GET ioctls.
	 *
	 * SREGS saves/restores a pending interrupt, similar to what
	 * VCPU_EVENTS also does.
	 *
	 * SET_REGS clears pending exceptions unconditionally, thus, it must be
	 * done before SET_VCPU_EVENTS, which restores it.
	 *
	 * SET_LAPIC must come after SET_SREGS, because the latter restores
	 * the apic base msr.
	 *
	 * SET_LAPIC must come before SET_MSRS, because the TSC deadline MSR
	 * only restores successfully, when the LAPIC is correctly configured.
	 *
	 * GET_MSRS requires a pre-populated data structure to do something
	 * meaningful. For SET_MSRS it will then contain good data.
	 *
	 * The KVM API documentation is wrong for GET_MSRS/SET_MSRS. They
	 * return the number of successfully read/written values, which is
	 * actually helpful to determine where things went wrong.
	 */

If there's existing documentation on the ordering, I'd be happy to add
to it. If not, then my comment earlier about timers causing an
interrupt to be injected to the local APIC is far *more* obvious than
most of the rest... :)

Paolo Bonzini March 9, 2022, 4:39 p.m. UTC | #5

On 3/9/22 17:37, David Woodhouse wrote:
> The dirty tracking part is already there:
> 
>    Note that the shared info page may be constantly written to by KVM;
>    it contains the event channel bitmap used to deliver interrupts to
>    a Xen guest, amongst other things. It is exempt from dirty tracking
>    mechanisms — KVM will not explicitly mark the page as dirty each
>    time an event channel interrupt is delivered to the guest! Thus,
>    userspace should always assume that the designated GFN is dirty if
>    any vCPU has been running or any event channel interrupts can be
>    routed to the guest.
> 
> And for vcpu_info:
> 
> KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO
>    Sets the guest physical address of the vcpu_info for a given vCPU.
>    As with the shared_info page for the VM, the corresponding page may be
>    dirtied at any time if event channel interrupt delivery is enabled, so
>    userspace should always assume that the page is dirty without relying
>    on dirty logging.
> 
> 
> For the ordering of the various vCPU state fetches... is there any
> existing documentation on that? An emacs buffer just to the right of
> this compose window contains the only notes on that topic that I've
> ever seen (much of which may well be out of date by now)...

Ok, guilty as charged.

Paolo

> 	/*
> 	 * Notes on ordering requirements (and other ramblings).
> 	 *
> 	 * KVM_GET_MP_STATE calls kvm_apic_accept_events(), which might modify
> 	 * vCPU/LAPIC state. As such, it must be done before most everything
> 	 * else, otherwise we cannot restore everything and expect it to work.
> 	 *
> 	 * KVM_GET_VCPU_EVENTS/KVM_SET_VCPU_EVENTS is unsafe if other vCPUs are
> 	 * still running.
> 	 *
> 	 * Some SET ioctls depend on kvm_vcpu_is_bsp(), so if we ever change
> 	 * the BSP, we have to do that before restoring anything. The same
> 	 * seems to be true for CPUID stuff.
> 	 *
> 	 * KVM_GET_LAPIC may change state of LAPIC before returning it.
> 	 *
> 	 * GET_VCPU_EVENTS should probably be last to save. The code looks as
> 	 * it might as well be affected by internal state modifications of the
> 	 * GET ioctls.
> 	 *
> 	 * SREGS saves/restores a pending interrupt, similar to what
> 	 * VCPU_EVENTS also does.
> 	 *
> 	 * SET_REGS clears pending exceptions unconditionally, thus, it must be
> 	 * done before SET_VCPU_EVENTS, which restores it.
> 	 *
> 	 * SET_LAPIC must come after SET_SREGS, because the latter restores
> 	 * the apic base msr.
> 	 *
> 	 * SET_LAPIC must come before SET_MSRS, because the TSC deadline MSR
> 	 * only restores successfully, when the LAPIC is correctly configured.
> 	 *
> 	 * GET_MSRS requires a pre-populated data structure to do something
> 	 * meaningful. For SET_MSRS it will then contain good data.
> 	 *
> 	 * The KVM API documentation is wrong for GET_MSRS/SET_MSRS. They
> 	 * return the number of successfully read/written values, which is
> 	 * actually helpful to determine where things went wrong.
> 	 */
> 
> 
> If there's existing documentation on the ordering, I'd be happy to add
> to it. If not, then my comment earlier about timers causing an
> interrupt to be injected to the local APIC is far*more*  obvious than
> most of the rest...:)

[1/2] KVM: x86/xen: PV oneshot timer fixes

Commit Message

Comments

Patch