diff mbox series

[v3] KVM: x86: Use fast path for Xen timer delivery

Message ID f21ee3bd852761e7808240d4ecaec3013c649dc7.camel@infradead.org (mailing list archive)
State New, archived
Headers show
Series [v3] KVM: x86: Use fast path for Xen timer delivery | expand

Commit Message

David Woodhouse Sept. 30, 2023, 1:58 p.m. UTC
From: David Woodhouse <dwmw@amazon.co.uk>

Most of the time there's no need to kick the vCPU and deliver the timer
event through kvm_xen_inject_timer_irqs(). Use kvm_xen_set_evtchn_fast()
directly from the timer callback, and only fall back to the slow path
when it's necessary to do so.

This gives a significant improvement in timer latency testing (using
nanosleep() for various periods and then measuring the actual time
elapsed).

However, there was a reason¹ the fast path was dropped when this support
was first added. The current code holds vcpu->mutex for all operations
on the kvm->arch.timer_expires field, and the fast path introduces a
potential race condition. Avoid that race by ensuring the hrtimer is
(temporarily) cancelled before making changes in kvm_xen_start_timer(),
and also when reading the values out for KVM_XEN_VCPU_ATTR_TYPE_TIMER.

¹ https://lore.kernel.org/kvm/846caa99-2e42-4443-1070-84e49d2f11d2@redhat.com/

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 • v2: Remember, and deal with, those races. 

 • v3: Drop the assertions for vcpu being loaded; those can be done
       separately if at all.

       Reorder the code in xen_timer_callback() to make it clearer
       that kvm->arch.xen.timer_expires is being cleared in the case
       where the event channel delivery is *complete*, as opposed to
       the -EWOULDBLOCK deferred path.

       Drop the 'pending' variable in kvm_xen_vcpu_get_attr() and
       restart the hrtimer if (kvm->arch.xen.timer_expires), which
       ought to be exactly the same thing (that's the *point* in
       cancelling the timer, to make it truthful as we return its
       value to userspace).       

       Improve comments.

 arch/x86/kvm/xen.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 49 insertions(+)

Comments

Paul Durrant Oct. 2, 2023, 10:35 a.m. UTC | #1
On 30/09/2023 14:58, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> Most of the time there's no need to kick the vCPU and deliver the timer
> event through kvm_xen_inject_timer_irqs(). Use kvm_xen_set_evtchn_fast()
> directly from the timer callback, and only fall back to the slow path
> when it's necessary to do so.
> 
> This gives a significant improvement in timer latency testing (using
> nanosleep() for various periods and then measuring the actual time
> elapsed).
> 
> However, there was a reason¹ the fast path was dropped when this support
> was first added. The current code holds vcpu->mutex for all operations
> on the kvm->arch.timer_expires field, and the fast path introduces a
> potential race condition. Avoid that race by ensuring the hrtimer is
> (temporarily) cancelled before making changes in kvm_xen_start_timer(),
> and also when reading the values out for KVM_XEN_VCPU_ATTR_TYPE_TIMER.
> 
> ¹ https://lore.kernel.org/kvm/846caa99-2e42-4443-1070-84e49d2f11d2@redhat.com/
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   • v2: Remember, and deal with, those races.
> 
>   • v3: Drop the assertions for vcpu being loaded; those can be done
>         separately if at all.
> 
>         Reorder the code in xen_timer_callback() to make it clearer
>         that kvm->arch.xen.timer_expires is being cleared in the case
>         where the event channel delivery is *complete*, as opposed to
>         the -EWOULDBLOCK deferred path.
> 
>         Drop the 'pending' variable in kvm_xen_vcpu_get_attr() and
>         restart the hrtimer if (kvm->arch.xen.timer_expires), which
>         ought to be exactly the same thing (that's the *point* in
>         cancelling the timer, to make it truthful as we return its
>         value to userspace).
> 
>         Improve comments.
> 
>   arch/x86/kvm/xen.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 49 insertions(+)
> 

Reviewed-by: Paul Durrant <paul@xen.org>
Sean Christopherson Oct. 2, 2023, 5 p.m. UTC | #2
On Sat, Sep 30, 2023, David Woodhouse wrote:
> @@ -146,6 +160,14 @@ static enum hrtimer_restart xen_timer_callback(struct hrtimer *timer)
>  
>  static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_ns)
>  {
> +	/*
> +	 * Avoid races with the old timer firing. Checking timer_expires
> +	 * to avoid calling hrtimer_cancel() will only have false positives
> +	 * so is fine.
> +	 */
> +	if (vcpu->arch.xen.timer_expires)
> +		hrtimer_cancel(&vcpu->arch.xen.timer);
> +
>  	atomic_set(&vcpu->arch.xen.timer_pending, 0);
>  	vcpu->arch.xen.timer_expires = guest_abs;
>  
> @@ -1019,9 +1041,36 @@ int kvm_xen_vcpu_get_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data)
>  		break;
>  
>  	case KVM_XEN_VCPU_ATTR_TYPE_TIMER:
> +		/*
> +		 * Ensure a consistent snapshot of state is captured, with a
> +		 * timer either being pending, or the event channel delivered
> +		 * to the corresponding bit in the shared_info. Not still
> +		 * lurking in the timer_pending flag for deferred delivery.
> +		 * Purely as an optimisation, if the timer_expires field is
> +		 * zero, that means the timer isn't active (or even in the
> +		 * timer_pending flag) and there is no need to cancel it.
> +		 */

Ah, kvm_xen_start_timer() zeros timer_pending.

Given that, shouldn't it be impossible for xen_timer_callback() to observe a
non-zero timer_pending value?  E.g. couldn't this code WARN?

	if (atomic_read(&vcpu->arch.xen.timer_pending))
		return HRTIMER_NORESTART;

Obviously not a blocker for this patch, I'm mostly just curious to know if I'm
missing something.
David Woodhouse Oct. 2, 2023, 5:05 p.m. UTC | #3
On Mon, 2023-10-02 at 10:00 -0700, Sean Christopherson wrote:
> 
> >         case KVM_XEN_VCPU_ATTR_TYPE_TIMER:
> > +               /*
> > +                * Ensure a consistent snapshot of state is captured, with a
> > +                * timer either being pending, or the event channel delivered
> > +                * to the corresponding bit in the shared_info. Not still
> > +                * lurking in the timer_pending flag for deferred delivery.
> > +                * Purely as an optimisation, if the timer_expires field is
> > +                * zero, that means the timer isn't active (or even in the
> > +                * timer_pending flag) and there is no need to cancel it.
> > +                */
> 
> Ah, kvm_xen_start_timer() zeros timer_pending.
> 
> Given that, shouldn't it be impossible for xen_timer_callback() to observe a
> non-zero timer_pending value?  E.g. couldn't this code WARN?
> 
>         if (atomic_read(&vcpu->arch.xen.timer_pending))
>                 return HRTIMER_NORESTART;
> 
> Obviously not a blocker for this patch, I'm mostly just curious to know if I'm
> missing something.

Yes, I believe that is true.
Sean Christopherson Oct. 5, 2023, 1:29 a.m. UTC | #4
On Sat, 30 Sep 2023 14:58:35 +0100, David Woodhouse wrote:
> Most of the time there's no need to kick the vCPU and deliver the timer
> event through kvm_xen_inject_timer_irqs(). Use kvm_xen_set_evtchn_fast()
> directly from the timer callback, and only fall back to the slow path
> when it's necessary to do so.
> 
> This gives a significant improvement in timer latency testing (using
> nanosleep() for various periods and then measuring the actual time
> elapsed).
> 
> [...]

Applied to kvm-x86 xen, thanks!

[1/1] KVM: x86: Use fast path for Xen timer delivery
      https://github.com/kvm-x86/linux/commit/77c9b9dea4fb

--
https://github.com/kvm-x86/linux/tree/next
Sean Christopherson Feb. 6, 2024, 6:41 p.m. UTC | #5
On Sat, Sep 30, 2023, David Woodhouse wrote:
> diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
> index 40edf4d1974c..75586da134b3 100644
> --- a/arch/x86/kvm/xen.c
> +++ b/arch/x86/kvm/xen.c
> @@ -134,9 +134,23 @@ static enum hrtimer_restart xen_timer_callback(struct hrtimer *timer)
>  {
>  	struct kvm_vcpu *vcpu = container_of(timer, struct kvm_vcpu,
>  					     arch.xen.timer);
> +	struct kvm_xen_evtchn e;
> +	int rc;
> +
>  	if (atomic_read(&vcpu->arch.xen.timer_pending))
>  		return HRTIMER_NORESTART;
>  
> +	e.vcpu_id = vcpu->vcpu_id;
> +	e.vcpu_idx = vcpu->vcpu_idx;
> +	e.port = vcpu->arch.xen.timer_virq;
> +	e.priority = KVM_IRQ_ROUTING_XEN_EVTCHN_PRIO_2LEVEL;
> +
> +	rc = kvm_xen_set_evtchn_fast(&e, vcpu->kvm);
> +	if (rc != -EWOULDBLOCK) {
> +		vcpu->arch.xen.timer_expires = 0;
> +		return HRTIMER_NORESTART;
> +	}
> +
>  	atomic_inc(&vcpu->arch.xen.timer_pending);
>  	kvm_make_request(KVM_REQ_UNBLOCK, vcpu);
>  	kvm_vcpu_kick(vcpu);
> @@ -146,6 +160,14 @@ static enum hrtimer_restart xen_timer_callback(struct hrtimer *timer)
>  
>  static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_ns)
>  {
> +	/*
> +	 * Avoid races with the old timer firing. Checking timer_expires
> +	 * to avoid calling hrtimer_cancel() will only have false positives
> +	 * so is fine.
> +	 */
> +	if (vcpu->arch.xen.timer_expires)
> +		hrtimer_cancel(&vcpu->arch.xen.timer);
> +
>  	atomic_set(&vcpu->arch.xen.timer_pending, 0);
>  	vcpu->arch.xen.timer_expires = guest_abs;
>  
> @@ -1019,9 +1041,36 @@ int kvm_xen_vcpu_get_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data)
>  		break;
>  
>  	case KVM_XEN_VCPU_ATTR_TYPE_TIMER:
> +		/*
> +		 * Ensure a consistent snapshot of state is captured, with a
> +		 * timer either being pending, or the event channel delivered
> +		 * to the corresponding bit in the shared_info. Not still
> +		 * lurking in the timer_pending flag for deferred delivery.
> +		 * Purely as an optimisation, if the timer_expires field is
> +		 * zero, that means the timer isn't active (or even in the
> +		 * timer_pending flag) and there is no need to cancel it.
> +		 */
> +		if (vcpu->arch.xen.timer_expires) {
> +			hrtimer_cancel(&vcpu->arch.xen.timer);
> +			kvm_xen_inject_timer_irqs(vcpu);

This has an obvious-in-hindsight recursive deadlock bug.  If KVM actually needs
to inject a timer IRQ, and the fast path fails, i.e. the gpc is invalid,
kvm_xen_set_evtchn() will attempt to acquire xen.xen_lock, which is already held.

Not sure if I sucked at testing before, or if I just got "lucky" on a random run.

 ============================================
 WARNING: possible recursive locking detected
 6.8.0-smp--5e10b4d51d77-drs #232 Tainted: G           O      
 --------------------------------------------
 xen_shinfo_test/250013 is trying to acquire lock:
 ffff938c9930cc30 (&kvm->arch.xen.xen_lock){+.+.}-{3:3}, at: kvm_xen_set_evtchn+0x74/0x170 [kvm]
 
 but task is already holding lock:
 ffff938c9930cc30 (&kvm->arch.xen.xen_lock){+.+.}-{3:3}, at: kvm_xen_vcpu_get_attr+0x38/0x250 [kvm]
 
 other info that might help us debug this:
  Possible unsafe locking scenario:
        CPU0
        ----
   lock(&kvm->arch.xen.xen_lock);
   lock(&kvm->arch.xen.xen_lock);
 
  *** DEADLOCK ***
  May be due to missing lock nesting notation
 2 locks held by xen_shinfo_test/250013:
  #0: ffff9228863f21a8 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x8f/0x5b0 [kvm]
  #1: ffff938c9930cc30 (&kvm->arch.xen.xen_lock){+.+.}-{3:3}, at: kvm_xen_vcpu_get_attr+0x38/0x250 [kvm]
 
 stack backtrace:
 CPU: 128 PID: 250013 Comm: xen_shinfo_test Tainted: G           O       6.8.0-smp--5e10b4d51d77-drs #232
 Hardware name: Google, Inc.                                                       Arcadia_IT_80/Arcadia_IT_80, BIOS 34.30.0 07/27/2023
 Call Trace:
  <TASK>
  dump_stack_lvl+0x69/0xa0
  dump_stack+0x14/0x20
  print_deadlock_bug+0x2af/0x2c0
  __lock_acquire+0x13f7/0x2e30
  lock_acquire+0xd4/0x220
  __mutex_lock+0x6a/0xa60
  mutex_lock_nested+0x1f/0x30
  kvm_xen_set_evtchn+0x74/0x170 [kvm]
  kvm_xen_vcpu_get_attr+0x136/0x250 [kvm]
  kvm_arch_vcpu_ioctl+0x942/0x1130 [kvm]
  kvm_vcpu_ioctl+0x484/0x5b0 [kvm]
  __se_sys_ioctl+0x7a/0xc0
  __x64_sys_ioctl+0x21/0x30
  do_syscall_64+0x82/0x160
  entry_SYSCALL_64_after_hwframe+0x63/0x6b
 RIP: 0033:0x460eab
David Woodhouse Feb. 6, 2024, 6:51 p.m. UTC | #6
On Tue, 2024-02-06 at 10:41 -0800, Sean Christopherson wrote:
> 
> This has an obvious-in-hindsight recursive deadlock bug.  If KVM actually needs
> to inject a timer IRQ, and the fast path fails, i.e. the gpc is invalid,
> kvm_xen_set_evtchn() will attempt to acquire xen.xen_lock, which is already held

Hm, right. In fact, kvm_xen_set_evtchn() shouldn't actually *need* the
xen_lock in an ideal world; it's only taking it in order to work around
the fact that the gfn_to_pfn_cache doesn't have its *own* self-
sufficient locking. I have patches for that...

I think the *simplest* of the "patches for that" approaches is just to
use the gpc->refresh_lock to cover all activate, refresh and deactivate
calls. I was waiting for Paul's series to land before sending that one,
but I'll work on it today, and double-check my belief that we can then
just drop xen_lock from kvm_xen_set_evtchn().
Sean Christopherson Feb. 7, 2024, 2:58 a.m. UTC | #7
On Tue, Feb 06, 2024, David Woodhouse wrote:
> On Tue, 2024-02-06 at 10:41 -0800, Sean Christopherson wrote:
> > 
> > This has an obvious-in-hindsight recursive deadlock bug.  If KVM actually needs
> > to inject a timer IRQ, and the fast path fails, i.e. the gpc is invalid,
> > kvm_xen_set_evtchn() will attempt to acquire xen.xen_lock, which is already held
> 
> Hm, right. In fact, kvm_xen_set_evtchn() shouldn't actually *need* the
> xen_lock in an ideal world; it's only taking it in order to work around
> the fact that the gfn_to_pfn_cache doesn't have its *own* self-
> sufficient locking. I have patches for that...
> 
> I think the *simplest* of the "patches for that" approaches is just to
> use the gpc->refresh_lock to cover all activate, refresh and deactivate
> calls. I was waiting for Paul's series to land before sending that one,
> but I'll work on it today, and double-check my belief that we can then
> just drop xen_lock from kvm_xen_set_evtchn().

While I definitely want to get rid of arch.xen.xen_lock, I don't want to address
the deadlock by relying on adding more locking to the gpc code.  I want a teeny
tiny patch that is easy to review and backport.  Y'all are *proably* the only
folks that care about Xen emulation, but even so, that's not a valid reason for
taking a roundabout way to fixing a deadlock.

Can't we simply not take xen_lock in kvm_xen_vcpu_get_attr()  It holds vcpu->mutex
so it's mutually exclusive with kvm_xen_vcpu_set_attr(), and I don't see any other
flows other than vCPU destruction that deactivate (or change) the gpc.

And the worst case scenario is that if _userspace_ is being stupid, userspace gets
a stale GPA.

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 4b4e738c6f1b..50aa28b9ffc4 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -973,8 +973,6 @@ int kvm_xen_vcpu_get_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data)
 {
        int r = -ENOENT;
 
-       mutex_lock(&vcpu->kvm->arch.xen.xen_lock);
-
        switch (data->type) {
        case KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO:
                if (vcpu->arch.xen.vcpu_info_cache.active)
@@ -1083,7 +1081,6 @@ int kvm_xen_vcpu_get_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data)
                break;
        }
 
-       mutex_unlock(&vcpu->kvm->arch.xen.xen_lock);
        return r;
 }
 
 

If that seems to risky, we could go with an ugly and hacky, but conservative:

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 4b4e738c6f1b..456d05c5b18a 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -1052,7 +1052,9 @@ int kvm_xen_vcpu_get_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data)
                 */
                if (vcpu->arch.xen.timer_expires) {
                        hrtimer_cancel(&vcpu->arch.xen.timer);
+                       mutex_unlock(&vcpu->kvm->arch.xen.xen_lock);
                        kvm_xen_inject_timer_irqs(vcpu);
+                       mutex_lock(&vcpu->kvm->arch.xen.xen_lock);
                }
 
                data->u.timer.port = vcpu->arch.xen.timer_virq;
David Woodhouse Feb. 7, 2024, 3:29 a.m. UTC | #8
On Tue, 2024-02-06 at 18:58 -0800, Sean Christopherson wrote:
> On Tue, Feb 06, 2024, David Woodhouse wrote:
> > On Tue, 2024-02-06 at 10:41 -0800, Sean Christopherson wrote:
> > > 
> > > This has an obvious-in-hindsight recursive deadlock bug.  If KVM actually needs
> > > to inject a timer IRQ, and the fast path fails, i.e. the gpc is invalid,
> > > kvm_xen_set_evtchn() will attempt to acquire xen.xen_lock, which is already held
> > 
> > Hm, right. In fact, kvm_xen_set_evtchn() shouldn't actually *need* the
> > xen_lock in an ideal world; it's only taking it in order to work around
> > the fact that the gfn_to_pfn_cache doesn't have its *own* self-
> > sufficient locking. I have patches for that...
> > 
> > I think the *simplest* of the "patches for that" approaches is just to
> > use the gpc->refresh_lock to cover all activate, refresh and deactivate
> > calls. I was waiting for Paul's series to land before sending that one,
> > but I'll work on it today, and double-check my belief that we can then
> > just drop xen_lock from kvm_xen_set_evtchn().
> 
> While I definitely want to get rid of arch.xen.xen_lock, I don't want to address
> the deadlock by relying on adding more locking to the gpc code.  I want a teeny
> tiny patch that is easy to review and backport.  Y'all are *proably* the only
> folks that care about Xen emulation, but even so, that's not a valid reason for
> taking a roundabout way to fixing a deadlock.

I strongly disagree. I get that you're reticent about fixing the gpc
locking, but what I'm proposing is absolutely *not* a 'roundabout way
to fixing a deadlock'. The kvm_xen_set_evtchn() function shouldn't
*need* that lock; it's only taking it because of the underlying problem
with the gpc itself, which needs its caller to do its locking for it.

The solution is not to do further gymnastics with the xen_lock.

> Can't we simply not take xen_lock in kvm_xen_vcpu_get_attr()  It holds vcpu->mutex
> so it's mutually exclusive with kvm_xen_vcpu_set_attr(), and I don't see any other
> flows other than vCPU destruction that deactivate (or change) the gpc.

Maybe. Although with the gpc locking being incomplete, I'm extremely
concerned about something *implicitly* relying on the xen_lock. We
still need to fix the gpc to have self-contained locking.

I'll put something together and do some testing.
Sean Christopherson Feb. 7, 2024, 4:28 a.m. UTC | #9
On Tue, Feb 06, 2024, David Woodhouse wrote:
> On Tue, 2024-02-06 at 18:58 -0800, Sean Christopherson wrote:
> > On Tue, Feb 06, 2024, David Woodhouse wrote:
> > > On Tue, 2024-02-06 at 10:41 -0800, Sean Christopherson wrote:
> > > > 
> > > > This has an obvious-in-hindsight recursive deadlock bug.  If KVM actually needs
> > > > to inject a timer IRQ, and the fast path fails, i.e. the gpc is invalid,
> > > > kvm_xen_set_evtchn() will attempt to acquire xen.xen_lock, which is already held
> > > 
> > > Hm, right. In fact, kvm_xen_set_evtchn() shouldn't actually *need* the
> > > xen_lock in an ideal world; it's only taking it in order to work around
> > > the fact that the gfn_to_pfn_cache doesn't have its *own* self-
> > > sufficient locking. I have patches for that...
> > > 
> > > I think the *simplest* of the "patches for that" approaches is just to
> > > use the gpc->refresh_lock to cover all activate, refresh and deactivate
> > > calls. I was waiting for Paul's series to land before sending that one,
> > > but I'll work on it today, and double-check my belief that we can then
> > > just drop xen_lock from kvm_xen_set_evtchn().
> > 
> > While I definitely want to get rid of arch.xen.xen_lock, I don't want to address
> > the deadlock by relying on adding more locking to the gpc code.  I want a teeny
> > tiny patch that is easy to review and backport.  Y'all are *proably* the only
> > folks that care about Xen emulation, but even so, that's not a valid reason for
> > taking a roundabout way to fixing a deadlock.
> 
> I strongly disagree. I get that you're reticent about fixing the gpc
> locking, but what I'm proposing is absolutely *not* a 'roundabout way
> to fixing a deadlock'. The kvm_xen_set_evtchn() function shouldn't
> *need* that lock; it's only taking it because of the underlying problem
> with the gpc itself, which needs its caller to do its locking for it.
> 
> The solution is not to do further gymnastics with the xen_lock.

I agree that's the long term solution, but I am not entirely confident that a big
overhaul is 6.9 material at this point.  Squeezing an overhaul into 6.8 (and if
we're being nitpicky, backporting to 6.7) is out of the question.
David Woodhouse Feb. 7, 2024, 4:36 a.m. UTC | #10
On Tue, 2024-02-06 at 20:28 -0800, Sean Christopherson wrote:
> 
> I agree that's the long term solution, but I am not entirely confident that a big
> overhaul is 6.9 material at this point.  Squeezing an overhaul into 6.8 (and if
> we're being nitpicky, backporting to 6.7) is out of the question.

It actually ends up being really simple. We just lift ->refresh_lock up
and use it to protect all the activate/deactivate/refresh paths.

I have a patch in my tree; just need to put it through some testing.

(Currently stuck in a hotel room with a bunch of positive COVID tests,
unstable wifi and no decent screen, getting a bit fractious :)
diff mbox series

Patch

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 40edf4d1974c..75586da134b3 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -134,9 +134,23 @@  static enum hrtimer_restart xen_timer_callback(struct hrtimer *timer)
 {
 	struct kvm_vcpu *vcpu = container_of(timer, struct kvm_vcpu,
 					     arch.xen.timer);
+	struct kvm_xen_evtchn e;
+	int rc;
+
 	if (atomic_read(&vcpu->arch.xen.timer_pending))
 		return HRTIMER_NORESTART;
 
+	e.vcpu_id = vcpu->vcpu_id;
+	e.vcpu_idx = vcpu->vcpu_idx;
+	e.port = vcpu->arch.xen.timer_virq;
+	e.priority = KVM_IRQ_ROUTING_XEN_EVTCHN_PRIO_2LEVEL;
+
+	rc = kvm_xen_set_evtchn_fast(&e, vcpu->kvm);
+	if (rc != -EWOULDBLOCK) {
+		vcpu->arch.xen.timer_expires = 0;
+		return HRTIMER_NORESTART;
+	}
+
 	atomic_inc(&vcpu->arch.xen.timer_pending);
 	kvm_make_request(KVM_REQ_UNBLOCK, vcpu);
 	kvm_vcpu_kick(vcpu);
@@ -146,6 +160,14 @@  static enum hrtimer_restart xen_timer_callback(struct hrtimer *timer)
 
 static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_ns)
 {
+	/*
+	 * Avoid races with the old timer firing. Checking timer_expires
+	 * to avoid calling hrtimer_cancel() will only have false positives
+	 * so is fine.
+	 */
+	if (vcpu->arch.xen.timer_expires)
+		hrtimer_cancel(&vcpu->arch.xen.timer);
+
 	atomic_set(&vcpu->arch.xen.timer_pending, 0);
 	vcpu->arch.xen.timer_expires = guest_abs;
 
@@ -1019,9 +1041,36 @@  int kvm_xen_vcpu_get_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data)
 		break;
 
 	case KVM_XEN_VCPU_ATTR_TYPE_TIMER:
+		/*
+		 * Ensure a consistent snapshot of state is captured, with a
+		 * timer either being pending, or the event channel delivered
+		 * to the corresponding bit in the shared_info. Not still
+		 * lurking in the timer_pending flag for deferred delivery.
+		 * Purely as an optimisation, if the timer_expires field is
+		 * zero, that means the timer isn't active (or even in the
+		 * timer_pending flag) and there is no need to cancel it.
+		 */
+		if (vcpu->arch.xen.timer_expires) {
+			hrtimer_cancel(&vcpu->arch.xen.timer);
+			kvm_xen_inject_timer_irqs(vcpu);
+		}
+
 		data->u.timer.port = vcpu->arch.xen.timer_virq;
 		data->u.timer.priority = KVM_IRQ_ROUTING_XEN_EVTCHN_PRIO_2LEVEL;
 		data->u.timer.expires_ns = vcpu->arch.xen.timer_expires;
+
+		/*
+		 * The hrtimer may trigger and raise the IRQ immediately,
+		 * while the returned state causes it to be set up and
+		 * raised again on the destination system after migration.
+		 * That's fine, as the guest won't even have had a chance
+		 * to run and handle the interrupt. Asserting an already
+		 * pending event channel is idempotent.
+		 */
+		if (vcpu->arch.xen.timer_expires)
+			hrtimer_start_expires(&vcpu->arch.xen.timer,
+					      HRTIMER_MODE_ABS_HARD);
+
 		r = 0;
 		break;