Message ID | f21ee3bd852761e7808240d4ecaec3013c649dc7.camel@infradead.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v3] KVM: x86: Use fast path for Xen timer delivery | expand |
On 30/09/2023 14:58, David Woodhouse wrote: > From: David Woodhouse <dwmw@amazon.co.uk> > > Most of the time there's no need to kick the vCPU and deliver the timer > event through kvm_xen_inject_timer_irqs(). Use kvm_xen_set_evtchn_fast() > directly from the timer callback, and only fall back to the slow path > when it's necessary to do so. > > This gives a significant improvement in timer latency testing (using > nanosleep() for various periods and then measuring the actual time > elapsed). > > However, there was a reason¹ the fast path was dropped when this support > was first added. The current code holds vcpu->mutex for all operations > on the kvm->arch.timer_expires field, and the fast path introduces a > potential race condition. Avoid that race by ensuring the hrtimer is > (temporarily) cancelled before making changes in kvm_xen_start_timer(), > and also when reading the values out for KVM_XEN_VCPU_ATTR_TYPE_TIMER. > > ¹ https://lore.kernel.org/kvm/846caa99-2e42-4443-1070-84e49d2f11d2@redhat.com/ > > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> > --- > • v2: Remember, and deal with, those races. > > • v3: Drop the assertions for vcpu being loaded; those can be done > separately if at all. > > Reorder the code in xen_timer_callback() to make it clearer > that kvm->arch.xen.timer_expires is being cleared in the case > where the event channel delivery is *complete*, as opposed to > the -EWOULDBLOCK deferred path. > > Drop the 'pending' variable in kvm_xen_vcpu_get_attr() and > restart the hrtimer if (kvm->arch.xen.timer_expires), which > ought to be exactly the same thing (that's the *point* in > cancelling the timer, to make it truthful as we return its > value to userspace). > > Improve comments. > > arch/x86/kvm/xen.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 49 insertions(+) > Reviewed-by: Paul Durrant <paul@xen.org>
On Sat, Sep 30, 2023, David Woodhouse wrote: > @@ -146,6 +160,14 @@ static enum hrtimer_restart xen_timer_callback(struct hrtimer *timer) > > static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_ns) > { > + /* > + * Avoid races with the old timer firing. Checking timer_expires > + * to avoid calling hrtimer_cancel() will only have false positives > + * so is fine. > + */ > + if (vcpu->arch.xen.timer_expires) > + hrtimer_cancel(&vcpu->arch.xen.timer); > + > atomic_set(&vcpu->arch.xen.timer_pending, 0); > vcpu->arch.xen.timer_expires = guest_abs; > > @@ -1019,9 +1041,36 @@ int kvm_xen_vcpu_get_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data) > break; > > case KVM_XEN_VCPU_ATTR_TYPE_TIMER: > + /* > + * Ensure a consistent snapshot of state is captured, with a > + * timer either being pending, or the event channel delivered > + * to the corresponding bit in the shared_info. Not still > + * lurking in the timer_pending flag for deferred delivery. > + * Purely as an optimisation, if the timer_expires field is > + * zero, that means the timer isn't active (or even in the > + * timer_pending flag) and there is no need to cancel it. > + */ Ah, kvm_xen_start_timer() zeros timer_pending. Given that, shouldn't it be impossible for xen_timer_callback() to observe a non-zero timer_pending value? E.g. couldn't this code WARN? if (atomic_read(&vcpu->arch.xen.timer_pending)) return HRTIMER_NORESTART; Obviously not a blocker for this patch, I'm mostly just curious to know if I'm missing something.
On Mon, 2023-10-02 at 10:00 -0700, Sean Christopherson wrote: > > > case KVM_XEN_VCPU_ATTR_TYPE_TIMER: > > + /* > > + * Ensure a consistent snapshot of state is captured, with a > > + * timer either being pending, or the event channel delivered > > + * to the corresponding bit in the shared_info. Not still > > + * lurking in the timer_pending flag for deferred delivery. > > + * Purely as an optimisation, if the timer_expires field is > > + * zero, that means the timer isn't active (or even in the > > + * timer_pending flag) and there is no need to cancel it. > > + */ > > Ah, kvm_xen_start_timer() zeros timer_pending. > > Given that, shouldn't it be impossible for xen_timer_callback() to observe a > non-zero timer_pending value? E.g. couldn't this code WARN? > > if (atomic_read(&vcpu->arch.xen.timer_pending)) > return HRTIMER_NORESTART; > > Obviously not a blocker for this patch, I'm mostly just curious to know if I'm > missing something. Yes, I believe that is true.
On Sat, 30 Sep 2023 14:58:35 +0100, David Woodhouse wrote: > Most of the time there's no need to kick the vCPU and deliver the timer > event through kvm_xen_inject_timer_irqs(). Use kvm_xen_set_evtchn_fast() > directly from the timer callback, and only fall back to the slow path > when it's necessary to do so. > > This gives a significant improvement in timer latency testing (using > nanosleep() for various periods and then measuring the actual time > elapsed). > > [...] Applied to kvm-x86 xen, thanks! [1/1] KVM: x86: Use fast path for Xen timer delivery https://github.com/kvm-x86/linux/commit/77c9b9dea4fb -- https://github.com/kvm-x86/linux/tree/next
On Sat, Sep 30, 2023, David Woodhouse wrote: > diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c > index 40edf4d1974c..75586da134b3 100644 > --- a/arch/x86/kvm/xen.c > +++ b/arch/x86/kvm/xen.c > @@ -134,9 +134,23 @@ static enum hrtimer_restart xen_timer_callback(struct hrtimer *timer) > { > struct kvm_vcpu *vcpu = container_of(timer, struct kvm_vcpu, > arch.xen.timer); > + struct kvm_xen_evtchn e; > + int rc; > + > if (atomic_read(&vcpu->arch.xen.timer_pending)) > return HRTIMER_NORESTART; > > + e.vcpu_id = vcpu->vcpu_id; > + e.vcpu_idx = vcpu->vcpu_idx; > + e.port = vcpu->arch.xen.timer_virq; > + e.priority = KVM_IRQ_ROUTING_XEN_EVTCHN_PRIO_2LEVEL; > + > + rc = kvm_xen_set_evtchn_fast(&e, vcpu->kvm); > + if (rc != -EWOULDBLOCK) { > + vcpu->arch.xen.timer_expires = 0; > + return HRTIMER_NORESTART; > + } > + > atomic_inc(&vcpu->arch.xen.timer_pending); > kvm_make_request(KVM_REQ_UNBLOCK, vcpu); > kvm_vcpu_kick(vcpu); > @@ -146,6 +160,14 @@ static enum hrtimer_restart xen_timer_callback(struct hrtimer *timer) > > static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_ns) > { > + /* > + * Avoid races with the old timer firing. Checking timer_expires > + * to avoid calling hrtimer_cancel() will only have false positives > + * so is fine. > + */ > + if (vcpu->arch.xen.timer_expires) > + hrtimer_cancel(&vcpu->arch.xen.timer); > + > atomic_set(&vcpu->arch.xen.timer_pending, 0); > vcpu->arch.xen.timer_expires = guest_abs; > > @@ -1019,9 +1041,36 @@ int kvm_xen_vcpu_get_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data) > break; > > case KVM_XEN_VCPU_ATTR_TYPE_TIMER: > + /* > + * Ensure a consistent snapshot of state is captured, with a > + * timer either being pending, or the event channel delivered > + * to the corresponding bit in the shared_info. Not still > + * lurking in the timer_pending flag for deferred delivery. > + * Purely as an optimisation, if the timer_expires field is > + * zero, that means the timer isn't active (or even in the > + * timer_pending flag) and there is no need to cancel it. > + */ > + if (vcpu->arch.xen.timer_expires) { > + hrtimer_cancel(&vcpu->arch.xen.timer); > + kvm_xen_inject_timer_irqs(vcpu); This has an obvious-in-hindsight recursive deadlock bug. If KVM actually needs to inject a timer IRQ, and the fast path fails, i.e. the gpc is invalid, kvm_xen_set_evtchn() will attempt to acquire xen.xen_lock, which is already held. Not sure if I sucked at testing before, or if I just got "lucky" on a random run. ============================================ WARNING: possible recursive locking detected 6.8.0-smp--5e10b4d51d77-drs #232 Tainted: G O -------------------------------------------- xen_shinfo_test/250013 is trying to acquire lock: ffff938c9930cc30 (&kvm->arch.xen.xen_lock){+.+.}-{3:3}, at: kvm_xen_set_evtchn+0x74/0x170 [kvm] but task is already holding lock: ffff938c9930cc30 (&kvm->arch.xen.xen_lock){+.+.}-{3:3}, at: kvm_xen_vcpu_get_attr+0x38/0x250 [kvm] other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&kvm->arch.xen.xen_lock); lock(&kvm->arch.xen.xen_lock); *** DEADLOCK *** May be due to missing lock nesting notation 2 locks held by xen_shinfo_test/250013: #0: ffff9228863f21a8 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x8f/0x5b0 [kvm] #1: ffff938c9930cc30 (&kvm->arch.xen.xen_lock){+.+.}-{3:3}, at: kvm_xen_vcpu_get_attr+0x38/0x250 [kvm] stack backtrace: CPU: 128 PID: 250013 Comm: xen_shinfo_test Tainted: G O 6.8.0-smp--5e10b4d51d77-drs #232 Hardware name: Google, Inc. Arcadia_IT_80/Arcadia_IT_80, BIOS 34.30.0 07/27/2023 Call Trace: <TASK> dump_stack_lvl+0x69/0xa0 dump_stack+0x14/0x20 print_deadlock_bug+0x2af/0x2c0 __lock_acquire+0x13f7/0x2e30 lock_acquire+0xd4/0x220 __mutex_lock+0x6a/0xa60 mutex_lock_nested+0x1f/0x30 kvm_xen_set_evtchn+0x74/0x170 [kvm] kvm_xen_vcpu_get_attr+0x136/0x250 [kvm] kvm_arch_vcpu_ioctl+0x942/0x1130 [kvm] kvm_vcpu_ioctl+0x484/0x5b0 [kvm] __se_sys_ioctl+0x7a/0xc0 __x64_sys_ioctl+0x21/0x30 do_syscall_64+0x82/0x160 entry_SYSCALL_64_after_hwframe+0x63/0x6b RIP: 0033:0x460eab
On Tue, 2024-02-06 at 10:41 -0800, Sean Christopherson wrote: > > This has an obvious-in-hindsight recursive deadlock bug. If KVM actually needs > to inject a timer IRQ, and the fast path fails, i.e. the gpc is invalid, > kvm_xen_set_evtchn() will attempt to acquire xen.xen_lock, which is already held Hm, right. In fact, kvm_xen_set_evtchn() shouldn't actually *need* the xen_lock in an ideal world; it's only taking it in order to work around the fact that the gfn_to_pfn_cache doesn't have its *own* self- sufficient locking. I have patches for that... I think the *simplest* of the "patches for that" approaches is just to use the gpc->refresh_lock to cover all activate, refresh and deactivate calls. I was waiting for Paul's series to land before sending that one, but I'll work on it today, and double-check my belief that we can then just drop xen_lock from kvm_xen_set_evtchn().
On Tue, Feb 06, 2024, David Woodhouse wrote: > On Tue, 2024-02-06 at 10:41 -0800, Sean Christopherson wrote: > > > > This has an obvious-in-hindsight recursive deadlock bug. If KVM actually needs > > to inject a timer IRQ, and the fast path fails, i.e. the gpc is invalid, > > kvm_xen_set_evtchn() will attempt to acquire xen.xen_lock, which is already held > > Hm, right. In fact, kvm_xen_set_evtchn() shouldn't actually *need* the > xen_lock in an ideal world; it's only taking it in order to work around > the fact that the gfn_to_pfn_cache doesn't have its *own* self- > sufficient locking. I have patches for that... > > I think the *simplest* of the "patches for that" approaches is just to > use the gpc->refresh_lock to cover all activate, refresh and deactivate > calls. I was waiting for Paul's series to land before sending that one, > but I'll work on it today, and double-check my belief that we can then > just drop xen_lock from kvm_xen_set_evtchn(). While I definitely want to get rid of arch.xen.xen_lock, I don't want to address the deadlock by relying on adding more locking to the gpc code. I want a teeny tiny patch that is easy to review and backport. Y'all are *proably* the only folks that care about Xen emulation, but even so, that's not a valid reason for taking a roundabout way to fixing a deadlock. Can't we simply not take xen_lock in kvm_xen_vcpu_get_attr() It holds vcpu->mutex so it's mutually exclusive with kvm_xen_vcpu_set_attr(), and I don't see any other flows other than vCPU destruction that deactivate (or change) the gpc. And the worst case scenario is that if _userspace_ is being stupid, userspace gets a stale GPA. diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c index 4b4e738c6f1b..50aa28b9ffc4 100644 --- a/arch/x86/kvm/xen.c +++ b/arch/x86/kvm/xen.c @@ -973,8 +973,6 @@ int kvm_xen_vcpu_get_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data) { int r = -ENOENT; - mutex_lock(&vcpu->kvm->arch.xen.xen_lock); - switch (data->type) { case KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO: if (vcpu->arch.xen.vcpu_info_cache.active) @@ -1083,7 +1081,6 @@ int kvm_xen_vcpu_get_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data) break; } - mutex_unlock(&vcpu->kvm->arch.xen.xen_lock); return r; } If that seems to risky, we could go with an ugly and hacky, but conservative: diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c index 4b4e738c6f1b..456d05c5b18a 100644 --- a/arch/x86/kvm/xen.c +++ b/arch/x86/kvm/xen.c @@ -1052,7 +1052,9 @@ int kvm_xen_vcpu_get_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data) */ if (vcpu->arch.xen.timer_expires) { hrtimer_cancel(&vcpu->arch.xen.timer); + mutex_unlock(&vcpu->kvm->arch.xen.xen_lock); kvm_xen_inject_timer_irqs(vcpu); + mutex_lock(&vcpu->kvm->arch.xen.xen_lock); } data->u.timer.port = vcpu->arch.xen.timer_virq;
On Tue, 2024-02-06 at 18:58 -0800, Sean Christopherson wrote: > On Tue, Feb 06, 2024, David Woodhouse wrote: > > On Tue, 2024-02-06 at 10:41 -0800, Sean Christopherson wrote: > > > > > > This has an obvious-in-hindsight recursive deadlock bug. If KVM actually needs > > > to inject a timer IRQ, and the fast path fails, i.e. the gpc is invalid, > > > kvm_xen_set_evtchn() will attempt to acquire xen.xen_lock, which is already held > > > > Hm, right. In fact, kvm_xen_set_evtchn() shouldn't actually *need* the > > xen_lock in an ideal world; it's only taking it in order to work around > > the fact that the gfn_to_pfn_cache doesn't have its *own* self- > > sufficient locking. I have patches for that... > > > > I think the *simplest* of the "patches for that" approaches is just to > > use the gpc->refresh_lock to cover all activate, refresh and deactivate > > calls. I was waiting for Paul's series to land before sending that one, > > but I'll work on it today, and double-check my belief that we can then > > just drop xen_lock from kvm_xen_set_evtchn(). > > While I definitely want to get rid of arch.xen.xen_lock, I don't want to address > the deadlock by relying on adding more locking to the gpc code. I want a teeny > tiny patch that is easy to review and backport. Y'all are *proably* the only > folks that care about Xen emulation, but even so, that's not a valid reason for > taking a roundabout way to fixing a deadlock. I strongly disagree. I get that you're reticent about fixing the gpc locking, but what I'm proposing is absolutely *not* a 'roundabout way to fixing a deadlock'. The kvm_xen_set_evtchn() function shouldn't *need* that lock; it's only taking it because of the underlying problem with the gpc itself, which needs its caller to do its locking for it. The solution is not to do further gymnastics with the xen_lock. > Can't we simply not take xen_lock in kvm_xen_vcpu_get_attr() It holds vcpu->mutex > so it's mutually exclusive with kvm_xen_vcpu_set_attr(), and I don't see any other > flows other than vCPU destruction that deactivate (or change) the gpc. Maybe. Although with the gpc locking being incomplete, I'm extremely concerned about something *implicitly* relying on the xen_lock. We still need to fix the gpc to have self-contained locking. I'll put something together and do some testing.
On Tue, Feb 06, 2024, David Woodhouse wrote: > On Tue, 2024-02-06 at 18:58 -0800, Sean Christopherson wrote: > > On Tue, Feb 06, 2024, David Woodhouse wrote: > > > On Tue, 2024-02-06 at 10:41 -0800, Sean Christopherson wrote: > > > > > > > > This has an obvious-in-hindsight recursive deadlock bug. If KVM actually needs > > > > to inject a timer IRQ, and the fast path fails, i.e. the gpc is invalid, > > > > kvm_xen_set_evtchn() will attempt to acquire xen.xen_lock, which is already held > > > > > > Hm, right. In fact, kvm_xen_set_evtchn() shouldn't actually *need* the > > > xen_lock in an ideal world; it's only taking it in order to work around > > > the fact that the gfn_to_pfn_cache doesn't have its *own* self- > > > sufficient locking. I have patches for that... > > > > > > I think the *simplest* of the "patches for that" approaches is just to > > > use the gpc->refresh_lock to cover all activate, refresh and deactivate > > > calls. I was waiting for Paul's series to land before sending that one, > > > but I'll work on it today, and double-check my belief that we can then > > > just drop xen_lock from kvm_xen_set_evtchn(). > > > > While I definitely want to get rid of arch.xen.xen_lock, I don't want to address > > the deadlock by relying on adding more locking to the gpc code. I want a teeny > > tiny patch that is easy to review and backport. Y'all are *proably* the only > > folks that care about Xen emulation, but even so, that's not a valid reason for > > taking a roundabout way to fixing a deadlock. > > I strongly disagree. I get that you're reticent about fixing the gpc > locking, but what I'm proposing is absolutely *not* a 'roundabout way > to fixing a deadlock'. The kvm_xen_set_evtchn() function shouldn't > *need* that lock; it's only taking it because of the underlying problem > with the gpc itself, which needs its caller to do its locking for it. > > The solution is not to do further gymnastics with the xen_lock. I agree that's the long term solution, but I am not entirely confident that a big overhaul is 6.9 material at this point. Squeezing an overhaul into 6.8 (and if we're being nitpicky, backporting to 6.7) is out of the question.
On Tue, 2024-02-06 at 20:28 -0800, Sean Christopherson wrote: > > I agree that's the long term solution, but I am not entirely confident that a big > overhaul is 6.9 material at this point. Squeezing an overhaul into 6.8 (and if > we're being nitpicky, backporting to 6.7) is out of the question. It actually ends up being really simple. We just lift ->refresh_lock up and use it to protect all the activate/deactivate/refresh paths. I have a patch in my tree; just need to put it through some testing. (Currently stuck in a hotel room with a bunch of positive COVID tests, unstable wifi and no decent screen, getting a bit fractious :)
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c index 40edf4d1974c..75586da134b3 100644 --- a/arch/x86/kvm/xen.c +++ b/arch/x86/kvm/xen.c @@ -134,9 +134,23 @@ static enum hrtimer_restart xen_timer_callback(struct hrtimer *timer) { struct kvm_vcpu *vcpu = container_of(timer, struct kvm_vcpu, arch.xen.timer); + struct kvm_xen_evtchn e; + int rc; + if (atomic_read(&vcpu->arch.xen.timer_pending)) return HRTIMER_NORESTART; + e.vcpu_id = vcpu->vcpu_id; + e.vcpu_idx = vcpu->vcpu_idx; + e.port = vcpu->arch.xen.timer_virq; + e.priority = KVM_IRQ_ROUTING_XEN_EVTCHN_PRIO_2LEVEL; + + rc = kvm_xen_set_evtchn_fast(&e, vcpu->kvm); + if (rc != -EWOULDBLOCK) { + vcpu->arch.xen.timer_expires = 0; + return HRTIMER_NORESTART; + } + atomic_inc(&vcpu->arch.xen.timer_pending); kvm_make_request(KVM_REQ_UNBLOCK, vcpu); kvm_vcpu_kick(vcpu); @@ -146,6 +160,14 @@ static enum hrtimer_restart xen_timer_callback(struct hrtimer *timer) static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_ns) { + /* + * Avoid races with the old timer firing. Checking timer_expires + * to avoid calling hrtimer_cancel() will only have false positives + * so is fine. + */ + if (vcpu->arch.xen.timer_expires) + hrtimer_cancel(&vcpu->arch.xen.timer); + atomic_set(&vcpu->arch.xen.timer_pending, 0); vcpu->arch.xen.timer_expires = guest_abs; @@ -1019,9 +1041,36 @@ int kvm_xen_vcpu_get_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data) break; case KVM_XEN_VCPU_ATTR_TYPE_TIMER: + /* + * Ensure a consistent snapshot of state is captured, with a + * timer either being pending, or the event channel delivered + * to the corresponding bit in the shared_info. Not still + * lurking in the timer_pending flag for deferred delivery. + * Purely as an optimisation, if the timer_expires field is + * zero, that means the timer isn't active (or even in the + * timer_pending flag) and there is no need to cancel it. + */ + if (vcpu->arch.xen.timer_expires) { + hrtimer_cancel(&vcpu->arch.xen.timer); + kvm_xen_inject_timer_irqs(vcpu); + } + data->u.timer.port = vcpu->arch.xen.timer_virq; data->u.timer.priority = KVM_IRQ_ROUTING_XEN_EVTCHN_PRIO_2LEVEL; data->u.timer.expires_ns = vcpu->arch.xen.timer_expires; + + /* + * The hrtimer may trigger and raise the IRQ immediately, + * while the returned state causes it to be set up and + * raised again on the destination system after migration. + * That's fine, as the guest won't even have had a chance + * to run and handle the interrupt. Asserting an already + * pending event channel is idempotent. + */ + if (vcpu->arch.xen.timer_expires) + hrtimer_start_expires(&vcpu->arch.xen.timer, + HRTIMER_MODE_ABS_HARD); + r = 0; break;