Message ID | 20250401154727.835231-3-seanjc@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | KVM: VMX: Fix lockdep false positive on PI wakeup | expand |
On Tue, 2025-04-01 at 08:47 -0700, Sean Christopherson wrote: > From: Yan Zhao <yan.y.zhao@intel.com> > > Use a separate subclass when acquiring KVM's per-CPU posted interrupts > wakeup lock in the scheduled out path, i.e. when adding a vCPU on the list > of vCPUs to wake, to workaround a false positive deadlock. > > Chain exists of: > &p->pi_lock --> &rq->__lock --> &per_cpu(wakeup_vcpus_on_cpu_lock, cpu) > > Possible unsafe locking scenario: > > CPU0 CPU1 > ---- ---- > lock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu)); > lock(&rq->__lock); > lock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu)); > lock(&p->pi_lock); > > *** DEADLOCK *** > > In the wakeup handler, the callchain is *always*: > > sysvec_kvm_posted_intr_wakeup_ipi() > | > --> pi_wakeup_handler() > | > --> kvm_vcpu_wake_up() > | > --> try_to_wake_up(), > > and the lock order is: > > &per_cpu(wakeup_vcpus_on_cpu_lock, cpu) --> &p->pi_lock. > > For the schedule out path, the callchain is always (for all intents and > purposes; if the kernel is preemptible, kvm_sched_out() can be called from > something other than schedule(), but the beginning of the callchain will > be the same point in vcpu_block()): > > vcpu_block() > | > --> schedule() > | > --> kvm_sched_out() > | > --> vmx_vcpu_put() > | > --> vmx_vcpu_pi_put() > | > --> pi_enable_wakeup_handler() > > and the lock order is: > > &rq->__lock --> &per_cpu(wakeup_vcpus_on_cpu_lock, cpu) > > I.e. lockdep sees AB+BC ordering for schedule out, and CA ordering for > wakeup, and complains about the A=>C versus C=>A inversion. In practice, > deadlock can't occur between schedule out and the wakeup handler as they > are mutually exclusive. The entirely of the schedule out code that runs > with the problematic scheduler locks held, does so with IRQs disabled, > i.e. can't run concurrently with the wakeup handler. > > Use a subclass instead disabling lockdep entirely, and tell lockdep that > both subclasses are being acquired when loading a vCPU, as the sched_out > and sched_in paths are NOT mutually exclusive, e.g. > > CPU 0 CPU 1 > --------------- --------------- > vCPU0 sched_out > vCPU1 sched_in > vCPU1 sched_out vCPU 0 sched_in > > where vCPU0's sched_in may race with vCPU1's sched_out, on CPU 0's wakeup > list+lock. > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- > arch/x86/kvm/vmx/posted_intr.c | 32 +++++++++++++++++++++++++++++--- > 1 file changed, 29 insertions(+), 3 deletions(-) > > diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c > index 840d435229a8..51116fe69a50 100644 > --- a/arch/x86/kvm/vmx/posted_intr.c > +++ b/arch/x86/kvm/vmx/posted_intr.c > @@ -31,6 +31,8 @@ static DEFINE_PER_CPU(struct list_head, wakeup_vcpus_on_cpu); > */ > static DEFINE_PER_CPU(raw_spinlock_t, wakeup_vcpus_on_cpu_lock); > > +#define PI_LOCK_SCHED_OUT SINGLE_DEPTH_NESTING > + > static inline struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu) > { > return &(to_vmx(vcpu)->pi_desc); > @@ -89,9 +91,20 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu) > * current pCPU if the task was migrated. > */ > if (pi_desc->nv == POSTED_INTR_WAKEUP_VECTOR) { > - raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu)); > + raw_spinlock_t *spinlock = &per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu); > + > + /* > + * In addition to taking the wakeup lock for the regular/IRQ > + * context, tell lockdep it is being taken for the "sched out" > + * context as well. vCPU loads happens in task context, and > + * this is taking the lock of the *previous* CPU, i.e. can race > + * with both the scheduler and the wakeup handler. > + */ > + raw_spin_lock(spinlock); > + spin_acquire(&spinlock->dep_map, PI_LOCK_SCHED_OUT, 0, _RET_IP_); > list_del(&vmx->pi_wakeup_list); > - raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu)); > + spin_release(&spinlock->dep_map, _RET_IP_); > + raw_spin_unlock(spinlock); > } > > dest = cpu_physical_id(cpu); > @@ -151,7 +164,20 @@ static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu) > > lockdep_assert_irqs_disabled(); > > - raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu)); > + /* > + * Acquire the wakeup lock using the "sched out" context to workaround > + * a lockdep false positive. When this is called, schedule() holds > + * various per-CPU scheduler locks. When the wakeup handler runs, it > + * holds this CPU's wakeup lock while calling try_to_wake_up(), which > + * can eventually take the aforementioned scheduler locks, which causes > + * lockdep to assume there is deadlock. > + * > + * Deadlock can't actually occur because IRQs are disabled for the > + * entirety of the sched_out critical section, i.e. the wakeup handler > + * can't run while the scheduler locks are held. > + */ > + raw_spin_lock_nested(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu), > + PI_LOCK_SCHED_OUT); > list_add_tail(&vmx->pi_wakeup_list, > &per_cpu(wakeup_vcpus_on_cpu, vcpu->cpu)); > raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu)); Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Best regards, Maxim Levitsky
On Tue, Apr 01, 2025 at 08:47:27AM -0700, Sean Christopherson wrote: > From: Yan Zhao <yan.y.zhao@intel.com> > > Use a separate subclass when acquiring KVM's per-CPU posted interrupts > wakeup lock in the scheduled out path, i.e. when adding a vCPU on the list > of vCPUs to wake, to workaround a false positive deadlock. > > Chain exists of: > &p->pi_lock --> &rq->__lock --> &per_cpu(wakeup_vcpus_on_cpu_lock, cpu) > > Possible unsafe locking scenario: > > CPU0 CPU1 > ---- ---- > lock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu)); > lock(&rq->__lock); > lock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu)); > lock(&p->pi_lock); > > *** DEADLOCK *** > > In the wakeup handler, the callchain is *always*: > > sysvec_kvm_posted_intr_wakeup_ipi() > | > --> pi_wakeup_handler() > | > --> kvm_vcpu_wake_up() > | > --> try_to_wake_up(), > > and the lock order is: > > &per_cpu(wakeup_vcpus_on_cpu_lock, cpu) --> &p->pi_lock. > > For the schedule out path, the callchain is always (for all intents and > purposes; if the kernel is preemptible, kvm_sched_out() can be called from > something other than schedule(), but the beginning of the callchain will > be the same point in vcpu_block()): > > vcpu_block() > | > --> schedule() > | > --> kvm_sched_out() > | > --> vmx_vcpu_put() > | > --> vmx_vcpu_pi_put() > | > --> pi_enable_wakeup_handler() > > and the lock order is: > > &rq->__lock --> &per_cpu(wakeup_vcpus_on_cpu_lock, cpu) > > I.e. lockdep sees AB+BC ordering for schedule out, and CA ordering for > wakeup, and complains about the A=>C versus C=>A inversion. In practice, > deadlock can't occur between schedule out and the wakeup handler as they > are mutually exclusive. The entirely of the schedule out code that runs > with the problematic scheduler locks held, does so with IRQs disabled, > i.e. can't run concurrently with the wakeup handler. > > Use a subclass instead disabling lockdep entirely, and tell lockdep that Paolo initially recommended utilizing the subclass. Do you think it's good to add his suggested-by tag? BTW: is it necessary to state the subclass assignment explicitly in the patch msg? e.g., wakeup handler: subclass 0 sched_out: subclass 1 sched_in: subclasses 0 and 1 Aside from the minor nits, LGTM! Thanks for polishing the patch and helping with the msg/comments :) > both subclasses are being acquired when loading a vCPU, as the sched_out > and sched_in paths are NOT mutually exclusive, e.g. > > CPU 0 CPU 1 > --------------- --------------- > vCPU0 sched_out > vCPU1 sched_in > vCPU1 sched_out vCPU 0 sched_in > > where vCPU0's sched_in may race with vCPU1's sched_out, on CPU 0's wakeup > list+lock. > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- > arch/x86/kvm/vmx/posted_intr.c | 32 +++++++++++++++++++++++++++++--- > 1 file changed, 29 insertions(+), 3 deletions(-) > > diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c > index 840d435229a8..51116fe69a50 100644 > --- a/arch/x86/kvm/vmx/posted_intr.c > +++ b/arch/x86/kvm/vmx/posted_intr.c > @@ -31,6 +31,8 @@ static DEFINE_PER_CPU(struct list_head, wakeup_vcpus_on_cpu); > */ > static DEFINE_PER_CPU(raw_spinlock_t, wakeup_vcpus_on_cpu_lock); > > +#define PI_LOCK_SCHED_OUT SINGLE_DEPTH_NESTING > + > static inline struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu) > { > return &(to_vmx(vcpu)->pi_desc); > @@ -89,9 +91,20 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu) > * current pCPU if the task was migrated. > */ > if (pi_desc->nv == POSTED_INTR_WAKEUP_VECTOR) { > - raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu)); > + raw_spinlock_t *spinlock = &per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu); > + > + /* > + * In addition to taking the wakeup lock for the regular/IRQ > + * context, tell lockdep it is being taken for the "sched out" > + * context as well. vCPU loads happens in task context, and > + * this is taking the lock of the *previous* CPU, i.e. can race > + * with both the scheduler and the wakeup handler. > + */ > + raw_spin_lock(spinlock); > + spin_acquire(&spinlock->dep_map, PI_LOCK_SCHED_OUT, 0, _RET_IP_); > list_del(&vmx->pi_wakeup_list); > - raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu)); > + spin_release(&spinlock->dep_map, _RET_IP_); > + raw_spin_unlock(spinlock); > } > > dest = cpu_physical_id(cpu); > @@ -151,7 +164,20 @@ static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu) > > lockdep_assert_irqs_disabled(); > > - raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu)); > + /* > + * Acquire the wakeup lock using the "sched out" context to workaround > + * a lockdep false positive. When this is called, schedule() holds > + * various per-CPU scheduler locks. When the wakeup handler runs, it > + * holds this CPU's wakeup lock while calling try_to_wake_up(), which > + * can eventually take the aforementioned scheduler locks, which causes > + * lockdep to assume there is deadlock. > + * > + * Deadlock can't actually occur because IRQs are disabled for the > + * entirety of the sched_out critical section, i.e. the wakeup handler > + * can't run while the scheduler locks are held. > + */ > + raw_spin_lock_nested(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu), > + PI_LOCK_SCHED_OUT); > list_add_tail(&vmx->pi_wakeup_list, > &per_cpu(wakeup_vcpus_on_cpu, vcpu->cpu)); > raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu)); > -- > 2.49.0.472.ge94155a9ec-goog >
On Wed, Apr 02, 2025, Yan Zhao wrote: > On Tue, Apr 01, 2025 at 08:47:27AM -0700, Sean Christopherson wrote: > > I.e. lockdep sees AB+BC ordering for schedule out, and CA ordering for > > wakeup, and complains about the A=>C versus C=>A inversion. In practice, > > deadlock can't occur between schedule out and the wakeup handler as they > > are mutually exclusive. The entirely of the schedule out code that runs > > with the problematic scheduler locks held, does so with IRQs disabled, > > i.e. can't run concurrently with the wakeup handler. > > > > Use a subclass instead disabling lockdep entirely, and tell lockdep that > Paolo initially recommended utilizing the subclass. > Do you think it's good to add his suggested-by tag? Sure. > BTW: is it necessary to state the subclass assignment explicitly in the > patch msg? e.g., > > wakeup handler: subclass 0 > sched_out: subclass 1 > sched_in: subclasses 0 and 1 Yeah, explicitly stating the effectively rules would be helpful. If those are the only issues, I'll just fixup the changelog when applying.
diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c index 840d435229a8..51116fe69a50 100644 --- a/arch/x86/kvm/vmx/posted_intr.c +++ b/arch/x86/kvm/vmx/posted_intr.c @@ -31,6 +31,8 @@ static DEFINE_PER_CPU(struct list_head, wakeup_vcpus_on_cpu); */ static DEFINE_PER_CPU(raw_spinlock_t, wakeup_vcpus_on_cpu_lock); +#define PI_LOCK_SCHED_OUT SINGLE_DEPTH_NESTING + static inline struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu) { return &(to_vmx(vcpu)->pi_desc); @@ -89,9 +91,20 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu) * current pCPU if the task was migrated. */ if (pi_desc->nv == POSTED_INTR_WAKEUP_VECTOR) { - raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu)); + raw_spinlock_t *spinlock = &per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu); + + /* + * In addition to taking the wakeup lock for the regular/IRQ + * context, tell lockdep it is being taken for the "sched out" + * context as well. vCPU loads happens in task context, and + * this is taking the lock of the *previous* CPU, i.e. can race + * with both the scheduler and the wakeup handler. + */ + raw_spin_lock(spinlock); + spin_acquire(&spinlock->dep_map, PI_LOCK_SCHED_OUT, 0, _RET_IP_); list_del(&vmx->pi_wakeup_list); - raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu)); + spin_release(&spinlock->dep_map, _RET_IP_); + raw_spin_unlock(spinlock); } dest = cpu_physical_id(cpu); @@ -151,7 +164,20 @@ static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu) lockdep_assert_irqs_disabled(); - raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu)); + /* + * Acquire the wakeup lock using the "sched out" context to workaround + * a lockdep false positive. When this is called, schedule() holds + * various per-CPU scheduler locks. When the wakeup handler runs, it + * holds this CPU's wakeup lock while calling try_to_wake_up(), which + * can eventually take the aforementioned scheduler locks, which causes + * lockdep to assume there is deadlock. + * + * Deadlock can't actually occur because IRQs are disabled for the + * entirety of the sched_out critical section, i.e. the wakeup handler + * can't run while the scheduler locks are held. + */ + raw_spin_lock_nested(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu), + PI_LOCK_SCHED_OUT); list_add_tail(&vmx->pi_wakeup_list, &per_cpu(wakeup_vcpus_on_cpu, vcpu->cpu)); raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));