Message ID | 20240904072541.8746-2-pstanner@redhat.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Krzysztof WilczyĆski |
Headers | show |
Series | PCI: Fix might_sleep() lockdep warning in pcim_intx() | expand |
On Wed, 4 Sep 2024 09:25:42 +0200 Philipp Stanner <pstanner@redhat.com> wrote: > commit 25216afc9db5 ("PCI: Add managed pcim_intx()") moved the > allocation step for pci_intx()'s device resource from > pcim_enable_device() to pcim_intx(). As before, pcim_enable_device() > sets pci_dev.is_managed to true; and it is never set to false again. > > Under some circumstances it can happen that drivers share a struct > pci_dev. If one driver uses pcim_enable_device() and the other doesn't, > this causes the other driver to run into managed pcim_intx(), which will > try to allocate when called for the first time. I don't think "share" is the correct way to describe this. The struct pci_dev is never shared between drivers, it's more of a lifecycle issue. The struct pci_dev persists for the life of the device, but is_managed is relative to the driver that is currently bound to the device. The issue is that the devres infrastructure doesn't clear this flag when the managed driver that set it releases the struct pci_dev. As we discussed in the other thread, this is a latent issue that has existed for some time, but it's only when pcim_intx() starts allocating memory, as introduced in the Fixes commit below, that the calling requirements for pcim_intx() are narrowed and become incompatible (and visible via lockdep) with existing drivers. > Allocations might sleep, so calling pci_intx() while holding spinlocks > becomes then invalid, which causes lockdep warnings and could cause > deadlocks. > > Have pcim_enable_device()'s release function, pcim_disable_device(), set > pci_dev.is_managed to false so that subsequent drivers using the same > struct pci_dev do not run implicitly into managed code. > > Fixes: 25216afc9db5 ("PCI: Add managed pcim_intx()") > Reported-by: Alex Williamson <alex.williamson@redhat.com> > Closes: https://lore.kernel.org/all/20240903094431.63551744.alex.williamson@redhat.com/ > Signed-off-by: Philipp Stanner <pstanner@redhat.com> > --- > @Alex: > Please have a look whether this fixes your issue. Yes, this works for me. Tested-by: Alex Williamson <alex.williamson@redhat.com> > Might also be cool to include your lockdep warning in the commit > message, if you can provide that. See below. Thanks, Alex ======================================================== WARNING: possible irq lock inversion dependency detected 6.11.0-rc6+ #59 Tainted: G W -------------------------------------------------------- CPU 0/KVM/1537 just changed the state of lock: ffffa0f0cff965f0 (&vdev->irqlock){-...}-{2:2}, at: vfio_intx_handler+0x21/0xd0 [vfio_pci_core] but this lock took another, HARDIRQ-unsafe lock in the past: (fs_reclaim){+.+.}-{0:0} and interrupts could create inverse lock ordering between them. other info that might help us debug this: Possible interrupt unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); local_irq_disable(); lock(&vdev->irqlock); lock(fs_reclaim); <Interrupt> lock(&vdev->irqlock); *** DEADLOCK *** 1 lock held by CPU 0/KVM/1537: #0: ffffa0f0eaffc3b0 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x86/0x9a0 [kvm] the shortest dependencies between 2nd lock and 1st lock: -> (fs_reclaim){+.+.}-{0:0} { HARDIRQ-ON-W at: lock_acquire+0xba/0x2c0 fs_reclaim_acquire+0x99/0xf0 __kmalloc_node_noprof+0x93/0x470 alloc_cpumask_var_node+0x21/0x30 smp_prepare_cpus_common+0x90/0x140 native_smp_prepare_cpus+0xa/0xc0 kernel_init_freeable+0x23d/0x5a0 kernel_init+0x16/0x1d0 ret_from_fork+0x2d/0x50 ret_from_fork_asm+0x1a/0x30 SOFTIRQ-ON-W at: lock_acquire+0xba/0x2c0 fs_reclaim_acquire+0x99/0xf0 __kmalloc_node_noprof+0x93/0x470 alloc_cpumask_var_node+0x21/0x30 smp_prepare_cpus_common+0x90/0x140 native_smp_prepare_cpus+0xa/0xc0 kernel_init_freeable+0x23d/0x5a0 kernel_init+0x16/0x1d0 ret_from_fork+0x2d/0x50 ret_from_fork_asm+0x1a/0x30 INITIAL USE at: lock_acquire+0xba/0x2c0 fs_reclaim_acquire+0x99/0xf0 __kmalloc_node_noprof+0x93/0x470 alloc_cpumask_var_node+0x21/0x30 smp_prepare_cpus_common+0x90/0x140 native_smp_prepare_cpus+0xa/0xc0 kernel_init_freeable+0x23d/0x5a0 kernel_init+0x16/0x1d0 ret_from_fork+0x2d/0x50 ret_from_fork_asm+0x1a/0x30 } ... key at: [<ffffffffb2250300>] __fs_reclaim_map+0x0/0x28 ... acquired at: fs_reclaim_acquire+0x99/0xf0 __kmalloc_node_track_caller_noprof+0x8f/0x460 __devres_alloc_node+0x42/0x80 pcim_intx+0xb6/0xe0 pci_intx+0x67/0x80 __vfio_pci_intx_mask+0x70/0xe0 [vfio_pci_core] vfio_pci_set_intx_mask+0x40/0x50 [vfio_pci_core] vfio_pci_core_ioctl+0x74e/0xf50 [vfio_pci_core] vfio_device_fops_unl_ioctl+0xa5/0x870 [vfio] __x64_sys_ioctl+0x90/0xd0 do_syscall_64+0x8e/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> (&vdev->irqlock){-...}-{2:2} { IN-HARDIRQ-W at: lock_acquire+0xba/0x2c0 _raw_spin_lock_irqsave+0x3b/0x60 vfio_intx_handler+0x21/0xd0 [vfio_pci_core] __handle_irq_event_percpu+0x81/0x260 handle_irq_event+0x34/0x70 handle_fasteoi_irq+0x91/0x210 __common_interrupt+0x6f/0x140 common_interrupt+0xb3/0xd0 asm_common_interrupt+0x22/0x40 vmx_do_interrupt_irqoff+0x10/0x20 [kvm_intel] vmx_handle_exit_irqoff+0xc3/0x220 [kvm_intel] kvm_arch_vcpu_ioctl_run+0x12e9/0x1d20 [kvm] kvm_vcpu_ioctl+0x239/0x9a0 [kvm] __x64_sys_ioctl+0x90/0xd0 do_syscall_64+0x8e/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e INITIAL USE at: lock_acquire+0xba/0x2c0 _raw_spin_lock_irqsave+0x3b/0x60 __vfio_pci_intx_mask+0x30/0xe0 [vfio_pci_core] vfio_pci_set_intx_mask+0x40/0x50 [vfio_pci_core] vfio_pci_core_ioctl+0x74e/0xf50 [vfio_pci_core] vfio_device_fops_unl_ioctl+0xa5/0x870 [vfio] __x64_sys_ioctl+0x90/0xd0 do_syscall_64+0x8e/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e } ... key at: [<ffffffffc0a68800>] __key.90+0x0/0x10 [vfio_pci_core] ... acquired at: __lock_acquire+0x9b0/0x2130 lock_acquire+0xba/0x2c0 _raw_spin_lock_irqsave+0x3b/0x60 vfio_intx_handler+0x21/0xd0 [vfio_pci_core] __handle_irq_event_percpu+0x81/0x260 handle_irq_event+0x34/0x70 handle_fasteoi_irq+0x91/0x210 __common_interrupt+0x6f/0x140 common_interrupt+0xb3/0xd0 asm_common_interrupt+0x22/0x40 vmx_do_interrupt_irqoff+0x10/0x20 [kvm_intel] vmx_handle_exit_irqoff+0xc3/0x220 [kvm_intel] kvm_arch_vcpu_ioctl_run+0x12e9/0x1d20 [kvm] kvm_vcpu_ioctl+0x239/0x9a0 [kvm] __x64_sys_ioctl+0x90/0xd0 do_syscall_64+0x8e/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e stack backtrace: CPU: 4 UID: 107 PID: 1537 Comm: CPU 0/KVM Tainted: G W 6.11.0-rc6+ #59 Tainted: [W]=WARN Hardware name: HP HP ProDesk 600 G3 MT/829D, BIOS P02 Ver. 02.44 09/13/2022 Call Trace: <IRQ> dump_stack_lvl+0x62/0x90 mark_lock+0x3b7/0x960 __lock_acquire+0x9b0/0x2130 lock_acquire+0xba/0x2c0 ? vfio_intx_handler+0x21/0xd0 [vfio_pci_core] _raw_spin_lock_irqsave+0x3b/0x60 ? vfio_intx_handler+0x21/0xd0 [vfio_pci_core] vfio_intx_handler+0x21/0xd0 [vfio_pci_core] __handle_irq_event_percpu+0x81/0x260 handle_irq_event+0x34/0x70 handle_fasteoi_irq+0x91/0x210 __common_interrupt+0x6f/0x140 common_interrupt+0xb3/0xd0 </IRQ> <TASK> asm_common_interrupt+0x22/0x40 RIP: 0010:vmx_do_interrupt_irqoff+0x10/0x20 [kvm_intel] Code: ff ff ff 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 55 48 89 e5 48 83 e4 f0 6a 18 55 9c 6a 10 ff d7 <0f> 1f 00 48 89 ec 5d c3 cc cc cc cc 0f 1f 40 00 90 90 90 90 90 90 RSP: 0018:ffffbc4d8231ba78 EFLAGS: 00000082 RAX: 0000000000000240 RBX: ffffa0f0eaffc300 RCX: 0000000080000000 RDX: ffffffff00000000 RSI: 0000000080000022 RDI: ffffffffb1000240 RBP: ffffbc4d8231ba78 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffffa0f0ea835000 R13: ffffa0f0e0d00000 R14: ffffa0f0eaffc300 R15: 0000000000000000 ? irq_entries_start+0x10/0x660 vmx_handle_exit_irqoff+0xc3/0x220 [kvm_intel] kvm_arch_vcpu_ioctl_run+0x12e9/0x1d20 [kvm] ? kvm_vcpu_ioctl+0x239/0x9a0 [kvm] kvm_vcpu_ioctl+0x239/0x9a0 [kvm] ? find_held_lock+0x2b/0x80 __x64_sys_ioctl+0x90/0xd0 do_syscall_64+0x8e/0x180 ? lockdep_hardirqs_on+0x77/0x100 ? do_syscall_64+0x9a/0x180 ? vmx_vcpu_put+0xf6/0x270 [kvm_intel] ? kvm_arch_vcpu_put+0x191/0x270 [kvm] ? find_held_lock+0x2b/0x80 ? kvm_vcpu_ioctl+0x197/0x9a0 [kvm] ? lock_release+0x132/0x290 ? rcu_is_watching+0xd/0x40 ? kfree+0x231/0x360 ? __mutex_unlock_slowpath+0x2a/0x260 ? kvm_vcpu_ioctl+0x1a7/0x9a0 [kvm] ? find_held_lock+0x2b/0x80 ? user_return_notifier_unregister+0x3c/0x70 ? do_syscall_64+0x9a/0x180 ? lockdep_hardirqs_on+0x77/0x100 ? do_syscall_64+0x9a/0x180 ? syscall_exit_to_user_mode+0x1c2/0x2b0 ? do_syscall_64+0x9a/0x180 ? lockdep_hardirqs_on+0x77/0x100 ? do_syscall_64+0x9a/0x180 ? lockdep_hardirqs_on+0x77/0x100 ? do_syscall_64+0x9a/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7fd030fa13ed Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00 RSP: 002b:00007fd029bf6420 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 000055b4692ec540 RCX: 00007fd030fa13ed RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000018 RBP: 00007fd029bf6470 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007fd029c006c0 R13: ffffffffffff7a90 R14: 0000000000000000 R15: 00007ffe459ab200 </TASK>
diff --git a/drivers/pci/devres.c b/drivers/pci/devres.c index 3780a9f9ec00..c7affbbf73ab 100644 --- a/drivers/pci/devres.c +++ b/drivers/pci/devres.c @@ -483,6 +483,8 @@ static void pcim_disable_device(void *pdev_raw) if (!pdev->pinned) pci_disable_device(pdev); + + pdev->is_managed = false; } /**
commit 25216afc9db5 ("PCI: Add managed pcim_intx()") moved the allocation step for pci_intx()'s device resource from pcim_enable_device() to pcim_intx(). As before, pcim_enable_device() sets pci_dev.is_managed to true; and it is never set to false again. Under some circumstances it can happen that drivers share a struct pci_dev. If one driver uses pcim_enable_device() and the other doesn't, this causes the other driver to run into managed pcim_intx(), which will try to allocate when called for the first time. Allocations might sleep, so calling pci_intx() while holding spinlocks becomes then invalid, which causes lockdep warnings and could cause deadlocks. Have pcim_enable_device()'s release function, pcim_disable_device(), set pci_dev.is_managed to false so that subsequent drivers using the same struct pci_dev do not run implicitly into managed code. Fixes: 25216afc9db5 ("PCI: Add managed pcim_intx()") Reported-by: Alex Williamson <alex.williamson@redhat.com> Closes: https://lore.kernel.org/all/20240903094431.63551744.alex.williamson@redhat.com/ Signed-off-by: Philipp Stanner <pstanner@redhat.com> --- @Alex: Please have a look whether this fixes your issue. Might also be cool to include your lockdep warning in the commit message, if you can provide that. P. --- drivers/pci/devres.c | 2 ++ 1 file changed, 2 insertions(+)