Message ID | 20240612181625.3604512-2-kbusch@meta.com (mailing list archive) |
---|---|
State | New |
Delegated to: | Bjorn Helgaas |
Headers | show |
Series | pcie hotplug and error fixes | expand |
On Wed, Jun 12, 2024 at 11:16:24AM -0700, Keith Busch wrote: > PCIe hotplug events modify the topology in their IRQ thread once it can > acquire the global pci_rescan_remove_lock. > > If a different removal event happens to acquire that lock first, and > that removal event is for the parent device of the bridge processing the > other hotplug event, then we are deadlocked: the parent removal will > wait indefinitely on the child's IRQ thread because the parent is > holding the global lock the child thread needs to make forward progress. Yes, that's a known problem. I submitted a fix for it in 2018: https://lore.kernel.org/all/4c882e25194ba8282b78fe963fec8faae7cf23eb.1529173804.git.lukas@wunner.de/ The patch I proposed was similar to yours, but was smaller and confined to pciehp_pci.c. It was part of a larger series and when respinning that series I dropped the patch, which is the reason it never got applied. I explained the rationale for dropping it in this message: https://lore.kernel.org/all/20180906162634.ylyp3ydwujf5wuxx@wunner.de/ Basically all these proposals (both mine and yours) are not great because they add another layer of duct tape without tackling the underlying problem -- that pci_lock_rescan_remove() is way too coarse-grained and needs to be replaced by finer-grained locking. That however is a complex task that we haven't made significant forward progress on in the last couple of years. Something else always seemed more important. So I don't really know what to say. I recognize it's a problem but I'm hesitant to endorse a duct tape fix. :( In the second link above I mentioned that my approach doesn't solve another race discovered by Xiongfeng Wang. pciehp has been refactored considerably since then, so I'm not sure if that particular issue still exists... Thanks, Lukas
On Thu, Jun 27, 2024 at 09:47:05AM +0200, Lukas Wunner wrote: > On Wed, Jun 12, 2024 at 11:16:24AM -0700, Keith Busch wrote: > > PCIe hotplug events modify the topology in their IRQ thread once it can > > acquire the global pci_rescan_remove_lock. > > > > If a different removal event happens to acquire that lock first, and > > that removal event is for the parent device of the bridge processing the > > other hotplug event, then we are deadlocked: the parent removal will > > wait indefinitely on the child's IRQ thread because the parent is > > holding the global lock the child thread needs to make forward progress. > > Yes, that's a known problem. I submitted a fix for it in 2018: > > https://lore.kernel.org/all/4c882e25194ba8282b78fe963fec8faae7cf23eb.1529173804.git.lukas@wunner.de/ > > The patch I proposed was similar to yours, but was smaller and > confined to pciehp_pci.c. It was part of a larger series and > when respinning that series I dropped the patch, which is the > reason it never got applied. I explained the rationale for > dropping it in this message: > > https://lore.kernel.org/all/20180906162634.ylyp3ydwujf5wuxx@wunner.de/ > > Basically all these proposals (both mine and yours) are not great > because they add another layer of duct tape without tackling the > underlying problem -- that pci_lock_rescan_remove() is way too > coarse-grained and needs to be replaced by finer-grained locking. > That however is a complex task that we haven't made significant > forward progress on in the last couple of years. Something else > always seemed more important. > > So I don't really know what to say. I recognize it's a problem > but I'm hesitant to endorse a duct tape fix. :( > > In the second link above I mentioned that my approach doesn't solve > another race discovered by Xiongfeng Wang. pciehp has been refactored > considerably since then, so I'm not sure if that particular issue > still exists... Thanks for comments. I agree the current locking scheme is the real problem here. But I would still selfishly take a duct tape solution at this point! :) There's so many direct pci_lock_rescan_remove() users with hardware I don't have, it may be difficult to properly test a significant change to the locking. The sysfs race appears to still exist today. My patch wouldn't help with that either.
On Thu, Jun 27, 2024 at 09:47:05AM +0200, Lukas Wunner wrote: > On Wed, Jun 12, 2024 at 11:16:24AM -0700, Keith Busch wrote: > > PCIe hotplug events modify the topology in their IRQ thread once it can > > acquire the global pci_rescan_remove_lock. > > > > If a different removal event happens to acquire that lock first, and > > that removal event is for the parent device of the bridge processing the > > other hotplug event, then we are deadlocked: the parent removal will > > wait indefinitely on the child's IRQ thread because the parent is > > holding the global lock the child thread needs to make forward progress. > > Yes, that's a known problem. I submitted a fix for it in 2018: > > https://lore.kernel.org/all/4c882e25194ba8282b78fe963fec8faae7cf23eb.1529173804.git.lukas@wunner.de/ > > The patch I proposed was similar to yours, but was smaller and > confined to pciehp_pci.c. It was part of a larger series and > when respinning that series I dropped the patch, which is the > reason it never got applied. I explained the rationale for > dropping it in this message: > > https://lore.kernel.org/all/20180906162634.ylyp3ydwujf5wuxx@wunner.de/ > > Basically all these proposals (both mine and yours) are not great > because they add another layer of duct tape without tackling the > underlying problem -- that pci_lock_rescan_remove() is way too > coarse-grained and needs to be replaced by finer-grained locking. > That however is a complex task that we haven't made significant > forward progress on in the last couple of years. Something else > always seemed more important. Thinking about this some more: The problem is pci_lock_rescan_remove() is a single global lock. What if we introduce a lock at each bridge or for each pci_bus. Before a portion of the hierarchy is removed, all locks in that sub-hierarchy are acquired bottom-up. I think that should avoid the deadlock. Thoughts? Thanks, Lukas
On Mon, Nov 11, 2024 at 08:38:03AM +0100, Lukas Wunner wrote: > Thinking about this some more: > > The problem is pci_lock_rescan_remove() is a single global lock. > > What if we introduce a lock at each bridge or for each pci_bus. > Before a portion of the hierarchy is removed, all locks in that > sub-hierarchy are acquired bottom-up. > > I think that should avoid the deadlock. Thoughts? I note that you attempted something similar back in July: https://lore.kernel.org/all/20240722151936.1452299-9-kbusch@meta.com/ However I'd suggest to solve this differently: Keep the pci_lock_rescan_remove() everywhere, don't add pci_lock_bus() adjacent to it. Instead, amend pci_lock_rescan_remove() to walk the sub-hierarchy bottom-up and acquire all the bus locks. Obviously you'll have to amend pci_lock_rescan_remove() to accept a pci_dev which is the bridge atop the sub-hierarchy. (Or alternatively, the top-most pci_bus in the sub-hierarchy.) Thanks, Lukas
On Mon, Nov 11, 2024 at 09:00:18AM +0100, Lukas Wunner wrote: > On Mon, Nov 11, 2024 at 08:38:03AM +0100, Lukas Wunner wrote: > > Thinking about this some more: > > > > The problem is pci_lock_rescan_remove() is a single global lock. > > > > What if we introduce a lock at each bridge or for each pci_bus. > > Before a portion of the hierarchy is removed, all locks in that > > sub-hierarchy are acquired bottom-up. > > > > I think that should avoid the deadlock. Thoughts? > > I note that you attempted something similar back in July: > > https://lore.kernel.org/all/20240722151936.1452299-9-kbusch@meta.com/ > > However I'd suggest to solve this differently: > > Keep the pci_lock_rescan_remove() everywhere, don't add pci_lock_bus() > adjacent to it. > > Instead, amend pci_lock_rescan_remove() to walk the sub-hierarchy > bottom-up and acquire all the bus locks. Obviously you'll have to amend > pci_lock_rescan_remove() to accept a pci_dev which is the bridge atop > the sub-hierarchy. (Or alternatively, the top-most pci_bus in the > sub-hierarchy.) I don't think we can walk the bus bottom-up without hitting the same deadlock I'm trying to fix.
diff --git a/drivers/pci/hotplug/pciehp_pci.c b/drivers/pci/hotplug/pciehp_pci.c index ad12515a4a121..ca6237b0732c8 100644 --- a/drivers/pci/hotplug/pciehp_pci.c +++ b/drivers/pci/hotplug/pciehp_pci.c @@ -34,9 +34,12 @@ int pciehp_configure_device(struct controller *ctrl) struct pci_dev *dev; struct pci_dev *bridge = ctrl->pcie->port; struct pci_bus *parent = bridge->subordinate; - int num, ret = 0; + int num, ret; - pci_lock_rescan_remove(); + ret = pci_trylock_rescan_remove(bridge); + if (!ret) + return -ENODEV; + ret = 0; dev = pci_get_slot(parent, PCI_DEVFN(0, 0)); if (dev) { @@ -93,6 +96,7 @@ void pciehp_unconfigure_device(struct controller *ctrl, bool presence) struct pci_dev *dev, *temp; struct pci_bus *parent = ctrl->pcie->port->subordinate; u16 command; + int ret; ctrl_dbg(ctrl, "%s: domain:bus:dev = %04x:%02x:00\n", __func__, pci_domain_nr(parent), parent->number); @@ -100,7 +104,9 @@ void pciehp_unconfigure_device(struct controller *ctrl, bool presence) if (!presence) pci_walk_bus(parent, pci_dev_set_disconnected, NULL); - pci_lock_rescan_remove(); + ret = pci_trylock_rescan_remove(parent->self); + if (!ret) + return; /* * Stopping an SR-IOV PF device removes all the associated VFs, diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index fd44565c47562..f525490a02122 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -370,6 +370,7 @@ static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused) { pci_dev_set_io_state(dev, pci_channel_io_perm_failure); pci_doe_disconnected(dev); + pci_notify_disconnected(); return 0; } diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 5fbabb4e3425f..d2e19a1d1a45b 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -3302,6 +3302,7 @@ EXPORT_SYMBOL_GPL(pci_rescan_bus); * routines should always be executed under this mutex. */ static DEFINE_MUTEX(pci_rescan_remove_lock); +static DECLARE_WAIT_QUEUE_HEAD(pci_lock_wq); void pci_lock_rescan_remove(void) { @@ -3309,12 +3310,35 @@ void pci_lock_rescan_remove(void) } EXPORT_SYMBOL_GPL(pci_lock_rescan_remove); +/* + * pci_trylock_rescan_remove() - keep trying to take the lock until successful + * or notified the device is disconnected + * + * Returns 1 if the lock was successfully taken, 0 otherwise. + */ +bool pci_trylock_rescan_remove(struct pci_dev *dev) +{ + int ret; + + wait_event(pci_lock_wq, + (ret = mutex_trylock(&pci_rescan_remove_lock)) == 1 || + pci_dev_is_disconnected(dev)); + + return ret; +} + void pci_unlock_rescan_remove(void) { mutex_unlock(&pci_rescan_remove_lock); + wake_up_all(&pci_lock_wq); } EXPORT_SYMBOL_GPL(pci_unlock_rescan_remove); +void pci_notify_disconnected(void) +{ + wake_up_all(&pci_lock_wq); +} + static int __init pci_sort_bf_cmp(const struct device *d_a, const struct device *d_b) { diff --git a/include/linux/pci.h b/include/linux/pci.h index cafc5ab1cbcb4..05f293f7d8b1c 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -1442,7 +1442,9 @@ void set_pcie_hotplug_bridge(struct pci_dev *pdev); unsigned int pci_rescan_bus_bridge_resize(struct pci_dev *bridge); unsigned int pci_rescan_bus(struct pci_bus *bus); void pci_lock_rescan_remove(void); +bool pci_trylock_rescan_remove(struct pci_dev *dev); void pci_unlock_rescan_remove(void); +void pci_notify_disconnected(void); /* Vital Product Data routines */ ssize_t pci_read_vpd(struct pci_dev *dev, loff_t pos, size_t count, void *buf); @@ -2072,6 +2074,8 @@ pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs, { return -ENOSPC; } + +static inline void pci_notify_disconnected(void) { } #endif /* CONFIG_PCI */ /* Include architecture-dependent settings and functions */