Message ID | 20241012004857.218874-1-marex@denx.de (mailing list archive) |
---|---|
State | New |
Delegated to: | Manivannan Sadhasivam |
Headers | show |
Series | [RFC] PCI/PM: Do not RPM suspend devices without drivers | expand |
On 10/12/24 2:48 AM, Marek Vasut wrote: > I am sending this as RFC, because I can only trigger it sporadically on > Linux 6.6.54 , but I believe this was around for a while. The rationale > might also be far from perfect. This is NOT a proper fix for the issue. > > The pci_host_probe() and pci_bus_type RPM suspend seem to race against each > other at least in Linux kernel up to 6.6.54 . The problem occurs when the > PCIe host controller driver, in this case DWC i.MX6, is sufficiently delayed > by EPROBE_DEFER from one if its clocks, in this case the PCIe bus clock > provided by RS9 clock synthesizer driver which is compiled as a module and > loaded about a minute after boot. Once the RS9 module is loaded and the > bus clock become available, the probe of DWC iMX6 controller driver can > proceed. > > At that point, imx6_pcie_probe() triggers pci_host_probe(), while at the > same time, devices instantiated with pci_bus_type can already enter RPM > suspend via pci_bus_type pci_pm_runtime_idle() / pci_pm_runtime_suspend() > callbacks. > > The pci_host_probe() does reallocate BARs for devices which start up with > uninitialized BAR addresses set to 0 by calling pci_bus_assign_resources(), > which updates the device config space content. > > At the same time, pci_pm_runtime_suspend() triggers pci_save_state() for > all devices which do not have drivers assigned to them to store current > content of their config space registers. > > This leads to a race condition between pci_bus_assign_resources() and > pci_save_state(). In case pci_save_state() wins and gets called before > pci_bus_assign_resources(), the content stored by pci_save_state() is > the incorrect pre-pci_bus_assign_resources() content, which is usually > one with BARs set to invalid addresses and possibly other invalid > configuration. > > Once either a driver or manual RPM control attempts to start the device > up, that invalid content is restored into the device config space and > the device becomes inoperable. If the BARs are restored to zeroes, then > the device stops responding to BAR memory accesses, while it still does > respond to config space accesses. > > Work around the issue by not suspending pci_bus_type devices which do > not have driver assigned to them, keep those devices active to prevent > pci_save_state() from being called. Once a proper driver takes over, it > can RPM manage the device correctly. > > Invalid ordering and backtrace is below, visualized with this extra print > added to drivers/pci/setup-res.c : > > " > @@ -108,6 +108,8 @@ static void pci_std_update_resource(struct pci_dev *dev, int resno) > resno, new, check); > } > > + pci_err(dev, "BAR %d: updated (%#010x != %#010x)\n", resno, new, check); > + > if (res->flags & IORESOURCE_MEM_64) { > new = region.start >> 16 >> 16; > pci_write_config_dword(dev, reg + 4, new); > " > > " > [ 47.042906] pci 0000:01:00.0: save config 0x10: 0x00000004 > ... > [ 47.079863] pci 0000:01:00.0: BAR 0: updated (0x18100004 != 0x18100004) > ... > " > > " > [ 47.274095] pci_update_resource+0x1f0/0x260 > [ 47.278370] pci_assign_resource+0x22c/0x234 > [ 47.282643] assign_requested_resources_sorted+0x6c/0xac > [ 47.287959] __assign_resources_sorted+0xfc/0x424 > [ 47.292669] __pci_bus_assign_resources+0x68/0x1f4 > [ 47.297463] __pci_bus_assign_resources+0xec/0x1f4 > [ 47.302258] pci_bus_assign_resources+0x1c/0x24 > [ 47.306792] pci_host_probe+0x88/0xa4 > [ 47.310457] dw_pcie_host_init+0x17c/0x530 > [ 47.314560] imx6_pcie_probe+0x698/0x708 > [ 47.318487] platform_probe+0x6c/0xb8 > [ 47.322153] really_probe+0x140/0x278 > [ 47.325818] __driver_probe_device+0xf4/0x10c > [ 47.330177] driver_probe_device+0x40/0xf8 > [ 47.334277] __device_attach_driver+0x60/0xd4 > [ 47.338638] bus_for_each_drv+0xb4/0xdc > [ 47.342476] __device_attach_async_helper+0x78/0xcc > [ 47.347357] async_run_entry_fn+0x38/0xe0 > [ 47.351369] process_scheduled_works+0x1cc/0x2b8 > [ 47.355991] worker_thread+0x214/0x25c > [ 47.359744] kthread+0xec/0xfc > [ 47.362804] ret_from_fork+0x10/0x20 > " > " > [ 47.575814] pci_save_state+0xcc/0x224 > [ 47.579567] pci_pm_runtime_suspend+0x44/0x16c > [ 47.584013] __rpm_callback+0x48/0x124 > [ 47.587764] rpm_callback+0x70/0x74 > [ 47.591254] rpm_suspend+0x26c/0x424 > [ 47.594831] rpm_idle+0x190/0x1c0 > [ 47.598149] pm_runtime_work+0x8c/0x9c > [ 47.601900] process_scheduled_works+0x1cc/0x2b8 > [ 47.606524] worker_thread+0x214/0x25c > [ 47.610278] kthread+0xec/0xfc > [ 47.613338] ret_from_fork+0x10/0x20 The backtraces are collected at the very end of pci_update_resource() and pci_save_state() using WARN_ON(), so the timestamps do not match, but at least they include the call stack how those functions were reached when this problem occurred .
On Sat, Oct 12, 2024 at 02:48:48AM +0200, Marek Vasut wrote: > The pci_host_probe() does reallocate BARs for devices which start up with > uninitialized BAR addresses set to 0 by calling pci_bus_assign_resources(), > which updates the device config space content. > > At the same time, pci_pm_runtime_suspend() triggers pci_save_state() for > all devices which do not have drivers assigned to them to store current > content of their config space registers. [...] > Work around the issue by not suspending pci_bus_type devices which do > not have driver assigned to them, keep those devices active to prevent > pci_save_state() from being called. Once a proper driver takes over, it > can RPM manage the device correctly. It sounds like you may want to acquire a runtime PM reference or disable runtime PM for the duration of the bus scan (or at least device scan) rather than the proposed workaround. Thanks, Lukas
On 10/13/24 1:03 PM, Lukas Wunner wrote: > On Sat, Oct 12, 2024 at 02:48:48AM +0200, Marek Vasut wrote: >> The pci_host_probe() does reallocate BARs for devices which start up with >> uninitialized BAR addresses set to 0 by calling pci_bus_assign_resources(), >> which updates the device config space content. >> >> At the same time, pci_pm_runtime_suspend() triggers pci_save_state() for >> all devices which do not have drivers assigned to them to store current >> content of their config space registers. > [...] >> Work around the issue by not suspending pci_bus_type devices which do >> not have driver assigned to them, keep those devices active to prevent >> pci_save_state() from being called. Once a proper driver takes over, it >> can RPM manage the device correctly. > > It sounds like you may want to acquire a runtime PM reference > or disable runtime PM for the duration of the bus scan (or at > least device scan) rather than the proposed workaround. I was hoping to get at least an confirmation that this issue really exists and that I'm not chasing some nonsense. Thoughts ?
On 10/26/2024 2:19 AM, Marek Vasut wrote: > On 10/13/24 1:03 PM, Lukas Wunner wrote: >> On Sat, Oct 12, 2024 at 02:48:48AM +0200, Marek Vasut wrote: >>> The pci_host_probe() does reallocate BARs for devices which start up >>> with >>> uninitialized BAR addresses set to 0 by calling >>> pci_bus_assign_resources(), >>> which updates the device config space content. >>> >>> At the same time, pci_pm_runtime_suspend() triggers pci_save_state() >>> for >>> all devices which do not have drivers assigned to them to store current >>> content of their config space registers. What exactly do you mean by "at the same time"? >> [...] >>> Work around the issue by not suspending pci_bus_type devices which do >>> not have driver assigned to them, keep those devices active to prevent >>> pci_save_state() from being called. Once a proper driver takes over, it >>> can RPM manage the device correctly. >> >> It sounds like you may want to acquire a runtime PM reference >> or disable runtime PM for the duration of the bus scan (or at >> least device scan) rather than the proposed workaround. > I was hoping to get at least an confirmation that this issue really > exists and that I'm not chasing some nonsense. Thoughts ?
On 10/28/24 6:52 PM, Wysocki, Rafael J wrote: > On 10/26/2024 2:19 AM, Marek Vasut wrote: >> On 10/13/24 1:03 PM, Lukas Wunner wrote: >>> On Sat, Oct 12, 2024 at 02:48:48AM +0200, Marek Vasut wrote: >>>> The pci_host_probe() does reallocate BARs for devices which start up >>>> with >>>> uninitialized BAR addresses set to 0 by calling >>>> pci_bus_assign_resources(), >>>> which updates the device config space content. >>>> >>>> At the same time, pci_pm_runtime_suspend() triggers pci_save_state() >>>> for >>>> all devices which do not have drivers assigned to them to store current >>>> content of their config space registers. > > What exactly do you mean by "at the same time"? I mean these two blocks of code run in parallel and likely race each other.
On Mon, Oct 28, 2024 at 7:58 PM Marek Vasut <marex@denx.de> wrote: > > On 10/28/24 6:52 PM, Wysocki, Rafael J wrote: > > On 10/26/2024 2:19 AM, Marek Vasut wrote: > >> On 10/13/24 1:03 PM, Lukas Wunner wrote: > >>> On Sat, Oct 12, 2024 at 02:48:48AM +0200, Marek Vasut wrote: > >>>> The pci_host_probe() does reallocate BARs for devices which start up > >>>> with > >>>> uninitialized BAR addresses set to 0 by calling > >>>> pci_bus_assign_resources(), > >>>> which updates the device config space content. > >>>> > >>>> At the same time, pci_pm_runtime_suspend() triggers pci_save_state() > >>>> for > >>>> all devices which do not have drivers assigned to them to store current > >>>> content of their config space registers. > > > > What exactly do you mean by "at the same time"? > I mean these two blocks of code run in parallel and likely race each other. Which two blocks of code? I'm guessing one of them is pci_host_probe() and what's the other? Unbound PCI devices have their PM-runtime usage counters incremented at init time; see pci_pm_init(). This can be undone by user space if it changes their /sys/devices/.../power/control attributes to "auto", but in that case the device will be suspended immediately from control_store(). If pci_host_probe() can race against this (and it looks like it can from your problem description), it needs to call pm_runtime_get_sync() on each PCI device before accessing its registers.
diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c index 35270172c8331..2c21ae4b15217 100644 --- a/drivers/pci/pci-driver.c +++ b/drivers/pci/pci-driver.c @@ -1378,7 +1378,7 @@ static int pci_pm_runtime_idle(struct device *dev) * always remain in D0 regardless of the runtime PM status */ if (!pci_dev->driver) - return 0; + return -EBUSY; if (pm && pm->runtime_idle) return pm->runtime_idle(dev);
I am sending this as RFC, because I can only trigger it sporadically on Linux 6.6.54 , but I believe this was around for a while. The rationale might also be far from perfect. This is NOT a proper fix for the issue. The pci_host_probe() and pci_bus_type RPM suspend seem to race against each other at least in Linux kernel up to 6.6.54 . The problem occurs when the PCIe host controller driver, in this case DWC i.MX6, is sufficiently delayed by EPROBE_DEFER from one if its clocks, in this case the PCIe bus clock provided by RS9 clock synthesizer driver which is compiled as a module and loaded about a minute after boot. Once the RS9 module is loaded and the bus clock become available, the probe of DWC iMX6 controller driver can proceed. At that point, imx6_pcie_probe() triggers pci_host_probe(), while at the same time, devices instantiated with pci_bus_type can already enter RPM suspend via pci_bus_type pci_pm_runtime_idle() / pci_pm_runtime_suspend() callbacks. The pci_host_probe() does reallocate BARs for devices which start up with uninitialized BAR addresses set to 0 by calling pci_bus_assign_resources(), which updates the device config space content. At the same time, pci_pm_runtime_suspend() triggers pci_save_state() for all devices which do not have drivers assigned to them to store current content of their config space registers. This leads to a race condition between pci_bus_assign_resources() and pci_save_state(). In case pci_save_state() wins and gets called before pci_bus_assign_resources(), the content stored by pci_save_state() is the incorrect pre-pci_bus_assign_resources() content, which is usually one with BARs set to invalid addresses and possibly other invalid configuration. Once either a driver or manual RPM control attempts to start the device up, that invalid content is restored into the device config space and the device becomes inoperable. If the BARs are restored to zeroes, then the device stops responding to BAR memory accesses, while it still does respond to config space accesses. Work around the issue by not suspending pci_bus_type devices which do not have driver assigned to them, keep those devices active to prevent pci_save_state() from being called. Once a proper driver takes over, it can RPM manage the device correctly. Invalid ordering and backtrace is below, visualized with this extra print added to drivers/pci/setup-res.c : " @@ -108,6 +108,8 @@ static void pci_std_update_resource(struct pci_dev *dev, int resno) resno, new, check); } + pci_err(dev, "BAR %d: updated (%#010x != %#010x)\n", resno, new, check); + if (res->flags & IORESOURCE_MEM_64) { new = region.start >> 16 >> 16; pci_write_config_dword(dev, reg + 4, new); " " [ 47.042906] pci 0000:01:00.0: save config 0x10: 0x00000004 ... [ 47.079863] pci 0000:01:00.0: BAR 0: updated (0x18100004 != 0x18100004) ... " " [ 47.274095] pci_update_resource+0x1f0/0x260 [ 47.278370] pci_assign_resource+0x22c/0x234 [ 47.282643] assign_requested_resources_sorted+0x6c/0xac [ 47.287959] __assign_resources_sorted+0xfc/0x424 [ 47.292669] __pci_bus_assign_resources+0x68/0x1f4 [ 47.297463] __pci_bus_assign_resources+0xec/0x1f4 [ 47.302258] pci_bus_assign_resources+0x1c/0x24 [ 47.306792] pci_host_probe+0x88/0xa4 [ 47.310457] dw_pcie_host_init+0x17c/0x530 [ 47.314560] imx6_pcie_probe+0x698/0x708 [ 47.318487] platform_probe+0x6c/0xb8 [ 47.322153] really_probe+0x140/0x278 [ 47.325818] __driver_probe_device+0xf4/0x10c [ 47.330177] driver_probe_device+0x40/0xf8 [ 47.334277] __device_attach_driver+0x60/0xd4 [ 47.338638] bus_for_each_drv+0xb4/0xdc [ 47.342476] __device_attach_async_helper+0x78/0xcc [ 47.347357] async_run_entry_fn+0x38/0xe0 [ 47.351369] process_scheduled_works+0x1cc/0x2b8 [ 47.355991] worker_thread+0x214/0x25c [ 47.359744] kthread+0xec/0xfc [ 47.362804] ret_from_fork+0x10/0x20 " " [ 47.575814] pci_save_state+0xcc/0x224 [ 47.579567] pci_pm_runtime_suspend+0x44/0x16c [ 47.584013] __rpm_callback+0x48/0x124 [ 47.587764] rpm_callback+0x70/0x74 [ 47.591254] rpm_suspend+0x26c/0x424 [ 47.594831] rpm_idle+0x190/0x1c0 [ 47.598149] pm_runtime_work+0x8c/0x9c [ 47.601900] process_scheduled_works+0x1cc/0x2b8 [ 47.606524] worker_thread+0x214/0x25c [ 47.610278] kthread+0xec/0xfc [ 47.613338] ret_from_fork+0x10/0x20 " Trigger: " $ modprobe clk_renesas_pcie $ echo on > /sys/devices/platform/soc\@0/33800000.pcie/pci0000\:00/0000\:00\:00.0/0000\:01\:00.0/power/control " Useful hint in the BAR0 direction: https://forums.developer.nvidia.com/t/intel-9260-wifi-on-jetson-nano-jetbot/73360/46 Signed-off-by: Marek Vasut <marex@denx.de> --- Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: linux-pci@vger.kernel.org --- drivers/pci/pci-driver.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)