Message ID | 20200307172044.29645-1-stanspas@amazon.com (mailing list archive) |
---|---|
Headers | show |
Series | Improve PCI device post-reset readiness polling | expand |
On Sat, 2020-03-07 at 18:20 +0100, Stanislav Spassov wrote: > From: Stanislav Spassov <stanspas@amazon.de> > > The first version of this patch series can be found here: > https://lore.kernel.org/linux-pci/20200223122057.6504-1-stanspas@amazon.com > > The goal of this patch series is to solve an issue where pci_dev_wait > can cause system crashes. After a reset, a hung device may keep > responding with CRS completions indefinitely. If CRS Software Visibility > is enabled on the Root Port, attempting to read any register other than > PCI_VENDOR_ID will cause the Root Port to autonomously retry the request > without reporting back to the CPU core. Unless the number of retries or > the amount of time spent retrying is limited by platform-specific means, > this scenario leads to low-level platform timeouts (such as a TOR > Timeout), which can easily escalate to a crash. > > Feedback on the v1 inspired a lot of additional improvements all around the > device reset codepaths and reducing post-reset delays. These improvements > were published as part of v2 (v3 is just small build fixes). > > It looks like there is immediate demand specifically for the CRS work, > so I am once again reducing the series to just that. The reset will be > posted as a separate patch series that will likely require more time and > iterations to stabilize. Hm, what happened to this? Bjorn?
On Fri, 2021-01-22 at 08:54 +0000, David Woodhouse wrote: > On Sat, 2020-03-07 at 18:20 +0100, Stanislav Spassov wrote: > > From: Stanislav Spassov < > > stanspas@amazon.de > > > > > > > The first version of this patch series can be found here: > > https://lore.kernel.org/linux-pci/20200223122057.6504-1-stanspas@amazon.com > > > > > > The goal of this patch series is to solve an issue where pci_dev_wait > > can cause system crashes. After a reset, a hung device may keep > > responding with CRS completions indefinitely. If CRS Software Visibility > > is enabled on the Root Port, attempting to read any register other than > > PCI_VENDOR_ID will cause the Root Port to autonomously retry the request > > without reporting back to the CPU core. Unless the number of retries or > > the amount of time spent retrying is limited by platform-specific means, > > this scenario leads to low-level platform timeouts (such as a TOR > > Timeout), which can easily escalate to a crash. > > > > Feedback on the v1 inspired a lot of additional improvements all around the > > device reset codepaths and reducing post-reset delays. These improvements > > were published as part of v2 (v3 is just small build fixes). > > > > It looks like there is immediate demand specifically for the CRS work, > > so I am once again reducing the series to just that. The reset will be > > posted as a separate patch series that will likely require more time and > > iterations to stabilize. > > Hm, what happened to this? > > Bjorn? Ping?
From: Stanislav Spassov <stanspas@amazon.de> The first version of this patch series can be found here: https://lore.kernel.org/linux-pci/20200223122057.6504-1-stanspas@amazon.com The goal of this patch series is to solve an issue where pci_dev_wait can cause system crashes. After a reset, a hung device may keep responding with CRS completions indefinitely. If CRS Software Visibility is enabled on the Root Port, attempting to read any register other than PCI_VENDOR_ID will cause the Root Port to autonomously retry the request without reporting back to the CPU core. Unless the number of retries or the amount of time spent retrying is limited by platform-specific means, this scenario leads to low-level platform timeouts (such as a TOR Timeout), which can easily escalate to a crash. Feedback on the v1 inspired a lot of additional improvements all around the device reset codepaths and reducing post-reset delays. These improvements were published as part of v2 (v3 is just small build fixes). It looks like there is immediate demand specifically for the CRS work, so I am once again reducing the series to just that. The reset will be posted as a separate patch series that will likely require more time and iterations to stabilize. Changes since v3: - In pci_dev_wait(), added "timeout -= waited" to account the time spent polling PCI_VENDOR_ID before falling back to polling PCI_COMMAND if device readiness could not be positively established via CRS (i.e., if we stopped receiving CRS completions but did not receive a valid vendor ID due to dealing with an SR-IOV VF, or due to a different error) - Simplified the commit message of "PCI: Add CRS handling to pci_dev_wait()" to avoid confusion as to when Root Ports will autonomously retry requests that resulted in CRS completions. Stanislav Spassov (3): PCI: Refactor polling loop out of pci_dev_wait PCI: Cache CRS Software Visibiliy in struct pci_dev PCI: Add CRS handling to pci_dev_wait() drivers/pci/pci.c | 109 +++++++++++++++++++++++++++++++++++--------- drivers/pci/probe.c | 8 +++- include/linux/pci.h | 3 ++ 3 files changed, 98 insertions(+), 22 deletions(-) base-commit: bb6d3fb354c5ee8d6bde2d576eb7220ea09862b9