diff mbox series

pci: fix broadcom secondary bus reset handling

Message ID 20240501145118.2051595-1-kbusch@meta.com (mailing list archive)
State Changes Requested
Delegated to: Bjorn Helgaas
Headers show
Series pci: fix broadcom secondary bus reset handling | expand

Commit Message

Keith Busch May 1, 2024, 2:51 p.m. UTC
From: Keith Busch <kbusch@kernel.org>

After a link reset, the Broadcom / LSI PEX890xx PCIe Gen 5 Switch in synth
mode will temporarily insert a fake place-holder device, 1000 02b2, before
the link is actually active for the expected downstream device. Confirm
the device's identifier matches what we expect before moving forward.
Otherwise, the pciehp driver may unmask hotplug notifications before
the link is actually active, which triggers an undesired device removal.

Cc: Suganath Prabu S <suganath-prabu.subramani@broadcom.com>
Cc: Peter Delevoryas <pdel@meta.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/pci/pci.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

Comments

Bjorn Helgaas May 1, 2024, 7:55 p.m. UTC | #1
[+cc Lukas since you mentioned pciehp]

On Wed, May 01, 2024 at 07:51:18AM -0700, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> After a link reset, the Broadcom / LSI PEX890xx PCIe Gen 5 Switch in synth
> mode will temporarily insert a fake place-holder device, 1000 02b2, before
> the link is actually active for the expected downstream device. Confirm
> the device's identifier matches what we expect before moving forward.
> Otherwise, the pciehp driver may unmask hotplug notifications before
> the link is actually active, which triggers an undesired device removal.

Is this a Switch that doesn't conform to the PCIe spec, or is there
something wrong with the way we're looking for a CRS completion?

In the absence of a device defect, I would not expect to need a
Broacom-specific comment in this code.

> Cc: Suganath Prabu S <suganath-prabu.subramani@broadcom.com>
> Cc: Peter Delevoryas <pdel@meta.com>
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
>  drivers/pci/pci.c | 16 +++++++++++-----
>  1 file changed, 11 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index e5f243dd42884..4dc00f7411a94 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -1255,6 +1255,7 @@ static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
>  	int delay = 1;
>  	bool retrain = false;
>  	struct pci_dev *bridge;
> +	u32 vid = dev->vendor | dev->device << 16;
>  
>  	if (pci_is_pcie(dev)) {
>  		bridge = pci_upstream_bridge(dev);
> @@ -1268,17 +1269,22 @@ static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
>  	 * responding to them with CRS completions.  The Root Port will
>  	 * generally synthesize ~0 (PCI_ERROR_RESPONSE) data to complete
>  	 * the read (except when CRS SV is enabled and the read was for the
> -	 * Vendor ID; in that case it synthesizes 0x0001 data).
> +	 * Vendor ID; in that case it synthesizes 0x0001 data, or if the device
> +	 * is downstream a Broadcom switch, which syntesizes a fake device)

s/syntesizes/synthesizes/

>  	 * Wait for the device to return a non-CRS completion.  Read the
> -	 * Command register instead of Vendor ID so we don't have to
> -	 * contend with the CRS SV value.
> +	 * Command register instead of Vendor ID so we don't have to contend
> +	 * with the CRS SV value. But, also read the Vendor and Device ID's
> +	 * to defeat Broadcom switch's placeholder device.

s/ID's/IDs/

>  	 */
>  	for (;;) {
> -		u32 id;
> +		u32 id, l;
>  
> +		pci_read_config_dword(dev, PCI_VENDOR_ID, &l);
>  		pci_read_config_dword(dev, PCI_COMMAND, &id);
> -		if (!PCI_POSSIBLE_ERROR(id))
> +
> +		if (!PCI_POSSIBLE_ERROR(id) && !PCI_POSSIBLE_ERROR(l) &&
> +		    l == vid)
>  			break;
>  
>  		if (delay > timeout) {
> -- 
> 2.43.0
>
Lukas Wunner May 1, 2024, 9:50 p.m. UTC | #2
On Wed, May 01, 2024 at 07:51:18AM -0700, Keith Busch wrote:
> After a link reset, the Broadcom / LSI PEX890xx PCIe Gen 5 Switch in synth
> mode will temporarily insert a fake place-holder device, 1000 02b2, before
> the link is actually active for the expected downstream device. Confirm
> the device's identifier matches what we expect before moving forward.
> Otherwise, the pciehp driver may unmask hotplug notifications before
> the link is actually active, which triggers an undesired device removal.

This won't work if the device was hot-swapped with a different one
and thus correctly returns a different Vendor/Device ID.  We'd wait
for the device to report the previous device's Vendor/Device ID,
which doesn't make sense.

It would be possible to raise d3cold_delay in struct pci_dev for
children of affected Broadcom switches.  Have you considered that
as a potential solution?

Thanks,

Lukas
Keith Busch May 2, 2024, 8:47 a.m. UTC | #3
On Wed, May 01, 2024 at 02:55:34PM -0500, Bjorn Helgaas wrote:
> On Wed, May 01, 2024 at 07:51:18AM -0700, Keith Busch wrote:
> > From: Keith Busch <kbusch@kernel.org>
> > 
> > After a link reset, the Broadcom / LSI PEX890xx PCIe Gen 5 Switch in synth
> > mode will temporarily insert a fake place-holder device, 1000 02b2, before
> > the link is actually active for the expected downstream device. Confirm
> > the device's identifier matches what we expect before moving forward.
> > Otherwise, the pciehp driver may unmask hotplug notifications before
> > the link is actually active, which triggers an undesired device removal.
> 
> Is this a Switch that doesn't conform to the PCIe spec, or is there
> something wrong with the way we're looking for a CRS completion?
> 
> In the absence of a device defect, I would not expect to need a
> Broacom-specific comment in this code.

Yeah, you're right. I started off quirking specific devices, then it
evolved to the more generic handling this turned into, but didn't update
the commit log or comments accordingly.
Keith Busch May 2, 2024, 8:56 a.m. UTC | #4
On Wed, May 01, 2024 at 11:50:47PM +0200, Lukas Wunner wrote:
> On Wed, May 01, 2024 at 07:51:18AM -0700, Keith Busch wrote:
> > After a link reset, the Broadcom / LSI PEX890xx PCIe Gen 5 Switch in synth
> > mode will temporarily insert a fake place-holder device, 1000 02b2, before
> > the link is actually active for the expected downstream device. Confirm
> > the device's identifier matches what we expect before moving forward.
> > Otherwise, the pciehp driver may unmask hotplug notifications before
> > the link is actually active, which triggers an undesired device removal.
> 
> This won't work if the device was hot-swapped with a different one
> and thus correctly returns a different Vendor/Device ID.  We'd wait
> for the device to report the previous device's Vendor/Device ID,
> which doesn't make sense.
> 
> It would be possible to raise d3cold_delay in struct pci_dev for
> children of affected Broadcom switches.  Have you considered that
> as a potential solution?

Good point, there's more paths I need to consider here. The path this is
addressing is through pciehp's reset_slot handling, which temporarily
disables the link change and presence detection. In the error scenario,
the secondary bus reset completes too quickly, which re-enables the
pciehp events before the downstream device has settled. Once it settles,
that triggers a Link Change/PDC event, then we lose our device.

I briefly considered a quirk for d3cold_delay, but I was hoping for
something more programatic than adding an arbitrary delay. That might be
okay though.
Keith Busch May 8, 2024, 9:03 p.m. UTC | #5
On Thu, May 02, 2024 at 09:47:31AM +0100, Keith Busch wrote:
> Yeah, you're right. I started off quirking specific devices, then it
> evolved to the more generic handling this turned into, but didn't update
> the commit log or comments accordingly.

Quick update, we're testing configurable options to modify the pcie port
behavior, so a kernel-side update may not happen if that is successful.
diff mbox series

Patch

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index e5f243dd42884..4dc00f7411a94 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1255,6 +1255,7 @@  static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
 	int delay = 1;
 	bool retrain = false;
 	struct pci_dev *bridge;
+	u32 vid = dev->vendor | dev->device << 16;
 
 	if (pci_is_pcie(dev)) {
 		bridge = pci_upstream_bridge(dev);
@@ -1268,17 +1269,22 @@  static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
 	 * responding to them with CRS completions.  The Root Port will
 	 * generally synthesize ~0 (PCI_ERROR_RESPONSE) data to complete
 	 * the read (except when CRS SV is enabled and the read was for the
-	 * Vendor ID; in that case it synthesizes 0x0001 data).
+	 * Vendor ID; in that case it synthesizes 0x0001 data, or if the device
+	 * is downstream a Broadcom switch, which syntesizes a fake device)
 	 *
 	 * Wait for the device to return a non-CRS completion.  Read the
-	 * Command register instead of Vendor ID so we don't have to
-	 * contend with the CRS SV value.
+	 * Command register instead of Vendor ID so we don't have to contend
+	 * with the CRS SV value. But, also read the Vendor and Device ID's
+	 * to defeat Broadcom switch's placeholder device.
 	 */
 	for (;;) {
-		u32 id;
+		u32 id, l;
 
+		pci_read_config_dword(dev, PCI_VENDOR_ID, &l);
 		pci_read_config_dword(dev, PCI_COMMAND, &id);
-		if (!PCI_POSSIBLE_ERROR(id))
+
+		if (!PCI_POSSIBLE_ERROR(id) && !PCI_POSSIBLE_ERROR(l) &&
+		    l == vid)
 			break;
 
 		if (delay > timeout) {