diff mbox series

[RFC,1/3] PCI: Wait for device readiness with Configuration RRS

Message ID 20240827234848.4429-2-helgaas@kernel.org (mailing list archive)
State Accepted
Commit 3029763eb3f421854ca8ea6ea447106bf2caf83d
Delegated to: Bjorn Helgaas
Headers show
Series PCI: Use Configuration RRS to wait for device ready | expand

Commit Message

Bjorn Helgaas Aug. 27, 2024, 11:48 p.m. UTC
From: Bjorn Helgaas <bhelgaas@google.com>

After a device reset, delays are required before the device can
successfully complete config accesses.  PCIe r6.0, sec 6.6, specifies some
delays required before software can perform config accesses.  Devices that
require more time after those delays may respond to config accesses with
Configuration Request Retry Status (RRS) completions.

Callers of pci_dev_wait() are responsible for delays until the device can
respond to config accesses.  pci_dev_wait() waits any additional time until
the device can successfully complete config accesses.

Reading config space of devices that are not present or not ready typically
returns ~0 (PCI_ERROR_RESPONSE).  Previously we polled the Command register
until we got a value other than ~0.  This is sometimes a problem because
Root Complex handling of RRS completions may include several retries and
implementation-specific behavior that is invisible to software (see sec
2.3.2), so the exponential backoff in pci_dev_wait() may not work as
intended.

Linux enables Configuration RRS Software Visibility on all Root Ports that
support it.  If it is enabled, read the Vendor ID instead of the Command
register.  RRS completions cause immediate return of the 0x0001 reserved
Vendor ID value, so the pci_dev_wait() backoff works correctly.

When a read of Vendor ID eventually completes successfully by returning a
non-0x0001 value (the Vendor ID or 0xffff for VFs), the device should be
initialized and ready to respond to config requests.

For conventional PCI devices or devices below Root Ports that don't support
Configuration RRS Software Visibility, poll the Command register as before.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/pci/pci.c   | 41 ++++++++++++++++++++++++++++-------------
 drivers/pci/pci.h   |  5 +++++
 drivers/pci/probe.c |  9 +++------
 include/linux/pci.h |  1 +
 4 files changed, 37 insertions(+), 19 deletions(-)

Comments

Lukas Wunner Aug. 28, 2024, 4:17 a.m. UTC | #1
On Tue, Aug 27, 2024 at 06:48:46PM -0500, Bjorn Helgaas wrote:
> @@ -1311,9 +1320,15 @@ static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
>  			return -ENOTTY;
>  		}
>  
> -		pci_read_config_dword(dev, PCI_COMMAND, &id);
> -		if (!PCI_POSSIBLE_ERROR(id))
> -			break;
> +		if (root && root->config_crs_sv) {
> +			pci_read_config_dword(dev, PCI_VENDOR_ID, &id);
> +			if (!pci_bus_crs_vendor_id(id))
> +				break;

There was an effort from Amazon back in 2020/2021 to improve CRS support:

https://lore.kernel.org/linux-pci/20200307172044.29645-1-stanspas@amazon.com/

One suggestion you raised in the subsequent discussion was to use a
16-bit (word) instead of a 32-bit (dword) read of the Vendor ID
register to avoid issues with devices that don't implement CRS SV
correctly:

https://lore.kernel.org/linux-pci/20210913160745.GA1329939@bjorn-Precision-5520/

I realize that pci_bus_crs_vendor_id() masks the Device ID bits,
so probably no biggie.  Just want to make sure all lessons learned
during previous discussions on this topic are considered. :)

Thanks,

Lukas
Bjorn Helgaas Aug. 28, 2024, 8:53 p.m. UTC | #2
[+cc Stanislav, Rajat]

On Wed, Aug 28, 2024 at 06:17:04AM +0200, Lukas Wunner wrote:
> On Tue, Aug 27, 2024 at 06:48:46PM -0500, Bjorn Helgaas wrote:
> > @@ -1311,9 +1320,15 @@ static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
> >  			return -ENOTTY;
> >  		}
> >  
> > -		pci_read_config_dword(dev, PCI_COMMAND, &id);
> > -		if (!PCI_POSSIBLE_ERROR(id))
> > -			break;
> > +		if (root && root->config_crs_sv) {
> > +			pci_read_config_dword(dev, PCI_VENDOR_ID, &id);
> > +			if (!pci_bus_crs_vendor_id(id))
> > +				break;
> 
> There was an effort from Amazon back in 2020/2021 to improve CRS support:
> 
> https://lore.kernel.org/linux-pci/20200307172044.29645-1-stanspas@amazon.com/

Thanks for reminding me of that, and I'm sorry that series didn't get
applied back then because it's very similar to this one.

> One suggestion you raised in the subsequent discussion was to use a
> 16-bit (word) instead of a 32-bit (dword) read of the Vendor ID
> register to avoid issues with devices that don't implement CRS SV
> correctly:
> 
> https://lore.kernel.org/linux-pci/20210913160745.GA1329939@bjorn-Precision-5520/

Thanks.  Reading that response, I don't understand my point about using
a 16-bit read.  I mentioned 89665a6a7140 ("PCI: Check only the Vendor
ID to identify Configuration Request Retry"), the commit log of which
points to http://lkml.kernel.org/r/4729FC36.3040000@gmail.com, which
documents several defective devices that have a Vendor ID of 0x0001.

E.g., the Realtek rtl8169 controller mentioned there has Vendor/Device
ID of [0001:8168].  Doing either a 16-bit read or a 32-bit read and
checking the low 16 bits would incorrectly treat that as a Config RRS
completion.

So it *looks* to me like we will time out after 60 seconds and
conclude the device never became ready:

  pci_scan_device(..., timeout=60*1000)
    pci_bus_read_dev_vendor_id
      pci_bus_generic_read_dev_vendor_id
        pci_bus_read_config_dword(PCI_VENDOR_ID, l)  # <--
        # *l == 0x81680001
        if (pci_bus_crs_vendor_id(*l))        # 0x81680001 & 0xffff = 0x0001
          pci_bus_wait_crs(..., timeout=60*1000)
            while (pci_bus_crs_vendor_id(*l)) {
              if (delay > timeout)
                return false;                 # device not ready
              pci_bus_read_config_dword(PCI_VENDOR_ID, l)
            }

That *might* be an argument for doing a 32-bit read and checking for
0xffff0001, since the spec requires all 1's in the extra bytes.  But
I'm not aware of any complaints about these broken devices with a
0x0001 Vendor ID, and the 89665a6a7140 commit log says there are also
defective devices that don't fill the extra bytes with all 1's.

So my inclination is to keep the current code that does a 32-bit read
and checks only the low 16 bits.

> I realize that pci_bus_crs_vendor_id() masks the Device ID bits,
> so probably no biggie.  Just want to make sure all lessons learned
> during previous discussions on this topic are considered. :)
Duc Dang Sept. 11, 2024, 12:42 a.m. UTC | #3
Tested-by: Duc Dang <ducdang@google.com>

On Tue, Aug 27, 2024 at 4:57 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> From: Bjorn Helgaas <bhelgaas@google.com>
>
> After a device reset, delays are required before the device can
> successfully complete config accesses.  PCIe r6.0, sec 6.6, specifies some
> delays required before software can perform config accesses.  Devices that
> require more time after those delays may respond to config accesses with
> Configuration Request Retry Status (RRS) completions.
>
> Callers of pci_dev_wait() are responsible for delays until the device can
> respond to config accesses.  pci_dev_wait() waits any additional time until
> the device can successfully complete config accesses.
>
> Reading config space of devices that are not present or not ready typically
> returns ~0 (PCI_ERROR_RESPONSE).  Previously we polled the Command register
> until we got a value other than ~0.  This is sometimes a problem because
> Root Complex handling of RRS completions may include several retries and
> implementation-specific behavior that is invisible to software (see sec
> 2.3.2), so the exponential backoff in pci_dev_wait() may not work as
> intended.
>
> Linux enables Configuration RRS Software Visibility on all Root Ports that
> support it.  If it is enabled, read the Vendor ID instead of the Command
> register.  RRS completions cause immediate return of the 0x0001 reserved
> Vendor ID value, so the pci_dev_wait() backoff works correctly.
>
> When a read of Vendor ID eventually completes successfully by returning a
> non-0x0001 value (the Vendor ID or 0xffff for VFs), the device should be
> initialized and ready to respond to config requests.
>
> For conventional PCI devices or devices below Root Ports that don't support
> Configuration RRS Software Visibility, poll the Command register as before.
>
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> ---
>  drivers/pci/pci.c   | 41 ++++++++++++++++++++++++++++-------------
>  drivers/pci/pci.h   |  5 +++++
>  drivers/pci/probe.c |  9 +++------
>  include/linux/pci.h |  1 +
>  4 files changed, 37 insertions(+), 19 deletions(-)
>
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index e3a49f66982d..fc2ecb7fe081 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -1283,7 +1283,9 @@ static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
>  {
>         int delay = 1;
>         bool retrain = false;
> -       struct pci_dev *bridge;
> +       struct pci_dev *root, *bridge;
> +
> +       root = pcie_find_root_port(dev);
>
>         if (pci_is_pcie(dev)) {
>                 bridge = pci_upstream_bridge(dev);
> @@ -1292,16 +1294,23 @@ static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
>         }
>
>         /*
> -        * After reset, the device should not silently discard config
> -        * requests, but it may still indicate that it needs more time by
> -        * responding to them with CRS completions.  The Root Port will
> -        * generally synthesize ~0 (PCI_ERROR_RESPONSE) data to complete
> -        * the read (except when CRS SV is enabled and the read was for the
> -        * Vendor ID; in that case it synthesizes 0x0001 data).
> +        * The caller has already waited long enough after a reset that the
> +        * device should respond to config requests, but it may respond
> +        * with Request Retry Status (RRS) if it needs more time to
> +        * initialize.
>          *
> -        * Wait for the device to return a non-CRS completion.  Read the
> -        * Command register instead of Vendor ID so we don't have to
> -        * contend with the CRS SV value.
> +        * If the device is below a Root Port with Configuration RRS
> +        * Software Visibility enabled, reading the Vendor ID returns a
> +        * special data value if the device responded with RRS.  Read the
> +        * Vendor ID until we get non-RRS status.
> +        *
> +        * If there's no Root Port or Configuration RRS Software Visibility
> +        * is not enabled, the device may still respond with RRS, but
> +        * hardware may retry the config request.  If no retries receive
> +        * Successful Completion, hardware generally synthesizes ~0
> +        * (PCI_ERROR_RESPONSE) data to complete the read.  Reading Vendor
> +        * ID for VFs and non-existent devices also returns ~0, so read the
> +        * Command register until it returns something other than ~0.
>          */
>         for (;;) {
>                 u32 id;
> @@ -1311,9 +1320,15 @@ static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
>                         return -ENOTTY;
>                 }
>
> -               pci_read_config_dword(dev, PCI_COMMAND, &id);
> -               if (!PCI_POSSIBLE_ERROR(id))
> -                       break;
> +               if (root && root->config_crs_sv) {
> +                       pci_read_config_dword(dev, PCI_VENDOR_ID, &id);
> +                       if (!pci_bus_crs_vendor_id(id))
> +                               break;
> +               } else {
> +                       pci_read_config_dword(dev, PCI_COMMAND, &id);
> +                       if (!PCI_POSSIBLE_ERROR(id))
> +                               break;
> +               }
>
>                 if (delay > timeout) {
>                         pci_warn(dev, "not ready %dms after %s; giving up\n",
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 79c8398f3938..fa1997bc2667 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -139,6 +139,11 @@ bool pci_bridge_d3_possible(struct pci_dev *dev);
>  void pci_bridge_d3_update(struct pci_dev *dev);
>  int pci_bridge_wait_for_secondary_bus(struct pci_dev *dev, char *reset_type);
>
> +static inline bool pci_bus_crs_vendor_id(u32 l)
> +{
> +       return (l & 0xffff) == PCI_VENDOR_ID_PCI_SIG;
> +}
> +
>  static inline void pci_wakeup_event(struct pci_dev *dev)
>  {
>         /* Wait 100 ms before the system can be put into a sleep state. */
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index b14b9876c030..b1615da9eb6b 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -1209,9 +1209,11 @@ static void pci_enable_crs(struct pci_dev *pdev)
>
>         /* Enable CRS Software Visibility if supported */
>         pcie_capability_read_word(pdev, PCI_EXP_RTCAP, &root_cap);
> -       if (root_cap & PCI_EXP_RTCAP_CRSVIS)
> +       if (root_cap & PCI_EXP_RTCAP_CRSVIS) {
>                 pcie_capability_set_word(pdev, PCI_EXP_RTCTL,
>                                          PCI_EXP_RTCTL_CRSSVE);
> +               pdev->config_crs_sv = 1;
> +       }
>  }
>
>  static unsigned int pci_scan_child_bus_extend(struct pci_bus *bus,
> @@ -2343,11 +2345,6 @@ struct pci_dev *pci_alloc_dev(struct pci_bus *bus)
>  }
>  EXPORT_SYMBOL(pci_alloc_dev);
>
> -static bool pci_bus_crs_vendor_id(u32 l)
> -{
> -       return (l & 0xffff) == PCI_VENDOR_ID_PCI_SIG;
> -}
> -
>  static bool pci_bus_wait_crs(struct pci_bus *bus, int devfn, u32 *l,
>                              int timeout)
>  {
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 4cf89a4b4cbc..121d8d94d6d0 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -371,6 +371,7 @@ struct pci_dev {
>                                            can be generated */
>         unsigned int    pme_poll:1;     /* Poll device's PME status bit */
>         unsigned int    pinned:1;       /* Whether this dev is pinned */
> +       unsigned int    config_crs_sv:1; /* Config CRS software visibility */
>         unsigned int    imm_ready:1;    /* Supports Immediate Readiness */
>         unsigned int    d1_support:1;   /* Low power state D1 is supported */
>         unsigned int    d2_support:1;   /* Low power state D2 is supported */
> --
> 2.34.1
>
diff mbox series

Patch

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index e3a49f66982d..fc2ecb7fe081 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1283,7 +1283,9 @@  static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
 {
 	int delay = 1;
 	bool retrain = false;
-	struct pci_dev *bridge;
+	struct pci_dev *root, *bridge;
+
+	root = pcie_find_root_port(dev);
 
 	if (pci_is_pcie(dev)) {
 		bridge = pci_upstream_bridge(dev);
@@ -1292,16 +1294,23 @@  static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
 	}
 
 	/*
-	 * After reset, the device should not silently discard config
-	 * requests, but it may still indicate that it needs more time by
-	 * responding to them with CRS completions.  The Root Port will
-	 * generally synthesize ~0 (PCI_ERROR_RESPONSE) data to complete
-	 * the read (except when CRS SV is enabled and the read was for the
-	 * Vendor ID; in that case it synthesizes 0x0001 data).
+	 * The caller has already waited long enough after a reset that the
+	 * device should respond to config requests, but it may respond
+	 * with Request Retry Status (RRS) if it needs more time to
+	 * initialize.
 	 *
-	 * Wait for the device to return a non-CRS completion.  Read the
-	 * Command register instead of Vendor ID so we don't have to
-	 * contend with the CRS SV value.
+	 * If the device is below a Root Port with Configuration RRS
+	 * Software Visibility enabled, reading the Vendor ID returns a
+	 * special data value if the device responded with RRS.  Read the
+	 * Vendor ID until we get non-RRS status.
+	 *
+	 * If there's no Root Port or Configuration RRS Software Visibility
+	 * is not enabled, the device may still respond with RRS, but
+	 * hardware may retry the config request.  If no retries receive
+	 * Successful Completion, hardware generally synthesizes ~0
+	 * (PCI_ERROR_RESPONSE) data to complete the read.  Reading Vendor
+	 * ID for VFs and non-existent devices also returns ~0, so read the
+	 * Command register until it returns something other than ~0.
 	 */
 	for (;;) {
 		u32 id;
@@ -1311,9 +1320,15 @@  static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
 			return -ENOTTY;
 		}
 
-		pci_read_config_dword(dev, PCI_COMMAND, &id);
-		if (!PCI_POSSIBLE_ERROR(id))
-			break;
+		if (root && root->config_crs_sv) {
+			pci_read_config_dword(dev, PCI_VENDOR_ID, &id);
+			if (!pci_bus_crs_vendor_id(id))
+				break;
+		} else {
+			pci_read_config_dword(dev, PCI_COMMAND, &id);
+			if (!PCI_POSSIBLE_ERROR(id))
+				break;
+		}
 
 		if (delay > timeout) {
 			pci_warn(dev, "not ready %dms after %s; giving up\n",
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 79c8398f3938..fa1997bc2667 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -139,6 +139,11 @@  bool pci_bridge_d3_possible(struct pci_dev *dev);
 void pci_bridge_d3_update(struct pci_dev *dev);
 int pci_bridge_wait_for_secondary_bus(struct pci_dev *dev, char *reset_type);
 
+static inline bool pci_bus_crs_vendor_id(u32 l)
+{
+	return (l & 0xffff) == PCI_VENDOR_ID_PCI_SIG;
+}
+
 static inline void pci_wakeup_event(struct pci_dev *dev)
 {
 	/* Wait 100 ms before the system can be put into a sleep state. */
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index b14b9876c030..b1615da9eb6b 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1209,9 +1209,11 @@  static void pci_enable_crs(struct pci_dev *pdev)
 
 	/* Enable CRS Software Visibility if supported */
 	pcie_capability_read_word(pdev, PCI_EXP_RTCAP, &root_cap);
-	if (root_cap & PCI_EXP_RTCAP_CRSVIS)
+	if (root_cap & PCI_EXP_RTCAP_CRSVIS) {
 		pcie_capability_set_word(pdev, PCI_EXP_RTCTL,
 					 PCI_EXP_RTCTL_CRSSVE);
+		pdev->config_crs_sv = 1;
+	}
 }
 
 static unsigned int pci_scan_child_bus_extend(struct pci_bus *bus,
@@ -2343,11 +2345,6 @@  struct pci_dev *pci_alloc_dev(struct pci_bus *bus)
 }
 EXPORT_SYMBOL(pci_alloc_dev);
 
-static bool pci_bus_crs_vendor_id(u32 l)
-{
-	return (l & 0xffff) == PCI_VENDOR_ID_PCI_SIG;
-}
-
 static bool pci_bus_wait_crs(struct pci_bus *bus, int devfn, u32 *l,
 			     int timeout)
 {
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 4cf89a4b4cbc..121d8d94d6d0 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -371,6 +371,7 @@  struct pci_dev {
 					   can be generated */
 	unsigned int	pme_poll:1;	/* Poll device's PME status bit */
 	unsigned int	pinned:1;	/* Whether this dev is pinned */
+	unsigned int	config_crs_sv:1; /* Config CRS software visibility */
 	unsigned int	imm_ready:1;	/* Supports Immediate Readiness */
 	unsigned int	d1_support:1;	/* Low power state D1 is supported */
 	unsigned int	d2_support:1;	/* Low power state D2 is supported */