Message ID | 20191005182129.32538-1-vidyas@nvidia.com (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
Series | PCI: Add CRS timeout for pci_device_is_present() | expand |
On Sat, Oct 05, 2019 at 11:51:29PM +0530, Vidya Sagar wrote: > Adds a 60 seconds timeout to consider CRS (Configuration request Retry > Status) from a downstream device when Vendor ID read is attempted by > an upstream device. This helps to work with devices that return CRS > during system resume. This also makes pci_device_is_present() consistent > with the probe path where 60 seconds timeout is already being used. > > Signed-off-by: Vidya Sagar <vidyas@nvidia.com> > --- > drivers/pci/pci.c | 3 ++- > drivers/pci/pci.h | 2 ++ > drivers/pci/probe.c | 2 +- > 3 files changed, 5 insertions(+), 2 deletions(-) I think this makes sense, so: Reviewed-by: Thierry Reding <treding@nvidia.com> However, it looks like Sinan has researched this extensively in the past and gave a presentation on this at Plumbers in 2017: https://blog.linuxplumbersconf.org/2017/ocw/system/presentations/4732/original/crs.pdf Adding Sinan to see if he has any concerns about this, since resume time is explicitly mentioned in the above slides. Thierry > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index 95dc78ebdded..3ab9f6d3c8a6 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -5905,7 +5905,8 @@ bool pci_device_is_present(struct pci_dev *pdev) > > if (pci_dev_is_disconnected(pdev)) > return false; > - return pci_bus_read_dev_vendor_id(pdev->bus, pdev->devfn, &v, 0); > + return pci_bus_read_dev_vendor_id(pdev->bus, pdev->devfn, &v, > + PCI_CRS_TIMEOUT); > } > EXPORT_SYMBOL_GPL(pci_device_is_present); > > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h > index 3f6947ee3324..aa25c5fdc6a5 100644 > --- a/drivers/pci/pci.h > +++ b/drivers/pci/pci.h > @@ -4,6 +4,8 @@ > > #include <linux/pci.h> > > +#define PCI_CRS_TIMEOUT (60 * 1000) /* 60 sec*/ > + > #define PCI_FIND_CAP_TTL 48 > > #define PCI_VSEC_ID_INTEL_TBT 0x1234 /* Thunderbolt */ > diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c > index 7c5d68b807ef..6e44a03283c8 100644 > --- a/drivers/pci/probe.c > +++ b/drivers/pci/probe.c > @@ -2258,7 +2258,7 @@ static struct pci_dev *pci_scan_device(struct pci_bus *bus, int devfn) > struct pci_dev *dev; > u32 l; > > - if (!pci_bus_read_dev_vendor_id(bus, devfn, &l, 60*1000)) > + if (!pci_bus_read_dev_vendor_id(bus, devfn, &l, PCI_CRS_TIMEOUT)) > return NULL; > > dev = pci_alloc_dev(bus); > -- > 2.17.1 >
On Sat, Oct 05, 2019 at 11:51:29PM +0530, Vidya Sagar wrote: > Adds a 60 seconds timeout to consider CRS (Configuration request Retry > Status) from a downstream device when Vendor ID read is attempted by > an upstream device. This helps to work with devices that return CRS > during system resume. This also makes pci_device_is_present() consistent > with the probe path where 60 seconds timeout is already being used. This looks like a good idea. > > Signed-off-by: Vidya Sagar <vidyas@nvidia.com> > --- > drivers/pci/pci.c | 3 ++- > drivers/pci/pci.h | 2 ++ > drivers/pci/probe.c | 2 +- > 3 files changed, 5 insertions(+), 2 deletions(-) > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index 95dc78ebdded..3ab9f6d3c8a6 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -5905,7 +5905,8 @@ bool pci_device_is_present(struct pci_dev *pdev) > > if (pci_dev_is_disconnected(pdev)) > return false; > - return pci_bus_read_dev_vendor_id(pdev->bus, pdev->devfn, &v, 0); > + return pci_bus_read_dev_vendor_id(pdev->bus, pdev->devfn, &v, > + PCI_CRS_TIMEOUT); > } > EXPORT_SYMBOL_GPL(pci_device_is_present); > > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h > index 3f6947ee3324..aa25c5fdc6a5 100644 > --- a/drivers/pci/pci.h > +++ b/drivers/pci/pci.h > @@ -4,6 +4,8 @@ > > #include <linux/pci.h> > > +#define PCI_CRS_TIMEOUT (60 * 1000) /* 60 sec*/ > + This has the same value as PCIE_RESET_READY_POLL_MS defined in pci.c, when I look at how PCIE_RESET_READY_POLL_MS is used, it seems that we have two nearly identical ways to handle the same thing. pci_dev_wait - this seems to be specifically used for handling SV CRS when reading the vendor ID. pci_dev_wait - this seems to be used after FLR, AF_FLR, bus reset and D3-D0, in all the use-cases the timeout is 60 seconds. This function waits for the function to return a non-CRS completion - however it doesn't rely on the SV value of 0x0001. What is the reason that pci_dev_wait polls on PCI_COMMAND instead of a SV CRS value? (Is this a hack to gain some CPU time on RC's without SV? I.e. rather than the hardware retrying, it just gives up allowing us to retry). Is there any reason why these two functions can be combined? > #define PCI_FIND_CAP_TTL 48 > > #define PCI_VSEC_ID_INTEL_TBT 0x1234 /* Thunderbolt */ > diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c > index 7c5d68b807ef..6e44a03283c8 100644 > --- a/drivers/pci/probe.c > +++ b/drivers/pci/probe.c > @@ -2258,7 +2258,7 @@ static struct pci_dev *pci_scan_device(struct pci_bus *bus, int devfn) > struct pci_dev *dev; > u32 l; > > - if (!pci_bus_read_dev_vendor_id(bus, devfn, &l, 60*1000)) > + if (!pci_bus_read_dev_vendor_id(bus, devfn, &l, PCI_CRS_TIMEOUT)) Should you also fix up acpiphp_add_context in drivers/pci/hotplug/acpiphp_glue.c to use PCI_CRS_TIMEOUT? Thanks, Andrew Murray > return NULL; > > dev = pci_alloc_dev(bus); > -- > 2.17.1 >
On 10/14/2019 1:20 AM, Thierry Reding wrote: > I think this makes sense, so: > > Reviewed-by: Thierry Reding <treding@nvidia.com> > > However, it looks like Sinan has researched this extensively in the past > and gave a presentation on this at Plumbers in 2017: > > https://blog.linuxplumbersconf.org/2017/ocw/system/presentations/4732/original/crs.pdf > > Adding Sinan to see if he has any concerns about this, since resume time > is explicitly mentioned in the above slides. Thanks for including me. Let me catch up here. pci_dev_wait() is supposed to handle this case via pci_pm_reset(). /** * pci_pm_reset - Put device into PCI_D3 and back into PCI_D0. * @dev: Device to reset. * @probe: If set, only check if the device can be reset this way. */ Do you know if your execution path hits this function? We might have missed a use case.
On Mon, Oct 14, 2019 at 01:21:31PM -0700, Sinan Kaya wrote: > On 10/14/2019 1:20 AM, Thierry Reding wrote: > > I think this makes sense, so: > > > > Reviewed-by: Thierry Reding <treding@nvidia.com> > > > > However, it looks like Sinan has researched this extensively in the past > > and gave a presentation on this at Plumbers in 2017: > > > > https://blog.linuxplumbersconf.org/2017/ocw/system/presentations/4732/original/crs.pdf > > > > Adding Sinan to see if he has any concerns about this, since resume time > > is explicitly mentioned in the above slides. > > > Thanks for including me. Let me catch up here. > > pci_dev_wait() is supposed to handle this case via pci_pm_reset(). > > /** > * pci_pm_reset - Put device into PCI_D3 and back into PCI_D0. > * @dev: Device to reset. > * @probe: If set, only check if the device can be reset this way. > */ > > Do you know if your execution path hits this function? We might have > missed a use case. > I see only a couple of callers of pci_device_is_present() in the tree, this being from next-20191015: $ git grep -n pci_device_is_present drivers/net/ethernet/broadcom/tg3.c:9070: if (!pci_device_is_present(tp->pdev))drivers/net/ethernet/broadcom/tg3.c:11785: if (pci_device_is_present(tp->pdev)) { drivers/net/ethernet/intel/igb/igb_main.c:8838: if (!pci_device_is_present(pdev)) drivers/nvme/host/pci.c:2866: if (!pci_device_is_present(pdev)) { drivers/pci/hotplug/acpiphp_glue.c:650: alive = pci_device_is_present(dev); drivers/pci/pci.c:935: !pci_device_is_present(dev)) { drivers/pci/pci.c:5902:bool pci_device_is_present(struct pci_dev *pdev) drivers/pci/pci.c:5910:EXPORT_SYMBOL_GPL(pci_device_is_present); drivers/thunderbolt/nhi.c:939: if (!pci_device_is_present(pdev)) { include/linux/pci.h:1206:bool pci_device_is_present(struct pci_dev *pdev); The NVME driver has the call in the ->remove() callback, so I don't think it's relevant here. Both TG3 and IGB ethernet drivers seem to call this during resume and so does Thunderbolt. Vidya, can you clarify for which device you're seeing the issues? Sounds like adding a call to pci_pm_reset() (via pci_reset_function()) at some point. Sinan, it looks as if pci_pm_reset() (or any of its callers) is not used very widely. Is that just because most drivers haven't had a need for it yet? Or am I missing some core functionality that calls this for every device anyway? Thierry
+Rafael On 10/15/2019 2:30 AM, Thierry Reding wrote: > Vidya, can you clarify for which device you're seeing the issues? Sounds > like adding a call to pci_pm_reset() (via pci_reset_function()) at some > point. > > Sinan, it looks as if pci_pm_reset() (or any of its callers) is not used > very widely. Is that just because most drivers haven't had a need for it > yet? Or am I missing some core functionality that calls this for every > device anyway? pci_pm_reset() is there as an alternative reset path. We are not supposed to call this function. Sorry for giving you wrong direction here. pci_reset_function() should call it only if there is no other suitable reset function is found. I think the PCI core should be putting the device back D0 state as one of the first actions before enumerating. Wake up could be a combination of ACPI and/or PCI wake up depending on where your device sits in the topology. On the other hand, wake up code doesn't perform the CRS wait. CRS wait is deferred until the first vendor id read in pci_scan_device(). I see that it already waits for 60 seconds. Going back to the patch... I think we need to find the path that actually needs this sleep and put pci_dev_wait() there. +++ b/drivers/pci/pci.c @@ -5905,7 +5905,8 @@ bool pci_device_is_present(struct pci_dev *pdev) if (pci_dev_is_disconnected(pdev)) return false; - return pci_bus_read_dev_vendor_id(pdev->bus, pdev->devfn, &v, 0); + return pci_bus_read_dev_vendor_id(pdev->bus, pdev->devfn, &v, + PCI_CRS_TIMEOUT); } pci_device_is_present() is a too low-level function and it may not be allowed to sleep. It uses 0 as timeout value.
On 10/15/2019 1:51 AM, Sinan Kaya wrote: > On 10/14/2019 1:20 AM, Thierry Reding wrote: >> I think this makes sense, so: >> >> Reviewed-by: Thierry Reding <treding@nvidia.com> >> >> However, it looks like Sinan has researched this extensively in the past >> and gave a presentation on this at Plumbers in 2017: >> >> https://blog.linuxplumbersconf.org/2017/ocw/system/presentations/4732/original/crs.pdf >> >> Adding Sinan to see if he has any concerns about this, since resume time >> is explicitly mentioned in the above slides. > > > Thanks for including me. Let me catch up here. > > pci_dev_wait() is supposed to handle this case via pci_pm_reset(). > > /** > * pci_pm_reset - Put device into PCI_D3 and back into PCI_D0. > * @dev: Device to reset. > * @probe: If set, only check if the device can be reset this way. > */ > > Do you know if your execution path hits this function? We might have > missed a use case. > Nope. It doesn't. Following is the stack dump showing how pci_update_current_state() is called in resume() path. And pci_device_is_present() is called from inside pci_update_current_state() API. My understanding is that pci_device_is_present() is the API we hit first in the resume() path for any PCIe device. [ 36.380726] Call trace: [ 36.383270] dump_backtrace+0x0/0x158 [ 36.386802] show_stack+0x14/0x20 [ 36.389749] dump_stack+0xb0/0xf8 [ 36.393451] pci_update_current_state+0x58/0xe0 [ 36.398178] pci_power_up+0x60/0x70 [ 36.401672] pci_pm_resume_noirq+0x6c/0x130 [ 36.405669] dpm_run_callback.isra.16+0x20/0x70 [ 36.410248] device_resume_noirq+0x120/0x238 [ 36.414364] async_resume_noirq+0x24/0x58 [ 36.418364] async_run_entry_fn+0x40/0x148 [ 36.422418] process_one_work+0x1e8/0x360 [ 36.426525] worker_thread+0x40/0x488 [ 36.430201] kthread+0x118/0x120 [ 36.433843] ret_from_fork+0x10/0x1c Also, I don't see pci_pm_reset() getting called in resume() path at all. Thanks, Vidya Sagar
On 10/15/2019 3:00 PM, Thierry Reding wrote: > On Mon, Oct 14, 2019 at 01:21:31PM -0700, Sinan Kaya wrote: >> On 10/14/2019 1:20 AM, Thierry Reding wrote: >>> I think this makes sense, so: >>> >>> Reviewed-by: Thierry Reding <treding@nvidia.com> >>> >>> However, it looks like Sinan has researched this extensively in the past >>> and gave a presentation on this at Plumbers in 2017: >>> >>> https://blog.linuxplumbersconf.org/2017/ocw/system/presentations/4732/original/crs.pdf >>> >>> Adding Sinan to see if he has any concerns about this, since resume time >>> is explicitly mentioned in the above slides. >> >> >> Thanks for including me. Let me catch up here. >> >> pci_dev_wait() is supposed to handle this case via pci_pm_reset(). >> >> /** >> * pci_pm_reset - Put device into PCI_D3 and back into PCI_D0. >> * @dev: Device to reset. >> * @probe: If set, only check if the device can be reset this way. >> */ >> >> Do you know if your execution path hits this function? We might have >> missed a use case. >> > > I see only a couple of callers of pci_device_is_present() in the tree, > this being from next-20191015: > > $ git grep -n pci_device_is_present > drivers/net/ethernet/broadcom/tg3.c:9070: if (!pci_device_is_present(tp->pdev))drivers/net/ethernet/broadcom/tg3.c:11785: if (pci_device_is_present(tp->pdev)) { > drivers/net/ethernet/intel/igb/igb_main.c:8838: if (!pci_device_is_present(pdev)) > drivers/nvme/host/pci.c:2866: if (!pci_device_is_present(pdev)) { > drivers/pci/hotplug/acpiphp_glue.c:650: alive = pci_device_is_present(dev); > drivers/pci/pci.c:935: !pci_device_is_present(dev)) { > drivers/pci/pci.c:5902:bool pci_device_is_present(struct pci_dev *pdev) > drivers/pci/pci.c:5910:EXPORT_SYMBOL_GPL(pci_device_is_present); > drivers/thunderbolt/nhi.c:939: if (!pci_device_is_present(pdev)) { > include/linux/pci.h:1206:bool pci_device_is_present(struct pci_dev *pdev); > I think the important one is the following which is called from inside pci_update_current_state() function. drivers/pci/pci.c:935: !pci_device_is_present(dev)) { I've put a dump_stack() to see how this is called and following is the trace [ 36.380726] Call trace: [ 36.383270] dump_backtrace+0x0/0x158 [ 36.386802] show_stack+0x14/0x20 [ 36.389749] dump_stack+0xb0/0xf8 [ 36.393451] pci_update_current_state+0x58/0xe0 [ 36.398178] pci_power_up+0x60/0x70 [ 36.401672] pci_pm_resume_noirq+0x6c/0x130 [ 36.405669] dpm_run_callback.isra.16+0x20/0x70 [ 36.410248] device_resume_noirq+0x120/0x238 [ 36.414364] async_resume_noirq+0x24/0x58 [ 36.418364] async_run_entry_fn+0x40/0x148 [ 36.422418] process_one_work+0x1e8/0x360 [ 36.426525] worker_thread+0x40/0x488 [ 36.430201] kthread+0x118/0x120 [ 36.433843] ret_from_fork+0x10/0x1c > The NVME driver has the call in the ->remove() callback, so I don't > think it's relevant here. Both TG3 and IGB ethernet drivers seem to call > this during resume and so does Thunderbolt. > > Vidya, can you clarify for which device you're seeing the issues? Sounds > like adding a call to pci_pm_reset() (via pci_reset_function()) at some > point. With 0 sec wait, I see issue with Intel 750 NVMe card. As I mentioned above, it is called from the PCI-PM subsystem which is where the timeout is required. > > Sinan, it looks as if pci_pm_reset() (or any of its callers) is not used > very widely. Is that just because most drivers haven't had a need for it > yet? Or am I missing some core functionality that calls this for every > device anyway? > > Thierry >
On 10/15/2019 4:40 PM, Sinan Kaya wrote: > +Rafael > > On 10/15/2019 2:30 AM, Thierry Reding wrote: >> Vidya, can you clarify for which device you're seeing the issues? Sounds >> like adding a call to pci_pm_reset() (via pci_reset_function()) at some >> point. >> >> Sinan, it looks as if pci_pm_reset() (or any of its callers) is not used >> very widely. Is that just because most drivers haven't had a need for it >> yet? Or am I missing some core functionality that calls this for every >> device anyway? > > pci_pm_reset() is there as an alternative reset path. We are not > supposed to call this function. Sorry for giving you wrong direction > here. pci_reset_function() should call it only if there is no other > suitable reset function is found. > > I think the PCI core should be putting the device back D0 state as one > of the first actions before enumerating. Wake up could be a combination > of ACPI and/or PCI wake up depending on where your device sits in the > topology. Yup. It is indeed doing it as part of pci_power_up() in pci.c file. But, what is confusing to me is the order of the calls. pci_power_up() has following calls in the same order. pci_raw_set_power_state(dev, PCI_D0); pci_update_current_state(dev, PCI_D0); But, pci_raw_set_power_state() is accessing config space without calling pci_device_is_present() whereas pci_update_current_state() which is called later in the flow is calling pci_device_is_present()...! > > On the other hand, wake up code doesn't perform the CRS wait. CRS > wait is deferred until the first vendor id read in pci_scan_device(). > I see that it already waits for 60 seconds. > > Going back to the patch... > > I think we need to find the path that actually needs this sleep and > put pci_dev_wait() there. Following is the path in resume() flow. [ 36.380726] Call trace: [ 36.383270] dump_backtrace+0x0/0x158 [ 36.386802] show_stack+0x14/0x20 [ 36.389749] dump_stack+0xb0/0xf8 [ 36.393451] pci_update_current_state+0x58/0xe0 [ 36.398178] pci_power_up+0x60/0x70 [ 36.401672] pci_pm_resume_noirq+0x6c/0x130 [ 36.405669] dpm_run_callback.isra.16+0x20/0x70 [ 36.410248] device_resume_noirq+0x120/0x238 [ 36.414364] async_resume_noirq+0x24/0x58 [ 36.418364] async_run_entry_fn+0x40/0x148 [ 36.422418] process_one_work+0x1e8/0x360 [ 36.426525] worker_thread+0x40/0x488 [ 36.430201] kthread+0x118/0x120 [ 36.433843] ret_from_fork+0x10/0x1c > > +++ b/drivers/pci/pci.c > @@ -5905,7 +5905,8 @@ bool pci_device_is_present(struct pci_dev *pdev) > > if (pci_dev_is_disconnected(pdev)) > return false; > - return pci_bus_read_dev_vendor_id(pdev->bus, pdev->devfn, &v, 0); > + return pci_bus_read_dev_vendor_id(pdev->bus, pdev->devfn, &v, > + PCI_CRS_TIMEOUT); > } > > pci_device_is_present() is a too low-level function and it may not > be allowed to sleep. It uses 0 as timeout value. > >
On 10/21/2019 11:13 AM, Vidya Sagar wrote: Hi Sinan / Rafael, Apologies for the ping again. Do you guys have any further comments on this? -Vidya Sagar > On 10/15/2019 5:44 PM, Vidya Sagar wrote: > > Hi Sinan / Rafael, > Do you have any further comments on this patch? > > - Vidya Sagar > >> On 10/15/2019 4:40 PM, Sinan Kaya wrote: >>> +Rafael >>> >>> On 10/15/2019 2:30 AM, Thierry Reding wrote: >>>> Vidya, can you clarify for which device you're seeing the issues? Sounds >>>> like adding a call to pci_pm_reset() (via pci_reset_function()) at some >>>> point. >>>> >>>> Sinan, it looks as if pci_pm_reset() (or any of its callers) is not used >>>> very widely. Is that just because most drivers haven't had a need for it >>>> yet? Or am I missing some core functionality that calls this for every >>>> device anyway? >>> >>> pci_pm_reset() is there as an alternative reset path. We are not >>> supposed to call this function. Sorry for giving you wrong direction >>> here. pci_reset_function() should call it only if there is no other >>> suitable reset function is found. >>> >>> I think the PCI core should be putting the device back D0 state as one >>> of the first actions before enumerating. Wake up could be a combination >>> of ACPI and/or PCI wake up depending on where your device sits in the >>> topology. >> Yup. It is indeed doing it as part of pci_power_up() in pci.c file. >> But, what is confusing to me is the order of the calls. >> pci_power_up() has following calls in the same order. >> pci_raw_set_power_state(dev, PCI_D0); >> pci_update_current_state(dev, PCI_D0); >> But, pci_raw_set_power_state() is accessing config space without calling >> pci_device_is_present() whereas pci_update_current_state() which is called >> later in the flow is calling pci_device_is_present()...! >> >>> >>> On the other hand, wake up code doesn't perform the CRS wait. CRS >>> wait is deferred until the first vendor id read in pci_scan_device(). >>> I see that it already waits for 60 seconds. >>> >>> Going back to the patch... >>> >>> I think we need to find the path that actually needs this sleep and >>> put pci_dev_wait() there. >> Following is the path in resume() flow. >> [ 36.380726] Call trace: >> [ 36.383270] dump_backtrace+0x0/0x158 >> [ 36.386802] show_stack+0x14/0x20 >> [ 36.389749] dump_stack+0xb0/0xf8 >> [ 36.393451] pci_update_current_state+0x58/0xe0 >> [ 36.398178] pci_power_up+0x60/0x70 >> [ 36.401672] pci_pm_resume_noirq+0x6c/0x130 >> [ 36.405669] dpm_run_callback.isra.16+0x20/0x70 >> [ 36.410248] device_resume_noirq+0x120/0x238 >> [ 36.414364] async_resume_noirq+0x24/0x58 >> [ 36.418364] async_run_entry_fn+0x40/0x148 >> [ 36.422418] process_one_work+0x1e8/0x360 >> [ 36.426525] worker_thread+0x40/0x488 >> [ 36.430201] kthread+0x118/0x120 >> [ 36.433843] ret_from_fork+0x10/0x1c >> >>> >>> +++ b/drivers/pci/pci.c >>> @@ -5905,7 +5905,8 @@ bool pci_device_is_present(struct pci_dev *pdev) >>> >>> if (pci_dev_is_disconnected(pdev)) >>> return false; >>> - return pci_bus_read_dev_vendor_id(pdev->bus, pdev->devfn, &v, 0); >>> + return pci_bus_read_dev_vendor_id(pdev->bus, pdev->devfn, &v, >>> + PCI_CRS_TIMEOUT); >>> } >>> >>> pci_device_is_present() is a too low-level function and it may not >>> be allowed to sleep. It uses 0 as timeout value. >>> >>> >> >
On 10/25/2019 7:58 AM, Vidya Sagar wrote: > On 10/21/2019 11:13 AM, Vidya Sagar wrote: > > Hi Sinan / Rafael, > Apologies for the ping again. > Do you guys have any further comments on this? > > -Vidya Sagar I think you'll need some attention from Bjorn here to see the complete picture. As I said, changing pci_device_is_present() is not right. This needs to be done at one level higher.
On 10/26/2019 7:29 PM, Sinan Kaya wrote: > On 10/25/2019 7:58 AM, Vidya Sagar wrote: >> On 10/21/2019 11:13 AM, Vidya Sagar wrote: >> >> Hi Sinan / Rafael, >> Apologies for the ping again. >> Do you guys have any further comments on this? >> >> -Vidya Sagar > > I think you'll need some attention from Bjorn here to see the complete > picture. > > As I said, changing pci_device_is_present() is not right. This needs to > be done at one level higher. Hi Bjorn, Could you please help me understand why this change can't be done in pci_device_is_present() API? - Vidya Sagar >
On Mon, Nov 04, 2019 at 05:13:50PM +0530, Vidya Sagar wrote: > On 10/26/2019 7:29 PM, Sinan Kaya wrote: > > On 10/25/2019 7:58 AM, Vidya Sagar wrote: > > > On 10/21/2019 11:13 AM, Vidya Sagar wrote: > > > > > > Hi Sinan / Rafael, > > > Apologies for the ping again. > > > Do you guys have any further comments on this? > > > > > > -Vidya Sagar > > > > I think you'll need some attention from Bjorn here to see the complete > > picture. > > > > As I said, changing pci_device_is_present() is not right. This needs to > > be done at one level higher. > > Hi Bjorn, > Could you please help me understand why this change can't be done in > pci_device_is_present() API? I assume that's because pci_device_is_present() is currently called in contexts that aren't allowed to sleep, therefore you would trigger a regression. Lorenzo
[+cc Andrew, Lukas] On Tue, Oct 15, 2019 at 05:44:47PM +0530, Vidya Sagar wrote: > On 10/15/2019 4:40 PM, Sinan Kaya wrote: > > ... > > I think the PCI core should be putting the device back D0 state as one > > of the first actions before enumerating. Wake up could be a combination > > of ACPI and/or PCI wake up depending on where your device sits in the > > topology. > > Yup. It is indeed doing it as part of pci_power_up() in pci.c file. > But, what is confusing to me is the order of the calls. > pci_power_up() has following calls in the same order. > pci_raw_set_power_state(dev, PCI_D0); > pci_update_current_state(dev, PCI_D0); > But, pci_raw_set_power_state() is accessing config space without calling > pci_device_is_present() whereas pci_update_current_state() which is called > later in the flow is calling pci_device_is_present()...! A device should always respond to config reads unless it is in D3cold or it is initializing after a reset. IIUC you're doing a resume, not a reset, so I think your device must be coming out of D3cold. That's typically done via ACPI, and I think we are missing some kind of delay before our first config access: pci_power_up platform_pci_set_power_state(PCI_D0) # eg, ACPI pci_raw_set_power_state pci_read_config_word(PCI_PM_CTRL) # <-- first config access pci_write_config_word(PCI_PM_CTRL) pci_read_config_word(PCI_PM_CTRL) pci_update_current_state if (... || !pci_device_is_present()) Mika is working on some delays for the transition out of D3cold [1]. He's more concerned with a secondary bus behind a bridge, so I don't think his patch addresses this case, but he's certainly familiar with this area. Huh, I'm really confused about this, too. I don't understand how resume ever works without any delay between platform_pci_power_manageable() and the config reads in pci_raw_set_power_state(). I must be missing something. The pci_device_is_present() call in pci_update_current_state() was added by a6a64026c0cd ("PCI: Recognize D3cold in pci_update_current_state()") [2]. The purpose there is not to wait for a device to become ready; it is to learn whether the device is in D3cold. I don't think we'd want a delay in this path because it would slow down all transitions into D3cold. [1] https://lore.kernel.org/r/20191004123947.11087-1-mika.westerberg@linux.intel.com [2] https://git.kernel.org/linus/a6a64026c0cd > > On the other hand, wake up code doesn't perform the CRS wait. CRS > > wait is deferred until the first vendor id read in pci_scan_device(). > > I see that it already waits for 60 seconds. > > > > Going back to the patch... > > > > I think we need to find the path that actually needs this sleep and > > put pci_dev_wait() there. > > Following is the path in resume() flow. > [ 36.380726] Call trace: > [ 36.383270] dump_backtrace+0x0/0x158 > [ 36.386802] show_stack+0x14/0x20 > [ 36.389749] dump_stack+0xb0/0xf8 > [ 36.393451] pci_update_current_state+0x58/0xe0 > [ 36.398178] pci_power_up+0x60/0x70 > [ 36.401672] pci_pm_resume_noirq+0x6c/0x130 > [ 36.405669] dpm_run_callback.isra.16+0x20/0x70 > [ 36.410248] device_resume_noirq+0x120/0x238 > [ 36.414364] async_resume_noirq+0x24/0x58 > [ 36.418364] async_run_entry_fn+0x40/0x148 > [ 36.422418] process_one_work+0x1e8/0x360 > [ 36.426525] worker_thread+0x40/0x488 > [ 36.430201] kthread+0x118/0x120 > [ 36.433843] ret_from_fork+0x10/0x1c > > > > > +++ b/drivers/pci/pci.c > > @@ -5905,7 +5905,8 @@ bool pci_device_is_present(struct pci_dev *pdev) > > > > if (pci_dev_is_disconnected(pdev)) > > return false; > > - return pci_bus_read_dev_vendor_id(pdev->bus, pdev->devfn, &v, 0); > > + return pci_bus_read_dev_vendor_id(pdev->bus, pdev->devfn, &v, > > + PCI_CRS_TIMEOUT); > > } > > > > pci_device_is_present() is a too low-level function and it may not > > be allowed to sleep. It uses 0 as timeout value. > > > > >
On Monday, November 4, 2019 6:39:04 PM CET Bjorn Helgaas wrote: > [+cc Andrew, Lukas] > > On Tue, Oct 15, 2019 at 05:44:47PM +0530, Vidya Sagar wrote: > > On 10/15/2019 4:40 PM, Sinan Kaya wrote: > > > ... > > > I think the PCI core should be putting the device back D0 state as one > > > of the first actions before enumerating. Wake up could be a combination > > > of ACPI and/or PCI wake up depending on where your device sits in the > > > topology. > > > > Yup. It is indeed doing it as part of pci_power_up() in pci.c file. > > But, what is confusing to me is the order of the calls. > > pci_power_up() has following calls in the same order. > > pci_raw_set_power_state(dev, PCI_D0); > > pci_update_current_state(dev, PCI_D0); > > But, pci_raw_set_power_state() is accessing config space without calling > > pci_device_is_present() whereas pci_update_current_state() which is called > > later in the flow is calling pci_device_is_present()...! > > A device should always respond to config reads unless it is in D3cold > or it is initializing after a reset. IIUC you're doing a resume, not > a reset, so I think your device must be coming out of D3cold. That's > typically done via ACPI, and I think we are missing some kind of delay > before our first config access: > > pci_power_up > platform_pci_set_power_state(PCI_D0) # eg, ACPI > pci_raw_set_power_state > pci_read_config_word(PCI_PM_CTRL) # <-- first config access > pci_write_config_word(PCI_PM_CTRL) > pci_read_config_word(PCI_PM_CTRL) > pci_update_current_state > if (... || !pci_device_is_present()) > > Mika is working on some delays for the transition out of D3cold [1]. > He's more concerned with a secondary bus behind a bridge, so I don't > think his patch addresses this case, but he's certainly familiar with > this area. > > Huh, I'm really confused about this, too. I don't > understand how resume ever works without any delay between > platform_pci_power_manageable() and the config reads in > pci_raw_set_power_state(). I must be missing something. There is a delay in the runtime_d3cold case, see __pci_start_power_transition(). But overall platform_pci_power_manageable() only checks whether or not the platform firmware can change the power state of the device. If it can, it is expected to take care of any necessary delays while doing that (because there may be delays required by this particular instance of the platform firmware, beyond what is mandated by the PCI spec, or there may not be any need to wait at all). If the platform firmware becomes responsible for setting the device's power state, there is not reason why it should not be responsible for the delay part too. In any case, I'm not sure how useful it is to add delays for everyone in the cases in which a specific system needs a delay because of its own PM implementation limitations. It may be better to quirk such systems explicitly as long as there are not too many quirks in there, or we'll end up adding more and more *implicit* quirks in the form of general delays. Cheers!
On Tue, Nov 05, 2019 at 11:55:45AM +0100, Rafael J. Wysocki wrote: > On Monday, November 4, 2019 6:39:04 PM CET Bjorn Helgaas wrote: > > [+cc Andrew, Lukas] > > > > On Tue, Oct 15, 2019 at 05:44:47PM +0530, Vidya Sagar wrote: > > > On 10/15/2019 4:40 PM, Sinan Kaya wrote: > > > > ... > > > > I think the PCI core should be putting the device back D0 state as one > > > > of the first actions before enumerating. Wake up could be a combination > > > > of ACPI and/or PCI wake up depending on where your device sits in the > > > > topology. > > > > > > Yup. It is indeed doing it as part of pci_power_up() in pci.c file. > > > But, what is confusing to me is the order of the calls. > > > pci_power_up() has following calls in the same order. > > > pci_raw_set_power_state(dev, PCI_D0); > > > pci_update_current_state(dev, PCI_D0); > > > But, pci_raw_set_power_state() is accessing config space without calling > > > pci_device_is_present() whereas pci_update_current_state() which is called > > > later in the flow is calling pci_device_is_present()...! > > > > A device should always respond to config reads unless it is in D3cold > > or it is initializing after a reset. IIUC you're doing a resume, not > > a reset, so I think your device must be coming out of D3cold. That's > > typically done via ACPI, and I think we are missing some kind of delay > > before our first config access: > > > > pci_power_up > > platform_pci_set_power_state(PCI_D0) # eg, ACPI > > pci_raw_set_power_state > > pci_read_config_word(PCI_PM_CTRL) # <-- first config access > > pci_write_config_word(PCI_PM_CTRL) > > pci_read_config_word(PCI_PM_CTRL) > > pci_update_current_state > > if (... || !pci_device_is_present()) > > > > Mika is working on some delays for the transition out of D3cold [1]. > > He's more concerned with a secondary bus behind a bridge, so I don't > > think his patch addresses this case, but he's certainly familiar with > > this area. > > > > Huh, I'm really confused about this, too. I don't > > understand how resume ever works without any delay between > > platform_pci_power_manageable() and the config reads in > > pci_raw_set_power_state(). I must be missing something. > > There is a delay in the runtime_d3cold case, see > __pci_start_power_transition(). I see the delay in __pci_start_power_transition(), but I don't see how it's relevant. It's only called by pci_set_power_state(), and while many drivers call pci_set_power_state() from legacy .resume() methods, the pci_pm_resume_noirq() path where Vidya is seeing problems doesn't use it. > But overall platform_pci_power_manageable() only checks whether or > not the platform firmware can change the power state of the device. > If it can, it is expected to take care of any necessary delays while > doing that (because there may be delays required by this particular > instance of the platform firmware, beyond what is mandated by the > PCI spec, or there may not be any need to wait at all). ... That sounds like a reasonable argument for why firmware should be responsible for this delay, but I don't think that's very clear in the ACPI spec, so I wouldn't be surprised if it got missed. Based on Vidya's backtrace, I think the resume path with problems is this: pci_pm_resume_noirq pci_pm_default_resume_early pci_power_up if (platform_pci_power_manageable(dev)) platform_pci_set_power_state(dev, PCI_D0) # <-- FW delay here? pci_raw_set_power_state pci_update_current_state pci_device_is_present # <-- config read returns CRS So I think your suggestion is that Vidya's firmware should be doing the delay inside platform_pci_set_power_state()? Vidya, you typically work on Tegra, so I assume this is on an arm64 system? Does it have ACPI? Do you have access to the firmware developers to ask about who they expect to do the delays? > In any case, I'm not sure how useful it is to add delays for > everyone in the cases in which a specific system needs a delay > because of its own PM implementation limitations. It may be better > to quirk such systems explicitly as long as there are not too many > quirks in there, or we'll end up adding more and more *implicit* > quirks in the form of general delays. I agree, a general delay doesn't sound good. Are you thinking something like this? void pci_power_up(struct pci_dev *dev) { if (platform_pci_power_manageable(dev)) { platform_pci_set_power_state(dev, PCI_D0); if (dev->XXX) msleep(dev->XXX); } ... We already have dev->d3_delay and d3cold_delay, so it's getting a bit messy to keep them all straight. Bjorn
On 11/6/2019 10:11 PM, Bjorn Helgaas wrote: > On Tue, Nov 05, 2019 at 11:55:45AM +0100, Rafael J. Wysocki wrote: >> On Monday, November 4, 2019 6:39:04 PM CET Bjorn Helgaas wrote: >>> [+cc Andrew, Lukas] >>> >>> On Tue, Oct 15, 2019 at 05:44:47PM +0530, Vidya Sagar wrote: >>>> On 10/15/2019 4:40 PM, Sinan Kaya wrote: >>>>> ... >>>>> I think the PCI core should be putting the device back D0 state as one >>>>> of the first actions before enumerating. Wake up could be a combination >>>>> of ACPI and/or PCI wake up depending on where your device sits in the >>>>> topology. >>>> >>>> Yup. It is indeed doing it as part of pci_power_up() in pci.c file. >>>> But, what is confusing to me is the order of the calls. >>>> pci_power_up() has following calls in the same order. >>>> pci_raw_set_power_state(dev, PCI_D0); >>>> pci_update_current_state(dev, PCI_D0); >>>> But, pci_raw_set_power_state() is accessing config space without calling >>>> pci_device_is_present() whereas pci_update_current_state() which is called >>>> later in the flow is calling pci_device_is_present()...! >>> >>> A device should always respond to config reads unless it is in D3cold >>> or it is initializing after a reset. IIUC you're doing a resume, not >>> a reset, so I think your device must be coming out of D3cold. That's >>> typically done via ACPI, and I think we are missing some kind of delay >>> before our first config access: >>> >>> pci_power_up >>> platform_pci_set_power_state(PCI_D0) # eg, ACPI >>> pci_raw_set_power_state >>> pci_read_config_word(PCI_PM_CTRL) # <-- first config access >>> pci_write_config_word(PCI_PM_CTRL) >>> pci_read_config_word(PCI_PM_CTRL) >>> pci_update_current_state >>> if (... || !pci_device_is_present()) >>> >>> Mika is working on some delays for the transition out of D3cold [1]. >>> He's more concerned with a secondary bus behind a bridge, so I don't >>> think his patch addresses this case, but he's certainly familiar with >>> this area. >>> >>> Huh, I'm really confused about this, too. I don't >>> understand how resume ever works without any delay between >>> platform_pci_power_manageable() and the config reads in >>> pci_raw_set_power_state(). I must be missing something. >> >> There is a delay in the runtime_d3cold case, see >> __pci_start_power_transition(). > > I see the delay in __pci_start_power_transition(), but I don't see how > it's relevant. It's only called by pci_set_power_state(), and while > many drivers call pci_set_power_state() from legacy .resume() methods, > the pci_pm_resume_noirq() path where Vidya is seeing problems doesn't > use it. > >> But overall platform_pci_power_manageable() only checks whether or >> not the platform firmware can change the power state of the device. >> If it can, it is expected to take care of any necessary delays while >> doing that (because there may be delays required by this particular >> instance of the platform firmware, beyond what is mandated by the >> PCI spec, or there may not be any need to wait at all). ... > > That sounds like a reasonable argument for why firmware should be > responsible for this delay, but I don't think that's very clear in the > ACPI spec, so I wouldn't be surprised if it got missed. > > Based on Vidya's backtrace, I think the resume path with problems is > this: > > pci_pm_resume_noirq > pci_pm_default_resume_early > pci_power_up > if (platform_pci_power_manageable(dev)) > platform_pci_set_power_state(dev, PCI_D0) # <-- FW delay here? > pci_raw_set_power_state > pci_update_current_state > pci_device_is_present # <-- config read returns CRS > > So I think your suggestion is that Vidya's firmware should be doing > the delay inside platform_pci_set_power_state()? > > Vidya, you typically work on Tegra, so I assume this is on an arm64 > system? Does it have ACPI? Do you have access to the firmware > developers to ask about who they expect to do the delays? Yes. This is on arm64 (Tegra) and we don't have any ACPI or any other firmware for that matter. PCIe is brought up directly in the kernel. > >> In any case, I'm not sure how useful it is to add delays for >> everyone in the cases in which a specific system needs a delay >> because of its own PM implementation limitations. It may be better >> to quirk such systems explicitly as long as there are not too many >> quirks in there, or we'll end up adding more and more *implicit* >> quirks in the form of general delays. > > I agree, a general delay doesn't sound good. Are you thinking > something like this? > > void pci_power_up(struct pci_dev *dev) > { > if (platform_pci_power_manageable(dev)) { > platform_pci_set_power_state(dev, PCI_D0); > if (dev->XXX) > msleep(dev->XXX); > } > ... > > We already have dev->d3_delay and d3cold_delay, so it's getting a bit > messy to keep them all straight. > > Bjorn >
On Mon, Nov 11, 2019 at 11:31:18AM +0530, Vidya Sagar wrote: > On 11/6/2019 10:11 PM, Bjorn Helgaas wrote: > > Based on Vidya's backtrace, I think the resume path with problems > > is this: > > > > pci_pm_resume_noirq > > pci_pm_default_resume_early > > pci_power_up > > if (platform_pci_power_manageable(dev)) > > platform_pci_set_power_state(dev, PCI_D0) # <-- FW delay here? > > pci_raw_set_power_state > > pci_update_current_state > > pci_device_is_present # <-- config read returns CRS > > > > So I think your suggestion is that Vidya's firmware should be > > doing the delay inside platform_pci_set_power_state()? > > > > Vidya, you typically work on Tegra, so I assume this is on an > > arm64 system? Does it have ACPI? Do you have access to the > > firmware developers to ask about who they expect to do the delays? > > Yes. This is on arm64 (Tegra) and we don't have any ACPI or any > other firmware for that matter. PCIe is brought up directly in the > kernel. I assume that your device is coming out of D3cold because apparently you're seeing a CRS status from the config read when pci_update_current_state() calls pci_device_is_present(). CRS status should only happen after reset or power-on from D3cold, and you're not doing a reset. I'm pretty sure platform_pci_power_manageable() returns false on your system (can you confirm that?) because the only scenarios with platform power management are MID (Intel platform) and ACPI (which you don't have). Maybe you have some other platform-specific mechanism that controls power to PCI devices, and it's not integrated into the platform_pci_*() framework? Bjorn
On Mon, Nov 11, 2019 at 04:32:35PM -0600, Bjorn Helgaas wrote: > On Mon, Nov 11, 2019 at 11:31:18AM +0530, Vidya Sagar wrote: > > On 11/6/2019 10:11 PM, Bjorn Helgaas wrote: > > > > Based on Vidya's backtrace, I think the resume path with problems > > > is this: > > > > > > pci_pm_resume_noirq > > > pci_pm_default_resume_early > > > pci_power_up > > > if (platform_pci_power_manageable(dev)) > > > platform_pci_set_power_state(dev, PCI_D0) # <-- FW delay here? > > > pci_raw_set_power_state > > > pci_update_current_state > > > pci_device_is_present # <-- config read returns CRS > > > > > > So I think your suggestion is that Vidya's firmware should be > > > doing the delay inside platform_pci_set_power_state()? > > > > > > Vidya, you typically work on Tegra, so I assume this is on an > > > arm64 system? Does it have ACPI? Do you have access to the > > > firmware developers to ask about who they expect to do the delays? > > > > Yes. This is on arm64 (Tegra) and we don't have any ACPI or any > > other firmware for that matter. PCIe is brought up directly in the > > kernel. > > I assume that your device is coming out of D3cold because apparently > you're seeing a CRS status from the config read when > pci_update_current_state() calls pci_device_is_present(). CRS status > should only happen after reset or power-on from D3cold, and you're not > doing a reset. > > I'm pretty sure platform_pci_power_manageable() returns false on > your system (can you confirm that?) because the only scenarios with > platform power management are MID (Intel platform) and ACPI (which you > don't have). > > Maybe you have some other platform-specific mechanism that controls > power to PCI devices, and it's not integrated into the > platform_pci_*() framework? My understanding after reading the PCIe specification is that CRS is a mechanism that allows an endpoint to signal that it isn't ready yet for operation after reset or power-on from D3cold. There's nothing in there that's platform specific. This is really only for specific endpoints. I don't see how adding platform specific PM code would help in this case. At a platform level we don't know if users are going to plug in a PCI endpoint that needs a long delay before it's ready after reset and/ or exit from D3cold. I do understand that perhaps pci_device_is_present() is perhaps not the best place to do complex CRS handling, but if a mechanism is clearly described in the specification, isn't it something that should be dealt with in the core? That way we don't have to quirk this for every device and platform. The PCIe specification says that: Software that intends to take advantage of this mechanism must ensure that the first access made to a device following a valid reset condition is a Configuration Read Request accessing both bytes of the Vendor ID field in the device's Configuration Space header. So doesn't that mean that pci_device_is_present() is already much too late because we've potentially made other configuration read requests in the meantime? Wouldn't it make more sense to push the CRS handling up a bit? The existing pci_power_up() function seems like it would be a good place. For example, adding code to deal with CRS right after the platform PCI PM calls but before pci_raw_set_power_state() seems like it would fit the restrictions given in the above quote from the specification. Thierry
On Tue, Nov 12, 2019 at 01:59:23PM +0100, Thierry Reding wrote: > On Mon, Nov 11, 2019 at 04:32:35PM -0600, Bjorn Helgaas wrote: > > On Mon, Nov 11, 2019 at 11:31:18AM +0530, Vidya Sagar wrote: > > > On 11/6/2019 10:11 PM, Bjorn Helgaas wrote: > > > > > > Based on Vidya's backtrace, I think the resume path with problems > > > > is this: > > > > > > > > pci_pm_resume_noirq > > > > pci_pm_default_resume_early > > > > pci_power_up > > > > if (platform_pci_power_manageable(dev)) > > > > platform_pci_set_power_state(dev, PCI_D0) # <-- FW delay here? > > > > pci_raw_set_power_state > > > > pci_update_current_state > > > > pci_device_is_present # <-- config read returns CRS > > > > > > > > So I think your suggestion is that Vidya's firmware should be > > > > doing the delay inside platform_pci_set_power_state()? > > > > > > > > Vidya, you typically work on Tegra, so I assume this is on an > > > > arm64 system? Does it have ACPI? Do you have access to the > > > > firmware developers to ask about who they expect to do the delays? > > > > > > Yes. This is on arm64 (Tegra) and we don't have any ACPI or any > > > other firmware for that matter. PCIe is brought up directly in the > > > kernel. > > > > I assume that your device is coming out of D3cold because apparently > > you're seeing a CRS status from the config read when > > pci_update_current_state() calls pci_device_is_present(). CRS status > > should only happen after reset or power-on from D3cold, and you're not > > doing a reset. > > > > I'm pretty sure platform_pci_power_manageable() returns false on > > your system (can you confirm that?) because the only scenarios with > > platform power management are MID (Intel platform) and ACPI (which you > > don't have). > > > > Maybe you have some other platform-specific mechanism that controls > > power to PCI devices, and it's not integrated into the > > platform_pci_*() framework? > > My understanding after reading the PCIe specification is that CRS is a > mechanism that allows an endpoint to signal that it isn't ready yet for > operation after reset or power-on from D3cold. There's nothing in there > that's platform specific. This is really only for specific endpoints. > > I don't see how adding platform specific PM code would help in this > case. At a platform level we don't know if users are going to plug in a > PCI endpoint that needs a long delay before it's ready after reset and/ > or exit from D3cold. Right, see below. > I do understand that perhaps pci_device_is_present() is perhaps not the > best place to do complex CRS handling, but if a mechanism is clearly > described in the specification, isn't it something that should be dealt > with in the core? That way we don't have to quirk this for every device > and platform. Definitely; we don't want quirks for endpoints (unless they're actually broken) or for platforms (unless there's a platform hardware or firmware defect). There's no question that we need to delay and handle CRS after power-on from D3cold. I'm trying to get at the point that PCI itself doesn't tell us how to do that power-on. The mechanisms defined by PCI rely on config space, which is only accessible in D0-D3hot, not D3cold. Power-on from D3cold can only happen via ACPI methods or other platform-specific mechanisms, and the current design abstracts those via platform_pci_set_power_state(). This is partly based on Rafael's response in [1]. > The PCIe specification says that: > > Software that intends to take advantage of this mechanism must > ensure that the first access made to a device following a valid > reset condition is a Configuration Read Request accessing both > bytes of the Vendor ID field in the device's Configuration Space > header. > > So doesn't that mean that pci_device_is_present() is already much too > late because we've potentially made other configuration read requests in > the meantime? > > Wouldn't it make more sense to push the CRS handling up a bit? The > existing pci_power_up() function seems like it would be a good place. > For example, adding code to deal with CRS right after the platform PCI > PM calls but before pci_raw_set_power_state() seems like it would fit > the restrictions given in the above quote from the specification. Yep, I think that's the right point. I'm trying to figure out how to integrate it. Rafael suggests that delays may be platform-specific and should be in platform_pci_set_power_state(), but the CRS handling itself isn't platform-specific and maybe could be higher up. I'm fishing to see if Tegra has some kind of power control for endpoints that is not related to the platform_pci_*() framework. How did the endpoint get put in D3cold in the first place? I assume something in the suspend path did that? Maybe this happens when we suspend the Tegra RC itself, e.g., tegra_pcie_pm_suspend()? tegra_pcie_pm_suspend tegra_pcie_phy_power_off tegra_pcie_power_off tegra_pcie_pm_resume tegra_pcie_power_on tegra_pcie_phy_power_on If a path like tegra_pcie_pm_resume() is causing the D3cold -> D0 transition for the endpoint, I don't think we want to do CRS handling there because that path shouldn't be touching the endpoint. But maybe it should be doing the delays required by PCIe r5.0, sec 6.6.1, before any config accesses are issued to devices. [1] https://lore.kernel.org/r/11429373.7ySiFsEkgL@kreacher
On 11/12/2019 4:02 AM, Bjorn Helgaas wrote: > On Mon, Nov 11, 2019 at 11:31:18AM +0530, Vidya Sagar wrote: >> On 11/6/2019 10:11 PM, Bjorn Helgaas wrote: > >>> Based on Vidya's backtrace, I think the resume path with problems >>> is this: >>> >>> pci_pm_resume_noirq >>> pci_pm_default_resume_early >>> pci_power_up >>> if (platform_pci_power_manageable(dev)) >>> platform_pci_set_power_state(dev, PCI_D0) # <-- FW delay here? >>> pci_raw_set_power_state >>> pci_update_current_state >>> pci_device_is_present # <-- config read returns CRS >>> >>> So I think your suggestion is that Vidya's firmware should be >>> doing the delay inside platform_pci_set_power_state()? >>> >>> Vidya, you typically work on Tegra, so I assume this is on an >>> arm64 system? Does it have ACPI? Do you have access to the >>> firmware developers to ask about who they expect to do the delays? >> >> Yes. This is on arm64 (Tegra) and we don't have any ACPI or any >> other firmware for that matter. PCIe is brought up directly in the >> kernel. > > I assume that your device is coming out of D3cold because apparently > you're seeing a CRS status from the config read when > pci_update_current_state() calls pci_device_is_present(). CRS status > should only happen after reset or power-on from D3cold, and you're not > doing a reset. > > I'm pretty sure platform_pci_power_manageable() returns false on > your system (can you confirm that?) because the only scenarios with > platform power management are MID (Intel platform) and ACPI (which you > don't have). Yes. I can confirm that platform_pci_power_manageable() is false in case of Tegra. > > Maybe you have some other platform-specific mechanism that controls > power to PCI devices, and it's not integrated into the > platform_pci_*() framework? > > Bjorn >
On 11/12/2019 7:51 PM, Bjorn Helgaas wrote: > On Tue, Nov 12, 2019 at 01:59:23PM +0100, Thierry Reding wrote: >> On Mon, Nov 11, 2019 at 04:32:35PM -0600, Bjorn Helgaas wrote: >>> On Mon, Nov 11, 2019 at 11:31:18AM +0530, Vidya Sagar wrote: >>>> On 11/6/2019 10:11 PM, Bjorn Helgaas wrote: >>> >>>>> Based on Vidya's backtrace, I think the resume path with problems >>>>> is this: >>>>> >>>>> pci_pm_resume_noirq >>>>> pci_pm_default_resume_early >>>>> pci_power_up >>>>> if (platform_pci_power_manageable(dev)) >>>>> platform_pci_set_power_state(dev, PCI_D0) # <-- FW delay here? >>>>> pci_raw_set_power_state >>>>> pci_update_current_state >>>>> pci_device_is_present # <-- config read returns CRS >>>>> >>>>> So I think your suggestion is that Vidya's firmware should be >>>>> doing the delay inside platform_pci_set_power_state()? >>>>> >>>>> Vidya, you typically work on Tegra, so I assume this is on an >>>>> arm64 system? Does it have ACPI? Do you have access to the >>>>> firmware developers to ask about who they expect to do the delays? >>>> >>>> Yes. This is on arm64 (Tegra) and we don't have any ACPI or any >>>> other firmware for that matter. PCIe is brought up directly in the >>>> kernel. >>> >>> I assume that your device is coming out of D3cold because apparently >>> you're seeing a CRS status from the config read when >>> pci_update_current_state() calls pci_device_is_present(). CRS status >>> should only happen after reset or power-on from D3cold, and you're not >>> doing a reset. >>> >>> I'm pretty sure platform_pci_power_manageable() returns false on >>> your system (can you confirm that?) because the only scenarios with >>> platform power management are MID (Intel platform) and ACPI (which you >>> don't have). >>> >>> Maybe you have some other platform-specific mechanism that controls >>> power to PCI devices, and it's not integrated into the >>> platform_pci_*() framework? >> >> My understanding after reading the PCIe specification is that CRS is a >> mechanism that allows an endpoint to signal that it isn't ready yet for >> operation after reset or power-on from D3cold. There's nothing in there >> that's platform specific. This is really only for specific endpoints. >> >> I don't see how adding platform specific PM code would help in this >> case. At a platform level we don't know if users are going to plug in a >> PCI endpoint that needs a long delay before it's ready after reset and/ >> or exit from D3cold. > > Right, see below. > >> I do understand that perhaps pci_device_is_present() is perhaps not the >> best place to do complex CRS handling, but if a mechanism is clearly >> described in the specification, isn't it something that should be dealt >> with in the core? That way we don't have to quirk this for every device >> and platform. > > Definitely; we don't want quirks for endpoints (unless they're > actually broken) or for platforms (unless there's a platform hardware > or firmware defect). > > There's no question that we need to delay and handle CRS after > power-on from D3cold. I'm trying to get at the point that PCI itself > doesn't tell us how to do that power-on. The mechanisms defined by > PCI rely on config space, which is only accessible in D0-D3hot, not > D3cold. Power-on from D3cold can only happen via ACPI methods or > other platform-specific mechanisms, and the current design abstracts > those via platform_pci_set_power_state(). This is partly based on > Rafael's response in [1]. > >> The PCIe specification says that: >> >> Software that intends to take advantage of this mechanism must >> ensure that the first access made to a device following a valid >> reset condition is a Configuration Read Request accessing both >> bytes of the Vendor ID field in the device's Configuration Space >> header. >> >> So doesn't that mean that pci_device_is_present() is already much too >> late because we've potentially made other configuration read requests in >> the meantime? >> >> Wouldn't it make more sense to push the CRS handling up a bit? The >> existing pci_power_up() function seems like it would be a good place. >> For example, adding code to deal with CRS right after the platform PCI >> PM calls but before pci_raw_set_power_state() seems like it would fit >> the restrictions given in the above quote from the specification. > > Yep, I think that's the right point. I'm trying to figure out how to > integrate it. Rafael suggests that delays may be platform-specific > and should be in platform_pci_set_power_state(), but the CRS handling > itself isn't platform-specific and maybe could be higher up. > > I'm fishing to see if Tegra has some kind of power control for > endpoints that is not related to the platform_pci_*() framework. How > did the endpoint get put in D3cold in the first place? I assume > something in the suspend path did that? Maybe this happens when we > suspend the Tegra RC itself, e.g., tegra_pcie_pm_suspend()? > > tegra_pcie_pm_suspend > tegra_pcie_phy_power_off > tegra_pcie_power_off > > tegra_pcie_pm_resume > tegra_pcie_power_on > tegra_pcie_phy_power_on > > If a path like tegra_pcie_pm_resume() is causing the D3cold -> D0 > transition for the endpoint, I don't think we want to do CRS handling > there because that path shouldn't be touching the endpoint. But maybe > it should be doing the delays required by PCIe r5.0, sec 6.6.1, before > any config accesses are issued to devices. Here, I'm exercising suspend-to-RAM sequence and the PCIe endpoint of concern is Intel 750 NVMe drive. PCIe host controller driver already has 100ms of delay as per the spec, but this particular device is taking 1023ms to get ready to respond to configuration space requests (till that time, it responds with configuration request retry statuses) I've put a dump_stack () to see the path resume sequence takes and here it is Call trace: dump_backtrace+0x0/0x158 show_stack+0x14/0x20 dump_stack+0xb0/0xf4 pci_bus_generic_read_dev_vendor_id+0x19c/0x1a0 pci_bus_read_dev_vendor_id+0x48/0x70 pci_update_current_state+0x68/0xd8 pci_power_up+0x40/0x50 pci_pm_resume_noirq+0x7c/0x138 dpm_run_callback.isra.16+0x20/0x70 device_resume_noirq+0x120/0x238 async_resume_noirq+0x24/0x58 async_run_entry_fn+0x40/0x148 process_one_work+0x1e8/0x360 worker_thread+0x40/0x488 kthread+0x118/0x120 ret_from_fork+0x10/0x1c pci 0005:01:00.0: ready after 1023ms Spec also mentions the following Unless Readiness Notifications mechanisms are used (see Section 6.23), the Root Complex and/or system software must allow at least 1.0 s after a Conventional Reset of a device, before it may determine that a device which fails to return a Successful Completion status for a valid Configuration Request is a broken device. This period is independent of how quickly Link training completes. My understanding is that this 1sec waiting is supposed to be done by the PCIe sub-system and not the host controller driver. FWIW, this particular device is taking a little more time than 1 sec (i.e. 1023 ms) I'm now wondering why is it that the CRS has a timeout of 60 secs than just 1 sec? > > [1] https://lore.kernel.org/r/11429373.7ySiFsEkgL@kreacher >
On Tue, Nov 12, 2019 at 11:29:55PM +0530, Vidya Sagar wrote: > On 11/12/2019 7:51 PM, Bjorn Helgaas wrote: > > On Tue, Nov 12, 2019 at 01:59:23PM +0100, Thierry Reding wrote: > > > On Mon, Nov 11, 2019 at 04:32:35PM -0600, Bjorn Helgaas wrote: > > > > On Mon, Nov 11, 2019 at 11:31:18AM +0530, Vidya Sagar wrote: > > > > > On 11/6/2019 10:11 PM, Bjorn Helgaas wrote: > > > > > > > > > > Based on Vidya's backtrace, I think the resume path with problems > > > > > > is this: > > > > > > > > > > > > pci_pm_resume_noirq > > > > > > pci_pm_default_resume_early > > > > > > pci_power_up > > > > > > if (platform_pci_power_manageable(dev)) > > > > > > platform_pci_set_power_state(dev, PCI_D0) # <-- FW delay here? > > > > > > pci_raw_set_power_state > > > > > > pci_update_current_state > > > > > > pci_device_is_present # <-- config read returns CRS > > > > > > > > > > > > So I think your suggestion is that Vidya's firmware should be > > > > > > doing the delay inside platform_pci_set_power_state()? > > > > > > > > > > > > Vidya, you typically work on Tegra, so I assume this is on an > > > > > > arm64 system? Does it have ACPI? Do you have access to the > > > > > > firmware developers to ask about who they expect to do the delays? > > > > > > > > > > Yes. This is on arm64 (Tegra) and we don't have any ACPI or any > > > > > other firmware for that matter. PCIe is brought up directly in the > > > > > kernel. > > > > > > > > I assume that your device is coming out of D3cold because apparently > > > > you're seeing a CRS status from the config read when > > > > pci_update_current_state() calls pci_device_is_present(). CRS status > > > > should only happen after reset or power-on from D3cold, and you're not > > > > doing a reset. > > > > > > > > I'm pretty sure platform_pci_power_manageable() returns false on > > > > your system (can you confirm that?) because the only scenarios with > > > > platform power management are MID (Intel platform) and ACPI (which you > > > > don't have). > > > > > > > > Maybe you have some other platform-specific mechanism that controls > > > > power to PCI devices, and it's not integrated into the > > > > platform_pci_*() framework? > > > > > > My understanding after reading the PCIe specification is that CRS is a > > > mechanism that allows an endpoint to signal that it isn't ready yet for > > > operation after reset or power-on from D3cold. There's nothing in there > > > that's platform specific. This is really only for specific endpoints. > > > > > > I don't see how adding platform specific PM code would help in this > > > case. At a platform level we don't know if users are going to plug in a > > > PCI endpoint that needs a long delay before it's ready after reset and/ > > > or exit from D3cold. > > > > Right, see below. > > > > > I do understand that perhaps pci_device_is_present() is perhaps not the > > > best place to do complex CRS handling, but if a mechanism is clearly > > > described in the specification, isn't it something that should be dealt > > > with in the core? That way we don't have to quirk this for every device > > > and platform. > > > > Definitely; we don't want quirks for endpoints (unless they're > > actually broken) or for platforms (unless there's a platform hardware > > or firmware defect). > > > > There's no question that we need to delay and handle CRS after > > power-on from D3cold. I'm trying to get at the point that PCI itself > > doesn't tell us how to do that power-on. The mechanisms defined by > > PCI rely on config space, which is only accessible in D0-D3hot, not > > D3cold. Power-on from D3cold can only happen via ACPI methods or > > other platform-specific mechanisms, and the current design abstracts > > those via platform_pci_set_power_state(). This is partly based on > > Rafael's response in [1]. > > > > > The PCIe specification says that: > > > > > > Software that intends to take advantage of this mechanism must > > > ensure that the first access made to a device following a valid > > > reset condition is a Configuration Read Request accessing both > > > bytes of the Vendor ID field in the device's Configuration Space > > > header. > > > > > > So doesn't that mean that pci_device_is_present() is already much too > > > late because we've potentially made other configuration read requests in > > > the meantime? > > > > > > Wouldn't it make more sense to push the CRS handling up a bit? The > > > existing pci_power_up() function seems like it would be a good place. > > > For example, adding code to deal with CRS right after the platform PCI > > > PM calls but before pci_raw_set_power_state() seems like it would fit > > > the restrictions given in the above quote from the specification. > > > > Yep, I think that's the right point. I'm trying to figure out how to > > integrate it. Rafael suggests that delays may be platform-specific > > and should be in platform_pci_set_power_state(), but the CRS handling > > itself isn't platform-specific and maybe could be higher up. > > > > I'm fishing to see if Tegra has some kind of power control for > > endpoints that is not related to the platform_pci_*() framework. How > > did the endpoint get put in D3cold in the first place? I assume > > something in the suspend path did that? Maybe this happens when we > > suspend the Tegra RC itself, e.g., tegra_pcie_pm_suspend()? > > > > tegra_pcie_pm_suspend > > tegra_pcie_phy_power_off > > tegra_pcie_power_off > > > > tegra_pcie_pm_resume > > tegra_pcie_power_on > > tegra_pcie_phy_power_on > > > > If a path like tegra_pcie_pm_resume() is causing the D3cold -> D0 > > transition for the endpoint, I don't think we want to do CRS handling > > there because that path shouldn't be touching the endpoint. But maybe > > it should be doing the delays required by PCIe r5.0, sec 6.6.1, before > > any config accesses are issued to devices. > > Here, I'm exercising suspend-to-RAM sequence and the PCIe endpoint of > concern is Intel 750 NVMe drive. > PCIe host controller driver already has 100ms of delay as per the spec, To make this more concrete, where specifically is this delay? > but this particular device is taking 1023ms to get ready to respond to > configuration space requests (till that time, it responds with > configuration request retry statuses) > I've put a dump_stack () to see the path resume sequence takes and here it is > > Call trace: > dump_backtrace+0x0/0x158 > show_stack+0x14/0x20 > dump_stack+0xb0/0xf4 > pci_bus_generic_read_dev_vendor_id+0x19c/0x1a0 > pci_bus_read_dev_vendor_id+0x48/0x70 > pci_update_current_state+0x68/0xd8 > pci_power_up+0x40/0x50 > pci_pm_resume_noirq+0x7c/0x138 > dpm_run_callback.isra.16+0x20/0x70 > device_resume_noirq+0x120/0x238 > async_resume_noirq+0x24/0x58 > async_run_entry_fn+0x40/0x148 > process_one_work+0x1e8/0x360 > worker_thread+0x40/0x488 > kthread+0x118/0x120 > ret_from_fork+0x10/0x1c > pci 0005:01:00.0: ready after 1023ms > > Spec also mentions the following > Unless Readiness Notifications mechanisms are used (see Section > 6.23), the Root Complex and/or system software must allow at > least 1.0 s after a Conventional Reset of a device, before it > may determine that a device which fails to return a Successful > Completion status for a valid Configuration Request is a broken > device. This period is independent of how quickly Link training > completes. > > My understanding is that this 1sec waiting is supposed to be done by > the PCIe sub-system and not the host controller driver. That doesn't say we must wait 1s; it only says we can't decide the device is broken before 1s. But regardless, I agree that the CRS handling doesn't belong in the driver for either the endpoint or the PCIe host controller. My question is whether this wait should be connected somehow with platform_pci_set_power_state(). It sounds like the tegra host controller driver already does the platform-specific delays, and I'm not sure it's reasonable for platform_pci_set_power_state() to do the CRS polling. Maybe something like this? I'd really like to get Rafael's thinking here. diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index e7982af9a5d8..052fa316c917 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -964,9 +964,14 @@ void pci_refresh_power_state(struct pci_dev *dev) */ void pci_power_up(struct pci_dev *dev) { + pci_power_state_t prev_state = dev->current_state; + if (platform_pci_power_manageable(dev)) platform_pci_set_power_state(dev, PCI_D0); + if (prev_state == PCI_D3cold) + pci_dev_wait(dev, "D3cold->D0", PCIE_RESET_READY_POLL_MS); + pci_raw_set_power_state(dev, PCI_D0); pci_update_current_state(dev, PCI_D0); } > FWIW, this particular device is taking a little more time than 1 sec > (i.e. 1023 ms) I'm now wondering why is it that the CRS has a > timeout of 60 secs than just 1 sec? pci_bus_wait_crs() does an exponential backoff, i.e., it does reads after 0ms, 2ms, 4ms, 8ms, ..., 512ms, 1024ms, so we don't know *exactly* when the device became ready. All we know is that it became ready somewhere between 512ms and 1024ms. But 821cdad5c46c ("PCI: Wait up to 60 seconds for device to become ready after FLR") does suggest that the Intel 750 NVMe may require more than 1s. I think the 60s timeout is just a convenient large number and I'm not sure it's derived from anything in the spec. > > [1] https://lore.kernel.org/r/11429373.7ySiFsEkgL@kreacher
On 11/13/2019 12:28 AM, Bjorn Helgaas wrote: > On Tue, Nov 12, 2019 at 11:29:55PM +0530, Vidya Sagar wrote: >> On 11/12/2019 7:51 PM, Bjorn Helgaas wrote: >>> On Tue, Nov 12, 2019 at 01:59:23PM +0100, Thierry Reding wrote: >>>> On Mon, Nov 11, 2019 at 04:32:35PM -0600, Bjorn Helgaas wrote: >>>>> On Mon, Nov 11, 2019 at 11:31:18AM +0530, Vidya Sagar wrote: >>>>>> On 11/6/2019 10:11 PM, Bjorn Helgaas wrote: >>>>> >>>>>>> Based on Vidya's backtrace, I think the resume path with problems >>>>>>> is this: >>>>>>> >>>>>>> pci_pm_resume_noirq >>>>>>> pci_pm_default_resume_early >>>>>>> pci_power_up >>>>>>> if (platform_pci_power_manageable(dev)) >>>>>>> platform_pci_set_power_state(dev, PCI_D0) # <-- FW delay here? >>>>>>> pci_raw_set_power_state >>>>>>> pci_update_current_state >>>>>>> pci_device_is_present # <-- config read returns CRS >>>>>>> >>>>>>> So I think your suggestion is that Vidya's firmware should be >>>>>>> doing the delay inside platform_pci_set_power_state()? >>>>>>> >>>>>>> Vidya, you typically work on Tegra, so I assume this is on an >>>>>>> arm64 system? Does it have ACPI? Do you have access to the >>>>>>> firmware developers to ask about who they expect to do the delays? >>>>>> >>>>>> Yes. This is on arm64 (Tegra) and we don't have any ACPI or any >>>>>> other firmware for that matter. PCIe is brought up directly in the >>>>>> kernel. >>>>> >>>>> I assume that your device is coming out of D3cold because apparently >>>>> you're seeing a CRS status from the config read when >>>>> pci_update_current_state() calls pci_device_is_present(). CRS status >>>>> should only happen after reset or power-on from D3cold, and you're not >>>>> doing a reset. >>>>> >>>>> I'm pretty sure platform_pci_power_manageable() returns false on >>>>> your system (can you confirm that?) because the only scenarios with >>>>> platform power management are MID (Intel platform) and ACPI (which you >>>>> don't have). >>>>> >>>>> Maybe you have some other platform-specific mechanism that controls >>>>> power to PCI devices, and it's not integrated into the >>>>> platform_pci_*() framework? >>>> >>>> My understanding after reading the PCIe specification is that CRS is a >>>> mechanism that allows an endpoint to signal that it isn't ready yet for >>>> operation after reset or power-on from D3cold. There's nothing in there >>>> that's platform specific. This is really only for specific endpoints. >>>> >>>> I don't see how adding platform specific PM code would help in this >>>> case. At a platform level we don't know if users are going to plug in a >>>> PCI endpoint that needs a long delay before it's ready after reset and/ >>>> or exit from D3cold. >>> >>> Right, see below. >>> >>>> I do understand that perhaps pci_device_is_present() is perhaps not the >>>> best place to do complex CRS handling, but if a mechanism is clearly >>>> described in the specification, isn't it something that should be dealt >>>> with in the core? That way we don't have to quirk this for every device >>>> and platform. >>> >>> Definitely; we don't want quirks for endpoints (unless they're >>> actually broken) or for platforms (unless there's a platform hardware >>> or firmware defect). >>> >>> There's no question that we need to delay and handle CRS after >>> power-on from D3cold. I'm trying to get at the point that PCI itself >>> doesn't tell us how to do that power-on. The mechanisms defined by >>> PCI rely on config space, which is only accessible in D0-D3hot, not >>> D3cold. Power-on from D3cold can only happen via ACPI methods or >>> other platform-specific mechanisms, and the current design abstracts >>> those via platform_pci_set_power_state(). This is partly based on >>> Rafael's response in [1]. >>> >>>> The PCIe specification says that: >>>> >>>> Software that intends to take advantage of this mechanism must >>>> ensure that the first access made to a device following a valid >>>> reset condition is a Configuration Read Request accessing both >>>> bytes of the Vendor ID field in the device's Configuration Space >>>> header. >>>> >>>> So doesn't that mean that pci_device_is_present() is already much too >>>> late because we've potentially made other configuration read requests in >>>> the meantime? >>>> >>>> Wouldn't it make more sense to push the CRS handling up a bit? The >>>> existing pci_power_up() function seems like it would be a good place. >>>> For example, adding code to deal with CRS right after the platform PCI >>>> PM calls but before pci_raw_set_power_state() seems like it would fit >>>> the restrictions given in the above quote from the specification. >>> >>> Yep, I think that's the right point. I'm trying to figure out how to >>> integrate it. Rafael suggests that delays may be platform-specific >>> and should be in platform_pci_set_power_state(), but the CRS handling >>> itself isn't platform-specific and maybe could be higher up. >>> >>> I'm fishing to see if Tegra has some kind of power control for >>> endpoints that is not related to the platform_pci_*() framework. How >>> did the endpoint get put in D3cold in the first place? I assume >>> something in the suspend path did that? Maybe this happens when we >>> suspend the Tegra RC itself, e.g., tegra_pcie_pm_suspend()? >>> >>> tegra_pcie_pm_suspend >>> tegra_pcie_phy_power_off >>> tegra_pcie_power_off >>> >>> tegra_pcie_pm_resume >>> tegra_pcie_power_on >>> tegra_pcie_phy_power_on >>> >>> If a path like tegra_pcie_pm_resume() is causing the D3cold -> D0 >>> transition for the endpoint, I don't think we want to do CRS handling >>> there because that path shouldn't be touching the endpoint. But maybe >>> it should be doing the delays required by PCIe r5.0, sec 6.6.1, before >>> any config accesses are issued to devices. >> >> Here, I'm exercising suspend-to-RAM sequence and the PCIe endpoint of >> concern is Intel 750 NVMe drive. >> PCIe host controller driver already has 100ms of delay as per the spec, > > To make this more concrete, where specifically is this delay? It is after PERST is deasserted and before making the first check for DLActive+ https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/controller/dwc/pcie-tegra194.c?h=v5.4-rc7#n816 and dw_pcie_wait_for_link() has maximum timeout of 1sec before bailing out Anyway, in this case, link is coming up within 100ms but config space is not accessible as device returns CRS. > >> but this particular device is taking 1023ms to get ready to respond to >> configuration space requests (till that time, it responds with >> configuration request retry statuses) >> I've put a dump_stack () to see the path resume sequence takes and here it is >> >> Call trace: >> dump_backtrace+0x0/0x158 >> show_stack+0x14/0x20 >> dump_stack+0xb0/0xf4 >> pci_bus_generic_read_dev_vendor_id+0x19c/0x1a0 >> pci_bus_read_dev_vendor_id+0x48/0x70 >> pci_update_current_state+0x68/0xd8 >> pci_power_up+0x40/0x50 >> pci_pm_resume_noirq+0x7c/0x138 >> dpm_run_callback.isra.16+0x20/0x70 >> device_resume_noirq+0x120/0x238 >> async_resume_noirq+0x24/0x58 >> async_run_entry_fn+0x40/0x148 >> process_one_work+0x1e8/0x360 >> worker_thread+0x40/0x488 >> kthread+0x118/0x120 >> ret_from_fork+0x10/0x1c >> pci 0005:01:00.0: ready after 1023ms >> >> Spec also mentions the following >> Unless Readiness Notifications mechanisms are used (see Section >> 6.23), the Root Complex and/or system software must allow at >> least 1.0 s after a Conventional Reset of a device, before it >> may determine that a device which fails to return a Successful >> Completion status for a valid Configuration Request is a broken >> device. This period is independent of how quickly Link training >> completes. >> >> My understanding is that this 1sec waiting is supposed to be done by >> the PCIe sub-system and not the host controller driver. > > That doesn't say we must wait 1s; it only says we can't decide the > device is broken before 1s. But regardless, I agree that the CRS > handling doesn't belong in the driver for either the endpoint or the > PCIe host controller. > > My question is whether this wait should be connected somehow with > platform_pci_set_power_state(). It sounds like the tegra host > controller driver already does the platform-specific delays, and I'm > not sure it's reasonable for platform_pci_set_power_state() to do the > CRS polling. Maybe something like this? I'd really like to get > Rafael's thinking here. > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index e7982af9a5d8..052fa316c917 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -964,9 +964,14 @@ void pci_refresh_power_state(struct pci_dev *dev) > */ > void pci_power_up(struct pci_dev *dev) > { > + pci_power_state_t prev_state = dev->current_state; > + > if (platform_pci_power_manageable(dev)) > platform_pci_set_power_state(dev, PCI_D0); > > + if (prev_state == PCI_D3cold) > + pci_dev_wait(dev, "D3cold->D0", PCIE_RESET_READY_POLL_MS); > + > pci_raw_set_power_state(dev, PCI_D0); > pci_update_current_state(dev, PCI_D0); > } > >> FWIW, this particular device is taking a little more time than 1 sec >> (i.e. 1023 ms) I'm now wondering why is it that the CRS has a >> timeout of 60 secs than just 1 sec? > > pci_bus_wait_crs() does an exponential backoff, i.e., it does reads > after 0ms, 2ms, 4ms, 8ms, ..., 512ms, 1024ms, so we don't know > *exactly* when the device became ready. All we know is that it > became ready somewhere between 512ms and 1024ms. > > But 821cdad5c46c ("PCI: Wait up to 60 seconds for device to become > ready after FLR") does suggest that the Intel 750 NVMe may require > more than 1s. I think the 60s timeout is just a convenient large > number and I'm not sure it's derived from anything in the spec. I tried out the above patch that you came up with. Couple of points here. In Tegra194, we put the endpoint to D0 explicitly (instead of leaving it in D3Hot) and then go for link transition to L2 (because of Tegra specific reasons). Even if I comment out the API in pcie-tegra194.c which puts the devices back into D0 (from D3Hot) just before going for L2 transition, device is in D3Hot but not in D3Cold So, if (prev_state == PCI_D3cold) didn't work really. Also, I'm wondering why should there be a check in the first place anyway? why can't pci_dev_wait() be called always? > >>> [1] https://lore.kernel.org/r/11429373.7ySiFsEkgL@kreacher >
On Tue, Nov 12, 2019 at 12:58:44PM -0600, Bjorn Helgaas wrote: > On Tue, Nov 12, 2019 at 11:29:55PM +0530, Vidya Sagar wrote: > > On 11/12/2019 7:51 PM, Bjorn Helgaas wrote: > > > On Tue, Nov 12, 2019 at 01:59:23PM +0100, Thierry Reding wrote: > > > > On Mon, Nov 11, 2019 at 04:32:35PM -0600, Bjorn Helgaas wrote: > > > > > On Mon, Nov 11, 2019 at 11:31:18AM +0530, Vidya Sagar wrote: > > > > > > On 11/6/2019 10:11 PM, Bjorn Helgaas wrote: > > > > > > > > > > > > Based on Vidya's backtrace, I think the resume path with problems > > > > > > > is this: > > > > > > > > > > > > > > pci_pm_resume_noirq > > > > > > > pci_pm_default_resume_early > > > > > > > pci_power_up > > > > > > > if (platform_pci_power_manageable(dev)) > > > > > > > platform_pci_set_power_state(dev, PCI_D0) # <-- FW delay here? > > > > > > > pci_raw_set_power_state > > > > > > > pci_update_current_state > > > > > > > pci_device_is_present # <-- config read returns CRS > > > > > > > > > > > > > > So I think your suggestion is that Vidya's firmware should be > > > > > > > doing the delay inside platform_pci_set_power_state()? > > > > > > > > > > > > > > Vidya, you typically work on Tegra, so I assume this is on an > > > > > > > arm64 system? Does it have ACPI? Do you have access to the > > > > > > > firmware developers to ask about who they expect to do the delays? > > > > > > > > > > > > Yes. This is on arm64 (Tegra) and we don't have any ACPI or any > > > > > > other firmware for that matter. PCIe is brought up directly in the > > > > > > kernel. > > > > > > > > > > I assume that your device is coming out of D3cold because apparently > > > > > you're seeing a CRS status from the config read when > > > > > pci_update_current_state() calls pci_device_is_present(). CRS status > > > > > should only happen after reset or power-on from D3cold, and you're not > > > > > doing a reset. > > > > > > > > > > I'm pretty sure platform_pci_power_manageable() returns false on > > > > > your system (can you confirm that?) because the only scenarios with > > > > > platform power management are MID (Intel platform) and ACPI (which you > > > > > don't have). > > > > > > > > > > Maybe you have some other platform-specific mechanism that controls > > > > > power to PCI devices, and it's not integrated into the > > > > > platform_pci_*() framework? > > > > > > > > My understanding after reading the PCIe specification is that CRS is a > > > > mechanism that allows an endpoint to signal that it isn't ready yet for > > > > operation after reset or power-on from D3cold. There's nothing in there > > > > that's platform specific. This is really only for specific endpoints. > > > > > > > > I don't see how adding platform specific PM code would help in this > > > > case. At a platform level we don't know if users are going to plug in a > > > > PCI endpoint that needs a long delay before it's ready after reset and/ > > > > or exit from D3cold. > > > > > > Right, see below. > > > > > > > I do understand that perhaps pci_device_is_present() is perhaps not the > > > > best place to do complex CRS handling, but if a mechanism is clearly > > > > described in the specification, isn't it something that should be dealt > > > > with in the core? That way we don't have to quirk this for every device > > > > and platform. > > > > > > Definitely; we don't want quirks for endpoints (unless they're > > > actually broken) or for platforms (unless there's a platform hardware > > > or firmware defect). > > > > > > There's no question that we need to delay and handle CRS after > > > power-on from D3cold. I'm trying to get at the point that PCI itself > > > doesn't tell us how to do that power-on. The mechanisms defined by > > > PCI rely on config space, which is only accessible in D0-D3hot, not > > > D3cold. Power-on from D3cold can only happen via ACPI methods or > > > other platform-specific mechanisms, and the current design abstracts > > > those via platform_pci_set_power_state(). This is partly based on > > > Rafael's response in [1]. > > > > > > > The PCIe specification says that: > > > > > > > > Software that intends to take advantage of this mechanism must > > > > ensure that the first access made to a device following a valid > > > > reset condition is a Configuration Read Request accessing both > > > > bytes of the Vendor ID field in the device's Configuration Space > > > > header. > > > > > > > > So doesn't that mean that pci_device_is_present() is already much too > > > > late because we've potentially made other configuration read requests in > > > > the meantime? > > > > > > > > Wouldn't it make more sense to push the CRS handling up a bit? The > > > > existing pci_power_up() function seems like it would be a good place. > > > > For example, adding code to deal with CRS right after the platform PCI > > > > PM calls but before pci_raw_set_power_state() seems like it would fit > > > > the restrictions given in the above quote from the specification. > > > > > > Yep, I think that's the right point. I'm trying to figure out how to > > > integrate it. Rafael suggests that delays may be platform-specific > > > and should be in platform_pci_set_power_state(), but the CRS handling > > > itself isn't platform-specific and maybe could be higher up. > > > > > > I'm fishing to see if Tegra has some kind of power control for > > > endpoints that is not related to the platform_pci_*() framework. How > > > did the endpoint get put in D3cold in the first place? I assume > > > something in the suspend path did that? Maybe this happens when we > > > suspend the Tegra RC itself, e.g., tegra_pcie_pm_suspend()? > > > > > > tegra_pcie_pm_suspend > > > tegra_pcie_phy_power_off > > > tegra_pcie_power_off > > > > > > tegra_pcie_pm_resume > > > tegra_pcie_power_on > > > tegra_pcie_phy_power_on > > > > > > If a path like tegra_pcie_pm_resume() is causing the D3cold -> D0 > > > transition for the endpoint, I don't think we want to do CRS handling > > > there because that path shouldn't be touching the endpoint. But maybe > > > it should be doing the delays required by PCIe r5.0, sec 6.6.1, before > > > any config accesses are issued to devices. > > > > Here, I'm exercising suspend-to-RAM sequence and the PCIe endpoint of > > concern is Intel 750 NVMe drive. > > PCIe host controller driver already has 100ms of delay as per the spec, > > To make this more concrete, where specifically is this delay? > > > but this particular device is taking 1023ms to get ready to respond to > > configuration space requests (till that time, it responds with > > configuration request retry statuses) > > I've put a dump_stack () to see the path resume sequence takes and here it is > > > > Call trace: > > dump_backtrace+0x0/0x158 > > show_stack+0x14/0x20 > > dump_stack+0xb0/0xf4 > > pci_bus_generic_read_dev_vendor_id+0x19c/0x1a0 > > pci_bus_read_dev_vendor_id+0x48/0x70 > > pci_update_current_state+0x68/0xd8 > > pci_power_up+0x40/0x50 > > pci_pm_resume_noirq+0x7c/0x138 > > dpm_run_callback.isra.16+0x20/0x70 > > device_resume_noirq+0x120/0x238 > > async_resume_noirq+0x24/0x58 > > async_run_entry_fn+0x40/0x148 > > process_one_work+0x1e8/0x360 > > worker_thread+0x40/0x488 > > kthread+0x118/0x120 > > ret_from_fork+0x10/0x1c > > pci 0005:01:00.0: ready after 1023ms > > > > Spec also mentions the following > > Unless Readiness Notifications mechanisms are used (see Section > > 6.23), the Root Complex and/or system software must allow at > > least 1.0 s after a Conventional Reset of a device, before it > > may determine that a device which fails to return a Successful > > Completion status for a valid Configuration Request is a broken > > device. This period is independent of how quickly Link training > > completes. > > > > My understanding is that this 1sec waiting is supposed to be done by > > the PCIe sub-system and not the host controller driver. > > That doesn't say we must wait 1s; it only says we can't decide the > device is broken before 1s. But regardless, I agree that the CRS > handling doesn't belong in the driver for either the endpoint or the > PCIe host controller. > > My question is whether this wait should be connected somehow with > platform_pci_set_power_state(). It sounds like the tegra host > controller driver already does the platform-specific delays, and I'm > not sure it's reasonable for platform_pci_set_power_state() to do the > CRS polling. Maybe something like this? I'd really like to get > Rafael's thinking here. > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index e7982af9a5d8..052fa316c917 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -964,9 +964,14 @@ void pci_refresh_power_state(struct pci_dev *dev) > */ > void pci_power_up(struct pci_dev *dev) > { > + pci_power_state_t prev_state = dev->current_state; > + > if (platform_pci_power_manageable(dev)) > platform_pci_set_power_state(dev, PCI_D0); > > + if (prev_state == PCI_D3cold) > + pci_dev_wait(dev, "D3cold->D0", PCIE_RESET_READY_POLL_MS); Is there any reason in particular why you chose to call pci_dev_wait()? It seems to me like that's a little broader than pci_bus_wait_crs(). The latter is static in drivers/pci/probe.c so we'd need to change that in order to use it from drivers/pci/pci.c, but it sounds like the more explicit thing to do. Thierry > + > pci_raw_set_power_state(dev, PCI_D0); > pci_update_current_state(dev, PCI_D0); > } > > > FWIW, this particular device is taking a little more time than 1 sec > > (i.e. 1023 ms) I'm now wondering why is it that the CRS has a > > timeout of 60 secs than just 1 sec? > > pci_bus_wait_crs() does an exponential backoff, i.e., it does reads > after 0ms, 2ms, 4ms, 8ms, ..., 512ms, 1024ms, so we don't know > *exactly* when the device became ready. All we know is that it > became ready somewhere between 512ms and 1024ms. > > But 821cdad5c46c ("PCI: Wait up to 60 seconds for device to become > ready after FLR") does suggest that the Intel 750 NVMe may require > more than 1s. I think the 60s timeout is just a convenient large > number and I'm not sure it's derived from anything in the spec. > > > > [1] https://lore.kernel.org/r/11429373.7ySiFsEkgL@kreacher >
On Wed, Nov 13, 2019 at 12:20:43PM +0100, Thierry Reding wrote: > On Tue, Nov 12, 2019 at 12:58:44PM -0600, Bjorn Helgaas wrote: > > My question is whether this wait should be connected somehow with > > platform_pci_set_power_state(). It sounds like the tegra host > > controller driver already does the platform-specific delays, and I'm > > not sure it's reasonable for platform_pci_set_power_state() to do the > > CRS polling. Maybe something like this? I'd really like to get > > Rafael's thinking here. > > > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > > index e7982af9a5d8..052fa316c917 100644 > > --- a/drivers/pci/pci.c > > +++ b/drivers/pci/pci.c > > @@ -964,9 +964,14 @@ void pci_refresh_power_state(struct pci_dev *dev) > > */ > > void pci_power_up(struct pci_dev *dev) > > { > > + pci_power_state_t prev_state = dev->current_state; > > + > > if (platform_pci_power_manageable(dev)) > > platform_pci_set_power_state(dev, PCI_D0); > > > > + if (prev_state == PCI_D3cold) > > + pci_dev_wait(dev, "D3cold->D0", PCIE_RESET_READY_POLL_MS); > > Is there any reason in particular why you chose to call pci_dev_wait()? > It seems to me like that's a little broader than pci_bus_wait_crs(). The > latter is static in drivers/pci/probe.c so we'd need to change that in > order to use it from drivers/pci/pci.c, but it sounds like the more > explicit thing to do. Broader in what sense? They work essentially identically except that pci_bus_wait_crs() doesn't need a pci_dev * (because it's used during enumeration when we don't have a pci_dev yet) and it does dword reads instead of word reads. It is a shame that the logic is duplicated, but we don't have to worry about that here. I think I would stick with pci_dev_wait() in this case since we do have a pci_dev * and it's a little simpler, unless I'm missing the advantage of pci_bus_wait_crs(). Bjorn
On 11/15/2019 12:06 AM, Bjorn Helgaas wrote: > On Wed, Nov 13, 2019 at 12:20:43PM +0100, Thierry Reding wrote: >> On Tue, Nov 12, 2019 at 12:58:44PM -0600, Bjorn Helgaas wrote: > >>> My question is whether this wait should be connected somehow with >>> platform_pci_set_power_state(). It sounds like the tegra host >>> controller driver already does the platform-specific delays, and I'm >>> not sure it's reasonable for platform_pci_set_power_state() to do the >>> CRS polling. Maybe something like this? I'd really like to get >>> Rafael's thinking here. >>> >>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c >>> index e7982af9a5d8..052fa316c917 100644 >>> --- a/drivers/pci/pci.c >>> +++ b/drivers/pci/pci.c >>> @@ -964,9 +964,14 @@ void pci_refresh_power_state(struct pci_dev *dev) >>> */ >>> void pci_power_up(struct pci_dev *dev) >>> { >>> + pci_power_state_t prev_state = dev->current_state; >>> + >>> if (platform_pci_power_manageable(dev)) >>> platform_pci_set_power_state(dev, PCI_D0); >>> >>> + if (prev_state == PCI_D3cold) >>> + pci_dev_wait(dev, "D3cold->D0", PCIE_RESET_READY_POLL_MS); >> >> Is there any reason in particular why you chose to call pci_dev_wait()? >> It seems to me like that's a little broader than pci_bus_wait_crs(). The >> latter is static in drivers/pci/probe.c so we'd need to change that in >> order to use it from drivers/pci/pci.c, but it sounds like the more >> explicit thing to do. > > Broader in what sense? They work essentially identically except that > pci_bus_wait_crs() doesn't need a pci_dev * (because it's used during > enumeration when we don't have a pci_dev yet) and it does dword reads > instead of word reads. > > It is a shame that the logic is duplicated, but we don't have to worry > about that here. > > I think I would stick with pci_dev_wait() in this case since we do > have a pci_dev * and it's a little simpler, unless I'm missing the > advantage of pci_bus_wait_crs(). Is there any specific reason why should there be a check for the state? In Tegra series, I observe that, by the time execution comes to this point, prev_state is PCI_D3Hot and in Tegra194 particularly, it is PCI_D0 because the host controller driver explicitly keeps the downstream devices in PCI_D0 state as a work around for one of the Tegra194 specific issues. So, I feel the check for current_state may not be need here(?) - Vidya Sagar > > Bjorn >
On Fri, Nov 15, 2019 at 03:34:23PM +0530, Vidya Sagar wrote: > On 11/15/2019 12:06 AM, Bjorn Helgaas wrote: > > On Wed, Nov 13, 2019 at 12:20:43PM +0100, Thierry Reding wrote: > > > On Tue, Nov 12, 2019 at 12:58:44PM -0600, Bjorn Helgaas wrote: > > > > > > > My question is whether this wait should be connected somehow with > > > > platform_pci_set_power_state(). It sounds like the tegra host > > > > controller driver already does the platform-specific delays, and I'm > > > > not sure it's reasonable for platform_pci_set_power_state() to do the > > > > CRS polling. Maybe something like this? I'd really like to get > > > > Rafael's thinking here. > > > > > > > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > > > > index e7982af9a5d8..052fa316c917 100644 > > > > --- a/drivers/pci/pci.c > > > > +++ b/drivers/pci/pci.c > > > > @@ -964,9 +964,14 @@ void pci_refresh_power_state(struct pci_dev *dev) > > > > */ > > > > void pci_power_up(struct pci_dev *dev) > > > > { > > > > + pci_power_state_t prev_state = dev->current_state; > > > > + > > > > if (platform_pci_power_manageable(dev)) > > > > platform_pci_set_power_state(dev, PCI_D0); > > > > + if (prev_state == PCI_D3cold) > > > > + pci_dev_wait(dev, "D3cold->D0", PCIE_RESET_READY_POLL_MS); > > Is there any specific reason why should there be a check for the > state? In Tegra series, I observe that, by the time execution comes > to this point, prev_state is PCI_D3Hot and in Tegra194 particularly, > it is PCI_D0 because the host controller driver explicitly keeps the > downstream devices in PCI_D0 state as a work around for one of the > Tegra194 specific issues. I think you're right, we probably should not try to check "prev_state" here because we can't rely on that being accurate. On Tegra, I assume suspend puts the PCIe devices in D3hot, then when we suspend the RC itself, it looks like tegra_pcie_pm_suspend() actually turns off the power, so the PCIe devices probably go to D3cold but nothing updates their dev->current_state, so it's probably still PCI_D3hot. On Tegra194, the same probably happens, except that when we suspend the RC itself, tegra_pcie_downstream_dev_to_D0() puts the PCIe devices back in D0 (updating their dev->current_state to PCI_D0), and then we turn off the power, so they probably also end up in D3cold but with dev_current_state still set to PCI_D0. > So, I feel the check for current_state may not be need here(?) I think you're right. We can't tell what dev->current_state is when we enter pci_power_up(). Bjorn
On 11/16/2019 4:06 AM, Bjorn Helgaas wrote: > On Fri, Nov 15, 2019 at 03:34:23PM +0530, Vidya Sagar wrote: >> On 11/15/2019 12:06 AM, Bjorn Helgaas wrote: >>> On Wed, Nov 13, 2019 at 12:20:43PM +0100, Thierry Reding wrote: >>>> On Tue, Nov 12, 2019 at 12:58:44PM -0600, Bjorn Helgaas wrote: >>>> >>>>> My question is whether this wait should be connected somehow with >>>>> platform_pci_set_power_state(). It sounds like the tegra host >>>>> controller driver already does the platform-specific delays, and I'm >>>>> not sure it's reasonable for platform_pci_set_power_state() to do the >>>>> CRS polling. Maybe something like this? I'd really like to get >>>>> Rafael's thinking here. >>>>> >>>>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c >>>>> index e7982af9a5d8..052fa316c917 100644 >>>>> --- a/drivers/pci/pci.c >>>>> +++ b/drivers/pci/pci.c >>>>> @@ -964,9 +964,14 @@ void pci_refresh_power_state(struct pci_dev *dev) >>>>> */ >>>>> void pci_power_up(struct pci_dev *dev) >>>>> { >>>>> + pci_power_state_t prev_state = dev->current_state; >>>>> + >>>>> if (platform_pci_power_manageable(dev)) >>>>> platform_pci_set_power_state(dev, PCI_D0); >>>>> + if (prev_state == PCI_D3cold) >>>>> + pci_dev_wait(dev, "D3cold->D0", PCIE_RESET_READY_POLL_MS); >> >> Is there any specific reason why should there be a check for the >> state? In Tegra series, I observe that, by the time execution comes >> to this point, prev_state is PCI_D3Hot and in Tegra194 particularly, >> it is PCI_D0 because the host controller driver explicitly keeps the >> downstream devices in PCI_D0 state as a work around for one of the >> Tegra194 specific issues. > > I think you're right, we probably should not try to check "prev_state" > here because we can't rely on that being accurate. > > On Tegra, I assume suspend puts the PCIe devices in D3hot, then when > we suspend the RC itself, it looks like tegra_pcie_pm_suspend() > actually turns off the power, so the PCIe devices probably go to > D3cold but nothing updates their dev->current_state, so it's probably > still PCI_D3hot. > > On Tegra194, the same probably happens, except that when we suspend > the RC itself, tegra_pcie_downstream_dev_to_D0() puts the PCIe devices > back in D0 (updating their dev->current_state to PCI_D0), and then we > turn off the power, so they probably also end up in D3cold but with > dev_current_state still set to PCI_D0. > >> So, I feel the check for current_state may not be need here(?) > > I think you're right. We can't tell what dev->current_state is when > we enter pci_power_up(). Thanks, I'll push a change with your suggested modifications. - Vidya Sagar > > Bjorn >
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 95dc78ebdded..3ab9f6d3c8a6 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -5905,7 +5905,8 @@ bool pci_device_is_present(struct pci_dev *pdev) if (pci_dev_is_disconnected(pdev)) return false; - return pci_bus_read_dev_vendor_id(pdev->bus, pdev->devfn, &v, 0); + return pci_bus_read_dev_vendor_id(pdev->bus, pdev->devfn, &v, + PCI_CRS_TIMEOUT); } EXPORT_SYMBOL_GPL(pci_device_is_present); diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 3f6947ee3324..aa25c5fdc6a5 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -4,6 +4,8 @@ #include <linux/pci.h> +#define PCI_CRS_TIMEOUT (60 * 1000) /* 60 sec*/ + #define PCI_FIND_CAP_TTL 48 #define PCI_VSEC_ID_INTEL_TBT 0x1234 /* Thunderbolt */ diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 7c5d68b807ef..6e44a03283c8 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -2258,7 +2258,7 @@ static struct pci_dev *pci_scan_device(struct pci_bus *bus, int devfn) struct pci_dev *dev; u32 l; - if (!pci_bus_read_dev_vendor_id(bus, devfn, &l, 60*1000)) + if (!pci_bus_read_dev_vendor_id(bus, devfn, &l, PCI_CRS_TIMEOUT)) return NULL; dev = pci_alloc_dev(bus);
Adds a 60 seconds timeout to consider CRS (Configuration request Retry Status) from a downstream device when Vendor ID read is attempted by an upstream device. This helps to work with devices that return CRS during system resume. This also makes pci_device_is_present() consistent with the probe path where 60 seconds timeout is already being used. Signed-off-by: Vidya Sagar <vidyas@nvidia.com> --- drivers/pci/pci.c | 3 ++- drivers/pci/pci.h | 2 ++ drivers/pci/probe.c | 2 +- 3 files changed, 5 insertions(+), 2 deletions(-)