Message ID | 20190605145820.37169-3-mika.westerberg@linux.intel.com (mailing list archive) |
---|---|
State | Changes Requested, archived |
Headers | show |
Series | PCI: Power management improvements | expand |
On Wed, Jun 05, 2019 at 05:58:19PM +0300, Mika Westerberg wrote: > PME polling does not take into account that a device that is directly > connected to the host bridge may go into D3cold as well. This leads to a > situation where the PME poll thread reads from a config space of a > device that is in D3cold and gets incorrect information because the > config space is not accessible. > > Here is an example from Intel Ice Lake system where two PCIe root ports > are in D3cold (I've instrumented the kernel to log the PMCSR register > contents): > > [ 62.971442] pcieport 0000:00:07.1: Check PME status, PMCSR=0xffff > [ 62.971504] pcieport 0000:00:07.0: Check PME status, PMCSR=0xffff > > Since 0xffff is interpreted so that PME is pending, the root ports will > be runtime resumed. This repeats over and over again essentially > blocking all runtime power management. > > Prevent this from happening by checking whether the device is in D3cold > before its PME status is read. There's more broken here. The below patch fixes a PME polling race and should also fix the issue you're witnessing, could you verify that? The patch has been rotting on my development branch for several months, I just didn't get around to posting it, my apologies. -- >8 -- Subject: [PATCH] PCI / PM: Fix race on PME polling Since commit df17e62e5bff ("PCI: Add support for polling PME state on suspended legacy PCI devices"), the work item pci_pme_list_scan() polls the PME status flag of devices and wakes them up if the bit is set. The function performs a check whether a device's upstream bridge is in D0 for otherwise the device is inaccessible, rendering PME polling impossible. However the check is racy because it is performed before polling the device. If the upstream bridge runtime suspends to D3hot after pci_pme_list_scan() checks its power state and before it invokes pci_pme_wakeup(), the latter will read the PMCSR as "all ones" and mistake it for a set PME status flag. I am seeing this race play out as a Thunderbolt controller going to D3cold and occasionally immediately going to D0 again because PM polling was performed at just the wrong time. Avoid by checking for an "all ones" PMCSR in pci_check_pme_status(). Fixes: 58ff463396ad ("PCI PM: Add function for checking PME status of devices") Signed-off-by: Lukas Wunner <lukas@wunner.de> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: stable@vger.kernel.org # v2.6.34+ --- drivers/pci/pci.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index b98a564..2e05348 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -1753,6 +1753,8 @@ bool pci_check_pme_status(struct pci_dev *dev) pci_read_config_word(dev, pmcsr_pos, &pmcsr); if (!(pmcsr & PCI_PM_CTRL_PME_STATUS)) return false; + if (pmcsr == ~0) + return false; /* Clear PME status. */ pmcsr |= PCI_PM_CTRL_PME_STATUS;
On Wed, Jun 05, 2019 at 09:05:57PM +0200, Lukas Wunner wrote: > On Wed, Jun 05, 2019 at 05:58:19PM +0300, Mika Westerberg wrote: > > PME polling does not take into account that a device that is directly > > connected to the host bridge may go into D3cold as well. This leads to a > > situation where the PME poll thread reads from a config space of a > > device that is in D3cold and gets incorrect information because the > > config space is not accessible. > > > > Here is an example from Intel Ice Lake system where two PCIe root ports > > are in D3cold (I've instrumented the kernel to log the PMCSR register > > contents): > > > > [ 62.971442] pcieport 0000:00:07.1: Check PME status, PMCSR=0xffff > > [ 62.971504] pcieport 0000:00:07.0: Check PME status, PMCSR=0xffff > > > > Since 0xffff is interpreted so that PME is pending, the root ports will > > be runtime resumed. This repeats over and over again essentially > > blocking all runtime power management. > > > > Prevent this from happening by checking whether the device is in D3cold > > before its PME status is read. > > There's more broken here. The below patch fixes a PME polling race > and should also fix the issue you're witnessing, could you verify that? It fixes the issue but I needed to tune it a bit -> > The patch has been rotting on my development branch for several months, > I just didn't get around to posting it, my apologies. Better late than never :) > -- >8 -- > Subject: [PATCH] PCI / PM: Fix race on PME polling > > Since commit df17e62e5bff ("PCI: Add support for polling PME state on > suspended legacy PCI devices"), the work item pci_pme_list_scan() polls > the PME status flag of devices and wakes them up if the bit is set. > > The function performs a check whether a device's upstream bridge is in > D0 for otherwise the device is inaccessible, rendering PME polling > impossible. However the check is racy because it is performed before > polling the device. If the upstream bridge runtime suspends to D3hot > after pci_pme_list_scan() checks its power state and before it invokes > pci_pme_wakeup(), the latter will read the PMCSR as "all ones" and > mistake it for a set PME status flag. I am seeing this race play out as > a Thunderbolt controller going to D3cold and occasionally immediately > going to D0 again because PM polling was performed at just the wrong > time. > > Avoid by checking for an "all ones" PMCSR in pci_check_pme_status(). > > Fixes: 58ff463396ad ("PCI PM: Add function for checking PME status of devices") > Signed-off-by: Lukas Wunner <lukas@wunner.de> > Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > Cc: stable@vger.kernel.org # v2.6.34+ > --- > drivers/pci/pci.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index b98a564..2e05348 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -1753,6 +1753,8 @@ bool pci_check_pme_status(struct pci_dev *dev) > pci_read_config_word(dev, pmcsr_pos, &pmcsr); > if (!(pmcsr & PCI_PM_CTRL_PME_STATUS)) > return false; > + if (pmcsr == ~0) <- Here I needed to do if (pmcsr == (u16)~0) I think it is because pmcsr is u16 so we end up comparing: 0xffff == 0xffffffff > + return false; > > /* Clear PME status. */ > pmcsr |= PCI_PM_CTRL_PME_STATUS; > -- > 2.20.1
On Wed, Jun 05, 2019 at 05:58:19PM +0300, Mika Westerberg wrote: > PME polling does not take into account that a device that is directly > connected to the host bridge may go into D3cold as well. This leads to a > situation where the PME poll thread reads from a config space of a > device that is in D3cold and gets incorrect information because the > config space is not accessible. > > Here is an example from Intel Ice Lake system where two PCIe root ports > are in D3cold (I've instrumented the kernel to log the PMCSR register > contents): > > [ 62.971442] pcieport 0000:00:07.1: Check PME status, PMCSR=0xffff > [ 62.971504] pcieport 0000:00:07.0: Check PME status, PMCSR=0xffff > > Since 0xffff is interpreted so that PME is pending, the root ports will > be runtime resumed. This repeats over and over again essentially > blocking all runtime power management. > > Prevent this from happening by checking whether the device is in D3cold > before its PME status is read. > > Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> Reviewed-by: Lukas Wunner <lukas@wunner.de> Fixes: 71a83bd727cc ("PCI/PM: add runtime PM support to PCIe port") Cc: stable@vger.kernel.org # v3.6+ Although the patch I've posted today (which checks for an "all ones" read from config space) covers the issue fixed herein, your patch still makes sense to avoid unnecessarily accessing config space in the first place. Thanks, Lukas > --- > drivers/pci/pci.c | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index 87a1f902fa8e..720da09d4d73 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -2060,6 +2060,13 @@ static void pci_pme_list_scan(struct work_struct *work) > */ > if (bridge && bridge->current_state != PCI_D0) > continue; > + /* > + * If the device is in D3cold it should not be > + * polled either. > + */ > + if (pme_dev->dev->current_state == PCI_D3cold) > + continue; > + > pci_pme_wakeup(pme_dev->dev, NULL); > } else { > list_del(&pme_dev->list); > -- > 2.20.1 >
On Wednesday, June 5, 2019 4:58:19 PM CEST Mika Westerberg wrote: > PME polling does not take into account that a device that is directly > connected to the host bridge may go into D3cold as well. This leads to a > situation where the PME poll thread reads from a config space of a > device that is in D3cold and gets incorrect information because the > config space is not accessible. > > Here is an example from Intel Ice Lake system where two PCIe root ports > are in D3cold (I've instrumented the kernel to log the PMCSR register > contents): > > [ 62.971442] pcieport 0000:00:07.1: Check PME status, PMCSR=0xffff > [ 62.971504] pcieport 0000:00:07.0: Check PME status, PMCSR=0xffff > > Since 0xffff is interpreted so that PME is pending, the root ports will > be runtime resumed. This repeats over and over again essentially > blocking all runtime power management. > > Prevent this from happening by checking whether the device is in D3cold > before its PME status is read. > > Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > --- > drivers/pci/pci.c | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index 87a1f902fa8e..720da09d4d73 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -2060,6 +2060,13 @@ static void pci_pme_list_scan(struct work_struct *work) > */ > if (bridge && bridge->current_state != PCI_D0) > continue; > + /* > + * If the device is in D3cold it should not be > + * polled either. > + */ > + if (pme_dev->dev->current_state == PCI_D3cold) > + continue; > + > pci_pme_wakeup(pme_dev->dev, NULL); > } else { > list_del(&pme_dev->list); >
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 87a1f902fa8e..720da09d4d73 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -2060,6 +2060,13 @@ static void pci_pme_list_scan(struct work_struct *work) */ if (bridge && bridge->current_state != PCI_D0) continue; + /* + * If the device is in D3cold it should not be + * polled either. + */ + if (pme_dev->dev->current_state == PCI_D3cold) + continue; + pci_pme_wakeup(pme_dev->dev, NULL); } else { list_del(&pme_dev->list);
PME polling does not take into account that a device that is directly connected to the host bridge may go into D3cold as well. This leads to a situation where the PME poll thread reads from a config space of a device that is in D3cold and gets incorrect information because the config space is not accessible. Here is an example from Intel Ice Lake system where two PCIe root ports are in D3cold (I've instrumented the kernel to log the PMCSR register contents): [ 62.971442] pcieport 0000:00:07.1: Check PME status, PMCSR=0xffff [ 62.971504] pcieport 0000:00:07.0: Check PME status, PMCSR=0xffff Since 0xffff is interpreted so that PME is pending, the root ports will be runtime resumed. This repeats over and over again essentially blocking all runtime power management. Prevent this from happening by checking whether the device is in D3cold before its PME status is read. Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> --- drivers/pci/pci.c | 7 +++++++ 1 file changed, 7 insertions(+)