Message ID | 20170816193303.GA14147@ulmo (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
On Wed, Aug 16, 2017 at 09:33:03PM +0200, Thierry Reding wrote: > On Tue, Aug 15, 2017 at 12:03:31PM -0500, Bjorn Helgaas wrote: > > On Tue, Aug 15, 2017 at 11:24:48PM +0800, Ding Tianhong wrote: > > > Eric report a oops when booting the system after applying > > > the commit a99b646afa8a ("PCI: Disable PCIe Relaxed..."): > > > ... > > > > > It looks like the pci_find_pcie_root_port() was trying to > > > find the Root Port for the PCI device which is the Root > > > Port already, it will return NULL and trigger the problem, > > > so check the highest_pcie_bridge to fix thie problem. > > > > The problem was actually with a Root Complex Integrated Endpoint that > > has no upstream PCIe device: > > > > 00:05.2 System peripheral: Intel Corporation Device 0e2a (rev 04) > > Subsystem: Intel Corporation Device 0e2a > > Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- > > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- > > Capabilities: [40] Express (v2) Root Complex Integrated Endpoint, MSI 00 > > DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us > > ExtTag- RBE- FLReset- > > DevCtl: Report errors: Correctable- Non-Fatal- Fatal+ Unsupported+ > > RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ > > MaxPayload 128 bytes, MaxReadReq 128 bytes > > I've started seeing this crash on Tegra K1 as well. Here's the device > for which it oopses: > > 00:02.0 PCI bridge: NVIDIA Corporation TegraK1 PCIe x1 Bridge (rev a1) (prog-if 00 [Normal decode]) > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+ > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- > Latency: 0, Cache Line Size: 64 bytes > Interrupt: pin A routed to IRQ 391 > Bus: primary=00, secondary=01, subordinate=01, sec-latency=0 > I/O behind bridge: 00001000-00001fff [size=4K] > Memory behind bridge: 13000000-130fffff [size=1M] > Prefetchable memory behind bridge: 0000000020000000-00000000200fffff [size=1M] > Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR- > BridgeCtl: Parity+ SERR- NoISA- VGA- MAbort- >Reset- FastB2B- > PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn- > Capabilities: [40] Subsystem: NVIDIA Corporation TegraK1 PCIe x1 Bridge > Capabilities: [48] Power Management version 3 > Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+) > Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- > Capabilities: [50] MSI: Enable+ Count=1/2 Maskable- 64bit+ > Address: 000000fcfffff000 Data: 0000 > Capabilities: [60] HyperTransport: MSI Mapping Enable- Fixed- > Mapping Address Base: 00000000fee00000 > Capabilities: [80] Express (v2) Root Port (Slot+), MSI 00 > DevCap: MaxPayload 128 bytes, PhantFunc 0 > ExtTag+ RBE+ > DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- > RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ > MaxPayload 128 bytes, MaxReadReq 512 bytes > DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- > LnkCap: Port #1, Speed 5GT/s, Width x1, ASPM L0s, Exit Latency L0s <512ns > ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp- > LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- > LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt- > SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise- > Slot #0, PowerLimit 0.000W; Interlock- NoCompl- > SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg- > Control: AttnInd Off, PwrInd On, Power- Interlock- > SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock- > Changed: MRL- PresDet+ LinkState+ > RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible- > RootCap: CRSVisible- > RootSta: PME ReqID 0000, PMEStatus- PMEPending- > DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR-, OBFF Not Supported ARIFwd- > AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS- > DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled ARIFwd- > AtomicOpsCtl: ReqEn- EgressBlck- > LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis- > Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- > Compliance De-emphasis: -6dB > LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1- > EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- > Kernel driver in use: pcieport > > > > Fixes: a99b646afa8a ("PCI: Disable PCIe Relaxed Ordering if unsupported") > > > > This also > > > > Fixes: c56d4450eb68 ("PCI: Turn off Request Attributes to avoid Chelsio T5 Completion erratum") > > > > which added pci_find_pcie_root_port(). Prior to this Relaxed Ordering > > series, we only used pci_find_pcie_root_port() in a Chelsio quirk that > > only applied to non-integrated endpoints, so we didn't trip over the > > bug. > > > > > Reported-by: Eric Dumazet <eric.dumazet@gmail.com> > > > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> > > > Signed-off-by: Ding Tianhong <dingtianhong@huawei.com> > > > --- > > > drivers/pci/pci.c | 3 ++- > > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > > > index af0cc34..7e2022f 100644 > > > --- a/drivers/pci/pci.c > > > +++ b/drivers/pci/pci.c > > > @@ -522,7 +522,8 @@ struct pci_dev *pci_find_pcie_root_port(struct pci_dev *dev) > > > bridge = pci_upstream_bridge(bridge); > > > } > > > > > > - if (pci_pcie_type(highest_pcie_bridge) != PCI_EXP_TYPE_ROOT_PORT) > > > + if (highest_pcie_bridge && > > > + pci_pcie_type(highest_pcie_bridge) != PCI_EXP_TYPE_ROOT_PORT) > > > return NULL; > > > > > > return highest_pcie_bridge; > > > -- > > > > I think structuring the fix as follows is a little more readable: > > > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > > index af0cc3456dc1..587cd7623ed8 100644 > > --- a/drivers/pci/pci.c > > +++ b/drivers/pci/pci.c > > @@ -522,10 +522,11 @@ struct pci_dev *pci_find_pcie_root_port(struct pci_dev *dev) > > bridge = pci_upstream_bridge(bridge); > > } > > > > - if (pci_pcie_type(highest_pcie_bridge) != PCI_EXP_TYPE_ROOT_PORT) > > - return NULL; > > + if (highest_pcie_bridge && > > + pci_pcie_type(highest_pcie_bridge) == PCI_EXP_TYPE_ROOT_PORT) > > + return highest_pcie_bridge; > > > > - return highest_pcie_bridge; > > + return NULL; > > } > > EXPORT_SYMBOL(pci_find_pcie_root_port); > > In case of Tegra, dev actually points to the root port. Now if I read > the above code correctly, highest_pcie_bridge will still be NULL in that > case, which in turn will return NULL from pci_find_pcie_root_port(). But > shouldn't it really return dev? > > The patch that I used to fix the issue is this: > > --->8--- > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index 2c712dcfd37d..dd56c1c05614 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -514,7 +514,7 @@ EXPORT_SYMBOL(pci_find_resource); > */ > struct pci_dev *pci_find_pcie_root_port(struct pci_dev *dev) > { > - struct pci_dev *bridge, *highest_pcie_bridge = NULL; > + struct pci_dev *bridge, *highest_pcie_bridge = dev; > > bridge = pci_upstream_bridge(dev); > while (bridge && pci_is_pcie(bridge)) { > --->8--- > > That works correctly if this function ends up being called on the PCIe > root port, though perhaps that's not what this function is supposed to > do. It's somewhat unclear from the kerneldoc what the function should > be doing when called on a root port device itself. Your fix looks right to me.
From: Bjorn Helgaas <helgaas@kernel.org> Date: Wed, 16 Aug 2017 15:02:37 -0500 > Your fix looks right to me. Someone please submit this fix formally because this change is now in Linus's tree. Thank you.
On 2017/8/17 4:59, David Miller wrote: > From: Bjorn Helgaas <helgaas@kernel.org> > Date: Wed, 16 Aug 2017 15:02:37 -0500 > >> Your fix looks right to me. > > Someone please submit this fix formally because this change is now in > Linus's tree. > I will send it. > Thank you. > > . >
Thierry Reding <thierry.reding@gmail.com> writes: ... > > In case of Tegra, dev actually points to the root port. Now if I read > the above code correctly, highest_pcie_bridge will still be NULL in that > case, which in turn will return NULL from pci_find_pcie_root_port(). But > shouldn't it really return dev? > > The patch that I used to fix the issue is this: > > --->8--- > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index 2c712dcfd37d..dd56c1c05614 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -514,7 +514,7 @@ EXPORT_SYMBOL(pci_find_resource); > */ > struct pci_dev *pci_find_pcie_root_port(struct pci_dev *dev) > { > - struct pci_dev *bridge, *highest_pcie_bridge = NULL; > + struct pci_dev *bridge, *highest_pcie_bridge = dev; > > bridge = pci_upstream_bridge(dev); > while (bridge && pci_is_pcie(bridge)) { > --->8--- > > That works correctly if this function ends up being called on the PCIe > root port, though perhaps that's not what this function is supposed to > do. It's somewhat unclear from the kerneldoc what the function should > be doing when called on a root port device itself. That also works for me on powerpc (oops reported up thread). cheers
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 2c712dcfd37d..dd56c1c05614 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -514,7 +514,7 @@ EXPORT_SYMBOL(pci_find_resource); */ struct pci_dev *pci_find_pcie_root_port(struct pci_dev *dev) { - struct pci_dev *bridge, *highest_pcie_bridge = NULL; + struct pci_dev *bridge, *highest_pcie_bridge = dev; bridge = pci_upstream_bridge(dev); while (bridge && pci_is_pcie(bridge)) {