Message ID | 20181206041951.22413-1-david@gibson.dropbear.id.au (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
Series | PCI: Add no-D3 quirk for Mellanox ConnectX-[45] | expand |
On Thu, Dec 06, 2018 at 03:19:51PM +1100, David Gibson wrote: > Mellanox ConnectX-5 IB cards (MT27800) seem to cause a call trace when > unbound from their regular driver and attached to vfio-pci in order to pass > them through to a guest. > > This goes away if the disable_idle_d3 option is used, so it looks like a > problem with the hardware handling D3 state. To fix that more permanently, > use a device quirk to disable D3 state for these devices. > > We do this by renaming the existing quirk_no_ata_d3() more generally and > attaching it to the ConnectX-[45] devices (0x15b3:0x1013). > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au> > --- > drivers/pci/quirks.c | 17 +++++++++++------ > 1 file changed, 11 insertions(+), 6 deletions(-) > Hi David, Thank for your patch, I would like to reproduce the calltrace before moving forward, but have trouble to reproduce the original issue. I'm working with vfio-pci and CX-4/5 cards on daily basis, tried manually enter into D3 state now, and it worked for me. Can you please post your full calltrace, and "lspci -s PCI_ID -vv" output? Thanks
On Thu, Dec 06, 2018 at 08:45:09AM +0200, Leon Romanovsky wrote: > On Thu, Dec 06, 2018 at 03:19:51PM +1100, David Gibson wrote: > > Mellanox ConnectX-5 IB cards (MT27800) seem to cause a call trace when > > unbound from their regular driver and attached to vfio-pci in order to pass > > them through to a guest. > > > > This goes away if the disable_idle_d3 option is used, so it looks like a > > problem with the hardware handling D3 state. To fix that more permanently, > > use a device quirk to disable D3 state for these devices. > > > > We do this by renaming the existing quirk_no_ata_d3() more generally and > > attaching it to the ConnectX-[45] devices (0x15b3:0x1013). > > > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au> > > --- > > drivers/pci/quirks.c | 17 +++++++++++------ > > 1 file changed, 11 insertions(+), 6 deletions(-) > > > > Hi David, > > Thank for your patch, > > I would like to reproduce the calltrace before moving forward, > but have trouble to reproduce the original issue. > > I'm working with vfio-pci and CX-4/5 cards on daily basis, > tried manually enter into D3 state now, and it worked for me. > > Can you please post your full calltrace, and "lspci -s PCI_ID -vv" > output? Sorry, I may have jumped the gun on this. Using disable_idle_d3 seems to do _something_ for these cards, but there are some other things going wrong which are confusing the issue. This is on POWER, which might affect the situation. I'll get back to you once I have some more information.
Hi David, I see you're still working on this, but if you do end up going this direction eventually, would you mind splitting this into two patches: 1) rename the quirk to make it more generic (but not changing any behavior), and 2) add the ConnectX devices to the quirk. That way the ConnectX change is smaller and more easily understood/reverted/etc. On Thu, Dec 06, 2018 at 03:19:51PM +1100, David Gibson wrote: > Mellanox ConnectX-5 IB cards (MT27800) seem to cause a call trace when > unbound from their regular driver and attached to vfio-pci in order to pass > them through to a guest. > > This goes away if the disable_idle_d3 option is used, so it looks like a > problem with the hardware handling D3 state. To fix that more permanently, > use a device quirk to disable D3 state for these devices. > > We do this by renaming the existing quirk_no_ata_d3() more generally and > attaching it to the ConnectX-[45] devices (0x15b3:0x1013). > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au> > --- > drivers/pci/quirks.c | 17 +++++++++++------ > 1 file changed, 11 insertions(+), 6 deletions(-) > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > index 4700d24e5d55..add3f516ca12 100644 > --- a/drivers/pci/quirks.c > +++ b/drivers/pci/quirks.c > @@ -1315,23 +1315,24 @@ static void quirk_ide_samemode(struct pci_dev *pdev) > } > DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82801CA_10, quirk_ide_samemode); > > -/* Some ATA devices break if put into D3 */ > -static void quirk_no_ata_d3(struct pci_dev *pdev) > +/* Some devices (including a number of ATA cards) break if put into D3 */ > +static void quirk_no_d3(struct pci_dev *pdev) > { > pdev->dev_flags |= PCI_DEV_FLAGS_NO_D3; > } > + > /* Quirk the legacy ATA devices only. The AHCI ones are ok */ > DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_SERVERWORKS, PCI_ANY_ID, > - PCI_CLASS_STORAGE_IDE, 8, quirk_no_ata_d3); > + PCI_CLASS_STORAGE_IDE, 8, quirk_no_d3); > DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_ATI, PCI_ANY_ID, > - PCI_CLASS_STORAGE_IDE, 8, quirk_no_ata_d3); > + PCI_CLASS_STORAGE_IDE, 8, quirk_no_d3); > /* ALi loses some register settings that we cannot then restore */ > DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_AL, PCI_ANY_ID, > - PCI_CLASS_STORAGE_IDE, 8, quirk_no_ata_d3); > + PCI_CLASS_STORAGE_IDE, 8, quirk_no_d3); > /* VIA comes back fine but we need to keep it alive or ACPI GTM failures > occur when mode detecting */ > DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_VIA, PCI_ANY_ID, > - PCI_CLASS_STORAGE_IDE, 8, quirk_no_ata_d3); > + PCI_CLASS_STORAGE_IDE, 8, quirk_no_d3); > > /* > * This was originally an Alpha-specific thing, but it really fits here. > @@ -3367,6 +3368,10 @@ static void mellanox_check_broken_intx_masking(struct pci_dev *pdev) > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_MELLANOX, PCI_ANY_ID, > mellanox_check_broken_intx_masking); > > +/* Mellanox MT27800 (ConnectX-5) IB card seems to break with D3 > + * In particular this shows up when the device is bound to the vfio-pci driver */ Follow usual multiline comment style, i.e., /* * text ... * more text ... */ > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_CONNECTX4, quirk_no_d3) > + > static void quirk_no_bus_reset(struct pci_dev *dev) > { > dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET; > -- > 2.19.2 >
On Tue, Dec 11, 2018 at 08:01:43AM -0600, Bjorn Helgaas wrote: > Hi David, > > I see you're still working on this, but if you do end up going this > direction eventually, would you mind splitting this into two patches: > 1) rename the quirk to make it more generic (but not changing any > behavior), and 2) add the ConnectX devices to the quirk. That way > the ConnectX change is smaller and more easily > understood/reverted/etc. Sure. Would it make sense to send (1) as an independent cleanup, while I'm still working out exactly what (if anything) we need for (2)?
On Tue, Dec 11, 2018 at 6:38 PM David Gibson <david@gibson.dropbear.id.au> wrote: > > On Tue, Dec 11, 2018 at 08:01:43AM -0600, Bjorn Helgaas wrote: > > Hi David, > > > > I see you're still working on this, but if you do end up going this > > direction eventually, would you mind splitting this into two patches: > > 1) rename the quirk to make it more generic (but not changing any > > behavior), and 2) add the ConnectX devices to the quirk. That way > > the ConnectX change is smaller and more easily > > understood/reverted/etc. > > Sure. Would it make sense to send (1) as an independent cleanup, > while I'm still working out exactly what (if anything) we need for > (2)? You could, but I don't think there's really much benefit in doing the first without the second, and I think there is some value in handling both patches at the same time.
On Thu, Dec 06, 2018 at 08:45:09AM +0200, Leon Romanovsky wrote: > On Thu, Dec 06, 2018 at 03:19:51PM +1100, David Gibson wrote: > > Mellanox ConnectX-5 IB cards (MT27800) seem to cause a call trace when > > unbound from their regular driver and attached to vfio-pci in order to pass > > them through to a guest. > > > > This goes away if the disable_idle_d3 option is used, so it looks like a > > problem with the hardware handling D3 state. To fix that more permanently, > > use a device quirk to disable D3 state for these devices. > > > > We do this by renaming the existing quirk_no_ata_d3() more generally and > > attaching it to the ConnectX-[45] devices (0x15b3:0x1013). > > > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au> > > --- > > drivers/pci/quirks.c | 17 +++++++++++------ > > 1 file changed, 11 insertions(+), 6 deletions(-) > > > > Hi David, > > Thank for your patch, > > I would like to reproduce the calltrace before moving forward, > but have trouble to reproduce the original issue. > > I'm working with vfio-pci and CX-4/5 cards on daily basis, > tried manually enter into D3 state now, and it worked for me. Interesting. I've investigated this further, though I don't have as many new clues as I'd like. The problem occurs reliably, at least on one particular type of machine (a POWER8 "Garrison" with ConnectX-4). I don't yet know if it occurs with other machines, I'm having trouble getting access to other machines with a suitable card. I didn't manage to reproduce it on a different POWER8 machine with a ConnectX-5, but I don't know if it's the difference in machine or difference in card revision that's important. So possibilities that occur to me: * It's something specific about how the vfio-pci driver uses D3 state - have you tried rebinding your device to vfio-pci? * It's something specific about POWER, either the kernel or the PCI bridge hardware * It's something specific about this particular type of machine > Can you please post your full calltrace, and "lspci -s PCI_ID -vv" > output? [root@ibm-p8-garrison-01 ~]# lspci -vv -s 0008:01:00 0008:01:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4] Subsystem: IBM Device 04f1 Physical Slot: Slot1 Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Interrupt: pin A routed to IRQ 473 NUMA node: 1 Region 0: Memory at 240000000000 (64-bit, prefetchable) [size=512M] Capabilities: [60] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+ RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset- MaxPayload 512 bytes, MaxReadReq 512 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported AtomicOpsCap: 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis+, LTR-, OBFF Disabled AtomicOpsCtl: ReqEn- LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+ EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest- Capabilities: [48] Vital Product Data Product Name: 2-port 100Gb EDR IB PCIe x16 Adapter Read-only fields: [PN] Part number: 00WT039 [EC] Engineering changes: P40057 [FN] Unknown: 30 30 57 54 30 37 35 [SN] Serial number: YA50YF58P080 [FC] Unknown: 45 43 33 46 [CC] Unknown: 32 43 45 41 [VK] Vendor specific: ipzSeries [MN] Manufacture ID: 532X4590060204 [Z0] Unknown: 49 42 4d 32 31 39 30 31 31 30 30 33 32 [RV] Reserved: checksum good, 0 byte(s) reserved End Capabilities: [9c] MSI-X: Enable- Count=128 Masked- Vector table: BAR=0 offset=00002000 PBA: BAR=0 offset=00003000 Capabilities: [c0] Vendor Specific Information: Len=18 <?> Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [100 v1] Device Serial Number ba-da-ce-55-de-ad-ca-fe Capabilities: [110 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 04, ECRCGenCap+ ECRCGenEn+ ECRCChkCap+ ECRCChkEn+ MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- HeaderLog: 00000000 00000000 00000000 00000000 Capabilities: [170 v1] Alternative Routing-ID Interpretation (ARI) ARICap: MFVC- ACS-, Next Function: 1 ARICtl: MFVC- ACS-, Function Group: 0 Capabilities: [1c0 v1] #19 Kernel driver in use: vfio-pci Kernel modules: mlx5_core 0008:01:00.1 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4] Subsystem: IBM Device 04f1 Physical Slot: Slot1 Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Interrupt: pin A routed to IRQ 473 NUMA node: 1 Region 0: Memory at 240020000000 (64-bit, prefetchable) [size=512M] Capabilities: [60] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+ RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset- MaxPayload 512 bytes, MaxReadReq 512 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported AtomicOpsCap: 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis+, LTR-, OBFF Disabled AtomicOpsCtl: ReqEn- LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1- EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- Capabilities: [48] Vital Product Data Product Name: 2-port 100Gb EDR IB PCIe x16 Adapter Read-only fields: [PN] Part number: 00WT039 [EC] Engineering changes: P40057 [FN] Unknown: 30 30 57 54 30 37 35 [SN] Serial number: YA50YF58P080 [FC] Unknown: 45 43 33 46 [CC] Unknown: 32 43 45 41 [VK] Vendor specific: ipzSeries [MN] Manufacture ID: 532X4590060204 [Z0] Unknown: 49 42 4d 32 31 39 30 31 31 30 30 33 32 [RV] Reserved: checksum good, 0 byte(s) reserved End Capabilities: [9c] MSI-X: Enable- Count=128 Masked- Vector table: BAR=0 offset=00002000 PBA: BAR=0 offset=00003000 Capabilities: [c0] Vendor Specific Information: Len=18 <?> Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [100 v1] Device Serial Number ba-da-ce-55-de-ad-ca-fe Capabilities: [110 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 04, ECRCGenCap+ ECRCGenEn+ ECRCChkCap+ ECRCChkEn+ MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- HeaderLog: 00000000 00000000 00000000 00000000 Capabilities: [170 v1] Alternative Routing-ID Interpretation (ARI) ARICap: MFVC- ACS-, Next Function: 0 ARICtl: MFVC- ACS-, Function Group: 0 Kernel driver in use: vfio-pci Kernel modules: mlx5_core The problem is manifesting as an EEH failure (a POWER specific error reporting system similar in intent to AER but entirely different in implementation). That's in turn causing the device to be reset and the call trace from there. There are bugs in the EEH recovery that we're pursuing elsewhere, but the problem at issue here is why we're tripping a hardware reported failure in the first place. Given that, the trace probably isn't very meaningful (it's from the recovery path, not the mlx or vfio driver), but fwiw: [ 132.573829] EEH: PHB#8 failure detected, location: N/A [ 132.573944] CPU: 64 PID: 397 Comm: kworker/64:0 Kdump: loaded Not tainted 4.18.0-57.el8.ppc64le #1 [ 132.574052] Workqueue: events work_for_cpu_fn [ 132.574083] Call Trace: [ 132.574100] [c0000037f54d38c0] [c000000000c9ceec] dump_stack+0xb0/0xf4 (unreliable) [ 132.574147] [c0000037f54d3900] [c000000000042664] eeh_dev_check_failure+0x524/0x5f0 [ 132.574300] [c0000037f54d39a0] [c0000000000bf108] pnv_pci_read_config+0x148/0x180 [ 132.574348] [c0000037f54d39e0] [c000000000731694] pci_read_config_word+0xa4/0x130 [ 132.574393] [c0000037f54d3a40] [c00000000073aa18] pci_raw_set_power_state+0xf8/0x300 [ 132.574438] [c0000037f54d3ad0] [c000000000743450] pci_set_power_state+0x60/0x250 [ 132.574486] [c0000037f54d3b10] [d000000013561e4c] vfio_pci_probe+0x184/0x270 [vfio_pci] [ 132.574531] [c0000037f54d3bb0] [c00000000074bb3c] local_pci_probe+0x6c/0x140 [ 132.574577] [c0000037f54d3c40] [c00000000015aa18] work_for_cpu_fn+0x38/0x60 [ 132.574615] [c0000037f54d3c70] [c00000000015fb84] process_one_work+0x2f4/0x5b0 [ 132.574660] [c0000037f54d3d10] [c000000000161190] worker_thread+0x330/0x760 [ 132.574803] [c0000037f54d3dc0] [c00000000016a4fc] kthread+0x1ac/0x1c0 [ 132.574842] [c0000037f54d3e30] [c00000000000b75c] ret_from_kernel_thread+0x5c/0x80 [ 132.574894] EEH: Detected error on PHB#8 [ 132.574926] EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 failures. [ 132.574981] EEH: Notify device drivers to shutdown [ 132.575011] EEH: Beginning: 'error_detected(IO frozen)' [ 132.575040] EEH: PE#fe (PCI 0008:00:00.0): no driver [ 132.575193] EEH: PE#0 (PCI 0008:01:00.0): Invoking vfio-pci->error_detected(IO frozen) [ 132.575253] EEH: PE#0 (PCI 0008:01:00.0): vfio-pci driver reports: 'can recover' [ 132.575514] EEH: PE#0 (PCI 0008:01:00.1): Invoking vfio-pci->error_detected(IO frozen) [ 132.575592] EEH: PE#0 (PCI 0008:01:00.1): vfio-pci driver reports: 'can recover' [ 132.575634] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'can recover' [ 132.575684] EEH: Collect temporary log [ 132.575706] PHB3 PHB#8 Diag-data (Version: 1) [ 132.575734] brdgCtl: 0000ffff [ 132.575756] RootSts: ffffffff ffffffff ffffffff ffffffff 0000ffff [ 132.575790] RootErrSts: ffffffff ffffffff ffffffff [ 132.575933] RootErrLog: ffffffff ffffffff ffffffff ffffffff [ 132.575973] RootErrLog1: ffffffff 0000000000000000 0000000000000000 [ 132.576014] nFir: 0000808000000000 0030006e00000000 0000800000000000 [ 132.576048] PhbSts: 0000001800000000 0000001800000000 [ 132.576076] Lem: 0000020000080000 42498e367f502eae 0000000000080000 [ 132.576111] OutErr: 0000002000000000 0000002000000000 0000000000000000 0000000000000000 [ 132.576159] InAErr: 0000000020000000 0000000020000000 8080000000000000 0000000000000000 [ 132.576327] EEH: Reset without hotplug activity [ 132.606003] vfio-pci 0008:01:00.0: Refused to change power state, currently in D3 [ 132.606062] iommu: Removing device 0008:01:00.0 from group 0 [ 132.636000] vfio-pci 0008:01:00.1: Refused to change power state, currently in D3 [ 132.636057] iommu: Removing device 0008:01:00.1 from group 0 [ 137.196696] EEH: Sleep 5s ahead of partial hotplug [ 142.236046] pci 0008:01:00.0: [15b3:1013] type 00 class 0x020700 [ 142.236156] pci 0008:01:00.0: reg 0x10: [mem 0x240000000000-0x24001fffffff 64bit pref] [ 142.236932] pci 0008:01:00.1: [15b3:1013] type 00 class 0x020700 [ 142.237030] pci 0008:01:00.1: reg 0x10: [mem 0x240020000000-0x24003fffffff 64bit pref] [ 142.238763] pci 0008:00:00.0: BAR 14: assigned [mem 0x3fe200000000-0x3fe23fffffff] [ 142.238940] pci 0008:01:00.0: BAR 0: assigned [mem 0x240000000000-0x24001fffffff 64bit pref] [ 142.239021] pci 0008:01:00.1: BAR 0: assigned [mem 0x240020000000-0x24003fffffff 64bit pref] [ 142.239112] pci 0008:01:00.0: Can't enable device memory [ 142.239417] mlx5_core 0008:01:00.0: Cannot enable PCI device, aborting [ 142.239476] mlx5_core 0008:01:00.0: mlx5_pci_init failed with error code -22 [ 142.239539] mlx5_core: probe of 0008:01:00.0 failed with error -22 [ 142.239590] vfio-pci: probe of 0008:01:00.0 failed with error -22 [ 142.239631] pci 0008:01:00.1: Can't enable device memory [ 142.241612] mlx5_core 0008:01:00.1: Cannot enable PCI device, aborting [ 142.241654] mlx5_core 0008:01:00.1: mlx5_pci_init failed with error code -22 [ 142.241716] mlx5_core: probe of 0008:01:00.1 failed with error -22 [ 142.241762] vfio-pci: probe of 0008:01:00.1 failed with error -22 [ 142.241800] EEH: Notify device drivers the completion of reset [ 142.241835] EEH: Beginning: 'slot_reset' [ 142.241856] EEH: PE#fe (PCI 0008:00:00.0): no driver [ 142.241884] EEH: Finished:'slot_reset' with aggregate recovery state:'none' [ 142.241918] EEH: Notify device driver to resume [ 142.241947] EEH: Beginning: 'resume' [ 142.241968] EEH: PE#fe (PCI 0008:00:00.0): no driver [ 142.241996] EEH: Finished:'resume' [ 142.241996] EEH: Recovery successful.
On Fri, Jan 04, 2019 at 02:44:01PM +1100, David Gibson wrote: > On Thu, Dec 06, 2018 at 08:45:09AM +0200, Leon Romanovsky wrote: > > On Thu, Dec 06, 2018 at 03:19:51PM +1100, David Gibson wrote: > > > Mellanox ConnectX-5 IB cards (MT27800) seem to cause a call trace when > > > unbound from their regular driver and attached to vfio-pci in order to pass > > > them through to a guest. > > > > > > This goes away if the disable_idle_d3 option is used, so it looks like a > > > problem with the hardware handling D3 state. To fix that more permanently, > > > use a device quirk to disable D3 state for these devices. > > > > > > We do this by renaming the existing quirk_no_ata_d3() more generally and > > > attaching it to the ConnectX-[45] devices (0x15b3:0x1013). > > > > > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au> > > > drivers/pci/quirks.c | 17 +++++++++++------ > > > 1 file changed, 11 insertions(+), 6 deletions(-) > > > > > > > Hi David, > > > > Thank for your patch, > > > > I would like to reproduce the calltrace before moving forward, > > but have trouble to reproduce the original issue. > > > > I'm working with vfio-pci and CX-4/5 cards on daily basis, > > tried manually enter into D3 state now, and it worked for me. > > Interesting. I've investigated this further, though I don't have as > many new clues as I'd like. The problem occurs reliably, at least on > one particular type of machine (a POWER8 "Garrison" with ConnectX-4). > I don't yet know if it occurs with other machines, I'm having trouble > getting access to other machines with a suitable card. I didn't > manage to reproduce it on a different POWER8 machine with a > ConnectX-5, but I don't know if it's the difference in machine or > difference in card revision that's important. Make sure the card has the latest firmware is always good advice.. > So possibilities that occur to me: > * It's something specific about how the vfio-pci driver uses D3 > state - have you tried rebinding your device to vfio-pci? > * It's something specific about POWER, either the kernel or the PCI > bridge hardware > * It's something specific about this particular type of machine Does the EEH indicate what happend to actually trigger it? Jason
On Sat, 2019-01-05 at 10:51 -0700, Jason Gunthorpe wrote: > > > Interesting. I've investigated this further, though I don't have as > > many new clues as I'd like. The problem occurs reliably, at least on > > one particular type of machine (a POWER8 "Garrison" with ConnectX-4). > > I don't yet know if it occurs with other machines, I'm having trouble > > getting access to other machines with a suitable card. I didn't > > manage to reproduce it on a different POWER8 machine with a > > ConnectX-5, but I don't know if it's the difference in machine or > > difference in card revision that's important. > > Make sure the card has the latest firmware is always good advice.. > > > So possibilities that occur to me: > > * It's something specific about how the vfio-pci driver uses D3 > > state - have you tried rebinding your device to vfio-pci? > > * It's something specific about POWER, either the kernel or the PCI > > bridge hardware > > * It's something specific about this particular type of machine > > Does the EEH indicate what happend to actually trigger it? In a very cryptic way that requires manual parsing using non-public docs sadly but yes. From the look of it, it's a completion timeout. Looks to me like we don't get a response to a config space access during the change of D state. I don't know if it's the write of the D3 state itself or the read back though (it's probably detected on the read back or a subsequent read, but that doesn't tell me which specific one failed). Some extra logging in OPAL might help pin that down by checking the InA error state in the config accessor after the config write (and polling on it for a while as from a CPU perspective I don't knw if the write is synchronous, probably not). Cheers, Ben.
On Sun, Jan 06, 2019 at 09:43:46AM +1100, Benjamin Herrenschmidt wrote: > On Sat, 2019-01-05 at 10:51 -0700, Jason Gunthorpe wrote: > > > > > Interesting. I've investigated this further, though I don't have as > > > many new clues as I'd like. The problem occurs reliably, at least on > > > one particular type of machine (a POWER8 "Garrison" with ConnectX-4). > > > I don't yet know if it occurs with other machines, I'm having trouble > > > getting access to other machines with a suitable card. I didn't > > > manage to reproduce it on a different POWER8 machine with a > > > ConnectX-5, but I don't know if it's the difference in machine or > > > difference in card revision that's important. > > > > Make sure the card has the latest firmware is always good advice.. > > > > > So possibilities that occur to me: > > > * It's something specific about how the vfio-pci driver uses D3 > > > state - have you tried rebinding your device to vfio-pci? > > > * It's something specific about POWER, either the kernel or the PCI > > > bridge hardware > > > * It's something specific about this particular type of machine > > > > Does the EEH indicate what happend to actually trigger it? > > In a very cryptic way that requires manual parsing using non-public > docs sadly but yes. From the look of it, it's a completion timeout. > > Looks to me like we don't get a response to a config space access > during the change of D state. I don't know if it's the write of the D3 > state itself or the read back though (it's probably detected on the > read back or a subsequent read, but that doesn't tell me which specific > one failed). If it is just one card doing it (again, check you have latest firmware) I wonder if it is a sketchy PCI-E electrical link that is causing a long re-training cycle? Can you tell if the PCI-E link is permanently gone or does it eventually return? Does the card work in Gen 3 when it starts? Is there any indication of PCI-E link errors? Everytime or sometimes? POWER 8 firmware is good? If the link does eventually come back, is the POWER8's D3 resumption timeout long enough? If this doesn't lead to an obvious conclusion you'll probably need to connect to IBM's Mellanox support team to get more information from the card side. Jason
On Mon, Jan 07, 2019 at 09:01:29PM -0700, Jason Gunthorpe wrote: > On Sun, Jan 06, 2019 at 09:43:46AM +1100, Benjamin Herrenschmidt wrote: > > On Sat, 2019-01-05 at 10:51 -0700, Jason Gunthorpe wrote: > > > > > > > Interesting. I've investigated this further, though I don't have as > > > > many new clues as I'd like. The problem occurs reliably, at least on > > > > one particular type of machine (a POWER8 "Garrison" with ConnectX-4). > > > > I don't yet know if it occurs with other machines, I'm having trouble > > > > getting access to other machines with a suitable card. I didn't > > > > manage to reproduce it on a different POWER8 machine with a > > > > ConnectX-5, but I don't know if it's the difference in machine or > > > > difference in card revision that's important. > > > > > > Make sure the card has the latest firmware is always good advice.. > > > > > > > So possibilities that occur to me: > > > > * It's something specific about how the vfio-pci driver uses D3 > > > > state - have you tried rebinding your device to vfio-pci? > > > > * It's something specific about POWER, either the kernel or the PCI > > > > bridge hardware > > > > * It's something specific about this particular type of machine > > > > > > Does the EEH indicate what happend to actually trigger it? > > > > In a very cryptic way that requires manual parsing using non-public > > docs sadly but yes. From the look of it, it's a completion timeout. > > > > Looks to me like we don't get a response to a config space access > > during the change of D state. I don't know if it's the write of the D3 > > state itself or the read back though (it's probably detected on the > > read back or a subsequent read, but that doesn't tell me which specific > > one failed). > > If it is just one card doing it (again, check you have latest > firmware) I wonder if it is a sketchy PCI-E electrical link that is > causing a long re-training cycle? Can you tell if the PCI-E link is > permanently gone or does it eventually return? > > Does the card work in Gen 3 when it starts? Is there any indication of > PCI-E link errors? > > Everytime or sometimes? > > POWER 8 firmware is good? If the link does eventually come back, is > the POWER8's D3 resumption timeout long enough? > > If this doesn't lead to an obvious conclusion you'll probably need to > connect to IBM's Mellanox support team to get more information from > the card side. +1, I tried to find any Mellanox-internal bugs related to your issue and didn't find anything concrete. Thanks > > Jason
On 06/01/2019 09:43, Benjamin Herrenschmidt wrote: > On Sat, 2019-01-05 at 10:51 -0700, Jason Gunthorpe wrote: >> >>> Interesting. I've investigated this further, though I don't have as >>> many new clues as I'd like. The problem occurs reliably, at least on >>> one particular type of machine (a POWER8 "Garrison" with ConnectX-4). >>> I don't yet know if it occurs with other machines, I'm having trouble >>> getting access to other machines with a suitable card. I didn't >>> manage to reproduce it on a different POWER8 machine with a >>> ConnectX-5, but I don't know if it's the difference in machine or >>> difference in card revision that's important. >> >> Make sure the card has the latest firmware is always good advice.. >> >>> So possibilities that occur to me: >>> * It's something specific about how the vfio-pci driver uses D3 >>> state - have you tried rebinding your device to vfio-pci? >>> * It's something specific about POWER, either the kernel or the PCI >>> bridge hardware >>> * It's something specific about this particular type of machine >> >> Does the EEH indicate what happend to actually trigger it? > > In a very cryptic way that requires manual parsing using non-public > docs sadly but yes. From the look of it, it's a completion timeout. > > Looks to me like we don't get a response to a config space access > during the change of D state. I don't know if it's the write of the D3 > state itself or the read back though (it's probably detected on the > read back or a subsequent read, but that doesn't tell me which specific > one failed). It is write: pci_write_config_word(dev, dev->pm_cap + PCI_PM_CTRL, pmcsr); > > Some extra logging in OPAL might help pin that down by checking the InA > error state in the config accessor after the config write (and polling > on it for a while as from a CPU perspective I don't knw if the write is > synchronous, probably not). Extra logging gives these straight after that write: nFir: 0000808000000000 0030006e00000000 0000800000000000 PhbSts: 0000001800000000 0000001800000000 Lem: 0000020000088000 42498e367f502eae 0000000000080000 OutErr: 0000002000000000 0000002000000000 0000000000000000 0000000000000000 InAErr: 0000000030000000 0000000020000000 8080000000000000 0000000000000000 Decoded (my fancy script): nFir: 0000808000000000 0030006e00000000 0000800000000000 |- PCI Nest Fault Isolation Register(FIR) NestBase+0x00 _BE_ = 0000808000000000h: | [0..63] 00000000 00000000 10000000 10000000 00000000 00000000 00000000 00000000 | #16 set: The PHB had a severe error and has fenced the AIB | #24 set: The internal SCOM to ASB bridge has an error | #29..30: Error bit from SCOM FIR engine = 0h |- PCI Nest FIR Mask NestBase+0x03 _BE_ = 0030006e00000000h: | [0..63] 00000000 00110000 00000000 01101110 00000000 00000000 00000000 00000000 | #10 set: Any PowerBus data hang poll error(Only checked for CI Stores) | #11 set: Any PowerBus command hang error (domestic address range) | #25 set: A command received ack_dead, foreign data hang, or Link_chk_abort from the foreign interface | #26 set: Any PowerBus command hang error (foreign address range) | #28 set: Error bit from BARS SCOM engines, Nest domain | #29..30: Error bit from SCOM FIR engine = 3h/[0..1] 11 |- PCI Nest FIR WOF (“Who's on First”) NestBase+0x08 _BE_ = 0000800000000000h: | [0..63] 00000000 00000000 10000000 00000000 00000000 00000000 00000000 00000000 | #16 set: The PHB had a severe error and has fenced the AIB | #29..30: Error bit from SCOM FIR engine = 0h | PhbSts: 0000001800000000 0000001800000000 |- 0x0120 Processor Load/Store Status Register _BE_ = 0000001800000000h: | [0..63] 00000000 00000000 00000000 00011000 00000000 00000000 00000000 00000000 | #27 set: One of the PHB3’s error status register bits is set | #28 set: One of the PHB3’s first error status register bits is set |- 0x0110 DMA Channel Status Register _BE_ = 0000001800000000h: | [0..63] 00000000 00000000 00000000 00011000 00000000 00000000 00000000 00000000 | #27 set: One of the PHB3’s error status register bits is set | #28 set: One of the PHB3’s first error status register bits is set | Lem: 0000020000088000 42498e367f502eae 0000000000080000 |- 0xC00 LEM FIR Accumulator Register _BE_ = 0000020000088000h: | [0..63] 00000000 00000000 00000010 00000000 00000000 00001000 10000000 00000000 | #22 set: CFG Access Error | #44 set: PCT Timeout Error | #48 set: PCT Unexpected Completion |- 0xC18 LEM Error Mask Register = 42498e367f502eaeh |- 0xC40 LEM WOF Register _BE_ = 0000000000080000h: | [0..63] 00000000 00000000 00000000 00000000 00000000 00001000 00000000 00000000 | #44 set: PCT Timeout Error | OutErr: 0000002000000000 0000002000000000 0000000000000000 0000000000000000 |- 0xD00 Outbound Error Status Register _BE_ = 0000002000000000h: | [0..63] 00000000 00000000 00000000 00100000 00000000 00000000 00000000 00000000 | #26 set: CFG Address/Enable Error |- 0xD08 Outbound First Error Status Register _BE_ = 0000002000000000h: | [0..63] 00000000 00000000 00000000 00100000 00000000 00000000 00000000 00000000 | #26 set: CFG Address/Enable Error | InAErr: 0000000030000000 0000000020000000 8080000000000000 0000000000000000 |- 0xD80 InboundA Error Status Register _BE_ = 0000000030000000h: | [0..63] 00000000 00000000 00000000 00000000 00110000 00000000 00000000 00000000 | #34 set: PCT Timeout | #35 set: PCT Unexpected Completion |- 0xD88 InboundA First Error Status Register _BE_ = 0000000020000000h: | [0..63] 00000000 00000000 00000000 00000000 00100000 00000000 00000000 00000000 | #34 set: PCT Timeout |- 0xDC0 InboundA Error Log Register 0 = 8080000000000000h "A PCI completion timeout occurred for an outstanding PCI-E transaction" it is. This is how I bind the device to vfio: echo vfio-pci > '/sys/bus/pci/devices/0000:01:00.0/driver_override' echo vfio-pci > '/sys/bus/pci/devices/0000:01:00.1/driver_override' echo '0000:01:00.0' > '/sys/bus/pci/devices/0000:01:00.0/driver/unbind' echo '0000:01:00.1' > '/sys/bus/pci/devices/0000:01:00.1/driver/unbind' echo '0000:01:00.0' > /sys/bus/pci/drivers/vfio-pci/bind echo '0000:01:00.1' > /sys/bus/pci/drivers/vfio-pci/bind and I noticed that EEH only happens with the last command. The order (.0,.1 or .1,.0) does not matter, it seems that putting one function to D3 is fine but putting another one when the first one is already in D3 - produces EEH. And I do not recall ever seeing this on the firestone machine. Weird.
On Mon, 2019-01-07 at 21:01 -0700, Jason Gunthorpe wrote: > > > In a very cryptic way that requires manual parsing using non-public > > docs sadly but yes. From the look of it, it's a completion timeout. > > > > Looks to me like we don't get a response to a config space access > > during the change of D state. I don't know if it's the write of the D3 > > state itself or the read back though (it's probably detected on the > > read back or a subsequent read, but that doesn't tell me which specific > > one failed). > > If it is just one card doing it (again, check you have latest > firmware) I wonder if it is a sketchy PCI-E electrical link that is > causing a long re-training cycle? Can you tell if the PCI-E link is > permanently gone or does it eventually return? No, it's 100% reproducable on systems with that specific card model, not card instance, and maybe different systems/cards as well, I'll let David & Alexey comment further on that. > Does the card work in Gen 3 when it starts? Is there any indication of > PCI-E link errors? Nope. > Everytime or sometimes? > > POWER 8 firmware is good? If the link does eventually come back, is > the POWER8's D3 resumption timeout long enough? > > If this doesn't lead to an obvious conclusion you'll probably need to > connect to IBM's Mellanox support team to get more information from > the card side. We are IBM :-) So far, it seems to be that the card is doing something not quite right, but we don't know what. We might need to engage Mellanox themselves. Cheers, Ben.
On Wed, Jan 09, 2019 at 04:09:02PM +1100, Benjamin Herrenschmidt wrote: > On Mon, 2019-01-07 at 21:01 -0700, Jason Gunthorpe wrote: > > > > > In a very cryptic way that requires manual parsing using non-public > > > docs sadly but yes. From the look of it, it's a completion timeout. > > > > > > Looks to me like we don't get a response to a config space access > > > during the change of D state. I don't know if it's the write of the D3 > > > state itself or the read back though (it's probably detected on the > > > read back or a subsequent read, but that doesn't tell me which specific > > > one failed). > > > > If it is just one card doing it (again, check you have latest > > firmware) I wonder if it is a sketchy PCI-E electrical link that is > > causing a long re-training cycle? Can you tell if the PCI-E link is > > permanently gone or does it eventually return? > > No, it's 100% reproducable on systems with that specific card model, > not card instance, and maybe different systems/cards as well, I'll let > David & Alexey comment further on that. Well, it's 100% reproducable on a particular model of system (garrison) with a particular model of card. I've had some suggestions that it fails with some other systems card card models, but nothing confirmed - the one other system model I've been able to try, which also had a newer card model didn't reproduce the problem. > > Does the card work in Gen 3 when it starts? Is there any indication of > > PCI-E link errors? > > Nope. > > > Everytime or sometimes? > > > > POWER 8 firmware is good? If the link does eventually come back, is > > the POWER8's D3 resumption timeout long enough? > > > > If this doesn't lead to an obvious conclusion you'll probably need to > > connect to IBM's Mellanox support team to get more information from > > the card side. > > We are IBM :-) So far, it seems to be that the card is doing something > not quite right, but we don't know what. We might need to engage > Mellanox themselves. Possibly. On the other hand, I've had it reported that this is a software regression at least with downstream red hat kernels. I haven't yet been able to eliminate factors that might be confusing that, or try to find a working version upstream.
On 09/01/2019 16:30, David Gibson wrote: > On Wed, Jan 09, 2019 at 04:09:02PM +1100, Benjamin Herrenschmidt wrote: >> On Mon, 2019-01-07 at 21:01 -0700, Jason Gunthorpe wrote: >>> >>>> In a very cryptic way that requires manual parsing using non-public >>>> docs sadly but yes. From the look of it, it's a completion timeout. >>>> >>>> Looks to me like we don't get a response to a config space access >>>> during the change of D state. I don't know if it's the write of the D3 >>>> state itself or the read back though (it's probably detected on the >>>> read back or a subsequent read, but that doesn't tell me which specific >>>> one failed). >>> >>> If it is just one card doing it (again, check you have latest >>> firmware) I wonder if it is a sketchy PCI-E electrical link that is >>> causing a long re-training cycle? Can you tell if the PCI-E link is >>> permanently gone or does it eventually return? >> >> No, it's 100% reproducable on systems with that specific card model, >> not card instance, and maybe different systems/cards as well, I'll let >> David & Alexey comment further on that. > > Well, it's 100% reproducable on a particular model of system > (garrison) with a particular model of card. I've had some suggestions > that it fails with some other systems card card models, but nothing > confirmed - the one other system model I've been able to try, which > also had a newer card model didn't reproduce the problem. I have just moved the "Mellanox Technologies MT27700 Family [ConnectX-4]" from garrison to firestone machine and there it does not produce an EEH, with the same kernel and skiboot (both upstream + my debug). Hm. I cannot really blame the card but I cannot see what could cause the difference in skiboot either. I even tried disabling NPU so garrison would look like firestone, still EEH'ing. >>> Does the card work in Gen 3 when it starts? Is there any indication of >>> PCI-E link errors? >> >> Nope. >> >>> Everytime or sometimes? >>> >>> POWER 8 firmware is good? If the link does eventually come back, is >>> the POWER8's D3 resumption timeout long enough? >>> >>> If this doesn't lead to an obvious conclusion you'll probably need to >>> connect to IBM's Mellanox support team to get more information from >>> the card side. >> >> We are IBM :-) So far, it seems to be that the card is doing something >> not quite right, but we don't know what. We might need to engage >> Mellanox themselves. > > Possibly. On the other hand, I've had it reported that this is a > software regression at least with downstream red hat kernels. I > haven't yet been able to eliminate factors that might be confusing > that, or try to find a working version upstream. Do you have tarballs handy? I'd diff...
On Wed, 2019-01-09 at 15:53 +1100, Alexey Kardashevskiy wrote: > "A PCI completion timeout occurred for an outstanding PCI-E transaction" > it is. > > This is how I bind the device to vfio: > > echo vfio-pci > '/sys/bus/pci/devices/0000:01:00.0/driver_override' > echo vfio-pci > '/sys/bus/pci/devices/0000:01:00.1/driver_override' > echo '0000:01:00.0' > '/sys/bus/pci/devices/0000:01:00.0/driver/unbind' > echo '0000:01:00.1' > '/sys/bus/pci/devices/0000:01:00.1/driver/unbind' > echo '0000:01:00.0' > /sys/bus/pci/drivers/vfio-pci/bind > echo '0000:01:00.1' > /sys/bus/pci/drivers/vfio-pci/bind > > > and I noticed that EEH only happens with the last command. The order > (.0,.1 or .1,.0) does not matter, it seems that putting one function to > D3 is fine but putting another one when the first one is already in D3 - > produces EEH. And I do not recall ever seeing this on the firestone > machine. Weird. Putting all functions into D3 is what allows the device to actually go into D3. Does it work with other devices ? We do have that bug on early P9 revisions where the attempt of bringing the link to L1 as part of the D3 process fails in horrible ways, I thought P8 would be ok but maybe not ... Otherwise, it might be that our timeouts are too low (you may want to talk to our PCIe guys internally) Cheers, Ben.
On Wed, 2019-01-09 at 17:32 +1100, Alexey Kardashevskiy wrote: > I have just moved the "Mellanox Technologies MT27700 Family > [ConnectX-4]" from garrison to firestone machine and there it does not > produce an EEH, with the same kernel and skiboot (both upstream + my > debug). Hm. I cannot really blame the card but I cannot see what could > cause the difference in skiboot either. I even tried disabling NPU so > garrison would look like firestone, still EEH'ing. The systems have a different chip though, firestone is P8 and garrison is P8', which a slightly different PHB revision. Worth checking if we have anything significantly different in our inits and poke at the HW guys. BTW. Are the cards behind a switch in either case ? Cheers, Ben.
On 09/01/2019 18:25, Benjamin Herrenschmidt wrote: > On Wed, 2019-01-09 at 17:32 +1100, Alexey Kardashevskiy wrote: >> I have just moved the "Mellanox Technologies MT27700 Family >> [ConnectX-4]" from garrison to firestone machine and there it does not >> produce an EEH, with the same kernel and skiboot (both upstream + my >> debug). Hm. I cannot really blame the card but I cannot see what could >> cause the difference in skiboot either. I even tried disabling NPU so >> garrison would look like firestone, still EEH'ing. > > The systems have a different chip though, firestone is P8 and garrison > is P8', which a slightly different PHB revision. Worth checking if we > have anything significantly different in our inits and poke at the HW > guys. Nope, we do not have anything different for these machines. Asking HW guys never worked for me :-/ I think the easiest is just doing what we did for PHB4 and ignoring these D3 requests on garrisons. > BTW. Are the cards behind a switch in either case ? No, directly connected to the root on both: garrison: 0000:00:00.0 PCI bridge: IBM Device 03dc (rev ff) 0000:01:00.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4] (rev ff) 0000:01:00.1 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4] (rev ff) firestone (phb #0 is taken by nvidia gpu): 0001:00:00.0 PCI bridge: IBM POWER8 Host Bridge (PHB3) 0001:01:00.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4] 0001:01:00.1 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]
On 09/01/2019 18:24, Benjamin Herrenschmidt wrote: > On Wed, 2019-01-09 at 15:53 +1100, Alexey Kardashevskiy wrote: >> "A PCI completion timeout occurred for an outstanding PCI-E transaction" >> it is. >> >> This is how I bind the device to vfio: >> >> echo vfio-pci > '/sys/bus/pci/devices/0000:01:00.0/driver_override' >> echo vfio-pci > '/sys/bus/pci/devices/0000:01:00.1/driver_override' >> echo '0000:01:00.0' > '/sys/bus/pci/devices/0000:01:00.0/driver/unbind' >> echo '0000:01:00.1' > '/sys/bus/pci/devices/0000:01:00.1/driver/unbind' >> echo '0000:01:00.0' > /sys/bus/pci/drivers/vfio-pci/bind >> echo '0000:01:00.1' > /sys/bus/pci/drivers/vfio-pci/bind >> >> >> and I noticed that EEH only happens with the last command. The order >> (.0,.1 or .1,.0) does not matter, it seems that putting one function to >> D3 is fine but putting another one when the first one is already in D3 - >> produces EEH. And I do not recall ever seeing this on the firestone >> machine. Weird. > > Putting all functions into D3 is what allows the device to actually go > into D3. > > Does it work with other devices ? Works fine with on the very same garrison: 0009:07:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01) 0009:07:00.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01) Bizarre. > We do have that bug on early P9 > revisions where the attempt of bringing the link to L1 as part of the > D3 process fails in horrible ways, I thought P8 would be ok but maybe > not ... > Otherwise, it might be that our timeouts are too low (you may want to > talk to our PCIe guys internally) This increases "Outbound non-posted transactions timeout configuration" from 16ms to 1s and does not help anyway: diff --git a/hw/phb3.c b/hw/phb3.c index 38b8f46..cb14909 100644 --- a/hw/phb3.c +++ b/hw/phb3.c @@ -4065,7 +4065,7 @@ static void phb3_init_utl(struct phb3 *p) /* Init_82: PCI Express port control * SW283991: Set Outbound Non-Posted request timeout to 16ms (RTOS). */ - out_be64(p->regs + UTL_PCIE_PORT_CONTROL, 0x8588007000000000); + out_be64(p->regs + UTL_PCIE_PORT_CONTROL, 0x858800d000000000);
On Wed, Jan 09, 2019 at 04:09:02PM +1100, Benjamin Herrenschmidt wrote: > > POWER 8 firmware is good? If the link does eventually come back, is > > the POWER8's D3 resumption timeout long enough? > > > > If this doesn't lead to an obvious conclusion you'll probably need to > > connect to IBM's Mellanox support team to get more information from > > the card side. > > We are IBM :-) So far, it seems to be that the card is doing something > not quite right, but we don't know what. We might need to engage > Mellanox themselves. Sorry, it was unclear, I ment the support team for IBM inside Mellanox .. There might be internal debugging available that can show if the card is detecting the beacon, how far it gets in renegotiation, etc. From all the mails it really has the feel of a PCI-E interop problem between these two specific chips.. Jason
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index 4700d24e5d55..add3f516ca12 100644 --- a/drivers/pci/quirks.c +++ b/drivers/pci/quirks.c @@ -1315,23 +1315,24 @@ static void quirk_ide_samemode(struct pci_dev *pdev) } DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82801CA_10, quirk_ide_samemode); -/* Some ATA devices break if put into D3 */ -static void quirk_no_ata_d3(struct pci_dev *pdev) +/* Some devices (including a number of ATA cards) break if put into D3 */ +static void quirk_no_d3(struct pci_dev *pdev) { pdev->dev_flags |= PCI_DEV_FLAGS_NO_D3; } + /* Quirk the legacy ATA devices only. The AHCI ones are ok */ DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_SERVERWORKS, PCI_ANY_ID, - PCI_CLASS_STORAGE_IDE, 8, quirk_no_ata_d3); + PCI_CLASS_STORAGE_IDE, 8, quirk_no_d3); DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_ATI, PCI_ANY_ID, - PCI_CLASS_STORAGE_IDE, 8, quirk_no_ata_d3); + PCI_CLASS_STORAGE_IDE, 8, quirk_no_d3); /* ALi loses some register settings that we cannot then restore */ DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_AL, PCI_ANY_ID, - PCI_CLASS_STORAGE_IDE, 8, quirk_no_ata_d3); + PCI_CLASS_STORAGE_IDE, 8, quirk_no_d3); /* VIA comes back fine but we need to keep it alive or ACPI GTM failures occur when mode detecting */ DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_VIA, PCI_ANY_ID, - PCI_CLASS_STORAGE_IDE, 8, quirk_no_ata_d3); + PCI_CLASS_STORAGE_IDE, 8, quirk_no_d3); /* * This was originally an Alpha-specific thing, but it really fits here. @@ -3367,6 +3368,10 @@ static void mellanox_check_broken_intx_masking(struct pci_dev *pdev) DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_MELLANOX, PCI_ANY_ID, mellanox_check_broken_intx_masking); +/* Mellanox MT27800 (ConnectX-5) IB card seems to break with D3 + * In particular this shows up when the device is bound to the vfio-pci driver */ +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_CONNECTX4, quirk_no_d3) + static void quirk_no_bus_reset(struct pci_dev *dev) { dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET;
Mellanox ConnectX-5 IB cards (MT27800) seem to cause a call trace when unbound from their regular driver and attached to vfio-pci in order to pass them through to a guest. This goes away if the disable_idle_d3 option is used, so it looks like a problem with the hardware handling D3 state. To fix that more permanently, use a device quirk to disable D3 state for these devices. We do this by renaming the existing quirk_no_ata_d3() more generally and attaching it to the ConnectX-[45] devices (0x15b3:0x1013). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> --- drivers/pci/quirks.c | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-)