Message ID | 20240125081414.2189572-1-ming4.li@intel.com |
---|---|
State | Superseded |
Headers | show |
Series | [1/1] cxl/pci: Skip to handle RAS errors if CXL.mem device is detached | expand |
Li Ming wrote: > CXL.mem protocol errors are logged in CXL RAS capability, if CXL.mem > device is unbound from CXL.mem driver, will not expect any CXL.mem > protocol errors happen on the endpoint or the dport connected to the > endpoint. Giving up these unexpected errors to avoid error handler to > access unmapped RCH dport's RAS capability. The error handler of CXL PCI > device helps to handle RAS errors happened on RCH dport. The host of the > RCH dport's RAS capability mapping is CXL.mem device, so the error > handler will access unmapped RCH dport's RAS capability after CXL.mem > device is unbound from the CXL.mem driver. Thanks for this Li Ming! I am going to reword this to add more context: --- The PCI AER model is an awkward fit for CXL error handling. While the expectation is that a PCI device can escalate to link reset to recover from an AER event, the same reset on CXL amounts to a suprise memory hotplug of massive amounts of memory. At present, the CXL error handler attempts some optimisitic error handling to unbind the device from the cxl_mem driver after reaping some RAS register values. This results in a "hopeful" attempt to unplug the memory, but there is no guarantee that will succeed. A subsequent AER notification after the memdev unbind event can no longer assume the registers are mapped. Check for memdev bind before reaping status register values to avoid crashes of the form: RIP: 0010:__cxl_handle_ras+0x30/0x110 [cxl_core] Call Trace: <TASK> cxl_handle_rp_ras+0xbc/0xd0 [cxl_core] cxl_error_detected+0x6c/0xf0 [cxl_core] report_error_detected+0xc7/0x1c0 ? __pfx_report_frozen_detected+0x10/0x10 pci_walk_bus+0x73/0x90 pcie_do_recovery+0x23f/0x330 Longer term, the unbind and PCI_ERS_RESULT_DISCONNECT behavior might need to be replaced with a new PCI_ERS_RESULT_PANIC. --- > Fixes: 6ac07883dbb5 ("cxl/pci: Add RCH downstream port error logging") > Suggested-by: Dan Williams <dan.j.williams@intel.com> > Signed-off-by: Li Ming <ming4.li@intel.com>
Hi Li and Dan, I added comment below. On 1/26/2024 12:37 AM, Dan Williams wrote: > Li Ming wrote: >> CXL.mem protocol errors are logged in CXL RAS capability, if CXL.mem >> device is unbound from CXL.mem driver, will not expect any CXL.mem >> protocol errors happen on the endpoint or the dport connected to the >> endpoint. Giving up these unexpected errors to avoid error handler to >> access unmapped RCH dport's RAS capability. The error handler of CXL PCI >> device helps to handle RAS errors happened on RCH dport. The host of the >> RCH dport's RAS capability mapping is CXL.mem device, so the error >> handler will access unmapped RCH dport's RAS capability after CXL.mem >> device is unbound from the CXL.mem driver. > Thanks for this Li Ming! > > I am going to reword this to add more context: > > --- > The PCI AER model is an awkward fit for CXL error handling. While the > expectation is that a PCI device can escalate to link reset to recover > from an AER event, the same reset on CXL amounts to a suprise memory > hotplug of massive amounts of memory. > > At present, the CXL error handler attempts some optimisitic error > handling to unbind the device from the cxl_mem driver after reaping some > RAS register values. This results in a "hopeful" attempt to unplug the > memory, but there is no guarantee that will succeed. > > A subsequent AER notification after the memdev unbind event can no > longer assume the registers are mapped. Check for memdev bind before > reaping status register values to avoid crashes of the form: > > RIP: 0010:__cxl_handle_ras+0x30/0x110 [cxl_core] > Call Trace: > <TASK> > cxl_handle_rp_ras+0xbc/0xd0 [cxl_core] > cxl_error_detected+0x6c/0xf0 [cxl_core] > report_error_detected+0xc7/0x1c0 > ? __pfx_report_frozen_detected+0x10/0x10 > pci_walk_bus+0x73/0x90 > pcie_do_recovery+0x23f/0x330 report_error_detected() includes the same "if (dev->driver)" check before calling the device's err_handler(). The same check again in the CXL device error handler increases the chances of catching the surprise unbind case but not by much. Regards, Terry > Longer term, the unbind and PCI_ERS_RESULT_DISCONNECT behavior might > need to be replaced with a new PCI_ERS_RESULT_PANIC. > --- > >> Fixes: 6ac07883dbb5 ("cxl/pci: Add RCH downstream port error logging") >> Suggested-by: Dan Williams <dan.j.williams@intel.com> >> Signed-off-by: Li Ming <ming4.li@intel.com>
Bowman, Terry wrote: > Hi Li and Dan, > > I added comment below. > > On 1/26/2024 12:37 AM, Dan Williams wrote: > > Li Ming wrote: > >> CXL.mem protocol errors are logged in CXL RAS capability, if CXL.mem > >> device is unbound from CXL.mem driver, will not expect any CXL.mem > >> protocol errors happen on the endpoint or the dport connected to the > >> endpoint. Giving up these unexpected errors to avoid error handler to > >> access unmapped RCH dport's RAS capability. The error handler of CXL PCI > >> device helps to handle RAS errors happened on RCH dport. The host of the > >> RCH dport's RAS capability mapping is CXL.mem device, so the error > >> handler will access unmapped RCH dport's RAS capability after CXL.mem > >> device is unbound from the CXL.mem driver. > > Thanks for this Li Ming! > > > > I am going to reword this to add more context: > > > > --- > > The PCI AER model is an awkward fit for CXL error handling. While the > > expectation is that a PCI device can escalate to link reset to recover > > from an AER event, the same reset on CXL amounts to a suprise memory > > hotplug of massive amounts of memory. > > > > At present, the CXL error handler attempts some optimisitic error > > handling to unbind the device from the cxl_mem driver after reaping some > > RAS register values. This results in a "hopeful" attempt to unplug the > > memory, but there is no guarantee that will succeed. > > > > A subsequent AER notification after the memdev unbind event can no > > longer assume the registers are mapped. Check for memdev bind before > > reaping status register values to avoid crashes of the form: > > > > RIP: 0010:__cxl_handle_ras+0x30/0x110 [cxl_core] > > Call Trace: > > <TASK> > > cxl_handle_rp_ras+0xbc/0xd0 [cxl_core] > > cxl_error_detected+0x6c/0xf0 [cxl_core] > > report_error_detected+0xc7/0x1c0 > > ? __pfx_report_frozen_detected+0x10/0x10 > > pci_walk_bus+0x73/0x90 > > pcie_do_recovery+0x23f/0x330 > > report_error_detected() includes the same "if (dev->driver)" check > before calling the device's err_handler(). The same check again in the > CXL device error handler increases the chances of catching the > surprise unbind case but not by much. So report_error_detected() is checking if pdev->dev.driver is NULL, in this case we are checking whether *cxlmd->dev.driver is NULL*, where cxlmd->dev.parent == pdev. In other words when cxl_pci sees an error it tries to keep the CXL.io up and running while shutting down the CXL.mem side, but it's not clear if that is just making a bad situation worse. So might need a follow-up to just panic() rather than hope that unbinding the cxl_memdev does anything useful.
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c index 6c9c8d92f8f7..480489f5644e 100644 --- a/drivers/cxl/core/pci.c +++ b/drivers/cxl/core/pci.c @@ -932,11 +932,21 @@ static void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds) { } void cxl_cor_error_detected(struct pci_dev *pdev) { struct cxl_dev_state *cxlds = pci_get_drvdata(pdev); + struct device *dev = &cxlds->cxlmd->dev; + + scoped_guard(device, dev) { + if (!dev->driver) { + dev_warn(&pdev->dev, + "%s: memdev disabled, abort error handling\n", + dev_name(dev)); + return; + } - if (cxlds->rcd) - cxl_handle_rdport_errors(cxlds); + if (cxlds->rcd) + cxl_handle_rdport_errors(cxlds); - cxl_handle_endpoint_cor_ras(cxlds); + cxl_handle_endpoint_cor_ras(cxlds); + } } EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, CXL); @@ -948,16 +958,25 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev, struct device *dev = &cxlmd->dev; bool ue; - if (cxlds->rcd) - cxl_handle_rdport_errors(cxlds); + scoped_guard(device, dev) { + if (!dev->driver) { + dev_warn(&pdev->dev, + "%s: memdev disabled, abort error handling\n", + dev_name(dev)); + return PCI_ERS_RESULT_DISCONNECT; + } + + if (cxlds->rcd) + cxl_handle_rdport_errors(cxlds); + /* + * A frozen channel indicates an impending reset which is fatal to + * CXL.mem operation, and will likely crash the system. On the off + * chance the situation is recoverable dump the status of the RAS + * capability registers and bounce the active state of the memdev. + */ + ue = cxl_handle_endpoint_ras(cxlds); + } - /* - * A frozen channel indicates an impending reset which is fatal to - * CXL.mem operation, and will likely crash the system. On the off - * chance the situation is recoverable dump the status of the RAS - * capability registers and bounce the active state of the memdev. - */ - ue = cxl_handle_endpoint_ras(cxlds); switch (state) { case pci_channel_io_normal:
CXL.mem protocol errors are logged in CXL RAS capability, if CXL.mem device is unbound from CXL.mem driver, will not expect any CXL.mem protocol errors happen on the endpoint or the dport connected to the endpoint. Giving up these unexpected errors to avoid error handler to access unmapped RCH dport's RAS capability. The error handler of CXL PCI device helps to handle RAS errors happened on RCH dport. The host of the RCH dport's RAS capability mapping is CXL.mem device, so the error handler will access unmapped RCH dport's RAS capability after CXL.mem device is unbound from the CXL.mem driver. Fixes: 6ac07883dbb5 ("cxl/pci: Add RCH downstream port error logging") Suggested-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Li Ming <ming4.li@intel.com> --- drivers/cxl/core/pci.c | 43 ++++++++++++++++++++++++++++++------------ 1 file changed, 31 insertions(+), 12 deletions(-)