Message ID | 164740402242.3912056.8303625392871313860.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive) |
---|---|
Headers | show |
Series | cxl/pci: Add fundamental error handling | expand |
On Tue, Mar 15, 2022 at 9:14 PM Dan Williams <dan.j.williams@intel.com> wrote: > > Add a 'struct pci_error_handlers' instance for the cxl_pci driver. > Section 8.2.5.9 "CXL RAS Capability Structure" of the CXL 2.0 > specification defines the error sources considered in this > implementation. The RAS Capability Structure defines protocol, link and > internal errors which are distinct from memory poison errors that are > conveyed via direct consumption and/or media scanning. > > The errors reported by the RAS registers are categorized into > correctable and uncorrectable errors, where the uncorrectable errors are > optionally steered to either fatal or non-fatal AER events. Table 224 > "Device Specific Error Reporting and Nomenclature Guidelines" in the CXL > 2.0 specification outlines that the remediation for uncorrectable errors > is a reset to recover. This matches how the Linux PCIe AER core treats > uncorrectable errors as occasions to reset the device to recover > operation. > > While the specification notes "CXL Reset" or "Secondary Bus Reset" as > theoretical recovery options, they are not feasible in practice since > in-flight CXL.mem operations may not terminate and cause knock-on system > fatal events. Reset is only reliable for recovering CXL.io, it is not > reliable for recovering CXL.mem. Assuming the system survives, a reset > causes CXL.mem operation to restart from scratch. > > The "ECN: Error Isolation on CXL.mem and CXL.cache" [1] document > recognizes the CXL Reset vs CXL.mem operational conflict and helps to at > least provide a mechanism for the Root Port to terminate in flight > CXL.mem operations with completions. That still poses problems in > practice if the kernel is running out of "System RAM" backed by the CXL > device and poison is used to convey the data lost to the protocol error. > > Regardless of whether the reset and restart of CXL.mem operations is > feasible / successful, the logging is still useful. So, the > implementation reads, reports, and clears the status in the RAS > Capability Structure registers, and it notifies the 'struct cxl_memdev' > associated with the given PCIe endpoint to reattach to its driver over > the reset so that the HDM decoder configuration can be reconstructed. > > The first half of the series reworks component register mapping so that > the cxl_pci driver can own the RAS Capability while the cxl_port driver > continues to own the HDM Decoder Capability. The last half implements > the RAS Capability Structure mapping and reporting via 'struct > pci_error_handlers'. > > [1]: https://www.computeexpresslink.org/spec-landing > > --- > > > Dan Williams (8): > cxl/pci: Cleanup repeated code in cxl_probe_regs() helpers > cxl/pci: Cleanup cxl_map_device_regs() > cxl/pci: Kill cxl_map_regs() > cxl/core/regs: Make cxl_map_{component,device}_regs() device generic > cxl/port: Limit the port driver to just the HDM Decoder Capability > cxl/pci: Prepare for mapping RAS Capability Structure > cxl/pci: Find and map the RAS Capability Structure > cxl/pci: Add (hopeful) error handling support > > > drivers/cxl/core/hdm.c | 33 +++++---- > drivers/cxl/core/memdev.c | 1 > drivers/cxl/core/pci.c | 3 - > drivers/cxl/core/port.c | 2 - > drivers/cxl/core/regs.c | 172 ++++++++++++++++++++++++++------------------- > drivers/cxl/cxl.h | 36 +++++++-- > drivers/cxl/cxlmem.h | 2 + > drivers/cxl/cxlpci.h | 9 -- > drivers/cxl/pci.c | 163 ++++++++++++++++++++++++++++++++----------- > 9 files changed, 273 insertions(+), 148 deletions(-) > > base-commit: 74be98774dfbc5b8b795db726bd772e735d2edd4 Apologies, wrong base-commit, this series is based on that commit + this series: https://lore.kernel.org/linux-cxl/164730733718.3806189.9721916820488234094.stgit@dwillia2-desk3.amr.corp.intel.com/