mbox series

[v3,0/15] Enable CXL PCIe port protocol error handling and logging

Message ID 20241113215429.3177981-1-terry.bowman@amd.com (mailing list archive)
Headers show
Series Enable CXL PCIe port protocol error handling and logging | expand

Message

Bowman, Terry Nov. 13, 2024, 9:54 p.m. UTC
CXL protocol error handling support exists for CXL endpoint devices and CXL
restricted host downstream port devices. This patchset adds the same
support for CXL PCIe port devices including: CXL root ports, CXL upstream
switch ports, and CXL downstream switch ports.[1]

This implementation separates PCIe protocol error handling and CXL protocol
error handling. This is necessary because of the different requirements for
PCIe and CXL device recovery, specifically uncorrectable error (UCE)
handling. PCIe AER handling attempts recovery and uses a device disconnect
if recovery fails. CXL devices must use a kernel panic in the case of an
uncorrectable errors (UCE). CXL recovery is not attempted because the
procedure could corrupt memory while indicating successful recovery.

The first 7 patches update the existing AER service driver to support CXL
PCIe port protocol error handling and reporting. This includes AER service
driver changes for adding correctable and uncorrectable error support, CXL
specific recovery handling, and support for CXL driver callback handlers.

The following 8 patches address CXL driver support for CXL PCIe port
protocol errors. This includes the following changes to the CXL drivers:
mapping CXL port and downstream port RAS registers, interface updates for
common restricted CXL host mode (RCH) and virtual hierarchy mode (VH),
adding port specific error handlers, error logging, and UIE/CIE enablement.

[1] - CXL 3.1 specification, 12.0 Reliability, Availability, and Serviceability

Testing:

Below are test results for this patchset using Qemu with CXL root
port(0c:00.0), CXL upstream switchport(0d:00.0), CXL downstream
switchport(0e:00.0). The endpoint CE and UCE injection logs are also
added.

This was tested using aer-inject updated to support CE and UCE internal
error injection. CXL RAS was set using a test patch (not upstreamed but can
provide if needed).

 - Root Port Correctable Error 
 root@tbowman-cxl:~/aer-inject# ./root-ce-inject.sh
 pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0c:00.0
 pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0c:00.0
 pcieport 0000:0c:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
 pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00004000/0000a000
 pcieport 0000:0c:00.0:    [14] CorrIntErr
 aer_event: 0000:0c:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
 cxl_port_aer_correctable_error: device=0000:0c:00.0 host=pci0000:0c status='Received Error From Physical Layer'

 - Root Port UnCorrectable Error 
 root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh
 pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
 pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
 pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
 pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00400000/02000000
 pcieport 0000:0c:00.0:    [22] UncorrIntErr
 systemd-journald[482]: Sent WATCHDOG=1 notification.
 aer_event: 0000:0c:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
 cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
 Kernel panic - not syncing: CXL cachemem error. Invoking panic
 CPU: 10 UID: 0 PID: 150 Comm: irq/24-aerdrv Tainted: G            E      6.12.0-rc6test-gb0cd92ab89ad #4507
 Tainted: [E]=UNSIGNED_MODULE
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
 Call Trace:
  <TASK>
  dump_stack_lvl+0x27/0x90
  dump_stack+0x10/0x20
  panic+0x33e/0x380
  cxl_do_recovery+0x122/0x130
  ? srso_return_thunk+0x5/0x5f
  aer_isr+0x3e0/0x710
  irq_thread_fn+0x28/0x70
  irq_thread+0x179/0x240
  ? srso_return_thunk+0x5/0x5f
  ? __pfx_irq_thread_fn+0x10/0x10
  ? __pfx_irq_thread_dtor+0x10/0x10
  ? __pfx_irq_thread+0x10/0x10
  kthread+0xf5/0x130
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x3c/0x60
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1a/0x30
  </TASK>
 Kernel Offset: 0x1800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
 ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---

 - Upstream Port Correctable Error
 root@tbowman-cxl:~/aer-inject# ./us-ce-inject.sh
 pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0d:00.0
 pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0d:00.0
 pcieport 0000:0d:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
 pcieport 0000:0d:00.0:   device [19e5:a128] error status/mask=00004000/0000a000
 pcieport 0000:0d:00.0:    [14] CorrIntErr
 aer_event: 0000:0d:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
 cxl_port_aer_correctable_error: device=0000:0d:00.0 host=0000:0c:00.0 status='Received Error From Physical Layer'

 - Upstream Port UnCorrectable Error 
 root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh
 pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
 pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
 pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
 pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00400000/02000000
 pcieport 0000:0c:00.0:    [22] UncorrIntErr
 systemd-journald[482]: Sent WATCHDOG=1 notification.
 aer_event: 0000:0c:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
 cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
 Kernel panic - not syncing: CXL cachemem error. Invoking panic
 CPU: 10 UID: 0 PID: 150 Comm: irq/24-aerdrv Tainted: G            E      6.12.0-rc6test-gb0cd92ab89ad #4507
 Tainted: [E]=UNSIGNED_MODULE
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
 Call Trace:
  <TASK>
  dump_stack_lvl+0x27/0x90
  dump_stack+0x10/0x20
  panic+0x33e/0x380
  cxl_do_recovery+0x122/0x130
  ? srso_return_thunk+0x5/0x5f
  aer_isr+0x3e0/0x710
  irq_thread_fn+0x28/0x70
  irq_thread+0x179/0x240
  ? srso_return_thunk+0x5/0x5f
  ? __pfx_irq_thread_fn+0x10/0x10
  ? __pfx_irq_thread_dtor+0x10/0x10
  ? __pfx_irq_thread+0x10/0x10
  kthread+0xf5/0x130
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x3c/0x60
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1a/0x30
  </TASK>
 Kernel Offset: 0x1800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
 ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---

 - Downstream Port Correctable Error
 root@tbowman-cxl:~/aer-inject# ./ds-ce-inject.sh
 pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0e:00.0
 pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0e:00.0
 pcieport 0000:0e:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
 pcieport 0000:0e:00.0:   device [19e5:a129] error status/mask=00004000/0000a000
 pcieport 0000:0e:00.0:    [14] CorrIntErr
 aer_event: 0000:0e:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
 cxl_port_aer_correctable_error: device=0000:0e:00.0 host=0000:0d:00.0 status='Received Error From Physical Layer'

 - Downstream Port UnCorrectable Error 
 root@tbowman-cxl:~/aer-inject# ./ds-uce-inject.sh
 pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0e:00.0
 pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0e:00.0
 pcieport 0000:0e:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
 pcieport 0000:0e:00.0:   device [19e5:a129] error status/mask=00400000/02000000
 pcieport 0000:0e:00.0:    [22] UncorrIntErr
 aer_event: 0000:0e:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
 cxl_port_aer_uncorrectable_error: device=0000:0e:00.0 host=0000:0d:00.0 status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
 Kernel panic - not syncing: CXL cachemem error. Invoking panic
 CPU: 10 UID: 0 PID: 146 Comm: irq/24-aerdrv Tainted: G            E      6.12.0-rc6test-gb0cd92ab89ad #4507
 Tainted: [E]=UNSIGNED_MODULE
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
 Call Trace:
  <TASK>
  dump_stack_lvl+0x27/0x90
  dump_stack+0x10/0x20
  panic+0x33e/0x380
  cxl_do_recovery+0x122/0x130
  ? srso_return_thunk+0x5/0x5f
  aer_isr+0x3e0/0x710
  irq_thread_fn+0x28/0x70
  irq_thread+0x179/0x240
  ? srso_return_thunk+0x5/0x5f
  ? __pfx_irq_thread_fn+0x10/0x10
  ? __pfx_irq_thread_dtor+0x10/0x10
  ? __pfx_irq_thread+0x10/0x10
  kthread+0xf5/0x130
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x3c/0x60
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1a/0x30
  </TASK>
 Kernel Offset: 0x1ac00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
 ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---

 - Endpoint Correctable Error
 root@tbowman-cxl:~/aer-inject# ./ep-ce-inject.sh
 pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000040/00000000 into device 0000:0f:00.0
 pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0f:00.0
 cxl_pci 0000:0f:00.0: CXL Bus Error: severity=Correctable, type=Data Link Layer, (Receiver ID)
 cxl_pci 0000:0f:00.0:   device [8086:0d93] error status/mask=00000040/0000e000
 systemd-journald[482]: Sent WATCHDOG=1 notification.
 cxl_pci 0000:0f:00.0:    [ 6] BadTLP
 aer_event: 0000:0f:00.0 CXL Bus Error: severity=Corrected, Bad TLP, TLP Header=Not available

 - Endpoint UnCorrectable Error
 root@tbowman-cxl:~/aer-inject# ./ep-uce-inject.sh
 pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00040000 into device 0000:0f:00.0
 pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0f:00.0
 cxl_pci 0000:0f:00.0: AER: CXL Bus Error: severity=Uncorrectable (Fatal), type=Inaccessible, (Unregistered Agent ID)
 aer_event: 0000:0f:00.0 CXL Bus Error: severity=Fatal, , TLP Header=Not available
 cxl_pci 0000:0f:00.0: mem3: frozen state error detected, disable CXL.mem
 cxl_detach_ep: cxl_mem mem3: disconnect mem3 from port2
 cxl_detach_ep: cxl_mem mem3: disconnect mem3 from port1
 pcieport 0000:0e:00.0: unlocked secondary bus reset via: pciehp_reset_slot+0xac/0x160
 pcieport 0000:0e:00.0: AER: Downstream Port link has been reset (0)
 cxl_pci 0000:0f:00.0: mem3: restart CXL.mem after slot reset
 devm_cxl_enumerate_ports: cxl_mem mem3: scan: iter: mem3 dport_dev: 0000:0e:00.0 parent: 0000:0d:00.0
 devm_cxl_enumerate_ports: cxl_mem mem3: found already registered port port2:0000:0d:00.0
 devm_cxl_enumerate_ports: cxl_mem mem3: scan: iter: 0000:0e:00.0 dport_dev: 0000:0c:00.0 parent: pci0000:0c
 devm_cxl_enumerate_ports: cxl_mem mem3: found already registered port port1:pci0000:0c
 cxl_port_alloc: cxl_mem mem3: host-bridge: pci0000:0c
 cxl_cdat_get_length: cxl_port endpoint6: CDAT length 160
 cxl_port_perf_data_calculate: cxl_port endpoint6: Failed to retrieve ep perf coordinates.
 cxl_endpoint_parse_cdat: cxl_port endpoint6: Failed to do perf coord calculations.
 init_hdm_decoder: cxl_port endpoint6: decoder6.0: range: 0x0-0xffffffffffffffff iw: 1 ig: 256
 add_hdm_decoder: cxl decoder6.0: Added to port endpoint6
 init_hdm_decoder: cxl_port endpoint6: decoder6.1: range: 0x0-0xffffffffffffffff iw: 1 ig: 256
 add_hdm_decoder: cxl decoder6.1: Added to port endpoint6
 init_hdm_decoder: cxl_port endpoint6: decoder6.2: range: 0x0-0xffffffffffffffff iw: 1 ig: 256
 add_hdm_decoder: cxl decoder6.2: Added to port endpoint6
 init_hdm_decoder: cxl_port endpoint6: decoder6.3: range: 0x0-0xffffffffffffffff iw: 1 ig: 256
 add_hdm_decoder: cxl decoder6.3: Added to port endpoint6
 cxl_bus_probe: cxl_port endpoint6: probe: 0
 devm_cxl_add_port: cxl_mem mem3: endpoint6 added to port2
 cxl_bus_probe: cxl_mem mem3: probe: 0
 cxl_pci 0000:0f:00.0: mem3: error resume successful
 pcieport 0000:0e:00.0: AER: device recovery successful

 Changes in v2 -> v3
 [Terry] Rebase to 6.12-rc7
 [Terry] Add UIE/CIE port enablement patch. Needed because only RP are  enabled by AER driver.
 [DaveJ] Isolate reading upstream port's AER info to only the CXL path
 [Jonathan, Dan] Add details about separate handling paths for CXL & PCIe
 [Jonathan] Add details to existing comment in devm_cxl_add_endpoint()
 about call to cxl_init_ep_ports_aer()
 [Jonathan] Updated cxl_init_ep_ports_aer() w/ checks for NULL;
 [Jonathan] Move find_cxl_port() patch immediately before patch to create handlers
 [Jonathan] Patch title fix: find_cxl_ports() -> find_cxl_port()
 [Jonathan] Remove 2 unnecessary dev_warns() in cxl_dport_init_ras_reporting() and
 cxl_uport_init_ras_reporting().
 [Jonathan] Remove unnecessary filter on PCIe port devices in dev_is_cxl_pci()
 [Jonathan] Remove cleanup declarations in cxl_pci_port_ras()
 [Jonathan] Fix spacing in 'struct cxl_error_handlers' declaration.
 [Jonathan] Remove unnecessary check for PCI device in __cxl_handle_ras() & __cxl_handle_cor_ras()
 
 Changes in v1 -> v2
 [Jonathan] Remove extra NULL check and cleanup in cxl_pci_port_ras()
 [Jonathan] Update description to DSP map patch description
 [Jonathan] Update cxl_pci_port_ras() to check for NULL port
 [Jonathan] Dont call handler before handler port changes are present (patch order)
 [Bjorn] Fix linebreak in cover sheet URL
 [Bjorn] Remove timestamps from test logs in cover sheet
 [Bjorn] Retitle AER commits to use "PCI/AER:"
 [Bjorn] Retitle patch#3 to use renaming instead of refactoring
 [Bjorn] Fix base commit-id on cover sheet
 [Bjorn] Add VH spec reference/citation
 [Terry] Removed last 2 patches to enable internal errors. Is not needed
 because internal errors are enabled in AER driver.
 [Dan] Create cxl_do_recovery() and pci_driver::cxl_err_handlers.
 [Dan] Use kernel panic in CXL recovery
 [Dan] cxl_port_hndlrs -> cxl_port_error_handlers
 [Dan] Move cxl_port_error_handlers to pci_driver. Remove module (un)registration.
 [Terry] Add patch w/ qcxl_assign_port_error_handlers() and cxl_clear_port_error_handlers()
 [Terry] Removed PCI_ERS_RESULT_PANIC patch. Is no longer needed because the result type parameter
 is not used in the CXL_err_handlers callbacks.

 Changes in RFC -> v1:
 [Dan] Rename cxl_rch_handle_error() becomes cxl_handle_error()
 [Dan] Add cxl_do_recovery()
 [Jonathan] Flatten cxl_setup_parent_uport()
 [Jonathan] Use cxl_component_regs instead of struct cxl_regs regs
 [Jonathan] Rename cxl_dev_is_pci_type()
 [Ming] bus_find_device(&cxl_bus_type, NULL, &pdev->dev, match_uport) can
 replace these find_cxl_port() and device_find_child().
 [Jonathan] Compact call to cxl_port_map_regs() in cxl_setup_parent_uport()
 [Ming] Dont use endpoint as host to cxl_map_component_regs()
 [Bjorn] Use "PCIe UIR/CIE" instesad of "AER UI/CIE"
 [Bjorn] Dont use Kconfig to enable/disable a CXL external interface

Terry Bowman (15):
  PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct
    pci_driver'
  PCI/AER: Rename AER driver's interfaces to also indicate CXL PCIe port
    support
  cxl/pci: Introduce PCIe helper functions pcie_is_cxl() and
    pcie_is_cxl_port()
  PCI/AER: Modify AER driver logging to report CXL or PCIe bus error
    type
  PCI/AER: Add CXL PCIe port correctable error support in AER service
    driver
  PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe
    port devices
  PCI/AER: Add CXL PCIe port uncorrectable error recovery in AER service
    driver
  cxl/pci: Map CXL PCIe root port and downstream switch port RAS
    registers
  cxl/pci: Map CXL PCIe upstream switch port RAS registers
  cxl/pci: Update RAS handler interfaces to also support CXL PCIe ports
  cxl/pci: Change find_cxl_port() to non-static
  cxl/pci: Add error handler for CXL PCIe port RAS errors
  cxl/pci: Add trace logging for CXL PCIe port RAS errors
  cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers
  PCI/AER: Enable internal errors for CXL upstream and downstream switch
    ports

 drivers/cxl/core/core.h       |   3 +
 drivers/cxl/core/pci.c        | 183 ++++++++++++++++++++++++++++------
 drivers/cxl/core/port.c       |   4 +-
 drivers/cxl/core/trace.h      |  47 +++++++++
 drivers/cxl/cxl.h             |  10 +-
 drivers/cxl/mem.c             |  36 ++++++-
 drivers/pci/pci.c             |  14 +++
 drivers/pci/pci.h             |   3 +
 drivers/pci/pcie/aer.c        | 104 +++++++++++--------
 drivers/pci/pcie/err.c        |  54 ++++++++++
 drivers/pci/probe.c           |  10 ++
 include/linux/aer.h           |   1 +
 include/linux/pci.h           |  13 +++
 include/ras/ras_event.h       |   9 +-
 include/uapi/linux/pci_regs.h |   3 +-
 15 files changed, 410 insertions(+), 84 deletions(-)


base-commit: 2d5404caa8c7bb5c4e0435f94b28834ae5456623