[v2,0/14] Enable CXL PCIe port protocol error handling and logging

Message ID	20241025210305.27499-1-terry.bowman@amd.com (mailing list archive)
Headers	show Received: from NAM12-MW2-obe.outbound.protection.outlook.com (mail-mw2nam12on2068.outbound.protection.outlook.com [40.107.244.68]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 45810217F2A; Fri, 25 Oct 2024 21:03:17 +0000 (UTC) Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=SATLEXMB04.amd.com; pr=C From: Terry Bowman <terry.bowman@amd.com> To: <ming4.li@intel.com>, <linux-cxl@vger.kernel.org>, <linux-kernel@vger.kernel.org>, <linux-pci@vger.kernel.org>, <dave@stgolabs.net>, <jonathan.cameron@huawei.com>, <dave.jiang@intel.com>, <alison.schofield@intel.com>, <vishal.l.verma@intel.com>, <dan.j.williams@intel.com>, <bhelgaas@google.com>, <mahesh@linux.ibm.com>, <ira.weiny@intel.com>, <oohall@gmail.com>, <Benjamin.Cheatham@amd.com>, <rrichter@amd.com>, <nathan.fontenot@amd.com>, <terry.bowman@amd.com>, <Smita.KoralahalliChannabasappa@amd.com> Subject: [PATCH v2 0/14] Enable CXL PCIe port protocol error handling and logging Date: Fri, 25 Oct 2024 16:02:51 -0500 Message-ID: <20241025210305.27499-1-terry.bowman@amd.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Type: text/plain
Series	Enable CXL PCIe port protocol error handling and logging \| expand [v2,0/14] Enable CXL PCIe port protocol error handling and logging [v2,01/14] PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct pci_driver' [v2,02/14] PCI/AER: Rename AER driver's interfaces to also indicate CXL PCIe port support [v2,03/14] cxl/pci: Introduce helper functions pcie_is_cxl() and pcie_is_cxl_port() [v2,04/14] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type [v2,05/14] PCI/AER: Add CXL PCIe port correctable error support in AER service driver [v2,06/14] PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe port devices [v2,07/14] PCI/AER: Add CXL PCIe port uncorrectable error recovery in AER service driver [v2,08/14] cxl/pci: Change find_cxl_ports() to non-static [v2,09/14] cxl/pci: Map CXL PCIe root port and downstream switch port RAS registers [v2,10/14] cxl/pci: Map CXL PCIe upstream switch port RAS registers [v2,11/14] cxl/pci: Rename RAS handler interfaces to also indicate CXL PCIe port support [v2,12/14] cxl/pci: Add error handler for CXL PCIe port RAS errors [v2,13/14] cxl/pci: Add trace logging for CXL PCIe port RAS errors [v2,14/14] cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers

Bowman, Terry Oct. 25, 2024, 9:02 p.m. UTC

This is a continuation of the CXL port error handling RFC from earlier.[1]
The RFC resulted in the decision to add CXL PCIe port error handling to
the existing RCH downstream port handling in the AER service driver. This
patchset adds the CXL PCIe port protocol error handling and logging.

The first 7 patches update the existing AER service driver to support CXL
PCIe port protocol error handling and reporting. This includes AER service
driver changes for adding correctable and uncorrectable error support, CXL
specific recovery handling, and addition of CXL driver callback handlers.

The following 7 patches address CXL driver support for CXL PCIe port
protocol errors. This includes the following changes to the CXL drivers:
mapping CXL port and downstream port RAS registers, interface updates for
common restricted CXL host mode (RCH) and virtual hierarchy mode (VH),
adding port specific error handlers, and protocol error logging.

[1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554-1-terry.bowman@amd.com/

Testing:

Below are test results for this patchset using Qemu with CXL root
port(0c:00.0), CXL upstream switchport(0d:00.0), CXL downstream
switchport(0e:00.0). A CXL endpoint(0f:00.0) CE and UCE logs are
also added to show the existing PCIe endpoint handling is not changed.

This was tested using aer-inject updated to support CE and UCE internal
error injection. CXL RAS was set using a test patch (not upstreamed but can
provide if needed).

 - Root port UCE:
 root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh
 pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
 pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
 pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
 pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00400000/02000000
 pcieport 0000:0c:00.0:    [22] UncorrIntErr
 aer_event: 0000:0c:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
 cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
 Kernel panic - not syncing: CXL cachemem error. Invoking panic
 CPU: 1 UID: 0 PID: 146 Comm: irq/24-aerdrv Tainted: G            E      6.12.0-rc2-cxl-port-err-g2beab06a67d1 #4414
 Tainted: [E]=UNSIGNED_MODULE
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
 Call Trace:
  <TASK>
  dump_stack_lvl+0x27/0x90
  dump_stack+0x10/0x20
  panic+0x33e/0x380
  cxl_do_recovery+0x116/0x120
  ? srso_return_thunk+0x5/0x5f
  aer_isr+0x3e0/0x710
  irq_thread_fn+0x28/0x70
  irq_thread+0x179/0x240
  ? srso_return_thunk+0x5/0x5f
  ? __pfx_irq_thread_fn+0x10/0x10
  ? __pfx_irq_thread_dtor+0x10/0x10
  ? __pfx_irq_thread+0x10/0x10
  kthread+0xf5/0x130
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x3c/0x60
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1a/0x30
  </TASK>
 Kernel Offset: 0x29000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
 ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---

 - Root port CE:
 root@tbowman-cxl:~/aer-inject# ./root-c[  191.866259] systemd-journald[482]: Sent WATCHDOG=1 notification.
 e-inject.sh
 pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0c:00.0
 pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0c:00.0
 pcieport 0000:0c:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
 pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00004000/0000a000
 pcieport 0000:0c:00.0:    [14] CorrIntErr
 aer_event: 0000:0c:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
 cxl_port_aer_correctable_error: device=0000:0c:00.0 host=pci0000:0c status='Received Error From Physical Layer'

 - Upstream switch port UCE:
 root@tbowman-cxl:~/aer-inject# ./us-uce-inject.sh
 pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0d:00.0
 pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0d:00.0
 pcieport 0000:0d:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
 pcieport 0000:0d:00.0:   device [19e5:a128] error status/mask=00400000/02000000
 pcieport 0000:0d:00.0:    [22] UncorrIntErr
 aer_event: 0000:0d:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
 cxl_port_aer_uncorrectable_error: device=0000:0d:00.0 host=0000:0c:00.0 status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
 Kernel panic - not syncing: CXL cachemem error. Invoking panic
 CPU: 1 UID: 0 PID: 148 Comm: irq/24-aerdrv Tainted: G            E      6.12.0-rc2-cxl-port-err-g2beab06a67d1 #4414
 Tainted: [E]=UNSIGNED_MODULE
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
 Call Trace:
  <TASK>
  dump_stack_lvl+0x27/0x90
  dump_stack+0x10/0x20
  panic+0x33e/0x380
  cxl_do_recovery+0x116/0x120
  ? srso_return_thunk+0x5/0x5f
  aer_isr+0x3e0/0x710
  ? free_cpumask_var+0x9/0x10
  ? kfree+0x259/0x2e0
  irq_thread_fn+0x28/0x70
  irq_thread+0x179/0x240
  ? srso_return_thunk+0x5/0x5f
  ? __pfx_irq_thread_fn+0x10/0x10
  ? __pfx_irq_thread_dtor+0x10/0x10
  ? __pfx_irq_thread+0x10/0x10
  kthread+0xf5/0x130
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x3c/0x60
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1a/0x30
  </TASK>
 Kernel Offset: 0x24c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
 ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---

 - Upstream switch port CE:
 root@tbowman-cxl:~/aer-inject# ./us-ce-inject.sh
 pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0d:00.0
 pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0d:00.0
 pcieport 0000:0d:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
 pcieport 0000:0d:00.0:   device [19e5:a128] error status/mask=00004000/0000a000
 pcieport 0000:0d:00.0:    [14] CorrIntErr
 aer_event: 0000:0d:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
 cxl_port_aer_correctable_error: device=0000:0d:00.0 host=0000:0c:00.0 status='Received Error From Physical Layer'

 - Downstream switch port UCE:
 root@tbowman-cxl:~/aer-inject# ./ds-uce-inject.sh
 pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0e:00.0
 pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0e:00.0
 pcieport 0000:0e:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
 pcieport 0000:0e:00.0:   device [19e5:a129] error status/mask=00400000/02000000
 pcieport 0000:0e:00.0:    [22] UncorrIntErr
 aer_event: 0000:0e:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
 cxl_port_aer_uncorrectable_error: device=0000:0e:00.0 host=0000:0d:00.0 status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
 Kernel panic - not syncing: CXL cachemem error. Invoking panic
 CPU: 1 UID: 0 PID: 147 Comm: irq/24-aerdrv Tainted: G            E      6.12.0-rc2-cxl-port-err-g2beab06a67d1 #4414
 Tainted: [E]=UNSIGNED_MODULE
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
 Call Trace:
  <TASK>
  dump_stack_lvl+0x27/0x90
  dump_stack+0x10/0x20
  panic+0x33e/0x380
  cxl_do_recovery+0x116/0x120
  ? srso_return_thunk+0x5/0x5f
  aer_isr+0x3e0/0x710
  irq_thread_fn+0x28/0x70
  irq_thread+0x179/0x240
  ? srso_return_thunk+0x5/0x5f
  ? __pfx_irq_thread_fn+0x10/0x10
  ? __pfx_irq_thread_dtor+0x10/0x10
  ? __pfx_irq_thread+0x10/0x10
  kthread+0xf5/0x130
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x3c/0x60
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1a/0x30
  </TASK>
 Kernel Offset: 0x19c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
 ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---

 - Downstream switch port CE:
 root@tbowman-cxl:~/aer-inject# ./ds-ce-inject.sh
 pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0e:00.0
 pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0e:00.0
 pcieport 0000:0e:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
 pcieport 0000:0e:00.0:   device [19e5:a129] error status/mask=00004000/0000a000
 pcieport 0000:0e:00.0:    [14] CorrIntErr
 aer_event: 0000:0e:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
 cxl_port_aer_correctable_error: device=0000:0e:00.0 host=0000:0d:00.0 status='Received Error From Physical Layer'

 - Endpoint CE
 root@tbowman-cxl:~/aer-inject# ./ep-ce-inject.sh
 pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000040/00000000 into device 0000:0f:00.0
 pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0f:00.0
 cxl_pci 0000:0f:00.0: CXL Bus Error: severity=Correctable, type=Data Link Layer, (Receiver ID)
 cxl_pci 0000:0f:00.0:   device [8086:0d93] error status/mask=00000040/0000e000
 cxl_pci 0000:0f:00.0:    [ 6] BadTLP
 aer_event: 0000:0f:00.0 CXL Bus Error: severity=Corrected, Bad TLP, TLP Header=Not available
 cxl_aer_correctable_error: memdev=mem1 host=0000:0f:00.0 serial=0: status: 'Received Error From Physical Layer'

 - Endpoint UCE
 root@tbowman-cxl:~/aer-inject# ./ep-uce-inject.sh
 pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00040000 into device 0000:0f:00.0
 pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0f:00.0
 cxl_pci 0000:0f:00.0: AER: CXL Bus Error: severity=Uncorrectable (Fatal), type=Inaccessible, (Unregistered Agent ID)
 aer_event: 0000:0f:00.0 CXL Bus Error: severity=Fatal, , TLP Header=Not available
 cxl_aer_uncorrectable_error: memdev=mem1 host=0000:0f:00.0 serial=0: status: 'Memory Byte Enable Parity Error' firs'
 cxl_pci 0000:0f:00.0: mem1: frozen state error detected, disable CXL.mem
 cxl_detach_ep: cxl_mem mem1: disconnect mem1 from port2
 cxl_detach_ep: cxl_mem mem1: disconnect mem1 from port1
 pcieport 0000:0e:00.0: unlocked secondary bus reset via: pciehp_reset_slot+0xac/0x160
 pcieport 0000:0e:00.0: AER: Downstream Port link has been reset (0)
 cxl_pci 0000:0f:00.0: mem1: restart CXL.mem after slot reset
 devm_cxl_enumerate_ports: cxl_mem mem1: scan: iter: mem1 dport_dev: 0000:0e:00.0 parent: 0000:0d:00.0
 devm_cxl_enumerate_ports: cxl_mem mem1: found already registered port port2:0000:0d:00.0
 devm_cxl_enumerate_ports: cxl_mem mem1: scan: iter: 0000:0e:00.0 dport_dev: 0000:0c:00.0 parent: pci0000:0c
 devm_cxl_enumerate_ports: cxl_mem mem1: found already registered port port1:pci0000:0c
 __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4500
 cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
 __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4500
 cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
 __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
 cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
 __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
 cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
 <snip>
 cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
 __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
 cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
 __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
 cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
 __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
 cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
 cxl_bus_probe: cxl_nvdimm pmem1: probe: 0
 devm_cxl_add_nvdimm: cxl_mem mem1: register pmem1
 pcieport 0000:0e:00.0: RAS is already mapped
 cxl_port port2: RAS is already mapped
 pcieport 0000:0c:00.0: RAS is already mapped
 cxl_port_alloc: cxl_mem mem1: host-bridge: pci0000:0c
 cxl_cdat_get_length: cxl_port endpoint4: CDAT length 160
 cxl_port_perf_data_calculate: cxl_port endpoint4: Failed to retrieve ep perf coordinates.
 cxl_endpoint_parse_cdat: cxl_port endpoint4: Failed to do perf coord calculations.
 init_hdm_decoder: cxl_port endpoint4: decoder4.0: range: 0x0-0xffffffffffffffff iw: 1 ig: 256
 add_hdm_decoder: cxl decoder4.0: Added to port endpoint4
 init_hdm_decoder: cxl_port endpoint4: decoder4.1: range: 0x0-0xffffffffffffffff iw: 1 ig: 256
 add_hdm_decoder: cxl decoder4.1: Added to port endpoint4
 init_hdm_decoder: cxl_port endpoint4: decoder4.2: range: 0x0-0xffffffffffffffff iw: 1 ig: 256
 add_hdm_decoder: cxl decoder4.2: Added to port endpoint4
 init_hdm_decoder: cxl_port endpoint4: decoder4.3: range: 0x0-0xffffffffffffffff iw: 1 ig: 256
 add_hdm_decoder: cxl decoder4.3: Added to port endpoint4
 cxl_bus_probe: cxl_port endpoint4: probe: 0
 devm_cxl_add_port: cxl_mem mem1: endpoint4 added to port2
 cxl_bus_probe: cxl_mem mem1: probe: 0
 cxl_pci 0000:0f:00.0: mem1: error resume successful
 pcieport 0000:0e:00.0: AER: device recovery successful

 Changes in v1 -> v2
 [Jonathan] Remove extra NULL check and cleanup in cxl_pci_port_ras()
 [Jonathan] Update description to DSP map patch description
 [Jonathan] Update cxl_pci_port_ras() to check for NULL port
 [Jonathan] Dont call handler before handler port changes are present (patch order).
 [Bjorn] Fix linebreak in cover sheet URL
 [Bjorn] Remove timestamps from test logs in cover sheet
 [Bjorn] Retitle AER commits to use "PCI/AER:"
 [Bjorn] Retitle patch#3 to use renaming instead of refactoring
 [Bjorn] Fixe base commit-id on cover sheet
 [Bjorn] Add VH spec reference/citation
 [Terry] Removed last 2 patches to enable internal errors. Is not needed
 because internal errors are enabled in AER driver.
 [Dan] Create cxl_do_recovery() and pci_driver::cxl_err_handlers.
 [Dan] Use kernel panic in CXL recovery
 [Dan] cxl_port_hndlrs -> cxl_port_error_handlers
 [Dan] Move cxl_port_error_handlers to pci_driver. Remove module (un)registration.
 [Terry] Add patch w/ qcxl_assign_port_error_handlers() and cxl_clear_port_error_handlers()
 [Terry] Removed PCI_ERS_RESULT_PANIC patch. Is no longer needed because the result type parameter
 is not used in the CXL_err_handlers callabcks.

Changes in RFC -> v1:
 [Dan] Rename cxl_rch_handle_error() becomes cxl_handle_error()
 [Dan] Add cxl_do_recovery()
 [Jonathan] Flatten cxl_setup_parent_uport()
 [Jonathan] Use cxl_component_regs instead of struct cxl_regs regs
 [Jonathan] Rename cxl_dev_is_pci_type()
 [Ming] bus_find_device(&cxl_bus_type, NULL, &pdev->dev, match_uport) can
 replace these find_cxl_port() and device_find_child().
 [Jonathan] Compact call to cxl_port_map_regs() in cxl_setup_parent_uport()
 [Ming] Dont use endpoint as host to cxl_map_component_regs()
 [Bjorn] Use "PCIe UIR/CIE" instesad of "AER UI/CIE"
 [Bjorn] Dont use Kconfig to enable/disable a CXL external interface

Terry Bowman (14):
  PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct
    pci_driver'
  PCI/AER: Rename AER driver's interfaces to also indicate CXL PCIe port
    support
  cxl/pci: Introduce helper functions pcie_is_cxl() and
    pcie_is_cxl_port()
  PCI/AER: Modify AER driver logging to report CXL or PCIe bus error
    type
  PCI/AER: Add CXL PCIe port correctable error support in AER service
    driver
  PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe
    port devices
  PCI/AER: Add CXL PCIe port uncorrectable error recovery in AER service
    driver
  cxl/pci: Change find_cxl_ports() to non-static
  cxl/pci: Map CXL PCIe root port and downstream switch port RAS
    registers
  cxl/pci: Map CXL PCIe upstream switch port RAS registers
  cxl/pci: Rename RAS handler interfaces to also indicate CXL PCIe port
    support
  cxl/pci: Add error handler for CXL PCIe port RAS errors
  cxl/pci: Add trace logging for CXL PCIe port RAS errors
  cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers

 drivers/cxl/core/core.h       |   3 +
 drivers/cxl/core/pci.c        | 180 +++++++++++++++++++++++++++-------
 drivers/cxl/core/port.c       |   4 +-
 drivers/cxl/core/trace.h      |  47 +++++++++
 drivers/cxl/cxl.h             |  10 +-
 drivers/cxl/mem.c             |  29 +++++-
 drivers/pci/pci.c             |  14 +++
 drivers/pci/pci.h             |   3 +
 drivers/pci/pcie/aer.c        |  99 ++++++++++++-------
 drivers/pci/pcie/err.c        |  54 ++++++++++
 drivers/pci/probe.c           |  10 ++
 include/linux/pci.h           |  13 +++
 include/ras/ras_event.h       |   9 +-
 include/uapi/linux/pci_regs.h |   3 +-
 14 files changed, 396 insertions(+), 82 deletions(-)


base-commit: 739a5da7ed744578a9477fb322f04afecafca6b0

Bowman, Terry Oct. 27, 2024, 4:59 p.m. UTC | #1

Correction. This applies to:

8cf0b93919e1 (tag: v6.12-rc2) Linux 6.12-rc2

On 10/25/2024 4:02 PM, Terry Bowman wrote:
> This is a continuation of the CXL port error handling RFC from earlier.[1]
> The RFC resulted in the decision to add CXL PCIe port error handling to
> the existing RCH downstream port handling in the AER service driver. This
> patchset adds the CXL PCIe port protocol error handling and logging.
>
> The first 7 patches update the existing AER service driver to support CXL
> PCIe port protocol error handling and reporting. This includes AER service
> driver changes for adding correctable and uncorrectable error support, CXL
> specific recovery handling, and addition of CXL driver callback handlers.
>
> The following 7 patches address CXL driver support for CXL PCIe port
> protocol errors. This includes the following changes to the CXL drivers:
> mapping CXL port and downstream port RAS registers, interface updates for
> common restricted CXL host mode (RCH) and virtual hierarchy mode (VH),
> adding port specific error handlers, and protocol error logging.
>
> [1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554-1-terry.bowman@amd.com/
>
> Testing:
>
> Below are test results for this patchset using Qemu with CXL root
> port(0c:00.0), CXL upstream switchport(0d:00.0), CXL downstream
> switchport(0e:00.0). A CXL endpoint(0f:00.0) CE and UCE logs are
> also added to show the existing PCIe endpoint handling is not changed.
>
> This was tested using aer-inject updated to support CE and UCE internal
> error injection. CXL RAS was set using a test patch (not upstreamed but can
> provide if needed).
>
>  - Root port UCE:
>  root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh
>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
>  pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
>  pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>  pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00400000/02000000
>  pcieport 0000:0c:00.0:    [22] UncorrIntErr
>  aer_event: 0000:0c:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
>  cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
>  Kernel panic - not syncing: CXL cachemem error. Invoking panic
>  CPU: 1 UID: 0 PID: 146 Comm: irq/24-aerdrv Tainted: G            E      6.12.0-rc2-cxl-port-err-g2beab06a67d1 #4414
>  Tainted: [E]=UNSIGNED_MODULE
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
>  Call Trace:
>   <TASK>
>   dump_stack_lvl+0x27/0x90
>   dump_stack+0x10/0x20
>   panic+0x33e/0x380
>   cxl_do_recovery+0x116/0x120
>   ? srso_return_thunk+0x5/0x5f
>   aer_isr+0x3e0/0x710
>   irq_thread_fn+0x28/0x70
>   irq_thread+0x179/0x240
>   ? srso_return_thunk+0x5/0x5f
>   ? __pfx_irq_thread_fn+0x10/0x10
>   ? __pfx_irq_thread_dtor+0x10/0x10
>   ? __pfx_irq_thread+0x10/0x10
>   kthread+0xf5/0x130
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork+0x3c/0x60
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork_asm+0x1a/0x30
>   </TASK>
>  Kernel Offset: 0x29000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>  ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
>
>  - Root port CE:
>  root@tbowman-cxl:~/aer-inject# ./root-c[  191.866259] systemd-journald[482]: Sent WATCHDOG=1 notification.
>  e-inject.sh
>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0c:00.0
>  pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0c:00.0
>  pcieport 0000:0c:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>  pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00004000/0000a000
>  pcieport 0000:0c:00.0:    [14] CorrIntErr
>  aer_event: 0000:0c:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
>  cxl_port_aer_correctable_error: device=0000:0c:00.0 host=pci0000:0c status='Received Error From Physical Layer'
>
>  - Upstream switch port UCE:
>  root@tbowman-cxl:~/aer-inject# ./us-uce-inject.sh
>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0d:00.0
>  pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0d:00.0
>  pcieport 0000:0d:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>  pcieport 0000:0d:00.0:   device [19e5:a128] error status/mask=00400000/02000000
>  pcieport 0000:0d:00.0:    [22] UncorrIntErr
>  aer_event: 0000:0d:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
>  cxl_port_aer_uncorrectable_error: device=0000:0d:00.0 host=0000:0c:00.0 status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
>  Kernel panic - not syncing: CXL cachemem error. Invoking panic
>  CPU: 1 UID: 0 PID: 148 Comm: irq/24-aerdrv Tainted: G            E      6.12.0-rc2-cxl-port-err-g2beab06a67d1 #4414
>  Tainted: [E]=UNSIGNED_MODULE
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
>  Call Trace:
>   <TASK>
>   dump_stack_lvl+0x27/0x90
>   dump_stack+0x10/0x20
>   panic+0x33e/0x380
>   cxl_do_recovery+0x116/0x120
>   ? srso_return_thunk+0x5/0x5f
>   aer_isr+0x3e0/0x710
>   ? free_cpumask_var+0x9/0x10
>   ? kfree+0x259/0x2e0
>   irq_thread_fn+0x28/0x70
>   irq_thread+0x179/0x240
>   ? srso_return_thunk+0x5/0x5f
>   ? __pfx_irq_thread_fn+0x10/0x10
>   ? __pfx_irq_thread_dtor+0x10/0x10
>   ? __pfx_irq_thread+0x10/0x10
>   kthread+0xf5/0x130
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork+0x3c/0x60
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork_asm+0x1a/0x30
>   </TASK>
>  Kernel Offset: 0x24c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>  ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
>
>  - Upstream switch port CE:
>  root@tbowman-cxl:~/aer-inject# ./us-ce-inject.sh
>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0d:00.0
>  pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0d:00.0
>  pcieport 0000:0d:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>  pcieport 0000:0d:00.0:   device [19e5:a128] error status/mask=00004000/0000a000
>  pcieport 0000:0d:00.0:    [14] CorrIntErr
>  aer_event: 0000:0d:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
>  cxl_port_aer_correctable_error: device=0000:0d:00.0 host=0000:0c:00.0 status='Received Error From Physical Layer'
>
>  - Downstream switch port UCE:
>  root@tbowman-cxl:~/aer-inject# ./ds-uce-inject.sh
>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0e:00.0
>  pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0e:00.0
>  pcieport 0000:0e:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>  pcieport 0000:0e:00.0:   device [19e5:a129] error status/mask=00400000/02000000
>  pcieport 0000:0e:00.0:    [22] UncorrIntErr
>  aer_event: 0000:0e:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
>  cxl_port_aer_uncorrectable_error: device=0000:0e:00.0 host=0000:0d:00.0 status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
>  Kernel panic - not syncing: CXL cachemem error. Invoking panic
>  CPU: 1 UID: 0 PID: 147 Comm: irq/24-aerdrv Tainted: G            E      6.12.0-rc2-cxl-port-err-g2beab06a67d1 #4414
>  Tainted: [E]=UNSIGNED_MODULE
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
>  Call Trace:
>   <TASK>
>   dump_stack_lvl+0x27/0x90
>   dump_stack+0x10/0x20
>   panic+0x33e/0x380
>   cxl_do_recovery+0x116/0x120
>   ? srso_return_thunk+0x5/0x5f
>   aer_isr+0x3e0/0x710
>   irq_thread_fn+0x28/0x70
>   irq_thread+0x179/0x240
>   ? srso_return_thunk+0x5/0x5f
>   ? __pfx_irq_thread_fn+0x10/0x10
>   ? __pfx_irq_thread_dtor+0x10/0x10
>   ? __pfx_irq_thread+0x10/0x10
>   kthread+0xf5/0x130
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork+0x3c/0x60
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork_asm+0x1a/0x30
>   </TASK>
>  Kernel Offset: 0x19c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>  ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
>
>  - Downstream switch port CE:
>  root@tbowman-cxl:~/aer-inject# ./ds-ce-inject.sh
>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0e:00.0
>  pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0e:00.0
>  pcieport 0000:0e:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>  pcieport 0000:0e:00.0:   device [19e5:a129] error status/mask=00004000/0000a000
>  pcieport 0000:0e:00.0:    [14] CorrIntErr
>  aer_event: 0000:0e:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
>  cxl_port_aer_correctable_error: device=0000:0e:00.0 host=0000:0d:00.0 status='Received Error From Physical Layer'
>
>  - Endpoint CE
>  root@tbowman-cxl:~/aer-inject# ./ep-ce-inject.sh
>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000040/00000000 into device 0000:0f:00.0
>  pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0f:00.0
>  cxl_pci 0000:0f:00.0: CXL Bus Error: severity=Correctable, type=Data Link Layer, (Receiver ID)
>  cxl_pci 0000:0f:00.0:   device [8086:0d93] error status/mask=00000040/0000e000
>  cxl_pci 0000:0f:00.0:    [ 6] BadTLP
>  aer_event: 0000:0f:00.0 CXL Bus Error: severity=Corrected, Bad TLP, TLP Header=Not available
>  cxl_aer_correctable_error: memdev=mem1 host=0000:0f:00.0 serial=0: status: 'Received Error From Physical Layer'
>
>  - Endpoint UCE
>  root@tbowman-cxl:~/aer-inject# ./ep-uce-inject.sh
>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00040000 into device 0000:0f:00.0
>  pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0f:00.0
>  cxl_pci 0000:0f:00.0: AER: CXL Bus Error: severity=Uncorrectable (Fatal), type=Inaccessible, (Unregistered Agent ID)
>  aer_event: 0000:0f:00.0 CXL Bus Error: severity=Fatal, , TLP Header=Not available
>  cxl_aer_uncorrectable_error: memdev=mem1 host=0000:0f:00.0 serial=0: status: 'Memory Byte Enable Parity Error' firs'
>  cxl_pci 0000:0f:00.0: mem1: frozen state error detected, disable CXL.mem
>  cxl_detach_ep: cxl_mem mem1: disconnect mem1 from port2
>  cxl_detach_ep: cxl_mem mem1: disconnect mem1 from port1
>  pcieport 0000:0e:00.0: unlocked secondary bus reset via: pciehp_reset_slot+0xac/0x160
>  pcieport 0000:0e:00.0: AER: Downstream Port link has been reset (0)
>  cxl_pci 0000:0f:00.0: mem1: restart CXL.mem after slot reset
>  devm_cxl_enumerate_ports: cxl_mem mem1: scan: iter: mem1 dport_dev: 0000:0e:00.0 parent: 0000:0d:00.0
>  devm_cxl_enumerate_ports: cxl_mem mem1: found already registered port port2:0000:0d:00.0
>  devm_cxl_enumerate_ports: cxl_mem mem1: scan: iter: 0000:0e:00.0 dport_dev: 0000:0c:00.0 parent: pci0000:0c
>  devm_cxl_enumerate_ports: cxl_mem mem1: found already registered port port1:pci0000:0c
>  __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4500
>  cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
>  __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4500
>  cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
>  __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
>  cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
>  __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
>  cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
>  <snip>
>  cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
>  __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
>  cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
>  __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
>  cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
>  __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
>  cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
>  cxl_bus_probe: cxl_nvdimm pmem1: probe: 0
>  devm_cxl_add_nvdimm: cxl_mem mem1: register pmem1
>  pcieport 0000:0e:00.0: RAS is already mapped
>  cxl_port port2: RAS is already mapped
>  pcieport 0000:0c:00.0: RAS is already mapped
>  cxl_port_alloc: cxl_mem mem1: host-bridge: pci0000:0c
>  cxl_cdat_get_length: cxl_port endpoint4: CDAT length 160
>  cxl_port_perf_data_calculate: cxl_port endpoint4: Failed to retrieve ep perf coordinates.
>  cxl_endpoint_parse_cdat: cxl_port endpoint4: Failed to do perf coord calculations.
>  init_hdm_decoder: cxl_port endpoint4: decoder4.0: range: 0x0-0xffffffffffffffff iw: 1 ig: 256
>  add_hdm_decoder: cxl decoder4.0: Added to port endpoint4
>  init_hdm_decoder: cxl_port endpoint4: decoder4.1: range: 0x0-0xffffffffffffffff iw: 1 ig: 256
>  add_hdm_decoder: cxl decoder4.1: Added to port endpoint4
>  init_hdm_decoder: cxl_port endpoint4: decoder4.2: range: 0x0-0xffffffffffffffff iw: 1 ig: 256
>  add_hdm_decoder: cxl decoder4.2: Added to port endpoint4
>  init_hdm_decoder: cxl_port endpoint4: decoder4.3: range: 0x0-0xffffffffffffffff iw: 1 ig: 256
>  add_hdm_decoder: cxl decoder4.3: Added to port endpoint4
>  cxl_bus_probe: cxl_port endpoint4: probe: 0
>  devm_cxl_add_port: cxl_mem mem1: endpoint4 added to port2
>  cxl_bus_probe: cxl_mem mem1: probe: 0
>  cxl_pci 0000:0f:00.0: mem1: error resume successful
>  pcieport 0000:0e:00.0: AER: device recovery successful
>
>  Changes in v1 -> v2
>  [Jonathan] Remove extra NULL check and cleanup in cxl_pci_port_ras()
>  [Jonathan] Update description to DSP map patch description
>  [Jonathan] Update cxl_pci_port_ras() to check for NULL port
>  [Jonathan] Dont call handler before handler port changes are present (patch order).
>  [Bjorn] Fix linebreak in cover sheet URL
>  [Bjorn] Remove timestamps from test logs in cover sheet
>  [Bjorn] Retitle AER commits to use "PCI/AER:"
>  [Bjorn] Retitle patch#3 to use renaming instead of refactoring
>  [Bjorn] Fixe base commit-id on cover sheet
>  [Bjorn] Add VH spec reference/citation
>  [Terry] Removed last 2 patches to enable internal errors. Is not needed
>  because internal errors are enabled in AER driver.
>  [Dan] Create cxl_do_recovery() and pci_driver::cxl_err_handlers.
>  [Dan] Use kernel panic in CXL recovery
>  [Dan] cxl_port_hndlrs -> cxl_port_error_handlers
>  [Dan] Move cxl_port_error_handlers to pci_driver. Remove module (un)registration.
>  [Terry] Add patch w/ qcxl_assign_port_error_handlers() and cxl_clear_port_error_handlers()
>  [Terry] Removed PCI_ERS_RESULT_PANIC patch. Is no longer needed because the result type parameter
>  is not used in the CXL_err_handlers callabcks.
>
> Changes in RFC -> v1:
>  [Dan] Rename cxl_rch_handle_error() becomes cxl_handle_error()
>  [Dan] Add cxl_do_recovery()
>  [Jonathan] Flatten cxl_setup_parent_uport()
>  [Jonathan] Use cxl_component_regs instead of struct cxl_regs regs
>  [Jonathan] Rename cxl_dev_is_pci_type()
>  [Ming] bus_find_device(&cxl_bus_type, NULL, &pdev->dev, match_uport) can
>  replace these find_cxl_port() and device_find_child().
>  [Jonathan] Compact call to cxl_port_map_regs() in cxl_setup_parent_uport()
>  [Ming] Dont use endpoint as host to cxl_map_component_regs()
>  [Bjorn] Use "PCIe UIR/CIE" instesad of "AER UI/CIE"
>  [Bjorn] Dont use Kconfig to enable/disable a CXL external interface
>
> Terry Bowman (14):
>   PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct
>     pci_driver'
>   PCI/AER: Rename AER driver's interfaces to also indicate CXL PCIe port
>     support
>   cxl/pci: Introduce helper functions pcie_is_cxl() and
>     pcie_is_cxl_port()
>   PCI/AER: Modify AER driver logging to report CXL or PCIe bus error
>     type
>   PCI/AER: Add CXL PCIe port correctable error support in AER service
>     driver
>   PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe
>     port devices
>   PCI/AER: Add CXL PCIe port uncorrectable error recovery in AER service
>     driver
>   cxl/pci: Change find_cxl_ports() to non-static
>   cxl/pci: Map CXL PCIe root port and downstream switch port RAS
>     registers
>   cxl/pci: Map CXL PCIe upstream switch port RAS registers
>   cxl/pci: Rename RAS handler interfaces to also indicate CXL PCIe port
>     support
>   cxl/pci: Add error handler for CXL PCIe port RAS errors
>   cxl/pci: Add trace logging for CXL PCIe port RAS errors
>   cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers
>
>  drivers/cxl/core/core.h       |   3 +
>  drivers/cxl/core/pci.c        | 180 +++++++++++++++++++++++++++-------
>  drivers/cxl/core/port.c       |   4 +-
>  drivers/cxl/core/trace.h      |  47 +++++++++
>  drivers/cxl/cxl.h             |  10 +-
>  drivers/cxl/mem.c             |  29 +++++-
>  drivers/pci/pci.c             |  14 +++
>  drivers/pci/pci.h             |   3 +
>  drivers/pci/pcie/aer.c        |  99 ++++++++++++-------
>  drivers/pci/pcie/err.c        |  54 ++++++++++
>  drivers/pci/probe.c           |  10 ++
>  include/linux/pci.h           |  13 +++
>  include/ras/ras_event.h       |   9 +-
>  include/uapi/linux/pci_regs.h |   3 +-
>  14 files changed, 396 insertions(+), 82 deletions(-)
>
>
> base-commit: 739a5da7ed744578a9477fb322f04afecafca6b0

Bowman, Terry Oct. 28, 2024, 1:05 a.m. UTC | #2

Correction. This applies to the following base commit:

8cf0b93919e1 (tag: v6.12-rc2) Linux 6.12-rc2


On 10/25/2024 4:02 PM, Terry Bowman wrote:
> This is a continuation of the CXL port error handling RFC from earlier.[1]
> The RFC resulted in the decision to add CXL PCIe port error handling to
> the existing RCH downstream port handling in the AER service driver. This
> patchset adds the CXL PCIe port protocol error handling and logging.
>
> The first 7 patches update the existing AER service driver to support CXL
> PCIe port protocol error handling and reporting. This includes AER service
> driver changes for adding correctable and uncorrectable error support, CXL
> specific recovery handling, and addition of CXL driver callback handlers.
>
> The following 7 patches address CXL driver support for CXL PCIe port
> protocol errors. This includes the following changes to the CXL drivers:
> mapping CXL port and downstream port RAS registers, interface updates for
> common restricted CXL host mode (RCH) and virtual hierarchy mode (VH),
> adding port specific error handlers, and protocol error logging.
>
> [1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554-1-terry.bowman@amd.com/
>
> Testing:
>
> Below are test results for this patchset using Qemu with CXL root
> port(0c:00.0), CXL upstream switchport(0d:00.0), CXL downstream
> switchport(0e:00.0). A CXL endpoint(0f:00.0) CE and UCE logs are
> also added to show the existing PCIe endpoint handling is not changed.
>
> This was tested using aer-inject updated to support CE and UCE internal
> error injection. CXL RAS was set using a test patch (not upstreamed but can
> provide if needed).
>
>  - Root port UCE:
>  root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh
>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
>  pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
>  pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>  pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00400000/02000000
>  pcieport 0000:0c:00.0:    [22] UncorrIntErr
>  aer_event: 0000:0c:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
>  cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
>  Kernel panic - not syncing: CXL cachemem error. Invoking panic
>  CPU: 1 UID: 0 PID: 146 Comm: irq/24-aerdrv Tainted: G            E      6.12.0-rc2-cxl-port-err-g2beab06a67d1 #4414
>  Tainted: [E]=UNSIGNED_MODULE
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
>  Call Trace:
>   <TASK>
>   dump_stack_lvl+0x27/0x90
>   dump_stack+0x10/0x20
>   panic+0x33e/0x380
>   cxl_do_recovery+0x116/0x120
>   ? srso_return_thunk+0x5/0x5f
>   aer_isr+0x3e0/0x710
>   irq_thread_fn+0x28/0x70
>   irq_thread+0x179/0x240
>   ? srso_return_thunk+0x5/0x5f
>   ? __pfx_irq_thread_fn+0x10/0x10
>   ? __pfx_irq_thread_dtor+0x10/0x10
>   ? __pfx_irq_thread+0x10/0x10
>   kthread+0xf5/0x130
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork+0x3c/0x60
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork_asm+0x1a/0x30
>   </TASK>
>  Kernel Offset: 0x29000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>  ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
>
>  - Root port CE:
>  root@tbowman-cxl:~/aer-inject# ./root-c[  191.866259] systemd-journald[482]: Sent WATCHDOG=1 notification.
>  e-inject.sh
>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0c:00.0
>  pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0c:00.0
>  pcieport 0000:0c:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>  pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00004000/0000a000
>  pcieport 0000:0c:00.0:    [14] CorrIntErr
>  aer_event: 0000:0c:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
>  cxl_port_aer_correctable_error: device=0000:0c:00.0 host=pci0000:0c status='Received Error From Physical Layer'
>
>  - Upstream switch port UCE:
>  root@tbowman-cxl:~/aer-inject# ./us-uce-inject.sh
>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0d:00.0
>  pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0d:00.0
>  pcieport 0000:0d:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>  pcieport 0000:0d:00.0:   device [19e5:a128] error status/mask=00400000/02000000
>  pcieport 0000:0d:00.0:    [22] UncorrIntErr
>  aer_event: 0000:0d:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
>  cxl_port_aer_uncorrectable_error: device=0000:0d:00.0 host=0000:0c:00.0 status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
>  Kernel panic - not syncing: CXL cachemem error. Invoking panic
>  CPU: 1 UID: 0 PID: 148 Comm: irq/24-aerdrv Tainted: G            E      6.12.0-rc2-cxl-port-err-g2beab06a67d1 #4414
>  Tainted: [E]=UNSIGNED_MODULE
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
>  Call Trace:
>   <TASK>
>   dump_stack_lvl+0x27/0x90
>   dump_stack+0x10/0x20
>   panic+0x33e/0x380
>   cxl_do_recovery+0x116/0x120
>   ? srso_return_thunk+0x5/0x5f
>   aer_isr+0x3e0/0x710
>   ? free_cpumask_var+0x9/0x10
>   ? kfree+0x259/0x2e0
>   irq_thread_fn+0x28/0x70
>   irq_thread+0x179/0x240
>   ? srso_return_thunk+0x5/0x5f
>   ? __pfx_irq_thread_fn+0x10/0x10
>   ? __pfx_irq_thread_dtor+0x10/0x10
>   ? __pfx_irq_thread+0x10/0x10
>   kthread+0xf5/0x130
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork+0x3c/0x60
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork_asm+0x1a/0x30
>   </TASK>
>  Kernel Offset: 0x24c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>  ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
>
>  - Upstream switch port CE:
>  root@tbowman-cxl:~/aer-inject# ./us-ce-inject.sh
>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0d:00.0
>  pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0d:00.0
>  pcieport 0000:0d:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>  pcieport 0000:0d:00.0:   device [19e5:a128] error status/mask=00004000/0000a000
>  pcieport 0000:0d:00.0:    [14] CorrIntErr
>  aer_event: 0000:0d:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
>  cxl_port_aer_correctable_error: device=0000:0d:00.0 host=0000:0c:00.0 status='Received Error From Physical Layer'
>
>  - Downstream switch port UCE:
>  root@tbowman-cxl:~/aer-inject# ./ds-uce-inject.sh
>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0e:00.0
>  pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0e:00.0
>  pcieport 0000:0e:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>  pcieport 0000:0e:00.0:   device [19e5:a129] error status/mask=00400000/02000000
>  pcieport 0000:0e:00.0:    [22] UncorrIntErr
>  aer_event: 0000:0e:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
>  cxl_port_aer_uncorrectable_error: device=0000:0e:00.0 host=0000:0d:00.0 status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
>  Kernel panic - not syncing: CXL cachemem error. Invoking panic
>  CPU: 1 UID: 0 PID: 147 Comm: irq/24-aerdrv Tainted: G            E      6.12.0-rc2-cxl-port-err-g2beab06a67d1 #4414
>  Tainted: [E]=UNSIGNED_MODULE
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
>  Call Trace:
>   <TASK>
>   dump_stack_lvl+0x27/0x90
>   dump_stack+0x10/0x20
>   panic+0x33e/0x380
>   cxl_do_recovery+0x116/0x120
>   ? srso_return_thunk+0x5/0x5f
>   aer_isr+0x3e0/0x710
>   irq_thread_fn+0x28/0x70
>   irq_thread+0x179/0x240
>   ? srso_return_thunk+0x5/0x5f
>   ? __pfx_irq_thread_fn+0x10/0x10
>   ? __pfx_irq_thread_dtor+0x10/0x10
>   ? __pfx_irq_thread+0x10/0x10
>   kthread+0xf5/0x130
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork+0x3c/0x60
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork_asm+0x1a/0x30
>   </TASK>
>  Kernel Offset: 0x19c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>  ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
>
>  - Downstream switch port CE:
>  root@tbowman-cxl:~/aer-inject# ./ds-ce-inject.sh
>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0e:00.0
>  pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0e:00.0
>  pcieport 0000:0e:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>  pcieport 0000:0e:00.0:   device [19e5:a129] error status/mask=00004000/0000a000
>  pcieport 0000:0e:00.0:    [14] CorrIntErr
>  aer_event: 0000:0e:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
>  cxl_port_aer_correctable_error: device=0000:0e:00.0 host=0000:0d:00.0 status='Received Error From Physical Layer'
>
>  - Endpoint CE
>  root@tbowman-cxl:~/aer-inject# ./ep-ce-inject.sh
>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000040/00000000 into device 0000:0f:00.0
>  pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0f:00.0
>  cxl_pci 0000:0f:00.0: CXL Bus Error: severity=Correctable, type=Data Link Layer, (Receiver ID)
>  cxl_pci 0000:0f:00.0:   device [8086:0d93] error status/mask=00000040/0000e000
>  cxl_pci 0000:0f:00.0:    [ 6] BadTLP
>  aer_event: 0000:0f:00.0 CXL Bus Error: severity=Corrected, Bad TLP, TLP Header=Not available
>  cxl_aer_correctable_error: memdev=mem1 host=0000:0f:00.0 serial=0: status: 'Received Error From Physical Layer'
>
>  - Endpoint UCE
>  root@tbowman-cxl:~/aer-inject# ./ep-uce-inject.sh
>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00040000 into device 0000:0f:00.0
>  pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0f:00.0
>  cxl_pci 0000:0f:00.0: AER: CXL Bus Error: severity=Uncorrectable (Fatal), type=Inaccessible, (Unregistered Agent ID)
>  aer_event: 0000:0f:00.0 CXL Bus Error: severity=Fatal, , TLP Header=Not available
>  cxl_aer_uncorrectable_error: memdev=mem1 host=0000:0f:00.0 serial=0: status: 'Memory Byte Enable Parity Error' firs'
>  cxl_pci 0000:0f:00.0: mem1: frozen state error detected, disable CXL.mem
>  cxl_detach_ep: cxl_mem mem1: disconnect mem1 from port2
>  cxl_detach_ep: cxl_mem mem1: disconnect mem1 from port1
>  pcieport 0000:0e:00.0: unlocked secondary bus reset via: pciehp_reset_slot+0xac/0x160
>  pcieport 0000:0e:00.0: AER: Downstream Port link has been reset (0)
>  cxl_pci 0000:0f:00.0: mem1: restart CXL.mem after slot reset
>  devm_cxl_enumerate_ports: cxl_mem mem1: scan: iter: mem1 dport_dev: 0000:0e:00.0 parent: 0000:0d:00.0
>  devm_cxl_enumerate_ports: cxl_mem mem1: found already registered port port2:0000:0d:00.0
>  devm_cxl_enumerate_ports: cxl_mem mem1: scan: iter: 0000:0e:00.0 dport_dev: 0000:0c:00.0 parent: pci0000:0c
>  devm_cxl_enumerate_ports: cxl_mem mem1: found already registered port port1:pci0000:0c
>  __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4500
>  cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
>  __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4500
>  cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
>  __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
>  cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
>  __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
>  cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
>  <snip>
>  cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
>  __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
>  cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
>  __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
>  cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
>  __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
>  cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
>  cxl_bus_probe: cxl_nvdimm pmem1: probe: 0
>  devm_cxl_add_nvdimm: cxl_mem mem1: register pmem1
>  pcieport 0000:0e:00.0: RAS is already mapped
>  cxl_port port2: RAS is already mapped
>  pcieport 0000:0c:00.0: RAS is already mapped
>  cxl_port_alloc: cxl_mem mem1: host-bridge: pci0000:0c
>  cxl_cdat_get_length: cxl_port endpoint4: CDAT length 160
>  cxl_port_perf_data_calculate: cxl_port endpoint4: Failed to retrieve ep perf coordinates.
>  cxl_endpoint_parse_cdat: cxl_port endpoint4: Failed to do perf coord calculations.
>  init_hdm_decoder: cxl_port endpoint4: decoder4.0: range: 0x0-0xffffffffffffffff iw: 1 ig: 256
>  add_hdm_decoder: cxl decoder4.0: Added to port endpoint4
>  init_hdm_decoder: cxl_port endpoint4: decoder4.1: range: 0x0-0xffffffffffffffff iw: 1 ig: 256
>  add_hdm_decoder: cxl decoder4.1: Added to port endpoint4
>  init_hdm_decoder: cxl_port endpoint4: decoder4.2: range: 0x0-0xffffffffffffffff iw: 1 ig: 256
>  add_hdm_decoder: cxl decoder4.2: Added to port endpoint4
>  init_hdm_decoder: cxl_port endpoint4: decoder4.3: range: 0x0-0xffffffffffffffff iw: 1 ig: 256
>  add_hdm_decoder: cxl decoder4.3: Added to port endpoint4
>  cxl_bus_probe: cxl_port endpoint4: probe: 0
>  devm_cxl_add_port: cxl_mem mem1: endpoint4 added to port2
>  cxl_bus_probe: cxl_mem mem1: probe: 0
>  cxl_pci 0000:0f:00.0: mem1: error resume successful
>  pcieport 0000:0e:00.0: AER: device recovery successful
>
>  Changes in v1 -> v2
>  [Jonathan] Remove extra NULL check and cleanup in cxl_pci_port_ras()
>  [Jonathan] Update description to DSP map patch description
>  [Jonathan] Update cxl_pci_port_ras() to check for NULL port
>  [Jonathan] Dont call handler before handler port changes are present (patch order).
>  [Bjorn] Fix linebreak in cover sheet URL
>  [Bjorn] Remove timestamps from test logs in cover sheet
>  [Bjorn] Retitle AER commits to use "PCI/AER:"
>  [Bjorn] Retitle patch#3 to use renaming instead of refactoring
>  [Bjorn] Fixe base commit-id on cover sheet
>  [Bjorn] Add VH spec reference/citation
>  [Terry] Removed last 2 patches to enable internal errors. Is not needed
>  because internal errors are enabled in AER driver.
>  [Dan] Create cxl_do_recovery() and pci_driver::cxl_err_handlers.
>  [Dan] Use kernel panic in CXL recovery
>  [Dan] cxl_port_hndlrs -> cxl_port_error_handlers
>  [Dan] Move cxl_port_error_handlers to pci_driver. Remove module (un)registration.
>  [Terry] Add patch w/ qcxl_assign_port_error_handlers() and cxl_clear_port_error_handlers()
>  [Terry] Removed PCI_ERS_RESULT_PANIC patch. Is no longer needed because the result type parameter
>  is not used in the CXL_err_handlers callabcks.
>
> Changes in RFC -> v1:
>  [Dan] Rename cxl_rch_handle_error() becomes cxl_handle_error()
>  [Dan] Add cxl_do_recovery()
>  [Jonathan] Flatten cxl_setup_parent_uport()
>  [Jonathan] Use cxl_component_regs instead of struct cxl_regs regs
>  [Jonathan] Rename cxl_dev_is_pci_type()
>  [Ming] bus_find_device(&cxl_bus_type, NULL, &pdev->dev, match_uport) can
>  replace these find_cxl_port() and device_find_child().
>  [Jonathan] Compact call to cxl_port_map_regs() in cxl_setup_parent_uport()
>  [Ming] Dont use endpoint as host to cxl_map_component_regs()
>  [Bjorn] Use "PCIe UIR/CIE" instesad of "AER UI/CIE"
>  [Bjorn] Dont use Kconfig to enable/disable a CXL external interface
>
> Terry Bowman (14):
>   PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct
>     pci_driver'
>   PCI/AER: Rename AER driver's interfaces to also indicate CXL PCIe port
>     support
>   cxl/pci: Introduce helper functions pcie_is_cxl() and
>     pcie_is_cxl_port()
>   PCI/AER: Modify AER driver logging to report CXL or PCIe bus error
>     type
>   PCI/AER: Add CXL PCIe port correctable error support in AER service
>     driver
>   PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe
>     port devices
>   PCI/AER: Add CXL PCIe port uncorrectable error recovery in AER service
>     driver
>   cxl/pci: Change find_cxl_ports() to non-static
>   cxl/pci: Map CXL PCIe root port and downstream switch port RAS
>     registers
>   cxl/pci: Map CXL PCIe upstream switch port RAS registers
>   cxl/pci: Rename RAS handler interfaces to also indicate CXL PCIe port
>     support
>   cxl/pci: Add error handler for CXL PCIe port RAS errors
>   cxl/pci: Add trace logging for CXL PCIe port RAS errors
>   cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers
>
>  drivers/cxl/core/core.h       |   3 +
>  drivers/cxl/core/pci.c        | 180 +++++++++++++++++++++++++++-------
>  drivers/cxl/core/port.c       |   4 +-
>  drivers/cxl/core/trace.h      |  47 +++++++++
>  drivers/cxl/cxl.h             |  10 +-
>  drivers/cxl/mem.c             |  29 +++++-
>  drivers/pci/pci.c             |  14 +++
>  drivers/pci/pci.h             |   3 +
>  drivers/pci/pcie/aer.c        |  99 ++++++++++++-------
>  drivers/pci/pcie/err.c        |  54 ++++++++++
>  drivers/pci/probe.c           |  10 ++
>  include/linux/pci.h           |  13 +++
>  include/ras/ras_event.h       |   9 +-
>  include/uapi/linux/pci_regs.h |   3 +-
>  14 files changed, 396 insertions(+), 82 deletions(-)
>
>
> base-commit: 739a5da7ed744578a9477fb322f04afecafca6b0

Fan Ni Nov. 1, 2024, 6 p.m. UTC | #3

On Fri, Oct 25, 2024 at 04:02:51PM -0500, Terry Bowman wrote:
> This is a continuation of the CXL port error handling RFC from earlier.[1]
> The RFC resulted in the decision to add CXL PCIe port error handling to
> the existing RCH downstream port handling in the AER service driver. This
> patchset adds the CXL PCIe port protocol error handling and logging.
> 
> The first 7 patches update the existing AER service driver to support CXL
> PCIe port protocol error handling and reporting. This includes AER service
> driver changes for adding correctable and uncorrectable error support, CXL
> specific recovery handling, and addition of CXL driver callback handlers.
> 
> The following 7 patches address CXL driver support for CXL PCIe port
> protocol errors. This includes the following changes to the CXL drivers:
> mapping CXL port and downstream port RAS registers, interface updates for
> common restricted CXL host mode (RCH) and virtual hierarchy mode (VH),
> adding port specific error handlers, and protocol error logging.
> 
> [1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554-1-terry.bowman@amd.com/
> 
> Testing:

Hi Terry,
I tried to test the patchset with aer_inject tool (with the patch you shared
in the last version), and hit some issues.
Could you help check and give some insights? Thanks.

Below are some test setup info and results.

I tested two topology,
  a. one memdev directly attaced to a HB with only one RP;
  b. a topology with cxl switch:
         HB
        /  \
      RP0   RP1
       |
     switch
       |
 ----------------
 |    |    |    |
mem0 mem1 mem2 mem3

For both topologies, I cannot reproduce the system panic shown in your cover
letter.  

btw, I tried both compile cxl as modules and in the kernel.

Below, I will use the direct-attached topology (a) as an example to show what I
tried, hope can get some clarity about the test and what I missed or did wrong.

-------------------------------------
pci device info on the test VM 
root@fan:~# lspci
00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
00:01.0 VGA compatible controller: Device 1234:1111 (rev 02)
00:02.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 03)
00:03.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem
00:04.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem
00:05.0 Host bridge: Red Hat, Inc. QEMU PCIe Expander bridge
00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface Controller (rev 02)
00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] (rev 02)
00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02)
0c:00.0 PCI bridge: Intel Corporation Device 7075
0d:00.0 CXL: Intel Corporation Device 0d93 (rev 01)
root@fan:~# 
-------------------------------------

The aer injection input file looks like below,

-------------------------------------
fan:~/cxl/cxl-test-tool$ cat /tmp/internal 
AER
PCI_ID 0000:0c:00.0
UNCOR_STATUS INTERNAL
HEADER_LOG 0 1 2 3
------------------------------------

dmesg after aer injection 

ssh root@localhost -p 2024 "dmesg"
[  613.195352] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
[  613.195830] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
[  613.196253] pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
[  613.198199] pcieport 0000:0c:00.0: AER: No uncorrectable error found. Continuing.
--------------------------------------

The problem seems to be related to the cxl_error_handler not been assigned for
cxlmem device. 

in
cxl_do_recover() {
...
    327     cxl_walk_bridge(bridge, cxl_report_error_detected, &status);                         
    328     if (status)                                                                 
    329         panic("CXL cachemem error. Invoking panic");                   
...
}
The status returned is false, so no panic().

I tried to add some dev_dbg info to the code to debug.
Below are the debug info and kernel code changes for debugging. 
--------------------------------------
fan:~/cxl/cxl-test-tool$ cxl-tool.py --cmd dmesg | grep XXX
[    1.738909] cxl_mem:cxl_mem_probe:205: cxl_mem mem0: XXX: add endpoint
[    1.739188] cxl_mem:devm_cxl_add_endpoint:85: cxl_port port1: XXX: add endpoint
[    1.739509] cxl_mem:devm_cxl_add_endpoint:92: cxl_mem mem0: XXX: init ep port aer
[    1.739876] cxl_core:cxl_dport_init_ras_reporting:907: pcieport 0000:0c:00.0: XXX: assign port error handlers for dport 1
[    1.740338] cxl_core:cxl_dport_init_ras_reporting:913: pcieport 0000:0c:00.0: XXX: assign port error handlers for dport 2
[    1.740812] cxl_core:cxl_dport_init_ras_reporting:927: pcieport 0000:0c:00.0: XXX: assign port error handlers for dport 3
[    1.741273] cxl_core:cxl_assign_port_error_handlers:851: pcieport 0000:0c:00.0: XXX: cxl_err_handler: (____ptrval____)
[    1.741812] cxl_core:cxl_assign_port_error_handlers:855: pcieport 0000:0c:00.0: XXX: cxl_err_handler: (____ptrval____)
[    1.742263] cxl_core:cxl_assign_port_error_handlers:857: pcieport 0000:0c:00.0: XXX: cxl_err_handler: (____ptrval____) (____ptrval____)
fan:~/cxl/cxl-test-tool$ 
--------------------------------------

dmesg after error injection:
--------------------------------------
ssh root@localhost -p 2024 "dmesg"
[  228.544439] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
[  228.544977] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
[  228.545381] pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
[  228.545879] pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00400000/00000000
[  228.546360] pcieport 0000:0c:00.0:    [22] UncorrIntErr          
[  228.546698] pcieport 0000:0c:00.0: AER: XXX: call cxl_err_handler: 00000000a268bfcb 000000009e0da039
[  228.547103] cxl_pci 0000:0d:00.0: AER: XXX: call cxl_err_handler: 00000000b9f08b93 0000000000000000
[  228.547515] pcieport 0000:0c:00.0: AER: No uncorrectable error found. Continuing.
fan:~/cxl/cxl-test-tool$ 
--------------------------------------


Kernel changes:
--------------------------------------
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 5f7570c6173c..bcecd1283fc6 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -848,10 +848,13 @@ static void cxl_assign_port_error_handlers(struct pci_dev *pdev)
 {
 	struct pci_driver *pdrv = pdev->driver;
 
+    dev_dbg(&pdev->dev, "XXX: cxl_err_handler: %p\n enter", pdev);
 	if (!pdrv)
 		return;
 
+    dev_dbg(&pdev->dev, "XXX: cxl_err_handler: %p\n", pdrv);
 	pdrv->cxl_err_handler = &cxl_port_error_handlers;
+    dev_dbg(&pdev->dev, "XXX: cxl_err_handler: %p %p\n", pdrv, pdrv->cxl_err_handler);
 }
 
 static void cxl_clear_port_error_handlers(void *data)
@@ -869,12 +872,14 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port)
 {
 	struct pci_dev *pdev = to_pci_dev(port->uport_dev);
 
+    dev_dbg(&port->dev, "XXX: assign port error handlers for uport 1\n");
 	/* uport may have more than 1 downstream EP. Check if already mapped. */
 	if (port->uport_regs.ras) {
 		dev_warn(&port->dev, "RAS is already mapped\n");
 		return;
 	}
 
+    dev_dbg(&port->dev, "XXX: assign port error handlers for uport 2\n");
 	port->reg_map.host = &port->dev;
 	if (cxl_map_component_regs(&port->reg_map, &port->uport_regs,
 				   BIT(CXL_CM_CAP_CAP_ID_RAS))) {
@@ -882,6 +887,7 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port)
 		return;
 	}
 
+    dev_dbg(&port->dev, "XXX: assign port error handlers for uport 3\n");
 	cxl_assign_port_error_handlers(pdev);
 	devm_add_action_or_reset(port->uport_dev, cxl_clear_port_error_handlers, pdev);
 }
@@ -898,11 +904,13 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
 	struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport_dev);
 	struct pci_dev *pdev = to_pci_dev(dport_dev);
 
+    dev_dbg(dport_dev, "XXX: assign port error handlers for dport 1\n");
 	if (dport->rch && host_bridge->native_aer) {
 		cxl_dport_map_rch_aer(dport);
 		cxl_disable_rch_root_ints(dport);
 	}
 
+    dev_dbg(dport_dev, "XXX: assign port error handlers for dport 2\n");
 	/* dport may have more than 1 downstream EP. Check if already mapped. */
 	if (dport->regs.ras) {
 		dev_warn(dport_dev, "RAS is already mapped\n");
@@ -916,6 +924,7 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
 		return;
 	}
 
+    dev_dbg(dport_dev, "XXX: assign port error handlers for dport 3\n");
 	cxl_assign_port_error_handlers(pdev);
 	devm_add_action_or_reset(dport_dev, cxl_clear_port_error_handlers, pdev);
 }
diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index 067fd6389562..aa824584f8dd 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -82,13 +82,15 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
 	 * Now that the path to the root is established record all the
 	 * intervening ports in the chain.
 	 */
+    dev_dbg(host, "XXX: add endpoint\n");
 	for (iter = parent_port, down = NULL; !is_cxl_root(iter);
 	     down = iter, iter = to_cxl_port(iter->dev.parent)) {
 		struct cxl_ep *ep;
 
 		ep = cxl_ep_load(iter, cxlmd);
 		ep->next = down;
-		cxl_init_ep_ports_aer(ep);
+        dev_dbg(ep->ep, "XXX: init ep port aer\n");
+        cxl_init_ep_ports_aer(ep);
 	}
 
 	/* Note: endpoint port component registers are derived from @cxlds */
@@ -200,6 +202,7 @@ static int cxl_mem_probe(struct device *dev)
 			return -ENXIO;
 		}
 
+        dev_dbg(dev, "XXX: add endpoint\n");
 		rc = devm_cxl_add_endpoint(endpoint_parent, cxlmd, dport);
 		if (rc)
 			return rc;
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 3785f4ca5103..8285f14994e8 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -294,6 +294,11 @@ static int cxl_report_error_detected(struct pci_dev *dev, void *data)
 	bool *status = data;
 
 	device_lock(&dev->dev);
+    if (pdrv) {
+        dev_dbg(&dev->dev, "XXX: call cxl_err_handler: %p %p\n", pdrv, pdrv->cxl_err_handler);
+    } else {
+        dev_dbg(&dev->dev, "XXX: call cxl_err_handler: %p no handler\n", pdrv);
+    }
 	if (pdrv && pdrv->cxl_err_handler &&
 	    pdrv->cxl_err_handler->error_detected) {
 		const struct cxl_error_handlers *cxl_err_handler =
--------------------------------------

Fan
> 
> Below are test results for this patchset using Qemu with CXL root
> port(0c:00.0), CXL upstream switchport(0d:00.0), CXL downstream
> switchport(0e:00.0). A CXL endpoint(0f:00.0) CE and UCE logs are
> also added to show the existing PCIe endpoint handling is not changed.
> 
> This was tested using aer-inject updated to support CE and UCE internal
> error injection. CXL RAS was set using a test patch (not upstreamed but can
> provide if needed).
> 
>  - Root port UCE:
>  root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh
>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
>  pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
>  pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>  pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00400000/02000000
>  pcieport 0000:0c:00.0:    [22] UncorrIntErr
>  aer_event: 0000:0c:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
>  cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
>  Kernel panic - not syncing: CXL cachemem error. Invoking panic
>  CPU: 1 UID: 0 PID: 146 Comm: irq/24-aerdrv Tainted: G            E      6.12.0-rc2-cxl-port-err-g2beab06a67d1 #4414
>  Tainted: [E]=UNSIGNED_MODULE
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
>  Call Trace:
>   <TASK>
>   dump_stack_lvl+0x27/0x90
>   dump_stack+0x10/0x20
>   panic+0x33e/0x380
>   cxl_do_recovery+0x116/0x120
>   ? srso_return_thunk+0x5/0x5f
>   aer_isr+0x3e0/0x710
>   irq_thread_fn+0x28/0x70
>   irq_thread+0x179/0x240
>   ? srso_return_thunk+0x5/0x5f
>   ? __pfx_irq_thread_fn+0x10/0x10
>   ? __pfx_irq_thread_dtor+0x10/0x10
>   ? __pfx_irq_thread+0x10/0x10
>   kthread+0xf5/0x130
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork+0x3c/0x60
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork_asm+0x1a/0x30
>   </TASK>
>  Kernel Offset: 0x29000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>  ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
...

Bowman, Terry Nov. 1, 2024, 6:28 p.m. UTC | #4

Hi Fan,

I added comments below.

On 11/1/2024 1:00 PM, Fan Ni wrote:
> On Fri, Oct 25, 2024 at 04:02:51PM -0500, Terry Bowman wrote:
>> This is a continuation of the CXL port error handling RFC from earlier.[1]
>> The RFC resulted in the decision to add CXL PCIe port error handling to
>> the existing RCH downstream port handling in the AER service driver. This
>> patchset adds the CXL PCIe port protocol error handling and logging.
>>
>> The first 7 patches update the existing AER service driver to support CXL
>> PCIe port protocol error handling and reporting. This includes AER service
>> driver changes for adding correctable and uncorrectable error support, CXL
>> specific recovery handling, and addition of CXL driver callback handlers.
>>
>> The following 7 patches address CXL driver support for CXL PCIe port
>> protocol errors. This includes the following changes to the CXL drivers:
>> mapping CXL port and downstream port RAS registers, interface updates for
>> common restricted CXL host mode (RCH) and virtual hierarchy mode (VH),
>> adding port specific error handlers, and protocol error logging.
>>
>> [1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554-1-terry.bowman@amd.com/
>>
>> Testing:
> Hi Terry,
> I tried to test the patchset with aer_inject tool (with the patch you shared
> in the last version), and hit some issues.
> Could you help check and give some insights? Thanks.
>
> Below are some test setup info and results.
>
> I tested two topology,
>   a. one memdev directly attaced to a HB with only one RP;
>   b. a topology with cxl switch:
>          HB
>         /  \
>       RP0   RP1
>        |
>      switch
>        |
>  ----------------
>  |    |    |    |
> mem0 mem1 mem2 mem3
>
> For both topologies, I cannot reproduce the system panic shown in your cover
> letter.  
>
> btw, I tried both compile cxl as modules and in the kernel.
>
> Below, I will use the direct-attached topology (a) as an example to show what I
> tried, hope can get some clarity about the test and what I missed or did wrong.
>
> -------------------------------------
> pci device info on the test VM 
> root@fan:~# lspci
> 00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
> 00:01.0 VGA compatible controller: Device 1234:1111 (rev 02)
> 00:02.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 03)
> 00:03.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem
> 00:04.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem
> 00:05.0 Host bridge: Red Hat, Inc. QEMU PCIe Expander bridge
> 00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface Controller (rev 02)
> 00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] (rev 02)
> 00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02)
> 0c:00.0 PCI bridge: Intel Corporation Device 7075
> 0d:00.0 CXL: Intel Corporation Device 0d93 (rev 01)
> root@fan:~# 
> -------------------------------------
>
> The aer injection input file looks like below,
>
> -------------------------------------
> fan:~/cxl/cxl-test-tool$ cat /tmp/internal 
> AER
> PCI_ID 0000:0c:00.0
> UNCOR_STATUS INTERNAL
> HEADER_LOG 0 1 2 3
> ------------------------------------
>
> dmesg after aer injection 
>
> ssh root@localhost -p 2024 "dmesg"
> [  613.195352] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
> [  613.195830] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
> [  613.196253] pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> [  613.198199] pcieport 0000:0c:00.0: AER: No uncorrectable error found. Continuing.
> -----------------------------------

This is likely because the device's CXL RAS status is not set and as a result returns false and bypasses the panic.
Unfortunately, the aer-inject only sets the AER status and triggers the interrupt. The CXL RAS is not set.

I attached 2 'test' patches. The first patch sets the device's RAS status to simulate the error reporting.
This will have to be adjusted as the patch looks for a specific device's bus and this will likely be a different
bus then the device's you test in your setup.

The 2nd patch enables UIE/CIE. I moved this out of the v2 patchset. I need to revisit this to see if it is
needed in the patchset itself (not just a test patch).

Regards,
Terry

>
> The problem seems to be related to the cxl_error_handler not been assigned for
> cxlmem device. 
>
> in
> cxl_do_recover() {
> ...
>     327     cxl_walk_bridge(bridge, cxl_report_error_detected, &status);                         
>     328     if (status)                                                                 
>     329         panic("CXL cachemem error. Invoking panic");                   
> ...
> }
> The status returned is false, so no panic().
>
> I tried to add some dev_dbg info to the code to debug.
> Below are the debug info and kernel code changes for debugging. 
> --------------------------------------
> fan:~/cxl/cxl-test-tool$ cxl-tool.py --cmd dmesg | grep XXX
> [    1.738909] cxl_mem:cxl_mem_probe:205: cxl_mem mem0: XXX: add endpoint
> [    1.739188] cxl_mem:devm_cxl_add_endpoint:85: cxl_port port1: XXX: add endpoint
> [    1.739509] cxl_mem:devm_cxl_add_endpoint:92: cxl_mem mem0: XXX: init ep port aer
> [    1.739876] cxl_core:cxl_dport_init_ras_reporting:907: pcieport 0000:0c:00.0: XXX: assign port error handlers for dport 1
> [    1.740338] cxl_core:cxl_dport_init_ras_reporting:913: pcieport 0000:0c:00.0: XXX: assign port error handlers for dport 2
> [    1.740812] cxl_core:cxl_dport_init_ras_reporting:927: pcieport 0000:0c:00.0: XXX: assign port error handlers for dport 3
> [    1.741273] cxl_core:cxl_assign_port_error_handlers:851: pcieport 0000:0c:00.0: XXX: cxl_err_handler: (____ptrval____)
> [    1.741812] cxl_core:cxl_assign_port_error_handlers:855: pcieport 0000:0c:00.0: XXX: cxl_err_handler: (____ptrval____)
> [    1.742263] cxl_core:cxl_assign_port_error_handlers:857: pcieport 0000:0c:00.0: XXX: cxl_err_handler: (____ptrval____) (____ptrval____)
> fan:~/cxl/cxl-test-tool$ 
> --------------------------------------
>
> dmesg after error injection:
> --------------------------------------
> ssh root@localhost -p 2024 "dmesg"
> [  228.544439] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
> [  228.544977] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
> [  228.545381] pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> [  228.545879] pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00400000/00000000
> [  228.546360] pcieport 0000:0c:00.0:    [22] UncorrIntErr          
> [  228.546698] pcieport 0000:0c:00.0: AER: XXX: call cxl_err_handler: 00000000a268bfcb 000000009e0da039
> [  228.547103] cxl_pci 0000:0d:00.0: AER: XXX: call cxl_err_handler: 00000000b9f08b93 0000000000000000
> [  228.547515] pcieport 0000:0c:00.0: AER: No uncorrectable error found. Continuing.
> fan:~/cxl/cxl-test-tool$ 
> --------------------------------------
>
>
> Kernel changes:
> --------------------------------------
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 5f7570c6173c..bcecd1283fc6 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -848,10 +848,13 @@ static void cxl_assign_port_error_handlers(struct pci_dev *pdev)
>  {
>  	struct pci_driver *pdrv = pdev->driver;
>  
> +    dev_dbg(&pdev->dev, "XXX: cxl_err_handler: %p\n enter", pdev);
>  	if (!pdrv)
>  		return;
>  
> +    dev_dbg(&pdev->dev, "XXX: cxl_err_handler: %p\n", pdrv);
>  	pdrv->cxl_err_handler = &cxl_port_error_handlers;
> +    dev_dbg(&pdev->dev, "XXX: cxl_err_handler: %p %p\n", pdrv, pdrv->cxl_err_handler);
>  }
>  
>  static void cxl_clear_port_error_handlers(void *data)
> @@ -869,12 +872,14 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port)
>  {
>  	struct pci_dev *pdev = to_pci_dev(port->uport_dev);
>  
> +    dev_dbg(&port->dev, "XXX: assign port error handlers for uport 1\n");
>  	/* uport may have more than 1 downstream EP. Check if already mapped. */
>  	if (port->uport_regs.ras) {
>  		dev_warn(&port->dev, "RAS is already mapped\n");
>  		return;
>  	}
>  
> +    dev_dbg(&port->dev, "XXX: assign port error handlers for uport 2\n");
>  	port->reg_map.host = &port->dev;
>  	if (cxl_map_component_regs(&port->reg_map, &port->uport_regs,
>  				   BIT(CXL_CM_CAP_CAP_ID_RAS))) {
> @@ -882,6 +887,7 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port)
>  		return;
>  	}
>  
> +    dev_dbg(&port->dev, "XXX: assign port error handlers for uport 3\n");
>  	cxl_assign_port_error_handlers(pdev);
>  	devm_add_action_or_reset(port->uport_dev, cxl_clear_port_error_handlers, pdev);
>  }
> @@ -898,11 +904,13 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
>  	struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport_dev);
>  	struct pci_dev *pdev = to_pci_dev(dport_dev);
>  
> +    dev_dbg(dport_dev, "XXX: assign port error handlers for dport 1\n");
>  	if (dport->rch && host_bridge->native_aer) {
>  		cxl_dport_map_rch_aer(dport);
>  		cxl_disable_rch_root_ints(dport);
>  	}
>  
> +    dev_dbg(dport_dev, "XXX: assign port error handlers for dport 2\n");
>  	/* dport may have more than 1 downstream EP. Check if already mapped. */
>  	if (dport->regs.ras) {
>  		dev_warn(dport_dev, "RAS is already mapped\n");
> @@ -916,6 +924,7 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
>  		return;
>  	}
>  
> +    dev_dbg(dport_dev, "XXX: assign port error handlers for dport 3\n");
>  	cxl_assign_port_error_handlers(pdev);
>  	devm_add_action_or_reset(dport_dev, cxl_clear_port_error_handlers, pdev);
>  }
> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> index 067fd6389562..aa824584f8dd 100644
> --- a/drivers/cxl/mem.c
> +++ b/drivers/cxl/mem.c
> @@ -82,13 +82,15 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
>  	 * Now that the path to the root is established record all the
>  	 * intervening ports in the chain.
>  	 */
> +    dev_dbg(host, "XXX: add endpoint\n");
>  	for (iter = parent_port, down = NULL; !is_cxl_root(iter);
>  	     down = iter, iter = to_cxl_port(iter->dev.parent)) {
>  		struct cxl_ep *ep;
>  
>  		ep = cxl_ep_load(iter, cxlmd);
>  		ep->next = down;
> -		cxl_init_ep_ports_aer(ep);
> +        dev_dbg(ep->ep, "XXX: init ep port aer\n");
> +        cxl_init_ep_ports_aer(ep);
>  	}
>  
>  	/* Note: endpoint port component registers are derived from @cxlds */
> @@ -200,6 +202,7 @@ static int cxl_mem_probe(struct device *dev)
>  			return -ENXIO;
>  		}
>  
> +        dev_dbg(dev, "XXX: add endpoint\n");
>  		rc = devm_cxl_add_endpoint(endpoint_parent, cxlmd, dport);
>  		if (rc)
>  			return rc;
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index 3785f4ca5103..8285f14994e8 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -294,6 +294,11 @@ static int cxl_report_error_detected(struct pci_dev *dev, void *data)
>  	bool *status = data;
>  
>  	device_lock(&dev->dev);
> +    if (pdrv) {
> +        dev_dbg(&dev->dev, "XXX: call cxl_err_handler: %p %p\n", pdrv, pdrv->cxl_err_handler);
> +    } else {
> +        dev_dbg(&dev->dev, "XXX: call cxl_err_handler: %p no handler\n", pdrv);
> +    }
>  	if (pdrv && pdrv->cxl_err_handler &&
>  	    pdrv->cxl_err_handler->error_detected) {
>  		const struct cxl_error_handlers *cxl_err_handler =
> --------------------------------------
>
> Fan
>> Below are test results for this patchset using Qemu with CXL root
>> port(0c:00.0), CXL upstream switchport(0d:00.0), CXL downstream
>> switchport(0e:00.0). A CXL endpoint(0f:00.0) CE and UCE logs are
>> also added to show the existing PCIe endpoint handling is not changed.
>>
>> This was tested using aer-inject updated to support CE and UCE internal
>> error injection. CXL RAS was set using a test patch (not upstreamed but can
>> provide if needed).
>>
>>  - Root port UCE:
>>  root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh
>>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
>>  pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
>>  pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>>  pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00400000/02000000
>>  pcieport 0000:0c:00.0:    [22] UncorrIntErr
>>  aer_event: 0000:0c:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
>>  cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
>>  Kernel panic - not syncing: CXL cachemem error. Invoking panic
>>  CPU: 1 UID: 0 PID: 146 Comm: irq/24-aerdrv Tainted: G            E      6.12.0-rc2-cxl-port-err-g2beab06a67d1 #4414
>>  Tainted: [E]=UNSIGNED_MODULE
>>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
>>  Call Trace:
>>   <TASK>
>>   dump_stack_lvl+0x27/0x90
>>   dump_stack+0x10/0x20
>>   panic+0x33e/0x380
>>   cxl_do_recovery+0x116/0x120
>>   ? srso_return_thunk+0x5/0x5f
>>   aer_isr+0x3e0/0x710
>>   irq_thread_fn+0x28/0x70
>>   irq_thread+0x179/0x240
>>   ? srso_return_thunk+0x5/0x5f
>>   ? __pfx_irq_thread_fn+0x10/0x10
>>   ? __pfx_irq_thread_dtor+0x10/0x10
>>   ? __pfx_irq_thread+0x10/0x10
>>   kthread+0xf5/0x130
>>   ? __pfx_kthread+0x10/0x10
>>   ret_from_fork+0x3c/0x60
>>   ? __pfx_kthread+0x10/0x10
>>   ret_from_fork_asm+0x1a/0x30
>>   </TASK>
>>  Kernel Offset: 0x29000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>>  ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
> ...

Fan Ni Nov. 1, 2024, 7:11 p.m. UTC | #5

On Fri, Nov 01, 2024 at 01:28:12PM -0500, Bowman, Terry wrote:
> Hi Fan,
> 
> I added comments below.
> 
> On 11/1/2024 1:00 PM, Fan Ni wrote:
> > On Fri, Oct 25, 2024 at 04:02:51PM -0500, Terry Bowman wrote:
> >> This is a continuation of the CXL port error handling RFC from earlier.[1]
> >> The RFC resulted in the decision to add CXL PCIe port error handling to
> >> the existing RCH downstream port handling in the AER service driver. This
> >> patchset adds the CXL PCIe port protocol error handling and logging.
> >>
> >> The first 7 patches update the existing AER service driver to support CXL
> >> PCIe port protocol error handling and reporting. This includes AER service
> >> driver changes for adding correctable and uncorrectable error support, CXL
> >> specific recovery handling, and addition of CXL driver callback handlers.
> >>
> >> The following 7 patches address CXL driver support for CXL PCIe port
> >> protocol errors. This includes the following changes to the CXL drivers:
> >> mapping CXL port and downstream port RAS registers, interface updates for
> >> common restricted CXL host mode (RCH) and virtual hierarchy mode (VH),
> >> adding port specific error handlers, and protocol error logging.
> >>
> >> [1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554-1-terry.bowman@amd.com/
> >>
> >> Testing:
> > Hi Terry,
> > I tried to test the patchset with aer_inject tool (with the patch you shared
> > in the last version), and hit some issues.
> > Could you help check and give some insights? Thanks.
> >
> > Below are some test setup info and results.
> >
> > I tested two topology,
> >   a. one memdev directly attaced to a HB with only one RP;
> >   b. a topology with cxl switch:
> >          HB
> >         /  \
> >       RP0   RP1
> >        |
> >      switch
> >        |
> >  ----------------
> >  |    |    |    |
> > mem0 mem1 mem2 mem3
> >
> > For both topologies, I cannot reproduce the system panic shown in your cover
> > letter.  
> >
> > btw, I tried both compile cxl as modules and in the kernel.
> >
> > Below, I will use the direct-attached topology (a) as an example to show what I
> > tried, hope can get some clarity about the test and what I missed or did wrong.
> >
> > -------------------------------------
> > pci device info on the test VM 
> > root@fan:~# lspci
> > 00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
> > 00:01.0 VGA compatible controller: Device 1234:1111 (rev 02)
> > 00:02.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 03)
> > 00:03.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem
> > 00:04.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem
> > 00:05.0 Host bridge: Red Hat, Inc. QEMU PCIe Expander bridge
> > 00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface Controller (rev 02)
> > 00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] (rev 02)
> > 00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02)
> > 0c:00.0 PCI bridge: Intel Corporation Device 7075
> > 0d:00.0 CXL: Intel Corporation Device 0d93 (rev 01)
> > root@fan:~# 
> > -------------------------------------
> >
> > The aer injection input file looks like below,
> >
> > -------------------------------------
> > fan:~/cxl/cxl-test-tool$ cat /tmp/internal 
> > AER
> > PCI_ID 0000:0c:00.0
> > UNCOR_STATUS INTERNAL
> > HEADER_LOG 0 1 2 3
> > ------------------------------------
> >
> > dmesg after aer injection 
> >
> > ssh root@localhost -p 2024 "dmesg"
> > [  613.195352] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
> > [  613.195830] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
> > [  613.196253] pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> > [  613.198199] pcieport 0000:0c:00.0: AER: No uncorrectable error found. Continuing.
> > -----------------------------------
> 
> This is likely because the device's CXL RAS status is not set and as a result returns false and bypasses the panic.
> Unfortunately, the aer-inject only sets the AER status and triggers the interrupt. The CXL RAS is not set.
> 
> I attached 2 'test' patches. The first patch sets the device's RAS status to simulate the error reporting.
> This will have to be adjusted as the patch looks for a specific device's bus and this will likely be a different
> bus then the device's you test in your setup.
> 
> The 2nd patch enables UIE/CIE. I moved this out of the v2 patchset. I need to revisit this to see if it is
> needed in the patchset itself (not just a test patch).
> 
> Regards,
> Terry

Hi Terry,

Thanks for the quick reply. With the two patches applied, the system
panic as expected. 

Thanks,
Fan

> 
> >
> > The problem seems to be related to the cxl_error_handler not been assigned for
> > cxlmem device. 
> >
> > in
> > cxl_do_recover() {
> > ...
> >     327     cxl_walk_bridge(bridge, cxl_report_error_detected, &status);                         
> >     328     if (status)                                                                 
> >     329         panic("CXL cachemem error. Invoking panic");                   
> > ...
> > }
> > The status returned is false, so no panic().
> >
> > I tried to add some dev_dbg info to the code to debug.
> > Below are the debug info and kernel code changes for debugging. 
> > --------------------------------------
> > fan:~/cxl/cxl-test-tool$ cxl-tool.py --cmd dmesg | grep XXX
> > [    1.738909] cxl_mem:cxl_mem_probe:205: cxl_mem mem0: XXX: add endpoint
> > [    1.739188] cxl_mem:devm_cxl_add_endpoint:85: cxl_port port1: XXX: add endpoint
> > [    1.739509] cxl_mem:devm_cxl_add_endpoint:92: cxl_mem mem0: XXX: init ep port aer
> > [    1.739876] cxl_core:cxl_dport_init_ras_reporting:907: pcieport 0000:0c:00.0: XXX: assign port error handlers for dport 1
> > [    1.740338] cxl_core:cxl_dport_init_ras_reporting:913: pcieport 0000:0c:00.0: XXX: assign port error handlers for dport 2
> > [    1.740812] cxl_core:cxl_dport_init_ras_reporting:927: pcieport 0000:0c:00.0: XXX: assign port error handlers for dport 3
> > [    1.741273] cxl_core:cxl_assign_port_error_handlers:851: pcieport 0000:0c:00.0: XXX: cxl_err_handler: (____ptrval____)
> > [    1.741812] cxl_core:cxl_assign_port_error_handlers:855: pcieport 0000:0c:00.0: XXX: cxl_err_handler: (____ptrval____)
> > [    1.742263] cxl_core:cxl_assign_port_error_handlers:857: pcieport 0000:0c:00.0: XXX: cxl_err_handler: (____ptrval____) (____ptrval____)
> > fan:~/cxl/cxl-test-tool$ 
> > --------------------------------------
> >
> > dmesg after error injection:
> > --------------------------------------
> > ssh root@localhost -p 2024 "dmesg"
> > [  228.544439] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
> > [  228.544977] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
> > [  228.545381] pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> > [  228.545879] pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00400000/00000000
> > [  228.546360] pcieport 0000:0c:00.0:    [22] UncorrIntErr          
> > [  228.546698] pcieport 0000:0c:00.0: AER: XXX: call cxl_err_handler: 00000000a268bfcb 000000009e0da039
> > [  228.547103] cxl_pci 0000:0d:00.0: AER: XXX: call cxl_err_handler: 00000000b9f08b93 0000000000000000
> > [  228.547515] pcieport 0000:0c:00.0: AER: No uncorrectable error found. Continuing.
> > fan:~/cxl/cxl-test-tool$ 
> > --------------------------------------
> >
> >
> > Kernel changes:
> > --------------------------------------
> > diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> > index 5f7570c6173c..bcecd1283fc6 100644
> > --- a/drivers/cxl/core/pci.c
> > +++ b/drivers/cxl/core/pci.c
> > @@ -848,10 +848,13 @@ static void cxl_assign_port_error_handlers(struct pci_dev *pdev)
> >  {
> >  	struct pci_driver *pdrv = pdev->driver;
> >  
> > +    dev_dbg(&pdev->dev, "XXX: cxl_err_handler: %p\n enter", pdev);
> >  	if (!pdrv)
> >  		return;
> >  
> > +    dev_dbg(&pdev->dev, "XXX: cxl_err_handler: %p\n", pdrv);
> >  	pdrv->cxl_err_handler = &cxl_port_error_handlers;
> > +    dev_dbg(&pdev->dev, "XXX: cxl_err_handler: %p %p\n", pdrv, pdrv->cxl_err_handler);
> >  }
> >  
> >  static void cxl_clear_port_error_handlers(void *data)
> > @@ -869,12 +872,14 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port)
> >  {
> >  	struct pci_dev *pdev = to_pci_dev(port->uport_dev);
> >  
> > +    dev_dbg(&port->dev, "XXX: assign port error handlers for uport 1\n");
> >  	/* uport may have more than 1 downstream EP. Check if already mapped. */
> >  	if (port->uport_regs.ras) {
> >  		dev_warn(&port->dev, "RAS is already mapped\n");
> >  		return;
> >  	}
> >  
> > +    dev_dbg(&port->dev, "XXX: assign port error handlers for uport 2\n");
> >  	port->reg_map.host = &port->dev;
> >  	if (cxl_map_component_regs(&port->reg_map, &port->uport_regs,
> >  				   BIT(CXL_CM_CAP_CAP_ID_RAS))) {
> > @@ -882,6 +887,7 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port)
> >  		return;
> >  	}
> >  
> > +    dev_dbg(&port->dev, "XXX: assign port error handlers for uport 3\n");
> >  	cxl_assign_port_error_handlers(pdev);
> >  	devm_add_action_or_reset(port->uport_dev, cxl_clear_port_error_handlers, pdev);
> >  }
> > @@ -898,11 +904,13 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
> >  	struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport_dev);
> >  	struct pci_dev *pdev = to_pci_dev(dport_dev);
> >  
> > +    dev_dbg(dport_dev, "XXX: assign port error handlers for dport 1\n");
> >  	if (dport->rch && host_bridge->native_aer) {
> >  		cxl_dport_map_rch_aer(dport);
> >  		cxl_disable_rch_root_ints(dport);
> >  	}
> >  
> > +    dev_dbg(dport_dev, "XXX: assign port error handlers for dport 2\n");
> >  	/* dport may have more than 1 downstream EP. Check if already mapped. */
> >  	if (dport->regs.ras) {
> >  		dev_warn(dport_dev, "RAS is already mapped\n");
> > @@ -916,6 +924,7 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
> >  		return;
> >  	}
> >  
> > +    dev_dbg(dport_dev, "XXX: assign port error handlers for dport 3\n");
> >  	cxl_assign_port_error_handlers(pdev);
> >  	devm_add_action_or_reset(dport_dev, cxl_clear_port_error_handlers, pdev);
> >  }
> > diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> > index 067fd6389562..aa824584f8dd 100644
> > --- a/drivers/cxl/mem.c
> > +++ b/drivers/cxl/mem.c
> > @@ -82,13 +82,15 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
> >  	 * Now that the path to the root is established record all the
> >  	 * intervening ports in the chain.
> >  	 */
> > +    dev_dbg(host, "XXX: add endpoint\n");
> >  	for (iter = parent_port, down = NULL; !is_cxl_root(iter);
> >  	     down = iter, iter = to_cxl_port(iter->dev.parent)) {
> >  		struct cxl_ep *ep;
> >  
> >  		ep = cxl_ep_load(iter, cxlmd);
> >  		ep->next = down;
> > -		cxl_init_ep_ports_aer(ep);
> > +        dev_dbg(ep->ep, "XXX: init ep port aer\n");
> > +        cxl_init_ep_ports_aer(ep);
> >  	}
> >  
> >  	/* Note: endpoint port component registers are derived from @cxlds */
> > @@ -200,6 +202,7 @@ static int cxl_mem_probe(struct device *dev)
> >  			return -ENXIO;
> >  		}
> >  
> > +        dev_dbg(dev, "XXX: add endpoint\n");
> >  		rc = devm_cxl_add_endpoint(endpoint_parent, cxlmd, dport);
> >  		if (rc)
> >  			return rc;
> > diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> > index 3785f4ca5103..8285f14994e8 100644
> > --- a/drivers/pci/pcie/err.c
> > +++ b/drivers/pci/pcie/err.c
> > @@ -294,6 +294,11 @@ static int cxl_report_error_detected(struct pci_dev *dev, void *data)
> >  	bool *status = data;
> >  
> >  	device_lock(&dev->dev);
> > +    if (pdrv) {
> > +        dev_dbg(&dev->dev, "XXX: call cxl_err_handler: %p %p\n", pdrv, pdrv->cxl_err_handler);
> > +    } else {
> > +        dev_dbg(&dev->dev, "XXX: call cxl_err_handler: %p no handler\n", pdrv);
> > +    }
> >  	if (pdrv && pdrv->cxl_err_handler &&
> >  	    pdrv->cxl_err_handler->error_detected) {
> >  		const struct cxl_error_handlers *cxl_err_handler =
> > --------------------------------------
> >
> > Fan
> >> Below are test results for this patchset using Qemu with CXL root
> >> port(0c:00.0), CXL upstream switchport(0d:00.0), CXL downstream
> >> switchport(0e:00.0). A CXL endpoint(0f:00.0) CE and UCE logs are
> >> also added to show the existing PCIe endpoint handling is not changed.
> >>
> >> This was tested using aer-inject updated to support CE and UCE internal
> >> error injection. CXL RAS was set using a test patch (not upstreamed but can
> >> provide if needed).
> >>
> >>  - Root port UCE:
> >>  root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh
> >>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
> >>  pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
> >>  pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> >>  pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00400000/02000000
> >>  pcieport 0000:0c:00.0:    [22] UncorrIntErr
> >>  aer_event: 0000:0c:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
> >>  cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
> >>  Kernel panic - not syncing: CXL cachemem error. Invoking panic
> >>  CPU: 1 UID: 0 PID: 146 Comm: irq/24-aerdrv Tainted: G            E      6.12.0-rc2-cxl-port-err-g2beab06a67d1 #4414
> >>  Tainted: [E]=UNSIGNED_MODULE
> >>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> >>  Call Trace:
> >>   <TASK>
> >>   dump_stack_lvl+0x27/0x90
> >>   dump_stack+0x10/0x20
> >>   panic+0x33e/0x380
> >>   cxl_do_recovery+0x116/0x120
> >>   ? srso_return_thunk+0x5/0x5f
> >>   aer_isr+0x3e0/0x710
> >>   irq_thread_fn+0x28/0x70
> >>   irq_thread+0x179/0x240
> >>   ? srso_return_thunk+0x5/0x5f
> >>   ? __pfx_irq_thread_fn+0x10/0x10
> >>   ? __pfx_irq_thread_dtor+0x10/0x10
> >>   ? __pfx_irq_thread+0x10/0x10
> >>   kthread+0xf5/0x130
> >>   ? __pfx_kthread+0x10/0x10
> >>   ret_from_fork+0x3c/0x60
> >>   ? __pfx_kthread+0x10/0x10
> >>   ret_from_fork_asm+0x1a/0x30
> >>   </TASK>
> >>  Kernel Offset: 0x29000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> >>  ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
> > ...

Fan Ni Nov. 1, 2024, 10:11 p.m. UTC | #6

On Fri, Nov 01, 2024 at 01:28:12PM -0500, Bowman, Terry wrote:
> Hi Fan,
> 
> I added comments below.
> 
> On 11/1/2024 1:00 PM, Fan Ni wrote:
> > On Fri, Oct 25, 2024 at 04:02:51PM -0500, Terry Bowman wrote:
> >> This is a continuation of the CXL port error handling RFC from earlier.[1]
> >> The RFC resulted in the decision to add CXL PCIe port error handling to
> >> the existing RCH downstream port handling in the AER service driver. This
> >> patchset adds the CXL PCIe port protocol error handling and logging.
> >>
> >> The first 7 patches update the existing AER service driver to support CXL
> >> PCIe port protocol error handling and reporting. This includes AER service
> >> driver changes for adding correctable and uncorrectable error support, CXL
> >> specific recovery handling, and addition of CXL driver callback handlers.
> >>
> >> The following 7 patches address CXL driver support for CXL PCIe port
> >> protocol errors. This includes the following changes to the CXL drivers:
> >> mapping CXL port and downstream port RAS registers, interface updates for
> >> common restricted CXL host mode (RCH) and virtual hierarchy mode (VH),
> >> adding port specific error handlers, and protocol error logging.
> >>
> >> [1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554-1-terry.bowman@amd.com/
> >>
> >> Testing:
> > Hi Terry,
> > I tried to test the patchset with aer_inject tool (with the patch you shared
> > in the last version), and hit some issues.
> > Could you help check and give some insights? Thanks.
> >
> > Below are some test setup info and results.
> >
> > I tested two topology,
> >   a. one memdev directly attaced to a HB with only one RP;
> >   b. a topology with cxl switch:
> >          HB
> >         /  \
> >       RP0   RP1
> >        |
> >      switch
> >        |
> >  ----------------
> >  |    |    |    |
> > mem0 mem1 mem2 mem3
> >
> > For both topologies, I cannot reproduce the system panic shown in your cover
> > letter.  
> >
> > btw, I tried both compile cxl as modules and in the kernel.
> >
> > Below, I will use the direct-attached topology (a) as an example to show what I
> > tried, hope can get some clarity about the test and what I missed or did wrong.
> >
> > -------------------------------------
> > pci device info on the test VM 
> > root@fan:~# lspci
> > 00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
> > 00:01.0 VGA compatible controller: Device 1234:1111 (rev 02)
> > 00:02.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 03)
> > 00:03.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem
> > 00:04.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem
> > 00:05.0 Host bridge: Red Hat, Inc. QEMU PCIe Expander bridge
> > 00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface Controller (rev 02)
> > 00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] (rev 02)
> > 00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02)
> > 0c:00.0 PCI bridge: Intel Corporation Device 7075
> > 0d:00.0 CXL: Intel Corporation Device 0d93 (rev 01)
> > root@fan:~# 
> > -------------------------------------
> >
> > The aer injection input file looks like below,
> >
> > -------------------------------------
> > fan:~/cxl/cxl-test-tool$ cat /tmp/internal 
> > AER
> > PCI_ID 0000:0c:00.0
> > UNCOR_STATUS INTERNAL
> > HEADER_LOG 0 1 2 3
> > ------------------------------------
> >
> > dmesg after aer injection 
> >
> > ssh root@localhost -p 2024 "dmesg"
> > [  613.195352] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
> > [  613.195830] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
> > [  613.196253] pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> > [  613.198199] pcieport 0000:0c:00.0: AER: No uncorrectable error found. Continuing.
> > -----------------------------------
> 
> This is likely because the device's CXL RAS status is not set and as a result returns false and bypasses the panic.
> Unfortunately, the aer-inject only sets the AER status and triggers the interrupt. The CXL RAS is not set.
> 
> I attached 2 'test' patches. The first patch sets the device's RAS status to simulate the error reporting.
> This will have to be adjusted as the patch looks for a specific device's bus and this will likely be a different
> bus then the device's you test in your setup.
> 
> The 2nd patch enables UIE/CIE. I moved this out of the v2 patchset. I need to revisit this to see if it is
> needed in the patchset itself (not just a test patch).
> 
> Regards,
> Terry
> 
Hi Terry, 

I checked the two patches you attached, do we really need the first
patch to umask internal error? I see it is already unmasked in
aer_enable_internal_errors() which is called in aer_probe().
I tried to only apply the other patch and test again, it seems the test
output is the same as applying two patches. The system panics as well.

Fan

> >
> > The problem seems to be related to the cxl_error_handler not been assigned for
> > cxlmem device. 
> >
> > in
> > cxl_do_recover() {
> > ...
> >     327     cxl_walk_bridge(bridge, cxl_report_error_detected, &status);                         
> >     328     if (status)                                                                 
> >     329         panic("CXL cachemem error. Invoking panic");                   
> > ...
> > }
> > The status returned is false, so no panic().
> >
> > I tried to add some dev_dbg info to the code to debug.
> > Below are the debug info and kernel code changes for debugging. 
> > --------------------------------------
> > fan:~/cxl/cxl-test-tool$ cxl-tool.py --cmd dmesg | grep XXX
> > [    1.738909] cxl_mem:cxl_mem_probe:205: cxl_mem mem0: XXX: add endpoint
> > [    1.739188] cxl_mem:devm_cxl_add_endpoint:85: cxl_port port1: XXX: add endpoint
> > [    1.739509] cxl_mem:devm_cxl_add_endpoint:92: cxl_mem mem0: XXX: init ep port aer
> > [    1.739876] cxl_core:cxl_dport_init_ras_reporting:907: pcieport 0000:0c:00.0: XXX: assign port error handlers for dport 1
> > [    1.740338] cxl_core:cxl_dport_init_ras_reporting:913: pcieport 0000:0c:00.0: XXX: assign port error handlers for dport 2
> > [    1.740812] cxl_core:cxl_dport_init_ras_reporting:927: pcieport 0000:0c:00.0: XXX: assign port error handlers for dport 3
> > [    1.741273] cxl_core:cxl_assign_port_error_handlers:851: pcieport 0000:0c:00.0: XXX: cxl_err_handler: (____ptrval____)
> > [    1.741812] cxl_core:cxl_assign_port_error_handlers:855: pcieport 0000:0c:00.0: XXX: cxl_err_handler: (____ptrval____)
> > [    1.742263] cxl_core:cxl_assign_port_error_handlers:857: pcieport 0000:0c:00.0: XXX: cxl_err_handler: (____ptrval____) (____ptrval____)
> > fan:~/cxl/cxl-test-tool$ 
> > --------------------------------------
> >
> > dmesg after error injection:
> > --------------------------------------
> > ssh root@localhost -p 2024 "dmesg"
> > [  228.544439] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
> > [  228.544977] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
> > [  228.545381] pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> > [  228.545879] pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00400000/00000000
> > [  228.546360] pcieport 0000:0c:00.0:    [22] UncorrIntErr          
> > [  228.546698] pcieport 0000:0c:00.0: AER: XXX: call cxl_err_handler: 00000000a268bfcb 000000009e0da039
> > [  228.547103] cxl_pci 0000:0d:00.0: AER: XXX: call cxl_err_handler: 00000000b9f08b93 0000000000000000
> > [  228.547515] pcieport 0000:0c:00.0: AER: No uncorrectable error found. Continuing.
> > fan:~/cxl/cxl-test-tool$ 
> > --------------------------------------
> >
> >
> > Kernel changes:
> > --------------------------------------
> > diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> > index 5f7570c6173c..bcecd1283fc6 100644
> > --- a/drivers/cxl/core/pci.c
> > +++ b/drivers/cxl/core/pci.c
> > @@ -848,10 +848,13 @@ static void cxl_assign_port_error_handlers(struct pci_dev *pdev)
> >  {
> >  	struct pci_driver *pdrv = pdev->driver;
> >  
> > +    dev_dbg(&pdev->dev, "XXX: cxl_err_handler: %p\n enter", pdev);
> >  	if (!pdrv)
> >  		return;
> >  
> > +    dev_dbg(&pdev->dev, "XXX: cxl_err_handler: %p\n", pdrv);
> >  	pdrv->cxl_err_handler = &cxl_port_error_handlers;
> > +    dev_dbg(&pdev->dev, "XXX: cxl_err_handler: %p %p\n", pdrv, pdrv->cxl_err_handler);
> >  }
> >  
> >  static void cxl_clear_port_error_handlers(void *data)
> > @@ -869,12 +872,14 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port)
> >  {
> >  	struct pci_dev *pdev = to_pci_dev(port->uport_dev);
> >  
> > +    dev_dbg(&port->dev, "XXX: assign port error handlers for uport 1\n");
> >  	/* uport may have more than 1 downstream EP. Check if already mapped. */
> >  	if (port->uport_regs.ras) {
> >  		dev_warn(&port->dev, "RAS is already mapped\n");
> >  		return;
> >  	}
> >  
> > +    dev_dbg(&port->dev, "XXX: assign port error handlers for uport 2\n");
> >  	port->reg_map.host = &port->dev;
> >  	if (cxl_map_component_regs(&port->reg_map, &port->uport_regs,
> >  				   BIT(CXL_CM_CAP_CAP_ID_RAS))) {
> > @@ -882,6 +887,7 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port)
> >  		return;
> >  	}
> >  
> > +    dev_dbg(&port->dev, "XXX: assign port error handlers for uport 3\n");
> >  	cxl_assign_port_error_handlers(pdev);
> >  	devm_add_action_or_reset(port->uport_dev, cxl_clear_port_error_handlers, pdev);
> >  }
> > @@ -898,11 +904,13 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
> >  	struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport_dev);
> >  	struct pci_dev *pdev = to_pci_dev(dport_dev);
> >  
> > +    dev_dbg(dport_dev, "XXX: assign port error handlers for dport 1\n");
> >  	if (dport->rch && host_bridge->native_aer) {
> >  		cxl_dport_map_rch_aer(dport);
> >  		cxl_disable_rch_root_ints(dport);
> >  	}
> >  
> > +    dev_dbg(dport_dev, "XXX: assign port error handlers for dport 2\n");
> >  	/* dport may have more than 1 downstream EP. Check if already mapped. */
> >  	if (dport->regs.ras) {
> >  		dev_warn(dport_dev, "RAS is already mapped\n");
> > @@ -916,6 +924,7 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
> >  		return;
> >  	}
> >  
> > +    dev_dbg(dport_dev, "XXX: assign port error handlers for dport 3\n");
> >  	cxl_assign_port_error_handlers(pdev);
> >  	devm_add_action_or_reset(dport_dev, cxl_clear_port_error_handlers, pdev);
> >  }
> > diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> > index 067fd6389562..aa824584f8dd 100644
> > --- a/drivers/cxl/mem.c
> > +++ b/drivers/cxl/mem.c
> > @@ -82,13 +82,15 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
> >  	 * Now that the path to the root is established record all the
> >  	 * intervening ports in the chain.
> >  	 */
> > +    dev_dbg(host, "XXX: add endpoint\n");
> >  	for (iter = parent_port, down = NULL; !is_cxl_root(iter);
> >  	     down = iter, iter = to_cxl_port(iter->dev.parent)) {
> >  		struct cxl_ep *ep;
> >  
> >  		ep = cxl_ep_load(iter, cxlmd);
> >  		ep->next = down;
> > -		cxl_init_ep_ports_aer(ep);
> > +        dev_dbg(ep->ep, "XXX: init ep port aer\n");
> > +        cxl_init_ep_ports_aer(ep);
> >  	}
> >  
> >  	/* Note: endpoint port component registers are derived from @cxlds */
> > @@ -200,6 +202,7 @@ static int cxl_mem_probe(struct device *dev)
> >  			return -ENXIO;
> >  		}
> >  
> > +        dev_dbg(dev, "XXX: add endpoint\n");
> >  		rc = devm_cxl_add_endpoint(endpoint_parent, cxlmd, dport);
> >  		if (rc)
> >  			return rc;
> > diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> > index 3785f4ca5103..8285f14994e8 100644
> > --- a/drivers/pci/pcie/err.c
> > +++ b/drivers/pci/pcie/err.c
> > @@ -294,6 +294,11 @@ static int cxl_report_error_detected(struct pci_dev *dev, void *data)
> >  	bool *status = data;
> >  
> >  	device_lock(&dev->dev);
> > +    if (pdrv) {
> > +        dev_dbg(&dev->dev, "XXX: call cxl_err_handler: %p %p\n", pdrv, pdrv->cxl_err_handler);
> > +    } else {
> > +        dev_dbg(&dev->dev, "XXX: call cxl_err_handler: %p no handler\n", pdrv);
> > +    }
> >  	if (pdrv && pdrv->cxl_err_handler &&
> >  	    pdrv->cxl_err_handler->error_detected) {
> >  		const struct cxl_error_handlers *cxl_err_handler =
> > --------------------------------------
> >
> > Fan
> >> Below are test results for this patchset using Qemu with CXL root
> >> port(0c:00.0), CXL upstream switchport(0d:00.0), CXL downstream
> >> switchport(0e:00.0). A CXL endpoint(0f:00.0) CE and UCE logs are
> >> also added to show the existing PCIe endpoint handling is not changed.
> >>
> >> This was tested using aer-inject updated to support CE and UCE internal
> >> error injection. CXL RAS was set using a test patch (not upstreamed but can
> >> provide if needed).
> >>
> >>  - Root port UCE:
> >>  root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh
> >>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
> >>  pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
> >>  pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> >>  pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00400000/02000000
> >>  pcieport 0000:0c:00.0:    [22] UncorrIntErr
> >>  aer_event: 0000:0c:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
> >>  cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
> >>  Kernel panic - not syncing: CXL cachemem error. Invoking panic
> >>  CPU: 1 UID: 0 PID: 146 Comm: irq/24-aerdrv Tainted: G            E      6.12.0-rc2-cxl-port-err-g2beab06a67d1 #4414
> >>  Tainted: [E]=UNSIGNED_MODULE
> >>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> >>  Call Trace:
> >>   <TASK>
> >>   dump_stack_lvl+0x27/0x90
> >>   dump_stack+0x10/0x20
> >>   panic+0x33e/0x380
> >>   cxl_do_recovery+0x116/0x120
> >>   ? srso_return_thunk+0x5/0x5f
> >>   aer_isr+0x3e0/0x710
> >>   irq_thread_fn+0x28/0x70
> >>   irq_thread+0x179/0x240
> >>   ? srso_return_thunk+0x5/0x5f
> >>   ? __pfx_irq_thread_fn+0x10/0x10
> >>   ? __pfx_irq_thread_dtor+0x10/0x10
> >>   ? __pfx_irq_thread+0x10/0x10
> >>   kthread+0xf5/0x130
> >>   ? __pfx_kthread+0x10/0x10
> >>   ret_from_fork+0x3c/0x60
> >>   ? __pfx_kthread+0x10/0x10
> >>   ret_from_fork_asm+0x1a/0x30
> >>   </TASK>
> >>  Kernel Offset: 0x29000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> >>  ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
> > ...

Bowman, Terry Nov. 4, 2024, 9:25 p.m. UTC | #7

On 11/1/2024 5:11 PM, Fan Ni wrote:
> On Fri, Nov 01, 2024 at 01:28:12PM -0500, Bowman, Terry wrote:
>> Hi Fan,
>>
>> I added comments below.
>>
>> On 11/1/2024 1:00 PM, Fan Ni wrote:
>>> On Fri, Oct 25, 2024 at 04:02:51PM -0500, Terry Bowman wrote:
>>>> This is a continuation of the CXL port error handling RFC from earlier.[1]
>>>> The RFC resulted in the decision to add CXL PCIe port error handling to
>>>> the existing RCH downstream port handling in the AER service driver. This
>>>> patchset adds the CXL PCIe port protocol error handling and logging.
>>>>
>>>> The first 7 patches update the existing AER service driver to support CXL
>>>> PCIe port protocol error handling and reporting. This includes AER service
>>>> driver changes for adding correctable and uncorrectable error support, CXL
>>>> specific recovery handling, and addition of CXL driver callback handlers.
>>>>
>>>> The following 7 patches address CXL driver support for CXL PCIe port
>>>> protocol errors. This includes the following changes to the CXL drivers:
>>>> mapping CXL port and downstream port RAS registers, interface updates for
>>>> common restricted CXL host mode (RCH) and virtual hierarchy mode (VH),
>>>> adding port specific error handlers, and protocol error logging.
>>>>
>>>> [1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554-1-terry.bowman@amd.com/
>>>>
>>>> Testing:
>>> Hi Terry,
>>> I tried to test the patchset with aer_inject tool (with the patch you shared
>>> in the last version), and hit some issues.
>>> Could you help check and give some insights? Thanks.
>>>
>>> Below are some test setup info and results.
>>>
>>> I tested two topology,
>>>   a. one memdev directly attaced to a HB with only one RP;
>>>   b. a topology with cxl switch:
>>>          HB
>>>         /  \
>>>       RP0   RP1
>>>        |
>>>      switch
>>>        |
>>>  ----------------
>>>  |    |    |    |
>>> mem0 mem1 mem2 mem3
>>>
>>> For both topologies, I cannot reproduce the system panic shown in your cover
>>> letter.  
>>>
>>> btw, I tried both compile cxl as modules and in the kernel.
>>>
>>> Below, I will use the direct-attached topology (a) as an example to show what I
>>> tried, hope can get some clarity about the test and what I missed or did wrong.
>>>
>>> -------------------------------------
>>> pci device info on the test VM 
>>> root@fan:~# lspci
>>> 00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
>>> 00:01.0 VGA compatible controller: Device 1234:1111 (rev 02)
>>> 00:02.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 03)
>>> 00:03.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem
>>> 00:04.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem
>>> 00:05.0 Host bridge: Red Hat, Inc. QEMU PCIe Expander bridge
>>> 00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface Controller (rev 02)
>>> 00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] (rev 02)
>>> 00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02)
>>> 0c:00.0 PCI bridge: Intel Corporation Device 7075
>>> 0d:00.0 CXL: Intel Corporation Device 0d93 (rev 01)
>>> root@fan:~# 
>>> -------------------------------------
>>>
>>> The aer injection input file looks like below,
>>>
>>> -------------------------------------
>>> fan:~/cxl/cxl-test-tool$ cat /tmp/internal 
>>> AER
>>> PCI_ID 0000:0c:00.0
>>> UNCOR_STATUS INTERNAL
>>> HEADER_LOG 0 1 2 3
>>> ------------------------------------
>>>
>>> dmesg after aer injection 
>>>
>>> ssh root@localhost -p 2024 "dmesg"
>>> [  613.195352] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
>>> [  613.195830] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
>>> [  613.196253] pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>>> [  613.198199] pcieport 0000:0c:00.0: AER: No uncorrectable error found. Continuing.
>>> -----------------------------------
>> This is likely because the device's CXL RAS status is not set and as a result returns false and bypasses the panic.
>> Unfortunately, the aer-inject only sets the AER status and triggers the interrupt. The CXL RAS is not set.
>>
>> I attached 2 'test' patches. The first patch sets the device's RAS status to simulate the error reporting.
>> This will have to be adjusted as the patch looks for a specific device's bus and this will likely be a different
>> bus then the device's you test in your setup.
>>
>> The 2nd patch enables UIE/CIE. I moved this out of the v2 patchset. I need to revisit this to see if it is
>> needed in the patchset itself (not just a test patch).
>>
>> Regards,
>> Terry
>>
> Hi Terry, 
>
> I checked the two patches you attached, do we really need the first
> patch to umask internal error? I see it is already unmasked in
> aer_enable_internal_errors() which is called in aer_probe().
> I tried to only apply the other patch and test again, it seems the test
> output is the same as applying two patches. The system panics as well.
>
> Fan
Hi Fan,

Which device did you inject into? RP, DSP, or USP?

Yes, the RP UIE & CIE are enabled by the AER driver. RCEC too. But, this is not done for CXL DSP
and USP. Below are details from the spec describing how an AER error masked at the source will not
be propagated as notification to the root complex (RP or RCEC).

'If an individual error is masked when it is detected, its error status bit is still affected,
but no error reporting Message is sent to the Root Complex, and the error is not recorded in the
Header Log, TLP Prefix Log, or First Error Pointer.'[1]

[1] PCIe Spec 6.2.3.2.2 Masking Individual Errors

Also, there can be platform BIOS settings that enable/disable UIE/CIE.

Regards,
Terry

Fan Ni Nov. 4, 2024, 9:48 p.m. UTC | #8

On Mon, Nov 04, 2024 at 03:25:38PM -0600, Bowman, Terry wrote:
> 
> 
> On 11/1/2024 5:11 PM, Fan Ni wrote:
> > On Fri, Nov 01, 2024 at 01:28:12PM -0500, Bowman, Terry wrote:
> >> Hi Fan,
> >>
> >> I added comments below.
> >>
> >> On 11/1/2024 1:00 PM, Fan Ni wrote:
> >>> On Fri, Oct 25, 2024 at 04:02:51PM -0500, Terry Bowman wrote:
> >>>> This is a continuation of the CXL port error handling RFC from earlier.[1]
> >>>> The RFC resulted in the decision to add CXL PCIe port error handling to
> >>>> the existing RCH downstream port handling in the AER service driver. This
> >>>> patchset adds the CXL PCIe port protocol error handling and logging.
> >>>>
> >>>> The first 7 patches update the existing AER service driver to support CXL
> >>>> PCIe port protocol error handling and reporting. This includes AER service
> >>>> driver changes for adding correctable and uncorrectable error support, CXL
> >>>> specific recovery handling, and addition of CXL driver callback handlers.
> >>>>
> >>>> The following 7 patches address CXL driver support for CXL PCIe port
> >>>> protocol errors. This includes the following changes to the CXL drivers:
> >>>> mapping CXL port and downstream port RAS registers, interface updates for
> >>>> common restricted CXL host mode (RCH) and virtual hierarchy mode (VH),
> >>>> adding port specific error handlers, and protocol error logging.
> >>>>
> >>>> [1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554-1-terry.bowman@amd.com/
> >>>>
> >>>> Testing:
> >>> Hi Terry,
> >>> I tried to test the patchset with aer_inject tool (with the patch you shared
> >>> in the last version), and hit some issues.
> >>> Could you help check and give some insights? Thanks.
> >>>
> >>> Below are some test setup info and results.
> >>>
> >>> I tested two topology,
> >>>   a. one memdev directly attaced to a HB with only one RP;
> >>>   b. a topology with cxl switch:
> >>>          HB
> >>>         /  \
> >>>       RP0   RP1
> >>>        |
> >>>      switch
> >>>        |
> >>>  ----------------
> >>>  |    |    |    |
> >>> mem0 mem1 mem2 mem3
> >>>
> >>> For both topologies, I cannot reproduce the system panic shown in your cover
> >>> letter.  
> >>>
> >>> btw, I tried both compile cxl as modules and in the kernel.
> >>>
> >>> Below, I will use the direct-attached topology (a) as an example to show what I
> >>> tried, hope can get some clarity about the test and what I missed or did wrong.
> >>>
> >>> -------------------------------------
> >>> pci device info on the test VM 
> >>> root@fan:~# lspci
> >>> 00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
> >>> 00:01.0 VGA compatible controller: Device 1234:1111 (rev 02)
> >>> 00:02.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 03)
> >>> 00:03.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem
> >>> 00:04.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem
> >>> 00:05.0 Host bridge: Red Hat, Inc. QEMU PCIe Expander bridge
> >>> 00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface Controller (rev 02)
> >>> 00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] (rev 02)
> >>> 00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02)
> >>> 0c:00.0 PCI bridge: Intel Corporation Device 7075
> >>> 0d:00.0 CXL: Intel Corporation Device 0d93 (rev 01)
> >>> root@fan:~# 
> >>> -------------------------------------
> >>>
> >>> The aer injection input file looks like below,
> >>>
> >>> -------------------------------------
> >>> fan:~/cxl/cxl-test-tool$ cat /tmp/internal 
> >>> AER
> >>> PCI_ID 0000:0c:00.0
> >>> UNCOR_STATUS INTERNAL
> >>> HEADER_LOG 0 1 2 3
> >>> ------------------------------------
> >>>
> >>> dmesg after aer injection 
> >>>
> >>> ssh root@localhost -p 2024 "dmesg"
> >>> [  613.195352] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
> >>> [  613.195830] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
> >>> [  613.196253] pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> >>> [  613.198199] pcieport 0000:0c:00.0: AER: No uncorrectable error found. Continuing.
> >>> -----------------------------------
> >> This is likely because the device's CXL RAS status is not set and as a result returns false and bypasses the panic.
> >> Unfortunately, the aer-inject only sets the AER status and triggers the interrupt. The CXL RAS is not set.
> >>
> >> I attached 2 'test' patches. The first patch sets the device's RAS status to simulate the error reporting.
> >> This will have to be adjusted as the patch looks for a specific device's bus and this will likely be a different
> >> bus then the device's you test in your setup.
> >>
> >> The 2nd patch enables UIE/CIE. I moved this out of the v2 patchset. I need to revisit this to see if it is
> >> needed in the patchset itself (not just a test patch).
> >>
> >> Regards,
> >> Terry
> >>
> > Hi Terry, 
> >
> > I checked the two patches you attached, do we really need the first
> > patch to umask internal error? I see it is already unmasked in
> > aer_enable_internal_errors() which is called in aer_probe().
> > I tried to only apply the other patch and test again, it seems the test
> > output is the same as applying two patches. The system panics as well.
> >
> > Fan
> Hi Fan,
> 
> Which device did you inject into? RP, DSP, or USP?
> 
> Yes, the RP UIE & CIE are enabled by the AER driver. RCEC too. But, this is not done for CXL DSP
> and USP. Below are details from the spec describing how an AER error masked at the source will not
> be propagated as notification to the root complex (RP or RCEC).
> 
> 'If an individual error is masked when it is detected, its error status bit is still affected,
> but no error reporting Message is sent to the Root Complex, and the error is not recorded in the
> Header Log, TLP Prefix Log, or First Error Pointer.'[1]
> 
> [1] PCIe Spec 6.2.3.2.2 Masking Individual Errors
> 
> Also, there can be platform BIOS settings that enable/disable UIE/CIE.
> 
> Regards,
> Terry
Oh, I see. I did inject into rp in my previous setup. And confirmed we
need extra unmask for downstream port case. 

Thanks for the info.

Fan
>

[v2,0/14] Enable CXL PCIe port protocol error handling and logging

Message

Comments