mbox series

[0/4] pci/aer: Handle Advisory Non-Fatal properly

Message ID 20240111073227.31488-1-qingshun.wang@linux.intel.com (mailing list archive)
Headers show
Series pci/aer: Handle Advisory Non-Fatal properly | expand

Message

Wang, Qingshun Jan. 11, 2024, 7:32 a.m. UTC
According to PCIe specification 4.0 sections 6.2.3.2.4 and 6.2.4.3,
certain uncorrectable errors will signal ERR_COR instead of
ERR_NONFATAL, logged as Advisory Non-Fatal Error, and set bits in
both Correctable Error Status register and Uncorrectable Error Status
register. Currently, when handling AER event the kernel will only look
at CE status or UE status, but never both. In the Advisory
Non-Fatal Error case, bits set in UE status register will not be
reported and cleared until the next Fatal/Non-Fatal error arrives.

For instance, before this patch series, once kernel receives an ANFE
with Poisoned TLP in OS native AER mode, only status of CE will be
reported and cleared:

[  148.459816] pcieport 0000:b7:02.0: AER: Corrected error received: 0000:b7:02.0
[  148.459858] pcieport 0000:b7:02.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, (Receiver ID)
[  148.459863] pcieport 0000:b7:02.0:   device [8086:0db0] error status/mask=00002000/00000000
[  148.459868] pcieport 0000:b7:02.0:    [13] NonFatalErr           

If the kernel receives a Malformed TLP after that, two UE will be
reported, which is unexpected. Malformed TLP Header was lost since
the previous ANF gated the TLP header logs:

[  170.540192] pcieport 0000:b7:02.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
[  170.552420] pcieport 0000:b7:02.0:   device [8086:0db0] error status/mask=00041000/00180020
[  170.561904] pcieport 0000:b7:02.0:    [12] TLP                    (First)
[  170.569656] pcieport 0000:b7:02.0:    [18] MalfTLP       

To handle this case properly, this patch set adds additional fields
in aer_err_info to track both CE and UE status/mask and UE severity.
This information will later be used to determine the register bits
that need to be cleared. Additionally, adds more data to aer_event
tracepoint, which would help to better understand ANFE and other errors
for external observation.

In the previous scenario, after this patch series, both CE status and
related UE status will be reported and cleared after ANFE:

[  148.459816] pcieport 0000:b7:02.0: AER: Corrected error received: 0000:b7:02.0
[  148.459858] pcieport 0000:b7:02.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, (Receiver ID)
[  148.459863] pcieport 0000:b7:02.0:   device [8086:0db0] error status/mask=00002000/00000000
[  148.459868] pcieport 0000:b7:02.0:    [13] NonFatalErr           
[  148.459868] pcieport 0000:b7:02.0:   Uncorrectable errors that may cause Advisory Non-Fatal:
[  148.459868] pcieport 0000:b7:02.0:    [18] TLP

Wang, Qingshun (4):
  pci/aer: Store more information in aer_err_info
  pci/aer: Handle Advisory Non-Fatal properly
  pci/aer: Fetch information for FTrace
  ras: Trace more information in aer_event

 drivers/acpi/apei/ghes.c      |  16 ++-
 drivers/cxl/core/pci.c        |  15 ++-
 drivers/pci/pci.h             |  12 ++-
 drivers/pci/pcie/aer.c        | 188 +++++++++++++++++++++++++++-------
 include/linux/aer.h           |   6 +-
 include/linux/pci.h           |  27 +++++
 include/ras/ras_event.h       |  48 ++++++++-
 include/uapi/linux/pci_regs.h |   1 +
 8 files changed, 266 insertions(+), 47 deletions(-)

base-commit: 610a9b8f49fbcf1100716370d3b5f6f884a2835a

Comments

Wang, Qingshun Jan. 12, 2024, 3:23 a.m. UTC | #1
Hi,
Please ignore the Reivewed-by tags in this patch series, they were added
by mistake.

Best regards,
Wang, Qingshun
Bjorn Helgaas Jan. 12, 2024, 4:41 p.m. UTC | #2
On Thu, Jan 11, 2024 at 03:32:15PM +0800, Wang, Qingshun wrote:
> According to PCIe specification 4.0 sections 6.2.3.2.4 and 6.2.4.3,
> certain uncorrectable errors will signal ERR_COR instead of
> ERR_NONFATAL, logged as Advisory Non-Fatal Error, and set bits in
> both Correctable Error Status register and Uncorrectable Error Status
> register. Currently, when handling AER event the kernel will only look
> at CE status or UE status, but never both. In the Advisory
> Non-Fatal Error case, bits set in UE status register will not be
> reported and cleared until the next Fatal/Non-Fatal error arrives.
> 
> For instance, before this patch series, once kernel receives an ANFE
> with Poisoned TLP in OS native AER mode, only status of CE will be
> reported and cleared:
> 
> [  148.459816] pcieport 0000:b7:02.0: AER: Corrected error received: 0000:b7:02.0
> [  148.459858] pcieport 0000:b7:02.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, (Receiver ID)
> [  148.459863] pcieport 0000:b7:02.0:   device [8086:0db0] error status/mask=00002000/00000000
> [  148.459868] pcieport 0000:b7:02.0:    [13] NonFatalErr           
> 
> If the kernel receives a Malformed TLP after that, two UE will be
> reported, which is unexpected. Malformed TLP Header was lost since
> the previous ANF gated the TLP header logs:
> 
> [  170.540192] pcieport 0000:b7:02.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
> [  170.552420] pcieport 0000:b7:02.0:   device [8086:0db0] error status/mask=00041000/00180020
> [  170.561904] pcieport 0000:b7:02.0:    [12] TLP                    (First)
> [  170.569656] pcieport 0000:b7:02.0:    [18] MalfTLP       
> 
> To handle this case properly, this patch set adds additional fields
> in aer_err_info to track both CE and UE status/mask and UE severity.
> This information will later be used to determine the register bits
> that need to be cleared. Additionally, adds more data to aer_event
> tracepoint, which would help to better understand ANFE and other errors
> for external observation.
> 
> In the previous scenario, after this patch series, both CE status and
> related UE status will be reported and cleared after ANFE:
> 
> [  148.459816] pcieport 0000:b7:02.0: AER: Corrected error received: 0000:b7:02.0
> [  148.459858] pcieport 0000:b7:02.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, (Receiver ID)
> [  148.459863] pcieport 0000:b7:02.0:   device [8086:0db0] error status/mask=00002000/00000000
> [  148.459868] pcieport 0000:b7:02.0:    [13] NonFatalErr           
> [  148.459868] pcieport 0000:b7:02.0:   Uncorrectable errors that may cause Advisory Non-Fatal:
> [  148.459868] pcieport 0000:b7:02.0:    [18] TLP

Thanks for the overview here.  It would be good to put some of these
details in the commit logs of the patches that implement this, because
this cover letter is not preserved when the series is merged.

If/when you do, remove the timestamps because they're not relevant and
are merely distracting.  Indent quoted material a couple spaces.

Update the citations to a current spec revision (PCIe r6.0, or maybe
PCIe r6.1).  The section numbers are probably the same, but there's no
point in citing a revision that's 6.5 years old when newer ones are
available.

Bjorn
Wang, Qingshun Jan. 16, 2024, 8:32 a.m. UTC | #3
On Fri, Jan 12, 2024 at 10:41:07AM -0600, Bjorn Helgaas wrote:
> On Thu, Jan 11, 2024 at 03:32:15PM +0800, Wang, Qingshun wrote:
> > According to PCIe specification 4.0 sections 6.2.3.2.4 and 6.2.4.3,
> > certain uncorrectable errors will signal ERR_COR instead of
> > ERR_NONFATAL, logged as Advisory Non-Fatal Error, and set bits in
> > both Correctable Error Status register and Uncorrectable Error Status
> > register. Currently, when handling AER event the kernel will only look
> > at CE status or UE status, but never both. In the Advisory
> > Non-Fatal Error case, bits set in UE status register will not be
> > reported and cleared until the next Fatal/Non-Fatal error arrives.
> > 
> > For instance, before this patch series, once kernel receives an ANFE
> > with Poisoned TLP in OS native AER mode, only status of CE will be
> > reported and cleared:
> > 
> > [  148.459816] pcieport 0000:b7:02.0: AER: Corrected error received: 0000:b7:02.0
> > [  148.459858] pcieport 0000:b7:02.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, (Receiver ID)
> > [  148.459863] pcieport 0000:b7:02.0:   device [8086:0db0] error status/mask=00002000/00000000
> > [  148.459868] pcieport 0000:b7:02.0:    [13] NonFatalErr           
> > 
> > If the kernel receives a Malformed TLP after that, two UE will be
> > reported, which is unexpected. Malformed TLP Header was lost since
> > the previous ANF gated the TLP header logs:
> > 
> > [  170.540192] pcieport 0000:b7:02.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
> > [  170.552420] pcieport 0000:b7:02.0:   device [8086:0db0] error status/mask=00041000/00180020
> > [  170.561904] pcieport 0000:b7:02.0:    [12] TLP                    (First)
> > [  170.569656] pcieport 0000:b7:02.0:    [18] MalfTLP       
> > 
> > To handle this case properly, this patch set adds additional fields
> > in aer_err_info to track both CE and UE status/mask and UE severity.
> > This information will later be used to determine the register bits
> > that need to be cleared. Additionally, adds more data to aer_event
> > tracepoint, which would help to better understand ANFE and other errors
> > for external observation.
> > 
> > In the previous scenario, after this patch series, both CE status and
> > related UE status will be reported and cleared after ANFE:
> > 
> > [  148.459816] pcieport 0000:b7:02.0: AER: Corrected error received: 0000:b7:02.0
> > [  148.459858] pcieport 0000:b7:02.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, (Receiver ID)
> > [  148.459863] pcieport 0000:b7:02.0:   device [8086:0db0] error status/mask=00002000/00000000
> > [  148.459868] pcieport 0000:b7:02.0:    [13] NonFatalErr           
> > [  148.459868] pcieport 0000:b7:02.0:   Uncorrectable errors that may cause Advisory Non-Fatal:
> > [  148.459868] pcieport 0000:b7:02.0:    [18] TLP
> 
> Thanks for the overview here.  It would be good to put some of these
> details in the commit logs of the patches that implement this, because
> this cover letter is not preserved when the series is merged.
Thanks for your advice, will put some of these details in commit logs, 
mainly in PATCH 2. 
> 
> If/when you do, remove the timestamps because they're not relevant and
> are merely distracting.  Indent quoted material a couple spaces.
Agreed. Thanks.
> 
> Update the citations to a current spec revision (PCIe r6.0, or maybe
> PCIe r6.1).  The section numbers are probably the same, but there's no
> point in citing a revision that's 6.5 years old when newer ones are
> available.
Makes sense, thanks!
> 
> Bjorn