Message ID | 166974414546.1608150.4142682712102935008.stgit@djiang5-desk3.ch.intel.com |
---|---|
State | New, archived |
Headers | show |
Series | cxl/pci: Add fundamental error handling | expand |
On Tue, Nov 29, 2022 at 10:49:05AM -0700, Dave Jiang wrote: > Some new devices such as CXL devices may want to record additional error > information on a corrected error. Add a callback to allow the PCI device > driver to do additional logging such as providing additional stats for user > space RAS monitoring. > > For CXL device, this is actually a need due to CXL needing to write to the > device AER status register in order to clear the unmasked CEs. s/CE/correctable error/ since it's the first use and not common in PCI-land. "device AER status register" sounds like the PCIe AER Correctable Error Status Register (PCIe r6.0, sec 7.8.4.5), but I think you mean something else, maybe a CXL-specific register? The PCIe core needs to own the AER one (PCI_ERR_COR_STATUS) so it can coordinate ownership between firmware and Linux. > Cc: Bjorn Helgaas <bhelgaas@google.com> > Suggested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> > Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> > Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> > Signed-off-by: Dave Jiang <dave.jiang@intel.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> > --- > Documentation/PCI/pci-error-recovery.rst | 7 +++++++ > drivers/pci/pcie/aer.c | 8 +++++++- > include/linux/pci.h | 3 +++ > 3 files changed, 17 insertions(+), 1 deletion(-) > > diff --git a/Documentation/PCI/pci-error-recovery.rst b/Documentation/PCI/pci-error-recovery.rst > index 187f43a03200..690220255d5e 100644 > --- a/Documentation/PCI/pci-error-recovery.rst > +++ b/Documentation/PCI/pci-error-recovery.rst > @@ -83,6 +83,7 @@ This structure has the form:: > int (*mmio_enabled)(struct pci_dev *dev); > int (*slot_reset)(struct pci_dev *dev); > void (*resume)(struct pci_dev *dev); > + void (*cor_error_log)(struct pci_dev *dev); I think I would remove "log" from the name because it suggests this hook should *only* log, and you need to actually clear some status. Maybe "cor_error_detected()" to be analogous to error_detected()? > }; > > The possible channel states are:: > @@ -422,5 +423,11 @@ That is, the recovery API only requires that: > - drivers/net/cxgb3 > - drivers/net/s2io.c > > + The cor_error_log() callback is invoked in handle_error_source() when > + the error severity is "correctable". The callback is optional and allows > + additional logging to be done if desired. See example: > + > + - drivers/cxl/pci.c > + > The End > ------- > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > index e2d8a74f83c3..af1b5eecbb11 100644 > --- a/drivers/pci/pcie/aer.c > +++ b/drivers/pci/pcie/aer.c > @@ -961,8 +961,14 @@ static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info) > if (aer) > pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS, > info->status); > - if (pcie_aer_is_native(dev)) > + if (pcie_aer_is_native(dev)) { > + struct pci_driver *pdrv = dev->driver; > + > + if (pdrv && pdrv->err_handler && > + pdrv->err_handler->cor_error_log) > + pdrv->err_handler->cor_error_log(dev); > pcie_clear_device_status(dev); > + } > } else if (info->severity == AER_NONFATAL) > pcie_do_recovery(dev, pci_channel_io_normal, aer_root_reset); > else if (info->severity == AER_FATAL) > diff --git a/include/linux/pci.h b/include/linux/pci.h > index 575849a100a3..54939b3426a9 100644 > --- a/include/linux/pci.h > +++ b/include/linux/pci.h > @@ -844,6 +844,9 @@ struct pci_error_handlers { > > /* Device driver may resume normal operations */ > void (*resume)(struct pci_dev *dev); > + > + /* Allow device driver to record more details of a correctable error */ > + void (*cor_error_log)(struct pci_dev *dev); > }; > > > >
On 11/30/2022 12:45 PM, Bjorn Helgaas wrote: > On Tue, Nov 29, 2022 at 10:49:05AM -0700, Dave Jiang wrote: >> Some new devices such as CXL devices may want to record additional error >> information on a corrected error. Add a callback to allow the PCI device >> driver to do additional logging such as providing additional stats for user >> space RAS monitoring. >> >> For CXL device, this is actually a need due to CXL needing to write to the >> device AER status register in order to clear the unmasked CEs. > > s/CE/correctable error/ since it's the first use and not common in > PCI-land. > Ok > "device AER status register" sounds like the PCIe AER Correctable > Error Status Register (PCIe r6.0, sec 7.8.4.5), but I think you mean > something else, maybe a CXL-specific register? Yes. It's part of the CXL device RAS structure. I'll add more details. > > The PCIe core needs to own the AER one (PCI_ERR_COR_STATUS) so it can > coordinate ownership between firmware and Linux. > >> Cc: Bjorn Helgaas <bhelgaas@google.com> >> Suggested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> >> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> >> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> >> Signed-off-by: Dave Jiang <dave.jiang@intel.com> > > Acked-by: Bjorn Helgaas <bhelgaas@google.com> Thank you! > >> --- >> Documentation/PCI/pci-error-recovery.rst | 7 +++++++ >> drivers/pci/pcie/aer.c | 8 +++++++- >> include/linux/pci.h | 3 +++ >> 3 files changed, 17 insertions(+), 1 deletion(-) >> >> diff --git a/Documentation/PCI/pci-error-recovery.rst b/Documentation/PCI/pci-error-recovery.rst >> index 187f43a03200..690220255d5e 100644 >> --- a/Documentation/PCI/pci-error-recovery.rst >> +++ b/Documentation/PCI/pci-error-recovery.rst >> @@ -83,6 +83,7 @@ This structure has the form:: >> int (*mmio_enabled)(struct pci_dev *dev); >> int (*slot_reset)(struct pci_dev *dev); >> void (*resume)(struct pci_dev *dev); >> + void (*cor_error_log)(struct pci_dev *dev); > > I think I would remove "log" from the name because it suggests this > hook should *only* log, and you need to actually clear some status. > Maybe "cor_error_detected()" to be analogous to error_detected()? Ok I'll change. > >> }; >> >> The possible channel states are:: >> @@ -422,5 +423,11 @@ That is, the recovery API only requires that: >> - drivers/net/cxgb3 >> - drivers/net/s2io.c >> >> + The cor_error_log() callback is invoked in handle_error_source() when >> + the error severity is "correctable". The callback is optional and allows >> + additional logging to be done if desired. See example: >> + >> + - drivers/cxl/pci.c >> + >> The End >> ------- >> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c >> index e2d8a74f83c3..af1b5eecbb11 100644 >> --- a/drivers/pci/pcie/aer.c >> +++ b/drivers/pci/pcie/aer.c >> @@ -961,8 +961,14 @@ static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info) >> if (aer) >> pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS, >> info->status); >> - if (pcie_aer_is_native(dev)) >> + if (pcie_aer_is_native(dev)) { >> + struct pci_driver *pdrv = dev->driver; >> + >> + if (pdrv && pdrv->err_handler && >> + pdrv->err_handler->cor_error_log) >> + pdrv->err_handler->cor_error_log(dev); >> pcie_clear_device_status(dev); >> + } >> } else if (info->severity == AER_NONFATAL) >> pcie_do_recovery(dev, pci_channel_io_normal, aer_root_reset); >> else if (info->severity == AER_FATAL) >> diff --git a/include/linux/pci.h b/include/linux/pci.h >> index 575849a100a3..54939b3426a9 100644 >> --- a/include/linux/pci.h >> +++ b/include/linux/pci.h >> @@ -844,6 +844,9 @@ struct pci_error_handlers { >> >> /* Device driver may resume normal operations */ >> void (*resume)(struct pci_dev *dev); >> + >> + /* Allow device driver to record more details of a correctable error */ >> + void (*cor_error_log)(struct pci_dev *dev); >> }; >> >> >> >>
diff --git a/Documentation/PCI/pci-error-recovery.rst b/Documentation/PCI/pci-error-recovery.rst index 187f43a03200..690220255d5e 100644 --- a/Documentation/PCI/pci-error-recovery.rst +++ b/Documentation/PCI/pci-error-recovery.rst @@ -83,6 +83,7 @@ This structure has the form:: int (*mmio_enabled)(struct pci_dev *dev); int (*slot_reset)(struct pci_dev *dev); void (*resume)(struct pci_dev *dev); + void (*cor_error_log)(struct pci_dev *dev); }; The possible channel states are:: @@ -422,5 +423,11 @@ That is, the recovery API only requires that: - drivers/net/cxgb3 - drivers/net/s2io.c + The cor_error_log() callback is invoked in handle_error_source() when + the error severity is "correctable". The callback is optional and allows + additional logging to be done if desired. See example: + + - drivers/cxl/pci.c + The End ------- diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index e2d8a74f83c3..af1b5eecbb11 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -961,8 +961,14 @@ static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info) if (aer) pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS, info->status); - if (pcie_aer_is_native(dev)) + if (pcie_aer_is_native(dev)) { + struct pci_driver *pdrv = dev->driver; + + if (pdrv && pdrv->err_handler && + pdrv->err_handler->cor_error_log) + pdrv->err_handler->cor_error_log(dev); pcie_clear_device_status(dev); + } } else if (info->severity == AER_NONFATAL) pcie_do_recovery(dev, pci_channel_io_normal, aer_root_reset); else if (info->severity == AER_FATAL) diff --git a/include/linux/pci.h b/include/linux/pci.h index 575849a100a3..54939b3426a9 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -844,6 +844,9 @@ struct pci_error_handlers { /* Device driver may resume normal operations */ void (*resume)(struct pci_dev *dev); + + /* Allow device driver to record more details of a correctable error */ + void (*cor_error_log)(struct pci_dev *dev); };