Message ID | 20241008221657.1130181-2-terry.bowman@amd.com |
---|---|
State | Superseded |
Headers | show |
Series | Enable CXL PCIe port protocol error handling and logging | expand |
Terry Bowman wrote: > CXL protocol errors are reported to the OS through PCIe correctable and > uncorrectable internal errors. However, since CXL PCIe port devices > are currently bound to the portdrv driver, there is no mechanism to > notify the CXL driver, which is necessary for proper logging and > handling. > > To address this, introduce CXL PCIe port error callbacks along with > register/unregister and accessor functions. The callbacks will be > invoked by the AER driver in the case protocol errors are reported by > a CXL port device. > > The AER driver callbacks will be used in future patches implementing > CXL PCIe port error handling. > > Signed-off-by: Terry Bowman <terry.bowman@amd.com> > --- > drivers/pci/pcie/aer.c | 22 ++++++++++++++++++++++ > include/linux/aer.h | 14 ++++++++++++++ > 2 files changed, 36 insertions(+) > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > index 13b8586924ea..a9792b9576b4 100644 > --- a/drivers/pci/pcie/aer.c > +++ b/drivers/pci/pcie/aer.c > @@ -50,6 +50,8 @@ struct aer_rpc { > DECLARE_KFIFO(aer_fifo, struct aer_err_source, AER_ERROR_SOURCES_MAX); > }; > > +static struct cxl_port_err_hndlrs cxl_port_hndlrs; I think this can afford to splurge on a few more letters and make this static struct cxl_port_error_handlers cxl_port_error_handlers; > + > /* AER stats for the device */ > struct aer_stats { > > @@ -1078,6 +1080,26 @@ static inline void cxl_rch_handle_error(struct pci_dev *dev, > struct aer_err_info *info) { } > #endif > > +void register_cxl_port_hndlrs(struct cxl_port_err_hndlrs *_cxl_port_hndlrs) > +{ > + cxl_port_hndlrs.error_detected = _cxl_port_hndlrs->error_detected; > + cxl_port_hndlrs.cor_error_detected = _cxl_port_hndlrs->cor_error_detected; > +} > +EXPORT_SYMBOL_NS_GPL(register_cxl_port_hndlrs, CXL); > + > +void unregister_cxl_port_hndlrs(void) > +{ > + cxl_port_hndlrs.error_detected = NULL; > + cxl_port_hndlrs.cor_error_detected = NULL; > +} > +EXPORT_SYMBOL_NS_GPL(unregister_cxl_port_hndlrs, CXL); > + > +struct cxl_port_err_hndlrs *find_cxl_port_hndlrs(void) > +{ > + return &cxl_port_hndlrs; > +} > +EXPORT_SYMBOL_NS_GPL(find_cxl_port_hndlrs, CXL); I guess I will need to go deeper into the code, but I would not have expected that new registration interfaces are needed. Each 'struct pci_driver' could optionally include CXL error handlers alongside their PCIe error handlers and when CXL AER errors are broadcast only the CXL handlers are invoked. I.e. the registration is something like: diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c index 6af5e0425872..42db26195bda 100644 --- a/drivers/pci/pcie/portdrv.c +++ b/drivers/pci/pcie/portdrv.c @@ -793,6 +793,7 @@ static struct pci_driver pcie_portdriver = { .shutdown = pcie_portdrv_shutdown, .err_handler = &pcie_portdrv_err_handler, + .cxl_err_handler = &cxl_portdrv_err_handler, .driver_managed_dma = true,
Hi Dan, On 10/21/24 20:53, Dan Williams wrote: > Terry Bowman wrote: >> CXL protocol errors are reported to the OS through PCIe correctable and >> uncorrectable internal errors. However, since CXL PCIe port devices >> are currently bound to the portdrv driver, there is no mechanism to >> notify the CXL driver, which is necessary for proper logging and >> handling. >> >> To address this, introduce CXL PCIe port error callbacks along with >> register/unregister and accessor functions. The callbacks will be >> invoked by the AER driver in the case protocol errors are reported by >> a CXL port device. >> >> The AER driver callbacks will be used in future patches implementing >> CXL PCIe port error handling. >> >> Signed-off-by: Terry Bowman <terry.bowman@amd.com> >> --- >> drivers/pci/pcie/aer.c | 22 ++++++++++++++++++++++ >> include/linux/aer.h | 14 ++++++++++++++ >> 2 files changed, 36 insertions(+) >> >> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c >> index 13b8586924ea..a9792b9576b4 100644 >> --- a/drivers/pci/pcie/aer.c >> +++ b/drivers/pci/pcie/aer.c >> @@ -50,6 +50,8 @@ struct aer_rpc { >> DECLARE_KFIFO(aer_fifo, struct aer_err_source, AER_ERROR_SOURCES_MAX); >> }; >> >> +static struct cxl_port_err_hndlrs cxl_port_hndlrs; > > I think this can afford to splurge on a few more letters and make this > > static struct cxl_port_error_handlers cxl_port_error_handlers; > > Ok. >> + >> /* AER stats for the device */ >> struct aer_stats { >> >> @@ -1078,6 +1080,26 @@ static inline void cxl_rch_handle_error(struct pci_dev *dev, >> struct aer_err_info *info) { } >> #endif >> >> +void register_cxl_port_hndlrs(struct cxl_port_err_hndlrs *_cxl_port_hndlrs) >> +{ >> + cxl_port_hndlrs.error_detected = _cxl_port_hndlrs->error_detected; >> + cxl_port_hndlrs.cor_error_detected = _cxl_port_hndlrs->cor_error_detected; >> +} >> +EXPORT_SYMBOL_NS_GPL(register_cxl_port_hndlrs, CXL); >> + >> +void unregister_cxl_port_hndlrs(void) >> +{ >> + cxl_port_hndlrs.error_detected = NULL; >> + cxl_port_hndlrs.cor_error_detected = NULL; >> +} >> +EXPORT_SYMBOL_NS_GPL(unregister_cxl_port_hndlrs, CXL); >> + >> +struct cxl_port_err_hndlrs *find_cxl_port_hndlrs(void) >> +{ >> + return &cxl_port_hndlrs; >> +} >> +EXPORT_SYMBOL_NS_GPL(find_cxl_port_hndlrs, CXL); > > I guess I will need to go deeper into the code, but I would not have > expected that new registration interfaces are needed. Each 'struct > pci_driver' could optionally include CXL error handlers alongside their > PCIe error handlers and when CXL AER errors are broadcast only the CXL > handlers are invoked. I.e. the registration is something like: > > diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c > index 6af5e0425872..42db26195bda 100644 > --- a/drivers/pci/pcie/portdrv.c > +++ b/drivers/pci/pcie/portdrv.c > @@ -793,6 +793,7 @@ static struct pci_driver pcie_portdriver = { > .shutdown = pcie_portdrv_shutdown, > > .err_handler = &pcie_portdrv_err_handler, > + .cxl_err_handler = &cxl_portdrv_err_handler, > > .driver_managed_dma = true, Ok. I'm thinking to add a definition for 'pci_dev::cxl_err_handler' of type 'struct pci_error_handler'. 'struct pci_error_handler' contains a slot reset(), resume(), and mmio_enabled() fn pointers that are used in PCIe recovery if available. The plan is for CXL devices to call panic for UCE fatal and non-fatal but it might be good to use the 'struct pci_error_handler' type in case there are needs for the other handlers in the future. It also makes the logic to access and use the error handlers common, requiring less code. Regards, Terry
Terry Bowman wrote: [..] > > diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c > > index 6af5e0425872..42db26195bda 100644 > > --- a/drivers/pci/pcie/portdrv.c > > +++ b/drivers/pci/pcie/portdrv.c > > @@ -793,6 +793,7 @@ static struct pci_driver pcie_portdriver = { > > .shutdown = pcie_portdrv_shutdown, > > > > .err_handler = &pcie_portdrv_err_handler, > > + .cxl_err_handler = &cxl_portdrv_err_handler, > > > > .driver_managed_dma = true, > > Ok. I'm thinking to add a definition for 'pci_dev::cxl_err_handler' of type > 'struct pci_error_handler'. > > 'struct pci_error_handler' contains a slot reset(), resume(), and mmio_enabled() fn > pointers that are used in PCIe recovery if available. The plan is for CXL devices to > call panic for UCE fatal and non-fatal but it might be good to use the > 'struct pci_error_handler' type in case there are needs for the other handlers in > the future. It also makes the logic to access and use the error handlers common, > requiring less code. Can you give an example where CXL can reuse 'struct pci_error_handlers` infrastructure? The PCI error handlers are built around the idea that operations can be paused and recovered, CXL operations assume near constant device participation in CPU cache and memory coherency protocol. About the only reuse I can think of is cases where a CXL error could be sent down the PCI error handler path, i.e. ones that would send a 'pci_channel_io_normal' notice to ->error_detected(). Otherwise, pci_channel_state_t and pci_ers_result_t seem to be a poor fit for CXL error handling.
Hi Dan, On 10/22/24 12:09, Dan Williams wrote: > Terry Bowman wrote: > [..] >>> diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c >>> index 6af5e0425872..42db26195bda 100644 >>> --- a/drivers/pci/pcie/portdrv.c >>> +++ b/drivers/pci/pcie/portdrv.c >>> @@ -793,6 +793,7 @@ static struct pci_driver pcie_portdriver = { >>> .shutdown = pcie_portdrv_shutdown, >>> >>> .err_handler = &pcie_portdrv_err_handler, >>> + .cxl_err_handler = &cxl_portdrv_err_handler, >>> >>> .driver_managed_dma = true, >> >> Ok. I'm thinking to add a definition for 'pci_dev::cxl_err_handler' of type >> 'struct pci_error_handler'. >> >> 'struct pci_error_handler' contains a slot reset(), resume(), and mmio_enabled() fn >> pointers that are used in PCIe recovery if available. The plan is for CXL devices to >> call panic for UCE fatal and non-fatal but it might be good to use the >> 'struct pci_error_handler' type in case there are needs for the other handlers in >> the future. It also makes the logic to access and use the error handlers common, >> requiring less code. > > Can you give an example where CXL can reuse 'struct pci_error_handlers` > infrastructure? The PCI error handlers are built around the idea that > operations can be paused and recovered, CXL operations assume near > constant device participation in CPU cache and memory coherency > protocol. > > About the only reuse I can think of is cases where a CXL error could be > sent down the PCI error handler path, i.e. ones that would send a > 'pci_channel_io_normal' notice to ->error_detected(). Otherwise, > pci_channel_state_t and pci_ers_result_t seem to be a poor fit for CXL > error handling. I was referring to reusing separate instance of 'struct pci_error_handlers' for CXL UCE-CE errors. One example where it can be reused in infrastructure is in err.c's report_error_detected(). If both PCIe and CXL errors use 'struct pci_error_handlers' then the updated report_error_detected() becomes a bit simpler with less helper function logic. But, it's not a reason by itself to choose to reuse 'struct pci_error_handlers' for CXL errors. Looking closer at aer,c shows there is no advantage in this file for using 'struct pci_error_handlers' for CXL errors. If I understand correctly you want a new type introduced, 'struct cxl_error_handlers'. And will contain 2 function pointers for CE and UCE handling. Regards, Terry
Terry Bowman wrote: [..] > I was referring to reusing separate instance of 'struct pci_error_handlers' for CXL > UCE-CE errors. > > One example where it can be reused in infrastructure is in err.c's > report_error_detected(). If both PCIe and CXL errors use 'struct pci_error_handlers' > then the updated report_error_detected() becomes a bit simpler with less helper > function logic. report_error_detected() is concerned with link and i/o state (pci_dev_is_disconnected() and pci_dev_set_io_state()). For device disconnects, CXL recovery potentially needs to span multiple devices. For i/o state, CXL.io could be fully operational while CXL.cache and CXL.mem are in fatal state. CXL considerations do not feel welcome in that function. Ideally a PCIe developer never needs to see or understand the CXL error model because it is off in its own path. In other words, if someone maintaining pcie_do_recovery=>report_error_detected() for the PCIe case needs to go find a CXL expert each time they want to touch that path, that feels like a regression in PCIe error handling maintainability. > But, it's not a reason by itself to choose to reuse 'struct > pci_error_handlers' for CXL errors. > > Looking closer at aer,c shows there is no advantage in this file for using 'struct > pci_error_handlers' for CXL errors. > > If I understand correctly you want a new type introduced, 'struct cxl_error_handlers'. Yes, mainly because the bus state and the result of the recovery tend to be a different operational model. If a CXL error fits the PCIe model then it can be sent via pcie_do_recovery(), but I expect that only applies to a handful of correctable errors like CRC_Threshold, Retry_Threshold, or Physical_Layer_Error. Almost everything else *seems* like it has a CXL specific response that would confuse pcie_do_recovery(). So, in general new operational models == new data structures and types. > And will contain 2 function pointers for CE and UCE handling. Unless and until we define a CXL Reset flow, then yes, I assume you mean ->error_detected() and ->cor_error_detected()? I do think there will be some limited fatal cases with CXL accelerators that could be recoverable if the accelerator knows that the memory error can be recovered by resetting the device without surprise data-loss. That work can wait until those failure cases become clearer. I imagine something like CXL error isolation for a host-bridge dedicated to a single accelerator might be recoverable, but anything with general-purpose memory is likely better off with a kernel-panic (see the CXL error isolation discussion: http://lore.kernel.org/e7d4a31a-bd5e-41d9-9b51-fbbd5e8fc9b2@amd.com).
Hi Dan, I added a question below. On 10/22/2024 6:43 PM, Dan Williams wrote: > Terry Bowman wrote: > [..] >> I was referring to reusing separate instance of 'struct pci_error_handlers' for CXL >> UCE-CE errors. >> >> One example where it can be reused in infrastructure is in err.c's >> report_error_detected(). If both PCIe and CXL errors use 'struct pci_error_handlers' >> then the updated report_error_detected() becomes a bit simpler with less helper >> function logic. > report_error_detected() is concerned with link and i/o state > (pci_dev_is_disconnected() and pci_dev_set_io_state()). For device > disconnects, CXL recovery potentially needs to span multiple devices. > For i/o state, CXL.io could be fully operational while CXL.cache and > CXL.mem are in fatal state. > > CXL considerations do not feel welcome in that function. > > Ideally a PCIe developer never needs to see or understand the CXL error > model because it is off in its own path. In other words, if someone > maintaining pcie_do_recovery=>report_error_detected() for the PCIe case > needs to go find a CXL expert each time they want to touch that path, > that feels like a regression in PCIe error handling maintainability. > >> But, it's not a reason by itself to choose to reuse 'struct >> pci_error_handlers' for CXL errors. >> >> Looking closer at aer,c shows there is no advantage in this file for using 'struct >> pci_error_handlers' for CXL errors. >> >> If I understand correctly you want a new type introduced, 'struct cxl_error_handlers'. > Yes, mainly because the bus state and the result of the recovery tend to > be a different operational model. If a CXL error fits the PCIe model > then it can be sent via pcie_do_recovery(), but I expect that only > applies to a handful of correctable errors like CRC_Threshold, > Retry_Threshold, or Physical_Layer_Error. Almost everything else *seems* > like it has a CXL specific response that would confuse > pcie_do_recovery(). > > So, in general new operational models == new data structures and types. Would you like to continue to use the pci_error_handlers for the CXL PCIe endpoint device driver? Or do we change the CXL PCIe endpoint driver to use the cxl_error_handlers ? Regards, Terry
Bowman, Terry wrote: [..] > > So, in general new operational models == new data structures and types. > > Would you like to continue to use the pci_error_handlers for the CXL PCIe > endpoint device driver? Or do we change the CXL PCIe endpoint driver to > use the cxl_error_handlers ? I would expect to extend with new 'cxl_error_handlers' support. 'Extend' not 'replace' because a CXL endpoint could in fact be operating in pure PCIe mode. Otherwise, it is currently ambiguous what an error handler invocation is signaling. In the new scheme fatal CXL errors are panics, while fatal PCIe errors remain device resets. So the question is how to distinguish those events without separate entry points. Either the error (bus) type needs to be added as a parameter to the callbacks or the error type is implied by the 'error handlers' type. I would rather not go disturb the PCIe error handler world by making them deal with an error type passed to existing callbacks, so a new invocation regime seems appropriate. I am open to other thoughts, but this seems the most maintainable way to handle these parallel universes that both happen to fork from an AER error source.
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index 13b8586924ea..a9792b9576b4 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -50,6 +50,8 @@ struct aer_rpc { DECLARE_KFIFO(aer_fifo, struct aer_err_source, AER_ERROR_SOURCES_MAX); }; +static struct cxl_port_err_hndlrs cxl_port_hndlrs; + /* AER stats for the device */ struct aer_stats { @@ -1078,6 +1080,26 @@ static inline void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info) { } #endif +void register_cxl_port_hndlrs(struct cxl_port_err_hndlrs *_cxl_port_hndlrs) +{ + cxl_port_hndlrs.error_detected = _cxl_port_hndlrs->error_detected; + cxl_port_hndlrs.cor_error_detected = _cxl_port_hndlrs->cor_error_detected; +} +EXPORT_SYMBOL_NS_GPL(register_cxl_port_hndlrs, CXL); + +void unregister_cxl_port_hndlrs(void) +{ + cxl_port_hndlrs.error_detected = NULL; + cxl_port_hndlrs.cor_error_detected = NULL; +} +EXPORT_SYMBOL_NS_GPL(unregister_cxl_port_hndlrs, CXL); + +struct cxl_port_err_hndlrs *find_cxl_port_hndlrs(void) +{ + return &cxl_port_hndlrs; +} +EXPORT_SYMBOL_NS_GPL(find_cxl_port_hndlrs, CXL); + /** * pci_aer_handle_error - handle logging error into an event log * @dev: pointer to pci_dev data structure of error source device diff --git a/include/linux/aer.h b/include/linux/aer.h index 4b97f38f3fcf..67fd04c5ae2b 100644 --- a/include/linux/aer.h +++ b/include/linux/aer.h @@ -10,6 +10,7 @@ #include <linux/errno.h> #include <linux/types.h> +#include <linux/pci.h> #define AER_NONFATAL 0 #define AER_FATAL 1 @@ -55,5 +56,18 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity, int cper_severity_to_aer(int cper_severity); void aer_recover_queue(int domain, unsigned int bus, unsigned int devfn, int severity, struct aer_capability_regs *aer_regs); + +struct cxl_port_err_hndlrs { + + /* CXL uncorrectable error detected on this device */ + pci_ers_result_t (*error_detected)(struct pci_dev *dev, + pci_channel_state_t error); + + /* CXL corrected error detected on this device */ + void (*cor_error_detected)(struct pci_dev *dev); +}; +void register_cxl_port_hndlrs(struct cxl_port_err_hndlrs *_cxl_port_hndlrs); +void unregister_cxl_port_hndlrs(void); +struct cxl_port_err_hndlrs *find_cxl_port_hndlrs(void); #endif //_AER_H_
CXL protocol errors are reported to the OS through PCIe correctable and uncorrectable internal errors. However, since CXL PCIe port devices are currently bound to the portdrv driver, there is no mechanism to notify the CXL driver, which is necessary for proper logging and handling. To address this, introduce CXL PCIe port error callbacks along with register/unregister and accessor functions. The callbacks will be invoked by the AER driver in the case protocol errors are reported by a CXL port device. The AER driver callbacks will be used in future patches implementing CXL PCIe port error handling. Signed-off-by: Terry Bowman <terry.bowman@amd.com> --- drivers/pci/pcie/aer.c | 22 ++++++++++++++++++++++ include/linux/aer.h | 14 ++++++++++++++ 2 files changed, 36 insertions(+)