diff mbox series

[1/1] EDAC/igen6: Fix the issue of no error events

Message ID 20230725080427.23883-1-qiuxu.zhuo@intel.com (mailing list archive)
State New, archived
Headers show
Series [1/1] EDAC/igen6: Fix the issue of no error events | expand

Commit Message

Qiuxu Zhuo July 25, 2023, 8:04 a.m. UTC
Current igen6_edac checks for pending errors before the registration
of the error handler. However, there is a possibility that the error
occurs during the registration process, leading to unhandled pending
errors and no future error events. This issue can be reproduced by
repeatedly injecting errors during the loading of the igen6_edac.

Fix this issue by moving the pending error handler after the registration
of the error handler, ensuring that no pending errors are left unhandled.

Fixes: 10590a9d4f23 ("EDAC/igen6: Add EDAC driver for Intel client SoCs using IBECC")
Reported-by: Ee Wey Lim <ee.wey.lim@intel.com>
Tested-by: Ee Wey Lim <ee.wey.lim@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
---
 drivers/edac/igen6_edac.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

Comments

Tony Luck July 25, 2023, 3:51 p.m. UTC | #1
> Fix this issue by moving the pending error handler after the registration
> of the error handler, ensuring that no pending errors are left unhandled.

Do you think drivers/edac/e7xxx_edac.c has the same issue?

491
492         /* clear any pending errors, or initial state bits */
493         e7xxx_get_error_info(mci, &discard);
494
495         /* Here we assume that we will never see multiple instances of this
496          * type of memory controller.  The ID is therefore hardcoded to 0.
497          */
498         if (edac_mc_add_mc(mci)) {
499                 edac_dbg(3, "failed edac_mc_add_mc()\n");
500                 goto fail1;
501         }
502
503         /* allocating generic PCI control info */
504         e7xxx_pci = edac_pci_create_generic_ctl(&pdev->dev, EDAC_MOD_STR);

Though it might be hard to find such an old system to test.

-Tony
Qiuxu Zhuo July 26, 2023, 2:31 a.m. UTC | #2
> From: Luck, Tony <tony.luck@intel.com>
> ...
> Subject: RE: [PATCH 1/1] EDAC/igen6: Fix the issue of no error events
> 
> > Fix this issue by moving the pending error handler after the
> > registration of the error handler, ensuring that no pending errors are left
> unhandled.
> 
> Do you think drivers/edac/e7xxx_edac.c has the same issue?

Hi Tony,

Based on the code [1], the e7xxx_edac works in polling mode and the pending 
errors can be handled within one period after the error handler registration.

So, I don't think the e7xxx_edac has the same issue.

[1] e7xxx_probe1()-> mci->edac_check = e7xxx_check;
                                  -> edac_mc_add_mc() -> if (mci->edac_check) { mci->op_state = OP_RUNNING_POLL; ... }

> 491
> 492         /* clear any pending errors, or initial state bits */
> 493         e7xxx_get_error_info(mci, &discard);

                   This function is also invoked periodically in polling mode:
                   mci->edac_check() -> e7xxx_check() -> e7xxx_get_error_info()

> 494
> 495         /* Here we assume that we will never see multiple instances of this
> 496          * type of memory controller.  The ID is therefore hardcoded to 0.
> 497          */
> 498         if (edac_mc_add_mc(mci)) {
> 499                 edac_dbg(3, "failed edac_mc_add_mc()\n");
> 500                 goto fail1;
> 501         }
> 502
> 503         /* allocating generic PCI control info */
> 504         e7xxx_pci = edac_pci_create_generic_ctl(&pdev->dev,
> EDAC_MOD_STR);
> 
> Though it might be hard to find such an old system to test.
> 
> -Tony
Tony Luck Aug. 2, 2023, 8:18 p.m. UTC | #3
> Current igen6_edac checks for pending errors before the registration
> of the error handler. However, there is a possibility that the error
> occurs during the registration process, leading to unhandled pending
> errors and no future error events. This issue can be reproduced by
> repeatedly injecting errors during the loading of the igen6_edac.
>
> Fix this issue by moving the pending error handler after the registration
> of the error handler, ensuring that no pending errors are left unhandled.
>
> Fixes: 10590a9d4f23 ("EDAC/igen6: Add EDAC driver for Intel client SoCs using IBECC")
> Reported-by: Ee Wey Lim <ee.wey.lim@intel.com>
> Tested-by: Ee Wey Lim <ee.wey.lim@intel.com>
> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>

Applied to RAS tree for next merge window.

Thanks

-Tony
diff mbox series

Patch

diff --git a/drivers/edac/igen6_edac.c b/drivers/edac/igen6_edac.c
index 544dd19072ea..1a18693294db 100644
--- a/drivers/edac/igen6_edac.c
+++ b/drivers/edac/igen6_edac.c
@@ -27,7 +27,7 @@ 
 #include "edac_mc.h"
 #include "edac_module.h"
 
-#define IGEN6_REVISION	"v2.5"
+#define IGEN6_REVISION	"v2.5.1"
 
 #define EDAC_MOD_STR	"igen6_edac"
 #define IGEN6_NMI_NAME	"igen6_ibecc"
@@ -1216,9 +1216,6 @@  static int igen6_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	INIT_WORK(&ecclog_work, ecclog_work_cb);
 	init_irq_work(&ecclog_irq_work, ecclog_irq_work_cb);
 
-	/* Check if any pending errors before registering the NMI handler */
-	ecclog_handler();
-
 	rc = register_err_handler();
 	if (rc)
 		goto fail3;
@@ -1230,6 +1227,9 @@  static int igen6_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 		goto fail4;
 	}
 
+	/* Check if any pending errors before/during the registration of the error handler */
+	ecclog_handler();
+
 	igen6_debug_setup();
 	return 0;
 fail4: