diff mbox series

EDAC instances probing

Message ID 20201211181915.GD25974@zn.tnic (mailing list archive)
State New, archived
Headers show
Series EDAC instances probing | expand

Commit Message

Borislav Petkov Dec. 11, 2020, 6:19 p.m. UTC
Hi guys,

so we converted a couple of EDAC drivers to per-CPU-family autoprobing
instead of the PCI device IDs one which needed constant adding of new
device IDs.

However easy the new probing is, it spams dmesg on each CPU as it tries
loading on each CPU, when there's no ECC DIMMs or ECC is disabled.
Here's the output from a 128 CPU box:

$ grep EDAC dmesg.log | sed 's/\[.*\] //' | sort | uniq -c
    128 EDAC amd64: F17h detected (node 0).
    128 EDAC amd64: Node 0: DRAM ECC disabled.
      1 EDAC MC: Ver: 3.0.0

that's 2 lines per CPU.

Btw, people have complained about the spamming.

So I tried something clumsy, see below, which fixes this into what it
should say:

$ dmesg | grep EDAC
[    2.693470] EDAC MC: Ver: 3.0.0
[    8.284461] EDAC amd64: F17h detected (node 0).
[    8.287953] EDAC amd64: Node 0: DRAM ECC disabled.
[    8.381430] EDAC amd64: F17h detected (node 1).
[    8.384684] EDAC amd64: Node 1: DRAM ECC disabled.
[    8.461902] EDAC amd64: F17h detected (node 2).
[    8.461993] EDAC amd64: Node 2: DRAM ECC disabled.
[    8.536907] EDAC amd64: F17h detected (node 3).
[    8.538923] EDAC amd64: Node 3: DRAM ECC disabled.
[    8.643213] EDAC amd64: F17h detected (node 4).
[    8.645474] EDAC amd64: Node 4: DRAM ECC disabled.
[    8.713411] EDAC amd64: F17h detected (node 5).
[    8.714818] EDAC amd64: Node 5: DRAM ECC disabled.
[    8.807825] EDAC amd64: F17h detected (node 6).
[    8.809882] EDAC amd64: Node 6: DRAM ECC disabled.
[    8.908043] EDAC amd64: F17h detected (node 7).
[    8.910883] EDAC amd64: Node 7: DRAM ECC disabled.

Once per driver instance, however each driver accounts an instance -
logical node, physical node, whatever.

So it looks like this, do you guys think this is too ugly to live?

---

Comments

Yazen Ghannam Dec. 11, 2020, 8:35 p.m. UTC | #1
On Fri, Dec 11, 2020 at 07:19:15PM +0100, Borislav Petkov wrote:
> Hi guys,
> 
> so we converted a couple of EDAC drivers to per-CPU-family autoprobing
> instead of the PCI device IDs one which needed constant adding of new
> device IDs.
> 
> However easy the new probing is, it spams dmesg on each CPU as it tries
> loading on each CPU, when there's no ECC DIMMs or ECC is disabled.
> Here's the output from a 128 CPU box:
> 
> $ grep EDAC dmesg.log | sed 's/\[.*\] //' | sort | uniq -c
>     128 EDAC amd64: F17h detected (node 0).
>     128 EDAC amd64: Node 0: DRAM ECC disabled.
>       1 EDAC MC: Ver: 3.0.0
> 
> that's 2 lines per CPU.
> 
> Btw, people have complained about the spamming.
> 
> So I tried something clumsy, see below, which fixes this into what it
> should say:
> 
> $ dmesg | grep EDAC
> [    2.693470] EDAC MC: Ver: 3.0.0
> [    8.284461] EDAC amd64: F17h detected (node 0).
> [    8.287953] EDAC amd64: Node 0: DRAM ECC disabled.
> [    8.381430] EDAC amd64: F17h detected (node 1).
> [    8.384684] EDAC amd64: Node 1: DRAM ECC disabled.
> [    8.461902] EDAC amd64: F17h detected (node 2).
> [    8.461993] EDAC amd64: Node 2: DRAM ECC disabled.
> [    8.536907] EDAC amd64: F17h detected (node 3).
> [    8.538923] EDAC amd64: Node 3: DRAM ECC disabled.
> [    8.643213] EDAC amd64: F17h detected (node 4).
> [    8.645474] EDAC amd64: Node 4: DRAM ECC disabled.
> [    8.713411] EDAC amd64: F17h detected (node 5).
> [    8.714818] EDAC amd64: Node 5: DRAM ECC disabled.
> [    8.807825] EDAC amd64: F17h detected (node 6).
> [    8.809882] EDAC amd64: Node 6: DRAM ECC disabled.
> [    8.908043] EDAC amd64: F17h detected (node 7).
> [    8.910883] EDAC amd64: Node 7: DRAM ECC disabled.
> 
> Once per driver instance, however each driver accounts an instance -
> logical node, physical node, whatever.
> 
> So it looks like this, do you guys think this is too ugly to live?
>

I think it's okay. But it could even be a single boolean rather than a
bitmap. At least for amd64_edac_mod, the driver will probe all the
Northbridge/Data Fabric instances even if some fail. I don't know if the
same applies to other EDAC modules. Does this issue affect other
modules?

Also, would it make sense to go back to PCI device probing? We've still
needed to add PCI IDs for almost every model group. Probing by PCI
device should help us avoid this issue and also prevent some messages
where PCI IDs aren't found for supported families. For example, we had
that problem when Family 17h Models 30h-3Fh came out. The module would
load because it recognized Family 17h, but it would fail because the new
PCI IDs for Models 30h-3Fh were not recognized.

Thanks,
Yazen
Borislav Petkov Dec. 11, 2020, 8:58 p.m. UTC | #2
On Fri, Dec 11, 2020 at 02:35:20PM -0600, Yazen Ghannam wrote:
> I think it's okay. But it could even be a single boolean rather than a
> bitmap. At least for amd64_edac_mod, the driver will probe all the
> Northbridge/Data Fabric instances even if some fail.

Ah, I see what you mean. amd64_edac does the all-or-nothing thing where
when it fails probing an instance, it unwinds. Yeah, I guess we can make
this a bool to say "we probed already".

> I don't know if the same applies to other EDAC modules.

I'll have to have a look.

> Does this issue affect other modules?

Those:

$ git grep x86_match_cpu drivers/edac/
drivers/edac/amd64_edac.c:3662: if (!x86_match_cpu(amd64_cpuids))
drivers/edac/i10nm_base.c:281:  id = x86_match_cpu(i10nm_cpuids);
drivers/edac/pnd2_edac.c:1557:  id = x86_match_cpu(pnd2_cpuids);
drivers/edac/sb_edac.c:3513:    id = x86_match_cpu(sbridge_cpuids);
drivers/edac/skx_base.c:659:    id = x86_match_cpu(skx_cpuids);

> Also, would it make sense to go back to PCI device probing? We've still
> needed to add PCI IDs for almost every model group. Probing by PCI
> device should help us avoid this issue and also prevent some messages
> where PCI IDs aren't found for supported families. For example, we had
> that problem when Family 17h Models 30h-3Fh came out. The module would
> load because it recognized Family 17h, but it would fail because the new
> PCI IDs for Models 30h-3Fh were not recognized.

Just like in the other mail I just sent you.

So that's the thing - we converted to x86_match_cpu() because adding
all those PCI IDs was tedious and using f/m/s would need a lot less
enablement so that the drivers work out of the box most of the time.

It's a tradeoff.
Borislav Petkov Jan. 13, 2021, 8:33 p.m. UTC | #3
I'm soo stupid. Usually I do the simplest solution but this time it has
escaped me.

But I caught it now:

---
diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index 9868f95a5622..105d00b27be2 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -3528,8 +3528,7 @@ static bool ecc_enabled(struct amd64_pvt *pvt)
 				     MSR_IA32_MCG_CTL, nid);
 	}
 
-	amd64_info("Node %d: DRAM ECC %s.\n",
-		   nid, (ecc_en ? "enabled" : "disabled"));
+	edac_dbg(3, "Node %d: DRAM ECC %s.\n", nid, (ecc_en ? "enabled" : "disabled"));
 
 	if (!ecc_en || !nb_mce_en)
 		return false;
@@ -3689,11 +3688,6 @@ static struct amd64_family_type *per_family_init(struct amd64_pvt *pvt)
 		return NULL;
 	}
 
-	amd64_info("%s %sdetected (node %d).\n", fam_type->ctl_name,
-		     (pvt->fam == 0xf ?
-				(pvt->ext_model >= K8_REV_F  ? "revF or later "
-							     : "revE or earlier ")
-				 : ""), pvt->mc_node_id);
 	return fam_type;
 }
 
@@ -3867,6 +3861,12 @@ static int probe_one_instance(unsigned int nid)
 
 	dump_misc_regs(pvt);
 
+	amd64_info("%s %sdetected (node %d).\n", fam_type->ctl_name,
+		     (pvt->fam == 0xf ?
+				(pvt->ext_model >= K8_REV_F  ? "revF or later "
+							     : "revE or earlier ")
+				 : ""), pvt->mc_node_id);
+
 	return ret;
 
 err_enable:
Yazen Ghannam Jan. 23, 2021, 4:45 a.m. UTC | #4
On Wed, Jan 13, 2021 at 09:33:16PM +0100, Borislav Petkov wrote:
> I'm soo stupid. Usually I do the simplest solution but this time it has
> escaped me.
> 
> But I caught it now:
> 
> ---
> diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
> index 9868f95a5622..105d00b27be2 100644
> --- a/drivers/edac/amd64_edac.c
> +++ b/drivers/edac/amd64_edac.c
> @@ -3528,8 +3528,7 @@ static bool ecc_enabled(struct amd64_pvt *pvt)
>  				     MSR_IA32_MCG_CTL, nid);
>  	}
>  
> -	amd64_info("Node %d: DRAM ECC %s.\n",
> -		   nid, (ecc_en ? "enabled" : "disabled"));
> +	edac_dbg(3, "Node %d: DRAM ECC %s.\n", nid, (ecc_en ? "enabled" : "disabled"));
>  
>  	if (!ecc_en || !nb_mce_en)
>  		return false;
> @@ -3689,11 +3688,6 @@ static struct amd64_family_type *per_family_init(struct amd64_pvt *pvt)
>  		return NULL;
>  	}
>  
> -	amd64_info("%s %sdetected (node %d).\n", fam_type->ctl_name,
> -		     (pvt->fam == 0xf ?
> -				(pvt->ext_model >= K8_REV_F  ? "revF or later "
> -							     : "revE or earlier ")
> -				 : ""), pvt->mc_node_id);
>  	return fam_type;
>  }
>  
> @@ -3867,6 +3861,12 @@ static int probe_one_instance(unsigned int nid)
>  
>  	dump_misc_regs(pvt);
>  
> +	amd64_info("%s %sdetected (node %d).\n", fam_type->ctl_name,
> +		     (pvt->fam == 0xf ?
> +				(pvt->ext_model >= K8_REV_F  ? "revF or later "
> +							     : "revE or earlier ")
> +				 : ""), pvt->mc_node_id);
> +
>  	return ret;
>  
>  err_enable:
> 
> --

I like it. I tested it out on one of my server systems, and it seems to
work well.

Tested-by: Yazen Ghannam <yazen.ghannam@amd.com>

Thanks,
Yazen
diff mbox series

Patch

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index f7087ddddb90..de37d0d9a27b 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -3581,6 +3581,7 @@  static int probe_one_instance(unsigned int nid)
 
 	dump_misc_regs(pvt);
 
+	set_bit(nid, edac_get_probed_instances());
 	return ret;
 
 err_enable:
@@ -3591,6 +3592,7 @@  static int probe_one_instance(unsigned int nid)
 	kfree(s);
 	ecc_stngs[nid] = NULL;
 
+	set_bit(nid, edac_get_probed_instances());
 err_out:
 	return ret;
 }
@@ -3674,6 +3676,10 @@  static int __init amd64_edac_init(void)
 		goto err_free;
 
 	for (i = 0; i < amd_nb_num(); i++) {
+
+		if (test_bit(i, edac_get_probed_instances()))
+			continue;
+
 		err = probe_one_instance(i);
 		if (err) {
 			/* unwind properly */
diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index f6d462d0be2d..f97186237ccc 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -53,6 +53,15 @@  static LIST_HEAD(mc_devices);
  */
 static const char *edac_mc_owner;
 
+/* bitmap of already probed driver instances, 64 should be big enough. :-P */
+static DECLARE_BITMAP(probed_instances, 64);
+
+unsigned long *edac_get_probed_instances(void)
+{
+	return probed_instances;
+}
+EXPORT_SYMBOL_GPL(edac_get_probed_instances);
+
 static struct mem_ctl_info *error_desc_to_mci(struct edac_raw_error_desc *e)
 {
 	return container_of(e, struct mem_ctl_info, error_desc);
diff --git a/drivers/edac/edac_mc.h b/drivers/edac/edac_mc.h
index 881b00eadf7a..7c0d4ac7c35a 100644
--- a/drivers/edac/edac_mc.h
+++ b/drivers/edac/edac_mc.h
@@ -255,4 +255,6 @@  void edac_mc_handle_error(const enum hw_event_mc_err_type type,
  */
 extern char *edac_op_state_to_string(int op_state);
 
+unsigned long *edac_get_probed_instances(void);
+
 #endif				/* _EDAC_MC_H_ */