Message ID | 20201211181915.GD25974@zn.tnic (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | EDAC instances probing | expand |
On Fri, Dec 11, 2020 at 07:19:15PM +0100, Borislav Petkov wrote: > Hi guys, > > so we converted a couple of EDAC drivers to per-CPU-family autoprobing > instead of the PCI device IDs one which needed constant adding of new > device IDs. > > However easy the new probing is, it spams dmesg on each CPU as it tries > loading on each CPU, when there's no ECC DIMMs or ECC is disabled. > Here's the output from a 128 CPU box: > > $ grep EDAC dmesg.log | sed 's/\[.*\] //' | sort | uniq -c > 128 EDAC amd64: F17h detected (node 0). > 128 EDAC amd64: Node 0: DRAM ECC disabled. > 1 EDAC MC: Ver: 3.0.0 > > that's 2 lines per CPU. > > Btw, people have complained about the spamming. > > So I tried something clumsy, see below, which fixes this into what it > should say: > > $ dmesg | grep EDAC > [ 2.693470] EDAC MC: Ver: 3.0.0 > [ 8.284461] EDAC amd64: F17h detected (node 0). > [ 8.287953] EDAC amd64: Node 0: DRAM ECC disabled. > [ 8.381430] EDAC amd64: F17h detected (node 1). > [ 8.384684] EDAC amd64: Node 1: DRAM ECC disabled. > [ 8.461902] EDAC amd64: F17h detected (node 2). > [ 8.461993] EDAC amd64: Node 2: DRAM ECC disabled. > [ 8.536907] EDAC amd64: F17h detected (node 3). > [ 8.538923] EDAC amd64: Node 3: DRAM ECC disabled. > [ 8.643213] EDAC amd64: F17h detected (node 4). > [ 8.645474] EDAC amd64: Node 4: DRAM ECC disabled. > [ 8.713411] EDAC amd64: F17h detected (node 5). > [ 8.714818] EDAC amd64: Node 5: DRAM ECC disabled. > [ 8.807825] EDAC amd64: F17h detected (node 6). > [ 8.809882] EDAC amd64: Node 6: DRAM ECC disabled. > [ 8.908043] EDAC amd64: F17h detected (node 7). > [ 8.910883] EDAC amd64: Node 7: DRAM ECC disabled. > > Once per driver instance, however each driver accounts an instance - > logical node, physical node, whatever. > > So it looks like this, do you guys think this is too ugly to live? > I think it's okay. But it could even be a single boolean rather than a bitmap. At least for amd64_edac_mod, the driver will probe all the Northbridge/Data Fabric instances even if some fail. I don't know if the same applies to other EDAC modules. Does this issue affect other modules? Also, would it make sense to go back to PCI device probing? We've still needed to add PCI IDs for almost every model group. Probing by PCI device should help us avoid this issue and also prevent some messages where PCI IDs aren't found for supported families. For example, we had that problem when Family 17h Models 30h-3Fh came out. The module would load because it recognized Family 17h, but it would fail because the new PCI IDs for Models 30h-3Fh were not recognized. Thanks, Yazen
On Fri, Dec 11, 2020 at 02:35:20PM -0600, Yazen Ghannam wrote: > I think it's okay. But it could even be a single boolean rather than a > bitmap. At least for amd64_edac_mod, the driver will probe all the > Northbridge/Data Fabric instances even if some fail. Ah, I see what you mean. amd64_edac does the all-or-nothing thing where when it fails probing an instance, it unwinds. Yeah, I guess we can make this a bool to say "we probed already". > I don't know if the same applies to other EDAC modules. I'll have to have a look. > Does this issue affect other modules? Those: $ git grep x86_match_cpu drivers/edac/ drivers/edac/amd64_edac.c:3662: if (!x86_match_cpu(amd64_cpuids)) drivers/edac/i10nm_base.c:281: id = x86_match_cpu(i10nm_cpuids); drivers/edac/pnd2_edac.c:1557: id = x86_match_cpu(pnd2_cpuids); drivers/edac/sb_edac.c:3513: id = x86_match_cpu(sbridge_cpuids); drivers/edac/skx_base.c:659: id = x86_match_cpu(skx_cpuids); > Also, would it make sense to go back to PCI device probing? We've still > needed to add PCI IDs for almost every model group. Probing by PCI > device should help us avoid this issue and also prevent some messages > where PCI IDs aren't found for supported families. For example, we had > that problem when Family 17h Models 30h-3Fh came out. The module would > load because it recognized Family 17h, but it would fail because the new > PCI IDs for Models 30h-3Fh were not recognized. Just like in the other mail I just sent you. So that's the thing - we converted to x86_match_cpu() because adding all those PCI IDs was tedious and using f/m/s would need a lot less enablement so that the drivers work out of the box most of the time. It's a tradeoff.
I'm soo stupid. Usually I do the simplest solution but this time it has escaped me. But I caught it now: --- diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c index 9868f95a5622..105d00b27be2 100644 --- a/drivers/edac/amd64_edac.c +++ b/drivers/edac/amd64_edac.c @@ -3528,8 +3528,7 @@ static bool ecc_enabled(struct amd64_pvt *pvt) MSR_IA32_MCG_CTL, nid); } - amd64_info("Node %d: DRAM ECC %s.\n", - nid, (ecc_en ? "enabled" : "disabled")); + edac_dbg(3, "Node %d: DRAM ECC %s.\n", nid, (ecc_en ? "enabled" : "disabled")); if (!ecc_en || !nb_mce_en) return false; @@ -3689,11 +3688,6 @@ static struct amd64_family_type *per_family_init(struct amd64_pvt *pvt) return NULL; } - amd64_info("%s %sdetected (node %d).\n", fam_type->ctl_name, - (pvt->fam == 0xf ? - (pvt->ext_model >= K8_REV_F ? "revF or later " - : "revE or earlier ") - : ""), pvt->mc_node_id); return fam_type; } @@ -3867,6 +3861,12 @@ static int probe_one_instance(unsigned int nid) dump_misc_regs(pvt); + amd64_info("%s %sdetected (node %d).\n", fam_type->ctl_name, + (pvt->fam == 0xf ? + (pvt->ext_model >= K8_REV_F ? "revF or later " + : "revE or earlier ") + : ""), pvt->mc_node_id); + return ret; err_enable:
On Wed, Jan 13, 2021 at 09:33:16PM +0100, Borislav Petkov wrote: > I'm soo stupid. Usually I do the simplest solution but this time it has > escaped me. > > But I caught it now: > > --- > diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c > index 9868f95a5622..105d00b27be2 100644 > --- a/drivers/edac/amd64_edac.c > +++ b/drivers/edac/amd64_edac.c > @@ -3528,8 +3528,7 @@ static bool ecc_enabled(struct amd64_pvt *pvt) > MSR_IA32_MCG_CTL, nid); > } > > - amd64_info("Node %d: DRAM ECC %s.\n", > - nid, (ecc_en ? "enabled" : "disabled")); > + edac_dbg(3, "Node %d: DRAM ECC %s.\n", nid, (ecc_en ? "enabled" : "disabled")); > > if (!ecc_en || !nb_mce_en) > return false; > @@ -3689,11 +3688,6 @@ static struct amd64_family_type *per_family_init(struct amd64_pvt *pvt) > return NULL; > } > > - amd64_info("%s %sdetected (node %d).\n", fam_type->ctl_name, > - (pvt->fam == 0xf ? > - (pvt->ext_model >= K8_REV_F ? "revF or later " > - : "revE or earlier ") > - : ""), pvt->mc_node_id); > return fam_type; > } > > @@ -3867,6 +3861,12 @@ static int probe_one_instance(unsigned int nid) > > dump_misc_regs(pvt); > > + amd64_info("%s %sdetected (node %d).\n", fam_type->ctl_name, > + (pvt->fam == 0xf ? > + (pvt->ext_model >= K8_REV_F ? "revF or later " > + : "revE or earlier ") > + : ""), pvt->mc_node_id); > + > return ret; > > err_enable: > > -- I like it. I tested it out on one of my server systems, and it seems to work well. Tested-by: Yazen Ghannam <yazen.ghannam@amd.com> Thanks, Yazen
diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c index f7087ddddb90..de37d0d9a27b 100644 --- a/drivers/edac/amd64_edac.c +++ b/drivers/edac/amd64_edac.c @@ -3581,6 +3581,7 @@ static int probe_one_instance(unsigned int nid) dump_misc_regs(pvt); + set_bit(nid, edac_get_probed_instances()); return ret; err_enable: @@ -3591,6 +3592,7 @@ static int probe_one_instance(unsigned int nid) kfree(s); ecc_stngs[nid] = NULL; + set_bit(nid, edac_get_probed_instances()); err_out: return ret; } @@ -3674,6 +3676,10 @@ static int __init amd64_edac_init(void) goto err_free; for (i = 0; i < amd_nb_num(); i++) { + + if (test_bit(i, edac_get_probed_instances())) + continue; + err = probe_one_instance(i); if (err) { /* unwind properly */ diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c index f6d462d0be2d..f97186237ccc 100644 --- a/drivers/edac/edac_mc.c +++ b/drivers/edac/edac_mc.c @@ -53,6 +53,15 @@ static LIST_HEAD(mc_devices); */ static const char *edac_mc_owner; +/* bitmap of already probed driver instances, 64 should be big enough. :-P */ +static DECLARE_BITMAP(probed_instances, 64); + +unsigned long *edac_get_probed_instances(void) +{ + return probed_instances; +} +EXPORT_SYMBOL_GPL(edac_get_probed_instances); + static struct mem_ctl_info *error_desc_to_mci(struct edac_raw_error_desc *e) { return container_of(e, struct mem_ctl_info, error_desc); diff --git a/drivers/edac/edac_mc.h b/drivers/edac/edac_mc.h index 881b00eadf7a..7c0d4ac7c35a 100644 --- a/drivers/edac/edac_mc.h +++ b/drivers/edac/edac_mc.h @@ -255,4 +255,6 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type, */ extern char *edac_op_state_to_string(int op_state); +unsigned long *edac_get_probed_instances(void); + #endif /* _EDAC_MC_H_ */