diff mbox series

[1/3] x86/MCE/AMD: Provide an "Unknown" MCA bank type

Message ID 20211203020017.728440-2-yazen.ghannam@amd.com (mailing list archive)
State New, archived
Headers show
Series AMD SMCA Updates | expand

Commit Message

Yazen Ghannam Dec. 3, 2021, 2 a.m. UTC
The AMD MCA Thresholding sysfs interface populates directories for each
bank and thresholding block. The name used for each directory is looked
up in a table of known bank types. However, new bank types won't match
in this list and will return NULL for the name. This will cause the
machinecheck sysfs interface to fail to be populated.

Set new and unknown MCA bank types to the "unknown" type. Also,
ensure that the bank's thresholding block directories have unique names.
This will ensure that the machinecheck sysfs interface can be
initialized.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
 arch/x86/include/asm/mce.h    |  1 +
 arch/x86/kernel/cpu/mce/amd.c | 34 ++++++++++++++++++++++++++++------
 drivers/edac/mce_amd.c        |  3 +++
 3 files changed, 32 insertions(+), 6 deletions(-)

Comments

Borislav Petkov Dec. 3, 2021, 10:17 p.m. UTC | #1
On Fri, Dec 03, 2021 at 02:00:15AM +0000, Yazen Ghannam wrote:
> The AMD MCA Thresholding sysfs interface populates directories for each
> bank and thresholding block. The name used for each directory is looked
> up in a table of known bank types. However, new bank types won't match
> in this list and will return NULL for the name. This will cause the
> machinecheck sysfs interface to fail to be populated.
> 
> Set new and unknown MCA bank types to the "unknown" type. Also,
> ensure that the bank's thresholding block directories have unique names.
> This will ensure that the machinecheck sysfs interface can be
> initialized.

What is the advantage of having a sysfs directory structure headed with
an "unknown" entry vs not having that structure at all when the kernel
runs on a machine for which it has not been enabled yet?

IOW, if those new banks would need additional enablement, what's the
point of having "unknown" on older kernels which do not have any
functionality?

IOW, how does this:

/sys/devices/system/machinecheck/machinecheck0/unknown/unknown/
├── error_count
├── interrupt_enable
└── threshold_limit

help a user?

Btw, looking at the current layout:

...
├── insn_fetch
│   └── insn_fetch
│       ├── error_count
│       ├── interrupt_enable
│       └── threshold_limit
├── l2_cache
│   └── l2_cache
│       ├── error_count
│       ├── interrupt_enable
│       └── threshold_limit
...

we have those names repeated which looks wonky and useless too. I'd
expect them to be:

...
├── insn_fetch
│   ├── error_count
│   ├── interrupt_enable
│   └── threshold_limit
├── l2_cache
│   ├── error_count
│   ├── interrupt_enable
│   └── threshold_limit
...

Can we fix that too pls?

Thx.
Yazen Ghannam Dec. 7, 2021, 4:28 p.m. UTC | #2
On Fri, Dec 03, 2021 at 11:17:45PM +0100, Borislav Petkov wrote:
> On Fri, Dec 03, 2021 at 02:00:15AM +0000, Yazen Ghannam wrote:
> > The AMD MCA Thresholding sysfs interface populates directories for each
> > bank and thresholding block. The name used for each directory is looked
> > up in a table of known bank types. However, new bank types won't match
> > in this list and will return NULL for the name. This will cause the
> > machinecheck sysfs interface to fail to be populated.
> > 
> > Set new and unknown MCA bank types to the "unknown" type. Also,
> > ensure that the bank's thresholding block directories have unique names.
> > This will ensure that the machinecheck sysfs interface can be
> > initialized.
> 
> What is the advantage of having a sysfs directory structure headed with
> an "unknown" entry vs not having that structure at all when the kernel
> runs on a machine for which it has not been enabled yet?
> 
> IOW, if those new banks would need additional enablement, what's the
> point of having "unknown" on older kernels which do not have any
> functionality?
> 
> IOW, how does this:
> 
> /sys/devices/system/machinecheck/machinecheck0/unknown/unknown/
> ├── error_count
> ├── interrupt_enable
> └── threshold_limit
> 
> help a user?

Yeah, I see your point.

> 
> Btw, looking at the current layout:
> 
> ...
> ├── insn_fetch
> │   └── insn_fetch
> │       ├── error_count
> │       ├── interrupt_enable
> │       └── threshold_limit
> ├── l2_cache
> │   └── l2_cache
> │       ├── error_count
> │       ├── interrupt_enable
> │       └── threshold_limit
> ...
> 
> we have those names repeated which looks wonky and useless too. I'd
> expect them to be:
> 
> ...
> ├── insn_fetch
> │   ├── error_count
> │   ├── interrupt_enable
> │   └── threshold_limit
> ├── l2_cache
> │   ├── error_count
> │   ├── interrupt_enable
> │   └── threshold_limit
> ...
> 
> Can we fix that too pls?
> 

Sure thing. But I don't think removing the second directory will be okay. The
layout is "bank"/"block". If the "block" has special use like DRAM ECC, or L3
Cache on older systems, then it'll have a unique name. Otherwise, the block
will take the name of the bank.

I think the more robust solution is to drop the unique names and use generic
names like "bank"/"block". A new file called "type" can be introduced into the
directory structure, and this can return the name of the bank/block. New bank
types will return "<null>" for the "type", but the directory structure should
remain the same and functional.

I've seen this in other sysfs interfaces like cpuidle,
e.g. /sys/devices/system/cpu/cpu0/cpuidle/stateX

The "blockX/type" file is like the "stateX/desc" file. Or the "type" file can
be called "desc", since it's a description of what the bank or block
represent.

Here are a couple of examples:

/sys/devices/system/machinecheck/machinecheck0/
├── th_bank0
│   ├── type ("Instruction Fetch")
│   └── th_block0
│       ├── type ("All Errors")
│       ├── error_count
│       ├── interrupt_enable
│       └── threshold_limit
├── th_bank1
│   ├── type ("Northbridge")
│   ├── th_block0
│   │   ├── type ("DRAM Errors")
│   │   ├── error_count
│   │   ├── interrupt_enable
│   │   └── threshold_limit
│   └── th_block1
│       ├── type ("Link Errors")
│       ├── error_count
│       ├── interrupt_enable
│       └── threshold_limit
...

OR

/sys/devices/system/machinecheck/machinecheck0/thresholding
├── bank0
│   ├── desc ("Instruction Fetch")
│   └── block0
│       ├── desc ("All Errors")
│       ├── error_count
│       ├── interrupt_enable
│       └── threshold_limit
├── bank1
│   ├── desc ("Northbridge")
│   ├── block0
│   │   ├── desc ("DRAM Errors")
│   │   ├── error_count
│   │   ├── interrupt_enable
│   │   └── threshold_limit
│   └── block1
│       ├── desc ("Link Errors")
│       ├── error_count
│       ├── interrupt_enable
│       └── threshold_limit
...

I'm inclined to the second option, since it keeps all the thresholding
functionality under a single directory.

What do you think?

Thanks,
Yazen
Borislav Petkov Dec. 11, 2021, 3:39 p.m. UTC | #3
On Tue, Dec 07, 2021 at 04:28:42PM +0000, Yazen Ghannam wrote:
> Sure thing. But I don't think removing the second directory will be okay. The
> layout is "bank"/"block". If the "block" has special use like DRAM ECC, or L3
> Cache on older systems, then it'll have a unique name. Otherwise, the block
> will take the name of the bank.

Ah, there was something... and I found a good example on my zen1 box:

├── umc_0
│   ├── dram_ecc
│   │   ├── error_count
│   │   ├── interrupt_enable
│   │   └── threshold_limit
│   └── misc_umc
│       ├── error_count
│       ├── interrupt_enable
│       └── threshold_limit

but yeah, that still doesn't make it clear how the hierarchy is...

> /sys/devices/system/machinecheck/machinecheck0/thresholding
> ├── bank0
> │   ├── desc ("Instruction Fetch")
> │   └── block0
> │       ├── desc ("All Errors")
> │       ├── error_count
> │       ├── interrupt_enable
> │       └── threshold_limit
> ├── bank1
> │   ├── desc ("Northbridge")
> │   ├── block0
> │   │   ├── desc ("DRAM Errors")
> │   │   ├── error_count
> │   │   ├── interrupt_enable
> │   │   └── threshold_limit
> │   └── block1
> │       ├── desc ("Link Errors")
> │       ├── error_count
> │       ├── interrupt_enable
> │       └── threshold_limit
> ...
> 
> I'm inclined to the second option, since it keeps all the thresholding
> functionality under a single directory.

Yeah, that makes it explicit and one can see that a bank can have
multiple blocks.

Renaming will change the ABI but we can always do symlinks later if
people complain. Which I doubt because I've yet to hear of someone using
that thresholding thing at all...

Thx.
diff mbox series

Patch

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index d58f4f2e006f..7c1c35909946 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -319,6 +319,7 @@  enum smca_bank_types {
 	SMCA_XGMI_PCS,	/* xGMI PCS Unit */
 	SMCA_XGMI_PHY,	/* xGMI PHY Unit */
 	SMCA_WAFL_PHY,	/* WAFL PHY Unit */
+	SMCA_UNKNOWN,	/* Unknown type */
 	N_SMCA_BANK_TYPES
 };
 
diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index 2f351838d5f7..b9a5a94914a9 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -100,6 +100,7 @@  static struct smca_bank_name smca_names[] = {
 	[SMCA_XGMI_PCS]			= { "xgmi_pcs",		"Ext Global Memory Interconnect PCS Unit" },
 	[SMCA_XGMI_PHY]			= { "xgmi_phy",		"Ext Global Memory Interconnect PHY Unit" },
 	[SMCA_WAFL_PHY]			= { "wafl_phy",		"WAFL PHY Unit" },
+	[SMCA_UNKNOWN]			= { "unknown",		"Unrecognized Bank Type" },
 };
 
 static const char *smca_get_name(enum smca_bank_types t)
@@ -189,6 +190,9 @@  static struct smca_hwid smca_hwid_mcatypes[] = {
 
 	/* WAFL PHY MCA type */
 	{ SMCA_WAFL_PHY, HWID_MCATYPE(0x267, 0x0)	},
+
+	/* Unknown type - this must be last in the list */
+	{ SMCA_UNKNOWN,  HWID_MCATYPE(0xFFF, 0xFFFF)	},
 };
 
 struct smca_bank smca_banks[MAX_NR_BANKS];
@@ -300,7 +304,9 @@  static void smca_configure(unsigned int bank, unsigned int cpu)
 
 	for (i = 0; i < ARRAY_SIZE(smca_hwid_mcatypes); i++) {
 		s_hwid = &smca_hwid_mcatypes[i];
-		if (hwid_mcatype == s_hwid->hwid_mcatype) {
+
+		if (hwid_mcatype == s_hwid->hwid_mcatype ||
+		    s_hwid->bank_type == SMCA_UNKNOWN) {
 			smca_banks[bank].hwid = s_hwid;
 			smca_banks[bank].id = low;
 			smca_banks[bank].sysfs_id = s_hwid->count++;
@@ -1032,12 +1038,28 @@  static const char *get_name(unsigned int bank, struct threshold_block *b)
 		return NULL;
 	}
 
-	if (smca_banks[bank].hwid->count == 1)
-		return smca_get_name(bank_type);
+	if (smca_banks[bank].hwid->count == 1) {
+		if (bank_type == SMCA_UNKNOWN) {
+			snprintf(buf_mcatype, MAX_MCATYPE_NAME_LEN,
+				 "%s_%x", smca_get_name(bank_type),
+					  smca_banks[bank].id);
+
+			return buf_mcatype;
+		} else {
+			return smca_get_name(bank_type);
+		}
+	}
+
+	if (b && bank_type == SMCA_UNKNOWN) {
+		snprintf(buf_mcatype, MAX_MCATYPE_NAME_LEN,
+			 "%s_%x_block_%u", smca_get_name(bank_type),
+			 smca_banks[bank].id, b->block);
+	} else {
+		snprintf(buf_mcatype, MAX_MCATYPE_NAME_LEN,
+			 "%s_%u", smca_get_name(bank_type),
+				  smca_banks[bank].sysfs_id);
+	}
 
-	snprintf(buf_mcatype, MAX_MCATYPE_NAME_LEN,
-		 "%s_%x", smca_get_name(bank_type),
-			  smca_banks[bank].sysfs_id);
 	return buf_mcatype;
 }
 
diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
index 67dbf4c31271..5ccc09db0a51 100644
--- a/drivers/edac/mce_amd.c
+++ b/drivers/edac/mce_amd.c
@@ -1068,6 +1068,9 @@  static void decode_smca_error(struct mce *m)
 
 	pr_emerg(HW_ERR "%s Ext. Error Code: %d", ip_name, xec);
 
+	if (bank_type == SMCA_UNKNOWN)
+		return;
+
 	/* Only print the decode of valid error codes */
 	if (xec < smca_mce_descs[bank_type].num_descs)
 		pr_cont(", %s.\n", smca_mce_descs[bank_type].descs[xec]);