diff mbox series

x86/mce: Cover grading of AMD machine error checks

Message ID 20220309174107.6113-1-carlos.bilbao@amd.com (mailing list archive)
State New, archived
Headers show
Series x86/mce: Cover grading of AMD machine error checks | expand

Commit Message

Carlos Bilbao March 9, 2022, 5:41 p.m. UTC
AMD's severity grading covers very few machine errors. In the graded cases
there are no user-readable messages, complicating debugging of critical
hardware errors.

Fix the above issues extending the current grading logic for AMD with cases
not previously considered and their corresponding messages.

Signed-off-by: Carlos Bilbao <carlos.bilbao@amd.com>
---
 arch/x86/include/asm/mce.h         |   6 +
 arch/x86/kernel/cpu/mce/severity.c | 232 +++++++++++++++++++++++++----
 2 files changed, 205 insertions(+), 33 deletions(-)


base-commit: 7f1b8e0d6360178e3527d4f14e6921c254a86035

Comments

Borislav Petkov March 9, 2022, 6:37 p.m. UTC | #1
Definitely a step in the right direction.

Now...

On Wed, Mar 09, 2022 at 11:41:07AM -0600, Carlos Bilbao wrote:
> AMD's severity grading covers very few machine errors. In the graded cases
> there are no user-readable messages, complicating debugging of critical
> hardware errors.

That's too generic. What is the actual use case here you're spending all
this time for?

> Fix the above issues extending the current grading logic for AMD with cases
> not previously considered and their corresponding messages.
> 
> Signed-off-by: Carlos Bilbao <carlos.bilbao@amd.com>
> ---
>  arch/x86/include/asm/mce.h         |   6 +
>  arch/x86/kernel/cpu/mce/severity.c | 232 +++++++++++++++++++++++++----
>  2 files changed, 205 insertions(+), 33 deletions(-)

Now, looking at the whole thing, AFAICT all you're interested in is
getting some strings out from those error types. But but, we already
have something like that. That's even mentioned in the patch:

> +	 * Default return values. The poll handler catches these and passes
> +	 * responsibility of decoding them to EDAC

So there's a big fat module mce_amd.c which does convert MCEs to
strings. So why can't that be used and extended instead of adding more
strings to more places in the kernel?

Thx.
Carlos Bilbao March 10, 2022, 6:24 p.m. UTC | #2
On 3/9/2022 12:37 PM, Borislav Petkov wrote:
> Definitely a step in the right direction.
>
> Now...
>
> On Wed, Mar 09, 2022 at 11:41:07AM -0600, Carlos Bilbao wrote:
>> AMD's severity grading covers very few machine errors. In the graded cases
>> there are no user-readable messages, complicating debugging of critical
>> hardware errors.
>
> That's too generic. What is the actual use case here you're spending all
> this time for?
>

We will cover grading of MCEs like deferred memory scrub errors, attempts 
to access poisonous data, etc. I could list all new covered cases in the 
commit message if you think that'd be positive.

>> Fix the above issues extending the current grading logic for AMD with cases
>> not previously considered and their corresponding messages.
>>
>> Signed-off-by: Carlos Bilbao <carlos.bilbao@amd.com>
>> ---
>>  arch/x86/include/asm/mce.h         |   6 +
>>  arch/x86/kernel/cpu/mce/severity.c | 232 +++++++++++++++++++++++++----
>>  2 files changed, 205 insertions(+), 33 deletions(-)
>
> Now, looking at the whole thing, AFAICT all you're interested in is
> getting some strings out from those error types. But but, we already
> have something like that. That's even mentioned in the patch:
>
>> +	 * Default return values. The poll handler catches these and passes
>> +	 * responsibility of decoding them to EDAC
>
> So there's a big fat module mce_amd.c which does convert MCEs to
> strings. So why can't that be used and extended instead of adding more
> strings to more places in the kernel?
>
> Thx.
>

The severity grading logic has a double functionality: (1) Grade the 
machine errors and (2) Assign them a severity message. The first task is
arguably more important because depending on the severity we assign, the
MCE handler do_machine_check() will do things differently. 

The part of this patch that grades new errors is needed for (1). But -as 
you mention- why do we need these new strings if we have the EDAC driver?

If we grade the error as a PANIC, the handler will not notify EDAC, and the
kernel will use our new severity messages when displaying the error with
mce_panic().

If the machine error is not graded as PANIC, the notification chain will
come into play, and EDAC will eventually decode the machine errors without 
using the severity message. Hence, you have a good point, these new strings
will be unnecessary -We should send a new version of the patch that only 
includes msgs for PANIC cases. We probably won't need helper functions by
then. I'll also take some time to think of any other error cases we might
not be covering yet.

Hope that helps clarify,
Carlos
Borislav Petkov March 10, 2022, 10:13 p.m. UTC | #3
On Thu, Mar 10, 2022 at 12:24:08PM -0600, Carlos Bilbao wrote:
> We will cover grading of MCEs like deferred memory scrub errors, attempts 
> to access poisonous data, etc. I could list all new covered cases in the 
> commit message if you think that'd be positive.

So no actual use case - you want to grade error severity for all types
of MCEs.

> Hope that helps clarify,

Yes, it does a bit.

It sounds to me like you want to do at least two patches:

1. Extend the severity grading function with the new types of errors

2. Add string descriptions of the error types mce_severity_amd() looks
at, so that mce_panic() issues them.

I.e., you want to decode the fatal MCEs which panic the machine.

In general, what would help is if you think about what you're trying to
achieve and write it down first. How to achieve that we can figure out
later.

What happens now is you send me a patch and I'm trying to decipher from
the code why you're doing what you're doing. Which is kinda backwards if
you think about it...
Carlos Bilbao March 11, 2022, 2:07 p.m. UTC | #4
On 3/10/2022 4:13 PM, Borislav Petkov wrote:
> On Thu, Mar 10, 2022 at 12:24:08PM -0600, Carlos Bilbao wrote:
>> We will cover grading of MCEs like deferred memory scrub errors, attempts 
>> to access poisonous data, etc. I could list all new covered cases in the 
>> commit message if you think that'd be positive.
> 
> So no actual use case - you want to grade error severity for all types
> of MCEs.
> 
>> Hope that helps clarify,
> 
> Yes, it does a bit.
> 
> It sounds to me like you want to do at least two patches:
> 
> 1. Extend the severity grading function with the new types of errors
> 
> 2. Add string descriptions of the error types mce_severity_amd() looks
> at, so that mce_panic() issues them.
> 
> I.e., you want to decode the fatal MCEs which panic the machine.
> 
> In general, what would help is if you think about what you're trying to
> achieve and write it down first. How to achieve that we can figure out
> later.
> 
> What happens now is you send me a patch and I'm trying to decipher from
> the code why you're doing what you're doing. Which is kinda backwards if
> you think about it...
> 

Glad we are on the same page now. I will prepare a pachset and include more
informative commit messages.

Thanks,
Carlos
diff mbox series

Patch

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index cc73061e7255..6b1ef40f8580 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -50,6 +50,12 @@ 
 #define MCI_STATUS_POISON	BIT_ULL(43)  /* access poisonous data */
 #define MCI_STATUS_SCRUB	BIT_ULL(40)  /* Error detected during scrub operation */
 
+/* AMD Error codes from PPR(s) section 3.1 Machine Check Architecture */
+#define ERRORCODE_T_MSK GENMASK(3, 2) /* Mask for transaction type bits */
+#define ERRORCODE_M_MSK GENMASK(7, 4) /* Mask for memory transaction type */
+#define ERRORCODE_T_DATA  0x4  /* Transaction type of error is Data */
+#define ERRORCODE_M_FETCH 0x50 /* Memory transaction type of error is Instruction Fetch */
+
 /*
  * McaX field if set indicates a given bank supports MCA extensions:
  *  - Deferred error interrupt type is specifiable by bank.
diff --git a/arch/x86/kernel/cpu/mce/severity.c b/arch/x86/kernel/cpu/mce/severity.c
index 1add86935349..764c6caf4bfe 100644
--- a/arch/x86/kernel/cpu/mce/severity.c
+++ b/arch/x86/kernel/cpu/mce/severity.c
@@ -327,59 +327,225 @@  static __always_inline int mce_severity_amd_smca(struct mce *m, enum context err
 }
 
 /*
- * See AMD Error Scope Hierarchy table in a newer BKDG. For example
- * 49125_15h_Models_30h-3Fh_BKDG.pdf, section "RAS Features"
+ * Evaluate the severity of a data load error for AMD systems, depending
+ * on the context in which the MCE happened.
  */
-static noinstr int mce_severity_amd(struct mce *m, struct pt_regs *regs, char **msg, bool is_excp)
+static inline int mce_grade_data_amd(enum context ctx, char **severity_msg)
 {
-	enum context ctx = error_context(m, regs);
+	WARN_ON(!severity_msg);
+
+	switch (ctx) {
+	case IN_USER:
+		*severity_msg = "Action required: data load error in user process";
+		return MCE_AR_SEVERITY;
+	case IN_KERNEL_RECOV:
+		*severity_msg = "Action required: data load in kernel recoverable area";
+		return MCE_AR_SEVERITY;
+	case IN_KERNEL:
+		*severity_msg = "Data load in unrecoverable area of kernel";
+		return MCE_PANIC_SEVERITY;
+	default:
+		*severity_msg = "Data load in unknown context";
+		return MCE_PANIC_SEVERITY;
+	}
+}
 
-	/* Processor Context Corrupt, no need to fumble too much, die! */
-	if (m->status & MCI_STATUS_PCC)
+/*
+ * Evaluate the severity of an instruction fetch error for AMD systems,
+ * depending on the context in which the MCE happened.
+ */
+static inline int mce_grade_fetch_amd(enum context ctx, char **severity_msg)
+{
+	WARN_ON(!severity_msg);
+
+	switch (ctx) {
+	case IN_USER:
+		*severity_msg = "Action required: instruction fetch in user process";
+		return MCE_AR_SEVERITY;
+	case IN_KERNEL_RECOV:
+		*severity_msg = "Instruction fetch in kernel recoverable area";
+#ifdef CONFIG_MEMORY_FAILURE
+		return MCE_AR_SEVERITY;
+#else /* !MCE_PANIC_SEVERITY */
 		return MCE_PANIC_SEVERITY;
+#endif
+	case IN_KERNEL:
+		*severity_msg = "Instruction fetch error in kernel";
+		return MCE_PANIC_SEVERITY;
+	default:
+		*severity_msg = "Instruction fetch error in unknown context";
+		return MCE_PANIC_SEVERITY;
+	}
+}
 
-	if (m->status & MCI_STATUS_UC) {
+/*
+ * Evaluate the severity of a memory poison error for AMD systems,
+ * depending on the context in which the MCE happened.
+ */
+static inline int mce_grade_poison_amd(enum context ctx, char **severity_msg)
+{
 
-		if (ctx == IN_KERNEL)
-			return MCE_PANIC_SEVERITY;
+	WARN_ON(!severity_msg);
+
+	switch (ctx) {
+	case IN_USER:
+		*severity_msg = "Attempt to consume poisoned data in user process";
+		return MCE_AR_SEVERITY;
+	case IN_KERNEL_RECOV:
+		*severity_msg = "Attempt to consume poisoned memory in kernel recoverable area";
+#ifdef CONFIG_MEMORY_FAILURE
+		return MCE_AR_SEVERITY;
+#else /* !CONFIG_MEMORY_FAILURE */
+		return MCE_PANIC_SEVERITY;
+#endif
+	case IN_KERNEL:
+		*severity_msg = "Attempt to consume poisoned data in kernel context";
+		return MCE_PANIC_SEVERITY;
+	default:
+		*severity_msg = "Poisoned data consumption in unknown context";
+		return MCE_PANIC_SEVERITY;
+	}
+}
+
+/*
+ * Evaluate the severity of deferred errors for AMD systems, for which only
+ * scrub error is interesting to notify an action requirement.
+ */
+static noinstr int mce_grade_deferred_amd(struct mce *m, enum context ctx, char **msg)
+{
+	int ret;
+
+	WARN_ON(!msg);
 
+	if (m->status & MCI_STATUS_SCRUB) {
+		ret = MCE_AR_SEVERITY;
+		*msg = "Action required: Error detected on a scrub operation";
+	} else {
 		/*
-		 * On older systems where overflow_recov flag is not present, we
-		 * should simply panic if an error overflow occurs. If
-		 * overflow_recov flag is present and set, then software can try
-		 * to at least kill process to prolong system operation.
+		 * deferred error: poll handler catches these and adds to mce_ring so
+		 * memory-failure can take recovery actions.
 		 */
-		if (mce_flags.overflow_recov) {
-			if (mce_flags.smca)
-				return mce_severity_amd_smca(m, ctx);
+		ret = MCE_DEFERRED_SEVERITY;
+		*msg = "Deferred";
+	}
 
-			/* kill current process */
-			return MCE_AR_SEVERITY;
+	return ret;
+}
+
+/*
+ * Evaluate the severity of an overflow error for AMD systems, dependent on
+ * the recoverable features available.
+ */
+static noinstr int mce_grade_overflow_amd(struct mce *m, enum context ctx, char **msg)
+{
+	int ret;
+
+	WARN_ON(!msg);
+
+	if (ctx == IN_KERNEL) {
+		*msg = "Uncorrectable error in kernel context";
+		return MCE_PANIC_SEVERITY;
+	}
+
+	/*
+	 * On older systems where overflow_recov flag is not present, we
+	 * should simply panic if an error overflow occurs. If
+	 * overflow_recov flag is present and set, then software can try
+	 * to at least kill process to prolong system operation.
+	 */
+	if (mce_flags.overflow_recov) {
+		*msg = "Uncorrected recoverable error";
+		if (mce_flags.smca) {
+			ret = mce_severity_amd_smca(m, ctx);
+			if (ret == MCE_PANIC_SEVERITY)
+				*msg = "Uncorrected unrecoverable error";
 		} else {
-			/* at least one error was not logged */
-			if (m->status & MCI_STATUS_OVER)
-				return MCE_PANIC_SEVERITY;
+			/* kill current process */
+			ret = MCE_AR_SEVERITY;
 		}
+		return ret;
+	}
 
-		/*
-		 * For any other case, return MCE_UC_SEVERITY so that we log the
-		 * error and exit #MC handler.
-		 */
-		return MCE_UC_SEVERITY;
+	/* at least one error was not logged */
+	if (m->status & MCI_STATUS_OVER) {
+		*msg = "Overflow uncorrected";
+		return MCE_PANIC_SEVERITY;
 	}
 
 	/*
-	 * deferred error: poll handler catches these and adds to mce_ring so
-	 * memory-failure can take recovery actions.
+	 * For any other case, return MCE_UC_SEVERITY so that we log the
+	 * error and exit #MC handler.
 	 */
-	if (m->status & MCI_STATUS_DEFERRED)
-		return MCE_DEFERRED_SEVERITY;
+	*msg = "Uncorrected overflow error";
+	return MCE_UC_SEVERITY;
+}
+
+/*
+ * See AMD PPR(s) section 3.1 Machine Check Architecture
+ */
+static noinstr int mce_severity_amd(struct mce *m, struct pt_regs *regs, char **msg, bool is_excp)
+{
+	enum context ctx = error_context(m, regs);
+	char *severity_msg;
+	int ret;
 
 	/*
-	 * corrected error: poll handler catches these and passes responsibility
-	 * of decoding the error to EDAC
+	 * Default return values. The poll handler catches these and passes
+	 * responsibility of decoding them to EDAC
 	 */
-	return MCE_KEEP_SEVERITY;
+	ret = MCE_KEEP_SEVERITY;
+	severity_msg = "Corrected error";
+
+	if (m->status & MCI_STATUS_DEFERRED) {
+		ret = mce_grade_deferred_amd(m, ctx, &severity_msg);
+		goto amd_severity;
+	}
+
+	/* If the UC bit is not set, the error has been corrected */
+	if (!(m->status & MCI_STATUS_UC)) {
+		ret = MCE_KEEP_SEVERITY;
+		severity_msg = "Corrected error";
+		goto amd_severity;
+	}
+
+	if (m->status & MCI_STATUS_POISON) {
+		ret = mce_grade_poison_amd(ctx, &severity_msg);
+		goto amd_severity;
+	}
+
+	/* Processor Context Corrupt, no need to fumble too much, die! */
+	if (m->status & MCI_STATUS_PCC) {
+		severity_msg = "Processor Context Corrupt";
+		ret = MCE_PANIC_SEVERITY;
+		goto amd_severity;
+	}
+
+	if ((m->status & ERRORCODE_T_MSK) == ERRORCODE_T_DATA) {
+		ret = mce_grade_data_amd(ctx, &severity_msg);
+		goto amd_severity;
+	}
+
+	if ((m->status & ERRORCODE_M_MSK) == ERRORCODE_M_FETCH) {
+		ret = mce_grade_fetch_amd(ctx, &severity_msg);
+		goto amd_severity;
+	}
+
+	if (m->status & MCI_STATUS_OVER) {
+		ret = mce_grade_overflow_amd(m, ctx, &severity_msg);
+		goto amd_severity;
+	}
+
+	if (ctx == IN_KERNEL) {
+		ret = MCE_PANIC_SEVERITY;
+		severity_msg = "Uncorrectable error in kernel context";
+	}
+
+amd_severity:
+
+	if (msg)
+		*msg = severity_msg;
+
+	return ret;
 }
 
 static noinstr int mce_severity_intel(struct mce *m, struct pt_regs *regs, char **msg, bool is_excp)