diff mbox series

[v2,1/5] x86/mce: Collect error message for severities below MCE_PANIC_SEVERITY

Message ID 20250217063335.22257-2-xueshuai@linux.alibaba.com (mailing list archive)
State New
Headers show
Series mm/hwpoison: Fix regressions in memory failure handling | expand

Commit Message

Shuai Xue Feb. 17, 2025, 6:33 a.m. UTC
Currently, mce_no_way_out() only collects error messages when the error
severity is equal to `MCE_PANIC_SEVERITY`. To improve diagnostics,
modify the behavior to also collect error messages when the severity is
less than `MCE_PANIC_SEVERITY`.

Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
---
 arch/x86/kernel/cpu/mce/core.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

Comments

Borislav Petkov Feb. 18, 2025, 7:58 a.m. UTC | #1
On Mon, Feb 17, 2025 at 02:33:31PM +0800, Shuai Xue wrote:
> Currently, mce_no_way_out() only collects error messages when the error
> severity is equal to `MCE_PANIC_SEVERITY`. To improve diagnostics,
> modify the behavior to also collect error messages when the severity is
> less than `MCE_PANIC_SEVERITY`.

That function is literally called "no way out" as in, is the error severe
enough so that there is no way out.

Now you went and stomped all over that to "improve diagnostics". What
diagnostsics? Your commit messages need to explain in detail why exactly
a patch exists.

So nope.
Shuai Xue Feb. 18, 2025, 9:39 a.m. UTC | #2
在 2025/2/18 15:58, Borislav Petkov 写道:
> On Mon, Feb 17, 2025 at 02:33:31PM +0800, Shuai Xue wrote:
>> Currently, mce_no_way_out() only collects error messages when the error
>> severity is equal to `MCE_PANIC_SEVERITY`. To improve diagnostics,
>> modify the behavior to also collect error messages when the severity is
>> less than `MCE_PANIC_SEVERITY`.
> 
> That function is literally called "no way out" as in, is the error severe
> enough so that there is no way out.
> 
> Now you went and stomped all over that to "improve diagnostics". What
> diagnostsics? Your commit messages need to explain in detail why exactly
> a patch exists.
> 
> So nope.
> 

Hi, Borislav,

Thank you for reply.

The msg in predefined `severities`, e.g.

	MCESEV(
		AO, "Action optional: last level cache writeback error",
		SER, MASK(MCI_UC_AR|MCACOD, MCI_STATUS_UC|MCACOD_L3WB)
		),

is helpful for users to know what kind of MCE is happened. For a fatal machine
check, kernel panic use the message and I want to extend to collect the message
and print it out for non-fatal one.

If you don't object, let's go on to discuss how to implement it.
Otherwise, you can ignore the following response.

Yes, mce_no_way_out() means "no way out" literally. It only collects message
for MCE_PANIC_SEVERITY but use in common path. So I used this function to
extend it to non-fatal, assuming it was obvious.
  
Is __mc_scan_banks() a proper function to extend?

Thanks.
Shuai
Borislav Petkov Feb. 18, 2025, 9:50 a.m. UTC | #3
On Tue, Feb 18, 2025 at 05:39:33PM +0800, Shuai Xue wrote:
> Is __mc_scan_banks() a proper function to extend?

No, first you need to explain the big picture:

https://lore.kernel.org/r/20250218082727.GCZ7REb7OG6NTAY-V-@fat_crate.local
diff mbox series

Patch

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 0dc00c9894c7..f2e730d4acc5 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -925,12 +925,13 @@  static __always_inline void quirk_zen_ifu(int bank, struct mce *m, struct pt_reg
  * Do a quick check if any of the events requires a panic.
  * This decides if we keep the events around or clear them.
  */
-static __always_inline int mce_no_way_out(struct mce_hw_err *err, char **msg, unsigned long *validp,
-					  struct pt_regs *regs)
+static __always_inline bool mce_no_way_out(struct mce_hw_err *err, char **msg,
+					   unsigned long *validp,
+					   struct pt_regs *regs)
 {
 	struct mce *m = &err->m;
 	char *tmp = *msg;
-	int i;
+	int i, cur_sev = MCE_NO_SEVERITY, sev;
 
 	for (i = 0; i < this_cpu_read(mce_num_banks); i++) {
 		m->status = mce_rdmsrl(mca_msr_reg(i, MCA_STATUS));
@@ -945,13 +946,17 @@  static __always_inline int mce_no_way_out(struct mce_hw_err *err, char **msg, un
 			quirk_zen_ifu(i, m, regs);
 
 		m->bank = i;
-		if (mce_severity(m, regs, &tmp, true) >= MCE_PANIC_SEVERITY) {
+		sev = mce_severity(m, regs, &tmp, true);
+		if (sev >= cur_sev) {
 			mce_read_aux(err, i);
 			*msg = tmp;
-			return 1;
+			cur_sev = sev;
 		}
+
+		if (cur_sev == MCE_PANIC_SEVERITY)
+			return true;
 	}
-	return 0;
+	return false;
 }
 
 /*