Message ID | 20250217063335.22257-2-xueshuai@linux.alibaba.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/hwpoison: Fix regressions in memory failure handling | expand |
On Mon, Feb 17, 2025 at 02:33:31PM +0800, Shuai Xue wrote: > Currently, mce_no_way_out() only collects error messages when the error > severity is equal to `MCE_PANIC_SEVERITY`. To improve diagnostics, > modify the behavior to also collect error messages when the severity is > less than `MCE_PANIC_SEVERITY`. That function is literally called "no way out" as in, is the error severe enough so that there is no way out. Now you went and stomped all over that to "improve diagnostics". What diagnostsics? Your commit messages need to explain in detail why exactly a patch exists. So nope.
在 2025/2/18 15:58, Borislav Petkov 写道: > On Mon, Feb 17, 2025 at 02:33:31PM +0800, Shuai Xue wrote: >> Currently, mce_no_way_out() only collects error messages when the error >> severity is equal to `MCE_PANIC_SEVERITY`. To improve diagnostics, >> modify the behavior to also collect error messages when the severity is >> less than `MCE_PANIC_SEVERITY`. > > That function is literally called "no way out" as in, is the error severe > enough so that there is no way out. > > Now you went and stomped all over that to "improve diagnostics". What > diagnostsics? Your commit messages need to explain in detail why exactly > a patch exists. > > So nope. > Hi, Borislav, Thank you for reply. The msg in predefined `severities`, e.g. MCESEV( AO, "Action optional: last level cache writeback error", SER, MASK(MCI_UC_AR|MCACOD, MCI_STATUS_UC|MCACOD_L3WB) ), is helpful for users to know what kind of MCE is happened. For a fatal machine check, kernel panic use the message and I want to extend to collect the message and print it out for non-fatal one. If you don't object, let's go on to discuss how to implement it. Otherwise, you can ignore the following response. Yes, mce_no_way_out() means "no way out" literally. It only collects message for MCE_PANIC_SEVERITY but use in common path. So I used this function to extend it to non-fatal, assuming it was obvious. Is __mc_scan_banks() a proper function to extend? Thanks. Shuai
On Tue, Feb 18, 2025 at 05:39:33PM +0800, Shuai Xue wrote:
> Is __mc_scan_banks() a proper function to extend?
No, first you need to explain the big picture:
https://lore.kernel.org/r/20250218082727.GCZ7REb7OG6NTAY-V-@fat_crate.local
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 0dc00c9894c7..f2e730d4acc5 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -925,12 +925,13 @@ static __always_inline void quirk_zen_ifu(int bank, struct mce *m, struct pt_reg * Do a quick check if any of the events requires a panic. * This decides if we keep the events around or clear them. */ -static __always_inline int mce_no_way_out(struct mce_hw_err *err, char **msg, unsigned long *validp, - struct pt_regs *regs) +static __always_inline bool mce_no_way_out(struct mce_hw_err *err, char **msg, + unsigned long *validp, + struct pt_regs *regs) { struct mce *m = &err->m; char *tmp = *msg; - int i; + int i, cur_sev = MCE_NO_SEVERITY, sev; for (i = 0; i < this_cpu_read(mce_num_banks); i++) { m->status = mce_rdmsrl(mca_msr_reg(i, MCA_STATUS)); @@ -945,13 +946,17 @@ static __always_inline int mce_no_way_out(struct mce_hw_err *err, char **msg, un quirk_zen_ifu(i, m, regs); m->bank = i; - if (mce_severity(m, regs, &tmp, true) >= MCE_PANIC_SEVERITY) { + sev = mce_severity(m, regs, &tmp, true); + if (sev >= cur_sev) { mce_read_aux(err, i); *msg = tmp; - return 1; + cur_sev = sev; } + + if (cur_sev == MCE_PANIC_SEVERITY) + return true; } - return 0; + return false; } /*
Currently, mce_no_way_out() only collects error messages when the error severity is equal to `MCE_PANIC_SEVERITY`. To improve diagnostics, modify the behavior to also collect error messages when the severity is less than `MCE_PANIC_SEVERITY`. Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> --- arch/x86/kernel/cpu/mce/core.c | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-)