Message ID | 20230929181626.210782-1-tony.luck@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | Handle corrected machine check interrupt storms | expand |
> Including placing most of the storm tracking code into threshold.c > instead of bloating core.c. The lkp test robot complains on a randconfig build with: # CONFIG_X86_MCE_INTEL is not set # CONFIG_X86_MCE_AMD is not set about some undefined symbols. >> core.c:(.text+0x1130): undefined reference to `storm_desc' >> core.c:(.text+0x1634): undefined reference to `mce_track_storm' Simple fix would be to move definition of storm_desc into core.c and provide a stub: static inline void mce_track_storm(struct mce *mce) { } for the case where neither INTEL nor AMD is configured. in internal.h -Tony
Linux CMCI storm mitigation is a big hammer that just disables the CMCI interrupt globally and switches to polling all banks. There are two problems with this: 1) It really is a big hammer. It means that errors reported in other banks from different functional units are all subject to the same polling delay before being processed. 2) Intel systems signal some uncorrected errors using CMCI (e.g. memory controller patrol scrub on Icelake Xeon and newer). Delaying processing these error reports negates some of the benefit of the patrol scrubber providing early notice of errors before they are consumed and cause a machine check. This series throws away the old storm implementation and replaces it with one that keeps track of the weather on each separate machine check bank. When a storm is detected from a bank. On Intel the storm is mitigated by setting a very high threshold for corrected errors to signal CMCI. This threshold does not affect signaling CMCI for uncorrected errors. Signed-off-by: Tony Luck <tony.luck@intel.com> --- Changes since v7: Applied all the suggestions from Yazen's review of v7 Link: https://lore.kernel.org/all/c76723df-f2f1-4888-9e05-61917145503c@amd.com/ Link: https://lore.kernel.org/all/6ae4df67-ba0b-4b50-8c1d-a5d382105ad2@amd.com/ Including placing most of the storm tracking code into threshold.c instead of bloating core.c. Tony Luck (3): x86/mce: Remove old CMCI storm mitigation code x86/mce: Add per-bank CMCI storm mitigation x86/mce: Handle Intel threshold interrupt storms arch/x86/kernel/cpu/mce/internal.h | 47 +++- arch/x86/kernel/cpu/mce/core.c | 45 ++-- arch/x86/kernel/cpu/mce/intel.c | 338 ++++++++++++---------------- arch/x86/kernel/cpu/mce/threshold.c | 86 +++++++ 4 files changed, 293 insertions(+), 223 deletions(-) base-commit: 6465e260f48790807eef06b583b38ca9789b6072