Message ID | 20230425121829.61755-1-xueshuai@linux.alibaba.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | x86/mce/amd: init mce severity to handle deferred memory failure | expand |
On 4/25/23 8:18 AM, Shuai Xue wrote: > When a deferred UE error is detected, e.g by background patrol scruber, it > will be handled in APIC interrupt handler amd_deferred_error_interrupt(). > The handler will collect MCA banks, init mce struct and process it by > nofitying the registered MCE decode chain. > > The uc_decode_notifier, one of MCE decode chain, will process memory > failure but only limit to MCE_AO_SEVERITY and MCE_DEFERRED_SEVERITY. > However, APIC interrupt handler does not init mce severity and the > uninitialized severity is 0 (MCE_NO_SEVERITY). > > To handle the deferred memory failure case, init mce severity when logging > MCA banks. > > Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> > Hi Shuai Xue, I think this patch is fair to do. But it won't have the intended effect in practice. The value in MCA_ADDR for DRAM ECC errors will be a memory controller "normalized address". This is not a system physical address that the OS can use to take action. The mce_usable_address() function needs to be updated to handle this. I'll send a patchset this week to do so. Afterwards, the uc_decode_notifier will not attempt to handle these errors. Thanks, Yazen
On 2023/5/9 22:25, Yazen Ghannam wrote: > On 4/25/23 8:18 AM, Shuai Xue wrote: >> When a deferred UE error is detected, e.g by background patrol scruber, it >> will be handled in APIC interrupt handler amd_deferred_error_interrupt(). >> The handler will collect MCA banks, init mce struct and process it by >> nofitying the registered MCE decode chain. >> >> The uc_decode_notifier, one of MCE decode chain, will process memory >> failure but only limit to MCE_AO_SEVERITY and MCE_DEFERRED_SEVERITY. >> However, APIC interrupt handler does not init mce severity and the >> uninitialized severity is 0 (MCE_NO_SEVERITY). >> >> To handle the deferred memory failure case, init mce severity when logging >> MCA banks. >> >> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> >> > > Hi Shuai Xue, > > I think this patch is fair to do. But it won't have the intended effect > in practice. > > The value in MCA_ADDR for DRAM ECC errors will be a memory controller > "normalized address". This is not a system physical address that the OS > can use to take action. > > The mce_usable_address() function needs to be updated to handle this. > I'll send a patchset this week to do so. Afterwards, the > uc_decode_notifier will not attempt to handle these errors. From the experience of other platforms (e.g. ARM64 RAS and Intel MCA), uc_decode_notifier should handle these error to hard offline the corrupted page. If the corrupted page is a free buddy page, we can isolate it and avoid using the page in the future. In my test case, the error is detected by patrol scrubber in memory controller. The scrubber may lack of system address space perspective, and only reports "normalized address". But we can decode the "normalized address" to system address by EDAC (umc_normaddr_to_sysaddr), right? (I am not quite familiar with AMD RAS, please correct me if I am wrong) > > Thanks, > Yazen Thank you. Best Regards, Shuai
On 5/9/23 10:17 PM, Shuai Xue wrote: > > > On 2023/5/9 22:25, Yazen Ghannam wrote: >> On 4/25/23 8:18 AM, Shuai Xue wrote: >>> When a deferred UE error is detected, e.g by background patrol scruber, it >>> will be handled in APIC interrupt handler amd_deferred_error_interrupt(). >>> The handler will collect MCA banks, init mce struct and process it by >>> nofitying the registered MCE decode chain. >>> >>> The uc_decode_notifier, one of MCE decode chain, will process memory >>> failure but only limit to MCE_AO_SEVERITY and MCE_DEFERRED_SEVERITY. >>> However, APIC interrupt handler does not init mce severity and the >>> uninitialized severity is 0 (MCE_NO_SEVERITY). >>> >>> To handle the deferred memory failure case, init mce severity when logging >>> MCA banks. >>> >>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> >>> >> >> Hi Shuai Xue, >> >> I think this patch is fair to do. But it won't have the intended effect >> in practice. >> >> The value in MCA_ADDR for DRAM ECC errors will be a memory controller >> "normalized address". This is not a system physical address that the OS >> can use to take action. >> >> The mce_usable_address() function needs to be updated to handle this. >> I'll send a patchset this week to do so. Afterwards, the >> uc_decode_notifier will not attempt to handle these errors. > > From the experience of other platforms (e.g. ARM64 RAS and Intel MCA), > uc_decode_notifier should handle these error to hard offline the corrupted > page. If the corrupted page is a free buddy page, we can isolate it and avoid > using the page in the future. > > In my test case, the error is detected by patrol scrubber in memory controller. > The scrubber may lack of system address space perspective, and only reports > "normalized address". But we can decode the "normalized address" to system address > by EDAC (umc_normaddr_to_sysaddr), right? > > (I am not quite familiar with AMD RAS, please correct me if I am wrong) > Yes, that's correct. The address translation requires some updates that are still in-review. Afterwards, we can investigate ways to use the translated address. It may require some rework in the MCE notifier chain or, more simply, calling memory_failure() from the EDAC module itself. Thanks, Yazen
AMD ATL has merged to upstream, can we merge this patch to process deferred error with memory_failure()? Thanks, Ruidong
On 4/18/24 04:42, Ruidong Tian wrote: > > AMD ATL has merged to upstream, can we merge this patch to process > deferred error with memory_failure()? > Hi Ruidong, Thanks for the follow up. This patch is made redundant by the following patch in review. https://lore.kernel.org/linux-edac/20240404151359.47970-11-yazen.ghannam@amd.com/ Also, this is still not sufficient. The address translation still needs to be invoked in order for memory_failure() to have a valid system physical address. Please see the following work-in-progress patch. https://github.com/AMDESE/linux/commit/6ddd8e90d08edb4a2730ccd02981baef4645bb43 Thanks, Yazen
diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c index 23c5072fbbb7..b5e1a27b0881 100644 --- a/arch/x86/kernel/cpu/mce/amd.c +++ b/arch/x86/kernel/cpu/mce/amd.c @@ -734,6 +734,7 @@ static void __log_error(unsigned int bank, u64 status, u64 addr, u64 misc) m.misc = misc; m.bank = bank; m.tsc = rdtsc(); + m.severity = mce_severity(&m, NULL, NULL, false); if (m.status & MCI_STATUS_ADDRV) { m.addr = addr;
When a deferred UE error is detected, e.g by background patrol scruber, it will be handled in APIC interrupt handler amd_deferred_error_interrupt(). The handler will collect MCA banks, init mce struct and process it by nofitying the registered MCE decode chain. The uc_decode_notifier, one of MCE decode chain, will process memory failure but only limit to MCE_AO_SEVERITY and MCE_DEFERRED_SEVERITY. However, APIC interrupt handler does not init mce severity and the uninitialized severity is 0 (MCE_NO_SEVERITY). To handle the deferred memory failure case, init mce severity when logging MCA banks. Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> --- Steps to reproduce: step 1: inject a patrol scrub error by ras-tools #einj_mem_uc patrol step 2: check dmesg, no memory failure log #dmesg -c [51295.686806] mce: [Hardware Error]: Machine check events logged [51295.693566] mce->status: 0x942031000400011b [51295.698248] mce->misc: 0x00000000 [51295.701952] mce->severity: 0x00000000 # Manually added printk [51295.726640] [Hardware Error]: Deferred error, no action required. [51295.733448] [Hardware Error]: CPU:65 (19:11:1) MC21_STATUS[-|-|-|AddrV|-|-|SyndV|UECC|Deferred|-|Scrub]: 0x942031000400011b [51295.733452] [Hardware Error]: Error Addr: 0x0000000006350a00 [51295.733453] [Hardware Error]: PPIN: 0x02b69e294c148024 [51295.733453] [Hardware Error]: IPID: 0x0000109600250f00, Syndrome: 0x9a4a00000b800000 [51295.733455] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. [51295.733463] mce: umc_normaddr_to_sysaddr: Invalid DramBaseAddress range: 0x0. [51295.733471] EDAC MC0: 1 UE Cannot decode normalized address on mc#0csrow#0channel#2 (csrow:0 channel:2 page:0x0 offset:0x0 grain:64) [51295.733471] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD After this fix: [ 514.966892] mce: [Hardware Error]: Machine check events logged [ 514.966912] mce->status: 0x942031000400011b [ 514.978093] mce->misc: 0x00000000 [ 514.981796] mce->severity: 0x00000001 [ 514.985885] <uc_decode_notifier> pre_handler: p->addr = 0x00000000e09e69e4, ip = ffffffff8104b955, flags = 0x282 [ 514.997253] <uc_decode_notifier> post_handler: p->addr = 0x00000000e09e69e4, flags = 0x282 [ 515.006501] Memory failure: 0x5dc2: recovery action for free buddy page: Recovered [ 515.015188] [Hardware Error]: Deferred error, no action required. [ 515.022006] [Hardware Error]: CPU:67 (19:11:1) MC21_STATUS[-|-|-|AddrV|-|-|SyndV|UECC|Deferred|-|Scrub]: 0x942031000400011b [ 515.034440] [Hardware Error]: Error Addr: 0x0000000005dc2a00 [ 515.034442] [Hardware Error]: PPIN: 0x02b69e294c148024 [ 515.034443] [Hardware Error]: IPID: 0x0000109600650f00, Syndrome: 0x9a4a00000b800008 [ 515.034445] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. [ 515.034453] umc_normaddr_to_sysaddr: Invalid DramBaseAddress range: 0x0. [ 515.034458] EDAC MC1: 1 UE Cannot decode normalized address on mc#1csrow#0channel#6 (csrow:0 channel:6 page:0x0 offset:0x0 grain:64) [ 515.034461] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Note, the memory_failure handles wrong physical address because umc_normaddr_to_sysaddr fails. I don't figure out why it fails. --- arch/x86/kernel/cpu/mce/amd.c | 1 + 1 file changed, 1 insertion(+)