Message ID | 20220716033104.903163-1-jane.chu@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] x86/mce: retrieve poison range from hardware whenever supported | expand |
Jane Chu wrote: > With Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine > poison granularity") that changed nfit_handle_mce() callback to report > badrange according to 1ULL << MCI_MISC_ADDR_LSB(mce->misc), it's been > discovered that the mce->misc LSB field is 0x1000 bytes, hence injecting > 2 back-to-back poisons and the driver ends up logging 8 badblocks, > because 0x1000 bytes is 8 512-byte. > > Dan Williams noticed that apei_mce_report_mem_error() hardcode > the LSB field to PAGE_SHIFT instead of consulting the input > struct cper_sec_mem_err record. So change to rely on hardware whenever > support is available. > > v1: https://lkml.org/lkml/2022/7/15/1040 This should be "Link:" and it should reference lore.kernel.org by msg-id. Lore is maintained by LF infrastructure and the message-id can be used to search lore.kernel.org mirrors even if the LF infra is down. Link: https://lore.kernel.org/r/7ed50fd8-521e-cade-77b1-738b8bfb8502@oracle.com > Signed-off-by: Jane Chu <jane.chu@oracle.com> > --- > arch/x86/kernel/cpu/mce/apei.c | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c > index 717192915f28..a4d5893632e0 100644 > --- a/arch/x86/kernel/cpu/mce/apei.c > +++ b/arch/x86/kernel/cpu/mce/apei.c > @@ -29,6 +29,7 @@ > void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err *mem_err) > { > struct mce m; > + int lsb; > > if (!(mem_err->validation_bits & CPER_MEM_VALID_PA)) > return; > @@ -37,7 +38,10 @@ void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err *mem_err) > m.bank = -1; > /* Fake a memory read error with unknown channel */ > m.status = MCI_STATUS_VAL | MCI_STATUS_EN | MCI_STATUS_ADDRV | MCI_STATUS_MISCV | 0x9f; > - m.misc = (MCI_MISC_ADDR_PHYS << 6) | PAGE_SHIFT; > + lsb = __builtin_ffs(mem_err->physical_addr_mask) - 1; Just use the kernel's __ffs64() which does not require the substraction fixup, and you can assume that physical_address_mask is non-zero just like trace_extlog_mem_event() does. > + if (lsb <= 0) > + lsb = PAGE_SHIFT; This fixup can then be removed as well. After the above comments are addressed you can add: Reviewed-by: Dan Williams <dan.j.williams@intel.com>
On 7/15/2022 9:50 PM, Dan Williams wrote: > Jane Chu wrote: >> With Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine >> poison granularity") that changed nfit_handle_mce() callback to report >> badrange according to 1ULL << MCI_MISC_ADDR_LSB(mce->misc), it's been >> discovered that the mce->misc LSB field is 0x1000 bytes, hence injecting >> 2 back-to-back poisons and the driver ends up logging 8 badblocks, >> because 0x1000 bytes is 8 512-byte. >> >> Dan Williams noticed that apei_mce_report_mem_error() hardcode >> the LSB field to PAGE_SHIFT instead of consulting the input >> struct cper_sec_mem_err record. So change to rely on hardware whenever >> support is available. >> >> v1: https://lkml.org/lkml/2022/7/15/1040 > > This should be "Link:" and it should reference lore.kernel.org by > msg-id. Lore is maintained by LF infrastructure and the message-id can > be used to search lore.kernel.org mirrors even if the LF infra is down. > > Link: https://lore.kernel.org/r/7ed50fd8-521e-cade-77b1-738b8bfb8502@oracle.com > Got it, thanks! >> Signed-off-by: Jane Chu <jane.chu@oracle.com> >> --- >> arch/x86/kernel/cpu/mce/apei.c | 6 +++++- >> 1 file changed, 5 insertions(+), 1 deletion(-) >> >> diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c >> index 717192915f28..a4d5893632e0 100644 >> --- a/arch/x86/kernel/cpu/mce/apei.c >> +++ b/arch/x86/kernel/cpu/mce/apei.c >> @@ -29,6 +29,7 @@ >> void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err *mem_err) >> { >> struct mce m; >> + int lsb; >> >> if (!(mem_err->validation_bits & CPER_MEM_VALID_PA)) >> return; >> @@ -37,7 +38,10 @@ void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err *mem_err) >> m.bank = -1; >> /* Fake a memory read error with unknown channel */ >> m.status = MCI_STATUS_VAL | MCI_STATUS_EN | MCI_STATUS_ADDRV | MCI_STATUS_MISCV | 0x9f; >> - m.misc = (MCI_MISC_ADDR_PHYS << 6) | PAGE_SHIFT; >> + lsb = __builtin_ffs(mem_err->physical_addr_mask) - 1; > > Just use the kernel's __ffs64() which does not require the substraction Will do. > fixup, and you can assume that physical_address_mask is non-zero just > like trace_extlog_mem_event() does. I guess to me, a more convincing evidence is /* Error grain */ if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK) e->grain = ~mem_err->physical_addr_mask + 1; in ghes_edac_report_mem_error(). > >> + if (lsb <= 0) >> + lsb = PAGE_SHIFT; > > This fixup can then be removed as well. > > After the above comments are addressed you can add: > > Reviewed-by: Dan Williams <dan.j.williams@intel.com> Thanks! -jane
diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c index 717192915f28..a4d5893632e0 100644 --- a/arch/x86/kernel/cpu/mce/apei.c +++ b/arch/x86/kernel/cpu/mce/apei.c @@ -29,6 +29,7 @@ void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err *mem_err) { struct mce m; + int lsb; if (!(mem_err->validation_bits & CPER_MEM_VALID_PA)) return; @@ -37,7 +38,10 @@ void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err *mem_err) m.bank = -1; /* Fake a memory read error with unknown channel */ m.status = MCI_STATUS_VAL | MCI_STATUS_EN | MCI_STATUS_ADDRV | MCI_STATUS_MISCV | 0x9f; - m.misc = (MCI_MISC_ADDR_PHYS << 6) | PAGE_SHIFT; + lsb = __builtin_ffs(mem_err->physical_addr_mask) - 1; + if (lsb <= 0) + lsb = PAGE_SHIFT; + m.misc = (MCI_MISC_ADDR_PHYS << 6) | lsb; if (severity >= GHES_SEV_RECOVERABLE) m.status |= MCI_STATUS_UC;
With Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine poison granularity") that changed nfit_handle_mce() callback to report badrange according to 1ULL << MCI_MISC_ADDR_LSB(mce->misc), it's been discovered that the mce->misc LSB field is 0x1000 bytes, hence injecting 2 back-to-back poisons and the driver ends up logging 8 badblocks, because 0x1000 bytes is 8 512-byte. Dan Williams noticed that apei_mce_report_mem_error() hardcode the LSB field to PAGE_SHIFT instead of consulting the input struct cper_sec_mem_err record. So change to rely on hardware whenever support is available. v1: https://lkml.org/lkml/2022/7/15/1040 Signed-off-by: Jane Chu <jane.chu@oracle.com> --- arch/x86/kernel/cpu/mce/apei.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)