Message ID | 20180430213358.8319-3-mr.nuke.me@gmail.com (mailing list archive) |
---|---|
State | RFC, archived |
Headers | show |
On Mon, Apr 30, 2018 at 04:33:52PM -0500, Alexandru Gagniuc wrote: > The policy was to panic() when GHES said that an error is "Fatal". > This logic is wrong for several reasons, as it doesn't take into > account what caused the error. > > PCIe fatal errors indicate that the link to a device is either > unstable or unusable. They don't indicate that the machine is on fire, > and they are not severe enough that we need to panic(). Instead of > relying on crackmonkey firmware, evaluate the error severity based on ^^^^^^^^^^^^ Please keep the smartass formulations for the ML only and do not let them leak into commit messages. > Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com> > --- > drivers/acpi/apei/ghes.c | 45 ++++++++++++++++++++++++++++++++++++++++++--- > 1 file changed, 42 insertions(+), 3 deletions(-) > > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c > index c9f1971333c1..49318fba409c 100644 > --- a/drivers/acpi/apei/ghes.c > +++ b/drivers/acpi/apei/ghes.c > @@ -425,8 +425,7 @@ static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int > * GHES_SEV_RECOVERABLE -> AER_NONFATAL > * GHES_SEV_RECOVERABLE && CPER_SEC_RESET -> AER_FATAL > * These both need to be reported and recovered from by the AER driver. > - * GHES_SEV_PANIC does not make it to this handling since the kernel must > - * panic. > + * GHES_SEV_PANIC -> AER_FATAL > */ > static void ghes_handle_aer(struct acpi_hest_generic_data *gdata) > { > @@ -459,6 +458,46 @@ static void ghes_handle_aer(struct acpi_hest_generic_data *gdata) > #endif > } > > +/* PCIe errors should not cause a panic. */ > +static int ghes_sec_pcie_severity(struct acpi_hest_generic_data *gdata) > +{ > + struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata); > + > + if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID && > + pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO && > + IS_ENABLED(CONFIG_ACPI_APEI_PCIEAER)) How is PCIe error severity dependent on whether the AER error reporting driver is enabled (and possibly not even loaded) on the system? > + return CPER_SEV_RECOVERABLE; > + > + return ghes_cper_severity(gdata->error_severity); > +} > +/*
On 05/11/2018 10:40 AM, Borislav Petkov wrote: > On Mon, Apr 30, 2018 at 04:33:52PM -0500, Alexandru Gagniuc wrote: >> The policy was to panic() when GHES said that an error is "Fatal". >> This logic is wrong for several reasons, as it doesn't take into >> account what caused the error. >> >> PCIe fatal errors indicate that the link to a device is either >> unstable or unusable. They don't indicate that the machine is on fire, >> and they are not severe enough that we need to panic(). Instead of >> relying on crackmonkey firmware, evaluate the error severity based on > ^^^^^^^^^^^^ > > Please keep the smartass formulations for the ML only and do not let > them leak into commit messages. You're right. The monkeys are not crack. Instead, what a lot of manufacturers do is maintain large monkey farms with electronic typewriters. Given a sufficiently large farm, they take those results which compile. Of those results, they pick and ship the one that takes longest to boot, without the customers complaining. That being clarified, should I replace "crackmonkey" with "broken" in the commit message? (snip) >> +/* PCIe errors should not cause a panic. */ >> +static int ghes_sec_pcie_severity(struct acpi_hest_generic_data *gdata) >> +{ >> + struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata); >> + >> + if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID && >> + pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO && >> + IS_ENABLED(CONFIG_ACPI_APEI_PCIEAER)) > > How is PCIe error severity dependent on whether the AER error reporting > driver is enabled (and possibly not even loaded) on the system? Borislav, I sense some confusion. AER is not a "reporting" driver. It handles the errors. You can't leave these errors unhandled. They propagate to the root complex and can cause fatal MCEs when not handled. The window to handle the error is pretty large, so it's not a concern when you're handling it. Alex >> + return CPER_SEV_RECOVERABLE; >> + >> + return ghes_cper_severity(gdata->error_severity); >> +} >> +/* > -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, May 11, 2018 at 10:54:09AM -0500, Alex G. wrote: > That being clarified, should I replace "crackmonkey" with "broken" in > the commit message? Keep your opinion *outside* of commit messages - their goal is to explain *why* the change is being made in strictly technical language so that when someone looks at git history, someone can know *why*. > Borislav, I sense some confusion. AER is not a "reporting" driver. It > handles the errors. You can't leave these errors unhandled. They > propagate to the root complex and can cause fatal MCEs when not handled. > The window to handle the error is pretty large, so it's not a concern > when you're handling it. I think *you* didn't get it: IS_ENABLED(CONFIG_ACPI_APEI_PCIEAER) is not enough of a check to confirm that there actually *is* an AER driver to handle the errors. If you really want to make sure the driver is loaded and functioning, then you need an explicit registering mechanism or some other way of checking it really is there and handling errors.
On 05/11/2018 11:02 AM, Borislav Petkov wrote: > On Fri, May 11, 2018 at 10:54:09AM -0500, Alex G. wrote: >> That being clarified, should I replace "crackmonkey" with "broken" in >> the commit message? > > Keep your opinion *outside* of commit messages - their goal is to > explain *why* the change is being made in strictly technical language so > that when someone looks at git history, someone can know *why*. > >> Borislav, I sense some confusion. AER is not a "reporting" driver. It >> handles the errors. You can't leave these errors unhandled. They >> propagate to the root complex and can cause fatal MCEs when not handled. >> The window to handle the error is pretty large, so it's not a concern >> when you're handling it. > > I think *you* didn't get it: IS_ENABLED(CONFIG_ACPI_APEI_PCIEAER) is not > enough of a check to confirm that there actually *is* an AER driver to > handle the errors. If you really want to make sure the driver is loaded > and functioning, then you need an explicit registering mechanism or some > other way of checking it really is there and handling errors. config ACPI_APEI_PCIEAER bool "APEI PCIe AER logging/recovering support" depends on ACPI_APEI && PCIEAER help PCIe AER errors may be reported via APEI firmware first mode. Turn on this option to enable the corresponding support. PCIAER is not modularizable. QED Alex -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, May 11, 2018 at 11:12:25AM -0500, Alex G. wrote: > > I think *you* didn't get it: IS_ENABLED(CONFIG_ACPI_APEI_PCIEAER) is not > > enough of a check to confirm that there actually *is* an AER driver to > > handle the errors. If you really want to make sure the driver is loaded > > and functioning, then you need an explicit registering mechanism or some > > other way of checking it really is there and handling errors. > > config ACPI_APEI_PCIEAER > bool "APEI PCIe AER logging/recovering support" > depends on ACPI_APEI && PCIEAER > help > PCIe AER errors may be reported via APEI firmware first mode. > Turn on this option to enable the corresponding support. > > PCIAER is not modularizable. QED QED my ass. Read the f*ck my email again: the presence of the *code* is not enough of a check to confirm the error has been handled. aer_recover_work_func() can fail as that kfifo_put() in aer_recover_queue() can too. You need an *actual* confirmation that the error has been handled properly and *only* *then* not panic the system. Otherwise you are potentially leaving those errors unhandled.
On 05/11/2018 11:29 AM, Borislav Petkov wrote: > On Fri, May 11, 2018 at 11:12:25AM -0500, Alex G. wrote: >>> I think *you* didn't get it: IS_ENABLED(CONFIG_ACPI_APEI_PCIEAER) is not >>> enough of a check to confirm that there actually *is* an AER driver to >>> handle the errors. If you really want to make sure the driver is loaded >>> and functioning, then you need an explicit registering mechanism or some >>> other way of checking it really is there and handling errors. >> >> config ACPI_APEI_PCIEAER >> bool "APEI PCIe AER logging/recovering support" >> depends on ACPI_APEI && PCIEAER >> help >> PCIe AER errors may be reported via APEI firmware first mode. >> Turn on this option to enable the corresponding support. >> >> PCIAER is not modularizable. QED > > QED my ass. > > Read the f*ck my email again: the presence of the *code* is > not enough of a check to confirm the error has been handled. > aer_recover_work_func() can fail as that kfifo_put() in > aer_recover_queue() can too. > > You need an *actual* confirmation that the error has been handled > properly and *only* *then* not panic the system. Otherwise you are > potentially leaving those errors unhandled. "How is PCIe error severity dependent on whether the AER error reporting driver is enabled (and possibly not even loaded) on the system?" Little about confirmation of error being handled was talked about either in your **** email, or previous versions of this series. And quite frankly it's besides the scope of this patch. The scope is to enable SURPRISE!!! removal of NVMe drives and PCIe devices. For that purpose, we don't need confirmation that the error was handled. Such a confirmation requires a rework of GHES handling, or at least the interaction between GHES and AER, both of which I find to be mostly satisfactory. You can't at this point know if the error is going to be handled. There's code further downstream to handle this. You also didn't like it when I wanted to handle things downstream. I understand your concern with unhandled AER errors evolving into MCE's. That's extremely rare, but when it happens you still panic due to the MCE. To give you an idea of the rarity, in several months of testing, I was only able to reproduce MCEs once, and that was with a very defective drive, and a very idiotic test case. If you find this solution unacceptable, that's fine. We can fix it in firmware. We can hide all the events from the OS, contain the downstream ports, and simulate hot-remove interrupts. All in firmware, all the time. Alex -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, May 11, 2018 at 12:01:52PM -0500, Alex G. wrote: > I understand your concern with unhandled AER errors evolving into MCE's. > That's extremely rare, but when it happens you still panic due to the > MCE. I don't like leaving holes in the handling of PCIe errors. You need to handle only those errors which are caused by hot-removal and not affect other error types. Or do a comprehensive PCIe errors handling of all errors in the AER driver.
On 05/11/2018 12:41 PM, Borislav Petkov wrote: > On Fri, May 11, 2018 at 12:01:52PM -0500, Alex G. wrote: >> I understand your concern with unhandled AER errors evolving into MCE's. >> That's extremely rare, but when it happens you still panic due to the >> MCE. > > I don't like leaving holes in the handling of PCIe errors. You need to > handle only those errors which are caused by hot-removal and not affect > other error types. Or do a comprehensive PCIe errors handling of all > errors in the AER driver. Forget about how AER works, and worry about parity with native AER. If AER is buggy, it will have the same bug in native and FFS cases. Right now we're paranoid, over-babying the errors, and don't even make it to the handler. How is this better? Alex -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index c9f1971333c1..49318fba409c 100644 --- a/drivers/acpi/apei/ghes.c +++ b/drivers/acpi/apei/ghes.c @@ -425,8 +425,7 @@ static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int * GHES_SEV_RECOVERABLE -> AER_NONFATAL * GHES_SEV_RECOVERABLE && CPER_SEC_RESET -> AER_FATAL * These both need to be reported and recovered from by the AER driver. - * GHES_SEV_PANIC does not make it to this handling since the kernel must - * panic. + * GHES_SEV_PANIC -> AER_FATAL */ static void ghes_handle_aer(struct acpi_hest_generic_data *gdata) { @@ -459,6 +458,46 @@ static void ghes_handle_aer(struct acpi_hest_generic_data *gdata) #endif } +/* PCIe errors should not cause a panic. */ +static int ghes_sec_pcie_severity(struct acpi_hest_generic_data *gdata) +{ + struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata); + + if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID && + pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO && + IS_ENABLED(CONFIG_ACPI_APEI_PCIEAER)) + return CPER_SEV_RECOVERABLE; + + return ghes_cper_severity(gdata->error_severity); +} +/* + * The severity field in the status block is oftentimes more severe than it + * needs to be. This makes it an unreliable metric for the severity. A more + * reliable way is to look at each subsection and correlate it with how well + * the error can be handled. + * - SEC_PCIE: All PCIe errors can be handled by AER. + */ +static int ghes_severity(struct ghes *ghes) +{ + int worst_sev, sec_sev; + struct acpi_hest_generic_data *gdata; + const guid_t *section_type; + const struct acpi_hest_generic_status *estatus = ghes->estatus; + + worst_sev = GHES_SEV_NO; + apei_estatus_for_each_section(estatus, gdata) { + section_type = (guid_t *)gdata->section_type; + sec_sev = ghes_cper_severity(gdata->error_severity); + + if (guid_equal(section_type, &CPER_SEC_PCIE)) + sec_sev = ghes_sec_pcie_severity(gdata); + + worst_sev = max(worst_sev, sec_sev); + } + + return worst_sev; +} + static void ghes_do_proc(struct ghes *ghes, const struct acpi_hest_generic_status *estatus) { @@ -944,7 +983,7 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs) ret = NMI_HANDLED; } - sev = ghes_cper_severity(ghes->estatus->error_severity); + sev = ghes_severity(ghes); if (sev >= GHES_SEV_PANIC) { oops_begin(); ghes_print_queued_estatus();
The policy was to panic() when GHES said that an error is "Fatal". This logic is wrong for several reasons, as it doesn't take into account what caused the error. PCIe fatal errors indicate that the link to a device is either unstable or unusable. They don't indicate that the machine is on fire, and they are not severe enough that we need to panic(). Instead of relying on crackmonkey firmware, evaluate the error severity based on what caused the error (GHES subsections). Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com> --- drivers/acpi/apei/ghes.c | 45 ++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 42 insertions(+), 3 deletions(-)