Message ID | 20180416215903.7318-5-mr.nuke.me@gmail.com (mailing list archive) |
---|---|
State | RFC, archived |
Headers | show |
On Mon, Apr 16, 2018 at 04:59:03PM -0500, Alexandru Gagniuc wrote: > There seems to be a culture amongst BIOS teams to want to crash the > OS when an error can't be handled in firmware. Marking GHES errors as > "fatal" is a very common way to do this. > > However, a number of errors reported by GHES may be fatal in the sense > a device or link is lost, but are not fatal to the system. When there > is a disagreement with firmware about the handleability of an error, > print a warning message. > > Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com> > --- > drivers/acpi/apei/ghes.c | 10 +++++++++- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c > index e0528da4e8f8..6a117825611d 100644 > --- a/drivers/acpi/apei/ghes.c > +++ b/drivers/acpi/apei/ghes.c > @@ -535,13 +535,14 @@ static const struct ghes_handler *get_handler(const guid_t *type) > static void ghes_do_proc(struct ghes *ghes, > const struct acpi_hest_generic_status *estatus) > { > - int sev, sec_sev; > + int sev, sec_sev, corrected_sev; > struct acpi_hest_generic_data *gdata; > const struct ghes_handler *handler; > guid_t *sec_type; > guid_t *fru_id = &NULL_UUID_LE; > char *fru_text = ""; > > + corrected_sev = GHES_SEV_NO; > sev = ghes_severity(estatus->error_severity); > apei_estatus_for_each_section(estatus, gdata) { > sec_type = (guid_t *)gdata->section_type; > @@ -563,6 +564,13 @@ static void ghes_do_proc(struct ghes *ghes, > sec_sev, err, > gdata->error_data_length); > } > + > + corrected_sev = max(corrected_sev, sec_sev); > + } > + > + if ((sev >= GHES_SEV_PANIC) && (corrected_sev < sev)) { > + pr_warn("FIRMWARE BUG: Firmware sent fatal error that we were able to correct"); > + pr_warn("BROKEN FIRMWARE: Complain to your hardware vendor"); No, I don't want any of that crap issuing stuff in dmesg and then people opening bugs and running around and trying to replace hardware. We either can handle the error and log a normal record somewhere or we cannot and explode. The complaining about the FW doesn't bring shit.
On 04/18/2018 12:54 PM, Borislav Petkov wrote: > On Mon, Apr 16, 2018 at 04:59:03PM -0500, Alexandru Gagniuc wrote: (snip) >> + >> + corrected_sev = max(corrected_sev, sec_sev); >> + } >> + >> + if ((sev >= GHES_SEV_PANIC) && (corrected_sev < sev)) { >> + pr_warn("FIRMWARE BUG: Firmware sent fatal error that we were able to correct"); >> + pr_warn("BROKEN FIRMWARE: Complain to your hardware vendor"); > > No, I don't want any of that crap issuing stuff in dmesg and then people > opening bugs and running around and trying to replace hardware. > > We either can handle the error and log a normal record somewhere or we > cannot and explode. There is value in this. From my observations, fw claims it will do everything through FFS, yet fails to fully handle the situation. It's rooted in FW's assumptions about OS behavior. Because the (old) versions of windows, esxi, and rhel used during development crash, fw assumes that _all_ OSes crash. The result in a surprising majority of cases is that FFS doesn't properly handle recurring errors, and fw is, in fact, broken. > The complaining about the FW doesn't bring shit. You are correct. It doesn't bring defecation. It brings a red flag that helps people get closer to the root cause of problems. That being said, I can just drop this patch. Alex -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 19, 2018 at 10:11:03AM -0500, Alex G. wrote: > There is value in this. From my observations, fw claims it will do > everything through FFS, yet fails to fully handle the situation. It's > rooted in FW's assumptions about OS behavior. Because the (old) versions > of windows, esxi, and rhel used during development crash, fw assumes > that _all_ OSes crash. The result in a surprising majority of cases is > that FFS doesn't properly handle recurring errors, and fw is, in fact, > broken. So FW being broken is a social secret. But we don't care. We have tried, nothing happens. No one moves. The crack monkeys which program it have long moved to the next release and you hear crap like, "we don't support linux" and other bullshit. What we do now is to try to make the best of it - we either can handle an error *without* firmware's help or we panic. If we can recover from it, let's do that without screaming about something the user can't deal with anyway. All those FW_ERR printks cause nothing but expensive support calls, the outcome of which is nothing. Just a lot of money down the drain.
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index e0528da4e8f8..6a117825611d 100644 --- a/drivers/acpi/apei/ghes.c +++ b/drivers/acpi/apei/ghes.c @@ -535,13 +535,14 @@ static const struct ghes_handler *get_handler(const guid_t *type) static void ghes_do_proc(struct ghes *ghes, const struct acpi_hest_generic_status *estatus) { - int sev, sec_sev; + int sev, sec_sev, corrected_sev; struct acpi_hest_generic_data *gdata; const struct ghes_handler *handler; guid_t *sec_type; guid_t *fru_id = &NULL_UUID_LE; char *fru_text = ""; + corrected_sev = GHES_SEV_NO; sev = ghes_severity(estatus->error_severity); apei_estatus_for_each_section(estatus, gdata) { sec_type = (guid_t *)gdata->section_type; @@ -563,6 +564,13 @@ static void ghes_do_proc(struct ghes *ghes, sec_sev, err, gdata->error_data_length); } + + corrected_sev = max(corrected_sev, sec_sev); + } + + if ((sev >= GHES_SEV_PANIC) && (corrected_sev < sev)) { + pr_warn("FIRMWARE BUG: Firmware sent fatal error that we were able to correct"); + pr_warn("BROKEN FIRMWARE: Complain to your hardware vendor"); } }
There seems to be a culture amongst BIOS teams to want to crash the OS when an error can't be handled in firmware. Marking GHES errors as "fatal" is a very common way to do this. However, a number of errors reported by GHES may be fatal in the sense a device or link is lost, but are not fatal to the system. When there is a disagreement with firmware about the handleability of an error, print a warning message. Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com> --- drivers/acpi/apei/ghes.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-)