diff mbox

[RFC,v4,3/3] acpi: apei: Do not panic() on PCIe errors reported through GHES

Message ID 20180430213358.8319-3-mr.nuke.me@gmail.com (mailing list archive)
State RFC, archived
Headers show

Commit Message

Alex G. April 30, 2018, 9:33 p.m. UTC
The policy was to panic() when GHES said that an error is "Fatal".
This logic is wrong for several reasons, as it doesn't take into
account what caused the error.

PCIe fatal errors indicate that the link to a device is either
unstable or unusable. They don't indicate that the machine is on fire,
and they are not severe enough that we need to panic(). Instead of
relying on crackmonkey firmware, evaluate the error severity based on
what caused the error (GHES subsections).

Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com>
---
 drivers/acpi/apei/ghes.c | 45 ++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 42 insertions(+), 3 deletions(-)

Comments

Borislav Petkov May 11, 2018, 3:40 p.m. UTC | #1
On Mon, Apr 30, 2018 at 04:33:52PM -0500, Alexandru Gagniuc wrote:
> The policy was to panic() when GHES said that an error is "Fatal".
> This logic is wrong for several reasons, as it doesn't take into
> account what caused the error.
> 
> PCIe fatal errors indicate that the link to a device is either
> unstable or unusable. They don't indicate that the machine is on fire,
> and they are not severe enough that we need to panic(). Instead of
> relying on crackmonkey firmware, evaluate the error severity based on
	     ^^^^^^^^^^^^

Please keep the smartass formulations for the ML only and do not let
them leak into commit messages.

> Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com>
> ---
>  drivers/acpi/apei/ghes.c | 45 ++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 42 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index c9f1971333c1..49318fba409c 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -425,8 +425,7 @@ static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int
>   * GHES_SEV_RECOVERABLE -> AER_NONFATAL
>   * GHES_SEV_RECOVERABLE && CPER_SEC_RESET -> AER_FATAL
>   *     These both need to be reported and recovered from by the AER driver.
> - * GHES_SEV_PANIC does not make it to this handling since the kernel must
> - *     panic.
> + * GHES_SEV_PANIC -> AER_FATAL
>   */
>  static void ghes_handle_aer(struct acpi_hest_generic_data *gdata)
>  {
> @@ -459,6 +458,46 @@ static void ghes_handle_aer(struct acpi_hest_generic_data *gdata)
>  #endif
>  }
>  
> +/* PCIe errors should not cause a panic. */
> +static int ghes_sec_pcie_severity(struct acpi_hest_generic_data *gdata)
> +{
> +	struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata);
> +
> +	if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
> +	    pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO &&
> +	    IS_ENABLED(CONFIG_ACPI_APEI_PCIEAER))

How is PCIe error severity dependent on whether the AER error reporting
driver is enabled (and possibly not even loaded) on the system?

> +		return CPER_SEV_RECOVERABLE;
> +
> +	return ghes_cper_severity(gdata->error_severity);
> +}
> +/*
Alex G. May 11, 2018, 3:54 p.m. UTC | #2
On 05/11/2018 10:40 AM, Borislav Petkov wrote:
> On Mon, Apr 30, 2018 at 04:33:52PM -0500, Alexandru Gagniuc wrote:
>> The policy was to panic() when GHES said that an error is "Fatal".
>> This logic is wrong for several reasons, as it doesn't take into
>> account what caused the error.
>>
>> PCIe fatal errors indicate that the link to a device is either
>> unstable or unusable. They don't indicate that the machine is on fire,
>> and they are not severe enough that we need to panic(). Instead of
>> relying on crackmonkey firmware, evaluate the error severity based on
> 	     ^^^^^^^^^^^^
> 
> Please keep the smartass formulations for the ML only and do not let
> them leak into commit messages.

You're right. The monkeys are not crack. Instead, what a lot of
manufacturers do is maintain large monkey farms with electronic
typewriters. Given a sufficiently large farm, they take those results
which compile. Of those results, they pick and ship the one that takes
longest to boot, without the customers complaining.

That being clarified, should I replace "crackmonkey" with "broken" in
the commit message?

(snip)

>> +/* PCIe errors should not cause a panic. */
>> +static int ghes_sec_pcie_severity(struct acpi_hest_generic_data *gdata)
>> +{
>> +	struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata);
>> +
>> +	if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
>> +	    pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO &&
>> +	    IS_ENABLED(CONFIG_ACPI_APEI_PCIEAER))
> 
> How is PCIe error severity dependent on whether the AER error reporting
> driver is enabled (and possibly not even loaded) on the system?

Borislav, I sense some confusion. AER is not a "reporting" driver. It
handles the errors. You can't leave these errors unhandled. They
propagate to the root complex and can cause fatal MCEs when not handled.
The window to handle the error is pretty large, so it's not a concern
when you're handling it.

Alex

>> +		return CPER_SEV_RECOVERABLE;
>> +
>> +	return ghes_cper_severity(gdata->error_severity);
>> +}
>> +/*
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Borislav Petkov May 11, 2018, 4:02 p.m. UTC | #3
On Fri, May 11, 2018 at 10:54:09AM -0500, Alex G. wrote:
> That being clarified, should I replace "crackmonkey" with "broken" in
> the commit message?

Keep your opinion *outside* of commit messages - their goal is to
explain *why* the change is being made in strictly technical language so
that when someone looks at git history, someone can know *why*.

> Borislav, I sense some confusion. AER is not a "reporting" driver. It
> handles the errors. You can't leave these errors unhandled. They
> propagate to the root complex and can cause fatal MCEs when not handled.
> The window to handle the error is pretty large, so it's not a concern
> when you're handling it.

I think *you* didn't get it: IS_ENABLED(CONFIG_ACPI_APEI_PCIEAER) is not
enough of a check to confirm that there actually *is* an AER driver to
handle the errors. If you really want to make sure the driver is loaded
and functioning, then you need an explicit registering mechanism or some
other way of checking it really is there and handling errors.
Alex G. May 11, 2018, 4:12 p.m. UTC | #4
On 05/11/2018 11:02 AM, Borislav Petkov wrote:
> On Fri, May 11, 2018 at 10:54:09AM -0500, Alex G. wrote:
>> That being clarified, should I replace "crackmonkey" with "broken" in
>> the commit message?
> 
> Keep your opinion *outside* of commit messages - their goal is to
> explain *why* the change is being made in strictly technical language so
> that when someone looks at git history, someone can know *why*.
> 
>> Borislav, I sense some confusion. AER is not a "reporting" driver. It
>> handles the errors. You can't leave these errors unhandled. They
>> propagate to the root complex and can cause fatal MCEs when not handled.
>> The window to handle the error is pretty large, so it's not a concern
>> when you're handling it.
> 
> I think *you* didn't get it: IS_ENABLED(CONFIG_ACPI_APEI_PCIEAER) is not
> enough of a check to confirm that there actually *is* an AER driver to
> handle the errors. If you really want to make sure the driver is loaded
> and functioning, then you need an explicit registering mechanism or some
> other way of checking it really is there and handling errors.

config ACPI_APEI_PCIEAER
	bool "APEI PCIe AER logging/recovering support"
	depends on ACPI_APEI && PCIEAER
	help
	  PCIe AER errors may be reported via APEI firmware first mode.
	  Turn on this option to enable the corresponding support.

PCIAER is not modularizable. QED

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Borislav Petkov May 11, 2018, 4:29 p.m. UTC | #5
On Fri, May 11, 2018 at 11:12:25AM -0500, Alex G. wrote:
> > I think *you* didn't get it: IS_ENABLED(CONFIG_ACPI_APEI_PCIEAER) is not
> > enough of a check to confirm that there actually *is* an AER driver to
> > handle the errors. If you really want to make sure the driver is loaded
> > and functioning, then you need an explicit registering mechanism or some
> > other way of checking it really is there and handling errors.
> 
> config ACPI_APEI_PCIEAER
> 	bool "APEI PCIe AER logging/recovering support"
> 	depends on ACPI_APEI && PCIEAER
> 	help
> 	  PCIe AER errors may be reported via APEI firmware first mode.
> 	  Turn on this option to enable the corresponding support.
> 
> PCIAER is not modularizable. QED

QED my ass.

Read the f*ck my email again: the presence of the *code* is
not enough of a check to confirm the error has been handled.
aer_recover_work_func() can fail as that kfifo_put() in
aer_recover_queue() can too.

You need an *actual* confirmation that the error has been handled
properly and *only* *then* not panic the system. Otherwise you are
potentially leaving those errors unhandled.
Alex G. May 11, 2018, 5:01 p.m. UTC | #6
On 05/11/2018 11:29 AM, Borislav Petkov wrote:
> On Fri, May 11, 2018 at 11:12:25AM -0500, Alex G. wrote:
>>> I think *you* didn't get it: IS_ENABLED(CONFIG_ACPI_APEI_PCIEAER) is not
>>> enough of a check to confirm that there actually *is* an AER driver to
>>> handle the errors. If you really want to make sure the driver is loaded
>>> and functioning, then you need an explicit registering mechanism or some
>>> other way of checking it really is there and handling errors.
>>
>> config ACPI_APEI_PCIEAER
>> 	bool "APEI PCIe AER logging/recovering support"
>> 	depends on ACPI_APEI && PCIEAER
>> 	help
>> 	  PCIe AER errors may be reported via APEI firmware first mode.
>> 	  Turn on this option to enable the corresponding support.
>>
>> PCIAER is not modularizable. QED
> 
> QED my ass.
> 
> Read the f*ck my email again: the presence of the *code* is
> not enough of a check to confirm the error has been handled.
> aer_recover_work_func() can fail as that kfifo_put() in
> aer_recover_queue() can too.
> 
> You need an *actual* confirmation that the error has been handled
> properly and *only* *then* not panic the system. Otherwise you are
> potentially leaving those errors unhandled.


"How is PCIe error severity dependent on whether the AER error reporting
 driver is enabled (and possibly not even loaded) on the system?"

Little about confirmation of error being handled was talked about either
in your **** email, or previous versions of this series.  And quite
frankly it's besides the scope of this patch.

The scope is to enable SURPRISE!!! removal of NVMe drives and PCIe
devices. For that purpose, we don't need confirmation that the error was
handled. Such a confirmation requires a rework of GHES handling, or at
least the interaction between GHES and AER, both of which I find to be
mostly satisfactory.

You can't at this point know if the error is going to be handled.
There's code further downstream to handle this. You also didn't like it
when I wanted to handle things downstream.

I understand your concern with unhandled AER errors evolving into MCE's.
That's extremely rare, but when it happens you still panic due to the
MCE. To give you an idea of the rarity, in several months of testing, I
was only able to reproduce MCEs once, and that was with a very defective
drive, and a very idiotic test case.

If you find this solution unacceptable, that's fine. We can fix it in
firmware. We can hide all the events from the OS, contain the downstream
ports, and simulate hot-remove interrupts. All in firmware, all the time.

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Borislav Petkov May 11, 2018, 5:41 p.m. UTC | #7
On Fri, May 11, 2018 at 12:01:52PM -0500, Alex G. wrote:
> I understand your concern with unhandled AER errors evolving into MCE's.
> That's extremely rare, but when it happens you still panic due to the
> MCE.

I don't like leaving holes in the handling of PCIe errors. You need to
handle only those errors which are caused by hot-removal and not affect
other error types. Or do a comprehensive PCIe errors handling of all
errors in the AER driver.
Alex G. May 11, 2018, 5:56 p.m. UTC | #8
On 05/11/2018 12:41 PM, Borislav Petkov wrote:
> On Fri, May 11, 2018 at 12:01:52PM -0500, Alex G. wrote:
>> I understand your concern with unhandled AER errors evolving into MCE's.
>> That's extremely rare, but when it happens you still panic due to the
>> MCE.
> 
> I don't like leaving holes in the handling of PCIe errors. You need to
> handle only those errors which are caused by hot-removal and not affect
> other error types. Or do a comprehensive PCIe errors handling of all
> errors in the AER driver.

Forget about how AER works, and worry about parity with native AER. If
AER is buggy, it will have the same bug in native and FFS cases. Right
now we're paranoid, over-babying the errors, and don't even make it to
the handler. How is this better?

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index c9f1971333c1..49318fba409c 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -425,8 +425,7 @@  static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int
  * GHES_SEV_RECOVERABLE -> AER_NONFATAL
  * GHES_SEV_RECOVERABLE && CPER_SEC_RESET -> AER_FATAL
  *     These both need to be reported and recovered from by the AER driver.
- * GHES_SEV_PANIC does not make it to this handling since the kernel must
- *     panic.
+ * GHES_SEV_PANIC -> AER_FATAL
  */
 static void ghes_handle_aer(struct acpi_hest_generic_data *gdata)
 {
@@ -459,6 +458,46 @@  static void ghes_handle_aer(struct acpi_hest_generic_data *gdata)
 #endif
 }
 
+/* PCIe errors should not cause a panic. */
+static int ghes_sec_pcie_severity(struct acpi_hest_generic_data *gdata)
+{
+	struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata);
+
+	if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
+	    pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO &&
+	    IS_ENABLED(CONFIG_ACPI_APEI_PCIEAER))
+		return CPER_SEV_RECOVERABLE;
+
+	return ghes_cper_severity(gdata->error_severity);
+}
+/*
+ * The severity field in the status block is oftentimes more severe than it
+ * needs to be. This makes it an unreliable metric for the severity. A more
+ * reliable way is to look at each subsection and correlate it with how well
+ * the error can be handled.
+ *   - SEC_PCIE: All PCIe errors can be handled by AER.
+ */
+static int ghes_severity(struct ghes *ghes)
+{
+	int worst_sev, sec_sev;
+	struct acpi_hest_generic_data *gdata;
+	const guid_t *section_type;
+	const struct acpi_hest_generic_status *estatus = ghes->estatus;
+
+	worst_sev = GHES_SEV_NO;
+	apei_estatus_for_each_section(estatus, gdata) {
+		section_type = (guid_t *)gdata->section_type;
+		sec_sev = ghes_cper_severity(gdata->error_severity);
+
+		if (guid_equal(section_type, &CPER_SEC_PCIE))
+			sec_sev = ghes_sec_pcie_severity(gdata);
+
+		worst_sev = max(worst_sev, sec_sev);
+	}
+
+	return worst_sev;
+}
+
 static void ghes_do_proc(struct ghes *ghes,
 			 const struct acpi_hest_generic_status *estatus)
 {
@@ -944,7 +983,7 @@  static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
 			ret = NMI_HANDLED;
 		}
 
-		sev = ghes_cper_severity(ghes->estatus->error_severity);
+		sev = ghes_severity(ghes);
 		if (sev >= GHES_SEV_PANIC) {
 			oops_begin();
 			ghes_print_queued_estatus();