diff mbox

[V3,06/10] acpi: apei: panic OS with fatal error status block

Message ID 1475875882-2604-7-git-send-email-tbaicar@codeaurora.org (mailing list archive)
State New, archived
Headers show

Commit Message

Tyler Baicar Oct. 7, 2016, 9:31 p.m. UTC
From: "Jonathan (Zhixiong) Zhang" <zjzhang@codeaurora.org>

Even if an error status block's severity is fatal, the kernel does not
honor the severity level and panic.

With the firmware first model, the platform could inform the OS about a
fatal hardware error through the non-NMI GHES notification type. The OS
should panic when a hardware error record is received with this
severity.

Call panic() after CPER data in error status block is printed if
severity is fatal, before each error section is handled.

Signed-off-by: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
---
 drivers/acpi/apei/ghes.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

Comments

Suzuki K Poulose Oct. 13, 2016, 1 p.m. UTC | #1
On 07/10/16 22:31, Tyler Baicar wrote:
> From: "Jonathan (Zhixiong) Zhang" <zjzhang@codeaurora.org>
>
> Even if an error status block's severity is fatal, the kernel does not
> honor the severity level and panic.
>
> With the firmware first model, the platform could inform the OS about a
> fatal hardware error through the non-NMI GHES notification type. The OS
> should panic when a hardware error record is received with this
> severity.
>
> Call panic() after CPER data in error status block is printed if
> severity is fatal, before each error section is handled.
>
> Signed-off-by: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
> ---
>  drivers/acpi/apei/ghes.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 28d5a09..36894c8 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -141,6 +141,8 @@ static unsigned long ghes_estatus_pool_size_request;
>  static struct ghes_estatus_cache *ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE];
>  static atomic_t ghes_estatus_cache_alloced;
>
> +static int ghes_panic_timeout __read_mostly = 30;
> +
>  static int ghes_ioremap_init(void)
>  {
>  	ghes_ioremap_area = __get_vm_area(PAGE_SIZE * GHES_IOREMAP_PAGES,
> @@ -715,6 +717,12 @@ static int ghes_proc(struct ghes *ghes)
>  		if (ghes_print_estatus(NULL, ghes->generic, ghes->estatus))
>  			ghes_estatus_cache_add(ghes->generic, ghes->estatus);
>  	}
> +	if (ghes_severity(ghes->estatus->error_severity) >= GHES_SEV_PANIC) {
> +		if (panic_timeout == 0)
> +			panic_timeout = ghes_panic_timeout;
> +		panic("Fatal hardware error!");

I think there is a chance that we might miss the o/p of ghes_print_estatus() as we use
no pfx, and it could default to the normal loglevel and would never get printed
if panic() is encountered before it. On the other hand, there is already a
__ghes_panic() which does similar stuff. Is there a way we could reuse
(may be even parts of) it ? Or at least use KERN_EMERG for the ghes_print_estatus(),
if the severity could result in panic() ?

Cheers
Suzuki
Tyler Baicar Oct. 13, 2016, 11:34 p.m. UTC | #2
Hello Suzuki,

On 10/13/2016 7:00 AM, Suzuki K Poulose wrote:
> On 07/10/16 22:31, Tyler Baicar wrote:
>> From: "Jonathan (Zhixiong) Zhang" <zjzhang@codeaurora.org>
>>
>> Even if an error status block's severity is fatal, the kernel does not
>> honor the severity level and panic.
>>
>> With the firmware first model, the platform could inform the OS about a
>> fatal hardware error through the non-NMI GHES notification type. The OS
>> should panic when a hardware error record is received with this
>> severity.
>>
>> Call panic() after CPER data in error status block is printed if
>> severity is fatal, before each error section is handled.
>>
>> Signed-off-by: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
>> ---
>>  drivers/acpi/apei/ghes.c | 10 ++++++++--
>>  1 file changed, 8 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index 28d5a09..36894c8 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -141,6 +141,8 @@ static unsigned long ghes_estatus_pool_size_request;
>>  static struct ghes_estatus_cache 
>> *ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE];
>>  static atomic_t ghes_estatus_cache_alloced;
>>
>> +static int ghes_panic_timeout __read_mostly = 30;
>> +
>>  static int ghes_ioremap_init(void)
>>  {
>>      ghes_ioremap_area = __get_vm_area(PAGE_SIZE * GHES_IOREMAP_PAGES,
>> @@ -715,6 +717,12 @@ static int ghes_proc(struct ghes *ghes)
>>          if (ghes_print_estatus(NULL, ghes->generic, ghes->estatus))
>>              ghes_estatus_cache_add(ghes->generic, ghes->estatus);
>>      }
>> +    if (ghes_severity(ghes->estatus->error_severity) >= 
>> GHES_SEV_PANIC) {
>> +        if (panic_timeout == 0)
>> +            panic_timeout = ghes_panic_timeout;
>> +        panic("Fatal hardware error!");
>
> I think there is a chance that we might miss the o/p of 
> ghes_print_estatus() as we use
> no pfx, and it could default to the normal loglevel and would never 
> get printed
> if panic() is encountered before it. On the other hand, there is 
> already a
> __ghes_panic() which does similar stuff. Is there a way we could reuse
> (may be even parts of) it ? Or at least use KERN_EMERG for the 
> ghes_print_estatus(),
> if the severity could result in panic() ?
__ghes_panic() does additional handling which we do not want to do here. 
I could make the following a helper function so it is not duplicated though:

if (panic_timeout == 0)
     panic_timeout = ghes_panic_timeout;
panic("Fatal hardware error!");

The pfx is actually being calculated already in __ghes_print_estatus():

         if (pfx == NULL) {
                 if (ghes_severity(estatus->error_severity) <=
                     GHES_SEV_CORRECTED)
                         pfx = KERN_WARNING;
                 else
                         pfx = KERN_ERR;
         }

 From ghes.h:

enum {
         GHES_SEV_NO = 0x0,
         GHES_SEV_CORRECTED = 0x1,
         GHES_SEV_RECOVERABLE = 0x2,
         GHES_SEV_PANIC = 0x3,
};

This will make the pfx KERN_ERR for the case of a panic.

Thanks,
Tyler
>
> Cheers
> Suzuki
>
diff mbox

Patch

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 28d5a09..36894c8 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -141,6 +141,8 @@  static unsigned long ghes_estatus_pool_size_request;
 static struct ghes_estatus_cache *ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE];
 static atomic_t ghes_estatus_cache_alloced;
 
+static int ghes_panic_timeout __read_mostly = 30;
+
 static int ghes_ioremap_init(void)
 {
 	ghes_ioremap_area = __get_vm_area(PAGE_SIZE * GHES_IOREMAP_PAGES,
@@ -715,6 +717,12 @@  static int ghes_proc(struct ghes *ghes)
 		if (ghes_print_estatus(NULL, ghes->generic, ghes->estatus))
 			ghes_estatus_cache_add(ghes->generic, ghes->estatus);
 	}
+	if (ghes_severity(ghes->estatus->error_severity) >= GHES_SEV_PANIC) {
+		if (panic_timeout == 0)
+			panic_timeout = ghes_panic_timeout;
+		panic("Fatal hardware error!");
+	}
+
 	ghes_do_proc(ghes, ghes->estatus);
 
 	if (ghes->generic_v2) {
@@ -859,8 +867,6 @@  static atomic_t ghes_in_nmi = ATOMIC_INIT(0);
 
 static LIST_HEAD(ghes_nmi);
 
-static int ghes_panic_timeout	__read_mostly = 30;
-
 static void ghes_proc_in_irq(struct irq_work *irq_work)
 {
 	struct llist_node *llnode, *next;