diff mbox

[V8,06/10] acpi: apei: panic OS with fatal error status block

Message ID 1485969413-23577-7-git-send-email-tbaicar@codeaurora.org (mailing list archive)
State New, archived
Headers show

Commit Message

Tyler Baicar Feb. 1, 2017, 5:16 p.m. UTC
From: "Jonathan (Zhixiong) Zhang" <zjzhang@codeaurora.org>

Even if an error status block's severity is fatal, the kernel does not
honor the severity level and panic.

With the firmware first model, the platform could inform the OS about a
fatal hardware error through the non-NMI GHES notification type. The OS
should panic when a hardware error record is received with this
severity.

Call panic() after CPER data in error status block is printed if
severity is fatal, before each error section is handled.

Signed-off-by: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
---
 drivers/acpi/apei/ghes.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

Comments

James Morse Feb. 9, 2017, 10:48 a.m. UTC | #1
Hi Jonathan, Tyler,

On 01/02/17 17:16, Tyler Baicar wrote:
> From: "Jonathan (Zhixiong) Zhang" <zjzhang@codeaurora.org>
> 
> Even if an error status block's severity is fatal, the kernel does not
> honor the severity level and panic.
> 
> With the firmware first model, the platform could inform the OS about a
> fatal hardware error through the non-NMI GHES notification type. The OS
> should panic when a hardware error record is received with this
> severity.
> 
> Call panic() after CPER data in error status block is printed if
> severity is fatal, before each error section is handled.

> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 8756172..86c1f15 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -687,6 +689,13 @@ static int ghes_ack_error(struct acpi_hest_generic_v2 *generic_v2)
>  	return rc;
>  }
>  
> +static void __ghes_call_panic(void)
> +{
> +	if (panic_timeout == 0)
> +		panic_timeout = ghes_panic_timeout;
> +	panic("Fatal hardware error!");
> +}
> +

__ghes_panic() also has:
>	__ghes_print_estatus(KERN_EMERG, ghes->generic, ghes->estatus);

Which prints this estatus regardless of rate limiting and cache-ing.

[ ... ]

> @@ -698,6 +707,10 @@ static int ghes_proc(struct ghes *ghes)
>  		if (ghes_print_estatus(NULL, ghes->generic, ghes->estatus))

ghes_print_estatus() uses some custom rate limiting '2 messages every 5
seconds', GHES_SEV_PANIC shares the same limit as GHES_SEV_RECOVERABLE.

I think its possible to get 2 recoverable messages, then a panic in a 5 second
window. The rate limit will kick in to stop the panic estatus block being
printed, but we still go on to call panic() without the real reason being printed...

(the caching thing only seems to consider identical messages, given we would
never see two panic messages, I don't think that will cause any problems.)

>  			ghes_estatus_cache_add(ghes->generic, ghes->estatus);
>  	}
> +	if (ghes_severity(ghes->estatus->error_severity) >= GHES_SEV_PANIC) {
> +		__ghes_call_panic();
> +	}
> +

I think this ghes_severity() then panic() should go above the:
>	if (!ghes_estatus_cached(ghes->estatus)) {
and we should call __ghes_print_estatus() here too, to make sure the message
definitely got out!


With that,
Reviewed-by: James Morse <james.morse@arm.com>


Thanks,

James
Tyler Baicar Feb. 13, 2017, 10:45 p.m. UTC | #2
Hello James,


On 2/9/2017 3:48 AM, James Morse wrote:
> Hi Jonathan, Tyler,
>
> On 01/02/17 17:16, Tyler Baicar wrote:
>> From: "Jonathan (Zhixiong) Zhang" <zjzhang@codeaurora.org>
>>
>> Even if an error status block's severity is fatal, the kernel does not
>> honor the severity level and panic.
>>
>> With the firmware first model, the platform could inform the OS about a
>> fatal hardware error through the non-NMI GHES notification type. The OS
>> should panic when a hardware error record is received with this
>> severity.
>>
>> Call panic() after CPER data in error status block is printed if
>> severity is fatal, before each error section is handled.
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index 8756172..86c1f15 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -687,6 +689,13 @@ static int ghes_ack_error(struct acpi_hest_generic_v2 *generic_v2)
>>   	return rc;
>>   }
>>   
>> +static void __ghes_call_panic(void)
>> +{
>> +	if (panic_timeout == 0)
>> +		panic_timeout = ghes_panic_timeout;
>> +	panic("Fatal hardware error!");
>> +}
>> +
> __ghes_panic() also has:
>> 	__ghes_print_estatus(KERN_EMERG, ghes->generic, ghes->estatus);
> Which prints this estatus regardless of rate limiting and cache-ing.
>
> [ ... ]
>
>> @@ -698,6 +707,10 @@ static int ghes_proc(struct ghes *ghes)
>>   		if (ghes_print_estatus(NULL, ghes->generic, ghes->estatus))
> ghes_print_estatus() uses some custom rate limiting '2 messages every 5
> seconds', GHES_SEV_PANIC shares the same limit as GHES_SEV_RECOVERABLE.
>
> I think its possible to get 2 recoverable messages, then a panic in a 5 second
> window. The rate limit will kick in to stop the panic estatus block being
> printed, but we still go on to call panic() without the real reason being printed...
>
> (the caching thing only seems to consider identical messages, given we would
> never see two panic messages, I don't think that will cause any problems.)
>
>>   			ghes_estatus_cache_add(ghes->generic, ghes->estatus);
>>   	}
>> +	if (ghes_severity(ghes->estatus->error_severity) >= GHES_SEV_PANIC) {
>> +		__ghes_call_panic();
>> +	}
>> +
> I think this ghes_severity() then panic() should go above the:
>> 	if (!ghes_estatus_cached(ghes->estatus)) {
> and we should call __ghes_print_estatus() here too, to make sure the message
> definitely got out!
Okay, that makes sense. If we move this up, is there a problem with 
calling __ghes_panic() instead of making the __ghes_print_estatus() and 
__ghes_call_panic() calls here? It looks like that will just add a call 
to oops_begin() and ghes_print_queued_estatus() as well, but this is 
what ghes_notify_nmi() does if the severity is panic.

Thanks,
Tyler
> With that,
> Reviewed-by: James Morse <james.morse@arm.com>
>
>
> Thanks,
>
> James
James Morse Feb. 15, 2017, 12:13 p.m. UTC | #3
Hi Tyler,

On 13/02/17 22:45, Baicar, Tyler wrote:
> On 2/9/2017 3:48 AM, James Morse wrote:
>> On 01/02/17 17:16, Tyler Baicar wrote:
>>> From: "Jonathan (Zhixiong) Zhang" <zjzhang@codeaurora.org>
>>>
>>> Even if an error status block's severity is fatal, the kernel does not
>>> honor the severity level and panic.
>>>
>>> With the firmware first model, the platform could inform the OS about a
>>> fatal hardware error through the non-NMI GHES notification type. The OS
>>> should panic when a hardware error record is received with this
>>> severity.
>>>
>>> Call panic() after CPER data in error status block is printed if
>>> severity is fatal, before each error section is handled.

>>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>>> index 8756172..86c1f15 100644
>>> --- a/drivers/acpi/apei/ghes.c
>>> +++ b/drivers/acpi/apei/ghes.c
>>> @@ -687,6 +689,13 @@ static int ghes_ack_error(struct acpi_hest_generic_v2
>>> *generic_v2)
>>>       return rc;
>>>   }
>>>   +static void __ghes_call_panic(void)
>>> +{
>>> +    if (panic_timeout == 0)
>>> +        panic_timeout = ghes_panic_timeout;
>>> +    panic("Fatal hardware error!");
>>> +}
>>> +

>> __ghes_panic() also has:
>>>     __ghes_print_estatus(KERN_EMERG, ghes->generic, ghes->estatus);
>> Which prints this estatus regardless of rate limiting and cache-ing.

[...]

>>>               ghes_estatus_cache_add(ghes->generic, ghes->estatus);
>>>       }
>>> +    if (ghes_severity(ghes->estatus->error_severity) >= GHES_SEV_PANIC) {
>>> +        __ghes_call_panic();
>>> +    }

>> I think this ghes_severity() then panic() should go above the:
>>>     if (!ghes_estatus_cached(ghes->estatus)) {
>> and we should call __ghes_print_estatus() here too, to make sure the message
>> definitely got out!


> Okay, that makes sense. If we move this up, is there a problem with calling
> __ghes_panic() instead of making the __ghes_print_estatus() and
> __ghes_call_panic() calls here? It looks like that will just add a call to
> oops_begin() and ghes_print_queued_estatus() as well, but this is what
> ghes_notify_nmi() does if the severity is panic.


I don't think the queued stuff is relevant, isn't that just for x86-NMI messages
that it doesn't print out directly?

A quick grep shows arm64 doesn't have oops_begin(), you may have to add some
equivalent mechanism. Lets try and avoid that rabbit hole!

Given __ghes_panic() calls __ghes_print_estatus() too, you could try moving that
into your new __ghes_call_panic().... or whatever results in the least lines
changed!


Thanks,

James
Tyler Baicar Feb. 15, 2017, 5:07 p.m. UTC | #4
On 2/15/2017 5:13 AM, James Morse wrote:
> Hi Tyler,
>
> On 13/02/17 22:45, Baicar, Tyler wrote:
>> On 2/9/2017 3:48 AM, James Morse wrote:
>>> On 01/02/17 17:16, Tyler Baicar wrote:
>>>> From: "Jonathan (Zhixiong) Zhang" <zjzhang@codeaurora.org>
>>>>
>>>> Even if an error status block's severity is fatal, the kernel does not
>>>> honor the severity level and panic.
>>>>
>>>> With the firmware first model, the platform could inform the OS about a
>>>> fatal hardware error through the non-NMI GHES notification type. The OS
>>>> should panic when a hardware error record is received with this
>>>> severity.
>>>>
>>>> Call panic() after CPER data in error status block is printed if
>>>> severity is fatal, before each error section is handled.
>>>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>>>> index 8756172..86c1f15 100644
>>>> --- a/drivers/acpi/apei/ghes.c
>>>> +++ b/drivers/acpi/apei/ghes.c
>>>> @@ -687,6 +689,13 @@ static int ghes_ack_error(struct acpi_hest_generic_v2
>>>> *generic_v2)
>>>>        return rc;
>>>>    }
>>>>    +static void __ghes_call_panic(void)
>>>> +{
>>>> +    if (panic_timeout == 0)
>>>> +        panic_timeout = ghes_panic_timeout;
>>>> +    panic("Fatal hardware error!");
>>>> +}
>>>> +
>>> __ghes_panic() also has:
>>>>      __ghes_print_estatus(KERN_EMERG, ghes->generic, ghes->estatus);
>>> Which prints this estatus regardless of rate limiting and cache-ing.
> [...]
>
>>>>                ghes_estatus_cache_add(ghes->generic, ghes->estatus);
>>>>        }
>>>> +    if (ghes_severity(ghes->estatus->error_severity) >= GHES_SEV_PANIC) {
>>>> +        __ghes_call_panic();
>>>> +    }
>>> I think this ghes_severity() then panic() should go above the:
>>>>      if (!ghes_estatus_cached(ghes->estatus)) {
>>> and we should call __ghes_print_estatus() here too, to make sure the message
>>> definitely got out!
>
>> Okay, that makes sense. If we move this up, is there a problem with calling
>> __ghes_panic() instead of making the __ghes_print_estatus() and
>> __ghes_call_panic() calls here? It looks like that will just add a call to
>> oops_begin() and ghes_print_queued_estatus() as well, but this is what
>> ghes_notify_nmi() does if the severity is panic.
>
> I don't think the queued stuff is relevant, isn't that just for x86-NMI messages
> that it doesn't print out directly?
>
> A quick grep shows arm64 doesn't have oops_begin(), you may have to add some
> equivalent mechanism. Lets try and avoid that rabbit hole!
>
> Given __ghes_panic() calls __ghes_print_estatus() too, you could try moving that
> into your new __ghes_call_panic().... or whatever results in the least lines
> changed!
Sounds good, I will just use __ghes_print_estatus() and __ghes_call_panic().

Thanks,
Tyler
diff mbox

Patch

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 8756172..86c1f15 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -133,6 +133,8 @@ 
 static struct ghes_estatus_cache *ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE];
 static atomic_t ghes_estatus_cache_alloced;
 
+static int ghes_panic_timeout __read_mostly = 30;
+
 static int ghes_ioremap_init(void)
 {
 	ghes_ioremap_area = __get_vm_area(PAGE_SIZE * GHES_IOREMAP_PAGES,
@@ -687,6 +689,13 @@  static int ghes_ack_error(struct acpi_hest_generic_v2 *generic_v2)
 	return rc;
 }
 
+static void __ghes_call_panic(void)
+{
+	if (panic_timeout == 0)
+		panic_timeout = ghes_panic_timeout;
+	panic("Fatal hardware error!");
+}
+
 static int ghes_proc(struct ghes *ghes)
 {
 	int rc;
@@ -698,6 +707,10 @@  static int ghes_proc(struct ghes *ghes)
 		if (ghes_print_estatus(NULL, ghes->generic, ghes->estatus))
 			ghes_estatus_cache_add(ghes->generic, ghes->estatus);
 	}
+	if (ghes_severity(ghes->estatus->error_severity) >= GHES_SEV_PANIC) {
+		__ghes_call_panic();
+	}
+
 	ghes_do_proc(ghes, ghes->estatus);
 
 	if (IS_HEST_TYPE_GENERIC_V2(ghes)) {
@@ -828,8 +841,6 @@  static inline void ghes_sea_remove(struct ghes *ghes)
 
 static LIST_HEAD(ghes_nmi);
 
-static int ghes_panic_timeout	__read_mostly = 30;
-
 static void ghes_proc_in_irq(struct irq_work *irq_work)
 {
 	struct llist_node *llnode, *next;
@@ -922,9 +933,7 @@  static void __ghes_panic(struct ghes *ghes)
 	__ghes_print_estatus(KERN_EMERG, ghes->generic, ghes->estatus);
 
 	/* reboot to log the error! */
-	if (panic_timeout == 0)
-		panic_timeout = ghes_panic_timeout;
-	panic("Fatal hardware error!");
+	__ghes_call_panic();
 }
 
 static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)