Message ID | 1485969413-23577-7-git-send-email-tbaicar@codeaurora.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi Jonathan, Tyler, On 01/02/17 17:16, Tyler Baicar wrote: > From: "Jonathan (Zhixiong) Zhang" <zjzhang@codeaurora.org> > > Even if an error status block's severity is fatal, the kernel does not > honor the severity level and panic. > > With the firmware first model, the platform could inform the OS about a > fatal hardware error through the non-NMI GHES notification type. The OS > should panic when a hardware error record is received with this > severity. > > Call panic() after CPER data in error status block is printed if > severity is fatal, before each error section is handled. > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c > index 8756172..86c1f15 100644 > --- a/drivers/acpi/apei/ghes.c > +++ b/drivers/acpi/apei/ghes.c > @@ -687,6 +689,13 @@ static int ghes_ack_error(struct acpi_hest_generic_v2 *generic_v2) > return rc; > } > > +static void __ghes_call_panic(void) > +{ > + if (panic_timeout == 0) > + panic_timeout = ghes_panic_timeout; > + panic("Fatal hardware error!"); > +} > + __ghes_panic() also has: > __ghes_print_estatus(KERN_EMERG, ghes->generic, ghes->estatus); Which prints this estatus regardless of rate limiting and cache-ing. [ ... ] > @@ -698,6 +707,10 @@ static int ghes_proc(struct ghes *ghes) > if (ghes_print_estatus(NULL, ghes->generic, ghes->estatus)) ghes_print_estatus() uses some custom rate limiting '2 messages every 5 seconds', GHES_SEV_PANIC shares the same limit as GHES_SEV_RECOVERABLE. I think its possible to get 2 recoverable messages, then a panic in a 5 second window. The rate limit will kick in to stop the panic estatus block being printed, but we still go on to call panic() without the real reason being printed... (the caching thing only seems to consider identical messages, given we would never see two panic messages, I don't think that will cause any problems.) > ghes_estatus_cache_add(ghes->generic, ghes->estatus); > } > + if (ghes_severity(ghes->estatus->error_severity) >= GHES_SEV_PANIC) { > + __ghes_call_panic(); > + } > + I think this ghes_severity() then panic() should go above the: > if (!ghes_estatus_cached(ghes->estatus)) { and we should call __ghes_print_estatus() here too, to make sure the message definitely got out! With that, Reviewed-by: James Morse <james.morse@arm.com> Thanks, James
Hello James, On 2/9/2017 3:48 AM, James Morse wrote: > Hi Jonathan, Tyler, > > On 01/02/17 17:16, Tyler Baicar wrote: >> From: "Jonathan (Zhixiong) Zhang" <zjzhang@codeaurora.org> >> >> Even if an error status block's severity is fatal, the kernel does not >> honor the severity level and panic. >> >> With the firmware first model, the platform could inform the OS about a >> fatal hardware error through the non-NMI GHES notification type. The OS >> should panic when a hardware error record is received with this >> severity. >> >> Call panic() after CPER data in error status block is printed if >> severity is fatal, before each error section is handled. >> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c >> index 8756172..86c1f15 100644 >> --- a/drivers/acpi/apei/ghes.c >> +++ b/drivers/acpi/apei/ghes.c >> @@ -687,6 +689,13 @@ static int ghes_ack_error(struct acpi_hest_generic_v2 *generic_v2) >> return rc; >> } >> >> +static void __ghes_call_panic(void) >> +{ >> + if (panic_timeout == 0) >> + panic_timeout = ghes_panic_timeout; >> + panic("Fatal hardware error!"); >> +} >> + > __ghes_panic() also has: >> __ghes_print_estatus(KERN_EMERG, ghes->generic, ghes->estatus); > Which prints this estatus regardless of rate limiting and cache-ing. > > [ ... ] > >> @@ -698,6 +707,10 @@ static int ghes_proc(struct ghes *ghes) >> if (ghes_print_estatus(NULL, ghes->generic, ghes->estatus)) > ghes_print_estatus() uses some custom rate limiting '2 messages every 5 > seconds', GHES_SEV_PANIC shares the same limit as GHES_SEV_RECOVERABLE. > > I think its possible to get 2 recoverable messages, then a panic in a 5 second > window. The rate limit will kick in to stop the panic estatus block being > printed, but we still go on to call panic() without the real reason being printed... > > (the caching thing only seems to consider identical messages, given we would > never see two panic messages, I don't think that will cause any problems.) > >> ghes_estatus_cache_add(ghes->generic, ghes->estatus); >> } >> + if (ghes_severity(ghes->estatus->error_severity) >= GHES_SEV_PANIC) { >> + __ghes_call_panic(); >> + } >> + > I think this ghes_severity() then panic() should go above the: >> if (!ghes_estatus_cached(ghes->estatus)) { > and we should call __ghes_print_estatus() here too, to make sure the message > definitely got out! Okay, that makes sense. If we move this up, is there a problem with calling __ghes_panic() instead of making the __ghes_print_estatus() and __ghes_call_panic() calls here? It looks like that will just add a call to oops_begin() and ghes_print_queued_estatus() as well, but this is what ghes_notify_nmi() does if the severity is panic. Thanks, Tyler > With that, > Reviewed-by: James Morse <james.morse@arm.com> > > > Thanks, > > James
Hi Tyler, On 13/02/17 22:45, Baicar, Tyler wrote: > On 2/9/2017 3:48 AM, James Morse wrote: >> On 01/02/17 17:16, Tyler Baicar wrote: >>> From: "Jonathan (Zhixiong) Zhang" <zjzhang@codeaurora.org> >>> >>> Even if an error status block's severity is fatal, the kernel does not >>> honor the severity level and panic. >>> >>> With the firmware first model, the platform could inform the OS about a >>> fatal hardware error through the non-NMI GHES notification type. The OS >>> should panic when a hardware error record is received with this >>> severity. >>> >>> Call panic() after CPER data in error status block is printed if >>> severity is fatal, before each error section is handled. >>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c >>> index 8756172..86c1f15 100644 >>> --- a/drivers/acpi/apei/ghes.c >>> +++ b/drivers/acpi/apei/ghes.c >>> @@ -687,6 +689,13 @@ static int ghes_ack_error(struct acpi_hest_generic_v2 >>> *generic_v2) >>> return rc; >>> } >>> +static void __ghes_call_panic(void) >>> +{ >>> + if (panic_timeout == 0) >>> + panic_timeout = ghes_panic_timeout; >>> + panic("Fatal hardware error!"); >>> +} >>> + >> __ghes_panic() also has: >>> __ghes_print_estatus(KERN_EMERG, ghes->generic, ghes->estatus); >> Which prints this estatus regardless of rate limiting and cache-ing. [...] >>> ghes_estatus_cache_add(ghes->generic, ghes->estatus); >>> } >>> + if (ghes_severity(ghes->estatus->error_severity) >= GHES_SEV_PANIC) { >>> + __ghes_call_panic(); >>> + } >> I think this ghes_severity() then panic() should go above the: >>> if (!ghes_estatus_cached(ghes->estatus)) { >> and we should call __ghes_print_estatus() here too, to make sure the message >> definitely got out! > Okay, that makes sense. If we move this up, is there a problem with calling > __ghes_panic() instead of making the __ghes_print_estatus() and > __ghes_call_panic() calls here? It looks like that will just add a call to > oops_begin() and ghes_print_queued_estatus() as well, but this is what > ghes_notify_nmi() does if the severity is panic. I don't think the queued stuff is relevant, isn't that just for x86-NMI messages that it doesn't print out directly? A quick grep shows arm64 doesn't have oops_begin(), you may have to add some equivalent mechanism. Lets try and avoid that rabbit hole! Given __ghes_panic() calls __ghes_print_estatus() too, you could try moving that into your new __ghes_call_panic().... or whatever results in the least lines changed! Thanks, James
On 2/15/2017 5:13 AM, James Morse wrote: > Hi Tyler, > > On 13/02/17 22:45, Baicar, Tyler wrote: >> On 2/9/2017 3:48 AM, James Morse wrote: >>> On 01/02/17 17:16, Tyler Baicar wrote: >>>> From: "Jonathan (Zhixiong) Zhang" <zjzhang@codeaurora.org> >>>> >>>> Even if an error status block's severity is fatal, the kernel does not >>>> honor the severity level and panic. >>>> >>>> With the firmware first model, the platform could inform the OS about a >>>> fatal hardware error through the non-NMI GHES notification type. The OS >>>> should panic when a hardware error record is received with this >>>> severity. >>>> >>>> Call panic() after CPER data in error status block is printed if >>>> severity is fatal, before each error section is handled. >>>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c >>>> index 8756172..86c1f15 100644 >>>> --- a/drivers/acpi/apei/ghes.c >>>> +++ b/drivers/acpi/apei/ghes.c >>>> @@ -687,6 +689,13 @@ static int ghes_ack_error(struct acpi_hest_generic_v2 >>>> *generic_v2) >>>> return rc; >>>> } >>>> +static void __ghes_call_panic(void) >>>> +{ >>>> + if (panic_timeout == 0) >>>> + panic_timeout = ghes_panic_timeout; >>>> + panic("Fatal hardware error!"); >>>> +} >>>> + >>> __ghes_panic() also has: >>>> __ghes_print_estatus(KERN_EMERG, ghes->generic, ghes->estatus); >>> Which prints this estatus regardless of rate limiting and cache-ing. > [...] > >>>> ghes_estatus_cache_add(ghes->generic, ghes->estatus); >>>> } >>>> + if (ghes_severity(ghes->estatus->error_severity) >= GHES_SEV_PANIC) { >>>> + __ghes_call_panic(); >>>> + } >>> I think this ghes_severity() then panic() should go above the: >>>> if (!ghes_estatus_cached(ghes->estatus)) { >>> and we should call __ghes_print_estatus() here too, to make sure the message >>> definitely got out! > >> Okay, that makes sense. If we move this up, is there a problem with calling >> __ghes_panic() instead of making the __ghes_print_estatus() and >> __ghes_call_panic() calls here? It looks like that will just add a call to >> oops_begin() and ghes_print_queued_estatus() as well, but this is what >> ghes_notify_nmi() does if the severity is panic. > > I don't think the queued stuff is relevant, isn't that just for x86-NMI messages > that it doesn't print out directly? > > A quick grep shows arm64 doesn't have oops_begin(), you may have to add some > equivalent mechanism. Lets try and avoid that rabbit hole! > > Given __ghes_panic() calls __ghes_print_estatus() too, you could try moving that > into your new __ghes_call_panic().... or whatever results in the least lines > changed! Sounds good, I will just use __ghes_print_estatus() and __ghes_call_panic(). Thanks, Tyler
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index 8756172..86c1f15 100644 --- a/drivers/acpi/apei/ghes.c +++ b/drivers/acpi/apei/ghes.c @@ -133,6 +133,8 @@ static struct ghes_estatus_cache *ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE]; static atomic_t ghes_estatus_cache_alloced; +static int ghes_panic_timeout __read_mostly = 30; + static int ghes_ioremap_init(void) { ghes_ioremap_area = __get_vm_area(PAGE_SIZE * GHES_IOREMAP_PAGES, @@ -687,6 +689,13 @@ static int ghes_ack_error(struct acpi_hest_generic_v2 *generic_v2) return rc; } +static void __ghes_call_panic(void) +{ + if (panic_timeout == 0) + panic_timeout = ghes_panic_timeout; + panic("Fatal hardware error!"); +} + static int ghes_proc(struct ghes *ghes) { int rc; @@ -698,6 +707,10 @@ static int ghes_proc(struct ghes *ghes) if (ghes_print_estatus(NULL, ghes->generic, ghes->estatus)) ghes_estatus_cache_add(ghes->generic, ghes->estatus); } + if (ghes_severity(ghes->estatus->error_severity) >= GHES_SEV_PANIC) { + __ghes_call_panic(); + } + ghes_do_proc(ghes, ghes->estatus); if (IS_HEST_TYPE_GENERIC_V2(ghes)) { @@ -828,8 +841,6 @@ static inline void ghes_sea_remove(struct ghes *ghes) static LIST_HEAD(ghes_nmi); -static int ghes_panic_timeout __read_mostly = 30; - static void ghes_proc_in_irq(struct irq_work *irq_work) { struct llist_node *llnode, *next; @@ -922,9 +933,7 @@ static void __ghes_panic(struct ghes *ghes) __ghes_print_estatus(KERN_EMERG, ghes->generic, ghes->estatus); /* reboot to log the error! */ - if (panic_timeout == 0) - panic_timeout = ghes_panic_timeout; - panic("Fatal hardware error!"); + __ghes_call_panic(); } static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)