Message ID | 1515254577-6460-4-git-send-email-gengdongjiu@huawei.com (mailing list archive) |
---|---|
State | Not Applicable, archived |
Headers | show |
Hi Dongjiu Geng, (versions of patches 1,2 and 4 have been queued by Catalin) (Nit 'ACPI / APEI:' is the normal subject prefix for ghes.c, this helps the maintainers know which patches they need to pay attention to when you are touching multiple trees) On 06/01/18 16:02, Dongjiu Geng wrote: > ARMv8.2 requires implementation of the RAS extension. > In > this extension, it adds SEI(SError Interrupt) notification > type, this patch adds new GHES error source SEI handling > functions. This reads as if this patch is handling SError RAS notifications generated by a CPU with the RAS extensions. These are about CPU->Software notifications. APEI and GHES are a firmware first mechanism which is Software->Software. Reading the v8.2 documents won't help anyone with the APEI/GHES code. Please describe this from the ACPI view, "ACPI 6.x adds support for NOTIFY_SEI as a GHES notification mechanism... ", its up to the arch code to spot a v8.2 RAS Error based on the cpu caps. > This error source parsing and handling method > is similar with the SEA. There are problems with doing this: Oct. 18, 2017, 10:26 a.m. James Morse wrote: | How do SEA and SEI interact? | | As far as I can see they can both interrupt each other, which isn't something | the single in_nmi() path in APEI can handle. I thinks we should fix this | first. [..] | SEA gets away with a lot of things because its synchronous. SEI isn't. Xie | XiuQi pointed to the memory_failure_queue() code. We can use this directly | from SEA, but not SEI. (what happens if an SError arrives while we are | queueing memory_failure work from an IRQ). | | The one that scares me is the trace-point reporting stuff. What happens if an | SError arrives while we are enabling a trace point? (these are static-keys | right?) | | I don't think we can just plumb SEI in like this and be done with it. | (I'm looking at teasing out the estatus cache code from being x86:NMI only. | This way we solve the same 'cant do this from NMI context' with the same | code'.) I will post what I've got for this estatus-cache thing as an RFC, its not ready to be considered yet. > Expose API ghes_notify_sei() to external users. External > modules can call this exposed API to parse APEI table and > handle the SEI notification. external modules? You mean called by the arch code when it gets this NOTIFY_SEI? Thanks, James -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi James, On 2018/1/23 3:39, James Morse wrote: > Hi Dongjiu Geng, > > (versions of patches 1,2 and 4 have been queued by Catalin) > > (Nit 'ACPI / APEI:' is the normal subject prefix for ghes.c, this helps the > maintainers know which patches they need to pay attention to when you are > touching multiple trees) > > On 06/01/18 16:02, Dongjiu Geng wrote: >> ARMv8.2 requires implementation of the RAS extension. > >> In >> this extension, it adds SEI(SError Interrupt) notification >> type, this patch adds new GHES error source SEI handling >> functions. > > This reads as if this patch is handling SError RAS notifications generated by a > CPU with the RAS extensions. These are about CPU->Software notifications. APEI > and GHES are a firmware first mechanism which is Software->Software. > Reading the v8.2 documents won't help anyone with the APEI/GHES code. > > Please describe this from the ACPI view, "ACPI 6.x adds support for NOTIFY_SEI > as a GHES notification mechanism... ", its up to the arch code to spot a v8.2 > RAS Error based on the cpu caps.Ok, I will modify it. > > >> This error source parsing and handling method >> is similar with the SEA. > > There are problems with doing this: > > Oct. 18, 2017, 10:26 a.m. James Morse wrote: > | How do SEA and SEI interact? > | > | As far as I can see they can both interrupt each other, which isn't something > | the single in_nmi() path in APEI can handle. I thinks we should fix this > | first. > > [..] > > | SEA gets away with a lot of things because its synchronous. SEI isn't. Xie > | XiuQi pointed to the memory_failure_queue() code. We can use this directly > | from SEA, but not SEI. (what happens if an SError arrives while we are > | queueing memory_failure work from an IRQ). > | > | The one that scares me is the trace-point reporting stuff. What happens if an > | SError arrives while we are enabling a trace point? (these are static-keys > | right?) > | > | I don't think we can just plumb SEI in like this and be done with it. > | (I'm looking at teasing out the estatus cache code from being x86:NMI only. > | This way we solve the same 'cant do this from NMI context' with the same > | code'.) > > > I will post what I've got for this estatus-cache thing as an RFC, its not ready > to be considered yet.Yes, I know you are dong that. Your serial's patch will consider all above things, right? If your patch can be consider that, this patch can based on your patchset. thanks. > > >> Expose API ghes_notify_sei() to external users. External >> modules can call this exposed API to parse APEI table and >> handle the SEI notification. > > external modules? You mean called by the arch code when it gets this NOTIFY_SEI? yes, called by kernel ARCH code, such as below, I remember I have discussed with you. asmlinkage void do_serror(struct pt_regs *regs, unsigned int esr) { nmi_enter(); if (!ghes_notify_sei()) return; /* non-RAS errors are not containable */ if (!arm64_is_ras_serror(esr) || arm64_is_fatal_ras_serror(regs, esr)) arm64_serror_panic(regs, esr); nmi_exit(); } > > > Thanks, > > James > > . > -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
sorry fix a typo. On 2018/1/23 17:23, gengdongjiu wrote: >> There are problems with doing this: >> >> Oct. 18, 2017, 10:26 a.m. James Morse wrote: >> | How do SEA and SEI interact? >> | >> | As far as I can see they can both interrupt each other, which isn't something >> | the single in_nmi() path in APEI can handle. I thinks we should fix this >> | first. >> >> [..] >> >> | SEA gets away with a lot of things because its synchronous. SEI isn't. Xie >> | XiuQi pointed to the memory_failure_queue() code. We can use this directly >> | from SEA, but not SEI. (what happens if an SError arrives while we are >> | queueing memory_failure work from an IRQ). >> | >> | The one that scares me is the trace-point reporting stuff. What happens if an >> | SError arrives while we are enabling a trace point? (these are static-keys >> | right?) >> | >> | I don't think we can just plumb SEI in like this and be done with it. >> | (I'm looking at teasing out the estatus cache code from being x86:NMI only. >> | This way we solve the same 'cant do this from NMI context' with the same >> | code'.) >> >> >> I will post what I've got for this estatus-cache thing as an RFC, its not ready >> to be considered yet. Yes, I know you are dong that. Your serial's patch will consider all above things, right? If your patch can be consider that, this patch can based on your patchset. thanks. > >> -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi gengdongjiu, On 23/01/18 09:23, gengdongjiu wrote: > On 2018/1/23 3:39, James Morse wrote: >> gengdongjiu wrote: >>> This error source parsing and handling method >>> is similar with the SEA. >> >> There are problems with doing this: >> >> Oct. 18, 2017, 10:26 a.m. James Morse wrote: >> | How do SEA and SEI interact? >> | >> | As far as I can see they can both interrupt each other, which isn't something >> | the single in_nmi() path in APEI can handle. I thinks we should fix this >> | first. >> >> [..] >> >> | SEA gets away with a lot of things because its synchronous. SEI isn't. Xie >> | XiuQi pointed to the memory_failure_queue() code. We can use this directly >> | from SEA, but not SEI. (what happens if an SError arrives while we are >> | queueing memory_failure work from an IRQ). >> | >> | The one that scares me is the trace-point reporting stuff. What happens if an >> | SError arrives while we are enabling a trace point? (these are static-keys >> | right?) >> | >> | I don't think we can just plumb SEI in like this and be done with it. >> | (I'm looking at teasing out the estatus cache code from being x86:NMI only. >> | This way we solve the same 'cant do this from NMI context' with the same >> | code'.) >> >> >> I will post what I've got for this estatus-cache thing as an RFC, its not ready >> to be considered yet. > Yes, I know you are dong that. Your serial's patch will consider all above things, right? Assuming I got it right, yes. It currently makes the race Xie XiuQi spotted worse, which I want to fix too. (details on the cover letter) > If your patch can be consider that, this patch can based on your patchset. thanks. I'd like to pick these patches onto the end of that series, but first I want to know what NOTIFY_SEI means for any OS. The ACPI spec doesn't say, and because its asynchronous, route-able and mask-able, there are many more corners than NOTFIY_SEA. This thing is a notification using an emulated SError exception. (emulated because physical-SError must be routed to EL3 for firmware-first, and virtual-SError belongs to EL2). Does your firmware emulate SError exactly as the TakeException() pseudo code in the Arm-Arm? Is the emulated SError routed following the routing rules for HCR_EL2.{AMO, TGE}? What does your firmware do when it wants to emulate SError but its masked? (e.g.1: The physical-SError interrupted EL2 and the SPSR shows EL2 had PSTATE.A set. e.g.2: The physical-SError interrupted EL2 but HCR_EL2 indicates the emulated SError should go to EL1. This effectively masks SError.) Answers to these let us determine whether a bug is in the firmware or the kernel. If firmware is expecting the OS to do something special, I'd like to know about it from the beginning! >>> Expose API ghes_notify_sei() to external users. External >>> modules can call this exposed API to parse APEI table and >>> handle the SEI notification. >> >> external modules? You mean called by the arch code when it gets this NOTIFY_SEI? > yes, called by kernel ARCH code, such as below, I remember I have discussed with you. Sure. The phrase 'external modules' usually means the '.ko' files that live in /lib/modules, nothing outside the kernel tree should be doing this stuff. Thanks, James -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/acpi/apei/Kconfig b/drivers/acpi/apei/Kconfig index 52ae543..ff4afc3 100644 --- a/drivers/acpi/apei/Kconfig +++ b/drivers/acpi/apei/Kconfig @@ -55,6 +55,21 @@ config ACPI_APEI_SEA option allows the OS to look for such hardware error record, and take appropriate action. +config ACPI_APEI_SEI + bool "APEI SError(System Error) Interrupt logging/recovering support" + depends on ARM64 && ACPI_APEI_GHES + default y + help + This option should be enabled if the system supports + firmware first handling of SEI (SError interrupt). + + SEI happens with asynchronous external abort for errors on device + memory reads on ARMv8 systems. If a system supports firmware first + handling of SEI, the platform analyzes and handles hardware error + notifications from SEI, and it may then form a hardware error record for + the OS to parse and handle. This option allows the OS to look for + such hardware error record, and take appropriate action. + config ACPI_APEI_MEMORY_FAILURE bool "APEI memory error recovering support" depends on ACPI_APEI && MEMORY_FAILURE diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index 6a3f824..67cd3a7 100644 --- a/drivers/acpi/apei/ghes.c +++ b/drivers/acpi/apei/ghes.c @@ -855,6 +855,46 @@ static inline void ghes_sea_add(struct ghes *ghes) { } static inline void ghes_sea_remove(struct ghes *ghes) { } #endif /* CONFIG_ACPI_APEI_SEA */ +#ifdef CONFIG_ACPI_APEI_SEI +static LIST_HEAD(ghes_sei); + +/* + * Return 0 only if one of the SEI error sources successfully reported an error + * record sent from the firmware. + */ +int ghes_notify_sei(void) +{ + struct ghes *ghes; + int ret = -ENOENT; + + rcu_read_lock(); + list_for_each_entry_rcu(ghes, &ghes_sei, list) { + if (!ghes_proc(ghes)) + ret = 0; + } + rcu_read_unlock(); + return ret; +} + +static void ghes_sei_add(struct ghes *ghes) +{ + mutex_lock(&ghes_list_mutex); + list_add_rcu(&ghes->list, &ghes_sei); + mutex_unlock(&ghes_list_mutex); +} + +static void ghes_sei_remove(struct ghes *ghes) +{ + mutex_lock(&ghes_list_mutex); + list_del_rcu(&ghes->list); + mutex_unlock(&ghes_list_mutex); + synchronize_rcu(); +} +#else /* CONFIG_ACPI_APEI_SEI */ +static inline void ghes_sei_add(struct ghes *ghes) { } +static inline void ghes_sei_remove(struct ghes *ghes) { } +#endif /* CONFIG_ACPI_APEI_SEI */ + #ifdef CONFIG_HAVE_ACPI_APEI_NMI /* * printk is not safe in NMI context. So in NMI handler, we allocate @@ -1086,6 +1126,13 @@ static int ghes_probe(struct platform_device *ghes_dev) goto err; } break; + case ACPI_HEST_NOTIFY_SEI: + if (!IS_ENABLED(CONFIG_ACPI_APEI_SEI)) { + pr_warn(GHES_PFX "Generic hardware error source: %d notified via SEI is not supported!\n", + generic->header.source_id); + goto err; + } + break; case ACPI_HEST_NOTIFY_NMI: if (!IS_ENABLED(CONFIG_HAVE_ACPI_APEI_NMI)) { pr_warn(GHES_PFX "Generic hardware error source: %d notified via NMI interrupt is not supported!\n", @@ -1158,6 +1205,9 @@ static int ghes_probe(struct platform_device *ghes_dev) case ACPI_HEST_NOTIFY_SEA: ghes_sea_add(ghes); break; + case ACPI_HEST_NOTIFY_SEI: + ghes_sei_add(ghes); + break; case ACPI_HEST_NOTIFY_NMI: ghes_nmi_add(ghes); break; @@ -1211,6 +1261,9 @@ static int ghes_remove(struct platform_device *ghes_dev) case ACPI_HEST_NOTIFY_SEA: ghes_sea_remove(ghes); break; + case ACPI_HEST_NOTIFY_SEI: + ghes_sei_remove(ghes); + break; case ACPI_HEST_NOTIFY_NMI: ghes_nmi_remove(ghes); break; diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h index 8feb0c8..9ba59e2 100644 --- a/include/acpi/ghes.h +++ b/include/acpi/ghes.h @@ -120,5 +120,6 @@ static inline void *acpi_hest_get_next(struct acpi_hest_generic_data *gdata) section = acpi_hest_get_next(section)) int ghes_notify_sea(void); +int ghes_notify_sei(void); #endif /* GHES_H */
ARMv8.2 requires implementation of the RAS extension. In this extension, it adds SEI(SError Interrupt) notification type, this patch adds new GHES error source SEI handling functions. This error source parsing and handling method is similar with the SEA. Expose API ghes_notify_sei() to external users. External modules can call this exposed API to parse APEI table and handle the SEI notification. Signed-off-by: Dongjiu Geng <gengdongjiu@huawei.com> --- drivers/acpi/apei/Kconfig | 15 ++++++++++++++ drivers/acpi/apei/ghes.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++ include/acpi/ghes.h | 1 + 3 files changed, 69 insertions(+)