Message ID | 20180403170830.29282-4-mr.nuke.me@gmail.com (mailing list archive) |
---|---|
State | RFC, archived |
Headers | show |
Hi Alexandru, On 03/04/18 18:08, Alexandru Gagniuc wrote: > BIOSes like to send NMIs for a number of silly reasons often deemed > to be "fatal". For example pin bounce during a PCIE hotplug/unplug > might cause the link to go down and retrain, with fatal PCI errors > being generated while the link is retraining. Sounds fun! > Instead of panic()ing in NMI context, pass fatal errors down to IRQ > context to see if they can be resolved. How do we know we will survive this trip? On arm64 systems it may not be possible to return to the context we took the NMI notification from: we may bounce back into firmware with the same "world is on fire" error. Firmware can rightly assume the OS has made no attempt to handle the error. Your 'not re-arming the error' example makes this worrying. > With these change, PCIe error are handled by AER. Other far less > common errors, such as machine check exceptions, still cause a panic() > in their respective handlers. I agree AER is always going to be different. Could we take a look at the CPER records while still in_nmi() to decide whether linux knows better than firmware? For non-standard or processor-errors I think we should always panic() if they're marked as fatal. For memory-errors we could split memory_failure() up to have {NMI,IRQ,process}-context helpers, all we need to know at NMI-time is whether the affected memory is kernel memory. For the PCI*/AER errors we should be able to try and handle it ... if we can get back to process/IRQ context: What happens if a PCIe driver has interrupts masked and is doing something to cause this error? We can take the NMI and setup the irq-work, but it may never run as we return to interrupts-masked context. > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c > index 2c998125b1d5..7243a99ea57e 100644 > --- a/drivers/acpi/apei/ghes.c > +++ b/drivers/acpi/apei/ghes.c > @@ -955,7 +962,7 @@ static void __process_error(struct ghes *ghes) > static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs) > { > struct ghes *ghes; > - int sev, ret = NMI_DONE; > + int ret = NMI_DONE; > > if (!atomic_add_unless(&ghes_in_nmi, 1, 1)) > return ret; > @@ -968,13 +975,6 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs) > ret = NMI_HANDLED; > } > > - sev = ghes_severity(ghes->estatus->error_severity); > - if (sev >= GHES_SEV_PANIC) { > - oops_begin(); > - ghes_print_queued_estatus(); > - __ghes_panic(ghes); > - } > - > if (!(ghes->flags & GHES_TO_CLEAR)) > continue; For Processor-errors I think this is the wrong thing to do, but we should be able to poke around in the CPER records and find out what we are dealing with. Thanks, James -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi James, On 04/04/2018 02:18 AM, James Morse wrote: > Hi Alexandru, > > On 03/04/18 18:08, Alexandru Gagniuc wrote: >> BIOSes like to send NMIs for a number of silly reasons often deemed >> to be "fatal". For example pin bounce during a PCIE hotplug/unplug >> might cause the link to go down and retrain, with fatal PCI errors >> being generated while the link is retraining. > > Sounds fun! > > >> Instead of panic()ing in NMI context, pass fatal errors down to IRQ >> context to see if they can be resolved. > > How do we know we will survive this trip? We don't. > On arm64 systems it may not be possible to return to the context we took the NMI > notification from: we may bounce back into firmware with the same "world is on > fire" error. Firmware can rightly assume the OS has made no attempt to handle > the error. I'm not aware of the NMI path being used on arm64: $ git grep 'select HAVE_ACPI_APEI_NMI' arch/x86/Kconfig: select HAVE_ACPI_APEI_NMI if ACPI $ A far as jumping back into firmware, the OS should clear the GHES block status before exiting the NMI. THis tells the firmware that the OS got the message [1] I'm not seeing any mechanism to say "handled/not handled", so I'm not sure how firmware can make such assumptions. > Your 'not re-arming the error' example makes this worrying. Those are one class of bugs you get when you let firmware run the show. It's been an issue since day one of EFI. The best you can do in such cases is accept you may later lose the link to a device. Side-note: We have a BIOS defect filed with against this particular issue, and it only seems to affect PCIe SMIs. >> With these change, PCIe error are handled by AER. Other far less >> common errors, such as machine check exceptions, still cause a panic() >> in their respective handlers. > > I agree AER is always going to be different. Could we take a look at the CPER > records while still in_nmi() to decide whether linux knows better than firmware? > > For non-standard or processor-errors I think we should always panic() if they're > marked as fatal. > For memory-errors we could split memory_failure() up to have > {NMI,IRQ,process}-context helpers, all we need to know at NMI-time is whether > the affected memory is kernel memory. I'm trying to avoid splitting up the recovery logic based on _how_ we were notified of the error. The only sure way to know if you've handled an error is to try to handle it. Not handling _any_ fatal error still results in a panic(), and I was very careful to ensure that. The alternative is to keep ghes_do_proc() and ghes_notify_nmi() in sync, duplicate code between the two. But we can also get ghes_proc_irq_work() as an entry point, and this easily turns in a maintenance nightmare. > For the PCI*/AER errors we should be able to try and handle it ... if we can get > back to process/IRQ context: > What happens if a PCIe driver has interrupts masked and is doing something to > cause this error? We can take the NMI and setup the irq-work, but it may never > run as we return to interrupts-masked context. The way the recovery from NMI is set up is confusing. On NMIs, ghes_proc_in_irq() is queued up to do further work. Despite the name, in this case it runs as a task. That further queues up AER work, Either way, PCIe endpoint interrupts are irrelevant here. AER recovery runs independent of drivers or interrupts: - Drivers that implement AER recovery services are expected to do the right thing and shut down IO. To not do so, is a bug in the driver. - Drivers that do not implement AER have their devices taken offline. It is not uncommon for PCIe errors to amplify. Imagine losing the link right after you've fired off several DMA queues. You'd get a stream of Unsupported Request errors. AER handles this fairly well. > >> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c >> index 2c998125b1d5..7243a99ea57e 100644 >> --- a/drivers/acpi/apei/ghes.c >> +++ b/drivers/acpi/apei/ghes.c >> @@ -955,7 +962,7 @@ static void __process_error(struct ghes *ghes) >> static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs) >> { >> struct ghes *ghes; >> - int sev, ret = NMI_DONE; >> + int ret = NMI_DONE; >> >> if (!atomic_add_unless(&ghes_in_nmi, 1, 1)) >> return ret; >> @@ -968,13 +975,6 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs) >> ret = NMI_HANDLED; >> } >> >> - sev = ghes_severity(ghes->estatus->error_severity); >> - if (sev >= GHES_SEV_PANIC) { >> - oops_begin(); >> - ghes_print_queued_estatus(); >> - __ghes_panic(ghes); >> - } >> - >> if (!(ghes->flags & GHES_TO_CLEAR)) >> continue; > > For Processor-errors I think this is the wrong thing to do, but we should be > able to poke around in the CPER records and find out what we are dealing with. I don't think being trigger-happy with the panic button is ever a good thing. The logic to stop the system exists downstream, and is being called. I can see an issue if we somehow lose the ability to run queued tasks, but in that case, we've already lost control of the machine. Alex [1] ACPI 6.2 spec. 18.3.2.8 Generic Hardware Error Source version 2 (GHESv2 - Type 10) > Thanks, > > James > -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Alex, On 04/04/18 16:33, Alex G. wrote: > On 04/04/2018 02:18 AM, James Morse wrote: >> On 03/04/18 18:08, Alexandru Gagniuc wrote: >>> BIOSes like to send NMIs for a number of silly reasons often deemed >>> to be "fatal". For example pin bounce during a PCIE hotplug/unplug >>> might cause the link to go down and retrain, with fatal PCI errors >>> being generated while the link is retraining. >>> Instead of panic()ing in NMI context, pass fatal errors down to IRQ >>> context to see if they can be resolved. >> >> How do we know we will survive this trip? > > We don't. Isn't that even worse than panic()ing? (No output, followed by a watchdog reboot if we're lucky) >> On arm64 systems it may not be possible to return to the context we took the NMI >> notification from: we may bounce back into firmware with the same "world is on >> fire" error. Firmware can rightly assume the OS has made no attempt to handle >> the error. > > I'm not aware of the NMI path being used on arm64: > $ git grep 'select HAVE_ACPI_APEI_NMI' > arch/x86/Kconfig: select HAVE_ACPI_APEI_NMI if ACPI > $ (I agree its not obvious!), arm64 has NOTIFY_SEA, NOTIFY_SEI and NOTIFY_SDEI that all behave like NOTIFY_NMI (in that they can interrupt IRQ-masked code). CONFIG_HAVE_ACPI_APEI_NMI has come to mean NOTIFY_NMI, which requires the arch code to implement register_nmi_handler() and (from memory) behave as a notifier chain, which doesn't fit with how these things behave. NOTIFY_SEA lives behind CONFIG_ACPI_APEI_SEA in driver/acpi/apei/Kconfig. The series to add SDEI support is on the list here: https://www.spinics.net/lists/arm-kernel/msg642946.html (NOTIFY_SEI is very tricky for firmware to get right, I don't think we're there yet.) > A far as jumping back into firmware, the OS should clear the GHES block > status before exiting the NMI. THis tells the firmware that the OS got > the message [1] > I'm not seeing any mechanism to say "handled/not handled", so I'm not > sure how firmware can make such assumptions. (I'm probably assuming GHESv2's ACK stuff). Wouldn't reasonably-intelligent firmware be able to spot the same fault repeatedly in a loop? (last-fault-address == this-fault-address, last-fault-time==now) >> Your 'not re-arming the error' example makes this worrying. > > Those are one class of bugs you get when you let firmware run the show. > It's been an issue since day one of EFI. The best you can do in such > cases is accept you may later lose the link to a device. > Side-note: We have a BIOS defect filed with against this particular > issue, and it only seems to affect PCIe SMIs. This sounds like policy applied too early, fixing the firmware to at least make this configurable would be preferable. My point was there is a class of reported-as-fatal error that really is fatal, and for these trying to defer the work to irq_work is worse than panic()ing as we get less information about what happened. >>> With these change, PCIe error are handled by AER. Other far less >>> common errors, such as machine check exceptions, still cause a panic() >>> in their respective handlers. >> >> I agree AER is always going to be different. Could we take a look at the CPER >> records while still in_nmi() to decide whether linux knows better than firmware? >> >> For non-standard or processor-errors I think we should always panic() if they're >> marked as fatal. >> For memory-errors we could split memory_failure() up to have >> {NMI,IRQ,process}-context helpers, all we need to know at NMI-time is whether >> the affected memory is kernel memory. > > I'm trying to avoid splitting up the recovery logic based on _how_ we > were notified of the error. I agree, last time I checked there were 11 of them. 'in_nmi()' or not is a distinction the ghes code already has for ghes_copy_tofrom_phys(). We already have to consider NMI-like notifications separately as we can't call the recovery-helpers. > The only sure way to know if you've handled > an error is to try to handle it. Not handling _any_ fatal error still > results in a panic(), and I was very careful to ensure that. Are all AER errors recoverable? (is an OS:'fatal' AER error ever valid?) > The alternative is to keep ghes_do_proc() and ghes_notify_nmi() in sync, > duplicate code between the two. But we can also get ghes_proc_irq_work() > as an entry point, and this easily turns in a maintenance nightmare. We can't call all the recovery-helpers from ghes_notify_nmi() as they sleep and allocate memory. I think the trick would be to make the CPER parsing routines NMI-safe so that we can check these 'fatal' records for cases where we're pretty sure linux can do a better job. >> For the PCI*/AER errors we should be able to try and handle it ... if we can get >> back to process/IRQ context: >> What happens if a PCIe driver has interrupts masked and is doing something to >> cause this error? We can take the NMI and setup the irq-work, but it may never >> run as we return to interrupts-masked context. > > The way the recovery from NMI is set up is confusing. On NMIs, > ghes_proc_in_irq() is queued up to do further work. Despite the name, in > this case it runs as a task. That further queues up AER work, > Either way, PCIe endpoint interrupts are irrelevant here. AER recovery > runs independent of drivers or interrupts: Hmm, we may be looking at different things, (and I'm not familiar with x86): ghes_notify_nmi() uses irq_work_queue() to add the work to a percpu llist and call arch_irq_work_raise(). arch_irq_work_raise() sends a self IPI, which, once taken causes the arch code to call irq_work_run() to consume the llist. My badly made point is we could take the NMI that triggers all this from a context that has interrupts masked. We can't take the self-IPI until we return from the NMI, and we return to a context that still has interrupts masked. The original context may cause the NMI to trigger again before it unmasks interrupts. In this case we've livelocked, the recovery work never runs. I think something like polling a status register over a broken-link would do this. (has my config write been acknowledge?) > - Drivers that implement AER recovery services are expected to do the > right thing and shut down IO. To not do so, is a bug in the driver. > - Drivers that do not implement AER have their devices taken offline. > It is not uncommon for PCIe errors to amplify. Imagine losing the link > right after you've fired off several DMA queues. You'd get a stream of > Unsupported Request errors. AER handles this fairly well. >>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c >>> index 2c998125b1d5..7243a99ea57e 100644 >>> --- a/drivers/acpi/apei/ghes.c >>> +++ b/drivers/acpi/apei/ghes.c >>> @@ -955,7 +962,7 @@ static void __process_error(struct ghes *ghes) >>> static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs) >>> { >>> struct ghes *ghes; >>> - int sev, ret = NMI_DONE; >>> + int ret = NMI_DONE; >>> >>> if (!atomic_add_unless(&ghes_in_nmi, 1, 1)) >>> return ret; >>> @@ -968,13 +975,6 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs) >>> ret = NMI_HANDLED; >>> } >>> >>> - sev = ghes_severity(ghes->estatus->error_severity); >>> - if (sev >= GHES_SEV_PANIC) { >>> - oops_begin(); >>> - ghes_print_queued_estatus(); >>> - __ghes_panic(ghes); >>> - } >>> - >>> if (!(ghes->flags & GHES_TO_CLEAR)) >>> continue; >> >> For Processor-errors I think this is the wrong thing to do, but we should be >> able to poke around in the CPER records and find out what we are dealing with. > > I don't think being trigger-happy with the panic button is ever a good > thing. The logic to stop the system exists downstream, and is being > called. > I can see an issue if we somehow lose the ability to run queued > tasks, but in that case, we've already lost control of the machine. about to loose control of the machine: firmware has saved the day giving us a clean slate and some standardised messages that describe the fault. I agree we can do better for AER errors, but I don't think we should make the handling of other classes of error worse. These classes match nicely to the CPER record types, which we already have mapped in ghes_notify_nmi(). Thanks, James -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/04/2018 11:53 AM, James Morse wrote: > Hi Alex, (snip) >>> How do we know we will survive this trip? >> >> We don't. > > Isn't that even worse than panic()ing? (No output, followed by a watchdog reboot > if we're lucky) On the contrary. No output, followed by a watchdog reboot is usually what happens when the panic() is in NMI. One of the very few ways that can be debugged is over a NULL modem cable. >>> On arm64 systems it may not be possible to return to the context we took the NMI >>> notification from: we may bounce back into firmware with the same "world is on >>> fire" error. Firmware can rightly assume the OS has made no attempt to handle >>> the error. >> >> I'm not aware of the NMI path being used on arm64: >> $ git grep 'select HAVE_ACPI_APEI_NMI' >> arch/x86/Kconfig: select HAVE_ACPI_APEI_NMI if ACPI >> $ > > (I agree its not obvious!), arm64 has NOTIFY_SEA, NOTIFY_SEI and NOTIFY_SDEI > that all behave like NOTIFY_NMI (in that they can interrupt IRQ-masked code). > > CONFIG_HAVE_ACPI_APEI_NMI has come to mean NOTIFY_NMI, which requires the arch > code to implement register_nmi_handler() and (from memory) behave as a notifier > chain, which doesn't fit with how these things behave. > > NOTIFY_SEA lives behind CONFIG_ACPI_APEI_SEA in driver/acpi/apei/Kconfig. > The series to add SDEI support is on the list here: > https://www.spinics.net/lists/arm-kernel/msg642946.html > (NOTIFY_SEI is very tricky for firmware to get right, I don't think we're there > yet.) From my understanding, your patch series tries to use ghes_notify_nmi() as a common entry point. But if on arm64, you can return to firmware, then we have wildly different constraints. On x86, you can't return to SMM. You can have a new SMI triggered while servicing the NMI, but you can't re-enter an SMI that you've returned from. SMM holds all the cores, and they all leave SMM at roughly the same time, so you don't deal with coexistence issues. This probably also avoids the multiple notifications that you are trying to implement on arm. On quick thought, I think the best way to go for both series is to leave the entry points arch-specific, and prevent hax86 and harm64 from stepping on each other's toes. Thoughts? (snip) > Wouldn't reasonably-intelligent firmware be able to spot the same fault > repeatedly in a loop? (last-fault-address == this-fault-address, > last-fault-time==now) I'm told that the BIOS on Dell machines will detect interrupt storms, and disable the specific error source in such cases. On an x86 server. 50us of SMM eats up several milliseconds-core of CPU time, so they try to be a little careful. >>> Your 'not re-arming the error' example makes this worrying. >> >> Those are one class of bugs you get when you let firmware run the show. >> It's been an issue since day one of EFI. The best you can do in such >> cases is accept you may later lose the link to a device. >> Side-note: We have a BIOS defect filed with against this particular >> issue, and it only seems to affect PCIe SMIs. > > This sounds like policy applied too early That's firmware-first in one sentence. > fixing the firmware to at least make this configurable would be preferable. I can ask for "fixing the firmware" and we might even get the fix, but the fact remains that there are machines in the field with the buggy BIOSes, and there will continue to be such machines even after a fix is released. > My point was there is a class of reported-as-fatal error that really is fatal, > and for these trying to defer the work to irq_work is worse than panic()ing as > we get less information about what happened. How do we get less information? We save the GHES structure on the estatus_queue, and use it as the error source in either case. Is there a class of errors where we can reasonable expect things to be okay in NMI, but go horribly wrong on return from NMI? (snip) > Are all AER errors recoverable? (is an OS:'fatal' AER error ever valid?) PCIe "fatal" means that your link is gone and you might not get it back. You lose all the downstream devices. You may be able to reset the link and get it back. This sort of "fatal" has no impact on a machine's ability to continue operation. From ACPI, FFS should only send a "fatal" error if a machine needs reset [1], so it's clearly a firmware bug to mark any PCIe error "fatal". This won't stop BIOS vendors from taking shortcuts and deciding to crash an OS by sending the error as "fatal". [1]ACPI 6.2: 18.1Hardware Errors and Error Sources >> The alternative is to keep ghes_do_proc() and ghes_notify_nmi() in sync, >> duplicate code between the two. But we can also get ghes_proc_irq_work() >> as an entry point, and this easily turns in a maintenance nightmare. > > We can't call all the recovery-helpers from ghes_notify_nmi() as they sleep and > allocate memory. I think the trick would be to make the CPER parsing routines > NMI-safe so that we can check these 'fatal' records for cases where we're pretty > sure linux can do a better job. can_i_handle_this_nonblocking(struct four_letter_acronum_err *err) ? That's the approach I originally tried, but I didn't like it. The parsing gets deeper and deeper, and, by the time you realize if the error can be handled or not, you're already halfway through recovery. This is a perfectly viable approach. However, is there a palpable (non-theoretical) concern that warrants complicating things this way? >>> For the PCI*/AER errors we should be able to try and handle it ... if we can get >>> back to process/IRQ context: >>> What happens if a PCIe driver has interrupts masked and is doing something to >>> cause this error? We can take the NMI and setup the irq-work, but it may never >>> run as we return to interrupts-masked context. >> >> The way the recovery from NMI is set up is confusing. On NMIs, >> ghes_proc_in_irq() is queued up to do further work. Despite the name, in >> this case it runs as a task. That further queues up AER work, >> Either way, PCIe endpoint interrupts are irrelevant here. AER recovery >> runs independent of drivers or interrupts: > > Hmm, we may be looking at different things, (and I'm not familiar with x86): > ghes_notify_nmi() uses irq_work_queue() to add the work to a percpu llist and > call arch_irq_work_raise(). arch_irq_work_raise() sends a self IPI, which, once > taken causes the arch code to call irq_work_run() to consume the llist. I abstract things at irq_work_queue(): "queue this work and let me return from this interrupt" > My badly made point is we could take the NMI that triggers all this from a > context that has interrupts masked. We can't take the self-IPI until we return > from the NMI, and we return to a context that still has interrupts masked. The > original context may cause the NMI to trigger again before it unmasks > interrupts. In this case we've livelocked, the recovery work never runs. In order to get another NMI from firmware, you need SMM to be triggered again. Remember, SMM has to exit before the NMI can be serviced. I doubt the livelock is a concern in this place, because we're dealing with NMIs invoked by firmware(SMM). NMIs for other reasons would be handled elsewhere. In the event that FFS fails to prevent and SMI storm, and keeps spamming the CPU with NMIs, that is a clear firmware bug. It's also the sort of bug I haven't observed in the wild. > I think something like polling a status register over a broken-link would do > this. (has my config write been acknowledge?) That would not be a valid reason to disable interrupts. Non-posted PCIe requests can take hundreds of milliseconds (memory-mapped or otherwise). I can't think of any sort of code that can cause errors to come in via NMI which would be a suitable atomic context for disabling interrupts. Maybe on arm it's normal to stuff in anything error-related via NMI? On x86, most error sources have their own (maskable) interrupt vector. I am not concert about livelocking the system. (snip) >> I can see an issue if we somehow lose the ability to run queued >> tasks, but in that case, we've already lost control of the machine. > > about to loose control of the machine: firmware has saved the day giving us a > clean slate and some standardised messages that describe the fault. In theory, yes. In practice, by the time OS gets control things are usually worse than they would have been without FFS. "standardized" is really a pipe dream, especially on x86, where each vendor interprets things differently, so the OS still has to figure it all out on its own. > I agree we can do better for AER errors, but I don't think we should make the > handling of other classes of error worse. These classes match nicely to the CPER > record types, which we already have mapped in ghes_notify_nmi(). My approach is to _not_ trust the firmware. I hope arm does it better, but on x86, each hardware vendor does what it thinks is best in firmware. The result is wildly different FFS behavior between vendors. This is something that an OS then, cannot rely upon. Alex -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Alex, On 04/04/18 20:49, Alex G. wrote: > On 04/04/2018 11:53 AM, James Morse wrote: >>>> How do we know we will survive this trip? >>> >>> We don't. >> >> Isn't that even worse than panic()ing? (No output, followed by a watchdog reboot >> if we're lucky) > On the contrary. No output, followed by a watchdog reboot is usually > what happens when the panic() is in NMI. One of the very few ways that > can be debugged is over a NULL modem cable. (IPMI-SOL?) Really? printk has NMI support, and panic()s call to bust_spinlocks() should unblank the screen before printk_safe_flush_on_panic() dumps the messages. (I'm assuming some BMC-VGA exists to capture the 'screen'). >>>> On arm64 systems it may not be possible to return to the context we took the NMI >>>> notification from: we may bounce back into firmware with the same "world is on >>>> fire" error. Firmware can rightly assume the OS has made no attempt to handle >>>> the error. >>> >>> I'm not aware of the NMI path being used on arm64: >>> $ git grep 'select HAVE_ACPI_APEI_NMI' >>> arch/x86/Kconfig: select HAVE_ACPI_APEI_NMI if ACPI >>> $ >> >> (I agree its not obvious!), arm64 has NOTIFY_SEA, NOTIFY_SEI and NOTIFY_SDEI >> that all behave like NOTIFY_NMI (in that they can interrupt IRQ-masked code). >> >> CONFIG_HAVE_ACPI_APEI_NMI has come to mean NOTIFY_NMI, which requires the arch >> code to implement register_nmi_handler() and (from memory) behave as a notifier >> chain, which doesn't fit with how these things behave. >> >> NOTIFY_SEA lives behind CONFIG_ACPI_APEI_SEA in driver/acpi/apei/Kconfig. >> The series to add SDEI support is on the list here: >> https://www.spinics.net/lists/arm-kernel/msg642946.html >> (NOTIFY_SEI is very tricky for firmware to get right, I don't think we're there >> yet.) > From my understanding, your patch series tries to use ghes_notify_nmi() > as a common entry point. It abstracts the NMI-like notifications to use a common helper into the estatus-queue. ghes_notify_nmi() is the arch-code entry point for x86's NOTIFY_NMI. We have ghes_notify_sea() and ghes_sdei_callback() on arm64. They are different as their registration requirements are different. They each call _in_nmi_notify_one() to do the in_nmi() ghes work. > But if on arm64, you can return to firmware, > then we have wildly different constraints. We can't return to firmware, but we can take the error again, causing another trap to firmware. e.g.: If a program accesses a value in the cache and the cache location is discovered to be corrupt/marked-as-poisoned. This causes an 'external abort' exception, which firmware has configured to trap to firmware. Once there, it generates CPER records and sets-up an external-abort-exception for Linux. If linux returns from this external-abort, we return to whatever program accessed the bad-cache-value in the first case, re-run the instruction that generates the external abort, and the same thing happens again, but to firmware it looks like a second fault at the same address. The same thing can happen for a processor error. > On x86, you can't return to SMM. You can have a new SMI triggered while > servicing the NMI, but you can't re-enter an SMI that you've returned > from. SMM holds all the cores, and they all leave SMM at roughly the > same time, so you don't deal with coexistence issues. This probably also > avoids the multiple notifications that you are trying to implement on arm. its the new SMI case. (We don't necessarily have all cores pulled into firmware, (although firmware could do that in software), so different CPUs taking NMI-like notifications is one of the things we have to handle.) > On quick thought, I think the best way to go for both series is to leave > the entry points arch-specific, and prevent hax86 and harm64 from > stepping on each other's toes. Thoughts? The behaviour should be driven from the standard. I agree there is some leeway, which linux should handle on all architectures that support GHES. The entry points from the arch code are already separate, as these have different notification types in the HEST->GHES entries and different registration requirements. Once in ghes.c all the NMI-like notifications get merged together as the only relevant thing about them is we have to use the estatus-queue, and can't call any of the sleepy recovery helpers. I don't think this is the issue though: An NMI notification could by any of the eleven CPER notification record types. Your argument is Linux probably can handle PCIe-AER errors, even if broken firmware thinks they are fatal. This is only one of the eleven notification record types. Returning to handle the error definitely doesn't work for some classes of fatal error, especially when interrupts are masked. I think its straightforward to test whether the CPER records contain only errors where we can do better: (not tested/built, I haven't checked whether this is nmi safe): ---------------------%<--------------------- /* * Return true if only PCIe-AER CPER sections are present. We ignore the fatal * severity in this case as linux knows better. Called in_nmi(). */ static bool ghes_is_deferrable(struct ghes *ghes, const struct acpi_hest_generic_status *estatus) { guid_t *sec_type; struct acpi_hest_generic_data *gdata; apei_estatus_for_each_section(estatus, gdata) { sec_type = (guid_t *)gdata->section_type; if (!guid_equal(sec_type, &CPER_SEC_PCIE)) return false; } return true; } ---------------------%<--------------------- >> Wouldn't reasonably-intelligent firmware be able to spot the same fault >> repeatedly in a loop? (last-fault-address == this-fault-address, >> last-fault-time==now) > > I'm told that the BIOS on Dell machines will detect interrupt storms, > and disable the specific error source in such cases. On an x86 server. > 50us of SMM eats up several milliseconds-core of CPU time, so they try > to be a little careful. > >>>> Your 'not re-arming the error' example makes this worrying. >>> >>> Those are one class of bugs you get when you let firmware run the show. >>> It's been an issue since day one of EFI. The best you can do in such >>> cases is accept you may later lose the link to a device. >>> Side-note: We have a BIOS defect filed with against this particular >>> issue, and it only seems to affect PCIe SMIs. >> >> This sounds like policy applied too early > > That's firmware-first in one sentence. > >> fixing the firmware to at least make this configurable would be preferable. > > I can ask for "fixing the firmware" and we might even get the fix, but > the fact remains that there are machines in the field with the buggy > BIOSes, and there will continue to be such machines even after a fix is > released. > > >> My point was there is a class of reported-as-fatal error that really is fatal, >> and for these trying to defer the work to irq_work is worse than panic()ing as >> we get less information about what happened. > > How do we get less information? We save the GHES structure on the > estatus_queue, and use it as the error source in either case. Is there a > class of errors where we can reasonable expect things to be okay in NMI, > but go horribly wrong on return from NMI? Yes, (described above). This could happen if we see Arm64's 'SEA' CPER records or x86's 'MCE' CPER records. I'm pretty sure it will happen on both architectures if you return to a context with interrupts masked. Things will go horribly wrong if you return from an imprecise exception notified via firmware first. > (snip) >> Are all AER errors recoverable? (is an OS:'fatal' AER error ever valid?) > > PCIe "fatal" means that your link is gone and you might not get it back. > You lose all the downstream devices. You may be able to reset the link > and get it back. This sort of "fatal" has no impact on a machine's > ability to continue operation. > > From ACPI, FFS should only send a "fatal" error if a machine needs reset > [1], so it's clearly a firmware bug to mark any PCIe error "fatal". This > won't stop BIOS vendors from taking shortcuts and deciding to crash an > OS by sending the error as "fatal". > > [1]ACPI 6.2: 18.1Hardware Errors and Error Sources Great! Its never valid. So we can check the CPER records in the NMI handler, and for AER errors, clear the fatal bit printing a nasty message about the firmware. >>> The alternative is to keep ghes_do_proc() and ghes_notify_nmi() in sync, >>> duplicate code between the two. But we can also get ghes_proc_irq_work() >>> as an entry point, and this easily turns in a maintenance nightmare. >> >> We can't call all the recovery-helpers from ghes_notify_nmi() as they sleep and >> allocate memory. I think the trick would be to make the CPER parsing routines >> NMI-safe so that we can check these 'fatal' records for cases where we're pretty >> sure linux can do a better job. > > can_i_handle_this_nonblocking(struct four_letter_acronum_err *err) ? Nothing so complicated. Just clearing the bit for AER-only CPER blocks should do. (we can probably assume that AER & Vendor-Specific are describing the same thing) > That's the approach I originally tried, but I didn't like it. The > parsing gets deeper and deeper, and, by the time you realize if the > error can be handled or not, you're already halfway through recovery. > > This is a perfectly viable approach. However, is there a palpable > (non-theoretical) concern that warrants complicating things this way? For AER your text above says no, we can always handle an error, they are never fatal. For memory-errors, yes. An ECC error for a location on the kernel stack while the affected thread was preempt_disable()d. We can't kick the thread off the CPU, and we can't return from the handler as we can't stop accesses to the stack when we return. We can avoid it being complicated by just separating AER errors out when they come in. You've said they never mean the OS has to reboot. >>>> For the PCI*/AER errors we should be able to try and handle it ... if we can get >>>> back to process/IRQ context: >>>> What happens if a PCIe driver has interrupts masked and is doing something to >>>> cause this error? We can take the NMI and setup the irq-work, but it may never >>>> run as we return to interrupts-masked context. >>> >>> The way the recovery from NMI is set up is confusing. On NMIs, >>> ghes_proc_in_irq() is queued up to do further work. Despite the name, in >>> this case it runs as a task. That further queues up AER work, >>> Either way, PCIe endpoint interrupts are irrelevant here. AER recovery >>> runs independent of drivers or interrupts: >> >> Hmm, we may be looking at different things, (and I'm not familiar with x86): >> ghes_notify_nmi() uses irq_work_queue() to add the work to a percpu llist and >> call arch_irq_work_raise(). arch_irq_work_raise() sends a self IPI, which, once >> taken causes the arch code to call irq_work_run() to consume the llist. > > I abstract things at irq_work_queue(): "queue this work and let me > return from this interrupt" ('from this NMI', to come back later as an IRQ, from where we can call schedule_work_on(). This chaining is because the scheduler takes irqsave spin-locks to schedule the work.) >> My badly made point is we could take the NMI that triggers all this from a >> context that has interrupts masked. We can't take the self-IPI until we return >> from the NMI, and we return to a context that still has interrupts masked. The >> original context may cause the NMI to trigger again before it unmasks >> interrupts. In this case we've livelocked, the recovery work never runs. > In order to get another NMI from firmware, you need SMM to be triggered > again. Remember, SMM has to exit before the NMI can be serviced. > > I doubt the livelock is a concern in this place, because we're dealing > with NMIs invoked by firmware(SMM). NMIs for other reasons would be > handled elsewhere. (aha, this might be the difference in the way we see this) > In the event that FFS fails to prevent and SMI storm, and keeps spamming > the CPU with NMIs, that is a clear firmware bug. It's also the sort of > bug I haven't observed in the wild. Do these SMI ever occur synchronously with the CPUs execution? i.e. if I execute this instruction, I will get an SMI. (or are these your handled-elsewhere-NMI case?) On arm64 poisoned-cache locations triggering an external abort (leading to a NOTIFY_SEA NMI-like notification to APEI) in response to a load is synchronous, hence the live-lock problem. I agree AER is very likely to be different. >> I think something like polling a status register over a broken-link would do >> this. (has my config write been acknowledge?) > That would not be a valid reason to disable interrupts. Non-posted PCIe > requests can take hundreds of milliseconds (memory-mapped or otherwise). > I can't think of any sort of code that can cause errors to come in via > NMI which would be a suitable atomic context for disabling interrupts. > > Maybe on arm it's normal to stuff in anything error-related via NMI? On > x86, most error sources have their own (maskable) interrupt vector. Maybe. On arm systems error sources might have their own interrupts that are taken to firmware, and firmware may be able to mask them. This is all hidden in firmware, which can use something like SDEI to push these onto linux as NMI-like notifications. > I am not concert about livelocking the system. This is what keeps me awake at night. I think your platform has NMI handled as MCE/kernel-first for synchronous CPU errors. SMI triggered by PCI-AER are notified by NMI too, but handled by firmware-first. This hybrid kernel/firmware first depending on the error-source isn't the only way of building things. Appendix-N of the UEFI spec lists CPER records for firmware-first handling of types of error that your platform must 'handle elsewhere'. We have arm systems where kernel-first isn't viable as the equivalent of the MCA registers is platform/integration/silicon-revision specific. These systems handle all their errors using firmware-first. These systems will generate CPER records where fatal means fatal, and returning from the NMI-like notification will either send us into the weeds, or trap us back into firmware. Thanks, James -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi James, On 04/06/2018 01:24 PM, James Morse wrote: Do you have any ETA on when your SEA patches are going to make it upstream? There's not much point in updating my patchset if it's going to conflict with your work. (snip) >> On the contrary. No output, followed by a watchdog reboot is usually >> what happens when the panic() is in NMI. One of the very few ways that >> can be debugged is over a NULL modem cable. > > (IPMI-SOL?) > > Really? printk has NMI support, and panic()s call to bust_spinlocks() should > unblank the screen before printk_safe_flush_on_panic() dumps the messages. (I'm > assuming some BMC-VGA exists to capture the 'screen'). Plausible. There are other mechanisms though that may reset the system before it is practical to capture the screen. (snip) >> But if on arm64, you can return to firmware, >> then we have wildly different constraints. > > We can't return to firmware, but we can take the error again, causing another > trap to firmware. > > e.g.: If a program accesses a value in the cache and the cache location is > discovered to be corrupt/marked-as-poisoned. This causes an 'external abort' > exception, which firmware has configured to trap to firmware. Once there, it > generates CPER records and sets-up an external-abort-exception for Linux. > If linux returns from this external-abort, we return to whatever program > accessed the bad-cache-value in the first case, re-run the instruction that > generates the external abort, and the same thing happens again, but to firmware > it looks like a second fault at the same address. This is a very good example of why we should _not_ panic in NMI. Invalidate the cache-line, next load causes a memory round-trip, and everyone's happy. Of course, FFS can't do that because it doesn't know if it's valid to invalidate (pun intended). But the OS does. So you _want_ to reach the error handler downstream of NMI context, and you do _not_ want to crash. (snip) >> On x86, you can't return to SMM. You can have a new SMI triggered while >> servicing the NMI, but you can't re-enter an SMI that you've returned >> from. SMM holds all the cores, and they all leave SMM at roughly the >> same time, so you don't deal with coexistence issues. This probably also >> avoids the multiple notifications that you are trying to implement on arm. > > its the new SMI case. > > (We don't necessarily have all cores pulled into firmware, (although firmware > could do that in software), so different CPUs taking NMI-like notifications is > one of the things we have to handle.) How does FFS handle race conditions that can occur when accessing HW concurrently with the OS? I'm told it's the main reasons why BIOS doesn't release unused cores from SMM early. >> On quick thought, I think the best way to go for both series is to leave >> the entry points arch-specific, and prevent hax86 and harm64 from >> stepping on each other's toes. Thoughts? > > The behaviour should be driven from the standard. I agree there is some leeway, > which linux should handle on all architectures that support GHES. "The standard" deals mostly with firmware behavior. While I'm all for following specifications in order to ensure smooth interaction and integration, Linux is an OS and it deals with a plethora of "standards". Its job is to serve user requests. When a "standard" deviates from this, such as mandating a crash when one is not required, it's the OS's job to _not_ follow this requirement. > The entry points from the arch code are already separate, as these have > different notification types in the HEST->GHES entries and different > registration requirements. > Once in ghes.c all the NMI-like notifications get merged together as the only > relevant thing about them is we have to use the estatus-queue, and can't call > any of the sleepy recovery helpers. > > I don't think this is the issue though: An NMI notification could by any of the > eleven CPER notification record types. > Your argument is Linux probably can handle PCIe-AER errors, My argument is that Linux should do a best effort to recover from hardware (or FFS) errors. In support of that argument, I presented evidence that Linux already can handle a variety of errors. How those errors are reported determines whether the OS crashes or not, and that's not right. When errors are reported natively, we make it to the error handlers before crashing. And the essence of my argument is that there is no reason to have a different behavior with FFS. > even if broken firmware thinks they are fatal. I think the idea of firmware-first is broken. But it's there, it's shipping in FW, so we have to accommodate it in SW. > This is only one of the eleven notification record types. Returning to handle > the error definitely doesn't work for some classes of fatal error, especially > when interrupts are masked. > I think its straightforward to test whether the CPER records contain only errors > where we can do better: (not tested/built, I haven't checked whether this is nmi > safe): That looks very similar to what I had originally implemented. > ---------------------%<--------------------- > /* > * Return true if only PCIe-AER CPER sections are present. We ignore the fatal > * severity in this case as linux knows better. Called in_nmi(). > */ > static bool ghes_is_deferrable(struct ghes *ghes, > const struct acpi_hest_generic_status *estatus) > { > guid_t *sec_type; > struct acpi_hest_generic_data *gdata; > > apei_estatus_for_each_section(estatus, gdata) { > sec_type = (guid_t *)gdata->section_type; > if (!guid_equal(sec_type, &CPER_SEC_PCIE)) Also need to check if the CPER record contains enough information to identify the cause of the error. There's no guarantee the record even has a valid PCI BDF, let alone error information. Possible? yes. Easy? yes. Advisable? maybe. (snip) >>> My point was there is a class of reported-as-fatal error that really is fatal, >>> and for these trying to defer the work to irq_work is worse than panic()ing as >>> we get less information about what happened. >> >> How do we get less information? We save the GHES structure on the >> estatus_queue, and use it as the error source in either case. Is there a >> class of errors where we can reasonable expect things to be okay in NMI, >> but go horribly wrong on return from NMI? > > Yes, (described above). This could happen if we see Arm64's 'SEA' CPER records > or x86's 'MCE' CPER records. I'm pretty sure it will happen on both > architectures if you return to a context with interrupts masked. And linux can handle a wide subset of MCEs just fine, so the ghes_is_deferrable() logic would, under my argument, agree to pass execution to the actual handlers. I do agree that this is one case where splitting the handler in a top half and bottom half would make sense. Though in that case, the problem is generalized to "how to best handle error Y", rather than "how to handle error Y in FFS". >> (snip) >>> Are all AER errors recoverable? (is an OS:'fatal' AER error ever valid?) >> >> PCIe "fatal" means that your link is gone and you might not get it back. >> You lose all the downstream devices. You may be able to reset the link >> and get it back. This sort of "fatal" has no impact on a machine's >> ability to continue operation. >> >> From ACPI, FFS should only send a "fatal" error if a machine needs reset >> [1], so it's clearly a firmware bug to mark any PCIe error "fatal". This >> won't stop BIOS vendors from taking shortcuts and deciding to crash an >> OS by sending the error as "fatal". >> >> [1]ACPI 6.2: 18.1Hardware Errors and Error Sources > > Great! Its never valid. So we can check the CPER records in the NMI handler, and > for AER errors, clear the fatal bit printing a nasty message about the firmware. There's the counter-argument that the machine is a "XYZ" appliance, and losing the PCIe link to the "XYZ" device is essentially "fatal" to the (purpose of the) machine. As you noted, there is some leeway ;) Sarcasm aside, I do like the idea of a message complaining about the firmware. (snip) >> That's the approach I originally tried, but I didn't like it. The >> parsing gets deeper and deeper, and, by the time you realize if the >> error can be handled or not, you're already halfway through recovery. >> >> This is a perfectly viable approach. However, is there a palpable >> (non-theoretical) concern that warrants complicating things this way? > > For AER your text above says no, we can always handle an error, they are never > fatal. > > For memory-errors, yes. An ECC error for a location on the kernel stack while > the affected thread was preempt_disable()d. We can't kick the thread off the > CPU, and we can't return from the handler as we can't stop accesses to the stack > when we return. > > We can avoid it being complicated by just separating AER errors out when they > come in. You've said they never mean the OS has to reboot. If we're going to split recovery paths, then it makes better sense to have a system where handleable errors are passed down to a lesser context. Or even run non-blocking handlers in whatever context they are received. Much more than having a special case for a specific error. (snip) >> I abstract things at irq_work_queue(): "queue this work and let me >> return from this interrupt" > > ('from this NMI', to come back later as an IRQ, from where we can call > schedule_work_on(). This chaining is because the scheduler takes irqsave > spin-locks to schedule the work.) Hopefully, the handling is set up in such a way that I don't have to worry about these details on a per-error basis. If I do have to worry, them I must be doing something wrong. (snip) >> In the event that FFS fails to prevent and SMI storm, and keeps spamming >> the CPU with NMIs, that is a clear firmware bug. It's also the sort of >> bug I haven't observed in the wild. > > Do these SMI ever occur synchronously with the CPUs execution? i.e. if I execute > this instruction, I will get an SMI. (or are these your handled-elsewhere-NMI case?) I'm not aware of synchronous exceptions that end up as SMIs, though I imagine it is possible on x86. An IO write to an SMI trap is equivalent to that. > On arm64 poisoned-cache locations triggering an external abort (leading to a > NOTIFY_SEA NMI-like notification to APEI) in response to a load is synchronous, > hence the live-lock problem. How is this currently handled in the kernel-first case? (snip) > Maybe. On arm systems error sources might have their own interrupts that are > taken to firmware, That sounds like an even bigger headache than x86. >> I am not concert about livelocking the system. > > This is what keeps me awake at night. Nothing keeps me awake at night. I keep others awake at night. > I think your platform has NMI handled as MCE/kernel-first for synchronous CPU > errors. SMI triggered by PCI-AER are notified by NMI too, but handled by > firmware-first. MCEs come in via FFS. There's a school of thought that sees a certain value in logging ECC errors to the BMC. It's arguable whether the ECC errors are synchronous in 100% of cases. > This hybrid kernel/firmware first depending on the error-source isn't the only > way of building things. Appendix-N of the UEFI spec lists CPER records for > firmware-first handling of types of error that your platform must 'handle > elsewhere'. > > We have arm systems where kernel-first isn't viable as the equivalent of the MCA > registers is platform/integration/silicon-revision specific. These systems > handle all their errors using firmware-first. These systems will generate CPER > records where fatal means fatal, and returning from the NMI-like notification > will either send us into the weeds, or trap us back into firmware. Without getting into details, the non-viablity of kernel-first is subjective. Alex -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Alex, On 09/04/18 19:11, Alex G. wrote: > On 04/06/2018 01:24 PM, James Morse wrote: > Do you have any ETA on when your SEA patches are going to make it > upstream? There's not much point in updating my patchset if it's going > to conflict with your work. The SEA stuff went in with 7edda0886bc3 ("acpi: apei: handle SEA notification type for ARMv8"). My series is moving it to use the estatus-queue in the same way as x86's NOTIFY_NMI does. This lets us safely add the other two NMI-like notifications. I have no idea on the ETA, it depends on review feedback! >>> But if on arm64, you can return to firmware, >>> then we have wildly different constraints. >> >> We can't return to firmware, but we can take the error again, causing another >> trap to firmware. >> >> e.g.: If a program accesses a value in the cache and the cache location is >> discovered to be corrupt/marked-as-poisoned. This causes an 'external abort' >> exception, which firmware has configured to trap to firmware. Once there, it >> generates CPER records and sets-up an external-abort-exception for Linux. >> If linux returns from this external-abort, we return to whatever program >> accessed the bad-cache-value in the first case, re-run the instruction that >> generates the external abort, and the same thing happens again, but to firmware >> it looks like a second fault at the same address. > This is a very good example of why we should _not_ panic in NMI. > Invalidate the cache-line, next load causes a memory round-trip, and > everyone's happy. Of course, FFS can't do that because it doesn't know > if it's valid to invalidate (pun intended). But the OS does. So you > _want_ to reach the error handler downstream of NMI context, and you do > _not_ want to crash. This assumes a cache-invalidate will clear the error, which I don't think we're guaranteed on arm. It also destroys any adjacent data, "everyone's happy" includes the thread that got a chunk of someone-else's stack frame, I don't think it will be happy for very long! (this is a side issue for AER though) >>> On x86, you can't return to SMM. You can have a new SMI triggered while >>> servicing the NMI, but you can't re-enter an SMI that you've returned >>> from. SMM holds all the cores, and they all leave SMM at roughly the >>> same time, so you don't deal with coexistence issues. This probably also >>> avoids the multiple notifications that you are trying to implement on arm. >> >> its the new SMI case. >> >> (We don't necessarily have all cores pulled into firmware, (although firmware >> could do that in software), so different CPUs taking NMI-like notifications is >> one of the things we have to handle.) > How does FFS handle race conditions that can occur when accessing HW > concurrently with the OS? I'm told it's the main reasons why BIOS > doesn't release unused cores from SMM early. This is firmware's problem, it depends on whether there is any hardware that is shared with the OS. Some hardware can be marked 'secure' in which case only firmware can access it, alternatively firmware can trap or just disable the OS's access to the shared hardware. For example, with the v8.2 RAS Extensions, there are some per-cpu error registers. Firmware can disable these for the OS, so that it always reads 0 from them. Instead firmware takes the error via FF, reads the registers from firmware, and dumps CPER records into the OS's memory. If there is a shared hardware resource that both the OS and firmware may be accessing, yes firmware needs to pull the other CPUs in, but this depends on the SoC design, it doesn't necessarily happen. >>> On quick thought, I think the best way to go for both series is to leave >>> the entry points arch-specific, and prevent hax86 and harm64 from >>> stepping on each other's toes. Thoughts? >> >> The behaviour should be driven from the standard. I agree there is some leeway, >> which linux should handle on all architectures that support GHES. > > "The standard" deals mostly with firmware behavior. While I'm all for > following specifications in order to ensure smooth interaction and > integration, Linux is an OS and it deals with a plethora of "standards". > Its job is to serve user requests. When a "standard" deviates from this, > such as mandating a crash when one is not required, it's the OS's job to > _not_ follow this requirement. Sure, we're quirking our behaviour based on a high level of mistrust for the firmware. My point here was we shouldn't duplicate the implementation because we want x86:{AER,CPU,MEM} to behave differently to arm64:{AER,CPU,MEM}. I'd rather the quirked-behaviour was along the *:{AER} versus *:{CPU,MEM} line. If we have extra code to spot deferrable errors, we should use it on both architectures. >> Once in ghes.c all the NMI-like notifications get merged together as the only >> relevant thing about them is we have to use the estatus-queue, and can't call >> any of the sleepy recovery helpers. >> >> I don't think this is the issue though: An NMI notification could by any of the >> eleven CPER notification record types. >> Your argument is Linux probably can handle PCIe-AER errors, > > My argument is that Linux should do a best effort to recover from > hardware (or FFS) errors. In support of that argument, I presented > evidence that Linux already can handle a variety of errors. How those > errors are reported determines whether the OS crashes or not, and that's > not right. For AER we agree, these never mean 'the CPU is on fire'. > When errors are reported natively, we make it to the error > handlers before crashing. And the essence of my argument is that there > is no reason to have a different behavior with FFS. For the native CPU-error case, the CPU generates some kind of interrupt out of firey-death, into the error handler and we put out the fire or panic. For the firmware-first case this patch returned us to the firey-death, in the hope of taking an IRQ that would bring us into the error handler. This won't work if the CPU came out of a mode where interrupts are masked. Even if they aren't, its not guaranteed on arm64 that we take the interrupt 'in time'. Doesn't x86 guarantee to execute at least one instruction after an iret? (I recall something about interrupt storms and forward-progress guarantees). Some triage of the error in the NMI handler is the right thing to do here. I agree with your point that AER errors are always deferrable, lets tackle those first. (I think user-space memory errors can be deferred if we took the error out of user-space and can prevent our return with TIF_NEED_RESCHED or equivalent. The same might work for CPU errors affecting user-space context) >> even if broken firmware thinks they are fatal. > I think the idea of firmware-first is broken. But it's there, it's > shipping in FW, so we have to accommodate it in SW. Part of our different-views here is firmware-first is taking something away from you, whereas for me its giving me information that would otherwise be in secret-soc-specific registers. >> This is only one of the eleven notification record types. Returning to handle >> the error definitely doesn't work for some classes of fatal error, especially >> when interrupts are masked. > >> I think its straightforward to test whether the CPER records contain only errors >> where we can do better: (not tested/built, I haven't checked whether this is nmi >> safe): > That looks very similar to what I had originally implemented. >> ---------------------%<--------------------- >> /* >> * Return true if only PCIe-AER CPER sections are present. We ignore the fatal >> * severity in this case as linux knows better. Called in_nmi(). >> */ >> static bool ghes_is_deferrable(struct ghes *ghes, >> const struct acpi_hest_generic_status *estatus) >> { >> guid_t *sec_type; >> struct acpi_hest_generic_data *gdata; >> >> apei_estatus_for_each_section(estatus, gdata) { >> sec_type = (guid_t *)gdata->section_type; >> if (!guid_equal(sec_type, &CPER_SEC_PCIE)) > > Also need to check if the CPER record contains enough information to > identify the cause of the error. There's no guarantee the record even > has a valid PCI BDF, let alone error information. Possible? yes. Easy? > yes. Advisable? maybe. If AER errors never prevent the CPU getting to the deferred error handler, we can do that sort of thing there. >>>> My point was there is a class of reported-as-fatal error that really is fatal, >>>> and for these trying to defer the work to irq_work is worse than panic()ing as >>>> we get less information about what happened. >>> >>> How do we get less information? We save the GHES structure on the >>> estatus_queue, and use it as the error source in either case. Is there a >>> class of errors where we can reasonable expect things to be okay in NMI, >>> but go horribly wrong on return from NMI? >> >> Yes, (described above). This could happen if we see Arm64's 'SEA' CPER records >> or x86's 'MCE' CPER records. I'm pretty sure it will happen on both >> architectures if you return to a context with interrupts masked. > And linux can handle a wide subset of MCEs just fine, so the > ghes_is_deferrable() logic would, under my argument, agree to pass > execution to the actual handlers. For some classes of error we can't safely get there. > I do agree that this is one case where > splitting the handler in a top half and bottom half would make sense. Yes, splitting this stuff up to have all the NMI-safe stuff up-front, then drop to IRQ-context for the next round would make this a lot easier. > Though in that case, the problem is generalized to "how to best handle > error Y", rather than "how to handle error Y in FFS". (that's a good thing yes?) I assume x86 can take MCE errors out of IRQ-masked code. Sharing the handle_foo_error_nmi() code between the two paths would be a good thing. >>>> Are all AER errors recoverable? (is an OS:'fatal' AER error ever valid?) >>> >>> PCIe "fatal" means that your link is gone and you might not get it back. >>> You lose all the downstream devices. You may be able to reset the link >>> and get it back. This sort of "fatal" has no impact on a machine's >>> ability to continue operation. >>> >>> From ACPI, FFS should only send a "fatal" error if a machine needs reset >>> [1], so it's clearly a firmware bug to mark any PCIe error "fatal". This >>> won't stop BIOS vendors from taking shortcuts and deciding to crash an >>> OS by sending the error as "fatal". >>> >>> [1]ACPI 6.2: 18.1Hardware Errors and Error Sources >> >> Great! Its never valid. So we can check the CPER records in the NMI handler, and >> for AER errors, clear the fatal bit printing a nasty message about the firmware. > There's the counter-argument that the machine is a "XYZ" appliance, and > losing the PCIe link to the "XYZ" device is essentially "fatal" to the > (purpose of the) machine. As you noted, there is some leeway ;) For any appliance I'd expect some health-monitor-thing to spot all the disks in the disk-enclosure have disappeared and reboot. (Losing the root-filsystem is probably the general case). > Sarcasm aside, I do like the idea of a message complaining about the > firmware. It wouldn't be appropriate in all cases, but for AER it looks like 'fatal' is always an over-reaction. For example, given a memory error firmware can't know whether we can re-read some page of kernel memory from disk and then reschedule the thread. >>> That's the approach I originally tried, but I didn't like it. The >>> parsing gets deeper and deeper, and, by the time you realize if the >>> error can be handled or not, you're already halfway through recovery. >>> >>> This is a perfectly viable approach. However, is there a palpable >>> (non-theoretical) concern that warrants complicating things this way? >> >> For AER your text above says no, we can always handle an error, they are never >> fatal. >> >> For memory-errors, yes. An ECC error for a location on the kernel stack while >> the affected thread was preempt_disable()d. We can't kick the thread off the >> CPU, and we can't return from the handler as we can't stop accesses to the stack >> when we return. >> >> We can avoid it being complicated by just separating AER errors out when they >> come in. You've said they never mean the OS has to reboot. > > If we're going to split recovery paths, then it makes better sense to > have a system where handleable errors are passed down to a lesser > context. Or even run non-blocking handlers in whatever context they are > received. Much more than having a special case for a specific error. Yes. I think running the NMI-safe parts of the handler first, then the IRQ-parts when we drop to IRQ context is the way to do this. >>> I abstract things at irq_work_queue(): "queue this work and let me >>> return from this interrupt" >> >> ('from this NMI', to come back later as an IRQ, from where we can call >> schedule_work_on(). This chaining is because the scheduler takes irqsave >> spin-locks to schedule the work.) > > Hopefully, the handling is set up in such a way that I don't have to > worry about these details on a per-error basis. If I do have to worry, > them I must be doing something wrong. I agree, but something has to know. Lets tackle AER first, where we have none of theses concerns, then tackle those that are more closely tied to the CPU state. >>> In the event that FFS fails to prevent and SMI storm, and keeps spamming >>> the CPU with NMIs, that is a clear firmware bug. It's also the sort of >>> bug I haven't observed in the wild. >> >> Do these SMI ever occur synchronously with the CPUs execution? i.e. if I execute >> this instruction, I will get an SMI. (or are these your handled-elsewhere-NMI case?) > > I'm not aware of synchronous exceptions that end up as SMIs, though I > imagine it is possible on x86. An IO write to an SMI trap is equivalent > to that. Okay, that's good to know. >> On arm64 poisoned-cache locations triggering an external abort (leading to a >> NOTIFY_SEA NMI-like notification to APEI) in response to a load is synchronous, >> hence the live-lock problem. > > How is this currently handled in the kernel-first case? We don't have any kernel-first support today it would need the Arm version of MCA (v8.2 RAS Extensions) which are relatively new. Everything we've seen so far is using firmware-first as it has to know a little about the SoC topology. How would it be handled? For the kernel-first version of NOTIFY_SEA the code would get the interrupted pt_regs and can probe the RAS Error registers to get the physical address and properties of the fault. It can then determine whether this can be handled, and kick off the appropriate recovery helper. If it interrupted the kernel, the helper has to schedule the recovery work, if it interrupted interrupts-masked code, it would have to self-IPI to then be able to schedule the work. The same return-to-a-broken-context problem exists for kernel-first, we can't do it there either. >>> I am not concert about livelocking the system. >> >> This is what keeps me awake at night. > > Nothing keeps me awake at night. I keep others awake at night. (Ha! I owe you a beer!) >> I think your platform has NMI handled as MCE/kernel-first for synchronous CPU >> errors. SMI triggered by PCI-AER are notified by NMI too, but handled by >> firmware-first. > > MCEs come in via FFS. Oh? I thought those were to separate worlds. I assumed MCE-NMIs didn't go via SMM. > There's a school of thought that sees a certain > value in logging ECC errors to the BMC. It's arguable whether the ECC > errors are synchronous in 100% of cases. >> This hybrid kernel/firmware first depending on the error-source isn't the only >> way of building things. Appendix-N of the UEFI spec lists CPER records for >> firmware-first handling of types of error that your platform must 'handle >> elsewhere'. >> >> We have arm systems where kernel-first isn't viable as the equivalent of the MCA >> registers is platform/integration/silicon-revision specific. These systems >> handle all their errors using firmware-first. These systems will generate CPER >> records where fatal means fatal, and returning from the NMI-like notification >> will either send us into the weeds, or trap us back into firmware. > > Without getting into details, the non-viablity of kernel-first is > subjective. Sure, there are some similar problems for both. I think the way forward here is to spot defer-able errors, even if they are marked fatal. Initially this will just be AER, but if we can split up the handlers to have NMI/IRQ/Process context counterparts, we may be able to do enough work early-on to defer other types of error. Thanks, James -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/13/2018 11:38 AM, James Morse wrote: > Hi Alex, > > On 09/04/18 19:11, Alex G. wrote: >> On 04/06/2018 01:24 PM, James Morse wrote: >> Do you have any ETA on when your SEA patches are going to make it >> upstream? There's not much point in updating my patchset if it's going >> to conflict with your work. > > The SEA stuff went in with 7edda0886bc3 ("acpi: apei: handle SEA notification > type for ARMv8"). My series is moving it to use the estatus-queue in the same > way as x86's NOTIFY_NMI does. This lets us safely add the other two NMI-like > notifications. I have no idea on the ETA, it depends on review feedback! Okay. I'll get a v2 out soonish then. (snip) > This assumes a cache-invalidate will clear the error, which I don't think we're > guaranteed on arm. > It also destroys any adjacent data, "everyone's happy" includes the thread that > got a chunk of someone-else's stack frame, I don't think it will be happy for > very long! Hmm, no cache-line (or page) invalidation on arm64? How does dma_map/unmap_*() work then? You may not guarantee to fix the error, but I don't buy into the "let's crash without trying" argument. > (this is a side issue for AER though) Somebody muddled up AER with these tables, so we now have to worry about it. :) (snip) >> How does FFS handle race conditions that can occur when accessing HW >> concurrently with the OS? I'm told it's the main reasons why BIOS >> doesn't release unused cores from SMM early. > > This is firmware's problem, it depends on whether there is any hardware that is > shared with the OS. Some hardware can be marked 'secure' in which case only > firmware can access it, alternatively firmware can trap or just disable the OS's > access to the shared hardware. It's everyone's problem. It's the firmware's responsibility. > For example, with the v8.2 RAS Extensions, there are some per-cpu error > registers. Firmware can disable these for the OS, so that it always reads 0 from > them. Instead firmware takes the error via FF, reads the registers from > firmware, and dumps CPER records into the OS's memory. > > If there is a shared hardware resource that both the OS and firmware may be > accessing, yes firmware needs to pull the other CPUs in, but this depends on the > SoC design, it doesn't necessarily happen. The problem with shared resources is just a problem. I've seen systems where all 100 cores are held up for 300+ ms. In latency-critical applications reliability drops exponentially. Am I correct in assuming your answer would be to "hide" more stuff from the OS? (snip) > Sure, we're quirking our behaviour based on a high level of mistrust for the > firmware. My point here was we shouldn't duplicate the implementation because we > want x86:{AER,CPU,MEM} to behave differently to arm64:{AER,CPU,MEM}. I'd rather > the quirked-behaviour was along the *:{AER} versus *:{CPU,MEM} line. If we have > extra code to spot deferrable errors, we should use it on both architectures. It's a well earned and well deserved mistrust. Firmware is evil (*). (*) sarcastic overstatement of facts (snip) > For AER we agree, these never mean 'the CPU is on fire'. Sounds like a good marketing slogan: "ACME's new turbo-encabulated CPU -- it's on fire!" (snip) >>> even if broken firmware thinks they are fatal. > >> I think the idea of firmware-first is broken. But it's there, it's >> shipping in FW, so we have to accommodate it in SW. > > Part of our different-views here is firmware-first is taking something away from > you, whereas for me its giving me information that would otherwise be in > secret-soc-specific registers. Under this interpretation, FFS is a band-aid to the problem of "secret" registers. "Secret" hardware doesn't really fit well into the idea of an OS [1]. The irony of the solution is that the response is centered on firmware extensions which were designed to stile interoperability [2]. You are right, FFS is a poopstorm and headache for me. It takes my sanity away. I once wrote FW for a laptop where the only use of SMM was to run the "Enable ACPI" call -- which only disabled further SMM. We didn't need an iota of FFS because at that time AMD wasn't secretive, and there was no need to have "secret" registers in the first place. [1] https://www.youtube.com/watch?v=_36yNWw_07g [2] http://antitrust.slated.org/www.iowaconsumercase.org/011607/3000/PX03020.pdf (snip) >> And linux can handle a wide subset of MCEs just fine, so the >> ghes_is_deferrable() logic would, under my argument, agree to pass >> execution to the actual handlers. > > For some classes of error we can't safely get there. Optimize for the common case. (snip) >> Though in that case, the problem is generalized to "how to best handle >> error Y", rather than "how to handle error Y in FFS". > > (that's a good thing yes?) I assume x86 can take MCE errors out of IRQ-masked > code. Sharing the handle_foo_error_nmi() code between the two paths would be a > good thing. That's a good thing, yes. But, by ignorance, FFS doesn't always let you do this. Getting to your kernel data structures is not IRQ-safe in the general case. In the AER case, you need pci_get_domain_bus_and_slot(), which is not IRQ-safe. So, while in the native case, you can stage context pointers, and separate things based on interrupt vector, you don't get this sort of flexibility with FFS. There is the extra step of recovering your context, which is not the sort of situation you often get with kernel-first. Most of the discussion has been around whether to crash before we even get this context. Stay tuned for v2 though. (snip) >> Sarcasm aside, I do like the idea of a message complaining about the >> firmware. > > It wouldn't be appropriate in all cases, but for AER it looks like 'fatal' is > always an over-reaction. > For example, given a memory error firmware can't know whether we can re-read > some page of kernel memory from disk and then reschedule the thread. If firmware can't know the error is recoverable, it also can't know it's fatal. When firmware does try to make a determination from insufficient information, it opens itself up to sarky remarks. That's fair game. (snip) >> If we're going to split recovery paths, then it makes better sense to >> have a system where handleable errors are passed down to a lesser >> context. Or even run non-blocking handlers in whatever context they are >> received. Much more than having a special case for a specific error. > > Yes. I think running the NMI-safe parts of the handler first, then the IRQ-parts > when we drop to IRQ context is the way to do this. Coming soon to a ML near you. >>>> I abstract things at irq_work_queue(): "queue this work and let me >>>> return from this interrupt" >>> >>> ('from this NMI', to come back later as an IRQ, from where we can call >>> schedule_work_on(). This chaining is because the scheduler takes irqsave >>> spin-locks to schedule the work.) >> >> Hopefully, the handling is set up in such a way that I don't have to >> worry about these details on a per-error basis. If I do have to worry, >> them I must be doing something wrong. > > I agree, but something has to know. Lets tackle AER first, The actual handler knows. For AER, that's do_recovery(). > where we have none of theses concerns Untamed PCIe errors can end up as MCEs and triple-fault the CPU :) You didn't think we get a free ride with AER, did you? (snip) >>> On arm64 poisoned-cache locations triggering an external abort (leading to a >>> NOTIFY_SEA NMI-like notification to APEI) in response to a load is synchronous, >>> hence the live-lock problem. >> >> How is this currently handled in the kernel-first case? > > We don't have any kernel-first support today :( > it would need the Arm version of > MCA (v8.2 RAS Extensions) which are relatively new. Everything we've seen so far > is using firmware-first as it has to know a little about the SoC topology. > > How would it be handled? For the kernel-first version of NOTIFY_SEA the code > would get the interrupted pt_regs and can probe the RAS Error registers (snip) Sounds like a regular, standard, raceless error handler :) >>>> I am not concert about livelocking the system. >>> >>> This is what keeps me awake at night. >> >> Nothing keeps me awake at night. I keep others awake at night. > > (Ha! I owe you a beer!) (I will not forget) >>> I think your platform has NMI handled as MCE/kernel-first for synchronous CPU >>> errors. SMI triggered by PCI-AER are notified by NMI too, but handled by >>> firmware-first. >> >> MCEs come in via FFS. > > Oh? I thought those were to separate worlds. I assumed MCE-NMIs didn't go via SMM. FFS is done in SMM. I'm certain there's a very simple and easy way in which MCEs can be reported kernel-first, but I've only seen it done FFS recently. (snip) > I think the way forward here is to spot defer-able errors, even if they are > marked fatal. Initially this will just be AER, but if we can split up the > handlers to have NMI/IRQ/Process context counterparts, we may be able to do > enough work early-on to defer other types of error. You're gonna love my v2. Alex -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Alex, On 04/16/2018 10:59 PM, Alex G. wrote: > On 04/13/2018 11:38 AM, James Morse wrote: >> This assumes a cache-invalidate will clear the error, which I don't think we're >> guaranteed on arm. >> It also destroys any adjacent data, "everyone's happy" includes the thread that >> got a chunk of someone-else's stack frame, I don't think it will be happy for >> very long! > > Hmm, no cache-line (or page) invalidation on arm64? How does > dma_map/unmap_*() work then? You may not guarantee to fix the error, but There are cache-invalidate instructions, but I don't think 'solving' a RAS error with them is the right thing to do. > I don't buy into the "let's crash without trying" argument. Our 'cache writeback granule' may be as large as 2K, so we may have to invalidate up to 2K of data to convince the hardware this address is okay again. All we've done here is differently-corrupt the data so that it no longer generates a RAS fault, it just gives you the wrong data instead. Cache-invalidation is destructive. I don't think there is a one-size-fits-all solution here. >> (this is a side issue for AER though) > > Somebody muddled up AER with these tables, so we now have to worry about > it. :) Eh? I see there is a v2, maybe I'll understand this comment once I read it. >>> How does FFS handle race conditions that can occur when accessing HW >>> concurrently with the OS? I'm told it's the main reasons why BIOS >>> doesn't release unused cores from SMM early. >> >> This is firmware's problem, it depends on whether there is any hardware that is >> shared with the OS. Some hardware can be marked 'secure' in which case only >> firmware can access it, alternatively firmware can trap or just disable the OS's >> access to the shared hardware. > > It's everyone's problem. It's the firmware's responsibility. It depends on the SoC design. If there is no hardware that the OS and firmware both need to access to handle an error then I don't think firmware needs to do this. >> For example, with the v8.2 RAS Extensions, there are some per-cpu error >> registers. Firmware can disable these for the OS, so that it always reads 0 from >> them. Instead firmware takes the error via FF, reads the registers from >> firmware, and dumps CPER records into the OS's memory. >> >> If there is a shared hardware resource that both the OS and firmware may be >> accessing, yes firmware needs to pull the other CPUs in, but this depends on the >> SoC design, it doesn't necessarily happen. > > The problem with shared resources is just a problem. I've seen systems > where all 100 cores are held up for 300+ ms. In latency-critical > applications reliability drops exponentially. Am I correct in assuming > your answer would be to "hide" more stuff from the OS? No, I'm not a fan of firmware cycle stealing. If you can design the SoC or firmware so that the 'all CPUs' stuff doesn't need to happen, then you won't get these issues. (I don't design these things, I'm sure they're much more complicated than I think!) Because the firmware is SoC-specific, so it only needs to do exactly what is necessary. >>> I think the idea of firmware-first is broken. But it's there, it's >>> shipping in FW, so we have to accommodate it in SW. >> >> Part of our different-views here is firmware-first is taking something away from >> you, whereas for me its giving me information that would otherwise be in >> secret-soc-specific registers. > > Under this interpretation, FFS is a band-aid to the problem of "secret" > registers. "Secret" hardware doesn't really fit well into the idea of an > OS [1]. Sorry, I'm being sloppy with my terminology, by secret-soc-specific I mean either Linux can't access them (firmware privilege-level only) or Linux can't reasonably know where these registers are, as they're soc-specific and vary by manufacture. >>> And linux can handle a wide subset of MCEs just fine, so the >>> ghes_is_deferrable() logic would, under my argument, agree to pass >>> execution to the actual handlers. >> >> For some classes of error we can't safely get there. > > Optimize for the common case. At the expense of reliability? Thanks, James -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/20/2018 02:27 AM, James Morse wrote: > Hi Alex, > > On 04/16/2018 10:59 PM, Alex G. wrote: >> On 04/13/2018 11:38 AM, James Morse wrote: >>> This assumes a cache-invalidate will clear the error, which I don't > think we're >>> guaranteed on arm. >>> It also destroys any adjacent data, "everyone's happy" includes the > thread that >>> got a chunk of someone-else's stack frame, I don't think it will be > happy for >>> very long! >> >> Hmm, no cache-line (or page) invalidation on arm64? How does >> dma_map/unmap_*() work then? You may not guarantee to fix the error, but > > There are cache-invalidate instructions, but I don't think 'solving' a > RAS error with them is the right thing to do. You seem to be putting RAS on a pedestal in a very cloudy and foggy day. I admit that I fail to see the specialness of RAS in comparison to other errors. >> I don't buy into the "let's crash without trying" argument. > > Our 'cache writeback granule' may be as large as 2K, so we may have to > invalidate up to 2K of data to convince the hardware this address is > okay again. Eureka! OS can invalidate the entire page. 1:1 mapping with the memory management data. > All we've done here is differently-corrupt the data so that it no longer > generates a RAS fault, it just gives you the wrong data instead. > Cache-invalidation is destructive. > > I don't think there is a one-size-fits-all solution here. Of course there isn't. That's not the issue. A cache corruption is a special case of a memory access issue, and that, we already know how to handle. Triple-fault and cpu-on-fire concerns apply wrt returning to the context which triggered the problem. We've already figured that out. There is a lot of opportunity here for using well tested code paths and not crashing on first go. Why let firmware make this a problem again? >>> (this is a side issue for AER though) >> >> Somebody muddled up AER with these tables, so we now have to worry about >> it. :) > > Eh? I see there is a v2, maybe I'll understand this comment once I read it. I meant that somebody (the spec writers) decided to put ominous errors (PCIe) on the same severity scale with "cpu is on fire" errors. >>>> How does FFS handle race conditions that can occur when accessing HW >>>> concurrently with the OS? I'm told it's the main reasons why BIOS >>>> doesn't release unused cores from SMM early. >>> >>> This is firmware's problem, it depends on whether there is any > hardware that is >>> shared with the OS. Some hardware can be marked 'secure' in which > case only >>> firmware can access it, alternatively firmware can trap or just > disable the OS's >>> access to the shared hardware. >> >> It's everyone's problem. It's the firmware's responsibility. > > It depends on the SoC design. If there is no hardware that the OS and > firmware both need to access to handle an error then I don't think > firmware needs to do this. > > >>> For example, with the v8.2 RAS Extensions, there are some per-cpu error >>> registers. Firmware can disable these for the OS, so that it always > reads 0 from >>> them. Instead firmware takes the error via FF, reads the registers from >>> firmware, and dumps CPER records into the OS's memory. >>> >>> If there is a shared hardware resource that both the OS and firmware > may be >>> accessing, yes firmware needs to pull the other CPUs in, but this > depends on the >>> SoC design, it doesn't necessarily happen. >> >> The problem with shared resources is just a problem. I've seen systems >> where all 100 cores are held up for 300+ ms. In latency-critical >> applications reliability drops exponentially. Am I correct in assuming >> your answer would be to "hide" more stuff from the OS? > > No, I'm not a fan of firmware cycle stealing. If you can design the SoC or > firmware so that the 'all CPUs' stuff doesn't need to happen, then you > won't get > these issues. (I don't design these things, I'm sure they're much more > complicated > than I think!) > > Because the firmware is SoC-specific, so it only needs to do exactly > what is necessary. Irrespective of the hardware design, there's devicetree, ACPI methods, and a few other ways to inform the OS of non-standard bits. They don't have the resource sharing problem. I'm confused as to why FFS is used when there are concerns about resource conflicts instead of race-free alternatives. >>>> I think the idea of firmware-first is broken. But it's there, it's >>>> shipping in FW, so we have to accommodate it in SW. >>> >>> Part of our different-views here is firmware-first is taking > something away from >>> you, whereas for me its giving me information that would otherwise be in >>> secret-soc-specific registers. >> >> Under this interpretation, FFS is a band-aid to the problem of "secret" >> registers. "Secret" hardware doesn't really fit well into the idea of an >> OS [1]. > > Sorry, I'm being sloppy with my terminology, by secret-soc-specific I > mean either Linux can't access them (firmware privilege-level only) or > Linux can't reasonably know where these registers are, as they're > soc-specific and vary by manufacture. This is still a software problem. I'm assuming register access can be granted to the OS, and I'm also assuming that there exists a non-FFS way to describe the registers to the OS. >>>> And linux can handle a wide subset of MCEs just fine, so the >>>> ghes_is_deferrable() logic would, under my argument, agree to pass >>>> execution to the actual handlers. >>> >>> For some classes of error we can't safely get there. >> >> Optimize for the common case. > > At the expense of reliability? Who suggested to sacrifice reliability? Alex -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index 2c998125b1d5..7243a99ea57e 100644 --- a/drivers/acpi/apei/ghes.c +++ b/drivers/acpi/apei/ghes.c @@ -428,8 +428,7 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int * GHES_SEV_RECOVERABLE -> AER_NONFATAL * GHES_SEV_RECOVERABLE && CPER_SEC_RESET -> AER_FATAL * These both need to be reported and recovered from by the AER driver. - * GHES_SEV_PANIC does not make it to this handling since the kernel must - * panic. + * GHES_SEV_PANIC -> AER_FATAL */ static bool ghes_handle_aer(struct acpi_hest_generic_data *gdata) { @@ -899,6 +898,7 @@ static void ghes_proc_in_irq(struct irq_work *irq_work) struct ghes_estatus_node *estatus_node; struct acpi_hest_generic *generic; struct acpi_hest_generic_status *estatus; + int corrected_sev; u32 len, node_len; llnode = llist_del_all(&ghes_estatus_llist); @@ -914,7 +914,14 @@ static void ghes_proc_in_irq(struct irq_work *irq_work) estatus = GHES_ESTATUS_FROM_NODE(estatus_node); len = cper_estatus_len(estatus); node_len = GHES_ESTATUS_NODE_LEN(len); - ghes_do_proc(estatus_node->ghes, estatus); + corrected_sev = ghes_do_proc(estatus_node->ghes, estatus); + + if (corrected_sev >= GHES_SEV_PANIC) { + oops_begin(); + ghes_print_queued_estatus(); + __ghes_panic(estatus_node->ghes); + } + if (!ghes_estatus_cached(estatus)) { generic = estatus_node->generic; if (ghes_print_estatus(NULL, generic, estatus)) @@ -955,7 +962,7 @@ static void __process_error(struct ghes *ghes) static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs) { struct ghes *ghes; - int sev, ret = NMI_DONE; + int ret = NMI_DONE; if (!atomic_add_unless(&ghes_in_nmi, 1, 1)) return ret; @@ -968,13 +975,6 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs) ret = NMI_HANDLED; } - sev = ghes_severity(ghes->estatus->error_severity); - if (sev >= GHES_SEV_PANIC) { - oops_begin(); - ghes_print_queued_estatus(); - __ghes_panic(ghes); - } - if (!(ghes->flags & GHES_TO_CLEAR)) continue;
BIOSes like to send NMIs for a number of silly reasons often deemed to be "fatal". For example pin bounce during a PCIE hotplug/unplug might cause the link to go down and retrain, with fatal PCI errors being generated while the link is retraining. Instead of panic()ing in NMI context, pass fatal errors down to IRQ context to see if they can be resolved. With these change, PCIe error are handled by AER. Other far less common errors, such as machine check exceptions, still cause a panic() in their respective handlers. Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com> --- drivers/acpi/apei/ghes.c | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-)