Message ID | 20220719195325.402745-11-gpiccoli@igalia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | The panic notifiers refactor strikes back - fixes/clean-ups | expand |
On 19/07/2022 16:53, Guilherme G. Piccoli wrote: > The altera_edac panic notifier performs some data collection with > regards errors detected; such code relies in the regmap layer to > perform reads/writes, so the code is abstracted and there is some > risk level to execute that, since the panic path runs in atomic > context, with interrupts/preemption and secondary CPUs disabled. > > Users want the information collected in this panic notifier though, > so in order to balance the risk/benefit, let's skip the altera panic > notifier if kdump is loaded. While at it, remove a useless header > and encompass a macro inside the sole ifdef block it is used. > > Cc: Dinh Nguyen <dinguyen@kernel.org> > Cc: Petr Mladek <pmladek@suse.com> > Cc: Tony Luck <tony.luck@intel.com> > Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com> > > --- > > V2: > - new patch, based on the discussion in [0]. > [0] https://lore.kernel.org/lkml/62a63fc2-346f-f375-043a-fa21385279df@igalia.com/ > > drivers/edac/altera_edac.c | 16 ++++++++++++---- > 1 file changed, 12 insertions(+), 4 deletions(-) > [...] Hey Tony / Dinh, do you think this patch is good, based on the discussion we had [0] last iteration? Thanks in advance, Guilherme [0] https://lore.kernel.org/lkml/599b72f6-76a4-8e6d-5432-56fb1ffd7e0b@igalia.com/
On 7/19/22 14:53, Guilherme G. Piccoli wrote: > The altera_edac panic notifier performs some data collection with > regards errors detected; such code relies in the regmap layer to > perform reads/writes, so the code is abstracted and there is some > risk level to execute that, since the panic path runs in atomic > context, with interrupts/preemption and secondary CPUs disabled. > > Users want the information collected in this panic notifier though, > so in order to balance the risk/benefit, let's skip the altera panic > notifier if kdump is loaded. While at it, remove a useless header > and encompass a macro inside the sole ifdef block it is used. > > Cc: Dinh Nguyen <dinguyen@kernel.org> > Cc: Petr Mladek <pmladek@suse.com> > Cc: Tony Luck <tony.luck@intel.com> > Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com> > > --- > > V2: > - new patch, based on the discussion in [0]. > [0] https://lore.kernel.org/lkml/62a63fc2-346f-f375-043a-fa21385279df@igalia.com/ > Acked-by: Dinh Nguyen <dinguyen@kernel.org>
On 16/08/2022 15:44, Dinh Nguyen wrote: > [...] >> V2: >> - new patch, based on the discussion in [0]. >> [0] https://lore.kernel.org/lkml/62a63fc2-346f-f375-043a-fa21385279df@igalia.com/ >> > > Acked-by: Dinh Nguyen <dinguyen@kernel.org> Thanks a lot Dinh! There is something I'm asking for maintainers on patches in this series, so here it goes for you (CC Borislav / Tony): do you think it's possible to pick this one in your tree directly, from this series, or do you prefer I re-submit as a standalone patch, on linux-edac maybe? Thanks, Guilherme
On Tue, Jul 19, 2022 at 04:53:23PM -0300, Guilherme G. Piccoli wrote: > The altera_edac panic notifier performs some data collection with > regards errors detected; such code relies in the regmap layer to > perform reads/writes, so the code is abstracted and there is some > risk level to execute that, since the panic path runs in atomic > context, with interrupts/preemption and secondary CPUs disabled. > > Users want the information collected in this panic notifier though, > so in order to balance the risk/benefit, let's skip the altera panic > notifier if kdump is loaded. While at it, remove a useless header > and encompass a macro inside the sole ifdef block it is used. How does the fact that kdump is loaded, obviate the need to print information about the errors? Are you suggesting that people who have the whole vmcore would be able to piece together the error information?
On 17/08/2022 14:31, Borislav Petkov wrote: > [...] > > How does the fact that kdump is loaded, obviate the need to print > information about the errors? > > Are you suggesting that people who have the whole vmcore would be able > to piece together the error information? > Hi Boris, thanks for chiming in. So, this is part of an effort to clean-up the panic path. Currently, if a kdump happens, *all* the panic notifiers are skipped by default, including this one. In this scenario, this patch seems like a no-op. But happens that in the refactor we are proposing [0], some notifiers should run before the kdump. We are basically putting some ordering in the way notifiers are executed, while documenting this properly and with the goal of not increasing the failure risk for kdump. This patch is useful so we can bring the altera EDAC notifier to run earlier while not increasing the risk on kdump - this operation is a bit "delicate" to happen in the panic scenario. The origin of this patch was a discussion with Tony/Peter [1], guess we can call it a "compromise solution". Let me know if you disagree or have more questions, and in case you'd like to check/participate in the whole panic notifiers refactor discussion, it would be great =) Cheers, Guilherme [0] https://lore.kernel.org/lkml/20220427224924.592546-1-gpiccoli@igalia.com/ [1] https://lore.kernel.org/lkml/62a63fc2-346f-f375-043a-fa21385279df@igalia.com/
On Wed, Aug 17, 2022 at 03:45:30PM -0300, Guilherme G. Piccoli wrote: > But happens that in the refactor we are proposing [0], some notifiers > should run before the kdump. We are basically putting some ordering in > the way notifiers are executed, while documenting this properly and with > the goal of not increasing the failure risk for kdump. What is "the failure risk for kdump"? Some of the notifiers which run before kdump might fail and thus prevent the machine from kdumping? > This patch is useful so we can bring the altera EDAC notifier to run > earlier while not increasing the risk on kdump - this operation is a bit > "delicate" to happen in the panic scenario. The origin of this patch was > a discussion with Tony/Peter [1], guess we can call it a "compromise > solution". My question stands: if kdump is loaded and the s10_edac_dberr_handler() does not read the the fatal errors and they don't get shown in dmesg before the machine panics, how do you intend to show that information to the user? Because fatal errors are something you absolutely wanna show, at least, in dmesg! I don't think you can "read" the errors from vmcore - they need to be read from the hw registers before the machine dies. Thx.
On 17/08/2022 16:34, Borislav Petkov wrote: > [...] > > What is "the failure risk for kdump"? > > Some of the notifiers which run before kdump might fail and thus prevent > the machine from kdumping? > Exactly; some notifiers could break the machine and prevent a successful kdump. The EDAC one is consider medium risk, due to invasive operations (register readings on panic situation). > [...] > My question stands: if kdump is loaded and the s10_edac_dberr_handler() > does not read the the fatal errors and they don't get shown in dmesg > before the machine panics, how do you intend to show that information to > the user? > > Because fatal errors are something you absolutely wanna show, at least, > in dmesg! > > I don't think you can "read" the errors from vmcore - they need to be > read from the hw registers before the machine dies. > My understanding is the same as yours, i.e., this is not possible to collect from vmcore, it requires register reading. But again: if you kdump your machine today, you won't collect this information, patch changed nothing in that regard. The one thing it changes is that you'd skip the altera register dump if kdump is set AND you managed to also set "crash_kexec_post_notifiers". In case you / Dinh / Tony disagrees with the patch, it's fine and we can discard it, but then this notifier couldn't run early in the refactor we are doing, it'd postponed to run later. This are is full of trade-offs, we just need to choose what compromise solution is preferred by the majority of developers =) Cheers, Guilherme
On Wed, Aug 17, 2022 at 05:28:34PM -0300, Guilherme G. Piccoli wrote: > My understanding is the same as yours, i.e., this is not possible to > collect from vmcore, it requires register reading. But again: if you > kdump your machine today, you won't collect this information, patch > changed nothing in that regard. Why won't you be able to collect it? You can certainly access dmesg in the vmcore and see those errors logged there. > The one thing it changes is that you'd skip the altera register dump if > kdump is set AND you managed to also set "crash_kexec_post_notifiers". What your patch changes is, it prevents s10_edac_dberr_handler() from logging potentially important fatal hw errors when kdump is loaded. If Dinh is fine with that, I'll take the patch. But it looks like a bad idea to me.
On 17/08/2022 18:02, Borislav Petkov wrote: > On Wed, Aug 17, 2022 at 05:28:34PM -0300, Guilherme G. Piccoli wrote: >> My understanding is the same as yours, i.e., this is not possible to >> collect from vmcore, it requires register reading. But again: if you >> kdump your machine today, you won't collect this information, patch >> changed nothing in that regard. > > Why won't you be able to collect it? You can certainly access dmesg in > the vmcore and see those errors logged there. Sorry for the confusion, let me try to be a bit more clear: (1) if we kdump but we *didn't run* s10_edac_dberr_handler() before kdump, the information is lost, since s10_edac_dberr_handler() performs register readings. That information is not contained inside the vmcore. (2) If for some reason the function s10_edac_dberr_handler() *was executed prior to kdump*, of course the registers information would be on dmesg, easy to collect in the vmcore. Makes sense? > >> The one thing it changes is that you'd skip the altera register dump if >> kdump is set AND you managed to also set "crash_kexec_post_notifiers". > > What your patch changes is, it prevents s10_edac_dberr_handler() from > logging potentially important fatal hw errors when kdump is loaded. Agreed. If kdump is loaded, we cannot log that information (modulo that we do not collect it today by default on kdump as well). The other part of story (the reason of the patch) is that we plan to start running this panic notifier a bit earlier, being able to collect such edac logs with pstore, for example. > > If Dinh is fine with that, I'll take the patch. But it looks like a bad > idea to me. > I think we should seek what the majority of the folks consider the best, in order to converge to some well-accepted solution. I'm completely OK in dropping this one and rework with some other idea, or we can leave this panic notifier as is, continue running that a bit later. Tony / Petr (when back), suggestions are welcome =) Cheers, Guilherme
On Wed, Aug 17, 2022 at 06:39:07PM -0300, Guilherme G. Piccoli wrote:
> Sorry for the confusion, let me try to be a bit more clear:
I think you're missing the point. Lemme try again:
You *absolutely* must log those errors because they're important. It
doesn't matter if this is done in a panic notifier and you're changing
that whole shebang or through some other magic.
If you do, then this driver needs to *still* *log* those fatal errors -
regardless through a panic notifier or some novel contraption it wants
to use.
So if you want to change the panic notifiers, you *must* make sure those
errors still do get logged.
Better?
On 17/08/2022 18:46, Borislav Petkov wrote: > On Wed, Aug 17, 2022 at 06:39:07PM -0300, Guilherme G. Piccoli wrote: >> Sorry for the confusion, let me try to be a bit more clear: > > I think you're missing the point. Lemme try again: > > You *absolutely* must log those errors because they're important. It > doesn't matter if this is done in a panic notifier and you're changing > that whole shebang or through some other magic. > > If you do, then this driver needs to *still* *log* those fatal errors - > regardless through a panic notifier or some novel contraption it wants > to use. > > So if you want to change the panic notifiers, you *must* make sure those > errors still do get logged. > > Better? > Boris, I understand the importance of the logs, for sure! But do you agree that currently, in case of a kdump, that information *is not collected*, with our without my patch?
On Wed, Aug 17, 2022 at 06:56:11PM -0300, Guilherme G. Piccoli wrote: > But do you agree that currently, in case of a kdump, that information > *is not collected*, with our without my patch? If for some reason that panic notifier does not get run before kdump, then that's a bug and that driver should not use a panic notifier to log hw errors in the first place.
On 17/08/2022 19:00, Borislav Petkov wrote: > On Wed, Aug 17, 2022 at 06:56:11PM -0300, Guilherme G. Piccoli wrote: >> But do you agree that currently, in case of a kdump, that information >> *is not collected*, with our without my patch? > > If for some reason that panic notifier does not get run before kdump, > then that's a bug and that driver should not use a panic notifier to log > hw errors in the first place. > Indeed, the notifiers don't run before kdump by default, in a conscious decision of the kdump maintainers. You might be right, in the sense that maybe the edac error handler shouldn't run as a panic notifier. Let's see if Tony / Dinh can chime in on that discussion - we could move it to run in the kexec event as well, so it'd always run before a kdump, but maybe the risk it offers during panic time is not worth. Again - a matter of a trade-off, a good compromise must be agreed by all parties (kdump maintainers are usually extremely afraid of taking risks to not break kdump). Cheers!
On Wed, Aug 17, 2022 at 07:09:26PM -0300, Guilherme G. Piccoli wrote: > Again - a matter of a trade-off, a good compromise must be agreed by all > parties (kdump maintainers are usually extremely afraid of taking risks > to not break kdump). Right, and logging hw errors from a panic notifier feels simply weird. x86 has its own, special notifier exactly for that and it is independent from the panic path but it gets run right after the exception raised due to the hw error is done. Dunno if ARM has such facilities tho... Thx.
On 17/08/2022 19:19, Borislav Petkov wrote: > On Wed, Aug 17, 2022 at 07:09:26PM -0300, Guilherme G. Piccoli wrote: >> Again - a matter of a trade-off, a good compromise must be agreed by all >> parties (kdump maintainers are usually extremely afraid of taking risks >> to not break kdump). > > Right, and logging hw errors from a panic notifier feels simply weird. > > x86 has its own, special notifier exactly for that and it is independent > from the panic path but it gets run right after the exception raised due > to the hw error is done. > > Dunno if ARM has such facilities tho... > > Thx. > Yeah, MCE stuff right? Not sure for ARM and other architectures as well. Cheers, Guilherme
diff --git a/drivers/edac/altera_edac.c b/drivers/edac/altera_edac.c index e7e8e624a436..741fe5539154 100644 --- a/drivers/edac/altera_edac.c +++ b/drivers/edac/altera_edac.c @@ -16,7 +16,6 @@ #include <linux/kernel.h> #include <linux/mfd/altera-sysmgr.h> #include <linux/mfd/syscon.h> -#include <linux/notifier.h> #include <linux/of_address.h> #include <linux/of_irq.h> #include <linux/of_platform.h> @@ -24,6 +23,7 @@ #include <linux/platform_device.h> #include <linux/regmap.h> #include <linux/types.h> +#include <linux/kexec.h> #include <linux/uaccess.h> #include "altera_edac.h" @@ -2063,22 +2063,30 @@ static const struct irq_domain_ops a10_eccmgr_ic_ops = { }; /************** Stratix 10 EDAC Double Bit Error Handler ************/ -#define to_a10edac(p, m) container_of(p, struct altr_arria10_edac, m) - #ifdef CONFIG_64BIT /* panic routine issues reboot on non-zero panic_timeout */ extern int panic_timeout; +#define to_a10edac(p, m) container_of(p, struct altr_arria10_edac, m) + /* * The double bit error is handled through SError which is fatal. This is * called as a panic notifier to printout ECC error info as part of the panic. + * + * Notice that if kdump is set, we take the risk avoidance approach and + * skip the notifier, given that users are expected to have access to a + * full vmcore. */ static int s10_edac_dberr_handler(struct notifier_block *this, unsigned long event, void *ptr) { - struct altr_arria10_edac *edac = to_a10edac(this, panic_notifier); + struct altr_arria10_edac *edac; int err_addr, dberror; + if (kexec_crash_loaded()) + return NOTIFY_DONE; + + edac = to_a10edac(this, panic_notifier); regmap_read(edac->ecc_mgr_map, S10_SYSMGR_ECC_INTSTAT_DERR_OFST, &dberror); regmap_write(edac->ecc_mgr_map, S10_SYSMGR_UE_VAL_OFST, dberror);
The altera_edac panic notifier performs some data collection with regards errors detected; such code relies in the regmap layer to perform reads/writes, so the code is abstracted and there is some risk level to execute that, since the panic path runs in atomic context, with interrupts/preemption and secondary CPUs disabled. Users want the information collected in this panic notifier though, so in order to balance the risk/benefit, let's skip the altera panic notifier if kdump is loaded. While at it, remove a useless header and encompass a macro inside the sole ifdef block it is used. Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com> --- V2: - new patch, based on the discussion in [0]. [0] https://lore.kernel.org/lkml/62a63fc2-346f-f375-043a-fa21385279df@igalia.com/ drivers/edac/altera_edac.c | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-)