Message ID | 20220819221731.480795-9-gpiccoli@igalia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | The panic notifiers refactor - fixes/clean-ups (V3) | expand |
On 19/08/2022 19:17, Guilherme G. Piccoli wrote: > The altera_edac panic notifier performs some data collection with > regards errors detected; such code relies in the regmap layer to > perform reads/writes, so the code is abstracted and there is some > risk level to execute that, since the panic path runs in atomic > context, with interrupts/preemption and secondary CPUs disabled. > > Users want the information collected in this panic notifier though, > so in order to balance the risk/benefit, let's skip the altera panic > notifier if kdump is loaded. While at it, remove a useless header > and encompass a macro inside the sole ifdef block it is used. > > Cc: Borislav Petkov <bp@alien8.de> > Cc: Petr Mladek <pmladek@suse.com> > Cc: Tony Luck <tony.luck@intel.com> > Acked-by: Dinh Nguyen <dinguyen@kernel.org> > Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com> > > --- > > V3: > - added the ack tag from Dinh - thanks! > - had a good discussion with Boris about that in V2 [0], > hopefully we can continue and reach a consensus in this V3. > [0] https://lore.kernel.org/lkml/46137c67-25b4-6657-33b7-cffdc7afc0d7@igalia.com/ > > V2: > - new patch, based on the discussion in [1]. > [1] https://lore.kernel.org/lkml/62a63fc2-346f-f375-043a-fa21385279df@igalia.com/ > > [...] Hi Dinh, Tony, Boris - sorry for the ping. Appreciate reviews on this one - Dinh already ACKed the patch but Boris raised some points in the past version [0], so any opinions or discussions are welcome! Thanks, Guilherme [0] https://lore.kernel.org/lkml/46137c67-25b4-6657-33b7-cffdc7afc0d7@igalia.com/
On 18/09/2022 11:10, Guilherme G. Piccoli wrote: > On 19/08/2022 19:17, Guilherme G. Piccoli wrote: >> The altera_edac panic notifier performs some data collection with >> regards errors detected; such code relies in the regmap layer to >> perform reads/writes, so the code is abstracted and there is some >> risk level to execute that, since the panic path runs in atomic >> context, with interrupts/preemption and secondary CPUs disabled. >> >> Users want the information collected in this panic notifier though, >> so in order to balance the risk/benefit, let's skip the altera panic >> notifier if kdump is loaded. While at it, remove a useless header >> and encompass a macro inside the sole ifdef block it is used. >> >> Cc: Borislav Petkov <bp@alien8.de> >> Cc: Petr Mladek <pmladek@suse.com> >> Cc: Tony Luck <tony.luck@intel.com> >> Acked-by: Dinh Nguyen <dinguyen@kernel.org> >> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com> >> >> --- >> >> V3: >> - added the ack tag from Dinh - thanks! >> - had a good discussion with Boris about that in V2 [0], >> hopefully we can continue and reach a consensus in this V3. >> [0] https://lore.kernel.org/lkml/46137c67-25b4-6657-33b7-cffdc7afc0d7@igalia.com/ >> >> V2: >> - new patch, based on the discussion in [1]. >> [1] https://lore.kernel.org/lkml/62a63fc2-346f-f375-043a-fa21385279df@igalia.com/ >> >> [...] > > Hi Dinh, Tony, Boris - sorry for the ping. Hey folks, apologies for the new ping. Is there anything to improve here maybe? Reviews / opinions are very appreciated! Cheers, Guilherme
On 18/09/2022 11:10, Guilherme G. Piccoli wrote: > On 19/08/2022 19:17, Guilherme G. Piccoli wrote: >> The altera_edac panic notifier performs some data collection with >> regards errors detected; such code relies in the regmap layer to >> perform reads/writes, so the code is abstracted and there is some >> risk level to execute that, since the panic path runs in atomic >> context, with interrupts/preemption and secondary CPUs disabled. >> >> Users want the information collected in this panic notifier though, >> so in order to balance the risk/benefit, let's skip the altera panic >> notifier if kdump is loaded. While at it, remove a useless header >> and encompass a macro inside the sole ifdef block it is used. >> >> Cc: Borislav Petkov <bp@alien8.de> >> Cc: Petr Mladek <pmladek@suse.com> >> Cc: Tony Luck <tony.luck@intel.com> >> Acked-by: Dinh Nguyen <dinguyen@kernel.org> >> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com> >> >> --- >> >> V3: >> - added the ack tag from Dinh - thanks! >> - had a good discussion with Boris about that in V2 [0], >> hopefully we can continue and reach a consensus in this V3. >> [0] https://lore.kernel.org/lkml/46137c67-25b4-6657-33b7-cffdc7afc0d7@igalia.com/ >> >> V2: >> - new patch, based on the discussion in [1]. >> [1] https://lore.kernel.org/lkml/62a63fc2-346f-f375-043a-fa21385279df@igalia.com/ >> >> [...] > > Hi Dinh, Tony, Boris - sorry for the ping. > > Appreciate reviews on this one - Dinh already ACKed the patch but Boris > raised some points in the past version [0], so any opinions or > discussions are welcome! Hi folks, monthly ping heheh Apologies for the re-pings, please let me know if there is anything required to move on this patch. Cheers, Guilherme P.S. I've been trimming the huge CC list in the series, done it here as well.
On Tue, Nov 22, 2022 at 10:33:12AM -0300, Guilherme G. Piccoli wrote: Leaving in the whole thing for newly added people. > On 18/09/2022 11:10, Guilherme G. Piccoli wrote: > > On 19/08/2022 19:17, Guilherme G. Piccoli wrote: > >> The altera_edac panic notifier performs some data collection with > >> regards errors detected; such code relies in the regmap layer to > >> perform reads/writes, so the code is abstracted and there is some > >> risk level to execute that, since the panic path runs in atomic > >> context, with interrupts/preemption and secondary CPUs disabled. > >> > >> Users want the information collected in this panic notifier though, > >> so in order to balance the risk/benefit, let's skip the altera panic > >> notifier if kdump is loaded. While at it, remove a useless header > >> and encompass a macro inside the sole ifdef block it is used. > >> > >> Cc: Borislav Petkov <bp@alien8.de> > >> Cc: Petr Mladek <pmladek@suse.com> > >> Cc: Tony Luck <tony.luck@intel.com> > >> Acked-by: Dinh Nguyen <dinguyen@kernel.org> > >> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com> > >> > >> --- > >> > >> V3: > >> - added the ack tag from Dinh - thanks! > >> - had a good discussion with Boris about that in V2 [0], > >> hopefully we can continue and reach a consensus in this V3. > >> [0] https://lore.kernel.org/lkml/46137c67-25b4-6657-33b7-cffdc7afc0d7@igalia.com/ > >> > >> V2: > >> - new patch, based on the discussion in [1]. > >> [1] https://lore.kernel.org/lkml/62a63fc2-346f-f375-043a-fa21385279df@igalia.com/ > >> > >> [...] > > > > Hi Dinh, Tony, Boris - sorry for the ping. > > > > Appreciate reviews on this one - Dinh already ACKed the patch but Boris > > raised some points in the past version [0], so any opinions or > > discussions are welcome! > > > Hi folks, monthly ping heheh > Apologies for the re-pings, please let me know if there is anything > required to move on this patch. Looking at this again, I really don't like the sprinkling of if (kexec_crash_loaded()) in unrelated code. And I still think that the real fix here is to kill this edac->panic_notifier thing. And replace it with simply logging the error from the double bit error interrupt handle. That DBERR IRQ thing altr_edac_a10_irq_handler(). Because this is what this panic notifier does - dump double-bit errors. Now, if Dinh doesn't move, I guess we can ask Tony and/or Rabara (he has sent a patch for this driver recently and Altera belongs to Intel now) to find someone who can test such a change and we (you could give it a try first :)) can do that change. Thx.
On Fri 2022-08-19 19:17:28, Guilherme G. Piccoli wrote: > The altera_edac panic notifier performs some data collection with > regards errors detected; such code relies in the regmap layer to > perform reads/writes, so the code is abstracted and there is some > risk level to execute that, since the panic path runs in atomic > context, with interrupts/preemption and secondary CPUs disabled. > > Users want the information collected in this panic notifier though, > so in order to balance the risk/benefit, let's skip the altera panic > notifier if kdump is loaded. While at it, remove a useless header > and encompass a macro inside the sole ifdef block it is used. > > Cc: Borislav Petkov <bp@alien8.de> > Cc: Petr Mladek <pmladek@suse.com> > Cc: Tony Luck <tony.luck@intel.com> > Acked-by: Dinh Nguyen <dinguyen@kernel.org> > Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com> > > --- > > V3: > - added the ack tag from Dinh - thanks! > - had a good discussion with Boris about that in V2 [0], > hopefully we can continue and reach a consensus in this V3. > [0] https://lore.kernel.org/lkml/46137c67-25b4-6657-33b7-cffdc7afc0d7@igalia.com/ > > V2: > - new patch, based on the discussion in [1]. > [1] https://lore.kernel.org/lkml/62a63fc2-346f-f375-043a-fa21385279df@igalia.com/ > > > drivers/edac/altera_edac.c | 16 ++++++++++++---- > 1 file changed, 12 insertions(+), 4 deletions(-) > > diff --git a/drivers/edac/altera_edac.c b/drivers/edac/altera_edac.c > index e7e8e624a436..741fe5539154 100644 > --- a/drivers/edac/altera_edac.c > +++ b/drivers/edac/altera_edac.c > @@ -16,7 +16,6 @@ > #include <linux/kernel.h> > #include <linux/mfd/altera-sysmgr.h> > #include <linux/mfd/syscon.h> > -#include <linux/notifier.h> > #include <linux/of_address.h> > #include <linux/of_irq.h> > #include <linux/of_platform.h> > @@ -24,6 +23,7 @@ > #include <linux/platform_device.h> > #include <linux/regmap.h> > #include <linux/types.h> > +#include <linux/kexec.h> > #include <linux/uaccess.h> > > #include "altera_edac.h" > @@ -2063,22 +2063,30 @@ static const struct irq_domain_ops a10_eccmgr_ic_ops = { > }; > > /************** Stratix 10 EDAC Double Bit Error Handler ************/ > -#define to_a10edac(p, m) container_of(p, struct altr_arria10_edac, m) > - > #ifdef CONFIG_64BIT > /* panic routine issues reboot on non-zero panic_timeout */ > extern int panic_timeout; > > +#define to_a10edac(p, m) container_of(p, struct altr_arria10_edac, m) > + > /* > * The double bit error is handled through SError which is fatal. This is > * called as a panic notifier to printout ECC error info as part of the panic. > + * > + * Notice that if kdump is set, we take the risk avoidance approach and > + * skip the notifier, given that users are expected to have access to a > + * full vmcore. > */ > static int s10_edac_dberr_handler(struct notifier_block *this, > unsigned long event, void *ptr) > { > - struct altr_arria10_edac *edac = to_a10edac(this, panic_notifier); > + struct altr_arria10_edac *edac; > int err_addr, dberror; > > + if (kexec_crash_loaded()) > + return NOTIFY_DONE; I have read the discussion about v2 [1] and this looks like a bad approach from my POV. My understanding is that the information provided by this notifier could not be found in the crashdump. It means that people really want to run this before crashdump in principle. Of course, there is the question how much safe this code is. I mean if the panic() code path might get blocked here. I see two possibilities. The best solution would be if we know that this is "always" safe or if it can be done a safe way. Then we could keep it as it is or implement the safe way. Alternative solution would be to create a kernel parameter that would enable/disable this particular report when kdump is enabled. The question would be the default. It would depend on how risky the code is and how useful the information is. [1] https://lore.kernel.org/r/20220719195325.402745-11-gpiccoli@igalia.com > + edac = to_a10edac(this, panic_notifier); > regmap_read(edac->ecc_mgr_map, S10_SYSMGR_ECC_INTSTAT_DERR_OFST, > &dberror); > regmap_write(edac->ecc_mgr_map, S10_SYSMGR_UE_VAL_OFST, dberror); Best Regards, Petr
Hi Boris and Petr, first of all thanks for your great analysis and really sorry for the huge delay in my response. Below I'm pasting the 2 relevant responses from both Petr and Boris. On 22/11/2022 12:06, Borislav Petkov wrote: > On Tue, Nov 22, 2022 at 10:33:12AM -0300, Guilherme G. Piccoli wrote: > > Leaving in the whole thing for newly added people. > >> On 18/09/2022 11:10, Guilherme G. Piccoli wrote: >>> On 19/08/2022 19:17, Guilherme G. Piccoli wrote: >>>> The altera_edac panic notifier performs some data collection with >>>> regards errors detected; such code relies in the regmap layer to >>>> perform reads/writes, so the code is abstracted and there is some >>>> risk level to execute that, since the panic path runs in atomic >>>> context, with interrupts/preemption and secondary CPUs disabled. >>>> >>>> Users want the information collected in this panic notifier though, >>>> so in order to balance the risk/benefit, let's skip the altera panic >>>> notifier if kdump is loaded. While at it, remove a useless header >>>> and encompass a macro inside the sole ifdef block it is used. >>>> >>>> Cc: Borislav Petkov <bp@alien8.de> >>>> Cc: Petr Mladek <pmladek@suse.com> >>>> Cc: Tony Luck <tony.luck@intel.com> >>>> Acked-by: Dinh Nguyen <dinguyen@kernel.org> >>>> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com> >>>> >>>> --- >>>> >>>> V3: >>>> - added the ack tag from Dinh - thanks! >>>> - had a good discussion with Boris about that in V2 [0], >>>> hopefully we can continue and reach a consensus in this V3. >>>> [0] https://lore.kernel.org/lkml/46137c67-25b4-6657-33b7-cffdc7afc0d7@igalia.com/ >>>> >>>> V2: >>>> - new patch, based on the discussion in [1]. >>>> [1] https://lore.kernel.org/lkml/62a63fc2-346f-f375-043a-fa21385279df@igalia.com/ >>>> >>>> [...] >>> >>> Hi Dinh, Tony, Boris - sorry for the ping. >>> >>> Appreciate reviews on this one - Dinh already ACKed the patch but Boris >>> raised some points in the past version [0], so any opinions or >>> discussions are welcome! >> >> >> Hi folks, monthly ping heheh >> Apologies for the re-pings, please let me know if there is anything >> required to move on this patch. > > Looking at this again, I really don't like the sprinkling of > > if (kexec_crash_loaded()) > > in unrelated code. And I still think that the real fix here is to kill > this > > edac->panic_notifier > > thing. And replace it with simply logging the error from the double bit > error interrupt handle. That DBERR IRQ thing altr_edac_a10_irq_handler(). > Because this is what this panic notifier does - dump double-bit errors. > > Now, if Dinh doesn't move, I guess we can ask Tony and/or Rabara (he has > sent a patch for this driver recently and Altera belongs to Intel now) > to find someone who can test such a change and we (you could give it a > try first :)) can do that change. > > Thx. > On 09/12/2022 13:03, Petr Mladek wrote:> [...]> > I have read the discussion about v2 [1] and this looks like a bad > approach from my POV. > > My understanding is that the information provided by this notifier > could not be found in the crashdump. It means that people really > want to run this before crashdump in principle. > > Of course, there is the question how much safe this code is. I mean > if the panic() code path might get blocked here. > > I see two possibilities. > > The best solution would be if we know that this is "always" safe or if > it can be done a safe way. Then we could keep it as it is or implement > the safe way. > > Alternative solution would be to create a kernel parameter that > would enable/disable this particular report when kdump is enabled. > The question would be the default. It would depend on how risky > the code is and how useful the information is. > > [1] https://lore.kernel.org/r/20220719195325.402745-11-gpiccoli@igalia.com > So, for me Petr approach is the more straightforward, though we could rethink the idea of this notifier being...a notifier, as suggest Boris heh Anyway, what I plan to do is: I'll re-submit a simple clean-up for this code (header / ifdef stuff), not functional-changing the code path. After that, when re-submitting the V2 or the notifiers refactor (which I'm pending for some good months =O ), I'll deal with this code properly, factoring the ideas and proposing a meaningful change. So, let's discard this patch for now. Thanks again, Guilherme
diff --git a/drivers/edac/altera_edac.c b/drivers/edac/altera_edac.c index e7e8e624a436..741fe5539154 100644 --- a/drivers/edac/altera_edac.c +++ b/drivers/edac/altera_edac.c @@ -16,7 +16,6 @@ #include <linux/kernel.h> #include <linux/mfd/altera-sysmgr.h> #include <linux/mfd/syscon.h> -#include <linux/notifier.h> #include <linux/of_address.h> #include <linux/of_irq.h> #include <linux/of_platform.h> @@ -24,6 +23,7 @@ #include <linux/platform_device.h> #include <linux/regmap.h> #include <linux/types.h> +#include <linux/kexec.h> #include <linux/uaccess.h> #include "altera_edac.h" @@ -2063,22 +2063,30 @@ static const struct irq_domain_ops a10_eccmgr_ic_ops = { }; /************** Stratix 10 EDAC Double Bit Error Handler ************/ -#define to_a10edac(p, m) container_of(p, struct altr_arria10_edac, m) - #ifdef CONFIG_64BIT /* panic routine issues reboot on non-zero panic_timeout */ extern int panic_timeout; +#define to_a10edac(p, m) container_of(p, struct altr_arria10_edac, m) + /* * The double bit error is handled through SError which is fatal. This is * called as a panic notifier to printout ECC error info as part of the panic. + * + * Notice that if kdump is set, we take the risk avoidance approach and + * skip the notifier, given that users are expected to have access to a + * full vmcore. */ static int s10_edac_dberr_handler(struct notifier_block *this, unsigned long event, void *ptr) { - struct altr_arria10_edac *edac = to_a10edac(this, panic_notifier); + struct altr_arria10_edac *edac; int err_addr, dberror; + if (kexec_crash_loaded()) + return NOTIFY_DONE; + + edac = to_a10edac(this, panic_notifier); regmap_read(edac->ecc_mgr_map, S10_SYSMGR_ECC_INTSTAT_DERR_OFST, &dberror); regmap_write(edac->ecc_mgr_map, S10_SYSMGR_UE_VAL_OFST, dberror);