diff mbox series

[V3,08/11] EDAC/altera: Skip the panic notifier if kdump is loaded

Message ID 20220819221731.480795-9-gpiccoli@igalia.com (mailing list archive)
State New, archived
Headers show
Series The panic notifiers refactor - fixes/clean-ups (V3) | expand

Commit Message

Guilherme G. Piccoli Aug. 19, 2022, 10:17 p.m. UTC
The altera_edac panic notifier performs some data collection with
regards errors detected; such code relies in the regmap layer to
perform reads/writes, so the code is abstracted and there is some
risk level to execute that, since the panic path runs in atomic
context, with interrupts/preemption and secondary CPUs disabled.

Users want the information collected in this panic notifier though,
so in order to balance the risk/benefit, let's skip the altera panic
notifier if kdump is loaded. While at it, remove a useless header
and encompass a macro inside the sole ifdef block it is used.

Cc: Borislav Petkov <bp@alien8.de>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Tony Luck <tony.luck@intel.com>
Acked-by: Dinh Nguyen <dinguyen@kernel.org>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>

---

V3:
- added the ack tag from Dinh - thanks!
- had a good discussion with Boris about that in V2 [0],
hopefully we can continue and reach a consensus in this V3.
[0] https://lore.kernel.org/lkml/46137c67-25b4-6657-33b7-cffdc7afc0d7@igalia.com/

V2:
- new patch, based on the discussion in [1].
[1] https://lore.kernel.org/lkml/62a63fc2-346f-f375-043a-fa21385279df@igalia.com/


 drivers/edac/altera_edac.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

Comments

Guilherme G. Piccoli Sept. 18, 2022, 2:10 p.m. UTC | #1
On 19/08/2022 19:17, Guilherme G. Piccoli wrote:
> The altera_edac panic notifier performs some data collection with
> regards errors detected; such code relies in the regmap layer to
> perform reads/writes, so the code is abstracted and there is some
> risk level to execute that, since the panic path runs in atomic
> context, with interrupts/preemption and secondary CPUs disabled.
> 
> Users want the information collected in this panic notifier though,
> so in order to balance the risk/benefit, let's skip the altera panic
> notifier if kdump is loaded. While at it, remove a useless header
> and encompass a macro inside the sole ifdef block it is used.
> 
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Petr Mladek <pmladek@suse.com>
> Cc: Tony Luck <tony.luck@intel.com>
> Acked-by: Dinh Nguyen <dinguyen@kernel.org>
> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
> 
> ---
> 
> V3:
> - added the ack tag from Dinh - thanks!
> - had a good discussion with Boris about that in V2 [0],
> hopefully we can continue and reach a consensus in this V3.
> [0] https://lore.kernel.org/lkml/46137c67-25b4-6657-33b7-cffdc7afc0d7@igalia.com/
> 
> V2:
> - new patch, based on the discussion in [1].
> [1] https://lore.kernel.org/lkml/62a63fc2-346f-f375-043a-fa21385279df@igalia.com/
> 
> [...]

Hi Dinh, Tony, Boris - sorry for the ping.

Appreciate reviews on this one - Dinh already ACKed the patch but Boris
raised some points in the past version [0], so any opinions or
discussions are welcome!

Thanks,


Guilherme


[0]
https://lore.kernel.org/lkml/46137c67-25b4-6657-33b7-cffdc7afc0d7@igalia.com/
Guilherme G. Piccoli Oct. 17, 2022, 2:05 p.m. UTC | #2
On 18/09/2022 11:10, Guilherme G. Piccoli wrote:
> On 19/08/2022 19:17, Guilherme G. Piccoli wrote:
>> The altera_edac panic notifier performs some data collection with
>> regards errors detected; such code relies in the regmap layer to
>> perform reads/writes, so the code is abstracted and there is some
>> risk level to execute that, since the panic path runs in atomic
>> context, with interrupts/preemption and secondary CPUs disabled.
>>
>> Users want the information collected in this panic notifier though,
>> so in order to balance the risk/benefit, let's skip the altera panic
>> notifier if kdump is loaded. While at it, remove a useless header
>> and encompass a macro inside the sole ifdef block it is used.
>>
>> Cc: Borislav Petkov <bp@alien8.de>
>> Cc: Petr Mladek <pmladek@suse.com>
>> Cc: Tony Luck <tony.luck@intel.com>
>> Acked-by: Dinh Nguyen <dinguyen@kernel.org>
>> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
>>
>> ---
>>
>> V3:
>> - added the ack tag from Dinh - thanks!
>> - had a good discussion with Boris about that in V2 [0],
>> hopefully we can continue and reach a consensus in this V3.
>> [0] https://lore.kernel.org/lkml/46137c67-25b4-6657-33b7-cffdc7afc0d7@igalia.com/
>>
>> V2:
>> - new patch, based on the discussion in [1].
>> [1] https://lore.kernel.org/lkml/62a63fc2-346f-f375-043a-fa21385279df@igalia.com/
>>
>> [...]
> 
> Hi Dinh, Tony, Boris - sorry for the ping.

Hey folks, apologies for the new ping.

Is there anything to improve here maybe? Reviews / opinions are very
appreciated!
Cheers,


Guilherme
Guilherme G. Piccoli Nov. 22, 2022, 1:33 p.m. UTC | #3
On 18/09/2022 11:10, Guilherme G. Piccoli wrote:
> On 19/08/2022 19:17, Guilherme G. Piccoli wrote:
>> The altera_edac panic notifier performs some data collection with
>> regards errors detected; such code relies in the regmap layer to
>> perform reads/writes, so the code is abstracted and there is some
>> risk level to execute that, since the panic path runs in atomic
>> context, with interrupts/preemption and secondary CPUs disabled.
>>
>> Users want the information collected in this panic notifier though,
>> so in order to balance the risk/benefit, let's skip the altera panic
>> notifier if kdump is loaded. While at it, remove a useless header
>> and encompass a macro inside the sole ifdef block it is used.
>>
>> Cc: Borislav Petkov <bp@alien8.de>
>> Cc: Petr Mladek <pmladek@suse.com>
>> Cc: Tony Luck <tony.luck@intel.com>
>> Acked-by: Dinh Nguyen <dinguyen@kernel.org>
>> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
>>
>> ---
>>
>> V3:
>> - added the ack tag from Dinh - thanks!
>> - had a good discussion with Boris about that in V2 [0],
>> hopefully we can continue and reach a consensus in this V3.
>> [0] https://lore.kernel.org/lkml/46137c67-25b4-6657-33b7-cffdc7afc0d7@igalia.com/
>>
>> V2:
>> - new patch, based on the discussion in [1].
>> [1] https://lore.kernel.org/lkml/62a63fc2-346f-f375-043a-fa21385279df@igalia.com/
>>
>> [...]
> 
> Hi Dinh, Tony, Boris - sorry for the ping.
> 
> Appreciate reviews on this one - Dinh already ACKed the patch but Boris
> raised some points in the past version [0], so any opinions or
> discussions are welcome!


Hi folks, monthly ping heheh
Apologies for the re-pings, please let me know if there is anything
required to move on this patch.

Cheers,


Guilherme


P.S. I've been trimming the huge CC list in the series, done it here as
well.
Borislav Petkov Nov. 22, 2022, 3:06 p.m. UTC | #4
On Tue, Nov 22, 2022 at 10:33:12AM -0300, Guilherme G. Piccoli wrote:

Leaving in the whole thing for newly added people.

> On 18/09/2022 11:10, Guilherme G. Piccoli wrote:
> > On 19/08/2022 19:17, Guilherme G. Piccoli wrote:
> >> The altera_edac panic notifier performs some data collection with
> >> regards errors detected; such code relies in the regmap layer to
> >> perform reads/writes, so the code is abstracted and there is some
> >> risk level to execute that, since the panic path runs in atomic
> >> context, with interrupts/preemption and secondary CPUs disabled.
> >>
> >> Users want the information collected in this panic notifier though,
> >> so in order to balance the risk/benefit, let's skip the altera panic
> >> notifier if kdump is loaded. While at it, remove a useless header
> >> and encompass a macro inside the sole ifdef block it is used.
> >>
> >> Cc: Borislav Petkov <bp@alien8.de>
> >> Cc: Petr Mladek <pmladek@suse.com>
> >> Cc: Tony Luck <tony.luck@intel.com>
> >> Acked-by: Dinh Nguyen <dinguyen@kernel.org>
> >> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
> >>
> >> ---
> >>
> >> V3:
> >> - added the ack tag from Dinh - thanks!
> >> - had a good discussion with Boris about that in V2 [0],
> >> hopefully we can continue and reach a consensus in this V3.
> >> [0] https://lore.kernel.org/lkml/46137c67-25b4-6657-33b7-cffdc7afc0d7@igalia.com/
> >>
> >> V2:
> >> - new patch, based on the discussion in [1].
> >> [1] https://lore.kernel.org/lkml/62a63fc2-346f-f375-043a-fa21385279df@igalia.com/
> >>
> >> [...]
> > 
> > Hi Dinh, Tony, Boris - sorry for the ping.
> > 
> > Appreciate reviews on this one - Dinh already ACKed the patch but Boris
> > raised some points in the past version [0], so any opinions or
> > discussions are welcome!
> 
> 
> Hi folks, monthly ping heheh
> Apologies for the re-pings, please let me know if there is anything
> required to move on this patch.

Looking at this again, I really don't like the sprinkling of

	if (kexec_crash_loaded())

in unrelated code. And I still think that the real fix here is to kill
this

	edac->panic_notifier

thing. And replace it with simply logging the error from the double bit
error interrupt handle. That DBERR IRQ thing altr_edac_a10_irq_handler().
Because this is what this panic notifier does - dump double-bit errors.

Now, if Dinh doesn't move, I guess we can ask Tony and/or Rabara (he has
sent a patch for this driver recently and Altera belongs to Intel now)
to find someone who can test such a change and we (you could give it a
try first :)) can do that change.

Thx.
Petr Mladek Dec. 9, 2022, 4:03 p.m. UTC | #5
On Fri 2022-08-19 19:17:28, Guilherme G. Piccoli wrote:
> The altera_edac panic notifier performs some data collection with
> regards errors detected; such code relies in the regmap layer to
> perform reads/writes, so the code is abstracted and there is some
> risk level to execute that, since the panic path runs in atomic
> context, with interrupts/preemption and secondary CPUs disabled.
> 
> Users want the information collected in this panic notifier though,
> so in order to balance the risk/benefit, let's skip the altera panic
> notifier if kdump is loaded. While at it, remove a useless header
> and encompass a macro inside the sole ifdef block it is used.
> 
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Petr Mladek <pmladek@suse.com>
> Cc: Tony Luck <tony.luck@intel.com>
> Acked-by: Dinh Nguyen <dinguyen@kernel.org>
> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
> 
> ---
> 
> V3:
> - added the ack tag from Dinh - thanks!
> - had a good discussion with Boris about that in V2 [0],
> hopefully we can continue and reach a consensus in this V3.
> [0] https://lore.kernel.org/lkml/46137c67-25b4-6657-33b7-cffdc7afc0d7@igalia.com/
> 
> V2:
> - new patch, based on the discussion in [1].
> [1] https://lore.kernel.org/lkml/62a63fc2-346f-f375-043a-fa21385279df@igalia.com/
> 
> 
>  drivers/edac/altera_edac.c | 16 ++++++++++++----
>  1 file changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/edac/altera_edac.c b/drivers/edac/altera_edac.c
> index e7e8e624a436..741fe5539154 100644
> --- a/drivers/edac/altera_edac.c
> +++ b/drivers/edac/altera_edac.c
> @@ -16,7 +16,6 @@
>  #include <linux/kernel.h>
>  #include <linux/mfd/altera-sysmgr.h>
>  #include <linux/mfd/syscon.h>
> -#include <linux/notifier.h>
>  #include <linux/of_address.h>
>  #include <linux/of_irq.h>
>  #include <linux/of_platform.h>
> @@ -24,6 +23,7 @@
>  #include <linux/platform_device.h>
>  #include <linux/regmap.h>
>  #include <linux/types.h>
> +#include <linux/kexec.h>
>  #include <linux/uaccess.h>
>  
>  #include "altera_edac.h"
> @@ -2063,22 +2063,30 @@ static const struct irq_domain_ops a10_eccmgr_ic_ops = {
>  };
>  
>  /************** Stratix 10 EDAC Double Bit Error Handler ************/
> -#define to_a10edac(p, m) container_of(p, struct altr_arria10_edac, m)
> -
>  #ifdef CONFIG_64BIT
>  /* panic routine issues reboot on non-zero panic_timeout */
>  extern int panic_timeout;
>  
> +#define to_a10edac(p, m) container_of(p, struct altr_arria10_edac, m)
> +
>  /*
>   * The double bit error is handled through SError which is fatal. This is
>   * called as a panic notifier to printout ECC error info as part of the panic.
> + *
> + * Notice that if kdump is set, we take the risk avoidance approach and
> + * skip the notifier, given that users are expected to have access to a
> + * full vmcore.
>   */
>  static int s10_edac_dberr_handler(struct notifier_block *this,
>  				  unsigned long event, void *ptr)
>  {
> -	struct altr_arria10_edac *edac = to_a10edac(this, panic_notifier);
> +	struct altr_arria10_edac *edac;
>  	int err_addr, dberror;
>  
> +	if (kexec_crash_loaded())
> +		return NOTIFY_DONE;

I have read the discussion about v2 [1] and this looks like a bad
approach from my POV.

My understanding is that the information provided by this notifier
could not be found in the crashdump. It means that people really
want to run this before crashdump in principle.

Of course, there is the question how much safe this code is. I mean
if the panic() code path might get blocked here.

I see two possibilities.

The best solution would be if we know that this is "always" safe or if
it can be done a safe way. Then we could keep it as it is or implement
the safe way.

Alternative solution would be to create a kernel parameter that
would enable/disable this particular report when kdump is enabled.
The question would be the default. It would depend on how risky
the code is and how useful the information is.

[1] https://lore.kernel.org/r/20220719195325.402745-11-gpiccoli@igalia.com

> +	edac = to_a10edac(this, panic_notifier);
>  	regmap_read(edac->ecc_mgr_map, S10_SYSMGR_ECC_INTSTAT_DERR_OFST,
>  		    &dberror);
>  	regmap_write(edac->ecc_mgr_map, S10_SYSMGR_UE_VAL_OFST, dberror);

Best Regards,
Petr
Guilherme G. Piccoli Feb. 10, 2023, 4:01 p.m. UTC | #6
Hi Boris and Petr, first of all thanks for your great analysis and
really sorry for the huge delay in my response.

Below I'm pasting the 2 relevant responses from both Petr and Boris.


On 22/11/2022 12:06, Borislav Petkov wrote:
> On Tue, Nov 22, 2022 at 10:33:12AM -0300, Guilherme G. Piccoli wrote:
> 
> Leaving in the whole thing for newly added people.
> 
>> On 18/09/2022 11:10, Guilherme G. Piccoli wrote:
>>> On 19/08/2022 19:17, Guilherme G. Piccoli wrote:
>>>> The altera_edac panic notifier performs some data collection with
>>>> regards errors detected; such code relies in the regmap layer to
>>>> perform reads/writes, so the code is abstracted and there is some
>>>> risk level to execute that, since the panic path runs in atomic
>>>> context, with interrupts/preemption and secondary CPUs disabled.
>>>>
>>>> Users want the information collected in this panic notifier though,
>>>> so in order to balance the risk/benefit, let's skip the altera panic
>>>> notifier if kdump is loaded. While at it, remove a useless header
>>>> and encompass a macro inside the sole ifdef block it is used.
>>>>
>>>> Cc: Borislav Petkov <bp@alien8.de>
>>>> Cc: Petr Mladek <pmladek@suse.com>
>>>> Cc: Tony Luck <tony.luck@intel.com>
>>>> Acked-by: Dinh Nguyen <dinguyen@kernel.org>
>>>> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
>>>>
>>>> ---
>>>>
>>>> V3:
>>>> - added the ack tag from Dinh - thanks!
>>>> - had a good discussion with Boris about that in V2 [0],
>>>> hopefully we can continue and reach a consensus in this V3.
>>>> [0] https://lore.kernel.org/lkml/46137c67-25b4-6657-33b7-cffdc7afc0d7@igalia.com/
>>>>
>>>> V2:
>>>> - new patch, based on the discussion in [1].
>>>> [1] https://lore.kernel.org/lkml/62a63fc2-346f-f375-043a-fa21385279df@igalia.com/
>>>>
>>>> [...]
>>>
>>> Hi Dinh, Tony, Boris - sorry for the ping.
>>>
>>> Appreciate reviews on this one - Dinh already ACKed the patch but Boris
>>> raised some points in the past version [0], so any opinions or
>>> discussions are welcome!
>>
>>
>> Hi folks, monthly ping heheh
>> Apologies for the re-pings, please let me know if there is anything
>> required to move on this patch.
> 
> Looking at this again, I really don't like the sprinkling of
> 
> 	if (kexec_crash_loaded())
> 
> in unrelated code. And I still think that the real fix here is to kill
> this
> 
> 	edac->panic_notifier
> 
> thing. And replace it with simply logging the error from the double bit
> error interrupt handle. That DBERR IRQ thing altr_edac_a10_irq_handler().
> Because this is what this panic notifier does - dump double-bit errors.
> 
> Now, if Dinh doesn't move, I guess we can ask Tony and/or Rabara (he has
> sent a patch for this driver recently and Altera belongs to Intel now)
> to find someone who can test such a change and we (you could give it a
> try first :)) can do that change.
> 
> Thx.
> 

On 09/12/2022 13:03, Petr Mladek wrote:> [...]>
> I have read the discussion about v2 [1] and this looks like a bad
> approach from my POV.
>
> My understanding is that the information provided by this notifier
> could not be found in the crashdump. It means that people really
> want to run this before crashdump in principle.
>
> Of course, there is the question how much safe this code is. I mean
> if the panic() code path might get blocked here.
>
> I see two possibilities.
>
> The best solution would be if we know that this is "always" safe or if
> it can be done a safe way. Then we could keep it as it is or implement
> the safe way.
>
> Alternative solution would be to create a kernel parameter that
> would enable/disable this particular report when kdump is enabled.
> The question would be the default. It would depend on how risky
> the code is and how useful the information is.
>
> [1] https://lore.kernel.org/r/20220719195325.402745-11-gpiccoli@igalia.com
>


So, for me Petr approach is the more straightforward, though we could
rethink the idea of this notifier being...a notifier, as suggest Boris heh

Anyway, what I plan to do is: I'll re-submit a simple clean-up for this
code (header / ifdef stuff), not functional-changing the code path.

After that, when re-submitting the V2 or the notifiers refactor (which
I'm pending for some good months =O ), I'll deal with this code
properly, factoring the ideas and proposing a meaningful change.

So, let's discard this patch for now.
Thanks again,


Guilherme
diff mbox series

Patch

diff --git a/drivers/edac/altera_edac.c b/drivers/edac/altera_edac.c
index e7e8e624a436..741fe5539154 100644
--- a/drivers/edac/altera_edac.c
+++ b/drivers/edac/altera_edac.c
@@ -16,7 +16,6 @@ 
 #include <linux/kernel.h>
 #include <linux/mfd/altera-sysmgr.h>
 #include <linux/mfd/syscon.h>
-#include <linux/notifier.h>
 #include <linux/of_address.h>
 #include <linux/of_irq.h>
 #include <linux/of_platform.h>
@@ -24,6 +23,7 @@ 
 #include <linux/platform_device.h>
 #include <linux/regmap.h>
 #include <linux/types.h>
+#include <linux/kexec.h>
 #include <linux/uaccess.h>
 
 #include "altera_edac.h"
@@ -2063,22 +2063,30 @@  static const struct irq_domain_ops a10_eccmgr_ic_ops = {
 };
 
 /************** Stratix 10 EDAC Double Bit Error Handler ************/
-#define to_a10edac(p, m) container_of(p, struct altr_arria10_edac, m)
-
 #ifdef CONFIG_64BIT
 /* panic routine issues reboot on non-zero panic_timeout */
 extern int panic_timeout;
 
+#define to_a10edac(p, m) container_of(p, struct altr_arria10_edac, m)
+
 /*
  * The double bit error is handled through SError which is fatal. This is
  * called as a panic notifier to printout ECC error info as part of the panic.
+ *
+ * Notice that if kdump is set, we take the risk avoidance approach and
+ * skip the notifier, given that users are expected to have access to a
+ * full vmcore.
  */
 static int s10_edac_dberr_handler(struct notifier_block *this,
 				  unsigned long event, void *ptr)
 {
-	struct altr_arria10_edac *edac = to_a10edac(this, panic_notifier);
+	struct altr_arria10_edac *edac;
 	int err_addr, dberror;
 
+	if (kexec_crash_loaded())
+		return NOTIFY_DONE;
+
+	edac = to_a10edac(this, panic_notifier);
 	regmap_read(edac->ecc_mgr_map, S10_SYSMGR_ECC_INTSTAT_DERR_OFST,
 		    &dberror);
 	regmap_write(edac->ecc_mgr_map, S10_SYSMGR_UE_VAL_OFST, dberror);