diff mbox

[v2,7/8] bus: brcmstb_gisb: add notifier handling

Message ID 20170328213431.10904-8-opendmb@gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Doug Berger March 28, 2017, 9:34 p.m. UTC
Check for GISB arbitration errors through a chained notifier
when a process dies or a kernel panic occurs.  This allows a
meaningful diagnostic message to occur along with other
diagnostic information.

Notably writes to a bad GISB address on an ARM64 architecture
kernel cause unrecoverable SError aborts and this allows the
cause of the abort to be seen.

Signed-off-by: Doug Berger <opendmb@gmail.com>
---
 drivers/bus/brcmstb_gisb.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

Comments

Mark Rutland March 29, 2017, 10:13 a.m. UTC | #1
Hi Doug,

On Tue, Mar 28, 2017 at 02:34:30PM -0700, Doug Berger wrote:
> Check for GISB arbitration errors through a chained notifier
> when a process dies or a kernel panic occurs.  This allows a
> meaningful diagnostic message to occur along with other
> diagnostic information.
> 
> Notably writes to a bad GISB address on an ARM64 architecture
> kernel cause unrecoverable SError aborts and this allows the
> cause of the abort to be seen.
> 
> Signed-off-by: Doug Berger <opendmb@gmail.com>

Thanks for changing this.

> ---
>  drivers/bus/brcmstb_gisb.c | 27 +++++++++++++++++++++++++++
>  1 file changed, 27 insertions(+)
> 
> diff --git a/drivers/bus/brcmstb_gisb.c b/drivers/bus/brcmstb_gisb.c
> index 500b6bb5c739..774729002b8c 100644
> --- a/drivers/bus/brcmstb_gisb.c
> +++ b/drivers/bus/brcmstb_gisb.c
> @@ -24,6 +24,8 @@
>  #include <linux/of.h>
>  #include <linux/bitops.h>
>  #include <linux/pm.h>
> +#include <linux/kernel.h>
> +#include <linux/kdebug.h>
>  
>  #ifdef CONFIG_ARM
>  #include <asm/bug.h>
> @@ -290,6 +292,25 @@ static irqreturn_t brcmstb_gisb_tea_handler(int irq, void *dev_id)
>  	return IRQ_HANDLED;
>  }
>  
> +/*
> + * Dump out gisb errors on die or panic.
> + */
> +static int dump_gisb_error(struct notifier_block *self, unsigned long v,
> +			   void *p)
> +{
> +	struct brcmstb_gisb_arb_device *gdev;
> +
> +	/* iterate over each GISB arb registered handlers */
> +	list_for_each_entry(gdev, &brcmstb_gisb_arb_device_list, next)
> +		brcmstb_gisb_arb_decode_addr(gdev, "async abort");
> +
> +	return 0;

I think this should be NOTIFY_OK.

> +}
> +
> +static struct notifier_block gisb_error_notifier = {
> +	.notifier_call = dump_gisb_error,
> +};
> +
>  static DEVICE_ATTR(gisb_arb_timeout, S_IWUSR | S_IRUGO,
>  		gisb_arb_get_timeout, gisb_arb_set_timeout);
>  
> @@ -408,6 +429,12 @@ static int __init brcmstb_gisb_arb_probe(struct platform_device *pdev)
>  	board_be_handler = brcmstb_bus_error_handler;
>  #endif
>  
> +	if (list_is_singular(&brcmstb_gisb_arb_device_list)) {
> +		register_die_notifier(&gisb_error_notifier);
> +		atomic_notifier_chain_register(&panic_notifier_list,
> +					       &gisb_error_notifier);

I don't think this is quite right. A notifier_block can only be
registered to one notifier chain at a time, and this has the potential
to corrupt both chains.

I also think you only need to register the panic notifier. An SError
should always result in a panic.

Thanks,
Mark.
Doug Berger March 29, 2017, 5:39 p.m. UTC | #2
On 03/29/2017 03:13 AM, Mark Rutland wrote:

snip

>> +/*
>> + * Dump out gisb errors on die or panic.
>> + */
>> +static int dump_gisb_error(struct notifier_block *self, unsigned long v,
>> +			   void *p)
>> +{
>> +	struct brcmstb_gisb_arb_device *gdev;
>> +
>> +	/* iterate over each GISB arb registered handlers */
>> +	list_for_each_entry(gdev, &brcmstb_gisb_arb_device_list, next)
>> +		brcmstb_gisb_arb_decode_addr(gdev, "async abort");
>> +
>> +	return 0;
> 
> I think this should be NOTIFY_OK.
> 
I used dump_mem_limit() as a template and didn't catch this (work to
do...).  Upon review I think I would prefer NOTIFY_DONE since this call
is opportunistic (i.e. it is taking the opportunity to check whether
additional diagnostic data is available to display) and has no interest
in affecting the overall handling of the event.

>> +}
>> +
>> +static struct notifier_block gisb_error_notifier = {
>> +	.notifier_call = dump_gisb_error,
>> +};
>> +
>>  static DEVICE_ATTR(gisb_arb_timeout, S_IWUSR | S_IRUGO,
>>  		gisb_arb_get_timeout, gisb_arb_set_timeout);
>>  
>> @@ -408,6 +429,12 @@ static int __init brcmstb_gisb_arb_probe(struct platform_device *pdev)
>>  	board_be_handler = brcmstb_bus_error_handler;
>>  #endif
>>  
>> +	if (list_is_singular(&brcmstb_gisb_arb_device_list)) {
>> +		register_die_notifier(&gisb_error_notifier);
>> +		atomic_notifier_chain_register(&panic_notifier_list,
>> +					       &gisb_error_notifier);
> 
> I don't think this is quite right. A notifier_block can only be
> registered to one notifier chain at a time, and this has the potential
> to corrupt both chains.
> 
A VERY good point thanks for pointing this out.

> I also think you only need to register the panic notifier. An SError
> should always result in a panic.
> 
That was my initial thought as well.  However, testing revealed that the
bad mode Oops actually exits the user space process and doesn't reach
the panic so there was no helpful diagnostic message.  This may be in
line with your comments about insufficient fatality of failures in PATCH
v2 6/8, but it actually is more in line with our desired behavior for
the aborted write.  Setting the notify on die gave us the result we are
looking for, but as noted above I should have created a separate notifier.

I had hoped that the same approach (i.e. die notifier) would remove the
need for PATCH v2 6/8 as well, but I found that the Unhandled fault
error didn't actually die from user mode.

> Thanks,
> Mark.
> 
Thank you for your help with this,
    Doug
Mark Rutland March 29, 2017, 6:17 p.m. UTC | #3
On Wed, Mar 29, 2017 at 10:39:11AM -0700, Doug Berger wrote:
> On 03/29/2017 03:13 AM, Mark Rutland wrote:

> >> +static int dump_gisb_error(struct notifier_block *self, unsigned long v,
> >> +			   void *p)

> >> +	return 0;
> > 
> > I think this should be NOTIFY_OK.
> > 
> I used dump_mem_limit() as a template and didn't catch this (work to
> do...).  Upon review I think I would prefer NOTIFY_DONE since this call
> is opportunistic (i.e. it is taking the opportunity to check whether
> additional diagnostic data is available to display) and has no interest
> in affecting the overall handling of the event.

That's fine by me.

Does the distinction matter here?

Most notifer users treat NOTIFY_OK and NOTIFY_DONE as equivalent, and
notifier_call_chain only terminates when it sees NOTIFY_STOP_MASK.

[...]

> >> +	if (list_is_singular(&brcmstb_gisb_arb_device_list)) {
> >> +		register_die_notifier(&gisb_error_notifier);
> >> +		atomic_notifier_chain_register(&panic_notifier_list,
> >> +					       &gisb_error_notifier);
> > 
> > I don't think this is quite right. A notifier_block can only be
> > registered to one notifier chain at a time, and this has the potential
> > to corrupt both chains.
> > 
> A VERY good point thanks for pointing this out.
> 
> > I also think you only need to register the panic notifier. An SError
> > should always result in a panic.
> > 
> That was my initial thought as well.  However, testing revealed that the
> bad mode Oops actually exits the user space process and doesn't reach
> the panic so there was no helpful diagnostic message.  This may be in
> line with your comments about insufficient fatality of failures in PATCH
> v2 6/8, but it actually is more in line with our desired behavior for
> the aborted write.  Setting the notify on die gave us the result we are
> looking for, but as noted above I should have created a separate notifier.
> 
> I had hoped that the same approach (i.e. die notifier) would remove the
> need for PATCH v2 6/8 as well, but I found that the Unhandled fault
> error didn't actually die from user mode.

In my mind it's a bug that we don't treat those errors more fatally.

I'll try to dig into that.

Thanks,
Mark.
diff mbox

Patch

diff --git a/drivers/bus/brcmstb_gisb.c b/drivers/bus/brcmstb_gisb.c
index 500b6bb5c739..774729002b8c 100644
--- a/drivers/bus/brcmstb_gisb.c
+++ b/drivers/bus/brcmstb_gisb.c
@@ -24,6 +24,8 @@ 
 #include <linux/of.h>
 #include <linux/bitops.h>
 #include <linux/pm.h>
+#include <linux/kernel.h>
+#include <linux/kdebug.h>
 
 #ifdef CONFIG_ARM
 #include <asm/bug.h>
@@ -290,6 +292,25 @@  static irqreturn_t brcmstb_gisb_tea_handler(int irq, void *dev_id)
 	return IRQ_HANDLED;
 }
 
+/*
+ * Dump out gisb errors on die or panic.
+ */
+static int dump_gisb_error(struct notifier_block *self, unsigned long v,
+			   void *p)
+{
+	struct brcmstb_gisb_arb_device *gdev;
+
+	/* iterate over each GISB arb registered handlers */
+	list_for_each_entry(gdev, &brcmstb_gisb_arb_device_list, next)
+		brcmstb_gisb_arb_decode_addr(gdev, "async abort");
+
+	return 0;
+}
+
+static struct notifier_block gisb_error_notifier = {
+	.notifier_call = dump_gisb_error,
+};
+
 static DEVICE_ATTR(gisb_arb_timeout, S_IWUSR | S_IRUGO,
 		gisb_arb_get_timeout, gisb_arb_set_timeout);
 
@@ -408,6 +429,12 @@  static int __init brcmstb_gisb_arb_probe(struct platform_device *pdev)
 	board_be_handler = brcmstb_bus_error_handler;
 #endif
 
+	if (list_is_singular(&brcmstb_gisb_arb_device_list)) {
+		register_die_notifier(&gisb_error_notifier);
+		atomic_notifier_chain_register(&panic_notifier_list,
+					       &gisb_error_notifier);
+	}
+
 	dev_info(&pdev->dev, "registered mem: %p, irqs: %d, %d\n",
 			gdev->base, timeout_irq, tea_irq);