diff mbox

ARM: mm: add imprecise abort non-deadly handler

Message ID 1391797214-17142-2-git-send-email-ben.dooks@codethink.co.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Ben Dooks Feb. 7, 2014, 6:20 p.m. UTC
Given that imprecise aborts may be delivered after the action that
caused them (or even for non-cpu related activities such as bridge
faults from a bus-master) it is possible that the wrong process is
terminated as a result.

It is not know at this time in an SMP system which cores get notified
of an imprecise external abort, we have yet to find the right details
in the architecture reference manuals. This also means that killing
the process is probably the wrong thing to do on reception of these aborts.

Add a handler to take and print imprecise aborts and allow the process
to continue. This should ensure that the abort is shown but not kill
the process that was running on the cpu core at the time.

Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
---
 arch/arm/mm/fault.c      | 15 +++++++++++++++
 arch/arm/mm/fsr-2level.c | 32 ++++++++++++++++----------------
 2 files changed, 31 insertions(+), 16 deletions(-)

Comments

Dave Martin Feb. 10, 2014, 2:37 p.m. UTC | #1
On Fri, Feb 07, 2014 at 06:20:14PM +0000, Ben Dooks wrote:
> Given that imprecise aborts may be delivered after the action that
> caused them (or even for non-cpu related activities such as bridge
> faults from a bus-master) it is possible that the wrong process is
> terminated as a result.
> 
> It is not know at this time in an SMP system which cores get notified
> of an imprecise external abort, we have yet to find the right details
> in the architecture reference manuals. This also means that killing
> the process is probably the wrong thing to do on reception of these aborts.
> 
> Add a handler to take and print imprecise aborts and allow the process
> to continue. This should ensure that the abort is shown but not kill
> the process that was running on the cpu core at the time.

Not treating these as thread-specific faults seems correct, since we
never have a way to map these aborts back to the culprit ... except that
there is a likelihood the culprit is still running when the abort fires.


"Spurious" imprecise aborts pretty much always indicate a hardware error
or a nasty bug somewhere.

Another cause is badly implemented, buggy or malicious userspace software
being given more exotic mmaps that it is qualified to deal with
responsibly.  That's a nasty bug in the distro maintainer / system
administrator / vendor.

So, I think this should be at least KERN_ERROR; maybe KERN_CRIT or above.
We must not encourage people to think that these aborts are somehow
benign.

If we really want people to fix their bugs, it may be worth considering
panic(), or doing this when some threshold is reached.  This may be a
bit harsh though, at least without some threshold.

Cheers
---Dave
Ben Dooks Feb. 10, 2014, 5:25 p.m. UTC | #2
On 10/02/14 14:37, Dave Martin wrote:
> On Fri, Feb 07, 2014 at 06:20:14PM +0000, Ben Dooks wrote:
>> Given that imprecise aborts may be delivered after the action that
>> caused them (or even for non-cpu related activities such as bridge
>> faults from a bus-master) it is possible that the wrong process is
>> terminated as a result.
>>
>> It is not know at this time in an SMP system which cores get notified
>> of an imprecise external abort, we have yet to find the right details
>> in the architecture reference manuals. This also means that killing
>> the process is probably the wrong thing to do on reception of these aborts.
>>
>> Add a handler to take and print imprecise aborts and allow the process
>> to continue. This should ensure that the abort is shown but not kill
>> the process that was running on the cpu core at the time.
>
> Not treating these as thread-specific faults seems correct, since we
> never have a way to map these aborts back to the culprit ... except that
> there is a likelihood the culprit is still running when the abort fires.
>
>
> "Spurious" imprecise aborts pretty much always indicate a hardware error
> or a nasty bug somewhere.

I need to find out where the one we are catching is coming from in our
system.

> Another cause is badly implemented, buggy or malicious userspace software
> being given more exotic mmaps that it is qualified to deal with
> responsibly.  That's a nasty bug in the distro maintainer / system
> administrator / vendor.
>
> So, I think this should be at least KERN_ERROR; maybe KERN_CRIT or above.
> We must not encourage people to think that these aborts are somehow
> benign.

Ok, KERN_ERROR or KERN_CRIT sound reasonable.

> If we really want people to fix their bugs, it may be worth considering
> panic(), or doing this when some threshold is reached.  This may be a
> bit harsh though, at least without some threshold.

I was considering also firing a WARN_ON(abort_count++ > 10) or something
similar.
Dave Martin Feb. 11, 2014, 3:43 p.m. UTC | #3
On Mon, Feb 10, 2014 at 05:25:15PM +0000, Ben Dooks wrote:
> On 10/02/14 14:37, Dave Martin wrote:

[...]

> >"Spurious" imprecise aborts pretty much always indicate a hardware error
> >or a nasty bug somewhere.
> 
> I need to find out where the one we are catching is coming from in our
> system.

Would certainly be good to know...

> >Another cause is badly implemented, buggy or malicious userspace software
> >being given more exotic mmaps that it is qualified to deal with
> >responsibly.  That's a nasty bug in the distro maintainer / system
> >administrator / vendor.
> >
> >So, I think this should be at least KERN_ERROR; maybe KERN_CRIT or above.
> >We must not encourage people to think that these aborts are somehow
> >benign.
> 
> Ok, KERN_ERROR or KERN_CRIT sound reasonable.
> 
> >If we really want people to fix their bugs, it may be worth considering
> >panic(), or doing this when some threshold is reached.  This may be a
> >bit harsh though, at least without some threshold.
> 
> I was considering also firing a WARN_ON(abort_count++ > 10) or something
> similar.

Maybe WARN_ON the first such abort, and ratelimit somehow after?

Being noisy is good -- it's just killing a random, possibly-innocent task
that's not so good.

Cheers
---Dave
diff mbox

Patch

diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index eb8830a..c1fd5ba 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -512,6 +512,21 @@  do_bad(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 	return 1;
 }
 
+/*
+ * On an imprecise external abort it is possible that the currently running
+ * process did not cause it (it could be from an external bus bridge or
+ * another device causing a fault on the bus).
+ *
+ * Always return handled as we do not know how to do it and killing the
+ * current process may not prevent future faults.
+ */
+static int
+do_bad_imprecise(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
+{
+	printk(KERN_INFO "Caught imprecise abort (0x%03x) %08lx", fsr, addr);
+	return 0;
+}
+
 struct fsr_info {
 	int	(*fn)(unsigned long addr, unsigned int fsr, struct pt_regs *regs);
 	int	sig;
diff --git a/arch/arm/mm/fsr-2level.c b/arch/arm/mm/fsr-2level.c
index 18ca74c..8f5ef60 100644
--- a/arch/arm/mm/fsr-2level.c
+++ b/arch/arm/mm/fsr-2level.c
@@ -24,22 +24,22 @@  static struct fsr_info fsr_info[] = {
 	 * 10 of the FSR, and may not be recoverable.  These are only
 	 * supported if the CPU abort handler supports bit 10.
 	 */
-	{ do_bad,		SIGBUS,  0,		"unknown 16"			   },
-	{ do_bad,		SIGBUS,  0,		"unknown 17"			   },
-	{ do_bad,		SIGBUS,  0,		"unknown 18"			   },
-	{ do_bad,		SIGBUS,  0,		"unknown 19"			   },
-	{ do_bad,		SIGBUS,  0,		"lock abort"			   }, /* xscale */
-	{ do_bad,		SIGBUS,  0,		"unknown 21"			   },
-	{ do_bad,		SIGBUS,  BUS_OBJERR,	"imprecise external abort"	   }, /* xscale */
-	{ do_bad,		SIGBUS,  0,		"unknown 23"			   },
-	{ do_bad,		SIGBUS,  0,		"dcache parity error"		   }, /* xscale */
-	{ do_bad,		SIGBUS,  0,		"unknown 25"			   },
-	{ do_bad,		SIGBUS,  0,		"unknown 26"			   },
-	{ do_bad,		SIGBUS,  0,		"unknown 27"			   },
-	{ do_bad,		SIGBUS,  0,		"unknown 28"			   },
-	{ do_bad,		SIGBUS,  0,		"unknown 29"			   },
-	{ do_bad,		SIGBUS,  0,		"unknown 30"			   },
-	{ do_bad,		SIGBUS,  0,		"unknown 31"			   },
+	{ do_bad_imprecise,	SIGBUS,  0,		"unknown 16"			   },
+	{ do_bad_imprecise,	SIGBUS,  0,		"unknown 17"			   },
+	{ do_bad_imprecise,	SIGBUS,  0,		"unknown 18"			   },
+	{ do_bad_imprecise,	SIGBUS,  0,		"unknown 19"			   },
+	{ do_bad_imprecise,	SIGBUS,  0,		"lock abort"			   }, /* xscale */
+	{ do_bad_imprecise,	SIGBUS,  0,		"unknown 21"			   },
+	{ do_bad_imprecise,	SIGBUS,  BUS_OBJERR,	"imprecise external abort"	   }, /* xscale */
+	{ do_bad_imprecise,	SIGBUS,  0,		"unknown 23"			   },
+	{ do_bad_imprecise,	SIGBUS,  0,		"dcache parity error"		   }, /* xscale */
+	{ do_bad_imprecise,	SIGBUS,  0,		"unknown 25"			   },
+	{ do_bad_imprecise,	SIGBUS,  0,		"unknown 26"			   },
+	{ do_bad_imprecise,	SIGBUS,  0,		"unknown 27"			   },
+	{ do_bad_imprecise,	SIGBUS,  0,		"unknown 28"			   },
+	{ do_bad_imprecise,	SIGBUS,  0,		"unknown 29"			   },
+	{ do_bad_imprecise,	SIGBUS,  0,		"unknown 30"			   },
+	{ do_bad_imprecise,	SIGBUS,  0,		"unknown 31"			   },
 };
 
 static struct fsr_info ifsr_info[] = {