diff mbox

runtime check for omap-aes bus access permission (was: Re: 3.13-rc3 (commit 7ce93f3) breaks Nokia N900 DT boot)

Message ID 201502191920.41284@pali (mailing list archive)
State New, archived
Headers show

Commit Message

Pali Rohár Feb. 19, 2015, 6:20 p.m. UTC
On Wednesday 11 February 2015 16:22:51 Matthijs van Duin wrote:
> On 11 February 2015 at 13:39, Pali Rohár <pali.rohar@gmail.com> wrote:
> >> Anyhow, since checking the firewalls/APs to see if you have
> >> permission will probably only get you yet another fault if
> >> things are walled off, the robust way of dealing with this
> >> sort of situation is by probing the device with a read
> >> while trapping bus faults. This also handles modules that
> >> are unreachable for other reasons, e.g. being disabled by
> >> eFuse.
> > 
> > It is possible to patch kernel code to mask or ignore that
> > fault? Can you help me with something like that?
> 
> As I mentioned, I'm still learning my way around the kernel,
> so I don't feel very comfortable suggesting a concrete patch
> just yet. I've been browsing arch/arm/mm/ however and my
> impression is that all that would be required is editing
> fault.c by making a copy of do_bad but containing
>     return user_mode(regs) || !fixup_exception(regs);
> and hook it onto the appropriate fault codes.  However, this
> really needs the opinion of someone more familiar with this
> code.
> 
> I do have an observation to make on the issue of fault
> decoding: the list in fsr-2level.c may be "standard ARMv3 and
> ARMv4 aborts" but they are quite wrong for ARMv7 which has:
> 
> [ 0] -
> [ 1] alignment fault
> [ 2] debug event
> [ 3] section access flag fault
> [ 4] instruction cache maintainance fault (reported via data
> abort) [ 5] section translation fault
> [ 6] page access flag fault
> [ 7] page translation fault
> [ 8] bus error on access
> [ 9] section domain fault
> [10] -
> [11] page domain fault
> [12] bus error on section table walk
> [13] section permission fault
> [14] bus error on page table walk
> [15] page permission fault
> [16] (TLB conflict abort)
> [17] -
> [18] -
> [19] -
> [20] (lockdown abort)
> [21] -
> [22] async bus error (reported via data abort)
> [23] -
> [24] async parity/ECC error (reported via data abort)
> [25] parity/ECC error on access
> [26] (coprocessor abort)
> [27] -
> [28] parity/ECC error on section table walk
> [29] -
> [30] parity/ECC error on page table walk
> [31] -
> 
> Some entries are patched up near the bottom of fault.c but
> many bogus messages remain, for example the "on linefetch" vs
> "on non-linefetch" is misleading since no such thing can be
> inferred from the fault status on v7.  Also, the i-cache
> maintenance fault handling looks wrong to me: it should fetch
> the actual fault status from IFSR (even though the address
> still comes from DFSR) and dispatch based on that.
> 
> Async external aborts (async bus error and async parity/ECC
> error) give you basically no info. DFAR will contain garbage
> hence displaying it will confuse rather than enlighten, a
> traceback is pointless since the instruction that caused the
> access is long retired, likewise user_mode() doesn't matter
> since a transition to kernel space may have happened after
> the access that cause the abort. Basically they should be
> treated more as an IRQ than as a fault (note they can also be
> masked just like irqs). In case of a bus error, it may be
> appropriate to just warn about it, or perhaps send a signal
> to the current process, although in the latter case it should
> have some means to distinguish it from a synchronous bus
> error.
> 
> At least on the cortex-a8, a parity/ECC error (whether async
> or not) is to be regarded as absolutely fatal.  Quoth the
> TRM: "No recovery is possible. The abort handler must disable
> the caches, communicate the fail directly with the external
> system, request a reboot."
> 
> Bit 10 no longer indicates an asynchronous (let alone
> imprecise) fault.  Apart from the debug events and async
> aborts (and possibly some implementation-defined aborts), all
> aborts listed are synchronous, and DFAR/IFAR is valid.
> There's no technical obstruction to make these trappable via
> the kernel exception handling mechanism. (Though at least in
> case of parity/ECC errors one shouldn't.)

Anyway, in Nokia Harmattan N9/N950 2.6.32 kernel is this patch:


Maybe it is related?

Comments

Matthijs van Duin Feb. 19, 2015, 8:25 p.m. UTC | #1
On 18 February 2015 at 22:14, Pali Rohár <pali.rohar@gmail.com> wrote:
> Can you help us with above problem? How to catch external abort
> on non-linefetch in kernel driver and prevent kernel panic?

Actually it's a synchronous bus error that you want to catch, which
however is misreported by linux as "external abort on non-linefetch"
(... but a bus error on a linefetch would produce exactly the same
error).  Also, ARM apparently uses the term "external abort" as
umbrella term for aborts triggered outside the MMU, which includes not
just bus errors but also (uncorrectable) parity/ECC errors.

Anyhow, the core question you mean to ask is: can the "exception"
mechanism current already in place to trap MMU faults in e.g.
put_user() easily be extended to allow drivers to trap synchronous bus
errors?  My impression is that this would in fact be quite easy and I
even outlined a suggested patch, but I'm still a kernel newbie so I
may be way off course.

Although its main use would be for auto-probing, it's maybe worth
mentioning I've met at least one peripheral which also reports bus
errors when writing inappropriate/unsupported *values* to a register.
(Of course when using posted writes you won't get an abort anyhow in
that case, it's only reported via interconnect error logs.)


On 19 February 2015 at 19:20, Pali Rohár <pali.rohar@gmail.com> wrote:
> Anyway, in Nokia Harmattan N9/N950 2.6.32 kernel is this patch

In mainline linux the same fix-up is done at runtime rather than
compile time (in exceptions_init() at the bottom of fault.c). Either
way, in my post of the 11th I also mentioned that it looks wrong to
me. I-cache maintenance fault is really a special case in the fault
decoding logic since it means "although you got here via DAbort and
the relevant address is in DFAR, the exception happened on the
instruction side so you need to fetch the fault status from IFSR
instead."
Aaro Koskinen Feb. 19, 2015, 9:10 p.m. UTC | #2
Hi,

On Thu, Feb 19, 2015 at 07:20:41PM +0100, Pali Rohár wrote:
> Anyway, in Nokia Harmattan N9/N950 2.6.32 kernel is this patch:

> +/* Do we need runtime check ? */
> +#if __LINUX_ARM_ARCH__ < 6
>  	{ do_bad,		SIGBUS,	 0,		"external abort on linefetch"	   },
> +#else
> +	{ do_translation_fault,	SIGSEGV, SEGV_MAPERR,	"I-cache maintenance fault"	   },
> +#endif

> Maybe it is related?

That was unrelated. Also, the patch is also in mainline,
see 8c0b742ca7a7d21de0ddc87eda6ef0b282e4de18 (ARM: 6134/1: Handle
instruction cache maintenance fault properly).

A.
diff mbox

Patch

diff --git a/arch/arm/mm/fsr-2level.c b/arch/arm/mm/fsr-2level.c
index 18ca74c..d530d55 100644
--- a/arch/arm/mm/fsr-2level.c
+++ b/arch/arm/mm/fsr-2level.c
@@ -7,7 +7,12 @@  static struct fsr_info fsr_info[] = {
 	{ do_bad,		SIGBUS,	 BUS_ADRALN,	"alignment exception"		   },
 	{ do_bad,		SIGKILL, 0,		"terminal exception"		   },
 	{ do_bad,		SIGBUS,	 BUS_ADRALN,	"alignment exception"		   },
+/* Do we need runtime check ? */
+#if __LINUX_ARM_ARCH__ < 6
 	{ do_bad,		SIGBUS,	 0,		"external abort on linefetch"	   },
+#else
+	{ do_translation_fault,	SIGSEGV, SEGV_MAPERR,	"I-cache maintenance fault"	   },
+#endif
 	{ do_translation_fault,	SIGSEGV, SEGV_MAPERR,	"section translation fault"	   },
 	{ do_bad,		SIGBUS,	 0,		"external abort on linefetch"	   },
 	{ do_page_fault,	SIGSEGV, SEGV_MAPERR,	"page translation fault"	   },