Versatile Express randomly fails to boot

Message ID	20150316000438.GD8656@n2100.arm.linux.org.uk (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org> Date: Mon, 16 Mar 2015 00:04:38 +0000 From: Russell King - ARM Linux <linux@arm.linux.org.uk> To: linux-arm-kernel@lists.infradead.org, Will Deacon <will.deacon@arm.com>, daniel.thompson@linaro.org Subject: Re: Versatile Express randomly fails to boot Message-ID: <20150316000438.GD8656@n2100.arm.linux.org.uk> References: <20150315213330.GB8656@n2100.arm.linux.org.uk> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20150315213330.GB8656@n2100.arm.linux.org.uk> User-Agent: Mutt/1.5.23 (2014-03-12) Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org

Russell King - ARM Linux March 16, 2015, 12:04 a.m. UTC

On Sun, Mar 15, 2015 at 09:33:30PM +0000, Russell King - ARM Linux wrote:
> I'm going to try a few other kernels to try and track down what's going
> on - whether something from arm-soc or my tree is responsible for this
> really weird behaviour.

Okay, this is weird - it seems that it's caused by the FIQ oops
dumping code/FIQ changes which I've carried for many months
unchanged in my tree.

I haven't yet been able to prove which bit of those changes is
responsible yet - with a build time of about 5 minutes, and a test
time (due to the number of iterations required to prove it) around
10 minutes, it takes a while to narrow stuff down - it's taken all
evening to work out which branch is responsible, I'm just narrowing
it down to the commit, which looks like it's a result of something
in the change below.

This is pretty close to Daniel's patches, and we know that the GIC
on Versatile Express has problems with this stuff - I wonder if this
means we can't even probe the GIC to find out whether it's capable of
FIQ delivery via testing whether GICD_ENABLE_GRP1 can be set.

That said, and as I said above, this exact patch has been in my kernel,
and has been built and tested over many months, so I find it hard to
believe that this really _is_ responsible.

For tonights build, I'm going to drop the FIQ stuff from my devel tree,
which means it'll be gone from the build tree until we can, again,
figure out what the heck's going on here.

Russell King - ARM Linux March 16, 2015, 12:42 a.m. UTC | #1

On Mon, Mar 16, 2015 at 12:04:38AM +0000, Russell King - ARM Linux wrote:
> On Sun, Mar 15, 2015 at 09:33:30PM +0000, Russell King - ARM Linux wrote:
> > I'm going to try a few other kernels to try and track down what's going
> > on - whether something from arm-soc or my tree is responsible for this
> > really weird behaviour.
> 
> Okay, this is weird - it seems that it's caused by the FIQ oops
> dumping code/FIQ changes which I've carried for many months
> unchanged in my tree.

More weirdness.  Progressing forwards through my development code
showed that when I merged the patch I mentioned in the previous mail,
things started to fail.

As I also mentioned, I'd drop that branch (two patches, one adding
the IPI backtrace stuff and the second one updating the GIC to allow
it to raise FIQs on suitably equipped platforms.)  I would have
expected that to have worked, but it just failed after four boot
iterations.  So either it's not the FIQ, or it is the FIQ code _and_
also something else.  Or it has something to do with the placement
of functions in the kernel.

I'll try more stuff tomorrow, working from where I presently am
(which is basically last night's code minus the FIQ changes) by
removing other changes to see what brings us back to a working
system.

As I've already said - this is really weird because all of these
changes were also tested against -rc1... those which weren't are:

mm: fold arch_randomize_brk into ARCH_HAS_ELF_RANDOMIZE
mm: split ET_DYN ASLR from mmap ASLR
mm: move randomize_et_dyn into ELF_ET_DYN_BASE
mm: expose arch_mmap_rnd when available
arm: factor out mmap ASLR into mmap_rnd

and a number of clkdev rework patches (to make it use clk_hw
internally.)  Neither of these should be affecting it, but that's
something I will be testing tomorrow.

Russell King - ARM Linux March 16, 2015, 9:35 a.m. UTC | #2

On Mon, Mar 16, 2015 at 12:42:39AM +0000, Russell King - ARM Linux wrote:
> On Mon, Mar 16, 2015 at 12:04:38AM +0000, Russell King - ARM Linux wrote:
> > On Sun, Mar 15, 2015 at 09:33:30PM +0000, Russell King - ARM Linux wrote:
> > > I'm going to try a few other kernels to try and track down what's going
> > > on - whether something from arm-soc or my tree is responsible for this
> > > really weird behaviour.
> > 
> > Okay, this is weird - it seems that it's caused by the FIQ oops
> > dumping code/FIQ changes which I've carried for many months
> > unchanged in my tree.
> 
> More weirdness.  Progressing forwards through my development code
> showed that when I merged the patch I mentioned in the previous mail,
> things started to fail.
> 
> As I also mentioned, I'd drop that branch (two patches, one adding
> the IPI backtrace stuff and the second one updating the GIC to allow
> it to raise FIQs on suitably equipped platforms.)  I would have
> expected that to have worked, but it just failed after four boot
> iterations.  So either it's not the FIQ, or it is the FIQ code _and_
> also something else.  Or it has something to do with the placement
> of functions in the kernel.
> 
> I'll try more stuff tomorrow, working from where I presently am
> (which is basically last night's code minus the FIQ changes) by
> removing other changes to see what brings us back to a working
> system.
> 
> As I've already said - this is really weird because all of these
> changes were also tested against -rc1... those which weren't are:
> 
> mm: fold arch_randomize_brk into ARCH_HAS_ELF_RANDOMIZE
> mm: split ET_DYN ASLR from mmap ASLR
> mm: move randomize_et_dyn into ELF_ET_DYN_BASE
> mm: expose arch_mmap_rnd when available
> arm: factor out mmap ASLR into mmap_rnd
> 
> and a number of clkdev rework patches (to make it use clk_hw
> internally.)  Neither of these should be affecting it, but that's
> something I will be testing tomorrow.

Okay, reverting the ASLR changes and the clkdev changes annoyingly still
results in random failure.

Russell King - ARM Linux March 16, 2015, 1:04 p.m. UTC | #3

On Mon, Mar 16, 2015 at 09:35:53AM +0000, Russell King - ARM Linux wrote:
> On Mon, Mar 16, 2015 at 12:42:39AM +0000, Russell King - ARM Linux wrote:
> > On Mon, Mar 16, 2015 at 12:04:38AM +0000, Russell King - ARM Linux wrote:
> > > On Sun, Mar 15, 2015 at 09:33:30PM +0000, Russell King - ARM Linux wrote:
> > > > I'm going to try a few other kernels to try and track down what's going
> > > > on - whether something from arm-soc or my tree is responsible for this
> > > > really weird behaviour.
> > > 
> > > Okay, this is weird - it seems that it's caused by the FIQ oops
> > > dumping code/FIQ changes which I've carried for many months
> > > unchanged in my tree.
> > 
> > More weirdness.  Progressing forwards through my development code
> > showed that when I merged the patch I mentioned in the previous mail,
> > things started to fail.
> > 
> > As I also mentioned, I'd drop that branch (two patches, one adding
> > the IPI backtrace stuff and the second one updating the GIC to allow
> > it to raise FIQs on suitably equipped platforms.)  I would have
> > expected that to have worked, but it just failed after four boot
> > iterations.  So either it's not the FIQ, or it is the FIQ code _and_
> > also something else.  Or it has something to do with the placement
> > of functions in the kernel.
> > 
> > I'll try more stuff tomorrow, working from where I presently am
> > (which is basically last night's code minus the FIQ changes) by
> > removing other changes to see what brings us back to a working
> > system.
> > 
> > As I've already said - this is really weird because all of these
> > changes were also tested against -rc1... those which weren't are:
> > 
> > mm: fold arch_randomize_brk into ARCH_HAS_ELF_RANDOMIZE
> > mm: split ET_DYN ASLR from mmap ASLR
> > mm: move randomize_et_dyn into ELF_ET_DYN_BASE
> > mm: expose arch_mmap_rnd when available
> > arm: factor out mmap ASLR into mmap_rnd
> > 
> > and a number of clkdev rework patches (to make it use clk_hw
> > internally.)  Neither of these should be affecting it, but that's
> > something I will be testing tomorrow.
> 
> Okay, reverting the ASLR changes and the clkdev changes annoyingly still
> results in random failure.

After ruling out ASLR and clkdev, I started progressively reverting other
stuff in the build tree.  Eventually, I got down to reverting the L2C
change I've been carrying since the L2C cleanups.

With that lot reverted, which is slightly more than the previously known
good test, it booted five times without issue.

So, I thought I'd add that L2C change to the list of bad commits, and try
omitting _just_ the L2C and FIQ changes... and it still fails - on the
first test boot iteration.

I think I'm going to declare that this is all down to some obscure
hardware problem with Versatile Express, which is tickled by the layout
of the kernel against the cache, and take it out of the nightly system
(it's pointless having unstable hardware being tested; random failures
are completely meaningless.)

Sudeep Holla March 16, 2015, 5:47 p.m. UTC | #4

Hi Russell,

On 16/03/15 13:04, Russell King - ARM Linux wrote:
> On Mon, Mar 16, 2015 at 09:35:53AM +0000, Russell King - ARM Linux wrote:
>> On Mon, Mar 16, 2015 at 12:42:39AM +0000, Russell King - ARM Linux wrote:
>>> On Mon, Mar 16, 2015 at 12:04:38AM +0000, Russell King - ARM Linux wrote:
>>>> On Sun, Mar 15, 2015 at 09:33:30PM +0000, Russell King - ARM Linux wrote:
>>>>> I'm going to try a few other kernels to try and track down what's going
>>>>> on - whether something from arm-soc or my tree is responsible for this
>>>>> really weird behaviour.
>>>>
>>>> Okay, this is weird - it seems that it's caused by the FIQ oops
>>>> dumping code/FIQ changes which I've carried for many months
>>>> unchanged in my tree.
>>>
>>> More weirdness.  Progressing forwards through my development code
>>> showed that when I merged the patch I mentioned in the previous mail,
>>> things started to fail.
>>>
>>> As I also mentioned, I'd drop that branch (two patches, one adding
>>> the IPI backtrace stuff and the second one updating the GIC to allow
>>> it to raise FIQs on suitably equipped platforms.)  I would have
>>> expected that to have worked, but it just failed after four boot
>>> iterations.  So either it's not the FIQ, or it is the FIQ code _and_
>>> also something else.  Or it has something to do with the placement
>>> of functions in the kernel.
>>>
>>> I'll try more stuff tomorrow, working from where I presently am
>>> (which is basically last night's code minus the FIQ changes) by
>>> removing other changes to see what brings us back to a working
>>> system.
>>>
>>> As I've already said - this is really weird because all of these
>>> changes were also tested against -rc1... those which weren't are:
>>>
>>> mm: fold arch_randomize_brk into ARCH_HAS_ELF_RANDOMIZE
>>> mm: split ET_DYN ASLR from mmap ASLR
>>> mm: move randomize_et_dyn into ELF_ET_DYN_BASE
>>> mm: expose arch_mmap_rnd when available
>>> arm: factor out mmap ASLR into mmap_rnd
>>>
>>> and a number of clkdev rework patches (to make it use clk_hw
>>> internally.)  Neither of these should be affecting it, but that's
>>> something I will be testing tomorrow.
>>
>> Okay, reverting the ASLR changes and the clkdev changes annoyingly still
>> results in random failure.
>
> After ruling out ASLR and clkdev, I started progressively reverting other
> stuff in the build tree.  Eventually, I got down to reverting the L2C
> change I've been carrying since the L2C cleanups.
>
> With that lot reverted, which is slightly more than the previously known
> good test, it booted five times without issue.
>
> So, I thought I'd add that L2C change to the list of bad commits, and try
> omitting _just_ the L2C and FIQ changes... and it still fails - on the
> first test boot iteration.
>
> I think I'm going to declare that this is all down to some obscure
> hardware problem with Versatile Express, which is tickled by the layout
> of the kernel against the cache, and take it out of the nightly system
> (it's pointless having unstable hardware being tested; random failures
> are completely meaningless.)
>

I was able to see exact behaviour on my VExpress setup with CA9X4 
core-tile. Few observations from my side:

1. This issue can be reproduced even on v3.19
2. As you suspected L2C, I tried disabling L2C and it seems to solve
    the issue
3. Since it's very random and enabling LL_DEBUG made it difficult to
    reproduce the issue, I tried to dump the stack using DS5 debugger
4. The stack is exactly same always both on v4.0-rc* and v3.19 kernel
    and on multiple runs
5. Connecting to h/w debugger, stopping and re-starting the CPUs,
    solves the issue. It's helping CPUs to get out of __radix_tree_lookup
    somehow

Stacktrace
==========
(sorry it's looks different from std. Linux backtrace as this one id 
dump from DS5)

CPU 0
----
#0 __radix_tree_lookup( root = <Value currently has no location>, index 
= 16, nodep = (struct radix_tree_node**) 0x0, slotp = (void***) 0x0 ) at 
radix-tree.c:517
#1 generic_handle_irq( irq = 16 ) at irqdesc.c:349
#2 __handle_domain_irq( domain = (struct irq_domain*) 0xBF004400, hwirq 
= 16, lookup = <Value currently has no location>, regs = <Value 
currently has no location> ) at irqdesc.c:391
#3 __raw_readl( addr = <Value optimised away by compiler> ) at io.h:121
#4 gic_handle_irq( regs = (struct pt_regs*) 0x805F1F40 ) at irq-gic.c:277
#5 [__irq_svc+0x40]


CPU1
----
#0 __radix_tree_lookup( root = <Value currently has no location>, index 
= 16, nodep = (struct radix_tree_node**) 0x0, slotp = (void***) 0x0 ) at 
radix-tree.c:517
#1 __irq_get_desc_lock( irq = <Value currently has no location>, flags = 
(long unsigned int*) 0xBF08BF94, bus = false, check = 3 ) at irqdesc.c:544
#2 enable_percpu_irq( irq = 16, type = 0 ) at manage.c:1583
#3 twd_timer_cpu_notify( self = <Value not available : Undefined value 
in stack frame for register R0>, action = <Value currently has no 
location>, hcpu = <Value not available : Undefined value in stack frame 
for register R2> ) at smp_twd.c:322
#4 notifier_call_chain( nl = <Value currently has no location>, val = 
<Value not available : Undefined value in stack frame for register R1>, 
v = <Value not available : Undefined value in stack frame for register 
R2>, nr_to_call = <Value not available : Undefined value in stack frame 
for register R3>, nr_calls = (int*) 0x0 ) at notifier.c:95
#5 notifier_to_errno( ret = <Value currently has no location> ) at 
notifier.h:179
#6 cpu_notify( val = <Value currently has no location>, v = <Value 
currently has no location> ) at cpu.c:234
#7 secondary_start_kernel() at smp.c:367

CPU2 & CPU3
-----------
Not booted yet, still waiting in bootloader

Russell King - ARM Linux March 16, 2015, 6:16 p.m. UTC | #5

On Mon, Mar 16, 2015 at 05:47:46PM +0000, Sudeep Holla wrote:
> Hi Russell,
> 
> I was able to see exact behaviour on my VExpress setup with CA9X4 core-tile.
> Few observations from my side:
> 
> 1. This issue can be reproduced even on v3.19
> 2. As you suspected L2C, I tried disabling L2C and it seems to solve
>    the issue

My L2C says it's cache ID is 0x410000c3 - which is indeed a L2C-310, but
with an undocumented revision ID of 3, which as far as we can make out,
it's a R1Px where x > 0.

> 3. Since it's very random and enabling LL_DEBUG made it difficult to
>    reproduce the issue, I tried to dump the stack using DS5 debugger
> 4. The stack is exactly same always both on v4.0-rc* and v3.19 kernel
>    and on multiple runs

Hmm, I haven't seen them before I moved to 4.0-rc3 - before then my
nightly boot tests (which run two boots on the platform each night)
always seemed to succeed.

> 5. Connecting to h/w debugger, stopping and re-starting the CPUs,
>    solves the issue. It's helping CPUs to get out of __radix_tree_lookup
>    somehow

Interesting.  Are the traces below from 4.0-rc3 or an older kernel?

> Stacktrace
> ==========
> (sorry it's looks different from std. Linux backtrace as this one id dump
> from DS5)
> 
> CPU 0
> ----
> #0 __radix_tree_lookup( root = <Value currently has no location>, index =
> 16, nodep = (struct radix_tree_node**) 0x0, slotp = (void***) 0x0 ) at
> radix-tree.c:517

Can you dump the disassembly around this location for both CPU0 and CPU1
and the register values please?  I think it would be interesting to see
if they're both stuck on exactly the same address access.

I've currently narrowed down my latest potential culpret to something in
my Cubox-i code... specifically something in my "cubox-i-sdio" or
"imx-drm^" branches.

The cubox-i-sdio branch contains Olof's modifications to MMC to support
resets and regulators associated with wifi cards, which would be built,
but we would not have executed any of the MMC code at the point where
we'd be bringing the secondary CPUs up.  The imx-drm^ changes don't
touch any file which is built into my Versatile Express kernel, so it's
unlikely to affect anything (though, I'm build-boot-testing with imx-drm^
but cubox-i-sdio dropped just to make absolutely sure.)

One thing I've tried is turning off are the Cortex-A9 features - early
BRESP and full line of zeros.  That seems to make no apparent difference,
though it's hard to tell when #if 0'ing out the code, because that changes
the code placement and seems to stop the problem triggering.  I did have
a case where disabling FLZ (via #if 0'ing it out) seemed to solve it with
errata 588369 enabled, but changing the code to clear the FLZ bit instead
(which should have had the same effect) resulted in the problem
re-appearing.

I'm beginning to believe at this point that there /is/ a bug in the L2C on
the test chip, and that we're probably better off changing the Versatile
Express DT files to disable the L2C cache controller... what are your
thoughts on that?

I'm currently doing up to 8 boot tests - if I can do 8 consecutive boot
tests which all succeed, I'm declaring it a pass, otherwise it's a fail.
Generally, I've found that it will fail very early (like the first) but
sometimes up to the 4th.

I guess one thing we need to confirm is whether we have exactly the same
hardware and firmware versions.  Here's my board's early boot messages:

ARM V2M Boot loader v1.1.1
HBI0190 build 2313

ARM V2M Firmware v3.1.2
Build Date: Apr 16 2013

Date: Mon 30 Mar 2009
Time:     16:59:14

Cmd> reboot

Powering up system...
Daughterboard fitted to site 1.

Switching on ATXPSU...
ATX3V3: ON
VIOset: 1.8V
MBtemp: 27 degC

Configuring motherboard (rev D, var A)...
IOFPGA  config: PASSED
MUXFPGA config: PASSED
OSC CLK config: PASSED

Testing SMC devices (FPGA build 8)...
SRAM 32MB test: PASSED
VRAM  8MB test: PASSED
LAN9118   test: PASSED
USB & OTG test: PASSED
KMI1/KMI2 test: PASSED
MMC & SD  test: PASSED
DVI image test: PASSED
AACI AC97 test: PASSED
CF card   test: PASSED
UART port test: PASSED
MAC addrs test: PASSED

Reading Site 1 Board File \SITE1\HBI0191B\board.txt
DB1 JTAG configuration complete.
Setting DB1 OSCCLKS...
DB1.0 DCC 0 SPI configuration complete.

Writing SCC 0x40610000 with 0xBB8A802A
Writing SCC 0x40610001 with 0x00001F09
Writing SCC 0x40610002 with 0x00000000
DB1.0 DCC 0 SCC configuration complete.

DB SMB clock enabled.
Waiting for SITE1 CB_READY...
Testing SMB clock...
Configuring MUXFPGA for MB.
Setting DVI mode for VGA.
Releasing Daughterboard resets.
Switching MCC log to UART1.

Warning: Card Format not recognised, please check card.

ARM Versatile Express Boot Monitor
Version:    V5.2.1
Build Date: Apr  4 2013
Daughterboard Site 1: V2P-CA9 Cortex A9
Daughterboard Site 2: Not Used
Running boot script from flash - BOOTSCRIPT

U-Boot 2013.01.-rc1-00003-g43ee87aabf17-dirty (Jan 07 2014 - 00:00:38)
...

Sudeep Holla March 16, 2015, 7:16 p.m. UTC | #6

On 16/03/15 18:16, Russell King - ARM Linux wrote:
> On Mon, Mar 16, 2015 at 05:47:46PM +0000, Sudeep Holla wrote:
>> Hi Russell,
>>
>> I was able to see exact behaviour on my VExpress setup with CA9X4 core-tile.
>> Few observations from my side:
>>
>> 1. This issue can be reproduced even on v3.19
>> 2. As you suspected L2C, I tried disabling L2C and it seems to solve
>>     the issue
>
> My L2C says it's cache ID is 0x410000c3 - which is indeed a L2C-310, but
> with an undocumented revision ID of 3, which as far as we can make out,
> it's a R1Px where x > 0.
>
>> 3. Since it's very random and enabling LL_DEBUG made it difficult to
>>     reproduce the issue, I tried to dump the stack using DS5 debugger
>> 4. The stack is exactly same always both on v4.0-rc* and v3.19 kernel
>>     and on multiple runs
>
> Hmm, I haven't seen them before I moved to 4.0-rc3 - before then my
> nightly boot tests (which run two boots on the platform each night)
> always seemed to succeed.
>
>> 5. Connecting to h/w debugger, stopping and re-starting the CPUs,
>>     solves the issue. It's helping CPUs to get out of __radix_tree_lookup
>>     somehow
>
> Interesting.  Are the traces below from 4.0-rc3 or an older kernel?
>

This one is with v3.19 but I get exact same trace with v4.0-rc* kernel.

>> Stacktrace
>> ==========
>> (sorry it's looks different from std. Linux backtrace as this one id dump
>> from DS5)
>>
>> CPU 0
>> ----
>> #0 __radix_tree_lookup( root = <Value currently has no location>, index =
>> 16, nodep = (struct radix_tree_node**) 0x0, slotp = (void***) 0x0 ) at
>> radix-tree.c:517
>
> Can you dump the disassembly around this location for both CPU0 and CPU1
> and the register values please?  I think it would be interesting to see
> if they're both stuck on exactly the same address access.
>

(with v4.0-rc4 this time)

CPU#0
=====
#0 __radix_tree_lookup( root = <Value currently has no location>, index 
= 16, nodep = (struct radix_tree_node**) 0x0, slotp = (void***) 0x0 ) at 
radix-tree.c:517
     node = (struct radix_tree_node*) 0xBEC00001
     parent = <Value optimised away by compiler>
     height = 1
     shift = 0
     slot = <Value currently has no location>
#1 generic_handle_irq( irq = 16 ) at irqdesc.c:349
     desc = <Value optimised away by compiler>
#2 __handle_domain_irq( domain = (struct irq_domain*) 0xBF004400, hwirq 
= 16, lookup = <Value currently has no location>, regs = <Value 
currently has no location> ) at irqdesc.c:391
     old_regs = (struct pt_regs*) 0x0
     irq = <Value optimised away by compiler>
     ret = 0
#3 __raw_readl( addr = <Value optimised away by compiler> ) at io.h:121
#4 gic_handle_irq( regs = (struct pt_regs*) 0x805F1F40 ) at irq-gic.c:277
     irqstat = 2147518036
     irqnr = <Value currently has no location>
     gic = <Value optimised away by compiler>
     cpu_base = (void*) 0xC0802100
#5 [__irq_svc+0x40]

S:0x8021F80C : LSL      lr,r4,#3
S:0x8021F810 : SUB      lr,lr,r4,LSL #1
S:0x8021F814 : SUB      lr,lr,#6
S:0x8021F818 : B        {pc}+8 ; 0x8021f820
S:0x8021F81C : MOV      r5,r0
S:0x8021F820 : LSR      r12,r1,lr
S:0x8021F824 : SUB      lr,lr,#6
S:0x8021F828 : AND      r12,r12,#0x3f
S:0x8021F82C : ADD      r12,r12,#6
S:0x8021F830 : LDR      r0,[r5,r12,LSL #2]

Core registers:
R0           0x0000003F
R1           0x00000010
R2           0x00000000
R3           0x00000000
R4           0x00000001
R5           0xBEC00000
R6           0x00000000
R7           0x00000000
R8           0xBF004400
R9           0x805F1F90
R10          0x00000001
R11          0x805EEB08
R12          0xBEC00001
SP           0x805F1EFC
LR           0x00000000
PC           0x8021F820
CPSR         0x80000193

CPU#1
=====
#0 __radix_tree_lookup( root = <Value currently has no location>, index 
= 16, nodep = (struct radix_tree_node**) 0x0, slotp = (void***) 0x0 ) at 
radix-tree.c:517
     node = (struct radix_tree_node*) 0xBEC00001
     parent = <Value optimised away by compiler>
     height = 1
     shift = 0
     slot = <Value currently has no location>
#1 __irq_get_desc_lock( irq = <Value currently has no location>, flags = 
(long unsigned int*) 0xBF08BF94, bus = false, check = 3 ) at irqdesc.c:544
     desc = <Value optimised away by compiler>
#2 enable_percpu_irq( irq = 16, type = 0 ) at manage.c:1583
     cpu = 1
     flags = <Value currently has no location>
     desc = <Value optimised away by compiler>
#3 twd_timer_cpu_notify( self = <Value not available : Undefined value 
in stack frame for register R0>, action = <Value currently has no 
location>, hcpu = <Value not available : Undefined value in stack frame 
for register R2> ) at smp_twd.c:322
#4 notifier_call_chain( nl = <Value currently has no location>, val = 
<Value not available : Undefined value in stack frame for register R1>, 
v = <Value not available : Undefined value in stack frame for register 
R2>, nr_to_call = <Value not available : Undefined value in stack frame 
for register R3>, nr_calls = (int*) 0x0 ) at notifier.c:95
     ret = <Value currently has no location>
     nb = <Value optimised away by compiler>
     next_nb = <Value optimised away by compiler>
#5 notifier_to_errno( ret = <Value currently has no location> ) at 
notifier.h:179
#6 cpu_notify( val = <Value currently has no location>, v = <Value 
currently has no location> ) at cpu.c:234
#7 secondary_start_kernel() at smp.c:367
     mm = <Value optimised away by compiler>
     cpu = 1
#8 [S:0x60008724]

Disassembly:

S:0x8021F80C : LSL      lr,r4,#3
S:0x8021F810 : SUB      lr,lr,r4,LSL #1
S:0x8021F814 : SUB      lr,lr,#6
S:0x8021F818 : B        {pc}+8 ; 0x8021f820
S:0x8021F81C : MOV      r5,r0
S:0x8021F820 : LSR      r12,r1,lr
S:0x8021F824 : SUB      lr,lr,#6
S:0x8021F828 : AND      r12,r12,#0x3f
S:0x8021F82C : ADD      r12,r12,#6
S:0x8021F830 : LDR      r0,[r5,r12,LSL #2]

Core registers:
R0           0x0000003F
R1           0x00000010
R2           0x00000000
R3           0x00000000
R4           0x00000001
R5           0xBEC00000
R6           0xBF08BF94
R7           0x00000000
R8           0x805F92A0
R9           0x00000000
R10          0x00000000
R11          0x00000000
R12          0xBEC00001
SP           0xBF08BF6C
LR           0x00000000
PC           0x8021F820
CPSR         0x800001D3   Nzcvq_ge3ge2ge1ge0_inactive_eAIFtj_SVC

[...]

> I'm beginning to believe at this point that there /is/ a bug in the L2C on
> the test chip, and that we're probably better off changing the Versatile
> Express DT files to disable the L2C cache controller... what are your
> thoughts on that?
>

I was thinking of taking the dump of L2C register settings and comparing
them. But currently I am facing issues booting even v3.18 on my setup,
it seem to fails somewhere else which I need to look at.

> I'm currently doing up to 8 boot tests - if I can do 8 consecutive boot
> tests which all succeed, I'm declaring it a pass, otherwise it's a fail.
> Generally, I've found that it will fail very early (like the first) but
> sometimes up to the 4th.
>
> I guess one thing we need to confirm is whether we have exactly the same
> hardware and firmware versions.  Here's my board's early boot messages:
>

ARM V2M Boot loader v1.1.2
HBI0190 build 2313

ARM V2M Firmware v3.1.2
Build Date: Apr 16 2013

Date: Mon 16 Mar 2015
Time:     18:57:21

Powering up system...
Daughterboard fitted to site 1.

Switching on ATXPSU...
ATX3V3: ON
VIOset: 1.8V
MBtemp: 26 degC

Configuring motherboard (rev D, var A)...
IOFPGA  config: PASSED
MUXFPGA config: PASSED
OSC CLK config: PASSED

Testing SMC devices (FPGA build 8)...
SRAM 32MB test: PASSED
VRAM  8MB test: PASSED
LAN9118   test: PASSED
USB & OTG test: PASSED
KMI1/KMI2 test: PASSED
MMC & SD  test: PASSED
DVI image test: PASSED
AACI AC97 test: PASSED
CF card   test: PASSED
UART port test: PASSED
MAC addrs test: PASSED

Reading Site 1 Board File \SITE1\HBI0191B\board.txt
DB1 JTAG configuration complete.
Setting DB1 OSCCLKS...
DB1.0 DCC 0 SPI configuration complete.

Writing SCC 0x40610000 with 0xBB8A802A
Writing SCC 0x40610001 with 0x00001F09
Writing SCC 0x40610002 with 0x00000000
DB1.0 DCC 0 SCC configuration complete.

DB SMB clock enabled.
Waiting for SITE1 CB_READY...
Testing SMB clock...
Configuring MUXFPGA for MB.
Setting DVI mode for VGA.
Releasing Daughterboard resets.
Switching MCC log to UART1.
%BootMonitor-Warning, Unable to open SYSTEM.DAT


ARM Versatile Express Boot Monitor
Version:    V5.2.1
Build Date: Apr  4 2013
Daughterboard Site 1: V2P-CA9 Cortex A9
Daughterboard Site 2: Not Used
Running boot script from flash - BOOTSCRIPT

Russell King - ARM Linux March 16, 2015, 7:52 p.m. UTC | #7

On Mon, Mar 16, 2015 at 07:16:05PM +0000, Sudeep Holla wrote:
> On 16/03/15 18:16, Russell King - ARM Linux wrote:
> >Can you dump the disassembly around this location for both CPU0 and CPU1
> >and the register values please?  I think it would be interesting to see
> >if they're both stuck on exactly the same address access.
> 
> (with v4.0-rc4 this time)

Thanks.

> CPU#0
> =====
...
> S:0x8021F80C : LSL      lr,r4,#3
> S:0x8021F810 : SUB      lr,lr,r4,LSL #1
> S:0x8021F814 : SUB      lr,lr,#6
> S:0x8021F818 : B        {pc}+8 ; 0x8021f820
> S:0x8021F81C : MOV      r5,r0
> S:0x8021F820 : LSR      r12,r1,lr
> S:0x8021F824 : SUB      lr,lr,#6
> S:0x8021F828 : AND      r12,r12,#0x3f
> S:0x8021F82C : ADD      r12,r12,#6
> S:0x8021F830 : LDR      r0,[r5,r12,LSL #2]
> 
> Core registers:
> R0           0x0000003F
> R1           0x00000010
> R2           0x00000000
> R3           0x00000000
> R4           0x00000001
> R5           0xBEC00000
> R6           0x00000000
> R7           0x00000000
> R8           0xBF004400
> R9           0x805F1F90
> R10          0x00000001
> R11          0x805EEB08
> R12          0xBEC00001
> SP           0x805F1EFC
> LR           0x00000000
> PC           0x8021F820
> CPSR         0x80000193
> 
> CPU#1
> =====
...
> S:0x8021F80C : LSL      lr,r4,#3
> S:0x8021F810 : SUB      lr,lr,r4,LSL #1
> S:0x8021F814 : SUB      lr,lr,#6
> S:0x8021F818 : B        {pc}+8 ; 0x8021f820
> S:0x8021F81C : MOV      r5,r0
> S:0x8021F820 : LSR      r12,r1,lr
> S:0x8021F824 : SUB      lr,lr,#6
> S:0x8021F828 : AND      r12,r12,#0x3f
> S:0x8021F82C : ADD      r12,r12,#6
> S:0x8021F830 : LDR      r0,[r5,r12,LSL #2]
> 
> Core registers:
> R0           0x0000003F
> R1           0x00000010
> R2           0x00000000
> R3           0x00000000
> R4           0x00000001
> R5           0xBEC00000
> R6           0xBF08BF94
> R7           0x00000000
> R8           0x805F92A0
> R9           0x00000000
> R10          0x00000000
> R11          0x00000000
> R12          0xBEC00001
> SP           0xBF08BF6C
> LR           0x00000000
> PC           0x8021F820
> CPSR         0x800001D3   Nzcvq_ge3ge2ge1ge0_inactive_eAIFtj_SVC

And we find that both CPUs have stopped at exactly the same place, which
is an arithmetic instruction.

If I had to guess, I'd say the reason it's stopped there (exactly on a
cache line boundary) is because both CPUs are waiting for an instruction
fetch to complete into its L1 I-cache, and for some reason, the L2
cache is not satisfying the request from either CPU.  The question of
course is... why not.

> >I guess one thing we need to confirm is whether we have exactly the same
> >hardware and firmware versions.  Here's my board's early boot messages:

Looks like we're broadly the same, apart from the boot loader version.
You have 1.1.2, whereas I have 1.1.1.

Co-incidentally, I just looked at the disassembly of my __radix_tree_lookup:

c0199750:       e0050495        mul     r5, r5, r4
c0199754:       e2455006        sub     r5, r5, #6
c0199758:       ea000000        b       c0199760 <__radix_tree_lookup+0x70>
c019975c:       e1a0c000        mov     ip, r0
c0199760:       e1a06531        lsr     r6, r1, r5
c0199764:       e206603f        and     r6, r6, #63     ; 0x3f
c0199768:       e2866006        add     r6, r6, #6
c019976c:       e79c0106        ldr     r0, [ip, r6, lsl #2]

The code is slightly different, but notice that the alignment of the
LSR instruction is the same as yours - at first I wondered whether that's
coincidence or not.  However, taking Olof's MMC changes back out of my
tree (which results in a booting kernel) makes no difference to the
placement of this code.

The start of the read-only data section doesn't change between the working
and non-working kernels, but the location of the spinlock and some scheduler
code does change (along with all the networking code.)

There's changes in the read-only data section, there's also changes to a
set of "descriptor.NNNNN" symbols towards the end of the data section,
which goes on to change the placement of the bss section.

The diff between the System.map is unpostable - it's about 1.3MB. :(

Versatile Express randomly fails to boot

Commit Message

Comments

Patch