diff mbox

Versatile Express randomly fails to boot - Versatile Express to be removed from nightly testing

Message ID 550818A6.9020205@arm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Sudeep Holla March 17, 2015, 12:05 p.m. UTC
On 16/03/15 19:52, Russell King - ARM Linux wrote:
> On Mon, Mar 16, 2015 at 07:16:05PM +0000, Sudeep Holla wrote:
>> On 16/03/15 18:16, Russell King - ARM Linux wrote:

[...]

> If I had to guess, I'd say the reason it's stopped there (exactly on a
> cache line boundary) is because both CPUs are waiting for an instruction
> fetch to complete into its L1 I-cache, and for some reason, the L2
> cache is not satisfying the request from either CPU.  The question of
> course is... why not.
>

As I had mentioned yesterday, I did compare the L2C settings between
v3.18 and later kernel and found them to be *exactly same*.

Since you suspected issues around instruction fetching, I tried playing
around the tag and data ram latencies. After some experiments, I found
that changing just the tag ram read latency to 2 cycles, the issue we
are seeing goes away at-least on my setup. It will be good to see the
behaviour on your setup with the patch below.

The default value which bootmon is programming happens to be worst
case scenario(8 cycles for all). Will recalls that it was changed to
minimum value after graphics guys complained about performance.

We need to check with h/w guys to get the correct optimal values for
these latencies.

Regards,
Sudeep

--->8

Comments

Russell King - ARM Linux March 17, 2015, 3:36 p.m. UTC | #1
On Tue, Mar 17, 2015 at 12:05:58PM +0000, Sudeep Holla wrote:
> As I had mentioned yesterday, I did compare the L2C settings between
> v3.18 and later kernel and found them to be *exactly same*.
> 
> Since you suspected issues around instruction fetching, I tried playing
> around the tag and data ram latencies. After some experiments, I found
> that changing just the tag ram read latency to 2 cycles, the issue we
> are seeing goes away at-least on my setup. It will be good to see the
> behaviour on your setup with the patch below.
> 
> The default value which bootmon is programming happens to be worst
> case scenario(8 cycles for all). Will recalls that it was changed to
> minimum value after graphics guys complained about performance.
> 
> We need to check with h/w guys to get the correct optimal values for
> these latencies.
> 
> Regards,
> Sudeep
> 
> --->8
> 
> diff --git a/arch/arm/boot/dts/vexpress-v2p-ca9.dts
> b/arch/arm/boot/dts/vexpress-v2p-ca9.dts
> index 23662b5a5e9d..030c90c1105d 100644
> --- a/arch/arm/boot/dts/vexpress-v2p-ca9.dts
> +++ b/arch/arm/boot/dts/vexpress-v2p-ca9.dts
> @@ -172,7 +172,7 @@
>                 interrupts = <0 43 4>;
>                 cache-level = <2>;
>                 arm,data-latency = <1 1 1>;
> -               arm,tag-latency = <1 1 1>;
> +               arm,tag-latency = <1 2 1>;

I've tried <1 2 1> and <1 8 1> here, I don't see any difference.  My test
build fails on the first boot attempt for each.

I notice you're only changing the write latency here.  Is that correct?
You mention read latency above.
Sudeep Holla March 17, 2015, 3:51 p.m. UTC | #2
On 17/03/15 15:36, Russell King - ARM Linux wrote:
> On Tue, Mar 17, 2015 at 12:05:58PM +0000, Sudeep Holla wrote:
>> As I had mentioned yesterday, I did compare the L2C settings between
>> v3.18 and later kernel and found them to be *exactly same*.
>>
>> Since you suspected issues around instruction fetching, I tried playing
>> around the tag and data ram latencies. After some experiments, I found
>> that changing just the tag ram read latency to 2 cycles, the issue we
>> are seeing goes away at-least on my setup. It will be good to see the
>> behaviour on your setup with the patch below.
>>
>> The default value which bootmon is programming happens to be worst
>> case scenario(8 cycles for all). Will recalls that it was changed to
>> minimum value after graphics guys complained about performance.
>>
>> We need to check with h/w guys to get the correct optimal values for
>> these latencies.
>>
>> Regards,
>> Sudeep
>>
>> --->8
>>
>> diff --git a/arch/arm/boot/dts/vexpress-v2p-ca9.dts
>> b/arch/arm/boot/dts/vexpress-v2p-ca9.dts
>> index 23662b5a5e9d..030c90c1105d 100644
>> --- a/arch/arm/boot/dts/vexpress-v2p-ca9.dts
>> +++ b/arch/arm/boot/dts/vexpress-v2p-ca9.dts
>> @@ -172,7 +172,7 @@
>>                  interrupts = <0 43 4>;
>>                  cache-level = <2>;
>>                  arm,data-latency = <1 1 1>;
>> -               arm,tag-latency = <1 1 1>;
>> +               arm,tag-latency = <1 2 1>;
>
> I've tried <1 2 1> and <1 8 1> here, I don't see any difference.  My test
> build fails on the first boot attempt for each.
>

That's bad. I started with 2 cycles for all(rd/wr/setup) latencies(data
and tag ram) and narrowed down to this setting with multiple
experiments. I did try booting 10 times each time at-least.

Since the bootmon sets 8 cycles for all the latencies, does it make
sense to try that setting to check if the issue you are seeing is
related to L2 latencies at all. Meanwhile I will continue my testing.

> I notice you're only changing the write latency here.  Is that correct?
> You mention read latency above.
>

Sorry my bad, you are right, it's write latency, I misread the L2C
binding document.

Regards,
Sudeep
Russell King - ARM Linux March 17, 2015, 4:17 p.m. UTC | #3
On Tue, Mar 17, 2015 at 03:51:53PM +0000, Sudeep Holla wrote:
> That's bad. I started with 2 cycles for all(rd/wr/setup) latencies(data
> and tag ram) and narrowed down to this setting with multiple
> experiments. I did try booting 10 times each time at-least.
> 
> Since the bootmon sets 8 cycles for all the latencies, does it make
> sense to try that setting to check if the issue you are seeing is
> related to L2 latencies at all. Meanwhile I will continue my testing.

For me <2 2 1> works - so read and write latencies of 2, setup of 1.
Russell King - ARM Linux March 30, 2015, 2:03 p.m. UTC | #4
On Tue, Mar 17, 2015 at 04:17:48PM +0000, Russell King - ARM Linux wrote:
> On Tue, Mar 17, 2015 at 03:51:53PM +0000, Sudeep Holla wrote:
> > That's bad. I started with 2 cycles for all(rd/wr/setup) latencies(data
> > and tag ram) and narrowed down to this setting with multiple
> > experiments. I did try booting 10 times each time at-least.
> > 
> > Since the bootmon sets 8 cycles for all the latencies, does it make
> > sense to try that setting to check if the issue you are seeing is
> > related to L2 latencies at all. Meanwhile I will continue my testing.
> 
> For me <2 2 1> works - so read and write latencies of 2, setup of 1.

So what's happening?  Is someone going to update the dtb to adjust the
latencies so that we can have a reliable platform?
Sudeep Holla March 30, 2015, 2:48 p.m. UTC | #5
Hi Russell,

On 30/03/15 15:03, Russell King - ARM Linux wrote:
> On Tue, Mar 17, 2015 at 04:17:48PM +0000, Russell King - ARM Linux wrote:
>> On Tue, Mar 17, 2015 at 03:51:53PM +0000, Sudeep Holla wrote:
>>> That's bad. I started with 2 cycles for all(rd/wr/setup) latencies(data
>>> and tag ram) and narrowed down to this setting with multiple
>>> experiments. I did try booting 10 times each time at-least.
>>>
>>> Since the bootmon sets 8 cycles for all the latencies, does it make
>>> sense to try that setting to check if the issue you are seeing is
>>> related to L2 latencies at all. Meanwhile I will continue my testing.
>>
>> For me <2 2 1> works - so read and write latencies of 2, setup of 1.
>
> So what's happening?  Is someone going to update the dtb to adjust the
> latencies so that we can have a reliable platform?
>

Though <2 2 1> works fine most of the time, I did try testing continuous
reboot overnight and it failed. I kept increasing the latencies and
found out that even max latency of <8 8 8> could not survive continuous
overnight reboot test and it fails with exact same issue.

So I am not sure if we can consider it as a fix. However if we are OK to
have *mostly reliable*, then we can push that change.

Regards,
Sudeep
Russell King - ARM Linux March 30, 2015, 3:05 p.m. UTC | #6
On Mon, Mar 30, 2015 at 03:48:08PM +0100, Sudeep Holla wrote:
> Though <2 2 1> works fine most of the time, I did try testing continuous
> reboot overnight and it failed. I kept increasing the latencies and
> found out that even max latency of <8 8 8> could not survive continuous
> overnight reboot test and it fails with exact same issue.
> 
> So I am not sure if we can consider it as a fix. However if we are OK to
> have *mostly reliable*, then we can push that change.

Okay, the issue I have is this.

Versatile Express used to boot reliably in the nightly build tests prior
to DT.  In that mode, we never configured the latency values.

Then the legacy code was removed, and I had to switch over to DT booting,
and shortly after I noticed that the platform was now randomly failing
its nightly boot tests.

Maybe we should revert the commit removing the superior legacy code,
because that seems to be the only thing that was reliable?  Maybe it was
premature to remove it until DT had proven itself?

On the other hand, if the legacy code hadn't been removed, I probably
would never have tested it - but then, from what I hear, this was a
*known* issue prior to the removal of the legacy code.  Given that the
legacy code worked totally fine, it's utterly idiotic to me to have
removed the working legacy code when DT is soo unstable.

Whatever way I look at this, this problem _is_ a _regression_, and we
can't sit around and hope it magically vanishes by some means.

I think given what you've said, it suggests that there is something else
going on.  So, what we need to do is to revert the removal of the legacy
code and investigate what the differences are between the apparently
broken DT code and the working legacy code.

I have not _once_ seen this behaviour with the legacy code.
Sudeep Holla March 30, 2015, 3:39 p.m. UTC | #7
On 30/03/15 16:05, Russell King - ARM Linux wrote:
> On Mon, Mar 30, 2015 at 03:48:08PM +0100, Sudeep Holla wrote:
>> Though <2 2 1> works fine most of the time, I did try testing continuous
>> reboot overnight and it failed. I kept increasing the latencies and
>> found out that even max latency of <8 8 8> could not survive continuous
>> overnight reboot test and it fails with exact same issue.
>>
>> So I am not sure if we can consider it as a fix. However if we are OK to
>> have *mostly reliable*, then we can push that change.
>
> Okay, the issue I have is this.
>
> Versatile Express used to boot reliably in the nightly build tests prior
> to DT.  In that mode, we never configured the latency values.
>

I have never run in legacy mode as I am relatively new to vexpress
platform and started using with DT from first. Just to understand better
I had a look at the commit commit 81cc3f868d30("ARM: vexpress: Remove
non-DT code") and I see the below function in
arch/arm/mach-vexpress/ct-ca9x4.c So I assume we were programming one
cycle for all the latencies just like DT.

static void __init ca9x4_l2_init(void)
{
#ifdef CONFIG_CACHE_L2X0
	void __iomem *l2x0_base = ioremap(CT_CA9X4_L2CC, SZ_4K);

	if (l2x0_base) {
		/* set RAM latencies to 1 cycle for this core tile. */
		writel(0, l2x0_base + L310_TAG_LATENCY_CTRL);
		writel(0, l2x0_base + L310_DATA_LATENCY_CTRL);

		l2x0_init(l2x0_base, 0x00400000, 0xfe0fffff);
	} else {
		pr_err("L2C: unable to map L2 cache controller\n");
	}
#endif
}

> Then the legacy code was removed, and I had to switch over to DT booting,
> and shortly after I noticed that the platform was now randomly failing
> its nightly boot tests.
>
> Maybe we should revert the commit removing the superior legacy code,
> because that seems to be the only thing that was reliable?  Maybe it was
> premature to remove it until DT had proven itself?
>
> On the other hand, if the legacy code hadn't been removed, I probably
> would never have tested it - but then, from what I hear, this was a
> *known* issue prior to the removal of the legacy code.  Given that the
> legacy code worked totally fine, it's utterly idiotic to me to have
> removed the working legacy code when DT is soo unstable.
>
> Whatever way I look at this, this problem _is_ a _regression_, and we
> can't sit around and hope it magically vanishes by some means.
>

I agree, last time I tested it was fine with v3.18. However I have not
run the continuous overnight reboot test on that. I will first started
looking at that, just to see if it's issue related to DT vs legacy boot.

> I think given what you've said, it suggests that there is something else
> going on.  So, what we need to do is to revert the removal of the legacy
> code and investigate what the differences are between the apparently
> broken DT code and the working legacy code.
>

Agreed, I will see if DT boot was ever stable before before and
including v3.18

> I have not _once_ seen this behaviour with the legacy code.
>

OK

Regards,
Sudeep
Sudeep Holla March 31, 2015, 5:27 p.m. UTC | #8
On 30/03/15 16:39, Sudeep Holla wrote:
>
>
> On 30/03/15 16:05, Russell King - ARM Linux wrote:
>> On Mon, Mar 30, 2015 at 03:48:08PM +0100, Sudeep Holla wrote:
>>> Though <2 2 1> works fine most of the time, I did try testing continuous
>>> reboot overnight and it failed. I kept increasing the latencies and
>>> found out that even max latency of <8 8 8> could not survive continuous
>>> overnight reboot test and it fails with exact same issue.
>>>
>>> So I am not sure if we can consider it as a fix. However if we are OK to
>>> have *mostly reliable*, then we can push that change.
>>
>> Okay, the issue I have is this.
>>
>> Versatile Express used to boot reliably in the nightly build tests prior
>> to DT.  In that mode, we never configured the latency values.
>>
>
> I have never run in legacy mode as I am relatively new to vexpress
> platform and started using with DT from first. Just to understand better
> I had a look at the commit commit 81cc3f868d30("ARM: vexpress: Remove
> non-DT code") and I see the below function in
> arch/arm/mach-vexpress/ct-ca9x4.c So I assume we were programming one
> cycle for all the latencies just like DT.
>

I was able to boot v3.18 without DT and I compared the L2C settings with
and w/o DT, they are identical. Also v3.18 with and w/o DT survived
overnight reboot testing.

>> Then the legacy code was removed, and I had to switch over to DT booting,
>> and shortly after I noticed that the platform was now randomly failing
>> its nightly boot tests.
>>
>> Maybe we should revert the commit removing the superior legacy code,
>> because that seems to be the only thing that was reliable?  Maybe it was
>> premature to remove it until DT had proven itself?
>>

Not sure on that as v3.18 with DT seems to be working fine and passed
overnight reboot testing.

>> On the other hand, if the legacy code hadn't been removed, I probably
>> would never have tested it - but then, from what I hear, this was a
>> *known* issue prior to the removal of the legacy code.  Given that the
>> legacy code worked totally fine, it's utterly idiotic to me to have
>> removed the working legacy code when DT is soo unstable.
>>
>> Whatever way I look at this, this problem _is_ a _regression_, and we
>> can't sit around and hope it magically vanishes by some means.
>>
>
> I agree, last time I tested it was fine with v3.18. However I have not
> run the continuous overnight reboot test on that. I will first started
> looking at that, just to see if it's issue related to DT vs legacy boot.
>

Since v3.18 is both boot modes and the problem is reproducible on
v3.19-rc1. I am trying to bisect but not sure if that's feasible for
such a problem. I also found out by accident that even on mainline with
more configs enabled, it's hard to hit the issue.

Regards,
Sudeep
Russell King - ARM Linux April 2, 2015, 2:13 p.m. UTC | #9
On Tue, Mar 31, 2015 at 06:27:30PM +0100, Sudeep Holla wrote:
> Not sure on that as v3.18 with DT seems to be working fine and passed
> overnight reboot testing.

Okay, that suggests there's something post v3.18 which is causing this,
rather than it being a DT vs non-DT thing.

An extra data point which I've just found (by enabling attempts to do
hibernation on various test platforms) is that the Versatile Express
appears to be incapable of taking a CPU offline.

This crashes the entire system with sometimes random results.  Sometimes
it'll appear that a spinlock has been left owned by CPU#1 which is
offline.  Sometimes it'll silently hang.  Sometimes it'll start slowly
dumping kernel messages from the start of the kernel's ring buffer (!),
eg:

PM: freeze of devices complete after 29.342 msecs
PM: late freeze of devices complete after 6.398 msecs
PM: noirq freeze of devices complete after 5.493 msecs
Disabling non-boot CPUs ...
__cpu_disable(1)
__cpu_die(1)
handle_IPI(0)
Booting Linux on physical CPU 0x0

So far, it's not managed to take a CPU successfully offline and know that
it has.  If I disable the calls to cpu_enter_lowpower() and
cpu_leave_lowpower(), then it appears to work.

This leads me to wonder whether flush_cache_louis() works... which led me
in turn to ARM_ERRATA_643719, which is disabled in my builds.  However,
the CA9 tile has a r0p1 CA9, which allegedly suffers from this errata.

The really interesting thing is that I've never had that errata enabled
for Versatile Express - even going back to 3.14 times (I have a working
3.14 config file which clearly shows that it was disabled.)  So, I'm
wondering if we've relaxed the cache flushing in such a way that we now
expose the ineffectual flush_cache_louis() bug.

There aren't that many flush_cache_louis() calls in the kernel.  We do
have this:

commit bca7a5a04933700a8bde4ea5798119607a8b0436
Author: Russell King <rmk+kernel@arm.linux.org.uk>
Date:   Thu Apr 18 18:15:44 2013 +0100

    ARM: cpu hotplug: remove majority of cache flushing from platforms

in conjuction with:

commit 51acdfd1fa38a2bf1003255be9f105c19fbc0176
Author: Russell King <rmk+kernel@arm.linux.org.uk>
Date:   Thu Apr 18 18:05:29 2013 +0100

    ARM: smp: flush L1 cache in cpu_die()

which changed the flush_cache_all() to a flush_cache_louis() in the
hot unplug path.  We also have this:

commit e40678559fdf3f56ce9a349365fbf39e1f63ecc0
Author: Nicolas Pitre <nicolas.pitre@linaro.org>
Date:   Thu Nov 8 19:46:07 2012 +0100

    ARM: 7573/1: idmap: use flush_cache_louis() and flush TLBs only when necessary

which added the flush_cache_louis() for the idmap tables, but prior to
that, I don't see how we were ensuring that the page tables were visible.

I haven't tested going back to a tag latency of 1 1 1 yet.  Can you
confirm whether you have this errata enabled for your tests?

Thanks.
Sudeep Holla April 2, 2015, 5:38 p.m. UTC | #10
On 02/04/15 15:13, Russell King - ARM Linux wrote:
> On Tue, Mar 31, 2015 at 06:27:30PM +0100, Sudeep Holla wrote:
>> Not sure on that as v3.18 with DT seems to be working fine and passed
>> overnight reboot testing.
>
> Okay, that suggests there's something post v3.18 which is causing this,
> rather than it being a DT vs non-DT thing.
>

Correct. Just to be 100% sure I reverted that non-DT removal commit on
both v3.19-rc1 and v4.0-rc6 and was able to reproduce issue without DT.

> An extra data point which I've just found (by enabling attempts to do
> hibernation on various test platforms) is that the Versatile Express
> appears to be incapable of taking a CPU offline.
>
> This crashes the entire system with sometimes random results.  Sometimes
> it'll appear that a spinlock has been left owned by CPU#1 which is
> offline.  Sometimes it'll silently hang.  Sometimes it'll start slowly
> dumping kernel messages from the start of the kernel's ring buffer (!),
> eg:
>
> PM: freeze of devices complete after 29.342 msecs
> PM: late freeze of devices complete after 6.398 msecs
> PM: noirq freeze of devices complete after 5.493 msecs
> Disabling non-boot CPUs ...
> __cpu_disable(1)
> __cpu_die(1)
> handle_IPI(0)
> Booting Linux on physical CPU 0x0
>
> So far, it's not managed to take a CPU successfully offline and know that
> it has.  If I disable the calls to cpu_enter_lowpower() and
> cpu_leave_lowpower(), then it appears to work.
>
> This leads me to wonder whether flush_cache_louis() works... which led me
> in turn to ARM_ERRATA_643719, which is disabled in my builds.  However,
> the CA9 tile has a r0p1 CA9, which allegedly suffers from this errata.
>

Yes I observed that and tested for this issue enabling it. It's doesn't
affect and I still hit the issue.

[...]
>
> I haven't tested going back to a tag latency of 1 1 1 yet.  Can you
> confirm whether you have this errata enabled for your tests?
>
I have now gone back to <1 1 1> latency to debug the issue as it's
easier to reproduce with that latencies.

After I failed terribly to bisect between v3.18..v3.19-c1, as it depends
a lot on the config you choose(a lot of changes introduced as it's merge
window), I started looking at the code where we hit this issue since
it's always in __radix_tree_lookup in lib/radix-tree.c while
accessing the slots to see if it provides any more details.

Regards,
Sudeep
Jon Medhurst (Tixy) June 14, 2016, 3:31 p.m. UTC | #11
Hi Sudeep

Over the past several days I think I've been unknowingly reproducing
many of the steps in this old discussion thread [1] about A9 Versatile
Express boot failures. It's only when I found myself looking at the L2
cache timings that I got a vague recollection and dug out this old
thread again. Was there any resolution to the issue? As far as I can
work out, the A9x4 CoreTile stopped working around Linux 3.18 (the
problem isn't 100% reproducible so it's difficult to tell).

Using "arm,tag-latency = <2 2 1>" as Russell seemed to indicate [2]
fixed things for him, also works for me. So should we update mainline
device-tree with that?

Alternatively, we could assume nobody cares about A9 as presumably Linux
has been unbootable for a year without anyone raising the issue. (The
only reason I'm looking at it is I may be making U-Boot changes for
vexpress and I wanted to test them).

But if we are going to just ignore things, I think it would be good to
delete the A9 dts, or the L2 cache entry, so other people in the future
don't waste days trying to track down the problem.

[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2015-March/330860.html
[2] http://lists.infradead.org/pipermail/linux-arm-kernel/2015-May/342005.html
Russell King (Oracle) June 14, 2016, 3:52 p.m. UTC | #12
On Tue, Jun 14, 2016 at 04:31:25PM +0100, Jon Medhurst (Tixy) wrote:
> Using "arm,tag-latency = <2 2 1>" as Russell seemed to indicate [2]
> fixed things for him, also works for me. So should we update mainline
> device-tree with that?

I've proposed that several times, and there seems to be no desire to
do so.  For me, VE CT9x4 no longer boots in my nightly builds, and
my plan at the start of the year was to take it out of both the
nightly builds and the boot tests - no one within ARM seems to have
any interest in the platform.

Having a randomly failing platform due to hardware issues is not
productive to an automated boot test system, so I think we should
(a) remove it from automated testing, and (b) consider deleting
support for it from the kernel tree, as it seems there is little
interest in debugging what's happening.
Sudeep Holla June 14, 2016, 4:31 p.m. UTC | #13
Hi Tixy,

On 14/06/16 16:31, Jon Medhurst (Tixy) wrote:
> Hi Sudeep
>
> Over the past several days I think I've been unknowingly reproducing
> many of the steps in this old discussion thread [1] about A9 Versatile
> Express boot failures. It's only when I found myself looking at the L2
> cache timings that I got a vague recollection and dug out this old
> thread again. Was there any resolution to the issue? As far as I can
> work out, the A9x4 CoreTile stopped working around Linux 3.18 (the
> problem isn't 100% reproducible so it's difficult to tell).
>
> Using "arm,tag-latency = <2 2 1>" as Russell seemed to indicate [2]
> fixed things for him, also works for me. So should we update mainline
> device-tree with that?
>

That's fine by me.

> Alternatively, we could assume nobody cares about A9 as presumably Linux
> has been unbootable for a year without anyone raising the issue. (The
> only reason I'm looking at it is I may be making U-Boot changes for
> vexpress and I wanted to test them).
>

I admit I just do a boot test every release and I seem to have flashed
the DT so failed to notice any issue. That's my fault.

> But if we are going to just ignore things, I think it would be good to
> delete the A9 dts, or the L2 cache entry, so other people in the future
> don't waste days trying to track down the problem.
>

I am fine either way, if people still use it as reference, we can retain it.
Sudeep Holla June 14, 2016, 4:44 p.m. UTC | #14
Hi Russell,

On 14/06/16 16:52, Russell King - ARM Linux wrote:
> On Tue, Jun 14, 2016 at 04:31:25PM +0100, Jon Medhurst (Tixy) wrote:
>> Using "arm,tag-latency = <2 2 1>" as Russell seemed to indicate [2]
>> fixed things for him, also works for me. So should we update mainline
>> device-tree with that?
>
> I've proposed that several times, and there seems to be no desire to
> do so.

Sorry for missing that. IIRC, we didn't conclude as <2 2 1> did fail
on continuous reboot test over night on my setup. As I mentioned early 
we can change to this new value if people are able to use it reliably.

> For me, VE CT9x4 no longer boots in my nightly builds, and
> my plan at the start of the year was to take it out of both the
> nightly builds and the boot tests - no one within ARM seems to have
> any interest in the platform.
>

It's hard to get any kind of attention from hardware guys for such an
old platform.

> Having a randomly failing platform due to hardware issues is not
> productive to an automated boot test system, so I think we should
> (a) remove it from automated testing, and (b) consider deleting
> support for it from the kernel tree, as it seems there is little
> interest in debugging what's happening.
>

Even with higher latency if the platform is unusable, I agree to remove.
If you think it's usable with the updated latency(<2 2 1>) then we can
update it.
Russell King (Oracle) June 14, 2016, 4:49 p.m. UTC | #15
On Tue, Jun 14, 2016 at 05:44:26PM +0100, Sudeep Holla wrote:
> Even with higher latency if the platform is unusable, I agree to remove.
> If you think it's usable with the updated latency(<2 2 1>) then we can
> update it.

The kernels I'm booting have that updated latency.  It used to improve
things, but for most of this year, it fails most boot attempts.  Out
of the last 21 boot attempts, all 21 attempts failed with the above
latency value.

  http://www.armlin ux.org.uk/developer/build/index.php?id=2003

(remove the space...)
Jon Medhurst (Tixy) June 15, 2016, 9:27 a.m. UTC | #16
On Tue, 2016-06-14 at 17:49 +0100, Russell King - ARM Linux wrote:
> On Tue, Jun 14, 2016 at 05:44:26PM +0100, Sudeep Holla wrote:
> > Even with higher latency if the platform is unusable, I agree to remove.
> > If you think it's usable with the updated latency(<2 2 1>) then we can
> > update it.
> 
> The kernels I'm booting have that updated latency.  It used to improve
> things, but for most of this year, it fails most boot attempts.  Out
> of the last 21 boot attempts, all 21 attempts failed with the above
> latency value.

I've done some more testing and it seems the different latency values
don't reliably fix things for me after all.

In had previously tried enabling the A9 errata workarounds that aren't
enabled for vexpress:

ARM_ERRATA_751472  Hacked to remove wrong(?) depends !ARCH_MULTIPLATFORM
ARM_ERRATA_754327
ARM_ERRATA_764369

That didn't fix things, but I noticed this morning there are PL310
errata workarounds too that we should have enabled:

PL310_ERRATA_769419
PL310_ERRATA_727915
PL310_ERRATA_588369

I'm going to do testing with different combinations to see if I can
gather evidence that any of these errata are causing the problem we're
hitting. Of course, errata workarounds as with cache timing tweaks will
affect timings in the system and so could affect the triggering of our
problem without actually being the root cause.
Sudeep Holla June 15, 2016, 9:32 a.m. UTC | #17
On 15/06/16 10:27, Jon Medhurst (Tixy) wrote:
> On Tue, 2016-06-14 at 17:49 +0100, Russell King - ARM Linux wrote:
>> On Tue, Jun 14, 2016 at 05:44:26PM +0100, Sudeep Holla wrote:
>>> Even with higher latency if the platform is unusable, I agree to remove.
>>> If you think it's usable with the updated latency(<2 2 1>) then we can
>>> update it.
>>
>> The kernels I'm booting have that updated latency.  It used to improve
>> things, but for most of this year, it fails most boot attempts.  Out
>> of the last 21 boot attempts, all 21 attempts failed with the above
>> latency value.
>

[...]

>
> That didn't fix things, but I noticed this morning there are PL310
> errata workarounds too that we should have enabled:
>
> PL310_ERRATA_769419
> PL310_ERRATA_727915
> PL310_ERRATA_588369
>

I had all the above enabled in my config which doesn't help either.
Jon Medhurst (Tixy) June 15, 2016, 9:50 a.m. UTC | #18
On Wed, 2016-06-15 at 10:32 +0100, Sudeep Holla wrote:
> > That didn't fix things, but I noticed this morning there are PL310
> > errata workarounds too that we should have enabled:
> >
> > PL310_ERRATA_769419
> > PL310_ERRATA_727915
> > PL310_ERRATA_588369
> >
> 
> I had all the above enabled in my config which doesn't help either.

Thanks for letting me know, that's saved me from a tedious morning of
testing. Though did you also have the A9 errata enabled?

BTW, I noticed the PL310 Cache ID reports RTL version 3, which doesn't
appear in the list of versions in cache-l2x0.h and doesn't match the TRM
shipped on the vexpress CD, which is for version 2 (r1p0).

Anyway, I'll test various Linux versions with the L2 cache removed from
device-tree to see if that reliably fixes things.
Sudeep Holla June 15, 2016, 9:59 a.m. UTC | #19
On 15/06/16 10:50, Jon Medhurst (Tixy) wrote:
> On Wed, 2016-06-15 at 10:32 +0100, Sudeep Holla wrote:
>>> That didn't fix things, but I noticed this morning there are PL310
>>> errata workarounds too that we should have enabled:
>>>
>>> PL310_ERRATA_769419
>>> PL310_ERRATA_727915
>>> PL310_ERRATA_588369
>>>
>>
>> I had all the above enabled in my config which doesn't help either.
>
> Thanks for letting me know, that's saved me from a tedious morning of
> testing. Though did you also have the A9 errata enabled?
>

Yes I tried multi_v7_defconfig which has all the errata's enabled for A9

> BTW, I noticed the PL310 Cache ID reports RTL version 3, which doesn't
> appear in the list of versions in cache-l2x0.h and doesn't match the TRM
> shipped on the vexpress CD, which is for version 2 (r1p0).
>

Yes I remember having similar observations from past. IIRC, it's one of
the earliest version used on that core-tile(which generally is the case
on few early versions of test chips)

> Anyway, I'll test various Linux versions with the L2 cache removed from
> device-tree to see if that reliably fixes things.
>

Thanks, but would like to see what Russell thinks about this approach
before you want to spend more time on that.
diff mbox

Patch

diff --git a/arch/arm/boot/dts/vexpress-v2p-ca9.dts 
b/arch/arm/boot/dts/vexpress-v2p-ca9.dts
index 23662b5a5e9d..030c90c1105d 100644
--- a/arch/arm/boot/dts/vexpress-v2p-ca9.dts
+++ b/arch/arm/boot/dts/vexpress-v2p-ca9.dts
@@ -172,7 +172,7 @@ 
                 interrupts = <0 43 4>;
                 cache-level = <2>;
                 arm,data-latency = <1 1 1>;
-               arm,tag-latency = <1 1 1>;
+               arm,tag-latency = <1 2 1>;
         };

         pmu {