arm: Add Arm Erratum 773769 for Large data RAM latency.

Message ID	1389187991-26446-1-git-send-email-gautam.vivek@samsung.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org> From: Vivek Gautam <gautam.vivek@samsung.com> To: linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-samsung-soc@vger.kernel.org, linux@arm.linux.org.uk, kgene.kim@samsung.com Subject: [PATCH] arm: Add Arm Erratum 773769 for Large data RAM latency. Date: Wed, 08 Jan 2014 19:03:11 +0530 Message-id: <1389187991-26446-1-git-send-email-gautam.vivek@samsung.com> DLP-Filter: Pass Cc: will.deacon@arm.com, sboyd@codeaurora.org, Doug Anderson <dianders@chromium.org>, David Garbett <david.garbett@arm.com>, Vivek Gautam <gautam.vivek@samsung.com>, catalin.marinas@arm.com, gregory.clement@free-electrons.com, Olof Johansson <olofj@chromium.org> Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org

Vivek Gautam Jan. 8, 2014, 1:33 p.m. UTC

The erratum-773769 occurs on Arm Coretex-A15 (rev r2p0),
when L2 Data Ram latency is set to 4 cycles or more; or
when ACP is in use, or with L2 Data RAM slice configured.
Therefore, the effective latency as calculated in Table 7-2 of
Cotex-A15 (rev r2p0) trm should be 3 cycles or less.

On Exynos5250 based systems the effective data ram latency
is 4 cycles, since we have DATA_RAM_SETUP bit enabled (L2CTRL[5]=1b'1)
and DATA_RAM_LATENCY bits set to 0x2 (L2CTLR[2:0]=3b'010) therefore,
the effective L2 data RAM latency becomes 4 cycles.
So erratum '773769' occurs causing a corrupted L2 Cache.

This patch gives a workaround to the mentioned erratum, using below
mentioned algo:
----------------------------------------------------------------
if data RAM setup = 1
  then check if effective latency i.e (latency + setup + 1) > 3
  if 'true'
    then clear data RAM setup
      goto branch 'a'
if data RAM setup = 0
  a: then check if data RAM latency > 0x10
    if true then force data RAM latency = 0x10
----------------------------------------------------------------
so that the effective data RAM latency reduces to 3 cycles or less
and hence prevent hitting the erratum.

NOTE: The Exynos5250 based products have already been shipped, which
      makes it impossible to add the change in bootloader, so we are
      adding the required change in kernel.

Signed-off-by: Vivek Gautam <gautam.vivek@samsung.com>
Cc: Doug Anderson <dianders@chromium.org>
Cc: Olof Johansson <olofj@chromium.org>
Cc: David Garbett <david.garbett@arm.com>
---
 arch/arm/Kconfig             |   15 ++++++++
 arch/arm/mach-exynos/Kconfig |    1 +
 arch/arm/mm/proc-v7.S        |   79 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 95 insertions(+)

Will Deacon Jan. 8, 2014, 2:35 p.m. UTC | #1

On Wed, Jan 08, 2014 at 01:33:11PM +0000, Vivek Gautam wrote:
> The erratum-773769 occurs on Arm Coretex-A15 (rev r2p0),
> when L2 Data Ram latency is set to 4 cycles or more; or
> when ACP is in use, or with L2 Data RAM slice configured.
> Therefore, the effective latency as calculated in Table 7-2 of
> Cotex-A15 (rev r2p0) trm should be 3 cycles or less.
> 
> On Exynos5250 based systems the effective data ram latency
> is 4 cycles, since we have DATA_RAM_SETUP bit enabled (L2CTRL[5]=1b'1)
> and DATA_RAM_LATENCY bits set to 0x2 (L2CTLR[2:0]=3b'010) therefore,
> the effective L2 data RAM latency becomes 4 cycles.
> So erratum '773769' occurs causing a corrupted L2 Cache.
> 
> This patch gives a workaround to the mentioned erratum, using below
> mentioned algo:
> ----------------------------------------------------------------
> if data RAM setup = 1
>   then check if effective latency i.e (latency + setup + 1) > 3
>   if 'true'
>     then clear data RAM setup
>       goto branch 'a'
> if data RAM setup = 0
>   a: then check if data RAM latency > 0x10
>     if true then force data RAM latency = 0x10
> ----------------------------------------------------------------
> so that the effective data RAM latency reduces to 3 cycles or less
> and hence prevent hitting the erratum.
> 
> NOTE: The Exynos5250 based products have already been shipped, which
>       makes it impossible to add the change in bootloader, so we are
>       adding the required change in kernel.

NAK. Whilst I appreciate that you may not be able to fix your bootloader,
this isn't the right change to make in the kernel. Blindly changing memory
latencies is likely to do more harm than good for a multi-platform kernel,
even if it works for exynos 5250. The only alternative I can think of (if you
have to make a mainline kernel change) is to restrict the clock frequencies at
which the CPU is allowed to run, although that obviously requires some
investigation in order to determine how viable it is for your SoC.

Will

Doug Anderson Jan. 8, 2014, 4:21 p.m. UTC | #2

WIll,

Thanks for your comments!

On Wed, Jan 8, 2014 at 6:35 AM, Will Deacon <will.deacon@arm.com> wrote:
> On Wed, Jan 08, 2014 at 01:33:11PM +0000, Vivek Gautam wrote:
>> The erratum-773769 occurs on Arm Coretex-A15 (rev r2p0),
>> when L2 Data Ram latency is set to 4 cycles or more; or
>> when ACP is in use, or with L2 Data RAM slice configured.
>> Therefore, the effective latency as calculated in Table 7-2 of
>> Cotex-A15 (rev r2p0) trm should be 3 cycles or less.
>>
>> On Exynos5250 based systems the effective data ram latency
>> is 4 cycles, since we have DATA_RAM_SETUP bit enabled (L2CTRL[5]=1b'1)
>> and DATA_RAM_LATENCY bits set to 0x2 (L2CTLR[2:0]=3b'010) therefore,
>> the effective L2 data RAM latency becomes 4 cycles.
>> So erratum '773769' occurs causing a corrupted L2 Cache.
>>
>> This patch gives a workaround to the mentioned erratum, using below
>> mentioned algo:
>> ----------------------------------------------------------------
>> if data RAM setup = 1
>>   then check if effective latency i.e (latency + setup + 1) > 3
>>   if 'true'
>>     then clear data RAM setup
>>       goto branch 'a'
>> if data RAM setup = 0
>>   a: then check if data RAM latency > 0x10
>>     if true then force data RAM latency = 0x10
>> ----------------------------------------------------------------
>> so that the effective data RAM latency reduces to 3 cycles or less
>> and hence prevent hitting the erratum.
>>
>> NOTE: The Exynos5250 based products have already been shipped, which
>>       makes it impossible to add the change in bootloader, so we are
>>       adding the required change in kernel.
>
> NAK. Whilst I appreciate that you may not be able to fix your bootloader,
> this isn't the right change to make in the kernel. Blindly changing memory
> latencies is likely to do more harm than good for a multi-platform kernel,
> even if it works for exynos 5250. The only alternative I can think of (if you
> have to make a mainline kernel change) is to restrict the clock frequencies at
> which the CPU is allowed to run, although that obviously requires some
> investigation in order to determine how viable it is for your SoC.

OK, so humor me a little here...

I'll start off by saying that I'm totally OK if mainline doesn't want
this fixed.  If mainline is not interested in running reliably on
exynos5250-based products then there's nothing I can do about it.  It
seems unfortunate, but I'm not going to get into a shouting match
about it.

You're saying that this patch is blindly changing memory latencies.  I
don't think it is (please correct me if I'm wrong!).  This patch:

* Is guarded by a CONFIG option so the code isn't even compiled in if
you don't want EXYNOS5250 support.  I know this doesn't help
multiplatform, but it means that if you really hate the code it's easy
to disable.

* Is guarded by a runtime check of the revision number so that it
doesn't run on unaffected A15 revisions.

* Is guarded by a runtime check so it does nothing at all if the total
latency is <= 3 (AKA if boot code already picked a sane value)

...with the above guards it's pretty safe...

I will agree that there is a _potential_ that this could make things
work worse on an already broken product our there, but I would say
there's a reasonable chance that such a product doesn't exist (but
please correct me if I'm wrong).  Specifically, this patch will cause
problems in two examples that I can think of:

--

Example A: existing A15 <=r2p0 product with 773769-ignorant boot code
that could be fixed, but that needs "setup = 1"

In this case we've got boot code that's like the exynos5250 boot code
that accidentally sets the total latency to >= 4 when it would work
just fine to use a value of 3.  ...except that unlike the exynos5250
this hypthetical SoC needs to leave setup = 1.

...if such a machine were found in the wild (seems unlikely?) then
we'll need to figure out what to do.  If its boot code cannot be
updated and we want to support this product with a similar patch then
we'll need to be more dynamic.

--

Example B: existing A15 <=r2p0 product that just has to live with crashes

In this case we've got a product where we're going to just accept that
it crashes sometimes (since this is a fairly hard crash to trigger)
because it crashes even more when the total latency < 4.  In this case
we don't want to "fix" the errata because that makes things worse.

...if such a machine were find in the wild (it's possible, I guess?)
then that's a really good reason not to take this patch or to find
some way to dynamically enable / disable it.

--

Let me know what you think of the above, or if I'm misunderstanding something...

-Doug

Russell King - ARM Linux Jan. 8, 2014, 7:20 p.m. UTC | #3

On Wed, Jan 08, 2014 at 08:21:21AM -0800, Doug Anderson wrote:
> WIll,
> 
> Thanks for your comments!
> 
> On Wed, Jan 8, 2014 at 6:35 AM, Will Deacon <will.deacon@arm.com> wrote:
> > NAK. Whilst I appreciate that you may not be able to fix your bootloader,
> > this isn't the right change to make in the kernel. Blindly changing memory
> > latencies is likely to do more harm than good for a multi-platform kernel,
> > even if it works for exynos 5250. The only alternative I can think of (if you
> > have to make a mainline kernel change) is to restrict the clock frequencies at
> > which the CPU is allowed to run, although that obviously requires some
> > investigation in order to determine how viable it is for your SoC.

+1

> I'll start off by saying that I'm totally OK if mainline doesn't want
> this fixed.  If mainline is not interested in running reliably on
> exynos5250-based products then there's nothing I can do about it.  It
> seems unfortunate, but I'm not going to get into a shouting match
> about it.

No, we're saying to put the work-around in the boot loader, not the kernel.

> You're saying that this patch is blindly changing memory latencies.  I
> don't think it is (please correct me if I'm wrong!).  This patch:
> 
> * Is guarded by a CONFIG option so the code isn't even compiled in if
> you don't want EXYNOS5250 support.  I know this doesn't help
> multiplatform, but it means that if you really hate the code it's easy
> to disable.

As you identify, this is completely useless in the multiplatform kernel
case, so you can completely remove this from the argument _for_ this
change - it carries no weight what so ever.

> * Is guarded by a runtime check of the revision number so that it
> doesn't run on unaffected A15 revisions.

So what if the A15 reports an affected revision, but the SoC integrator
has fixed the problem - why should they have to put up with the work-
around being applied on their silicon?

> * Is guarded by a runtime check so it does nothing at all if the total
> latency is <= 3 (AKA if boot code already picked a sane value)

Right, so the effect of the above is that with a multiplatform kernel
including Exynos 5 support, all platforms with an A15 r2p0 or earlier
have this work-around applied even if it was actually fixed in the
silicon - which there's no way to know from the software perspective.

We've been through these arguments many times, you're not the first to
raise it, and we've decided upon the policy.  We want as _few_ work-
arounds in the kernel as possible, because applying the work-arounds
is very problematical with the mixture of secure and non-secure booting.

Maybe Will can check, but I believe that the L2 cache control register
is one of those which is not writable from non-secure mode, and thus
will crash the kernel at boot if a write is attempted.

Doug Anderson Jan. 8, 2014, 7:43 p.m. UTC | #4

Hi,

On Wed, Jan 8, 2014 at 11:20 AM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> On Wed, Jan 08, 2014 at 08:21:21AM -0800, Doug Anderson wrote:
>> WIll,
>>
>> Thanks for your comments!
>>
>> On Wed, Jan 8, 2014 at 6:35 AM, Will Deacon <will.deacon@arm.com> wrote:
>> > NAK. Whilst I appreciate that you may not be able to fix your bootloader,
>> > this isn't the right change to make in the kernel. Blindly changing memory
>> > latencies is likely to do more harm than good for a multi-platform kernel,
>> > even if it works for exynos 5250. The only alternative I can think of (if you
>> > have to make a mainline kernel change) is to restrict the clock frequencies at
>> > which the CPU is allowed to run, although that obviously requires some
>> > investigation in order to determine how viable it is for your SoC.
>
> +1
>
>> I'll start off by saying that I'm totally OK if mainline doesn't want
>> this fixed.  If mainline is not interested in running reliably on
>> exynos5250-based products then there's nothing I can do about it.  It
>> seems unfortunate, but I'm not going to get into a shouting match
>> about it.
>
> No, we're saying to put the work-around in the boot loader, not the kernel.

Unfortunately the resume path of the firmware runs from Read Only
firmware code (yes, it sucks), so it's not totally trivial to fix.  It
would be possible for someone to unscrew their write protect switch
and update their RO firmware, but that doesn't help the average user.

Olof came up with the idea that you could update the RW firmware
(affects initial boot) and then cache away the value and restore it in
the kernel after resume.  That would still require a kernel patch but
perhaps a less objectionable one.  ...of course if writing this
register is a problem in secure mode then maybe that patch would be
NAKed anyway.

>> * Is guarded by a runtime check of the revision number so that it
>> doesn't run on unaffected A15 revisions.
>
> So what if the A15 reports an affected revision, but the SoC integrator
> has fixed the problem - why should they have to put up with the work-
> around being applied on their silicon?

That's one I didn't think about, you're right.  ...but we're really
getting into hypothetical situations here.  Are there any r2p0
products that have such a fix (and thus require a latency of >= 4)?

> We've been through these arguments many times, you're not the first to
> raise it, and we've decided upon the policy.  We want as _few_ work-
> arounds in the kernel as possible, because applying the work-arounds
> is very problematical with the mixture of secure and non-secure booting.

OK, fair enough.  If we want no workaround here then we can drop this patch.

I'd guess that the way forward is:

* Land kernel workaround locally in Chromium tree

* I'll try to land FW change locally in at least one Chromium branch.
If we were to get a new RO build ever (seems unlikely at this point)
it would be fixed for upstream kernels.  If we were to get a new RW
build (seems unlikely, but at least less unlikely) it would be fixed
if someone landed a kernel patch to save/restore this register across
suspend/resume.

* If Joe Upstream wants to run an upstream kernel on some type of
exynos5250 product (Samsung ARM Chromebook, HP Chromebook 11, Nexus 10
are the ones I know of) then he will deal with the small number of
crashes or figure out a solution.

-Doug

Nicolas Pitre Jan. 8, 2014, 8:58 p.m. UTC | #5

On Wed, 8 Jan 2014, Doug Anderson wrote:

> On Wed, Jan 8, 2014 at 11:20 AM, Russell King - ARM Linux
> <linux@arm.linux.org.uk> wrote:
> > We've been through these arguments many times, you're not the first to
> > raise it, and we've decided upon the policy.  We want as _few_ work-
> > arounds in the kernel as possible, because applying the work-arounds
> > is very problematical with the mixture of secure and non-secure booting.
> 
> OK, fair enough.  If we want no workaround here then we can drop this patch.
> 
> 
> I'd guess that the way forward is:
> 
> * Land kernel workaround locally in Chromium tree
> 
> * I'll try to land FW change locally in at least one Chromium branch.
> If we were to get a new RO build ever (seems unlikely at this point)
> it would be fixed for upstream kernels.  If we were to get a new RW
> build (seems unlikely, but at least less unlikely) it would be fixed
> if someone landed a kernel patch to save/restore this register across
> suspend/resume.
> 
> * If Joe Upstream wants to run an upstream kernel on some type of
> exynos5250 product (Samsung ARM Chromebook, HP Chromebook 11, Nexus 10
> are the ones I know of) then he will deal with the small number of
> crashes or figure out a solution.

At some point you have to realize that what you're proposing is not 
viable when taking into account the bigger picture.  And your FW 
architecture is to blame.  If you were running Windows instead of Linux 
I bet you'd have found a way to fix things differently than asking 
Microsoft to add a special quirk in their code.  In the PC world you are 
required to perform a BIOS update instead.

So you really need to provision the ability to update at least a portion 
of the firmware that is not read-only.  That may be a third-stage 
bootloader or the like which purpose is only to install workarounds such 
as this before executing the kernel and which doesn't need to be updated 
with the kernel.  And in theory you should be able to do that on top of 
your current RO firmware.

Errata workarounds are often machine specific and they can be applied 
only in special conditions the kernel might not safely know about until 
it is too late.

Nicolas

Russell King - ARM Linux Jan. 8, 2014, 9:02 p.m. UTC | #6

On Wed, Jan 08, 2014 at 11:43:29AM -0800, Doug Anderson wrote:
> Olof came up with the idea that you could update the RW firmware
> (affects initial boot) and then cache away the value and restore it in
> the kernel after resume.  That would still require a kernel patch but
> perhaps a less objectionable one.  ...of course if writing this
> register is a problem in secure mode then maybe that patch would be
> NAKed anyway.

It's not a problem in secure mode, since secure mode will have access
to the register.  It's the non-secure mode where various registers
either ignore writes, or they trigger an exception that are the problem.

Consider what would happen if an exception were to be triggered when no
exception handlers were installed (eg, because the MMU is not enabled.)

> That's one I didn't think about, you're right.  ...but we're really
> getting into hypothetical situations here.  Are there any r2p0
> products that have such a fix (and thus require a latency of >= 4)?

Really, we can't know, because this kind of information is dependent on
the SoC, and probably such customisations are only knowable via NDAs.

However, I do know of some A9 CPUs where exactly this kind of thing has
happened - where various selected silicon bugs have been fixed but the
revision still reports the same as one with all those bugs.

Nicolas Pitre Jan. 8, 2014, 9:08 p.m. UTC | #7

On Wed, 8 Jan 2014, Doug Anderson wrote:

> Hi,
> 
> On Wed, Jan 8, 2014 at 11:20 AM, Russell King - ARM Linux
> > No, we're saying to put the work-around in the boot loader, not the kernel.
> 
> Unfortunately the resume path of the firmware runs from Read Only 
> firmware code (yes, it sucks), so it's not totally trivial to fix. It 
> would be possible for someone to unscrew their write protect switch 
> and update their RO firmware, but that doesn't help the average user.

[...]

> I'd guess that the way forward is:
> 
> * Land kernel workaround locally in Chromium tree
> 
> * I'll try to land FW change locally in at least one Chromium branch.
> If we were to get a new RO build ever (seems unlikely at this point)
> it would be fixed for upstream kernels.  If we were to get a new RW
> build (seems unlikely, but at least less unlikely) it would be fixed
> if someone landed a kernel patch to save/restore this register across
> suspend/resume.
> 
> * If Joe Upstream wants to run an upstream kernel on some type of
> exynos5250 product (Samsung ARM Chromebook, HP Chromebook 11, Nexus 10
> are the ones I know of) then he will deal with the small number of
> crashes or figure out a solution.

If Joe Upstream wants to run an upstream kernel, doesn't he have to 
unscrew his write protect switch first, at which point the RO firmware 
can be updated as well?


Nicolas

Doug Anderson Jan. 8, 2014, 9:32 p.m. UTC | #8

Nicolas,

On Wed, Jan 8, 2014 at 12:58 PM, Nicolas Pitre <nico@fluxnic.net> wrote:
> On Wed, 8 Jan 2014, Doug Anderson wrote:
>
>> On Wed, Jan 8, 2014 at 11:20 AM, Russell King - ARM Linux
>> <linux@arm.linux.org.uk> wrote:
>> > We've been through these arguments many times, you're not the first to
>> > raise it, and we've decided upon the policy.  We want as _few_ work-
>> > arounds in the kernel as possible, because applying the work-arounds
>> > is very problematical with the mixture of secure and non-secure booting.
>>
>> OK, fair enough.  If we want no workaround here then we can drop this patch.
>>
>>
>> I'd guess that the way forward is:
>>
>> * Land kernel workaround locally in Chromium tree
>>
>> * I'll try to land FW change locally in at least one Chromium branch.
>> If we were to get a new RO build ever (seems unlikely at this point)
>> it would be fixed for upstream kernels.  If we were to get a new RW
>> build (seems unlikely, but at least less unlikely) it would be fixed
>> if someone landed a kernel patch to save/restore this register across
>> suspend/resume.
>>
>> * If Joe Upstream wants to run an upstream kernel on some type of
>> exynos5250 product (Samsung ARM Chromebook, HP Chromebook 11, Nexus 10
>> are the ones I know of) then he will deal with the small number of
>> crashes or figure out a solution.
>
> At some point you have to realize that what you're proposing is not
> viable when taking into account the bigger picture.  And your FW
> architecture is to blame.  If you were running Windows instead of Linux
> I bet you'd have found a way to fix things differently than asking
> Microsoft to add a special quirk in their code.  In the PC world you are
> required to perform a BIOS update instead.

Yup, I can believe that.

...I'm merely trying to report reality.  Firmware updates (even RW
ones) require an extra level of scrutiny / testing and a chunk of
in-house resources.  Those resources get allocated if there is no
other choice, but I can't force a lot of people to spend a lot of time
to approve a RW firmware update (actually approving >= 3 RW firmware
updates since there are at least three different 5250 variants) when
there is a simple and low risk kernel fix for it.  I could try, but
I'll get slapped down / laughed at.

If this were the Windows/x86 world and we had no choice the decision
would go a different way.  ...but we're not in the Windows/x86 world.


> So you really need to provision the ability to update at least a portion
> of the firmware that is not read-only.  That may be a third-stage
> bootloader or the like which purpose is only to install workarounds such
> as this before executing the kernel and which doesn't need to be updated
> with the kernel.  And in theory you should be able to do that on top of
> your current RO firmware.

The fact that we have too much code in RO is a known issue and people
are definitely working on that.  It has already bitten us.  It's
unlikely that someone will try retrofit something to work on existing
devices, though.

-Doug

Doug Anderson Jan. 8, 2014, 9:33 p.m. UTC | #9

Russell,

On Wed, Jan 8, 2014 at 1:02 PM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> On Wed, Jan 08, 2014 at 11:43:29AM -0800, Doug Anderson wrote:
>> Olof came up with the idea that you could update the RW firmware
>> (affects initial boot) and then cache away the value and restore it in
>> the kernel after resume.  That would still require a kernel patch but
>> perhaps a less objectionable one.  ...of course if writing this
>> register is a problem in secure mode then maybe that patch would be
>> NAKed anyway.
>
> It's not a problem in secure mode, since secure mode will have access
> to the register.  It's the non-secure mode where various registers
> either ignore writes, or they trigger an exception that are the problem.
>
> Consider what would happen if an exception were to be triggered when no
> exception handlers were installed (eg, because the MMU is not enabled.)

OK, good to know.  ...so if someone wanted to code up this patch and
send upstream they would be able to.  :)

-Doug

Doug Anderson Jan. 8, 2014, 9:35 p.m. UTC | #10

Nicolas,

On Wed, Jan 8, 2014 at 1:08 PM, Nicolas Pitre <nico@fluxnic.net> wrote:
> On Wed, 8 Jan 2014, Doug Anderson wrote:
>
>> Hi,
>>
>> On Wed, Jan 8, 2014 at 11:20 AM, Russell King - ARM Linux
>> > No, we're saying to put the work-around in the boot loader, not the kernel.
>>
>> Unfortunately the resume path of the firmware runs from Read Only
>> firmware code (yes, it sucks), so it's not totally trivial to fix. It
>> would be possible for someone to unscrew their write protect switch
>> and update their RO firmware, but that doesn't help the average user.
>
> [...]
>
>> I'd guess that the way forward is:
>>
>> * Land kernel workaround locally in Chromium tree
>>
>> * I'll try to land FW change locally in at least one Chromium branch.
>> If we were to get a new RO build ever (seems unlikely at this point)
>> it would be fixed for upstream kernels.  If we were to get a new RW
>> build (seems unlikely, but at least less unlikely) it would be fixed
>> if someone landed a kernel patch to save/restore this register across
>> suspend/resume.
>>
>> * If Joe Upstream wants to run an upstream kernel on some type of
>> exynos5250 product (Samsung ARM Chromebook, HP Chromebook 11, Nexus 10
>> are the ones I know of) then he will deal with the small number of
>> crashes or figure out a solution.
>
> If Joe Upstream wants to run an upstream kernel, doesn't he have to
> unscrew his write protect switch first, at which point the RO firmware
> can be updated as well?

Actually, no.  You can move your device into dev mode and run any
kernel you want.  You'll get an annoying "you're in dev mode" screen
at every bootup, but otherwise it works just fine.

Going into dev mode requires some special keystrokes at bootup and a
wipe of your hard disk but no screwdrivers.


-Doug

Tomasz Figa Jan. 10, 2014, 5:09 p.m. UTC | #11

Hi,

On 08.01.2014 17:21, Doug Anderson wrote:
> WIll,
>
> Thanks for your comments!
>
> On Wed, Jan 8, 2014 at 6:35 AM, Will Deacon <will.deacon@arm.com> wrote:
>> On Wed, Jan 08, 2014 at 01:33:11PM +0000, Vivek Gautam wrote:
>>> The erratum-773769 occurs on Arm Coretex-A15 (rev r2p0),
>>> when L2 Data Ram latency is set to 4 cycles or more; or
>>> when ACP is in use, or with L2 Data RAM slice configured.
>>> Therefore, the effective latency as calculated in Table 7-2 of
>>> Cotex-A15 (rev r2p0) trm should be 3 cycles or less.
>>>
>>> On Exynos5250 based systems the effective data ram latency
>>> is 4 cycles, since we have DATA_RAM_SETUP bit enabled (L2CTRL[5]=1b'1)
>>> and DATA_RAM_LATENCY bits set to 0x2 (L2CTLR[2:0]=3b'010) therefore,
>>> the effective L2 data RAM latency becomes 4 cycles.
>>> So erratum '773769' occurs causing a corrupted L2 Cache.
>>>
>>> This patch gives a workaround to the mentioned erratum, using below
>>> mentioned algo:
>>> ----------------------------------------------------------------
>>> if data RAM setup = 1
>>>    then check if effective latency i.e (latency + setup + 1) > 3
>>>    if 'true'
>>>      then clear data RAM setup
>>>        goto branch 'a'
>>> if data RAM setup = 0
>>>    a: then check if data RAM latency > 0x10
>>>      if true then force data RAM latency = 0x10
>>> ----------------------------------------------------------------
>>> so that the effective data RAM latency reduces to 3 cycles or less
>>> and hence prevent hitting the erratum.
>>>
>>> NOTE: The Exynos5250 based products have already been shipped, which
>>>        makes it impossible to add the change in bootloader, so we are
>>>        adding the required change in kernel.
>>
>> NAK. Whilst I appreciate that you may not be able to fix your bootloader,
>> this isn't the right change to make in the kernel. Blindly changing memory
>> latencies is likely to do more harm than good for a multi-platform kernel,
>> even if it works for exynos 5250. The only alternative I can think of (if you
>> have to make a mainline kernel change) is to restrict the clock frequencies at
>> which the CPU is allowed to run, although that obviously requires some
>> investigation in order to determine how viable it is for your SoC.
>
> OK, so humor me a little here...
>
> I'll start off by saying that I'm totally OK if mainline doesn't want
> this fixed.  If mainline is not interested in running reliably on
> exynos5250-based products then there's nothing I can do about it.  It
> seems unfortunate, but I'm not going to get into a shouting match
> about it.
>
> You're saying that this patch is blindly changing memory latencies.  I
> don't think it is (please correct me if I'm wrong!).  This patch:
>
> * Is guarded by a CONFIG option so the code isn't even compiled in if
> you don't want EXYNOS5250 support.  I know this doesn't help
> multiplatform, but it means that if you really hate the code it's easy
> to disable.
>
> * Is guarded by a runtime check of the revision number so that it
> doesn't run on unaffected A15 revisions.
>
> * Is guarded by a runtime check so it does nothing at all if the total
> latency is <= 3 (AKA if boot code already picked a sane value)
>
> ...with the above guards it's pretty safe...
>
> I will agree that there is a _potential_ that this could make things
> work worse on an already broken product our there, but I would say
> there's a reasonable chance that such a product doesn't exist (but
> please correct me if I'm wrong).  Specifically, this patch will cause
> problems in two examples that I can think of:
>
> --
>
> Example A: existing A15 <=r2p0 product with 773769-ignorant boot code
> that could be fixed, but that needs "setup = 1"
>
> In this case we've got boot code that's like the exynos5250 boot code
> that accidentally sets the total latency to >= 4 when it would work
> just fine to use a value of 3.  ...except that unlike the exynos5250
> this hypthetical SoC needs to leave setup = 1.
>
> ...if such a machine were found in the wild (seems unlikely?) then
> we'll need to figure out what to do.  If its boot code cannot be
> updated and we want to support this product with a similar patch then
> we'll need to be more dynamic.
>
> --
>
> Example B: existing A15 <=r2p0 product that just has to live with crashes
>
> In this case we've got a product where we're going to just accept that
> it crashes sometimes (since this is a fairly hard crash to trigger)
> because it crashes even more when the total latency < 4.  In this case
> we don't want to "fix" the errata because that makes things worse.
>
> ...if such a machine were find in the wild (it's possible, I guess?)
> then that's a really good reason not to take this patch or to find
> some way to dynamically enable / disable it.
>
> --
>
> Let me know what you think of the above, or if I'm misunderstanding something...

It's hard to disagree with any of the opinions presented here. We want 
to support Exynos5250-based hardware, but at the same time we don't want 
to regress any hardware unaffected by mentioned issue...

Let me propose you a solution that comes to my mind (as discussed with 
ojn on #armlinux):

To work around the erratum we basically need to ensure that two things 
are done:

  1) At boot-up the L2 controller configured by R/O firmware need to
     be reconfigured to sane values.

  2) At resume from sleep mode the same process as in 1) must be done.

As for the first one, it might be handled either by R/W firmware of the 
board or some small loader code glued in front of zImage. Second one can 
be implemented in SoC-specific resume code 
(arch/arm/plat-samsung/s5p-sleep.S), as it's currently done with L2X0 on 
Exynos4.

Of course this is scattering the workaround over multiple code bases, 
but this way the workaround is executed only on affected platforms 
without hurting the unaffected ones. What do you think?

Best regards,
Tomasz

Sander Jan. 29, 2014, 12:26 p.m. UTC | #12

Doug Anderson wrote (ao):
> On Wed, Jan 8, 2014 at 1:08 PM, Nicolas Pitre <nico@fluxnic.net> wrote:
> > On Wed, 8 Jan 2014, Doug Anderson wrote:
> >> * If Joe Upstream wants to run an upstream kernel on some type of
> >> exynos5250 product (Samsung ARM Chromebook, HP Chromebook 11, Nexus 10
> >> are the ones I know of) then he will deal with the small number of
> >> crashes or figure out a solution.
> >
> > If Joe Upstream wants to run an upstream kernel, doesn't he have to
> > unscrew his write protect switch first, at which point the RO firmware
> > can be updated as well?
> 
> Actually, no.  You can move your device into dev mode and run any
> kernel you want.  You'll get an annoying "you're in dev mode" screen
> at every bootup, but otherwise it works just fine.
> 
> Going into dev mode requires some special keystrokes at bootup and a
> wipe of your hard disk but no screwdrivers.

And there is http://www.arndaleboard.org/ where you just put an upstream
kernel on sd and boot.

	Sander

arm: Add Arm Erratum 773769 for Large data RAM latency.

Commit Message

Comments

Patch