diff mbox

arm: Add Arm Erratum 773769 for Large data RAM latency.

Message ID 1389187991-26446-1-git-send-email-gautam.vivek@samsung.com (mailing list archive)
State New, archived
Headers show

Commit Message

Vivek Gautam Jan. 8, 2014, 1:33 p.m. UTC
The erratum-773769 occurs on Arm Coretex-A15 (rev r2p0),
when L2 Data Ram latency is set to 4 cycles or more; or
when ACP is in use, or with L2 Data RAM slice configured.
Therefore, the effective latency as calculated in Table 7-2 of
Cotex-A15 (rev r2p0) trm should be 3 cycles or less.

On Exynos5250 based systems the effective data ram latency
is 4 cycles, since we have DATA_RAM_SETUP bit enabled (L2CTRL[5]=1b'1)
and DATA_RAM_LATENCY bits set to 0x2 (L2CTLR[2:0]=3b'010) therefore,
the effective L2 data RAM latency becomes 4 cycles.
So erratum '773769' occurs causing a corrupted L2 Cache.

This patch gives a workaround to the mentioned erratum, using below
mentioned algo:
----------------------------------------------------------------
if data RAM setup = 1
  then check if effective latency i.e (latency + setup + 1) > 3
  if 'true'
    then clear data RAM setup
      goto branch 'a'
if data RAM setup = 0
  a: then check if data RAM latency > 0x10
    if true then force data RAM latency = 0x10
----------------------------------------------------------------
so that the effective data RAM latency reduces to 3 cycles or less
and hence prevent hitting the erratum.

NOTE: The Exynos5250 based products have already been shipped, which
      makes it impossible to add the change in bootloader, so we are
      adding the required change in kernel.

Signed-off-by: Vivek Gautam <gautam.vivek@samsung.com>
Cc: Doug Anderson <dianders@chromium.org>
Cc: Olof Johansson <olofj@chromium.org>
Cc: David Garbett <david.garbett@arm.com>
---
 arch/arm/Kconfig             |   15 ++++++++
 arch/arm/mach-exynos/Kconfig |    1 +
 arch/arm/mm/proc-v7.S        |   79 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 95 insertions(+)

Comments

Will Deacon Jan. 8, 2014, 2:35 p.m. UTC | #1
On Wed, Jan 08, 2014 at 01:33:11PM +0000, Vivek Gautam wrote:
> The erratum-773769 occurs on Arm Coretex-A15 (rev r2p0),
> when L2 Data Ram latency is set to 4 cycles or more; or
> when ACP is in use, or with L2 Data RAM slice configured.
> Therefore, the effective latency as calculated in Table 7-2 of
> Cotex-A15 (rev r2p0) trm should be 3 cycles or less.
> 
> On Exynos5250 based systems the effective data ram latency
> is 4 cycles, since we have DATA_RAM_SETUP bit enabled (L2CTRL[5]=1b'1)
> and DATA_RAM_LATENCY bits set to 0x2 (L2CTLR[2:0]=3b'010) therefore,
> the effective L2 data RAM latency becomes 4 cycles.
> So erratum '773769' occurs causing a corrupted L2 Cache.
> 
> This patch gives a workaround to the mentioned erratum, using below
> mentioned algo:
> ----------------------------------------------------------------
> if data RAM setup = 1
>   then check if effective latency i.e (latency + setup + 1) > 3
>   if 'true'
>     then clear data RAM setup
>       goto branch 'a'
> if data RAM setup = 0
>   a: then check if data RAM latency > 0x10
>     if true then force data RAM latency = 0x10
> ----------------------------------------------------------------
> so that the effective data RAM latency reduces to 3 cycles or less
> and hence prevent hitting the erratum.
> 
> NOTE: The Exynos5250 based products have already been shipped, which
>       makes it impossible to add the change in bootloader, so we are
>       adding the required change in kernel.

NAK. Whilst I appreciate that you may not be able to fix your bootloader,
this isn't the right change to make in the kernel. Blindly changing memory
latencies is likely to do more harm than good for a multi-platform kernel,
even if it works for exynos 5250. The only alternative I can think of (if you
have to make a mainline kernel change) is to restrict the clock frequencies at
which the CPU is allowed to run, although that obviously requires some
investigation in order to determine how viable it is for your SoC.

Will
Doug Anderson Jan. 8, 2014, 4:21 p.m. UTC | #2
WIll,

Thanks for your comments!

On Wed, Jan 8, 2014 at 6:35 AM, Will Deacon <will.deacon@arm.com> wrote:
> On Wed, Jan 08, 2014 at 01:33:11PM +0000, Vivek Gautam wrote:
>> The erratum-773769 occurs on Arm Coretex-A15 (rev r2p0),
>> when L2 Data Ram latency is set to 4 cycles or more; or
>> when ACP is in use, or with L2 Data RAM slice configured.
>> Therefore, the effective latency as calculated in Table 7-2 of
>> Cotex-A15 (rev r2p0) trm should be 3 cycles or less.
>>
>> On Exynos5250 based systems the effective data ram latency
>> is 4 cycles, since we have DATA_RAM_SETUP bit enabled (L2CTRL[5]=1b'1)
>> and DATA_RAM_LATENCY bits set to 0x2 (L2CTLR[2:0]=3b'010) therefore,
>> the effective L2 data RAM latency becomes 4 cycles.
>> So erratum '773769' occurs causing a corrupted L2 Cache.
>>
>> This patch gives a workaround to the mentioned erratum, using below
>> mentioned algo:
>> ----------------------------------------------------------------
>> if data RAM setup = 1
>>   then check if effective latency i.e (latency + setup + 1) > 3
>>   if 'true'
>>     then clear data RAM setup
>>       goto branch 'a'
>> if data RAM setup = 0
>>   a: then check if data RAM latency > 0x10
>>     if true then force data RAM latency = 0x10
>> ----------------------------------------------------------------
>> so that the effective data RAM latency reduces to 3 cycles or less
>> and hence prevent hitting the erratum.
>>
>> NOTE: The Exynos5250 based products have already been shipped, which
>>       makes it impossible to add the change in bootloader, so we are
>>       adding the required change in kernel.
>
> NAK. Whilst I appreciate that you may not be able to fix your bootloader,
> this isn't the right change to make in the kernel. Blindly changing memory
> latencies is likely to do more harm than good for a multi-platform kernel,
> even if it works for exynos 5250. The only alternative I can think of (if you
> have to make a mainline kernel change) is to restrict the clock frequencies at
> which the CPU is allowed to run, although that obviously requires some
> investigation in order to determine how viable it is for your SoC.

OK, so humor me a little here...

I'll start off by saying that I'm totally OK if mainline doesn't want
this fixed.  If mainline is not interested in running reliably on
exynos5250-based products then there's nothing I can do about it.  It
seems unfortunate, but I'm not going to get into a shouting match
about it.

You're saying that this patch is blindly changing memory latencies.  I
don't think it is (please correct me if I'm wrong!).  This patch:

* Is guarded by a CONFIG option so the code isn't even compiled in if
you don't want EXYNOS5250 support.  I know this doesn't help
multiplatform, but it means that if you really hate the code it's easy
to disable.

* Is guarded by a runtime check of the revision number so that it
doesn't run on unaffected A15 revisions.

* Is guarded by a runtime check so it does nothing at all if the total
latency is <= 3 (AKA if boot code already picked a sane value)

...with the above guards it's pretty safe...

I will agree that there is a _potential_ that this could make things
work worse on an already broken product our there, but I would say
there's a reasonable chance that such a product doesn't exist (but
please correct me if I'm wrong).  Specifically, this patch will cause
problems in two examples that I can think of:

--

Example A: existing A15 <=r2p0 product with 773769-ignorant boot code
that could be fixed, but that needs "setup = 1"

In this case we've got boot code that's like the exynos5250 boot code
that accidentally sets the total latency to >= 4 when it would work
just fine to use a value of 3.  ...except that unlike the exynos5250
this hypthetical SoC needs to leave setup = 1.

...if such a machine were found in the wild (seems unlikely?) then
we'll need to figure out what to do.  If its boot code cannot be
updated and we want to support this product with a similar patch then
we'll need to be more dynamic.

--

Example B: existing A15 <=r2p0 product that just has to live with crashes

In this case we've got a product where we're going to just accept that
it crashes sometimes (since this is a fairly hard crash to trigger)
because it crashes even more when the total latency < 4.  In this case
we don't want to "fix" the errata because that makes things worse.

...if such a machine were find in the wild (it's possible, I guess?)
then that's a really good reason not to take this patch or to find
some way to dynamically enable / disable it.

--

Let me know what you think of the above, or if I'm misunderstanding something...

-Doug
Russell King - ARM Linux Jan. 8, 2014, 7:20 p.m. UTC | #3
On Wed, Jan 08, 2014 at 08:21:21AM -0800, Doug Anderson wrote:
> WIll,
> 
> Thanks for your comments!
> 
> On Wed, Jan 8, 2014 at 6:35 AM, Will Deacon <will.deacon@arm.com> wrote:
> > NAK. Whilst I appreciate that you may not be able to fix your bootloader,
> > this isn't the right change to make in the kernel. Blindly changing memory
> > latencies is likely to do more harm than good for a multi-platform kernel,
> > even if it works for exynos 5250. The only alternative I can think of (if you
> > have to make a mainline kernel change) is to restrict the clock frequencies at
> > which the CPU is allowed to run, although that obviously requires some
> > investigation in order to determine how viable it is for your SoC.

+1

> I'll start off by saying that I'm totally OK if mainline doesn't want
> this fixed.  If mainline is not interested in running reliably on
> exynos5250-based products then there's nothing I can do about it.  It
> seems unfortunate, but I'm not going to get into a shouting match
> about it.

No, we're saying to put the work-around in the boot loader, not the kernel.

> You're saying that this patch is blindly changing memory latencies.  I
> don't think it is (please correct me if I'm wrong!).  This patch:
> 
> * Is guarded by a CONFIG option so the code isn't even compiled in if
> you don't want EXYNOS5250 support.  I know this doesn't help
> multiplatform, but it means that if you really hate the code it's easy
> to disable.

As you identify, this is completely useless in the multiplatform kernel
case, so you can completely remove this from the argument _for_ this
change - it carries no weight what so ever.

> * Is guarded by a runtime check of the revision number so that it
> doesn't run on unaffected A15 revisions.

So what if the A15 reports an affected revision, but the SoC integrator
has fixed the problem - why should they have to put up with the work-
around being applied on their silicon?

> * Is guarded by a runtime check so it does nothing at all if the total
> latency is <= 3 (AKA if boot code already picked a sane value)

Right, so the effect of the above is that with a multiplatform kernel
including Exynos 5 support, all platforms with an A15 r2p0 or earlier
have this work-around applied even if it was actually fixed in the
silicon - which there's no way to know from the software perspective.

We've been through these arguments many times, you're not the first to
raise it, and we've decided upon the policy.  We want as _few_ work-
arounds in the kernel as possible, because applying the work-arounds
is very problematical with the mixture of secure and non-secure booting.

Maybe Will can check, but I believe that the L2 cache control register
is one of those which is not writable from non-secure mode, and thus
will crash the kernel at boot if a write is attempted.
Doug Anderson Jan. 8, 2014, 7:43 p.m. UTC | #4
Hi,

On Wed, Jan 8, 2014 at 11:20 AM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> On Wed, Jan 08, 2014 at 08:21:21AM -0800, Doug Anderson wrote:
>> WIll,
>>
>> Thanks for your comments!
>>
>> On Wed, Jan 8, 2014 at 6:35 AM, Will Deacon <will.deacon@arm.com> wrote:
>> > NAK. Whilst I appreciate that you may not be able to fix your bootloader,
>> > this isn't the right change to make in the kernel. Blindly changing memory
>> > latencies is likely to do more harm than good for a multi-platform kernel,
>> > even if it works for exynos 5250. The only alternative I can think of (if you
>> > have to make a mainline kernel change) is to restrict the clock frequencies at
>> > which the CPU is allowed to run, although that obviously requires some
>> > investigation in order to determine how viable it is for your SoC.
>
> +1
>
>> I'll start off by saying that I'm totally OK if mainline doesn't want
>> this fixed.  If mainline is not interested in running reliably on
>> exynos5250-based products then there's nothing I can do about it.  It
>> seems unfortunate, but I'm not going to get into a shouting match
>> about it.
>
> No, we're saying to put the work-around in the boot loader, not the kernel.

Unfortunately the resume path of the firmware runs from Read Only
firmware code (yes, it sucks), so it's not totally trivial to fix.  It
would be possible for someone to unscrew their write protect switch
and update their RO firmware, but that doesn't help the average user.

Olof came up with the idea that you could update the RW firmware
(affects initial boot) and then cache away the value and restore it in
the kernel after resume.  That would still require a kernel patch but
perhaps a less objectionable one.  ...of course if writing this
register is a problem in secure mode then maybe that patch would be
NAKed anyway.


>> * Is guarded by a runtime check of the revision number so that it
>> doesn't run on unaffected A15 revisions.
>
> So what if the A15 reports an affected revision, but the SoC integrator
> has fixed the problem - why should they have to put up with the work-
> around being applied on their silicon?

That's one I didn't think about, you're right.  ...but we're really
getting into hypothetical situations here.  Are there any r2p0
products that have such a fix (and thus require a latency of >= 4)?


> We've been through these arguments many times, you're not the first to
> raise it, and we've decided upon the policy.  We want as _few_ work-
> arounds in the kernel as possible, because applying the work-arounds
> is very problematical with the mixture of secure and non-secure booting.

OK, fair enough.  If we want no workaround here then we can drop this patch.


I'd guess that the way forward is:

* Land kernel workaround locally in Chromium tree

* I'll try to land FW change locally in at least one Chromium branch.
If we were to get a new RO build ever (seems unlikely at this point)
it would be fixed for upstream kernels.  If we were to get a new RW
build (seems unlikely, but at least less unlikely) it would be fixed
if someone landed a kernel patch to save/restore this register across
suspend/resume.

* If Joe Upstream wants to run an upstream kernel on some type of
exynos5250 product (Samsung ARM Chromebook, HP Chromebook 11, Nexus 10
are the ones I know of) then he will deal with the small number of
crashes or figure out a solution.


-Doug
Nicolas Pitre Jan. 8, 2014, 8:58 p.m. UTC | #5
On Wed, 8 Jan 2014, Doug Anderson wrote:

> On Wed, Jan 8, 2014 at 11:20 AM, Russell King - ARM Linux
> <linux@arm.linux.org.uk> wrote:
> > We've been through these arguments many times, you're not the first to
> > raise it, and we've decided upon the policy.  We want as _few_ work-
> > arounds in the kernel as possible, because applying the work-arounds
> > is very problematical with the mixture of secure and non-secure booting.
> 
> OK, fair enough.  If we want no workaround here then we can drop this patch.
> 
> 
> I'd guess that the way forward is:
> 
> * Land kernel workaround locally in Chromium tree
> 
> * I'll try to land FW change locally in at least one Chromium branch.
> If we were to get a new RO build ever (seems unlikely at this point)
> it would be fixed for upstream kernels.  If we were to get a new RW
> build (seems unlikely, but at least less unlikely) it would be fixed
> if someone landed a kernel patch to save/restore this register across
> suspend/resume.
> 
> * If Joe Upstream wants to run an upstream kernel on some type of
> exynos5250 product (Samsung ARM Chromebook, HP Chromebook 11, Nexus 10
> are the ones I know of) then he will deal with the small number of
> crashes or figure out a solution.

At some point you have to realize that what you're proposing is not 
viable when taking into account the bigger picture.  And your FW 
architecture is to blame.  If you were running Windows instead of Linux 
I bet you'd have found a way to fix things differently than asking 
Microsoft to add a special quirk in their code.  In the PC world you are 
required to perform a BIOS update instead.

So you really need to provision the ability to update at least a portion 
of the firmware that is not read-only.  That may be a third-stage 
bootloader or the like which purpose is only to install workarounds such 
as this before executing the kernel and which doesn't need to be updated 
with the kernel.  And in theory you should be able to do that on top of 
your current RO firmware.

Errata workarounds are often machine specific and they can be applied 
only in special conditions the kernel might not safely know about until 
it is too late.


Nicolas
Russell King - ARM Linux Jan. 8, 2014, 9:02 p.m. UTC | #6
On Wed, Jan 08, 2014 at 11:43:29AM -0800, Doug Anderson wrote:
> Olof came up with the idea that you could update the RW firmware
> (affects initial boot) and then cache away the value and restore it in
> the kernel after resume.  That would still require a kernel patch but
> perhaps a less objectionable one.  ...of course if writing this
> register is a problem in secure mode then maybe that patch would be
> NAKed anyway.

It's not a problem in secure mode, since secure mode will have access
to the register.  It's the non-secure mode where various registers
either ignore writes, or they trigger an exception that are the problem.

Consider what would happen if an exception were to be triggered when no
exception handlers were installed (eg, because the MMU is not enabled.)

> That's one I didn't think about, you're right.  ...but we're really
> getting into hypothetical situations here.  Are there any r2p0
> products that have such a fix (and thus require a latency of >= 4)?

Really, we can't know, because this kind of information is dependent on
the SoC, and probably such customisations are only knowable via NDAs.

However, I do know of some A9 CPUs where exactly this kind of thing has
happened - where various selected silicon bugs have been fixed but the
revision still reports the same as one with all those bugs.
Nicolas Pitre Jan. 8, 2014, 9:08 p.m. UTC | #7
On Wed, 8 Jan 2014, Doug Anderson wrote:

> Hi,
> 
> On Wed, Jan 8, 2014 at 11:20 AM, Russell King - ARM Linux
> > No, we're saying to put the work-around in the boot loader, not the kernel.
> 
> Unfortunately the resume path of the firmware runs from Read Only 
> firmware code (yes, it sucks), so it's not totally trivial to fix. It 
> would be possible for someone to unscrew their write protect switch 
> and update their RO firmware, but that doesn't help the average user.

[...]

> I'd guess that the way forward is:
> 
> * Land kernel workaround locally in Chromium tree
> 
> * I'll try to land FW change locally in at least one Chromium branch.
> If we were to get a new RO build ever (seems unlikely at this point)
> it would be fixed for upstream kernels.  If we were to get a new RW
> build (seems unlikely, but at least less unlikely) it would be fixed
> if someone landed a kernel patch to save/restore this register across
> suspend/resume.
> 
> * If Joe Upstream wants to run an upstream kernel on some type of
> exynos5250 product (Samsung ARM Chromebook, HP Chromebook 11, Nexus 10
> are the ones I know of) then he will deal with the small number of
> crashes or figure out a solution.

If Joe Upstream wants to run an upstream kernel, doesn't he have to 
unscrew his write protect switch first, at which point the RO firmware 
can be updated as well?


Nicolas
Doug Anderson Jan. 8, 2014, 9:32 p.m. UTC | #8
Nicolas,

On Wed, Jan 8, 2014 at 12:58 PM, Nicolas Pitre <nico@fluxnic.net> wrote:
> On Wed, 8 Jan 2014, Doug Anderson wrote:
>
>> On Wed, Jan 8, 2014 at 11:20 AM, Russell King - ARM Linux
>> <linux@arm.linux.org.uk> wrote:
>> > We've been through these arguments many times, you're not the first to
>> > raise it, and we've decided upon the policy.  We want as _few_ work-
>> > arounds in the kernel as possible, because applying the work-arounds
>> > is very problematical with the mixture of secure and non-secure booting.
>>
>> OK, fair enough.  If we want no workaround here then we can drop this patch.
>>
>>
>> I'd guess that the way forward is:
>>
>> * Land kernel workaround locally in Chromium tree
>>
>> * I'll try to land FW change locally in at least one Chromium branch.
>> If we were to get a new RO build ever (seems unlikely at this point)
>> it would be fixed for upstream kernels.  If we were to get a new RW
>> build (seems unlikely, but at least less unlikely) it would be fixed
>> if someone landed a kernel patch to save/restore this register across
>> suspend/resume.
>>
>> * If Joe Upstream wants to run an upstream kernel on some type of
>> exynos5250 product (Samsung ARM Chromebook, HP Chromebook 11, Nexus 10
>> are the ones I know of) then he will deal with the small number of
>> crashes or figure out a solution.
>
> At some point you have to realize that what you're proposing is not
> viable when taking into account the bigger picture.  And your FW
> architecture is to blame.  If you were running Windows instead of Linux
> I bet you'd have found a way to fix things differently than asking
> Microsoft to add a special quirk in their code.  In the PC world you are
> required to perform a BIOS update instead.

Yup, I can believe that.

...I'm merely trying to report reality.  Firmware updates (even RW
ones) require an extra level of scrutiny / testing and a chunk of
in-house resources.  Those resources get allocated if there is no
other choice, but I can't force a lot of people to spend a lot of time
to approve a RW firmware update (actually approving >= 3 RW firmware
updates since there are at least three different 5250 variants) when
there is a simple and low risk kernel fix for it.  I could try, but
I'll get slapped down / laughed at.

If this were the Windows/x86 world and we had no choice the decision
would go a different way.  ...but we're not in the Windows/x86 world.


> So you really need to provision the ability to update at least a portion
> of the firmware that is not read-only.  That may be a third-stage
> bootloader or the like which purpose is only to install workarounds such
> as this before executing the kernel and which doesn't need to be updated
> with the kernel.  And in theory you should be able to do that on top of
> your current RO firmware.

The fact that we have too much code in RO is a known issue and people
are definitely working on that.  It has already bitten us.  It's
unlikely that someone will try retrofit something to work on existing
devices, though.

-Doug
Doug Anderson Jan. 8, 2014, 9:33 p.m. UTC | #9
Russell,

On Wed, Jan 8, 2014 at 1:02 PM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> On Wed, Jan 08, 2014 at 11:43:29AM -0800, Doug Anderson wrote:
>> Olof came up with the idea that you could update the RW firmware
>> (affects initial boot) and then cache away the value and restore it in
>> the kernel after resume.  That would still require a kernel patch but
>> perhaps a less objectionable one.  ...of course if writing this
>> register is a problem in secure mode then maybe that patch would be
>> NAKed anyway.
>
> It's not a problem in secure mode, since secure mode will have access
> to the register.  It's the non-secure mode where various registers
> either ignore writes, or they trigger an exception that are the problem.
>
> Consider what would happen if an exception were to be triggered when no
> exception handlers were installed (eg, because the MMU is not enabled.)

OK, good to know.  ...so if someone wanted to code up this patch and
send upstream they would be able to.  :)

-Doug
Doug Anderson Jan. 8, 2014, 9:35 p.m. UTC | #10
Nicolas,

On Wed, Jan 8, 2014 at 1:08 PM, Nicolas Pitre <nico@fluxnic.net> wrote:
> On Wed, 8 Jan 2014, Doug Anderson wrote:
>
>> Hi,
>>
>> On Wed, Jan 8, 2014 at 11:20 AM, Russell King - ARM Linux
>> > No, we're saying to put the work-around in the boot loader, not the kernel.
>>
>> Unfortunately the resume path of the firmware runs from Read Only
>> firmware code (yes, it sucks), so it's not totally trivial to fix. It
>> would be possible for someone to unscrew their write protect switch
>> and update their RO firmware, but that doesn't help the average user.
>
> [...]
>
>> I'd guess that the way forward is:
>>
>> * Land kernel workaround locally in Chromium tree
>>
>> * I'll try to land FW change locally in at least one Chromium branch.
>> If we were to get a new RO build ever (seems unlikely at this point)
>> it would be fixed for upstream kernels.  If we were to get a new RW
>> build (seems unlikely, but at least less unlikely) it would be fixed
>> if someone landed a kernel patch to save/restore this register across
>> suspend/resume.
>>
>> * If Joe Upstream wants to run an upstream kernel on some type of
>> exynos5250 product (Samsung ARM Chromebook, HP Chromebook 11, Nexus 10
>> are the ones I know of) then he will deal with the small number of
>> crashes or figure out a solution.
>
> If Joe Upstream wants to run an upstream kernel, doesn't he have to
> unscrew his write protect switch first, at which point the RO firmware
> can be updated as well?

Actually, no.  You can move your device into dev mode and run any
kernel you want.  You'll get an annoying "you're in dev mode" screen
at every bootup, but otherwise it works just fine.

Going into dev mode requires some special keystrokes at bootup and a
wipe of your hard disk but no screwdrivers.


-Doug
Tomasz Figa Jan. 10, 2014, 5:09 p.m. UTC | #11
Hi,

On 08.01.2014 17:21, Doug Anderson wrote:
> WIll,
>
> Thanks for your comments!
>
> On Wed, Jan 8, 2014 at 6:35 AM, Will Deacon <will.deacon@arm.com> wrote:
>> On Wed, Jan 08, 2014 at 01:33:11PM +0000, Vivek Gautam wrote:
>>> The erratum-773769 occurs on Arm Coretex-A15 (rev r2p0),
>>> when L2 Data Ram latency is set to 4 cycles or more; or
>>> when ACP is in use, or with L2 Data RAM slice configured.
>>> Therefore, the effective latency as calculated in Table 7-2 of
>>> Cotex-A15 (rev r2p0) trm should be 3 cycles or less.
>>>
>>> On Exynos5250 based systems the effective data ram latency
>>> is 4 cycles, since we have DATA_RAM_SETUP bit enabled (L2CTRL[5]=1b'1)
>>> and DATA_RAM_LATENCY bits set to 0x2 (L2CTLR[2:0]=3b'010) therefore,
>>> the effective L2 data RAM latency becomes 4 cycles.
>>> So erratum '773769' occurs causing a corrupted L2 Cache.
>>>
>>> This patch gives a workaround to the mentioned erratum, using below
>>> mentioned algo:
>>> ----------------------------------------------------------------
>>> if data RAM setup = 1
>>>    then check if effective latency i.e (latency + setup + 1) > 3
>>>    if 'true'
>>>      then clear data RAM setup
>>>        goto branch 'a'
>>> if data RAM setup = 0
>>>    a: then check if data RAM latency > 0x10
>>>      if true then force data RAM latency = 0x10
>>> ----------------------------------------------------------------
>>> so that the effective data RAM latency reduces to 3 cycles or less
>>> and hence prevent hitting the erratum.
>>>
>>> NOTE: The Exynos5250 based products have already been shipped, which
>>>        makes it impossible to add the change in bootloader, so we are
>>>        adding the required change in kernel.
>>
>> NAK. Whilst I appreciate that you may not be able to fix your bootloader,
>> this isn't the right change to make in the kernel. Blindly changing memory
>> latencies is likely to do more harm than good for a multi-platform kernel,
>> even if it works for exynos 5250. The only alternative I can think of (if you
>> have to make a mainline kernel change) is to restrict the clock frequencies at
>> which the CPU is allowed to run, although that obviously requires some
>> investigation in order to determine how viable it is for your SoC.
>
> OK, so humor me a little here...
>
> I'll start off by saying that I'm totally OK if mainline doesn't want
> this fixed.  If mainline is not interested in running reliably on
> exynos5250-based products then there's nothing I can do about it.  It
> seems unfortunate, but I'm not going to get into a shouting match
> about it.
>
> You're saying that this patch is blindly changing memory latencies.  I
> don't think it is (please correct me if I'm wrong!).  This patch:
>
> * Is guarded by a CONFIG option so the code isn't even compiled in if
> you don't want EXYNOS5250 support.  I know this doesn't help
> multiplatform, but it means that if you really hate the code it's easy
> to disable.
>
> * Is guarded by a runtime check of the revision number so that it
> doesn't run on unaffected A15 revisions.
>
> * Is guarded by a runtime check so it does nothing at all if the total
> latency is <= 3 (AKA if boot code already picked a sane value)
>
> ...with the above guards it's pretty safe...
>
> I will agree that there is a _potential_ that this could make things
> work worse on an already broken product our there, but I would say
> there's a reasonable chance that such a product doesn't exist (but
> please correct me if I'm wrong).  Specifically, this patch will cause
> problems in two examples that I can think of:
>
> --
>
> Example A: existing A15 <=r2p0 product with 773769-ignorant boot code
> that could be fixed, but that needs "setup = 1"
>
> In this case we've got boot code that's like the exynos5250 boot code
> that accidentally sets the total latency to >= 4 when it would work
> just fine to use a value of 3.  ...except that unlike the exynos5250
> this hypthetical SoC needs to leave setup = 1.
>
> ...if such a machine were found in the wild (seems unlikely?) then
> we'll need to figure out what to do.  If its boot code cannot be
> updated and we want to support this product with a similar patch then
> we'll need to be more dynamic.
>
> --
>
> Example B: existing A15 <=r2p0 product that just has to live with crashes
>
> In this case we've got a product where we're going to just accept that
> it crashes sometimes (since this is a fairly hard crash to trigger)
> because it crashes even more when the total latency < 4.  In this case
> we don't want to "fix" the errata because that makes things worse.
>
> ...if such a machine were find in the wild (it's possible, I guess?)
> then that's a really good reason not to take this patch or to find
> some way to dynamically enable / disable it.
>
> --
>
> Let me know what you think of the above, or if I'm misunderstanding something...

It's hard to disagree with any of the opinions presented here. We want 
to support Exynos5250-based hardware, but at the same time we don't want 
to regress any hardware unaffected by mentioned issue...

Let me propose you a solution that comes to my mind (as discussed with 
ojn on #armlinux):

To work around the erratum we basically need to ensure that two things 
are done:

  1) At boot-up the L2 controller configured by R/O firmware need to
     be reconfigured to sane values.

  2) At resume from sleep mode the same process as in 1) must be done.

As for the first one, it might be handled either by R/W firmware of the 
board or some small loader code glued in front of zImage. Second one can 
be implemented in SoC-specific resume code 
(arch/arm/plat-samsung/s5p-sleep.S), as it's currently done with L2X0 on 
Exynos4.

Of course this is scattering the workaround over multiple code bases, 
but this way the workaround is executed only on affected platforms 
without hurting the unaffected ones. What do you think?

Best regards,
Tomasz
Sander Jan. 29, 2014, 12:26 p.m. UTC | #12
Doug Anderson wrote (ao):
> On Wed, Jan 8, 2014 at 1:08 PM, Nicolas Pitre <nico@fluxnic.net> wrote:
> > On Wed, 8 Jan 2014, Doug Anderson wrote:
> >> * If Joe Upstream wants to run an upstream kernel on some type of
> >> exynos5250 product (Samsung ARM Chromebook, HP Chromebook 11, Nexus 10
> >> are the ones I know of) then he will deal with the small number of
> >> crashes or figure out a solution.
> >
> > If Joe Upstream wants to run an upstream kernel, doesn't he have to
> > unscrew his write protect switch first, at which point the RO firmware
> > can be updated as well?
> 
> Actually, no.  You can move your device into dev mode and run any
> kernel you want.  You'll get an annoying "you're in dev mode" screen
> at every bootup, but otherwise it works just fine.
> 
> Going into dev mode requires some special keystrokes at bootup and a
> wipe of your hard disk but no screwdrivers.

And there is http://www.arndaleboard.org/ where you just put an upstream
kernel on sd and boot.

	Sander
diff mbox

Patch

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index c59fa19..2e6f36c 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -1250,6 +1250,21 @@  config ARM_ERRATA_751472
 	  operation is received by a CPU before the ICIALLUIS has completed,
 	  potentially leading to corrupted entries in the cache or TLB.
 
+config ARM_ERRATA_773769
+	bool "ARM errata: Large data RAM latencies can lead to rare data corruption"
+	depends on CPU_V7
+	help
+	  This option enables the workaround for the erratum 773769, which affects
+	  Cortex-A15 (rev r2p0).
+	  In systems with L2 Data RAM latency programmed to 4 or more cycles,
+	  or with ACP in use, or with a L2 Data RAM slice configured, it is
+	  possible that a rare collision between non-cacheable stores and
+	  L1 data cache evictions which can lead to data corruption in L2 cache
+	  or memory.
+	  This workaround is to configure an effective Data RAM latency of 3 or
+	  less. Also note that, if a Data RAM slice is configured in A15 then
+	  there is no workaround.
+
 config PL310_ERRATA_753970
 	bool "PL310 errata: cache sync operation may be faulty"
 	depends on CACHE_PL310
diff --git a/arch/arm/mach-exynos/Kconfig b/arch/arm/mach-exynos/Kconfig
index 4c414af..29f505f 100644
--- a/arch/arm/mach-exynos/Kconfig
+++ b/arch/arm/mach-exynos/Kconfig
@@ -82,6 +82,7 @@  config SOC_EXYNOS5250
 	default y
 	depends on ARCH_EXYNOS5
 	select ARCH_HAS_BANDGAP
+	select ARM_ERRATA_773769
 	select PINCTRL_EXYNOS
 	select PM_GENERIC_DOMAINS if PM
 	select S5P_PM if PM
diff --git a/arch/arm/mm/proc-v7.S b/arch/arm/mm/proc-v7.S
index bd17819..0674c4c 100644
--- a/arch/arm/mm/proc-v7.S
+++ b/arch/arm/mm/proc-v7.S
@@ -141,6 +141,49 @@  ENTRY(cpu_v7_do_resume)
 	mcr	p15, 0, r4, c10, c2, 0	@ write PRRR
 	mcr	p15, 0, r5, c10, c2, 1	@ write NMRR
 #endif	/* CONFIG_MMU */
+
+#ifdef CONFIG_ARM_ERRATA_773769
+	/* get the arm rev id */
+	mrc	p15, 0, r3, c0, c0, 0		@ read main ID register
+	and	r4, r3, #0xff000000		@ ARM?
+	teq	r4, #0x41000000
+	bne	8f
+	and	r5, r3, #0x00f00000		@ variant
+	and	r7, r3, #0x0000000f		@ revision
+	orr	r7, r7, r5, lsr #20-4		@ combine variant and revision
+	ubfx	r3, r3, #4, #12			@ primary part number
+
+	ldr	r4, =0x00000c0f			@ Cortex-A15 primary part number
+	teq	r3, r4
+	bne	8f
+
+	ALT_SMP(cmp r7, #0x21)			@ present prior to r2p1
+	ALT_UP_B(8f)
+	mrclt	p15, 0, r3, c1, c0, 0		@ read system control register
+	andlt	r3, r3, #0x4			@ mask for C bit
+	cmplt	r3, #0x0			@ check if cache is on/off
+	bne	8f				@ Do nothing when cache is on
+
+	mrceq	p15, 1, r5, c9, c0, 2		@ read L2 control register
+	andeq	r3, r5, #(1 << 5)		@ mask for data RAM setup
+	lsreq	r3, r3, #0x5
+	cmpeq	r3, #0x1			@ check if data RAM setup = 1
+	bne	9f
+	and	r4, r5, #0x7			@ mask for data RAM latency
+	add	r4, r4, r3
+	add	r4, r4, #0x1			@ effective latency
+	cmp	r4, #0x3
+	bicgt	r5, r5, #(1 << 5)		@ clear data RAM setup bit
+
+9:	and	r4, r5, #0x7			@ mask for data RAM latency
+	cmp	r4, #0x2			@ check if data RAM latency > 2
+	ble	10f
+	bic	r5, r5, #0x7			@ clear data RAM latency bits
+	orr	r5, r5, #0x2			@ force data RAM latency = 2
+10:	mcr	p15, 1, r5, c9, c0, 2		@ set L2 control register
+8:
+#endif
+
 	mrc	p15, 0, r4, c1, c0, 1	@ Read Auxiliary control register
 	teq	r4, r9			@ Is it already set?
 	mcrne	p15, 0, r9, c1, c0, 1	@ No, so write it
@@ -349,6 +392,42 @@  __v7_setup:
 	mcrle	p15, 0, r10, c1, c0, 1		@ write aux control register
 #endif
 
+#ifdef CONFIG_ARM_ERRATA_773769
+	ALT_SMP(cmp r6, #0x21)			@ present prior to r2p1
+	ALT_UP_B(5f)
+	mrclt	p15, 0, r3, c1, c0, 0		@ read system control register
+	andlt	r3, r3, #0x4			@ mask for C bit
+	cmplt	r3, #0x0			@ check if cache is on/off
+	bne	5f				@ Do nothing when cache is on
+	/*
+	 * if data RAM setup = 1
+	 * 	then check if effective latency i.e (latency + setup + 1) > 3
+	 *		if true then clear data RAM setup
+	 *		goto branch 'a'
+	 * if data RAM setup = 0
+	 *	a: then check if data RAM latency > 0x10
+	 *		if true then force data RAM latency = 0x10
+	 */
+	mrceq	p15, 1, r5, c9, c0, 2		@ read L2 control register
+	andeq	r3, r5, #(1 << 5)		@ mask for data RAM setup
+	lsreq	r3, r3, #0x5
+	cmpeq	r3, #0x1			@ check if data RAM setup = 1
+	bne	6f
+	and	r10, r5, #0x7			@ mask for data RAM latency
+	add	r10, r10, r3
+	add	r10, r10, #0x1			@ effective latency
+	cmp	r10, #0x3
+	bicgt	r5, r5, #(1 << 5)		@ clear data RAM setup bit
+
+6:	and	r10, r5, #0x7			@ mask for data RAM latency
+	cmp	r10, #0x2			@ check if data RAM latency > 2
+	ble	7f
+	bic	r5, r5, #0x7			@ clear data RAM latency bits
+	orr	r5, r5, #0x2			@ force data RAM latency = 2
+7:	mcr	p15, 1, r5, c9, c0, 2		@ set L2 control register
+5:
+#endif
+
 4:	mov	r10, #0
 	mcr	p15, 0, r10, c7, c5, 0		@ I+BTB cache invalidate
 	dsb