Message ID | 1389187991-26446-1-git-send-email-gautam.vivek@samsung.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Jan 08, 2014 at 01:33:11PM +0000, Vivek Gautam wrote: > The erratum-773769 occurs on Arm Coretex-A15 (rev r2p0), > when L2 Data Ram latency is set to 4 cycles or more; or > when ACP is in use, or with L2 Data RAM slice configured. > Therefore, the effective latency as calculated in Table 7-2 of > Cotex-A15 (rev r2p0) trm should be 3 cycles or less. > > On Exynos5250 based systems the effective data ram latency > is 4 cycles, since we have DATA_RAM_SETUP bit enabled (L2CTRL[5]=1b'1) > and DATA_RAM_LATENCY bits set to 0x2 (L2CTLR[2:0]=3b'010) therefore, > the effective L2 data RAM latency becomes 4 cycles. > So erratum '773769' occurs causing a corrupted L2 Cache. > > This patch gives a workaround to the mentioned erratum, using below > mentioned algo: > ---------------------------------------------------------------- > if data RAM setup = 1 > then check if effective latency i.e (latency + setup + 1) > 3 > if 'true' > then clear data RAM setup > goto branch 'a' > if data RAM setup = 0 > a: then check if data RAM latency > 0x10 > if true then force data RAM latency = 0x10 > ---------------------------------------------------------------- > so that the effective data RAM latency reduces to 3 cycles or less > and hence prevent hitting the erratum. > > NOTE: The Exynos5250 based products have already been shipped, which > makes it impossible to add the change in bootloader, so we are > adding the required change in kernel. NAK. Whilst I appreciate that you may not be able to fix your bootloader, this isn't the right change to make in the kernel. Blindly changing memory latencies is likely to do more harm than good for a multi-platform kernel, even if it works for exynos 5250. The only alternative I can think of (if you have to make a mainline kernel change) is to restrict the clock frequencies at which the CPU is allowed to run, although that obviously requires some investigation in order to determine how viable it is for your SoC. Will
WIll, Thanks for your comments! On Wed, Jan 8, 2014 at 6:35 AM, Will Deacon <will.deacon@arm.com> wrote: > On Wed, Jan 08, 2014 at 01:33:11PM +0000, Vivek Gautam wrote: >> The erratum-773769 occurs on Arm Coretex-A15 (rev r2p0), >> when L2 Data Ram latency is set to 4 cycles or more; or >> when ACP is in use, or with L2 Data RAM slice configured. >> Therefore, the effective latency as calculated in Table 7-2 of >> Cotex-A15 (rev r2p0) trm should be 3 cycles or less. >> >> On Exynos5250 based systems the effective data ram latency >> is 4 cycles, since we have DATA_RAM_SETUP bit enabled (L2CTRL[5]=1b'1) >> and DATA_RAM_LATENCY bits set to 0x2 (L2CTLR[2:0]=3b'010) therefore, >> the effective L2 data RAM latency becomes 4 cycles. >> So erratum '773769' occurs causing a corrupted L2 Cache. >> >> This patch gives a workaround to the mentioned erratum, using below >> mentioned algo: >> ---------------------------------------------------------------- >> if data RAM setup = 1 >> then check if effective latency i.e (latency + setup + 1) > 3 >> if 'true' >> then clear data RAM setup >> goto branch 'a' >> if data RAM setup = 0 >> a: then check if data RAM latency > 0x10 >> if true then force data RAM latency = 0x10 >> ---------------------------------------------------------------- >> so that the effective data RAM latency reduces to 3 cycles or less >> and hence prevent hitting the erratum. >> >> NOTE: The Exynos5250 based products have already been shipped, which >> makes it impossible to add the change in bootloader, so we are >> adding the required change in kernel. > > NAK. Whilst I appreciate that you may not be able to fix your bootloader, > this isn't the right change to make in the kernel. Blindly changing memory > latencies is likely to do more harm than good for a multi-platform kernel, > even if it works for exynos 5250. The only alternative I can think of (if you > have to make a mainline kernel change) is to restrict the clock frequencies at > which the CPU is allowed to run, although that obviously requires some > investigation in order to determine how viable it is for your SoC. OK, so humor me a little here... I'll start off by saying that I'm totally OK if mainline doesn't want this fixed. If mainline is not interested in running reliably on exynos5250-based products then there's nothing I can do about it. It seems unfortunate, but I'm not going to get into a shouting match about it. You're saying that this patch is blindly changing memory latencies. I don't think it is (please correct me if I'm wrong!). This patch: * Is guarded by a CONFIG option so the code isn't even compiled in if you don't want EXYNOS5250 support. I know this doesn't help multiplatform, but it means that if you really hate the code it's easy to disable. * Is guarded by a runtime check of the revision number so that it doesn't run on unaffected A15 revisions. * Is guarded by a runtime check so it does nothing at all if the total latency is <= 3 (AKA if boot code already picked a sane value) ...with the above guards it's pretty safe... I will agree that there is a _potential_ that this could make things work worse on an already broken product our there, but I would say there's a reasonable chance that such a product doesn't exist (but please correct me if I'm wrong). Specifically, this patch will cause problems in two examples that I can think of: -- Example A: existing A15 <=r2p0 product with 773769-ignorant boot code that could be fixed, but that needs "setup = 1" In this case we've got boot code that's like the exynos5250 boot code that accidentally sets the total latency to >= 4 when it would work just fine to use a value of 3. ...except that unlike the exynos5250 this hypthetical SoC needs to leave setup = 1. ...if such a machine were found in the wild (seems unlikely?) then we'll need to figure out what to do. If its boot code cannot be updated and we want to support this product with a similar patch then we'll need to be more dynamic. -- Example B: existing A15 <=r2p0 product that just has to live with crashes In this case we've got a product where we're going to just accept that it crashes sometimes (since this is a fairly hard crash to trigger) because it crashes even more when the total latency < 4. In this case we don't want to "fix" the errata because that makes things worse. ...if such a machine were find in the wild (it's possible, I guess?) then that's a really good reason not to take this patch or to find some way to dynamically enable / disable it. -- Let me know what you think of the above, or if I'm misunderstanding something... -Doug
On Wed, Jan 08, 2014 at 08:21:21AM -0800, Doug Anderson wrote: > WIll, > > Thanks for your comments! > > On Wed, Jan 8, 2014 at 6:35 AM, Will Deacon <will.deacon@arm.com> wrote: > > NAK. Whilst I appreciate that you may not be able to fix your bootloader, > > this isn't the right change to make in the kernel. Blindly changing memory > > latencies is likely to do more harm than good for a multi-platform kernel, > > even if it works for exynos 5250. The only alternative I can think of (if you > > have to make a mainline kernel change) is to restrict the clock frequencies at > > which the CPU is allowed to run, although that obviously requires some > > investigation in order to determine how viable it is for your SoC. +1 > I'll start off by saying that I'm totally OK if mainline doesn't want > this fixed. If mainline is not interested in running reliably on > exynos5250-based products then there's nothing I can do about it. It > seems unfortunate, but I'm not going to get into a shouting match > about it. No, we're saying to put the work-around in the boot loader, not the kernel. > You're saying that this patch is blindly changing memory latencies. I > don't think it is (please correct me if I'm wrong!). This patch: > > * Is guarded by a CONFIG option so the code isn't even compiled in if > you don't want EXYNOS5250 support. I know this doesn't help > multiplatform, but it means that if you really hate the code it's easy > to disable. As you identify, this is completely useless in the multiplatform kernel case, so you can completely remove this from the argument _for_ this change - it carries no weight what so ever. > * Is guarded by a runtime check of the revision number so that it > doesn't run on unaffected A15 revisions. So what if the A15 reports an affected revision, but the SoC integrator has fixed the problem - why should they have to put up with the work- around being applied on their silicon? > * Is guarded by a runtime check so it does nothing at all if the total > latency is <= 3 (AKA if boot code already picked a sane value) Right, so the effect of the above is that with a multiplatform kernel including Exynos 5 support, all platforms with an A15 r2p0 or earlier have this work-around applied even if it was actually fixed in the silicon - which there's no way to know from the software perspective. We've been through these arguments many times, you're not the first to raise it, and we've decided upon the policy. We want as _few_ work- arounds in the kernel as possible, because applying the work-arounds is very problematical with the mixture of secure and non-secure booting. Maybe Will can check, but I believe that the L2 cache control register is one of those which is not writable from non-secure mode, and thus will crash the kernel at boot if a write is attempted.
Hi, On Wed, Jan 8, 2014 at 11:20 AM, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: > On Wed, Jan 08, 2014 at 08:21:21AM -0800, Doug Anderson wrote: >> WIll, >> >> Thanks for your comments! >> >> On Wed, Jan 8, 2014 at 6:35 AM, Will Deacon <will.deacon@arm.com> wrote: >> > NAK. Whilst I appreciate that you may not be able to fix your bootloader, >> > this isn't the right change to make in the kernel. Blindly changing memory >> > latencies is likely to do more harm than good for a multi-platform kernel, >> > even if it works for exynos 5250. The only alternative I can think of (if you >> > have to make a mainline kernel change) is to restrict the clock frequencies at >> > which the CPU is allowed to run, although that obviously requires some >> > investigation in order to determine how viable it is for your SoC. > > +1 > >> I'll start off by saying that I'm totally OK if mainline doesn't want >> this fixed. If mainline is not interested in running reliably on >> exynos5250-based products then there's nothing I can do about it. It >> seems unfortunate, but I'm not going to get into a shouting match >> about it. > > No, we're saying to put the work-around in the boot loader, not the kernel. Unfortunately the resume path of the firmware runs from Read Only firmware code (yes, it sucks), so it's not totally trivial to fix. It would be possible for someone to unscrew their write protect switch and update their RO firmware, but that doesn't help the average user. Olof came up with the idea that you could update the RW firmware (affects initial boot) and then cache away the value and restore it in the kernel after resume. That would still require a kernel patch but perhaps a less objectionable one. ...of course if writing this register is a problem in secure mode then maybe that patch would be NAKed anyway. >> * Is guarded by a runtime check of the revision number so that it >> doesn't run on unaffected A15 revisions. > > So what if the A15 reports an affected revision, but the SoC integrator > has fixed the problem - why should they have to put up with the work- > around being applied on their silicon? That's one I didn't think about, you're right. ...but we're really getting into hypothetical situations here. Are there any r2p0 products that have such a fix (and thus require a latency of >= 4)? > We've been through these arguments many times, you're not the first to > raise it, and we've decided upon the policy. We want as _few_ work- > arounds in the kernel as possible, because applying the work-arounds > is very problematical with the mixture of secure and non-secure booting. OK, fair enough. If we want no workaround here then we can drop this patch. I'd guess that the way forward is: * Land kernel workaround locally in Chromium tree * I'll try to land FW change locally in at least one Chromium branch. If we were to get a new RO build ever (seems unlikely at this point) it would be fixed for upstream kernels. If we were to get a new RW build (seems unlikely, but at least less unlikely) it would be fixed if someone landed a kernel patch to save/restore this register across suspend/resume. * If Joe Upstream wants to run an upstream kernel on some type of exynos5250 product (Samsung ARM Chromebook, HP Chromebook 11, Nexus 10 are the ones I know of) then he will deal with the small number of crashes or figure out a solution. -Doug
On Wed, 8 Jan 2014, Doug Anderson wrote: > On Wed, Jan 8, 2014 at 11:20 AM, Russell King - ARM Linux > <linux@arm.linux.org.uk> wrote: > > We've been through these arguments many times, you're not the first to > > raise it, and we've decided upon the policy. We want as _few_ work- > > arounds in the kernel as possible, because applying the work-arounds > > is very problematical with the mixture of secure and non-secure booting. > > OK, fair enough. If we want no workaround here then we can drop this patch. > > > I'd guess that the way forward is: > > * Land kernel workaround locally in Chromium tree > > * I'll try to land FW change locally in at least one Chromium branch. > If we were to get a new RO build ever (seems unlikely at this point) > it would be fixed for upstream kernels. If we were to get a new RW > build (seems unlikely, but at least less unlikely) it would be fixed > if someone landed a kernel patch to save/restore this register across > suspend/resume. > > * If Joe Upstream wants to run an upstream kernel on some type of > exynos5250 product (Samsung ARM Chromebook, HP Chromebook 11, Nexus 10 > are the ones I know of) then he will deal with the small number of > crashes or figure out a solution. At some point you have to realize that what you're proposing is not viable when taking into account the bigger picture. And your FW architecture is to blame. If you were running Windows instead of Linux I bet you'd have found a way to fix things differently than asking Microsoft to add a special quirk in their code. In the PC world you are required to perform a BIOS update instead. So you really need to provision the ability to update at least a portion of the firmware that is not read-only. That may be a third-stage bootloader or the like which purpose is only to install workarounds such as this before executing the kernel and which doesn't need to be updated with the kernel. And in theory you should be able to do that on top of your current RO firmware. Errata workarounds are often machine specific and they can be applied only in special conditions the kernel might not safely know about until it is too late. Nicolas
On Wed, Jan 08, 2014 at 11:43:29AM -0800, Doug Anderson wrote: > Olof came up with the idea that you could update the RW firmware > (affects initial boot) and then cache away the value and restore it in > the kernel after resume. That would still require a kernel patch but > perhaps a less objectionable one. ...of course if writing this > register is a problem in secure mode then maybe that patch would be > NAKed anyway. It's not a problem in secure mode, since secure mode will have access to the register. It's the non-secure mode where various registers either ignore writes, or they trigger an exception that are the problem. Consider what would happen if an exception were to be triggered when no exception handlers were installed (eg, because the MMU is not enabled.) > That's one I didn't think about, you're right. ...but we're really > getting into hypothetical situations here. Are there any r2p0 > products that have such a fix (and thus require a latency of >= 4)? Really, we can't know, because this kind of information is dependent on the SoC, and probably such customisations are only knowable via NDAs. However, I do know of some A9 CPUs where exactly this kind of thing has happened - where various selected silicon bugs have been fixed but the revision still reports the same as one with all those bugs.
On Wed, 8 Jan 2014, Doug Anderson wrote: > Hi, > > On Wed, Jan 8, 2014 at 11:20 AM, Russell King - ARM Linux > > No, we're saying to put the work-around in the boot loader, not the kernel. > > Unfortunately the resume path of the firmware runs from Read Only > firmware code (yes, it sucks), so it's not totally trivial to fix. It > would be possible for someone to unscrew their write protect switch > and update their RO firmware, but that doesn't help the average user. [...] > I'd guess that the way forward is: > > * Land kernel workaround locally in Chromium tree > > * I'll try to land FW change locally in at least one Chromium branch. > If we were to get a new RO build ever (seems unlikely at this point) > it would be fixed for upstream kernels. If we were to get a new RW > build (seems unlikely, but at least less unlikely) it would be fixed > if someone landed a kernel patch to save/restore this register across > suspend/resume. > > * If Joe Upstream wants to run an upstream kernel on some type of > exynos5250 product (Samsung ARM Chromebook, HP Chromebook 11, Nexus 10 > are the ones I know of) then he will deal with the small number of > crashes or figure out a solution. If Joe Upstream wants to run an upstream kernel, doesn't he have to unscrew his write protect switch first, at which point the RO firmware can be updated as well? Nicolas
Nicolas, On Wed, Jan 8, 2014 at 12:58 PM, Nicolas Pitre <nico@fluxnic.net> wrote: > On Wed, 8 Jan 2014, Doug Anderson wrote: > >> On Wed, Jan 8, 2014 at 11:20 AM, Russell King - ARM Linux >> <linux@arm.linux.org.uk> wrote: >> > We've been through these arguments many times, you're not the first to >> > raise it, and we've decided upon the policy. We want as _few_ work- >> > arounds in the kernel as possible, because applying the work-arounds >> > is very problematical with the mixture of secure and non-secure booting. >> >> OK, fair enough. If we want no workaround here then we can drop this patch. >> >> >> I'd guess that the way forward is: >> >> * Land kernel workaround locally in Chromium tree >> >> * I'll try to land FW change locally in at least one Chromium branch. >> If we were to get a new RO build ever (seems unlikely at this point) >> it would be fixed for upstream kernels. If we were to get a new RW >> build (seems unlikely, but at least less unlikely) it would be fixed >> if someone landed a kernel patch to save/restore this register across >> suspend/resume. >> >> * If Joe Upstream wants to run an upstream kernel on some type of >> exynos5250 product (Samsung ARM Chromebook, HP Chromebook 11, Nexus 10 >> are the ones I know of) then he will deal with the small number of >> crashes or figure out a solution. > > At some point you have to realize that what you're proposing is not > viable when taking into account the bigger picture. And your FW > architecture is to blame. If you were running Windows instead of Linux > I bet you'd have found a way to fix things differently than asking > Microsoft to add a special quirk in their code. In the PC world you are > required to perform a BIOS update instead. Yup, I can believe that. ...I'm merely trying to report reality. Firmware updates (even RW ones) require an extra level of scrutiny / testing and a chunk of in-house resources. Those resources get allocated if there is no other choice, but I can't force a lot of people to spend a lot of time to approve a RW firmware update (actually approving >= 3 RW firmware updates since there are at least three different 5250 variants) when there is a simple and low risk kernel fix for it. I could try, but I'll get slapped down / laughed at. If this were the Windows/x86 world and we had no choice the decision would go a different way. ...but we're not in the Windows/x86 world. > So you really need to provision the ability to update at least a portion > of the firmware that is not read-only. That may be a third-stage > bootloader or the like which purpose is only to install workarounds such > as this before executing the kernel and which doesn't need to be updated > with the kernel. And in theory you should be able to do that on top of > your current RO firmware. The fact that we have too much code in RO is a known issue and people are definitely working on that. It has already bitten us. It's unlikely that someone will try retrofit something to work on existing devices, though. -Doug
Russell, On Wed, Jan 8, 2014 at 1:02 PM, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: > On Wed, Jan 08, 2014 at 11:43:29AM -0800, Doug Anderson wrote: >> Olof came up with the idea that you could update the RW firmware >> (affects initial boot) and then cache away the value and restore it in >> the kernel after resume. That would still require a kernel patch but >> perhaps a less objectionable one. ...of course if writing this >> register is a problem in secure mode then maybe that patch would be >> NAKed anyway. > > It's not a problem in secure mode, since secure mode will have access > to the register. It's the non-secure mode where various registers > either ignore writes, or they trigger an exception that are the problem. > > Consider what would happen if an exception were to be triggered when no > exception handlers were installed (eg, because the MMU is not enabled.) OK, good to know. ...so if someone wanted to code up this patch and send upstream they would be able to. :) -Doug
Nicolas, On Wed, Jan 8, 2014 at 1:08 PM, Nicolas Pitre <nico@fluxnic.net> wrote: > On Wed, 8 Jan 2014, Doug Anderson wrote: > >> Hi, >> >> On Wed, Jan 8, 2014 at 11:20 AM, Russell King - ARM Linux >> > No, we're saying to put the work-around in the boot loader, not the kernel. >> >> Unfortunately the resume path of the firmware runs from Read Only >> firmware code (yes, it sucks), so it's not totally trivial to fix. It >> would be possible for someone to unscrew their write protect switch >> and update their RO firmware, but that doesn't help the average user. > > [...] > >> I'd guess that the way forward is: >> >> * Land kernel workaround locally in Chromium tree >> >> * I'll try to land FW change locally in at least one Chromium branch. >> If we were to get a new RO build ever (seems unlikely at this point) >> it would be fixed for upstream kernels. If we were to get a new RW >> build (seems unlikely, but at least less unlikely) it would be fixed >> if someone landed a kernel patch to save/restore this register across >> suspend/resume. >> >> * If Joe Upstream wants to run an upstream kernel on some type of >> exynos5250 product (Samsung ARM Chromebook, HP Chromebook 11, Nexus 10 >> are the ones I know of) then he will deal with the small number of >> crashes or figure out a solution. > > If Joe Upstream wants to run an upstream kernel, doesn't he have to > unscrew his write protect switch first, at which point the RO firmware > can be updated as well? Actually, no. You can move your device into dev mode and run any kernel you want. You'll get an annoying "you're in dev mode" screen at every bootup, but otherwise it works just fine. Going into dev mode requires some special keystrokes at bootup and a wipe of your hard disk but no screwdrivers. -Doug
Hi, On 08.01.2014 17:21, Doug Anderson wrote: > WIll, > > Thanks for your comments! > > On Wed, Jan 8, 2014 at 6:35 AM, Will Deacon <will.deacon@arm.com> wrote: >> On Wed, Jan 08, 2014 at 01:33:11PM +0000, Vivek Gautam wrote: >>> The erratum-773769 occurs on Arm Coretex-A15 (rev r2p0), >>> when L2 Data Ram latency is set to 4 cycles or more; or >>> when ACP is in use, or with L2 Data RAM slice configured. >>> Therefore, the effective latency as calculated in Table 7-2 of >>> Cotex-A15 (rev r2p0) trm should be 3 cycles or less. >>> >>> On Exynos5250 based systems the effective data ram latency >>> is 4 cycles, since we have DATA_RAM_SETUP bit enabled (L2CTRL[5]=1b'1) >>> and DATA_RAM_LATENCY bits set to 0x2 (L2CTLR[2:0]=3b'010) therefore, >>> the effective L2 data RAM latency becomes 4 cycles. >>> So erratum '773769' occurs causing a corrupted L2 Cache. >>> >>> This patch gives a workaround to the mentioned erratum, using below >>> mentioned algo: >>> ---------------------------------------------------------------- >>> if data RAM setup = 1 >>> then check if effective latency i.e (latency + setup + 1) > 3 >>> if 'true' >>> then clear data RAM setup >>> goto branch 'a' >>> if data RAM setup = 0 >>> a: then check if data RAM latency > 0x10 >>> if true then force data RAM latency = 0x10 >>> ---------------------------------------------------------------- >>> so that the effective data RAM latency reduces to 3 cycles or less >>> and hence prevent hitting the erratum. >>> >>> NOTE: The Exynos5250 based products have already been shipped, which >>> makes it impossible to add the change in bootloader, so we are >>> adding the required change in kernel. >> >> NAK. Whilst I appreciate that you may not be able to fix your bootloader, >> this isn't the right change to make in the kernel. Blindly changing memory >> latencies is likely to do more harm than good for a multi-platform kernel, >> even if it works for exynos 5250. The only alternative I can think of (if you >> have to make a mainline kernel change) is to restrict the clock frequencies at >> which the CPU is allowed to run, although that obviously requires some >> investigation in order to determine how viable it is for your SoC. > > OK, so humor me a little here... > > I'll start off by saying that I'm totally OK if mainline doesn't want > this fixed. If mainline is not interested in running reliably on > exynos5250-based products then there's nothing I can do about it. It > seems unfortunate, but I'm not going to get into a shouting match > about it. > > You're saying that this patch is blindly changing memory latencies. I > don't think it is (please correct me if I'm wrong!). This patch: > > * Is guarded by a CONFIG option so the code isn't even compiled in if > you don't want EXYNOS5250 support. I know this doesn't help > multiplatform, but it means that if you really hate the code it's easy > to disable. > > * Is guarded by a runtime check of the revision number so that it > doesn't run on unaffected A15 revisions. > > * Is guarded by a runtime check so it does nothing at all if the total > latency is <= 3 (AKA if boot code already picked a sane value) > > ...with the above guards it's pretty safe... > > I will agree that there is a _potential_ that this could make things > work worse on an already broken product our there, but I would say > there's a reasonable chance that such a product doesn't exist (but > please correct me if I'm wrong). Specifically, this patch will cause > problems in two examples that I can think of: > > -- > > Example A: existing A15 <=r2p0 product with 773769-ignorant boot code > that could be fixed, but that needs "setup = 1" > > In this case we've got boot code that's like the exynos5250 boot code > that accidentally sets the total latency to >= 4 when it would work > just fine to use a value of 3. ...except that unlike the exynos5250 > this hypthetical SoC needs to leave setup = 1. > > ...if such a machine were found in the wild (seems unlikely?) then > we'll need to figure out what to do. If its boot code cannot be > updated and we want to support this product with a similar patch then > we'll need to be more dynamic. > > -- > > Example B: existing A15 <=r2p0 product that just has to live with crashes > > In this case we've got a product where we're going to just accept that > it crashes sometimes (since this is a fairly hard crash to trigger) > because it crashes even more when the total latency < 4. In this case > we don't want to "fix" the errata because that makes things worse. > > ...if such a machine were find in the wild (it's possible, I guess?) > then that's a really good reason not to take this patch or to find > some way to dynamically enable / disable it. > > -- > > Let me know what you think of the above, or if I'm misunderstanding something... It's hard to disagree with any of the opinions presented here. We want to support Exynos5250-based hardware, but at the same time we don't want to regress any hardware unaffected by mentioned issue... Let me propose you a solution that comes to my mind (as discussed with ojn on #armlinux): To work around the erratum we basically need to ensure that two things are done: 1) At boot-up the L2 controller configured by R/O firmware need to be reconfigured to sane values. 2) At resume from sleep mode the same process as in 1) must be done. As for the first one, it might be handled either by R/W firmware of the board or some small loader code glued in front of zImage. Second one can be implemented in SoC-specific resume code (arch/arm/plat-samsung/s5p-sleep.S), as it's currently done with L2X0 on Exynos4. Of course this is scattering the workaround over multiple code bases, but this way the workaround is executed only on affected platforms without hurting the unaffected ones. What do you think? Best regards, Tomasz
Doug Anderson wrote (ao): > On Wed, Jan 8, 2014 at 1:08 PM, Nicolas Pitre <nico@fluxnic.net> wrote: > > On Wed, 8 Jan 2014, Doug Anderson wrote: > >> * If Joe Upstream wants to run an upstream kernel on some type of > >> exynos5250 product (Samsung ARM Chromebook, HP Chromebook 11, Nexus 10 > >> are the ones I know of) then he will deal with the small number of > >> crashes or figure out a solution. > > > > If Joe Upstream wants to run an upstream kernel, doesn't he have to > > unscrew his write protect switch first, at which point the RO firmware > > can be updated as well? > > Actually, no. You can move your device into dev mode and run any > kernel you want. You'll get an annoying "you're in dev mode" screen > at every bootup, but otherwise it works just fine. > > Going into dev mode requires some special keystrokes at bootup and a > wipe of your hard disk but no screwdrivers. And there is http://www.arndaleboard.org/ where you just put an upstream kernel on sd and boot. Sander
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index c59fa19..2e6f36c 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -1250,6 +1250,21 @@ config ARM_ERRATA_751472 operation is received by a CPU before the ICIALLUIS has completed, potentially leading to corrupted entries in the cache or TLB. +config ARM_ERRATA_773769 + bool "ARM errata: Large data RAM latencies can lead to rare data corruption" + depends on CPU_V7 + help + This option enables the workaround for the erratum 773769, which affects + Cortex-A15 (rev r2p0). + In systems with L2 Data RAM latency programmed to 4 or more cycles, + or with ACP in use, or with a L2 Data RAM slice configured, it is + possible that a rare collision between non-cacheable stores and + L1 data cache evictions which can lead to data corruption in L2 cache + or memory. + This workaround is to configure an effective Data RAM latency of 3 or + less. Also note that, if a Data RAM slice is configured in A15 then + there is no workaround. + config PL310_ERRATA_753970 bool "PL310 errata: cache sync operation may be faulty" depends on CACHE_PL310 diff --git a/arch/arm/mach-exynos/Kconfig b/arch/arm/mach-exynos/Kconfig index 4c414af..29f505f 100644 --- a/arch/arm/mach-exynos/Kconfig +++ b/arch/arm/mach-exynos/Kconfig @@ -82,6 +82,7 @@ config SOC_EXYNOS5250 default y depends on ARCH_EXYNOS5 select ARCH_HAS_BANDGAP + select ARM_ERRATA_773769 select PINCTRL_EXYNOS select PM_GENERIC_DOMAINS if PM select S5P_PM if PM diff --git a/arch/arm/mm/proc-v7.S b/arch/arm/mm/proc-v7.S index bd17819..0674c4c 100644 --- a/arch/arm/mm/proc-v7.S +++ b/arch/arm/mm/proc-v7.S @@ -141,6 +141,49 @@ ENTRY(cpu_v7_do_resume) mcr p15, 0, r4, c10, c2, 0 @ write PRRR mcr p15, 0, r5, c10, c2, 1 @ write NMRR #endif /* CONFIG_MMU */ + +#ifdef CONFIG_ARM_ERRATA_773769 + /* get the arm rev id */ + mrc p15, 0, r3, c0, c0, 0 @ read main ID register + and r4, r3, #0xff000000 @ ARM? + teq r4, #0x41000000 + bne 8f + and r5, r3, #0x00f00000 @ variant + and r7, r3, #0x0000000f @ revision + orr r7, r7, r5, lsr #20-4 @ combine variant and revision + ubfx r3, r3, #4, #12 @ primary part number + + ldr r4, =0x00000c0f @ Cortex-A15 primary part number + teq r3, r4 + bne 8f + + ALT_SMP(cmp r7, #0x21) @ present prior to r2p1 + ALT_UP_B(8f) + mrclt p15, 0, r3, c1, c0, 0 @ read system control register + andlt r3, r3, #0x4 @ mask for C bit + cmplt r3, #0x0 @ check if cache is on/off + bne 8f @ Do nothing when cache is on + + mrceq p15, 1, r5, c9, c0, 2 @ read L2 control register + andeq r3, r5, #(1 << 5) @ mask for data RAM setup + lsreq r3, r3, #0x5 + cmpeq r3, #0x1 @ check if data RAM setup = 1 + bne 9f + and r4, r5, #0x7 @ mask for data RAM latency + add r4, r4, r3 + add r4, r4, #0x1 @ effective latency + cmp r4, #0x3 + bicgt r5, r5, #(1 << 5) @ clear data RAM setup bit + +9: and r4, r5, #0x7 @ mask for data RAM latency + cmp r4, #0x2 @ check if data RAM latency > 2 + ble 10f + bic r5, r5, #0x7 @ clear data RAM latency bits + orr r5, r5, #0x2 @ force data RAM latency = 2 +10: mcr p15, 1, r5, c9, c0, 2 @ set L2 control register +8: +#endif + mrc p15, 0, r4, c1, c0, 1 @ Read Auxiliary control register teq r4, r9 @ Is it already set? mcrne p15, 0, r9, c1, c0, 1 @ No, so write it @@ -349,6 +392,42 @@ __v7_setup: mcrle p15, 0, r10, c1, c0, 1 @ write aux control register #endif +#ifdef CONFIG_ARM_ERRATA_773769 + ALT_SMP(cmp r6, #0x21) @ present prior to r2p1 + ALT_UP_B(5f) + mrclt p15, 0, r3, c1, c0, 0 @ read system control register + andlt r3, r3, #0x4 @ mask for C bit + cmplt r3, #0x0 @ check if cache is on/off + bne 5f @ Do nothing when cache is on + /* + * if data RAM setup = 1 + * then check if effective latency i.e (latency + setup + 1) > 3 + * if true then clear data RAM setup + * goto branch 'a' + * if data RAM setup = 0 + * a: then check if data RAM latency > 0x10 + * if true then force data RAM latency = 0x10 + */ + mrceq p15, 1, r5, c9, c0, 2 @ read L2 control register + andeq r3, r5, #(1 << 5) @ mask for data RAM setup + lsreq r3, r3, #0x5 + cmpeq r3, #0x1 @ check if data RAM setup = 1 + bne 6f + and r10, r5, #0x7 @ mask for data RAM latency + add r10, r10, r3 + add r10, r10, #0x1 @ effective latency + cmp r10, #0x3 + bicgt r5, r5, #(1 << 5) @ clear data RAM setup bit + +6: and r10, r5, #0x7 @ mask for data RAM latency + cmp r10, #0x2 @ check if data RAM latency > 2 + ble 7f + bic r5, r5, #0x7 @ clear data RAM latency bits + orr r5, r5, #0x2 @ force data RAM latency = 2 +7: mcr p15, 1, r5, c9, c0, 2 @ set L2 control register +5: +#endif + 4: mov r10, #0 mcr p15, 0, r10, c7, c5, 0 @ I+BTB cache invalidate dsb
The erratum-773769 occurs on Arm Coretex-A15 (rev r2p0), when L2 Data Ram latency is set to 4 cycles or more; or when ACP is in use, or with L2 Data RAM slice configured. Therefore, the effective latency as calculated in Table 7-2 of Cotex-A15 (rev r2p0) trm should be 3 cycles or less. On Exynos5250 based systems the effective data ram latency is 4 cycles, since we have DATA_RAM_SETUP bit enabled (L2CTRL[5]=1b'1) and DATA_RAM_LATENCY bits set to 0x2 (L2CTLR[2:0]=3b'010) therefore, the effective L2 data RAM latency becomes 4 cycles. So erratum '773769' occurs causing a corrupted L2 Cache. This patch gives a workaround to the mentioned erratum, using below mentioned algo: ---------------------------------------------------------------- if data RAM setup = 1 then check if effective latency i.e (latency + setup + 1) > 3 if 'true' then clear data RAM setup goto branch 'a' if data RAM setup = 0 a: then check if data RAM latency > 0x10 if true then force data RAM latency = 0x10 ---------------------------------------------------------------- so that the effective data RAM latency reduces to 3 cycles or less and hence prevent hitting the erratum. NOTE: The Exynos5250 based products have already been shipped, which makes it impossible to add the change in bootloader, so we are adding the required change in kernel. Signed-off-by: Vivek Gautam <gautam.vivek@samsung.com> Cc: Doug Anderson <dianders@chromium.org> Cc: Olof Johansson <olofj@chromium.org> Cc: David Garbett <david.garbett@arm.com> --- arch/arm/Kconfig | 15 ++++++++ arch/arm/mach-exynos/Kconfig | 1 + arch/arm/mm/proc-v7.S | 79 ++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 95 insertions(+)