Message ID | 20150402224947.GX24899@n2100.arm.linux.org.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Apr 02, 2015 at 11:49:47PM +0100, Russell King - ARM Linux wrote: > Several cleanups are in the patch below... I'll separate them out, but > I'd like to hear comments on them. Basically: > > 1. cache-v7.S is built for ARMv7 CPUs, so there's no reason not to > use movw and movt when loading large constants, rather than using > "ldr rd,=constant" > > 2. we can do a much more efficient check for the errata in > v7_flush_dcache_louis than we were doing - rather than putting the > work-around code in the fast path, we can re-organise this such that > we only try to run the workaround code if the LoU field is zero. > > 3. shift the bitfield we want to extract in the CLIDR to the appropriate > bit position prior to masking; this reduces the complexity of the > code, particularly with the SMP differences in v7_flush_dcache_louis. > > 4. pre-shift the Cortex A9 MIDR value to be checked, and shift the > actual MIDR to lose the bottom four revision bits. > > 5. as the v7_flush_dcache_louis code is more optimal, I see no reason not > to enable this workaround by default now - if people really want it to > be disabled, they can still choose that option. This is in addition > to Versatile Express enabling it. Given the memory corrupting abilities > of not having this errata enabled, I think it's only sane that it's > something that should be encouraged to be enabled, even though it only > affects r0pX CPUs. > > One obvious issue comes up here though - in the case that the LoU bits > are validly zero, we merely return from v7_flush_dcache_louis with no > DSB or ISB. However v7_flush_dcache_all always has a DSB or ISB at the > end, even if LoC is zero. Is this an intentional difference, or should > v7_flush_dcache_louis always end with a DSB+ISB ? Cc'ing Lorenzo since he added this code. The DSB+ISB at the end is usually meant to wait for the completion of the cache maintenance operation. If there is no cache maintenance required, you don't need the barriers (nor the DMB at the beginning of the function), unless the semantics of this function always imply barrier. I assume not since we have a DSB anyway before issuing the WFI to put the CPU into a sleep state but I'll wait for Lorenzo to confirm. As for the v7_flush_dcache_all always having the barriers, I guess no-one is expecting a v7 CPU without caches.
On Fri, Apr 10, 2015 at 02:27:42PM +0100, Catalin Marinas wrote: > On Thu, Apr 02, 2015 at 11:49:47PM +0100, Russell King - ARM Linux wrote: > > Several cleanups are in the patch below... I'll separate them out, but > > I'd like to hear comments on them. Basically: > > > > 1. cache-v7.S is built for ARMv7 CPUs, so there's no reason not to > > use movw and movt when loading large constants, rather than using > > "ldr rd,=constant" > > > > 2. we can do a much more efficient check for the errata in > > v7_flush_dcache_louis than we were doing - rather than putting the > > work-around code in the fast path, we can re-organise this such that > > we only try to run the workaround code if the LoU field is zero. > > > > 3. shift the bitfield we want to extract in the CLIDR to the appropriate > > bit position prior to masking; this reduces the complexity of the > > code, particularly with the SMP differences in v7_flush_dcache_louis. > > > > 4. pre-shift the Cortex A9 MIDR value to be checked, and shift the > > actual MIDR to lose the bottom four revision bits. > > > > 5. as the v7_flush_dcache_louis code is more optimal, I see no reason not > > to enable this workaround by default now - if people really want it to > > be disabled, they can still choose that option. This is in addition > > to Versatile Express enabling it. Given the memory corrupting abilities > > of not having this errata enabled, I think it's only sane that it's > > something that should be encouraged to be enabled, even though it only > > affects r0pX CPUs. > > > > One obvious issue comes up here though - in the case that the LoU bits > > are validly zero, we merely return from v7_flush_dcache_louis with no > > DSB or ISB. However v7_flush_dcache_all always has a DSB or ISB at the > > end, even if LoC is zero. Is this an intentional difference, or should > > v7_flush_dcache_louis always end with a DSB+ISB ? > > Cc'ing Lorenzo since he added this code. > > The DSB+ISB at the end is usually meant to wait for the completion of > the cache maintenance operation. If there is no cache maintenance > required, you don't need the barriers (nor the DMB at the beginning of > the function), unless the semantics of this function always imply > barrier. I assume not since we have a DSB anyway before issuing the WFI > to put the CPU into a sleep state but I'll wait for Lorenzo to confirm. I do not think the cache functions always imply barrier semantics, as you said the barriers are there to ensure completion of cache operations, if there is nothing to flush the barriers should not be there. > As for the v7_flush_dcache_all always having the barriers, I guess > no-one is expecting a v7 CPU without caches. Well, yes, actually I'd rather remove the barriers if LoC is 0 instead of adding them to LoUIS API in case there is nothing to flush, I do not think that's a problem in the first place. Lorenzo
On Fri, Apr 10, 2015 at 03:11:05PM +0100, Lorenzo Pieralisi wrote: > On Fri, Apr 10, 2015 at 02:27:42PM +0100, Catalin Marinas wrote: > > As for the v7_flush_dcache_all always having the barriers, I guess > > no-one is expecting a v7 CPU without caches. > > Well, yes, actually I'd rather remove the barriers if LoC is 0 instead > of adding them to LoUIS API in case there is nothing to flush, I do not > think that's a problem in the first place. Maybe you should read through the patch set before declaring there to be no problem. There may be no problem as the code stands right now, and we can bury our heads in the sand and continue as through there's no problem what so ever, and we'd be right to, but... We need to have 643719 enabled for all platforms. Right now, the code in cache-v7 surrounding that workaround is pretty crap - enabling the option places the workaround into the execution path whether it applies or not. What this patch series does is rearrange stuff to avoid the workaround where it's possible to do so, while cleaning up a lot of the crap coding that was done here in the first place. In doing this, I spotted that if we _solve_ the above issue by making both functions behave in the same way, we get more common code paths. So, while the issue doesn't _cause_ a problem, fixing the disparity between the two functions is worth it just to be able to have a flush_levels implementation which behaves the same no matter how it's called - one which always has a dmb before it, and (possibly) always the dsb+isb afterwards - or only has the dsb+isb afterwards if it's done some cache flushing. Look at the resulting code after applying all the patches, and I'm sure you'd agree that it's a lot better than the crap which is currently there. To save you having to look up the files and apply the patches, here are the old and new versions: ============= old ==================== ENTRY(v7_flush_dcache_louis) dmb @ ensure ordering with previous memory accesses mrc p15, 1, r0, c0, c0, 1 @ read clidr, r0 = clidr ALT_SMP(ands r3, r0, #(7 << 21)) @ extract LoUIS from clidr ALT_UP(ands r3, r0, #(7 << 27)) @ extract LoUU from clidr #ifdef CONFIG_ARM_ERRATA_643719 ALT_SMP(mrceq p15, 0, r2, c0, c0, 0) @ read main ID register ALT_UP(reteq lr) @ LoUU is zero, so nothing to do ldreq r1, =0x410fc090 @ ID of ARM Cortex A9 r0p? biceq r2, r2, #0x0000000f @ clear minor revision number teqeq r2, r1 @ test for errata affected core and if so... orreqs r3, #(1 << 21) @ fix LoUIS value (and set flags state to 'ne') #endif ALT_SMP(mov r3, r3, lsr #20) @ r3 = LoUIS * 2 ALT_UP(mov r3, r3, lsr #26) @ r3 = LoUU * 2 reteq lr @ return if level == 0 mov r10, #0 @ r10 (starting level) = 0 b flush_levels @ start flushing cache levels ENTRY(v7_flush_dcache_all) dmb @ ensure ordering with previous memory accesses mrc p15, 1, r0, c0, c0, 1 @ read clidr ands r3, r0, #0x7000000 @ extract loc from clidr mov r3, r3, lsr #23 @ left align loc bit field beq finished @ if loc is 0, then no need to clean mov r10, #0 @ start clean at cache level 0 flush_levels: add r2, r10, r10, lsr #1 @ work out 3x current cache level mov r1, r0, lsr r2 @ extract cache type bits from clidr ... ============= new ==================== ENTRY(v7_flush_dcache_louis) mrc p15, 1, r0, c0, c0, 1 @ read clidr, r0 = clidr ALT_SMP(mov r3, r0, lsr #20) @ move LoUIS into position ALT_UP( mov r3, r0, lsr #26) @ move LoUU into position #ifdef CONFIG_ARM_ERRATA_643719 ALT_SMP(ands r3, r3, #7 << 1) @ extract LoU*2 field from clidrALT_UP( b start_flush_levels) bne start_flush_levels @ LoU != 0, start flushing mrc p15, 0, r2, c0, c0, 0 @ read main ID register movw r1, #:lower16:(0x410fc090 >> 4) @ ID of ARM Cortex A9 r0p? movt r1, #:upper16:(0x410fc090 >> 4) teq r1, r2, lsr #4 @ test for errata affected core and if so... moveq r3, #1 << 1 @ fix LoUIS value (and set flags state to 'ne') #endif b start_flush_levels @ start flushing cache levels ENTRY(v7_flush_dcache_all) mrc p15, 1, r0, c0, c0, 1 @ read clidr mov r3, r0, lsr #23 @ move LoC into position start_flush_levels: dmb @ ensure ordering with previous memory accesses ands r3, r3, #7 << 1 @ extract field from clidr beq finished @ if loc is 0, then no need to clean mov r10, #0 @ start clean at cache level 0 flush_levels: add r2, r10, r10, lsr #1 @ work out 3x current cache level mov r1, r0, lsr r2 @ extract cache type bits from clidr ... Which do you think is better, the old version or the new version?
On Fri, Apr 10, 2015 at 03:26:55PM +0100, Russell King - ARM Linux wrote: > On Fri, Apr 10, 2015 at 03:11:05PM +0100, Lorenzo Pieralisi wrote: > > On Fri, Apr 10, 2015 at 02:27:42PM +0100, Catalin Marinas wrote: > > > As for the v7_flush_dcache_all always having the barriers, I guess > > > no-one is expecting a v7 CPU without caches. > > > > Well, yes, actually I'd rather remove the barriers if LoC is 0 instead > > of adding them to LoUIS API in case there is nothing to flush, I do not > > think that's a problem in the first place. > > Maybe you should read through the patch set before declaring there to > be no problem. I don't think anyone was disputing the code clean-up but whether the cache flushing functions need to have barrier semantics even when there isn't any level to flush. Looking through the code, flush_cache_louis() is used in cpu_die() where in some circumstances a DSB would be needed (like the CPU being killed from another CPU rather than by a WFI execution). To avoid duplicating barriers in cpu_die() since in practice LoUIS is at least 1 (unless you have unified caches at level 1), we could always require them in v7_flush_dcache_louis. > ENTRY(v7_flush_dcache_louis) > mrc p15, 1, r0, c0, c0, 1 @ read clidr, r0 = clidr > ALT_SMP(mov r3, r0, lsr #20) @ move LoUIS into position > ALT_UP( mov r3, r0, lsr #26) @ move LoUU into position > #ifdef CONFIG_ARM_ERRATA_643719 > ALT_SMP(ands r3, r3, #7 << 1) @ extract LoU*2 field from clidrALT_UP( b start_flush_levels) > bne start_flush_levels @ LoU != 0, start flushing > mrc p15, 0, r2, c0, c0, 0 @ read main ID register > movw r1, #:lower16:(0x410fc090 >> 4) @ ID of ARM Cortex A9 r0p? > movt r1, #:upper16:(0x410fc090 >> 4) > teq r1, r2, lsr #4 @ test for errata affected core > and if so... > moveq r3, #1 << 1 @ fix LoUIS value (and set flags state to 'ne') > #endif > b start_flush_levels @ start flushing cache levels > > ENTRY(v7_flush_dcache_all) > mrc p15, 1, r0, c0, c0, 1 @ read clidr > mov r3, r0, lsr #23 @ move LoC into position > start_flush_levels: > dmb @ ensure ordering with previous memory accesses > ands r3, r3, #7 << 1 @ extract field from clidr > beq finished @ if loc is 0, then no need to clean > mov r10, #0 @ start clean at cache level 0 > flush_levels: > add r2, r10, r10, lsr #1 @ work out 3x current cache level > mov r1, r0, lsr r2 @ extract cache type bits from clidr Since the DMB is only needed for ordering prior memory accesses with the cache maintenance itself, you could move it just before 'flush_levels' (rather theoretical as I'm not aware of any ARMv7 CPUs with a unified level 1 cache). With that said, you can add my reviewed-by on all patches: Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
On Fri, Apr 10, 2015 at 03:26:55PM +0100, Russell King - ARM Linux wrote: > On Fri, Apr 10, 2015 at 03:11:05PM +0100, Lorenzo Pieralisi wrote: > > On Fri, Apr 10, 2015 at 02:27:42PM +0100, Catalin Marinas wrote: > > > As for the v7_flush_dcache_all always having the barriers, I guess > > > no-one is expecting a v7 CPU without caches. > > > > Well, yes, actually I'd rather remove the barriers if LoC is 0 instead > > of adding them to LoUIS API in case there is nothing to flush, I do not > > think that's a problem in the first place. > > Maybe you should read through the patch set before declaring there to > be no problem. > > There may be no problem as the code stands right now, and we can bury > our heads in the sand and continue as through there's no problem what > so ever, and we'd be right to, but... > > We need to have 643719 enabled for all platforms. Right now, the code > in cache-v7 surrounding that workaround is pretty crap - enabling the > option places the workaround into the execution path whether it applies > or not. > > What this patch series does is rearrange stuff to avoid the workaround > where it's possible to do so, while cleaning up a lot of the crap coding > that was done here in the first place. > > In doing this, I spotted that if we _solve_ the above issue by making > both functions behave in the same way, we get more common code paths. I certainly agree it is a good clean-up, I was referring to barriers and the barrier semantics. > So, while the issue doesn't _cause_ a problem, fixing the disparity > between the two functions is worth it just to be able to have a > flush_levels implementation which behaves the same no matter how it's > called - one which always has a dmb before it, and (possibly) always > the dsb+isb afterwards - or only has the dsb+isb afterwards if it's > done some cache flushing. In actual facts the latter is what's required, for practical purposes the former solution is acceptable to fix the disparity. > Look at the resulting code after applying all the patches, and I'm sure > you'd agree that it's a lot better than the crap which is currently > there. To save you having to look up the files and apply the patches, > here are the old and new versions: > > ============= old ==================== > ENTRY(v7_flush_dcache_louis) > dmb @ ensure ordering with previous memory accesses > mrc p15, 1, r0, c0, c0, 1 @ read clidr, r0 = clidr > ALT_SMP(ands r3, r0, #(7 << 21)) @ extract LoUIS from clidr > ALT_UP(ands r3, r0, #(7 << 27)) @ extract LoUU from clidr > #ifdef CONFIG_ARM_ERRATA_643719 > ALT_SMP(mrceq p15, 0, r2, c0, c0, 0) @ read main ID register > ALT_UP(reteq lr) @ LoUU is zero, so nothing to do ldreq r1, =0x410fc090 @ ID of ARM Cortex A9 r0p? > biceq r2, r2, #0x0000000f @ clear minor revision number > teqeq r2, r1 @ test for errata affected core and if so... > orreqs r3, #(1 << 21) @ fix LoUIS value (and set flags state to 'ne') > #endif > ALT_SMP(mov r3, r3, lsr #20) @ r3 = LoUIS * 2 > ALT_UP(mov r3, r3, lsr #26) @ r3 = LoUU * 2 > reteq lr @ return if level == 0 > mov r10, #0 @ r10 (starting level) = 0 > b flush_levels @ start flushing cache levels > > ENTRY(v7_flush_dcache_all) > dmb @ ensure ordering with previous memory accesses > mrc p15, 1, r0, c0, c0, 1 @ read clidr > ands r3, r0, #0x7000000 @ extract loc from clidr > mov r3, r3, lsr #23 @ left align loc bit field > beq finished @ if loc is 0, then no need to clean > mov r10, #0 @ start clean at cache level 0 > flush_levels: > add r2, r10, r10, lsr #1 @ work out 3x current cache level > mov r1, r0, lsr r2 @ extract cache type bits from clidr > ... > ============= new ==================== > ENTRY(v7_flush_dcache_louis) > mrc p15, 1, r0, c0, c0, 1 @ read clidr, r0 = clidr > ALT_SMP(mov r3, r0, lsr #20) @ move LoUIS into position > ALT_UP( mov r3, r0, lsr #26) @ move LoUU into position > #ifdef CONFIG_ARM_ERRATA_643719 > ALT_SMP(ands r3, r3, #7 << 1) @ extract LoU*2 field from clidrALT_UP( b start_flush_levels) > bne start_flush_levels @ LoU != 0, start flushing > mrc p15, 0, r2, c0, c0, 0 @ read main ID register > movw r1, #:lower16:(0x410fc090 >> 4) @ ID of ARM Cortex A9 r0p? > movt r1, #:upper16:(0x410fc090 >> 4) > teq r1, r2, lsr #4 @ test for errata affected core > and if so... > moveq r3, #1 << 1 @ fix LoUIS value (and set flags state to 'ne') > #endif > b start_flush_levels @ start flushing cache levels > > ENTRY(v7_flush_dcache_all) > mrc p15, 1, r0, c0, c0, 1 @ read clidr > mov r3, r0, lsr #23 @ move LoC into position > start_flush_levels: > dmb @ ensure ordering with previous memory accesses > ands r3, r3, #7 << 1 @ extract field from clidr > beq finished @ if loc is 0, then no need to clean > mov r10, #0 @ start clean at cache level 0 > flush_levels: > add r2, r10, r10, lsr #1 @ work out 3x current cache level > mov r1, r0, lsr r2 @ extract cache type bits from clidr > ... > > Which do you think is better, the old version or the new version? The old version, joking ;). I was not referring to the clean-up which is a good thing on its own, we just have to agree on barriers semantics in case the cache flushing level turns out to be 0, see above. Thanks, Lorenzo
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 2eb6de9465bf..c26dfef393cd 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -1139,6 +1139,7 @@ config ARM_ERRATA_742231 config ARM_ERRATA_643719 bool "ARM errata: LoUIS bit field in CLIDR register is incorrect" depends on CPU_V7 && SMP + default y help This option enables the workaround for the 643719 Cortex-A9 (prior to r1p0) erratum. On affected cores the LoUIS bit field of the CLIDR diff --git a/arch/arm/mm/cache-v7.S b/arch/arm/mm/cache-v7.S index b966656d2c2d..1010bebe05eb 100644 --- a/arch/arm/mm/cache-v7.S +++ b/arch/arm/mm/cache-v7.S @@ -36,10 +36,10 @@ ENTRY(v7_invalidate_l1) mcr p15, 2, r0, c0, c0, 0 mrc p15, 1, r0, c0, c0, 0 - ldr r1, =0x7fff + movw r1, #0x7fff and r2, r1, r0, lsr #13 - ldr r1, =0x3ff + movw r1, #0x3ff and r3, r1, r0, lsr #3 @ NumWays - 1 add r2, r2, #1 @ NumSets @@ -90,21 +90,20 @@ ENDPROC(v7_flush_icache_all) ENTRY(v7_flush_dcache_louis) dmb @ ensure ordering with previous memory accesses mrc p15, 1, r0, c0, c0, 1 @ read clidr, r0 = clidr - ALT_SMP(ands r3, r0, #(7 << 21)) @ extract LoUIS from clidr - ALT_UP(ands r3, r0, #(7 << 27)) @ extract LoUU from clidr +ALT_SMP(mov r3, r0, lsr #20) @ move LoUIS into position +ALT_UP( mov r3, r0, lsr #26) @ move LoUU into position + ands r3, r3, #7 << 1 @ extract LoU field from clidr + bne start_flush_levels @ LoU != 0, start flushing #ifdef CONFIG_ARM_ERRATA_643719 - ALT_SMP(mrceq p15, 0, r2, c0, c0, 0) @ read main ID register - ALT_UP(reteq lr) @ LoUU is zero, so nothing to do - ldreq r1, =0x410fc090 @ ID of ARM Cortex A9 r0p? - biceq r2, r2, #0x0000000f @ clear minor revision number - teqeq r2, r1 @ test for errata affected core and if so... - orreqs r3, #(1 << 21) @ fix LoUIS value (and set flags state to 'ne') +ALT_SMP(mrc p15, 0, r2, c0, c0, 0) @ read main ID register +ALT_UP( ret lr) @ LoUU is zero, so nothing to do + movw r1, #:lower16:(0x410fc090 >> 4) @ ID of ARM Cortex A9 r0p? + movt r1, #:upper16:(0x410fc090 >> 4) + teq r1, r2, lsr #4 @ test for errata affected core and if so... + moveq r3, #1 << 1 @ fix LoUIS value (and set flags state to 'ne') + beq start_flush_levels @ start flushing cache levels #endif - ALT_SMP(mov r3, r3, lsr #20) @ r3 = LoUIS * 2 - ALT_UP(mov r3, r3, lsr #26) @ r3 = LoUU * 2 - reteq lr @ return if level == 0 - mov r10, #0 @ r10 (starting level) = 0 - b flush_levels @ start flushing cache levels + ret lr ENDPROC(v7_flush_dcache_louis) /* @@ -119,9 +118,10 @@ ENDPROC(v7_flush_dcache_louis) ENTRY(v7_flush_dcache_all) dmb @ ensure ordering with previous memory accesses mrc p15, 1, r0, c0, c0, 1 @ read clidr - ands r3, r0, #0x7000000 @ extract loc from clidr - mov r3, r3, lsr #23 @ left align loc bit field + mov r3, r0, lsr #23 @ align LoC + ands r3, r3, #7 << 1 @ extract loc from clidr beq finished @ if loc is 0, then no need to clean +start_flush_levels: mov r10, #0 @ start clean at cache level 0 flush_levels: add r2, r10, r10, lsr #1 @ work out 3x current cache level @@ -140,10 +140,10 @@ flush_levels: #endif and r2, r1, #7 @ extract the length of the cache lines add r2, r2, #4 @ add 4 (line length offset) - ldr r4, =0x3ff + movw r4, #0x3ff ands r4, r4, r1, lsr #3 @ find maximum number on the way size clz r5, r4 @ find bit position of way size increment - ldr r7, =0x7fff + movw r7, #0x7fff ands r7, r7, r1, lsr #13 @ extract max number of the index size loop1: mov r9, r7 @ create working copy of max index