Message ID | 1359008879-9015-7-git-send-email-nicolas.pitre@linaro.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Jan 24, 2013 at 06:27:49AM +0000, Nicolas Pitre wrote: > From: Dave Martin <dave.martin@linaro.org> > > Instead of requiring the first man to be elected in advance (which > can be suboptimal in some situations), this patch uses a per- > cluster mutex to co-ordinate selection of the first man. > > This should also make it more feasible to reuse this code path for > asynchronous cluster resume (as in CPUidle scenarios). > > We must ensure that the vlock data doesn't share a cacheline with > anything else, or dirty cache eviction could corrupt it. > > Signed-off-by: Dave Martin <dave.martin@linaro.org> > Signed-off-by: Nicolas Pitre <nicolas.pitre@linaro.org> [...] > + > + .align __CACHE_WRITEBACK_ORDER > + .type first_man_locks, #object > +first_man_locks: > + .space VLOCK_SIZE * BL_MAX_CLUSTERS > + .align __CACHE_WRITEBACK_ORDER > > .type bL_entry_vectors, #object > ENTRY(bL_entry_vectors) I've just been chatting to Dave about this and __CACHE_WRITEBACK_ORDER isn't really the correct solution here. To summarise the problem: although vlocks are only accessed by CPUs with their caches disabled, the lock structures could reside in the same cacheline (at some level of cache) as cacheable data being written by another CPU. This comes about because the vlock code has a cacheable alias via the kernel linear mapping and means that when the cacheable data is evicted, it clobbers the vlocks with stale values which are part of the dirty cacheline. Now, we also have this problem for DMA mappings, as mentioned here: http://lists.infradead.org/pipermail/linux-arm-kernel/2012-October/124276.html It seems to me that we actually want a mechanism for allocating/managing physically contiguous blocks of memory such that the cacheable alias is removed from the linear mapping (perhaps we could use PAGE_NONE to avoid confusing the mm code?). Will
On Mon, 28 Jan 2013, Will Deacon wrote: > On Thu, Jan 24, 2013 at 06:27:49AM +0000, Nicolas Pitre wrote: > > From: Dave Martin <dave.martin@linaro.org> > > > > Instead of requiring the first man to be elected in advance (which > > can be suboptimal in some situations), this patch uses a per- > > cluster mutex to co-ordinate selection of the first man. > > > > This should also make it more feasible to reuse this code path for > > asynchronous cluster resume (as in CPUidle scenarios). > > > > We must ensure that the vlock data doesn't share a cacheline with > > anything else, or dirty cache eviction could corrupt it. > > > > Signed-off-by: Dave Martin <dave.martin@linaro.org> > > Signed-off-by: Nicolas Pitre <nicolas.pitre@linaro.org> > > [...] > > > + > > + .align __CACHE_WRITEBACK_ORDER > > + .type first_man_locks, #object > > +first_man_locks: > > + .space VLOCK_SIZE * BL_MAX_CLUSTERS > > + .align __CACHE_WRITEBACK_ORDER > > > > .type bL_entry_vectors, #object > > ENTRY(bL_entry_vectors) > > I've just been chatting to Dave about this and __CACHE_WRITEBACK_ORDER > isn't really the correct solution here. > > To summarise the problem: although vlocks are only accessed by CPUs with > their caches disabled, the lock structures could reside in the same > cacheline (at some level of cache) as cacheable data being written by > another CPU. This comes about because the vlock code has a cacheable alias > via the kernel linear mapping and means that when the cacheable data is > evicted, it clobbers the vlocks with stale values which are part of the > dirty cacheline. > > Now, we also have this problem for DMA mappings, as mentioned here: > > http://lists.infradead.org/pipermail/linux-arm-kernel/2012-October/124276.html > > It seems to me that we actually want a mechanism for allocating/managing > physically contiguous blocks of memory such that the cacheable alias is > removed from the linear mapping (perhaps we could use PAGE_NONE to avoid > confusing the mm code?). Well, I partly disagree. I don't dispute the need for a mechanism to allocate physically contiguous blocks of memory in the DMA case or other similar users of largish dynamic allocations. But That's not the case here. In the vlock case, what we actually need in practice is equivalent to a _single_ cache line of cache free memory. Requiring a dynamic allocation infrastructure tailored for this specific case is going to waste much more CPU cycles and memory in the end than what this static allocation is currently doing, even if it were overallocating. Your suggestion would be needed when we get to the point where dynamic sizing of the number of clusters is required. But, as I said in response to your previous comment, let's not fall into the trap of overengineering this solution for the time being. Better approach this incrementally if actual usage does indicate that some more sophistication is needed. The whole stack is already complex enough as it is and I'd prefer if people could get familiar with the simpler version initially. Nicolas
diff --git a/arch/arm/common/Makefile b/arch/arm/common/Makefile index 8025899a20..aa797237a7 100644 --- a/arch/arm/common/Makefile +++ b/arch/arm/common/Makefile @@ -13,4 +13,4 @@ obj-$(CONFIG_SHARP_PARAM) += sharpsl_param.o obj-$(CONFIG_SHARP_SCOOP) += scoop.o obj-$(CONFIG_PCI_HOST_ITE8152) += it8152.o obj-$(CONFIG_ARM_TIMER_SP804) += timer-sp.o -obj-$(CONFIG_BIG_LITTLE) += bL_head.o bL_entry.o +obj-$(CONFIG_BIG_LITTLE) += bL_head.o bL_entry.o vlock.o diff --git a/arch/arm/common/bL_head.S b/arch/arm/common/bL_head.S index a226cdf4ce..86bcc8a003 100644 --- a/arch/arm/common/bL_head.S +++ b/arch/arm/common/bL_head.S @@ -16,6 +16,8 @@ #include <linux/linkage.h> #include <asm/bL_entry.h> +#include "vlock.h" + .if BL_SYNC_CLUSTER_CPUS .error "cpus must be the first member of struct bL_cluster_sync_struct" .endif @@ -64,10 +66,11 @@ ENTRY(bL_entry_point) * position independent way. */ adr r5, 3f - ldmia r5, {r6, r7, r8} + ldmia r5, {r6, r7, r8, r11} add r6, r5, r6 @ r6 = bL_entry_vectors ldr r7, [r5, r7] @ r7 = bL_power_up_setup_phys add r8, r5, r8 @ r8 = bL_sync + add r11, r5, r11 @ r11 = first_man_locks mov r0, #BL_SYNC_CLUSTER_SIZE mla r8, r0, r10, r8 @ r8 = bL_sync cluster base @@ -81,13 +84,22 @@ ENTRY(bL_entry_point) @ At this point, the cluster cannot unexpectedly enter the GOING_DOWN @ state, because there is at least one active CPU (this CPU). - @ Note: the following is racy as another CPU might be testing - @ the same flag at the same moment. That'll be fixed later. + mov r0, #VLOCK_SIZE + mla r11, r0, r10, r11 @ r11 = cluster first man lock + mov r0, r11 + mov r1, r9 @ cpu + bl vlock_trylock @ implies DMB + + cmp r0, #0 @ failed to get the lock? + bne cluster_setup_wait @ wait for cluster setup if so + ldrb r0, [r8, #BL_SYNC_CLUSTER_CLUSTER] cmp r0, #CLUSTER_UP @ cluster already up? bne cluster_setup @ if not, set up the cluster - @ Otherwise, skip setup: + @ Otherwise, release the first man lock and skip setup: + mov r0, r11 + bl vlock_unlock b cluster_setup_complete cluster_setup: @@ -137,6 +149,19 @@ cluster_setup_leave: dsb sev + mov r0, r11 + bl vlock_unlock @ implies DMB + b cluster_setup_complete + + @ In the contended case, non-first men wait here for cluster setup + @ to complete: +cluster_setup_wait: + ldrb r0, [r8, #BL_SYNC_CLUSTER_CLUSTER] + cmp r0, #CLUSTER_UP + wfene + bne cluster_setup_wait + dmb + cluster_setup_complete: @ If a platform-specific CPU setup hook is needed, it is @ called from here. @@ -168,11 +193,17 @@ bL_entry_gated: 3: .word bL_entry_vectors - . .word bL_power_up_setup_phys - 3b .word bL_sync - 3b + .word first_man_locks - 3b ENDPROC(bL_entry_point) .bss - .align 5 + + .align __CACHE_WRITEBACK_ORDER + .type first_man_locks, #object +first_man_locks: + .space VLOCK_SIZE * BL_MAX_CLUSTERS + .align __CACHE_WRITEBACK_ORDER .type bL_entry_vectors, #object ENTRY(bL_entry_vectors)