Message ID | 1358268498-8086-4-git-send-email-dave.martin@linaro.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
+ Catalin, RMK Dave, On Tuesday 15 January 2013 10:18 PM, Dave Martin wrote: > For architectural correctness even Strongly-Ordered memory accesses > require barriers in order to guarantee that multiple CPUs have a > coherent view of the ordering of memory accesses. > > Virtually everything done by this early code is done via explicit > memory access only, so DSBs are seldom required. Existing barriers > are demoted to DMB, except where a DSB is needed to synchronise > non-memory signalling (i.e., before a SEV). If a particular > platform performs cache maintenance in its power_up_setup function, > it should force it to complete explicitly including a DSB, instead > of relying on the bL_head framework code to do it. > > Some additional DMBs are added to ensure all the memory ordering > properties required by the race avoidance algorithm. DMBs are also > moved out of loops, and for clarity some are moved so that most > directly follow the memory operation which needs to be > synchronised. > > The setting of a CPU's bL_entry_vectors[] entry is also required to > act as a synchronisation point, so a DMB is added after checking > that entry to ensure that other CPUs do not observe gated > operations leaking across the opening of the gate. > > Signed-off-by: Dave Martin <dave.martin@linaro.org> > --- Sorry to pick on this again but I am not able to understand why the strongly ordered access needs barriers. At least from the ARM point of view, a strongly ordered write will be more of blocking write and the further interconnect also is suppose to respect that rule. SO read writes are like adding barrier after every load store so adding explicit barriers doesn't make sense. Is this a side effect of some "write early response" kind of optimizations at interconnect level ? Will you be able to point to specs or documents which puts this requirement ? Regards santosh
On Wed, Jan 16, 2013 at 12:20:47PM +0530, Santosh Shilimkar wrote: > + Catalin, RMK > > Dave, > > On Tuesday 15 January 2013 10:18 PM, Dave Martin wrote: > >For architectural correctness even Strongly-Ordered memory accesses > >require barriers in order to guarantee that multiple CPUs have a > >coherent view of the ordering of memory accesses. > > > >Virtually everything done by this early code is done via explicit > >memory access only, so DSBs are seldom required. Existing barriers > >are demoted to DMB, except where a DSB is needed to synchronise > >non-memory signalling (i.e., before a SEV). If a particular > >platform performs cache maintenance in its power_up_setup function, > >it should force it to complete explicitly including a DSB, instead > >of relying on the bL_head framework code to do it. > > > >Some additional DMBs are added to ensure all the memory ordering > >properties required by the race avoidance algorithm. DMBs are also > >moved out of loops, and for clarity some are moved so that most > >directly follow the memory operation which needs to be > >synchronised. > > > >The setting of a CPU's bL_entry_vectors[] entry is also required to > >act as a synchronisation point, so a DMB is added after checking > >that entry to ensure that other CPUs do not observe gated > >operations leaking across the opening of the gate. > > > >Signed-off-by: Dave Martin <dave.martin@linaro.org> > >--- > > Sorry to pick on this again but I am not able to understand why > the strongly ordered access needs barriers. At least from the > ARM point of view, a strongly ordered write will be more of blocking > write and the further interconnect also is suppose to respect that This is what I originally assumed (hence the absence of barriers in the initial patch). > rule. SO read writes are like adding barrier after every load store This assumption turns out to be wrong, unfortunately, although in a uniprocessor scenario is makes no difference. A SO memory access does block the CPU making the access, but explicitly does not block the interconnect. In a typical boot scenario for example, all secondary CPUs are quiescent or powered down, so there's no problem. But we can't make the same assumptions when we're trying to coordinate between multiple active CPUs. > so adding explicit barriers doesn't make sense. Is this a side > effect of some "write early response" kind of optimizations at > interconnect level ? Strongly-Ordered accesses are always non-shareable, so there is no explicit guarantee of coherency between multiple masters. If there is only one master, it makes no difference, but if there are multiple masters, there is no guarantee that they are conntected to a slave device (DRAM controller in this case) via a single slave port. The architecture only guarantees global serialisation when there is a single slave device, but provides no way to know whether two accesses from different masters will reach the same slave port. This is in the realms of "implementation defined." Unfortunately, a high-performance component like a DRAM controller is exactly the kind of component which may implement multiple master ports, so you can't guarantee that accesses are serialised in the same order from the perspective of all masters. There may be some pipelining and caching between each master port and the actual memory, for example. This is allowed, because there is no requirement for the DMC to look like a single slave device from the perspective of multiple masters. A multi-ported slave might provide transparent coherency between master ports, but it is only required to guarantee this when the accesses are shareable (SO is always non-shared), or when explicit barriers are used to force synchronisation between the device's master ports. Of course, a given platform may have a DMC with only one slave port, in which case the barriers should not be needed. But I wanted this code to be generic enough to be reusable -- hence the addition of the barriers. The CPU does not need to wait for a DMB to "complete" in any sense, so this does not necessarily have a meaningful impact on performance. This is my understanding anyway. > Will you be able to point to specs or documents which puts > this requirement ? Unfortunately, this is one of this things which we require not because there is a statement in the ARM ARM to say that we need it -- rather, there is no statement in the ARM ARM to say that we don't. Cheers ---Dave
On Wednesday 16 January 2013 05:19 PM, Dave Martin wrote: > On Wed, Jan 16, 2013 at 12:20:47PM +0530, Santosh Shilimkar wrote: >> + Catalin, RMK >> >> Dave, >> >> On Tuesday 15 January 2013 10:18 PM, Dave Martin wrote: >>> For architectural correctness even Strongly-Ordered memory accesses >>> require barriers in order to guarantee that multiple CPUs have a >>> coherent view of the ordering of memory accesses. >>> >>> Virtually everything done by this early code is done via explicit >>> memory access only, so DSBs are seldom required. Existing barriers >>> are demoted to DMB, except where a DSB is needed to synchronise >>> non-memory signalling (i.e., before a SEV). If a particular >>> platform performs cache maintenance in its power_up_setup function, >>> it should force it to complete explicitly including a DSB, instead >>> of relying on the bL_head framework code to do it. >>> >>> Some additional DMBs are added to ensure all the memory ordering >>> properties required by the race avoidance algorithm. DMBs are also >>> moved out of loops, and for clarity some are moved so that most >>> directly follow the memory operation which needs to be >>> synchronised. >>> >>> The setting of a CPU's bL_entry_vectors[] entry is also required to >>> act as a synchronisation point, so a DMB is added after checking >>> that entry to ensure that other CPUs do not observe gated >>> operations leaking across the opening of the gate. >>> >>> Signed-off-by: Dave Martin <dave.martin@linaro.org> >>> --- >> >> Sorry to pick on this again but I am not able to understand why >> the strongly ordered access needs barriers. At least from the >> ARM point of view, a strongly ordered write will be more of blocking >> write and the further interconnect also is suppose to respect that > > This is what I originally assumed (hence the absence of barriers in > the initial patch). > >> rule. SO read writes are like adding barrier after every load store > > This assumption turns out to be wrong, unfortunately, although in > a uniprocessor scenario is makes no difference. A SO memory access > does block the CPU making the access, but explicitly does not > block the interconnect. > I suspected the interconnect part when you described the barrier need for SO memory region. > In a typical boot scenario for example, all secondary CPUs are > quiescent or powered down, so there's no problem. But we can't make > the same assumptions when we're trying to coordinate between > multiple active CPUs. > >> so adding explicit barriers doesn't make sense. Is this a side >> effect of some "write early response" kind of optimizations at >> interconnect level ? > > Strongly-Ordered accesses are always non-shareable, so there is > no explicit guarantee of coherency between multiple masters. > This is where probably issue then. My understanding is exactly opposite here and hence I wasn't worried about multi-master CPU scenario since sharable attributes would be taking care of it considering the same page tables being used in SMP system. ARM documentation says - ------------ Shareability and the S bit, with TEX remap The memory type of a region, as indicated in the Memory type column of Table B3-12 on page B3-1350, provides the first level of control of whether the region is shareable: • If the memory type is Strongly-ordered then the region is Shareable ------------------------------------------------------------ > If there is only one master, it makes no difference, but if there > are multiple masters, there is no guarantee that they are conntected > to a slave device (DRAM controller in this case) via a single > slave port. > See above. We are talking about multiple CPUs here and not the DSP or other co-processors. In either case we are discussing code which is getting execute on ARM CPUs so we can safely limit it to the multi-master ARM CPU. > The architecture only guarantees global serialisation when there is a > single slave device, but provides no way to know whether two accesses > from different masters will reach the same slave port. This is in the > realms of "implementation defined." > > Unfortunately, a high-performance component like a DRAM controller > is exactly the kind of component which may implement multiple > master ports, so you can't guarantee that accesses are serialised > in the same order from the perspective of all masters. There may > be some pipelining and caching between each master port and the actual > memory, for example. This is allowed, because there is no requirement > for the DMC to look like a single slave device from the perspective > of multiple masters. > > A multi-ported slave might provide transparent coherency between master > ports, but it is only required to guarantee this when the accesses > are shareable (SO is always non-shared), or when explicit barriers > are used to force synchronisation between the device's master ports. > > Of course, a given platform may have a DMC with only one slave > port, in which case the barriers should not be needed. But I wanted > this code to be generic enough to be reusable -- hence the > addition of the barriers. The CPU does not need to wait for a DMB > to "complete" in any sense, so this does not necessarily have a > meaningful impact on performance. > > This is my understanding anyway. > >> Will you be able to point to specs or documents which puts >> this requirement ? > > Unfortunately, this is one of this things which we require not because > there is a statement in the ARM ARM to say that we need it -- rather, > there is no statement in the ARM ARM to say that we don't. > Thanks a lot for elaborate answer. It helps to understand the rationale at least. Regards Santosh
On Wed, Jan 16, 2013 at 05:41:00PM +0530, Santosh Shilimkar wrote: > On Wednesday 16 January 2013 05:19 PM, Dave Martin wrote: > >On Wed, Jan 16, 2013 at 12:20:47PM +0530, Santosh Shilimkar wrote: > >>+ Catalin, RMK > >> > >>Dave, > >> > >>On Tuesday 15 January 2013 10:18 PM, Dave Martin wrote: > >>>For architectural correctness even Strongly-Ordered memory accesses > >>>require barriers in order to guarantee that multiple CPUs have a > >>>coherent view of the ordering of memory accesses. > >>> > >>>Virtually everything done by this early code is done via explicit > >>>memory access only, so DSBs are seldom required. Existing barriers > >>>are demoted to DMB, except where a DSB is needed to synchronise > >>>non-memory signalling (i.e., before a SEV). If a particular > >>>platform performs cache maintenance in its power_up_setup function, > >>>it should force it to complete explicitly including a DSB, instead > >>>of relying on the bL_head framework code to do it. > >>> > >>>Some additional DMBs are added to ensure all the memory ordering > >>>properties required by the race avoidance algorithm. DMBs are also > >>>moved out of loops, and for clarity some are moved so that most > >>>directly follow the memory operation which needs to be > >>>synchronised. > >>> > >>>The setting of a CPU's bL_entry_vectors[] entry is also required to > >>>act as a synchronisation point, so a DMB is added after checking > >>>that entry to ensure that other CPUs do not observe gated > >>>operations leaking across the opening of the gate. > >>> > >>>Signed-off-by: Dave Martin <dave.martin@linaro.org> > >>>--- > >> > >>Sorry to pick on this again but I am not able to understand why > >>the strongly ordered access needs barriers. At least from the > >>ARM point of view, a strongly ordered write will be more of blocking > >>write and the further interconnect also is suppose to respect that > > > >This is what I originally assumed (hence the absence of barriers in > >the initial patch). > > > >>rule. SO read writes are like adding barrier after every load store > > > >This assumption turns out to be wrong, unfortunately, although in > >a uniprocessor scenario is makes no difference. A SO memory access > >does block the CPU making the access, but explicitly does not > >block the interconnect. > > > I suspected the interconnect part when you described the barrier > need for SO memory region. > > >In a typical boot scenario for example, all secondary CPUs are > >quiescent or powered down, so there's no problem. But we can't make > >the same assumptions when we're trying to coordinate between > >multiple active CPUs. > > > >>so adding explicit barriers doesn't make sense. Is this a side > >>effect of some "write early response" kind of optimizations at > >>interconnect level ? > > > >Strongly-Ordered accesses are always non-shareable, so there is > >no explicit guarantee of coherency between multiple masters. > > > This is where probably issue then. My understanding is exactly > opposite here and hence I wasn't worried about multi-master > CPU scenario since sharable attributes would be taking care of it > considering the same page tables being used in SMP system. > > ARM documentation says - > ------------ > Shareability and the S bit, with TEX remap > The memory type of a region, as indicated in the Memory type column > of Table B3-12 on page B3-1350, provides > the first level of control of whether the region is shareable: > • If the memory type is Strongly-ordered then the region is Shareable > ------------------------------------------------------------ Hmmm, it looks like you're right here. My assumption that SO implies non-shareable is wrong. This is backed up by: A3.5.6 Device and Strongly-ordered memory "Address locations marked as Strongly-ordered [...] are always treated as Shareable." I think this is sufficient to ensure that if two CPUs access the same location with SO accesses, each will see an access order to any single location which is consistent with the program order of the accesses on the other CPUs. (This comes from the glossary definition of Coherent.) However, I can't see any general guarantee for accesses to _different_ locations, beyond the guarantees for certain special cases given in A3.8.2 Ordering requirements for memory accesses (address and control dependencies etc.) This may make some of the dmbs unnecessary, but it is not clear whether they are all unnecessary. I'll need to follow up on this and see if we can get an answer. Cheers ---Dave
On Wednesday 16 January 2013 06:17 PM, Dave Martin wrote: > On Wed, Jan 16, 2013 at 05:41:00PM +0530, Santosh Shilimkar wrote: >> On Wednesday 16 January 2013 05:19 PM, Dave Martin wrote: >>> On Wed, Jan 16, 2013 at 12:20:47PM +0530, Santosh Shilimkar wrote: >>>> + Catalin, RMK >>>> >>>> Dave, >>>> >>>> On Tuesday 15 January 2013 10:18 PM, Dave Martin wrote: >>>>> For architectural correctness even Strongly-Ordered memory accesses >>>>> require barriers in order to guarantee that multiple CPUs have a >>>>> coherent view of the ordering of memory accesses. >>>>> >>>>> Virtually everything done by this early code is done via explicit >>>>> memory access only, so DSBs are seldom required. Existing barriers >>>>> are demoted to DMB, except where a DSB is needed to synchronise >>>>> non-memory signalling (i.e., before a SEV). If a particular >>>>> platform performs cache maintenance in its power_up_setup function, >>>>> it should force it to complete explicitly including a DSB, instead >>>>> of relying on the bL_head framework code to do it. >>>>> >>>>> Some additional DMBs are added to ensure all the memory ordering >>>>> properties required by the race avoidance algorithm. DMBs are also >>>>> moved out of loops, and for clarity some are moved so that most >>>>> directly follow the memory operation which needs to be >>>>> synchronised. >>>>> >>>>> The setting of a CPU's bL_entry_vectors[] entry is also required to >>>>> act as a synchronisation point, so a DMB is added after checking >>>>> that entry to ensure that other CPUs do not observe gated >>>>> operations leaking across the opening of the gate. >>>>> >>>>> Signed-off-by: Dave Martin <dave.martin@linaro.org> >>>>> --- >>>> >>>> Sorry to pick on this again but I am not able to understand why >>>> the strongly ordered access needs barriers. At least from the >>>> ARM point of view, a strongly ordered write will be more of blocking >>>> write and the further interconnect also is suppose to respect that >>> >>> This is what I originally assumed (hence the absence of barriers in >>> the initial patch). >>> >>>> rule. SO read writes are like adding barrier after every load store >>> >>> This assumption turns out to be wrong, unfortunately, although in >>> a uniprocessor scenario is makes no difference. A SO memory access >>> does block the CPU making the access, but explicitly does not >>> block the interconnect. >>> >> I suspected the interconnect part when you described the barrier >> need for SO memory region. >> >>> In a typical boot scenario for example, all secondary CPUs are >>> quiescent or powered down, so there's no problem. But we can't make >>> the same assumptions when we're trying to coordinate between >>> multiple active CPUs. >>> >>>> so adding explicit barriers doesn't make sense. Is this a side >>>> effect of some "write early response" kind of optimizations at >>>> interconnect level ? >>> >>> Strongly-Ordered accesses are always non-shareable, so there is >>> no explicit guarantee of coherency between multiple masters. >>> >> This is where probably issue then. My understanding is exactly >> opposite here and hence I wasn't worried about multi-master >> CPU scenario since sharable attributes would be taking care of it >> considering the same page tables being used in SMP system. >> >> ARM documentation says - >> ------------ >> Shareability and the S bit, with TEX remap >> The memory type of a region, as indicated in the Memory type column >> of Table B3-12 on page B3-1350, provides >> the first level of control of whether the region is shareable: >> • If the memory type is Strongly-ordered then the region is Shareable >> ------------------------------------------------------------ > > Hmmm, it looks like you're right here. My assumption that SO implies > non-shareable is wrong. This is backed up by: > > A3.5.6 Device and Strongly-ordered memory > > "Address locations marked as Strongly-ordered [...] are always treated > as Shareable." > > > I think this is sufficient to ensure that if two CPUs access the same > location with SO accesses, each will see an access order to any single > location which is consistent with the program order of the accesses on > the other CPUs. (This comes from the glossary definition of Coherent.) > > However, I can't see any general guarantee for accesses to _different_ > locations, beyond the guarantees for certain special cases given in > A3.8.2 Ordering requirements for memory accesses (address and control > dependencies etc.) > > This may make some of the dmbs unnecessary, but it is not clear whether > they are all unnecessary. > > > I'll need to follow up on this and see if we can get an answer. > Thanks David. I am looking forward to hear more on this. Regards, Santosh
Santosh, On Wed, Jan 16, 2013 at 06:50:47AM +0000, Santosh Shilimkar wrote: > On Tuesday 15 January 2013 10:18 PM, Dave Martin wrote: > > For architectural correctness even Strongly-Ordered memory accesses > > require barriers in order to guarantee that multiple CPUs have a > > coherent view of the ordering of memory accesses. > > > > Virtually everything done by this early code is done via explicit > > memory access only, so DSBs are seldom required. Existing barriers > > are demoted to DMB, except where a DSB is needed to synchronise > > non-memory signalling (i.e., before a SEV). If a particular > > platform performs cache maintenance in its power_up_setup function, > > it should force it to complete explicitly including a DSB, instead > > of relying on the bL_head framework code to do it. > > > > Some additional DMBs are added to ensure all the memory ordering > > properties required by the race avoidance algorithm. DMBs are also > > moved out of loops, and for clarity some are moved so that most > > directly follow the memory operation which needs to be > > synchronised. > > > > The setting of a CPU's bL_entry_vectors[] entry is also required to > > act as a synchronisation point, so a DMB is added after checking > > that entry to ensure that other CPUs do not observe gated > > operations leaking across the opening of the gate. > > > > Signed-off-by: Dave Martin <dave.martin@linaro.org> > > Sorry to pick on this again but I am not able to understand why > the strongly ordered access needs barriers. At least from the > ARM point of view, a strongly ordered write will be more of blocking > write and the further interconnect also is suppose to respect that > rule. SO read writes are like adding barrier after every load store > so adding explicit barriers doesn't make sense. Is this a side > effect of some "write early response" kind of optimizations at > interconnect level ? SO or Device memory accesses are *not* like putting a proper barrier between each access, though it may behave in some situations like having a barrier. The ARM ARM (A3.8.3, fig 3.5) defines how accesses must *arrive* at a peripheral or block of memory depending on the memory type and in case of Device or SO we don't need additional barriers because such accesses would *arrive* in order (given the minimum 1KB range restriction). But it does not say anything about *observability* by a different *master*. That's because you can't guarantee that your memory accesses go to the same slave port. For observability by a different master, you need an explicit DMB even though the memory type is Device or SO. While it may work fine in most cases (especially when the accesses by various masters go to the same slave port), you can't be sure what the memory controller or whatever interconnect do. As Dave said, it's more about what the ARM ARM doesn't say rather than what it explicitly states.
On Wed, Jan 16, 2013 at 03:05:34PM +0000, Catalin Marinas wrote: > Santosh, > > On Wed, Jan 16, 2013 at 06:50:47AM +0000, Santosh Shilimkar wrote: > > On Tuesday 15 January 2013 10:18 PM, Dave Martin wrote: > > > For architectural correctness even Strongly-Ordered memory accesses > > > require barriers in order to guarantee that multiple CPUs have a > > > coherent view of the ordering of memory accesses. > > > > > > Virtually everything done by this early code is done via explicit > > > memory access only, so DSBs are seldom required. Existing barriers > > > are demoted to DMB, except where a DSB is needed to synchronise > > > non-memory signalling (i.e., before a SEV). If a particular > > > platform performs cache maintenance in its power_up_setup function, > > > it should force it to complete explicitly including a DSB, instead > > > of relying on the bL_head framework code to do it. > > > > > > Some additional DMBs are added to ensure all the memory ordering > > > properties required by the race avoidance algorithm. DMBs are also > > > moved out of loops, and for clarity some are moved so that most > > > directly follow the memory operation which needs to be > > > synchronised. > > > > > > The setting of a CPU's bL_entry_vectors[] entry is also required to > > > act as a synchronisation point, so a DMB is added after checking > > > that entry to ensure that other CPUs do not observe gated > > > operations leaking across the opening of the gate. > > > > > > Signed-off-by: Dave Martin <dave.martin@linaro.org> > > > > Sorry to pick on this again but I am not able to understand why > > the strongly ordered access needs barriers. At least from the > > ARM point of view, a strongly ordered write will be more of blocking > > write and the further interconnect also is suppose to respect that > > rule. SO read writes are like adding barrier after every load store > > so adding explicit barriers doesn't make sense. Is this a side > > effect of some "write early response" kind of optimizations at > > interconnect level ? > > SO or Device memory accesses are *not* like putting a proper barrier > between each access, though it may behave in some situations like having > a barrier. The ARM ARM (A3.8.3, fig 3.5) defines how accesses must > *arrive* at a peripheral or block of memory depending on the memory type > and in case of Device or SO we don't need additional barriers because > such accesses would *arrive* in order (given the minimum 1KB range > restriction). But it does not say anything about *observability* by a > different *master*. That's because you can't guarantee that your memory > accesses go to the same slave port. > > For observability by a different master, you need an explicit DMB even > though the memory type is Device or SO. While it may work fine in most > cases (especially when the accesses by various masters go to the same > slave port), you can't be sure what the memory controller or whatever > interconnect do. > > As Dave said, it's more about what the ARM ARM doesn't say rather than > what it explicitly states. OK, so I talked to one of the ARM architects and he strongly recommends having the DMBs. The fact that SO accesses are shareable guarantees a coherent view of access order only per memory location (i.e., per address), _not_ per slave. This is what was missing from my original assumption. For different memory locations, even within a single block of memory, you don't get that guarantee about what other masters see, except for the precise documented situations where there are address, control, or data dependencies between instructions which imply a particular observation order by other masters. (See A3.8.2, Ordering requirements for memory accesses). For this locking code, we're concerned about ordering between different memory locations in almost all instances, so most or all of the barriers will be needed, for strict correctness. Cheers ---Dave
On Wednesday 16 January 2013 08:35 PM, Catalin Marinas wrote: > Santosh, > > On Wed, Jan 16, 2013 at 06:50:47AM +0000, Santosh Shilimkar wrote: >> On Tuesday 15 January 2013 10:18 PM, Dave Martin wrote: >>> For architectural correctness even Strongly-Ordered memory accesses >>> require barriers in order to guarantee that multiple CPUs have a >>> coherent view of the ordering of memory accesses. >>> >>> Virtually everything done by this early code is done via explicit >>> memory access only, so DSBs are seldom required. Existing barriers >>> are demoted to DMB, except where a DSB is needed to synchronise >>> non-memory signalling (i.e., before a SEV). If a particular >>> platform performs cache maintenance in its power_up_setup function, >>> it should force it to complete explicitly including a DSB, instead >>> of relying on the bL_head framework code to do it. >>> >>> Some additional DMBs are added to ensure all the memory ordering >>> properties required by the race avoidance algorithm. DMBs are also >>> moved out of loops, and for clarity some are moved so that most >>> directly follow the memory operation which needs to be >>> synchronised. >>> >>> The setting of a CPU's bL_entry_vectors[] entry is also required to >>> act as a synchronisation point, so a DMB is added after checking >>> that entry to ensure that other CPUs do not observe gated >>> operations leaking across the opening of the gate. >>> >>> Signed-off-by: Dave Martin <dave.martin@linaro.org> >> >> Sorry to pick on this again but I am not able to understand why >> the strongly ordered access needs barriers. At least from the >> ARM point of view, a strongly ordered write will be more of blocking >> write and the further interconnect also is suppose to respect that >> rule. SO read writes are like adding barrier after every load store >> so adding explicit barriers doesn't make sense. Is this a side >> effect of some "write early response" kind of optimizations at >> interconnect level ? > > SO or Device memory accesses are *not* like putting a proper barrier > between each access, though it may behave in some situations like having > a barrier. The ARM ARM (A3.8.3, fig 3.5) defines how accesses must > *arrive* at a peripheral or block of memory depending on the memory type > and in case of Device or SO we don't need additional barriers because > such accesses would *arrive* in order (given the minimum 1KB range > restriction). But it does not say anything about *observability* by a > different *master*. That's because you can't guarantee that your memory > accesses go to the same slave port. > > For observability by a different master, you need an explicit DMB even > though the memory type is Device or SO. While it may work fine in most > cases (especially when the accesses by various masters go to the same > slave port), you can't be sure what the memory controller or whatever > interconnect do. > I agree that the interconnect behavior and no. of slave port usages is implementation specific and we can't assume anything here. So safer side is to have those additional barriers to not hit the corner cases. > As Dave said, it's more about what the ARM ARM doesn't say rather than > what it explicitly states. > I understand it better now. Thanks for good discussion. Regards, Santosh
diff --git a/arch/arm/common/bL_head.S b/arch/arm/common/bL_head.S index fd71ff6..a4a20e5 100644 --- a/arch/arm/common/bL_head.S +++ b/arch/arm/common/bL_head.S @@ -87,8 +87,7 @@ ENTRY(bL_entry_point) mov r5, #BL_SYNC_CPU_SIZE mla r5, r9, r5, r8 @ r5 = bL_sync cpu address strb r0, [r5] - - dsb + dmb @ At this point, the cluster cannot unexpectedly enter the GOING_DOWN @ state, because there is at least one active CPU (this CPU). @@ -97,7 +96,7 @@ ENTRY(bL_entry_point) mla r11, r0, r10, r11 @ r11 = cluster first man lock mov r0, r11 mov r1, r9 @ cpu - bl vlock_trylock + bl vlock_trylock @ implies DSB cmp r0, #0 @ failed to get the lock? bne cluster_setup_wait @ wait for cluster setup if so @@ -115,11 +114,12 @@ cluster_setup: @ Wait for any previously-pending cluster teardown operations to abort @ or complete: - dsb - ldrb r0, [r8, #BL_SYNC_CLUSTER_CLUSTER] + dmb +0: ldrb r0, [r8, #BL_SYNC_CLUSTER_CLUSTER] cmp r0, #CLUSTER_GOING_DOWN wfeeq - beq cluster_setup + beq 0b + dmb @ If the outbound gave up before teardown started, skip cluster setup: @@ -131,8 +131,8 @@ cluster_setup: cmp r7, #0 mov r0, #1 @ second (cluster) affinity level blxne r7 @ Call power_up_setup if defined + dmb - dsb mov r0, #CLUSTER_UP strb r0, [r8, #BL_SYNC_CLUSTER_CLUSTER] dsb @@ -146,11 +146,11 @@ cluster_setup_leave: @ In the contended case, non-first men wait here for cluster setup @ to complete: cluster_setup_wait: - dsb ldrb r0, [r8, #BL_SYNC_CLUSTER_CLUSTER] cmp r0, #CLUSTER_UP wfene bne cluster_setup_wait + dmb cluster_setup_complete: @ If a platform-specific CPU setup hook is needed, it is @@ -162,13 +162,14 @@ cluster_setup_complete: @ Mark the CPU as up: - dsb + dmb mov r0, #CPU_UP strb r0, [r5] + dmb bL_entry_gated: - dsb ldr r5, [r6, r4, lsl #2] @ r5 = CPU entry vector + dmb cmp r5, #0 wfeeq beq bL_entry_gated
For architectural correctness even Strongly-Ordered memory accesses require barriers in order to guarantee that multiple CPUs have a coherent view of the ordering of memory accesses. Virtually everything done by this early code is done via explicit memory access only, so DSBs are seldom required. Existing barriers are demoted to DMB, except where a DSB is needed to synchronise non-memory signalling (i.e., before a SEV). If a particular platform performs cache maintenance in its power_up_setup function, it should force it to complete explicitly including a DSB, instead of relying on the bL_head framework code to do it. Some additional DMBs are added to ensure all the memory ordering properties required by the race avoidance algorithm. DMBs are also moved out of loops, and for clarity some are moved so that most directly follow the memory operation which needs to be synchronised. The setting of a CPU's bL_entry_vectors[] entry is also required to act as a synchronisation point, so a DMB is added after checking that entry to ensure that other CPUs do not observe gated operations leaking across the opening of the gate. Signed-off-by: Dave Martin <dave.martin@linaro.org> --- arch/arm/common/bL_head.S | 21 +++++++++++---------- 1 files changed, 11 insertions(+), 10 deletions(-)