Message ID | 20160115173912.GU3818@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, Jan 15, 2016 at 09:39:12AM -0800, Paul E. McKenney wrote: > Should we start putting litmus tests for the various examples > somewhere, perhaps in a litmus-tests directory within each participating > architecture? I have a pile of powerpc-related litmus tests on my laptop, > but they probably aren't doing all that much good there. Yeah, or a version of them in C that we can 'compile'? > > commit 2cb4e83a1b5c89c8e39b8a64bd89269d05913e41 > Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com> > Date: Fri Jan 15 09:30:42 2016 -0800 > > documentation: Distinguish between local and global transitivity > > The introduction of smp_load_acquire() and smp_store_release() had > the side effect of introducing a weaker notion of transitivity: > The transitivity of full smp_mb() barriers is global, but that > of smp_store_release()/smp_load_acquire() chains is local. This > commit therefore introduces the notion of local transitivity and > gives an example. > > Reported-by: Peter Zijlstra <peterz@infradead.org> > Reported-by: Will Deacon <will.deacon@arm.com> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> I think it fails to mention smp_mb__after_release_acquire(), although I suspect we didn't actually introduce the primitive yet, which raises the point, do we want to?
On Fri, Jan 15, 2016 at 10:29:12PM +0100, Peter Zijlstra wrote: > On Fri, Jan 15, 2016 at 09:39:12AM -0800, Paul E. McKenney wrote: > > Should we start putting litmus tests for the various examples > > somewhere, perhaps in a litmus-tests directory within each participating > > architecture? I have a pile of powerpc-related litmus tests on my laptop, > > but they probably aren't doing all that much good there. > > Yeah, or a version of them in C that we can 'compile'? That would be good as well. I am guessing that architecture-specific litmus tests will also be needed, but you are right that architecture-independent versions are higher priority. > > commit 2cb4e83a1b5c89c8e39b8a64bd89269d05913e41 > > Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com> > > Date: Fri Jan 15 09:30:42 2016 -0800 > > > > documentation: Distinguish between local and global transitivity > > > > The introduction of smp_load_acquire() and smp_store_release() had > > the side effect of introducing a weaker notion of transitivity: > > The transitivity of full smp_mb() barriers is global, but that > > of smp_store_release()/smp_load_acquire() chains is local. This > > commit therefore introduces the notion of local transitivity and > > gives an example. > > > > Reported-by: Peter Zijlstra <peterz@infradead.org> > > Reported-by: Will Deacon <will.deacon@arm.com> > > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> > > I think it fails to mention smp_mb__after_release_acquire(), although I > suspect we didn't actually introduce the primitive yet, which raises the > point, do we want to? Well, it is not in v4.4. I believe that we need good use cases before we add it. Thanx, Paul
Hi Paul, On Fri, Jan 15, 2016 at 09:39:12AM -0800, Paul E. McKenney wrote: > On Fri, Jan 15, 2016 at 09:55:54AM +0100, Peter Zijlstra wrote: > > On Thu, Jan 14, 2016 at 01:29:13PM -0800, Paul E. McKenney wrote: > > > So smp_mb() provides transitivity, as do pairs of smp_store_release() > > > and smp_read_acquire(), > > > > But they provide different grades of transitivity, which is where all > > the confusion lays. > > > > smp_mb() is strongly/globally transitive, all CPUs will agree on the order. > > > > Whereas the RCpc release+acquire is weakly so, only the two cpus > > involved in the handover will agree on the order. > > Good point! > > Using grace periods in place of smp_mb() also provides strong/global > transitivity, but also insanely high latencies. ;-) > > The patch below updates Documentation/memory-barriers.txt to define > local vs. global transitivity. The corresponding ppcmem litmus test > is included below as well. > > Should we start putting litmus tests for the various examples > somewhere, perhaps in a litmus-tests directory within each participating > architecture? I have a pile of powerpc-related litmus tests on my laptop, > but they probably aren't doing all that much good there. I too would like to have the litmus tests in the kernel so that we can refer to them from memory-barriers.txt. Ideally they wouldn't be targetted to a particular arch, however. > PPC local-transitive > "" > { > 0:r1=1; 0:r2=u; 0:r3=v; 0:r4=x; 0:r5=y; 0:r6=z; > 1:r1=1; 1:r2=u; 1:r3=v; 1:r4=x; 1:r5=y; 1:r6=z; > 2:r1=1; 2:r2=u; 2:r3=v; 2:r4=x; 2:r5=y; 2:r6=z; > 3:r1=1; 3:r2=u; 3:r3=v; 3:r4=x; 3:r5=y; 3:r6=z; > } > P0 | P1 | P2 | P3 ; > lwz r9,0(r4) | lwz r9,0(r5) | lwz r9,0(r6) | stw r1,0(r3) ; > lwsync | lwsync | lwsync | sync ; > stw r1,0(r2) | lwz r8,0(r3) | stw r1,0(r7) | lwz r9,0(r2) ; > lwsync | lwz r7,0(r2) | | ; > stw r1,0(r5) | lwsync | | ; > | stw r1,0(r6) | | ; > exists > (* (0:r9=0 /\ 1:r9=1 /\ 2:r9=1 /\ 1:r8=0 /\ 3:r9=0) *) > (* (0:r9=1 /\ 1:r9=1 /\ 2:r9=1) *) > (* (0:r9=0 /\ 1:r9=1 /\ 2:r9=1 /\ 1:r7=0) *) > (0:r9=0 /\ 1:r9=1 /\ 2:r9=1 /\ 1:r7=0) i.e. we should rewrite this using READ_ONCE/WRITE_ONCE and smp_mb() etc. > ------------------------------------------------------------------------ > > commit 2cb4e83a1b5c89c8e39b8a64bd89269d05913e41 > Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com> > Date: Fri Jan 15 09:30:42 2016 -0800 > > documentation: Distinguish between local and global transitivity > > The introduction of smp_load_acquire() and smp_store_release() had > the side effect of introducing a weaker notion of transitivity: > The transitivity of full smp_mb() barriers is global, but that > of smp_store_release()/smp_load_acquire() chains is local. This > commit therefore introduces the notion of local transitivity and > gives an example. > > Reported-by: Peter Zijlstra <peterz@infradead.org> > Reported-by: Will Deacon <will.deacon@arm.com> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> > > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt > index c66ba46d8079..d8109ed99342 100644 > --- a/Documentation/memory-barriers.txt > +++ b/Documentation/memory-barriers.txt > @@ -1318,8 +1318,82 @@ or a level of cache, CPU 2 might have early access to CPU 1's writes. > General barriers are therefore required to ensure that all CPUs agree > on the combined order of CPU 1's and CPU 2's accesses. > > -To reiterate, if your code requires transitivity, use general barriers > -throughout. > +General barriers provide "global transitivity", so that all CPUs will > +agree on the order of operations. In contrast, a chain of release-acquire > +pairs provides only "local transitivity", so that only those CPUs on > +the chain are guaranteed to agree on the combined order of the accesses. Thanks for having a go at this. I tried defining something axiomatically, but got stuck pretty quickly. In my scheme, I used "data-directed transitivity" instead of "local transitivity", since the latter seems to be a bit of a misnomer. > +For example, switching to C code in deference to Herman Hollerith: > + > + int u, v, x, y, z; > + > + void cpu0(void) > + { > + r0 = smp_load_acquire(&x); > + WRITE_ONCE(u, 1); > + smp_store_release(&y, 1); > + } > + > + void cpu1(void) > + { > + r1 = smp_load_acquire(&y); > + r4 = READ_ONCE(v); > + r5 = READ_ONCE(u); > + smp_store_release(&z, 1); > + } > + > + void cpu2(void) > + { > + r2 = smp_load_acquire(&z); > + smp_store_release(&x, 1); > + } > + > + void cpu3(void) > + { > + WRITE_ONCE(v, 1); > + smp_mb(); > + r3 = READ_ONCE(u); > + } > + > +Because cpu0(), cpu1(), and cpu2() participate in a local transitive > +chain of smp_store_release()/smp_load_acquire() pairs, the following > +outcome is prohibited: > + > + r0 == 1 && r1 == 1 && r2 == 1 > + > +Furthermore, because of the release-acquire relationship between cpu0() > +and cpu1(), cpu1() must see cpu0()'s writes, so that the following > +outcome is prohibited: > + > + r1 == 1 && r5 == 0 > + > +However, the transitivity of release-acquire is local to the participating > +CPUs and does not apply to cpu3(). Therefore, the following outcome > +is possible: > + > + r0 == 0 && r1 == 1 && r2 == 1 && r3 == 0 && r4 == 0 I think you should be completely explicit and include r5 == 1 here, too. Also -- where would you add the smp_mb__after_release_acquire to fix (i.e. forbid) this? Immediately after cpu1()'s read of y? > +Although cpu0(), cpu1(), and cpu2() will see their respective reads and > +writes in order, CPUs not involved in the release-acquire chain might > +well disagree on the order. This disagreement stems from the fact that > +the weak memory-barrier instructions used to implement smp_load_acquire() > +and smp_store_release() are not required to order prior stores against > +subsequent loads in all cases. This means that cpu3() can see cpu0()'s > +store to u as happening -after- cpu1()'s load from v, even though > +both cpu0() and cpu1() agree that these two operations occurred in the > +intended order. > + > +However, please keep in mind that smp_load_acquire() is not magic. > +In particular, it simply reads from its argument with ordering. It does > +-not- ensure that any particular value will be read. Therefore, the > +following outcome is possible: > + > + r0 == 0 && r1 == 0 && r2 == 0 && r5 == 0 > + > +Note that this outcome can happen even on a mythical sequentially > +consistent system where nothing is ever reordered. I'm not sure this last bit is strictly necessary. If somebody thinks that acquire/release involve some sort of implicit synchronisation, I think they may have bigger problems with memory-barriers.txt. Will
On Mon, Jan 25, 2016 at 06:02:34PM +0000, Will Deacon wrote: > Hi Paul, > > On Fri, Jan 15, 2016 at 09:39:12AM -0800, Paul E. McKenney wrote: > > On Fri, Jan 15, 2016 at 09:55:54AM +0100, Peter Zijlstra wrote: > > > On Thu, Jan 14, 2016 at 01:29:13PM -0800, Paul E. McKenney wrote: > > > > So smp_mb() provides transitivity, as do pairs of smp_store_release() > > > > and smp_read_acquire(), > > > > > > But they provide different grades of transitivity, which is where all > > > the confusion lays. > > > > > > smp_mb() is strongly/globally transitive, all CPUs will agree on the order. > > > > > > Whereas the RCpc release+acquire is weakly so, only the two cpus > > > involved in the handover will agree on the order. > > > > Good point! > > > > Using grace periods in place of smp_mb() also provides strong/global > > transitivity, but also insanely high latencies. ;-) > > > > The patch below updates Documentation/memory-barriers.txt to define > > local vs. global transitivity. The corresponding ppcmem litmus test > > is included below as well. > > > > Should we start putting litmus tests for the various examples > > somewhere, perhaps in a litmus-tests directory within each participating > > architecture? I have a pile of powerpc-related litmus tests on my laptop, > > but they probably aren't doing all that much good there. > > I too would like to have the litmus tests in the kernel so that we can > refer to them from memory-barriers.txt. Ideally they wouldn't be targetted > to a particular arch, however. Agreed. Working on it... > > PPC local-transitive > > "" > > { > > 0:r1=1; 0:r2=u; 0:r3=v; 0:r4=x; 0:r5=y; 0:r6=z; > > 1:r1=1; 1:r2=u; 1:r3=v; 1:r4=x; 1:r5=y; 1:r6=z; > > 2:r1=1; 2:r2=u; 2:r3=v; 2:r4=x; 2:r5=y; 2:r6=z; > > 3:r1=1; 3:r2=u; 3:r3=v; 3:r4=x; 3:r5=y; 3:r6=z; > > } > > P0 | P1 | P2 | P3 ; > > lwz r9,0(r4) | lwz r9,0(r5) | lwz r9,0(r6) | stw r1,0(r3) ; > > lwsync | lwsync | lwsync | sync ; > > stw r1,0(r2) | lwz r8,0(r3) | stw r1,0(r7) | lwz r9,0(r2) ; > > lwsync | lwz r7,0(r2) | | ; > > stw r1,0(r5) | lwsync | | ; > > | stw r1,0(r6) | | ; > > exists > > (* (0:r9=0 /\ 1:r9=1 /\ 2:r9=1 /\ 1:r8=0 /\ 3:r9=0) *) > > (* (0:r9=1 /\ 1:r9=1 /\ 2:r9=1) *) > > (* (0:r9=0 /\ 1:r9=1 /\ 2:r9=1 /\ 1:r7=0) *) > > (0:r9=0 /\ 1:r9=1 /\ 2:r9=1 /\ 1:r7=0) > > i.e. we should rewrite this using READ_ONCE/WRITE_ONCE and smp_mb() etc. Yep! > > ------------------------------------------------------------------------ > > > > commit 2cb4e83a1b5c89c8e39b8a64bd89269d05913e41 > > Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com> > > Date: Fri Jan 15 09:30:42 2016 -0800 > > > > documentation: Distinguish between local and global transitivity > > > > The introduction of smp_load_acquire() and smp_store_release() had > > the side effect of introducing a weaker notion of transitivity: > > The transitivity of full smp_mb() barriers is global, but that > > of smp_store_release()/smp_load_acquire() chains is local. This > > commit therefore introduces the notion of local transitivity and > > gives an example. > > > > Reported-by: Peter Zijlstra <peterz@infradead.org> > > Reported-by: Will Deacon <will.deacon@arm.com> > > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> > > > > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt > > index c66ba46d8079..d8109ed99342 100644 > > --- a/Documentation/memory-barriers.txt > > +++ b/Documentation/memory-barriers.txt > > @@ -1318,8 +1318,82 @@ or a level of cache, CPU 2 might have early access to CPU 1's writes. > > General barriers are therefore required to ensure that all CPUs agree > > on the combined order of CPU 1's and CPU 2's accesses. > > > > -To reiterate, if your code requires transitivity, use general barriers > > -throughout. > > +General barriers provide "global transitivity", so that all CPUs will > > +agree on the order of operations. In contrast, a chain of release-acquire > > +pairs provides only "local transitivity", so that only those CPUs on > > +the chain are guaranteed to agree on the combined order of the accesses. > > Thanks for having a go at this. I tried defining something axiomatically, > but got stuck pretty quickly. In my scheme, I used "data-directed > transitivity" instead of "local transitivity", since the latter seems to > be a bit of a misnomer. I figured that "local" meant local to the CPUs participating in the release-acquire chain. As opposed to smp_mb() chains where the ordering is "global" as in visible to all CPUs, whether on the chain or not. Does that help? > > +For example, switching to C code in deference to Herman Hollerith: > > + > > + int u, v, x, y, z; > > + > > + void cpu0(void) > > + { > > + r0 = smp_load_acquire(&x); > > + WRITE_ONCE(u, 1); > > + smp_store_release(&y, 1); > > + } > > + > > + void cpu1(void) > > + { > > + r1 = smp_load_acquire(&y); > > + r4 = READ_ONCE(v); > > + r5 = READ_ONCE(u); > > + smp_store_release(&z, 1); > > + } > > + > > + void cpu2(void) > > + { > > + r2 = smp_load_acquire(&z); > > + smp_store_release(&x, 1); > > + } > > + > > + void cpu3(void) > > + { > > + WRITE_ONCE(v, 1); > > + smp_mb(); > > + r3 = READ_ONCE(u); > > + } > > + > > +Because cpu0(), cpu1(), and cpu2() participate in a local transitive > > +chain of smp_store_release()/smp_load_acquire() pairs, the following > > +outcome is prohibited: > > + > > + r0 == 1 && r1 == 1 && r2 == 1 > > + > > +Furthermore, because of the release-acquire relationship between cpu0() > > +and cpu1(), cpu1() must see cpu0()'s writes, so that the following > > +outcome is prohibited: > > + > > + r1 == 1 && r5 == 0 > > + > > +However, the transitivity of release-acquire is local to the participating > > +CPUs and does not apply to cpu3(). Therefore, the following outcome > > +is possible: > > + > > + r0 == 0 && r1 == 1 && r2 == 1 && r3 == 0 && r4 == 0 > > I think you should be completely explicit and include r5 == 1 here, too. Good point -- I added this as an additional outcome: r0 == 0 && r1 == 1 && r2 == 1 && r3 == 0 && r4 == 0 && r5 == 1 > Also -- where would you add the smp_mb__after_release_acquire to fix > (i.e. forbid) this? Immediately after cpu1()'s read of y? That sounds plausible, but we would first have to agree on exactly what smp_mb__after_release_acquire() did. ;-) > > +Although cpu0(), cpu1(), and cpu2() will see their respective reads and > > +writes in order, CPUs not involved in the release-acquire chain might > > +well disagree on the order. This disagreement stems from the fact that > > +the weak memory-barrier instructions used to implement smp_load_acquire() > > +and smp_store_release() are not required to order prior stores against > > +subsequent loads in all cases. This means that cpu3() can see cpu0()'s > > +store to u as happening -after- cpu1()'s load from v, even though > > +both cpu0() and cpu1() agree that these two operations occurred in the > > +intended order. > > + > > +However, please keep in mind that smp_load_acquire() is not magic. > > +In particular, it simply reads from its argument with ordering. It does > > +-not- ensure that any particular value will be read. Therefore, the > > +following outcome is possible: > > + > > + r0 == 0 && r1 == 0 && r2 == 0 && r5 == 0 > > + > > +Note that this outcome can happen even on a mythical sequentially > > +consistent system where nothing is ever reordered. > > I'm not sure this last bit is strictly necessary. If somebody thinks that > acquire/release involve some sort of implicit synchronisation, I think > they may have bigger problems with memory-barriers.txt. Agreed. But unless I add text like this occasionally, such people could easily read through much of memory-barriers.txt and think that they did in fact understand it. So I have to occasionally trip an assertion in their brain. Or try to... :-/ Thanx, Paul
On Mon, Jan 25, 2016 at 10:12:11PM -0800, Paul E. McKenney wrote: > On Mon, Jan 25, 2016 at 06:02:34PM +0000, Will Deacon wrote: > > Thanks for having a go at this. I tried defining something axiomatically, > > but got stuck pretty quickly. In my scheme, I used "data-directed > > transitivity" instead of "local transitivity", since the latter seems to > > be a bit of a misnomer. > > I figured that "local" meant local to the CPUs participating in the > release-acquire chain. As opposed to smp_mb() chains where the ordering > is "global" as in visible to all CPUs, whether on the chain or not. > Does that help? That is in fact how I read and understood it.
diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index c66ba46d8079..d8109ed99342 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -1318,8 +1318,82 @@ or a level of cache, CPU 2 might have early access to CPU 1's writes. General barriers are therefore required to ensure that all CPUs agree on the combined order of CPU 1's and CPU 2's accesses. -To reiterate, if your code requires transitivity, use general barriers -throughout. +General barriers provide "global transitivity", so that all CPUs will +agree on the order of operations. In contrast, a chain of release-acquire +pairs provides only "local transitivity", so that only those CPUs on +the chain are guaranteed to agree on the combined order of the accesses. +For example, switching to C code in deference to Herman Hollerith: + + int u, v, x, y, z; + + void cpu0(void) + { + r0 = smp_load_acquire(&x); + WRITE_ONCE(u, 1); + smp_store_release(&y, 1); + } + + void cpu1(void) + { + r1 = smp_load_acquire(&y); + r4 = READ_ONCE(v); + r5 = READ_ONCE(u); + smp_store_release(&z, 1); + } + + void cpu2(void) + { + r2 = smp_load_acquire(&z); + smp_store_release(&x, 1); + } + + void cpu3(void) + { + WRITE_ONCE(v, 1); + smp_mb(); + r3 = READ_ONCE(u); + } + +Because cpu0(), cpu1(), and cpu2() participate in a local transitive +chain of smp_store_release()/smp_load_acquire() pairs, the following +outcome is prohibited: + + r0 == 1 && r1 == 1 && r2 == 1 + +Furthermore, because of the release-acquire relationship between cpu0() +and cpu1(), cpu1() must see cpu0()'s writes, so that the following +outcome is prohibited: + + r1 == 1 && r5 == 0 + +However, the transitivity of release-acquire is local to the participating +CPUs and does not apply to cpu3(). Therefore, the following outcome +is possible: + + r0 == 0 && r1 == 1 && r2 == 1 && r3 == 0 && r4 == 0 + +Although cpu0(), cpu1(), and cpu2() will see their respective reads and +writes in order, CPUs not involved in the release-acquire chain might +well disagree on the order. This disagreement stems from the fact that +the weak memory-barrier instructions used to implement smp_load_acquire() +and smp_store_release() are not required to order prior stores against +subsequent loads in all cases. This means that cpu3() can see cpu0()'s +store to u as happening -after- cpu1()'s load from v, even though +both cpu0() and cpu1() agree that these two operations occurred in the +intended order. + +However, please keep in mind that smp_load_acquire() is not magic. +In particular, it simply reads from its argument with ordering. It does +-not- ensure that any particular value will be read. Therefore, the +following outcome is possible: + + r0 == 0 && r1 == 0 && r2 == 0 && r5 == 0 + +Note that this outcome can happen even on a mythical sequentially +consistent system where nothing is ever reordered. + +To reiterate, if your code requires global transitivity, use general +barriers throughout. ========================