diff mbox

[v3,11/41] mips: reuse asm-generic/barrier.h

Message ID 1452426622-4471-12-git-send-email-mst@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Michael S. Tsirkin Jan. 10, 2016, 2:18 p.m. UTC
On mips dma_rmb, dma_wmb, smp_store_mb, read_barrier_depends,
smp_read_barrier_depends, smp_store_release and smp_load_acquire  match
the asm-generic variants exactly. Drop the local definitions and pull in
asm-generic/barrier.h instead.

This is in preparation to refactoring this code area.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
---
 arch/mips/include/asm/barrier.h | 25 ++-----------------------
 1 file changed, 2 insertions(+), 23 deletions(-)

Comments

Leonid Yegoshin Jan. 12, 2016, 1:14 a.m. UTC | #1
On 01/10/2016 06:18 AM, Michael S. Tsirkin wrote:
> On mips dma_rmb, dma_wmb, smp_store_mb, read_barrier_depends,
> smp_read_barrier_depends, smp_store_release and smp_load_acquire  match
> the asm-generic variants exactly. Drop the local definitions and pull in
> asm-generic/barrier.h instead.
>
This statement doesn't fit MIPS barriers variations. Moreover, there is 
a reason to extend that even more specific, at least for 
smp_store_release and smp_load_acquire, look into

     http://patchwork.linux-mips.org/patch/10506/

- Leonid.
Michael S. Tsirkin Jan. 12, 2016, 8:43 a.m. UTC | #2
On Mon, Jan 11, 2016 at 05:14:14PM -0800, Leonid Yegoshin wrote:
> On 01/10/2016 06:18 AM, Michael S. Tsirkin wrote:
> >On mips dma_rmb, dma_wmb, smp_store_mb, read_barrier_depends,
> >smp_read_barrier_depends, smp_store_release and smp_load_acquire  match
> >the asm-generic variants exactly. Drop the local definitions and pull in
> >asm-generic/barrier.h instead.
> >
> This statement doesn't fit MIPS barriers variations. Moreover, there is a
> reason to extend that even more specific, at least for smp_store_release and
> smp_load_acquire, look into
> 
>     http://patchwork.linux-mips.org/patch/10506/
> 
> - Leonid.

Fine, but it matches what current code is doing.  Since that
MIPS_LIGHTWEIGHT_SYNC patch didn't go into linux-next yet, do
you see a problem reworking it on top of this patchset?
Peter Zijlstra Jan. 12, 2016, 9:27 a.m. UTC | #3
On Mon, Jan 11, 2016 at 05:14:14PM -0800, Leonid Yegoshin wrote:

> This statement doesn't fit MIPS barriers variations. Moreover, there is a
> reason to extend that even more specific, at least for smp_store_release and
> smp_load_acquire, look into
> 
>     http://patchwork.linux-mips.org/patch/10506/

Dude, that's one horrible patch.

1) you do not make such things selectable; either the hardware needs
them or it doesn't. If it does you _must_ use them, however unlikely.

2) the changelog _completely_ fails to explain the sync 0x11 and sync
0x12 semantics nor does it provide a publicly accessible link to
documentation that does.

3) it really should have explained what you did with
smp_llsc_mb/smp_mb__before_llsc() in _detail_.

And I agree that ideally it should be split into parts.

Seriously, this is _NOT_ OK.
Peter Zijlstra Jan. 12, 2016, 9:51 a.m. UTC | #4
On Tue, Jan 12, 2016 at 10:43:36AM +0200, Michael S. Tsirkin wrote:
> On Mon, Jan 11, 2016 at 05:14:14PM -0800, Leonid Yegoshin wrote:
> > On 01/10/2016 06:18 AM, Michael S. Tsirkin wrote:
> > >On mips dma_rmb, dma_wmb, smp_store_mb, read_barrier_depends,
> > >smp_read_barrier_depends, smp_store_release and smp_load_acquire  match
> > >the asm-generic variants exactly. Drop the local definitions and pull in
> > >asm-generic/barrier.h instead.
> > >
> > This statement doesn't fit MIPS barriers variations. Moreover, there is a
> > reason to extend that even more specific, at least for smp_store_release and
> > smp_load_acquire, look into
> > 
> >     http://patchwork.linux-mips.org/patch/10506/
> > 
> > - Leonid.
> 
> Fine, but it matches what current code is doing.  Since that
> MIPS_LIGHTWEIGHT_SYNC patch didn't go into linux-next yet, do
> you see a problem reworking it on top of this patchset?

That patch is a complete doorstop atm. It needs a lot more work before
it can go anywhere. Don't worry about it.
Peter Zijlstra Jan. 12, 2016, 10:25 a.m. UTC | #5
On Tue, Jan 12, 2016 at 10:27:11AM +0100, Peter Zijlstra wrote:
> 2) the changelog _completely_ fails to explain the sync 0x11 and sync
> 0x12 semantics nor does it provide a publicly accessible link to
> documentation that does.

Ralf pointed me at: https://imgtec.com/mips/architectures/mips64/

> 3) it really should have explained what you did with
> smp_llsc_mb/smp_mb__before_llsc() in _detail_.

And reading the MIPS64 v6.04 instruction set manual, I think 0x11/0x12
are _NOT_ transitive and therefore cannot be used to implement the
smp_mb__{before,after} stuff.

That is, in MIPS speak, those SYNC types are Ordering Barriers, not
Completion Barriers. They need not be globally performed.
Peter Zijlstra Jan. 12, 2016, 10:40 a.m. UTC | #6
On Tue, Jan 12, 2016 at 11:25:55AM +0100, Peter Zijlstra wrote:
> On Tue, Jan 12, 2016 at 10:27:11AM +0100, Peter Zijlstra wrote:
> > 2) the changelog _completely_ fails to explain the sync 0x11 and sync
> > 0x12 semantics nor does it provide a publicly accessible link to
> > documentation that does.
> 
> Ralf pointed me at: https://imgtec.com/mips/architectures/mips64/
> 
> > 3) it really should have explained what you did with
> > smp_llsc_mb/smp_mb__before_llsc() in _detail_.
> 
> And reading the MIPS64 v6.04 instruction set manual, I think 0x11/0x12
> are _NOT_ transitive and therefore cannot be used to implement the
> smp_mb__{before,after} stuff.
> 
> That is, in MIPS speak, those SYNC types are Ordering Barriers, not
> Completion Barriers. They need not be globally performed.

Which if true; and I know Will has some questions here; would also mean
that you 'cannot' use the ACQUIRE/RELEASE barriers for your locks as was
recently suggested by David Daney.

That is, currently all architectures -- with exception of PPC -- have
RCsc locks, but using these non-transitive things will get you RCpc
locks.

So yes, MIPS can go RCpc for its locks and share the burden of pain with
PPC, but that needs to be a very concious decision.
Will Deacon Jan. 12, 2016, 11:41 a.m. UTC | #7
On Tue, Jan 12, 2016 at 11:40:12AM +0100, Peter Zijlstra wrote:
> On Tue, Jan 12, 2016 at 11:25:55AM +0100, Peter Zijlstra wrote:
> > On Tue, Jan 12, 2016 at 10:27:11AM +0100, Peter Zijlstra wrote:
> > > 2) the changelog _completely_ fails to explain the sync 0x11 and sync
> > > 0x12 semantics nor does it provide a publicly accessible link to
> > > documentation that does.
> > 
> > Ralf pointed me at: https://imgtec.com/mips/architectures/mips64/
> > 
> > > 3) it really should have explained what you did with
> > > smp_llsc_mb/smp_mb__before_llsc() in _detail_.
> > 
> > And reading the MIPS64 v6.04 instruction set manual, I think 0x11/0x12
> > are _NOT_ transitive and therefore cannot be used to implement the
> > smp_mb__{before,after} stuff.
> > 
> > That is, in MIPS speak, those SYNC types are Ordering Barriers, not
> > Completion Barriers. They need not be globally performed.
> 
> Which if true; and I know Will has some questions here; would also mean
> that you 'cannot' use the ACQUIRE/RELEASE barriers for your locks as was
> recently suggested by David Daney.

The issue I have with the SYNC description in the text above is that it
describes the single CPU (program order) and the dual-CPU (confusingly
named global order) cases, but then doesn't generalise any further. That
means we can't sensibly reason about transitivity properties when a third
agent is involved. For example, the WRC+sync+addr test:


P0:
Wx = 1

P1:
Rx == 1
SYNC
Wy = 1

P2:
Ry == 1
<address dep>
Rx = 0


I can't find anything to forbid that, given the text. The main problem
is having the SYNC on P1 affect the write by P0.

> That is, currently all architectures -- with exception of PPC -- have
> RCsc locks, but using these non-transitive things will get you RCpc
> locks.
> 
> So yes, MIPS can go RCpc for its locks and share the burden of pain with
> PPC, but that needs to be a very concious decision.

I think it's much worse than RCpc, given my interpretation of the wording.

Will
Leonid Yegoshin Jan. 12, 2016, 8:45 p.m. UTC | #8
(I try to answer on multiple mails in one)

First of all, it seems like some generic notes should be given here:

1. Generic MIPS "SYNC" (aka "SYNC 0") instruction is a very heavy in 
some CPUs. On that CPUs it basically kills pipelines in each CPU, can do 
a special memory/IO bus transaction (similar to "fence") and hold a 
system until all R/W is completed. It is like Big Kernel Lock but worse. 
So, the move to SMP_* kind of barriers is needed to improve performance, 
especially on newest CPUs with long pipelines.

2. MIPS Arch document may be misleading because words "ordering" and 
"completion" means different from Linux, the SYNC instruction 
description is written for HW engineers. I wrote that in a separate 
patch of the same patchset - 
http://patchwork.linux-mips.org/patch/10505/ "MIPS: R6: Use lightweight 
SYNC instruction in smp_* memory barriers":

> This instructions were specifically designed to work for smp_*() sort of
> memory barriers in MIPS R2/R3/R5 and R6.
>
> Unfortunately, it's description is very cryptic and is done in HW engineering
> style which prevents use of it by SW.

3. I bother MIPS Arch team long time until I completely understood that 
MIPS SYNC_WMB, SYNC_MB, SYNC_RMB, SYNC_RELEASE and SYNC_ACQUIRE do an 
exactly that is required in Documentation/memory-barriers.txt


In Peter Zijlstra mail:

> 1) you do not make such things selectable; either the hardware needs
> them or it doesn't. If it does you_must_  use them, however unlikely.
It is selectable only for MIPS R2 but not MIPS R6. The reason is - most 
of MIPS R2 CPUs have short pipeline and that SYNC is just waste of CPU 
resource, especially taking into account that "lightweight syncs" are 
converted to a heavy "SYNC 0" in many of that CPUs. However the latest 
MIPS/Imagination CPU have a pipeline long enough to hit a problem - 
absence of SYNC at LL/SC inside atomics, barriers etc.

> And reading the MIPS64 v6.04 instruction set manual, I think 0x11/0x12
> are_NOT_  transitive and therefore cannot be used to implement the
> smp_mb__{before,after} stuff.
>
> That is, in MIPS speak, those SYNC types are Ordering Barriers, not
> Completion Barriers.

Please see above, point 2.

> That is, currently all architectures -- with exception of PPC -- have
> RCsc locks, but using these non-transitive things will get you RCpc
> locks.
>
> So yes, MIPS can go RCpc for its locks and share the burden of pain with
> PPC, but that needs to be a very concious decision.

I don't understand that - I tried hard but I can't find any word like 
"RCsc", "RCpc" in Documents/ directory. Web search goes nowhere, of course.


In Will Deacon mail:

> The issue I have with the SYNC description in the text above is that it
> describes the single CPU (program order) and the dual-CPU (confusingly
> named global order) cases, but then doesn't generalise any further. That
> means we can't sensibly reason about transitivity properties when a third
> agent is involved. For example, the WRC+sync+addr test:
>
>
> P0:
> Wx = 1
>
> P1:
> Rx == 1
> SYNC
> Wy = 1
>
> P2:
> Ry == 1
> <address dep>
> Rx = 0
>
>
> I can't find anything to forbid that, given the text. The main problem
> is having the SYNC on P1 affect the write by P0.

As I understand that test, the visibility of P0: W[x] = 1 is identical 
to P1 and P2 here. If P1 got X before SYNC and write to Y after SYNC 
then instruction source register dependency tracking in P2 prevents a 
speculative load of X before P2 obtains Y from the same place as P0/P1 
and calculate address of X. If some load of X in P2 happens before 
address dependency calculation it's result is discarded.

Yes, you can't find that in MIPS SYNC instruction description, it is 
more likely in CM (Coherence Manager) area. I just pointed our arch team 
member responsible for documents and he will think how to explain that.

- Leonid.
Peter Zijlstra Jan. 12, 2016, 9:40 p.m. UTC | #9
On Tue, Jan 12, 2016 at 12:45:14PM -0800, Leonid Yegoshin wrote:
> (I try to answer on multiple mails in one)
> 
> First of all, it seems like some generic notes should be given here:
> 
> 1. Generic MIPS "SYNC" (aka "SYNC 0") instruction is a very heavy in some
> CPUs. On that CPUs it basically kills pipelines in each CPU, can do a
> special memory/IO bus transaction (similar to "fence") and hold a system
> until all R/W is completed. It is like Big Kernel Lock but worse. So, the
> move to SMP_* kind of barriers is needed to improve performance, especially
> on newest CPUs with long pipelines.

The MIPS SYNC isn't any worse than the PPC SYNC, x86 MFENCE or arm DSB
SY, yes they're heavy, so what.

> 2. MIPS Arch document may be misleading because words "ordering" and
> "completion" means different from Linux, the SYNC instruction description is
> written for HW engineers. I wrote that in a separate patch of the same
> patchset - http://patchwork.linux-mips.org/patch/10505/ "MIPS: R6: Use
> lightweight SYNC instruction in smp_* memory barriers":

Did you actually say anything here?

> >This instructions were specifically designed to work for smp_*() sort of
> >memory barriers in MIPS R2/R3/R5 and R6.
> >
> >Unfortunately, it's description is very cryptic and is done in HW engineering
> >style which prevents use of it by SW.
> 
> 3. I bother MIPS Arch team long time until I completely understood that MIPS
> SYNC_WMB, SYNC_MB, SYNC_RMB, SYNC_RELEASE and SYNC_ACQUIRE do an exactly
> that is required in Documentation/memory-barriers.txt

Ha! and you think that document covers all the really fun details?

In particular we're very much all 'confused' about the various notions
of transitivity and what barriers imply how much of it.

> In Peter Zijlstra mail:
> 
> >1) you do not make such things selectable; either the hardware needs
> >them or it doesn't. If it does you_must_  use them, however unlikely.

> It is selectable only for MIPS R2 but not MIPS R6. The reason is - most of
> MIPS R2 CPUs have short pipeline and that SYNC is just waste of CPU
> resource, especially taking into account that "lightweight syncs" are
> converted to a heavy "SYNC 0" in many of that CPUs. However the latest
> MIPS/Imagination CPU have a pipeline long enough to hit a problem - absence
> of SYNC at LL/SC inside atomics, barriers etc.

What ?! Are you saying that because R2 has short pipelines its unlikely
to hit the reordering issues and we can omit barriers?

> >And reading the MIPS64 v6.04 instruction set manual, I think 0x11/0x12
> >are_NOT_  transitive and therefore cannot be used to implement the
> >smp_mb__{before,after} stuff.
> >
> >That is, in MIPS speak, those SYNC types are Ordering Barriers, not
> >Completion Barriers.
> 
> Please see above, point 2.

That did not in fact enlighten things. Are they transitive/multi-copy
atomic or not?

(and here Will will go into great detail on the differences between the
two and make our collective brains explode :-)

> >That is, currently all architectures -- with exception of PPC -- have
> >RCsc locks, but using these non-transitive things will get you RCpc
> >locks.
> >
> >So yes, MIPS can go RCpc for its locks and share the burden of pain with
> >PPC, but that needs to be a very concious decision.
> 
> I don't understand that - I tried hard but I can't find any word like
> "RCsc", "RCpc" in Documents/ directory. Web search goes nowhere, of course.

From: lkml.kernel.org/r/20150828153921.GF19282@twins.programming.kicks-ass.net

Yes, the difference between RCpc and RCsc is in the meaning of RELEASE +
ACQUIRE. With RCsc that implies a full memory barrier, with RCpc it does
not.

Currently PowerPC is the only arch that (can, and) does RCpc and gives a
weaker RELEASE + ACQUIRE. Only the CPU who did the ACQUIRE is guaranteed
to see the stores of the CPU which did the RELEASE in order.

As it stands, RCU is the only _known_ codebase where this matters, but
we did in fact write code for a fair number of years 'assuming' RELEASE
+ ACQUIRE was a full barrier, so who knows what else is out there.


RCsc - release consistency sequential consistency
RCpc - release consistency processor consistency

https://en.wikipedia.org/wiki/Processor_consistency
Leonid Yegoshin Jan. 13, 2016, 12:21 a.m. UTC | #10
On 01/12/2016 01:40 PM, Peter Zijlstra wrote:
>
>> It is selectable only for MIPS R2 but not MIPS R6. The reason is - most of
>> MIPS R2 CPUs have short pipeline and that SYNC is just waste of CPU
>> resource, especially taking into account that "lightweight syncs" are
>> converted to a heavy "SYNC 0" in many of that CPUs. However the latest
>> MIPS/Imagination CPU have a pipeline long enough to hit a problem - absence
>> of SYNC at LL/SC inside atomics, barriers etc.
> What ?! Are you saying that because R2 has short pipelines its unlikely
> to hit the reordering issues and we can omit barriers?

It was my guess to explain - why barriers was not included originally. 
You can check with Ralf, he knows more about that time MIPS Linux code.

I bother with this more than 2 years and I just try to solve that issue 
- in recent CPUs the load after LL/SC synchronization instruction loop 
can get ahead of SC for sure, it was tested.

>
>>> And reading the MIPS64 v6.04 instruction set manual, I think 0x11/0x12
>>> are_NOT_  transitive and therefore cannot be used to implement the
>>> smp_mb__{before,after} stuff.
>>>
>>> That is, in MIPS speak, those SYNC types are Ordering Barriers, not
>>> Completion Barriers.
>> Please see above, point 2.
> That did not in fact enlighten things. Are they transitive/multi-copy
> atomic or not?

Peter Zijlstra recently wrote: "In particular we're very much all 
'confused' about the various notions of transitivity". I am actually 
confused too and need some examples here.

>
> (and here Will will go into great detail on the differences between the
> two and make our collective brains explode :-)
>
>>> That is, currently all architectures -- with exception of PPC -- have
>>> RCsc locks, but using these non-transitive things will get you RCpc
>>> locks.
>>>
>>> So yes, MIPS can go RCpc for its locks and share the burden of pain with
>>> PPC, but that needs to be a very concious decision.
>> I don't understand that - I tried hard but I can't find any word like
>> "RCsc", "RCpc" in Documents/ directory. Web search goes nowhere, of course.
> From: lkml.kernel.org/r/20150828153921.GF19282@twins.programming.kicks-ass.net
>
> Yes, the difference between RCpc and RCsc is in the meaning of RELEASE +
> ACQUIRE. With RCsc that implies a full memory barrier, with RCpc it does
> not.

MIPS Arch starting from R2 requires that. If some CPU can't, it should 
execute a full "SYNC 0" instead, which is a full memory barrier.

>
> Currently PowerPC is the only arch that (can, and) does RCpc and gives a
> weaker RELEASE + ACQUIRE. Only the CPU who did the ACQUIRE is guaranteed
> to see the stores of the CPU which did the RELEASE in order.

Yes, it was a goal for SYNC_ACQUIRE and SYNC_RELEASE.

Caveats:

     - "Full memory barrier" on MIPS means - full barrier for any device 
in coherent domain. In MIPS Tech/Imagination Tech MIPS-based CPU it is 
"for any device connected to CM or IOCU + directly connected memory".

     - It is not applied to instruction fetch. However, I-Cache flushes 
and SYNCI are consistent with that. There is also hazard barrier 
instructions to clear CPU pipeline to some extent - to help with this 
limitation.

I don't think that these caveats prevent a correct Acquire/Release semantic.

- Leonid.
Will Deacon Jan. 13, 2016, 10:45 a.m. UTC | #11
On Tue, Jan 12, 2016 at 12:45:14PM -0800, Leonid Yegoshin wrote:
> >The issue I have with the SYNC description in the text above is that it
> >describes the single CPU (program order) and the dual-CPU (confusingly
> >named global order) cases, but then doesn't generalise any further. That
> >means we can't sensibly reason about transitivity properties when a third
> >agent is involved. For example, the WRC+sync+addr test:
> >
> >
> >P0:
> >Wx = 1
> >
> >P1:
> >Rx == 1
> >SYNC
> >Wy = 1
> >
> >P2:
> >Ry == 1
> ><address dep>
> >Rx = 0
> >
> >
> >I can't find anything to forbid that, given the text. The main problem
> >is having the SYNC on P1 affect the write by P0.
> 
> As I understand that test, the visibility of P0: W[x] = 1 is identical to P1
> and P2 here. If P1 got X before SYNC and write to Y after SYNC then
> instruction source register dependency tracking in P2 prevents a speculative
> load of X before P2 obtains Y from the same place as P0/P1 and calculate
> address of X. If some load of X in P2 happens before address dependency
> calculation it's result is discarded.

I don't think the address dependency is enough on its own. By that
reasoning, the following variant (WRC+addr+addr) would work too:


P0:
Wx = 1

P1:
Rx == 1
<address dep>
Wy = 1

P2:
Ry == 1
<address dep>
Rx = 0


So are you saying that this is also forbidden?
Imagine that P0 and P1 are two threads that share a store buffer. What
then?

> Yes, you can't find that in MIPS SYNC instruction description, it is more
> likely in CM (Coherence Manager) area. I just pointed our arch team member
> responsible for documents and he will think how to explain that.

I tried grepping the linked documents for "coherence manager" but couldn't
find anything. Is the description you refer to available anywhere?

Will
Leonid Yegoshin Jan. 13, 2016, 7:02 p.m. UTC | #12
On 01/13/2016 02:45 AM, Will Deacon wrote:
> On Tue, Jan 12, 2016 at 12:45:14PM -0800, Leonid Yegoshin wrote:
>>
> I don't think the address dependency is enough on its own. By that
> reasoning, the following variant (WRC+addr+addr) would work too:
>
>
> P0:
> Wx = 1
>
> P1:
> Rx == 1
> <address dep>
> Wy = 1
>
> P2:
> Ry == 1
> <address dep>
> Rx = 0
>
>
> So are you saying that this is also forbidden?
> Imagine that P0 and P1 are two threads that share a store buffer. What
> then?
>

I ask HW team about it but I have a question - has it any relationship 
with replacing MIPS SYNC with lightweight SYNCs (SYNC_WMB etc)? You use 
any barrier or do not use it and I just voice an intention to use a more 
efficient instruction instead of bold hummer (SYNC instruction). If you 
don't use any barrier here then it is a different issue.

May be it has sense to return back to original issue?

- Leonid
Peter Zijlstra Jan. 13, 2016, 8:48 p.m. UTC | #13
On Wed, Jan 13, 2016 at 11:02:35AM -0800, Leonid Yegoshin wrote:

> I ask HW team about it but I have a question - has it any relationship with
> replacing MIPS SYNC with lightweight SYNCs (SYNC_WMB etc)?

Of course. If you cannot explain the semantics of the primitives you
introduce, how can we judge the patch.

This barrier business is hard enough as it is, but magic unexplained
hardware makes it impossible.

Rest assured, you (MIPS) isn't the first (nor likely the last) to go
through all this. We've had these discussions (and to a certain extend
are still having them) for x86, PPC, Alpha, ARM, etc..

Any every time new barriers instructions get introduced we had better
have a full and comprehensive explanation to go along with them.
Leonid Yegoshin Jan. 13, 2016, 8:58 p.m. UTC | #14
On 01/13/2016 12:48 PM, Peter Zijlstra wrote:
> On Wed, Jan 13, 2016 at 11:02:35AM -0800, Leonid Yegoshin wrote:
>
>> I ask HW team about it but I have a question - has it any relationship with
>> replacing MIPS SYNC with lightweight SYNCs (SYNC_WMB etc)?
> Of course. If you cannot explain the semantics of the primitives you
> introduce, how can we judge the patch.
>
>
You missed a point - it is a question about replacement of SYNC with 
lightweight primitives. It is NOT a question about multithread system 
behavior without any SYNC. The answer on a latest Will's question lies 
in different area.

- Leonid.
Leonid Yegoshin Jan. 13, 2016, 10:26 p.m. UTC | #15
On 01/13/2016 02:45 AM, Will Deacon wrote:
>>
> I don't think the address dependency is enough on its own. By that
> reasoning, the following variant (WRC+addr+addr) would work too:
>
>
> P0:
> Wx = 1
>
> P1:
> Rx == 1
> <address dep>
> Wy = 1
>
> P2:
> Ry == 1
> <address dep>
> Rx = 0
>
>
> So are you saying that this is also forbidden?
> Imagine that P0 and P1 are two threads that share a store buffer. What
> then?

OK, I collected answers and it is:

     In MIPS R6 this test passes OK, I mean - P2: Rx = 1 if Ry is read 
as 1. By design.

     However, it is unclear that happens in MIPS R2 1004K.

     Moreover, there are voices against guarantee that it will be in 
future and that voices point me to Documentation/memory-barriers.txt 
section "DATA DEPENDENCY BARRIERS" examples which require SYNC_RMB 
between loading address/index and using that for loading data based on 
that address or index for shared data (look on CPU2 pseudo-code):
> To deal with this, a data dependency barrier or better must be inserted
> between the address load and the data load:
>
>         CPU 1                 CPU 2
>         ===============       ===============
>         { A == 1, B == 2, C = 3, P == &A, Q == &C }
>         B = 4;
>         <write barrier>
>         WRITE_ONCE(P, &B);
>                               Q = READ_ONCE(P);
>                               <data dependency barrier> <----------- 
> SYNC_RMB is here
>                               D = *Q;
...
> Another example of where data dependency barriers might be required is 
> where a
> number is read from memory and then used to calculate the index for an 
> array
> access:
>
>         CPU 1                 CPU 2
>         ===============       ===============
>         { M[0] == 1, M[1] == 2, M[3] = 3, P == 0, Q == 3 }
>         M[1] = 4;
>         <write barrier>
>         WRITE_ONCE(P, 1);
>                               Q = READ_ONCE(P);
>                               <data dependency barrier> <------------ 
> SYNC_RMB is here
>                               D = M[Q];

That voices say that there is a legitimate reason to relax HW here for 
performance if SYNC_RMB is needed anyway to work with this sequence of 
shared data.


And all that is out-of-topic here in my mind. I just want to be sure 
that this patchset still provides a use of a specific lightweight SYNCs 
on MIPS vs bold and heavy generalized "SYNC 0" in any case.

- Leonid.
Michael S. Tsirkin Jan. 14, 2016, 9:24 a.m. UTC | #16
On Wed, Jan 13, 2016 at 02:26:16PM -0800, Leonid Yegoshin wrote:
> And all that is out-of-topic here in my mind. I just want to be sure that
> this patchset still provides a use of a specific lightweight SYNCs on MIPS
> vs bold and heavy generalized "SYNC 0" in any case.
> 
> - Leonid.

Of course it does. All this patchset does is rename smp_mb/rmb/wmb
to __smp_mb()/__smp_rmb()/__smp_wmb()
and then asm-generic does #define smp_mb __smp_mb
or #define smp_mb barrier depending on CONFIG_SMP.

Why is that needed? So we can implement
[PATCH v3 28/41] asm-generic: implement virt_xxx memory barriers
Will Deacon Jan. 14, 2016, 12:04 p.m. UTC | #17
On Wed, Jan 13, 2016 at 12:58:22PM -0800, Leonid Yegoshin wrote:
> On 01/13/2016 12:48 PM, Peter Zijlstra wrote:
> >On Wed, Jan 13, 2016 at 11:02:35AM -0800, Leonid Yegoshin wrote:
> >
> >>I ask HW team about it but I have a question - has it any relationship with
> >>replacing MIPS SYNC with lightweight SYNCs (SYNC_WMB etc)?
> >Of course. If you cannot explain the semantics of the primitives you
> >introduce, how can we judge the patch.
> >
> >
> You missed a point - it is a question about replacement of SYNC with
> lightweight primitives. It is NOT a question about multithread system
> behavior without any SYNC. The answer on a latest Will's question lies in
> different area.

The reason we (Peter and I) care about this isn't because we enjoy being
obstructive. It's because there is a whole load of core (i.e. portable)
kernel code that is written to the *kernel* memory model. For example,
the scheduler, RCU, mutex implementations, perf, drivers, you name it.

Consequently, it's important that the architecture back-ends implement
these portable primitives (e.g. smp_mb()) in a way that satisfies the
kernel memory model so that core code doesn't need to worry about the
underlying architecture for synchronisation purposes. You could turn
around and say "but if MIPS gets it wrong, then that's MIPS's problem",
but actually not having a general understanding of the ordering guarantees
provided by each architecture makes it very difficult for us to extend
the kernel memory model in such a way that it can be implemented
efficiently across the board *and* relied upon by core code.

The virtio patch at the start of the thread doesn't particularly concern
me. It's the other patches you linked to that implement acquire/release
that have me worried.

Will
Will Deacon Jan. 14, 2016, 12:14 p.m. UTC | #18
On Wed, Jan 13, 2016 at 02:26:16PM -0800, Leonid Yegoshin wrote:
> On 01/13/2016 02:45 AM, Will Deacon wrote:
> >>
> >I don't think the address dependency is enough on its own. By that
> >reasoning, the following variant (WRC+addr+addr) would work too:
> >
> >
> >P0:
> >Wx = 1
> >
> >P1:
> >Rx == 1
> ><address dep>
> >Wy = 1
> >
> >P2:
> >Ry == 1
> ><address dep>
> >Rx = 0
> >
> >
> >So are you saying that this is also forbidden?
> >Imagine that P0 and P1 are two threads that share a store buffer. What
> >then?
> 
> OK, I collected answers and it is:
> 
>     In MIPS R6 this test passes OK, I mean - P2: Rx = 1 if Ry is read as 1.
> By design.
> 
>     However, it is unclear that happens in MIPS R2 1004K.

How can it be unclear? If, for example, the outcome is permitted on that
CPU, then your original reasoning for the WRC+sync+addr doesn't apply
there and SYNC is not transitive. That's what I'm trying to get to the
bottom of.

Does the MIPS kernel target a particular CPU at compile time?

>     Moreover, there are voices against guarantee that it will be in future
> and that voices point me to Documentation/memory-barriers.txt section "DATA
> DEPENDENCY BARRIERS" examples which require SYNC_RMB between loading
> address/index and using that for loading data based on that address or index
> for shared data (look on CPU2 pseudo-code):
> >To deal with this, a data dependency barrier or better must be inserted
> >between the address load and the data load:
> >
> >        CPU 1                 CPU 2
> >        ===============       ===============
> >        { A == 1, B == 2, C = 3, P == &A, Q == &C }
> >        B = 4;
> >        <write barrier>
> >        WRITE_ONCE(P, &B);
> >                              Q = READ_ONCE(P);
> >                              <data dependency barrier> <-----------
> >SYNC_RMB is here
> >                              D = *Q;
> ...
> >Another example of where data dependency barriers might be required is
> >where a
> >number is read from memory and then used to calculate the index for an
> >array
> >access:
> >
> >        CPU 1                 CPU 2
> >        ===============       ===============
> >        { M[0] == 1, M[1] == 2, M[3] = 3, P == 0, Q == 3 }
> >        M[1] = 4;
> >        <write barrier>
> >        WRITE_ONCE(P, 1);
> >                              Q = READ_ONCE(P);
> >                              <data dependency barrier> <------------
> >SYNC_RMB is here
> >                              D = M[Q];
> 
> That voices say that there is a legitimate reason to relax HW here for
> performance if SYNC_RMB is needed anyway to work with this sequence of
> shared data.

Are you saying that MIPS needs to implement [smp_]read_barrier_depends?

> And all that is out-of-topic here in my mind. I just want to be sure that
> this patchset still provides a use of a specific lightweight SYNCs on MIPS
> vs bold and heavy generalized "SYNC 0" in any case.

We may be highjacking the thread slightly, but there are much bigger
issues at play here if you want to start using lightweight barriers to
implement relaxed kernel primitives such as smp_load_acquire and
smp_store_release.

Will
Paul E. McKenney Jan. 14, 2016, 4:16 p.m. UTC | #19
On Thu, Jan 14, 2016 at 12:04:45PM +0000, Will Deacon wrote:
> On Wed, Jan 13, 2016 at 12:58:22PM -0800, Leonid Yegoshin wrote:
> > On 01/13/2016 12:48 PM, Peter Zijlstra wrote:
> > >On Wed, Jan 13, 2016 at 11:02:35AM -0800, Leonid Yegoshin wrote:
> > >
> > >>I ask HW team about it but I have a question - has it any relationship with
> > >>replacing MIPS SYNC with lightweight SYNCs (SYNC_WMB etc)?
> > >Of course. If you cannot explain the semantics of the primitives you
> > >introduce, how can we judge the patch.
> > >
> > >
> > You missed a point - it is a question about replacement of SYNC with
> > lightweight primitives. It is NOT a question about multithread system
> > behavior without any SYNC. The answer on a latest Will's question lies in
> > different area.
> 
> The reason we (Peter and I) care about this isn't because we enjoy being
> obstructive. It's because there is a whole load of core (i.e. portable)
> kernel code that is written to the *kernel* memory model. For example,
> the scheduler, RCU, mutex implementations, perf, drivers, you name it.
> 
> Consequently, it's important that the architecture back-ends implement
> these portable primitives (e.g. smp_mb()) in a way that satisfies the
> kernel memory model so that core code doesn't need to worry about the
> underlying architecture for synchronisation purposes. You could turn
> around and say "but if MIPS gets it wrong, then that's MIPS's problem",
> but actually not having a general understanding of the ordering guarantees
> provided by each architecture makes it very difficult for us to extend
> the kernel memory model in such a way that it can be implemented
> efficiently across the board *and* relied upon by core code.

What Will said!

Yes, you can cut corners within MIPS architecture-specific code,
but primitives that are used in the core kernel really do need to
work as expected.

							Thanx, Paul

> The virtio patch at the start of the thread doesn't particularly concern
> me. It's the other patches you linked to that implement acquire/release
> that have me worried.
> 
> Will
>
Leonid Yegoshin Jan. 14, 2016, 7:28 p.m. UTC | #20
On 01/14/2016 04:14 AM, Will Deacon wrote:
> On Wed, Jan 13, 2016 at 02:26:16PM -0800, Leonid Yegoshin wrote:
>
>>      Moreover, there are voices against guarantee that it will be in future
>> and that voices point me to Documentation/memory-barriers.txt section "DATA
>> DEPENDENCY BARRIERS" examples which require SYNC_RMB between loading
>> address/index and using that for loading data based on that address or index
>> for shared data (look on CPU2 pseudo-code):
>>> To deal with this, a data dependency barrier or better must be inserted
>>> between the address load and the data load:
>>>
>>>         CPU 1                 CPU 2
>>>         ===============       ===============
>>>         { A == 1, B == 2, C = 3, P == &A, Q == &C }
>>>         B = 4;
>>>         <write barrier>
>>>         WRITE_ONCE(P, &B);
>>>                               Q = READ_ONCE(P);
>>>                               <data dependency barrier> <-----------
>>> SYNC_RMB is here
>>>                               D = *Q;
>> ...
>>> Another example of where data dependency barriers might be required is
>>> where a
>>> number is read from memory and then used to calculate the index for an
>>> array
>>> access:
>>>
>>>         CPU 1                 CPU 2
>>>         ===============       ===============
>>>         { M[0] == 1, M[1] == 2, M[3] = 3, P == 0, Q == 3 }
>>>         M[1] = 4;
>>>         <write barrier>
>>>         WRITE_ONCE(P, 1);
>>>                               Q = READ_ONCE(P);
>>>                               <data dependency barrier> <------------
>>> SYNC_RMB is here
>>>                               D = M[Q];
>> That voices say that there is a legitimate reason to relax HW here for
>> performance if SYNC_RMB is needed anyway to work with this sequence of
>> shared data.
> Are you saying that MIPS needs to implement [smp_]read_barrier_depends?

It is not me, it is Documentation/memory-barriers.txt from kernel sources.

HW team can't work on voice statements, it should do a work on written 
documents. If that is written (see above the lines which I marked by 
"SYNC_RMB") then anybody should use it and never mind how many 
CPUs/Threads are in play. This examples explicitly requires to insert 
"data dependency barrier" between reading a shared pointer/index and 
using it to fetch a shared data. So, your WRC+addr+addr test is a 
violation of that recommendation.

- Leonid.
Leonid Yegoshin Jan. 14, 2016, 7:42 p.m. UTC | #21
On 01/14/2016 08:16 AM, Paul E. McKenney wrote:
> On Thu, Jan 14, 2016 at 12:04:45PM +0000, Will Deacon wrote:
>> On Wed, Jan 13, 2016 at 12:58:22PM -0800, Leonid Yegoshin wrote:
>>> On 01/13/2016 12:48 PM, Peter Zijlstra wrote:
>>>> On Wed, Jan 13, 2016 at 11:02:35AM -0800, Leonid Yegoshin wrote:
>>>>
>>>>> I ask HW team about it but I have a question - has it any relationship with
>>>>> replacing MIPS SYNC with lightweight SYNCs (SYNC_WMB etc)?
>>>> Of course. If you cannot explain the semantics of the primitives you
>>>> introduce, how can we judge the patch.
>>>>
>>>>
>>> You missed a point - it is a question about replacement of SYNC with
>>> lightweight primitives. It is NOT a question about multithread system
>>> behavior without any SYNC. The answer on a latest Will's question lies in
>>> different area.
> What Will said!
>
> Yes, you can cut corners within MIPS architecture-specific code,
> but primitives that are used in the core kernel really do need to
> work as expected.
>
> 							Thanx, Paul
>
>
Absolutelly! Please use SYNC - right now it is not.

An the only point - please use an appropriate SYNC_* barriers instead of 
heavy bold hammer. That stuff was design explicitly to support the 
requirements of Documentation/memory-barriers.txt

It is easy - just use smp_acquire instead of plain smp_mb 
insmp_load_acquire, at least for MIPS.

- Leonid.
- Leonid.
Leonid Yegoshin Jan. 14, 2016, 8:12 p.m. UTC | #22
On 01/14/2016 04:04 AM, Will Deacon wrote:
> Consequently, it's important that the architecture back-ends implement 
> these portable primitives (e.g. smp_mb()) in a way that satisfies the 
> kernel memory model so that core code doesn't need to worry about the 
> underlying architecture for synchronisation purposes.

It seems you don't listen me. I said multiple times - MIPS 
implementation of SYNC_RMB/SYNC_WMB/SYNC_MB/SYNC_ACQUIRE/SYNC_RELEASE 
instructions matches the description of 
smp_rmb/smp_wmb/smp_mb/sync_acquire/sync_release from 
Documentation/memory-barriers.txt file.

What else do you want from me - RTL or microArch design for that?

- Leonid.
Peter Zijlstra Jan. 14, 2016, 8:15 p.m. UTC | #23
On Thu, Jan 14, 2016 at 11:42:02AM -0800, Leonid Yegoshin wrote:
> An the only point - please use an appropriate SYNC_* barriers instead of
> heavy bold hammer. That stuff was design explicitly to support the
> requirements of Documentation/memory-barriers.txt

That's madness. That document changes from version to version as to what
we _think_ the actual hardware does. It is _NOT_ a specification.

You cannot design hardware from that. Its incomplete and fails to
specify a bunch of things. It not a mathematically sound definition of a
memory model.

Please stop referring to that document for what a particular barrier
_should_ do.  Explain what MIPS does, so we can attempt to integrate
this knowledge with our knowledge of PPC/ARM/Alpha/x86/etc. and improve
upon our understanding of hardware and improve the Linux memory model.
Paul E. McKenney Jan. 14, 2016, 8:34 p.m. UTC | #24
On Thu, Jan 14, 2016 at 11:28:18AM -0800, Leonid Yegoshin wrote:
> On 01/14/2016 04:14 AM, Will Deacon wrote:
> >On Wed, Jan 13, 2016 at 02:26:16PM -0800, Leonid Yegoshin wrote:
> >
> >>     Moreover, there are voices against guarantee that it will be in future
> >>and that voices point me to Documentation/memory-barriers.txt section "DATA
> >>DEPENDENCY BARRIERS" examples which require SYNC_RMB between loading
> >>address/index and using that for loading data based on that address or index
> >>for shared data (look on CPU2 pseudo-code):
> >>>To deal with this, a data dependency barrier or better must be inserted
> >>>between the address load and the data load:
> >>>
> >>>        CPU 1                 CPU 2
> >>>        ===============       ===============
> >>>        { A == 1, B == 2, C = 3, P == &A, Q == &C }
> >>>        B = 4;
> >>>        <write barrier>
> >>>        WRITE_ONCE(P, &B);
> >>>                              Q = READ_ONCE(P);
> >>>                              <data dependency barrier> <-----------
> >>>SYNC_RMB is here
> >>>                              D = *Q;
> >>...
> >>>Another example of where data dependency barriers might be required is
> >>>where a
> >>>number is read from memory and then used to calculate the index for an
> >>>array
> >>>access:
> >>>
> >>>        CPU 1                 CPU 2
> >>>        ===============       ===============
> >>>        { M[0] == 1, M[1] == 2, M[3] = 3, P == 0, Q == 3 }
> >>>        M[1] = 4;
> >>>        <write barrier>
> >>>        WRITE_ONCE(P, 1);
> >>>                              Q = READ_ONCE(P);
> >>>                              <data dependency barrier> <------------
> >>>SYNC_RMB is here
> >>>                              D = M[Q];
> >>That voices say that there is a legitimate reason to relax HW here for
> >>performance if SYNC_RMB is needed anyway to work with this sequence of
> >>shared data.
> >Are you saying that MIPS needs to implement [smp_]read_barrier_depends?
> 
> It is not me, it is Documentation/memory-barriers.txt from kernel sources.
> 
> HW team can't work on voice statements, it should do a work on
> written documents. If that is written (see above the lines which I
> marked by "SYNC_RMB") then anybody should use it and never mind how
> many CPUs/Threads are in play. This examples explicitly requires to
> insert "data dependency barrier" between reading a shared
> pointer/index and using it to fetch a shared data. So, your
> WRC+addr+addr test is a violation of that recommendation.

Perhaps Documentation/memory-barriers.txt needs additional clarification.
It would not be the first time.

If your CPU implicitly maintains ordering based on address and
data dependencies, then you don't need any instructions for
<data dependency barrier>.

The WRC+addr+addr is OK because data dependencies are not required to be
transitive, in other words, they are not required to flow from one CPU to
another without the help of an explicit memory barrier.  Transitivity is
instead supplied by smp_mb() and by smp_store_release()-smp_load_acquire()
chains.  Here is the Linux kernel code for WRC+addr+addr, give or take
(and no, I have no idea why anyone would want to write code like this):

	struct foo {
		struct foo **a;
	};
	struct foo b;
	struct foo c;
	struct foo d;
	struct foo e;
	struct foo f = { &d };
	struct foo g = { &e };
	struct foo *x = &b;

	void cpu0(void)
	{
		WRITE_ONCE(x, &f);
	}

	void cpu1(void)
	{
		struct foo *p;

		p = lockless_dereference(x);
		WRITE_ONCE(p->a, &x);
	}

	void cpu2(void)
	{
		r1 = lockless_dereference(f.a);
		WRITE_ONCE(*r1, &c);
	}

It is legal to end the run with x==&f and r1==&x.  To prevent this outcome,
we do the following:

	struct foo {
		struct foo **a;
	};
	struct foo b;
	struct foo c;
	struct foo d;
	struct foo e;
	struct foo f = { &d };
	struct foo g = { &e };
	struct foo *x = &b;

	void cpu0(void)
	{
		WRITE_ONCE(x, &f);
	}

	void cpu1(void)
	{
		struct foo *p;

		p = lockless_dereference(x);
		smp_store_release(&p->a, &x); /* Additional ordering. */
	}

	void cpu2(void)
	{
		r1 = lockless_dereference(f.a);
		WRITE_ONCE(*r1, &c);
	}

And I still don't know why anyone would need this sort of code.  ;-)

Alternatively, we pull cpu2() into cpu1():

	struct foo {
		struct foo **a;
	};
	struct foo b;
	struct foo c;
	struct foo d;
	struct foo e;
	struct foo f = { &d };
	struct foo g = { &e };
	struct foo *x = &b;

	void cpu0(void)
	{
		WRITE_ONCE(x, &f);
	}

	void cpu1(void)
	{
		struct foo *p;

		p = lockless_dereference(x);
		WRITE_ONCE(p->a, &x);
		r1 = lockless_dereference(f.a);
		WRITE_ONCE(*r1, &c);
	}

The ordering is now enforced by being within a single thread.  In fact,
the second lockless_dereference() can be READ_ONCE().

So, does MIPS maintain ordering within a given CPU based on address and
data dependencies?  If so, you don't need to emit memory-barrier instructions
for read_barrier_depends().

							Thanx, Paul
Paul E. McKenney Jan. 14, 2016, 8:36 p.m. UTC | #25
On Thu, Jan 14, 2016 at 09:15:13PM +0100, Peter Zijlstra wrote:
> On Thu, Jan 14, 2016 at 11:42:02AM -0800, Leonid Yegoshin wrote:
> > An the only point - please use an appropriate SYNC_* barriers instead of
> > heavy bold hammer. That stuff was design explicitly to support the
> > requirements of Documentation/memory-barriers.txt
> 
> That's madness. That document changes from version to version as to what
> we _think_ the actual hardware does. It is _NOT_ a specification.

There is work in progress on a specification, but please don't hold
your breath.  And I am not as optimistic as I might be about any formal
specification keeping up with the Linux kernel or with the hardware that
it supports.  But it seems worth a good try.

> You cannot design hardware from that. Its incomplete and fails to
> specify a bunch of things. It not a mathematically sound definition of a
> memory model.
> 
> Please stop referring to that document for what a particular barrier
> _should_ do.  Explain what MIPS does, so we can attempt to integrate
> this knowledge with our knowledge of PPC/ARM/Alpha/x86/etc. and improve
> upon our understanding of hardware and improve the Linux memory model.

Please!

							Thanx, Paul
Peter Zijlstra Jan. 14, 2016, 8:46 p.m. UTC | #26
On Thu, Jan 14, 2016 at 09:15:13PM +0100, Peter Zijlstra wrote:
> On Thu, Jan 14, 2016 at 11:42:02AM -0800, Leonid Yegoshin wrote:
> > An the only point - please use an appropriate SYNC_* barriers instead of
> > heavy bold hammer. That stuff was design explicitly to support the
> > requirements of Documentation/memory-barriers.txt
> 
> That's madness. That document changes from version to version as to what
> we _think_ the actual hardware does. It is _NOT_ a specification.
> 
> You cannot design hardware from that. Its incomplete and fails to
> specify a bunch of things. It not a mathematically sound definition of a
> memory model.
> 
> Please stop referring to that document for what a particular barrier
> _should_ do.  Explain what MIPS does, so we can attempt to integrate
> this knowledge with our knowledge of PPC/ARM/Alpha/x86/etc. and improve
> upon our understanding of hardware and improve the Linux memory model.

That is, if you'd managed to read that file at the right point in time,
you might have through we'd be OK with requiring a barrier for
control dependencies.

We got rid of that mistake. It was based on a flawed reading of the
Alpha docs. See: 105ff3cbf225 ("atomic: remove all traces of
READ_ONCE_CTRL() and atomic*_read_ctrl()")

Similarly, while the document goes to great length to explain the
read_barrier_depends thing, nobody actually thinks its a brilliant idea
to have. Ideally we'd kill the thing the moment we drop Alpha support.

Again, memory-barriers.txt is _NOT_, I repeat, _NOT_ a hardware spec, it
is not even a recommendation. It are our best effort (but flawed)
scribbles of what we think is makes sense given the huge amount of
actual hardware we have to run on.


As to the ACQUIRE/RELEASE semantics, ARM64 actually has
multi-copy-atomic acquire/release (as does ia64, although in reality it
doesn't actually have acquire/release). PPC otoh does _NOT_ have this,
and is currently the only arch to suffer RCpc locks.

Now for a long long time we assumed our locks were RCsc, and we've
written code assuming UNLOCK x + LOCK y was in fact a full barrier with
transitiviy. Then we figured out PPC didn't actually match that. RCU is
the only piece of code we _know_ relied on that, but there might be more
out there...

So we document, for new code, that UNLOCK+LOCK isn't a MB, while at the
same time we lobby PPC to stick a full barrier in and get rid of this
stuff.

Nobody really likes RCpc locks, esp. given the history we have of
assuming RCsc.

The current document allowing for RCpc is not an endorsement thereof.
Ideally we'd _NOT_ have to worry about that. We can do without these
head-aches.


So again, stop referring to our document as a spec. Also please don't
make MIPS push the limits of weak memory models, we really can do
without the pain.
Leonid Yegoshin Jan. 14, 2016, 8:46 p.m. UTC | #27
On 01/14/2016 12:15 PM, Peter Zijlstra wrote:
> On Thu, Jan 14, 2016 at 11:42:02AM -0800, Leonid Yegoshin wrote:
>> An the only point - please use an appropriate SYNC_* barriers instead of
>> heavy bold hammer. That stuff was design explicitly to support the
>> requirements of Documentation/memory-barriers.txt
> That's madness. That document changes from version to version as to what
> we _think_ the actual hardware does. It is _NOT_ a specification.
>
> You cannot design hardware from that. Its incomplete and fails to
> specify a bunch of things. It not a mathematically sound definition of a
> memory model.
>
> Please stop referring to that document for what a particular barrier
> _should_ do.  Explain what MIPS does, so we can attempt to integrate
> this knowledge with our knowledge of PPC/ARM/Alpha/x86/etc. and improve
> upon our understanding of hardware and improve the Linux memory model.

I am afraid I can't help you here. It is very complicated stuff and a 
model is actually doesn't fit your assumptions about CPUs well without 
some simplifications which are based on what you want to have.

I say that SYNC_ACQUIRE/etc follows what you expect for smp_acquire etc 
(basing on that document). And at least two CPU models were tested with 
my patches (see it in LMO) for that last year and that instructions are 
implemented now in engineering kernel.

If you have something else in mind, you can ask me. But I prefer to do 
not deviate too much from Documentation/memory-barriers.txt, for exam - 
if it asks to have memory barrier somewhere, then I assume the code 
should have it, and please - don't ask me a test which violates the 
current version of document recommendations.

For a moment I don't see a significant changes in this document for MIPS 
Arch at least 1.5 year, and the only significant point is that MIPS CPU 
Arch doesn't have yet smp_read_barrier_depends() and smp_rmb() should be 
used instead.

- Leonid.
Paul E. McKenney Jan. 14, 2016, 8:48 p.m. UTC | #28
On Thu, Jan 14, 2016 at 12:12:53PM -0800, Leonid Yegoshin wrote:
> On 01/14/2016 04:04 AM, Will Deacon wrote:
> >Consequently, it's important that the architecture back-ends
> >implement these portable primitives (e.g. smp_mb()) in a way that
> >satisfies the kernel memory model so that core code doesn't need
> >to worry about the underlying architecture for synchronisation
> >purposes.
> 
> It seems you don't listen me. I said multiple times - MIPS
> implementation of
> SYNC_RMB/SYNC_WMB/SYNC_MB/SYNC_ACQUIRE/SYNC_RELEASE instructions
> matches the description of
> smp_rmb/smp_wmb/smp_mb/sync_acquire/sync_release from
> Documentation/memory-barriers.txt file.
> 
> What else do you want from me - RTL or microArch design for that?

I suspect that it is more likely that we are talking past each other.
This stuff is subtle and although we have better ways of talking about
it than (say) ten years ago, it is subtle.  Two ways of talking about
it are herd and ppcmem.

The overview of ppcmem (AKA armmem and cppmem) is here:
https://www.cl.cam.ac.uk/~pes20/ppcmem/help.html

The intro to herd is here: http://arxiv.org/pdf/1308.6810v5.pdf
It may be downloaded here: http://diy.inria.fr/herd/

As a very rough rule of thumb, herd is faster and easier to use
and ppcmem is more precise.

So SYNC_RMB is intended to implement smp_rmb(), correct?

You could use SYNC_ACQUIRE() to implement read_barrier_depends() and
smp_read_barrier_depends(), but SYNC_RMB probably does not suffice.
The reason for this is that smp_read_barrier_depends() must order the
pointer load against any subsequent read or write through a dereference
of that pointer.  For example:

	p = READ_ONCE(gp);
	smp_rmb();
	r1 = p->a; /* ordered by smp_rmb(). */
	p->b = 42; /* NOT ordered by smp_rmb(), BUG!!! */
	r2 = x; /* ordered by smp_rmb(), but doesn't need to be. */

In contrast:

	p = READ_ONCE(gp);
	smp_read_barrier_depends();
	r1 = p->a; /* ordered by smp_read_barrier_depends(). */
	p->b = 42; /* ordered by smp_read_barrier_depends(). */
	r2 = x; /* not ordered by smp_read_barrier_depends(), which is OK. */

Again, if your hardware maintains local ordering for address
and data dependencies, you can have read_barrier_depends() and
smp_read_barrier_depends() be no-ops like they are for most
architectures.

Does that help?

							Thanx, Paul
Leonid Yegoshin Jan. 14, 2016, 9:01 p.m. UTC | #29
I need some time to understand your test examples. However,

On 01/14/2016 12:34 PM, Paul E. McKenney wrote:
>
>
> The WRC+addr+addr is OK because data dependencies are not required to be
> transitive, in other words, they are not required to flow from one CPU to
> another without the help of an explicit memory barrier.

I don't see any reliable way to fit WRC+addr+addr into "DATA DEPENDENCY 
BARRIERS" section recommendation to have data dependency barrier between 
read of a shared pointer/index and read the shared data based on that 
pointer. If you have this two reads, it doesn't matter the rest of 
scenario, you should put the dependency barrier in code anyway. If you 
don't do it in WRC+addr+addr scenario then after years it can be easily 
changed to different scenario which fits some of scenario in "DATA 
DEPENDENCY BARRIERS" section and fails.

>    Transitivity is

Peter Zijlstra recently wrote: "In particular we're very much all 
'confused' about the various notions of transitivity". I am confused 
too, so - please use some more simple way to explain your words. Sorry, 
but we need a common ground first.

- Leonid.
Leonid Yegoshin Jan. 14, 2016, 9:24 p.m. UTC | #30
On 01/14/2016 12:48 PM, Paul E. McKenney wrote:
>
> So SYNC_RMB is intended to implement smp_rmb(), correct?
Yes.
>
> You could use SYNC_ACQUIRE() to implement read_barrier_depends() and
> smp_read_barrier_depends(), but SYNC_RMB probably does not suffice.

If smp_read_barrier_depends() is used to separate not only two reads but 
read pointer and WRITE basing on that pointer (example below) - yes. I 
just doesn't see any example of this in famous 
Documentation/memory-barriers.txt and had no chance to know what you use 
it in this way too.

> The reason for this is that smp_read_barrier_depends() must order the
> pointer load against any subsequent read or write through a dereference
> of that pointer.

I can't see that requirement anywhere in Documents directory. I mean - 
the words "write through a dereference of that pointer" or similar for 
smp_read_barrier_depends.

>    For example:
>
> 	p = READ_ONCE(gp);
> 	smp_rmb();
> 	r1 = p->a; /* ordered by smp_rmb(). */
> 	p->b = 42; /* NOT ordered by smp_rmb(), BUG!!! */
> 	r2 = x; /* ordered by smp_rmb(), but doesn't need to be. */
>
> In contrast:
>
> 	p = READ_ONCE(gp);
> 	smp_read_barrier_depends();
> 	r1 = p->a; /* ordered by smp_read_barrier_depends(). */
> 	p->b = 42; /* ordered by smp_read_barrier_depends(). */
> 	r2 = x; /* not ordered by smp_read_barrier_depends(), which is OK. */
>
> Again, if your hardware maintains local ordering for address
> and data dependencies, you can have read_barrier_depends() and
> smp_read_barrier_depends() be no-ops like they are for most
> architectures.

It is not so simple, I mean "local ordering for address and data 
dependencies". Local ordering is NOT enough. It happens that current 
MIPS R6 doesn't require in your example smp_read_barrier_depends() but 
in discussion it comes out that it may not. Because without 
smp_read_barrier_depends() your example can be a part of Will's 
WRC+addr+addr and we found some design which easily can bump into this 
test. And that design actually performs "local ordering for address and 
data dependencies" too.

- Leonid.
Paul E. McKenney Jan. 14, 2016, 9:29 p.m. UTC | #31
On Thu, Jan 14, 2016 at 01:01:05PM -0800, Leonid Yegoshin wrote:
> I need some time to understand your test examples. However,

Understood.

> On 01/14/2016 12:34 PM, Paul E. McKenney wrote:
> >
> >
> >The WRC+addr+addr is OK because data dependencies are not required to be
> >transitive, in other words, they are not required to flow from one CPU to
> >another without the help of an explicit memory barrier.
> 
> I don't see any reliable way to fit WRC+addr+addr into "DATA
> DEPENDENCY BARRIERS" section recommendation to have data dependency
> barrier between read of a shared pointer/index and read the shared
> data based on that pointer. If you have this two reads, it doesn't
> matter the rest of scenario, you should put the dependency barrier
> in code anyway. If you don't do it in WRC+addr+addr scenario then
> after years it can be easily changed to different scenario which
> fits some of scenario in "DATA DEPENDENCY BARRIERS" section and
> fails.

The trick is that lockless_dereference() contains an
smp_read_barrier_depends():

#define lockless_dereference(p) \
({ \
	typeof(p) _________p1 = READ_ONCE(p); \
	smp_read_barrier_depends(); /* Dependency order vs. p above. */ \
	(_________p1); \
})

Or am I missing your point?

> >   Transitivity is
> 
> Peter Zijlstra recently wrote: "In particular we're very much all
> 'confused' about the various notions of transitivity". I am confused
> too, so - please use some more simple way to explain your words.
> Sorry, but we need a common ground first.

OK, how about an example?  (Z6.3 in the ppcmem naming scheme.)

	int x, y, z;

	void cpu0(void)
	{
		WRITE_ONCE(x, 1);
		smp_wmb();
		WRITE_ONCE(y, 1);
	}

	void cpu1(void)
	{
		WRITE_ONCE(y, 2);
		smp_wmb();
		WRITE_ONCE(z, 1);
	}

	void cpu2(void)
	{
		r1 = READ_ONCE(z);
		smp_rmb();
		r2 = read_once(x);
	}

If smp_rmb() and smp_wmb() provided transitive ordering, then cpu2()
would see cpu0()'s ordering.  But they do not, so the ordering is
visible at best to the adjacent CPU.  This means that the final value
of y can be 2, while at the same time r1==1 && r2==0.

Now the full barrier, smp_mb(), does provide transitive ordering,
so if the three barriers in the above example are replaced with
smp_mb() the y==2 && r1==1 && r2==0 outcome will be prohibited.

So smp_mb() provides transitivity, as do pairs of smp_store_release()
and smp_read_acquire(), as do RCU grace periods.  The exact interactions
between transitive and non-transitive ordering is a work in progress.
That said, if a series of transitive segments ends in a write, which
connects to a single non-transitive segment starting with a read,
you should be good.  And in fact in the example above, you can replace
the smp_wmb()s with smp_mb() and leave the smp_rmb() and still
prohibit the "cyclic" outcome.

If you want a more formal definition, I must refer you back to the
ppcmem and herd references.

Does that help?

							Thanx, Paul
Paul E. McKenney Jan. 14, 2016, 9:34 p.m. UTC | #32
On Thu, Jan 14, 2016 at 12:46:43PM -0800, Leonid Yegoshin wrote:
> On 01/14/2016 12:15 PM, Peter Zijlstra wrote:
> >On Thu, Jan 14, 2016 at 11:42:02AM -0800, Leonid Yegoshin wrote:
> >>An the only point - please use an appropriate SYNC_* barriers instead of
> >>heavy bold hammer. That stuff was design explicitly to support the
> >>requirements of Documentation/memory-barriers.txt
> >That's madness. That document changes from version to version as to what
> >we _think_ the actual hardware does. It is _NOT_ a specification.
> >
> >You cannot design hardware from that. Its incomplete and fails to
> >specify a bunch of things. It not a mathematically sound definition of a
> >memory model.
> >
> >Please stop referring to that document for what a particular barrier
> >_should_ do.  Explain what MIPS does, so we can attempt to integrate
> >this knowledge with our knowledge of PPC/ARM/Alpha/x86/etc. and improve
> >upon our understanding of hardware and improve the Linux memory model.
> 
> I am afraid I can't help you here. It is very complicated stuff and
> a model is actually doesn't fit your assumptions about CPUs well
> without some simplifications which are based on what you want to
> have.
> 
> I say that SYNC_ACQUIRE/etc follows what you expect for smp_acquire
> etc (basing on that document). And at least two CPU models were
> tested with my patches (see it in LMO) for that last year and that
> instructions are implemented now in engineering kernel.
> 
> If you have something else in mind, you can ask me. But I prefer to
> do not deviate too much from Documentation/memory-barriers.txt, for
> exam - if it asks to have memory barrier somewhere, then I assume
> the code should have it, and please - don't ask me a test which
> violates the current version of document recommendations.
> 
> For a moment I don't see a significant changes in this document for
> MIPS Arch at least 1.5 year, and the only significant point is that
> MIPS CPU Arch doesn't have yet smp_read_barrier_depends() and
> smp_rmb() should be used instead.

Is SYNC_ACQUIRE a memory-barrier instruction that orders prior loads
against later loads and stores?  If so, and if MIPS does not do
ordering based on address and data dependencies, I suggest making
read_barrier_depends() be a SYNC_ACQUIRE rather than SYNC_RMB.

							Thanx, Paul
Leonid Yegoshin Jan. 14, 2016, 9:36 p.m. UTC | #33
On 01/14/2016 01:29 PM, Paul E. McKenney wrote:
>
>> On 01/14/2016 12:34 PM, Paul E. McKenney wrote:
>>>
>>> The WRC+addr+addr is OK because data dependencies are not required to be
>>> transitive, in other words, they are not required to flow from one CPU to
>>> another without the help of an explicit memory barrier.
>> I don't see any reliable way to fit WRC+addr+addr into "DATA
>> DEPENDENCY BARRIERS" section recommendation to have data dependency
>> barrier between read of a shared pointer/index and read the shared
>> data based on that pointer. If you have this two reads, it doesn't
>> matter the rest of scenario, you should put the dependency barrier
>> in code anyway. If you don't do it in WRC+addr+addr scenario then
>> after years it can be easily changed to different scenario which
>> fits some of scenario in "DATA DEPENDENCY BARRIERS" section and
>> fails.
> The trick is that lockless_dereference() contains an
> smp_read_barrier_depends():
>
> #define lockless_dereference(p) \
> ({ \
> 	typeof(p) _________p1 = READ_ONCE(p); \
> 	smp_read_barrier_depends(); /* Dependency order vs. p above. */ \
> 	(_________p1); \
> })
>
> Or am I missing your point?

WRC+addr+addr has no any barrier. lockless_dereference() has a barrier. 
I don't see a common points between this and that in your answer, sorry.

- Leonid.
Leonid Yegoshin Jan. 14, 2016, 9:45 p.m. UTC | #34
On 01/14/2016 01:34 PM, Paul E. McKenney wrote:
> On Thu, Jan 14, 2016 at 12:46:43PM -0800, Leonid Yegoshin wrote:
>> On 01/14/2016 12:15 PM, Peter Zijlstra wrote:
>>> On Thu, Jan 14, 2016 at 11:42:02AM -0800, Leonid Yegoshin wrote:
>>>> An the only point - please use an appropriate SYNC_* barriers instead of
>>>> heavy bold hammer. That stuff was design explicitly to support the
>>>> requirements of Documentation/memory-barriers.txt
>>> That's madness. That document changes from version to version as to what
>>> we _think_ the actual hardware does. It is _NOT_ a specification.
>>>
>>> You cannot design hardware from that. Its incomplete and fails to
>>> specify a bunch of things. It not a mathematically sound definition of a
>>> memory model.
>>>
>>> Please stop referring to that document for what a particular barrier
>>> _should_ do.  Explain what MIPS does, so we can attempt to integrate
>>> this knowledge with our knowledge of PPC/ARM/Alpha/x86/etc. and improve
>>> upon our understanding of hardware and improve the Linux memory model.
>> I am afraid I can't help you here. It is very complicated stuff and
>> a model is actually doesn't fit your assumptions about CPUs well
>> without some simplifications which are based on what you want to
>> have.
>>
>> I say that SYNC_ACQUIRE/etc follows what you expect for smp_acquire
>> etc (basing on that document). And at least two CPU models were
>> tested with my patches (see it in LMO) for that last year and that
>> instructions are implemented now in engineering kernel.
>>
>> If you have something else in mind, you can ask me. But I prefer to
>> do not deviate too much from Documentation/memory-barriers.txt, for
>> exam - if it asks to have memory barrier somewhere, then I assume
>> the code should have it, and please - don't ask me a test which
>> violates the current version of document recommendations.
>>
>> For a moment I don't see a significant changes in this document for
>> MIPS Arch at least 1.5 year, and the only significant point is that
>> MIPS CPU Arch doesn't have yet smp_read_barrier_depends() and
>> smp_rmb() should be used instead.
> Is SYNC_ACQUIRE a memory-barrier instruction that orders prior loads
> against later loads and stores?

Yes, it is in MD00087 (table 6.6 of document Ver 6.04) - 
https://imgtec.com/?do-download=4302

>    If so, and if MIPS does not do
> ordering based on address and data dependencies, I suggest making
> read_barrier_depends() be a SYNC_ACQUIRE rather than SYNC_RMB.

I understood that, after I see the example of using it.
Please consider to add that into Documentation/memory-barriers.txt (it 
is not easy to find that this barrier is used for shared WRITE basing on 
shared pointer), it would be helpful.

- Leonid.
Paul E. McKenney Jan. 14, 2016, 10:24 p.m. UTC | #35
On Thu, Jan 14, 2016 at 01:45:44PM -0800, Leonid Yegoshin wrote:
> On 01/14/2016 01:34 PM, Paul E. McKenney wrote:
> >On Thu, Jan 14, 2016 at 12:46:43PM -0800, Leonid Yegoshin wrote:
> >>On 01/14/2016 12:15 PM, Peter Zijlstra wrote:
> >>>On Thu, Jan 14, 2016 at 11:42:02AM -0800, Leonid Yegoshin wrote:
> >>>>An the only point - please use an appropriate SYNC_* barriers instead of
> >>>>heavy bold hammer. That stuff was design explicitly to support the
> >>>>requirements of Documentation/memory-barriers.txt
> >>>That's madness. That document changes from version to version as to what
> >>>we _think_ the actual hardware does. It is _NOT_ a specification.
> >>>
> >>>You cannot design hardware from that. Its incomplete and fails to
> >>>specify a bunch of things. It not a mathematically sound definition of a
> >>>memory model.
> >>>
> >>>Please stop referring to that document for what a particular barrier
> >>>_should_ do.  Explain what MIPS does, so we can attempt to integrate
> >>>this knowledge with our knowledge of PPC/ARM/Alpha/x86/etc. and improve
> >>>upon our understanding of hardware and improve the Linux memory model.
> >>I am afraid I can't help you here. It is very complicated stuff and
> >>a model is actually doesn't fit your assumptions about CPUs well
> >>without some simplifications which are based on what you want to
> >>have.
> >>
> >>I say that SYNC_ACQUIRE/etc follows what you expect for smp_acquire
> >>etc (basing on that document). And at least two CPU models were
> >>tested with my patches (see it in LMO) for that last year and that
> >>instructions are implemented now in engineering kernel.
> >>
> >>If you have something else in mind, you can ask me. But I prefer to
> >>do not deviate too much from Documentation/memory-barriers.txt, for
> >>exam - if it asks to have memory barrier somewhere, then I assume
> >>the code should have it, and please - don't ask me a test which
> >>violates the current version of document recommendations.
> >>
> >>For a moment I don't see a significant changes in this document for
> >>MIPS Arch at least 1.5 year, and the only significant point is that
> >>MIPS CPU Arch doesn't have yet smp_read_barrier_depends() and
> >>smp_rmb() should be used instead.
> 
> >Is SYNC_ACQUIRE a memory-barrier instruction that orders prior loads
> >against later loads and stores?
> 
> Yes, it is in MD00087 (table 6.6 of document Ver 6.04) -
> https://imgtec.com/?do-download=4302

OK, it does look like it should work.  Of course, if you can rely
on straight address/data dependencies, that would be even better.

> >   If so, and if MIPS does not do
> >ordering based on address and data dependencies, I suggest making
> >read_barrier_depends() be a SYNC_ACQUIRE rather than SYNC_RMB.
> 
> I understood that, after I see the example of using it.
> Please consider to add that into Documentation/memory-barriers.txt
> (it is not easy to find that this barrier is used for shared WRITE
> basing on shared pointer), it would be helpful.

Actually, the Linux kernel doesn't have an acquire barrier, just an
smp_load_acquire().  Or did someone sneak one in while I wasn't looking?  ;-)

							Thanx, Paul
Paul E. McKenney Jan. 14, 2016, 10:55 p.m. UTC | #36
On Thu, Jan 14, 2016 at 01:36:50PM -0800, Leonid Yegoshin wrote:
> On 01/14/2016 01:29 PM, Paul E. McKenney wrote:
> >
> >>On 01/14/2016 12:34 PM, Paul E. McKenney wrote:
> >>>
> >>>The WRC+addr+addr is OK because data dependencies are not required to be
> >>>transitive, in other words, they are not required to flow from one CPU to
> >>>another without the help of an explicit memory barrier.
> >>I don't see any reliable way to fit WRC+addr+addr into "DATA
> >>DEPENDENCY BARRIERS" section recommendation to have data dependency
> >>barrier between read of a shared pointer/index and read the shared
> >>data based on that pointer. If you have this two reads, it doesn't
> >>matter the rest of scenario, you should put the dependency barrier
> >>in code anyway. If you don't do it in WRC+addr+addr scenario then
> >>after years it can be easily changed to different scenario which
> >>fits some of scenario in "DATA DEPENDENCY BARRIERS" section and
> >>fails.
> >The trick is that lockless_dereference() contains an
> >smp_read_barrier_depends():
> >
> >#define lockless_dereference(p) \
> >({ \
> >	typeof(p) _________p1 = READ_ONCE(p); \
> >	smp_read_barrier_depends(); /* Dependency order vs. p above. */ \
> >	(_________p1); \
> >})
> >
> >Or am I missing your point?
> 
> WRC+addr+addr has no any barrier. lockless_dereference() has a
> barrier. I don't see a common points between this and that in your
> answer, sorry.

Me, I am wondering what WRC+addr+addr has to do with anything at all.

<Going back through earlier email>

OK, so it looks like Will was asking not about WRC+addr+addr, but instead
about WRC+sync+addr.  This would drop an smp_mb() into cpu2() in my
earlier example, which needs to provide ordering.

I am guessing that the manual's "Older instructions which must be globally
performed when the SYNC instruction completes" provides the equivalent
of ARM/Power A-cumulativity, which can be thought of as transitivity
backwards in time.  This leads me to believe that your smp_mb() needs
to use SYNC rather than SYNC_MB, as was the subject of earlier spirited
discussion in this thread.

Suppose you have something like this:

	void cpu0(void)
	{
		WRITE_ONCE(a, 1);
		SYNC_MB();
		r0 = READ_ONCE(b);
	}

	void cpu1(void)
	{
		WRITE_ONCE(b, 1);
		SYNC_MB();
		r1 = READ_ONCE(c);
	}

	void cpu2(void)
	{
		WRITE_ONCE(c, 1);
		SYNC_MB();
		r2 = READ_ONCE(d);
	}

	void cpu3(void)
	{
		WRITE_ONCE(d, 1);
		SYNC_MB();
		r3 = READ_ONCE(a);
	}

Does your hardware guarantee that it is not possible for all of r0,
r1, r2, and r3 to be equal to zero at the end of the test, assuming
that a, b, c, and d are all initially zero, and the four functions
above run concurrently?  There are many similar litmus tests for other
combinations of reads and writes, but this is perhaps the nastiest from
a hardware viewpoint.  Does SYNC_MB() provide sufficient ordering for
this sort of situation?

Another (more academic) case is this one, with x and y initially zero:

	void cpu0(void)
	{
		WRITE_ONCE(x, 1);
	}

	void cpu1(void)
	{
		WRITE_ONCE(y, 1);
	}

	void cpu2(void)
	{
		r1 = READ_ONCE(x, 1);
		SYNC_MB();
		r2 = READ_ONCE(y, 1);
	}

	void cpu3(void)
	{
		r3 = READ_ONCE(y, 1);
		SYNC_MB();
		r4 = READ_ONCE(x, 1);
	}

Does SYNC_MB() prohibit r1 == 1 && r2 == 0 && r3 == 1 && r4 == 0?

Now, I don't know of any specific use cases for this pattern, but it
is greatly beloved of some of the old-school concurrency community,
so it is likely to crop up at some point, despite my best efforts.  :-/

							Thanx, Paul
Leonid Yegoshin Jan. 14, 2016, 11:04 p.m. UTC | #37
On 01/14/2016 02:24 PM, Paul E. McKenney wrote:
> Actually, the Linux kernel doesn't have an acquire barrier, just an 
> smp_load_acquire(). Or did someone sneak one in while I wasn't looking?
That was an exactly starting point for this discussion. This patch just 
pulls out from MIPS files smp_load_acquire() and smp_store_release(). 
However, I put into LMO half year ago the patch 
http://patchwork.linux-mips.org/patch/10506/ which replaces a generic 
smp_mb with MIPS specific smp_release/acquire in that functions. This 
patch also fixes use of SYNCs barriers in spin_locks/atomics/bitops for 
Imagination MIPS CPUs too - it is just absent now for any Imagination 
MIPS CPUs!

Michael later pointed me that it can be returned back with his series of 
patches but discussion was already here.

- Leonid.
Leonid Yegoshin Jan. 14, 2016, 11:33 p.m. UTC | #38
On 01/14/2016 02:55 PM, Paul E. McKenney wrote:
> OK, so it looks like Will was asking not about WRC+addr+addr, but instead
> about WRC+sync+addr.
(He actually asked twice about this and that too but skip this)

> I am guessing that the manual's "Older instructions which must be globally
> performed when the SYNC instruction completes" provides the equivalent
> of ARM/Power A-cumulativity, which can be thought of as transitivity
> backwards in time.  This leads me to believe that your smp_mb() needs
> to use SYNC rather than SYNC_MB, as was the subject of earlier spirited
> discussion in this thread.

Don't be fooled here by words "ordered" and "completed" - it is HW 
design items and actually written poorly.
Just assume that SYNC_MB is absolutely the same as SYNC for any CPU and 
coherent device (besides performance). The difference can be in 
non-coherent devices because SYNC actually tries to make a barrier for 
them too. In some SoCs it is just the same because there is no need to 
barrier a non-coherent device (device register access usually strictly 
ordered... if there is no bridge in between).

>
> Suppose you have something like this:
> ...
> Does your hardware guarantee that it is not possible for all of r0,
> r1, r2, and r3 to be equal to zero at the end of the test, assuming
> that a, b, c, and d are all initially zero, and the four functions
> above run concurrently?

It is assumed to be so from Arch point of view. HW bugs are possible, of 
course.

> Another (more academic) case is this one, with x and y initially zero:
>
> ...
> Does SYNC_MB() prohibit r1 == 1 && r2 == 0 && r3 == 1 && r4 == 0?

It is assumed to be so from Arch point of view. HW bugs are possible, of 
course.

Note: I am not sure about ANY past MIPS R2 CPU because that stuff is 
implemented some time but nobody made it in Linux kernel (it was used by 
some vendor for non-Linux system). For that reason my patch for 
lightweight SYNCs has an option - implement it or implement a generic 
SYNC. It is possible that some vendor did it in different way but nobody 
knows or test it. But as a minimum - SYNC must be implemented in 
spinlocks/atomics/bitops, in recent P5600 it is proven that read can 
pass write in atomics.

MIPS R6 is a different story, I verified lightweight SYNCs from the 
beginning and it also should use SYNCs.

- Leonid.
Paul E. McKenney Jan. 15, 2016, 12:47 a.m. UTC | #39
On Thu, Jan 14, 2016 at 03:33:40PM -0800, Leonid Yegoshin wrote:
> On 01/14/2016 02:55 PM, Paul E. McKenney wrote:
> >OK, so it looks like Will was asking not about WRC+addr+addr, but instead
> >about WRC+sync+addr.
> (He actually asked twice about this and that too but skip this)

Fair enough!  ;-)

> >I am guessing that the manual's "Older instructions which must be globally
> >performed when the SYNC instruction completes" provides the equivalent
> >of ARM/Power A-cumulativity, which can be thought of as transitivity
> >backwards in time.  This leads me to believe that your smp_mb() needs
> >to use SYNC rather than SYNC_MB, as was the subject of earlier spirited
> >discussion in this thread.
> 
> Don't be fooled here by words "ordered" and "completed" - it is HW
> design items and actually written poorly.
> Just assume that SYNC_MB is absolutely the same as SYNC for any CPU
> and coherent device (besides performance). The difference can be in
> non-coherent devices because SYNC actually tries to make a barrier
> for them too. In some SoCs it is just the same because there is no
> need to barrier a non-coherent device (device register access
> usually strictly ordered... if there is no bridge in between).

So smp_mb() can be SYNC_MB.  However, mb() needs to be SYNC for MMIO
purposes, correct?

> >Suppose you have something like this:
> >...
> >Does your hardware guarantee that it is not possible for all of r0,
> >r1, r2, and r3 to be equal to zero at the end of the test, assuming
> >that a, b, c, and d are all initially zero, and the four functions
> >above run concurrently?
> 
> It is assumed to be so from Arch point of view. HW bugs are
> possible, of course.

Indeed!

> >Another (more academic) case is this one, with x and y initially zero:
> >
> >...
> >Does SYNC_MB() prohibit r1 == 1 && r2 == 0 && r3 == 1 && r4 == 0?
> 
> It is assumed to be so from Arch point of view. HW bugs are
> possible, of course.

Looks to me like smp_mb() can be SYNC_MB, then.

> Note: I am not sure about ANY past MIPS R2 CPU because that stuff is
> implemented some time but nobody made it in Linux kernel (it was
> used by some vendor for non-Linux system). For that reason my patch
> for lightweight SYNCs has an option - implement it or implement a
> generic SYNC. It is possible that some vendor did it in different
> way but nobody knows or test it. But as a minimum - SYNC must be
> implemented in spinlocks/atomics/bitops, in recent P5600 it is
> proven that read can pass write in atomics.
> 
> MIPS R6 is a different story, I verified lightweight SYNCs from the
> beginning and it also should use SYNCs.

So you need to build a different kernel for some types of MIPS systems?
Or do you do boot-time rewriting, like a number of other arches do?

							Thanx, Paul
Leonid Yegoshin Jan. 15, 2016, 1:07 a.m. UTC | #40
On 01/14/2016 04:47 PM, Paul E. McKenney wrote:
> On Thu, Jan 14, 2016 at 03:33:40PM -0800, Leonid Yegoshin wrote:
>> Don't be fooled here by words "ordered" and "completed" - it is HW
>> design items and actually written poorly.
>> Just assume that SYNC_MB is absolutely the same as SYNC for any CPU
>> and coherent device (besides performance). The difference can be in
>> non-coherent devices because SYNC actually tries to make a barrier
>> for them too. In some SoCs it is just the same because there is no
>> need to barrier a non-coherent device (device register access
>> usually strictly ordered... if there is no bridge in between).
> So smp_mb() can be SYNC_MB.  However, mb() needs to be SYNC for MMIO
> purposes, correct?

Absolutely. For MIPS R2 which is not Octeon.

>> Note: I am not sure about ANY past MIPS R2 CPU because that stuff is
>> implemented some time but nobody made it in Linux kernel (it was
>> used by some vendor for non-Linux system). For that reason my patch
>> for lightweight SYNCs has an option - implement it or implement a
>> generic SYNC. It is possible that some vendor did it in different
>> way but nobody knows or test it. But as a minimum - SYNC must be
>> implemented in spinlocks/atomics/bitops, in recent P5600 it is
>> proven that read can pass write in atomics.
>>
>> MIPS R6 is a different story, I verified lightweight SYNCs from the
>> beginning and it also should use SYNCs.
> So you need to build a different kernel for some types of MIPS systems?
> Or do you do boot-time rewriting, like a number of other arches do?

I don't know. I would like to have responses. Ralf asked Maciej about 
old systems and that came nowhere. Even rewrite - don't know what to do 
with that: no lightweight SYNC or no SYNC at all - yes, it is still 
possible that SYNC on some systems can be too heavy or even harmful, 
nobody tested that.

- Leonid.
Peter Zijlstra Jan. 15, 2016, 8:55 a.m. UTC | #41
On Thu, Jan 14, 2016 at 01:29:13PM -0800, Paul E. McKenney wrote:
> So smp_mb() provides transitivity, as do pairs of smp_store_release()
> and smp_read_acquire(), 

But they provide different grades of transitivity, which is where all
the confusion lays.

smp_mb() is strongly/globally transitive, all CPUs will agree on the order.

Whereas the RCpc release+acquire is weakly so, only the two cpus
involved in the handover will agree on the order.
Peter Zijlstra Jan. 15, 2016, 9:13 a.m. UTC | #42
On Fri, Jan 15, 2016 at 09:55:54AM +0100, Peter Zijlstra wrote:
> On Thu, Jan 14, 2016 at 01:29:13PM -0800, Paul E. McKenney wrote:
> > So smp_mb() provides transitivity, as do pairs of smp_store_release()
> > and smp_read_acquire(), 
> 
> But they provide different grades of transitivity, which is where all
> the confusion lays.
> 
> smp_mb() is strongly/globally transitive, all CPUs will agree on the order.
> 
> Whereas the RCpc release+acquire is weakly so, only the two cpus
> involved in the handover will agree on the order.

And the stuff we're confused about is how best to express the difference
and guarantees of these two forms of transitivity and how exactly they
interact.

And smp_load_acquire()/smp_store_release() are RCpc because TSO archs
and PPC. the atomic*_{acquire,release}() are RCpc because PPC and
LOCK,UNLOCK are similarly RCpc because of PPC.

Now we'd like PPC to stick a SYNC in either LOCK or UNLOCK so at least
the locks are RCsc again, but they resist for performance reasons but
waver because they don't want to be the ones finding all the nasty bugs
because they're the only one.

Now the thing I worry about, and still have not had an answer to is if
weakly ordered MIPS will end up being RCsc or RCpc for their locks if
they get implemented with SYNC_ACQUIRE and SYNC_RELEASE instead of the
current SYNC.
Will Deacon Jan. 15, 2016, 10:24 a.m. UTC | #43
On Thu, Jan 14, 2016 at 02:55:10PM -0800, Paul E. McKenney wrote:
> On Thu, Jan 14, 2016 at 01:36:50PM -0800, Leonid Yegoshin wrote:
> > On 01/14/2016 01:29 PM, Paul E. McKenney wrote:
> > >
> > >>On 01/14/2016 12:34 PM, Paul E. McKenney wrote:
> > >>>
> > >>>The WRC+addr+addr is OK because data dependencies are not required to be
> > >>>transitive, in other words, they are not required to flow from one CPU to
> > >>>another without the help of an explicit memory barrier.
> > >>I don't see any reliable way to fit WRC+addr+addr into "DATA
> > >>DEPENDENCY BARRIERS" section recommendation to have data dependency
> > >>barrier between read of a shared pointer/index and read the shared
> > >>data based on that pointer. If you have this two reads, it doesn't
> > >>matter the rest of scenario, you should put the dependency barrier
> > >>in code anyway. If you don't do it in WRC+addr+addr scenario then
> > >>after years it can be easily changed to different scenario which
> > >>fits some of scenario in "DATA DEPENDENCY BARRIERS" section and
> > >>fails.
> > >The trick is that lockless_dereference() contains an
> > >smp_read_barrier_depends():
> > >
> > >#define lockless_dereference(p) \
> > >({ \
> > >	typeof(p) _________p1 = READ_ONCE(p); \
> > >	smp_read_barrier_depends(); /* Dependency order vs. p above. */ \
> > >	(_________p1); \
> > >})
> > >
> > >Or am I missing your point?
> > 
> > WRC+addr+addr has no any barrier. lockless_dereference() has a
> > barrier. I don't see a common points between this and that in your
> > answer, sorry.
> 
> Me, I am wondering what WRC+addr+addr has to do with anything at all.

See my earlier reply [1] (but also, your WRC Linux example looks more
like a variant on WWC and I couldn't really follow it).

> <Going back through earlier email>
> 
> OK, so it looks like Will was asking not about WRC+addr+addr, but instead
> about WRC+sync+addr.  This would drop an smp_mb() into cpu2() in my
> earlier example, which needs to provide ordering.
> 
> I am guessing that the manual's "Older instructions which must be globally
> performed when the SYNC instruction completes" provides the equivalent
> of ARM/Power A-cumulativity, which can be thought of as transitivity
> backwards in time. 

I couldn't make that leap. In particular, the manual's "Detailed
Description" sections explicitly refer to program-order:

  Every synchronizable specified memory instruction (loads or stores or
  both) that occurs in the instruction stream before the SYNC
  instruction must reach a stage in the load/store datapath after which
  no instruction re-ordering is possible before any synchronizable
  specified memory instruction which occurs after the SYNC instruction
  in the instruction stream reaches the same stage in the load/store
  datapath.

Will

[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2016-January/399765.html
Paul E. McKenney Jan. 15, 2016, 5:46 p.m. UTC | #44
On Fri, Jan 15, 2016 at 10:13:48AM +0100, Peter Zijlstra wrote:
> On Fri, Jan 15, 2016 at 09:55:54AM +0100, Peter Zijlstra wrote:
> > On Thu, Jan 14, 2016 at 01:29:13PM -0800, Paul E. McKenney wrote:
> > > So smp_mb() provides transitivity, as do pairs of smp_store_release()
> > > and smp_read_acquire(), 
> > 
> > But they provide different grades of transitivity, which is where all
> > the confusion lays.
> > 
> > smp_mb() is strongly/globally transitive, all CPUs will agree on the order.
> > 
> > Whereas the RCpc release+acquire is weakly so, only the two cpus
> > involved in the handover will agree on the order.
> 
> And the stuff we're confused about is how best to express the difference
> and guarantees of these two forms of transitivity and how exactly they
> interact.

Hoping my memory-barrier.txt patch helps here...

> And smp_load_acquire()/smp_store_release() are RCpc because TSO archs
> and PPC. the atomic*_{acquire,release}() are RCpc because PPC and
> LOCK,UNLOCK are similarly RCpc because of PPC.
> 
> Now we'd like PPC to stick a SYNC in either LOCK or UNLOCK so at least
> the locks are RCsc again, but they resist for performance reasons but
> waver because they don't want to be the ones finding all the nasty bugs
> because they're the only one.

I believe that the relevant proverb said something about starving to
death between two bales of hay...  ;-)

> Now the thing I worry about, and still have not had an answer to is if
> weakly ordered MIPS will end up being RCsc or RCpc for their locks if
> they get implemented with SYNC_ACQUIRE and SYNC_RELEASE instead of the
> current SYNC.

It would be good to have better clarity on this, no two ways about it.

							Thanx, Paul
Paul E. McKenney Jan. 15, 2016, 5:54 p.m. UTC | #45
On Fri, Jan 15, 2016 at 10:24:32AM +0000, Will Deacon wrote:
> On Thu, Jan 14, 2016 at 02:55:10PM -0800, Paul E. McKenney wrote:
> > On Thu, Jan 14, 2016 at 01:36:50PM -0800, Leonid Yegoshin wrote:
> > > On 01/14/2016 01:29 PM, Paul E. McKenney wrote:
> > > >
> > > >>On 01/14/2016 12:34 PM, Paul E. McKenney wrote:
> > > >>>
> > > >>>The WRC+addr+addr is OK because data dependencies are not required to be
> > > >>>transitive, in other words, they are not required to flow from one CPU to
> > > >>>another without the help of an explicit memory barrier.
> > > >>I don't see any reliable way to fit WRC+addr+addr into "DATA
> > > >>DEPENDENCY BARRIERS" section recommendation to have data dependency
> > > >>barrier between read of a shared pointer/index and read the shared
> > > >>data based on that pointer. If you have this two reads, it doesn't
> > > >>matter the rest of scenario, you should put the dependency barrier
> > > >>in code anyway. If you don't do it in WRC+addr+addr scenario then
> > > >>after years it can be easily changed to different scenario which
> > > >>fits some of scenario in "DATA DEPENDENCY BARRIERS" section and
> > > >>fails.
> > > >The trick is that lockless_dereference() contains an
> > > >smp_read_barrier_depends():
> > > >
> > > >#define lockless_dereference(p) \
> > > >({ \
> > > >	typeof(p) _________p1 = READ_ONCE(p); \
> > > >	smp_read_barrier_depends(); /* Dependency order vs. p above. */ \
> > > >	(_________p1); \
> > > >})
> > > >
> > > >Or am I missing your point?
> > > 
> > > WRC+addr+addr has no any barrier. lockless_dereference() has a
> > > barrier. I don't see a common points between this and that in your
> > > answer, sorry.
> > 
> > Me, I am wondering what WRC+addr+addr has to do with anything at all.
> 
> See my earlier reply [1] (but also, your WRC Linux example looks more
> like a variant on WWC and I couldn't really follow it).

I will revisit my WRC Linux example.  And yes, creating litmus tests
that use non-fake dependencies is still a bit of an undertaking.  :-/
I am sure that it will seem more natural with time and experience...

> > <Going back through earlier email>
> > 
> > OK, so it looks like Will was asking not about WRC+addr+addr, but instead
> > about WRC+sync+addr.  This would drop an smp_mb() into cpu2() in my
> > earlier example, which needs to provide ordering.
> > 
> > I am guessing that the manual's "Older instructions which must be globally
> > performed when the SYNC instruction completes" provides the equivalent
> > of ARM/Power A-cumulativity, which can be thought of as transitivity
> > backwards in time. 
> 
> I couldn't make that leap. In particular, the manual's "Detailed
> Description" sections explicitly refer to program-order:
> 
>   Every synchronizable specified memory instruction (loads or stores or
>   both) that occurs in the instruction stream before the SYNC
>   instruction must reach a stage in the load/store datapath after which
>   no instruction re-ordering is possible before any synchronizable
>   specified memory instruction which occurs after the SYNC instruction
>   in the instruction stream reaches the same stage in the load/store
>   datapath.
> 
> Will
> 
> [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2016-January/399765.html

All good points.  I think we all agree that the MIPS documentation could
use significant help.  And given that I work for the company that produced
the analogous documentation for PowerPC, that is saying something.  ;-)

We simply can't know if MIPS's memory ordering is sufficient for the
Linux kernel given its current implementation of the ordering primitives
and its current documentation.

I feel a bit better than I did earlier due to Leonid's response to my
earlier litmus-test examples.  But I do recommend some serious stress
testing of MIPS on a good set of litmus tests.  Much nicer finding issues
that way than as random irreproducible strange behavior!

							Thanx, Paul
Paul E. McKenney Jan. 15, 2016, 7:28 p.m. UTC | #46
On Fri, Jan 15, 2016 at 09:54:01AM -0800, Paul E. McKenney wrote:
> On Fri, Jan 15, 2016 at 10:24:32AM +0000, Will Deacon wrote:
> > On Thu, Jan 14, 2016 at 02:55:10PM -0800, Paul E. McKenney wrote:
> > > On Thu, Jan 14, 2016 at 01:36:50PM -0800, Leonid Yegoshin wrote:
> > > > On 01/14/2016 01:29 PM, Paul E. McKenney wrote:
> > > > >
> > > > >>On 01/14/2016 12:34 PM, Paul E. McKenney wrote:
> > > > >>>
> > > > >>>The WRC+addr+addr is OK because data dependencies are not required to be
> > > > >>>transitive, in other words, they are not required to flow from one CPU to
> > > > >>>another without the help of an explicit memory barrier.
> > > > >>I don't see any reliable way to fit WRC+addr+addr into "DATA
> > > > >>DEPENDENCY BARRIERS" section recommendation to have data dependency
> > > > >>barrier between read of a shared pointer/index and read the shared
> > > > >>data based on that pointer. If you have this two reads, it doesn't
> > > > >>matter the rest of scenario, you should put the dependency barrier
> > > > >>in code anyway. If you don't do it in WRC+addr+addr scenario then
> > > > >>after years it can be easily changed to different scenario which
> > > > >>fits some of scenario in "DATA DEPENDENCY BARRIERS" section and
> > > > >>fails.
> > > > >The trick is that lockless_dereference() contains an
> > > > >smp_read_barrier_depends():
> > > > >
> > > > >#define lockless_dereference(p) \
> > > > >({ \
> > > > >	typeof(p) _________p1 = READ_ONCE(p); \
> > > > >	smp_read_barrier_depends(); /* Dependency order vs. p above. */ \
> > > > >	(_________p1); \
> > > > >})
> > > > >
> > > > >Or am I missing your point?
> > > > 
> > > > WRC+addr+addr has no any barrier. lockless_dereference() has a
> > > > barrier. I don't see a common points between this and that in your
> > > > answer, sorry.
> > > 
> > > Me, I am wondering what WRC+addr+addr has to do with anything at all.
> > 
> > See my earlier reply [1] (but also, your WRC Linux example looks more
> > like a variant on WWC and I couldn't really follow it).
> 
> I will revisit my WRC Linux example.  And yes, creating litmus tests
> that use non-fake dependencies is still a bit of an undertaking.  :-/
> I am sure that it will seem more natural with time and experience...

Hmmm...  You are quite right, I did do WWC.  I need to change cpu2()'s
last access from a store to a load to get WRC.  Plus the levels of
indirection definitely didn't match up, did they?

	struct foo {
		struct foo *next;
	};
	struct foo a;
	struct foo b;
	struct foo c = { &a };
	struct foo d = { &b };
	struct foo x = { &c };
	struct foo y = { &d };
	struct foo *r1, *r2, *r3;

	void cpu0(void)
	{
		WRITE_ONCE(x.next, &y);
	}

	void cpu1(void)
	{
		r1 = lockless_dereference(x.next);
		WRITE_ONCE(r1->next, &x);
	}

	void cpu2(void)
	{
		r2 = lockless_dereference(y.next);
		r3 = READ_ONCE(r2->next);
	}

In this case, it is legal to end the run with:

	r1 == &y && r2 == &x && r3 == &c

Please see below for a ppcmem litmus test.

So, did I get it right this time?  ;-)

							Thanx, Paul

PS.  And yes, working through this does help me understand the
     benefits of fake dependencies.  Why do you ask?  ;-)

------------------------------------------------------------------------

PPC WRCnf+addrs
""
{
0:r2=x; 0:r3=y;
1:r2=x; 1:r3=y;
2:r2=x; 2:r3=y;
c=a; d=b; x=c; y=d;
}
 P0           | P1            | P2            ;
 stw r3,0(r2) | lwz r8,0(r2)  | lwz r8,0(r3)  ;
              | stw r2,0(r3)  | lwz r9,0(r8)  ;
exists
(1:r8=y /\ 2:r8=x /\ 2:r9=c)
Peter Zijlstra Jan. 15, 2016, 9:27 p.m. UTC | #47
On Fri, Jan 15, 2016 at 09:46:12AM -0800, Paul E. McKenney wrote:
> On Fri, Jan 15, 2016 at 10:13:48AM +0100, Peter Zijlstra wrote:

> > And the stuff we're confused about is how best to express the difference
> > and guarantees of these two forms of transitivity and how exactly they
> > interact.
> 
> Hoping my memory-barrier.txt patch helps here...

Yes, that seems a good start. But yesterday you raised the 'fun' point
of two globally ordered sequences connected by a single local link.

And I think I'm still confused on LWSYNC (in the smp_wmb case) when one
of the stores looses a conflict, and if that scenario matters. If it
does, we should inspect the same case for other barriers.
Paul E. McKenney Jan. 15, 2016, 9:58 p.m. UTC | #48
On Fri, Jan 15, 2016 at 10:27:14PM +0100, Peter Zijlstra wrote:
> On Fri, Jan 15, 2016 at 09:46:12AM -0800, Paul E. McKenney wrote:
> > On Fri, Jan 15, 2016 at 10:13:48AM +0100, Peter Zijlstra wrote:
> 
> > > And the stuff we're confused about is how best to express the difference
> > > and guarantees of these two forms of transitivity and how exactly they
> > > interact.
> > 
> > Hoping my memory-barrier.txt patch helps here...
> 
> Yes, that seems a good start. But yesterday you raised the 'fun' point
> of two globally ordered sequences connected by a single local link.

The conclusion that I am slowly coming to is that litmus tests should
not be thought of as linear chains, but rather as cycles.  If you think
of it as a cycle, then it doesn't matter where the local link is, just
how many of them and how they are connected.

But I will admit that there are some rather strange litmus tests that
challenge this cycle-centric view, for example, the one shown below.
It turns out that herd and ppcmem disagree on the outcome.  (The Power
architects side with ppcmem.)

> And I think I'm still confused on LWSYNC (in the smp_wmb case) when one
> of the stores looses a conflict, and if that scenario matters. If it
> does, we should inspect the same case for other barriers.

Indeed.  I am still working on how these should be described.  My
current thought is to be quite conservative on what ordering is
actually respected, however, the current task is formalizing how
RCU plays with the rest of the memory model.

							Thanx, Paul

------------------------------------------------------------------------

PPC Overlapping Group-B sets version 4
""
(* When the Group-B sets from two different barriers involve instructions in
   the same thread, within that thread one set must contain the other.

	P0	P1	P2
	Rx=1	Wy=1	Wz=2
	dep.	lwsync	lwsync
	Ry=0	Wz=1	Wx=1
	Rz=1

	assert(!(z=2))

   Forbidden by ppcmem, allowed by herd.
*)
{
0:r1=x; 0:r2=y; 0:r3=z;
1:r1=x; 1:r2=y; 1:r3=z; 1:r4=1;
2:r1=x; 2:r2=y; 2:r3=z; 2:r4=1; 2:r5=2;
}
 P0		| P1		| P2		;
 lwz r6,0(r1)	| stw r4,0(r2)	| stw r5,0(r3)	;
 xor r7,r6,r6	| lwsync	| lwsync	;
 lwzx r7,r7,r2	| stw r4,0(r3)	| stw r4,0(r1)	;
 lwz r8,0(r3)	|		|		;

exists
(z=2 /\ 0:r6=1 /\ 0:r7=0 /\ 0:r8=1)
Herbert Xu Jan. 18, 2016, 8:19 a.m. UTC | #49
Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
>
> You could use SYNC_ACQUIRE() to implement read_barrier_depends() and
> smp_read_barrier_depends(), but SYNC_RMB probably does not suffice.
> The reason for this is that smp_read_barrier_depends() must order the
> pointer load against any subsequent read or write through a dereference
> of that pointer.  For example:
> 
>        p = READ_ONCE(gp);
>        smp_rmb();
>        r1 = p->a; /* ordered by smp_rmb(). */
>        p->b = 42; /* NOT ordered by smp_rmb(), BUG!!! */
>        r2 = x; /* ordered by smp_rmb(), but doesn't need to be. */
> 
> In contrast:
> 
>        p = READ_ONCE(gp);
>        smp_read_barrier_depends();
>        r1 = p->a; /* ordered by smp_read_barrier_depends(). */
>        p->b = 42; /* ordered by smp_read_barrier_depends(). */
>        r2 = x; /* not ordered by smp_read_barrier_depends(), which is OK. */
> 
> Again, if your hardware maintains local ordering for address
> and data dependencies, you can have read_barrier_depends() and
> smp_read_barrier_depends() be no-ops like they are for most
> architectures.
> 
> Does that help?

This is crazy! smp_rmb started out being strictly stronger than
smp_read_barrier_depends, when did this stop being the case?
Paul E. McKenney Jan. 18, 2016, 3:46 p.m. UTC | #50
On Mon, Jan 18, 2016 at 04:19:29PM +0800, Herbert Xu wrote:
> Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> >
> > You could use SYNC_ACQUIRE() to implement read_barrier_depends() and
> > smp_read_barrier_depends(), but SYNC_RMB probably does not suffice.
> > The reason for this is that smp_read_barrier_depends() must order the
> > pointer load against any subsequent read or write through a dereference
> > of that pointer.  For example:
> > 
> >        p = READ_ONCE(gp);
> >        smp_rmb();
> >        r1 = p->a; /* ordered by smp_rmb(). */
> >        p->b = 42; /* NOT ordered by smp_rmb(), BUG!!! */
> >        r2 = x; /* ordered by smp_rmb(), but doesn't need to be. */
> > 
> > In contrast:
> > 
> >        p = READ_ONCE(gp);
> >        smp_read_barrier_depends();
> >        r1 = p->a; /* ordered by smp_read_barrier_depends(). */
> >        p->b = 42; /* ordered by smp_read_barrier_depends(). */
> >        r2 = x; /* not ordered by smp_read_barrier_depends(), which is OK. */
> > 
> > Again, if your hardware maintains local ordering for address
> > and data dependencies, you can have read_barrier_depends() and
> > smp_read_barrier_depends() be no-ops like they are for most
> > architectures.
> > 
> > Does that help?
> 
> This is crazy! smp_rmb started out being strictly stronger than
> smp_read_barrier_depends, when did this stop being the case?

Hello, Herbert!

It is true that most Linux kernel code relies only on the read-read
properties of dependencies, but the read-write properties are useful.
Admittedly relatively rarely, but useful.

The better comparison for smp_read_barrier_depends(), especially in
its rcu_dereference*() form, is smp_load_acquire().

							Thanx, Paul
Will Deacon Jan. 25, 2016, 2:41 p.m. UTC | #51
On Fri, Jan 15, 2016 at 11:28:45AM -0800, Paul E. McKenney wrote:
> On Fri, Jan 15, 2016 at 09:54:01AM -0800, Paul E. McKenney wrote:
> > On Fri, Jan 15, 2016 at 10:24:32AM +0000, Will Deacon wrote:
> > > See my earlier reply [1] (but also, your WRC Linux example looks more
> > > like a variant on WWC and I couldn't really follow it).
> > 
> > I will revisit my WRC Linux example.  And yes, creating litmus tests
> > that use non-fake dependencies is still a bit of an undertaking.  :-/
> > I am sure that it will seem more natural with time and experience...
> 
> Hmmm...  You are quite right, I did do WWC.  I need to change cpu2()'s
> last access from a store to a load to get WRC.  Plus the levels of
> indirection definitely didn't match up, did they?

Nope, it was pretty baffling!

> 	struct foo {
> 		struct foo *next;
> 	};
> 	struct foo a;
> 	struct foo b;
> 	struct foo c = { &a };
> 	struct foo d = { &b };
> 	struct foo x = { &c };
> 	struct foo y = { &d };
> 	struct foo *r1, *r2, *r3;
> 
> 	void cpu0(void)
> 	{
> 		WRITE_ONCE(x.next, &y);
> 	}
> 
> 	void cpu1(void)
> 	{
> 		r1 = lockless_dereference(x.next);
> 		WRITE_ONCE(r1->next, &x);
> 	}
> 
> 	void cpu2(void)
> 	{
> 		r2 = lockless_dereference(y.next);
> 		r3 = READ_ONCE(r2->next);
> 	}
> 
> In this case, it is legal to end the run with:
> 
> 	r1 == &y && r2 == &x && r3 == &c
> 
> Please see below for a ppcmem litmus test.
> 
> So, did I get it right this time?  ;-)

The code above looks correct to me (in that it matches WRC+addrs),
but your litmus test:

> PPC WRCnf+addrs
> ""
> {
> 0:r2=x; 0:r3=y;
> 1:r2=x; 1:r3=y;
> 2:r2=x; 2:r3=y;
> c=a; d=b; x=c; y=d;
> }
>  P0           | P1            | P2            ;
>  stw r3,0(r2) | lwz r8,0(r2)  | lwz r8,0(r3)  ;
>               | stw r2,0(r3)  | lwz r9,0(r8)  ;
> exists
> (1:r8=y /\ 2:r8=x /\ 2:r9=c)

Seems to be missing the address dependency on P1.

Will
Will Deacon Jan. 25, 2016, 4:42 p.m. UTC | #52
On Fri, Jan 15, 2016 at 01:58:53PM -0800, Paul E. McKenney wrote:
> On Fri, Jan 15, 2016 at 10:27:14PM +0100, Peter Zijlstra wrote:
> > On Fri, Jan 15, 2016 at 09:46:12AM -0800, Paul E. McKenney wrote:
> > > On Fri, Jan 15, 2016 at 10:13:48AM +0100, Peter Zijlstra wrote:
> > 
> > > > And the stuff we're confused about is how best to express the difference
> > > > and guarantees of these two forms of transitivity and how exactly they
> > > > interact.
> > > 
> > > Hoping my memory-barrier.txt patch helps here...
> > 
> > Yes, that seems a good start. But yesterday you raised the 'fun' point
> > of two globally ordered sequences connected by a single local link.
> 
> The conclusion that I am slowly coming to is that litmus tests should
> not be thought of as linear chains, but rather as cycles.  If you think
> of it as a cycle, then it doesn't matter where the local link is, just
> how many of them and how they are connected.

Do you have some examples of this? I'm struggling to make it work in my
mind, or are you talking specifically in the context of the kernel
memory model?

> But I will admit that there are some rather strange litmus tests that
> challenge this cycle-centric view, for example, the one shown below.
> It turns out that herd and ppcmem disagree on the outcome.  (The Power
> architects side with ppcmem.)
> 
> > And I think I'm still confused on LWSYNC (in the smp_wmb case) when one
> > of the stores looses a conflict, and if that scenario matters. If it
> > does, we should inspect the same case for other barriers.
> 
> Indeed.  I am still working on how these should be described.  My
> current thought is to be quite conservative on what ordering is
> actually respected, however, the current task is formalizing how
> RCU plays with the rest of the memory model.
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> PPC Overlapping Group-B sets version 4
> ""
> (* When the Group-B sets from two different barriers involve instructions in
>    the same thread, within that thread one set must contain the other.
> 
> 	P0	P1	P2
> 	Rx=1	Wy=1	Wz=2
> 	dep.	lwsync	lwsync
> 	Ry=0	Wz=1	Wx=1
> 	Rz=1
> 
> 	assert(!(z=2))
> 
>    Forbidden by ppcmem, allowed by herd.
> *)
> {
> 0:r1=x; 0:r2=y; 0:r3=z;
> 1:r1=x; 1:r2=y; 1:r3=z; 1:r4=1;
> 2:r1=x; 2:r2=y; 2:r3=z; 2:r4=1; 2:r5=2;
> }
>  P0		| P1		| P2		;
>  lwz r6,0(r1)	| stw r4,0(r2)	| stw r5,0(r3)	;
>  xor r7,r6,r6	| lwsync	| lwsync	;
>  lwzx r7,r7,r2	| stw r4,0(r3)	| stw r4,0(r1)	;
>  lwz r8,0(r3)	|		|		;
> 
> exists
> (z=2 /\ 0:r6=1 /\ 0:r7=0 /\ 0:r8=1)

That really hurts. Assuming that the "assert(!(z=2))" is actually there
to constrain the coherence order of z to be {0->1->2}, then I think that
this test is forbidden on arm using dmb instead of lwsync. That said, I
also don't think the Rz=1 in P0 changes anything.

The double negatives don't help here! (it is forbidden to guarantee that
z is not always 2).

Will
Paul E. McKenney Jan. 26, 2016, 1:06 a.m. UTC | #53
On Mon, Jan 25, 2016 at 02:41:34PM +0000, Will Deacon wrote:
> On Fri, Jan 15, 2016 at 11:28:45AM -0800, Paul E. McKenney wrote:
> > On Fri, Jan 15, 2016 at 09:54:01AM -0800, Paul E. McKenney wrote:
> > > On Fri, Jan 15, 2016 at 10:24:32AM +0000, Will Deacon wrote:
> > > > See my earlier reply [1] (but also, your WRC Linux example looks more
> > > > like a variant on WWC and I couldn't really follow it).
> > > 
> > > I will revisit my WRC Linux example.  And yes, creating litmus tests
> > > that use non-fake dependencies is still a bit of an undertaking.  :-/
> > > I am sure that it will seem more natural with time and experience...
> > 
> > Hmmm...  You are quite right, I did do WWC.  I need to change cpu2()'s
> > last access from a store to a load to get WRC.  Plus the levels of
> > indirection definitely didn't match up, did they?
> 
> Nope, it was pretty baffling!

"It is a service that I provide."  ;-)

> > 	struct foo {
> > 		struct foo *next;
> > 	};
> > 	struct foo a;
> > 	struct foo b;
> > 	struct foo c = { &a };
> > 	struct foo d = { &b };
> > 	struct foo x = { &c };
> > 	struct foo y = { &d };
> > 	struct foo *r1, *r2, *r3;
> > 
> > 	void cpu0(void)
> > 	{
> > 		WRITE_ONCE(x.next, &y);
> > 	}
> > 
> > 	void cpu1(void)
> > 	{
> > 		r1 = lockless_dereference(x.next);
> > 		WRITE_ONCE(r1->next, &x);
> > 	}
> > 
> > 	void cpu2(void)
> > 	{
> > 		r2 = lockless_dereference(y.next);
> > 		r3 = READ_ONCE(r2->next);
> > 	}
> > 
> > In this case, it is legal to end the run with:
> > 
> > 	r1 == &y && r2 == &x && r3 == &c
> > 
> > Please see below for a ppcmem litmus test.
> > 
> > So, did I get it right this time?  ;-)
> 
> The code above looks correct to me (in that it matches WRC+addrs),
> but your litmus test:
> 
> > PPC WRCnf+addrs
> > ""
> > {
> > 0:r2=x; 0:r3=y;
> > 1:r2=x; 1:r3=y;
> > 2:r2=x; 2:r3=y;
> > c=a; d=b; x=c; y=d;
> > }
> >  P0           | P1            | P2            ;
> >  stw r3,0(r2) | lwz r8,0(r2)  | lwz r8,0(r3)  ;
> >               | stw r2,0(r3)  | lwz r9,0(r8)  ;
> > exists
> > (1:r8=y /\ 2:r8=x /\ 2:r9=c)
> 
> Seems to be missing the address dependency on P1.

You are quite correct!  How about the following?

As before, both herd and ppcmem say that the cycle is allowed, as
expected, given non-transitive ordering.  To prohibit the cycle, P1
needs a suitable memory-barrier instruction.

							Thanx, Paul

------------------------------------------------------------------------

PPC WRCnf+addrs
""
{
0:r2=x; 0:r3=y;
1:r2=x; 1:r3=y;
2:r2=x; 2:r3=y;
c=a; d=b; x=c; y=d;
}
 P0           | P1            | P2            ;
 stw r3,0(r2) | lwz r8,0(r2)  | lwz r8,0(r3)  ;
              | stw r2,0(r8)  | lwz r9,0(r8)  ;
exists
(1:r8=y /\ 2:r8=x /\ 2:r9=c)
Paul E. McKenney Jan. 26, 2016, 6:03 a.m. UTC | #54
On Mon, Jan 25, 2016 at 04:42:43PM +0000, Will Deacon wrote:
> On Fri, Jan 15, 2016 at 01:58:53PM -0800, Paul E. McKenney wrote:
> > On Fri, Jan 15, 2016 at 10:27:14PM +0100, Peter Zijlstra wrote:
> > > On Fri, Jan 15, 2016 at 09:46:12AM -0800, Paul E. McKenney wrote:
> > > > On Fri, Jan 15, 2016 at 10:13:48AM +0100, Peter Zijlstra wrote:
> > > 
> > > > > And the stuff we're confused about is how best to express the difference
> > > > > and guarantees of these two forms of transitivity and how exactly they
> > > > > interact.
> > > > 
> > > > Hoping my memory-barrier.txt patch helps here...
> > > 
> > > Yes, that seems a good start. But yesterday you raised the 'fun' point
> > > of two globally ordered sequences connected by a single local link.
> > 
> > The conclusion that I am slowly coming to is that litmus tests should
> > not be thought of as linear chains, but rather as cycles.  If you think
> > of it as a cycle, then it doesn't matter where the local link is, just
> > how many of them and how they are connected.
> 
> Do you have some examples of this? I'm struggling to make it work in my
> mind, or are you talking specifically in the context of the kernel
> memory model?

Now that you mention it, maybe it would be best to keep the transitive
and non-transitive separate for the time being anyway.  Just because it
might be possible to deal with does not necessarily mean that we should
be encouraging it.  ;-)

> > But I will admit that there are some rather strange litmus tests that
> > challenge this cycle-centric view, for example, the one shown below.
> > It turns out that herd and ppcmem disagree on the outcome.  (The Power
> > architects side with ppcmem.)
> > 
> > > And I think I'm still confused on LWSYNC (in the smp_wmb case) when one
> > > of the stores looses a conflict, and if that scenario matters. If it
> > > does, we should inspect the same case for other barriers.
> > 
> > Indeed.  I am still working on how these should be described.  My
> > current thought is to be quite conservative on what ordering is
> > actually respected, however, the current task is formalizing how
> > RCU plays with the rest of the memory model.
> > 
> > 							Thanx, Paul
> > 
> > ------------------------------------------------------------------------
> > 
> > PPC Overlapping Group-B sets version 4
> > ""
> > (* When the Group-B sets from two different barriers involve instructions in
> >    the same thread, within that thread one set must contain the other.
> > 
> > 	P0	P1	P2
> > 	Rx=1	Wy=1	Wz=2
> > 	dep.	lwsync	lwsync
> > 	Ry=0	Wz=1	Wx=1
> > 	Rz=1
> > 
> > 	assert(!(z=2))
> > 
> >    Forbidden by ppcmem, allowed by herd.
> > *)
> > {
> > 0:r1=x; 0:r2=y; 0:r3=z;
> > 1:r1=x; 1:r2=y; 1:r3=z; 1:r4=1;
> > 2:r1=x; 2:r2=y; 2:r3=z; 2:r4=1; 2:r5=2;
> > }
> >  P0		| P1		| P2		;
> >  lwz r6,0(r1)	| stw r4,0(r2)	| stw r5,0(r3)	;
> >  xor r7,r6,r6	| lwsync	| lwsync	;
> >  lwzx r7,r7,r2	| stw r4,0(r3)	| stw r4,0(r1)	;
> >  lwz r8,0(r3)	|		|		;
> > 
> > exists
> > (z=2 /\ 0:r6=1 /\ 0:r7=0 /\ 0:r8=1)
> 
> That really hurts. Assuming that the "assert(!(z=2))" is actually there
> to constrain the coherence order of z to be {0->1->2}, then I think that
> this test is forbidden on arm using dmb instead of lwsync. That said, I
> also don't think the Rz=1 in P0 changes anything.

What about the smp_wmb() variant of dmb that orders only stores?

> The double negatives don't help here! (it is forbidden to guarantee that
> z is not always 2).

Yes, this is a weird one, and I don't know of any use of it.

							Thanx, Paul
Peter Zijlstra Jan. 26, 2016, 10:19 a.m. UTC | #55
On Mon, Jan 25, 2016 at 10:03:22PM -0800, Paul E. McKenney wrote:
> On Mon, Jan 25, 2016 at 04:42:43PM +0000, Will Deacon wrote:
> > On Fri, Jan 15, 2016 at 01:58:53PM -0800, Paul E. McKenney wrote:
> > > On Fri, Jan 15, 2016 at 10:27:14PM +0100, Peter Zijlstra wrote:

> > > > Yes, that seems a good start. But yesterday you raised the 'fun' point
> > > > of two globally ordered sequences connected by a single local link.
> > > 
> > > The conclusion that I am slowly coming to is that litmus tests should
> > > not be thought of as linear chains, but rather as cycles.  If you think
> > > of it as a cycle, then it doesn't matter where the local link is, just
> > > how many of them and how they are connected.
> > 
> > Do you have some examples of this? I'm struggling to make it work in my
> > mind, or are you talking specifically in the context of the kernel
> > memory model?
> 
> Now that you mention it, maybe it would be best to keep the transitive
> and non-transitive separate for the time being anyway.  Just because it
> might be possible to deal with does not necessarily mean that we should
> be encouraging it.  ;-)

So isn't smp_mb__after_unlock_lock() exactly such a scenario? And would
not someone trying to implement RCsc locks using locally transitive
RELEASE/ACQUIRE operations need exactly this stuff?

That is, I am afraid we need to cover the mix of local and global
transitive operations at least in overview.
Will Deacon Jan. 26, 2016, 12:10 p.m. UTC | #56
On Mon, Jan 25, 2016 at 05:06:46PM -0800, Paul E. McKenney wrote:
> On Mon, Jan 25, 2016 at 02:41:34PM +0000, Will Deacon wrote:
> > On Fri, Jan 15, 2016 at 11:28:45AM -0800, Paul E. McKenney wrote:
> > > On Fri, Jan 15, 2016 at 09:54:01AM -0800, Paul E. McKenney wrote:
> > > > On Fri, Jan 15, 2016 at 10:24:32AM +0000, Will Deacon wrote:
> > > > > See my earlier reply [1] (but also, your WRC Linux example looks more
> > > > > like a variant on WWC and I couldn't really follow it).
> > > > 
> > > > I will revisit my WRC Linux example.  And yes, creating litmus tests
> > > > that use non-fake dependencies is still a bit of an undertaking.  :-/
> > > > I am sure that it will seem more natural with time and experience...
> > > 
> > > Hmmm...  You are quite right, I did do WWC.  I need to change cpu2()'s
> > > last access from a store to a load to get WRC.  Plus the levels of
> > > indirection definitely didn't match up, did they?
> > 
> > Nope, it was pretty baffling!
> 
> "It is a service that I provide."  ;-)
> 
> > > 	struct foo {
> > > 		struct foo *next;
> > > 	};
> > > 	struct foo a;
> > > 	struct foo b;
> > > 	struct foo c = { &a };
> > > 	struct foo d = { &b };
> > > 	struct foo x = { &c };
> > > 	struct foo y = { &d };
> > > 	struct foo *r1, *r2, *r3;
> > > 
> > > 	void cpu0(void)
> > > 	{
> > > 		WRITE_ONCE(x.next, &y);
> > > 	}
> > > 
> > > 	void cpu1(void)
> > > 	{
> > > 		r1 = lockless_dereference(x.next);
> > > 		WRITE_ONCE(r1->next, &x);
> > > 	}
> > > 
> > > 	void cpu2(void)
> > > 	{
> > > 		r2 = lockless_dereference(y.next);
> > > 		r3 = READ_ONCE(r2->next);
> > > 	}
> > > 
> > > In this case, it is legal to end the run with:
> > > 
> > > 	r1 == &y && r2 == &x && r3 == &c
> > > 
> > > Please see below for a ppcmem litmus test.
> > > 
> > > So, did I get it right this time?  ;-)
> > 
> > The code above looks correct to me (in that it matches WRC+addrs),
> > but your litmus test:
> > 
> > > PPC WRCnf+addrs
> > > ""
> > > {
> > > 0:r2=x; 0:r3=y;
> > > 1:r2=x; 1:r3=y;
> > > 2:r2=x; 2:r3=y;
> > > c=a; d=b; x=c; y=d;
> > > }
> > >  P0           | P1            | P2            ;
> > >  stw r3,0(r2) | lwz r8,0(r2)  | lwz r8,0(r3)  ;
> > >               | stw r2,0(r3)  | lwz r9,0(r8)  ;
> > > exists
> > > (1:r8=y /\ 2:r8=x /\ 2:r9=c)
> > 
> > Seems to be missing the address dependency on P1.
> 
> You are quite correct!  How about the following?

I think that's it!

> As before, both herd and ppcmem say that the cycle is allowed, as
> expected, given non-transitive ordering.  To prohibit the cycle, P1
> needs a suitable memory-barrier instruction.
> 
> ------------------------------------------------------------------------
> 
> PPC WRCnf+addrs
> ""
> {
> 0:r2=x; 0:r3=y;
> 1:r2=x; 1:r3=y;
> 2:r2=x; 2:r3=y;
> c=a; d=b; x=c; y=d;
> }
>  P0           | P1            | P2            ;
>  stw r3,0(r2) | lwz r8,0(r2)  | lwz r8,0(r3)  ;
>               | stw r2,0(r8)  | lwz r9,0(r8)  ;
> exists
> (1:r8=y /\ 2:r8=x /\ 2:r9=c)

Agreed.

Will
Will Deacon Jan. 26, 2016, 12:16 p.m. UTC | #57
On Mon, Jan 25, 2016 at 10:03:22PM -0800, Paul E. McKenney wrote:
> On Mon, Jan 25, 2016 at 04:42:43PM +0000, Will Deacon wrote:
> > On Fri, Jan 15, 2016 at 01:58:53PM -0800, Paul E. McKenney wrote:
> > > PPC Overlapping Group-B sets version 4
> > > ""
> > > (* When the Group-B sets from two different barriers involve instructions in
> > >    the same thread, within that thread one set must contain the other.
> > > 
> > > 	P0	P1	P2
> > > 	Rx=1	Wy=1	Wz=2
> > > 	dep.	lwsync	lwsync
> > > 	Ry=0	Wz=1	Wx=1
> > > 	Rz=1
> > > 
> > > 	assert(!(z=2))
> > > 
> > >    Forbidden by ppcmem, allowed by herd.
> > > *)
> > > {
> > > 0:r1=x; 0:r2=y; 0:r3=z;
> > > 1:r1=x; 1:r2=y; 1:r3=z; 1:r4=1;
> > > 2:r1=x; 2:r2=y; 2:r3=z; 2:r4=1; 2:r5=2;
> > > }
> > >  P0		| P1		| P2		;
> > >  lwz r6,0(r1)	| stw r4,0(r2)	| stw r5,0(r3)	;
> > >  xor r7,r6,r6	| lwsync	| lwsync	;
> > >  lwzx r7,r7,r2	| stw r4,0(r3)	| stw r4,0(r1)	;
> > >  lwz r8,0(r3)	|		|		;
> > > 
> > > exists
> > > (z=2 /\ 0:r6=1 /\ 0:r7=0 /\ 0:r8=1)
> > 
> > That really hurts. Assuming that the "assert(!(z=2))" is actually there
> > to constrain the coherence order of z to be {0->1->2}, then I think that
> > this test is forbidden on arm using dmb instead of lwsync. That said, I
> > also don't think the Rz=1 in P0 changes anything.
> 
> What about the smp_wmb() variant of dmb that orders only stores?

Tricky, but I think it still works out if the coherence order of z is as
I described above. The line of reasoning is weird though -- I ended up
considering the two cases where P0 reads z before and after it reads x
and what that means for the read of y.

Will
Boqun Feng Jan. 26, 2016, 2:35 p.m. UTC | #58
Hi Will,

On Tue, Jan 26, 2016 at 12:16:09PM +0000, Will Deacon wrote:
> On Mon, Jan 25, 2016 at 10:03:22PM -0800, Paul E. McKenney wrote:
> > On Mon, Jan 25, 2016 at 04:42:43PM +0000, Will Deacon wrote:
> > > On Fri, Jan 15, 2016 at 01:58:53PM -0800, Paul E. McKenney wrote:
> > > > PPC Overlapping Group-B sets version 4
> > > > ""
> > > > (* When the Group-B sets from two different barriers involve instructions in
> > > >    the same thread, within that thread one set must contain the other.
> > > > 
> > > > 	P0	P1	P2
> > > > 	Rx=1	Wy=1	Wz=2
> > > > 	dep.	lwsync	lwsync
> > > > 	Ry=0	Wz=1	Wx=1
> > > > 	Rz=1
> > > > 
> > > > 	assert(!(z=2))
> > > > 
> > > >    Forbidden by ppcmem, allowed by herd.
> > > > *)
> > > > {
> > > > 0:r1=x; 0:r2=y; 0:r3=z;
> > > > 1:r1=x; 1:r2=y; 1:r3=z; 1:r4=1;
> > > > 2:r1=x; 2:r2=y; 2:r3=z; 2:r4=1; 2:r5=2;
> > > > }
> > > >  P0		| P1		| P2		;
> > > >  lwz r6,0(r1)	| stw r4,0(r2)	| stw r5,0(r3)	;
> > > >  xor r7,r6,r6	| lwsync	| lwsync	;
> > > >  lwzx r7,r7,r2	| stw r4,0(r3)	| stw r4,0(r1)	;
> > > >  lwz r8,0(r3)	|		|		;
> > > > 
> > > > exists
> > > > (z=2 /\ 0:r6=1 /\ 0:r7=0 /\ 0:r8=1)
> > > 
> > > That really hurts. Assuming that the "assert(!(z=2))" is actually there
> > > to constrain the coherence order of z to be {0->1->2}, then I think that
> > > this test is forbidden on arm using dmb instead of lwsync. That said, I
> > > also don't think the Rz=1 in P0 changes anything.
> > 
> > What about the smp_wmb() variant of dmb that orders only stores?
> 
> Tricky, but I think it still works out if the coherence order of z is as
> I described above. The line of reasoning is weird though -- I ended up
> considering the two cases where P0 reads z before and after it reads x
                                             ^^^^^^^^^^^^^^^
Because of the fact that two reads on the same processors can't be
executed simultaneously? I feel like this is exactly something herd
missed.

> and what that means for the read of y.
> 

And the reasoning on PPC is similar, so looks like the read of z on P0
is a necessary condition for the exists clause to be forbidden.

Regards,
Boqun

> Will
Boqun Feng Jan. 26, 2016, 4:52 p.m. UTC | #59
Hi Paul,

On Mon, Jan 18, 2016 at 07:46:29AM -0800, Paul E. McKenney wrote:
> On Mon, Jan 18, 2016 at 04:19:29PM +0800, Herbert Xu wrote:
> > Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> > >
> > > You could use SYNC_ACQUIRE() to implement read_barrier_depends() and
> > > smp_read_barrier_depends(), but SYNC_RMB probably does not suffice.
> > > The reason for this is that smp_read_barrier_depends() must order the
> > > pointer load against any subsequent read or write through a dereference
> > > of that pointer.  For example:
> > > 
> > >        p = READ_ONCE(gp);
> > >        smp_rmb();
> > >        r1 = p->a; /* ordered by smp_rmb(). */
> > >        p->b = 42; /* NOT ordered by smp_rmb(), BUG!!! */
> > >        r2 = x; /* ordered by smp_rmb(), but doesn't need to be. */
> > > 
> > > In contrast:
> > > 
> > >        p = READ_ONCE(gp);
> > >        smp_read_barrier_depends();
> > >        r1 = p->a; /* ordered by smp_read_barrier_depends(). */
> > >        p->b = 42; /* ordered by smp_read_barrier_depends(). */
> > >        r2 = x; /* not ordered by smp_read_barrier_depends(), which is OK. */
> > > 
> > > Again, if your hardware maintains local ordering for address
> > > and data dependencies, you can have read_barrier_depends() and
> > > smp_read_barrier_depends() be no-ops like they are for most
> > > architectures.
> > > 
> > > Does that help?
> > 
> > This is crazy! smp_rmb started out being strictly stronger than
> > smp_read_barrier_depends, when did this stop being the case?
> 
> Hello, Herbert!
> 
> It is true that most Linux kernel code relies only on the read-read
> properties of dependencies, but the read-write properties are useful.
> Admittedly relatively rarely, but useful.
> 
> The better comparison for smp_read_barrier_depends(), especially in
> its rcu_dereference*() form, is smp_load_acquire().
> 

Confused..

I recall that last time you and Linus came into a conclusion that even
on Alpha, a barrier for read->write with data dependency is unnecessary:

http://article.gmane.org/gmane.linux.kernel/2077661

And in an earlier mail of that thread, Linus made his point that
smp_read_barrier_depends() should only be used to order read->read.

So right now, are we going to extend the semantics of
smp_read_barrier_depends()? Can we just make smp_read_barrier_depends()
still only work for read->read, and assume all the architectures won't
reorder read->write with data dependency, so that the code above having
a smp_rmb() also works?

Regards,
Boqun
Peter Zijlstra Jan. 26, 2016, 5:22 p.m. UTC | #60
On Wed, Jan 27, 2016 at 12:52:07AM +0800, Boqun Feng wrote:
> I recall that last time you and Linus came into a conclusion that even
> on Alpha, a barrier for read->write with data dependency is unnecessary:
> 
> http://article.gmane.org/gmane.linux.kernel/2077661
> 
> And in an earlier mail of that thread, Linus made his point that
> smp_read_barrier_depends() should only be used to order read->read.
> 
> So right now, are we going to extend the semantics of
> smp_read_barrier_depends()? Can we just make smp_read_barrier_depends()
> still only work for read->read, and assume all the architectures won't
> reorder read->write with data dependency, so that the code above having
> a smp_rmb() also works?

That discussions was about control dependencies. So writes that _depend_
on a prior read having an explicit value.

So something like:

	struct foo *x = READ_ONCE(*ptr);
	smp_read_barrier_depends()
	if (x->val == 5)
		x->bar = 5;

In that case, the load of x->val must be complete and its value
determined _before_ the store to x->bar can happen.

This is distinct from:

	struct foo *x = READ_ONCE(*ptr);
	smp_read_barrier_depends();
	x->bar = 5;

And its the second case where smp_read_barrier_depends() read->write
order matters.
Linus Torvalds Jan. 26, 2016, 7:44 p.m. UTC | #61
On Tue, Jan 26, 2016 at 9:22 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> This is distinct from:

That may be distinct, but:

>         struct foo *x = READ_ONCE(*ptr);
>         smp_read_barrier_depends();
>         x->bar = 5;

This case is complete BS. Stop perpetuating it. I already removed a
number of bogus cases of it, and I removed the incorrect documentation
that had this crap.

It's called "smp_READ_barrier_depends()" for a reason.

Alpha is the only one that needs it, and alpha needs it only for
dependent READS.

It's not called smp_read_write_barrier_depends(). It's not called
"smp_mb_depends()". It's a weaker form of "smp_rmb()", nothing else.

So alpha does have an implied dependency chain from a read to a
subsequent dependent write, and does not need any extra barriers.

Alpha does *not* have a dependency chain from a read to a subsequent
read, which is why we need that horrible crappy
smp_read_barrier_depends(). But it's the only reason.

This is the alpha reference manual wrt read-to-write dependency:

  5.6.1.7 Definition of Dependence Constraint

    The depends relation (DP) is defined as follows. Given u and v
issued by processor Pi, where u
    is a read or an instruction fetch and v is a write, u precedes v
in DP order (written u DP v, that
    is, v depends on u) in either of the following situations:

     • u determines the execution of v, the location accessed by v, or
the value written by v.
     • u determines the execution or address or value of another
memory access z that precedes

    v or might precede v (that is, would precede v in some execution
path depending
    on the value read by u) by processor issue constraint (see Section 5.6.1.3).

Note that the dependence barrier honors not only control flow, but
address and data values too.  This is a different syntax than we use,
but 'u' is the READ_ONCE, and 'v' is the write. Any data, address or
conditional dependency between the two implies an ordering.

So no, "smp_read_barrier_depends()" is *ONLY* about two reads, where
the second read is data-dependent on the first. Nothing else.

So if you _ever_ see a "smp_read_barrier_depends()" that isn't about a
barrier between two reads, then that is a bug.

The above code is crap.  It's exactly as much crap as

   a = READ_ONCE(x);
   smp_rmb();
   WRITE_ONCE(b, y);

because a "rmb()" simply doesn't have anything to do with
read-vs-subsequent-write ordering.

                 Linus
Paul E. McKenney Jan. 26, 2016, 7:51 p.m. UTC | #62
On Wed, Jan 27, 2016 at 12:52:07AM +0800, Boqun Feng wrote:
> Hi Paul,
> 
> On Mon, Jan 18, 2016 at 07:46:29AM -0800, Paul E. McKenney wrote:
> > On Mon, Jan 18, 2016 at 04:19:29PM +0800, Herbert Xu wrote:
> > > Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> > > >
> > > > You could use SYNC_ACQUIRE() to implement read_barrier_depends() and
> > > > smp_read_barrier_depends(), but SYNC_RMB probably does not suffice.
> > > > The reason for this is that smp_read_barrier_depends() must order the
> > > > pointer load against any subsequent read or write through a dereference
> > > > of that pointer.  For example:
> > > > 
> > > >        p = READ_ONCE(gp);
> > > >        smp_rmb();
> > > >        r1 = p->a; /* ordered by smp_rmb(). */
> > > >        p->b = 42; /* NOT ordered by smp_rmb(), BUG!!! */
> > > >        r2 = x; /* ordered by smp_rmb(), but doesn't need to be. */
> > > > 
> > > > In contrast:
> > > > 
> > > >        p = READ_ONCE(gp);
> > > >        smp_read_barrier_depends();
> > > >        r1 = p->a; /* ordered by smp_read_barrier_depends(). */
> > > >        p->b = 42; /* ordered by smp_read_barrier_depends(). */
> > > >        r2 = x; /* not ordered by smp_read_barrier_depends(), which is OK. */
> > > > 
> > > > Again, if your hardware maintains local ordering for address
> > > > and data dependencies, you can have read_barrier_depends() and
> > > > smp_read_barrier_depends() be no-ops like they are for most
> > > > architectures.
> > > > 
> > > > Does that help?
> > > 
> > > This is crazy! smp_rmb started out being strictly stronger than
> > > smp_read_barrier_depends, when did this stop being the case?
> > 
> > Hello, Herbert!
> > 
> > It is true that most Linux kernel code relies only on the read-read
> > properties of dependencies, but the read-write properties are useful.
> > Admittedly relatively rarely, but useful.
> > 
> > The better comparison for smp_read_barrier_depends(), especially in
> > its rcu_dereference*() form, is smp_load_acquire().
> 
> Confused..
> 
> I recall that last time you and Linus came into a conclusion that even
> on Alpha, a barrier for read->write with data dependency is unnecessary:
> 
> http://article.gmane.org/gmane.linux.kernel/2077661
> 
> And in an earlier mail of that thread, Linus made his point that
> smp_read_barrier_depends() should only be used to order read->read.

Those examples involved read-to-write with conditionals, as in:

	if (READ_ONCE(a))
		WRITE_ONCE(b, 1);

Without the "if", no ordering is guaranteed on weakly ordered CPUs.
(The volatile accesses keep ordering within the compiler for once...

> So right now, are we going to extend the semantics of
> smp_read_barrier_depends()? Can we just make smp_read_barrier_depends()
> still only work for read->read, and assume all the architectures won't
> reorder read->write with data dependency, so that the code above having
> a smp_rmb() also works?

The semantics of smp_read_barrier_depends() has been both read-to-write
and read-to-read for some time now, this patch just catches the
documentation up with reality.

							Thanx, Paul
Paul E. McKenney Jan. 26, 2016, 7:58 p.m. UTC | #63
On Tue, Jan 26, 2016 at 12:16:09PM +0000, Will Deacon wrote:
> On Mon, Jan 25, 2016 at 10:03:22PM -0800, Paul E. McKenney wrote:
> > On Mon, Jan 25, 2016 at 04:42:43PM +0000, Will Deacon wrote:
> > > On Fri, Jan 15, 2016 at 01:58:53PM -0800, Paul E. McKenney wrote:
> > > > PPC Overlapping Group-B sets version 4
> > > > ""
> > > > (* When the Group-B sets from two different barriers involve instructions in
> > > >    the same thread, within that thread one set must contain the other.
> > > > 
> > > > 	P0	P1	P2
> > > > 	Rx=1	Wy=1	Wz=2
> > > > 	dep.	lwsync	lwsync
> > > > 	Ry=0	Wz=1	Wx=1
> > > > 	Rz=1
> > > > 
> > > > 	assert(!(z=2))
> > > > 
> > > >    Forbidden by ppcmem, allowed by herd.
> > > > *)
> > > > {
> > > > 0:r1=x; 0:r2=y; 0:r3=z;
> > > > 1:r1=x; 1:r2=y; 1:r3=z; 1:r4=1;
> > > > 2:r1=x; 2:r2=y; 2:r3=z; 2:r4=1; 2:r5=2;
> > > > }
> > > >  P0		| P1		| P2		;
> > > >  lwz r6,0(r1)	| stw r4,0(r2)	| stw r5,0(r3)	;
> > > >  xor r7,r6,r6	| lwsync	| lwsync	;
> > > >  lwzx r7,r7,r2	| stw r4,0(r3)	| stw r4,0(r1)	;
> > > >  lwz r8,0(r3)	|		|		;
> > > > 
> > > > exists
> > > > (z=2 /\ 0:r6=1 /\ 0:r7=0 /\ 0:r8=1)
> > > 
> > > That really hurts. Assuming that the "assert(!(z=2))" is actually there
> > > to constrain the coherence order of z to be {0->1->2}, then I think that
> > > this test is forbidden on arm using dmb instead of lwsync. That said, I
> > > also don't think the Rz=1 in P0 changes anything.
> > 
> > What about the smp_wmb() variant of dmb that orders only stores?
> 
> Tricky, but I think it still works out if the coherence order of z is as
> I described above. The line of reasoning is weird though -- I ended up
> considering the two cases where P0 reads z before and after it reads x
> and what that means for the read of y.

By "works out" you mean that ARM prohibits the outcome?

BTW, I never have seen a real-world use for this case.  At the moment
it is mostly a cautionary tale about memory-model corner cases and
tools.

							Thanx, Paul
Paul E. McKenney Jan. 26, 2016, 8:10 p.m. UTC | #64
On Tue, Jan 26, 2016 at 11:44:46AM -0800, Linus Torvalds wrote:
> On Tue, Jan 26, 2016 at 9:22 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > This is distinct from:
> 
> That may be distinct, but:
> 
> >         struct foo *x = READ_ONCE(*ptr);
> >         smp_read_barrier_depends();
> >         x->bar = 5;
> 
> This case is complete BS. Stop perpetuating it. I already removed a
> number of bogus cases of it, and I removed the incorrect documentation
> that had this crap.

If I understand your objection correctly, you want the above pattern
expressed either like this:

	struct foo *x = rcu_dereference(*ptr);
	x->bar = 5;

Or like this:

	struct foo *x = lockless_dereference(*ptr);
	x->bar = 5;

Or am I missing your point?

> It's called "smp_READ_barrier_depends()" for a reason.
> 
> Alpha is the only one that needs it, and alpha needs it only for
> dependent READS.
> 
> It's not called smp_read_write_barrier_depends(). It's not called
> "smp_mb_depends()". It's a weaker form of "smp_rmb()", nothing else.
> 
> So alpha does have an implied dependency chain from a read to a
> subsequent dependent write, and does not need any extra barriers.
> 
> Alpha does *not* have a dependency chain from a read to a subsequent
> read, which is why we need that horrible crappy
> smp_read_barrier_depends(). But it's the only reason.
> 
> This is the alpha reference manual wrt read-to-write dependency:
> 
>   5.6.1.7 Definition of Dependence Constraint
> 
>     The depends relation (DP) is defined as follows. Given u and v
> issued by processor Pi, where u
>     is a read or an instruction fetch and v is a write, u precedes v
> in DP order (written u DP v, that
>     is, v depends on u) in either of the following situations:
> 
>      • u determines the execution of v, the location accessed by v, or
> the value written by v.
>      • u determines the execution or address or value of another
> memory access z that precedes
> 
>     v or might precede v (that is, would precede v in some execution
> path depending
>     on the value read by u) by processor issue constraint (see Section 5.6.1.3).
> 
> Note that the dependence barrier honors not only control flow, but
> address and data values too.  This is a different syntax than we use,
> but 'u' is the READ_ONCE, and 'v' is the write. Any data, address or
> conditional dependency between the two implies an ordering.
> 
> So no, "smp_read_barrier_depends()" is *ONLY* about two reads, where
> the second read is data-dependent on the first. Nothing else.
> 
> So if you _ever_ see a "smp_read_barrier_depends()" that isn't about a
> barrier between two reads, then that is a bug.

And the smp_read_barrier_depends() in both rcu_dereference() and
in lockless_dereference() is ordering the read-to-read case and the
underlying hardware is ordering the read-to-write case on weakly ordered
hardware.

Or, again, am I missing your point?

							Thanx, Paul

> The above code is crap.  It's exactly as much crap as
> 
>    a = READ_ONCE(x);
>    smp_rmb();
>    WRITE_ONCE(b, y);
> 
> because a "rmb()" simply doesn't have anything to do with
> read-vs-subsequent-write ordering.
> 
>                  Linus
>
Paul E. McKenney Jan. 26, 2016, 8:13 p.m. UTC | #65
On Tue, Jan 26, 2016 at 11:19:27AM +0100, Peter Zijlstra wrote:
> On Mon, Jan 25, 2016 at 10:03:22PM -0800, Paul E. McKenney wrote:
> > On Mon, Jan 25, 2016 at 04:42:43PM +0000, Will Deacon wrote:
> > > On Fri, Jan 15, 2016 at 01:58:53PM -0800, Paul E. McKenney wrote:
> > > > On Fri, Jan 15, 2016 at 10:27:14PM +0100, Peter Zijlstra wrote:
> 
> > > > > Yes, that seems a good start. But yesterday you raised the 'fun' point
> > > > > of two globally ordered sequences connected by a single local link.
> > > > 
> > > > The conclusion that I am slowly coming to is that litmus tests should
> > > > not be thought of as linear chains, but rather as cycles.  If you think
> > > > of it as a cycle, then it doesn't matter where the local link is, just
> > > > how many of them and how they are connected.
> > > 
> > > Do you have some examples of this? I'm struggling to make it work in my
> > > mind, or are you talking specifically in the context of the kernel
> > > memory model?
> > 
> > Now that you mention it, maybe it would be best to keep the transitive
> > and non-transitive separate for the time being anyway.  Just because it
> > might be possible to deal with does not necessarily mean that we should
> > be encouraging it.  ;-)
> 
> So isn't smp_mb__after_unlock_lock() exactly such a scenario? And would
> not someone trying to implement RCsc locks using locally transitive
> RELEASE/ACQUIRE operations need exactly this stuff?
> 
> That is, I am afraid we need to cover the mix of local and global
> transitive operations at least in overview.

True, but we haven't gotten to locking yet.  That said, I would argue
that smp_mb__after_unlock_lock() upgrades locks to transitive, and
thus would not be an exception to the "no combining transitive and
non-transitive steps in cycles" rule.

							Thanx, Paul
Linus Torvalds Jan. 26, 2016, 10:15 p.m. UTC | #66
On Tue, Jan 26, 2016 at 12:10 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Tue, Jan 26, 2016 at 11:44:46AM -0800, Linus Torvalds wrote:
>>
>> >         struct foo *x = READ_ONCE(*ptr);
>> >         smp_read_barrier_depends();
>> >         x->bar = 5;
>>
>> This case is complete BS. Stop perpetuating it. I already removed a
>> number of bogus cases of it, and I removed the incorrect documentation
>> that had this crap.
>
> If I understand your objection correctly, you want the above pattern
> expressed either like this:
>
>         struct foo *x = rcu_dereference(*ptr);
>         x->bar = 5;
>
> Or like this:
>
>         struct foo *x = lockless_dereference(*ptr);
>         x->bar = 5;
>
> Or am I missing your point?

You are entirely missing the point.

You might as well just write it as

    struct foo x = READ_ONCE(*ptr);
    x->bar = 5;

because that "smp_read_barrier_depends()" does NOTHING wrt the second write.

So what I am saying is simple: anybody who writes that
"smp_read_barrier_depends()" in there is just ttoally and completely
WRONG, and the fact that Peter wrote it out after I removed several
instances of that bloody f*cking idiocy is disturbing.

Don't do it. It's BS. It's wrong. Don't make excuses for it.

             Linus
Linus Torvalds Jan. 26, 2016, 10:33 p.m. UTC | #67
On Tue, Jan 26, 2016 at 2:15 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> You might as well just write it as
>
>     struct foo x = READ_ONCE(*ptr);
>     x->bar = 5;
>
> because that "smp_read_barrier_depends()" does NOTHING wrt the second write.

Just to clarify: on alpha it adds a memory barrier, but that memory
barrier is useless.

On non-alpha, it is a no-op, and obviously does nothing simply because
it generates no code.

So if anybody believes that the "smp_read_barrier_depends()" does
something, they are *wrong*.

And if anybody sends out an email with that smp_read_barrier_depends()
in an example, they are actively just confusing other people, which is
even worse than just being wrong. Which is why I jumped in.

So stop perpetuating the myth that smp_read_barrier_depends() does
something here. It does not. It's a bug, and it has become this "mind
virus" for some people that seem to believe that it does something.

I had to remove this crap once from the kernel already, see commit
105ff3cbf225 ("atomic: remove all traces of READ_ONCE_CTRL() and
atomic*_read_ctrl()").

I don't want to ever see that broken construct again. And I want to
make sure that everybody is educated about how broken it was. I'm
extremely unhappy that it came up again.

If it turns out that some architecture does actually need a barrier
between a read and a dependent write, then that will mean that

 (a) we'll have to make up a _new_ barrier, because
"smp_read_barrier_depends()" is not that barrier. We'll presumably
then have to make that new barrier part of "rcu_derefence()" and
friends.

 (b) we will have found an architecture with even worse memory
ordering semantics than alpha, and we'll have to stop castigating
alpha for being the worst memory ordering ever.

but I sincerely hope that we'll never find that kind of broken architecture.

               Linus
Paul E. McKenney Jan. 26, 2016, 11:29 p.m. UTC | #68
On Tue, Jan 26, 2016 at 02:33:40PM -0800, Linus Torvalds wrote:
> On Tue, Jan 26, 2016 at 2:15 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > You might as well just write it as
> >
> >     struct foo x = READ_ONCE(*ptr);
> >     x->bar = 5;
> >
> > because that "smp_read_barrier_depends()" does NOTHING wrt the second write.
> 
> Just to clarify: on alpha it adds a memory barrier, but that memory
> barrier is useless.

No trailing data-dependent read, so agreed, no smp_read_barrier_depends()
needed.  That said, I believe that we should encourage rcu_dereference*()
or lockless_dereference() instead of READ_ONCE() for documentation
reasons, though.

> On non-alpha, it is a no-op, and obviously does nothing simply because
> it generates no code.
> 
> So if anybody believes that the "smp_read_barrier_depends()" does
> something, they are *wrong*.

The other problem with smp_read_barrier_depends() is that it is often
a pain figuring out which prior load it is supposed to apply to.
Hence my preference for rcu_dereference*() and lockless_dereference().

> And if anybody sends out an email with that smp_read_barrier_depends()
> in an example, they are actively just confusing other people, which is
> even worse than just being wrong. Which is why I jumped in.
> 
> So stop perpetuating the myth that smp_read_barrier_depends() does
> something here. It does not. It's a bug, and it has become this "mind
> virus" for some people that seem to believe that it does something.

It looks like I should add words to memory-barriers.txt de-emphasizing
smp_read_barrier_depends().  I will take a look at that.

> I had to remove this crap once from the kernel already, see commit
> 105ff3cbf225 ("atomic: remove all traces of READ_ONCE_CTRL() and
> atomic*_read_ctrl()").
> 
> I don't want to ever see that broken construct again. And I want to
> make sure that everybody is educated about how broken it was. I'm
> extremely unhappy that it came up again.

Well, if it makes you feel better, that was control dependencies and this
was data dependencies.  So it was not -exactly- the same.  ;-)

(Sorry, couldn't resist...)

> If it turns out that some architecture does actually need a barrier
> between a read and a dependent write, then that will mean that
> 
>  (a) we'll have to make up a _new_ barrier, because
> "smp_read_barrier_depends()" is not that barrier. We'll presumably
> then have to make that new barrier part of "rcu_derefence()" and
> friends.

Agreed.  We can worry about whether or not we replace the current
smp_read_barrier_depends() with that new barrier when and if such
hardware appears.

>  (b) we will have found an architecture with even worse memory
> ordering semantics than alpha, and we'll have to stop castigating
> alpha for being the worst memory ordering ever.

;-) ;-) ;-)

> but I sincerely hope that we'll never find that kind of broken architecture.

Apparently at least some hardware vendors are reading memory-barriers.txt,
so perhaps the odds of that kind of breakage have reduced.

								Thanx, Paul
Paul E. McKenney Jan. 26, 2016, 11:37 p.m. UTC | #69
On Tue, Jan 26, 2016 at 12:10:10PM +0000, Will Deacon wrote:
> On Mon, Jan 25, 2016 at 05:06:46PM -0800, Paul E. McKenney wrote:
> > On Mon, Jan 25, 2016 at 02:41:34PM +0000, Will Deacon wrote:
> > > On Fri, Jan 15, 2016 at 11:28:45AM -0800, Paul E. McKenney wrote:
> > > > On Fri, Jan 15, 2016 at 09:54:01AM -0800, Paul E. McKenney wrote:
> > > > > On Fri, Jan 15, 2016 at 10:24:32AM +0000, Will Deacon wrote:
> > > > > > See my earlier reply [1] (but also, your WRC Linux example looks more
> > > > > > like a variant on WWC and I couldn't really follow it).
> > > > > 
> > > > > I will revisit my WRC Linux example.  And yes, creating litmus tests
> > > > > that use non-fake dependencies is still a bit of an undertaking.  :-/
> > > > > I am sure that it will seem more natural with time and experience...
> > > > 
> > > > Hmmm...  You are quite right, I did do WWC.  I need to change cpu2()'s
> > > > last access from a store to a load to get WRC.  Plus the levels of
> > > > indirection definitely didn't match up, did they?
> > > 
> > > Nope, it was pretty baffling!
> > 
> > "It is a service that I provide."  ;-)
> > 
> > > > 	struct foo {
> > > > 		struct foo *next;
> > > > 	};
> > > > 	struct foo a;
> > > > 	struct foo b;
> > > > 	struct foo c = { &a };
> > > > 	struct foo d = { &b };
> > > > 	struct foo x = { &c };
> > > > 	struct foo y = { &d };
> > > > 	struct foo *r1, *r2, *r3;
> > > > 
> > > > 	void cpu0(void)
> > > > 	{
> > > > 		WRITE_ONCE(x.next, &y);
> > > > 	}
> > > > 
> > > > 	void cpu1(void)
> > > > 	{
> > > > 		r1 = lockless_dereference(x.next);
> > > > 		WRITE_ONCE(r1->next, &x);
> > > > 	}
> > > > 
> > > > 	void cpu2(void)
> > > > 	{
> > > > 		r2 = lockless_dereference(y.next);
> > > > 		r3 = READ_ONCE(r2->next);
> > > > 	}
> > > > 
> > > > In this case, it is legal to end the run with:
> > > > 
> > > > 	r1 == &y && r2 == &x && r3 == &c
> > > > 
> > > > Please see below for a ppcmem litmus test.
> > > > 
> > > > So, did I get it right this time?  ;-)
> > > 
> > > The code above looks correct to me (in that it matches WRC+addrs),
> > > but your litmus test:
> > > 
> > > > PPC WRCnf+addrs
> > > > ""
> > > > {
> > > > 0:r2=x; 0:r3=y;
> > > > 1:r2=x; 1:r3=y;
> > > > 2:r2=x; 2:r3=y;
> > > > c=a; d=b; x=c; y=d;
> > > > }
> > > >  P0           | P1            | P2            ;
> > > >  stw r3,0(r2) | lwz r8,0(r2)  | lwz r8,0(r3)  ;
> > > >               | stw r2,0(r3)  | lwz r9,0(r8)  ;
> > > > exists
> > > > (1:r8=y /\ 2:r8=x /\ 2:r9=c)
> > > 
> > > Seems to be missing the address dependency on P1.
> > 
> > You are quite correct!  How about the following?
> 
> I think that's it!
> 
> > As before, both herd and ppcmem say that the cycle is allowed, as
> > expected, given non-transitive ordering.  To prohibit the cycle, P1
> > needs a suitable memory-barrier instruction.
> > 
> > ------------------------------------------------------------------------
> > 
> > PPC WRCnf+addrs
> > ""
> > {
> > 0:r2=x; 0:r3=y;
> > 1:r2=x; 1:r3=y;
> > 2:r2=x; 2:r3=y;
> > c=a; d=b; x=c; y=d;
> > }
> >  P0           | P1            | P2            ;
> >  stw r3,0(r2) | lwz r8,0(r2)  | lwz r8,0(r3)  ;
> >               | stw r2,0(r8)  | lwz r9,0(r8)  ;
> > exists
> > (1:r8=y /\ 2:r8=x /\ 2:r9=c)
> 
> Agreed.

OK, thank you!  Would you agree that it would be good to replace the
current xor-based fake-dependency litmus tests with tests having real
dependencies?

							Thanx, Paul
Linus Torvalds Jan. 26, 2016, 11:45 p.m. UTC | #70
On Tue, Jan 26, 2016 at 3:29 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
>
> No trailing data-dependent read, so agreed, no smp_read_barrier_depends()
> needed.  That said, I believe that we should encourage rcu_dereference*()
> or lockless_dereference() instead of READ_ONCE() for documentation
> reasons, though.

I agree that that is likely the right thing to do in pretty much all situations.

In theory, there might be performance situations where we'd want to
actively avoid the smp_read_barrier_depends() inherent in those, but
considering that it's only a performance issue on alpha, and we
probably have all of two or three people using Linux on alpha, it's a
pretty theoretical performance worry.

                  Linus
Paul E. McKenney Jan. 27, 2016, 12:57 a.m. UTC | #71
On Tue, Jan 26, 2016 at 03:45:23PM -0800, Linus Torvalds wrote:
> On Tue, Jan 26, 2016 at 3:29 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> >
> > No trailing data-dependent read, so agreed, no smp_read_barrier_depends()
> > needed.  That said, I believe that we should encourage rcu_dereference*()
> > or lockless_dereference() instead of READ_ONCE() for documentation
> > reasons, though.
> 
> I agree that that is likely the right thing to do in pretty much all situations.
> 
> In theory, there might be performance situations where we'd want to
> actively avoid the smp_read_barrier_depends() inherent in those, but
> considering that it's only a performance issue on alpha, and we
> probably have all of two or three people using Linux on alpha, it's a
> pretty theoretical performance worry.

Agreed!

							Thanx, Paul
Peter Zijlstra Jan. 27, 2016, 7:51 a.m. UTC | #72
On Tue, Jan 26, 2016 at 02:33:40PM -0800, Linus Torvalds wrote:

> If it turns out that some architecture does actually need a barrier
> between a read and a dependent write, then that will mean that
> 
>  (a) we'll have to make up a _new_ barrier, because
> "smp_read_barrier_depends()" is not that barrier. We'll presumably
> then have to make that new barrier part of "rcu_derefence()" and
> friends.
> 
>  (b) we will have found an architecture with even worse memory
> ordering semantics than alpha, and we'll have to stop castigating
> alpha for being the worst memory ordering ever.
> 
> but I sincerely hope that we'll never find that kind of broken architecture.

So for a moment it looked like MIPS wanted to equal or surpass Alpha in
this respect.

And Paul made the point that smp_read_barrier_depends() really should
be smp_aquire_barrier_depends() in that we rely on both dependent reads
and writes to be ordered against the initial pointer load.

Now, as you've made abundantly clear, Alpha does this, although it needs
the little extra help in the dependent read department.

The 'problem' is that someone seemed to have used our
Documentation/memory-barriers.txt as a specification for what hardware
is permitted and we require. And in that light Paul noted that
read_barrier_depends really should be considered an
acquire_barrier_depends and order both dependent reads and writes
against the (prior) read (if nothing else already does).

Now clearly, any sane architecture doesn't need anything like this, but
again our document doesn't seem to judge. That is, from reading the
document one can get the impression is a perfectly fine thing to do.
Nowhere does our disdain for this thing show.
Peter Zijlstra Jan. 27, 2016, 8:39 a.m. UTC | #73
On Tue, Jan 26, 2016 at 12:13:39PM -0800, Paul E. McKenney wrote:
> On Tue, Jan 26, 2016 at 11:19:27AM +0100, Peter Zijlstra wrote:

> > So isn't smp_mb__after_unlock_lock() exactly such a scenario? And would
> > not someone trying to implement RCsc locks using locally transitive
> > RELEASE/ACQUIRE operations need exactly this stuff?
> > 
> > That is, I am afraid we need to cover the mix of local and global
> > transitive operations at least in overview.
> 
> True, but we haven't gotten to locking yet.

The mythical smp_mb__after_release_acquire() then ;-)

(and yes, I know you're going to say we don't have that)

> That said, I would argue
> that smp_mb__after_unlock_lock() upgrades locks to transitive, and
> thus would not be an exception to the "no combining transitive and
> non-transitive steps in cycles" rule.

But But But ;-) It does that exactly by combining. I suspect this is
(partly) the source of your SC chains with one PC link example.
Will Deacon Jan. 27, 2016, 10:23 a.m. UTC | #74
On Tue, Jan 26, 2016 at 03:37:33PM -0800, Paul E. McKenney wrote:
> On Tue, Jan 26, 2016 at 12:10:10PM +0000, Will Deacon wrote:
> > On Mon, Jan 25, 2016 at 05:06:46PM -0800, Paul E. McKenney wrote:
> > > PPC WRCnf+addrs
> > > ""
> > > {
> > > 0:r2=x; 0:r3=y;
> > > 1:r2=x; 1:r3=y;
> > > 2:r2=x; 2:r3=y;
> > > c=a; d=b; x=c; y=d;
> > > }
> > >  P0           | P1            | P2            ;
> > >  stw r3,0(r2) | lwz r8,0(r2)  | lwz r8,0(r3)  ;
> > >               | stw r2,0(r8)  | lwz r9,0(r8)  ;
> > > exists
> > > (1:r8=y /\ 2:r8=x /\ 2:r9=c)
> > 
> > Agreed.
> 
> OK, thank you!  Would you agree that it would be good to replace the
> current xor-based fake-dependency litmus tests with tests having real
> dependencies?

Yes, because it would look a lot more like real (kernel) code.

Will
Will Deacon Jan. 27, 2016, 10:25 a.m. UTC | #75
On Tue, Jan 26, 2016 at 11:58:20AM -0800, Paul E. McKenney wrote:
> On Tue, Jan 26, 2016 at 12:16:09PM +0000, Will Deacon wrote:
> > On Mon, Jan 25, 2016 at 10:03:22PM -0800, Paul E. McKenney wrote:
> > > On Mon, Jan 25, 2016 at 04:42:43PM +0000, Will Deacon wrote:
> > > > On Fri, Jan 15, 2016 at 01:58:53PM -0800, Paul E. McKenney wrote:
> > > > > PPC Overlapping Group-B sets version 4
> > > > > ""
> > > > > (* When the Group-B sets from two different barriers involve instructions in
> > > > >    the same thread, within that thread one set must contain the other.
> > > > > 
> > > > > 	P0	P1	P2
> > > > > 	Rx=1	Wy=1	Wz=2
> > > > > 	dep.	lwsync	lwsync
> > > > > 	Ry=0	Wz=1	Wx=1
> > > > > 	Rz=1
> > > > > 
> > > > > 	assert(!(z=2))
> > > > > 
> > > > >    Forbidden by ppcmem, allowed by herd.
> > > > > *)
> > > > > {
> > > > > 0:r1=x; 0:r2=y; 0:r3=z;
> > > > > 1:r1=x; 1:r2=y; 1:r3=z; 1:r4=1;
> > > > > 2:r1=x; 2:r2=y; 2:r3=z; 2:r4=1; 2:r5=2;
> > > > > }
> > > > >  P0		| P1		| P2		;
> > > > >  lwz r6,0(r1)	| stw r4,0(r2)	| stw r5,0(r3)	;
> > > > >  xor r7,r6,r6	| lwsync	| lwsync	;
> > > > >  lwzx r7,r7,r2	| stw r4,0(r3)	| stw r4,0(r1)	;
> > > > >  lwz r8,0(r3)	|		|		;
> > > > > 
> > > > > exists
> > > > > (z=2 /\ 0:r6=1 /\ 0:r7=0 /\ 0:r8=1)
> > > > 
> > > > That really hurts. Assuming that the "assert(!(z=2))" is actually there
> > > > to constrain the coherence order of z to be {0->1->2}, then I think that
> > > > this test is forbidden on arm using dmb instead of lwsync. That said, I
> > > > also don't think the Rz=1 in P0 changes anything.
> > > 
> > > What about the smp_wmb() variant of dmb that orders only stores?
> > 
> > Tricky, but I think it still works out if the coherence order of z is as
> > I described above. The line of reasoning is weird though -- I ended up
> > considering the two cases where P0 reads z before and after it reads x
> > and what that means for the read of y.
> 
> By "works out" you mean that ARM prohibits the outcome?

Yes, that's my understanding.

Will
Ralf Baechle Jan. 27, 2016, 10:40 a.m. UTC | #76
On Thu, Jan 14, 2016 at 04:47:53PM -0800, Paul E. McKenney wrote:

> So you need to build a different kernel for some types of MIPS systems?

Yes.  We can't really do without.  Classic MIPS code is not relocatable
without the complexity of PIC code as used by ELF DSOs - and their
performanc penalty.  Plus we have a number of architecture revisions
ovr the decades, big and little endian, 32 and 64 bit as the major
stumbling stones.  There however are groups of similar systems that
can share kernel binaries.

> Or do you do boot-time rewriting, like a number of other arches do?

We don't rewrite the code (as in the .text of the vmlinux binary) but we
do runtime code generation for a few highly performance sensitive area
of the kernel code such as copy_page() or TLB exception handlers.  This
allows more flexibility than just inserting templates into the kernel
code.  Downside - it means we have some of the complexity of as and ld
in the kernel.

  Ralf
Maciej W. Rozycki Jan. 27, 2016, 11:26 a.m. UTC | #77
On Fri, 15 Jan 2016, Leonid Yegoshin wrote:

> > So you need to build a different kernel for some types of MIPS systems?
> > Or do you do boot-time rewriting, like a number of other arches do?
> 
> I don't know. I would like to have responses. Ralf asked Maciej about old
> systems and that came nowhere. Even rewrite - don't know what to do with that:
> no lightweight SYNC or no SYNC at all - yes, it is still possible that SYNC on
> some systems can be too heavy or even harmful, nobody tested that.

 I don't recall being asked; mind that I might not get to messages I have 
not been cc-ed in a timely manner and I may miss some altogether.  With 
the amount of mailing list traffic that passes by me my scanner may fail 
to trigger.  Sorry if this causes anybody trouble, but such is life.

 Coincidentally, I have just posted some notes on SYNC in a different 
thread, see <http://lkml.iu.edu/hypermail/linux/kernel/1601.3/03080.html>.  
There's a reference to an older message of mine there too.  I hope this 
answers your questions.

  Maciej
Maciej W. Rozycki Jan. 27, 2016, 12:09 p.m. UTC | #78
On Wed, 27 Jan 2016, Ralf Baechle wrote:

> > So you need to build a different kernel for some types of MIPS systems?
> 
> Yes.  We can't really do without.  Classic MIPS code is not relocatable
> without the complexity of PIC code as used by ELF DSOs - and their
> performanc penalty.  Plus we have a number of architecture revisions
> ovr the decades, big and little endian, 32 and 64 bit as the major
> stumbling stones.  There however are groups of similar systems that
> can share kernel binaries.

 Matt (cc-ed) has recently posted patches to add support for a relocatable 
kernel, implemented without the usual overhead of PIC code.  It works by 
retaining relocations in a fully-linked binary and then simply replaying 
the work the static linker does when assigning addresses, as the image 
loaded is copied to its intended destination at an early bootstrap stage.  
See: 
<http://www.linux-mips.org/cgi-bin/mesg.cgi?a=linux-mips&i=1449137297-30464-1-git-send-email-matt.redfearn%40imgtec.com> 
for details.

 I think this framework can be reused by carefully choosing instructions 
used in early bootstrap code, up to the relocation stage, so that it is 
runnable anywhere (not the same as PIC!) like early ld.so initialisation 
and then loading the whole attached image starting from an address where 
RAM does exist on target hardware.

 Endianness is a different matter, obviously we can't build a single image 
for both, although for distributions' sake an approach similar to one used 
with bi-endian firmware (for hardware which has an easy way to switch the 
endianness, e.g. a physical jumper or a configuration bit stored in flash 
memory; not to be confused with the reverse user endianness mode) might be 
feasible, by glueing two kernel images together and then selecting the 
right one early in bootstrap, perhaps again reusing Matt's framework.  
I'm not sure if this is worth the effort though, I suspect the usage level 
of this feature would be minimal.

 All in all I think making a generic MIPS kernel just might be feasible, 
but with the diversity of options available the effort required would be 
enormous.  NetBSD for example I believe supports building a kernel that 
correctly runs on both R3000 (MIPS I, 32-bit) and R4000 (MIPS III, 64-bit) 
DEC hardware (as did DEC Ultrix, the vendor OS for these systems).  These 
processors are different enough from each other that you cannot use the 
same code for cache, memory and exception management in an OS kernel -- 
backward compatibility is only provided for user software.  That proves 
the concept, however in a very limited way only, not even covering SMP, 
and their R4000 kernel does not support 64-bit userland I believe.  They 
still have completely separate ports for other MIPS hardware, such as for 
Broadcom SiByte SB-1 (MIPS64r1) processors.

  Maciej
Paul E. McKenney Jan. 27, 2016, 11:32 p.m. UTC | #79
On Wed, Jan 27, 2016 at 10:25:46AM +0000, Will Deacon wrote:
> On Tue, Jan 26, 2016 at 11:58:20AM -0800, Paul E. McKenney wrote:
> > On Tue, Jan 26, 2016 at 12:16:09PM +0000, Will Deacon wrote:
> > > On Mon, Jan 25, 2016 at 10:03:22PM -0800, Paul E. McKenney wrote:
> > > > On Mon, Jan 25, 2016 at 04:42:43PM +0000, Will Deacon wrote:
> > > > > On Fri, Jan 15, 2016 at 01:58:53PM -0800, Paul E. McKenney wrote:
> > > > > > PPC Overlapping Group-B sets version 4
> > > > > > ""
> > > > > > (* When the Group-B sets from two different barriers involve instructions in
> > > > > >    the same thread, within that thread one set must contain the other.
> > > > > > 
> > > > > > 	P0	P1	P2
> > > > > > 	Rx=1	Wy=1	Wz=2
> > > > > > 	dep.	lwsync	lwsync
> > > > > > 	Ry=0	Wz=1	Wx=1
> > > > > > 	Rz=1
> > > > > > 
> > > > > > 	assert(!(z=2))
> > > > > > 
> > > > > >    Forbidden by ppcmem, allowed by herd.
> > > > > > *)
> > > > > > {
> > > > > > 0:r1=x; 0:r2=y; 0:r3=z;
> > > > > > 1:r1=x; 1:r2=y; 1:r3=z; 1:r4=1;
> > > > > > 2:r1=x; 2:r2=y; 2:r3=z; 2:r4=1; 2:r5=2;
> > > > > > }
> > > > > >  P0		| P1		| P2		;
> > > > > >  lwz r6,0(r1)	| stw r4,0(r2)	| stw r5,0(r3)	;
> > > > > >  xor r7,r6,r6	| lwsync	| lwsync	;
> > > > > >  lwzx r7,r7,r2	| stw r4,0(r3)	| stw r4,0(r1)	;
> > > > > >  lwz r8,0(r3)	|		|		;
> > > > > > 
> > > > > > exists
> > > > > > (z=2 /\ 0:r6=1 /\ 0:r7=0 /\ 0:r8=1)
> > > > > 
> > > > > That really hurts. Assuming that the "assert(!(z=2))" is actually there
> > > > > to constrain the coherence order of z to be {0->1->2}, then I think that
> > > > > this test is forbidden on arm using dmb instead of lwsync. That said, I
> > > > > also don't think the Rz=1 in P0 changes anything.
> > > > 
> > > > What about the smp_wmb() variant of dmb that orders only stores?
> > > 
> > > Tricky, but I think it still works out if the coherence order of z is as
> > > I described above. The line of reasoning is weird though -- I ended up
> > > considering the two cases where P0 reads z before and after it reads x
> > > and what that means for the read of y.
> > 
> > By "works out" you mean that ARM prohibits the outcome?
> 
> Yes, that's my understanding.

Very good, we have agreement between the two architectures, then.  ;-)

							Thanx, Paul
Leonid Yegoshin Jan. 28, 2016, 12:58 a.m. UTC | #80
On 01/27/2016 03:26 AM, Maciej W. Rozycki wrote:
> On Fri, 15 Jan 2016, Leonid Yegoshin wrote:
>
>>> So you need to build a different kernel for some types of MIPS systems?
>>> Or do you do boot-time rewriting, like a number of other arches do?
>> I don't know. I would like to have responses. Ralf asked Maciej about old
>> systems and that came nowhere. Even rewrite - don't know what to do with that:
>> no lightweight SYNC or no SYNC at all - yes, it is still possible that SYNC on
>> some systems can be too heavy or even harmful, nobody tested that.
>   I don't recall being asked; mind that I might not get to messages I have
> not been cc-ed in a timely manner and I may miss some altogether.  With
> the amount of mailing list traffic that passes by me my scanner may fail
> to trigger.  Sorry if this causes anybody trouble, but such is life.
>
>   Coincidentally, I have just posted some notes on SYNC in a different
> thread, see <http://lkml.iu.edu/hypermail/linux/kernel/1601.3/03080.html>.
> There's a reference to an older message of mine there too.  I hope this
> answers your questions.
>
>    Maciej
In http://patchwork.linux-mips.org/patch/10505/the very last mesg 
exchange is:

Maciej,

do you have an R4000 / R4600 / R5000 / R7000 / SiByte system at hand to
test this?
...
   Ralf

Maciej W. Rozycki- June 5, 2015, 9:18 p.m.

On Fri, 5 Jun 2015, Ralf Baechle wrote:

> do you have an R4000 / R4600 / R5000 / R7000 / SiByte system at hand to
> test this?

  I should be able to check R4400 (that is virtually the same as R4000)
next week or so.  As to SiByte -- not before next month I'm afraid.  I
don't have access to any of the other processors you named.  You may
want to find a better person if you want to accept this change soon.

   Maciej

... and that stops forever...

- Leonid.
Maciej W. Rozycki Jan. 29, 2016, 1:38 p.m. UTC | #81
On Thu, 28 Jan 2016, Leonid Yegoshin wrote:

> In http://patchwork.linux-mips.org/patch/10505/ the very last mesg exchange
> is:
[...]
> ... and that stops forever...

 Thanks for the reminder -- last June was very hectic, I travelled a lot 
and I lost the discussion from my radar.  Apologies for that.  I replied 
in that thread now with my results.  I hope this helps.

  Maciej
diff mbox

Patch

diff --git a/arch/mips/include/asm/barrier.h b/arch/mips/include/asm/barrier.h
index 752e0b8..3eac4b9 100644
--- a/arch/mips/include/asm/barrier.h
+++ b/arch/mips/include/asm/barrier.h
@@ -10,9 +10,6 @@ 
 
 #include <asm/addrspace.h>
 
-#define read_barrier_depends()		do { } while(0)
-#define smp_read_barrier_depends()	do { } while(0)
-
 #ifdef CONFIG_CPU_HAS_SYNC
 #define __sync()				\
 	__asm__ __volatile__(			\
@@ -87,8 +84,6 @@ 
 
 #define wmb()		fast_wmb()
 #define rmb()		fast_rmb()
-#define dma_wmb()	fast_wmb()
-#define dma_rmb()	fast_rmb()
 
 #if defined(CONFIG_WEAK_ORDERING) && defined(CONFIG_SMP)
 # ifdef CONFIG_CPU_CAVIUM_OCTEON
@@ -112,9 +107,6 @@ 
 #define __WEAK_LLSC_MB		"		\n"
 #endif
 
-#define smp_store_mb(var, value) \
-	do { WRITE_ONCE(var, value); smp_mb(); } while (0)
-
 #define smp_llsc_mb()	__asm__ __volatile__(__WEAK_LLSC_MB : : :"memory")
 
 #ifdef CONFIG_CPU_CAVIUM_OCTEON
@@ -129,22 +121,9 @@ 
 #define nudge_writes() mb()
 #endif
 
-#define smp_store_release(p, v)						\
-do {									\
-	compiletime_assert_atomic_type(*p);				\
-	smp_mb();							\
-	WRITE_ONCE(*p, v);						\
-} while (0)
-
-#define smp_load_acquire(p)						\
-({									\
-	typeof(*p) ___p1 = READ_ONCE(*p);				\
-	compiletime_assert_atomic_type(*p);				\
-	smp_mb();							\
-	___p1;								\
-})
-
 #define smp_mb__before_atomic()	smp_mb__before_llsc()
 #define smp_mb__after_atomic()	smp_llsc_mb()
 
+#include <asm-generic/barrier.h>
+
 #endif /* __ASM_BARRIER_H */