Message ID | 20200630173734.14057-5-will@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Allow architectures to override __READ_ONCE() | expand |
On Tue, Jun 30, 2020 at 06:37:20PM +0100, Will Deacon wrote: > Rather then relying on the core code to use smp_read_barrier_depends() > as part of the READ_ONCE() definition, instead override __READ_ONCE() > in the Alpha code so that it is treated the same way as > smp_load_acquire(). > > Acked-by: Paul E. McKenney <paulmck@kernel.org> > Signed-off-by: Will Deacon <will@kernel.org> > --- > arch/alpha/include/asm/barrier.h | 61 ++++---------------------------- > arch/alpha/include/asm/rwonce.h | 19 ++++++++++ > 2 files changed, 26 insertions(+), 54 deletions(-) > create mode 100644 arch/alpha/include/asm/rwonce.h > > diff --git a/arch/alpha/include/asm/barrier.h b/arch/alpha/include/asm/barrier.h > index 92ec486a4f9e..2ecd068d91d1 100644 > --- a/arch/alpha/include/asm/barrier.h > +++ b/arch/alpha/include/asm/barrier.h > @@ -2,64 +2,17 @@ > #ifndef __BARRIER_H > #define __BARRIER_H > > -#include <asm/compiler.h> > - > #define mb() __asm__ __volatile__("mb": : :"memory") > #define rmb() __asm__ __volatile__("mb": : :"memory") > #define wmb() __asm__ __volatile__("wmb": : :"memory") > -#define read_barrier_depends() __asm__ __volatile__("mb": : :"memory") > +#define __smp_load_acquire(p) \ > +({ \ > + __unqual_scalar_typeof(*p) ___p1 = \ > + (*(volatile typeof(___p1) *)(p)); \ > + compiletime_assert_atomic_type(*p); \ > + ___p1; \ > +}) Sorry if I'm being thick, but doesn't this need a barrier after the volatile access to provide the acquire semantic? IIUC prior to this commit alpha would have used the asm-generic __smp_load_acquire, i.e. | #ifndef __smp_load_acquire | #define __smp_load_acquire(p) \ | ({ \ | __unqual_scalar_typeof(*p) ___p1 = READ_ONCE(*p); \ | compiletime_assert_atomic_type(*p); \ | __smp_mb(); \ | (typeof(*p))___p1; \ | }) | #endif ... where the __smp_mb() would be alpha's mb() from earlier in the patch context, i.e. | #define mb() __asm__ __volatile__("mb": : :"memory") ... so don't we need similar before returning ___p1 above in __smp_load_acquire() (and also matching the old read_barrier_depends())? [...] > +#include <asm/barrier.h> > + > +/* > + * Alpha is apparently daft enough to reorder address-dependent loads > + * on some CPU implementations. Knock some common sense into it with > + * a memory barrier in READ_ONCE(). > + */ > +#define __READ_ONCE(x) __smp_load_acquire(&(x)) As above, I don't see a memory barrier implied here, so this doesn't look quite right. Thanks, Mark.
On Thu, Jul 02, 2020 at 10:32:39AM +0100, Mark Rutland wrote: > On Tue, Jun 30, 2020 at 06:37:20PM +0100, Will Deacon wrote: > > -#define read_barrier_depends() __asm__ __volatile__("mb": : :"memory") > > +#define __smp_load_acquire(p) \ > > +({ \ > > + __unqual_scalar_typeof(*p) ___p1 = \ > > + (*(volatile typeof(___p1) *)(p)); \ > > + compiletime_assert_atomic_type(*p); \ > > + ___p1; \ > > +}) > > Sorry if I'm being thick, but doesn't this need a barrier after the > volatile access to provide the acquire semantic? > > IIUC prior to this commit alpha would have used the asm-generic > __smp_load_acquire, i.e. > > | #ifndef __smp_load_acquire > | #define __smp_load_acquire(p) \ > | ({ \ > | __unqual_scalar_typeof(*p) ___p1 = READ_ONCE(*p); \ > | compiletime_assert_atomic_type(*p); \ > | __smp_mb(); \ > | (typeof(*p))___p1; \ > | }) > | #endif > > ... where the __smp_mb() would be alpha's mb() from earlier in the patch > context, i.e. > > | #define mb() __asm__ __volatile__("mb": : :"memory") > > ... so don't we need similar before returning ___p1 above in > __smp_load_acquire() (and also matching the old read_barrier_depends())? > > [...] > > > +#include <asm/barrier.h> > > + > > +/* > > + * Alpha is apparently daft enough to reorder address-dependent loads > > + * on some CPU implementations. Knock some common sense into it with > > + * a memory barrier in READ_ONCE(). > > + */ > > +#define __READ_ONCE(x) __smp_load_acquire(&(x)) > > As above, I don't see a memory barrier implied here, so this doesn't > look quite right. You're right, and Peter spotted the same thing off-list. I've reworked locally so that the mb() ends up in __READ_ONCE() and __smp_load_acquire() calles __READ_ONCE() instead of the other way round (which made more sense before the rework in the merge window). Will
On Thu, Jul 2, 2020 at 11:48 AM Will Deacon <will@kernel.org> wrote: > On Thu, Jul 02, 2020 at 10:32:39AM +0100, Mark Rutland wrote: > > On Tue, Jun 30, 2020 at 06:37:20PM +0100, Will Deacon wrote: > > > -#define read_barrier_depends() __asm__ __volatile__("mb": : :"memory") > > > +#define __smp_load_acquire(p) \ > > > +({ \ > > > + __unqual_scalar_typeof(*p) ___p1 = \ > > > + (*(volatile typeof(___p1) *)(p)); \ > > > + compiletime_assert_atomic_type(*p); \ > > > + ___p1; \ > > > +}) > > > > Sorry if I'm being thick, but doesn't this need a barrier after the > > volatile access to provide the acquire semantic? > > > > IIUC prior to this commit alpha would have used the asm-generic > > __smp_load_acquire, i.e. > > > > | #ifndef __smp_load_acquire > > | #define __smp_load_acquire(p) \ > > | ({ \ > > | __unqual_scalar_typeof(*p) ___p1 = READ_ONCE(*p); \ > > | compiletime_assert_atomic_type(*p); \ > > | __smp_mb(); \ > > | (typeof(*p))___p1; \ > > | }) > > | #endif I also have a question that I didn't dare ask when the same code came up before (I guess it's also what's in the kernel today): With the cast to 'typeof(*p)' at the end, doesn't that mean we lose the effect of __unqual_scalar_typeof() again, so any "volatile" pointer passed into __READ_ONCE_SCALAR() or __smp_load_acquire() still leads to a volatile load of the original variable, plus another volatile access to ___p1 after spilling it to the stack as a non-volatile variable? I hope I'm missing something obvious here. Arnd
On Thu, Jul 02, 2020 at 12:08:41PM +0200, Arnd Bergmann wrote: > On Thu, Jul 2, 2020 at 11:48 AM Will Deacon <will@kernel.org> wrote: > > On Thu, Jul 02, 2020 at 10:32:39AM +0100, Mark Rutland wrote: > > > On Tue, Jun 30, 2020 at 06:37:20PM +0100, Will Deacon wrote: > > > > -#define read_barrier_depends() __asm__ __volatile__("mb": : :"memory") > > > > +#define __smp_load_acquire(p) \ > > > > +({ \ > > > > + __unqual_scalar_typeof(*p) ___p1 = \ > > > > + (*(volatile typeof(___p1) *)(p)); \ > > > > + compiletime_assert_atomic_type(*p); \ > > > > + ___p1; \ > > > > +}) > > > > > > Sorry if I'm being thick, but doesn't this need a barrier after the > > > volatile access to provide the acquire semantic? > > > > > > IIUC prior to this commit alpha would have used the asm-generic > > > __smp_load_acquire, i.e. > > > > > > | #ifndef __smp_load_acquire > > > | #define __smp_load_acquire(p) \ > > > | ({ \ > > > | __unqual_scalar_typeof(*p) ___p1 = READ_ONCE(*p); \ > > > | compiletime_assert_atomic_type(*p); \ > > > | __smp_mb(); \ > > > | (typeof(*p))___p1; \ > > > | }) > > > | #endif > > I also have a question that I didn't dare ask when the same > code came up before (I guess it's also what's in the kernel today): > > With the cast to 'typeof(*p)' at the end, doesn't that mean we > lose the effect of __unqual_scalar_typeof() again, so any "volatile" > pointer passed into __READ_ONCE_SCALAR() or > __smp_load_acquire() still leads to a volatile load of the original > variable, plus another volatile access to ___p1 after > spilling it to the stack as a non-volatile variable? Not sure I follow you here, but I can confirm that what you're worried about doesn't happen for the usual case of a pointer-to-volatile scalar. For example, ignoring dependency ordering: unsigned long foo(volatile unsigned long *p) { return smp_load_acquire(p) + 1; } Ends up looking like: unsigned long ___p1 = *(const volatile unsigned long *)p; smp_mb(); (volatile unsigned long)___p1; My understanding is that casting a non-pointer type to volatile doesn't do anything, so we're good. On the other hand, you can still cause the stack reload if you use volatile pointers to volatile: volatile unsigned long *bar(volatile unsigned long * volatile *ptr) { return READ_ONCE(*ptr); } but this is pretty weird code, I think. Will
On Thu, Jul 2, 2020 at 1:18 PM Will Deacon <will@kernel.org> wrote: > On Thu, Jul 02, 2020 at 12:08:41PM +0200, Arnd Bergmann wrote: > > On Thu, Jul 2, 2020 at 11:48 AM Will Deacon <will@kernel.org> wrote: > > > On Thu, Jul 02, 2020 at 10:32:39AM +0100, Mark Rutland wrote: > Not sure I follow you here, but I can confirm that what you're worried > about doesn't happen for the usual case of a pointer-to-volatile scalar. > > For example, ignoring dependency ordering: > > unsigned long foo(volatile unsigned long *p) > { > return smp_load_acquire(p) + 1; > } > > Ends up looking like: > > unsigned long ___p1 = *(const volatile unsigned long *)p; > smp_mb(); > (volatile unsigned long)___p1; > > My understanding is that casting a non-pointer type to volatile doesn't > do anything, so we're good. Right, I mixed up the correct (typeof(*p))___p; with the incorrect *typeof(p)&___p; which would dereference a volatile pointer and cause the problem. The code is all fine then. Arnd
On Tue, Jun 30, 2020 at 1:38 PM Will Deacon <will@kernel.org> wrote: > > Rather then relying on the core code to use smp_read_barrier_depends() > as part of the READ_ONCE() definition, instead override __READ_ONCE() > in the Alpha code so that it is treated the same way as > smp_load_acquire(). > > Acked-by: Paul E. McKenney <paulmck@kernel.org> > Signed-off-by: Will Deacon <will@kernel.org> > --- > arch/alpha/include/asm/barrier.h | 61 ++++---------------------------- > arch/alpha/include/asm/rwonce.h | 19 ++++++++++ > 2 files changed, 26 insertions(+), 54 deletions(-) > create mode 100644 arch/alpha/include/asm/rwonce.h > > diff --git a/arch/alpha/include/asm/barrier.h b/arch/alpha/include/asm/barrier.h > index 92ec486a4f9e..2ecd068d91d1 100644 > --- a/arch/alpha/include/asm/barrier.h > +++ b/arch/alpha/include/asm/barrier.h > @@ -2,64 +2,17 @@ > #ifndef __BARRIER_H > #define __BARRIER_H > > -#include <asm/compiler.h> > - > #define mb() __asm__ __volatile__("mb": : :"memory") > #define rmb() __asm__ __volatile__("mb": : :"memory") > #define wmb() __asm__ __volatile__("wmb": : :"memory") > > -/** > - * read_barrier_depends - Flush all pending reads that subsequents reads > - * depend on. > - * > - * No data-dependent reads from memory-like regions are ever reordered > - * over this barrier. All reads preceding this primitive are guaranteed > - * to access memory (but not necessarily other CPUs' caches) before any > - * reads following this primitive that depend on the data return by > - * any of the preceding reads. This primitive is much lighter weight than > - * rmb() on most CPUs, and is never heavier weight than is > - * rmb(). > - * > - * These ordering constraints are respected by both the local CPU > - * and the compiler. > - * > - * Ordering is not guaranteed by anything other than these primitives, > - * not even by data dependencies. See the documentation for > - * memory_barrier() for examples and URLs to more information. > - * > - * For example, the following code would force ordering (the initial > - * value of "a" is zero, "b" is one, and "p" is "&a"): > - * > - * <programlisting> > - * CPU 0 CPU 1 > - * > - * b = 2; > - * memory_barrier(); > - * p = &b; q = p; > - * read_barrier_depends(); > - * d = *q; > - * </programlisting> > - * > - * because the read of "*q" depends on the read of "p" and these > - * two reads are separated by a read_barrier_depends(). However, > - * the following code, with the same initial values for "a" and "b": > - * Would it be Ok to keep this example in the kernel sources? I think it serves as good documentation and highlights the issue in the Alpha architecture well. > - * <programlisting> > - * CPU 0 CPU 1 > - * > - * a = 2; > - * memory_barrier(); > - * b = 3; y = b; > - * read_barrier_depends(); > - * x = a; > - * </programlisting> > - * > - * does not enforce ordering, since there is no data dependency between > - * the read of "a" and the read of "b". Therefore, on some CPUs, such > - * as Alpha, "y" could be set to 3 and "x" to 0. Use rmb() > - * in cases like this where there are no data dependencies. > - */ > -#define read_barrier_depends() __asm__ __volatile__("mb": : :"memory") > +#define __smp_load_acquire(p) \ > +({ \ > + __unqual_scalar_typeof(*p) ___p1 = \ > + (*(volatile typeof(___p1) *)(p)); \ > + compiletime_assert_atomic_type(*p); \ > + ___p1; \ > +}) I had the same question as Mark about the need for a memory barrier here, otherwise alpha will again break right? Looking forward to the future fix you mentioned. BTW, do you know any architecture where speculative execution of address-dependent loads can cause similar misorderings? That would be pretty insane though. In Alpha's case it is not speculation but rather the split local cache design as the docs mention. The reason I ask is it is pretty amusing that control-dependent loads do have such misordering issues due to speculative branch execution and I wondered what other games the CPUs are playing. FWIW I ran into [1] which talks about analogy between memory dependence and control dependence. [1] https://en.wikipedia.org/wiki/Memory_dependence_prediction - Joel > > #ifdef CONFIG_SMP > #define __ASM_SMP_MB "\tmb\n" > diff --git a/arch/alpha/include/asm/rwonce.h b/arch/alpha/include/asm/rwonce.h > new file mode 100644 > index 000000000000..83a92e49a615 > --- /dev/null > +++ b/arch/alpha/include/asm/rwonce.h > @@ -0,0 +1,19 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +/* > + * Copyright (C) 2019 Google LLC. > + */ > +#ifndef __ASM_RWONCE_H > +#define __ASM_RWONCE_H > + > +#include <asm/barrier.h> > + > +/* > + * Alpha is apparently daft enough to reorder address-dependent loads > + * on some CPU implementations. Knock some common sense into it with > + * a memory barrier in READ_ONCE(). > + */ > +#define __READ_ONCE(x) __smp_load_acquire(&(x)) > + > +#include <asm-generic/rwonce.h> > + > +#endif /* __ASM_RWONCE_H */ > -- > 2.27.0.212.ge8ba1cc988-goog > > -- > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com. >
Hi Joel, On Thu, Jul 02, 2020 at 10:43:55AM -0400, Joel Fernandes wrote: > On Tue, Jun 30, 2020 at 1:38 PM Will Deacon <will@kernel.org> wrote: > > diff --git a/arch/alpha/include/asm/barrier.h b/arch/alpha/include/asm/barrier.h > > index 92ec486a4f9e..2ecd068d91d1 100644 > > --- a/arch/alpha/include/asm/barrier.h > > +++ b/arch/alpha/include/asm/barrier.h > > - * For example, the following code would force ordering (the initial > > - * value of "a" is zero, "b" is one, and "p" is "&a"): > > - * > > - * <programlisting> > > - * CPU 0 CPU 1 > > - * > > - * b = 2; > > - * memory_barrier(); > > - * p = &b; q = p; > > - * read_barrier_depends(); > > - * d = *q; > > - * </programlisting> > > - * > > - * because the read of "*q" depends on the read of "p" and these > > - * two reads are separated by a read_barrier_depends(). However, > > - * the following code, with the same initial values for "a" and "b": > > - * > > Would it be Ok to keep this example in the kernel sources? I think it > serves as good documentation and highlights the issue in the Alpha > architecture well. I'd _really_ like to remove it, as I think it only serves to confuse people on a topic that is confusing enough already. Paul's perfbook [1] already has plenty of information about this, so I don't think we need to repeat that here. I could add a citation, perhaps? > > - * <programlisting> > > - * CPU 0 CPU 1 > > - * > > - * a = 2; > > - * memory_barrier(); > > - * b = 3; y = b; > > - * read_barrier_depends(); > > - * x = a; > > - * </programlisting> > > - * > > - * does not enforce ordering, since there is no data dependency between > > - * the read of "a" and the read of "b". Therefore, on some CPUs, such > > - * as Alpha, "y" could be set to 3 and "x" to 0. Use rmb() > > - * in cases like this where there are no data dependencies. > > - */ > > -#define read_barrier_depends() __asm__ __volatile__("mb": : :"memory") > > +#define __smp_load_acquire(p) \ > > +({ \ > > + __unqual_scalar_typeof(*p) ___p1 = \ > > + (*(volatile typeof(___p1) *)(p)); \ > > + compiletime_assert_atomic_type(*p); \ > > + ___p1; \ > > +}) > > I had the same question as Mark about the need for a memory barrier > here, otherwise alpha will again break right? Looking forward to the > future fix you mentioned. Yeah, sorry about that. It went missing somehow during the rebase, it seems. > BTW, do you know any architecture where speculative execution of > address-dependent loads can cause similar misorderings? That would be > pretty insane though. In Alpha's case it is not speculation but rather > the split local cache design as the docs mention. The reason I ask > is it is pretty amusing that control-dependent loads do have such > misordering issues due to speculative branch execution and I wondered > what other games the CPUs are playing. FWIW I ran into [1] which talks > about analogy between memory dependence and control dependence. I think you're asking about value prediction, and the implications it would have on address-dependent loads where the address can itself be predicted. I'm not aware of an CPUs where that is observable architecturally. arm64 has some load instructions that do not honour address dependencies, but I believe that's mainly to enable alternative cache designs for things like non-temporal and large vector loads. Will [1] https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html
On Thu, Jul 2, 2020 at 10:55 AM Will Deacon <will@kernel.org> wrote: > On Thu, Jul 02, 2020 at 10:43:55AM -0400, Joel Fernandes wrote: > > On Tue, Jun 30, 2020 at 1:38 PM Will Deacon <will@kernel.org> wrote: > > > diff --git a/arch/alpha/include/asm/barrier.h b/arch/alpha/include/asm/barrier.h > > > index 92ec486a4f9e..2ecd068d91d1 100644 > > > --- a/arch/alpha/include/asm/barrier.h > > > +++ b/arch/alpha/include/asm/barrier.h > > > - * For example, the following code would force ordering (the initial > > > - * value of "a" is zero, "b" is one, and "p" is "&a"): > > > - * > > > - * <programlisting> > > > - * CPU 0 CPU 1 > > > - * > > > - * b = 2; > > > - * memory_barrier(); > > > - * p = &b; q = p; > > > - * read_barrier_depends(); > > > - * d = *q; > > > - * </programlisting> > > > - * > > > - * because the read of "*q" depends on the read of "p" and these > > > - * two reads are separated by a read_barrier_depends(). However, > > > - * the following code, with the same initial values for "a" and "b": > > > - * > > > > Would it be Ok to keep this example in the kernel sources? I think it > > serves as good documentation and highlights the issue in the Alpha > > architecture well. > > I'd _really_ like to remove it, as I think it only serves to confuse people > on a topic that is confusing enough already. Paul's perfbook [1] already has > plenty of information about this, so I don't think we need to repeat that > here. I could add a citation, perhaps? True, and also found that LKMM docs and the memory-barriers.txt talks about it, so removing it here sounds good to me. Maybe a reference here to either documentation should be Ok. > > BTW, do you know any architecture where speculative execution of > > address-dependent loads can cause similar misorderings? That would be > > pretty insane though. In Alpha's case it is not speculation but rather > > the split local cache design as the docs mention. The reason I ask > > is it is pretty amusing that control-dependent loads do have such > > misordering issues due to speculative branch execution and I wondered > > what other games the CPUs are playing. FWIW I ran into [1] which talks > > about analogy between memory dependence and control dependence. > > I think you're asking about value prediction, and the implications it would > have on address-dependent loads where the address can itself be predicted. Yes. > I'm not aware of an CPUs where that is observable architecturally. I see. > arm64 has some load instructions that do not honour address dependencies, > but I believe that's mainly to enable alternative cache designs for things > like non-temporal and large vector loads. Good to know this, thanks. - Joel
diff --git a/arch/alpha/include/asm/barrier.h b/arch/alpha/include/asm/barrier.h index 92ec486a4f9e..2ecd068d91d1 100644 --- a/arch/alpha/include/asm/barrier.h +++ b/arch/alpha/include/asm/barrier.h @@ -2,64 +2,17 @@ #ifndef __BARRIER_H #define __BARRIER_H -#include <asm/compiler.h> - #define mb() __asm__ __volatile__("mb": : :"memory") #define rmb() __asm__ __volatile__("mb": : :"memory") #define wmb() __asm__ __volatile__("wmb": : :"memory") -/** - * read_barrier_depends - Flush all pending reads that subsequents reads - * depend on. - * - * No data-dependent reads from memory-like regions are ever reordered - * over this barrier. All reads preceding this primitive are guaranteed - * to access memory (but not necessarily other CPUs' caches) before any - * reads following this primitive that depend on the data return by - * any of the preceding reads. This primitive is much lighter weight than - * rmb() on most CPUs, and is never heavier weight than is - * rmb(). - * - * These ordering constraints are respected by both the local CPU - * and the compiler. - * - * Ordering is not guaranteed by anything other than these primitives, - * not even by data dependencies. See the documentation for - * memory_barrier() for examples and URLs to more information. - * - * For example, the following code would force ordering (the initial - * value of "a" is zero, "b" is one, and "p" is "&a"): - * - * <programlisting> - * CPU 0 CPU 1 - * - * b = 2; - * memory_barrier(); - * p = &b; q = p; - * read_barrier_depends(); - * d = *q; - * </programlisting> - * - * because the read of "*q" depends on the read of "p" and these - * two reads are separated by a read_barrier_depends(). However, - * the following code, with the same initial values for "a" and "b": - * - * <programlisting> - * CPU 0 CPU 1 - * - * a = 2; - * memory_barrier(); - * b = 3; y = b; - * read_barrier_depends(); - * x = a; - * </programlisting> - * - * does not enforce ordering, since there is no data dependency between - * the read of "a" and the read of "b". Therefore, on some CPUs, such - * as Alpha, "y" could be set to 3 and "x" to 0. Use rmb() - * in cases like this where there are no data dependencies. - */ -#define read_barrier_depends() __asm__ __volatile__("mb": : :"memory") +#define __smp_load_acquire(p) \ +({ \ + __unqual_scalar_typeof(*p) ___p1 = \ + (*(volatile typeof(___p1) *)(p)); \ + compiletime_assert_atomic_type(*p); \ + ___p1; \ +}) #ifdef CONFIG_SMP #define __ASM_SMP_MB "\tmb\n" diff --git a/arch/alpha/include/asm/rwonce.h b/arch/alpha/include/asm/rwonce.h new file mode 100644 index 000000000000..83a92e49a615 --- /dev/null +++ b/arch/alpha/include/asm/rwonce.h @@ -0,0 +1,19 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (C) 2019 Google LLC. + */ +#ifndef __ASM_RWONCE_H +#define __ASM_RWONCE_H + +#include <asm/barrier.h> + +/* + * Alpha is apparently daft enough to reorder address-dependent loads + * on some CPU implementations. Knock some common sense into it with + * a memory barrier in READ_ONCE(). + */ +#define __READ_ONCE(x) __smp_load_acquire(&(x)) + +#include <asm-generic/rwonce.h> + +#endif /* __ASM_RWONCE_H */