diff mbox series

[18/18] arm64: lto: Strengthen READ_ONCE() to acquire when CLANG_LTO=y

Message ID 20200630173734.14057-19-will@kernel.org (mailing list archive)
State New, archived
Headers show
Series Allow architectures to override __READ_ONCE() | expand

Commit Message

Will Deacon June 30, 2020, 5:37 p.m. UTC
When building with LTO, there is an increased risk of the compiler
converting an address dependency headed by a READ_ONCE() invocation
into a control dependency and consequently allowing for harmful
reordering by the CPU.

Ensure that such transformations are harmless by overriding the generic
READ_ONCE() definition with one that provides acquire semantics when
building with LTO.

Signed-off-by: Will Deacon <will@kernel.org>
---
 arch/arm64/include/asm/rwonce.h   | 63 +++++++++++++++++++++++++++++++
 arch/arm64/kernel/vdso/Makefile   |  2 +-
 arch/arm64/kernel/vdso32/Makefile |  2 +-
 3 files changed, 65 insertions(+), 2 deletions(-)
 create mode 100644 arch/arm64/include/asm/rwonce.h

Comments

Arnd Bergmann June 30, 2020, 7:25 p.m. UTC | #1
On Tue, Jun 30, 2020 at 7:39 PM Will Deacon <will@kernel.org> wrote:
> +#define __READ_ONCE(x)                                                 \
> +({                                                                     \
> +       int atomic = 1;                                                 \
> +       union { __unqual_scalar_typeof(x) __val; char __c[1]; } __u;    \
> +       typeof(&(x)) __x = &(x);                                        \
> +       switch (sizeof(x)) {                                            \
...
> +       atomic ? (typeof(x))__u.__val : (*(volatile typeof(x) *)__x);   \
> +})

This expands (x) nine times (five in __unqual_scala_typeof()), which can
lead to significant code bloat after preprocessing if something passes a
compound expression into READ_ONCE().
The compiler works it out eventually, but we've seen an actual slowdown
in compile speed from this recently, especially on clang.

I think if you move the

        typeof(&(x)) __x = &(x);

line first, all other instances can use typeof(*__x) instead of typeof(x)
and avoid this problem. Once we make gcc-4.9 the minimum version,
this could be further improved to

       __auto_type __x = &(x);

       Arnd
Marco Elver June 30, 2020, 7:47 p.m. UTC | #2
On Tue, 30 Jun 2020 at 19:39, Will Deacon <will@kernel.org> wrote:
>
> When building with LTO, there is an increased risk of the compiler
> converting an address dependency headed by a READ_ONCE() invocation
> into a control dependency and consequently allowing for harmful
> reordering by the CPU.
>
> Ensure that such transformations are harmless by overriding the generic
> READ_ONCE() definition with one that provides acquire semantics when
> building with LTO.
>
> Signed-off-by: Will Deacon <will@kernel.org>
> ---
>  arch/arm64/include/asm/rwonce.h   | 63 +++++++++++++++++++++++++++++++
>  arch/arm64/kernel/vdso/Makefile   |  2 +-
>  arch/arm64/kernel/vdso32/Makefile |  2 +-
>  3 files changed, 65 insertions(+), 2 deletions(-)
>  create mode 100644 arch/arm64/include/asm/rwonce.h

This seems reasonable, given we can't realistically tell the compiler
about dependent loads. What (if any), is the performance impact? I
guess this also heavily depends on the actual silicon.

I do wonder, though, if there is some way to make the compiler do
something better for us. Clearly, implementing real
memory_order_consume hasn't worked out until today. But maybe the
compiler could promote dependent loads to acquires if it recognizes it
lost dependencies during optimizations. Just thinking out loud, it
probably still has some weird corner case that will break. ;-)

The other thing is that I'd be cautious blaming LTO, as I tried to
summarize here:
https://lore.kernel.org/kernel-hardening/20200630191931.GA884155@elver.google.com/

The main thing is that, yes, this might be something to be worried
about, but if we are worried about it, we need to be worried about it
in *all* builds (LTO or not). My guess is that's not acceptable. Would
it be better to just guard the promotion of READ_ONCE() to acquire
behind a config option like CONFIG_ACQUIRE_READ_DEPENDENCIES, and then
make LTO select that (or maybe leave it optional?). In future, for
very aggressive non-LTO compilers even, one may then also select that
if there is substantiated worry things do actually break.

Thanks,
-- Marco
Peter Zijlstra June 30, 2020, 8:20 p.m. UTC | #3
On Tue, Jun 30, 2020 at 09:47:30PM +0200, Marco Elver wrote:
> I do wonder, though, if there is some way to make the compiler do
> something better for us. Clearly, implementing real
> memory_order_consume hasn't worked out until today. But maybe the
> compiler could promote dependent loads to acquires if it recognizes it
> lost dependencies during optimizations. Just thinking out loud, it
> probably still has some weird corner case that will break. ;-)

I'd be very hesitant to let the compiler upgrade the ordering for us,
specifically because we're not using C11 crud and are using a lot of
inline asm.
Sami Tolvanen June 30, 2020, 10:57 p.m. UTC | #4
On Tue, Jun 30, 2020 at 12:47 PM Marco Elver <elver@google.com> wrote:
>
> On Tue, 30 Jun 2020 at 19:39, Will Deacon <will@kernel.org> wrote:
> >
> > When building with LTO, there is an increased risk of the compiler
> > converting an address dependency headed by a READ_ONCE() invocation
> > into a control dependency and consequently allowing for harmful
> > reordering by the CPU.
> >
> > Ensure that such transformations are harmless by overriding the generic
> > READ_ONCE() definition with one that provides acquire semantics when
> > building with LTO.
> >
> > Signed-off-by: Will Deacon <will@kernel.org>
> > ---
> >  arch/arm64/include/asm/rwonce.h   | 63 +++++++++++++++++++++++++++++++
> >  arch/arm64/kernel/vdso/Makefile   |  2 +-
> >  arch/arm64/kernel/vdso32/Makefile |  2 +-
> >  3 files changed, 65 insertions(+), 2 deletions(-)
> >  create mode 100644 arch/arm64/include/asm/rwonce.h
>
> This seems reasonable, given we can't realistically tell the compiler
> about dependent loads. What (if any), is the performance impact? I
> guess this also heavily depends on the actual silicon.
>
> I do wonder, though, if there is some way to make the compiler do
> something better for us. Clearly, implementing real
> memory_order_consume hasn't worked out until today. But maybe the
> compiler could promote dependent loads to acquires if it recognizes it
> lost dependencies during optimizations. Just thinking out loud, it
> probably still has some weird corner case that will break. ;-)
>
> The other thing is that I'd be cautious blaming LTO, as I tried to
> summarize here:
> https://lore.kernel.org/kernel-hardening/20200630191931.GA884155@elver.google.com/
>
> The main thing is that, yes, this might be something to be worried
> about, but if we are worried about it, we need to be worried about it
> in *all* builds (LTO or not). My guess is that's not acceptable. Would
> it be better to just guard the promotion of READ_ONCE() to acquire
> behind a config option like CONFIG_ACQUIRE_READ_DEPENDENCIES, and then
> make LTO select that (or maybe leave it optional?). In future, for
> very aggressive non-LTO compilers even, one may then also select that
> if there is substantiated worry things do actually break.

I agree, a separate config option would be better here.

Also Will, the LTO patches use CONFIG_LTO_CLANG instead of CLANG_LTO.

Sami
Will Deacon July 1, 2020, 10:19 a.m. UTC | #5
On Tue, Jun 30, 2020 at 09:25:03PM +0200, Arnd Bergmann wrote:
> On Tue, Jun 30, 2020 at 7:39 PM Will Deacon <will@kernel.org> wrote:
> > +#define __READ_ONCE(x)                                                 \
> > +({                                                                     \
> > +       int atomic = 1;                                                 \
> > +       union { __unqual_scalar_typeof(x) __val; char __c[1]; } __u;    \
> > +       typeof(&(x)) __x = &(x);                                        \
> > +       switch (sizeof(x)) {                                            \
> ...
> > +       atomic ? (typeof(x))__u.__val : (*(volatile typeof(x) *)__x);   \
> > +})
> 
> This expands (x) nine times (five in __unqual_scala_typeof()), which can
> lead to significant code bloat after preprocessing if something passes a
> compound expression into READ_ONCE().
> The compiler works it out eventually, but we've seen an actual slowdown
> in compile speed from this recently, especially on clang.
> 
> I think if you move the
> 
>         typeof(&(x)) __x = &(x);
> 
> line first, all other instances can use typeof(*__x) instead of typeof(x)
> and avoid this problem.

Cheers, I was only thinking about side-effects when I wrote this, but
bloating built time is very unpopular, so I'll go with your suggestion.

> Once we make gcc-4.9 the minimum version,
> this could be further improved to
> 
>        __auto_type __x = &(x);

Is anybody working on moving to 4.9? I've seen the mails from Linus
championing it, but I thought there was a RHEL in support that people
might care about?

Will
Will Deacon July 1, 2020, 10:24 a.m. UTC | #6
On Tue, Jun 30, 2020 at 09:47:30PM +0200, Marco Elver wrote:
> On Tue, 30 Jun 2020 at 19:39, Will Deacon <will@kernel.org> wrote:
> >
> > When building with LTO, there is an increased risk of the compiler
> > converting an address dependency headed by a READ_ONCE() invocation
> > into a control dependency and consequently allowing for harmful
> > reordering by the CPU.
> >
> > Ensure that such transformations are harmless by overriding the generic
> > READ_ONCE() definition with one that provides acquire semantics when
> > building with LTO.
> >
> > Signed-off-by: Will Deacon <will@kernel.org>
> > ---
> >  arch/arm64/include/asm/rwonce.h   | 63 +++++++++++++++++++++++++++++++
> >  arch/arm64/kernel/vdso/Makefile   |  2 +-
> >  arch/arm64/kernel/vdso32/Makefile |  2 +-
> >  3 files changed, 65 insertions(+), 2 deletions(-)
> >  create mode 100644 arch/arm64/include/asm/rwonce.h
> 
> This seems reasonable, given we can't realistically tell the compiler
> about dependent loads. What (if any), is the performance impact? I
> guess this also heavily depends on the actual silicon.

Right, it depends both on the CPU micro-architecture and also the workload.
When we ran some basic tests, the overhead wasn't greater than the benefit
seen by enabling LTO, so it seems like a reasonable trade-off (given that
LTO is a dependency for CFI, so it's not just about performance).

Will
Will Deacon July 1, 2020, 10:25 a.m. UTC | #7
On Tue, Jun 30, 2020 at 03:57:54PM -0700, Sami Tolvanen wrote:
> On Tue, Jun 30, 2020 at 12:47 PM Marco Elver <elver@google.com> wrote:
> >
> > On Tue, 30 Jun 2020 at 19:39, Will Deacon <will@kernel.org> wrote:
> > >
> > > When building with LTO, there is an increased risk of the compiler
> > > converting an address dependency headed by a READ_ONCE() invocation
> > > into a control dependency and consequently allowing for harmful
> > > reordering by the CPU.
> > >
> > > Ensure that such transformations are harmless by overriding the generic
> > > READ_ONCE() definition with one that provides acquire semantics when
> > > building with LTO.
> > >
> > > Signed-off-by: Will Deacon <will@kernel.org>
> > > ---
> > >  arch/arm64/include/asm/rwonce.h   | 63 +++++++++++++++++++++++++++++++
> > >  arch/arm64/kernel/vdso/Makefile   |  2 +-
> > >  arch/arm64/kernel/vdso32/Makefile |  2 +-
> > >  3 files changed, 65 insertions(+), 2 deletions(-)
> > >  create mode 100644 arch/arm64/include/asm/rwonce.h
> >
> > This seems reasonable, given we can't realistically tell the compiler
> > about dependent loads. What (if any), is the performance impact? I
> > guess this also heavily depends on the actual silicon.
> >
> > I do wonder, though, if there is some way to make the compiler do
> > something better for us. Clearly, implementing real
> > memory_order_consume hasn't worked out until today. But maybe the
> > compiler could promote dependent loads to acquires if it recognizes it
> > lost dependencies during optimizations. Just thinking out loud, it
> > probably still has some weird corner case that will break. ;-)
> >
> > The other thing is that I'd be cautious blaming LTO, as I tried to
> > summarize here:
> > https://lore.kernel.org/kernel-hardening/20200630191931.GA884155@elver.google.com/
> >
> > The main thing is that, yes, this might be something to be worried
> > about, but if we are worried about it, we need to be worried about it
> > in *all* builds (LTO or not). My guess is that's not acceptable. Would
> > it be better to just guard the promotion of READ_ONCE() to acquire
> > behind a config option like CONFIG_ACQUIRE_READ_DEPENDENCIES, and then
> > make LTO select that (or maybe leave it optional?). In future, for
> > very aggressive non-LTO compilers even, one may then also select that
> > if there is substantiated worry things do actually break.
> 
> I agree, a separate config option would be better here.
> 
> Also Will, the LTO patches use CONFIG_LTO_CLANG instead of CLANG_LTO.

D'oh, sorry. I'll fix that (I had that #ifdef commented out for my testing).

Will
Arnd Bergmann July 1, 2020, 10:59 a.m. UTC | #8
On Wed, Jul 1, 2020 at 12:19 PM Will Deacon <will@kernel.org> wrote:
> On Tue, Jun 30, 2020 at 09:25:03PM +0200, Arnd Bergmann wrote:
> > On Tue, Jun 30, 2020 at 7:39 PM Will Deacon <will@kernel.org> wrote:
> > Once we make gcc-4.9 the minimum version,
> > this could be further improved to
> >
> >        __auto_type __x = &(x);
>
> Is anybody working on moving to 4.9? I've seen the mails from Linus
> championing it, but I thought there was a RHEL in support that people
> might care about?

I don't think there was a serious discussion about it so far, and
we only just moved to gcc-4.8.

I think moving to gnu11 (gcc-4.9 or clang) instead of gnu99 has other
benefits as well, so we may well want to do it anyway when something
else comes up.

For __auto_type(), we could do it like

#if (clang or gcc-4.9+)
#define auto_typeof(x) __auto_type
#else
#define auto_typeof(x) typeof(x)
#endif

which could be used in a lot of macros.

     Arnd
Dave Martin July 1, 2020, 5:07 p.m. UTC | #9
On Tue, Jun 30, 2020 at 06:37:34PM +0100, Will Deacon wrote:
> When building with LTO, there is an increased risk of the compiler
> converting an address dependency headed by a READ_ONCE() invocation
> into a control dependency and consequently allowing for harmful
> reordering by the CPU.
> 
> Ensure that such transformations are harmless by overriding the generic
> READ_ONCE() definition with one that provides acquire semantics when
> building with LTO.
> 
> Signed-off-by: Will Deacon <will@kernel.org>
> ---
>  arch/arm64/include/asm/rwonce.h   | 63 +++++++++++++++++++++++++++++++
>  arch/arm64/kernel/vdso/Makefile   |  2 +-
>  arch/arm64/kernel/vdso32/Makefile |  2 +-
>  3 files changed, 65 insertions(+), 2 deletions(-)
>  create mode 100644 arch/arm64/include/asm/rwonce.h
> 
> diff --git a/arch/arm64/include/asm/rwonce.h b/arch/arm64/include/asm/rwonce.h
> new file mode 100644
> index 000000000000..515e360b01a1
> --- /dev/null
> +++ b/arch/arm64/include/asm/rwonce.h
> @@ -0,0 +1,63 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (C) 2020 Google LLC.
> + */
> +#ifndef __ASM_RWONCE_H
> +#define __ASM_RWONCE_H
> +
> +#ifdef CONFIG_CLANG_LTO

Don't we have a generic option for LTO that's not specific to Clang.

Also, can you illustrate code that can only be unsafe with Clang LTO?

[...]

Cheers
---Dave
Will Deacon July 2, 2020, 7:23 a.m. UTC | #10
On Wed, Jul 01, 2020 at 06:07:25PM +0100, Dave P Martin wrote:
> On Tue, Jun 30, 2020 at 06:37:34PM +0100, Will Deacon wrote:
> > When building with LTO, there is an increased risk of the compiler
> > converting an address dependency headed by a READ_ONCE() invocation
> > into a control dependency and consequently allowing for harmful
> > reordering by the CPU.
> > 
> > Ensure that such transformations are harmless by overriding the generic
> > READ_ONCE() definition with one that provides acquire semantics when
> > building with LTO.
> > 
> > Signed-off-by: Will Deacon <will@kernel.org>
> > ---
> >  arch/arm64/include/asm/rwonce.h   | 63 +++++++++++++++++++++++++++++++
> >  arch/arm64/kernel/vdso/Makefile   |  2 +-
> >  arch/arm64/kernel/vdso32/Makefile |  2 +-
> >  3 files changed, 65 insertions(+), 2 deletions(-)
> >  create mode 100644 arch/arm64/include/asm/rwonce.h
> > 
> > diff --git a/arch/arm64/include/asm/rwonce.h b/arch/arm64/include/asm/rwonce.h
> > new file mode 100644
> > index 000000000000..515e360b01a1
> > --- /dev/null
> > +++ b/arch/arm64/include/asm/rwonce.h
> > @@ -0,0 +1,63 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * Copyright (C) 2020 Google LLC.
> > + */
> > +#ifndef __ASM_RWONCE_H
> > +#define __ASM_RWONCE_H
> > +
> > +#ifdef CONFIG_CLANG_LTO
> 
> Don't we have a generic option for LTO that's not specific to Clang.

/me looks at the LTO series some more

Oh yeah, there's CONFIG_LTO which is selected by CONFIG_LTO_CLANG, which is
the non-typoed version of the above. I can switch this to CONFIG_LTO.

> Also, can you illustrate code that can only be unsafe with Clang LTO?

I don't have a concrete example, but it's an ongoing concern over on the LTO
thread [1], so I cooked this to show one way we could deal with it. The main
concern is that the whole-program optimisations enabled by LTO may allow the
compiler to enumerate possible values for a pointer at link time and replace
an address dependency between two loads with a control dependency instead,
defeating the dependency ordering within the CPU.

We likely won't realise if/when this goes wrong, other than impossible to
debug, subtle breakage that crops up seemingly randomly. Ideally, we'd be
able to detect this sort of thing happening at build time, and perhaps
even prevent it with compiler options or annotations, but none of that is
close to being available and I'm keen to progress the LTO patches in the
meantime because they are a requirement for CFI.

Will

[1] https://lore.kernel.org/r/20200624203200.78870-1-samitolvanen@google.com
Dave Martin July 6, 2020, 4 p.m. UTC | #11
On Thu, Jul 02, 2020 at 08:23:02AM +0100, Will Deacon wrote:
> On Wed, Jul 01, 2020 at 06:07:25PM +0100, Dave P Martin wrote:
> > On Tue, Jun 30, 2020 at 06:37:34PM +0100, Will Deacon wrote:
> > > When building with LTO, there is an increased risk of the compiler
> > > converting an address dependency headed by a READ_ONCE() invocation
> > > into a control dependency and consequently allowing for harmful
> > > reordering by the CPU.
> > > 
> > > Ensure that such transformations are harmless by overriding the generic
> > > READ_ONCE() definition with one that provides acquire semantics when
> > > building with LTO.
> > > 
> > > Signed-off-by: Will Deacon <will@kernel.org>
> > > ---
> > >  arch/arm64/include/asm/rwonce.h   | 63 +++++++++++++++++++++++++++++++
> > >  arch/arm64/kernel/vdso/Makefile   |  2 +-
> > >  arch/arm64/kernel/vdso32/Makefile |  2 +-
> > >  3 files changed, 65 insertions(+), 2 deletions(-)
> > >  create mode 100644 arch/arm64/include/asm/rwonce.h
> > > 
> > > diff --git a/arch/arm64/include/asm/rwonce.h b/arch/arm64/include/asm/rwonce.h
> > > new file mode 100644
> > > index 000000000000..515e360b01a1
> > > --- /dev/null
> > > +++ b/arch/arm64/include/asm/rwonce.h
> > > @@ -0,0 +1,63 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > +/*
> > > + * Copyright (C) 2020 Google LLC.
> > > + */
> > > +#ifndef __ASM_RWONCE_H
> > > +#define __ASM_RWONCE_H
> > > +
> > > +#ifdef CONFIG_CLANG_LTO
> > 
> > Don't we have a generic option for LTO that's not specific to Clang.
> 
> /me looks at the LTO series some more
> 
> Oh yeah, there's CONFIG_LTO which is selected by CONFIG_LTO_CLANG, which is
> the non-typoed version of the above. I can switch this to CONFIG_LTO.
> 
> > Also, can you illustrate code that can only be unsafe with Clang LTO?
> 
> I don't have a concrete example, but it's an ongoing concern over on the LTO
> thread [1], so I cooked this to show one way we could deal with it. The main
> concern is that the whole-program optimisations enabled by LTO may allow the
> compiler to enumerate possible values for a pointer at link time and replace
> an address dependency between two loads with a control dependency instead,
> defeating the dependency ordering within the CPU.

Why can't that happen without LTO?

> We likely won't realise if/when this goes wrong, other than impossible to
> debug, subtle breakage that crops up seemingly randomly. Ideally, we'd be
> able to detect this sort of thing happening at build time, and perhaps
> even prevent it with compiler options or annotations, but none of that is
> close to being available and I'm keen to progress the LTO patches in the
> meantime because they are a requirement for CFI.

My concern was not so much why LTO makes things dangerous, as why !LTO
makes things safe...

Cheers
---Dave
Dave Martin July 6, 2020, 4:08 p.m. UTC | #12
On Tue, Jun 30, 2020 at 06:37:34PM +0100, Will Deacon wrote:
> When building with LTO, there is an increased risk of the compiler
> converting an address dependency headed by a READ_ONCE() invocation
> into a control dependency and consequently allowing for harmful
> reordering by the CPU.
> 
> Ensure that such transformations are harmless by overriding the generic
> READ_ONCE() definition with one that provides acquire semantics when
> building with LTO.
> 
> Signed-off-by: Will Deacon <will@kernel.org>
> ---
>  arch/arm64/include/asm/rwonce.h   | 63 +++++++++++++++++++++++++++++++
>  arch/arm64/kernel/vdso/Makefile   |  2 +-
>  arch/arm64/kernel/vdso32/Makefile |  2 +-
>  3 files changed, 65 insertions(+), 2 deletions(-)
>  create mode 100644 arch/arm64/include/asm/rwonce.h
> 
> diff --git a/arch/arm64/include/asm/rwonce.h b/arch/arm64/include/asm/rwonce.h
> new file mode 100644
> index 000000000000..515e360b01a1
> --- /dev/null
> +++ b/arch/arm64/include/asm/rwonce.h
> @@ -0,0 +1,63 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (C) 2020 Google LLC.
> + */
> +#ifndef __ASM_RWONCE_H
> +#define __ASM_RWONCE_H
> +
> +#ifdef CONFIG_CLANG_LTO
> +
> +#include <linux/compiler_types.h>
> +#include <asm/alternative-macros.h>
> +
> +#ifndef BUILD_VDSO
> +
> +#ifdef CONFIG_AS_HAS_LDAPR
> +#define __LOAD_RCPC(sfx, regs...)					\
> +	ALTERNATIVE(							\
> +		"ldar"	#sfx "\t" #regs,				\

^ Should this be here?  It seems that READ_ONCE() will actually read
twice... even if that doesn't actually conflict with the required
semantics of READ_ONCE(), it looks odd.

Making a direct link between LTO and the memory model also seems highly
spurious (as discussed in the other subthread) so can we have a comment
explaining the reasoning?

> +		".arch_extension rcpc\n"				\
> +		"ldapr"	#sfx "\t" #regs,				\
> +	ARM64_HAS_LDAPR)
> +#else
> +#define __LOAD_RCPC(sfx, regs...)	"ldar" #sfx "\t" #regs
> +#endif /* CONFIG_AS_HAS_LDAPR */

[...]

Cheers
---Dave
Paul E. McKenney July 6, 2020, 4:34 p.m. UTC | #13
On Mon, Jul 06, 2020 at 05:00:23PM +0100, Dave Martin wrote:
> On Thu, Jul 02, 2020 at 08:23:02AM +0100, Will Deacon wrote:
> > On Wed, Jul 01, 2020 at 06:07:25PM +0100, Dave P Martin wrote:
> > > On Tue, Jun 30, 2020 at 06:37:34PM +0100, Will Deacon wrote:
> > > > When building with LTO, there is an increased risk of the compiler
> > > > converting an address dependency headed by a READ_ONCE() invocation
> > > > into a control dependency and consequently allowing for harmful
> > > > reordering by the CPU.
> > > > 
> > > > Ensure that such transformations are harmless by overriding the generic
> > > > READ_ONCE() definition with one that provides acquire semantics when
> > > > building with LTO.
> > > > 
> > > > Signed-off-by: Will Deacon <will@kernel.org>
> > > > ---
> > > >  arch/arm64/include/asm/rwonce.h   | 63 +++++++++++++++++++++++++++++++
> > > >  arch/arm64/kernel/vdso/Makefile   |  2 +-
> > > >  arch/arm64/kernel/vdso32/Makefile |  2 +-
> > > >  3 files changed, 65 insertions(+), 2 deletions(-)
> > > >  create mode 100644 arch/arm64/include/asm/rwonce.h
> > > > 
> > > > diff --git a/arch/arm64/include/asm/rwonce.h b/arch/arm64/include/asm/rwonce.h
> > > > new file mode 100644
> > > > index 000000000000..515e360b01a1
> > > > --- /dev/null
> > > > +++ b/arch/arm64/include/asm/rwonce.h
> > > > @@ -0,0 +1,63 @@
> > > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > > +/*
> > > > + * Copyright (C) 2020 Google LLC.
> > > > + */
> > > > +#ifndef __ASM_RWONCE_H
> > > > +#define __ASM_RWONCE_H
> > > > +
> > > > +#ifdef CONFIG_CLANG_LTO
> > > 
> > > Don't we have a generic option for LTO that's not specific to Clang.
> > 
> > /me looks at the LTO series some more
> > 
> > Oh yeah, there's CONFIG_LTO which is selected by CONFIG_LTO_CLANG, which is
> > the non-typoed version of the above. I can switch this to CONFIG_LTO.
> > 
> > > Also, can you illustrate code that can only be unsafe with Clang LTO?
> > 
> > I don't have a concrete example, but it's an ongoing concern over on the LTO
> > thread [1], so I cooked this to show one way we could deal with it. The main
> > concern is that the whole-program optimisations enabled by LTO may allow the
> > compiler to enumerate possible values for a pointer at link time and replace
> > an address dependency between two loads with a control dependency instead,
> > defeating the dependency ordering within the CPU.
> 
> Why can't that happen without LTO?

Because without LTO, the compiler cannot see all the pointers all at
the same time due to their being in different translation units.

But yes, if the compiler could see all the pointer values and further
-know- that it was seeing all the pointer values, these optimizations
could happen even without LTO.  But it is quite easy to make sure that
the compiler thinks that there are additional pointer values that it
does not know about.

> > We likely won't realise if/when this goes wrong, other than impossible to
> > debug, subtle breakage that crops up seemingly randomly. Ideally, we'd be
> > able to detect this sort of thing happening at build time, and perhaps
> > even prevent it with compiler options or annotations, but none of that is
> > close to being available and I'm keen to progress the LTO patches in the
> > meantime because they are a requirement for CFI.
> 
> My concern was not so much why LTO makes things dangerous, as why !LTO
> makes things safe...

Because ignorant compilers are safe compilers!  ;-)

							Thanx, Paul
Dave Martin July 6, 2020, 5:05 p.m. UTC | #14
On Mon, Jul 06, 2020 at 09:34:55AM -0700, Paul E. McKenney wrote:
> On Mon, Jul 06, 2020 at 05:00:23PM +0100, Dave Martin wrote:
> > On Thu, Jul 02, 2020 at 08:23:02AM +0100, Will Deacon wrote:
> > > On Wed, Jul 01, 2020 at 06:07:25PM +0100, Dave P Martin wrote:
> > > > On Tue, Jun 30, 2020 at 06:37:34PM +0100, Will Deacon wrote:
> > > > > When building with LTO, there is an increased risk of the compiler
> > > > > converting an address dependency headed by a READ_ONCE() invocation
> > > > > into a control dependency and consequently allowing for harmful
> > > > > reordering by the CPU.
> > > > > 
> > > > > Ensure that such transformations are harmless by overriding the generic
> > > > > READ_ONCE() definition with one that provides acquire semantics when
> > > > > building with LTO.
> > > > > 
> > > > > Signed-off-by: Will Deacon <will@kernel.org>
> > > > > ---
> > > > >  arch/arm64/include/asm/rwonce.h   | 63 +++++++++++++++++++++++++++++++
> > > > >  arch/arm64/kernel/vdso/Makefile   |  2 +-
> > > > >  arch/arm64/kernel/vdso32/Makefile |  2 +-
> > > > >  3 files changed, 65 insertions(+), 2 deletions(-)
> > > > >  create mode 100644 arch/arm64/include/asm/rwonce.h
> > > > > 
> > > > > diff --git a/arch/arm64/include/asm/rwonce.h b/arch/arm64/include/asm/rwonce.h
> > > > > new file mode 100644
> > > > > index 000000000000..515e360b01a1
> > > > > --- /dev/null
> > > > > +++ b/arch/arm64/include/asm/rwonce.h
> > > > > @@ -0,0 +1,63 @@
> > > > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > > > +/*
> > > > > + * Copyright (C) 2020 Google LLC.
> > > > > + */
> > > > > +#ifndef __ASM_RWONCE_H
> > > > > +#define __ASM_RWONCE_H
> > > > > +
> > > > > +#ifdef CONFIG_CLANG_LTO
> > > > 
> > > > Don't we have a generic option for LTO that's not specific to Clang.
> > > 
> > > /me looks at the LTO series some more
> > > 
> > > Oh yeah, there's CONFIG_LTO which is selected by CONFIG_LTO_CLANG, which is
> > > the non-typoed version of the above. I can switch this to CONFIG_LTO.
> > > 
> > > > Also, can you illustrate code that can only be unsafe with Clang LTO?
> > > 
> > > I don't have a concrete example, but it's an ongoing concern over on the LTO
> > > thread [1], so I cooked this to show one way we could deal with it. The main
> > > concern is that the whole-program optimisations enabled by LTO may allow the
> > > compiler to enumerate possible values for a pointer at link time and replace
> > > an address dependency between two loads with a control dependency instead,
> > > defeating the dependency ordering within the CPU.
> > 
> > Why can't that happen without LTO?
> 
> Because without LTO, the compiler cannot see all the pointers all at
> the same time due to their being in different translation units.
> 
> But yes, if the compiler could see all the pointer values and further
> -know- that it was seeing all the pointer values, these optimizations
> could happen even without LTO.  But it is quite easy to make sure that
> the compiler thinks that there are additional pointer values that it
> does not know about.

Yes of course, but even without LTO the compiler can still apply this
optimisation to everything visible in the translation unit, and that can
drift as people refactor code over time.

Convincing the compiler there are other possible values doesn't help.
Even in

int foo(int *p)
{
	asm ("" : "+r" (p));
	return *p;
}

Can't the compiler still generate something like this:

	switch (p) {
	case &foo:
		return foo;

	case &bar:
		return bar;

	default:
		return *p;
	}

...in which case we still have the same lost ordering guarantee that
we were trying to enforce.

If foo and bar already happen to be in registers and profiling shows
that &foo and &bar are the most likely value of p then this might be
a reasonable optimisation in some situations, irrespective of LTO.

The underlying problem here seems to be that the necessary ordering
rule is not part of what passes for the C memory model prior to C11.
If we want to control the data flow, don't we have to wrap the entire
dereference in a macro?

> > > We likely won't realise if/when this goes wrong, other than impossible to
> > > debug, subtle breakage that crops up seemingly randomly. Ideally, we'd be
> > > able to detect this sort of thing happening at build time, and perhaps
> > > even prevent it with compiler options or annotations, but none of that is
> > > close to being available and I'm keen to progress the LTO patches in the
> > > meantime because they are a requirement for CFI.
> > 
> > My concern was not so much why LTO makes things dangerous, as why !LTO
> > makes things safe...
> 
> Because ignorant compilers are safe compilers!  ;-)

AFAICT ignorance is no gurantee of ordering in general -- the compiler
is free to speculatively invent knowledge any place that the language
spec allows it to.  !LTO doesn't stop this happening.

Hopefully some of the knowledge I invented in my reply is valid...

Cheers
---Dave
Paul E. McKenney July 6, 2020, 5:36 p.m. UTC | #15
On Mon, Jul 06, 2020 at 06:05:57PM +0100, Dave Martin wrote:
> On Mon, Jul 06, 2020 at 09:34:55AM -0700, Paul E. McKenney wrote:
> > On Mon, Jul 06, 2020 at 05:00:23PM +0100, Dave Martin wrote:
> > > On Thu, Jul 02, 2020 at 08:23:02AM +0100, Will Deacon wrote:
> > > > On Wed, Jul 01, 2020 at 06:07:25PM +0100, Dave P Martin wrote:
> > > > > On Tue, Jun 30, 2020 at 06:37:34PM +0100, Will Deacon wrote:
> > > > > > When building with LTO, there is an increased risk of the compiler
> > > > > > converting an address dependency headed by a READ_ONCE() invocation
> > > > > > into a control dependency and consequently allowing for harmful
> > > > > > reordering by the CPU.
> > > > > > 
> > > > > > Ensure that such transformations are harmless by overriding the generic
> > > > > > READ_ONCE() definition with one that provides acquire semantics when
> > > > > > building with LTO.
> > > > > > 
> > > > > > Signed-off-by: Will Deacon <will@kernel.org>
> > > > > > ---
> > > > > >  arch/arm64/include/asm/rwonce.h   | 63 +++++++++++++++++++++++++++++++
> > > > > >  arch/arm64/kernel/vdso/Makefile   |  2 +-
> > > > > >  arch/arm64/kernel/vdso32/Makefile |  2 +-
> > > > > >  3 files changed, 65 insertions(+), 2 deletions(-)
> > > > > >  create mode 100644 arch/arm64/include/asm/rwonce.h
> > > > > > 
> > > > > > diff --git a/arch/arm64/include/asm/rwonce.h b/arch/arm64/include/asm/rwonce.h
> > > > > > new file mode 100644
> > > > > > index 000000000000..515e360b01a1
> > > > > > --- /dev/null
> > > > > > +++ b/arch/arm64/include/asm/rwonce.h
> > > > > > @@ -0,0 +1,63 @@
> > > > > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > > > > +/*
> > > > > > + * Copyright (C) 2020 Google LLC.
> > > > > > + */
> > > > > > +#ifndef __ASM_RWONCE_H
> > > > > > +#define __ASM_RWONCE_H
> > > > > > +
> > > > > > +#ifdef CONFIG_CLANG_LTO
> > > > > 
> > > > > Don't we have a generic option for LTO that's not specific to Clang.
> > > > 
> > > > /me looks at the LTO series some more
> > > > 
> > > > Oh yeah, there's CONFIG_LTO which is selected by CONFIG_LTO_CLANG, which is
> > > > the non-typoed version of the above. I can switch this to CONFIG_LTO.
> > > > 
> > > > > Also, can you illustrate code that can only be unsafe with Clang LTO?
> > > > 
> > > > I don't have a concrete example, but it's an ongoing concern over on the LTO
> > > > thread [1], so I cooked this to show one way we could deal with it. The main
> > > > concern is that the whole-program optimisations enabled by LTO may allow the
> > > > compiler to enumerate possible values for a pointer at link time and replace
> > > > an address dependency between two loads with a control dependency instead,
> > > > defeating the dependency ordering within the CPU.
> > > 
> > > Why can't that happen without LTO?
> > 
> > Because without LTO, the compiler cannot see all the pointers all at
> > the same time due to their being in different translation units.
> > 
> > But yes, if the compiler could see all the pointer values and further
> > -know- that it was seeing all the pointer values, these optimizations
> > could happen even without LTO.  But it is quite easy to make sure that
> > the compiler thinks that there are additional pointer values that it
> > does not know about.
> 
> Yes of course, but even without LTO the compiler can still apply this
> optimisation to everything visible in the translation unit, and that can
> drift as people refactor code over time.
> 
> Convincing the compiler there are other possible values doesn't help.
> Even in
> 
> int foo(int *p)
> {
> 	asm ("" : "+r" (p));
> 	return *p;
> }
> 
> Can't the compiler still generate something like this:
> 
> 	switch (p) {
> 	case &foo:
> 		return foo;
> 
> 	case &bar:
> 		return bar;
> 
> 	default:
> 		return *p;
> 	}
> 
> ...in which case we still have the same lost ordering guarantee that
> we were trying to enforce.
> 
> If foo and bar already happen to be in registers and profiling shows
> that &foo and &bar are the most likely value of p then this might be
> a reasonable optimisation in some situations, irrespective of LTO.

Agreed, the additional information from profile-driven optimization
can be just as damaging as that from LTO.

> The underlying problem here seems to be that the necessary ordering
> rule is not part of what passes for the C memory model prior to C11.
> If we want to control the data flow, don't we have to wrap the entire
> dereference in a macro?

Yes, exactly.  Because we are relying on things that are not guaranteed
by the C memory model, we need to pay attention to the implementations.
As I have said elsewhere, the price of control dependencies is eternal
vigilance.

And this also applies, to a lesser extent, to address and data
dependencies, which are also not well supported by the C standard.

There is one important case in which the C memory model -does- support
control dependencies, and that is when the dependent write is a normal
C-language write that is not involved in a data race.  In that case,
if the compiler broke the control dependency, it might have introduced
a data race, which it is forbidden to do.  However, this rule can also
be broken when the compiler knows too much, as it might be able to prove
that breaking the dependency won't introduce a data race.  In that case,
according to the standard, it is free to break the dependency.

> > > > We likely won't realise if/when this goes wrong, other than impossible to
> > > > debug, subtle breakage that crops up seemingly randomly. Ideally, we'd be
> > > > able to detect this sort of thing happening at build time, and perhaps
> > > > even prevent it with compiler options or annotations, but none of that is
> > > > close to being available and I'm keen to progress the LTO patches in the
> > > > meantime because they are a requirement for CFI.
> > > 
> > > My concern was not so much why LTO makes things dangerous, as why !LTO
> > > makes things safe...
> > 
> > Because ignorant compilers are safe compilers!  ;-)
> 
> AFAICT ignorance is no gurantee of ordering in general -- the compiler
> is free to speculatively invent knowledge any place that the language
> spec allows it to.  !LTO doesn't stop this happening.

Agreed, according to the standard, the compiler has great freedom.

We have two choices: (1) Restrict ourselves to live within the confines of
the standard or (2) Pay continued close attention to the implementation.
We have made different choices at different times, but for many ordering
situations we have gone with door #2.

Me, I have been working to get the standard to better support our
use case.  This is at best slow going.  But don't take my word for it,
ask Will.

> Hopefully some of the knowledge I invented in my reply is valid...

It is.  It is just that there are multiple valid strategies, and the
Linux kernel is currently taking a mixed-strategy approach.

							Thanx, Paul
Will Deacon July 6, 2020, 6:35 p.m. UTC | #16
On Mon, Jul 06, 2020 at 05:08:20PM +0100, Dave Martin wrote:
> On Tue, Jun 30, 2020 at 06:37:34PM +0100, Will Deacon wrote:
> > diff --git a/arch/arm64/include/asm/rwonce.h b/arch/arm64/include/asm/rwonce.h
> > new file mode 100644
> > index 000000000000..515e360b01a1
> > --- /dev/null
> > +++ b/arch/arm64/include/asm/rwonce.h
> > @@ -0,0 +1,63 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * Copyright (C) 2020 Google LLC.
> > + */
> > +#ifndef __ASM_RWONCE_H
> > +#define __ASM_RWONCE_H
> > +
> > +#ifdef CONFIG_CLANG_LTO
> > +
> > +#include <linux/compiler_types.h>
> > +#include <asm/alternative-macros.h>
> > +
> > +#ifndef BUILD_VDSO
> > +
> > +#ifdef CONFIG_AS_HAS_LDAPR
> > +#define __LOAD_RCPC(sfx, regs...)					\
> > +	ALTERNATIVE(							\
> > +		"ldar"	#sfx "\t" #regs,				\
> 
> ^ Should this be here?  It seems that READ_ONCE() will actually read
> twice... even if that doesn't actually conflict with the required
> semantics of READ_ONCE(), it looks odd.

It's patched at runtime, so it's either LDAR or LDAPR.

> Making a direct link between LTO and the memory model also seems highly
> spurious (as discussed in the other subthread) so can we have a comment
> explaining the reasoning?

Sure, although like I say, this is more about helping to progress that
conversation.

Will
Will Deacon July 6, 2020, 6:35 p.m. UTC | #17
On Mon, Jul 06, 2020 at 05:00:23PM +0100, Dave Martin wrote:
> On Thu, Jul 02, 2020 at 08:23:02AM +0100, Will Deacon wrote:
> > On Wed, Jul 01, 2020 at 06:07:25PM +0100, Dave P Martin wrote:
> > > Also, can you illustrate code that can only be unsafe with Clang LTO?
> > 
> > I don't have a concrete example, but it's an ongoing concern over on the LTO
> > thread [1], so I cooked this to show one way we could deal with it. The main
> > concern is that the whole-program optimisations enabled by LTO may allow the
> > compiler to enumerate possible values for a pointer at link time and replace
> > an address dependency between two loads with a control dependency instead,
> > defeating the dependency ordering within the CPU.
> 
> Why can't that happen without LTO?

It could, but I'd argue that it's considerably less likely because there
is less information available to the compiler to perform these sorts of
optimisations. It also doesn't appear to be happening in practice.

The current state of affairs is that, if/when we catch the compiler
performing harmful optimistations, we look for a way to disable them.
However, there are good reasons to enable LTO, so this is one way to
do that without having to worry about the potential impact on dependency
ordering.

Will
Marco Elver July 6, 2020, 7:23 p.m. UTC | #18
On Mon, 6 Jul 2020 at 20:35, Will Deacon <will@kernel.org> wrote:
> On Mon, Jul 06, 2020 at 05:00:23PM +0100, Dave Martin wrote:
> > On Thu, Jul 02, 2020 at 08:23:02AM +0100, Will Deacon wrote:
> > > On Wed, Jul 01, 2020 at 06:07:25PM +0100, Dave P Martin wrote:
> > > > Also, can you illustrate code that can only be unsafe with Clang LTO?
> > >
> > > I don't have a concrete example, but it's an ongoing concern over on the LTO
> > > thread [1], so I cooked this to show one way we could deal with it. The main
> > > concern is that the whole-program optimisations enabled by LTO may allow the
> > > compiler to enumerate possible values for a pointer at link time and replace
> > > an address dependency between two loads with a control dependency instead,
> > > defeating the dependency ordering within the CPU.
> >
> > Why can't that happen without LTO?
>
> It could, but I'd argue that it's considerably less likely because there
> is less information available to the compiler to perform these sorts of
> optimisations. It also doesn't appear to be happening in practice.
>
> The current state of affairs is that, if/when we catch the compiler
> performing harmful optimistations, we look for a way to disable them.
> However, there are good reasons to enable LTO, so this is one way to
> do that without having to worry about the potential impact on dependency
> ordering.

If it's of any help, I'll see if we can implement that warning in LLVM
if data dependencies somehow disappear (although I don't have any
cycles to pursue right now myself). Until then, short of manual
inspection or encountering a bug in the wild, there is no proof any of
this happens or doesn't happen.

Also, as some anecdotal evidence it's extremely unlikely, even with
LTO: looking at the passes that LLVM runs, there are a number of
passes that seem to want to eliminate basic blocks, thereby getting
rid of branches. Intuitively, it makes sense, because branches are
expensive on most architectures (for GPU targets, I think it tries
even harder to get rid of branches). If we extend our reasoning and
assumptions of LTO's aggressiveness in that direction, we might
actually end up with fewer branches. That might be beneficial for the
data dependencies we worry about (but not so much for control
dependencies we want to keep). Still, no point in speculating (no pun
intended) until we have hard data what actually happens. :-)

Thanks,
-- Marco
Paul E. McKenney July 6, 2020, 7:42 p.m. UTC | #19
On Mon, Jul 06, 2020 at 09:23:26PM +0200, Marco Elver wrote:
> On Mon, 6 Jul 2020 at 20:35, Will Deacon <will@kernel.org> wrote:
> > On Mon, Jul 06, 2020 at 05:00:23PM +0100, Dave Martin wrote:
> > > On Thu, Jul 02, 2020 at 08:23:02AM +0100, Will Deacon wrote:
> > > > On Wed, Jul 01, 2020 at 06:07:25PM +0100, Dave P Martin wrote:
> > > > > Also, can you illustrate code that can only be unsafe with Clang LTO?
> > > >
> > > > I don't have a concrete example, but it's an ongoing concern over on the LTO
> > > > thread [1], so I cooked this to show one way we could deal with it. The main
> > > > concern is that the whole-program optimisations enabled by LTO may allow the
> > > > compiler to enumerate possible values for a pointer at link time and replace
> > > > an address dependency between two loads with a control dependency instead,
> > > > defeating the dependency ordering within the CPU.
> > >
> > > Why can't that happen without LTO?
> >
> > It could, but I'd argue that it's considerably less likely because there
> > is less information available to the compiler to perform these sorts of
> > optimisations. It also doesn't appear to be happening in practice.
> >
> > The current state of affairs is that, if/when we catch the compiler
> > performing harmful optimistations, we look for a way to disable them.
> > However, there are good reasons to enable LTO, so this is one way to
> > do that without having to worry about the potential impact on dependency
> > ordering.
> 
> If it's of any help, I'll see if we can implement that warning in LLVM
> if data dependencies somehow disappear (although I don't have any
> cycles to pursue right now myself). Until then, short of manual
> inspection or encountering a bug in the wild, there is no proof any of
> this happens or doesn't happen.
> 
> Also, as some anecdotal evidence it's extremely unlikely, even with
> LTO: looking at the passes that LLVM runs, there are a number of
> passes that seem to want to eliminate basic blocks, thereby getting
> rid of branches. Intuitively, it makes sense, because branches are
> expensive on most architectures (for GPU targets, I think it tries
> even harder to get rid of branches). If we extend our reasoning and
> assumptions of LTO's aggressiveness in that direction, we might
> actually end up with fewer branches. That might be beneficial for the
> data dependencies we worry about (but not so much for control
> dependencies we want to keep). Still, no point in speculating (no pun
> intended) until we have hard data what actually happens. :-)

Anything along these lines would be very welcome!!!

							Thanx, Paul
Dave Martin July 7, 2020, 10:10 a.m. UTC | #20
On Mon, Jul 06, 2020 at 07:35:11PM +0100, Will Deacon wrote:
> On Mon, Jul 06, 2020 at 05:08:20PM +0100, Dave Martin wrote:
> > On Tue, Jun 30, 2020 at 06:37:34PM +0100, Will Deacon wrote:
> > > diff --git a/arch/arm64/include/asm/rwonce.h b/arch/arm64/include/asm/rwonce.h
> > > new file mode 100644
> > > index 000000000000..515e360b01a1
> > > --- /dev/null
> > > +++ b/arch/arm64/include/asm/rwonce.h
> > > @@ -0,0 +1,63 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > +/*
> > > + * Copyright (C) 2020 Google LLC.
> > > + */
> > > +#ifndef __ASM_RWONCE_H
> > > +#define __ASM_RWONCE_H
> > > +
> > > +#ifdef CONFIG_CLANG_LTO
> > > +
> > > +#include <linux/compiler_types.h>
> > > +#include <asm/alternative-macros.h>
> > > +
> > > +#ifndef BUILD_VDSO
> > > +
> > > +#ifdef CONFIG_AS_HAS_LDAPR
> > > +#define __LOAD_RCPC(sfx, regs...)					\
> > > +	ALTERNATIVE(							\
> > > +		"ldar"	#sfx "\t" #regs,				\
> > 
> > ^ Should this be here?  It seems that READ_ONCE() will actually read
> > twice... even if that doesn't actually conflict with the required
> > semantics of READ_ONCE(), it looks odd.
> 
> It's patched at runtime, so it's either LDAR or LDAPR.

Agh ignore me, I somehow failed to sport the ALTERNATIVE().

For my understanding -- my background here is a bit shaky -- the LDAPR
gives us load-to-load order even if there is just a control dependency?

If so (possibly dumb question): why can't we just turn this on
unconditionally?  Is there a significant performance impact?

I'm still confused (or ignorant) though.  If both loads are READ_ONCE()
then switching to LDAPR presumably helps, but otherwise, once the
compiler has reduced the address dependency to a control dependency
can't it then go one step further and reverse the order of the loads?
LDAPR wouldn't rescue us from that.

Or does the "memory" clobber in READ_ONCE() fix that for all important
cases?  I can't see this mattering for local variables (where it
definitely won't work), but I wonder whether static variables might not
count as "memory" in some situations.

Discounting ridiculous things like static register variables, I think
the only way for a static variable not to count as memory would be if
there are no writes to it that are reachable from any translation unit
entry point (possibly after dead code removal).  If so, maybe that's
enough.

> > Making a direct link between LTO and the memory model also seems highly
> > spurious (as discussed in the other subthread) so can we have a comment
> > explaining the reasoning?
> 
> Sure, although like I say, this is more about helping to progress that
> conversation.

That's fair enough, but when there is a consensus it would be good to
see it documented in the code _especially_ if we know that the fix won't
address all instances of the problem and in any case works partly by
accident.  That doesn't mean it's not a good practical compromise, but
it could be very confusing to unpick later on.

Cheers
---Dave
Dave Martin July 7, 2020, 10:29 a.m. UTC | #21
On Mon, Jul 06, 2020 at 10:36:28AM -0700, Paul E. McKenney wrote:
> On Mon, Jul 06, 2020 at 06:05:57PM +0100, Dave Martin wrote:
> > On Mon, Jul 06, 2020 at 09:34:55AM -0700, Paul E. McKenney wrote:
> > > On Mon, Jul 06, 2020 at 05:00:23PM +0100, Dave Martin wrote:
> > > > On Thu, Jul 02, 2020 at 08:23:02AM +0100, Will Deacon wrote:
> > > > > On Wed, Jul 01, 2020 at 06:07:25PM +0100, Dave P Martin wrote:
> > > > > > On Tue, Jun 30, 2020 at 06:37:34PM +0100, Will Deacon wrote:
> > > > > > > When building with LTO, there is an increased risk of the compiler
> > > > > > > converting an address dependency headed by a READ_ONCE() invocation
> > > > > > > into a control dependency and consequently allowing for harmful
> > > > > > > reordering by the CPU.
> > > > > > > 
> > > > > > > Ensure that such transformations are harmless by overriding the generic
> > > > > > > READ_ONCE() definition with one that provides acquire semantics when
> > > > > > > building with LTO.
> > > > > > > 
> > > > > > > Signed-off-by: Will Deacon <will@kernel.org>
> > > > > > > ---
> > > > > > >  arch/arm64/include/asm/rwonce.h   | 63 +++++++++++++++++++++++++++++++
> > > > > > >  arch/arm64/kernel/vdso/Makefile   |  2 +-
> > > > > > >  arch/arm64/kernel/vdso32/Makefile |  2 +-
> > > > > > >  3 files changed, 65 insertions(+), 2 deletions(-)
> > > > > > >  create mode 100644 arch/arm64/include/asm/rwonce.h
> > > > > > > 
> > > > > > > diff --git a/arch/arm64/include/asm/rwonce.h b/arch/arm64/include/asm/rwonce.h
> > > > > > > new file mode 100644
> > > > > > > index 000000000000..515e360b01a1
> > > > > > > --- /dev/null
> > > > > > > +++ b/arch/arm64/include/asm/rwonce.h
> > > > > > > @@ -0,0 +1,63 @@
> > > > > > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > > > > > +/*
> > > > > > > + * Copyright (C) 2020 Google LLC.
> > > > > > > + */
> > > > > > > +#ifndef __ASM_RWONCE_H
> > > > > > > +#define __ASM_RWONCE_H
> > > > > > > +
> > > > > > > +#ifdef CONFIG_CLANG_LTO
> > > > > > 
> > > > > > Don't we have a generic option for LTO that's not specific to Clang.
> > > > > 
> > > > > /me looks at the LTO series some more
> > > > > 
> > > > > Oh yeah, there's CONFIG_LTO which is selected by CONFIG_LTO_CLANG, which is
> > > > > the non-typoed version of the above. I can switch this to CONFIG_LTO.
> > > > > 
> > > > > > Also, can you illustrate code that can only be unsafe with Clang LTO?
> > > > > 
> > > > > I don't have a concrete example, but it's an ongoing concern over on the LTO
> > > > > thread [1], so I cooked this to show one way we could deal with it. The main
> > > > > concern is that the whole-program optimisations enabled by LTO may allow the
> > > > > compiler to enumerate possible values for a pointer at link time and replace
> > > > > an address dependency between two loads with a control dependency instead,
> > > > > defeating the dependency ordering within the CPU.
> > > > 
> > > > Why can't that happen without LTO?
> > > 
> > > Because without LTO, the compiler cannot see all the pointers all at
> > > the same time due to their being in different translation units.
> > > 
> > > But yes, if the compiler could see all the pointer values and further
> > > -know- that it was seeing all the pointer values, these optimizations
> > > could happen even without LTO.  But it is quite easy to make sure that
> > > the compiler thinks that there are additional pointer values that it
> > > does not know about.
> > 
> > Yes of course, but even without LTO the compiler can still apply this
> > optimisation to everything visible in the translation unit, and that can
> > drift as people refactor code over time.
> > 
> > Convincing the compiler there are other possible values doesn't help.
> > Even in
> > 
> > int foo(int *p)
> > {
> > 	asm ("" : "+r" (p));
> > 	return *p;
> > }
> > 
> > Can't the compiler still generate something like this:
> > 
> > 	switch (p) {
> > 	case &foo:
> > 		return foo;
> > 
> > 	case &bar:
> > 		return bar;
> > 
> > 	default:
> > 		return *p;
> > 	}
> > 
> > ...in which case we still have the same lost ordering guarantee that
> > we were trying to enforce.
> > 
> > If foo and bar already happen to be in registers and profiling shows
> > that &foo and &bar are the most likely value of p then this might be
> > a reasonable optimisation in some situations, irrespective of LTO.
> 
> Agreed, the additional information from profile-driven optimization
> can be just as damaging as that from LTO.
> 
> > The underlying problem here seems to be that the necessary ordering
> > rule is not part of what passes for the C memory model prior to C11.
> > If we want to control the data flow, don't we have to wrap the entire
> > dereference in a macro?
> 
> Yes, exactly.  Because we are relying on things that are not guaranteed
> by the C memory model, we need to pay attention to the implementations.
> As I have said elsewhere, the price of control dependencies is eternal
> vigilance.
> 
> And this also applies, to a lesser extent, to address and data
> dependencies, which are also not well supported by the C standard.
> 
> There is one important case in which the C memory model -does- support
> control dependencies, and that is when the dependent write is a normal
> C-language write that is not involved in a data race.  In that case,
> if the compiler broke the control dependency, it might have introduced
> a data race, which it is forbidden to do.  However, this rule can also
> be broken when the compiler knows too much, as it might be able to prove
> that breaking the dependency won't introduce a data race.  In that case,
> according to the standard, it is free to break the dependency.

Which only matters because the C abstract machine may not match reality.

LTO has no bearing on the abstract machine though.

If specific compiler options etc. can be added to inhibit the
problematic optimisations, that would be ideal.  I guess that can't
happen overnight though.

> > > > > We likely won't realise if/when this goes wrong, other than impossible to
> > > > > debug, subtle breakage that crops up seemingly randomly. Ideally, we'd be
> > > > > able to detect this sort of thing happening at build time, and perhaps
> > > > > even prevent it with compiler options or annotations, but none of that is
> > > > > close to being available and I'm keen to progress the LTO patches in the
> > > > > meantime because they are a requirement for CFI.
> > > > 
> > > > My concern was not so much why LTO makes things dangerous, as why !LTO
> > > > makes things safe...
> > > 
> > > Because ignorant compilers are safe compilers!  ;-)
> > 
> > AFAICT ignorance is no gurantee of ordering in general -- the compiler
> > is free to speculatively invent knowledge any place that the language
> > spec allows it to.  !LTO doesn't stop this happening.
> 
> Agreed, according to the standard, the compiler has great freedom.
> 
> We have two choices: (1) Restrict ourselves to live within the confines of
> the standard or (2) Pay continued close attention to the implementation.
> We have made different choices at different times, but for many ordering
> situations we have gone with door #2.
> 
> Me, I have been working to get the standard to better support our
> use case.  This is at best slow going.  But don't take my word for it,
> ask Will.

I can believe it.  They want to enable optimisations rather than prevent
them...

> > Hopefully some of the knowledge I invented in my reply is valid...
> 
> It is.  It is just that there are multiple valid strategies, and the
> Linux kernel is currently taking a mixed-strategy approach.

Ack.  The hope that there is a correct way to fix everything dies
hard ;)

Life was cosier before I started trying to reason about language specs.

Cheers
---Dave
Paul E. McKenney July 7, 2020, 10:51 p.m. UTC | #22
On Tue, Jul 07, 2020 at 11:29:15AM +0100, Dave Martin wrote:
> On Mon, Jul 06, 2020 at 10:36:28AM -0700, Paul E. McKenney wrote:
> > On Mon, Jul 06, 2020 at 06:05:57PM +0100, Dave Martin wrote:

[ . . . ]

> > > The underlying problem here seems to be that the necessary ordering
> > > rule is not part of what passes for the C memory model prior to C11.
> > > If we want to control the data flow, don't we have to wrap the entire
> > > dereference in a macro?
> > 
> > Yes, exactly.  Because we are relying on things that are not guaranteed
> > by the C memory model, we need to pay attention to the implementations.
> > As I have said elsewhere, the price of control dependencies is eternal
> > vigilance.
> > 
> > And this also applies, to a lesser extent, to address and data
> > dependencies, which are also not well supported by the C standard.
> > 
> > There is one important case in which the C memory model -does- support
> > control dependencies, and that is when the dependent write is a normal
> > C-language write that is not involved in a data race.  In that case,
> > if the compiler broke the control dependency, it might have introduced
> > a data race, which it is forbidden to do.  However, this rule can also
> > be broken when the compiler knows too much, as it might be able to prove
> > that breaking the dependency won't introduce a data race.  In that case,
> > according to the standard, it is free to break the dependency.
> 
> Which only matters because the C abstract machine may not match reality.
> 
> LTO has no bearing on the abstract machine though.
> 
> If specific compiler options etc. can be added to inhibit the
> problematic optimisations, that would be ideal.  I guess that can't
> happen overnight though.

Sadly, I must agree.

> > > > > > We likely won't realise if/when this goes wrong, other than impossible to
> > > > > > debug, subtle breakage that crops up seemingly randomly. Ideally, we'd be
> > > > > > able to detect this sort of thing happening at build time, and perhaps
> > > > > > even prevent it with compiler options or annotations, but none of that is
> > > > > > close to being available and I'm keen to progress the LTO patches in the
> > > > > > meantime because they are a requirement for CFI.
> > > > > 
> > > > > My concern was not so much why LTO makes things dangerous, as why !LTO
> > > > > makes things safe...
> > > > 
> > > > Because ignorant compilers are safe compilers!  ;-)
> > > 
> > > AFAICT ignorance is no gurantee of ordering in general -- the compiler
> > > is free to speculatively invent knowledge any place that the language
> > > spec allows it to.  !LTO doesn't stop this happening.
> > 
> > Agreed, according to the standard, the compiler has great freedom.
> > 
> > We have two choices: (1) Restrict ourselves to live within the confines of
> > the standard or (2) Pay continued close attention to the implementation.
> > We have made different choices at different times, but for many ordering
> > situations we have gone with door #2.
> > 
> > Me, I have been working to get the standard to better support our
> > use case.  This is at best slow going.  But don't take my word for it,
> > ask Will.
> 
> I can believe it.  They want to enable optimisations rather than prevent
> them...

Right in one!  ;-)

> > > Hopefully some of the knowledge I invented in my reply is valid...
> > 
> > It is.  It is just that there are multiple valid strategies, and the
> > Linux kernel is currently taking a mixed-strategy approach.
> 
> Ack.  The hope that there is a correct way to fix everything dies
> hard ;)

Either that, or one slowly degrades ones definition of "correct".  :-/

> Life was cosier before I started trying to reason about language specs.

Same here!

							Thanx, Paul
Nick Desaulniers July 7, 2020, 11:01 p.m. UTC | #23
I'm trying to put together a Micro Conference for Linux Plumbers
conference focused on "make LLVM slightly less shitty."  Do you all
plan on attending the conference? Would it be worthwhile to hold a
session focused on discussing this (LTO and memory models) be
worthwhile?


On Tue, Jul 7, 2020 at 3:51 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Tue, Jul 07, 2020 at 11:29:15AM +0100, Dave Martin wrote:
> > On Mon, Jul 06, 2020 at 10:36:28AM -0700, Paul E. McKenney wrote:
> > > On Mon, Jul 06, 2020 at 06:05:57PM +0100, Dave Martin wrote:
>
> [ . . . ]
>
> > > > The underlying problem here seems to be that the necessary ordering
> > > > rule is not part of what passes for the C memory model prior to C11.
> > > > If we want to control the data flow, don't we have to wrap the entire
> > > > dereference in a macro?
> > >
> > > Yes, exactly.  Because we are relying on things that are not guaranteed
> > > by the C memory model, we need to pay attention to the implementations.
> > > As I have said elsewhere, the price of control dependencies is eternal
> > > vigilance.
> > >
> > > And this also applies, to a lesser extent, to address and data
> > > dependencies, which are also not well supported by the C standard.
> > >
> > > There is one important case in which the C memory model -does- support
> > > control dependencies, and that is when the dependent write is a normal
> > > C-language write that is not involved in a data race.  In that case,
> > > if the compiler broke the control dependency, it might have introduced
> > > a data race, which it is forbidden to do.  However, this rule can also
> > > be broken when the compiler knows too much, as it might be able to prove
> > > that breaking the dependency won't introduce a data race.  In that case,
> > > according to the standard, it is free to break the dependency.
> >
> > Which only matters because the C abstract machine may not match reality.
> >
> > LTO has no bearing on the abstract machine though.
> >
> > If specific compiler options etc. can be added to inhibit the
> > problematic optimisations, that would be ideal.  I guess that can't
> > happen overnight though.
>
> Sadly, I must agree.
>
> > > > > > > We likely won't realise if/when this goes wrong, other than impossible to
> > > > > > > debug, subtle breakage that crops up seemingly randomly. Ideally, we'd be
> > > > > > > able to detect this sort of thing happening at build time, and perhaps
> > > > > > > even prevent it with compiler options or annotations, but none of that is
> > > > > > > close to being available and I'm keen to progress the LTO patches in the
> > > > > > > meantime because they are a requirement for CFI.
> > > > > >
> > > > > > My concern was not so much why LTO makes things dangerous, as why !LTO
> > > > > > makes things safe...
> > > > >
> > > > > Because ignorant compilers are safe compilers!  ;-)
> > > >
> > > > AFAICT ignorance is no gurantee of ordering in general -- the compiler
> > > > is free to speculatively invent knowledge any place that the language
> > > > spec allows it to.  !LTO doesn't stop this happening.
> > >
> > > Agreed, according to the standard, the compiler has great freedom.
> > >
> > > We have two choices: (1) Restrict ourselves to live within the confines of
> > > the standard or (2) Pay continued close attention to the implementation.
> > > We have made different choices at different times, but for many ordering
> > > situations we have gone with door #2.
> > >
> > > Me, I have been working to get the standard to better support our
> > > use case.  This is at best slow going.  But don't take my word for it,
> > > ask Will.
> >
> > I can believe it.  They want to enable optimisations rather than prevent
> > them...
>
> Right in one!  ;-)
>
> > > > Hopefully some of the knowledge I invented in my reply is valid...
> > >
> > > It is.  It is just that there are multiple valid strategies, and the
> > > Linux kernel is currently taking a mixed-strategy approach.
> >
> > Ack.  The hope that there is a correct way to fix everything dies
> > hard ;)
>
> Either that, or one slowly degrades ones definition of "correct".  :-/
>
> > Life was cosier before I started trying to reason about language specs.
>
> Same here!
>
>                                                         Thanx, Paul
Marco Elver July 8, 2020, 7:15 a.m. UTC | #24
On Wed, 8 Jul 2020 at 01:01, Nick Desaulniers <ndesaulniers@google.com> wrote:
>
> I'm trying to put together a Micro Conference for Linux Plumbers
> conference focused on "make LLVM slightly less shitty."  Do you all
> plan on attending the conference? Would it be worthwhile to hold a
> session focused on discussing this (LTO and memory models) be
> worthwhile?

I would welcome sessions on LLVM, and would try to attend. Apart from
general improvements to the LLVM ecosystem, we should also emphasize
the benefits LLVM provides and how we can enable them (one reason we
want LTO is to get CFI).

Regarding LTO and memory models, I'm not sure. Given the current state
of things, such a discussion needs to be carefully framed to not go in
circles, because we're trying to figure out things at the intersection
of architecture, what the compiler does, the C standard, and the
kernel wants. And because some of these boxes are difficult to change
(standard, arch, compiler) or difficult to precisely define behaviour
(compiler), we might end up going in circles. From what I see there
are efforts to fix the situation at the root (standard), and we might
have means to get the compiler to tell us what it's doing. But these
happen extremely slowly.

So, if we do this, we need to be careful to not end up re-discussing
what we discussed here, but rather try and make it a continuation that
hopefully leads to some constructive output.

Thanks,
-- Marco
Peter Zijlstra July 8, 2020, 9:16 a.m. UTC | #25
On Tue, Jul 07, 2020 at 04:01:28PM -0700, Nick Desaulniers wrote:
> I'm trying to put together a Micro Conference for Linux Plumbers
> conference focused on "make LLVM slightly less shitty."  Do you all
> plan on attending the conference? Would it be worthwhile to hold a
> session focused on discussing this (LTO and memory models) be
> worthwhile?

I'd love to have a session about compilers and memory ordering with both
GCC and CLANG in attendance.

We need a solution for dependent-loads and control-dependencies for both
toolchains.
Paul E. McKenney July 8, 2020, 6:20 p.m. UTC | #26
On Wed, Jul 08, 2020 at 11:16:20AM +0200, Peter Zijlstra wrote:
> On Tue, Jul 07, 2020 at 04:01:28PM -0700, Nick Desaulniers wrote:
> > I'm trying to put together a Micro Conference for Linux Plumbers
> > conference focused on "make LLVM slightly less shitty."  Do you all
> > plan on attending the conference? Would it be worthwhile to hold a
> > session focused on discussing this (LTO and memory models) be
> > worthwhile?
> 
> I'd love to have a session about compilers and memory ordering with both
> GCC and CLANG in attendance.
> 
> We need a solution for dependent-loads and control-dependencies for both
> toolchains.

What Peter said!  ;-)

							Thanx, Paul
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/rwonce.h b/arch/arm64/include/asm/rwonce.h
new file mode 100644
index 000000000000..515e360b01a1
--- /dev/null
+++ b/arch/arm64/include/asm/rwonce.h
@@ -0,0 +1,63 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2020 Google LLC.
+ */
+#ifndef __ASM_RWONCE_H
+#define __ASM_RWONCE_H
+
+#ifdef CONFIG_CLANG_LTO
+
+#include <linux/compiler_types.h>
+#include <asm/alternative-macros.h>
+
+#ifndef BUILD_VDSO
+
+#ifdef CONFIG_AS_HAS_LDAPR
+#define __LOAD_RCPC(sfx, regs...)					\
+	ALTERNATIVE(							\
+		"ldar"	#sfx "\t" #regs,				\
+		".arch_extension rcpc\n"				\
+		"ldapr"	#sfx "\t" #regs,				\
+	ARM64_HAS_LDAPR)
+#else
+#define __LOAD_RCPC(sfx, regs...)	"ldar" #sfx "\t" #regs
+#endif /* CONFIG_AS_HAS_LDAPR */
+
+#define __READ_ONCE(x)							\
+({									\
+	int atomic = 1;							\
+	union { __unqual_scalar_typeof(x) __val; char __c[1]; } __u;	\
+	typeof(&(x)) __x = &(x);					\
+	switch (sizeof(x)) {						\
+	case 1:								\
+		asm volatile(__LOAD_RCPC(b, %w0, %1)			\
+			: "=r" (*(__u8 *)__u.__c)			\
+			: "Q" (*__x) : "memory");			\
+		break;							\
+	case 2:								\
+		asm volatile(__LOAD_RCPC(h, %w0, %1)			\
+			: "=r" (*(__u16 *)__u.__c)			\
+			: "Q" (*__x) : "memory");			\
+		break;							\
+	case 4:								\
+		asm volatile(__LOAD_RCPC(, %w0, %1)			\
+			: "=r" (*(__u32 *)__u.__c)			\
+			: "Q" (*__x) : "memory");			\
+		break;							\
+	case 8:								\
+		asm volatile(__LOAD_RCPC(, %0, %1)			\
+			: "=r" (*(__u64 *)__u.__c)			\
+			: "Q" (*__x) : "memory");			\
+		break;							\
+	default:							\
+		atomic = 0;						\
+	}								\
+	atomic ? (typeof(x))__u.__val : (*(volatile typeof(x) *)__x);	\
+})
+
+#endif	/* !BUILD_VDSO */
+#endif	/* CONFIG_CLANG_LTO */
+
+#include <asm-generic/rwonce.h>
+
+#endif	/* __ASM_RWONCE_H */
diff --git a/arch/arm64/kernel/vdso/Makefile b/arch/arm64/kernel/vdso/Makefile
index 45d5cfe46429..60df97f2e7de 100644
--- a/arch/arm64/kernel/vdso/Makefile
+++ b/arch/arm64/kernel/vdso/Makefile
@@ -28,7 +28,7 @@  ldflags-y := -shared -nostdlib -soname=linux-vdso.so.1 --hash-style=sysv	\
 	     $(btildflags-y) -T
 
 ccflags-y := -fno-common -fno-builtin -fno-stack-protector -ffixed-x18
-ccflags-y += -DDISABLE_BRANCH_PROFILING
+ccflags-y += -DDISABLE_BRANCH_PROFILING -DBUILD_VDSO
 
 CFLAGS_REMOVE_vgettimeofday.o = $(CC_FLAGS_FTRACE) -Os $(CC_FLAGS_SCS) $(GCC_PLUGINS_CFLAGS)
 KBUILD_CFLAGS			+= $(DISABLE_LTO)
diff --git a/arch/arm64/kernel/vdso32/Makefile b/arch/arm64/kernel/vdso32/Makefile
index d88148bef6b0..4fdf3754a058 100644
--- a/arch/arm64/kernel/vdso32/Makefile
+++ b/arch/arm64/kernel/vdso32/Makefile
@@ -43,7 +43,7 @@  cc32-as-instr = $(call try-run,\
 # As a result we set our own flags here.
 
 # KBUILD_CPPFLAGS and NOSTDINC_FLAGS from top-level Makefile
-VDSO_CPPFLAGS := -D__KERNEL__ -nostdinc -isystem $(shell $(CC_COMPAT) -print-file-name=include)
+VDSO_CPPFLAGS := -DBUILD_VDSO -D__KERNEL__ -nostdinc -isystem $(shell $(CC_COMPAT) -print-file-name=include)
 VDSO_CPPFLAGS += $(LINUXINCLUDE)
 
 # Common C and assembly flags