Message ID | 20200624203200.78870-1-samitolvanen@google.com (mailing list archive) |
---|---|
Headers | show |
Series | add support for Clang LTO | expand |
On Wed, Jun 24, 2020 at 01:31:38PM -0700, Sami Tolvanen wrote: > This patch series adds support for building x86_64 and arm64 kernels > with Clang's Link Time Optimization (LTO). > > In addition to performance, the primary motivation for LTO is to allow > Clang's Control-Flow Integrity (CFI) to be used in the kernel. Google's > Pixel devices have shipped with LTO+CFI kernels since 2018. > > Most of the patches are build system changes for handling LLVM bitcode, > which Clang produces with LTO instead of ELF object files, postponing > ELF processing until a later stage, and ensuring initcall ordering. > > Note that first objtool patch in the series is already in linux-next, > but as it's needed with LTO, I'm including it also here to make testing > easier. I'm very sad that yet again, memory ordering isn't addressed. LTO vastly increases the range of the optimizer to wreck things.
On Wed, Jun 24, 2020 at 11:15:40PM +0200, Peter Zijlstra wrote: > On Wed, Jun 24, 2020 at 01:31:38PM -0700, Sami Tolvanen wrote: > > This patch series adds support for building x86_64 and arm64 kernels > > with Clang's Link Time Optimization (LTO). > > > > In addition to performance, the primary motivation for LTO is to allow > > Clang's Control-Flow Integrity (CFI) to be used in the kernel. Google's > > Pixel devices have shipped with LTO+CFI kernels since 2018. > > > > Most of the patches are build system changes for handling LLVM bitcode, > > which Clang produces with LTO instead of ELF object files, postponing > > ELF processing until a later stage, and ensuring initcall ordering. > > > > Note that first objtool patch in the series is already in linux-next, > > but as it's needed with LTO, I'm including it also here to make testing > > easier. > > I'm very sad that yet again, memory ordering isn't addressed. LTO vastly > increases the range of the optimizer to wreck things. I believe Will has some thoughts about this, and patches, but I'll let him talk about it. Sami
On Wed, Jun 24, 2020 at 2:15 PM Peter Zijlstra <peterz@infradead.org> wrote: > > On Wed, Jun 24, 2020 at 01:31:38PM -0700, Sami Tolvanen wrote: > > This patch series adds support for building x86_64 and arm64 kernels > > with Clang's Link Time Optimization (LTO). > > > > In addition to performance, the primary motivation for LTO is to allow > > Clang's Control-Flow Integrity (CFI) to be used in the kernel. Google's > > Pixel devices have shipped with LTO+CFI kernels since 2018. > > > > Most of the patches are build system changes for handling LLVM bitcode, > > which Clang produces with LTO instead of ELF object files, postponing > > ELF processing until a later stage, and ensuring initcall ordering. > > > > Note that first objtool patch in the series is already in linux-next, > > but as it's needed with LTO, I'm including it also here to make testing > > easier. > > I'm very sad that yet again, memory ordering isn't addressed. LTO vastly > increases the range of the optimizer to wreck things. Hi Peter, could you expand on the issue for the folks on the thread? I'm happy to try to hack something up in LLVM if we check that X does or does not happen; maybe we can even come up with some concrete test cases that can be added to LLVM's codebase?
On Wed, Jun 24, 2020 at 02:31:36PM -0700, Nick Desaulniers wrote: > On Wed, Jun 24, 2020 at 2:15 PM Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Wed, Jun 24, 2020 at 01:31:38PM -0700, Sami Tolvanen wrote: > > > This patch series adds support for building x86_64 and arm64 kernels > > > with Clang's Link Time Optimization (LTO). > > > > > > In addition to performance, the primary motivation for LTO is to allow > > > Clang's Control-Flow Integrity (CFI) to be used in the kernel. Google's > > > Pixel devices have shipped with LTO+CFI kernels since 2018. > > > > > > Most of the patches are build system changes for handling LLVM bitcode, > > > which Clang produces with LTO instead of ELF object files, postponing > > > ELF processing until a later stage, and ensuring initcall ordering. > > > > > > Note that first objtool patch in the series is already in linux-next, > > > but as it's needed with LTO, I'm including it also here to make testing > > > easier. > > > > I'm very sad that yet again, memory ordering isn't addressed. LTO vastly > > increases the range of the optimizer to wreck things. > > Hi Peter, could you expand on the issue for the folks on the thread? > I'm happy to try to hack something up in LLVM if we check that X does > or does not happen; maybe we can even come up with some concrete test > cases that can be added to LLVM's codebase? I'm sure Will will respond, but the basic issue is the trainwreck C11 made of dependent loads. Anyway, here's a link to the last time this came up: https://lore.kernel.org/linux-arm-kernel/20171116174830.GX3624@linux.vnet.ibm.com/
On Thu, Jun 25, 2020 at 10:03:13AM +0200, Peter Zijlstra wrote: > On Wed, Jun 24, 2020 at 02:31:36PM -0700, Nick Desaulniers wrote: > > On Wed, Jun 24, 2020 at 2:15 PM Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > On Wed, Jun 24, 2020 at 01:31:38PM -0700, Sami Tolvanen wrote: > > > > This patch series adds support for building x86_64 and arm64 kernels > > > > with Clang's Link Time Optimization (LTO). > > > > > > > > In addition to performance, the primary motivation for LTO is to allow > > > > Clang's Control-Flow Integrity (CFI) to be used in the kernel. Google's > > > > Pixel devices have shipped with LTO+CFI kernels since 2018. > > > > > > > > Most of the patches are build system changes for handling LLVM bitcode, > > > > which Clang produces with LTO instead of ELF object files, postponing > > > > ELF processing until a later stage, and ensuring initcall ordering. > > > > > > > > Note that first objtool patch in the series is already in linux-next, > > > > but as it's needed with LTO, I'm including it also here to make testing > > > > easier. > > > > > > I'm very sad that yet again, memory ordering isn't addressed. LTO vastly > > > increases the range of the optimizer to wreck things. > > > > Hi Peter, could you expand on the issue for the folks on the thread? > > I'm happy to try to hack something up in LLVM if we check that X does > > or does not happen; maybe we can even come up with some concrete test > > cases that can be added to LLVM's codebase? > > I'm sure Will will respond, but the basic issue is the trainwreck C11 > made of dependent loads. > > Anyway, here's a link to the last time this came up: > > https://lore.kernel.org/linux-arm-kernel/20171116174830.GX3624@linux.vnet.ibm.com/ Another good read: https://lore.kernel.org/lkml/20150520005510.GA23559@linux.vnet.ibm.com/ and having (partially) re-read that, I now worry intensily about things like latch_tree_find(), cyc2ns_read_begin, __ktime_get_fast_ns(). It looks like kernel/time/sched_clock.c uses raw_read_seqcount() which deviates from the above patterns by, for some reason, using a primitive that includes an extra smp_rmb(). And this is just the few things I could remember off the top of my head, who knows what else is out there.
On Wed, Jun 24, 2020 at 02:30:14PM -0700, Sami Tolvanen wrote: > On Wed, Jun 24, 2020 at 11:15:40PM +0200, Peter Zijlstra wrote: > > On Wed, Jun 24, 2020 at 01:31:38PM -0700, Sami Tolvanen wrote: > > > This patch series adds support for building x86_64 and arm64 kernels > > > with Clang's Link Time Optimization (LTO). > > > > > > In addition to performance, the primary motivation for LTO is to allow > > > Clang's Control-Flow Integrity (CFI) to be used in the kernel. Google's > > > Pixel devices have shipped with LTO+CFI kernels since 2018. > > > > > > Most of the patches are build system changes for handling LLVM bitcode, > > > which Clang produces with LTO instead of ELF object files, postponing > > > ELF processing until a later stage, and ensuring initcall ordering. > > > > > > Note that first objtool patch in the series is already in linux-next, > > > but as it's needed with LTO, I'm including it also here to make testing > > > easier. > > > > I'm very sad that yet again, memory ordering isn't addressed. LTO vastly > > increases the range of the optimizer to wreck things. > > I believe Will has some thoughts about this, and patches, but I'll let > him talk about it. Thanks for reminding me! I will get patches out ASAP (I've been avoiding the rebase from hell, but this is the motivation I need). Will
On Thu, Jun 25, 2020 at 10:24:33AM +0200, Peter Zijlstra wrote: > On Thu, Jun 25, 2020 at 10:03:13AM +0200, Peter Zijlstra wrote: > > I'm sure Will will respond, but the basic issue is the trainwreck C11 > > made of dependent loads. > > > > Anyway, here's a link to the last time this came up: > > > > https://lore.kernel.org/linux-arm-kernel/20171116174830.GX3624@linux.vnet.ibm.com/ > > Another good read: > > https://lore.kernel.org/lkml/20150520005510.GA23559@linux.vnet.ibm.com/ > > and having (partially) re-read that, I now worry intensily about things > like latch_tree_find(), cyc2ns_read_begin, __ktime_get_fast_ns(). > > It looks like kernel/time/sched_clock.c uses raw_read_seqcount() which > deviates from the above patterns by, for some reason, using a primitive > that includes an extra smp_rmb(). > > And this is just the few things I could remember off the top of my head, > who knows what else is out there. As an example, let us consider __ktime_get_fast_ns(), the critical bit is: seq = raw_read_seqcount_latch(&tkf->seq); tkr = tkf->base + (seq & 0x01); now = tkr->base; And we hard rely on that being a dependent load, so: LOAD seq, (tkf->seq) LOAD tkr, tkf->base AND seq, 1 MUL seq, sizeof(tk_read_base) ADD tkr, seq LOAD now, (tkr->base) Such that we obtain 'now' as a direct dependency on 'seq'. This ensures the loads are ordered. A compiler can wreck this by translating it into something like: LOAD seq, (tkf->seq) LOAD tkr, tkf->base AND seq, 1 CMP seq, 0 JE 1f ADD tkr, sizeof(tk_read_base) 1: LOAD now, (tkr->base) Because now the machine can speculate and load now before seq, breaking the ordering.
On Thu, Jun 25, 2020 at 5:32 AM 'Sami Tolvanen' via Clang Built Linux <clang-built-linux@googlegroups.com> wrote: > > This patch series adds support for building x86_64 and arm64 kernels > with Clang's Link Time Optimization (LTO). > > In addition to performance, the primary motivation for LTO is to allow > Clang's Control-Flow Integrity (CFI) to be used in the kernel. Google's > Pixel devices have shipped with LTO+CFI kernels since 2018. > > Most of the patches are build system changes for handling LLVM bitcode, > which Clang produces with LTO instead of ELF object files, postponing > ELF processing until a later stage, and ensuring initcall ordering. > > Note that first objtool patch in the series is already in linux-next, > but as it's needed with LTO, I'm including it also here to make testing > easier. I put this series on a testing branch, and 0-day bot started reporting some issues. (but 0-day bot is quieter than I expected. Perhaps, 0-day bot does not turn on LLVM=1 ?) I also got an error for ARCH=arm64 allyesconfig + CONFIG_LTO_CLANG=y $ make ARCH=arm64 LLVM=1 LLVM_IAS=1 CROSS_COMPILE=~/tools/aarch64-linaro-7.5/bin/aarch64-linux-gnu- -j24 ... GEN .version CHK include/generated/compile.h UPD include/generated/compile.h CC init/version.o AR init/built-in.a GEN .tmp_initcalls.lds GEN .tmp_symversions.lds LTO vmlinux.o MODPOST vmlinux.symvers MODINFO modules.builtin.modinfo GEN modules.builtin LD .tmp_vmlinux.kallsyms1 ld.lld: error: undefined symbol: __compiletime_assert_905 >>> referenced by irqbypass.c >>> vmlinux.o:(jeq_imm) make: *** [Makefile:1161: vmlinux] Error 1 > Sami Tolvanen (22): > objtool: use sh_info to find the base for .rela sections > kbuild: add support for Clang LTO > kbuild: lto: fix module versioning > kbuild: lto: fix recordmcount > kbuild: lto: postpone objtool > kbuild: lto: limit inlining > kbuild: lto: merge module sections > kbuild: lto: remove duplicate dependencies from .mod files > init: lto: ensure initcall ordering > init: lto: fix PREL32 relocations > pci: lto: fix PREL32 relocations > modpost: lto: strip .lto from module names > scripts/mod: disable LTO for empty.c > efi/libstub: disable LTO > drivers/misc/lkdtm: disable LTO for rodata.o > arm64: export CC_USING_PATCHABLE_FUNCTION_ENTRY > arm64: vdso: disable LTO > arm64: allow LTO_CLANG and THINLTO to be selected > x86, vdso: disable LTO only for vDSO > x86, ftrace: disable recordmcount for ftrace_make_nop > x86, relocs: Ignore L4_PAGE_OFFSET relocations > x86, build: allow LTO_CLANG and THINLTO to be selected > > .gitignore | 1 + > Makefile | 27 ++- > arch/Kconfig | 65 +++++++ > arch/arm64/Kconfig | 2 + > arch/arm64/Makefile | 1 + > arch/arm64/kernel/vdso/Makefile | 4 +- > arch/x86/Kconfig | 2 + > arch/x86/Makefile | 5 + > arch/x86/entry/vdso/Makefile | 5 +- > arch/x86/kernel/ftrace.c | 1 + > arch/x86/tools/relocs.c | 1 + > drivers/firmware/efi/libstub/Makefile | 2 + > drivers/misc/lkdtm/Makefile | 1 + > include/asm-generic/vmlinux.lds.h | 12 +- > include/linux/compiler-clang.h | 4 + > include/linux/compiler.h | 2 +- > include/linux/compiler_types.h | 4 + > include/linux/init.h | 78 +++++++- > include/linux/pci.h | 15 +- > kernel/trace/ftrace.c | 1 + > lib/Kconfig.debug | 2 +- > scripts/Makefile.build | 55 +++++- > scripts/Makefile.lib | 6 +- > scripts/Makefile.modfinal | 40 +++- > scripts/Makefile.modpost | 26 ++- > scripts/generate_initcall_order.pl | 270 ++++++++++++++++++++++++++ > scripts/link-vmlinux.sh | 100 +++++++++- > scripts/mod/Makefile | 1 + > scripts/mod/modpost.c | 16 +- > scripts/mod/modpost.h | 9 + > scripts/mod/sumversion.c | 6 +- > scripts/module-lto.lds | 26 +++ > scripts/recordmcount.c | 3 +- > tools/objtool/elf.c | 2 +- > 34 files changed, 737 insertions(+), 58 deletions(-) > create mode 100755 scripts/generate_initcall_order.pl > create mode 100644 scripts/module-lto.lds > > > base-commit: 26e122e97a3d0390ebec389347f64f3730fdf48f > -- > 2.27.0.212.ge8ba1cc988-goog > > -- > You received this message because you are subscribed to the Google Groups "Clang Built Linux" group. > To unsubscribe from this group and stop receiving emails from it, send an email to clang-built-linux+unsubscribe@googlegroups.com. > To view this discussion on the web visit https://groups.google.com/d/msgid/clang-built-linux/20200624203200.78870-1-samitolvanen%40google.com. -- Best Regards Masahiro Yamada
Hi Masahiro, On Mon, Jun 29, 2020 at 01:56:19AM +0900, Masahiro Yamada wrote: > On Thu, Jun 25, 2020 at 5:32 AM 'Sami Tolvanen' via Clang Built Linux > <clang-built-linux@googlegroups.com> wrote: > > > > This patch series adds support for building x86_64 and arm64 kernels > > with Clang's Link Time Optimization (LTO). > > > > In addition to performance, the primary motivation for LTO is to allow > > Clang's Control-Flow Integrity (CFI) to be used in the kernel. Google's > > Pixel devices have shipped with LTO+CFI kernels since 2018. > > > > Most of the patches are build system changes for handling LLVM bitcode, > > which Clang produces with LTO instead of ELF object files, postponing > > ELF processing until a later stage, and ensuring initcall ordering. > > > > Note that first objtool patch in the series is already in linux-next, > > but as it's needed with LTO, I'm including it also here to make testing > > easier. > > > I put this series on a testing branch, > and 0-day bot started reporting some issues. Yes, I'll fix those issues in v2. > (but 0-day bot is quieter than I expected. > Perhaps, 0-day bot does not turn on LLVM=1 ?) In order for it to test an LTO build, it would need to enable LTO_CLANG explicitly though, in addition to LLVM=1. > I also got an error for > ARCH=arm64 allyesconfig + CONFIG_LTO_CLANG=y > > > > $ make ARCH=arm64 LLVM=1 LLVM_IAS=1 > CROSS_COMPILE=~/tools/aarch64-linaro-7.5/bin/aarch64-linux-gnu- > -j24 > > ... > > GEN .version > CHK include/generated/compile.h > UPD include/generated/compile.h > CC init/version.o > AR init/built-in.a > GEN .tmp_initcalls.lds > GEN .tmp_symversions.lds > LTO vmlinux.o > MODPOST vmlinux.symvers > MODINFO modules.builtin.modinfo > GEN modules.builtin > LD .tmp_vmlinux.kallsyms1 > ld.lld: error: undefined symbol: __compiletime_assert_905 > >>> referenced by irqbypass.c > >>> vmlinux.o:(jeq_imm) > make: *** [Makefile:1161: vmlinux] Error 1 I can reproduce this with ToT LLVM and it's BUILD_BUG_ON_MSG(..., "value too large for the field") in drivers/net/ethernet/netronome/nfp/bpf/jit.c. Specifically, the FIELD_FIT / __BF_FIELD_CHECK macro in ur_load_imm_any. This compiles just fine with an earlier LLVM revision, so it could be a relatively recent regression. I'll take a look. Thanks for catching this! Sami
I was asked for input on this, and after a few days digging through some history, thought I'd comment. Hope you don't mind. On Thu, Jun 25, 2020 at 10:57AM +0200, Peter Zijlstra wrote: > On Thu, Jun 25, 2020 at 10:24:33AM +0200, Peter Zijlstra wrote: > > On Thu, Jun 25, 2020 at 10:03:13AM +0200, Peter Zijlstra wrote: > > > I'm sure Will will respond, but the basic issue is the trainwreck C11 > > > made of dependent loads. > > > > > > Anyway, here's a link to the last time this came up: > > > > > > https://lore.kernel.org/linux-arm-kernel/20171116174830.GX3624@linux.vnet.ibm.com/ > > > > Another good read: > > > > https://lore.kernel.org/lkml/20150520005510.GA23559@linux.vnet.ibm.com/ [...] > Because now the machine can speculate and load now before seq, breaking > the ordering. First of all, I agree with the concerns, but not because of LTO. To set the stage better, and summarize the fundamental problem again: we're in the unfortunate situation that no compiler today has a way to _efficiently_ deal with C11's memory_order_consume [https://lwn.net/Articles/588300/]. If we did, we could just use that and be done with it. But, sadly, that doesn't seem possible right now -- compilers just say consume==acquire. Will suggests doing the same in the kernel: https://lkml.kernel.org/r/20200630173734.14057-19-will@kernel.org What we're most worried about right now is the existence of compiler transformations that could break data dependencies by e.g. turning them into control dependencies. If this is a real worry, I don't think LTO is the magical feature that will uncover those optimizations. If these compiler transformations are real, they also exist in a normal build! And if we are worried about them, we need to stop relying on dependent load ordering across the board; or switch to -O0 for everything. Clearly, we don't want either. Why do we think LTO is special? With LTO, Clang just emits LLVM bitcode instead of ELF objects, and during the linker stage intermodular optimizations across translation unit boundaries are done that might not be possible otherwise [https://llvm.org/docs/LinkTimeOptimization.html]. From the memory model side of things, if we could fully convey our intent to the compiler (the imaginary consume), there would be no problem, because all optimization stages from bitcode generation to the final machine code generation after LTO know about the intended semantics. (Also, keep in mind that LTO is _not_ doing post link optimization of machine code binaries!) But as far as we can tell, there is no evidence of the dreaded "data dependency to control dependency" conversion with LTO that isn't there in non-LTO builds, if it's even there at all. Has the data to control dependency conversion been encountered in the wild? If not, is the resulting reaction an overreaction? If so, we need to be careful blaming LTO for something that it isn't even guilty of. So, we are probably better off untangling LTO from the story: 1. LTO or no LTO does not matter. The LTO series should not get tangled up with memory model issues. 2. The memory model question and problems need to be answered and addressed separately. Thoughts? Thanks, -- Marco
On Tue, Jun 30, 2020 at 09:19:31PM +0200, Marco Elver wrote: > I was asked for input on this, and after a few days digging through some > history, thought I'd comment. Hope you don't mind. Not at all, being the one that asked :-) > First of all, I agree with the concerns, but not because of LTO. > > To set the stage better, and summarize the fundamental problem again: > we're in the unfortunate situation that no compiler today has a way to > _efficiently_ deal with C11's memory_order_consume > [https://lwn.net/Articles/588300/]. If we did, we could just use that > and be done with it. But, sadly, that doesn't seem possible right now -- > compilers just say consume==acquire. I'm not convinced C11 memory_order_consume would actually work for us, even if it would work. That is, given: https://lore.kernel.org/lkml/20150520005510.GA23559@linux.vnet.ibm.com/ only pointers can have consume, but like I pointed out, we have code that relies on dependent loads from integers. > Will suggests doing the same in the > kernel: https://lkml.kernel.org/r/20200630173734.14057-19-will@kernel.org PowerPC would need a similar thing, it too will not preserve causality for control dependecies. > What we're most worried about right now is the existence of compiler > transformations that could break data dependencies by e.g. turning them > into control dependencies. Correct. > If this is a real worry, I don't think LTO is the magical feature that > will uncover those optimizations. If these compiler transformations are > real, they also exist in a normal build! Agreed, _however_ with the caveat that LTO could make them more common. After all, with whole program analysis, the compiler might be able to more easily determine that our pointer @ptr is only ever assigned the values of &A, &B or &C, while without that visibility it would not be able to determine this. Once it knows @ptr has a limited number of determined values, the conversion into control dependencies becomes much more likely. > And if we are worried about them, we need to stop relying on dependent > load ordering across the board; or switch to -O0 for everything. > Clearly, we don't want either. Agreed. > Why do we think LTO is special? As argued above, whole-program analysis would make it more likely. But I agree the fundamental problem exists independent from LTO. > But as far as we can tell, there is no evidence of the dreaded "data > dependency to control dependency" conversion with LTO that isn't there > in non-LTO builds, if it's even there at all. Has the data to control > dependency conversion been encountered in the wild? If not, is the > resulting reaction an overreaction? If so, we need to be careful blaming > LTO for something that it isn't even guilty of. It is mostly paranoia; in a large part driven by the fact that even if such a conversion were to be done, it could go a very long time without actually causing problems, and longer still for such problems to be traced back to such an 'optimization'. That is, the collective hurt from debugging too many ordering issues. > So, we are probably better off untangling LTO from the story: > > 1. LTO or no LTO does not matter. The LTO series should not get tangled > up with memory model issues. > > 2. The memory model question and problems need to be answered and > addressed separately. > > Thoughts? How hard would it be to creates something that analyzes a build and looks for all 'dependent load -> control dependency' transformations headed by a volatile (and/or from asm) load and issues a warning for them? This would give us an indication of how valuable this transformation is for the kernel. I'm hoping/expecting it's vanishingly rare, but what do I know.
On Tue, Jun 30, 2020 at 10:12:43PM +0200, Peter Zijlstra wrote: > On Tue, Jun 30, 2020 at 09:19:31PM +0200, Marco Elver wrote: > > I was asked for input on this, and after a few days digging through some > > history, thought I'd comment. Hope you don't mind. > > Not at all, being the one that asked :-) > > > First of all, I agree with the concerns, but not because of LTO. > > > > To set the stage better, and summarize the fundamental problem again: > > we're in the unfortunate situation that no compiler today has a way to > > _efficiently_ deal with C11's memory_order_consume > > [https://lwn.net/Articles/588300/]. If we did, we could just use that > > and be done with it. But, sadly, that doesn't seem possible right now -- > > compilers just say consume==acquire. > > I'm not convinced C11 memory_order_consume would actually work for us, > even if it would work. That is, given: > > https://lore.kernel.org/lkml/20150520005510.GA23559@linux.vnet.ibm.com/ > > only pointers can have consume, but like I pointed out, we have code > that relies on dependent loads from integers. I agree that C11 memory_order_consume is not normally what we want, given that it is universally promoted to memory_order_acquire. However, dependent loads from integers are, if anything, more difficult to defend from the compiler than are control dependencies. This applies doubly to integers that are used to index two-element arrays, in which case you are just asking the compiler to destroy your dependent loads by converting them into control dependencies. > > Will suggests doing the same in the > > kernel: https://lkml.kernel.org/r/20200630173734.14057-19-will@kernel.org > > PowerPC would need a similar thing, it too will not preserve causality > for control dependecies. > > > What we're most worried about right now is the existence of compiler > > transformations that could break data dependencies by e.g. turning them > > into control dependencies. > > Correct. > > > If this is a real worry, I don't think LTO is the magical feature that > > will uncover those optimizations. If these compiler transformations are > > real, they also exist in a normal build! > > Agreed, _however_ with the caveat that LTO could make them more common. > > After all, with whole program analysis, the compiler might be able to > more easily determine that our pointer @ptr is only ever assigned the > values of &A, &B or &C, while without that visibility it would not be > able to determine this. > > Once it knows @ptr has a limited number of determined values, the > conversion into control dependencies becomes much more likely. Which would of course break dependent loads. > > And if we are worried about them, we need to stop relying on dependent > > load ordering across the board; or switch to -O0 for everything. > > Clearly, we don't want either. > > Agreed. > > > Why do we think LTO is special? > > As argued above, whole-program analysis would make it more likely. But I > agree the fundamental problem exists independent from LTO. > > > But as far as we can tell, there is no evidence of the dreaded "data > > dependency to control dependency" conversion with LTO that isn't there > > in non-LTO builds, if it's even there at all. Has the data to control > > dependency conversion been encountered in the wild? If not, is the > > resulting reaction an overreaction? If so, we need to be careful blaming > > LTO for something that it isn't even guilty of. > > It is mostly paranoia; in a large part driven by the fact that even if > such a conversion were to be done, it could go a very long time without > actually causing problems, and longer still for such problems to be > traced back to such an 'optimization'. > > That is, the collective hurt from debugging too many ordering issues. > > > So, we are probably better off untangling LTO from the story: > > > > 1. LTO or no LTO does not matter. The LTO series should not get tangled > > up with memory model issues. > > > > 2. The memory model question and problems need to be answered and > > addressed separately. > > > > Thoughts? > > How hard would it be to creates something that analyzes a build and > looks for all 'dependent load -> control dependency' transformations > headed by a volatile (and/or from asm) load and issues a warning for > them? > > This would give us an indication of how valuable this transformation is > for the kernel. I'm hoping/expecting it's vanishingly rare, but what do > I know. > This could be quite useful! Thanx, Paul
On Tue, Jun 30, 2020 at 01:30:16PM -0700, Paul E. McKenney wrote: > On Tue, Jun 30, 2020 at 10:12:43PM +0200, Peter Zijlstra wrote: > > I'm not convinced C11 memory_order_consume would actually work for us, > > even if it would work. That is, given: > > > > https://lore.kernel.org/lkml/20150520005510.GA23559@linux.vnet.ibm.com/ > > > > only pointers can have consume, but like I pointed out, we have code > > that relies on dependent loads from integers. > > I agree that C11 memory_order_consume is not normally what we want, > given that it is universally promoted to memory_order_acquire. > > However, dependent loads from integers are, if anything, more difficult > to defend from the compiler than are control dependencies. This applies > doubly to integers that are used to index two-element arrays, in which > case you are just asking the compiler to destroy your dependent loads > by converting them into control dependencies. Yes, I'm aware. However, as you might know, I'm firmly in the 'C is a glorified assembler' camp (as I expect most actual OS people are, out of necessity if nothing else) and if I wanted a control dependency I would've bloody well written one. I think an optimizing compiler is awesome, but only in so far as that optimization is actually helpful -- and yes, I just stepped into a giant twilight zone there. That is, any optimization that has _any_ controversy should be controllable (like -fno-strict-overflow -fno-strict-aliasing) and I'd very much like the same here. In a larger context, I still think that eliminating speculative stores is both necessary and sufficient to avoid out-of-thin-air. So I'd also love to get some control on that.
On Tue, 30 Jun 2020 at 22:30, Paul E. McKenney <paulmck@kernel.org> wrote: > On Tue, Jun 30, 2020 at 10:12:43PM +0200, Peter Zijlstra wrote: > > On Tue, Jun 30, 2020 at 09:19:31PM +0200, Marco Elver wrote: > > > First of all, I agree with the concerns, but not because of LTO. > > > > > > To set the stage better, and summarize the fundamental problem again: > > > we're in the unfortunate situation that no compiler today has a way to > > > _efficiently_ deal with C11's memory_order_consume > > > [https://lwn.net/Articles/588300/]. If we did, we could just use that > > > and be done with it. But, sadly, that doesn't seem possible right now -- > > > compilers just say consume==acquire. > > > > I'm not convinced C11 memory_order_consume would actually work for us, > > even if it would work. That is, given: > > > > https://lore.kernel.org/lkml/20150520005510.GA23559@linux.vnet.ibm.com/ > > > > only pointers can have consume, but like I pointed out, we have code > > that relies on dependent loads from integers. > > I agree that C11 memory_order_consume is not normally what we want, > given that it is universally promoted to memory_order_acquire. > > However, dependent loads from integers are, if anything, more difficult > to defend from the compiler than are control dependencies. This applies > doubly to integers that are used to index two-element arrays, in which > case you are just asking the compiler to destroy your dependent loads > by converting them into control dependencies. > > > > Will suggests doing the same in the > > > kernel: https://lkml.kernel.org/r/20200630173734.14057-19-will@kernel.org > > > > PowerPC would need a similar thing, it too will not preserve causality > > for control dependecies. > > > > > What we're most worried about right now is the existence of compiler > > > transformations that could break data dependencies by e.g. turning them > > > into control dependencies. > > > > Correct. > > > > > If this is a real worry, I don't think LTO is the magical feature that > > > will uncover those optimizations. If these compiler transformations are > > > real, they also exist in a normal build! > > > > Agreed, _however_ with the caveat that LTO could make them more common. > > > > After all, with whole program analysis, the compiler might be able to > > more easily determine that our pointer @ptr is only ever assigned the > > values of &A, &B or &C, while without that visibility it would not be > > able to determine this. > > > > Once it knows @ptr has a limited number of determined values, the > > conversion into control dependencies becomes much more likely. > > Which would of course break dependent loads. > > > > And if we are worried about them, we need to stop relying on dependent > > > load ordering across the board; or switch to -O0 for everything. > > > Clearly, we don't want either. > > > > Agreed. > > > > > Why do we think LTO is special? > > > > As argued above, whole-program analysis would make it more likely. But I > > agree the fundamental problem exists independent from LTO. > > > > > But as far as we can tell, there is no evidence of the dreaded "data > > > dependency to control dependency" conversion with LTO that isn't there > > > in non-LTO builds, if it's even there at all. Has the data to control > > > dependency conversion been encountered in the wild? If not, is the > > > resulting reaction an overreaction? If so, we need to be careful blaming > > > LTO for something that it isn't even guilty of. > > > > It is mostly paranoia; in a large part driven by the fact that even if > > such a conversion were to be done, it could go a very long time without > > actually causing problems, and longer still for such problems to be > > traced back to such an 'optimization'. > > > > That is, the collective hurt from debugging too many ordering issues. > > > > > So, we are probably better off untangling LTO from the story: > > > > > > 1. LTO or no LTO does not matter. The LTO series should not get tangled > > > up with memory model issues. > > > > > > 2. The memory model question and problems need to be answered and > > > addressed separately. > > > > > > Thoughts? > > > > How hard would it be to creates something that analyzes a build and > > looks for all 'dependent load -> control dependency' transformations > > headed by a volatile (and/or from asm) load and issues a warning for > > them? I was thinking about this, but in the context of the "auto-promote to acquire" which you didn't like. Issuing a warning should certainly be simpler. I think there is no one place where we know these transformations happen, but rather, need to analyze the IR before transformations, take note of all the dependent loads headed by volatile+asm, and then run an analysis after optimizations checking the dependencies are still there. > > This would give us an indication of how valuable this transformation is > > for the kernel. I'm hoping/expecting it's vanishingly rare, but what do > > I know. > > This could be quite useful! We might then even be able to say, "if you get this warning, turn on CONFIG_ACQUIRE_READ_DEPENDENCIES" (or however the option will be named). Or some other tricks, like automatically recompile the TU where this happens with the option. But again, this is not something that should specifically block LTO, because if we have this, we'll need to turn it on for everything. Thanks, -- Marco
On Wed, Jul 01, 2020 at 11:41:17AM +0200, Marco Elver wrote: > On Tue, 30 Jun 2020 at 22:30, Paul E. McKenney <paulmck@kernel.org> wrote: > > On Tue, Jun 30, 2020 at 10:12:43PM +0200, Peter Zijlstra wrote: > > > On Tue, Jun 30, 2020 at 09:19:31PM +0200, Marco Elver wrote: > > > > So, we are probably better off untangling LTO from the story: > > > > > > > > 1. LTO or no LTO does not matter. The LTO series should not get tangled > > > > up with memory model issues. > > > > > > > > 2. The memory model question and problems need to be answered and > > > > addressed separately. > > > > > > > > Thoughts? > > > > > > How hard would it be to creates something that analyzes a build and > > > looks for all 'dependent load -> control dependency' transformations > > > headed by a volatile (and/or from asm) load and issues a warning for > > > them? > > I was thinking about this, but in the context of the "auto-promote to > acquire" which you didn't like. Issuing a warning should certainly be > simpler. > > I think there is no one place where we know these transformations > happen, but rather, need to analyze the IR before transformations, > take note of all the dependent loads headed by volatile+asm, and then > run an analysis after optimizations checking the dependencies are > still there. > > > > This would give us an indication of how valuable this transformation is > > > for the kernel. I'm hoping/expecting it's vanishingly rare, but what do > > > I know. > > > > This could be quite useful! > > We might then even be able to say, "if you get this warning, turn on > CONFIG_ACQUIRE_READ_DEPENDENCIES" (or however the option will be > named). Or some other tricks, like automatically recompile the TU > where this happens with the option. But again, this is not something > that should specifically block LTO, because if we have this, we'll > need to turn it on for everything. I'm not especially keen on solving this with additional config options -- all it does it further fragment the number of kernels we have to care about and distributions really won't be in a position to know whether this should be enabled or not. I would prefer that the build fails, and we figure out which compiler switch we need to stop the harmful optimisation taking place. As Peter says, it _should_ be a rare thing to see (empirically, the kernel seems to be getting away with it so far). The problem, as I see it, is that the C language doesn't provide us with a way to express dependency ordering and so we're at the mercy of the compiler when we roll our own implementation. Paul continues to fight the good fight at committee meetings to improve the situation, but in the meantime we'd benefit from two things: 1. A way to disable any compiler optimisations that break our dependency ordering in spite of READ_ONCE() 2. A way to detect at build time if these harmful optimisations are taking place Finally, while I agree that this problem isn't limited to LTO, my fear is that LTO provides enough information for address dependencies headed by a READ_ONCE() to be converted to control dependencies when some common values of the pointer can be determined by the compiler. If we can rule this sort of thing out, then great, but in the absence of (2) I think throwing in an acquire is a sensible safety measure. Doesn't CFI rely on LTO to do something similar for indirect branch targets, or have I got that totally mixed up? Will
On Wed, Jul 01, 2020 at 11:41:17AM +0200, Marco Elver wrote: > On Tue, 30 Jun 2020 at 22:30, Paul E. McKenney <paulmck@kernel.org> wrote: > > On Tue, Jun 30, 2020 at 10:12:43PM +0200, Peter Zijlstra wrote: > > > On Tue, Jun 30, 2020 at 09:19:31PM +0200, Marco Elver wrote: > > > > Thoughts? > > > > > > How hard would it be to creates something that analyzes a build and > > > looks for all 'dependent load -> control dependency' transformations > > > headed by a volatile (and/or from asm) load and issues a warning for > > > them? > > I was thinking about this, but in the context of the "auto-promote to > acquire" which you didn't like. Issuing a warning should certainly be > simpler. > > I think there is no one place where we know these transformations > happen, but rather, need to analyze the IR before transformations, > take note of all the dependent loads headed by volatile+asm, and then > run an analysis after optimizations checking the dependencies are > still there. Urgh, that sounds nasty. The thing is, as I've hinted at in my other reply, I would really like a compiler switch to disable this optimization entirely -- knowing how relevant the trnaformation is, is simply a first step towards that. In order to control the tranformation, you have to actually know where in the optimization passes it happens. Also, if (big if in my book) we find the optimization is actually beneficial, we can invert the warning when using the switch and warn about lost optimization possibilities and manually re-write the code to use control deps. > > > This would give us an indication of how valuable this transformation is > > > for the kernel. I'm hoping/expecting it's vanishingly rare, but what do > > > I know. > > > > This could be quite useful! > > We might then even be able to say, "if you get this warning, turn on > CONFIG_ACQUIRE_READ_DEPENDENCIES" (or however the option will be > named). I was going to suggest: if this happens, employ -fno-wreck-dependencies :-)
On Wed, Jul 01, 2020 at 01:40:27PM +0200, Peter Zijlstra wrote: > On Wed, Jul 01, 2020 at 11:41:17AM +0200, Marco Elver wrote: > > On Tue, 30 Jun 2020 at 22:30, Paul E. McKenney <paulmck@kernel.org> wrote: > > > On Tue, Jun 30, 2020 at 10:12:43PM +0200, Peter Zijlstra wrote: > > > > On Tue, Jun 30, 2020 at 09:19:31PM +0200, Marco Elver wrote: > > > > > > Thoughts? > > > > > > > > How hard would it be to creates something that analyzes a build and > > > > looks for all 'dependent load -> control dependency' transformations > > > > headed by a volatile (and/or from asm) load and issues a warning for > > > > them? > > > > I was thinking about this, but in the context of the "auto-promote to > > acquire" which you didn't like. Issuing a warning should certainly be > > simpler. > > > > I think there is no one place where we know these transformations > > happen, but rather, need to analyze the IR before transformations, > > take note of all the dependent loads headed by volatile+asm, and then > > run an analysis after optimizations checking the dependencies are > > still there. > > Urgh, that sounds nasty. The thing is, as I've hinted at in my other > reply, I would really like a compiler switch to disable this > optimization entirely -- knowing how relevant the trnaformation is, is > simply a first step towards that. > > In order to control the tranformation, you have to actually know where > in the optimization passes it happens. > > Also, if (big if in my book) we find the optimization is actually > beneficial, we can invert the warning when using the switch and warn > about lost optimization possibilities and manually re-write the code to > use control deps. There are lots of optimization passes and any of them might decide to destroy dependencies. :-( > > > > This would give us an indication of how valuable this transformation is > > > > for the kernel. I'm hoping/expecting it's vanishingly rare, but what do > > > > I know. > > > > > > This could be quite useful! > > > > We might then even be able to say, "if you get this warning, turn on > > CONFIG_ACQUIRE_READ_DEPENDENCIES" (or however the option will be > > named). > > I was going to suggest: if this happens, employ -fno-wreck-dependencies > :-) The current state in the C++ committee is that marking variables carrying dependencies is the way forward. This is of course not what the Linux kernel community does, but it should not be hard to have a -fall-variables-dependent or some such that causes all variables to be treated as if they were marked. Though I was hoping for only pointers. Are they -sure- that they -absolutely- need to carry dependencies through integers??? Anyway, the next step is to provide this functionality in one of the major compilers. Akshat Garg started this in GCC as a GSoC project by duplicating "volatile" functionality with a _Dependent_ptr keyword. Next steps would include removing "volatile" functionality not required for dependencies. Here is a random posting, which if I remember correctly raised some doubts as to whether "volatile" was really carried through everywhere that it needs to for things like LTO: https://gcc.gnu.org/legacy-ml/gcc/2019-07/msg00139.html What happened to this effort? Akshat graduated and got an unrelated job, you know, the usual. ;-) Thanx, Paul
From: Peter Zijlstra > Sent: 01 July 2020 10:11 > On Tue, Jun 30, 2020 at 01:30:16PM -0700, Paul E. McKenney wrote: > > On Tue, Jun 30, 2020 at 10:12:43PM +0200, Peter Zijlstra wrote: > > > > I'm not convinced C11 memory_order_consume would actually work for us, > > > even if it would work. That is, given: > > > > > > https://lore.kernel.org/lkml/20150520005510.GA23559@linux.vnet.ibm.com/ > > > > > > only pointers can have consume, but like I pointed out, we have code > > > that relies on dependent loads from integers. > > > > I agree that C11 memory_order_consume is not normally what we want, > > given that it is universally promoted to memory_order_acquire. > > > > However, dependent loads from integers are, if anything, more difficult > > to defend from the compiler than are control dependencies. This applies > > doubly to integers that are used to index two-element arrays, in which > > case you are just asking the compiler to destroy your dependent loads > > by converting them into control dependencies. > > Yes, I'm aware. However, as you might know, I'm firmly in the 'C is a > glorified assembler' camp (as I expect most actual OS people are, out of > necessity if nothing else) and if I wanted a control dependency I > would've bloody well written one. I write in C because doing register tracking is hard :-) I've got an hdlc implementation in C that is carefully adjusted so that the worst case path is bounded. I probably know every one of the 1000 instructions in it. Would an asm statement that uses the same 'register' for input and output but doesn't actually do anything help? It won't generate any code, but the compiler ought to assume that it might change the value - so can't do optimisations that track the value across the call. > I think an optimizing compiler is awesome, but only in so far as that > optimization is actually helpful -- and yes, I just stepped into a giant > twilight zone there. That is, any optimization that has _any_ > controversy should be controllable (like -fno-strict-overflow > -fno-strict-aliasing) and I'd very much like the same here. I'm fed up of gcc generating the code that uses SIMD instructions for the 'tail' loop at the end of a function that is already doing SIMD operations for the main part of the loop. And compilers that convert a byte copy loop to 'rep movsb'. If I'm copying 3 or 4 bytes I don't want a 40 clock overhead. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Wed, Jul 01, 2020 at 07:06:54AM -0700, Paul E. McKenney wrote: > The current state in the C++ committee is that marking variables > carrying dependencies is the way forward. This is of course not what > the Linux kernel community does, but it should not be hard to have a > -fall-variables-dependent or some such that causes all variables to be > treated as if they were marked. Though I was hoping for only pointers. > Are they -sure- that they -absolutely- need to carry dependencies > through integers??? What's 'need'? :-) I'm thinking __ktime_get_fast_ns() is better off with a dependent load than it is with an extra smp_rmb(). Yes we can stick an smp_rmb() in there, but I don't like it. Like I wrote earlier, if I wanted a control dependency, I'd have written one.
On Wed, Jul 01, 2020 at 05:05:12PM +0200, Peter Zijlstra wrote: > On Wed, Jul 01, 2020 at 07:06:54AM -0700, Paul E. McKenney wrote: > > > The current state in the C++ committee is that marking variables > > carrying dependencies is the way forward. This is of course not what > > the Linux kernel community does, but it should not be hard to have a > > -fall-variables-dependent or some such that causes all variables to be > > treated as if they were marked. Though I was hoping for only pointers. > > Are they -sure- that they -absolutely- need to carry dependencies > > through integers??? > > What's 'need'? :-) Turning off all dependency-killing optimizations on all pointers is likely a non-event. Turning off all dependency-killing optimizations on all integers is not the road to happiness. So whatever "need" might be, it would need to be rather earthshaking. ;-) It is probably not -that- hard to convert to pointers, even if they are indexing multiple arrays. > I'm thinking __ktime_get_fast_ns() is better off with a dependent load > than it is with an extra smp_rmb(). > > Yes we can stick an smp_rmb() in there, but I don't like it. Like I > wrote earlier, if I wanted a control dependency, I'd have written one. No argument here. But it looks like we are going to have to tell the compiler. Thanx, Paul
On Wed, Jul 01, 2020 at 02:20:13PM +0000, David Laight wrote: > From: Peter Zijlstra > > Sent: 01 July 2020 10:11 > > On Tue, Jun 30, 2020 at 01:30:16PM -0700, Paul E. McKenney wrote: > > > On Tue, Jun 30, 2020 at 10:12:43PM +0200, Peter Zijlstra wrote: > > > > > > I'm not convinced C11 memory_order_consume would actually work for us, > > > > even if it would work. That is, given: > > > > > > > > https://lore.kernel.org/lkml/20150520005510.GA23559@linux.vnet.ibm.com/ > > > > > > > > only pointers can have consume, but like I pointed out, we have code > > > > that relies on dependent loads from integers. > > > > > > I agree that C11 memory_order_consume is not normally what we want, > > > given that it is universally promoted to memory_order_acquire. > > > > > > However, dependent loads from integers are, if anything, more difficult > > > to defend from the compiler than are control dependencies. This applies > > > doubly to integers that are used to index two-element arrays, in which > > > case you are just asking the compiler to destroy your dependent loads > > > by converting them into control dependencies. > > > > Yes, I'm aware. However, as you might know, I'm firmly in the 'C is a > > glorified assembler' camp (as I expect most actual OS people are, out of > > necessity if nothing else) and if I wanted a control dependency I > > would've bloody well written one. > > I write in C because doing register tracking is hard :-) > I've got an hdlc implementation in C that is carefully adjusted > so that the worst case path is bounded. > I probably know every one of the 1000 instructions in it. > > Would an asm statement that uses the same 'register' for input and > output but doesn't actually do anything help? > It won't generate any code, but the compiler ought to assume that > it might change the value - so can't do optimisations that track > the value across the call. It might replace the volatile load, but there are optimizations that apply to the downstream code as well. Or are you suggesting periodically pushing the dependent variable through this asm? That might work, but it would be easier and more maintainable to just mark the variable. > > I think an optimizing compiler is awesome, but only in so far as that > > optimization is actually helpful -- and yes, I just stepped into a giant > > twilight zone there. That is, any optimization that has _any_ > > controversy should be controllable (like -fno-strict-overflow > > -fno-strict-aliasing) and I'd very much like the same here. > > I'm fed up of gcc generating the code that uses SIMD instructions > for the 'tail' loop at the end of a function that is already doing > SIMD operations for the main part of the loop. > And compilers that convert a byte copy loop to 'rep movsb'. > If I'm copying 3 or 4 bytes I don't want a 40 clock overhead. Agreed, compilers can often be all too "helpful". :-( Thanx, Paul
On Wed, Jul 01, 2020 at 09:03:38AM -0700, Paul E. McKenney wrote:
> But it looks like we are going to have to tell the compiler.
What does the current proposal look like? I can certainly annotate the
seqcount latch users, but who knows what other code is out there....
From: Paul E. McKenney > Sent: 01 July 2020 17:06 ... > > Would an asm statement that uses the same 'register' for input and > > output but doesn't actually do anything help? > > It won't generate any code, but the compiler ought to assume that > > it might change the value - so can't do optimisations that track > > the value across the call. > > It might replace the volatile load, but there are optimizations that > apply to the downstream code as well. > > Or are you suggesting periodically pushing the dependent variable > through this asm? That might work, but it would be easier and > more maintainable to just mark the variable. Marking the variable requires compiler support. Although what 'volatile register int foo;' means might be interesting. So I was thinking that in the case mentioned earlier you do: ptr += LAUNDER(offset & 1); to ensure the compiler didn't convert to: if (offset & 1) ptr++; (Which is probably a pessimisation - the reverse is likely better.) David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Thu, Jul 02, 2020 at 10:20:40AM +0200, Peter Zijlstra wrote: > On Wed, Jul 01, 2020 at 09:03:38AM -0700, Paul E. McKenney wrote: > > > But it looks like we are going to have to tell the compiler. > > What does the current proposal look like? I can certainly annotate the > seqcount latch users, but who knows what other code is out there.... For pointers, yes, within the Linux kernel it is hopeless, thus the thought of a -fall-dependent-ptr or some such that makes the compiler pretend that each and every pointer is marked with the _Dependent_ptr qualifier. New non-Linux-kernel code might want to use his qualifier explicitly, perhaps something like the following: _Dependent_ptr struct foo *p; // Or maybe after the "*"? rcu_read_lock(); p = rcu_dereference(gp); // And so on... If a function is to take a dependent pointer as a function argument, then the corresponding parameter need the _Dependent_ptr marking. Ditto for return values. The proposal did not cover integers due to concerns about the number of optimization passes that would need to be reviewed to make that work. Nevertheless, using a marked integer would be safer than using an unmarked one, and if the review can be carried out, why not? Maybe something like this: _Dependent_ptr int idx; rcu_read_lock(); idx = READ_ONCE(gidx); d = rcuarray[idx]; rcu_read_unlock(); do_something_with(d); So use of this qualifier is quite reasonable. The prototype for GCC is here: https://github.com/AKG001/gcc/ Thanx, Paul
On Thu, Jul 02, 2020 at 09:37:26AM +0000, David Laight wrote: > From: Paul E. McKenney > > Sent: 01 July 2020 17:06 > ... > > > Would an asm statement that uses the same 'register' for input and > > > output but doesn't actually do anything help? > > > It won't generate any code, but the compiler ought to assume that > > > it might change the value - so can't do optimisations that track > > > the value across the call. > > > > It might replace the volatile load, but there are optimizations that > > apply to the downstream code as well. > > > > Or are you suggesting periodically pushing the dependent variable > > through this asm? That might work, but it would be easier and > > more maintainable to just mark the variable. > > Marking the variable requires compiler support. > Although what 'volatile register int foo;' means might be interesting. > > So I was thinking that in the case mentioned earlier you do: > ptr += LAUNDER(offset & 1); > to ensure the compiler didn't convert to: > if (offset & 1) ptr++; > (Which is probably a pessimisation - the reverse is likely better.) Indeed, Akshat's prototype follows the "volatile" qualifier in many ways. https://github.com/AKG001/gcc/ Thanx, Paul
On Thu, Jul 02, 2020 at 10:59:48AM -0700, Paul E. McKenney wrote: > On Thu, Jul 02, 2020 at 10:20:40AM +0200, Peter Zijlstra wrote: > > On Wed, Jul 01, 2020 at 09:03:38AM -0700, Paul E. McKenney wrote: > > > > > But it looks like we are going to have to tell the compiler. > > > > What does the current proposal look like? I can certainly annotate the > > seqcount latch users, but who knows what other code is out there.... > > For pointers, yes, within the Linux kernel it is hopeless, thus the > thought of a -fall-dependent-ptr or some such that makes the compiler > pretend that each and every pointer is marked with the _Dependent_ptr > qualifier. > > New non-Linux-kernel code might want to use his qualifier explicitly, > perhaps something like the following: > > _Dependent_ptr struct foo *p; // Or maybe after the "*"? After, as you've written it, it's a pointer to a '_Dependent struct foo'. > > rcu_read_lock(); > p = rcu_dereference(gp); > // And so on... > > If a function is to take a dependent pointer as a function argument, > then the corresponding parameter need the _Dependent_ptr marking. > Ditto for return values. > > The proposal did not cover integers due to concerns about the number of > optimization passes that would need to be reviewed to make that work. > Nevertheless, using a marked integer would be safer than using an unmarked > one, and if the review can be carried out, why not? Maybe something > like this: > > _Dependent_ptr int idx; > > rcu_read_lock(); > idx = READ_ONCE(gidx); > d = rcuarray[idx]; > rcu_read_unlock(); > do_something_with(d); > > So use of this qualifier is quite reasonable. The above usage might warrant a rename of the qualifier though, since clearly there isn't anything ptr around. > The prototype for GCC is here: https://github.com/AKG001/gcc/ Thanks! Those test cases are somewhat over qualified though: static volatile _Atomic (TYPE) * _Dependent_ptr a; \ Also, if C goes and specifies load dependencies, in any form, is then not the corrolary that they need to specify control dependencies? How else can they exclude the transformation. And of course, once we're there, can we get explicit support for control dependencies too? :-) :-)
On Fri, Jul 03, 2020 at 03:13:30PM +0200, Peter Zijlstra wrote: > > The prototype for GCC is here: https://github.com/AKG001/gcc/ > > Thanks! Those test cases are somewhat over qualified though: > > static volatile _Atomic (TYPE) * _Dependent_ptr a; \ One question though; since its a qualifier, and we've recently spend a whole lot of effort to strip qualifiers in say READ_ONCE(), how does, and how do we want, this qualifier to behave. C++ has very convenient means of manipulating qualifiers, so it's not much of a problem there, but for C it is, as we've found, really quite cumbersome. Even with _Generic() we can't manipulate individual qualifiers afaict.
On Fri, Jul 03, 2020 at 03:13:30PM +0200, Peter Zijlstra wrote: > On Thu, Jul 02, 2020 at 10:59:48AM -0700, Paul E. McKenney wrote: > > On Thu, Jul 02, 2020 at 10:20:40AM +0200, Peter Zijlstra wrote: > > > On Wed, Jul 01, 2020 at 09:03:38AM -0700, Paul E. McKenney wrote: > > > > > > > But it looks like we are going to have to tell the compiler. > > > > > > What does the current proposal look like? I can certainly annotate the > > > seqcount latch users, but who knows what other code is out there.... > > > > For pointers, yes, within the Linux kernel it is hopeless, thus the > > thought of a -fall-dependent-ptr or some such that makes the compiler > > pretend that each and every pointer is marked with the _Dependent_ptr > > qualifier. > > > > New non-Linux-kernel code might want to use his qualifier explicitly, > > perhaps something like the following: > > > > _Dependent_ptr struct foo *p; // Or maybe after the "*"? > > After, as you've written it, it's a pointer to a '_Dependent struct > foo'. Yeah, I have to look that up every time. :-/ Thank you for checking! > > rcu_read_lock(); > > p = rcu_dereference(gp); > > // And so on... > > > > If a function is to take a dependent pointer as a function argument, > > then the corresponding parameter need the _Dependent_ptr marking. > > Ditto for return values. > > > > The proposal did not cover integers due to concerns about the number of > > optimization passes that would need to be reviewed to make that work. > > Nevertheless, using a marked integer would be safer than using an unmarked > > one, and if the review can be carried out, why not? Maybe something > > like this: > > > > _Dependent_ptr int idx; > > > > rcu_read_lock(); > > idx = READ_ONCE(gidx); > > d = rcuarray[idx]; > > rcu_read_unlock(); > > do_something_with(d); > > > > So use of this qualifier is quite reasonable. > > The above usage might warrant a rename of the qualifier though, since > clearly there isn't anything ptr around. Given the large number of additional optimizations that need to be suppressed in the non-pointer case, any discouragement based on the "_ptr" at the end of the name is all to the good. And if that line of reasoning is unconvincing, please look at the program at the end of this email, which compiles without errors with -Wall and gives the expected output. ;-) > > The prototype for GCC is here: https://github.com/AKG001/gcc/ > > Thanks! Those test cases are somewhat over qualified though: > > static volatile _Atomic (TYPE) * _Dependent_ptr a; \ Especially given that in C, _Atomic operations are implicitly volatile. But this is likely a holdover from Akshat's implementation strategy, which was to pattern _Dependent_ptr after the volatile keyword. > Also, if C goes and specifies load dependencies, in any form, is then > not the corrolary that they need to specify control dependencies? How > else can they exclude the transformation. By requiring that any temporaries generated from variables that are marked _Dependent_ptr also be marked _Dependent_ptr. This is of course one divergence of _Dependent_ptr from the volatile keyword. > And of course, once we're there, can we get explicit support for control > dependencies too? :-) :-) Keep talking like this and I am going to make sure that you attend a standards committee meeting. If need be, by arranging for you to be physically dragged there. ;-) More seriously, for control dependencies, the variable that would need to be marked would be the program counter, which might require some additional syntax. Thanx, Paul ------------------------------------------------------------------------ #include <stdio.h> #include <stdlib.h> #include <string.h> int foo(int *p, int i) { return i[p]; } int arr[] = { 0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, }; int main(int argc, char *argv[]) { int i = atoi(argv[1]); printf("%d[arr] = %d\n", i, foo(arr, i)); return 0; }
On Fri, Jul 03, 2020 at 03:25:23PM +0200, Peter Zijlstra wrote: > On Fri, Jul 03, 2020 at 03:13:30PM +0200, Peter Zijlstra wrote: > > > The prototype for GCC is here: https://github.com/AKG001/gcc/ > > > > Thanks! Those test cases are somewhat over qualified though: > > > > static volatile _Atomic (TYPE) * _Dependent_ptr a; \ > > One question though; since its a qualifier, and we've recently spend a > whole lot of effort to strip qualifiers in say READ_ONCE(), how does, > and how do we want, this qualifier to behave. Dereferencing a _Dependent_ptr pointer gives you something that is not _Dependent_ptr, unless the declaration was like this: _Dependent_ptr _Atomic (TYPE) * _Dependent_ptr a; And if I recall correctly, the current state is that assigning a _Dependent_ptr variable to a non-_Dependent_ptr variable strips this marking (though the thought was to be able to ask for a warning). So, yes, it would be nice to be able to explicitly strip the _Dependent_ptr, perhaps the kill_dependency() macro, which is already in the C standard. > C++ has very convenient means of manipulating qualifiers, so it's not > much of a problem there, but for C it is, as we've found, really quite > cumbersome. Even with _Generic() we can't manipulate individual > qualifiers afaict. Fair point, and in C++ this is a templated class, at least in the same sense that std::atomic<> is a templated class. But in this case, would kill_dependency do what you want? Thanx, Paul
On Fri, Jul 03, 2020 at 07:42:28AM -0700, Paul E. McKenney wrote: > On Fri, Jul 03, 2020 at 03:13:30PM +0200, Peter Zijlstra wrote: > > On Thu, Jul 02, 2020 at 10:59:48AM -0700, Paul E. McKenney wrote: > > > On Thu, Jul 02, 2020 at 10:20:40AM +0200, Peter Zijlstra wrote: > > > > On Wed, Jul 01, 2020 at 09:03:38AM -0700, Paul E. McKenney wrote: [ . . . ] > > Also, if C goes and specifies load dependencies, in any form, is then > > not the corrolary that they need to specify control dependencies? How > > else can they exclude the transformation. > > By requiring that any temporaries generated from variables that are > marked _Dependent_ptr also be marked _Dependent_ptr. This is of course > one divergence of _Dependent_ptr from the volatile keyword. > > > And of course, once we're there, can we get explicit support for control > > dependencies too? :-) :-) > > Keep talking like this and I am going to make sure that you attend a > standards committee meeting. If need be, by arranging for you to be > physically dragged there. ;-) > > More seriously, for control dependencies, the variable that would need > to be marked would be the program counter, which might require some > additional syntax. And perhaps more constructively, we do need to prioritize address and data dependencies over control dependencies. For one thing, there are a lot more address/data dependencies in existing code than there are control dependencies, and (sadly, perhaps more importantly) there are a lot more people who are convinced that address/data dependencies are important. For another (admittedly more theoretical) thing, the OOTA scenarios stemming from control dependencies are a lot less annoying than those from address/data dependencies. And address/data dependencies are as far as I know vulnerable to things like conditional-move instructions that can cause problems for control dependencies. Nevertheless, yes, control dependencies also need attention. Thanx, Paul
On Mon, Jul 06, 2020 at 09:26:33AM -0700, Paul E. McKenney wrote: > And perhaps more constructively, we do need to prioritize address and data > dependencies over control dependencies. For one thing, there are a lot > more address/data dependencies in existing code than there are control > dependencies, and (sadly, perhaps more importantly) there are a lot more > people who are convinced that address/data dependencies are important. If they do not consider their Linux OS running correctly :-) > For another (admittedly more theoretical) thing, the OOTA scenarios > stemming from control dependencies are a lot less annoying than those > from address/data dependencies. > > And address/data dependencies are as far as I know vulnerable to things > like conditional-move instructions that can cause problems for control > dependencies. > > Nevertheless, yes, control dependencies also need attention. Today I added one more \o/
On Mon, Jul 06, 2020 at 08:29:26PM +0200, Peter Zijlstra wrote: > On Mon, Jul 06, 2020 at 09:26:33AM -0700, Paul E. McKenney wrote: > > > And perhaps more constructively, we do need to prioritize address and data > > dependencies over control dependencies. For one thing, there are a lot > > more address/data dependencies in existing code than there are control > > dependencies, and (sadly, perhaps more importantly) there are a lot more > > people who are convinced that address/data dependencies are important. > > If they do not consider their Linux OS running correctly :-) Many of them really do not care at all. In fact, some would consider Linux failing to run as an added bonus. > > For another (admittedly more theoretical) thing, the OOTA scenarios > > stemming from control dependencies are a lot less annoying than those > > from address/data dependencies. > > > > And address/data dependencies are as far as I know vulnerable to things > > like conditional-move instructions that can cause problems for control > > dependencies. > > > > Nevertheless, yes, control dependencies also need attention. > > Today I added one more \o/ Just make sure you continually check to make sure that compilers don't break it, along with the others you have added. ;-) Thanx, Paul
On Mon, Jul 06, 2020 at 11:39:33AM -0700, Paul E. McKenney wrote: > On Mon, Jul 06, 2020 at 08:29:26PM +0200, Peter Zijlstra wrote: > > On Mon, Jul 06, 2020 at 09:26:33AM -0700, Paul E. McKenney wrote: > > If they do not consider their Linux OS running correctly :-) > > Many of them really do not care at all. In fact, some would consider > Linux failing to run as an added bonus. This I think is why we have compiler people in the thread that care a lot more. > > > Nevertheless, yes, control dependencies also need attention. > > > > Today I added one more \o/ > > Just make sure you continually check to make sure that compilers > don't break it, along with the others you have added. ;-) There's: kernel/locking/mcs_spinlock.h: smp_cond_load_acquire(l, VAL); \ kernel/sched/core.c: smp_cond_load_acquire(&p->on_cpu, !VAL); kernel/smp.c: smp_cond_load_acquire(&csd->node.u_flags, !(VAL & CSD_FLAG_LOCK)); arch/x86/kernel/alternative.c: atomic_cond_read_acquire(&desc.refs, !VAL); kernel/locking/qrwlock.c: atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED)); kernel/locking/qrwlock.c: atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED)); kernel/locking/qrwlock.c: atomic_cond_read_acquire(&lock->cnts, VAL == _QW_WAITING); kernel/locking/qspinlock.c: atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_MASK)); kernel/locking/qspinlock.c: val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK)); include/linux/refcount.h: smp_acquire__after_ctrl_dep(); ipc/mqueue.c: smp_acquire__after_ctrl_dep(); ipc/msg.c: smp_acquire__after_ctrl_dep(); ipc/sem.c: smp_acquire__after_ctrl_dep(); kernel/locking/rwsem.c: smp_acquire__after_ctrl_dep(); kernel/sched/core.c: smp_acquire__after_ctrl_dep(); kernel/events/ring_buffer.c:__perf_output_begin() And I'm fairly sure I'm forgetting some... One could argue there's too many of them to check already. Both GCC and CLANG had better think about it.
On Mon, Jul 06, 2020 at 09:40:12PM +0200, Peter Zijlstra wrote: > On Mon, Jul 06, 2020 at 11:39:33AM -0700, Paul E. McKenney wrote: > > On Mon, Jul 06, 2020 at 08:29:26PM +0200, Peter Zijlstra wrote: > > > On Mon, Jul 06, 2020 at 09:26:33AM -0700, Paul E. McKenney wrote: > > > > If they do not consider their Linux OS running correctly :-) > > > > Many of them really do not care at all. In fact, some would consider > > Linux failing to run as an added bonus. > > This I think is why we have compiler people in the thread that care a > lot more. Here is hoping! ;-) > > > > Nevertheless, yes, control dependencies also need attention. > > > > > > Today I added one more \o/ > > > > Just make sure you continually check to make sure that compilers > > don't break it, along with the others you have added. ;-) > > There's: > > kernel/locking/mcs_spinlock.h: smp_cond_load_acquire(l, VAL); \ > kernel/sched/core.c: smp_cond_load_acquire(&p->on_cpu, !VAL); > kernel/smp.c: smp_cond_load_acquire(&csd->node.u_flags, !(VAL & CSD_FLAG_LOCK)); > > arch/x86/kernel/alternative.c: atomic_cond_read_acquire(&desc.refs, !VAL); > kernel/locking/qrwlock.c: atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED)); > kernel/locking/qrwlock.c: atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED)); > kernel/locking/qrwlock.c: atomic_cond_read_acquire(&lock->cnts, VAL == _QW_WAITING); > kernel/locking/qspinlock.c: atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_MASK)); > kernel/locking/qspinlock.c: val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK)); > > include/linux/refcount.h: smp_acquire__after_ctrl_dep(); > ipc/mqueue.c: smp_acquire__after_ctrl_dep(); > ipc/msg.c: smp_acquire__after_ctrl_dep(); > ipc/sem.c: smp_acquire__after_ctrl_dep(); > kernel/locking/rwsem.c: smp_acquire__after_ctrl_dep(); > kernel/sched/core.c: smp_acquire__after_ctrl_dep(); > > kernel/events/ring_buffer.c:__perf_output_begin() > > And I'm fairly sure I'm forgetting some... One could argue there's too > many of them to check already. > > Both GCC and CLANG had better think about it. That would be good! I won't list the number of address/data dependencies given that there are well over a thousand of them. Thanx, Paul
On Mon, Jun 29, 2020 at 04:20:59PM -0700, Sami Tolvanen wrote: > Hi Masahiro, > > On Mon, Jun 29, 2020 at 01:56:19AM +0900, Masahiro Yamada wrote: > > I also got an error for > > ARCH=arm64 allyesconfig + CONFIG_LTO_CLANG=y > > > > > > > > $ make ARCH=arm64 LLVM=1 LLVM_IAS=1 > > CROSS_COMPILE=~/tools/aarch64-linaro-7.5/bin/aarch64-linux-gnu- > > -j24 > > > > ... > > > > GEN .version > > CHK include/generated/compile.h > > UPD include/generated/compile.h > > CC init/version.o > > AR init/built-in.a > > GEN .tmp_initcalls.lds > > GEN .tmp_symversions.lds > > LTO vmlinux.o > > MODPOST vmlinux.symvers > > MODINFO modules.builtin.modinfo > > GEN modules.builtin > > LD .tmp_vmlinux.kallsyms1 > > ld.lld: error: undefined symbol: __compiletime_assert_905 > > >>> referenced by irqbypass.c > > >>> vmlinux.o:(jeq_imm) > > make: *** [Makefile:1161: vmlinux] Error 1 > > I can reproduce this with ToT LLVM and it's BUILD_BUG_ON_MSG(..., "value > too large for the field") in drivers/net/ethernet/netronome/nfp/bpf/jit.c. > Specifically, the FIELD_FIT / __BF_FIELD_CHECK macro in ur_load_imm_any. > > This compiles just fine with an earlier LLVM revision, so it could be a > relatively recent regression. I'll take a look. Thanks for catching this! After spending some time debugging this with Nick, it looks like the error is caused by a recent optimization change in LLVM, which together with the inlining of ur_load_imm_any into jeq_imm, changes a runtime check in FIELD_FIT that would always fail, to a compile-time check that breaks the build. In jeq_imm, we have: /* struct bpf_insn: _s32 imm */ u64 imm = insn->imm; /* sign extend */ ... if (imm >> 32) { /* non-zero only if insn->imm is negative */ /* inlined from ur_load_imm_any */ u32 __imm = imm >> 32; /* therefore, always 0xffffffff */ /* * __imm has a value known at compile-time, which means * __builtin_constant_p(__imm) is true and we end up with * essentially this in __BF_FIELD_CHECK: */ if (__builtin_constant_p(__imm) && __imm <= 255) __compiletime_assert_N(); The compile-time check comes from the following BUILD_BUG_ON_MSG: #define __BF_FIELD_CHECK(_mask, _reg, _val, _pfx) \ ... BUILD_BUG_ON_MSG(__builtin_constant_p(_val) ? \ ~((_mask) >> __bf_shf(_mask)) & (_val) : 0, \ _pfx "value too large for the field"); \ While we could stop the compiler from performing this optimization by telling it to never inline ur_load_imm_any, we feel like a better fix might be to replace FIELD_FIT(UR_REG_IMM_MAX, imm) with a simple imm <= UR_REG_IMM_MAX check that won't trip a compile-time assertion even when the condition is known to fail. Jiong, Jakub, do you see any issues here? Sami
On Tue, Jul 07, 2020 at 08:51:07AM -0700, Sami Tolvanen wrote: > After spending some time debugging this with Nick, it looks like the > error is caused by a recent optimization change in LLVM, which together > with the inlining of ur_load_imm_any into jeq_imm, changes a runtime > check in FIELD_FIT that would always fail, to a compile-time check that > breaks the build. In jeq_imm, we have: > > /* struct bpf_insn: _s32 imm */ > u64 imm = insn->imm; /* sign extend */ > ... > if (imm >> 32) { /* non-zero only if insn->imm is negative */ > /* inlined from ur_load_imm_any */ > u32 __imm = imm >> 32; /* therefore, always 0xffffffff */ > > /* > * __imm has a value known at compile-time, which means > * __builtin_constant_p(__imm) is true and we end up with > * essentially this in __BF_FIELD_CHECK: > */ > if (__builtin_constant_p(__imm) && __imm <= 255) Should be __imm > 255, of course, which means the compiler will generate a call to __compiletime_assert. > Jiong, Jakub, do you see any issues here? (Jiong's email bounced, so removing from the recipient list.) Sami
On Tue, 7 Jul 2020 09:05:28 -0700 Sami Tolvanen wrote: > On Tue, Jul 07, 2020 at 08:51:07AM -0700, Sami Tolvanen wrote: > > After spending some time debugging this with Nick, it looks like the > > error is caused by a recent optimization change in LLVM, which together > > with the inlining of ur_load_imm_any into jeq_imm, changes a runtime > > check in FIELD_FIT that would always fail, to a compile-time check that > > breaks the build. In jeq_imm, we have: > > > > /* struct bpf_insn: _s32 imm */ > > u64 imm = insn->imm; /* sign extend */ > > ... > > if (imm >> 32) { /* non-zero only if insn->imm is negative */ > > /* inlined from ur_load_imm_any */ > > u32 __imm = imm >> 32; /* therefore, always 0xffffffff */ > > > > /* > > * __imm has a value known at compile-time, which means > > * __builtin_constant_p(__imm) is true and we end up with > > * essentially this in __BF_FIELD_CHECK: > > */ > > if (__builtin_constant_p(__imm) && __imm <= 255) > > Should be __imm > 255, of course, which means the compiler will generate > a call to __compiletime_assert. I think FIELD_FIT() should not pass the value into __BF_FIELD_CHECK(). So: diff --git a/include/linux/bitfield.h b/include/linux/bitfield.h index 48ea093ff04c..4e035aca6f7e 100644 --- a/include/linux/bitfield.h +++ b/include/linux/bitfield.h @@ -77,7 +77,7 @@ */ #define FIELD_FIT(_mask, _val) \ ({ \ - __BF_FIELD_CHECK(_mask, 0ULL, _val, "FIELD_FIT: "); \ + __BF_FIELD_CHECK(_mask, 0ULL, 0ULL, "FIELD_FIT: "); \ !((((typeof(_mask))_val) << __bf_shf(_mask)) & ~(_mask)); \ }) It's perfectly legal to pass a constant which does not fit, in which case FIELD_FIT() should just return false not break the build. Right?
On Tue, Jul 7, 2020 at 9:56 AM Jakub Kicinski <kuba@kernel.org> wrote: > > > On Tue, Jul 07, 2020 at 08:51:07AM -0700, Sami Tolvanen wrote: > > > After spending some time debugging this with Nick, it looks like the > > > error is caused by a recent optimization change in LLVM, which together > > > with the inlining of ur_load_imm_any into jeq_imm, changes a runtime > > > check in FIELD_FIT that would always fail, to a compile-time check that > > > breaks the build. In jeq_imm, we have: > > > > > > /* struct bpf_insn: _s32 imm */ > > > u64 imm = insn->imm; /* sign extend */ > > > ... > > > if (imm >> 32) { /* non-zero only if insn->imm is negative */ > > > /* inlined from ur_load_imm_any */ > > > u32 __imm = imm >> 32; /* therefore, always 0xffffffff */ > > > > > > /* > > > * __imm has a value known at compile-time, which means > > > * __builtin_constant_p(__imm) is true and we end up with > > > * essentially this in __BF_FIELD_CHECK: > > > */ > > > if (__builtin_constant_p(__imm) && __imm > 255) > > I think FIELD_FIT() should not pass the value into __BF_FIELD_CHECK(). > > So: > > diff --git a/include/linux/bitfield.h b/include/linux/bitfield.h > index 48ea093ff04c..4e035aca6f7e 100644 > --- a/include/linux/bitfield.h > +++ b/include/linux/bitfield.h > @@ -77,7 +77,7 @@ > */ > #define FIELD_FIT(_mask, _val) \ > ({ \ > - __BF_FIELD_CHECK(_mask, 0ULL, _val, "FIELD_FIT: "); \ > + __BF_FIELD_CHECK(_mask, 0ULL, 0ULL, "FIELD_FIT: "); \ > !((((typeof(_mask))_val) << __bf_shf(_mask)) & ~(_mask)); \ > }) > > It's perfectly legal to pass a constant which does not fit, in which > case FIELD_FIT() should just return false not break the build. > > Right? I see the value of the __builtin_constant_p check; this is just a very interesting case where rather than an integer literal appearing in the source, the compiler is able to deduce that the parameter can only have one value in one case, and allows __builtin_constant_p to evaluate to true for it. I had definitely asked Sami about the comment above FIELD_FIT: """ 76 * Return: true if @_val can fit inside @_mask, false if @_val is too big. """ in which FIELD_FIT doesn't return false if @_val is too big and a compile time constant. (Rather it breaks the build). Of the 14 expansion sites of FIELD_FIT I see in mainline, it doesn't look like any integral literals are passed, so maybe the compile time checks of _val are of little value for FIELD_FIT. So I think your suggested diff is the most concise fix.
On Tue, 7 Jul 2020 10:17:25 -0700 Nick Desaulniers wrote: > On Tue, Jul 7, 2020 at 9:56 AM Jakub Kicinski <kuba@kernel.org> wrote: > > > > > On Tue, Jul 07, 2020 at 08:51:07AM -0700, Sami Tolvanen wrote: > > > > After spending some time debugging this with Nick, it looks like the > > > > error is caused by a recent optimization change in LLVM, which together > > > > with the inlining of ur_load_imm_any into jeq_imm, changes a runtime > > > > check in FIELD_FIT that would always fail, to a compile-time check that > > > > breaks the build. In jeq_imm, we have: > > > > > > > > /* struct bpf_insn: _s32 imm */ > > > > u64 imm = insn->imm; /* sign extend */ > > > > ... > > > > if (imm >> 32) { /* non-zero only if insn->imm is negative */ > > > > /* inlined from ur_load_imm_any */ > > > > u32 __imm = imm >> 32; /* therefore, always 0xffffffff */ > > > > > > > > /* > > > > * __imm has a value known at compile-time, which means > > > > * __builtin_constant_p(__imm) is true and we end up with > > > > * essentially this in __BF_FIELD_CHECK: > > > > */ > > > > if (__builtin_constant_p(__imm) && __imm > 255) > > > > I think FIELD_FIT() should not pass the value into __BF_FIELD_CHECK(). > > > > So: > > > > diff --git a/include/linux/bitfield.h b/include/linux/bitfield.h > > index 48ea093ff04c..4e035aca6f7e 100644 > > --- a/include/linux/bitfield.h > > +++ b/include/linux/bitfield.h > > @@ -77,7 +77,7 @@ > > */ > > #define FIELD_FIT(_mask, _val) \ > > ({ \ > > - __BF_FIELD_CHECK(_mask, 0ULL, _val, "FIELD_FIT: "); \ > > + __BF_FIELD_CHECK(_mask, 0ULL, 0ULL, "FIELD_FIT: "); \ > > !((((typeof(_mask))_val) << __bf_shf(_mask)) & ~(_mask)); \ > > }) > > > > It's perfectly legal to pass a constant which does not fit, in which > > case FIELD_FIT() should just return false not break the build. > > > > Right? > > I see the value of the __builtin_constant_p check; this is just a very > interesting case where rather than an integer literal appearing in the > source, the compiler is able to deduce that the parameter can only > have one value in one case, and allows __builtin_constant_p to > evaluate to true for it. > > I had definitely asked Sami about the comment above FIELD_FIT: > """ > 76 * Return: true if @_val can fit inside @_mask, false if @_val is > too big. > """ > in which FIELD_FIT doesn't return false if @_val is too big and a > compile time constant. (Rather it breaks the build). > > Of the 14 expansion sites of FIELD_FIT I see in mainline, it doesn't > look like any integral literals are passed, so maybe the compile time > checks of _val are of little value for FIELD_FIT. Also I just double checked and all FIELD_FIT() uses check the return value. > So I think your suggested diff is the most concise fix. Feel free to submit that officially as a patch if it fixes the build for you, here's my sign-off: Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Dear Sami, Am 24.06.20 um 22:31 schrieb Sami Tolvanen: > This patch series adds support for building x86_64 and arm64 kernels > with Clang's Link Time Optimization (LTO). > > In addition to performance, the primary motivation for LTO is to allow > Clang's Control-Flow Integrity (CFI) to be used in the kernel. Google's > Pixel devices have shipped with LTO+CFI kernels since 2018. > > Most of the patches are build system changes for handling LLVM bitcode, > which Clang produces with LTO instead of ELF object files, postponing > ELF processing until a later stage, and ensuring initcall ordering. > > Note that first objtool patch in the series is already in linux-next, > but as it's needed with LTO, I'm including it also here to make testing > easier. […] Thank you very much for sending these changes. Do you have a branch, where your current work can be pulled from? Your branch on GitHub [1] seems 15 months old. Out of curiosity, I applied the changes, allowed the selection for i386 (x86), and with Clang 1:11~++20200701093119+ffee8040534-1~exp1 from Debian experimental, it failed with `Invalid absolute R_386_32 relocation: KERNEL_PAGES`: > make -f ./scripts/Makefile.build obj=arch/x86/boot arch/x86/boot/bzImage > make -f ./scripts/Makefile.build obj=arch/x86/boot/compressed arch/x86/boot/compressed/vmlinux > llvm-nm vmlinux | sed -n -e 's/^\([0-9a-fA-F]*\) [ABCDGRSTVW] \(_text\|__bss_start\|_end\)$/#define VO_ _AC(0x,UL)/p' > arch/x86/boot/compressed/../voffset.h > clang -Wp,-MMD,arch/x86/boot/compressed/.misc.o.d -nostdinc -isystem /usr/lib/llvm-11/lib/clang/11.0.0/include -I./arch/x86/include -I./arch/x86/include/generated -I./include -I./arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/kconfig.h -include ./include/linux/compiler_types.h -D__KERNEL__ -Qunused-arguments -m32 -O2 -fno-strict-aliasing -fPIE -DDISABLE_BRANCH_PROFILING -march=i386 -mno-mmx -mno-sse -ffreestanding -fno-stack-protector -Wno-address-of-packed-member -Wno-gnu -Wno-pointer-sign -fmacro-prefix-map=./= -fno-asynchronous-unwind-tables -DKBUILD_MODFILE='"arch/x86/boot/compressed/misc"' -DKBUILD_BASENAME='"misc"' -DKBUILD_MODNAME='"misc"' -D__KBUILD_MODNAME=misc -c -o arch/x86/boot/compressed/misc.o arch/x86/boot/compressed/misc.c > llvm-objcopy -R .comment -S vmlinux arch/x86/boot/compressed/vmlinux.bin > arch/x86/tools/relocs vmlinux > arch/x86/boot/compressed/vmlinux.relocs;arch/x86/tools/relocs --abs-relocs vmlinux > Invalid absolute R_386_32 relocation: KERNEL_PAGES > make[2]: *** [arch/x86/boot/compressed/Makefile:134: arch/x86/boot/compressed/vmlinux.relocs] Error 1 > make[2]: *** Deleting file 'arch/x86/boot/compressed/vmlinux.relocs' > make[1]: *** [arch/x86/boot/Makefile:115: arch/x86/boot/compressed/vmlinux] Error 2 > make: *** [arch/x86/Makefile:268: bzImage] Error 2 Kind regards, Paul [1]: https://github.com/samitolvanen/linux/tree/clang-lto
On Sat, Jul 11, 2020 at 6:32 PM Paul Menzel <pmenzel@molgen.mpg.de> wrote: > > Dear Sami, > > > Am 24.06.20 um 22:31 schrieb Sami Tolvanen: > > This patch series adds support for building x86_64 and arm64 kernels > > with Clang's Link Time Optimization (LTO). > > > > In addition to performance, the primary motivation for LTO is to allow > > Clang's Control-Flow Integrity (CFI) to be used in the kernel. Google's > > Pixel devices have shipped with LTO+CFI kernels since 2018. > > > > Most of the patches are build system changes for handling LLVM bitcode, > > which Clang produces with LTO instead of ELF object files, postponing > > ELF processing until a later stage, and ensuring initcall ordering. > > > > Note that first objtool patch in the series is already in linux-next, > > but as it's needed with LTO, I'm including it also here to make testing > > easier. > > […] > > Thank you very much for sending these changes. > > Do you have a branch, where your current work can be pulled from? Your > branch on GitHub [1] seems 15 months old. > Agreed it's easier to git-pull. I have seen [1] - not sure if this is the latest version. Alternatively, you can check patchwork LKML by searching for $submitter. ( You can open patch 01/22 and download the whole patch-series by following the link "series", see [3]. ) - Sedat - [1] https://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild.git/log/?h=lto [2] https://lore.kernel.org/patchwork/project/lkml/list/?series=&submitter=19676 [3] https://lore.kernel.org/patchwork/series/450026/mbox/
On Sun, Jul 12, 2020 at 10:59:17AM +0200, Sedat Dilek wrote: > On Sat, Jul 11, 2020 at 6:32 PM Paul Menzel <pmenzel@molgen.mpg.de> wrote: > > > > Dear Sami, > > > > > > Am 24.06.20 um 22:31 schrieb Sami Tolvanen: > > > This patch series adds support for building x86_64 and arm64 kernels > > > with Clang's Link Time Optimization (LTO). > > > > > > In addition to performance, the primary motivation for LTO is to allow > > > Clang's Control-Flow Integrity (CFI) to be used in the kernel. Google's > > > Pixel devices have shipped with LTO+CFI kernels since 2018. > > > > > > Most of the patches are build system changes for handling LLVM bitcode, > > > which Clang produces with LTO instead of ELF object files, postponing > > > ELF processing until a later stage, and ensuring initcall ordering. > > > > > > Note that first objtool patch in the series is already in linux-next, > > > but as it's needed with LTO, I'm including it also here to make testing > > > easier. > > > > […] > > > > Thank you very much for sending these changes. > > > > Do you have a branch, where your current work can be pulled from? Your > > branch on GitHub [1] seems 15 months old. > > > > Agreed it's easier to git-pull. > I have seen [1] - not sure if this is the latest version. > Alternatively, you can check patchwork LKML by searching for $submitter. > ( You can open patch 01/22 and download the whole patch-series by > following the link "series", see [3]. ) > > - Sedat - > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild.git/log/?h=lto > [2] https://lore.kernel.org/patchwork/project/lkml/list/?series=&submitter=19676 > [3] https://lore.kernel.org/patchwork/series/450026/mbox/ > Sami tagged this series on his GitHub: https://github.com/samitolvanen/linux/releases/tag/lto-v1 git pull https://github.com/samitolvanen/linux lto-v1 Otherwise, he is updating the clang-cfi branch that includes both the LTO and CFI patchsets. You can pull that and just turn on CONFIG_LTO_CLANG. Lastly, for the future, I would recommend grabbing b4 to easily apply patches (specifically full series) from lore.kernel.org. https://git.kernel.org/pub/scm/utils/b4/b4.git/ https://git.kernel.org/pub/scm/utils/b4/b4.git/tree/README.rst You could grab this series and apply it easily by either downloading the mbox file and following the instructions it gives for applying the mbox file: $ b4 am 20200624203200.78870-1-samitolvanen@google.com or I prefer piping so that I don't have to clean up later: $ b4 am -o - 20200624203200.78870-1-samitolvanen@google.com | git am Cheers, Nathan
On Sat, Jul 11, 2020 at 9:32 AM Paul Menzel <pmenzel@molgen.mpg.de> wrote: > Thank you very much for sending these changes. > > Do you have a branch, where your current work can be pulled from? Your > branch on GitHub [1] seems 15 months old. The clang-lto branch is rebased regularly on top of Linus' tree. GitHub just looks at the commit date of the last commit in the tree, which isn't all that informative. > Out of curiosity, I applied the changes, allowed the selection for i386 > (x86), and with Clang 1:11~++20200701093119+ffee8040534-1~exp1 from > Debian experimental, it failed with `Invalid absolute R_386_32 > relocation: KERNEL_PAGES`: I haven't looked at getting this to work on i386, which is why we only select ARCH_SUPPORTS_LTO for x86_64. I would expect there to be a few issues to address. > > arch/x86/tools/relocs vmlinux > arch/x86/boot/compressed/vmlinux.relocs;arch/x86/tools/relocs --abs-relocs vmlinux > > Invalid absolute R_386_32 relocation: KERNEL_PAGES KERNEL_PAGES looks like a constant, so it's probably safe to ignore the absolute relocation in tools/relocs.c. Sami
On Sun, Jul 12, 2020 at 8:40 PM Nathan Chancellor <natechancellor@gmail.com> wrote: > > On Sun, Jul 12, 2020 at 10:59:17AM +0200, Sedat Dilek wrote: > > On Sat, Jul 11, 2020 at 6:32 PM Paul Menzel <pmenzel@molgen.mpg.de> wrote: > > > > > > Dear Sami, > > > > > > > > > Am 24.06.20 um 22:31 schrieb Sami Tolvanen: > > > > This patch series adds support for building x86_64 and arm64 kernels > > > > with Clang's Link Time Optimization (LTO). > > > > > > > > In addition to performance, the primary motivation for LTO is to allow > > > > Clang's Control-Flow Integrity (CFI) to be used in the kernel. Google's > > > > Pixel devices have shipped with LTO+CFI kernels since 2018. > > > > > > > > Most of the patches are build system changes for handling LLVM bitcode, > > > > which Clang produces with LTO instead of ELF object files, postponing > > > > ELF processing until a later stage, and ensuring initcall ordering. > > > > > > > > Note that first objtool patch in the series is already in linux-next, > > > > but as it's needed with LTO, I'm including it also here to make testing > > > > easier. > > > > > > […] > > > > > > Thank you very much for sending these changes. > > > > > > Do you have a branch, where your current work can be pulled from? Your > > > branch on GitHub [1] seems 15 months old. > > > > > > > Agreed it's easier to git-pull. > > I have seen [1] - not sure if this is the latest version. > > Alternatively, you can check patchwork LKML by searching for $submitter. > > ( You can open patch 01/22 and download the whole patch-series by > > following the link "series", see [3]. ) > > > > - Sedat - > > > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild.git/log/?h=lto > > [2] https://lore.kernel.org/patchwork/project/lkml/list/?series=&submitter=19676 > > [3] https://lore.kernel.org/patchwork/series/450026/mbox/ > > > > Sami tagged this series on his GitHub: > > https://github.com/samitolvanen/linux/releases/tag/lto-v1 > > git pull https://github.com/samitolvanen/linux lto-v1 > > Otherwise, he is updating the clang-cfi branch that includes both the > LTO and CFI patchsets. You can pull that and just turn on > CONFIG_LTO_CLANG. > > Lastly, for the future, I would recommend grabbing b4 to easily apply > patches (specifically full series) from lore.kernel.org. > > https://git.kernel.org/pub/scm/utils/b4/b4.git/ > https://git.kernel.org/pub/scm/utils/b4/b4.git/tree/README.rst > > You could grab this series and apply it easily by either downloading the > mbox file and following the instructions it gives for applying the mbox > file: > > $ b4 am 20200624203200.78870-1-samitolvanen@google.com > > or I prefer piping so that I don't have to clean up later: > > $ b4 am -o - 20200624203200.78870-1-samitolvanen@google.com | git am > It is always a pleasure to read your replies and enrich my know-how beyond Linux-kernel hacking :-). Thanks for the tip with "b4" tool. Might add this to our ClangBuiltLinux wiki "Command line tips and tricks"? - Sedat - [1] https://github.com/ClangBuiltLinux/linux/wiki/Command-line-tips-and-tricks
Dear Sami, Am 13.07.20 um 01:34 schrieb Sami Tolvanen: > On Sat, Jul 11, 2020 at 9:32 AM Paul Menzel <pmenzel@molgen.mpg.de> wrote: >> Thank you very much for sending these changes. >> >> Do you have a branch, where your current work can be pulled from? Your >> branch on GitHub [1] seems 15 months old. > > The clang-lto branch is rebased regularly on top of Linus' tree. > GitHub just looks at the commit date of the last commit in the tree, > which isn't all that informative. Thank you for clearing this up, and sorry for not checking myself. >> Out of curiosity, I applied the changes, allowed the selection for i386 >> (x86), and with Clang 1:11~++20200701093119+ffee8040534-1~exp1 from >> Debian experimental, it failed with `Invalid absolute R_386_32 >> relocation: KERNEL_PAGES`: > > I haven't looked at getting this to work on i386, which is why we only > select ARCH_SUPPORTS_LTO for x86_64. I would expect there to be a few > issues to address. > >>> arch/x86/tools/relocs vmlinux > arch/x86/boot/compressed/vmlinux.relocs;arch/x86/tools/relocs --abs-relocs vmlinux >>> Invalid absolute R_386_32 relocation: KERNEL_PAGES > > KERNEL_PAGES looks like a constant, so it's probably safe to ignore > the absolute relocation in tools/relocs.c. Thank you for pointing me to the right direction. I am happy to report, that with the diff below (no idea to what list to add the string), Linux 5.8-rc5 with the LLVM/Clang/LTO patches on top, builds and boots on the ASRock E350M1. ``` diff --git a/arch/x86/tools/relocs.c b/arch/x86/tools/relocs.c index 8f3bf34840cef..e91af127ed3c0 100644 --- a/arch/x86/tools/relocs.c +++ b/arch/x86/tools/relocs.c @@ -79,6 +79,7 @@ static const char * const sym_regex_kernel[S_NSYMTYPES] = { "__end_rodata_hpage_align|" #endif "__vvar_page|" + "KERNEL_PAGES|" "_end)$" }; ``` Kind regards, Paul
On Tue, Jul 14, 2020 at 2:16 PM Paul Menzel <pmenzel@molgen.mpg.de> wrote: > > Dear Sami, > > > Am 13.07.20 um 01:34 schrieb Sami Tolvanen: > > On Sat, Jul 11, 2020 at 9:32 AM Paul Menzel <pmenzel@molgen.mpg.de> wrote: > >> Thank you very much for sending these changes. > >> > >> Do you have a branch, where your current work can be pulled from? Your > >> branch on GitHub [1] seems 15 months old. > > > > The clang-lto branch is rebased regularly on top of Linus' tree. > > GitHub just looks at the commit date of the last commit in the tree, > > which isn't all that informative. > > Thank you for clearing this up, and sorry for not checking myself. > > >> Out of curiosity, I applied the changes, allowed the selection for i386 > >> (x86), and with Clang 1:11~++20200701093119+ffee8040534-1~exp1 from > >> Debian experimental, it failed with `Invalid absolute R_386_32 > >> relocation: KERNEL_PAGES`: > > > > I haven't looked at getting this to work on i386, which is why we only > > select ARCH_SUPPORTS_LTO for x86_64. I would expect there to be a few > > issues to address. > > > >>> arch/x86/tools/relocs vmlinux > arch/x86/boot/compressed/vmlinux.relocs;arch/x86/tools/relocs --abs-relocs vmlinux > >>> Invalid absolute R_386_32 relocation: KERNEL_PAGES > > > > KERNEL_PAGES looks like a constant, so it's probably safe to ignore > > the absolute relocation in tools/relocs.c. > > Thank you for pointing me to the right direction. I am happy to report, > that with the diff below (no idea to what list to add the string), Linux > 5.8-rc5 with the LLVM/Clang/LTO patches on top, builds and boots on the > ASRock E350M1. > > ``` > diff --git a/arch/x86/tools/relocs.c b/arch/x86/tools/relocs.c > index 8f3bf34840cef..e91af127ed3c0 100644 > --- a/arch/x86/tools/relocs.c > +++ b/arch/x86/tools/relocs.c > @@ -79,6 +79,7 @@ static const char * const > sym_regex_kernel[S_NSYMTYPES] = { > "__end_rodata_hpage_align|" > #endif > "__vvar_page|" > + "KERNEL_PAGES|" > "_end)$" > }; > ``` > What llvm-toolchain and version did you use? Can you post your linux-config? Thanks. - Sedat -
On Tue, Jul 14, 2020 at 2:44 AM Sedat Dilek <sedat.dilek@gmail.com> wrote: > > On Sun, Jul 12, 2020 at 8:40 PM Nathan Chancellor > <natechancellor@gmail.com> wrote: > > > > Lastly, for the future, I would recommend grabbing b4 to easily apply > > patches (specifically full series) from lore.kernel.org. > > > > https://git.kernel.org/pub/scm/utils/b4/b4.git/ > > https://git.kernel.org/pub/scm/utils/b4/b4.git/tree/README.rst > > > > You could grab this series and apply it easily by either downloading the > > mbox file and following the instructions it gives for applying the mbox > > file: > > > > $ b4 am 20200624203200.78870-1-samitolvanen@google.com > > > > or I prefer piping so that I don't have to clean up later: > > > > $ b4 am -o - 20200624203200.78870-1-samitolvanen@google.com | git am > > > > It is always a pleasure to read your replies and enrich my know-how > beyond Linux-kernel hacking :-). > > Thanks for the tip with "b4" tool. > Might add this to our ClangBuiltLinux wiki "Command line tips and tricks"? > > - Sedat - > > [1] https://github.com/ClangBuiltLinux/linux/wiki/Command-line-tips-and-tricks Good idea, done.