Message ID | ZOkGCSNr0VN2VIJJ@p100 (mailing list archive) |
---|---|
State | Accepted, archived |
Headers | show |
Series | lib/clz_ctz.c: Fix __clzdi2() and __ctzdi2() for 32-bit kernels | expand |
On Fri, 25 Aug 2023 at 12:50, Helge Deller <deller@gmx.de> wrote: > > This patch fixes the in-kernel functions __clzdi2() and __ctzdi2() [..] Applied, Linus
[ Unrelated to this patch, except it made me look, adding clang build people to cc ] On Fri, 25 Aug 2023 at 13:25, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Fri, 25 Aug 2023 at 12:50, Helge Deller <deller@gmx.de> wrote: > > > > This patch fixes the in-kernel functions __clzdi2() and __ctzdi2() [..] > > Applied, Bah. Still applied, but actually building this (on 64-bit, so kind of pointless) I note that clang completely messes up this function on x86. Clang turns this: return __ffs64(val); into this horror: pushq %rax movq %rdi, (%rsp) #APP rep bsfq (%rsp), %rax #NO_APP popq %rcx which is just incredibly broken on so many levels. It *should* be a single instruction, like gcc does: rep; bsf %rdi,%rax # tmp87, word but clang decides that it really wants to put the argument on the stack, and apparently also wants to do that nonsensical stack alignment thing to make things even worse. We use this: static __always_inline unsigned long variable__ffs(unsigned long word) { asm("rep; bsf %1,%0" : "=r" (word) : "rm" (word)); return word; } for the definition, and it looks like clang royally just screws up here. Yes, "m" is _allowed_ in that input set, but it damn well shouldn't be used for something that is already in a register, since "r" is also allowed, and is the first choice. I think it's this clang bug: https://github.com/llvm/llvm-project/issues/20571 https://github.com/llvm/llvm-project/issues/30873 https://github.com/llvm/llvm-project/issues/34837 and it doesn't matter for *this* case (since I don't think this library function is ever used on x86), but it looks like a generic clang issue. Linus
On Fri, Aug 25, 2023 at 1:43 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > [ Unrelated to this patch, except it made me look, adding clang build > people to cc ] > > On Fri, 25 Aug 2023 at 13:25, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > On Fri, 25 Aug 2023 at 12:50, Helge Deller <deller@gmx.de> wrote: > > > > > > This patch fixes the in-kernel functions __clzdi2() and __ctzdi2() [..] > > > > Applied, > > Bah. Still applied, but actually building this (on 64-bit, so kind of > pointless) I note that clang completely messes up this function on > x86. > > Clang turns this: > > return __ffs64(val); > > into this horror: > > pushq %rax > movq %rdi, (%rsp) > #APP > rep > bsfq (%rsp), %rax > #NO_APP > popq %rcx > > which is just incredibly broken on so many levels. It *should* be a > single instruction, like gcc does: > > rep; bsf %rdi,%rax # tmp87, word > > but clang decides that it really wants to put the argument on the > stack, and apparently also wants to do that nonsensical stack > alignment thing to make things even worse. > > We use this: > > static __always_inline unsigned long variable__ffs(unsigned long word) > { > asm("rep; bsf %1,%0" > : "=r" (word) > : "rm" (word)); > return word; > } > > for the definition, and it looks like clang royally just screws up > here. Yes, "m" is _allowed_ in that input set, but it damn well > shouldn't be used for something that is already in a register, since > "r" is also allowed, and is the first choice. > > I think it's this clang bug: > > https://github.com/llvm/llvm-project/issues/20571 ^ yep, my comments at the end of that thread are the last time we've had a chance to look into this. Boy, it's been 9 months since the last discussion of it. I'm sorry for that. The TL;DR of that thread is that when both "r" and "m" constraints are present, LLVM is conservative and always chooses "m" because at that point it's not able to express to later passes that "m" is still a valid fallback if "r" was chosen. Obviously "r" is preferable to "m" and we should fix that. Seeing who wants to roll up their sleeves and volunteer to understand LLVM's register allocation code is like asking who wants to be the first to jump into a black hole and see what happens. I'm having a hard enough time understanding the stack spilling code to better understand what precisely exists in what stack slots in order to make progress on some of our -Wframe-larger-than= warnings, but I need to suck it up and do better. This came up previously in our discussion about __builtin_ia32_readeflags_*. https://lore.kernel.org/all/20211215211847.206208-1-morbo@google.com/ > https://github.com/llvm/llvm-project/issues/30873 > https://github.com/llvm/llvm-project/issues/34837 > > and it doesn't matter for *this* case (since I don't think this > library function is ever used on x86), but it looks like a generic > clang issue. > > Linus
On Fri, Aug 25, 2023 at 2:01 PM Nick Desaulniers <ndesaulniers@google.com> wrote: > > On Fri, Aug 25, 2023 at 1:43 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > [ Unrelated to this patch, except it made me look, adding clang build > > people to cc ] > > > > On Fri, 25 Aug 2023 at 13:25, Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > > > > > > On Fri, 25 Aug 2023 at 12:50, Helge Deller <deller@gmx.de> wrote: > > > > > > > > This patch fixes the in-kernel functions __clzdi2() and __ctzdi2() [..] > > > > > > Applied, > > > > Bah. Still applied, but actually building this (on 64-bit, so kind of > > pointless) I note that clang completely messes up this function on > > x86. > > > > Clang turns this: > > > > return __ffs64(val); > > > > into this horror: > > > > pushq %rax > > movq %rdi, (%rsp) > > #APP > > rep > > bsfq (%rsp), %rax > > #NO_APP > > popq %rcx > > > > which is just incredibly broken on so many levels. It *should* be a > > single instruction, like gcc does: > > > > rep; bsf %rdi,%rax # tmp87, word > > > > but clang decides that it really wants to put the argument on the > > stack, and apparently also wants to do that nonsensical stack > > alignment thing to make things even worse. > > > > We use this: > > > > static __always_inline unsigned long variable__ffs(unsigned long word) > > { > > asm("rep; bsf %1,%0" > > : "=r" (word) > > : "rm" (word)); > > return word; > > } > > > > for the definition, and it looks like clang royally just screws up > > here. Yes, "m" is _allowed_ in that input set, but it damn well > > shouldn't be used for something that is already in a register, since > > "r" is also allowed, and is the first choice. > > > > I think it's this clang bug: > > > > https://github.com/llvm/llvm-project/issues/20571 > > ^ yep, my comments at the end of that thread are the last time we've > had a chance to look into this. Boy, it's been 9 months since the > last discussion of it. I'm sorry for that. > > The TL;DR of that thread is that when both "r" and "m" constraints are > present, LLVM is conservative and always chooses "m" because at that > point it's not able to express to later passes that "m" is still a > valid fallback if "r" was chosen. > > Obviously "r" is preferable to "m" and we should fix that. Seeing who > wants to roll up their sleeves and volunteer to understand LLVM's > register allocation code is like asking who wants to be the first to > jump into a black hole and see what happens. Yum! Human spaghetti! :-) I want to look into this myself. I'm a bit focussed on other things at the moment, but this is definitely on my list of "DO WANT"s. -bw > I'm having a hard enough > time understanding the stack spilling code to better understand what > precisely exists in what stack slots in order to make progress on some > of our -Wframe-larger-than= warnings, but I need to suck it up and do > better. > > This came up previously in our discussion about __builtin_ia32_readeflags_*. > https://lore.kernel.org/all/20211215211847.206208-1-morbo@google.com/ > > > https://github.com/llvm/llvm-project/issues/30873 > > https://github.com/llvm/llvm-project/issues/34837 > > > > and it doesn't matter for *this* case (since I don't think this > > library function is ever used on x86), but it looks like a generic > > clang issue. > > > > Linus > > > > -- > Thanks, > ~Nick Desaulniers
On Fri, Aug 25, 2023 at 3:33 PM Bill Wendling <morbo@google.com> wrote: > > On Fri, Aug 25, 2023 at 2:01 PM Nick Desaulniers > <ndesaulniers@google.com> wrote: > > > > On Fri, Aug 25, 2023 at 1:43 PM Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > > > > > > [ Unrelated to this patch, except it made me look, adding clang build > > > people to cc ] > > > > > > On Fri, 25 Aug 2023 at 13:25, Linus Torvalds > > > <torvalds@linux-foundation.org> wrote: > > > > > > > > On Fri, 25 Aug 2023 at 12:50, Helge Deller <deller@gmx.de> wrote: > > > > > > > > > > This patch fixes the in-kernel functions __clzdi2() and __ctzdi2() [..] > > > > > > > > Applied, > > > > > > Bah. Still applied, but actually building this (on 64-bit, so kind of > > > pointless) I note that clang completely messes up this function on > > > x86. > > > > > > Clang turns this: > > > > > > return __ffs64(val); > > > > > > into this horror: > > > > > > pushq %rax > > > movq %rdi, (%rsp) > > > #APP > > > rep > > > bsfq (%rsp), %rax > > > #NO_APP > > > popq %rcx > > > > > > which is just incredibly broken on so many levels. It *should* be a > > > single instruction, like gcc does: > > > > > > rep; bsf %rdi,%rax # tmp87, word > > > > > > but clang decides that it really wants to put the argument on the > > > stack, and apparently also wants to do that nonsensical stack > > > alignment thing to make things even worse. > > > > > > We use this: > > > > > > static __always_inline unsigned long variable__ffs(unsigned long word) > > > { > > > asm("rep; bsf %1,%0" > > > : "=r" (word) > > > : "rm" (word)); > > > return word; > > > } > > > > > > for the definition, and it looks like clang royally just screws up > > > here. Yes, "m" is _allowed_ in that input set, but it damn well > > > shouldn't be used for something that is already in a register, since > > > "r" is also allowed, and is the first choice. > > > > > > I think it's this clang bug: > > > > > > https://github.com/llvm/llvm-project/issues/20571 > > > > ^ yep, my comments at the end of that thread are the last time we've > > had a chance to look into this. Boy, it's been 9 months since the > > last discussion of it. I'm sorry for that. > > > > The TL;DR of that thread is that when both "r" and "m" constraints are > > present, LLVM is conservative and always chooses "m" because at that > > point it's not able to express to later passes that "m" is still a > > valid fallback if "r" was chosen. > > > > Obviously "r" is preferable to "m" and we should fix that. Seeing who > > wants to roll up their sleeves and volunteer to understand LLVM's > > register allocation code is like asking who wants to be the first to > > jump into a black hole and see what happens. > > Yum! Human spaghetti! :-) > > I want to look into this myself. I'm a bit focussed on other things at > the moment, but this is definitely on my list of "DO WANT"s. > Another idea is that there are __builtin_* functions for a lot of functions that are currently in inline asm---__builtin_ctz{,l,ll,s] and _builtin_ffs{,l,ll}. The major issue with the `__builtin_ia32_readeflags_*` was its inability to take unrelated MSRs into account during code motion. That may not be the same worry here? -bw > -bw > > > I'm having a hard enough > > time understanding the stack spilling code to better understand what > > precisely exists in what stack slots in order to make progress on some > > of our -Wframe-larger-than= warnings, but I need to suck it up and do > > better. > > > > This came up previously in our discussion about __builtin_ia32_readeflags_*. > > https://lore.kernel.org/all/20211215211847.206208-1-morbo@google.com/ > > > > > https://github.com/llvm/llvm-project/issues/30873 > > > https://github.com/llvm/llvm-project/issues/34837 > > > > > > and it doesn't matter for *this* case (since I don't think this > > > library function is ever used on x86), but it looks like a generic > > > clang issue. > > > > > > Linus > > > > > > > > -- > > Thanks, > > ~Nick Desaulniers
On Fri, 25 Aug 2023 at 15:57, Bill Wendling <morbo@google.com> wrote: > > > Another idea is that there are __builtin_* functions for a lot of > functions that are currently in inline asm No. We've been through this before. The builtins are almost entirely untested, and often undocumented and buggy. > The major issue with the > `__builtin_ia32_readeflags_*` was its inability to take unrelated MSRs > into account during code motion. That may not be the same worry here? No, the problem with __builtin_ia32_readeflags_*() was that it was literally completely buggy and generated entirely broken code: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104971 but that's really more of a symptom than anything else. It's a symptom of the fact that unlike inline asm's, those builtins are often undocumented in what compiler version they appeared, and are of very questionable quality. They often don't have many users, and the test suites are non-existent. For example, we *do* use __builtin_ffs() on x86 for constant values, because the compiler does the right thing. But for non-constant ones, the inline asm actually generates better code: gcc generatea some disgusting mess with a 'bsf' followed by a 'cmov' for the zero case, when we know better. See for example https://godbolt.org/z/jKKf48Wsf I don't understand why compiler people prefer a builtin that is an untested special case that assumes that the compiler knows what is going on (and often doesn't), over a generic escape facility that is supported and needed anyway (inline asm). In other words: the statement "builtins generate better code" is simply PROVABLY NOT TRUE. Builtins have often generated *worse* code than using inline asms, to the point where "worse" is actively buggy crap. At least inline asms are reliable. That's a *big* deal. Linus
On Fri, Aug 25, 2023 at 4:35 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Fri, 25 Aug 2023 at 15:57, Bill Wendling <morbo@google.com> wrote: > > > > > Another idea is that there are __builtin_* functions for a lot of > > functions that are currently in inline asm > > No. We've been through this before. The builtins are almost entirely > untested, and often undocumented and buggy. > > > The major issue with the > > `__builtin_ia32_readeflags_*` was its inability to take unrelated MSRs > > into account during code motion. That may not be the same worry here? > > No, the problem with __builtin_ia32_readeflags_*() was that it was > literally completely buggy and generated entirely broken code: > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104971 > > but that's really more of a symptom than anything else. > > It's a symptom of the fact that unlike inline asm's, those builtins > are often undocumented in what compiler version they appeared, and are > of very questionable quality. They often don't have many users, and > the test suites are non-existent. > > For example, we *do* use __builtin_ffs() on x86 for constant values, > because the compiler does the right thing. > > But for non-constant ones, the inline asm actually generates better > code: gcc generatea some disgusting mess with a 'bsf' followed by a > 'cmov' for the zero case, when we know better. > > See for example > > https://godbolt.org/z/jKKf48Wsf > Ew...gross. > I don't understand why compiler people prefer a builtin that is an > untested special case that assumes that the compiler knows what is > going on (and often doesn't), over a generic escape facility that is > supported and needed anyway (inline asm). > > In other words: the statement "builtins generate better code" is > simply PROVABLY NOT TRUE. > > Builtins have often generated *worse* code than using inline asms, to > the point where "worse" is actively buggy crap. > > At least inline asms are reliable. That's a *big* deal. > There are a couple of reasons why compiler writers (at least I) prefer builtins to inline asm. Inline asm takes control away from the compiler, which makes it harder for the compiler to perform normal optimizations. It's more portable. As for better code, it won't generate it in all situations, as you pointed out, but can typically generate as good code. Inline asm has its own issues---portability, difficult to use constraints (the EFLAGS asms went back and forth on its constraints over the years), and, from what I remember, GNU's inline asm is closely tied to its register allocator, making it hard to support in other compilers. Plus, there's not guarantee that an inline asm won't be moved. (It's been discussed before, and I also believe that the current Linux inline asm "barriers" should prevent this. I just vastly prefer guaranteed behavior over "it works now". I've migrated many systems to new compilers (and even compiler versions) and have seen "well, it worked for the previous compiler" types of bugs that were actual bugs in the programmer's code.) Okay, none of what I said is going to convince you to use builtins, and I'm not suggesting we do. It's just a few of the reasons why we tend to prefer them to other methods. Just take this as an FYI. :-) -bw
On Fri, Aug 25, 2023 at 4:35 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > But for non-constant ones, the inline asm actually generates better > code: gcc generatea some disgusting mess with a 'bsf' followed by a > 'cmov' for the zero case, when we know better. > > See for example > > https://godbolt.org/z/jKKf48Wsf > > I don't understand why compiler people prefer a builtin that is an > untested special case that assumes that the compiler knows what is > going on (and often doesn't), over a generic escape facility that is > supported and needed anyway (inline asm). > > In other words: the statement "builtins generate better code" is > simply PROVABLY NOT TRUE. > > Builtins have often generated *worse* code than using inline asms, to > the point where "worse" is actively buggy crap. > > At least inline asms are reliable. That's a *big* deal. So 2 concerns where "I'll do it in inline asm" can pessimize codegen: 1. You alluded to this, but what happens when one of these functions is called with a constant? (Not just a literal value, but a value that can be proven constant at compile time via optimizations as well?) arch/x86/include/asm/bitops.h got this right for ffs(), but it did not for fls()! (I think that could be `if (__builtin_constant_p(x)) return x ? 32 - __builtin_clz(x) : 0;` but check my math; oh, good job arch/powerpc/include/asm/bitops.h). 2. by providing the definition of a symbol typically provided by libc (and then not building with -ffreestanding) pessimizes libcall optimization. example: https://godbolt.org/z/crrTKEf6G ffs() gets this right again by using a macro, and __always_inline can work around this somewhat (so fls() if off the hook here). But any attempt using `static inline` would be pessimized for constants.
On Fri, 25 Aug 2023 at 17:52, Nick Desaulniers <ndesaulniers@google.com> wrote: > > So 2 concerns where "I'll do it in inline asm" can pessimize codegen: > 1. You alluded to this, but what happens when one of these functions > is called with a constant? This is why our headers have a lot of __builtin_constant_p()'s in them.. In this particular case, see the x86 asm/bitops.h code: #define ffs(x) (__builtin_constant_p(x) ? __builtin_ffs(x) : variable_ffs(x)) but this is actually quite a common pattern, and it's often not about something like __builtin_ffs() at all. See all the other __builtin_constant_p()'s that we have in that same file because we basically just use different code sequences for constants. And that file isn't even unusual. We use it quite a lot when we care about code generation for some particular case. > 2. by providing the definition of a symbol typically provided by libc > (and then not building with -ffreestanding) pessimizes libcall > optimization. .. and this is partly why we often avoid libgcc things, and do certain things by hand. The classic rule is "Don't do 64-bit divisions using the C '/' operator". So in the kernel you have to use do_div() and friends, because the library versions are often disgusting and might not know that 64/32 is much much cheaper and is what you want. And quite often we simply use other names - but then we also do *not* build with -freestanding, because -freestanding has at least traditionally meant that the compiler won't optimize the simple and obvious cases (typically things like "memcpy with a constant size"). So we mix and match and pick the best option. The kernel really doesn't care about architecture portability, because honestly, something like "ffs()" is entirely *trivial* to get right, compared to the real differences between architectures (eg VM and IO differences etc). Linus
On 2023-08-25, Linus Torvalds wrote: >On Fri, 25 Aug 2023 at 17:52, Nick Desaulniers <ndesaulniers@google.com> wrote: >> >> So 2 concerns where "I'll do it in inline asm" can pessimize codegen: >> 1. You alluded to this, but what happens when one of these functions >> is called with a constant? > >This is why our headers have a lot of __builtin_constant_p()'s in them.. > >In this particular case, see the x86 asm/bitops.h code: > > #define ffs(x) (__builtin_constant_p(x) ? __builtin_ffs(x) : >variable_ffs(x)) For the curious (like me), __builtin_ffs https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html#index-_005f_005fbuiltin_005fffs says Returns the number of leading 0-bits in x, starting at the most significant bit position. If x is 0, the result is undefined. The hangling of 0 seems the cause that __builtin_ffs codegen is not as well as inline asm. Clang implemented the builtin in 2008 and took the same constraint (penalty). GCC compiles __builtin_ctzl(x) to xorl %eax, %eax; tzcntq %rdi, %rax on most Intel processors (AMD -march= values are unaffected). The extra xor is due to a false dependency issue https://gcc.gnu.org/git/?p=gcc.git&a=commit;h=73543b2286027da1de561902440b53f775a03a86 Inline asm wins here as well since we know the argument 0 is undefined. In May 2023, https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=cc6eb8b51f9568ae0caf46b80e2a0aff050030ce "Disable avoid_false_dep_for_bmi for atom and icelake(and later) core processors." removed the extra xor for icelake (and later) core processors. >but this is actually quite a common pattern, and it's often not about >something like __builtin_ffs() at all. > >See all the other __builtin_constant_p()'s that we have in that same >file because we basically just use different code sequences for >constants. > >And that file isn't even unusual. We use it quite a lot when we care >about code generation for some particular case. > >> 2. by providing the definition of a symbol typically provided by libc >> (and then not building with -ffreestanding) pessimizes libcall >> optimization. > >.. and this is partly why we often avoid libgcc things, and do certain >things by hand. > >The classic rule is "Don't do 64-bit divisions using the C '/' operator". > >So in the kernel you have to use do_div() and friends, because the >library versions are often disgusting and might not know that 64/32 is >much much cheaper and is what you want. > >And quite often we simply use other names - but then we also do *not* >build with -freestanding, because -freestanding has at least >traditionally meant that the compiler won't optimize the simple and >obvious cases (typically things like "memcpy with a constant size"). > >So we mix and match and pick the best option. > >The kernel really doesn't care about architecture portability, because >honestly, something like "ffs()" is entirely *trivial* to get right, >compared to the real differences between architectures (eg VM and IO >differences etc). > > Linus >
Hi Linus, On Sat, Aug 26, 2023 at 3:08 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Fri, 25 Aug 2023 at 17:52, Nick Desaulniers <ndesaulniers@google.com> wrote: > And quite often we simply use other names - but then we also do *not* > build with -freestanding, because -freestanding has at least > traditionally meant that the compiler won't optimize the simple and > obvious cases (typically things like "memcpy with a constant size"). Several architectures (incl. x86, but excl. amd64) do build the kernel with -freestanding. IIRC, the issue was that without that, gcc was "optimizing" calls to standard functions (implemented as inline optimized assembler functions) by replacing them with calls to other standard functions (also implemented as inline optimized assembler functions). As the latter became external calls, this defeated the optimized assembler implementation, and could even cause build failures if no external implementation was available. Gr{oetje,eeting}s, Geert
From: Linus Torvalds > Sent: 25 August 2023 21:43 .... > Clang turns this: > > return __ffs64(val); > > into this horror: > > pushq %rax > movq %rdi, (%rsp) > #APP > rep > bsfq (%rsp), %rax > #NO_APP > popq %rcx > > which is just incredibly broken on so many levels. It *should* be a > single instruction, like gcc does: > > rep; bsf %rdi,%rax # tmp87, word > > but clang decides that it really wants to put the argument on the > stack, and apparently also wants to do that nonsensical stack > alignment thing to make things even worse. > > We use this: > > static __always_inline unsigned long variable__ffs(unsigned long word) > { > asm("rep; bsf %1,%0" > : "=r" (word) > : "rm" (word)); > return word; > } > > for the definition, and it looks like clang royally just screws up > here. Yes, "m" is _allowed_ in that input set, but it damn well > shouldn't be used for something that is already in a register, since > "r" is also allowed, and is the first choice. Why don't we just remove the "m" option? Pretty much the only time it will be worse is it the value is in memory and loading it into a register causes a spill to stack. While it is possible to generate code where that happens it is pretty unlikely. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Mon, 28 Aug 2023 at 00:33, Geert Uytterhoeven <geert@linux-m68k.org> wrote: > > Several architectures (incl. x86, but excl. amd64) do build the kernel with > -freestanding. > > IIRC, the issue was that without that, gcc was "optimizing" calls > to standard functions (implemented as inline optimized assembler > functions) by replacing them with calls to other standard functions > (also implemented as inline optimized assembler functions). So using -ffreestanding is definitely the right thing to do for a kernel in theory. It's very much supposed to tell the compiler to not assume a standard libc, and without that gcc will do various transformations that make sense when you "know" what libc does, but may not make sense in the limited library model of a kernel. So without it, gcc will do things like converting a 'printf()' call without any conversion characters to a much cheaper 'puts()' etc. Now, we often avoid that issue entirely by having our own function names (ie printk()), but we do tend to use the *really* core C library names. Anyway, it turns out that some of the things you miss out on with -ffreestanding are kind of important. In particular, at least gcc will stop some 'memcpy()' optimizations too, which ends up being pretty horrendous. So while -ffreestanding would be the right thing to do in theory, in practice it's actually pretty horrible. It's a big hammer that affects a lot of things, and while many of them make sense for a kernel, some of them are really bad. Which is why x86-64 no longer uses it. I would actually suggest other architectures take a look if they care at all about code generation. In particular, look at the x86-64 version of 'string.h' in arch/x86/include/asm/string_64.h and note the difference with the 32-bit one. The 32-bit one is the "this is how we used to do it" that nobody cared enough to change. The 64-bit one is much simpler and actually generates better code simply because gcc recognizes memcpy() and friends, and will then inline it when small etc. The *downside* is that now you have to trust the compiler to do the right thing. And that will depend on compiler version etc. There's a reason why 32-bit x86 does everything by hand: when your compiler history starts at gcc-1.40, things are simply *very* different from when you now rely on gcc-5.1 and newer... Put another way: gcc has changed, and what used to make sense probably doesn't make sense any more. Linus
On Mon, 28 Aug 2023 at 03:53, David Laight <David.Laight@aculab.com> wrote: > > From: Linus Torvalds > > > > We use this: > > > > static __always_inline unsigned long variable__ffs(unsigned long word) > > { > > asm("rep; bsf %1,%0" > > : "=r" (word) > > : "rm" (word)); > > return word; > > } > > > > for the definition, and it looks like clang royally just screws up > > here. Yes, "m" is _allowed_ in that input set, but it damn well > > shouldn't be used for something that is already in a register, since > > "r" is also allowed, and is the first choice. > > Why don't we just remove the "m" option? For this particular case, it would probably be the right thing to do. It's sad, though, because gcc handles this correctly, and always has. And in this particular case, it probably matters not at all. In many other cases where we have 'rm', we may actually be in the situation that having 'rm' (or other cases like "g" that also allows immediates) helps because register pressure can be a thing. It's mostly a thing on 32-bit x86 where you have a lot fewer registers, and there we've literally run into situations where we have had internal compiler errors because of complex inline asm statements running out of registers. With a simple "one input, one output" case, that just isn't an issue, so to work around a clang misfeature we could do it - if somebody finds a case where it actually matters (as opposed to "damn, when looking at the generted code for a function that we never actually use on x86, I noticed that code generation is horrendous"). Linus
On Fri, Aug 25, 2023 at 6:08 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Fri, 25 Aug 2023 at 17:52, Nick Desaulniers <ndesaulniers@google.com> wrote: > > > > So 2 concerns where "I'll do it in inline asm" can pessimize codegen: > > 1. You alluded to this, but what happens when one of these functions > > is called with a constant? > > This is why our headers have a lot of __builtin_constant_p()'s in them.. > > In this particular case, see the x86 asm/bitops.h code: > > #define ffs(x) (__builtin_constant_p(x) ? __builtin_ffs(x) : > variable_ffs(x)) > > but this is actually quite a common pattern, and it's often not about > something like __builtin_ffs() at all. I was a reviewer on commit fdb6649ab7c1 ("x86/asm/bitops: Use __builtin_ctzl() to evaluate constant expressions"); I'm familiar with the pattern. > > See all the other __builtin_constant_p()'s that we have in that same > file because we basically just use different code sequences for > constants. > > And that file isn't even unusual. We use it quite a lot when we care > about code generation for some particular case. More so my point was x86 bitops is missing commit 2fcff790dcb4 ("powerpc: Use builtin functions for fls()/__fls()/fls64()") treatment. I've sent https://lore.kernel.org/llvm/20230828-x86_fls-v1-1-e6a31b9f79c3@google.com/. > > > 2. by providing the definition of a symbol typically provided by libc > > (and then not building with -ffreestanding) pessimizes libcall > > optimization. > > .. and this is partly why we often avoid libgcc things, and do certain > things by hand. (Sorry if the following rant is prior knowledge, it's mostly for reference for others cc'ed who might not know this) Careful, `-ffreestanding` and libgcc are two orthogonal things (at least in my mind). -ffreestanding is to libc as --rtlib= is to the compiler runtime (which is distinct from the libc) `-ffreestanding` is more about "does the runtime environment somehow provide libc symbols." libgcc (or llvm's equivalent "compiler-rt") is not responsible for providing symbols from libc. `--rtlib=` controls what compiler runtime will be used, but in my experience, today's compilers don't make codegen decisions on that value. These are mostly runtime helpers for "idk how to do <complicated math thing, such as double word division>" or "maybe you didn't want that inline." What's brittle about making codegen decisions with regards to these flags though is that these dependencies grow over time, and yet it's not possible today (AFAIK) to specify what's the minimum target to support. For instance, IIRC glibc recently added support for one of the kernel's string.h routines, maybe strlcpy or something. https://sourceware.org/git/?p=glibc.git;a=commit;h=454a20c8756c9c1d55419153255fc7692b3d2199 When is it safe for the compiler to start transforming calls to other functions into calls to strlcpy? (Guess: year 2033, because:) What happens when dynamically linking against older versions of glib that do not provide that symbol? > > The classic rule is "Don't do 64-bit divisions using the C '/' operator". > > So in the kernel you have to use do_div() and friends, because the > library versions are often disgusting and might not know that 64/32 is > much much cheaper and is what you want. And thus the same problem exists for the kernel wrt --rtlib that I alluded to above for strlcpy. By providing a partial implementation of a compiler runtime (--rtlib=), the compiler will frequently emit libcalls to symbols for which the kernel hasn't provided. You can avoid open coded double word division in the kernel all you want but: 1. again you're probably pessimizing division by constant remainder by using div64_u64. GCC is *really* good at replacing these when the divisor is constant; IME better than clang. 2. even avoiding open coded division, the compiler can still insert division; loop-elision can replace loops outright if the trip count is adjusted by a determinable value. (see 3220022038b9). By providing a partial compiler runtime, and then using every -mno- flag in the book, you tie the compiler's hands wrt what it can emit vs libcall. There's not even a way to express to today's compiler that "we have a compiler runtime, it's just straight up missing things." Personally, I think a clang-tidy check for open coded division is perhaps a better way to enforce the kernel's posture than providing half a compiler runtime then doing gymnastics in the code to work around the emission of libcalls to __divdi3() or__aeabi_uldivmod() (mostly on 32b platforms). A linkage failure is nice, but only occurs on 32b targets and it doesn't disambiguate between the case of developer open coded division vs compiler inserted division. > > And quite often we simply use other names - but then we also do *not* > build with -freestanding, because -freestanding has at least > traditionally meant that the compiler won't optimize the simple and > obvious cases (typically things like "memcpy with a constant size"). Personal opinion: we very much do NOT want to use -ffreestanding for those libcall optimizations. I discussed this recently with ARCH=loongarch folks: commit 3f301dc292eb ("LoongArch: Replace -ffreestanding with finer-grained -fno-builtin's") It is my intention to remove -ffreestanding from ARCH=i386. https://github.com/ClangBuiltLinux/linux/issues/1583 I had to first fix a bug in LLVM though https://reviews.llvm.org/D125285 So rather than remove it outright, we might need to retain it for builds with older releases of clang for now. Though as you allude to down thread, perhaps some things that were the case in linux 1.0 / gcc 1.40 no longer hold. Which is why adding such flags to kernel makefiles really ought to be accompanied by a comment in sources linking to an issue tracker report, so that we might clean these up one day. > > So we mix and match and pick the best option. Gross, and like *could you not?* I suspect it's more so the case of a developer not realising it's perhaps a compiler bug, or not reporting such bug, and trying flags they're heard of once until something links. Any use of -ffreestanding for any arch had better have a comment to a compiler vendor's bug tracker laying out why that's necessary for that arch and not others. Many kernel developers are allergic to filing formal compiler bugs in places where compiler vendors are looking, IME. > > The kernel really doesn't care about architecture portability, because > honestly, something like "ffs()" is entirely *trivial* to get right, > compared to the real differences between architectures (eg VM and IO > differences etc). > > Linus
On Mon, Aug 28, 2023 at 12:33 AM Geert Uytterhoeven <geert@linux-m68k.org> wrote: > > Hi Linus, > > On Sat, Aug 26, 2023 at 3:08 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > On Fri, 25 Aug 2023 at 17:52, Nick Desaulniers <ndesaulniers@google.com> wrote: > > And quite often we simply use other names - but then we also do *not* > > build with -freestanding, because -freestanding has at least > > traditionally meant that the compiler won't optimize the simple and > > obvious cases (typically things like "memcpy with a constant size"). > > Several architectures (incl. x86, but excl. amd64) do build the kernel with > -freestanding. > > IIRC, the issue was that without that, gcc was "optimizing" calls > to standard functions (implemented as inline optimized assembler > functions) by replacing them with calls to other standard functions > (also implemented as inline optimized assembler functions). As the > latter became external calls, this defeated the optimized assembler > implementation, and could even cause build failures if no external > implementation was available. That's what the -fno-builtin-* flags are for, IMO, though those also have toolchain portability issues IME.
On Mon, Aug 28, 2023 at 3:53 AM David Laight <David.Laight@aculab.com> wrote: > > From: Linus Torvalds > > Sent: 25 August 2023 21:43 > .... > > Clang turns this: > > > > return __ffs64(val); > > > > into this horror: > > > > pushq %rax > > movq %rdi, (%rsp) > > #APP > > rep > > bsfq (%rsp), %rax > > #NO_APP > > popq %rcx > > > > which is just incredibly broken on so many levels. It *should* be a > > single instruction, like gcc does: > > > > rep; bsf %rdi,%rax # tmp87, word > > > > but clang decides that it really wants to put the argument on the > > stack, and apparently also wants to do that nonsensical stack > > alignment thing to make things even worse. > > > > We use this: > > > > static __always_inline unsigned long variable__ffs(unsigned long word) > > { > > asm("rep; bsf %1,%0" > > : "=r" (word) > > : "rm" (word)); > > return word; > > } > > > > for the definition, and it looks like clang royally just screws up > > here. Yes, "m" is _allowed_ in that input set, but it damn well > > shouldn't be used for something that is already in a register, since > > "r" is also allowed, and is the first choice. > > Why don't we just remove the "m" option? > > Pretty much the only time it will be worse is it the value > is in memory and loading it into a register causes a spill > to stack. > > While it is possible to generate code where that happens it > is pretty unlikely. As Linus expressed below, register exhaustion could occur. Besides, this is a bug in clang that we acknowledge, and should fix. I have the general idea where things are going wrong, I just don't yet have the muscle memory (or time) to dive into the register allocator. > > David > > - > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > Registration No: 1397386 (Wales)
On Mon, Aug 28, 2023 at 9:25 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Mon, 28 Aug 2023 at 00:33, Geert Uytterhoeven <geert@linux-m68k.org> wrote: > > > > Several architectures (incl. x86, but excl. amd64) do build the kernel with > > -freestanding. > > > > IIRC, the issue was that without that, gcc was "optimizing" calls > > to standard functions (implemented as inline optimized assembler > > functions) by replacing them with calls to other standard functions > > (also implemented as inline optimized assembler functions). > > So using -ffreestanding is definitely the right thing to do for a > kernel in theory. It's very much supposed to tell the compiler to not -ffreestanding is probably a good suggestion for any embedded platform. But given the size of the kernel, and similarities of symbols and their semantics expected by the compiler and provided by the kernel, I think -ffreestanding should not be set at this point for the Linux kernel. > assume a standard libc, and without that gcc will do various > transformations that make sense when you "know" what libc does, but > may not make sense in the limited library model of a kernel. > > So without it, gcc will do things like converting a 'printf()' call > without any conversion characters to a much cheaper 'puts()' etc. Now, > we often avoid that issue entirely by having our own function names > (ie printk()), but we do tend to use the *really* core C library > names. > > Anyway, it turns out that some of the things you miss out on with > -ffreestanding are kind of important. In particular, at least gcc will > stop some 'memcpy()' optimizations too, which ends up being pretty > horrendous. > > So while -ffreestanding would be the right thing to do in theory, in > practice it's actually pretty horrible. It's a big hammer that affects > a lot of things, and while many of them make sense for a kernel, some > of them are really bad. Which is why x86-64 no longer uses it. I agree. > > I would actually suggest other architectures take a look if they care > at all about code generation. In particular, look at the x86-64 > version of 'string.h' in > > arch/x86/include/asm/string_64.h > > and note the difference with the 32-bit one. The 32-bit one is the > "this is how we used to do it" that nobody cared enough to change. The > 64-bit one is much simpler and actually generates better code simply > because gcc recognizes memcpy() and friends, and will then inline it > when small etc. > > The *downside* is that now you have to trust the compiler to do the > right thing. And that will depend on compiler version etc. There's a > reason why 32-bit x86 does everything by hand: when your compiler > history starts at gcc-1.40, things are simply *very* different from > when you now rely on gcc-5.1 and newer... > > Put another way: gcc has changed, and what used to make sense probably > doesn't make sense any more. Yep, I think it's time to review the use of -ffreestanding in the linux kernel.
On Mon, Aug 28, 2023 at 9:30 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Mon, 28 Aug 2023 at 03:53, David Laight <David.Laight@aculab.com> wrote: > > > > From: Linus Torvalds > > > > > > We use this: > > > > > > static __always_inline unsigned long variable__ffs(unsigned long word) > > > { > > > asm("rep; bsf %1,%0" > > > : "=r" (word) > > > : "rm" (word)); > > > return word; > > > } > > > > > > for the definition, and it looks like clang royally just screws up > > > here. Yes, "m" is _allowed_ in that input set, but it damn well > > > shouldn't be used for something that is already in a register, since > > > "r" is also allowed, and is the first choice. > > > > Why don't we just remove the "m" option? > > For this particular case, it would probably be the right thing to do. > It's sad, though, because gcc handles this correctly, and always has. > > And in this particular case, it probably matters not at all. > > In many other cases where we have 'rm', we may actually be in the > situation that having 'rm' (or other cases like "g" that also allows > immediates) helps because register pressure can be a thing. > > It's mostly a thing on 32-bit x86 where you have a lot fewer > registers, and there we've literally run into situations where we have > had internal compiler errors because of complex inline asm statements > running out of registers. > > With a simple "one input, one output" case, that just isn't an issue, > so to work around a clang misfeature we could do it - if somebody > finds a case where it actually matters (as opposed to "damn, when > looking at the generted code for a function that we never actually use > on x86, I noticed that code generation is horrendous"). > > Linus Yes; it's a compiler bug, and we will fix it. Then the fix will be an incentive for folks that care to move to a newer toolchain.
diff --git a/lib/clz_ctz.c b/lib/clz_ctz.c index 0d3a686b5ba2..fb8c0c5c2bd2 100644 --- a/lib/clz_ctz.c +++ b/lib/clz_ctz.c @@ -28,36 +28,16 @@ int __weak __clzsi2(int val) } EXPORT_SYMBOL(__clzsi2); -int __weak __clzdi2(long val); -int __weak __ctzdi2(long val); -#if BITS_PER_LONG == 32 - -int __weak __clzdi2(long val) +int __weak __clzdi2(u64 val); +int __weak __clzdi2(u64 val) { - return 32 - fls((int)val); + return 64 - fls64(val); } EXPORT_SYMBOL(__clzdi2); -int __weak __ctzdi2(long val) +int __weak __ctzdi2(u64 val); +int __weak __ctzdi2(u64 val) { - return __ffs((u32)val); + return __ffs64(val); } EXPORT_SYMBOL(__ctzdi2); - -#elif BITS_PER_LONG == 64 - -int __weak __clzdi2(long val) -{ - return 64 - fls64((u64)val); -} -EXPORT_SYMBOL(__clzdi2); - -int __weak __ctzdi2(long val) -{ - return __ffs64((u64)val); -} -EXPORT_SYMBOL(__ctzdi2); - -#else -#error BITS_PER_LONG not 32 or 64 -#endif
The gcc compiler translates on some architectures the 64-bit __builtin_clzll() function to a call to the libgcc function __clzdi2(), which should take a 64-bit parameter on 32- and 64-bit platforms. But in the current kernel code, the built-in __clzdi2() function is defined to operate (wrongly) on 32-bit parameters if BITS_PER_LONG == 32, thus the return values on 32-bit kernels are in the range from [0..31] instead of the expected [0..63] range. This patch fixes the in-kernel functions __clzdi2() and __ctzdi2() to take a 64-bit parameter on 32-bit kernels as well, thus it makes the functions identical for 32- and 64-bit kernels. This bug went unnoticed since kernel 3.11 for over 10 years, and here are some possible reasons for that: a) Some architectures have assembly instructions to count the bits and which are used instead of calling __clzdi2(), e.g. on x86 the bsr instruction and on ppc cntlz is used. On such architectures the wrong __clzdi2() implementation isn't used and as such the bug has no effect and won't be noticed. b) Some architectures link to libgcc.a, and the in-kernel weak functions get replaced by the correct 64-bit variants from libgcc.a. c) __builtin_clzll() and __clzdi2() doesn't seem to be used in many places in the kernel, and most likely only in uncritical functions, e.g. when printing hex values via seq_put_hex_ll(). The wrong return value will still print the correct number, but just in a wrong formatting (e.g. with too many leading zeroes). d) 32-bit kernels aren't used that much any longer, so they are less tested. A trivial testcase to verify if the currently running 32-bit kernel is affected by the bug is to look at the output of /proc/self/maps: Here the kernel uses a correct implementation of __clzdi2(): root@debian:~# cat /proc/self/maps 00010000-00019000 r-xp 00000000 08:05 787324 /usr/bin/cat 00019000-0001a000 rwxp 00009000 08:05 787324 /usr/bin/cat 0001a000-0003b000 rwxp 00000000 00:00 0 [heap] f7551000-f770d000 r-xp 00000000 08:05 794765 /usr/lib/hppa-linux-gnu/libc.so.6 ... and this kernel uses the broken implementation of __clzdi2(): root@debian:~# cat /proc/self/maps 0000000010000-0000000019000 r-xp 00000000 000000008:000000005 787324 /usr/bin/cat 0000000019000-000000001a000 rwxp 000000009000 000000008:000000005 787324 /usr/bin/cat 000000001a000-000000003b000 rwxp 00000000 00:00 0 [heap] 00000000f73d1000-00000000f758d000 r-xp 00000000 000000008:000000005 794765 /usr/lib/hppa-linux-gnu/libc.so.6 ... Signed-off-by: Helge Deller <deller@gmx.de> Fixes: 4df87bb7b6a22 ("lib: add weak clz/ctz functions") Cc: Chanho Min <chanho.min@lge.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: <stable@vger.kernel.org> # v3.11+