diff mbox series

lib/clz_ctz.c: Fix __clzdi2() and __ctzdi2() for 32-bit kernels

Message ID ZOkGCSNr0VN2VIJJ@p100 (mailing list archive)
State Accepted, archived
Headers show
Series lib/clz_ctz.c: Fix __clzdi2() and __ctzdi2() for 32-bit kernels | expand

Commit Message

Helge Deller Aug. 25, 2023, 7:50 p.m. UTC
The gcc compiler translates on some architectures the 64-bit
__builtin_clzll() function to a call to the libgcc function
__clzdi2(), which should take a 64-bit parameter on 32- and 64-bit
platforms.

But in the current kernel code, the built-in __clzdi2() function is
defined to operate (wrongly) on 32-bit parameters if BITS_PER_LONG == 32,
thus the return values on 32-bit kernels are in the range from
[0..31] instead of the expected [0..63] range.

This patch fixes the in-kernel functions __clzdi2() and __ctzdi2() to
take a 64-bit parameter on 32-bit kernels as well, thus it makes the
functions identical for 32- and 64-bit kernels.

This bug went unnoticed since kernel 3.11 for over 10 years, and
here are some possible reasons for that:

a) Some architectures have assembly instructions to count the bits
   and which are used instead of calling __clzdi2(), e.g. on x86 the
   bsr instruction and on ppc cntlz is used. On such architectures
   the wrong __clzdi2() implementation isn't used and as such the
   bug has no effect and won't be noticed.

b) Some architectures link to libgcc.a, and the in-kernel weak
   functions get replaced by the correct 64-bit variants from libgcc.a.

c) __builtin_clzll() and __clzdi2() doesn't seem to be used in many
   places in the kernel, and most likely only in uncritical functions,
   e.g. when printing hex values via seq_put_hex_ll(). The wrong return
   value will still print the correct number, but just in a wrong formatting
   (e.g. with too many leading zeroes).

d) 32-bit kernels aren't used that much any longer, so they are less
   tested.

A trivial testcase to verify if the currently running 32-bit kernel
is affected by the bug is to look at the output of /proc/self/maps:

Here the kernel uses a correct implementation of __clzdi2():
root@debian:~# cat /proc/self/maps
00010000-00019000 r-xp 00000000 08:05 787324     /usr/bin/cat
00019000-0001a000 rwxp 00009000 08:05 787324     /usr/bin/cat
0001a000-0003b000 rwxp 00000000 00:00 0          [heap]
f7551000-f770d000 r-xp 00000000 08:05 794765     /usr/lib/hppa-linux-gnu/libc.so.6
...

and this kernel uses the broken implementation of __clzdi2():
root@debian:~# cat /proc/self/maps
0000000010000-0000000019000 r-xp 00000000 000000008:000000005 787324  /usr/bin/cat
0000000019000-000000001a000 rwxp 000000009000 000000008:000000005 787324  /usr/bin/cat
000000001a000-000000003b000 rwxp 00000000 00:00 0  [heap]
00000000f73d1000-00000000f758d000 r-xp 00000000 000000008:000000005 794765  /usr/lib/hppa-linux-gnu/libc.so.6
...

Signed-off-by: Helge Deller <deller@gmx.de>
Fixes: 4df87bb7b6a22 ("lib: add weak clz/ctz functions")
Cc: Chanho Min <chanho.min@lge.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: <stable@vger.kernel.org> # v3.11+

Comments

Linus Torvalds Aug. 25, 2023, 8:25 p.m. UTC | #1
On Fri, 25 Aug 2023 at 12:50, Helge Deller <deller@gmx.de> wrote:
>
> This patch fixes the in-kernel functions __clzdi2() and __ctzdi2() [..]

Applied,

               Linus
Linus Torvalds Aug. 25, 2023, 8:43 p.m. UTC | #2
[ Unrelated to this patch, except it made me look, adding clang build
people to cc ]

On Fri, 25 Aug 2023 at 13:25, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Fri, 25 Aug 2023 at 12:50, Helge Deller <deller@gmx.de> wrote:
> >
> > This patch fixes the in-kernel functions __clzdi2() and __ctzdi2() [..]
>
> Applied,

Bah. Still applied,  but actually building this (on 64-bit, so kind of
pointless) I note that clang completely messes up this function on
x86.

Clang turns this:

        return __ffs64(val);

into this horror:

        pushq   %rax
        movq    %rdi, (%rsp)
        #APP
        rep
        bsfq    (%rsp), %rax
        #NO_APP
        popq    %rcx

which is just incredibly broken on so many levels. It *should* be a
single instruction, like gcc does:

        rep; bsf %rdi,%rax      # tmp87, word

but clang decides that it really wants to put the argument on the
stack, and apparently also wants to do that nonsensical stack
alignment thing to make things even worse.

We use this:

  static __always_inline unsigned long variable__ffs(unsigned long word)
  {
        asm("rep; bsf %1,%0"
                : "=r" (word)
                : "rm" (word));
        return word;
  }

for the definition, and it looks like clang royally just screws up
here. Yes, "m" is _allowed_ in that input set, but it damn well
shouldn't be used for something that is already in a register, since
"r" is also allowed, and is the first choice.

I think it's this clang bug:

    https://github.com/llvm/llvm-project/issues/20571
    https://github.com/llvm/llvm-project/issues/30873
    https://github.com/llvm/llvm-project/issues/34837

and it doesn't matter for *this* case (since I don't think this
library function is ever used on x86), but it looks like a generic
clang issue.

                 Linus
Nick Desaulniers Aug. 25, 2023, 9:01 p.m. UTC | #3
On Fri, Aug 25, 2023 at 1:43 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> [ Unrelated to this patch, except it made me look, adding clang build
> people to cc ]
>
> On Fri, 25 Aug 2023 at 13:25, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Fri, 25 Aug 2023 at 12:50, Helge Deller <deller@gmx.de> wrote:
> > >
> > > This patch fixes the in-kernel functions __clzdi2() and __ctzdi2() [..]
> >
> > Applied,
>
> Bah. Still applied,  but actually building this (on 64-bit, so kind of
> pointless) I note that clang completely messes up this function on
> x86.
>
> Clang turns this:
>
>         return __ffs64(val);
>
> into this horror:
>
>         pushq   %rax
>         movq    %rdi, (%rsp)
>         #APP
>         rep
>         bsfq    (%rsp), %rax
>         #NO_APP
>         popq    %rcx
>
> which is just incredibly broken on so many levels. It *should* be a
> single instruction, like gcc does:
>
>         rep; bsf %rdi,%rax      # tmp87, word
>
> but clang decides that it really wants to put the argument on the
> stack, and apparently also wants to do that nonsensical stack
> alignment thing to make things even worse.
>
> We use this:
>
>   static __always_inline unsigned long variable__ffs(unsigned long word)
>   {
>         asm("rep; bsf %1,%0"
>                 : "=r" (word)
>                 : "rm" (word));
>         return word;
>   }
>
> for the definition, and it looks like clang royally just screws up
> here. Yes, "m" is _allowed_ in that input set, but it damn well
> shouldn't be used for something that is already in a register, since
> "r" is also allowed, and is the first choice.
>
> I think it's this clang bug:
>
>     https://github.com/llvm/llvm-project/issues/20571

^ yep, my comments at the end of that thread are the last time we've
had a chance to look into this.  Boy, it's been 9 months since the
last discussion of it.  I'm sorry for that.

The TL;DR of that thread is that when both "r" and "m" constraints are
present, LLVM is conservative and always chooses "m" because at that
point it's not able to express to later passes that "m" is still a
valid fallback if "r" was chosen.

Obviously "r" is preferable to "m" and we should fix that.  Seeing who
wants to roll up their sleeves and volunteer to understand LLVM's
register allocation code is like asking who wants to be the first to
jump into a black hole and see what happens.  I'm having a hard enough
time understanding the stack spilling code to better understand what
precisely exists in what stack slots in order to make progress on some
of our -Wframe-larger-than= warnings, but I need to suck it up and do
better.

This came up previously in our discussion about __builtin_ia32_readeflags_*.
https://lore.kernel.org/all/20211215211847.206208-1-morbo@google.com/

>     https://github.com/llvm/llvm-project/issues/30873
>     https://github.com/llvm/llvm-project/issues/34837
>
> and it doesn't matter for *this* case (since I don't think this
> library function is ever used on x86), but it looks like a generic
> clang issue.
>
>                  Linus
Bill Wendling Aug. 25, 2023, 10:33 p.m. UTC | #4
On Fri, Aug 25, 2023 at 2:01 PM Nick Desaulniers
<ndesaulniers@google.com> wrote:
>
> On Fri, Aug 25, 2023 at 1:43 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > [ Unrelated to this patch, except it made me look, adding clang build
> > people to cc ]
> >
> > On Fri, 25 Aug 2023 at 13:25, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > On Fri, 25 Aug 2023 at 12:50, Helge Deller <deller@gmx.de> wrote:
> > > >
> > > > This patch fixes the in-kernel functions __clzdi2() and __ctzdi2() [..]
> > >
> > > Applied,
> >
> > Bah. Still applied,  but actually building this (on 64-bit, so kind of
> > pointless) I note that clang completely messes up this function on
> > x86.
> >
> > Clang turns this:
> >
> >         return __ffs64(val);
> >
> > into this horror:
> >
> >         pushq   %rax
> >         movq    %rdi, (%rsp)
> >         #APP
> >         rep
> >         bsfq    (%rsp), %rax
> >         #NO_APP
> >         popq    %rcx
> >
> > which is just incredibly broken on so many levels. It *should* be a
> > single instruction, like gcc does:
> >
> >         rep; bsf %rdi,%rax      # tmp87, word
> >
> > but clang decides that it really wants to put the argument on the
> > stack, and apparently also wants to do that nonsensical stack
> > alignment thing to make things even worse.
> >
> > We use this:
> >
> >   static __always_inline unsigned long variable__ffs(unsigned long word)
> >   {
> >         asm("rep; bsf %1,%0"
> >                 : "=r" (word)
> >                 : "rm" (word));
> >         return word;
> >   }
> >
> > for the definition, and it looks like clang royally just screws up
> > here. Yes, "m" is _allowed_ in that input set, but it damn well
> > shouldn't be used for something that is already in a register, since
> > "r" is also allowed, and is the first choice.
> >
> > I think it's this clang bug:
> >
> >     https://github.com/llvm/llvm-project/issues/20571
>
> ^ yep, my comments at the end of that thread are the last time we've
> had a chance to look into this.  Boy, it's been 9 months since the
> last discussion of it.  I'm sorry for that.
>
> The TL;DR of that thread is that when both "r" and "m" constraints are
> present, LLVM is conservative and always chooses "m" because at that
> point it's not able to express to later passes that "m" is still a
> valid fallback if "r" was chosen.
>
> Obviously "r" is preferable to "m" and we should fix that.  Seeing who
> wants to roll up their sleeves and volunteer to understand LLVM's
> register allocation code is like asking who wants to be the first to
> jump into a black hole and see what happens.

Yum! Human spaghetti! :-)

I want to look into this myself. I'm a bit focussed on other things at
the moment, but this is definitely on my list of "DO WANT"s.

-bw

>  I'm having a hard enough
> time understanding the stack spilling code to better understand what
> precisely exists in what stack slots in order to make progress on some
> of our -Wframe-larger-than= warnings, but I need to suck it up and do
> better.
>
> This came up previously in our discussion about __builtin_ia32_readeflags_*.
> https://lore.kernel.org/all/20211215211847.206208-1-morbo@google.com/
>
> >     https://github.com/llvm/llvm-project/issues/30873
> >     https://github.com/llvm/llvm-project/issues/34837
> >
> > and it doesn't matter for *this* case (since I don't think this
> > library function is ever used on x86), but it looks like a generic
> > clang issue.
> >
> >                  Linus
>
>
>
> --
> Thanks,
> ~Nick Desaulniers
Bill Wendling Aug. 25, 2023, 10:57 p.m. UTC | #5
On Fri, Aug 25, 2023 at 3:33 PM Bill Wendling <morbo@google.com> wrote:
>
> On Fri, Aug 25, 2023 at 2:01 PM Nick Desaulniers
> <ndesaulniers@google.com> wrote:
> >
> > On Fri, Aug 25, 2023 at 1:43 PM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > [ Unrelated to this patch, except it made me look, adding clang build
> > > people to cc ]
> > >
> > > On Fri, 25 Aug 2023 at 13:25, Linus Torvalds
> > > <torvalds@linux-foundation.org> wrote:
> > > >
> > > > On Fri, 25 Aug 2023 at 12:50, Helge Deller <deller@gmx.de> wrote:
> > > > >
> > > > > This patch fixes the in-kernel functions __clzdi2() and __ctzdi2() [..]
> > > >
> > > > Applied,
> > >
> > > Bah. Still applied,  but actually building this (on 64-bit, so kind of
> > > pointless) I note that clang completely messes up this function on
> > > x86.
> > >
> > > Clang turns this:
> > >
> > >         return __ffs64(val);
> > >
> > > into this horror:
> > >
> > >         pushq   %rax
> > >         movq    %rdi, (%rsp)
> > >         #APP
> > >         rep
> > >         bsfq    (%rsp), %rax
> > >         #NO_APP
> > >         popq    %rcx
> > >
> > > which is just incredibly broken on so many levels. It *should* be a
> > > single instruction, like gcc does:
> > >
> > >         rep; bsf %rdi,%rax      # tmp87, word
> > >
> > > but clang decides that it really wants to put the argument on the
> > > stack, and apparently also wants to do that nonsensical stack
> > > alignment thing to make things even worse.
> > >
> > > We use this:
> > >
> > >   static __always_inline unsigned long variable__ffs(unsigned long word)
> > >   {
> > >         asm("rep; bsf %1,%0"
> > >                 : "=r" (word)
> > >                 : "rm" (word));
> > >         return word;
> > >   }
> > >
> > > for the definition, and it looks like clang royally just screws up
> > > here. Yes, "m" is _allowed_ in that input set, but it damn well
> > > shouldn't be used for something that is already in a register, since
> > > "r" is also allowed, and is the first choice.
> > >
> > > I think it's this clang bug:
> > >
> > >     https://github.com/llvm/llvm-project/issues/20571
> >
> > ^ yep, my comments at the end of that thread are the last time we've
> > had a chance to look into this.  Boy, it's been 9 months since the
> > last discussion of it.  I'm sorry for that.
> >
> > The TL;DR of that thread is that when both "r" and "m" constraints are
> > present, LLVM is conservative and always chooses "m" because at that
> > point it's not able to express to later passes that "m" is still a
> > valid fallback if "r" was chosen.
> >
> > Obviously "r" is preferable to "m" and we should fix that.  Seeing who
> > wants to roll up their sleeves and volunteer to understand LLVM's
> > register allocation code is like asking who wants to be the first to
> > jump into a black hole and see what happens.
>
> Yum! Human spaghetti! :-)
>
> I want to look into this myself. I'm a bit focussed on other things at
> the moment, but this is definitely on my list of "DO WANT"s.
>
Another idea is that there are __builtin_* functions for a lot of
functions that are currently in inline asm---__builtin_ctz{,l,ll,s]
and _builtin_ffs{,l,ll}. The major issue with the
`__builtin_ia32_readeflags_*` was its inability to take unrelated MSRs
into account during code motion. That may not be the same worry here?

-bw


> -bw
>
> >  I'm having a hard enough
> > time understanding the stack spilling code to better understand what
> > precisely exists in what stack slots in order to make progress on some
> > of our -Wframe-larger-than= warnings, but I need to suck it up and do
> > better.
> >
> > This came up previously in our discussion about __builtin_ia32_readeflags_*.
> > https://lore.kernel.org/all/20211215211847.206208-1-morbo@google.com/
> >
> > >     https://github.com/llvm/llvm-project/issues/30873
> > >     https://github.com/llvm/llvm-project/issues/34837
> > >
> > > and it doesn't matter for *this* case (since I don't think this
> > > library function is ever used on x86), but it looks like a generic
> > > clang issue.
> > >
> > >                  Linus
> >
> >
> >
> > --
> > Thanks,
> > ~Nick Desaulniers
Linus Torvalds Aug. 25, 2023, 11:34 p.m. UTC | #6
On Fri, 25 Aug 2023 at 15:57, Bill Wendling <morbo@google.com> wrote:
> >
> Another idea is that there are __builtin_* functions for a lot of
> functions that are currently in inline asm

No. We've been through this before. The builtins are almost entirely
untested, and often undocumented and buggy.

> The major issue with the
> `__builtin_ia32_readeflags_*` was its inability to take unrelated MSRs
> into account during code motion. That may not be the same worry here?

No, the problem with __builtin_ia32_readeflags_*() was that it was
literally completely buggy and generated entirely broken code:

    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104971

but that's really more of a symptom than anything else.

It's a symptom of the fact that unlike inline asm's, those builtins
are often undocumented in what compiler version they appeared, and are
of very questionable quality. They often don't have many users, and
the test suites are non-existent.

For example, we *do* use __builtin_ffs() on x86 for constant values,
because the compiler does the right thing.

But for non-constant ones, the inline asm actually generates better
code: gcc generatea some disgusting mess with a 'bsf' followed by a
'cmov' for the zero case, when we know better.

See for example

   https://godbolt.org/z/jKKf48Wsf

I don't understand why compiler people prefer a builtin that is an
untested special case that assumes that the compiler knows what is
going on (and often doesn't), over a generic escape facility that is
supported and needed anyway (inline asm).

In other words: the statement "builtins generate better code" is
simply PROVABLY NOT TRUE.

Builtins have often generated *worse* code than using inline asms, to
the point where "worse" is actively buggy crap.

At least inline asms are reliable. That's a *big* deal.

               Linus
Bill Wendling Aug. 26, 2023, 12:08 a.m. UTC | #7
On Fri, Aug 25, 2023 at 4:35 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Fri, 25 Aug 2023 at 15:57, Bill Wendling <morbo@google.com> wrote:
> > >
> > Another idea is that there are __builtin_* functions for a lot of
> > functions that are currently in inline asm
>
> No. We've been through this before. The builtins are almost entirely
> untested, and often undocumented and buggy.
>
> > The major issue with the
> > `__builtin_ia32_readeflags_*` was its inability to take unrelated MSRs
> > into account during code motion. That may not be the same worry here?
>
> No, the problem with __builtin_ia32_readeflags_*() was that it was
> literally completely buggy and generated entirely broken code:
>
>     https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104971
>
> but that's really more of a symptom than anything else.
>
> It's a symptom of the fact that unlike inline asm's, those builtins
> are often undocumented in what compiler version they appeared, and are
> of very questionable quality. They often don't have many users, and
> the test suites are non-existent.
>
> For example, we *do* use __builtin_ffs() on x86 for constant values,
> because the compiler does the right thing.
>
> But for non-constant ones, the inline asm actually generates better
> code: gcc generatea some disgusting mess with a 'bsf' followed by a
> 'cmov' for the zero case, when we know better.
>
> See for example
>
>    https://godbolt.org/z/jKKf48Wsf
>
Ew...gross.

> I don't understand why compiler people prefer a builtin that is an
> untested special case that assumes that the compiler knows what is
> going on (and often doesn't), over a generic escape facility that is
> supported and needed anyway (inline asm).
>
> In other words: the statement "builtins generate better code" is
> simply PROVABLY NOT TRUE.
>
> Builtins have often generated *worse* code than using inline asms, to
> the point where "worse" is actively buggy crap.
>
> At least inline asms are reliable. That's a *big* deal.
>
There are a couple of reasons why compiler writers (at least I) prefer
builtins to inline asm. Inline asm takes control away from the
compiler, which makes it harder for the compiler to perform normal
optimizations. It's more portable. As for better code, it won't
generate it in all situations, as you pointed out, but can typically
generate as good code.

Inline asm has its own issues---portability, difficult to use
constraints (the EFLAGS asms went back and forth on its constraints
over the years), and, from what I remember, GNU's inline asm is
closely tied to its register allocator, making it hard to support in
other compilers. Plus, there's not guarantee that an inline asm won't
be moved. (It's been discussed before, and I also believe that the
current Linux inline asm "barriers" should prevent this. I just vastly
prefer guaranteed behavior over "it works now". I've migrated many
systems to new compilers (and even compiler versions) and have seen
"well, it worked for the previous compiler" types of bugs that were
actual bugs in the programmer's code.)

Okay, none of what I said is going to convince you to use builtins,
and I'm not suggesting we do. It's just a few of the reasons why we
tend to prefer them to other methods. Just take this as an FYI. :-)

-bw
Nick Desaulniers Aug. 26, 2023, 12:52 a.m. UTC | #8
On Fri, Aug 25, 2023 at 4:35 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> But for non-constant ones, the inline asm actually generates better
> code: gcc generatea some disgusting mess with a 'bsf' followed by a
> 'cmov' for the zero case, when we know better.
>
> See for example
>
>    https://godbolt.org/z/jKKf48Wsf
>
> I don't understand why compiler people prefer a builtin that is an
> untested special case that assumes that the compiler knows what is
> going on (and often doesn't), over a generic escape facility that is
> supported and needed anyway (inline asm).
>
> In other words: the statement "builtins generate better code" is
> simply PROVABLY NOT TRUE.
>
> Builtins have often generated *worse* code than using inline asms, to
> the point where "worse" is actively buggy crap.
>
> At least inline asms are reliable. That's a *big* deal.

So 2 concerns where "I'll do it in inline asm" can pessimize codegen:
1. You alluded to this, but what happens when one of these functions
is called with a constant? (Not just a literal value, but a value that
can be proven constant at compile time via optimizations as well?)
arch/x86/include/asm/bitops.h got this right for ffs(), but it did not
for fls()! (I think that could be `if (__builtin_constant_p(x)) return
x ? 32 - __builtin_clz(x) : 0;` but check my math; oh, good job
arch/powerpc/include/asm/bitops.h).
2. by providing the definition of a symbol typically provided by libc
(and then not building with -ffreestanding) pessimizes libcall
optimization.
example: https://godbolt.org/z/crrTKEf6G
ffs() gets this right again by using a macro, and __always_inline can
work around this somewhat (so fls() if off the hook here). But any
attempt using `static inline` would be pessimized for constants.
Linus Torvalds Aug. 26, 2023, 1:07 a.m. UTC | #9
On Fri, 25 Aug 2023 at 17:52, Nick Desaulniers <ndesaulniers@google.com> wrote:
>
> So 2 concerns where "I'll do it in inline asm" can pessimize codegen:
> 1. You alluded to this, but what happens when one of these functions
> is called with a constant?

This is why our headers have a lot of __builtin_constant_p()'s in them..

In this particular case, see the x86 asm/bitops.h code:

    #define ffs(x) (__builtin_constant_p(x) ? __builtin_ffs(x) :
variable_ffs(x))

but this is actually quite a common pattern, and it's often not about
something like __builtin_ffs() at all.

See all the other __builtin_constant_p()'s that we have in that same
file because we basically just use different code sequences for
constants.

And that file isn't even unusual. We use it quite a lot when we care
about code generation for some particular case.

> 2. by providing the definition of a symbol typically provided by libc
> (and then not building with -ffreestanding) pessimizes libcall
> optimization.

.. and this is partly why we often avoid libgcc things, and do certain
things by hand.

The classic rule is "Don't do 64-bit divisions using the C '/' operator".

So in the kernel you have to use do_div() and friends, because the
library versions are often disgusting and might not know that 64/32 is
much much cheaper and is what you want.

And quite often we simply use other names - but then we also do *not*
build with -freestanding, because -freestanding has at least
traditionally meant that the compiler won't optimize the simple and
obvious cases (typically things like "memcpy with a constant size").

So we mix and match and pick the best option.

The kernel really doesn't care about architecture portability, because
honestly, something like "ffs()" is entirely *trivial* to get right,
compared to the real differences between architectures (eg VM and IO
differences etc).

             Linus
Fangrui Song Aug. 26, 2023, 3:17 a.m. UTC | #10
On 2023-08-25, Linus Torvalds wrote:
>On Fri, 25 Aug 2023 at 17:52, Nick Desaulniers <ndesaulniers@google.com> wrote:
>>
>> So 2 concerns where "I'll do it in inline asm" can pessimize codegen:
>> 1. You alluded to this, but what happens when one of these functions
>> is called with a constant?
>
>This is why our headers have a lot of __builtin_constant_p()'s in them..
>
>In this particular case, see the x86 asm/bitops.h code:
>
>    #define ffs(x) (__builtin_constant_p(x) ? __builtin_ffs(x) :
>variable_ffs(x))

For the curious (like me),

__builtin_ffs
https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html#index-_005f_005fbuiltin_005fffs says

     Returns the number of leading 0-bits in x, starting at the most significant bit position. If x is 0, the result is undefined.

The hangling of 0 seems the cause that __builtin_ffs codegen is not as
well as inline asm.  Clang implemented the builtin in 2008 and took the
same constraint (penalty).



GCC compiles __builtin_ctzl(x) to xorl    %eax, %eax; tzcntq  %rdi, %rax
on most Intel processors (AMD -march= values are unaffected). The extra
xor is due to a false dependency issue
https://gcc.gnu.org/git/?p=gcc.git&a=commit;h=73543b2286027da1de561902440b53f775a03a86

Inline asm wins here as well since we know the argument 0 is undefined.

In May 2023, https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=cc6eb8b51f9568ae0caf46b80e2a0aff050030ce
"Disable avoid_false_dep_for_bmi for atom and icelake(and later) core processors."
removed the extra xor for icelake (and later) core processors.

>but this is actually quite a common pattern, and it's often not about
>something like __builtin_ffs() at all.
>
>See all the other __builtin_constant_p()'s that we have in that same
>file because we basically just use different code sequences for
>constants.
>
>And that file isn't even unusual. We use it quite a lot when we care
>about code generation for some particular case.
>
>> 2. by providing the definition of a symbol typically provided by libc
>> (and then not building with -ffreestanding) pessimizes libcall
>> optimization.
>
>.. and this is partly why we often avoid libgcc things, and do certain
>things by hand.
>
>The classic rule is "Don't do 64-bit divisions using the C '/' operator".
>
>So in the kernel you have to use do_div() and friends, because the
>library versions are often disgusting and might not know that 64/32 is
>much much cheaper and is what you want.
>
>And quite often we simply use other names - but then we also do *not*
>build with -freestanding, because -freestanding has at least
>traditionally meant that the compiler won't optimize the simple and
>obvious cases (typically things like "memcpy with a constant size").
>
>So we mix and match and pick the best option.
>
>The kernel really doesn't care about architecture portability, because
>honestly, something like "ffs()" is entirely *trivial* to get right,
>compared to the real differences between architectures (eg VM and IO
>differences etc).
>
>             Linus
>
Geert Uytterhoeven Aug. 28, 2023, 7:33 a.m. UTC | #11
Hi Linus,

On Sat, Aug 26, 2023 at 3:08 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, 25 Aug 2023 at 17:52, Nick Desaulniers <ndesaulniers@google.com> wrote:
> And quite often we simply use other names - but then we also do *not*
> build with -freestanding, because -freestanding has at least
> traditionally meant that the compiler won't optimize the simple and
> obvious cases (typically things like "memcpy with a constant size").

Several architectures (incl. x86, but excl. amd64) do build the kernel with
-freestanding.

IIRC, the issue was that without that, gcc was "optimizing" calls
to standard functions (implemented as inline optimized assembler
functions) by replacing them with calls to other standard functions
(also implemented as inline optimized assembler functions).  As the
latter became external calls, this defeated the optimized assembler
implementation, and could even cause build failures if no external
implementation was available.

Gr{oetje,eeting}s,

                        Geert
David Laight Aug. 28, 2023, 10:53 a.m. UTC | #12
From: Linus Torvalds
> Sent: 25 August 2023 21:43
....
> Clang turns this:
> 
>         return __ffs64(val);
> 
> into this horror:
> 
>         pushq   %rax
>         movq    %rdi, (%rsp)
>         #APP
>         rep
>         bsfq    (%rsp), %rax
>         #NO_APP
>         popq    %rcx
> 
> which is just incredibly broken on so many levels. It *should* be a
> single instruction, like gcc does:
> 
>         rep; bsf %rdi,%rax      # tmp87, word
> 
> but clang decides that it really wants to put the argument on the
> stack, and apparently also wants to do that nonsensical stack
> alignment thing to make things even worse.
> 
> We use this:
> 
>   static __always_inline unsigned long variable__ffs(unsigned long word)
>   {
>         asm("rep; bsf %1,%0"
>                 : "=r" (word)
>                 : "rm" (word));
>         return word;
>   }
> 
> for the definition, and it looks like clang royally just screws up
> here. Yes, "m" is _allowed_ in that input set, but it damn well
> shouldn't be used for something that is already in a register, since
> "r" is also allowed, and is the first choice.

Why don't we just remove the "m" option?

Pretty much the only time it will be worse is it the value
is in memory and loading it into a register causes a spill
to stack.

While it is possible to generate code where that happens it
is pretty unlikely.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Linus Torvalds Aug. 28, 2023, 4:24 p.m. UTC | #13
On Mon, 28 Aug 2023 at 00:33, Geert Uytterhoeven <geert@linux-m68k.org> wrote:
>
> Several architectures (incl. x86, but excl. amd64) do build the kernel with
> -freestanding.
>
> IIRC, the issue was that without that, gcc was "optimizing" calls
> to standard functions (implemented as inline optimized assembler
> functions) by replacing them with calls to other standard functions
> (also implemented as inline optimized assembler functions).

So using -ffreestanding is definitely the right thing to do for a
kernel in theory. It's very much supposed to tell the compiler to not
assume a standard libc, and without that gcc will do various
transformations that make sense when you "know" what libc does, but
may not make sense in the limited library model of a kernel.

So without it, gcc will do things like converting a 'printf()' call
without any conversion characters to a much cheaper 'puts()' etc. Now,
we often avoid that issue entirely by having our own function names
(ie printk()), but we do tend to use the *really* core C library
names.

Anyway, it turns out that some of the things you miss out on with
-ffreestanding are kind of important. In particular, at least gcc will
stop some 'memcpy()' optimizations too, which ends up being pretty
horrendous.

So while -ffreestanding would be the right thing to do in theory, in
practice it's actually pretty horrible. It's a big hammer that affects
a lot of things, and while many of them make sense for a kernel, some
of them are really bad. Which is why x86-64 no longer uses it.

I would actually suggest other architectures take a look if they care
at all about code generation. In particular, look at the x86-64
version of 'string.h' in

  arch/x86/include/asm/string_64.h

and note the difference with the 32-bit one. The 32-bit one is the
"this is how we used to do it" that nobody cared enough to change. The
64-bit one is much simpler and actually generates better code simply
because gcc recognizes memcpy() and friends, and will then inline it
when small etc.

The *downside* is that now you have to trust the compiler to do the
right thing. And that will depend on compiler version etc. There's a
reason why 32-bit x86 does everything by hand: when your compiler
history starts at gcc-1.40, things are simply *very* different from
when you now rely on gcc-5.1 and newer...

Put another way: gcc has changed, and what used to make sense probably
doesn't make sense any more.

                 Linus
Linus Torvalds Aug. 28, 2023, 4:30 p.m. UTC | #14
On Mon, 28 Aug 2023 at 03:53, David Laight <David.Laight@aculab.com> wrote:
>
> From: Linus Torvalds
> >
> > We use this:
> >
> >   static __always_inline unsigned long variable__ffs(unsigned long word)
> >   {
> >         asm("rep; bsf %1,%0"
> >                 : "=r" (word)
> >                 : "rm" (word));
> >         return word;
> >   }
> >
> > for the definition, and it looks like clang royally just screws up
> > here. Yes, "m" is _allowed_ in that input set, but it damn well
> > shouldn't be used for something that is already in a register, since
> > "r" is also allowed, and is the first choice.
>
> Why don't we just remove the "m" option?

For this particular case, it would probably be the right thing to do.
It's sad, though, because gcc handles this correctly, and always has.

And in this particular case, it probably matters not at all.

In many other cases where we have 'rm', we may actually be in the
situation that having 'rm' (or other cases like "g" that also allows
immediates) helps because register pressure can be a thing.

It's mostly a thing on 32-bit x86 where you have a lot fewer
registers, and there we've literally run into situations where we have
had internal compiler errors because of complex inline asm statements
running out of registers.

With a simple "one input, one output" case, that just isn't an issue,
so to work around a clang misfeature we could do it - if somebody
finds a case where it actually matters (as opposed to "damn, when
looking at the generted code for a function that we never actually use
on x86, I noticed that code generation is horrendous").

             Linus
Nick Desaulniers Aug. 28, 2023, 8:08 p.m. UTC | #15
On Fri, Aug 25, 2023 at 6:08 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Fri, 25 Aug 2023 at 17:52, Nick Desaulniers <ndesaulniers@google.com> wrote:
> >
> > So 2 concerns where "I'll do it in inline asm" can pessimize codegen:
> > 1. You alluded to this, but what happens when one of these functions
> > is called with a constant?
>
> This is why our headers have a lot of __builtin_constant_p()'s in them..
>
> In this particular case, see the x86 asm/bitops.h code:
>
>     #define ffs(x) (__builtin_constant_p(x) ? __builtin_ffs(x) :
> variable_ffs(x))
>
> but this is actually quite a common pattern, and it's often not about
> something like __builtin_ffs() at all.

I was a reviewer on commit fdb6649ab7c1 ("x86/asm/bitops: Use
__builtin_ctzl() to evaluate constant expressions"); I'm familiar with
the pattern.

>
> See all the other __builtin_constant_p()'s that we have in that same
> file because we basically just use different code sequences for
> constants.
>
> And that file isn't even unusual. We use it quite a lot when we care
> about code generation for some particular case.

More so my point was x86 bitops is missing
commit 2fcff790dcb4 ("powerpc: Use builtin functions for fls()/__fls()/fls64()")
treatment.

I've sent https://lore.kernel.org/llvm/20230828-x86_fls-v1-1-e6a31b9f79c3@google.com/.

>
> > 2. by providing the definition of a symbol typically provided by libc
> > (and then not building with -ffreestanding) pessimizes libcall
> > optimization.
>
> .. and this is partly why we often avoid libgcc things, and do certain
> things by hand.

(Sorry if the following rant is prior knowledge, it's mostly for
reference for others cc'ed who might not know this)

Careful, `-ffreestanding` and libgcc are two orthogonal things (at
least in my mind).

-ffreestanding is to libc as --rtlib= is to the compiler runtime
(which is distinct from the libc)

`-ffreestanding` is more about "does the runtime environment somehow
provide libc symbols."

libgcc (or llvm's equivalent "compiler-rt") is not responsible for
providing symbols from libc.  `--rtlib=` controls what compiler
runtime will be used, but in my experience, today's compilers don't
make codegen decisions on that value.  These are mostly runtime
helpers for "idk how to do <complicated math thing, such as double
word division>" or "maybe you didn't want that inline."

What's brittle about making codegen decisions with regards to these
flags though is that these dependencies grow over time, and yet it's
not possible today (AFAIK) to specify what's the minimum target to
support.

For instance, IIRC glibc recently added support for one of the
kernel's string.h routines, maybe strlcpy or something.
https://sourceware.org/git/?p=glibc.git;a=commit;h=454a20c8756c9c1d55419153255fc7692b3d2199

When is it safe for the compiler to start transforming calls to other
functions into calls to strlcpy?  (Guess: year 2033, because:) What
happens when dynamically linking against older versions of glib that
do not provide that symbol?

>
> The classic rule is "Don't do 64-bit divisions using the C '/' operator".
>
> So in the kernel you have to use do_div() and friends, because the
> library versions are often disgusting and might not know that 64/32 is
> much much cheaper and is what you want.

And thus the same problem exists for the kernel wrt --rtlib that I
alluded to above for strlcpy.  By providing a partial implementation
of a compiler runtime (--rtlib=), the compiler will frequently emit
libcalls to symbols for which the kernel hasn't provided.  You can
avoid open coded double word division in the kernel all you want but:
1. again you're probably pessimizing division by constant remainder by
using div64_u64.  GCC is *really* good at replacing these when the
divisor is constant; IME better than clang.
2. even avoiding open coded division, the compiler can still insert
division; loop-elision can replace loops outright if the trip count is
adjusted by a determinable value. (see 3220022038b9).

By providing a partial compiler runtime, and then using every -mno-
flag in the book, you tie the compiler's hands wrt what it can emit vs
libcall.  There's not even a way to express to today's compiler that
"we have a compiler runtime, it's just straight up missing things."

Personally, I think a clang-tidy check for open coded division is
perhaps a better way to enforce the kernel's posture than providing
half a compiler runtime then doing gymnastics in the code to work
around the emission of libcalls to __divdi3() or__aeabi_uldivmod()
(mostly on 32b platforms).  A linkage failure is nice, but only occurs
on 32b targets and it doesn't disambiguate between the case of
developer open coded division vs compiler inserted division.

>
> And quite often we simply use other names - but then we also do *not*
> build with -freestanding, because -freestanding has at least
> traditionally meant that the compiler won't optimize the simple and
> obvious cases (typically things like "memcpy with a constant size").

Personal opinion: we very much do NOT want to use -ffreestanding for
those libcall optimizations.

I discussed this recently with ARCH=loongarch folks:
commit 3f301dc292eb ("LoongArch: Replace -ffreestanding with
finer-grained -fno-builtin's")

It is my intention to remove -ffreestanding from ARCH=i386.
https://github.com/ClangBuiltLinux/linux/issues/1583

I had to first fix a bug in LLVM though
https://reviews.llvm.org/D125285
So rather than remove it outright, we might need to retain it for
builds with older releases of clang for now.

Though as you allude to down thread, perhaps some things that were the
case in linux 1.0 / gcc 1.40 no longer hold.  Which is why adding such
flags to kernel makefiles really ought to be accompanied by a comment
in sources linking to an issue tracker report, so that we might clean
these up one day.

>
> So we mix and match and pick the best option.

Gross, and like *could you not?*  I suspect it's more so the case of a
developer not realising it's perhaps a compiler bug, or not reporting
such bug, and trying flags they're heard of once until something
links.

Any use of -ffreestanding for any arch had better have a comment to a
compiler vendor's bug tracker laying out why that's necessary for that
arch and not others.

Many kernel developers are allergic to filing formal compiler bugs in
places where compiler vendors are looking, IME.

>
> The kernel really doesn't care about architecture portability, because
> honestly, something like "ffs()" is entirely *trivial* to get right,
> compared to the real differences between architectures (eg VM and IO
> differences etc).
>
>              Linus
Nick Desaulniers Aug. 28, 2023, 8:09 p.m. UTC | #16
On Mon, Aug 28, 2023 at 12:33 AM Geert Uytterhoeven
<geert@linux-m68k.org> wrote:
>
> Hi Linus,
>
> On Sat, Aug 26, 2023 at 3:08 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > On Fri, 25 Aug 2023 at 17:52, Nick Desaulniers <ndesaulniers@google.com> wrote:
> > And quite often we simply use other names - but then we also do *not*
> > build with -freestanding, because -freestanding has at least
> > traditionally meant that the compiler won't optimize the simple and
> > obvious cases (typically things like "memcpy with a constant size").
>
> Several architectures (incl. x86, but excl. amd64) do build the kernel with
> -freestanding.
>
> IIRC, the issue was that without that, gcc was "optimizing" calls
> to standard functions (implemented as inline optimized assembler
> functions) by replacing them with calls to other standard functions
> (also implemented as inline optimized assembler functions).  As the
> latter became external calls, this defeated the optimized assembler
> implementation, and could even cause build failures if no external
> implementation was available.

That's what the -fno-builtin-* flags are for, IMO, though those also
have toolchain portability issues IME.
Nick Desaulniers Aug. 28, 2023, 8:10 p.m. UTC | #17
On Mon, Aug 28, 2023 at 3:53 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: Linus Torvalds
> > Sent: 25 August 2023 21:43
> ....
> > Clang turns this:
> >
> >         return __ffs64(val);
> >
> > into this horror:
> >
> >         pushq   %rax
> >         movq    %rdi, (%rsp)
> >         #APP
> >         rep
> >         bsfq    (%rsp), %rax
> >         #NO_APP
> >         popq    %rcx
> >
> > which is just incredibly broken on so many levels. It *should* be a
> > single instruction, like gcc does:
> >
> >         rep; bsf %rdi,%rax      # tmp87, word
> >
> > but clang decides that it really wants to put the argument on the
> > stack, and apparently also wants to do that nonsensical stack
> > alignment thing to make things even worse.
> >
> > We use this:
> >
> >   static __always_inline unsigned long variable__ffs(unsigned long word)
> >   {
> >         asm("rep; bsf %1,%0"
> >                 : "=r" (word)
> >                 : "rm" (word));
> >         return word;
> >   }
> >
> > for the definition, and it looks like clang royally just screws up
> > here. Yes, "m" is _allowed_ in that input set, but it damn well
> > shouldn't be used for something that is already in a register, since
> > "r" is also allowed, and is the first choice.
>
> Why don't we just remove the "m" option?
>
> Pretty much the only time it will be worse is it the value
> is in memory and loading it into a register causes a spill
> to stack.
>
> While it is possible to generate code where that happens it
> is pretty unlikely.

As Linus expressed below, register exhaustion could occur.  Besides,
this is a bug in clang that we acknowledge, and should fix.  I have
the general idea where things are going wrong, I just don't yet have
the muscle memory (or time) to dive into the register allocator.

>
>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
Nick Desaulniers Aug. 28, 2023, 8:13 p.m. UTC | #18
On Mon, Aug 28, 2023 at 9:25 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, 28 Aug 2023 at 00:33, Geert Uytterhoeven <geert@linux-m68k.org> wrote:
> >
> > Several architectures (incl. x86, but excl. amd64) do build the kernel with
> > -freestanding.
> >
> > IIRC, the issue was that without that, gcc was "optimizing" calls
> > to standard functions (implemented as inline optimized assembler
> > functions) by replacing them with calls to other standard functions
> > (also implemented as inline optimized assembler functions).
>
> So using -ffreestanding is definitely the right thing to do for a
> kernel in theory. It's very much supposed to tell the compiler to not

-ffreestanding is probably a good suggestion for any embedded
platform.  But given the size of the kernel, and similarities of
symbols and their semantics expected by the compiler and provided by
the kernel, I think -ffreestanding should not be set at this point for
the Linux kernel.

> assume a standard libc, and without that gcc will do various
> transformations that make sense when you "know" what libc does, but
> may not make sense in the limited library model of a kernel.
>
> So without it, gcc will do things like converting a 'printf()' call
> without any conversion characters to a much cheaper 'puts()' etc. Now,
> we often avoid that issue entirely by having our own function names
> (ie printk()), but we do tend to use the *really* core C library
> names.
>
> Anyway, it turns out that some of the things you miss out on with
> -ffreestanding are kind of important. In particular, at least gcc will
> stop some 'memcpy()' optimizations too, which ends up being pretty
> horrendous.
>
> So while -ffreestanding would be the right thing to do in theory, in
> practice it's actually pretty horrible. It's a big hammer that affects
> a lot of things, and while many of them make sense for a kernel, some
> of them are really bad. Which is why x86-64 no longer uses it.

I agree.

>
> I would actually suggest other architectures take a look if they care
> at all about code generation. In particular, look at the x86-64
> version of 'string.h' in
>
>   arch/x86/include/asm/string_64.h
>
> and note the difference with the 32-bit one. The 32-bit one is the
> "this is how we used to do it" that nobody cared enough to change. The
> 64-bit one is much simpler and actually generates better code simply
> because gcc recognizes memcpy() and friends, and will then inline it
> when small etc.
>
> The *downside* is that now you have to trust the compiler to do the
> right thing. And that will depend on compiler version etc. There's a
> reason why 32-bit x86 does everything by hand: when your compiler
> history starts at gcc-1.40, things are simply *very* different from
> when you now rely on gcc-5.1 and newer...
>
> Put another way: gcc has changed, and what used to make sense probably
> doesn't make sense any more.

Yep, I think it's time to review the use of -ffreestanding in the linux kernel.
Nick Desaulniers Aug. 28, 2023, 8:14 p.m. UTC | #19
On Mon, Aug 28, 2023 at 9:30 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, 28 Aug 2023 at 03:53, David Laight <David.Laight@aculab.com> wrote:
> >
> > From: Linus Torvalds
> > >
> > > We use this:
> > >
> > >   static __always_inline unsigned long variable__ffs(unsigned long word)
> > >   {
> > >         asm("rep; bsf %1,%0"
> > >                 : "=r" (word)
> > >                 : "rm" (word));
> > >         return word;
> > >   }
> > >
> > > for the definition, and it looks like clang royally just screws up
> > > here. Yes, "m" is _allowed_ in that input set, but it damn well
> > > shouldn't be used for something that is already in a register, since
> > > "r" is also allowed, and is the first choice.
> >
> > Why don't we just remove the "m" option?
>
> For this particular case, it would probably be the right thing to do.
> It's sad, though, because gcc handles this correctly, and always has.
>
> And in this particular case, it probably matters not at all.
>
> In many other cases where we have 'rm', we may actually be in the
> situation that having 'rm' (or other cases like "g" that also allows
> immediates) helps because register pressure can be a thing.
>
> It's mostly a thing on 32-bit x86 where you have a lot fewer
> registers, and there we've literally run into situations where we have
> had internal compiler errors because of complex inline asm statements
> running out of registers.
>
> With a simple "one input, one output" case, that just isn't an issue,
> so to work around a clang misfeature we could do it - if somebody
> finds a case where it actually matters (as opposed to "damn, when
> looking at the generted code for a function that we never actually use
> on x86, I noticed that code generation is horrendous").
>
>              Linus

Yes; it's a compiler bug, and we will fix it. Then the fix will be an
incentive for folks that care to move to a newer toolchain.
diff mbox series

Patch

diff --git a/lib/clz_ctz.c b/lib/clz_ctz.c
index 0d3a686b5ba2..fb8c0c5c2bd2 100644
--- a/lib/clz_ctz.c
+++ b/lib/clz_ctz.c
@@ -28,36 +28,16 @@  int __weak __clzsi2(int val)
 }
 EXPORT_SYMBOL(__clzsi2);
 
-int __weak __clzdi2(long val);
-int __weak __ctzdi2(long val);
-#if BITS_PER_LONG == 32
-
-int __weak __clzdi2(long val)
+int __weak __clzdi2(u64 val);
+int __weak __clzdi2(u64 val)
 {
-	return 32 - fls((int)val);
+	return 64 - fls64(val);
 }
 EXPORT_SYMBOL(__clzdi2);
 
-int __weak __ctzdi2(long val)
+int __weak __ctzdi2(u64 val);
+int __weak __ctzdi2(u64 val)
 {
-	return __ffs((u32)val);
+	return __ffs64(val);
 }
 EXPORT_SYMBOL(__ctzdi2);
-
-#elif BITS_PER_LONG == 64
-
-int __weak __clzdi2(long val)
-{
-	return 64 - fls64((u64)val);
-}
-EXPORT_SYMBOL(__clzdi2);
-
-int __weak __ctzdi2(long val)
-{
-	return __ffs64((u64)val);
-}
-EXPORT_SYMBOL(__ctzdi2);
-
-#else
-#error BITS_PER_LONG not 32 or 64
-#endif