modpost: add allow list for llvm IPSCCP

Message ID	20210929225850.3889950-1-ndesaulniers@google.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-kbuild-owner@kernel.org> Date: Wed, 29 Sep 2021 15:58:50 -0700 Message-Id: <20210929225850.3889950-1-ndesaulniers@google.com> Mime-Version: 1.0 Subject: [PATCH] modpost: add allow list for llvm IPSCCP From: Nick Desaulniers <ndesaulniers@google.com> To: Masahiro Yamada <masahiroy@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org>, Arnd Bergmann <arnd@kernel.org>, Kees Cook <keescook@chromium.org>, Nick Desaulniers <ndesaulniers@google.com>, Nathan Chancellor <nathan@kernel.org>, Michal Marek <michal.lkml@markovi.net>, linux-kbuild@vger.kernel.org, linux-kernel@vger.kernel.org, llvm@lists.linux.dev Content-Type: text/plain; charset="UTF-8" Precedence: bulk
Series	modpost: add allow list for llvm IPSCCP \| expand modpost: add allow list for llvm IPSCCP

Nick Desaulniers Sept. 29, 2021, 10:58 p.m. UTC

Modpost validation of calls from .text to .init sections relies on GCC's
constant propagation renaming specialized functions being renamed to
have .constprop.* suffixes. See the comment on Pattern 5 in
scripts/mod/modpost.c:

  GCC may optimize static inlines when fed constant arg(s) resulting
  in functions like cpumask_empty() -- generating an associated symbol
  cpumask_empty.constprop.3 that appears in the audit.  If the const that
  is passed in comes from __init, like say nmi_ipi_mask, we get a
  meaningless section warning.

LLVM does similar optimizations (inter-procedural sparse conditional
constant propagation; IPSCCP), but doesn't rename the specialized
functions, so we still observe modpost warnings.

Add checks in modpost to check if the .comment section contains that
string "clang" (ie. was the object file built with clang?), and if so
additionally check an allow list.

Fixes the following modpost warnings observed on clang-13+:

allmodconfig:
WARNING: modpost: vmlinux.o(.text+0x*): Section mismatch in reference
from the function
  test_bit() to the variable .init.data:numa_nodes_parsed
  __first_node() to the variable .init.data:numa_nodes_parsed
  __next_node() to the variable .init.data:numa_nodes_parsed
  __nodes_weight() to the variable .init.data:numa_nodes_parsed
  early_get_smp_config() to the variable .init.data:x86_init
defconfig:
  __nodes_weight() to the variable .init.data:numa_nodes_parsed

Link: https://github.com/ClangBuiltLinux/linux/issues/1302
Reported-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
---
 scripts/mod/modpost.c | 56 ++++++++++++++++++++++++++++++++++++++-----
 scripts/mod/modpost.h |  2 ++
 2 files changed, 52 insertions(+), 6 deletions(-)

Linus Torvalds Sept. 29, 2021, 11:28 p.m. UTC | #1

On Wed, Sep 29, 2021 at 3:59 PM Nick Desaulniers
<ndesaulniers@google.com> wrote:
>
> +static const struct secref_exception secref_allowlist[] = {
> +       { .fromsym = "__first_node", .tosym = "numa_nodes_parsed" },
> +       { .fromsym = "__next_node", .tosym = "numa_nodes_parsed" },
> +       { .fromsym = "__nodes_weight", .tosym = "numa_nodes_parsed" },
> +       { .fromsym = "early_get_smp_config", .tosym = "x86_init" },
> +       { .fromsym = "test_bit", .tosym = "numa_nodes_parsed" },
> +};

This list is basically made-up and random.

Why did those functions not get inlined? Wouldn't it be better to make
them always-inline?

Or, like in at least the early_get_smp_config() case, just make it be
marked __init, so that if it doesn't get inlined it gets the right
section?

It seems silly to add random source mappings to a checking program.

It was bad for the gcc constprop hack, but at least there it was a
clear case of "this inlining failed". This ad-hoc list has cases of
things that are clearly wrong in general ("test_bit()" must not use
initdata), and that "ok, the function just doesn't have the right
section marker.

(All of get_smp_config/early_get_smp_config/find_smp_config should be
__init, since they most definitely cannot work after __init time - but
why a compiler doesn't just inline them when they are one single
indirect call, I don't really get)

         Linus

Nick Desaulniers Sept. 30, 2021, 12:18 a.m. UTC | #2

On Wed, Sep 29, 2021 at 4:28 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Wed, Sep 29, 2021 at 3:59 PM Nick Desaulniers
> <ndesaulniers@google.com> wrote:
> >
> > +static const struct secref_exception secref_allowlist[] = {
> > +       { .fromsym = "__first_node", .tosym = "numa_nodes_parsed" },
> > +       { .fromsym = "__next_node", .tosym = "numa_nodes_parsed" },
> > +       { .fromsym = "__nodes_weight", .tosym = "numa_nodes_parsed" },
> > +       { .fromsym = "early_get_smp_config", .tosym = "x86_init" },
> > +       { .fromsym = "test_bit", .tosym = "numa_nodes_parsed" },
> > +};

Thanks for your feedback.  This has been a long-standing issue with no
clear path forward; I was looking forward to your input.

>
> This list is basically made-up and random.

Definitely brittle.  And it contains checks that are specific to
basically one set of configs for one arch. It sucks to pay that cost
for unaffected architectures.

> Why did those functions not get inlined?

$ make LLVM=1 -j72 allmodconfig
$ make LLVM=1 -j72 arch/x86/mm/amdtopology.o KCFLAGS=-Rpass-missed=inline.
...
arch/x86/mm/amdtopology.c:110:7: remark: 'test_bit' not inlined into
'amd_numa_init' because too costly to inline (cost=115, threshold=45)
[-Rpass-missed=inline]
                if (node_isset(nodeid, numa_nodes_parsed)) {
                    ^
arch/x86/mm/amdtopology.c:157:7: remark: '__nodes_weight' not inlined
into 'amd_numa_init' because too costly to inline (cost=60,
threshold=45) [-Rpass-missed=inline]
        if (!nodes_weight(numa_nodes_parsed))
             ^
arch/x86/mm/amdtopology.c:171:2: remark: 'early_get_smp_config' not
inlined into 'amd_numa_init' because too costly to inline (cost=85,
threshold=45) [-Rpass-missed=inline]
        early_get_smp_config();
        ^
arch/x86/mm/amdtopology.c:178:2: remark: '__first_node' not inlined
into 'amd_numa_init' because too costly to inline (cost=70,
threshold=45) [-Rpass-missed=inline]
        for_each_node_mask(i, numa_nodes_parsed)
        ^
arch/x86/mm/amdtopology.c:178:2: remark: '__next_node' not inlined
into 'amd_numa_init' because too costly to inline (cost=95,
threshold=45) [-Rpass-missed=inline]

ie. for allmodconfig, the sanitizers add too much instrumentation to
the callees that they become too large to be considered profitable to
inline by the cost model.  Note that LLVM's inliner works bottom up,
not top down.

Though for the defconfig case...somehow the cost is more than with the
sanitizers...

arch/x86/mm/amdtopology.c:157:7: remark: '__nodes_weight' not inlined
into 'amd_numa_init' because too costly to inline (cost=930,
threshold=45) [-Rpass-missed=inline]
        if (!nodes_weight(numa_nodes_parsed))
             ^

Looking at the output of `make LLVM=1 -j72
arch/x86/mm/amdtopology.ll`, @__nodes_weight is just some inline asm
(.altinstructions). I wonder if I need to teach the cost model about
`asm inline`...

For the allmodconfig build it looks like `__nodes_weight` calls
`__bitmap_weight` and the code coverage runtime hooks.

> Wouldn't it be better to make
> them always-inline?

Perhaps, see what that might look like:
https://github.com/ClangBuiltLinux/linux/issues/1302#issuecomment-807260475
Does that look better?

> Or, like in at least the early_get_smp_config() case, just make it be
> marked __init, so that if it doesn't get inlined it gets the right
> section?

In the case of early_get_smp_config(), that's what Boris suggested:
https://lore.kernel.org/lkml/20210225114533.GA380@zn.tnic/

>
> It seems silly to add random source mappings to a checking program.
>
> It was bad for the gcc constprop hack, but at least there it was a

Part of me feels like modpost not warning on those is permitting a
"memory leak," in so far as code that's only called from .init callers
is never reclaimed. Or leaving behind gadgets...

> clear case of "this inlining failed". This ad-hoc list has cases of
> things that are clearly wrong in general ("test_bit()" must not use
> initdata), and that "ok, the function just doesn't have the right
> section marker.

Sorry, what do you mean "test_bit() must not use initdata?" Because it
can lead to problems like this? Or...?

include/linux/nodemask.h has a comment that I'd bet predates that
modpost "Pattern 5" gcc constprop hack.
https://github.com/ClangBuiltLinux/linux/blob/83d09ad4b950651a95d37697f1493c00d888d0db/include/linux/nodemask.h#L119-L125

>
> (All of get_smp_config/early_get_smp_config/find_smp_config should be
> __init, since they most definitely cannot work after __init time - but
> why a compiler doesn't just inline them when they are one single
> indirect call, I don't really get)
>
>          Linus

Rasmus Villemoes Sept. 30, 2021, 8:13 a.m. UTC | #3

On 30/09/2021 02.18, Nick Desaulniers wrote:
> On Wed, Sep 29, 2021 at 4:28 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>

> Though for the defconfig case...somehow the cost is more than with the
> sanitizers...
> 
> arch/x86/mm/amdtopology.c:157:7: remark: '__nodes_weight' not inlined
> into 'amd_numa_init' because too costly to inline (cost=930,
> threshold=45) [-Rpass-missed=inline]
>         if (!nodes_weight(numa_nodes_parsed))
>              ^
> 
> Looking at the output of `make LLVM=1 -j72
> arch/x86/mm/amdtopology.ll`, @__nodes_weight is just some inline asm
> (.altinstructions). I wonder if I need to teach the cost model about
> `asm inline`...

Remind me, does clang understand 'asm inline("foo")'? Regardless, it
seems that the

  asm (ALTERNATIVE("call __sw_hweight32", ...
  asm (ALTERNATIVE("call __sw_hweight64", ...

in arch/x86/include/asm/arch_hweight.h could/should be made asm_inline
at least for gcc's sake.

Somewhat related: I really think we should remove __cold from the
definition of __init: It hurts boot time (on a simple board with quite
reproducible boot timing I measured 1-3% some time ago), and it is
likely at least partially responsible for the never-ending tsunami of
functions-that-obviously-should-have-been-inlined(TM) but were not
because the caller is being optimized for size. Whatever small cost in
extra .text is reclaimed after init - and those who are concerned about
the size of the kernel image itself probably build with
CONFIG_OPTIMIZE_FOR_SIZE=y, and I see no change in such an image whether
__init includes __cold or not.

Rasmus

Masahiro Yamada Sept. 30, 2021, 12:45 p.m. UTC | #4

On Thu, Sep 30, 2021 at 9:19 AM Nick Desaulniers
<ndesaulniers@google.com> wrote:
>
> On Wed, Sep 29, 2021 at 4:28 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Wed, Sep 29, 2021 at 3:59 PM Nick Desaulniers
> > <ndesaulniers@google.com> wrote:
> > >
> > > +static const struct secref_exception secref_allowlist[] = {
> > > +       { .fromsym = "__first_node", .tosym = "numa_nodes_parsed" },
> > > +       { .fromsym = "__next_node", .tosym = "numa_nodes_parsed" },
> > > +       { .fromsym = "__nodes_weight", .tosym = "numa_nodes_parsed" },
> > > +       { .fromsym = "early_get_smp_config", .tosym = "x86_init" },
> > > +       { .fromsym = "test_bit", .tosym = "numa_nodes_parsed" },
> > > +};
>
> Thanks for your feedback.  This has been a long-standing issue with no
> clear path forward; I was looking forward to your input.
>
> >
> > This list is basically made-up and random.
>
> Definitely brittle.  And it contains checks that are specific to
> basically one set of configs for one arch. It sucks to pay that cost
> for unaffected architectures.
>
> > Why did those functions not get inlined?
>
> $ make LLVM=1 -j72 allmodconfig
> $ make LLVM=1 -j72 arch/x86/mm/amdtopology.o KCFLAGS=-Rpass-missed=inline.
> ...
> arch/x86/mm/amdtopology.c:110:7: remark: 'test_bit' not inlined into
> 'amd_numa_init' because too costly to inline (cost=115, threshold=45)
> [-Rpass-missed=inline]
>                 if (node_isset(nodeid, numa_nodes_parsed)) {
>                     ^
> arch/x86/mm/amdtopology.c:157:7: remark: '__nodes_weight' not inlined
> into 'amd_numa_init' because too costly to inline (cost=60,
> threshold=45) [-Rpass-missed=inline]
>         if (!nodes_weight(numa_nodes_parsed))
>              ^
> arch/x86/mm/amdtopology.c:171:2: remark: 'early_get_smp_config' not
> inlined into 'amd_numa_init' because too costly to inline (cost=85,
> threshold=45) [-Rpass-missed=inline]
>         early_get_smp_config();
>         ^
> arch/x86/mm/amdtopology.c:178:2: remark: '__first_node' not inlined
> into 'amd_numa_init' because too costly to inline (cost=70,
> threshold=45) [-Rpass-missed=inline]
>         for_each_node_mask(i, numa_nodes_parsed)
>         ^
> arch/x86/mm/amdtopology.c:178:2: remark: '__next_node' not inlined
> into 'amd_numa_init' because too costly to inline (cost=95,
> threshold=45) [-Rpass-missed=inline]
>
>
> ie. for allmodconfig, the sanitizers add too much instrumentation to
> the callees that they become too large to be considered profitable to
> inline by the cost model.  Note that LLVM's inliner works bottom up,
> not top down.
>
> Though for the defconfig case...somehow the cost is more than with the
> sanitizers...
>
> arch/x86/mm/amdtopology.c:157:7: remark: '__nodes_weight' not inlined
> into 'amd_numa_init' because too costly to inline (cost=930,
> threshold=45) [-Rpass-missed=inline]
>         if (!nodes_weight(numa_nodes_parsed))
>              ^
>
> Looking at the output of `make LLVM=1 -j72
> arch/x86/mm/amdtopology.ll`, @__nodes_weight is just some inline asm
> (.altinstructions). I wonder if I need to teach the cost model about
> `asm inline`...
>
> For the allmodconfig build it looks like `__nodes_weight` calls
> `__bitmap_weight` and the code coverage runtime hooks.
>
> > Wouldn't it be better to make
> > them always-inline?
>
> Perhaps, see what that might look like:
> https://github.com/ClangBuiltLinux/linux/issues/1302#issuecomment-807260475
> Does that look better?
>
> > Or, like in at least the early_get_smp_config() case, just make it be
> > marked __init, so that if it doesn't get inlined it gets the right
> > section?
>
> In the case of early_get_smp_config(), that's what Boris suggested:
> https://lore.kernel.org/lkml/20210225114533.GA380@zn.tnic/


__init works particularly for early_get_smp_config().

For static line helpers that are called from __init and non-__init functions,
maybe __ref will work.

In my understanding, the .ref.text section is not free'd,
but modpost bypasses the section mismatch checks.

I am not sure what is a better approach for generic cases,
__always_inline, __ref, or what else?


I am not a big fan of this patch, at least...
(The reason was already stated by Linus)

Linus Torvalds Sept. 30, 2021, 6:54 p.m. UTC | #5

On Wed, Sep 29, 2021 at 5:19 PM Nick Desaulniers
<ndesaulniers@google.com> wrote:
> ...
> arch/x86/mm/amdtopology.c:110:7: remark: 'test_bit' not inlined into
> 'amd_numa_init' because too costly to inline (cost=115, threshold=45)
> [-Rpass-missed=inline]
>                 if (node_isset(nodeid, numa_nodes_parsed)) {

Yeah, I think that we should just do the __always_inline thing.

I'd rather have the stupid debug code overhead in the caller - that
may end up knowing that the pointer actually is so that the debug code
goes away - than have "test_bit()" uninlined because there's so much
crazy debug code in it.

I also happen to believe that we have too much crazy "instrumentation" crap.

Why is that test_bit() word read so magical that it merits a
"instrument_atomic_read()"?

But I absolutely detest how KCSAN and some other tooling seems to get
a free pass on doing stupid things, just because they generated bad
warnings so then they can freely generate these much more fundamental
problems because the result is a f*cking mess.

> Though for the defconfig case...somehow the cost is more than with the
> sanitizers...

Maybe the solution is that if you have some of the crazy sanitizers,
we just say "the end result is not worth even checking". And stop
checking all the section mismatches, and all the stack size things.

Because it looks like this is more of a real issue:

> arch/x86/mm/amdtopology.c:157:7: remark: '__nodes_weight' not inlined
> into 'amd_numa_init' because too costly to inline (cost=930,
> threshold=45) [-Rpass-missed=inline]
>         if (!nodes_weight(numa_nodes_parsed))
>              ^

Hmm. That's just a "bitmap_weight()", and that function in turn is
__always_inline.

And the *reason* it is __always_inline is that it really wants to act
as a macro, and look at the second argument and do special things if
it is a small constant value.

And it looks like clang messes things up by simply not doing enough
simplification before inlining decisions, so it all looks very
complicated to clang, even though when you actually generate code, you
have one (of two) very simple code sequences.

> > Wouldn't it be better to make
> > them always-inline?
>
> Perhaps, see what that might look like:
> https://github.com/ClangBuiltLinux/linux/issues/1302#issuecomment-807260475
> Does that look better?

I suspect that in this case, because of clang deficiencies, that
__always_inline actually is the right thing to do at least on
__nodes_weight.

Looking at your comment lower down

  https://github.com/ClangBuiltLinux/linux/issues/1302#issuecomment-807757878

I really think this is a clang bug, and that you need to do certain
simplifications both before _and_ after inlining.

Before, because of the inlining cost decisions particularly wrt
constant arguments.

After, because successful inlining changes things completely.

Marking __nodes_weight() be __always_inline just works around clang
being broken in this regard.

It is _possible_ that it might help to make bitmap_weight() be a macro
instead of an inline function, but it's a kind of sad state of affairs
if that is required.

And it might well fail - if you don't do the constant propagation
before making inlining decisions, you'll _still_ end up thinking that
bitmap_weight() is very costly because you don't do that
__builtin_constant_p() lowering.

And then you end up using the (much more expensive) generic function
instead of the cheap "look, for single words this is a trivial" thing.

> Part of me feels like modpost not warning on those is permitting a
> "memory leak," in so far as code that's only called from .init callers
> is never reclaimed. Or leaving behind gadgets...

I think we can just treat modpost as a "good heuristic".  If it
catches all the normal cases, it's fine - but it must not have false
positives.

That's basically true of all warnings. False positive warnings make a
warning worthless. That's just *basic*.

So the gcc thing is a "ok, we know compilers mess this up if they do
partial inlining with constant propagation, so we will suppress what
is quite likely a false positive for that case".

That clang patch, in comparison? That's just a hack enumerating random
cases. TRhere is no logic to it, and there is absolutely zero
maintainability. It will cause us to forever just add other random
cases to the list, making the whole tooling entirely pointless.

See the difference?

Maybe clang should just adopt the gcc naming convention, so that we
can just use the gcc heuristic.

> > clear case of "this inlining failed". This ad-hoc list has cases of
> > things that are clearly wrong in general ("test_bit()" must not use
> > initdata), and that "ok, the function just doesn't have the right
> > section marker.
>
> Sorry, what do you mean "test_bit() must not use initdata?" Because it
> can lead to problems like this? Or...?

No, I mean that it is completely unacceptable to add some crazy rule
like "you can access this init-data  from any context, as long as you
use test_bit to do so".

That's basically what your rule does. And it's a FUNDAMENTALLY invalid
rule. It's simply not true.  The rule is invalid, it's just that clang
has made such a mess of it that in one particular case it happens to
be true.

The gcc "rule" is much more reasonable: it's *not* saying "it's ok to
access this init-data from test_bit". The gcc rule says "we know gcc
messes up our heuristics when out-of-lining with constprop, so we just
won't warn because false positives are bad, bad, bad.

One rule is fundamentally garbage and wrong. The other rule is a
generic "we know this situation cannot be tested for". Very different.

                  Linus

Linus Torvalds Sept. 30, 2021, 7:21 p.m. UTC | #6

On Thu, Sep 30, 2021 at 11:54 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Hmm. That's just a "bitmap_weight()", and that function in turn is
> __always_inline.
>
> And the *reason* it is __always_inline is that it really wants to act
> as a macro, and look at the second argument and do special things if
> it is a small constant value.

Looking around, it's not the only one. A lot of the bitmap functions
do that, but it looks like we're missing a few __always_inline cases.

I wonder if we should have a macro to generate those "do X or Y
depending on small_const_nbits()" - and have it generate
__always_inline functions.

Of course, some of those functions have more complex "check at build
time" cases, like that bitmap_clear/set() thing that has a special
case for when it just turns into "memset()"

We have a lot of these kinds of situations where we have a "generic"
function that specializes itself based on arguments. And yes, they are
often recursive, so that you need more than one level of inlining to
actually determine what the arguments are.

I don't know if we might have some way to mark these (and detect the
cases where they don't get inlined and we lose the vasy basic
optimizations).

It's kind of similar to the _Generic() thing that does different
things based on static types, it's just that it does it based on
argument ranges.

          Linus

Nick Desaulniers Sept. 30, 2021, 11:32 p.m. UTC | #7

On Wed, Sep 29, 2021 at 05:18:49PM -0700, Nick Desaulniers wrote:
> On Wed, Sep 29, 2021 at 4:28 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Or, like in at least the early_get_smp_config() case, just make it be
> > marked __init, so that if it doesn't get inlined it gets the right
> > section?
> 
> In the case of early_get_smp_config(), that's what Boris suggested:
> https://lore.kernel.org/lkml/20210225114533.GA380@zn.tnic/

Sorry, I misremembered the above thread. That's what *Arnd* had suggested.

Nick Desaulniers Oct. 1, 2021, 11:26 p.m. UTC | #8

On Thu, Sep 30, 2021 at 11:54 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Wed, Sep 29, 2021 at 5:19 PM Nick Desaulniers
> <ndesaulniers@google.com> wrote:
> > ...
> > arch/x86/mm/amdtopology.c:110:7: remark: 'test_bit' not inlined into
> > 'amd_numa_init' because too costly to inline (cost=115, threshold=45)
> > [-Rpass-missed=inline]
> >                 if (node_isset(nodeid, numa_nodes_parsed)) {
>
> Yeah, I think that we should just do the __always_inline thing.
>
> I'd rather have the stupid debug code overhead in the caller - that
> may end up knowing that the pointer actually is so that the debug code
> goes away - than have "test_bit()" uninlined because there's so much
> crazy debug code in it.
>
> I also happen to believe that we have too much crazy "instrumentation" crap.
>
> Why is that test_bit() word read so magical that it merits a
> "instrument_atomic_read()"?
>
> But I absolutely detest how KCSAN and some other tooling seems to get
> a free pass on doing stupid things, just because they generated bad
> warnings so then they can freely generate these much more fundamental
> problems because the result is a f*cking mess.
>
> > Though for the defconfig case...somehow the cost is more than with the
> > sanitizers...
>
> Maybe the solution is that if you have some of the crazy sanitizers,
> we just say "the end result is not worth even checking". And stop
> checking all the section mismatches, and all the stack size things.
>
> Because it looks like this is more of a real issue:
>
> > arch/x86/mm/amdtopology.c:157:7: remark: '__nodes_weight' not inlined
> > into 'amd_numa_init' because too costly to inline (cost=930,
> > threshold=45) [-Rpass-missed=inline]
> >         if (!nodes_weight(numa_nodes_parsed))
> >              ^
>
> Hmm. That's just a "bitmap_weight()", and that function in turn is
> __always_inline.
>
> And the *reason* it is __always_inline is that it really wants to act
> as a macro, and look at the second argument and do special things if
> it is a small constant value.
>
> And it looks like clang messes things up by simply not doing enough
> simplification before inlining decisions, so it all looks very
> complicated to clang, even though when you actually generate code, you
> have one (of two) very simple code sequences.
>
> > > Wouldn't it be better to make
> > > them always-inline?
> >
> > Perhaps, see what that might look like:
> > https://github.com/ClangBuiltLinux/linux/issues/1302#issuecomment-807260475
> > Does that look better?
>
> I suspect that in this case, because of clang deficiencies, that
> __always_inline actually is the right thing to do at least on
> __nodes_weight.
>
> Looking at your comment lower down
>
>   https://github.com/ClangBuiltLinux/linux/issues/1302#issuecomment-807757878
>
> I really think this is a clang bug, and that you need to do certain
> simplifications both before _and_ after inlining.
>
> Before, because of the inlining cost decisions particularly wrt
> constant arguments.
>
> After, because successful inlining changes things completely.

I made that comment because our LowerConstantIntrinsics pass was
simplifying our internal representation of __builtin_constant_p and
__builtin_object_size in the same pass, after inlining.

For the case of (defconfig) __nodes_weight not being inlined into
amd_numa_init, it's because hweight_long is a huge block of code all
predicated on __builtin_constant_p (BCP).  Because we evaluate (BCP)
AFTER inlining which is very much necessary for the semantics of BCP,
we don't eliminate that entire block.  Playing with the cost model
though...

we basically have the pattern:

void caller (void) {
  caller(42);
}
void callee (int x) {
  if (__builtin_constant_p(x)) {
    // M "instructions"
  } else {
    // N "instructions"
  }
}

the current cost model says that the cost of inlining callee into
caller is ~ M + N + C.  If we knew that BCP would fold away based on
inling, and which way, we might be able to consider the cost just M +
C or N + C.

In the case of "(defconfig) __nodes_weight not being inlined into
amd_numa_init" M >> N, and yet after inlining BCP evaluates to false
(so the estimated cost was M + N but in actuality was closer to N).
(See how big __const_hweight64 is before BCP evaluation; in this case
callee is the analog for hweight64).

I guess if the argument to BCP is a parameter of callee (brittle),
then we perhaps should be able to figure whether BCP would evaluate to
true or false, then properly select M or N.

Ok, let me see if I can go build that into the cost model...and if
that actually helps any of these cases...

Though I do wonder what happens when there's more than one level here...ie.

void caller (void) { mid(42): }
void mid (int x) { callee(x); }
void callee (int x) { if __builtin_constant_p(x) /* M */; else /* N */; }

or

void caller0 (void) { mid(42): } // BCP would be true
void caller1 (int x) { mid(x); } // BCP would be false
void mid (int x) { callee(x); } // looking just at the call site, BCP
would be false
void callee (int x) { if __builtin_constant_p(x) /* M */; else /* N */; }

(Sorry for the brain dump; this is more for me than for you. Good chat!)

>
> Marking __nodes_weight() be __always_inline just works around clang
> being broken in this regard.
>
> It is _possible_ that it might help to make bitmap_weight() be a macro
> instead of an inline function, but it's a kind of sad state of affairs
> if that is required.
>
> And it might well fail - if you don't do the constant propagation
> before making inlining decisions, you'll _still_ end up thinking that
> bitmap_weight() is very costly because you don't do that
> __builtin_constant_p() lowering.
>
> And then you end up using the (much more expensive) generic function
> instead of the cheap "look, for single words this is a trivial" thing.

modpost: add allow list for llvm IPSCCP

Commit Message

Comments

Patch