mbox series

[GIT,PULL] bitmap changes for v6.2-rc1

Message ID Y5uprmSmSfYechX2@yury-laptop (mailing list archive)
State Not Applicable
Delegated to: Netdev Maintainers
Headers show
Series [GIT,PULL] bitmap changes for v6.2-rc1 | expand

Pull-request

git@github.com:/norov/linux.git tags/bitmap-6.2-rc1

Checks

Context Check Description
netdev/tree_selection success Pull request for net
netdev/apply fail Pull to net failed

Message

Yury Norov Dec. 15, 2022, 11:11 p.m. UTC
The following changes since commit 76dcd734eca23168cb008912c0f69ff408905235:

  Linux 6.1-rc8 (2022-12-04 14:48:12 -0800)

are available in the Git repository at:

  git@github.com:/norov/linux.git tags/bitmap-6.2-rc1

for you to fetch changes up to 2386459394d2a46964829a00c48a08a23ead94ed:

  lib/cpumask: update comment for cpumask_local_spread() (2022-12-15 14:44:43 -0800)

----------------------------------------------------------------
Hi Linus,

Please pull bitmap patches for v6.2. They spent in -next for more than
a week without any issues. The branch consists of:

- optimize small_const path for find_next_bit() and friends (me);

  Introduces small_const_nbits_off() and uses it in find_next_bit()-like
  functions to allow static optimization when all bits to search are
  withing a word boundary, even if not a 1st word.

- cpumask: improve on cpumask_local_spread() locality (me):

  The series makes cpumask_local_spread() picking Nth CPU based on NUMA
  hop distances, which is better than picking CPUs from a given node and
  if there's no N cpus - from a random node (how it works now).

- sched, net: NUMA-aware CPU spreading interface (Valentin Schneider,
  Tariq Toukan):

  Iterators for NUMA-aware CPUs traversing. Better alternative for
  cpumask_local_spread(), when it's needed to enumerate all CPUs.

Thanks,
Yury

----------------------------------------------------------------
Tariq Toukan (1):
      net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity hints

Valentin Schneider (2):
      sched/topology: Introduce sched_numa_hop_mask()
      sched/topology: Introduce for_each_numa_hop_mask()

Yury Norov (9):
      bitmap: switch from inline to __always_inline
      bitmap: improve small_const case for find_next() functions
      bitmap: add tests for find_next_bit()
      lib/find: introduce find_nth_and_andnot_bit
      cpumask: introduce cpumask_nth_and_andnot
      sched: add sched_numa_find_nth_cpu()
      cpumask: improve on cpumask_local_spread() locality
      lib/cpumask: reorganize cpumask_local_spread() logic
      lib/cpumask: update comment for cpumask_local_spread()

 drivers/net/ethernet/mellanox/mlx5/core/eq.c |  18 ++-
 include/asm-generic/bitsperlong.h            |  12 ++
 include/linux/bitmap.h                       |  46 ++++----
 include/linux/cpumask.h                      | 164 +++++++++++++++------------
 include/linux/find.h                         | 118 +++++++++++--------
 include/linux/nodemask.h                     |  86 +++++++-------
 include/linux/topology.h                     |  33 ++++++
 kernel/sched/topology.c                      |  90 +++++++++++++++
 lib/cpumask.c                                |  52 +++++----
 lib/find_bit.c                               |   9 ++
 lib/test_bitmap.c                            |  23 +++-
 11 files changed, 438 insertions(+), 213 deletions(-)

Comments

Linus Torvalds Dec. 23, 2022, 6:44 p.m. UTC | #1
On Thu, Dec 15, 2022 at 3:11 PM Yury Norov <yury.norov@gmail.com> wrote:
>
> Please pull bitmap patches for v6.2. They spent in -next for more than
> a week without any issues. The branch consists of:

So I've been holding off on this because these bitmap pulls have
always scared me, and I wanted to have the time to actually look
through them in detail before pulling.

I'm back home, over the travel chaos, and while I have other pulls
pending, they seem to be benign fixes so I started looking at this.

And when looking at it, I did indeed finx what I think is a
fundamental arithmetic bug.

That small_const_nbits_off() is simply buggy.

Try this:

        small_const_nbits_off(64,-1);

and see it return true.

The thing is, doing divides in C is something you need to be very
careful about, because they do *not* behave nicely with signed values.
They do the "mathematically intuitive" thing of truncating towards
zero, but when you are talking about bit ranges like this, that is
*NOT* what you want.

The comment is fairly good, but it just doesn't match the code.

If you want to check that 'nbits-1' and 'off' are in the same word,
you simply should not use division at all. Sure, it would work, if you
verify that they are both non-negative, but the fact is, even then
it's just wrong, wrong, wrong. You want a shift operation or a masking
operation.

Honestly, in this case, I think the logical thing to do is "check that
the upper bits are the same". The way you do that is probably
something like

   !((off) ^ ((nbits)-1) & ~(BITS_PER_LONG-1))

which doesn't have the fundamental bug that the divide version has.

So anyway, I won't be pulling this. I appreciate that you are trying
to micro-optimize the bitop functions, but this is not the first time
your pull requests have made me worried and caused me to not pull.
This is such basic functionality that I really need these pull
requests to be less worrisome.

In other words, not only don't I want to see these kinds of bugs -
please make sure that there isn't anythign that looks even *remotely*
scary. The micro-optimizations had better be *so* obvious and so well
thought through that I not only don't find bugs in them, I don't even
get nervous about finding bugs.

Side note: that whole small_const_nbits_off() thing could be possibly
improved in other ways. It's actually unnecessarily strict (if it was
correct, that is): we don't really require 'off' to be constant, what
we *do* want is

 - "nbits > off" needs to be a compile-time true constant

 - that "off is in the same word as nbits-1" also needs to be a
compile-time true constant.

and the compiler could determine both of those to be the case even if
'off' itself isn't a constant, if there is code around it that
verifies it.

For example, if you have code like this:

        off = a & 7;
        n = find_next_bit(bitmap, 64, off);

then the compiler could still see that it's that "last word only"
case, even though 'off' isn't actually a constant value.

So you could try using a helper macro like

   #define COMPILE_TIME_TRUE(x) (__builtin_constant_p(x) && (x))

to handle those kinds of things. But again - the code needs to be
OBVIOUSLY CORRECT. Make me feel those warm and fuzzies about it
because it's so obviously safe and correct.

That said, I see your test-cases for your micro-optimizations, but I'd
also like to see references to where it actually improves real code.

I know for a fact that 'small_const_nbits()' actually triggers in real
life. I don't see that your 'small_const_nbits_off()' would trigger in
real life in any situation that isn't already covered by
'small_const_nbits()'.

So convince me not only that the optimizations are obviously correct,
but also that they actually matter.

                Linus
Linus Torvalds Dec. 23, 2022, 7:07 p.m. UTC | #2
On Fri, Dec 23, 2022 at 10:44 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Honestly, in this case, I think the logical thing to do is "check that
> the upper bits are the same". The way you do that is probably
> something like
>
>    !((off) ^ ((nbits)-1) & ~(BITS_PER_LONG-1))

Note that while the above is probably correct (but you always need to
double-check my emailed "something like this" code - I literally write
it in the MUA, and I make mistakes too), I'd never want to see that as
part of one big complex macro.

In fact, I think I am missing a set of parentheses, because '&' has a
higher precedence than '^', so the above is actually buggy.

So I'd much rather see something like this

  #define COMPILE_TIME_TRUE(x) (__builtin_constant_p(x) && (x))

  #define bits_in_same_word(x,y) \
        (!(((x)^(y))&~(BITS_PER_LONG-1)))

  #define bitmap_off_in_last_word(nbits,off) \
        bits_in_same_word((nbits)-1,off)

  #define small_const_nbits_off(nbits, off) \
        (__builtin_constant_p(nbits) && (nbits) > 0 && \
         COMPILE_TIME_TRUE(bitmap_off_in_last_word(nbits,off)))

where each step does one thing and one thing only, and you don't have
one complicated thing that is hard to read.

And again, don't take my word blindly for the above.  I *think* the
above may be correct, but there's a "think" and a "may" there.

Plus I'd still like to hear about where the above would actually
matter and make a code generation difference in real life (compared to
just the simple "optimize the single-word bitmap" case).

                Linus
Yury Norov Jan. 10, 2023, 7:24 a.m. UTC | #3
On Fri, Dec 23, 2022 at 10:44:07AM -0800, Linus Torvalds wrote:
> On Thu, Dec 15, 2022 at 3:11 PM Yury Norov <yury.norov@gmail.com> wrote:
> >
> > Please pull bitmap patches for v6.2. They spent in -next for more than
> > a week without any issues. The branch consists of:
> 
> So I've been holding off on this because these bitmap pulls have
> always scared me, and I wanted to have the time to actually look
> through them in detail before pulling.
> 
> I'm back home, over the travel chaos, and while I have other pulls
> pending, they seem to be benign fixes so I started looking at this.
> 
> And when looking at it, I did indeed finx what I think is a
> fundamental arithmetic bug.
> 
> That small_const_nbits_off() is simply buggy.
> 
> Try this:
> 
>         small_const_nbits_off(64,-1);
> 
> and see it return true.

Hi Linus,

Sorry for a delayed reply.

small_const_nbits{_off} is used only for bitmap and find_bit functions
where both offset and size are unsigned types. -1 there will turn into
UINT_MAX or ULONG_MAX, and small_const_nbits_off(64, -1) returns false.

The bitops.h functions (set_bit et all) also use unsigned type for nr.
Negative offsets are not used in bit operations from very basic level.

Notice that '(nbits) > 0' part in small_const_nbits() is there to exclude
0, not negative numbers.

So, support for negative offset/size looks irrelevant for all existing
users of small_const(). I doubt there'll be new non-bitmap users for
the macro anytime soon.

small_const_nbits() and proposed small_const_nbits_off() are in fact
very bitmap-related macros. 'Small' in this context refers to a
single-word bitmap. 0 is definitely a small and constant number, but
small_const_nbits(0) will return false - only because inline versions
of bitmap functions don't handle 0 correctly.

I think, small_const_nbits confuses because it sounds too generic and
is located in a very generic place. There's a historical reason for
that.

Originally, the macro was hosted in include/linux/bitmap.h and at that
time find.h was in include/asm-generic/bitops/. In commit 586eaebea5988
(lib: extend the scope of small_const_nbits() macro) I moved the macro
to include/asm-generic/bitsperlong.h to optimize find_bit functions too.

After that, working on other parts I found that having bitmap.h and
find.h in different include paths is a permanent headache due to things
like circular dependencies, and moved find.h to include/linux, where it
should be. And even made it an internal header for bitmap.h. But didn't
move small_const_nbits(). Looks like I have to move it to somewhere in
include/linux/bitops.h.

[...]
 
> So convince me not only that the optimizations are obviously correct,
> but also that they actually matter.

There're no existing users for small_const_nbits_off(). I've been
reworking bitmap_find_free_region(), added a pile of tests and
found that some quite trivial cases are not inlined, for example
find_next_bit(addr, 128, 124).

Let's ignore this patch unless we'll have real users.


Regarding the rest of the series. Can you please take a look? It includes
an optimization for CPUs allocations. With quite a simple rework of
cpumask_local_spread() we are gaining measurable and significant
improvement for many subsystems on NUMA machines.

Tariq measured impact of NUMA-based locality on networking in his
environment, and from his commit message:

    Performance tests:
    
    TCP multi-stream, using 16 iperf3 instances pinned to 16 cores (with aRFS on).
    Active cores: 64,65,72,73,80,81,88,89,96,97,104,105,112,113,120,121
    
    +-------------------------+-----------+------------------+------------------+
    |                         | BW (Gbps) | TX side CPU util | RX side CPU util |
    +-------------------------+-----------+------------------+------------------+
    | Baseline                | 52.3      | 6.4 %            | 17.9 %           |
    +-------------------------+-----------+------------------+------------------+
    | Applied on TX side only | 52.6      | 5.2 %            | 18.5 %           |
    +-------------------------+-----------+------------------+------------------+
    | Applied on RX side only | 94.9      | 11.9 %           | 27.2 %           |
    +-------------------------+-----------+------------------+------------------+
    | Applied on both sides   | 95.1      | 8.4 %            | 27.3 %           |
    +-------------------------+-----------+------------------+------------------+
    
    Bottleneck in RX side is released, reached linerate (~1.8x speedup).
    ~30% less cpu util on TX.

We'd really like to have this work in next kernel release.

Thanks,
Yury