[0/9,RFC] Make wake_up_{bit,var} less fragile

Message ID	20240819053605.11706-1-neilb@suse.de (mailing list archive)
Headers	show Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.223.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DBFE414A4F9; Mon, 19 Aug 2024 05:36:37 +0000 (UTC) From: NeilBrown <neilb@suse.de> To: Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org Subject: [PATCH 0/9 RFC] Make wake_up_{bit,var} less fragile Date: Mon, 19 Aug 2024 15:20:34 +1000 Message-ID: <20240819053605.11706-1-neilb@suse.de> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Make wake_up_{bit,var} less fragile \| expand [0/9,RFC] Make wake_up_{bit,var} less fragile [1/9] i915: remove wake_up on I915_RESET_MODESET. [2/9] Introduce atomic_dec_and_wake_up_var(). [3/9] XFS: use wait_var_event() when waiting of i_pincount. [4/9] Use wait_var_event() instead of I_DIO_WAKEUP [5/9] Block: switch bd_prepare_to_claim to use ___wait_var_event() [6/9] block/pktdvd: switch congestion waiting to ___wait_var_event() [7/9] Improve and expand wake_up_bit() interface. [8/9] Improve and extend wake_up_var() interface. [9/9] Use clear_and_wake_up_bit() where appropriate.

NeilBrown Aug. 19, 2024, 5:20 a.m. UTC

I wasn't really sure who to send this too, and get_maintainer.pl
suggested 132 addresses which seemed excessive.  So I've focussed on
'sched' maintainers.  I'll probably submit individual patches to
relevant maintainers/lists if I get positive feedback at this level.

This series was motivated by 

   Commit ed0172af5d6f ("SUNRPC: Fix a race to wake a sync task")

which adds smp_mb__after_atomic().  I thought "any API that requires that
sort of thing needs to be fixed".

The main patches here are 7 and 8 which revise wake_up_bit and
wake_up_var respectively.  They result in 3 interfaces:
  wake_up_{bit,var}           includes smp_mb__after_atomic()
  wake_up_{bit,var}_relaxed() doesn't have a barrier
  wake_up_{bit,var}_mb()      includes smb_mb().

I think this set of interfaces should be easier to use correctly.  They
are also now documented more clearly.

The preceeding patches clean up various places where the exiting
interfaces weren't used optimally.  The final patch uses
clear_and_wake_up_bit() more widely because it seems like a good idea.

I have three questions:

1/ is my understanding of the needed barriers correct.
 i.e:
   smp_mb__after_atomic() needed after a clear_bit() to atomic_set() 
     or similar, or a change inside a locked region
   smb_mb() needed after any non-locked update
   nothing needed after test_and_clear_bit() or atomic_dec_and_test()
     or similar.
   (I realised while working on this that my previous understanding
    of the barrier requires was wrong, so maybe it still is).

2/ How should we handle the "flag day" where a barrier is added to
   wake_up_bit() and wake_up_var().  Some options are:
   a/ have a big patch for the flag-day as this series does
   b/ add the barrier in a new wake_up_atomic_{bit,var} and deprecate
      wake_up_{bit,var}
   c/ don't worry about the fact that there will be an extra barrier for
      a while - just make the change to wake_up_xxx() first, then submit
      individual patches to remove barriers as needed.

3/ Who else should I ask to remove this at this high level?

Thanks,
NeilBrown


 [PATCH 1/9] i915: remove wake_up on I915_RESET_MODESET.
 [PATCH 2/9] Introduce atomic_dec_and_wake_up_var().
 [PATCH 3/9] XFS: use wait_var_event() when waiting of i_pincount.
 [PATCH 4/9] Use wait_var_event() instead of I_DIO_WAKEUP
 [PATCH 5/9] Block: switch bd_prepare_to_claim to use
 [PATCH 6/9] block/pktdvd: switch congestion waiting to
 [PATCH 7/9] Improve and expand wake_up_bit() interface.
 [PATCH 8/9] Improve and extend wake_up_var() interface.
 [PATCH 9/9] Use clear_and_wake_up_bit() where appropriate.

Linus Torvalds Aug. 19, 2024, 6:13 a.m. UTC | #1

On Sun, 18 Aug 2024 at 22:36, NeilBrown <neilb@suse.de> wrote:
>
> The main patches here are 7 and 8 which revise wake_up_bit and
> wake_up_var respectively.  They result in 3 interfaces:
>   wake_up_{bit,var}           includes smp_mb__after_atomic()

I actually think this is even worse than the current model, in that
now it subtle only works after atomic ops, and it's not obvious from
the name.

At least the current model, correct code looks like

      do_some_atomic_op
      smp_mb__after_atomic()
      wake_up_{bit,var}

and the smp_mb__after_atomic() makes sense and pairs with the atomic.
So the current one may be complex, but at the same time it's also
explicit. Your changed interface is still complex, but now it's even
less obvious what is actually going on.

With your suggested interface, a plain "wake_up_{bit,var}" only works
after atomic ops, and other ops have to magically know that they
should use the _mb() version or whatever. And somebody who doesn't
understand that subtlety, and copies the code (but changes the op from
an atomic one to something else) now introduces code that looks fine,
but is really subtly wrong.

The reason for the barrier is for the serialization with the
waitqueue_active() check. Honestly, if you worry about correctness
here, I think you should leave the existing wake_up_{bit,var}() alone,
and concentrate on having helpers that do the whole "set and wake up".

IOW, I do not think you should change existing semantics, but *this*
kind of pattern:

>  [PATCH 2/9] Introduce atomic_dec_and_wake_up_var().
>  [PATCH 9/9] Use clear_and_wake_up_bit() where appropriate.

sounds like a good idea.

IOW, once you have a whole "atomic_dec_and_wake_up()" (skip the "_var"
- it's implied by the fact that it's an atomic_dec), *then* that
function makes for a simple-to-use model, and now the "atomic_dec(),
the smp_mb__after_atomic(), and the wake_up_var()" are all together.

For all the same reasons, it makes total sense to have
"clear_bit_and_wake()" etc.

But exposing those "three different memory barrier scenarios" as three
different helpers is the *opposite* of helpful. It keeps the current
complexity, and makes it worse by making the barrier rules even more
opaque, imho.

               Linus

Christian Brauner Aug. 19, 2024, 8:16 a.m. UTC | #2

On Mon, Aug 19, 2024 at 03:20:34PM GMT, NeilBrown wrote:
> I wasn't really sure who to send this too, and get_maintainer.pl
> suggested 132 addresses which seemed excessive.  So I've focussed on
> 'sched' maintainers.  I'll probably submit individual patches to
> relevant maintainers/lists if I get positive feedback at this level.
> 
> This series was motivated by 
> 
>    Commit ed0172af5d6f ("SUNRPC: Fix a race to wake a sync task")
> 
> which adds smp_mb__after_atomic().  I thought "any API that requires that
> sort of thing needs to be fixed".
> 
> The main patches here are 7 and 8 which revise wake_up_bit and
> wake_up_var respectively.  They result in 3 interfaces:
>   wake_up_{bit,var}           includes smp_mb__after_atomic()
>   wake_up_{bit,var}_relaxed() doesn't have a barrier
>   wake_up_{bit,var}_mb()      includes smb_mb().

It's great that this api is brought up because it gives me a reason to
ask a stupid question I've had for a while.

I want to change the i_state member in struct inode from an unsigned
long to a u32 because really we're wasting 4 bytes on 64 bit that we're
never going to use given how little I_* flags we actually have and I
dislike that we use that vacuous type in a bunch of our structures for
that reason.

(Together with another 4 byte shrinkage we would get a whopping 8 bytes
back.)

The problem is that we currently use wait_on_bit() and wake_up_bit() in
various places on i_state and all of these functions require an unsigned
long (probably because some architectures only handle atomic ops on
unsigned long).

I have an experimental patch converting all of that from wait_on_bit()
to wait_var_event() but I was hesitant about it because iiuc the
semantics don't nicely translate into each other. Specifically, if some
code uses wait_on_bit(SOME_BIT) and someone calls
wake_up_bit(SOME_OTHER_BIT) then iiuc only SOME_OTHER_BIT waiters will
be woken. IOW, SOME_BIT waiters are unaffected.

But if this is switched to wait_var_event() and wake_up_var() and there
are two waiters e.g.,

W1: wait_var_event(inode->i_state, !(inode->i_state & SOME_BIT))
W2: wait_var_event(inode->i_state, !(inode->i_state & SOME_OTHER_BIT))

then a waker like

inode->i_state &= ~SOME_OTHER_BIT;
// insert appropriate barrier
wake_up_var(inode->i_state)

will cause both W1 and W2 to be woken but W1 is put back to sleep? Is
there a nicer way to do this? The only thing I want is a (pony) 32bit
bit-wait-like mechanism.

Linus Torvalds Aug. 19, 2024, 5:45 p.m. UTC | #3

On Mon, 19 Aug 2024 at 01:16, Christian Brauner <brauner@kernel.org> wrote:
>
> The problem is that we currently use wait_on_bit() and wake_up_bit() in
> various places on i_state and all of these functions require an unsigned
> long (probably because some architectures only handle atomic ops on
> unsigned long).

It's actually mostly because of endianness, not atomicity.

The whole "flags in one single unsigned long" is a very traditional
Linux pattern. Even originally, when "unsigned", "unsigned int", and
"unsigned long" were all the same 32-bit thing, the pattern for the
kernel was "unsigned long" for the native architecture accesses.

It may not be a *great* pattern, and arguably it should always have
been "flags in a single unsigned int" (which was the same thing back
in the days). But hey, hindsight is 20:20.

[ And I say "arguably", not "obviously".

  The kernel basically takes the approach that "unsigned long" is the
size of GP registers, and so "array of unsigned long" is in some
respect fundamentally more efficient than "array of unsigned int",
because you can very naturally do operations in bigger chunks.

  So bitops working on some more architecture-neutral size like just
bytes or "u32" or "unsigned int" would have advantages, but "unsigned
long" in many ways is also a real advantage, and can have serious
alignment advantages for the simple cases ]

And it turns out byte order matters. Because you absolutely want to be
able to mix things like simple initializers, ie

     unsigned long flags = 1ul << BITPOS;
     ...
     clear_bit(BITPOS, &flags);

then the clear_bit() really _fundamentally_ works only on arrays of
"unsigned long", because on big-endian machines the bit position
really depends on the chunk size.

And yes, big-endianness is a disease (and yes, bit ordering is
literally one of the reasons), and it's happily mostly gone, but sadly
"mostly" is not "entirely".

So we are more or less stuck with "bit operations fundamentally work
on arrays of unsigned long", and no, you *cannot* use them on smaller
types because of the horror that is big-endianness.

Could we do "u32 bitops"? Yeah, but because of all the endianness
problems, it really would end up having to be a whole new set of
interfaces.

We do, btw, have a special case: we support the notion of
"little-endian bitmaps".  When you actually have data structures that
are binary objects across architectures, you need to have a sane
*portable* bitmap representation, and the only sane model is the
little-endian one that basically is the same across any type size (ie
"bit 0" is the same physical bit in a byte array as it is in a word
array and in a unsigned long array).

So you have things like filesystems use this model with test_bit_le()
and friends. But note that that does add extra overhead on BE
machines. And while we probably *should* have just said that the
normal bitops are always little-endian, we didn't, because by the time
BE machines were an issue, we already had tons of those simple
initializers (that would now need to use some helper macro to do the
bit swizzling on big-endian HW).

So I suspect we're kind of stuck with it. If you use the bit
operations - including the wait/wake_bit stuff - you *have* to use
"unsigned long".

Note that the "var_wait" versions don't actually care about the size
of the variable. They never look at the value in memory, so they
basically just treat the address of the variable as a cookie for
waiting. So you can use the "var" versions with absolutely anything.

[ Side note: the wake_up_bit() interface is broken garbage. It uses
"void *word" for the word. That's very dangerous because it allows
type mis-use without warnings. I didn't notice until I checked.
Thankfully wait_on_bit() itself has the right signature, so hopefully
nobody ever does that ]

            Linus

NeilBrown Aug. 19, 2024, 8:52 p.m. UTC | #4

On Mon, 19 Aug 2024, Christian Brauner wrote:
> On Mon, Aug 19, 2024 at 03:20:34PM GMT, NeilBrown wrote:
> > I wasn't really sure who to send this too, and get_maintainer.pl
> > suggested 132 addresses which seemed excessive.  So I've focussed on
> > 'sched' maintainers.  I'll probably submit individual patches to
> > relevant maintainers/lists if I get positive feedback at this level.
> > 
> > This series was motivated by 
> > 
> >    Commit ed0172af5d6f ("SUNRPC: Fix a race to wake a sync task")
> > 
> > which adds smp_mb__after_atomic().  I thought "any API that requires that
> > sort of thing needs to be fixed".
> > 
> > The main patches here are 7 and 8 which revise wake_up_bit and
> > wake_up_var respectively.  They result in 3 interfaces:
> >   wake_up_{bit,var}           includes smp_mb__after_atomic()
> >   wake_up_{bit,var}_relaxed() doesn't have a barrier
> >   wake_up_{bit,var}_mb()      includes smb_mb().
> 
> It's great that this api is brought up because it gives me a reason to
> ask a stupid question I've had for a while.
> 
> I want to change the i_state member in struct inode from an unsigned
> long to a u32 because really we're wasting 4 bytes on 64 bit that we're
> never going to use given how little I_* flags we actually have and I
> dislike that we use that vacuous type in a bunch of our structures for
> that reason.
> 
> (Together with another 4 byte shrinkage we would get a whopping 8 bytes
> back.)
> 
> The problem is that we currently use wait_on_bit() and wake_up_bit() in
> various places on i_state and all of these functions require an unsigned
> long (probably because some architectures only handle atomic ops on
> unsigned long).

i_state contains two bits that are used for wake_up - I_NEW and I_SYNC,
one virtual bit that is used for wake_up - I_DIO_WAKEUP - and 15 others.
You could fit those in a short and two bools which gives you three
different addresses to pass to wake_up_var().
Doing that would make it a little difficult to test for 
   I_NEW | I_FREEING | I_WILL_FREE
but you could probably make an inline that the compile with optimise
effectively.

Alternately you could union the "u32" with an array for 4 char to give
you 4 addresses for wakeup.

Both of these would be effective, but a bit hackish.  If you want
something clean we would need a new interface.  Maybe
wake_var_bit()/wait_var_bit_event().
It could store the bit in wait_bit_key.bit_nr as "-2-n" or similar.  Or
better still, add another field: enum { WAIT_BIT, WAIT_VAR, WAIT_VAR_BIT } wait_type
to wait_bit_key.

I would probably go with the union approach and re-order the bits so that
the ones you want to use for wake_up are less than sizeof(u32).

NeilBrown

Linus Torvalds Aug. 19, 2024, 9:12 p.m. UTC | #5

On Mon, 19 Aug 2024 at 13:52, NeilBrown <neilb@suse.de> wrote:
>
> You could fit those in a short and two bools which gives you three
> different addresses to pass to wake_up_var().

You don't actually have to even do that.

The address passed to 'wake_up_var()' doesn't actually have to *match*
anything. It's used purely as a cookie.

So you can literally do something like

   #define inode_state(X,inode) ((X)+(char *)&(inode)->i_state)

and then just use inode_state(0/1/2,inode) for waiting/waking the
different bits (and the numbers 0/1/2 do not have to bear any relation
to the bit numbers, although you may obviously do that).

              Linus

NeilBrown Aug. 20, 2024, 9:47 p.m. UTC | #6

On Mon, 19 Aug 2024, Linus Torvalds wrote:
> On Sun, 18 Aug 2024 at 22:36, NeilBrown <neilb@suse.de> wrote:
> >
> > The main patches here are 7 and 8 which revise wake_up_bit and
> > wake_up_var respectively.  They result in 3 interfaces:
> >   wake_up_{bit,var}           includes smp_mb__after_atomic()
> 
> I actually think this is even worse than the current model, in that
> now it subtle only works after atomic ops, and it's not obvious from
> the name.
> 
> At least the current model, correct code looks like
> 
>       do_some_atomic_op
>       smp_mb__after_atomic()
>       wake_up_{bit,var}
> 
> and the smp_mb__after_atomic() makes sense and pairs with the atomic.
> So the current one may be complex, but at the same time it's also
> explicit. Your changed interface is still complex, but now it's even
> less obvious what is actually going on.
> 
> With your suggested interface, a plain "wake_up_{bit,var}" only works
> after atomic ops, and other ops have to magically know that they
> should use the _mb() version or whatever. And somebody who doesn't
> understand that subtlety, and copies the code (but changes the op from
> an atomic one to something else) now introduces code that looks fine,
> but is really subtly wrong.
> 
> The reason for the barrier is for the serialization with the
> waitqueue_active() check. Honestly, if you worry about correctness
> here, I think you should leave the existing wake_up_{bit,var}() alone,
> and concentrate on having helpers that do the whole "set and wake up".
> 
> IOW, I do not think you should change existing semantics, but *this*
> kind of pattern:
> 
> >  [PATCH 2/9] Introduce atomic_dec_and_wake_up_var().
> >  [PATCH 9/9] Use clear_and_wake_up_bit() where appropriate.
> 
> sounds like a good idea.
> 
> IOW, once you have a whole "atomic_dec_and_wake_up()" (skip the "_var"
> - it's implied by the fact that it's an atomic_dec), *then* that
> function makes for a simple-to-use model, and now the "atomic_dec(),
> the smp_mb__after_atomic(), and the wake_up_var()" are all together.
> 
> For all the same reasons, it makes total sense to have
> "clear_bit_and_wake()" etc.

I can definitely get behind the idea has having a few more helpers and
using them more widely.  But unless we get rid of wake_up_bit(), people
will still use and some will use it wrongly.

What would you think of changing wake_up_bit/var to always have a full
memory barrier, and adding wake_up_bit/var_relaxed() for use when a
different barrier, or not barrier, is wanted?

Thanks,
NeilBrown


> 
> But exposing those "three different memory barrier scenarios" as three
> different helpers is the *opposite* of helpful. It keeps the current
> complexity, and makes it worse by making the barrier rules even more
> opaque, imho.
> 
>                Linus
>

Linus Torvalds Aug. 20, 2024, 9:58 p.m. UTC | #7

On Tue, 20 Aug 2024 at 14:47, NeilBrown <neilb@suse.de> wrote:
>
> I can definitely get behind the idea has having a few more helpers and
> using them more widely.  But unless we get rid of wake_up_bit(), people
> will still use and some will use it wrongly.

I do not believe this is a valid argument.

"We have interfaces that somebody can use wrongly" is a fact of life,
not an argument.

The whole "wake_up_bit()" is a very special thing, and dammit, if
people don't know the rules, then they shouldn't be using it.

Anybody using that interface *ALREADY* has to have some model of
atomicity for the actual bit they are changing. And yes, they can get
that wrong too.

The only way to actually make it a simple interface is to do the bit
operation and the wakeup together. Which is why I think that
interfaces like clear_bit_and_wake() or set_bit_and_wake() are fine,
because at that point you actually have a valid rule for the whole
operation.

But wake_up_bit() on its own ALREADY depends on the user doing the
right thing for the bit itself. Putting a memory barrier in it will
only *HIDE* incompetence, it won't be fixing it.

So no. Don't add interfaces that hide the problem.

                  Linus

NeilBrown Aug. 20, 2024, 10:15 p.m. UTC | #8

On Wed, 21 Aug 2024, Linus Torvalds wrote:
> On Tue, 20 Aug 2024 at 14:47, NeilBrown <neilb@suse.de> wrote:
> >
> > I can definitely get behind the idea has having a few more helpers and
> > using them more widely.  But unless we get rid of wake_up_bit(), people
> > will still use and some will use it wrongly.
> 
> I do not believe this is a valid argument.
> 
> "We have interfaces that somebody can use wrongly" is a fact of life,
> not an argument.

The argument is more like "we have interfaces that are often used
wrongly and the resulting bugs are hard to find through testing because
they don't affect the more popular architectures".

> 
> The whole "wake_up_bit()" is a very special thing, and dammit, if
> people don't know the rules, then they shouldn't be using it.
> 
> Anybody using that interface *ALREADY* has to have some model of
> atomicity for the actual bit they are changing. And yes, they can get
> that wrong too.
> 
> The only way to actually make it a simple interface is to do the bit
> operation and the wakeup together. Which is why I think that
> interfaces like clear_bit_and_wake() or set_bit_and_wake() are fine,
> because at that point you actually have a valid rule for the whole
> operation.
> 
> But wake_up_bit() on its own ALREADY depends on the user doing the
> right thing for the bit itself. Putting a memory barrier in it will
> only *HIDE* incompetence, it won't be fixing it.
> 
> So no. Don't add interfaces that hide the problem.

Ok, thanks.  I'll focus my efforts on helper functions.

NeilBrown


> 
>                   Linus
>

Linus Torvalds Aug. 20, 2024, 10:24 p.m. UTC | #9

On Tue, 20 Aug 2024 at 15:16, NeilBrown <neilb@suse.de> wrote:
>
> The argument is more like "we have interfaces that are often used
> wrongly and the resulting bugs are hard to find through testing because
> they don't affect the more popular architectures".

Right, but let's make the fix be that we actually then make those
places use better interfaces that don't _have_ any memory ordering
issues.

THAT is my argument. In the "combined" interface, the problem simply
goes away entirely, rather than being hidden by adding possibly
totally pointless barriers.

                 Linus

[0/9,RFC] Make wake_up_{bit,var} less fragile

Message

Comments