Message ID | 20240819053605.11706-1-neilb@suse.de (mailing list archive) |
---|---|
Headers | show |
Series | Make wake_up_{bit,var} less fragile | expand |
On Sun, 18 Aug 2024 at 22:36, NeilBrown <neilb@suse.de> wrote: > > The main patches here are 7 and 8 which revise wake_up_bit and > wake_up_var respectively. They result in 3 interfaces: > wake_up_{bit,var} includes smp_mb__after_atomic() I actually think this is even worse than the current model, in that now it subtle only works after atomic ops, and it's not obvious from the name. At least the current model, correct code looks like do_some_atomic_op smp_mb__after_atomic() wake_up_{bit,var} and the smp_mb__after_atomic() makes sense and pairs with the atomic. So the current one may be complex, but at the same time it's also explicit. Your changed interface is still complex, but now it's even less obvious what is actually going on. With your suggested interface, a plain "wake_up_{bit,var}" only works after atomic ops, and other ops have to magically know that they should use the _mb() version or whatever. And somebody who doesn't understand that subtlety, and copies the code (but changes the op from an atomic one to something else) now introduces code that looks fine, but is really subtly wrong. The reason for the barrier is for the serialization with the waitqueue_active() check. Honestly, if you worry about correctness here, I think you should leave the existing wake_up_{bit,var}() alone, and concentrate on having helpers that do the whole "set and wake up". IOW, I do not think you should change existing semantics, but *this* kind of pattern: > [PATCH 2/9] Introduce atomic_dec_and_wake_up_var(). > [PATCH 9/9] Use clear_and_wake_up_bit() where appropriate. sounds like a good idea. IOW, once you have a whole "atomic_dec_and_wake_up()" (skip the "_var" - it's implied by the fact that it's an atomic_dec), *then* that function makes for a simple-to-use model, and now the "atomic_dec(), the smp_mb__after_atomic(), and the wake_up_var()" are all together. For all the same reasons, it makes total sense to have "clear_bit_and_wake()" etc. But exposing those "three different memory barrier scenarios" as three different helpers is the *opposite* of helpful. It keeps the current complexity, and makes it worse by making the barrier rules even more opaque, imho. Linus
On Mon, Aug 19, 2024 at 03:20:34PM GMT, NeilBrown wrote: > I wasn't really sure who to send this too, and get_maintainer.pl > suggested 132 addresses which seemed excessive. So I've focussed on > 'sched' maintainers. I'll probably submit individual patches to > relevant maintainers/lists if I get positive feedback at this level. > > This series was motivated by > > Commit ed0172af5d6f ("SUNRPC: Fix a race to wake a sync task") > > which adds smp_mb__after_atomic(). I thought "any API that requires that > sort of thing needs to be fixed". > > The main patches here are 7 and 8 which revise wake_up_bit and > wake_up_var respectively. They result in 3 interfaces: > wake_up_{bit,var} includes smp_mb__after_atomic() > wake_up_{bit,var}_relaxed() doesn't have a barrier > wake_up_{bit,var}_mb() includes smb_mb(). It's great that this api is brought up because it gives me a reason to ask a stupid question I've had for a while. I want to change the i_state member in struct inode from an unsigned long to a u32 because really we're wasting 4 bytes on 64 bit that we're never going to use given how little I_* flags we actually have and I dislike that we use that vacuous type in a bunch of our structures for that reason. (Together with another 4 byte shrinkage we would get a whopping 8 bytes back.) The problem is that we currently use wait_on_bit() and wake_up_bit() in various places on i_state and all of these functions require an unsigned long (probably because some architectures only handle atomic ops on unsigned long). I have an experimental patch converting all of that from wait_on_bit() to wait_var_event() but I was hesitant about it because iiuc the semantics don't nicely translate into each other. Specifically, if some code uses wait_on_bit(SOME_BIT) and someone calls wake_up_bit(SOME_OTHER_BIT) then iiuc only SOME_OTHER_BIT waiters will be woken. IOW, SOME_BIT waiters are unaffected. But if this is switched to wait_var_event() and wake_up_var() and there are two waiters e.g., W1: wait_var_event(inode->i_state, !(inode->i_state & SOME_BIT)) W2: wait_var_event(inode->i_state, !(inode->i_state & SOME_OTHER_BIT)) then a waker like inode->i_state &= ~SOME_OTHER_BIT; // insert appropriate barrier wake_up_var(inode->i_state) will cause both W1 and W2 to be woken but W1 is put back to sleep? Is there a nicer way to do this? The only thing I want is a (pony) 32bit bit-wait-like mechanism.
On Mon, 19 Aug 2024 at 01:16, Christian Brauner <brauner@kernel.org> wrote: > > The problem is that we currently use wait_on_bit() and wake_up_bit() in > various places on i_state and all of these functions require an unsigned > long (probably because some architectures only handle atomic ops on > unsigned long). It's actually mostly because of endianness, not atomicity. The whole "flags in one single unsigned long" is a very traditional Linux pattern. Even originally, when "unsigned", "unsigned int", and "unsigned long" were all the same 32-bit thing, the pattern for the kernel was "unsigned long" for the native architecture accesses. It may not be a *great* pattern, and arguably it should always have been "flags in a single unsigned int" (which was the same thing back in the days). But hey, hindsight is 20:20. [ And I say "arguably", not "obviously". The kernel basically takes the approach that "unsigned long" is the size of GP registers, and so "array of unsigned long" is in some respect fundamentally more efficient than "array of unsigned int", because you can very naturally do operations in bigger chunks. So bitops working on some more architecture-neutral size like just bytes or "u32" or "unsigned int" would have advantages, but "unsigned long" in many ways is also a real advantage, and can have serious alignment advantages for the simple cases ] And it turns out byte order matters. Because you absolutely want to be able to mix things like simple initializers, ie unsigned long flags = 1ul << BITPOS; ... clear_bit(BITPOS, &flags); then the clear_bit() really _fundamentally_ works only on arrays of "unsigned long", because on big-endian machines the bit position really depends on the chunk size. And yes, big-endianness is a disease (and yes, bit ordering is literally one of the reasons), and it's happily mostly gone, but sadly "mostly" is not "entirely". So we are more or less stuck with "bit operations fundamentally work on arrays of unsigned long", and no, you *cannot* use them on smaller types because of the horror that is big-endianness. Could we do "u32 bitops"? Yeah, but because of all the endianness problems, it really would end up having to be a whole new set of interfaces. We do, btw, have a special case: we support the notion of "little-endian bitmaps". When you actually have data structures that are binary objects across architectures, you need to have a sane *portable* bitmap representation, and the only sane model is the little-endian one that basically is the same across any type size (ie "bit 0" is the same physical bit in a byte array as it is in a word array and in a unsigned long array). So you have things like filesystems use this model with test_bit_le() and friends. But note that that does add extra overhead on BE machines. And while we probably *should* have just said that the normal bitops are always little-endian, we didn't, because by the time BE machines were an issue, we already had tons of those simple initializers (that would now need to use some helper macro to do the bit swizzling on big-endian HW). So I suspect we're kind of stuck with it. If you use the bit operations - including the wait/wake_bit stuff - you *have* to use "unsigned long". Note that the "var_wait" versions don't actually care about the size of the variable. They never look at the value in memory, so they basically just treat the address of the variable as a cookie for waiting. So you can use the "var" versions with absolutely anything. [ Side note: the wake_up_bit() interface is broken garbage. It uses "void *word" for the word. That's very dangerous because it allows type mis-use without warnings. I didn't notice until I checked. Thankfully wait_on_bit() itself has the right signature, so hopefully nobody ever does that ] Linus
On Mon, 19 Aug 2024, Christian Brauner wrote: > On Mon, Aug 19, 2024 at 03:20:34PM GMT, NeilBrown wrote: > > I wasn't really sure who to send this too, and get_maintainer.pl > > suggested 132 addresses which seemed excessive. So I've focussed on > > 'sched' maintainers. I'll probably submit individual patches to > > relevant maintainers/lists if I get positive feedback at this level. > > > > This series was motivated by > > > > Commit ed0172af5d6f ("SUNRPC: Fix a race to wake a sync task") > > > > which adds smp_mb__after_atomic(). I thought "any API that requires that > > sort of thing needs to be fixed". > > > > The main patches here are 7 and 8 which revise wake_up_bit and > > wake_up_var respectively. They result in 3 interfaces: > > wake_up_{bit,var} includes smp_mb__after_atomic() > > wake_up_{bit,var}_relaxed() doesn't have a barrier > > wake_up_{bit,var}_mb() includes smb_mb(). > > It's great that this api is brought up because it gives me a reason to > ask a stupid question I've had for a while. > > I want to change the i_state member in struct inode from an unsigned > long to a u32 because really we're wasting 4 bytes on 64 bit that we're > never going to use given how little I_* flags we actually have and I > dislike that we use that vacuous type in a bunch of our structures for > that reason. > > (Together with another 4 byte shrinkage we would get a whopping 8 bytes > back.) > > The problem is that we currently use wait_on_bit() and wake_up_bit() in > various places on i_state and all of these functions require an unsigned > long (probably because some architectures only handle atomic ops on > unsigned long). i_state contains two bits that are used for wake_up - I_NEW and I_SYNC, one virtual bit that is used for wake_up - I_DIO_WAKEUP - and 15 others. You could fit those in a short and two bools which gives you three different addresses to pass to wake_up_var(). Doing that would make it a little difficult to test for I_NEW | I_FREEING | I_WILL_FREE but you could probably make an inline that the compile with optimise effectively. Alternately you could union the "u32" with an array for 4 char to give you 4 addresses for wakeup. Both of these would be effective, but a bit hackish. If you want something clean we would need a new interface. Maybe wake_var_bit()/wait_var_bit_event(). It could store the bit in wait_bit_key.bit_nr as "-2-n" or similar. Or better still, add another field: enum { WAIT_BIT, WAIT_VAR, WAIT_VAR_BIT } wait_type to wait_bit_key. I would probably go with the union approach and re-order the bits so that the ones you want to use for wake_up are less than sizeof(u32). NeilBrown
On Mon, 19 Aug 2024 at 13:52, NeilBrown <neilb@suse.de> wrote: > > You could fit those in a short and two bools which gives you three > different addresses to pass to wake_up_var(). You don't actually have to even do that. The address passed to 'wake_up_var()' doesn't actually have to *match* anything. It's used purely as a cookie. So you can literally do something like #define inode_state(X,inode) ((X)+(char *)&(inode)->i_state) and then just use inode_state(0/1/2,inode) for waiting/waking the different bits (and the numbers 0/1/2 do not have to bear any relation to the bit numbers, although you may obviously do that). Linus
On Mon, 19 Aug 2024, Linus Torvalds wrote: > On Sun, 18 Aug 2024 at 22:36, NeilBrown <neilb@suse.de> wrote: > > > > The main patches here are 7 and 8 which revise wake_up_bit and > > wake_up_var respectively. They result in 3 interfaces: > > wake_up_{bit,var} includes smp_mb__after_atomic() > > I actually think this is even worse than the current model, in that > now it subtle only works after atomic ops, and it's not obvious from > the name. > > At least the current model, correct code looks like > > do_some_atomic_op > smp_mb__after_atomic() > wake_up_{bit,var} > > and the smp_mb__after_atomic() makes sense and pairs with the atomic. > So the current one may be complex, but at the same time it's also > explicit. Your changed interface is still complex, but now it's even > less obvious what is actually going on. > > With your suggested interface, a plain "wake_up_{bit,var}" only works > after atomic ops, and other ops have to magically know that they > should use the _mb() version or whatever. And somebody who doesn't > understand that subtlety, and copies the code (but changes the op from > an atomic one to something else) now introduces code that looks fine, > but is really subtly wrong. > > The reason for the barrier is for the serialization with the > waitqueue_active() check. Honestly, if you worry about correctness > here, I think you should leave the existing wake_up_{bit,var}() alone, > and concentrate on having helpers that do the whole "set and wake up". > > IOW, I do not think you should change existing semantics, but *this* > kind of pattern: > > > [PATCH 2/9] Introduce atomic_dec_and_wake_up_var(). > > [PATCH 9/9] Use clear_and_wake_up_bit() where appropriate. > > sounds like a good idea. > > IOW, once you have a whole "atomic_dec_and_wake_up()" (skip the "_var" > - it's implied by the fact that it's an atomic_dec), *then* that > function makes for a simple-to-use model, and now the "atomic_dec(), > the smp_mb__after_atomic(), and the wake_up_var()" are all together. > > For all the same reasons, it makes total sense to have > "clear_bit_and_wake()" etc. I can definitely get behind the idea has having a few more helpers and using them more widely. But unless we get rid of wake_up_bit(), people will still use and some will use it wrongly. What would you think of changing wake_up_bit/var to always have a full memory barrier, and adding wake_up_bit/var_relaxed() for use when a different barrier, or not barrier, is wanted? Thanks, NeilBrown > > But exposing those "three different memory barrier scenarios" as three > different helpers is the *opposite* of helpful. It keeps the current > complexity, and makes it worse by making the barrier rules even more > opaque, imho. > > Linus >
On Tue, 20 Aug 2024 at 14:47, NeilBrown <neilb@suse.de> wrote: > > I can definitely get behind the idea has having a few more helpers and > using them more widely. But unless we get rid of wake_up_bit(), people > will still use and some will use it wrongly. I do not believe this is a valid argument. "We have interfaces that somebody can use wrongly" is a fact of life, not an argument. The whole "wake_up_bit()" is a very special thing, and dammit, if people don't know the rules, then they shouldn't be using it. Anybody using that interface *ALREADY* has to have some model of atomicity for the actual bit they are changing. And yes, they can get that wrong too. The only way to actually make it a simple interface is to do the bit operation and the wakeup together. Which is why I think that interfaces like clear_bit_and_wake() or set_bit_and_wake() are fine, because at that point you actually have a valid rule for the whole operation. But wake_up_bit() on its own ALREADY depends on the user doing the right thing for the bit itself. Putting a memory barrier in it will only *HIDE* incompetence, it won't be fixing it. So no. Don't add interfaces that hide the problem. Linus
On Wed, 21 Aug 2024, Linus Torvalds wrote: > On Tue, 20 Aug 2024 at 14:47, NeilBrown <neilb@suse.de> wrote: > > > > I can definitely get behind the idea has having a few more helpers and > > using them more widely. But unless we get rid of wake_up_bit(), people > > will still use and some will use it wrongly. > > I do not believe this is a valid argument. > > "We have interfaces that somebody can use wrongly" is a fact of life, > not an argument. The argument is more like "we have interfaces that are often used wrongly and the resulting bugs are hard to find through testing because they don't affect the more popular architectures". > > The whole "wake_up_bit()" is a very special thing, and dammit, if > people don't know the rules, then they shouldn't be using it. > > Anybody using that interface *ALREADY* has to have some model of > atomicity for the actual bit they are changing. And yes, they can get > that wrong too. > > The only way to actually make it a simple interface is to do the bit > operation and the wakeup together. Which is why I think that > interfaces like clear_bit_and_wake() or set_bit_and_wake() are fine, > because at that point you actually have a valid rule for the whole > operation. > > But wake_up_bit() on its own ALREADY depends on the user doing the > right thing for the bit itself. Putting a memory barrier in it will > only *HIDE* incompetence, it won't be fixing it. > > So no. Don't add interfaces that hide the problem. Ok, thanks. I'll focus my efforts on helper functions. NeilBrown > > Linus >
On Tue, 20 Aug 2024 at 15:16, NeilBrown <neilb@suse.de> wrote: > > The argument is more like "we have interfaces that are often used > wrongly and the resulting bugs are hard to find through testing because > they don't affect the more popular architectures". Right, but let's make the fix be that we actually then make those places use better interfaces that don't _have_ any memory ordering issues. THAT is my argument. In the "combined" interface, the problem simply goes away entirely, rather than being hidden by adding possibly totally pointless barriers. Linus