Message ID | 20240821-work-i_state-v2-1-67244769f102@kernel.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | inode: turn i_state into u32 | expand |
On Thu, 22 Aug 2024, Christian Brauner wrote: > The i_state member is an unsigned long so that it can be used with the > wait bit infrastructure which expects unsigned long. This wastes 4 bytes > which we're unlikely to ever use. Switch to using the var event wait > mechanism using the address of the bit. Thanks to Linus for the address > idea. > > Signed-off-by: Christian Brauner <brauner@kernel.org> > --- > fs/inode.c | 10 ++++++++++ > include/linux/fs.h | 16 ++++++++++++++++ > 2 files changed, 26 insertions(+) > > diff --git a/fs/inode.c b/fs/inode.c > index 154f8689457f..f2a2f6351ec3 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -472,6 +472,16 @@ static void __inode_add_lru(struct inode *inode, bool rotate) > inode->i_state |= I_REFERENCED; > } > > +struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe, > + struct inode *inode, u32 bit) > +{ > + void *bit_address; > + > + bit_address = inode_state_wait_address(inode, bit); > + init_wait_var_entry(wqe, bit_address, 0); > + return __var_waitqueue(bit_address); > +} > + > /* > * Add inode to LRU if needed (inode is unused and clean). > * > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 23e7d46b818a..a5b036714d74 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -744,6 +744,22 @@ struct inode { > void *i_private; /* fs or device private pointer */ > } __randomize_layout; > > +/* > + * Get bit address from inode->i_state to use with wait_var_event() > + * infrastructre. > + */ > +#define inode_state_wait_address(inode, bit) ((char *)&(inode)->i_state + (bit)) > + > +struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe, > + struct inode *inode, u32 bit); > + > +static inline void inode_wake_up_bit(struct inode *inode, u32 bit) > +{ > + /* Ensure @bit will be seen cleared/set when caller is woken up. */ The above comment is wrong. I think I once thought it was correct too but now I know better (I hope). A better comment might be /* Insert memory barrier as recommended by wake_up_var() */ but even that is unnecessary as we don't need the memory barrier. A careful reading of memory-barriers.rst shows that *when the process is actually woken* there are sufficient barriers in wake_up_process() and prepare_wait_event() and the scheduler and (particularly) set_current_state() so that a value set before the wake_up is seen after the schedule(). So this barrier isn't about the bit. This barrier is about the wait_queue. In particular it is about waitqueue_active() call at the start of wake_up_var(). If that test wasn't there and if instead wake_up_var() conditionally called __wake_up(), then there would be no need for any barrier. A comment near wake_up_bit() makes it clear that the barrier is only needed when the spinlock is avoided. On a weakly ordered arch, this test can be effected *before* the write of the bit. If the waiter adds itself to the wait queue and then tests the bit before the bit is set, but after the waitqueue_active() test is put in effect, then the wake_up will never be sent. But .... this is all academic of this code because you don't need a barrier at all. The wake_up happens in a spin_locked region, and the wait is entirely inside the same spin_lock, except for the schedule. A later patch has: spin_unlock(); schedule(); spin_lock(); So there is no room for a race. If the bit is cleared before the wait_var_event() equivalent, then no wakeup is needed. When the lock is dropped after the bit is cleared the unlock will have all the barriers needed for the bit to be visible. The only other place that the bit can be cleared is concurrent with the above schedule() while the spinlock isn't held by the waiter. In that case it is again clear that no barrier is needed - or that the spin_unlock/lock provide all the needed barriers. So a better comment would be /* no barrier needs as both waker and waiter are in spin-locked regions */ Thanks, NeilBrown > + smp_mb(); > + wake_up_var(inode_state_wait_address(inode, bit)); > +} > + > struct timespec64 timestamp_truncate(struct timespec64 t, struct inode *inode); > > static inline unsigned int i_blocksize(const struct inode *node) > > -- > 2.43.0 > >
On Wed, Aug 21, 2024 at 05:47:31PM +0200, Christian Brauner wrote: > The i_state member is an unsigned long so that it can be used with the > wait bit infrastructure which expects unsigned long. This wastes 4 bytes > which we're unlikely to ever use. Switch to using the var event wait > mechanism using the address of the bit. Thanks to Linus for the address > idea. > > Signed-off-by: Christian Brauner <brauner@kernel.org> > --- > fs/inode.c | 10 ++++++++++ > include/linux/fs.h | 16 ++++++++++++++++ > 2 files changed, 26 insertions(+) > > diff --git a/fs/inode.c b/fs/inode.c > index 154f8689457f..f2a2f6351ec3 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -472,6 +472,16 @@ static void __inode_add_lru(struct inode *inode, bool rotate) > inode->i_state |= I_REFERENCED; > } > > +struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe, > + struct inode *inode, u32 bit) > +{ > + void *bit_address; > + > + bit_address = inode_state_wait_address(inode, bit); > + init_wait_var_entry(wqe, bit_address, 0); > + return __var_waitqueue(bit_address); > +} > + > /* > * Add inode to LRU if needed (inode is unused and clean). > * > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 23e7d46b818a..a5b036714d74 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -744,6 +744,22 @@ struct inode { > void *i_private; /* fs or device private pointer */ > } __randomize_layout; > > +/* > + * Get bit address from inode->i_state to use with wait_var_event() > + * infrastructre. > + */ > +#define inode_state_wait_address(inode, bit) ((char *)&(inode)->i_state + (bit)) > + > +struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe, > + struct inode *inode, u32 bit); > + > +static inline void inode_wake_up_bit(struct inode *inode, u32 bit) > +{ > + /* Ensure @bit will be seen cleared/set when caller is woken up. */ > + smp_mb(); > + wake_up_var(inode_state_wait_address(inode, bit)); > +} Why is this memory barrier necessary? wakeup_var() takes the wait queue head spinlock, as does prepare_to_wait() before the waiter condition check and the subsequent finish_wait() when the wait is done. Hence by the time the functions like inode_wait_for_writeback() return to the caller, there's been a full unlock->lock cycle on the wait queue head lock between the bit being set and the waiter waking and checking it. ie. the use of the generic waitqueue infrastructure implies a full memory barrier between the bit being set, the wakeup being issued and the waiter checking the condition bit. Hence I'm not sure what race condition this memory barrier is preventing - is this just a historical artifact that predates the current wait/wake implementation? -Dave.
On Thu, 22 Aug 2024 at 06:12, NeilBrown <neilb@suse.de> wrote: > > So this barrier isn't about the bit. This barrier is about the > wait_queue. In particular it is about waitqueue_active() call Correct. The waitqueue_active() call is basically very racy, but it's also often a huge deal for performance. In *most* cases, wakeups happen with nobody waiting for them at all, and taking the waitqueue spinlock just to see that there are no waiters is hugely expensive and easily causes lots of pointless cache ping-pong. So the waitqueue_active() is there to basically say "if there are no waiters, don't bother doing anything". But *because* it is done - by definition - without holding the waitqueue lock, it means that a new waiter might have just checked for the condition, and is about to add itself to the wait queue. Now, waiters will then re-check *after* they have added themselves to the waitqueue too (that's what all the wait_for_event etc stuff does), and the waiter side will have barriers thanks to the spinlocks on the waitqueue. But the waker still needs to make sure it's ordered wrt "I changed the condition that the waiter is waiting for" and the lockless waitqueue_active() test. If waiters and wakers are mutually serialized by some external lock, there are no races. But commonly, the waker will just do an atomic op, or maybe it holds a lock (as a writer) that the waiting side does not hold, and the waiting side just does a "READ_ONCE()" or a "smp_load_acquire()" or whatever. So the waker needs to make sure that its state setting is serialized with the waiter. If it uses our atomics, then typically a smp_mb__after_atomic() is appropriate (which often ends up being a no-op). But commonly you end up needing a smp_mb(), which serializes whatever action the waker did with the waker then doing that optimistic lockless waitqueue_active(). > /* no barrier needs as both waker and waiter are in spin-locked regions */ If all the inode state waiters do indeed hold the lock, then that is fine. Linus
On Thu, 22 Aug 2024 at 06:28, Dave Chinner <david@fromorbit.com> wrote: > > wakeup_var() takes the wait queue head spinlock That's the part you missed. wake_up_var() does no such thing. It calles __wake_up_bit(), which does this: if (waitqueue_active(wq_head)) __wake_up(wq_head, TASK_NORMAL, 1, &key); and that waitqueue_active() is lockless. That's the thing that wants the barrier (or something else that guarantees that the waiters actually *see* the new state after they've added themselves to the wait-list, and before we then look at the wait-list being empty). See the comment in <linux/wait.h> above the waitqueue_active(). And yes, this is annoying and subtle and has caused bugs. But taking waitqueue locks just to notice nobody is waiting is very expensive indeed, and much more expensive than a memory barrier when the common case is typically often "nobody is waiting". Linus
On Thu, Aug 22, 2024 at 08:12:15AM +1000, NeilBrown wrote: > On Thu, 22 Aug 2024, Christian Brauner wrote: > > The i_state member is an unsigned long so that it can be used with the > > wait bit infrastructure which expects unsigned long. This wastes 4 bytes > > which we're unlikely to ever use. Switch to using the var event wait > > mechanism using the address of the bit. Thanks to Linus for the address > > idea. > > > > Signed-off-by: Christian Brauner <brauner@kernel.org> > > --- > > fs/inode.c | 10 ++++++++++ > > include/linux/fs.h | 16 ++++++++++++++++ > > 2 files changed, 26 insertions(+) > > > > diff --git a/fs/inode.c b/fs/inode.c > > index 154f8689457f..f2a2f6351ec3 100644 > > --- a/fs/inode.c > > +++ b/fs/inode.c > > @@ -472,6 +472,16 @@ static void __inode_add_lru(struct inode *inode, bool rotate) > > inode->i_state |= I_REFERENCED; > > } > > > > +struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe, > > + struct inode *inode, u32 bit) > > +{ > > + void *bit_address; > > + > > + bit_address = inode_state_wait_address(inode, bit); > > + init_wait_var_entry(wqe, bit_address, 0); > > + return __var_waitqueue(bit_address); > > +} > > + > > /* > > * Add inode to LRU if needed (inode is unused and clean). > > * > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > index 23e7d46b818a..a5b036714d74 100644 > > --- a/include/linux/fs.h > > +++ b/include/linux/fs.h > > @@ -744,6 +744,22 @@ struct inode { > > void *i_private; /* fs or device private pointer */ > > } __randomize_layout; > > > > +/* > > + * Get bit address from inode->i_state to use with wait_var_event() > > + * infrastructre. > > + */ > > +#define inode_state_wait_address(inode, bit) ((char *)&(inode)->i_state + (bit)) > > + > > +struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe, > > + struct inode *inode, u32 bit); > > + > > +static inline void inode_wake_up_bit(struct inode *inode, u32 bit) > > +{ > > + /* Ensure @bit will be seen cleared/set when caller is woken up. */ > > The above comment is wrong. I think I once thought it was correct too > but now I know better (I hope). > A better comment might be > /* Insert memory barrier as recommended by wake_up_var() */ > but even that is unnecessary as we don't need the memory barrier. > > A careful reading of memory-barriers.rst shows that *when the process is > actually woken* there are sufficient barriers in wake_up_process() and > prepare_wait_event() and the scheduler and (particularly) > set_current_state() so that a value set before the wake_up is seen after > the schedule(). > > So this barrier isn't about the bit. This barrier is about the > wait_queue. In particular it is about waitqueue_active() call at the > start of wake_up_var(). If that test wasn't there and if instead > wake_up_var() conditionally called __wake_up(), then there would be no > need for any barrier. A comment near wake_up_bit() makes it clear that > the barrier is only needed when the spinlock is avoided. Oh, I missed that *landmine*. So almost none of the current uses of wake_up_var() have an explicit memory barrier before them, and a lot of them are not done under spin locks. I suspect that the implicit store ordering of the wake queue check is mostly only handled by chance - an atomic_dec_and_test() or smp_store_release() call is made before wake_up_var() or the wait/wakeup is synchronised by an external lock. > On a weakly ordered arch, this test can be effected *before* the write > of the bit. If the waiter adds itself to the wait queue and then tests > the bit before the bit is set, but after the waitqueue_active() test is > put in effect, then the wake_up will never be sent. > > But .... this is all academic of this code because you don't need a > barrier at all. The wake_up happens in a spin_locked region, and the > wait is entirely inside the same spin_lock, except for the schedule. A > later patch has: > spin_unlock(); schedule(); spin_lock(); > > So there is no room for a race. If the bit is cleared before the > wait_var_event() equivalent, then no wakeup is needed. When the lock is > dropped after the bit is cleared the unlock will have all the barriers > needed for the bit to be visible. Right, that's exactly the problem with infrastructure that externalises memory ordering dependencies. Most of the time they just work because of external factors, but sometimes they don't and we get random memory barriers being cargo culted around the place because that seems to fix weird problems.... > The only other place that the bit can be cleared is concurrent with the > above schedule() while the spinlock isn't held by the waiter. In that > case it is again clear that no barrier is needed - or that the > spin_unlock/lock provide all the needed barriers. > > So a better comment would be > > /* no barrier needs as both waker and waiter are in spin-locked regions */ ^^^^^^^ ordering Even better would be to fix wake_up_var() to not have an implicit ordering requirement. Add __wake_up_var() as the "I know what I'm doing" API, and have wake_up_var() always issue the memory barrier like so: __wake_up_var(var) { __wake_up_bit(....); } wake_up_var(var) { smp_mb(); __wake_up_var(var); } Then people who don't intimately understand ordering (i.e. just want to use a conditional wait variable) or just don't want to think about complex memory ordering issues for otherwise obvious and simple code can simply use the safe variant. Those that need to optimise for speed or know they have solid ordering or external serialisation can use __wake_up_var() and leave a comment as to why that is safe. i.e.: /* * No wait/wakeup ordering required as both waker and waiters * are serialised by the inode->i_lock. */ __wake_up_var(....); -Dave.
On Thu, Aug 22, 2024 at 08:12:15AM GMT, NeilBrown wrote: > On Thu, 22 Aug 2024, Christian Brauner wrote: > > The i_state member is an unsigned long so that it can be used with the > > wait bit infrastructure which expects unsigned long. This wastes 4 bytes > > which we're unlikely to ever use. Switch to using the var event wait > > mechanism using the address of the bit. Thanks to Linus for the address > > idea. > > > > Signed-off-by: Christian Brauner <brauner@kernel.org> > > --- > > fs/inode.c | 10 ++++++++++ > > include/linux/fs.h | 16 ++++++++++++++++ > > 2 files changed, 26 insertions(+) > > > > diff --git a/fs/inode.c b/fs/inode.c > > index 154f8689457f..f2a2f6351ec3 100644 > > --- a/fs/inode.c > > +++ b/fs/inode.c > > @@ -472,6 +472,16 @@ static void __inode_add_lru(struct inode *inode, bool rotate) > > inode->i_state |= I_REFERENCED; > > } > > > > +struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe, > > + struct inode *inode, u32 bit) > > +{ > > + void *bit_address; > > + > > + bit_address = inode_state_wait_address(inode, bit); > > + init_wait_var_entry(wqe, bit_address, 0); > > + return __var_waitqueue(bit_address); > > +} > > + > > /* > > * Add inode to LRU if needed (inode is unused and clean). > > * > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > index 23e7d46b818a..a5b036714d74 100644 > > --- a/include/linux/fs.h > > +++ b/include/linux/fs.h > > @@ -744,6 +744,22 @@ struct inode { > > void *i_private; /* fs or device private pointer */ > > } __randomize_layout; > > > > +/* > > + * Get bit address from inode->i_state to use with wait_var_event() > > + * infrastructre. > > + */ > > +#define inode_state_wait_address(inode, bit) ((char *)&(inode)->i_state + (bit)) > > + > > +struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe, > > + struct inode *inode, u32 bit); > > + > > +static inline void inode_wake_up_bit(struct inode *inode, u32 bit) > > +{ > > + /* Ensure @bit will be seen cleared/set when caller is woken up. */ > > The above comment is wrong. I think I once thought it was correct too > but now I know better (I hope). > A better comment might be > /* Insert memory barrier as recommended by wake_up_var() */ > but even that is unnecessary as we don't need the memory barrier. > > A careful reading of memory-barriers.rst shows that *when the process is > actually woken* there are sufficient barriers in wake_up_process() and > prepare_wait_event() and the scheduler and (particularly) > set_current_state() so that a value set before the wake_up is seen after > the schedule(). > > So this barrier isn't about the bit. This barrier is about the > wait_queue. In particular it is about waitqueue_active() call at the > start of wake_up_var(). If that test wasn't there and if instead > wake_up_var() conditionally called __wake_up(), then there would be no Did you mean "unconditionally called"? > need for any barrier. A comment near wake_up_bit() makes it clear that > the barrier is only needed when the spinlock is avoided. > > On a weakly ordered arch, this test can be effected *before* the write > of the bit. If the waiter adds itself to the wait queue and then tests > the bit before the bit is set, but after the waitqueue_active() test is > put in effect, then the wake_up will never be sent. > > But .... this is all academic of this code because you don't need a > barrier at all. The wake_up happens in a spin_locked region, and the > wait is entirely inside the same spin_lock, except for the schedule. A > later patch has: > spin_unlock(); schedule(); spin_lock(); > > So there is no room for a race. If the bit is cleared before the > wait_var_event() equivalent, then no wakeup is needed. When the lock is > dropped after the bit is cleared the unlock will have all the barriers > needed for the bit to be visible. > > The only other place that the bit can be cleared is concurrent with the > above schedule() while the spinlock isn't held by the waiter. In that > case it is again clear that no barrier is needed - or that the > spin_unlock/lock provide all the needed barriers. > > So a better comment would be > > /* no barrier needs as both waker and waiter are in spin-locked regions */ Thanks for the analysis. I was under the impression that wait_on_inode() was called in contexts where no barrier is guaranteed and the bit isn't checked with spin_lock() held.
On Thu, 22 Aug 2024 at 16:27, Christian Brauner <brauner@kernel.org> wrote: > > > > > /* no barrier needs as both waker and waiter are in spin-locked regions */ > > Thanks for the analysis. I was under the impression that wait_on_inode() > was called in contexts where no barrier is guaranteed and the bit isn't > checked with spin_lock() held. Yes, it does look like the barrier is needed, because the waiter does not hold the lock indeed. Linus
On Thu, 22 Aug 2024, Dave Chinner wrote: > > Even better would be to fix wake_up_var() to not have an implicit > ordering requirement. Add __wake_up_var() as the "I know what I'm > doing" API, and have wake_up_var() always issue the memory barrier > like so: > > __wake_up_var(var) > { > __wake_up_bit(....); > } > > wake_up_var(var) > { > smp_mb(); > __wake_up_var(var); > } This is very close to what I proposed https://lore.kernel.org/all/20240819053605.11706-9-neilb@suse.de/ but Linus doesn't like that approach. I would prefer a collection of interfaces which combine the update to the variable with the wake_up. So atomic_dec_and_wake() store_release_and_wake() though there are a few that don't fit into a common pattern. Sometimes a minor code re-write can make them fit. Other times I'm not so sure. I'm glad I'm not the only one who though that adding a barrier to wake_up_var() was a good idea though - thanks. NeilBrown > > Then people who don't intimately understand ordering (i.e. just want > to use a conditional wait variable) or just don't want to > think about complex memory ordering issues for otherwise obvious and > simple code can simply use the safe variant. > > Those that need to optimise for speed or know they have solid > ordering or external serialisation can use __wake_up_var() and leave > a comment as to why that is safe. i.e.: > > /* > * No wait/wakeup ordering required as both waker and waiters > * are serialised by the inode->i_lock. > */ > __wake_up_var(....); > > -Dave. > -- > Dave Chinner > david@fromorbit.com >
On Thu, 22 Aug 2024, Christian Brauner wrote: > > > > So this barrier isn't about the bit. This barrier is about the > > wait_queue. In particular it is about waitqueue_active() call at the > > start of wake_up_var(). If that test wasn't there and if instead > > wake_up_var() conditionally called __wake_up(), then there would be no > > Did you mean "unconditionally called"? Yes :-) > > > need for any barrier. A comment near wake_up_bit() makes it clear that > > the barrier is only needed when the spinlock is avoided. > > > > On a weakly ordered arch, this test can be effected *before* the write > > of the bit. If the waiter adds itself to the wait queue and then tests > > the bit before the bit is set, but after the waitqueue_active() test is > > put in effect, then the wake_up will never be sent. > > > > But .... this is all academic of this code because you don't need a > > barrier at all. The wake_up happens in a spin_locked region, and the > > wait is entirely inside the same spin_lock, except for the schedule. A > > later patch has: > > spin_unlock(); schedule(); spin_lock(); > > > > So there is no room for a race. If the bit is cleared before the > > wait_var_event() equivalent, then no wakeup is needed. When the lock is > > dropped after the bit is cleared the unlock will have all the barriers > > needed for the bit to be visible. > > > > The only other place that the bit can be cleared is concurrent with the > > above schedule() while the spinlock isn't held by the waiter. In that > > case it is again clear that no barrier is needed - or that the > > spin_unlock/lock provide all the needed barriers. > > > > So a better comment would be > > > > /* no barrier needs as both waker and waiter are in spin-locked regions */ > > Thanks for the analysis. I was under the impression that wait_on_inode() > was called in contexts where no barrier is guaranteed and the bit isn't > checked with spin_lock() held. > Yes it is. I was looking at the I_SYNC waiter and mistakenly thought it was a model for the others. A wake_up for I_SYNC being cleared doesn't need the barrier. Maybe inode_wake_up_bit() should not have a barrier and the various callers should add whichever barrier is appropriate. That is the model that Linus prefers for wake_up_bit() and for consistency it should apply to inode_wake_up_bit() too. Thanks, NeilBrown
On Fri, 23 Aug 2024 at 08:14, NeilBrown <neilb@suse.de> wrote: > > Maybe inode_wake_up_bit() should not have a barrier and the various > callers should add whichever barrier is appropriate. That is the model > that Linus prefers for wake_up_bit() and for consistency it should apply > to inode_wake_up_bit() too. I think that for the wake_up_bit() cases in particular, it's fairly easy to add whatever "clear_and_wakeup()" helpers. After all, there's only so much you can do to one bit, and I think merging the "act" with the "wakeup" into one routine is not only going to help fix the memory ordering problem, I think it's going to result in more readable code even aside from any memory ordering issues. The wake_up_var() infrastructure that the inode code uses is a bit more involved. Not only can the variable be anything at all (so the operations you can do on it are obviously largely unbounded), but the inode hack in particular then uses one thing for the actual variable, and another thing for the address that is used to match up waits and wakeups. So I suspect the inode code will have to do its thing explcitly with the low-level helpers and deal with the memory ordering issues on its own, adding the smp_mb() manually. Linus
On Fri, 23 Aug 2024 at 10:52, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > The wake_up_var() infrastructure that the inode code uses is a bit > more involved. Not only can the variable be anything at all (so the > operations you can do on it are obviously largely unbounded), but the > inode hack in particular then uses one thing for the actual variable, > and another thing for the address that is used to match up waits and > wakeups. .. btw, that doesn't mean we can't have helpers for the common cases. They might have to be macros (so that they just work regardless of the type), but having a set_var_and_wake(var, value); macro that just expands to something like (completely untested "maybe this works" macro): #define set_var_and_wake(var,value) do { \ __auto_type __set_ptr = &(var); \ *(__set_ptr) = (value); \ smp_mb(); \ wake_up_var(__set_ptr); \ } while (0) doesn't sound too bad for at least some common patterns. Looking around, we do seem to have a pattern of smp_store_release() -> wake_up_var() instead of a memory barrier. I don't think that actually works. The smp_store_release() means that *earlier* accesses will be bounded by the store operation, but *later* accesses - including very much the "look if the wait queue is empty" check - are totally unordered by it, and can be done before the store by the CPU. But I haven't thought deeply about it, that was just my gut reaction when seeing the pattern. It superficially makes sense, but I think it's entirely wrong (it's a "smp_load_acquire()" that would order with later accesses, but there is no "store with acquire semantics" operation). Linus
On Fri, 23 Aug 2024 at 11:05, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Looking around, we do seem to have a pattern of > > smp_store_release() -> wake_up_var() > > instead of a memory barrier. I don't think that actually works. Hmm. It might not work for the wakeup race, but it might be a good idea to do the store_release anyway, just for the actual user (ie then the *use* of the variable may start with a "smp_load_acquire()", and the release->acquire semantics means that everything that was done before the release is visible after the acquire. Of course, the smp_mb() will force that ordering too, but for a true concurrent user that doesn't actually need waking up, it might see the newly stores var value before the smp_mb() happens on the generating side. End result: those code paths may want *both* the smp_store_release() and the smp_mb(), because they are ordering different accesses. Just goes to show that this whole thing is more complicated than just "wait for a value". Linus
On Fri, 23 Aug 2024, Linus Torvalds wrote: > On Fri, 23 Aug 2024 at 08:14, NeilBrown <neilb@suse.de> wrote: > > > > Maybe inode_wake_up_bit() should not have a barrier and the various > > callers should add whichever barrier is appropriate. That is the model > > that Linus prefers for wake_up_bit() and for consistency it should apply > > to inode_wake_up_bit() too. > > I think that for the wake_up_bit() cases in particular, it's fairly > easy to add whatever "clear_and_wakeup()" helpers. After all, there's > only so much you can do to one bit, and I think merging the "act" with > the "wakeup" into one routine is not only going to help fix the memory > ordering problem, I think it's going to result in more readable code > even aside from any memory ordering issues. That patterns that don't easily fit a helper include: - skipping the wake_up when there is no hurry. nfs_page_clear_headlock() does this is PG_CONTENDED1 isn't set. intel_recv_event() does something very similar if STATE_FIRMWARE_LOADED isn't set. I imagine changing these to if (need a wakeup) clear_bit_and_wakeup(...) else clear_bit() rpc_make_runnable() has three case and maybe each would need a separate clear_bit in order to have a clear_bit adjacent to the wake_up(). That might actually be a good thing, I haven't really tried yet. - key_reject_and_link() separate the test_and_clear from the wake_up_bit() for reasons that aren't completely obvious to me. __key_instantiate_and_link() does the same. Maybe they just move the wake_up outside the mutex.... > > The wake_up_var() infrastructure that the inode code uses is a bit > more involved. Not only can the variable be anything at all (so the > operations you can do on it are obviously largely unbounded), but the > inode hack in particular then uses one thing for the actual variable, > and another thing for the address that is used to match up waits and > wakeups. > > So I suspect the inode code will have to do its thing explcitly with > the low-level helpers and deal with the memory ordering issues on its > own, adding the smp_mb() manually. The problem here is that "wake_up_var()" doesn't *look* like a low-level helper. Maybe we could replace the few remaining instances with __wake_up_var() once the easy cases are fixed?? Thanks, NeilBrown
diff --git a/fs/inode.c b/fs/inode.c index 154f8689457f..f2a2f6351ec3 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -472,6 +472,16 @@ static void __inode_add_lru(struct inode *inode, bool rotate) inode->i_state |= I_REFERENCED; } +struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe, + struct inode *inode, u32 bit) +{ + void *bit_address; + + bit_address = inode_state_wait_address(inode, bit); + init_wait_var_entry(wqe, bit_address, 0); + return __var_waitqueue(bit_address); +} + /* * Add inode to LRU if needed (inode is unused and clean). * diff --git a/include/linux/fs.h b/include/linux/fs.h index 23e7d46b818a..a5b036714d74 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -744,6 +744,22 @@ struct inode { void *i_private; /* fs or device private pointer */ } __randomize_layout; +/* + * Get bit address from inode->i_state to use with wait_var_event() + * infrastructre. + */ +#define inode_state_wait_address(inode, bit) ((char *)&(inode)->i_state + (bit)) + +struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe, + struct inode *inode, u32 bit); + +static inline void inode_wake_up_bit(struct inode *inode, u32 bit) +{ + /* Ensure @bit will be seen cleared/set when caller is woken up. */ + smp_mb(); + wake_up_var(inode_state_wait_address(inode, bit)); +} + struct timespec64 timestamp_truncate(struct timespec64 t, struct inode *inode); static inline unsigned int i_blocksize(const struct inode *node)
The i_state member is an unsigned long so that it can be used with the wait bit infrastructure which expects unsigned long. This wastes 4 bytes which we're unlikely to ever use. Switch to using the var event wait mechanism using the address of the bit. Thanks to Linus for the address idea. Signed-off-by: Christian Brauner <brauner@kernel.org> --- fs/inode.c | 10 ++++++++++ include/linux/fs.h | 16 ++++++++++++++++ 2 files changed, 26 insertions(+)