[RFC,v2,1/6] fs: add i_state helpers

Message ID	20240821-work-i_state-v2-1-67244769f102@kernel.org (mailing list archive)
State	New
Headers	show Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 818A014E2DE for <linux-fsdevel@vger.kernel.org>; Wed, 21 Aug 2024 15:47:54 +0000 (UTC) From: Christian Brauner <brauner@kernel.org> Date: Wed, 21 Aug 2024 17:47:31 +0200 Subject: [PATCH RFC v2 1/6] fs: add i_state helpers Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Message-Id: <20240821-work-i_state-v2-1-67244769f102@kernel.org> References: <20240821-work-i_state-v2-0-67244769f102@kernel.org> In-Reply-To: <20240821-work-i_state-v2-0-67244769f102@kernel.org> To: Linus Torvalds <torvalds@linux-foundation.org> Cc: NeilBrown <neilb@suse.de>, Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@redhat.com>, Jeff Layton <jlayton@kernel.org>, Jan Kara <jack@suse.cz>, Christian Brauner <brauner@kernel.org>, linux-fsdevel@vger.kernel.org
Series	inode: turn i_state into u32 \| expand [RFC,v2,0/6] inode: turn i_state into u32 [RFC,v2,1/6] fs: add i_state helpers [RFC,v2,2/6] fs: reorder i_state bits [RFC,v2,3/6] writeback: port __I_SYNC to var event [RFC,v2,4/6] inode: port __I_NEW to var event [RFC,v2,5/6] inode: port __I_LRU_ISOLATING to var event [RFC,v2,6/6] inode: make i_state a u32

Christian Brauner Aug. 21, 2024, 3:47 p.m. UTC

The i_state member is an unsigned long so that it can be used with the
wait bit infrastructure which expects unsigned long. This wastes 4 bytes
which we're unlikely to ever use. Switch to using the var event wait
mechanism using the address of the bit. Thanks to Linus for the address
idea.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/inode.c         | 10 ++++++++++
 include/linux/fs.h | 16 ++++++++++++++++
 2 files changed, 26 insertions(+)

NeilBrown Aug. 21, 2024, 10:12 p.m. UTC | #1

On Thu, 22 Aug 2024, Christian Brauner wrote:
> The i_state member is an unsigned long so that it can be used with the
> wait bit infrastructure which expects unsigned long. This wastes 4 bytes
> which we're unlikely to ever use. Switch to using the var event wait
> mechanism using the address of the bit. Thanks to Linus for the address
> idea.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
>  fs/inode.c         | 10 ++++++++++
>  include/linux/fs.h | 16 ++++++++++++++++
>  2 files changed, 26 insertions(+)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 154f8689457f..f2a2f6351ec3 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -472,6 +472,16 @@ static void __inode_add_lru(struct inode *inode, bool rotate)
>  		inode->i_state |= I_REFERENCED;
>  }
>  
> +struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe,
> +					    struct inode *inode, u32 bit)
> +{
> +        void *bit_address;
> +
> +        bit_address = inode_state_wait_address(inode, bit);
> +        init_wait_var_entry(wqe, bit_address, 0);
> +        return __var_waitqueue(bit_address);
> +}
> +
>  /*
>   * Add inode to LRU if needed (inode is unused and clean).
>   *
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 23e7d46b818a..a5b036714d74 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -744,6 +744,22 @@ struct inode {
>  	void			*i_private; /* fs or device private pointer */
>  } __randomize_layout;
>  
> +/*
> + * Get bit address from inode->i_state to use with wait_var_event()
> + * infrastructre.
> + */
> +#define inode_state_wait_address(inode, bit) ((char *)&(inode)->i_state + (bit))
> +
> +struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe,
> +					    struct inode *inode, u32 bit);
> +
> +static inline void inode_wake_up_bit(struct inode *inode, u32 bit)
> +{
> +	/* Ensure @bit will be seen cleared/set when caller is woken up. */

The above comment is wrong.  I think I once thought it was correct too
but now I know better (I hope).
A better comment might be
       /* Insert memory barrier as recommended by wake_up_var() */
but even that is unnecessary as we don't need the memory barrier.

A careful reading of memory-barriers.rst shows that *when the process is
actually woken* there are sufficient barriers in wake_up_process() and
prepare_wait_event() and the scheduler and (particularly)
set_current_state() so that a value set before the wake_up is seen after
the schedule().

So this barrier isn't about the bit.  This barrier is about the
wait_queue.  In particular it is about waitqueue_active() call at the
start of wake_up_var().  If that test wasn't there and if instead
wake_up_var() conditionally called __wake_up(), then there would be no
need for any barrier.  A comment near wake_up_bit() makes it clear that
the barrier is only needed when the spinlock is avoided.

On a weakly ordered arch, this test can be effected *before* the write
of the bit.  If the waiter adds itself to the wait queue and then tests
the bit before the bit is set, but after the waitqueue_active() test is
put in effect, then the wake_up will never be sent.

But ....  this is all academic of this code because you don't need a
barrier at all.  The wake_up happens in a spin_locked region, and the
wait is entirely inside the same spin_lock, except for the schedule.  A
later patch has:
     spin_unlock(); schedule(); spin_lock();

So there is no room for a race.  If the bit is cleared before the
wait_var_event() equivalent, then no wakeup is needed.  When the lock is
dropped after the bit is cleared the unlock will have all the barriers
needed for the bit to be visible.

The only other place that the bit can be cleared is concurrent with the
above schedule() while the spinlock isn't held by the waiter.  In that
case it is again clear that no barrier is needed - or that the
spin_unlock/lock provide all the needed barriers.

So a better comment would be

   /* no barrier needs as both waker and waiter are in spin-locked regions */

Thanks,
NeilBrown

> +	smp_mb();
> +	wake_up_var(inode_state_wait_address(inode, bit));
> +}
> +
>  struct timespec64 timestamp_truncate(struct timespec64 t, struct inode *inode);
>  
>  static inline unsigned int i_blocksize(const struct inode *node)
> 
> -- 
> 2.43.0
> 
>

Dave Chinner Aug. 21, 2024, 10:28 p.m. UTC | #2

On Wed, Aug 21, 2024 at 05:47:31PM +0200, Christian Brauner wrote:
> The i_state member is an unsigned long so that it can be used with the
> wait bit infrastructure which expects unsigned long. This wastes 4 bytes
> which we're unlikely to ever use. Switch to using the var event wait
> mechanism using the address of the bit. Thanks to Linus for the address
> idea.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
>  fs/inode.c         | 10 ++++++++++
>  include/linux/fs.h | 16 ++++++++++++++++
>  2 files changed, 26 insertions(+)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 154f8689457f..f2a2f6351ec3 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -472,6 +472,16 @@ static void __inode_add_lru(struct inode *inode, bool rotate)
>  		inode->i_state |= I_REFERENCED;
>  }
>  
> +struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe,
> +					    struct inode *inode, u32 bit)
> +{
> +        void *bit_address;
> +
> +        bit_address = inode_state_wait_address(inode, bit);
> +        init_wait_var_entry(wqe, bit_address, 0);
> +        return __var_waitqueue(bit_address);
> +}
> +
>  /*
>   * Add inode to LRU if needed (inode is unused and clean).
>   *
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 23e7d46b818a..a5b036714d74 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -744,6 +744,22 @@ struct inode {
>  	void			*i_private; /* fs or device private pointer */
>  } __randomize_layout;
>  
> +/*
> + * Get bit address from inode->i_state to use with wait_var_event()
> + * infrastructre.
> + */
> +#define inode_state_wait_address(inode, bit) ((char *)&(inode)->i_state + (bit))
> +
> +struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe,
> +					    struct inode *inode, u32 bit);
> +
> +static inline void inode_wake_up_bit(struct inode *inode, u32 bit)
> +{
> +	/* Ensure @bit will be seen cleared/set when caller is woken up. */
> +	smp_mb();
> +	wake_up_var(inode_state_wait_address(inode, bit));
> +}

Why is this memory barrier necessary?

wakeup_var() takes the wait queue head spinlock, as does
prepare_to_wait() before the waiter condition check and the
subsequent finish_wait() when the wait is done.

Hence by the time the functions like inode_wait_for_writeback()
return to the caller, there's been a full unlock->lock cycle on the
wait queue head lock between the bit being set and the waiter waking
and checking it.

ie. the use of the generic waitqueue infrastructure implies a full
memory barrier between the bit being set, the wakeup being issued
and the waiter checking the condition bit.

Hence I'm not sure what race condition this memory barrier is
preventing - is this just a historical artifact that predates the
current wait/wake implementation?

-Dave.

Linus Torvalds Aug. 21, 2024, 10:47 p.m. UTC | #3

On Thu, 22 Aug 2024 at 06:12, NeilBrown <neilb@suse.de> wrote:
>
> So this barrier isn't about the bit.  This barrier is about the
> wait_queue.  In particular it is about waitqueue_active() call

Correct. The waitqueue_active() call is basically very racy, but it's
also often a huge deal for performance.

In *most* cases, wakeups happen with nobody waiting for them at all,
and taking the waitqueue spinlock just to see that there are no
waiters is hugely expensive and easily causes lots of pointless cache
ping-pong.

So the waitqueue_active() is there to basically say "if there are no
waiters, don't bother doing anything".

But *because* it is done - by definition - without holding the
waitqueue lock, it means that a new waiter might have just checked for
the condition, and is about to add itself to the wait queue.

Now, waiters will then re-check *after* they have added themselves to
the waitqueue too (that's what all the wait_for_event etc stuff does),
and the waiter side will have barriers thanks to the spinlocks on the
waitqueue.

But the waker still needs to make sure it's ordered wrt "I changed the
condition that the waiter is waiting for" and the lockless
waitqueue_active() test.

If waiters and wakers are mutually serialized by some external lock,
there are no races.

But commonly, the waker will just do an atomic op, or maybe it holds a
lock (as a writer) that the waiting side does not hold, and the
waiting side just does a "READ_ONCE()" or a "smp_load_acquire()" or
whatever.

So the waker needs to make sure that its state setting is serialized
with the waiter. If it uses our atomics, then typically a
smp_mb__after_atomic() is appropriate (which often ends up being a
no-op).

But commonly you end up needing a smp_mb(), which serializes whatever
action the waker did with the waker then doing that optimistic
lockless waitqueue_active().

>    /* no barrier needs as both waker and waiter are in spin-locked regions */

If all the inode state waiters do indeed hold the lock, then that is fine.

               Linus

Linus Torvalds Aug. 21, 2024, 10:53 p.m. UTC | #4

On Thu, 22 Aug 2024 at 06:28, Dave Chinner <david@fromorbit.com> wrote:
>
> wakeup_var() takes the wait queue head spinlock

That's the part you missed.

wake_up_var() does no such thing. It calles __wake_up_bit(), which does this:

        if (waitqueue_active(wq_head))
                __wake_up(wq_head, TASK_NORMAL, 1, &key);

and that waitqueue_active() is lockless.

That's the thing that wants the barrier (or something else that
guarantees that the waiters actually *see* the new state after they've
added themselves to the wait-list, and before we then look at the
wait-list being empty).

See the comment in <linux/wait.h> above the waitqueue_active().

And yes, this is annoying and subtle and has caused bugs. But taking
waitqueue locks just to notice nobody is waiting is very expensive
indeed, and much more expensive than a memory barrier when the common
case is typically often "nobody is waiting".

               Linus

Dave Chinner Aug. 21, 2024, 11:34 p.m. UTC | #5

On Thu, Aug 22, 2024 at 08:12:15AM +1000, NeilBrown wrote:
> On Thu, 22 Aug 2024, Christian Brauner wrote:
> > The i_state member is an unsigned long so that it can be used with the
> > wait bit infrastructure which expects unsigned long. This wastes 4 bytes
> > which we're unlikely to ever use. Switch to using the var event wait
> > mechanism using the address of the bit. Thanks to Linus for the address
> > idea.
> > 
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> > ---
> >  fs/inode.c         | 10 ++++++++++
> >  include/linux/fs.h | 16 ++++++++++++++++
> >  2 files changed, 26 insertions(+)
> > 
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 154f8689457f..f2a2f6351ec3 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -472,6 +472,16 @@ static void __inode_add_lru(struct inode *inode, bool rotate)
> >  		inode->i_state |= I_REFERENCED;
> >  }
> >  
> > +struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe,
> > +					    struct inode *inode, u32 bit)
> > +{
> > +        void *bit_address;
> > +
> > +        bit_address = inode_state_wait_address(inode, bit);
> > +        init_wait_var_entry(wqe, bit_address, 0);
> > +        return __var_waitqueue(bit_address);
> > +}
> > +
> >  /*
> >   * Add inode to LRU if needed (inode is unused and clean).
> >   *
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 23e7d46b818a..a5b036714d74 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -744,6 +744,22 @@ struct inode {
> >  	void			*i_private; /* fs or device private pointer */
> >  } __randomize_layout;
> >  
> > +/*
> > + * Get bit address from inode->i_state to use with wait_var_event()
> > + * infrastructre.
> > + */
> > +#define inode_state_wait_address(inode, bit) ((char *)&(inode)->i_state + (bit))
> > +
> > +struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe,
> > +					    struct inode *inode, u32 bit);
> > +
> > +static inline void inode_wake_up_bit(struct inode *inode, u32 bit)
> > +{
> > +	/* Ensure @bit will be seen cleared/set when caller is woken up. */
> 
> The above comment is wrong.  I think I once thought it was correct too
> but now I know better (I hope).
> A better comment might be
>        /* Insert memory barrier as recommended by wake_up_var() */
> but even that is unnecessary as we don't need the memory barrier.
> 
> A careful reading of memory-barriers.rst shows that *when the process is
> actually woken* there are sufficient barriers in wake_up_process() and
> prepare_wait_event() and the scheduler and (particularly)
> set_current_state() so that a value set before the wake_up is seen after
> the schedule().
> 
> So this barrier isn't about the bit.  This barrier is about the
> wait_queue.  In particular it is about waitqueue_active() call at the
> start of wake_up_var().  If that test wasn't there and if instead
> wake_up_var() conditionally called __wake_up(), then there would be no
> need for any barrier.  A comment near wake_up_bit() makes it clear that
> the barrier is only needed when the spinlock is avoided.

Oh, I missed that *landmine*.

So almost none of the current uses of wake_up_var() have an explicit
memory barrier before them, and a lot of them are not done under
spin locks. I suspect that the implicit store ordering of the wake
queue check is mostly only handled by chance - an
atomic_dec_and_test() or smp_store_release() call is made before
wake_up_var() or the wait/wakeup is synchronised by an external
lock.

> On a weakly ordered arch, this test can be effected *before* the write
> of the bit.  If the waiter adds itself to the wait queue and then tests
> the bit before the bit is set, but after the waitqueue_active() test is
> put in effect, then the wake_up will never be sent.
> 
> But ....  this is all academic of this code because you don't need a
> barrier at all.  The wake_up happens in a spin_locked region, and the
> wait is entirely inside the same spin_lock, except for the schedule.  A
> later patch has:
>      spin_unlock(); schedule(); spin_lock();
> 
> So there is no room for a race.  If the bit is cleared before the
> wait_var_event() equivalent, then no wakeup is needed.  When the lock is
> dropped after the bit is cleared the unlock will have all the barriers
> needed for the bit to be visible.

Right, that's exactly the problem with infrastructure that
externalises memory ordering dependencies. Most of the time they
just work because of external factors, but sometimes they don't and
we get random memory barriers being cargo culted around the place
because that seems to fix weird problems....

> The only other place that the bit can be cleared is concurrent with the
> above schedule() while the spinlock isn't held by the waiter.  In that
> case it is again clear that no barrier is needed - or that the
> spin_unlock/lock provide all the needed barriers.
> 
> So a better comment would be
> 
>    /* no barrier needs as both waker and waiter are in spin-locked regions */
           ^^^^^^^
	   ordering

Even better would be to fix wake_up_var() to not have an implicit
ordering requirement. Add __wake_up_var() as the "I know what I'm
doing" API, and have wake_up_var() always issue the memory barrier
like so:

__wake_up_var(var)
{
	__wake_up_bit(....);
}

wake_up_var(var)
{
	smp_mb();
	__wake_up_var(var);
}

Then people who don't intimately understand ordering (i.e. just want
to use a conditional wait variable) or just don't want to
think about complex memory ordering issues for otherwise obvious and
simple code can simply use the safe variant.

Those that need to optimise for speed or know they have solid
ordering or external serialisation can use __wake_up_var() and leave
a comment as to why that is safe. i.e.:

	/*
	 * No wait/wakeup ordering required as both waker and waiters
	 * are serialised by the inode->i_lock.
	 */
	__wake_up_var(....);

-Dave.

Christian Brauner Aug. 22, 2024, 8:27 a.m. UTC | #6

On Thu, Aug 22, 2024 at 08:12:15AM GMT, NeilBrown wrote:
> On Thu, 22 Aug 2024, Christian Brauner wrote:
> > The i_state member is an unsigned long so that it can be used with the
> > wait bit infrastructure which expects unsigned long. This wastes 4 bytes
> > which we're unlikely to ever use. Switch to using the var event wait
> > mechanism using the address of the bit. Thanks to Linus for the address
> > idea.
> > 
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> > ---
> >  fs/inode.c         | 10 ++++++++++
> >  include/linux/fs.h | 16 ++++++++++++++++
> >  2 files changed, 26 insertions(+)
> > 
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 154f8689457f..f2a2f6351ec3 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -472,6 +472,16 @@ static void __inode_add_lru(struct inode *inode, bool rotate)
> >  		inode->i_state |= I_REFERENCED;
> >  }
> >  
> > +struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe,
> > +					    struct inode *inode, u32 bit)
> > +{
> > +        void *bit_address;
> > +
> > +        bit_address = inode_state_wait_address(inode, bit);
> > +        init_wait_var_entry(wqe, bit_address, 0);
> > +        return __var_waitqueue(bit_address);
> > +}
> > +
> >  /*
> >   * Add inode to LRU if needed (inode is unused and clean).
> >   *
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 23e7d46b818a..a5b036714d74 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -744,6 +744,22 @@ struct inode {
> >  	void			*i_private; /* fs or device private pointer */
> >  } __randomize_layout;
> >  
> > +/*
> > + * Get bit address from inode->i_state to use with wait_var_event()
> > + * infrastructre.
> > + */
> > +#define inode_state_wait_address(inode, bit) ((char *)&(inode)->i_state + (bit))
> > +
> > +struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe,
> > +					    struct inode *inode, u32 bit);
> > +
> > +static inline void inode_wake_up_bit(struct inode *inode, u32 bit)
> > +{
> > +	/* Ensure @bit will be seen cleared/set when caller is woken up. */
> 
> The above comment is wrong.  I think I once thought it was correct too
> but now I know better (I hope).
> A better comment might be
>        /* Insert memory barrier as recommended by wake_up_var() */
> but even that is unnecessary as we don't need the memory barrier.
> 
> A careful reading of memory-barriers.rst shows that *when the process is
> actually woken* there are sufficient barriers in wake_up_process() and
> prepare_wait_event() and the scheduler and (particularly)
> set_current_state() so that a value set before the wake_up is seen after
> the schedule().
> 
> So this barrier isn't about the bit.  This barrier is about the
> wait_queue.  In particular it is about waitqueue_active() call at the
> start of wake_up_var().  If that test wasn't there and if instead
> wake_up_var() conditionally called __wake_up(), then there would be no

Did you mean "unconditionally called"?

> need for any barrier.  A comment near wake_up_bit() makes it clear that
> the barrier is only needed when the spinlock is avoided.
> 
> On a weakly ordered arch, this test can be effected *before* the write
> of the bit.  If the waiter adds itself to the wait queue and then tests
> the bit before the bit is set, but after the waitqueue_active() test is
> put in effect, then the wake_up will never be sent.
> 
> But ....  this is all academic of this code because you don't need a
> barrier at all.  The wake_up happens in a spin_locked region, and the
> wait is entirely inside the same spin_lock, except for the schedule.  A
> later patch has:
>      spin_unlock(); schedule(); spin_lock();
> 
> So there is no room for a race.  If the bit is cleared before the
> wait_var_event() equivalent, then no wakeup is needed.  When the lock is
> dropped after the bit is cleared the unlock will have all the barriers
> needed for the bit to be visible.
> 
> The only other place that the bit can be cleared is concurrent with the
> above schedule() while the spinlock isn't held by the waiter.  In that
> case it is again clear that no barrier is needed - or that the
> spin_unlock/lock provide all the needed barriers.
> 
> So a better comment would be
> 
>    /* no barrier needs as both waker and waiter are in spin-locked regions */

Thanks for the analysis. I was under the impression that wait_on_inode()
was called in contexts where no barrier is guaranteed and the bit isn't
checked with spin_lock() held.

Linus Torvalds Aug. 22, 2024, 8:37 a.m. UTC | #7

On Thu, 22 Aug 2024 at 16:27, Christian Brauner <brauner@kernel.org> wrote:
>
> >
> >    /* no barrier needs as both waker and waiter are in spin-locked regions */
>
> Thanks for the analysis. I was under the impression that wait_on_inode()
> was called in contexts where no barrier is guaranteed and the bit isn't
> checked with spin_lock() held.

Yes, it does look like the barrier is needed, because the waiter does
not hold the lock indeed.

                Linus

NeilBrown Aug. 23, 2024, 12:08 a.m. UTC | #8

On Thu, 22 Aug 2024, Dave Chinner wrote:
> 
> Even better would be to fix wake_up_var() to not have an implicit
> ordering requirement. Add __wake_up_var() as the "I know what I'm
> doing" API, and have wake_up_var() always issue the memory barrier
> like so:
> 
> __wake_up_var(var)
> {
> 	__wake_up_bit(....);
> }
> 
> wake_up_var(var)
> {
> 	smp_mb();
> 	__wake_up_var(var);
> }

This is very close to what I proposed
 https://lore.kernel.org/all/20240819053605.11706-9-neilb@suse.de/
but Linus doesn't like that approach.
I would prefer a collection of interfaces which combine the update to
the variable with the wake_up.
So 
   atomic_dec_and_wake()
   store_release_and_wake()

though there are a few that don't fit into a common pattern.
Sometimes a minor code re-write can make them fit.  Other times I'm not
so sure.

I'm glad I'm not the only one who though that adding a barrier to
wake_up_var() was a good idea though - thanks.

NeilBrown

> 
> Then people who don't intimately understand ordering (i.e. just want
> to use a conditional wait variable) or just don't want to
> think about complex memory ordering issues for otherwise obvious and
> simple code can simply use the safe variant.
> 
> Those that need to optimise for speed or know they have solid
> ordering or external serialisation can use __wake_up_var() and leave
> a comment as to why that is safe. i.e.:
> 
> 	/*
> 	 * No wait/wakeup ordering required as both waker and waiters
> 	 * are serialised by the inode->i_lock.
> 	 */
> 	__wake_up_var(....);
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>

NeilBrown Aug. 23, 2024, 12:14 a.m. UTC | #9

On Thu, 22 Aug 2024, Christian Brauner wrote:
> > 
> > So this barrier isn't about the bit.  This barrier is about the
> > wait_queue.  In particular it is about waitqueue_active() call at the
> > start of wake_up_var().  If that test wasn't there and if instead
> > wake_up_var() conditionally called __wake_up(), then there would be no
> 
> Did you mean "unconditionally called"?

Yes :-)


> 
> > need for any barrier.  A comment near wake_up_bit() makes it clear that
> > the barrier is only needed when the spinlock is avoided.
> > 
> > On a weakly ordered arch, this test can be effected *before* the write
> > of the bit.  If the waiter adds itself to the wait queue and then tests
> > the bit before the bit is set, but after the waitqueue_active() test is
> > put in effect, then the wake_up will never be sent.
> > 
> > But ....  this is all academic of this code because you don't need a
> > barrier at all.  The wake_up happens in a spin_locked region, and the
> > wait is entirely inside the same spin_lock, except for the schedule.  A
> > later patch has:
> >      spin_unlock(); schedule(); spin_lock();
> > 
> > So there is no room for a race.  If the bit is cleared before the
> > wait_var_event() equivalent, then no wakeup is needed.  When the lock is
> > dropped after the bit is cleared the unlock will have all the barriers
> > needed for the bit to be visible.
> > 
> > The only other place that the bit can be cleared is concurrent with the
> > above schedule() while the spinlock isn't held by the waiter.  In that
> > case it is again clear that no barrier is needed - or that the
> > spin_unlock/lock provide all the needed barriers.
> > 
> > So a better comment would be
> > 
> >    /* no barrier needs as both waker and waiter are in spin-locked regions */
> 
> Thanks for the analysis. I was under the impression that wait_on_inode()
> was called in contexts where no barrier is guaranteed and the bit isn't
> checked with spin_lock() held.
> 

Yes it is.  I was looking at the I_SYNC waiter and mistakenly thought it
was a model for the others.
A wake_up for I_SYNC being cleared doesn't need the barrier.

Maybe inode_wake_up_bit() should not have a barrier and the various
callers should add whichever barrier is appropriate.  That is the model
that Linus prefers for wake_up_bit() and for consistency it should apply
to inode_wake_up_bit() too.

Thanks,
NeilBrown

Linus Torvalds Aug. 23, 2024, 2:52 a.m. UTC | #10

On Fri, 23 Aug 2024 at 08:14, NeilBrown <neilb@suse.de> wrote:
>
> Maybe inode_wake_up_bit() should not have a barrier and the various
> callers should add whichever barrier is appropriate.  That is the model
> that Linus prefers for wake_up_bit() and for consistency it should apply
> to inode_wake_up_bit() too.

I think that for the wake_up_bit() cases in particular, it's fairly
easy to add whatever "clear_and_wakeup()" helpers. After all, there's
only so much you can do to one bit, and I think merging the "act" with
the "wakeup" into one routine is not only going to help fix the memory
ordering problem, I think it's going to result in more readable code
even aside from any memory ordering issues.

The wake_up_var() infrastructure that the inode code uses is a bit
more involved. Not only can the variable be anything at all (so the
operations you can do on it are obviously largely unbounded), but the
inode hack in particular then uses one thing for the actual variable,
and another thing for the address that is used to match up waits and
wakeups.

So I suspect the inode code will have to do its thing explcitly with
the low-level helpers and deal with the memory ordering issues on its
own, adding the smp_mb() manually.

                  Linus

Linus Torvalds Aug. 23, 2024, 3:05 a.m. UTC | #11

On Fri, 23 Aug 2024 at 10:52, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> The wake_up_var() infrastructure that the inode code uses is a bit
> more involved. Not only can the variable be anything at all (so the
> operations you can do on it are obviously largely unbounded), but the
> inode hack in particular then uses one thing for the actual variable,
> and another thing for the address that is used to match up waits and
> wakeups.

.. btw, that doesn't mean we can't have helpers for the common cases.
They might have to be macros (so that they just work regardless of the
type), but having a

    set_var_and_wake(var, value);

macro that just expands to something like (completely untested "maybe
this works" macro):

  #define set_var_and_wake(var,value) do {        \
        __auto_type __set_ptr = &(var);          \
        *(__set_ptr) = (value);                 \
        smp_mb();                               \
        wake_up_var(__set_ptr);                 \
  } while (0)

doesn't sound too bad for at least some common patterns.

Looking around, we do seem to have a pattern of

   smp_store_release() -> wake_up_var()

instead of a memory barrier. I don't think that actually works. The
smp_store_release() means that *earlier* accesses will be bounded by
the store operation, but *later* accesses - including very much the
"look if the wait queue is empty" check - are totally unordered by it,
and can be done before the store by the CPU.

But I haven't thought deeply about it, that was just my gut reaction
when seeing the pattern. It superficially makes sense, but I think
it's entirely wrong (it's a "smp_load_acquire()" that would order with
later accesses, but there is no "store with acquire semantics"
operation).

              Linus

Linus Torvalds Aug. 23, 2024, 3:44 a.m. UTC | #12

On Fri, 23 Aug 2024 at 11:05, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Looking around, we do seem to have a pattern of
>
>    smp_store_release() -> wake_up_var()
>
> instead of a memory barrier. I don't think that actually works.

Hmm. It might not work for the wakeup race, but it might be a good
idea to do the store_release anyway, just for the actual user (ie then
the *use* of the variable may start with a "smp_load_acquire()", and
the release->acquire semantics means that everything that was done
before the release is visible after the acquire.

Of course, the smp_mb() will force that ordering too, but for a true
concurrent user that doesn't actually need waking up, it might see the
newly stores var value before the smp_mb() happens on the generating
side.

End result: those code paths may want *both* the smp_store_release()
and the smp_mb(), because they are ordering different accesses.

Just goes to show that this whole thing is more complicated than just
"wait for a value".

                Linus

NeilBrown Aug. 23, 2024, 5:01 a.m. UTC | #13

On Fri, 23 Aug 2024, Linus Torvalds wrote:
> On Fri, 23 Aug 2024 at 08:14, NeilBrown <neilb@suse.de> wrote:
> >
> > Maybe inode_wake_up_bit() should not have a barrier and the various
> > callers should add whichever barrier is appropriate.  That is the model
> > that Linus prefers for wake_up_bit() and for consistency it should apply
> > to inode_wake_up_bit() too.
> 
> I think that for the wake_up_bit() cases in particular, it's fairly
> easy to add whatever "clear_and_wakeup()" helpers. After all, there's
> only so much you can do to one bit, and I think merging the "act" with
> the "wakeup" into one routine is not only going to help fix the memory
> ordering problem, I think it's going to result in more readable code
> even aside from any memory ordering issues.

That patterns that don't easily fit a helper include:

  - skipping the wake_up when there is no hurry.
    nfs_page_clear_headlock() does this is PG_CONTENDED1
    isn't set.  intel_recv_event() does something very similar
    if STATE_FIRMWARE_LOADED isn't set.
    I imagine changing these to
         if (need a wakeup)
             clear_bit_and_wakeup(...)
         else
             clear_bit()
    rpc_make_runnable() has three case and maybe each would need
    a separate clear_bit in order to have a clear_bit adjacent to
    the wake_up().  That might actually be a good thing, I haven't
    really tried yet.

  - key_reject_and_link() separate the test_and_clear from the
    wake_up_bit() for reasons that aren't completely obvious to me.
    __key_instantiate_and_link() does the same.
    Maybe they just move the wake_up outside the mutex....

> 
> The wake_up_var() infrastructure that the inode code uses is a bit
> more involved. Not only can the variable be anything at all (so the
> operations you can do on it are obviously largely unbounded), but the
> inode hack in particular then uses one thing for the actual variable,
> and another thing for the address that is used to match up waits and
> wakeups.
> 
> So I suspect the inode code will have to do its thing explcitly with
> the low-level helpers and deal with the memory ordering issues on its
> own, adding the smp_mb() manually.

The problem here is that "wake_up_var()" doesn't *look* like a
low-level helper.  Maybe we could replace the few remaining instances
with __wake_up_var() once the easy cases are fixed??

Thanks,
NeilBrown

[RFC,v2,1/6] fs: add i_state helpers

Commit Message

Comments

Patch