SUNRPC: Fix a race to wake a sync task

Message ID	4c1c7cd404173b9888464a2d7e2a430cc7e18737.1720792415.git.bcodding@redhat.com (mailing list archive)
State	New
Headers	show Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BADD6171085 for <linux-nfs@vger.kernel.org>; Fri, 12 Jul 2024 13:55:23 +0000 (UTC) From: Benjamin Coddington <bcodding@redhat.com> To: Trond Myklebust <trond.myklebust@hammerspace.com>, Anna Schumaker <anna@kernel.org> Cc: linux-nfs@vger.kernel.org Subject: [PATCH] SUNRPC: Fix a race to wake a sync task Date: Fri, 12 Jul 2024 09:55:17 -0400 Message-ID: <4c1c7cd404173b9888464a2d7e2a430cc7e18737.1720792415.git.bcodding@redhat.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	SUNRPC: Fix a race to wake a sync task \| expand SUNRPC: Fix a race to wake a sync task

Benjamin Coddington July 12, 2024, 1:55 p.m. UTC

We've observed NFS clients with sync tasks sleeping in __rpc_execute
waiting on RPC_TASK_QUEUED that have not responded to a wake-up from
rpc_make_runnable().  I suspect this problem usually goes unnoticed,
because on a busy client the task will eventually be re-awoken by another
task completion or xprt event.  However, if the state manager is draining
the slot table, a sync task missing a wake-up can result in a hung client.

We've been able to prove that the waker in rpc_make_runnable() successfully
calls wake_up_bit() (ie- there's no race to tk_runstate), but the
wake_up_bit() call fails to wake the waiter.  I suspect the waker is
missing the load of the bit's wait_queue_head, so waitqueue_active() is
false.  There are some very helpful comments about this problem above
wake_up_bit(), prepare_to_wait(), and waitqueue_active().

Fix this by inserting smp_mb() before the wake_up_bit(), which pairs with
prepare_to_wait() calling set_current_state().

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
---
 net/sunrpc/sched.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Jeff Layton July 12, 2024, 2:37 p.m. UTC | #1

On Fri, 2024-07-12 at 09:55 -0400, Benjamin Coddington wrote:
> We've observed NFS clients with sync tasks sleeping in __rpc_execute
> waiting on RPC_TASK_QUEUED that have not responded to a wake-up from
> rpc_make_runnable().  I suspect this problem usually goes unnoticed,
> because on a busy client the task will eventually be re-awoken by
> another
> task completion or xprt event.  However, if the state manager is
> draining
> the slot table, a sync task missing a wake-up can result in a hung
> client.
> 
> We've been able to prove that the waker in rpc_make_runnable()
> successfully
> calls wake_up_bit() (ie- there's no race to tk_runstate), but the
> wake_up_bit() call fails to wake the waiter.  I suspect the waker is
> missing the load of the bit's wait_queue_head, so waitqueue_active()
> is
> false.  There are some very helpful comments about this problem above
> wake_up_bit(), prepare_to_wait(), and waitqueue_active().
> 
> Fix this by inserting smp_mb() before the wake_up_bit(), which pairs
> with
> prepare_to_wait() calling set_current_state().
> 
> Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
> ---
>  net/sunrpc/sched.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
> index 6debf4fd42d4..34b31be75497 100644
> --- a/net/sunrpc/sched.c
> +++ b/net/sunrpc/sched.c
> @@ -369,8 +369,11 @@ static void rpc_make_runnable(struct
> workqueue_struct *wq,
>  	if (RPC_IS_ASYNC(task)) {
>  		INIT_WORK(&task->u.tk_work, rpc_async_schedule);
>  		queue_work(wq, &task->u.tk_work);
> -	} else
> +	} else {
> +		/* paired with set_current_state() in
> prepare_to_wait */
> +		smp_mb();
>  		wake_up_bit(&task->tk_runstate, RPC_TASK_QUEUED);
> +	}
>  }
>  
>  /*

Nice catch!

Reviewed-by: Jeff Layton <jlayton@kernel.org>

Trond Myklebust July 16, 2024, 1:23 p.m. UTC | #2

On Fri, 2024-07-12 at 09:55 -0400, Benjamin Coddington wrote:
> We've observed NFS clients with sync tasks sleeping in __rpc_execute
> waiting on RPC_TASK_QUEUED that have not responded to a wake-up from
> rpc_make_runnable().  I suspect this problem usually goes unnoticed,
> because on a busy client the task will eventually be re-awoken by
> another
> task completion or xprt event.  However, if the state manager is
> draining
> the slot table, a sync task missing a wake-up can result in a hung
> client.
> 
> We've been able to prove that the waker in rpc_make_runnable()
> successfully
> calls wake_up_bit() (ie- there's no race to tk_runstate), but the
> wake_up_bit() call fails to wake the waiter.  I suspect the waker is
> missing the load of the bit's wait_queue_head, so waitqueue_active()
> is
> false.  There are some very helpful comments about this problem above
> wake_up_bit(), prepare_to_wait(), and waitqueue_active().
> 
> Fix this by inserting smp_mb() before the wake_up_bit(), which pairs
> with
> prepare_to_wait() calling set_current_state().
> 
> Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
> ---
>  net/sunrpc/sched.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
> index 6debf4fd42d4..34b31be75497 100644
> --- a/net/sunrpc/sched.c
> +++ b/net/sunrpc/sched.c
> @@ -369,8 +369,11 @@ static void rpc_make_runnable(struct
> workqueue_struct *wq,
>  	if (RPC_IS_ASYNC(task)) {
>  		INIT_WORK(&task->u.tk_work, rpc_async_schedule);
>  		queue_work(wq, &task->u.tk_work);
> -	} else
> +	} else {
> +		/* paired with set_current_state() in
> prepare_to_wait */
> +		smp_mb();

Hmm... Why isn't it sufficient to use smp_mb__after_atomic() here?
That's what clear_and_wake_up_bit() uses in this case.

>  		wake_up_bit(&task->tk_runstate, RPC_TASK_QUEUED);
> +	}
>  }
>  
>  /*

Benjamin Coddington July 16, 2024, 1:44 p.m. UTC | #3

On 16 Jul 2024, at 9:23, Trond Myklebust wrote:

> On Fri, 2024-07-12 at 09:55 -0400, Benjamin Coddington wrote:
>> We've observed NFS clients with sync tasks sleeping in __rpc_execute
>> waiting on RPC_TASK_QUEUED that have not responded to a wake-up from
>> rpc_make_runnable().  I suspect this problem usually goes unnoticed,
>> because on a busy client the task will eventually be re-awoken by
>> another
>> task completion or xprt event.  However, if the state manager is
>> draining
>> the slot table, a sync task missing a wake-up can result in a hung
>> client.
>>
>> We've been able to prove that the waker in rpc_make_runnable()
>> successfully
>> calls wake_up_bit() (ie- there's no race to tk_runstate), but the
>> wake_up_bit() call fails to wake the waiter.  I suspect the waker is
>> missing the load of the bit's wait_queue_head, so waitqueue_active()
>> is
>> false.  There are some very helpful comments about this problem above
>> wake_up_bit(), prepare_to_wait(), and waitqueue_active().
>>
>> Fix this by inserting smp_mb() before the wake_up_bit(), which pairs
>> with
>> prepare_to_wait() calling set_current_state().
>>
>> Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
>> ---
>>  net/sunrpc/sched.c | 5 ++++-
>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
>> index 6debf4fd42d4..34b31be75497 100644
>> --- a/net/sunrpc/sched.c
>> +++ b/net/sunrpc/sched.c
>> @@ -369,8 +369,11 @@ static void rpc_make_runnable(struct
>> workqueue_struct *wq,
>>  	if (RPC_IS_ASYNC(task)) {
>>  		INIT_WORK(&task->u.tk_work, rpc_async_schedule);
>>  		queue_work(wq, &task->u.tk_work);
>> -	} else
>> +	} else {
>> +		/* paired with set_current_state() in
>> prepare_to_wait */
>> +		smp_mb();
>
> Hmm... Why isn't it sufficient to use smp_mb__after_atomic() here?
> That's what clear_and_wake_up_bit() uses in this case.

Great question, one that I can't answer with confidence (which is why I
solicited advice under the RFC posting).

I ended up following the guidance in the comment above wake_up_bit(), since
we clear RPC_TASK_RUNNING non-atomically within the queue's spinlock.  It
states that we'd probably need smp_mb().

However - this race isn't to tk_runstate, its actually to waitqueue_active()
being true/false.  So - I believe unless we're checking the bit in the
waiter's wait_bit_action function, we really only want to look at the race
in the terms of making sure the waiter's setting of the wait_queue_head is
visible after the waker tests_and_sets RPC_TASK_RUNNING.

Ben

Trond Myklebust July 16, 2024, 2:37 p.m. UTC | #4

On Tue, 2024-07-16 at 09:44 -0400, Benjamin Coddington wrote:
> On 16 Jul 2024, at 9:23, Trond Myklebust wrote:
> 
> > On Fri, 2024-07-12 at 09:55 -0400, Benjamin Coddington wrote:
> > > We've observed NFS clients with sync tasks sleeping in
> > > __rpc_execute
> > > waiting on RPC_TASK_QUEUED that have not responded to a wake-up
> > > from
> > > rpc_make_runnable().  I suspect this problem usually goes
> > > unnoticed,
> > > because on a busy client the task will eventually be re-awoken by
> > > another
> > > task completion or xprt event.  However, if the state manager is
> > > draining
> > > the slot table, a sync task missing a wake-up can result in a
> > > hung
> > > client.
> > > 
> > > We've been able to prove that the waker in rpc_make_runnable()
> > > successfully
> > > calls wake_up_bit() (ie- there's no race to tk_runstate), but the
> > > wake_up_bit() call fails to wake the waiter.  I suspect the waker
> > > is
> > > missing the load of the bit's wait_queue_head, so
> > > waitqueue_active()
> > > is
> > > false.  There are some very helpful comments about this problem
> > > above
> > > wake_up_bit(), prepare_to_wait(), and waitqueue_active().
> > > 
> > > Fix this by inserting smp_mb() before the wake_up_bit(), which
> > > pairs
> > > with
> > > prepare_to_wait() calling set_current_state().
> > > 
> > > Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
> > > ---
> > >  net/sunrpc/sched.c | 5 ++++-
> > >  1 file changed, 4 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
> > > index 6debf4fd42d4..34b31be75497 100644
> > > --- a/net/sunrpc/sched.c
> > > +++ b/net/sunrpc/sched.c
> > > @@ -369,8 +369,11 @@ static void rpc_make_runnable(struct
> > > workqueue_struct *wq,
> > >  	if (RPC_IS_ASYNC(task)) {
> > >  		INIT_WORK(&task->u.tk_work, rpc_async_schedule);
> > >  		queue_work(wq, &task->u.tk_work);
> > > -	} else
> > > +	} else {
> > > +		/* paired with set_current_state() in
> > > prepare_to_wait */
> > > +		smp_mb();
> > 
> > Hmm... Why isn't it sufficient to use smp_mb__after_atomic() here?
> > That's what clear_and_wake_up_bit() uses in this case.
> 
> Great question, one that I can't answer with confidence (which is why
> I
> solicited advice under the RFC posting).
> 
> I ended up following the guidance in the comment above wake_up_bit(),
> since
> we clear RPC_TASK_RUNNING non-atomically within the queue's
> spinlock.  It
> states that we'd probably need smp_mb().
> 
> However - this race isn't to tk_runstate, its actually to
> waitqueue_active()
> being true/false.  So - I believe unless we're checking the bit in
> the
> waiter's wait_bit_action function, we really only want to look at the
> race
> in the terms of making sure the waiter's setting of the
> wait_queue_head is
> visible after the waker tests_and_sets RPC_TASK_RUNNING.
> 

My point is that if we were to call

	clear_and_wake_up_bit(RPC_TASK_QUEUED, &task->tk_runstate);

then that should be sufficient to resolve the race with

	status = out_of_line_wait_on_bit(&task->tk_runstate,
                                RPC_TASK_QUEUED, rpc_wait_bit_killable,
                                TASK_KILLABLE|TASK_FREEZABLE);

So really, all we should need to do is to add in the
smp_mb__after_atomic(). This is consistent with the guidance in
Documentation/atomic_t.txt and Documentation/atomic_bitops.txt.

Benjamin Coddington July 16, 2024, 4:06 p.m. UTC | #5

On 16 Jul 2024, at 10:37, Trond Myklebust wrote:

> On Tue, 2024-07-16 at 09:44 -0400, Benjamin Coddington wrote:
>> On 16 Jul 2024, at 9:23, Trond Myklebust wrote:
>>> Hmm... Why isn't it sufficient to use smp_mb__after_atomic() here?
>>> That's what clear_and_wake_up_bit() uses in this case.
>>
>> Great question, one that I can't answer with confidence (which is why
>> I
>> solicited advice under the RFC posting).
>>
>> I ended up following the guidance in the comment above wake_up_bit(),
>> since
>> we clear RPC_TASK_RUNNING non-atomically within the queue's
>> spinlock.  It
>> states that we'd probably need smp_mb().
>>
>> However - this race isn't to tk_runstate, its actually to
>> waitqueue_active()
>> being true/false.  So - I believe unless we're checking the bit in
>> the
>> waiter's wait_bit_action function, we really only want to look at the
>> race
>> in the terms of making sure the waiter's setting of the
>> wait_queue_head is
>> visible after the waker tests_and_sets RPC_TASK_RUNNING.
>>
>
> My point is that if we were to call
>
> 	clear_and_wake_up_bit(RPC_TASK_QUEUED, &task->tk_runstate);
>
> then that should be sufficient to resolve the race with
>
> 	status = out_of_line_wait_on_bit(&task->tk_runstate,
>                                 RPC_TASK_QUEUED, rpc_wait_bit_killable,
>                                 TASK_KILLABLE|TASK_FREEZABLE);
>
> So really, all we should need to do is to add in the
> smp_mb__after_atomic(). This is consistent with the guidance in
> Documentation/atomic_t.txt and Documentation/atomic_bitops.txt.

Point taken, but isn't clear_bit_unlock() (via clear_and_wake_up_bit())
providing a release barrier that clear_bit() (via rpc_clear_running(),
spin_unlock()) is not providing?

I don't think what we're doing is the same, because we conditionally gate
the waker on test_and_set_bit(RPC_TASK_RUNNING) (which provides the atomic
barrier), but order doesn't matter yet since we can still be racing on the
other CPU to set up the wait_bit_queue -- because /that/ is gated on
RPC_TASK_QUEUED.  So I think we need to have a barrier that pairs with
set_current_state().

If we /were/ to use clear_and_wake_up_bit(RPC_TASK_QUEUED,
&task->tk_runstate) in rpc_make_runnable, I think that would also fix the
problem.

I can run the same 50-hour test with smb_mb__after_atomic() here on the
same ppc64le setup where the problem manifests and provide results here.  It
will take a couple of days.

Ben

Benjamin Coddington July 16, 2024, 4:13 p.m. UTC | #6

On 16 Jul 2024, at 12:06, Benjamin Coddington wrote:

> On 16 Jul 2024, at 10:37, Trond Myklebust wrote:
>
>> On Tue, 2024-07-16 at 09:44 -0400, Benjamin Coddington wrote:
>>> On 16 Jul 2024, at 9:23, Trond Myklebust wrote:
>>>> Hmm... Why isn't it sufficient to use smp_mb__after_atomic() here?
>>>> That's what clear_and_wake_up_bit() uses in this case.
>>>
>>> Great question, one that I can't answer with confidence (which is why
>>> I
>>> solicited advice under the RFC posting).
>>>
>>> I ended up following the guidance in the comment above wake_up_bit(),
>>> since
>>> we clear RPC_TASK_RUNNING non-atomically within the queue's
>>> spinlock.  It
>>> states that we'd probably need smp_mb().
>>>
>>> However - this race isn't to tk_runstate, its actually to
>>> waitqueue_active()
>>> being true/false.  So - I believe unless we're checking the bit in
>>> the
>>> waiter's wait_bit_action function, we really only want to look at the
>>> race
>>> in the terms of making sure the waiter's setting of the
>>> wait_queue_head is
>>> visible after the waker tests_and_sets RPC_TASK_RUNNING.
>>>
>>
>> My point is that if we were to call
>>
>> 	clear_and_wake_up_bit(RPC_TASK_QUEUED, &task->tk_runstate);
>>
>> then that should be sufficient to resolve the race with
>>
>> 	status = out_of_line_wait_on_bit(&task->tk_runstate,
>>                                 RPC_TASK_QUEUED, rpc_wait_bit_killable,
>>                                 TASK_KILLABLE|TASK_FREEZABLE);
>>
>> So really, all we should need to do is to add in the
>> smp_mb__after_atomic(). This is consistent with the guidance in
>> Documentation/atomic_t.txt and Documentation/atomic_bitops.txt.
>
> Point taken, but isn't clear_bit_unlock() (via clear_and_wake_up_bit())
> providing a release barrier that clear_bit() (via rpc_clear_running(),
> spin_unlock()) is not providing?

Another few minutes of looking at it.. and now I think you're right..

the smp_mb__after_atomic() is for ordering after rpc_clear_queued(), which I
was overlooking in rpc_make_runnable().  I will still kick off my testing,
and send along the results.  Thanks for the look.

Ben

Trond Myklebust July 16, 2024, 4:32 p.m. UTC | #7

On Tue, 2024-07-16 at 12:06 -0400, Benjamin Coddington wrote:
> On 16 Jul 2024, at 10:37, Trond Myklebust wrote:
> 
> > On Tue, 2024-07-16 at 09:44 -0400, Benjamin Coddington wrote:
> > > On 16 Jul 2024, at 9:23, Trond Myklebust wrote:
> > > > Hmm... Why isn't it sufficient to use smp_mb__after_atomic()
> > > > here?
> > > > That's what clear_and_wake_up_bit() uses in this case.
> > > 
> > > Great question, one that I can't answer with confidence (which is
> > > why
> > > I
> > > solicited advice under the RFC posting).
> > > 
> > > I ended up following the guidance in the comment above
> > > wake_up_bit(),
> > > since
> > > we clear RPC_TASK_RUNNING non-atomically within the queue's
> > > spinlock.  It
> > > states that we'd probably need smp_mb().
> > > 
> > > However - this race isn't to tk_runstate, its actually to
> > > waitqueue_active()
> > > being true/false.  So - I believe unless we're checking the bit
> > > in
> > > the
> > > waiter's wait_bit_action function, we really only want to look at
> > > the
> > > race
> > > in the terms of making sure the waiter's setting of the
> > > wait_queue_head is
> > > visible after the waker tests_and_sets RPC_TASK_RUNNING.
> > > 
> > 
> > My point is that if we were to call
> > 
> > 	clear_and_wake_up_bit(RPC_TASK_QUEUED, &task-
> > >tk_runstate);
> > 
> > then that should be sufficient to resolve the race with
> > 
> > 	status = out_of_line_wait_on_bit(&task->tk_runstate,
> >                                 RPC_TASK_QUEUED,
> > rpc_wait_bit_killable,
> >                                 TASK_KILLABLE|TASK_FREEZABLE);
> > 
> > So really, all we should need to do is to add in the
> > smp_mb__after_atomic(). This is consistent with the guidance in
> > Documentation/atomic_t.txt and Documentation/atomic_bitops.txt.
> 
> Point taken, but isn't clear_bit_unlock() (via
> clear_and_wake_up_bit())
> providing a release barrier that clear_bit() (via
> rpc_clear_running(),
> spin_unlock()) is not providing?

So let's just replace rpc_clear_queued() with a call to
clear_bit_unlocked(). That still ends up being less expensive than the
full memory barrier, doesn't it?

Benjamin Coddington July 16, 2024, 6:31 p.m. UTC | #8

On 16 Jul 2024, at 12:32, Trond Myklebust wrote:
> So let's just replace rpc_clear_queued() with a call to
> clear_bit_unlocked(). That still ends up being less expensive than the
> full memory barrier, doesn't it?

Yes.  But unless you really want to go this direction, I'll keep testing
your first suggestion to use smp_mb__after_atomic().  That should do the
same thing but also allow us to skip the memory barrier for async tasks.

Ben

SUNRPC: Fix a race to wake a sync task

Commit Message

Comments

Patch