Message ID | 4c1c7cd404173b9888464a2d7e2a430cc7e18737.1720792415.git.bcodding@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | SUNRPC: Fix a race to wake a sync task | expand |
On Fri, 2024-07-12 at 09:55 -0400, Benjamin Coddington wrote: > We've observed NFS clients with sync tasks sleeping in __rpc_execute > waiting on RPC_TASK_QUEUED that have not responded to a wake-up from > rpc_make_runnable(). I suspect this problem usually goes unnoticed, > because on a busy client the task will eventually be re-awoken by > another > task completion or xprt event. However, if the state manager is > draining > the slot table, a sync task missing a wake-up can result in a hung > client. > > We've been able to prove that the waker in rpc_make_runnable() > successfully > calls wake_up_bit() (ie- there's no race to tk_runstate), but the > wake_up_bit() call fails to wake the waiter. I suspect the waker is > missing the load of the bit's wait_queue_head, so waitqueue_active() > is > false. There are some very helpful comments about this problem above > wake_up_bit(), prepare_to_wait(), and waitqueue_active(). > > Fix this by inserting smp_mb() before the wake_up_bit(), which pairs > with > prepare_to_wait() calling set_current_state(). > > Signed-off-by: Benjamin Coddington <bcodding@redhat.com> > --- > net/sunrpc/sched.c | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c > index 6debf4fd42d4..34b31be75497 100644 > --- a/net/sunrpc/sched.c > +++ b/net/sunrpc/sched.c > @@ -369,8 +369,11 @@ static void rpc_make_runnable(struct > workqueue_struct *wq, > if (RPC_IS_ASYNC(task)) { > INIT_WORK(&task->u.tk_work, rpc_async_schedule); > queue_work(wq, &task->u.tk_work); > - } else > + } else { > + /* paired with set_current_state() in > prepare_to_wait */ > + smp_mb(); > wake_up_bit(&task->tk_runstate, RPC_TASK_QUEUED); > + } > } > > /* Nice catch! Reviewed-by: Jeff Layton <jlayton@kernel.org>
On Fri, 2024-07-12 at 09:55 -0400, Benjamin Coddington wrote: > We've observed NFS clients with sync tasks sleeping in __rpc_execute > waiting on RPC_TASK_QUEUED that have not responded to a wake-up from > rpc_make_runnable(). I suspect this problem usually goes unnoticed, > because on a busy client the task will eventually be re-awoken by > another > task completion or xprt event. However, if the state manager is > draining > the slot table, a sync task missing a wake-up can result in a hung > client. > > We've been able to prove that the waker in rpc_make_runnable() > successfully > calls wake_up_bit() (ie- there's no race to tk_runstate), but the > wake_up_bit() call fails to wake the waiter. I suspect the waker is > missing the load of the bit's wait_queue_head, so waitqueue_active() > is > false. There are some very helpful comments about this problem above > wake_up_bit(), prepare_to_wait(), and waitqueue_active(). > > Fix this by inserting smp_mb() before the wake_up_bit(), which pairs > with > prepare_to_wait() calling set_current_state(). > > Signed-off-by: Benjamin Coddington <bcodding@redhat.com> > --- > net/sunrpc/sched.c | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c > index 6debf4fd42d4..34b31be75497 100644 > --- a/net/sunrpc/sched.c > +++ b/net/sunrpc/sched.c > @@ -369,8 +369,11 @@ static void rpc_make_runnable(struct > workqueue_struct *wq, > if (RPC_IS_ASYNC(task)) { > INIT_WORK(&task->u.tk_work, rpc_async_schedule); > queue_work(wq, &task->u.tk_work); > - } else > + } else { > + /* paired with set_current_state() in > prepare_to_wait */ > + smp_mb(); Hmm... Why isn't it sufficient to use smp_mb__after_atomic() here? That's what clear_and_wake_up_bit() uses in this case. > wake_up_bit(&task->tk_runstate, RPC_TASK_QUEUED); > + } > } > > /*
On 16 Jul 2024, at 9:23, Trond Myklebust wrote: > On Fri, 2024-07-12 at 09:55 -0400, Benjamin Coddington wrote: >> We've observed NFS clients with sync tasks sleeping in __rpc_execute >> waiting on RPC_TASK_QUEUED that have not responded to a wake-up from >> rpc_make_runnable(). I suspect this problem usually goes unnoticed, >> because on a busy client the task will eventually be re-awoken by >> another >> task completion or xprt event. However, if the state manager is >> draining >> the slot table, a sync task missing a wake-up can result in a hung >> client. >> >> We've been able to prove that the waker in rpc_make_runnable() >> successfully >> calls wake_up_bit() (ie- there's no race to tk_runstate), but the >> wake_up_bit() call fails to wake the waiter. I suspect the waker is >> missing the load of the bit's wait_queue_head, so waitqueue_active() >> is >> false. There are some very helpful comments about this problem above >> wake_up_bit(), prepare_to_wait(), and waitqueue_active(). >> >> Fix this by inserting smp_mb() before the wake_up_bit(), which pairs >> with >> prepare_to_wait() calling set_current_state(). >> >> Signed-off-by: Benjamin Coddington <bcodding@redhat.com> >> --- >> net/sunrpc/sched.c | 5 ++++- >> 1 file changed, 4 insertions(+), 1 deletion(-) >> >> diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c >> index 6debf4fd42d4..34b31be75497 100644 >> --- a/net/sunrpc/sched.c >> +++ b/net/sunrpc/sched.c >> @@ -369,8 +369,11 @@ static void rpc_make_runnable(struct >> workqueue_struct *wq, >> if (RPC_IS_ASYNC(task)) { >> INIT_WORK(&task->u.tk_work, rpc_async_schedule); >> queue_work(wq, &task->u.tk_work); >> - } else >> + } else { >> + /* paired with set_current_state() in >> prepare_to_wait */ >> + smp_mb(); > > Hmm... Why isn't it sufficient to use smp_mb__after_atomic() here? > That's what clear_and_wake_up_bit() uses in this case. Great question, one that I can't answer with confidence (which is why I solicited advice under the RFC posting). I ended up following the guidance in the comment above wake_up_bit(), since we clear RPC_TASK_RUNNING non-atomically within the queue's spinlock. It states that we'd probably need smp_mb(). However - this race isn't to tk_runstate, its actually to waitqueue_active() being true/false. So - I believe unless we're checking the bit in the waiter's wait_bit_action function, we really only want to look at the race in the terms of making sure the waiter's setting of the wait_queue_head is visible after the waker tests_and_sets RPC_TASK_RUNNING. Ben
On Tue, 2024-07-16 at 09:44 -0400, Benjamin Coddington wrote: > On 16 Jul 2024, at 9:23, Trond Myklebust wrote: > > > On Fri, 2024-07-12 at 09:55 -0400, Benjamin Coddington wrote: > > > We've observed NFS clients with sync tasks sleeping in > > > __rpc_execute > > > waiting on RPC_TASK_QUEUED that have not responded to a wake-up > > > from > > > rpc_make_runnable(). I suspect this problem usually goes > > > unnoticed, > > > because on a busy client the task will eventually be re-awoken by > > > another > > > task completion or xprt event. However, if the state manager is > > > draining > > > the slot table, a sync task missing a wake-up can result in a > > > hung > > > client. > > > > > > We've been able to prove that the waker in rpc_make_runnable() > > > successfully > > > calls wake_up_bit() (ie- there's no race to tk_runstate), but the > > > wake_up_bit() call fails to wake the waiter. I suspect the waker > > > is > > > missing the load of the bit's wait_queue_head, so > > > waitqueue_active() > > > is > > > false. There are some very helpful comments about this problem > > > above > > > wake_up_bit(), prepare_to_wait(), and waitqueue_active(). > > > > > > Fix this by inserting smp_mb() before the wake_up_bit(), which > > > pairs > > > with > > > prepare_to_wait() calling set_current_state(). > > > > > > Signed-off-by: Benjamin Coddington <bcodding@redhat.com> > > > --- > > > net/sunrpc/sched.c | 5 ++++- > > > 1 file changed, 4 insertions(+), 1 deletion(-) > > > > > > diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c > > > index 6debf4fd42d4..34b31be75497 100644 > > > --- a/net/sunrpc/sched.c > > > +++ b/net/sunrpc/sched.c > > > @@ -369,8 +369,11 @@ static void rpc_make_runnable(struct > > > workqueue_struct *wq, > > > if (RPC_IS_ASYNC(task)) { > > > INIT_WORK(&task->u.tk_work, rpc_async_schedule); > > > queue_work(wq, &task->u.tk_work); > > > - } else > > > + } else { > > > + /* paired with set_current_state() in > > > prepare_to_wait */ > > > + smp_mb(); > > > > Hmm... Why isn't it sufficient to use smp_mb__after_atomic() here? > > That's what clear_and_wake_up_bit() uses in this case. > > Great question, one that I can't answer with confidence (which is why > I > solicited advice under the RFC posting). > > I ended up following the guidance in the comment above wake_up_bit(), > since > we clear RPC_TASK_RUNNING non-atomically within the queue's > spinlock. It > states that we'd probably need smp_mb(). > > However - this race isn't to tk_runstate, its actually to > waitqueue_active() > being true/false. So - I believe unless we're checking the bit in > the > waiter's wait_bit_action function, we really only want to look at the > race > in the terms of making sure the waiter's setting of the > wait_queue_head is > visible after the waker tests_and_sets RPC_TASK_RUNNING. > My point is that if we were to call clear_and_wake_up_bit(RPC_TASK_QUEUED, &task->tk_runstate); then that should be sufficient to resolve the race with status = out_of_line_wait_on_bit(&task->tk_runstate, RPC_TASK_QUEUED, rpc_wait_bit_killable, TASK_KILLABLE|TASK_FREEZABLE); So really, all we should need to do is to add in the smp_mb__after_atomic(). This is consistent with the guidance in Documentation/atomic_t.txt and Documentation/atomic_bitops.txt.
On 16 Jul 2024, at 10:37, Trond Myklebust wrote: > On Tue, 2024-07-16 at 09:44 -0400, Benjamin Coddington wrote: >> On 16 Jul 2024, at 9:23, Trond Myklebust wrote: >>> Hmm... Why isn't it sufficient to use smp_mb__after_atomic() here? >>> That's what clear_and_wake_up_bit() uses in this case. >> >> Great question, one that I can't answer with confidence (which is why >> I >> solicited advice under the RFC posting). >> >> I ended up following the guidance in the comment above wake_up_bit(), >> since >> we clear RPC_TASK_RUNNING non-atomically within the queue's >> spinlock. It >> states that we'd probably need smp_mb(). >> >> However - this race isn't to tk_runstate, its actually to >> waitqueue_active() >> being true/false. So - I believe unless we're checking the bit in >> the >> waiter's wait_bit_action function, we really only want to look at the >> race >> in the terms of making sure the waiter's setting of the >> wait_queue_head is >> visible after the waker tests_and_sets RPC_TASK_RUNNING. >> > > My point is that if we were to call > > clear_and_wake_up_bit(RPC_TASK_QUEUED, &task->tk_runstate); > > then that should be sufficient to resolve the race with > > status = out_of_line_wait_on_bit(&task->tk_runstate, > RPC_TASK_QUEUED, rpc_wait_bit_killable, > TASK_KILLABLE|TASK_FREEZABLE); > > So really, all we should need to do is to add in the > smp_mb__after_atomic(). This is consistent with the guidance in > Documentation/atomic_t.txt and Documentation/atomic_bitops.txt. Point taken, but isn't clear_bit_unlock() (via clear_and_wake_up_bit()) providing a release barrier that clear_bit() (via rpc_clear_running(), spin_unlock()) is not providing? I don't think what we're doing is the same, because we conditionally gate the waker on test_and_set_bit(RPC_TASK_RUNNING) (which provides the atomic barrier), but order doesn't matter yet since we can still be racing on the other CPU to set up the wait_bit_queue -- because /that/ is gated on RPC_TASK_QUEUED. So I think we need to have a barrier that pairs with set_current_state(). If we /were/ to use clear_and_wake_up_bit(RPC_TASK_QUEUED, &task->tk_runstate) in rpc_make_runnable, I think that would also fix the problem. I can run the same 50-hour test with smb_mb__after_atomic() here on the same ppc64le setup where the problem manifests and provide results here. It will take a couple of days. Ben
On 16 Jul 2024, at 12:06, Benjamin Coddington wrote: > On 16 Jul 2024, at 10:37, Trond Myklebust wrote: > >> On Tue, 2024-07-16 at 09:44 -0400, Benjamin Coddington wrote: >>> On 16 Jul 2024, at 9:23, Trond Myklebust wrote: >>>> Hmm... Why isn't it sufficient to use smp_mb__after_atomic() here? >>>> That's what clear_and_wake_up_bit() uses in this case. >>> >>> Great question, one that I can't answer with confidence (which is why >>> I >>> solicited advice under the RFC posting). >>> >>> I ended up following the guidance in the comment above wake_up_bit(), >>> since >>> we clear RPC_TASK_RUNNING non-atomically within the queue's >>> spinlock. It >>> states that we'd probably need smp_mb(). >>> >>> However - this race isn't to tk_runstate, its actually to >>> waitqueue_active() >>> being true/false. So - I believe unless we're checking the bit in >>> the >>> waiter's wait_bit_action function, we really only want to look at the >>> race >>> in the terms of making sure the waiter's setting of the >>> wait_queue_head is >>> visible after the waker tests_and_sets RPC_TASK_RUNNING. >>> >> >> My point is that if we were to call >> >> clear_and_wake_up_bit(RPC_TASK_QUEUED, &task->tk_runstate); >> >> then that should be sufficient to resolve the race with >> >> status = out_of_line_wait_on_bit(&task->tk_runstate, >> RPC_TASK_QUEUED, rpc_wait_bit_killable, >> TASK_KILLABLE|TASK_FREEZABLE); >> >> So really, all we should need to do is to add in the >> smp_mb__after_atomic(). This is consistent with the guidance in >> Documentation/atomic_t.txt and Documentation/atomic_bitops.txt. > > Point taken, but isn't clear_bit_unlock() (via clear_and_wake_up_bit()) > providing a release barrier that clear_bit() (via rpc_clear_running(), > spin_unlock()) is not providing? Another few minutes of looking at it.. and now I think you're right.. the smp_mb__after_atomic() is for ordering after rpc_clear_queued(), which I was overlooking in rpc_make_runnable(). I will still kick off my testing, and send along the results. Thanks for the look. Ben
On Tue, 2024-07-16 at 12:06 -0400, Benjamin Coddington wrote: > On 16 Jul 2024, at 10:37, Trond Myklebust wrote: > > > On Tue, 2024-07-16 at 09:44 -0400, Benjamin Coddington wrote: > > > On 16 Jul 2024, at 9:23, Trond Myklebust wrote: > > > > Hmm... Why isn't it sufficient to use smp_mb__after_atomic() > > > > here? > > > > That's what clear_and_wake_up_bit() uses in this case. > > > > > > Great question, one that I can't answer with confidence (which is > > > why > > > I > > > solicited advice under the RFC posting). > > > > > > I ended up following the guidance in the comment above > > > wake_up_bit(), > > > since > > > we clear RPC_TASK_RUNNING non-atomically within the queue's > > > spinlock. It > > > states that we'd probably need smp_mb(). > > > > > > However - this race isn't to tk_runstate, its actually to > > > waitqueue_active() > > > being true/false. So - I believe unless we're checking the bit > > > in > > > the > > > waiter's wait_bit_action function, we really only want to look at > > > the > > > race > > > in the terms of making sure the waiter's setting of the > > > wait_queue_head is > > > visible after the waker tests_and_sets RPC_TASK_RUNNING. > > > > > > > My point is that if we were to call > > > > clear_and_wake_up_bit(RPC_TASK_QUEUED, &task- > > >tk_runstate); > > > > then that should be sufficient to resolve the race with > > > > status = out_of_line_wait_on_bit(&task->tk_runstate, > > RPC_TASK_QUEUED, > > rpc_wait_bit_killable, > > TASK_KILLABLE|TASK_FREEZABLE); > > > > So really, all we should need to do is to add in the > > smp_mb__after_atomic(). This is consistent with the guidance in > > Documentation/atomic_t.txt and Documentation/atomic_bitops.txt. > > Point taken, but isn't clear_bit_unlock() (via > clear_and_wake_up_bit()) > providing a release barrier that clear_bit() (via > rpc_clear_running(), > spin_unlock()) is not providing? So let's just replace rpc_clear_queued() with a call to clear_bit_unlocked(). That still ends up being less expensive than the full memory barrier, doesn't it?
On 16 Jul 2024, at 12:32, Trond Myklebust wrote: > So let's just replace rpc_clear_queued() with a call to > clear_bit_unlocked(). That still ends up being less expensive than the > full memory barrier, doesn't it? Yes. But unless you really want to go this direction, I'll keep testing your first suggestion to use smp_mb__after_atomic(). That should do the same thing but also allow us to skip the memory barrier for async tasks. Ben
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c index 6debf4fd42d4..34b31be75497 100644 --- a/net/sunrpc/sched.c +++ b/net/sunrpc/sched.c @@ -369,8 +369,11 @@ static void rpc_make_runnable(struct workqueue_struct *wq, if (RPC_IS_ASYNC(task)) { INIT_WORK(&task->u.tk_work, rpc_async_schedule); queue_work(wq, &task->u.tk_work); - } else + } else { + /* paired with set_current_state() in prepare_to_wait */ + smp_mb(); wake_up_bit(&task->tk_runstate, RPC_TASK_QUEUED); + } } /*
We've observed NFS clients with sync tasks sleeping in __rpc_execute waiting on RPC_TASK_QUEUED that have not responded to a wake-up from rpc_make_runnable(). I suspect this problem usually goes unnoticed, because on a busy client the task will eventually be re-awoken by another task completion or xprt event. However, if the state manager is draining the slot table, a sync task missing a wake-up can result in a hung client. We've been able to prove that the waker in rpc_make_runnable() successfully calls wake_up_bit() (ie- there's no race to tk_runstate), but the wake_up_bit() call fails to wake the waiter. I suspect the waker is missing the load of the bit's wait_queue_head, so waitqueue_active() is false. There are some very helpful comments about this problem above wake_up_bit(), prepare_to_wait(), and waitqueue_active(). Fix this by inserting smp_mb() before the wake_up_bit(), which pairs with prepare_to_wait() calling set_current_state(). Signed-off-by: Benjamin Coddington <bcodding@redhat.com> --- net/sunrpc/sched.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)