diff mbox series

[3/9] rcu/tree: Reduce wake up for synchronize_rcu() common case

Message ID 20240604222355.2370768-3-paulmck@kernel.org (mailing list archive)
State New, archived
Headers show
Series Miscellaneous fixes for v6.11 | expand

Commit Message

Paul E. McKenney June 4, 2024, 10:23 p.m. UTC
From: "Joel Fernandes (Google)" <joel@joelfernandes.org>

In the synchronize_rcu() common case, we will have less than
SR_MAX_USERS_WAKE_FROM_GP number of users per GP. Waking up the kworker
is pointless just to free the last injected wait head since at that point,
all the users have already been awakened.

Introduce a new counter to track this and prevent the wakeup in the
common case.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/tree.c | 35 ++++++++++++++++++++++++++++++-----
 kernel/rcu/tree.h |  1 +
 2 files changed, 31 insertions(+), 5 deletions(-)

Comments

Frederic Weisbecker June 5, 2024, 4:35 p.m. UTC | #1
Le Tue, Jun 04, 2024 at 03:23:49PM -0700, Paul E. McKenney a écrit :
> From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
> 
> In the synchronize_rcu() common case, we will have less than
> SR_MAX_USERS_WAKE_FROM_GP number of users per GP. Waking up the kworker
> is pointless just to free the last injected wait head since at that point,
> all the users have already been awakened.
> 
> Introduce a new counter to track this and prevent the wakeup in the
> common case.
> 
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> ---
>  kernel/rcu/tree.c | 35 ++++++++++++++++++++++++++++++-----
>  kernel/rcu/tree.h |  1 +
>  2 files changed, 31 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 6ba36d9c09bde..2fe08e6186b4d 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -96,6 +96,7 @@ static struct rcu_state rcu_state = {
>  	.ofl_lock = __ARCH_SPIN_LOCK_UNLOCKED,
>  	.srs_cleanup_work = __WORK_INITIALIZER(rcu_state.srs_cleanup_work,
>  		rcu_sr_normal_gp_cleanup_work),
> +	.srs_cleanups_pending = ATOMIC_INIT(0),
>  };
>  
>  /* Dump rcu_node combining tree at boot to verify correct setup. */
> @@ -1633,8 +1634,11 @@ static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
>  	 * the done tail list manipulations are protected here.
>  	 */
>  	done = smp_load_acquire(&rcu_state.srs_done_tail);
> -	if (!done)
> +	if (!done) {
> +		/* See comments below. */
> +		atomic_dec_return_release(&rcu_state.srs_cleanups_pending);

This condition is not supposed to happen. If the work is scheduled,
there has to be a wait_queue in rcu_state.srs_done_tail. And decrementing
may make things worse.

So this should be:

if (WARN_ON_ONCE(!done))
   return;

Thanks.
Paul E. McKenney June 5, 2024, 6:42 p.m. UTC | #2
On Wed, Jun 05, 2024 at 06:35:08PM +0200, Frederic Weisbecker wrote:
> Le Tue, Jun 04, 2024 at 03:23:49PM -0700, Paul E. McKenney a écrit :
> > From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
> > 
> > In the synchronize_rcu() common case, we will have less than
> > SR_MAX_USERS_WAKE_FROM_GP number of users per GP. Waking up the kworker
> > is pointless just to free the last injected wait head since at that point,
> > all the users have already been awakened.
> > 
> > Introduce a new counter to track this and prevent the wakeup in the
> > common case.
> > 
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > ---
> >  kernel/rcu/tree.c | 35 ++++++++++++++++++++++++++++++-----
> >  kernel/rcu/tree.h |  1 +
> >  2 files changed, 31 insertions(+), 5 deletions(-)
> > 
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 6ba36d9c09bde..2fe08e6186b4d 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -96,6 +96,7 @@ static struct rcu_state rcu_state = {
> >  	.ofl_lock = __ARCH_SPIN_LOCK_UNLOCKED,
> >  	.srs_cleanup_work = __WORK_INITIALIZER(rcu_state.srs_cleanup_work,
> >  		rcu_sr_normal_gp_cleanup_work),
> > +	.srs_cleanups_pending = ATOMIC_INIT(0),
> >  };
> >  
> >  /* Dump rcu_node combining tree at boot to verify correct setup. */
> > @@ -1633,8 +1634,11 @@ static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
> >  	 * the done tail list manipulations are protected here.
> >  	 */
> >  	done = smp_load_acquire(&rcu_state.srs_done_tail);
> > -	if (!done)
> > +	if (!done) {
> > +		/* See comments below. */
> > +		atomic_dec_return_release(&rcu_state.srs_cleanups_pending);
> 
> This condition is not supposed to happen. If the work is scheduled,
> there has to be a wait_queue in rcu_state.srs_done_tail. And decrementing
> may make things worse.
> 
> So this should be:
> 
> if (WARN_ON_ONCE(!done))
>    return;

Or just this:

	WARN_ON_ONCE(!smp_load_acquire(&rcu_state.srs_done_tail));

Uladzislau, thoughts?  Is there some corner case where we really can
see that smp_load_acquire() hand back a NULL pointer, perhaps during boot?

							Thanx, Paul
Neeraj upadhyay June 6, 2024, 5:58 a.m. UTC | #3
On Wed, Jun 5, 2024 at 10:05 PM Frederic Weisbecker <frederic@kernel.org> wrote:
>
> Le Tue, Jun 04, 2024 at 03:23:49PM -0700, Paul E. McKenney a écrit :
> > From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
> >
> > In the synchronize_rcu() common case, we will have less than
> > SR_MAX_USERS_WAKE_FROM_GP number of users per GP. Waking up the kworker
> > is pointless just to free the last injected wait head since at that point,
> > all the users have already been awakened.
> >
> > Introduce a new counter to track this and prevent the wakeup in the
> > common case.
> >
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > ---
> >  kernel/rcu/tree.c | 35 ++++++++++++++++++++++++++++++-----
> >  kernel/rcu/tree.h |  1 +
> >  2 files changed, 31 insertions(+), 5 deletions(-)
> >
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 6ba36d9c09bde..2fe08e6186b4d 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -96,6 +96,7 @@ static struct rcu_state rcu_state = {
> >       .ofl_lock = __ARCH_SPIN_LOCK_UNLOCKED,
> >       .srs_cleanup_work = __WORK_INITIALIZER(rcu_state.srs_cleanup_work,
> >               rcu_sr_normal_gp_cleanup_work),
> > +     .srs_cleanups_pending = ATOMIC_INIT(0),
> >  };
> >
> >  /* Dump rcu_node combining tree at boot to verify correct setup. */
> > @@ -1633,8 +1634,11 @@ static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
> >        * the done tail list manipulations are protected here.
> >        */
> >       done = smp_load_acquire(&rcu_state.srs_done_tail);
> > -     if (!done)
> > +     if (!done) {
> > +             /* See comments below. */
> > +             atomic_dec_return_release(&rcu_state.srs_cleanups_pending);
>
> This condition is not supposed to happen. If the work is scheduled,
> there has to be a wait_queue in rcu_state.srs_done_tail. And decrementing
> may make things worse.
>

I also don't see a scenario where this can happen. However, if we are
returning from here, given that for every queued work we do an
increment of rcu_state.srs_cleanups_pending, I think it's safer to
decrement in this
case, as that counter tracks only the work queuing and execution counts.

    atomic_inc(&rcu_state.srs_cleanups_pending);
    if (!queue_work(sync_wq, &rcu_state.srs_cleanup_work))
        atomic_dec(&rcu_state.srs_cleanups_pending);




Thanks
Neeraj

> So this should be:
>
> if (WARN_ON_ONCE(!done))
>    return;
>
> Thanks.
>
Paul E. McKenney June 6, 2024, 6:12 p.m. UTC | #4
On Thu, Jun 06, 2024 at 11:28:07AM +0530, Neeraj upadhyay wrote:
> On Wed, Jun 5, 2024 at 10:05 PM Frederic Weisbecker <frederic@kernel.org> wrote:
> >
> > Le Tue, Jun 04, 2024 at 03:23:49PM -0700, Paul E. McKenney a écrit :
> > > From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
> > >
> > > In the synchronize_rcu() common case, we will have less than
> > > SR_MAX_USERS_WAKE_FROM_GP number of users per GP. Waking up the kworker
> > > is pointless just to free the last injected wait head since at that point,
> > > all the users have already been awakened.
> > >
> > > Introduce a new counter to track this and prevent the wakeup in the
> > > common case.
> > >
> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > > ---
> > >  kernel/rcu/tree.c | 35 ++++++++++++++++++++++++++++++-----
> > >  kernel/rcu/tree.h |  1 +
> > >  2 files changed, 31 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index 6ba36d9c09bde..2fe08e6186b4d 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -96,6 +96,7 @@ static struct rcu_state rcu_state = {
> > >       .ofl_lock = __ARCH_SPIN_LOCK_UNLOCKED,
> > >       .srs_cleanup_work = __WORK_INITIALIZER(rcu_state.srs_cleanup_work,
> > >               rcu_sr_normal_gp_cleanup_work),
> > > +     .srs_cleanups_pending = ATOMIC_INIT(0),
> > >  };
> > >
> > >  /* Dump rcu_node combining tree at boot to verify correct setup. */
> > > @@ -1633,8 +1634,11 @@ static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
> > >        * the done tail list manipulations are protected here.
> > >        */
> > >       done = smp_load_acquire(&rcu_state.srs_done_tail);
> > > -     if (!done)
> > > +     if (!done) {
> > > +             /* See comments below. */
> > > +             atomic_dec_return_release(&rcu_state.srs_cleanups_pending);
> >
> > This condition is not supposed to happen. If the work is scheduled,
> > there has to be a wait_queue in rcu_state.srs_done_tail. And decrementing
> > may make things worse.
> >
> 
> I also don't see a scenario where this can happen. However, if we are
> returning from here, given that for every queued work we do an
> increment of rcu_state.srs_cleanups_pending, I think it's safer to
> decrement in this
> case, as that counter tracks only the work queuing and execution counts.
> 
>     atomic_inc(&rcu_state.srs_cleanups_pending);
>     if (!queue_work(sync_wq, &rcu_state.srs_cleanup_work))
>         atomic_dec(&rcu_state.srs_cleanups_pending);

Linus Torvald's general rule is that if you cannot imagine how a bug
can happen, don't attempt to clean up after it.  His rationale (which
is *almost* always a good one) is that not knowing how the bug happens
means that attempts to clean up will usually just make matters worse.
And all too often, the clean-up code makes debugging more difficult.

One example exception to this rule is when debug-objects detects a
duplicate call_rcu().  In that case, we ignore that second call_rcu().
But the reason is that experience has shown that the usual cause really
is someone doing a duplicate call_rcu(), and also that ignoring the
second call_rcu() makes debugging easier.

So what is it that Frederic and I are missing here?

							Thanx, Paul

> Thanks
> Neeraj
> 
> > So this should be:
> >
> > if (WARN_ON_ONCE(!done))
> >    return;
> >
> > Thanks.
> >
>
Neeraj upadhyay June 7, 2024, 1:51 a.m. UTC | #5
On Thu, Jun 6, 2024 at 11:42 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Thu, Jun 06, 2024 at 11:28:07AM +0530, Neeraj upadhyay wrote:
> > On Wed, Jun 5, 2024 at 10:05 PM Frederic Weisbecker <frederic@kernel.org> wrote:
> > >
> > > Le Tue, Jun 04, 2024 at 03:23:49PM -0700, Paul E. McKenney a écrit :
> > > > From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
> > > >
> > > > In the synchronize_rcu() common case, we will have less than
> > > > SR_MAX_USERS_WAKE_FROM_GP number of users per GP. Waking up the kworker
> > > > is pointless just to free the last injected wait head since at that point,
> > > > all the users have already been awakened.
> > > >
> > > > Introduce a new counter to track this and prevent the wakeup in the
> > > > common case.
> > > >
> > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > > Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > > > ---
> > > >  kernel/rcu/tree.c | 35 ++++++++++++++++++++++++++++++-----
> > > >  kernel/rcu/tree.h |  1 +
> > > >  2 files changed, 31 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > index 6ba36d9c09bde..2fe08e6186b4d 100644
> > > > --- a/kernel/rcu/tree.c
> > > > +++ b/kernel/rcu/tree.c
> > > > @@ -96,6 +96,7 @@ static struct rcu_state rcu_state = {
> > > >       .ofl_lock = __ARCH_SPIN_LOCK_UNLOCKED,
> > > >       .srs_cleanup_work = __WORK_INITIALIZER(rcu_state.srs_cleanup_work,
> > > >               rcu_sr_normal_gp_cleanup_work),
> > > > +     .srs_cleanups_pending = ATOMIC_INIT(0),
> > > >  };
> > > >
> > > >  /* Dump rcu_node combining tree at boot to verify correct setup. */
> > > > @@ -1633,8 +1634,11 @@ static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
> > > >        * the done tail list manipulations are protected here.
> > > >        */
> > > >       done = smp_load_acquire(&rcu_state.srs_done_tail);
> > > > -     if (!done)
> > > > +     if (!done) {
> > > > +             /* See comments below. */
> > > > +             atomic_dec_return_release(&rcu_state.srs_cleanups_pending);
> > >
> > > This condition is not supposed to happen. If the work is scheduled,
> > > there has to be a wait_queue in rcu_state.srs_done_tail. And decrementing
> > > may make things worse.
> > >
> >
> > I also don't see a scenario where this can happen. However, if we are
> > returning from here, given that for every queued work we do an
> > increment of rcu_state.srs_cleanups_pending, I think it's safer to
> > decrement in this
> > case, as that counter tracks only the work queuing and execution counts.
> >
> >     atomic_inc(&rcu_state.srs_cleanups_pending);
> >     if (!queue_work(sync_wq, &rcu_state.srs_cleanup_work))
> >         atomic_dec(&rcu_state.srs_cleanups_pending);
>
> Linus Torvald's general rule is that if you cannot imagine how a bug
> can happen, don't attempt to clean up after it.  His rationale (which
> is *almost* always a good one) is that not knowing how the bug happens
> means that attempts to clean up will usually just make matters worse.
> And all too often, the clean-up code makes debugging more difficult.
>

Ok. Thanks for sharing this info!

> One example exception to this rule is when debug-objects detects a
> duplicate call_rcu().  In that case, we ignore that second call_rcu().
> But the reason is that experience has shown that the usual cause really
> is someone doing a duplicate call_rcu(), and also that ignoring the
> second call_rcu() makes debugging easier.
>
> So what is it that Frederic and I are missing here?
>

Maybe nothing. As kworker context does not modify srs_done_tail and
invalid values
of srs_done_tail  can only be caused by the GP kthread manipulations
of  srs_done_tail , my thought here was, we can keep the pending
rcu_sr_normal_gp_cleanup_work count consistent with the number of
queue_work() and kworker executions, even when we see unexpected
srs_done_tail values like these. However, as you described the general rule
is to not attempt any clean up for such scenarios.

Thanks
Neeraj



>                                                         Thanx, Paul
>
> > Thanks
> > Neeraj
> >
> > > So this should be:
> > >
> > > if (WARN_ON_ONCE(!done))
> > >    return;
> > >
> > > Thanks.
> > >
> >
Paul E. McKenney June 10, 2024, 3:12 p.m. UTC | #6
On Fri, Jun 07, 2024 at 07:21:55AM +0530, Neeraj upadhyay wrote:
> On Thu, Jun 6, 2024 at 11:42 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Thu, Jun 06, 2024 at 11:28:07AM +0530, Neeraj upadhyay wrote:
> > > On Wed, Jun 5, 2024 at 10:05 PM Frederic Weisbecker <frederic@kernel.org> wrote:
> > > >
> > > > Le Tue, Jun 04, 2024 at 03:23:49PM -0700, Paul E. McKenney a écrit :
> > > > > From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
> > > > >
> > > > > In the synchronize_rcu() common case, we will have less than
> > > > > SR_MAX_USERS_WAKE_FROM_GP number of users per GP. Waking up the kworker
> > > > > is pointless just to free the last injected wait head since at that point,
> > > > > all the users have already been awakened.
> > > > >
> > > > > Introduce a new counter to track this and prevent the wakeup in the
> > > > > common case.
> > > > >
> > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > > > Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > > > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > > > > ---
> > > > >  kernel/rcu/tree.c | 35 ++++++++++++++++++++++++++++++-----
> > > > >  kernel/rcu/tree.h |  1 +
> > > > >  2 files changed, 31 insertions(+), 5 deletions(-)
> > > > >
> > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > index 6ba36d9c09bde..2fe08e6186b4d 100644
> > > > > --- a/kernel/rcu/tree.c
> > > > > +++ b/kernel/rcu/tree.c
> > > > > @@ -96,6 +96,7 @@ static struct rcu_state rcu_state = {
> > > > >       .ofl_lock = __ARCH_SPIN_LOCK_UNLOCKED,
> > > > >       .srs_cleanup_work = __WORK_INITIALIZER(rcu_state.srs_cleanup_work,
> > > > >               rcu_sr_normal_gp_cleanup_work),
> > > > > +     .srs_cleanups_pending = ATOMIC_INIT(0),
> > > > >  };
> > > > >
> > > > >  /* Dump rcu_node combining tree at boot to verify correct setup. */
> > > > > @@ -1633,8 +1634,11 @@ static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
> > > > >        * the done tail list manipulations are protected here.
> > > > >        */
> > > > >       done = smp_load_acquire(&rcu_state.srs_done_tail);
> > > > > -     if (!done)
> > > > > +     if (!done) {
> > > > > +             /* See comments below. */
> > > > > +             atomic_dec_return_release(&rcu_state.srs_cleanups_pending);
> > > >
> > > > This condition is not supposed to happen. If the work is scheduled,
> > > > there has to be a wait_queue in rcu_state.srs_done_tail. And decrementing
> > > > may make things worse.
> > > >
> > >
> > > I also don't see a scenario where this can happen. However, if we are
> > > returning from here, given that for every queued work we do an
> > > increment of rcu_state.srs_cleanups_pending, I think it's safer to
> > > decrement in this
> > > case, as that counter tracks only the work queuing and execution counts.
> > >
> > >     atomic_inc(&rcu_state.srs_cleanups_pending);
> > >     if (!queue_work(sync_wq, &rcu_state.srs_cleanup_work))
> > >         atomic_dec(&rcu_state.srs_cleanups_pending);
> >
> > Linus Torvald's general rule is that if you cannot imagine how a bug
> > can happen, don't attempt to clean up after it.  His rationale (which
> > is *almost* always a good one) is that not knowing how the bug happens
> > means that attempts to clean up will usually just make matters worse.
> > And all too often, the clean-up code makes debugging more difficult.
> >
> 
> Ok. Thanks for sharing this info!
> 
> > One example exception to this rule is when debug-objects detects a
> > duplicate call_rcu().  In that case, we ignore that second call_rcu().
> > But the reason is that experience has shown that the usual cause really
> > is someone doing a duplicate call_rcu(), and also that ignoring the
> > second call_rcu() makes debugging easier.
> >
> > So what is it that Frederic and I are missing here?
> 
> Maybe nothing. As kworker context does not modify srs_done_tail and
> invalid values
> of srs_done_tail  can only be caused by the GP kthread manipulations
> of  srs_done_tail , my thought here was, we can keep the pending
> rcu_sr_normal_gp_cleanup_work count consistent with the number of
> queue_work() and kworker executions, even when we see unexpected
> srs_done_tail values like these. However, as you described the general rule
> is to not attempt any clean up for such scenarios.

So "if (WARN_ON_ONCE(!done) return;"?

Or is there a better way?

							Thanx, Paul
Neeraj upadhyay June 11, 2024, 1:46 p.m. UTC | #7
On Mon, Jun 10, 2024 at 8:42 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Fri, Jun 07, 2024 at 07:21:55AM +0530, Neeraj upadhyay wrote:
> > On Thu, Jun 6, 2024 at 11:42 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > >
> > > On Thu, Jun 06, 2024 at 11:28:07AM +0530, Neeraj upadhyay wrote:
> > > > On Wed, Jun 5, 2024 at 10:05 PM Frederic Weisbecker <frederic@kernel.org> wrote:
> > > > >
> > > > > Le Tue, Jun 04, 2024 at 03:23:49PM -0700, Paul E. McKenney a écrit :
> > > > > > From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
> > > > > >
> > > > > > In the synchronize_rcu() common case, we will have less than
> > > > > > SR_MAX_USERS_WAKE_FROM_GP number of users per GP. Waking up the kworker
> > > > > > is pointless just to free the last injected wait head since at that point,
> > > > > > all the users have already been awakened.
> > > > > >
> > > > > > Introduce a new counter to track this and prevent the wakeup in the
> > > > > > common case.
> > > > > >
> > > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > > > > Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > > > > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > > > > > ---
> > > > > >  kernel/rcu/tree.c | 35 ++++++++++++++++++++++++++++++-----
> > > > > >  kernel/rcu/tree.h |  1 +
> > > > > >  2 files changed, 31 insertions(+), 5 deletions(-)
> > > > > >
> > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > > index 6ba36d9c09bde..2fe08e6186b4d 100644
> > > > > > --- a/kernel/rcu/tree.c
> > > > > > +++ b/kernel/rcu/tree.c
> > > > > > @@ -96,6 +96,7 @@ static struct rcu_state rcu_state = {
> > > > > >       .ofl_lock = __ARCH_SPIN_LOCK_UNLOCKED,
> > > > > >       .srs_cleanup_work = __WORK_INITIALIZER(rcu_state.srs_cleanup_work,
> > > > > >               rcu_sr_normal_gp_cleanup_work),
> > > > > > +     .srs_cleanups_pending = ATOMIC_INIT(0),
> > > > > >  };
> > > > > >
> > > > > >  /* Dump rcu_node combining tree at boot to verify correct setup. */
> > > > > > @@ -1633,8 +1634,11 @@ static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
> > > > > >        * the done tail list manipulations are protected here.
> > > > > >        */
> > > > > >       done = smp_load_acquire(&rcu_state.srs_done_tail);
> > > > > > -     if (!done)
> > > > > > +     if (!done) {
> > > > > > +             /* See comments below. */
> > > > > > +             atomic_dec_return_release(&rcu_state.srs_cleanups_pending);
> > > > >
> > > > > This condition is not supposed to happen. If the work is scheduled,
> > > > > there has to be a wait_queue in rcu_state.srs_done_tail. And decrementing
> > > > > may make things worse.
> > > > >
> > > >
> > > > I also don't see a scenario where this can happen. However, if we are
> > > > returning from here, given that for every queued work we do an
> > > > increment of rcu_state.srs_cleanups_pending, I think it's safer to
> > > > decrement in this
> > > > case, as that counter tracks only the work queuing and execution counts.
> > > >
> > > >     atomic_inc(&rcu_state.srs_cleanups_pending);
> > > >     if (!queue_work(sync_wq, &rcu_state.srs_cleanup_work))
> > > >         atomic_dec(&rcu_state.srs_cleanups_pending);
> > >
> > > Linus Torvald's general rule is that if you cannot imagine how a bug
> > > can happen, don't attempt to clean up after it.  His rationale (which
> > > is *almost* always a good one) is that not knowing how the bug happens
> > > means that attempts to clean up will usually just make matters worse.
> > > And all too often, the clean-up code makes debugging more difficult.
> > >
> >
> > Ok. Thanks for sharing this info!
> >
> > > One example exception to this rule is when debug-objects detects a
> > > duplicate call_rcu().  In that case, we ignore that second call_rcu().
> > > But the reason is that experience has shown that the usual cause really
> > > is someone doing a duplicate call_rcu(), and also that ignoring the
> > > second call_rcu() makes debugging easier.
> > >
> > > So what is it that Frederic and I are missing here?
> >
> > Maybe nothing. As kworker context does not modify srs_done_tail and
> > invalid values
> > of srs_done_tail  can only be caused by the GP kthread manipulations
> > of  srs_done_tail , my thought here was, we can keep the pending
> > rcu_sr_normal_gp_cleanup_work count consistent with the number of
> > queue_work() and kworker executions, even when we see unexpected
> > srs_done_tail values like these. However, as you described the general rule
> > is to not attempt any clean up for such scenarios.
>
> So "if (WARN_ON_ONCE(!done) return;"?
>

Looks good, nit: one more closing parenthesis "if (WARN_ON_ONCE(!done)) return;"

> Or is there a better way?
>

I think, above check suffices .


Thanks
Neeraj

>                                                         Thanx, Paul
Paul E. McKenney June 11, 2024, 4:17 p.m. UTC | #8
On Tue, Jun 11, 2024 at 07:16:37PM +0530, Neeraj upadhyay wrote:
> On Mon, Jun 10, 2024 at 8:42 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Fri, Jun 07, 2024 at 07:21:55AM +0530, Neeraj upadhyay wrote:
> > > On Thu, Jun 6, 2024 at 11:42 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > >
> > > > On Thu, Jun 06, 2024 at 11:28:07AM +0530, Neeraj upadhyay wrote:
> > > > > On Wed, Jun 5, 2024 at 10:05 PM Frederic Weisbecker <frederic@kernel.org> wrote:
> > > > > >
> > > > > > Le Tue, Jun 04, 2024 at 03:23:49PM -0700, Paul E. McKenney a écrit :
> > > > > > > From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
> > > > > > >
> > > > > > > In the synchronize_rcu() common case, we will have less than
> > > > > > > SR_MAX_USERS_WAKE_FROM_GP number of users per GP. Waking up the kworker
> > > > > > > is pointless just to free the last injected wait head since at that point,
> > > > > > > all the users have already been awakened.
> > > > > > >
> > > > > > > Introduce a new counter to track this and prevent the wakeup in the
> > > > > > > common case.
> > > > > > >
> > > > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > > > > > Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > > > > > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > > > > > > ---
> > > > > > >  kernel/rcu/tree.c | 35 ++++++++++++++++++++++++++++++-----
> > > > > > >  kernel/rcu/tree.h |  1 +
> > > > > > >  2 files changed, 31 insertions(+), 5 deletions(-)
> > > > > > >
> > > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > > > index 6ba36d9c09bde..2fe08e6186b4d 100644
> > > > > > > --- a/kernel/rcu/tree.c
> > > > > > > +++ b/kernel/rcu/tree.c
> > > > > > > @@ -96,6 +96,7 @@ static struct rcu_state rcu_state = {
> > > > > > >       .ofl_lock = __ARCH_SPIN_LOCK_UNLOCKED,
> > > > > > >       .srs_cleanup_work = __WORK_INITIALIZER(rcu_state.srs_cleanup_work,
> > > > > > >               rcu_sr_normal_gp_cleanup_work),
> > > > > > > +     .srs_cleanups_pending = ATOMIC_INIT(0),
> > > > > > >  };
> > > > > > >
> > > > > > >  /* Dump rcu_node combining tree at boot to verify correct setup. */
> > > > > > > @@ -1633,8 +1634,11 @@ static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
> > > > > > >        * the done tail list manipulations are protected here.
> > > > > > >        */
> > > > > > >       done = smp_load_acquire(&rcu_state.srs_done_tail);
> > > > > > > -     if (!done)
> > > > > > > +     if (!done) {
> > > > > > > +             /* See comments below. */
> > > > > > > +             atomic_dec_return_release(&rcu_state.srs_cleanups_pending);
> > > > > >
> > > > > > This condition is not supposed to happen. If the work is scheduled,
> > > > > > there has to be a wait_queue in rcu_state.srs_done_tail. And decrementing
> > > > > > may make things worse.
> > > > > >
> > > > >
> > > > > I also don't see a scenario where this can happen. However, if we are
> > > > > returning from here, given that for every queued work we do an
> > > > > increment of rcu_state.srs_cleanups_pending, I think it's safer to
> > > > > decrement in this
> > > > > case, as that counter tracks only the work queuing and execution counts.
> > > > >
> > > > >     atomic_inc(&rcu_state.srs_cleanups_pending);
> > > > >     if (!queue_work(sync_wq, &rcu_state.srs_cleanup_work))
> > > > >         atomic_dec(&rcu_state.srs_cleanups_pending);
> > > >
> > > > Linus Torvald's general rule is that if you cannot imagine how a bug
> > > > can happen, don't attempt to clean up after it.  His rationale (which
> > > > is *almost* always a good one) is that not knowing how the bug happens
> > > > means that attempts to clean up will usually just make matters worse.
> > > > And all too often, the clean-up code makes debugging more difficult.
> > > >
> > >
> > > Ok. Thanks for sharing this info!
> > >
> > > > One example exception to this rule is when debug-objects detects a
> > > > duplicate call_rcu().  In that case, we ignore that second call_rcu().
> > > > But the reason is that experience has shown that the usual cause really
> > > > is someone doing a duplicate call_rcu(), and also that ignoring the
> > > > second call_rcu() makes debugging easier.
> > > >
> > > > So what is it that Frederic and I are missing here?
> > >
> > > Maybe nothing. As kworker context does not modify srs_done_tail and
> > > invalid values
> > > of srs_done_tail  can only be caused by the GP kthread manipulations
> > > of  srs_done_tail , my thought here was, we can keep the pending
> > > rcu_sr_normal_gp_cleanup_work count consistent with the number of
> > > queue_work() and kworker executions, even when we see unexpected
> > > srs_done_tail values like these. However, as you described the general rule
> > > is to not attempt any clean up for such scenarios.
> >
> > So "if (WARN_ON_ONCE(!done) return;"?
> >
> 
> Looks good, nit: one more closing parenthesis "if (WARN_ON_ONCE(!done)) return;"
> 
> > Or is there a better way?
> >
> 
> I think, above check suffices .

Very well, I will make this change on the next rebase.

								Thanx, Paul
diff mbox series

Patch

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 6ba36d9c09bde..2fe08e6186b4d 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -96,6 +96,7 @@  static struct rcu_state rcu_state = {
 	.ofl_lock = __ARCH_SPIN_LOCK_UNLOCKED,
 	.srs_cleanup_work = __WORK_INITIALIZER(rcu_state.srs_cleanup_work,
 		rcu_sr_normal_gp_cleanup_work),
+	.srs_cleanups_pending = ATOMIC_INIT(0),
 };
 
 /* Dump rcu_node combining tree at boot to verify correct setup. */
@@ -1633,8 +1634,11 @@  static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
 	 * the done tail list manipulations are protected here.
 	 */
 	done = smp_load_acquire(&rcu_state.srs_done_tail);
-	if (!done)
+	if (!done) {
+		/* See comments below. */
+		atomic_dec_return_release(&rcu_state.srs_cleanups_pending);
 		return;
+	}
 
 	head = done->next;
 	done->next = NULL;
@@ -1656,6 +1660,9 @@  static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
 
 		rcu_sr_put_wait_head(rcu);
 	}
+
+	/* Order list manipulations with atomic access. */
+	atomic_dec_return_release(&rcu_state.srs_cleanups_pending);
 }
 
 /*
@@ -1663,7 +1670,7 @@  static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
  */
 static void rcu_sr_normal_gp_cleanup(void)
 {
-	struct llist_node *wait_tail, *next, *rcu;
+	struct llist_node *wait_tail, *next = NULL, *rcu = NULL;
 	int done = 0;
 
 	wait_tail = rcu_state.srs_wait_tail;
@@ -1697,16 +1704,34 @@  static void rcu_sr_normal_gp_cleanup(void)
 			break;
 	}
 
-	// concurrent sr_normal_gp_cleanup work might observe this update.
-	smp_store_release(&rcu_state.srs_done_tail, wait_tail);
+	/*
+	 * Fast path, no more users to process except putting the second last
+	 * wait head if no inflight-workers. If there are in-flight workers,
+	 * they will remove the last wait head.
+	 *
+	 * Note that the ACQUIRE orders atomic access with list manipulation.
+	 */
+	if (wait_tail->next && wait_tail->next->next == NULL &&
+	    rcu_sr_is_wait_head(wait_tail->next) &&
+	    !atomic_read_acquire(&rcu_state.srs_cleanups_pending)) {
+		rcu_sr_put_wait_head(wait_tail->next);
+		wait_tail->next = NULL;
+	}
+
+	/* Concurrent sr_normal_gp_cleanup work might observe this update. */
 	ASSERT_EXCLUSIVE_WRITER(rcu_state.srs_done_tail);
+	smp_store_release(&rcu_state.srs_done_tail, wait_tail);
 
 	/*
 	 * We schedule a work in order to perform a final processing
 	 * of outstanding users(if still left) and releasing wait-heads
 	 * added by rcu_sr_normal_gp_init() call.
 	 */
-	queue_work(sync_wq, &rcu_state.srs_cleanup_work);
+	if (wait_tail->next) {
+		atomic_inc(&rcu_state.srs_cleanups_pending);
+		if (!queue_work(sync_wq, &rcu_state.srs_cleanup_work))
+			atomic_dec(&rcu_state.srs_cleanups_pending);
+	}
 }
 
 /*
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index bae7925c497fe..affcb92a358c3 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -420,6 +420,7 @@  struct rcu_state {
 	struct llist_node *srs_done_tail; /* ready for GP users. */
 	struct sr_wait_node srs_wait_nodes[SR_NORMAL_GP_WAIT_HEAD_MAX];
 	struct work_struct srs_cleanup_work;
+	atomic_t srs_cleanups_pending; /* srs inflight worker cleanups. */
 };
 
 /* Values for rcu_state structure's gp_flags field. */