[v4,3/3] rcu: Use _full() API to debug synchronize_rcu()

Message ID	20250227131613.52683-3-urezki@gmail.com (mailing list archive)
State	New
Headers	show Received: from mail-lf1-f47.google.com (mail-lf1-f47.google.com [209.85.167.47]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 18F6222F38E; Thu, 27 Feb 2025 13:16:19 +0000 (UTC) From: "Uladzislau Rezki (Sony)" <urezki@gmail.com> To: "Paul E . McKenney" <paulmck@kernel.org>, Boqun Feng <boqun.feng@gmail.com> Cc: RCU <rcu@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org>, Frederic Weisbecker <frederic@kernel.org>, Cheung Wall <zzqq0103.hey@gmail.com>, Neeraj upadhyay <Neeraj.Upadhyay@amd.com>, Joel Fernandes <joel@joelfernandes.org>, Uladzislau Rezki <urezki@gmail.com>, Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Subject: [PATCH v4 3/3] rcu: Use _full() API to debug synchronize_rcu() Date: Thu, 27 Feb 2025 14:16:13 +0100 Message-Id: <20250227131613.52683-3-urezki@gmail.com> In-Reply-To: <20250227131613.52683-1-urezki@gmail.com> References: <20250227131613.52683-1-urezki@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[v4,1/3] rcutorture: Allow a negative value for nfakewriters \| expand [v4,1/3] rcutorture: Allow a negative value for nfakewriters [v4,2/3] rcu: Update TREE05.boot to test normal synchronize_rcu() [v4,3/3] rcu: Use _full() API to debug synchronize_rcu()

Uladzislau Rezki Feb. 27, 2025, 1:16 p.m. UTC

Switch for using of get_state_synchronize_rcu_full() and
poll_state_synchronize_rcu_full() pair to debug a normal
synchronize_rcu() call.

Just using "not" full APIs to identify if a grace period is
passed or not might lead to a false-positive kernel splat.

It can happen, because get_state_synchronize_rcu() compresses
both normal and expedited states into one single unsigned long
value, so a poll_state_synchronize_rcu() can miss GP-completion
when synchronize_rcu()/synchronize_rcu_expedited() concurrently
run.

To address this, switch to poll_state_synchronize_rcu_full() and
get_state_synchronize_rcu_full() APIs, which use separate variables
for expedited and normal states.

Link: https://lore.kernel.org/lkml/Z5ikQeVmVdsWQrdD@pc636/T/
Fixes: 988f569ae041 ("rcu: Reduce synchronize_rcu() latency")
Reported-by: cheung wall <zzqq0103.hey@gmail.com>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 include/linux/rcupdate_wait.h | 3 +++
 kernel/rcu/tree.c             | 8 +++-----
 2 files changed, 6 insertions(+), 5 deletions(-)

Boqun Feng Feb. 27, 2025, 5:12 p.m. UTC | #1

Hi Ulad,

I put these three patches into next (and misc.2025.02.27a) for some
testing, hopefully it all goes well and they can make it v6.15.

A few tag changed below:

On Thu, Feb 27, 2025 at 02:16:13PM +0100, Uladzislau Rezki (Sony) wrote:
> Switch for using of get_state_synchronize_rcu_full() and
> poll_state_synchronize_rcu_full() pair to debug a normal
> synchronize_rcu() call.
> 
> Just using "not" full APIs to identify if a grace period is
> passed or not might lead to a false-positive kernel splat.
> 
> It can happen, because get_state_synchronize_rcu() compresses
> both normal and expedited states into one single unsigned long
> value, so a poll_state_synchronize_rcu() can miss GP-completion
> when synchronize_rcu()/synchronize_rcu_expedited() concurrently
> run.
> 
> To address this, switch to poll_state_synchronize_rcu_full() and
> get_state_synchronize_rcu_full() APIs, which use separate variables
> for expedited and normal states.
> 
> Link: https://lore.kernel.org/lkml/Z5ikQeVmVdsWQrdD@pc636/T/

I switch this into "Closes:" per checkpatch.

> Fixes: 988f569ae041 ("rcu: Reduce synchronize_rcu() latency")
> Reported-by: cheung wall <zzqq0103.hey@gmail.com>
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

You seem to forget add Paul's Reviewed-by, so I add it in rcu/next.
Would you or Paul double-check the Reviewed-by should be here?

Regards,
Boqun

> ---
>  include/linux/rcupdate_wait.h | 3 +++
>  kernel/rcu/tree.c             | 8 +++-----
>  2 files changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/rcupdate_wait.h b/include/linux/rcupdate_wait.h
> index f9bed3d3f78d..4c92d4291cce 100644
> --- a/include/linux/rcupdate_wait.h
> +++ b/include/linux/rcupdate_wait.h
> @@ -16,6 +16,9 @@
>  struct rcu_synchronize {
>  	struct rcu_head head;
>  	struct completion completion;
> +
> +	/* This is for debugging. */
> +	struct rcu_gp_oldstate oldstate;
>  };
>  void wakeme_after_rcu(struct rcu_head *head);
>  
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 8625f616c65a..48384fa2eaeb 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1632,12 +1632,10 @@ static void rcu_sr_normal_complete(struct llist_node *node)
>  {
>  	struct rcu_synchronize *rs = container_of(
>  		(struct rcu_head *) node, struct rcu_synchronize, head);
> -	unsigned long oldstate = (unsigned long) rs->head.func;
>  
>  	WARN_ONCE(IS_ENABLED(CONFIG_PROVE_RCU) &&
> -		!poll_state_synchronize_rcu(oldstate),
> -		"A full grace period is not passed yet: %lu",
> -		rcu_seq_diff(get_state_synchronize_rcu(), oldstate));
> +		!poll_state_synchronize_rcu_full(&rs->oldstate),
> +		"A full grace period is not passed yet!\n");
>  
>  	/* Finally. */
>  	complete(&rs->completion);
> @@ -3247,7 +3245,7 @@ static void synchronize_rcu_normal(void)
>  	 * snapshot before adding a request.
>  	 */
>  	if (IS_ENABLED(CONFIG_PROVE_RCU))
> -		rs.head.func = (void *) get_state_synchronize_rcu();
> +		get_state_synchronize_rcu_full(&rs.oldstate);
>  
>  	rcu_sr_normal_add_req(&rs);
>  
> -- 
> 2.39.5
>

Paul E. McKenney Feb. 27, 2025, 5:26 p.m. UTC | #2

On Thu, Feb 27, 2025 at 09:12:39AM -0800, Boqun Feng wrote:
> Hi Ulad,
> 
> I put these three patches into next (and misc.2025.02.27a) for some
> testing, hopefully it all goes well and they can make it v6.15.
> 
> A few tag changed below:
> 
> On Thu, Feb 27, 2025 at 02:16:13PM +0100, Uladzislau Rezki (Sony) wrote:
> > Switch for using of get_state_synchronize_rcu_full() and
> > poll_state_synchronize_rcu_full() pair to debug a normal
> > synchronize_rcu() call.
> > 
> > Just using "not" full APIs to identify if a grace period is
> > passed or not might lead to a false-positive kernel splat.
> > 
> > It can happen, because get_state_synchronize_rcu() compresses
> > both normal and expedited states into one single unsigned long
> > value, so a poll_state_synchronize_rcu() can miss GP-completion
> > when synchronize_rcu()/synchronize_rcu_expedited() concurrently
> > run.
> > 
> > To address this, switch to poll_state_synchronize_rcu_full() and
> > get_state_synchronize_rcu_full() APIs, which use separate variables
> > for expedited and normal states.
> > 
> > Link: https://lore.kernel.org/lkml/Z5ikQeVmVdsWQrdD@pc636/T/
> 
> I switch this into "Closes:" per checkpatch.
> 
> > Fixes: 988f569ae041 ("rcu: Reduce synchronize_rcu() latency")
> > Reported-by: cheung wall <zzqq0103.hey@gmail.com>
> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> 
> You seem to forget add Paul's Reviewed-by, so I add it in rcu/next.
> Would you or Paul double-check the Reviewed-by should be here?

I am good with keeping my Reviewed-by tags.

							Thanx, Paul

> Regards,
> Boqun
> 
> > ---
> >  include/linux/rcupdate_wait.h | 3 +++
> >  kernel/rcu/tree.c             | 8 +++-----
> >  2 files changed, 6 insertions(+), 5 deletions(-)
> > 
> > diff --git a/include/linux/rcupdate_wait.h b/include/linux/rcupdate_wait.h
> > index f9bed3d3f78d..4c92d4291cce 100644
> > --- a/include/linux/rcupdate_wait.h
> > +++ b/include/linux/rcupdate_wait.h
> > @@ -16,6 +16,9 @@
> >  struct rcu_synchronize {
> >  	struct rcu_head head;
> >  	struct completion completion;
> > +
> > +	/* This is for debugging. */
> > +	struct rcu_gp_oldstate oldstate;
> >  };
> >  void wakeme_after_rcu(struct rcu_head *head);
> >  
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 8625f616c65a..48384fa2eaeb 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -1632,12 +1632,10 @@ static void rcu_sr_normal_complete(struct llist_node *node)
> >  {
> >  	struct rcu_synchronize *rs = container_of(
> >  		(struct rcu_head *) node, struct rcu_synchronize, head);
> > -	unsigned long oldstate = (unsigned long) rs->head.func;
> >  
> >  	WARN_ONCE(IS_ENABLED(CONFIG_PROVE_RCU) &&
> > -		!poll_state_synchronize_rcu(oldstate),
> > -		"A full grace period is not passed yet: %lu",
> > -		rcu_seq_diff(get_state_synchronize_rcu(), oldstate));
> > +		!poll_state_synchronize_rcu_full(&rs->oldstate),
> > +		"A full grace period is not passed yet!\n");
> >  
> >  	/* Finally. */
> >  	complete(&rs->completion);
> > @@ -3247,7 +3245,7 @@ static void synchronize_rcu_normal(void)
> >  	 * snapshot before adding a request.
> >  	 */
> >  	if (IS_ENABLED(CONFIG_PROVE_RCU))
> > -		rs.head.func = (void *) get_state_synchronize_rcu();
> > +		get_state_synchronize_rcu_full(&rs.oldstate);
> >  
> >  	rcu_sr_normal_add_req(&rs);
> >  
> > -- 
> > 2.39.5
> >

Boqun Feng Feb. 27, 2025, 5:30 p.m. UTC | #3

On Thu, Feb 27, 2025 at 09:26:40AM -0800, Paul E. McKenney wrote:
> On Thu, Feb 27, 2025 at 09:12:39AM -0800, Boqun Feng wrote:
> > Hi Ulad,
> > 
> > I put these three patches into next (and misc.2025.02.27a) for some
> > testing, hopefully it all goes well and they can make it v6.15.
> > 
> > A few tag changed below:
> > 
> > On Thu, Feb 27, 2025 at 02:16:13PM +0100, Uladzislau Rezki (Sony) wrote:
> > > Switch for using of get_state_synchronize_rcu_full() and
> > > poll_state_synchronize_rcu_full() pair to debug a normal
> > > synchronize_rcu() call.
> > > 
> > > Just using "not" full APIs to identify if a grace period is
> > > passed or not might lead to a false-positive kernel splat.
> > > 
> > > It can happen, because get_state_synchronize_rcu() compresses
> > > both normal and expedited states into one single unsigned long
> > > value, so a poll_state_synchronize_rcu() can miss GP-completion
> > > when synchronize_rcu()/synchronize_rcu_expedited() concurrently
> > > run.
> > > 
> > > To address this, switch to poll_state_synchronize_rcu_full() and
> > > get_state_synchronize_rcu_full() APIs, which use separate variables
> > > for expedited and normal states.
> > > 
> > > Link: https://lore.kernel.org/lkml/Z5ikQeVmVdsWQrdD@pc636/T/
> > 
> > I switch this into "Closes:" per checkpatch.
> > 
> > > Fixes: 988f569ae041 ("rcu: Reduce synchronize_rcu() latency")
> > > Reported-by: cheung wall <zzqq0103.hey@gmail.com>
> > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > 
> > You seem to forget add Paul's Reviewed-by, so I add it in rcu/next.
> > Would you or Paul double-check the Reviewed-by should be here?
> 
> I am good with keeping my Reviewed-by tags.
> 

Thank you!

Regards,
Boqun

> 							Thanx, Paul
> 
[...]

Uladzislau Rezki Feb. 27, 2025, 5:43 p.m. UTC | #4

Hello, Boqun!

> Hi Ulad,
> 
> I put these three patches into next (and misc.2025.02.27a) for some
> testing, hopefully it all goes well and they can make it v6.15.
> 
> A few tag changed below:
> 
> On Thu, Feb 27, 2025 at 02:16:13PM +0100, Uladzislau Rezki (Sony) wrote:
> > Switch for using of get_state_synchronize_rcu_full() and
> > poll_state_synchronize_rcu_full() pair to debug a normal
> > synchronize_rcu() call.
> > 
> > Just using "not" full APIs to identify if a grace period is
> > passed or not might lead to a false-positive kernel splat.
> > 
> > It can happen, because get_state_synchronize_rcu() compresses
> > both normal and expedited states into one single unsigned long
> > value, so a poll_state_synchronize_rcu() can miss GP-completion
> > when synchronize_rcu()/synchronize_rcu_expedited() concurrently
> > run.
> > 
> > To address this, switch to poll_state_synchronize_rcu_full() and
> > get_state_synchronize_rcu_full() APIs, which use separate variables
> > for expedited and normal states.
> > 
> > Link: https://lore.kernel.org/lkml/Z5ikQeVmVdsWQrdD@pc636/T/
> 
> I switch this into "Closes:" per checkpatch.
> 
Thanks!

> > Fixes: 988f569ae041 ("rcu: Reduce synchronize_rcu() latency")
> > Reported-by: cheung wall <zzqq0103.hey@gmail.com>
> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> 
> You seem to forget add Paul's Reviewed-by, so I add it in rcu/next.
> Would you or Paul double-check the Reviewed-by should be here?
> 
Probably. But i do not remember that i got a review from Paul. Might
be wrong and have missed that :)

Anyway, thank you for adding.

--
Uladzislau Rezki

Uladzislau Rezki Feb. 27, 2025, 5:44 p.m. UTC | #5

On Thu, Feb 27, 2025 at 09:26:40AM -0800, Paul E. McKenney wrote:
> On Thu, Feb 27, 2025 at 09:12:39AM -0800, Boqun Feng wrote:
> > Hi Ulad,
> > 
> > I put these three patches into next (and misc.2025.02.27a) for some
> > testing, hopefully it all goes well and they can make it v6.15.
> > 
> > A few tag changed below:
> > 
> > On Thu, Feb 27, 2025 at 02:16:13PM +0100, Uladzislau Rezki (Sony) wrote:
> > > Switch for using of get_state_synchronize_rcu_full() and
> > > poll_state_synchronize_rcu_full() pair to debug a normal
> > > synchronize_rcu() call.
> > > 
> > > Just using "not" full APIs to identify if a grace period is
> > > passed or not might lead to a false-positive kernel splat.
> > > 
> > > It can happen, because get_state_synchronize_rcu() compresses
> > > both normal and expedited states into one single unsigned long
> > > value, so a poll_state_synchronize_rcu() can miss GP-completion
> > > when synchronize_rcu()/synchronize_rcu_expedited() concurrently
> > > run.
> > > 
> > > To address this, switch to poll_state_synchronize_rcu_full() and
> > > get_state_synchronize_rcu_full() APIs, which use separate variables
> > > for expedited and normal states.
> > > 
> > > Link: https://lore.kernel.org/lkml/Z5ikQeVmVdsWQrdD@pc636/T/
> > 
> > I switch this into "Closes:" per checkpatch.
> > 
> > > Fixes: 988f569ae041 ("rcu: Reduce synchronize_rcu() latency")
> > > Reported-by: cheung wall <zzqq0103.hey@gmail.com>
> > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > 
> > You seem to forget add Paul's Reviewed-by, so I add it in rcu/next.
> > Would you or Paul double-check the Reviewed-by should be here?
> 
> I am good with keeping my Reviewed-by tags.
> 
Thanks Paul!

--
Uladzislau Rezki

Paul E. McKenney Feb. 28, 2025, 3:41 p.m. UTC | #6

On Thu, Feb 27, 2025 at 06:44:15PM +0100, Uladzislau Rezki wrote:
> On Thu, Feb 27, 2025 at 09:26:40AM -0800, Paul E. McKenney wrote:
> > On Thu, Feb 27, 2025 at 09:12:39AM -0800, Boqun Feng wrote:
> > > Hi Ulad,
> > > 
> > > I put these three patches into next (and misc.2025.02.27a) for some
> > > testing, hopefully it all goes well and they can make it v6.15.
> > > 
> > > A few tag changed below:
> > > 
> > > On Thu, Feb 27, 2025 at 02:16:13PM +0100, Uladzislau Rezki (Sony) wrote:
> > > > Switch for using of get_state_synchronize_rcu_full() and
> > > > poll_state_synchronize_rcu_full() pair to debug a normal
> > > > synchronize_rcu() call.
> > > > 
> > > > Just using "not" full APIs to identify if a grace period is
> > > > passed or not might lead to a false-positive kernel splat.
> > > > 
> > > > It can happen, because get_state_synchronize_rcu() compresses
> > > > both normal and expedited states into one single unsigned long
> > > > value, so a poll_state_synchronize_rcu() can miss GP-completion
> > > > when synchronize_rcu()/synchronize_rcu_expedited() concurrently
> > > > run.
> > > > 
> > > > To address this, switch to poll_state_synchronize_rcu_full() and
> > > > get_state_synchronize_rcu_full() APIs, which use separate variables
> > > > for expedited and normal states.
> > > > 
> > > > Link: https://lore.kernel.org/lkml/Z5ikQeVmVdsWQrdD@pc636/T/
> > > 
> > > I switch this into "Closes:" per checkpatch.
> > > 
> > > > Fixes: 988f569ae041 ("rcu: Reduce synchronize_rcu() latency")
> > > > Reported-by: cheung wall <zzqq0103.hey@gmail.com>
> > > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > 
> > > You seem to forget add Paul's Reviewed-by, so I add it in rcu/next.
> > > Would you or Paul double-check the Reviewed-by should be here?
> > 
> > I am good with keeping my Reviewed-by tags.
> > 
> Thanks Paul!

Except that I got this from overnight testing of rcu/dev on the shared
RCU tree:

WARNING: CPU: 5 PID: 14 at kernel/rcu/tree.c:1636 rcu_sr_normal_complete+0x5c/0x80

I see this only on TREE05.  Which should not be too surprising, given
that this is the scenario that tests it.  It happened within five minutes
on all 14 of the TREE05 runs.

This commit, just to avoid any ambiguity:

7cb48b64a563 ("MAINTAINERS: Update Joel's email address")

							Thanx, Paul

Uladzislau Rezki Feb. 28, 2025, 4:36 p.m. UTC | #7

On Fri, Feb 28, 2025 at 07:41:40AM -0800, Paul E. McKenney wrote:
> On Thu, Feb 27, 2025 at 06:44:15PM +0100, Uladzislau Rezki wrote:
> > On Thu, Feb 27, 2025 at 09:26:40AM -0800, Paul E. McKenney wrote:
> > > On Thu, Feb 27, 2025 at 09:12:39AM -0800, Boqun Feng wrote:
> > > > Hi Ulad,
> > > > 
> > > > I put these three patches into next (and misc.2025.02.27a) for some
> > > > testing, hopefully it all goes well and they can make it v6.15.
> > > > 
> > > > A few tag changed below:
> > > > 
> > > > On Thu, Feb 27, 2025 at 02:16:13PM +0100, Uladzislau Rezki (Sony) wrote:
> > > > > Switch for using of get_state_synchronize_rcu_full() and
> > > > > poll_state_synchronize_rcu_full() pair to debug a normal
> > > > > synchronize_rcu() call.
> > > > > 
> > > > > Just using "not" full APIs to identify if a grace period is
> > > > > passed or not might lead to a false-positive kernel splat.
> > > > > 
> > > > > It can happen, because get_state_synchronize_rcu() compresses
> > > > > both normal and expedited states into one single unsigned long
> > > > > value, so a poll_state_synchronize_rcu() can miss GP-completion
> > > > > when synchronize_rcu()/synchronize_rcu_expedited() concurrently
> > > > > run.
> > > > > 
> > > > > To address this, switch to poll_state_synchronize_rcu_full() and
> > > > > get_state_synchronize_rcu_full() APIs, which use separate variables
> > > > > for expedited and normal states.
> > > > > 
> > > > > Link: https://lore.kernel.org/lkml/Z5ikQeVmVdsWQrdD@pc636/T/
> > > > 
> > > > I switch this into "Closes:" per checkpatch.
> > > > 
> > > > > Fixes: 988f569ae041 ("rcu: Reduce synchronize_rcu() latency")
> > > > > Reported-by: cheung wall <zzqq0103.hey@gmail.com>
> > > > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > > 
> > > > You seem to forget add Paul's Reviewed-by, so I add it in rcu/next.
> > > > Would you or Paul double-check the Reviewed-by should be here?
> > > 
> > > I am good with keeping my Reviewed-by tags.
> > > 
> > Thanks Paul!
> 
> Except that I got this from overnight testing of rcu/dev on the shared
> RCU tree:
> 
> WARNING: CPU: 5 PID: 14 at kernel/rcu/tree.c:1636 rcu_sr_normal_complete+0x5c/0x80
> 
> I see this only on TREE05.  Which should not be too surprising, given
> that this is the scenario that tests it.  It happened within five minutes
> on all 14 of the TREE05 runs.
> 
Hm.. This is not fun. I tested this on my system and i did not manage to
trigger this whereas you do. Something is wrong.

--
Uladzislau Rezki

Uladzislau Rezki Feb. 28, 2025, 5:08 p.m. UTC | #8

On Fri, Feb 28, 2025 at 05:36:47PM +0100, Uladzislau Rezki wrote:
> On Fri, Feb 28, 2025 at 07:41:40AM -0800, Paul E. McKenney wrote:
> > On Thu, Feb 27, 2025 at 06:44:15PM +0100, Uladzislau Rezki wrote:
> > > On Thu, Feb 27, 2025 at 09:26:40AM -0800, Paul E. McKenney wrote:
> > > > On Thu, Feb 27, 2025 at 09:12:39AM -0800, Boqun Feng wrote:
> > > > > Hi Ulad,
> > > > > 
> > > > > I put these three patches into next (and misc.2025.02.27a) for some
> > > > > testing, hopefully it all goes well and they can make it v6.15.
> > > > > 
> > > > > A few tag changed below:
> > > > > 
> > > > > On Thu, Feb 27, 2025 at 02:16:13PM +0100, Uladzislau Rezki (Sony) wrote:
> > > > > > Switch for using of get_state_synchronize_rcu_full() and
> > > > > > poll_state_synchronize_rcu_full() pair to debug a normal
> > > > > > synchronize_rcu() call.
> > > > > > 
> > > > > > Just using "not" full APIs to identify if a grace period is
> > > > > > passed or not might lead to a false-positive kernel splat.
> > > > > > 
> > > > > > It can happen, because get_state_synchronize_rcu() compresses
> > > > > > both normal and expedited states into one single unsigned long
> > > > > > value, so a poll_state_synchronize_rcu() can miss GP-completion
> > > > > > when synchronize_rcu()/synchronize_rcu_expedited() concurrently
> > > > > > run.
> > > > > > 
> > > > > > To address this, switch to poll_state_synchronize_rcu_full() and
> > > > > > get_state_synchronize_rcu_full() APIs, which use separate variables
> > > > > > for expedited and normal states.
> > > > > > 
> > > > > > Link: https://lore.kernel.org/lkml/Z5ikQeVmVdsWQrdD@pc636/T/
> > > > > 
> > > > > I switch this into "Closes:" per checkpatch.
> > > > > 
> > > > > > Fixes: 988f569ae041 ("rcu: Reduce synchronize_rcu() latency")
> > > > > > Reported-by: cheung wall <zzqq0103.hey@gmail.com>
> > > > > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > > > 
> > > > > You seem to forget add Paul's Reviewed-by, so I add it in rcu/next.
> > > > > Would you or Paul double-check the Reviewed-by should be here?
> > > > 
> > > > I am good with keeping my Reviewed-by tags.
> > > > 
> > > Thanks Paul!
> > 
> > Except that I got this from overnight testing of rcu/dev on the shared
> > RCU tree:
> > 
> > WARNING: CPU: 5 PID: 14 at kernel/rcu/tree.c:1636 rcu_sr_normal_complete+0x5c/0x80
> > 
> > I see this only on TREE05.  Which should not be too surprising, given
> > that this is the scenario that tests it.  It happened within five minutes
> > on all 14 of the TREE05 runs.
> > 
> Hm.. This is not fun. I tested this on my system and i did not manage to
> trigger this whereas you do. Something is wrong.
>
We have below code to start a new GP, if we detect that processing is
starved:

<snip>
/*
 * The "start_new_poll" is set to true, only when this GP is not able
 * to handle anything and there are outstanding users. It happens when
 * the rcu_sr_normal_gp_init() function was not able to insert a dummy
 * separator to the llist, because there were no left any dummy-nodes.
 *
 * Number of dummy-nodes is fixed, it could be that we are run out of
 * them, if so we start a new pool request to repeat a try. It is rare
 * and it means that a system is doing a slow processing of callbacks.
 */
  if (start_new_poll)
    (void) start_poll_synchronize_rcu();
<snip>

we do not use a _full() version, since we need to inform rcu-gp-kthread
to initiate a new GP.

Any thoughts?

--
Uladzislau Rezki

Paul E. McKenney Feb. 28, 2025, 6:21 p.m. UTC | #9

On Fri, Feb 28, 2025 at 05:36:47PM +0100, Uladzislau Rezki wrote:
> On Fri, Feb 28, 2025 at 07:41:40AM -0800, Paul E. McKenney wrote:
> > On Thu, Feb 27, 2025 at 06:44:15PM +0100, Uladzislau Rezki wrote:
> > > On Thu, Feb 27, 2025 at 09:26:40AM -0800, Paul E. McKenney wrote:
> > > > On Thu, Feb 27, 2025 at 09:12:39AM -0800, Boqun Feng wrote:
> > > > > Hi Ulad,
> > > > > 
> > > > > I put these three patches into next (and misc.2025.02.27a) for some
> > > > > testing, hopefully it all goes well and they can make it v6.15.
> > > > > 
> > > > > A few tag changed below:
> > > > > 
> > > > > On Thu, Feb 27, 2025 at 02:16:13PM +0100, Uladzislau Rezki (Sony) wrote:
> > > > > > Switch for using of get_state_synchronize_rcu_full() and
> > > > > > poll_state_synchronize_rcu_full() pair to debug a normal
> > > > > > synchronize_rcu() call.
> > > > > > 
> > > > > > Just using "not" full APIs to identify if a grace period is
> > > > > > passed or not might lead to a false-positive kernel splat.
> > > > > > 
> > > > > > It can happen, because get_state_synchronize_rcu() compresses
> > > > > > both normal and expedited states into one single unsigned long
> > > > > > value, so a poll_state_synchronize_rcu() can miss GP-completion
> > > > > > when synchronize_rcu()/synchronize_rcu_expedited() concurrently
> > > > > > run.
> > > > > > 
> > > > > > To address this, switch to poll_state_synchronize_rcu_full() and
> > > > > > get_state_synchronize_rcu_full() APIs, which use separate variables
> > > > > > for expedited and normal states.
> > > > > > 
> > > > > > Link: https://lore.kernel.org/lkml/Z5ikQeVmVdsWQrdD@pc636/T/
> > > > > 
> > > > > I switch this into "Closes:" per checkpatch.
> > > > > 
> > > > > > Fixes: 988f569ae041 ("rcu: Reduce synchronize_rcu() latency")
> > > > > > Reported-by: cheung wall <zzqq0103.hey@gmail.com>
> > > > > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > > > 
> > > > > You seem to forget add Paul's Reviewed-by, so I add it in rcu/next.
> > > > > Would you or Paul double-check the Reviewed-by should be here?
> > > > 
> > > > I am good with keeping my Reviewed-by tags.
> > > > 
> > > Thanks Paul!
> > 
> > Except that I got this from overnight testing of rcu/dev on the shared
> > RCU tree:
> > 
> > WARNING: CPU: 5 PID: 14 at kernel/rcu/tree.c:1636 rcu_sr_normal_complete+0x5c/0x80
> > 
> > I see this only on TREE05.  Which should not be too surprising, given
> > that this is the scenario that tests it.  It happened within five minutes
> > on all 14 of the TREE05 runs.
> > 
> Hm.. This is not fun. I tested this on my system and i did not manage to
> trigger this whereas you do. Something is wrong.

If you have a debug patch, I would be happy to give it a go.

							Thanx, Paul

Uladzislau Rezki Feb. 28, 2025, 6:24 p.m. UTC | #10

On Fri, Feb 28, 2025 at 10:21:57AM -0800, Paul E. McKenney wrote:
> On Fri, Feb 28, 2025 at 05:36:47PM +0100, Uladzislau Rezki wrote:
> > On Fri, Feb 28, 2025 at 07:41:40AM -0800, Paul E. McKenney wrote:
> > > On Thu, Feb 27, 2025 at 06:44:15PM +0100, Uladzislau Rezki wrote:
> > > > On Thu, Feb 27, 2025 at 09:26:40AM -0800, Paul E. McKenney wrote:
> > > > > On Thu, Feb 27, 2025 at 09:12:39AM -0800, Boqun Feng wrote:
> > > > > > Hi Ulad,
> > > > > > 
> > > > > > I put these three patches into next (and misc.2025.02.27a) for some
> > > > > > testing, hopefully it all goes well and they can make it v6.15.
> > > > > > 
> > > > > > A few tag changed below:
> > > > > > 
> > > > > > On Thu, Feb 27, 2025 at 02:16:13PM +0100, Uladzislau Rezki (Sony) wrote:
> > > > > > > Switch for using of get_state_synchronize_rcu_full() and
> > > > > > > poll_state_synchronize_rcu_full() pair to debug a normal
> > > > > > > synchronize_rcu() call.
> > > > > > > 
> > > > > > > Just using "not" full APIs to identify if a grace period is
> > > > > > > passed or not might lead to a false-positive kernel splat.
> > > > > > > 
> > > > > > > It can happen, because get_state_synchronize_rcu() compresses
> > > > > > > both normal and expedited states into one single unsigned long
> > > > > > > value, so a poll_state_synchronize_rcu() can miss GP-completion
> > > > > > > when synchronize_rcu()/synchronize_rcu_expedited() concurrently
> > > > > > > run.
> > > > > > > 
> > > > > > > To address this, switch to poll_state_synchronize_rcu_full() and
> > > > > > > get_state_synchronize_rcu_full() APIs, which use separate variables
> > > > > > > for expedited and normal states.
> > > > > > > 
> > > > > > > Link: https://lore.kernel.org/lkml/Z5ikQeVmVdsWQrdD@pc636/T/
> > > > > > 
> > > > > > I switch this into "Closes:" per checkpatch.
> > > > > > 
> > > > > > > Fixes: 988f569ae041 ("rcu: Reduce synchronize_rcu() latency")
> > > > > > > Reported-by: cheung wall <zzqq0103.hey@gmail.com>
> > > > > > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > > > > 
> > > > > > You seem to forget add Paul's Reviewed-by, so I add it in rcu/next.
> > > > > > Would you or Paul double-check the Reviewed-by should be here?
> > > > > 
> > > > > I am good with keeping my Reviewed-by tags.
> > > > > 
> > > > Thanks Paul!
> > > 
> > > Except that I got this from overnight testing of rcu/dev on the shared
> > > RCU tree:
> > > 
> > > WARNING: CPU: 5 PID: 14 at kernel/rcu/tree.c:1636 rcu_sr_normal_complete+0x5c/0x80
> > > 
> > > I see this only on TREE05.  Which should not be too surprising, given
> > > that this is the scenario that tests it.  It happened within five minutes
> > > on all 14 of the TREE05 runs.
> > > 
> > Hm.. This is not fun. I tested this on my system and i did not manage to
> > trigger this whereas you do. Something is wrong.
> 
> If you have a debug patch, I would be happy to give it a go.
> 
I can trigger it. But.

Some background. I tested those patches during many hours on the stable
kernel which is 6.13. On that kernel i was not able to trigger it. Running
the rcutorture on the our shared "dev" tree, which i did now, triggers this
right away.

--
Uladzislau Rezki

Paul E. McKenney Feb. 28, 2025, 6:25 p.m. UTC | #11

On Fri, Feb 28, 2025 at 06:08:19PM +0100, Uladzislau Rezki wrote:
> On Fri, Feb 28, 2025 at 05:36:47PM +0100, Uladzislau Rezki wrote:
> > On Fri, Feb 28, 2025 at 07:41:40AM -0800, Paul E. McKenney wrote:
> > > On Thu, Feb 27, 2025 at 06:44:15PM +0100, Uladzislau Rezki wrote:
> > > > On Thu, Feb 27, 2025 at 09:26:40AM -0800, Paul E. McKenney wrote:
> > > > > On Thu, Feb 27, 2025 at 09:12:39AM -0800, Boqun Feng wrote:
> > > > > > Hi Ulad,
> > > > > > 
> > > > > > I put these three patches into next (and misc.2025.02.27a) for some
> > > > > > testing, hopefully it all goes well and they can make it v6.15.
> > > > > > 
> > > > > > A few tag changed below:
> > > > > > 
> > > > > > On Thu, Feb 27, 2025 at 02:16:13PM +0100, Uladzislau Rezki (Sony) wrote:
> > > > > > > Switch for using of get_state_synchronize_rcu_full() and
> > > > > > > poll_state_synchronize_rcu_full() pair to debug a normal
> > > > > > > synchronize_rcu() call.
> > > > > > > 
> > > > > > > Just using "not" full APIs to identify if a grace period is
> > > > > > > passed or not might lead to a false-positive kernel splat.
> > > > > > > 
> > > > > > > It can happen, because get_state_synchronize_rcu() compresses
> > > > > > > both normal and expedited states into one single unsigned long
> > > > > > > value, so a poll_state_synchronize_rcu() can miss GP-completion
> > > > > > > when synchronize_rcu()/synchronize_rcu_expedited() concurrently
> > > > > > > run.
> > > > > > > 
> > > > > > > To address this, switch to poll_state_synchronize_rcu_full() and
> > > > > > > get_state_synchronize_rcu_full() APIs, which use separate variables
> > > > > > > for expedited and normal states.
> > > > > > > 
> > > > > > > Link: https://lore.kernel.org/lkml/Z5ikQeVmVdsWQrdD@pc636/T/
> > > > > > 
> > > > > > I switch this into "Closes:" per checkpatch.
> > > > > > 
> > > > > > > Fixes: 988f569ae041 ("rcu: Reduce synchronize_rcu() latency")
> > > > > > > Reported-by: cheung wall <zzqq0103.hey@gmail.com>
> > > > > > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > > > > 
> > > > > > You seem to forget add Paul's Reviewed-by, so I add it in rcu/next.
> > > > > > Would you or Paul double-check the Reviewed-by should be here?
> > > > > 
> > > > > I am good with keeping my Reviewed-by tags.
> > > > > 
> > > > Thanks Paul!
> > > 
> > > Except that I got this from overnight testing of rcu/dev on the shared
> > > RCU tree:
> > > 
> > > WARNING: CPU: 5 PID: 14 at kernel/rcu/tree.c:1636 rcu_sr_normal_complete+0x5c/0x80
> > > 
> > > I see this only on TREE05.  Which should not be too surprising, given
> > > that this is the scenario that tests it.  It happened within five minutes
> > > on all 14 of the TREE05 runs.
> > > 
> > Hm.. This is not fun. I tested this on my system and i did not manage to
> > trigger this whereas you do. Something is wrong.
> >
> We have below code to start a new GP, if we detect that processing is
> starved:
> 
> <snip>
> /*
>  * The "start_new_poll" is set to true, only when this GP is not able
>  * to handle anything and there are outstanding users. It happens when
>  * the rcu_sr_normal_gp_init() function was not able to insert a dummy
>  * separator to the llist, because there were no left any dummy-nodes.
>  *
>  * Number of dummy-nodes is fixed, it could be that we are run out of
>  * them, if so we start a new pool request to repeat a try. It is rare
>  * and it means that a system is doing a slow processing of callbacks.
>  */
>   if (start_new_poll)
>     (void) start_poll_synchronize_rcu();
> <snip>
> 
> we do not use a _full() version, since we need to inform rcu-gp-kthread
> to initiate a new GP.
> 
> Any thoughts?

My kneejerk not-to-be-trusted take is that it does not matter which type
of grace period gets started so long as a grace period does get started.
Presumably you have done the get_state_synchronize_rcu_full() before
this point?

							Thanx, Paul

Uladzislau Rezki Feb. 28, 2025, 6:30 p.m. UTC | #12

On Fri, Feb 28, 2025 at 10:25:58AM -0800, Paul E. McKenney wrote:
> On Fri, Feb 28, 2025 at 06:08:19PM +0100, Uladzislau Rezki wrote:
> > On Fri, Feb 28, 2025 at 05:36:47PM +0100, Uladzislau Rezki wrote:
> > > On Fri, Feb 28, 2025 at 07:41:40AM -0800, Paul E. McKenney wrote:
> > > > On Thu, Feb 27, 2025 at 06:44:15PM +0100, Uladzislau Rezki wrote:
> > > > > On Thu, Feb 27, 2025 at 09:26:40AM -0800, Paul E. McKenney wrote:
> > > > > > On Thu, Feb 27, 2025 at 09:12:39AM -0800, Boqun Feng wrote:
> > > > > > > Hi Ulad,
> > > > > > > 
> > > > > > > I put these three patches into next (and misc.2025.02.27a) for some
> > > > > > > testing, hopefully it all goes well and they can make it v6.15.
> > > > > > > 
> > > > > > > A few tag changed below:
> > > > > > > 
> > > > > > > On Thu, Feb 27, 2025 at 02:16:13PM +0100, Uladzislau Rezki (Sony) wrote:
> > > > > > > > Switch for using of get_state_synchronize_rcu_full() and
> > > > > > > > poll_state_synchronize_rcu_full() pair to debug a normal
> > > > > > > > synchronize_rcu() call.
> > > > > > > > 
> > > > > > > > Just using "not" full APIs to identify if a grace period is
> > > > > > > > passed or not might lead to a false-positive kernel splat.
> > > > > > > > 
> > > > > > > > It can happen, because get_state_synchronize_rcu() compresses
> > > > > > > > both normal and expedited states into one single unsigned long
> > > > > > > > value, so a poll_state_synchronize_rcu() can miss GP-completion
> > > > > > > > when synchronize_rcu()/synchronize_rcu_expedited() concurrently
> > > > > > > > run.
> > > > > > > > 
> > > > > > > > To address this, switch to poll_state_synchronize_rcu_full() and
> > > > > > > > get_state_synchronize_rcu_full() APIs, which use separate variables
> > > > > > > > for expedited and normal states.
> > > > > > > > 
> > > > > > > > Link: https://lore.kernel.org/lkml/Z5ikQeVmVdsWQrdD@pc636/T/
> > > > > > > 
> > > > > > > I switch this into "Closes:" per checkpatch.
> > > > > > > 
> > > > > > > > Fixes: 988f569ae041 ("rcu: Reduce synchronize_rcu() latency")
> > > > > > > > Reported-by: cheung wall <zzqq0103.hey@gmail.com>
> > > > > > > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > > > > > 
> > > > > > > You seem to forget add Paul's Reviewed-by, so I add it in rcu/next.
> > > > > > > Would you or Paul double-check the Reviewed-by should be here?
> > > > > > 
> > > > > > I am good with keeping my Reviewed-by tags.
> > > > > > 
> > > > > Thanks Paul!
> > > > 
> > > > Except that I got this from overnight testing of rcu/dev on the shared
> > > > RCU tree:
> > > > 
> > > > WARNING: CPU: 5 PID: 14 at kernel/rcu/tree.c:1636 rcu_sr_normal_complete+0x5c/0x80
> > > > 
> > > > I see this only on TREE05.  Which should not be too surprising, given
> > > > that this is the scenario that tests it.  It happened within five minutes
> > > > on all 14 of the TREE05 runs.
> > > > 
> > > Hm.. This is not fun. I tested this on my system and i did not manage to
> > > trigger this whereas you do. Something is wrong.
> > >
> > We have below code to start a new GP, if we detect that processing is
> > starved:
> > 
> > <snip>
> > /*
> >  * The "start_new_poll" is set to true, only when this GP is not able
> >  * to handle anything and there are outstanding users. It happens when
> >  * the rcu_sr_normal_gp_init() function was not able to insert a dummy
> >  * separator to the llist, because there were no left any dummy-nodes.
> >  *
> >  * Number of dummy-nodes is fixed, it could be that we are run out of
> >  * them, if so we start a new pool request to repeat a try. It is rare
> >  * and it means that a system is doing a slow processing of callbacks.
> >  */
> >   if (start_new_poll)
> >     (void) start_poll_synchronize_rcu();
> > <snip>
> > 
> > we do not use a _full() version, since we need to inform rcu-gp-kthread
> > to initiate a new GP.
> > 
> > Any thoughts?
> 
> My kneejerk not-to-be-trusted take is that it does not matter which type
> of grace period gets started so long as a grace period does get started.
> Presumably you have done the get_state_synchronize_rcu_full() before
> this point?
> 
Yes, of course. That code is located in the gp-kthread. get_state_synchronize_rcu_full()
is done before.

--
Uladzislau Rezki

Paul E. McKenney Feb. 28, 2025, 6:38 p.m. UTC | #13

On Fri, Feb 28, 2025 at 07:24:41PM +0100, Uladzislau Rezki wrote:
> On Fri, Feb 28, 2025 at 10:21:57AM -0800, Paul E. McKenney wrote:
> > On Fri, Feb 28, 2025 at 05:36:47PM +0100, Uladzislau Rezki wrote:
> > > On Fri, Feb 28, 2025 at 07:41:40AM -0800, Paul E. McKenney wrote:
> > > > On Thu, Feb 27, 2025 at 06:44:15PM +0100, Uladzislau Rezki wrote:
> > > > > On Thu, Feb 27, 2025 at 09:26:40AM -0800, Paul E. McKenney wrote:
> > > > > > On Thu, Feb 27, 2025 at 09:12:39AM -0800, Boqun Feng wrote:
> > > > > > > Hi Ulad,
> > > > > > > 
> > > > > > > I put these three patches into next (and misc.2025.02.27a) for some
> > > > > > > testing, hopefully it all goes well and they can make it v6.15.
> > > > > > > 
> > > > > > > A few tag changed below:
> > > > > > > 
> > > > > > > On Thu, Feb 27, 2025 at 02:16:13PM +0100, Uladzislau Rezki (Sony) wrote:
> > > > > > > > Switch for using of get_state_synchronize_rcu_full() and
> > > > > > > > poll_state_synchronize_rcu_full() pair to debug a normal
> > > > > > > > synchronize_rcu() call.
> > > > > > > > 
> > > > > > > > Just using "not" full APIs to identify if a grace period is
> > > > > > > > passed or not might lead to a false-positive kernel splat.
> > > > > > > > 
> > > > > > > > It can happen, because get_state_synchronize_rcu() compresses
> > > > > > > > both normal and expedited states into one single unsigned long
> > > > > > > > value, so a poll_state_synchronize_rcu() can miss GP-completion
> > > > > > > > when synchronize_rcu()/synchronize_rcu_expedited() concurrently
> > > > > > > > run.
> > > > > > > > 
> > > > > > > > To address this, switch to poll_state_synchronize_rcu_full() and
> > > > > > > > get_state_synchronize_rcu_full() APIs, which use separate variables
> > > > > > > > for expedited and normal states.
> > > > > > > > 
> > > > > > > > Link: https://lore.kernel.org/lkml/Z5ikQeVmVdsWQrdD@pc636/T/
> > > > > > > 
> > > > > > > I switch this into "Closes:" per checkpatch.
> > > > > > > 
> > > > > > > > Fixes: 988f569ae041 ("rcu: Reduce synchronize_rcu() latency")
> > > > > > > > Reported-by: cheung wall <zzqq0103.hey@gmail.com>
> > > > > > > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > > > > > 
> > > > > > > You seem to forget add Paul's Reviewed-by, so I add it in rcu/next.
> > > > > > > Would you or Paul double-check the Reviewed-by should be here?
> > > > > > 
> > > > > > I am good with keeping my Reviewed-by tags.
> > > > > > 
> > > > > Thanks Paul!
> > > > 
> > > > Except that I got this from overnight testing of rcu/dev on the shared
> > > > RCU tree:
> > > > 
> > > > WARNING: CPU: 5 PID: 14 at kernel/rcu/tree.c:1636 rcu_sr_normal_complete+0x5c/0x80
> > > > 
> > > > I see this only on TREE05.  Which should not be too surprising, given
> > > > that this is the scenario that tests it.  It happened within five minutes
> > > > on all 14 of the TREE05 runs.
> > > > 
> > > Hm.. This is not fun. I tested this on my system and i did not manage to
> > > trigger this whereas you do. Something is wrong.
> > 
> > If you have a debug patch, I would be happy to give it a go.
> > 
> I can trigger it. But.
> 
> Some background. I tested those patches during many hours on the stable
> kernel which is 6.13. On that kernel i was not able to trigger it. Running
> the rcutorture on the our shared "dev" tree, which i did now, triggers this
> right away.

Bisection?  (Hey, you knew that was coming!)

							Thanx, Paul

Uladzislau Rezki Feb. 28, 2025, 7:12 p.m. UTC | #14

Hello, Paul!

> > > > > 
> > > > > Except that I got this from overnight testing of rcu/dev on the shared
> > > > > RCU tree:
> > > > > 
> > > > > WARNING: CPU: 5 PID: 14 at kernel/rcu/tree.c:1636 rcu_sr_normal_complete+0x5c/0x80
> > > > > 
> > > > > I see this only on TREE05.  Which should not be too surprising, given
> > > > > that this is the scenario that tests it.  It happened within five minutes
> > > > > on all 14 of the TREE05 runs.
> > > > > 
> > > > Hm.. This is not fun. I tested this on my system and i did not manage to
> > > > trigger this whereas you do. Something is wrong.
> > > 
> > > If you have a debug patch, I would be happy to give it a go.
> > > 
> > I can trigger it. But.
> > 
> > Some background. I tested those patches during many hours on the stable
> > kernel which is 6.13. On that kernel i was not able to trigger it. Running
> > the rcutorture on the our shared "dev" tree, which i did now, triggers this
> > right away.
> 
> Bisection?  (Hey, you knew that was coming!)
> 
Looks like this: rcu: Fix get_state_synchronize_rcu_full() GP-start detection

After revert in the dev, rcutorture passes TREE05, 16 instances.

--
Uladzislau Rezki

Paul E. McKenney Feb. 28, 2025, 7:59 p.m. UTC | #15

On Fri, Feb 28, 2025 at 08:12:51PM +0100, Uladzislau Rezki wrote:
> Hello, Paul!
> 
> > > > > > 
> > > > > > Except that I got this from overnight testing of rcu/dev on the shared
> > > > > > RCU tree:
> > > > > > 
> > > > > > WARNING: CPU: 5 PID: 14 at kernel/rcu/tree.c:1636 rcu_sr_normal_complete+0x5c/0x80
> > > > > > 
> > > > > > I see this only on TREE05.  Which should not be too surprising, given
> > > > > > that this is the scenario that tests it.  It happened within five minutes
> > > > > > on all 14 of the TREE05 runs.
> > > > > > 
> > > > > Hm.. This is not fun. I tested this on my system and i did not manage to
> > > > > trigger this whereas you do. Something is wrong.
> > > > 
> > > > If you have a debug patch, I would be happy to give it a go.
> > > > 
> > > I can trigger it. But.
> > > 
> > > Some background. I tested those patches during many hours on the stable
> > > kernel which is 6.13. On that kernel i was not able to trigger it. Running
> > > the rcutorture on the our shared "dev" tree, which i did now, triggers this
> > > right away.
> > 
> > Bisection?  (Hey, you knew that was coming!)
> > 
> Looks like this: rcu: Fix get_state_synchronize_rcu_full() GP-start detection
> 
> After revert in the dev, rcutorture passes TREE05, 16 instances.

Huh.  We sure don't get to revert that one...

Do we have a problem with the ordering in rcu_gp_init() between the calls
to rcu_seq_start() and portions of rcu_sr_normal_gp_init()?  For example,
do we need to capture the relevant portion of the list before the call
to rcu_seq_start(), and do the grace-period-start work afterwards?

My kneejerk (and possibibly completely wrong) guess is that rcu_gp_init()
calls rcu_gp_start(), then there is a call to synchronize_rcu() whose
cookie says wait for the end of the next grace period, then we capture
the lists including this one that needs to wait longer.  Then when we
look at the end of the grace period, boom!  This would be a real bug due
to some CPU coming online between the time of the call to rcu_gp_start()
and synchronize_rcu().

Or is there some other way that this can happen?

							Thanx, Paul

Paul E. McKenney March 1, 2025, 1:08 a.m. UTC | #16

On Fri, Feb 28, 2025 at 11:59:55AM -0800, Paul E. McKenney wrote:
> On Fri, Feb 28, 2025 at 08:12:51PM +0100, Uladzislau Rezki wrote:
> > Hello, Paul!
> > 
> > > > > > > 
> > > > > > > Except that I got this from overnight testing of rcu/dev on the shared
> > > > > > > RCU tree:
> > > > > > > 
> > > > > > > WARNING: CPU: 5 PID: 14 at kernel/rcu/tree.c:1636 rcu_sr_normal_complete+0x5c/0x80
> > > > > > > 
> > > > > > > I see this only on TREE05.  Which should not be too surprising, given
> > > > > > > that this is the scenario that tests it.  It happened within five minutes
> > > > > > > on all 14 of the TREE05 runs.
> > > > > > > 
> > > > > > Hm.. This is not fun. I tested this on my system and i did not manage to
> > > > > > trigger this whereas you do. Something is wrong.
> > > > > 
> > > > > If you have a debug patch, I would be happy to give it a go.
> > > > > 
> > > > I can trigger it. But.
> > > > 
> > > > Some background. I tested those patches during many hours on the stable
> > > > kernel which is 6.13. On that kernel i was not able to trigger it. Running
> > > > the rcutorture on the our shared "dev" tree, which i did now, triggers this
> > > > right away.
> > > 
> > > Bisection?  (Hey, you knew that was coming!)
> > > 
> > Looks like this: rcu: Fix get_state_synchronize_rcu_full() GP-start detection
> > 
> > After revert in the dev, rcutorture passes TREE05, 16 instances.
> 
> Huh.  We sure don't get to revert that one...
> 
> Do we have a problem with the ordering in rcu_gp_init() between the calls
> to rcu_seq_start() and portions of rcu_sr_normal_gp_init()?  For example,
> do we need to capture the relevant portion of the list before the call
> to rcu_seq_start(), and do the grace-period-start work afterwards?

I tried moving the call to rcu_sr_normal_gp_init() before the call to
rcu_seq_start() and got no failures in a one-hour run of 200*TREE05.
Which does not necessarily mean that this is the correct fix, but I
figured that it might at least provide food for thought.

							Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 48384fa2eaeb8..d3efeff7740e7 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1819,10 +1819,10 @@ static noinline_for_stack bool rcu_gp_init(void)
 
 	/* Advance to a new grace period and initialize state. */
 	record_gp_stall_check_time();
+	start_new_poll = rcu_sr_normal_gp_init();
 	/* Record GP times before starting GP, hence rcu_seq_start(). */
 	rcu_seq_start(&rcu_state.gp_seq);
 	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
-	start_new_poll = rcu_sr_normal_gp_init();
 	trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
 	rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
 	raw_spin_unlock_irq_rcu_node(rnp);

Uladzislau Rezki March 2, 2025, 10:19 a.m. UTC | #17

On Fri, Feb 28, 2025 at 05:08:49PM -0800, Paul E. McKenney wrote:
> On Fri, Feb 28, 2025 at 11:59:55AM -0800, Paul E. McKenney wrote:
> > On Fri, Feb 28, 2025 at 08:12:51PM +0100, Uladzislau Rezki wrote:
> > > Hello, Paul!
> > > 
> > > > > > > > 
> > > > > > > > Except that I got this from overnight testing of rcu/dev on the shared
> > > > > > > > RCU tree:
> > > > > > > > 
> > > > > > > > WARNING: CPU: 5 PID: 14 at kernel/rcu/tree.c:1636 rcu_sr_normal_complete+0x5c/0x80
> > > > > > > > 
> > > > > > > > I see this only on TREE05.  Which should not be too surprising, given
> > > > > > > > that this is the scenario that tests it.  It happened within five minutes
> > > > > > > > on all 14 of the TREE05 runs.
> > > > > > > > 
> > > > > > > Hm.. This is not fun. I tested this on my system and i did not manage to
> > > > > > > trigger this whereas you do. Something is wrong.
> > > > > > 
> > > > > > If you have a debug patch, I would be happy to give it a go.
> > > > > > 
> > > > > I can trigger it. But.
> > > > > 
> > > > > Some background. I tested those patches during many hours on the stable
> > > > > kernel which is 6.13. On that kernel i was not able to trigger it. Running
> > > > > the rcutorture on the our shared "dev" tree, which i did now, triggers this
> > > > > right away.
> > > > 
> > > > Bisection?  (Hey, you knew that was coming!)
> > > > 
> > > Looks like this: rcu: Fix get_state_synchronize_rcu_full() GP-start detection
> > > 
> > > After revert in the dev, rcutorture passes TREE05, 16 instances.
> > 
> > Huh.  We sure don't get to revert that one...
> > 
> > Do we have a problem with the ordering in rcu_gp_init() between the calls
> > to rcu_seq_start() and portions of rcu_sr_normal_gp_init()?  For example,
> > do we need to capture the relevant portion of the list before the call
> > to rcu_seq_start(), and do the grace-period-start work afterwards?
> 
> I tried moving the call to rcu_sr_normal_gp_init() before the call to
> rcu_seq_start() and got no failures in a one-hour run of 200*TREE05.
> Which does not necessarily mean that this is the correct fix, but I
> figured that it might at least provide food for thought.
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 48384fa2eaeb8..d3efeff7740e7 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1819,10 +1819,10 @@ static noinline_for_stack bool rcu_gp_init(void)
>  
>  	/* Advance to a new grace period and initialize state. */
>  	record_gp_stall_check_time();
> +	start_new_poll = rcu_sr_normal_gp_init();
>  	/* Record GP times before starting GP, hence rcu_seq_start(). */
>  	rcu_seq_start(&rcu_state.gp_seq);
>  	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> -	start_new_poll = rcu_sr_normal_gp_init();
>  	trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
>  	rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
>  	raw_spin_unlock_irq_rcu_node(rnp);
>
Running this 24 hours already. TREE05 * 16 scenario. I do not see any
warnings yet. There is a race, indeed. The gp_seq is moved forward,
wheres clients can still come until rcu_sr_normal_gp_init() places a
dummy-wait-head for this GP.

Thank you for testing Paul and looking to this :)

--
Uladzislau Rezki

Paul E. McKenney March 2, 2025, 5:39 p.m. UTC | #18

On Sun, Mar 02, 2025 at 11:19:44AM +0100, Uladzislau Rezki wrote:
> On Fri, Feb 28, 2025 at 05:08:49PM -0800, Paul E. McKenney wrote:
> > On Fri, Feb 28, 2025 at 11:59:55AM -0800, Paul E. McKenney wrote:
> > > On Fri, Feb 28, 2025 at 08:12:51PM +0100, Uladzislau Rezki wrote:
> > > > Hello, Paul!
> > > > 
> > > > > > > > > 
> > > > > > > > > Except that I got this from overnight testing of rcu/dev on the shared
> > > > > > > > > RCU tree:
> > > > > > > > > 
> > > > > > > > > WARNING: CPU: 5 PID: 14 at kernel/rcu/tree.c:1636 rcu_sr_normal_complete+0x5c/0x80
> > > > > > > > > 
> > > > > > > > > I see this only on TREE05.  Which should not be too surprising, given
> > > > > > > > > that this is the scenario that tests it.  It happened within five minutes
> > > > > > > > > on all 14 of the TREE05 runs.
> > > > > > > > > 
> > > > > > > > Hm.. This is not fun. I tested this on my system and i did not manage to
> > > > > > > > trigger this whereas you do. Something is wrong.
> > > > > > > 
> > > > > > > If you have a debug patch, I would be happy to give it a go.
> > > > > > > 
> > > > > > I can trigger it. But.
> > > > > > 
> > > > > > Some background. I tested those patches during many hours on the stable
> > > > > > kernel which is 6.13. On that kernel i was not able to trigger it. Running
> > > > > > the rcutorture on the our shared "dev" tree, which i did now, triggers this
> > > > > > right away.
> > > > > 
> > > > > Bisection?  (Hey, you knew that was coming!)
> > > > > 
> > > > Looks like this: rcu: Fix get_state_synchronize_rcu_full() GP-start detection
> > > > 
> > > > After revert in the dev, rcutorture passes TREE05, 16 instances.
> > > 
> > > Huh.  We sure don't get to revert that one...
> > > 
> > > Do we have a problem with the ordering in rcu_gp_init() between the calls
> > > to rcu_seq_start() and portions of rcu_sr_normal_gp_init()?  For example,
> > > do we need to capture the relevant portion of the list before the call
> > > to rcu_seq_start(), and do the grace-period-start work afterwards?
> > 
> > I tried moving the call to rcu_sr_normal_gp_init() before the call to
> > rcu_seq_start() and got no failures in a one-hour run of 200*TREE05.
> > Which does not necessarily mean that this is the correct fix, but I
> > figured that it might at least provide food for thought.
> > 
> > 							Thanx, Paul
> > 
> > ------------------------------------------------------------------------
> > 
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 48384fa2eaeb8..d3efeff7740e7 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -1819,10 +1819,10 @@ static noinline_for_stack bool rcu_gp_init(void)
> >  
> >  	/* Advance to a new grace period and initialize state. */
> >  	record_gp_stall_check_time();
> > +	start_new_poll = rcu_sr_normal_gp_init();
> >  	/* Record GP times before starting GP, hence rcu_seq_start(). */
> >  	rcu_seq_start(&rcu_state.gp_seq);
> >  	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> > -	start_new_poll = rcu_sr_normal_gp_init();
> >  	trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
> >  	rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
> >  	raw_spin_unlock_irq_rcu_node(rnp);
> >
> Running this 24 hours already. TREE05 * 16 scenario. I do not see any
> warnings yet. There is a race, indeed. The gp_seq is moved forward,
> wheres clients can still come until rcu_sr_normal_gp_init() places a
> dummy-wait-head for this GP.
> 
> Thank you for testing Paul and looking to this :)

Very good!  This is a bug in this commit of mine:

012f47f0f806 ("rcu: Fix get_state_synchronize_rcu_full() GP-start detection")

Boqun, could you please fold this into that commit with something like
this added to the commit log just before the paragraph starting with
"Although this fixes 91a967fd6934"?

	However, simply changing get_state_synchronize_rcu_full() function
	to use rcu_state.gp_seq instead of the root rcu_node structure's
	->gp_seq field results in a theoretical bug in kernels booted
	with rcutree.rcu_normal_wake_from_gp=1 due to the following
	sequence of events:

	o	The rcu_gp_init() function invokes rcu_seq_start()
		to officially start a new grace period.

	o	A new RCU reader begins, referencing X from some
		RCU-protected list.  The new grace period is not
		obligated to wait for this reader.

	o	An updater removes X, then calls synchronize_rcu(),
		which queues a wait element.

	o	The grace period ends, awakening the updater, which
		frees X while the reader is still referencing it.

	The reason that this is theoretical is that although the
	grace period has officially started, none of the CPUs are
	officially aware of this, and thus will have to assume that
	the RCU reader pre-dated the start of the grace period.

	Except for kernels built with CONFIG_PROVE_RCU=y, which use the
	polled grace-period APIs, which can and do complain bitterly when
	this sequence of events occurs.  Not only that, there might be
	some future RCU grace-period mechanism that pulls this sequence
	of events from theory into practice.  This commit therefore
	also pulls the call to rcu_sr_normal_gp_init() to precede that
	to rcu_seq_start().

I will let you guys decide whether the call to rcu_sr_normal_gp_init()
needs a comment, and, if so, what that comment should say.  ;-)

							Thanx, Paul

Boqun Feng March 2, 2025, 6:46 p.m. UTC | #19

On Sun, Mar 02, 2025 at 09:39:44AM -0800, Paul E. McKenney wrote:
> On Sun, Mar 02, 2025 at 11:19:44AM +0100, Uladzislau Rezki wrote:
> > On Fri, Feb 28, 2025 at 05:08:49PM -0800, Paul E. McKenney wrote:
> > > On Fri, Feb 28, 2025 at 11:59:55AM -0800, Paul E. McKenney wrote:
> > > > On Fri, Feb 28, 2025 at 08:12:51PM +0100, Uladzislau Rezki wrote:
> > > > > Hello, Paul!
> > > > > 
> > > > > > > > > > 
> > > > > > > > > > Except that I got this from overnight testing of rcu/dev on the shared
> > > > > > > > > > RCU tree:
> > > > > > > > > > 
> > > > > > > > > > WARNING: CPU: 5 PID: 14 at kernel/rcu/tree.c:1636 rcu_sr_normal_complete+0x5c/0x80
> > > > > > > > > > 
> > > > > > > > > > I see this only on TREE05.  Which should not be too surprising, given
> > > > > > > > > > that this is the scenario that tests it.  It happened within five minutes
> > > > > > > > > > on all 14 of the TREE05 runs.
> > > > > > > > > > 
> > > > > > > > > Hm.. This is not fun. I tested this on my system and i did not manage to
> > > > > > > > > trigger this whereas you do. Something is wrong.
> > > > > > > > 
> > > > > > > > If you have a debug patch, I would be happy to give it a go.
> > > > > > > > 
> > > > > > > I can trigger it. But.
> > > > > > > 
> > > > > > > Some background. I tested those patches during many hours on the stable
> > > > > > > kernel which is 6.13. On that kernel i was not able to trigger it. Running
> > > > > > > the rcutorture on the our shared "dev" tree, which i did now, triggers this
> > > > > > > right away.
> > > > > > 
> > > > > > Bisection?  (Hey, you knew that was coming!)
> > > > > > 
> > > > > Looks like this: rcu: Fix get_state_synchronize_rcu_full() GP-start detection
> > > > > 
> > > > > After revert in the dev, rcutorture passes TREE05, 16 instances.
> > > > 
> > > > Huh.  We sure don't get to revert that one...
> > > > 
> > > > Do we have a problem with the ordering in rcu_gp_init() between the calls
> > > > to rcu_seq_start() and portions of rcu_sr_normal_gp_init()?  For example,
> > > > do we need to capture the relevant portion of the list before the call
> > > > to rcu_seq_start(), and do the grace-period-start work afterwards?
> > > 
> > > I tried moving the call to rcu_sr_normal_gp_init() before the call to
> > > rcu_seq_start() and got no failures in a one-hour run of 200*TREE05.
> > > Which does not necessarily mean that this is the correct fix, but I
> > > figured that it might at least provide food for thought.
> > > 
> > > 							Thanx, Paul
> > > 
> > > ------------------------------------------------------------------------
> > > 
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index 48384fa2eaeb8..d3efeff7740e7 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -1819,10 +1819,10 @@ static noinline_for_stack bool rcu_gp_init(void)
> > >  
> > >  	/* Advance to a new grace period and initialize state. */
> > >  	record_gp_stall_check_time();
> > > +	start_new_poll = rcu_sr_normal_gp_init();
> > >  	/* Record GP times before starting GP, hence rcu_seq_start(). */
> > >  	rcu_seq_start(&rcu_state.gp_seq);
> > >  	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> > > -	start_new_poll = rcu_sr_normal_gp_init();
> > >  	trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
> > >  	rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
> > >  	raw_spin_unlock_irq_rcu_node(rnp);
> > >
> > Running this 24 hours already. TREE05 * 16 scenario. I do not see any
> > warnings yet. There is a race, indeed. The gp_seq is moved forward,
> > wheres clients can still come until rcu_sr_normal_gp_init() places a
> > dummy-wait-head for this GP.
> > 
> > Thank you for testing Paul and looking to this :)
> 
> Very good!  This is a bug in this commit of mine:
> 
> 012f47f0f806 ("rcu: Fix get_state_synchronize_rcu_full() GP-start detection")
> 
> Boqun, could you please fold this into that commit with something like
> this added to the commit log just before the paragraph starting with
> "Although this fixes 91a967fd6934"?
> 
> 	However, simply changing get_state_synchronize_rcu_full() function
> 	to use rcu_state.gp_seq instead of the root rcu_node structure's
> 	->gp_seq field results in a theoretical bug in kernels booted
> 	with rcutree.rcu_normal_wake_from_gp=1 due to the following
> 	sequence of events:
> 
> 	o	The rcu_gp_init() function invokes rcu_seq_start()
> 		to officially start a new grace period.
> 
> 	o	A new RCU reader begins, referencing X from some
> 		RCU-protected list.  The new grace period is not
> 		obligated to wait for this reader.
> 
> 	o	An updater removes X, then calls synchronize_rcu(),
> 		which queues a wait element.
> 
> 	o	The grace period ends, awakening the updater, which
> 		frees X while the reader is still referencing it.
> 
> 	The reason that this is theoretical is that although the
> 	grace period has officially started, none of the CPUs are
> 	officially aware of this, and thus will have to assume that
> 	the RCU reader pre-dated the start of the grace period.
> 
> 	Except for kernels built with CONFIG_PROVE_RCU=y, which use the
> 	polled grace-period APIs, which can and do complain bitterly when
> 	this sequence of events occurs.  Not only that, there might be
> 	some future RCU grace-period mechanism that pulls this sequence
> 	of events from theory into practice.  This commit therefore
> 	also pulls the call to rcu_sr_normal_gp_init() to precede that
> 	to rcu_seq_start().
> 
> I will let you guys decide whether the call to rcu_sr_normal_gp_init()
> needs a comment, and, if so, what that comment should say.  ;-)
> 

Please see the updated patch below (next and rcu/dev branches are
updated too). For the comment, I think we can add something like

	/* 
	 * A new wait segment must be started before gp_seq advanced, so
	 * that previous gp waiters won't observe the new gp_seq.
	 */

but I will let Ulad to decide ;-)

Regards,
Boqun

------------------------>8
Subject: [PATCH] rcu: Fix get_state_synchronize_rcu_full() GP-start detection

The get_state_synchronize_rcu_full() and poll_state_synchronize_rcu_full()
functions use the root rcu_node structure's ->gp_seq field to detect
the beginnings and ends of grace periods, respectively.  This choice is
necessary for the poll_state_synchronize_rcu_full() function because
(give or take counter wrap), the following sequence is guaranteed not
to trigger:

	get_state_synchronize_rcu_full(&rgos);
	synchronize_rcu();
	WARN_ON_ONCE(!poll_state_synchronize_rcu_full(&rgos));

The RCU callbacks that awaken synchronize_rcu() instances are
guaranteed not to be invoked before the root rcu_node structure's
->gp_seq field is updated to indicate the end of the grace period.
However, these callbacks might start being invoked immediately
thereafter, in particular, before rcu_state.gp_seq has been updated.
Therefore, poll_state_synchronize_rcu_full() must refer to the
root rcu_node structure's ->gp_seq field.  Because this field is
updated under this structure's ->lock, any code following a call to
poll_state_synchronize_rcu_full() will be fully ordered after the
full grace-period computation, as is required by RCU's memory-ordering
semantics.

By symmetry, the get_state_synchronize_rcu_full() function should also
use this same root rcu_node structure's ->gp_seq field.  But it turns out
that symmetry is profoundly (though extremely infrequently) destructive
in this case.  To see this, consider the following sequence of events:

1.	CPU 0 starts a new grace period, and updates rcu_state.gp_seq
	accordingly.

2.	As its first step of grace-period initialization, CPU 0 examines
	the current CPU hotplug state and decides that it need not wait
	for CPU 1, which is currently offline.

3.	CPU 1 comes online, and updates its state.  But this does not
	affect the current grace period, but rather the one after that.
	After all, CPU 1 was offline when the current grace period
	started, so all pre-existing RCU readers on CPU 1 must have
	completed or been preempted before it last went offline.
	The current grace period therefore has nothing it needs to wait
	for on CPU 1.

4.	CPU 1 switches to an rcutorture kthread which is running
	rcutorture's rcu_torture_reader() function, which starts a new
	RCU reader.

5.	CPU 2 is running rcutorture's rcu_torture_writer() function
	and collects a new polled grace-period "cookie" using
	get_state_synchronize_rcu_full().  Because the newly started
	grace period has not completed initialization, the root rcu_node
	structure's ->gp_seq field has not yet been updated to indicate
	that this new grace period has already started.

	This cookie is therefore set up for the end of the current grace
	period (rather than the end of the following grace period).

6.	CPU 0 finishes grace-period initialization.

7.	If CPU 1’s rcutorture reader is preempted, it will be added to
	the ->blkd_tasks list, but because CPU 1’s ->qsmask bit is not
	set in CPU 1's leaf rcu_node structure, the ->gp_tasks pointer
	will not be updated.  Thus, this grace period will not wait on
	it.  Which is only fair, given that the CPU did not come online
	until after the grace period officially started.

8.	CPUs 0 and 2 then detect the new grace period and then report
	a quiescent state to the RCU core.

9.	Because CPU 1 was offline at the start of the current grace
	period, CPUs 0 and 2 are the only CPUs that this grace period
	needs to wait on.  So the grace period ends and post-grace-period
	cleanup starts.  In particular, the root rcu_node structure's
	->gp_seq field is updated to indicate that this grace period
	has now ended.

10.	CPU 2 continues running rcu_torture_writer() and sees that,
	from the viewpoint of the root rcu_node structure consulted by
	the poll_state_synchronize_rcu_full() function, the grace period
	has ended.  It therefore updates state accordingly.

11.	CPU 1 is still running the same RCU reader, which notices this
	update and thus complains about the too-short grace period.

The fix is for the get_state_synchronize_rcu_full() function to use
rcu_state.gp_seq instead of the root rcu_node structure's ->gp_seq field.
With this change in place, if step 5's cookie indicates that the grace
period has not yet started, then any prior code executed by CPU 2 must
have happened before CPU 1 came online.  This will in turn prevent CPU
1's code in steps 3 and 11 from spanning CPU 2's grace-period wait,
thus preventing CPU 1 from being subjected to a too-short grace period.

This commit therefore makes this change.  Note that there is no change to
the poll_state_synchronize_rcu_full() function, which as noted above,
must continue to use the root rcu_node structure's ->gp_seq field.
This is of course an asymmetry between these two functions, but is an
asymmetry that is absolutely required for correct operation.  It is a
common human tendency to greatly value symmetry, and sometimes symmetry
is a wonderful thing.  Other times, symmetry results in poor performance.
But in this case, symmetry is just plain wrong.

Nevertheless, the asymmetry does require an additional adjustment.
It is possible for get_state_synchronize_rcu_full() to see a given
grace period as having started, but for an immediately following
poll_state_synchronize_rcu_full() to see it as having not yet started.
Given the current rcu_seq_done_exact() implementation, this will
result in a false-positive indication that the grace period is done
from poll_state_synchronize_rcu_full().  This is dealt with by making
rcu_seq_done_exact() reach back three grace periods rather than just
two of them.

However, simply changing get_state_synchronize_rcu_full() function to
use rcu_state.gp_seq instead of the root rcu_node structure's ->gp_seq
field results in a theoretical bug in kernels booted with
rcutree.rcu_normal_wake_from_gp=1 due to the following sequence of
events:

o	The rcu_gp_init() function invokes rcu_seq_start() to officially
	start a new grace period.

o	A new RCU reader begins, referencing X from some RCU-protected
	list.  The new grace period is not obligated to wait for this
	reader.

o	An updater removes X, then calls synchronize_rcu(), which queues
	a wait element.

o	The grace period ends, awakening the updater, which frees X
	while the reader is still referencing it.

The reason that this is theoretical is that although the grace period
has officially started, none of the CPUs are officially aware of this,
and thus will have to assume that the RCU reader pre-dated the start of
the grace period.

Except for kernels built with CONFIG_PROVE_RCU=y, which use the polled
grace-period APIs, which can and do complain bitterly when this sequence
of events occurs.  Not only that, there might be some future RCU
grace-period mechanism that pulls this sequence of events from theory
into practice.  This commit therefore also pulls the call to
rcu_sr_normal_gp_init() to precede that to rcu_seq_start().

Although this fixes 91a967fd6934 ("rcu: Add full-sized polling for
get_completed*() and poll_state*()"), it is not clear that it is
worth backporting this commit.  First, it took me many weeks to
convince rcutorture to reproduce this more frequently than once
per year.  Second, this cannot be reproduced at all without frequent
CPU-hotplug operations, as in waiting all of 50 milliseconds from the
end of the previous operation until starting the next one.  Third,
the TREE03.boot settings cause multi-millisecond delays during RCU
grace-period initialization, which greatly increase the probability of
the above sequence of events.  (Don't do this in production workloads!)
Fourth, the TREE03 rcutorture scenario was modified to use four-CPU
guest OSes, to have a single-rcu_node combining tree, no testing of RCU
priority boosting, and no random preemption, and these modifications
were necessary to reproduce this issue in a reasonable timeframe.
Fifth, extremely heavy use of get_state_synchronize_rcu_full() and/or
poll_state_synchronize_rcu_full() is required to reproduce this, and as of
v6.12, only kfree_rcu() uses it, and even then not particularly heavily.

[boqun: Apply the fix [1]]

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Tested-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Link: https://lore.kernel.org/rcu/d90bd6d9-d15c-4b9b-8a69-95336e74e8f4@paulmck-laptop/ [1]
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
 kernel/rcu/rcu.h  |  2 +-
 kernel/rcu/tree.c | 11 +++++++----
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index feb3ac1dc5d5..f87c9d6d36fc 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -162,7 +162,7 @@ static inline bool rcu_seq_done_exact(unsigned long *sp, unsigned long s)
 {
 	unsigned long cur_s = READ_ONCE(*sp);
 
-	return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (2 * RCU_SEQ_STATE_MASK + 1));
+	return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (3 * RCU_SEQ_STATE_MASK + 1));
 }
 
 /*
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index e4c0ce600b2b..f80cfdc3ee3e 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1801,10 +1801,10 @@ static noinline_for_stack bool rcu_gp_init(void)
 
 	/* Advance to a new grace period and initialize state. */
 	record_gp_stall_check_time();
+	start_new_poll = rcu_sr_normal_gp_init();
 	/* Record GP times before starting GP, hence rcu_seq_start(). */
 	rcu_seq_start(&rcu_state.gp_seq);
 	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
-	start_new_poll = rcu_sr_normal_gp_init();
 	trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
 	rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
 	raw_spin_unlock_irq_rcu_node(rnp);
@@ -3357,14 +3357,17 @@ EXPORT_SYMBOL_GPL(get_state_synchronize_rcu);
  */
 void get_state_synchronize_rcu_full(struct rcu_gp_oldstate *rgosp)
 {
-	struct rcu_node *rnp = rcu_get_root();
-
 	/*
 	 * Any prior manipulation of RCU-protected data must happen
 	 * before the loads from ->gp_seq and ->expedited_sequence.
 	 */
 	smp_mb();  /* ^^^ */
-	rgosp->rgos_norm = rcu_seq_snap(&rnp->gp_seq);
+
+	// Yes, rcu_state.gp_seq, not rnp_root->gp_seq, the latter's use
+	// in poll_state_synchronize_rcu_full() notwithstanding.  Use of
+	// the latter here would result in too-short grace periods due to
+	// interactions with newly onlined CPUs.
+	rgosp->rgos_norm = rcu_seq_snap(&rcu_state.gp_seq);
 	rgosp->rgos_exp = rcu_seq_snap(&rcu_state.expedited_sequence);
 }
 EXPORT_SYMBOL_GPL(get_state_synchronize_rcu_full);

Paul E. McKenney March 2, 2025, 8:36 p.m. UTC | #20

On Sun, Mar 02, 2025 at 10:46:29AM -0800, Boqun Feng wrote:
> On Sun, Mar 02, 2025 at 09:39:44AM -0800, Paul E. McKenney wrote:
> > On Sun, Mar 02, 2025 at 11:19:44AM +0100, Uladzislau Rezki wrote:
> > > On Fri, Feb 28, 2025 at 05:08:49PM -0800, Paul E. McKenney wrote:
> > > > On Fri, Feb 28, 2025 at 11:59:55AM -0800, Paul E. McKenney wrote:
> > > > > On Fri, Feb 28, 2025 at 08:12:51PM +0100, Uladzislau Rezki wrote:
> > > > > > Hello, Paul!
> > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > Except that I got this from overnight testing of rcu/dev on the shared
> > > > > > > > > > > RCU tree:
> > > > > > > > > > > 
> > > > > > > > > > > WARNING: CPU: 5 PID: 14 at kernel/rcu/tree.c:1636 rcu_sr_normal_complete+0x5c/0x80
> > > > > > > > > > > 
> > > > > > > > > > > I see this only on TREE05.  Which should not be too surprising, given
> > > > > > > > > > > that this is the scenario that tests it.  It happened within five minutes
> > > > > > > > > > > on all 14 of the TREE05 runs.
> > > > > > > > > > > 
> > > > > > > > > > Hm.. This is not fun. I tested this on my system and i did not manage to
> > > > > > > > > > trigger this whereas you do. Something is wrong.
> > > > > > > > > 
> > > > > > > > > If you have a debug patch, I would be happy to give it a go.
> > > > > > > > > 
> > > > > > > > I can trigger it. But.
> > > > > > > > 
> > > > > > > > Some background. I tested those patches during many hours on the stable
> > > > > > > > kernel which is 6.13. On that kernel i was not able to trigger it. Running
> > > > > > > > the rcutorture on the our shared "dev" tree, which i did now, triggers this
> > > > > > > > right away.
> > > > > > > 
> > > > > > > Bisection?  (Hey, you knew that was coming!)
> > > > > > > 
> > > > > > Looks like this: rcu: Fix get_state_synchronize_rcu_full() GP-start detection
> > > > > > 
> > > > > > After revert in the dev, rcutorture passes TREE05, 16 instances.
> > > > > 
> > > > > Huh.  We sure don't get to revert that one...
> > > > > 
> > > > > Do we have a problem with the ordering in rcu_gp_init() between the calls
> > > > > to rcu_seq_start() and portions of rcu_sr_normal_gp_init()?  For example,
> > > > > do we need to capture the relevant portion of the list before the call
> > > > > to rcu_seq_start(), and do the grace-period-start work afterwards?
> > > > 
> > > > I tried moving the call to rcu_sr_normal_gp_init() before the call to
> > > > rcu_seq_start() and got no failures in a one-hour run of 200*TREE05.
> > > > Which does not necessarily mean that this is the correct fix, but I
> > > > figured that it might at least provide food for thought.
> > > > 
> > > > 							Thanx, Paul
> > > > 
> > > > ------------------------------------------------------------------------
> > > > 
> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > index 48384fa2eaeb8..d3efeff7740e7 100644
> > > > --- a/kernel/rcu/tree.c
> > > > +++ b/kernel/rcu/tree.c
> > > > @@ -1819,10 +1819,10 @@ static noinline_for_stack bool rcu_gp_init(void)
> > > >  
> > > >  	/* Advance to a new grace period and initialize state. */
> > > >  	record_gp_stall_check_time();
> > > > +	start_new_poll = rcu_sr_normal_gp_init();
> > > >  	/* Record GP times before starting GP, hence rcu_seq_start(). */
> > > >  	rcu_seq_start(&rcu_state.gp_seq);
> > > >  	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> > > > -	start_new_poll = rcu_sr_normal_gp_init();
> > > >  	trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
> > > >  	rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
> > > >  	raw_spin_unlock_irq_rcu_node(rnp);
> > > >
> > > Running this 24 hours already. TREE05 * 16 scenario. I do not see any
> > > warnings yet. There is a race, indeed. The gp_seq is moved forward,
> > > wheres clients can still come until rcu_sr_normal_gp_init() places a
> > > dummy-wait-head for this GP.
> > > 
> > > Thank you for testing Paul and looking to this :)
> > 
> > Very good!  This is a bug in this commit of mine:
> > 
> > 012f47f0f806 ("rcu: Fix get_state_synchronize_rcu_full() GP-start detection")
> > 
> > Boqun, could you please fold this into that commit with something like
> > this added to the commit log just before the paragraph starting with
> > "Although this fixes 91a967fd6934"?
> > 
> > 	However, simply changing get_state_synchronize_rcu_full() function
> > 	to use rcu_state.gp_seq instead of the root rcu_node structure's
> > 	->gp_seq field results in a theoretical bug in kernels booted
> > 	with rcutree.rcu_normal_wake_from_gp=1 due to the following
> > 	sequence of events:
> > 
> > 	o	The rcu_gp_init() function invokes rcu_seq_start()
> > 		to officially start a new grace period.
> > 
> > 	o	A new RCU reader begins, referencing X from some
> > 		RCU-protected list.  The new grace period is not
> > 		obligated to wait for this reader.
> > 
> > 	o	An updater removes X, then calls synchronize_rcu(),
> > 		which queues a wait element.
> > 
> > 	o	The grace period ends, awakening the updater, which
> > 		frees X while the reader is still referencing it.
> > 
> > 	The reason that this is theoretical is that although the
> > 	grace period has officially started, none of the CPUs are
> > 	officially aware of this, and thus will have to assume that
> > 	the RCU reader pre-dated the start of the grace period.
> > 
> > 	Except for kernels built with CONFIG_PROVE_RCU=y, which use the
> > 	polled grace-period APIs, which can and do complain bitterly when
> > 	this sequence of events occurs.  Not only that, there might be
> > 	some future RCU grace-period mechanism that pulls this sequence
> > 	of events from theory into practice.  This commit therefore
> > 	also pulls the call to rcu_sr_normal_gp_init() to precede that
> > 	to rcu_seq_start().
> > 
> > I will let you guys decide whether the call to rcu_sr_normal_gp_init()
> > needs a comment, and, if so, what that comment should say.  ;-)
> > 
> 
> Please see the updated patch below (next and rcu/dev branches are
> updated too).

Works for me!

>               For the comment, I think we can add something like
> 
> 	/* 
> 	 * A new wait segment must be started before gp_seq advanced, so
> 	 * that previous gp waiters won't observe the new gp_seq.
> 	 */
> 
> but I will let Ulad to decide ;-)

Over to you, Uladzislau!  ;-)

							Thanx, Paul

> Regards,
> Boqun
> 
> ------------------------>8
> Subject: [PATCH] rcu: Fix get_state_synchronize_rcu_full() GP-start detection
> 
> The get_state_synchronize_rcu_full() and poll_state_synchronize_rcu_full()
> functions use the root rcu_node structure's ->gp_seq field to detect
> the beginnings and ends of grace periods, respectively.  This choice is
> necessary for the poll_state_synchronize_rcu_full() function because
> (give or take counter wrap), the following sequence is guaranteed not
> to trigger:
> 
> 	get_state_synchronize_rcu_full(&rgos);
> 	synchronize_rcu();
> 	WARN_ON_ONCE(!poll_state_synchronize_rcu_full(&rgos));
> 
> The RCU callbacks that awaken synchronize_rcu() instances are
> guaranteed not to be invoked before the root rcu_node structure's
> ->gp_seq field is updated to indicate the end of the grace period.
> However, these callbacks might start being invoked immediately
> thereafter, in particular, before rcu_state.gp_seq has been updated.
> Therefore, poll_state_synchronize_rcu_full() must refer to the
> root rcu_node structure's ->gp_seq field.  Because this field is
> updated under this structure's ->lock, any code following a call to
> poll_state_synchronize_rcu_full() will be fully ordered after the
> full grace-period computation, as is required by RCU's memory-ordering
> semantics.
> 
> By symmetry, the get_state_synchronize_rcu_full() function should also
> use this same root rcu_node structure's ->gp_seq field.  But it turns out
> that symmetry is profoundly (though extremely infrequently) destructive
> in this case.  To see this, consider the following sequence of events:
> 
> 1.	CPU 0 starts a new grace period, and updates rcu_state.gp_seq
> 	accordingly.
> 
> 2.	As its first step of grace-period initialization, CPU 0 examines
> 	the current CPU hotplug state and decides that it need not wait
> 	for CPU 1, which is currently offline.
> 
> 3.	CPU 1 comes online, and updates its state.  But this does not
> 	affect the current grace period, but rather the one after that.
> 	After all, CPU 1 was offline when the current grace period
> 	started, so all pre-existing RCU readers on CPU 1 must have
> 	completed or been preempted before it last went offline.
> 	The current grace period therefore has nothing it needs to wait
> 	for on CPU 1.
> 
> 4.	CPU 1 switches to an rcutorture kthread which is running
> 	rcutorture's rcu_torture_reader() function, which starts a new
> 	RCU reader.
> 
> 5.	CPU 2 is running rcutorture's rcu_torture_writer() function
> 	and collects a new polled grace-period "cookie" using
> 	get_state_synchronize_rcu_full().  Because the newly started
> 	grace period has not completed initialization, the root rcu_node
> 	structure's ->gp_seq field has not yet been updated to indicate
> 	that this new grace period has already started.
> 
> 	This cookie is therefore set up for the end of the current grace
> 	period (rather than the end of the following grace period).
> 
> 6.	CPU 0 finishes grace-period initialization.
> 
> 7.	If CPU 1’s rcutorture reader is preempted, it will be added to
> 	the ->blkd_tasks list, but because CPU 1’s ->qsmask bit is not
> 	set in CPU 1's leaf rcu_node structure, the ->gp_tasks pointer
> 	will not be updated.  Thus, this grace period will not wait on
> 	it.  Which is only fair, given that the CPU did not come online
> 	until after the grace period officially started.
> 
> 8.	CPUs 0 and 2 then detect the new grace period and then report
> 	a quiescent state to the RCU core.
> 
> 9.	Because CPU 1 was offline at the start of the current grace
> 	period, CPUs 0 and 2 are the only CPUs that this grace period
> 	needs to wait on.  So the grace period ends and post-grace-period
> 	cleanup starts.  In particular, the root rcu_node structure's
> 	->gp_seq field is updated to indicate that this grace period
> 	has now ended.
> 
> 10.	CPU 2 continues running rcu_torture_writer() and sees that,
> 	from the viewpoint of the root rcu_node structure consulted by
> 	the poll_state_synchronize_rcu_full() function, the grace period
> 	has ended.  It therefore updates state accordingly.
> 
> 11.	CPU 1 is still running the same RCU reader, which notices this
> 	update and thus complains about the too-short grace period.
> 
> The fix is for the get_state_synchronize_rcu_full() function to use
> rcu_state.gp_seq instead of the root rcu_node structure's ->gp_seq field.
> With this change in place, if step 5's cookie indicates that the grace
> period has not yet started, then any prior code executed by CPU 2 must
> have happened before CPU 1 came online.  This will in turn prevent CPU
> 1's code in steps 3 and 11 from spanning CPU 2's grace-period wait,
> thus preventing CPU 1 from being subjected to a too-short grace period.
> 
> This commit therefore makes this change.  Note that there is no change to
> the poll_state_synchronize_rcu_full() function, which as noted above,
> must continue to use the root rcu_node structure's ->gp_seq field.
> This is of course an asymmetry between these two functions, but is an
> asymmetry that is absolutely required for correct operation.  It is a
> common human tendency to greatly value symmetry, and sometimes symmetry
> is a wonderful thing.  Other times, symmetry results in poor performance.
> But in this case, symmetry is just plain wrong.
> 
> Nevertheless, the asymmetry does require an additional adjustment.
> It is possible for get_state_synchronize_rcu_full() to see a given
> grace period as having started, but for an immediately following
> poll_state_synchronize_rcu_full() to see it as having not yet started.
> Given the current rcu_seq_done_exact() implementation, this will
> result in a false-positive indication that the grace period is done
> from poll_state_synchronize_rcu_full().  This is dealt with by making
> rcu_seq_done_exact() reach back three grace periods rather than just
> two of them.
> 
> However, simply changing get_state_synchronize_rcu_full() function to
> use rcu_state.gp_seq instead of the root rcu_node structure's ->gp_seq
> field results in a theoretical bug in kernels booted with
> rcutree.rcu_normal_wake_from_gp=1 due to the following sequence of
> events:
> 
> o	The rcu_gp_init() function invokes rcu_seq_start() to officially
> 	start a new grace period.
> 
> o	A new RCU reader begins, referencing X from some RCU-protected
> 	list.  The new grace period is not obligated to wait for this
> 	reader.
> 
> o	An updater removes X, then calls synchronize_rcu(), which queues
> 	a wait element.
> 
> o	The grace period ends, awakening the updater, which frees X
> 	while the reader is still referencing it.
> 
> The reason that this is theoretical is that although the grace period
> has officially started, none of the CPUs are officially aware of this,
> and thus will have to assume that the RCU reader pre-dated the start of
> the grace period.
> 
> Except for kernels built with CONFIG_PROVE_RCU=y, which use the polled
> grace-period APIs, which can and do complain bitterly when this sequence
> of events occurs.  Not only that, there might be some future RCU
> grace-period mechanism that pulls this sequence of events from theory
> into practice.  This commit therefore also pulls the call to
> rcu_sr_normal_gp_init() to precede that to rcu_seq_start().
> 
> Although this fixes 91a967fd6934 ("rcu: Add full-sized polling for
> get_completed*() and poll_state*()"), it is not clear that it is
> worth backporting this commit.  First, it took me many weeks to
> convince rcutorture to reproduce this more frequently than once
> per year.  Second, this cannot be reproduced at all without frequent
> CPU-hotplug operations, as in waiting all of 50 milliseconds from the
> end of the previous operation until starting the next one.  Third,
> the TREE03.boot settings cause multi-millisecond delays during RCU
> grace-period initialization, which greatly increase the probability of
> the above sequence of events.  (Don't do this in production workloads!)
> Fourth, the TREE03 rcutorture scenario was modified to use four-CPU
> guest OSes, to have a single-rcu_node combining tree, no testing of RCU
> priority boosting, and no random preemption, and these modifications
> were necessary to reproduce this issue in a reasonable timeframe.
> Fifth, extremely heavy use of get_state_synchronize_rcu_full() and/or
> poll_state_synchronize_rcu_full() is required to reproduce this, and as of
> v6.12, only kfree_rcu() uses it, and even then not particularly heavily.
> 
> [boqun: Apply the fix [1]]
> 
> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> Tested-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> Link: https://lore.kernel.org/rcu/d90bd6d9-d15c-4b9b-8a69-95336e74e8f4@paulmck-laptop/ [1]
> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> ---
>  kernel/rcu/rcu.h  |  2 +-
>  kernel/rcu/tree.c | 11 +++++++----
>  2 files changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> index feb3ac1dc5d5..f87c9d6d36fc 100644
> --- a/kernel/rcu/rcu.h
> +++ b/kernel/rcu/rcu.h
> @@ -162,7 +162,7 @@ static inline bool rcu_seq_done_exact(unsigned long *sp, unsigned long s)
>  {
>  	unsigned long cur_s = READ_ONCE(*sp);
>  
> -	return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (2 * RCU_SEQ_STATE_MASK + 1));
> +	return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (3 * RCU_SEQ_STATE_MASK + 1));
>  }
>  
>  /*
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index e4c0ce600b2b..f80cfdc3ee3e 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1801,10 +1801,10 @@ static noinline_for_stack bool rcu_gp_init(void)
>  
>  	/* Advance to a new grace period and initialize state. */
>  	record_gp_stall_check_time();
> +	start_new_poll = rcu_sr_normal_gp_init();
>  	/* Record GP times before starting GP, hence rcu_seq_start(). */
>  	rcu_seq_start(&rcu_state.gp_seq);
>  	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> -	start_new_poll = rcu_sr_normal_gp_init();
>  	trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
>  	rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
>  	raw_spin_unlock_irq_rcu_node(rnp);
> @@ -3357,14 +3357,17 @@ EXPORT_SYMBOL_GPL(get_state_synchronize_rcu);
>   */
>  void get_state_synchronize_rcu_full(struct rcu_gp_oldstate *rgosp)
>  {
> -	struct rcu_node *rnp = rcu_get_root();
> -
>  	/*
>  	 * Any prior manipulation of RCU-protected data must happen
>  	 * before the loads from ->gp_seq and ->expedited_sequence.
>  	 */
>  	smp_mb();  /* ^^^ */
> -	rgosp->rgos_norm = rcu_seq_snap(&rnp->gp_seq);
> +
> +	// Yes, rcu_state.gp_seq, not rnp_root->gp_seq, the latter's use
> +	// in poll_state_synchronize_rcu_full() notwithstanding.  Use of
> +	// the latter here would result in too-short grace periods due to
> +	// interactions with newly onlined CPUs.
> +	rgosp->rgos_norm = rcu_seq_snap(&rcu_state.gp_seq);
>  	rgosp->rgos_exp = rcu_seq_snap(&rcu_state.expedited_sequence);
>  }
>  EXPORT_SYMBOL_GPL(get_state_synchronize_rcu_full);
> -- 
> 2.45.1
>

Joel Fernandes March 3, 2025, 12:15 a.m. UTC | #21

On Fri, Feb 28, 2025 at 05:08:49PM -0800, Paul E. McKenney wrote:
> On Fri, Feb 28, 2025 at 11:59:55AM -0800, Paul E. McKenney wrote:
> > On Fri, Feb 28, 2025 at 08:12:51PM +0100, Uladzislau Rezki wrote:
> > > Hello, Paul!
> > > 
> > > > > > > > 
> > > > > > > > Except that I got this from overnight testing of rcu/dev on the shared
> > > > > > > > RCU tree:
> > > > > > > > 
> > > > > > > > WARNING: CPU: 5 PID: 14 at kernel/rcu/tree.c:1636 rcu_sr_normal_complete+0x5c/0x80
> > > > > > > > 
> > > > > > > > I see this only on TREE05.  Which should not be too surprising, given
> > > > > > > > that this is the scenario that tests it.  It happened within five minutes
> > > > > > > > on all 14 of the TREE05 runs.
> > > > > > > > 
> > > > > > > Hm.. This is not fun. I tested this on my system and i did not manage to
> > > > > > > trigger this whereas you do. Something is wrong.
> > > > > > 
> > > > > > If you have a debug patch, I would be happy to give it a go.
> > > > > > 
> > > > > I can trigger it. But.
> > > > > 
> > > > > Some background. I tested those patches during many hours on the stable
> > > > > kernel which is 6.13. On that kernel i was not able to trigger it. Running
> > > > > the rcutorture on the our shared "dev" tree, which i did now, triggers this
> > > > > right away.
> > > > 
> > > > Bisection?  (Hey, you knew that was coming!)
> > > > 
> > > Looks like this: rcu: Fix get_state_synchronize_rcu_full() GP-start detection
> > > 
> > > After revert in the dev, rcutorture passes TREE05, 16 instances.
> > 
> > Huh.  We sure don't get to revert that one...
> > 
> > Do we have a problem with the ordering in rcu_gp_init() between the calls
> > to rcu_seq_start() and portions of rcu_sr_normal_gp_init()?  For example,
> > do we need to capture the relevant portion of the list before the call
> > to rcu_seq_start(), and do the grace-period-start work afterwards?
> 
> I tried moving the call to rcu_sr_normal_gp_init() before the call to
> rcu_seq_start() and got no failures in a one-hour run of 200*TREE05.
> Which does not necessarily mean that this is the correct fix, but I
> figured that it might at least provide food for thought.
> 
>                                           Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 48384fa2eaeb8..d3efeff7740e7 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1819,10 +1819,10 @@ static noinline_for_stack bool rcu_gp_init(void)
>  
>        /* Advance to a new grace period and initialize state. */
>        record_gp_stall_check_time();
> +      start_new_poll = rcu_sr_normal_gp_init();
>        /* Record GP times before starting GP, hence rcu_seq_start(). */
>        rcu_seq_start(&rcu_state.gp_seq);
>        ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> -      start_new_poll = rcu_sr_normal_gp_init();
>        trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));

Oh... so the bug is this? Good catch...


CPU 0                                           CPU 1

                                                rcu_gp_init()
                                                      rcu_seq_start(rcu_state.gp_seq)
sychronize_rcu_normal()
      rs.head.func
      	= (void *) get_state_synchronize_rcu();
                   // save rcu_state.gp_seq
      rcu_sr_normal_add_req() ->
            llist_add(rcu_state.srs_next)
      (void) start_poll_synchronize_rcu();


                                                      sr_normal_gp_init()
                                                            llist_add(wait_head, &rcu_state.srs_next);
							    // pick up the
							    // injected WH
                                                            rcu_state.srs_wait_tail = wait_head;

                                                rcu_gp_cleanup()
                                                      rcu_seq_end(&rcu_state.gp_seq);
                                                      sr_normal_complete()
                                                            WARN_ONCE(IS_ENABLED(CONFIG_PROVE_RCU) &&
                                                            !poll_state_synchronize_rcu(oldstate),

Where as reordering sr_normal_gp_init() prevents this:

                                                rcu_gp_init()

                                                      sr_normal_gp_init()
							    // WH has not
							    // been injected
							    // so nothing to
							    // wait on

                                                      rcu_seq_start(rcu_state.gp_seq)
sychronize_rcu_normal()
      rs.head.func
      	= (void *) get_state_synchronize_rcu();
                   // save rcu_state.gp_seq
      rcu_sr_normal_add_req() ->
            llist_add(rcu_state.srs_next)
      (void) start_poll_synchronize_rcu();

                                                rcu_gp_cleanup()
                                                      rcu_seq_end(&rcu_state.gp_seq);
                                                      // sr_normal_complete()
						      // wont do anything so
						      // no warning

Did I get that right?

I think this is a real bug AFAICS, hoping all the memory barriers are in
place to make sure the code reordering also correctly orders the accesses.
I'll double check that.

I also feel its 'theoretical', because as long as rcu_gp_init() and
rcu_gp_cleanup() are properly ordered WRT pre-existing readers, then
synchronize_rcu_normal() still waits for pre-existing readers even though its
a bit confused about the value of the cookies.

For the fix,
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>

(If possible, include a Link: to my (this) post so that the sequence of
events is further clarified.)

thanks,

 - Joel

Joel Fernandes March 3, 2025, 12:17 a.m. UTC | #22

On Sun, Mar 02, 2025 at 07:15:07PM -0500, Joel Fernandes wrote:
> On Fri, Feb 28, 2025 at 05:08:49PM -0800, Paul E. McKenney wrote:
> > On Fri, Feb 28, 2025 at 11:59:55AM -0800, Paul E. McKenney wrote:
> > > On Fri, Feb 28, 2025 at 08:12:51PM +0100, Uladzislau Rezki wrote:
> > > > Hello, Paul!
> > > > 
> > > > > > > > > 
> > > > > > > > > Except that I got this from overnight testing of rcu/dev on the shared
> > > > > > > > > RCU tree:
> > > > > > > > > 
> > > > > > > > > WARNING: CPU: 5 PID: 14 at kernel/rcu/tree.c:1636 rcu_sr_normal_complete+0x5c/0x80
> > > > > > > > > 
> > > > > > > > > I see this only on TREE05.  Which should not be too surprising, given
> > > > > > > > > that this is the scenario that tests it.  It happened within five minutes
> > > > > > > > > on all 14 of the TREE05 runs.
> > > > > > > > > 
> > > > > > > > Hm.. This is not fun. I tested this on my system and i did not manage to
> > > > > > > > trigger this whereas you do. Something is wrong.
> > > > > > > 
> > > > > > > If you have a debug patch, I would be happy to give it a go.
> > > > > > > 
> > > > > > I can trigger it. But.
> > > > > > 
> > > > > > Some background. I tested those patches during many hours on the stable
> > > > > > kernel which is 6.13. On that kernel i was not able to trigger it. Running
> > > > > > the rcutorture on the our shared "dev" tree, which i did now, triggers this
> > > > > > right away.
> > > > > 
> > > > > Bisection?  (Hey, you knew that was coming!)
> > > > > 
> > > > Looks like this: rcu: Fix get_state_synchronize_rcu_full() GP-start detection
> > > > 
> > > > After revert in the dev, rcutorture passes TREE05, 16 instances.
> > > 
> > > Huh.  We sure don't get to revert that one...
> > > 
> > > Do we have a problem with the ordering in rcu_gp_init() between the calls
> > > to rcu_seq_start() and portions of rcu_sr_normal_gp_init()?  For example,
> > > do we need to capture the relevant portion of the list before the call
> > > to rcu_seq_start(), and do the grace-period-start work afterwards?
> > 
> > I tried moving the call to rcu_sr_normal_gp_init() before the call to
> > rcu_seq_start() and got no failures in a one-hour run of 200*TREE05.
> > Which does not necessarily mean that this is the correct fix, but I
> > figured that it might at least provide food for thought.
> > 
> >                                           Thanx, Paul
> > 
> > ------------------------------------------------------------------------
> > 
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 48384fa2eaeb8..d3efeff7740e7 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -1819,10 +1819,10 @@ static noinline_for_stack bool rcu_gp_init(void)
> >  
> >        /* Advance to a new grace period and initialize state. */
> >        record_gp_stall_check_time();
> > +      start_new_poll = rcu_sr_normal_gp_init();
> >        /* Record GP times before starting GP, hence rcu_seq_start(). */
> >        rcu_seq_start(&rcu_state.gp_seq);
> >        ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> > -      start_new_poll = rcu_sr_normal_gp_init();
> >        trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
> 
> Oh... so the bug is this? Good catch...
> 
> 
> CPU 0                                           CPU 1
> 
>                                                 rcu_gp_init()
>                                                       rcu_seq_start(rcu_state.gp_seq)
> sychronize_rcu_normal()
>       rs.head.func
>       	= (void *) get_state_synchronize_rcu();
>                    // save rcu_state.gp_seq
>       rcu_sr_normal_add_req() ->
>             llist_add(rcu_state.srs_next)
>       (void) start_poll_synchronize_rcu();
> 
> 
>                                                       sr_normal_gp_init()
>                                                             llist_add(wait_head, &rcu_state.srs_next);
> 							    // pick up the
> 							    // injected WH
>                                                             rcu_state.srs_wait_tail = wait_head;
> 
>                                                 rcu_gp_cleanup()
>                                                       rcu_seq_end(&rcu_state.gp_seq);
>                                                       sr_normal_complete()
>                                                             WARN_ONCE(IS_ENABLED(CONFIG_PROVE_RCU) &&
>                                                             !poll_state_synchronize_rcu(oldstate),
> 
> Where as reordering sr_normal_gp_init() prevents this:
> 
>                                                 rcu_gp_init()
> 
>                                                       sr_normal_gp_init()
> 							    // WH has not
> 							    // been injected
> 							    // so nothing to
> 							    // wait on
> 
>                                                       rcu_seq_start(rcu_state.gp_seq)
> sychronize_rcu_normal()
>       rs.head.func
>       	= (void *) get_state_synchronize_rcu();
>                    // save rcu_state.gp_seq
>       rcu_sr_normal_add_req() ->
>             llist_add(rcu_state.srs_next)
>       (void) start_poll_synchronize_rcu();
> 
>                                                 rcu_gp_cleanup()
>                                                       rcu_seq_end(&rcu_state.gp_seq);
>                                                       // sr_normal_complete()
> 						      // wont do anything so
> 						      // no warning
> 
> Did I get that right?
> 
> I think this is a real bug AFAICS, hoping all the memory barriers are in
> place to make sure the code reordering also correctly orders the accesses.
> I'll double check that.
> 
> I also feel its 'theoretical', because as long as rcu_gp_init() and
> rcu_gp_cleanup() are properly ordered WRT pre-existing readers, then
> synchronize_rcu_normal() still waits for pre-existing readers even though its
> a bit confused about the value of the cookies.
> 
> For the fix,
> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>

Oops, this should be:

Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>

;-)

thanks,

 - Joel

> 
> (If possible, include a Link: to my (this) post so that the sequence of
> events is further clarified.)
> 
> thanks,
> 
>  - Joel
>

Uladzislau Rezki March 3, 2025, 4:03 p.m. UTC | #23

On Sun, Mar 02, 2025 at 12:36:51PM -0800, Paul E. McKenney wrote:
> On Sun, Mar 02, 2025 at 10:46:29AM -0800, Boqun Feng wrote:
> > On Sun, Mar 02, 2025 at 09:39:44AM -0800, Paul E. McKenney wrote:
> > > On Sun, Mar 02, 2025 at 11:19:44AM +0100, Uladzislau Rezki wrote:
> > > > On Fri, Feb 28, 2025 at 05:08:49PM -0800, Paul E. McKenney wrote:
> > > > > On Fri, Feb 28, 2025 at 11:59:55AM -0800, Paul E. McKenney wrote:
> > > > > > On Fri, Feb 28, 2025 at 08:12:51PM +0100, Uladzislau Rezki wrote:
> > > > > > > Hello, Paul!
> > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > Except that I got this from overnight testing of rcu/dev on the shared
> > > > > > > > > > > > RCU tree:
> > > > > > > > > > > > 
> > > > > > > > > > > > WARNING: CPU: 5 PID: 14 at kernel/rcu/tree.c:1636 rcu_sr_normal_complete+0x5c/0x80
> > > > > > > > > > > > 
> > > > > > > > > > > > I see this only on TREE05.  Which should not be too surprising, given
> > > > > > > > > > > > that this is the scenario that tests it.  It happened within five minutes
> > > > > > > > > > > > on all 14 of the TREE05 runs.
> > > > > > > > > > > > 
> > > > > > > > > > > Hm.. This is not fun. I tested this on my system and i did not manage to
> > > > > > > > > > > trigger this whereas you do. Something is wrong.
> > > > > > > > > > 
> > > > > > > > > > If you have a debug patch, I would be happy to give it a go.
> > > > > > > > > > 
> > > > > > > > > I can trigger it. But.
> > > > > > > > > 
> > > > > > > > > Some background. I tested those patches during many hours on the stable
> > > > > > > > > kernel which is 6.13. On that kernel i was not able to trigger it. Running
> > > > > > > > > the rcutorture on the our shared "dev" tree, which i did now, triggers this
> > > > > > > > > right away.
> > > > > > > > 
> > > > > > > > Bisection?  (Hey, you knew that was coming!)
> > > > > > > > 
> > > > > > > Looks like this: rcu: Fix get_state_synchronize_rcu_full() GP-start detection
> > > > > > > 
> > > > > > > After revert in the dev, rcutorture passes TREE05, 16 instances.
> > > > > > 
> > > > > > Huh.  We sure don't get to revert that one...
> > > > > > 
> > > > > > Do we have a problem with the ordering in rcu_gp_init() between the calls
> > > > > > to rcu_seq_start() and portions of rcu_sr_normal_gp_init()?  For example,
> > > > > > do we need to capture the relevant portion of the list before the call
> > > > > > to rcu_seq_start(), and do the grace-period-start work afterwards?
> > > > > 
> > > > > I tried moving the call to rcu_sr_normal_gp_init() before the call to
> > > > > rcu_seq_start() and got no failures in a one-hour run of 200*TREE05.
> > > > > Which does not necessarily mean that this is the correct fix, but I
> > > > > figured that it might at least provide food for thought.
> > > > > 
> > > > > 							Thanx, Paul
> > > > > 
> > > > > ------------------------------------------------------------------------
> > > > > 
> > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > index 48384fa2eaeb8..d3efeff7740e7 100644
> > > > > --- a/kernel/rcu/tree.c
> > > > > +++ b/kernel/rcu/tree.c
> > > > > @@ -1819,10 +1819,10 @@ static noinline_for_stack bool rcu_gp_init(void)
> > > > >  
> > > > >  	/* Advance to a new grace period and initialize state. */
> > > > >  	record_gp_stall_check_time();
> > > > > +	start_new_poll = rcu_sr_normal_gp_init();
> > > > >  	/* Record GP times before starting GP, hence rcu_seq_start(). */
> > > > >  	rcu_seq_start(&rcu_state.gp_seq);
> > > > >  	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> > > > > -	start_new_poll = rcu_sr_normal_gp_init();
> > > > >  	trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
> > > > >  	rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
> > > > >  	raw_spin_unlock_irq_rcu_node(rnp);
> > > > >
> > > > Running this 24 hours already. TREE05 * 16 scenario. I do not see any
> > > > warnings yet. There is a race, indeed. The gp_seq is moved forward,
> > > > wheres clients can still come until rcu_sr_normal_gp_init() places a
> > > > dummy-wait-head for this GP.
> > > > 
> > > > Thank you for testing Paul and looking to this :)
> > > 
> > > Very good!  This is a bug in this commit of mine:
> > > 
> > > 012f47f0f806 ("rcu: Fix get_state_synchronize_rcu_full() GP-start detection")
> > > 
> > > Boqun, could you please fold this into that commit with something like
> > > this added to the commit log just before the paragraph starting with
> > > "Although this fixes 91a967fd6934"?
> > > 
> > > 	However, simply changing get_state_synchronize_rcu_full() function
> > > 	to use rcu_state.gp_seq instead of the root rcu_node structure's
> > > 	->gp_seq field results in a theoretical bug in kernels booted
> > > 	with rcutree.rcu_normal_wake_from_gp=1 due to the following
> > > 	sequence of events:
> > > 
> > > 	o	The rcu_gp_init() function invokes rcu_seq_start()
> > > 		to officially start a new grace period.
> > > 
> > > 	o	A new RCU reader begins, referencing X from some
> > > 		RCU-protected list.  The new grace period is not
> > > 		obligated to wait for this reader.
> > > 
> > > 	o	An updater removes X, then calls synchronize_rcu(),
> > > 		which queues a wait element.
> > > 
> > > 	o	The grace period ends, awakening the updater, which
> > > 		frees X while the reader is still referencing it.
> > > 
> > > 	The reason that this is theoretical is that although the
> > > 	grace period has officially started, none of the CPUs are
> > > 	officially aware of this, and thus will have to assume that
> > > 	the RCU reader pre-dated the start of the grace period.
> > > 
> > > 	Except for kernels built with CONFIG_PROVE_RCU=y, which use the
> > > 	polled grace-period APIs, which can and do complain bitterly when
> > > 	this sequence of events occurs.  Not only that, there might be
> > > 	some future RCU grace-period mechanism that pulls this sequence
> > > 	of events from theory into practice.  This commit therefore
> > > 	also pulls the call to rcu_sr_normal_gp_init() to precede that
> > > 	to rcu_seq_start().
> > > 
> > > I will let you guys decide whether the call to rcu_sr_normal_gp_init()
> > > needs a comment, and, if so, what that comment should say.  ;-)
> > > 
> > 
> > Please see the updated patch below (next and rcu/dev branches are
> > updated too).
> 
> Works for me!
> 
> >               For the comment, I think we can add something like
> > 
> > 	/* 
> > 	 * A new wait segment must be started before gp_seq advanced, so
> > 	 * that previous gp waiters won't observe the new gp_seq.
> > 	 */
> > 
> > but I will let Ulad to decide ;-)
> 
> Over to you, Uladzislau!  ;-)
> 
Works for me! Sorry for late answer. I got a fever, therefore i reply not
in time.

--
Uladzislau Rezki

Joel Fernandes March 3, 2025, 5 p.m. UTC | #24

On Sun, Mar 02, 2025 at 07:17:10PM -0500, Joel Fernandes wrote:
> On Sun, Mar 02, 2025 at 07:15:07PM -0500, Joel Fernandes wrote:
> > On Fri, Feb 28, 2025 at 05:08:49PM -0800, Paul E. McKenney wrote:
> > > On Fri, Feb 28, 2025 at 11:59:55AM -0800, Paul E. McKenney wrote:
> > > > On Fri, Feb 28, 2025 at 08:12:51PM +0100, Uladzislau Rezki wrote:
> > > > > Hello, Paul!
> > > > > 
> > > > > > > > > > 
> > > > > > > > > > Except that I got this from overnight testing of rcu/dev on the shared
> > > > > > > > > > RCU tree:
> > > > > > > > > > 
> > > > > > > > > > WARNING: CPU: 5 PID: 14 at kernel/rcu/tree.c:1636 rcu_sr_normal_complete+0x5c/0x80
> > > > > > > > > > 
> > > > > > > > > > I see this only on TREE05.  Which should not be too surprising, given
> > > > > > > > > > that this is the scenario that tests it.  It happened within five minutes
> > > > > > > > > > on all 14 of the TREE05 runs.
> > > > > > > > > > 
> > > > > > > > > Hm.. This is not fun. I tested this on my system and i did not manage to
> > > > > > > > > trigger this whereas you do. Something is wrong.
> > > > > > > > 
> > > > > > > > If you have a debug patch, I would be happy to give it a go.
> > > > > > > > 
> > > > > > > I can trigger it. But.
> > > > > > > 
> > > > > > > Some background. I tested those patches during many hours on the stable
> > > > > > > kernel which is 6.13. On that kernel i was not able to trigger it. Running
> > > > > > > the rcutorture on the our shared "dev" tree, which i did now, triggers this
> > > > > > > right away.
> > > > > > 
> > > > > > Bisection?  (Hey, you knew that was coming!)
> > > > > > 
> > > > > Looks like this: rcu: Fix get_state_synchronize_rcu_full() GP-start detection
> > > > > 
> > > > > After revert in the dev, rcutorture passes TREE05, 16 instances.
> > > > 
> > > > Huh.  We sure don't get to revert that one...
> > > > 
> > > > Do we have a problem with the ordering in rcu_gp_init() between the calls
> > > > to rcu_seq_start() and portions of rcu_sr_normal_gp_init()?  For example,
> > > > do we need to capture the relevant portion of the list before the call
> > > > to rcu_seq_start(), and do the grace-period-start work afterwards?
> > > 
> > > I tried moving the call to rcu_sr_normal_gp_init() before the call to
> > > rcu_seq_start() and got no failures in a one-hour run of 200*TREE05.
> > > Which does not necessarily mean that this is the correct fix, but I
> > > figured that it might at least provide food for thought.
> > > 
> > >                                           Thanx, Paul
> > > 
> > > ------------------------------------------------------------------------
> > > 
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index 48384fa2eaeb8..d3efeff7740e7 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -1819,10 +1819,10 @@ static noinline_for_stack bool rcu_gp_init(void)
> > >  
> > >        /* Advance to a new grace period and initialize state. */
> > >        record_gp_stall_check_time();
> > > +      start_new_poll = rcu_sr_normal_gp_init();
> > >        /* Record GP times before starting GP, hence rcu_seq_start(). */
> > >        rcu_seq_start(&rcu_state.gp_seq);
> > >        ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> > > -      start_new_poll = rcu_sr_normal_gp_init();
> > >        trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
> > 
> > Oh... so the bug is this? Good catch...
> > 
> > 
> > CPU 0                                           CPU 1
> > 
> >                                                 rcu_gp_init()
> >                                                       rcu_seq_start(rcu_state.gp_seq)
> > sychronize_rcu_normal()
> >       rs.head.func
> >       	= (void *) get_state_synchronize_rcu();
> >                    // save rcu_state.gp_seq
> >       rcu_sr_normal_add_req() ->
> >             llist_add(rcu_state.srs_next)
> >       (void) start_poll_synchronize_rcu();
> > 
> > 
> >                                                       sr_normal_gp_init()
> >                                                             llist_add(wait_head, &rcu_state.srs_next);
> > 							    // pick up the
> > 							    // injected WH
> >                                                             rcu_state.srs_wait_tail = wait_head;
> > 
> >                                                 rcu_gp_cleanup()
> >                                                       rcu_seq_end(&rcu_state.gp_seq);
> >                                                       sr_normal_complete()
> >                                                             WARN_ONCE(IS_ENABLED(CONFIG_PROVE_RCU) &&
> >                                                             !poll_state_synchronize_rcu(oldstate),
> > 
> > Where as reordering sr_normal_gp_init() prevents this:
> > 
> >                                                 rcu_gp_init()
> > 
> >                                                       sr_normal_gp_init()
> > 							    // WH has not
> > 							    // been injected
> > 							    // so nothing to
> > 							    // wait on
> > 
> >                                                       rcu_seq_start(rcu_state.gp_seq)
> > sychronize_rcu_normal()
> >       rs.head.func
> >       	= (void *) get_state_synchronize_rcu();
> >                    // save rcu_state.gp_seq
> >       rcu_sr_normal_add_req() ->
> >             llist_add(rcu_state.srs_next)
> >       (void) start_poll_synchronize_rcu();
> > 
> >                                                 rcu_gp_cleanup()
> >                                                       rcu_seq_end(&rcu_state.gp_seq);
> >                                                       // sr_normal_complete()
> > 						      // wont do anything so
> > 						      // no warning
> > 
> > Did I get that right?
> > 
> > I think this is a real bug AFAICS, hoping all the memory barriers are in
> > place to make sure the code reordering also correctly orders the accesses.
> > I'll double check that.
> > 
> > I also feel its 'theoretical', because as long as rcu_gp_init() and
> > rcu_gp_cleanup() are properly ordered WRT pre-existing readers, then
> > synchronize_rcu_normal() still waits for pre-existing readers even though its
> > a bit confused about the value of the cookies.
> > 
> > For the fix,
> > Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> 
> Oops, this should be:
> 
> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>

I see the original patch "rcu: Fix get_state_synchronize_rcu_full() GP-start
detection" is not yet on -next. Once we are convinced about the fix, do we
want to squash the fix into this patch and have Boqun take it?

Yet another idea is to apply it for 6.15-rc cycle if more time is needed.

Alternatively, we could squash it and I queue it for 6.16 instead of 6.15.

And I'm guessing Vlad's series is also for 6.16.

thanks,

 - Joel

Boqun Feng March 3, 2025, 5:07 p.m. UTC | #25

On Mon, Mar 03, 2025 at 12:00:40PM -0500, Joel Fernandes wrote:
[...]
> 
> I see the original patch "rcu: Fix get_state_synchronize_rcu_full() GP-start
> detection" is not yet on -next. Once we are convinced about the fix, do we
> want to squash the fix into this patch and have Boqun take it?
> 

Which "-next" are you talking about? The original patch and the fix is
already in next-20250303 of linux-next:

	https://web.git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/log/?h=next-20250303&qt=range&q=153fc45000e0058435ec0609258fb16e7ea257d2

Regards,
Boqun

> Yet another idea is to apply it for 6.15-rc cycle if more time is needed.
> 
> Alternatively, we could squash it and I queue it for 6.16 instead of 6.15.
> 
> And I'm guessing Vlad's series is also for 6.16.
> 
> thanks,
> 
>  - Joel
>

Joel Fernandes March 3, 2025, 5:30 p.m. UTC | #26

On 3/3/2025 12:07 PM, Boqun Feng wrote:
> On Mon, Mar 03, 2025 at 12:00:40PM -0500, Joel Fernandes wrote:
> [...]
>>
>> I see the original patch "rcu: Fix get_state_synchronize_rcu_full() GP-start
>> detection" is not yet on -next. Once we are convinced about the fix, do we
>> want to squash the fix into this patch and have Boqun take it?
>>
> 
> Which "-next" are you talking about? The original patch and the fix is
> already in next-20250303 of linux-next:
> 
> 	https://web.git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/log/?h=next-20250303&qt=range&q=153fc45000e0058435ec0609258fb16e7ea257d2
I see it now during manual inspection, but I'm confused why git cherry tells me
otherwise.

I tried the following command and it shows the patch in question in the first
line of output. Basically the question that the command asks is "What is in
Paul's dev branch that is not in RCU tree's -next branch".  This question is
asked for the obvious raisins.
So I am obviously missing something in the command. Thoughts?

(rcugit is the RCU tree, and paul/dev is Paul's dev branch)

git cherry --abbrev -v rcugit/next paul/dev | grep "^+" | cut -d' ' -f2,3-

012f47f0f806 rcu: Fix get_state_synchronize_rcu_full() GP-start detection
2ada0addbdb6 tools/memory-model: Add atomic_and()/or()/xor() and add_negative
e176ebffc3f4 tools/memory-model: Add atomic_andnot() with its variants
de6f99723392 tools/memory-model: Legitimize current use of tags in LKMM macros
723177d71224 tools/memory-model: Define applicable tags on operation in tools/...
29279349a566 tools/memory-model: Define effect of Mb tags on RMWs in tools/...
d80a8e016433 MAINTAINERS: Update Joel's email address
fafa18068359 tools/memory-model: Switch to softcoded herd7 tags
dcc5197839f2 tools/memory-model: Distinguish between syntactic and semantic tags
fa9e35a0772a tools/memory-model/README: Fix typo
a2bfbf847c96 tools/memory-model: glossary.txt: Fix indents
3839dbb05869 rcutorture: Make srcu_lockdep.sh check kernel Kconfig
b5aa1c489085 rcutorture: Make srcu_lockdep.sh check reader-conflict handling
9a5720bad9ed rcu: Remove swake_up_one_online() bandaid
04159042a62b Revert "rcu/nocb: Fix rcuog wake-up from offline softirq"
fdc37fed1c81 rcutorture: Split out beginning and end from rcu_torture_one_read()
3c6b1925361e rcutorture: Make torture.sh --do-rt use CONFIG_PREEMPT_RT
fadc715785cc rcutorture: Add tests for SRCU up/down reader primitives
90a8f490324c rcutorture: Pull rcu_torture_updown() loop body into new function
5fbaa5179f6a rcutorture: Comment invocations of tick_dep_set_task()
461810471faa rcutorture: Complain if an ->up_read() is delayed more than 10 seconds
35e1a180319d rcutorture: Check for ->up_read() without matching ->down_read()
0676ba797dfa EXP srcu: Enable Tiny SRCU On all CONFIG_SMP=n kernels
c8fff13fd2fd EXP rcutorture: Add SRCU-V scenario for preemptible Tiny SRCU
910a5f9ebf5f EXP rcutorture: Limit callback flooding for Tiny SRCU in
preemptible kernels
8979a891a365 EXP hrtimers: Force migrate away hrtimers queued after
CPUHP_AP_HRTIMERS_DYING

Joel Fernandes March 3, 2025, 5:59 p.m. UTC | #27

On 3/3/2025 12:30 PM, Joel Fernandes wrote:
> 
> 
> On 3/3/2025 12:07 PM, Boqun Feng wrote:
>> On Mon, Mar 03, 2025 at 12:00:40PM -0500, Joel Fernandes wrote:
>> [...]
>>>
>>> I see the original patch "rcu: Fix get_state_synchronize_rcu_full() GP-start
>>> detection" is not yet on -next. Once we are convinced about the fix, do we
>>> want to squash the fix into this patch and have Boqun take it?
>>>
>>
>> Which "-next" are you talking about? The original patch and the fix is
>> already in next-20250303 of linux-next:
>>
>> 	https://web.git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/log/?h=next-20250303&qt=range&q=153fc45000e0058435ec0609258fb16e7ea257d2
> I see it now during manual inspection, but I'm confused why git cherry tells me
> otherwise.
> 
> I tried the following command and it shows the patch in question in the first
> line of output. Basically the question that the command asks is "What is in
> Paul's dev branch that is not in RCU tree's -next branch".  This question is
> asked for the obvious raisins.
> So I am obviously missing something in the command. Thoughts?
> 
> (rcugit is the RCU tree, and paul/dev is Paul's dev branch)
> 
> git cherry --abbrev -v rcugit/next paul/dev | grep "^+" | cut -d' ' -f2,3-
> 
> 012f47f0f806 rcu: Fix get_state_synchronize_rcu_full() GP-start detection
Ah, its because Paul's dev branch has an older version of the patch I think
(without the fix being discussed here) :-D

So that explains it...

thanks,

 - Joel

Paul E. McKenney March 3, 2025, 6:55 p.m. UTC | #28

On Mon, Mar 03, 2025 at 12:30:51PM -0500, Joel Fernandes wrote:
> 
> 
> On 3/3/2025 12:07 PM, Boqun Feng wrote:
> > On Mon, Mar 03, 2025 at 12:00:40PM -0500, Joel Fernandes wrote:
> > [...]
> >>
> >> I see the original patch "rcu: Fix get_state_synchronize_rcu_full() GP-start
> >> detection" is not yet on -next. Once we are convinced about the fix, do we
> >> want to squash the fix into this patch and have Boqun take it?
> >>
> > 
> > Which "-next" are you talking about? The original patch and the fix is
> > already in next-20250303 of linux-next:
> > 
> > 	https://web.git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/log/?h=next-20250303&qt=range&q=153fc45000e0058435ec0609258fb16e7ea257d2
> I see it now during manual inspection, but I'm confused why git cherry tells me
> otherwise.

The gitk command for the win!  ;-)

Or, in non-GUI environments, tig.

> I tried the following command and it shows the patch in question in the first
> line of output. Basically the question that the command asks is "What is in
> Paul's dev branch that is not in RCU tree's -next branch".  This question is
> asked for the obvious raisins.
> So I am obviously missing something in the command. Thoughts?
> 
> (rcugit is the RCU tree, and paul/dev is Paul's dev branch)
> 
> git cherry --abbrev -v rcugit/next paul/dev | grep "^+" | cut -d' ' -f2,3-

I must defer to others on this one.  I must confess that I have not yet
found a good use case for "git cherry".

							Thanx, Paul

> 012f47f0f806 rcu: Fix get_state_synchronize_rcu_full() GP-start detection
> 2ada0addbdb6 tools/memory-model: Add atomic_and()/or()/xor() and add_negative
> e176ebffc3f4 tools/memory-model: Add atomic_andnot() with its variants
> de6f99723392 tools/memory-model: Legitimize current use of tags in LKMM macros
> 723177d71224 tools/memory-model: Define applicable tags on operation in tools/...
> 29279349a566 tools/memory-model: Define effect of Mb tags on RMWs in tools/...
> d80a8e016433 MAINTAINERS: Update Joel's email address
> fafa18068359 tools/memory-model: Switch to softcoded herd7 tags
> dcc5197839f2 tools/memory-model: Distinguish between syntactic and semantic tags
> fa9e35a0772a tools/memory-model/README: Fix typo
> a2bfbf847c96 tools/memory-model: glossary.txt: Fix indents
> 3839dbb05869 rcutorture: Make srcu_lockdep.sh check kernel Kconfig
> b5aa1c489085 rcutorture: Make srcu_lockdep.sh check reader-conflict handling
> 9a5720bad9ed rcu: Remove swake_up_one_online() bandaid
> 04159042a62b Revert "rcu/nocb: Fix rcuog wake-up from offline softirq"
> fdc37fed1c81 rcutorture: Split out beginning and end from rcu_torture_one_read()
> 3c6b1925361e rcutorture: Make torture.sh --do-rt use CONFIG_PREEMPT_RT
> fadc715785cc rcutorture: Add tests for SRCU up/down reader primitives
> 90a8f490324c rcutorture: Pull rcu_torture_updown() loop body into new function
> 5fbaa5179f6a rcutorture: Comment invocations of tick_dep_set_task()
> 461810471faa rcutorture: Complain if an ->up_read() is delayed more than 10 seconds
> 35e1a180319d rcutorture: Check for ->up_read() without matching ->down_read()
> 0676ba797dfa EXP srcu: Enable Tiny SRCU On all CONFIG_SMP=n kernels
> c8fff13fd2fd EXP rcutorture: Add SRCU-V scenario for preemptible Tiny SRCU
> 910a5f9ebf5f EXP rcutorture: Limit callback flooding for Tiny SRCU in
> preemptible kernels
> 8979a891a365 EXP hrtimers: Force migrate away hrtimers queued after
> CPUHP_AP_HRTIMERS_DYING
> 
>

Joel Fernandes March 3, 2025, 8:02 p.m. UTC | #29

On 3/3/2025 1:55 PM, Paul E. McKenney wrote:
>> I tried the following command and it shows the patch in question in the first
>> line of output. Basically the question that the command asks is "What is in
>> Paul's dev branch that is not in RCU tree's -next branch".  This question is
>> asked for the obvious raisins.
>> So I am obviously missing something in the command. Thoughts?
>>
>> (rcugit is the RCU tree, and paul/dev is Paul's dev branch)
>>
>> git cherry --abbrev -v rcugit/next paul/dev | grep "^+" | cut -d' ' -f2,3-
> I must defer to others on this one.  I must confess that I have not yet
> found a good use case for "git cherry".

In this case git cherry was seemingly correct because it detected a difference
between the same patch in your dev branch versus the next branch [1]. It seems
the next branch has the fix with the reordering of the statements. So it thought
that your branch had a patch that wasn't there in next, where in reality the one
in next was "newer".

[1]
https://web.git.kernel.org/pub/scm/linux/kernel/git/rcu/linux.git/commit/?h=misc.2025.03.02a&id=153fc45000e0058435ec0609258fb16e7ea257d2

thanks,

 - Joel

[v4,3/3] rcu: Use _full() API to debug synchronize_rcu()

Commit Message

Comments

Patch