diff mbox series

[02/11] rcu: Kill rnp->ofl_seq and use only rcu_state.ofl_lock for exclusion

Message ID 20211209150938.3518-3-dwmw2@infradead.org (mailing list archive)
State New, archived
Headers show
Series Parallel CPU bringup for x86_64 | expand

Commit Message

David Woodhouse Dec. 9, 2021, 3:09 p.m. UTC
From: David Woodhouse <dwmw@amazon.co.uk>

If we allow architectures to bring APs online in parallel, then we end
up requiring rcu_cpu_starting() to be reentrant. But currently, the
manipulation of rnp->ofl_seq is not thread-safe.

However, rnp->ofl_seq is also fairly much pointless anyway since both
rcu_cpu_starting() and rcu_report_dead() hold rcu_state.ofl_lock for
fairly much the whole time that rnp->ofl_seq is set to an odd number
to indicate that an operation is in progress.

So drop rnp->ofl_seq completely, and use only rcu_state.ofl_lock.

This has a couple of minor complexities: lockdep will complain when we
take rcu_state.ofl_lock, and currently accepts the 'excuse' of having
an odd value in rnp->ofl_seq. So switch it to an arch_spinlock_t to
avoid that false positive complaint. Since we're killing rnp->ofl_seq
of course that 'excuse' has to be changed too, so make it check for
arch_spin_is_locked(rcu_state.ofl_lock).

There's no arch_spin_lock_irqsave() so we have to manually save and
restore local interrupts around the locking.

At Paul's request, make rcu_gp_init not just wait but *exclude* any
CPU online/offline activity, which was fairly much true already by
virtue of it holding rcu_state.ofl_lock.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 kernel/rcu/tree.c | 64 +++++++++++++++++++++++------------------------
 kernel/rcu/tree.h |  4 +--
 2 files changed, 32 insertions(+), 36 deletions(-)

Comments

Paul E. McKenney Dec. 9, 2021, 5:18 p.m. UTC | #1
On Thu, Dec 09, 2021 at 03:09:29PM +0000, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> If we allow architectures to bring APs online in parallel, then we end
> up requiring rcu_cpu_starting() to be reentrant. But currently, the
> manipulation of rnp->ofl_seq is not thread-safe.
> 
> However, rnp->ofl_seq is also fairly much pointless anyway since both
> rcu_cpu_starting() and rcu_report_dead() hold rcu_state.ofl_lock for
> fairly much the whole time that rnp->ofl_seq is set to an odd number
> to indicate that an operation is in progress.
> 
> So drop rnp->ofl_seq completely, and use only rcu_state.ofl_lock.
> 
> This has a couple of minor complexities: lockdep will complain when we
> take rcu_state.ofl_lock, and currently accepts the 'excuse' of having
> an odd value in rnp->ofl_seq. So switch it to an arch_spinlock_t to
> avoid that false positive complaint. Since we're killing rnp->ofl_seq
> of course that 'excuse' has to be changed too, so make it check for
> arch_spin_is_locked(rcu_state.ofl_lock).
> 
> There's no arch_spin_lock_irqsave() so we have to manually save and
> restore local interrupts around the locking.
> 
> At Paul's request, make rcu_gp_init not just wait but *exclude* any
> CPU online/offline activity, which was fairly much true already by
> virtue of it holding rcu_state.ofl_lock.

Looks good!

Could you please also make the first clause read something like this?

	"At Paul's request based on Neeraj's analysis, ..."

I am going to pull this into -rcu for more intensive testing of this
code for non-concurrent CPU-online operations (making the above change
to the commit log).  At some point, rcutorture needs to learn how to
do concurrent CPU-online operations, but it would be good to bake the
RCU-specific patches for a bit beforehand.

Depending on timing, you might wish to send this patch with the
rest of this group, so:

Acked-by: Paul E. McKenney <paulmck@kernel.org>

If testing goes well and if you don't get it there first, I expect
to push this during the v5.18 merge window.

							Thanx, Paul

> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>  kernel/rcu/tree.c | 64 +++++++++++++++++++++++------------------------
>  kernel/rcu/tree.h |  4 +--
>  2 files changed, 32 insertions(+), 36 deletions(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index ef8d36f580fc..a1bb0b1229ed 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -91,7 +91,7 @@ static struct rcu_state rcu_state = {
>  	.abbr = RCU_ABBR,
>  	.exp_mutex = __MUTEX_INITIALIZER(rcu_state.exp_mutex),
>  	.exp_wake_mutex = __MUTEX_INITIALIZER(rcu_state.exp_wake_mutex),
> -	.ofl_lock = __RAW_SPIN_LOCK_UNLOCKED(rcu_state.ofl_lock),
> +	.ofl_lock = __ARCH_SPIN_LOCK_UNLOCKED,
>  };
>  
>  /* Dump rcu_node combining tree at boot to verify correct setup. */
> @@ -1168,7 +1168,15 @@ bool rcu_lockdep_current_cpu_online(void)
>  	preempt_disable_notrace();
>  	rdp = this_cpu_ptr(&rcu_data);
>  	rnp = rdp->mynode;
> -	if (rdp->grpmask & rcu_rnp_online_cpus(rnp) || READ_ONCE(rnp->ofl_seq) & 0x1)
> +	/*
> +	 * Strictly, we care here about the case where the current CPU is
> +	 * in rcu_cpu_starting() and thus has an excuse for rdp->grpmask
> +	 * not being up to date. So arch_spin_is_locked() might have a
> +	 * false positive if it's held by some *other* CPU, but that's
> +	 * OK because that just means a false *negative* on the warning.
> +	 */
> +	if (rdp->grpmask & rcu_rnp_online_cpus(rnp) ||
> +	    arch_spin_is_locked(&rcu_state.ofl_lock))
>  		ret = true;
>  	preempt_enable_notrace();
>  	return ret;
> @@ -1731,7 +1739,6 @@ static void rcu_strict_gp_boundary(void *unused)
>   */
>  static noinline_for_stack bool rcu_gp_init(void)
>  {
> -	unsigned long firstseq;
>  	unsigned long flags;
>  	unsigned long oldmask;
>  	unsigned long mask;
> @@ -1774,22 +1781,17 @@ static noinline_for_stack bool rcu_gp_init(void)
>  	 * of RCU's Requirements documentation.
>  	 */
>  	WRITE_ONCE(rcu_state.gp_state, RCU_GP_ONOFF);
> +	/* Exclude CPU hotplug operations. */
>  	rcu_for_each_leaf_node(rnp) {
> -		// Wait for CPU-hotplug operations that might have
> -		// started before this grace period did.
> -		smp_mb(); // Pair with barriers used when updating ->ofl_seq to odd values.
> -		firstseq = READ_ONCE(rnp->ofl_seq);
> -		if (firstseq & 0x1)
> -			while (firstseq == READ_ONCE(rnp->ofl_seq))
> -				schedule_timeout_idle(1);  // Can't wake unless RCU is watching.
> -		smp_mb(); // Pair with barriers used when updating ->ofl_seq to even values.
> -		raw_spin_lock(&rcu_state.ofl_lock);
> -		raw_spin_lock_irq_rcu_node(rnp);
> +		local_irq_save(flags);
> +		arch_spin_lock(&rcu_state.ofl_lock);
> +		raw_spin_lock_rcu_node(rnp);
>  		if (rnp->qsmaskinit == rnp->qsmaskinitnext &&
>  		    !rnp->wait_blkd_tasks) {
>  			/* Nothing to do on this leaf rcu_node structure. */
> -			raw_spin_unlock_irq_rcu_node(rnp);
> -			raw_spin_unlock(&rcu_state.ofl_lock);
> +			raw_spin_unlock_rcu_node(rnp);
> +			arch_spin_unlock(&rcu_state.ofl_lock);
> +			local_irq_restore(flags);
>  			continue;
>  		}
>  
> @@ -1824,8 +1826,9 @@ static noinline_for_stack bool rcu_gp_init(void)
>  				rcu_cleanup_dead_rnp(rnp);
>  		}
>  
> -		raw_spin_unlock_irq_rcu_node(rnp);
> -		raw_spin_unlock(&rcu_state.ofl_lock);
> +		raw_spin_unlock_rcu_node(rnp);
> +		arch_spin_unlock(&rcu_state.ofl_lock);
> +		local_irq_restore(flags);
>  	}
>  	rcu_gp_slow(gp_preinit_delay); /* Races with CPU hotplug. */
>  
> @@ -4233,7 +4236,7 @@ int rcutree_offline_cpu(unsigned int cpu)
>   */
>  void rcu_cpu_starting(unsigned int cpu)
>  {
> -	unsigned long flags;
> +	unsigned long flags, seq_flags;
>  	unsigned long mask;
>  	struct rcu_data *rdp;
>  	struct rcu_node *rnp;
> @@ -4246,11 +4249,11 @@ void rcu_cpu_starting(unsigned int cpu)
>  
>  	rnp = rdp->mynode;
>  	mask = rdp->grpmask;
> -	WRITE_ONCE(rnp->ofl_seq, rnp->ofl_seq + 1);
> -	WARN_ON_ONCE(!(rnp->ofl_seq & 0x1));
> +	local_irq_save(seq_flags);
> +	arch_spin_lock(&rcu_state.ofl_lock);
>  	rcu_dynticks_eqs_online();
>  	smp_mb(); // Pair with rcu_gp_cleanup()'s ->ofl_seq barrier().
> -	raw_spin_lock_irqsave_rcu_node(rnp, flags);
> +	raw_spin_lock_rcu_node(rnp);
>  	WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext | mask);
>  	newcpu = !(rnp->expmaskinitnext & mask);
>  	rnp->expmaskinitnext |= mask;
> @@ -4269,9 +4272,8 @@ void rcu_cpu_starting(unsigned int cpu)
>  	} else {
>  		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>  	}
> -	smp_mb(); // Pair with rcu_gp_cleanup()'s ->ofl_seq barrier().
> -	WRITE_ONCE(rnp->ofl_seq, rnp->ofl_seq + 1);
> -	WARN_ON_ONCE(rnp->ofl_seq & 0x1);
> +	arch_spin_unlock(&rcu_state.ofl_lock);
> +	local_irq_restore(seq_flags);
>  	smp_mb(); /* Ensure RCU read-side usage follows above initialization. */
>  }
>  
> @@ -4285,7 +4287,7 @@ void rcu_cpu_starting(unsigned int cpu)
>   */
>  void rcu_report_dead(unsigned int cpu)
>  {
> -	unsigned long flags;
> +	unsigned long flags, seq_flags;
>  	unsigned long mask;
>  	struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
>  	struct rcu_node *rnp = rdp->mynode;  /* Outgoing CPU's rdp & rnp. */
> @@ -4299,10 +4301,8 @@ void rcu_report_dead(unsigned int cpu)
>  
>  	/* Remove outgoing CPU from mask in the leaf rcu_node structure. */
>  	mask = rdp->grpmask;
> -	WRITE_ONCE(rnp->ofl_seq, rnp->ofl_seq + 1);
> -	WARN_ON_ONCE(!(rnp->ofl_seq & 0x1));
> -	smp_mb(); // Pair with rcu_gp_cleanup()'s ->ofl_seq barrier().
> -	raw_spin_lock(&rcu_state.ofl_lock);
> +	local_irq_save(seq_flags);
> +	arch_spin_lock(&rcu_state.ofl_lock);
>  	raw_spin_lock_irqsave_rcu_node(rnp, flags); /* Enforce GP memory-order guarantee. */
>  	rdp->rcu_ofl_gp_seq = READ_ONCE(rcu_state.gp_seq);
>  	rdp->rcu_ofl_gp_flags = READ_ONCE(rcu_state.gp_flags);
> @@ -4313,10 +4313,8 @@ void rcu_report_dead(unsigned int cpu)
>  	}
>  	WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext & ~mask);
>  	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> -	raw_spin_unlock(&rcu_state.ofl_lock);
> -	smp_mb(); // Pair with rcu_gp_cleanup()'s ->ofl_seq barrier().
> -	WRITE_ONCE(rnp->ofl_seq, rnp->ofl_seq + 1);
> -	WARN_ON_ONCE(rnp->ofl_seq & 0x1);
> +	arch_spin_unlock(&rcu_state.ofl_lock);
> +	local_irq_restore(seq_flags);
>  
>  	rdp->cpu_started = false;
>  }
> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> index 305cf6aeb408..aff4cc9303fb 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -56,8 +56,6 @@ struct rcu_node {
>  				/*  Initialized from ->qsmaskinitnext at the */
>  				/*  beginning of each grace period. */
>  	unsigned long qsmaskinitnext;
> -	unsigned long ofl_seq;	/* CPU-hotplug operation sequence count. */
> -				/* Online CPUs for next grace period. */
>  	unsigned long expmask;	/* CPUs or groups that need to check in */
>  				/*  to allow the current expedited GP */
>  				/*  to complete. */
> @@ -358,7 +356,7 @@ struct rcu_state {
>  	const char *name;			/* Name of structure. */
>  	char abbr;				/* Abbreviated name. */
>  
> -	raw_spinlock_t ofl_lock ____cacheline_internodealigned_in_smp;
> +	arch_spinlock_t ofl_lock ____cacheline_internodealigned_in_smp;
>  						/* Synchronize offline with */
>  						/*  GP pre-initialization. */
>  };
> -- 
> 2.31.1
>
Neeraj Upadhyay Dec. 9, 2021, 6:31 p.m. UTC | #2
Hi,

On 12/9/2021 8:39 PM, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> If we allow architectures to bring APs online in parallel, then we end
> up requiring rcu_cpu_starting() to be reentrant. But currently, the
> manipulation of rnp->ofl_seq is not thread-safe.
> 
> However, rnp->ofl_seq is also fairly much pointless anyway since both
> rcu_cpu_starting() and rcu_report_dead() hold rcu_state.ofl_lock for
> fairly much the whole time that rnp->ofl_seq is set to an odd number
> to indicate that an operation is in progress.
> 
> So drop rnp->ofl_seq completely, and use only rcu_state.ofl_lock.
> 
> This has a couple of minor complexities: lockdep will complain when we
> take rcu_state.ofl_lock, and currently accepts the 'excuse' of having
> an odd value in rnp->ofl_seq. So switch it to an arch_spinlock_t to
> avoid that false positive complaint. Since we're killing rnp->ofl_seq
> of course that 'excuse' has to be changed too, so make it check for
> arch_spin_is_locked(rcu_state.ofl_lock).
> 
> There's no arch_spin_lock_irqsave() so we have to manually save and
> restore local interrupts around the locking.
> 
> At Paul's request, make rcu_gp_init not just wait but *exclude* any
> CPU online/offline activity, which was fairly much true already by
> virtue of it holding rcu_state.ofl_lock.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   kernel/rcu/tree.c | 64 +++++++++++++++++++++++------------------------
>   kernel/rcu/tree.h |  4 +--
>   2 files changed, 32 insertions(+), 36 deletions(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index ef8d36f580fc..a1bb0b1229ed 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -91,7 +91,7 @@ static struct rcu_state rcu_state = {
>   	.abbr = RCU_ABBR,
>   	.exp_mutex = __MUTEX_INITIALIZER(rcu_state.exp_mutex),
>   	.exp_wake_mutex = __MUTEX_INITIALIZER(rcu_state.exp_wake_mutex),
> -	.ofl_lock = __RAW_SPIN_LOCK_UNLOCKED(rcu_state.ofl_lock),
> +	.ofl_lock = __ARCH_SPIN_LOCK_UNLOCKED,
>   };
>   
>   /* Dump rcu_node combining tree at boot to verify correct setup. */
> @@ -1168,7 +1168,15 @@ bool rcu_lockdep_current_cpu_online(void)
>   	preempt_disable_notrace();
>   	rdp = this_cpu_ptr(&rcu_data);
>   	rnp = rdp->mynode;
> -	if (rdp->grpmask & rcu_rnp_online_cpus(rnp) || READ_ONCE(rnp->ofl_seq) & 0x1)
> +	/*
> +	 * Strictly, we care here about the case where the current CPU is
> +	 * in rcu_cpu_starting() and thus has an excuse for rdp->grpmask
> +	 * not being up to date. So arch_spin_is_locked() might have a
> +	 * false positive if it's held by some *other* CPU, but that's
> +	 * OK because that just means a false *negative* on the warning.
> +	 */
> +	if (rdp->grpmask & rcu_rnp_online_cpus(rnp) ||
> +	    arch_spin_is_locked(&rcu_state.ofl_lock))
>   		ret = true;
>   	preempt_enable_notrace();
>   	return ret;
> @@ -1731,7 +1739,6 @@ static void rcu_strict_gp_boundary(void *unused)
>    */
>   static noinline_for_stack bool rcu_gp_init(void)
>   {
> -	unsigned long firstseq;
>   	unsigned long flags;
>   	unsigned long oldmask;
>   	unsigned long mask;
> @@ -1774,22 +1781,17 @@ static noinline_for_stack bool rcu_gp_init(void)
>   	 * of RCU's Requirements documentation.
>   	 */
>   	WRITE_ONCE(rcu_state.gp_state, RCU_GP_ONOFF);
> +	/* Exclude CPU hotplug operations. */
>   	rcu_for_each_leaf_node(rnp) {
> -		// Wait for CPU-hotplug operations that might have
> -		// started before this grace period did.
> -		smp_mb(); // Pair with barriers used when updating ->ofl_seq to odd values.
> -		firstseq = READ_ONCE(rnp->ofl_seq);
> -		if (firstseq & 0x1)
> -			while (firstseq == READ_ONCE(rnp->ofl_seq))
> -				schedule_timeout_idle(1);  // Can't wake unless RCU is watching.
> -		smp_mb(); // Pair with barriers used when updating ->ofl_seq to even values.
> -		raw_spin_lock(&rcu_state.ofl_lock);
> -		raw_spin_lock_irq_rcu_node(rnp);
> +		local_irq_save(flags);
> +		arch_spin_lock(&rcu_state.ofl_lock);
> +		raw_spin_lock_rcu_node(rnp);
>   		if (rnp->qsmaskinit == rnp->qsmaskinitnext &&
>   		    !rnp->wait_blkd_tasks) {
>   			/* Nothing to do on this leaf rcu_node structure. */
> -			raw_spin_unlock_irq_rcu_node(rnp);
> -			raw_spin_unlock(&rcu_state.ofl_lock);
> +			raw_spin_unlock_rcu_node(rnp);
> +			arch_spin_unlock(&rcu_state.ofl_lock);
> +			local_irq_restore(flags);
>   			continue;
>   		}
>   
> @@ -1824,8 +1826,9 @@ static noinline_for_stack bool rcu_gp_init(void)
>   				rcu_cleanup_dead_rnp(rnp);
>   		}
>   
> -		raw_spin_unlock_irq_rcu_node(rnp);
> -		raw_spin_unlock(&rcu_state.ofl_lock);
> +		raw_spin_unlock_rcu_node(rnp);
> +		arch_spin_unlock(&rcu_state.ofl_lock);
> +		local_irq_restore(flags);
>   	}
>   	rcu_gp_slow(gp_preinit_delay); /* Races with CPU hotplug. */
>   
> @@ -4233,7 +4236,7 @@ int rcutree_offline_cpu(unsigned int cpu)
>    */
>   void rcu_cpu_starting(unsigned int cpu)
>   {
> -	unsigned long flags;
> +	unsigned long flags, seq_flags;
>   	unsigned long mask;
>   	struct rcu_data *rdp;
>   	struct rcu_node *rnp;
> @@ -4246,11 +4249,11 @@ void rcu_cpu_starting(unsigned int cpu)
>   
>   	rnp = rdp->mynode;
>   	mask = rdp->grpmask;
> -	WRITE_ONCE(rnp->ofl_seq, rnp->ofl_seq + 1);
> -	WARN_ON_ONCE(!(rnp->ofl_seq & 0x1));
> +	local_irq_save(seq_flags);
> +	arch_spin_lock(&rcu_state.ofl_lock);
>   	rcu_dynticks_eqs_online();
>   	smp_mb(); // Pair with rcu_gp_cleanup()'s ->ofl_seq barrier().

Can we drop this smp_mb(),as arch_spin_lock(&rcu_state.ofl_lock) 
provides the ordering now?

> -	raw_spin_lock_irqsave_rcu_node(rnp, flags);
> +	raw_spin_lock_rcu_node(rnp);
>   	WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext | mask);
>   	newcpu = !(rnp->expmaskinitnext & mask);
>   	rnp->expmaskinitnext |= mask;
> @@ -4269,9 +4272,8 @@ void rcu_cpu_starting(unsigned int cpu)
>   	} else {
>   		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);

'flags' is uninitialized now?



Thanks
Neeraj

>   	}
> -	smp_mb(); // Pair with rcu_gp_cleanup()'s ->ofl_seq barrier().
> -	WRITE_ONCE(rnp->ofl_seq, rnp->ofl_seq + 1);
> -	WARN_ON_ONCE(rnp->ofl_seq & 0x1);
> +	arch_spin_unlock(&rcu_state.ofl_lock);
> +	local_irq_restore(seq_flags);
>   	smp_mb(); /* Ensure RCU read-side usage follows above initialization. */
>   }
>   
> @@ -4285,7 +4287,7 @@ void rcu_cpu_starting(unsigned int cpu)
>    */
>   void rcu_report_dead(unsigned int cpu)
>   {
> -	unsigned long flags;
> +	unsigned long flags, seq_flags;
>   	unsigned long mask;
>   	struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
>   	struct rcu_node *rnp = rdp->mynode;  /* Outgoing CPU's rdp & rnp. */
> @@ -4299,10 +4301,8 @@ void rcu_report_dead(unsigned int cpu)
>   
>   	/* Remove outgoing CPU from mask in the leaf rcu_node structure. */
>   	mask = rdp->grpmask;
> -	WRITE_ONCE(rnp->ofl_seq, rnp->ofl_seq + 1);
> -	WARN_ON_ONCE(!(rnp->ofl_seq & 0x1));
> -	smp_mb(); // Pair with rcu_gp_cleanup()'s ->ofl_seq barrier().
> -	raw_spin_lock(&rcu_state.ofl_lock);
> +	local_irq_save(seq_flags);
> +	arch_spin_lock(&rcu_state.ofl_lock);
>   	raw_spin_lock_irqsave_rcu_node(rnp, flags); /* Enforce GP memory-order guarantee. */
>   	rdp->rcu_ofl_gp_seq = READ_ONCE(rcu_state.gp_seq);
>   	rdp->rcu_ofl_gp_flags = READ_ONCE(rcu_state.gp_flags);
> @@ -4313,10 +4313,8 @@ void rcu_report_dead(unsigned int cpu)
>   	}
>   	WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext & ~mask);
>   	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> -	raw_spin_unlock(&rcu_state.ofl_lock);
> -	smp_mb(); // Pair with rcu_gp_cleanup()'s ->ofl_seq barrier().
> -	WRITE_ONCE(rnp->ofl_seq, rnp->ofl_seq + 1);
> -	WARN_ON_ONCE(rnp->ofl_seq & 0x1);
> +	arch_spin_unlock(&rcu_state.ofl_lock);
> +	local_irq_restore(seq_flags);
>   
>   	rdp->cpu_started = false;
>   }
> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> index 305cf6aeb408..aff4cc9303fb 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -56,8 +56,6 @@ struct rcu_node {
>   				/*  Initialized from ->qsmaskinitnext at the */
>   				/*  beginning of each grace period. */
>   	unsigned long qsmaskinitnext;
> -	unsigned long ofl_seq;	/* CPU-hotplug operation sequence count. */
> -				/* Online CPUs for next grace period. */
>   	unsigned long expmask;	/* CPUs or groups that need to check in */
>   				/*  to allow the current expedited GP */
>   				/*  to complete. */
> @@ -358,7 +356,7 @@ struct rcu_state {
>   	const char *name;			/* Name of structure. */
>   	char abbr;				/* Abbreviated name. */
>   
> -	raw_spinlock_t ofl_lock ____cacheline_internodealigned_in_smp;
> +	arch_spinlock_t ofl_lock ____cacheline_internodealigned_in_smp;
>   						/* Synchronize offline with */
>   						/*  GP pre-initialization. */
>   };
>
David Woodhouse Dec. 9, 2021, 6:43 p.m. UTC | #3
On Fri, 2021-12-10 at 00:01 +0530, Neeraj Upadhyay wrote:
> > @@ -4246,11 +4249,11 @@ void rcu_cpu_starting(unsigned int cpu)
> >    
> >        rnp = rdp->mynode;
> >        mask = rdp->grpmask;
> > -     WRITE_ONCE(rnp->ofl_seq, rnp->ofl_seq + 1);
> > -     WARN_ON_ONCE(!(rnp->ofl_seq & 0x1));
> > +     local_irq_save(seq_flags);
> > +     arch_spin_lock(&rcu_state.ofl_lock);
> >        rcu_dynticks_eqs_online();
> >        smp_mb(); // Pair with rcu_gp_cleanup()'s ->ofl_seq barrier().
> 
> Can we drop this smp_mb(),as arch_spin_lock(&rcu_state.ofl_lock) 
> provides the ordering now?

Yes, thanks.

> > -     raw_spin_lock_irqsave_rcu_node(rnp, flags);
> > +     raw_spin_lock_rcu_node(rnp);
> >        WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext | mask);
> >        newcpu = !(rnp->expmaskinitnext & mask);
> >        rnp->expmaskinitnext |= mask;
> > @@ -4269,9 +4272,8 @@ void rcu_cpu_starting(unsigned int cpu)
> >        } else {
> >                raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> 
> 'flags' is uninitialized now?

Ah yes, that suffered from the fact that saving the flags is pointless
because we know we already disabled interrupts... but
rcu_report_qs_rnp() *really* wants to be given some flags to restore.

Will fix that too; thanks again.
David Woodhouse Dec. 9, 2021, 6:52 p.m. UTC | #4
On Thu, 2021-12-09 at 09:18 -0800, Paul E. McKenney wrote:
> On Thu, Dec 09, 2021 at 03:09:29PM +0000, David Woodhouse wrote:
> > From: David Woodhouse <
> > dwmw@amazon.co.uk
> > >
> > 
> > If we allow architectures to bring APs online in parallel, then we end
> > up requiring rcu_cpu_starting() to be reentrant. But currently, the
> > manipulation of rnp->ofl_seq is not thread-safe.
> > 
> > However, rnp->ofl_seq is also fairly much pointless anyway since both
> > rcu_cpu_starting() and rcu_report_dead() hold rcu_state.ofl_lock for
> > fairly much the whole time that rnp->ofl_seq is set to an odd number
> > to indicate that an operation is in progress.
> > 
> > So drop rnp->ofl_seq completely, and use only rcu_state.ofl_lock.
> > 
> > This has a couple of minor complexities: lockdep will complain when we
> > take rcu_state.ofl_lock, and currently accepts the 'excuse' of having
> > an odd value in rnp->ofl_seq. So switch it to an arch_spinlock_t to
> > avoid that false positive complaint. Since we're killing rnp->ofl_seq
> > of course that 'excuse' has to be changed too, so make it check for
> > arch_spin_is_locked(rcu_state.ofl_lock).
> > 
> > There's no arch_spin_lock_irqsave() so we have to manually save and
> > restore local interrupts around the locking.
> > 
> > At Paul's request, make rcu_gp_init not just wait but *exclude* any
> > CPU online/offline activity, which was fairly much true already by
> > virtue of it holding rcu_state.ofl_lock.
> 
> Looks good!
> 
> Could you please also make the first clause read something like this?
> 
> 	"At Paul's request based on Neeraj's analysis, ..."
> 
> I am going to pull this into -rcu for more intensive testing of this
> code for non-concurrent CPU-online operations (making the above change
> to the commit log).  At some point, rcutorture needs to learn how to
> do concurrent CPU-online operations, but it would be good to bake the
> RCU-specific patches for a bit beforehand.
> 
> Depending on timing, you might wish to send this patch with the
> rest of this group, so:
> 
> Acked-by: Paul E. McKenney <paulmck@kernel.org>
> 
> If testing goes well and if you don't get it there first, I expect
> to push this during the v5.18 merge window.


Thanks.

I think the best option is probably to let all the cleanups and fixes
get merged separately through whatever tree is most appropriate, and
then we can just merge that final patch to actually *enable* parallel
boot for x86 last of all.

I'll address Neeraj's feedback and repost this one.
diff mbox series

Patch

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index ef8d36f580fc..a1bb0b1229ed 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -91,7 +91,7 @@  static struct rcu_state rcu_state = {
 	.abbr = RCU_ABBR,
 	.exp_mutex = __MUTEX_INITIALIZER(rcu_state.exp_mutex),
 	.exp_wake_mutex = __MUTEX_INITIALIZER(rcu_state.exp_wake_mutex),
-	.ofl_lock = __RAW_SPIN_LOCK_UNLOCKED(rcu_state.ofl_lock),
+	.ofl_lock = __ARCH_SPIN_LOCK_UNLOCKED,
 };
 
 /* Dump rcu_node combining tree at boot to verify correct setup. */
@@ -1168,7 +1168,15 @@  bool rcu_lockdep_current_cpu_online(void)
 	preempt_disable_notrace();
 	rdp = this_cpu_ptr(&rcu_data);
 	rnp = rdp->mynode;
-	if (rdp->grpmask & rcu_rnp_online_cpus(rnp) || READ_ONCE(rnp->ofl_seq) & 0x1)
+	/*
+	 * Strictly, we care here about the case where the current CPU is
+	 * in rcu_cpu_starting() and thus has an excuse for rdp->grpmask
+	 * not being up to date. So arch_spin_is_locked() might have a
+	 * false positive if it's held by some *other* CPU, but that's
+	 * OK because that just means a false *negative* on the warning.
+	 */
+	if (rdp->grpmask & rcu_rnp_online_cpus(rnp) ||
+	    arch_spin_is_locked(&rcu_state.ofl_lock))
 		ret = true;
 	preempt_enable_notrace();
 	return ret;
@@ -1731,7 +1739,6 @@  static void rcu_strict_gp_boundary(void *unused)
  */
 static noinline_for_stack bool rcu_gp_init(void)
 {
-	unsigned long firstseq;
 	unsigned long flags;
 	unsigned long oldmask;
 	unsigned long mask;
@@ -1774,22 +1781,17 @@  static noinline_for_stack bool rcu_gp_init(void)
 	 * of RCU's Requirements documentation.
 	 */
 	WRITE_ONCE(rcu_state.gp_state, RCU_GP_ONOFF);
+	/* Exclude CPU hotplug operations. */
 	rcu_for_each_leaf_node(rnp) {
-		// Wait for CPU-hotplug operations that might have
-		// started before this grace period did.
-		smp_mb(); // Pair with barriers used when updating ->ofl_seq to odd values.
-		firstseq = READ_ONCE(rnp->ofl_seq);
-		if (firstseq & 0x1)
-			while (firstseq == READ_ONCE(rnp->ofl_seq))
-				schedule_timeout_idle(1);  // Can't wake unless RCU is watching.
-		smp_mb(); // Pair with barriers used when updating ->ofl_seq to even values.
-		raw_spin_lock(&rcu_state.ofl_lock);
-		raw_spin_lock_irq_rcu_node(rnp);
+		local_irq_save(flags);
+		arch_spin_lock(&rcu_state.ofl_lock);
+		raw_spin_lock_rcu_node(rnp);
 		if (rnp->qsmaskinit == rnp->qsmaskinitnext &&
 		    !rnp->wait_blkd_tasks) {
 			/* Nothing to do on this leaf rcu_node structure. */
-			raw_spin_unlock_irq_rcu_node(rnp);
-			raw_spin_unlock(&rcu_state.ofl_lock);
+			raw_spin_unlock_rcu_node(rnp);
+			arch_spin_unlock(&rcu_state.ofl_lock);
+			local_irq_restore(flags);
 			continue;
 		}
 
@@ -1824,8 +1826,9 @@  static noinline_for_stack bool rcu_gp_init(void)
 				rcu_cleanup_dead_rnp(rnp);
 		}
 
-		raw_spin_unlock_irq_rcu_node(rnp);
-		raw_spin_unlock(&rcu_state.ofl_lock);
+		raw_spin_unlock_rcu_node(rnp);
+		arch_spin_unlock(&rcu_state.ofl_lock);
+		local_irq_restore(flags);
 	}
 	rcu_gp_slow(gp_preinit_delay); /* Races with CPU hotplug. */
 
@@ -4233,7 +4236,7 @@  int rcutree_offline_cpu(unsigned int cpu)
  */
 void rcu_cpu_starting(unsigned int cpu)
 {
-	unsigned long flags;
+	unsigned long flags, seq_flags;
 	unsigned long mask;
 	struct rcu_data *rdp;
 	struct rcu_node *rnp;
@@ -4246,11 +4249,11 @@  void rcu_cpu_starting(unsigned int cpu)
 
 	rnp = rdp->mynode;
 	mask = rdp->grpmask;
-	WRITE_ONCE(rnp->ofl_seq, rnp->ofl_seq + 1);
-	WARN_ON_ONCE(!(rnp->ofl_seq & 0x1));
+	local_irq_save(seq_flags);
+	arch_spin_lock(&rcu_state.ofl_lock);
 	rcu_dynticks_eqs_online();
 	smp_mb(); // Pair with rcu_gp_cleanup()'s ->ofl_seq barrier().
-	raw_spin_lock_irqsave_rcu_node(rnp, flags);
+	raw_spin_lock_rcu_node(rnp);
 	WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext | mask);
 	newcpu = !(rnp->expmaskinitnext & mask);
 	rnp->expmaskinitnext |= mask;
@@ -4269,9 +4272,8 @@  void rcu_cpu_starting(unsigned int cpu)
 	} else {
 		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 	}
-	smp_mb(); // Pair with rcu_gp_cleanup()'s ->ofl_seq barrier().
-	WRITE_ONCE(rnp->ofl_seq, rnp->ofl_seq + 1);
-	WARN_ON_ONCE(rnp->ofl_seq & 0x1);
+	arch_spin_unlock(&rcu_state.ofl_lock);
+	local_irq_restore(seq_flags);
 	smp_mb(); /* Ensure RCU read-side usage follows above initialization. */
 }
 
@@ -4285,7 +4287,7 @@  void rcu_cpu_starting(unsigned int cpu)
  */
 void rcu_report_dead(unsigned int cpu)
 {
-	unsigned long flags;
+	unsigned long flags, seq_flags;
 	unsigned long mask;
 	struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
 	struct rcu_node *rnp = rdp->mynode;  /* Outgoing CPU's rdp & rnp. */
@@ -4299,10 +4301,8 @@  void rcu_report_dead(unsigned int cpu)
 
 	/* Remove outgoing CPU from mask in the leaf rcu_node structure. */
 	mask = rdp->grpmask;
-	WRITE_ONCE(rnp->ofl_seq, rnp->ofl_seq + 1);
-	WARN_ON_ONCE(!(rnp->ofl_seq & 0x1));
-	smp_mb(); // Pair with rcu_gp_cleanup()'s ->ofl_seq barrier().
-	raw_spin_lock(&rcu_state.ofl_lock);
+	local_irq_save(seq_flags);
+	arch_spin_lock(&rcu_state.ofl_lock);
 	raw_spin_lock_irqsave_rcu_node(rnp, flags); /* Enforce GP memory-order guarantee. */
 	rdp->rcu_ofl_gp_seq = READ_ONCE(rcu_state.gp_seq);
 	rdp->rcu_ofl_gp_flags = READ_ONCE(rcu_state.gp_flags);
@@ -4313,10 +4313,8 @@  void rcu_report_dead(unsigned int cpu)
 	}
 	WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext & ~mask);
 	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
-	raw_spin_unlock(&rcu_state.ofl_lock);
-	smp_mb(); // Pair with rcu_gp_cleanup()'s ->ofl_seq barrier().
-	WRITE_ONCE(rnp->ofl_seq, rnp->ofl_seq + 1);
-	WARN_ON_ONCE(rnp->ofl_seq & 0x1);
+	arch_spin_unlock(&rcu_state.ofl_lock);
+	local_irq_restore(seq_flags);
 
 	rdp->cpu_started = false;
 }
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 305cf6aeb408..aff4cc9303fb 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -56,8 +56,6 @@  struct rcu_node {
 				/*  Initialized from ->qsmaskinitnext at the */
 				/*  beginning of each grace period. */
 	unsigned long qsmaskinitnext;
-	unsigned long ofl_seq;	/* CPU-hotplug operation sequence count. */
-				/* Online CPUs for next grace period. */
 	unsigned long expmask;	/* CPUs or groups that need to check in */
 				/*  to allow the current expedited GP */
 				/*  to complete. */
@@ -358,7 +356,7 @@  struct rcu_state {
 	const char *name;			/* Name of structure. */
 	char abbr;				/* Abbreviated name. */
 
-	raw_spinlock_t ofl_lock ____cacheline_internodealigned_in_smp;
+	arch_spinlock_t ofl_lock ____cacheline_internodealigned_in_smp;
 						/* Synchronize offline with */
 						/*  GP pre-initialization. */
 };