diff mbox series

[v16,3/8] cpu/hotplug: Add dynamic parallel bringup states before CPUHP_BRINGUP_CPU

Message ID 20230321194008.785922-4-usama.arif@bytedance.com (mailing list archive)
State Superseded
Headers show
Series Parallel CPU bringup for x86_64 | expand

Commit Message

Usama Arif March 21, 2023, 7:40 p.m. UTC
From: David Woodhouse <dwmw@amazon.co.uk>

There is often significant latency in the early stages of CPU bringup,
and time is wasted by waking each CPU (e.g. with SIPI/INIT/INIT on x86)
and then waiting for it to respond before moving on to the next.

Allow a platform to register a set of pre-bringup CPUHP states to which
each CPU can be stepped in parallel, thus absorbing some of that latency.

There is a subtlety here: even with an empty CPUHP_BP_PARALLEL_DYN step,
this means that *all* CPUs are brought through the prepare states and to
CPUHP_BP_PREPARE_DYN before any of them are taken to CPUHP_BRINGUP_CPU
and then are allowed to run for themselves to CPUHP_ONLINE.

So any combination of prepare/start calls which depend on A-B ordering
for each CPU in turn, such as the X2APIC code which used to allocate a
cluster mask 'just in case' and store it in a global variable in the
prep stage, then potentially consume that preallocated structure from
the AP and set the global pointer to NULL to be reallocated in
CPUHP_X2APIC_PREPARE for the next CPU... would explode horribly.

Any platform enabling the CPUHP_BP_PARALLEL_DYN steps must be reviewed
and tested to ensure that such issues do not exist, and the existing
behaviour of bringing CPUs to CPUHP_BP_PREPARE_DYN and then immediately
to CPUHP_BRINGUP_CPU and CPUHP_ONLINE only one at a time does not change
unless such a state is registered.

Note that the new parallel stages do *not* yet bring each AP to the
CPUHP_BRINGUP_CPU state at the same time, only to the new states which
exist before it. The final loop in bringup_nonboot_cpus() is untouched,
bringing each AP in turn from the final PARALLEL_DYN state (or all the
way from CPUHP_OFFLINE) to CPUHP_BRINGUP_CPU and then waiting for that
AP to do its own processing and reach CPUHP_ONLINE before releasing the
next.

Parallelising that part by bringing them all to CPUHP_BRINGUP_CPU
and then waiting for them all is an exercise for the future.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Usama Arif <usama.arif@bytedance.com>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Kim Phillips <kim.phillips@amd.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
---
 include/linux/cpuhotplug.h |  2 ++
 kernel/cpu.c               | 49 ++++++++++++++++++++++++++++++++++----
 2 files changed, 47 insertions(+), 4 deletions(-)

Comments

Thomas Gleixner March 23, 2023, 10:36 p.m. UTC | #1
On Tue, Mar 21 2023 at 19:40, Usama Arif wrote:
>  void bringup_nonboot_cpus(unsigned int setup_max_cpus)
>  {
> +	unsigned int n = setup_max_cpus - num_online_cpus();
>  	unsigned int cpu;
>  
> +	/*
> +	 * An architecture may have registered parallel pre-bringup states to
> +	 * which each CPU may be brought in parallel. For each such state,
> +	 * bring N CPUs to it in turn before the final round of bringing them
> +	 * online.
> +	 */
> +	if (n > 0) {
> +		enum cpuhp_state st = CPUHP_BP_PARALLEL_DYN;
> +
> +		while (st <= CPUHP_BP_PARALLEL_DYN_END && cpuhp_hp_states[st].name) {


There is no point in special casing this. All architectures can invoke
the CPUHP_BP_* states before CPUHP_BRINGUP_CPU for each to be brought up
CPU first. So this can be made unconditional and common exercised code.

Aside of that this dynamic state range is pretty pointless. There is
really nothing dynamic here and there is no real good reason to have
four dynamic parallel states just because.

The only interesting thing after CPUHP_BP_PREPARE_DYN_END and before
CPUHP_BP_BRINGUP is a state which kicks the AP into life, i.e. we can
just hardcode that as CPUHP_BP_PARALLEL_STARTUP.

Thanks,

        tglx
---
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -133,6 +133,20 @@ enum cpuhp_state {
 	CPUHP_MIPS_SOC_PREPARE,
 	CPUHP_BP_PREPARE_DYN,
 	CPUHP_BP_PREPARE_DYN_END		= CPUHP_BP_PREPARE_DYN + 20,
+	/*
+	 * This is an optional state if the architecture supports parallel
+	 * startup. It's used to send the startup IPI so that the APs can
+	 * run in parallel through the low level startup code instead of
+	 * sending the IPIs one by one in CPUHP_BRINGUP_CPU. This avoids
+	 * waiting for the AP to react and shortens the serialized bringup.
+	 */
+	CPUHP_BP_PARALLEL_STARTUP,
+
+	/*
+	 * Fully per AP serialized bringup from here on. If the
+	 * architecture does no register the CPUHP_BP_PARALLEL_STARTUP
+	 * state, this step sends the startup IPI first.
+	 */
 	CPUHP_BRINGUP_CPU,
 
 	/*
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1504,13 +1504,49 @@ int bringup_hibernate_cpu(unsigned int s
 
 void bringup_nonboot_cpus(unsigned int setup_max_cpus)
 {
-	unsigned int cpu;
+	unsigned int cpu, n = 1;
 
+	/*
+	 * All CHUHP_BP* states are invoked on the control CPU (BP). There
+	 * is no requirement that these states need to be invoked
+	 * sequentially for a particular CPU. They can be invoked state by
+	 * state for each to be onlined CPU.
+	 *
+	 * On architectures which have setup the CPUHP_BP_PARALLEL_STARTUP
+	 * state, this sends the startup IPI to each of the to be onlined
+	 * APs. This avoids waiting for each AP to respond to the startup
+	 * IPI in CPUHP_BRINGUP_CPU. The APs proceed through the low level
+	 * bringup code and then wait for the control CPU to release them
+	 * one by one for the final onlining procedure in the loop below.
+	 *
+	 * For architectures which do not support parallel bringup the
+	 * CPUHP_BP_PARALLEL_STARTUP state is a NOOP and the actual startup
+	 * happens in the CPUHP_BRINGUP_CPU state which is fully serialized
+	 * per CPU in the loop below.
+	 */
+	for_each_present_cpu(cpu) {
+		if (n++ >= setup_max_cpus)
+			break;
+		cpu_up(cpu, CPUHP_BP_PARALLEL_STARTUP);
+	}
+
+	/* Do the per CPU serialized bringup to ONLINE state */
 	for_each_present_cpu(cpu) {
 		if (num_online_cpus() >= setup_max_cpus)
 			break;
-		if (!cpu_online(cpu))
-			cpu_up(cpu, CPUHP_ONLINE);
+
+		if (!cpu_online(cpu)) {
+			struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
+			int ret = cpu_up(cpu, CPUHP_ONLINE);
+
+			/*
+			 * Due to the above preparation loop a failed online attempt
+			 * has only rolled back to CPUHP_BP_PARALLEL_STARTUP. Do the
+			 * remaining cleanups.
+			 */
+			if (ret && can_rollback_cpu(st))
+				WARN_ON(cpuhp_invoke_callback_range(false, cpu, st, CPUHP_OFFLINE));
+		}
 	}
 }
David Woodhouse March 23, 2023, 10:49 p.m. UTC | #2
On Thu, 2023-03-23 at 23:36 +0100, Thomas Gleixner wrote:
> On Tue, Mar 21 2023 at 19:40, Usama Arif wrote:
> >  void bringup_nonboot_cpus(unsigned int setup_max_cpus)
> >  {
> > +       unsigned int n = setup_max_cpus - num_online_cpus();
> >         unsigned int cpu;
> >  
> > +       /*
> > +        * An architecture may have registered parallel pre-bringup states to
> > +        * which each CPU may be brought in parallel. For each such state,
> > +        * bring N CPUs to it in turn before the final round of bringing them
> > +        * online.
> > +        */
> > +       if (n > 0) {
> > +               enum cpuhp_state st = CPUHP_BP_PARALLEL_DYN;
> > +
> > +               while (st <= CPUHP_BP_PARALLEL_DYN_END && cpuhp_hp_states[st].name) {
> 
> 
> There is no point in special casing this. All architectures can invoke
> the CPUHP_BP_* states before CPUHP_BRINGUP_CPU for each to be brought up
> CPU first. So this can be made unconditional and common exercised code.
> 

There were three paragraphs in the commit message explaining why I
didn't want to do that. It didn't work for x86 before I started, and I
haven't reviewed *every* other architecture to ensure that it will work
there. It was opt-in for a reason. :)

> Aside of that this dynamic state range is pretty pointless. There is
> really nothing dynamic here and there is no real good reason to have
> four dynamic parallel states just because.

The original patch set did use more than one state; the plan to do
microcode updates in parallel does involve doing at least one more, I
believe.
 
https://lore.kernel.org/all/eb6717dfc4ceb99803c0396f950db7c3231c75ef.camel@infradead.org/

> The only interesting thing after CPUHP_BP_PREPARE_DYN_END and before
> CPUHP_BP_BRINGUP is a state which kicks the AP into life, i.e. we can
> just hardcode that as CPUHP_BP_PARALLEL_STARTUP.
> 
> Thanks,
> 
>         tglx
> ---
> --- a/include/linux/cpuhotplug.h
> +++ b/include/linux/cpuhotplug.h
> @@ -133,6 +133,20 @@ enum cpuhp_state {
>         CPUHP_MIPS_SOC_PREPARE,
>         CPUHP_BP_PREPARE_DYN,
>         CPUHP_BP_PREPARE_DYN_END                = CPUHP_BP_PREPARE_DYN + 20,
> +       /*
> +        * This is an optional state if the architecture supports parallel
> +        * startup. It's used to send the startup IPI so that the APs can
> +        * run in parallel through the low level startup code instead of
> +        * sending the IPIs one by one in CPUHP_BRINGUP_CPU. This avoids
> +        * waiting for the AP to react and shortens the serialized bringup.
> +        */
> +       CPUHP_BP_PARALLEL_STARTUP,
> +
> +       /*
> +        * Fully per AP serialized bringup from here on. If the
> +        * architecture does no register the CPUHP_BP_PARALLEL_STARTUP
> +        * state, this step sends the startup IPI first.
> +        */

Not sure I'd conceded that yet; the APs do their *own* bringup from
here, and that really ought to be able to run in parallel.

>         CPUHP_BRINGUP_CPU,
>  
>         /*
Thomas Gleixner March 23, 2023, 11:05 p.m. UTC | #3
On Thu, Mar 23 2023 at 23:36, Thomas Gleixner wrote:
> On Tue, Mar 21 2023 at 19:40, Usama Arif wrote:
>>  void bringup_nonboot_cpus(unsigned int setup_max_cpus)
>>  {
>> +	unsigned int n = setup_max_cpus - num_online_cpus();
>>  	unsigned int cpu;
>>  
>> +	/*
>> +	 * An architecture may have registered parallel pre-bringup states to
>> +	 * which each CPU may be brought in parallel. For each such state,
>> +	 * bring N CPUs to it in turn before the final round of bringing them
>> +	 * online.
>> +	 */
>> +	if (n > 0) {
>> +		enum cpuhp_state st = CPUHP_BP_PARALLEL_DYN;
>> +
>> +		while (st <= CPUHP_BP_PARALLEL_DYN_END && cpuhp_hp_states[st].name) {
>
>
> There is no point in special casing this. All architectures can invoke
> the CPUHP_BP_* states before CPUHP_BRINGUP_CPU for each to be brought up
> CPU first. So this can be made unconditional and common exercised
> code.

Bah. There is. We discussed that before. Architectures need to opt in to
make sure that there are no implicit dependencies on the full
serialization.

Still the rest can be simplified as below.

Thanks,

        tglx
---
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -133,6 +133,20 @@ enum cpuhp_state {
 	CPUHP_MIPS_SOC_PREPARE,
 	CPUHP_BP_PREPARE_DYN,
 	CPUHP_BP_PREPARE_DYN_END		= CPUHP_BP_PREPARE_DYN + 20,
+	/*
+	 * This is an optional state if the architecture supports parallel
+	 * startup. It's used to send the startup IPI so that the APs can
+	 * run in parallel through the low level startup code instead of
+	 * sending the IPIs one by one in CPUHP_BRINGUP_CPU. This avoids
+	 * waiting for the AP to react and shortens the serialized bringup.
+	 */
+	CPUHP_BP_PARALLEL_STARTUP,
+
+	/*
+	 * Fully per AP serialized bringup from here on. If the
+	 * architecture does no register the CPUHP_BP_PARALLEL_STARTUP
+	 * state, this step sends the startup IPI first.
+	 */
 	CPUHP_BRINGUP_CPU,
 
 	/*
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1504,13 +1504,45 @@ int bringup_hibernate_cpu(unsigned int s
 
 void bringup_nonboot_cpus(unsigned int setup_max_cpus)
 {
-	unsigned int cpu;
+	unsigned int cpu, n = 1;
 
+	/*
+	 * On architectures which have setup the CPUHP_BP_PARALLEL_STARTUP
+	 * state, this invokes all BP prepare states and the parallel
+	 * startup state sends the startup IPI to each of the to be onlined
+	 * APs. This avoids waiting for each AP to respond to the startup
+	 * IPI in CPUHP_BRINGUP_CPU. The APs proceed through the low level
+	 * bringup code and then wait for the control CPU to release them
+	 * one by one for the final onlining procedure in the loop below.
+	 *
+	 * For architectures which do not support parallel bringup all
+	 * states are fully serialized in the loop below.
+	 */
+	if (!cpuhp_step_empty(true, CPUHP_BP_PARALLEL_STARTUP) {
+			for_each_present_cpu(cpu) {
+				if (n++ >= setup_max_cpus)
+					break;
+				cpu_up(cpu, CPUHP_BP_PARALLEL_STARTUP);
+			}
+	}
+
+	/* Do the per CPU serialized bringup to ONLINE state */
 	for_each_present_cpu(cpu) {
 		if (num_online_cpus() >= setup_max_cpus)
 			break;
-		if (!cpu_online(cpu))
-			cpu_up(cpu, CPUHP_ONLINE);
+
+		if (!cpu_online(cpu)) {
+			struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
+			int ret = cpu_up(cpu, CPUHP_ONLINE);
+
+			/*
+			 * Due to the above preparation loop a failed online attempt
+			 * might have only rolled back to CPUHP_BP_PARALLEL_STARTUP. Do the
+			 * remaining cleanups. NOOP for the non parallel case.
+			 */
+			if (ret && can_rollback_cpu(st))
+				WARN_ON(cpuhp_invoke_callback_range(false, cpu, st, CPUHP_OFFLINE));
+		}
 	}
 }
David Woodhouse March 23, 2023, 11:12 p.m. UTC | #4
On Fri, 2023-03-24 at 00:05 +0100, Thomas Gleixner wrote:
> Still the rest can be simplified as below.

...

> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -1504,13 +1504,45 @@ int bringup_hibernate_cpu(unsigned int s
>  
>  void bringup_nonboot_cpus(unsigned int setup_max_cpus)
>  {
> -       unsigned int cpu;
> +       unsigned int cpu, n = 1;
>  
> +       /*
> +        * On architectures which have setup the CPUHP_BP_PARALLEL_STARTUP
> +        * state, this invokes all BP prepare states and the parallel
> +        * startup state sends the startup IPI to each of the to be onlined
> +        * APs. This avoids waiting for each AP to respond to the startup
> +        * IPI in CPUHP_BRINGUP_CPU. The APs proceed through the low level
> +        * bringup code and then wait for the control CPU to release them
> +        * one by one for the final onlining procedure in the loop below.
> +        *
> +        * For architectures which do not support parallel bringup all
> +        * states are fully serialized in the loop below.
> +        */
> +       if (!cpuhp_step_empty(true, CPUHP_BP_PARALLEL_STARTUP) {

I'll take using cpuhp_step_empty().

> +                       for_each_present_cpu(cpu) {
> +                               if (n++ >= setup_max_cpus)
> +                                       break;
> +                               cpu_up(cpu, CPUHP_BP_PARALLEL_STARTUP);
> +                       }
> +       }
> +
> +       /* Do the per CPU serialized bringup to ONLINE state */
>         for_each_present_cpu(cpu) {
>                 if (num_online_cpus() >= setup_max_cpus)
>                         break;
> -               if (!cpu_online(cpu))
> -                       cpu_up(cpu, CPUHP_ONLINE);
> +
> +               if (!cpu_online(cpu)) {
> +                       struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
> +                       int ret = cpu_up(cpu, CPUHP_ONLINE);
> +
> +                       /*
> +                        * Due to the above preparation loop a failed online attempt
> +                        * might have only rolled back to CPUHP_BP_PARALLEL_STARTUP. Do the
> +                        * remaining cleanups. NOOP for the non parallel case.
> +                        */
> +                       if (ret && can_rollback_cpu(st))
> +                               WARN_ON(cpuhp_invoke_callback_range(false, cpu, st, CPUHP_OFFLINE));
> +               }

And I'll take doing this bit unconditionally (it's basically a no-op if
they already got rolled all the way back to CPUHP_OFFLINE, right?).

But the additional complexity of having multiple steps is fairly
minimal, and I'm already planning to *use* another one even in x86, as
discussed.
Thomas Gleixner March 23, 2023, 11:48 p.m. UTC | #5
On Thu, Mar 23 2023 at 22:49, David Woodhouse wrote:
> On Thu, 2023-03-23 at 23:36 +0100, Thomas Gleixner wrote:
>> There is no point in special casing this. All architectures can invoke
>> the CPUHP_BP_* states before CPUHP_BRINGUP_CPU for each to be brought up
>> CPU first. So this can be made unconditional and common exercised code.
>> 
>
> There were three paragraphs in the commit message explaining why I
> didn't want to do that. It didn't work for x86 before I started, and I
> haven't reviewed *every* other architecture to ensure that it will work
> there. It was opt-in for a reason. :)

I noticed myself before reading your reply :)

>> Aside of that this dynamic state range is pretty pointless. There is
>> really nothing dynamic here and there is no real good reason to have
>> four dynamic parallel states just because.
>
> The original patch set did use more than one state; the plan to do
> microcode updates in parallel does involve doing at least one more, I
> believe.

I don't think so. The micro code muck can completely serialize itself
and does not need control CPU assistance if done right. If we go there
we have to fix that mess and not just proliferatng it with moar duct tape.

>> +       /*
>> +        * Fully per AP serialized bringup from here on. If the
>> +        * architecture does no register the CPUHP_BP_PARALLEL_STARTUP
>> +        * state, this step sends the startup IPI first.
>> +        */
>
> Not sure I'd conceded that yet; the APs do their *own* bringup from
> here, and that really ought to be able to run in parallel.

Somewhere in the distance future once we've

  1) sorted the mandatory synchronization points, e.g. TSC sync in the
     early bootup phase.

  2) audited _all_ AP state callbacks that they are able to cope with
     parallel bringup.

     That's a long road as there are tons of assumptions about the
     implicit CPU hotplug serialization in those callbacks.

     Just because it did not explode in your face yet does not mean this
     is safe.

     I just looked at 10 randomly picked AP online callbacks and found 5
     of them being not ready :)

Thanks,

        tglx
David Woodhouse March 24, 2023, 12:06 a.m. UTC | #6
On Fri, 2023-03-24 at 00:48 +0100, Thomas Gleixner wrote:
> On Thu, Mar 23 2023 at 22:49, David Woodhouse wrote:
> > On Thu, 2023-03-23 at 23:36 +0100, Thomas Gleixner wrote:
> > > There is no point in special casing this. All architectures can invoke
> > > the CPUHP_BP_* states before CPUHP_BRINGUP_CPU for each to be brought up
> > > CPU first. So this can be made unconditional and common exercised code.
> > > 
> > 
> > There were three paragraphs in the commit message explaining why I
> > didn't want to do that. It didn't work for x86 before I started, and I
> > haven't reviewed *every* other architecture to ensure that it will work
> > there. It was opt-in for a reason. :)
> 
> I noticed myself before reading your reply :)
> 
> > > Aside of that this dynamic state range is pretty pointless. There is
> > > really nothing dynamic here and there is no real good reason to have
> > > four dynamic parallel states just because.
> > 
> > The original patch set did use more than one state; the plan to do
> > microcode updates in parallel does involve doing at least one more, I
> > believe.
> 
> I don't think so. The micro code muck can completely serialize itself
> and does not need control CPU assistance if done right. If we go there
> we have to fix that mess and not just proliferatng it with moar duct tape.
> 
> > > +       /*
> > > +        * Fully per AP serialized bringup from here on. If the
> > > +        * architecture does no register the CPUHP_BP_PARALLEL_STARTUP
> > > +        * state, this step sends the startup IPI first.
> > > +        */
> > 
> > Not sure I'd conceded that yet; the APs do their *own* bringup from
> > here, and that really ought to be able to run in parallel.
> 
> Somewhere in the distance future once we've
> 
>   1) sorted the mandatory synchronization points, e.g. TSC sync in the
>      early bootup phase.

That's why we have four of them... :)

>   2) audited _all_ AP state callbacks that they are able to cope with
>      parallel bringup.
> 
>      That's a long road as there are tons of assumptions about the
>      implicit CPU hotplug serialization in those callbacks.
> 
>      Just because it did not explode in your face yet does not mean this
>      is safe.
> 
>      I just looked at 10 randomly picked AP online callbacks and found 5
>      of them being not ready :)

Oh, it's totally hosed, absolutely. I don't think even my most
ambitious hacks had even tried it yet. But I want to, so I wasn't about
to add a comment saying the opposite.

But it's fine; removing the comment is the *least* of the work to be
done in making that bit actually work in parallel :)
Thomas Gleixner March 24, 2023, 1:16 a.m. UTC | #7
On Thu, Mar 23 2023 at 23:12, David Woodhouse wrote:
> On Fri, 2023-03-24 at 00:05 +0100, Thomas Gleixner wrote:
>> +                       if (ret && can_rollback_cpu(st))
>> +                               WARN_ON(cpuhp_invoke_callback_range(false, cpu, st, CPUHP_OFFLINE));
>> +               }
>
> And I'll take doing this bit unconditionally (it's basically a no-op if
> they already got rolled all the way back to CPUHP_OFFLINE, right?).
>
> But the additional complexity of having multiple steps is fairly
> minimal, and I'm already planning to *use* another one even in x86, as
> discussed.

It's not about the "complexity". That's a general design question and
I'm not agreeing with your approach of putting AP specifics into the BP
state space.

The BP only phase ends at the point where the AP is woken up via
SIPI/INIT/whatever. Period.

And no, we are not making this special just because it's the easiest way
to get it done. I have _zero_ interest in this kind of hackery which
just slaps stuff into the code where its conveniant without even
thinking about proper separations

We went a great length to separate things clearly and it takes more than
"oh let's reserve a few special states" to keep this separation
intact. That's a matter of correctness, maintainability and taste.

That microcode thing on X86 has absolutely no business in the pre
bringup DYN states. It's an AP problem and it can be solved completely
without BP interaction.

And before you start drooling over further great parallelism, can you
please take a step back and come up with a sensible design for the
actual real world requirments?

The point is that after an AP comes out of lala land and starts
executing there are mandatory synchronization points which need to be
handled by design. The number of synchronization points is architecture
and platform specific and might be 0, but thats a detail.

So the proper thing is to split CPUHP_BRINGUP_CPU, which is the bridging
state between AP and BP today, into a set of synchronization points
between BP and AP.

But that's non-trivial because if you look at bringup_cpu() then you'll
notice that this state has an implicit protection against interrupt
allocation/free and quite some architectures rely on this for their
early bringup code and possibly their STARTING state callbacks.

That aside. Let's just look at x86 as of today from the BP side:

    1) Wakeup AP
    2) Wait until there is sign of life
    3) Let AP proceed
    5) Wait until AP is done with init
    6) TSC synchronization
    7) Wait until it is online

and on the AP side:

    1) Do low level init
    2) Report life
    3) Wait until BP allows to proceed
    4) Do init
    5) Report done
    6) TSC synchronization
    7) Report online

So surely you could claim that up to #6 (TSC sync) nothing needs to
synchronize.

But that's simply not true because topology information is implicitely
serialized by CPU hotplug and your attempt to serialize that in patch
7/8 is not cutting it at all because AP #4 (STARTING) callbacks rely
implicitely on the immutability of the topology information and so do
some of the later (threaded) AP callbacks too.

As I said before 5 out of 10 callbacks I looked at are not ready for
this.

Just because it did not explode in your face yet, does not make any of
your wet dreams more correct.

Again: I'm not interested in this kind of "works for me" nonsense at
all. I wasted enough time already to make CPU hotplug reliable and
understandable. I have no intention to waste more time on it just
because.

Thanks,

        tglx
David Woodhouse March 24, 2023, 9:31 a.m. UTC | #8
On Fri, 2023-03-24 at 02:16 +0100, Thomas Gleixner wrote:
> On Thu, Mar 23 2023 at 23:12, David Woodhouse wrote:
> > On Fri, 2023-03-24 at 00:05 +0100, Thomas Gleixner wrote:
> > > +                       if (ret && can_rollback_cpu(st))
> > > +                               WARN_ON(cpuhp_invoke_callback_range(false, cpu, st, CPUHP_OFFLINE));
> > > +               }
> > 
> > And I'll take doing this bit unconditionally (it's basically a no-op if
> > they already got rolled all the way back to CPUHP_OFFLINE, right?).
> > 
> > But the additional complexity of having multiple steps is fairly
> > minimal, and I'm already planning to *use* another one even in x86, as
> > discussed.
> 
> It's not about the "complexity". That's a general design question and
> I'm not agreeing with your approach of putting AP specifics into the BP
> state space.
> 
> The BP only phase ends at the point where the AP is woken up via
> SIPI/INIT/whatever. Period.
> 
> And no, we are not making this special just because it's the easiest way
> to get it done. I have _zero_ interest in this kind of hackery which
> just slaps stuff into the code where its conveniant without even
> thinking about proper separations
> 
> We went a great length to separate things clearly and it takes more than
> "oh let's reserve a few special states" to keep this separation
> intact. That's a matter of correctness, maintainability and taste.
> 
> That microcode thing on X86 has absolutely no business in the pre
> bringup DYN states. It's an AP problem and it can be solved completely
> without BP interaction.

As long as the BP doesn't bring any more siblings online during the
update, sure.

> And before you start drooling over further great parallelism, can you
> please take a step back and come up with a sensible design for the
> actual real world requirments?
> 
> The point is that after an AP comes out of lala land and starts
> executing there are mandatory synchronization points which need to be
> handled by design. The number of synchronization points is architecture
> and platform specific and might be 0, but thats a detail.
> 
> So the proper thing is to split CPUHP_BRINGUP_CPU, which is the bridging
> state between AP and BP today, into a set of synchronization points
> between BP and AP.

I feel we're talking past each other a little bit. Because I'd have
said that's *exactly* what this patch set is doing.

The existing x86 CPUHP_BRINGUP_CPU step in native_cpu_up() has a bunch
of back-and-forth between BP and AP, which you've enumerated below and
which I cleaned up and commented in the 'Split up native_cpu_up into
separate phases and document them' patch.

This patch set literally introduces the new PARALLEL_DYN states to be
"a set of synchronization points between BP and AP", and then makes the
x86 code utilise that for the *first* of its back-and-forth exchanges
between BP and AP.

The patch to do the second stage and truly make it a 'set' on x86 is
waiting in the wings, precisely because it *happens* not to catch fire
yet, but hasn't been properly audited yet.

But it seemed to me that you were saying you want us to limit
architectures to just *one* additional step (CPUHP_BP_PARALLEL_STARTUP)
instead of the 'set' that I'd added with the PARALLEL_DYN states? Did I
misunderstand?

> But that's non-trivial because if you look at bringup_cpu() then you'll
> notice that this state has an implicit protection against interrupt
> allocation/free and quite some architectures rely on this for their
> early bringup code and possibly their STARTING state callbacks.

I literally pointed that out in 2021 (in one of the messages I
reference below).

For x86 I don't believe that's going to be an issue yet, if at all. I
think it matters for the lapic_online() call which happens near the end
of start_secondary(); it's almost the last thing it does before going
into the generic AP startup thread. It's *also* preceded by a comment
that at least *suggests* it ought to be fine anyway, although we should
never entirely trust such comments.

        /*
         * Lock vector_lock, set CPU online and bring the vector
         * allocator online. Online must be set with vector_lock held
         * to prevent a concurrent irq setup/teardown from seeing a
         * half valid vector space.
         */
        lock_vector_lock();
        /* Sync point with do_wait_cpu_online() */
        set_cpu_online(smp_processor_id(), true);
        lapic_online();

We're a long way from parallelizing *that* one and needing to check the
veracity of the comment though.

> That aside. Let's just look at x86 as of today from the BP side:
> 
>     1) Wakeup AP
>     2) Wait until there is sign of life
>     3) Let AP proceed
>     5) Wait until AP is done with init
>     6) TSC synchronization
>     7) Wait until it is online
> 
> and on the AP side:
> 
>     1) Do low level init
>     2) Report life
>     3) Wait until BP allows to proceed
>     4) Do init
>     5) Report done
>     6) TSC synchronization
>     7) Report online
> 
> So surely you could claim that up to #6 (TSC sync) nothing needs to
> synchronize.

I don't think we could claim that at all. There are a whole bunch of
implicit dependencies on the fact that this happens in series, and at
any given point in the sequence we might know that *either* the BP is
free to act, *or* precisely one of the APs is free. For example:

> But that's simply not true because topology information is implicitely
> serialized by CPU hotplug and your attempt to serialize that in patch
> 7/8 is not cutting it at all because AP #4 (STARTING) callbacks rely
> implicitely on the immutability of the topology information

Right, and that's just fine. That patch explicitly calls out that it is
preparation for *future* parallelism. Which is why it's *last* in the
series (well, apart from the SEV-ES bit, which I wanted to be even
*more* last-in-the-series than that).

If it was needed for the "phase 1" parallelism of INIT/SIPI/SIPI that
gets enabled in patch #6, then it would have come before that in the
series.

It is necessary — but not sufficient, as you correctly point out — for
the *next* stages of parallelism. 

>  and so do some of the later (threaded) AP callbacks too.
> 
> As I said before 5 out of 10 callbacks I looked at are not ready for
> this.
> 
> Just because it did not explode in your face yet, does not make any of
> your wet dreams more correct.

Now you're looking even further into the future. We're not trying to
make the AP startup threads run in parallel yet.

> Again: I'm not interested in this kind of "works for me" nonsense at
> all.

You keep saying that. It's starting to become offensive.

Yes, the first RFC posting of this approach was quite explicitly billed
as 'proof-of-concept hack' and 'don't look too hard because it's only
really for figuring out the timing'. I make no apology for that:
https://lore.kernel.org/lkml/8ac72d7b287ed1058b2dec3301578238aff0abdd.camel@infradead.org/

And if you look back at the first actual series that was posted, it's
also very clear about which bit is actually ready because all the
testing and thinking has been done, and which isn't:
https://lore.kernel.org/lkml/478c56af540eaa47b8f452e51cb0c085f26db738.camel@infradead.org/

So please, stop giving me this "\"works for me\" nonsense" nonsense.

Yes, there are some patches on top of this lot that aren't ready yet;
we've fixed up *some* of the things that actually broke, and only done
a cursory pass of actually inspecting it to make sure it's safe. And
some of your observations above are very relevant and helpful to when
we come to do *that* part of the thinking.

But we aren't asking you to merge those parts yet. We have deliberately
cut it back to the initial stage because that's where the biggest win
is, and there's *enough* thinking to be done just for this part.

We will review your comments and be happy to engage in further robust
discussion about the future patches when we get round to that (if
indeed we conclude that there's a significant win to be had from them).

In the meantime, if there are specific things which are wrong with the
parallelism introduced by *this* patch series that you're trying to
point out to us, then I apologise; I've missed them.
Thomas Gleixner March 24, 2023, 9:46 a.m. UTC | #9
On Fri, Mar 24 2023 at 02:16, Thomas Gleixner wrote:
> On Thu, Mar 23 2023 at 23:12, David Woodhouse wrote:
> So surely you could claim that up to #6 (TSC sync) nothing needs to
> synchronize.
>
> But that's simply not true because topology information is implicitely
> serialized by CPU hotplug and your attempt to serialize that in patch
> 7/8 is not cutting it at all because AP #4 (STARTING) callbacks rely
> implicitely on the immutability of the topology information and so do
> some of the later (threaded) AP callbacks too.

It's even worse. Any topology relevant change _must_ be serialized by
holding cpu_hotplug_lock write locked. So this needs fundamentally more
thought than just slapping a few misnomed states into the picture and
pretending that everything else just works.

Thanks,

        tglx
David Woodhouse March 24, 2023, 10 a.m. UTC | #10
On Fri, 2023-03-24 at 10:46 +0100, Thomas Gleixner wrote:
> 
> It's even worse. Any topology relevant change _must_ be serialized by
> holding cpu_hotplug_lock write locked. So this needs fundamentally more
> thought

Yes. Yes, it does. Which is why it isn't being done in parallel in this
patch series. The topology update happens from smp_callin() which
happens from do_wait_cpu_initialized(), which is still in the
CPUHP_BRINGUP_CPU stage for now, being done one at a time.

When we come to look at the next stage, doing do_wait_cpu_initialized()
in parallel for multiple APs too, we will be happy to hear specifics of
why this "must" be serialized by the cpu_hotplug_lock and no other. It
will be great to check that we haven't missed anything.

But we aren't there yet.
Thomas Gleixner March 24, 2023, 1:57 p.m. UTC | #11
On Fri, Mar 24 2023 at 09:31, David Woodhouse wrote:
> On Fri, 2023-03-24 at 02:16 +0100, Thomas Gleixner wrote:
>> So the proper thing is to split CPUHP_BRINGUP_CPU, which is the bridging
>> state between AP and BP today, into a set of synchronization points
>> between BP and AP.
>
> I feel we're talking past each other a little bit. Because I'd have
> said that's *exactly* what this patch set is doing.
>
> The existing x86 CPUHP_BRINGUP_CPU step in native_cpu_up() has a bunch
> of back-and-forth between BP and AP, which you've enumerated below and
> which I cleaned up and commented in the 'Split up native_cpu_up into
> separate phases and document them' patch.
>
> This patch set literally introduces the new PARALLEL_DYN states to be
> "a set of synchronization points between BP and AP", and then makes the
> x86 code utilise that for the *first* of its back-and-forth exchanges
> between BP and AP.

It provides a dynamic space which is absolutely unspecified and just
opens the door for tinkering.

This first step does not even contain a synchronization point. All it
does is to kick the AP into gear. Not more, not less. Naming this
obscurely as PARALLEL_DYN is a tasteless bandaid.

If you go and look at all __cpu_up() implementations, then you'll notice
that these are very similar. All of them do

   1) Kick AP
   2) Synchronize at least once between BP and AP

There is nothing x86 specific about this. So instead of hiding this
behind a misnomed dynamic space, the obviously correct solution is to
make this an explicit mechanism in the core code and convert _all_
architectures over to this scheme. That's in the first place completely
independent of parallel bringup.

It has a value on it's own as it consolidates the zoo of synchronization
mechanisms (cpumasks, completions, random variables) into one shared
mechanism in the core code.

That's very much different from what your patch is doing. And there is a
very good reason aside of consolidation to do so:

This prepares to handle the parallelism requirements in the core code
instead of letting each architecture come up with its own set of
magic. Which in turn is a prerequisite for allowing the STARTING
callbacks or later the threaded AP states to execute in parallel.

Why? Simply because of this:

  BP			AP              state
  kick()                                BRINGUP_CPU
                        startup()                      
  sync()                sync() 
                        starting()      advances to AP_ONLINE
  sync()                sync()
  TSC_sync()            TSC_sync()
  wait_for_online()     set_online()
                        cpu_startup_entry() AP_ONLINE_IDLE
  wait_for_completion() complete()

This works correctly today because bringup_cpu() does not modify state
and excpects the state to be advanced by the AP once the completion is
done.

So you _cannot_ just throw some magic dynamic states before BRINGUP_CPU
and then expect that the state machine is consistent when the AP is
allowed to run the starting callbacks in parallel.

The sync point after the starting callbacks simply cannot be in that
dynamic state space. It has to be _after_ the AP starting states.

That needs a fundamental change in the way how the state management
at this point works and this needs to be done upfront. Aside of the
general serialization aspects this needs some deep thought whether the
BP control can stay single threaded or if it's required to spawn a
control thread per AP.

The kick CPU into life state is completely independent of the above and
can be moved just before BRINGUP_CPU without violating anything. But
that's where the easy to solve part stops hard.

You might find my wording offensive, but I perceive your "let's use a
few dynamic states and see what sticks" approach offensive too.

The analysis of all this is not rocket science and could have been done
long ago by yourself.

Thanks,

        tglx
Thomas Gleixner March 24, 2023, 2:07 p.m. UTC | #12
On Fri, Mar 24 2023 at 09:31, David Woodhouse wrote:
> On Fri, 2023-03-24 at 02:16 +0100, Thomas Gleixner wrote:
>> But that's non-trivial because if you look at bringup_cpu() then you'll
>> notice that this state has an implicit protection against interrupt
>> allocation/free and quite some architectures rely on this for their
>> early bringup code and possibly their STARTING state callbacks.
>
> I literally pointed that out in 2021 (in one of the messages I
> reference below).
>
> For x86 I don't believe that's going to be an issue yet, if at all. I
> think it matters for the lapic_online() call which happens near the end
> of start_secondary(); it's almost the last thing it does before going
> into the generic AP startup thread. It's *also* preceded by a comment
> that at least *suggests* it ought to be fine anyway, although we should
> never entirely trust such comments.
>
>         /*
>          * Lock vector_lock, set CPU online and bring the vector
>          * allocator online. Online must be set with vector_lock held
>          * to prevent a concurrent irq setup/teardown from seeing a
>          * half valid vector space.

That setup/teardown wording is misleading as that's already covered by
sparse_irq_lock.

This is purely x86 specific. Setting the CPU online and onlining the CPU
in the matrix allocator has to be atomic vs. concurrent allocations from
the matrix allocator, which can happen via request/free_irq() and
affinity changes independent of interrupt setup/teardown (e.g. MSI enable).

Thanks,

        tglx
David Woodhouse March 24, 2023, 3:33 p.m. UTC | #13
On Fri, 2023-03-24 at 14:57 +0100, Thomas Gleixner wrote:
> 
> Why? Simply because of this:
> 
>   BP                    AP              state
>   kick()                                BRINGUP_CPU
>                         startup()                      
>   sync()                sync() 
>                         starting()      advances to AP_ONLINE
>   sync()                sync()
>   TSC_sync()            TSC_sync()
>   wait_for_online()     set_online()
>                         cpu_startup_entry() AP_ONLINE_IDLE
>   wait_for_completion() complete()
> 
> This works correctly today because bringup_cpu() does not modify state
> and excpects the state to be advanced by the AP once the completion is
> done.
> 
> So you _cannot_ just throw some magic dynamic states before BRINGUP_CPU
> and then expect that the state machine is consistent when the AP is
> allowed to run the starting callbacks in parallel.

Aha! I see.

Yes, when the AP calls notify_cpu_starting(), which x86 does from
smp_callin(), the AP takes *itself* forward through the states from
there.

That happens when the BP gets to do_wait_cpu_initialized(). So yes, the
actual code in the existing series of patches is entirely safe, but
you're right that we do only want that *one* additional state for
parallelising the "kick AP"  before CPUHP_BRINGUP_CPU. The rest need to
come afterwards and be handled differently.
David Woodhouse March 27, 2023, 6:48 p.m. UTC | #14
On Fri, 2023-03-24 at 00:05 +0100, Thomas Gleixner wrote:
> 
> > There is no point in special casing this. All architectures can invoke
> > the CPUHP_BP_* states before CPUHP_BRINGUP_CPU for each to be brought up
> > CPU first. So this can be made unconditional and common exercised
> > code.
> 
> Bah. There is. We discussed that before. Architectures need to opt in to
> make sure that there are no implicit dependencies on the full
> serialization.
> 
> Still the rest can be simplified as below.

Ack.

Full series at https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/parallel-6.2-v17

From: David Woodhouse <dwmw@amazon.co.uk>
Subject: [PATCH 3/8] cpu/hotplug: Add CPUHP_BP_PARALLEL_STARTUP state before
 CPUHP_BRINGUP_CPU

There is often significant latency in the early stages of CPU bringup,
and time is wasted by waking each CPU (e.g. with SIPI/INIT/INIT on x86)
and then waiting for it to make its way through hardware powerup and
through firmware before finally reaching the kernel entry point and
moving on through its startup.

Allow a platform to register a pre-bringup CPUHP state to which each
CPU can be stepped in parallel, thus absorbing some of that latency.

There is a subtlety here: even with an empty CPUHP_BP_PARALLEL_STARTUP
step, this means that *all* CPUs are brought through the prepare states
all the way to CPUHP_BP_PARALLEL_STARTUP before any of them are taken
to CPUHP_BRINGUP_CPU and then are allowed to run for themselves to
CPUHP_ONLINE.

So any combination of prepare/start calls which depend on A-B ordering
for each CPU in turn would explore horribly. As an example, the X2APIC
code prior to commit cefad862f238 ("x86/apic/x2apic: Allow CPU
cluster_mask to be populated in parallel") would allocate a new cluster
mask "just in case" and store it in a global variable in the prep stage,
then the AP would potentially consume that preallocated structure and set
the global pointer to NULL to be reallocated in CPUHP_X2APIC_PREPARE for
the next CPU. Which doesn't work at all if the prepare step is run for
all the CPUs first.

Any platform enabling the CPUHP_BP_PARALLEL_STARTUP step must be
reviewed and tested to ensure that such issues do not exist, and the
existing behaviour of each AP through to CPUHP_BP_PREPARE_DYN and then
immediately to CPUHP_BRINGUP_CPU and CPUHP_ONLINE only one at a time
does not change unless such a state is registered.

Note that this does *not* yet bring each AP to the CPUHP_BRINGUP_CPU
state at the same time, only to the new CPUHP_BP_PARALLEL_STARTUP state.
The final loop in bringup_nonboot_cpus() remains the same, bringing each
AP in turn from the CPUHP_BP_PARALLEL_STARTUP (or all the way from
CPUHP_OFFLINE) to CPUHP_BRINGUP_CPU and then waiting for that AP to do
its own processing and reach CPUHP_ONLINE before releasing the next.

Parallelising that part by bringing them all to CPUHP_BRINGUP_CPU and
then waiting for them all to run to CPUHP_ONLINE at the same time is a
more complicated exercise for the future.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Usama Arif <usama.arif@bytedance.com>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Kim Phillips <kim.phillips@amd.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com> [arm64]
---
 include/linux/cpuhotplug.h | 22 ++++++++++++++++++++++
 kernel/cpu.c               | 38 +++++++++++++++++++++++++++++++++++---
 2 files changed, 57 insertions(+), 3 deletions(-)

diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index c6fab004104a..84efd33ed3a3 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -133,6 +133,28 @@ enum cpuhp_state {
 	CPUHP_MIPS_SOC_PREPARE,
 	CPUHP_BP_PREPARE_DYN,
 	CPUHP_BP_PREPARE_DYN_END		= CPUHP_BP_PREPARE_DYN + 20,
+	/*
+	 * This is an optional state if the architecture supports parallel
+	 * startup. It's used to start bringing the CPU online (e.g. send
+	 * the startup IPI) so that the APs can run in parallel through
+	 * the low level startup code instead of waking them one by one in
+	 * CPUHP_BRINGUP_CPU. This avoids waiting for the AP to react and
+	 * shortens the serialized phase of the bringup.
+	 *
+	 * If the architecture registers this state, all APs will be taken
+	 * to it (and thus through all prior states) before any is taken
+	 * to the subsequent CPUHP_BRINGUP_CPU state.
+	 */
+	CPUHP_BP_PARALLEL_STARTUP,
+
+	/*
+	 * This step brings the AP online and takes it to the point where it
+	 * manages its own state from here on. For the time being, the rest
+	 * of the AP bringup is fully serialized despite running on the AP.
+	 * If the architecture doesn't use the CPUHP_BP_PARALLEL_STARTUP
+	 * state, this step also does all the work of bringing the CPU
+	 * online.
+	 */
 	CPUHP_BRINGUP_CPU,
 
 	/*
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 43e0a77f21e8..6be5b60db51b 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1504,13 +1504,45 @@ int bringup_hibernate_cpu(unsigned int sleep_cpu)
 
 void bringup_nonboot_cpus(unsigned int setup_max_cpus)
 {
-	unsigned int cpu;
+	unsigned int cpu, n = num_online_cpus();
 
+	/*
+	 * On architectures which have setup the CPUHP_BP_PARALLEL_STARTUP
+	 * state, this invokes all BP prepare states and the parallel
+	 * startup state sends the startup IPI to each of the to be onlined
+	 * APs. This avoids waiting for each AP to respond to the startup
+	 * IPI in CPUHP_BRINGUP_CPU. The APs proceed through the low level
+	 * bringup code and then wait for the control CPU to release them
+	 * one by one for the final onlining procedure in the loop below.
+	 *
+	 * For architectures which do not support parallel bringup all
+	 * states are fully serialized in the loop below.
+	 */
+	if (!cpuhp_step_empty(true, CPUHP_BP_PARALLEL_STARTUP)) {
+		for_each_present_cpu(cpu) {
+			if (n++ >= setup_max_cpus)
+				break;
+			cpu_up(cpu, CPUHP_BP_PARALLEL_STARTUP);
+		}
+	}
+
+	/* Do the per CPU serialized bringup to ONLINE state */
 	for_each_present_cpu(cpu) {
 		if (num_online_cpus() >= setup_max_cpus)
 			break;
-		if (!cpu_online(cpu))
-			cpu_up(cpu, CPUHP_ONLINE);
+
+		if (!cpu_online(cpu)) {
+			struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
+			int ret = cpu_up(cpu, CPUHP_ONLINE);
+
+			/*
+			 * Due to the above preparation loop a failed online attempt
+			 * might have only rolled back to CPUHP_BP_PARALLEL_STARTUP. Do the
+			 * remaining cleanups. NOOP for the non parallel case.
+			 */
+			if (ret && can_rollback_cpu(st))
+				WARN_ON(cpuhp_invoke_callback_range(false, cpu, st, CPUHP_OFFLINE));
+		}
 	}
 }
diff mbox series

Patch

diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index c6fab004104a..ef3cf69a3d5b 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -133,6 +133,8 @@  enum cpuhp_state {
 	CPUHP_MIPS_SOC_PREPARE,
 	CPUHP_BP_PREPARE_DYN,
 	CPUHP_BP_PREPARE_DYN_END		= CPUHP_BP_PREPARE_DYN + 20,
+	CPUHP_BP_PARALLEL_DYN,
+	CPUHP_BP_PARALLEL_DYN_END		= CPUHP_BP_PARALLEL_DYN + 4,
 	CPUHP_BRINGUP_CPU,
 
 	/*
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 43e0a77f21e8..cf3c1c6f0710 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1504,13 +1504,49 @@  int bringup_hibernate_cpu(unsigned int sleep_cpu)
 
 void bringup_nonboot_cpus(unsigned int setup_max_cpus)
 {
+	unsigned int n = setup_max_cpus - num_online_cpus();
 	unsigned int cpu;
 
+	/*
+	 * An architecture may have registered parallel pre-bringup states to
+	 * which each CPU may be brought in parallel. For each such state,
+	 * bring N CPUs to it in turn before the final round of bringing them
+	 * online.
+	 */
+	if (n > 0) {
+		enum cpuhp_state st = CPUHP_BP_PARALLEL_DYN;
+
+		while (st <= CPUHP_BP_PARALLEL_DYN_END && cpuhp_hp_states[st].name) {
+			int i = n;
+
+			for_each_present_cpu(cpu) {
+				cpu_up(cpu, st);
+				if (!--i)
+					break;
+			}
+			st++;
+		}
+	}
+
 	for_each_present_cpu(cpu) {
 		if (num_online_cpus() >= setup_max_cpus)
 			break;
-		if (!cpu_online(cpu))
-			cpu_up(cpu, CPUHP_ONLINE);
+		if (!cpu_online(cpu)) {
+			int ret = cpu_up(cpu, CPUHP_ONLINE);
+
+			/*
+			 * For the parallel bringup case, roll all the way back
+			 * to CPUHP_OFFLINE on failure; don't leave them in the
+			 * parallel stages. This happens in the nosmt case for
+			 * non-primary threads.
+			 */
+			if (ret && cpuhp_hp_states[CPUHP_BP_PARALLEL_DYN].name) {
+				struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
+				if (can_rollback_cpu(st))
+					WARN_ON(cpuhp_invoke_callback_range(false, cpu, st,
+									    CPUHP_OFFLINE));
+			}
+		}
 	}
 }
 
@@ -1882,6 +1918,10 @@  static int cpuhp_reserve_state(enum cpuhp_state state)
 		step = cpuhp_hp_states + CPUHP_BP_PREPARE_DYN;
 		end = CPUHP_BP_PREPARE_DYN_END;
 		break;
+	case CPUHP_BP_PARALLEL_DYN:
+		step = cpuhp_hp_states + CPUHP_BP_PARALLEL_DYN;
+		end = CPUHP_BP_PARALLEL_DYN_END;
+		break;
 	default:
 		return -EINVAL;
 	}
@@ -1906,14 +1946,15 @@  static int cpuhp_store_callbacks(enum cpuhp_state state, const char *name,
 	/*
 	 * If name is NULL, then the state gets removed.
 	 *
-	 * CPUHP_AP_ONLINE_DYN and CPUHP_BP_PREPARE_DYN are handed out on
+	 * CPUHP_AP_ONLINE_DYN and CPUHP_BP_P*_DYN are handed out on
 	 * the first allocation from these dynamic ranges, so the removal
 	 * would trigger a new allocation and clear the wrong (already
 	 * empty) state, leaving the callbacks of the to be cleared state
 	 * dangling, which causes wreckage on the next hotplug operation.
 	 */
 	if (name && (state == CPUHP_AP_ONLINE_DYN ||
-		     state == CPUHP_BP_PREPARE_DYN)) {
+		     state == CPUHP_BP_PREPARE_DYN ||
+		     state == CPUHP_BP_PARALLEL_DYN)) {
 		ret = cpuhp_reserve_state(state);
 		if (ret < 0)
 			return ret;