Message ID | 20230321194008.785922-4-usama.arif@bytedance.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Parallel CPU bringup for x86_64 | expand |
On Tue, Mar 21 2023 at 19:40, Usama Arif wrote: > void bringup_nonboot_cpus(unsigned int setup_max_cpus) > { > + unsigned int n = setup_max_cpus - num_online_cpus(); > unsigned int cpu; > > + /* > + * An architecture may have registered parallel pre-bringup states to > + * which each CPU may be brought in parallel. For each such state, > + * bring N CPUs to it in turn before the final round of bringing them > + * online. > + */ > + if (n > 0) { > + enum cpuhp_state st = CPUHP_BP_PARALLEL_DYN; > + > + while (st <= CPUHP_BP_PARALLEL_DYN_END && cpuhp_hp_states[st].name) { There is no point in special casing this. All architectures can invoke the CPUHP_BP_* states before CPUHP_BRINGUP_CPU for each to be brought up CPU first. So this can be made unconditional and common exercised code. Aside of that this dynamic state range is pretty pointless. There is really nothing dynamic here and there is no real good reason to have four dynamic parallel states just because. The only interesting thing after CPUHP_BP_PREPARE_DYN_END and before CPUHP_BP_BRINGUP is a state which kicks the AP into life, i.e. we can just hardcode that as CPUHP_BP_PARALLEL_STARTUP. Thanks, tglx --- --- a/include/linux/cpuhotplug.h +++ b/include/linux/cpuhotplug.h @@ -133,6 +133,20 @@ enum cpuhp_state { CPUHP_MIPS_SOC_PREPARE, CPUHP_BP_PREPARE_DYN, CPUHP_BP_PREPARE_DYN_END = CPUHP_BP_PREPARE_DYN + 20, + /* + * This is an optional state if the architecture supports parallel + * startup. It's used to send the startup IPI so that the APs can + * run in parallel through the low level startup code instead of + * sending the IPIs one by one in CPUHP_BRINGUP_CPU. This avoids + * waiting for the AP to react and shortens the serialized bringup. + */ + CPUHP_BP_PARALLEL_STARTUP, + + /* + * Fully per AP serialized bringup from here on. If the + * architecture does no register the CPUHP_BP_PARALLEL_STARTUP + * state, this step sends the startup IPI first. + */ CPUHP_BRINGUP_CPU, /* --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -1504,13 +1504,49 @@ int bringup_hibernate_cpu(unsigned int s void bringup_nonboot_cpus(unsigned int setup_max_cpus) { - unsigned int cpu; + unsigned int cpu, n = 1; + /* + * All CHUHP_BP* states are invoked on the control CPU (BP). There + * is no requirement that these states need to be invoked + * sequentially for a particular CPU. They can be invoked state by + * state for each to be onlined CPU. + * + * On architectures which have setup the CPUHP_BP_PARALLEL_STARTUP + * state, this sends the startup IPI to each of the to be onlined + * APs. This avoids waiting for each AP to respond to the startup + * IPI in CPUHP_BRINGUP_CPU. The APs proceed through the low level + * bringup code and then wait for the control CPU to release them + * one by one for the final onlining procedure in the loop below. + * + * For architectures which do not support parallel bringup the + * CPUHP_BP_PARALLEL_STARTUP state is a NOOP and the actual startup + * happens in the CPUHP_BRINGUP_CPU state which is fully serialized + * per CPU in the loop below. + */ + for_each_present_cpu(cpu) { + if (n++ >= setup_max_cpus) + break; + cpu_up(cpu, CPUHP_BP_PARALLEL_STARTUP); + } + + /* Do the per CPU serialized bringup to ONLINE state */ for_each_present_cpu(cpu) { if (num_online_cpus() >= setup_max_cpus) break; - if (!cpu_online(cpu)) - cpu_up(cpu, CPUHP_ONLINE); + + if (!cpu_online(cpu)) { + struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu); + int ret = cpu_up(cpu, CPUHP_ONLINE); + + /* + * Due to the above preparation loop a failed online attempt + * has only rolled back to CPUHP_BP_PARALLEL_STARTUP. Do the + * remaining cleanups. + */ + if (ret && can_rollback_cpu(st)) + WARN_ON(cpuhp_invoke_callback_range(false, cpu, st, CPUHP_OFFLINE)); + } } }
On Thu, 2023-03-23 at 23:36 +0100, Thomas Gleixner wrote: > On Tue, Mar 21 2023 at 19:40, Usama Arif wrote: > > void bringup_nonboot_cpus(unsigned int setup_max_cpus) > > { > > + unsigned int n = setup_max_cpus - num_online_cpus(); > > unsigned int cpu; > > > > + /* > > + * An architecture may have registered parallel pre-bringup states to > > + * which each CPU may be brought in parallel. For each such state, > > + * bring N CPUs to it in turn before the final round of bringing them > > + * online. > > + */ > > + if (n > 0) { > > + enum cpuhp_state st = CPUHP_BP_PARALLEL_DYN; > > + > > + while (st <= CPUHP_BP_PARALLEL_DYN_END && cpuhp_hp_states[st].name) { > > > There is no point in special casing this. All architectures can invoke > the CPUHP_BP_* states before CPUHP_BRINGUP_CPU for each to be brought up > CPU first. So this can be made unconditional and common exercised code. > There were three paragraphs in the commit message explaining why I didn't want to do that. It didn't work for x86 before I started, and I haven't reviewed *every* other architecture to ensure that it will work there. It was opt-in for a reason. :) > Aside of that this dynamic state range is pretty pointless. There is > really nothing dynamic here and there is no real good reason to have > four dynamic parallel states just because. The original patch set did use more than one state; the plan to do microcode updates in parallel does involve doing at least one more, I believe. https://lore.kernel.org/all/eb6717dfc4ceb99803c0396f950db7c3231c75ef.camel@infradead.org/ > The only interesting thing after CPUHP_BP_PREPARE_DYN_END and before > CPUHP_BP_BRINGUP is a state which kicks the AP into life, i.e. we can > just hardcode that as CPUHP_BP_PARALLEL_STARTUP. > > Thanks, > > tglx > --- > --- a/include/linux/cpuhotplug.h > +++ b/include/linux/cpuhotplug.h > @@ -133,6 +133,20 @@ enum cpuhp_state { > CPUHP_MIPS_SOC_PREPARE, > CPUHP_BP_PREPARE_DYN, > CPUHP_BP_PREPARE_DYN_END = CPUHP_BP_PREPARE_DYN + 20, > + /* > + * This is an optional state if the architecture supports parallel > + * startup. It's used to send the startup IPI so that the APs can > + * run in parallel through the low level startup code instead of > + * sending the IPIs one by one in CPUHP_BRINGUP_CPU. This avoids > + * waiting for the AP to react and shortens the serialized bringup. > + */ > + CPUHP_BP_PARALLEL_STARTUP, > + > + /* > + * Fully per AP serialized bringup from here on. If the > + * architecture does no register the CPUHP_BP_PARALLEL_STARTUP > + * state, this step sends the startup IPI first. > + */ Not sure I'd conceded that yet; the APs do their *own* bringup from here, and that really ought to be able to run in parallel. > CPUHP_BRINGUP_CPU, > > /*
On Thu, Mar 23 2023 at 23:36, Thomas Gleixner wrote: > On Tue, Mar 21 2023 at 19:40, Usama Arif wrote: >> void bringup_nonboot_cpus(unsigned int setup_max_cpus) >> { >> + unsigned int n = setup_max_cpus - num_online_cpus(); >> unsigned int cpu; >> >> + /* >> + * An architecture may have registered parallel pre-bringup states to >> + * which each CPU may be brought in parallel. For each such state, >> + * bring N CPUs to it in turn before the final round of bringing them >> + * online. >> + */ >> + if (n > 0) { >> + enum cpuhp_state st = CPUHP_BP_PARALLEL_DYN; >> + >> + while (st <= CPUHP_BP_PARALLEL_DYN_END && cpuhp_hp_states[st].name) { > > > There is no point in special casing this. All architectures can invoke > the CPUHP_BP_* states before CPUHP_BRINGUP_CPU for each to be brought up > CPU first. So this can be made unconditional and common exercised > code. Bah. There is. We discussed that before. Architectures need to opt in to make sure that there are no implicit dependencies on the full serialization. Still the rest can be simplified as below. Thanks, tglx --- --- a/include/linux/cpuhotplug.h +++ b/include/linux/cpuhotplug.h @@ -133,6 +133,20 @@ enum cpuhp_state { CPUHP_MIPS_SOC_PREPARE, CPUHP_BP_PREPARE_DYN, CPUHP_BP_PREPARE_DYN_END = CPUHP_BP_PREPARE_DYN + 20, + /* + * This is an optional state if the architecture supports parallel + * startup. It's used to send the startup IPI so that the APs can + * run in parallel through the low level startup code instead of + * sending the IPIs one by one in CPUHP_BRINGUP_CPU. This avoids + * waiting for the AP to react and shortens the serialized bringup. + */ + CPUHP_BP_PARALLEL_STARTUP, + + /* + * Fully per AP serialized bringup from here on. If the + * architecture does no register the CPUHP_BP_PARALLEL_STARTUP + * state, this step sends the startup IPI first. + */ CPUHP_BRINGUP_CPU, /* --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -1504,13 +1504,45 @@ int bringup_hibernate_cpu(unsigned int s void bringup_nonboot_cpus(unsigned int setup_max_cpus) { - unsigned int cpu; + unsigned int cpu, n = 1; + /* + * On architectures which have setup the CPUHP_BP_PARALLEL_STARTUP + * state, this invokes all BP prepare states and the parallel + * startup state sends the startup IPI to each of the to be onlined + * APs. This avoids waiting for each AP to respond to the startup + * IPI in CPUHP_BRINGUP_CPU. The APs proceed through the low level + * bringup code and then wait for the control CPU to release them + * one by one for the final onlining procedure in the loop below. + * + * For architectures which do not support parallel bringup all + * states are fully serialized in the loop below. + */ + if (!cpuhp_step_empty(true, CPUHP_BP_PARALLEL_STARTUP) { + for_each_present_cpu(cpu) { + if (n++ >= setup_max_cpus) + break; + cpu_up(cpu, CPUHP_BP_PARALLEL_STARTUP); + } + } + + /* Do the per CPU serialized bringup to ONLINE state */ for_each_present_cpu(cpu) { if (num_online_cpus() >= setup_max_cpus) break; - if (!cpu_online(cpu)) - cpu_up(cpu, CPUHP_ONLINE); + + if (!cpu_online(cpu)) { + struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu); + int ret = cpu_up(cpu, CPUHP_ONLINE); + + /* + * Due to the above preparation loop a failed online attempt + * might have only rolled back to CPUHP_BP_PARALLEL_STARTUP. Do the + * remaining cleanups. NOOP for the non parallel case. + */ + if (ret && can_rollback_cpu(st)) + WARN_ON(cpuhp_invoke_callback_range(false, cpu, st, CPUHP_OFFLINE)); + } } }
On Fri, 2023-03-24 at 00:05 +0100, Thomas Gleixner wrote: > Still the rest can be simplified as below. ... > --- a/kernel/cpu.c > +++ b/kernel/cpu.c > @@ -1504,13 +1504,45 @@ int bringup_hibernate_cpu(unsigned int s > > void bringup_nonboot_cpus(unsigned int setup_max_cpus) > { > - unsigned int cpu; > + unsigned int cpu, n = 1; > > + /* > + * On architectures which have setup the CPUHP_BP_PARALLEL_STARTUP > + * state, this invokes all BP prepare states and the parallel > + * startup state sends the startup IPI to each of the to be onlined > + * APs. This avoids waiting for each AP to respond to the startup > + * IPI in CPUHP_BRINGUP_CPU. The APs proceed through the low level > + * bringup code and then wait for the control CPU to release them > + * one by one for the final onlining procedure in the loop below. > + * > + * For architectures which do not support parallel bringup all > + * states are fully serialized in the loop below. > + */ > + if (!cpuhp_step_empty(true, CPUHP_BP_PARALLEL_STARTUP) { I'll take using cpuhp_step_empty(). > + for_each_present_cpu(cpu) { > + if (n++ >= setup_max_cpus) > + break; > + cpu_up(cpu, CPUHP_BP_PARALLEL_STARTUP); > + } > + } > + > + /* Do the per CPU serialized bringup to ONLINE state */ > for_each_present_cpu(cpu) { > if (num_online_cpus() >= setup_max_cpus) > break; > - if (!cpu_online(cpu)) > - cpu_up(cpu, CPUHP_ONLINE); > + > + if (!cpu_online(cpu)) { > + struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu); > + int ret = cpu_up(cpu, CPUHP_ONLINE); > + > + /* > + * Due to the above preparation loop a failed online attempt > + * might have only rolled back to CPUHP_BP_PARALLEL_STARTUP. Do the > + * remaining cleanups. NOOP for the non parallel case. > + */ > + if (ret && can_rollback_cpu(st)) > + WARN_ON(cpuhp_invoke_callback_range(false, cpu, st, CPUHP_OFFLINE)); > + } And I'll take doing this bit unconditionally (it's basically a no-op if they already got rolled all the way back to CPUHP_OFFLINE, right?). But the additional complexity of having multiple steps is fairly minimal, and I'm already planning to *use* another one even in x86, as discussed.
On Thu, Mar 23 2023 at 22:49, David Woodhouse wrote: > On Thu, 2023-03-23 at 23:36 +0100, Thomas Gleixner wrote: >> There is no point in special casing this. All architectures can invoke >> the CPUHP_BP_* states before CPUHP_BRINGUP_CPU for each to be brought up >> CPU first. So this can be made unconditional and common exercised code. >> > > There were three paragraphs in the commit message explaining why I > didn't want to do that. It didn't work for x86 before I started, and I > haven't reviewed *every* other architecture to ensure that it will work > there. It was opt-in for a reason. :) I noticed myself before reading your reply :) >> Aside of that this dynamic state range is pretty pointless. There is >> really nothing dynamic here and there is no real good reason to have >> four dynamic parallel states just because. > > The original patch set did use more than one state; the plan to do > microcode updates in parallel does involve doing at least one more, I > believe. I don't think so. The micro code muck can completely serialize itself and does not need control CPU assistance if done right. If we go there we have to fix that mess and not just proliferatng it with moar duct tape. >> + /* >> + * Fully per AP serialized bringup from here on. If the >> + * architecture does no register the CPUHP_BP_PARALLEL_STARTUP >> + * state, this step sends the startup IPI first. >> + */ > > Not sure I'd conceded that yet; the APs do their *own* bringup from > here, and that really ought to be able to run in parallel. Somewhere in the distance future once we've 1) sorted the mandatory synchronization points, e.g. TSC sync in the early bootup phase. 2) audited _all_ AP state callbacks that they are able to cope with parallel bringup. That's a long road as there are tons of assumptions about the implicit CPU hotplug serialization in those callbacks. Just because it did not explode in your face yet does not mean this is safe. I just looked at 10 randomly picked AP online callbacks and found 5 of them being not ready :) Thanks, tglx
On Fri, 2023-03-24 at 00:48 +0100, Thomas Gleixner wrote: > On Thu, Mar 23 2023 at 22:49, David Woodhouse wrote: > > On Thu, 2023-03-23 at 23:36 +0100, Thomas Gleixner wrote: > > > There is no point in special casing this. All architectures can invoke > > > the CPUHP_BP_* states before CPUHP_BRINGUP_CPU for each to be brought up > > > CPU first. So this can be made unconditional and common exercised code. > > > > > > > There were three paragraphs in the commit message explaining why I > > didn't want to do that. It didn't work for x86 before I started, and I > > haven't reviewed *every* other architecture to ensure that it will work > > there. It was opt-in for a reason. :) > > I noticed myself before reading your reply :) > > > > Aside of that this dynamic state range is pretty pointless. There is > > > really nothing dynamic here and there is no real good reason to have > > > four dynamic parallel states just because. > > > > The original patch set did use more than one state; the plan to do > > microcode updates in parallel does involve doing at least one more, I > > believe. > > I don't think so. The micro code muck can completely serialize itself > and does not need control CPU assistance if done right. If we go there > we have to fix that mess and not just proliferatng it with moar duct tape. > > > > + /* > > > + * Fully per AP serialized bringup from here on. If the > > > + * architecture does no register the CPUHP_BP_PARALLEL_STARTUP > > > + * state, this step sends the startup IPI first. > > > + */ > > > > Not sure I'd conceded that yet; the APs do their *own* bringup from > > here, and that really ought to be able to run in parallel. > > Somewhere in the distance future once we've > > 1) sorted the mandatory synchronization points, e.g. TSC sync in the > early bootup phase. That's why we have four of them... :) > 2) audited _all_ AP state callbacks that they are able to cope with > parallel bringup. > > That's a long road as there are tons of assumptions about the > implicit CPU hotplug serialization in those callbacks. > > Just because it did not explode in your face yet does not mean this > is safe. > > I just looked at 10 randomly picked AP online callbacks and found 5 > of them being not ready :) Oh, it's totally hosed, absolutely. I don't think even my most ambitious hacks had even tried it yet. But I want to, so I wasn't about to add a comment saying the opposite. But it's fine; removing the comment is the *least* of the work to be done in making that bit actually work in parallel :)
On Thu, Mar 23 2023 at 23:12, David Woodhouse wrote: > On Fri, 2023-03-24 at 00:05 +0100, Thomas Gleixner wrote: >> + if (ret && can_rollback_cpu(st)) >> + WARN_ON(cpuhp_invoke_callback_range(false, cpu, st, CPUHP_OFFLINE)); >> + } > > And I'll take doing this bit unconditionally (it's basically a no-op if > they already got rolled all the way back to CPUHP_OFFLINE, right?). > > But the additional complexity of having multiple steps is fairly > minimal, and I'm already planning to *use* another one even in x86, as > discussed. It's not about the "complexity". That's a general design question and I'm not agreeing with your approach of putting AP specifics into the BP state space. The BP only phase ends at the point where the AP is woken up via SIPI/INIT/whatever. Period. And no, we are not making this special just because it's the easiest way to get it done. I have _zero_ interest in this kind of hackery which just slaps stuff into the code where its conveniant without even thinking about proper separations We went a great length to separate things clearly and it takes more than "oh let's reserve a few special states" to keep this separation intact. That's a matter of correctness, maintainability and taste. That microcode thing on X86 has absolutely no business in the pre bringup DYN states. It's an AP problem and it can be solved completely without BP interaction. And before you start drooling over further great parallelism, can you please take a step back and come up with a sensible design for the actual real world requirments? The point is that after an AP comes out of lala land and starts executing there are mandatory synchronization points which need to be handled by design. The number of synchronization points is architecture and platform specific and might be 0, but thats a detail. So the proper thing is to split CPUHP_BRINGUP_CPU, which is the bridging state between AP and BP today, into a set of synchronization points between BP and AP. But that's non-trivial because if you look at bringup_cpu() then you'll notice that this state has an implicit protection against interrupt allocation/free and quite some architectures rely on this for their early bringup code and possibly their STARTING state callbacks. That aside. Let's just look at x86 as of today from the BP side: 1) Wakeup AP 2) Wait until there is sign of life 3) Let AP proceed 5) Wait until AP is done with init 6) TSC synchronization 7) Wait until it is online and on the AP side: 1) Do low level init 2) Report life 3) Wait until BP allows to proceed 4) Do init 5) Report done 6) TSC synchronization 7) Report online So surely you could claim that up to #6 (TSC sync) nothing needs to synchronize. But that's simply not true because topology information is implicitely serialized by CPU hotplug and your attempt to serialize that in patch 7/8 is not cutting it at all because AP #4 (STARTING) callbacks rely implicitely on the immutability of the topology information and so do some of the later (threaded) AP callbacks too. As I said before 5 out of 10 callbacks I looked at are not ready for this. Just because it did not explode in your face yet, does not make any of your wet dreams more correct. Again: I'm not interested in this kind of "works for me" nonsense at all. I wasted enough time already to make CPU hotplug reliable and understandable. I have no intention to waste more time on it just because. Thanks, tglx
On Fri, 2023-03-24 at 02:16 +0100, Thomas Gleixner wrote: > On Thu, Mar 23 2023 at 23:12, David Woodhouse wrote: > > On Fri, 2023-03-24 at 00:05 +0100, Thomas Gleixner wrote: > > > + if (ret && can_rollback_cpu(st)) > > > + WARN_ON(cpuhp_invoke_callback_range(false, cpu, st, CPUHP_OFFLINE)); > > > + } > > > > And I'll take doing this bit unconditionally (it's basically a no-op if > > they already got rolled all the way back to CPUHP_OFFLINE, right?). > > > > But the additional complexity of having multiple steps is fairly > > minimal, and I'm already planning to *use* another one even in x86, as > > discussed. > > It's not about the "complexity". That's a general design question and > I'm not agreeing with your approach of putting AP specifics into the BP > state space. > > The BP only phase ends at the point where the AP is woken up via > SIPI/INIT/whatever. Period. > > And no, we are not making this special just because it's the easiest way > to get it done. I have _zero_ interest in this kind of hackery which > just slaps stuff into the code where its conveniant without even > thinking about proper separations > > We went a great length to separate things clearly and it takes more than > "oh let's reserve a few special states" to keep this separation > intact. That's a matter of correctness, maintainability and taste. > > That microcode thing on X86 has absolutely no business in the pre > bringup DYN states. It's an AP problem and it can be solved completely > without BP interaction. As long as the BP doesn't bring any more siblings online during the update, sure. > And before you start drooling over further great parallelism, can you > please take a step back and come up with a sensible design for the > actual real world requirments? > > The point is that after an AP comes out of lala land and starts > executing there are mandatory synchronization points which need to be > handled by design. The number of synchronization points is architecture > and platform specific and might be 0, but thats a detail. > > So the proper thing is to split CPUHP_BRINGUP_CPU, which is the bridging > state between AP and BP today, into a set of synchronization points > between BP and AP. I feel we're talking past each other a little bit. Because I'd have said that's *exactly* what this patch set is doing. The existing x86 CPUHP_BRINGUP_CPU step in native_cpu_up() has a bunch of back-and-forth between BP and AP, which you've enumerated below and which I cleaned up and commented in the 'Split up native_cpu_up into separate phases and document them' patch. This patch set literally introduces the new PARALLEL_DYN states to be "a set of synchronization points between BP and AP", and then makes the x86 code utilise that for the *first* of its back-and-forth exchanges between BP and AP. The patch to do the second stage and truly make it a 'set' on x86 is waiting in the wings, precisely because it *happens* not to catch fire yet, but hasn't been properly audited yet. But it seemed to me that you were saying you want us to limit architectures to just *one* additional step (CPUHP_BP_PARALLEL_STARTUP) instead of the 'set' that I'd added with the PARALLEL_DYN states? Did I misunderstand? > But that's non-trivial because if you look at bringup_cpu() then you'll > notice that this state has an implicit protection against interrupt > allocation/free and quite some architectures rely on this for their > early bringup code and possibly their STARTING state callbacks. I literally pointed that out in 2021 (in one of the messages I reference below). For x86 I don't believe that's going to be an issue yet, if at all. I think it matters for the lapic_online() call which happens near the end of start_secondary(); it's almost the last thing it does before going into the generic AP startup thread. It's *also* preceded by a comment that at least *suggests* it ought to be fine anyway, although we should never entirely trust such comments. /* * Lock vector_lock, set CPU online and bring the vector * allocator online. Online must be set with vector_lock held * to prevent a concurrent irq setup/teardown from seeing a * half valid vector space. */ lock_vector_lock(); /* Sync point with do_wait_cpu_online() */ set_cpu_online(smp_processor_id(), true); lapic_online(); We're a long way from parallelizing *that* one and needing to check the veracity of the comment though. > That aside. Let's just look at x86 as of today from the BP side: > > 1) Wakeup AP > 2) Wait until there is sign of life > 3) Let AP proceed > 5) Wait until AP is done with init > 6) TSC synchronization > 7) Wait until it is online > > and on the AP side: > > 1) Do low level init > 2) Report life > 3) Wait until BP allows to proceed > 4) Do init > 5) Report done > 6) TSC synchronization > 7) Report online > > So surely you could claim that up to #6 (TSC sync) nothing needs to > synchronize. I don't think we could claim that at all. There are a whole bunch of implicit dependencies on the fact that this happens in series, and at any given point in the sequence we might know that *either* the BP is free to act, *or* precisely one of the APs is free. For example: > But that's simply not true because topology information is implicitely > serialized by CPU hotplug and your attempt to serialize that in patch > 7/8 is not cutting it at all because AP #4 (STARTING) callbacks rely > implicitely on the immutability of the topology information Right, and that's just fine. That patch explicitly calls out that it is preparation for *future* parallelism. Which is why it's *last* in the series (well, apart from the SEV-ES bit, which I wanted to be even *more* last-in-the-series than that). If it was needed for the "phase 1" parallelism of INIT/SIPI/SIPI that gets enabled in patch #6, then it would have come before that in the series. It is necessary — but not sufficient, as you correctly point out — for the *next* stages of parallelism. > and so do some of the later (threaded) AP callbacks too. > > As I said before 5 out of 10 callbacks I looked at are not ready for > this. > > Just because it did not explode in your face yet, does not make any of > your wet dreams more correct. Now you're looking even further into the future. We're not trying to make the AP startup threads run in parallel yet. > Again: I'm not interested in this kind of "works for me" nonsense at > all. You keep saying that. It's starting to become offensive. Yes, the first RFC posting of this approach was quite explicitly billed as 'proof-of-concept hack' and 'don't look too hard because it's only really for figuring out the timing'. I make no apology for that: https://lore.kernel.org/lkml/8ac72d7b287ed1058b2dec3301578238aff0abdd.camel@infradead.org/ And if you look back at the first actual series that was posted, it's also very clear about which bit is actually ready because all the testing and thinking has been done, and which isn't: https://lore.kernel.org/lkml/478c56af540eaa47b8f452e51cb0c085f26db738.camel@infradead.org/ So please, stop giving me this "\"works for me\" nonsense" nonsense. Yes, there are some patches on top of this lot that aren't ready yet; we've fixed up *some* of the things that actually broke, and only done a cursory pass of actually inspecting it to make sure it's safe. And some of your observations above are very relevant and helpful to when we come to do *that* part of the thinking. But we aren't asking you to merge those parts yet. We have deliberately cut it back to the initial stage because that's where the biggest win is, and there's *enough* thinking to be done just for this part. We will review your comments and be happy to engage in further robust discussion about the future patches when we get round to that (if indeed we conclude that there's a significant win to be had from them). In the meantime, if there are specific things which are wrong with the parallelism introduced by *this* patch series that you're trying to point out to us, then I apologise; I've missed them.
On Fri, Mar 24 2023 at 02:16, Thomas Gleixner wrote: > On Thu, Mar 23 2023 at 23:12, David Woodhouse wrote: > So surely you could claim that up to #6 (TSC sync) nothing needs to > synchronize. > > But that's simply not true because topology information is implicitely > serialized by CPU hotplug and your attempt to serialize that in patch > 7/8 is not cutting it at all because AP #4 (STARTING) callbacks rely > implicitely on the immutability of the topology information and so do > some of the later (threaded) AP callbacks too. It's even worse. Any topology relevant change _must_ be serialized by holding cpu_hotplug_lock write locked. So this needs fundamentally more thought than just slapping a few misnomed states into the picture and pretending that everything else just works. Thanks, tglx
On Fri, 2023-03-24 at 10:46 +0100, Thomas Gleixner wrote: > > It's even worse. Any topology relevant change _must_ be serialized by > holding cpu_hotplug_lock write locked. So this needs fundamentally more > thought Yes. Yes, it does. Which is why it isn't being done in parallel in this patch series. The topology update happens from smp_callin() which happens from do_wait_cpu_initialized(), which is still in the CPUHP_BRINGUP_CPU stage for now, being done one at a time. When we come to look at the next stage, doing do_wait_cpu_initialized() in parallel for multiple APs too, we will be happy to hear specifics of why this "must" be serialized by the cpu_hotplug_lock and no other. It will be great to check that we haven't missed anything. But we aren't there yet.
On Fri, Mar 24 2023 at 09:31, David Woodhouse wrote: > On Fri, 2023-03-24 at 02:16 +0100, Thomas Gleixner wrote: >> So the proper thing is to split CPUHP_BRINGUP_CPU, which is the bridging >> state between AP and BP today, into a set of synchronization points >> between BP and AP. > > I feel we're talking past each other a little bit. Because I'd have > said that's *exactly* what this patch set is doing. > > The existing x86 CPUHP_BRINGUP_CPU step in native_cpu_up() has a bunch > of back-and-forth between BP and AP, which you've enumerated below and > which I cleaned up and commented in the 'Split up native_cpu_up into > separate phases and document them' patch. > > This patch set literally introduces the new PARALLEL_DYN states to be > "a set of synchronization points between BP and AP", and then makes the > x86 code utilise that for the *first* of its back-and-forth exchanges > between BP and AP. It provides a dynamic space which is absolutely unspecified and just opens the door for tinkering. This first step does not even contain a synchronization point. All it does is to kick the AP into gear. Not more, not less. Naming this obscurely as PARALLEL_DYN is a tasteless bandaid. If you go and look at all __cpu_up() implementations, then you'll notice that these are very similar. All of them do 1) Kick AP 2) Synchronize at least once between BP and AP There is nothing x86 specific about this. So instead of hiding this behind a misnomed dynamic space, the obviously correct solution is to make this an explicit mechanism in the core code and convert _all_ architectures over to this scheme. That's in the first place completely independent of parallel bringup. It has a value on it's own as it consolidates the zoo of synchronization mechanisms (cpumasks, completions, random variables) into one shared mechanism in the core code. That's very much different from what your patch is doing. And there is a very good reason aside of consolidation to do so: This prepares to handle the parallelism requirements in the core code instead of letting each architecture come up with its own set of magic. Which in turn is a prerequisite for allowing the STARTING callbacks or later the threaded AP states to execute in parallel. Why? Simply because of this: BP AP state kick() BRINGUP_CPU startup() sync() sync() starting() advances to AP_ONLINE sync() sync() TSC_sync() TSC_sync() wait_for_online() set_online() cpu_startup_entry() AP_ONLINE_IDLE wait_for_completion() complete() This works correctly today because bringup_cpu() does not modify state and excpects the state to be advanced by the AP once the completion is done. So you _cannot_ just throw some magic dynamic states before BRINGUP_CPU and then expect that the state machine is consistent when the AP is allowed to run the starting callbacks in parallel. The sync point after the starting callbacks simply cannot be in that dynamic state space. It has to be _after_ the AP starting states. That needs a fundamental change in the way how the state management at this point works and this needs to be done upfront. Aside of the general serialization aspects this needs some deep thought whether the BP control can stay single threaded or if it's required to spawn a control thread per AP. The kick CPU into life state is completely independent of the above and can be moved just before BRINGUP_CPU without violating anything. But that's where the easy to solve part stops hard. You might find my wording offensive, but I perceive your "let's use a few dynamic states and see what sticks" approach offensive too. The analysis of all this is not rocket science and could have been done long ago by yourself. Thanks, tglx
On Fri, Mar 24 2023 at 09:31, David Woodhouse wrote: > On Fri, 2023-03-24 at 02:16 +0100, Thomas Gleixner wrote: >> But that's non-trivial because if you look at bringup_cpu() then you'll >> notice that this state has an implicit protection against interrupt >> allocation/free and quite some architectures rely on this for their >> early bringup code and possibly their STARTING state callbacks. > > I literally pointed that out in 2021 (in one of the messages I > reference below). > > For x86 I don't believe that's going to be an issue yet, if at all. I > think it matters for the lapic_online() call which happens near the end > of start_secondary(); it's almost the last thing it does before going > into the generic AP startup thread. It's *also* preceded by a comment > that at least *suggests* it ought to be fine anyway, although we should > never entirely trust such comments. > > /* > * Lock vector_lock, set CPU online and bring the vector > * allocator online. Online must be set with vector_lock held > * to prevent a concurrent irq setup/teardown from seeing a > * half valid vector space. That setup/teardown wording is misleading as that's already covered by sparse_irq_lock. This is purely x86 specific. Setting the CPU online and onlining the CPU in the matrix allocator has to be atomic vs. concurrent allocations from the matrix allocator, which can happen via request/free_irq() and affinity changes independent of interrupt setup/teardown (e.g. MSI enable). Thanks, tglx
On Fri, 2023-03-24 at 14:57 +0100, Thomas Gleixner wrote: > > Why? Simply because of this: > > BP AP state > kick() BRINGUP_CPU > startup() > sync() sync() > starting() advances to AP_ONLINE > sync() sync() > TSC_sync() TSC_sync() > wait_for_online() set_online() > cpu_startup_entry() AP_ONLINE_IDLE > wait_for_completion() complete() > > This works correctly today because bringup_cpu() does not modify state > and excpects the state to be advanced by the AP once the completion is > done. > > So you _cannot_ just throw some magic dynamic states before BRINGUP_CPU > and then expect that the state machine is consistent when the AP is > allowed to run the starting callbacks in parallel. Aha! I see. Yes, when the AP calls notify_cpu_starting(), which x86 does from smp_callin(), the AP takes *itself* forward through the states from there. That happens when the BP gets to do_wait_cpu_initialized(). So yes, the actual code in the existing series of patches is entirely safe, but you're right that we do only want that *one* additional state for parallelising the "kick AP" before CPUHP_BRINGUP_CPU. The rest need to come afterwards and be handled differently.
On Fri, 2023-03-24 at 00:05 +0100, Thomas Gleixner wrote: > > > There is no point in special casing this. All architectures can invoke > > the CPUHP_BP_* states before CPUHP_BRINGUP_CPU for each to be brought up > > CPU first. So this can be made unconditional and common exercised > > code. > > Bah. There is. We discussed that before. Architectures need to opt in to > make sure that there are no implicit dependencies on the full > serialization. > > Still the rest can be simplified as below. Ack. Full series at https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/parallel-6.2-v17 From: David Woodhouse <dwmw@amazon.co.uk> Subject: [PATCH 3/8] cpu/hotplug: Add CPUHP_BP_PARALLEL_STARTUP state before CPUHP_BRINGUP_CPU There is often significant latency in the early stages of CPU bringup, and time is wasted by waking each CPU (e.g. with SIPI/INIT/INIT on x86) and then waiting for it to make its way through hardware powerup and through firmware before finally reaching the kernel entry point and moving on through its startup. Allow a platform to register a pre-bringup CPUHP state to which each CPU can be stepped in parallel, thus absorbing some of that latency. There is a subtlety here: even with an empty CPUHP_BP_PARALLEL_STARTUP step, this means that *all* CPUs are brought through the prepare states all the way to CPUHP_BP_PARALLEL_STARTUP before any of them are taken to CPUHP_BRINGUP_CPU and then are allowed to run for themselves to CPUHP_ONLINE. So any combination of prepare/start calls which depend on A-B ordering for each CPU in turn would explore horribly. As an example, the X2APIC code prior to commit cefad862f238 ("x86/apic/x2apic: Allow CPU cluster_mask to be populated in parallel") would allocate a new cluster mask "just in case" and store it in a global variable in the prep stage, then the AP would potentially consume that preallocated structure and set the global pointer to NULL to be reallocated in CPUHP_X2APIC_PREPARE for the next CPU. Which doesn't work at all if the prepare step is run for all the CPUs first. Any platform enabling the CPUHP_BP_PARALLEL_STARTUP step must be reviewed and tested to ensure that such issues do not exist, and the existing behaviour of each AP through to CPUHP_BP_PREPARE_DYN and then immediately to CPUHP_BRINGUP_CPU and CPUHP_ONLINE only one at a time does not change unless such a state is registered. Note that this does *not* yet bring each AP to the CPUHP_BRINGUP_CPU state at the same time, only to the new CPUHP_BP_PARALLEL_STARTUP state. The final loop in bringup_nonboot_cpus() remains the same, bringing each AP in turn from the CPUHP_BP_PARALLEL_STARTUP (or all the way from CPUHP_OFFLINE) to CPUHP_BRINGUP_CPU and then waiting for that AP to do its own processing and reach CPUHP_ONLINE before releasing the next. Parallelising that part by bringing them all to CPUHP_BRINGUP_CPU and then waiting for them all to run to CPUHP_ONLINE at the same time is a more complicated exercise for the future. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Usama Arif <usama.arif@bytedance.com> Tested-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Kim Phillips <kim.phillips@amd.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> [arm64] --- include/linux/cpuhotplug.h | 22 ++++++++++++++++++++++ kernel/cpu.c | 38 +++++++++++++++++++++++++++++++++++--- 2 files changed, 57 insertions(+), 3 deletions(-) diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h index c6fab004104a..84efd33ed3a3 100644 --- a/include/linux/cpuhotplug.h +++ b/include/linux/cpuhotplug.h @@ -133,6 +133,28 @@ enum cpuhp_state { CPUHP_MIPS_SOC_PREPARE, CPUHP_BP_PREPARE_DYN, CPUHP_BP_PREPARE_DYN_END = CPUHP_BP_PREPARE_DYN + 20, + /* + * This is an optional state if the architecture supports parallel + * startup. It's used to start bringing the CPU online (e.g. send + * the startup IPI) so that the APs can run in parallel through + * the low level startup code instead of waking them one by one in + * CPUHP_BRINGUP_CPU. This avoids waiting for the AP to react and + * shortens the serialized phase of the bringup. + * + * If the architecture registers this state, all APs will be taken + * to it (and thus through all prior states) before any is taken + * to the subsequent CPUHP_BRINGUP_CPU state. + */ + CPUHP_BP_PARALLEL_STARTUP, + + /* + * This step brings the AP online and takes it to the point where it + * manages its own state from here on. For the time being, the rest + * of the AP bringup is fully serialized despite running on the AP. + * If the architecture doesn't use the CPUHP_BP_PARALLEL_STARTUP + * state, this step also does all the work of bringing the CPU + * online. + */ CPUHP_BRINGUP_CPU, /* diff --git a/kernel/cpu.c b/kernel/cpu.c index 43e0a77f21e8..6be5b60db51b 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -1504,13 +1504,45 @@ int bringup_hibernate_cpu(unsigned int sleep_cpu) void bringup_nonboot_cpus(unsigned int setup_max_cpus) { - unsigned int cpu; + unsigned int cpu, n = num_online_cpus(); + /* + * On architectures which have setup the CPUHP_BP_PARALLEL_STARTUP + * state, this invokes all BP prepare states and the parallel + * startup state sends the startup IPI to each of the to be onlined + * APs. This avoids waiting for each AP to respond to the startup + * IPI in CPUHP_BRINGUP_CPU. The APs proceed through the low level + * bringup code and then wait for the control CPU to release them + * one by one for the final onlining procedure in the loop below. + * + * For architectures which do not support parallel bringup all + * states are fully serialized in the loop below. + */ + if (!cpuhp_step_empty(true, CPUHP_BP_PARALLEL_STARTUP)) { + for_each_present_cpu(cpu) { + if (n++ >= setup_max_cpus) + break; + cpu_up(cpu, CPUHP_BP_PARALLEL_STARTUP); + } + } + + /* Do the per CPU serialized bringup to ONLINE state */ for_each_present_cpu(cpu) { if (num_online_cpus() >= setup_max_cpus) break; - if (!cpu_online(cpu)) - cpu_up(cpu, CPUHP_ONLINE); + + if (!cpu_online(cpu)) { + struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu); + int ret = cpu_up(cpu, CPUHP_ONLINE); + + /* + * Due to the above preparation loop a failed online attempt + * might have only rolled back to CPUHP_BP_PARALLEL_STARTUP. Do the + * remaining cleanups. NOOP for the non parallel case. + */ + if (ret && can_rollback_cpu(st)) + WARN_ON(cpuhp_invoke_callback_range(false, cpu, st, CPUHP_OFFLINE)); + } } }
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h index c6fab004104a..ef3cf69a3d5b 100644 --- a/include/linux/cpuhotplug.h +++ b/include/linux/cpuhotplug.h @@ -133,6 +133,8 @@ enum cpuhp_state { CPUHP_MIPS_SOC_PREPARE, CPUHP_BP_PREPARE_DYN, CPUHP_BP_PREPARE_DYN_END = CPUHP_BP_PREPARE_DYN + 20, + CPUHP_BP_PARALLEL_DYN, + CPUHP_BP_PARALLEL_DYN_END = CPUHP_BP_PARALLEL_DYN + 4, CPUHP_BRINGUP_CPU, /* diff --git a/kernel/cpu.c b/kernel/cpu.c index 43e0a77f21e8..cf3c1c6f0710 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -1504,13 +1504,49 @@ int bringup_hibernate_cpu(unsigned int sleep_cpu) void bringup_nonboot_cpus(unsigned int setup_max_cpus) { + unsigned int n = setup_max_cpus - num_online_cpus(); unsigned int cpu; + /* + * An architecture may have registered parallel pre-bringup states to + * which each CPU may be brought in parallel. For each such state, + * bring N CPUs to it in turn before the final round of bringing them + * online. + */ + if (n > 0) { + enum cpuhp_state st = CPUHP_BP_PARALLEL_DYN; + + while (st <= CPUHP_BP_PARALLEL_DYN_END && cpuhp_hp_states[st].name) { + int i = n; + + for_each_present_cpu(cpu) { + cpu_up(cpu, st); + if (!--i) + break; + } + st++; + } + } + for_each_present_cpu(cpu) { if (num_online_cpus() >= setup_max_cpus) break; - if (!cpu_online(cpu)) - cpu_up(cpu, CPUHP_ONLINE); + if (!cpu_online(cpu)) { + int ret = cpu_up(cpu, CPUHP_ONLINE); + + /* + * For the parallel bringup case, roll all the way back + * to CPUHP_OFFLINE on failure; don't leave them in the + * parallel stages. This happens in the nosmt case for + * non-primary threads. + */ + if (ret && cpuhp_hp_states[CPUHP_BP_PARALLEL_DYN].name) { + struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu); + if (can_rollback_cpu(st)) + WARN_ON(cpuhp_invoke_callback_range(false, cpu, st, + CPUHP_OFFLINE)); + } + } } } @@ -1882,6 +1918,10 @@ static int cpuhp_reserve_state(enum cpuhp_state state) step = cpuhp_hp_states + CPUHP_BP_PREPARE_DYN; end = CPUHP_BP_PREPARE_DYN_END; break; + case CPUHP_BP_PARALLEL_DYN: + step = cpuhp_hp_states + CPUHP_BP_PARALLEL_DYN; + end = CPUHP_BP_PARALLEL_DYN_END; + break; default: return -EINVAL; } @@ -1906,14 +1946,15 @@ static int cpuhp_store_callbacks(enum cpuhp_state state, const char *name, /* * If name is NULL, then the state gets removed. * - * CPUHP_AP_ONLINE_DYN and CPUHP_BP_PREPARE_DYN are handed out on + * CPUHP_AP_ONLINE_DYN and CPUHP_BP_P*_DYN are handed out on * the first allocation from these dynamic ranges, so the removal * would trigger a new allocation and clear the wrong (already * empty) state, leaving the callbacks of the to be cleared state * dangling, which causes wreckage on the next hotplug operation. */ if (name && (state == CPUHP_AP_ONLINE_DYN || - state == CPUHP_BP_PREPARE_DYN)) { + state == CPUHP_BP_PREPARE_DYN || + state == CPUHP_BP_PARALLEL_DYN)) { ret = cpuhp_reserve_state(state); if (ret < 0) return ret;