Message ID | 77c9199eabf3a30ebcf89356b2dd35abd611a3a9.1699982111.git.krystian.hebel@3mdeb.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | x86: parallelize AP bring-up during boot | expand |
On 14.11.2023 18:50, Krystian Hebel wrote: > Multiple delays are required when sending IPIs and waiting for > responses. During boot, 4 such IPIs were sent per each AP. With this > change, only one set of broadcast IPIs is sent. This reduces boot time, > especially for platforms with large number of cores. Yet APs do their startup work in parallel only for a brief period of time, if I'm not mistaken. Othwerwise I can't see why you'd still have cpu_up() in __start_xen(). > --- a/xen/arch/x86/setup.c > +++ b/xen/arch/x86/setup.c > @@ -1963,6 +1963,8 @@ void __init noreturn __start_xen(unsigned long mbi_p) > cpu_data[i].stack_base = cpu_alloc_stack(i); > } > > + smp_send_init_sipi_sipi_allbutself(); > + > for_each_present_cpu ( i ) > { > if ( (park_offline_cpus || num_online_cpus() < max_cpus) && So what about constraints on the number of CPUs to use? In such a case you shouldn't send the IPI to all of them, at least if they're not meant to be parked. > --- a/xen/arch/x86/smpboot.c > +++ b/xen/arch/x86/smpboot.c > @@ -425,7 +425,7 @@ void start_secondary(unsigned int cpu) > > static int wakeup_secondary_cpu(int phys_apicid, unsigned long start_eip) > { > - unsigned long send_status = 0, accept_status = 0; > + unsigned long send_status = 0, accept_status = 0, sh = 0; sh doesn't need to be 64 bits wide, does it? > int maxlvt, timeout, i; > > /* > @@ -445,6 +445,12 @@ static int wakeup_secondary_cpu(int phys_apicid, unsigned long start_eip) > if ( tboot_in_measured_env() && !tboot_wake_ap(phys_apicid, start_eip) ) > return 0; > > + /* > + * Use destination shorthand for broadcasting IPIs during boot. > + */ Nit (style): This is a single line comment. > + if ( phys_apicid == BAD_APICID ) > + sh = APIC_DEST_ALLBUT; I think the latest for this the function parameter wants changing to unsigned int (in another prereq patch). > @@ -573,21 +578,31 @@ static int do_boot_cpu(int apicid, int cpu) > */ > mtrr_save_state(); > > - start_eip = bootsym_phys(trampoline_realmode_entry); > + /* Check if AP is already up. */ > + if ( cpu_data[cpu].cpu_state != CPU_STATE_INIT ) > + { > + /* This grunge runs the startup process for the targeted processor. */ > + unsigned long start_eip; > + start_eip = bootsym_phys(trampoline_realmode_entry); > > - /* start_eip needs be page aligned, and below the 1M boundary. */ > - if ( start_eip & ~0xff000 ) > - panic("AP trampoline %#lx not suitably positioned\n", start_eip); > + /* start_eip needs be page aligned, and below the 1M boundary. */ > + if ( start_eip & ~0xff000 ) > + panic("AP trampoline %#lx not suitably positioned\n", start_eip); Isn't this redundant now with the panic() in smp_send_init_sipi_sipi_allbutself(), at least as long as that runs unconditionally. > - /* So we see what's up */ > - if ( opt_cpu_info ) > - printk("Booting processor %d/%d eip %lx\n", > - cpu, apicid, start_eip); > + /* So we see what's up */ > + if ( opt_cpu_info ) > + printk("AP trampoline at %lx\n", start_eip); Why this change in log message? It makes messages for individual CPUs indistinguishable. And like above it's redundant with what smp_send_init_sipi_sipi_allbutself() logs. > - /* This grunge runs the startup process for the targeted processor. */ > + /* mark "stuck" area as not stuck */ > + bootsym(trampoline_cpu_started) = 0; > + smp_mb(); > > - /* Starting actual IPI sequence... */ > - boot_error = wakeup_secondary_cpu(apicid, start_eip); > + /* Starting actual IPI sequence... */ > + boot_error = wakeup_secondary_cpu(apicid, start_eip); > + } > + > + if ( opt_cpu_info ) > + printk("Booting processor %d/%d\n", cpu, apicid); Oh, here's the other half. Yet for above it still doesn't make sense to issue the same message for all CPUs. > @@ -646,10 +661,6 @@ static int do_boot_cpu(int apicid, int cpu) > rc = -EIO; > } > > - /* mark "stuck" area as not stuck */ > - bootsym(trampoline_cpu_started) = 0; > - smp_mb(); While you move this up, it's not clear to me how you would now identify individual stuck CPUs. I would have expected that this is another global that needs converting up front, to be per-CPU. > @@ -1155,6 +1166,23 @@ static struct notifier_block cpu_smpboot_nfb = { > .notifier_call = cpu_smpboot_callback > }; > > +void smp_send_init_sipi_sipi_allbutself(void) __init? > +{ > + unsigned long start_eip; > + start_eip = bootsym_phys(trampoline_realmode_entry); This can be the initializer of the variable, which would then save me from complaining about the missing blank line between declaration and statement(s). (Actually, as I notice only now - same for code you move around in do_boot_cpu().) Jan
On 8.02.2024 13:37, Jan Beulich wrote: > On 14.11.2023 18:50, Krystian Hebel wrote: >> Multiple delays are required when sending IPIs and waiting for >> responses. During boot, 4 such IPIs were sent per each AP. With this >> change, only one set of broadcast IPIs is sent. This reduces boot time, >> especially for platforms with large number of cores. > Yet APs do their startup work in parallel only for a brief period of > time, if I'm not mistaken. Othwerwise I can't see why you'd still have > cpu_up() in __start_xen(). cpu_up() is left because multiple notifiers aren't easy to convert to work in parallel. In terms of lines of code it looks like a brief period, but all the delays along the way were taking much more time than the actual work. As the gain was already more than what I hoped for, I decided against spending too much time trying to fix the notifiers' code for minimal profit. > >> --- a/xen/arch/x86/setup.c >> +++ b/xen/arch/x86/setup.c >> @@ -1963,6 +1963,8 @@ void __init noreturn __start_xen(unsigned long mbi_p) >> cpu_data[i].stack_base = cpu_alloc_stack(i); >> } >> >> + smp_send_init_sipi_sipi_allbutself(); >> + >> for_each_present_cpu ( i ) >> { >> if ( (park_offline_cpus || num_online_cpus() < max_cpus) && > So what about constraints on the number of CPUs to use? In such a case > you shouldn't send the IPI to all of them, at least if they're not > meant to be parked. Fair point, such check can be easily added before broadcasting and the rest of the code should already be able to handle this. > >> --- a/xen/arch/x86/smpboot.c >> +++ b/xen/arch/x86/smpboot.c >> @@ -425,7 +425,7 @@ void start_secondary(unsigned int cpu) >> >> static int wakeup_secondary_cpu(int phys_apicid, unsigned long start_eip) >> { >> - unsigned long send_status = 0, accept_status = 0; >> + unsigned long send_status = 0, accept_status = 0, sh = 0; > sh doesn't need to be 64 bits wide, does it? No, will change. > >> int maxlvt, timeout, i; >> >> /* >> @@ -445,6 +445,12 @@ static int wakeup_secondary_cpu(int phys_apicid, unsigned long start_eip) >> if ( tboot_in_measured_env() && !tboot_wake_ap(phys_apicid, start_eip) ) >> return 0; >> >> + /* >> + * Use destination shorthand for broadcasting IPIs during boot. >> + */ > Nit (style): This is a single line comment. Ack > >> + if ( phys_apicid == BAD_APICID ) >> + sh = APIC_DEST_ALLBUT; > I think the latest for this the function parameter wants changing to > unsigned int (in another prereq patch). What do you mean, phys_apicid in wakeup_secondary_cpu()? It is passed as signed int since __cpu_up(), should I change all of those to unsigned? > >> @@ -573,21 +578,31 @@ static int do_boot_cpu(int apicid, int cpu) >> */ >> mtrr_save_state(); >> >> - start_eip = bootsym_phys(trampoline_realmode_entry); >> + /* Check if AP is already up. */ >> + if ( cpu_data[cpu].cpu_state != CPU_STATE_INIT ) >> + { >> + /* This grunge runs the startup process for the targeted processor. */ >> + unsigned long start_eip; >> + start_eip = bootsym_phys(trampoline_realmode_entry); >> >> - /* start_eip needs be page aligned, and below the 1M boundary. */ >> - if ( start_eip & ~0xff000 ) >> - panic("AP trampoline %#lx not suitably positioned\n", start_eip); >> + /* start_eip needs be page aligned, and below the 1M boundary. */ >> + if ( start_eip & ~0xff000 ) >> + panic("AP trampoline %#lx not suitably positioned\n", start_eip); > Isn't this redundant now with the panic() in > smp_send_init_sipi_sipi_allbutself(), at least as long as that runs > unconditionally. Won't be running unconditionally, but it also wouldn't be redundant in case of hot-plugging. > >> - /* So we see what's up */ >> - if ( opt_cpu_info ) >> - printk("Booting processor %d/%d eip %lx\n", >> - cpu, apicid, start_eip); >> + /* So we see what's up */ >> + if ( opt_cpu_info ) >> + printk("AP trampoline at %lx\n", start_eip); > Why this change in log message? It makes messages for individual CPUs > indistinguishable. And like above it's redundant with what > smp_send_init_sipi_sipi_allbutself() logs. > >> - /* This grunge runs the startup process for the targeted processor. */ >> + /* mark "stuck" area as not stuck */ >> + bootsym(trampoline_cpu_started) = 0; >> + smp_mb(); >> >> - /* Starting actual IPI sequence... */ >> - boot_error = wakeup_secondary_cpu(apicid, start_eip); >> + /* Starting actual IPI sequence... */ >> + boot_error = wakeup_secondary_cpu(apicid, start_eip); >> + } >> + >> + if ( opt_cpu_info ) >> + printk("Booting processor %d/%d\n", cpu, apicid); > Oh, here's the other half. Yet for above it still doesn't make sense > to issue the same message for all CPUs. I'll undo it. It was important at one point for debugging, but I agree that it doesn't make sense now. > >> @@ -646,10 +661,6 @@ static int do_boot_cpu(int apicid, int cpu) >> rc = -EIO; >> } >> >> - /* mark "stuck" area as not stuck */ >> - bootsym(trampoline_cpu_started) = 0; >> - smp_mb(); > While you move this up, it's not clear to me how you would now > identify individual stuck CPUs. I would have expected that this is > another global that needs converting up front, to be per-CPU. In the existing code this is set very early, in 16b code, and the variable is located within the first page of trampoline. With this change it is impossible to identify individual stuck CPUs. I was considering removing this variable altogether. Another option would be to make this an array with NR_CPUS elements, but this may get too big. It would be possible to fill this relatively early, after CPU ID is obtained, before paging is enabled, but after loading IDT, GDT and jumping to protected mode. Those are things that can break due to error in the code, but it may be better than not having that info at all. It could also be set from 64b code, that way simple per-CPU data would work. It is a bit late, but this would probably be easiest and cleanest to write. > >> @@ -1155,6 +1166,23 @@ static struct notifier_block cpu_smpboot_nfb = { >> .notifier_call = cpu_smpboot_callback >> }; >> >> +void smp_send_init_sipi_sipi_allbutself(void) > __init? Ack > >> +{ >> + unsigned long start_eip; >> + start_eip = bootsym_phys(trampoline_realmode_entry); > This can be the initializer of the variable, which would then save > me from complaining about the missing blank line between declaration > and statement(s). (Actually, as I notice only now - same for code you > move around in do_boot_cpu().) Will do. It may still be split due to line length, but at least that will follow code style. > > Jan Best regards,
On 12.03.2024 18:13, Krystian Hebel wrote: > > On 8.02.2024 13:37, Jan Beulich wrote: >> On 14.11.2023 18:50, Krystian Hebel wrote: >>> Multiple delays are required when sending IPIs and waiting for >>> responses. During boot, 4 such IPIs were sent per each AP. With this >>> change, only one set of broadcast IPIs is sent. This reduces boot time, >>> especially for platforms with large number of cores. >> Yet APs do their startup work in parallel only for a brief period of >> time, if I'm not mistaken. Othwerwise I can't see why you'd still have >> cpu_up() in __start_xen(). > cpu_up() is left because multiple notifiers aren't easy to convert to work > in parallel. In terms of lines of code it looks like a brief period, but all > the delays along the way were taking much more time than the actual > work. As the gain was already more than what I hoped for, I decided > against spending too much time trying to fix the notifiers' code for > minimal profit. Which is all fine. Just that by title of this patch and the cover letter I expected more. Adding "partly" or some such in both places may help. >>> --- a/xen/arch/x86/smpboot.c >>> +++ b/xen/arch/x86/smpboot.c >>> @@ -425,7 +425,7 @@ void start_secondary(unsigned int cpu) >>> >>> static int wakeup_secondary_cpu(int phys_apicid, unsigned long start_eip) >>> { >>> - unsigned long send_status = 0, accept_status = 0; >>> + unsigned long send_status = 0, accept_status = 0, sh = 0; >> sh doesn't need to be 64 bits wide, does it? > No, will change. >> >>> int maxlvt, timeout, i; >>> >>> /* >>> @@ -445,6 +445,12 @@ static int wakeup_secondary_cpu(int phys_apicid, unsigned long start_eip) >>> if ( tboot_in_measured_env() && !tboot_wake_ap(phys_apicid, start_eip) ) >>> return 0; >>> >>> + /* >>> + * Use destination shorthand for broadcasting IPIs during boot. >>> + */ >> Nit (style): This is a single line comment. > Ack >> >>> + if ( phys_apicid == BAD_APICID ) >>> + sh = APIC_DEST_ALLBUT; >> I think the latest for this the function parameter wants changing to >> unsigned int (in another prereq patch). > What do you mean, phys_apicid in wakeup_secondary_cpu()? It is passed > as signed int since __cpu_up(), should I change all of those to unsigned? That would be best, yes. BAD_APICID, after all, is an unsigned constant (no matter that its definition involves a unary minus operator). Jan
diff --git a/xen/arch/x86/include/asm/smp.h b/xen/arch/x86/include/asm/smp.h index 98739028a6ed..6ca0158a368d 100644 --- a/xen/arch/x86/include/asm/smp.h +++ b/xen/arch/x86/include/asm/smp.h @@ -31,6 +31,7 @@ DECLARE_PER_CPU(cpumask_var_t, send_ipi_cpumask); extern bool park_offline_cpus; void smp_send_nmi_allbutself(void); +void smp_send_init_sipi_sipi_allbutself(void); void send_IPI_mask(const cpumask_t *, int vector); void send_IPI_self(int vector); diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c index b2c0679725ea..42a9067b81eb 100644 --- a/xen/arch/x86/setup.c +++ b/xen/arch/x86/setup.c @@ -1963,6 +1963,8 @@ void __init noreturn __start_xen(unsigned long mbi_p) cpu_data[i].stack_base = cpu_alloc_stack(i); } + smp_send_init_sipi_sipi_allbutself(); + for_each_present_cpu ( i ) { if ( (park_offline_cpus || num_online_cpus() < max_cpus) && diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c index cbea2d45f70d..e9a7f78a5a6f 100644 --- a/xen/arch/x86/smpboot.c +++ b/xen/arch/x86/smpboot.c @@ -425,7 +425,7 @@ void start_secondary(unsigned int cpu) static int wakeup_secondary_cpu(int phys_apicid, unsigned long start_eip) { - unsigned long send_status = 0, accept_status = 0; + unsigned long send_status = 0, accept_status = 0, sh = 0; int maxlvt, timeout, i; /* @@ -445,6 +445,12 @@ static int wakeup_secondary_cpu(int phys_apicid, unsigned long start_eip) if ( tboot_in_measured_env() && !tboot_wake_ap(phys_apicid, start_eip) ) return 0; + /* + * Use destination shorthand for broadcasting IPIs during boot. + */ + if ( phys_apicid == BAD_APICID ) + sh = APIC_DEST_ALLBUT; + /* * Be paranoid about clearing APIC errors. */ @@ -458,7 +464,7 @@ static int wakeup_secondary_cpu(int phys_apicid, unsigned long start_eip) /* * Turn INIT on target chip via IPI */ - apic_icr_write(APIC_INT_LEVELTRIG | APIC_INT_ASSERT | APIC_DM_INIT, + apic_icr_write(APIC_INT_LEVELTRIG | APIC_INT_ASSERT | APIC_DM_INIT | sh, phys_apicid); if ( !x2apic_enabled ) @@ -475,7 +481,7 @@ static int wakeup_secondary_cpu(int phys_apicid, unsigned long start_eip) Dprintk("Deasserting INIT.\n"); - apic_icr_write(APIC_INT_LEVELTRIG | APIC_DM_INIT, phys_apicid); + apic_icr_write(APIC_INT_LEVELTRIG | APIC_DM_INIT | sh, phys_apicid); Dprintk("Waiting for send to finish...\n"); timeout = 0; @@ -512,7 +518,7 @@ static int wakeup_secondary_cpu(int phys_apicid, unsigned long start_eip) * STARTUP IPI * Boot on the stack */ - apic_icr_write(APIC_DM_STARTUP | (start_eip >> 12), phys_apicid); + apic_icr_write(APIC_DM_STARTUP | (start_eip >> 12) | sh, phys_apicid); if ( !x2apic_enabled ) { @@ -565,7 +571,6 @@ int alloc_cpu_id(void) static int do_boot_cpu(int apicid, int cpu) { int timeout, boot_error = 0, rc = 0; - unsigned long start_eip; /* * Save current MTRR state in case it was changed since early boot @@ -573,21 +578,31 @@ static int do_boot_cpu(int apicid, int cpu) */ mtrr_save_state(); - start_eip = bootsym_phys(trampoline_realmode_entry); + /* Check if AP is already up. */ + if ( cpu_data[cpu].cpu_state != CPU_STATE_INIT ) + { + /* This grunge runs the startup process for the targeted processor. */ + unsigned long start_eip; + start_eip = bootsym_phys(trampoline_realmode_entry); - /* start_eip needs be page aligned, and below the 1M boundary. */ - if ( start_eip & ~0xff000 ) - panic("AP trampoline %#lx not suitably positioned\n", start_eip); + /* start_eip needs be page aligned, and below the 1M boundary. */ + if ( start_eip & ~0xff000 ) + panic("AP trampoline %#lx not suitably positioned\n", start_eip); - /* So we see what's up */ - if ( opt_cpu_info ) - printk("Booting processor %d/%d eip %lx\n", - cpu, apicid, start_eip); + /* So we see what's up */ + if ( opt_cpu_info ) + printk("AP trampoline at %lx\n", start_eip); - /* This grunge runs the startup process for the targeted processor. */ + /* mark "stuck" area as not stuck */ + bootsym(trampoline_cpu_started) = 0; + smp_mb(); - /* Starting actual IPI sequence... */ - boot_error = wakeup_secondary_cpu(apicid, start_eip); + /* Starting actual IPI sequence... */ + boot_error = wakeup_secondary_cpu(apicid, start_eip); + } + + if ( opt_cpu_info ) + printk("Booting processor %d/%d\n", cpu, apicid); if ( !boot_error ) { @@ -646,10 +661,6 @@ static int do_boot_cpu(int apicid, int cpu) rc = -EIO; } - /* mark "stuck" area as not stuck */ - bootsym(trampoline_cpu_started) = 0; - smp_mb(); - return rc; } @@ -1155,6 +1166,23 @@ static struct notifier_block cpu_smpboot_nfb = { .notifier_call = cpu_smpboot_callback }; +void smp_send_init_sipi_sipi_allbutself(void) +{ + unsigned long start_eip; + start_eip = bootsym_phys(trampoline_realmode_entry); + + /* start_eip needs be page aligned, and below the 1M boundary. */ + if ( start_eip & ~0xff000 ) + panic("AP trampoline %#lx not suitably positioned\n", start_eip); + + /* So we see what's up */ + if ( opt_cpu_info ) + printk("Booting APs in parallel, eip %lx\n", start_eip); + + /* Starting actual broadcast IPI sequence... */ + wakeup_secondary_cpu(BAD_APICID, start_eip); +} + void __init smp_prepare_cpus(void) { register_cpu_notifier(&cpu_smpboot_nfb);
Multiple delays are required when sending IPIs and waiting for responses. During boot, 4 such IPIs were sent per each AP. With this change, only one set of broadcast IPIs is sent. This reduces boot time, especially for platforms with large number of cores. Single CPU initialization is still possible, it is used for hotplug. During wakeup from S3 APs are started one by one. It should be possible to enable parallel execution there as well, but I don't have a way of properly testing it as of now. Signed-off-by: Krystian Hebel <krystian.hebel@3mdeb.com> --- xen/arch/x86/include/asm/smp.h | 1 + xen/arch/x86/setup.c | 2 + xen/arch/x86/smpboot.c | 68 ++++++++++++++++++++++++---------- 3 files changed, 51 insertions(+), 20 deletions(-)