Message ID | 87b0e650f28038c2fb64c5eb607c8fdaa7b4db07.1699982111.git.krystian.hebel@3mdeb.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | x86: parallelize AP bring-up during boot | expand |
On 14.11.2023 18:50, Krystian Hebel wrote: > If multiple CPUs called machine_restart() before actual restart took > place, but after boot CPU declared itself not online, Can you help me please in identifying where this operation is? I can see two places where a CPU is removed from cpu_online_map, yet neither __stop_this_cpu() nor __cpu_disable() ought to be coming into play here. In fact I didn't think CPU0 was ever marked not-online. Except perhaps if we came through machine_crash_shutdown() -> nmi_shootdown_cpus(), but I'm sure you would have mentioned such a further dependency. Jan
On 8.02.2024 12:30, Jan Beulich wrote: > On 14.11.2023 18:50, Krystian Hebel wrote: >> If multiple CPUs called machine_restart() before actual restart took >> place, but after boot CPU declared itself not online, > Can you help me please in identifying where this operation is? I can see > two places where a CPU is removed from cpu_online_map, yet neither > __stop_this_cpu() nor __cpu_disable() ought to be coming into play here. > In fact I didn't think CPU0 was ever marked not-online. Except perhaps > if we came through machine_crash_shutdown() -> nmi_shootdown_cpus(), but > I'm sure you would have mentioned such a further dependency. > > Jan BUG_ON() in cpu_notifier_call_chain() (I've been playing with some of the notifiers and one of them eventually failed) resulted in panic() around the same time AP did in pm_idle() due to inconsistent settings between BSP and AP for MWAIT/MONITOR support after TXT dynamic launch. There is 5s delay between smp_send_stop() and actual reboot, during that time AP spammed the output so the original reason for panic() was visible only after unreasonable amount of scrolling. Adding TXT support is the reason why I even started making AP bring-up parallel. Problem with MWAIT doesn't happen in current code or changes in this patchset, but handling of such error is related to SMP so I've included it. Best regards,
On 12.03.2024 17:05, Krystian Hebel wrote: > > On 8.02.2024 12:30, Jan Beulich wrote: >> On 14.11.2023 18:50, Krystian Hebel wrote: >>> If multiple CPUs called machine_restart() before actual restart took >>> place, but after boot CPU declared itself not online, >> Can you help me please in identifying where this operation is? I can see >> two places where a CPU is removed from cpu_online_map, yet neither >> __stop_this_cpu() nor __cpu_disable() ought to be coming into play here. >> In fact I didn't think CPU0 was ever marked not-online. Except perhaps >> if we came through machine_crash_shutdown() -> nmi_shootdown_cpus(), but >> I'm sure you would have mentioned such a further dependency. >> > BUG_ON() in cpu_notifier_call_chain() (I've been playing with some of > the notifiers and one of them eventually failed) resulted in panic() > around the same time AP did in pm_idle() due to inconsistent settings > between BSP and AP for MWAIT/MONITOR support after TXT dynamic > launch. There is 5s delay between smp_send_stop() and actual reboot, > during that time AP spammed the output so the original reason for > panic() was visible only after unreasonable amount of scrolling. > > Adding TXT support is the reason why I even started making AP bring-up > parallel. Problem with MWAIT doesn't happen in current code or changes > in this patchset, but handling of such error is related to SMP so I've > included it. If you mean to address a latent problem, then you want to say so and you want to make sure you include enough detail on the (future) conditions under which the problem may happen. Otherwise anything you say wants to match present code / behavior. Jan
diff --git a/xen/arch/x86/shutdown.c b/xen/arch/x86/shutdown.c index 7619544d14da..32c70505ed77 100644 --- a/xen/arch/x86/shutdown.c +++ b/xen/arch/x86/shutdown.c @@ -577,9 +577,23 @@ void machine_restart(unsigned int delay_millisecs) /* Ensure we are the boot CPU. */ if ( get_apic_id() != boot_cpu_physical_apicid ) { - /* Send IPI to the boot CPU (logical cpu 0). */ - on_selected_cpus(cpumask_of(0), __machine_restart, - &delay_millisecs, 0); + /* + * Send IPI to the boot CPU (logical cpu 0). + * + * If multiple CPUs called machine_restart() before actual restart + * took place, but after boot CPU declared itself not online, ASSERT + * in on_selected_cpus() will fail. Few calls later we would end up + * here again, with another frame on call stack for new exception. + * To protect against running out of stack, check if boot CPU is + * online. + * + * Note this is not an atomic operation, so it is possible for + * on_selected_cpus() to be called once after boot CPU is offline + * before we hit halt() below. + */ + if ( cpu_online(0) ) + on_selected_cpus(cpumask_of(0), __machine_restart, + &delay_millisecs, 0); for ( ; ; ) halt(); }
If multiple CPUs called machine_restart() before actual restart took place, but after boot CPU declared itself not online, ASSERT in on_selected_cpus() will fail. Few calls later execution would end up in machine_restart() again, with another frame on call stack for new exception. To protect against running out of stack, code checks if boot CPU is still online before calling on_selected_cpus(). Signed-off-by: Krystian Hebel <krystian.hebel@3mdeb.com> --- xen/arch/x86/shutdown.c | 20 +++++++++++++++++--- 1 file changed, 17 insertions(+), 3 deletions(-)