Message ID | 20170321073452.GA17298@linaro.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Tue, 2017-03-21 at 16:34 +0900, AKASHI Takahiro wrote: > Yes, it is intentional. I removed 'offline' code in my v14 (2016/3/4). > As you assumed, I'd expect 'online' status of all CPUs to be kept > unchanged in the core dump. I wonder if it would be better to take a *copy* of it and put it back after we're done taking the CPUs down? As things stand, we now have *three* different methods of taking down all the CPUs... and *none* of them allow a platform to override it with an NMI-based or STONITH-based method, which seems like something of an oversight. > If you can agree, I would like to modify this disputed warning code to: > > + BUG_ON(!in_kexec_crash && (stuck_cpus || (num_online_cpus() > 1))); > + WARN(in_kexec_crash && (stuck_cpus || smp_crash_stop_failed()), > + "Some CPUs may be stale, kdump will be unreliable.\n"); That works; thanks. FWIW I'm currently blaming my platform's firmware for my sporadic crash-on-CPU#1 failures. If your testing includes crashes on non-boot CPUs (perhaps using the sysrq hack I posted) and it reliably passes for you, then let's ignore that for now.
===8<=== diff --git a/arch/arm64/include/asm/smp.h b/arch/arm64/include/asm/smp.h index cea009f2657d..55f08c5acfad 100644 --- a/arch/arm64/include/asm/smp.h +++ b/arch/arm64/include/asm/smp.h @@ -149,6 +149,7 @@ static inline void cpu_panic_kernel(void) bool cpus_are_stuck_in_kernel(void); extern void smp_send_crash_stop(void); +extern bool smp_crash_stop_failed(void); #endif /* ifndef __ASSEMBLY__ */ diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c index 68b96ea13b4c..29e1cf8cca95 100644 --- a/arch/arm64/kernel/machine_kexec.c +++ b/arch/arm64/kernel/machine_kexec.c @@ -146,12 +146,15 @@ void machine_kexec(struct kimage *kimage) { phys_addr_t reboot_code_buffer_phys; void *reboot_code_buffer; + bool in_kexec_crash = (kimage == kexec_crash_image); + bool stuck_cpus = cpus_are_stuck_in_kernel(); /* * New cpus may have become stuck_in_kernel after we loaded the image. */ - BUG_ON((cpus_are_stuck_in_kernel() || (num_online_cpus() > 1)) && - !WARN_ON(kimage == kexec_crash_image)); + BUG_ON(!in_kexec_crash && (stuck_cpus || (num_online_cpus() > 1))); + WARN(in_kexec_crash && (stuck_cpus || smp_crash_stop_failed()), + "Some CPUs may be stale, kdump will be unreliable.\n"); reboot_code_buffer_phys = page_to_phys(kimage->control_code_page); reboot_code_buffer = phys_to_virt(reboot_code_buffer_phys); diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c index a7e2921143c4..8016914591d2 100644 --- a/arch/arm64/kernel/smp.c +++ b/arch/arm64/kernel/smp.c @@ -833,7 +833,7 @@ static void ipi_cpu_stop(unsigned int cpu) } #ifdef CONFIG_KEXEC_CORE -static atomic_t waiting_for_crash_ipi; +static atomic_t waiting_for_crash_ipi = ATOMIC_INIT(0); #endif static void ipi_cpu_crash_stop(unsigned int cpu, struct pt_regs *regs) @@ -990,7 +990,12 @@ void smp_send_crash_stop(void) if (atomic_read(&waiting_for_crash_ipi) > 0) pr_warning("SMP: failed to stop secondary CPUs %*pbl\n", - cpumask_pr_args(cpu_online_mask)); + cpumask_pr_args(&mask)); +} + +bool smp_crash_stop_failed(void) +{ + return (atomic_read(&waiting_for_crash_ipi) > 0); } #endif ===>8===