Message ID | 20211012072428.2569-2-dongli.zhang@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Fix the Xen HVM kdump/kexec boot panic issue | expand |
On 12.10.21 09:24, Dongli Zhang wrote: > The sched_clock() can be used very early since upstream > commit 857baa87b642 ("sched/clock: Enable sched clock early"). In addition, > with upstream commit 38669ba205d1 ("x86/xen/time: Output xen sched_clock > time from 0"), kdump kernel in Xen HVM guest may panic at very early stage > when accessing &__this_cpu_read(xen_vcpu)->time as in below: > > setup_arch() > -> init_hypervisor_platform() > -> x86_init.hyper.init_platform = xen_hvm_guest_init() > -> xen_hvm_init_time_ops() > -> xen_clocksource_read() > -> src = &__this_cpu_read(xen_vcpu)->time; > > This is because Xen HVM supports at most MAX_VIRT_CPUS=32 'vcpu_info' > embedded inside 'shared_info' during early stage until xen_vcpu_setup() is > used to allocate/relocate 'vcpu_info' for boot cpu at arbitrary address. > > However, when Xen HVM guest panic on vcpu >= 32, since > xen_vcpu_info_reset(0) would set per_cpu(xen_vcpu, cpu) = NULL when > vcpu >= 32, xen_clocksource_read() on vcpu >= 32 would panic. > > This patch delays xen_hvm_init_time_ops() to later in > xen_hvm_smp_prepare_boot_cpu() after the 'vcpu_info' for boot vcpu is > registered when the boot vcpu is >= 32. > > This issue can be reproduced on purpose via below command at the guest > side when kdump/kexec is enabled: > > "taskset -c 33 echo c > /proc/sysrq-trigger" > > Cc: Joe Jin <joe.jin@oracle.com> > Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> > --- > arch/x86/xen/enlighten_hvm.c | 20 +++++++++++++++++++- > arch/x86/xen/smp_hvm.c | 3 +++ > 2 files changed, 22 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/xen/enlighten_hvm.c b/arch/x86/xen/enlighten_hvm.c > index e68ea5f4ad1c..152279416d9a 100644 > --- a/arch/x86/xen/enlighten_hvm.c > +++ b/arch/x86/xen/enlighten_hvm.c > @@ -216,7 +216,25 @@ static void __init xen_hvm_guest_init(void) > WARN_ON(xen_cpuhp_setup(xen_cpu_up_prepare_hvm, xen_cpu_dead_hvm)); > xen_unplug_emulated_devices(); > x86_init.irqs.intr_init = xen_init_IRQ; > - xen_hvm_init_time_ops(); > + > + /* > + * Only MAX_VIRT_CPUS 'vcpu_info' are embedded inside 'shared_info' > + * and the VM would use them until xen_vcpu_setup() is used to > + * allocate/relocate them at arbitrary address. > + * > + * However, when Xen HVM guest panic on vcpu >= MAX_VIRT_CPUS, > + * per_cpu(xen_vcpu, cpu) is still NULL at this stage. To access > + * per_cpu(xen_vcpu, cpu) via xen_clocksource_read() would panic. > + * > + * Therefore we delay xen_hvm_init_time_ops() to > + * xen_hvm_smp_prepare_boot_cpu() when boot vcpu is >= MAX_VIRT_CPUS. > + */ > + if (xen_vcpu_nr(0) >= MAX_VIRT_CPUS) > + pr_info("Delay xen_hvm_init_time_ops() as kernel is running on vcpu=%d\n", > + xen_vcpu_nr(0)); > + else > + xen_hvm_init_time_ops(); > + > xen_hvm_init_mmu_ops(); > > #ifdef CONFIG_KEXEC_CORE > diff --git a/arch/x86/xen/smp_hvm.c b/arch/x86/xen/smp_hvm.c > index 6ff3c887e0b9..60cd4fafd188 100644 > --- a/arch/x86/xen/smp_hvm.c > +++ b/arch/x86/xen/smp_hvm.c > @@ -19,6 +19,9 @@ static void __init xen_hvm_smp_prepare_boot_cpu(void) > */ > xen_vcpu_setup(0); > > + if (xen_vcpu_nr(0) >= MAX_VIRT_CPUS) > + xen_hvm_init_time_ops(); > + Please add a comment referencing the related code in xen_hvm_guest_init(). Juergen
On 10/12/21 3:24 AM, Dongli Zhang wrote: > The sched_clock() can be used very early since upstream > commit 857baa87b642 ("sched/clock: Enable sched clock early"). In addition, > with upstream commit 38669ba205d1 ("x86/xen/time: Output xen sched_clock > time from 0"), kdump kernel in Xen HVM guest may panic at very early stage > when accessing &__this_cpu_read(xen_vcpu)->time as in below: Please drop "upstream". It's always upstream here. > + > + /* > + * Only MAX_VIRT_CPUS 'vcpu_info' are embedded inside 'shared_info' > + * and the VM would use them until xen_vcpu_setup() is used to > + * allocate/relocate them at arbitrary address. > + * > + * However, when Xen HVM guest panic on vcpu >= MAX_VIRT_CPUS, > + * per_cpu(xen_vcpu, cpu) is still NULL at this stage. To access > + * per_cpu(xen_vcpu, cpu) via xen_clocksource_read() would panic. > + * > + * Therefore we delay xen_hvm_init_time_ops() to > + * xen_hvm_smp_prepare_boot_cpu() when boot vcpu is >= MAX_VIRT_CPUS. > + */ > + if (xen_vcpu_nr(0) >= MAX_VIRT_CPUS) What about always deferring this when panicing? Would that work? Deciding whether to defer based on cpu number feels a bit awkward. -boris > + pr_info("Delay xen_hvm_init_time_ops() as kernel is running on vcpu=%d\n", > + xen_vcpu_nr(0)); > + else > + xen_hvm_init_time_ops(); > + > xen_hvm_init_mmu_ops(); >
Hi Boris, On 10/12/21 10:17 AM, Boris Ostrovsky wrote: > > On 10/12/21 3:24 AM, Dongli Zhang wrote: >> The sched_clock() can be used very early since upstream >> commit 857baa87b642 ("sched/clock: Enable sched clock early"). In addition, >> with upstream commit 38669ba205d1 ("x86/xen/time: Output xen sched_clock >> time from 0"), kdump kernel in Xen HVM guest may panic at very early stage >> when accessing &__this_cpu_read(xen_vcpu)->time as in below: > > > Please drop "upstream". It's always upstream here. > > >> + >> + /* >> + * Only MAX_VIRT_CPUS 'vcpu_info' are embedded inside 'shared_info' >> + * and the VM would use them until xen_vcpu_setup() is used to >> + * allocate/relocate them at arbitrary address. >> + * >> + * However, when Xen HVM guest panic on vcpu >= MAX_VIRT_CPUS, >> + * per_cpu(xen_vcpu, cpu) is still NULL at this stage. To access >> + * per_cpu(xen_vcpu, cpu) via xen_clocksource_read() would panic. >> + * >> + * Therefore we delay xen_hvm_init_time_ops() to >> + * xen_hvm_smp_prepare_boot_cpu() when boot vcpu is >= MAX_VIRT_CPUS. >> + */ >> + if (xen_vcpu_nr(0) >= MAX_VIRT_CPUS) > > > What about always deferring this when panicing? Would that work? > > > Deciding whether to defer based on cpu number feels a bit awkward. > > > -boris > I did some tests and I do not think this works well. I prefer to delay the initialization only for VCPU >= 32. This is the syslog if we always delay xen_hvm_init_time_ops(), regardless whether VCPU >= 32. [ 0.032372] Booting paravirtualized kernel on Xen HVM [ 0.032376] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns [ 0.037683] setup_percpu: NR_CPUS:64 nr_cpumask_bits:64 nr_cpu_ids:64 nr_node_ids:2 [ 0.041876] percpu: Embedded 49 pages/cpu s162968 r8192 d29544 u262144 --> There is a clock backwards from 0.041876 to 0.000010. [ 0.000010] Built 2 zonelists, mobility grouping on. Total pages: 2015744 [ 0.000012] Policy zone: Normal [ 0.000014] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.15.0-rc6xen+ root=UUID=2a5975ab-a059-4697-9aee-7a53ddfeea21 ro text console=ttyS0,115200n8 console=tty1 crashkernel=512M-:192M This is because the initial pv_sched_clock is native_sched_clock(), and it switches to xen_sched_clock() in xen_hvm_init_time_ops(). Is it fine to always have a clock backward for non-kdump kernel? To avoid the clock backward, we may register a dummy clocksource which always returns 0, before xen_hvm_init_time_ops(). I do not think this is reasonable. Thank you very much! Dongli Zhang
diff --git a/arch/x86/xen/enlighten_hvm.c b/arch/x86/xen/enlighten_hvm.c index e68ea5f4ad1c..152279416d9a 100644 --- a/arch/x86/xen/enlighten_hvm.c +++ b/arch/x86/xen/enlighten_hvm.c @@ -216,7 +216,25 @@ static void __init xen_hvm_guest_init(void) WARN_ON(xen_cpuhp_setup(xen_cpu_up_prepare_hvm, xen_cpu_dead_hvm)); xen_unplug_emulated_devices(); x86_init.irqs.intr_init = xen_init_IRQ; - xen_hvm_init_time_ops(); + + /* + * Only MAX_VIRT_CPUS 'vcpu_info' are embedded inside 'shared_info' + * and the VM would use them until xen_vcpu_setup() is used to + * allocate/relocate them at arbitrary address. + * + * However, when Xen HVM guest panic on vcpu >= MAX_VIRT_CPUS, + * per_cpu(xen_vcpu, cpu) is still NULL at this stage. To access + * per_cpu(xen_vcpu, cpu) via xen_clocksource_read() would panic. + * + * Therefore we delay xen_hvm_init_time_ops() to + * xen_hvm_smp_prepare_boot_cpu() when boot vcpu is >= MAX_VIRT_CPUS. + */ + if (xen_vcpu_nr(0) >= MAX_VIRT_CPUS) + pr_info("Delay xen_hvm_init_time_ops() as kernel is running on vcpu=%d\n", + xen_vcpu_nr(0)); + else + xen_hvm_init_time_ops(); + xen_hvm_init_mmu_ops(); #ifdef CONFIG_KEXEC_CORE diff --git a/arch/x86/xen/smp_hvm.c b/arch/x86/xen/smp_hvm.c index 6ff3c887e0b9..60cd4fafd188 100644 --- a/arch/x86/xen/smp_hvm.c +++ b/arch/x86/xen/smp_hvm.c @@ -19,6 +19,9 @@ static void __init xen_hvm_smp_prepare_boot_cpu(void) */ xen_vcpu_setup(0); + if (xen_vcpu_nr(0) >= MAX_VIRT_CPUS) + xen_hvm_init_time_ops(); + /* * The alternative logic (which patches the unlock/lock) runs before * the smp bootup up code is activated. Hence we need to set this up
The sched_clock() can be used very early since upstream commit 857baa87b642 ("sched/clock: Enable sched clock early"). In addition, with upstream commit 38669ba205d1 ("x86/xen/time: Output xen sched_clock time from 0"), kdump kernel in Xen HVM guest may panic at very early stage when accessing &__this_cpu_read(xen_vcpu)->time as in below: setup_arch() -> init_hypervisor_platform() -> x86_init.hyper.init_platform = xen_hvm_guest_init() -> xen_hvm_init_time_ops() -> xen_clocksource_read() -> src = &__this_cpu_read(xen_vcpu)->time; This is because Xen HVM supports at most MAX_VIRT_CPUS=32 'vcpu_info' embedded inside 'shared_info' during early stage until xen_vcpu_setup() is used to allocate/relocate 'vcpu_info' for boot cpu at arbitrary address. However, when Xen HVM guest panic on vcpu >= 32, since xen_vcpu_info_reset(0) would set per_cpu(xen_vcpu, cpu) = NULL when vcpu >= 32, xen_clocksource_read() on vcpu >= 32 would panic. This patch delays xen_hvm_init_time_ops() to later in xen_hvm_smp_prepare_boot_cpu() after the 'vcpu_info' for boot vcpu is registered when the boot vcpu is >= 32. This issue can be reproduced on purpose via below command at the guest side when kdump/kexec is enabled: "taskset -c 33 echo c > /proc/sysrq-trigger" Cc: Joe Jin <joe.jin@oracle.com> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> --- arch/x86/xen/enlighten_hvm.c | 20 +++++++++++++++++++- arch/x86/xen/smp_hvm.c | 3 +++ 2 files changed, 22 insertions(+), 1 deletion(-)