mbox series

[v3,0/1] xen: fix HVM kexec kernel panic

Message ID 20220224215049.2845-1-dongli.zhang@oracle.com (mailing list archive)
Headers show
Series xen: fix HVM kexec kernel panic | expand

Message

Dongli Zhang Feb. 24, 2022, 9:50 p.m. UTC
This is the v3 of the patch to fix xen kexec kernel panic issue when the
kexec is triggered on VCPU >= 32.

PANIC: early exception 0x0e IP 10:ffffffffa96679b6 error 0 cr2 0x20
[    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 5.17.0-rc4xen-00054-gf71077a4d84b-dirty #1
[    0.000000] Hardware name: Xen HVM domU, BIOS 4.4.4OVM 12/15/2020
[    0.000000] RIP: 0010:pvclock_clocksource_read+0x6/0xb0
... ...
[    0.000000] RSP: 0000:ffffffffaae03e10 EFLAGS: 00010082 ORIG_RAX: 0000000000000000
[    0.000000] RAX: 0000000000000000 RBX: 0000000000010000 RCX: 0000000000000002
[    0.000000] RDX: 0000000000000003 RSI: ffffffffaac37515 RDI: 0000000000000020
[    0.000000] RBP: 0000000000011000 R08: 0000000000000000 R09: 0000000000000001
[    0.000000] R10: ffffffffaae03df8 R11: ffffffffaae03c68 R12: 0000000040000004
[    0.000000] R13: ffffffffaae03e50 R14: 0000000000000000 R15: 0000000000000000
[    0.000000] FS:  0000000000000000(0000) GS:ffffffffab588000(0000) knlGS:0000000000000000
[    0.000000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.000000] CR2: 0000000000000020 CR3: 00000000ea410000 CR4: 00000000000406a0
[    0.000000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.000000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    0.000000] Call Trace:
[    0.000000]  <TASK>
[    0.000000]  ? xen_clocksource_read+0x24/0x40
[    0.000000]  ? xen_init_time_common+0x5/0x49
[    0.000000]  ? xen_hvm_init_time_ops+0x23/0x45
[    0.000000]  ? xen_hvm_guest_init+0x221/0x25c
[    0.000000]  ? 0xffffffffa9600000
[    0.000000]  ? setup_arch+0x440/0xbd6
[    0.000000]  ? start_kernel+0x6c/0x695
[    0.000000]  ? secondary_startup_64_no_verify+0xd5/0xdb
[    0.000000]  </TASK>


Changed since v1:
  - Add commit message to explain why xen_hvm_init_time_ops() is delayed
    for any vcpus. (Suggested by Boris Ostrovsky)
  - Add a comment in xen_hvm_smp_prepare_boot_cpu() referencing the related
    code in xen_hvm_guest_init(). (suggested by Juergen Gross)
Changed since v2:
  - Delay for all VCPUs. (Suggested by Boris Ostrovsky)
  - Add commit message that why PVM is not supported by this patch
  - Test if kexec/kdump works with mainline xen (HVM and PVM)


I have delayed the xen_hvm_init_time_ops() for all VCPUs. Unfortunately,
now I am able to reproduce the clock backward as shown below on some old
versions of xen. I am not able to reproduce on most recent mainline xen.

[    0.359687] pcpu-alloc: [0] 16 17 18 19 20 21 22 23 [0] 24 25 26 27 28 29 30 31
[    0.359694] pcpu-alloc: [0] 32 33 34 35 36 37 38 39 [0] 40 41 42 43 44 45 46 47
[    0.359701] pcpu-alloc: [0] 48 49 50 51 52 53 54 55 [0] 56 57 58 59 60 61 62 63

... clock backward after the clocksource is switched from native to xen...

[    0.000004] Fallback order for Node 0: 0
[    0.002967] Built 1 zonelists, mobility grouping on.  Total pages: 3527744
[    0.007129] Policy zone: Normal
[    0.008937] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.17.0-rc4xen-00054-gf71077a4d84b-dirty root=UUID=2a5975ab-a059-4697-9aee-7a53ddfeea21 ro text console=ttyS0,115200n8 console=tty1 earlyprintk=tty    S0,115200n8 loglevel=7 no_timer_check reboot=s32 splash crashkernel=512M-:192M vt.handoff=1
[    0.023880] Unknown kernel command line parameters "text splash BOOT_IMAGE=/boot/vmlinuz-5.17.0-rc4xen-00054-gf71077a4d84b-dirty", will be passed to user space.
[    0.032647] printk: log_buf_len individual max cpu contribution: 4096 bytes
[    0.036828] printk: log_buf_len total cpu_extra contributions: 258048 bytes
[    0.041049] printk: log_buf_len min size: 262144 bytes
[    0.044481] printk: log_buf_len: 524288 bytes


Since now I am able to reproduce the clock backward on old xen version,
please let me know if I should re-use the v2 of this patch, as it has been
running well in our env well for very long time.

https://lore.kernel.org/all/20211028012543.8776-1-dongli.zhang@oracle.com/


BTW, I have tested that 'soft_reset' does not work with mainline xen, even
when I directly trigger kexec with below commands.

# kexec -l /boot/vmlinuz-5.17.0-rc4xen-00054-gf71077a4d84b-dirty \
        --initrd=/boot/initrd.img-5.17.0-rc4xen-00054-gf71077a4d84b-dirty \
        --reuse-cmdline
# kexec -e


Thank you very much!

Dongli Zhang

Comments

Boris Ostrovsky Feb. 25, 2022, 10:39 p.m. UTC | #1
On 2/24/22 4:50 PM, Dongli Zhang wrote:
> This is the v3 of the patch to fix xen kexec kernel panic issue when the
> kexec is triggered on VCPU >= 32.
>
> PANIC: early exception 0x0e IP 10:ffffffffa96679b6 error 0 cr2 0x20
> [    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 5.17.0-rc4xen-00054-gf71077a4d84b-dirty #1
> [    0.000000] Hardware name: Xen HVM domU, BIOS 4.4.4OVM 12/15/2020
> [    0.000000] RIP: 0010:pvclock_clocksource_read+0x6/0xb0
> ... ...
> [    0.000000] RSP: 0000:ffffffffaae03e10 EFLAGS: 00010082 ORIG_RAX: 0000000000000000
> [    0.000000] RAX: 0000000000000000 RBX: 0000000000010000 RCX: 0000000000000002
> [    0.000000] RDX: 0000000000000003 RSI: ffffffffaac37515 RDI: 0000000000000020
> [    0.000000] RBP: 0000000000011000 R08: 0000000000000000 R09: 0000000000000001
> [    0.000000] R10: ffffffffaae03df8 R11: ffffffffaae03c68 R12: 0000000040000004
> [    0.000000] R13: ffffffffaae03e50 R14: 0000000000000000 R15: 0000000000000000
> [    0.000000] FS:  0000000000000000(0000) GS:ffffffffab588000(0000) knlGS:0000000000000000
> [    0.000000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    0.000000] CR2: 0000000000000020 CR3: 00000000ea410000 CR4: 00000000000406a0
> [    0.000000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [    0.000000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [    0.000000] Call Trace:
> [    0.000000]  <TASK>
> [    0.000000]  ? xen_clocksource_read+0x24/0x40


This is done to set xen_sched_clock_offset which I think will not be used for a while, until sched_clock is called (and the other two uses are for suspend/resume)


Can we simply defer 'xen_sched_clock_offset = xen_clocksource_read();' until after all vcpu areas are properly set? Or are there other uses of xen_clocksource_read() before ?


-boris


> [    0.000000]  ? xen_init_time_common+0x5/0x49
> [    0.000000]  ? xen_hvm_init_time_ops+0x23/0x45
> [    0.000000]  ? xen_hvm_guest_init+0x221/0x25c
> [    0.000000]  ? 0xffffffffa9600000
> [    0.000000]  ? setup_arch+0x440/0xbd6
> [    0.000000]  ? start_kernel+0x6c/0x695
> [    0.000000]  ? secondary_startup_64_no_verify+0xd5/0xdb
> [    0.000000]  </TASK>
>
>
> Changed since v1:
>    - Add commit message to explain why xen_hvm_init_time_ops() is delayed
>      for any vcpus. (Suggested by Boris Ostrovsky)
>    - Add a comment in xen_hvm_smp_prepare_boot_cpu() referencing the related
>      code in xen_hvm_guest_init(). (suggested by Juergen Gross)
> Changed since v2:
>    - Delay for all VCPUs. (Suggested by Boris Ostrovsky)
>    - Add commit message that why PVM is not supported by this patch
>    - Test if kexec/kdump works with mainline xen (HVM and PVM)
>
>
> I have delayed the xen_hvm_init_time_ops() for all VCPUs. Unfortunately,
> now I am able to reproduce the clock backward as shown below on some old
> versions of xen. I am not able to reproduce on most recent mainline xen.
>
> [    0.359687] pcpu-alloc: [0] 16 17 18 19 20 21 22 23 [0] 24 25 26 27 28 29 30 31
> [    0.359694] pcpu-alloc: [0] 32 33 34 35 36 37 38 39 [0] 40 41 42 43 44 45 46 47
> [    0.359701] pcpu-alloc: [0] 48 49 50 51 52 53 54 55 [0] 56 57 58 59 60 61 62 63
>
> ... clock backward after the clocksource is switched from native to xen...
>
> [    0.000004] Fallback order for Node 0: 0
> [    0.002967] Built 1 zonelists, mobility grouping on.  Total pages: 3527744
> [    0.007129] Policy zone: Normal
> [    0.008937] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.17.0-rc4xen-00054-gf71077a4d84b-dirty root=UUID=2a5975ab-a059-4697-9aee-7a53ddfeea21 ro text console=ttyS0,115200n8 console=tty1 earlyprintk=tty    S0,115200n8 loglevel=7 no_timer_check reboot=s32 splash crashkernel=512M-:192M vt.handoff=1
> [    0.023880] Unknown kernel command line parameters "text splash BOOT_IMAGE=/boot/vmlinuz-5.17.0-rc4xen-00054-gf71077a4d84b-dirty", will be passed to user space.
> [    0.032647] printk: log_buf_len individual max cpu contribution: 4096 bytes
> [    0.036828] printk: log_buf_len total cpu_extra contributions: 258048 bytes
> [    0.041049] printk: log_buf_len min size: 262144 bytes
> [    0.044481] printk: log_buf_len: 524288 bytes
>
>
> Since now I am able to reproduce the clock backward on old xen version,
> please let me know if I should re-use the v2 of this patch, as it has been
> running well in our env well for very long time.
>
> https://lore.kernel.org/all/20211028012543.8776-1-dongli.zhang@oracle.com/
>
>
> BTW, I have tested that 'soft_reset' does not work with mainline xen, even
> when I directly trigger kexec with below commands.
>
> # kexec -l /boot/vmlinuz-5.17.0-rc4xen-00054-gf71077a4d84b-dirty \
>          --initrd=/boot/initrd.img-5.17.0-rc4xen-00054-gf71077a4d84b-dirty \
>          --reuse-cmdline
> # kexec -e
>
>
> Thank you very much!
>
> Dongli Zhang
>
>
Dongli Zhang Feb. 26, 2022, 1:17 a.m. UTC | #2
Hi Boris,

On 2/25/22 2:39 PM, Boris Ostrovsky wrote:
> 
> On 2/24/22 4:50 PM, Dongli Zhang wrote:
>> This is the v3 of the patch to fix xen kexec kernel panic issue when the
>> kexec is triggered on VCPU >= 32.
>>
>> PANIC: early exception 0x0e IP 10:ffffffffa96679b6 error 0 cr2 0x20
>> [    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted
>> 5.17.0-rc4xen-00054-gf71077a4d84b-dirty #1
>> [    0.000000] Hardware name: Xen HVM domU, BIOS 4.4.4OVM 12/15/2020
>> [    0.000000] RIP: 0010:pvclock_clocksource_read+0x6/0xb0
>> ... ...
>> [    0.000000] RSP: 0000:ffffffffaae03e10 EFLAGS: 00010082 ORIG_RAX:
>> 0000000000000000
>> [    0.000000] RAX: 0000000000000000 RBX: 0000000000010000 RCX: 0000000000000002
>> [    0.000000] RDX: 0000000000000003 RSI: ffffffffaac37515 RDI: 0000000000000020
>> [    0.000000] RBP: 0000000000011000 R08: 0000000000000000 R09: 0000000000000001
>> [    0.000000] R10: ffffffffaae03df8 R11: ffffffffaae03c68 R12: 0000000040000004
>> [    0.000000] R13: ffffffffaae03e50 R14: 0000000000000000 R15: 0000000000000000
>> [    0.000000] FS:  0000000000000000(0000) GS:ffffffffab588000(0000)
>> knlGS:0000000000000000
>> [    0.000000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [    0.000000] CR2: 0000000000000020 CR3: 00000000ea410000 CR4: 00000000000406a0
>> [    0.000000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [    0.000000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> [    0.000000] Call Trace:
>> [    0.000000]  <TASK>
>> [    0.000000]  ? xen_clocksource_read+0x24/0x40
> 
> 
> This is done to set xen_sched_clock_offset which I think will not be used for a
> while, until sched_clock is called (and the other two uses are for suspend/resume)
> 
> 
> Can we simply defer 'xen_sched_clock_offset = xen_clocksource_read();' until
> after all vcpu areas are properly set? Or are there other uses of
> xen_clocksource_read() before ?
> 

I have tested that below patch will panic kdump kernel.

diff --git a/arch/x86/xen/smp_hvm.c b/arch/x86/xen/smp_hvm.c
index 6ff3c887e0b9..6a0c99941ae1 100644
--- a/arch/x86/xen/smp_hvm.c
+++ b/arch/x86/xen/smp_hvm.c
@@ -19,6 +19,8 @@ static void __init xen_hvm_smp_prepare_boot_cpu(void)
         */
        xen_vcpu_setup(0);

+       xen_init_sched_clock_offset();
+
        /*
         * The alternative logic (which patches the unlock/lock) runs before
         * the smp bootup up code is activated. Hence we need to set this up
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index d9c945ee1100..8a2eafa0c215 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -520,9 +520,14 @@ static void __init xen_time_init(void)
                pvclock_gtod_register_notifier(&xen_pvclock_gtod_notifier);
 }

-static void __init xen_init_time_common(void)
+void xen_init_sched_clock_offset(void)
 {
        xen_sched_clock_offset = xen_clocksource_read();
+}
+
+static void __init xen_init_time_common(void)
+{
+       xen_sched_clock_offset = 0;
        static_call_update(pv_steal_clock, xen_steal_clock);
        paravirt_set_sched_clock(xen_sched_clock);

diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index fd0fec6e92f4..9f7656214dfb 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -69,6 +69,7 @@ void xen_teardown_timer(int cpu);
 void xen_setup_cpu_clockevents(void);
 void xen_save_time_memory_area(void);
 void xen_restore_time_memory_area(void);
+void xen_init_sched_clock_offset(void);
 void xen_init_time_ops(void);
 void xen_hvm_init_time_ops(void);



Unfortunately, I am not able to obtain the panic callstack from kdump time this
time. I have only below.

PANIC: early exception 0x0e IP 10:ffffffffa6c679b6 error 0 cr2 0x20
PANIC: early exception 0x0e IP 10:ffffffffa6c679b6 error 0 cr2 0x20



The sched_clock() can be used very early since commit 857baa87b642
("sched/clock: Enable sched clock early"). Any printk should use sched_clock()
to obtain the timestamp.

vprintk_store()
-> local_clock()
   -> sched_clock()
      -> paravirt_sched_clock()
         -> xen_sched_clock()
            -> xen_clocksource_read()


AFAIR, we started to encounter the issue since commit 857baa87b642
("sched/clock: Enable sched clock early"). kdump used to work well before that
commit.

Thank you very much!

Dongli Zhang
Boris Ostrovsky Feb. 28, 2022, 8:45 p.m. UTC | #3
On 2/25/22 8:17 PM, Dongli Zhang wrote:
> Hi Boris,
> 
> On 2/25/22 2:39 PM, Boris Ostrovsky wrote:
>>
>> On 2/24/22 4:50 PM, Dongli Zhang wrote:
>>> This is the v3 of the patch to fix xen kexec kernel panic issue when the
>>> kexec is triggered on VCPU >= 32.
>>>
>>> PANIC: early exception 0x0e IP 10:ffffffffa96679b6 error 0 cr2 0x20
>>> [    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted
>>> 5.17.0-rc4xen-00054-gf71077a4d84b-dirty #1
>>> [    0.000000] Hardware name: Xen HVM domU, BIOS 4.4.4OVM 12/15/2020
>>> [    0.000000] RIP: 0010:pvclock_clocksource_read+0x6/0xb0
>>> ... ...
>>> [    0.000000] RSP: 0000:ffffffffaae03e10 EFLAGS: 00010082 ORIG_RAX:
>>> 0000000000000000
>>> [    0.000000] RAX: 0000000000000000 RBX: 0000000000010000 RCX: 0000000000000002
>>> [    0.000000] RDX: 0000000000000003 RSI: ffffffffaac37515 RDI: 0000000000000020
>>> [    0.000000] RBP: 0000000000011000 R08: 0000000000000000 R09: 0000000000000001
>>> [    0.000000] R10: ffffffffaae03df8 R11: ffffffffaae03c68 R12: 0000000040000004
>>> [    0.000000] R13: ffffffffaae03e50 R14: 0000000000000000 R15: 0000000000000000
>>> [    0.000000] FS:  0000000000000000(0000) GS:ffffffffab588000(0000)
>>> knlGS:0000000000000000
>>> [    0.000000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [    0.000000] CR2: 0000000000000020 CR3: 00000000ea410000 CR4: 00000000000406a0
>>> [    0.000000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> [    0.000000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>> [    0.000000] Call Trace:
>>> [    0.000000]  <TASK>
>>> [    0.000000]  ? xen_clocksource_read+0x24/0x40
>>
>>
>> This is done to set xen_sched_clock_offset which I think will not be used for a
>> while, until sched_clock is called (and the other two uses are for suspend/resume)
>>
>>
>> Can we simply defer 'xen_sched_clock_offset = xen_clocksource_read();' until
>> after all vcpu areas are properly set? Or are there other uses of
>> xen_clocksource_read() before ?
>>
> 
> I have tested that below patch will panic kdump kernel.
> 



Oh well, so much for that then. Yes, sched_clock() is at least called from printk path.


I guess we will have to go with v2 then, we don't want to start seeing time going back, even if only with older hypervisors. The only thing I might ask is that you roll the logic inside xen_hvm_init_time_ops(). Something like


xen_hvm_init_time_ops()
{
	/*
	 * Wait until per_cpu(xen_vcpu, 0) is initialized which may happen
	 * later (e.g. when kdump kernel runs on >=MAX_VIRT_CPUS vcpu)
	 */
	if (__this_cpu_read(xen_vcpu_nr(0)) == NULL)
		return;

	...

-boris
Dongli Zhang March 1, 2022, 1:18 a.m. UTC | #4
Hi Boris,

On 2/28/22 12:45 PM, Boris Ostrovsky wrote:
> 
> 
> On 2/25/22 8:17 PM, Dongli Zhang wrote:
>> Hi Boris,
>>
>> On 2/25/22 2:39 PM, Boris Ostrovsky wrote:
>>>
>>> On 2/24/22 4:50 PM, Dongli Zhang wrote:
>>>> This is the v3 of the patch to fix xen kexec kernel panic issue when the
>>>> kexec is triggered on VCPU >= 32.
>>>>
>>>> PANIC: early exception 0x0e IP 10:ffffffffa96679b6 error 0 cr2 0x20
>>>> [    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted
>>>> 5.17.0-rc4xen-00054-gf71077a4d84b-dirty #1
>>>> [    0.000000] Hardware name: Xen HVM domU, BIOS 4.4.4OVM 12/15/2020
>>>> [    0.000000] RIP: 0010:pvclock_clocksource_read+0x6/0xb0
>>>> ... ...
>>>> [    0.000000] RSP: 0000:ffffffffaae03e10 EFLAGS: 00010082 ORIG_RAX:
>>>> 0000000000000000
>>>> [    0.000000] RAX: 0000000000000000 RBX: 0000000000010000 RCX:
>>>> 0000000000000002
>>>> [    0.000000] RDX: 0000000000000003 RSI: ffffffffaac37515 RDI:
>>>> 0000000000000020
>>>> [    0.000000] RBP: 0000000000011000 R08: 0000000000000000 R09:
>>>> 0000000000000001
>>>> [    0.000000] R10: ffffffffaae03df8 R11: ffffffffaae03c68 R12:
>>>> 0000000040000004
>>>> [    0.000000] R13: ffffffffaae03e50 R14: 0000000000000000 R15:
>>>> 0000000000000000
>>>> [    0.000000] FS:  0000000000000000(0000) GS:ffffffffab588000(0000)
>>>> knlGS:0000000000000000
>>>> [    0.000000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [    0.000000] CR2: 0000000000000020 CR3: 00000000ea410000 CR4:
>>>> 00000000000406a0
>>>> [    0.000000] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>>> 0000000000000000
>>>> [    0.000000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>>>> 0000000000000400
>>>> [    0.000000] Call Trace:
>>>> [    0.000000]  <TASK>
>>>> [    0.000000]  ? xen_clocksource_read+0x24/0x40
>>>
>>>
>>> This is done to set xen_sched_clock_offset which I think will not be used for a
>>> while, until sched_clock is called (and the other two uses are for
>>> suspend/resume)
>>>
>>>
>>> Can we simply defer 'xen_sched_clock_offset = xen_clocksource_read();' until
>>> after all vcpu areas are properly set? Or are there other uses of
>>> xen_clocksource_read() before ?
>>>
>>
>> I have tested that below patch will panic kdump kernel.
>>
> 
> 
> 
> Oh well, so much for that then. Yes, sched_clock() is at least called from
> printk path.
> 
> 
> I guess we will have to go with v2 then, we don't want to start seeing time
> going back, even if only with older hypervisors. The only thing I might ask is
> that you roll the logic inside xen_hvm_init_time_ops(). Something like
> 
> 
> xen_hvm_init_time_ops()
> {
>     /*
>      * Wait until per_cpu(xen_vcpu, 0) is initialized which may happen
>      * later (e.g. when kdump kernel runs on >=MAX_VIRT_CPUS vcpu)
>      */
>     if (__this_cpu_read(xen_vcpu_nr(0)) == NULL)
>         return;
> 

I think you meant __this_cpu_read(xen_vcpu).

I will call xen_hvm_init_time_ops() at both places, and move the logic into
xen_hvm_init_time_ops().

Thank you very much!

Dongli Zhang
Dongli Zhang March 1, 2022, 4:56 a.m. UTC | #5
Hi Boris,

On 2/28/22 5:18 PM, Dongli Zhang wrote:
> Hi Boris,
> 
> On 2/28/22 12:45 PM, Boris Ostrovsky wrote:
>>
>>
>> On 2/25/22 8:17 PM, Dongli Zhang wrote:
>>> Hi Boris,
>>>
>>> On 2/25/22 2:39 PM, Boris Ostrovsky wrote:
>>>>
>>>> On 2/24/22 4:50 PM, Dongli Zhang wrote:
>>>>> This is the v3 of the patch to fix xen kexec kernel panic issue when the
>>>>> kexec is triggered on VCPU >= 32.
>>>>>
>>>>> PANIC: early exception 0x0e IP 10:ffffffffa96679b6 error 0 cr2 0x20
>>>>> [    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted
>>>>> 5.17.0-rc4xen-00054-gf71077a4d84b-dirty #1
>>>>> [    0.000000] Hardware name: Xen HVM domU, BIOS 4.4.4OVM 12/15/2020
>>>>> [    0.000000] RIP: 0010:pvclock_clocksource_read+0x6/0xb0
>>>>> ... ...
>>>>> [    0.000000] RSP: 0000:ffffffffaae03e10 EFLAGS: 00010082 ORIG_RAX:
>>>>> 0000000000000000
>>>>> [    0.000000] RAX: 0000000000000000 RBX: 0000000000010000 RCX:
>>>>> 0000000000000002
>>>>> [    0.000000] RDX: 0000000000000003 RSI: ffffffffaac37515 RDI:
>>>>> 0000000000000020
>>>>> [    0.000000] RBP: 0000000000011000 R08: 0000000000000000 R09:
>>>>> 0000000000000001
>>>>> [    0.000000] R10: ffffffffaae03df8 R11: ffffffffaae03c68 R12:
>>>>> 0000000040000004
>>>>> [    0.000000] R13: ffffffffaae03e50 R14: 0000000000000000 R15:
>>>>> 0000000000000000
>>>>> [    0.000000] FS:  0000000000000000(0000) GS:ffffffffab588000(0000)
>>>>> knlGS:0000000000000000
>>>>> [    0.000000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> [    0.000000] CR2: 0000000000000020 CR3: 00000000ea410000 CR4:
>>>>> 00000000000406a0
>>>>> [    0.000000] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>>>> 0000000000000000
>>>>> [    0.000000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>>>>> 0000000000000400
>>>>> [    0.000000] Call Trace:
>>>>> [    0.000000]  <TASK>
>>>>> [    0.000000]  ? xen_clocksource_read+0x24/0x40
>>>>
>>>>
>>>> This is done to set xen_sched_clock_offset which I think will not be used for a
>>>> while, until sched_clock is called (and the other two uses are for
>>>> suspend/resume)
>>>>
>>>>
>>>> Can we simply defer 'xen_sched_clock_offset = xen_clocksource_read();' until
>>>> after all vcpu areas are properly set? Or are there other uses of
>>>> xen_clocksource_read() before ?
>>>>
>>>
>>> I have tested that below patch will panic kdump kernel.
>>>
>>
>>
>>
>> Oh well, so much for that then. Yes, sched_clock() is at least called from
>> printk path.
>>
>>
>> I guess we will have to go with v2 then, we don't want to start seeing time
>> going back, even if only with older hypervisors. The only thing I might ask is
>> that you roll the logic inside xen_hvm_init_time_ops(). Something like
>>
>>
>> xen_hvm_init_time_ops()
>> {
>>     /*
>>      * Wait until per_cpu(xen_vcpu, 0) is initialized which may happen
>>      * later (e.g. when kdump kernel runs on >=MAX_VIRT_CPUS vcpu)
>>      */
>>     if (__this_cpu_read(xen_vcpu_nr(0)) == NULL)
>>         return;
>>
> 
> I think you meant __this_cpu_read(xen_vcpu).
> 
> I will call xen_hvm_init_time_ops() at both places, and move the logic into
> xen_hvm_init_time_ops().
> 
> Thank you very much!
> 
> Dongli Zhang
> 


How about we do not move the logic into xen_hvm_init_time_ops()?

Suppose we move the logic into xen_hvm_init_time_ops() line 573, the line line
570 might be printed twice.


559 void __init xen_hvm_init_time_ops(void)
560 {
561         /*
562          * vector callback is needed otherwise we cannot receive interrupts
563          * on cpu > 0 and at this point we don't know how many cpus are
564          * available.
565          */
566         if (!xen_have_vector_callback)
567                 return;
568
569         if (!xen_feature(XENFEAT_hvm_safe_pvclock)) {
570                 pr_info("Xen doesn't support pvclock on HVM, disable pv timer");
571                 return;
572         }
573
574         xen_init_time_common();
575
576         x86_init.timers.setup_percpu_clockev = xen_time_init;
577         x86_cpuinit.setup_percpu_clockev = xen_hvm_setup_cpu_clockevents;
578
579         x86_platform.set_wallclock = xen_set_wallclock;
580 }

I feel the code looks better if we keep the logic at caller side. Would you mind
letting me know your feedback?

Thank you very much!

Dongli Zhang
Boris Ostrovsky March 1, 2022, 5:21 p.m. UTC | #6
On 2/28/22 11:56 PM, Dongli Zhang wrote:
> Hi Boris,
>
> On 2/28/22 5:18 PM, Dongli Zhang wrote:
>> Hi Boris,
>>
>> On 2/28/22 12:45 PM, Boris Ostrovsky wrote:
>>>
>>> On 2/25/22 8:17 PM, Dongli Zhang wrote:
>>>> Hi Boris,
>>>>
>>>> On 2/25/22 2:39 PM, Boris Ostrovsky wrote:
>>>>> On 2/24/22 4:50 PM, Dongli Zhang wrote:
>>>>>> This is the v3 of the patch to fix xen kexec kernel panic issue when the
>>>>>> kexec is triggered on VCPU >= 32.
>>>>>>
>>>>>> PANIC: early exception 0x0e IP 10:ffffffffa96679b6 error 0 cr2 0x20
>>>>>> [    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted
>>>>>> 5.17.0-rc4xen-00054-gf71077a4d84b-dirty #1
>>>>>> [    0.000000] Hardware name: Xen HVM domU, BIOS 4.4.4OVM 12/15/2020
>>>>>> [    0.000000] RIP: 0010:pvclock_clocksource_read+0x6/0xb0
>>>>>> ... ...
>>>>>> [    0.000000] RSP: 0000:ffffffffaae03e10 EFLAGS: 00010082 ORIG_RAX:
>>>>>> 0000000000000000
>>>>>> [    0.000000] RAX: 0000000000000000 RBX: 0000000000010000 RCX:
>>>>>> 0000000000000002
>>>>>> [    0.000000] RDX: 0000000000000003 RSI: ffffffffaac37515 RDI:
>>>>>> 0000000000000020
>>>>>> [    0.000000] RBP: 0000000000011000 R08: 0000000000000000 R09:
>>>>>> 0000000000000001
>>>>>> [    0.000000] R10: ffffffffaae03df8 R11: ffffffffaae03c68 R12:
>>>>>> 0000000040000004
>>>>>> [    0.000000] R13: ffffffffaae03e50 R14: 0000000000000000 R15:
>>>>>> 0000000000000000
>>>>>> [    0.000000] FS:  0000000000000000(0000) GS:ffffffffab588000(0000)
>>>>>> knlGS:0000000000000000
>>>>>> [    0.000000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>> [    0.000000] CR2: 0000000000000020 CR3: 00000000ea410000 CR4:
>>>>>> 00000000000406a0
>>>>>> [    0.000000] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>>>>> 0000000000000000
>>>>>> [    0.000000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>>>>>> 0000000000000400
>>>>>> [    0.000000] Call Trace:
>>>>>> [    0.000000]  <TASK>
>>>>>> [    0.000000]  ? xen_clocksource_read+0x24/0x40
>>>>>
>>>>> This is done to set xen_sched_clock_offset which I think will not be used for a
>>>>> while, until sched_clock is called (and the other two uses are for
>>>>> suspend/resume)
>>>>>
>>>>>
>>>>> Can we simply defer 'xen_sched_clock_offset = xen_clocksource_read();' until
>>>>> after all vcpu areas are properly set? Or are there other uses of
>>>>> xen_clocksource_read() before ?
>>>>>
>>>> I have tested that below patch will panic kdump kernel.
>>>>
>>>
>>>
>>> Oh well, so much for that then. Yes, sched_clock() is at least called from
>>> printk path.
>>>
>>>
>>> I guess we will have to go with v2 then, we don't want to start seeing time
>>> going back, even if only with older hypervisors. The only thing I might ask is
>>> that you roll the logic inside xen_hvm_init_time_ops(). Something like
>>>
>>>
>>> xen_hvm_init_time_ops()
>>> {
>>>      /*
>>>       * Wait until per_cpu(xen_vcpu, 0) is initialized which may happen
>>>       * later (e.g. when kdump kernel runs on >=MAX_VIRT_CPUS vcpu)
>>>       */
>>>      if (__this_cpu_read(xen_vcpu_nr(0)) == NULL)
>>>          return;
>>>
>> I think you meant __this_cpu_read(xen_vcpu).
>>
>> I will call xen_hvm_init_time_ops() at both places, and move the logic into
>> xen_hvm_init_time_ops().
>>
>> Thank you very much!
>>
>> Dongli Zhang
>>
>
> How about we do not move the logic into xen_hvm_init_time_ops()?
>
> Suppose we move the logic into xen_hvm_init_time_ops() line 573, the line line
> 570 might be printed twice.


You would need to make sure the routine is executed only once so something like a local static variable would be needed.


>
>
> 559 void __init xen_hvm_init_time_ops(void)
> 560 {
> 561         /*
> 562          * vector callback is needed otherwise we cannot receive interrupts
> 563          * on cpu > 0 and at this point we don't know how many cpus are
> 564          * available.
> 565          */
> 566         if (!xen_have_vector_callback)
> 567                 return;
> 568
> 569         if (!xen_feature(XENFEAT_hvm_safe_pvclock)) {
> 570                 pr_info("Xen doesn't support pvclock on HVM, disable pv timer");
> 571                 return;
> 572         }
> 573
> 574         xen_init_time_common();
> 575
> 576         x86_init.timers.setup_percpu_clockev = xen_time_init;
> 577         x86_cpuinit.setup_percpu_clockev = xen_hvm_setup_cpu_clockevents;
> 578
> 579         x86_platform.set_wallclock = xen_set_wallclock;
> 580 }
>
> I feel the code looks better if we keep the logic at caller side. Would you mind
> letting me know your feedback?


My preference is to keep logic concentrated in one place whenever possible.


-boris