Message ID | 20210315165800.5948-2-madvenka@linux.microsoft.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | arm64: Implement reliable stack trace | expand |
On Mon, Mar 15, 2021 at 11:57:53AM -0500, madvenka@linux.microsoft.com wrote: > In summary, task pt_regs->stackframe is where a successful stack trace ends. > .if \el == 0 > - mov x29, xzr > + stp xzr, xzr, [sp, #S_STACKFRAME] > .else > stp x29, x22, [sp, #S_STACKFRAME] > - add x29, sp, #S_STACKFRAME > .endif > + add x29, sp, #S_STACKFRAME For both user and kernel threads this patch (at least by itself) results in an additional record being reported in stack traces with a NULL function pointer since it keeps the existing record where it is and adds this new fixed record below it. This is addressed for the kernel later in the series, by "arm64: Terminate the stack trace at TASK_FRAME and EL0_FRAME", but will still be visible to other unwinders such as debuggers. I'm not sure that this *matters* but it might and should at least be called out more explicitly. If we are going to add the extra record there would probably be less potential for confusion if we pointed it at some sensibly named dummy function so anything or anyone that does see it on the stack doesn't get confused by a NULL.
On 3/18/21 10:09 AM, Mark Brown wrote: > On Mon, Mar 15, 2021 at 11:57:53AM -0500, madvenka@linux.microsoft.com wrote: > >> In summary, task pt_regs->stackframe is where a successful stack trace ends. > >> .if \el == 0 >> - mov x29, xzr >> + stp xzr, xzr, [sp, #S_STACKFRAME] >> .else >> stp x29, x22, [sp, #S_STACKFRAME] >> - add x29, sp, #S_STACKFRAME >> .endif >> + add x29, sp, #S_STACKFRAME > > For both user and kernel threads this patch (at least by itself) results > in an additional record being reported in stack traces with a NULL > function pointer since it keeps the existing record where it is and adds > this new fixed record below it. This is addressed for the kernel later > in the series, by "arm64: Terminate the stack trace at TASK_FRAME and > EL0_FRAME", but will still be visible to other unwinders such as > debuggers. I'm not sure that this *matters* but it might and should at > least be called out more explicitly. > > If we are going to add the extra record there would probably be less > potential for confusion if we pointed it at some sensibly named dummy > function so anything or anyone that does see it on the stack doesn't get > confused by a NULL. > I agree. I will think about this some more. If no other solution presents itself, I will add the dummy function. Madhavan > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel >
On Thu, Mar 18, 2021 at 03:26:13PM -0500, Madhavan T. Venkataraman wrote: > On 3/18/21 10:09 AM, Mark Brown wrote: > > If we are going to add the extra record there would probably be less > > potential for confusion if we pointed it at some sensibly named dummy > > function so anything or anyone that does see it on the stack doesn't get > > confused by a NULL. > I agree. I will think about this some more. If no other solution presents > itself, I will add the dummy function. After discussing this with Mark Rutland offlist he convinced me that so long as we ensure the kernel doesn't print the NULL record we're probably OK here, the effort setting the function pointer up correctly in all circumstances (especially when we're not in the normal memory map) is probably not worth it for the limited impact it's likely to have to see the NULL pointer (probably mainly a person working with some external debugger). It should be noted in the changelog though, and/or merged in with the relevant change to the unwinder.
On 3/19/21 7:30 AM, Mark Brown wrote: > On Thu, Mar 18, 2021 at 03:26:13PM -0500, Madhavan T. Venkataraman wrote: >> On 3/18/21 10:09 AM, Mark Brown wrote: > >>> If we are going to add the extra record there would probably be less >>> potential for confusion if we pointed it at some sensibly named dummy >>> function so anything or anyone that does see it on the stack doesn't get >>> confused by a NULL. > >> I agree. I will think about this some more. If no other solution presents >> itself, I will add the dummy function. > > After discussing this with Mark Rutland offlist he convinced me that so > long as we ensure the kernel doesn't print the NULL record we're > probably OK here, the effort setting the function pointer up correctly > in all circumstances (especially when we're not in the normal memory > map) is probably not worth it for the limited impact it's likely to have > to see the NULL pointer (probably mainly a person working with some > external debugger). It should be noted in the changelog though, and/or > merged in with the relevant change to the unwinder. > OK. I will add a comment as well as note it in the changelog. Thanks to both of you. Madhavan
On 3/19/21 9:29 AM, Madhavan T. Venkataraman wrote: > > > On 3/19/21 7:30 AM, Mark Brown wrote: >> On Thu, Mar 18, 2021 at 03:26:13PM -0500, Madhavan T. Venkataraman wrote: >>> On 3/18/21 10:09 AM, Mark Brown wrote: >> >>>> If we are going to add the extra record there would probably be less >>>> potential for confusion if we pointed it at some sensibly named dummy >>>> function so anything or anyone that does see it on the stack doesn't get >>>> confused by a NULL. >> >>> I agree. I will think about this some more. If no other solution presents >>> itself, I will add the dummy function. >> >> After discussing this with Mark Rutland offlist he convinced me that so >> long as we ensure the kernel doesn't print the NULL record we're >> probably OK here, the effort setting the function pointer up correctly >> in all circumstances (especially when we're not in the normal memory >> map) is probably not worth it for the limited impact it's likely to have >> to see the NULL pointer (probably mainly a person working with some >> external debugger). It should be noted in the changelog though, and/or >> merged in with the relevant change to the unwinder. >> > > OK. I will add a comment as well as note it in the changelog. > > Thanks to both of you. > > Madhavan > I thought about this some more. I think I have a simple solution. I will prepare a patch and send it out. If you and Mark Rutland approve, I will include it in the next version of this RFC. Madhavan
On 3/19/21 1:19 PM, Madhavan T. Venkataraman wrote: > > > On 3/19/21 9:29 AM, Madhavan T. Venkataraman wrote: >> >> >> On 3/19/21 7:30 AM, Mark Brown wrote: >>> On Thu, Mar 18, 2021 at 03:26:13PM -0500, Madhavan T. Venkataraman wrote: >>>> On 3/18/21 10:09 AM, Mark Brown wrote: >>> >>>>> If we are going to add the extra record there would probably be less >>>>> potential for confusion if we pointed it at some sensibly named dummy >>>>> function so anything or anyone that does see it on the stack doesn't get >>>>> confused by a NULL. >>> >>>> I agree. I will think about this some more. If no other solution presents >>>> itself, I will add the dummy function. >>> >>> After discussing this with Mark Rutland offlist he convinced me that so >>> long as we ensure the kernel doesn't print the NULL record we're >>> probably OK here, the effort setting the function pointer up correctly >>> in all circumstances (especially when we're not in the normal memory >>> map) is probably not worth it for the limited impact it's likely to have >>> to see the NULL pointer (probably mainly a person working with some >>> external debugger). It should be noted in the changelog though, and/or >>> merged in with the relevant change to the unwinder. >>> >> >> OK. I will add a comment as well as note it in the changelog. >> >> Thanks to both of you. >> >> Madhavan >> > > I thought about this some more. I think I have a simple solution. I will > prepare a patch and send it out. If you and Mark Rutland approve, I will > include it in the next version of this RFC. > > Madhavan > I solved this by using existing functions logically instead of inventing a dummy function. I initialize pt_regs->stackframe[1] to an existing function so that the stack trace will not show a 0x0 entry as well as the kernel and gdb will show identical stack traces. For all task stack traces including the idle tasks, the stack trace will end at copy_thread() as copy_thread() is the function that initializes the pt_regs and the first stack frame for a task. For EL0 exceptions, the stack trace will end with vectors() as vectors entries call the EL0 handlers. Here are sample stack traces (I only show the ending of each trace): Idle task on primary CPU ======================== ... [ 0.022557] start_kernel+0x5b8/0x5f4 [ 0.022570] __primary_switched+0xa8/0xb8 [ 0.022578] copy_thread+0x0/0x188 Idle task on secondary CPU ========================== ... [ 0.023397] secondary_start_kernel+0x188/0x1e0 [ 0.023406] __secondary_switched+0x40/0x88 [ 0.023415] copy_thread+0x0/0x188 All other kernel threads ======================== ... [ 13.501062] ret_from_fork+0x10/0x18 [ 13.507998] copy_thread+0x0/0x188 User threads (EL0 exception) ============ write(2) system call example: ... [ 521.686148] vfs_write+0xc8/0x2c0 [ 521.686156] ksys_write+0x74/0x108 [ 521.686161] __arm64_sys_write+0x24/0x30 [ 521.686166] el0_svc_common.constprop.0+0x70/0x1a8 [ 521.686175] do_el0_svc+0x2c/0x98 [ 521.686180] el0_svc+0x2c/0x70 [ 521.686188] el0_sync_handler+0xb0/0xb8 [ 521.686193] el0_sync+0x17c/0x180 [ 521.686198] vectors+0x0/0x7d8 Here are the code changes: ======================================================================== diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S index a31a0a713c85..514307e80b79 100644 --- a/arch/arm64/kernel/entry.S +++ b/arch/arm64/kernel/entry.S @@ -261,16 +261,19 @@ alternative_else_nop_endif stp lr, x21, [sp, #S_LR] /* - * For exceptions from EL0, terminate the callchain here. + * For exceptions from EL0, terminate the callchain here at + * task_pt_regs(current)->stackframe. + * * For exceptions from EL1, create a synthetic frame record so the * interrupted code shows up in the backtrace. */ .if \el == 0 - mov x29, xzr + ldr x17, =vectors + stp xzr, x17, [sp, #S_STACKFRAME] .else stp x29, x22, [sp, #S_STACKFRAME] - add x29, sp, #S_STACKFRAME .endif + add x29, sp, #S_STACKFRAME #ifdef CONFIG_ARM64_SW_TTBR0_PAN alternative_if_not ARM64_HAS_PAN diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S index 66b0e0b66e31..699e0dd313a1 100644 --- a/arch/arm64/kernel/head.S +++ b/arch/arm64/kernel/head.S @@ -393,6 +393,29 @@ SYM_FUNC_START_LOCAL(__create_page_tables) ret x28 SYM_FUNC_END(__create_page_tables) + /* + * The boot task becomes the idle task for the primary CPU. The + * CPU startup task on each secondary CPU becomes the idle task + * for the secondary CPU. + * + * The idle task does not require pt_regs. But create a dummy + * pt_regs so that task_pt_regs(idle_task)->stackframe can be + * set up to be the last frame on the idle task stack just like + * all the other kernel tasks. This helps the unwinder to + * terminate the stack trace at a well-known stack offset. + * + * Also, set up the last return PC to be copy_thread() just + * like all the other kernel tasks so that the stack trace of + * all kernel tasks ends with the same function. + */ + .macro setup_last_frame + sub sp, sp, #PT_REGS_SIZE + ldr x17, =copy_thread + stp xzr, x17, [sp, #S_STACKFRAME] + add x29, sp, #S_STACKFRAME + adr x30, #0 + .endm + /* * The following fragment of code is executed with the MMU enabled. * @@ -447,8 +470,7 @@ SYM_FUNC_START_LOCAL(__primary_switched) #endif bl switch_to_vhe // Prefer VHE if possible add sp, sp, #16 - mov x29, #0 - mov x30, #0 + setup_last_frame b start_kernel SYM_FUNC_END(__primary_switched) @@ -606,8 +628,7 @@ SYM_FUNC_START_LOCAL(__secondary_switched) cbz x2, __secondary_too_slow msr sp_el0, x2 scs_load x2, x3 - mov x29, #0 - mov x30, #0 + setup_last_frame #ifdef CONFIG_ARM64_PTR_AUTH ptrauth_keys_init_cpu x2, x3, x4, x5 diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c index 325c83b1a24d..9050699ff67c 100644 --- a/arch/arm64/kernel/process.c +++ b/arch/arm64/kernel/process.c @@ -437,6 +437,12 @@ int copy_thread(unsigned long clone_flags, unsigned long stack_start, } p->thread.cpu_context.pc = (unsigned long)ret_from_fork; p->thread.cpu_context.sp = (unsigned long)childregs; + /* + * For the benefit of the unwinder, set up childregs->stackframe + * as the last frame for the new task. + */ + p->thread.cpu_context.fp = (unsigned long)childregs->stackframe; + childregs->stackframe[1] = (unsigned long)copy_thread; ptrace_hw_copy_thread(p); ====================================================================== If you approve, the above will become RFC Patch v3 1/8 in the next version. Let me know. Also, I could introduce an extra frame in the EL1 exception stack trace that includes vectors so the stack trace would look like this (timer interrupt example): call_timer_fn run_timer_softirq __do_softirq irq_exit __handle_domain_irq gic_handle_irq el1_irq vectors This way, if the unwinder finds vectors, it knows that it is an exception frame. Madhavan
On Fri, Mar 19, 2021 at 05:03:09PM -0500, Madhavan T. Venkataraman wrote: > I solved this by using existing functions logically instead of inventing a > dummy function. I initialize pt_regs->stackframe[1] to an existing function > so that the stack trace will not show a 0x0 entry as well as the kernel and > gdb will show identical stack traces. > > For all task stack traces including the idle tasks, the stack trace will > end at copy_thread() as copy_thread() is the function that initializes the > pt_regs and the first stack frame for a task. I don't think this is a good idea, as it will mean that copy_thread() will appear to be live in every thread, and therefore will not be patchable. There are other things people need to be aware of when using an external debugger (e.g. during EL0<->ELx transitions there are periods when X29 and X30 contain the EL0 values, and cannot be used to unwind), so I don't think there's a strong need to make this look prettier to an external debugger. > For EL0 exceptions, the stack trace will end with vectors() as vectors > entries call the EL0 handlers. > > Here are sample stack traces (I only show the ending of each trace): > > Idle task on primary CPU > ======================== > > ... > [ 0.022557] start_kernel+0x5b8/0x5f4 > [ 0.022570] __primary_switched+0xa8/0xb8 > [ 0.022578] copy_thread+0x0/0x188 > > Idle task on secondary CPU > ========================== > > ... > [ 0.023397] secondary_start_kernel+0x188/0x1e0 > [ 0.023406] __secondary_switched+0x40/0x88 > [ 0.023415] copy_thread+0x0/0x188 > > All other kernel threads > ======================== > > ... > [ 13.501062] ret_from_fork+0x10/0x18 > [ 13.507998] copy_thread+0x0/0x188 > > User threads (EL0 exception) > ============ > > write(2) system call example: > > ... > [ 521.686148] vfs_write+0xc8/0x2c0 > [ 521.686156] ksys_write+0x74/0x108 > [ 521.686161] __arm64_sys_write+0x24/0x30 > [ 521.686166] el0_svc_common.constprop.0+0x70/0x1a8 > [ 521.686175] do_el0_svc+0x2c/0x98 > [ 521.686180] el0_svc+0x2c/0x70 > [ 521.686188] el0_sync_handler+0xb0/0xb8 > [ 521.686193] el0_sync+0x17c/0x180 > [ 521.686198] vectors+0x0/0x7d8 [...] > If you approve, the above will become RFC Patch v3 1/8 in the next version. As above, I don't think we should repurpose an existing function here, and my preference is to use 0x0. > Let me know. > > Also, I could introduce an extra frame in the EL1 exception stack trace that > includes vectors so the stack trace would look like this (timer interrupt example): > > call_timer_fn > run_timer_softirq > __do_softirq > irq_exit > __handle_domain_irq > gic_handle_irq > el1_irq > vectors > > This way, if the unwinder finds vectors, it knows that it is an exception frame. I can see this might make it simpler to detect exception boundaries, but I suspect that we need other information anyway, so this doesn't become all that helpful. For EL0<->EL1 exception boundaries we want to successfully terminate a robust stacktrace whereas for EL1<->EL1 exception boundaries we want to fail a robust stacktrace. I reckon we have to figure that out from the el1_* and el0_* entry points (which I am working to reduce/simplify as part of the entry assembly conversion to C). With that we can terminate unwind at the el0_* parts, and reject unwinding across any other bit of .entry.text. Thanks, Mark.
On 3/23/21 5:24 AM, Mark Rutland wrote: > On Fri, Mar 19, 2021 at 05:03:09PM -0500, Madhavan T. Venkataraman wrote: >> I solved this by using existing functions logically instead of inventing a >> dummy function. I initialize pt_regs->stackframe[1] to an existing function >> so that the stack trace will not show a 0x0 entry as well as the kernel and >> gdb will show identical stack traces. >> >> For all task stack traces including the idle tasks, the stack trace will >> end at copy_thread() as copy_thread() is the function that initializes the >> pt_regs and the first stack frame for a task. > > I don't think this is a good idea, as it will mean that copy_thread() > will appear to be live in every thread, and therefore will not be > patchable. > > There are other things people need to be aware of when using an external > debugger (e.g. during EL0<->ELx transitions there are periods when X29 > and X30 contain the EL0 values, and cannot be used to unwind), so I > don't think there's a strong need to make this look prettier to an > external debugger. > OK. >> For EL0 exceptions, the stack trace will end with vectors() as vectors >> entries call the EL0 handlers. >> >> Here are sample stack traces (I only show the ending of each trace): >> >> Idle task on primary CPU >> ======================== >> >> ... >> [ 0.022557] start_kernel+0x5b8/0x5f4 >> [ 0.022570] __primary_switched+0xa8/0xb8 >> [ 0.022578] copy_thread+0x0/0x188 >> >> Idle task on secondary CPU >> ========================== >> >> ... >> [ 0.023397] secondary_start_kernel+0x188/0x1e0 >> [ 0.023406] __secondary_switched+0x40/0x88 >> [ 0.023415] copy_thread+0x0/0x188 >> >> All other kernel threads >> ======================== >> >> ... >> [ 13.501062] ret_from_fork+0x10/0x18 >> [ 13.507998] copy_thread+0x0/0x188 >> >> User threads (EL0 exception) >> ============ >> >> write(2) system call example: >> >> ... >> [ 521.686148] vfs_write+0xc8/0x2c0 >> [ 521.686156] ksys_write+0x74/0x108 >> [ 521.686161] __arm64_sys_write+0x24/0x30 >> [ 521.686166] el0_svc_common.constprop.0+0x70/0x1a8 >> [ 521.686175] do_el0_svc+0x2c/0x98 >> [ 521.686180] el0_svc+0x2c/0x70 >> [ 521.686188] el0_sync_handler+0xb0/0xb8 >> [ 521.686193] el0_sync+0x17c/0x180 >> [ 521.686198] vectors+0x0/0x7d8 > > [...] > >> If you approve, the above will become RFC Patch v3 1/8 in the next version. > > As above, I don't think we should repurpose an existing function here, > and my preference is to use 0x0. > OK. >> Let me know. >> >> Also, I could introduce an extra frame in the EL1 exception stack trace that >> includes vectors so the stack trace would look like this (timer interrupt example): >> >> call_timer_fn >> run_timer_softirq >> __do_softirq >> irq_exit >> __handle_domain_irq >> gic_handle_irq >> el1_irq >> vectors >> >> This way, if the unwinder finds vectors, it knows that it is an exception frame. > > I can see this might make it simpler to detect exception boundaries, but > I suspect that we need other information anyway, so this doesn't become > all that helpful. For EL0<->EL1 exception boundaries we want to > successfully terminate a robust stacktrace whereas for EL1<->EL1 > exception boundaries we want to fail a robust stacktrace. > > I reckon we have to figure that out from the el1_* and el0_* entry > points (which I am working to reduce/simplify as part of the entry > assembly conversion to C). With that we can terminate unwind at the > el0_* parts, and reject unwinding across any other bit of .entry.text. > OK. That is fine. Thanks. Madhavan
diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S index a31a0a713c85..e2dc2e998934 100644 --- a/arch/arm64/kernel/entry.S +++ b/arch/arm64/kernel/entry.S @@ -261,16 +261,18 @@ alternative_else_nop_endif stp lr, x21, [sp, #S_LR] /* - * For exceptions from EL0, terminate the callchain here. + * For exceptions from EL0, terminate the callchain here at + * task_pt_regs(current)->stackframe. + * * For exceptions from EL1, create a synthetic frame record so the * interrupted code shows up in the backtrace. */ .if \el == 0 - mov x29, xzr + stp xzr, xzr, [sp, #S_STACKFRAME] .else stp x29, x22, [sp, #S_STACKFRAME] - add x29, sp, #S_STACKFRAME .endif + add x29, sp, #S_STACKFRAME #ifdef CONFIG_ARM64_SW_TTBR0_PAN alternative_if_not ARM64_HAS_PAN diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S index 66b0e0b66e31..2769b20934d4 100644 --- a/arch/arm64/kernel/head.S +++ b/arch/arm64/kernel/head.S @@ -393,6 +393,28 @@ SYM_FUNC_START_LOCAL(__create_page_tables) ret x28 SYM_FUNC_END(__create_page_tables) + /* + * The boot task becomes the idle task for the primary CPU. The + * CPU startup task on each secondary CPU becomes the idle task + * for the secondary CPU. + * + * The idle task does not require pt_regs. But create a dummy + * pt_regs so that task_pt_regs(idle_task)->stackframe can be + * set up to be the last frame on the idle task stack just like + * all the other kernel tasks. This helps the unwinder to + * terminate the stack trace at a well-known stack offset. + * + * Also, set up the last return PC to be ret_from_fork() just + * like all the other kernel tasks so that the stack trace of + * all kernel tasks ends with the same function. + */ + .macro setup_last_frame + sub sp, sp, #PT_REGS_SIZE + stp xzr, xzr, [sp, #S_STACKFRAME] + add x29, sp, #S_STACKFRAME + ldr x30, =ret_from_fork + .endm + /* * The following fragment of code is executed with the MMU enabled. * @@ -447,8 +469,7 @@ SYM_FUNC_START_LOCAL(__primary_switched) #endif bl switch_to_vhe // Prefer VHE if possible add sp, sp, #16 - mov x29, #0 - mov x30, #0 + setup_last_frame b start_kernel SYM_FUNC_END(__primary_switched) @@ -606,8 +627,7 @@ SYM_FUNC_START_LOCAL(__secondary_switched) cbz x2, __secondary_too_slow msr sp_el0, x2 scs_load x2, x3 - mov x29, #0 - mov x30, #0 + setup_last_frame #ifdef CONFIG_ARM64_PTR_AUTH ptrauth_keys_init_cpu x2, x3, x4, x5 diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c index 325c83b1a24d..7ffa689e8b60 100644 --- a/arch/arm64/kernel/process.c +++ b/arch/arm64/kernel/process.c @@ -437,6 +437,11 @@ int copy_thread(unsigned long clone_flags, unsigned long stack_start, } p->thread.cpu_context.pc = (unsigned long)ret_from_fork; p->thread.cpu_context.sp = (unsigned long)childregs; + /* + * For the benefit of the unwinder, set up childregs->stackframe + * as the last frame for the new task. + */ + p->thread.cpu_context.fp = (unsigned long)childregs->stackframe; ptrace_hw_copy_thread(p);