Message ID | 20220511060521.465744-1-sumit.garg@linaro.org (mailing list archive) |
---|---|
Headers | show |
Series | arm64: Fix pending single-step debugging issues | expand |
Hi, On Tue, May 10, 2022 at 11:05 PM Sumit Garg <sumit.garg@linaro.org> wrote: > > This patch-set reworks pending fixes from Wei's series [1] to make > single-step debugging via kgdb/kdb on arm64 work as expected. There was > a prior discussion on ML [2] regarding if we should keep the interrupts > enabled during single-stepping. So patch #1 follows suggestion from Will > [3] to not disable interrupts during single stepping but rather skip > single stepping within interrupt handler. > > [1] https://lore.kernel.org/all/20200509214159.19680-1-liwei391@huawei.com/ > [2] https://lore.kernel.org/all/CAD=FV=Voyfq3Qz0T3RY+aYWYJ0utdH=P_AweB=13rcV8GDBeyQ@mail.gmail.com/ > [3] https://lore.kernel.org/all/20200626095551.GA9312@willie-the-truck/ > > Changes in v3: > - Reword commit descriptions as per Daniel's suggestions. > > Changes in v2: > - Replace patch #1 to rather follow Will's suggestion. > > Sumit Garg (2): > arm64: entry: Skip single stepping into interrupt handlers > arm64: kgdb: Set PSTATE.SS to 1 to re-enable single-step > > arch/arm64/include/asm/debug-monitors.h | 1 + > arch/arm64/kernel/debug-monitors.c | 5 +++++ > arch/arm64/kernel/entry-common.c | 18 +++++++++++++++++- > arch/arm64/kernel/kgdb.c | 2 ++ > 4 files changed, 25 insertions(+), 1 deletion(-) Sorry it took so long for me to respond. I kept dreaming that I'd find the time to really dig deep into this to understand it fully and I'm finally giving up on it. I'm going to hope that Will and/or Catalin knows this area of the code well and can give it a good review. If not then I'll strive harder to make the time... In any case, I poked around with this a bunch and it definitely improved the stepping behavior a whole lot. I still got one case where gdb hit an assertion while I was stepping, but I could believe that was a problem with gdb? I couldn't reproduce it. Thus I can at least give: Tested-by: Douglas Anderson <dianders@chromium.org> I'll also note that I _think_ I remember that with Wei's series that the gdb function "call" started working. I tried that here and it didn't seem so happy. To keep things simple, I created a dummy function in my kernel that looked like: void doug_test(void) { pr_info("testing, 1 2 3\n"); } I broke into the debugger by echoing "g" to /proc/sysrq-trigger and then tried "call doug_test()". I guess my printout actually printed but it wasn't so happy after that. Seems like it somehow ended up returning to a bogus address after the call which then caused a crash. testing, 1 2 3 BUG: sleeping function called from invalid context at arch/arm64/mm/fault.c:593 in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 3393, name: bash preempt_count: 0, expected: 0 RCU nest depth: 1, expected: 0 CPU: 6 PID: 3393 Comm: bash Not tainted 5.19.0-rc4+ #3 dbec0bdb8582e447bccdcf2e70d7fe04477b1aac Hardware name: Google Herobrine (rev1+) (DT) Call trace: dump_backtrace+0xf0/0x110 show_stack+0x24/0x70 dump_stack_lvl+0x64/0x7c dump_stack+0x18/0x38 __might_resched+0x144/0x154 __might_sleep+0x54/0x84 do_page_fault+0x1d4/0x42c do_mem_abort+0x4c/0xb0 el1_abort+0x3c/0x5c el1h_64_sync_handler+0x4c/0xc4 el1h_64_sync+0x64/0x68 0xffffffc008000000 __handle_sysrq+0x15c/0x184 write_sysrq_trigger+0x94/0x128 proc_reg_write+0xbc/0xec vfs_write+0xf0/0x2c8 ksys_write+0x84/0xf0 __arm64_sys_write+0x28/0x34 invoke_syscall+0x4c/0x120 el0_svc_common+0x94/0xfc do_el0_svc+0x38/0xc0 el0_svc+0x2c/0x7c el0t_64_sync_handler+0x48/0x114 el0t_64_sync+0x18c/0x190 Unable to handle kernel execute from non-executable memory at virtual address ffffffc008000000 Mem abort info: ESR = 0x000000008600000f EC = 0x21: IABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x0f: level 3 permission fault swapper pgtable: 4k pages, 39-bit VAs, pgdp=0000000082863000 [ffffffc008000000] pgd=100000027ffff003, p4d=100000027ffff003, pud=100000027ffff003, pmd=100000027fffe003, pte=00680001001c3703 Internal error: Oops: 8600000f [#1] PREEMPT SMP I'm not sure if that's a sign that something is missing with your patch or not. -Doug
On Fri, Jul 01, 2022 at 03:14:16PM -0700, Doug Anderson wrote: > Hi, > > On Tue, May 10, 2022 at 11:05 PM Sumit Garg <sumit.garg@linaro.org> wrote: > > > > This patch-set reworks pending fixes from Wei's series [1] to make > > single-step debugging via kgdb/kdb on arm64 work as expected. There was > > a prior discussion on ML [2] regarding if we should keep the interrupts > > enabled during single-stepping. So patch #1 follows suggestion from Will > > [3] to not disable interrupts during single stepping but rather skip > > single stepping within interrupt handler. > > > > [1] https://lore.kernel.org/all/20200509214159.19680-1-liwei391@huawei.com/ > > [2] https://lore.kernel.org/all/CAD=FV=Voyfq3Qz0T3RY+aYWYJ0utdH=P_AweB=13rcV8GDBeyQ@mail.gmail.com/ > > [3] https://lore.kernel.org/all/20200626095551.GA9312@willie-the-truck/ > > > > Changes in v3: > > - Reword commit descriptions as per Daniel's suggestions. > > > > Changes in v2: > > - Replace patch #1 to rather follow Will's suggestion. > > > > Sumit Garg (2): > > arm64: entry: Skip single stepping into interrupt handlers > > arm64: kgdb: Set PSTATE.SS to 1 to re-enable single-step > > > > arch/arm64/include/asm/debug-monitors.h | 1 + > > arch/arm64/kernel/debug-monitors.c | 5 +++++ > > arch/arm64/kernel/entry-common.c | 18 +++++++++++++++++- > > arch/arm64/kernel/kgdb.c | 2 ++ > > 4 files changed, 25 insertions(+), 1 deletion(-) > > Sorry it took so long for me to respond. I kept dreaming that I'd find > the time to really dig deep into this to understand it fully and I'm > finally giving up on it. I'm going to hope that Will and/or Catalin > knows this area of the code well and can give it a good review. If not > then I'll strive harder to make the time... So the good news is that I spent a couple of days on this last week. The bad news is that I'm nowhere done and about to disappear on holiday for a week! But anyway, I might be able to give this a review when I'm back. Failing that, I wonder if enough of us have a little bit of time each that we could hack on an agreed design together which covers the debug exception behaviour beyond kgdb? Will
Hi, On Fri, Jul 8, 2022 at 9:31 AM Will Deacon <will@kernel.org> wrote: > > On Fri, Jul 01, 2022 at 03:14:16PM -0700, Doug Anderson wrote: > > Hi, > > > > On Tue, May 10, 2022 at 11:05 PM Sumit Garg <sumit.garg@linaro.org> wrote: > > > > > > This patch-set reworks pending fixes from Wei's series [1] to make > > > single-step debugging via kgdb/kdb on arm64 work as expected. There was > > > a prior discussion on ML [2] regarding if we should keep the interrupts > > > enabled during single-stepping. So patch #1 follows suggestion from Will > > > [3] to not disable interrupts during single stepping but rather skip > > > single stepping within interrupt handler. > > > > > > [1] https://lore.kernel.org/all/20200509214159.19680-1-liwei391@huawei.com/ > > > [2] https://lore.kernel.org/all/CAD=FV=Voyfq3Qz0T3RY+aYWYJ0utdH=P_AweB=13rcV8GDBeyQ@mail.gmail.com/ > > > [3] https://lore.kernel.org/all/20200626095551.GA9312@willie-the-truck/ > > > > > > Changes in v3: > > > - Reword commit descriptions as per Daniel's suggestions. > > > > > > Changes in v2: > > > - Replace patch #1 to rather follow Will's suggestion. > > > > > > Sumit Garg (2): > > > arm64: entry: Skip single stepping into interrupt handlers > > > arm64: kgdb: Set PSTATE.SS to 1 to re-enable single-step > > > > > > arch/arm64/include/asm/debug-monitors.h | 1 + > > > arch/arm64/kernel/debug-monitors.c | 5 +++++ > > > arch/arm64/kernel/entry-common.c | 18 +++++++++++++++++- > > > arch/arm64/kernel/kgdb.c | 2 ++ > > > 4 files changed, 25 insertions(+), 1 deletion(-) > > > > Sorry it took so long for me to respond. I kept dreaming that I'd find > > the time to really dig deep into this to understand it fully and I'm > > finally giving up on it. I'm going to hope that Will and/or Catalin > > knows this area of the code well and can give it a good review. If not > > then I'll strive harder to make the time... > > So the good news is that I spent a couple of days on this last week. Excellent, thanks! > The bad news is that I'm nowhere done and about to disappear on holiday > for a week! No worries. It's been broken for so many years and isn't the most urgent thing, but it's also one of those things that I do eventually want to get fixed so I just want to make sure it doesn't get put off indefinitely... > But anyway, I might be able to give this a review when I'm back. Failing > that, I wonder if enough of us have a little bit of time each that we > could hack on an agreed design together which covers the debug exception > behaviour beyond kgdb? If it will unblock this then I can figure out how to make time for it. Definitely my biggest utility would be in testing, but I'm also happy to knock around ideas. ...and as per above if the way to move forward is to block off a day or two to learn more about all the status bits and how they interact in the kernel then I can do that too. Barring unforeseen circumstances I'll also be in Dublin in September if that's somehow useful. ;-) -Doug
Hi Doug, On Sat, 2 Jul 2022 at 03:44, Doug Anderson <dianders@chromium.org> wrote: > > Hi, > > On Tue, May 10, 2022 at 11:05 PM Sumit Garg <sumit.garg@linaro.org> wrote: > > > > This patch-set reworks pending fixes from Wei's series [1] to make > > single-step debugging via kgdb/kdb on arm64 work as expected. There was > > a prior discussion on ML [2] regarding if we should keep the interrupts > > enabled during single-stepping. So patch #1 follows suggestion from Will > > [3] to not disable interrupts during single stepping but rather skip > > single stepping within interrupt handler. > > > > [1] https://lore.kernel.org/all/20200509214159.19680-1-liwei391@huawei.com/ > > [2] https://lore.kernel.org/all/CAD=FV=Voyfq3Qz0T3RY+aYWYJ0utdH=P_AweB=13rcV8GDBeyQ@mail.gmail.com/ > > [3] https://lore.kernel.org/all/20200626095551.GA9312@willie-the-truck/ > > > > Changes in v3: > > - Reword commit descriptions as per Daniel's suggestions. > > > > Changes in v2: > > - Replace patch #1 to rather follow Will's suggestion. > > > > Sumit Garg (2): > > arm64: entry: Skip single stepping into interrupt handlers > > arm64: kgdb: Set PSTATE.SS to 1 to re-enable single-step > > > > arch/arm64/include/asm/debug-monitors.h | 1 + > > arch/arm64/kernel/debug-monitors.c | 5 +++++ > > arch/arm64/kernel/entry-common.c | 18 +++++++++++++++++- > > arch/arm64/kernel/kgdb.c | 2 ++ > > 4 files changed, 25 insertions(+), 1 deletion(-) > > Sorry it took so long for me to respond. I kept dreaming that I'd find > the time to really dig deep into this to understand it fully and I'm > finally giving up on it. No worries and apologies on my part as well as I had to find some time to reproduce the issue that you have reported below. > I'm going to hope that Will and/or Catalin > knows this area of the code well and can give it a good review. If not > then I'll strive harder to make the time... > > In any case, I poked around with this a bunch and it definitely > improved the stepping behavior a whole lot. I still got one case where > gdb hit an assertion while I was stepping, but I could believe that > was a problem with gdb? I couldn't reproduce it. Thus I can at least > give: > > Tested-by: Douglas Anderson <dianders@chromium.org> > Thanks for the testing. > I'll also note that I _think_ I remember that with Wei's series that > the gdb function "call" started working. I tried that here and it > didn't seem so happy. To keep things simple, I created a dummy > function in my kernel that looked like: > > void doug_test(void) > { > pr_info("testing, 1 2 3\n"); > } > > I broke into the debugger by echoing "g" to /proc/sysrq-trigger and > then tried "call doug_test()". I guess my printout actually printed > but it wasn't so happy after that. Seems like it somehow ended up > returning to a bogus address after the call which then caused a crash. > I am able to reproduce this issue on my setup as well. But it doesn't seem to be a regression caused by this patch-set over Wei's series. As I could reproduce this issue with v1 [1] patch-set as well which was just a forward port of pending patches from Wei's series to the latest upstream. Maybe it's a different regression caused by other changes? BTW, do you remember the kernel version you tested with Wei's series applied? [1] https://lore.kernel.org/linux-arm-kernel/20220411093819.1012583-1-sumit.garg@linaro.org/T/ -Sumit > testing, 1 2 3 > BUG: sleeping function called from invalid context at > arch/arm64/mm/fault.c:593 > in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 3393, name: bash > preempt_count: 0, expected: 0 > RCU nest depth: 1, expected: 0 > CPU: 6 PID: 3393 Comm: bash Not tainted 5.19.0-rc4+ #3 > dbec0bdb8582e447bccdcf2e70d7fe04477b1aac > Hardware name: Google Herobrine (rev1+) (DT) > Call trace: > dump_backtrace+0xf0/0x110 > show_stack+0x24/0x70 > dump_stack_lvl+0x64/0x7c > dump_stack+0x18/0x38 > __might_resched+0x144/0x154 > __might_sleep+0x54/0x84 > do_page_fault+0x1d4/0x42c > do_mem_abort+0x4c/0xb0 > el1_abort+0x3c/0x5c > el1h_64_sync_handler+0x4c/0xc4 > el1h_64_sync+0x64/0x68 > 0xffffffc008000000 > __handle_sysrq+0x15c/0x184 > write_sysrq_trigger+0x94/0x128 > proc_reg_write+0xbc/0xec > vfs_write+0xf0/0x2c8 > ksys_write+0x84/0xf0 > __arm64_sys_write+0x28/0x34 > invoke_syscall+0x4c/0x120 > el0_svc_common+0x94/0xfc > do_el0_svc+0x38/0xc0 > el0_svc+0x2c/0x7c > el0t_64_sync_handler+0x48/0x114 > el0t_64_sync+0x18c/0x190 > Unable to handle kernel execute from non-executable memory at > virtual address ffffffc008000000 > Mem abort info: > ESR = 0x000000008600000f > EC = 0x21: IABT (current EL), IL = 32 bits > SET = 0, FnV = 0 > EA = 0, S1PTW = 0 > FSC = 0x0f: level 3 permission fault > swapper pgtable: 4k pages, 39-bit VAs, pgdp=0000000082863000 > [ffffffc008000000] pgd=100000027ffff003, p4d=100000027ffff003, > pud=100000027ffff003, pmd=100000027fffe003, pte=00680001001c3703 > Internal error: Oops: 8600000f [#1] PREEMPT SMP > > I'm not sure if that's a sign that something is missing with your patch or not. > > -Doug
Hi Will, On Fri, 8 Jul 2022 at 22:01, Will Deacon <will@kernel.org> wrote: > > On Fri, Jul 01, 2022 at 03:14:16PM -0700, Doug Anderson wrote: > > Hi, > > > > On Tue, May 10, 2022 at 11:05 PM Sumit Garg <sumit.garg@linaro.org> wrote: > > > > > > This patch-set reworks pending fixes from Wei's series [1] to make > > > single-step debugging via kgdb/kdb on arm64 work as expected. There was > > > a prior discussion on ML [2] regarding if we should keep the interrupts > > > enabled during single-stepping. So patch #1 follows suggestion from Will > > > [3] to not disable interrupts during single stepping but rather skip > > > single stepping within interrupt handler. > > > > > > [1] https://lore.kernel.org/all/20200509214159.19680-1-liwei391@huawei.com/ > > > [2] https://lore.kernel.org/all/CAD=FV=Voyfq3Qz0T3RY+aYWYJ0utdH=P_AweB=13rcV8GDBeyQ@mail.gmail.com/ > > > [3] https://lore.kernel.org/all/20200626095551.GA9312@willie-the-truck/ > > > > > > Changes in v3: > > > - Reword commit descriptions as per Daniel's suggestions. > > > > > > Changes in v2: > > > - Replace patch #1 to rather follow Will's suggestion. > > > > > > Sumit Garg (2): > > > arm64: entry: Skip single stepping into interrupt handlers > > > arm64: kgdb: Set PSTATE.SS to 1 to re-enable single-step > > > > > > arch/arm64/include/asm/debug-monitors.h | 1 + > > > arch/arm64/kernel/debug-monitors.c | 5 +++++ > > > arch/arm64/kernel/entry-common.c | 18 +++++++++++++++++- > > > arch/arm64/kernel/kgdb.c | 2 ++ > > > 4 files changed, 25 insertions(+), 1 deletion(-) > > > > Sorry it took so long for me to respond. I kept dreaming that I'd find > > the time to really dig deep into this to understand it fully and I'm > > finally giving up on it. I'm going to hope that Will and/or Catalin > > knows this area of the code well and can give it a good review. If not > > then I'll strive harder to make the time... > > So the good news is that I spent a couple of days on this last week. > Thanks for spending time to review this. > The bad news is that I'm nowhere done and about to disappear on holiday > for a week! > > But anyway, I might be able to give this a review when I'm back. Failing > that, I wonder if enough of us have a little bit of time each that we > could hack on an agreed design together which covers the debug exception > behaviour beyond kgdb? Sure I will be happy to contribute to improving overall debug exception behaviour. I look forward to any further comments/suggestions. -Sumit > > Will
Hi, On Mon, Jul 11, 2022 at 5:44 AM Sumit Garg <sumit.garg@linaro.org> wrote: > > > I'll also note that I _think_ I remember that with Wei's series that > > the gdb function "call" started working. I tried that here and it > > didn't seem so happy. To keep things simple, I created a dummy > > function in my kernel that looked like: > > > > void doug_test(void) > > { > > pr_info("testing, 1 2 3\n"); > > } > > > > I broke into the debugger by echoing "g" to /proc/sysrq-trigger and > > then tried "call doug_test()". I guess my printout actually printed > > but it wasn't so happy after that. Seems like it somehow ended up > > returning to a bogus address after the call which then caused a crash. > > > > I am able to reproduce this issue on my setup as well. But it doesn't > seem to be a regression caused by this patch-set over Wei's series. As > I could reproduce this issue with v1 [1] patch-set as well which was > just a forward port of pending patches from Wei's series to the latest > upstream. > > Maybe it's a different regression caused by other changes? BTW, do you > remember the kernel version you tested with Wei's series applied? Sorry, I don't remember! :( I can't even be 100% sure that I'm remembering correctly that I tested it back in the day, so it's possible that it simply never worked... -Doug
On Mon, 11 Jul 2022 at 19:17, Doug Anderson <dianders@chromium.org> wrote: > > Hi, > > On Mon, Jul 11, 2022 at 5:44 AM Sumit Garg <sumit.garg@linaro.org> wrote: > > > > > I'll also note that I _think_ I remember that with Wei's series that > > > the gdb function "call" started working. I tried that here and it > > > didn't seem so happy. To keep things simple, I created a dummy > > > function in my kernel that looked like: > > > > > > void doug_test(void) > > > { > > > pr_info("testing, 1 2 3\n"); > > > } > > > > > > I broke into the debugger by echoing "g" to /proc/sysrq-trigger and > > > then tried "call doug_test()". I guess my printout actually printed > > > but it wasn't so happy after that. Seems like it somehow ended up > > > returning to a bogus address after the call which then caused a crash. > > > > > > > I am able to reproduce this issue on my setup as well. But it doesn't > > seem to be a regression caused by this patch-set over Wei's series. As > > I could reproduce this issue with v1 [1] patch-set as well which was > > just a forward port of pending patches from Wei's series to the latest > > upstream. > > > > Maybe it's a different regression caused by other changes? BTW, do you > > remember the kernel version you tested with Wei's series applied? > > Sorry, I don't remember! :( I can't even be 100% sure that I'm > remembering correctly that I tested it back in the day, so it's > possible that it simply never worked... Okay, no worries. Let me see if I can come up with a separate fix for this. -Sumit > > -Doug
On Mon, 11 Jul 2022 at 19:21, Sumit Garg <sumit.garg@linaro.org> wrote: > > On Mon, 11 Jul 2022 at 19:17, Doug Anderson <dianders@chromium.org> wrote: > > > > Hi, > > > > On Mon, Jul 11, 2022 at 5:44 AM Sumit Garg <sumit.garg@linaro.org> wrote: > > > > > > > I'll also note that I _think_ I remember that with Wei's series that > > > > the gdb function "call" started working. I tried that here and it > > > > didn't seem so happy. To keep things simple, I created a dummy > > > > function in my kernel that looked like: > > > > > > > > void doug_test(void) > > > > { > > > > pr_info("testing, 1 2 3\n"); > > > > } > > > > > > > > I broke into the debugger by echoing "g" to /proc/sysrq-trigger and > > > > then tried "call doug_test()". I guess my printout actually printed > > > > but it wasn't so happy after that. Seems like it somehow ended up > > > > returning to a bogus address after the call which then caused a crash. > > > > > > > > > > I am able to reproduce this issue on my setup as well. But it doesn't > > > seem to be a regression caused by this patch-set over Wei's series. As > > > I could reproduce this issue with v1 [1] patch-set as well which was > > > just a forward port of pending patches from Wei's series to the latest > > > upstream. > > > > > > Maybe it's a different regression caused by other changes? BTW, do you > > > remember the kernel version you tested with Wei's series applied? > > > > Sorry, I don't remember! :( I can't even be 100% sure that I'm > > remembering correctly that I tested it back in the day, so it's > > possible that it simply never worked... > > Okay, no worries. Let me see if I can come up with a separate fix for this. > After digging deep into GDB call function operations for aarch64, it is certain that function calls simply never worked due to below reasons: 1. On aarch64, GDB call function inserts a breakpoint at the entrypoint of kernel (which is ffffffc008000000 from your dump) as return address from function called. And since it refers to the "_text" symbol which is marked non-executable as the actual text section starts with the "_stext" symbol. I did a following hack that makes it executable: diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index 626ec32873c6..e39ad1a5f5d6 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -700,7 +700,7 @@ static bool arm64_early_this_cpu_has_bti(void) static void __init map_kernel(pgd_t *pgdp) { static struct vm_struct vmlinux_text, vmlinux_rodata, vmlinux_inittext, - vmlinux_initdata, vmlinux_data; + vmlinux_initdata, vmlinux_data, vmlinux_htext; /* * External debuggers may need to write directly to the text @@ -721,6 +721,8 @@ static void __init map_kernel(pgd_t *pgdp) * Only rodata will be remapped with different permissions later on, * all other segments are allowed to use contiguous mappings. */ + map_kernel_segment(pgdp, _text, _stext, text_prot, &vmlinux_htext, 0, + VM_NO_GUARD); map_kernel_segment(pgdp, _stext, _etext, text_prot, &vmlinux_text, 0, VM_NO_GUARD); map_kernel_segment(pgdp, __start_rodata, __inittext_begin, PAGE_KERNEL, 2. For the GDB function "call" to work, GDB inserts a dummy stack frame. But in case of kernel on aarch64, the stack used is common among the exception handler and the kernel threads. So it's not trivial to insert a dummy stack frame and requires rework of exception entry code as it pushes pt_regs onto the stack. -Sumit > > > > > -Doug