Message ID | 20191201015304.cRPsmKUTM%akpm@linux-foundation.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [001/158] scripts/spelling.txt: add more spellings to spelling.txt | expand |
On Sat, Nov 30, 2019 at 05:53:04PM -0800, akpm@linux-foundation.org wrote: > From: Steven Price <steven.price@arm.com> > Subject: mm: add generic ptdump > > Add a generic version of page table dumping that architectures can opt-in > to That generic ptdump stuff is probably causing a splat on 32-bit: https://lkml.kernel.org/r/20191125144946.GA6628@duo.ucw.cz .config is attached in that thread too and triggers pretty reliably in a vm but I haven't poked at it further. Thx.
On Sun, Dec 1, 2019 at 1:09 AM Borislav Petkov <bp@alien8.de> wrote: > > That generic ptdump stuff is probably causing a splat on 32-bit: > > https://lkml.kernel.org/r/20191125144946.GA6628@duo.ucw.cz Hmm. I'm not sure about code generation, but for me that config gives me 60: 55 push %ebp 61: 89 e5 mov %esp,%ebp 63: 57 push %edi 64: 8b 4d 08 mov 0x8(%ebp),%ecx 67: 56 push %esi 68: 53 push %ebx 69: 8b 30 mov (%eax),%esi 6b: 8b 59 10 mov 0x10(%ecx),%ebx so that "ptdump_pte_entry+9" is the "mov (%eax),%esi" And that is "READ_ONCE(*pte)" So the pte pointer itself is broken. Which sounds really odd. Hmm. I've applied the whole series to a local branch, but I'm not merging it into my master branch yet. Can somebody figure out how the page walking could get that broken? Linus
On Sun, Dec 01, 2019 at 06:45:23AM -0800, Linus Torvalds wrote: > On Sun, Dec 1, 2019 at 1:09 AM Borislav Petkov <bp@alien8.de> wrote: > > > > That generic ptdump stuff is probably causing a splat on 32-bit: > > > > https://lkml.kernel.org/r/20191125144946.GA6628@duo.ucw.cz > > Hmm. I'm not sure about code generation, but for me that config gives me Note that I typed "probably" above because I'm not 100% sure it is those patches that would cause it. I mean, I saw EIP pointing to ptdump_pte_entry and was able to repro on linux-next with the .config in a vm. But then your master or tip/master wouldn't trigger so I shelved that as it is merge window and other 32-bit shit was broken, which needed more attention. So lemme first confirm it really is caused by those patches.
On Sun, Dec 01, 2019 at 04:10:11PM +0100, Borislav Petkov wrote:
> So lemme first confirm it really is caused by those patches.
Yeah, those patches are causing it. Tried your current master - it is OK
- and then applied Andrew's patches I was CCed on, ontop, and I got in a
VM:
VFS: Mounted root (ext4 filesystem) readonly on device 8:2.
devtmpfs: mounted
Freeing unused kernel image (initmem) memory: 664K
Write protecting kernel text and read-only data: 18164k
NX-protecting the kernel data: 7416k
BUG: kernel NULL pointer dereference, address: 00000014
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
*pdpt = 0000000000000000 *pde = f000ff53f000ff53
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.4.0+ #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-1 04/01/2014
EIP: __lock_acquire.isra.0+0x2e8/0x4e0
Code: e8 bd a1 2f 00 85 c0 74 11 8b 1d 08 8f 26 c5 85 db 0f 84 05 1a 00 00 8d 76 00 31 db 8d 65 f4 89 d8 5b 5e 5f 5d c3 8d 74 26 00 <8b> 44 90 04 85 c0 0f 85 4c fd ff ff e9 33 fd ff ff 8d b4 26 00 00
EAX: 00000010 EBX: 00000010 ECX: 00000001 EDX: 00000000
ESI: f1070040 EDI: f1070040 EBP: f1073e04 ESP: f1073de0
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010097
CR0: 80050033 CR2: 00000014 CR3: 05348000 CR4: 001406b0
Call Trace:
lock_acquire+0x42/0x60
? __walk_page_range+0x4d9/0x590
_raw_spin_lock+0x22/0x40
? __walk_page_range+0x4d9/0x590
__walk_page_range+0x4d9/0x590
walk_page_range_novma+0x57/0xa0
ptdump_walk_pgd+0x38/0x70
ptdump_walk_pgd_level_core+0x66/0x90
? ptdump_walk_pgd_level_core+0x90/0x90
ptdump_walk_pgd_level_checkwx+0x16/0x19
mark_rodata_ro+0x95/0x9a
? rest_init+0xfb/0xfb
kernel_init+0x25/0xe5
ret_from_fork+0x2e/0x38
Modules linked in:
CR2: 0000000000000014
---[ end trace 8b67ede738f0029a ]---
EIP: __lock_acquire.isra.0+0x2e8/0x4e0
Code: e8 bd a1 2f 00 85 c0 74 11 8b 1d 08 8f 26 c5 85 db 0f 84 05 1a 00 00 8d 76 00 31 db 8d 65 f4 89 d8 5b 5e 5f 5d c3 8d 74 26 00 <8b> 44 90 04 85 c0 0f 85 4c fd ff ff e9 33 fd ff ff 8d b4 26 00 00
EAX: 00000010 EBX: 00000010 ECX: 00000001 EDX: 00000000
ESI: f1070040 EDI: f1070040 EBP: f1073e04 ESP: f1073de0
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010097
CR0: 80050033 CR2: 00000014 CR3: 05348000 CR4: 001406b0
note: swapper/0[1] exited with preempt_count 1
Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
Kernel Offset: disabled
---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009 ]---
On Sun, Dec 01, 2019 at 04:21:19PM +0100, Borislav Petkov wrote: > On Sun, Dec 01, 2019 at 04:10:11PM +0100, Borislav Petkov wrote: > > So lemme first confirm it really is caused by those patches. > > Yeah, those patches are causing it. Tried your current master - it is OK > - and then applied Andrew's patches I was CCed on, ontop, and I got in a > VM: > > VFS: Mounted root (ext4 filesystem) readonly on device 8:2. > devtmpfs: mounted > Freeing unused kernel image (initmem) memory: 664K > Write protecting kernel text and read-only data: 18164k > NX-protecting the kernel data: 7416k > BUG: kernel NULL pointer dereference, address: 00000014 > #PF: supervisor read access in kernel mode > #PF: error_code(0x0000) - not-present page > *pdpt = 0000000000000000 *pde = f000ff53f000ff53 > Oops: 0000 [#1] PREEMPT SMP PTI > CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.4.0+ #3 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-1 04/01/2014 > EIP: __lock_acquire.isra.0+0x2e8/0x4e0 > Code: e8 bd a1 2f 00 85 c0 74 11 8b 1d 08 8f 26 c5 85 db 0f 84 05 1a 00 00 8d 76 00 31 db 8d 65 f4 89 d8 5b 5e 5f 5d c3 8d 74 26 00 <8b> 44 90 04 85 c0 0f 85 4c fd ff ff e9 33 fd ff ff 8d b4 26 00 00 > EAX: 00000010 EBX: 00000010 ECX: 00000001 EDX: 00000000 > ESI: f1070040 EDI: f1070040 EBP: f1073e04 ESP: f1073de0 > DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010097 > CR0: 80050033 CR2: 00000014 CR3: 05348000 CR4: 001406b0 > Call Trace: > lock_acquire+0x42/0x60 > ? __walk_page_range+0x4d9/0x590 > _raw_spin_lock+0x22/0x40 > ? __walk_page_range+0x4d9/0x590 > __walk_page_range+0x4d9/0x590 Ok, some more staring. That offset is: # mm/pagewalk.c:31: pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); sall $5, %eax #, tmp235 addl -64(%ebp), %eax # %sfp, tmp236 call page_address # addl %eax, %esi # tmp306, __pte # ./include/linux/spinlock.h:338: raw_spin_lock(&lock->rlock); movl -76(%ebp), %eax # %sfp, call _raw_spin_lock # movl %edi, %edx # start, start movl %ebx, -64(%ebp) # __boundary, %sfp movl -80(%ebp), %edi # %sfp, ops movl %esi, -40(%ebp) # __pte, %sfp i.e., pte_offset_map_lock() and I *think* that ptl thing is NULL. The Code section decodes to: Code: e8 bd a1 2f 00 85 c0 74 11 8b 1d 08 8f 26 c5 85 db 0f 84 05 1a 00 00 8d 76 00 31 db 8d 65 f4 89 d8 5b 5e 5f 5d c3 8d 74 26 00 <8b> 44 90 04 85 c0 0f 85 4c fd ff ff e9 33 fd ff ff 8d b4 26 00 00 All code ======== 0: e8 bd a1 2f 00 callq 0x2fa1c2 5: 85 c0 test %eax,%eax 7: 74 11 je 0x1a 9: 8b 1d 08 8f 26 c5 mov -0x3ad970f8(%rip),%ebx # 0xffffffffc5268f17 f: 85 db test %ebx,%ebx 11: 0f 84 05 1a 00 00 je 0x1a1c 17: 8d 76 00 lea 0x0(%rsi),%esi 1a: 31 db xor %ebx,%ebx 1c: 8d 65 f4 lea -0xc(%rbp),%esp 1f: 89 d8 mov %ebx,%eax 21: 5b pop %rbx 22: 5e pop %rsi 23: 5f pop %rdi 24: 5d pop %rbp 25: c3 retq 26: 8d 74 26 00 lea 0x0(%rsi,%riz,1),%esi 2a:* 8b 44 90 04 mov 0x4(%rax,%rdx,4),%eax <-- trapping instruction 2e: 85 c0 test %eax,%eax 30: 0f 85 4c fd ff ff jne 0xfffffffffffffd82 36: e9 33 fd ff ff jmpq 0xfffffffffffffd6e 3b: 8d .byte 0x8d 3c: b4 26 which is this corresponding piece in __lock_acquire(): call debug_locks_off # # kernel/locking/lockdep.c:3775: if (!debug_locks_off()) testl %eax, %eax # tmp325 je .L562 #, # kernel/locking/lockdep.c:3777: if (debug_locks_silent) movl debug_locks_silent, %ebx # debug_locks_silent, <retval> # kernel/locking/lockdep.c:3777: if (debug_locks_silent) testl %ebx, %ebx # <retval> je .L642 #, .p2align 4,,10 .p2align 3 .L562: # kernel/locking/lockdep.c:3826: return 0; xorl %ebx, %ebx # <retval> .L557: # kernel/locking/lockdep.c:3982: } leal -12(%ebp), %esp #, movl %ebx, %eax # <retval>, popl %ebx # popl %esi # popl %edi # popl %ebp # ret .p2align 4,,10 .p2align 3 .L649: # kernel/locking/lockdep.c:3832: class = lock->class_cache[subclass]; movl 4(%eax,%edx,4), %eax # lock_7(D)->class_cache, class ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (the LEA above is NOP padding) and %eax and %edx are both NULL. i.e., that thing: if (subclass < NR_LOCKDEP_CACHING_CLASSES) class = lock->class_cache[subclass]; ^^^^^^^^^^^^^^^ AFAICT, of course.
On Sun, Dec 01, 2019 at 03:45:54PM +0000, Borislav Petkov wrote: > On Sun, Dec 01, 2019 at 04:21:19PM +0100, Borislav Petkov wrote: > > On Sun, Dec 01, 2019 at 04:10:11PM +0100, Borislav Petkov wrote: > > > So lemme first confirm it really is caused by those patches. > > > > Yeah, those patches are causing it. Tried your current master - it is OK > > - and then applied Andrew's patches I was CCed on, ontop, and I got in a > > VM: > > > > VFS: Mounted root (ext4 filesystem) readonly on device 8:2. > > devtmpfs: mounted > > Freeing unused kernel image (initmem) memory: 664K > > Write protecting kernel text and read-only data: 18164k > > NX-protecting the kernel data: 7416k > > BUG: kernel NULL pointer dereference, address: 00000014 > > #PF: supervisor read access in kernel mode > > #PF: error_code(0x0000) - not-present page > > *pdpt = 0000000000000000 *pde = f000ff53f000ff53 > > Oops: 0000 [#1] PREEMPT SMP PTI > > CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.4.0+ #3 > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-1 04/01/2014 > > EIP: __lock_acquire.isra.0+0x2e8/0x4e0 > > Code: e8 bd a1 2f 00 85 c0 74 11 8b 1d 08 8f 26 c5 85 db 0f 84 05 1a 00 00 8d 76 00 31 db 8d 65 f4 89 d8 5b 5e 5f 5d c3 8d 74 26 00 <8b> 44 90 04 85 c0 0f 85 4c fd ff ff e9 33 fd ff ff 8d b4 26 00 00 > > EAX: 00000010 EBX: 00000010 ECX: 00000001 EDX: 00000000 > > ESI: f1070040 EDI: f1070040 EBP: f1073e04 ESP: f1073de0 > > DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010097 > > CR0: 80050033 CR2: 00000014 CR3: 05348000 CR4: 001406b0 > > Call Trace: > > lock_acquire+0x42/0x60 > > ? __walk_page_range+0x4d9/0x590 > > _raw_spin_lock+0x22/0x40 > > ? __walk_page_range+0x4d9/0x590 > > __walk_page_range+0x4d9/0x590 > Thanks for looking into this. I've been able to reproduce it locally with that config and I can see what's going wrong here. walk_pte_range() is being called with end=0xffffffff, but the comparison in the function is: if (addr == end) break; So addr never actually equals end, it skips from 0xfffff000 to 0x0. This means the function continues walking straight off the end and dereferencing 'random' ptes. As a quick hack I modified the condition to: if (addr == end || !addr) break; and I can then boot the VM. Clearly that's not the correct solution - I'll go away and have a think about the cleanest way of handling this case and also do some more testing before I resubmit for 5.6. Sorry for the trouble and thanks again for investigating. Steve
On Mon, Dec 02, 2019 at 09:09:24AM +0000, Steven Price wrote:
> Sorry for the trouble and thanks again for investigating.
You're very welcome! 8-)
Holler if you need the new version tested a bit.
Thx.
On 01.12.19 02:53, akpm@linux-foundation.org wrote: > From: Steven Price <steven.price@arm.com> > Subject: mm: add generic ptdump > > Add a generic version of page table dumping that architectures can opt-in > to > > [steven.price@arm.com: v15] > Link: http://lkml.kernel.org/r/20191101140942.51554-20-steven.price@arm.com > [cai@lca.pw: fix a -Wold-style-declaration warning] > Link: http://lkml.kernel.org/r/1572895385-29194-1-git-send-email-cai@lca.pw > Link: http://lkml.kernel.org/r/20191028135910.33253-20-steven.price@arm.com > Signed-off-by: Steven Price <steven.price@arm.com> > Signed-off-by: Qian Cai <cai@lca.pw> > Cc: Albert Ou <aou@eecs.berkeley.edu> > Cc: Alexander Potapenko <glider@google.com> > Cc: Alexandre Ghiti <alex@ghiti.fr> > Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> > Cc: Andy Lutomirski <luto@kernel.org> > Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> > Cc: Arnd Bergmann <arnd@arndb.de> > Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> > Cc: Borislav Petkov <bp@alien8.de> > Cc: Catalin Marinas <catalin.marinas@arm.com> > Cc: Christian Borntraeger <borntraeger@de.ibm.com> > Cc: Dave Hansen <dave.hansen@linux.intel.com> > Cc: Dave Jiang <dave.jiang@intel.com> > Cc: David S. Miller <davem@davemloft.net> > Cc: Dmitry Vyukov <dvyukov@google.com> > Cc: Heiko Carstens <heiko.carstens@de.ibm.com> > Cc: "H. Peter Anvin" <hpa@zytor.com> > Cc: Ingo Molnar <mingo@elte.hu> > Cc: James Hogan <jhogan@kernel.org> > Cc: James Morse <james.morse@arm.com> > Cc: "Liang, Kan" <kan.liang@linux.intel.com> > Cc: Mark Rutland <mark.rutland@arm.com> > Cc: Matthew Wilcox <mawilcox@microsoft.com> > Cc: Michael Ellerman <mpe@ellerman.id.au> > Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> > Cc: Palmer Dabbelt <palmer@sifive.com> > Cc: Paul Burton <paul.burton@mips.com> > Cc: Paul Mackerras <paulus@samba.org> > Cc: Paul Walmsley <paul.walmsley@sifive.com> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Ralf Baechle <ralf@linux-mips.org> > Cc: Russell King <linux@armlinux.org.uk> > Cc: Shiraz Hashim <shashim@codeaurora.org> > Cc: Thomas Gleixner <tglx@linutronix.de> > Cc: Vasily Gorbik <gor@linux.ibm.com> > Cc: Vineet Gupta <vgupta@synopsys.com> > Cc: Will Deacon <will@kernel.org> > Cc: Zong Li <zong.li@sifive.com> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org> > --- > > include/linux/ptdump.h | 21 +++++ > mm/Kconfig.debug | 21 +++++ > mm/Makefile | 1 > mm/ptdump.c | 151 +++++++++++++++++++++++++++++++++++++++ > 4 files changed, 194 insertions(+) > > --- /dev/null > +++ a/include/linux/ptdump.h > @@ -0,0 +1,21 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > + > +#ifndef _LINUX_PTDUMP_H > +#define _LINUX_PTDUMP_H > + > +#include <linux/mm_types.h> > + > +struct ptdump_range { > + unsigned long start; > + unsigned long end; > +}; > + > +struct ptdump_state { > + void (*note_page)(struct ptdump_state *st, unsigned long addr, > + int level, unsigned long val); > + const struct ptdump_range *range; > +}; > + > +void ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm); > + > +#endif /* _LINUX_PTDUMP_H */ > --- a/mm/Kconfig.debug~mm-add-generic-ptdump > +++ a/mm/Kconfig.debug > @@ -117,3 +117,24 @@ config DEBUG_RODATA_TEST > depends on STRICT_KERNEL_RWX > ---help--- > This option enables a testcase for the setting rodata read-only. > + > +config GENERIC_PTDUMP > + bool > + > +config PTDUMP_CORE > + bool > + > +config PTDUMP_DEBUGFS > + bool "Export kernel pagetable layout to userspace via debugfs" > + depends on DEBUG_KERNEL > + depends on DEBUG_FS > + depends on GENERIC_PTDUMP > + select PTDUMP_CORE > + help > + Say Y here if you want to show the kernel pagetable layout in a > + debugfs file. This information is only useful for kernel developers > + who are working in architecture specific areas of the kernel. > + It is probably not a good idea to enable this feature in a production > + kernel. > + > + If in doubt, say N. > --- a/mm/Makefile~mm-add-generic-ptdump > +++ a/mm/Makefile > @@ -98,6 +98,7 @@ obj-$(CONFIG_CMA) += cma.o > obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o > obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o > obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o > +obj-$(CONFIG_PTDUMP_CORE) += ptdump.o > obj-$(CONFIG_USERFAULTFD) += userfaultfd.o > obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o > obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o > --- /dev/null > +++ a/mm/ptdump.c > @@ -0,0 +1,151 @@ > +// SPDX-License-Identifier: GPL-2.0 > + > +#include <linux/pagewalk.h> > +#include <linux/ptdump.h> > +#include <linux/kasan.h> > + > +static int ptdump_pgd_entry(pgd_t *pgd, unsigned long addr, > + unsigned long next, struct mm_walk *walk) > +{ > + struct ptdump_state *st = walk->private; > + pgd_t val = READ_ONCE(*pgd); > + > + if (pgd_leaf(val)) > + st->note_page(st, addr, 1, pgd_val(val)); > + > + return 0; > +} > + > +static int ptdump_p4d_entry(p4d_t *p4d, unsigned long addr, > + unsigned long next, struct mm_walk *walk) > +{ > + struct ptdump_state *st = walk->private; > + p4d_t val = READ_ONCE(*p4d); > + > + if (p4d_leaf(val)) > + st->note_page(st, addr, 2, p4d_val(val)); > + > + return 0; > +} > + > +static int ptdump_pud_entry(pud_t *pud, unsigned long addr, > + unsigned long next, struct mm_walk *walk) > +{ > + struct ptdump_state *st = walk->private; > + pud_t val = READ_ONCE(*pud); > + > + if (pud_leaf(val)) > + st->note_page(st, addr, 3, pud_val(val)); > + > + return 0; > +} > + > +static int ptdump_pmd_entry(pmd_t *pmd, unsigned long addr, > + unsigned long next, struct mm_walk *walk) > +{ > + struct ptdump_state *st = walk->private; > + pmd_t val = READ_ONCE(*pmd); > + > + if (pmd_leaf(val)) > + st->note_page(st, addr, 4, pmd_val(val)); > + > + return 0; > +} > + > +static int ptdump_pte_entry(pte_t *pte, unsigned long addr, > + unsigned long next, struct mm_walk *walk) > +{ > + struct ptdump_state *st = walk->private; > + > + st->note_page(st, addr, 5, pte_val(READ_ONCE(*pte))); > + > + return 0; > +} > + > +#ifdef CONFIG_KASAN > +/* > + * This is an optimization for KASAN=y case. Since all kasan page tables > + * eventually point to the kasan_early_shadow_page we could call note_page() > + * right away without walking through lower level page tables. This saves > + * us dozens of seconds (minutes for 5-level config) while checking for > + * W+X mapping or reading kernel_page_tables debugfs file. > + */ > +static inline int note_kasan_page_table(struct mm_walk *walk, > + unsigned long addr) > +{ > + struct ptdump_state *st = walk->private; > + > + st->note_page(st, addr, 5, pte_val(kasan_early_shadow_pte[0])); > + return 1; > +} > + > +static int ptdump_test_p4d(unsigned long addr, unsigned long next, > + p4d_t *p4d, struct mm_walk *walk) > +{ > +#if CONFIG_PGTABLE_LEVELS > 4 > + if (p4d == lm_alias(kasan_early_shadow_p4d)) > + return note_kasan_page_table(walk, addr); > +#endif > + return 0; > +} > + > +static int ptdump_test_pud(unsigned long addr, unsigned long next, > + pud_t *pud, struct mm_walk *walk) > +{ > +#if CONFIG_PGTABLE_LEVELS > 3 > + if (pud == lm_alias(kasan_early_shadow_pud)) > + return note_kasan_page_table(walk, addr); > +#endif > + return 0; > +} > + > +static int ptdump_test_pmd(unsigned long addr, unsigned long next, > + pmd_t *pmd, struct mm_walk *walk) > +{ > +#if CONFIG_PGTABLE_LEVELS > 2 > + if (pmd == lm_alias(kasan_early_shadow_pmd)) > + return note_kasan_page_table(walk, addr); > +#endif > + return 0; > +} > +#endif /* CONFIG_KASAN */ > + > +static int ptdump_hole(unsigned long addr, unsigned long next, > + int depth, struct mm_walk *walk) > +{ > + struct ptdump_state *st = walk->private; > + > + st->note_page(st, addr, depth + 1, 0); > + > + return 0; > +} > + > +static const struct mm_walk_ops ptdump_ops = { > + .pgd_entry = ptdump_pgd_entry, > + .p4d_entry = ptdump_p4d_entry, > + .pud_entry = ptdump_pud_entry, > + .pmd_entry = ptdump_pmd_entry, > + .pte_entry = ptdump_pte_entry, > +#ifdef CONFIG_KASAN > + .test_p4d = ptdump_test_p4d, > + .test_pud = ptdump_test_pud, > + .test_pmd = ptdump_test_pmd, > +#endif > + .pte_hole = ptdump_hole, > +}; > + > +void ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm) > +{ > + const struct ptdump_range *range = st->range; > + > + down_read(&mm->mmap_sem); > + while (range->start != range->end) { > + walk_page_range_novma(mm, range->start, range->end, > + &ptdump_ops, st); > + range++; > + } > + up_read(&mm->mmap_sem); > + > + /* Flush out the last page */ > + st->note_page(st, 0, 0, 0); > +} > _ > On linux-next, booting a simple QEMU x86-64 guest (since I updated from pre-v5.4 base), I get: [ 1.231285] BUG: kernel NULL pointer dereference, address: 0000000000000018 [ 1.231897] #PF: supervisor read access in kernel mode [ 1.232354] #PF: error_code(0x0000) - not-present page [ 1.232803] PGD 0 P4D 0 [ 1.233033] Oops: 0000 [#1] SMP NOPTI [ 1.233359] CPU: 3 PID: 1 Comm: swapper/0 Not tainted 5.4.0-next-20191203+ #29 [ 1.233998] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu4 [ 1.235015] RIP: 0010:__lock_acquire+0x778/0x1940 [ 1.235428] Code: 00 45 31 ff 48 8b 44 24 48 65 48 33 04 25 28 00 00 00 0f 85 fd 0d 00 00 48 83 c4 50 44 89 f8 5b7 [ 1.237051] RSP: 0018:ffffbc6100637c48 EFLAGS: 00010002 [ 1.237512] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000 [ 1.238147] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018 [ 1.238765] RBP: ffff92dd7db54d80 R08: 0000000000000001 R09: 0000000000000000 [ 1.239395] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 1.240012] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000 [ 1.240626] FS: 0000000000000000(0000) GS:ffff92dd7dd00000(0000) knlGS:0000000000000000 [ 1.241316] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1.241808] CR2: 0000000000000018 CR3: 00000000a8610000 CR4: 00000000000006e0 [ 1.242407] Call Trace: [ 1.242626] ? check_usage_backwards+0x99/0x140 [ 1.243023] ? stack_trace_save+0x4b/0x70 [ 1.243385] lock_acquire+0xa2/0x1b0 [ 1.243707] ? __walk_page_range+0x6e5/0xa00 [ 1.244104] _raw_spin_lock+0x2c/0x40 [ 1.244431] ? __walk_page_range+0x6e5/0xa00 [ 1.244817] __walk_page_range+0x6e5/0xa00 [ 1.245184] walk_page_range_novma+0x69/0xb0 [ 1.245562] ptdump_walk_pgd+0x46/0x80 [ 1.245904] ptdump_walk_pgd_level_core+0xb7/0xe0 [ 1.246318] ? ptdump_walk_pgd_level_core+0xe0/0xe0 [ 1.246748] ? rest_init+0x23a/0x23a [ 1.247076] ? rest_init+0x23a/0x23a [ 1.247392] kernel_init+0x2c/0x106 [ 1.247700] ret_from_fork+0x27/0x50 [ 1.248025] Modules linked in: [ 1.248298] CR2: 0000000000000018 [ 1.248594] ---[ end trace d9ad45dca0b4f3a3 ]--- [ 1.249020] RIP: 0010:__lock_acquire+0x778/0x1940 [ 1.249432] Code: 00 45 31 ff 48 8b 44 24 48 65 48 33 04 25 28 00 00 00 0f 85 fd 0d 00 00 48 83 c4 50 44 89 f8 5b7 [ 1.251059] RSP: 0018:ffffbc6100637c48 EFLAGS: 00010002 [ 1.251514] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000 [ 1.252153] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018 [ 1.252773] RBP: ffff92dd7db54d80 R08: 0000000000000001 R09: 0000000000000000 [ 1.253396] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 1.254026] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000 [ 1.254648] FS: 0000000000000000(0000) GS:ffff92dd7dd00000(0000) knlGS:0000000000000000 [ 1.255360] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1.255867] CR2: 0000000000000018 CR3: 00000000a8610000 CR4: 00000000000006e0 [ 1.256491] BUG: sleeping function called from invalid context at include/linux/percpu-rwsem.h:38 [ 1.257268] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 1, name: swapper/0 [ 1.257952] INFO: lockdep is turned off. [ 1.258299] irq event stamp: 1570043 [ 1.258617] hardirqs last enabled at (1570043): [<ffffffff9716dd2c>] console_unlock+0x45c/0x5c0 [ 1.259386] hardirqs last disabled at (1570042): [<ffffffff9716d964>] console_unlock+0x94/0x5c0 [ 1.260153] softirqs last enabled at (1570040): [<ffffffff97e0035d>] __do_softirq+0x35d/0x45d [ 1.260898] softirqs last disabled at (1570033): [<ffffffff970efe54>] irq_exit+0xf4/0x100 [ 1.261615] CPU: 3 PID: 1 Comm: swapper/0 Tainted: G D 5.4.0-next-20191203+ #29 [ 1.262370] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu4 [ 1.263372] Call Trace: [ 1.263595] dump_stack+0x8f/0xd0 [ 1.263895] ___might_sleep.cold+0xb3/0xc3 [ 1.264246] exit_signals+0x30/0x2d0 [ 1.264552] do_exit+0xb4/0xc40 [ 1.264832] rewind_stack_do_exit+0x17/0x20 [ 1.265198] note: swapper/0[1] exited with preempt_count 1 [ 1.265700] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009 [ 1.266443] Kernel Offset: 0x16000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbff) [ 1.267394] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009 ]--- Related to this?
On 03.12.19 11:47, David Hildenbrand wrote: > On 01.12.19 02:53, akpm@linux-foundation.org wrote: >> From: Steven Price <steven.price@arm.com> >> Subject: mm: add generic ptdump >> >> Add a generic version of page table dumping that architectures can opt-in >> to >> >> [steven.price@arm.com: v15] >> Link: http://lkml.kernel.org/r/20191101140942.51554-20-steven.price@arm.com >> [cai@lca.pw: fix a -Wold-style-declaration warning] >> Link: http://lkml.kernel.org/r/1572895385-29194-1-git-send-email-cai@lca.pw >> Link: http://lkml.kernel.org/r/20191028135910.33253-20-steven.price@arm.com >> Signed-off-by: Steven Price <steven.price@arm.com> >> Signed-off-by: Qian Cai <cai@lca.pw> >> Cc: Albert Ou <aou@eecs.berkeley.edu> >> Cc: Alexander Potapenko <glider@google.com> >> Cc: Alexandre Ghiti <alex@ghiti.fr> >> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> >> Cc: Andy Lutomirski <luto@kernel.org> >> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> >> Cc: Arnd Bergmann <arnd@arndb.de> >> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> >> Cc: Borislav Petkov <bp@alien8.de> >> Cc: Catalin Marinas <catalin.marinas@arm.com> >> Cc: Christian Borntraeger <borntraeger@de.ibm.com> >> Cc: Dave Hansen <dave.hansen@linux.intel.com> >> Cc: Dave Jiang <dave.jiang@intel.com> >> Cc: David S. Miller <davem@davemloft.net> >> Cc: Dmitry Vyukov <dvyukov@google.com> >> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> >> Cc: "H. Peter Anvin" <hpa@zytor.com> >> Cc: Ingo Molnar <mingo@elte.hu> >> Cc: James Hogan <jhogan@kernel.org> >> Cc: James Morse <james.morse@arm.com> >> Cc: "Liang, Kan" <kan.liang@linux.intel.com> >> Cc: Mark Rutland <mark.rutland@arm.com> >> Cc: Matthew Wilcox <mawilcox@microsoft.com> >> Cc: Michael Ellerman <mpe@ellerman.id.au> >> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> >> Cc: Palmer Dabbelt <palmer@sifive.com> >> Cc: Paul Burton <paul.burton@mips.com> >> Cc: Paul Mackerras <paulus@samba.org> >> Cc: Paul Walmsley <paul.walmsley@sifive.com> >> Cc: Peter Zijlstra <peterz@infradead.org> >> Cc: Ralf Baechle <ralf@linux-mips.org> >> Cc: Russell King <linux@armlinux.org.uk> >> Cc: Shiraz Hashim <shashim@codeaurora.org> >> Cc: Thomas Gleixner <tglx@linutronix.de> >> Cc: Vasily Gorbik <gor@linux.ibm.com> >> Cc: Vineet Gupta <vgupta@synopsys.com> >> Cc: Will Deacon <will@kernel.org> >> Cc: Zong Li <zong.li@sifive.com> >> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> >> --- >> >> include/linux/ptdump.h | 21 +++++ >> mm/Kconfig.debug | 21 +++++ >> mm/Makefile | 1 >> mm/ptdump.c | 151 +++++++++++++++++++++++++++++++++++++++ >> 4 files changed, 194 insertions(+) >> >> --- /dev/null >> +++ a/include/linux/ptdump.h >> @@ -0,0 +1,21 @@ >> +/* SPDX-License-Identifier: GPL-2.0 */ >> + >> +#ifndef _LINUX_PTDUMP_H >> +#define _LINUX_PTDUMP_H >> + >> +#include <linux/mm_types.h> >> + >> +struct ptdump_range { >> + unsigned long start; >> + unsigned long end; >> +}; >> + >> +struct ptdump_state { >> + void (*note_page)(struct ptdump_state *st, unsigned long addr, >> + int level, unsigned long val); >> + const struct ptdump_range *range; >> +}; >> + >> +void ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm); >> + >> +#endif /* _LINUX_PTDUMP_H */ >> --- a/mm/Kconfig.debug~mm-add-generic-ptdump >> +++ a/mm/Kconfig.debug >> @@ -117,3 +117,24 @@ config DEBUG_RODATA_TEST >> depends on STRICT_KERNEL_RWX >> ---help--- >> This option enables a testcase for the setting rodata read-only. >> + >> +config GENERIC_PTDUMP >> + bool >> + >> +config PTDUMP_CORE >> + bool >> + >> +config PTDUMP_DEBUGFS >> + bool "Export kernel pagetable layout to userspace via debugfs" >> + depends on DEBUG_KERNEL >> + depends on DEBUG_FS >> + depends on GENERIC_PTDUMP >> + select PTDUMP_CORE >> + help >> + Say Y here if you want to show the kernel pagetable layout in a >> + debugfs file. This information is only useful for kernel developers >> + who are working in architecture specific areas of the kernel. >> + It is probably not a good idea to enable this feature in a production >> + kernel. >> + >> + If in doubt, say N. >> --- a/mm/Makefile~mm-add-generic-ptdump >> +++ a/mm/Makefile >> @@ -98,6 +98,7 @@ obj-$(CONFIG_CMA) += cma.o >> obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o >> obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o >> obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o >> +obj-$(CONFIG_PTDUMP_CORE) += ptdump.o >> obj-$(CONFIG_USERFAULTFD) += userfaultfd.o >> obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o >> obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o >> --- /dev/null >> +++ a/mm/ptdump.c >> @@ -0,0 +1,151 @@ >> +// SPDX-License-Identifier: GPL-2.0 >> + >> +#include <linux/pagewalk.h> >> +#include <linux/ptdump.h> >> +#include <linux/kasan.h> >> + >> +static int ptdump_pgd_entry(pgd_t *pgd, unsigned long addr, >> + unsigned long next, struct mm_walk *walk) >> +{ >> + struct ptdump_state *st = walk->private; >> + pgd_t val = READ_ONCE(*pgd); >> + >> + if (pgd_leaf(val)) >> + st->note_page(st, addr, 1, pgd_val(val)); >> + >> + return 0; >> +} >> + >> +static int ptdump_p4d_entry(p4d_t *p4d, unsigned long addr, >> + unsigned long next, struct mm_walk *walk) >> +{ >> + struct ptdump_state *st = walk->private; >> + p4d_t val = READ_ONCE(*p4d); >> + >> + if (p4d_leaf(val)) >> + st->note_page(st, addr, 2, p4d_val(val)); >> + >> + return 0; >> +} >> + >> +static int ptdump_pud_entry(pud_t *pud, unsigned long addr, >> + unsigned long next, struct mm_walk *walk) >> +{ >> + struct ptdump_state *st = walk->private; >> + pud_t val = READ_ONCE(*pud); >> + >> + if (pud_leaf(val)) >> + st->note_page(st, addr, 3, pud_val(val)); >> + >> + return 0; >> +} >> + >> +static int ptdump_pmd_entry(pmd_t *pmd, unsigned long addr, >> + unsigned long next, struct mm_walk *walk) >> +{ >> + struct ptdump_state *st = walk->private; >> + pmd_t val = READ_ONCE(*pmd); >> + >> + if (pmd_leaf(val)) >> + st->note_page(st, addr, 4, pmd_val(val)); >> + >> + return 0; >> +} >> + >> +static int ptdump_pte_entry(pte_t *pte, unsigned long addr, >> + unsigned long next, struct mm_walk *walk) >> +{ >> + struct ptdump_state *st = walk->private; >> + >> + st->note_page(st, addr, 5, pte_val(READ_ONCE(*pte))); >> + >> + return 0; >> +} >> + >> +#ifdef CONFIG_KASAN >> +/* >> + * This is an optimization for KASAN=y case. Since all kasan page tables >> + * eventually point to the kasan_early_shadow_page we could call note_page() >> + * right away without walking through lower level page tables. This saves >> + * us dozens of seconds (minutes for 5-level config) while checking for >> + * W+X mapping or reading kernel_page_tables debugfs file. >> + */ >> +static inline int note_kasan_page_table(struct mm_walk *walk, >> + unsigned long addr) >> +{ >> + struct ptdump_state *st = walk->private; >> + >> + st->note_page(st, addr, 5, pte_val(kasan_early_shadow_pte[0])); >> + return 1; >> +} >> + >> +static int ptdump_test_p4d(unsigned long addr, unsigned long next, >> + p4d_t *p4d, struct mm_walk *walk) >> +{ >> +#if CONFIG_PGTABLE_LEVELS > 4 >> + if (p4d == lm_alias(kasan_early_shadow_p4d)) >> + return note_kasan_page_table(walk, addr); >> +#endif >> + return 0; >> +} >> + >> +static int ptdump_test_pud(unsigned long addr, unsigned long next, >> + pud_t *pud, struct mm_walk *walk) >> +{ >> +#if CONFIG_PGTABLE_LEVELS > 3 >> + if (pud == lm_alias(kasan_early_shadow_pud)) >> + return note_kasan_page_table(walk, addr); >> +#endif >> + return 0; >> +} >> + >> +static int ptdump_test_pmd(unsigned long addr, unsigned long next, >> + pmd_t *pmd, struct mm_walk *walk) >> +{ >> +#if CONFIG_PGTABLE_LEVELS > 2 >> + if (pmd == lm_alias(kasan_early_shadow_pmd)) >> + return note_kasan_page_table(walk, addr); >> +#endif >> + return 0; >> +} >> +#endif /* CONFIG_KASAN */ >> + >> +static int ptdump_hole(unsigned long addr, unsigned long next, >> + int depth, struct mm_walk *walk) >> +{ >> + struct ptdump_state *st = walk->private; >> + >> + st->note_page(st, addr, depth + 1, 0); >> + >> + return 0; >> +} >> + >> +static const struct mm_walk_ops ptdump_ops = { >> + .pgd_entry = ptdump_pgd_entry, >> + .p4d_entry = ptdump_p4d_entry, >> + .pud_entry = ptdump_pud_entry, >> + .pmd_entry = ptdump_pmd_entry, >> + .pte_entry = ptdump_pte_entry, >> +#ifdef CONFIG_KASAN >> + .test_p4d = ptdump_test_p4d, >> + .test_pud = ptdump_test_pud, >> + .test_pmd = ptdump_test_pmd, >> +#endif >> + .pte_hole = ptdump_hole, >> +}; >> + >> +void ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm) >> +{ >> + const struct ptdump_range *range = st->range; >> + >> + down_read(&mm->mmap_sem); >> + while (range->start != range->end) { >> + walk_page_range_novma(mm, range->start, range->end, >> + &ptdump_ops, st); >> + range++; >> + } >> + up_read(&mm->mmap_sem); >> + >> + /* Flush out the last page */ >> + st->note_page(st, 0, 0, 0); >> +} >> _ >> > > On linux-next, booting a simple QEMU x86-64 guest (since I updated from > pre-v5.4 base), I get: > > [ 1.231285] BUG: kernel NULL pointer dereference, address: 0000000000000018 > [ 1.231897] #PF: supervisor read access in kernel mode > [ 1.232354] #PF: error_code(0x0000) - not-present page > [ 1.232803] PGD 0 P4D 0 > [ 1.233033] Oops: 0000 [#1] SMP NOPTI > [ 1.233359] CPU: 3 PID: 1 Comm: swapper/0 Not tainted 5.4.0-next-20191203+ #29 > [ 1.233998] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu4 > [ 1.235015] RIP: 0010:__lock_acquire+0x778/0x1940 > [ 1.235428] Code: 00 45 31 ff 48 8b 44 24 48 65 48 33 04 25 28 00 00 00 0f 85 fd 0d 00 00 48 83 c4 50 44 89 f8 5b7 > [ 1.237051] RSP: 0018:ffffbc6100637c48 EFLAGS: 00010002 > [ 1.237512] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000 > [ 1.238147] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018 > [ 1.238765] RBP: ffff92dd7db54d80 R08: 0000000000000001 R09: 0000000000000000 > [ 1.239395] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 > [ 1.240012] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000 > [ 1.240626] FS: 0000000000000000(0000) GS:ffff92dd7dd00000(0000) knlGS:0000000000000000 > [ 1.241316] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 1.241808] CR2: 0000000000000018 CR3: 00000000a8610000 CR4: 00000000000006e0 > [ 1.242407] Call Trace: > [ 1.242626] ? check_usage_backwards+0x99/0x140 > [ 1.243023] ? stack_trace_save+0x4b/0x70 > [ 1.243385] lock_acquire+0xa2/0x1b0 > [ 1.243707] ? __walk_page_range+0x6e5/0xa00 > [ 1.244104] _raw_spin_lock+0x2c/0x40 > [ 1.244431] ? __walk_page_range+0x6e5/0xa00 > [ 1.244817] __walk_page_range+0x6e5/0xa00 > [ 1.245184] walk_page_range_novma+0x69/0xb0 > [ 1.245562] ptdump_walk_pgd+0x46/0x80 > [ 1.245904] ptdump_walk_pgd_level_core+0xb7/0xe0 > [ 1.246318] ? ptdump_walk_pgd_level_core+0xe0/0xe0 > [ 1.246748] ? rest_init+0x23a/0x23a > [ 1.247076] ? rest_init+0x23a/0x23a > [ 1.247392] kernel_init+0x2c/0x106 > [ 1.247700] ret_from_fork+0x27/0x50 > [ 1.248025] Modules linked in: > [ 1.248298] CR2: 0000000000000018 > [ 1.248594] ---[ end trace d9ad45dca0b4f3a3 ]--- > [ 1.249020] RIP: 0010:__lock_acquire+0x778/0x1940 > [ 1.249432] Code: 00 45 31 ff 48 8b 44 24 48 65 48 33 04 25 28 00 00 00 0f 85 fd 0d 00 00 48 83 c4 50 44 89 f8 5b7 > [ 1.251059] RSP: 0018:ffffbc6100637c48 EFLAGS: 00010002 > [ 1.251514] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000 > [ 1.252153] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018 > [ 1.252773] RBP: ffff92dd7db54d80 R08: 0000000000000001 R09: 0000000000000000 > [ 1.253396] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 > [ 1.254026] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000 > [ 1.254648] FS: 0000000000000000(0000) GS:ffff92dd7dd00000(0000) knlGS:0000000000000000 > [ 1.255360] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 1.255867] CR2: 0000000000000018 CR3: 00000000a8610000 CR4: 00000000000006e0 > [ 1.256491] BUG: sleeping function called from invalid context at include/linux/percpu-rwsem.h:38 > [ 1.257268] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 1, name: swapper/0 > [ 1.257952] INFO: lockdep is turned off. > [ 1.258299] irq event stamp: 1570043 > [ 1.258617] hardirqs last enabled at (1570043): [<ffffffff9716dd2c>] console_unlock+0x45c/0x5c0 > [ 1.259386] hardirqs last disabled at (1570042): [<ffffffff9716d964>] console_unlock+0x94/0x5c0 > [ 1.260153] softirqs last enabled at (1570040): [<ffffffff97e0035d>] __do_softirq+0x35d/0x45d > [ 1.260898] softirqs last disabled at (1570033): [<ffffffff970efe54>] irq_exit+0xf4/0x100 > [ 1.261615] CPU: 3 PID: 1 Comm: swapper/0 Tainted: G D 5.4.0-next-20191203+ #29 > [ 1.262370] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu4 > [ 1.263372] Call Trace: > [ 1.263595] dump_stack+0x8f/0xd0 > [ 1.263895] ___might_sleep.cold+0xb3/0xc3 > [ 1.264246] exit_signals+0x30/0x2d0 > [ 1.264552] do_exit+0xb4/0xc40 > [ 1.264832] rewind_stack_do_exit+0x17/0x20 > [ 1.265198] note: swapper/0[1] exited with preempt_count 1 > [ 1.265700] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009 > [ 1.266443] Kernel Offset: 0x16000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbff) > [ 1.267394] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009 ]--- > > Related to this? > I just made sure that I am actually on the latest linux-next. I do have commit d3634da666853cdff2258a49dd3ce3607c0fd6c5 Author: Steven Price <steven.price@arm.com> Date: Tue Nov 19 11:47:24 2019 +1100 mm-pagewalk-allow-walking-without-vma-fix fix boot crash Reported-by: Qian Cai <cai@lca.pw> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Problem persists. I do have a bunch of debug options enabled in my config and can share if required.
--- /dev/null +++ a/include/linux/ptdump.h @@ -0,0 +1,21 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#ifndef _LINUX_PTDUMP_H +#define _LINUX_PTDUMP_H + +#include <linux/mm_types.h> + +struct ptdump_range { + unsigned long start; + unsigned long end; +}; + +struct ptdump_state { + void (*note_page)(struct ptdump_state *st, unsigned long addr, + int level, unsigned long val); + const struct ptdump_range *range; +}; + +void ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm); + +#endif /* _LINUX_PTDUMP_H */ --- a/mm/Kconfig.debug~mm-add-generic-ptdump +++ a/mm/Kconfig.debug @@ -117,3 +117,24 @@ config DEBUG_RODATA_TEST depends on STRICT_KERNEL_RWX ---help--- This option enables a testcase for the setting rodata read-only. + +config GENERIC_PTDUMP + bool + +config PTDUMP_CORE + bool + +config PTDUMP_DEBUGFS + bool "Export kernel pagetable layout to userspace via debugfs" + depends on DEBUG_KERNEL + depends on DEBUG_FS + depends on GENERIC_PTDUMP + select PTDUMP_CORE + help + Say Y here if you want to show the kernel pagetable layout in a + debugfs file. This information is only useful for kernel developers + who are working in architecture specific areas of the kernel. + It is probably not a good idea to enable this feature in a production + kernel. + + If in doubt, say N. --- a/mm/Makefile~mm-add-generic-ptdump +++ a/mm/Makefile @@ -98,6 +98,7 @@ obj-$(CONFIG_CMA) += cma.o obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o +obj-$(CONFIG_PTDUMP_CORE) += ptdump.o obj-$(CONFIG_USERFAULTFD) += userfaultfd.o obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o --- /dev/null +++ a/mm/ptdump.c @@ -0,0 +1,151 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include <linux/pagewalk.h> +#include <linux/ptdump.h> +#include <linux/kasan.h> + +static int ptdump_pgd_entry(pgd_t *pgd, unsigned long addr, + unsigned long next, struct mm_walk *walk) +{ + struct ptdump_state *st = walk->private; + pgd_t val = READ_ONCE(*pgd); + + if (pgd_leaf(val)) + st->note_page(st, addr, 1, pgd_val(val)); + + return 0; +} + +static int ptdump_p4d_entry(p4d_t *p4d, unsigned long addr, + unsigned long next, struct mm_walk *walk) +{ + struct ptdump_state *st = walk->private; + p4d_t val = READ_ONCE(*p4d); + + if (p4d_leaf(val)) + st->note_page(st, addr, 2, p4d_val(val)); + + return 0; +} + +static int ptdump_pud_entry(pud_t *pud, unsigned long addr, + unsigned long next, struct mm_walk *walk) +{ + struct ptdump_state *st = walk->private; + pud_t val = READ_ONCE(*pud); + + if (pud_leaf(val)) + st->note_page(st, addr, 3, pud_val(val)); + + return 0; +} + +static int ptdump_pmd_entry(pmd_t *pmd, unsigned long addr, + unsigned long next, struct mm_walk *walk) +{ + struct ptdump_state *st = walk->private; + pmd_t val = READ_ONCE(*pmd); + + if (pmd_leaf(val)) + st->note_page(st, addr, 4, pmd_val(val)); + + return 0; +} + +static int ptdump_pte_entry(pte_t *pte, unsigned long addr, + unsigned long next, struct mm_walk *walk) +{ + struct ptdump_state *st = walk->private; + + st->note_page(st, addr, 5, pte_val(READ_ONCE(*pte))); + + return 0; +} + +#ifdef CONFIG_KASAN +/* + * This is an optimization for KASAN=y case. Since all kasan page tables + * eventually point to the kasan_early_shadow_page we could call note_page() + * right away without walking through lower level page tables. This saves + * us dozens of seconds (minutes for 5-level config) while checking for + * W+X mapping or reading kernel_page_tables debugfs file. + */ +static inline int note_kasan_page_table(struct mm_walk *walk, + unsigned long addr) +{ + struct ptdump_state *st = walk->private; + + st->note_page(st, addr, 5, pte_val(kasan_early_shadow_pte[0])); + return 1; +} + +static int ptdump_test_p4d(unsigned long addr, unsigned long next, + p4d_t *p4d, struct mm_walk *walk) +{ +#if CONFIG_PGTABLE_LEVELS > 4 + if (p4d == lm_alias(kasan_early_shadow_p4d)) + return note_kasan_page_table(walk, addr); +#endif + return 0; +} + +static int ptdump_test_pud(unsigned long addr, unsigned long next, + pud_t *pud, struct mm_walk *walk) +{ +#if CONFIG_PGTABLE_LEVELS > 3 + if (pud == lm_alias(kasan_early_shadow_pud)) + return note_kasan_page_table(walk, addr); +#endif + return 0; +} + +static int ptdump_test_pmd(unsigned long addr, unsigned long next, + pmd_t *pmd, struct mm_walk *walk) +{ +#if CONFIG_PGTABLE_LEVELS > 2 + if (pmd == lm_alias(kasan_early_shadow_pmd)) + return note_kasan_page_table(walk, addr); +#endif + return 0; +} +#endif /* CONFIG_KASAN */ + +static int ptdump_hole(unsigned long addr, unsigned long next, + int depth, struct mm_walk *walk) +{ + struct ptdump_state *st = walk->private; + + st->note_page(st, addr, depth + 1, 0); + + return 0; +} + +static const struct mm_walk_ops ptdump_ops = { + .pgd_entry = ptdump_pgd_entry, + .p4d_entry = ptdump_p4d_entry, + .pud_entry = ptdump_pud_entry, + .pmd_entry = ptdump_pmd_entry, + .pte_entry = ptdump_pte_entry, +#ifdef CONFIG_KASAN + .test_p4d = ptdump_test_p4d, + .test_pud = ptdump_test_pud, + .test_pmd = ptdump_test_pmd, +#endif + .pte_hole = ptdump_hole, +}; + +void ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm) +{ + const struct ptdump_range *range = st->range; + + down_read(&mm->mmap_sem); + while (range->start != range->end) { + walk_page_range_novma(mm, range->start, range->end, + &ptdump_ops, st); + range++; + } + up_read(&mm->mmap_sem); + + /* Flush out the last page */ + st->note_page(st, 0, 0, 0); +}