diff mbox series

[v8,36/43] arm64: mm: Add support for folding PUDs at runtime

Message ID 20240214122845.2033971-81-ardb+git@google.com (mailing list archive)
State New, archived
Headers show
Series arm64: Add support for LPA2 and WXN at stage 1 | expand

Commit Message

Ard Biesheuvel Feb. 14, 2024, 12:29 p.m. UTC
From: Ard Biesheuvel <ardb@kernel.org>

In order to support LPA2 on 16k pages in a way that permits non-LPA2
systems to run the same kernel image, we have to be able to fall back to
at most 48 bits of virtual addressing.

Falling back to 48 bits would result in a level 0 with only 2 entries,
which is suboptimal in terms of TLB utilization. So instead, let's fall
back to 47 bits in that case. This means we need to be able to fold PUDs
dynamically, similar to how we fold P4Ds for 48 bit virtual addressing
on LPA2 with 4k pages.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm64/include/asm/pgalloc.h | 12 ++-
 arch/arm64/include/asm/pgtable.h | 87 +++++++++++++++++---
 arch/arm64/include/asm/tlb.h     |  3 +
 arch/arm64/kernel/cpufeature.c   |  2 +
 arch/arm64/mm/mmu.c              |  2 +-
 arch/arm64/mm/pgd.c              |  2 +
 6 files changed, 95 insertions(+), 13 deletions(-)

Comments

Ryan Roberts Feb. 29, 2024, 2:17 p.m. UTC | #1
Hi Ard,

On 14/02/2024 12:29, Ard Biesheuvel wrote:
> From: Ard Biesheuvel <ardb@kernel.org>
> 
> In order to support LPA2 on 16k pages in a way that permits non-LPA2
> systems to run the same kernel image, we have to be able to fall back to
> at most 48 bits of virtual addressing.
> 
> Falling back to 48 bits would result in a level 0 with only 2 entries,
> which is suboptimal in terms of TLB utilization. So instead, let's fall
> back to 47 bits in that case. This means we need to be able to fold PUDs
> dynamically, similar to how we fold P4Ds for 48 bit virtual addressing
> on LPA2 with 4k pages.

I'm seeing a panic during boot in today's linux-next (20240229) and bisect seems pretty confident that this commit is the offender. That said, its the merge commit that shows up as the problem commit:

26843fe8fa72 Merge branch 'for-next/core' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

but when testing the arm64's for-next/core, the problem doesn't exist. So I rebased the branch into linux-next and bisected again. That time, it fingers this patch. So I guess there is some interaction between this and other changes in next?


Note I'm running defconfig (so 4K base pages) plus:

# Squashfs for snaps, xfs for large file folios.
./scripts/config --enable CONFIG_SQUASHFS_LZ4
./scripts/config --enable CONFIG_SQUASHFS_LZO
./scripts/config --enable CONFIG_SQUASHFS_XZ
./scripts/config --enable CONFIG_SQUASHFS_ZSTD
./scripts/config --enable CONFIG_XFS_FS

# Useful trace features (on for Ubuntu configs).
./scripts/config --enable CONFIG_FTRACE
./scripts/config --enable CONFIG_FUNCTION_TRACER
./scripts/config --enable CONFIG_KPROBES
./scripts/config --enable CONFIG_HIST_TRIGGERS
./scripts/config --enable CONFIG_FTRACE_SYSCALLS

# For general mm debug.
./scripts/config --enable CONFIG_DEBUG_VM
./scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
./scripts/config --enable CONFIG_DEBUG_VM_RB
./scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
./scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
./scripts/config --enable CONFIG_PAGE_TABLE_CHECK

# For mm selftests.
./scripts/config --enable CONFIG_USERFAULTFD
./scripts/config --enable CONFIG_TEST_VMALLOC
./scripts/config --enable CONFIG_GUP_TEST

# Ram block device for testing swap changes.
./scripts/config --enable CONFIG_BLK_DEV_RAM


I'm booting a VM on Apple M2 with 12G RAM assigned, split evenly across 2 emulated numa nodes, and with a bunch of hugetlb pages of all sizes reserved, if that matters.


And I see this panic during boot (I guess due to the VM_DEBUG Kconfigs):

[    0.161062] debug_vm_pgtable: [debug_vm_pgtable         ]: Validating architecture page table helpers
[    0.161416] BUG: Bad page state in process swapper/0  pfn:18a65d
[    0.161634] page does not match folio
[    0.161753] page: refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x18a65d
[    0.162046] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
[    0.162332] Mem abort info:
[    0.162427]   ESR = 0x0000000096000004
[    0.162559]   EC = 0x25: DABT (current EL), IL = 32 bits
[    0.162723]   SET = 0, FnV = 0
[    0.162827]   EA = 0, S1PTW = 0
[    0.162933]   FSC = 0x04: level 0 translation fault
[    0.163089] Data abort info:
[    0.163189]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
[    0.163370]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[    0.163539]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[    0.163719] [0000000000000008] user address but active_mm is swapper
[    0.163934] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
[    0.164143] Modules linked in:
[    0.164251] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 6.8.0-rc6-00966-gde701dc1f7f8 #25
[    0.164516] Hardware name: linux,dummy-virt (DT)
[    0.164704] pstate: 21400005 (nzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
[    0.165052] pc : get_pfnblock_flags_mask+0x3c/0x68
[    0.165281] lr : __dump_page+0x1a0/0x408
[    0.165504] sp : ffff80008007b8f0
[    0.165715] x29: ffff80008007b8f0 x28: 0000000000ffffc0 x27: 0000000000000000
[    0.166047] x26: ffff80008007b950 x25: 0000000000000000 x24: 00000000fffffdff
[    0.166358] x23: ffffba8a417ba000 x22: 000000000018a65d x21: ffffba8a41601bf8
[    0.166701] x20: ffff80008007b950 x19: ffff80008007b950 x18: 0000000000000006
[    0.167036] x17: 78303a7865646e69 x16: 2030303030303030 x15: 0720072007200720
[    0.167365] x14: 0720072007200720 x13: 0720072007200720 x12: 0720072007200720
[    0.167693] x11: 0720072007200720 x10: ffffba8a4269c038 x9 : ffffba8a3fb0d0b8
[    0.168017] x8 : 00000000ffffefff x7 : ffffba8a4269c038 x6 : 80000000fffff000
[    0.168346] x5 : 000003fffff81de4 x4 : 0001fffffc0ef230 x3 : 0000000000000000
[    0.168699] x2 : 0000000000000007 x1 : fffffe0779181ee5 x0 : 00000000001fffff
[    0.169041] Call trace:
[    0.169164]  get_pfnblock_flags_mask+0x3c/0x68
[    0.169413]  dump_page+0x2c/0x70
[    0.169565]  bad_page+0x84/0x130
[    0.169734]  free_page_is_bad_report+0xa0/0xb8
[    0.169958]  free_unref_page_prepare+0x350/0x428
[    0.170132]  free_unref_page+0x50/0x1f0
[    0.170278]  __free_pages+0x11c/0x160
[    0.170417]  free_pages.part.0+0x6c/0x88
[    0.170576]  free_pages+0x1c/0x38
[    0.170703]  destroy_args+0x1c8/0x330
[    0.170890]  debug_vm_pgtable+0xae8/0x10f8
[    0.171059]  do_one_initcall+0x60/0x2c0
[    0.171222]  kernel_init_freeable+0x1ec/0x3d8
[    0.171406]  kernel_init+0x28/0x1f0
[    0.171557]  ret_from_fork+0x10/0x20
[    0.171712] Code: d37b1884 f100007f 8b040064 9a831083 (f9400460) 
[    0.171963] ---[ end trace 0000000000000000 ]---
[    0.172156] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[    0.172383] SMP: stopping secondary CPUs
[    0.172649] Kernel Offset: 0x3a89bf800000 from 0xffff800080000000
[    0.173923] PHYS_OFFSET: 0xfffff76180000000
[    0.174585] CPU features: 0x0,00000000,2004454a,13867723
[    0.175707] Memory Limit: none
[    0.176261] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---


bisection log (after rebasing arm64 stuff onto linux-next):

git bisect start
# bad: [446381d9ff3498f7c406109fac88d10bf855d0bd] arm64: Update setup_arch() comment on interrupt masking
git bisect bad 446381d9ff3498f7c406109fac88d10bf855d0bd
# good: [7f43e0f76e4710b2882c551519eff50e502115c5] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rmk/linux.git
git bisect good 7f43e0f76e4710b2882c551519eff50e502115c5
# good: [36da1bf4c61bf1e4322b9b04d6eb1aba2a515b73] arm64: mm: omit redundant remap of kernel image
git bisect good 36da1bf4c61bf1e4322b9b04d6eb1aba2a515b73
# bad: [38f5662b4788b308f3be3cdd15e6c0149a627937] mm: add arch hook to validate mmap() prot flags
git bisect bad 38f5662b4788b308f3be3cdd15e6c0149a627937
# good: [653a0b074c33c48913c78e72a000ff935ff208c2] arm64: mm: add LPA2 and 5 level paging support to G-to-nG conversion
git bisect good 653a0b074c33c48913c78e72a000ff935ff208c2
# bad: [ebc9452776ee8d978908eb2f7424838b0bff6285] arm64: ptdump: Disregard unaddressable VA space
git bisect bad ebc9452776ee8d978908eb2f7424838b0bff6285
# good: [1d8cd0e6257930b0df58ce51bca44e232dcce49c] arm64: mm: Add 5 level paging support to fixmap and swapper handling
git bisect good 1d8cd0e6257930b0df58ce51bca44e232dcce49c
# bad: [de701dc1f7f88e85aca48e4c76c66f03ac5fc55b] arm64: mm: Add support for folding PUDs at runtime
git bisect bad de701dc1f7f88e85aca48e4c76c66f03ac5fc55b
# good: [3561c4b14b23f03f109e954b5d89839bb8b73798] arm64: kasan: Reduce minimum shadow alignment and enable 5 level paging
git bisect good 3561c4b14b23f03f109e954b5d89839bb8b73798
# first bad commit: [de701dc1f7f88e85aca48e4c76c66f03ac5fc55b] arm64: mm: Add support for folding PUDs at runtime


I haven't looked in detail at your patch, but hoped you might get to the root cause quicker than me?

Thanks,
Ryan
Nathan Chancellor Feb. 29, 2024, 11:01 p.m. UTC | #2
On Thu, Feb 29, 2024 at 02:17:52PM +0000, Ryan Roberts wrote:
> Hi Ard,
> 
> On 14/02/2024 12:29, Ard Biesheuvel wrote:
> > From: Ard Biesheuvel <ardb@kernel.org>
> > 
> > In order to support LPA2 on 16k pages in a way that permits non-LPA2
> > systems to run the same kernel image, we have to be able to fall back to
> > at most 48 bits of virtual addressing.
> > 
> > Falling back to 48 bits would result in a level 0 with only 2 entries,
> > which is suboptimal in terms of TLB utilization. So instead, let's fall
> > back to 47 bits in that case. This means we need to be able to fold PUDs
> > dynamically, similar to how we fold P4Ds for 48 bit virtual addressing
> > on LPA2 with 4k pages.
> 
> I'm seeing a panic during boot in today's linux-next (20240229) and bisect seems pretty confident that this commit is the offender. That said, its the merge commit that shows up as the problem commit:
> 
> 26843fe8fa72 Merge branch 'for-next/core' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
> 
> but when testing the arm64's for-next/core, the problem doesn't exist. So I rebased the branch into linux-next and bisected again. That time, it fingers this patch. So I guess there is some interaction between this and other changes in next?
<...>
> [    0.161062] debug_vm_pgtable: [debug_vm_pgtable         ]: Validating architecture page table helpers
> [    0.161416] BUG: Bad page state in process swapper/0  pfn:18a65d
> [    0.161634] page does not match folio
> [    0.161753] page: refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x18a65d
> [    0.162046] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
> [    0.162332] Mem abort info:
> [    0.162427]   ESR = 0x0000000096000004
> [    0.162559]   EC = 0x25: DABT (current EL), IL = 32 bits
> [    0.162723]   SET = 0, FnV = 0
> [    0.162827]   EA = 0, S1PTW = 0
> [    0.162933]   FSC = 0x04: level 0 translation fault
> [    0.163089] Data abort info:
> [    0.163189]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
> [    0.163370]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
> [    0.163539]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> [    0.163719] [0000000000000008] user address but active_mm is swapper
> [    0.163934] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
> [    0.164143] Modules linked in:
> [    0.164251] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 6.8.0-rc6-00966-gde701dc1f7f8 #25
> [    0.164516] Hardware name: linux,dummy-virt (DT)
> [    0.164704] pstate: 21400005 (nzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
> [    0.165052] pc : get_pfnblock_flags_mask+0x3c/0x68
> [    0.165281] lr : __dump_page+0x1a0/0x408
> [    0.165504] sp : ffff80008007b8f0
> [    0.165715] x29: ffff80008007b8f0 x28: 0000000000ffffc0 x27: 0000000000000000
> [    0.166047] x26: ffff80008007b950 x25: 0000000000000000 x24: 00000000fffffdff
> [    0.166358] x23: ffffba8a417ba000 x22: 000000000018a65d x21: ffffba8a41601bf8
> [    0.166701] x20: ffff80008007b950 x19: ffff80008007b950 x18: 0000000000000006
> [    0.167036] x17: 78303a7865646e69 x16: 2030303030303030 x15: 0720072007200720
> [    0.167365] x14: 0720072007200720 x13: 0720072007200720 x12: 0720072007200720
> [    0.167693] x11: 0720072007200720 x10: ffffba8a4269c038 x9 : ffffba8a3fb0d0b8
> [    0.168017] x8 : 00000000ffffefff x7 : ffffba8a4269c038 x6 : 80000000fffff000
> [    0.168346] x5 : 000003fffff81de4 x4 : 0001fffffc0ef230 x3 : 0000000000000000
> [    0.168699] x2 : 0000000000000007 x1 : fffffe0779181ee5 x0 : 00000000001fffff
> [    0.169041] Call trace:
> [    0.169164]  get_pfnblock_flags_mask+0x3c/0x68
> [    0.169413]  dump_page+0x2c/0x70
> [    0.169565]  bad_page+0x84/0x130
> [    0.169734]  free_page_is_bad_report+0xa0/0xb8
> [    0.169958]  free_unref_page_prepare+0x350/0x428
> [    0.170132]  free_unref_page+0x50/0x1f0
> [    0.170278]  __free_pages+0x11c/0x160
> [    0.170417]  free_pages.part.0+0x6c/0x88
> [    0.170576]  free_pages+0x1c/0x38
> [    0.170703]  destroy_args+0x1c8/0x330
> [    0.170890]  debug_vm_pgtable+0xae8/0x10f8
> [    0.171059]  do_one_initcall+0x60/0x2c0
> [    0.171222]  kernel_init_freeable+0x1ec/0x3d8
> [    0.171406]  kernel_init+0x28/0x1f0
> [    0.171557]  ret_from_fork+0x10/0x20
> [    0.171712] Code: d37b1884 f100007f 8b040064 9a831083 (f9400460) 
> [    0.171963] ---[ end trace 0000000000000000 ]---
> [    0.172156] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
> [    0.172383] SMP: stopping secondary CPUs
> [    0.172649] Kernel Offset: 0x3a89bf800000 from 0xffff800080000000
> [    0.173923] PHYS_OFFSET: 0xfffff76180000000
> [    0.174585] CPU features: 0x0,00000000,2004454a,13867723
> [    0.175707] Memory Limit: none
> [    0.176261] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---

I did a second bisection by merging https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/stage1-lpa2
on top of the merges before for-next/core and eventually landed on:

d67cd9f23139ddfd7e0ef1e18474c16445188433 is the first bad commit
commit d67cd9f23139ddfd7e0ef1e18474c16445188433
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Feb 27 19:23:31 2024 +0000

    mm: add __dump_folio()

    Turn __dump_page() into a wrapper around __dump_folio().  Snapshot the
    page & folio into a stack variable so we don't hit BUG_ON() if an
    allocation is freed under us and what was a folio pointer becomes a
    pointer to a tail page.

    Link: https://lkml.kernel.org/r/20240227192337.757313-5-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

 mm/debug.c | 120 +++++++++++++++++++++++++++++++++----------------------------
 1 file changed, 66 insertions(+), 54 deletions(-)

# bad: [7f43e0f76e4710b2882c551519eff50e502115c5] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rmk/linux.git
# good: [805d849d7c3cc1f38efefd48b2480d62b7b5dcb7] Merge tag 'acpi-6.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
git bisect start '7f43e0f76e4710b2882c551519eff50e502115c5' '805d849d7c3cc1f38efefd48b2480d62b7b5dcb7'
# bad: [7e6ae2db7f319bf9613ec6db8fa3c9bc1de1b346] mm: add swappiness= arg to memory.reclaim
git bisect bad 7e6ae2db7f319bf9613ec6db8fa3c9bc1de1b346
# good: [c6ec76a2ebc5829e5826b218d2e1475ec11b333e] mm: add pte_batch_hint() to reduce scanning in folio_pte_batch()
git bisect good c6ec76a2ebc5829e5826b218d2e1475ec11b333e
# good: [a02829f011b64e6c102929ed55da52e38391e970] writeback: fix done_index when hitting the wbc->nr_to_write
git bisect good a02829f011b64e6c102929ed55da52e38391e970
# good: [de435b3b914686116f86494b8cb53224d7e24cc5] arm64/mm: improve comment in contpte_ptep_get_lockless()
git bisect good de435b3b914686116f86494b8cb53224d7e24cc5
# good: [c143365caad5c3ad45662c393b9114c7cc694473] mm: handle large folios in free_unref_folios()
git bisect good c143365caad5c3ad45662c393b9114c7cc694473
# skip: [ab6445067cfbaf4ac94e969f7e8e785049314099] mm: add alloc_contig_migrate_range allocation statistics
git bisect skip ab6445067cfbaf4ac94e969f7e8e785049314099
# good: [447bf726277614396adcd4beedaf77ef74a748fa] modules: wait do_free_init correctly
git bisect good 447bf726277614396adcd4beedaf77ef74a748fa
# good: [cf2ac0c3998ffcbea680aeea2dee04d450654534] mm: remove PageWaiters, PageSetWaiters and PageClearWaiters
git bisect good cf2ac0c3998ffcbea680aeea2dee04d450654534
# bad: [c48de1718df9dcafb08aefbc6a0edf46e2f94e66] mm: constify more page/folio tests
git bisect bad c48de1718df9dcafb08aefbc6a0edf46e2f94e66
# bad: [48e4e7b8eea5fc80faad81515d429bce041f352d] mm: make dump_page() take a const argument
git bisect bad 48e4e7b8eea5fc80faad81515d429bce041f352d
# bad: [d67cd9f23139ddfd7e0ef1e18474c16445188433] mm: add __dump_folio()
git bisect bad d67cd9f23139ddfd7e0ef1e18474c16445188433
# good: [e9844b2b6cf103f4f3a42119d62758eb26c5c233] mm: remove PageYoung and PageIdle definitions
git bisect good e9844b2b6cf103f4f3a42119d62758eb26c5c233
# first bad commit: [d67cd9f23139ddfd7e0ef1e18474c16445188433] mm: add __dump_folio()

Cheers,
Nathan
Ryan Roberts March 1, 2024, 8:54 a.m. UTC | #3
+ Matthew


On 29/02/2024 23:01, Nathan Chancellor wrote:
> On Thu, Feb 29, 2024 at 02:17:52PM +0000, Ryan Roberts wrote:
>> Hi Ard,
>>
>> On 14/02/2024 12:29, Ard Biesheuvel wrote:
>>> From: Ard Biesheuvel <ardb@kernel.org>
>>>
>>> In order to support LPA2 on 16k pages in a way that permits non-LPA2
>>> systems to run the same kernel image, we have to be able to fall back to
>>> at most 48 bits of virtual addressing.
>>>
>>> Falling back to 48 bits would result in a level 0 with only 2 entries,
>>> which is suboptimal in terms of TLB utilization. So instead, let's fall
>>> back to 47 bits in that case. This means we need to be able to fold PUDs
>>> dynamically, similar to how we fold P4Ds for 48 bit virtual addressing
>>> on LPA2 with 4k pages.
>>
>> I'm seeing a panic during boot in today's linux-next (20240229) and bisect seems pretty confident that this commit is the offender. That said, its the merge commit that shows up as the problem commit:
>>
>> 26843fe8fa72 Merge branch 'for-next/core' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
>>
>> but when testing the arm64's for-next/core, the problem doesn't exist. So I rebased the branch into linux-next and bisected again. That time, it fingers this patch. So I guess there is some interaction between this and other changes in next?
> <...>
>> [    0.161062] debug_vm_pgtable: [debug_vm_pgtable         ]: Validating architecture page table helpers
>> [    0.161416] BUG: Bad page state in process swapper/0  pfn:18a65d
>> [    0.161634] page does not match folio
>> [    0.161753] page: refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x18a65d
>> [    0.162046] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
>> [    0.162332] Mem abort info:
>> [    0.162427]   ESR = 0x0000000096000004
>> [    0.162559]   EC = 0x25: DABT (current EL), IL = 32 bits
>> [    0.162723]   SET = 0, FnV = 0
>> [    0.162827]   EA = 0, S1PTW = 0
>> [    0.162933]   FSC = 0x04: level 0 translation fault
>> [    0.163089] Data abort info:
>> [    0.163189]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
>> [    0.163370]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
>> [    0.163539]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
>> [    0.163719] [0000000000000008] user address but active_mm is swapper
>> [    0.163934] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
>> [    0.164143] Modules linked in:
>> [    0.164251] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 6.8.0-rc6-00966-gde701dc1f7f8 #25
>> [    0.164516] Hardware name: linux,dummy-virt (DT)
>> [    0.164704] pstate: 21400005 (nzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
>> [    0.165052] pc : get_pfnblock_flags_mask+0x3c/0x68
>> [    0.165281] lr : __dump_page+0x1a0/0x408
>> [    0.165504] sp : ffff80008007b8f0
>> [    0.165715] x29: ffff80008007b8f0 x28: 0000000000ffffc0 x27: 0000000000000000
>> [    0.166047] x26: ffff80008007b950 x25: 0000000000000000 x24: 00000000fffffdff
>> [    0.166358] x23: ffffba8a417ba000 x22: 000000000018a65d x21: ffffba8a41601bf8
>> [    0.166701] x20: ffff80008007b950 x19: ffff80008007b950 x18: 0000000000000006
>> [    0.167036] x17: 78303a7865646e69 x16: 2030303030303030 x15: 0720072007200720
>> [    0.167365] x14: 0720072007200720 x13: 0720072007200720 x12: 0720072007200720
>> [    0.167693] x11: 0720072007200720 x10: ffffba8a4269c038 x9 : ffffba8a3fb0d0b8
>> [    0.168017] x8 : 00000000ffffefff x7 : ffffba8a4269c038 x6 : 80000000fffff000
>> [    0.168346] x5 : 000003fffff81de4 x4 : 0001fffffc0ef230 x3 : 0000000000000000
>> [    0.168699] x2 : 0000000000000007 x1 : fffffe0779181ee5 x0 : 00000000001fffff
>> [    0.169041] Call trace:
>> [    0.169164]  get_pfnblock_flags_mask+0x3c/0x68
>> [    0.169413]  dump_page+0x2c/0x70
>> [    0.169565]  bad_page+0x84/0x130
>> [    0.169734]  free_page_is_bad_report+0xa0/0xb8
>> [    0.169958]  free_unref_page_prepare+0x350/0x428
>> [    0.170132]  free_unref_page+0x50/0x1f0
>> [    0.170278]  __free_pages+0x11c/0x160
>> [    0.170417]  free_pages.part.0+0x6c/0x88
>> [    0.170576]  free_pages+0x1c/0x38
>> [    0.170703]  destroy_args+0x1c8/0x330
>> [    0.170890]  debug_vm_pgtable+0xae8/0x10f8
>> [    0.171059]  do_one_initcall+0x60/0x2c0
>> [    0.171222]  kernel_init_freeable+0x1ec/0x3d8
>> [    0.171406]  kernel_init+0x28/0x1f0
>> [    0.171557]  ret_from_fork+0x10/0x20
>> [    0.171712] Code: d37b1884 f100007f 8b040064 9a831083 (f9400460) 
>> [    0.171963] ---[ end trace 0000000000000000 ]---
>> [    0.172156] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
>> [    0.172383] SMP: stopping secondary CPUs
>> [    0.172649] Kernel Offset: 0x3a89bf800000 from 0xffff800080000000
>> [    0.173923] PHYS_OFFSET: 0xfffff76180000000
>> [    0.174585] CPU features: 0x0,00000000,2004454a,13867723
>> [    0.175707] Memory Limit: none
>> [    0.176261] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---
> 
> I did a second bisection by merging https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/stage1-lpa2
> on top of the merges before for-next/core and eventually landed on:
> 
> d67cd9f23139ddfd7e0ef1e18474c16445188433 is the first bad commit
> commit d67cd9f23139ddfd7e0ef1e18474c16445188433
> Author: Matthew Wilcox (Oracle) <willy@infradead.org>
> Date:   Tue Feb 27 19:23:31 2024 +0000
> 
>     mm: add __dump_folio()
> 
>     Turn __dump_page() into a wrapper around __dump_folio().  Snapshot the
>     page & folio into a stack variable so we don't hit BUG_ON() if an
>     allocation is freed under us and what was a folio pointer becomes a
>     pointer to a tail page.
> 
>     Link: https://lkml.kernel.org/r/20240227192337.757313-5-willy@infradead.org
>     Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
>     Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

So is that suggesting that Ard's patch is doing something that the old
__dump_page() was ok with but the new version doesn't like? I don't think so,
because the bad page detection has already happened before we get to __dump_page().

So I'm not really sure how this patch is involved? I'm hoping that either Ard or
Matthew may be able to take a look and advise.

> 
>  mm/debug.c | 120 +++++++++++++++++++++++++++++++++----------------------------
>  1 file changed, 66 insertions(+), 54 deletions(-)
> 
> # bad: [7f43e0f76e4710b2882c551519eff50e502115c5] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rmk/linux.git
> # good: [805d849d7c3cc1f38efefd48b2480d62b7b5dcb7] Merge tag 'acpi-6.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
> git bisect start '7f43e0f76e4710b2882c551519eff50e502115c5' '805d849d7c3cc1f38efefd48b2480d62b7b5dcb7'
> # bad: [7e6ae2db7f319bf9613ec6db8fa3c9bc1de1b346] mm: add swappiness= arg to memory.reclaim
> git bisect bad 7e6ae2db7f319bf9613ec6db8fa3c9bc1de1b346
> # good: [c6ec76a2ebc5829e5826b218d2e1475ec11b333e] mm: add pte_batch_hint() to reduce scanning in folio_pte_batch()
> git bisect good c6ec76a2ebc5829e5826b218d2e1475ec11b333e
> # good: [a02829f011b64e6c102929ed55da52e38391e970] writeback: fix done_index when hitting the wbc->nr_to_write
> git bisect good a02829f011b64e6c102929ed55da52e38391e970
> # good: [de435b3b914686116f86494b8cb53224d7e24cc5] arm64/mm: improve comment in contpte_ptep_get_lockless()
> git bisect good de435b3b914686116f86494b8cb53224d7e24cc5
> # good: [c143365caad5c3ad45662c393b9114c7cc694473] mm: handle large folios in free_unref_folios()
> git bisect good c143365caad5c3ad45662c393b9114c7cc694473
> # skip: [ab6445067cfbaf4ac94e969f7e8e785049314099] mm: add alloc_contig_migrate_range allocation statistics
> git bisect skip ab6445067cfbaf4ac94e969f7e8e785049314099
> # good: [447bf726277614396adcd4beedaf77ef74a748fa] modules: wait do_free_init correctly
> git bisect good 447bf726277614396adcd4beedaf77ef74a748fa
> # good: [cf2ac0c3998ffcbea680aeea2dee04d450654534] mm: remove PageWaiters, PageSetWaiters and PageClearWaiters
> git bisect good cf2ac0c3998ffcbea680aeea2dee04d450654534
> # bad: [c48de1718df9dcafb08aefbc6a0edf46e2f94e66] mm: constify more page/folio tests
> git bisect bad c48de1718df9dcafb08aefbc6a0edf46e2f94e66
> # bad: [48e4e7b8eea5fc80faad81515d429bce041f352d] mm: make dump_page() take a const argument
> git bisect bad 48e4e7b8eea5fc80faad81515d429bce041f352d
> # bad: [d67cd9f23139ddfd7e0ef1e18474c16445188433] mm: add __dump_folio()
> git bisect bad d67cd9f23139ddfd7e0ef1e18474c16445188433
> # good: [e9844b2b6cf103f4f3a42119d62758eb26c5c233] mm: remove PageYoung and PageIdle definitions
> git bisect good e9844b2b6cf103f4f3a42119d62758eb26c5c233
> # first bad commit: [d67cd9f23139ddfd7e0ef1e18474c16445188433] mm: add __dump_folio()
> 
> Cheers,
> Nathan
Ard Biesheuvel March 1, 2024, 9:10 a.m. UTC | #4
On Fri, 1 Mar 2024 at 09:54, Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> + Matthew
>
>
> On 29/02/2024 23:01, Nathan Chancellor wrote:
> > On Thu, Feb 29, 2024 at 02:17:52PM +0000, Ryan Roberts wrote:
> >> Hi Ard,
> >>
> >> On 14/02/2024 12:29, Ard Biesheuvel wrote:
> >>> From: Ard Biesheuvel <ardb@kernel.org>
> >>>
> >>> In order to support LPA2 on 16k pages in a way that permits non-LPA2
> >>> systems to run the same kernel image, we have to be able to fall back to
> >>> at most 48 bits of virtual addressing.
> >>>
> >>> Falling back to 48 bits would result in a level 0 with only 2 entries,
> >>> which is suboptimal in terms of TLB utilization. So instead, let's fall
> >>> back to 47 bits in that case. This means we need to be able to fold PUDs
> >>> dynamically, similar to how we fold P4Ds for 48 bit virtual addressing
> >>> on LPA2 with 4k pages.
> >>
> >> I'm seeing a panic during boot in today's linux-next (20240229) and bisect seems pretty confident that this commit is the offender. That said, its the merge commit that shows up as the problem commit:
> >>
> >> 26843fe8fa72 Merge branch 'for-next/core' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
> >>
> >> but when testing the arm64's for-next/core, the problem doesn't exist. So I rebased the branch into linux-next and bisected again. That time, it fingers this patch. So I guess there is some interaction between this and other changes in next?
> > <...>
> >> [    0.161062] debug_vm_pgtable: [debug_vm_pgtable         ]: Validating architecture page table helpers
> >> [    0.161416] BUG: Bad page state in process swapper/0  pfn:18a65d
> >> [    0.161634] page does not match folio
> >> [    0.161753] page: refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x18a65d
> >> [    0.162046] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
> >> [    0.162332] Mem abort info:
> >> [    0.162427]   ESR = 0x0000000096000004
> >> [    0.162559]   EC = 0x25: DABT (current EL), IL = 32 bits
> >> [    0.162723]   SET = 0, FnV = 0
> >> [    0.162827]   EA = 0, S1PTW = 0
> >> [    0.162933]   FSC = 0x04: level 0 translation fault
> >> [    0.163089] Data abort info:
> >> [    0.163189]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
> >> [    0.163370]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
> >> [    0.163539]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> >> [    0.163719] [0000000000000008] user address but active_mm is swapper
> >> [    0.163934] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
> >> [    0.164143] Modules linked in:
> >> [    0.164251] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 6.8.0-rc6-00966-gde701dc1f7f8 #25
> >> [    0.164516] Hardware name: linux,dummy-virt (DT)
> >> [    0.164704] pstate: 21400005 (nzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
> >> [    0.165052] pc : get_pfnblock_flags_mask+0x3c/0x68
> >> [    0.165281] lr : __dump_page+0x1a0/0x408
> >> [    0.165504] sp : ffff80008007b8f0
> >> [    0.165715] x29: ffff80008007b8f0 x28: 0000000000ffffc0 x27: 0000000000000000
> >> [    0.166047] x26: ffff80008007b950 x25: 0000000000000000 x24: 00000000fffffdff
> >> [    0.166358] x23: ffffba8a417ba000 x22: 000000000018a65d x21: ffffba8a41601bf8
> >> [    0.166701] x20: ffff80008007b950 x19: ffff80008007b950 x18: 0000000000000006
> >> [    0.167036] x17: 78303a7865646e69 x16: 2030303030303030 x15: 0720072007200720
> >> [    0.167365] x14: 0720072007200720 x13: 0720072007200720 x12: 0720072007200720
> >> [    0.167693] x11: 0720072007200720 x10: ffffba8a4269c038 x9 : ffffba8a3fb0d0b8
> >> [    0.168017] x8 : 00000000ffffefff x7 : ffffba8a4269c038 x6 : 80000000fffff000
> >> [    0.168346] x5 : 000003fffff81de4 x4 : 0001fffffc0ef230 x3 : 0000000000000000
> >> [    0.168699] x2 : 0000000000000007 x1 : fffffe0779181ee5 x0 : 00000000001fffff
> >> [    0.169041] Call trace:
> >> [    0.169164]  get_pfnblock_flags_mask+0x3c/0x68
> >> [    0.169413]  dump_page+0x2c/0x70
> >> [    0.169565]  bad_page+0x84/0x130
> >> [    0.169734]  free_page_is_bad_report+0xa0/0xb8
> >> [    0.169958]  free_unref_page_prepare+0x350/0x428
> >> [    0.170132]  free_unref_page+0x50/0x1f0
> >> [    0.170278]  __free_pages+0x11c/0x160
> >> [    0.170417]  free_pages.part.0+0x6c/0x88
> >> [    0.170576]  free_pages+0x1c/0x38
> >> [    0.170703]  destroy_args+0x1c8/0x330
> >> [    0.170890]  debug_vm_pgtable+0xae8/0x10f8
> >> [    0.171059]  do_one_initcall+0x60/0x2c0
> >> [    0.171222]  kernel_init_freeable+0x1ec/0x3d8
> >> [    0.171406]  kernel_init+0x28/0x1f0
> >> [    0.171557]  ret_from_fork+0x10/0x20
> >> [    0.171712] Code: d37b1884 f100007f 8b040064 9a831083 (f9400460)
> >> [    0.171963] ---[ end trace 0000000000000000 ]---
> >> [    0.172156] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
> >> [    0.172383] SMP: stopping secondary CPUs
> >> [    0.172649] Kernel Offset: 0x3a89bf800000 from 0xffff800080000000
> >> [    0.173923] PHYS_OFFSET: 0xfffff76180000000
> >> [    0.174585] CPU features: 0x0,00000000,2004454a,13867723
> >> [    0.175707] Memory Limit: none
> >> [    0.176261] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---
> >
> > I did a second bisection by merging https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/stage1-lpa2
> > on top of the merges before for-next/core and eventually landed on:
> >
> > d67cd9f23139ddfd7e0ef1e18474c16445188433 is the first bad commit
> > commit d67cd9f23139ddfd7e0ef1e18474c16445188433
> > Author: Matthew Wilcox (Oracle) <willy@infradead.org>
> > Date:   Tue Feb 27 19:23:31 2024 +0000
> >
> >     mm: add __dump_folio()
> >
> >     Turn __dump_page() into a wrapper around __dump_folio().  Snapshot the
> >     page & folio into a stack variable so we don't hit BUG_ON() if an
> >     allocation is freed under us and what was a folio pointer becomes a
> >     pointer to a tail page.
> >
> >     Link: https://lkml.kernel.org/r/20240227192337.757313-5-willy@infradead.org
> >     Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> >     Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>
> So is that suggesting that Ard's patch is doing something that the old
> __dump_page() was ok with but the new version doesn't like? I don't think so,
> because the bad page detection has already happened before we get to __dump_page().
>

Yes, there are clearly two different issues at play here. The NULL
dereference might be an issue in the __dump_page() patch, but going
down that code path in the first place seems like it might be a
problem with mine.

The mapcount of -512 looks interesting as well.

> So I'm not really sure how this patch is involved? I'm hoping that either Ard or
> Matthew may be able to take a look and advise.
>

I'll try and make sense of this today. Thanks for the report and the
bisecting work.
Ard Biesheuvel March 1, 2024, 9:37 a.m. UTC | #5
On Fri, 1 Mar 2024 at 10:10, Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Fri, 1 Mar 2024 at 09:54, Ryan Roberts <ryan.roberts@arm.com> wrote:
> >
> > + Matthew
> >
> >
> > On 29/02/2024 23:01, Nathan Chancellor wrote:
> > > On Thu, Feb 29, 2024 at 02:17:52PM +0000, Ryan Roberts wrote:
> > >> Hi Ard,
> > >>
> > >> On 14/02/2024 12:29, Ard Biesheuvel wrote:
> > >>> From: Ard Biesheuvel <ardb@kernel.org>
> > >>>
> > >>> In order to support LPA2 on 16k pages in a way that permits non-LPA2
> > >>> systems to run the same kernel image, we have to be able to fall back to
> > >>> at most 48 bits of virtual addressing.
> > >>>
> > >>> Falling back to 48 bits would result in a level 0 with only 2 entries,
> > >>> which is suboptimal in terms of TLB utilization. So instead, let's fall
> > >>> back to 47 bits in that case. This means we need to be able to fold PUDs
> > >>> dynamically, similar to how we fold P4Ds for 48 bit virtual addressing
> > >>> on LPA2 with 4k pages.
> > >>
> > >> I'm seeing a panic during boot in today's linux-next (20240229) and bisect seems pretty confident that this commit is the offender. That said, its the merge commit that shows up as the problem commit:
> > >>
> > >> 26843fe8fa72 Merge branch 'for-next/core' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
> > >>
> > >> but when testing the arm64's for-next/core, the problem doesn't exist. So I rebased the branch into linux-next and bisected again. That time, it fingers this patch. So I guess there is some interaction between this and other changes in next?
> > > <...>
> > >> [    0.161062] debug_vm_pgtable: [debug_vm_pgtable         ]: Validating architecture page table helpers
> > >> [    0.161416] BUG: Bad page state in process swapper/0  pfn:18a65d
> > >> [    0.161634] page does not match folio
> > >> [    0.161753] page: refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x18a65d
> > >> [    0.162046] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
> > >> [    0.162332] Mem abort info:
> > >> [    0.162427]   ESR = 0x0000000096000004
> > >> [    0.162559]   EC = 0x25: DABT (current EL), IL = 32 bits
> > >> [    0.162723]   SET = 0, FnV = 0
> > >> [    0.162827]   EA = 0, S1PTW = 0
> > >> [    0.162933]   FSC = 0x04: level 0 translation fault
> > >> [    0.163089] Data abort info:
> > >> [    0.163189]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
> > >> [    0.163370]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
> > >> [    0.163539]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> > >> [    0.163719] [0000000000000008] user address but active_mm is swapper
> > >> [    0.163934] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
> > >> [    0.164143] Modules linked in:
> > >> [    0.164251] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 6.8.0-rc6-00966-gde701dc1f7f8 #25
> > >> [    0.164516] Hardware name: linux,dummy-virt (DT)
> > >> [    0.164704] pstate: 21400005 (nzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
> > >> [    0.165052] pc : get_pfnblock_flags_mask+0x3c/0x68
> > >> [    0.165281] lr : __dump_page+0x1a0/0x408
> > >> [    0.165504] sp : ffff80008007b8f0
> > >> [    0.165715] x29: ffff80008007b8f0 x28: 0000000000ffffc0 x27: 0000000000000000
> > >> [    0.166047] x26: ffff80008007b950 x25: 0000000000000000 x24: 00000000fffffdff
> > >> [    0.166358] x23: ffffba8a417ba000 x22: 000000000018a65d x21: ffffba8a41601bf8
> > >> [    0.166701] x20: ffff80008007b950 x19: ffff80008007b950 x18: 0000000000000006
> > >> [    0.167036] x17: 78303a7865646e69 x16: 2030303030303030 x15: 0720072007200720
> > >> [    0.167365] x14: 0720072007200720 x13: 0720072007200720 x12: 0720072007200720
> > >> [    0.167693] x11: 0720072007200720 x10: ffffba8a4269c038 x9 : ffffba8a3fb0d0b8
> > >> [    0.168017] x8 : 00000000ffffefff x7 : ffffba8a4269c038 x6 : 80000000fffff000
> > >> [    0.168346] x5 : 000003fffff81de4 x4 : 0001fffffc0ef230 x3 : 0000000000000000
> > >> [    0.168699] x2 : 0000000000000007 x1 : fffffe0779181ee5 x0 : 00000000001fffff
> > >> [    0.169041] Call trace:
> > >> [    0.169164]  get_pfnblock_flags_mask+0x3c/0x68
> > >> [    0.169413]  dump_page+0x2c/0x70
> > >> [    0.169565]  bad_page+0x84/0x130
> > >> [    0.169734]  free_page_is_bad_report+0xa0/0xb8
> > >> [    0.169958]  free_unref_page_prepare+0x350/0x428
> > >> [    0.170132]  free_unref_page+0x50/0x1f0
> > >> [    0.170278]  __free_pages+0x11c/0x160
> > >> [    0.170417]  free_pages.part.0+0x6c/0x88
> > >> [    0.170576]  free_pages+0x1c/0x38
> > >> [    0.170703]  destroy_args+0x1c8/0x330
> > >> [    0.170890]  debug_vm_pgtable+0xae8/0x10f8
> > >> [    0.171059]  do_one_initcall+0x60/0x2c0
> > >> [    0.171222]  kernel_init_freeable+0x1ec/0x3d8
> > >> [    0.171406]  kernel_init+0x28/0x1f0
> > >> [    0.171557]  ret_from_fork+0x10/0x20
> > >> [    0.171712] Code: d37b1884 f100007f 8b040064 9a831083 (f9400460)
> > >> [    0.171963] ---[ end trace 0000000000000000 ]---
> > >> [    0.172156] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
> > >> [    0.172383] SMP: stopping secondary CPUs
> > >> [    0.172649] Kernel Offset: 0x3a89bf800000 from 0xffff800080000000
> > >> [    0.173923] PHYS_OFFSET: 0xfffff76180000000
> > >> [    0.174585] CPU features: 0x0,00000000,2004454a,13867723
> > >> [    0.175707] Memory Limit: none
> > >> [    0.176261] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---
> > >
> > > I did a second bisection by merging https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/stage1-lpa2
> > > on top of the merges before for-next/core and eventually landed on:
> > >
> > > d67cd9f23139ddfd7e0ef1e18474c16445188433 is the first bad commit
> > > commit d67cd9f23139ddfd7e0ef1e18474c16445188433
> > > Author: Matthew Wilcox (Oracle) <willy@infradead.org>
> > > Date:   Tue Feb 27 19:23:31 2024 +0000
> > >
> > >     mm: add __dump_folio()
> > >
> > >     Turn __dump_page() into a wrapper around __dump_folio().  Snapshot the
> > >     page & folio into a stack variable so we don't hit BUG_ON() if an
> > >     allocation is freed under us and what was a folio pointer becomes a
> > >     pointer to a tail page.
> > >
> > >     Link: https://lkml.kernel.org/r/20240227192337.757313-5-willy@infradead.org
> > >     Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> > >     Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> >
> > So is that suggesting that Ard's patch is doing something that the old
> > __dump_page() was ok with but the new version doesn't like? I don't think so,
> > because the bad page detection has already happened before we get to __dump_page().
> >
>
> Yes, there are clearly two different issues at play here. The NULL
> dereference might be an issue in the __dump_page() patch, but going
> down that code path in the first place seems like it might be a
> problem with mine.
>
> The mapcount of -512 looks interesting as well.
>
> > So I'm not really sure how this patch is involved? I'm hoping that either Ard or
> > Matthew may be able to take a look and advise.
> >
>

The crash does not reproduce for me, but the warning can be fixed by

diff --git a/arch/arm64/include/asm/pgalloc.h b/arch/arm64/include/asm/pgalloc.h
index aeba2cf15a25..78f30b782889 100644
--- a/arch/arm64/include/asm/pgalloc.h
+++ b/arch/arm64/include/asm/pgalloc.h
@@ -61,7 +61,7 @@ static inline void pud_free(struct mm_struct *mm, pud_t *pud)
        if (!pgtable_l4_enabled())
                return;
        BUG_ON((unsigned long)pud & (PAGE_SIZE-1));
-       free_page((unsigned long)pud);
+       __pud_free(mm, pud);
 }
 #else
 static inline void __p4d_populate(p4d_t *p4dp, phys_addr_t pudp, p4dval_t prot)

I'll send this out as a patch shortly.
Ryan Roberts March 1, 2024, 9:47 a.m. UTC | #6
On 01/03/2024 09:37, Ard Biesheuvel wrote:
> On Fri, 1 Mar 2024 at 10:10, Ard Biesheuvel <ardb@kernel.org> wrote:
>>
>> On Fri, 1 Mar 2024 at 09:54, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>
>>> + Matthew
>>>
>>>
>>> On 29/02/2024 23:01, Nathan Chancellor wrote:
>>>> On Thu, Feb 29, 2024 at 02:17:52PM +0000, Ryan Roberts wrote:
>>>>> Hi Ard,
>>>>>
>>>>> On 14/02/2024 12:29, Ard Biesheuvel wrote:
>>>>>> From: Ard Biesheuvel <ardb@kernel.org>
>>>>>>
>>>>>> In order to support LPA2 on 16k pages in a way that permits non-LPA2
>>>>>> systems to run the same kernel image, we have to be able to fall back to
>>>>>> at most 48 bits of virtual addressing.
>>>>>>
>>>>>> Falling back to 48 bits would result in a level 0 with only 2 entries,
>>>>>> which is suboptimal in terms of TLB utilization. So instead, let's fall
>>>>>> back to 47 bits in that case. This means we need to be able to fold PUDs
>>>>>> dynamically, similar to how we fold P4Ds for 48 bit virtual addressing
>>>>>> on LPA2 with 4k pages.
>>>>>
>>>>> I'm seeing a panic during boot in today's linux-next (20240229) and bisect seems pretty confident that this commit is the offender. That said, its the merge commit that shows up as the problem commit:
>>>>>
>>>>> 26843fe8fa72 Merge branch 'for-next/core' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
>>>>>
>>>>> but when testing the arm64's for-next/core, the problem doesn't exist. So I rebased the branch into linux-next and bisected again. That time, it fingers this patch. So I guess there is some interaction between this and other changes in next?
>>>> <...>
>>>>> [    0.161062] debug_vm_pgtable: [debug_vm_pgtable         ]: Validating architecture page table helpers
>>>>> [    0.161416] BUG: Bad page state in process swapper/0  pfn:18a65d
>>>>> [    0.161634] page does not match folio
>>>>> [    0.161753] page: refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x18a65d
>>>>> [    0.162046] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
>>>>> [    0.162332] Mem abort info:
>>>>> [    0.162427]   ESR = 0x0000000096000004
>>>>> [    0.162559]   EC = 0x25: DABT (current EL), IL = 32 bits
>>>>> [    0.162723]   SET = 0, FnV = 0
>>>>> [    0.162827]   EA = 0, S1PTW = 0
>>>>> [    0.162933]   FSC = 0x04: level 0 translation fault
>>>>> [    0.163089] Data abort info:
>>>>> [    0.163189]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
>>>>> [    0.163370]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
>>>>> [    0.163539]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
>>>>> [    0.163719] [0000000000000008] user address but active_mm is swapper
>>>>> [    0.163934] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
>>>>> [    0.164143] Modules linked in:
>>>>> [    0.164251] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 6.8.0-rc6-00966-gde701dc1f7f8 #25
>>>>> [    0.164516] Hardware name: linux,dummy-virt (DT)
>>>>> [    0.164704] pstate: 21400005 (nzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
>>>>> [    0.165052] pc : get_pfnblock_flags_mask+0x3c/0x68
>>>>> [    0.165281] lr : __dump_page+0x1a0/0x408
>>>>> [    0.165504] sp : ffff80008007b8f0
>>>>> [    0.165715] x29: ffff80008007b8f0 x28: 0000000000ffffc0 x27: 0000000000000000
>>>>> [    0.166047] x26: ffff80008007b950 x25: 0000000000000000 x24: 00000000fffffdff
>>>>> [    0.166358] x23: ffffba8a417ba000 x22: 000000000018a65d x21: ffffba8a41601bf8
>>>>> [    0.166701] x20: ffff80008007b950 x19: ffff80008007b950 x18: 0000000000000006
>>>>> [    0.167036] x17: 78303a7865646e69 x16: 2030303030303030 x15: 0720072007200720
>>>>> [    0.167365] x14: 0720072007200720 x13: 0720072007200720 x12: 0720072007200720
>>>>> [    0.167693] x11: 0720072007200720 x10: ffffba8a4269c038 x9 : ffffba8a3fb0d0b8
>>>>> [    0.168017] x8 : 00000000ffffefff x7 : ffffba8a4269c038 x6 : 80000000fffff000
>>>>> [    0.168346] x5 : 000003fffff81de4 x4 : 0001fffffc0ef230 x3 : 0000000000000000
>>>>> [    0.168699] x2 : 0000000000000007 x1 : fffffe0779181ee5 x0 : 00000000001fffff
>>>>> [    0.169041] Call trace:
>>>>> [    0.169164]  get_pfnblock_flags_mask+0x3c/0x68
>>>>> [    0.169413]  dump_page+0x2c/0x70
>>>>> [    0.169565]  bad_page+0x84/0x130
>>>>> [    0.169734]  free_page_is_bad_report+0xa0/0xb8
>>>>> [    0.169958]  free_unref_page_prepare+0x350/0x428
>>>>> [    0.170132]  free_unref_page+0x50/0x1f0
>>>>> [    0.170278]  __free_pages+0x11c/0x160
>>>>> [    0.170417]  free_pages.part.0+0x6c/0x88
>>>>> [    0.170576]  free_pages+0x1c/0x38
>>>>> [    0.170703]  destroy_args+0x1c8/0x330
>>>>> [    0.170890]  debug_vm_pgtable+0xae8/0x10f8
>>>>> [    0.171059]  do_one_initcall+0x60/0x2c0
>>>>> [    0.171222]  kernel_init_freeable+0x1ec/0x3d8
>>>>> [    0.171406]  kernel_init+0x28/0x1f0
>>>>> [    0.171557]  ret_from_fork+0x10/0x20
>>>>> [    0.171712] Code: d37b1884 f100007f 8b040064 9a831083 (f9400460)
>>>>> [    0.171963] ---[ end trace 0000000000000000 ]---
>>>>> [    0.172156] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
>>>>> [    0.172383] SMP: stopping secondary CPUs
>>>>> [    0.172649] Kernel Offset: 0x3a89bf800000 from 0xffff800080000000
>>>>> [    0.173923] PHYS_OFFSET: 0xfffff76180000000
>>>>> [    0.174585] CPU features: 0x0,00000000,2004454a,13867723
>>>>> [    0.175707] Memory Limit: none
>>>>> [    0.176261] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---
>>>>
>>>> I did a second bisection by merging https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/stage1-lpa2
>>>> on top of the merges before for-next/core and eventually landed on:
>>>>
>>>> d67cd9f23139ddfd7e0ef1e18474c16445188433 is the first bad commit
>>>> commit d67cd9f23139ddfd7e0ef1e18474c16445188433
>>>> Author: Matthew Wilcox (Oracle) <willy@infradead.org>
>>>> Date:   Tue Feb 27 19:23:31 2024 +0000
>>>>
>>>>     mm: add __dump_folio()
>>>>
>>>>     Turn __dump_page() into a wrapper around __dump_folio().  Snapshot the
>>>>     page & folio into a stack variable so we don't hit BUG_ON() if an
>>>>     allocation is freed under us and what was a folio pointer becomes a
>>>>     pointer to a tail page.
>>>>
>>>>     Link: https://lkml.kernel.org/r/20240227192337.757313-5-willy@infradead.org
>>>>     Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
>>>>     Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>>>
>>> So is that suggesting that Ard's patch is doing something that the old
>>> __dump_page() was ok with but the new version doesn't like? I don't think so,
>>> because the bad page detection has already happened before we get to __dump_page().
>>>
>>
>> Yes, there are clearly two different issues at play here. The NULL
>> dereference might be an issue in the __dump_page() patch, but going
>> down that code path in the first place seems like it might be a
>> problem with mine.
>>
>> The mapcount of -512 looks interesting as well.
>>
>>> So I'm not really sure how this patch is involved? I'm hoping that either Ard or
>>> Matthew may be able to take a look and advise.
>>>
>>
> 
> The crash does not reproduce for me, but the warning can be fixed by
> 
> diff --git a/arch/arm64/include/asm/pgalloc.h b/arch/arm64/include/asm/pgalloc.h
> index aeba2cf15a25..78f30b782889 100644
> --- a/arch/arm64/include/asm/pgalloc.h
> +++ b/arch/arm64/include/asm/pgalloc.h
> @@ -61,7 +61,7 @@ static inline void pud_free(struct mm_struct *mm, pud_t *pud)
>         if (!pgtable_l4_enabled())
>                 return;
>         BUG_ON((unsigned long)pud & (PAGE_SIZE-1));
> -       free_page((unsigned long)pud);
> +       __pud_free(mm, pud);
>  }
>  #else
>  static inline void __p4d_populate(p4d_t *p4dp, phys_addr_t pudp, p4dval_t prot)
> 
> I'll send this out as a patch shortly.

Great thanks! I'm pretty sure I've found the bug in Matthew's patch - it is
copying the struct page to the stack to avoid a potential race, but later some
macros are hiding a page_to_pfn(). Since the page's address is on the stack, I
reckon that's giving a bogus pfn. Just confirming and writing it up and will
send against the original patch shortly.
Ryan Roberts March 1, 2024, 10:22 a.m. UTC | #7
On 01/03/2024 09:47, Ryan Roberts wrote:
> On 01/03/2024 09:37, Ard Biesheuvel wrote:
>> On Fri, 1 Mar 2024 at 10:10, Ard Biesheuvel <ardb@kernel.org> wrote:
>>>
>>> On Fri, 1 Mar 2024 at 09:54, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> + Matthew
>>>>
>>>>
>>>> On 29/02/2024 23:01, Nathan Chancellor wrote:
>>>>> On Thu, Feb 29, 2024 at 02:17:52PM +0000, Ryan Roberts wrote:
>>>>>> Hi Ard,
>>>>>>
>>>>>> On 14/02/2024 12:29, Ard Biesheuvel wrote:
>>>>>>> From: Ard Biesheuvel <ardb@kernel.org>
>>>>>>>
>>>>>>> In order to support LPA2 on 16k pages in a way that permits non-LPA2
>>>>>>> systems to run the same kernel image, we have to be able to fall back to
>>>>>>> at most 48 bits of virtual addressing.
>>>>>>>
>>>>>>> Falling back to 48 bits would result in a level 0 with only 2 entries,
>>>>>>> which is suboptimal in terms of TLB utilization. So instead, let's fall
>>>>>>> back to 47 bits in that case. This means we need to be able to fold PUDs
>>>>>>> dynamically, similar to how we fold P4Ds for 48 bit virtual addressing
>>>>>>> on LPA2 with 4k pages.
>>>>>>
>>>>>> I'm seeing a panic during boot in today's linux-next (20240229) and bisect seems pretty confident that this commit is the offender. That said, its the merge commit that shows up as the problem commit:
>>>>>>
>>>>>> 26843fe8fa72 Merge branch 'for-next/core' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
>>>>>>
>>>>>> but when testing the arm64's for-next/core, the problem doesn't exist. So I rebased the branch into linux-next and bisected again. That time, it fingers this patch. So I guess there is some interaction between this and other changes in next?
>>>>> <...>
>>>>>> [    0.161062] debug_vm_pgtable: [debug_vm_pgtable         ]: Validating architecture page table helpers
>>>>>> [    0.161416] BUG: Bad page state in process swapper/0  pfn:18a65d
>>>>>> [    0.161634] page does not match folio
>>>>>> [    0.161753] page: refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x18a65d
>>>>>> [    0.162046] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
>>>>>> [    0.162332] Mem abort info:
>>>>>> [    0.162427]   ESR = 0x0000000096000004
>>>>>> [    0.162559]   EC = 0x25: DABT (current EL), IL = 32 bits
>>>>>> [    0.162723]   SET = 0, FnV = 0
>>>>>> [    0.162827]   EA = 0, S1PTW = 0
>>>>>> [    0.162933]   FSC = 0x04: level 0 translation fault
>>>>>> [    0.163089] Data abort info:
>>>>>> [    0.163189]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
>>>>>> [    0.163370]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
>>>>>> [    0.163539]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
>>>>>> [    0.163719] [0000000000000008] user address but active_mm is swapper
>>>>>> [    0.163934] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
>>>>>> [    0.164143] Modules linked in:
>>>>>> [    0.164251] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 6.8.0-rc6-00966-gde701dc1f7f8 #25
>>>>>> [    0.164516] Hardware name: linux,dummy-virt (DT)
>>>>>> [    0.164704] pstate: 21400005 (nzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
>>>>>> [    0.165052] pc : get_pfnblock_flags_mask+0x3c/0x68
>>>>>> [    0.165281] lr : __dump_page+0x1a0/0x408
>>>>>> [    0.165504] sp : ffff80008007b8f0
>>>>>> [    0.165715] x29: ffff80008007b8f0 x28: 0000000000ffffc0 x27: 0000000000000000
>>>>>> [    0.166047] x26: ffff80008007b950 x25: 0000000000000000 x24: 00000000fffffdff
>>>>>> [    0.166358] x23: ffffba8a417ba000 x22: 000000000018a65d x21: ffffba8a41601bf8
>>>>>> [    0.166701] x20: ffff80008007b950 x19: ffff80008007b950 x18: 0000000000000006
>>>>>> [    0.167036] x17: 78303a7865646e69 x16: 2030303030303030 x15: 0720072007200720
>>>>>> [    0.167365] x14: 0720072007200720 x13: 0720072007200720 x12: 0720072007200720
>>>>>> [    0.167693] x11: 0720072007200720 x10: ffffba8a4269c038 x9 : ffffba8a3fb0d0b8
>>>>>> [    0.168017] x8 : 00000000ffffefff x7 : ffffba8a4269c038 x6 : 80000000fffff000
>>>>>> [    0.168346] x5 : 000003fffff81de4 x4 : 0001fffffc0ef230 x3 : 0000000000000000
>>>>>> [    0.168699] x2 : 0000000000000007 x1 : fffffe0779181ee5 x0 : 00000000001fffff
>>>>>> [    0.169041] Call trace:
>>>>>> [    0.169164]  get_pfnblock_flags_mask+0x3c/0x68
>>>>>> [    0.169413]  dump_page+0x2c/0x70
>>>>>> [    0.169565]  bad_page+0x84/0x130
>>>>>> [    0.169734]  free_page_is_bad_report+0xa0/0xb8
>>>>>> [    0.169958]  free_unref_page_prepare+0x350/0x428
>>>>>> [    0.170132]  free_unref_page+0x50/0x1f0
>>>>>> [    0.170278]  __free_pages+0x11c/0x160
>>>>>> [    0.170417]  free_pages.part.0+0x6c/0x88
>>>>>> [    0.170576]  free_pages+0x1c/0x38
>>>>>> [    0.170703]  destroy_args+0x1c8/0x330
>>>>>> [    0.170890]  debug_vm_pgtable+0xae8/0x10f8
>>>>>> [    0.171059]  do_one_initcall+0x60/0x2c0
>>>>>> [    0.171222]  kernel_init_freeable+0x1ec/0x3d8
>>>>>> [    0.171406]  kernel_init+0x28/0x1f0
>>>>>> [    0.171557]  ret_from_fork+0x10/0x20
>>>>>> [    0.171712] Code: d37b1884 f100007f 8b040064 9a831083 (f9400460)
>>>>>> [    0.171963] ---[ end trace 0000000000000000 ]---
>>>>>> [    0.172156] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
>>>>>> [    0.172383] SMP: stopping secondary CPUs
>>>>>> [    0.172649] Kernel Offset: 0x3a89bf800000 from 0xffff800080000000
>>>>>> [    0.173923] PHYS_OFFSET: 0xfffff76180000000
>>>>>> [    0.174585] CPU features: 0x0,00000000,2004454a,13867723
>>>>>> [    0.175707] Memory Limit: none
>>>>>> [    0.176261] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---
>>>>>
>>>>> I did a second bisection by merging https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/stage1-lpa2
>>>>> on top of the merges before for-next/core and eventually landed on:
>>>>>
>>>>> d67cd9f23139ddfd7e0ef1e18474c16445188433 is the first bad commit
>>>>> commit d67cd9f23139ddfd7e0ef1e18474c16445188433
>>>>> Author: Matthew Wilcox (Oracle) <willy@infradead.org>
>>>>> Date:   Tue Feb 27 19:23:31 2024 +0000
>>>>>
>>>>>     mm: add __dump_folio()
>>>>>
>>>>>     Turn __dump_page() into a wrapper around __dump_folio().  Snapshot the
>>>>>     page & folio into a stack variable so we don't hit BUG_ON() if an
>>>>>     allocation is freed under us and what was a folio pointer becomes a
>>>>>     pointer to a tail page.
>>>>>
>>>>>     Link: https://lkml.kernel.org/r/20240227192337.757313-5-willy@infradead.org
>>>>>     Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
>>>>>     Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>>>>
>>>> So is that suggesting that Ard's patch is doing something that the old
>>>> __dump_page() was ok with but the new version doesn't like? I don't think so,
>>>> because the bad page detection has already happened before we get to __dump_page().
>>>>
>>>
>>> Yes, there are clearly two different issues at play here. The NULL
>>> dereference might be an issue in the __dump_page() patch, but going
>>> down that code path in the first place seems like it might be a
>>> problem with mine.
>>>
>>> The mapcount of -512 looks interesting as well.
>>>
>>>> So I'm not really sure how this patch is involved? I'm hoping that either Ard or
>>>> Matthew may be able to take a look and advise.
>>>>
>>>
>>
>> The crash does not reproduce for me, but the warning can be fixed by
>>
>> diff --git a/arch/arm64/include/asm/pgalloc.h b/arch/arm64/include/asm/pgalloc.h
>> index aeba2cf15a25..78f30b782889 100644
>> --- a/arch/arm64/include/asm/pgalloc.h
>> +++ b/arch/arm64/include/asm/pgalloc.h
>> @@ -61,7 +61,7 @@ static inline void pud_free(struct mm_struct *mm, pud_t *pud)
>>         if (!pgtable_l4_enabled())
>>                 return;
>>         BUG_ON((unsigned long)pud & (PAGE_SIZE-1));
>> -       free_page((unsigned long)pud);
>> +       __pud_free(mm, pud);
>>  }
>>  #else
>>  static inline void __p4d_populate(p4d_t *p4dp, phys_addr_t pudp, p4dval_t prot)
>>
>> I'll send this out as a patch shortly.
> 
> Great thanks! I'm pretty sure I've found the bug in Matthew's patch - it is
> copying the struct page to the stack to avoid a potential race, but later some
> macros are hiding a page_to_pfn(). Since the page's address is on the stack, I
> reckon that's giving a bogus pfn. Just confirming and writing it up and will
> send against the original patch shortly.
> 

OK confirmed. When I fix Matthew's patch, the panic gets converted to a warning,
and when I add the fix for your patch, there is no warning at all.

See write up for Matthew's bug here:
https://lore.kernel.org/linux-mm/6de0d026-cd8d-4152-97ca-d33d2a4e2e84@arm.com/

Thanks,
Ryan
Ryan Roberts Sept. 30, 2024, 2:36 p.m. UTC | #8
Hi Ard,

On 14/02/2024 12:29, Ard Biesheuvel wrote:
> From: Ard Biesheuvel <ardb@kernel.org>
> 
> In order to support LPA2 on 16k pages in a way that permits non-LPA2
> systems to run the same kernel image, we have to be able to fall back to
> at most 48 bits of virtual addressing.
> 
> Falling back to 48 bits would result in a level 0 with only 2 entries,
> which is suboptimal in terms of TLB utilization. So instead, let's fall
> back to 47 bits in that case. This means we need to be able to fold PUDs
> dynamically, similar to how we fold P4Ds for 48 bit virtual addressing
> on LPA2 with 4k pages.
> 
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>

[...]

>  
> +#define pud_index(addr)		(((addr) >> PUD_SHIFT) & (PTRS_PER_PUD - 1))
> +
> +static inline pud_t *p4d_to_folded_pud(p4d_t *p4dp, unsigned long addr)
> +{
> +	return (pud_t *)PTR_ALIGN_DOWN(p4dp, PAGE_SIZE) + pud_index(addr);
> +}
> +

I wonder if you could explain what this function (and its equivalents at other
levels) is doing? Why isn't it just returning p4dp cast to a (pud_t *)?

I'm working on a prototype for boot-time page size selection. For this, I'm
compile-time enabling all levels, then run-time folding the ones I don't need,
based on the selected page size and VA size.

I'm trying to reuse your run-time folding code, but I have a case where this
function is broken as written. Replacing with "return (pud_t *)p4dp;" resolves
the problem; If VA_BITS=48 and pagesize=64K, the pgd has 64 entries. p4dp is
pointing to the correct entry in the pgd already, but this code aligns back to
the start of the page, then adds pud_index(), which is wrong because
PTRS_PER_PUD != PTRS_PER_PGDIR. (In my case, these 2 macros are actually
boot-time selected values rather than compile-time constants).

I think your code is probably correct and working around PTRS_PER_PXD being
compile-time constants for the non-folded case, but I can't quite convince myself.

Thanks,
Ryan


>  static inline pud_t *p4d_pgtable(p4d_t p4d)
>  {
>  	return (pud_t *)__va(p4d_page_paddr(p4d));
>  }
>  
> -/* Find an entry in the first-level page table. */
> -#define pud_offset_phys(dir, addr)	(p4d_page_paddr(READ_ONCE(*(dir))) + pud_index(addr) * sizeof(pud_t))
> +static inline phys_addr_t pud_offset_phys(p4d_t *p4dp, unsigned long addr)
> +{
> +	BUG_ON(!pgtable_l4_enabled());
>  
> -#define pud_set_fixmap(addr)		((pud_t *)set_fixmap_offset(FIX_PUD, addr))
> -#define pud_set_fixmap_offset(p4d, addr)	pud_set_fixmap(pud_offset_phys(p4d, addr))
> -#define pud_clear_fixmap()		clear_fixmap(FIX_PUD)
> +	return p4d_page_paddr(READ_ONCE(*p4dp)) + pud_index(addr) * sizeof(pud_t);
> +}
>  
> -#define p4d_page(p4d)		pfn_to_page(__phys_to_pfn(__p4d_to_phys(p4d)))
> +static inline
> +pud_t *pud_offset_lockless(p4d_t *p4dp, p4d_t p4d, unsigned long addr)
> +{
> +	if (!pgtable_l4_enabled())
> +		return p4d_to_folded_pud(p4dp, addr);
> +	return (pud_t *)__va(p4d_page_paddr(p4d)) + pud_index(addr);
> +}
> +#define pud_offset_lockless pud_offset_lockless
> +
> +static inline pud_t *pud_offset(p4d_t *p4dp, unsigned long addr)
> +{
> +	return pud_offset_lockless(p4dp, READ_ONCE(*p4dp), addr);
> +}
> +#define pud_offset	pud_offset
> +
> +static inline pud_t *pud_set_fixmap(unsigned long addr)
> +{
> +	if (!pgtable_l4_enabled())
> +		return NULL;
> +	return (pud_t *)set_fixmap_offset(FIX_PUD, addr);
> +}
> +
> +static inline pud_t *pud_set_fixmap_offset(p4d_t *p4dp, unsigned long addr)
> +{
> +	if (!pgtable_l4_enabled())
> +		return p4d_to_folded_pud(p4dp, addr);
> +	return pud_set_fixmap(pud_offset_phys(p4dp, addr));
> +}
> +
> +static inline void pud_clear_fixmap(void)
> +{
> +	if (pgtable_l4_enabled())
> +		clear_fixmap(FIX_PUD);
> +}
>  
>  /* use ONLY for statically allocated translation tables */
> -#define pud_offset_kimg(dir,addr)	((pud_t *)__phys_to_kimg(pud_offset_phys((dir), (addr))))
> +static inline pud_t *pud_offset_kimg(p4d_t *p4dp, u64 addr)
> +{
> +	if (!pgtable_l4_enabled())
> +		return p4d_to_folded_pud(p4dp, addr);
> +	return (pud_t *)__phys_to_kimg(pud_offset_phys(p4dp, addr));
> +}
> +
> +#define p4d_page(p4d)		pfn_to_page(__phys_to_pfn(__p4d_to_phys(p4d)))
>  
>  #else
>  
> +static inline bool pgtable_l4_enabled(void) { return false; }
> +
>  #define p4d_page_paddr(p4d)	({ BUILD_BUG(); 0;})
>  
>  /* Match pud_offset folding in <asm/generic/pgtable-nopud.h> */
> diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h
> index 0150deb332af..a947c6e784ed 100644
> --- a/arch/arm64/include/asm/tlb.h
> +++ b/arch/arm64/include/asm/tlb.h
> @@ -103,6 +103,9 @@ static inline void __pud_free_tlb(struct mmu_gather *tlb, pud_t *pudp,
>  {
>  	struct ptdesc *ptdesc = virt_to_ptdesc(pudp);
>  
> +	if (!pgtable_l4_enabled())
> +		return;
> +
>  	pagetable_pud_dtor(ptdesc);
>  	tlb_remove_ptdesc(tlb, ptdesc);
>  }
> diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
> index bc5e4e569864..94f035f6c421 100644
> --- a/arch/arm64/kernel/cpufeature.c
> +++ b/arch/arm64/kernel/cpufeature.c
> @@ -1767,6 +1767,8 @@ static int __init __kpti_install_ng_mappings(void *__unused)
>  
>  	if (levels == 5 && !pgtable_l5_enabled())
>  		levels = 4;
> +	else if (levels == 4 && !pgtable_l4_enabled())
> +		levels = 3;
>  
>  	remap_fn = (void *)__pa_symbol(idmap_kpti_install_ng_mappings);
>  
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 8e5b3a7c5afd..b131ed31a6c8 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -1065,7 +1065,7 @@ static void free_empty_pud_table(p4d_t *p4dp, unsigned long addr,
>  		free_empty_pmd_table(pudp, addr, next, floor, ceiling);
>  	} while (addr = next, addr < end);
>  
> -	if (CONFIG_PGTABLE_LEVELS <= 3)
> +	if (!pgtable_l4_enabled())
>  		return;
>  
>  	if (!pgtable_range_aligned(start, end, floor, ceiling, P4D_MASK))
> diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c
> index 3c4f8a279d2b..0c501cabc238 100644
> --- a/arch/arm64/mm/pgd.c
> +++ b/arch/arm64/mm/pgd.c
> @@ -21,6 +21,8 @@ static bool pgdir_is_page_size(void)
>  {
>  	if (PGD_SIZE == PAGE_SIZE)
>  		return true;
> +	if (CONFIG_PGTABLE_LEVELS == 4)
> +		return !pgtable_l4_enabled();
>  	if (CONFIG_PGTABLE_LEVELS == 5)
>  		return !pgtable_l5_enabled();
>  	return false;
Ard Biesheuvel Sept. 30, 2024, 2:53 p.m. UTC | #9
On Mon, 30 Sept 2024 at 16:36, Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Hi Ard,
>
> On 14/02/2024 12:29, Ard Biesheuvel wrote:
> > From: Ard Biesheuvel <ardb@kernel.org>
> >
> > In order to support LPA2 on 16k pages in a way that permits non-LPA2
> > systems to run the same kernel image, we have to be able to fall back to
> > at most 48 bits of virtual addressing.
> >
> > Falling back to 48 bits would result in a level 0 with only 2 entries,
> > which is suboptimal in terms of TLB utilization. So instead, let's fall
> > back to 47 bits in that case. This means we need to be able to fold PUDs
> > dynamically, similar to how we fold P4Ds for 48 bit virtual addressing
> > on LPA2 with 4k pages.
> >
> > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
>
> [...]
>
> >
> > +#define pud_index(addr)              (((addr) >> PUD_SHIFT) & (PTRS_PER_PUD - 1))
> > +
> > +static inline pud_t *p4d_to_folded_pud(p4d_t *p4dp, unsigned long addr)
> > +{
> > +     return (pud_t *)PTR_ALIGN_DOWN(p4dp, PAGE_SIZE) + pud_index(addr);
> > +}
> > +
>
> I wonder if you could explain what this function (and its equivalents at other
> levels) is doing? Why isn't it just returning p4dp cast to a (pud_t *)?
>

Because the p4dp index is derived from a different part of the VA, so
it points into the right page but at the wrong entry.

> I'm working on a prototype for boot-time page size selection. For this, I'm
> compile-time enabling all levels, then run-time folding the ones I don't need,
> based on the selected page size and VA size.
>

Nice!

> I'm trying to reuse your run-time folding code, but I have a case where this
> function is broken as written. Replacing with "return (pud_t *)p4dp;" resolves
> the problem; If VA_BITS=48 and pagesize=64K, the pgd has 64 entries. p4dp is
> pointing to the correct entry in the pgd already, but this code aligns back to
> the start of the page, then adds pud_index(), which is wrong because
> PTRS_PER_PUD != PTRS_PER_PGDIR. (In my case, these 2 macros are actually
> boot-time selected values rather than compile-time constants).
>
> I think your code is probably correct and working around PTRS_PER_PXD being
> compile-time constants for the non-folded case, but I can't quite convince myself.
>

Is your p4dp pointing to the correct descriptor because you changed
the runtime behavior of p4d_index() perhaps?
Ryan Roberts Sept. 30, 2024, 3:12 p.m. UTC | #10
On 30/09/2024 15:53, Ard Biesheuvel wrote:
> On Mon, 30 Sept 2024 at 16:36, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Hi Ard,
>>
>> On 14/02/2024 12:29, Ard Biesheuvel wrote:
>>> From: Ard Biesheuvel <ardb@kernel.org>
>>>
>>> In order to support LPA2 on 16k pages in a way that permits non-LPA2
>>> systems to run the same kernel image, we have to be able to fall back to
>>> at most 48 bits of virtual addressing.
>>>
>>> Falling back to 48 bits would result in a level 0 with only 2 entries,
>>> which is suboptimal in terms of TLB utilization. So instead, let's fall
>>> back to 47 bits in that case. This means we need to be able to fold PUDs
>>> dynamically, similar to how we fold P4Ds for 48 bit virtual addressing
>>> on LPA2 with 4k pages.
>>>
>>> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
>>
>> [...]
>>
>>>
>>> +#define pud_index(addr)              (((addr) >> PUD_SHIFT) & (PTRS_PER_PUD - 1))
>>> +
>>> +static inline pud_t *p4d_to_folded_pud(p4d_t *p4dp, unsigned long addr)
>>> +{
>>> +     return (pud_t *)PTR_ALIGN_DOWN(p4dp, PAGE_SIZE) + pud_index(addr);
>>> +}
>>> +
>>
>> I wonder if you could explain what this function (and its equivalents at other
>> levels) is doing? Why isn't it just returning p4dp cast to a (pud_t *)?
>>
> 
> Because the p4dp index is derived from a different part of the VA, so
> it points into the right page but at the wrong entry.

OK, yeah I think that's sunk in. TBH, the folding stuff melts my brain. Thanks
for the quick response. So the code is definitely correct, and it's needed
because the PxD_SHIFT and PTRS_PER_PxD are "wrong" for the real geometry.

> 
>> I'm working on a prototype for boot-time page size selection. For this, I'm
>> compile-time enabling all levels, then run-time folding the ones I don't need,
>> based on the selected page size and VA size.
>>
> 
> Nice!

Certainly in principle. Hoping to get an RFC out during October.

> 
>> I'm trying to reuse your run-time folding code, but I have a case where this
>> function is broken as written. Replacing with "return (pud_t *)p4dp;" resolves
>> the problem; If VA_BITS=48 and pagesize=64K, the pgd has 64 entries. p4dp is
>> pointing to the correct entry in the pgd already, but this code aligns back to
>> the start of the page, then adds pud_index(), which is wrong because
>> PTRS_PER_PUD != PTRS_PER_PGDIR. (In my case, these 2 macros are actually
>> boot-time selected values rather than compile-time constants).
>>
>> I think your code is probably correct and working around PTRS_PER_PXD being
>> compile-time constants for the non-folded case, but I can't quite convince myself.
>>
> 
> Is your p4dp pointing to the correct descriptor because you changed
> the runtime behavior of p4d_index() perhaps?

Yes; in my prototype, P4D_SHIFT and PTRS_PER_P4D are set at runtime and end up
the same as they would have been if pgtable-nop4d.h was included for
compile-time folding.
Ard Biesheuvel Oct. 1, 2024, 6:23 a.m. UTC | #11
On Mon, 30 Sept 2024 at 17:12, Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 30/09/2024 15:53, Ard Biesheuvel wrote:
> > On Mon, 30 Sept 2024 at 16:36, Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> Hi Ard,
> >>
> >> On 14/02/2024 12:29, Ard Biesheuvel wrote:
> >>> From: Ard Biesheuvel <ardb@kernel.org>
> >>>
> >>> In order to support LPA2 on 16k pages in a way that permits non-LPA2
> >>> systems to run the same kernel image, we have to be able to fall back to
> >>> at most 48 bits of virtual addressing.
> >>>
> >>> Falling back to 48 bits would result in a level 0 with only 2 entries,
> >>> which is suboptimal in terms of TLB utilization. So instead, let's fall
> >>> back to 47 bits in that case. This means we need to be able to fold PUDs
> >>> dynamically, similar to how we fold P4Ds for 48 bit virtual addressing
> >>> on LPA2 with 4k pages.
> >>>
> >>> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> >>
> >> [...]
> >>
> >>>
> >>> +#define pud_index(addr)              (((addr) >> PUD_SHIFT) & (PTRS_PER_PUD - 1))
> >>> +
> >>> +static inline pud_t *p4d_to_folded_pud(p4d_t *p4dp, unsigned long addr)
> >>> +{
> >>> +     return (pud_t *)PTR_ALIGN_DOWN(p4dp, PAGE_SIZE) + pud_index(addr);
> >>> +}
> >>> +
> >>
> >> I wonder if you could explain what this function (and its equivalents at other
> >> levels) is doing? Why isn't it just returning p4dp cast to a (pud_t *)?
> >>
> >
> > Because the p4dp index is derived from a different part of the VA, so
> > it points into the right page but at the wrong entry.
>
> OK, yeah I think that's sunk in. TBH, the folding stuff melts my brain. Thanks
> for the quick response. So the code is definitely correct, and it's needed
> because the PxD_SHIFT and PTRS_PER_PxD are "wrong" for the real geometry.
>

Indeed. I never considered putting this folding behavior in
p?d_index() though - perhaps that is a better place for it to begin
with.

> >
> >> I'm working on a prototype for boot-time page size selection. For this, I'm
> >> compile-time enabling all levels, then run-time folding the ones I don't need,
> >> based on the selected page size and VA size.
> >>
> >
> > Nice!
>
> Certainly in principle. Hoping to get an RFC out during October.
>

While you're at it, the Android folks will probably give you a medal
if you can manage 16k pages in user space, with 4k for the kernel and
for compat tasks.

But seriously, I'd be happy to compare notes about this - one thing I
have been meaning to do is to reduce the number of configurations we
support, by always using 52 bits in the kernel, and allowing some kind
of runtime folding for user space to reduce the depth where it
matters.

> >
> >> I'm trying to reuse your run-time folding code, but I have a case where this
> >> function is broken as written. Replacing with "return (pud_t *)p4dp;" resolves
> >> the problem; If VA_BITS=48 and pagesize=64K, the pgd has 64 entries. p4dp is
> >> pointing to the correct entry in the pgd already, but this code aligns back to
> >> the start of the page, then adds pud_index(), which is wrong because
> >> PTRS_PER_PUD != PTRS_PER_PGDIR. (In my case, these 2 macros are actually
> >> boot-time selected values rather than compile-time constants).
> >>
> >> I think your code is probably correct and working around PTRS_PER_PXD being
> >> compile-time constants for the non-folded case, but I can't quite convince myself.
> >>
> >
> > Is your p4dp pointing to the correct descriptor because you changed
> > the runtime behavior of p4d_index() perhaps?
>
> Yes; in my prototype, P4D_SHIFT and PTRS_PER_P4D are set at runtime and end up
> the same as they would have been if pgtable-nop4d.h was included for
> compile-time folding.
>

Yeah that makes sense. Good luck! :-)
Ryan Roberts Oct. 2, 2024, 9:08 a.m. UTC | #12
On 01/10/2024 07:23, Ard Biesheuvel wrote:
> On Mon, 30 Sept 2024 at 17:12, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 30/09/2024 15:53, Ard Biesheuvel wrote:
>>> On Mon, 30 Sept 2024 at 16:36, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> Hi Ard,
>>>>
>>>> On 14/02/2024 12:29, Ard Biesheuvel wrote:
>>>>> From: Ard Biesheuvel <ardb@kernel.org>
>>>>>
>>>>> In order to support LPA2 on 16k pages in a way that permits non-LPA2
>>>>> systems to run the same kernel image, we have to be able to fall back to
>>>>> at most 48 bits of virtual addressing.
>>>>>
>>>>> Falling back to 48 bits would result in a level 0 with only 2 entries,
>>>>> which is suboptimal in terms of TLB utilization. So instead, let's fall
>>>>> back to 47 bits in that case. This means we need to be able to fold PUDs
>>>>> dynamically, similar to how we fold P4Ds for 48 bit virtual addressing
>>>>> on LPA2 with 4k pages.
>>>>>
>>>>> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
>>>>
>>>> [...]
>>>>
>>>>>
>>>>> +#define pud_index(addr)              (((addr) >> PUD_SHIFT) & (PTRS_PER_PUD - 1))
>>>>> +
>>>>> +static inline pud_t *p4d_to_folded_pud(p4d_t *p4dp, unsigned long addr)
>>>>> +{
>>>>> +     return (pud_t *)PTR_ALIGN_DOWN(p4dp, PAGE_SIZE) + pud_index(addr);
>>>>> +}
>>>>> +
>>>>
>>>> I wonder if you could explain what this function (and its equivalents at other
>>>> levels) is doing? Why isn't it just returning p4dp cast to a (pud_t *)?
>>>>
>>>
>>> Because the p4dp index is derived from a different part of the VA, so
>>> it points into the right page but at the wrong entry.
>>
>> OK, yeah I think that's sunk in. TBH, the folding stuff melts my brain. Thanks
>> for the quick response. So the code is definitely correct, and it's needed
>> because the PxD_SHIFT and PTRS_PER_PxD are "wrong" for the real geometry.
>>
> 
> Indeed. I never considered putting this folding behavior in
> p?d_index() though - perhaps that is a better place for it to begin
> with.

Yes perhaps. For now, I've left your code as is for the compile-time variant,
and for boot-time variant, p4d_to_folded_pud() just returns the cast pointer.

> 
>>>
>>>> I'm working on a prototype for boot-time page size selection. For this, I'm
>>>> compile-time enabling all levels, then run-time folding the ones I don't need,
>>>> based on the selected page size and VA size.
>>>>
>>>
>>> Nice!
>>
>> Certainly in principle. Hoping to get an RFC out during October.
>>
> 
> While you're at it, the Android folks will probably give you a medal
> if you can manage 16k pages in user space, with 4k for the kernel and
> for compat tasks.

Ha - that's the ambition I started with. I have a design that I believe solved
all the issues except one; how to present procfs information about a process to
a different process with a larger page size. I felt there were likely going to
be other ABI confusion edge case like that lurking.

Eventually (with Rutland's help :) ) I conceded it was just a huge amount of
work. And after talking with the Android guys, decided to park it. Perhaps
something for the future, if there are other valid use cases. Boot time page
size selection has value on its own, but is orthogonal to per-process page size.

> 
> But seriously, I'd be happy to compare notes about this - one thing I
> have been meaning to do is to reduce the number of configurations we
> support, by always using 52 bits in the kernel, and allowing some kind
> of runtime folding for user space to reduce the depth where it
> matters.

OK that's interesting. I can see some cross-over there. I think I'm on the home
straight for an RFC. So how about I get that posted then we can have a chat?

> 
>>>
>>>> I'm trying to reuse your run-time folding code, but I have a case where this
>>>> function is broken as written. Replacing with "return (pud_t *)p4dp;" resolves
>>>> the problem; If VA_BITS=48 and pagesize=64K, the pgd has 64 entries. p4dp is
>>>> pointing to the correct entry in the pgd already, but this code aligns back to
>>>> the start of the page, then adds pud_index(), which is wrong because
>>>> PTRS_PER_PUD != PTRS_PER_PGDIR. (In my case, these 2 macros are actually
>>>> boot-time selected values rather than compile-time constants).
>>>>
>>>> I think your code is probably correct and working around PTRS_PER_PXD being
>>>> compile-time constants for the non-folded case, but I can't quite convince myself.
>>>>
>>>
>>> Is your p4dp pointing to the correct descriptor because you changed
>>> the runtime behavior of p4d_index() perhaps?
>>
>> Yes; in my prototype, P4D_SHIFT and PTRS_PER_P4D are set at runtime and end up
>> the same as they would have been if pgtable-nop4d.h was included for
>> compile-time folding.
>>
> 
> Yeah that makes sense. Good luck! :-)
Ard Biesheuvel Oct. 12, 2024, 9:47 a.m. UTC | #13
On Wed, 2 Oct 2024 at 11:08, Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 01/10/2024 07:23, Ard Biesheuvel wrote:
...
> While you're at it, the Android folks will probably give you a medal
> > if you can manage 16k pages in user space, with 4k for the kernel and
> > for compat tasks.
>
> Ha - that's the ambition I started with. I have a design that I believe solved
> all the issues except one; how to present procfs information about a process to
> a different process with a larger page size. I felt there were likely going to
> be other ABI confusion edge case like that lurking.
>
> Eventually (with Rutland's help :) ) I conceded it was just a huge amount of
> work. And after talking with the Android guys, decided to park it. Perhaps
> something for the future, if there are other valid use cases. Boot time page
> size selection has value on its own, but is orthogonal to per-process page size.
>
> >
> > But seriously, I'd be happy to compare notes about this - one thing I
> > have been meaning to do is to reduce the number of configurations we
> > support, by always using 52 bits in the kernel, and allowing some kind
> > of runtime folding for user space to reduce the depth where it
> > matters.
>
> OK that's interesting. I can see some cross-over there. I think I'm on the home
> straight for an RFC. So how about I get that posted then we can have a chat?
>

Sounds good.
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/pgalloc.h b/arch/arm64/include/asm/pgalloc.h
index cae8c648f462..aeba2cf15a25 100644
--- a/arch/arm64/include/asm/pgalloc.h
+++ b/arch/arm64/include/asm/pgalloc.h
@@ -14,6 +14,7 @@ 
 #include <asm/tlbflush.h>
 
 #define __HAVE_ARCH_PGD_FREE
+#define __HAVE_ARCH_PUD_FREE
 #include <asm-generic/pgalloc.h>
 
 #define PGD_SIZE	(PTRS_PER_PGD * sizeof(pgd_t))
@@ -43,7 +44,8 @@  static inline void __pud_populate(pud_t *pudp, phys_addr_t pmdp, pudval_t prot)
 
 static inline void __p4d_populate(p4d_t *p4dp, phys_addr_t pudp, p4dval_t prot)
 {
-	set_p4d(p4dp, __p4d(__phys_to_p4d_val(pudp) | prot));
+	if (pgtable_l4_enabled())
+		set_p4d(p4dp, __p4d(__phys_to_p4d_val(pudp) | prot));
 }
 
 static inline void p4d_populate(struct mm_struct *mm, p4d_t *p4dp, pud_t *pudp)
@@ -53,6 +55,14 @@  static inline void p4d_populate(struct mm_struct *mm, p4d_t *p4dp, pud_t *pudp)
 	p4dval |= (mm == &init_mm) ? P4D_TABLE_UXN : P4D_TABLE_PXN;
 	__p4d_populate(p4dp, __pa(pudp), p4dval);
 }
+
+static inline void pud_free(struct mm_struct *mm, pud_t *pud)
+{
+	if (!pgtable_l4_enabled())
+		return;
+	BUG_ON((unsigned long)pud & (PAGE_SIZE-1));
+	free_page((unsigned long)pud);
+}
 #else
 static inline void __p4d_populate(p4d_t *p4dp, phys_addr_t pudp, p4dval_t prot)
 {
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 3d7fb3cde83d..b3c716fa8121 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -759,12 +759,27 @@  static inline pmd_t *pud_pgtable(pud_t pud)
 
 #if CONFIG_PGTABLE_LEVELS > 3
 
+static __always_inline bool pgtable_l4_enabled(void)
+{
+	if (CONFIG_PGTABLE_LEVELS > 4 || !IS_ENABLED(CONFIG_ARM64_LPA2))
+		return true;
+	if (!alternative_has_cap_likely(ARM64_ALWAYS_BOOT))
+		return vabits_actual == VA_BITS;
+	return alternative_has_cap_unlikely(ARM64_HAS_VA52);
+}
+
+static inline bool mm_pud_folded(const struct mm_struct *mm)
+{
+	return !pgtable_l4_enabled();
+}
+#define mm_pud_folded  mm_pud_folded
+
 #define pud_ERROR(e)	\
 	pr_err("%s:%d: bad pud %016llx.\n", __FILE__, __LINE__, pud_val(e))
 
-#define p4d_none(p4d)		(!p4d_val(p4d))
-#define p4d_bad(p4d)		(!(p4d_val(p4d) & 2))
-#define p4d_present(p4d)	(p4d_val(p4d))
+#define p4d_none(p4d)		(pgtable_l4_enabled() && !p4d_val(p4d))
+#define p4d_bad(p4d)		(pgtable_l4_enabled() && !(p4d_val(p4d) & 2))
+#define p4d_present(p4d)	(!p4d_none(p4d))
 
 static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
 {
@@ -780,7 +795,8 @@  static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
 
 static inline void p4d_clear(p4d_t *p4dp)
 {
-	set_p4d(p4dp, __p4d(0));
+	if (pgtable_l4_enabled())
+		set_p4d(p4dp, __p4d(0));
 }
 
 static inline phys_addr_t p4d_page_paddr(p4d_t p4d)
@@ -788,25 +804,74 @@  static inline phys_addr_t p4d_page_paddr(p4d_t p4d)
 	return __p4d_to_phys(p4d);
 }
 
+#define pud_index(addr)		(((addr) >> PUD_SHIFT) & (PTRS_PER_PUD - 1))
+
+static inline pud_t *p4d_to_folded_pud(p4d_t *p4dp, unsigned long addr)
+{
+	return (pud_t *)PTR_ALIGN_DOWN(p4dp, PAGE_SIZE) + pud_index(addr);
+}
+
 static inline pud_t *p4d_pgtable(p4d_t p4d)
 {
 	return (pud_t *)__va(p4d_page_paddr(p4d));
 }
 
-/* Find an entry in the first-level page table. */
-#define pud_offset_phys(dir, addr)	(p4d_page_paddr(READ_ONCE(*(dir))) + pud_index(addr) * sizeof(pud_t))
+static inline phys_addr_t pud_offset_phys(p4d_t *p4dp, unsigned long addr)
+{
+	BUG_ON(!pgtable_l4_enabled());
 
-#define pud_set_fixmap(addr)		((pud_t *)set_fixmap_offset(FIX_PUD, addr))
-#define pud_set_fixmap_offset(p4d, addr)	pud_set_fixmap(pud_offset_phys(p4d, addr))
-#define pud_clear_fixmap()		clear_fixmap(FIX_PUD)
+	return p4d_page_paddr(READ_ONCE(*p4dp)) + pud_index(addr) * sizeof(pud_t);
+}
 
-#define p4d_page(p4d)		pfn_to_page(__phys_to_pfn(__p4d_to_phys(p4d)))
+static inline
+pud_t *pud_offset_lockless(p4d_t *p4dp, p4d_t p4d, unsigned long addr)
+{
+	if (!pgtable_l4_enabled())
+		return p4d_to_folded_pud(p4dp, addr);
+	return (pud_t *)__va(p4d_page_paddr(p4d)) + pud_index(addr);
+}
+#define pud_offset_lockless pud_offset_lockless
+
+static inline pud_t *pud_offset(p4d_t *p4dp, unsigned long addr)
+{
+	return pud_offset_lockless(p4dp, READ_ONCE(*p4dp), addr);
+}
+#define pud_offset	pud_offset
+
+static inline pud_t *pud_set_fixmap(unsigned long addr)
+{
+	if (!pgtable_l4_enabled())
+		return NULL;
+	return (pud_t *)set_fixmap_offset(FIX_PUD, addr);
+}
+
+static inline pud_t *pud_set_fixmap_offset(p4d_t *p4dp, unsigned long addr)
+{
+	if (!pgtable_l4_enabled())
+		return p4d_to_folded_pud(p4dp, addr);
+	return pud_set_fixmap(pud_offset_phys(p4dp, addr));
+}
+
+static inline void pud_clear_fixmap(void)
+{
+	if (pgtable_l4_enabled())
+		clear_fixmap(FIX_PUD);
+}
 
 /* use ONLY for statically allocated translation tables */
-#define pud_offset_kimg(dir,addr)	((pud_t *)__phys_to_kimg(pud_offset_phys((dir), (addr))))
+static inline pud_t *pud_offset_kimg(p4d_t *p4dp, u64 addr)
+{
+	if (!pgtable_l4_enabled())
+		return p4d_to_folded_pud(p4dp, addr);
+	return (pud_t *)__phys_to_kimg(pud_offset_phys(p4dp, addr));
+}
+
+#define p4d_page(p4d)		pfn_to_page(__phys_to_pfn(__p4d_to_phys(p4d)))
 
 #else
 
+static inline bool pgtable_l4_enabled(void) { return false; }
+
 #define p4d_page_paddr(p4d)	({ BUILD_BUG(); 0;})
 
 /* Match pud_offset folding in <asm/generic/pgtable-nopud.h> */
diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h
index 0150deb332af..a947c6e784ed 100644
--- a/arch/arm64/include/asm/tlb.h
+++ b/arch/arm64/include/asm/tlb.h
@@ -103,6 +103,9 @@  static inline void __pud_free_tlb(struct mmu_gather *tlb, pud_t *pudp,
 {
 	struct ptdesc *ptdesc = virt_to_ptdesc(pudp);
 
+	if (!pgtable_l4_enabled())
+		return;
+
 	pagetable_pud_dtor(ptdesc);
 	tlb_remove_ptdesc(tlb, ptdesc);
 }
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index bc5e4e569864..94f035f6c421 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -1767,6 +1767,8 @@  static int __init __kpti_install_ng_mappings(void *__unused)
 
 	if (levels == 5 && !pgtable_l5_enabled())
 		levels = 4;
+	else if (levels == 4 && !pgtable_l4_enabled())
+		levels = 3;
 
 	remap_fn = (void *)__pa_symbol(idmap_kpti_install_ng_mappings);
 
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 8e5b3a7c5afd..b131ed31a6c8 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1065,7 +1065,7 @@  static void free_empty_pud_table(p4d_t *p4dp, unsigned long addr,
 		free_empty_pmd_table(pudp, addr, next, floor, ceiling);
 	} while (addr = next, addr < end);
 
-	if (CONFIG_PGTABLE_LEVELS <= 3)
+	if (!pgtable_l4_enabled())
 		return;
 
 	if (!pgtable_range_aligned(start, end, floor, ceiling, P4D_MASK))
diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c
index 3c4f8a279d2b..0c501cabc238 100644
--- a/arch/arm64/mm/pgd.c
+++ b/arch/arm64/mm/pgd.c
@@ -21,6 +21,8 @@  static bool pgdir_is_page_size(void)
 {
 	if (PGD_SIZE == PAGE_SIZE)
 		return true;
+	if (CONFIG_PGTABLE_LEVELS == 4)
+		return !pgtable_l4_enabled();
 	if (CONFIG_PGTABLE_LEVELS == 5)
 		return !pgtable_l5_enabled();
 	return false;