From patchwork Mon Mar 17 14:16:49 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kevin Brodsky X-Patchwork-Id: 14019500 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 123BEC282EC for ; Mon, 17 Mar 2025 14:23:42 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help :List-Post:List-Archive:List-Unsubscribe:List-Id:Content-Transfer-Encoding: MIME-Version:Message-ID:Date:Subject:Cc:To:From:Reply-To:Content-Type: Content-ID:Content-Description:Resent-Date:Resent-From:Resent-Sender: Resent-To:Resent-Cc:Resent-Message-ID:In-Reply-To:References:List-Owner; bh=GQCGHjxoviKrg5MlZcwQWV/9UYQWX2agQfORO3jzRQU=; b=4zEpD282V9YtPlMHy22DxiWwDA 6UONfc90dxKi776ewUfIT7zbJlQR5D3DbsM+4kLE2AIyag3qGANntzSLmBQ2OtcsIFL4I2D8e+5Pg QMSo+X+NDyXXhLOzjs1L92/BDZXStzNQUp3DGUdcaDiM0m439qpS6tjvYCnMcXkGlKOp6kztkrM2L d9CWU0hQoVPYSbS3NI2+RThC3KW1fu3AnMzCDdByunL7PtXJeMN3t7aLo2dRASKirv0Uv6f9Mg84e Dy6ae2xFDQiwezl0taZz5J8r6Qj1mw+sFgPVggETnqNuk8jUgrceHUwn+CrXA/ECL7yB01zvwDNRL FtG1Q6tg==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.98 #2 (Red Hat Linux)) id 1tuBNI-00000002vHN-3HeL; Mon, 17 Mar 2025 14:23:28 +0000 Received: from foss.arm.com ([217.140.110.172]) by bombadil.infradead.org with esmtp (Exim 4.98 #2 (Red Hat Linux)) id 1tuBLW-00000002uI6-3f0J; Mon, 17 Mar 2025 14:21:40 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 89EB513D5; Mon, 17 Mar 2025 07:21:39 -0700 (PDT) Received: from e123572-lin.arm.com (e123572-lin.cambridge.arm.com [10.1.194.54]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 5FD433F63F; Mon, 17 Mar 2025 07:21:26 -0700 (PDT) From: Kevin Brodsky To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Kevin Brodsky , Albert Ou , Andreas Larsson , Andrew Morton , Catalin Marinas , Dave Hansen , "David S. Miller" , Geert Uytterhoeven , Linus Walleij , Madhavan Srinivasan , Mark Rutland , Matthew Wilcox , Michael Ellerman , "Mike Rapoport (IBM)" , Palmer Dabbelt , Paul Walmsley , Peter Zijlstra , Qi Zheng , Ryan Roberts , Will Deacon , Yang Shi , linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-csky@vger.kernel.org, linux-m68k@lists.linux-m68k.org, linux-openrisc@vger.kernel.org, linux-riscv@lists.infradead.org, linux-s390@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, sparclinux@vger.kernel.org Subject: [PATCH 00/11] Always call constructor for kernel page tables Date: Mon, 17 Mar 2025 14:16:49 +0000 Message-ID: <20250317141700.3701581-1-kevin.brodsky@arm.com> X-Mailer: git-send-email 2.47.0 MIME-Version: 1.0 X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20250317_072139_009759_89CD7135 X-CRM114-Status: GOOD ( 24.52 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org There has been much confusion around exactly when page table constructors/destructors (pagetable_*_[cd]tor) are supposed to be called. They were initially introduced for user PTEs only (to support split page table locks), then at the PMD level for the same purpose. Accounting was added later on, starting at the PTE level and then moving to higher levels (PMD, PUD). Finally, with my earlier series "Account page tables at all levels" [1], the ctor/dtor is run for all levels, all the way to PGD. I thought this was the end of the story, and it hopefully is for user pgtables, but I was wrong for what concerns kernel pgtables. The current situation there makes very little sense: * At the PTE level, the ctor/dtor is not called (at least in the generic implementation). Specific helpers are used for kernel pgtables at this level (pte_{alloc,free}_kernel()) and those have never called the ctor/dtor, most likely because they were initially irrelevant in the kernel case. * At all other levels, the ctor/dtor is normally called. This is potentially wasteful at the PMD level (more on that later). This series aims to ensure that the ctor/dtor is always called for kernel pgtables, as it already is for user pgtables. Besides consistency, the main motivation is to guarantee that ctor/dtor hooks are systematically called; this makes it possible to insert hooks to protect page tables [2], for instance. There is however an extra challenge: split locks are not used for kernel pgtables, and it would therefore be wasteful to initialise them (ptlock_init()). It is worth clarifying exactly when split locks are used. They clearly are for user pgtables, but as illustrated in commit 61444cde9170 ("ARM: 8591/1: mm: use fully constructed struct pages for EFI pgd allocations"), they also are for special page tables like efi_mm. The one case where split locks are definitely unused is pgtables owned by init_mm; this is consistent with the behaviour of apply_to_pte_range(). The approach chosen in this series is therefore to pass the mm associated to the pgtables being constructed to pagetable_{pte,pmd}_ctor() (patch 1), and skip ptlock_init() if mm == &init_mm (patch 2 and 6). This makes it possible to call the PTE ctor/dtor from pte_{alloc,free}_kernel() without unintended consequences (patch 2). As a result the accounting functions are now called at all levels for kernel pgtables, and split locks are never initialised. In configurations where ptlocks are dynamically allocated (32-bit, PREEMPT_RT, etc.) and ARCH_ENABLE_SPLIT_PMD_PTLOCK is selected, this series results in the removal of a kmem_cache allocation for every kernel PMD. Additionally, for certain architectures that do not use such as s390, the same optimisation occurs at the PTE level. --- Things get more complicated when it comes to special pgtable allocators (patch 7-11). All architectures need such allocators to create initial kernel pgtables; we are not concerned with those as the ctor cannot be called so early in the boot sequence. However, those allocators may also be used later in the boot sequence or during normal operations. There are two main use-cases: 1. Mapping EFI memory: efi_mm (arm, arm64, riscv) 2. arch_add_memory(): init_mm The ctor is already explicitly run (at the PTE/PMD level) in the first case, as required for pgtables that are not associated with init_mm. However the same allocators may also be used for the second use-case (or others), and this is where it gets messy. Patch 1 calls the ctor with NULL as mm in those situations, as the actual mm isn't available. Practically this means that ptlocks will be unconditionally initialised. This is fine on arm - create_mapping_late() is only used for the EFI mapping. On arm64, __create_pgd_mapping() is also used by arch_add_memory(); patch 7/8/10 ensure that ctors are called at all levels with the appropriate mm. The situation is similar on riscv, but propagating the mm down to the ctor would require significant refactoring. Since they are already called unconditionally, this series leaves riscv no worse off - patch 9 adds comments to clarify the situation. From a cursory look at other architectures implementing arch_add_memory(), s390 and x86 may also need a similar treatment to add constructor calls. This is to be taken care of in a future version or as a follow-up. --- The complications in those special pgtable allocators beg the question: does it really make sense to treat efi_mm and init_mm differently in e.g. apply_to_pte_range()? Maybe what we really need is a way to tell if an mm corresponds to user memory or not, and never use split locks for non-user mm's. Feedback and suggestions welcome! - Kevin [1] https://lore.kernel.org/linux-mm/20250103184415.2744423-1-kevin.brodsky@arm.com/ [2] https://lore.kernel.org/linux-hardening/20250203101839.1223008-1-kevin.brodsky@arm.com/ --- Cc: Albert Ou Cc: Andreas Larsson Cc: Andrew Morton Cc: Catalin Marinas Cc: Dave Hansen Cc: "David S. Miller" Cc: Geert Uytterhoeven Cc: Linus Walleij Cc: Madhavan Srinivasan Cc: Mark Rutland Cc: Matthew Wilcox Cc: Michael Ellerman Cc: "Mike Rapoport (IBM)" Cc: Palmer Dabbelt Cc: Paul Walmsley Cc: Peter Zijlstra Cc: Qi Zheng Cc: Ryan Roberts Cc: Will Deacon Cc: Yang Shi Cc: linux-arch@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-csky@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-m68k@lists.linux-m68k.org Cc: linux-openrisc@vger.kernel.org Cc: linux-riscv@lists.infradead.org Cc: linux-s390@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: sparclinux@vger.kernel.org --- Kevin Brodsky (11): mm: Pass mm down to pagetable_{pte,pmd}_ctor mm: Call ctor/dtor for kernel PTEs m68k: mm: Call ctor/dtor for kernel PTEs powerpc: mm: Call ctor/dtor for kernel PTEs sparc64: mm: Call ctor/dtor for kernel PTEs mm: Skip ptlock_init() for kernel PMDs arm64: mm: Use enum to identify pgtable level instead of *_SHIFT arm64: mm: Always call PTE/PMD ctor in __create_pgd_mapping() riscv: mm: Clarify ctor mm argument in alloc_{pte,pmd}_late arm64: mm: Call PUD/P4D ctor in __create_pgd_mapping() riscv: mm: Call PUD/P4D ctor in special kernel pgtable alloc arch/arm/mm/mmu.c | 2 +- arch/arm64/mm/mmu.c | 91 ++++++++++++++---------- arch/csky/include/asm/pgalloc.h | 2 +- arch/loongarch/include/asm/pgalloc.h | 2 +- arch/m68k/include/asm/mcf_pgalloc.h | 8 ++- arch/m68k/include/asm/motorola_pgalloc.h | 10 +-- arch/m68k/mm/motorola.c | 6 +- arch/microblaze/mm/pgtable.c | 2 +- arch/mips/include/asm/pgalloc.h | 2 +- arch/openrisc/mm/ioremap.c | 2 +- arch/parisc/include/asm/pgalloc.h | 2 +- arch/powerpc/mm/book3s64/pgtable.c | 2 +- arch/powerpc/mm/pgtable-frag.c | 30 ++++---- arch/riscv/mm/init.c | 26 ++++--- arch/s390/include/asm/pgalloc.h | 2 +- arch/s390/mm/pgalloc.c | 2 +- arch/sparc/mm/init_64.c | 29 ++++---- arch/sparc/mm/srmmu.c | 2 +- arch/x86/mm/pgtable.c | 2 +- include/asm-generic/pgalloc.h | 11 ++- include/linux/mm.h | 10 +-- 21 files changed, 137 insertions(+), 108 deletions(-) base-commit: 4701f33a10702d5fc577c32434eb62adde0a1ae1