From patchwork Mon Dec 18 10:50:59 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ryan Roberts X-Patchwork-Id: 13496579 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D6D48C46CCD for ; Mon, 18 Dec 2023 10:52:15 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7A58F6B00AF; Mon, 18 Dec 2023 05:52:15 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 754976B00B1; Mon, 18 Dec 2023 05:52:15 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 61FD56B00B0; Mon, 18 Dec 2023 05:52:15 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 52DBB6B00AE for ; Mon, 18 Dec 2023 05:52:15 -0500 (EST) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 32B291C1B79 for ; Mon, 18 Dec 2023 10:52:15 +0000 (UTC) X-FDA: 81579624630.12.7A6D7A2 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf27.hostedemail.com (Postfix) with ESMTP id 6C4A340013 for ; Mon, 18 Dec 2023 10:52:13 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf27.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1702896733; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=B2PjP7HqUpeEkMhkbIFHBWS4ku1Qcu2bCx3FQTTUYsI=; b=uefErjWf2HCvyflCWSIGEPsg7BFXAT4HuCI1NRtut6veru1nrNYpN8ZRpcoen5gfMnsMIW /WAZE81WhWxlUD+SAMIA6IIa1XQ1Wa/yLfR2bZ6RJpXUaKrSM6aw0ZME5s5K6IJ9e9dZij pmXs+JqCBGg2mYhSWtB8DrKE+jjUhY0= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf27.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1702896733; a=rsa-sha256; cv=none; b=kyuYOxdBRtBbmbIt3up1+ztF4Ikmp6JhFc9Vd9/KQWUZpiwlXOdEHD5EPBMjBavxrEcIcA 6sXHclW0SWrQ7r9IjivK3pxeqRjaW2OsZc8Vh+F/3+P+RLlaDxCvK4sQdz5Ubwhd782hNs ADMc07VZ8b5aQ+LECth96MvLzc9ezj4= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 384F71FB; Mon, 18 Dec 2023 02:52:57 -0800 (PST) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.26]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 5EF4F3F7C5; Mon, 18 Dec 2023 02:52:09 -0800 (PST) From: Ryan Roberts To: Catalin Marinas , Will Deacon , Ard Biesheuvel , Marc Zyngier , Oliver Upton , James Morse , Suzuki K Poulose , Zenghui Yu , Andrey Ryabinin , Alexander Potapenko , Andrey Konovalov , Dmitry Vyukov , Vincenzo Frascino , Andrew Morton , Anshuman Khandual , Matthew Wilcox , Yu Zhao , Mark Rutland , David Hildenbrand , Kefeng Wang , John Hubbard , Zi Yan , Barry Song <21cnbao@gmail.com>, Alistair Popple , Yang Shi Cc: Ryan Roberts , linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH v4 15/16] arm64/mm: Implement new helpers to optimize fork() Date: Mon, 18 Dec 2023 10:50:59 +0000 Message-Id: <20231218105100.172635-16-ryan.roberts@arm.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20231218105100.172635-1-ryan.roberts@arm.com> References: <20231218105100.172635-1-ryan.roberts@arm.com> MIME-Version: 1.0 X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 6C4A340013 X-Stat-Signature: 6hr1jc6fdxwzk3gj3hz3e5nqjwgwnh6j X-Rspam-User: X-HE-Tag: 1702896733-395555 X-HE-Meta: U2FsdGVkX181oMxAd1YbgX9BGQdgTaoh8m14YP6fop2pnQiV41G4VCjvVn956QVLg1iuv9yG+obOrlbWSDbkqmz9aAU2i2QoMgMQ+8sKoZG/+B4fNZ7Zm6P2Md/v4khNUxLNCffsbQpWsdD++q93vMIPEfwFrNI+dvZz8swoAAg4TgES6p0+Ck/GV/SStJkOh1vQ1dPSmgec0z7mZP60VY54Hqn4ACbBtociPgdYxsO38Brq+oRKwRocfwtHnlFyaJDVkSfTEcCw89gdjyomW73ZXiulONeJ8dteP2xt5wnsUNQvRXmZqkBiTpxzGLdLFaD9APYE2srKvQXd5npPHLKcSq1Ttva6ID/WnQ+HPIUSvil/R8+MwDU7tkImXYz6deLDeg8tJc6qXC1YIzIKsfCC9vjgelcim1IvxqLr8By0/STKscozar7pl9YinU+yTWxf+xh9B2akg2uF4s3+YHDqO5Fwh/RB2Gv1w2t+3aMujENtNtowsrbhSCa0U2M9aPyG38mhPXuzKVuz3tM6M6gN97ZO5rYOVT/p3kupwdA0ee624bI+Cm1UPx3y6iOWcS0hXZRDtMdymHJzLRjVjz1W/QC+/sPntWQ4HGPLQmpgPotwqfhbSMgmWUO0IJgaA3rh1YCvsjnkozXga/nLwdg6hQ+RQPAmazHKCuFDeVQDkTgUWVEES1fVD5/ERMM9uU8kymDEWGH8WC06S2M5jEJxQ0ZqmcpjfCN7ow4gWx4vEgQxFpjr5FIm4AEmdc+pjTRo00Xgwri8E51+VuDImIG00ExQfDyEi3swQFtv9no4HTnC5gv5/8JLw6Zu4pBhK4htUoC0KVPpLxGr1yXax1/gfk8Do3BwGjZT7zCAKCbiGfUO61sy2J8MANuRbVk095BAMGte5SmQ+X6n5D5qXT7kfeQBBI+Yz5VUBnldRvpcwWsLjYIpZDFIxmn15jkg1dNixu04o/7VIFPXsEc O82ZEqWZ cJ3Zy/mOzTgdkLMI4GnPQXSXJid4p2HeKt0+n3HKSJvPBFPY9BI9t2GwdiIGnmmCnFj/uWLtvE9i/Hj9hkaiZ4BLMiFP/7j4B7rDzYP3w4JcMPXq4PGAtWszAlqOi+0QBsmaSz3kemjoyCi+JN2JpUqjaRO1JZQ2/BlfG2uiPGl+1gj/68S6ToizaLnrbbYmVX4OMONvgeyTuUOFr2JESNQDYCnVlI0CRDvC5pueeGoYBjt/OmHqcFwtUsiv2PFHL9WfO X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: With the core-mm changes in place to batch-copy ptes during fork, we can take advantage of this in arm64 to greatly reduce the number of tlbis we have to issue, and recover the lost fork performance incured when adding support for transparent contiguous ptes. This optimization covers 2 cases: 2) The memory being CoWed is contpte-sized (or bigger) folios. We set wrprotect in the parent and set the ptes in the child for a whole contpte block in one hit. This means we can operate on the whole block and don't need to unfold/fold. 1) The memory being CoWed is all order-0 folios. No folding or unfolding occurs here, but the added cost of checking if we need to fold on every pte adds up. Given we are forking, we are just copying the ptes already in the parent, so we should be maintaining the single/contpte state into the child anyway, and any check for folding will always be false. Therefore, we can elide the fold check in set_ptes_full() and ptep_set_wrprotects() when full=1. The optimization to wrprotect a whole contpte block without unfolding is possible thanks to the tightening of the Arm ARM in respect to the definition and behaviour when 'Misprogramming the Contiguous bit'. See section D21194 at https://developer.arm.com/documentation/102105/latest/ The following microbenchmark results demonstate the recovered (and overall improved) fork performance for large pte-mapped folios once this patch is applied. Fork is called in a tight loop in a process with 1G of populated memory and the time for the function to execute is measured. 100 iterations per run, 8 runs performed on both Apple M2 (VM) and Ampere Altra (bare metal). Tests performed for case where 1G memory is comprised of pte-mapped order-9 folios. Negative is faster, positive is slower, compared to baseline upon which the series is based: | fork | Apple M2 VM | Ampere Altra | | order-9 |-------------------|-------------------| | (pte-map) | mean | stdev | mean | stdev | |---------------|---------|---------|---------|---------| | baseline | 0.0% | 1.2% | 0.0% | 0.1% | | before-change | 541.5% | 2.8% | 3654.4% | 0.0% | | after-change | -25.4% | 1.9% | -6.7% | 0.1% | Tested-by: John Hubbard Signed-off-by: Ryan Roberts --- arch/arm64/include/asm/pgtable.h | 97 ++++++++++++++++++++++++++------ arch/arm64/mm/contpte.c | 47 ++++++++++++++++ 2 files changed, 128 insertions(+), 16 deletions(-) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index e64120452301..d4805f73b9db 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -966,16 +966,12 @@ static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm, } #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ -/* - * __ptep_set_wrprotect - mark read-only while trasferring potential hardware - * dirty status (PTE_DBM && !PTE_RDONLY) to the software PTE_DIRTY bit. - */ -static inline void __ptep_set_wrprotect(struct mm_struct *mm, - unsigned long address, pte_t *ptep) +static inline void ___ptep_set_wrprotect(struct mm_struct *mm, + unsigned long address, pte_t *ptep, + pte_t pte) { - pte_t old_pte, pte; + pte_t old_pte; - pte = __ptep_get(ptep); do { old_pte = pte; pte = pte_wrprotect(pte); @@ -984,6 +980,26 @@ static inline void __ptep_set_wrprotect(struct mm_struct *mm, } while (pte_val(pte) != pte_val(old_pte)); } +/* + * __ptep_set_wrprotect - mark read-only while trasferring potential hardware + * dirty status (PTE_DBM && !PTE_RDONLY) to the software PTE_DIRTY bit. + */ +static inline void __ptep_set_wrprotect(struct mm_struct *mm, + unsigned long address, pte_t *ptep) +{ + ___ptep_set_wrprotect(mm, address, ptep, __ptep_get(ptep)); +} + +static inline void __ptep_set_wrprotects(struct mm_struct *mm, + unsigned long address, pte_t *ptep, + unsigned int nr, int full) +{ + unsigned int i; + + for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++) + __ptep_set_wrprotect(mm, address, ptep); +} + #ifdef CONFIG_TRANSPARENT_HUGEPAGE #define __HAVE_ARCH_PMDP_SET_WRPROTECT static inline void pmdp_set_wrprotect(struct mm_struct *mm, @@ -1139,6 +1155,8 @@ extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep); extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep); +extern void contpte_set_wrprotects(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, unsigned int nr, int full); extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep, pte_t entry, int dirty); @@ -1170,6 +1188,17 @@ static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr, __contpte_try_unfold(mm, addr, ptep, pte); } +#define pte_batch_remaining pte_batch_remaining +static inline unsigned int pte_batch_remaining(pte_t pte, unsigned long addr, + unsigned long end) +{ + if (!pte_valid_cont(pte)) + return 1; + + return min(CONT_PTES - ((addr >> PAGE_SHIFT) & (CONT_PTES - 1)), + (end - addr) >> PAGE_SHIFT); +} + /* * The below functions constitute the public API that arm64 presents to the * core-mm to manipulate PTE entries within their page tables (or at least this @@ -1219,20 +1248,30 @@ static inline void set_pte(pte_t *ptep, pte_t pte) __set_pte(ptep, pte_mknoncont(pte)); } -#define set_ptes set_ptes -static inline void set_ptes(struct mm_struct *mm, unsigned long addr, - pte_t *ptep, pte_t pte, unsigned int nr) +#define set_ptes_full set_ptes_full +static inline void set_ptes_full(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, pte_t pte, unsigned int nr, + int full) { pte = pte_mknoncont(pte); if (nr == 1) { - contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep)); + if (!full) + contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep)); __set_ptes(mm, addr, ptep, pte, 1); - contpte_try_fold(mm, addr, ptep, pte); + if (!full) + contpte_try_fold(mm, addr, ptep, pte); } else contpte_set_ptes(mm, addr, ptep, pte, nr); } +#define set_ptes set_ptes +static inline void set_ptes(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, pte_t pte, unsigned int nr) +{ + set_ptes_full(mm, addr, ptep, pte, nr, false); +} + static inline void pte_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep) { @@ -1272,13 +1311,38 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma, return contpte_ptep_clear_flush_young(vma, addr, ptep); } +#define ptep_set_wrprotects ptep_set_wrprotects +static inline void ptep_set_wrprotects(struct mm_struct *mm, + unsigned long addr, pte_t *ptep, + unsigned int nr, int full) +{ + if (nr == 1) { + /* + * Optimization: ptep_set_wrprotects() can only be called for + * present ptes so we only need to check contig bit as condition + * for unfold, and we can remove the contig bit from the pte we + * read to avoid re-reading. This speeds up fork() with is very + * sensitive for order-0 folios. Should be equivalent to + * contpte_try_unfold() for this case. + */ + pte_t orig_pte = __ptep_get(ptep); + + if (unlikely(pte_cont(orig_pte))) { + __contpte_try_unfold(mm, addr, ptep, orig_pte); + orig_pte = pte_mknoncont(orig_pte); + } + ___ptep_set_wrprotect(mm, addr, ptep, orig_pte); + if (!full) + contpte_try_fold(mm, addr, ptep, __ptep_get(ptep)); + } else + contpte_set_wrprotects(mm, addr, ptep, nr, full); +} + #define __HAVE_ARCH_PTEP_SET_WRPROTECT static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addr, pte_t *ptep) { - contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep)); - __ptep_set_wrprotect(mm, addr, ptep); - contpte_try_fold(mm, addr, ptep, __ptep_get(ptep)); + ptep_set_wrprotects(mm, addr, ptep, 1, false); } #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS @@ -1310,6 +1374,7 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma, #define ptep_clear_flush_young __ptep_clear_flush_young #define __HAVE_ARCH_PTEP_SET_WRPROTECT #define ptep_set_wrprotect __ptep_set_wrprotect +#define ptep_set_wrprotects __ptep_set_wrprotects #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS #define ptep_set_access_flags __ptep_set_access_flags diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c index 69c36749dd98..72e672024785 100644 --- a/arch/arm64/mm/contpte.c +++ b/arch/arm64/mm/contpte.c @@ -339,6 +339,53 @@ int contpte_ptep_clear_flush_young(struct vm_area_struct *vma, } EXPORT_SYMBOL(contpte_ptep_clear_flush_young); +void contpte_set_wrprotects(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, unsigned int nr, int full) +{ + unsigned long next; + unsigned long end; + + if (!mm_is_user(mm)) + return __ptep_set_wrprotects(mm, addr, ptep, nr, full); + + end = addr + (nr << PAGE_SHIFT); + + do { + next = pte_cont_addr_end(addr, end); + nr = (next - addr) >> PAGE_SHIFT; + + /* + * If wrprotecting an entire contig range, we can avoid + * unfolding. Just set wrprotect and wait for the later + * mmu_gather flush to invalidate the tlb. Until the flush, the + * page may or may not be wrprotected. After the flush, it is + * guarranteed wrprotected. If its a partial range though, we + * must unfold, because we can't have a case where CONT_PTE is + * set but wrprotect applies to a subset of the PTEs; this would + * cause it to continue to be unpredictable after the flush. + */ + if (nr != CONT_PTES) + contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep)); + + __ptep_set_wrprotects(mm, addr, ptep, nr, full); + + addr = next; + ptep += nr; + + /* + * If applying to a partial contig range, the change could have + * made the range foldable. Use the last pte in the range we + * just set for comparison, since contpte_try_fold() only + * triggers when acting on the last pte in the contig range. + */ + if (nr != CONT_PTES) + contpte_try_fold(mm, addr - PAGE_SIZE, ptep - 1, + __ptep_get(ptep - 1)); + + } while (addr != end); +} +EXPORT_SYMBOL(contpte_set_wrprotects); + int contpte_ptep_set_access_flags(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep, pte_t entry, int dirty)