From patchwork Thu Mar 27 16:06:58 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ryan Roberts X-Patchwork-Id: 14031267 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9AA19C36011 for ; Thu, 27 Mar 2025 16:07:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 50BB2280100; Thu, 27 Mar 2025 12:07:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4BB432800FF; Thu, 27 Mar 2025 12:07:19 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 382E8280100; Thu, 27 Mar 2025 12:07:19 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 1754C2800FF for ; Thu, 27 Mar 2025 12:07:19 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 160B0161130 for ; Thu, 27 Mar 2025 16:07:20 +0000 (UTC) X-FDA: 83267810640.08.CE0EDCA Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf03.hostedemail.com (Postfix) with ESMTP id 76F1020019 for ; Thu, 27 Mar 2025 16:07:12 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=none; spf=pass (imf03.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1743091634; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references; bh=wvko6dEUsSENXj2MtsiTmaS4H8wzQCaDQ0OcCJJ6Exw=; b=EzpKgjOi5NO+eR9JI1buno0Z1WPxxocdsUJUhFkLmXulKVtm0Y0fnMwo7tUK47J0RuEj9B uJrLaMkw/9aEY8z/5LMm5X6BB3G0jUobyABTlPIBAF/yfpIyoYP5VSs3Ts2blv7W+62h9y qGWXX4M9hskPEUqIf1BMk++KIhZAN34= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=none; spf=pass (imf03.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743091634; a=rsa-sha256; cv=none; b=n+VrdNmBdIPVUGrqkIpBM7PfavwZ5tam0bS3Z+yYhQKgVFMf0SplzH4oRDvBwnamw3nC/V f5BAZhFaU7c/boPEgXnLnDJh3Rnhs3nngvL2ZsY8e8L5TBH4Sm0TE6p073SJAOPsLkLQM1 RXcYQCYn0f4Z2+191OJTDIsm+DylzfU= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 294621063; Thu, 27 Mar 2025 09:07:15 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.27]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 540FA3F63F; Thu, 27 Mar 2025 09:07:08 -0700 (PDT) From: Ryan Roberts To: Andrew Morton , "Matthew Wilcox (Oracle)" , David Hildenbrand , Dave Chinner , Catalin Marinas , Will Deacon Cc: Ryan Roberts , linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v3] mm/filemap: Allow arch to request folio size for exec memory Date: Thu, 27 Mar 2025 16:06:58 +0000 Message-ID: <20250327160700.1147155-1-ryan.roberts@arm.com> X-Mailer: git-send-email 2.43.0 MIME-Version: 1.0 X-Rspamd-Queue-Id: 76F1020019 X-Stat-Signature: 9hsui6m5bui1ntuwpit43f7fypt35dqj X-Rspam-User: X-Rspamd-Server: rspam12 X-HE-Tag: 1743091632-738854 X-HE-Meta: U2FsdGVkX1+gK1lprn3izFfU2x7paMJBjoFfibxCCt1gJv0K/zSTW6MC+tWT2I081nfI/77USWmzdPFMeK77RKDISm8GZkaqDZDpA7MnzaN7qd5hR+uqflcldBxvHsqO5Dg1Ux0s9bIKNueTwVq8APJS2kAo5M/nEuifkCODFWdyUMM2RkZVhneO5OWZrIw86CGxMcuUHVf341a1vjroLFlUx7GPbz7j80kXhFQib449tlXkpSX8NdIQxtmW94I0KOpHYVpEq5GgpdmzHcqX7eZnrQF86GNLSDdDzfWMXnwVKYklA5gMzChNVWgLCcdKilXcT4aBtyGAovnaAL2D4M8IhhUoXpw4X3IJOFYQYaDZzSVTvanLkHsvIXNAXesXcW81RwAG/cjJ3uAqGhW1vxDP6tp7xWgk05rEpptAe5GmlmygDRUyqZFJjMy16je7S8SMeCZzsGR8TANPO4JEmXAGLGM7QsDrGlLwEvM2qHewjwbS4VSSO6eVWDl49wRbuVYV4xJPp9BafeBVuku0oW4I74cECCpaqVpeul2qEYvpqaTo6q9nsQciIKFMtYSMi51PtaSQ/n6PZ0gR12PcHWE1QVc+3bnodkOMA05vfBCTA6WWganWTsVrIyV//cVBIcENRy5T2GZ9RvTeDZGszSxMn7NPHiCLJnyZTie6XpYrGjwoaXG/pmgXyieYoi+Ox/VhNd3yDjc0U6V8cmvM4jTrOqMBDRzKLiTdzhd71lScraPSZShRamvwtca/1p+MvurYhF0py799q6MZdV8L9CKj9OisRudwfzbp4RceJaE9iAOa/HQeVGKTdaBQh1ykuYXz8UTeh9Ti9vIHQlUDRKXPWY/69J5ugeCWNKfEmmdFZ7wfMMvZ4bXYt//zG9w0gxbk6OpyxbM+UGbWGtUBJ6cnzBx8iTgTo948iRhLvVWZSUgOCFgv01lV+bMv6FS/L/P7nlrgv0AtvHmAFwy RP+kcMRX ZoPdpZeGmOgJESibyO+f78sqcJjmh/ttlUwijkH+2XmIEKG1GEaK2oKazVijh4Wnr5S/o2JiwtsJSxoBd6Wji5NIGcr/KsDHNtyeUYXNMqB7tWSYRSt4Htbfb5dpmzIYbHLBXHMtRmi6xldS5yDozASYzONfXkNwet8riqpu8klH6Xk7LqWX648IQCg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Change the readahead config so that if it is being requested for an executable mapping, do a synchronous read of an arch-specified size in a naturally aligned manner into a folio of the same size (assuming an fs with large folio support). On arm64 if memory is physically contiguous and naturally aligned to the "contpte" size, we can use contpte mappings, which improves utilization of the TLB. When paired with the "multi-size THP" feature, this works well to reduce dTLB pressure. However iTLB pressure is still high due to executable mappings having a low likelihood of being in the required folio size and mapping alignment, even when the filesystem supports readahead into large folios (e.g. XFS). The reason for the low likelihood is that the current readahead algorithm starts with an order-2 folio and increases the folio order by 2 every time the readahead mark is hit. But most executable memory tends to be accessed randomly and so the readahead mark is rarely hit and most executable folios remain order-2. To make things worse, readahead reduces the folio order to 0 at the readahead window boundaries if required for alignment to those boundaries. So let's special-case the read(ahead) logic for executable mappings. The trade-off is performance improvement (due to more efficient storage of the translations in iTLB) vs potential read amplification (due to reading too much data around the fault which won't be used), and the latter is independent of base page size. I've chosen 64K folio size for arm64 which benefits both the 4K and 16K base page size configs and shouldn't lead to any read amplification in practice since the old read-around path was (usually) reading blocks of 128K. I don't anticipate any write amplification because text is always RO. Note that the text region of an ELF file could be populated into the page cache for other reasons than taking a fault in a mmapped area. The most common case is due to the loader read()ing the header which can be shared with the beginning of text. So some text will still remain in small folios, but this simple, best effort change provides good performance improvements as is. Benchmarking ============ The below shows nginx and redis benchmarks on Ampere Altra arm64 system. First, confirmation that this patch causes more text to be contained in 64K folios: | File-backed folios | system boot | nginx | redis | | by size as percentage |-----------------|-----------------|-----------------| | of all mapped text mem | before | after | before | after | before | after | |========================|========|========|========|========|========|========| | base-page-4kB | 26% | 9% | 27% | 6% | 21% | 5% | | thp-aligned-8kB | 4% | 2% | 3% | 0% | 4% | 1% | | thp-aligned-16kB | 57% | 21% | 57% | 6% | 54% | 10% | | thp-aligned-32kB | 4% | 1% | 4% | 1% | 3% | 1% | | thp-aligned-64kB | 7% | 65% | 8% | 85% | 9% | 72% | | thp-aligned-2048kB | 0% | 0% | 0% | 0% | 7% | 8% | | thp-unaligned-16kB | 1% | 1% | 1% | 1% | 1% | 1% | | thp-unaligned-32kB | 0% | 0% | 0% | 0% | 0% | 0% | | thp-unaligned-64kB | 0% | 0% | 0% | 1% | 0% | 1% | | thp-partial | 1% | 1% | 0% | 0% | 1% | 1% | |------------------------|--------|--------|--------|--------|--------|--------| | cont-aligned-64kB | 7% | 65% | 8% | 85% | 16% | 80% | The above shows that for both workloads (each isolated with cgroups) as well as the general system state after boot, the amount of text backed by 4K and 16K folios reduces and the amount backed by 64K folios increases significantly. And the amount of text that is contpte-mapped significantly increases (see last row). And this is reflected in performance improvement: | Benchmark | Improvement | +===============================================+======================+ | pts/nginx (200 connections) | 8.96% | | pts/nginx (1000 connections) | 6.80% | +-----------------------------------------------+----------------------+ | pts/redis (LPOP, 50 connections) | 5.07% | | pts/redis (LPUSH, 50 connections) | 3.68% | Signed-off-by: Ryan Roberts --- Hi All, This is follow up from LSF/MM where we discussed this and there was concensus to take this most simple approach. I know Dave Chinner had reservations when I originally posted it last year, but I think he was coming around in the discussion at [3]. This applies on top of yesterday's mm-unstable (87f556baedc9). Changes since v2 [2] ==================== - Rename arch_wants_exec_folio_order() to arch_exec_folio_order() (per Andrew) - Fixed some typos (per Andrew) Changes since v1 [1] ==================== - Remove "void" from arch_wants_exec_folio_order() macro args list [1] https://lore.kernel.org/linux-mm/20240111154106.3692206-1-ryan.roberts@arm.com/ [2] https://lore.kernel.org/all/20240215154059.2863126-1-ryan.roberts@arm.com/ [3] https://lore.kernel.org/linux-mm/ce3b5402-79b8-415b-9c51-f712bb2b953b@arm.com/ Thanks, Ryan arch/arm64/include/asm/pgtable.h | 14 ++++++++++++++ include/linux/pgtable.h | 12 ++++++++++++ mm/filemap.c | 19 +++++++++++++++++++ 3 files changed, 45 insertions(+) -- 2.43.0 diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 15211f74b035..5f75e2ddef02 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -1514,6 +1514,20 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf, */ #define arch_wants_old_prefaulted_pte cpu_has_hw_af +/* + * Request exec memory is read into pagecache in at least 64K folios. The + * trade-off here is performance improvement due to storing translations more + * efficiently in the iTLB vs the potential for read amplification due to + * reading data from disk that won't be used (although this is not a real + * concern as readahead is almost always 128K by default so we are actually + * potentially reducing the read bandwidth). The latter is independent of base + * page size, so we set a page-size independent block size of 64K. This size can + * be contpte-mapped when 4K base pages are in use (16 pages into 1 iTLB entry), + * and HPA can coalesce it (4 pages into 1 TLB entry) when 16K base pages are in + * use. + */ +#define arch_exec_folio_order() ilog2(SZ_64K >> PAGE_SHIFT) + static inline bool pud_sect_supported(void) { return PAGE_SIZE == SZ_4K; diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 787c632ee2c9..944ff80e8f4f 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -456,6 +456,18 @@ static inline bool arch_has_hw_pte_young(void) } #endif +#ifndef arch_exec_folio_order +/* + * Returns preferred minimum folio order for executable file-backed memory. Must + * be in range [0, PMD_ORDER]. Negative value implies that the HW has no + * preference and mm will not special-case executable memory in the pagecache. + */ +static inline int arch_exec_folio_order(void) +{ + return -1; +} +#endif + #ifndef arch_check_zapped_pte static inline void arch_check_zapped_pte(struct vm_area_struct *vma, pte_t pte) diff --git a/mm/filemap.c b/mm/filemap.c index cc69f174f76b..22ff25a60598 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3223,6 +3223,25 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf) } #endif + /* + * Allow arch to request a preferred minimum folio order for executable + * memory. This can often be beneficial to performance if (e.g.) arm64 + * can contpte-map the folio. Executable memory rarely benefits from + * readahead anyway, due to its random access nature. + */ + if (vm_flags & VM_EXEC) { + int order = arch_exec_folio_order(); + + if (order >= 0) { + fpin = maybe_unlock_mmap_for_io(vmf, fpin); + ra->size = 1UL << order; + ra->async_size = 0; + ractl._index &= ~((1UL << order) - 1); + page_cache_ra_order(&ractl, ra, order); + return fpin; + } + } + /* If we don't want any read-ahead, don't bother */ if (vm_flags & VM_RAND_READ) return fpin;