From patchwork Fri Apr 14 13:02:46 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ryan Roberts X-Patchwork-Id: 13211473 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 988E4C77B72 for ; Fri, 14 Apr 2023 13:04:41 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:MIME-Version:Message-Id:Date:Subject:Cc :To:From:Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From: Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:In-Reply-To:References: List-Owner; bh=qIaVv76gu9a47Coeweta1P2rLnGB4jHGDEsn4lUUMVE=; b=HdliDwCF7ttfQC CiCp2FjS3cnF2cqGuQgLVx7RcaHx6Wu8YB7LuZ6qMwS6FAxl08q03rqz7vpdTngHcv7R4HCrH8SFr YerdlgtT3+i4J731U96GfQwC3/sCghGawHRnMft0FU+YnU8qzpshGKj+FMG7p9BsU+VqFu4SFzVP7 AVepMK7eQQMop7YaV2xagubvmcSHKmU9gTpadxL4Cr7ISAGVT3p7e7S6htN6uY8Zn87l78SMDW1kO mbMcRB9NLgP/aqvAMQhbgDbOOgcSNTLt9jPyPPxS340g2MgTHNUzatq4iMKJDC1eeYXfcI9TbuH2l isSvTZo0JTqYX2xckrOA==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.96 #2 (Red Hat Linux)) id 1pnJ5O-009bRh-39; Fri, 14 Apr 2023 13:03:31 +0000 Received: from foss.arm.com ([217.140.110.172]) by bombadil.infradead.org with esmtp (Exim 4.96 #2 (Red Hat Linux)) id 1pnJ5I-009bMQ-2f for linux-arm-kernel@lists.infradead.org; Fri, 14 Apr 2023 13:03:27 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 9F7DA2F4; Fri, 14 Apr 2023 06:04:03 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.26]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 096A93F6C4; Fri, 14 Apr 2023 06:03:17 -0700 (PDT) From: Ryan Roberts To: Andrew Morton , "Matthew Wilcox (Oracle)" , Yu Zhao , "Yin, Fengwei" Cc: Ryan Roberts , linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org Subject: [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory Date: Fri, 14 Apr 2023 14:02:46 +0100 Message-Id: <20230414130303.2345383-1-ryan.roberts@arm.com> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20230414_060324_987306_FDD892C6 X-CRM114-Status: GOOD ( 38.61 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org Hi All, This is a second RFC and my first proper attempt at implementing variable order, large folios for anonymous memory. The first RFC [1], was a partial implementation and a plea for help in debugging an issue I was hitting; thanks to Yin Fengwei and Matthew Wilcox for their advice in solving that! The objective of variable order anonymous folios is to improve performance by allocating larger chunks of memory during anonymous page faults: - Since SW (the kernel) is dealing with larger chunks of memory than base pages, there are efficiency savings to be had; fewer page faults, batched PTE and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel overhead. This should benefit all architectures. - Since we are now mapping physically contiguous chunks of memory, we can take advantage of HW TLB compression techniques. A reduction in TLB pressure speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce TLB entries; "the contiguous bit" (architectural) and HPA (uarch) - see [2]. This patch set deals with the SW side of things only but sets us up nicely for taking advantage of the HW improvements in the near future. I'm not yet benchmarking a wide variety of use cases, but those that I have looked at are positive; I see kernel compilation time improved by up to 10%, which I expect to improve further once I add in the arm64 "contiguous bit". Memory consumption is somewhere between 1% less and 2% more, depending on how its measured. More on perf and memory below. The patches are based on v6.3-rc6 + patches 1-31 of [3] (which needed one minor conflict resolution). I have a tree at [4]. [1] https://lore.kernel.org/linux-mm/20230317105802.2634004-1-ryan.roberts@arm.com/ [2] https://lore.kernel.org/linux-mm/d347c5b0-0c0f-ae50-9613-2cf962d8676e@arm.com/ [3] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/ [4] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anon_folio-lkml-rfc2 Approach ======== There are 4 fault paths that have been modified: - write fault on unallocated address: do_anonymous_page() - write fault on zero page: wp_page_copy() - write fault on non-exclusive CoW page: wp_page_copy() - write fault on exclusive CoW page: do_wp_page()/wp_page_reuse() In the first 2 cases, we will determine the preferred order folio to allocate, limited by a max order (currently order-4; see below), VMA and PMD bounds, and state of neighboring PTEs. In the 3rd case, we aim to allocate the same order folio as the source, subject to constraints that may arise if the source has been mremapped or partially munmapped. And in the 4th case, we reuse as much of the folio as we can, subject to the same mremap/munmap constraints. If allocation of our preferred folio order fails, we gracefully fall back to lower orders all the way to 0. Note that none of this affects the behavior of traditional PMD-sized THP. If we take a fault in an MADV_HUGEPAGE region, you still get PMD-sized mappings. Open Questions ============== How to Move Forwards -------------------- While the series is a small-ish code change, it represents a big shift in the way things are done. So I'd appreciate any help in scaling up performance testing, review and general advice on how best to guide a change like this into the kernel. Folio Allocation Order Policy ----------------------------- The current code is hardcoded to use a maximum order of 4. This was chosen for a couple of reasons: - From the SW performance perspective, I see a knee around here where increasing it doesn't lead to much more performance gain. - Intuitively I assume that higher orders become increasingly difficult to allocate. - From the HW performance perspective, arm64's HPA works on order-2 blocks and "the contiguous bit" works on order-4 for 4KB base pages (although it's order-7 for 16KB and order-5 for 64KB), so there is no HW benefit to going any higher. I suggest that ultimately setting the max order should be left to the architecture. arm64 would take advantage of this and set it to the order required for the contiguous bit for the configured base page size. However, I also have a (mild) concern about increased memory consumption. If an app has a pathological fault pattern (e.g. sparsely touches memory every 64KB) we would end up allocating 16x as much memory as we used to. One potential approach I see here is to track fault addresses per-VMA, and increase a per-VMA max allocation order for consecutive faults that extend a contiguous range, and decrement when discontiguous. Alternatively/additionally, we could use the VMA size as an indicator. I'd be interested in your thoughts/opinions. Deferred Split Queue Lock Contention ------------------------------------ The results below show that we are spending a much greater proportion of time in the kernel when doing a kernel compile using 160 CPUs vs 8 CPUs. I think this is (at least partially) related for contention on the deferred split queue lock. This is a per-memcg spinlock, which means a single spinlock shared among all 160 CPUs. I've solved part of the problem with the last patch in the series (which cuts down the need to take the lock), but at folio free time (free_transhuge_page()), the lock is still taken and I think this could be a problem. Now that most anonymous pages are large folios, this lock is taken a lot more. I think we could probably avoid taking the lock unless !list_empty(), but I haven't convinced myself its definitely safe, so haven't applied it yet. Roadmap ======= Beyond scaling up perf testing, I'm planning to enable use of the "contiguous bit" on arm64 to validate predictions about HW speedups. I also think there are some opportunities with madvise to split folios to non-0 orders, which might improve performance in some cases. madvise is also mistaking exclusive large folios for non-exclusive ones at the moment (due to the "small pages" mapcount scheme), so that needs to be fixed so that MADV_FREE correctly frees the folio. Results ======= Performance ----------- Test: Kernel Compilation, on Ampere Altra (160 CPU machine), with 8 jobs and with 160 jobs. First run discarded, next 3 runs averaged. Git repo cleaned before each run. make defconfig && time make -jN Image First with -j8: | | baseline time | anonfolio time | percent change | | | to compile (s) | to compile (s) | SMALLER=better | |-----------|---------------:|---------------:|---------------:| | real-time | 373.0 | 342.8 | -8.1% | | user-time | 2333.9 | 2275.3 | -2.5% | | sys-time | 510.7 | 340.9 | -33.3% | Above shows 8.1% improvement in real time execution, and 33.3% saving in kernel execution. The next 2 tables show a breakdown of the cycles spent in the kernel for the 8 job config: | | baseline | anonfolio | percent change | | | (cycles) | (cycles) | SMALLER=better | |----------------------|---------:|----------:|---------------:| | data abort | 683B | 316B | -53.8% | | instruction abort | 93B | 76B | -18.4% | | syscall | 887B | 767B | -13.6% | | | baseline | anonfolio | percent change | | | (cycles) | (cycles) | SMALLER=better | |----------------------|---------:|----------:|---------------:| | arm64_sys_openat | 194B | 188B | -3.3% | | arm64_sys_exit_group | 192B | 124B | -35.7% | | arm64_sys_read | 124B | 108B | -12.7% | | arm64_sys_execve | 75B | 67B | -11.0% | | arm64_sys_mmap | 51B | 50B | -3.0% | | arm64_sys_mprotect | 15B | 13B | -12.0% | | arm64_sys_write | 43B | 42B | -2.9% | | arm64_sys_munmap | 15B | 12B | -17.0% | | arm64_sys_newfstatat | 46B | 41B | -9.7% | | arm64_sys_clone | 26B | 24B | -10.0% | And now with -j160: | | baseline time | anonfolio time | percent change | | | to compile (s) | to compile (s) | SMALLER=better | |-----------|---------------:|---------------:|---------------:| | real-time | 53.7 | 48.2 | -10.2% | | user-time | 2705.8 | 2842.1 | 5.0% | | sys-time | 1370.4 | 1064.3 | -22.3% | Above shows a 10.2% improvement in real time execution. But ~3x more time is spent in the kernel than for the -j8 config. I think this is related to the lock contention issue I highlighted above, but haven't bottomed it out yet. It's also not yet clear to me why user-time increases by 5%. I've also run all the will-it-scale microbenchmarks for a single task, using the process mode. Results for multiple runs on the same kernel are noisy - I see ~5% fluctuation. So I'm just calling out tests with results that have gt 5% improvement or lt -5% regression. Results are average of 3 runs. Only 2 tests are regressed: | benchmark | baseline | anonfolio | percent change | | | ops/s | ops/s | BIGGER=better | | ---------------------|---------:|----------:|---------------:| | context_switch1.csv | 328744 | 351150 | 6.8% | | malloc1.csv | 96214 | 50890 | -47.1% | | mmap1.csv | 410253 | 375746 | -8.4% | | page_fault1.csv | 624061 | 3185678 | 410.5% | | page_fault2.csv | 416483 | 557448 | 33.8% | | page_fault3.csv | 724566 | 1152726 | 59.1% | | read1.csv | 1806908 | 1905752 | 5.5% | | read2.csv | 587722 | 1942062 | 230.4% | | tlb_flush1.csv | 143910 | 152097 | 5.7% | | tlb_flush2.csv | 266763 | 322320 | 20.8% | I believe malloc1 is an unrealistic test, since it does malloc/free for 128M object in a loop and never touches the allocated memory. I think the malloc implementation is maintaining a header just before the allocated object, which causes a single page fault. Previously that page fault allocated 1 page. Now it is allocating 16 pages. This cost would be repaid if the test code wrote to the allocated object. Alternatively the folio allocation order policy described above would also solve this. It is not clear to me why mmap1 has slowed down. This remains a todo. Memory ------ I measured memory consumption while doing a kernel compile with 8 jobs on a system limited to 4GB RAM. I polled /proc/meminfo every 0.5 seconds during the workload, then calcualted "memory used" high and low watermarks using both MemFree and MemAvailable. If there is a better way of measuring system memory consumption, please let me know! mem-used = 4GB - /proc/meminfo:MemFree | | baseline | anonfolio | percent change | | | (MB) | (MB) | SMALLER=better | |----------------------|---------:|----------:|---------------:| | mem-used-low | 825 | 842 | 2.1% | | mem-used-high | 2697 | 2672 | -0.9% | mem-used = 4GB - /proc/meminfo:MemAvailable | | baseline | anonfolio | percent change | | | (MB) | (MB) | SMALLER=better | |----------------------|---------:|----------:|---------------:| | mem-used-low | 518 | 530 | 2.3% | | mem-used-high | 1522 | 1537 | 1.0% | For the high watermark, the methods disagree; we are either saving 1% or using 1% more. For the low watermark, both methods agree that we are using about 2% more. I plan to investigate whether the proposed folio allocation order policy can reduce this to zero. Thanks for making it this far! Ryan Ryan Roberts (17): mm: Expose clear_huge_page() unconditionally mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio() mm: Introduce try_vma_alloc_movable_folio() mm: Implement folio_add_new_anon_rmap_range() mm: Routines to determine max anon folio allocation order mm: Allocate large folios for anonymous memory mm: Allow deferred splitting of arbitrary large anon folios mm: Implement folio_move_anon_rmap_range() mm: Update wp_page_reuse() to operate on range of pages mm: Reuse large folios for anonymous memory mm: Split __wp_page_copy_user() into 2 variants mm: ptep_clear_flush_range_notify() macro for batch operation mm: Implement folio_remove_rmap_range() mm: Copy large folios for anonymous memory mm: Convert zero page to large folios on write mm: mmap: Align unhinted maps to highest anon folio order mm: Batch-zap large anonymous folio PTE mappings arch/alpha/include/asm/page.h | 5 +- arch/arm64/include/asm/page.h | 3 +- arch/arm64/mm/fault.c | 7 +- arch/ia64/include/asm/page.h | 5 +- arch/m68k/include/asm/page_no.h | 7 +- arch/s390/include/asm/page.h | 5 +- arch/x86/include/asm/page.h | 5 +- include/linux/highmem.h | 23 +- include/linux/mm.h | 8 +- include/linux/mmu_notifier.h | 31 ++ include/linux/rmap.h | 6 + mm/memory.c | 877 ++++++++++++++++++++++++++++---- mm/mmap.c | 4 +- mm/rmap.c | 147 +++++- 14 files changed, 1000 insertions(+), 133 deletions(-) --- 2.25.1