From patchwork Mon Jul 3 13:53:25 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ryan Roberts X-Patchwork-Id: 13300151 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 526D6EB64DC for ; Mon, 3 Jul 2023 13:53:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CA28E90000A; Mon, 3 Jul 2023 09:53:46 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C2B768E00BA; Mon, 3 Jul 2023 09:53:46 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id ACD7C90000A; Mon, 3 Jul 2023 09:53:46 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 985BC8E00BA for ; Mon, 3 Jul 2023 09:53:46 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 7667AA03FB for ; Mon, 3 Jul 2023 13:53:46 +0000 (UTC) X-FDA: 80970443652.15.5DCE4F1 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf15.hostedemail.com (Postfix) with ESMTP id F422DA001D for ; Mon, 3 Jul 2023 13:53:43 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf15.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1688392424; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references; bh=XeJhIDf4DvxYeCRmtVBMzeCEFxh1uWMi6zWoSjZ2STo=; b=oq24RHHhGNGg7purt3dbAJr0wLb1A1iYfiJK+Bl7k430odewcisMxD9aFGUnlzQHeu7o2n Tjou3btPVj0oYpc+JOFpqEIJ7q1yY3Il6K/467HVt7VcFFJI1DijHX/C5pqHGtC4aPMZ0j 9BwQbvdjCzFCMRCUDeQOsBesCJrFiKU= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf15.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1688392424; a=rsa-sha256; cv=none; b=MJ17zB6oep903lN/IIJTmnTNdPH/96qvbz1X4gDL9aF1SzFt3h6BDj4cYzHzekvYzkmmiV EbCISmIPweemSPslbaT9SCVzqLjSaMwu06NMpU5ytChJn2B7fT2w1nK++9omW8yNBW8MBi 3DMcvqvFUlIXqyp0E1ywFLi3qC8LdSI= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 7BD382F4; Mon, 3 Jul 2023 06:54:25 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.26]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id F41E43F73F; Mon, 3 Jul 2023 06:53:40 -0700 (PDT) From: Ryan Roberts To: Andrew Morton , Matthew Wilcox , "Kirill A. Shutemov" , Yin Fengwei , David Hildenbrand , Yu Zhao , Catalin Marinas , Will Deacon , Anshuman Khandual , Yang Shi Cc: Ryan Roberts , linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v2 0/5] variable-order, large folios for anonymous memory Date: Mon, 3 Jul 2023 14:53:25 +0100 Message-Id: <20230703135330.1865927-1-ryan.roberts@arm.com> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: F422DA001D X-Stat-Signature: 518ojgaf95m43ppgi5mupnm86b8wm5pi X-HE-Tag: 1688392423-553108 X-HE-Meta: U2FsdGVkX1/ILuo7ca8Ve8F3NVXV7CF5e8Po8TabndDA3A7lW8Zl8MgTr9sdtZx1JYLE7QoOdVDeOd/L4hPMoqMWPRQ30jRGydEBWRAT6ZhSPIrSddiP9DRMh1Pe3qHw1n07deZTuvetaBnT1/6Di/BlphsWZWi3olMd1IzSB+vjK7aEtYyHcdlvaYgxeA22ddcHCOX3CJJHHDgau/xkP/f63WgLcGnyd4pxLhS+zSXfgURFYsnQlOdGBYZXUkYLvjwByl7PuL99paShtVtEuzcFvfI/R1viw+4W10XlbN6ZZBSr/ElKOXcQBG+LrTErnz9YNfz1ussaND2xxWZWyZYvXmdFT9AVUYpCizut0N/aUumyd0u4QVy7OLkUAtXUx66xp6yg7eKgvAELajUYZApooRZJKko5W6FJgz/ATT9idAIqugTplbAf7y5KKfYm+bJua0fJ84uP268PWqx3w5RhBtGCAcX38iG6S/ONJWjrG6ozqaccqSmCWPPQuFwX0lQQfS3jy50Nk81/lWhvGHhVpgjQ15415nVMnAT6/NIDNM/vg+KfsS4X7tJ+k9Pd+NuN6ghkQKeX0RoJrSQvoA4sTPS9rD+zl3y3nQaEpG2BD3mdo+Z7pN3MxGlnX0Ro0Uc3Dx1AL3BobT7Gd3T5ADPyDumQtQL7qmR3+tWX26fKUVx+t48pTkZKOTMB/WOCFt4pEa2Dh8vYo2AGK0gJZClNBJfaZqyxyV6dSQO0rRlajCVYjQj7Y4dx8gCT6E77BL628rM+xekp5Nklxm/WSp+wd3wqYyTzmWeF8USaOQ3I9MImPEw718/P9m1zoEzjUZfl/FJQDNU6BL7Z4yytyiC29BqY+leKEmXKtYpOqynoFXIk+4UJGYICpwxLwEdqDx6d7qB9gqkLihvqX2Qdh7DL1Q7I4f6dxy1eWMrtLwssb08s5NpFiOrcyGoPeryNAPLJtU5QytSMAfTp5BP 3QLWTXZS GU0IqFnLhGKzV8ZlQg7VVQ4liOgoJczQbDGT870O6oyMl1t/VVo7538jfGrwagEbkmZR3xNgARsia5C9Pxf18U9vmbTEUhLRWVGhCFvWWNzJ0IPfXU0oNwzmi/ZRKZ5XBNytjmEGhgwUpoq86InDLFPRAtRepHfoQNWtWzEjyH83GbEGPqQvZZO51bGnrGi/9M9wT4tUWf8kO+Zq7HeMeeifZWviR/TbXQnq2QKztn3Hb19luNp7kbc1JmaN2E1P+40Ngy6h5tDKrsIGzKBUfxeupdBIWHGb4P7Cpb6xmWj2KMH36vg9f7QYVpBII784RzQKyYWCg2H2nECY= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi All, This is v2 of a series to implement variable order, large folios for anonymous memory. The objective of this is to improve performance by allocating larger chunks of memory during anonymous page faults. See [1] for background. I've significantly reworked and simplified the patch set based on comments from Yu Zhao (thanks for all your feedback!). I've also renamed the feature to VARIABLE_THP, on Yu's advice. The last patch is for arm64 to explicitly override the default arch_wants_pte_order() and is intended as an example. If this series is accepted I suggest taking the first 4 patches through the mm tree and the arm64 change could be handled through the arm64 tree separately. Neither has any build dependency on the other. The one area where I haven't followed Yu's advice is in the determination of the size of folio to use. It was suggested that I have a single preferred large order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there being existing overlapping populated PTEs, etc) then fallback immediately to order-0. It turned out that this approach caused a performance regression in the Speedometer benchmark. With my v1 patch, there were significant quantities of memory which could not be placed in the 64K bucket and were instead being allocated for the 32K and 16K buckets. With the proposed simplification, that memory ended up using the 4K bucket, so page faults increased by 2.75x compared to the v1 patch (although due to the 64K bucket, this number is still a bit lower than the baseline). So instead, I continue to calculate a folio order that is somewhere between the preferred order and 0. (See below for more details). The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series [2], which is a hard dependency. I have a branch at [3]. Changes since v1 [1] -------------------- - removed changes to arch-dependent vma_alloc_zeroed_movable_folio() - replaced with arch-independent alloc_anon_folio() - follows THP allocation approach - no longer retry with intermediate orders if allocation fails - fallback directly to order-0 - remove folio_add_new_anon_rmap_range() patch - instead add its new functionality to folio_add_new_anon_rmap() - remove batch-zap pte mappings optimization patch - remove enabler folio_remove_rmap_range() patch too - These offer real perf improvement so will submit separately - simplify Kconfig - single FLEXIBLE_THP option, which is independent of arch - depends on TRANSPARENT_HUGEPAGE - when enabled default to max anon folio size of 64K unless arch explicitly overrides - simplify changes to do_anonymous_page(): - no more retry loop Performance ----------- Below results show 3 benchmarks; kernel compilation with 8 jobs, kernel compilation with 80 jobs, and speedometer 2.0 (a javascript benchmark running in Chromium). All cases are running on Ampere Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark is repeated 15 times over 5 reboots and averaged. 'anonfolio-lkml-v1' is the v1 patchset at [1]. 'anonfolio-lkml-v2' is this v2 patchset. 'anonfolio-lkml-v2-simple-order' is anonfolio-lkml-v2 but with the order selection simplification that Yu Zhao suggested - I'm trying to justify here why I did not follow the advice. Kernel compilation with 8 jobs: | kernel | real-time | kern-time | user-time | |:-------------------------------|------------:|------------:|------------:| | baseline-4k | 0.0% | 0.0% | 0.0% | | anonfolio-lkml-v1 | -5.3% | -42.9% | -0.6% | | anonfolio-lkml-v2-simple-order | -4.4% | -36.5% | -0.4% | | anonfolio-lkml-v2 | -4.8% | -38.6% | -0.6% | We can see that the simple-order approach is responsible for a regression of 0.4%. Kernel compilation with 80 jobs: | kernel | real-time | kern-time | user-time | |:-------------------------------|------------:|------------:|------------:| | baseline-4k | 0.0% | 0.0% | 0.0% | | anonfolio-lkml-v1 | -4.6% | -45.7% | 1.4% | | anonfolio-lkml-v2-simple-order | -4.7% | -40.2% | -0.1% | | anonfolio-lkml-v2 | -5.0% | -42.6% | -0.3% | simple-order costs 0.3 % here. v2 is actually performing higher than v1 due to fixing the v1 regression on user-time. Speedometer 2.0: | kernel | runs_per_min | |:-------------------------------|---------------:| | baseline-4k | 0.0% | | anonfolio-lkml-v1 | 0.7% | | anonfolio-lkml-v2-simple-order | -0.9% | | anonfolio-lkml-v2 | 0.5% | simple-order regresses performance by 0.9% vs the baseline, for a total negative swing of 1.6% vs v1. This is fixed by keeping the more complex order selection mechanism from v1. The remaining (kernel time) performance gap between v1 and v2 for the above benchmarks is due to the removal of the "batch zap" patch in v2. Adding that back in gives us the performance back. I intend to submit that as a separate series once this series is accepted. [1] https://lore.kernel.org/linux-mm/20230626171430.3167004-1-ryan.roberts@arm.com/ [2] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/ [3] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anonfolio-lkml_v2 Thanks, Ryan Ryan Roberts (5): mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap() mm: Allow deferred splitting of arbitrary large anon folios mm: Default implementation of arch_wants_pte_order() mm: FLEXIBLE_THP for improved performance arm64: mm: Override arch_wants_pte_order() arch/arm64/Kconfig | 12 +++ arch/arm64/include/asm/pgtable.h | 4 + arch/arm64/mm/mmu.c | 8 ++ include/linux/pgtable.h | 13 +++ mm/Kconfig | 10 ++ mm/memory.c | 168 ++++++++++++++++++++++++++++--- mm/rmap.c | 28 ++++-- 7 files changed, 222 insertions(+), 21 deletions(-) --- 2.25.1