From patchwork Mon Feb 17 14:07:52 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Ryan Roberts <ryan.roberts@arm.com>
X-Patchwork-Id: 13977905
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id AE5CFC021AA
	for <linux-mm@archiver.kernel.org>; Mon, 17 Feb 2025 14:08:23 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 454F0280057; Mon, 17 Feb 2025 09:08:23 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 4050928004D; Mon, 17 Feb 2025 09:08:23 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 2CCD4280057; Mon, 17 Feb 2025 09:08:23 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com
 [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 0E5C228004D
	for <linux-mm@kvack.org>; Mon, 17 Feb 2025 09:08:23 -0500 (EST)
Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id 78396B05B8
	for <linux-mm@kvack.org>; Mon, 17 Feb 2025 14:08:22 +0000 (UTC)
X-FDA: 83129616444.25.FEF2D90
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by imf02.hostedemail.com (Postfix) with ESMTP id EC8318000A
	for <linux-mm@kvack.org>; Mon, 17 Feb 2025 14:08:20 +0000 (UTC)
Authentication-Results: imf02.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=none) header.from=arm.com;
	spf=pass (imf02.hostedemail.com: domain of ryan.roberts@arm.com designates
 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1739801301;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:references; bh=8UZGeVAAz1hLF8OWGVINkvjBjQCBh6vNwDhvlo5tk90=;
	b=01wcr37pn+C8E9PDfET5ibip/5z3a8H8FiIVpUejW202Nx0pT40IG9UrACqU3MWCTniBxF
	uXK+ps2dIOZK9Byy/XBMfi0t3a40FP+XlMOmA5F84Rwy9tSVkQgIkWkTybNIDe4PmP1YvT
	V3+QglitgHqgBuZLouy+XGf+AZBK4Ns=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1739801301; a=rsa-sha256;
	cv=none;
	b=Qc42Cr9ND00DUOvQ82ES+JtzAIKooK+mMbEhSb3gXTWRqtjuMSaXeF93d5rnsURDbqZLBo
	eFYDakYio6ee9kEIPqCdDzqwys0g1f4nWr/tjXCuE0SI0AKRFA99cbvOnLEjHcBS53V6Tf
	uxQfbq4eKWfen7YzRfZQ7L52TjCs94A=
ARC-Authentication-Results: i=1;
	imf02.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=none) header.from=arm.com;
	spf=pass (imf02.hostedemail.com: domain of ryan.roberts@arm.com designates
 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 8F03A1E5E;
	Mon, 17 Feb 2025 06:08:39 -0800 (PST)
Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com
 [10.1.196.27])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 1E6C53F6A8;
	Mon, 17 Feb 2025 06:08:17 -0800 (PST)
From: Ryan Roberts <ryan.roberts@arm.com>
To: Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	Pasha Tatashin <pasha.tatashin@soleen.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Uladzislau Rezki <urezki@gmail.com>,
	Christoph Hellwig <hch@infradead.org>,
	David Hildenbrand <david@redhat.com>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	Mark Rutland <mark.rutland@arm.com>,
	Anshuman Khandual <anshuman.khandual@arm.com>,
	Alexandre Ghiti <alexghiti@rivosinc.com>,
	Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>,
	linux-arm-kernel@lists.infradead.org,
	linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH v2 00/14] Perf improvements for hugetlb and vmalloc on arm64
Date: Mon, 17 Feb 2025 14:07:52 +0000
Message-ID: <20250217140809.1702789-1-ryan.roberts@arm.com>
X-Mailer: git-send-email 2.43.0
MIME-Version: 1.0
X-Rspam-User: 
X-Rspamd-Queue-Id: EC8318000A
X-Rspamd-Server: rspam07
X-Stat-Signature: 5fht64m734c3h37fogfiaqanzfo5mqqk
X-HE-Tag: 1739801300-339782
X-HE-Meta: 
 U2FsdGVkX1+4BK8PXD1aNcuRknyGMy3N+txllB1r1DPN+paQ9LkLf9fJma05dWQ1m6+ShTUpwB2mfNYqh8ZUQhZP20clLasJRrDySFWMsAfJtMoVxSW1f36rFdGHK7L/GLBUFmmB+DSWKpcWt3i7os56mQ2214mbIXgzIYH5Mzs5LQFmKk0LIOA0CxTUf7pGrt6VjF1FFWnKzgE5pVq3A5L+ctnjrx4AU2SV345xnt87g2HHQRfAFu3rVAFv22jADjoxJlTovjDl6LVjuZKhjKDCkP2L+ijepAeMIJwPLya2H9IwTM81MUrQESdAt1X1S1o/BhQi5xMPag4kg1jWIBjIRTEy7qMlSrLwJVuiPI5Kt9aZ3pL9kbt3FFaaLQHTxMeshn8CNb1JB2rY/72erVIcT20XZuxd429P63BeqSnof7dHYG9K97gx97pHBTVHzn0oI5qnKdbaUxee5DRduI2EZCr9EhXp6PybBx56DPXjnFt0Bh52u2yeienr39Kqw4B5xuTdoRa1daAkI+9NI9EKccPZ/yIvIVJNmN8r2cmUoqfTUdBLSuK/7ifC8JDI6pjUhKuj871ftXGlx4kR+X7+lfSgoLqMIqUjspU0WHL2Ef+/LQO/xZBANz8g29kSjEtpmWT4kVXc/zf3M9LsSIiNYbU2Yvtjve3BWF0s7BdFGcqyyiC5nLf6pbl0yg+uZeQQ/V0WDBDlmYadqr8M6opjUv/4F2wJrxWrd92YTPj+9IuFeLrm0Nnpe3eUOa/8YExiPS//9lbLH8Gxs9W8Gbbnm4kNvPoFwL72OscyTPO+FAektqtwFAtHa/isxWnXAlg9+dhTzs4WFdX2oARNV8hxe0G4QAQz6qFeZ5Nvz7ZaSKVnS2vC2pmvuR0Vy0KER785Maj+Gwn5lj8zcHUeuzqmCEu+I9fjbjJqTZbctElGIOWZZmRz1HYDg+OVp0DtCIry0y4h6wYBAf97pz0
 Vy46tqsG
 By4vTTX5uW4fmTsNqX6dsRLVTi1Q+Gbby+HlTgYF+NuNx+EBgAPhen/d2u35mr8dbXXo8ORmyltT58NYsHffyfnMBPlxevB5nmdBgX76rVXFzYnB7PipxpQQWwpvA6+1jaRbImwzrqMsYgy263AZPmmoT0tnljOUicJaLk9c37oWrp2E=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Hi All,

This series contains some perf improvements for hugetlb and vmalloc on arm64,
and is split out from v1 of a wider series at [1]. Although some of these
patches are core-mm, advice from Andrew was to go via the arm64 tree. Hopefully
I can get some ACKs from mm folks.

The 2 key performance improvements are 1) enabling the use of contpte-mapped
blocks in the vmalloc space when appropriate (which reduces TLB pressure). There
were already hooks for this (used by powerpc) but they required some tidying and
extending for arm64. And 2) batching up barriers when modifying the vmalloc
address space for upto 30% reduction in time taken in vmalloc().

vmalloc() performance was measured using the test_vmalloc.ko module. Tested on
Apple M2 and Ampere Altra. Each test had loop count set to 500000 and the whole
test was repeated 10 times. (perf results are against v1).

legend:
  - p: nr_pages (pages to allocate)
  - h: use_huge (vmalloc() vs vmalloc_huge())
  - (I): statistically significant improvement (95% CI does not overlap)
  - (R): statistically significant regression (95% CI does not overlap)
  - mearements are times; smaller is better

+--------------------------------------------------+-------------+-------------+
| Benchmark                                        |             |             |
|   Result Class                                   |    Apple M2 | Ampere Alta |
+==================================================+=============+=============+
| micromm/vmalloc                                  |             |             |
|   fix_align_alloc_test: p:1, h:0 (usec)          | (I) -12.93% |  (I) -7.89% |
|   fix_size_alloc_test: p:1, h:0 (usec)           |   (R) 4.00% |       1.40% |
|   fix_size_alloc_test: p:1, h:1 (usec)           |   (R) 5.28% |       1.46% |
|   fix_size_alloc_test: p:2, h:0 (usec)           |  (I) -3.04% |      -1.11% |
|   fix_size_alloc_test: p:2, h:1 (usec)           |      -3.24% |      -2.86% |
|   fix_size_alloc_test: p:4, h:0 (usec)           | (I) -11.77% |  (I) -4.48% |
|   fix_size_alloc_test: p:4, h:1 (usec)           |  (I) -9.19% |  (I) -4.45% |
|   fix_size_alloc_test: p:8, h:0 (usec)           | (I) -19.79% | (I) -11.63% |
|   fix_size_alloc_test: p:8, h:1 (usec)           | (I) -19.40% | (I) -11.11% |
|   fix_size_alloc_test: p:16, h:0 (usec)          | (I) -24.89% | (I) -15.26% |
|   fix_size_alloc_test: p:16, h:1 (usec)          | (I) -11.61% |   (R) 6.00% |
|   fix_size_alloc_test: p:32, h:0 (usec)          | (I) -26.54% | (I) -18.80% |
|   fix_size_alloc_test: p:32, h:1 (usec)          | (I) -15.42% |   (R) 5.82% |
|   fix_size_alloc_test: p:64, h:0 (usec)          | (I) -30.25% | (I) -20.80% |
|   fix_size_alloc_test: p:64, h:1 (usec)          | (I) -16.98% |   (R) 6.54% |
|   fix_size_alloc_test: p:128, h:0 (usec)         | (I) -32.56% | (I) -21.79% |
|   fix_size_alloc_test: p:128, h:1 (usec)         | (I) -18.39% |   (R) 5.91% |
|   fix_size_alloc_test: p:256, h:0 (usec)         | (I) -33.33% | (I) -22.22% |
|   fix_size_alloc_test: p:256, h:1 (usec)         | (I) -18.82% |   (R) 5.79% |
|   fix_size_alloc_test: p:512, h:0 (usec)         | (I) -33.27% | (I) -22.23% |
|   fix_size_alloc_test: p:512, h:1 (usec)         |       0.86% |      -0.71% |
|   full_fit_alloc_test: p:1, h:0 (usec)           |       2.49% |      -0.62% |
|   kvfree_rcu_1_arg_vmalloc_test: p:1, h:0 (usec) |       1.79% |      -1.25% |
|   kvfree_rcu_2_arg_vmalloc_test: p:1, h:0 (usec) |      -0.32% |       0.61% |
|   long_busy_list_alloc_test: p:1, h:0 (usec)     | (I) -31.06% | (I) -19.62% |
|   pcpu_alloc_test: p:1, h:0 (usec)               |       0.06% |       0.47% |
|   random_size_align_alloc_test: p:1, h:0 (usec)  | (I) -14.94% |  (I) -8.68% |
|   random_size_alloc_test: p:1, h:0 (usec)        | (I) -30.22% | (I) -19.59% |
|   vm_map_ram_test: p:1, h:0 (usec)               |       2.65% |   (R) 7.22% |
+--------------------------------------------------+-------------+-------------+

So there are some nice improvements but also some regressions to explain:

First fix_size_alloc_test with h:1 and p:16,32,64,128,256 regress by ~6% on
Altra. The regression is actually introduced by enabling contpte-mapped 64K
blocks in these tests, and that regression is reduced (from about 8% if memory
serves) by doing the barrier batching. I don't have a definite conclusion on the
root cause, but I've ruled out the differences in the mapping paths in vmalloc.
I strongly believe this is likely due to the difference in the allocation path;
64K blocks are not cached per-cpu so we have to go all the way to the buddy. I'm
not sure why this doesn't show up on M2 though. Regardless, I'm going to assert
that it's better to choose 16x reduction in TLB pressure vs 6% on the vmalloc
allocation call duration.

Next we have ~4% regression on M2 when vmalloc'ing a single page. (h is
irrelevant because a single page is too small for contpte). I assume this is
because there is some minor overhead in the barrier deferral mechanism and we
are not getting to amortize it over multiple pages here. But I would assume
vmalloc'ing 1 page is uncommon because it doesn't buy you anything over
kmalloc?

Changes since v1 [1]
====================
- Split out the fixes into their own series
- Added Rbs from Anshuman - Thanks!
- Added patch to clean up the methods by which huge_pte size is determined
- Added "#ifndef __PAGETABLE_PMD_FOLDED" around PUD_SIZE in
  flush_hugetlb_tlb_range()
- Renamed ___set_ptes() -> set_ptes_anysz()
- Renamed ___ptep_get_and_clear() -> ptep_get_and_clear_anysz()
- Fixed typos in commit logs
- Refactored pXd_valid_not_user() for better reuse
- Removed TIF_KMAP_UPDATE_PENDING after concluding that single flag is sufficent
- Concluded the extra isb() in __switch_to() is not required
- Only call arch_update_kernel_mappings_[begin|end]() for kernel mappings

Thanks to Anshuman for the review!

Applies on top of my fixes series at [2], which applies on top of v6.14-rc3. All
mm selftests run and pass.

[1] https://lore.kernel.org/all/20250205151003.88959-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/all/20250217140419.1702389-1-ryan.roberts@arm.com/

Thanks,
Ryan

Ryan Roberts (14):
  arm64: hugetlb: Cleanup huge_pte size discovery mechanisms
  arm64: hugetlb: Refine tlb maintenance scope
  mm/page_table_check: Batch-check pmds/puds just like ptes
  arm64/mm: Refactor __set_ptes() and __ptep_get_and_clear()
  arm64: hugetlb: Use set_ptes_anysz() and ptep_get_and_clear_anysz()
  arm64/mm: Hoist barriers out of set_ptes_anysz() loop
  arm64/mm: Avoid barriers for invalid or userspace mappings
  mm/vmalloc: Warn on improper use of vunmap_range()
  mm/vmalloc: Gracefully unmap huge ptes
  arm64/mm: Support huge pte-mapped pages in vmap
  mm/vmalloc: Batch arch_sync_kernel_mappings() more efficiently
  mm: Generalize arch_sync_kernel_mappings()
  mm: Only call arch_update_kernel_mappings_[begin|end]() for kernel
    mappings
  arm64/mm: Batch barriers when updating kernel mappings

 arch/arm64/include/asm/hugetlb.h     |  29 ++--
 arch/arm64/include/asm/pgtable.h     | 207 +++++++++++++++++++--------
 arch/arm64/include/asm/thread_info.h |   1 +
 arch/arm64/include/asm/vmalloc.h     |  46 ++++++
 arch/arm64/kernel/process.c          |   9 +-
 arch/arm64/mm/hugetlbpage.c          |  72 ++++------
 include/linux/page_table_check.h     |  30 ++--
 include/linux/pgtable.h              |  24 +---
 include/linux/pgtable_modmask.h      |  32 +++++
 include/linux/vmalloc.h              |  55 +++++++
 mm/memory.c                          |   7 +-
 mm/page_table_check.c                |  34 +++--
 mm/vmalloc.c                         |  93 +++++++-----
 13 files changed, 434 insertions(+), 205 deletions(-)
 create mode 100644 include/linux/pgtable_modmask.h
---
2.43.0