From patchwork Wed Feb 26 03:00:45 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Rik van Riel <riel@surriel.com>
X-Patchwork-Id: 13991453
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 996E8C021BF
	for <linux-mm@archiver.kernel.org>; Wed, 26 Feb 2025 03:02:37 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id DB45C280008; Tue, 25 Feb 2025 22:02:30 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id D1BAA28000D; Tue, 25 Feb 2025 22:02:30 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 95CE228000B; Tue, 25 Feb 2025 22:02:30 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com
 [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id 604A8280008
	for <linux-mm@kvack.org>; Tue, 25 Feb 2025 22:02:30 -0500 (EST)
Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id F3AF1120656
	for <linux-mm@kvack.org>; Wed, 26 Feb 2025 03:02:29 +0000 (UTC)
X-FDA: 83160597618.04.64FA20D
Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147])
	by imf24.hostedemail.com (Postfix) with ESMTP id 7A387180005
	for <linux-mm@kvack.org>; Wed, 26 Feb 2025 03:02:28 +0000 (UTC)
Authentication-Results: imf24.hostedemail.com;
	dkim=none;
	spf=pass (imf24.hostedemail.com: domain of riel@shelob.surriel.com designates
 96.67.55.147 as permitted sender) smtp.mailfrom=riel@shelob.surriel.com;
	dmarc=none
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1740538948; a=rsa-sha256;
	cv=none;
	b=MTCNHW4B/HylYrhancahRajcg41E7YCbeTIJcFl4EjCS2KeaNTC6R+xiCwUW4FcbNK2soJ
	Qmkw9ekwWKt1MBlCBNvImH3mvFHchB2ofM9Lo+7RuEsHeLDorIo01ulR9vaIpYWhEKo1hR
	p9IC424/laZ6nhUDnQDzBwEeC+DHVYQ=
ARC-Authentication-Results: i=1;
	imf24.hostedemail.com;
	dkim=none;
	spf=pass (imf24.hostedemail.com: domain of riel@shelob.surriel.com designates
 96.67.55.147 as permitted sender) smtp.mailfrom=riel@shelob.surriel.com;
	dmarc=none
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1740538948;
	h=from:from:sender:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=/v4aW6UaLE++ZtewdPgD/bWpgXcZU5FK0h4A4Bel+c8=;
	b=5skDQM2/yMMl3Oep5FzWVZU7iRU7gYG8233z9Qs03DRxx+732us+h727ZLIvDhn5rBjcz5
	CJhTii3iVvX28Wmoj8bzbRnr8IfI/Hov8QjiCf/YZHfV+2AknPrQCYlngGMNediiZtezlX
	CU6aeuH6BRqK6rCOMOEUsFcbgditSZQ=
Received: from fangorn.home.surriel.com ([10.0.13.7])
	by shelob.surriel.com with esmtpsa  (TLS1.2) tls
 TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
	(Exim 4.97.1)
	(envelope-from <riel@shelob.surriel.com>)
	id 1tn7fw-000000001Y5-1Bi8;
	Tue, 25 Feb 2025 22:01:32 -0500
From: Rik van Riel <riel@surriel.com>
To: x86@kernel.org
Cc: linux-kernel@vger.kernel.org,
	bp@alien8.de,
	peterz@infradead.org,
	dave.hansen@linux.intel.com,
	zhengqi.arch@bytedance.com,
	nadav.amit@gmail.com,
	thomas.lendacky@amd.com,
	kernel-team@meta.com,
	linux-mm@kvack.org,
	akpm@linux-foundation.org,
	jackmanb@google.com,
	jannh@google.com,
	mhklinux@outlook.com,
	andrew.cooper3@citrix.com,
	Manali.Shukla@amd.com,
	mingo@kernel.org,
	Rik van Riel <riel@surriel.com>
Subject: [PATCH v14 10/13] x86/mm: enable broadcast TLB invalidation for
 multi-threaded processes
Date: Tue, 25 Feb 2025 22:00:45 -0500
Message-ID: <20250226030129.530345-11-riel@surriel.com>
X-Mailer: git-send-email 2.47.1
In-Reply-To: <20250226030129.530345-1-riel@surriel.com>
References: <20250226030129.530345-1-riel@surriel.com>
MIME-Version: 1.0
X-Stat-Signature: 9fck9rife6ewh4pg1cwgezr7i6gud5pu
X-Rspamd-Server: rspam06
X-Rspamd-Queue-Id: 7A387180005
X-Rspam-User: 
X-HE-Tag: 1740538948-473129
X-HE-Meta: 
 U2FsdGVkX1+Uxxwmkqy6FxSKp+uvN85yG0qgySq1WPYPPt9/ae7aGjdVzplqb8LHHDBCH5zPkXBNDuRLuGMlZKNSlSy4spLNIk1mAZmE+r3k13/byHjf3FjZ/UBQaTsOFE8nrxbalh9frrA9/Vfnsk61rpYxzPrp/hCl0pOl3ibJJ9sFESVXUxkotWAMZvFqi7VQ1al7nRPUAfKl8F/5swP0P13IhYE3xeZAFuC2dWYjqn04q7RB1Yv73R0IQsMVUl/nNW4/OowXfLL8wYLNMDm+MLr7QH5fSuZ+V2ejhJWavPKrZfah/DTqijQ1e3e7bAYy/YIyaAIW5jRA/KymIfD60MbR46b+MyA5+F5dYlUs9ODkoaS5rEe7/qNvcD8dJNnLCe1Fi8a2Q1UFa4vVoSG78/CIy/c3EaViNlqt8RzQBa4uNIbxKUFc6RLoGmY0o47vk/1BpqMJmYmjuVlQCE8GqMwPFQaFNbNnA97AGxeOpAmIuXxXTifgx8F4rkz2OELO/ZaUZvFxLbDpHcIqFoZ2lDLvE8F//Fo+T15/zduzCeovsa1O6PYS2eGa2zi+XtgbV9viXESqXfRsaz9V3SjMVQhTr1gqMMoaeZChuI8/O+It9IoE1qraQZI50tA9N43g+deVzej53juMLgCIz40pw23DZX2H8Xypeg8kEkfWJ5ocVK+5OhzOzIhAms4O5JhWaXab/jKpKFCi05TUMVGnW9NFm9SQTrfz4yY3wmhZFCEzli+xbY8FUyutmD4r602jz9x/juHDWEHpnTUWAUaw6kiP/iTq9+hbR+Dghzkliknm9WuosKKkRPOnYCicXqiJ30ls3tDMNjXL1/9ByGYOeUrp45NsaiNZvPXLSWdjE5406ecjmVk1mQ8xqOW4gNm+K8664CaXIm0tDZu6GYvoHzWo2EUI1/xOpmI2n4LbNj8XGMuVZhLZ2KC+cTGHq02UjG5YPFHbrHNuAdM
 8r34Iv1P
 m0egyXIYZpZp9vWSgQTGPp4CZiveCHf2bcNdsiaSmQPhm7pPUdkqPt8dSAG1hfFMJTxIDvYQgUfoAcCVrlO9G8WbNQAIat3nTmkLAA70RcXn3D0jBPOwjwzB1GQF9BHjj5EhiSm9dy3TpSvVtUv02mfgxZa1oYiltWdC34Ma7ue+TrsnCWf/Y7c+/UaFKSc6jPOcjSDj5YbJGyD9wvLqDFqL6uwWewceWqSD2QwGn64nkeOk3+3NX6HQ9/+Tfs9YnVH5eNTNkel30cBz6Qit2BeuPmDHm0x+ynZ4gYla3rXL4RYEsZiLBpwkzoay7d1USVcxcaBDHINYaqGjLEgsfy1Lyy6tqVs5C8vJojBcYDuVUbnJ0wNRWBDJRFmZ6Of7uTATZW12RP13cOJaVeVmrfeLEfIHG0JW7BBfvr1uP+jaPpdxEOhJZJxfLX/eUVgu4X+uQfOGHR/NIZNTRCJUADcxqRmcIRz8mAIduJ7yE7U2FhdbN6AXhdSKJ4fpxSHwGNEJZasr0nde1E+re2IQ/iT5FbDfDGzQnOTYalS36Y1hUo95uU1+DYnRzTvn4/u5MBCd3a3mfq6/heBpL5dBWLxW/qQ==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Use broadcast TLB invalidation, using the INVPLGB instruction.

There is not enough room in the 12-bit ASID address space to hand
out broadcast ASIDs to every process. Only hand out broadcast ASIDs
to processes when they are observed to be simultaneously running
on 4 or more CPUs.

This also allows single threaded process to continue using the
cheaper, local TLB invalidation instructions like INVLPGB.

Due to the structure of flush_tlb_mm_range, the INVLPGB
flushing is done in a generically named broadcast_tlb_flush
function, which can later also be used for Intel RAR.

Combined with the removal of unnecessary lru_add_drain calls
(see https://lkml.org/lkml/2024/12/19/1388) this results in a
nice performance boost for the will-it-scale tlb_flush2_threads
test on an AMD Milan system with 36 cores:

- vanilla kernel:           527k loops/second
- lru_add_drain removal:    731k loops/second
- only INVLPGB:             527k loops/second
- lru_add_drain + INVLPGB: 1157k loops/second

Profiling with only the INVLPGB changes showed while
TLB invalidation went down from 40% of the total CPU
time to only around 4% of CPU time, the contention
simply moved to the LRU lock.

Fixing both at the same time about doubles the
number of iterations per second from this case.

Comparing will-it-scale tlb_flush2_threads with
several different numbers of threads on a 72 CPU
AMD Milan shows similar results. The number
represents the total number of loops per second
across all the threads:

threads		tip		invlpgb

1		315k		304k
2		423k		424k
4		644k		1032k
8		652k		1267k
16		737k		1368k
32		759k		1199k
64		636k		1094k
72		609k		993k

1 and 2 thread performance is similar with and
without invlpgb, because invlpgb is only used
on processes using 4 or more CPUs simultaneously.

The number is the median across 5 runs.

Some numbers closer to real world performance
can be found at Phoronix, thanks to Michael:

https://www.phoronix.com/news/AMD-INVLPGB-Linux-Benefits

Signed-off-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Nadav Amit <nadav.amit@gmail.com>
Tested-by: Manali Shukla <Manali.Shukla@amd.com>
Tested-by: Brendan Jackman <jackmanb@google.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
---
 arch/x86/include/asm/tlbflush.h |   9 +++
 arch/x86/mm/tlb.c               | 107 +++++++++++++++++++++++++++++++-
 2 files changed, 115 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 37b735dcf025..811dd70eb6b8 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -272,6 +272,11 @@ static inline void mm_assign_global_asid(struct mm_struct *mm, u16 asid)
 	smp_store_release(&mm->context.global_asid, asid);
 }
 
+static inline void clear_asid_transition(struct mm_struct *mm)
+{
+	WRITE_ONCE(mm->context.asid_transition, false);
+}
+
 static inline bool in_asid_transition(struct mm_struct *mm)
 {
 	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
@@ -289,6 +294,10 @@ static inline void mm_assign_global_asid(struct mm_struct *mm, u16 asid)
 {
 }
 
+static inline void clear_asid_transition(struct mm_struct *mm)
+{
+}
+
 static inline bool in_asid_transition(struct mm_struct *mm)
 {
 	return false;
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index b7d461db1b08..cd109bdf0dd9 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -422,6 +422,108 @@ static bool needs_global_asid_reload(struct mm_struct *next, u16 prev_asid)
 	return false;
 }
 
+/*
+ * x86 has 4k ASIDs (2k when compiled with KPTI), but the largest
+ * x86 systems have over 8k CPUs. Because of this potential ASID
+ * shortage, global ASIDs are handed out to processes that have
+ * frequent TLB flushes and are active on 4 or more CPUs simultaneously.
+ */
+static void consider_global_asid(struct mm_struct *mm)
+{
+	if (!static_cpu_has(X86_FEATURE_INVLPGB))
+		return;
+
+	/* Check every once in a while. */
+	if ((current->pid & 0x1f) != (jiffies & 0x1f))
+		return;
+
+	if (!READ_ONCE(global_asid_available))
+		return;
+
+	/*
+	 * Assign a global ASID if the process is active on
+	 * 4 or more CPUs simultaneously.
+	 */
+	if (mm_active_cpus_exceeds(mm, 3))
+		use_global_asid(mm);
+}
+
+static void finish_asid_transition(struct flush_tlb_info *info)
+{
+	struct mm_struct *mm = info->mm;
+	int bc_asid = mm_global_asid(mm);
+	int cpu;
+
+	if (!in_asid_transition(mm))
+		return;
+
+	for_each_cpu(cpu, mm_cpumask(mm)) {
+		/*
+		 * The remote CPU is context switching. Wait for that to
+		 * finish, to catch the unlikely case of it switching to
+		 * the target mm with an out of date ASID.
+		 */
+		while (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu)) == LOADED_MM_SWITCHING)
+			cpu_relax();
+
+		if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu)) != mm)
+			continue;
+
+		/*
+		 * If at least one CPU is not using the global ASID yet,
+		 * send a TLB flush IPI. The IPI should cause stragglers
+		 * to transition soon.
+		 *
+		 * This can race with the CPU switching to another task;
+		 * that results in a (harmless) extra IPI.
+		 */
+		if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm_asid, cpu)) != bc_asid) {
+			flush_tlb_multi(mm_cpumask(info->mm), info);
+			return;
+		}
+	}
+
+	/* All the CPUs running this process are using the global ASID. */
+	clear_asid_transition(mm);
+}
+
+static void broadcast_tlb_flush(struct flush_tlb_info *info)
+{
+	bool pmd = info->stride_shift == PMD_SHIFT;
+	unsigned long asid = mm_global_asid(info->mm);
+	unsigned long addr = info->start;
+
+	/*
+	 * TLB flushes with INVLPGB are kicked off asynchronously.
+	 * The inc_mm_tlb_gen() guarantees page table updates are done
+	 * before these TLB flushes happen.
+	 */
+	if (info->end == TLB_FLUSH_ALL) {
+		invlpgb_flush_single_pcid_nosync(kern_pcid(asid));
+		/* Do any CPUs supporting INVLPGB need PTI? */
+		if (static_cpu_has(X86_FEATURE_PTI))
+			invlpgb_flush_single_pcid_nosync(user_pcid(asid));
+	} else do {
+		unsigned long nr = 1;
+
+		if (info->stride_shift <= PMD_SHIFT) {
+			nr = (info->end - addr) >> info->stride_shift;
+			nr = clamp_val(nr, 1, invlpgb_count_max);
+		}
+
+		invlpgb_flush_user_nr_nosync(kern_pcid(asid), addr, nr, pmd);
+		if (static_cpu_has(X86_FEATURE_PTI))
+			invlpgb_flush_user_nr_nosync(user_pcid(asid), addr, nr, pmd);
+
+		addr += nr << info->stride_shift;
+	} while (addr < info->end);
+
+	finish_asid_transition(info);
+
+	/* Wait for the INVLPGBs kicked off above to finish. */
+	__tlbsync();
+}
+
 /*
  * Given an ASID, flush the corresponding user ASID.  We can delay this
  * until the next time we switch to it.
@@ -1252,9 +1354,12 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	 * a local TLB flush is needed. Optimize this use-case by calling
 	 * flush_tlb_func_local() directly in this case.
 	 */
-	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
+	if (mm_global_asid(mm)) {
+		broadcast_tlb_flush(info);
+	} else if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
 		info->trim_cpumask = should_trim_cpumask(mm);
 		flush_tlb_multi(mm_cpumask(mm), info);
+		consider_global_asid(mm);
 	} else if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
 		lockdep_assert_irqs_enabled();
 		local_irq_disable();