From patchwork Sun Feb 23 19:49:00 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Rik van Riel X-Patchwork-Id: 13987231 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B3638C021B2 for ; Sun, 23 Feb 2025 19:51:39 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B3B716B0092; Sun, 23 Feb 2025 14:51:17 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 903C66B007B; Sun, 23 Feb 2025 14:51:17 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5D1446B009A; Sun, 23 Feb 2025 14:51:17 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 17A4F6B009C for ; Sun, 23 Feb 2025 14:51:17 -0500 (EST) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id D62091A0296 for ; Sun, 23 Feb 2025 19:51:16 +0000 (UTC) X-FDA: 83152253352.15.2670EF6 Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147]) by imf11.hostedemail.com (Postfix) with ESMTP id 517D140003 for ; Sun, 23 Feb 2025 19:51:15 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=none; spf=pass (imf11.hostedemail.com: domain of riel@shelob.surriel.com designates 96.67.55.147 as permitted sender) smtp.mailfrom=riel@shelob.surriel.com; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1740340275; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=jWuIUMPQFT+3IXsjrqIYDtGl6v9cnE+rEQO8OGup9co=; b=og0Eg8Xv9GzQH0BS0WSyK6AKJG9hm3Q7iCWOtf3dPExJQG0l1feO+BytCN7IZ6kn+fhHUY xMyQAHZd0Sme1XpHvT1Ab4OVIG1x0h3RD1WLhCSibnAcSFPdvNsvm6XPI+B58jygr6qF1U RIpfFA5Pu5NvYk1vUQHRaUXQcrOPwo4= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=none; spf=pass (imf11.hostedemail.com: domain of riel@shelob.surriel.com designates 96.67.55.147 as permitted sender) smtp.mailfrom=riel@shelob.surriel.com; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1740340275; a=rsa-sha256; cv=none; b=C0YFQHKpIJRKMMEtClm2/hmJGZkavVfXURuYMjjqVsnXn+Hnj4oxPJ8asSPKJrIGi26Rbs 3rZnghEl8jPZdQZd6/gxRsaMr+jh5JaqkotOdS5o+L0UAQX2ImK+ovG9Xntpb2+O/P2adA 3h4ni9xQuP5oWXzzb1QHZtmCpl2uGeM= Received: from fangorn.home.surriel.com ([10.0.13.7]) by shelob.surriel.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.97.1) (envelope-from ) id 1tmHyz-000000001hX-08Iv; Sun, 23 Feb 2025 14:49:45 -0500 From: Rik van Riel To: x86@kernel.org Cc: linux-kernel@vger.kernel.org, bp@alien8.de, peterz@infradead.org, dave.hansen@linux.intel.com, zhengqi.arch@bytedance.com, nadav.amit@gmail.com, thomas.lendacky@amd.com, kernel-team@meta.com, linux-mm@kvack.org, akpm@linux-foundation.org, jackmanb@google.com, jannh@google.com, mhklinux@outlook.com, andrew.cooper3@citrix.com, Manali.Shukla@amd.com, mingo@kernel.org, Rik van Riel Subject: [PATCH v13 10/14] x86/mm: enable broadcast TLB invalidation for multi-threaded processes Date: Sun, 23 Feb 2025 14:49:00 -0500 Message-ID: <20250223194943.3518952-11-riel@surriel.com> X-Mailer: git-send-email 2.47.1 In-Reply-To: <20250223194943.3518952-1-riel@surriel.com> References: <20250223194943.3518952-1-riel@surriel.com> MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: 517D140003 X-Stat-Signature: 333ur55dmou1d4ozqr8magdsjtpeh1qy X-HE-Tag: 1740340275-975073 X-HE-Meta: U2FsdGVkX18PcQbjqEtShry6623MdWNoLVC80F4FmDZ6BunLgRLHYH0C+H3vVDJ7AYL22502EMTvBdXITVezuQaLTMquIwLTSpIy5AF4lLwshFD/3Lf2ChzeIhaxQ0NZhxmIFqycyJLjH+b4KEDJISGhSDu20U9hVE6aS3JG1eWoSk1Sf8pqVMzKWRwfX0bSJsmuYc70kVdscrDfD6F9vu7xIkgayOTNw6tPF5cHjZXmuI2jstaYbMVGhmRv56TKM7LKDZA35y2pbW2obrqKi3DE8+iyRJN5gn92Mb/w8INFrcSbo4IJuQmUo1Ns3Q6blmtgbA98nZZKUJW+Swnm0CEBGi1Q4uPoiGRzl6/UH3Uy84P2oK3fFBT09PG5FOw95aFW/pIPpk+c7OKGeIZj0B2FkyP57cIHRDpCbmESNIVkbGY1P87s5LSZhUuf6B2BDwKD6m31rsbRX6dpz2ASHrsxEXVDjBd3AW0gQbdjMWJ2EILXK/3fFmJmyKMkCTkwh9O8m1zM8fdavewbMlAvJi6sB5PiEqzFcKckht9E+CjxrAz/OA0FSPzr/K0GGFfjBacromArigc7YYQ5oFwUlnq64efc5Gw4ftE4hxyez3zcy5CbXsX8/NSDUfnaM8FhIPq+za+mENPiBi5fcnqKzUZ8DRI7DxQSW0XZCGSEZpFB6zDCJhXnF2S5eI6UVjzt7rP6iBji8IjcNvacneiQFhIU4sXtykrIxY7gaXF3yaOzkGqynIzGE/wJ60f1tCVfFum7DORUJ+DG6JSxDrgV4QJuTKxpupbsmRR1G9W6LK7tyZctjO85O4Qhzzj8+0BkMwhZRdQ2ZWBqgm6YsR/C34xRV12R0rlttq+4FACZteNp/ql5MKO2gwW3puqOeXjMGrS1n2Jawj9BKXZ0gwZhQ3Io/Y7MExWKJgwdt82luYN0vkTwQn+x+7BDaFVh5wtb2p4aQ38cnP+bFqkMzFZ b1QSQkJV rPA7sdrBl0DhYi3A+dGRoH8HOm1ivcbNCr20qeB2r3FGVw89jGQpDFVjRLLYVbNR2C9rLaNx/H+tkf9G+3jdFNA2SBYpx4i1F+6ZAEke2RUnlqkkvJlvt0JDRg5kkUY6w5SiZwKIgmwa6jKRvtJtI5AmA4skAqxmSKBUg+FJCvsH50eFOjWTRafXni7rbKmhAICgJ6Y/ojcywSiDSmLnr5cwR8eCBqokGaMR2KZSULeKXzwkFh9/erhmC1nbP7CiQ+ezwAoo+s3mhfjW7aLL70upwWkmDFheCtUjBc4Y1yHyCPwK7mzk3YUMUZjxF57NY4xTx+GkTWqIQml7fD9dV5DH4TnWyrL4Nq6piOum1FZbgLMhmd39Cbz/OX9UaLZ9ZM0xHBpFFp0tBHzZ6R557OIA9CeVgQrp0ROZo7EE1VDlIDJ8QqkCoTUzJby1aRDbc4gzRBULzcM7fs9MSedTYmNv3yF41Snyij6ZxWhgWoRVy5tfQs+DHmjUl2TBZj/0orAk6F2Tu6eN5vmeC2xbzACA3SEMXFFRBXFVcDzzZjQBME0x9RZfQt5fQ4xHlofA6DiH+v+LdgPJR73Fxb1vdUJma7w== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Use broadcast TLB invalidation, using the INVPLGB instruction. There is not enough room in the 12-bit ASID address space to hand out broadcast ASIDs to every process. Only hand out broadcast ASIDs to processes when they are observed to be simultaneously running on 4 or more CPUs. This also allows single threaded process to continue using the cheaper, local TLB invalidation instructions like INVLPGB. Combined with the removal of unnecessary lru_add_drain calls (see https://lkml.org/lkml/2024/12/19/1388) this results in a nice performance boost for the will-it-scale tlb_flush2_threads test on an AMD Milan system with 36 cores: - vanilla kernel: 527k loops/second - lru_add_drain removal: 731k loops/second - only INVLPGB: 527k loops/second - lru_add_drain + INVLPGB: 1157k loops/second Profiling with only the INVLPGB changes showed while TLB invalidation went down from 40% of the total CPU time to only around 4% of CPU time, the contention simply moved to the LRU lock. Fixing both at the same time about doubles the number of iterations per second from this case. Comparing will-it-scale tlb_flush2_threads with several different numbers of threads on a 72 CPU AMD Milan shows similar results. The number represents the total number of loops per second across all the threads: threads tip invlpgb 1 315k 304k 2 423k 424k 4 644k 1032k 8 652k 1267k 16 737k 1368k 32 759k 1199k 64 636k 1094k 72 609k 993k 1 and 2 thread performance is similar with and without invlpgb, because invlpgb is only used on processes using 4 or more CPUs simultaneously. The number is the median across 5 runs. Some numbers closer to real world performance can be found at Phoronix, thanks to Michael: https://www.phoronix.com/news/AMD-INVLPGB-Linux-Benefits Signed-off-by: Rik van Riel Reviewed-by: Nadav Amit Tested-by: Manali Shukla Tested-by: Brendan Jackman Tested-by: Michael Kelley --- arch/x86/mm/tlb.c | 107 +++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 106 insertions(+), 1 deletion(-) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index d8a04e398615..01a5edb51ebe 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -420,6 +420,108 @@ static bool needs_global_asid_reload(struct mm_struct *next, u16 prev_asid) return false; } +/* + * x86 has 4k ASIDs (2k when compiled with KPTI), but the largest + * x86 systems have over 8k CPUs. Because of this potential ASID + * shortage, global ASIDs are handed out to processes that have + * frequent TLB flushes and are active on 4 or more CPUs simultaneously. + */ +static void consider_global_asid(struct mm_struct *mm) +{ + if (!static_cpu_has(X86_FEATURE_INVLPGB)) + return; + + /* Check every once in a while. */ + if ((current->pid & 0x1f) != (jiffies & 0x1f)) + return; + + if (!READ_ONCE(global_asid_available)) + return; + + /* + * Assign a global ASID if the process is active on + * 4 or more CPUs simultaneously. + */ + if (mm_active_cpus_exceeds(mm, 3)) + use_global_asid(mm); +} + +static void finish_asid_transition(struct flush_tlb_info *info) +{ + struct mm_struct *mm = info->mm; + int bc_asid = mm_global_asid(mm); + int cpu; + + if (!READ_ONCE(mm->context.asid_transition)) + return; + + for_each_cpu(cpu, mm_cpumask(mm)) { + /* + * The remote CPU is context switching. Wait for that to + * finish, to catch the unlikely case of it switching to + * the target mm with an out of date ASID. + */ + while (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu)) == LOADED_MM_SWITCHING) + cpu_relax(); + + if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu)) != mm) + continue; + + /* + * If at least one CPU is not using the global ASID yet, + * send a TLB flush IPI. The IPI should cause stragglers + * to transition soon. + * + * This can race with the CPU switching to another task; + * that results in a (harmless) extra IPI. + */ + if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm_asid, cpu)) != bc_asid) { + flush_tlb_multi(mm_cpumask(info->mm), info); + return; + } + } + + /* All the CPUs running this process are using the global ASID. */ + WRITE_ONCE(mm->context.asid_transition, false); +} + +static void broadcast_tlb_flush(struct flush_tlb_info *info) +{ + bool pmd = info->stride_shift == PMD_SHIFT; + unsigned long asid = info->mm->context.global_asid; + unsigned long addr = info->start; + + /* + * TLB flushes with INVLPGB are kicked off asynchronously. + * The inc_mm_tlb_gen() guarantees page table updates are done + * before these TLB flushes happen. + */ + if (info->end == TLB_FLUSH_ALL) { + invlpgb_flush_single_pcid_nosync(kern_pcid(asid)); + /* Do any CPUs supporting INVLPGB need PTI? */ + if (static_cpu_has(X86_FEATURE_PTI)) + invlpgb_flush_single_pcid_nosync(user_pcid(asid)); + } else do { + unsigned long nr = 1; + + if (info->stride_shift <= PMD_SHIFT) { + nr = (info->end - addr) >> info->stride_shift; + nr = clamp_val(nr, 1, invlpgb_count_max); + } + + invlpgb_flush_user_nr_nosync(kern_pcid(asid), addr, nr, pmd); + if (static_cpu_has(X86_FEATURE_PTI)) + invlpgb_flush_user_nr_nosync(user_pcid(asid), addr, nr, pmd); + + addr += nr << info->stride_shift; + } while (addr < info->end); + + finish_asid_transition(info); + + /* Wait for the INVLPGBs kicked off above to finish. */ + __tlbsync(); +} + /* * Given an ASID, flush the corresponding user ASID. We can delay this * until the next time we switch to it. @@ -1250,9 +1352,12 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, * a local TLB flush is needed. Optimize this use-case by calling * flush_tlb_func_local() directly in this case. */ - if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) { + if (mm_global_asid(mm)) { + broadcast_tlb_flush(info); + } else if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) { info->trim_cpumask = should_trim_cpumask(mm); flush_tlb_multi(mm_cpumask(mm), info); + consider_global_asid(mm); } else if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) { lockdep_assert_irqs_enabled(); local_irq_disable();