From patchwork Fri Feb  3 07:18:36 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Nicholas Piggin <npiggin@gmail.com>
X-Patchwork-Id: 13127092
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 4AE86C61DA4
	for <linux-mm@archiver.kernel.org>; Fri,  3 Feb 2023 07:19:20 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id DA5A66B0078; Fri,  3 Feb 2023 02:19:19 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id D55806B007B; Fri,  3 Feb 2023 02:19:19 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id BF6DA6B007D; Fri,  3 Feb 2023 02:19:19 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com
 [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id B0FA16B0078
	for <linux-mm@kvack.org>; Fri,  3 Feb 2023 02:19:19 -0500 (EST)
Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id 7824D1A10A1
	for <linux-mm@kvack.org>; Fri,  3 Feb 2023 07:19:19 +0000 (UTC)
X-FDA: 80425129638.28.27FD3BD
Received: from mail-pl1-f172.google.com (mail-pl1-f172.google.com
 [209.85.214.172])
	by imf21.hostedemail.com (Postfix) with ESMTP id ADBF71C0002
	for <linux-mm@kvack.org>; Fri,  3 Feb 2023 07:19:17 +0000 (UTC)
Authentication-Results: imf21.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20210112 header.b=GCqfd0gi;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf21.hostedemail.com: domain of npiggin@gmail.com designates
 209.85.214.172 as permitted sender) smtp.mailfrom=npiggin@gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1675408757;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=ISWO9h6SOLQP6WCeahRDXPdJymqgjLrbpJ6FMzUJXpQ=;
	b=h4UcBBM08AsIhgXDm10IBdhltrxrjPrDBN0dn9VuKbsXRE/USXkI8ArNLnFvQ3p5m5nH0r
	V1pY5EqZ1ALXx6BYFPJVq5OA2zaqqLfkuA/LeZptCvlqspvUqlmhx3fiXHc4BtrRVXLDKC
	NN5w6X3lYaLguFzEuCHC1jQl3kFvLY0=
ARC-Authentication-Results: i=1;
	imf21.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20210112 header.b=GCqfd0gi;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf21.hostedemail.com: domain of npiggin@gmail.com designates
 209.85.214.172 as permitted sender) smtp.mailfrom=npiggin@gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1675408757; a=rsa-sha256;
	cv=none;
	b=MELzVPVGQlrVwfi1Aanf9TcFBPNcrromr94KFMMNjCpifawLbduJ5wLJ3u8gXCdCN5bAn6
	Q4DyhdDBC1/P16YnanWG8wT4pZf3DWWkLGqI/SPIuUj1b3hfmvun0A/f6bSrzxY7YFonrI
	izRXsn53UN00jhkLeWu6OR1xnSPrAtU=
Received: by mail-pl1-f172.google.com with SMTP id u9so148988plf.3
        for <linux-mm@kvack.org>; Thu, 02 Feb 2023 23:19:17 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20210112;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date
         :message-id:reply-to;
        bh=ISWO9h6SOLQP6WCeahRDXPdJymqgjLrbpJ6FMzUJXpQ=;
        b=GCqfd0gisgkgKlxYZuLHUdcwKWxDsZidEXcZ1aNaZZ2ElupC8rz5g3nIkJ3bobpmqi
         DprWpFD65mlWDpzUu/unAkIGKL37dle+6bCw2lSx+pZmmhDM758k0n8nOhE+ohbJrWlG
         nnQ2rVqpFtFDnzZ9uApSYB6/eFyeFbXLHLC1IdotvCqzph/BPkikJi0WfNZA9G0CsEND
         hYHG9zGzwoF31aJ8l6S4dqlVqQzy8grON+LE2Meicn20ZhPs4l0HXU0uHvm0k+HFrdBm
         AlnDrtDXEIMMr2ZDtrtaE7KDDPGpuU2KJd9TLaiqM1heX5H+w6emkvg9SMXCTv+a9luZ
         bBzQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=ISWO9h6SOLQP6WCeahRDXPdJymqgjLrbpJ6FMzUJXpQ=;
        b=oM8RI8MGD9cRsOWVsDez8IugLh8ZwA8x2Ran/Y+7BC2yFVa+4bKyWsOnrgKwRkMGKI
         9ZBS5UkMkav+0pmbXNBwGrQYMGJ2p6iaoJ4X0psyis84pI9h58MWhBGn4KNaaNC2vogZ
         RNgbwGl05TZf6pdc8a8ULsjEpn+c5/MYRNPJR4+CnajwYWbEBLcnPUJKMP4NkOxgncdt
         EFNNAnEsDp7JxXz2+jOvhaBw5NpN45TqSKmm5AHVHBUjSTaMfz3+b6jahJql91KXG2LN
         cLD+zXSPOAfuivszPz+wf+udavH69paddJ1DMIjY2GMoCSOF+rHWI753aocLxiEYRG/x
         9ckA==
X-Gm-Message-State: AO0yUKWI793es6f9plDc2nkgHRLBujpd/lfA5IjRVoJD54GtVa2yRi2m
	HZ8Id8jWQHUJGC1YP/ZWLSg=
X-Google-Smtp-Source: 
 AK7set+asp6Wk4F7oZN8F9+ScVFgiV4xOvEyfkj0CJUO9gvbWyyWzAGztHv22zCvJ0UaYCbOJNjkVQ==
X-Received: by 2002:a05:6a20:6909:b0:b6:b6a6:9753 with SMTP id
 q9-20020a056a20690900b000b6b6a69753mr11785825pzj.8.1675408756632;
        Thu, 02 Feb 2023 23:19:16 -0800 (PST)
Received: from bobo.ibm.com (193-116-117-77.tpgi.com.au. [193.116.117.77])
        by smtp.gmail.com with ESMTPSA id
 f20-20020a637554000000b004df4ba1ebfesm877558pgn.66.2023.02.02.23.19.11
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 02 Feb 2023 23:19:16 -0800 (PST)
From: Nicholas Piggin <npiggin@gmail.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Nicholas Piggin <npiggin@gmail.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Nadav Amit <nadav.amit@gmail.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Andy Lutomirski <luto@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	Rik van Riel <riel@redhat.com>,
	linux-arch@vger.kernel.org,
	linux-mm@kvack.org,
	linuxppc-dev@lists.ozlabs.org
Subject: [PATCH v7 4/5] lazy tlb: shoot lazies,
 non-refcounting lazy tlb mm reference handling scheme
Date: Fri,  3 Feb 2023 17:18:36 +1000
Message-Id: <20230203071837.1136453-5-npiggin@gmail.com>
X-Mailer: git-send-email 2.37.2
In-Reply-To: <20230203071837.1136453-1-npiggin@gmail.com>
References: <20230203071837.1136453-1-npiggin@gmail.com>
MIME-Version: 1.0
X-Rspamd-Queue-Id: ADBF71C0002
X-Rspamd-Server: rspam09
X-Rspam-User: 
X-Stat-Signature: s8u56c6b6xahtkdxw6j8phzksb3je4yp
X-HE-Tag: 1675408757-419578
X-HE-Meta: 
 U2FsdGVkX1+IFPU6jZeajRJ+TigUZwuk5FqHBx8UesydjentBE4iV5KAqEvwcQSR97dQ1dpBPyunokiczEGzks/Roj6fVSnFuUjF3xQh46o3vb0BpA12d00UouZ/IZSK2NLusqUK2rEAirBaWaKAOxQAZbz/VrMLWSk4hAVT5xAawEHGekfPTlF+cRIXfQqWMo8sR3nCfXusGhR2IJKSkA3Vq9Qhuuh8TlXMl1460BY+ttzZH4cq+dF7RPQ/mlEi6DiEyAYvKrEyPU4cTB5SWyKJIVBPDtfKKJJ4NquZDmNspAwkyZgkNGqagzW48LQcUIY2hCfoiZXP/BXWaF1S5xclGtW8iIjZM1NosciDmMYOlpcjJokdVhMY8T/cNDhvjimu8me2Od2socuk6cCbFEST6Q6d8lIjuXsNbSfD4IIDD06OJ8kqlSsZ0mH9+O00BpgwyXBfOLGsxef/9OKlM04Q5uR3ZTrqgWbdHePwnRy/YxrX0x0Ct4jwm0FRTvetkW9zryRYHYygWz31rZmdN79hBmPKRsSxlj92laJ7/HVDAgdF7yacXcpwQO9V/m6dixcS02KxG1zlpLYZ4KOoALFOdzr0epou/+63JStpVuYUg2T1BG/6Hq1v7JC8/QPpY884eRgi09p2FHGpZV7aG6KuHXwFmC2hO6dnUyX2PzdYbpzFgDOsiVROPEimw6BRMDRS+u7lXwT06gyVfLhnr1ETEBT1DsmvhIFHrGeOnhZs5zCajQnthqDzP4FgDvowuw0zJcAPuzHNGtvgZasH/os2vF/rTeN9Qu/ovab4b6EDiaRaYnI2Ks3L6rtgK4WL2cJPEGAB1qek4/WwkSBupGrNIjNOplbqpTYkllelSu4/leuJqOEA/Dxl5By82XOZrqeiWRy4qELTQU/zXwUakwrK076Nnt1QxdmrCZFkh8z76Mskuy6UkGkwHjtIbydp9c7mplf53591yqIS+6y
 tZIPtp7+
 G4I5q8o1ILxPxl24QeHbBYhhgrmq8zUAlNqGNezQVAlX3dsxVa8l/CNykkP3/TSIp2nJ+hqo/R5pgBoY3TEK6eEpzkRBU3/xSkJQI3trjXuPBzx4k2LEb7sCLtiZQcJwzzZUUmDMPvR+4jlL54KvQYMt7YHBHEYetdtkcUjFVCyC+TEp24lk5ZP/6oIPtPyILsIG6fVVu6dPZAgc4ca/PdhQVRYDOqwXcmZl4wQC6zwqrSYUMkWd9hrStrpcuJkkZuZ6Bt+YftLqKGlZbpWix/bmf4jeA/Kjb9SntCg35QzJ7J7BM/Y2/HSm/SRtxEw+lUnL8GMWZHqN/Q97S5BhmEyv5pTC6pW7d7bvfKl1MlUlX+Db6ZfT+OQJH2whhg6zUY06/hKD8+8poWL4uLbHXCIqbAv+nOu27dTKNBUy5f4NleosJuxJllswG333AcXKjSSjHKVIqkUSDcUHm8sIPLvfRlvITbZBqbqkC9DTFjraZfIEDjSUuA7YNLb2gKxpp8qLQ2FCNav41gQmc7HxVhWcuv+ehrfWSzgMPVIeDBWk5RoqhpvhxfT8SRY1hoGa3b6BCNKs1PXFToHnI9rc8aDEDxw8DYq3fp/3P5RqXZmUers9kPdApC9/uacyjSkeEOI1W57sby9S8DjjQzWlKJW0qLKrte7J9qly0XmvHDbfDrxM=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On big systems, the mm refcount can become highly contented when doing a
lot of context switching with threaded applications. user<->idle switch
is one of the important cases. Abandoning lazy tlb entirely slows this
switching down quite a bit in the common uncontended case, so that is
not viable.

Implement a scheme where lazy tlb mm references do not contribute to the
refcount, instead they get explicitly removed when the refcount reaches
zero.

The final mmdrop() sends IPIs to all CPUs in the mm_cpumask and they
switch away from this mm to init_mm if it was being used as the lazy tlb
mm. Enabling the shoot lazies option therefore requires that the arch
ensures that mm_cpumask contains all CPUs that could possibly be using
mm. A DEBUG_VM option IPIs every CPU in the system after this to ensure
there are no references remaining before the mm is freed.

Shootdown IPIs cost could be an issue, but they have not been observed
to be a serious problem with this scheme, because short-lived processes
tend not to migrate CPUs much, therefore they don't get much chance to
leave lazy tlb mm references on remote CPUs. There are a lot of options
to reduce them if necessary, described in comments.

The near-worst-case can be benchmarked with will-it-scale:

  context_switch1_threads -t $(($(nproc) / 2))

This will create nproc threads (nproc / 2 switching pairs) all sharing
the same mm that spread over all CPUs so each CPU does
thread->idle->thread switching.

[ Rik came up with basically the same idea a few years ago, so credit
  to him for that. ]

Link: https://lore.kernel.org/linux-mm/20230118080011.2258375-1-npiggin@gmail.com/
Link: https://lore.kernel.org/all/20180728215357.3249-11-riel@surriel.com/
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/Kconfig      | 15 +++++++++++
 kernel/fork.c     | 65 +++++++++++++++++++++++++++++++++++++++++++++++
 lib/Kconfig.debug | 10 ++++++++
 3 files changed, 90 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index 11e8915c0652..0d2021aed57e 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -481,6 +481,21 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
 # converted already).
 config MMU_LAZY_TLB_REFCOUNT
 	def_bool y
+	depends on !MMU_LAZY_TLB_SHOOTDOWN
+
+# This option allows MMU_LAZY_TLB_REFCOUNT=n. It ensures no CPUs are using an
+# mm as a lazy tlb beyond its last reference count, by shooting down these
+# users before the mm is deallocated. __mmdrop() first IPIs all CPUs that may
+# be using the mm as a lazy tlb, so that they may switch themselves to using
+# init_mm for their active mm. mm_cpumask(mm) is used to determine which CPUs
+# may be using mm as a lazy tlb mm.
+#
+# To implement this, an arch *must*:
+# - At the time of the final mmdrop of the mm, ensure mm_cpumask(mm) contains
+#   at least all possible CPUs in which the mm is lazy.
+# - It must meet the requirements for MMU_LAZY_TLB_REFCOUNT=n (see above).
+config MMU_LAZY_TLB_SHOOTDOWN
+	bool
 
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
 	bool
diff --git a/kernel/fork.c b/kernel/fork.c
index 9f7fe3541897..e7d81db7e885 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -780,6 +780,67 @@ static void check_mm(struct mm_struct *mm)
 #define allocate_mm()	(kmem_cache_alloc(mm_cachep, GFP_KERNEL))
 #define free_mm(mm)	(kmem_cache_free(mm_cachep, (mm)))
 
+static void do_check_lazy_tlb(void *arg)
+{
+	struct mm_struct *mm = arg;
+
+	WARN_ON_ONCE(current->active_mm == mm);
+}
+
+static void do_shoot_lazy_tlb(void *arg)
+{
+	struct mm_struct *mm = arg;
+
+	if (current->active_mm == mm) {
+		WARN_ON_ONCE(current->mm);
+		current->active_mm = &init_mm;
+		switch_mm(mm, &init_mm, current);
+	}
+}
+
+static void cleanup_lazy_tlbs(struct mm_struct *mm)
+{
+	if (!IS_ENABLED(CONFIG_MMU_LAZY_TLB_SHOOTDOWN)) {
+		/*
+		 * In this case, lazy tlb mms are refounted and would not reach
+		 * __mmdrop until all CPUs have switched away and mmdrop()ed.
+		 */
+		return;
+	}
+
+	/*
+	 * Lazy mm shootdown does not refcount "lazy tlb mm" usage, rather it
+	 * requires lazy mm users to switch to another mm when the refcount
+	 * drops to zero, before the mm is freed. This requires IPIs here to
+	 * switch kernel threads to init_mm.
+	 *
+	 * archs that use IPIs to flush TLBs can piggy-back that lazy tlb mm
+	 * switch with the final userspace teardown TLB flush which leaves the
+	 * mm lazy on this CPU but no others, reducing the need for additional
+	 * IPIs here. There are cases where a final IPI is still required here,
+	 * such as the final mmdrop being performed on a different CPU than the
+	 * one exiting, or kernel threads using the mm when userspace exits.
+	 *
+	 * IPI overheads have not found to be expensive, but they could be
+	 * reduced in a number of possible ways, for example (roughly
+	 * increasing order of complexity):
+	 * - The last lazy reference created by exit_mm() could instead switch
+	 *   to init_mm, however it's probable this will run on the same CPU
+	 *   immediately afterwards, so this may not reduce IPIs much.
+	 * - A batch of mms requiring IPIs could be gathered and freed at once.
+	 * - CPUs store active_mm where it can be remotely checked without a
+	 *   lock, to filter out false-positives in the cpumask.
+	 * - After mm_users or mm_count reaches zero, switching away from the
+	 *   mm could clear mm_cpumask to reduce some IPIs, perhaps together
+	 *   with some batching or delaying of the final IPIs.
+	 * - A delayed freeing and RCU-like quiescing sequence based on mm
+	 *   switching to avoid IPIs completely.
+	 */
+	on_each_cpu_mask(mm_cpumask(mm), do_shoot_lazy_tlb, (void *)mm, 1);
+	if (IS_ENABLED(CONFIG_DEBUG_VM_SHOOT_LAZIES))
+		on_each_cpu(do_check_lazy_tlb, (void *)mm, 1);
+}
+
 /*
  * Called when the last reference to the mm
  * is dropped: either by a lazy thread or by
@@ -791,6 +852,10 @@ void __mmdrop(struct mm_struct *mm)
 
 	BUG_ON(mm == &init_mm);
 	WARN_ON_ONCE(mm == current->mm);
+
+	/* Ensure no CPUs are using this as their lazy tlb mm */
+	cleanup_lazy_tlbs(mm);
+
 	WARN_ON_ONCE(mm == current->active_mm);
 	mm_free_pgd(mm);
 	destroy_context(mm);
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 61a9425a311f..1a5849f9f414 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -852,6 +852,16 @@ config DEBUG_VM
 
 	  If unsure, say N.
 
+config DEBUG_VM_SHOOT_LAZIES
+	bool "Debug MMU_LAZY_TLB_SHOOTDOWN implementation"
+	depends on DEBUG_VM
+	depends on MMU_LAZY_TLB_SHOOTDOWN
+	help
+	  Enable additional IPIs that ensure lazy tlb mm references are removed
+	  before the mm is freed.
+
+	  If unsure, say N.
+
 config DEBUG_VM_MAPLE_TREE
 	bool "Debug VM maple trees"
 	depends on DEBUG_VM