From patchwork Wed Jan 18 08:00:09 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Nicholas Piggin X-Patchwork-Id: 13105808 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D0925C004D4 for ; Wed, 18 Jan 2023 08:00:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 727CD6B007B; Wed, 18 Jan 2023 03:00:37 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 6D8476B007D; Wed, 18 Jan 2023 03:00:37 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5A0316B007E; Wed, 18 Jan 2023 03:00:37 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 4DCC46B007B for ; Wed, 18 Jan 2023 03:00:37 -0500 (EST) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 2952540347 for ; Wed, 18 Jan 2023 08:00:37 +0000 (UTC) X-FDA: 80367172914.15.C742D5E Received: from mail-pj1-f48.google.com (mail-pj1-f48.google.com [209.85.216.48]) by imf21.hostedemail.com (Postfix) with ESMTP id 6E70F1C000F for ; Wed, 18 Jan 2023 08:00:35 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=KfdJtVwV; spf=pass (imf21.hostedemail.com: domain of npiggin@gmail.com designates 209.85.216.48 as permitted sender) smtp.mailfrom=npiggin@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1674028835; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=fqIE8OQntd0W4a4N2kg2lnLwGGF3XcgL5F+Chwm9T4Q=; b=6JqmlrPlhyp1khxYahxw4IIbF6IEmmnwt2PY+pmU4vMTyccTscKw4aU1T6Bvbhx6rp6Y3x sFqDSgKBnhOK2FPjRZFER1B9MVS0xcwnQy4mE7ICICYB1g07Bb/3cLq5wFQ14MmqnGa4T3 8qsMc/WuWZPDlR5z8UE4azu50rUxdm4= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=KfdJtVwV; spf=pass (imf21.hostedemail.com: domain of npiggin@gmail.com designates 209.85.216.48 as permitted sender) smtp.mailfrom=npiggin@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1674028835; a=rsa-sha256; cv=none; b=cX2Mlmd7BrWgi/bkkkmNvwOIKEYev9urAHL3r5jtMt+gjz68fxZzySRQ0hYR7QSbLHKXm/ U9ujCWLEW5SrCmpPErtbWQ3O5P57InDhFsTkLA/8ZsspfkPdPvJ9/7uB3Nco/4bI7NfpC6 ZZEORlNgq1HUJqBVp1h3aaamQyTF47c= Received: by mail-pj1-f48.google.com with SMTP id a14-20020a17090a70ce00b00229a2f73c56so1505581pjm.3 for ; Wed, 18 Jan 2023 00:00:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=fqIE8OQntd0W4a4N2kg2lnLwGGF3XcgL5F+Chwm9T4Q=; b=KfdJtVwV3zcqhh5famP1bouR06n0PTfilksZbm1hFiESF1H43nvbTHW7Mb75MaGACp zxfqqjGqGn88kJjs1uiSW/iHLDh4mIVynPbPCrU6IgmQSUDMRtetaS50F04kKSoGNz9g WONXHWze2mU/V3dtxmyHdPzr4FeUSFZg7tJTmveXFrYiPNYRsjHJbd3hg7AjrtA34J5h Io7y9+STzomQq1Rz0BArK9zI6oqFyEPUifvhz+IyNLNv5/K9JJDuVyedlkHTRoqirdGz Tca1XzhJjeMflbybvMkbUbVUj9LNEmaXTfCe9fq/YPFk+TQN1MRX0mvIM+RvTMI7GgXY 1JAQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=fqIE8OQntd0W4a4N2kg2lnLwGGF3XcgL5F+Chwm9T4Q=; b=XnQp2z8HGtMzgaCb62tzt4VTdgZOQbX2dSoDhcETpqTyWfvFNHzR4TfJZy6I2dJ+2T Hn1uRWm5eVjMILWC35Z8eYxXOvoOB6csBaXWODFduR8CT+8lYn+sRkCAO0+ESOUECcd5 JUbdylNeTP1W9/u0vRszBdpfLMNTnTXVduCl7X46ApSEQI0oGfGSe6t4NhQfQyZkBGYm HehRj8x6kdohtp48xDnrbdCZIhxHHVPP6u7Kc+7Tre2PKr6j2hZv8kGCifWojVnYqHls nJ7pzf640wScsQru/9BohBWOCgxdak5HAoqZEyY1e3BBXuI4y1HhMN7aspV/JV+lR60M QSGQ== X-Gm-Message-State: AFqh2krbo4i+p4JL0pD79Wg+TFxUkGzr6AneH3KAcyvEXGPY0xIVEgPy tYzZT4o7nFzeP3vSFpb32Os= X-Google-Smtp-Source: AMrXdXsdbUkyarSYAzgpmXfctJIjd4K16TFeyd7Cq4zYvGF3a+5S7WArti19aUKfPU+yLC30TT4/9Q== X-Received: by 2002:a17:90a:4606:b0:226:620b:6ae5 with SMTP id w6-20020a17090a460600b00226620b6ae5mr5974187pjg.22.1674028834284; Wed, 18 Jan 2023 00:00:34 -0800 (PST) Received: from bobo.ibm.com (193-116-102-45.tpgi.com.au. [193.116.102.45]) by smtp.gmail.com with ESMTPSA id y2-20020a17090a16c200b002272616d3e1sm738462pje.40.2023.01.18.00.00.30 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 18 Jan 2023 00:00:33 -0800 (PST) From: Nicholas Piggin To: Andrew Morton Cc: Nicholas Piggin , Andy Lutomirski , Linus Torvalds , linux-arch , linux-mm , linuxppc-dev@lists.ozlabs.org Subject: [PATCH v6 3/5] lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme Date: Wed, 18 Jan 2023 18:00:09 +1000 Message-Id: <20230118080011.2258375-4-npiggin@gmail.com> X-Mailer: git-send-email 2.37.2 In-Reply-To: <20230118080011.2258375-1-npiggin@gmail.com> References: <20230118080011.2258375-1-npiggin@gmail.com> MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 6E70F1C000F X-Stat-Signature: jfubpqtg8w58cjpmiuhzztnkgw7ribrh X-HE-Tag: 1674028835-50629 X-HE-Meta: U2FsdGVkX1/rxjwH2bri3y9aAfz5dCC10dWOUtsrz0h2NEPlKLA7+7C+Uq7XSwJ42b2i3aiCh0JkP5DtYaI3XyClqNNhZ/fQqTrI2G/bxwUP5ogujDcSe47sIXpEkIXbOs5v4IYwtgFxPzmM7tKXchaR5ogailr1KZ8x+G19WA94wewNeaF7QrjyRcQE9+GGrNeXniZhwxMgv0alcOAvm2E8A3ECge6j0cftHmgSyAii2BeiSKEKw7MSTGAJ+lb4y1KKgRmKBJz3fugErB4XubZ5Cd+NYdPFBX87I3j2cV3tPR6O1uEy0LyveAm4SYY76TO+cZ67NqPRGGwHd4cqThT+OpPqulpxkCQ8SH6KBdj9JSyjteHqLUXg0M90199Q3oh+pCJa4jP0GSP6oQ7zxanoI82IUURFRlmjFIayQLrSUvtlCuERhDqERYzuh8M6oTXxWBaVy1ZvnAF6cxd2jQV0M3/1Bf5x5/yhl5URwWwgJlw7oO73LQ8Msqyyb/gWBQxixSzKJptjrLBSe4UCY4N7m2sBFrF0xDanvhjquEmzo6mxVdTftSKSEC1Hfk7W+XzAyySbslTVkQHJCOZQ4e9pZmIr4GELdyTWW3utqNxk58Cf+ew1CShDcXXV+ing6NZHDg7HSqfQErZUIdM8KaZOoHQuapIZitgk2IvosnVdufWbnrUJxfjyhrk0hdjT0VN9Afbg+qeHFih3RiaKqsK+Icta3tkRz4p5GBdym7p7ZLW1CWw63CUi8fpWIAqs2uQaBSWbI2/XzXaNOTW/XS+TkbophQJ8ErOEL4teDPJM+G/DtVLFX7BwjbxSmFh9zYknHanb/Ia4gg2/HhM6t8Bk3HMYqo64OxdhNd53kefk2ZLE+jv4Fu6+sWl770ihr9XvZg+SNcSFh+KBrkdHQENBJuOsOPPWzz6PSxWAECsVP92WGqZbNFBNsflTyoqYRVjwcJPHI+kA99sMggY zG84VBYb ARjXzj2NiSix0aCxecjvJLR4VWY49Dl3RJnTDvwgCkU7OlQnjvCRibQQ4YpsGPgTxtAIznzISACKk6f+qVYDH5VQ4Yu8DC1fKrTQMNBTLA8TzwXDCe4RJyDc9DROGB/bvoGgL X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On big systems, the mm refcount can become highly contented when doing a lot of context switching with threaded applications (particularly switching between the idle thread and an application thread). Abandoning lazy tlb slows switching down quite a bit in the important user->idle->user cases, so instead implement a non-refcounted scheme that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down any remaining lazy ones. Shootdown IPIs cost could be an issue, but they have not been observed to be a serious problem with this scheme, because short-lived processes tend not to migrate CPUs much, therefore they don't get much chance to leave lazy tlb mm references on remote CPUs. There are a lot of options to reduce them if necessary. Signed-off-by: Nicholas Piggin --- arch/Kconfig | 15 ++++++++++++ kernel/fork.c | 65 +++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 80 insertions(+) diff --git a/arch/Kconfig b/arch/Kconfig index b07d36f08fea..f7da34e4bc62 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -481,6 +481,21 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM # already). config MMU_LAZY_TLB_REFCOUNT def_bool y + depends on !MMU_LAZY_TLB_SHOOTDOWN + +# This option allows MMU_LAZY_TLB_REFCOUNT=n. It ensures no CPUs are using an +# mm as a lazy tlb beyond its last reference count, by shooting down these +# users before the mm is deallocated. __mmdrop() first IPIs all CPUs that may +# be using the mm as a lazy tlb, so that they may switch themselves to using +# init_mm for their active mm. mm_cpumask(mm) is used to determine which CPUs +# may be using mm as a lazy tlb mm. +# +# To implement this, an arch *must*: +# - At the time of the final mmdrop of the mm, ensure mm_cpumask(mm) contains +# at least all possible CPUs in which the mm is lazy. +# - It must meet the requirements for MMU_LAZY_TLB_REFCOUNT=n (see above). +config MMU_LAZY_TLB_SHOOTDOWN + bool config ARCH_HAVE_NMI_SAFE_CMPXCHG bool diff --git a/kernel/fork.c b/kernel/fork.c index 9f7fe3541897..263660e78c2a 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -780,6 +780,67 @@ static void check_mm(struct mm_struct *mm) #define allocate_mm() (kmem_cache_alloc(mm_cachep, GFP_KERNEL)) #define free_mm(mm) (kmem_cache_free(mm_cachep, (mm))) +static void do_check_lazy_tlb(void *arg) +{ + struct mm_struct *mm = arg; + + WARN_ON_ONCE(current->active_mm == mm); +} + +static void do_shoot_lazy_tlb(void *arg) +{ + struct mm_struct *mm = arg; + + if (current->active_mm == mm) { + WARN_ON_ONCE(current->mm); + current->active_mm = &init_mm; + switch_mm(mm, &init_mm, current); + } +} + +static void cleanup_lazy_tlbs(struct mm_struct *mm) +{ + if (!IS_ENABLED(CONFIG_MMU_LAZY_TLB_SHOOTDOWN)) { + /* + * In this case, lazy tlb mms are refounted and would not reach + * __mmdrop until all CPUs have switched away and mmdrop()ed. + */ + return; + } + + /* + * Lazy TLB shootdown does not refcount "lazy tlb mm" usage, rather it + * requires lazy mm users to switch to another mm when the refcount + * drops to zero, before the mm is freed. This requires IPIs here to + * switch kernel threads to init_mm. + * + * archs that use IPIs to flush TLBs can piggy-back that lazy tlb mm + * switch with the final userspace teardown TLB flush which leaves the + * mm lazy on this CPU but no others, reducing the need for additional + * IPIs here. There are cases where a final IPI is still required here, + * such as the final mmdrop being performed on a different CPU than the + * one exiting, or kernel threads using the mm when userspace exits. + * + * IPI overheads have not found to be expensive, but they could be + * reduced in a number of possible ways, for example (roughly + * increasing order of complexity): + * - The last lazy reference created by exit_mm() could instead switch + * to init_mm, however it's probable this will run on the same CPU + * immediately afterwards, so this may not reduce IPIs much. + * - A batch of mms requiring IPIs could be gathered and freed at once. + * - CPUs store active_mm where it can be remotely checked without a + * lock, to filter out false-positives in the cpumask. + * - After mm_users or mm_count reaches zero, switching away from the + * mm could clear mm_cpumask to reduce some IPIs, perhaps together + * with some batching or delaying of the final IPIs. + * - A delayed freeing and RCU-like quiescing sequence based on mm + * switching to avoid IPIs completely. + */ + on_each_cpu_mask(mm_cpumask(mm), do_shoot_lazy_tlb, (void *)mm, 1); + if (IS_ENABLED(CONFIG_DEBUG_VM)) + on_each_cpu(do_check_lazy_tlb, (void *)mm, 1); +} + /* * Called when the last reference to the mm * is dropped: either by a lazy thread or by @@ -791,6 +852,10 @@ void __mmdrop(struct mm_struct *mm) BUG_ON(mm == &init_mm); WARN_ON_ONCE(mm == current->mm); + + /* Ensure no CPUs are using this as their lazy tlb mm */ + cleanup_lazy_tlbs(mm); + WARN_ON_ONCE(mm == current->active_mm); mm_free_pgd(mm); destroy_context(mm);