From patchwork Thu Jun 13 08:38:08 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 13696346 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id F2DB8C41513 for ; Thu, 13 Jun 2024 08:39:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 951626B0096; Thu, 13 Jun 2024 04:39:18 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 927816B0098; Thu, 13 Jun 2024 04:39:18 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7EF026B009B; Thu, 13 Jun 2024 04:39:18 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 649A06B0096 for ; Thu, 13 Jun 2024 04:39:18 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 1DBAC1419BD for ; Thu, 13 Jun 2024 08:39:18 +0000 (UTC) X-FDA: 82225215996.22.9029314 Received: from mail-pg1-f173.google.com (mail-pg1-f173.google.com [209.85.215.173]) by imf15.hostedemail.com (Postfix) with ESMTP id 4C2D3A000E for ; Thu, 13 Jun 2024 08:39:16 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=A5L854jG; spf=pass (imf15.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.215.173 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1718267956; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=A5Cg5Bw8hj6kh0AKq+CP9F+WzciGspxa+25MjItj/4s=; b=z/VA405rYNEwInrNcFCY+UjG4rWYDVz5tQdRDFHoStea9KBQBwNBdKs5ah86pD+7lvfsJE Z8+uS1uoa2kytIRAkj4gs4oywxUWYKF48tmpYZVId8FYusrENBnL9RCFknfFjpG6/By2dq xXDTjc6/YXNWd2nV4V2p5oQol0SPKKE= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=A5L854jG; spf=pass (imf15.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.215.173 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718267956; a=rsa-sha256; cv=none; b=luz3zeBFRXWIyNF9Q4TDVIbha+d491rqu0+pm7wFQQaYt9ZirKbWPNou8wIvH4ae6WZNJn 2APHiYK6mGNEnmI934gG+XZL8lbjndpgBI61wrGc+gjuf750XgLKAuFzb++0JpdBDvC/Fg 9EEBuo72DvUiEB3ewaFtohKw2QRKPA0= Received: by mail-pg1-f173.google.com with SMTP id 41be03b00d2f7-6c386a3ac43so40917a12.0 for ; Thu, 13 Jun 2024 01:39:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1718267955; x=1718872755; darn=kvack.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=A5Cg5Bw8hj6kh0AKq+CP9F+WzciGspxa+25MjItj/4s=; b=A5L854jGwMH6GQ0TkT6fnktjgnjGeW4Q8HJpxJmsQ6fg9cw/5iA7s5Wi0FzXBOANEG WKszqGF1Fc6t89Uq5pHbw7z+Cf89zTIP/W9uoQ+npvTqWLrpVXJ0OmpPaKni7bq6L28k +WdsMyW/n76sBJHk9XKq4GLgjnRNbxe0BpYvZsncBLHjzVX+qdnPgUloI5rfdgW6kTN+ 2M6aZBznqH+LNzgmObs1nNNvmjJTyS/ZKoeB5bUGRsNHibD4EAFTDQBXA0C1rhqZW61m DXaxkGiU9nIYcUlCZ7WIvKO1Fl4ONkeLuv9Oyy1Bou58/enl5KMJCrOBYcAGXZLGECvl jsvw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1718267955; x=1718872755; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=A5Cg5Bw8hj6kh0AKq+CP9F+WzciGspxa+25MjItj/4s=; b=iFQsawvzzSeDv1d6xKnQCxENagtIy+Alaqy3lycrfT95bDbRYoW/NJRwLs4tmhjzGc T7AXZKg2wD6GUFLhezv6GFhnRBrxldv0MUfUsVMIgrUxDLQi3rpdcO73TbBtoBUejUg3 iUy3QJnKfKm7wVSAcJg7yy2ZvdjWEVn9OtFUreSFeJtN/doOmSPnAtqjErxxdwR5497U jEqXuP5GFcWYHC5MiCALgh1skdRS8kuty4uA/NsVi72pO3Q80IwpSTjI23n+Xb0cT3N7 92O8ElOt8ISp6WAX4D3Mh0Jspfi1mFhTesw2hHIreaot8UAiSkWrtA7f2e7bumAft6Pi AswQ== X-Gm-Message-State: AOJu0YyC+SSCxM440oRtQVCHUtV12mV5JNgxUWuIIqRKsaFdkoTliHSb h5+buBUq0SmsKZ++4VVY5Xno92P7mtUkrypRGddavVcyoghovODV2XjKpaAna099sGMBbe53UHh C X-Google-Smtp-Source: AGHT+IFkPIn1Dv10c/sdgT1s4Zy3CxP/93/DlvxiYc4NUJmFcAF7R4ScSAp8hM8C7UzSnp9lmTxVHg== X-Received: by 2002:a05:6a20:da89:b0:1b4:4ed4:91f5 with SMTP id adf61e73a8af0-1b8ab6ab543mr4962426637.6.1718267954815; Thu, 13 Jun 2024 01:39:14 -0700 (PDT) Received: from C02DW0BEMD6R.bytedance.net ([203.208.167.150]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-705ccb980bdsm820856b3a.211.2024.06.13.01.39.11 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 13 Jun 2024 01:39:14 -0700 (PDT) From: Qi Zheng To: david@redhat.com, hughd@google.com, willy@infradead.org, mgorman@suse.de, muchun.song@linux.dev, akpm@linux-foundation.org Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Qi Zheng Subject: [RFC PATCH 1/3] mm: pgtable: move pte_free_defer() out of CONFIG_TRANSPARENT_HUGEPAGE Date: Thu, 13 Jun 2024 16:38:08 +0800 Message-Id: <7864fd8186075ae12fd227f13f4191f3d1bc6764.1718267194.git.zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: References: MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 4C2D3A000E X-Stat-Signature: p9net5r4oqzq7n3bjtx5hi6kbsiq6wpa X-HE-Tag: 1718267956-580627 X-HE-Meta: U2FsdGVkX18mOVP2drJ0EKI7gYfQ/zLCC7kV3UUjG1jPmPtRoe0TfS27JdlWxhDQV5C2PJu6FMeQhRMefINcFJ7OTlH/GvtZT/g2JFy5QjDPnq5AKA9PTnCuWWzm6a1fjZLcKskFiiH1040PYscNOcuW/Ptq7sa7aDZu7I5nCWhWUzjshD6/QEktRofu+Cco+jpAJpAhBhv3VSMiIFkXHqvN6RX1wfkPOCEftfvGCmC6VPMyu1f1UWNHl/+rs1GjjWxVvJsNcxEeqRBPazJJ5pJxs6maD6OfR3yABXCNsWAm4lzUufcRyR2uJRja15ZL29iiae//SZGU7VwKGih9H2wVdLW+ghHadZ9AdHv8vAmuUPB4uCTAMw0+LKUlD5NE0WfdJGkZ055vyS8JNkCf7rY12nBf7iM4zeoEBi+vGskI1pzQE220518Fm2DFCHP0sdDEShLHYNG+eqByCsOxNlT5XCrhgH7x7ed4Ec5imkdBtCrTPAwJc7n0roX3nkxHB4tT+of41LJiOIpaDs5CWRg4bbUDndV7O9JR+ybN6WiGHWwWDBLzn0NS3pCg8udx13f2LZFAbdhNCUYxcU13oMZEwJPSi0wN9YCS19pjo479CFPO0nXy2C/BKxB8xwdzrI79JewMC691LTq+MKl18rwQNAWEwAryvI2qzII4pk0U2YMw4zOTTf8SMG3zP2jQDaj3mZjMtdDqtRFX/Yn/8UZ7bQxxjgYUPUxznhmLKrhk5iM1KVLt14wxt5gCjC9x/4lgYLVlGCxydZXdjN6K+zAJpalARUkcMO7uiRtp9gKWjOXIlpf96Jlxt/03XeWph2+mzzEle+Qn2CU5dBZJWB3Ppl+IPWxwvN14MIoDr6mjUTZlbfObaBXNffYN6Qd2PpS/Job2mkKAKL20hEd/m57ltSntFt83lqMqXz9dArMail973N8hfYkoDUvGx52PGhthMjXkPn8shc2UFBX 5rCHDipU tMSeZmgOSyaClS7ZbIQskPe1ea3/moARfdPI6STtF2vxt0r+LXM/+RH0LpJeAag6wONr/Ruu76pLeksaqHnXU1qzgOfG/VJJichmuvrPeMc9aAp/W4sJ+WGi4MZVBvq+Uhl0KdxDeFnXgfNgyJhy06yg74rUI9B6Pb62+eRV35mFaW0jhq4kr68Jz9RhAchBJUwS8oBh6P0wUyy3Pw0qhbobKr+I0gJMzR5IfgYTPtY2ewjSf8WuCeYx3ZpNYFwBiZaq+VpF2PyTnAWxZAOfQVi8m86Zzomi4GDM0FrW4gwUFB+ULvgeSREXiWZg75P6TZmoZjYq8lj5ZusbDKPP2n/fwfWKUKP+YkuyXMehAQmzPXMnQhbe1BFpDYA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: In order to reuse the pte_free_defer() in the subsequent work of freeing empty user PTE pages, move it out of the CONFIG_TRANSPARENT_HUGEPAGE range. No functional change intended. Signed-off-by: Qi Zheng --- arch/powerpc/mm/pgtable-frag.c | 2 -- arch/s390/mm/pgalloc.c | 2 -- arch/sparc/mm/init_64.c | 2 +- mm/pgtable-generic.c | 2 +- 4 files changed, 2 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c index 8c31802f97e8..46d8f4bec85e 100644 --- a/arch/powerpc/mm/pgtable-frag.c +++ b/arch/powerpc/mm/pgtable-frag.c @@ -133,7 +133,6 @@ void pte_fragment_free(unsigned long *table, int kernel) } } -#ifdef CONFIG_TRANSPARENT_HUGEPAGE void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable) { struct page *page; @@ -142,4 +141,3 @@ void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable) SetPageActive(page); pte_fragment_free((unsigned long *)pgtable, 0); } -#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c index abb629d7e131..6415379bd3fd 100644 --- a/arch/s390/mm/pgalloc.c +++ b/arch/s390/mm/pgalloc.c @@ -204,7 +204,6 @@ void __tlb_remove_table(void *table) pagetable_pte_dtor_free(ptdesc); } -#ifdef CONFIG_TRANSPARENT_HUGEPAGE static void pte_free_now(struct rcu_head *head) { struct ptdesc *ptdesc = container_of(head, struct ptdesc, pt_rcu_head); @@ -223,7 +222,6 @@ void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable) */ WARN_ON_ONCE(mm_has_pgste(mm)); } -#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ /* * Base infrastructure required to generate basic asces, region, segment, diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c index 53d7cb5bbffe..20aaf123c9fc 100644 --- a/arch/sparc/mm/init_64.c +++ b/arch/sparc/mm/init_64.c @@ -2939,7 +2939,6 @@ void pgtable_free(void *table, bool is_page) kmem_cache_free(pgtable_cache, table); } -#ifdef CONFIG_TRANSPARENT_HUGEPAGE static void pte_free_now(struct rcu_head *head) { struct page *page; @@ -2956,6 +2955,7 @@ void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable) call_rcu(&page->rcu_head, pte_free_now); } +#ifdef CONFIG_TRANSPARENT_HUGEPAGE void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmd) { diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index a78a4adf711a..197937495a0a 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -233,6 +233,7 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address, return pmd; } #endif +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ /* arch define pte_free_defer in asm/pgalloc.h for its own implementation */ #ifndef pte_free_defer @@ -252,7 +253,6 @@ void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable) call_rcu(&page->rcu_head, pte_free_now); } #endif /* pte_free_defer */ -#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ #if defined(CONFIG_GUP_GET_PXX_LOW_HIGH) && \ (defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RCU)) From patchwork Thu Jun 13 08:38:09 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 13696347 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CC56BC27C4F for ; Thu, 13 Jun 2024 08:39:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 61E7A6B00A1; Thu, 13 Jun 2024 04:39:22 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5A70D6B00A2; Thu, 13 Jun 2024 04:39:22 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 41FAC6B00A3; Thu, 13 Jun 2024 04:39:22 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 235D76B00A1 for ; Thu, 13 Jun 2024 04:39:22 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id CF77DA39A6 for ; Thu, 13 Jun 2024 08:39:21 +0000 (UTC) X-FDA: 82225216122.13.7D5C919 Received: from mail-oa1-f41.google.com (mail-oa1-f41.google.com [209.85.160.41]) by imf02.hostedemail.com (Postfix) with ESMTP id 0D27780002 for ; Thu, 13 Jun 2024 08:39:19 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=aEZi95Lf; spf=pass (imf02.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.160.41 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718267959; a=rsa-sha256; cv=none; b=IXAUasEPgeWgmOhDBYuykaCTszuBJDwjoY3dcgH2U9hpBhgVKCTeGeXnfUnZAM2djK01YS uVUaOOUFvRI/f7M/SBbwXWbg3JmBiR1kEdcEkWCx2Pyjd68pdW8D4ZXDSdTEJdMrWZxTwj Kf7qYlXrpUEmvGHETPwOh0rc/qtf9XY= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=aEZi95Lf; spf=pass (imf02.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.160.41 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1718267959; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ChynU9z5JOCMqbjVxFHZ6Asc4dZFXj2+q+uZdr+9n9E=; b=kX4Ar98zSSGzJc5MK3ehIt86/xPH6udMNMKs0+DDATV2YKdMUOUnxmvw6xxdDNeq66FP7L 8HSkl7XdY0OwK2AgTxFdzYWmq5ve0vn87XtDytP7BHzwnylXZQGkcuiJkrmGc2usudXLIM c06w2hQFRrouTjsPMpDjo5pD6E5bLuo= Received: by mail-oa1-f41.google.com with SMTP id 586e51a60fabf-24c582673a5so112344fac.2 for ; Thu, 13 Jun 2024 01:39:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1718267959; x=1718872759; darn=kvack.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=ChynU9z5JOCMqbjVxFHZ6Asc4dZFXj2+q+uZdr+9n9E=; b=aEZi95Lf0/ywzNoLdwT8O1ig5vlRbp3PwbCquyDCbAS46a4ds4eAV05yUJgVINKw7j nVxZkb7Rcsq4ASz2NnVlLKgkaj548n56VKu15AAI6supXQlXwg/6dfR//R+XkKDs3rm5 EiNa9FYVH1nbnCVKo9WyJn8UO6/bbDzit27PiOha1bRfDaFZJr2dX/ZIIJ1/JjOnVP5E WL7CX6aOyWdncbBOKCRWdnz+4M2rG+UEV66kVK5Cj96fXIIwzby5PeOPYhL7GXEhIJWx hvbg1WeAoyA2eHMPAWXjR9rJt1+wL2Q8uNTJjnvwNHoYNNcvCTLj6ODX9u5O+yCckMfx 72lg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1718267959; x=1718872759; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ChynU9z5JOCMqbjVxFHZ6Asc4dZFXj2+q+uZdr+9n9E=; b=otkt3iH5kKo5BH17j67rbEfE7ZtKnTRiBrqbZb2+G3K6zjJu5eKR12/YR5VHigwlTD ykueAHh2cD0Ylz9PP7nNNzQzDhrOU7yG8apPyQXYavvAAnJPH+8fFQpvRNOy68zAh8vU FJLSkOTXIF06Lw5wWhmiW8/blKHG5ZH86jNzyELNhEvykWNADyzoYvMC12MyjpugFhUb Bb31YrA/7CYAcLENu1boFhhT97mjIIifEDqwmHWvG31xsCDfq2CwgBKrBeP9RcWU59YD E9JZ0P+OBBH9uKFvzicsjgIJiMfNAjtz8EtQJ3XA62bElQNT4/A5beIcKhmnoezJx8n+ Fu3Q== X-Gm-Message-State: AOJu0YxXaGTD0TiUkMCfGC1hqkQpqNgc7i3EZk/eLA+7Dp/n+b5PoB2q jIUrQlxVg1NoUz36CtRNn9PwY15yYhDuK/INQfbxKXnZGXX290+RByCxf6NQ5dE= X-Google-Smtp-Source: AGHT+IG5fRqFeCgKiHECkAvdVDXYqyUQDwS7LLbxLF7n/kGnf0XhUcBBmoT8lkZTDQg4z6Muj8MdhA== X-Received: by 2002:a05:6870:7193:b0:254:a5dd:3772 with SMTP id 586e51a60fabf-25514ef5f91mr4375320fac.4.1718267958799; Thu, 13 Jun 2024 01:39:18 -0700 (PDT) Received: from C02DW0BEMD6R.bytedance.net ([203.208.167.150]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-705ccb980bdsm820856b3a.211.2024.06.13.01.39.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 13 Jun 2024 01:39:18 -0700 (PDT) From: Qi Zheng To: david@redhat.com, hughd@google.com, willy@infradead.org, mgorman@suse.de, muchun.song@linux.dev, akpm@linux-foundation.org Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Qi Zheng Subject: [RFC PATCH 2/3] mm: pgtable: make pte_offset_map_nolock() return pmdval Date: Thu, 13 Jun 2024 16:38:09 +0800 Message-Id: X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: References: MIME-Version: 1.0 X-Stat-Signature: 1p5faq38zhm1de5uq7hid64atfyrgwo5 X-Rspamd-Queue-Id: 0D27780002 X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1718267959-828870 X-HE-Meta: U2FsdGVkX1/Ol3ETLW5hg5D1Jj7/HNDWXSbeovccP5vD1TwUr5JLu77njw7jSODfvuSLAuPukvSFxaRODAsnxnD4d2/le3Pi0gYF3aTwxEVKFGWXYGv62YHW2AUT1dJQfIls3bQhHTyHiEmse7SVvmiMTvTYE3wQ5hr+Cu23/1mX1z0u1FmLdIkp+eMxatrmdKSHnRpj3Utmk/K0zFQgQdILVlU7UJ8hlO3g0aXOsvRNSZ3de7G6c0+MDQjIhigGJo/S+Q6PBP9t5yJ7Bo3zbAVHK99OVU69tJ0eLepmGbAeKIWHWyMJl5EUfxe7Oia1L9V3YDVJSnlFsf5uL2cupvV3bBwgCCCmeAiMH46ZvnykSLxe16ThpWkp+uMUoGnssndTK4H+S5BOiAPs8lL1zTSc3olLEX7AyUrQBmUvSYznEel+i9JKoHhExL2CSUZ74xgpB1vyzus4N18BbPLnQnUx1RP8Zu15QJdEZwMbswLkRZYwPj/HqxC8TdP2kr39rPeUWPBUtqTTycmFFpL7YsQR2MhjGdL29pgon5IWW8MpsSXStMYazFp5u4ZVZruMhxS+K9XGWKALXZ0zI6UBFIi5D3IiDKQeunzG1ucTxzAcPk98gwprhrpDA7Du4fkE4RoGU/BY5H7b2xc+wN2wELJ6GCBca1O9TmsjkzbSXXHpwd93zsgwEyehlZA44SCPhuutM8P0rU7W2l157tCL+ts6qY47YWLsXs/ZxBw/oHm5dvUp+9AB6Y4ysSNLT2WtmAa1DxdxyozA/LiOBu6CvsGCuyteRTC8e8MtIklusBGnJApFBRYPjJXEeysG+oKZhJLiIZY6uybHv6yjEmSwUW8UxFHsFdrOF7KoZHlbTzGGr/wZXdmfwnlQlagkoLC0RJiJaMnhqD9+Yj/bhoJ4JkTQLGCD3Laet1y3XqFtZgJ+8XyOOMWcji+5E6hpFLUWpIyxgqo4fxyR7ZFFRod hbB52SQr Xg8EOnFeWBjLFWEJ5SF28pLYhwISb5R07175USn85QP2Zfi0lChzJT9RjVbEPorlAY2INk2jkwQIBIt71dUQL8OI5Gd64glhTwJsiGjETSGdT0CaYFMPc2j0yv0C27ubBMe8371dN3ne4rfbpSWZvVB771E4hWOCTyqIO1qec2RsnRHLBKDyYsnF/slF0cOsntgDRhXL1GnYQgNcc5+ZGyPYG7xS6yYluK+9+mZncaXcfZFRWPiP4r2foHCaV0eH4tEP4oSA79XOPByD+SGF9pF7hi1UbGin6EIvdyArIv5wMWt/elxBnV1S72hNH9PI4T8vZ9zQSYbj9BXDI2RNmtg0RfYCMukyDtTyy78j7Ky32Mc7XF+ay1Ov8NA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Make pte_offset_map_nolock() return pmdval so that we can recheck the *pmd once the lock is taken. This is a preparation for freeing empty PTE pages, no functional changes are expected. Signed-off-by: Qi Zheng --- Documentation/mm/split_page_table_lock.rst | 3 ++- arch/arm/mm/fault-armv.c | 2 +- arch/powerpc/mm/pgtable.c | 2 +- include/linux/mm.h | 4 ++-- mm/filemap.c | 2 +- mm/khugepaged.c | 4 ++-- mm/memory.c | 4 ++-- mm/mremap.c | 2 +- mm/page_vma_mapped.c | 2 +- mm/pgtable-generic.c | 21 ++++++++++++--------- mm/userfaultfd.c | 4 ++-- mm/vmscan.c | 2 +- 12 files changed, 28 insertions(+), 24 deletions(-) diff --git a/Documentation/mm/split_page_table_lock.rst b/Documentation/mm/split_page_table_lock.rst index e4f6972eb6c0..e6a47d57531c 100644 --- a/Documentation/mm/split_page_table_lock.rst +++ b/Documentation/mm/split_page_table_lock.rst @@ -18,7 +18,8 @@ There are helpers to lock/unlock a table and other accessor functions: pointer to its PTE table lock, or returns NULL if no PTE table; - pte_offset_map_nolock() maps PTE, returns pointer to PTE with pointer to its PTE table - lock (not taken), or returns NULL if no PTE table; + lock (not taken) and the value of its pmd entry, or returns NULL + if no PTE table; - pte_offset_map() maps PTE, returns pointer to PTE, or returns NULL if no PTE table; - pte_unmap() diff --git a/arch/arm/mm/fault-armv.c b/arch/arm/mm/fault-armv.c index 2286c2ea60ec..3e4ed99b9330 100644 --- a/arch/arm/mm/fault-armv.c +++ b/arch/arm/mm/fault-armv.c @@ -117,7 +117,7 @@ static int adjust_pte(struct vm_area_struct *vma, unsigned long address, * must use the nested version. This also means we need to * open-code the spin-locking. */ - pte = pte_offset_map_nolock(vma->vm_mm, pmd, address, &ptl); + pte = pte_offset_map_nolock(vma->vm_mm, pmd, NULL, address, &ptl); if (!pte) return 0; diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c index 9e7ba9c3851f..ab0250f1b226 100644 --- a/arch/powerpc/mm/pgtable.c +++ b/arch/powerpc/mm/pgtable.c @@ -350,7 +350,7 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long addr) */ if (pmd_none(*pmd)) return; - pte = pte_offset_map_nolock(mm, pmd, addr, &ptl); + pte = pte_offset_map_nolock(mm, pmd, NULL, addr, &ptl); BUG_ON(!pte); assert_spin_locked(ptl); pte_unmap(pte); diff --git a/include/linux/mm.h b/include/linux/mm.h index 106bb0310352..d5550c3dc550 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2969,8 +2969,8 @@ static inline pte_t *pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd, return pte; } -pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd, - unsigned long addr, spinlock_t **ptlp); +pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdvalp, + unsigned long addr, spinlock_t **ptlp); #define pte_unmap_unlock(pte, ptl) do { \ spin_unlock(ptl); \ diff --git a/mm/filemap.c b/mm/filemap.c index 37061aafd191..7eb2e3599966 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3231,7 +3231,7 @@ static vm_fault_t filemap_fault_recheck_pte_none(struct vm_fault *vmf) if (!(vmf->flags & FAULT_FLAG_ORIG_PTE_VALID)) return 0; - ptep = pte_offset_map_nolock(vma->vm_mm, vmf->pmd, vmf->address, + ptep = pte_offset_map_nolock(vma->vm_mm, vmf->pmd, NULL, vmf->address, &vmf->ptl); if (unlikely(!ptep)) return VM_FAULT_NOPAGE; diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 774a97e6e2da..2a8703ee876c 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -992,7 +992,7 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm, }; if (!pte++) { - pte = pte_offset_map_nolock(mm, pmd, address, &ptl); + pte = pte_offset_map_nolock(mm, pmd, NULL, address, &ptl); if (!pte) { mmap_read_unlock(mm); result = SCAN_PMD_NULL; @@ -1581,7 +1581,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, if (userfaultfd_armed(vma) && !(vma->vm_flags & VM_SHARED)) pml = pmd_lock(mm, pmd); - start_pte = pte_offset_map_nolock(mm, pmd, haddr, &ptl); + start_pte = pte_offset_map_nolock(mm, pmd, NULL, haddr, &ptl); if (!start_pte) /* mmap_lock + page lock should prevent this */ goto abort; if (!pml) diff --git a/mm/memory.c b/mm/memory.c index 1bd2ffb76ec2..694c0989a1d8 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1108,7 +1108,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, ret = -ENOMEM; goto out; } - src_pte = pte_offset_map_nolock(src_mm, src_pmd, addr, &src_ptl); + src_pte = pte_offset_map_nolock(src_mm, src_pmd, NULL, addr, &src_ptl); if (!src_pte) { pte_unmap_unlock(dst_pte, dst_ptl); /* ret == 0 */ @@ -5486,7 +5486,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) * it into a huge pmd: just retry later if so. */ vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd, - vmf->address, &vmf->ptl); + NULL, vmf->address, &vmf->ptl); if (unlikely(!vmf->pte)) return 0; vmf->orig_pte = ptep_get_lockless(vmf->pte); diff --git a/mm/mremap.c b/mm/mremap.c index e7ae140fc640..f672d0218a6f 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -175,7 +175,7 @@ static int move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, err = -EAGAIN; goto out; } - new_pte = pte_offset_map_nolock(mm, new_pmd, new_addr, &new_ptl); + new_pte = pte_offset_map_nolock(mm, new_pmd, NULL, new_addr, &new_ptl); if (!new_pte) { pte_unmap_unlock(old_pte, old_ptl); err = -EAGAIN; diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index ae5cc42aa208..507701b7bcc1 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -33,7 +33,7 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp) * Though, in most cases, page lock already protects this. */ pvmw->pte = pte_offset_map_nolock(pvmw->vma->vm_mm, pvmw->pmd, - pvmw->address, ptlp); + NULL, pvmw->address, ptlp); if (!pvmw->pte) return false; diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index 197937495a0a..b8b28715cb4f 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -305,7 +305,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp) return NULL; } -pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd, +pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdvalp, unsigned long addr, spinlock_t **ptlp) { pmd_t pmdval; @@ -314,6 +314,8 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd, pte = __pte_offset_map(pmd, addr, &pmdval); if (likely(pte)) *ptlp = pte_lockptr(mm, &pmdval); + if (pmdvalp) + *pmdvalp = pmdval; return pte; } @@ -347,14 +349,15 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd, * and disconnected table. Until pte_unmap(pte) unmaps and rcu_read_unlock()s * afterwards. * - * pte_offset_map_nolock(mm, pmd, addr, ptlp), above, is like pte_offset_map(); - * but when successful, it also outputs a pointer to the spinlock in ptlp - as - * pte_offset_map_lock() does, but in this case without locking it. This helps - * the caller to avoid a later pte_lockptr(mm, *pmd), which might by that time - * act on a changed *pmd: pte_offset_map_nolock() provides the correct spinlock - * pointer for the page table that it returns. In principle, the caller should - * recheck *pmd once the lock is taken; in practice, no callsite needs that - - * either the mmap_lock for write, or pte_same() check on contents, is enough. + * pte_offset_map_nolock(mm, pmd, pmdvalp, addr, ptlp), above, is like + * pte_offset_map(); but when successful, it also outputs a pointer to the + * spinlock in ptlp - as pte_offset_map_lock() does, but in this case without + * locking it. This helps the caller to avoid a later pte_lockptr(mm, *pmd), + * which might by that time act on a changed *pmd: pte_offset_map_nolock() + * provides the correct spinlock pointer for the page table that it returns. + * In principle, the caller should recheck *pmd once the lock is taken; But in + * most cases, either the mmap_lock for write, or pte_same() check on contents, + * is enough. * * Note that free_pgtables(), used after unmapping detached vmas, or when * exiting the whole mm, does not take page table lock before freeing a page diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 5e7f2801698a..9c77271d499c 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -1143,7 +1143,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, src_addr, src_addr + PAGE_SIZE); mmu_notifier_invalidate_range_start(&range); retry: - dst_pte = pte_offset_map_nolock(mm, dst_pmd, dst_addr, &dst_ptl); + dst_pte = pte_offset_map_nolock(mm, dst_pmd, NULL, dst_addr, &dst_ptl); /* Retry if a huge pmd materialized from under us */ if (unlikely(!dst_pte)) { @@ -1151,7 +1151,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, goto out; } - src_pte = pte_offset_map_nolock(mm, src_pmd, src_addr, &src_ptl); + src_pte = pte_offset_map_nolock(mm, src_pmd, NULL, src_addr, &src_ptl); /* * We held the mmap_lock for reading so MADV_DONTNEED diff --git a/mm/vmscan.c b/mm/vmscan.c index c0429fd6c573..56727caa907b 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3374,7 +3374,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end, DEFINE_MAX_SEQ(walk->lruvec); int old_gen, new_gen = lru_gen_from_seq(max_seq); - pte = pte_offset_map_nolock(args->mm, pmd, start & PMD_MASK, &ptl); + pte = pte_offset_map_nolock(args->mm, pmd, NULL, start & PMD_MASK, &ptl); if (!pte) return false; if (!spin_trylock(ptl)) { From patchwork Thu Jun 13 08:38:10 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 13696348 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D7D7FC27C4F for ; Thu, 13 Jun 2024 08:39:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 645366B00A3; Thu, 13 Jun 2024 04:39:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5CD2D6B00A4; Thu, 13 Jun 2024 04:39:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 420AB6B00A5; Thu, 13 Jun 2024 04:39:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 246B26B00A3 for ; Thu, 13 Jun 2024 04:39:26 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 9F37FA3A17 for ; Thu, 13 Jun 2024 08:39:25 +0000 (UTC) X-FDA: 82225216290.12.E2D448D Received: from mail-oa1-f53.google.com (mail-oa1-f53.google.com [209.85.160.53]) by imf11.hostedemail.com (Postfix) with ESMTP id D7DEA4000F for ; Thu, 13 Jun 2024 08:39:23 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=FCJMwcMs; spf=pass (imf11.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.160.53 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1718267962; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=CewtBWl7EeBPsRvYcQKXFLRJSGy2eXvxF5yP4Eq9hNE=; b=SwMCDNz0n+IHz52ElavOXgywvAOgXxUsdQyeRKeChVHrThfjoM5E80wwtmCIZY8ao5bUBb xA2y2zfiyiHqBEJT4lByaQOOPE1ETLwHfR5iuziU4IfmE1Eaykons7y2dxI/qN4LxpCYNK lJZMTUeqCx4LEgAG7+TiuRJ4vaAodEs= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=FCJMwcMs; spf=pass (imf11.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.160.53 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718267962; a=rsa-sha256; cv=none; b=bkslcOom/pI8wtullotwpV77hV+z2jBgzktygOEaGJH5R02FfXdDbtrk7zY625fzw7OsWM tlrv08vCMfGUSj5eS8ZSj7B+rKlNQdXof5mP/SmMK1T9+YJG/3beXEh3ZMlV0AThV/TDp/ kxu04znM6Mc0Lv/F1QgCCkSmHd/Pl+U= Received: by mail-oa1-f53.google.com with SMTP id 586e51a60fabf-254f646b625so135019fac.0 for ; Thu, 13 Jun 2024 01:39:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1718267963; x=1718872763; darn=kvack.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=CewtBWl7EeBPsRvYcQKXFLRJSGy2eXvxF5yP4Eq9hNE=; b=FCJMwcMsryZpffR5w7tfiSSfuaivTGyJ7eLskZrOLdNRn90Ps17JQhDAyaNBPUmw6A Z5MD2D7V7iwxsNzB3vLofYNHfVhjUt0kv3oCKK49pyw+oBUqAh3YNGsl2m1kdYBerhLM P7uptEMWeQALeFzmCBzzMO19PSVDwB2U/3N2VBDdhtG8l+o1Q3dNmpdepfjLTE4NaJ48 AxSMO3tBpjeMhbPSzKe/784BSwtRRUB6viKQmOFFiv7UcypTAeh/LBtcSfNwOKCy0Ogl 9nAYu60KMWD84BrgTW2F7tIyyC0fm3CkpgYtlL7ns7raQd8fmyPAyNG48SU7pkK41XJJ 4w1w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1718267963; x=1718872763; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=CewtBWl7EeBPsRvYcQKXFLRJSGy2eXvxF5yP4Eq9hNE=; b=esZotgw0WCW3k6CO6xJNjpjiTeZUN4q5GJWfNbYaA31gMJxNWAEwMVLqHnR5z3TTAh s2G6XvgVprH+7U2CeXWpNWxYvJnvoxaSSi6K5oP3MlK5iA76DJJJE6KTLCjnb4zsBw5z S5DBO0zVSuA297nrZHC9is50LmglGJ+49LIEYwaSbTsWW5Md5FV3M1M3RMZ70JFPDZSw EMMFu8zNsMpKR0zPeJ6PScbrLoFyIEkY9ezNdE+qOCZaAQfPCYDjeGvX0gO8DFnloV3A Tuzq/iwRZE9OvO6sPPBwQhm0eKGZIh6kkmmzZ4g8Ga6yAApR4ukTl4IpwMKjGmQSnx9q IEYw== X-Gm-Message-State: AOJu0YyHHD1DMcf3bbZy6pKqms9RzFiU/wMt0BRhs353+cq/qzBTiQdJ 0XOOQ0a/XBnoEDEzGE+UnUA0ii5iZKhEC3ekkYKdwQa8ND14yArBGVSjaoZnyLo= X-Google-Smtp-Source: AGHT+IF3BvBRGUqMq+LK1a8EUEuvL+E9w9pX4Nvfr12+FIihulto0pHvra8MZUn64vN2LmeV0o+qLQ== X-Received: by 2002:a05:6870:95a8:b0:250:702f:8bab with SMTP id 586e51a60fabf-2551505f631mr4491220fac.3.1718267962761; Thu, 13 Jun 2024 01:39:22 -0700 (PDT) Received: from C02DW0BEMD6R.bytedance.net ([203.208.167.150]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-705ccb980bdsm820856b3a.211.2024.06.13.01.39.19 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 13 Jun 2024 01:39:22 -0700 (PDT) From: Qi Zheng To: david@redhat.com, hughd@google.com, willy@infradead.org, mgorman@suse.de, muchun.song@linux.dev, akpm@linux-foundation.org Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Qi Zheng Subject: [RFC PATCH 3/3] mm: free empty user PTE pages Date: Thu, 13 Jun 2024 16:38:10 +0800 Message-Id: X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: References: MIME-Version: 1.0 X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: D7DEA4000F X-Stat-Signature: s6145dhn6ad3phbtnyz9kow9f3uxczs6 X-Rspam-User: X-HE-Tag: 1718267963-481624 X-HE-Meta: U2FsdGVkX18baka1BjHwezOdR1pet4WUXBrOprclBvLoMPISMeCZAAu8OS5ewnunWbFM55UrwSyLKkOXNjvth4B43lh4Gt27TVCD0IV1/2LXTixhsFp/JhrpcSyOM03+rAn1UKrK1vn608R8egI2bmUK63wjws2kFo6OFO+ck90Hy9DOLVkAqikOslHj5RVzwtFxZC0g6BgSpB8PGwLZpL1cOLKp/kV6R/vDLx5nUJc9Yli9kAP/+/6PlP5m/QPMpp3zBscQKYKOV6kUYwtev+wFiToPqNdEip/JjOR6AoHx/hrz+t2e95UI84zgd4LM9/JUsTJ98nQOdU/8wcYDpcoGMg0I6gVLw+hu1N4ZQZEArgQWEZegINALLee7qhGhaTFuGSZyIY4QnX9k3ig/S1x+rzdxqBZOJw4JGPVAMQCL9hck8vTKyvMPqlC59GTV0DjwIbXUxZEiAd4V/YiFPe3SbniL8UMyQay9iCwLnfBy24mVk/Wy86oJC//GFohqBXwQQle5cePtkKy56Y4USmPHcYRmf0kKbGU0zaW8V3XXe0bIwmlk7aexWqyC2tRaYyjJNSs2oOAgZ2su3KQlybbnv/qKGHCy9Lj7lzg6LaUPXZ0NVNAR1F3wYBtfIcptBceU+JiibMatZLHrk1TV07sTgdsg2q+YQEppzsd52X+d2X3T4iVfKmGil/SiG064QPJ7tKGjqUzNROoz/cv5inkfBYJYux+ul/+vncO2V8+GJLi01B54Wtv5/7cb1bxBZcMYWIfbZtKYg/+bLdGhk4TdT+6Cm536e4Xl0FlTIWtv4Sn3k83WwtRpk24CSvPvV6FU9wkdXWajiUl3bPK0uyo89QMb03pWsvesL7EMKuD2RkjvZjf02Vprfp9Nw1mFBEWYUFR6gmLrOo7Vqu83RhH/WaA0jCcmQ5l5rawhVA/ojZ920GAC1nEVbbm43v+tj+p0tUvd7MjleSlr/AF Om1z7Jpa D6nRHgeo8sCxW/ue0iViIu6kLkJEBGYugxkdRIPJzvwwAmlXZEMg7BJkb9cwnsy8MLnEGkwWXInUek2F68N/03/GQywmNApHHy4D8b9c7/uZ/T7KWEIVv2g1duxob3PWeoD5H1oluHWiNKOHnG+qHku7zOJ/SJ4pqWEIICVUg8dcLqKj5Ew6YAbmZ80e4cd4Py2oG0G5WvLoRIzN8TkEFercNKFVi6ETkfasoayJ8F8D655fw5eytzxvGtka30Yz7ibZTjTax+1DP1J+rEsZn18KmAxKGcXWa6KeEPquet04a8gJIcNCDfhP5IH4FYKT5ou89 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Now in order to pursue high performance, applications mostly use some high-performance user-mode memory allocators, such as jemalloc or tcmalloc. These memory allocators use madvise(MADV_DONTNEED or MADV_FREE) to release physical memory, but neither MADV_DONTNEED nor MADV_FREE will release page table memory, which may cause huge page table memory usage. The following are a memory usage snapshot of one process which actually happened on our server: VIRT: 55t RES: 590g VmPTE: 110g In this case, most of the page table entries are empty. For such a PTE page where all entries are empty, we can actually free it back to the system for others to use. Similar to numa_balancing, this commit adds a task_work to scan the address space of the user process when it returns to user space. If a suitable empty PTE page is found, it will be released. The following code snippet can show the effect of optimization: mmap 50G while (1) { for (; i < 1024 * 25; i++) { touch 2M memory madvise MADV_DONTNEED 2M } } As we can see, the memory usage of VmPTE is reduced: before after VIRT 50.0 GB 50.0 GB RES 3.1 MB 3.1 MB VmPTE 102640 kB 756 kB (Even less) Signed-off-by: Qi Zheng --- include/linux/mm_types.h | 4 + include/linux/pgtable.h | 14 +++ include/linux/sched.h | 1 + kernel/sched/core.c | 1 + kernel/sched/fair.c | 2 + mm/Makefile | 2 +- mm/freept.c | 180 +++++++++++++++++++++++++++++++++++++++ mm/khugepaged.c | 18 +++- 8 files changed, 220 insertions(+), 2 deletions(-) create mode 100644 mm/freept.c diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index ef09c4eef6d3..bbc697fa4a83 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -839,6 +839,10 @@ struct mm_struct { #endif #ifdef CONFIG_MMU atomic_long_t pgtables_bytes; /* size of all page tables */ + /* Next mm_pgtable scan (in jiffies) */ + unsigned long mm_pgtable_next_scan; + /* Restart point for scanning and freeing empty user PTE pages */ + unsigned long mm_pgtable_scan_offset; #endif int map_count; /* number of VMAs */ diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index fbff20070ca3..4d1cfaa92422 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1589,6 +1589,20 @@ static inline unsigned long my_zero_pfn(unsigned long addr) } #endif /* CONFIG_MMU */ +#ifdef CONFIG_MMU +#define MM_PGTABLE_SCAN_DELAY 100 /* 100ms */ +#define MM_PGTABLE_SCAN_SIZE 256 /* 256MB */ +void init_mm_pgtable_work(struct task_struct *p); +void task_tick_mm_pgtable(struct task_struct *curr); +#else +static inline void init_mm_pgtable_work(struct task_struct *p) +{ +} +static inline void task_tick_mm_pgtable(struct task_struct *curr) +{ +} +#endif + #ifdef CONFIG_MMU #ifndef CONFIG_TRANSPARENT_HUGEPAGE diff --git a/include/linux/sched.h b/include/linux/sched.h index 73c874e051f7..5c0f3d96d608 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1485,6 +1485,7 @@ struct task_struct { #ifdef CONFIG_MMU struct task_struct *oom_reaper_list; struct timer_list oom_reaper_timer; + struct callback_head pgtable_work; #endif #ifdef CONFIG_VMAP_STACK struct vm_struct *stack_vm_area; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c663075c86fb..d5f6df6f5c32 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4359,6 +4359,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) p->migration_pending = NULL; #endif init_sched_mm_cid(p); + init_mm_pgtable_work(p); } DEFINE_STATIC_KEY_FALSE(sched_numa_balancing); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 41b58387023d..bbc7cbf22eaa 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -12696,6 +12696,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) if (static_branch_unlikely(&sched_numa_balancing)) task_tick_numa(rq, curr); + task_tick_mm_pgtable(curr); + update_misfit_status(curr, rq); check_update_overutilized_status(task_rq(curr)); diff --git a/mm/Makefile b/mm/Makefile index 8fb85acda1b1..af1a324aa65e 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -54,7 +54,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \ mm_init.o percpu.o slab_common.o \ compaction.o show_mem.o shmem_quota.o\ interval_tree.o list_lru.o workingset.o \ - debug.o gup.o mmap_lock.o $(mmu-y) + debug.o gup.o mmap_lock.o freept.o $(mmu-y) # Give 'page_alloc' its own module-parameter namespace page-alloc-y := page_alloc.o diff --git a/mm/freept.c b/mm/freept.c new file mode 100644 index 000000000000..ed1ea5535e03 --- /dev/null +++ b/mm/freept.c @@ -0,0 +1,180 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include + +void task_tick_mm_pgtable(struct task_struct *curr) +{ + struct callback_head *work = &curr->pgtable_work; + unsigned long now = jiffies; + + if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD)) || + work->next != work) + return; + + if (time_before(now, READ_ONCE(curr->mm->mm_pgtable_next_scan))) + return; + + task_work_add(curr, work, TWA_RESUME); +} + +/* + * Locking: + * - already held the mmap read lock to traverse the vma tree and pgtable + * - use pmd lock for clearing pmd entry + * - use pte lock for checking empty PTE page, and release it after clearing + * pmd entry, then we can capture the changed pmd in pte_offset_map_lock() + * etc after holding this pte lock. Thanks to this, we don't need to hold the + * rmap-related locks. + * - users of pte_offset_map_lock() etc all expect the PTE page to be stable by + * using rcu lock, so use pte_free_defer() to free PTE pages. + */ +static int freept_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long next, + struct mm_walk *walk) +{ + struct mmu_notifier_range range; + struct mm_struct *mm = walk->mm; + pte_t *start_pte, *pte; + pmd_t pmdval; + spinlock_t *pml = NULL, *ptl; + unsigned long haddr = addr; + int i; + + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, + haddr, haddr + PMD_SIZE); + mmu_notifier_invalidate_range_start(&range); + + start_pte = pte_offset_map_nolock(mm, pmd, &pmdval, haddr, &ptl); + if (!start_pte) + goto out; + + pml = pmd_lock(mm, pmd); + if (ptl != pml) + spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); + + if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pmd)))) + goto out_ptl; + + /* Check if it is empty PTE page */ + for (i = 0, addr = haddr, pte = start_pte; + i < PTRS_PER_PTE; i++, addr += PAGE_SIZE, pte++) { + if (!pte_none(ptep_get(pte))) + goto out_ptl; + } + pte_unmap(start_pte); + + pmd_clear(pmd); + flush_tlb_range(walk->vma, haddr, haddr + PMD_SIZE); + pmdp_get_lockless_sync(); + if (ptl != pml) + spin_unlock(ptl); + spin_unlock(pml); + + mmu_notifier_invalidate_range_end(&range); + + mm_dec_nr_ptes(mm); + pte_free_defer(mm, pmd_pgtable(pmdval)); + + return 0; + +out_ptl: + pte_unmap_unlock(start_pte, ptl); + if (pml != ptl) + spin_unlock(pml); +out: + mmu_notifier_invalidate_range_end(&range); + + return 0; +} + +static const struct mm_walk_ops mm_pgtable_walk_ops = { + .pmd_entry = freept_pmd_entry, + .walk_lock = PGWALK_RDLOCK, +}; + +static void task_mm_pgtable_work(struct callback_head *work) +{ + unsigned long now = jiffies, old_scan, next_scan; + struct task_struct *p = current; + struct mm_struct *mm = p->mm; + struct vm_area_struct *vma; + unsigned long start, end; + struct vma_iterator vmi; + + work->next = work; /* Prevent double-add */ + if (p->flags & PF_EXITING) + return; + + if (!mm->mm_pgtable_next_scan) { + mm->mm_pgtable_next_scan = now + msecs_to_jiffies(MM_PGTABLE_SCAN_DELAY); + return; + } + + old_scan = mm->mm_pgtable_next_scan; + if (time_before(now, old_scan)) + return; + + next_scan = now + msecs_to_jiffies(MM_PGTABLE_SCAN_DELAY); + if (!try_cmpxchg(&mm->mm_pgtable_next_scan, &old_scan, next_scan)) + return; + + if (!mmap_read_trylock(mm)) + return; + + start = mm->mm_pgtable_scan_offset; + vma_iter_init(&vmi, mm, start); + vma = vma_next(&vmi); + if (!vma) { + mm->mm_pgtable_scan_offset = 0; + start = 0; + vma_iter_set(&vmi, start); + vma = vma_next(&vmi); + } + + do { + /* Skip hugetlb case */ + if (is_vm_hugetlb_page(vma)) + continue; + + /* Leave this to the THP path to handle */ + if (vma->vm_flags & VM_HUGEPAGE) + continue; + + /* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */ + if (userfaultfd_wp(vma)) + continue; + + /* Only consider PTE pages that do not cross vmas */ + start = ALIGN(vma->vm_start, PMD_SIZE); + end = ALIGN_DOWN(vma->vm_end, PMD_SIZE); + if (end - start < PMD_SIZE) + continue; + + walk_page_range_vma(vma, start, end, &mm_pgtable_walk_ops, NULL); + + if (end - mm->mm_pgtable_scan_offset >= (MM_PGTABLE_SCAN_SIZE << 20)) + goto out; + + cond_resched(); + } for_each_vma(vmi, vma); + +out: + mm->mm_pgtable_scan_offset = vma ? end : 0; + mmap_read_unlock(mm); +} + +void init_mm_pgtable_work(struct task_struct *p) +{ + struct mm_struct *mm = p->mm; + int mm_users = 0; + + if (mm) { + mm_users = atomic_read(&mm->mm_users); + if (mm_users == 1) + mm->mm_pgtable_next_scan = jiffies + msecs_to_jiffies(MM_PGTABLE_SCAN_DELAY); + } + p->pgtable_work.next = &p->pgtable_work; /* Protect against double add */ + init_task_work(&p->pgtable_work, task_mm_pgtable_work); +} diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 2a8703ee876c..a2b96f4ba737 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1581,7 +1581,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, if (userfaultfd_armed(vma) && !(vma->vm_flags & VM_SHARED)) pml = pmd_lock(mm, pmd); - start_pte = pte_offset_map_nolock(mm, pmd, NULL, haddr, &ptl); + start_pte = pte_offset_map_nolock(mm, pmd, &pgt_pmd, haddr, &ptl); if (!start_pte) /* mmap_lock + page lock should prevent this */ goto abort; if (!pml) @@ -1589,6 +1589,10 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, else if (ptl != pml) spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); + /* pmd entry may be changed by others */ + if (unlikely(!pml && !pmd_same(pgt_pmd, pmdp_get_lockless(pmd)))) + goto abort; + /* step 2: clear page table and adjust rmap */ for (i = 0, addr = haddr, pte = start_pte; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) { @@ -1636,6 +1640,11 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, pml = pmd_lock(mm, pmd); if (ptl != pml) spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); + + if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd)))) { + spin_unlock(ptl); + goto unlock; + } } pgt_pmd = pmdp_collapse_flush(vma, haddr, pmd); pmdp_get_lockless_sync(); @@ -1663,6 +1672,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, } if (start_pte) pte_unmap_unlock(start_pte, ptl); +unlock: if (pml && pml != ptl) spin_unlock(pml); if (notified) @@ -1722,6 +1732,12 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff) mmu_notifier_invalidate_range_start(&range); pml = pmd_lock(mm, pmd); + /* check if the pmd is still valid */ + if (check_pmd_still_valid(mm, addr, pmd) != SCAN_SUCCEED) { + spin_unlock(pml); + mmu_notifier_invalidate_range_end(&range); + continue; + } ptl = pte_lockptr(mm, pmd); if (ptl != pml) spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);