From patchwork Thu Aug 19 03:18:50 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 12446047 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.6 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2E2D0C4338F for ; Thu, 19 Aug 2021 03:20:01 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id C39706112D for ; Thu, 19 Aug 2021 03:20:00 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org C39706112D Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bytedance.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 518356B006C; Wed, 18 Aug 2021 23:20:00 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4C90C8D0002; Wed, 18 Aug 2021 23:20:00 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 390E16B0072; Wed, 18 Aug 2021 23:20:00 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0140.hostedemail.com [216.40.44.140]) by kanga.kvack.org (Postfix) with ESMTP id 1F4CC6B006C for ; Wed, 18 Aug 2021 23:20:00 -0400 (EDT) Received: from smtpin38.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id BB64B22025 for ; Thu, 19 Aug 2021 03:19:59 +0000 (UTC) X-FDA: 78490376118.38.827E4E8 Received: from mail-pl1-f171.google.com (mail-pl1-f171.google.com [209.85.214.171]) by imf06.hostedemail.com (Postfix) with ESMTP id 7E2C28023D0C for ; Thu, 19 Aug 2021 03:19:59 +0000 (UTC) Received: by mail-pl1-f171.google.com with SMTP id e15so3090825plh.8 for ; Wed, 18 Aug 2021 20:19:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=TZKZTZOtvlnbM7sr1/sGaSyg3CoVrbhapZj97svXJc4=; b=lM2aQf09sZst/3IUuwtoZ4ifaTLps8ik+qJ445iHcaPnQd7m0zuI2P8FkdK1NxGpvz kTTXotWV3FAoaqE+5dC65iH+O9cEkw3uWcYq+CcQ0ozGE8uT7+2j4j5DeIc67FooHXMo GQy9rI8uBPfjjdcpye6luwdS439+JZsEfoghxPK/MLyyof0R1IzUxJG5VSiVZAUq8FiZ 2PiV5muPG5cidLnOoqyyF5j4k/QN+y2yDF7b1jEbg/G6OkLAr0uM2hK0eJa3VU0/W7kT 60RAMPlS9BoP6/lMw6fy/475/LdHvWZVT25l0YIGIcK9ZOC/MGeLaPe1EfabLlBcU81G 3EDg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=TZKZTZOtvlnbM7sr1/sGaSyg3CoVrbhapZj97svXJc4=; b=lbRN2ugn6V6mYBq2NzspAIFaNZE6QIuHAWUTUoMCacR8lEnXdd79EKF+QzxUjHWeUP REvUa12f2Y0zVIAjhabHN7oGvpD93q1ZIJT3XepOCmRWIlGqlHLLSCv5utHtBbzDC51U EMBREGR4RqYgThAGHiqHUW/8uEzrvktRrpvG5eJcFyIuW4yTSk8DEjYdPXPbuX9NmV9Y f2WUHBNP6aj3bgyPWElvrfLt1zP55fYoewgBOOSQVk/nKn3FqDmTMQjdsoIY3O0nbq0/ nHWKaHHxeGqY52E/4ELzuFybE+cR8j9wvD6IOv7+adpum9+hK1/uc3dxNNaxjmTxOblT ZAxw== X-Gm-Message-State: AOAM533rHOo/OuV+mBLpD0SbGvROcrTEUzcDH8/T3wsvnC+ppzvfKnhV 4JMM3M4gl6vz+82V6WabdHRZdA== X-Google-Smtp-Source: ABdhPJzUL5x86pUbmIHy8gKMPuLqku6RHNT4f6EBLJUOkilcixbi6aCYteS4aghe0ljtC6qZbFuuNw== X-Received: by 2002:a17:90b:8cd:: with SMTP id ds13mr3304740pjb.70.1629343198552; Wed, 18 Aug 2021 20:19:58 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.255]) by smtp.gmail.com with ESMTPSA id k3sm1261276pfc.16.2021.08.18.20.19.54 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 18 Aug 2021 20:19:58 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, Qi Zheng Subject: [PATCH v2 1/9] mm: introduce pmd_install() helper Date: Thu, 19 Aug 2021 11:18:50 +0800 Message-Id: <20210819031858.98043-2-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20210819031858.98043-1-zhengqi.arch@bytedance.com> References: <20210819031858.98043-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 X-Rspamd-Queue-Id: 7E2C28023D0C Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=bytedance-com.20150623.gappssmtp.com header.s=20150623 header.b=lM2aQf09; dmarc=pass (policy=none) header.from=bytedance.com; spf=pass (imf06.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.214.171 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com X-Rspamd-Server: rspam04 X-Stat-Signature: f53joebqyabfy5b85iazanzji9z7uncg X-HE-Tag: 1629343199-472992 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Currently we have three times the same few lines repeated in the code. Deduplicate them by newly introduced pmd_install() helper. Signed-off-by: Qi Zheng Reviewed-by: David Hildenbrand --- include/linux/mm.h | 1 + mm/filemap.c | 11 ++--------- mm/memory.c | 34 ++++++++++++++++------------------ 3 files changed, 19 insertions(+), 27 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index ce8fc0fd6d6e..57e48217bd71 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2471,6 +2471,7 @@ static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud) return ptl; } +extern void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte); extern void __init pagecache_init(void); extern void __init free_area_init_memoryless_node(int nid); extern void free_initmem(void); diff --git a/mm/filemap.c b/mm/filemap.c index 53913fced7ae..9f773059c6dc 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3210,15 +3210,8 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page) } } - if (pmd_none(*vmf->pmd)) { - vmf->ptl = pmd_lock(mm, vmf->pmd); - if (likely(pmd_none(*vmf->pmd))) { - mm_inc_nr_ptes(mm); - pmd_populate(mm, vmf->pmd, vmf->prealloc_pte); - vmf->prealloc_pte = NULL; - } - spin_unlock(vmf->ptl); - } + if (pmd_none(*vmf->pmd)) + pmd_install(mm, vmf->pmd, &vmf->prealloc_pte); /* See comment in handle_pte_fault() */ if (pmd_devmap_trans_unstable(vmf->pmd)) { diff --git a/mm/memory.c b/mm/memory.c index 39e7a1495c3c..ef7b1762e996 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -433,9 +433,20 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma, } } +void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte) +{ + spinlock_t *ptl = pmd_lock(mm, pmd); + + if (likely(pmd_none(*pmd))) { /* Has another populated it ? */ + mm_inc_nr_ptes(mm); + pmd_populate(mm, pmd, *pte); + *pte = NULL; + } + spin_unlock(ptl); +} + int __pte_alloc(struct mm_struct *mm, pmd_t *pmd) { - spinlock_t *ptl; pgtable_t new = pte_alloc_one(mm); if (!new) return -ENOMEM; @@ -455,13 +466,7 @@ int __pte_alloc(struct mm_struct *mm, pmd_t *pmd) */ smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */ - ptl = pmd_lock(mm, pmd); - if (likely(pmd_none(*pmd))) { /* Has another populated it ? */ - mm_inc_nr_ptes(mm); - pmd_populate(mm, pmd, new); - new = NULL; - } - spin_unlock(ptl); + pmd_install(mm, pmd, &new); if (new) pte_free(mm, new); return 0; @@ -4027,17 +4032,10 @@ vm_fault_t finish_fault(struct vm_fault *vmf) return ret; } - if (vmf->prealloc_pte) { - vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); - if (likely(pmd_none(*vmf->pmd))) { - mm_inc_nr_ptes(vma->vm_mm); - pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte); - vmf->prealloc_pte = NULL; - } - spin_unlock(vmf->ptl); - } else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) { + if (vmf->prealloc_pte) + pmd_install(vma->vm_mm, vmf->pmd, &vmf->prealloc_pte); + else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) return VM_FAULT_OOM; - } } /* See comment in handle_pte_fault() */ From patchwork Thu Aug 19 03:18:51 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 12446049 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.6 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A8E10C432BE for ; Thu, 19 Aug 2021 03:20:05 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 3BD3A61107 for ; Thu, 19 Aug 2021 03:20:05 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 3BD3A61107 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bytedance.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id DA33C8D0003; Wed, 18 Aug 2021 23:20:04 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D546C8D0002; Wed, 18 Aug 2021 23:20:04 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BF5B68D0003; Wed, 18 Aug 2021 23:20:04 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0165.hostedemail.com [216.40.44.165]) by kanga.kvack.org (Postfix) with ESMTP id A5A958D0002 for ; Wed, 18 Aug 2021 23:20:04 -0400 (EDT) Received: from smtpin09.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 4E7BD22C15 for ; Thu, 19 Aug 2021 03:20:04 +0000 (UTC) X-FDA: 78490376328.09.15347A2 Received: from mail-pg1-f182.google.com (mail-pg1-f182.google.com [209.85.215.182]) by imf07.hostedemail.com (Postfix) with ESMTP id 0CF211006EB0 for ; Thu, 19 Aug 2021 03:20:03 +0000 (UTC) Received: by mail-pg1-f182.google.com with SMTP id s11so4522732pgr.11 for ; Wed, 18 Aug 2021 20:20:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=r4sxzAuXhEsyJUy6AggBHx5gh/7nTNy4dMi+0q2BJs8=; b=Im4iOguFKafQyltDVp7tngcIwrXZadsyHSMscezUz6390MhVVrOGtuIkwzMBFokNP9 IjzBbpgYGqc84dW6aMVFjSpInpXjR2EQX6EB5Q72Tq++PLTalmuZFkt7Dn3jcZTFiO1A XBf/Xuh03goX4fzNyQc456hRFTWbg5DcbiD8KArCB4r4oTxcuWjoty0Q7azgvweeOZF+ YYnbgoSfJ8b1ElT1C5YnBC/eOg/am9MrKctOnzW7D2VH5oEBBngCXEC2OGSktr+mD4an /E99+bGBDbemvP5XrKelATFMEBlvQAqy0I+refUkGYIYolBvk1fkJOg5Ee72PtrhNFtd Ea5A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=r4sxzAuXhEsyJUy6AggBHx5gh/7nTNy4dMi+0q2BJs8=; b=slJV84xb+lOpgAYpEQD/0xzHBQZDQ34YPscVPpCHPnqSMXp+xAgt3wxDuj1BfrOEeh FNwX3hm1c/JbyYFVwG1toqSeCjCHCrRvdpfBQFxluU1ujtNqbmlZLC38uZ5wUP5Lz6Uk z7i0hCgtAmMMnWcU5vIBJz+XO0vIALSupOBX0rnkP9bkisjeLHwkgoVUNLw8C0z88UqD ylvreyhuzVxvrl8v/7RyjNCzBSaJB7784gd/wfwgjTIOxpHyaTe0ooZK0p0XT4DMmVPw MtyxCwd/dpXpMchI/nHxIvbAJkxgdqR8ZGoXUFwsdbBLbayqFn/23+UjjQRkIg0mIe1a ot+g== X-Gm-Message-State: AOAM5318WOrNOC3mfo7kMOi/kN/nHlL+1E0cYIA73UuciSVZNvhtZQpS bcR22Xu5jps4ZKqp3WB+i6DHiA== X-Google-Smtp-Source: ABdhPJx3AwmbCyXSPkoniNrx5lbfh9sQIiAZ7PpmIMca8s2wYjdBrHYhlTZZX7GPtWGdRhaCFaVM8A== X-Received: by 2002:a63:170d:: with SMTP id x13mr11619028pgl.216.1629343203095; Wed, 18 Aug 2021 20:20:03 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.255]) by smtp.gmail.com with ESMTPSA id k3sm1261276pfc.16.2021.08.18.20.19.58 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 18 Aug 2021 20:20:02 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, Qi Zheng Subject: [PATCH v2 2/9] mm: remove redundant smp_wmb() Date: Thu, 19 Aug 2021 11:18:51 +0800 Message-Id: <20210819031858.98043-3-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20210819031858.98043-1-zhengqi.arch@bytedance.com> References: <20210819031858.98043-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 X-Rspamd-Queue-Id: 0CF211006EB0 Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=bytedance-com.20150623.gappssmtp.com header.s=20150623 header.b=Im4iOguF; dmarc=pass (policy=none) header.from=bytedance.com; spf=pass (imf07.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.215.182 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com X-Rspamd-Server: rspam04 X-Stat-Signature: fbm3rfc3sa6tcx1n469s7t8ty7o5113a X-HE-Tag: 1629343203-303739 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The smp_wmb() which is in the __pte_alloc() is used to ensure all ptes setup is visible before the pte is made visible to other CPUs by being put into page tables. We only need this when the pte is actually populated, so move it to pte_install(). __pte_alloc_kernel(), __p4d_alloc(), __pud_alloc() and __pmd_alloc() are similar to this case. We can also defer smp_wmb() to the place where the pmd entry is really populated by preallocated pte. There are two kinds of user of preallocated pte, one is filemap & finish_fault(), another is THP. The former does not need another smp_wmb() because the smp_wmb() has been done by pte_install(). Fortunately, the latter also does not need another smp_wmb() because there is already a smp_wmb() before populating the new pte when the THP uses a preallocated pte to split a huge pmd. Signed-off-by: Qi Zheng Reviewed-by: Muchun Song --- mm/memory.c | 47 ++++++++++++++++++++--------------------------- mm/sparse-vmemmap.c | 2 +- 2 files changed, 21 insertions(+), 28 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index ef7b1762e996..9c7534187454 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -439,6 +439,20 @@ void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte) if (likely(pmd_none(*pmd))) { /* Has another populated it ? */ mm_inc_nr_ptes(mm); + /* + * Ensure all pte setup (eg. pte page lock and page clearing) are + * visible before the pte is made visible to other CPUs by being + * put into page tables. + * + * The other side of the story is the pointer chasing in the page + * table walking code (when walking the page table without locking; + * ie. most of the time). Fortunately, these data accesses consist + * of a chain of data-dependent loads, meaning most CPUs (alpha + * being the notable exception) will already guarantee loads are + * seen in-order. See the alpha page table accessors for the + * smp_rmb() barriers in page table walking code. + */ + smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */ pmd_populate(mm, pmd, *pte); *pte = NULL; } @@ -451,21 +465,6 @@ int __pte_alloc(struct mm_struct *mm, pmd_t *pmd) if (!new) return -ENOMEM; - /* - * Ensure all pte setup (eg. pte page lock and page clearing) are - * visible before the pte is made visible to other CPUs by being - * put into page tables. - * - * The other side of the story is the pointer chasing in the page - * table walking code (when walking the page table without locking; - * ie. most of the time). Fortunately, these data accesses consist - * of a chain of data-dependent loads, meaning most CPUs (alpha - * being the notable exception) will already guarantee loads are - * seen in-order. See the alpha page table accessors for the - * smp_rmb() barriers in page table walking code. - */ - smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */ - pmd_install(mm, pmd, &new); if (new) pte_free(mm, new); @@ -478,10 +477,9 @@ int __pte_alloc_kernel(pmd_t *pmd) if (!new) return -ENOMEM; - smp_wmb(); /* See comment in __pte_alloc */ - spin_lock(&init_mm.page_table_lock); if (likely(pmd_none(*pmd))) { /* Has another populated it ? */ + smp_wmb(); /* See comment in pmd_install() */ pmd_populate_kernel(&init_mm, pmd, new); new = NULL; } @@ -3857,7 +3855,6 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) vmf->prealloc_pte = pte_alloc_one(vma->vm_mm); if (!vmf->prealloc_pte) return VM_FAULT_OOM; - smp_wmb(); /* See comment in __pte_alloc() */ } ret = vma->vm_ops->fault(vmf); @@ -3919,7 +3916,6 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) vmf->prealloc_pte = pte_alloc_one(vma->vm_mm); if (!vmf->prealloc_pte) return VM_FAULT_OOM; - smp_wmb(); /* See comment in __pte_alloc() */ } vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); @@ -4144,7 +4140,6 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf) vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm); if (!vmf->prealloc_pte) return VM_FAULT_OOM; - smp_wmb(); /* See comment in __pte_alloc() */ } return vmf->vma->vm_ops->map_pages(vmf, start_pgoff, end_pgoff); @@ -4819,13 +4814,13 @@ int __p4d_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address) if (!new) return -ENOMEM; - smp_wmb(); /* See comment in __pte_alloc */ - spin_lock(&mm->page_table_lock); if (pgd_present(*pgd)) /* Another has populated it */ p4d_free(mm, new); - else + else { + smp_wmb(); /* See comment in pmd_install() */ pgd_populate(mm, pgd, new); + } spin_unlock(&mm->page_table_lock); return 0; } @@ -4842,11 +4837,10 @@ int __pud_alloc(struct mm_struct *mm, p4d_t *p4d, unsigned long address) if (!new) return -ENOMEM; - smp_wmb(); /* See comment in __pte_alloc */ - spin_lock(&mm->page_table_lock); if (!p4d_present(*p4d)) { mm_inc_nr_puds(mm); + smp_wmb(); /* See comment in pmd_install() */ p4d_populate(mm, p4d, new); } else /* Another has populated it */ pud_free(mm, new); @@ -4867,11 +4861,10 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address) if (!new) return -ENOMEM; - smp_wmb(); /* See comment in __pte_alloc */ - ptl = pud_lock(mm, pud); if (!pud_present(*pud)) { mm_inc_nr_pmds(mm); + smp_wmb(); /* See comment in pmd_install() */ pud_populate(mm, pud, new); } else /* Another has populated it */ pmd_free(mm, new); diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c index bdce883f9286..db6df27c852a 100644 --- a/mm/sparse-vmemmap.c +++ b/mm/sparse-vmemmap.c @@ -76,7 +76,7 @@ static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start, set_pte_at(&init_mm, addr, pte, entry); } - /* Make pte visible before pmd. See comment in __pte_alloc(). */ + /* Make pte visible before pmd. See comment in pmd_install(). */ smp_wmb(); pmd_populate_kernel(&init_mm, pmd, pgtable); From patchwork Thu Aug 19 03:18:52 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 12446051 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.6 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 10BDEC432BE for ; Thu, 19 Aug 2021 03:20:10 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 9C8A261107 for ; Thu, 19 Aug 2021 03:20:09 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 9C8A261107 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bytedance.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 47D548D0005; Wed, 18 Aug 2021 23:20:09 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 42BAC8D0002; Wed, 18 Aug 2021 23:20:09 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2F4DE8D0005; Wed, 18 Aug 2021 23:20:09 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0060.hostedemail.com [216.40.44.60]) by kanga.kvack.org (Postfix) with ESMTP id 132E98D0002 for ; Wed, 18 Aug 2021 23:20:09 -0400 (EDT) Received: from smtpin15.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id B9C581827D899 for ; Thu, 19 Aug 2021 03:20:08 +0000 (UTC) X-FDA: 78490376496.15.9C0AF3B Received: from mail-pg1-f182.google.com (mail-pg1-f182.google.com [209.85.215.182]) by imf10.hostedemail.com (Postfix) with ESMTP id 78279601089C for ; Thu, 19 Aug 2021 03:20:08 +0000 (UTC) Received: by mail-pg1-f182.google.com with SMTP id k14so4521336pga.13 for ; Wed, 18 Aug 2021 20:20:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=nbYzfQrconcsR+4RpHS7arH5FSzBDfWHN7dK/RaMaFw=; b=YOM/CjZFQYUfZyT2Fc4Nn6IQVqf1WWt7EkHXDVMuyNb9SvyKWGL4qXbtcpAg4Sg89e oKOQEQQ11KPcU9hGocbQx/bTVUA0qlaM2sgIc0MH6y/Vd2e6ytt6c8FPfDjSIrOwWG3X rOogyzQzLB/Sbb4E1vsRCSOf0MYhhEHsy+0u9P2z7HeBDcvETioDnM01k+OJ9DwHEb1a zEmY73foZoQLq/cmt4/+CftyhY4k06xm1Ymfa9qbY94zjbReMLua3J0NFbWd9xgOHZ5g he3vwICKNs+p1+Nd98SAa9UrwjxADENYVRUtjWlPf5DDby19E/Q1iXTFOMoPgooL28O2 lr1A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=nbYzfQrconcsR+4RpHS7arH5FSzBDfWHN7dK/RaMaFw=; b=D/2NhBoPwGgJ37jihG3u+88w6UmjbZFaDpPeKTdCnGG4rmWNO0OTyhnHvA7rYYqmUM e4ljIzoa9AfKSpOXs7I6FAsPBwF7BFP7TtUP5LMi5dWG93KE0xdogAJ4KyFRYWLEMO/E QuD3PDXl+lNokLsxdR7n9dsjEPnzWvjh9Jad2NEHavhvDKU8roDLAArN+J/HciJyyfHE YNZo6Y3rt+OHGU4tkAq5DfmCO82q9cVohAkYfGo6GCemqo0NmooJeTSErN5csdYDPlyU wasl5VhJsjLIe6lxBnCjZuF72DmfwNcr5qychdz2Dc3rABalEDFVHvJTw/uF97jByuWB /x0g== X-Gm-Message-State: AOAM530b+xVVVu8JcgqN9N4yo5NxnAsULZK1E29SYzsfQ+z0pUtLnLpK 0gIxkuHisUPUPPULJmQCP1XEOw== X-Google-Smtp-Source: ABdhPJycFOiVUzzFfHyRJZNon6CB7juNZC+TEA1Mqbc1P3fyFSV4ULUnK2OqTYn5F+dvtiLDyjMJ4A== X-Received: by 2002:a65:6107:: with SMTP id z7mr11924617pgu.43.1629343207673; Wed, 18 Aug 2021 20:20:07 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.255]) by smtp.gmail.com with ESMTPSA id k3sm1261276pfc.16.2021.08.18.20.20.03 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 18 Aug 2021 20:20:07 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, Qi Zheng Subject: [PATCH v2 3/9] mm: rework the parameter of lock_page_or_retry() Date: Thu, 19 Aug 2021 11:18:52 +0800 Message-Id: <20210819031858.98043-4-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20210819031858.98043-1-zhengqi.arch@bytedance.com> References: <20210819031858.98043-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=bytedance-com.20150623.gappssmtp.com header.s=20150623 header.b="YOM/CjZF"; spf=pass (imf10.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.215.182 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=none) header.from=bytedance.com X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 78279601089C X-Stat-Signature: 4ob8inioquf7aeqod397y5d6r17ycced X-HE-Tag: 1629343208-120303 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: we need the vmf in lock_page_or_retry() in the subsequent patch, so pass in it directly. Signed-off-by: Qi Zheng --- include/linux/pagemap.h | 8 +++----- mm/filemap.c | 6 ++++-- mm/memory.c | 4 ++-- 3 files changed, 9 insertions(+), 9 deletions(-) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index c36292181a21..dbb4dd6bb608 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -710,8 +710,7 @@ static inline bool wake_page_match(struct wait_page_queue *wait_page, void __folio_lock(struct folio *folio); int __folio_lock_killable(struct folio *folio); -bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm, - unsigned int flags); +bool __folio_lock_or_retry(struct folio *folio, struct vm_fault *vmf); void unlock_page(struct page *page); void folio_unlock(struct folio *folio); @@ -773,14 +772,13 @@ static inline int lock_page_killable(struct page *page) * Return value and mmap_lock implications depend on flags; see * __folio_lock_or_retry(). */ -static inline bool lock_page_or_retry(struct page *page, struct mm_struct *mm, - unsigned int flags) +static inline bool lock_page_or_retry(struct page *page, struct vm_fault *vmf) { struct folio *folio; might_sleep(); folio = page_folio(page); - return folio_trylock(folio) || __folio_lock_or_retry(folio, mm, flags); + return folio_trylock(folio) || __folio_lock_or_retry(folio, vmf); } /* diff --git a/mm/filemap.c b/mm/filemap.c index 9f773059c6dc..eeac0e119cf5 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1688,9 +1688,11 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait) * If neither ALLOW_RETRY nor KILLABLE are set, will always return 1 * with the folio locked and the mmap_lock unperturbed. */ -bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm, - unsigned int flags) +bool __folio_lock_or_retry(struct folio *folio, struct vm_fault *vmf) { + unsigned int flags = vmf->flags; + struct mm_struct *mm = vmf->vma->vm_mm; + if (fault_flag_allow_retry_first(flags)) { /* * CAUTION! In this case, mmap_lock is not released diff --git a/mm/memory.c b/mm/memory.c index 9c7534187454..d2aaa85b840c 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3451,7 +3451,7 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf) struct vm_area_struct *vma = vmf->vma; struct mmu_notifier_range range; - if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) + if (!lock_page_or_retry(page, vmf)) return VM_FAULT_RETRY; mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0, vma, vma->vm_mm, vmf->address & PAGE_MASK, @@ -3584,7 +3584,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) goto out_release; } - locked = lock_page_or_retry(page, vma->vm_mm, vmf->flags); + locked = lock_page_or_retry(page, vmf); delayacct_clear_flag(current, DELAYACCT_PF_SWAPIN); if (!locked) { From patchwork Thu Aug 19 03:18:53 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 12446053 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.6 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0336CC4338F for ; Thu, 19 Aug 2021 03:20:15 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 64CC560240 for ; Thu, 19 Aug 2021 03:20:14 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 64CC560240 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bytedance.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 1475A8D0006; Wed, 18 Aug 2021 23:20:14 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0F84F8D0002; Wed, 18 Aug 2021 23:20:14 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F28B18D0006; Wed, 18 Aug 2021 23:20:13 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0123.hostedemail.com [216.40.44.123]) by kanga.kvack.org (Postfix) with ESMTP id D6DBF8D0002 for ; Wed, 18 Aug 2021 23:20:13 -0400 (EDT) Received: from smtpin15.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 869A5267EC for ; Thu, 19 Aug 2021 03:20:13 +0000 (UTC) X-FDA: 78490376706.15.5E28EB4 Received: from mail-pf1-f179.google.com (mail-pf1-f179.google.com [209.85.210.179]) by imf14.hostedemail.com (Postfix) with ESMTP id 24F7760108AB for ; Thu, 19 Aug 2021 03:20:13 +0000 (UTC) Received: by mail-pf1-f179.google.com with SMTP id y190so4182627pfg.7 for ; Wed, 18 Aug 2021 20:20:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=dLBiU6MTw+RMNLYbw/f1Gi4xCI37RzrwZeQgQIQle/g=; b=XqdFgyo+NtN349zfGYCa7cBgIZDm2l/eIEB/M86PrR9dgzA+xETAUX9oYTDkKvQA1F 0qehKpAkLPYBlJIAP6Xrz8HE7dRYShnJxktZ/Uw/wYiNxVIA6oBnTA4YtE0fNwoW1lxD nSm9kNolLpczR16LRUJ5cB44sbYe3ER8OWT+9N0tSJxFwd2JF2hnQP8F1NM1b5YLAAK9 q07VQA3AntKVRAEuUVzRXqqtDiojVVFohLqn0WBli3jvoFMH0Kzkk2/X70yXR9gKSEun sm3mLXIXkEZCaa0VPskwyV5FGxJr3U1Y75hJ7A3T99SSfKhy/4DW696FPJJtUyxFw5xY QY/Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=dLBiU6MTw+RMNLYbw/f1Gi4xCI37RzrwZeQgQIQle/g=; b=QfW98ywX1Ar+0MOhbUE3XDxpofcwCZbS7M3uEptQdIEst/ay+xLAEIRiHxODcA14dz ipTKEEMFv5iaw9JwjTa0xNL098hGPTeqyrPx5yS6kefn+Je06YMcZRu8PTZnHN5TJOqZ 11TZrAL7HrDMfHRxiSpirE+9SDmDk9vljoa8G5jJG8ELaYSaUmcr8BhC3Z/bTVNaWhu9 vDtJBr+nPZvj2gxVpWSKAEKDw2Mwa13bY/Vh1IZ0dzjJp+NB2teFxEPvpis7V17T3hP3 v9JFkEyIwq+RgKeuNkTohGd/fNUzR5ZmLX7Zrbv9rZuJrUAcPOEaMaJyG3Mubw90n1WH tqLQ== X-Gm-Message-State: AOAM531Py5hkKo5yIqPFzAw0gU5IxCXJZq31HFIwFGgsrfa7q9mDQNNP 8MJ/xnkKi5RMhRnaMQ18XLzEWg== X-Google-Smtp-Source: ABdhPJxVzLvmCukKmupQAAw8OO3gBB9vWLzU81PVRCYfqTszXe9EYVmmMJe0oC/0SIL4wcHwRlnk5Q== X-Received: by 2002:a65:641a:: with SMTP id a26mr12050636pgv.340.1629343212265; Wed, 18 Aug 2021 20:20:12 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.255]) by smtp.gmail.com with ESMTPSA id k3sm1261276pfc.16.2021.08.18.20.20.08 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 18 Aug 2021 20:20:11 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, Qi Zheng Subject: [PATCH v2 4/9] mm: move pte_alloc{,_map,_map_lock}() to a separate file Date: Thu, 19 Aug 2021 11:18:53 +0800 Message-Id: <20210819031858.98043-5-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20210819031858.98043-1-zhengqi.arch@bytedance.com> References: <20210819031858.98043-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 X-Rspamd-Queue-Id: 24F7760108AB Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=bytedance-com.20150623.gappssmtp.com header.s=20150623 header.b=XqdFgyo+; dmarc=pass (policy=none) header.from=bytedance.com; spf=pass (imf14.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.210.179 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com X-Rspamd-Server: rspam04 X-Stat-Signature: qyzt69wztrnzmat7wea5j913pza9i9hi X-HE-Tag: 1629343213-741321 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Subsequent patches will modify pte_alloc{,_map,_map_lock}(), which are the allocate function related to user PTE page table pages, so move those to a separate file in advance. This patch contains no functional changes, only some preparatory work. Signed-off-by: Qi Zheng --- arch/arm/mm/pgd.c | 1 + arch/arm64/mm/hugetlbpage.c | 1 + arch/ia64/mm/hugetlbpage.c | 1 + arch/parisc/mm/hugetlbpage.c | 1 + arch/powerpc/mm/hugetlbpage.c | 1 + arch/s390/mm/gmap.c | 1 + arch/s390/mm/pgtable.c | 1 + arch/sh/mm/hugetlbpage.c | 1 + arch/sparc/mm/hugetlbpage.c | 1 + include/linux/mm.h | 11 ---------- include/linux/pte_ref.h | 29 ++++++++++++++++++++++++++ mm/Makefile | 3 ++- mm/internal.h | 1 + mm/memory.c | 38 ---------------------------------- mm/pte_ref.c | 48 +++++++++++++++++++++++++++++++++++++++++++ 15 files changed, 89 insertions(+), 50 deletions(-) create mode 100644 include/linux/pte_ref.h create mode 100644 mm/pte_ref.c diff --git a/arch/arm/mm/pgd.c b/arch/arm/mm/pgd.c index f8e9bc58a84f..dcac1124e8bd 100644 --- a/arch/arm/mm/pgd.c +++ b/arch/arm/mm/pgd.c @@ -8,6 +8,7 @@ #include #include #include +#include #include #include diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c index 23505fc35324..6c8177647053 100644 --- a/arch/arm64/mm/hugetlbpage.c +++ b/arch/arm64/mm/hugetlbpage.c @@ -14,6 +14,7 @@ #include #include #include +#include #include #include #include diff --git a/arch/ia64/mm/hugetlbpage.c b/arch/ia64/mm/hugetlbpage.c index f993cb36c062..d343639db1ad 100644 --- a/arch/ia64/mm/hugetlbpage.c +++ b/arch/ia64/mm/hugetlbpage.c @@ -17,6 +17,7 @@ #include #include #include +#include #include #include #include diff --git a/arch/parisc/mm/hugetlbpage.c b/arch/parisc/mm/hugetlbpage.c index d1d3990b83f6..4f6044d6b4bd 100644 --- a/arch/parisc/mm/hugetlbpage.c +++ b/arch/parisc/mm/hugetlbpage.c @@ -13,6 +13,7 @@ #include #include #include +#include #include #include diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index 9a75ba078e1b..1afb8f552bcf 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include #include diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c index 9bb2c7512cd5..c1826d74773a 100644 --- a/arch/s390/mm/gmap.c +++ b/arch/s390/mm/gmap.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c index eec3a9d7176e..594a6d6888e3 100644 --- a/arch/s390/mm/pgtable.c +++ b/arch/s390/mm/pgtable.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include diff --git a/arch/sh/mm/hugetlbpage.c b/arch/sh/mm/hugetlbpage.c index 999ab5916e69..6a593ab248b7 100644 --- a/arch/sh/mm/hugetlbpage.c +++ b/arch/sh/mm/hugetlbpage.c @@ -15,6 +15,7 @@ #include #include #include +#include #include #include diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c index 0f49fada2093..4b9119825264 100644 --- a/arch/sparc/mm/hugetlbpage.c +++ b/arch/sparc/mm/hugetlbpage.c @@ -11,6 +11,7 @@ #include #include #include +#include #include #include diff --git a/include/linux/mm.h b/include/linux/mm.h index 57e48217bd71..369d4283de49 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2245,7 +2245,6 @@ static inline void mm_inc_nr_ptes(struct mm_struct *mm) {} static inline void mm_dec_nr_ptes(struct mm_struct *mm) {} #endif -int __pte_alloc(struct mm_struct *mm, pmd_t *pmd); int __pte_alloc_kernel(pmd_t *pmd); #if defined(CONFIG_MMU) @@ -2371,15 +2370,6 @@ static inline void pgtable_pte_page_dtor(struct page *page) pte_unmap(pte); \ } while (0) -#define pte_alloc(mm, pmd) (unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, pmd)) - -#define pte_alloc_map(mm, pmd, address) \ - (pte_alloc(mm, pmd) ? NULL : pte_offset_map(pmd, address)) - -#define pte_alloc_map_lock(mm, pmd, address, ptlp) \ - (pte_alloc(mm, pmd) ? \ - NULL : pte_offset_map_lock(mm, pmd, address, ptlp)) - #define pte_alloc_kernel(pmd, address) \ ((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd))? \ NULL: pte_offset_kernel(pmd, address)) @@ -2471,7 +2461,6 @@ static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud) return ptl; } -extern void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte); extern void __init pagecache_init(void); extern void __init free_area_init_memoryless_node(int nid); extern void free_initmem(void); diff --git a/include/linux/pte_ref.h b/include/linux/pte_ref.h new file mode 100644 index 000000000000..60b752dd7846 --- /dev/null +++ b/include/linux/pte_ref.h @@ -0,0 +1,29 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Free user PTE page table pages + * + * Copyright (c) 2021, ByteDance. All rights reserved. + * + * Author: Qi Zheng + */ +#ifndef _LINUX_PTE_REF_H +#define _LINUX_PTE_REF_H + +#include +#include +#include + +void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte); +int __pte_alloc(struct mm_struct *mm, pmd_t *pmd); + +#define pte_alloc(mm, pmd) (unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, pmd)) + +#define pte_alloc_map(mm, pmd, address) \ + (pte_alloc(mm, pmd) ? NULL : pte_offset_map(pmd, address)) + +#define pte_alloc_map_lock(mm, pmd, address, ptlp) \ + (pte_alloc(mm, pmd) ? \ + NULL : pte_offset_map_lock(mm, pmd, address, ptlp)) + +#endif + diff --git a/mm/Makefile b/mm/Makefile index 970604ea97dd..5cdbfaa2a7b5 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -38,7 +38,8 @@ mmu-y := nommu.o mmu-$(CONFIG_MMU) := highmem.o memory.o mincore.o \ mlock.o mmap.o mmu_gather.o mprotect.o mremap.o \ msync.o page_vma_mapped.o pagewalk.o \ - pgtable-generic.o rmap.o vmalloc.o ioremap.o + pgtable-generic.o rmap.o vmalloc.o ioremap.o \ + pte_ref.o ifdef CONFIG_CROSS_MEMORY_ATTACH diff --git a/mm/internal.h b/mm/internal.h index b1001ebeb286..c1e23fa563a7 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -11,6 +11,7 @@ #include #include #include +#include /* * The set of flags that only affect watermark checking and reclaim diff --git a/mm/memory.c b/mm/memory.c index d2aaa85b840c..265b841cc7f9 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -433,44 +433,6 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma, } } -void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte) -{ - spinlock_t *ptl = pmd_lock(mm, pmd); - - if (likely(pmd_none(*pmd))) { /* Has another populated it ? */ - mm_inc_nr_ptes(mm); - /* - * Ensure all pte setup (eg. pte page lock and page clearing) are - * visible before the pte is made visible to other CPUs by being - * put into page tables. - * - * The other side of the story is the pointer chasing in the page - * table walking code (when walking the page table without locking; - * ie. most of the time). Fortunately, these data accesses consist - * of a chain of data-dependent loads, meaning most CPUs (alpha - * being the notable exception) will already guarantee loads are - * seen in-order. See the alpha page table accessors for the - * smp_rmb() barriers in page table walking code. - */ - smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */ - pmd_populate(mm, pmd, *pte); - *pte = NULL; - } - spin_unlock(ptl); -} - -int __pte_alloc(struct mm_struct *mm, pmd_t *pmd) -{ - pgtable_t new = pte_alloc_one(mm); - if (!new) - return -ENOMEM; - - pmd_install(mm, pmd, &new); - if (new) - pte_free(mm, new); - return 0; -} - int __pte_alloc_kernel(pmd_t *pmd) { pte_t *new = pte_alloc_one_kernel(&init_mm); diff --git a/mm/pte_ref.c b/mm/pte_ref.c new file mode 100644 index 000000000000..07a73b5521cc --- /dev/null +++ b/mm/pte_ref.c @@ -0,0 +1,48 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Free user PTE page table pages + * + * Copyright (c) 2021, ByteDance. All rights reserved. + * + * Author: Qi Zheng + */ + +#include + +void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte) +{ + spinlock_t *ptl = pmd_lock(mm, pmd); + + if (likely(pmd_none(*pmd))) { /* Has another populated it ? */ + mm_inc_nr_ptes(mm); + /* + * Ensure all pte setup (eg. pte page lock and page clearing) are + * visible before the pte is made visible to other CPUs by being + * put into page tables. + * + * The other side of the story is the pointer chasing in the page + * table walking code (when walking the page table without locking; + * ie. most of the time). Fortunately, these data accesses consist + * of a chain of data-dependent loads, meaning most CPUs (alpha + * being the notable exception) will already guarantee loads are + * seen in-order. See the alpha page table accessors for the + * smp_rmb() barriers in page table walking code. + */ + smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */ + pmd_populate(mm, pmd, *pte); + *pte = NULL; + } + spin_unlock(ptl); +} + +int __pte_alloc(struct mm_struct *mm, pmd_t *pmd) +{ + pgtable_t new = pte_alloc_one(mm); + if (!new) + return -ENOMEM; + + pmd_install(mm, pmd, &new); + if (new) + pte_free(mm, new); + return 0; +} From patchwork Thu Aug 19 03:18:54 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 12446055 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.6 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6DB74C4338F for ; Thu, 19 Aug 2021 03:20:19 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 0E0D861107 for ; Thu, 19 Aug 2021 03:20:19 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 0E0D861107 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bytedance.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id A95E38D0007; Wed, 18 Aug 2021 23:20:18 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A466A8D0002; Wed, 18 Aug 2021 23:20:18 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8C1C08D0007; Wed, 18 Aug 2021 23:20:18 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0111.hostedemail.com [216.40.44.111]) by kanga.kvack.org (Postfix) with ESMTP id 73CE78D0002 for ; Wed, 18 Aug 2021 23:20:18 -0400 (EDT) Received: from smtpin38.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 04E8D1804760E for ; Thu, 19 Aug 2021 03:20:18 +0000 (UTC) X-FDA: 78490376916.38.91BF02C Received: from mail-pj1-f46.google.com (mail-pj1-f46.google.com [209.85.216.46]) by imf25.hostedemail.com (Postfix) with ESMTP id B7B0EB00C76E for ; Thu, 19 Aug 2021 03:20:17 +0000 (UTC) Received: by mail-pj1-f46.google.com with SMTP id gz13-20020a17090b0ecdb0290178c0e0ce8bso6405270pjb.1 for ; Wed, 18 Aug 2021 20:20:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=mOJy7d3mMHwvnCJysEKs+l/IQmr3k7czrKmvR87Rq+Y=; b=ZQtPcDfT/IvJ0lwmDwNMUvKubfxsfdrSWJTX/Gyl20U7Bj8zWdMmAiyW4sP1seBT41 gZIooZB7ch001lCJCH2ckerONpSyV5p8Je2yJ6eo4vjFpaJxp0DWG8c/eiA5g7s/dGaZ JtRu4quV5wLxRc9tM4UGEuBc43812xdRVPM7L81ugs34foHzwTt9qgu6eb7hZn7XBUKX Fuz04duulFSLI7saoBwqk3QS47Y0BTIniWSs0WUJPK21KoQItsDaaHIr8M1YFqwrWWFF kziYc+T7nV+iHv+IOcONPKLBMRERYmbnAJdS4FTuR2jTo3XU9Cx+G7dtXBSTlkfVgJkq vH+A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=mOJy7d3mMHwvnCJysEKs+l/IQmr3k7czrKmvR87Rq+Y=; b=ZXrwvu2qqh6LxFhs0E9yarUKL/Vp6LZRfnwBm5W0q2QQgWJS9Gt3W11wRBIIvln6gZ QV3kiAkDO0ej5AcH25bT5YRii46QLLPUBrjX1uT19VEJNPhbB7MZm/QelKLj8vyOtFSV hfFfKerb+24SyVVdN7CWguWsmsnFe9Ysi3qHO2gOkfs5r+TYz1BxkbCFPx16Ye1IYcsY Hd/LkSX90N2CMDWUJtqgQi8MWcPRGsGIJX2iWdrzDLznZy2Bb0j7ZX1oriEz/92nfIU3 Hs081sP0eFyCj5QvhtxJYcMYw9R/Yqy6rwloMB9eWBS2qWbJ/X4wvE522Sag7q4a6uX6 sy3w== X-Gm-Message-State: AOAM531CtDt2ARUbZM1mkyI8Xr7G8iNviG1TPDPHyeEXbSHZPvxV1hqX GCNFJqJyqRxR+CdEiVWXPkVVdg== X-Google-Smtp-Source: ABdhPJwEZ4RGOFInOKpqGaEMoIBV8Dz6mSB99TTKJzOR8VkbejafmMz0iT7mYo7EUQgIqYstfbLNlw== X-Received: by 2002:a17:90a:5d12:: with SMTP id s18mr10115629pji.35.1629343216801; Wed, 18 Aug 2021 20:20:16 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.255]) by smtp.gmail.com with ESMTPSA id k3sm1261276pfc.16.2021.08.18.20.20.12 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 18 Aug 2021 20:20:16 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, Qi Zheng Subject: [PATCH v2 5/9] mm: pte_refcount infrastructure Date: Thu, 19 Aug 2021 11:18:54 +0800 Message-Id: <20210819031858.98043-6-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20210819031858.98043-1-zhengqi.arch@bytedance.com> References: <20210819031858.98043-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=bytedance-com.20150623.gappssmtp.com header.s=20150623 header.b=ZQtPcDfT; dmarc=pass (policy=none) header.from=bytedance.com; spf=pass (imf25.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.216.46 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com X-Stat-Signature: 84hoq13h33rbugsg9pgdfsrbg5yn59ew X-Rspamd-Queue-Id: B7B0EB00C76E X-Rspamd-Server: rspam01 X-HE-Tag: 1629343217-648279 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In this patch, we add two fields to the struct page of PTE page table, the pte_refcount is used to track how many users of PTE page table, the pmd is used to record the pmd entry which is populated by the PTE. And we introduce some APIs that will be used in subsequent patches: The following are some APIs related to ->pte_refcount: pte_get{,_many}() pte_put{,_many,_vmf}() pte_try_get() And we add pte_alloc_get{,_map,_map_lock}(), the semantics of these function is to hold a pte_refcount count when it returns 0, and its caller is responsible for decrease the pte_refcount by calling pte_put(), like the following pattern: --> pte_alloc_get() do something about pte pte_put() This patch contains no functional changes, just creates some infrastructure, and we will use these in the next patch. Signed-off-by: Qi Zheng --- include/linux/mm_types.h | 8 +- include/linux/pgtable.h | 3 +- include/linux/pte_ref.h | 208 ++++++++++++++++++++++++++++++++++++++++++++++- mm/Kconfig | 4 + mm/memory.c | 8 +- mm/pte_ref.c | 59 ++++++++++++-- mm/sparse-vmemmap.c | 2 +- 7 files changed, 279 insertions(+), 13 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index b9aa9aa99924..85c5cc4b95b3 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -161,11 +161,17 @@ struct page { }; struct { /* Page table pages */ unsigned long _pt_pad_1; /* compound_head */ - pgtable_t pmd_huge_pte; /* protected by page->ptl */ + union { + pgtable_t pmd_huge_pte; /* protected by page->ptl */ + pmd_t *pmd; /* PTE page only */ + }; unsigned long _pt_pad_2; /* mapping */ union { struct mm_struct *pt_mm; /* x86 pgds only */ atomic_t pt_frag_refcount; /* powerpc */ +#ifdef CONFIG_FREE_USER_PTE + atomic_t pte_refcount; /* PTE page only */ +#endif }; #if USE_SPLIT_PTE_PTLOCKS #if ALLOC_SPLIT_PTLOCKS diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index e24d2c992b11..c799475bc5c9 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -331,7 +331,6 @@ static inline pte_t ptep_get_lockless(pte_t *ptep) } #endif /* CONFIG_GUP_GET_PTE_LOW_HIGH */ -#ifdef CONFIG_TRANSPARENT_HUGEPAGE #ifndef __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm, unsigned long address, @@ -342,6 +341,8 @@ static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm, return pmd; } #endif /* __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR */ + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE #ifndef __HAVE_ARCH_PUDP_HUGE_GET_AND_CLEAR static inline pud_t pudp_huge_get_and_clear(struct mm_struct *mm, unsigned long address, diff --git a/include/linux/pte_ref.h b/include/linux/pte_ref.h index 60b752dd7846..47aaeac7507e 100644 --- a/include/linux/pte_ref.h +++ b/include/linux/pte_ref.h @@ -15,6 +15,8 @@ void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte); int __pte_alloc(struct mm_struct *mm, pmd_t *pmd); +void pmd_install_get(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte); +int __pte_alloc_get(struct mm_struct *mm, pmd_t *pmd); #define pte_alloc(mm, pmd) (unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, pmd)) @@ -25,5 +27,209 @@ int __pte_alloc(struct mm_struct *mm, pmd_t *pmd); (pte_alloc(mm, pmd) ? \ NULL : pte_offset_map_lock(mm, pmd, address, ptlp)) -#endif +#ifdef CONFIG_FREE_USER_PTE + +void free_pte_table(struct mm_struct *mm, pmd_t *pmdp, unsigned long addr); + +static inline void pte_ref_init(pgtable_t pte, pmd_t *pmd, int count) +{ + pte->pmd = pmd; + atomic_set(&pte->pte_refcount, count); +} + +static inline pmd_t *pte_to_pmd(pte_t *pte) +{ + return virt_to_page(pte)->pmd; +} + +static inline void pte_migrate_pmd(pmd_t old_pmd, pmd_t *new_pmd) +{ + pmd_pgtable(old_pmd)->pmd = new_pmd; +} + +static inline void pte_get_many(pmd_t *pmdp, unsigned int nr) +{ + pgtable_t pte = pmd_pgtable(*pmdp); + + VM_BUG_ON(!PageTable(pte)); + atomic_add(nr, &pte->pte_refcount); +} + +/* + * pte_get - Increment ->pte_refcount for the PTE page table. + * @pmdp: a pointer to the pmd entry corresponding to the PTE page table. + * + * Similar to the mechanism of page refcount, the user of PTE page table + * should hold a refcount to it before accessing. + */ +static inline void pte_get(pmd_t *pmdp) +{ + pte_get_many(pmdp, 1); +} + +/* + * pte_get_unless_zero - Increment ->pte_refcount for the PTE page table + * unless it is zero. + * @pmdp: a pointer to the pmd entry corresponding to the PTE page table. + */ +static inline bool pte_get_unless_zero(pmd_t *pmdp) +{ + pgtable_t pte = pmd_pgtable(*pmdp); + VM_BUG_ON(!PageTable(pte)); + return atomic_inc_not_zero(&pte->pte_refcount); +} + +/* + * pte_try_get - Try to increment ->pte_refcount for the PTE page table. + * @mm: the mm_struct of the target address space. + * @pmdp: a pointer to the pmd entry corresponding to the PTE page table. + * + * Return true if the increment succeeded. Otherwise return false. + * + * Before Operating the PTE page table, we need to hold a ->pte_refcount + * to protect against the concurrent release of the PTE page table. + */ +static inline bool pte_try_get(struct mm_struct *mm, pmd_t *pmdp) +{ + bool retval = true; + spinlock_t *ptl; + + ptl = pmd_lock(mm, pmdp); + if (!pmd_present(*pmdp) || !pte_get_unless_zero(pmdp)) + retval = false; + spin_unlock(ptl); + + return retval; +} + +static inline void pte_put_many(struct mm_struct *mm, pmd_t *pmdp, + unsigned long addr, unsigned int nr) +{ + pgtable_t pte = pmd_pgtable(*pmdp); + + VM_BUG_ON(!PageTable(pte)); + if (atomic_sub_and_test(nr, &pte->pte_refcount)) + free_pte_table(mm, pmdp, addr & PMD_MASK); +} + +/* + * pte_put - Decrement ->pte_refcount for the PTE page table. + * @mm: the mm_struct of the target address space. + * @pmdp: a pointer to the pmd entry corresponding to the PTE page table. + * @addr: the start address of the tlb range to be flushed. + * + * The PTE page table page will be freed when the last refcount is dropped. + */ +static inline void pte_put(struct mm_struct *mm, pmd_t *pmdp, unsigned long addr) +{ + pte_put_many(mm, pmdp, addr, 1); +} + +/* + * pte_put_vmf - Decrement ->pte_refcount for the PTE page table. + * @vmf: fault information + * + * While we access ->pte_refcount of a PTE page table, any of the following + * ensures the pmd entry corresponding to the PTE page table stability: + * - mmap_lock + * - anon_lock + * - i_mmap_lock + * - parallel threads are excluded by other means which + * can make ->pmd stable(e.g. gup case) + * But the mmap_lock maybe unlocked in advance in some cases in + * handle_pte_fault(), so we should ensure the pte_put() is performed + * in the critical section of the mmap_lock. + */ +static inline void pte_put_vmf(struct vm_fault *vmf) +{ + if (!(vmf->flags & FAULT_FLAG_PTE_GET)) + return; + vmf->flags &= ~FAULT_FLAG_PTE_GET; + + pte_put(vmf->vma->vm_mm, vmf->pmd, vmf->address); +} + +/* + * pte_alloc_get - allocate a PTE page table if the pmd entry is none, and + * hold a ->pte_refcount when returning + * @mm: the mm_struct of the target address space. + * @pmdp: a pointer to the pmd entry corresponding to the PTE page table. + */ +static inline int pte_alloc_get(struct mm_struct *mm, pmd_t *pmdp) +{ + spinlock_t *ptl; + + ptl = pmd_lock(mm, pmdp); + if (pmd_none(*pmdp) || !pte_get_unless_zero(pmdp)) { + spin_unlock(ptl); + return __pte_alloc_get(mm, pmdp); + } + spin_unlock(ptl); + return 0; +} + +#define pte_alloc_get_map(mm, pmd, address) \ + (pte_alloc_get(mm, pmd) ? NULL : pte_offset_map(pmd, address)) + +#define pte_alloc_get_map_lock(mm, pmd, address, ptlp) \ + (pte_alloc_get(mm, pmd) ? \ + NULL : pte_offset_map_lock(mm, pmd, address, ptlp)) +#else +static inline void pte_ref_init(pgtable_t pte, pmd_t *pmd, int count) +{ +} + +static inline pmd_t *pte_to_pmd(pte_t *pte) +{ + return NULL; +} + +static inline void pte_migrate_pmd(pmd_t old_pmd, pmd_t *new_pmd) +{ +} + +static inline void pte_get_many(pmd_t *pmdp, unsigned int nr) +{ +} + +static inline void pte_get(pmd_t *pmdp) +{ +} + +static inline bool pte_get_unless_zero(pmd_t *pmdp) +{ + return true; +} + +static inline bool pte_try_get(struct mm_struct *mm, pmd_t *pmdp) +{ + return true; +} + +static inline void pte_put_many(struct mm_struct *mm, pmd_t *pmdp, + unsigned long addr, unsigned int value) +{ +} + +static inline void pte_put(struct mm_struct *mm, pmd_t *pmdp, unsigned long addr) +{ +} + +static inline void pte_put_vmf(struct vm_fault *vmf) +{ +} + +static inline int pte_alloc_get(struct mm_struct *mm, pmd_t *pmdp) +{ + return pte_alloc(mm, pmdp); +} + +#define pte_alloc_get_map(mm, pmd, address) \ + pte_alloc_map(mm, pmd, address) + +#define pte_alloc_get_map_lock(mm, pmd, address, ptlp) \ + pte_alloc_map_lock(mm, pmd, address, ptlp) +#endif /* CONFIG_FREE_USER_PTE */ + +#endif diff --git a/mm/Kconfig b/mm/Kconfig index 50ca602edeb6..fb16ecebcb80 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -894,6 +894,10 @@ config IO_MAPPING config SECRETMEM def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED +config FREE_USER_PTE + def_bool y + depends on X86_64 && !TRANSPARENT_HUGEPAGE + source "mm/damon/Kconfig" endmenu diff --git a/mm/memory.c b/mm/memory.c index 265b841cc7f9..d1efb868e682 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -441,7 +441,7 @@ int __pte_alloc_kernel(pmd_t *pmd) spin_lock(&init_mm.page_table_lock); if (likely(pmd_none(*pmd))) { /* Has another populated it ? */ - smp_wmb(); /* See comment in pmd_install() */ + smp_wmb(); /* See comment in pmd_install_get() */ pmd_populate_kernel(&init_mm, pmd, new); new = NULL; } @@ -4780,7 +4780,7 @@ int __p4d_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address) if (pgd_present(*pgd)) /* Another has populated it */ p4d_free(mm, new); else { - smp_wmb(); /* See comment in pmd_install() */ + smp_wmb(); /* See comment in pmd_install_get() */ pgd_populate(mm, pgd, new); } spin_unlock(&mm->page_table_lock); @@ -4802,7 +4802,7 @@ int __pud_alloc(struct mm_struct *mm, p4d_t *p4d, unsigned long address) spin_lock(&mm->page_table_lock); if (!p4d_present(*p4d)) { mm_inc_nr_puds(mm); - smp_wmb(); /* See comment in pmd_install() */ + smp_wmb(); /* See comment in pmd_install_get() */ p4d_populate(mm, p4d, new); } else /* Another has populated it */ pud_free(mm, new); @@ -4826,7 +4826,7 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address) ptl = pud_lock(mm, pud); if (!pud_present(*pud)) { mm_inc_nr_pmds(mm); - smp_wmb(); /* See comment in pmd_install() */ + smp_wmb(); /* See comment in pmd_install_get() */ pud_populate(mm, pud, new); } else /* Another has populated it */ pmd_free(mm, new); diff --git a/mm/pte_ref.c b/mm/pte_ref.c index 07a73b5521cc..630704ae26db 100644 --- a/mm/pte_ref.c +++ b/mm/pte_ref.c @@ -8,12 +8,47 @@ */ #include +#include +#include -void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte) +#ifdef CONFIG_DEBUG_VM +static void pte_free_debug(pmd_t pmd) +{ + pte_t *ptep = (pte_t *)pmd_page_vaddr(pmd); + int i = 0; + + for (i = 0; i < PTRS_PER_PTE; i++) + BUG_ON(!pte_none(*ptep++)); +} +#else +static inline void pte_free_debug(pmd_t pmd) +{ +} +#endif + +void free_pte_table(struct mm_struct *mm, pmd_t *pmdp, unsigned long addr) +{ + struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0); + spinlock_t *ptl; + pmd_t pmd; + + ptl = pmd_lock(mm, pmdp); + pmd = pmdp_huge_get_and_clear(mm, addr, pmdp); + spin_unlock(ptl); + + pte_free_debug(pmd); + flush_tlb_range(&vma, addr, addr + PMD_SIZE); + mm_dec_nr_ptes(mm); + pte_free(mm, pmd_pgtable(pmd)); +} + +void pmd_install_get(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte) { - spinlock_t *ptl = pmd_lock(mm, pmd); + spinlock_t *ptl; - if (likely(pmd_none(*pmd))) { /* Has another populated it ? */ +retry: + ptl = pmd_lock(mm, pmd); + if (likely(pmd_none(*pmd))) { mm_inc_nr_ptes(mm); /* * Ensure all pte setup (eg. pte page lock and page clearing) are @@ -30,19 +65,33 @@ void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte) */ smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */ pmd_populate(mm, pmd, *pte); + pte_ref_init(*pte, pmd, 1); *pte = NULL; + } else if (!pte_get_unless_zero(pmd)) { + spin_unlock(ptl); + goto retry; } spin_unlock(ptl); } -int __pte_alloc(struct mm_struct *mm, pmd_t *pmd) +void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte) +{ + pmd_install_get(mm, pmd, pte); +} + +int __pte_alloc_get(struct mm_struct *mm, pmd_t *pmd) { pgtable_t new = pte_alloc_one(mm); if (!new) return -ENOMEM; - pmd_install(mm, pmd, &new); + pmd_install_get(mm, pmd, &new); if (new) pte_free(mm, new); return 0; } + +int __pte_alloc(struct mm_struct *mm, pmd_t *pmd) +{ + return __pte_alloc_get(mm, pmd); +} diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c index db6df27c852a..818c82ee3c6e 100644 --- a/mm/sparse-vmemmap.c +++ b/mm/sparse-vmemmap.c @@ -76,7 +76,7 @@ static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start, set_pte_at(&init_mm, addr, pte, entry); } - /* Make pte visible before pmd. See comment in pmd_install(). */ + /* Make pte visible before pmd. See comment in pmd_install_get(). */ smp_wmb(); pmd_populate_kernel(&init_mm, pmd, pgtable); From patchwork Thu Aug 19 03:18:55 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 12446057 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.6 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9695BC4320A for ; Thu, 19 Aug 2021 03:20:24 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 1FD7560240 for ; Thu, 19 Aug 2021 03:20:24 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 1FD7560240 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bytedance.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id BA9768D0008; Wed, 18 Aug 2021 23:20:23 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B57F28D0002; Wed, 18 Aug 2021 23:20:23 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9AC278D0008; Wed, 18 Aug 2021 23:20:23 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 75AD38D0002 for ; Wed, 18 Aug 2021 23:20:23 -0400 (EDT) Received: from smtpin11.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 159C726DE4 for ; Thu, 19 Aug 2021 03:20:23 +0000 (UTC) X-FDA: 78490377126.11.32770EE Received: from mail-pg1-f178.google.com (mail-pg1-f178.google.com [209.85.215.178]) by imf11.hostedemail.com (Postfix) with ESMTP id B2C71F006B22 for ; Thu, 19 Aug 2021 03:20:22 +0000 (UTC) Received: by mail-pg1-f178.google.com with SMTP id x4so4526483pgh.1 for ; Wed, 18 Aug 2021 20:20:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=ZduSDTtaoEqANwU7wP263QzsgQE1CLRW77GCvOvUTpo=; b=TOdOz+CHpoTqRiLtJtGwBZ7v5PzKpBEu4X8/Y2774YSvC5qs1eSwa5psnKHken6xPV nGe6LcPLX4IFpdR0TVBET/11w2MpX7vu616u+kPZ26BY58XqGx+GZCe+pibvODCUD62c oaCC9fa6lKn4dntpK2JdsgHKlTxgxCR9+jm/p5un2BccglUErqttXHEv5R7WafxnC+kh yBOa4wKVhm4UBfi+7psbo2DJhRMz2EPEMHc+OiRxFkd9kcAm1LB1ERlIYj/JHtXHhqyf pFce312wD6t8DshsZmm+N8y1e3dKz1YHSm7eBctFd91hN5rh8UBHSoG+FOWTXIvE1mot j64Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=ZduSDTtaoEqANwU7wP263QzsgQE1CLRW77GCvOvUTpo=; b=r4MGrAcrhGEH72bndY5eaXx8GVQiVZXQObN3th/WpKkQXtDMQCzDlC7s4+JnyyapGK 8ttaFaoWPEjN+jDw3qrkVMg9o2B6+5m79sqdLdU9untp0iI4svk/g4wL4eDDMVzSE/C8 Bb04GjCO8io5Dp7FrQcdhYjkBLQwv3Mrf804SrQt9s5rSL8bVjmpZwoQa6xM6OxxjGgy UqTp+3YfLK5a9hegZVjvQRLvc5nz13HKT9de0ify+TIsJQJ6XGl8d738/PApAwqPTvqa akXXryLlE9fOnt0vZHGDZCEtf/2ai0+LtvPrWuCKS1hJ+KyTgdaQ94tDm6ibx3yCuUSX hC7g== X-Gm-Message-State: AOAM533+Li2xDs8vG/f+hvZyC2KMh/TliWDbDRFK0Gv4wj6ovP6FRSTb flW3gMRZyzNmc7gIGq9+gyJiSA== X-Google-Smtp-Source: ABdhPJz5SUh54nR0fmqbpJJhvEOHArmz3ue46AWktee67T3C1baBIT01Z1CKbagHnfacVGm5yDO+QA== X-Received: by 2002:a63:4d18:: with SMTP id a24mr11714662pgb.324.1629343221481; Wed, 18 Aug 2021 20:20:21 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.255]) by smtp.gmail.com with ESMTPSA id k3sm1261276pfc.16.2021.08.18.20.20.17 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 18 Aug 2021 20:20:21 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, Qi Zheng Subject: [PATCH v2 6/9] mm: free user PTE page table pages Date: Thu, 19 Aug 2021 11:18:55 +0800 Message-Id: <20210819031858.98043-7-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20210819031858.98043-1-zhengqi.arch@bytedance.com> References: <20210819031858.98043-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: B2C71F006B22 X-Stat-Signature: e8poohio4wwu4xs4gggodr3p6jozff43 Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=bytedance-com.20150623.gappssmtp.com header.s=20150623 header.b=TOdOz+CH; dmarc=pass (policy=none) header.from=bytedance.com; spf=pass (imf11.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.215.178 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com X-HE-Tag: 1629343222-717063 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Some malloc libraries(e.g. jemalloc or tcmalloc) usually allocate the amount of VAs by mmap() and do not unmap those VAs. They will use madvise(MADV_DONTNEED) to free physical memory if they want. But the page tables do not be freed by madvise(), so it can produce many page tables when the process touches an enormous virtual address space. The following figures are a memory usage snapshot of one process which actually happened on our server: VIRT: 55t RES: 590g VmPTE: 110g As we can see, the PTE page tables size is 110g, while the RES is 590g. In theory, the process only need 1.2g PTE page tables to map those physical memory. The reason why PTE page tables occupy a lot of memory is that madvise(MADV_DONTNEED) only empty the PTE and free physical memory but doesn't free the PTE page table pages. So we can free those empty PTE page tables to save memory. In the above cases, we can save memory about 108g(best case). And the larger the difference between the size of VIRT and RES, the more memory we save. In this patch series, we add a pte_refcount field to the struct page of page table to track how many users of PTE page table. Similar to the mechanism of page refcount, the user of PTE page table should hold a refcount to it before accessing. The PTE page table page will be freed when the last refcount is dropped. While we access ->pte_refcount of a PTE page table, any of the following ensures the pmd entry corresponding to the PTE page table stability: - mmap_lock - anon_lock - i_mmap_lock - parallel threads are excluded by other means which can make ->pmd stable(e.g. gup case) This patch does not support THP temporarily, it will be supported in the next patch. Signed-off-by: Qi Zheng --- fs/proc/task_mmu.c | 18 ++++- fs/userfaultfd.c | 2 + include/linux/mm.h | 2 + include/linux/pte_ref.h | 1 - include/linux/rmap.h | 4 +- kernel/events/uprobes.c | 3 + mm/filemap.c | 43 ++++++----- mm/gup.c | 7 ++ mm/hmm.c | 4 + mm/internal.h | 1 + mm/ksm.c | 4 + mm/madvise.c | 17 ++++- mm/memcontrol.c | 8 +- mm/memory.c | 194 +++++++++++++++++++++++++++++++++++++++--------- mm/mempolicy.c | 4 +- mm/migrate.c | 15 +++- mm/mincore.c | 5 +- mm/mlock.c | 1 + mm/mprotect.c | 10 ++- mm/mremap.c | 12 ++- mm/page_vma_mapped.c | 4 + mm/pagewalk.c | 16 +++- mm/pte_ref.c | 5 -- mm/rmap.c | 13 ++++ mm/swapfile.c | 4 +- mm/userfaultfd.c | 14 +++- 26 files changed, 326 insertions(+), 85 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index eb97468dfe4c..85ee730ff6ae 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -574,6 +574,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct vm_area_struct *vma = walk->vma; pte_t *pte; spinlock_t *ptl; + unsigned long start = addr; ptl = pmd_trans_huge_lock(pmd, vma); if (ptl) { @@ -582,7 +583,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, goto out; } - if (pmd_trans_unstable(pmd)) + if (pmd_trans_unstable(pmd) || !pte_try_get(vma->vm_mm, pmd)) goto out; /* * The mmap_lock held all the way back in m_start() is what @@ -593,6 +594,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, for (; addr != end; pte++, addr += PAGE_SIZE) smaps_pte_entry(pte, addr, walk); pte_unmap_unlock(pte - 1, ptl); + pte_put(vma->vm_mm, pmd, start); out: cond_resched(); return 0; @@ -1121,6 +1123,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr, pte_t *pte, ptent; spinlock_t *ptl; struct page *page; + unsigned long start = addr; ptl = pmd_trans_huge_lock(pmd, vma); if (ptl) { @@ -1143,7 +1146,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr, return 0; } - if (pmd_trans_unstable(pmd)) + if (pmd_trans_unstable(pmd) || !pte_try_get(vma->vm_mm, pmd)) return 0; pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); @@ -1168,6 +1171,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr, ClearPageReferenced(page); } pte_unmap_unlock(pte - 1, ptl); + pte_put(vma->vm_mm, pmd, start); cond_resched(); return 0; } @@ -1407,6 +1411,7 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end, spinlock_t *ptl; pte_t *pte, *orig_pte; int err = 0; + unsigned long start = addr; #ifdef CONFIG_TRANSPARENT_HUGEPAGE ptl = pmd_trans_huge_lock(pmdp, vma); @@ -1475,6 +1480,9 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end, return 0; #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ + if (!pte_try_get(walk->mm, pmdp)) + return 0; + /* * We can assume that @vma always points to a valid one and @end never * goes beyond vma->vm_end. @@ -1489,6 +1497,7 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end, break; } pte_unmap_unlock(orig_pte, ptl); + pte_put(walk->mm, pmdp, start); cond_resched(); @@ -1795,6 +1804,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr, spinlock_t *ptl; pte_t *orig_pte; pte_t *pte; + unsigned long start = addr; #ifdef CONFIG_TRANSPARENT_HUGEPAGE ptl = pmd_trans_huge_lock(pmd, vma); @@ -1812,6 +1822,9 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr, if (pmd_trans_unstable(pmd)) return 0; #endif + if (!pte_try_get(walk->mm, pmd)) + return 0; + orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); do { struct page *page = can_gather_numa_stats(*pte, vma, addr); @@ -1821,6 +1834,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr, } while (pte++, addr += PAGE_SIZE, addr != end); pte_unmap_unlock(orig_pte, ptl); + pte_put(walk->mm, pmd, start); cond_resched(); return 0; } diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 003f0d31743e..baa1d98a6f87 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -28,6 +28,7 @@ #include #include #include +#include int sysctl_unprivileged_userfaultfd __read_mostly; @@ -509,6 +510,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason) must_wait = userfaultfd_huge_must_wait(ctx, vmf->vma, vmf->address, vmf->flags, reason); + pte_put_vmf(vmf); mmap_read_unlock(mm); if (likely(must_wait && !READ_ONCE(ctx->released))) { diff --git a/include/linux/mm.h b/include/linux/mm.h index 369d4283de49..5659a483072b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -437,6 +437,7 @@ extern pgprot_t protection_map[16]; * @FAULT_FLAG_REMOTE: The fault is not for current task/mm. * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch. * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals. + * @FAULT_FLAG_PTE_GET: This means the pte ->pte_refcount has been got. * * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify * whether we would allow page faults to retry by specifying these two @@ -468,6 +469,7 @@ enum fault_flag { FAULT_FLAG_REMOTE = 1 << 7, FAULT_FLAG_INSTRUCTION = 1 << 8, FAULT_FLAG_INTERRUPTIBLE = 1 << 9, + FAULT_FLAG_PTE_GET = 1 << 10, }; /* diff --git a/include/linux/pte_ref.h b/include/linux/pte_ref.h index 47aaeac7507e..c2389d03bb59 100644 --- a/include/linux/pte_ref.h +++ b/include/linux/pte_ref.h @@ -13,7 +13,6 @@ #include #include -void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte); int __pte_alloc(struct mm_struct *mm, pmd_t *pmd); void pmd_install_get(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte); int __pte_alloc_get(struct mm_struct *mm, pmd_t *pmd); diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 221c3c6438a7..3a7ec085d939 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -11,7 +11,7 @@ #include #include #include - +#include #include /* @@ -222,6 +222,8 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw) pte_unmap(pvmw->pte); if (pvmw->ptl) spin_unlock(pvmw->ptl); + if (pvmw->pte && !PageHuge(pvmw->page)) + pte_put(pvmw->vma->vm_mm, pvmw->pmd, pvmw->address); } bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw); diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 6357c3580d07..5762f3abfd55 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -206,6 +206,9 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr, try_to_free_swap(old_page); page_vma_mapped_walk_done(&pvmw); + if (!new_page) + pte_put(mm, pte_to_pmd(pvmw.pte), addr); + if ((vma->vm_flags & VM_LOCKED) && !PageCompound(old_page)) munlock_vma_page(old_page); put_page(old_page); diff --git a/mm/filemap.c b/mm/filemap.c index eeac0e119cf5..7f2eb64e5c76 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1701,6 +1701,7 @@ bool __folio_lock_or_retry(struct folio *folio, struct vm_fault *vmf) if (flags & FAULT_FLAG_RETRY_NOWAIT) return false; + pte_put_vmf(vmf); mmap_read_unlock(mm); if (flags & FAULT_FLAG_KILLABLE) folio_wait_locked_killable(folio); @@ -1713,6 +1714,7 @@ bool __folio_lock_or_retry(struct folio *folio, struct vm_fault *vmf) ret = __folio_lock_killable(folio); if (ret) { + pte_put_vmf(vmf); mmap_read_unlock(mm); return false; } @@ -3197,32 +3199,30 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page) struct mm_struct *mm = vmf->vma->vm_mm; /* Huge page is mapped? No need to proceed. */ - if (pmd_trans_huge(*vmf->pmd)) { - unlock_page(page); - put_page(page); - return true; - } + if (pmd_trans_huge(*vmf->pmd)) + goto out; if (pmd_none(*vmf->pmd) && PageTransHuge(page)) { - vm_fault_t ret = do_set_pmd(vmf, page); - if (!ret) { - /* The page is mapped successfully, reference consumed. */ - unlock_page(page); - return true; - } + vm_fault_t ret = do_set_pmd(vmf, page); + if (!ret) { + /* The page is mapped successfully, reference consumed. */ + unlock_page(page); + return true; + } } - if (pmd_none(*vmf->pmd)) - pmd_install(mm, vmf->pmd, &vmf->prealloc_pte); + if (IS_ENABLED(CONFIG_FREE_USER_PTE) || pmd_none(*vmf->pmd)) + pmd_install_get(mm, vmf->pmd, &vmf->prealloc_pte); /* See comment in handle_pte_fault() */ - if (pmd_devmap_trans_unstable(vmf->pmd)) { - unlock_page(page); - put_page(page); - return true; - } + if (pmd_devmap_trans_unstable(vmf->pmd)) + goto out; return false; +out: + unlock_page(page); + put_page(page); + return true; } static struct page *next_uptodate_page(struct page *page, @@ -3292,10 +3292,12 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, struct address_space *mapping = file->f_mapping; pgoff_t last_pgoff = start_pgoff; unsigned long addr; + unsigned long start; XA_STATE(xas, &mapping->i_pages, start_pgoff); struct page *head, *page; unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss); vm_fault_t ret = 0; + unsigned int nr_get = 0; rcu_read_lock(); head = first_map_page(mapping, &xas, end_pgoff); @@ -3307,7 +3309,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, goto out; } - addr = vma->vm_start + ((start_pgoff - vma->vm_pgoff) << PAGE_SHIFT); + start = addr = vma->vm_start + ((start_pgoff - vma->vm_pgoff) << PAGE_SHIFT); vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl); do { page = find_subpage(head, xas.xa_index); @@ -3329,6 +3331,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, ret = VM_FAULT_NOPAGE; do_set_pte(vmf, page, addr); + nr_get++; /* no need to invalidate: a not-present page won't be cached */ update_mmu_cache(vma, addr, vmf->pte); unlock_page(head); @@ -3338,6 +3341,8 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, put_page(head); } while ((head = next_map_page(mapping, &xas, end_pgoff)) != NULL); pte_unmap_unlock(vmf->pte, vmf->ptl); + pte_get_many(vmf->pmd, nr_get); + pte_put(vma->vm_mm, vmf->pmd, start); out: rcu_read_unlock(); WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss); diff --git a/mm/gup.c b/mm/gup.c index 2630ed1bb4f4..30757f3b176c 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -500,6 +500,9 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, if (unlikely(pmd_bad(*pmd))) return no_page_table(vma, flags); + if (!pte_try_get(mm, pmd)) + return no_page_table(vma, flags); + ptep = pte_offset_map_lock(mm, pmd, address, &ptl); pte = *ptep; if (!pte_present(pte)) { @@ -517,6 +520,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, if (!is_migration_entry(entry)) goto no_page; pte_unmap_unlock(ptep, ptl); + pte_put(mm, pmd, address); migration_entry_wait(mm, pmd, address); goto retry; } @@ -524,6 +528,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, goto no_page; if ((flags & FOLL_WRITE) && !can_follow_write_pte(pte, flags)) { pte_unmap_unlock(ptep, ptl); + pte_put(mm, pmd, address); return NULL; } @@ -612,9 +617,11 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, } out: pte_unmap_unlock(ptep, ptl); + pte_put(mm, pmd, address); return page; no_page: pte_unmap_unlock(ptep, ptl); + pte_put(mm, pmd, address); if (!pte_none(pte)) return NULL; return no_page_table(vma, flags); diff --git a/mm/hmm.c b/mm/hmm.c index fad6be2bf072..29bb379510cc 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -380,6 +380,9 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR); } + if (!pte_try_get(walk->mm, pmdp)) + goto again; + ptep = pte_offset_map(pmdp, addr); for (; addr < end; addr += PAGE_SIZE, ptep++, hmm_pfns++) { int r; @@ -391,6 +394,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, } } pte_unmap(ptep - 1); + pte_put(walk->mm, pmdp, start); return 0; } diff --git a/mm/internal.h b/mm/internal.h index c1e23fa563a7..5cebb4ee1792 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -466,6 +466,7 @@ static inline struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf, if (fault_flag_allow_retry_first(flags) && !(flags & FAULT_FLAG_RETRY_NOWAIT)) { fpin = get_file(vmf->vma->vm_file); + pte_put_vmf(vmf); mmap_read_unlock(vmf->vma->vm_mm); } return fpin; diff --git a/mm/ksm.c b/mm/ksm.c index 08092e6f0b73..d0d72dd1eaf0 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -1138,6 +1138,9 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, if (!pmd) goto out; + if (!pte_try_get(mm, pmd)) + goto out; + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, mm, addr, addr + PAGE_SIZE); mmu_notifier_invalidate_range_start(&range); @@ -1187,6 +1190,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, err = 0; out_mn: mmu_notifier_invalidate_range_end(&range); + pte_put(mm, pmd, addr); out: return err; } diff --git a/mm/madvise.c b/mm/madvise.c index 0734db8d53a7..1494da73281c 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -191,7 +191,8 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start, struct vm_area_struct *vma = walk->private; unsigned long index; - if (pmd_none_or_trans_huge_or_clear_bad(pmd)) + if (pmd_none_or_trans_huge_or_clear_bad(pmd) || + !pte_try_get(vma->vm_mm, pmd)) return 0; for (index = start; index != end; index += PAGE_SIZE) { @@ -215,6 +216,7 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start, if (page) put_page(page); } + pte_put(vma->vm_mm, pmd, start); return 0; } @@ -318,6 +320,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, spinlock_t *ptl; struct page *page = NULL; LIST_HEAD(page_list); + unsigned long start = addr; if (fatal_signal_pending(current)) return -EINTR; @@ -392,6 +395,9 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, if (pmd_trans_unstable(pmd)) return 0; #endif + if (!pte_try_get(vma->vm_mm, pmd)) + return 0; + tlb_change_page_size(tlb, PAGE_SIZE); orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); flush_tlb_batched_pending(mm); @@ -471,6 +477,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, arch_leave_lazy_mmu_mode(); pte_unmap_unlock(orig_pte, ptl); + pte_put(vma->vm_mm, pmd, start); if (pageout) reclaim_pages(&page_list); cond_resched(); @@ -580,14 +587,17 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, struct page *page; int nr_swap = 0; unsigned long next; + unsigned int nr_put = 0; + unsigned long start = addr; next = pmd_addr_end(addr, end); if (pmd_trans_huge(*pmd)) if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next)) goto next; - if (pmd_trans_unstable(pmd)) + if (pmd_trans_unstable(pmd) || !pte_try_get(mm, pmd)) return 0; + nr_put++; tlb_change_page_size(tlb, PAGE_SIZE); orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl); @@ -612,6 +622,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, nr_swap--; free_swap_and_cache(entry); pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); + nr_put++; continue; } @@ -696,6 +707,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, } arch_leave_lazy_mmu_mode(); pte_unmap_unlock(orig_pte, ptl); + if (nr_put) + pte_put_many(mm, pmd, start, nr_put); cond_resched(); next: return 0; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 3e7c205a1852..f7d203ce14af 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5795,6 +5795,7 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd, struct vm_area_struct *vma = walk->vma; pte_t *pte; spinlock_t *ptl; + unsigned long start = addr; ptl = pmd_trans_huge_lock(pmd, vma); if (ptl) { @@ -5809,13 +5810,14 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd, return 0; } - if (pmd_trans_unstable(pmd)) + if (pmd_trans_unstable(pmd) || !pte_try_get(vma->vm_mm, pmd)) return 0; pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); for (; addr != end; pte++, addr += PAGE_SIZE) if (get_mctgt_type(vma, addr, *pte, NULL)) mc.precharge++; /* increment precharge temporarily */ pte_unmap_unlock(pte - 1, ptl); + pte_put(vma->vm_mm, pmd, start); cond_resched(); return 0; @@ -5995,6 +5997,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd, enum mc_target_type target_type; union mc_target target; struct page *page; + unsigned long start = addr; ptl = pmd_trans_huge_lock(pmd, vma); if (ptl) { @@ -6030,6 +6033,8 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd, if (pmd_trans_unstable(pmd)) return 0; retry: + if (!pte_try_get(vma->vm_mm, pmd)) + return 0; pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); for (; addr != end; addr += PAGE_SIZE) { pte_t ptent = *(pte++); @@ -6080,6 +6085,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd, } } pte_unmap_unlock(pte - 1, ptl); + pte_put(vma->vm_mm, pmd, start); cond_resched(); if (addr != end) { diff --git a/mm/memory.c b/mm/memory.c index d1efb868e682..8fcef8b67971 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -219,6 +219,17 @@ static void check_sync_rss_stat(struct task_struct *task) #endif /* SPLIT_RSS_COUNTING */ +#ifdef CONFIG_FREE_USER_PTE +static inline void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd, + unsigned long addr) +{ + /* + * We should never reach here since the PTE page tables are + * dynamically freed. + */ + BUG(); +} +#else /* * Note: this doesn't free the actual pages themselves. That * has been handled earlier when unmapping all the memory regions. @@ -231,6 +242,7 @@ static void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd, pte_free_tlb(tlb, token, addr); mm_dec_nr_ptes(tlb->mm); } +#endif /* CONFIG_FREE_USER_PTE */ static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud, unsigned long addr, unsigned long end, @@ -822,6 +834,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, if (!userfaultfd_wp(dst_vma)) pte = pte_swp_clear_uffd_wp(pte); set_pte_at(dst_mm, addr, dst_pte, pte); + pte_get(pte_to_pmd(dst_pte)); return 0; } @@ -890,6 +903,7 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma /* Uffd-wp needs to be delivered to dest pte as well */ pte = pte_wrprotect(pte_mkuffd_wp(pte)); set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte); + pte_get(pte_to_pmd(dst_pte)); return 0; } @@ -942,6 +956,7 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, pte = pte_clear_uffd_wp(pte); set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte); + pte_get(pte_to_pmd(dst_pte)); return 0; } @@ -983,7 +998,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, progress = 0; init_rss_vec(rss); - dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl); + dst_pte = pte_alloc_get_map_lock(dst_mm, dst_pmd, addr, &dst_ptl); if (!dst_pte) { ret = -ENOMEM; goto out; @@ -1071,8 +1086,10 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, goto out; } else if (ret == -EAGAIN) { prealloc = page_copy_prealloc(src_mm, src_vma, addr); - if (!prealloc) - return -ENOMEM; + if (!prealloc) { + ret = -ENOMEM; + goto out; + } } else if (ret) { VM_WARN_ON_ONCE(1); } @@ -1080,11 +1097,14 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, /* We've captured and resolved the error. Reset, try again. */ ret = 0; - if (addr != end) + if (addr != end) { + pte_put(dst_mm, dst_pmd, addr); goto again; + } out: if (unlikely(prealloc)) put_page(prealloc); + pte_put(dst_mm, dst_pmd, addr); return ret; } @@ -1103,9 +1123,13 @@ copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, return -ENOMEM; src_pmd = pmd_offset(src_pud, addr); do { + pmd_t pmdval; + next = pmd_addr_end(addr, end); - if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd) - || pmd_devmap(*src_pmd)) { +retry: + pmdval = READ_ONCE(*src_pmd); + if (is_swap_pmd(pmdval) || pmd_trans_huge(pmdval) + || pmd_devmap(pmdval)) { int err; VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, src_vma); err = copy_huge_pmd(dst_mm, src_mm, dst_pmd, src_pmd, @@ -1118,9 +1142,14 @@ copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, } if (pmd_none_or_clear_bad(src_pmd)) continue; + if (!pte_try_get(src_mm, src_pmd)) + goto retry; if (copy_pte_range(dst_vma, src_vma, dst_pmd, src_pmd, - addr, next)) + addr, next)) { + pte_put(src_mm, src_pmd, addr); return -ENOMEM; + } + pte_put(src_mm, src_pmd, addr); } while (dst_pmd++, src_pmd++, addr = next, addr != end); return 0; } @@ -1278,6 +1307,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, pte_t *start_pte; pte_t *pte; swp_entry_t entry; + unsigned int nr_put = 0; + unsigned long start = addr; tlb_change_page_size(tlb, PAGE_SIZE); again: @@ -1310,6 +1341,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, } ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); + nr_put++; tlb_remove_tlb_entry(tlb, pte, addr); if (unlikely(!page)) continue; @@ -1352,6 +1384,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, } pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); + nr_put++; rss[mm_counter(page)]--; if (is_device_private_entry(entry)) @@ -1376,6 +1409,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, if (unlikely(!free_swap_and_cache(entry))) print_bad_pte(vma, addr, ptent, NULL); pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); + nr_put++; } while (pte++, addr += PAGE_SIZE, addr != end); add_mm_rss_vec(mm, rss); @@ -1402,6 +1436,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, goto again; } + if (nr_put) + pte_put_many(mm, pmd, start, nr_put); + return addr; } @@ -1441,9 +1478,11 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb, * because MADV_DONTNEED holds the mmap_lock in read * mode. */ - if (pmd_none_or_trans_huge_or_clear_bad(pmd)) + if (pmd_none_or_trans_huge_or_clear_bad(pmd) || + !pte_try_get(tlb->mm, pmd)) goto next; next = zap_pte_range(tlb, vma, pmd, addr, next, details); + pte_put(tlb->mm, pmd, addr); next: cond_resched(); } while (pmd++, addr = next, addr != end); @@ -1698,7 +1737,7 @@ pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr, if (!pmd) return NULL; - return pte_alloc_map_lock(mm, pmd, addr, ptl); + return pte_alloc_get_map_lock(mm, pmd, addr, ptl); } static int validate_page_before_insert(struct page *page) @@ -1719,6 +1758,7 @@ static int insert_page_into_pte_locked(struct mm_struct *mm, pte_t *pte, inc_mm_counter_fast(mm, mm_counter_file(page)); page_add_file_rmap(page, false); set_pte_at(mm, addr, pte, mk_pte(page, prot)); + pte_get(pte_to_pmd(pte)); return 0; } @@ -1746,6 +1786,7 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr, goto out; retval = insert_page_into_pte_locked(mm, pte, addr, page, prot); pte_unmap_unlock(pte, ptl); + pte_put(mm, pte_to_pmd(pte), addr); out: return retval; } @@ -1789,7 +1830,7 @@ static int insert_pages(struct vm_area_struct *vma, unsigned long addr, /* Allocate the PTE if necessary; takes PMD lock once only. */ ret = -ENOMEM; - if (pte_alloc(mm, pmd)) + if (pte_alloc_get(mm, pmd)) goto out; while (pages_to_write_in_pmd) { @@ -1816,6 +1857,7 @@ static int insert_pages(struct vm_area_struct *vma, unsigned long addr, if (remaining_pages_total) goto more; ret = 0; + pte_put(mm, pmd, addr); out: *num = remaining_pages_total; return ret; @@ -2039,10 +2081,12 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr, } set_pte_at(mm, addr, pte, entry); + pte_get(pte_to_pmd(pte)); update_mmu_cache(vma, addr, pte); /* XXX: why not for insert_page? */ out_unlock: pte_unmap_unlock(pte, ptl); + pte_put(mm, pte_to_pmd(pte), addr); return VM_FAULT_NOPAGE; } @@ -2246,8 +2290,10 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd, pte_t *pte, *mapped_pte; spinlock_t *ptl; int err = 0; + unsigned int nr_get = 0; + unsigned long start_addr = addr; - mapped_pte = pte = pte_alloc_map_lock(mm, pmd, addr, &ptl); + mapped_pte = pte = pte_alloc_get_map_lock(mm, pmd, addr, &ptl); if (!pte) return -ENOMEM; arch_enter_lazy_mmu_mode(); @@ -2258,10 +2304,13 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd, break; } set_pte_at(mm, addr, pte, pte_mkspecial(pfn_pte(pfn, prot))); + nr_get++; pfn++; } while (pte++, addr += PAGE_SIZE, addr != end); + pte_get_many(pmd, nr_get); arch_leave_lazy_mmu_mode(); pte_unmap_unlock(mapped_pte, ptl); + pte_put(mm, pmd, start_addr); return err; } @@ -2474,13 +2523,17 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd, pte_t *pte, *mapped_pte; int err = 0; spinlock_t *ptl; + unsigned int nr_put = 0; + unsigned int nr_get = 0; + unsigned long start = addr; if (create) { mapped_pte = pte = (mm == &init_mm) ? pte_alloc_kernel_track(pmd, addr, mask) : - pte_alloc_map_lock(mm, pmd, addr, &ptl); + pte_alloc_get_map_lock(mm, pmd, addr, &ptl); if (!pte) return -ENOMEM; + nr_put++; } else { mapped_pte = pte = (mm == &init_mm) ? pte_offset_kernel(pmd, addr) : @@ -2495,17 +2548,30 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd, do { if (create || !pte_none(*pte)) { err = fn(pte++, addr, data); - if (err) - break; + if (IS_ENABLED(CONFIG_FREE_USER_PTE) && + mm != &init_mm && !pte_none(*(pte-1))) + nr_get++; + } else if (!pte_none(*pte)) { + err = fn(pte++, addr, data); + if (IS_ENABLED(CONFIG_FREE_USER_PTE) && + mm != &init_mm && pte_none(*(pte-1))) + nr_put++; } + if (err) + break; } while (addr += PAGE_SIZE, addr != end); } *mask |= PGTBL_PTE_MODIFIED; arch_leave_lazy_mmu_mode(); - if (mm != &init_mm) + if (mm != &init_mm) { pte_unmap_unlock(mapped_pte, ptl); + pte_get_many(pmd, nr_get); + if (nr_put) + pte_put_many(mm, pmd, start, nr_put); + } + return err; } @@ -2529,6 +2595,7 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud, } do { next = pmd_addr_end(addr, end); +retry: if (pmd_none(*pmd) && !create) continue; if (WARN_ON_ONCE(pmd_leaf(*pmd))) @@ -2538,8 +2605,12 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud, continue; pmd_clear_bad(pmd); } + if (!create && !pte_try_get(mm, pmd)) + goto retry; err = apply_to_pte_range(mm, pmd, addr, next, fn, data, create, mask); + if (!create) + pte_put(mm, pmd, addr); if (err) break; } while (pmd++, addr = next, addr != end); @@ -3689,7 +3760,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) return VM_FAULT_SIGBUS; /* - * Use pte_alloc() instead of pte_alloc_map(). We can't run + * Use pte_alloc_get() instead of pte_alloc_map(). We can't run * pte_offset_map() on pmds where a huge pmd might be created * from a different thread. * @@ -3698,7 +3769,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) * * Here we only have mmap_read_lock(mm). */ - if (pte_alloc(vma->vm_mm, vmf->pmd)) + if (pte_alloc_get(vma->vm_mm, vmf->pmd)) return VM_FAULT_OOM; /* See comment in handle_pte_fault() */ @@ -3722,7 +3793,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) /* Deliver the page fault to userland, check inside PT lock */ if (userfaultfd_missing(vma)) { pte_unmap_unlock(vmf->pte, vmf->ptl); - return handle_userfault(vmf, VM_UFFD_MISSING); + ret = handle_userfault(vmf, VM_UFFD_MISSING); + goto put; } goto setpte; } @@ -3765,7 +3837,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) if (userfaultfd_missing(vma)) { pte_unmap_unlock(vmf->pte, vmf->ptl); put_page(page); - return handle_userfault(vmf, VM_UFFD_MISSING); + ret = handle_userfault(vmf, VM_UFFD_MISSING); + goto put; } inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES); @@ -3773,19 +3846,23 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) lru_cache_add_inactive_or_unevictable(page, vma); setpte: set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); + pte_get(vmf->pmd); /* No need to invalidate - it was non-present before */ update_mmu_cache(vma, vmf->address, vmf->pte); unlock: pte_unmap_unlock(vmf->pte, vmf->ptl); - return ret; + goto put; release: put_page(page); goto unlock; oom_free_page: put_page(page); oom: - return VM_FAULT_OOM; + ret = VM_FAULT_OOM; +put: + pte_put(vma->vm_mm, vmf->pmd, vmf->address); + return ret; } /* @@ -3813,7 +3890,7 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) * unlock_page(B) * # flush A, B to clear the writeback */ - if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) { + if (!vmf->prealloc_pte) { vmf->prealloc_pte = pte_alloc_one(vma->vm_mm); if (!vmf->prealloc_pte) return VM_FAULT_OOM; @@ -3983,6 +4060,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf) return ret; } +retry: if (pmd_none(*vmf->pmd)) { if (PageTransCompound(page)) { ret = do_set_pmd(vmf, page); @@ -3991,9 +4069,11 @@ vm_fault_t finish_fault(struct vm_fault *vmf) } if (vmf->prealloc_pte) - pmd_install(vma->vm_mm, vmf->pmd, &vmf->prealloc_pte); - else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) + pmd_install_get(vma->vm_mm, vmf->pmd, &vmf->prealloc_pte); + else if (unlikely(pte_alloc_get(vma->vm_mm, vmf->pmd))) return VM_FAULT_OOM; + } else if (!pte_try_get(vma->vm_mm, vmf->pmd)) { + goto retry; } /* See comment in handle_pte_fault() */ @@ -4004,13 +4084,16 @@ vm_fault_t finish_fault(struct vm_fault *vmf) vmf->address, &vmf->ptl); ret = 0; /* Re-check under ptl */ - if (likely(pte_none(*vmf->pte))) + if (likely(pte_none(*vmf->pte))) { do_set_pte(vmf, page, vmf->address); - else + pte_get(vmf->pmd); + } else { ret = VM_FAULT_NOPAGE; + } update_mmu_tlb(vma, vmf->address, vmf->pte); pte_unmap_unlock(vmf->pte, vmf->ptl); + pte_put(vma->vm_mm, vmf->pmd, vmf->address); return ret; } @@ -4232,9 +4315,15 @@ static vm_fault_t do_fault(struct vm_fault *vmf) * If we find a migration pmd entry or a none pmd entry, which * should never happen, return SIGBUS */ - if (unlikely(!pmd_present(*vmf->pmd))) + if (unlikely(!pmd_present(*vmf->pmd))) { ret = VM_FAULT_SIGBUS; - else { + goto out; + } else { + if (!pte_try_get(vma->vm_mm, vmf->pmd)) { + ret = VM_FAULT_SIGBUS; + goto out; + } + vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address, @@ -4252,6 +4341,7 @@ static vm_fault_t do_fault(struct vm_fault *vmf) ret = VM_FAULT_NOPAGE; pte_unmap_unlock(vmf->pte, vmf->ptl); + pte_put(vma->vm_mm, vmf->pmd, vmf->address); } } else if (!(vmf->flags & FAULT_FLAG_WRITE)) ret = do_read_fault(vmf); @@ -4265,6 +4355,7 @@ static vm_fault_t do_fault(struct vm_fault *vmf) pte_free(vm_mm, vmf->prealloc_pte); vmf->prealloc_pte = NULL; } +out: return ret; } @@ -4460,10 +4551,12 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud) static vm_fault_t handle_pte_fault(struct vm_fault *vmf) { pte_t entry; + vm_fault_t ret; - if (unlikely(pmd_none(*vmf->pmd))) { +retry: + if (unlikely(pmd_none(READ_ONCE(*vmf->pmd)))) { /* - * Leave __pte_alloc() until later: because vm_ops->fault may + * Leave __pte_alloc_get() until later: because vm_ops->fault may * want to allocate huge page, and if we expose page table * for an instant, it will be difficult to retract from * concurrent faults and from rmap lookups. @@ -4484,6 +4577,13 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) */ if (pmd_devmap_trans_unstable(vmf->pmd)) return 0; + + if (!pte_try_get(vmf->vma->vm_mm, vmf->pmd)) + goto retry; + + if (IS_ENABLED(CONFIG_FREE_USER_PTE)) + vmf->flags |= FAULT_FLAG_PTE_GET; + /* * A regular pmd is established and it can't morph into a huge * pmd from under us anymore at this point because we hold the @@ -4505,6 +4605,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) if (pte_none(vmf->orig_pte)) { pte_unmap(vmf->pte); vmf->pte = NULL; + pte_put_vmf(vmf); } } @@ -4515,11 +4616,15 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) return do_fault(vmf); } - if (!pte_present(vmf->orig_pte)) - return do_swap_page(vmf); + if (!pte_present(vmf->orig_pte)) { + ret = do_swap_page(vmf); + goto put; + } - if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma)) - return do_numa_page(vmf); + if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma)) { + ret = do_numa_page(vmf); + goto put; + } vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd); spin_lock(vmf->ptl); @@ -4529,8 +4634,10 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) goto unlock; } if (vmf->flags & FAULT_FLAG_WRITE) { - if (!pte_write(entry)) - return do_wp_page(vmf); + if (!pte_write(entry)) { + ret = do_wp_page(vmf); + goto put; + } entry = pte_mkdirty(entry); } entry = pte_mkyoung(entry); @@ -4552,7 +4659,10 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) } unlock: pte_unmap_unlock(vmf->pte, vmf->ptl); - return 0; + ret = 0; +put: + pte_put_vmf(vmf); + return ret; } /* @@ -4889,13 +4999,22 @@ int follow_invalidate_pte(struct mm_struct *mm, unsigned long address, (address & PAGE_MASK) + PAGE_SIZE); mmu_notifier_invalidate_range_start(range); } + if (!pte_try_get(mm, pmd)) + goto out; ptep = pte_offset_map_lock(mm, pmd, address, ptlp); if (!pte_present(*ptep)) goto unlock; + /* + * when we reach here, it means that the ->pte_refcount is at least + * one and the contents of the PTE page table are stable until @ptlp is + * released, so we can put pte safely. + */ + pte_put(mm, pmd, address); *ptepp = ptep; return 0; unlock: pte_unmap_unlock(ptep, *ptlp); + pte_put(mm, pmd, address); if (range) mmu_notifier_invalidate_range_end(range); out: @@ -5022,6 +5141,7 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr, return -EINVAL; pte = *ptep; pte_unmap_unlock(ptep, ptl); + pte_put(vma->vm_mm, pte_to_pmd(ptep), addr); prot = pgprot_val(pte_pgprot(pte)); phys_addr = (resource_size_t)pte_pfn(pte) << PAGE_SHIFT; diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 9d3afa015fac..acef7df762a9 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -508,6 +508,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr, bool has_unmovable = false; pte_t *pte, *mapped_pte; spinlock_t *ptl; + unsigned long start = addr; ptl = pmd_trans_huge_lock(pmd, vma); if (ptl) { @@ -517,7 +518,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr, } /* THP was split, fall through to pte walk */ - if (pmd_trans_unstable(pmd)) + if (pmd_trans_unstable(pmd) || !pte_try_get(walk->mm, pmd)) return 0; mapped_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); @@ -553,6 +554,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr, break; } pte_unmap_unlock(mapped_pte, ptl); + pte_put(walk->mm, pmd, start); cond_resched(); if (has_unmovable) diff --git a/mm/migrate.c b/mm/migrate.c index 7a03a61bbcd8..a3bcef1430c9 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2298,6 +2298,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, if (unlikely(pmd_bad(*pmdp))) return migrate_vma_collect_skip(start, end, walk); + if (!pte_try_get(mm, pmdp)) + goto again; ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl); arch_enter_lazy_mmu_mode(); @@ -2419,6 +2421,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, } arch_leave_lazy_mmu_mode(); pte_unmap_unlock(ptep - 1, ptl); + pte_put(mm, pmdp, start); /* Only flush the TLB if we actually modified any entries */ if (unmapped) @@ -2826,7 +2829,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate, goto abort; /* - * Use pte_alloc() instead of pte_alloc_map(). We can't run + * Use pte_alloc_get() instead of pte_alloc_map(). We can't run * pte_offset_map() on pmds where a huge pmd might be created * from a different thread. * @@ -2835,7 +2838,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate, * * Here we only have mmap_read_lock(mm). */ - if (pte_alloc(mm, pmdp)) + if (pte_alloc_get(mm, pmdp)) goto abort; /* See the comment in pte_alloc_one_map() */ @@ -2843,9 +2846,9 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate, goto abort; if (unlikely(anon_vma_prepare(vma))) - goto abort; + goto put; if (mem_cgroup_charge(page_folio(page), vma->vm_mm, GFP_KERNEL)) - goto abort; + goto put; /* * The memory barrier inside __SetPageUptodate makes sure that @@ -2914,15 +2917,19 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate, } else { /* No need to invalidate - it was non-present before */ set_pte_at(mm, addr, ptep, entry); + pte_get(pmdp); update_mmu_cache(vma, addr, ptep); } pte_unmap_unlock(ptep, ptl); + pte_put(mm, pmdp, addr); *src = MIGRATE_PFN_MIGRATE; return; unlock_abort: pte_unmap_unlock(ptep, ptl); +put: + pte_put(mm, pmdp, addr); abort: *src &= ~MIGRATE_PFN_MIGRATE; } diff --git a/mm/mincore.c b/mm/mincore.c index 9122676b54d6..0401b526adee 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -18,6 +18,7 @@ #include #include #include +#include #include @@ -104,6 +105,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, pte_t *ptep; unsigned char *vec = walk->private; int nr = (end - addr) >> PAGE_SHIFT; + unsigned long start = addr; ptl = pmd_trans_huge_lock(pmd, vma); if (ptl) { @@ -112,7 +114,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, goto out; } - if (pmd_trans_unstable(pmd)) { + if (pmd_trans_unstable(pmd) || !pte_try_get(walk->mm, pmd)) { __mincore_unmapped_range(addr, end, vma, vec); goto out; } @@ -148,6 +150,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, vec++; } pte_unmap_unlock(ptep - 1, ptl); + pte_put(walk->mm, pmd, start); out: walk->private += nr; cond_resched(); diff --git a/mm/mlock.c b/mm/mlock.c index e263d62ae2d0..a4ef20ba9627 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -398,6 +398,7 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec, break; } pte_unmap_unlock(pte, ptl); + pte_put(vma->vm_mm, pte_to_pmd(pte), start); return start; } diff --git a/mm/mprotect.c b/mm/mprotect.c index 883e2cc85cad..53b412423ee8 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -234,9 +234,12 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pmd = pmd_offset(pud, addr); do { unsigned long this_pages; + pmd_t pmdval; next = pmd_addr_end(addr, end); +retry: + pmdval = READ_ONCE(*pmd); /* * Automatic NUMA balancing walks the tables with mmap_lock * held for read. It's possible a parallel update to occur @@ -245,7 +248,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, * Hence, it's necessary to atomically read the PMD value * for all the checks. */ - if (!is_swap_pmd(*pmd) && !pmd_devmap(*pmd) && + if (!is_swap_pmd(pmdval) && !pmd_devmap(pmdval) && pmd_none_or_clear_bad_unless_trans_huge(pmd)) goto next; @@ -257,7 +260,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, mmu_notifier_invalidate_range_start(&range); } - if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) { + if (is_swap_pmd(pmdval) || pmd_trans_huge(pmdval) || pmd_devmap(pmdval)) { if (next - addr != HPAGE_PMD_SIZE) { __split_huge_pmd(vma, pmd, addr, false, NULL); } else { @@ -276,8 +279,11 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, } /* fall through, the trans huge pmd just split */ } + if (!pte_try_get(vma->vm_mm, pmd)) + goto retry; this_pages = change_pte_range(vma, pmd, addr, next, newprot, cp_flags); + pte_put(vma->vm_mm, pmd, addr); pages += this_pages; next: cond_resched(); diff --git a/mm/mremap.c b/mm/mremap.c index c0b6c41b7b78..461dcb2de18a 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -141,6 +141,9 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, spinlock_t *old_ptl, *new_ptl; bool force_flush = false; unsigned long len = old_end - old_addr; + unsigned long old_start = old_addr; + unsigned int nr_put = 0; + unsigned int nr_get = 0; /* * When need_rmap_locks is true, we take the i_mmap_rwsem and anon_vma @@ -181,6 +184,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, continue; pte = ptep_get_and_clear(mm, old_addr, old_pte); + nr_put++; /* * If we are remapping a valid PTE, make sure * to flush TLB before we drop the PTL for the @@ -197,7 +201,9 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr); pte = move_soft_dirty_pte(pte); set_pte_at(mm, new_addr, new_pte, pte); + nr_get++; } + pte_get_many(new_pmd, nr_get); arch_leave_lazy_mmu_mode(); if (force_flush) @@ -206,6 +212,8 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, spin_unlock(new_ptl); pte_unmap(new_pte - 1); pte_unmap_unlock(old_pte - 1, old_ptl); + if (nr_put) + pte_put_many(mm, old_pmd, old_start, nr_put); if (need_rmap_locks) drop_rmap_locks(vma); } @@ -271,6 +279,7 @@ static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr, VM_BUG_ON(!pmd_none(*new_pmd)); pmd_populate(mm, new_pmd, pmd_pgtable(pmd)); + pte_migrate_pmd(pmd, new_pmd); flush_tlb_range(vma, old_addr, old_addr + PMD_SIZE); if (new_ptl != old_ptl) spin_unlock(new_ptl); @@ -548,10 +557,11 @@ unsigned long move_page_tables(struct vm_area_struct *vma, continue; } - if (pte_alloc(new_vma->vm_mm, new_pmd)) + if (pte_alloc_get(new_vma->vm_mm, new_pmd)) break; move_ptes(vma, old_pmd, old_addr, old_addr + extent, new_vma, new_pmd, new_addr, need_rmap_locks); + pte_put(new_vma->vm_mm, new_pmd, new_addr); } mmu_notifier_invalidate_range_end(&range); diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index f7b331081791..eb84fa5825c0 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -211,6 +211,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) } pvmw->pmd = pmd_offset(pud, pvmw->address); +retry: /* * Make sure the pmd value isn't cached in a register by the * compiler and used as a stale value after we've observed a @@ -258,6 +259,8 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) step_forward(pvmw, PMD_SIZE); continue; } + if (!pte_try_get(pvmw->vma->vm_mm, pvmw->pmd)) + goto retry; if (!map_pte(pvmw)) goto next_pte; this_pte: @@ -275,6 +278,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) pvmw->ptl = NULL; } pte_unmap(pvmw->pte); + pte_put(pvmw->vma->vm_mm, pvmw->pmd, pvmw->address); pvmw->pte = NULL; goto restart; } diff --git a/mm/pagewalk.c b/mm/pagewalk.c index 9b3db11a4d1d..da1324021429 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -3,6 +3,7 @@ #include #include #include +#include /* * We want to know the real level where a entry is located ignoring any @@ -110,6 +111,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, do { again: next = pmd_addr_end(addr, end); +retry: if (pmd_none(*pmd) || (!walk->vma && !walk->no_vma)) { if (ops->pte_hole) err = ops->pte_hole(addr, next, depth, walk); @@ -147,10 +149,18 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, goto again; } - if (is_hugepd(__hugepd(pmd_val(*pmd)))) + if (is_hugepd(__hugepd(pmd_val(*pmd)))) { err = walk_hugepd_range((hugepd_t *)pmd, addr, next, walk, PMD_SHIFT); - else - err = walk_pte_range(pmd, addr, next, walk); + } else { + if (!walk->no_vma) { + if (!pte_try_get(walk->mm, pmd)) + goto retry; + err = walk_pte_range(pmd, addr, next, walk); + pte_put(walk->mm, pmd, addr); + } else { + err = walk_pte_range(pmd, addr, next, walk); + } + } if (err) break; } while (pmd++, addr = next, addr != end); diff --git a/mm/pte_ref.c b/mm/pte_ref.c index 630704ae26db..dfbf817b7367 100644 --- a/mm/pte_ref.c +++ b/mm/pte_ref.c @@ -74,11 +74,6 @@ void pmd_install_get(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte) spin_unlock(ptl); } -void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte) -{ - pmd_install_get(mm, pmd, pte); -} - int __pte_alloc_get(struct mm_struct *mm, pmd_t *pmd) { pgtable_t new = pte_alloc_one(mm); diff --git a/mm/rmap.c b/mm/rmap.c index 09c41e1f44d8..eac31f3bca05 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1402,6 +1402,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, bool ret = true; struct mmu_notifier_range range; enum ttu_flags flags = (enum ttu_flags)(long)arg; + unsigned int nr_put = 0; /* * When racing against e.g. zap_pte_range() on another cpu, @@ -1549,6 +1550,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, /* We have to invalidate as we cleared the pte */ mmu_notifier_invalidate_range(mm, address, address + PAGE_SIZE); + nr_put++; } else if (PageAnon(page)) { swp_entry_t entry = { .val = page_private(subpage) }; pte_t swp_pte; @@ -1562,6 +1564,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, /* We have to invalidate as we cleared the pte */ mmu_notifier_invalidate_range(mm, address, address + PAGE_SIZE); + nr_put++; page_vma_mapped_walk_done(&pvmw); break; } @@ -1573,6 +1576,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, mmu_notifier_invalidate_range(mm, address, address + PAGE_SIZE); dec_mm_counter(mm, MM_ANONPAGES); + nr_put++; goto discard; } @@ -1628,6 +1632,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, * See Documentation/vm/mmu_notifier.rst */ dec_mm_counter(mm, mm_counter_file(page)); + nr_put++; } discard: /* @@ -1641,6 +1646,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, put_page(page); } + if (nr_put) + pte_put_many(mm, pvmw.pmd, address, nr_put); + mmu_notifier_invalidate_range_end(&range); return ret; @@ -1702,6 +1710,7 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma, bool ret = true; struct mmu_notifier_range range; enum ttu_flags flags = (enum ttu_flags)(long)arg; + unsigned int nr_put = 0; /* * When racing against e.g. zap_pte_range() on another cpu, @@ -1865,6 +1874,7 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma, /* We have to invalidate as we cleared the pte */ mmu_notifier_invalidate_range(mm, address, address + PAGE_SIZE); + nr_put++; } else { swp_entry_t entry; pte_t swp_pte; @@ -1911,6 +1921,9 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma, put_page(page); } + if (nr_put) + pte_put_many(mm, pvmw.pmd, address, nr_put); + mmu_notifier_invalidate_range_end(&range); return ret; diff --git a/mm/swapfile.c b/mm/swapfile.c index e3dcaeecc50f..10ebfc94208a 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -2021,10 +2021,12 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud, do { cond_resched(); next = pmd_addr_end(addr, end); - if (pmd_none_or_trans_huge_or_clear_bad(pmd)) + if (pmd_none_or_trans_huge_or_clear_bad(pmd) || + !pte_try_get(vma->vm_mm, pmd)) continue; ret = unuse_pte_range(vma, pmd, addr, next, type, frontswap, fs_pages_to_unuse); + pte_put(vma->vm_mm, pmd, addr); if (ret) return ret; } while (pmd++, addr = next, addr != end); diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 36e5f6ab976f..7661929b27bf 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -111,6 +111,7 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, lru_cache_add_inactive_or_unevictable(page, dst_vma); set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte); + pte_get(dst_pmd); /* No need to invalidate - it was non-present before */ update_mmu_cache(dst_vma, dst_addr, dst_pte); @@ -205,6 +206,7 @@ static int mfill_zeropage_pte(struct mm_struct *dst_mm, if (!pte_none(*dst_pte)) goto out_unlock; set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte); + pte_get(dst_pmd); /* No need to invalidate - it was non-present before */ update_mmu_cache(dst_vma, dst_addr, dst_pte); ret = 0; @@ -589,12 +591,15 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm, break; } if (unlikely(pmd_none(dst_pmdval)) && - unlikely(__pte_alloc(dst_mm, dst_pmd))) { + unlikely(__pte_alloc_get(dst_mm, dst_pmd))) { err = -ENOMEM; break; - } - /* If an huge pmd materialized from under us fail */ - if (unlikely(pmd_trans_huge(*dst_pmd))) { + } else if (unlikely(pmd_trans_huge(*dst_pmd)) || + !pte_try_get(dst_mm, dst_pmd)) { + /* + * If an huge pmd materialized from under us fail or the + * pte populated in the dst_pmd was freed. + */ err = -EFAULT; break; } @@ -604,6 +609,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm, err = mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr, src_addr, &page, mcopy_mode, wp_copy); + pte_put(dst_mm, dst_pmd, dst_addr); cond_resched(); if (unlikely(err == -ENOENT)) { From patchwork Thu Aug 19 03:18:56 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 12446059 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.6 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id CC720C432BE for ; Thu, 19 Aug 2021 03:20:28 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 70ECF60698 for ; Thu, 19 Aug 2021 03:20:28 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 70ECF60698 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bytedance.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 120B48D0009; Wed, 18 Aug 2021 23:20:28 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0A9E68D0002; Wed, 18 Aug 2021 23:20:28 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E65AB8D0009; Wed, 18 Aug 2021 23:20:27 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0219.hostedemail.com [216.40.44.219]) by kanga.kvack.org (Postfix) with ESMTP id C4B988D0002 for ; Wed, 18 Aug 2021 23:20:27 -0400 (EDT) Received: from smtpin28.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 7A5AB82813E7 for ; Thu, 19 Aug 2021 03:20:27 +0000 (UTC) X-FDA: 78490377294.28.39A48D0 Received: from mail-pf1-f174.google.com (mail-pf1-f174.google.com [209.85.210.174]) by imf26.hostedemail.com (Postfix) with ESMTP id 22BDE20189CF for ; Thu, 19 Aug 2021 03:20:27 +0000 (UTC) Received: by mail-pf1-f174.google.com with SMTP id x16so4189428pfh.2 for ; Wed, 18 Aug 2021 20:20:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=34jEtgDIeeAqwHdouWeDn0M5i3xpVbjPPaVPxhKZPRE=; b=OdP2K1fqSGeLmuoGc3A9/GqVuJKPr41KRCzmakQQVzlVznH91tiVD0wdi28Eo9bUiX OcUCEBwDWpbKxF+uY9Bx7NLL3r4ajdFP/A6yS3s/QJ6z2hgPwOoB4caJb7C0O2fkr4hn Srh0p5BCDsOAZpxMOsxdGNRcxe0HJ9FQSYCnF9fpiH2GsWEVxJmMXdedpWrXpwCKOXRr IbcMzRKSeooqwNSH8bSe/4FSOJcFxfRMpI62j4D7K4d/tOy46DSCnGzqjZNHBdHIlibB AB7KPJVeZLfEwX68uJYcQSHOAG6GaJwBkaMlNflsPf0MVEniqjYjxxM7rGpAkibFAMfa VIXA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=34jEtgDIeeAqwHdouWeDn0M5i3xpVbjPPaVPxhKZPRE=; b=kKw6Mgl4VtS19U67AAng63aDw6j/iGiwZeZlzWwpvRWxKwMP4srjWs8vmOblDrZyKa aPuvamoNbJs1X7h20b1k2egT0h4g1ATV9mu3IFNAjsR7HWZGcWDlDLf8Uyxa6qKyy0lg TgHRcI8HXZgX/x/9H1X24WTKBCDvP9cZRBqFT7Quh8V+KHVozIMrYygP4SHKDdf0WgKu E9Nl+yrLLvTWBffYijyAraBsg7QxMsLVmiL7UYORx4tfET0dhgRkowUyMl2bd8dT81Q5 DEgwJizvAGDEh+rfQSLqetBTvv29niD5vxXDKdnK9xwRPp3YD6mwrfndJHGKVhUZSA1W Y1Ww== X-Gm-Message-State: AOAM533hix0NNAfDGuXa1orNg6p8+8jnp0ul/oVuH5zwUeTLv8HD78mj ubIWy23m53WA1AF6OvNOSa4uRA== X-Google-Smtp-Source: ABdhPJxFhreOnM2cO4INt+6XuH9uM6cjiWdC7/fs34RZrCaATmNYyiCmVpHw6mVerLbAYhmUHCOMIQ== X-Received: by 2002:a63:170d:: with SMTP id x13mr11620321pgl.216.1629343226086; Wed, 18 Aug 2021 20:20:26 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.255]) by smtp.gmail.com with ESMTPSA id k3sm1261276pfc.16.2021.08.18.20.20.21 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 18 Aug 2021 20:20:25 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, Qi Zheng Subject: [PATCH v2 7/9] mm: add THP support for pte_ref Date: Thu, 19 Aug 2021 11:18:56 +0800 Message-Id: <20210819031858.98043-8-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20210819031858.98043-1-zhengqi.arch@bytedance.com> References: <20210819031858.98043-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=bytedance-com.20150623.gappssmtp.com header.s=20150623 header.b=OdP2K1fq; spf=pass (imf26.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.210.174 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=none) header.from=bytedance.com X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 22BDE20189CF X-Stat-Signature: 77nam381pj3d8kt76boqpsjnxsf95183 X-HE-Tag: 1629343227-102354 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The situation will become more complex with THP enabled. In the handle_pte_fault(), the pmd entry can be populated with PTE page table as well as THP page. So we can no longer always get ->pte_refcount successfully, we add the following function to solve this case: pte_alloc_try_get() pmd_install_try_get() When split THP into PTE page table, we initialize ->pte_refcount to HPAGE_PMD_NR, so that we can free this PTE normally through pte_put(). Signed-off-by: Qi Zheng --- fs/proc/task_mmu.c | 14 +++------- include/linux/pte_ref.h | 46 +++++++++++++++++++++++++++++-- mm/Kconfig | 2 +- mm/filemap.c | 11 ++++---- mm/khugepaged.c | 9 ++++++ mm/madvise.c | 9 ++---- mm/memcontrol.c | 7 ++--- mm/memory.c | 43 +++++++++++++++-------------- mm/mempolicy.c | 2 +- mm/migrate.c | 7 +---- mm/mincore.c | 2 +- mm/pgtable-generic.c | 2 ++ mm/pte_ref.c | 73 +++++++++++++++++++++++++++++++++++++------------ mm/sparse-vmemmap.c | 2 +- mm/swapfile.c | 3 +- mm/userfaultfd.c | 10 ++++--- 16 files changed, 159 insertions(+), 83 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 85ee730ff6ae..b31915696210 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -583,7 +583,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, goto out; } - if (pmd_trans_unstable(pmd) || !pte_try_get(vma->vm_mm, pmd)) + if (pmd_trans_unstable_or_pte_try_get(vma->vm_mm, pmd)) goto out; /* * The mmap_lock held all the way back in m_start() is what @@ -1146,7 +1146,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr, return 0; } - if (pmd_trans_unstable(pmd) || !pte_try_get(vma->vm_mm, pmd)) + if (pmd_trans_unstable_or_pte_try_get(vma->vm_mm, pmd)) return 0; pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); @@ -1475,12 +1475,9 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end, spin_unlock(ptl); return err; } - - if (pmd_trans_unstable(pmdp)) - return 0; #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ - if (!pte_try_get(walk->mm, pmdp)) + if (pmd_trans_unstable_or_pte_try_get(walk->mm, pmdp)) return 0; /* @@ -1818,11 +1815,8 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr, spin_unlock(ptl); return 0; } - - if (pmd_trans_unstable(pmd)) - return 0; #endif - if (!pte_try_get(walk->mm, pmd)) + if (pmd_trans_unstable_or_pte_try_get(walk->mm, pmd)) return 0; orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); diff --git a/include/linux/pte_ref.h b/include/linux/pte_ref.h index c2389d03bb59..04cca9427270 100644 --- a/include/linux/pte_ref.h +++ b/include/linux/pte_ref.h @@ -14,8 +14,9 @@ #include int __pte_alloc(struct mm_struct *mm, pmd_t *pmd); -void pmd_install_get(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte); +bool pmd_install_try_get(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte); int __pte_alloc_get(struct mm_struct *mm, pmd_t *pmd); +int __pte_alloc_try_get(struct mm_struct *mm, pmd_t *pmd); #define pte_alloc(mm, pmd) (unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, pmd)) @@ -95,7 +96,8 @@ static inline bool pte_try_get(struct mm_struct *mm, pmd_t *pmdp) spinlock_t *ptl; ptl = pmd_lock(mm, pmdp); - if (!pmd_present(*pmdp) || !pte_get_unless_zero(pmdp)) + if (pmd_leaf(*pmdp) || !pmd_present(*pmdp) || + !pte_get_unless_zero(pmdp)) retval = false; spin_unlock(ptl); @@ -174,6 +176,30 @@ static inline int pte_alloc_get(struct mm_struct *mm, pmd_t *pmdp) #define pte_alloc_get_map_lock(mm, pmd, address, ptlp) \ (pte_alloc_get(mm, pmd) ? \ NULL : pte_offset_map_lock(mm, pmd, address, ptlp)) + +/* + * pte_alloc_try_get - allocate a PTE page table if the pmd entry is none, and + * try to hold a ->pte_refcount + * @mm: the mm_struct of the target address space. + * @pmdp: a pointer to the pmd entry corresponding to the PTE page table. + * + * Return 1 if the ->pte_refcount is successfully held, return 0 if the pmd + * entry is populated with a THP page by another thread, Otherwise return + * -ENOMEM. + */ +static inline int pte_alloc_try_get(struct mm_struct *mm, pmd_t *pmdp) +{ + if (!pte_try_get(mm, pmdp)) + return __pte_alloc_try_get(mm, pmdp); + return 1; +} + +static inline bool pmd_trans_unstable_or_pte_try_get(struct mm_struct *mm, pmd_t *pmdp) +{ + if (!pte_try_get(mm, pmdp)) + return true; + return false; +} #else static inline void pte_ref_init(pgtable_t pte, pmd_t *pmd, int count) { @@ -229,6 +255,22 @@ static inline int pte_alloc_get(struct mm_struct *mm, pmd_t *pmdp) #define pte_alloc_get_map_lock(mm, pmd, address, ptlp) \ pte_alloc_map_lock(mm, pmd, address, ptlp) + +static inline int pte_alloc_try_get(struct mm_struct *mm, pmd_t *pmdp) +{ + if (unlikely(pmd_none(*pmdp))) + return __pte_alloc_try_get(mm, pmdp); + if (unlikely(pmd_devmap_trans_unstable(pmdp))) + return 0; + return 1; +} + +static inline bool pmd_trans_unstable_or_pte_try_get(struct mm_struct *mm, pmd_t *pmdp) +{ + if (pmd_trans_unstable(pmdp)) + return true; + return false; +} #endif /* CONFIG_FREE_USER_PTE */ #endif diff --git a/mm/Kconfig b/mm/Kconfig index fb16ecebcb80..18aea3d0ce49 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -896,7 +896,7 @@ config SECRETMEM config FREE_USER_PTE def_bool y - depends on X86_64 && !TRANSPARENT_HUGEPAGE + depends on X86_64 source "mm/damon/Kconfig" diff --git a/mm/filemap.c b/mm/filemap.c index 7f2eb64e5c76..8bcab6fc5d17 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3211,12 +3211,13 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page) } } - if (IS_ENABLED(CONFIG_FREE_USER_PTE) || pmd_none(*vmf->pmd)) - pmd_install_get(mm, vmf->pmd, &vmf->prealloc_pte); - - /* See comment in handle_pte_fault() */ - if (pmd_devmap_trans_unstable(vmf->pmd)) + if (IS_ENABLED(CONFIG_FREE_USER_PTE) || pmd_none(*vmf->pmd)) { + if (!pmd_install_try_get(mm, vmf->pmd, &vmf->prealloc_pte)) + goto out; + } else if (pmd_devmap_trans_unstable(vmf->pmd)) { + /* See comment in handle_pte_fault() */ goto out; + } return false; out: diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 6b9c98ddcd09..95d90c896580 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1240,6 +1240,10 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, } memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load)); + if (!pte_try_get(mm, pmd)) { + result = SCAN_PMD_NULL; + goto out; + } pte = pte_offset_map_lock(mm, pmd, address, &ptl); for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR; _pte++, _address += PAGE_SIZE) { @@ -1361,6 +1365,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, } out_unmap: pte_unmap_unlock(pte, ptl); + pte_put(mm, pmd, address); if (ret) { node = khugepaged_find_target_node(); /* collapse_huge_page will return with the mmap_lock released */ @@ -1463,6 +1468,8 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr) if (!pmd) goto drop_hpage; + if (!pte_try_get(mm, pmd)) + goto drop_hpage; start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl); /* step 1: check all mapped PTEs are to the right huge page */ @@ -1501,6 +1508,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr) } pte_unmap_unlock(start_pte, ptl); + pte_put(mm, pmd, haddr); /* step 3: set proper refcount and mm_counters. */ if (count) { @@ -1522,6 +1530,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr) abort: pte_unmap_unlock(start_pte, ptl); + pte_put(mm, pmd, haddr); goto drop_hpage; } diff --git a/mm/madvise.c b/mm/madvise.c index 1494da73281c..1befb4e64f2b 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -191,8 +191,7 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start, struct vm_area_struct *vma = walk->private; unsigned long index; - if (pmd_none_or_trans_huge_or_clear_bad(pmd) || - !pte_try_get(vma->vm_mm, pmd)) + if (pmd_trans_unstable_or_pte_try_get(vma->vm_mm, pmd)) return 0; for (index = start; index != end; index += PAGE_SIZE) { @@ -392,10 +391,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, } regular_page: - if (pmd_trans_unstable(pmd)) - return 0; #endif - if (!pte_try_get(vma->vm_mm, pmd)) + if (pmd_trans_unstable_or_pte_try_get(vma->vm_mm, pmd)) return 0; tlb_change_page_size(tlb, PAGE_SIZE); @@ -595,7 +592,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next)) goto next; - if (pmd_trans_unstable(pmd) || !pte_try_get(mm, pmd)) + if (pmd_trans_unstable_or_pte_try_get(mm, pmd)) return 0; nr_put++; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index f7d203ce14af..56c580d37e94 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5810,7 +5810,7 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd, return 0; } - if (pmd_trans_unstable(pmd) || !pte_try_get(vma->vm_mm, pmd)) + if (pmd_trans_unstable_or_pte_try_get(vma->vm_mm, pmd)) return 0; pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); for (; addr != end; pte++, addr += PAGE_SIZE) @@ -6029,11 +6029,8 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd, spin_unlock(ptl); return 0; } - - if (pmd_trans_unstable(pmd)) - return 0; retry: - if (!pte_try_get(vma->vm_mm, pmd)) + if (pmd_trans_unstable_or_pte_try_get(vma->vm_mm, pmd)) return 0; pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); for (; addr != end; addr += PAGE_SIZE) { diff --git a/mm/memory.c b/mm/memory.c index 8fcef8b67971..99dde124755b 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -453,7 +453,7 @@ int __pte_alloc_kernel(pmd_t *pmd) spin_lock(&init_mm.page_table_lock); if (likely(pmd_none(*pmd))) { /* Has another populated it ? */ - smp_wmb(); /* See comment in pmd_install_get() */ + smp_wmb(); /* See comment in __pmd_install() */ pmd_populate_kernel(&init_mm, pmd, new); new = NULL; } @@ -1478,8 +1478,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb, * because MADV_DONTNEED holds the mmap_lock in read * mode. */ - if (pmd_none_or_trans_huge_or_clear_bad(pmd) || - !pte_try_get(tlb->mm, pmd)) + if (pmd_trans_unstable_or_pte_try_get(tlb->mm, pmd)) goto next; next = zap_pte_range(tlb, vma, pmd, addr, next, details); pte_put(tlb->mm, pmd, addr); @@ -3760,7 +3759,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) return VM_FAULT_SIGBUS; /* - * Use pte_alloc_get() instead of pte_alloc_map(). We can't run + * Use pte_alloc_try_get() instead of pte_alloc_get_map(). We can't run * pte_offset_map() on pmds where a huge pmd might be created * from a different thread. * @@ -3769,12 +3768,10 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) * * Here we only have mmap_read_lock(mm). */ - if (pte_alloc_get(vma->vm_mm, vmf->pmd)) - return VM_FAULT_OOM; - - /* See comment in handle_pte_fault() */ - if (unlikely(pmd_trans_unstable(vmf->pmd))) - return 0; + ret = pte_alloc_try_get(vma->vm_mm, vmf->pmd); + if (ret <= 0) + return ret < 0 ? VM_FAULT_OOM : 0; + ret = 0; /* Use the zero-page for reads */ if (!(vmf->flags & FAULT_FLAG_WRITE) && @@ -4068,18 +4065,22 @@ vm_fault_t finish_fault(struct vm_fault *vmf) return ret; } - if (vmf->prealloc_pte) - pmd_install_get(vma->vm_mm, vmf->pmd, &vmf->prealloc_pte); - else if (unlikely(pte_alloc_get(vma->vm_mm, vmf->pmd))) - return VM_FAULT_OOM; + if (vmf->prealloc_pte) { + if (!pmd_install_try_get(vma->vm_mm, vmf->pmd, + &vmf->prealloc_pte)) + return 0; + } else { + ret = pte_alloc_try_get(vma->vm_mm, vmf->pmd); + if (ret <= 0) + return ret < 0 ? VM_FAULT_OOM : 0; + } + } else if (pmd_devmap_trans_unstable(vmf->pmd)) { + /* See comment in handle_pte_fault() */ + return 0; } else if (!pte_try_get(vma->vm_mm, vmf->pmd)) { goto retry; } - /* See comment in handle_pte_fault() */ - if (pmd_devmap_trans_unstable(vmf->pmd)) - return 0; - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); ret = 0; @@ -4890,7 +4891,7 @@ int __p4d_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address) if (pgd_present(*pgd)) /* Another has populated it */ p4d_free(mm, new); else { - smp_wmb(); /* See comment in pmd_install_get() */ + smp_wmb(); /* See comment in __pmd_install() */ pgd_populate(mm, pgd, new); } spin_unlock(&mm->page_table_lock); @@ -4912,7 +4913,7 @@ int __pud_alloc(struct mm_struct *mm, p4d_t *p4d, unsigned long address) spin_lock(&mm->page_table_lock); if (!p4d_present(*p4d)) { mm_inc_nr_puds(mm); - smp_wmb(); /* See comment in pmd_install_get() */ + smp_wmb(); /* See comment in __pmd_install() */ p4d_populate(mm, p4d, new); } else /* Another has populated it */ pud_free(mm, new); @@ -4936,7 +4937,7 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address) ptl = pud_lock(mm, pud); if (!pud_present(*pud)) { mm_inc_nr_pmds(mm); - smp_wmb(); /* See comment in pmd_install_get() */ + smp_wmb(); /* See comment in __pmd_install() */ pud_populate(mm, pud, new); } else /* Another has populated it */ pmd_free(mm, new); diff --git a/mm/mempolicy.c b/mm/mempolicy.c index acef7df762a9..9d0493f80a75 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -518,7 +518,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr, } /* THP was split, fall through to pte walk */ - if (pmd_trans_unstable(pmd) || !pte_try_get(walk->mm, pmd)) + if (pmd_trans_unstable_or_pte_try_get(walk->mm, pmd)) return 0; mapped_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); diff --git a/mm/migrate.c b/mm/migrate.c index a3bcef1430c9..af5b8900551b 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2829,7 +2829,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate, goto abort; /* - * Use pte_alloc_get() instead of pte_alloc_map(). We can't run + * Use pte_alloc_try_get() instead of pte_alloc_get_map(). We can't run * pte_offset_map() on pmds where a huge pmd might be created * from a different thread. * @@ -2841,12 +2841,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate, if (pte_alloc_get(mm, pmdp)) goto abort; - /* See the comment in pte_alloc_one_map() */ - if (unlikely(pmd_trans_unstable(pmdp))) - goto abort; - if (unlikely(anon_vma_prepare(vma))) - goto put; if (mem_cgroup_charge(page_folio(page), vma->vm_mm, GFP_KERNEL)) goto put; diff --git a/mm/mincore.c b/mm/mincore.c index 0401b526adee..a72ec90dd54f 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -114,7 +114,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, goto out; } - if (pmd_trans_unstable(pmd) || !pte_try_get(walk->mm, pmd)) { + if (pmd_trans_unstable_or_pte_try_get(walk->mm, pmd)) { __mincore_unmapped_range(addr, end, vma, vec); goto out; } diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index 4e640baf9794..88464c54619f 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -10,6 +10,7 @@ #include #include #include +#include #include /* @@ -186,6 +187,7 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp) struct page, lru); if (pmd_huge_pte(mm, pmdp)) list_del(&pgtable->lru); + pte_ref_init(pgtable, pmdp, HPAGE_PMD_NR); return pgtable; } #endif diff --git a/mm/pte_ref.c b/mm/pte_ref.c index dfbf817b7367..dff32909c7c4 100644 --- a/mm/pte_ref.c +++ b/mm/pte_ref.c @@ -42,6 +42,28 @@ void free_pte_table(struct mm_struct *mm, pmd_t *pmdp, unsigned long addr) pte_free(mm, pmd_pgtable(pmd)); } +static inline void __pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte) +{ + mm_inc_nr_ptes(mm); + /* + * Ensure all pte setup (eg. pte page lock and page clearing) are + * visible before the pte is made visible to other CPUs by being + * put into page tables. + * + * The other side of the story is the pointer chasing in the page + * table walking code (when walking the page table without locking; + * ie. most of the time). Fortunately, these data accesses consist + * of a chain of data-dependent loads, meaning most CPUs (alpha + * being the notable exception) will already guarantee loads are + * seen in-order. See the alpha page table accessors for the + * smp_rmb() barriers in page table walking code. + */ + smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */ + pmd_populate(mm, pmd, *pte); + pte_ref_init(*pte, pmd, 1); + *pte = NULL; +} + void pmd_install_get(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte) { spinlock_t *ptl; @@ -49,24 +71,7 @@ void pmd_install_get(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte) retry: ptl = pmd_lock(mm, pmd); if (likely(pmd_none(*pmd))) { - mm_inc_nr_ptes(mm); - /* - * Ensure all pte setup (eg. pte page lock and page clearing) are - * visible before the pte is made visible to other CPUs by being - * put into page tables. - * - * The other side of the story is the pointer chasing in the page - * table walking code (when walking the page table without locking; - * ie. most of the time). Fortunately, these data accesses consist - * of a chain of data-dependent loads, meaning most CPUs (alpha - * being the notable exception) will already guarantee loads are - * seen in-order. See the alpha page table accessors for the - * smp_rmb() barriers in page table walking code. - */ - smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */ - pmd_populate(mm, pmd, *pte); - pte_ref_init(*pte, pmd, 1); - *pte = NULL; + __pmd_install(mm, pmd, pte); } else if (!pte_get_unless_zero(pmd)) { spin_unlock(ptl); goto retry; @@ -74,6 +79,25 @@ void pmd_install_get(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte) spin_unlock(ptl); } +bool pmd_install_try_get(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte) +{ + spinlock_t *ptl; + bool retval = true; + +retry: + ptl = pmd_lock(mm, pmd); + if (likely(pmd_none(*pmd))) { + __pmd_install(mm, pmd, pte); + } else if (pmd_leaf(*pmd) || !pmd_present(*pmd)) { + retval = false; + } else if (!pte_get_unless_zero(pmd)) { + spin_unlock(ptl); + goto retry; + } + spin_unlock(ptl); + return retval; +} + int __pte_alloc_get(struct mm_struct *mm, pmd_t *pmd) { pgtable_t new = pte_alloc_one(mm); @@ -86,6 +110,19 @@ int __pte_alloc_get(struct mm_struct *mm, pmd_t *pmd) return 0; } +int __pte_alloc_try_get(struct mm_struct *mm, pmd_t *pmd) +{ + int retval; + pgtable_t new = pte_alloc_one(mm); + if (!new) + return -ENOMEM; + + retval = pmd_install_try_get(mm, pmd, &new); + if (new) + pte_free(mm, new); + return retval; +} + int __pte_alloc(struct mm_struct *mm, pmd_t *pmd) { return __pte_alloc_get(mm, pmd); diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c index 818c82ee3c6e..0ce288b979f1 100644 --- a/mm/sparse-vmemmap.c +++ b/mm/sparse-vmemmap.c @@ -76,7 +76,7 @@ static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start, set_pte_at(&init_mm, addr, pte, entry); } - /* Make pte visible before pmd. See comment in pmd_install_get(). */ + /* Make pte visible before pmd. See comment in __pmd_install(). */ smp_wmb(); pmd_populate_kernel(&init_mm, pmd, pgtable); diff --git a/mm/swapfile.c b/mm/swapfile.c index 10ebfc94208a..6db8381e1e19 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -2021,8 +2021,7 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud, do { cond_resched(); next = pmd_addr_end(addr, end); - if (pmd_none_or_trans_huge_or_clear_bad(pmd) || - !pte_try_get(vma->vm_mm, pmd)) + if (pmd_trans_unstable_or_pte_try_get(vma->vm_mm, pmd)) continue; ret = unuse_pte_range(vma, pmd, addr, next, type, frontswap, fs_pages_to_unuse); diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 7661929b27bf..cb76c6167541 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -572,6 +572,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm, while (src_addr < src_start + len) { pmd_t dst_pmdval; + int ret = 1; BUG_ON(dst_addr >= dst_start + len); @@ -590,12 +591,13 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm, err = -EEXIST; break; } - if (unlikely(pmd_none(dst_pmdval)) && - unlikely(__pte_alloc_get(dst_mm, dst_pmd))) { + if ((IS_ENABLED(CONFIG_FREE_USER_PTE) && + unlikely((ret = pte_alloc_try_get(dst_mm, dst_pmd)) < 0)) || + (unlikely(pmd_none(dst_pmdval)) && + unlikely((ret = __pte_alloc_try_get(dst_mm, dst_pmd)) < 0))) { err = -ENOMEM; break; - } else if (unlikely(pmd_trans_huge(*dst_pmd)) || - !pte_try_get(dst_mm, dst_pmd)) { + } else if (!ret || unlikely(pmd_trans_huge(*dst_pmd))) { /* * If an huge pmd materialized from under us fail or the * pte populated in the dst_pmd was freed. From patchwork Thu Aug 19 03:18:57 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 12446075 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.6 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2BA70C4338F for ; Thu, 19 Aug 2021 03:20:33 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id BC0E560240 for ; Thu, 19 Aug 2021 03:20:32 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org BC0E560240 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bytedance.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 650548D000A; Wed, 18 Aug 2021 23:20:32 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 600CA8D0002; Wed, 18 Aug 2021 23:20:32 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4A1728D000A; Wed, 18 Aug 2021 23:20:32 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0085.hostedemail.com [216.40.44.85]) by kanga.kvack.org (Postfix) with ESMTP id 2B7408D0002 for ; Wed, 18 Aug 2021 23:20:32 -0400 (EDT) Received: from smtpin33.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id D622D269D1 for ; Thu, 19 Aug 2021 03:20:31 +0000 (UTC) X-FDA: 78490377462.33.C0ECE38 Received: from mail-pj1-f47.google.com (mail-pj1-f47.google.com [209.85.216.47]) by imf25.hostedemail.com (Postfix) with ESMTP id 8E659B00C76E for ; Thu, 19 Aug 2021 03:20:31 +0000 (UTC) Received: by mail-pj1-f47.google.com with SMTP id m24-20020a17090a7f98b0290178b1a81700so3789879pjl.4 for ; Wed, 18 Aug 2021 20:20:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=pjxUS8O0SJ7ps3SzxoHYG/pW1shSChW8FQfdtPM3PZM=; b=Wt4eNvhsZql5ao/bqo27jODT8cRx3CrgAMjduQRW2uLN8kxO2x/DkYHXPqNNbIxyJz Eue96XByNueG0HhfOm2iPYGwffLz5LnPvuK6i4dFn8u6e3GIVWrPBbkZfpFHBBMsSMUg Fa5WrWsV0xU6pEJq3xemB/SBOqwQFvVWAbkBV1GcyE8V796Jf5fIlTdGqWqix2ifa2Wu 68SiRN66tOUgI4pavyIuSGCrpzYl79Oif92eoIbmFSnTdSGunTjmGnW+rcPS9Yu9qDNQ qpvvkIf/nsA/e9cbSwUkEu4mG6rRPuSEylOGMBtZu7tW8+3T030N5hJuSV7I24S6sMih cY+Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=pjxUS8O0SJ7ps3SzxoHYG/pW1shSChW8FQfdtPM3PZM=; b=qdLQ7rYM+FCajFp6MWnfjNcbU7axYv/oXXHJc1ncymtmOoobuOIKnR+wA3bvuD+3wk yNvWjE4cHPRGil3gp6+qeyb5lTDkDlTzYmq0qUO6r9O/8GFwwPGniMqiWKlFhJX8KvZm qXUkSMk3kJx2NZGcDQFC2wzTm2euShZMfO8ec16rhRBNW9I5reR/CBrMIAzg3tLNC87m 6Ugt5d7QmPM//SIsyvc6G5KcfeGmNzZqzpAKJpTRDV8fKmstnqqKB96f6HblOaI2oOhh ZLxFbnFPjiRvJFvObi+e6l1Fzj1Q/2gVeg3c/uKSn8EK00sco3wWZTngby03lv4n9C7D 9nXQ== X-Gm-Message-State: AOAM533CbIOMgtKCMCUNm8IYRyoKuhjtG/CfCTo7PYFb5U7Ua5orgJ1w 1elMvyfG4lqrPAzc5H+aPG+jlA== X-Google-Smtp-Source: ABdhPJyq8s/yc850Q4TklxHUL/DeYRu467Wv3JSf4VqRt7oZsTov56iltxlDsI7aw28TiXLi565fkw== X-Received: by 2002:a17:902:d0ce:b0:12d:c570:370c with SMTP id n14-20020a170902d0ce00b0012dc570370cmr9948844pln.40.1629343230651; Wed, 18 Aug 2021 20:20:30 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.255]) by smtp.gmail.com with ESMTPSA id k3sm1261276pfc.16.2021.08.18.20.20.26 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 18 Aug 2021 20:20:30 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, Qi Zheng Subject: [PATCH v2 8/9] mm: free PTE page table by using rcu mechanism Date: Thu, 19 Aug 2021 11:18:57 +0800 Message-Id: <20210819031858.98043-9-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20210819031858.98043-1-zhengqi.arch@bytedance.com> References: <20210819031858.98043-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=bytedance-com.20150623.gappssmtp.com header.s=20150623 header.b=Wt4eNvhs; dmarc=pass (policy=none) header.from=bytedance.com; spf=pass (imf25.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.216.47 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com X-Stat-Signature: rcj6peh6uwbabqtixpmj5i8msmx6tyqg X-Rspamd-Queue-Id: 8E659B00C76E X-Rspamd-Server: rspam01 X-HE-Tag: 1629343231-319324 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: With rcu_read_lock(), the release of the PTE page table can be postponed. So we don't need to hold the pmd lock anymore when we do pte_try_get()/pte_alloc_get(), which can improve performance and simplify code logic. Signed-off-by: Qi Zheng --- fs/proc/task_mmu.c | 8 ++++---- include/linux/pte_ref.h | 34 ++++++++++++++++++---------------- mm/gup.c | 2 +- mm/hmm.c | 2 +- mm/khugepaged.c | 4 ++-- mm/ksm.c | 2 +- mm/madvise.c | 6 +++--- mm/memcontrol.c | 4 ++-- mm/memory.c | 14 +++++++------- mm/mempolicy.c | 2 +- mm/migrate.c | 2 +- mm/mincore.c | 2 +- mm/mprotect.c | 2 +- mm/page_vma_mapped.c | 2 +- mm/pagewalk.c | 2 +- mm/pte_ref.c | 10 +++++++++- mm/swapfile.c | 2 +- 17 files changed, 55 insertions(+), 45 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index b31915696210..f44caef03f22 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -583,7 +583,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, goto out; } - if (pmd_trans_unstable_or_pte_try_get(vma->vm_mm, pmd)) + if (pmd_trans_unstable_or_pte_try_get(pmd)) goto out; /* * The mmap_lock held all the way back in m_start() is what @@ -1146,7 +1146,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr, return 0; } - if (pmd_trans_unstable_or_pte_try_get(vma->vm_mm, pmd)) + if (pmd_trans_unstable_or_pte_try_get(pmd)) return 0; pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); @@ -1477,7 +1477,7 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end, } #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ - if (pmd_trans_unstable_or_pte_try_get(walk->mm, pmdp)) + if (pmd_trans_unstable_or_pte_try_get(pmdp)) return 0; /* @@ -1816,7 +1816,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr, return 0; } #endif - if (pmd_trans_unstable_or_pte_try_get(walk->mm, pmd)) + if (pmd_trans_unstable_or_pte_try_get(pmd)) return 0; orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); diff --git a/include/linux/pte_ref.h b/include/linux/pte_ref.h index 04cca9427270..259e5aec048d 100644 --- a/include/linux/pte_ref.h +++ b/include/linux/pte_ref.h @@ -90,16 +90,17 @@ static inline bool pte_get_unless_zero(pmd_t *pmdp) * Before Operating the PTE page table, we need to hold a ->pte_refcount * to protect against the concurrent release of the PTE page table. */ -static inline bool pte_try_get(struct mm_struct *mm, pmd_t *pmdp) +static inline bool pte_try_get(pmd_t *pmdp) { bool retval = true; - spinlock_t *ptl; + pmd_t pmdval; - ptl = pmd_lock(mm, pmdp); - if (pmd_leaf(*pmdp) || !pmd_present(*pmdp) || - !pte_get_unless_zero(pmdp)) + rcu_read_lock(); + pmdval = READ_ONCE(*pmdp); + if (pmd_leaf(pmdval) || !pmd_present(pmdval) || + !pte_get_unless_zero(&pmdval)) retval = false; - spin_unlock(ptl); + rcu_read_unlock(); return retval; } @@ -159,14 +160,15 @@ static inline void pte_put_vmf(struct vm_fault *vmf) */ static inline int pte_alloc_get(struct mm_struct *mm, pmd_t *pmdp) { - spinlock_t *ptl; + pmd_t pmdval; - ptl = pmd_lock(mm, pmdp); - if (pmd_none(*pmdp) || !pte_get_unless_zero(pmdp)) { - spin_unlock(ptl); + rcu_read_lock(); + pmdval = READ_ONCE(*pmdp); + if (pmd_none(pmdval) || !pte_get_unless_zero(&pmdval)) { + rcu_read_unlock(); return __pte_alloc_get(mm, pmdp); } - spin_unlock(ptl); + rcu_read_unlock(); return 0; } @@ -189,14 +191,14 @@ static inline int pte_alloc_get(struct mm_struct *mm, pmd_t *pmdp) */ static inline int pte_alloc_try_get(struct mm_struct *mm, pmd_t *pmdp) { - if (!pte_try_get(mm, pmdp)) + if (!pte_try_get(pmdp)) return __pte_alloc_try_get(mm, pmdp); return 1; } -static inline bool pmd_trans_unstable_or_pte_try_get(struct mm_struct *mm, pmd_t *pmdp) +static inline bool pmd_trans_unstable_or_pte_try_get(pmd_t *pmdp) { - if (!pte_try_get(mm, pmdp)) + if (!pte_try_get(pmdp)) return true; return false; } @@ -227,7 +229,7 @@ static inline bool pte_get_unless_zero(pmd_t *pmdp) return true; } -static inline bool pte_try_get(struct mm_struct *mm, pmd_t *pmdp) +static inline bool pte_try_get(pmd_t *pmdp) { return true; } @@ -265,7 +267,7 @@ static inline int pte_alloc_try_get(struct mm_struct *mm, pmd_t *pmdp) return 1; } -static inline bool pmd_trans_unstable_or_pte_try_get(struct mm_struct *mm, pmd_t *pmdp) +static inline bool pmd_trans_unstable_or_pte_try_get(pmd_t *pmdp) { if (pmd_trans_unstable(pmdp)) return true; diff --git a/mm/gup.c b/mm/gup.c index 30757f3b176c..c987ac45d939 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -500,7 +500,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, if (unlikely(pmd_bad(*pmd))) return no_page_table(vma, flags); - if (!pte_try_get(mm, pmd)) + if (!pte_try_get(pmd)) return no_page_table(vma, flags); ptep = pte_offset_map_lock(mm, pmd, address, &ptl); diff --git a/mm/hmm.c b/mm/hmm.c index 29bb379510cc..d0e767c5fbb6 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -380,7 +380,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR); } - if (!pte_try_get(walk->mm, pmdp)) + if (!pte_try_get(pmdp)) goto again; ptep = pte_offset_map(pmdp, addr); diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 95d90c896580..f33db38eaafc 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1240,7 +1240,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, } memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load)); - if (!pte_try_get(mm, pmd)) { + if (!pte_try_get(pmd)) { result = SCAN_PMD_NULL; goto out; } @@ -1468,7 +1468,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr) if (!pmd) goto drop_hpage; - if (!pte_try_get(mm, pmd)) + if (!pte_try_get(pmd)) goto drop_hpage; start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl); diff --git a/mm/ksm.c b/mm/ksm.c index d0d72dd1eaf0..4a15418f1252 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -1138,7 +1138,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, if (!pmd) goto out; - if (!pte_try_get(mm, pmd)) + if (!pte_try_get(pmd)) goto out; mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, mm, addr, diff --git a/mm/madvise.c b/mm/madvise.c index 1befb4e64f2b..254811f41850 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -191,7 +191,7 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start, struct vm_area_struct *vma = walk->private; unsigned long index; - if (pmd_trans_unstable_or_pte_try_get(vma->vm_mm, pmd)) + if (pmd_trans_unstable_or_pte_try_get(pmd)) return 0; for (index = start; index != end; index += PAGE_SIZE) { @@ -392,7 +392,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, regular_page: #endif - if (pmd_trans_unstable_or_pte_try_get(vma->vm_mm, pmd)) + if (pmd_trans_unstable_or_pte_try_get(pmd)) return 0; tlb_change_page_size(tlb, PAGE_SIZE); @@ -592,7 +592,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next)) goto next; - if (pmd_trans_unstable_or_pte_try_get(mm, pmd)) + if (pmd_trans_unstable_or_pte_try_get(pmd)) return 0; nr_put++; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 56c580d37e94..956920f96191 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5810,7 +5810,7 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd, return 0; } - if (pmd_trans_unstable_or_pte_try_get(vma->vm_mm, pmd)) + if (pmd_trans_unstable_or_pte_try_get(pmd)) return 0; pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); for (; addr != end; pte++, addr += PAGE_SIZE) @@ -6030,7 +6030,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd, return 0; } retry: - if (pmd_trans_unstable_or_pte_try_get(vma->vm_mm, pmd)) + if (pmd_trans_unstable_or_pte_try_get(pmd)) return 0; pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); for (; addr != end; addr += PAGE_SIZE) { diff --git a/mm/memory.c b/mm/memory.c index 99dde124755b..6a7fe29d593b 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1142,7 +1142,7 @@ copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, } if (pmd_none_or_clear_bad(src_pmd)) continue; - if (!pte_try_get(src_mm, src_pmd)) + if (!pte_try_get(src_pmd)) goto retry; if (copy_pte_range(dst_vma, src_vma, dst_pmd, src_pmd, addr, next)) { @@ -1478,7 +1478,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb, * because MADV_DONTNEED holds the mmap_lock in read * mode. */ - if (pmd_trans_unstable_or_pte_try_get(tlb->mm, pmd)) + if (pmd_trans_unstable_or_pte_try_get(pmd)) goto next; next = zap_pte_range(tlb, vma, pmd, addr, next, details); pte_put(tlb->mm, pmd, addr); @@ -2604,7 +2604,7 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud, continue; pmd_clear_bad(pmd); } - if (!create && !pte_try_get(mm, pmd)) + if (!create && !pte_try_get(pmd)) goto retry; err = apply_to_pte_range(mm, pmd, addr, next, fn, data, create, mask); @@ -4077,7 +4077,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf) } else if (pmd_devmap_trans_unstable(vmf->pmd)) { /* See comment in handle_pte_fault() */ return 0; - } else if (!pte_try_get(vma->vm_mm, vmf->pmd)) { + } else if (!pte_try_get(vmf->pmd)) { goto retry; } @@ -4320,7 +4320,7 @@ static vm_fault_t do_fault(struct vm_fault *vmf) ret = VM_FAULT_SIGBUS; goto out; } else { - if (!pte_try_get(vma->vm_mm, vmf->pmd)) { + if (!pte_try_get(vmf->pmd)) { ret = VM_FAULT_SIGBUS; goto out; } @@ -4579,7 +4579,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) if (pmd_devmap_trans_unstable(vmf->pmd)) return 0; - if (!pte_try_get(vmf->vma->vm_mm, vmf->pmd)) + if (!pte_try_get(vmf->pmd)) goto retry; if (IS_ENABLED(CONFIG_FREE_USER_PTE)) @@ -5000,7 +5000,7 @@ int follow_invalidate_pte(struct mm_struct *mm, unsigned long address, (address & PAGE_MASK) + PAGE_SIZE); mmu_notifier_invalidate_range_start(range); } - if (!pte_try_get(mm, pmd)) + if (!pte_try_get(pmd)) goto out; ptep = pte_offset_map_lock(mm, pmd, address, ptlp); if (!pte_present(*ptep)) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 9d0493f80a75..9a6d1c845a93 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -518,7 +518,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr, } /* THP was split, fall through to pte walk */ - if (pmd_trans_unstable_or_pte_try_get(walk->mm, pmd)) + if (pmd_trans_unstable_or_pte_try_get(pmd)) return 0; mapped_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); diff --git a/mm/migrate.c b/mm/migrate.c index af5b8900551b..eb0da7fb7033 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2298,7 +2298,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, if (unlikely(pmd_bad(*pmdp))) return migrate_vma_collect_skip(start, end, walk); - if (!pte_try_get(mm, pmdp)) + if (!pte_try_get(pmdp)) goto again; ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl); arch_enter_lazy_mmu_mode(); diff --git a/mm/mincore.c b/mm/mincore.c index a72ec90dd54f..e47fe60c6e04 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -114,7 +114,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, goto out; } - if (pmd_trans_unstable_or_pte_try_get(walk->mm, pmd)) { + if (pmd_trans_unstable_or_pte_try_get(pmd)) { __mincore_unmapped_range(addr, end, vma, vec); goto out; } diff --git a/mm/mprotect.c b/mm/mprotect.c index 53b412423ee8..4673604c709e 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -279,7 +279,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, } /* fall through, the trans huge pmd just split */ } - if (!pte_try_get(vma->vm_mm, pmd)) + if (!pte_try_get(pmd)) goto retry; this_pages = change_pte_range(vma, pmd, addr, next, newprot, cp_flags); diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index eb84fa5825c0..c49bbff7aa60 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -259,7 +259,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) step_forward(pvmw, PMD_SIZE); continue; } - if (!pte_try_get(pvmw->vma->vm_mm, pvmw->pmd)) + if (!pte_try_get(pvmw->pmd)) goto retry; if (!map_pte(pvmw)) goto next_pte; diff --git a/mm/pagewalk.c b/mm/pagewalk.c index da1324021429..97cd4e726a2b 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -153,7 +153,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, err = walk_hugepd_range((hugepd_t *)pmd, addr, next, walk, PMD_SHIFT); } else { if (!walk->no_vma) { - if (!pte_try_get(walk->mm, pmd)) + if (!pte_try_get(pmd)) goto retry; err = walk_pte_range(pmd, addr, next, walk); pte_put(walk->mm, pmd, addr); diff --git a/mm/pte_ref.c b/mm/pte_ref.c index dff32909c7c4..ea40b1777056 100644 --- a/mm/pte_ref.c +++ b/mm/pte_ref.c @@ -26,6 +26,14 @@ static inline void pte_free_debug(pmd_t pmd) } #endif +static void pte_free_rcu(struct rcu_head *rcu) +{ + struct page *page = container_of(rcu, struct page, rcu_head); + + pgtable_pte_page_dtor(page); + __free_page(page); +} + void free_pte_table(struct mm_struct *mm, pmd_t *pmdp, unsigned long addr) { struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0); @@ -39,7 +47,7 @@ void free_pte_table(struct mm_struct *mm, pmd_t *pmdp, unsigned long addr) pte_free_debug(pmd); flush_tlb_range(&vma, addr, addr + PMD_SIZE); mm_dec_nr_ptes(mm); - pte_free(mm, pmd_pgtable(pmd)); + call_rcu(&pmd_pgtable(pmd)->rcu_head, pte_free_rcu); } static inline void __pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte) diff --git a/mm/swapfile.c b/mm/swapfile.c index 6db8381e1e19..47e95aceedd5 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -2021,7 +2021,7 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud, do { cond_resched(); next = pmd_addr_end(addr, end); - if (pmd_trans_unstable_or_pte_try_get(vma->vm_mm, pmd)) + if (pmd_trans_unstable_or_pte_try_get(pmd)) continue; ret = unuse_pte_range(vma, pmd, addr, next, type, frontswap, fs_pages_to_unuse); From patchwork Thu Aug 19 03:18:58 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 12446077 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.6 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C796AC432BE for ; Thu, 19 Aug 2021 03:20:37 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 75E886112D for ; Thu, 19 Aug 2021 03:20:37 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 75E886112D Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bytedance.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 257E38D000B; Wed, 18 Aug 2021 23:20:37 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 209478D0002; Wed, 18 Aug 2021 23:20:37 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0D1098D000B; Wed, 18 Aug 2021 23:20:37 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0249.hostedemail.com [216.40.44.249]) by kanga.kvack.org (Postfix) with ESMTP id E871F8D0002 for ; Wed, 18 Aug 2021 23:20:36 -0400 (EDT) Received: from smtpin08.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 9248C181A88ED for ; Thu, 19 Aug 2021 03:20:36 +0000 (UTC) X-FDA: 78490377672.08.5841FE0 Received: from mail-pj1-f47.google.com (mail-pj1-f47.google.com [209.85.216.47]) by imf14.hostedemail.com (Postfix) with ESMTP id 2426E6010890 for ; Thu, 19 Aug 2021 03:20:36 +0000 (UTC) Received: by mail-pj1-f47.google.com with SMTP id bo18so4093961pjb.0 for ; Wed, 18 Aug 2021 20:20:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=6WXbcHWNowQAVftzV5Tv2pLkmBnGpqWKBHXjn2bp8kE=; b=o1leQwFsVxR/zVtkR51mW76kcy/9PDUerFFV6wfGKhd3L8MqT9rlNGjLbKzVpbGZT5 4yWfecK0uVNjnpnyYXBsYh1X8gy+260xfrgtORlIIaf5UirQo3VeRCm9UybFwU0Er4Jj ZAcVxx0Esho8/oD/iPX+/iz08fBtMClr27wGzT518ouy5slZDhfm3HNhwbqom3rzoUBE ruswg74ZEHpxEGJZog/6Xm/YFTMSqmj09fe25GIzI7rX7wGwG+K5e4ONwnuj9IOfKmRr oE3PhkGHsE9E7NwdGysFfN/IHdk+qT0cU6hq1n4e5h0jyild/+PwJmvkY7xKfSsf2M1q ZwUw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=6WXbcHWNowQAVftzV5Tv2pLkmBnGpqWKBHXjn2bp8kE=; b=UVaki3Eb43KmH782yMu/DvE70JlWiuqCNcmMHvtsg62WIbx/4TJyr5pa7PSb+VBDUs espxL8BYXrgWifKpfzbP/eZHdIz5CD0/NlesCylBO/Ek9GqJgowB7qUWWTiNdcix5owv b/GuOb86euySsLW2jqSjX424tf3SsqiC9rY5crqYfAgjYAPAtlxrvzQfL+z+Dxr7o3Ar aRbqqhxN3ZwMG6QJha5QhRN1kojxO1+kJZHgpenwLKgS3UE6Kt4/SrjWfbAZAvTQUSAP E9Cp1RGhelsmvsPyH49GOBNEmp5TZ+SBg3IvhLXguEpFB9Hl5aX5H1uAaj8qd2BSWexp TuhA== X-Gm-Message-State: AOAM531mVVWXIdhmt5f6KRXtCcznwwHThZGgGt7ZCwz4L/ukNV5kFGmi 7EqB7mCjxVKLxSLojL1vLsHSkQ== X-Google-Smtp-Source: ABdhPJwnx+NtF/ns4F180dZ3R3QOaGtghaOiuVcTab32/0QMuYQEzTd2Kp5RXtBOqx3fudBw9tGBjw== X-Received: by 2002:a17:90a:4812:: with SMTP id a18mr12716983pjh.40.1629343235286; Wed, 18 Aug 2021 20:20:35 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.255]) by smtp.gmail.com with ESMTPSA id k3sm1261276pfc.16.2021.08.18.20.20.30 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 18 Aug 2021 20:20:34 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, Qi Zheng Subject: [PATCH v2 9/9] mm: use mmu_gather to free PTE page table Date: Thu, 19 Aug 2021 11:18:58 +0800 Message-Id: <20210819031858.98043-10-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20210819031858.98043-1-zhengqi.arch@bytedance.com> References: <20210819031858.98043-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 X-Rspamd-Queue-Id: 2426E6010890 Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=bytedance-com.20150623.gappssmtp.com header.s=20150623 header.b=o1leQwFs; dmarc=pass (policy=none) header.from=bytedance.com; spf=pass (imf14.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.216.47 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com X-Rspamd-Server: rspam04 X-Stat-Signature: 7rt6hjqngezzundy3et779mkdq7gftih X-HE-Tag: 1629343236-32087 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In unmap_region() and other paths, we can reuse @tlb to free PTE page table, which can reduce the number of tlb flush. Signed-off-by: Qi Zheng --- arch/x86/Kconfig | 2 +- include/linux/pte_ref.h | 32 +++++++++++++++++++++++++++----- mm/madvise.c | 4 ++-- mm/memory.c | 4 ++-- mm/mmu_gather.c | 40 +++++++++++++++++----------------------- mm/pte_ref.c | 12 +++++++++--- 6 files changed, 58 insertions(+), 36 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 45962aaf2b2c..fc7453826160 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -232,7 +232,7 @@ config X86 select HAVE_PCI select HAVE_PERF_REGS select HAVE_PERF_USER_STACK_DUMP - select MMU_GATHER_RCU_TABLE_FREE if PARAVIRT + select MMU_GATHER_RCU_TABLE_FREE if PARAVIRT || FREE_USER_PTE select HAVE_POSIX_CPU_TIMERS_TASK_WORK select HAVE_REGS_AND_STACK_ACCESS_API select HAVE_RELIABLE_STACKTRACE if X86_64 && (UNWINDER_FRAME_POINTER || UNWINDER_ORC) && STACK_VALIDATION diff --git a/include/linux/pte_ref.h b/include/linux/pte_ref.h index 259e5aec048d..4b7ea1fe447f 100644 --- a/include/linux/pte_ref.h +++ b/include/linux/pte_ref.h @@ -29,7 +29,8 @@ int __pte_alloc_try_get(struct mm_struct *mm, pmd_t *pmd); #ifdef CONFIG_FREE_USER_PTE -void free_pte_table(struct mm_struct *mm, pmd_t *pmdp, unsigned long addr); +void free_pte_table(struct mmu_gather *tlb, struct mm_struct *mm, pmd_t *pmdp, + unsigned long addr); static inline void pte_ref_init(pgtable_t pte, pmd_t *pmd, int count) { @@ -76,7 +77,6 @@ static inline bool pte_get_unless_zero(pmd_t *pmdp) { pgtable_t pte = pmd_pgtable(*pmdp); - VM_BUG_ON(!PageTable(pte)); return atomic_inc_not_zero(&pte->pte_refcount); } @@ -105,14 +105,26 @@ static inline bool pte_try_get(pmd_t *pmdp) return retval; } -static inline void pte_put_many(struct mm_struct *mm, pmd_t *pmdp, - unsigned long addr, unsigned int nr) +static inline void pte_put_many_tlb(struct mmu_gather *tlb, struct mm_struct *mm, + pmd_t *pmdp, unsigned long addr, unsigned int nr) { pgtable_t pte = pmd_pgtable(*pmdp); VM_BUG_ON(!PageTable(pte)); if (atomic_sub_and_test(nr, &pte->pte_refcount)) - free_pte_table(mm, pmdp, addr & PMD_MASK); + free_pte_table(tlb, mm, pmdp, addr & PMD_MASK); +} + +static inline void pte_put_tlb(struct mmu_gather *tlb, struct mm_struct *mm, + pmd_t *pmdp, unsigned long addr) +{ + pte_put_many_tlb(tlb, mm, pmdp, addr, 1); +} + +static inline void pte_put_many(struct mm_struct *mm, pmd_t *pmdp, + unsigned long addr, unsigned int nr) +{ + pte_put_many_tlb(NULL, mm, pmdp, addr, nr); } /* @@ -234,6 +246,16 @@ static inline bool pte_try_get(pmd_t *pmdp) return true; } +static inline void pte_put_many_tlb(struct mmu_gather *tlb, struct mm_struct *mm, + pmd_t *pmdp, unsigned long addr, unsigned int nr) +{ +} + +static inline void pte_put_tlb(struct mmu_gather *tlb, struct mm_struct *mm, + pmd_t *pmdp, unsigned long addr) +{ +} + static inline void pte_put_many(struct mm_struct *mm, pmd_t *pmdp, unsigned long addr, unsigned int value) { diff --git a/mm/madvise.c b/mm/madvise.c index 254811f41850..6616c0567dcb 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -474,7 +474,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, arch_leave_lazy_mmu_mode(); pte_unmap_unlock(orig_pte, ptl); - pte_put(vma->vm_mm, pmd, start); + pte_put_tlb(tlb, vma->vm_mm, pmd, start); if (pageout) reclaim_pages(&page_list); cond_resched(); @@ -705,7 +705,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, arch_leave_lazy_mmu_mode(); pte_unmap_unlock(orig_pte, ptl); if (nr_put) - pte_put_many(mm, pmd, start, nr_put); + pte_put_many_tlb(tlb, mm, pmd, start, nr_put); cond_resched(); next: return 0; diff --git a/mm/memory.c b/mm/memory.c index 6a7fe29d593b..28639a75ce02 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1437,7 +1437,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, } if (nr_put) - pte_put_many(mm, pmd, start, nr_put); + pte_put_many_tlb(tlb, mm, pmd, start, nr_put); return addr; } @@ -1481,7 +1481,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb, if (pmd_trans_unstable_or_pte_try_get(pmd)) goto next; next = zap_pte_range(tlb, vma, pmd, addr, next, details); - pte_put(tlb->mm, pmd, addr); + pte_put_tlb(tlb, tlb->mm, pmd, addr); next: cond_resched(); } while (pmd++, addr = next, addr != end); diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index 1b9837419bf9..1bd9fa889421 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -134,42 +134,42 @@ static void __tlb_remove_table_free(struct mmu_table_batch *batch) * */ -static void tlb_remove_table_smp_sync(void *arg) +static void tlb_remove_table_rcu(struct rcu_head *head) { - /* Simply deliver the interrupt */ + __tlb_remove_table_free(container_of(head, struct mmu_table_batch, rcu)); } -static void tlb_remove_table_sync_one(void) +static void tlb_remove_table_free(struct mmu_table_batch *batch) { - /* - * This isn't an RCU grace period and hence the page-tables cannot be - * assumed to be actually RCU-freed. - * - * It is however sufficient for software page-table walkers that rely on - * IRQ disabling. - */ - smp_call_function(tlb_remove_table_smp_sync, NULL, 1); + call_rcu(&batch->rcu, tlb_remove_table_rcu); } -static void tlb_remove_table_rcu(struct rcu_head *head) +static void tlb_remove_table_one_rcu(struct rcu_head *head) { - __tlb_remove_table_free(container_of(head, struct mmu_table_batch, rcu)); + struct page *page = container_of(head, struct page, rcu_head); + + __tlb_remove_table(page); } -static void tlb_remove_table_free(struct mmu_table_batch *batch) +static void tlb_remove_table_one(void *table) { - call_rcu(&batch->rcu, tlb_remove_table_rcu); + pgtable_t page = (pgtable_t)table; + + call_rcu(&page->rcu_head, tlb_remove_table_one_rcu); } #else /* !CONFIG_MMU_GATHER_RCU_TABLE_FREE */ -static void tlb_remove_table_sync_one(void) { } - static void tlb_remove_table_free(struct mmu_table_batch *batch) { __tlb_remove_table_free(batch); } +static void tlb_remove_table_one(void *table) +{ + __tlb_remove_table(table); +} + #endif /* CONFIG_MMU_GATHER_RCU_TABLE_FREE */ /* @@ -187,12 +187,6 @@ static inline void tlb_table_invalidate(struct mmu_gather *tlb) } } -static void tlb_remove_table_one(void *table) -{ - tlb_remove_table_sync_one(); - __tlb_remove_table(table); -} - static void tlb_table_flush(struct mmu_gather *tlb) { struct mmu_table_batch **batch = &tlb->batch; diff --git a/mm/pte_ref.c b/mm/pte_ref.c index ea40b1777056..676923f3c7c8 100644 --- a/mm/pte_ref.c +++ b/mm/pte_ref.c @@ -10,6 +10,7 @@ #include #include #include +#include #ifdef CONFIG_DEBUG_VM static void pte_free_debug(pmd_t pmd) @@ -34,7 +35,8 @@ static void pte_free_rcu(struct rcu_head *rcu) __free_page(page); } -void free_pte_table(struct mm_struct *mm, pmd_t *pmdp, unsigned long addr) +void free_pte_table(struct mmu_gather *tlb, struct mm_struct *mm, + pmd_t *pmdp, unsigned long addr) { struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0); spinlock_t *ptl; @@ -45,9 +47,13 @@ void free_pte_table(struct mm_struct *mm, pmd_t *pmdp, unsigned long addr) spin_unlock(ptl); pte_free_debug(pmd); - flush_tlb_range(&vma, addr, addr + PMD_SIZE); + if (!tlb) { + flush_tlb_range(&vma, addr, addr + PMD_SIZE); + call_rcu(&pmd_pgtable(pmd)->rcu_head, pte_free_rcu); + } else { + pte_free_tlb(tlb, pmd_pgtable(pmd), addr); + } mm_dec_nr_ptes(mm); - call_rcu(&pmd_pgtable(pmd)->rcu_head, pte_free_rcu); } static inline void __pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)