From patchwork Thu Jun 13 08:38:07 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 13696345 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A905AC27C4F for ; Thu, 13 Jun 2024 08:39:18 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 384E96B008A; Thu, 13 Jun 2024 04:39:18 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3345F6B0098; Thu, 13 Jun 2024 04:39:18 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1D5706B0096; Thu, 13 Jun 2024 04:39:18 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id F40966B00A0 for ; Thu, 13 Jun 2024 04:39:17 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 9DC68C082F for ; Thu, 13 Jun 2024 08:39:17 +0000 (UTC) X-FDA: 82225215954.21.FA11A7D Received: from mail-oa1-f47.google.com (mail-oa1-f47.google.com [209.85.160.47]) by imf04.hostedemail.com (Postfix) with ESMTP id C376640009 for ; Thu, 13 Jun 2024 08:39:13 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=N3HewTE3; spf=pass (imf04.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.160.47 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1718267954; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=hl67iH9COLf0aTY3D3whefyOfZy3UdpQMLr6lWpcnhw=; b=BL1ONs18HGtIWz1UJX3m0MXAj+q38IWzkLyt2HG3AK9v3f+Hps8fopwQ6JwXhBgApyhUnY l4a7vxA78rgZpkHddYYEAtm6rNhpBkNRnlxa2OUmWzcbm60Rh7I02K302DGcQVApJABsL1 wmVYZeT6g+p8aHebDe/jbOuZ0JqXT+M= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=N3HewTE3; spf=pass (imf04.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.160.47 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718267954; a=rsa-sha256; cv=none; b=kutCYxJ0PDdbLOvrh81Oggld3DDe6wO75j8AqPKFMhHiybQps1gBWaRJ6F5T5nBOKH9m8C PN0HzbwTSftaUfZlEwAn2kh8YA6ToMa/dDzlhnal85HWeAeixezEN92OYEsgv4duQP0OY+ 23A4CmgdM+g/+fPhibmFeXzn8p3hybE= Received: by mail-oa1-f47.google.com with SMTP id 586e51a60fabf-254e7bdfac1so120795fac.0 for ; Thu, 13 Jun 2024 01:39:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1718267951; x=1718872751; darn=kvack.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=hl67iH9COLf0aTY3D3whefyOfZy3UdpQMLr6lWpcnhw=; b=N3HewTE3oO+TqdKcjfMWnZYAJuN6NCWyitMbcqjiUftQEbepm+K9Jbwxo93P6QgxLH KV8/7mYTaCmDwqb4TECoDriTe+0mRjZrugJiWJk0sQP0ZctxIvM3uVXd+y7zg+YmYn+P DIErunfjF5gcwqFM7B1+EswXBZqIjLLyhxTAvnBVHGt5cnP7WwX5dQwe3mIxmBHMtiog KSmSq69FCcVhUMsJyk/bfwo5F8bC8Zg48xFs6tMGUvn/XfEf9fjd8m89xpyAupR8w5q5 BjWflkI0pko6IiBe/AXIZuMuAg6+NLpjYT1TK29U5N+6J0YdTsOy/dgyjZVNoJic0uyW l6Qg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1718267951; x=1718872751; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=hl67iH9COLf0aTY3D3whefyOfZy3UdpQMLr6lWpcnhw=; b=s+a9Ujl4qqThQHN1+XA06vINh2WkXgWjHlfHNXF8Dfe0vBJl1eCvad1tqx9Z05Jy80 BHzCr0GRozSyWqkchO1tzbPM+gROpDTctnCPA2ekSgHtXEkLUrt7jOOsMjfZulYqa4/f 0uCX1QP00hdOijw4rE9M5qflYJHPtNB5BW9Vj7gh1t0SmD5Tcb3ZIw9/vaO5v1ARxgZr r9LvqCF7GWPDMB3KXQ/TwTVPycBwt5w42yYCQPZD14WC5AOP1LnIHMLyHlQ2UUoWl4Ui BrGgFKRQYHLPO/kWJxvLQ40+QfgqRCPpR7tdDluBPSdxAhYaohBj0Qh5EWRjiqd6a6ve qwWg== X-Gm-Message-State: AOJu0YxECjmEUqVI+UcK1qiulvhUvU0RD+P0KRqJT9EJVlKK6AtN49wh EjuYSUSkF1vOmYurCeVzNZl+iCj2km3VipP2eOS5c0FwLIdeYo1Sc0A9MvZ1OMg= X-Google-Smtp-Source: AGHT+IF7Kk6RafpOf7Fw87qNvhDWnPOEWoveQm/d41GJcDFjhjd9dL6LGBHufxqdLTEi0UPK89Lnzg== X-Received: by 2002:a05:6871:5cf:b0:254:c111:12b9 with SMTP id 586e51a60fabf-25514bb928bmr4625947fac.2.1718267950856; Thu, 13 Jun 2024 01:39:10 -0700 (PDT) Received: from C02DW0BEMD6R.bytedance.net ([203.208.167.150]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-705ccb980bdsm820856b3a.211.2024.06.13.01.39.07 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 13 Jun 2024 01:39:10 -0700 (PDT) From: Qi Zheng To: david@redhat.com, hughd@google.com, willy@infradead.org, mgorman@suse.de, muchun.song@linux.dev, akpm@linux-foundation.org Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Qi Zheng Subject: [RFC PATCH 0/3] asynchronously scan and free empty user PTE pages Date: Thu, 13 Jun 2024 16:38:07 +0800 Message-Id: X-Mailer: git-send-email 2.24.3 (Apple Git-128) MIME-Version: 1.0 X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: C376640009 X-Stat-Signature: obzteokbjwn9uaq8a1796jjpwy1zjag5 X-Rspam-User: X-HE-Tag: 1718267953-962514 X-HE-Meta: U2FsdGVkX18lSoYMuVp3NuZRLdyZQTw2JhmQR+GzAyKERl0czy9Qtle3JAAtEoK/XjrQ8F8lxHK/AVHSxY8kIe/nwQUsyiWGhmWSDUTU0YkBGk4lPXl/tb1QB64w6QLzLcdRW3i3JXIxKGV0ZVU8TXWz3CmrM+hV24JowA1VBoupBPZubkQ/mM2l0bboq/h+9zwutt7KUSFFVlvPww6ulmHOUL3FyxZ4QWN3AJ/B9U6osgIreNLAcaw7+nNj9wXg2Bnfp6nCiCzekS8DRuLOKSz5Vo+s1zJ7Jif81gEZAnCjLWJ3ox5Jbdob9E++zs7ZemMalQddMeZOvB5CewLZGZykTPB91SLXvMLZjb/29TWZfeOE5JfAEx5/F90F/1hKDUQ+9Ziip4gEQy14w+Z+8YLtGE8BEXLAxIRlUlaD5xjeW35M03HriWCnTvJf+BP8EhsTvjUDX6o5KyyZUIqN9J3bhXXLARtSN1OputmNuhNultr3MTcWsSaePRGlXi0GrwwMLu1SzWVc+EJUM6TVK2fr3TGDYdHY0IFTggUA0t9Ee4CK8ierRaWccTUb4w9xAW/QDDSi+D9avpcdqj1pi1aCveoemDIwk08kN09F/GFhEcxfn9UAyA6xBmot2NlYj/Y5LqGDel1bNRZXmMNvjgCfYKODlKdiXvOe2Q5yiODwvaeRQEpBPVOc6mR1LvD3LPeL+b3TJxdkkehibDnoX4RO08aVoQBzeo6+kem/QMwoBHsYdUK5bXevS76NQTm87OnWerEVLSFo5zmPzAQz0c61u+mDWrYJBir2jQ6T94t1K4xpYEjoIZlQ0aQJ3+cxZIbzWYLVo4Jbejnobow00RWyEnTJ3vW3X3Jj7VoES2U6wt0VvTokhdrVRyPLkv8hmDAvjQLG7UIxb/A63I3o3Qr4JrqOHHxxNnva6h1vF49WhOMN/tZnytFGaSXirQ3sDLWVkIBHyxiaYFC2iL/ CDxbojKF Hwdc4js3yt2ogcC3hr3zq1VYkV4MQsV88pC+uaQX20i29Ayvqm6sOFpzcJA+C0+f8ztl+bsIgAhjAitdhJanlW4sSbTSq0Bs7U2o/KAsdvobaqNwv5V7A7kLlarh92VADVdU2sIwkpD7Yh6deTR6ihE6Uq7ca6mDCiJPRNq5ROCH93UvhujmEUHs4/FAmecBfSvCF9Hk7lIW8zvHDt4eaNomriZD1bBlEiVqILX3BXHeex0w/P0jlV6dDK/54nCovpkzYhKfeoMzKjByGsbQnQCnzFhvcfhmwNP2gJUSKXY5OSdqgMP7t40Zbb6/XEriTQpuqMt83lsYWYwyXRuQQMRrIoJhQtnPOEA7LHOFlT47/+xgnCGBmqwu/uIuvUkoDojwv/rDs6idfYY/iE2LMMgp+qsUy0VfID3Mq6f1rU7Qe5K3wEW6P3CL6m9cBBhh+HF/d X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi all, This series aims to asynchronously scan and free empty user PTE pages. 1. Background ============= We often find huge user PTE memory usage on our servers, such as the following: VIRT: 55t RES: 590g VmPTE: 110g The root cause is that these processes use some high-performance mmeory allocators (such as jemalloc, tcmalloc, etc). These memory allocators use madvise(MADV_DONTNEED or MADV_FREE) to release physical memory, but neither MADV_DONTNEED nor MADV_FREE will release page table memory, which may cause huge page table memory usage. This issue has been discussed on LSFMM 2022 (led by David Hildenbrand): topic link: https://lore.kernel.org/linux-mm/7b908208-02f8-6fde-4dfc-13d5e00310a6@redhat.com/ youtube link: https://www.youtube.com/watch?v=naO_BRhcU68 In the past, I have tried to introduce refcount for PTE pages to solve this problem, but these methods [1][2][3] introduced too much complexity. [1]. https://lore.kernel.org/lkml/20211110105428.32458-1-zhengqi.arch@bytedance.com/ [2]. https://lore.kernel.org/lkml/20220429133552.33768-1-zhengqi.arch@bytedance.com/ [3]. https://lore.kernel.org/lkml/20220825101037.96517-1-zhengqi.arch@bytedance.com/ 2. Infrastructure ================= Later, in order to freeing retracted page table, Hugh Dickins added a lot of PTE-related infrastructure[4][5][6]: - allow pte_offset_map_lock() etc to fail - make PTE pages can be removed without mmap or rmap locks (see collapse_pte_mapped_thp() and retract_page_tables()) - make PTE pages can be freed by RCU (via pte_free_defer()) - etc These are all beneficial to freeing empty PTE pages. [4]. https://lore.kernel.org/all/a4963be9-7aa6-350-66d0-2ba843e1af44@google.com/ [5]. https://lore.kernel.org/all/c1c9a74a-bc5b-15ea-e5d2-8ec34bc921d@google.com/ [6]. https://lore.kernel.org/all/7cd843a9-aa80-14f-5eb2-33427363c20@google.com/ 3. Implementation ================= For empty user PTE pages, we don't actually need to free it immediately, nor do we need to free all of it. Therefore, in this patchset, we register a task_work for the user tasks to asyncronously scan and free empty PTE pages when they return to user space. (The scanning time interval and address space size can be adjusted.) When scanning, we can filter out some unsuitable vmas: - VM_HUGETLB vma - VM_UFFD_WP vma - etc And for some PTE pages that spans multiple vmas, we can also skip. For locking: - use the mmap read lock to traverse the vma tree and pgtable - use pmd lock for clearing pmd entry - use pte lock for checking empty PTE page, and release it after clearing pmd entry, then we can capture the changed pmd in pte_offset_map_lock() etc after holding this pte lock. Thanks to this, we don't need to hold the rmap-related locks. - users of pte_offset_map_lock() etc all expect the PTE page to be stable by using rcu lock, so use pte_free_defer() to free PTE pages. For the path that will also free PTE pages in THP, we need to recheck whether the content of pmd entry is valid after holding pmd lock or pte lock. 4. TODO ======= Some applications may be concerned about the overhead of scanning and rebuilding page tables, so the following features are considered for implementation in the future: - add per-process switch (via prctl) - add a madvise option (like THP) - add MM_PGTABLE_SCAN_DELAY/MM_PGTABLE_SCAN_SIZE control (via procfs file) Perhaps we can add the refcount to PTE pages in the future as well, which would help improve the scanning speed. This series is based on next-20240612. Comments and suggestions are welcome! Thanks, Qi Qi Zheng (3): mm: pgtable: move pte_free_defer() out of CONFIG_TRANSPARENT_HUGEPAGE mm: pgtable: make pte_offset_map_nolock() return pmdval mm: free empty user PTE pages Documentation/mm/split_page_table_lock.rst | 3 +- arch/arm/mm/fault-armv.c | 2 +- arch/powerpc/mm/pgtable-frag.c | 2 - arch/powerpc/mm/pgtable.c | 2 +- arch/s390/mm/pgalloc.c | 2 - arch/sparc/mm/init_64.c | 2 +- include/linux/mm.h | 4 +- include/linux/mm_types.h | 4 + include/linux/pgtable.h | 14 ++ include/linux/sched.h | 1 + kernel/sched/core.c | 1 + kernel/sched/fair.c | 2 + mm/Makefile | 2 +- mm/filemap.c | 2 +- mm/freept.c | 180 +++++++++++++++++++++ mm/khugepaged.c | 20 ++- mm/memory.c | 4 +- mm/mremap.c | 2 +- mm/page_vma_mapped.c | 2 +- mm/pgtable-generic.c | 23 +-- mm/userfaultfd.c | 4 +- mm/vmscan.c | 2 +- 22 files changed, 249 insertions(+), 31 deletions(-) create mode 100644 mm/freept.c