From patchwork Mon Aug 5 12:55:04 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 13753582 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C4D22C3DA7F for ; Mon, 5 Aug 2024 12:55:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3CCD86B008A; Mon, 5 Aug 2024 08:55:50 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 37D3A6B008C; Mon, 5 Aug 2024 08:55:50 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 21D776B0092; Mon, 5 Aug 2024 08:55:50 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 0515B6B008A for ; Mon, 5 Aug 2024 08:55:49 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id A9A7F402B0 for ; Mon, 5 Aug 2024 12:55:49 +0000 (UTC) X-FDA: 82418188818.17.78CE87D Received: from mail-pf1-f171.google.com (mail-pf1-f171.google.com [209.85.210.171]) by imf18.hostedemail.com (Postfix) with ESMTP id 19C3E1C0012 for ; Mon, 5 Aug 2024 12:55:46 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=fVlYifAx; dmarc=pass (policy=quarantine) header.from=bytedance.com; spf=pass (imf18.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.210.171 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722862502; a=rsa-sha256; cv=none; b=oFO/JOvMKi87GTuFVOubCRTr+CpmnhrBcm+ZmQqrvBB7M41Adg/U/haoTA5sVxwm45cTRa jYrNFk6C16arQ4DoyCOPtsdNYb4qcYy1J1DmgMjm87luF8MykRsps86PAiwlafJteDz7hg upzTZ/V/mfolrBdgyff1S07f+c5kS30= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=fVlYifAx; dmarc=pass (policy=quarantine) header.from=bytedance.com; spf=pass (imf18.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.210.171 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1722862502; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=rNkm4ST7Yt4VxJPQvAz9DzGdsgOTeHEJXPtn+b8/ANA=; b=MZLz0OjfbnDEzmUcZxXAjcsQ8X/w5aDNREnNp5lm+QpB783/yjHo7m7HHw+UCKYF4K8UN1 3Ma+oWpZ/Q7gOb8vO6OIih7RxA6VS9LRBuVPbN204pU1zBe5pHhzz8bZ/8/XvucEJxoWEO YGQkDK4TvOrRutst+ZPrvjY34RVhRts= Received: by mail-pf1-f171.google.com with SMTP id d2e1a72fcca58-70d22b6dab0so1023253b3a.1 for ; Mon, 05 Aug 2024 05:55:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1722862545; x=1723467345; darn=kvack.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=rNkm4ST7Yt4VxJPQvAz9DzGdsgOTeHEJXPtn+b8/ANA=; b=fVlYifAxjVaJyw1S3YutYMhyWkSbMA7p7ftgwMBFuCNv6lDHcJpm9FJtB85AYph71L EZM0SyZvv0b0QKLnw7EtyluGbcLmLNilJs+a/1qslFZqTCx/o2dsVA82+uGdoCTKREnD H/Qn3x/cBk5OGOk3yg+qslUUgFkAK45q3PWebq/ejXusS9CkOsS9eaMf/OB+8OUFx+me mDits0KVSzLDd+BuI8cWWbmoC7C6EZSrpI7c+Hfbr50Vo/rsz3X9XsQb9iG8itiXUZDx RfaDjYePnME+IdBESsGYcw/GcujmkTAFQgpyZM5UBdySsooRYCm0I8dQiip859ME7pd4 WfRQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1722862545; x=1723467345; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=rNkm4ST7Yt4VxJPQvAz9DzGdsgOTeHEJXPtn+b8/ANA=; b=AK99bxV3/Q0FTg/7EzPLiP2DnEjbqY58D8py5rzUWVeTdw4RLmm6fHXGyYfYqlS+Ms LRyRGBt+rT1YOO91jRCuGMKDQb1QasZSR4X7ClaR9F+D+UeINj2LuJwDxXozkVyv+32B cVz5nX6XIuzUq1PtuBDJZWTh8bOblH0B09X+IHYBfcSAfNpsjj3iRhEY6PrEC5IJM+qI wRpwrit52msovbKPIfg+pa6M2Ik3x1o62p7fqtOcKsVKngnSgt2xH3jkWF57nwTjnkJZ z0Zlm/ELThTkXmnP+P4OqlyNJJp9PGNa4EuJJyBnGgZUsNXoEoVWK75/0CVoMFdjCebS DquQ== X-Gm-Message-State: AOJu0YyhijTyX3GFHsk7lX+dSbONiBfJJSgkAmpAB1UnmIb0hH8S9uFS IqxVE1F5Z9PlVRFoz6pIVJxQgho0cDHMGDA15XAhBGW3DJwRzAjLw1KXXEI3NlU= X-Google-Smtp-Source: AGHT+IH2+cbukx919Zfls8aFSY9/6eYjbs2ZERJUgHdRr2WnO7PGfws9eZKAHop8OvEf8AdWlM9bnQ== X-Received: by 2002:a05:6a21:999e:b0:1c2:88eb:1d33 with SMTP id adf61e73a8af0-1c6996288d0mr10091043637.6.1722862545346; Mon, 05 Aug 2024 05:55:45 -0700 (PDT) Received: from C02DW0BEMD6R.bytedance.net ([139.177.225.232]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-7106ecfaf1asm5503030b3a.142.2024.08.05.05.55.40 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 05 Aug 2024 05:55:44 -0700 (PDT) From: Qi Zheng To: david@redhat.com, hughd@google.com, willy@infradead.org, mgorman@suse.de, muchun.song@linux.dev, vbabka@kernel.org, akpm@linux-foundation.org, zokeefe@google.com, rientjes@google.com Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Qi Zheng Subject: [RFC PATCH v2 0/7] synchronously scan and reclaim empty user PTE pages Date: Mon, 5 Aug 2024 20:55:04 +0800 Message-Id: X-Mailer: git-send-email 2.24.3 (Apple Git-128) MIME-Version: 1.0 X-Rspamd-Queue-Id: 19C3E1C0012 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: p5pyx7usop5hgngn88r73kwnwxzxo7h9 X-HE-Tag: 1722862546-410599 X-HE-Meta: U2FsdGVkX1/kwQcCRncWxy6QHB0yYky7xjspuk7BxucZuM7cqmmiTQvlxuB2/yb0s7bhLdCOf3uzDCXEFRIeLtfDYMhZMqeTWmKTdsYFMYJ0c3Drc3nSlAYaKtyE8Lp1/hwqPRqkpfSquVaj/bTNKKjKIgvPYo5Cj9Lyx0qX3+d75aAsneyElnuzXXn8RPeLRVyHB56s95O/t75IKHeMFitgcD95+45tVa7/9zDvUnyjOuXyNCxuKNbfmspiZj6ydnyS0ySYsSlo2WcZX6sMJ0lA0AqnoCtnS1JlnBIdoZd94TVdPNbQ33sfTtRPkx1zpyk2+PhETtJFIHWBIKoYIBsl/1jelR5hOpgPbkueeBS0lukJ1fWrRl7qK+IddmhcqCOAci+hdghZTJVPyE8vbZgB05wJdA1ZRNOX1uX/19fYR3PELMnzcE5zTmdZ9naT00WNVHverkycE8JwONVjnLJXVfMghX1B7kEYo0OsLft1idqVfzSQW684zStfZL/O9w9WLj7je+tvhjlZNt88midnxAMJzTw0FrbfcEewySPtQsBmc8RFU0e3itC3tHXYc46GwPZ1wYxUz80R/uakuoaWo0z0p/JyNeWET577zxu3teSp0RHcfN70VrXXoXOi+UeYBAEtWrU7b9EN58SHVmq36fC0Nz8PlGUOKcM7eF2PwZaquNV/RBftDjMHczZIVM+mQVrgQyx1D3ZjuMc9RHZSNIKTRYsl6Qu2emKf1NFBywj2SBJfidmJbfSHFJcSFm2o9puvCYNslhoPFkyERzREqH6YuEHYY17TTNu1UeRexPNetVhJOZyjTKzRGf6tlxlJg6ivpptSb1cDxkPbLSUyRIziuL92NYnPaQHx9/twJ5mpFxJ4shJUCFe1DLc+zUNCf3U2A8uZGPHo+ZnRgQ7t90DXuvOGfdlesORTLyG2fTGgXrfslkdyZTlYgiUu9xMrhnPvdwAk3kXAroW xrrPakRn 5QFCn3pV05P7RLL1JI7XNz1oZjVt0aMRVp4u3oMhK1FEHjEdRd4fYpKhNSn066XFt9CX/k69VJV7KWMouDEobT1FRTG26W46sCFgGP5P4FHJyAp6K92/8z0PuWsuyBOyLH0GSeEdekpNRnj0r3MO6+AjSJlQK6jqwC4Yj0HmdcVHDPvV/81xzdLNsB6vOsKjuQAXVemq3AAsIsA6j2ThhNXH2TdJasKadig7j/XTgwmB/cHKnXi0pIxlu9ogyWRZBIbYf49X9gLRjjjqyAjUCrY6+PhfG8P1nw44lp3jtoWJ3rt8N9PTsZg4PrLtyNJo/Dmphf7f9oeRUS3IulyBL72baCwaX/rUnte5FYQQIiEzClVZ3SpAJPAWzG1BxfVENl/Nr X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Changes in RFC v2: - fix compilation errors in [RFC PATCH 5/7] and [RFC PATCH 7/7] reproted by kernel test robot - use pte_offset_map_nolock() + pmd_same() instead of check_pmd_still_valid() in retract_page_tables() (in [RFC PATCH 4/7]) - rebase onto the next-20240805 Hi all, Previously, we tried to use a completely asynchronous method to reclaim empty user PTE pages [1]. After discussing with David Hildenbrand, we decided to implement synchronous reclaimation in the case of madvise(MADV_DONTNEED) as the first step. So this series aims to synchronously scan and reclaim empty user PTE pages in zap_page_range_single() (madvise(MADV_DONTNEED) etc will invoke this). In zap_page_range_single(), mmu_gather is used to perform batch tlb flushing and page freeing operations. Therefore, if we want to free the empty PTE page in this path, the most natural way is to add it to mmu_gather as well. There are two problems that need to be solved here: 1. Now, if CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected, mmu_gather will free page table pages by semi RCU: - batch table freeing: asynchronous free by RCU - single table freeing: IPI + synchronous free But this is not enough to free the empty PTE page table pages in paths other that munmap and exit_mmap path, because IPI cannot be synchronized with rcu_read_lock() in pte_offset_map{_lock}(). So we should let single table also be freed by RCU like batch table freeing. 2. When we use mmu_gather to batch flush tlb and free PTE pages, the TLB is not flushed before pmd lock is unlocked. This may result in the following two situations: 1) Userland can trigger page fault and fill a huge page, which will cause the existence of small size TLB and huge TLB for the same address. 2) Userland can also trigger page fault and fill a PTE page, which will cause the existence of two small size TLBs, but the PTE page they map are different. For case 1), according to Intel's TLB Application note (317080), some CPUs of x86 do not allow it: ``` If software modifies the paging structures so that the page size used for a 4-KByte range of linear addresses changes, the TLBs may subsequently contain both ordinary and large-page translations for the address range.12 A reference to a linear address in the address range may use either translation. Which of the two translations is used may vary from one execution to another and the choice may be implementation-specific. Software wishing to prevent this uncertainty should not write to a paging- structure entry in a way that would change, for any linear address, both the page size and either the page frame or attributes. It can instead use the following algorithm: first mark the relevant paging-structure entry (e.g., PDE) not present; then invalidate any translations for the affected linear addresses (see Section 5.2); and then modify the relevant paging-structure entry to mark it present and establish translation(s) for the new page size. ``` We can also learn more information from the comments above pmdp_invalidate() in __split_huge_pmd_locked(). For case 2), we can see from the comments above ptep_clear_flush() in wp_page_copy() that this situation is also not allowed. Even without this patch series, madvise(MADV_DONTNEED) can also cause this situation: CPU 0 CPU 1 madvise (MADV_DONTNEED) --> clear pte entry pte_unmap_unlock touch and tlb miss --> set pte entry mmu_gather flush tlb But strangely, I didn't see any relevant fix code, maybe I missed something, or is this guaranteed by userland? Anyway, this series defines the following two functions to be implemented by the architecture. If the architecture does not allow the above two situations, then define these two functions to flush the tlb before set_pmd_at(). - arch_flush_tlb_before_set_huge_page - arch_flush_tlb_before_set_pte_page As a first step, we supported this feature on x86_64 and selectd the newly introduced CONFIG_ARCH_SUPPORTS_PT_RECLAIM. In order to reduce overhead, we only handle the cases with a high probability of generating empty PTE pages, and other cases will be filtered out, such as: - hugetlb vma (unsuitable) - userfaultfd_wp vma (may reinstall the pte entry) - writable private file mapping case (COW-ed anon page is not zapped) - etc For userfaultfd_wp and writable private file mapping cases (and MADV_FREE case, of course), consider scanning and freeing empty PTE pages asynchronously in the future. This series is based on next-20240805. Comments and suggestions are welcome! Thanks, Qi [1]. https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/ Qi Zheng (7): mm: pgtable: make pte_offset_map_nolock() return pmdval mm: introduce CONFIG_PT_RECLAIM mm: pass address information to pmd_install() mm: pgtable: try to reclaim empty PTE pages in zap_page_range_single() x86: mm: free page table pages by RCU instead of semi RCU x86: mm: define arch_flush_tlb_before_set_huge_page x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64 Documentation/mm/split_page_table_lock.rst | 3 +- arch/arm/mm/fault-armv.c | 2 +- arch/powerpc/mm/pgtable.c | 2 +- arch/x86/Kconfig | 1 + arch/x86/include/asm/pgtable.h | 6 + arch/x86/include/asm/tlb.h | 19 +++ arch/x86/kernel/paravirt.c | 7 ++ arch/x86/mm/pgtable.c | 23 +++- include/linux/hugetlb.h | 2 +- include/linux/mm.h | 13 +- include/linux/pgtable.h | 14 +++ mm/Kconfig | 14 +++ mm/Makefile | 1 + mm/debug_vm_pgtable.c | 2 +- mm/filemap.c | 4 +- mm/gup.c | 2 +- mm/huge_memory.c | 3 + mm/internal.h | 17 ++- mm/khugepaged.c | 32 +++-- mm/memory.c | 21 ++-- mm/migrate_device.c | 2 +- mm/mmu_gather.c | 9 +- mm/mprotect.c | 8 +- mm/mremap.c | 4 +- mm/page_vma_mapped.c | 2 +- mm/pgtable-generic.c | 21 ++-- mm/pt_reclaim.c | 131 +++++++++++++++++++++ mm/userfaultfd.c | 10 +- mm/vmscan.c | 2 +- 29 files changed, 321 insertions(+), 56 deletions(-) create mode 100644 mm/pt_reclaim.c