From patchwork Fri Sep 29 11:44:16 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ryan Roberts X-Patchwork-Id: 13404129 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A382BE80ABE for ; Fri, 29 Sep 2023 11:44:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 38A488E0003; Fri, 29 Sep 2023 07:44:50 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 313838E0001; Fri, 29 Sep 2023 07:44:50 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1B63C8E0003; Fri, 29 Sep 2023 07:44:50 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 0E2328E0001 for ; Fri, 29 Sep 2023 07:44:50 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id D579080B31 for ; Fri, 29 Sep 2023 11:44:49 +0000 (UTC) X-FDA: 81289453098.27.A2BFDCC Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf27.hostedemail.com (Postfix) with ESMTP id 208FC4002C for ; Fri, 29 Sep 2023 11:44:47 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf27.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1695987888; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=2gExMYv36b1IpQ9pVpTbz0yc/Ix+8JpSzZniXtjIQmc=; b=lvXGIwo8k1wJ/rdn/f8nVPJPeinTyDxWqwXffqHtmo5IQohxkkrwoCA9aZZa2/PqiHH9TX bNnOqF2qR/b88C0GSNu9jVt+PWKtbTV2F+hwmK7KE8W6nXuDACjrVTfV/7/o26XUrIt/x4 CKBm75nH0XTBuDXbMPvgv4zCTHiDZ0s= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf27.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695987888; a=rsa-sha256; cv=none; b=OXVS7CuqkCZvNGbTeRRMiWyQeEW4ZdK7EsMx5RbQh/98WhkGEgfHfCibPTkd1g/XTJzg/H irSq78yzqSq7swQtlv0cqVFkknjTb4AclOaisnsyZ7iYyxQY/GkjGKfZX+0CsxADFGJYHq waKGOp1RorsJno6ljB9nBc+7KYWh0HE= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 9765D1FB; Fri, 29 Sep 2023 04:45:25 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.26]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id B3E903F59C; Fri, 29 Sep 2023 04:44:44 -0700 (PDT) From: Ryan Roberts To: Andrew Morton , Matthew Wilcox , Yin Fengwei , David Hildenbrand , Yu Zhao , Catalin Marinas , Anshuman Khandual , Yang Shi , "Huang, Ying" , Zi Yan , Luis Chamberlain , Itaru Kitayama , "Kirill A. Shutemov" , John Hubbard , David Rientjes , Vlastimil Babka , Hugh Dickins Cc: Ryan Roberts , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org Subject: [PATCH v6 5/9] mm: thp: Extend THP to allocate anonymous large folios Date: Fri, 29 Sep 2023 12:44:16 +0100 Message-Id: <20230929114421.3761121-6-ryan.roberts@arm.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20230929114421.3761121-1-ryan.roberts@arm.com> References: <20230929114421.3761121-1-ryan.roberts@arm.com> MIME-Version: 1.0 X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 208FC4002C X-Stat-Signature: d3tjznfmu1969fjoucw48ca571fk6ufk X-Rspam-User: X-HE-Tag: 1695987887-779258 X-HE-Meta: U2FsdGVkX1942R8L4/ggJJe6QqWmrMI2OrMQsbi7gq9pHT7wUxOOzz8v/qz9Hs/EhX8dZCctthBsO5gW5BAw2tMYJ/5e2VlULprnl68f/LJPysOe0kCGes6UlU8MbRVcVpDEvMMESnSZtHFVWfYq0lzZVHZrXPSbpcsGq7QbopIcrB9WXYP8x1dyCnpamED6g3sFxI5njer6AZFSX61jT4YWciEL6gWF4N4VB5iyR7gwUQi2RSVt1cWNSNO//v0f9UvQkP43Oj4JCRhQnM0ttaAqVeUUsxY8y5Mf3xQbBSdx28qjBgQX1aF9JvddwzT/wYlfoLjni1ywN3NVd3MgTloX7yP1KF2TgK+CFn26trn1OgVbn3BKQzkm+gxJk9bqdkcN/izONkc7TIPK2ucB7ch0UXzCjGo6d7JiT1+Hienw1wtESLHJzHBR+YUtRzzaWsRnGVwIuxYh8TZRQv4ziDOst8WsgtZMop0Kk4C6F1y8ONlx+M0bPzfAQwDZbKZHz9GLqRI9UK2xuE4JXinN4vWnITSNAdqayuZEd2hXfMw6qZZ4wZiaLSTOHQ153weJ9R9jQAcxexDaz6kpkAwaO+pENJM+MeDUj9L0F9ARg8E9225tkkBGKlQEu8b2T0TQGScBDYfTgIDd2Dxw01BBxxL4dU8YE3PF/rymfJcKxdbYos/tqcH8CLXqeu64nkDBJ0GHx0kdkFrbp/vmV6e7VucNM0J4EBnyWKUMYHzpQhNPLguc6eby7WuTxZSQo3XU/SYLku58+Fg5NA1JH+qBYOV3hm/Y5qAH3LAS+byDpxbxK/YPfZIm3A8xhjC5iMWJotxv6Hnh3qnEoogOLi8glkeKM7Wq2WHtRzLZytKx9Rhwa4qPS+dNEkAHQzqTE9ARnDCcvSnoAXYg3LK4kRjfS5Iofj/ygD22rh5P3vhAXOyk7Au9I6dyJiYIYVbgOIQzRfuYCxRI8U7jCM43Sq9 Tsf8dDeA MHP5reG7T+kpN5Rw57UmNGUa5HJDtfJi8WurcVO5ZZjeLT3W4Btvgetzy+svccOeHAoOSqtxfw4w1rWp6R1K/HBNiMGYK5Klc6TA/dYhkH51ix9YwL03kwNFOkmqmQ+sOpgDx8i/gmlVptRxQr3uGusrKP+UoZsM8tUuNXABEbmIYuOKz3iimgKmwFMzteWQJkGzIIq7wtKQMfRqabpJ9AeqZoK271VAPBQLEuCwFY3aGKq4= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Introduce the logic to allow THP to be configured (through the new anon_orders interface we just added) to allocate large folios to back anonymous memory, which are smaller than PMD-size (for example order-2, order-3, order-4, etc). These THPs continue to be PTE-mapped, but in many cases can still provide similar benefits to traditional PMD-sized THP: Page faults are significantly reduced (by a factor of e.g. 4, 8, 16, etc. depending on the configured order), but latency spikes are much less prominent because the size of each page isn't as huge as the PMD-sized variant and there is less memory to clear in each page fault. The number of per-page operations (e.g. ref counting, rmap management, lru list management) are also significantly reduced since those ops now become per-folio. Some architectures also employ TLB compression mechanisms to squeeze more entries in when a set of PTEs are virtually and physically contiguous and approporiately aligned. In this case, TLB misses will occur less often. The new behaviour is disabled by default because the anon_orders defaults to only enabling PMD-order, but can be enabled at runtime by writing to anon_orders (see documentation in previous commit). The long term aim is to default anon_orders to include suitable lower orders, but there are some risks around internal fragmentation that need to be better understood first. Signed-off-by: Ryan Roberts --- Documentation/admin-guide/mm/transhuge.rst | 9 +- include/linux/huge_mm.h | 6 +- mm/memory.c | 108 +++++++++++++++++++-- 3 files changed, 111 insertions(+), 12 deletions(-) diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index 9f954e73a4ca..732c3b2f4ba8 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -353,7 +353,9 @@ anonymous transparent huge pages, it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages and AnonHugePteMap fields for each mapping. Note that in both cases, AnonHugePages refers only to PMD-mapped THPs. AnonHugePteMap refers to THPs that are mapped -using PTEs. +using PTEs. This includes all THPs whose order is smaller than +PMD-order, as well as any PMD-order THPs that happen to be PTE-mapped +for other reasons. The number of file transparent huge pages mapped to userspace is available by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``. @@ -367,6 +369,11 @@ frequently will incur overhead. There are a number of counters in ``/proc/vmstat`` that may be used to monitor how successfully the system is providing huge pages for use. +.. note:: + Currently the below counters only record events relating to + PMD-order THPs. Events relating to smaller order THPs are not + included. + thp_fault_alloc is incremented every time a huge page is successfully allocated to handle a page fault. diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 2e7c338229a6..c4860476a1f5 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -68,9 +68,11 @@ extern struct kobj_attribute shmem_enabled_attr; #define HPAGE_PMD_NR (1<pte + i))) + return true; + } + + return false; +} + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +static struct folio *alloc_anon_folio(struct vm_fault *vmf) +{ + gfp_t gfp; + pte_t *pte; + unsigned long addr; + struct folio *folio; + struct vm_area_struct *vma = vmf->vma; + unsigned int orders; + int order; + + /* + * If uffd is active for the vma we need per-page fault fidelity to + * maintain the uffd semantics. + */ + if (userfaultfd_armed(vma)) + goto fallback; + + /* + * Get a list of all the (large) orders below PMD_ORDER that are enabled + * for this vma. Then filter out the orders that can't be allocated over + * the faulting address and still be fully contained in the vma. + */ + orders = hugepage_vma_check(vma, vma->vm_flags, false, true, true, + BIT(PMD_ORDER) - 1); + orders = transhuge_vma_suitable(vma, vmf->address, orders); + + if (!orders) + goto fallback; + + pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK); + if (!pte) + return ERR_PTR(-EAGAIN); + + order = first_order(orders); + while (orders) { + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); + vmf->pte = pte + pte_index(addr); + if (!vmf_pte_range_changed(vmf, 1 << order)) + break; + order = next_order(&orders, order); + } + + vmf->pte = NULL; + pte_unmap(pte); + + gfp = vma_thp_gfp_mask(vma); + + while (orders) { + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); + folio = vma_alloc_folio(gfp, order, vma, addr, true); + if (folio) { + clear_huge_page(&folio->page, addr, 1 << order); + return folio; + } + order = next_order(&orders, order); + } + +fallback: + return vma_alloc_zeroed_movable_folio(vma, vmf->address); +} +#else +#define alloc_anon_folio(vmf) \ + vma_alloc_zeroed_movable_folio((vmf)->vma, (vmf)->address) +#endif + /* * We enter with non-exclusive mmap_lock (to exclude vma changes, * but allow concurrent faults), and pte mapped but not yet locked. @@ -4066,6 +4147,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) */ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) { + int i; + int nr_pages = 1; + unsigned long addr = vmf->address; bool uffd_wp = vmf_orig_pte_uffd_wp(vmf); struct vm_area_struct *vma = vmf->vma; struct folio *folio; @@ -4110,10 +4194,15 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) /* Allocate our own private page. */ if (unlikely(anon_vma_prepare(vma))) goto oom; - folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); + folio = alloc_anon_folio(vmf); + if (IS_ERR(folio)) + return 0; if (!folio) goto oom; + nr_pages = folio_nr_pages(folio); + addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE); + if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) goto oom_free_page; folio_throttle_swaprate(folio, GFP_KERNEL); @@ -4130,12 +4219,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) if (vma->vm_flags & VM_WRITE) entry = pte_mkwrite(pte_mkdirty(entry), vma); - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, - &vmf->ptl); + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl); if (!vmf->pte) goto release; - if (vmf_pte_changed(vmf)) { - update_mmu_tlb(vma, vmf->address, vmf->pte); + if (vmf_pte_range_changed(vmf, nr_pages)) { + for (i = 0; i < nr_pages; i++) + update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i); goto release; } @@ -4150,16 +4239,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) return handle_userfault(vmf, VM_UFFD_MISSING); } - inc_mm_counter(vma->vm_mm, MM_ANONPAGES); - folio_add_new_anon_rmap(folio, vma, vmf->address); + folio_ref_add(folio, nr_pages - 1); + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages); + folio_add_new_anon_rmap(folio, vma, addr); folio_add_lru_vma(folio, vma); setpte: if (uffd_wp) entry = pte_mkuffd_wp(entry); - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); + set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr_pages); /* No need to invalidate - it was non-present before */ - update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1); + update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr_pages); unlock: if (vmf->pte) pte_unmap_unlock(vmf->pte, vmf->ptl);