From patchwork Thu Dec 7 16:12:05 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ryan Roberts X-Patchwork-Id: 13483619 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 00617C10F05 for ; Thu, 7 Dec 2023 16:12:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8B4F66B0088; Thu, 7 Dec 2023 11:12:43 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 83DA56B009E; Thu, 7 Dec 2023 11:12:43 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 640D36B00A0; Thu, 7 Dec 2023 11:12:43 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 535EA6B0088 for ; Thu, 7 Dec 2023 11:12:43 -0500 (EST) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 1A41240142 for ; Thu, 7 Dec 2023 16:12:43 +0000 (UTC) X-FDA: 81540515406.17.5F410F2 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf03.hostedemail.com (Postfix) with ESMTP id 0E87F20025 for ; Thu, 7 Dec 2023 16:12:40 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=none; spf=pass (imf03.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1701965561; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=h3R9KU+saLqdMYJ6rAi0XNV1XQUTjecPDQM7JW9owrI=; b=RSZILfnmxWnICxcucmO8ppv/+X+kNyP3opa8a0Xav5+sLI/Aa1OZlu6J4gfwvvkZTmYSiL cImYU7YI2I9aaeyOpa9b42V1DEwudDjpud6JLY6Bb5WnWFt4Ac/58Z+pEVM7J/hBWHr/uW aG5W8VlLOOq3NJlRkRSLS86aHuhHiFI= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=none; spf=pass (imf03.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1701965561; a=rsa-sha256; cv=none; b=u9lj3RLDEWpx25Oa8WkzVsqgmqwyJhik8LcHuHGTd5kPlIsqvXU1Y1O8dxElZDvwmsd0Tn ci+9U2dv96pLBnaQXNk6LiZd2++U8P1Z0hAvjXnB2hXrNGnVzFi6bBYbo14TL8cEVdLzRp f9fPRIRJbAQDXrXRL+27gMCQaToCU+s= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 27086153B; Thu, 7 Dec 2023 08:13:26 -0800 (PST) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.26]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 5FEF63F762; Thu, 7 Dec 2023 08:12:37 -0800 (PST) From: Ryan Roberts To: Andrew Morton , Matthew Wilcox , Yin Fengwei , David Hildenbrand , Yu Zhao , Catalin Marinas , Anshuman Khandual , Yang Shi , "Huang, Ying" , Zi Yan , Luis Chamberlain , Itaru Kitayama , "Kirill A. Shutemov" , John Hubbard , David Rientjes , Vlastimil Babka , Hugh Dickins , Kefeng Wang , Barry Song <21cnbao@gmail.com>, Alistair Popple Cc: Ryan Roberts , linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org Subject: [PATCH v9 04/10] mm: thp: Support allocation of anonymous multi-size THP Date: Thu, 7 Dec 2023 16:12:05 +0000 Message-Id: <20231207161211.2374093-5-ryan.roberts@arm.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20231207161211.2374093-1-ryan.roberts@arm.com> References: <20231207161211.2374093-1-ryan.roberts@arm.com> MIME-Version: 1.0 X-Rspamd-Queue-Id: 0E87F20025 X-Rspam-User: X-Stat-Signature: 75wwpu7sh7k56ekjxxsuo944efrj8nek X-Rspamd-Server: rspam01 X-HE-Tag: 1701965560-555877 X-HE-Meta: U2FsdGVkX18m1lAfggrZl4mcbfX3DI74kEWwjICHFRvqlWUAnM3knGWB4+/KHdHaEAbDzt6Z7FNco8qDBA6qUTgXHantwFpWoMtUIk7vC6b6+vECuaiO0N7JRwgGjPPDADrutVF5CjaJpDFluCUwv8ENHkHnK26nQ1wIA84Dc9ij4X4+nK2/iBTjmNAs6JJy0VJikZfCmf6mxlTK3p5ZWUFCh0MnCLLIwlOOWY93Q1NA2/R8H0IzttL18SetPS/9WiOT4ZHAhu9NSBEYTF/qVtOV4FRig0pZYYmq3T5UzqaFxv6mpp2JMJwKhyr3Cx5CeoVTtyheKkTN9aehKpzp5yrQ2CNW1eOHGZIsc+JA8HZWwQL3h7l6PWWC5Upv6VgN9vW/9CpAv9vNaZWz8ygvjLAN0pgpns0SYKxTuD+VR3PQZVbhDIIzBrqCjfON45R5ol2ExoEFt5IVpduL4w6lBsv2NnRzQWaBKerCYLe1J7Pj2DclBWaHgLgyjorR8YvJLY83S/v4FMGITv2ybRX7fp8KA4Gz6Z/0dntuXIKfj3fbNRHRN0GX3Bx/39lbKLtETTHf4cJPQiHewSd/5hnT+/aCXediKmwrRQ0DQcENOvID5UsATlbINp2HrRH0HwuQZKjL7KBfC6qwEj8oc8JPnJ3YuT/CjWDbbixnxQCeenhPEmuUaL6hI0k5nsBhdQcJl20BToIa83BEWOO/hyeWWgcs+lTrrlR8wKJPhGIvHO/kU3Lf81Vhs/t2Q9zLUlxIJ3Icbf2LaiAD2gxCEXCs1U9J9kYkLBm4K3mKUApn8fvP5W0wUU229VmrL2KBmskoPwvrZkAGvUbuLjhJBKUqWHm1HoJ28JDbAhIGrTvubSIFHB8d7yxbY4N48zdO5xBWdQaot3TyZTjNk2w4m9V+yyr+9QqTGLtkC6CL4spQP8yMTMbUeXzwYL5+OLp1p3Oetutlvk64FM25aBGjWUj Zuy4aDeP 1vY8Z7Fuq1iWuaEQMbXJoLjGtrGPq+h085k4i+sDyX7KbXLdonYxReqhENsKf8+gJ5NtGgu1JlAaTS9XtDy9Fh8NDQSobbT3+mm42nVOYej3IzYdmOnV89mF3Sp2p6wvtstSUm7Lx02IyLCWAcMXaGz+3vPtBo+/OwuVQ3diJi/9YpSlm9eYE1dzMORX3nywUx4fOsRbfZYw3+wnrsbzLJb93Og3bOoGn2RtqcwiaXuqMBKISPu1JxLynwyUuR8S0Vj189qi7kwET3QPppGJU5yDanQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Introduce the logic to allow THP to be configured (through the new sysfs interface we just added) to allocate large folios to back anonymous memory, which are larger than the base page size but smaller than PMD-size. We call this new THP extension "multi-size THP" (mTHP). mTHP continues to be PTE-mapped, but in many cases can still provide similar benefits to traditional PMD-sized THP: Page faults are significantly reduced (by a factor of e.g. 4, 8, 16, etc. depending on the configured order), but latency spikes are much less prominent because the size of each page isn't as huge as the PMD-sized variant and there is less memory to clear in each page fault. The number of per-page operations (e.g. ref counting, rmap management, lru list management) are also significantly reduced since those ops now become per-folio. Some architectures also employ TLB compression mechanisms to squeeze more entries in when a set of PTEs are virtually and physically contiguous and approporiately aligned. In this case, TLB misses will occur less often. The new behaviour is disabled by default, but can be enabled at runtime by writing to /sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled (see documentation in previous commit). The long term aim is to change the default to include suitable lower orders, but there are some risks around internal fragmentation that need to be better understood first. Tested-by: Kefeng Wang Tested-by: John Hubbard Signed-off-by: Ryan Roberts Acked-by: David Hildenbrand --- include/linux/huge_mm.h | 6 ++- mm/memory.c | 111 ++++++++++++++++++++++++++++++++++++---- 2 files changed, 106 insertions(+), 11 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 609c153bae57..fa7a38a30fc6 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -68,9 +68,11 @@ extern struct kobj_attribute shmem_enabled_attr; #define HPAGE_PMD_NR (1<vma; + unsigned long orders; + struct folio *folio; + unsigned long addr; + pte_t *pte; + gfp_t gfp; + int order; + + /* + * If uffd is active for the vma we need per-page fault fidelity to + * maintain the uffd semantics. + */ + if (unlikely(userfaultfd_armed(vma))) + goto fallback; + + /* + * Get a list of all the (large) orders below PMD_ORDER that are enabled + * for this vma. Then filter out the orders that can't be allocated over + * the faulting address and still be fully contained in the vma. + */ + orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true, + BIT(PMD_ORDER) - 1); + orders = thp_vma_suitable_orders(vma, vmf->address, orders); + + if (!orders) + goto fallback; + + pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK); + if (!pte) + return ERR_PTR(-EAGAIN); + + /* + * Find the highest order where the aligned range is completely + * pte_none(). Note that all remaining orders will be completely + * pte_none(). + */ + order = highest_order(orders); + while (orders) { + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); + if (pte_range_none(pte + pte_index(addr), 1 << order)) + break; + order = next_order(&orders, order); + } + + pte_unmap(pte); + + /* Try allocating the highest of the remaining orders. */ + gfp = vma_thp_gfp_mask(vma); + while (orders) { + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); + folio = vma_alloc_folio(gfp, order, vma, addr, true); + if (folio) { + clear_huge_page(&folio->page, vmf->address, 1 << order); + return folio; + } + order = next_order(&orders, order); + } + +fallback: + return vma_alloc_zeroed_movable_folio(vma, vmf->address); +} +#else +#define alloc_anon_folio(vmf) \ + vma_alloc_zeroed_movable_folio((vmf)->vma, (vmf)->address) +#endif + /* * We enter with non-exclusive mmap_lock (to exclude vma changes, * but allow concurrent faults), and pte mapped but not yet locked. @@ -4134,9 +4215,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) { bool uffd_wp = vmf_orig_pte_uffd_wp(vmf); struct vm_area_struct *vma = vmf->vma; + unsigned long addr = vmf->address; struct folio *folio; vm_fault_t ret = 0; + int nr_pages = 1; pte_t entry; + int i; /* File mapping without ->vm_ops ? */ if (vma->vm_flags & VM_SHARED) @@ -4176,10 +4260,15 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) /* Allocate our own private page. */ if (unlikely(anon_vma_prepare(vma))) goto oom; - folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); + folio = alloc_anon_folio(vmf); + if (IS_ERR(folio)) + return 0; if (!folio) goto oom; + nr_pages = folio_nr_pages(folio); + addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE); + if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) goto oom_free_page; folio_throttle_swaprate(folio, GFP_KERNEL); @@ -4196,12 +4285,15 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) if (vma->vm_flags & VM_WRITE) entry = pte_mkwrite(pte_mkdirty(entry), vma); - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, - &vmf->ptl); + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl); if (!vmf->pte) goto release; - if (vmf_pte_changed(vmf)) { - update_mmu_tlb(vma, vmf->address, vmf->pte); + if (nr_pages == 1 && vmf_pte_changed(vmf)) { + update_mmu_tlb(vma, addr, vmf->pte); + goto release; + } else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) { + for (i = 0; i < nr_pages; i++) + update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i); goto release; } @@ -4216,16 +4308,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) return handle_userfault(vmf, VM_UFFD_MISSING); } - inc_mm_counter(vma->vm_mm, MM_ANONPAGES); - folio_add_new_anon_rmap(folio, vma, vmf->address); + folio_ref_add(folio, nr_pages - 1); + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages); + folio_add_new_anon_rmap(folio, vma, addr); folio_add_lru_vma(folio, vma); setpte: if (uffd_wp) entry = pte_mkuffd_wp(entry); - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); + set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr_pages); /* No need to invalidate - it was non-present before */ - update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1); + update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr_pages); unlock: if (vmf->pte) pte_unmap_unlock(vmf->pte, vmf->ptl);