From patchwork Wed Nov 15 13:27:28 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ryan Roberts X-Patchwork-Id: 13456678 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 50187C48BFA for ; Wed, 15 Nov 2023 13:28:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E279F6B0341; Wed, 15 Nov 2023 08:28:12 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E03036B0342; Wed, 15 Nov 2023 08:28:12 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CA0656B0343; Wed, 15 Nov 2023 08:28:12 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id B8A026B0341 for ; Wed, 15 Nov 2023 08:28:12 -0500 (EST) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 8F54140B2E for ; Wed, 15 Nov 2023 13:28:12 +0000 (UTC) X-FDA: 81460267224.04.B757492 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf17.hostedemail.com (Postfix) with ESMTP id D1D0640023 for ; Wed, 15 Nov 2023 13:28:10 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=none; spf=pass (imf17.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1700054891; a=rsa-sha256; cv=none; b=ldRfKuNZKRDfj9rXNShuJNkNbHpCoszDC8FTPyQhVdVzCcoH5MdzNC5VqmwCx3sctYzlWV AeMN5+/NS28qpKKzF4FDoV7fGsNHlJBvFhxYXHJ/phCGtWSKwArfRN/jaVIC2FxxF3RomH y35KVr0fpenyIKDpnvpTLmEGO61WT+w= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=none; spf=pass (imf17.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1700054891; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=GaRguLCBK9uUawAeJ/HZyd0iBmJtPBRZwhu2sCovmzI=; b=ZWmnILT7qEQcouf9IPaW3yAh5AZWv6ohN2RFZNQazVKYf0hL24ujb/xznPK0irFRhP/WMp a8xTT7FBJRWmqBfdE+PHyLOvU7S8Dymhlrj07RzYdcu0u03xe9gJAwIi8PVAliV+Xx/A75 JwBhYVacZOdG8SRhV2RkG41BN4VkEDM= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 775EA1595; Wed, 15 Nov 2023 05:28:55 -0800 (PST) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.26]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 0BCE93F7B4; Wed, 15 Nov 2023 05:28:06 -0800 (PST) From: Ryan Roberts To: Andrew Morton , Matthew Wilcox , Yin Fengwei , David Hildenbrand , Yu Zhao , Catalin Marinas , Anshuman Khandual , Yang Shi , "Huang, Ying" , Zi Yan , Luis Chamberlain , Itaru Kitayama , "Kirill A. Shutemov" , John Hubbard , David Rientjes , Vlastimil Babka , Hugh Dickins , Kefeng Wang Cc: Ryan Roberts , linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org Subject: [PATCH v7 04/10] mm: thp: Support allocation of anonymous small-sized THP Date: Wed, 15 Nov 2023 13:27:28 +0000 Message-Id: <20231115132734.931023-5-ryan.roberts@arm.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20231115132734.931023-1-ryan.roberts@arm.com> References: <20231115132734.931023-1-ryan.roberts@arm.com> MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: D1D0640023 X-Stat-Signature: yh1u7zm13e666c6h4n8h1nii6p9gcn63 X-HE-Tag: 1700054890-123132 X-HE-Meta: U2FsdGVkX18spv0kt5I3cPrYiJ3w/TbpBGE/YBzR37z/3k/IV3MDfBtkKN6EB0ojIgrgh4q+x4xPPiXD6QrHK6H8x4MIpmiLjhUSmC0efrzIAmNnVZ+TWK9g9zw+9vQKsIXL/cNcbOCm2qvJ4oztmZVzZ3MrdM+6Qvb1YPqXGaCxwcnfMBXbXlKGo/tzlQCLve5Fwrb8HfXmbJ6PT6el6vXyuM9deKNQII5iJg6XOtlSNyUesxBqg7EBsu3XhCVJqVhT46XCe+XCy5vYPv2WM2Mm8TLrF8xq/oYL8S716RLXFAxTyWWAEbALEBBLkbqwAifm4RKXDfbIo9J+KWaHRxr+HpjTMkqDnL1Mi8cFiBhUqqjoEinPlb3ee2GKyCvI2Z/uMhmwnhMXsLIJMl2Zip7weQ8n6GtxFmbP/LBMtEbI13gPyP24IjSTyQBRGy6bEPIoBH5tDFZSA7mTkTfW9tts0+XrxPyY/GXXTaISVxOxzR37iR2xX7Eo9W0cwhsIgxmDasZ9xKNWLFELnpziVjNC/Sj39XIi+uBW8ZbRBvUMxbG+S6HW/Vxu3PCHT+EuOA/bKzfUAK66cRKtd8th8t7VCVVxABB5D1NKd58b1xGQLyFoXP5XMN0EqpzkWrpBcTwM/Eb+8EeMomowVzLLBtwQBq4zltfB1BOphOCx/pxQdKlPvXDiskqTvXB8QbauwckTAknwJNpn6tsh3tvI3YF6ZWh9ON3XDNheiHuUVfUAJ74zZORZS1FH7tdicJ6J7U6jSKdNhUqsOJe2DoMcBwQh9gXFyJQokThLBymgZhsRdnq+z6LMSGvMzujSqgrYDrKNS+aIXbThuDjEHWEmxFpjstq5fewF9r2lfY6BPua299Ozyhak5KKe5gCPQ2VVOEJ3lLRqTBkbt8b/rWyAHUBhhkEIJY5gKjcRn5P23WVvv14wsaSIEwGjuR0c61YFpigIhAIFTFAx01sLTwk SxFLJKho ono9/pIyQqU8bRCS/Ehrv+B46aiW9o8YIe4nqy3L4nTZU9xPkdxh0uovBRWJCZ64K7s6iwVTcPYoeHqcqt6hg0RtApV9U892QscMWTPjNVKcENw9a7cAaOuGofAOG7MguRtjjldIgJvfBPsw5OFYonAWwsmZSfrUlZoxlhFj1fXKeZ+eJ97spk/lh6boWC8gu+YAggbRxe+OTQsLGVFAi6TugbXbj1TUk+9GNuQirq8Q7cXA= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Introduce the logic to allow THP to be configured (through the new sysfs interface we just added) to allocate large folios to back anonymous memory, which are smaller than PMD-size. We call this new THP type "small-sized THP". These small-sized THPs continue to be PTE-mapped, but in many cases can still provide similar benefits to traditional PMD-sized THP: Page faults are significantly reduced (by a factor of e.g. 4, 8, 16, etc. depending on the configured order), but latency spikes are much less prominent because the size of each page isn't as huge as the PMD-sized variant and there is less memory to clear in each page fault. The number of per-page operations (e.g. ref counting, rmap management, lru list management) are also significantly reduced since those ops now become per-folio. Some architectures also employ TLB compression mechanisms to squeeze more entries in when a set of PTEs are virtually and physically contiguous and approporiately aligned. In this case, TLB misses will occur less often. The new behaviour is disabled by default, but can be enabled at runtime by writing to /sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled (see documentation in previous commit). The long term aim is to change the default to include suitable lower orders, but there are some risks around internal fragmentation that need to be better understood first. Signed-off-by: Ryan Roberts --- include/linux/huge_mm.h | 6 ++- mm/memory.c | 106 ++++++++++++++++++++++++++++++++++++---- 2 files changed, 101 insertions(+), 11 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 7d6f7d96b039..edc302351971 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -68,9 +68,11 @@ extern struct kobj_attribute shmem_enabled_attr; #define HPAGE_PMD_NR (1<vma; + unsigned long orders; + int order; + + /* + * If uffd is active for the vma we need per-page fault fidelity to + * maintain the uffd semantics. + */ + if (userfaultfd_armed(vma)) + goto fallback; + + /* + * Get a list of all the (large) orders below PMD_ORDER that are enabled + * for this vma. Then filter out the orders that can't be allocated over + * the faulting address and still be fully contained in the vma. + */ + orders = hugepage_vma_check(vma, vma->vm_flags, false, true, true, + BIT(PMD_ORDER) - 1); + orders = transhuge_vma_suitable(vma, vmf->address, orders); + + if (!orders) + goto fallback; + + pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK); + if (!pte) + return ERR_PTR(-EAGAIN); + + order = first_order(orders); + while (orders) { + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); + vmf->pte = pte + pte_index(addr); + if (pte_range_none(vmf->pte, 1 << order)) + break; + order = next_order(&orders, order); + } + + vmf->pte = NULL; + pte_unmap(pte); + + gfp = vma_thp_gfp_mask(vma); + + while (orders) { + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); + folio = vma_alloc_folio(gfp, order, vma, addr, true); + if (folio) { + clear_huge_page(&folio->page, addr, 1 << order); + return folio; + } + order = next_order(&orders, order); + } + +fallback: + return vma_alloc_zeroed_movable_folio(vma, vmf->address); +} +#else +#define alloc_anon_folio(vmf) \ + vma_alloc_zeroed_movable_folio((vmf)->vma, (vmf)->address) +#endif + /* * We enter with non-exclusive mmap_lock (to exclude vma changes, * but allow concurrent faults), and pte mapped but not yet locked. @@ -4129,6 +4207,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) */ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) { + int i; + int nr_pages = 1; + unsigned long addr = vmf->address; bool uffd_wp = vmf_orig_pte_uffd_wp(vmf); struct vm_area_struct *vma = vmf->vma; struct folio *folio; @@ -4173,10 +4254,15 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) /* Allocate our own private page. */ if (unlikely(anon_vma_prepare(vma))) goto oom; - folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); + folio = alloc_anon_folio(vmf); + if (IS_ERR(folio)) + return 0; if (!folio) goto oom; + nr_pages = folio_nr_pages(folio); + addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE); + if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) goto oom_free_page; folio_throttle_swaprate(folio, GFP_KERNEL); @@ -4193,12 +4279,13 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) if (vma->vm_flags & VM_WRITE) entry = pte_mkwrite(pte_mkdirty(entry), vma); - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, - &vmf->ptl); + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl); if (!vmf->pte) goto release; - if (vmf_pte_changed(vmf)) { - update_mmu_tlb(vma, vmf->address, vmf->pte); + if ((nr_pages == 1 && vmf_pte_changed(vmf)) || + (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages))) { + for (i = 0; i < nr_pages; i++) + update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i); goto release; } @@ -4213,16 +4300,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) return handle_userfault(vmf, VM_UFFD_MISSING); } - inc_mm_counter(vma->vm_mm, MM_ANONPAGES); - folio_add_new_anon_rmap(folio, vma, vmf->address); + folio_ref_add(folio, nr_pages - 1); + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages); + folio_add_new_anon_rmap(folio, vma, addr); folio_add_lru_vma(folio, vma); setpte: if (uffd_wp) entry = pte_mkuffd_wp(entry); - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); + set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr_pages); /* No need to invalidate - it was non-present before */ - update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1); + update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr_pages); unlock: if (vmf->pte) pte_unmap_unlock(vmf->pte, vmf->ptl);