[RFC,21/26] hugetlb: add hugetlb_collapse

Message ID	20220624173656.2033256-22-jthoughton@google.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> Date: Fri, 24 Jun 2022 17:36:51 +0000 In-Reply-To: <20220624173656.2033256-1-jthoughton@google.com> Message-Id: <20220624173656.2033256-22-jthoughton@google.com> Mime-Version: 1.0 References: <20220624173656.2033256-1-jthoughton@google.com> Subject: [RFC PATCH 21/26] hugetlb: add hugetlb_collapse From: James Houghton <jthoughton@google.com> To: Mike Kravetz <mike.kravetz@oracle.com>, Muchun Song <songmuchun@bytedance.com>, Peter Xu <peterx@redhat.com> Cc: David Hildenbrand <david@redhat.com>, David Rientjes <rientjes@google.com>, Axel Rasmussen <axelrasmussen@google.com>, Mina Almasry <almasrymina@google.com>, Jue Wang <juew@google.com>, Manish Mishra <manish.mishra@nutanix.com>, "Dr . David Alan Gilbert" <dgilbert@redhat.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton <jthoughton@google.com> Content-Type: text/plain; charset="UTF-8" Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	hugetlb: Introduce HugeTLB high-granularity mapping \| expand [RFC,00/26] hugetlb: Introduce HugeTLB high-granularity mapping [RFC,01/26] hugetlb: make hstate accessor functions const [RFC,02/26] hugetlb: sort hstates in hugetlb_init_hstates [RFC,03/26] hugetlb: add make_huge_pte_with_shift [RFC,04/26] hugetlb: make huge_pte_lockptr take an explicit shift argument. [RFC,05/26] hugetlb: add CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING [RFC,06/26] mm: make free_p?d_range functions public [RFC,07/26] hugetlb: add hugetlb_pte to track HugeTLB page table entries [RFC,08/26] hugetlb: add hugetlb_free_range to free PT structures [RFC,09/26] hugetlb: add hugetlb_hgm_enabled [RFC,10/26] hugetlb: add for_each_hgm_shift [RFC,11/26] hugetlb: add hugetlb_walk_to to do PT walks [RFC,12/26] hugetlb: add HugeTLB splitting functionality [RFC,13/26] hugetlb: add huge_pte_alloc_high_granularity [RFC,14/26] hugetlb: add HGM support for hugetlb_fault and hugetlb_no_page [RFC,15/26] hugetlb: make unmapping compatible with high-granularity mappings [RFC,16/26] hugetlb: make hugetlb_change_protection compatible with HGM [RFC,17/26] hugetlb: update follow_hugetlb_page to support HGM [RFC,18/26] hugetlb: use struct hugetlb_pte for walk_hugetlb_range [RFC,19/26] hugetlb: add HGM support for copy_hugetlb_page_range [RFC,20/26] hugetlb: add support for high-granularity UFFDIO_CONTINUE [RFC,21/26] hugetlb: add hugetlb_collapse [RFC,22/26] madvise: add uapi for HugeTLB HGM collapse: MADV_COLLAPSE [RFC,23/26] userfaultfd: add UFFD_FEATURE_MINOR_HUGETLBFS_HGM [RFC,24/26] arm64/hugetlb: add support for high-granularity mappings [RFC,25/26] selftests: add HugeTLB HGM to userfaultfd selftest [RFC,26/26] selftests: add HugeTLB HGM to KVM demand paging selftest

Message ID

20220624173656.2033256-22-jthoughton@google.com (mailing list archive)

State

New

Headers

Date: Fri, 24 Jun 2022 17:36:51 +0000
In-Reply-To: <20220624173656.2033256-1-jthoughton@google.com>
Message-Id: <20220624173656.2033256-22-jthoughton@google.com>
Mime-Version: 1.0
References: <20220624173656.2033256-1-jthoughton@google.com>
Subject: [RFC PATCH 21/26] hugetlb: add hugetlb_collapse
From: James Houghton <jthoughton@google.com>
To: Mike Kravetz <mike.kravetz@oracle.com>,
 Muchun Song <songmuchun@bytedance.com>,
	Peter Xu <peterx@redhat.com>
Cc: David Hildenbrand <david@redhat.com>,
 David Rientjes <rientjes@google.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
 Mina Almasry <almasrymina@google.com>,
	Jue Wang <juew@google.com>, Manish Mishra <manish.mishra@nutanix.com>,
	"Dr . David Alan Gilbert" <dgilbert@redhat.com>, linux-mm@kvack.org,
 linux-kernel@vger.kernel.org,
	James Houghton <jthoughton@google.com>
Content-Type: text/plain; charset="UTF-8"
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

hugetlb: Introduce HugeTLB high-granularity mapping | expand

Commit Message

James Houghton June 24, 2022, 5:36 p.m. UTC

This is what implements MADV_COLLAPSE for HugeTLB pages. This is a
necessary extension to the UFFDIO_CONTINUE changes. When userspace
finishes mapping an entire hugepage with UFFDIO_CONTINUE, the kernel has
no mechanism to automatically collapse the page table to map the whole
hugepage normally. We require userspace to inform us that they would
like the hugepages to be collapsed; they do this with MADV_COLLAPSE.

If userspace has not mapped all of a hugepage with UFFDIO_CONTINUE, but
only some, hugetlb_collapse will cause the requested range to be mapped
as if it were UFFDIO_CONTINUE'd already.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 include/linux/hugetlb.h |  7 ++++
 mm/hugetlb.c            | 88 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 95 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index c207b1ac6195..438057dc3b75 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -1197,6 +1197,8 @@  int huge_pte_alloc_high_granularity(struct hugetlb_pte *hpte,
 				    unsigned int desired_sz,
 				    enum split_mode mode,
 				    bool write_locked);
+int hugetlb_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
+		     unsigned long start, unsigned long end);
 #else
 static inline bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
 {
@@ -1221,6 +1223,11 @@  static inline int huge_pte_alloc_high_granularity(struct hugetlb_pte *hpte,
 {
 	return -EINVAL;
 }
+int hugetlb_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
+		     unsigned long start, unsigned long end)
+{
+	return -EINVAL;
+}
 #endif
 
 static inline spinlock_t *huge_pte_lock(struct hstate *h,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 09fa57599233..70bb3a1342d9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -7280,6 +7280,94 @@  int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *mm,
 	return -EINVAL;
 }
 
+/*
+ * Collapse the address range from @start to @end to be mapped optimally.
+ *
+ * This is only valid for shared mappings. The main use case for this function
+ * is following UFFDIO_CONTINUE. If a user UFFDIO_CONTINUEs an entire hugepage
+ * by calling UFFDIO_CONTINUE once for each 4K region, the kernel doesn't know
+ * to collapse the mapping after the final UFFDIO_CONTINUE. Instead, we leave
+ * it up to userspace to tell us to do so, via MADV_COLLAPSE.
+ *
+ * Any holes in the mapping will be filled. If there is no page in the
+ * pagecache for a region we're collapsing, the PTEs will be cleared.
+ */
+int hugetlb_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
+			    unsigned long start, unsigned long end)
+{
+	struct hstate *h = hstate_vma(vma);
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	struct mmu_notifier_range range;
+	struct mmu_gather tlb;
+	struct hstate *tmp_h;
+	unsigned int shift;
+	unsigned long curr = start;
+	int ret = 0;
+	struct page *hpage, *subpage;
+	pgoff_t idx;
+	bool writable = vma->vm_flags & VM_WRITE;
+	bool shared = vma->vm_flags & VM_SHARED;
+	pte_t entry;
+
+	/*
+	 * This is only supported for shared VMAs, because we need to look up
+	 * the page to use for any PTEs we end up creating.
+	 */
+	if (!shared)
+		return -EINVAL;
+
+	i_mmap_assert_write_locked(mapping);
+
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, mm,
+				start, end);
+	mmu_notifier_invalidate_range_start(&range);
+	tlb_gather_mmu(&tlb, mm);
+
+	while (curr < end) {
+		for_each_hgm_shift(h, tmp_h, shift) {
+			unsigned long sz = 1UL << shift;
+			struct hugetlb_pte hpte;
+
+			if (!IS_ALIGNED(curr, sz) || curr + sz > end)
+				continue;
+
+			hugetlb_pte_init(&hpte);
+			ret = hugetlb_walk_to(mm, &hpte, curr, sz,
+					      /*stop_at_none=*/false);
+			if (ret)
+				goto out;
+			if (hugetlb_pte_size(&hpte) >= sz)
+				goto hpte_finished;
+
+			idx = vma_hugecache_offset(h, vma, curr);
+			hpage = find_lock_page(mapping, idx);
+			hugetlb_free_range(&tlb, &hpte, curr,
+					   curr + hugetlb_pte_size(&hpte));
+			if (!hpage) {
+				hugetlb_pte_clear(mm, &hpte, curr);
+				goto hpte_finished;
+			}
+
+			subpage = hugetlb_find_subpage(h, hpage, curr);
+			entry = make_huge_pte_with_shift(vma, subpage,
+							 writable, shift);
+			set_huge_pte_at(mm, curr, hpte.ptep, entry);
+			unlock_page(hpage);
+hpte_finished:
+			curr += hugetlb_pte_size(&hpte);
+			goto next;
+		}
+		ret = -EINVAL;
+		goto out;
+next:
+		continue;
+	}
+out:
+	tlb_finish_mmu(&tlb);
+	mmu_notifier_invalidate_range_end(&range);
+	return ret;
+}
+
 /*
  * Given a particular address, split the HugeTLB PTE that currently maps it
  * so that, for the given address, the PTE that maps it is `desired_shift`.

[RFC,21/26] hugetlb: add hugetlb_collapse

Commit Message

Patch