[RFC,12/26] hugetlb: add HugeTLB splitting functionality

Message ID	20220624173656.2033256-13-jthoughton@google.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> Date: Fri, 24 Jun 2022 17:36:42 +0000 In-Reply-To: <20220624173656.2033256-1-jthoughton@google.com> Message-Id: <20220624173656.2033256-13-jthoughton@google.com> Mime-Version: 1.0 References: <20220624173656.2033256-1-jthoughton@google.com> Subject: [RFC PATCH 12/26] hugetlb: add HugeTLB splitting functionality From: James Houghton <jthoughton@google.com> To: Mike Kravetz <mike.kravetz@oracle.com>, Muchun Song <songmuchun@bytedance.com>, Peter Xu <peterx@redhat.com> Cc: David Hildenbrand <david@redhat.com>, David Rientjes <rientjes@google.com>, Axel Rasmussen <axelrasmussen@google.com>, Mina Almasry <almasrymina@google.com>, Jue Wang <juew@google.com>, Manish Mishra <manish.mishra@nutanix.com>, "Dr . David Alan Gilbert" <dgilbert@redhat.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton <jthoughton@google.com> Content-Type: text/plain; charset="UTF-8" Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	hugetlb: Introduce HugeTLB high-granularity mapping \| expand [RFC,00/26] hugetlb: Introduce HugeTLB high-granularity mapping [RFC,01/26] hugetlb: make hstate accessor functions const [RFC,02/26] hugetlb: sort hstates in hugetlb_init_hstates [RFC,03/26] hugetlb: add make_huge_pte_with_shift [RFC,04/26] hugetlb: make huge_pte_lockptr take an explicit shift argument. [RFC,05/26] hugetlb: add CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING [RFC,06/26] mm: make free_p?d_range functions public [RFC,07/26] hugetlb: add hugetlb_pte to track HugeTLB page table entries [RFC,08/26] hugetlb: add hugetlb_free_range to free PT structures [RFC,09/26] hugetlb: add hugetlb_hgm_enabled [RFC,10/26] hugetlb: add for_each_hgm_shift [RFC,11/26] hugetlb: add hugetlb_walk_to to do PT walks [RFC,12/26] hugetlb: add HugeTLB splitting functionality [RFC,13/26] hugetlb: add huge_pte_alloc_high_granularity [RFC,14/26] hugetlb: add HGM support for hugetlb_fault and hugetlb_no_page [RFC,15/26] hugetlb: make unmapping compatible with high-granularity mappings [RFC,16/26] hugetlb: make hugetlb_change_protection compatible with HGM [RFC,17/26] hugetlb: update follow_hugetlb_page to support HGM [RFC,18/26] hugetlb: use struct hugetlb_pte for walk_hugetlb_range [RFC,19/26] hugetlb: add HGM support for copy_hugetlb_page_range [RFC,20/26] hugetlb: add support for high-granularity UFFDIO_CONTINUE [RFC,21/26] hugetlb: add hugetlb_collapse [RFC,22/26] madvise: add uapi for HugeTLB HGM collapse: MADV_COLLAPSE [RFC,23/26] userfaultfd: add UFFD_FEATURE_MINOR_HUGETLBFS_HGM [RFC,24/26] arm64/hugetlb: add support for high-granularity mappings [RFC,25/26] selftests: add HugeTLB HGM to userfaultfd selftest [RFC,26/26] selftests: add HugeTLB HGM to KVM demand paging selftest

Message ID

20220624173656.2033256-13-jthoughton@google.com (mailing list archive)

State

New

Headers

Date: Fri, 24 Jun 2022 17:36:42 +0000
In-Reply-To: <20220624173656.2033256-1-jthoughton@google.com>
Message-Id: <20220624173656.2033256-13-jthoughton@google.com>
Mime-Version: 1.0
References: <20220624173656.2033256-1-jthoughton@google.com>
Subject: [RFC PATCH 12/26] hugetlb: add HugeTLB splitting functionality
From: James Houghton <jthoughton@google.com>
To: Mike Kravetz <mike.kravetz@oracle.com>,
 Muchun Song <songmuchun@bytedance.com>,
	Peter Xu <peterx@redhat.com>
Cc: David Hildenbrand <david@redhat.com>,
 David Rientjes <rientjes@google.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
 Mina Almasry <almasrymina@google.com>,
	Jue Wang <juew@google.com>, Manish Mishra <manish.mishra@nutanix.com>,
	"Dr . David Alan Gilbert" <dgilbert@redhat.com>, linux-mm@kvack.org,
 linux-kernel@vger.kernel.org,
	James Houghton <jthoughton@google.com>
Content-Type: text/plain; charset="UTF-8"
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

hugetlb: Introduce HugeTLB high-granularity mapping | expand

Commit Message

James Houghton June 24, 2022, 5:36 p.m. UTC

The new function, hugetlb_split_to_shift, will optimally split the page
table to map a particular address at a particular granularity.

This is useful for punching a hole in the mapping and for mapping small
sections of a HugeTLB page (via UFFDIO_CONTINUE, for example).

Signed-off-by: James Houghton <jthoughton@google.com>
---
 mm/hugetlb.c | 122 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 122 insertions(+)

Comments

Manish June 27, 2022, 1:50 p.m. UTC | #1

On 24/06/22 11:06 pm, James Houghton wrote:
> The new function, hugetlb_split_to_shift, will optimally split the page
> table to map a particular address at a particular granularity.
>
> This is useful for punching a hole in the mapping and for mapping small
> sections of a HugeTLB page (via UFFDIO_CONTINUE, for example).
>
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>   mm/hugetlb.c | 122 +++++++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 122 insertions(+)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 3ec2a921ee6f..eaffe7b4f67c 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -102,6 +102,18 @@ struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp;
>   /* Forward declaration */
>   static int hugetlb_acct_memory(struct hstate *h, long delta);
>   
> +/*
> + * Find the subpage that corresponds to `addr` in `hpage`.
> + */
> +static struct page *hugetlb_find_subpage(struct hstate *h, struct page *hpage,
> +				 unsigned long addr)
> +{
> +	size_t idx = (addr & ~huge_page_mask(h))/PAGE_SIZE;
> +
> +	BUG_ON(idx >= pages_per_huge_page(h));
> +	return &hpage[idx];
> +}
> +
>   static inline bool subpool_is_free(struct hugepage_subpool *spool)
>   {
>   	if (spool->count)
> @@ -7044,6 +7056,116 @@ static unsigned int __shift_for_hstate(struct hstate *h)
>   	for ((tmp_h) = hstate; (shift) = __shift_for_hstate(tmp_h), \
>   			       (tmp_h) <= &hstates[hugetlb_max_hstate]; \
>   			       (tmp_h)++)
> +
> +/*
> + * Given a particular address, split the HugeTLB PTE that currently maps it
> + * so that, for the given address, the PTE that maps it is `desired_shift`.
> + * This function will always split the HugeTLB PTE optimally.
> + *
> + * For example, given a HugeTLB 1G page that is mapped from VA 0 to 1G. If we
> + * call this function with addr=0 and desired_shift=PAGE_SHIFT, will result in
> + * these changes to the page table:
> + * 1. The PUD will be split into 2M PMDs.
> + * 2. The first PMD will be split again into 4K PTEs.
> + */
> +static int hugetlb_split_to_shift(struct mm_struct *mm, struct vm_area_struct *vma,
> +			   const struct hugetlb_pte *hpte,
> +			   unsigned long addr, unsigned long desired_shift)
> +{
> +	unsigned long start, end, curr;
> +	unsigned long desired_sz = 1UL << desired_shift;
> +	struct hstate *h = hstate_vma(vma);
> +	int ret;
> +	struct hugetlb_pte new_hpte;
> +	struct mmu_notifier_range range;
> +	struct page *hpage = NULL;
> +	struct page *subpage;
> +	pte_t old_entry;
> +	struct mmu_gather tlb;
> +
> +	BUG_ON(!hpte->ptep);
> +	BUG_ON(hugetlb_pte_size(hpte) == desired_sz);
can it be BUG_ON(hugetlb_pte_size(hpte) <= desired_sz)
> +
> +	start = addr & hugetlb_pte_mask(hpte);
> +	end = start + hugetlb_pte_size(hpte);
> +
> +	i_mmap_assert_write_locked(vma->vm_file->f_mapping);

As it is just changing mappings is holding f_mapping required? I mean in future

is it any paln or way to use some per process level sub-lock?

> +
> +	BUG_ON(!hpte->ptep);
> +	/* This function only works if we are looking at a leaf-level PTE. */
> +	BUG_ON(!hugetlb_pte_none(hpte) && !hugetlb_pte_present_leaf(hpte));
> +
> +	/*
> +	 * Clear the PTE so that we will allocate the PT structures when
> +	 * walking the page table.
> +	 */
> +	old_entry = huge_ptep_get_and_clear(mm, start, hpte->ptep);
> +
> +	if (!huge_pte_none(old_entry))
> +		hpage = pte_page(old_entry);
> +
> +	BUG_ON(!IS_ALIGNED(start, desired_sz));
> +	BUG_ON(!IS_ALIGNED(end, desired_sz));
> +
> +	for (curr = start; curr < end;) {
> +		struct hstate *tmp_h;
> +		unsigned int shift;
> +
> +		for_each_hgm_shift(h, tmp_h, shift) {
> +			unsigned long sz = 1UL << shift;
> +
> +			if (!IS_ALIGNED(curr, sz) || curr + sz > end)
> +				continue;
> +			/*
> +			 * If we are including `addr`, we need to make sure
> +			 * splitting down to the correct size. Go to a smaller
> +			 * size if we are not.
> +			 */
> +			if (curr <= addr && curr + sz > addr &&
> +					shift > desired_shift)
> +				continue;
> +
> +			/*
> +			 * Continue the page table walk to the level we want,
> +			 * allocate PT structures as we go.
> +			 */

As i understand this for_each_hgm_shift loop is just to find right size of shift,

then code below this line can be put out of loop, no strong feeling but it looks

more proper may make code easier to understand.

> +			hugetlb_pte_copy(&new_hpte, hpte);
> +			ret = hugetlb_walk_to(mm, &new_hpte, curr, sz,
> +					      /*stop_at_none=*/false);
> +			if (ret)
> +				goto err;
> +			BUG_ON(hugetlb_pte_size(&new_hpte) != sz);
> +			if (hpage) {
> +				pte_t new_entry;
> +
> +				subpage = hugetlb_find_subpage(h, hpage, curr);
> +				new_entry = make_huge_pte_with_shift(vma, subpage,
> +								     huge_pte_write(old_entry),
> +								     shift);
> +				set_huge_pte_at(mm, curr, new_hpte.ptep, new_entry);
> +			}
> +			curr += sz;
> +			goto next;
> +		}
> +		/* We couldn't find a size that worked. */
> +		BUG();
> +next:
> +		continue;
> +	}
> +
> +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
> +				start, end);
> +	mmu_notifier_invalidate_range_start(&range);

sorry did not understand where tlb flush will be taken care in case of success?

I see set_huge_pte_at does not do it internally by self.

> +	return 0;
> +err:
> +	tlb_gather_mmu(&tlb, mm);
> +	/* Free any newly allocated page table entries. */
> +	hugetlb_free_range(&tlb, hpte, start, curr);
> +	/* Restore the old entry. */
> +	set_huge_pte_at(mm, start, hpte->ptep, old_entry);
> +	tlb_finish_mmu(&tlb);
> +	return ret;
> +}
>   #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
>   
>   /*

Manish June 29, 2022, 2:33 p.m. UTC | #2

On 24/06/22 11:06 pm, James Houghton wrote:
> The new function, hugetlb_split_to_shift, will optimally split the page
> table to map a particular address at a particular granularity.
>
> This is useful for punching a hole in the mapping and for mapping small
> sections of a HugeTLB page (via UFFDIO_CONTINUE, for example).
>
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>   mm/hugetlb.c | 122 +++++++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 122 insertions(+)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 3ec2a921ee6f..eaffe7b4f67c 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -102,6 +102,18 @@ struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp;
>   /* Forward declaration */
>   static int hugetlb_acct_memory(struct hstate *h, long delta);
>   
> +/*
> + * Find the subpage that corresponds to `addr` in `hpage`.
> + */
> +static struct page *hugetlb_find_subpage(struct hstate *h, struct page *hpage,
> +				 unsigned long addr)
> +{
> +	size_t idx = (addr & ~huge_page_mask(h))/PAGE_SIZE;
> +
> +	BUG_ON(idx >= pages_per_huge_page(h));
> +	return &hpage[idx];
> +}
> +
>   static inline bool subpool_is_free(struct hugepage_subpool *spool)
>   {
>   	if (spool->count)
> @@ -7044,6 +7056,116 @@ static unsigned int __shift_for_hstate(struct hstate *h)
>   	for ((tmp_h) = hstate; (shift) = __shift_for_hstate(tmp_h), \
>   			       (tmp_h) <= &hstates[hugetlb_max_hstate]; \
>   			       (tmp_h)++)
> +
> +/*
> + * Given a particular address, split the HugeTLB PTE that currently maps it
> + * so that, for the given address, the PTE that maps it is `desired_shift`.
> + * This function will always split the HugeTLB PTE optimally.
> + *
> + * For example, given a HugeTLB 1G page that is mapped from VA 0 to 1G. If we
> + * call this function with addr=0 and desired_shift=PAGE_SHIFT, will result in
> + * these changes to the page table:
> + * 1. The PUD will be split into 2M PMDs.
> + * 2. The first PMD will be split again into 4K PTEs.
> + */
> +static int hugetlb_split_to_shift(struct mm_struct *mm, struct vm_area_struct *vma,
> +			   const struct hugetlb_pte *hpte,
> +			   unsigned long addr, unsigned long desired_shift)
> +{
> +	unsigned long start, end, curr;
> +	unsigned long desired_sz = 1UL << desired_shift;
> +	struct hstate *h = hstate_vma(vma);
> +	int ret;
> +	struct hugetlb_pte new_hpte;
> +	struct mmu_notifier_range range;
> +	struct page *hpage = NULL;
> +	struct page *subpage;
> +	pte_t old_entry;
> +	struct mmu_gather tlb;
> +
> +	BUG_ON(!hpte->ptep);
> +	BUG_ON(hugetlb_pte_size(hpte) == desired_sz);
> +
> +	start = addr & hugetlb_pte_mask(hpte);
> +	end = start + hugetlb_pte_size(hpte);
> +
> +	i_mmap_assert_write_locked(vma->vm_file->f_mapping);
> +
> +	BUG_ON(!hpte->ptep);
> +	/* This function only works if we are looking at a leaf-level PTE. */
> +	BUG_ON(!hugetlb_pte_none(hpte) && !hugetlb_pte_present_leaf(hpte));
> +
> +	/*
> +	 * Clear the PTE so that we will allocate the PT structures when
> +	 * walking the page table.
> +	 */
> +	old_entry = huge_ptep_get_and_clear(mm, start, hpte->ptep);

Sorry missed it last time, what if hgm mapping present here and current hpte is

at higher level. Where we will clear and free child page-table pages.

I see it does not happen in huge_ptep_get_and_clear.

> +
> +	if (!huge_pte_none(old_entry))
> +		hpage = pte_page(old_entry);
> +
> +	BUG_ON(!IS_ALIGNED(start, desired_sz));
> +	BUG_ON(!IS_ALIGNED(end, desired_sz));
> +
> +	for (curr = start; curr < end;) {
> +		struct hstate *tmp_h;
> +		unsigned int shift;
> +
> +		for_each_hgm_shift(h, tmp_h, shift) {
> +			unsigned long sz = 1UL << shift;
> +
> +			if (!IS_ALIGNED(curr, sz) || curr + sz > end)
> +				continue;
> +			/*
> +			 * If we are including `addr`, we need to make sure
> +			 * splitting down to the correct size. Go to a smaller
> +			 * size if we are not.
> +			 */
> +			if (curr <= addr && curr + sz > addr &&
> +					shift > desired_shift)
> +				continue;
> +
> +			/*
> +			 * Continue the page table walk to the level we want,
> +			 * allocate PT structures as we go.
> +			 */
> +			hugetlb_pte_copy(&new_hpte, hpte);
> +			ret = hugetlb_walk_to(mm, &new_hpte, curr, sz,
> +					      /*stop_at_none=*/false);
> +			if (ret)
> +				goto err;
> +			BUG_ON(hugetlb_pte_size(&new_hpte) != sz);
> +			if (hpage) {
> +				pte_t new_entry;
> +
> +				subpage = hugetlb_find_subpage(h, hpage, curr);
> +				new_entry = make_huge_pte_with_shift(vma, subpage,
> +								     huge_pte_write(old_entry),
> +								     shift);
> +				set_huge_pte_at(mm, curr, new_hpte.ptep, new_entry);
> +			}
> +			curr += sz;
> +			goto next;
> +		}
> +		/* We couldn't find a size that worked. */
> +		BUG();
> +next:
> +		continue;
> +	}
> +
> +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
> +				start, end);
> +	mmu_notifier_invalidate_range_start(&range);
> +	return 0;
> +err:
> +	tlb_gather_mmu(&tlb, mm);
> +	/* Free any newly allocated page table entries. */
> +	hugetlb_free_range(&tlb, hpte, start, curr);
> +	/* Restore the old entry. */
> +	set_huge_pte_at(mm, start, hpte->ptep, old_entry);
> +	tlb_finish_mmu(&tlb);
> +	return ret;
> +}
>   #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
>   
>   /*

James Houghton June 29, 2022, 4:10 p.m. UTC | #3

On Mon, Jun 27, 2022 at 6:51 AM manish.mishra <manish.mishra@nutanix.com> wrote:
>
>
> On 24/06/22 11:06 pm, James Houghton wrote:
> > The new function, hugetlb_split_to_shift, will optimally split the page
> > table to map a particular address at a particular granularity.
> >
> > This is useful for punching a hole in the mapping and for mapping small
> > sections of a HugeTLB page (via UFFDIO_CONTINUE, for example).
> >
> > Signed-off-by: James Houghton <jthoughton@google.com>
> > ---
> >   mm/hugetlb.c | 122 +++++++++++++++++++++++++++++++++++++++++++++++++++
> >   1 file changed, 122 insertions(+)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 3ec2a921ee6f..eaffe7b4f67c 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -102,6 +102,18 @@ struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp;
> >   /* Forward declaration */
> >   static int hugetlb_acct_memory(struct hstate *h, long delta);
> >
> > +/*
> > + * Find the subpage that corresponds to `addr` in `hpage`.
> > + */
> > +static struct page *hugetlb_find_subpage(struct hstate *h, struct page *hpage,
> > +                              unsigned long addr)
> > +{
> > +     size_t idx = (addr & ~huge_page_mask(h))/PAGE_SIZE;
> > +
> > +     BUG_ON(idx >= pages_per_huge_page(h));
> > +     return &hpage[idx];
> > +}
> > +
> >   static inline bool subpool_is_free(struct hugepage_subpool *spool)
> >   {
> >       if (spool->count)
> > @@ -7044,6 +7056,116 @@ static unsigned int __shift_for_hstate(struct hstate *h)
> >       for ((tmp_h) = hstate; (shift) = __shift_for_hstate(tmp_h), \
> >                              (tmp_h) <= &hstates[hugetlb_max_hstate]; \
> >                              (tmp_h)++)
> > +
> > +/*
> > + * Given a particular address, split the HugeTLB PTE that currently maps it
> > + * so that, for the given address, the PTE that maps it is `desired_shift`.
> > + * This function will always split the HugeTLB PTE optimally.
> > + *
> > + * For example, given a HugeTLB 1G page that is mapped from VA 0 to 1G. If we
> > + * call this function with addr=0 and desired_shift=PAGE_SHIFT, will result in
> > + * these changes to the page table:
> > + * 1. The PUD will be split into 2M PMDs.
> > + * 2. The first PMD will be split again into 4K PTEs.
> > + */
> > +static int hugetlb_split_to_shift(struct mm_struct *mm, struct vm_area_struct *vma,
> > +                        const struct hugetlb_pte *hpte,
> > +                        unsigned long addr, unsigned long desired_shift)
> > +{
> > +     unsigned long start, end, curr;
> > +     unsigned long desired_sz = 1UL << desired_shift;
> > +     struct hstate *h = hstate_vma(vma);
> > +     int ret;
> > +     struct hugetlb_pte new_hpte;
> > +     struct mmu_notifier_range range;
> > +     struct page *hpage = NULL;
> > +     struct page *subpage;
> > +     pte_t old_entry;
> > +     struct mmu_gather tlb;
> > +
> > +     BUG_ON(!hpte->ptep);
> > +     BUG_ON(hugetlb_pte_size(hpte) == desired_sz);
> can it be BUG_ON(hugetlb_pte_size(hpte) <= desired_sz)

Sure -- I think that's better.

> > +
> > +     start = addr & hugetlb_pte_mask(hpte);
> > +     end = start + hugetlb_pte_size(hpte);
> > +
> > +     i_mmap_assert_write_locked(vma->vm_file->f_mapping);
>
> As it is just changing mappings is holding f_mapping required? I mean in future
>
> is it any paln or way to use some per process level sub-lock?

We don't need to hold a per-mapping lock here, you're right; a per-VMA
lock will do just fine. I'll replace this with a per-VMA lock for the
next version.

>
> > +
> > +     BUG_ON(!hpte->ptep);
> > +     /* This function only works if we are looking at a leaf-level PTE. */
> > +     BUG_ON(!hugetlb_pte_none(hpte) && !hugetlb_pte_present_leaf(hpte));
> > +
> > +     /*
> > +      * Clear the PTE so that we will allocate the PT structures when
> > +      * walking the page table.
> > +      */
> > +     old_entry = huge_ptep_get_and_clear(mm, start, hpte->ptep);
> > +
> > +     if (!huge_pte_none(old_entry))
> > +             hpage = pte_page(old_entry);
> > +
> > +     BUG_ON(!IS_ALIGNED(start, desired_sz));
> > +     BUG_ON(!IS_ALIGNED(end, desired_sz));
> > +
> > +     for (curr = start; curr < end;) {
> > +             struct hstate *tmp_h;
> > +             unsigned int shift;
> > +
> > +             for_each_hgm_shift(h, tmp_h, shift) {
> > +                     unsigned long sz = 1UL << shift;
> > +
> > +                     if (!IS_ALIGNED(curr, sz) || curr + sz > end)
> > +                             continue;
> > +                     /*
> > +                      * If we are including `addr`, we need to make sure
> > +                      * splitting down to the correct size. Go to a smaller
> > +                      * size if we are not.
> > +                      */
> > +                     if (curr <= addr && curr + sz > addr &&
> > +                                     shift > desired_shift)
> > +                             continue;
> > +
> > +                     /*
> > +                      * Continue the page table walk to the level we want,
> > +                      * allocate PT structures as we go.
> > +                      */
>
> As i understand this for_each_hgm_shift loop is just to find right size of shift,
>
> then code below this line can be put out of loop, no strong feeling but it looks
>
> more proper may make code easier to understand.

Agreed. I'll clean this up.

>
> > +                     hugetlb_pte_copy(&new_hpte, hpte);
> > +                     ret = hugetlb_walk_to(mm, &new_hpte, curr, sz,
> > +                                           /*stop_at_none=*/false);
> > +                     if (ret)
> > +                             goto err;
> > +                     BUG_ON(hugetlb_pte_size(&new_hpte) != sz);
> > +                     if (hpage) {
> > +                             pte_t new_entry;
> > +
> > +                             subpage = hugetlb_find_subpage(h, hpage, curr);
> > +                             new_entry = make_huge_pte_with_shift(vma, subpage,
> > +                                                                  huge_pte_write(old_entry),
> > +                                                                  shift);
> > +                             set_huge_pte_at(mm, curr, new_hpte.ptep, new_entry);
> > +                     }
> > +                     curr += sz;
> > +                     goto next;
> > +             }
> > +             /* We couldn't find a size that worked. */
> > +             BUG();
> > +next:
> > +             continue;
> > +     }
> > +
> > +     mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
> > +                             start, end);
> > +     mmu_notifier_invalidate_range_start(&range);
>
> sorry did not understand where tlb flush will be taken care in case of success?
>
> I see set_huge_pte_at does not do it internally by self.

A TLB flush isn't necessary in the success case -- pages that were
mapped before will continue to be mapped the same way, so the TLB
entries will still be valid. If we're splitting a none P*D, then
there's nothing to flush. If we're splitting a present P*D, then the
flush will come if/when we clear any of the page table entries below
the P*D.

>
> > +     return 0;
> > +err:
> > +     tlb_gather_mmu(&tlb, mm);
> > +     /* Free any newly allocated page table entries. */
> > +     hugetlb_free_range(&tlb, hpte, start, curr);
> > +     /* Restore the old entry. */
> > +     set_huge_pte_at(mm, start, hpte->ptep, old_entry);
> > +     tlb_finish_mmu(&tlb);
> > +     return ret;
> > +}
> >   #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
> >
> >   /*

James Houghton June 29, 2022, 4:20 p.m. UTC | #4

On Wed, Jun 29, 2022 at 7:33 AM manish.mishra <manish.mishra@nutanix.com> wrote:
>
>
> On 24/06/22 11:06 pm, James Houghton wrote:
> > The new function, hugetlb_split_to_shift, will optimally split the page
> > table to map a particular address at a particular granularity.
> >
> > This is useful for punching a hole in the mapping and for mapping small
> > sections of a HugeTLB page (via UFFDIO_CONTINUE, for example).
> >
> > Signed-off-by: James Houghton <jthoughton@google.com>
> > ---
> >   mm/hugetlb.c | 122 +++++++++++++++++++++++++++++++++++++++++++++++++++
> >   1 file changed, 122 insertions(+)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 3ec2a921ee6f..eaffe7b4f67c 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -102,6 +102,18 @@ struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp;
> >   /* Forward declaration */
> >   static int hugetlb_acct_memory(struct hstate *h, long delta);
> >
> > +/*
> > + * Find the subpage that corresponds to `addr` in `hpage`.
> > + */
> > +static struct page *hugetlb_find_subpage(struct hstate *h, struct page *hpage,
> > +                              unsigned long addr)
> > +{
> > +     size_t idx = (addr & ~huge_page_mask(h))/PAGE_SIZE;
> > +
> > +     BUG_ON(idx >= pages_per_huge_page(h));
> > +     return &hpage[idx];
> > +}
> > +
> >   static inline bool subpool_is_free(struct hugepage_subpool *spool)
> >   {
> >       if (spool->count)
> > @@ -7044,6 +7056,116 @@ static unsigned int __shift_for_hstate(struct hstate *h)
> >       for ((tmp_h) = hstate; (shift) = __shift_for_hstate(tmp_h), \
> >                              (tmp_h) <= &hstates[hugetlb_max_hstate]; \
> >                              (tmp_h)++)
> > +
> > +/*
> > + * Given a particular address, split the HugeTLB PTE that currently maps it
> > + * so that, for the given address, the PTE that maps it is `desired_shift`.
> > + * This function will always split the HugeTLB PTE optimally.
> > + *
> > + * For example, given a HugeTLB 1G page that is mapped from VA 0 to 1G. If we
> > + * call this function with addr=0 and desired_shift=PAGE_SHIFT, will result in
> > + * these changes to the page table:
> > + * 1. The PUD will be split into 2M PMDs.
> > + * 2. The first PMD will be split again into 4K PTEs.
> > + */
> > +static int hugetlb_split_to_shift(struct mm_struct *mm, struct vm_area_struct *vma,
> > +                        const struct hugetlb_pte *hpte,
> > +                        unsigned long addr, unsigned long desired_shift)
> > +{
> > +     unsigned long start, end, curr;
> > +     unsigned long desired_sz = 1UL << desired_shift;
> > +     struct hstate *h = hstate_vma(vma);
> > +     int ret;
> > +     struct hugetlb_pte new_hpte;
> > +     struct mmu_notifier_range range;
> > +     struct page *hpage = NULL;
> > +     struct page *subpage;
> > +     pte_t old_entry;
> > +     struct mmu_gather tlb;
> > +
> > +     BUG_ON(!hpte->ptep);
> > +     BUG_ON(hugetlb_pte_size(hpte) == desired_sz);
> > +
> > +     start = addr & hugetlb_pte_mask(hpte);
> > +     end = start + hugetlb_pte_size(hpte);
> > +
> > +     i_mmap_assert_write_locked(vma->vm_file->f_mapping);
> > +
> > +     BUG_ON(!hpte->ptep);
> > +     /* This function only works if we are looking at a leaf-level PTE. */
> > +     BUG_ON(!hugetlb_pte_none(hpte) && !hugetlb_pte_present_leaf(hpte));
> > +
> > +     /*
> > +      * Clear the PTE so that we will allocate the PT structures when
> > +      * walking the page table.
> > +      */
> > +     old_entry = huge_ptep_get_and_clear(mm, start, hpte->ptep);
>
> Sorry missed it last time, what if hgm mapping present here and current hpte is
>
> at higher level. Where we will clear and free child page-table pages.
>
> I see it does not happen in huge_ptep_get_and_clear.

This shouldn't happen because earlier we have
BUG_ON(!hugetlb_pte_none(hpte) && !hugetlb_pte_present_leaf(hpte));

i.e., hpte must either be none or present and leaf-level.

>
> > +
> > +     if (!huge_pte_none(old_entry))
> > +             hpage = pte_page(old_entry);
> > +
> > +     BUG_ON(!IS_ALIGNED(start, desired_sz));
> > +     BUG_ON(!IS_ALIGNED(end, desired_sz));
> > +
> > +     for (curr = start; curr < end;) {
> > +             struct hstate *tmp_h;
> > +             unsigned int shift;
> > +
> > +             for_each_hgm_shift(h, tmp_h, shift) {
> > +                     unsigned long sz = 1UL << shift;
> > +
> > +                     if (!IS_ALIGNED(curr, sz) || curr + sz > end)
> > +                             continue;
> > +                     /*
> > +                      * If we are including `addr`, we need to make sure
> > +                      * splitting down to the correct size. Go to a smaller
> > +                      * size if we are not.
> > +                      */
> > +                     if (curr <= addr && curr + sz > addr &&
> > +                                     shift > desired_shift)
> > +                             continue;
> > +
> > +                     /*
> > +                      * Continue the page table walk to the level we want,
> > +                      * allocate PT structures as we go.
> > +                      */
> > +                     hugetlb_pte_copy(&new_hpte, hpte);
> > +                     ret = hugetlb_walk_to(mm, &new_hpte, curr, sz,
> > +                                           /*stop_at_none=*/false);
> > +                     if (ret)
> > +                             goto err;
> > +                     BUG_ON(hugetlb_pte_size(&new_hpte) != sz);
> > +                     if (hpage) {
> > +                             pte_t new_entry;
> > +
> > +                             subpage = hugetlb_find_subpage(h, hpage, curr);
> > +                             new_entry = make_huge_pte_with_shift(vma, subpage,
> > +                                                                  huge_pte_write(old_entry),
> > +                                                                  shift);
> > +                             set_huge_pte_at(mm, curr, new_hpte.ptep, new_entry);
> > +                     }
> > +                     curr += sz;
> > +                     goto next;
> > +             }
> > +             /* We couldn't find a size that worked. */
> > +             BUG();
> > +next:
> > +             continue;
> > +     }
> > +
> > +     mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
> > +                             start, end);
> > +     mmu_notifier_invalidate_range_start(&range);
> > +     return 0;
> > +err:
> > +     tlb_gather_mmu(&tlb, mm);
> > +     /* Free any newly allocated page table entries. */
> > +     hugetlb_free_range(&tlb, hpte, start, curr);
> > +     /* Restore the old entry. */
> > +     set_huge_pte_at(mm, start, hpte->ptep, old_entry);
> > +     tlb_finish_mmu(&tlb);
> > +     return ret;
> > +}
> >   #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
> >
> >   /*

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3ec2a921ee6f..eaffe7b4f67c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -102,6 +102,18 @@  struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp;
 /* Forward declaration */
 static int hugetlb_acct_memory(struct hstate *h, long delta);
 
+/*
+ * Find the subpage that corresponds to `addr` in `hpage`.
+ */
+static struct page *hugetlb_find_subpage(struct hstate *h, struct page *hpage,
+				 unsigned long addr)
+{
+	size_t idx = (addr & ~huge_page_mask(h))/PAGE_SIZE;
+
+	BUG_ON(idx >= pages_per_huge_page(h));
+	return &hpage[idx];
+}
+
 static inline bool subpool_is_free(struct hugepage_subpool *spool)
 {
 	if (spool->count)
@@ -7044,6 +7056,116 @@  static unsigned int __shift_for_hstate(struct hstate *h)
 	for ((tmp_h) = hstate; (shift) = __shift_for_hstate(tmp_h), \
 			       (tmp_h) <= &hstates[hugetlb_max_hstate]; \
 			       (tmp_h)++)
+
+/*
+ * Given a particular address, split the HugeTLB PTE that currently maps it
+ * so that, for the given address, the PTE that maps it is `desired_shift`.
+ * This function will always split the HugeTLB PTE optimally.
+ *
+ * For example, given a HugeTLB 1G page that is mapped from VA 0 to 1G. If we
+ * call this function with addr=0 and desired_shift=PAGE_SHIFT, will result in
+ * these changes to the page table:
+ * 1. The PUD will be split into 2M PMDs.
+ * 2. The first PMD will be split again into 4K PTEs.
+ */
+static int hugetlb_split_to_shift(struct mm_struct *mm, struct vm_area_struct *vma,
+			   const struct hugetlb_pte *hpte,
+			   unsigned long addr, unsigned long desired_shift)
+{
+	unsigned long start, end, curr;
+	unsigned long desired_sz = 1UL << desired_shift;
+	struct hstate *h = hstate_vma(vma);
+	int ret;
+	struct hugetlb_pte new_hpte;
+	struct mmu_notifier_range range;
+	struct page *hpage = NULL;
+	struct page *subpage;
+	pte_t old_entry;
+	struct mmu_gather tlb;
+
+	BUG_ON(!hpte->ptep);
+	BUG_ON(hugetlb_pte_size(hpte) == desired_sz);
+
+	start = addr & hugetlb_pte_mask(hpte);
+	end = start + hugetlb_pte_size(hpte);
+
+	i_mmap_assert_write_locked(vma->vm_file->f_mapping);
+
+	BUG_ON(!hpte->ptep);
+	/* This function only works if we are looking at a leaf-level PTE. */
+	BUG_ON(!hugetlb_pte_none(hpte) && !hugetlb_pte_present_leaf(hpte));
+
+	/*
+	 * Clear the PTE so that we will allocate the PT structures when
+	 * walking the page table.
+	 */
+	old_entry = huge_ptep_get_and_clear(mm, start, hpte->ptep);
+
+	if (!huge_pte_none(old_entry))
+		hpage = pte_page(old_entry);
+
+	BUG_ON(!IS_ALIGNED(start, desired_sz));
+	BUG_ON(!IS_ALIGNED(end, desired_sz));
+
+	for (curr = start; curr < end;) {
+		struct hstate *tmp_h;
+		unsigned int shift;
+
+		for_each_hgm_shift(h, tmp_h, shift) {
+			unsigned long sz = 1UL << shift;
+
+			if (!IS_ALIGNED(curr, sz) || curr + sz > end)
+				continue;
+			/*
+			 * If we are including `addr`, we need to make sure
+			 * splitting down to the correct size. Go to a smaller
+			 * size if we are not.
+			 */
+			if (curr <= addr && curr + sz > addr &&
+					shift > desired_shift)
+				continue;
+
+			/*
+			 * Continue the page table walk to the level we want,
+			 * allocate PT structures as we go.
+			 */
+			hugetlb_pte_copy(&new_hpte, hpte);
+			ret = hugetlb_walk_to(mm, &new_hpte, curr, sz,
+					      /*stop_at_none=*/false);
+			if (ret)
+				goto err;
+			BUG_ON(hugetlb_pte_size(&new_hpte) != sz);
+			if (hpage) {
+				pte_t new_entry;
+
+				subpage = hugetlb_find_subpage(h, hpage, curr);
+				new_entry = make_huge_pte_with_shift(vma, subpage,
+								     huge_pte_write(old_entry),
+								     shift);
+				set_huge_pte_at(mm, curr, new_hpte.ptep, new_entry);
+			}
+			curr += sz;
+			goto next;
+		}
+		/* We couldn't find a size that worked. */
+		BUG();
+next:
+		continue;
+	}
+
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
+				start, end);
+	mmu_notifier_invalidate_range_start(&range);
+	return 0;
+err:
+	tlb_gather_mmu(&tlb, mm);
+	/* Free any newly allocated page table entries. */
+	hugetlb_free_range(&tlb, hpte, start, curr);
+	/* Restore the old entry. */
+	set_huge_pte_at(mm, start, hpte->ptep, old_entry);
+	tlb_finish_mmu(&tlb);
+	return ret;
+}
 #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
 
 /*

[RFC,12/26] hugetlb: add HugeTLB splitting functionality

Commit Message

Comments

Patch