diff mbox series

[v4,2/5] mm: LARGE_ANON_FOLIO for improved performance

Message ID 20230726095146.2826796-3-ryan.roberts@arm.com (mailing list archive)
State New
Headers show
Series variable-order, large folios for anonymous memory | expand

Commit Message

Ryan Roberts July 26, 2023, 9:51 a.m. UTC
Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
allocated in large folios of a determined order. All pages of the large
folio are pte-mapped during the same page fault, significantly reducing
the number of page faults. The number of per-page operations (e.g. ref
counting, rmap management lru list management) are also significantly
reduced since those ops now become per-folio.

The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
which defaults to disabled for now; The long term aim is for this to
defaut to enabled, but there are some risks around internal
fragmentation that need to be better understood first.

When enabled, the folio order is determined as such: For a vma, process
or system that has explicitly disabled THP, we continue to allocate
order-0. THP is most likely disabled to avoid any possible internal
fragmentation so we honour that request.

Otherwise, the return value of arch_wants_pte_order() is used. For vmas
that have not explicitly opted-in to use transparent hugepages (e.g.
where thp=madvise and the vma does not have MADV_HUGEPAGE), then
arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
bigger). This allows for a performance boost without requiring any
explicit opt-in from the workload while limitting internal
fragmentation.

If the preferred order can't be used (e.g. because the folio would
breach the bounds of the vma, or because ptes in the region are already
mapped) then we fall back to a suitable lower order; first
PAGE_ALLOC_COSTLY_ORDER, then order-0.

arch_wants_pte_order() can be overridden by the architecture if desired.
Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
set of ptes map physically contigious, naturally aligned memory, so this
mechanism allows the architecture to optimize as required.

Here we add the default implementation of arch_wants_pte_order(), used
when the architecture does not define it, which returns -1, implying
that the HW has no preference. In this case, mm will choose it's own
default order.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/pgtable.h |  13 ++++
 mm/Kconfig              |  10 +++
 mm/memory.c             | 166 ++++++++++++++++++++++++++++++++++++----
 3 files changed, 172 insertions(+), 17 deletions(-)

Comments

Yu Zhao July 26, 2023, 4:41 p.m. UTC | #1
On Wed, Jul 26, 2023 at 3:52 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
> allocated in large folios of a determined order. All pages of the large
> folio are pte-mapped during the same page fault, significantly reducing
> the number of page faults. The number of per-page operations (e.g. ref
> counting, rmap management lru list management) are also significantly
> reduced since those ops now become per-folio.
>
> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
> which defaults to disabled for now; The long term aim is for this to
> defaut to enabled, but there are some risks around internal
> fragmentation that need to be better understood first.
>
> When enabled, the folio order is determined as such: For a vma, process
> or system that has explicitly disabled THP, we continue to allocate
> order-0. THP is most likely disabled to avoid any possible internal
> fragmentation so we honour that request.
>
> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
> that have not explicitly opted-in to use transparent hugepages (e.g.
> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
> arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
> bigger). This allows for a performance boost without requiring any
> explicit opt-in from the workload while limitting internal
> fragmentation.
>
> If the preferred order can't be used (e.g. because the folio would
> breach the bounds of the vma, or because ptes in the region are already
> mapped) then we fall back to a suitable lower order; first
> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>
> arch_wants_pte_order() can be overridden by the architecture if desired.
> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
> set of ptes map physically contigious, naturally aligned memory, so this
> mechanism allows the architecture to optimize as required.
>
> Here we add the default implementation of arch_wants_pte_order(), used
> when the architecture does not define it, which returns -1, implying
> that the HW has no preference. In this case, mm will choose it's own
> default order.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  include/linux/pgtable.h |  13 ++++
>  mm/Kconfig              |  10 +++
>  mm/memory.c             | 166 ++++++++++++++++++++++++++++++++++++----
>  3 files changed, 172 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 5063b482e34f..2a1d83775837 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -313,6 +313,19 @@ static inline bool arch_has_hw_pte_young(void)
>  }
>  #endif
>
> +#ifndef arch_wants_pte_order
> +/*
> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
> + * to be at least order-2. Negative value implies that the HW has no preference
> + * and mm will choose it's own default order.
> + */
> +static inline int arch_wants_pte_order(void)
> +{
> +       return -1;
> +}
> +#endif
> +
>  #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>                                        unsigned long address,
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 09130434e30d..fa61ea160447 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1238,4 +1238,14 @@ config LOCK_MM_AND_FIND_VMA
>
>  source "mm/damon/Kconfig"
>
> +config LARGE_ANON_FOLIO
> +       bool "Allocate large folios for anonymous memory"
> +       depends on TRANSPARENT_HUGEPAGE
> +       default n
> +       help
> +         Use large (bigger than order-0) folios to back anonymous memory where
> +         possible, even for pte-mapped memory. This reduces the number of page
> +         faults, as well as other per-page overheads to improve performance for
> +         many workloads.
> +
>  endmenu
> diff --git a/mm/memory.c b/mm/memory.c
> index 01f39e8144ef..64c3f242c49a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4050,6 +4050,127 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         return ret;
>  }
>
> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
> +{
> +       int i;
> +
> +       if (nr_pages == 1)
> +               return vmf_pte_changed(vmf);
> +
> +       for (i = 0; i < nr_pages; i++) {
> +               if (!pte_none(ptep_get_lockless(vmf->pte + i)))
> +                       return true;
> +       }
> +
> +       return false;
> +}
> +
> +#ifdef CONFIG_LARGE_ANON_FOLIO
> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
> +               (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
> +
> +static int anon_folio_order(struct vm_area_struct *vma)
> +{
> +       int order;
> +
> +       /*
> +        * If THP is explicitly disabled for either the vma, the process or the
> +        * system, then this is very likely intended to limit internal
> +        * fragmentation; in this case, don't attempt to allocate a large
> +        * anonymous folio.
> +        *
> +        * Else, if the vma is eligible for thp, allocate a large folio of the
> +        * size preferred by the arch. Or if the arch requested a very small
> +        * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
> +        * which still meets the arch's requirements but means we still take
> +        * advantage of SW optimizations (e.g. fewer page faults).
> +        *
> +        * Finally if thp is enabled but the vma isn't eligible, take the
> +        * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
> +        * This ensures workloads that have not explicitly opted-in take benefit
> +        * while capping the potential for internal fragmentation.
> +        */

What empirical evidence is SZ_64K based on?
What workloads would benefit from it?
How much would they benefit from it?
Would they benefit more or less from different values?
How much internal fragmentation would it cause?
What cost function was used to arrive at the conclusion that its
benefits outweigh its costs?

> +       if ((vma->vm_flags & VM_NOHUGEPAGE) ||
> +           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
> +           !hugepage_flags_enabled())
> +               order = 0;
> +       else {
> +               order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
> +
> +               if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
> +                       order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
> +       }
> +
> +       return order;
> +}
> +
> +static int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)

static struct folio *alloc_anon_folio(struct vm_fault *vmf)

and use ERR_PTR() and its friends.

> +{
> +       int i;
> +       gfp_t gfp;
> +       pte_t *pte;
> +       unsigned long addr;
> +       struct vm_area_struct *vma = vmf->vma;
> +       int prefer = anon_folio_order(vma);
> +       int orders[] = {
> +               prefer,
> +               prefer > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0,
> +               0,
> +       };
> +
> +       *folio = NULL;
> +
> +       if (vmf_orig_pte_uffd_wp(vmf))
> +               goto fallback;
> +
> +       for (i = 0; orders[i]; i++) {
> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
> +               if (addr >= vma->vm_start &&
> +                   addr + (PAGE_SIZE << orders[i]) <= vma->vm_end)
> +                       break;
> +       }
> +
> +       if (!orders[i])
> +               goto fallback;
> +
> +       pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
> +       if (!pte)
> +               return -EAGAIN;
> +
> +       for (; orders[i]; i++) {
> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
> +               vmf->pte = pte + pte_index(addr);
> +               if (!vmf_pte_range_changed(vmf, 1 << orders[i]))
> +                       break;
> +       }
> +
> +       vmf->pte = NULL;
> +       pte_unmap(pte);
> +
> +       gfp = vma_thp_gfp_mask(vma);
> +
> +       for (; orders[i]; i++) {
> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
> +               *folio = vma_alloc_folio(gfp, orders[i], vma, addr, true);
> +               if (*folio) {
> +                       clear_huge_page(&(*folio)->page, addr, 1 << orders[i]);
> +                       return 0;
> +               }
> +       }
> +
> +fallback:
> +       *folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
> +       return *folio ? 0 : -ENOMEM;
> +}
> +#else
> +static inline int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
> +{
> +       *folio = vma_alloc_zeroed_movable_folio(vmf->vma, vmf->address);
> +       return *folio ? 0 : -ENOMEM;
> +}
> +#endif
> +
>  /*
>   * We enter with non-exclusive mmap_lock (to exclude vma changes,
>   * but allow concurrent faults), and pte mapped but not yet locked.
> @@ -4057,6 +4178,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   */
>  static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>  {
> +       int i = 0;
> +       int nr_pages = 1;
> +       unsigned long addr = vmf->address;
>         bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
>         struct vm_area_struct *vma = vmf->vma;
>         struct folio *folio;
> @@ -4101,10 +4225,15 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>         /* Allocate our own private page. */
>         if (unlikely(anon_vma_prepare(vma)))
>                 goto oom;
> -       folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
> +       ret = alloc_anon_folio(vmf, &folio);
> +       if (unlikely(ret == -EAGAIN))
> +               return 0;
>         if (!folio)
>                 goto oom;
>
> +       nr_pages = folio_nr_pages(folio);
> +       addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
> +
>         if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
>                 goto oom_free_page;
>         folio_throttle_swaprate(folio, GFP_KERNEL);
> @@ -4116,17 +4245,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>          */
>         __folio_mark_uptodate(folio);
>
> -       entry = mk_pte(&folio->page, vma->vm_page_prot);
> -       entry = pte_sw_mkyoung(entry);
> -       if (vma->vm_flags & VM_WRITE)
> -               entry = pte_mkwrite(pte_mkdirty(entry));
> -
> -       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> -                       &vmf->ptl);
> +       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
>         if (!vmf->pte)
>                 goto release;
> -       if (vmf_pte_changed(vmf)) {
> -               update_mmu_tlb(vma, vmf->address, vmf->pte);
> +       if (vmf_pte_range_changed(vmf, nr_pages)) {
> +               for (i = 0; i < nr_pages; i++)
> +                       update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
>                 goto release;
>         }
>
> @@ -4141,16 +4265,24 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>                 return handle_userfault(vmf, VM_UFFD_MISSING);
>         }
>
> -       inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> -       folio_add_new_anon_rmap(folio, vma, vmf->address);
> +       folio_ref_add(folio, nr_pages - 1);
> +       add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> +       folio_add_new_anon_rmap(folio, vma, addr);
>         folio_add_lru_vma(folio, vma);
> +
> +       for (i = 0; i < nr_pages; i++) {
> +               entry = mk_pte(folio_page(folio, i), vma->vm_page_prot);
> +               entry = pte_sw_mkyoung(entry);
> +               if (vma->vm_flags & VM_WRITE)
> +                       entry = pte_mkwrite(pte_mkdirty(entry));
>  setpte:
> -       if (uffd_wp)
> -               entry = pte_mkuffd_wp(entry);
> -       set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> +               if (uffd_wp)
> +                       entry = pte_mkuffd_wp(entry);
> +               set_pte_at(vma->vm_mm, addr + PAGE_SIZE * i, vmf->pte + i, entry);
>
> -       /* No need to invalidate - it was non-present before */
> -       update_mmu_cache(vma, vmf->address, vmf->pte);
> +               /* No need to invalidate - it was non-present before */
> +               update_mmu_cache(vma, addr + PAGE_SIZE * i, vmf->pte + i);
> +       }
>  unlock:
>         if (vmf->pte)
>                 pte_unmap_unlock(vmf->pte, vmf->ptl);

The rest looks good to me.
Yu Zhao July 27, 2023, 4:31 a.m. UTC | #2
On Wed, Jul 26, 2023 at 10:41 AM Yu Zhao <yuzhao@google.com> wrote:
>
> On Wed, Jul 26, 2023 at 3:52 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >
> > Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
> > allocated in large folios of a determined order. All pages of the large
> > folio are pte-mapped during the same page fault, significantly reducing
> > the number of page faults. The number of per-page operations (e.g. ref
> > counting, rmap management lru list management) are also significantly
> > reduced since those ops now become per-folio.
> >
> > The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
> > which defaults to disabled for now; The long term aim is for this to
> > defaut to enabled, but there are some risks around internal
> > fragmentation that need to be better understood first.
> >
> > When enabled, the folio order is determined as such: For a vma, process
> > or system that has explicitly disabled THP, we continue to allocate
> > order-0. THP is most likely disabled to avoid any possible internal
> > fragmentation so we honour that request.
> >
> > Otherwise, the return value of arch_wants_pte_order() is used. For vmas
> > that have not explicitly opted-in to use transparent hugepages (e.g.
> > where thp=madvise and the vma does not have MADV_HUGEPAGE), then
> > arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
> > bigger). This allows for a performance boost without requiring any
> > explicit opt-in from the workload while limitting internal
> > fragmentation.
> >
> > If the preferred order can't be used (e.g. because the folio would
> > breach the bounds of the vma, or because ptes in the region are already
> > mapped) then we fall back to a suitable lower order; first
> > PAGE_ALLOC_COSTLY_ORDER, then order-0.
> >
> > arch_wants_pte_order() can be overridden by the architecture if desired.
> > Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
> > set of ptes map physically contigious, naturally aligned memory, so this
> > mechanism allows the architecture to optimize as required.
> >
> > Here we add the default implementation of arch_wants_pte_order(), used
> > when the architecture does not define it, which returns -1, implying
> > that the HW has no preference. In this case, mm will choose it's own
> > default order.
> >
> > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> > ---
> >  include/linux/pgtable.h |  13 ++++
> >  mm/Kconfig              |  10 +++
> >  mm/memory.c             | 166 ++++++++++++++++++++++++++++++++++++----
> >  3 files changed, 172 insertions(+), 17 deletions(-)
> >
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index 5063b482e34f..2a1d83775837 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -313,6 +313,19 @@ static inline bool arch_has_hw_pte_young(void)
> >  }
> >  #endif
> >
> > +#ifndef arch_wants_pte_order
> > +/*
> > + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
> > + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
> > + * to be at least order-2. Negative value implies that the HW has no preference
> > + * and mm will choose it's own default order.
> > + */
> > +static inline int arch_wants_pte_order(void)
> > +{
> > +       return -1;
> > +}
> > +#endif
> > +
> >  #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
> >  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
> >                                        unsigned long address,
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 09130434e30d..fa61ea160447 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -1238,4 +1238,14 @@ config LOCK_MM_AND_FIND_VMA
> >
> >  source "mm/damon/Kconfig"
> >
> > +config LARGE_ANON_FOLIO
> > +       bool "Allocate large folios for anonymous memory"
> > +       depends on TRANSPARENT_HUGEPAGE
> > +       default n
> > +       help
> > +         Use large (bigger than order-0) folios to back anonymous memory where
> > +         possible, even for pte-mapped memory. This reduces the number of page
> > +         faults, as well as other per-page overheads to improve performance for
> > +         many workloads.
> > +
> >  endmenu
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 01f39e8144ef..64c3f242c49a 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4050,6 +4050,127 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >         return ret;
> >  }
> >
> > +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
> > +{
> > +       int i;
> > +
> > +       if (nr_pages == 1)
> > +               return vmf_pte_changed(vmf);
> > +
> > +       for (i = 0; i < nr_pages; i++) {
> > +               if (!pte_none(ptep_get_lockless(vmf->pte + i)))
> > +                       return true;
> > +       }
> > +
> > +       return false;
> > +}
> > +
> > +#ifdef CONFIG_LARGE_ANON_FOLIO
> > +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
> > +               (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
> > +
> > +static int anon_folio_order(struct vm_area_struct *vma)
> > +{
> > +       int order;
> > +
> > +       /*
> > +        * If THP is explicitly disabled for either the vma, the process or the
> > +        * system, then this is very likely intended to limit internal
> > +        * fragmentation; in this case, don't attempt to allocate a large
> > +        * anonymous folio.
> > +        *
> > +        * Else, if the vma is eligible for thp, allocate a large folio of the
> > +        * size preferred by the arch. Or if the arch requested a very small
> > +        * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
> > +        * which still meets the arch's requirements but means we still take
> > +        * advantage of SW optimizations (e.g. fewer page faults).
> > +        *
> > +        * Finally if thp is enabled but the vma isn't eligible, take the
> > +        * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
> > +        * This ensures workloads that have not explicitly opted-in take benefit
> > +        * while capping the potential for internal fragmentation.
> > +        */
>
> What empirical evidence is SZ_64K based on?
> What workloads would benefit from it?
> How much would they benefit from it?
> Would they benefit more or less from different values?
> How much internal fragmentation would it cause?
> What cost function was used to arrive at the conclusion that its
> benefits outweigh its costs?
>
> > +       if ((vma->vm_flags & VM_NOHUGEPAGE) ||
> > +           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
> > +           !hugepage_flags_enabled())
> > +               order = 0;
> > +       else {
> > +               order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
> > +
> > +               if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
> > +                       order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
> > +       }

I'm a bit surprised to see the above: why can we overload existing
ABIs? I don't think we can. Assuming we could, you would have to
update Documentation/admin-guide/mm/transhuge.rst in the same
patchset, and the man page for madvise() in a separate patch.

Most importantly, existing userspace programs that don't work well
with THPs won't be able to use (try) large folios either -- this is a
big no no.



> > +
> > +       return order;
> > +}
> > +
> > +static int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
>
> static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>
> and use ERR_PTR() and its friends.
>
> > +{
> > +       int i;
> > +       gfp_t gfp;
> > +       pte_t *pte;
> > +       unsigned long addr;
> > +       struct vm_area_struct *vma = vmf->vma;
> > +       int prefer = anon_folio_order(vma);
> > +       int orders[] = {
> > +               prefer,
> > +               prefer > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0,
> > +               0,
> > +       };
> > +
> > +       *folio = NULL;
> > +
> > +       if (vmf_orig_pte_uffd_wp(vmf))
> > +               goto fallback;
> > +
> > +       for (i = 0; orders[i]; i++) {
> > +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
> > +               if (addr >= vma->vm_start &&
> > +                   addr + (PAGE_SIZE << orders[i]) <= vma->vm_end)
> > +                       break;
> > +       }
> > +
> > +       if (!orders[i])
> > +               goto fallback;
> > +
> > +       pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
> > +       if (!pte)
> > +               return -EAGAIN;
> > +
> > +       for (; orders[i]; i++) {
> > +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
> > +               vmf->pte = pte + pte_index(addr);
> > +               if (!vmf_pte_range_changed(vmf, 1 << orders[i]))
> > +                       break;
> > +       }
> > +
> > +       vmf->pte = NULL;
> > +       pte_unmap(pte);
> > +
> > +       gfp = vma_thp_gfp_mask(vma);
> > +
> > +       for (; orders[i]; i++) {
> > +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
> > +               *folio = vma_alloc_folio(gfp, orders[i], vma, addr, true);
> > +               if (*folio) {
> > +                       clear_huge_page(&(*folio)->page, addr, 1 << orders[i]);
> > +                       return 0;
> > +               }
> > +       }
> > +
> > +fallback:
> > +       *folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
> > +       return *folio ? 0 : -ENOMEM;
> > +}
> > +#else
> > +static inline int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
> > +{
> > +       *folio = vma_alloc_zeroed_movable_folio(vmf->vma, vmf->address);
> > +       return *folio ? 0 : -ENOMEM;
> > +}
> > +#endif
> > +
> >  /*
> >   * We enter with non-exclusive mmap_lock (to exclude vma changes,
> >   * but allow concurrent faults), and pte mapped but not yet locked.
> > @@ -4057,6 +4178,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >   */
> >  static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> >  {
> > +       int i = 0;
> > +       int nr_pages = 1;
> > +       unsigned long addr = vmf->address;
> >         bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
> >         struct vm_area_struct *vma = vmf->vma;
> >         struct folio *folio;
> > @@ -4101,10 +4225,15 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> >         /* Allocate our own private page. */
> >         if (unlikely(anon_vma_prepare(vma)))
> >                 goto oom;
> > -       folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
> > +       ret = alloc_anon_folio(vmf, &folio);
> > +       if (unlikely(ret == -EAGAIN))
> > +               return 0;
> >         if (!folio)
> >                 goto oom;
> >
> > +       nr_pages = folio_nr_pages(folio);
> > +       addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
> > +
> >         if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
> >                 goto oom_free_page;
> >         folio_throttle_swaprate(folio, GFP_KERNEL);
> > @@ -4116,17 +4245,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> >          */
> >         __folio_mark_uptodate(folio);
> >
> > -       entry = mk_pte(&folio->page, vma->vm_page_prot);
> > -       entry = pte_sw_mkyoung(entry);
> > -       if (vma->vm_flags & VM_WRITE)
> > -               entry = pte_mkwrite(pte_mkdirty(entry));
> > -
> > -       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> > -                       &vmf->ptl);
> > +       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
> >         if (!vmf->pte)
> >                 goto release;
> > -       if (vmf_pte_changed(vmf)) {
> > -               update_mmu_tlb(vma, vmf->address, vmf->pte);
> > +       if (vmf_pte_range_changed(vmf, nr_pages)) {
> > +               for (i = 0; i < nr_pages; i++)
> > +                       update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
> >                 goto release;
> >         }
> >
> > @@ -4141,16 +4265,24 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> >                 return handle_userfault(vmf, VM_UFFD_MISSING);
> >         }
> >
> > -       inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> > -       folio_add_new_anon_rmap(folio, vma, vmf->address);
> > +       folio_ref_add(folio, nr_pages - 1);
> > +       add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> > +       folio_add_new_anon_rmap(folio, vma, addr);
> >         folio_add_lru_vma(folio, vma);
> > +
> > +       for (i = 0; i < nr_pages; i++) {
> > +               entry = mk_pte(folio_page(folio, i), vma->vm_page_prot);
> > +               entry = pte_sw_mkyoung(entry);
> > +               if (vma->vm_flags & VM_WRITE)
> > +                       entry = pte_mkwrite(pte_mkdirty(entry));
> >  setpte:
> > -       if (uffd_wp)
> > -               entry = pte_mkuffd_wp(entry);
> > -       set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> > +               if (uffd_wp)
> > +                       entry = pte_mkuffd_wp(entry);
> > +               set_pte_at(vma->vm_mm, addr + PAGE_SIZE * i, vmf->pte + i, entry);
> >
> > -       /* No need to invalidate - it was non-present before */
> > -       update_mmu_cache(vma, vmf->address, vmf->pte);
> > +               /* No need to invalidate - it was non-present before */
> > +               update_mmu_cache(vma, addr + PAGE_SIZE * i, vmf->pte + i);
> > +       }
> >  unlock:
> >         if (vmf->pte)
> >                 pte_unmap_unlock(vmf->pte, vmf->ptl);
>
> The rest looks good to me.
Ryan Roberts July 28, 2023, 10:13 a.m. UTC | #3
On 27/07/2023 05:31, Yu Zhao wrote:
> On Wed, Jul 26, 2023 at 10:41 AM Yu Zhao <yuzhao@google.com> wrote:
>>
>> On Wed, Jul 26, 2023 at 3:52 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>
>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
>>> allocated in large folios of a determined order. All pages of the large
>>> folio are pte-mapped during the same page fault, significantly reducing
>>> the number of page faults. The number of per-page operations (e.g. ref
>>> counting, rmap management lru list management) are also significantly
>>> reduced since those ops now become per-folio.
>>>
>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
>>> which defaults to disabled for now; The long term aim is for this to
>>> defaut to enabled, but there are some risks around internal
>>> fragmentation that need to be better understood first.
>>>
>>> When enabled, the folio order is determined as such: For a vma, process
>>> or system that has explicitly disabled THP, we continue to allocate
>>> order-0. THP is most likely disabled to avoid any possible internal
>>> fragmentation so we honour that request.
>>>
>>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
>>> that have not explicitly opted-in to use transparent hugepages (e.g.
>>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
>>> arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
>>> bigger). This allows for a performance boost without requiring any
>>> explicit opt-in from the workload while limitting internal
>>> fragmentation.
>>>
>>> If the preferred order can't be used (e.g. because the folio would
>>> breach the bounds of the vma, or because ptes in the region are already
>>> mapped) then we fall back to a suitable lower order; first
>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>>
>>> arch_wants_pte_order() can be overridden by the architecture if desired.
>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
>>> set of ptes map physically contigious, naturally aligned memory, so this
>>> mechanism allows the architecture to optimize as required.
>>>
>>> Here we add the default implementation of arch_wants_pte_order(), used
>>> when the architecture does not define it, which returns -1, implying
>>> that the HW has no preference. In this case, mm will choose it's own
>>> default order.
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>  include/linux/pgtable.h |  13 ++++
>>>  mm/Kconfig              |  10 +++
>>>  mm/memory.c             | 166 ++++++++++++++++++++++++++++++++++++----
>>>  3 files changed, 172 insertions(+), 17 deletions(-)
>>>
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index 5063b482e34f..2a1d83775837 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -313,6 +313,19 @@ static inline bool arch_has_hw_pte_young(void)
>>>  }
>>>  #endif
>>>
>>> +#ifndef arch_wants_pte_order
>>> +/*
>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
>>> + * to be at least order-2. Negative value implies that the HW has no preference
>>> + * and mm will choose it's own default order.
>>> + */
>>> +static inline int arch_wants_pte_order(void)
>>> +{
>>> +       return -1;
>>> +}
>>> +#endif
>>> +
>>>  #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>>                                        unsigned long address,
>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>> index 09130434e30d..fa61ea160447 100644
>>> --- a/mm/Kconfig
>>> +++ b/mm/Kconfig
>>> @@ -1238,4 +1238,14 @@ config LOCK_MM_AND_FIND_VMA
>>>
>>>  source "mm/damon/Kconfig"
>>>
>>> +config LARGE_ANON_FOLIO
>>> +       bool "Allocate large folios for anonymous memory"
>>> +       depends on TRANSPARENT_HUGEPAGE
>>> +       default n
>>> +       help
>>> +         Use large (bigger than order-0) folios to back anonymous memory where
>>> +         possible, even for pte-mapped memory. This reduces the number of page
>>> +         faults, as well as other per-page overheads to improve performance for
>>> +         many workloads.
>>> +
>>>  endmenu
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 01f39e8144ef..64c3f242c49a 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -4050,6 +4050,127 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>         return ret;
>>>  }
>>>
>>> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
>>> +{
>>> +       int i;
>>> +
>>> +       if (nr_pages == 1)
>>> +               return vmf_pte_changed(vmf);
>>> +
>>> +       for (i = 0; i < nr_pages; i++) {
>>> +               if (!pte_none(ptep_get_lockless(vmf->pte + i)))
>>> +                       return true;
>>> +       }
>>> +
>>> +       return false;
>>> +}
>>> +
>>> +#ifdef CONFIG_LARGE_ANON_FOLIO
>>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
>>> +               (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
>>> +
>>> +static int anon_folio_order(struct vm_area_struct *vma)
>>> +{
>>> +       int order;
>>> +
>>> +       /*
>>> +        * If THP is explicitly disabled for either the vma, the process or the
>>> +        * system, then this is very likely intended to limit internal
>>> +        * fragmentation; in this case, don't attempt to allocate a large
>>> +        * anonymous folio.
>>> +        *
>>> +        * Else, if the vma is eligible for thp, allocate a large folio of the
>>> +        * size preferred by the arch. Or if the arch requested a very small
>>> +        * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
>>> +        * which still meets the arch's requirements but means we still take
>>> +        * advantage of SW optimizations (e.g. fewer page faults).
>>> +        *
>>> +        * Finally if thp is enabled but the vma isn't eligible, take the
>>> +        * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
>>> +        * This ensures workloads that have not explicitly opted-in take benefit
>>> +        * while capping the potential for internal fragmentation.
>>> +        */
>>
>> What empirical evidence is SZ_64K based on?
>> What workloads would benefit from it?
>> How much would they benefit from it?
>> Would they benefit more or less from different values?
>> How much internal fragmentation would it cause?
>> What cost function was used to arrive at the conclusion that its
>> benefits outweigh its costs?

Sorry this has taken a little while to reply to; I've been re-running my perf
tests with the modern patches to recomfirm old data.

In terms of empirical evidence, I've run the kernel compilation benchmark (yes I
know its a narrow use case, but I figure some data is better than no data), for
all values of ANON_FOLIO_MAX_ORDER_UNHINTED {4k, 16k, 32k, 64k, 128k, 256k}.

I've run each test 15 times across 5 system reboots on Ampere Altra (arm64),
with the kernel configured for 4K base pages - I could rerun for other base page
sizes if we want to go further down this route.

I've captured run time and peak memory usage, and taken the mean. The stdev for
the peak memory usage is big-ish, but I'm confident this still captures the
central tendancy well:

| MAX_ORDER_UNHINTED |   real-time |   kern-time |   user-time | peak memory |
|:-------------------|------------:|------------:|------------:|:------------|
| 4k                 |        0.0% |        0.0% |        0.0% |        0.0% |
| 16k                |       -3.6% |      -26.5% |       -0.5% |       -0.1% |
| 32k                |       -4.8% |      -37.4% |       -0.6% |       -0.1% |
| 64k                |       -5.7% |      -42.0% |       -0.6% |       -1.1% |
| 128k               |       -5.6% |      -42.1% |       -0.7% |        1.4% |
| 256k               |       -4.9% |      -41.9% |       -0.4% |        1.9% |

64K looks like the clear sweet spot to me.

I know you have argued for using a page order in the past, rather than a size in
bytes. But my argument is that user space is mostly doing mmaps based on sizes
independent of the base page size (an assumption!) and a system's memory is
obviously a fixed quantity that doesn't it doesn't change with base page size.
So it feels more natural to limit internal fragmentation based on an absolute
size rather than a quantity of pages. Kyril have also suggested using absolute
sizes in the past [1].

It's also worth mentioning that the file-backed memory "fault_around" mechanism
chooses 64K.

If this approach really looks unacceptable, I have a couple of other ideas. But
I personally favour the approach that is already in the patch.

1) Add a large/small flag to arch_wants_pte_order(). arm64, at least, actually
has 2 mechanisms, HPA and contpte. Currently arm64 is always returning the
contpte order, but with a flag, it could return contpte order for large, and HPA
order for small. (I know we previously passed the vma and we didn't like that,
and this is pretty similar). I still think the SW (core-mm) needs a way to
sensibly limit internal fragmentation though, so personally I still think having
an upper limit in this case is useful.

2) More radical: move to a per-vma auto-tuning solution, which looks at the
fault pattern and maintains an allocation order in the VMA, which is modified
based on fault pattern. e.g. When we get faults that occur immediately adjacent
to the allocated range, we increase; when we get faults not connected to
previously allocated pages we decrease. I think it's an interesting thing to
look at, but certainly prefer that it's not part of an MVP implementation.

[1]
https://lore.kernel.org/linux-mm/20230414140948.7pcaz6niyr2tpa7s@box.shutemov.name/


>>
>>> +       if ((vma->vm_flags & VM_NOHUGEPAGE) ||
>>> +           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
>>> +           !hugepage_flags_enabled())
>>> +               order = 0;
>>> +       else {
>>> +               order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
>>> +
>>> +               if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>>> +                       order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
>>> +       }
> 
> I'm a bit surprised to see the above: why can we overload existing
> ABIs? I don't think we can. 

I think this is all covered by the conversation with David against v2; see [2]
and proceeding replies. Argument is that VM_NOHUGEPAGE (and friends) is really a
request from user space to optimize for the least memory wastage possible and
avoid populating ptes that have not been expressly requested.

[2]
https://lore.kernel.org/linux-mm/524bacd2-4a47-2b8b-6685-c46e31a01631@redhat.com/

Assuming we could, you would have to
> update Documentation/admin-guide/mm/transhuge.rst in the same
> patchset, and the man page for madvise() in a separate patch.

Yes, that's a fair point. Although transhuge.rst doesn't even mention
MADV_NOHUGEPAGE today.

> 
> Most importantly, existing userspace programs that don't work well
> with THPs won't be able to use (try) large folios either -- this is a
> big no no.

I think we need some comments from David here. As mentioned I've added this
tie-in based on his (strong) recommendation.

> 
> 
> 
>>> +
>>> +       return order;
>>> +}
>>> +
>>> +static int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
>>
>> static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>>
>> and use ERR_PTR() and its friends.

Yes, agreed. I'll change this for the next version.

>>
>>> +{
>>> +       int i;
>>> +       gfp_t gfp;
>>> +       pte_t *pte;
>>> +       unsigned long addr;
>>> +       struct vm_area_struct *vma = vmf->vma;
>>> +       int prefer = anon_folio_order(vma);
>>> +       int orders[] = {
>>> +               prefer,
>>> +               prefer > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0,
>>> +               0,
>>> +       };
>>> +
>>> +       *folio = NULL;
>>> +
>>> +       if (vmf_orig_pte_uffd_wp(vmf))
>>> +               goto fallback;
>>> +
>>> +       for (i = 0; orders[i]; i++) {
>>> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
>>> +               if (addr >= vma->vm_start &&
>>> +                   addr + (PAGE_SIZE << orders[i]) <= vma->vm_end)
>>> +                       break;
>>> +       }
>>> +
>>> +       if (!orders[i])
>>> +               goto fallback;
>>> +
>>> +       pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
>>> +       if (!pte)
>>> +               return -EAGAIN;
>>> +
>>> +       for (; orders[i]; i++) {
>>> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
>>> +               vmf->pte = pte + pte_index(addr);
>>> +               if (!vmf_pte_range_changed(vmf, 1 << orders[i]))
>>> +                       break;
>>> +       }
>>> +
>>> +       vmf->pte = NULL;
>>> +       pte_unmap(pte);
>>> +
>>> +       gfp = vma_thp_gfp_mask(vma);
>>> +
>>> +       for (; orders[i]; i++) {
>>> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
>>> +               *folio = vma_alloc_folio(gfp, orders[i], vma, addr, true);
>>> +               if (*folio) {
>>> +                       clear_huge_page(&(*folio)->page, addr, 1 << orders[i]);
>>> +                       return 0;
>>> +               }
>>> +       }
>>> +
>>> +fallback:
>>> +       *folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
>>> +       return *folio ? 0 : -ENOMEM;
>>> +}
>>> +#else
>>> +static inline int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
>>> +{
>>> +       *folio = vma_alloc_zeroed_movable_folio(vmf->vma, vmf->address);
>>> +       return *folio ? 0 : -ENOMEM;
>>> +}
>>> +#endif
>>> +
>>>  /*
>>>   * We enter with non-exclusive mmap_lock (to exclude vma changes,
>>>   * but allow concurrent faults), and pte mapped but not yet locked.
>>> @@ -4057,6 +4178,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>   */
>>>  static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>>  {
>>> +       int i = 0;
>>> +       int nr_pages = 1;
>>> +       unsigned long addr = vmf->address;
>>>         bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
>>>         struct vm_area_struct *vma = vmf->vma;
>>>         struct folio *folio;
>>> @@ -4101,10 +4225,15 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>>         /* Allocate our own private page. */
>>>         if (unlikely(anon_vma_prepare(vma)))
>>>                 goto oom;
>>> -       folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
>>> +       ret = alloc_anon_folio(vmf, &folio);
>>> +       if (unlikely(ret == -EAGAIN))
>>> +               return 0;
>>>         if (!folio)
>>>                 goto oom;
>>>
>>> +       nr_pages = folio_nr_pages(folio);
>>> +       addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
>>> +
>>>         if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
>>>                 goto oom_free_page;
>>>         folio_throttle_swaprate(folio, GFP_KERNEL);
>>> @@ -4116,17 +4245,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>>          */
>>>         __folio_mark_uptodate(folio);
>>>
>>> -       entry = mk_pte(&folio->page, vma->vm_page_prot);
>>> -       entry = pte_sw_mkyoung(entry);
>>> -       if (vma->vm_flags & VM_WRITE)
>>> -               entry = pte_mkwrite(pte_mkdirty(entry));
>>> -
>>> -       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>>> -                       &vmf->ptl);
>>> +       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
>>>         if (!vmf->pte)
>>>                 goto release;
>>> -       if (vmf_pte_changed(vmf)) {
>>> -               update_mmu_tlb(vma, vmf->address, vmf->pte);
>>> +       if (vmf_pte_range_changed(vmf, nr_pages)) {
>>> +               for (i = 0; i < nr_pages; i++)
>>> +                       update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
>>>                 goto release;
>>>         }
>>>
>>> @@ -4141,16 +4265,24 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>>                 return handle_userfault(vmf, VM_UFFD_MISSING);
>>>         }
>>>
>>> -       inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>>> -       folio_add_new_anon_rmap(folio, vma, vmf->address);
>>> +       folio_ref_add(folio, nr_pages - 1);
>>> +       add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
>>> +       folio_add_new_anon_rmap(folio, vma, addr);
>>>         folio_add_lru_vma(folio, vma);
>>> +
>>> +       for (i = 0; i < nr_pages; i++) {
>>> +               entry = mk_pte(folio_page(folio, i), vma->vm_page_prot);
>>> +               entry = pte_sw_mkyoung(entry);
>>> +               if (vma->vm_flags & VM_WRITE)
>>> +                       entry = pte_mkwrite(pte_mkdirty(entry));
>>>  setpte:
>>> -       if (uffd_wp)
>>> -               entry = pte_mkuffd_wp(entry);
>>> -       set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>>> +               if (uffd_wp)
>>> +                       entry = pte_mkuffd_wp(entry);
>>> +               set_pte_at(vma->vm_mm, addr + PAGE_SIZE * i, vmf->pte + i, entry);
>>>
>>> -       /* No need to invalidate - it was non-present before */
>>> -       update_mmu_cache(vma, vmf->address, vmf->pte);
>>> +               /* No need to invalidate - it was non-present before */
>>> +               update_mmu_cache(vma, addr + PAGE_SIZE * i, vmf->pte + i);
>>> +       }
>>>  unlock:
>>>         if (vmf->pte)
>>>                 pte_unmap_unlock(vmf->pte, vmf->ptl);
>>
>> The rest looks good to me.

Thanks, as always, for the detailed review and feedback!

Thanks,
Ryan
Yu Zhao Aug. 1, 2023, 6:18 a.m. UTC | #4
On Wed, Jul 26, 2023 at 3:52 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
> allocated in large folios of a determined order. All pages of the large
> folio are pte-mapped during the same page fault, significantly reducing
> the number of page faults. The number of per-page operations (e.g. ref
> counting, rmap management lru list management) are also significantly
> reduced since those ops now become per-folio.
>
> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
> which defaults to disabled for now; The long term aim is for this to
> defaut to enabled, but there are some risks around internal
> fragmentation that need to be better understood first.
>
> When enabled, the folio order is determined as such: For a vma, process
> or system that has explicitly disabled THP, we continue to allocate
> order-0. THP is most likely disabled to avoid any possible internal
> fragmentation so we honour that request.
>
> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
> that have not explicitly opted-in to use transparent hugepages (e.g.
> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
> arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
> bigger). This allows for a performance boost without requiring any
> explicit opt-in from the workload while limitting internal
> fragmentation.
>
> If the preferred order can't be used (e.g. because the folio would
> breach the bounds of the vma, or because ptes in the region are already
> mapped) then we fall back to a suitable lower order; first
> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>
> arch_wants_pte_order() can be overridden by the architecture if desired.
> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
> set of ptes map physically contigious, naturally aligned memory, so this
> mechanism allows the architecture to optimize as required.
>
> Here we add the default implementation of arch_wants_pte_order(), used
> when the architecture does not define it, which returns -1, implying
> that the HW has no preference. In this case, mm will choose it's own
> default order.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  include/linux/pgtable.h |  13 ++++
>  mm/Kconfig              |  10 +++
>  mm/memory.c             | 166 ++++++++++++++++++++++++++++++++++++----
>  3 files changed, 172 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 5063b482e34f..2a1d83775837 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -313,6 +313,19 @@ static inline bool arch_has_hw_pte_young(void)
>  }
>  #endif
>
> +#ifndef arch_wants_pte_order
> +/*
> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
> + * to be at least order-2. Negative value implies that the HW has no preference
> + * and mm will choose it's own default order.
> + */
> +static inline int arch_wants_pte_order(void)
> +{
> +       return -1;
> +}
> +#endif
> +
>  #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>                                        unsigned long address,
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 09130434e30d..fa61ea160447 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1238,4 +1238,14 @@ config LOCK_MM_AND_FIND_VMA
>
>  source "mm/damon/Kconfig"
>
> +config LARGE_ANON_FOLIO
> +       bool "Allocate large folios for anonymous memory"
> +       depends on TRANSPARENT_HUGEPAGE
> +       default n
> +       help
> +         Use large (bigger than order-0) folios to back anonymous memory where
> +         possible, even for pte-mapped memory. This reduces the number of page
> +         faults, as well as other per-page overheads to improve performance for
> +         many workloads.
> +
>  endmenu
> diff --git a/mm/memory.c b/mm/memory.c
> index 01f39e8144ef..64c3f242c49a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4050,6 +4050,127 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         return ret;
>  }
>
> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
> +{
> +       int i;
> +
> +       if (nr_pages == 1)
> +               return vmf_pte_changed(vmf);
> +
> +       for (i = 0; i < nr_pages; i++) {
> +               if (!pte_none(ptep_get_lockless(vmf->pte + i)))
> +                       return true;
> +       }
> +
> +       return false;
> +}
> +
> +#ifdef CONFIG_LARGE_ANON_FOLIO
> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
> +               (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
> +
> +static int anon_folio_order(struct vm_area_struct *vma)
> +{
> +       int order;
> +
> +       /*
> +        * If THP is explicitly disabled for either the vma, the process or the
> +        * system, then this is very likely intended to limit internal
> +        * fragmentation; in this case, don't attempt to allocate a large
> +        * anonymous folio.
> +        *
> +        * Else, if the vma is eligible for thp, allocate a large folio of the
> +        * size preferred by the arch. Or if the arch requested a very small
> +        * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
> +        * which still meets the arch's requirements but means we still take
> +        * advantage of SW optimizations (e.g. fewer page faults).
> +        *
> +        * Finally if thp is enabled but the vma isn't eligible, take the
> +        * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
> +        * This ensures workloads that have not explicitly opted-in take benefit
> +        * while capping the potential for internal fragmentation.
> +        */
> +
> +       if ((vma->vm_flags & VM_NOHUGEPAGE) ||
> +           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
> +           !hugepage_flags_enabled())
> +               order = 0;
> +       else {
> +               order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
> +
> +               if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
> +                       order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
> +       }
> +
> +       return order;
> +}
> +
> +static int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
> +{
> +       int i;
> +       gfp_t gfp;
> +       pte_t *pte;
> +       unsigned long addr;
> +       struct vm_area_struct *vma = vmf->vma;
> +       int prefer = anon_folio_order(vma);
> +       int orders[] = {
> +               prefer,
> +               prefer > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0,
> +               0,
> +       };
> +
> +       *folio = NULL;
> +
> +       if (vmf_orig_pte_uffd_wp(vmf))
> +               goto fallback;

I think we need to s/vmf_orig_pte_uffd_wp/userfaultfd_armed/ here;
otherwise UFFD would miss VM_UFFD_MISSING/MINOR.
Yu Zhao Aug. 1, 2023, 6:36 a.m. UTC | #5
On Fri, Jul 28, 2023 at 4:13 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 27/07/2023 05:31, Yu Zhao wrote:
> > On Wed, Jul 26, 2023 at 10:41 AM Yu Zhao <yuzhao@google.com> wrote:
> >>
> >> On Wed, Jul 26, 2023 at 3:52 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>
> >>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
> >>> allocated in large folios of a determined order. All pages of the large
> >>> folio are pte-mapped during the same page fault, significantly reducing
> >>> the number of page faults. The number of per-page operations (e.g. ref
> >>> counting, rmap management lru list management) are also significantly
> >>> reduced since those ops now become per-folio.
> >>>
> >>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
> >>> which defaults to disabled for now; The long term aim is for this to
> >>> defaut to enabled, but there are some risks around internal
> >>> fragmentation that need to be better understood first.
> >>>
> >>> When enabled, the folio order is determined as such: For a vma, process
> >>> or system that has explicitly disabled THP, we continue to allocate
> >>> order-0. THP is most likely disabled to avoid any possible internal
> >>> fragmentation so we honour that request.
> >>>
> >>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
> >>> that have not explicitly opted-in to use transparent hugepages (e.g.
> >>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
> >>> arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
> >>> bigger). This allows for a performance boost without requiring any
> >>> explicit opt-in from the workload while limitting internal
> >>> fragmentation.
> >>>
> >>> If the preferred order can't be used (e.g. because the folio would
> >>> breach the bounds of the vma, or because ptes in the region are already
> >>> mapped) then we fall back to a suitable lower order; first
> >>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
> >>>
> >>> arch_wants_pte_order() can be overridden by the architecture if desired.
> >>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
> >>> set of ptes map physically contigious, naturally aligned memory, so this
> >>> mechanism allows the architecture to optimize as required.
> >>>
> >>> Here we add the default implementation of arch_wants_pte_order(), used
> >>> when the architecture does not define it, which returns -1, implying
> >>> that the HW has no preference. In this case, mm will choose it's own
> >>> default order.
> >>>
> >>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >>> ---
> >>>  include/linux/pgtable.h |  13 ++++
> >>>  mm/Kconfig              |  10 +++
> >>>  mm/memory.c             | 166 ++++++++++++++++++++++++++++++++++++----
> >>>  3 files changed, 172 insertions(+), 17 deletions(-)
> >>>
> >>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >>> index 5063b482e34f..2a1d83775837 100644
> >>> --- a/include/linux/pgtable.h
> >>> +++ b/include/linux/pgtable.h
> >>> @@ -313,6 +313,19 @@ static inline bool arch_has_hw_pte_young(void)
> >>>  }
> >>>  #endif
> >>>
> >>> +#ifndef arch_wants_pte_order
> >>> +/*
> >>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
> >>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
> >>> + * to be at least order-2. Negative value implies that the HW has no preference
> >>> + * and mm will choose it's own default order.
> >>> + */
> >>> +static inline int arch_wants_pte_order(void)
> >>> +{
> >>> +       return -1;
> >>> +}
> >>> +#endif
> >>> +
> >>>  #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
> >>>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
> >>>                                        unsigned long address,
> >>> diff --git a/mm/Kconfig b/mm/Kconfig
> >>> index 09130434e30d..fa61ea160447 100644
> >>> --- a/mm/Kconfig
> >>> +++ b/mm/Kconfig
> >>> @@ -1238,4 +1238,14 @@ config LOCK_MM_AND_FIND_VMA
> >>>
> >>>  source "mm/damon/Kconfig"
> >>>
> >>> +config LARGE_ANON_FOLIO
> >>> +       bool "Allocate large folios for anonymous memory"
> >>> +       depends on TRANSPARENT_HUGEPAGE
> >>> +       default n
> >>> +       help
> >>> +         Use large (bigger than order-0) folios to back anonymous memory where
> >>> +         possible, even for pte-mapped memory. This reduces the number of page
> >>> +         faults, as well as other per-page overheads to improve performance for
> >>> +         many workloads.
> >>> +
> >>>  endmenu
> >>> diff --git a/mm/memory.c b/mm/memory.c
> >>> index 01f39e8144ef..64c3f242c49a 100644
> >>> --- a/mm/memory.c
> >>> +++ b/mm/memory.c
> >>> @@ -4050,6 +4050,127 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>         return ret;
> >>>  }
> >>>
> >>> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
> >>> +{
> >>> +       int i;
> >>> +
> >>> +       if (nr_pages == 1)
> >>> +               return vmf_pte_changed(vmf);
> >>> +
> >>> +       for (i = 0; i < nr_pages; i++) {
> >>> +               if (!pte_none(ptep_get_lockless(vmf->pte + i)))
> >>> +                       return true;
> >>> +       }
> >>> +
> >>> +       return false;
> >>> +}
> >>> +
> >>> +#ifdef CONFIG_LARGE_ANON_FOLIO
> >>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
> >>> +               (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
> >>> +
> >>> +static int anon_folio_order(struct vm_area_struct *vma)
> >>> +{
> >>> +       int order;
> >>> +
> >>> +       /*
> >>> +        * If THP is explicitly disabled for either the vma, the process or the
> >>> +        * system, then this is very likely intended to limit internal
> >>> +        * fragmentation; in this case, don't attempt to allocate a large
> >>> +        * anonymous folio.
> >>> +        *
> >>> +        * Else, if the vma is eligible for thp, allocate a large folio of the
> >>> +        * size preferred by the arch. Or if the arch requested a very small
> >>> +        * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
> >>> +        * which still meets the arch's requirements but means we still take
> >>> +        * advantage of SW optimizations (e.g. fewer page faults).
> >>> +        *
> >>> +        * Finally if thp is enabled but the vma isn't eligible, take the
> >>> +        * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
> >>> +        * This ensures workloads that have not explicitly opted-in take benefit
> >>> +        * while capping the potential for internal fragmentation.
> >>> +        */
> >>
> >> What empirical evidence is SZ_64K based on?
> >> What workloads would benefit from it?
> >> How much would they benefit from it?
> >> Would they benefit more or less from different values?
> >> How much internal fragmentation would it cause?
> >> What cost function was used to arrive at the conclusion that its
> >> benefits outweigh its costs?
>
> Sorry this has taken a little while to reply to; I've been re-running my perf
> tests with the modern patches to recomfirm old data.

Thanks for the data!

> In terms of empirical evidence, I've run the kernel compilation benchmark (yes I
> know its a narrow use case, but I figure some data is better than no data), for
> all values of ANON_FOLIO_MAX_ORDER_UNHINTED {4k, 16k, 32k, 64k, 128k, 256k}.
>
> I've run each test 15 times across 5 system reboots on Ampere Altra (arm64),

What about x86 and ppc? Do we expect they might perform similarly wrt
different page sizes?

> with the kernel configured for 4K base pages - I could rerun for other base page
> sizes if we want to go further down this route.
>
> I've captured run time and peak memory usage, and taken the mean. The stdev for
> the peak memory usage is big-ish, but I'm confident this still captures the
> central tendancy well:
>
> | MAX_ORDER_UNHINTED |   real-time |   kern-time |   user-time | peak memory |
> |:-------------------|------------:|------------:|------------:|:------------|
> | 4k                 |        0.0% |        0.0% |        0.0% |        0.0% |
> | 16k                |       -3.6% |      -26.5% |       -0.5% |       -0.1% |
> | 32k                |       -4.8% |      -37.4% |       -0.6% |       -0.1% |
> | 64k                |       -5.7% |      -42.0% |       -0.6% |       -1.1% |
> | 128k               |       -5.6% |      -42.1% |       -0.7% |        1.4% |
> | 256k               |       -4.9% |      -41.9% |       -0.4% |        1.9% |
>
> 64K looks like the clear sweet spot to me.

Were the tests done under memory pressure? I agree 64KB might be a
reasonable value, but I don't think we can or need to make a
conclusion at this point: there are still pending questions from my
list.

Just to double check: we only need ANON_FOLIO_MAX_ORDER_UNHINTED
because of hugepage_vma_check(), is it correct?

> I know you have argued for using a page order in the past, rather than a size in
> bytes. But my argument is that user space is mostly doing mmaps based on sizes
> independent of the base page size (an assumption!) and a system's memory is
> obviously a fixed quantity that doesn't it doesn't change with base page size.
> So it feels more natural to limit internal fragmentation based on an absolute
> size rather than a quantity of pages. Kyril have also suggested using absolute
> sizes in the past [1].
>
> It's also worth mentioning that the file-backed memory "fault_around" mechanism
> chooses 64K.

This example actually is against your argument:
1. There have been multiple reports that fault around hurt
performances and had to be disabled for some workloads over the years
-- ANON_FOLIO_MAX_ORDER_UNHINTED is likely to cause regressions too.
2. Not only can fault around be disabled, its default value can be
changed too -- this series can't do either.
3. Most importantly, fault around does not do high-order allocations
-- this series does, and high-order allocations can be very difficult
under memory pressure.

> If this approach really looks unacceptable, I have a couple of other ideas. But
> I personally favour the approach that is already in the patch.

I understand. If the answer to my question above is yes, then let's
take a step back and figure out whether overloading existing ABIs is
acceptable or not. Does this sound good to you?

> 1) Add a large/small flag to arch_wants_pte_order(). arm64, at least, actually
> has 2 mechanisms, HPA and contpte. Currently arm64 is always returning the
> contpte order, but with a flag, it could return contpte order for large, and HPA
> order for small. (I know we previously passed the vma and we didn't like that,
> and this is pretty similar). I still think the SW (core-mm) needs a way to
> sensibly limit internal fragmentation though, so personally I still think having
> an upper limit in this case is useful.
>
> 2) More radical: move to a per-vma auto-tuning solution, which looks at the
> fault pattern and maintains an allocation order in the VMA, which is modified
> based on fault pattern. e.g. When we get faults that occur immediately adjacent
> to the allocated range, we increase; when we get faults not connected to
> previously allocated pages we decrease. I think it's an interesting thing to
> look at, but certainly prefer that it's not part of an MVP implementation.
>
> [1]
> https://lore.kernel.org/linux-mm/20230414140948.7pcaz6niyr2tpa7s@box.shutemov.name/
>
>
> >>
> >>> +       if ((vma->vm_flags & VM_NOHUGEPAGE) ||
> >>> +           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
> >>> +           !hugepage_flags_enabled())
> >>> +               order = 0;
> >>> +       else {
> >>> +               order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
> >>> +
> >>> +               if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
> >>> +                       order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
> >>> +       }
> >
> > I'm a bit surprised to see the above: why can we overload existing
> > ABIs? I don't think we can.
>
> I think this is all covered by the conversation with David against v2; see [2]
> and proceeding replies. Argument is that VM_NOHUGEPAGE (and friends) is really a
> request from user space to optimize for the least memory wastage possible and
> avoid populating ptes that have not been expressly requested.
>
> [2]
> https://lore.kernel.org/linux-mm/524bacd2-4a47-2b8b-6685-c46e31a01631@redhat.com/

Thanks for the info.

I think there might be a misunderstanding here.

David, can you please clarify whether you suggested we overland
(change the semantics) of existing ABIs?

This sounds like a big red flag to me. If that's really what you
suggest, can you shed some light on why this is acceptable to existing
userspace at all?

Thanks.
Yin Fengwei Aug. 1, 2023, 11:30 p.m. UTC | #6
On 8/1/23 14:36, Yu Zhao wrote:
> On Fri, Jul 28, 2023 at 4:13 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 27/07/2023 05:31, Yu Zhao wrote:
>>> On Wed, Jul 26, 2023 at 10:41 AM Yu Zhao <yuzhao@google.com> wrote:
>>>>
>>>> On Wed, Jul 26, 2023 at 3:52 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>
>>>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
>>>>> allocated in large folios of a determined order. All pages of the large
>>>>> folio are pte-mapped during the same page fault, significantly reducing
>>>>> the number of page faults. The number of per-page operations (e.g. ref
>>>>> counting, rmap management lru list management) are also significantly
>>>>> reduced since those ops now become per-folio.
>>>>>
>>>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
>>>>> which defaults to disabled for now; The long term aim is for this to
>>>>> defaut to enabled, but there are some risks around internal
>>>>> fragmentation that need to be better understood first.
>>>>>
>>>>> When enabled, the folio order is determined as such: For a vma, process
>>>>> or system that has explicitly disabled THP, we continue to allocate
>>>>> order-0. THP is most likely disabled to avoid any possible internal
>>>>> fragmentation so we honour that request.
>>>>>
>>>>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
>>>>> that have not explicitly opted-in to use transparent hugepages (e.g.
>>>>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
>>>>> arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
>>>>> bigger). This allows for a performance boost without requiring any
>>>>> explicit opt-in from the workload while limitting internal
>>>>> fragmentation.
>>>>>
>>>>> If the preferred order can't be used (e.g. because the folio would
>>>>> breach the bounds of the vma, or because ptes in the region are already
>>>>> mapped) then we fall back to a suitable lower order; first
>>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>>>>
>>>>> arch_wants_pte_order() can be overridden by the architecture if desired.
>>>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
>>>>> set of ptes map physically contigious, naturally aligned memory, so this
>>>>> mechanism allows the architecture to optimize as required.
>>>>>
>>>>> Here we add the default implementation of arch_wants_pte_order(), used
>>>>> when the architecture does not define it, which returns -1, implying
>>>>> that the HW has no preference. In this case, mm will choose it's own
>>>>> default order.
>>>>>
>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>> ---
>>>>>  include/linux/pgtable.h |  13 ++++
>>>>>  mm/Kconfig              |  10 +++
>>>>>  mm/memory.c             | 166 ++++++++++++++++++++++++++++++++++++----
>>>>>  3 files changed, 172 insertions(+), 17 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>> index 5063b482e34f..2a1d83775837 100644
>>>>> --- a/include/linux/pgtable.h
>>>>> +++ b/include/linux/pgtable.h
>>>>> @@ -313,6 +313,19 @@ static inline bool arch_has_hw_pte_young(void)
>>>>>  }
>>>>>  #endif
>>>>>
>>>>> +#ifndef arch_wants_pte_order
>>>>> +/*
>>>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
>>>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
>>>>> + * to be at least order-2. Negative value implies that the HW has no preference
>>>>> + * and mm will choose it's own default order.
>>>>> + */
>>>>> +static inline int arch_wants_pte_order(void)
>>>>> +{
>>>>> +       return -1;
>>>>> +}
>>>>> +#endif
>>>>> +
>>>>>  #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>>>>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>>>>                                        unsigned long address,
>>>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>>>> index 09130434e30d..fa61ea160447 100644
>>>>> --- a/mm/Kconfig
>>>>> +++ b/mm/Kconfig
>>>>> @@ -1238,4 +1238,14 @@ config LOCK_MM_AND_FIND_VMA
>>>>>
>>>>>  source "mm/damon/Kconfig"
>>>>>
>>>>> +config LARGE_ANON_FOLIO
>>>>> +       bool "Allocate large folios for anonymous memory"
>>>>> +       depends on TRANSPARENT_HUGEPAGE
>>>>> +       default n
>>>>> +       help
>>>>> +         Use large (bigger than order-0) folios to back anonymous memory where
>>>>> +         possible, even for pte-mapped memory. This reduces the number of page
>>>>> +         faults, as well as other per-page overheads to improve performance for
>>>>> +         many workloads.
>>>>> +
>>>>>  endmenu
>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>> index 01f39e8144ef..64c3f242c49a 100644
>>>>> --- a/mm/memory.c
>>>>> +++ b/mm/memory.c
>>>>> @@ -4050,6 +4050,127 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>         return ret;
>>>>>  }
>>>>>
>>>>> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
>>>>> +{
>>>>> +       int i;
>>>>> +
>>>>> +       if (nr_pages == 1)
>>>>> +               return vmf_pte_changed(vmf);
>>>>> +
>>>>> +       for (i = 0; i < nr_pages; i++) {
>>>>> +               if (!pte_none(ptep_get_lockless(vmf->pte + i)))
>>>>> +                       return true;
>>>>> +       }
>>>>> +
>>>>> +       return false;
>>>>> +}
>>>>> +
>>>>> +#ifdef CONFIG_LARGE_ANON_FOLIO
>>>>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
>>>>> +               (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
>>>>> +
>>>>> +static int anon_folio_order(struct vm_area_struct *vma)
>>>>> +{
>>>>> +       int order;
>>>>> +
>>>>> +       /*
>>>>> +        * If THP is explicitly disabled for either the vma, the process or the
>>>>> +        * system, then this is very likely intended to limit internal
>>>>> +        * fragmentation; in this case, don't attempt to allocate a large
>>>>> +        * anonymous folio.
>>>>> +        *
>>>>> +        * Else, if the vma is eligible for thp, allocate a large folio of the
>>>>> +        * size preferred by the arch. Or if the arch requested a very small
>>>>> +        * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
>>>>> +        * which still meets the arch's requirements but means we still take
>>>>> +        * advantage of SW optimizations (e.g. fewer page faults).
>>>>> +        *
>>>>> +        * Finally if thp is enabled but the vma isn't eligible, take the
>>>>> +        * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
>>>>> +        * This ensures workloads that have not explicitly opted-in take benefit
>>>>> +        * while capping the potential for internal fragmentation.
>>>>> +        */
>>>>
>>>> What empirical evidence is SZ_64K based on?
>>>> What workloads would benefit from it?
>>>> How much would they benefit from it?
>>>> Would they benefit more or less from different values?
>>>> How much internal fragmentation would it cause?
>>>> What cost function was used to arrive at the conclusion that its
>>>> benefits outweigh its costs?
>>
>> Sorry this has taken a little while to reply to; I've been re-running my perf
>> tests with the modern patches to recomfirm old data.
> 
> Thanks for the data!
> 
>> In terms of empirical evidence, I've run the kernel compilation benchmark (yes I
>> know its a narrow use case, but I figure some data is better than no data), for
>> all values of ANON_FOLIO_MAX_ORDER_UNHINTED {4k, 16k, 32k, 64k, 128k, 256k}.
>>
>> I've run each test 15 times across 5 system reboots on Ampere Altra (arm64),
> 
> What about x86 and ppc? Do we expect they might perform similarly wrt
> different page sizes?
I will run the same text on Intel x86 platform.

Regards
Yin, Fengwei

> 
>> with the kernel configured for 4K base pages - I could rerun for other base page
>> sizes if we want to go further down this route.
>>
>> I've captured run time and peak memory usage, and taken the mean. The stdev for
>> the peak memory usage is big-ish, but I'm confident this still captures the
>> central tendancy well:
>>
>> | MAX_ORDER_UNHINTED |   real-time |   kern-time |   user-time | peak memory |
>> |:-------------------|------------:|------------:|------------:|:------------|
>> | 4k                 |        0.0% |        0.0% |        0.0% |        0.0% |
>> | 16k                |       -3.6% |      -26.5% |       -0.5% |       -0.1% |
>> | 32k                |       -4.8% |      -37.4% |       -0.6% |       -0.1% |
>> | 64k                |       -5.7% |      -42.0% |       -0.6% |       -1.1% |
>> | 128k               |       -5.6% |      -42.1% |       -0.7% |        1.4% |
>> | 256k               |       -4.9% |      -41.9% |       -0.4% |        1.9% |
>>
>> 64K looks like the clear sweet spot to me.
> 
> Were the tests done under memory pressure? I agree 64KB might be a
> reasonable value, but I don't think we can or need to make a
> conclusion at this point: there are still pending questions from my
> list.
> 
> Just to double check: we only need ANON_FOLIO_MAX_ORDER_UNHINTED
> because of hugepage_vma_check(), is it correct?
> 
>> I know you have argued for using a page order in the past, rather than a size in
>> bytes. But my argument is that user space is mostly doing mmaps based on sizes
>> independent of the base page size (an assumption!) and a system's memory is
>> obviously a fixed quantity that doesn't it doesn't change with base page size.
>> So it feels more natural to limit internal fragmentation based on an absolute
>> size rather than a quantity of pages. Kyril have also suggested using absolute
>> sizes in the past [1].
>>
>> It's also worth mentioning that the file-backed memory "fault_around" mechanism
>> chooses 64K.
> 
> This example actually is against your argument:
> 1. There have been multiple reports that fault around hurt
> performances and had to be disabled for some workloads over the years
> -- ANON_FOLIO_MAX_ORDER_UNHINTED is likely to cause regressions too.
> 2. Not only can fault around be disabled, its default value can be
> changed too -- this series can't do either.
> 3. Most importantly, fault around does not do high-order allocations
> -- this series does, and high-order allocations can be very difficult
> under memory pressure.
> 
>> If this approach really looks unacceptable, I have a couple of other ideas. But
>> I personally favour the approach that is already in the patch.
> 
> I understand. If the answer to my question above is yes, then let's
> take a step back and figure out whether overloading existing ABIs is
> acceptable or not. Does this sound good to you?
> 
>> 1) Add a large/small flag to arch_wants_pte_order(). arm64, at least, actually
>> has 2 mechanisms, HPA and contpte. Currently arm64 is always returning the
>> contpte order, but with a flag, it could return contpte order for large, and HPA
>> order for small. (I know we previously passed the vma and we didn't like that,
>> and this is pretty similar). I still think the SW (core-mm) needs a way to
>> sensibly limit internal fragmentation though, so personally I still think having
>> an upper limit in this case is useful.
>>
>> 2) More radical: move to a per-vma auto-tuning solution, which looks at the
>> fault pattern and maintains an allocation order in the VMA, which is modified
>> based on fault pattern. e.g. When we get faults that occur immediately adjacent
>> to the allocated range, we increase; when we get faults not connected to
>> previously allocated pages we decrease. I think it's an interesting thing to
>> look at, but certainly prefer that it's not part of an MVP implementation.
>>
>> [1]
>> https://lore.kernel.org/linux-mm/20230414140948.7pcaz6niyr2tpa7s@box.shutemov.name/
>>
>>
>>>>
>>>>> +       if ((vma->vm_flags & VM_NOHUGEPAGE) ||
>>>>> +           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
>>>>> +           !hugepage_flags_enabled())
>>>>> +               order = 0;
>>>>> +       else {
>>>>> +               order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
>>>>> +
>>>>> +               if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>>>>> +                       order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
>>>>> +       }
>>>
>>> I'm a bit surprised to see the above: why can we overload existing
>>> ABIs? I don't think we can.
>>
>> I think this is all covered by the conversation with David against v2; see [2]
>> and proceeding replies. Argument is that VM_NOHUGEPAGE (and friends) is really a
>> request from user space to optimize for the least memory wastage possible and
>> avoid populating ptes that have not been expressly requested.
>>
>> [2]
>> https://lore.kernel.org/linux-mm/524bacd2-4a47-2b8b-6685-c46e31a01631@redhat.com/
> 
> Thanks for the info.
> 
> I think there might be a misunderstanding here.
> 
> David, can you please clarify whether you suggested we overland
> (change the semantics) of existing ABIs?
> 
> This sounds like a big red flag to me. If that's really what you
> suggest, can you shed some light on why this is acceptable to existing
> userspace at all?
> 
> Thanks.
Ryan Roberts Aug. 2, 2023, 8:02 a.m. UTC | #7
On 01/08/2023 07:36, Yu Zhao wrote:
> On Fri, Jul 28, 2023 at 4:13 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 27/07/2023 05:31, Yu Zhao wrote:
>>> On Wed, Jul 26, 2023 at 10:41 AM Yu Zhao <yuzhao@google.com> wrote:
>>>>
>>>> On Wed, Jul 26, 2023 at 3:52 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>
>>>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
>>>>> allocated in large folios of a determined order. All pages of the large
>>>>> folio are pte-mapped during the same page fault, significantly reducing
>>>>> the number of page faults. The number of per-page operations (e.g. ref
>>>>> counting, rmap management lru list management) are also significantly
>>>>> reduced since those ops now become per-folio.
>>>>>
>>>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
>>>>> which defaults to disabled for now; The long term aim is for this to
>>>>> defaut to enabled, but there are some risks around internal
>>>>> fragmentation that need to be better understood first.
>>>>>
>>>>> When enabled, the folio order is determined as such: For a vma, process
>>>>> or system that has explicitly disabled THP, we continue to allocate
>>>>> order-0. THP is most likely disabled to avoid any possible internal
>>>>> fragmentation so we honour that request.
>>>>>
>>>>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
>>>>> that have not explicitly opted-in to use transparent hugepages (e.g.
>>>>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
>>>>> arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
>>>>> bigger). This allows for a performance boost without requiring any
>>>>> explicit opt-in from the workload while limitting internal
>>>>> fragmentation.
>>>>>
>>>>> If the preferred order can't be used (e.g. because the folio would
>>>>> breach the bounds of the vma, or because ptes in the region are already
>>>>> mapped) then we fall back to a suitable lower order; first
>>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>>>>
>>>>> arch_wants_pte_order() can be overridden by the architecture if desired.
>>>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
>>>>> set of ptes map physically contigious, naturally aligned memory, so this
>>>>> mechanism allows the architecture to optimize as required.
>>>>>
>>>>> Here we add the default implementation of arch_wants_pte_order(), used
>>>>> when the architecture does not define it, which returns -1, implying
>>>>> that the HW has no preference. In this case, mm will choose it's own
>>>>> default order.
>>>>>
>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>> ---
>>>>>  include/linux/pgtable.h |  13 ++++
>>>>>  mm/Kconfig              |  10 +++
>>>>>  mm/memory.c             | 166 ++++++++++++++++++++++++++++++++++++----
>>>>>  3 files changed, 172 insertions(+), 17 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>> index 5063b482e34f..2a1d83775837 100644
>>>>> --- a/include/linux/pgtable.h
>>>>> +++ b/include/linux/pgtable.h
>>>>> @@ -313,6 +313,19 @@ static inline bool arch_has_hw_pte_young(void)
>>>>>  }
>>>>>  #endif
>>>>>
>>>>> +#ifndef arch_wants_pte_order
>>>>> +/*
>>>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
>>>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
>>>>> + * to be at least order-2. Negative value implies that the HW has no preference
>>>>> + * and mm will choose it's own default order.
>>>>> + */
>>>>> +static inline int arch_wants_pte_order(void)
>>>>> +{
>>>>> +       return -1;
>>>>> +}
>>>>> +#endif
>>>>> +
>>>>>  #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>>>>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>>>>                                        unsigned long address,
>>>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>>>> index 09130434e30d..fa61ea160447 100644
>>>>> --- a/mm/Kconfig
>>>>> +++ b/mm/Kconfig
>>>>> @@ -1238,4 +1238,14 @@ config LOCK_MM_AND_FIND_VMA
>>>>>
>>>>>  source "mm/damon/Kconfig"
>>>>>
>>>>> +config LARGE_ANON_FOLIO
>>>>> +       bool "Allocate large folios for anonymous memory"
>>>>> +       depends on TRANSPARENT_HUGEPAGE
>>>>> +       default n
>>>>> +       help
>>>>> +         Use large (bigger than order-0) folios to back anonymous memory where
>>>>> +         possible, even for pte-mapped memory. This reduces the number of page
>>>>> +         faults, as well as other per-page overheads to improve performance for
>>>>> +         many workloads.
>>>>> +
>>>>>  endmenu
>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>> index 01f39e8144ef..64c3f242c49a 100644
>>>>> --- a/mm/memory.c
>>>>> +++ b/mm/memory.c
>>>>> @@ -4050,6 +4050,127 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>         return ret;
>>>>>  }
>>>>>
>>>>> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
>>>>> +{
>>>>> +       int i;
>>>>> +
>>>>> +       if (nr_pages == 1)
>>>>> +               return vmf_pte_changed(vmf);
>>>>> +
>>>>> +       for (i = 0; i < nr_pages; i++) {
>>>>> +               if (!pte_none(ptep_get_lockless(vmf->pte + i)))
>>>>> +                       return true;
>>>>> +       }
>>>>> +
>>>>> +       return false;
>>>>> +}
>>>>> +
>>>>> +#ifdef CONFIG_LARGE_ANON_FOLIO
>>>>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
>>>>> +               (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
>>>>> +
>>>>> +static int anon_folio_order(struct vm_area_struct *vma)
>>>>> +{
>>>>> +       int order;
>>>>> +
>>>>> +       /*
>>>>> +        * If THP is explicitly disabled for either the vma, the process or the
>>>>> +        * system, then this is very likely intended to limit internal
>>>>> +        * fragmentation; in this case, don't attempt to allocate a large
>>>>> +        * anonymous folio.
>>>>> +        *
>>>>> +        * Else, if the vma is eligible for thp, allocate a large folio of the
>>>>> +        * size preferred by the arch. Or if the arch requested a very small
>>>>> +        * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
>>>>> +        * which still meets the arch's requirements but means we still take
>>>>> +        * advantage of SW optimizations (e.g. fewer page faults).
>>>>> +        *
>>>>> +        * Finally if thp is enabled but the vma isn't eligible, take the
>>>>> +        * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
>>>>> +        * This ensures workloads that have not explicitly opted-in take benefit
>>>>> +        * while capping the potential for internal fragmentation.
>>>>> +        */
>>>>
>>>> What empirical evidence is SZ_64K based on?
>>>> What workloads would benefit from it?
>>>> How much would they benefit from it?
>>>> Would they benefit more or less from different values?
>>>> How much internal fragmentation would it cause?
>>>> What cost function was used to arrive at the conclusion that its
>>>> benefits outweigh its costs?
>>
>> Sorry this has taken a little while to reply to; I've been re-running my perf
>> tests with the modern patches to recomfirm old data.
> 
> Thanks for the data!
> 
>> In terms of empirical evidence, I've run the kernel compilation benchmark (yes I
>> know its a narrow use case, but I figure some data is better than no data), for
>> all values of ANON_FOLIO_MAX_ORDER_UNHINTED {4k, 16k, 32k, 64k, 128k, 256k}.
>>
>> I've run each test 15 times across 5 system reboots on Ampere Altra (arm64),
> 
> What about x86 and ppc? Do we expect they might perform similarly wrt
> different page sizes?

It's my assumption that they should behave similarly, but I haven't actually
tested it. Thanks to Yin Fengwei for the kind offer to run these tests on x86!

Yin: I have a test tool that will automate running this and gather up the
results. Not sure if this is useful to you? I can share if you want? I also
slightly modified the code to add a boot param to set the
ANON_FOLIO_MAX_ORDER_UNHINTED value, and the test tool automatically injected
the values.

> 
>> with the kernel configured for 4K base pages - I could rerun for other base page
>> sizes if we want to go further down this route.
>>
>> I've captured run time and peak memory usage, and taken the mean. The stdev for
>> the peak memory usage is big-ish, but I'm confident this still captures the
>> central tendancy well:
>>
>> | MAX_ORDER_UNHINTED |   real-time |   kern-time |   user-time | peak memory |
>> |:-------------------|------------:|------------:|------------:|:------------|
>> | 4k                 |        0.0% |        0.0% |        0.0% |        0.0% |
>> | 16k                |       -3.6% |      -26.5% |       -0.5% |       -0.1% |
>> | 32k                |       -4.8% |      -37.4% |       -0.6% |       -0.1% |
>> | 64k                |       -5.7% |      -42.0% |       -0.6% |       -1.1% |
>> | 128k               |       -5.6% |      -42.1% |       -0.7% |        1.4% |
>> | 256k               |       -4.9% |      -41.9% |       -0.4% |        1.9% |
>>
>> 64K looks like the clear sweet spot to me.
> 
> Were the tests done under memory pressure? 

No. I have the capability to run these tests in a memcg with limited max memory
though to force swap. I planned to do some sweeps increasing memory pressure,
all for ANON_FOLIO_MAX_ORDER_UNHINTED=64k. Doing this for all values above will
take too much time, but I could do them at a single value of max memory if that
helps? I'm not sure how I would choose that single value though? (probably do
the sweep for 64k then choose a sensible point on that graph?).

I agree 64KB might be a
> reasonable value, but I don't think we can or need to make a
> conclusion at this point: there are still pending questions from my
> list.

You mean there are still pending questions from your list above, or you have
others that you havent asked yet? If the former, I've answered all of the above
to the best of my ability. My view is that this value is always going to be
tuned empirically so its difficult to answer with absolutes. What can I do to
improve confidence? If the latter, then please let me know your other questions.

> 
> Just to double check: we only need ANON_FOLIO_MAX_ORDER_UNHINTED
> because of hugepage_vma_check(), is it correct?

tldr; yes, correct.

My problem is that for arm64 16k and 64k base page configs, the contpte size is
2M. It's my view that this is too big to allocate when it has not been
explicitly asked for. And I think experience with THP (which is 2M for 4K
systems today) demonstrates that. But I would still like to benefit from reduced
SW overhead where possible (i.e. reduced page faults), and I would still like to
use the contpte 2M mappings when the user has signalled that they can tolerate
the potential internal fragmentation (MADV_HUGEPAGE).

In practical terms, ANON_FOLIO_MAX_ORDER_UNHINTED does not affect arm64/4K at
all (because the contpte size is 64K) and it does not impact the other 4K base
page arches, which don't currently implement arch_wants_pte_order() meaning they
get the default PAGE_ALLOC_COSTLY_ORDER=3=32k.

> 
>> I know you have argued for using a page order in the past, rather than a size in
>> bytes. But my argument is that user space is mostly doing mmaps based on sizes
>> independent of the base page size (an assumption!) and a system's memory is
>> obviously a fixed quantity that doesn't it doesn't change with base page size.
>> So it feels more natural to limit internal fragmentation based on an absolute
>> size rather than a quantity of pages. Kyril have also suggested using absolute
>> sizes in the past [1].
>>
>> It's also worth mentioning that the file-backed memory "fault_around" mechanism
>> chooses 64K.
> 
> This example actually is against your argument:
> 1. There have been multiple reports that fault around hurt
> performances and had to be disabled for some workloads over the years> -- ANON_FOLIO_MAX_ORDER_UNHINTED is likely to cause regressions too.

I don't see how ANON_FOLIO_MAX_ORDER_UNHINTED can cause regressions; it's adding
a limit making the behaviour of Large Anon Folios more similar to the old
behaviour than it otherwise would be. Without it, we will be allocating 2M
folios in some cases which would be much more likely to cause regression in
unprepared apps IMHO.

> 2. Not only can fault around be disabled, its default value can be
> changed too -- this series can't do either.

I had a mechanism for that in the previous version, but discussion concluded
that we should avoid adding the control for now and add it only if/when we have
identified a workload that will benefit.

> 3. Most importantly, fault around does not do high-order allocations
> -- this series does, and high-order allocations can be very difficult
> under memory pressure.

But ANON_FOLIO_MAX_ORDER_UNHINTED *reduces* the order from what it otherwise
would be. So I don't see how its making things worse?

> 
>> If this approach really looks unacceptable, I have a couple of other ideas. But
>> I personally favour the approach that is already in the patch.
> 
> I understand. If the answer to my question above is yes, then let's
> take a step back and figure out whether overloading existing ABIs is
> acceptable or not. Does this sound good to you?

Yes, good idea. Hopefully my explanation above (and all the previous threads)
gives you a good idea for the problem as I see it, and how I think hooking the
THP hints is helpful to the solution. If I've understood David's previuous
remarks correctly, then this also aligns with David's opinions (David you could
confirm/deny this please?). Could you explain your position that hooking these
ABIs is a bad approach?

> 
>> 1) Add a large/small flag to arch_wants_pte_order(). arm64, at least, actually
>> has 2 mechanisms, HPA and contpte. Currently arm64 is always returning the
>> contpte order, but with a flag, it could return contpte order for large, and HPA
>> order for small. (I know we previously passed the vma and we didn't like that,
>> and this is pretty similar). I still think the SW (core-mm) needs a way to
>> sensibly limit internal fragmentation though, so personally I still think having
>> an upper limit in this case is useful.
>>
>> 2) More radical: move to a per-vma auto-tuning solution, which looks at the
>> fault pattern and maintains an allocation order in the VMA, which is modified
>> based on fault pattern. e.g. When we get faults that occur immediately adjacent
>> to the allocated range, we increase; when we get faults not connected to
>> previously allocated pages we decrease. I think it's an interesting thing to
>> look at, but certainly prefer that it's not part of an MVP implementation.
>>
>> [1]
>> https://lore.kernel.org/linux-mm/20230414140948.7pcaz6niyr2tpa7s@box.shutemov.name/
>>
>>
>>>>
>>>>> +       if ((vma->vm_flags & VM_NOHUGEPAGE) ||
>>>>> +           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
>>>>> +           !hugepage_flags_enabled())
>>>>> +               order = 0;
>>>>> +       else {
>>>>> +               order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
>>>>> +
>>>>> +               if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>>>>> +                       order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
>>>>> +       }
>>>
>>> I'm a bit surprised to see the above: why can we overload existing
>>> ABIs? I don't think we can.
>>
>> I think this is all covered by the conversation with David against v2; see [2]
>> and proceeding replies. Argument is that VM_NOHUGEPAGE (and friends) is really a
>> request from user space to optimize for the least memory wastage possible and
>> avoid populating ptes that have not been expressly requested.
>>
>> [2]
>> https://lore.kernel.org/linux-mm/524bacd2-4a47-2b8b-6685-c46e31a01631@redhat.com/
> 
> Thanks for the info.
> 
> I think there might be a misunderstanding here.
> 
> David, can you please clarify whether you suggested we overland
> (change the semantics) of existing ABIs?
> 
> This sounds like a big red flag to me. If that's really what you
> suggest, can you shed some light on why this is acceptable to existing
> userspace at all?
> 
> Thanks.
Ryan Roberts Aug. 2, 2023, 9:04 a.m. UTC | #8
On 02/08/2023 09:02, Ryan Roberts wrote:
...

>>>
>>> I've captured run time and peak memory usage, and taken the mean. The stdev for
>>> the peak memory usage is big-ish, but I'm confident this still captures the
>>> central tendancy well:
>>>
>>> | MAX_ORDER_UNHINTED |   real-time |   kern-time |   user-time | peak memory |
>>> |:-------------------|------------:|------------:|------------:|:------------|
>>> | 4k                 |        0.0% |        0.0% |        0.0% |        0.0% |
>>> | 16k                |       -3.6% |      -26.5% |       -0.5% |       -0.1% |
>>> | 32k                |       -4.8% |      -37.4% |       -0.6% |       -0.1% |
>>> | 64k                |       -5.7% |      -42.0% |       -0.6% |       -1.1% |
>>> | 128k               |       -5.6% |      -42.1% |       -0.7% |        1.4% |
>>> | 256k               |       -4.9% |      -41.9% |       -0.4% |        1.9% |
>>>
>>> 64K looks like the clear sweet spot to me.

I'm sorry about this; I've concluded that these tests are flawed. While I'm
correctly setting the MAX_ORDER_UNHINTED value in each case, this is run against
a 4K base page kernel, which means that it's arch_wants_pte_order() return value
is order-4. So for MAX_ORDER_UNHINTED = {64k, 128k, 256k}, the actual order used
is order-4 (=64K):

	order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);

	if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
		order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);

So while I think we can conclude that the performance improves from 4k -> 64k,
and the peak memory is about the same, we can't conclude that 64k is definely
where performance gains peak or that peak memory increases after this.

The error bars on the memory consumption are fairly big.

I'll rework the tests so that I'm actually measuring what I was intending to
measure and repost in due course.
Ryan Roberts Aug. 2, 2023, 9:33 a.m. UTC | #9
On 01/08/2023 07:18, Yu Zhao wrote:
> On Wed, Jul 26, 2023 at 3:52 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
>> allocated in large folios of a determined order. All pages of the large
>> folio are pte-mapped during the same page fault, significantly reducing
>> the number of page faults. The number of per-page operations (e.g. ref
>> counting, rmap management lru list management) are also significantly
>> reduced since those ops now become per-folio.
>>
>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
>> which defaults to disabled for now; The long term aim is for this to
>> defaut to enabled, but there are some risks around internal
>> fragmentation that need to be better understood first.
>>
>> When enabled, the folio order is determined as such: For a vma, process
>> or system that has explicitly disabled THP, we continue to allocate
>> order-0. THP is most likely disabled to avoid any possible internal
>> fragmentation so we honour that request.
>>
>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
>> that have not explicitly opted-in to use transparent hugepages (e.g.
>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
>> arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
>> bigger). This allows for a performance boost without requiring any
>> explicit opt-in from the workload while limitting internal
>> fragmentation.
>>
>> If the preferred order can't be used (e.g. because the folio would
>> breach the bounds of the vma, or because ptes in the region are already
>> mapped) then we fall back to a suitable lower order; first
>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>
>> arch_wants_pte_order() can be overridden by the architecture if desired.
>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
>> set of ptes map physically contigious, naturally aligned memory, so this
>> mechanism allows the architecture to optimize as required.
>>
>> Here we add the default implementation of arch_wants_pte_order(), used
>> when the architecture does not define it, which returns -1, implying
>> that the HW has no preference. In this case, mm will choose it's own
>> default order.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  include/linux/pgtable.h |  13 ++++
>>  mm/Kconfig              |  10 +++
>>  mm/memory.c             | 166 ++++++++++++++++++++++++++++++++++++----
>>  3 files changed, 172 insertions(+), 17 deletions(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index 5063b482e34f..2a1d83775837 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -313,6 +313,19 @@ static inline bool arch_has_hw_pte_young(void)
>>  }
>>  #endif
>>
>> +#ifndef arch_wants_pte_order
>> +/*
>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
>> + * to be at least order-2. Negative value implies that the HW has no preference
>> + * and mm will choose it's own default order.
>> + */
>> +static inline int arch_wants_pte_order(void)
>> +{
>> +       return -1;
>> +}
>> +#endif
>> +
>>  #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>                                        unsigned long address,
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 09130434e30d..fa61ea160447 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -1238,4 +1238,14 @@ config LOCK_MM_AND_FIND_VMA
>>
>>  source "mm/damon/Kconfig"
>>
>> +config LARGE_ANON_FOLIO
>> +       bool "Allocate large folios for anonymous memory"
>> +       depends on TRANSPARENT_HUGEPAGE
>> +       default n
>> +       help
>> +         Use large (bigger than order-0) folios to back anonymous memory where
>> +         possible, even for pte-mapped memory. This reduces the number of page
>> +         faults, as well as other per-page overheads to improve performance for
>> +         many workloads.
>> +
>>  endmenu
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 01f39e8144ef..64c3f242c49a 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4050,6 +4050,127 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>         return ret;
>>  }
>>
>> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
>> +{
>> +       int i;
>> +
>> +       if (nr_pages == 1)
>> +               return vmf_pte_changed(vmf);
>> +
>> +       for (i = 0; i < nr_pages; i++) {
>> +               if (!pte_none(ptep_get_lockless(vmf->pte + i)))
>> +                       return true;
>> +       }
>> +
>> +       return false;
>> +}
>> +
>> +#ifdef CONFIG_LARGE_ANON_FOLIO
>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
>> +               (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
>> +
>> +static int anon_folio_order(struct vm_area_struct *vma)
>> +{
>> +       int order;
>> +
>> +       /*
>> +        * If THP is explicitly disabled for either the vma, the process or the
>> +        * system, then this is very likely intended to limit internal
>> +        * fragmentation; in this case, don't attempt to allocate a large
>> +        * anonymous folio.
>> +        *
>> +        * Else, if the vma is eligible for thp, allocate a large folio of the
>> +        * size preferred by the arch. Or if the arch requested a very small
>> +        * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
>> +        * which still meets the arch's requirements but means we still take
>> +        * advantage of SW optimizations (e.g. fewer page faults).
>> +        *
>> +        * Finally if thp is enabled but the vma isn't eligible, take the
>> +        * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
>> +        * This ensures workloads that have not explicitly opted-in take benefit
>> +        * while capping the potential for internal fragmentation.
>> +        */
>> +
>> +       if ((vma->vm_flags & VM_NOHUGEPAGE) ||
>> +           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
>> +           !hugepage_flags_enabled())
>> +               order = 0;
>> +       else {
>> +               order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
>> +
>> +               if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>> +                       order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
>> +       }
>> +
>> +       return order;
>> +}
>> +
>> +static int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
>> +{
>> +       int i;
>> +       gfp_t gfp;
>> +       pte_t *pte;
>> +       unsigned long addr;
>> +       struct vm_area_struct *vma = vmf->vma;
>> +       int prefer = anon_folio_order(vma);
>> +       int orders[] = {
>> +               prefer,
>> +               prefer > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0,
>> +               0,
>> +       };
>> +
>> +       *folio = NULL;
>> +
>> +       if (vmf_orig_pte_uffd_wp(vmf))
>> +               goto fallback;
> 
> I think we need to s/vmf_orig_pte_uffd_wp/userfaultfd_armed/ here;
> otherwise UFFD would miss VM_UFFD_MISSING/MINOR.

I don't think this is the case. As far as I can see, VM_UFFD_MINOR only applies
to shmem and hugetlb. VM_UFFD_MISSING is checked under the PTL and if set on the
VMA, then it is handled without mapping the folio that was just allocated:

	/* Deliver the page fault to userland, check inside PT lock */
	if (userfaultfd_missing(vma)) {
		pte_unmap_unlock(vmf->pte, vmf->ptl);
		folio_put(folio);
		return handle_userfault(vmf, VM_UFFD_MISSING);
	}

So we are racing to allocate a large folio; if the vma later turns out to have
MISSING handling registered, we drop the folio and handle it, else we map the
large folio.

The vmf_orig_pte_uffd_wp() *is* required because we need to individually check
each PTE for the uffd_wp bit and fix it up.

So I think the code is correct, but perhaps it is safer/simpler to always avoid
allocating a large folio if the vma is registered for uffd in the way you
suggest? I don't know enough about uffd to form a strong opinion either way.
Yin Fengwei Aug. 2, 2023, 1:51 p.m. UTC | #10
On 8/2/2023 4:02 PM, Ryan Roberts wrote:
> On 01/08/2023 07:36, Yu Zhao wrote:
>> On Fri, Jul 28, 2023 at 4:13 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>
>>> On 27/07/2023 05:31, Yu Zhao wrote:
>>>> On Wed, Jul 26, 2023 at 10:41 AM Yu Zhao <yuzhao@google.com> wrote:
>>>>>
>>>>> On Wed, Jul 26, 2023 at 3:52 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>
>>>>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
>>>>>> allocated in large folios of a determined order. All pages of the large
>>>>>> folio are pte-mapped during the same page fault, significantly reducing
>>>>>> the number of page faults. The number of per-page operations (e.g. ref
>>>>>> counting, rmap management lru list management) are also significantly
>>>>>> reduced since those ops now become per-folio.
>>>>>>
>>>>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
>>>>>> which defaults to disabled for now; The long term aim is for this to
>>>>>> defaut to enabled, but there are some risks around internal
>>>>>> fragmentation that need to be better understood first.
>>>>>>
>>>>>> When enabled, the folio order is determined as such: For a vma, process
>>>>>> or system that has explicitly disabled THP, we continue to allocate
>>>>>> order-0. THP is most likely disabled to avoid any possible internal
>>>>>> fragmentation so we honour that request.
>>>>>>
>>>>>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
>>>>>> that have not explicitly opted-in to use transparent hugepages (e.g.
>>>>>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
>>>>>> arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
>>>>>> bigger). This allows for a performance boost without requiring any
>>>>>> explicit opt-in from the workload while limitting internal
>>>>>> fragmentation.
>>>>>>
>>>>>> If the preferred order can't be used (e.g. because the folio would
>>>>>> breach the bounds of the vma, or because ptes in the region are already
>>>>>> mapped) then we fall back to a suitable lower order; first
>>>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>>>>>
>>>>>> arch_wants_pte_order() can be overridden by the architecture if desired.
>>>>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
>>>>>> set of ptes map physically contigious, naturally aligned memory, so this
>>>>>> mechanism allows the architecture to optimize as required.
>>>>>>
>>>>>> Here we add the default implementation of arch_wants_pte_order(), used
>>>>>> when the architecture does not define it, which returns -1, implying
>>>>>> that the HW has no preference. In this case, mm will choose it's own
>>>>>> default order.
>>>>>>
>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>> ---
>>>>>>  include/linux/pgtable.h |  13 ++++
>>>>>>  mm/Kconfig              |  10 +++
>>>>>>  mm/memory.c             | 166 ++++++++++++++++++++++++++++++++++++----
>>>>>>  3 files changed, 172 insertions(+), 17 deletions(-)
>>>>>>
>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>>> index 5063b482e34f..2a1d83775837 100644
>>>>>> --- a/include/linux/pgtable.h
>>>>>> +++ b/include/linux/pgtable.h
>>>>>> @@ -313,6 +313,19 @@ static inline bool arch_has_hw_pte_young(void)
>>>>>>  }
>>>>>>  #endif
>>>>>>
>>>>>> +#ifndef arch_wants_pte_order
>>>>>> +/*
>>>>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
>>>>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
>>>>>> + * to be at least order-2. Negative value implies that the HW has no preference
>>>>>> + * and mm will choose it's own default order.
>>>>>> + */
>>>>>> +static inline int arch_wants_pte_order(void)
>>>>>> +{
>>>>>> +       return -1;
>>>>>> +}
>>>>>> +#endif
>>>>>> +
>>>>>>  #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>>>>>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>>>>>                                        unsigned long address,
>>>>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>>>>> index 09130434e30d..fa61ea160447 100644
>>>>>> --- a/mm/Kconfig
>>>>>> +++ b/mm/Kconfig
>>>>>> @@ -1238,4 +1238,14 @@ config LOCK_MM_AND_FIND_VMA
>>>>>>
>>>>>>  source "mm/damon/Kconfig"
>>>>>>
>>>>>> +config LARGE_ANON_FOLIO
>>>>>> +       bool "Allocate large folios for anonymous memory"
>>>>>> +       depends on TRANSPARENT_HUGEPAGE
>>>>>> +       default n
>>>>>> +       help
>>>>>> +         Use large (bigger than order-0) folios to back anonymous memory where
>>>>>> +         possible, even for pte-mapped memory. This reduces the number of page
>>>>>> +         faults, as well as other per-page overheads to improve performance for
>>>>>> +         many workloads.
>>>>>> +
>>>>>>  endmenu
>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>> index 01f39e8144ef..64c3f242c49a 100644
>>>>>> --- a/mm/memory.c
>>>>>> +++ b/mm/memory.c
>>>>>> @@ -4050,6 +4050,127 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>>         return ret;
>>>>>>  }
>>>>>>
>>>>>> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
>>>>>> +{
>>>>>> +       int i;
>>>>>> +
>>>>>> +       if (nr_pages == 1)
>>>>>> +               return vmf_pte_changed(vmf);
>>>>>> +
>>>>>> +       for (i = 0; i < nr_pages; i++) {
>>>>>> +               if (!pte_none(ptep_get_lockless(vmf->pte + i)))
>>>>>> +                       return true;
>>>>>> +       }
>>>>>> +
>>>>>> +       return false;
>>>>>> +}
>>>>>> +
>>>>>> +#ifdef CONFIG_LARGE_ANON_FOLIO
>>>>>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
>>>>>> +               (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
>>>>>> +
>>>>>> +static int anon_folio_order(struct vm_area_struct *vma)
>>>>>> +{
>>>>>> +       int order;
>>>>>> +
>>>>>> +       /*
>>>>>> +        * If THP is explicitly disabled for either the vma, the process or the
>>>>>> +        * system, then this is very likely intended to limit internal
>>>>>> +        * fragmentation; in this case, don't attempt to allocate a large
>>>>>> +        * anonymous folio.
>>>>>> +        *
>>>>>> +        * Else, if the vma is eligible for thp, allocate a large folio of the
>>>>>> +        * size preferred by the arch. Or if the arch requested a very small
>>>>>> +        * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
>>>>>> +        * which still meets the arch's requirements but means we still take
>>>>>> +        * advantage of SW optimizations (e.g. fewer page faults).
>>>>>> +        *
>>>>>> +        * Finally if thp is enabled but the vma isn't eligible, take the
>>>>>> +        * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
>>>>>> +        * This ensures workloads that have not explicitly opted-in take benefit
>>>>>> +        * while capping the potential for internal fragmentation.
>>>>>> +        */
>>>>>
>>>>> What empirical evidence is SZ_64K based on?
>>>>> What workloads would benefit from it?
>>>>> How much would they benefit from it?
>>>>> Would they benefit more or less from different values?
>>>>> How much internal fragmentation would it cause?
>>>>> What cost function was used to arrive at the conclusion that its
>>>>> benefits outweigh its costs?
>>>
>>> Sorry this has taken a little while to reply to; I've been re-running my perf
>>> tests with the modern patches to recomfirm old data.
>>
>> Thanks for the data!
>>
>>> In terms of empirical evidence, I've run the kernel compilation benchmark (yes I
>>> know its a narrow use case, but I figure some data is better than no data), for
>>> all values of ANON_FOLIO_MAX_ORDER_UNHINTED {4k, 16k, 32k, 64k, 128k, 256k}.
>>>
>>> I've run each test 15 times across 5 system reboots on Ampere Altra (arm64),
>>
>> What about x86 and ppc? Do we expect they might perform similarly wrt
>> different page sizes?
> 
> It's my assumption that they should behave similarly, but I haven't actually
> tested it. Thanks to Yin Fengwei for the kind offer to run these tests on x86!
> 
> Yin: I have a test tool that will automate running this and gather up the
> results. Not sure if this is useful to you? I can share if you want? I also
> slightly modified the code to add a boot param to set the
> ANON_FOLIO_MAX_ORDER_UNHINTED value, and the test tool automatically injected
> the values.
Not necessary. I started to run the test. I suppose I could share the test result
tomorrow.


Regards
Yin, Fengwei

> 
>>
>>> with the kernel configured for 4K base pages - I could rerun for other base page
>>> sizes if we want to go further down this route.
>>>
>>> I've captured run time and peak memory usage, and taken the mean. The stdev for
>>> the peak memory usage is big-ish, but I'm confident this still captures the
>>> central tendancy well:
>>>
>>> | MAX_ORDER_UNHINTED |   real-time |   kern-time |   user-time | peak memory |
>>> |:-------------------|------------:|------------:|------------:|:------------|
>>> | 4k                 |        0.0% |        0.0% |        0.0% |        0.0% |
>>> | 16k                |       -3.6% |      -26.5% |       -0.5% |       -0.1% |
>>> | 32k                |       -4.8% |      -37.4% |       -0.6% |       -0.1% |
>>> | 64k                |       -5.7% |      -42.0% |       -0.6% |       -1.1% |
>>> | 128k               |       -5.6% |      -42.1% |       -0.7% |        1.4% |
>>> | 256k               |       -4.9% |      -41.9% |       -0.4% |        1.9% |
>>>
>>> 64K looks like the clear sweet spot to me.
>>
>> Were the tests done under memory pressure? 
> 
> No. I have the capability to run these tests in a memcg with limited max memory
> though to force swap. I planned to do some sweeps increasing memory pressure,
> all for ANON_FOLIO_MAX_ORDER_UNHINTED=64k. Doing this for all values above will
> take too much time, but I could do them at a single value of max memory if that
> helps? I'm not sure how I would choose that single value though? (probably do
> the sweep for 64k then choose a sensible point on that graph?).
> 
> I agree 64KB might be a
>> reasonable value, but I don't think we can or need to make a
>> conclusion at this point: there are still pending questions from my
>> list.
> 
> You mean there are still pending questions from your list above, or you have
> others that you havent asked yet? If the former, I've answered all of the above
> to the best of my ability. My view is that this value is always going to be
> tuned empirically so its difficult to answer with absolutes. What can I do to
> improve confidence? If the latter, then please let me know your other questions.
> 
>>
>> Just to double check: we only need ANON_FOLIO_MAX_ORDER_UNHINTED
>> because of hugepage_vma_check(), is it correct?
> 
> tldr; yes, correct.
> 
> My problem is that for arm64 16k and 64k base page configs, the contpte size is
> 2M. It's my view that this is too big to allocate when it has not been
> explicitly asked for. And I think experience with THP (which is 2M for 4K
> systems today) demonstrates that. But I would still like to benefit from reduced
> SW overhead where possible (i.e. reduced page faults), and I would still like to
> use the contpte 2M mappings when the user has signalled that they can tolerate
> the potential internal fragmentation (MADV_HUGEPAGE).
> 
> In practical terms, ANON_FOLIO_MAX_ORDER_UNHINTED does not affect arm64/4K at
> all (because the contpte size is 64K) and it does not impact the other 4K base
> page arches, which don't currently implement arch_wants_pte_order() meaning they
> get the default PAGE_ALLOC_COSTLY_ORDER=3=32k.
> 
>>
>>> I know you have argued for using a page order in the past, rather than a size in
>>> bytes. But my argument is that user space is mostly doing mmaps based on sizes
>>> independent of the base page size (an assumption!) and a system's memory is
>>> obviously a fixed quantity that doesn't it doesn't change with base page size.
>>> So it feels more natural to limit internal fragmentation based on an absolute
>>> size rather than a quantity of pages. Kyril have also suggested using absolute
>>> sizes in the past [1].
>>>
>>> It's also worth mentioning that the file-backed memory "fault_around" mechanism
>>> chooses 64K.
>>
>> This example actually is against your argument:
>> 1. There have been multiple reports that fault around hurt
>> performances and had to be disabled for some workloads over the years> -- ANON_FOLIO_MAX_ORDER_UNHINTED is likely to cause regressions too.
> 
> I don't see how ANON_FOLIO_MAX_ORDER_UNHINTED can cause regressions; it's adding
> a limit making the behaviour of Large Anon Folios more similar to the old
> behaviour than it otherwise would be. Without it, we will be allocating 2M
> folios in some cases which would be much more likely to cause regression in
> unprepared apps IMHO.
> 
>> 2. Not only can fault around be disabled, its default value can be
>> changed too -- this series can't do either.
> 
> I had a mechanism for that in the previous version, but discussion concluded
> that we should avoid adding the control for now and add it only if/when we have
> identified a workload that will benefit.
> 
>> 3. Most importantly, fault around does not do high-order allocations
>> -- this series does, and high-order allocations can be very difficult
>> under memory pressure.
> 
> But ANON_FOLIO_MAX_ORDER_UNHINTED *reduces* the order from what it otherwise
> would be. So I don't see how its making things worse?
> 
>>
>>> If this approach really looks unacceptable, I have a couple of other ideas. But
>>> I personally favour the approach that is already in the patch.
>>
>> I understand. If the answer to my question above is yes, then let's
>> take a step back and figure out whether overloading existing ABIs is
>> acceptable or not. Does this sound good to you?
> 
> Yes, good idea. Hopefully my explanation above (and all the previous threads)
> gives you a good idea for the problem as I see it, and how I think hooking the
> THP hints is helpful to the solution. If I've understood David's previuous
> remarks correctly, then this also aligns with David's opinions (David you could
> confirm/deny this please?). Could you explain your position that hooking these
> ABIs is a bad approach?
> 
>>
>>> 1) Add a large/small flag to arch_wants_pte_order(). arm64, at least, actually
>>> has 2 mechanisms, HPA and contpte. Currently arm64 is always returning the
>>> contpte order, but with a flag, it could return contpte order for large, and HPA
>>> order for small. (I know we previously passed the vma and we didn't like that,
>>> and this is pretty similar). I still think the SW (core-mm) needs a way to
>>> sensibly limit internal fragmentation though, so personally I still think having
>>> an upper limit in this case is useful.
>>>
>>> 2) More radical: move to a per-vma auto-tuning solution, which looks at the
>>> fault pattern and maintains an allocation order in the VMA, which is modified
>>> based on fault pattern. e.g. When we get faults that occur immediately adjacent
>>> to the allocated range, we increase; when we get faults not connected to
>>> previously allocated pages we decrease. I think it's an interesting thing to
>>> look at, but certainly prefer that it's not part of an MVP implementation.
>>>
>>> [1]
>>> https://lore.kernel.org/linux-mm/20230414140948.7pcaz6niyr2tpa7s@box.shutemov.name/
>>>
>>>
>>>>>
>>>>>> +       if ((vma->vm_flags & VM_NOHUGEPAGE) ||
>>>>>> +           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
>>>>>> +           !hugepage_flags_enabled())
>>>>>> +               order = 0;
>>>>>> +       else {
>>>>>> +               order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
>>>>>> +
>>>>>> +               if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>>>>>> +                       order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
>>>>>> +       }
>>>>
>>>> I'm a bit surprised to see the above: why can we overload existing
>>>> ABIs? I don't think we can.
>>>
>>> I think this is all covered by the conversation with David against v2; see [2]
>>> and proceeding replies. Argument is that VM_NOHUGEPAGE (and friends) is really a
>>> request from user space to optimize for the least memory wastage possible and
>>> avoid populating ptes that have not been expressly requested.
>>>
>>> [2]
>>> https://lore.kernel.org/linux-mm/524bacd2-4a47-2b8b-6685-c46e31a01631@redhat.com/
>>
>> Thanks for the info.
>>
>> I think there might be a misunderstanding here.
>>
>> David, can you please clarify whether you suggested we overland
>> (change the semantics) of existing ABIs?
>>
>> This sounds like a big red flag to me. If that's really what you
>> suggest, can you shed some light on why this is acceptable to existing
>> userspace at all?
>>
>> Thanks.
>
Yu Zhao Aug. 2, 2023, 9:05 p.m. UTC | #11
On Wed, Aug 2, 2023 at 3:33 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 01/08/2023 07:18, Yu Zhao wrote:
> > On Wed, Jul 26, 2023 at 3:52 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
> >> allocated in large folios of a determined order. All pages of the large
> >> folio are pte-mapped during the same page fault, significantly reducing
> >> the number of page faults. The number of per-page operations (e.g. ref
> >> counting, rmap management lru list management) are also significantly
> >> reduced since those ops now become per-folio.
> >>
> >> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
> >> which defaults to disabled for now; The long term aim is for this to
> >> defaut to enabled, but there are some risks around internal
> >> fragmentation that need to be better understood first.
> >>
> >> When enabled, the folio order is determined as such: For a vma, process
> >> or system that has explicitly disabled THP, we continue to allocate
> >> order-0. THP is most likely disabled to avoid any possible internal
> >> fragmentation so we honour that request.
> >>
> >> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
> >> that have not explicitly opted-in to use transparent hugepages (e.g.
> >> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
> >> arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
> >> bigger). This allows for a performance boost without requiring any
> >> explicit opt-in from the workload while limitting internal
> >> fragmentation.
> >>
> >> If the preferred order can't be used (e.g. because the folio would
> >> breach the bounds of the vma, or because ptes in the region are already
> >> mapped) then we fall back to a suitable lower order; first
> >> PAGE_ALLOC_COSTLY_ORDER, then order-0.
> >>
> >> arch_wants_pte_order() can be overridden by the architecture if desired.
> >> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
> >> set of ptes map physically contigious, naturally aligned memory, so this
> >> mechanism allows the architecture to optimize as required.
> >>
> >> Here we add the default implementation of arch_wants_pte_order(), used
> >> when the architecture does not define it, which returns -1, implying
> >> that the HW has no preference. In this case, mm will choose it's own
> >> default order.
> >>
> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >> ---
> >>  include/linux/pgtable.h |  13 ++++
> >>  mm/Kconfig              |  10 +++
> >>  mm/memory.c             | 166 ++++++++++++++++++++++++++++++++++++----
> >>  3 files changed, 172 insertions(+), 17 deletions(-)
> >>
> >> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >> index 5063b482e34f..2a1d83775837 100644
> >> --- a/include/linux/pgtable.h
> >> +++ b/include/linux/pgtable.h
> >> @@ -313,6 +313,19 @@ static inline bool arch_has_hw_pte_young(void)
> >>  }
> >>  #endif
> >>
> >> +#ifndef arch_wants_pte_order
> >> +/*
> >> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
> >> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
> >> + * to be at least order-2. Negative value implies that the HW has no preference
> >> + * and mm will choose it's own default order.
> >> + */
> >> +static inline int arch_wants_pte_order(void)
> >> +{
> >> +       return -1;
> >> +}
> >> +#endif
> >> +
> >>  #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
> >>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
> >>                                        unsigned long address,
> >> diff --git a/mm/Kconfig b/mm/Kconfig
> >> index 09130434e30d..fa61ea160447 100644
> >> --- a/mm/Kconfig
> >> +++ b/mm/Kconfig
> >> @@ -1238,4 +1238,14 @@ config LOCK_MM_AND_FIND_VMA
> >>
> >>  source "mm/damon/Kconfig"
> >>
> >> +config LARGE_ANON_FOLIO
> >> +       bool "Allocate large folios for anonymous memory"
> >> +       depends on TRANSPARENT_HUGEPAGE
> >> +       default n
> >> +       help
> >> +         Use large (bigger than order-0) folios to back anonymous memory where
> >> +         possible, even for pte-mapped memory. This reduces the number of page
> >> +         faults, as well as other per-page overheads to improve performance for
> >> +         many workloads.
> >> +
> >>  endmenu
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index 01f39e8144ef..64c3f242c49a 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -4050,6 +4050,127 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>         return ret;
> >>  }
> >>
> >> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
> >> +{
> >> +       int i;
> >> +
> >> +       if (nr_pages == 1)
> >> +               return vmf_pte_changed(vmf);
> >> +
> >> +       for (i = 0; i < nr_pages; i++) {
> >> +               if (!pte_none(ptep_get_lockless(vmf->pte + i)))
> >> +                       return true;
> >> +       }
> >> +
> >> +       return false;
> >> +}
> >> +
> >> +#ifdef CONFIG_LARGE_ANON_FOLIO
> >> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
> >> +               (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
> >> +
> >> +static int anon_folio_order(struct vm_area_struct *vma)
> >> +{
> >> +       int order;
> >> +
> >> +       /*
> >> +        * If THP is explicitly disabled for either the vma, the process or the
> >> +        * system, then this is very likely intended to limit internal
> >> +        * fragmentation; in this case, don't attempt to allocate a large
> >> +        * anonymous folio.
> >> +        *
> >> +        * Else, if the vma is eligible for thp, allocate a large folio of the
> >> +        * size preferred by the arch. Or if the arch requested a very small
> >> +        * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
> >> +        * which still meets the arch's requirements but means we still take
> >> +        * advantage of SW optimizations (e.g. fewer page faults).
> >> +        *
> >> +        * Finally if thp is enabled but the vma isn't eligible, take the
> >> +        * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
> >> +        * This ensures workloads that have not explicitly opted-in take benefit
> >> +        * while capping the potential for internal fragmentation.
> >> +        */
> >> +
> >> +       if ((vma->vm_flags & VM_NOHUGEPAGE) ||
> >> +           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
> >> +           !hugepage_flags_enabled())
> >> +               order = 0;
> >> +       else {
> >> +               order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
> >> +
> >> +               if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
> >> +                       order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
> >> +       }
> >> +
> >> +       return order;
> >> +}
> >> +
> >> +static int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
> >> +{
> >> +       int i;
> >> +       gfp_t gfp;
> >> +       pte_t *pte;
> >> +       unsigned long addr;
> >> +       struct vm_area_struct *vma = vmf->vma;
> >> +       int prefer = anon_folio_order(vma);
> >> +       int orders[] = {
> >> +               prefer,
> >> +               prefer > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0,
> >> +               0,
> >> +       };
> >> +
> >> +       *folio = NULL;
> >> +
> >> +       if (vmf_orig_pte_uffd_wp(vmf))
> >> +               goto fallback;
> >
> > I think we need to s/vmf_orig_pte_uffd_wp/userfaultfd_armed/ here;
> > otherwise UFFD would miss VM_UFFD_MISSING/MINOR.
>
> I don't think this is the case. As far as I can see, VM_UFFD_MINOR only applies
> to shmem and hugetlb.

Correct, but we don't have a helper to check against (VM_UFFD_WP |
VM_UFFD_MISSING). Reusing userfaultfd_armed() seems cleaner to me or
even future proof.

> VM_UFFD_MISSING is checked under the PTL and if set on the
> VMA, then it is handled without mapping the folio that was just allocated:
>
>         /* Deliver the page fault to userland, check inside PT lock */
>         if (userfaultfd_missing(vma)) {
>                 pte_unmap_unlock(vmf->pte, vmf->ptl);
>                 folio_put(folio);
>                 return handle_userfault(vmf, VM_UFFD_MISSING);
>         }
>
> So we are racing to allocate a large folio; if the vma later turns out to have
> MISSING handling registered, we drop the folio and handle it, else we map the
> large folio.

Yes, then we have allocated a large folio (with great effort if under
heavy memory pressure) for nothing.

> The vmf_orig_pte_uffd_wp() *is* required because we need to individually check
> each PTE for the uffd_wp bit and fix it up.

This is not correct: we cannot see a WP PTE before you see
VM_UFFD_WP. So checking VM_UFFD_WP is perfectly safe.

The reason we might want to check individual PTEs is because WP can be
done to a subrange of a VMA that has VM_UFFD_WP, which I don't think
is the common case and worth considering here. But if you want to keep
it, that's fine with me. Without some comments, the next person might
find these two checks confusing though, if you plan to add both.

> So I think the code is correct, but perhaps it is safer/simpler to always avoid
> allocating a large folio if the vma is registered for uffd in the way you
> suggest? I don't know enough about uffd to form a strong opinion either way.

Yes, it's not about correctness. Just a second thought about not
allocating large folios unnecessarily when possible.
Yin Fengwei Aug. 3, 2023, 8:05 a.m. UTC | #12
On 7/28/23 18:13, Ryan Roberts wrote:
> On 27/07/2023 05:31, Yu Zhao wrote:
>> On Wed, Jul 26, 2023 at 10:41 AM Yu Zhao <yuzhao@google.com> wrote:
>>>
>>> On Wed, Jul 26, 2023 at 3:52 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
>>>> allocated in large folios of a determined order. All pages of the large
>>>> folio are pte-mapped during the same page fault, significantly reducing
>>>> the number of page faults. The number of per-page operations (e.g. ref
>>>> counting, rmap management lru list management) are also significantly
>>>> reduced since those ops now become per-folio.
>>>>
>>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
>>>> which defaults to disabled for now; The long term aim is for this to
>>>> defaut to enabled, but there are some risks around internal
>>>> fragmentation that need to be better understood first.
>>>>
>>>> When enabled, the folio order is determined as such: For a vma, process
>>>> or system that has explicitly disabled THP, we continue to allocate
>>>> order-0. THP is most likely disabled to avoid any possible internal
>>>> fragmentation so we honour that request.
>>>>
>>>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
>>>> that have not explicitly opted-in to use transparent hugepages (e.g.
>>>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
>>>> arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
>>>> bigger). This allows for a performance boost without requiring any
>>>> explicit opt-in from the workload while limitting internal
>>>> fragmentation.
>>>>
>>>> If the preferred order can't be used (e.g. because the folio would
>>>> breach the bounds of the vma, or because ptes in the region are already
>>>> mapped) then we fall back to a suitable lower order; first
>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>>>
>>>> arch_wants_pte_order() can be overridden by the architecture if desired.
>>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
>>>> set of ptes map physically contigious, naturally aligned memory, so this
>>>> mechanism allows the architecture to optimize as required.
>>>>
>>>> Here we add the default implementation of arch_wants_pte_order(), used
>>>> when the architecture does not define it, which returns -1, implying
>>>> that the HW has no preference. In this case, mm will choose it's own
>>>> default order.
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>  include/linux/pgtable.h |  13 ++++
>>>>  mm/Kconfig              |  10 +++
>>>>  mm/memory.c             | 166 ++++++++++++++++++++++++++++++++++++----
>>>>  3 files changed, 172 insertions(+), 17 deletions(-)
>>>>
>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>> index 5063b482e34f..2a1d83775837 100644
>>>> --- a/include/linux/pgtable.h
>>>> +++ b/include/linux/pgtable.h
>>>> @@ -313,6 +313,19 @@ static inline bool arch_has_hw_pte_young(void)
>>>>  }
>>>>  #endif
>>>>
>>>> +#ifndef arch_wants_pte_order
>>>> +/*
>>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
>>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
>>>> + * to be at least order-2. Negative value implies that the HW has no preference
>>>> + * and mm will choose it's own default order.
>>>> + */
>>>> +static inline int arch_wants_pte_order(void)
>>>> +{
>>>> +       return -1;
>>>> +}
>>>> +#endif
>>>> +
>>>>  #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>>>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>>>                                        unsigned long address,
>>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>>> index 09130434e30d..fa61ea160447 100644
>>>> --- a/mm/Kconfig
>>>> +++ b/mm/Kconfig
>>>> @@ -1238,4 +1238,14 @@ config LOCK_MM_AND_FIND_VMA
>>>>
>>>>  source "mm/damon/Kconfig"
>>>>
>>>> +config LARGE_ANON_FOLIO
>>>> +       bool "Allocate large folios for anonymous memory"
>>>> +       depends on TRANSPARENT_HUGEPAGE
>>>> +       default n
>>>> +       help
>>>> +         Use large (bigger than order-0) folios to back anonymous memory where
>>>> +         possible, even for pte-mapped memory. This reduces the number of page
>>>> +         faults, as well as other per-page overheads to improve performance for
>>>> +         many workloads.
>>>> +
>>>>  endmenu
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 01f39e8144ef..64c3f242c49a 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -4050,6 +4050,127 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>         return ret;
>>>>  }
>>>>
>>>> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
>>>> +{
>>>> +       int i;
>>>> +
>>>> +       if (nr_pages == 1)
>>>> +               return vmf_pte_changed(vmf);
>>>> +
>>>> +       for (i = 0; i < nr_pages; i++) {
>>>> +               if (!pte_none(ptep_get_lockless(vmf->pte + i)))
>>>> +                       return true;
>>>> +       }
>>>> +
>>>> +       return false;
>>>> +}
>>>> +
>>>> +#ifdef CONFIG_LARGE_ANON_FOLIO
>>>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
>>>> +               (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
>>>> +
>>>> +static int anon_folio_order(struct vm_area_struct *vma)
>>>> +{
>>>> +       int order;
>>>> +
>>>> +       /*
>>>> +        * If THP is explicitly disabled for either the vma, the process or the
>>>> +        * system, then this is very likely intended to limit internal
>>>> +        * fragmentation; in this case, don't attempt to allocate a large
>>>> +        * anonymous folio.
>>>> +        *
>>>> +        * Else, if the vma is eligible for thp, allocate a large folio of the
>>>> +        * size preferred by the arch. Or if the arch requested a very small
>>>> +        * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
>>>> +        * which still meets the arch's requirements but means we still take
>>>> +        * advantage of SW optimizations (e.g. fewer page faults).
>>>> +        *
>>>> +        * Finally if thp is enabled but the vma isn't eligible, take the
>>>> +        * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
>>>> +        * This ensures workloads that have not explicitly opted-in take benefit
>>>> +        * while capping the potential for internal fragmentation.
>>>> +        */
>>>
>>> What empirical evidence is SZ_64K based on?
>>> What workloads would benefit from it?
>>> How much would they benefit from it?
>>> Would they benefit more or less from different values?
>>> How much internal fragmentation would it cause?
>>> What cost function was used to arrive at the conclusion that its
>>> benefits outweigh its costs?
> 
> Sorry this has taken a little while to reply to; I've been re-running my perf
> tests with the modern patches to recomfirm old data.
> 
> In terms of empirical evidence, I've run the kernel compilation benchmark (yes I
> know its a narrow use case, but I figure some data is better than no data), for
> all values of ANON_FOLIO_MAX_ORDER_UNHINTED {4k, 16k, 32k, 64k, 128k, 256k}.
> 
> I've run each test 15 times across 5 system reboots on Ampere Altra (arm64),
> with the kernel configured for 4K base pages - I could rerun for other base page
> sizes if we want to go further down this route.
> 
> I've captured run time and peak memory usage, and taken the mean. The stdev for
> the peak memory usage is big-ish, but I'm confident this still captures the
> central tendancy well:
> 
> | MAX_ORDER_UNHINTED |   real-time |   kern-time |   user-time | peak memory |
> |:-------------------|------------:|------------:|------------:|:------------|
> | 4k                 |        0.0% |        0.0% |        0.0% |        0.0% |
> | 16k                |       -3.6% |      -26.5% |       -0.5% |       -0.1% |
> | 32k                |       -4.8% |      -37.4% |       -0.6% |       -0.1% |
> | 64k                |       -5.7% |      -42.0% |       -0.6% |       -1.1% |
> | 128k               |       -5.6% |      -42.1% |       -0.7% |        1.4% |
> | 256k               |       -4.9% |      -41.9% |       -0.4% |        1.9% |

Here is my test result:

		real		user		sys
hink-4k:	 0%		0%		0%
hink-16K:	-3%		0.1%		-18.3%
hink-32K:	-4%		0.2%		-27.2%
hink-64K:	-4%		0.5%		-31.0%
hink-128K:	-4%		0.9%		-33.7%
hink-256K:	-5%		1%		-34.6%


I used command: 
/usr/bin/time -f "\t%E real,\t%U user,\t%S sys" make -skj96 allmodconfig all
to build kernel and collect the real time/user time/kernel time.
/sys/kernel/mm/transparent_hugepage/enabled is "madvise".
Let me know if you have any question about the test.

I also find one strange behavior with this version. It's related with why
I need to set the /sys/kernel/mm/transparent_hugepage/enabled to "madvise".
If it's "never", the large folio is disabled either.
If it's "always", the THP will be active before large folio. So the system is
in the mixed mode. it's not suitable for this test.

So if it's "never", large folio is disabled. But why "madvise" enables large
folio unconditionly? Suppose it's only enabled for the VMA range which user
madvise large folio (or THP)?

Specific for the hink setting, my understand is that we can't choose it only
by this testing. Other workloads may have different behavior with differnt
hink setting.


Regards
Yin, Fengwei

> 
> 64K looks like the clear sweet spot to me.
> 
> I know you have argued for using a page order in the past, rather than a size in
> bytes. But my argument is that user space is mostly doing mmaps based on sizes
> independent of the base page size (an assumption!) and a system's memory is
> obviously a fixed quantity that doesn't it doesn't change with base page size.
> So it feels more natural to limit internal fragmentation based on an absolute
> size rather than a quantity of pages. Kyril have also suggested using absolute
> sizes in the past [1].
> 
> It's also worth mentioning that the file-backed memory "fault_around" mechanism
> chooses 64K.
> 
> If this approach really looks unacceptable, I have a couple of other ideas. But
> I personally favour the approach that is already in the patch.
> 
> 1) Add a large/small flag to arch_wants_pte_order(). arm64, at least, actually
> has 2 mechanisms, HPA and contpte. Currently arm64 is always returning the
> contpte order, but with a flag, it could return contpte order for large, and HPA
> order for small. (I know we previously passed the vma and we didn't like that,
> and this is pretty similar). I still think the SW (core-mm) needs a way to
> sensibly limit internal fragmentation though, so personally I still think having
> an upper limit in this case is useful.
> 
> 2) More radical: move to a per-vma auto-tuning solution, which looks at the
> fault pattern and maintains an allocation order in the VMA, which is modified
> based on fault pattern. e.g. When we get faults that occur immediately adjacent
> to the allocated range, we increase; when we get faults not connected to
> previously allocated pages we decrease. I think it's an interesting thing to
> look at, but certainly prefer that it's not part of an MVP implementation.
> 
> [1]
> https://lore.kernel.org/linux-mm/20230414140948.7pcaz6niyr2tpa7s@box.shutemov.name/
> 
> 
>>>
>>>> +       if ((vma->vm_flags & VM_NOHUGEPAGE) ||
>>>> +           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
>>>> +           !hugepage_flags_enabled())
>>>> +               order = 0;
>>>> +       else {
>>>> +               order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
>>>> +
>>>> +               if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>>>> +                       order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
>>>> +       }
>>
>> I'm a bit surprised to see the above: why can we overload existing
>> ABIs? I don't think we can. 
> 
> I think this is all covered by the conversation with David against v2; see [2]
> and proceeding replies. Argument is that VM_NOHUGEPAGE (and friends) is really a
> request from user space to optimize for the least memory wastage possible and
> avoid populating ptes that have not been expressly requested.
> 
> [2]
> https://lore.kernel.org/linux-mm/524bacd2-4a47-2b8b-6685-c46e31a01631@redhat.com/
> 
> Assuming we could, you would have to
>> update Documentation/admin-guide/mm/transhuge.rst in the same
>> patchset, and the man page for madvise() in a separate patch.
> 
> Yes, that's a fair point. Although transhuge.rst doesn't even mention
> MADV_NOHUGEPAGE today.
> 
>>
>> Most importantly, existing userspace programs that don't work well
>> with THPs won't be able to use (try) large folios either -- this is a
>> big no no.
> 
> I think we need some comments from David here. As mentioned I've added this
> tie-in based on his (strong) recommendation.
> 
>>
>>
>>
>>>> +
>>>> +       return order;
>>>> +}
>>>> +
>>>> +static int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
>>>
>>> static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>>>
>>> and use ERR_PTR() and its friends.
> 
> Yes, agreed. I'll change this for the next version.
> 
>>>
>>>> +{
>>>> +       int i;
>>>> +       gfp_t gfp;
>>>> +       pte_t *pte;
>>>> +       unsigned long addr;
>>>> +       struct vm_area_struct *vma = vmf->vma;
>>>> +       int prefer = anon_folio_order(vma);
>>>> +       int orders[] = {
>>>> +               prefer,
>>>> +               prefer > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0,
>>>> +               0,
>>>> +       };
>>>> +
>>>> +       *folio = NULL;
>>>> +
>>>> +       if (vmf_orig_pte_uffd_wp(vmf))
>>>> +               goto fallback;
>>>> +
>>>> +       for (i = 0; orders[i]; i++) {
>>>> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
>>>> +               if (addr >= vma->vm_start &&
>>>> +                   addr + (PAGE_SIZE << orders[i]) <= vma->vm_end)
>>>> +                       break;
>>>> +       }
>>>> +
>>>> +       if (!orders[i])
>>>> +               goto fallback;
>>>> +
>>>> +       pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
>>>> +       if (!pte)
>>>> +               return -EAGAIN;
>>>> +
>>>> +       for (; orders[i]; i++) {
>>>> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
>>>> +               vmf->pte = pte + pte_index(addr);
>>>> +               if (!vmf_pte_range_changed(vmf, 1 << orders[i]))
>>>> +                       break;
>>>> +       }
>>>> +
>>>> +       vmf->pte = NULL;
>>>> +       pte_unmap(pte);
>>>> +
>>>> +       gfp = vma_thp_gfp_mask(vma);
>>>> +
>>>> +       for (; orders[i]; i++) {
>>>> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
>>>> +               *folio = vma_alloc_folio(gfp, orders[i], vma, addr, true);
>>>> +               if (*folio) {
>>>> +                       clear_huge_page(&(*folio)->page, addr, 1 << orders[i]);
>>>> +                       return 0;
>>>> +               }
>>>> +       }
>>>> +
>>>> +fallback:
>>>> +       *folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
>>>> +       return *folio ? 0 : -ENOMEM;
>>>> +}
>>>> +#else
>>>> +static inline int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
>>>> +{
>>>> +       *folio = vma_alloc_zeroed_movable_folio(vmf->vma, vmf->address);
>>>> +       return *folio ? 0 : -ENOMEM;
>>>> +}
>>>> +#endif
>>>> +
>>>>  /*
>>>>   * We enter with non-exclusive mmap_lock (to exclude vma changes,
>>>>   * but allow concurrent faults), and pte mapped but not yet locked.
>>>> @@ -4057,6 +4178,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>   */
>>>>  static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>>>  {
>>>> +       int i = 0;
>>>> +       int nr_pages = 1;
>>>> +       unsigned long addr = vmf->address;
>>>>         bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
>>>>         struct vm_area_struct *vma = vmf->vma;
>>>>         struct folio *folio;
>>>> @@ -4101,10 +4225,15 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>>>         /* Allocate our own private page. */
>>>>         if (unlikely(anon_vma_prepare(vma)))
>>>>                 goto oom;
>>>> -       folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
>>>> +       ret = alloc_anon_folio(vmf, &folio);
>>>> +       if (unlikely(ret == -EAGAIN))
>>>> +               return 0;
>>>>         if (!folio)
>>>>                 goto oom;
>>>>
>>>> +       nr_pages = folio_nr_pages(folio);
>>>> +       addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
>>>> +
>>>>         if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
>>>>                 goto oom_free_page;
>>>>         folio_throttle_swaprate(folio, GFP_KERNEL);
>>>> @@ -4116,17 +4245,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>>>          */
>>>>         __folio_mark_uptodate(folio);
>>>>
>>>> -       entry = mk_pte(&folio->page, vma->vm_page_prot);
>>>> -       entry = pte_sw_mkyoung(entry);
>>>> -       if (vma->vm_flags & VM_WRITE)
>>>> -               entry = pte_mkwrite(pte_mkdirty(entry));
>>>> -
>>>> -       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>>>> -                       &vmf->ptl);
>>>> +       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
>>>>         if (!vmf->pte)
>>>>                 goto release;
>>>> -       if (vmf_pte_changed(vmf)) {
>>>> -               update_mmu_tlb(vma, vmf->address, vmf->pte);
>>>> +       if (vmf_pte_range_changed(vmf, nr_pages)) {
>>>> +               for (i = 0; i < nr_pages; i++)
>>>> +                       update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
>>>>                 goto release;
>>>>         }
>>>>
>>>> @@ -4141,16 +4265,24 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>>>                 return handle_userfault(vmf, VM_UFFD_MISSING);
>>>>         }
>>>>
>>>> -       inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>>>> -       folio_add_new_anon_rmap(folio, vma, vmf->address);
>>>> +       folio_ref_add(folio, nr_pages - 1);
>>>> +       add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
>>>> +       folio_add_new_anon_rmap(folio, vma, addr);
>>>>         folio_add_lru_vma(folio, vma);
>>>> +
>>>> +       for (i = 0; i < nr_pages; i++) {
>>>> +               entry = mk_pte(folio_page(folio, i), vma->vm_page_prot);
>>>> +               entry = pte_sw_mkyoung(entry);
>>>> +               if (vma->vm_flags & VM_WRITE)
>>>> +                       entry = pte_mkwrite(pte_mkdirty(entry));
>>>>  setpte:
>>>> -       if (uffd_wp)
>>>> -               entry = pte_mkuffd_wp(entry);
>>>> -       set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>>>> +               if (uffd_wp)
>>>> +                       entry = pte_mkuffd_wp(entry);
>>>> +               set_pte_at(vma->vm_mm, addr + PAGE_SIZE * i, vmf->pte + i, entry);
>>>>
>>>> -       /* No need to invalidate - it was non-present before */
>>>> -       update_mmu_cache(vma, vmf->address, vmf->pte);
>>>> +               /* No need to invalidate - it was non-present before */
>>>> +               update_mmu_cache(vma, addr + PAGE_SIZE * i, vmf->pte + i);
>>>> +       }
>>>>  unlock:
>>>>         if (vmf->pte)
>>>>                 pte_unmap_unlock(vmf->pte, vmf->ptl);
>>>
>>> The rest looks good to me.
> 
> Thanks, as always, for the detailed review and feedback!
> 
> Thanks,
> Ryan
> 
> 
>
Ryan Roberts Aug. 3, 2023, 8:21 a.m. UTC | #13
On 03/08/2023 09:05, Yin Fengwei wrote:

...

>> I've captured run time and peak memory usage, and taken the mean. The stdev for
>> the peak memory usage is big-ish, but I'm confident this still captures the
>> central tendancy well:
>>
>> | MAX_ORDER_UNHINTED |   real-time |   kern-time |   user-time | peak memory |
>> |:-------------------|------------:|------------:|------------:|:------------|
>> | 4k                 |        0.0% |        0.0% |        0.0% |        0.0% |
>> | 16k                |       -3.6% |      -26.5% |       -0.5% |       -0.1% |
>> | 32k                |       -4.8% |      -37.4% |       -0.6% |       -0.1% |
>> | 64k                |       -5.7% |      -42.0% |       -0.6% |       -1.1% |
>> | 128k               |       -5.6% |      -42.1% |       -0.7% |        1.4% |
>> | 256k               |       -4.9% |      -41.9% |       -0.4% |        1.9% |
> 
> Here is my test result:
> 
> 		real		user		sys
> hink-4k:	 0%		0%		0%
> hink-16K:	-3%		0.1%		-18.3%
> hink-32K:	-4%		0.2%		-27.2%
> hink-64K:	-4%		0.5%		-31.0%
> hink-128K:	-4%		0.9%		-33.7%
> hink-256K:	-5%		1%		-34.6%
> 
> 
> I used command: 
> /usr/bin/time -f "\t%E real,\t%U user,\t%S sys" make -skj96 allmodconfig all
> to build kernel and collect the real time/user time/kernel time.
> /sys/kernel/mm/transparent_hugepage/enabled is "madvise".
> Let me know if you have any question about the test.

Thanks for doing this! I have a couple of questions:

 - how many times did you run each test?

 - how did you configure the large page size? (I sent an email out yesterday
   saying that I was doing it wrong from my tests, so the 128k and 256k results
   for my test set are not valid.

 - what does "hink" mean??

> 
> I also find one strange behavior with this version. It's related with why
> I need to set the /sys/kernel/mm/transparent_hugepage/enabled to "madvise".
> If it's "never", the large folio is disabled either.
> If it's "always", the THP will be active before large folio. So the system is
> in the mixed mode. it's not suitable for this test.

We had a discussion around this in the THP meeting yesterday. I'm going to write
this up propoerly so we can have proper systematic discussion. The tentative
conclusion is that MADV_NOHUGEPAGE must continue to mean "do not fault in more
than is absolutely necessary". I would assume we need to extend that thinking to
the process-wide and system-wide knobs (as is done in the patch), but we didn't
explicitly say so in the meeting.

My intention is that if you have requested THP and your vma is big enough for
PMD-size then you get that, else you fallback to large anon folios. And if you
have neither opted in nor out, then you get large anon folios.

We talked about the idea of adding a new knob that let's you set the max order,
but that needs a lot more thought.

Anyway, as I said, I'll write it up so we can all systematically discuss.

> 
> So if it's "never", large folio is disabled. But why "madvise" enables large
> folio unconditionly? Suppose it's only enabled for the VMA range which user
> madvise large folio (or THP)?
> 
> Specific for the hink setting, my understand is that we can't choose it only
> by this testing. Other workloads may have different behavior with differnt
> hink setting.
> 
> 
> Regards
> Yin, Fengwei
>
Yin Fengwei Aug. 3, 2023, 8:37 a.m. UTC | #14
On 8/3/23 16:21, Ryan Roberts wrote:
> On 03/08/2023 09:05, Yin Fengwei wrote:
> 
> ...
> 
>>> I've captured run time and peak memory usage, and taken the mean. The stdev for
>>> the peak memory usage is big-ish, but I'm confident this still captures the
>>> central tendancy well:
>>>
>>> | MAX_ORDER_UNHINTED |   real-time |   kern-time |   user-time | peak memory |
>>> |:-------------------|------------:|------------:|------------:|:------------|
>>> | 4k                 |        0.0% |        0.0% |        0.0% |        0.0% |
>>> | 16k                |       -3.6% |      -26.5% |       -0.5% |       -0.1% |
>>> | 32k                |       -4.8% |      -37.4% |       -0.6% |       -0.1% |
>>> | 64k                |       -5.7% |      -42.0% |       -0.6% |       -1.1% |
>>> | 128k               |       -5.6% |      -42.1% |       -0.7% |        1.4% |
>>> | 256k               |       -4.9% |      -41.9% |       -0.4% |        1.9% |
>>
>> Here is my test result:
>>
>> 		real		user		sys
>> hink-4k:	 0%		0%		0%
>> hink-16K:	-3%		0.1%		-18.3%
>> hink-32K:	-4%		0.2%		-27.2%
>> hink-64K:	-4%		0.5%		-31.0%
>> hink-128K:	-4%		0.9%		-33.7%
>> hink-256K:	-5%		1%		-34.6%
>>
>>
>> I used command: 
>> /usr/bin/time -f "\t%E real,\t%U user,\t%S sys" make -skj96 allmodconfig all
>> to build kernel and collect the real time/user time/kernel time.
>> /sys/kernel/mm/transparent_hugepage/enabled is "madvise".
>> Let me know if you have any question about the test.
> 
> Thanks for doing this! I have a couple of questions:
> 
>  - how many times did you run each test?
     Three times for each ANON_FOLIO_MAX_ORDER_UNHINTED. The stddev is quite
     small like less than %1.
> 
>  - how did you configure the large page size? (I sent an email out yesterday
>    saying that I was doing it wrong from my tests, so the 128k and 256k results
>    for my test set are not valid.
     I changed the ANON_FOLIO_MAX_ORDER_UNHINTED definition manually every time.

> 
>  - what does "hink" mean??
     Sorry for the typo. It should be ANON_FOLIO_MAX_ORDER_UNHINTED.

> 
>>
>> I also find one strange behavior with this version. It's related with why
>> I need to set the /sys/kernel/mm/transparent_hugepage/enabled to "madvise".
>> If it's "never", the large folio is disabled either.
>> If it's "always", the THP will be active before large folio. So the system is
>> in the mixed mode. it's not suitable for this test.
> 
> We had a discussion around this in the THP meeting yesterday. I'm going to write
> this up propoerly so we can have proper systematic discussion. The tentative
> conclusion is that MADV_NOHUGEPAGE must continue to mean "do not fault in more
> than is absolutely necessary". I would assume we need to extend that thinking to
> the process-wide and system-wide knobs (as is done in the patch), but we didn't
> explicitly say so in the meeting.
There are cases that THP is not appreciated because of the latency or memory
consumption. For these cases, large folio may fill the gap as less latency and
memory consumption.


So if disabling THP means large folio can't be used, we loose the chance to
benefit those cases with large folio.


Regards
Yin, Fengwei

> 
> My intention is that if you have requested THP and your vma is big enough for
> PMD-size then you get that, else you fallback to large anon folios. And if you
> have neither opted in nor out, then you get large anon folios.
> 
> We talked about the idea of adding a new knob that let's you set the max order,
> but that needs a lot more thought.
> 
> Anyway, as I said, I'll write it up so we can all systematically discuss.
> 
>>
>> So if it's "never", large folio is disabled. But why "madvise" enables large
>> folio unconditionly? Suppose it's only enabled for the VMA range which user
>> madvise large folio (or THP)?
>>
>> Specific for the hink setting, my understand is that we can't choose it only
>> by this testing. Other workloads may have different behavior with differnt
>> hink setting.
>>
>>
>> Regards
>> Yin, Fengwei
>>
>
Ryan Roberts Aug. 3, 2023, 9:32 a.m. UTC | #15
On 03/08/2023 09:37, Yin Fengwei wrote:
> 
> 
> On 8/3/23 16:21, Ryan Roberts wrote:
>> On 03/08/2023 09:05, Yin Fengwei wrote:
>>
>> ...
>>
>>>> I've captured run time and peak memory usage, and taken the mean. The stdev for
>>>> the peak memory usage is big-ish, but I'm confident this still captures the
>>>> central tendancy well:
>>>>
>>>> | MAX_ORDER_UNHINTED |   real-time |   kern-time |   user-time | peak memory |
>>>> |:-------------------|------------:|------------:|------------:|:------------|
>>>> | 4k                 |        0.0% |        0.0% |        0.0% |        0.0% |
>>>> | 16k                |       -3.6% |      -26.5% |       -0.5% |       -0.1% |
>>>> | 32k                |       -4.8% |      -37.4% |       -0.6% |       -0.1% |
>>>> | 64k                |       -5.7% |      -42.0% |       -0.6% |       -1.1% |
>>>> | 128k               |       -5.6% |      -42.1% |       -0.7% |        1.4% |
>>>> | 256k               |       -4.9% |      -41.9% |       -0.4% |        1.9% |
>>>
>>> Here is my test result:
>>>
>>> 		real		user		sys
>>> hink-4k:	 0%		0%		0%
>>> hink-16K:	-3%		0.1%		-18.3%
>>> hink-32K:	-4%		0.2%		-27.2%
>>> hink-64K:	-4%		0.5%		-31.0%
>>> hink-128K:	-4%		0.9%		-33.7%
>>> hink-256K:	-5%		1%		-34.6%
>>>
>>>
>>> I used command: 
>>> /usr/bin/time -f "\t%E real,\t%U user,\t%S sys" make -skj96 allmodconfig all
>>> to build kernel and collect the real time/user time/kernel time.
>>> /sys/kernel/mm/transparent_hugepage/enabled is "madvise".
>>> Let me know if you have any question about the test.
>>
>> Thanks for doing this! I have a couple of questions:
>>
>>  - how many times did you run each test?
>      Three times for each ANON_FOLIO_MAX_ORDER_UNHINTED. The stddev is quite
>      small like less than %1.

And out of interest, were you running on bare metal or in VM? And did you reboot
between each run?

>>
>>  - how did you configure the large page size? (I sent an email out yesterday
>>    saying that I was doing it wrong from my tests, so the 128k and 256k results
>>    for my test set are not valid.
>      I changed the ANON_FOLIO_MAX_ORDER_UNHINTED definition manually every time.

In that case, I think your results are broken in a similar way to mine. This
code means that order will never be higher than 3 (32K) on x86:

+		order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
+
+		if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
+			order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);

On x86, arch_wants_pte_order() is not implemented and the default returns -1, so
you end up with:

	order = min(PAGE_ALLOC_COSTLY_ORDER, ANON_FOLIO_MAX_ORDER_UNHINTED)

So your 4k, 16k and 32k results should be valid, but 64k, 128k and 256k results
are actually using 32k, I think? Which is odd because you are getting more
stddev than the < 1% you quoted above? So perhaps this is down to rebooting
(kaslr, or something...?)

(on arm64, arch_wants_pte_order() returns 4, so my 64k result is also valid).

As a quick hack to work around this, would you be able to change the code to this:

+		if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
+			order = ANON_FOLIO_MAX_ORDER_UNHINTED;

> 
>>
>>  - what does "hink" mean??
>      Sorry for the typo. It should be ANON_FOLIO_MAX_ORDER_UNHINTED.
> 
>>
>>>
>>> I also find one strange behavior with this version. It's related with why
>>> I need to set the /sys/kernel/mm/transparent_hugepage/enabled to "madvise".
>>> If it's "never", the large folio is disabled either.
>>> If it's "always", the THP will be active before large folio. So the system is
>>> in the mixed mode. it's not suitable for this test.
>>
>> We had a discussion around this in the THP meeting yesterday. I'm going to write
>> this up propoerly so we can have proper systematic discussion. The tentative
>> conclusion is that MADV_NOHUGEPAGE must continue to mean "do not fault in more
>> than is absolutely necessary". I would assume we need to extend that thinking to
>> the process-wide and system-wide knobs (as is done in the patch), but we didn't
>> explicitly say so in the meeting.
> There are cases that THP is not appreciated because of the latency or memory
> consumption. For these cases, large folio may fill the gap as less latency and
> memory consumption.
> 
> 
> So if disabling THP means large folio can't be used, we loose the chance to
> benefit those cases with large folio.

Yes, I appreciate that. But there are also real use cases that expect
MADV_NOHUGEPAGE means "do not fault more than is absolutely necessary" and the
use cases break if that's not obeyed (e.g. live migration w/ qemu). So I think
we need to be conservitive to start. These apps that are explicitly forbidding
THP today, should be updated in the long run to opt-in to large anon folios
using some as-yet undefined control.

> 
> 
> Regards
> Yin, Fengwei
> 
>>
>> My intention is that if you have requested THP and your vma is big enough for
>> PMD-size then you get that, else you fallback to large anon folios. And if you
>> have neither opted in nor out, then you get large anon folios.
>>
>> We talked about the idea of adding a new knob that let's you set the max order,
>> but that needs a lot more thought.
>>
>> Anyway, as I said, I'll write it up so we can all systematically discuss.
>>
>>>
>>> So if it's "never", large folio is disabled. But why "madvise" enables large
>>> folio unconditionly? Suppose it's only enabled for the VMA range which user
>>> madvise large folio (or THP)?
>>>
>>> Specific for the hink setting, my understand is that we can't choose it only
>>> by this testing. Other workloads may have different behavior with differnt
>>> hink setting.
>>>
>>>
>>> Regards
>>> Yin, Fengwei
>>>
>>
Yin Fengwei Aug. 3, 2023, 9:58 a.m. UTC | #16
On 8/3/23 17:32, Ryan Roberts wrote:
> On 03/08/2023 09:37, Yin Fengwei wrote:
>>
>>
>> On 8/3/23 16:21, Ryan Roberts wrote:
>>> On 03/08/2023 09:05, Yin Fengwei wrote:
>>>
>>> ...
>>>
>>>>> I've captured run time and peak memory usage, and taken the mean. The stdev for
>>>>> the peak memory usage is big-ish, but I'm confident this still captures the
>>>>> central tendancy well:
>>>>>
>>>>> | MAX_ORDER_UNHINTED |   real-time |   kern-time |   user-time | peak memory |
>>>>> |:-------------------|------------:|------------:|------------:|:------------|
>>>>> | 4k                 |        0.0% |        0.0% |        0.0% |        0.0% |
>>>>> | 16k                |       -3.6% |      -26.5% |       -0.5% |       -0.1% |
>>>>> | 32k                |       -4.8% |      -37.4% |       -0.6% |       -0.1% |
>>>>> | 64k                |       -5.7% |      -42.0% |       -0.6% |       -1.1% |
>>>>> | 128k               |       -5.6% |      -42.1% |       -0.7% |        1.4% |
>>>>> | 256k               |       -4.9% |      -41.9% |       -0.4% |        1.9% |
>>>>
>>>> Here is my test result:
>>>>
>>>> 		real		user		sys
>>>> hink-4k:	 0%		0%		0%
>>>> hink-16K:	-3%		0.1%		-18.3%
>>>> hink-32K:	-4%		0.2%		-27.2%
>>>> hink-64K:	-4%		0.5%		-31.0%
>>>> hink-128K:	-4%		0.9%		-33.7%
>>>> hink-256K:	-5%		1%		-34.6%
>>>>
>>>>
>>>> I used command: 
>>>> /usr/bin/time -f "\t%E real,\t%U user,\t%S sys" make -skj96 allmodconfig all
>>>> to build kernel and collect the real time/user time/kernel time.
>>>> /sys/kernel/mm/transparent_hugepage/enabled is "madvise".
>>>> Let me know if you have any question about the test.
>>>
>>> Thanks for doing this! I have a couple of questions:
>>>
>>>  - how many times did you run each test?
>>      Three times for each ANON_FOLIO_MAX_ORDER_UNHINTED. The stddev is quite
>>      small like less than %1.
> 
> And out of interest, were you running on bare metal or in VM? And did you reboot
> between each run?
I run the test on bare metal env. I didn't reboot for every run. But have to reboot
for different ANON_FOLIO_MAX_ORDER_UNHINTED size. I do
   echo 3 > /proc/sys/vm/drop_caches
for everything run after "make mrproper" even after a fresh boot.


> 
>>>
>>>  - how did you configure the large page size? (I sent an email out yesterday
>>>    saying that I was doing it wrong from my tests, so the 128k and 256k results
>>>    for my test set are not valid.
>>      I changed the ANON_FOLIO_MAX_ORDER_UNHINTED definition manually every time.
> 
> In that case, I think your results are broken in a similar way to mine. This
> code means that order will never be higher than 3 (32K) on x86:
> 
> +		order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
> +
> +		if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
> +			order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
> 
> On x86, arch_wants_pte_order() is not implemented and the default returns -1, so
> you end up with:
I added arch_waits_pte_order() for x86 and gave it a very large number. So the
order is decided by ANON_FOLIO_MAX_ORDER_UNHINTED. I suppose my data is valid.

> 
> 	order = min(PAGE_ALLOC_COSTLY_ORDER, ANON_FOLIO_MAX_ORDER_UNHINTED)
> 
> So your 4k, 16k and 32k results should be valid, but 64k, 128k and 256k results
> are actually using 32k, I think? Which is odd because you are getting more
> stddev than the < 1% you quoted above? So perhaps this is down to rebooting
> (kaslr, or something...?)
> 
> (on arm64, arch_wants_pte_order() returns 4, so my 64k result is also valid).
> 
> As a quick hack to work around this, would you be able to change the code to this:
> 
> +		if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
> +			order = ANON_FOLIO_MAX_ORDER_UNHINTED;
> 
>>
>>>
>>>  - what does "hink" mean??
>>      Sorry for the typo. It should be ANON_FOLIO_MAX_ORDER_UNHINTED.
>>
>>>
>>>>
>>>> I also find one strange behavior with this version. It's related with why
>>>> I need to set the /sys/kernel/mm/transparent_hugepage/enabled to "madvise".
>>>> If it's "never", the large folio is disabled either.
>>>> If it's "always", the THP will be active before large folio. So the system is
>>>> in the mixed mode. it's not suitable for this test.
>>>
>>> We had a discussion around this in the THP meeting yesterday. I'm going to write
>>> this up propoerly so we can have proper systematic discussion. The tentative
>>> conclusion is that MADV_NOHUGEPAGE must continue to mean "do not fault in more
>>> than is absolutely necessary". I would assume we need to extend that thinking to
>>> the process-wide and system-wide knobs (as is done in the patch), but we didn't
>>> explicitly say so in the meeting.
>> There are cases that THP is not appreciated because of the latency or memory
>> consumption. For these cases, large folio may fill the gap as less latency and
>> memory consumption.
>>
>>
>> So if disabling THP means large folio can't be used, we loose the chance to
>> benefit those cases with large folio.
> 
> Yes, I appreciate that. But there are also real use cases that expect
> MADV_NOHUGEPAGE means "do not fault more than is absolutely necessary" and the
> use cases break if that's not obeyed (e.g. live migration w/ qemu). So I think
> we need to be conservitive to start. These apps that are explicitly forbidding
> THP today, should be updated in the long run to opt-in to large anon folios
> using some as-yet undefined control.
Fair enough.


Regards
Yin, Fengwei

> 
>>
>>
>> Regards
>> Yin, Fengwei
>>
>>>
>>> My intention is that if you have requested THP and your vma is big enough for
>>> PMD-size then you get that, else you fallback to large anon folios. And if you
>>> have neither opted in nor out, then you get large anon folios.
>>>
>>> We talked about the idea of adding a new knob that let's you set the max order,
>>> but that needs a lot more thought.
>>>
>>> Anyway, as I said, I'll write it up so we can all systematically discuss.
>>>
>>>>
>>>> So if it's "never", large folio is disabled. But why "madvise" enables large
>>>> folio unconditionly? Suppose it's only enabled for the VMA range which user
>>>> madvise large folio (or THP)?
>>>>
>>>> Specific for the hink setting, my understand is that we can't choose it only
>>>> by this testing. Other workloads may have different behavior with differnt
>>>> hink setting.
>>>>
>>>>
>>>> Regards
>>>> Yin, Fengwei
>>>>
>>>
>
Ryan Roberts Aug. 3, 2023, 10:24 a.m. UTC | #17
On 02/08/2023 22:05, Yu Zhao wrote:
> On Wed, Aug 2, 2023 at 3:33 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 01/08/2023 07:18, Yu Zhao wrote:
>>> On Wed, Jul 26, 2023 at 3:52 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
>>>> allocated in large folios of a determined order. All pages of the large
>>>> folio are pte-mapped during the same page fault, significantly reducing
>>>> the number of page faults. The number of per-page operations (e.g. ref
>>>> counting, rmap management lru list management) are also significantly
>>>> reduced since those ops now become per-folio.
>>>>
>>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
>>>> which defaults to disabled for now; The long term aim is for this to
>>>> defaut to enabled, but there are some risks around internal
>>>> fragmentation that need to be better understood first.
>>>>
>>>> When enabled, the folio order is determined as such: For a vma, process
>>>> or system that has explicitly disabled THP, we continue to allocate
>>>> order-0. THP is most likely disabled to avoid any possible internal
>>>> fragmentation so we honour that request.
>>>>
>>>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
>>>> that have not explicitly opted-in to use transparent hugepages (e.g.
>>>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
>>>> arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
>>>> bigger). This allows for a performance boost without requiring any
>>>> explicit opt-in from the workload while limitting internal
>>>> fragmentation.
>>>>
>>>> If the preferred order can't be used (e.g. because the folio would
>>>> breach the bounds of the vma, or because ptes in the region are already
>>>> mapped) then we fall back to a suitable lower order; first
>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>>>
>>>> arch_wants_pte_order() can be overridden by the architecture if desired.
>>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
>>>> set of ptes map physically contigious, naturally aligned memory, so this
>>>> mechanism allows the architecture to optimize as required.
>>>>
>>>> Here we add the default implementation of arch_wants_pte_order(), used
>>>> when the architecture does not define it, which returns -1, implying
>>>> that the HW has no preference. In this case, mm will choose it's own
>>>> default order.
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>  include/linux/pgtable.h |  13 ++++
>>>>  mm/Kconfig              |  10 +++
>>>>  mm/memory.c             | 166 ++++++++++++++++++++++++++++++++++++----
>>>>  3 files changed, 172 insertions(+), 17 deletions(-)
>>>>
>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>> index 5063b482e34f..2a1d83775837 100644
>>>> --- a/include/linux/pgtable.h
>>>> +++ b/include/linux/pgtable.h
>>>> @@ -313,6 +313,19 @@ static inline bool arch_has_hw_pte_young(void)
>>>>  }
>>>>  #endif
>>>>
>>>> +#ifndef arch_wants_pte_order
>>>> +/*
>>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
>>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
>>>> + * to be at least order-2. Negative value implies that the HW has no preference
>>>> + * and mm will choose it's own default order.
>>>> + */
>>>> +static inline int arch_wants_pte_order(void)
>>>> +{
>>>> +       return -1;
>>>> +}
>>>> +#endif
>>>> +
>>>>  #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>>>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>>>                                        unsigned long address,
>>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>>> index 09130434e30d..fa61ea160447 100644
>>>> --- a/mm/Kconfig
>>>> +++ b/mm/Kconfig
>>>> @@ -1238,4 +1238,14 @@ config LOCK_MM_AND_FIND_VMA
>>>>
>>>>  source "mm/damon/Kconfig"
>>>>
>>>> +config LARGE_ANON_FOLIO
>>>> +       bool "Allocate large folios for anonymous memory"
>>>> +       depends on TRANSPARENT_HUGEPAGE
>>>> +       default n
>>>> +       help
>>>> +         Use large (bigger than order-0) folios to back anonymous memory where
>>>> +         possible, even for pte-mapped memory. This reduces the number of page
>>>> +         faults, as well as other per-page overheads to improve performance for
>>>> +         many workloads.
>>>> +
>>>>  endmenu
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 01f39e8144ef..64c3f242c49a 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -4050,6 +4050,127 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>         return ret;
>>>>  }
>>>>
>>>> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
>>>> +{
>>>> +       int i;
>>>> +
>>>> +       if (nr_pages == 1)
>>>> +               return vmf_pte_changed(vmf);
>>>> +
>>>> +       for (i = 0; i < nr_pages; i++) {
>>>> +               if (!pte_none(ptep_get_lockless(vmf->pte + i)))
>>>> +                       return true;
>>>> +       }
>>>> +
>>>> +       return false;
>>>> +}
>>>> +
>>>> +#ifdef CONFIG_LARGE_ANON_FOLIO
>>>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
>>>> +               (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
>>>> +
>>>> +static int anon_folio_order(struct vm_area_struct *vma)
>>>> +{
>>>> +       int order;
>>>> +
>>>> +       /*
>>>> +        * If THP is explicitly disabled for either the vma, the process or the
>>>> +        * system, then this is very likely intended to limit internal
>>>> +        * fragmentation; in this case, don't attempt to allocate a large
>>>> +        * anonymous folio.
>>>> +        *
>>>> +        * Else, if the vma is eligible for thp, allocate a large folio of the
>>>> +        * size preferred by the arch. Or if the arch requested a very small
>>>> +        * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
>>>> +        * which still meets the arch's requirements but means we still take
>>>> +        * advantage of SW optimizations (e.g. fewer page faults).
>>>> +        *
>>>> +        * Finally if thp is enabled but the vma isn't eligible, take the
>>>> +        * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
>>>> +        * This ensures workloads that have not explicitly opted-in take benefit
>>>> +        * while capping the potential for internal fragmentation.
>>>> +        */
>>>> +
>>>> +       if ((vma->vm_flags & VM_NOHUGEPAGE) ||
>>>> +           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
>>>> +           !hugepage_flags_enabled())
>>>> +               order = 0;
>>>> +       else {
>>>> +               order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
>>>> +
>>>> +               if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>>>> +                       order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
>>>> +       }
>>>> +
>>>> +       return order;
>>>> +}
>>>> +
>>>> +static int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
>>>> +{
>>>> +       int i;
>>>> +       gfp_t gfp;
>>>> +       pte_t *pte;
>>>> +       unsigned long addr;
>>>> +       struct vm_area_struct *vma = vmf->vma;
>>>> +       int prefer = anon_folio_order(vma);
>>>> +       int orders[] = {
>>>> +               prefer,
>>>> +               prefer > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0,
>>>> +               0,
>>>> +       };
>>>> +
>>>> +       *folio = NULL;
>>>> +
>>>> +       if (vmf_orig_pte_uffd_wp(vmf))
>>>> +               goto fallback;
>>>
>>> I think we need to s/vmf_orig_pte_uffd_wp/userfaultfd_armed/ here;
>>> otherwise UFFD would miss VM_UFFD_MISSING/MINOR.
>>
>> I don't think this is the case. As far as I can see, VM_UFFD_MINOR only applies
>> to shmem and hugetlb.
> 
> Correct, but we don't have a helper to check against (VM_UFFD_WP |
> VM_UFFD_MISSING). Reusing userfaultfd_armed() seems cleaner to me or
> even future proof.
> 
>> VM_UFFD_MISSING is checked under the PTL and if set on the
>> VMA, then it is handled without mapping the folio that was just allocated:
>>
>>         /* Deliver the page fault to userland, check inside PT lock */
>>         if (userfaultfd_missing(vma)) {
>>                 pte_unmap_unlock(vmf->pte, vmf->ptl);
>>                 folio_put(folio);
>>                 return handle_userfault(vmf, VM_UFFD_MISSING);
>>         }
>>
>> So we are racing to allocate a large folio; if the vma later turns out to have
>> MISSING handling registered, we drop the folio and handle it, else we map the
>> large folio.
> 
> Yes, then we have allocated a large folio (with great effort if under
> heavy memory pressure) for nothing.
> 
>> The vmf_orig_pte_uffd_wp() *is* required because we need to individually check
>> each PTE for the uffd_wp bit and fix it up.
> 
> This is not correct: we cannot see a WP PTE before you see
> VM_UFFD_WP. So checking VM_UFFD_WP is perfectly safe.

I think you misunderstood me; I was trying to say that assuming we don't check
userfaultfd_armed() then we need the vmf_orig_pte_uffd_wp() check because we
need to ensure that the marker gets preserved for that specific pte and we can
only do that if we are operating on a single pte.

> 
> The reason we might want to check individual PTEs is because WP can be
> done to a subrange of a VMA that has VM_UFFD_WP, which I don't think
> is the common case and worth considering here. But if you want to keep
> it, that's fine with me. Without some comments, the next person might
> find these two checks confusing though, if you plan to add both.

I'm not proposing we need both checks.

> 
>> So I think the code is correct, but perhaps it is safer/simpler to always avoid
>> allocating a large folio if the vma is registered for uffd in the way you
>> suggest? I don't know enough about uffd to form a strong opinion either way.
> 
> Yes, it's not about correctness. Just a second thought about not
> allocating large folios unnecessarily when possible.

OK, I misunderstood you; I thought your original point is about correctness.

Anyway, you have convinced me that we should
s/vmf_orig_pte_uffd_wp/userfaultfd_armed/ on the grounds that trying hard to
allocate a high order folio is almost always going to be a waste of effort. I'll
change this in the next version.

Thanks,
Ryan
Ryan Roberts Aug. 3, 2023, 10:27 a.m. UTC | #18
On 03/08/2023 10:58, Yin Fengwei wrote:
> 
> 
> On 8/3/23 17:32, Ryan Roberts wrote:
>> On 03/08/2023 09:37, Yin Fengwei wrote:
>>>
>>>
>>> On 8/3/23 16:21, Ryan Roberts wrote:
>>>> On 03/08/2023 09:05, Yin Fengwei wrote:
>>>>
>>>> ...
>>>>
>>>>>> I've captured run time and peak memory usage, and taken the mean. The stdev for
>>>>>> the peak memory usage is big-ish, but I'm confident this still captures the
>>>>>> central tendancy well:
>>>>>>
>>>>>> | MAX_ORDER_UNHINTED |   real-time |   kern-time |   user-time | peak memory |
>>>>>> |:-------------------|------------:|------------:|------------:|:------------|
>>>>>> | 4k                 |        0.0% |        0.0% |        0.0% |        0.0% |
>>>>>> | 16k                |       -3.6% |      -26.5% |       -0.5% |       -0.1% |
>>>>>> | 32k                |       -4.8% |      -37.4% |       -0.6% |       -0.1% |
>>>>>> | 64k                |       -5.7% |      -42.0% |       -0.6% |       -1.1% |
>>>>>> | 128k               |       -5.6% |      -42.1% |       -0.7% |        1.4% |
>>>>>> | 256k               |       -4.9% |      -41.9% |       -0.4% |        1.9% |
>>>>>
>>>>> Here is my test result:
>>>>>
>>>>> 		real		user		sys
>>>>> hink-4k:	 0%		0%		0%
>>>>> hink-16K:	-3%		0.1%		-18.3%
>>>>> hink-32K:	-4%		0.2%		-27.2%
>>>>> hink-64K:	-4%		0.5%		-31.0%
>>>>> hink-128K:	-4%		0.9%		-33.7%
>>>>> hink-256K:	-5%		1%		-34.6%
>>>>>
>>>>>
>>>>> I used command: 
>>>>> /usr/bin/time -f "\t%E real,\t%U user,\t%S sys" make -skj96 allmodconfig all
>>>>> to build kernel and collect the real time/user time/kernel time.
>>>>> /sys/kernel/mm/transparent_hugepage/enabled is "madvise".
>>>>> Let me know if you have any question about the test.
>>>>
>>>> Thanks for doing this! I have a couple of questions:
>>>>
>>>>  - how many times did you run each test?
>>>      Three times for each ANON_FOLIO_MAX_ORDER_UNHINTED. The stddev is quite
>>>      small like less than %1.
>>
>> And out of interest, were you running on bare metal or in VM? And did you reboot
>> between each run?
> I run the test on bare metal env. I didn't reboot for every run. But have to reboot
> for different ANON_FOLIO_MAX_ORDER_UNHINTED size. I do
>    echo 3 > /proc/sys/vm/drop_caches
> for everything run after "make mrproper" even after a fresh boot.
> 
> 
>>
>>>>
>>>>  - how did you configure the large page size? (I sent an email out yesterday
>>>>    saying that I was doing it wrong from my tests, so the 128k and 256k results
>>>>    for my test set are not valid.
>>>      I changed the ANON_FOLIO_MAX_ORDER_UNHINTED definition manually every time.
>>
>> In that case, I think your results are broken in a similar way to mine. This
>> code means that order will never be higher than 3 (32K) on x86:
>>
>> +		order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
>> +
>> +		if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>> +			order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
>>
>> On x86, arch_wants_pte_order() is not implemented and the default returns -1, so
>> you end up with:
> I added arch_waits_pte_order() for x86 and gave it a very large number. So the
> order is decided by ANON_FOLIO_MAX_ORDER_UNHINTED. I suppose my data is valid.

Ahh great! ok sorry for the noise.

Given part of the rationale for the experiment was to plot perf against memory
usage, did you collect any memory numbers?

> 
>>
>> 	order = min(PAGE_ALLOC_COSTLY_ORDER, ANON_FOLIO_MAX_ORDER_UNHINTED)
>>
>> So your 4k, 16k and 32k results should be valid, but 64k, 128k and 256k results
>> are actually using 32k, I think? Which is odd because you are getting more
>> stddev than the < 1% you quoted above? So perhaps this is down to rebooting
>> (kaslr, or something...?)
>>
>> (on arm64, arch_wants_pte_order() returns 4, so my 64k result is also valid).
>>
>> As a quick hack to work around this, would you be able to change the code to this:
>>
>> +		if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>> +			order = ANON_FOLIO_MAX_ORDER_UNHINTED;
>>
>>>
>>>>
>>>>  - what does "hink" mean??
>>>      Sorry for the typo. It should be ANON_FOLIO_MAX_ORDER_UNHINTED.
>>>
>>>>
>>>>>
>>>>> I also find one strange behavior with this version. It's related with why
>>>>> I need to set the /sys/kernel/mm/transparent_hugepage/enabled to "madvise".
>>>>> If it's "never", the large folio is disabled either.
>>>>> If it's "always", the THP will be active before large folio. So the system is
>>>>> in the mixed mode. it's not suitable for this test.
>>>>
>>>> We had a discussion around this in the THP meeting yesterday. I'm going to write
>>>> this up propoerly so we can have proper systematic discussion. The tentative
>>>> conclusion is that MADV_NOHUGEPAGE must continue to mean "do not fault in more
>>>> than is absolutely necessary". I would assume we need to extend that thinking to
>>>> the process-wide and system-wide knobs (as is done in the patch), but we didn't
>>>> explicitly say so in the meeting.
>>> There are cases that THP is not appreciated because of the latency or memory
>>> consumption. For these cases, large folio may fill the gap as less latency and
>>> memory consumption.
>>>
>>>
>>> So if disabling THP means large folio can't be used, we loose the chance to
>>> benefit those cases with large folio.
>>
>> Yes, I appreciate that. But there are also real use cases that expect
>> MADV_NOHUGEPAGE means "do not fault more than is absolutely necessary" and the
>> use cases break if that's not obeyed (e.g. live migration w/ qemu). So I think
>> we need to be conservitive to start. These apps that are explicitly forbidding
>> THP today, should be updated in the long run to opt-in to large anon folios
>> using some as-yet undefined control.
> Fair enough.
> 
> 
> Regards
> Yin, Fengwei
> 
>>
>>>
>>>
>>> Regards
>>> Yin, Fengwei
>>>
>>>>
>>>> My intention is that if you have requested THP and your vma is big enough for
>>>> PMD-size then you get that, else you fallback to large anon folios. And if you
>>>> have neither opted in nor out, then you get large anon folios.
>>>>
>>>> We talked about the idea of adding a new knob that let's you set the max order,
>>>> but that needs a lot more thought.
>>>>
>>>> Anyway, as I said, I'll write it up so we can all systematically discuss.
>>>>
>>>>>
>>>>> So if it's "never", large folio is disabled. But why "madvise" enables large
>>>>> folio unconditionly? Suppose it's only enabled for the VMA range which user
>>>>> madvise large folio (or THP)?
>>>>>
>>>>> Specific for the hink setting, my understand is that we can't choose it only
>>>>> by this testing. Other workloads may have different behavior with differnt
>>>>> hink setting.
>>>>>
>>>>>
>>>>> Regards
>>>>> Yin, Fengwei
>>>>>
>>>>
>>
Yin Fengwei Aug. 3, 2023, 10:54 a.m. UTC | #19
On 8/3/23 18:27, Ryan Roberts wrote:
> On 03/08/2023 10:58, Yin Fengwei wrote:
>>
>>
>> On 8/3/23 17:32, Ryan Roberts wrote:
>>> On 03/08/2023 09:37, Yin Fengwei wrote:
>>>>
>>>>
>>>> On 8/3/23 16:21, Ryan Roberts wrote:
>>>>> On 03/08/2023 09:05, Yin Fengwei wrote:
>>>>>
>>>>> ...
>>>>>
>>>>>>> I've captured run time and peak memory usage, and taken the mean. The stdev for
>>>>>>> the peak memory usage is big-ish, but I'm confident this still captures the
>>>>>>> central tendancy well:
>>>>>>>
>>>>>>> | MAX_ORDER_UNHINTED |   real-time |   kern-time |   user-time | peak memory |
>>>>>>> |:-------------------|------------:|------------:|------------:|:------------|
>>>>>>> | 4k                 |        0.0% |        0.0% |        0.0% |        0.0% |
>>>>>>> | 16k                |       -3.6% |      -26.5% |       -0.5% |       -0.1% |
>>>>>>> | 32k                |       -4.8% |      -37.4% |       -0.6% |       -0.1% |
>>>>>>> | 64k                |       -5.7% |      -42.0% |       -0.6% |       -1.1% |
>>>>>>> | 128k               |       -5.6% |      -42.1% |       -0.7% |        1.4% |
>>>>>>> | 256k               |       -4.9% |      -41.9% |       -0.4% |        1.9% |
>>>>>>
>>>>>> Here is my test result:
>>>>>>
>>>>>> 		real		user		sys
>>>>>> hink-4k:	 0%		0%		0%
>>>>>> hink-16K:	-3%		0.1%		-18.3%
>>>>>> hink-32K:	-4%		0.2%		-27.2%
>>>>>> hink-64K:	-4%		0.5%		-31.0%
>>>>>> hink-128K:	-4%		0.9%		-33.7%
>>>>>> hink-256K:	-5%		1%		-34.6%
>>>>>>
>>>>>>
>>>>>> I used command: 
>>>>>> /usr/bin/time -f "\t%E real,\t%U user,\t%S sys" make -skj96 allmodconfig all
>>>>>> to build kernel and collect the real time/user time/kernel time.
>>>>>> /sys/kernel/mm/transparent_hugepage/enabled is "madvise".
>>>>>> Let me know if you have any question about the test.
>>>>>
>>>>> Thanks for doing this! I have a couple of questions:
>>>>>
>>>>>  - how many times did you run each test?
>>>>      Three times for each ANON_FOLIO_MAX_ORDER_UNHINTED. The stddev is quite
>>>>      small like less than %1.
>>>
>>> And out of interest, were you running on bare metal or in VM? And did you reboot
>>> between each run?
>> I run the test on bare metal env. I didn't reboot for every run. But have to reboot
>> for different ANON_FOLIO_MAX_ORDER_UNHINTED size. I do
>>    echo 3 > /proc/sys/vm/drop_caches
>> for everything run after "make mrproper" even after a fresh boot.
>>
>>
>>>
>>>>>
>>>>>  - how did you configure the large page size? (I sent an email out yesterday
>>>>>    saying that I was doing it wrong from my tests, so the 128k and 256k results
>>>>>    for my test set are not valid.
>>>>      I changed the ANON_FOLIO_MAX_ORDER_UNHINTED definition manually every time.
>>>
>>> In that case, I think your results are broken in a similar way to mine. This
>>> code means that order will never be higher than 3 (32K) on x86:
>>>
>>> +		order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
>>> +
>>> +		if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>>> +			order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
>>>
>>> On x86, arch_wants_pte_order() is not implemented and the default returns -1, so
>>> you end up with:
>> I added arch_waits_pte_order() for x86 and gave it a very large number. So the
>> order is decided by ANON_FOLIO_MAX_ORDER_UNHINTED. I suppose my data is valid.
> 
> Ahh great! ok sorry for the noise.
> 
> Given part of the rationale for the experiment was to plot perf against memory
> usage, did you collect any memory numbers?
No. I didn't collect the memory consumption.

Regards
Yin, Fengwei

> 
>>
>>>
>>> 	order = min(PAGE_ALLOC_COSTLY_ORDER, ANON_FOLIO_MAX_ORDER_UNHINTED)
>>>
>>> So your 4k, 16k and 32k results should be valid, but 64k, 128k and 256k results
>>> are actually using 32k, I think? Which is odd because you are getting more
>>> stddev than the < 1% you quoted above? So perhaps this is down to rebooting
>>> (kaslr, or something...?)
>>>
>>> (on arm64, arch_wants_pte_order() returns 4, so my 64k result is also valid).
>>>
>>> As a quick hack to work around this, would you be able to change the code to this:
>>>
>>> +		if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>>> +			order = ANON_FOLIO_MAX_ORDER_UNHINTED;
>>>
>>>>
>>>>>
>>>>>  - what does "hink" mean??
>>>>      Sorry for the typo. It should be ANON_FOLIO_MAX_ORDER_UNHINTED.
>>>>
>>>>>
>>>>>>
>>>>>> I also find one strange behavior with this version. It's related with why
>>>>>> I need to set the /sys/kernel/mm/transparent_hugepage/enabled to "madvise".
>>>>>> If it's "never", the large folio is disabled either.
>>>>>> If it's "always", the THP will be active before large folio. So the system is
>>>>>> in the mixed mode. it's not suitable for this test.
>>>>>
>>>>> We had a discussion around this in the THP meeting yesterday. I'm going to write
>>>>> this up propoerly so we can have proper systematic discussion. The tentative
>>>>> conclusion is that MADV_NOHUGEPAGE must continue to mean "do not fault in more
>>>>> than is absolutely necessary". I would assume we need to extend that thinking to
>>>>> the process-wide and system-wide knobs (as is done in the patch), but we didn't
>>>>> explicitly say so in the meeting.
>>>> There are cases that THP is not appreciated because of the latency or memory
>>>> consumption. For these cases, large folio may fill the gap as less latency and
>>>> memory consumption.
>>>>
>>>>
>>>> So if disabling THP means large folio can't be used, we loose the chance to
>>>> benefit those cases with large folio.
>>>
>>> Yes, I appreciate that. But there are also real use cases that expect
>>> MADV_NOHUGEPAGE means "do not fault more than is absolutely necessary" and the
>>> use cases break if that's not obeyed (e.g. live migration w/ qemu). So I think
>>> we need to be conservitive to start. These apps that are explicitly forbidding
>>> THP today, should be updated in the long run to opt-in to large anon folios
>>> using some as-yet undefined control.
>> Fair enough.
>>
>>
>> Regards
>> Yin, Fengwei
>>
>>>
>>>>
>>>>
>>>> Regards
>>>> Yin, Fengwei
>>>>
>>>>>
>>>>> My intention is that if you have requested THP and your vma is big enough for
>>>>> PMD-size then you get that, else you fallback to large anon folios. And if you
>>>>> have neither opted in nor out, then you get large anon folios.
>>>>>
>>>>> We talked about the idea of adding a new knob that let's you set the max order,
>>>>> but that needs a lot more thought.
>>>>>
>>>>> Anyway, as I said, I'll write it up so we can all systematically discuss.
>>>>>
>>>>>>
>>>>>> So if it's "never", large folio is disabled. But why "madvise" enables large
>>>>>> folio unconditionly? Suppose it's only enabled for the VMA range which user
>>>>>> madvise large folio (or THP)?
>>>>>>
>>>>>> Specific for the hink setting, my understand is that we can't choose it only
>>>>>> by this testing. Other workloads may have different behavior with differnt
>>>>>> hink setting.
>>>>>>
>>>>>>
>>>>>> Regards
>>>>>> Yin, Fengwei
>>>>>>
>>>>>
>>>
>
Ryan Roberts Aug. 3, 2023, 12:43 p.m. UTC | #20
+ Kirill

On 26/07/2023 10:51, Ryan Roberts wrote:
> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
> allocated in large folios of a determined order. All pages of the large
> folio are pte-mapped during the same page fault, significantly reducing
> the number of page faults. The number of per-page operations (e.g. ref
> counting, rmap management lru list management) are also significantly
> reduced since those ops now become per-folio.
> 
> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
> which defaults to disabled for now; The long term aim is for this to
> defaut to enabled, but there are some risks around internal
> fragmentation that need to be better understood first.
> 
> When enabled, the folio order is determined as such: For a vma, process
> or system that has explicitly disabled THP, we continue to allocate
> order-0. THP is most likely disabled to avoid any possible internal
> fragmentation so we honour that request.
> 
> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
> that have not explicitly opted-in to use transparent hugepages (e.g.
> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
> arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
> bigger). This allows for a performance boost without requiring any
> explicit opt-in from the workload while limitting internal
> fragmentation.
> 
> If the preferred order can't be used (e.g. because the folio would
> breach the bounds of the vma, or because ptes in the region are already
> mapped) then we fall back to a suitable lower order; first
> PAGE_ALLOC_COSTLY_ORDER, then order-0.
> 

...

> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
> +		(ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
> +
> +static int anon_folio_order(struct vm_area_struct *vma)
> +{
> +	int order;
> +
> +	/*
> +	 * If THP is explicitly disabled for either the vma, the process or the
> +	 * system, then this is very likely intended to limit internal
> +	 * fragmentation; in this case, don't attempt to allocate a large
> +	 * anonymous folio.
> +	 *
> +	 * Else, if the vma is eligible for thp, allocate a large folio of the
> +	 * size preferred by the arch. Or if the arch requested a very small
> +	 * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
> +	 * which still meets the arch's requirements but means we still take
> +	 * advantage of SW optimizations (e.g. fewer page faults).
> +	 *
> +	 * Finally if thp is enabled but the vma isn't eligible, take the
> +	 * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
> +	 * This ensures workloads that have not explicitly opted-in take benefit
> +	 * while capping the potential for internal fragmentation.
> +	 */
> +
> +	if ((vma->vm_flags & VM_NOHUGEPAGE) ||
> +	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
> +	    !hugepage_flags_enabled())
> +		order = 0;
> +	else {
> +		order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
> +
> +		if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
> +			order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
> +	}
> +
> +	return order;
> +}


Hi All,

I'm writing up the conclusions that we arrived at during discussion in the THP
meeting yesterday, regarding linkage with exiting THP ABIs. It would be great if
I can get explicit "agree" or disagree + rationale from at least David, Yu and
Kirill.

In summary; I think we are converging on the approach that is already coded, but
I'd like confirmation.



The THP situation today
-----------------------

 - At system level: THP can be set to "never", "madvise" or "always"
 - At process level: THP can be "never" or "defer to system setting"
 - At VMA level: no-hint, MADV_HUGEPAGE, MADV_NOHUGEPAGE

That gives us this table to describe how a page fault is handled, according to
process state (columns) and vma flags (rows):

                | never     | madvise   | always
----------------|-----------|-----------|-----------
no hint         | S         | S         | THP>S
MADV_HUGEPAGE   | S         | THP>S     | THP>S
MADV_NOHUGEPAGE | S         | S         | S

Legend:
S	allocate single page (PTE-mapped)
LAF	allocate lage anon folio (PTE-mapped)
THP	allocate THP-sized folio (PMD-mapped)
>	fallback (usually because vma size/alignment insufficient for folio)



Principles for Large Anon Folios (LAF)
--------------------------------------

David tells us there are use cases today (e.g. qemu live migration) which use
MADV_NOHUGEPAGE to mean "don't fill any PTEs that are not explicitly faulted"
and these use cases will break (i.e. functionally incorrect) if this request is
not honoured.

So LAF must at least honour MADV_NOHUGEPAGE to prevent breaking existing use
cases. And once we do this, then I think the least confusing thing is for it to
also honor the "never" system/process state; so if either the system, process or
vma has explicitly opted-out of THP, then LAF should also be bypassed.

Similarly, any case that would previously cause the allocation of PMD-sized THP
must continue to be honoured, else we risk performance regression.

That leaves the "madvise/no-hint" case, and all THP fallback paths due to the
VMA not being correctly aligned or sized to hold a PMD-sized mapping. In these
cases, we will attempt to use LAF first, and fallback to single page if the vma
size/alignment doesn't permit it.

                | never     | madvise   | always
----------------|-----------|-----------|-----------
no hint         | S         | LAF>S     | THP>LAF>S
MADV_HUGEPAGE   | S         | THP>LAF>S | THP>LAF>S
MADV_NOHUGEPAGE | S         | S         | S

I think this (perhaps conservative) approach will be the least surprising to
users. And is the policy that is already implemented in this patch.



Downsides of this policy
------------------------

As Yu and Yin have pointed out, there are some workloads which do not perform
well with THP, due to large fault latency or memory wastage, etc. But which
_may_ still benefit from LAF. By taking the conservative approach, we exclude
these workloads from benefiting automatically.

But given they have explicitly opted out of THP, it doesn't seem unreasonable
that those workloads should be explicitly modified to opt-in to LAF. The
question is what should a control for this look like? And do we need to
implement the control for an MVP implementation of LAF? For the latter question,
I would suggest this can come later - its a tool to further optimize, but its
absence does not regress today's performance.

What should a control look like?

One suggestion was to expose a "maximum order" tunable, which would limit the
size of THP that could be allocated. setting it to 1M would cause traditional
THP to be bypassed (assuming for now PMD-sized THP is 2M) but would permit LAF.
But Kirill suggested that this type of control might turn out to be restrictive
in the long run.

Another suggestion was to provide a more abstracted hint to the kernel, which
the kernel could then derive a policy from, and that policy would be easier to
change over time.



Large Anon Folio Size
---------------------

Once we have decided to use LAF (vs THP vs S), we need to decide how big the
folio should be. If/when we get a control as described above, that will
obviously place an upper bound on the size. HW may also have a preferred size
due to tricks it can do in the TLB (arch_wants_pte_order() in this patch) but
you may still want to allocate a bigger folio than the HW wants (since bigger
folios will reduce page faults) or you may want to allocate a smaller folio than
the HW wants (due to concerns about latency or memory wastage).

I've had a stab at addressing this in the patch too, using the same decision as
for THP (ignoring the vma size/alignment requirement) to decide if we use the HW
preferred order or if we cap it (currently set at 64K).

Thoughts, comments?

Thanks,
Ryan
Kirill A . Shutemov Aug. 3, 2023, 2:21 p.m. UTC | #21
On Thu, Aug 03, 2023 at 01:43:31PM +0100, Ryan Roberts wrote:
> + Kirill
> 
> On 26/07/2023 10:51, Ryan Roberts wrote:
> > Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
> > allocated in large folios of a determined order. All pages of the large
> > folio are pte-mapped during the same page fault, significantly reducing
> > the number of page faults. The number of per-page operations (e.g. ref
> > counting, rmap management lru list management) are also significantly
> > reduced since those ops now become per-folio.
> > 
> > The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
> > which defaults to disabled for now; The long term aim is for this to
> > defaut to enabled, but there are some risks around internal
> > fragmentation that need to be better understood first.
> > 
> > When enabled, the folio order is determined as such: For a vma, process
> > or system that has explicitly disabled THP, we continue to allocate
> > order-0. THP is most likely disabled to avoid any possible internal
> > fragmentation so we honour that request.
> > 
> > Otherwise, the return value of arch_wants_pte_order() is used. For vmas
> > that have not explicitly opted-in to use transparent hugepages (e.g.
> > where thp=madvise and the vma does not have MADV_HUGEPAGE), then
> > arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
> > bigger). This allows for a performance boost without requiring any
> > explicit opt-in from the workload while limitting internal
> > fragmentation.
> > 
> > If the preferred order can't be used (e.g. because the folio would
> > breach the bounds of the vma, or because ptes in the region are already
> > mapped) then we fall back to a suitable lower order; first
> > PAGE_ALLOC_COSTLY_ORDER, then order-0.
> > 
> 
> ...
> 
> > +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
> > +		(ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
> > +
> > +static int anon_folio_order(struct vm_area_struct *vma)
> > +{
> > +	int order;
> > +
> > +	/*
> > +	 * If THP is explicitly disabled for either the vma, the process or the
> > +	 * system, then this is very likely intended to limit internal
> > +	 * fragmentation; in this case, don't attempt to allocate a large
> > +	 * anonymous folio.
> > +	 *
> > +	 * Else, if the vma is eligible for thp, allocate a large folio of the
> > +	 * size preferred by the arch. Or if the arch requested a very small
> > +	 * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
> > +	 * which still meets the arch's requirements but means we still take
> > +	 * advantage of SW optimizations (e.g. fewer page faults).
> > +	 *
> > +	 * Finally if thp is enabled but the vma isn't eligible, take the
> > +	 * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
> > +	 * This ensures workloads that have not explicitly opted-in take benefit
> > +	 * while capping the potential for internal fragmentation.
> > +	 */
> > +
> > +	if ((vma->vm_flags & VM_NOHUGEPAGE) ||
> > +	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
> > +	    !hugepage_flags_enabled())
> > +		order = 0;
> > +	else {
> > +		order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
> > +
> > +		if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
> > +			order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
> > +	}
> > +
> > +	return order;
> > +}
> 
> 
> Hi All,
> 
> I'm writing up the conclusions that we arrived at during discussion in the THP
> meeting yesterday, regarding linkage with exiting THP ABIs. It would be great if
> I can get explicit "agree" or disagree + rationale from at least David, Yu and
> Kirill.
> 
> In summary; I think we are converging on the approach that is already coded, but
> I'd like confirmation.
> 
> 
> 
> The THP situation today
> -----------------------
> 
>  - At system level: THP can be set to "never", "madvise" or "always"
>  - At process level: THP can be "never" or "defer to system setting"
>  - At VMA level: no-hint, MADV_HUGEPAGE, MADV_NOHUGEPAGE
> 
> That gives us this table to describe how a page fault is handled, according to
> process state (columns) and vma flags (rows):
> 
>                 | never     | madvise   | always
> ----------------|-----------|-----------|-----------
> no hint         | S         | S         | THP>S
> MADV_HUGEPAGE   | S         | THP>S     | THP>S
> MADV_NOHUGEPAGE | S         | S         | S
> 
> Legend:
> S	allocate single page (PTE-mapped)
> LAF	allocate lage anon folio (PTE-mapped)
> THP	allocate THP-sized folio (PMD-mapped)
> >	fallback (usually because vma size/alignment insufficient for folio)
> 
> 
> 
> Principles for Large Anon Folios (LAF)
> --------------------------------------
> 
> David tells us there are use cases today (e.g. qemu live migration) which use
> MADV_NOHUGEPAGE to mean "don't fill any PTEs that are not explicitly faulted"
> and these use cases will break (i.e. functionally incorrect) if this request is
> not honoured.
> 
> So LAF must at least honour MADV_NOHUGEPAGE to prevent breaking existing use
> cases. And once we do this, then I think the least confusing thing is for it to
> also honor the "never" system/process state; so if either the system, process or
> vma has explicitly opted-out of THP, then LAF should also be bypassed.
> 
> Similarly, any case that would previously cause the allocation of PMD-sized THP
> must continue to be honoured, else we risk performance regression.
> 
> That leaves the "madvise/no-hint" case, and all THP fallback paths due to the
> VMA not being correctly aligned or sized to hold a PMD-sized mapping. In these
> cases, we will attempt to use LAF first, and fallback to single page if the vma
> size/alignment doesn't permit it.
> 
>                 | never     | madvise   | always
> ----------------|-----------|-----------|-----------
> no hint         | S         | LAF>S     | THP>LAF>S
> MADV_HUGEPAGE   | S         | THP>LAF>S | THP>LAF>S
> MADV_NOHUGEPAGE | S         | S         | S
> 
> I think this (perhaps conservative) approach will be the least surprising to
> users. And is the policy that is already implemented in this patch.

This looks very reasonable.

The only questionable field is no-hint/madvise. I can argue for both LAF>S
and S here. I think LAF>S is fine as long as we are not too aggressive
with allocation order.

I think we need to work on eliminating reasons for users to set 'never'.
If something behaves better with 'never' kernel has failed user.

> Downsides of this policy
> ------------------------
> 
> As Yu and Yin have pointed out, there are some workloads which do not perform
> well with THP, due to large fault latency or memory wastage, etc. But which
> _may_ still benefit from LAF. By taking the conservative approach, we exclude
> these workloads from benefiting automatically.

Hm. I don't buy it. Why THP with order-9 is too much, but order-8 LAF is
fine?

If allocation latency is a problem, it has to be fixed. Maybe with
introducing an API to page allocator where user can request a range of
acceptable orders and page allocator returns largest readily available
possibly starting background compaction.

> But given they have explicitly opted out of THP, it doesn't seem unreasonable
> that those workloads should be explicitly modified to opt-in to LAF.

No, we should address the reason the why THP is off. I think there
shouldn't be hard wall between THP and LAF, but smooth gradient.

> The
> question is what should a control for this look like? And do we need to
> implement the control for an MVP implementation of LAF? For the latter question,
> I would suggest this can come later - its a tool to further optimize, but its
> absence does not regress today's performance.
> 
> What should a control look like?

I would start with zero-API. Let's see if we can live with it.

If something is required for debug or benchmarking, we can add it to
debugfs.

> One suggestion was to expose a "maximum order" tunable, which would limit the
> size of THP that could be allocated. setting it to 1M would cause traditional
> THP to be bypassed (assuming for now PMD-sized THP is 2M) but would permit LAF.
> But Kirill suggested that this type of control might turn out to be restrictive
> in the long run.
> 
> Another suggestion was to provide a more abstracted hint to the kernel, which
> the kernel could then derive a policy from, and that policy would be easier to
> change over time.
> 
> 
> 
> Large Anon Folio Size
> ---------------------
> 
> Once we have decided to use LAF (vs THP vs S), we need to decide how big the
> folio should be. If/when we get a control as described above, that will
> obviously place an upper bound on the size. HW may also have a preferred size
> due to tricks it can do in the TLB (arch_wants_pte_order() in this patch) but
> you may still want to allocate a bigger folio than the HW wants (since bigger
> folios will reduce page faults) or you may want to allocate a smaller folio than
> the HW wants (due to concerns about latency or memory wastage).
> 
> I've had a stab at addressing this in the patch too, using the same decision as
> for THP (ignoring the vma size/alignment requirement) to decide if we use the HW
> preferred order or if we cap it (currently set at 64K).
> 
> Thoughts, comments?
> 
> Thanks,
> Ryan
> 
> 
> 
> 
>
Yu Zhao Aug. 3, 2023, 11:50 p.m. UTC | #22
On Thu, Aug 3, 2023 at 6:43 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> + Kirill
>
> On 26/07/2023 10:51, Ryan Roberts wrote:
> > Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
> > allocated in large folios of a determined order. All pages of the large
> > folio are pte-mapped during the same page fault, significantly reducing
> > the number of page faults. The number of per-page operations (e.g. ref
> > counting, rmap management lru list management) are also significantly
> > reduced since those ops now become per-folio.
> >
> > The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
> > which defaults to disabled for now; The long term aim is for this to
> > defaut to enabled, but there are some risks around internal
> > fragmentation that need to be better understood first.
> >
> > When enabled, the folio order is determined as such: For a vma, process
> > or system that has explicitly disabled THP, we continue to allocate
> > order-0. THP is most likely disabled to avoid any possible internal
> > fragmentation so we honour that request.
> >
> > Otherwise, the return value of arch_wants_pte_order() is used. For vmas
> > that have not explicitly opted-in to use transparent hugepages (e.g.
> > where thp=madvise and the vma does not have MADV_HUGEPAGE), then
> > arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
> > bigger). This allows for a performance boost without requiring any
> > explicit opt-in from the workload while limitting internal
> > fragmentation.
> >
> > If the preferred order can't be used (e.g. because the folio would
> > breach the bounds of the vma, or because ptes in the region are already
> > mapped) then we fall back to a suitable lower order; first
> > PAGE_ALLOC_COSTLY_ORDER, then order-0.
> >
>
> ...
>
> > +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
> > +             (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
> > +
> > +static int anon_folio_order(struct vm_area_struct *vma)
> > +{
> > +     int order;
> > +
> > +     /*
> > +      * If THP is explicitly disabled for either the vma, the process or the
> > +      * system, then this is very likely intended to limit internal
> > +      * fragmentation; in this case, don't attempt to allocate a large
> > +      * anonymous folio.
> > +      *
> > +      * Else, if the vma is eligible for thp, allocate a large folio of the
> > +      * size preferred by the arch. Or if the arch requested a very small
> > +      * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
> > +      * which still meets the arch's requirements but means we still take
> > +      * advantage of SW optimizations (e.g. fewer page faults).
> > +      *
> > +      * Finally if thp is enabled but the vma isn't eligible, take the
> > +      * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
> > +      * This ensures workloads that have not explicitly opted-in take benefit
> > +      * while capping the potential for internal fragmentation.
> > +      */
> > +
> > +     if ((vma->vm_flags & VM_NOHUGEPAGE) ||
> > +         test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
> > +         !hugepage_flags_enabled())
> > +             order = 0;
> > +     else {
> > +             order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
> > +
> > +             if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
> > +                     order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
> > +     }
> > +
> > +     return order;
> > +}
>
>
> Hi All,
>
> I'm writing up the conclusions that we arrived at during discussion in the THP
> meeting yesterday, regarding linkage with exiting THP ABIs. It would be great if
> I can get explicit "agree" or disagree + rationale from at least David, Yu and
> Kirill.
>
> In summary; I think we are converging on the approach that is already coded, but
> I'd like confirmation.
>
>
>
> The THP situation today
> -----------------------
>
>  - At system level: THP can be set to "never", "madvise" or "always"
>  - At process level: THP can be "never" or "defer to system setting"
>  - At VMA level: no-hint, MADV_HUGEPAGE, MADV_NOHUGEPAGE
>
> That gives us this table to describe how a page fault is handled, according to
> process state (columns) and vma flags (rows):
>
>                 | never     | madvise   | always
> ----------------|-----------|-----------|-----------
> no hint         | S         | S         | THP>S
> MADV_HUGEPAGE   | S         | THP>S     | THP>S
> MADV_NOHUGEPAGE | S         | S         | S
>
> Legend:
> S       allocate single page (PTE-mapped)
> LAF     allocate lage anon folio (PTE-mapped)
> THP     allocate THP-sized folio (PMD-mapped)
> >       fallback (usually because vma size/alignment insufficient for folio)
>
>
>
> Principles for Large Anon Folios (LAF)
> --------------------------------------
>
> David tells us there are use cases today (e.g. qemu live migration) which use
> MADV_NOHUGEPAGE to mean "don't fill any PTEs that are not explicitly faulted"
> and these use cases will break (i.e. functionally incorrect) if this request is
> not honoured.

I don't remember David saying this. I think he was referring to UFFD,
not MADV_NOHUGEPAGE, when discussing what we need to absolutely
respect.
Yu Zhao Aug. 4, 2023, 12:19 a.m. UTC | #23
On Thu, Aug 3, 2023 at 8:27 AM Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> On Thu, Aug 03, 2023 at 01:43:31PM +0100, Ryan Roberts wrote:
> > + Kirill
> >
> > On 26/07/2023 10:51, Ryan Roberts wrote:
> > > Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
> > > allocated in large folios of a determined order. All pages of the large
> > > folio are pte-mapped during the same page fault, significantly reducing
> > > the number of page faults. The number of per-page operations (e.g. ref
> > > counting, rmap management lru list management) are also significantly
> > > reduced since those ops now become per-folio.
> > >
> > > The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
> > > which defaults to disabled for now; The long term aim is for this to
> > > defaut to enabled, but there are some risks around internal
> > > fragmentation that need to be better understood first.
> > >
> > > When enabled, the folio order is determined as such: For a vma, process
> > > or system that has explicitly disabled THP, we continue to allocate
> > > order-0. THP is most likely disabled to avoid any possible internal
> > > fragmentation so we honour that request.
> > >
> > > Otherwise, the return value of arch_wants_pte_order() is used. For vmas
> > > that have not explicitly opted-in to use transparent hugepages (e.g.
> > > where thp=madvise and the vma does not have MADV_HUGEPAGE), then
> > > arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
> > > bigger). This allows for a performance boost without requiring any
> > > explicit opt-in from the workload while limitting internal
> > > fragmentation.
> > >
> > > If the preferred order can't be used (e.g. because the folio would
> > > breach the bounds of the vma, or because ptes in the region are already
> > > mapped) then we fall back to a suitable lower order; first
> > > PAGE_ALLOC_COSTLY_ORDER, then order-0.
> > >
> >
> > ...
> >
> > > +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
> > > +           (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
> > > +
> > > +static int anon_folio_order(struct vm_area_struct *vma)
> > > +{
> > > +   int order;
> > > +
> > > +   /*
> > > +    * If THP is explicitly disabled for either the vma, the process or the
> > > +    * system, then this is very likely intended to limit internal
> > > +    * fragmentation; in this case, don't attempt to allocate a large
> > > +    * anonymous folio.
> > > +    *
> > > +    * Else, if the vma is eligible for thp, allocate a large folio of the
> > > +    * size preferred by the arch. Or if the arch requested a very small
> > > +    * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
> > > +    * which still meets the arch's requirements but means we still take
> > > +    * advantage of SW optimizations (e.g. fewer page faults).
> > > +    *
> > > +    * Finally if thp is enabled but the vma isn't eligible, take the
> > > +    * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
> > > +    * This ensures workloads that have not explicitly opted-in take benefit
> > > +    * while capping the potential for internal fragmentation.
> > > +    */
> > > +
> > > +   if ((vma->vm_flags & VM_NOHUGEPAGE) ||
> > > +       test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
> > > +       !hugepage_flags_enabled())
> > > +           order = 0;
> > > +   else {
> > > +           order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
> > > +
> > > +           if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
> > > +                   order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
> > > +   }
> > > +
> > > +   return order;
> > > +}
> >
> >
> > Hi All,
> >
> > I'm writing up the conclusions that we arrived at during discussion in the THP
> > meeting yesterday, regarding linkage with exiting THP ABIs. It would be great if
> > I can get explicit "agree" or disagree + rationale from at least David, Yu and
> > Kirill.
> >
> > In summary; I think we are converging on the approach that is already coded, but
> > I'd like confirmation.
> >
> >
> >
> > The THP situation today
> > -----------------------
> >
> >  - At system level: THP can be set to "never", "madvise" or "always"
> >  - At process level: THP can be "never" or "defer to system setting"
> >  - At VMA level: no-hint, MADV_HUGEPAGE, MADV_NOHUGEPAGE
> >
> > That gives us this table to describe how a page fault is handled, according to
> > process state (columns) and vma flags (rows):
> >
> >                 | never     | madvise   | always
> > ----------------|-----------|-----------|-----------
> > no hint         | S         | S         | THP>S
> > MADV_HUGEPAGE   | S         | THP>S     | THP>S
> > MADV_NOHUGEPAGE | S         | S         | S
> >
> > Legend:
> > S     allocate single page (PTE-mapped)
> > LAF   allocate lage anon folio (PTE-mapped)
> > THP   allocate THP-sized folio (PMD-mapped)
> > >     fallback (usually because vma size/alignment insufficient for folio)
> >
> >
> >
> > Principles for Large Anon Folios (LAF)
> > --------------------------------------
> >
> > David tells us there are use cases today (e.g. qemu live migration) which use
> > MADV_NOHUGEPAGE to mean "don't fill any PTEs that are not explicitly faulted"
> > and these use cases will break (i.e. functionally incorrect) if this request is
> > not honoured.
> >
> > So LAF must at least honour MADV_NOHUGEPAGE to prevent breaking existing use
> > cases. And once we do this, then I think the least confusing thing is for it to
> > also honor the "never" system/process state; so if either the system, process or
> > vma has explicitly opted-out of THP, then LAF should also be bypassed.
> >
> > Similarly, any case that would previously cause the allocation of PMD-sized THP
> > must continue to be honoured, else we risk performance regression.
> >
> > That leaves the "madvise/no-hint" case, and all THP fallback paths due to the
> > VMA not being correctly aligned or sized to hold a PMD-sized mapping. In these
> > cases, we will attempt to use LAF first, and fallback to single page if the vma
> > size/alignment doesn't permit it.
> >
> >                 | never     | madvise   | always
> > ----------------|-----------|-----------|-----------
> > no hint         | S         | LAF>S     | THP>LAF>S
> > MADV_HUGEPAGE   | S         | THP>LAF>S | THP>LAF>S
> > MADV_NOHUGEPAGE | S         | S         | S
> >
> > I think this (perhaps conservative) approach will be the least surprising to
> > users. And is the policy that is already implemented in this patch.
>
> This looks very reasonable.
>
> The only questionable field is no-hint/madvise. I can argue for both LAF>S
> and S here. I think LAF>S is fine as long as we are not too aggressive
> with allocation order.
>
> I think we need to work on eliminating reasons for users to set 'never'.
> If something behaves better with 'never' kernel has failed user.
>
> > Downsides of this policy
> > ------------------------
> >
> > As Yu and Yin have pointed out, there are some workloads which do not perform
> > well with THP, due to large fault latency or memory wastage, etc. But which
> > _may_ still benefit from LAF. By taking the conservative approach, we exclude
> > these workloads from benefiting automatically.
>
> Hm. I don't buy it. Why THP with order-9 is too much, but order-8 LAF is
> fine?

No, it's not. And no one said order-8 LAF is fine :) The starting
order for LAF that we have been discussing is at most 64KB (vs 2MB
THP). For my taste, it's still too large. I'd go with 32KB/16KB.

However, the same argument can be used to argue against the policy
Ryan listed above: why order-10 LAF is ok for madvise but not order-11
(which becomes "always")?

I'm strongly against this policy for two practical reasons I learned
from tuning THPs in our data centers:
1. By doing the above, we are blurring the lines between those values
and making real-world performance tuning extremely hard if not
impractice.
2. As I previously pointed out: if we mix LAFs with THPs, we actually
risk causing performance regressions because giving smaller VMAs LAFs
can deprive large VMAs of THPs.
Yu Zhao Aug. 4, 2023, 12:28 a.m. UTC | #24
On Thu, Aug 3, 2023 at 2:07 AM Yin Fengwei <fengwei.yin@intel.com> wrote:
>
> On 7/28/23 18:13, Ryan Roberts wrote:
> > On 27/07/2023 05:31, Yu Zhao wrote:
> >> On Wed, Jul 26, 2023 at 10:41 AM Yu Zhao <yuzhao@google.com> wrote:
> >>>
> >>> On Wed, Jul 26, 2023 at 3:52 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>
> >>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
> >>>> allocated in large folios of a determined order. All pages of the large
> >>>> folio are pte-mapped during the same page fault, significantly reducing
> >>>> the number of page faults. The number of per-page operations (e.g. ref
> >>>> counting, rmap management lru list management) are also significantly
> >>>> reduced since those ops now become per-folio.
> >>>>
> >>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
> >>>> which defaults to disabled for now; The long term aim is for this to
> >>>> defaut to enabled, but there are some risks around internal
> >>>> fragmentation that need to be better understood first.
> >>>>
> >>>> When enabled, the folio order is determined as such: For a vma, process
> >>>> or system that has explicitly disabled THP, we continue to allocate
> >>>> order-0. THP is most likely disabled to avoid any possible internal
> >>>> fragmentation so we honour that request.
> >>>>
> >>>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
> >>>> that have not explicitly opted-in to use transparent hugepages (e.g.
> >>>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
> >>>> arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
> >>>> bigger). This allows for a performance boost without requiring any
> >>>> explicit opt-in from the workload while limitting internal
> >>>> fragmentation.
> >>>>
> >>>> If the preferred order can't be used (e.g. because the folio would
> >>>> breach the bounds of the vma, or because ptes in the region are already
> >>>> mapped) then we fall back to a suitable lower order; first
> >>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
> >>>>
> >>>> arch_wants_pte_order() can be overridden by the architecture if desired.
> >>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
> >>>> set of ptes map physically contigious, naturally aligned memory, so this
> >>>> mechanism allows the architecture to optimize as required.
> >>>>
> >>>> Here we add the default implementation of arch_wants_pte_order(), used
> >>>> when the architecture does not define it, which returns -1, implying
> >>>> that the HW has no preference. In this case, mm will choose it's own
> >>>> default order.
> >>>>
> >>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >>>> ---
> >>>>  include/linux/pgtable.h |  13 ++++
> >>>>  mm/Kconfig              |  10 +++
> >>>>  mm/memory.c             | 166 ++++++++++++++++++++++++++++++++++++----
> >>>>  3 files changed, 172 insertions(+), 17 deletions(-)
> >>>>
> >>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >>>> index 5063b482e34f..2a1d83775837 100644
> >>>> --- a/include/linux/pgtable.h
> >>>> +++ b/include/linux/pgtable.h
> >>>> @@ -313,6 +313,19 @@ static inline bool arch_has_hw_pte_young(void)
> >>>>  }
> >>>>  #endif
> >>>>
> >>>> +#ifndef arch_wants_pte_order
> >>>> +/*
> >>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
> >>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
> >>>> + * to be at least order-2. Negative value implies that the HW has no preference
> >>>> + * and mm will choose it's own default order.
> >>>> + */
> >>>> +static inline int arch_wants_pte_order(void)
> >>>> +{
> >>>> +       return -1;
> >>>> +}
> >>>> +#endif
> >>>> +
> >>>>  #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
> >>>>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
> >>>>                                        unsigned long address,
> >>>> diff --git a/mm/Kconfig b/mm/Kconfig
> >>>> index 09130434e30d..fa61ea160447 100644
> >>>> --- a/mm/Kconfig
> >>>> +++ b/mm/Kconfig
> >>>> @@ -1238,4 +1238,14 @@ config LOCK_MM_AND_FIND_VMA
> >>>>
> >>>>  source "mm/damon/Kconfig"
> >>>>
> >>>> +config LARGE_ANON_FOLIO
> >>>> +       bool "Allocate large folios for anonymous memory"
> >>>> +       depends on TRANSPARENT_HUGEPAGE
> >>>> +       default n
> >>>> +       help
> >>>> +         Use large (bigger than order-0) folios to back anonymous memory where
> >>>> +         possible, even for pte-mapped memory. This reduces the number of page
> >>>> +         faults, as well as other per-page overheads to improve performance for
> >>>> +         many workloads.
> >>>> +
> >>>>  endmenu
> >>>> diff --git a/mm/memory.c b/mm/memory.c
> >>>> index 01f39e8144ef..64c3f242c49a 100644
> >>>> --- a/mm/memory.c
> >>>> +++ b/mm/memory.c
> >>>> @@ -4050,6 +4050,127 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>>         return ret;
> >>>>  }
> >>>>
> >>>> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
> >>>> +{
> >>>> +       int i;
> >>>> +
> >>>> +       if (nr_pages == 1)
> >>>> +               return vmf_pte_changed(vmf);
> >>>> +
> >>>> +       for (i = 0; i < nr_pages; i++) {
> >>>> +               if (!pte_none(ptep_get_lockless(vmf->pte + i)))
> >>>> +                       return true;
> >>>> +       }
> >>>> +
> >>>> +       return false;
> >>>> +}
> >>>> +
> >>>> +#ifdef CONFIG_LARGE_ANON_FOLIO
> >>>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
> >>>> +               (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
> >>>> +
> >>>> +static int anon_folio_order(struct vm_area_struct *vma)
> >>>> +{
> >>>> +       int order;
> >>>> +
> >>>> +       /*
> >>>> +        * If THP is explicitly disabled for either the vma, the process or the
> >>>> +        * system, then this is very likely intended to limit internal
> >>>> +        * fragmentation; in this case, don't attempt to allocate a large
> >>>> +        * anonymous folio.
> >>>> +        *
> >>>> +        * Else, if the vma is eligible for thp, allocate a large folio of the
> >>>> +        * size preferred by the arch. Or if the arch requested a very small
> >>>> +        * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
> >>>> +        * which still meets the arch's requirements but means we still take
> >>>> +        * advantage of SW optimizations (e.g. fewer page faults).
> >>>> +        *
> >>>> +        * Finally if thp is enabled but the vma isn't eligible, take the
> >>>> +        * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
> >>>> +        * This ensures workloads that have not explicitly opted-in take benefit
> >>>> +        * while capping the potential for internal fragmentation.
> >>>> +        */
> >>>
> >>> What empirical evidence is SZ_64K based on?
> >>> What workloads would benefit from it?
> >>> How much would they benefit from it?
> >>> Would they benefit more or less from different values?
> >>> How much internal fragmentation would it cause?
> >>> What cost function was used to arrive at the conclusion that its
> >>> benefits outweigh its costs?
> >
> > Sorry this has taken a little while to reply to; I've been re-running my perf
> > tests with the modern patches to recomfirm old data.
> >
> > In terms of empirical evidence, I've run the kernel compilation benchmark (yes I
> > know its a narrow use case, but I figure some data is better than no data), for
> > all values of ANON_FOLIO_MAX_ORDER_UNHINTED {4k, 16k, 32k, 64k, 128k, 256k}.
> >
> > I've run each test 15 times across 5 system reboots on Ampere Altra (arm64),
> > with the kernel configured for 4K base pages - I could rerun for other base page
> > sizes if we want to go further down this route.
> >
> > I've captured run time and peak memory usage, and taken the mean. The stdev for
> > the peak memory usage is big-ish, but I'm confident this still captures the
> > central tendancy well:
> >
> > | MAX_ORDER_UNHINTED |   real-time |   kern-time |   user-time | peak memory |
> > |:-------------------|------------:|------------:|------------:|:------------|
> > | 4k                 |        0.0% |        0.0% |        0.0% |        0.0% |
> > | 16k                |       -3.6% |      -26.5% |       -0.5% |       -0.1% |
> > | 32k                |       -4.8% |      -37.4% |       -0.6% |       -0.1% |
> > | 64k                |       -5.7% |      -42.0% |       -0.6% |       -1.1% |
> > | 128k               |       -5.6% |      -42.1% |       -0.7% |        1.4% |
> > | 256k               |       -4.9% |      -41.9% |       -0.4% |        1.9% |
>
> Here is my test result:
>
>                 real            user            sys
> hink-4k:         0%             0%              0%
> hink-16K:       -3%             0.1%            -18.3%
> hink-32K:       -4%             0.2%            -27.2%
> hink-64K:       -4%             0.5%            -31.0%
> hink-128K:      -4%             0.9%            -33.7%
> hink-256K:      -5%             1%              -34.6%
>
>
> I used command:
> /usr/bin/time -f "\t%E real,\t%U user,\t%S sys" make -skj96 allmodconfig all
> to build kernel and collect the real time/user time/kernel time.
> /sys/kernel/mm/transparent_hugepage/enabled is "madvise".
> Let me know if you have any question about the test.
>
> I also find one strange behavior with this version. It's related with why
> I need to set the /sys/kernel/mm/transparent_hugepage/enabled to "madvise".
> If it's "never", the large folio is disabled either.
> If it's "always", the THP will be active before large folio. So the system is
> in the mixed mode. it's not suitable for this test.
>
> So if it's "never", large folio is disabled. But why "madvise" enables large
> folio unconditionly? Suppose it's only enabled for the VMA range which user
> madvise large folio (or THP)?

Indeed. It's a very peculiar behavior, as I called out in another email
Zi Yan Aug. 4, 2023, 2:16 a.m. UTC | #25
On 3 Aug 2023, at 20:19, Yu Zhao wrote:

> On Thu, Aug 3, 2023 at 8:27 AM Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
>>
>> On Thu, Aug 03, 2023 at 01:43:31PM +0100, Ryan Roberts wrote:
>>> + Kirill
>>>
>>> On 26/07/2023 10:51, Ryan Roberts wrote:
>>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
>>>> allocated in large folios of a determined order. All pages of the large
>>>> folio are pte-mapped during the same page fault, significantly reducing
>>>> the number of page faults. The number of per-page operations (e.g. ref
>>>> counting, rmap management lru list management) are also significantly
>>>> reduced since those ops now become per-folio.
>>>>
>>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
>>>> which defaults to disabled for now; The long term aim is for this to
>>>> defaut to enabled, but there are some risks around internal
>>>> fragmentation that need to be better understood first.
>>>>
>>>> When enabled, the folio order is determined as such: For a vma, process
>>>> or system that has explicitly disabled THP, we continue to allocate
>>>> order-0. THP is most likely disabled to avoid any possible internal
>>>> fragmentation so we honour that request.
>>>>
>>>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
>>>> that have not explicitly opted-in to use transparent hugepages (e.g.
>>>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
>>>> arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
>>>> bigger). This allows for a performance boost without requiring any
>>>> explicit opt-in from the workload while limitting internal
>>>> fragmentation.
>>>>
>>>> If the preferred order can't be used (e.g. because the folio would
>>>> breach the bounds of the vma, or because ptes in the region are already
>>>> mapped) then we fall back to a suitable lower order; first
>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>>>
>>>
>>> ...
>>>
>>>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
>>>> +           (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
>>>> +
>>>> +static int anon_folio_order(struct vm_area_struct *vma)
>>>> +{
>>>> +   int order;
>>>> +
>>>> +   /*
>>>> +    * If THP is explicitly disabled for either the vma, the process or the
>>>> +    * system, then this is very likely intended to limit internal
>>>> +    * fragmentation; in this case, don't attempt to allocate a large
>>>> +    * anonymous folio.
>>>> +    *
>>>> +    * Else, if the vma is eligible for thp, allocate a large folio of the
>>>> +    * size preferred by the arch. Or if the arch requested a very small
>>>> +    * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
>>>> +    * which still meets the arch's requirements but means we still take
>>>> +    * advantage of SW optimizations (e.g. fewer page faults).
>>>> +    *
>>>> +    * Finally if thp is enabled but the vma isn't eligible, take the
>>>> +    * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
>>>> +    * This ensures workloads that have not explicitly opted-in take benefit
>>>> +    * while capping the potential for internal fragmentation.
>>>> +    */
>>>> +
>>>> +   if ((vma->vm_flags & VM_NOHUGEPAGE) ||
>>>> +       test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
>>>> +       !hugepage_flags_enabled())
>>>> +           order = 0;
>>>> +   else {
>>>> +           order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
>>>> +
>>>> +           if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>>>> +                   order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
>>>> +   }
>>>> +
>>>> +   return order;
>>>> +}
>>>
>>>
>>> Hi All,
>>>
>>> I'm writing up the conclusions that we arrived at during discussion in the THP
>>> meeting yesterday, regarding linkage with exiting THP ABIs. It would be great if
>>> I can get explicit "agree" or disagree + rationale from at least David, Yu and
>>> Kirill.
>>>
>>> In summary; I think we are converging on the approach that is already coded, but
>>> I'd like confirmation.
>>>
>>>
>>>
>>> The THP situation today
>>> -----------------------
>>>
>>>  - At system level: THP can be set to "never", "madvise" or "always"
>>>  - At process level: THP can be "never" or "defer to system setting"
>>>  - At VMA level: no-hint, MADV_HUGEPAGE, MADV_NOHUGEPAGE
>>>
>>> That gives us this table to describe how a page fault is handled, according to
>>> process state (columns) and vma flags (rows):
>>>
>>>                 | never     | madvise   | always
>>> ----------------|-----------|-----------|-----------
>>> no hint         | S         | S         | THP>S
>>> MADV_HUGEPAGE   | S         | THP>S     | THP>S
>>> MADV_NOHUGEPAGE | S         | S         | S
>>>
>>> Legend:
>>> S     allocate single page (PTE-mapped)
>>> LAF   allocate lage anon folio (PTE-mapped)
>>> THP   allocate THP-sized folio (PMD-mapped)
>>>>     fallback (usually because vma size/alignment insufficient for folio)
>>>
>>>
>>>
>>> Principles for Large Anon Folios (LAF)
>>> --------------------------------------
>>>
>>> David tells us there are use cases today (e.g. qemu live migration) which use
>>> MADV_NOHUGEPAGE to mean "don't fill any PTEs that are not explicitly faulted"
>>> and these use cases will break (i.e. functionally incorrect) if this request is
>>> not honoured.
>>>
>>> So LAF must at least honour MADV_NOHUGEPAGE to prevent breaking existing use
>>> cases. And once we do this, then I think the least confusing thing is for it to
>>> also honor the "never" system/process state; so if either the system, process or
>>> vma has explicitly opted-out of THP, then LAF should also be bypassed.
>>>
>>> Similarly, any case that would previously cause the allocation of PMD-sized THP
>>> must continue to be honoured, else we risk performance regression.
>>>
>>> That leaves the "madvise/no-hint" case, and all THP fallback paths due to the
>>> VMA not being correctly aligned or sized to hold a PMD-sized mapping. In these
>>> cases, we will attempt to use LAF first, and fallback to single page if the vma
>>> size/alignment doesn't permit it.
>>>
>>>                 | never     | madvise   | always
>>> ----------------|-----------|-----------|-----------
>>> no hint         | S         | LAF>S     | THP>LAF>S
>>> MADV_HUGEPAGE   | S         | THP>LAF>S | THP>LAF>S
>>> MADV_NOHUGEPAGE | S         | S         | S
>>>
>>> I think this (perhaps conservative) approach will be the least surprising to
>>> users. And is the policy that is already implemented in this patch.
>>
>> This looks very reasonable.
>>
>> The only questionable field is no-hint/madvise. I can argue for both LAF>S
>> and S here. I think LAF>S is fine as long as we are not too aggressive
>> with allocation order.
>>
>> I think we need to work on eliminating reasons for users to set 'never'.
>> If something behaves better with 'never' kernel has failed user.
>>
>>> Downsides of this policy
>>> ------------------------
>>>
>>> As Yu and Yin have pointed out, there are some workloads which do not perform
>>> well with THP, due to large fault latency or memory wastage, etc. But which
>>> _may_ still benefit from LAF. By taking the conservative approach, we exclude
>>> these workloads from benefiting automatically.
>>
>> Hm. I don't buy it. Why THP with order-9 is too much, but order-8 LAF is
>> fine?
>
> No, it's not. And no one said order-8 LAF is fine :) The starting
> order for LAF that we have been discussing is at most 64KB (vs 2MB
> THP). For my taste, it's still too large. I'd go with 32KB/16KB.

I guess it is because ARM64 supports contig PTE at 64KB, so getting
large anon folio at 64KB on ARM64 would have an extra perf boost when
set contig PTE bits patch is also in.

On x86_64, 32KB might be better on AMD CPUs that support PTE clustering,
which would use a single TLB entry for 8 contiguous 4KB pages and is
done at microarchitecture level without additional software changes.

>
> However, the same argument can be used to argue against the policy
> Ryan listed above: why order-10 LAF is ok for madvise but not order-11
> (which becomes "always")?
>
> I'm strongly against this policy for two practical reasons I learned
> from tuning THPs in our data centers:

Do you mind writing down your policy? That would help us see and discuss
the difference.

> 1. By doing the above, we are blurring the lines between those values
> and making real-world performance tuning extremely hard if not
> impractice.
> 2. As I previously pointed out: if we mix LAFs with THPs, we actually
> risk causing performance regressions because giving smaller VMAs LAFs
> can deprive large VMAs of THPs.

I think these two reasons are based on that we do not have a reasonable
LAF+THP allocation and management policy and we do not fully understand
the pros and cons of using LAF and mixing LAF with THP. It would be
safe to separate LAF and THP. By doing so,

1. for workloads do not benefit from THP, we can turn on LAF alone to
see if there is a performance boost and further understand if LAF
hurts, has no impactor , or improves the performance of these workloads.

2. for workloads benefit from THP, we can also turn on LAF separately
to understand the performance impact of LAF (hurt, no change, or improve).

Ultimately, after we understand the performance impact of LAF, THP, and
mix of them and come up a reasonable kernel policy, a unified knob would
make sense. But we are not there yet.


--
Best Regards,
Yan, Zi
Yu Zhao Aug. 4, 2023, 3:35 a.m. UTC | #26
On Thu, Aug 3, 2023 at 8:16 PM Zi Yan <ziy@nvidia.com> wrote:
>
> On 3 Aug 2023, at 20:19, Yu Zhao wrote:
>
> > On Thu, Aug 3, 2023 at 8:27 AM Kirill A. Shutemov
> > <kirill.shutemov@linux.intel.com> wrote:
> >>
> >> On Thu, Aug 03, 2023 at 01:43:31PM +0100, Ryan Roberts wrote:
> >>> + Kirill
> >>>
> >>> On 26/07/2023 10:51, Ryan Roberts wrote:
> >>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
> >>>> allocated in large folios of a determined order. All pages of the large
> >>>> folio are pte-mapped during the same page fault, significantly reducing
> >>>> the number of page faults. The number of per-page operations (e.g. ref
> >>>> counting, rmap management lru list management) are also significantly
> >>>> reduced since those ops now become per-folio.
> >>>>
> >>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
> >>>> which defaults to disabled for now; The long term aim is for this to
> >>>> defaut to enabled, but there are some risks around internal
> >>>> fragmentation that need to be better understood first.
> >>>>
> >>>> When enabled, the folio order is determined as such: For a vma, process
> >>>> or system that has explicitly disabled THP, we continue to allocate
> >>>> order-0. THP is most likely disabled to avoid any possible internal
> >>>> fragmentation so we honour that request.
> >>>>
> >>>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
> >>>> that have not explicitly opted-in to use transparent hugepages (e.g.
> >>>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
> >>>> arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
> >>>> bigger). This allows for a performance boost without requiring any
> >>>> explicit opt-in from the workload while limitting internal
> >>>> fragmentation.
> >>>>
> >>>> If the preferred order can't be used (e.g. because the folio would
> >>>> breach the bounds of the vma, or because ptes in the region are already
> >>>> mapped) then we fall back to a suitable lower order; first
> >>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
> >>>>
> >>>
> >>> ...
> >>>
> >>>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
> >>>> +           (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
> >>>> +
> >>>> +static int anon_folio_order(struct vm_area_struct *vma)
> >>>> +{
> >>>> +   int order;
> >>>> +
> >>>> +   /*
> >>>> +    * If THP is explicitly disabled for either the vma, the process or the
> >>>> +    * system, then this is very likely intended to limit internal
> >>>> +    * fragmentation; in this case, don't attempt to allocate a large
> >>>> +    * anonymous folio.
> >>>> +    *
> >>>> +    * Else, if the vma is eligible for thp, allocate a large folio of the
> >>>> +    * size preferred by the arch. Or if the arch requested a very small
> >>>> +    * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
> >>>> +    * which still meets the arch's requirements but means we still take
> >>>> +    * advantage of SW optimizations (e.g. fewer page faults).
> >>>> +    *
> >>>> +    * Finally if thp is enabled but the vma isn't eligible, take the
> >>>> +    * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
> >>>> +    * This ensures workloads that have not explicitly opted-in take benefit
> >>>> +    * while capping the potential for internal fragmentation.
> >>>> +    */
> >>>> +
> >>>> +   if ((vma->vm_flags & VM_NOHUGEPAGE) ||
> >>>> +       test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
> >>>> +       !hugepage_flags_enabled())
> >>>> +           order = 0;
> >>>> +   else {
> >>>> +           order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
> >>>> +
> >>>> +           if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
> >>>> +                   order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
> >>>> +   }
> >>>> +
> >>>> +   return order;
> >>>> +}
> >>>
> >>>
> >>> Hi All,
> >>>
> >>> I'm writing up the conclusions that we arrived at during discussion in the THP
> >>> meeting yesterday, regarding linkage with exiting THP ABIs. It would be great if
> >>> I can get explicit "agree" or disagree + rationale from at least David, Yu and
> >>> Kirill.
> >>>
> >>> In summary; I think we are converging on the approach that is already coded, but
> >>> I'd like confirmation.
> >>>
> >>>
> >>>
> >>> The THP situation today
> >>> -----------------------
> >>>
> >>>  - At system level: THP can be set to "never", "madvise" or "always"
> >>>  - At process level: THP can be "never" or "defer to system setting"
> >>>  - At VMA level: no-hint, MADV_HUGEPAGE, MADV_NOHUGEPAGE
> >>>
> >>> That gives us this table to describe how a page fault is handled, according to
> >>> process state (columns) and vma flags (rows):
> >>>
> >>>                 | never     | madvise   | always
> >>> ----------------|-----------|-----------|-----------
> >>> no hint         | S         | S         | THP>S
> >>> MADV_HUGEPAGE   | S         | THP>S     | THP>S
> >>> MADV_NOHUGEPAGE | S         | S         | S
> >>>
> >>> Legend:
> >>> S     allocate single page (PTE-mapped)
> >>> LAF   allocate lage anon folio (PTE-mapped)
> >>> THP   allocate THP-sized folio (PMD-mapped)
> >>>>     fallback (usually because vma size/alignment insufficient for folio)
> >>>
> >>>
> >>>
> >>> Principles for Large Anon Folios (LAF)
> >>> --------------------------------------
> >>>
> >>> David tells us there are use cases today (e.g. qemu live migration) which use
> >>> MADV_NOHUGEPAGE to mean "don't fill any PTEs that are not explicitly faulted"
> >>> and these use cases will break (i.e. functionally incorrect) if this request is
> >>> not honoured.
> >>>
> >>> So LAF must at least honour MADV_NOHUGEPAGE to prevent breaking existing use
> >>> cases. And once we do this, then I think the least confusing thing is for it to
> >>> also honor the "never" system/process state; so if either the system, process or
> >>> vma has explicitly opted-out of THP, then LAF should also be bypassed.
> >>>
> >>> Similarly, any case that would previously cause the allocation of PMD-sized THP
> >>> must continue to be honoured, else we risk performance regression.
> >>>
> >>> That leaves the "madvise/no-hint" case, and all THP fallback paths due to the
> >>> VMA not being correctly aligned or sized to hold a PMD-sized mapping. In these
> >>> cases, we will attempt to use LAF first, and fallback to single page if the vma
> >>> size/alignment doesn't permit it.
> >>>
> >>>                 | never     | madvise   | always
> >>> ----------------|-----------|-----------|-----------
> >>> no hint         | S         | LAF>S     | THP>LAF>S
> >>> MADV_HUGEPAGE   | S         | THP>LAF>S | THP>LAF>S
> >>> MADV_NOHUGEPAGE | S         | S         | S
> >>>
> >>> I think this (perhaps conservative) approach will be the least surprising to
> >>> users. And is the policy that is already implemented in this patch.
> >>
> >> This looks very reasonable.
> >>
> >> The only questionable field is no-hint/madvise. I can argue for both LAF>S
> >> and S here. I think LAF>S is fine as long as we are not too aggressive
> >> with allocation order.
> >>
> >> I think we need to work on eliminating reasons for users to set 'never'.
> >> If something behaves better with 'never' kernel has failed user.
> >>
> >>> Downsides of this policy
> >>> ------------------------
> >>>
> >>> As Yu and Yin have pointed out, there are some workloads which do not perform
> >>> well with THP, due to large fault latency or memory wastage, etc. But which
> >>> _may_ still benefit from LAF. By taking the conservative approach, we exclude
> >>> these workloads from benefiting automatically.
> >>
> >> Hm. I don't buy it. Why THP with order-9 is too much, but order-8 LAF is
> >> fine?
> >
> > No, it's not. And no one said order-8 LAF is fine :) The starting
> > order for LAF that we have been discussing is at most 64KB (vs 2MB
> > THP). For my taste, it's still too large. I'd go with 32KB/16KB.
>
> I guess it is because ARM64 supports contig PTE at 64KB, so getting
> large anon folio at 64KB on ARM64 would have an extra perf boost when
> set contig PTE bits patch is also in.
>
> On x86_64, 32KB might be better on AMD CPUs that support PTE clustering,
> which would use a single TLB entry for 8 contiguous 4KB pages and is
> done at microarchitecture level without additional software changes.
>
> >
> > However, the same argument can be used to argue against the policy
> > Ryan listed above: why order-10 LAF is ok for madvise but not order-11
> > (which becomes "always")?
> >
> > I'm strongly against this policy for two practical reasons I learned
> > from tuning THPs in our data centers:
>
> Do you mind writing down your policy? That would help us see and discuss
> the difference.
>
> > 1. By doing the above, we are blurring the lines between those values
> > and making real-world performance tuning extremely hard if not
> > impractice.
> > 2. As I previously pointed out: if we mix LAFs with THPs, we actually
> > risk causing performance regressions because giving smaller VMAs LAFs
> > can deprive large VMAs of THPs.
>
> I think these two reasons are based on that we do not have a reasonable
> LAF+THP allocation and management policy and we do not fully understand
> the pros and cons of using LAF and mixing LAF with THP. It would be
> safe to separate LAF and THP. By doing so,
>
> 1. for workloads do not benefit from THP, we can turn on LAF alone to
> see if there is a performance boost and further understand if LAF
> hurts, has no impactor , or improves the performance of these workloads.
>
> 2. for workloads benefit from THP, we can also turn on LAF separately
> to understand the performance impact of LAF (hurt, no change, or improve).

This is basically what I've been suggesting. We should have a separate
knob, not overload the existing ones. And this separate knob should be
able to take a list of fallback orders. After we have a wider
deployment, we might gain a better understanding of the "cost
function". Then we can try to build some in-kernel heuristics that
automatically decides the best orders to fallback. If/when we get
there, we can simply extend the knob by adding a new "magic word",
e.g., "auto".

> Ultimately, after we understand the performance impact of LAF, THP, and
> mix of them and come up a reasonable kernel policy, a unified knob would
> make sense. But we are not there yet.

Exactly.
Ryan Roberts Aug. 4, 2023, 8:27 a.m. UTC | #27
On 04/08/2023 00:50, Yu Zhao wrote:
> On Thu, Aug 3, 2023 at 6:43 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> + Kirill
>>
>> On 26/07/2023 10:51, Ryan Roberts wrote:
>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
>>> allocated in large folios of a determined order. All pages of the large
>>> folio are pte-mapped during the same page fault, significantly reducing
>>> the number of page faults. The number of per-page operations (e.g. ref
>>> counting, rmap management lru list management) are also significantly
>>> reduced since those ops now become per-folio.
>>>
>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
>>> which defaults to disabled for now; The long term aim is for this to
>>> defaut to enabled, but there are some risks around internal
>>> fragmentation that need to be better understood first.
>>>
>>> When enabled, the folio order is determined as such: For a vma, process
>>> or system that has explicitly disabled THP, we continue to allocate
>>> order-0. THP is most likely disabled to avoid any possible internal
>>> fragmentation so we honour that request.
>>>
>>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
>>> that have not explicitly opted-in to use transparent hugepages (e.g.
>>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
>>> arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
>>> bigger). This allows for a performance boost without requiring any
>>> explicit opt-in from the workload while limitting internal
>>> fragmentation.
>>>
>>> If the preferred order can't be used (e.g. because the folio would
>>> breach the bounds of the vma, or because ptes in the region are already
>>> mapped) then we fall back to a suitable lower order; first
>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>>
>>
>> ...
>>
>>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
>>> +             (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
>>> +
>>> +static int anon_folio_order(struct vm_area_struct *vma)
>>> +{
>>> +     int order;
>>> +
>>> +     /*
>>> +      * If THP is explicitly disabled for either the vma, the process or the
>>> +      * system, then this is very likely intended to limit internal
>>> +      * fragmentation; in this case, don't attempt to allocate a large
>>> +      * anonymous folio.
>>> +      *
>>> +      * Else, if the vma is eligible for thp, allocate a large folio of the
>>> +      * size preferred by the arch. Or if the arch requested a very small
>>> +      * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
>>> +      * which still meets the arch's requirements but means we still take
>>> +      * advantage of SW optimizations (e.g. fewer page faults).
>>> +      *
>>> +      * Finally if thp is enabled but the vma isn't eligible, take the
>>> +      * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
>>> +      * This ensures workloads that have not explicitly opted-in take benefit
>>> +      * while capping the potential for internal fragmentation.
>>> +      */
>>> +
>>> +     if ((vma->vm_flags & VM_NOHUGEPAGE) ||
>>> +         test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
>>> +         !hugepage_flags_enabled())
>>> +             order = 0;
>>> +     else {
>>> +             order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
>>> +
>>> +             if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>>> +                     order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
>>> +     }
>>> +
>>> +     return order;
>>> +}
>>
>>
>> Hi All,
>>
>> I'm writing up the conclusions that we arrived at during discussion in the THP
>> meeting yesterday, regarding linkage with exiting THP ABIs. It would be great if
>> I can get explicit "agree" or disagree + rationale from at least David, Yu and
>> Kirill.
>>
>> In summary; I think we are converging on the approach that is already coded, but
>> I'd like confirmation.
>>
>>
>>
>> The THP situation today
>> -----------------------
>>
>>  - At system level: THP can be set to "never", "madvise" or "always"
>>  - At process level: THP can be "never" or "defer to system setting"
>>  - At VMA level: no-hint, MADV_HUGEPAGE, MADV_NOHUGEPAGE
>>
>> That gives us this table to describe how a page fault is handled, according to
>> process state (columns) and vma flags (rows):
>>
>>                 | never     | madvise   | always
>> ----------------|-----------|-----------|-----------
>> no hint         | S         | S         | THP>S
>> MADV_HUGEPAGE   | S         | THP>S     | THP>S
>> MADV_NOHUGEPAGE | S         | S         | S
>>
>> Legend:
>> S       allocate single page (PTE-mapped)
>> LAF     allocate lage anon folio (PTE-mapped)
>> THP     allocate THP-sized folio (PMD-mapped)
>>>       fallback (usually because vma size/alignment insufficient for folio)
>>
>>
>>
>> Principles for Large Anon Folios (LAF)
>> --------------------------------------
>>
>> David tells us there are use cases today (e.g. qemu live migration) which use
>> MADV_NOHUGEPAGE to mean "don't fill any PTEs that are not explicitly faulted"
>> and these use cases will break (i.e. functionally incorrect) if this request is
>> not honoured.
> 
> I don't remember David saying this. I think he was referring to UFFD,
> not MADV_NOHUGEPAGE, when discussing what we need to absolutely
> respect.

My understanding was that MADV_NOHUGEPAGE was being applied to regions *before*
UFFD was being registered, and the app relied on MADV_NOHUGEPAGE to not back any
unfaulted pages. It's not completely clear to me how not honouring
MADV_NOHUGEPAGE would break things though. David?
Ryan Roberts Aug. 4, 2023, 9:06 a.m. UTC | #28
On 04/08/2023 01:19, Yu Zhao wrote:
> On Thu, Aug 3, 2023 at 8:27 AM Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
>>
>> On Thu, Aug 03, 2023 at 01:43:31PM +0100, Ryan Roberts wrote:
>>> + Kirill
>>>
>>> On 26/07/2023 10:51, Ryan Roberts wrote:
>>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
>>>> allocated in large folios of a determined order. All pages of the large
>>>> folio are pte-mapped during the same page fault, significantly reducing
>>>> the number of page faults. The number of per-page operations (e.g. ref
>>>> counting, rmap management lru list management) are also significantly
>>>> reduced since those ops now become per-folio.
>>>>
>>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
>>>> which defaults to disabled for now; The long term aim is for this to
>>>> defaut to enabled, but there are some risks around internal
>>>> fragmentation that need to be better understood first.
>>>>
>>>> When enabled, the folio order is determined as such: For a vma, process
>>>> or system that has explicitly disabled THP, we continue to allocate
>>>> order-0. THP is most likely disabled to avoid any possible internal
>>>> fragmentation so we honour that request.
>>>>
>>>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
>>>> that have not explicitly opted-in to use transparent hugepages (e.g.
>>>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
>>>> arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
>>>> bigger). This allows for a performance boost without requiring any
>>>> explicit opt-in from the workload while limitting internal
>>>> fragmentation.
>>>>
>>>> If the preferred order can't be used (e.g. because the folio would
>>>> breach the bounds of the vma, or because ptes in the region are already
>>>> mapped) then we fall back to a suitable lower order; first
>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>>>
>>>
>>> ...
>>>
>>>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
>>>> +           (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
>>>> +
>>>> +static int anon_folio_order(struct vm_area_struct *vma)
>>>> +{
>>>> +   int order;
>>>> +
>>>> +   /*
>>>> +    * If THP is explicitly disabled for either the vma, the process or the
>>>> +    * system, then this is very likely intended to limit internal
>>>> +    * fragmentation; in this case, don't attempt to allocate a large
>>>> +    * anonymous folio.
>>>> +    *
>>>> +    * Else, if the vma is eligible for thp, allocate a large folio of the
>>>> +    * size preferred by the arch. Or if the arch requested a very small
>>>> +    * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
>>>> +    * which still meets the arch's requirements but means we still take
>>>> +    * advantage of SW optimizations (e.g. fewer page faults).
>>>> +    *
>>>> +    * Finally if thp is enabled but the vma isn't eligible, take the
>>>> +    * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
>>>> +    * This ensures workloads that have not explicitly opted-in take benefit
>>>> +    * while capping the potential for internal fragmentation.
>>>> +    */
>>>> +
>>>> +   if ((vma->vm_flags & VM_NOHUGEPAGE) ||
>>>> +       test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
>>>> +       !hugepage_flags_enabled())
>>>> +           order = 0;
>>>> +   else {
>>>> +           order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
>>>> +
>>>> +           if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>>>> +                   order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
>>>> +   }
>>>> +
>>>> +   return order;
>>>> +}
>>>
>>>
>>> Hi All,
>>>
>>> I'm writing up the conclusions that we arrived at during discussion in the THP
>>> meeting yesterday, regarding linkage with exiting THP ABIs. It would be great if
>>> I can get explicit "agree" or disagree + rationale from at least David, Yu and
>>> Kirill.
>>>
>>> In summary; I think we are converging on the approach that is already coded, but
>>> I'd like confirmation.
>>>
>>>
>>>
>>> The THP situation today
>>> -----------------------
>>>
>>>  - At system level: THP can be set to "never", "madvise" or "always"
>>>  - At process level: THP can be "never" or "defer to system setting"
>>>  - At VMA level: no-hint, MADV_HUGEPAGE, MADV_NOHUGEPAGE
>>>
>>> That gives us this table to describe how a page fault is handled, according to
>>> process state (columns) and vma flags (rows):
>>>
>>>                 | never     | madvise   | always
>>> ----------------|-----------|-----------|-----------
>>> no hint         | S         | S         | THP>S
>>> MADV_HUGEPAGE   | S         | THP>S     | THP>S
>>> MADV_NOHUGEPAGE | S         | S         | S
>>>
>>> Legend:
>>> S     allocate single page (PTE-mapped)
>>> LAF   allocate lage anon folio (PTE-mapped)
>>> THP   allocate THP-sized folio (PMD-mapped)
>>>>     fallback (usually because vma size/alignment insufficient for folio)
>>>
>>>
>>>
>>> Principles for Large Anon Folios (LAF)
>>> --------------------------------------
>>>
>>> David tells us there are use cases today (e.g. qemu live migration) which use
>>> MADV_NOHUGEPAGE to mean "don't fill any PTEs that are not explicitly faulted"
>>> and these use cases will break (i.e. functionally incorrect) if this request is
>>> not honoured.
>>>
>>> So LAF must at least honour MADV_NOHUGEPAGE to prevent breaking existing use
>>> cases. And once we do this, then I think the least confusing thing is for it to
>>> also honor the "never" system/process state; so if either the system, process or
>>> vma has explicitly opted-out of THP, then LAF should also be bypassed.
>>>
>>> Similarly, any case that would previously cause the allocation of PMD-sized THP
>>> must continue to be honoured, else we risk performance regression.
>>>
>>> That leaves the "madvise/no-hint" case, and all THP fallback paths due to the
>>> VMA not being correctly aligned or sized to hold a PMD-sized mapping. In these
>>> cases, we will attempt to use LAF first, and fallback to single page if the vma
>>> size/alignment doesn't permit it.
>>>
>>>                 | never     | madvise   | always
>>> ----------------|-----------|-----------|-----------
>>> no hint         | S         | LAF>S     | THP>LAF>S
>>> MADV_HUGEPAGE   | S         | THP>LAF>S | THP>LAF>S
>>> MADV_NOHUGEPAGE | S         | S         | S
>>>
>>> I think this (perhaps conservative) approach will be the least surprising to
>>> users. And is the policy that is already implemented in this patch.
>>
>> This looks very reasonable.
>>
>> The only questionable field is no-hint/madvise. I can argue for both LAF>S
>> and S here. I think LAF>S is fine as long as we are not too aggressive
>> with allocation order.
>>
>> I think we need to work on eliminating reasons for users to set 'never'.
>> If something behaves better with 'never' kernel has failed user.
>>
>>> Downsides of this policy
>>> ------------------------
>>>
>>> As Yu and Yin have pointed out, there are some workloads which do not perform
>>> well with THP, due to large fault latency or memory wastage, etc. But which
>>> _may_ still benefit from LAF. By taking the conservative approach, we exclude
>>> these workloads from benefiting automatically.
>>
>> Hm. I don't buy it. Why THP with order-9 is too much, but order-8 LAF is
>> fine?
> 
> No, it's not. And no one said order-8 LAF is fine :) The starting
> order for LAF that we have been discussing is at most 64KB (vs 2MB
> THP). For my taste, it's still too large. I'd go with 32KB/16KB.

Its currently influenced by the arch. If the arch doesn't have an opinion then
its currently 32K in the code. The 64K size is my aspiration for arm64 if/when I
land the contpte mapping work.

> 
> However, the same argument can be used to argue against the policy
> Ryan listed above: why order-10 LAF is ok for madvise but not order-11
> (which becomes "always")?

Sorry I don't understand what you are saying here. Where has order-10 LAF come from?

> 
> I'm strongly against this policy 

Ugh, I thought we came to an agreement (or at least "disagree and commit") on
the THP call. Obviously I was wrong.

David is telling us that we will break user space if we don't consider
MADV_NOHUGEPAGE to mean "never allocate memory to unfaulted addresses". So tying
to at least this must be cast in stone, no? Could you lay out any policy
proposal you have as an alternative that still follows this requirement?

> for two practical reasons I learned
> from tuning THPs in our data centers:
> 1. By doing the above, we are blurring the lines between those values
> and making real-world performance tuning extremely hard if not
> impractice.
> 2. As I previously pointed out: if we mix LAFs with THPs, we actually
> risk causing performance regressions because giving smaller VMAs LAFs
> can deprive large VMAs of THPs.
Yu Zhao Aug. 4, 2023, 6:53 p.m. UTC | #29
On Fri, Aug 4, 2023 at 3:06 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 04/08/2023 01:19, Yu Zhao wrote:
> > On Thu, Aug 3, 2023 at 8:27 AM Kirill A. Shutemov
> > <kirill.shutemov@linux.intel.com> wrote:
> >>
> >> On Thu, Aug 03, 2023 at 01:43:31PM +0100, Ryan Roberts wrote:
> >>> + Kirill
> >>>
> >>> On 26/07/2023 10:51, Ryan Roberts wrote:
> >>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
> >>>> allocated in large folios of a determined order. All pages of the large
> >>>> folio are pte-mapped during the same page fault, significantly reducing
> >>>> the number of page faults. The number of per-page operations (e.g. ref
> >>>> counting, rmap management lru list management) are also significantly
> >>>> reduced since those ops now become per-folio.
> >>>>
> >>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
> >>>> which defaults to disabled for now; The long term aim is for this to
> >>>> defaut to enabled, but there are some risks around internal
> >>>> fragmentation that need to be better understood first.
> >>>>
> >>>> When enabled, the folio order is determined as such: For a vma, process
> >>>> or system that has explicitly disabled THP, we continue to allocate
> >>>> order-0. THP is most likely disabled to avoid any possible internal
> >>>> fragmentation so we honour that request.
> >>>>
> >>>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
> >>>> that have not explicitly opted-in to use transparent hugepages (e.g.
> >>>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
> >>>> arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
> >>>> bigger). This allows for a performance boost without requiring any
> >>>> explicit opt-in from the workload while limitting internal
> >>>> fragmentation.
> >>>>
> >>>> If the preferred order can't be used (e.g. because the folio would
> >>>> breach the bounds of the vma, or because ptes in the region are already
> >>>> mapped) then we fall back to a suitable lower order; first
> >>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
> >>>>
> >>>
> >>> ...
> >>>
> >>>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
> >>>> +           (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
> >>>> +
> >>>> +static int anon_folio_order(struct vm_area_struct *vma)
> >>>> +{
> >>>> +   int order;
> >>>> +
> >>>> +   /*
> >>>> +    * If THP is explicitly disabled for either the vma, the process or the
> >>>> +    * system, then this is very likely intended to limit internal
> >>>> +    * fragmentation; in this case, don't attempt to allocate a large
> >>>> +    * anonymous folio.
> >>>> +    *
> >>>> +    * Else, if the vma is eligible for thp, allocate a large folio of the
> >>>> +    * size preferred by the arch. Or if the arch requested a very small
> >>>> +    * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
> >>>> +    * which still meets the arch's requirements but means we still take
> >>>> +    * advantage of SW optimizations (e.g. fewer page faults).
> >>>> +    *
> >>>> +    * Finally if thp is enabled but the vma isn't eligible, take the
> >>>> +    * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
> >>>> +    * This ensures workloads that have not explicitly opted-in take benefit
> >>>> +    * while capping the potential for internal fragmentation.
> >>>> +    */
> >>>> +
> >>>> +   if ((vma->vm_flags & VM_NOHUGEPAGE) ||
> >>>> +       test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
> >>>> +       !hugepage_flags_enabled())
> >>>> +           order = 0;
> >>>> +   else {
> >>>> +           order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
> >>>> +
> >>>> +           if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
> >>>> +                   order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
> >>>> +   }
> >>>> +
> >>>> +   return order;
> >>>> +}
> >>>
> >>>
> >>> Hi All,
> >>>
> >>> I'm writing up the conclusions that we arrived at during discussion in the THP
> >>> meeting yesterday, regarding linkage with exiting THP ABIs. It would be great if
> >>> I can get explicit "agree" or disagree + rationale from at least David, Yu and
> >>> Kirill.
> >>>
> >>> In summary; I think we are converging on the approach that is already coded, but
> >>> I'd like confirmation.
> >>>
> >>>
> >>>
> >>> The THP situation today
> >>> -----------------------
> >>>
> >>>  - At system level: THP can be set to "never", "madvise" or "always"
> >>>  - At process level: THP can be "never" or "defer to system setting"
> >>>  - At VMA level: no-hint, MADV_HUGEPAGE, MADV_NOHUGEPAGE
> >>>
> >>> That gives us this table to describe how a page fault is handled, according to
> >>> process state (columns) and vma flags (rows):
> >>>
> >>>                 | never     | madvise   | always
> >>> ----------------|-----------|-----------|-----------
> >>> no hint         | S         | S         | THP>S
> >>> MADV_HUGEPAGE   | S         | THP>S     | THP>S
> >>> MADV_NOHUGEPAGE | S         | S         | S
> >>>
> >>> Legend:
> >>> S     allocate single page (PTE-mapped)
> >>> LAF   allocate lage anon folio (PTE-mapped)
> >>> THP   allocate THP-sized folio (PMD-mapped)
> >>>>     fallback (usually because vma size/alignment insufficient for folio)
> >>>
> >>>
> >>>
> >>> Principles for Large Anon Folios (LAF)
> >>> --------------------------------------
> >>>
> >>> David tells us there are use cases today (e.g. qemu live migration) which use
> >>> MADV_NOHUGEPAGE to mean "don't fill any PTEs that are not explicitly faulted"
> >>> and these use cases will break (i.e. functionally incorrect) if this request is
> >>> not honoured.
> >>>
> >>> So LAF must at least honour MADV_NOHUGEPAGE to prevent breaking existing use
> >>> cases. And once we do this, then I think the least confusing thing is for it to
> >>> also honor the "never" system/process state; so if either the system, process or
> >>> vma has explicitly opted-out of THP, then LAF should also be bypassed.
> >>>
> >>> Similarly, any case that would previously cause the allocation of PMD-sized THP
> >>> must continue to be honoured, else we risk performance regression.
> >>>
> >>> That leaves the "madvise/no-hint" case, and all THP fallback paths due to the
> >>> VMA not being correctly aligned or sized to hold a PMD-sized mapping. In these
> >>> cases, we will attempt to use LAF first, and fallback to single page if the vma
> >>> size/alignment doesn't permit it.
> >>>
> >>>                 | never     | madvise   | always
> >>> ----------------|-----------|-----------|-----------
> >>> no hint         | S         | LAF>S     | THP>LAF>S
> >>> MADV_HUGEPAGE   | S         | THP>LAF>S | THP>LAF>S
> >>> MADV_NOHUGEPAGE | S         | S         | S
> >>>
> >>> I think this (perhaps conservative) approach will be the least surprising to
> >>> users. And is the policy that is already implemented in this patch.
> >>
> >> This looks very reasonable.
> >>
> >> The only questionable field is no-hint/madvise. I can argue for both LAF>S
> >> and S here. I think LAF>S is fine as long as we are not too aggressive
> >> with allocation order.
> >>
> >> I think we need to work on eliminating reasons for users to set 'never'.
> >> If something behaves better with 'never' kernel has failed user.
> >>
> >>> Downsides of this policy
> >>> ------------------------
> >>>
> >>> As Yu and Yin have pointed out, there are some workloads which do not perform
> >>> well with THP, due to large fault latency or memory wastage, etc. But which
> >>> _may_ still benefit from LAF. By taking the conservative approach, we exclude
> >>> these workloads from benefiting automatically.
> >>
> >> Hm. I don't buy it. Why THP with order-9 is too much, but order-8 LAF is
> >> fine?
> >
> > No, it's not. And no one said order-8 LAF is fine :) The starting
> > order for LAF that we have been discussing is at most 64KB (vs 2MB
> > THP). For my taste, it's still too large. I'd go with 32KB/16KB.
>
> Its currently influenced by the arch. If the arch doesn't have an opinion then
> its currently 32K in the code. The 64K size is my aspiration for arm64 if/when I
> land the contpte mapping work.

Just to double check: this discussion covers the long term/permanente
solution/roadmap, correct? That's what Kirill and I were arguing
about. Otherwise, the order-8/9 concern above is totally irrelevant,
since we don't have them in this series.

For the short term (this series), what you described above looks good
to me: we may regress but will not break any existing use cases, and
we are behind a Kconfig option.

> > However, the same argument can be used to argue against the policy
> > Ryan listed above: why order-10 LAF is ok for madvise but not order-11
> > (which becomes "always")?
>
> Sorry I don't understand what you are saying here. Where has order-10 LAF come from?

I pushed that rhetoric a bit further: order-11 is the THP size (32MB)
with 16KB base page size on ARM. Confusing, isn't it? And there is
another complaint from Fengwei here [1].

[1] https://lore.kernel.org/linux-mm/CAOUHufasZ6w32sHO+Lq33+tGy3+GiO0_dd6mNYwfS_5gqhzYbw@mail.gmail.com/

> > I'm strongly against this policy

Again, just to be clear: I'm strongly against this policy to be
exposed to userspace in any way and become a long-term/permanent thing
we have to maintain/change in the future, since I'm assuming that's
the context.

> Ugh, I thought we came to an agreement (or at least "disagree and commit") on
> the THP call. Obviously I was wrong.

My impression is we only agreed on one thing: at the current stage, we
should respect things we absolutely have to. We didn't agree on what
"never" means ("never 2MB" or "never >4KB"), and we didn't touch on
how "always" should behave at all.

> David is telling us that we will break user space if we don't consider
> MADV_NOHUGEPAGE to mean "never allocate memory to unfaulted addresses". So tying
> to at least this must be cast in stone, no? Could you lay out any policy
> proposal you have as an alternative that still follows this requirement?

If MADV_NOHUGEPAGE falls into the category of things we have to
absolutely respect, then we will. But I don't think it does, because
the UFFD check we have in this series already guarantees the KVM use
case. I can explain how it works in detail if it's still not clear to
you: long story short, the UFFD check precedes the MADV_NOHUGEPAGE
check in alloc_anon_folio().

Here is what I recommend for the medium and long terms:
https://lore.kernel.org/linux-mm/CAOUHufYm6Lkm4tLRbyKOc3-NYU-8d6ZDMNDWHo=e=E16oasN8A@mail.gmail.com/

For the short term, hard-coding two orders (hw/sw preferred), putting
them behind a Kconfig and not exposing this info to the userspace are
good enough for me.
David Hildenbrand Aug. 4, 2023, 8:23 p.m. UTC | #30
On 04.08.23 10:27, Ryan Roberts wrote:
> On 04/08/2023 00:50, Yu Zhao wrote:
>> On Thu, Aug 3, 2023 at 6:43 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>
>>> + Kirill
>>>
>>> On 26/07/2023 10:51, Ryan Roberts wrote:
>>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
>>>> allocated in large folios of a determined order. All pages of the large
>>>> folio are pte-mapped during the same page fault, significantly reducing
>>>> the number of page faults. The number of per-page operations (e.g. ref
>>>> counting, rmap management lru list management) are also significantly
>>>> reduced since those ops now become per-folio.
>>>>
>>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
>>>> which defaults to disabled for now; The long term aim is for this to
>>>> defaut to enabled, but there are some risks around internal
>>>> fragmentation that need to be better understood first.
>>>>
>>>> When enabled, the folio order is determined as such: For a vma, process
>>>> or system that has explicitly disabled THP, we continue to allocate
>>>> order-0. THP is most likely disabled to avoid any possible internal
>>>> fragmentation so we honour that request.
>>>>
>>>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
>>>> that have not explicitly opted-in to use transparent hugepages (e.g.
>>>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
>>>> arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
>>>> bigger). This allows for a performance boost without requiring any
>>>> explicit opt-in from the workload while limitting internal
>>>> fragmentation.
>>>>
>>>> If the preferred order can't be used (e.g. because the folio would
>>>> breach the bounds of the vma, or because ptes in the region are already
>>>> mapped) then we fall back to a suitable lower order; first
>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>>>
>>>
>>> ...
>>>
>>>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
>>>> +             (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
>>>> +
>>>> +static int anon_folio_order(struct vm_area_struct *vma)
>>>> +{
>>>> +     int order;
>>>> +
>>>> +     /*
>>>> +      * If THP is explicitly disabled for either the vma, the process or the
>>>> +      * system, then this is very likely intended to limit internal
>>>> +      * fragmentation; in this case, don't attempt to allocate a large
>>>> +      * anonymous folio.
>>>> +      *
>>>> +      * Else, if the vma is eligible for thp, allocate a large folio of the
>>>> +      * size preferred by the arch. Or if the arch requested a very small
>>>> +      * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
>>>> +      * which still meets the arch's requirements but means we still take
>>>> +      * advantage of SW optimizations (e.g. fewer page faults).
>>>> +      *
>>>> +      * Finally if thp is enabled but the vma isn't eligible, take the
>>>> +      * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
>>>> +      * This ensures workloads that have not explicitly opted-in take benefit
>>>> +      * while capping the potential for internal fragmentation.
>>>> +      */
>>>> +
>>>> +     if ((vma->vm_flags & VM_NOHUGEPAGE) ||
>>>> +         test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
>>>> +         !hugepage_flags_enabled())
>>>> +             order = 0;
>>>> +     else {
>>>> +             order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
>>>> +
>>>> +             if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>>>> +                     order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
>>>> +     }
>>>> +
>>>> +     return order;
>>>> +}
>>>
>>>
>>> Hi All,
>>>
>>> I'm writing up the conclusions that we arrived at during discussion in the THP
>>> meeting yesterday, regarding linkage with exiting THP ABIs. It would be great if
>>> I can get explicit "agree" or disagree + rationale from at least David, Yu and
>>> Kirill.
>>>
>>> In summary; I think we are converging on the approach that is already coded, but
>>> I'd like confirmation.
>>>
>>>
>>>
>>> The THP situation today
>>> -----------------------
>>>
>>>   - At system level: THP can be set to "never", "madvise" or "always"
>>>   - At process level: THP can be "never" or "defer to system setting"
>>>   - At VMA level: no-hint, MADV_HUGEPAGE, MADV_NOHUGEPAGE
>>>
>>> That gives us this table to describe how a page fault is handled, according to
>>> process state (columns) and vma flags (rows):
>>>
>>>                  | never     | madvise   | always
>>> ----------------|-----------|-----------|-----------
>>> no hint         | S         | S         | THP>S
>>> MADV_HUGEPAGE   | S         | THP>S     | THP>S
>>> MADV_NOHUGEPAGE | S         | S         | S
>>>
>>> Legend:
>>> S       allocate single page (PTE-mapped)
>>> LAF     allocate lage anon folio (PTE-mapped)
>>> THP     allocate THP-sized folio (PMD-mapped)
>>>>        fallback (usually because vma size/alignment insufficient for folio)
>>>
>>>
>>>
>>> Principles for Large Anon Folios (LAF)
>>> --------------------------------------
>>>
>>> David tells us there are use cases today (e.g. qemu live migration) which use
>>> MADV_NOHUGEPAGE to mean "don't fill any PTEs that are not explicitly faulted"
>>> and these use cases will break (i.e. functionally incorrect) if this request is
>>> not honoured.
>>
>> I don't remember David saying this. I think he was referring to UFFD,
>> not MADV_NOHUGEPAGE, when discussing what we need to absolutely
>> respect.
> 
> My understanding was that MADV_NOHUGEPAGE was being applied to regions *before*
> UFFD was being registered, and the app relied on MADV_NOHUGEPAGE to not back any
> unfaulted pages. It's not completely clear to me how not honouring
> MADV_NOHUGEPAGE would break things though. David?

Sorry, I'm still lagging behind on some threads.

Imagine the following for VM postcopy live migration:

(1) Set MADV_NOHUGEPAGE on guest memory and discard all memory (e.g.,
     MADV_DONTNEED), to start with a clean slate.
(2) Migrates some pages during precopy from the source and stores them
     into guest memory on the destination. Some of the memory locations
     will have pages populated.
(3) At some point, decide to enable postcopy: enable userfaultfd on
     guest memory.
(4) Discard *selected* pages again that have been dirtied in the
     meantime on the source. These are pages that have been migrated
     previously.
(5) Start running the VM on the destination.
(6) Anything that's not populated will trigger userfaultfd missing
     faults. Then, you can request them from the source and place them.

Assume you would populate more than required during 2), you can end up 
not getting userfaultfd faults during 4) and corrupt your guest state. 
It works if during (2) you migrated all guest memory, or if during 4) 
you zap everything that still needs migration.

According to the man page:

   MADV_NOHUGEPAGE (since Linux 2.6.38): Ensures that memory in the
   address range specified by addr and length will not be backed by
   transparent hugepages.

To me, that includes any other page size that is different to the base 
page size (getpagesize()) and, therefore, the traditional system behavior.

Even if we end up calling these "transparent huge pages of different 
size" differently and eventually handle them slightly differently.

But I can see why people want to try finding ways around why "never" 
should not mean "never" when we come up with a new shiny name for 
"transparent huge pages of different size".

Not that it makes anything clearer or easier if we call 2 MiB pages on 
x86 THP and 1 MiB pages TLP (Transparent Large Pages ?), whereby 1 MiB 
pages on s390x are THP ... for sure we can come up with new names for 
new sizes and cause more confusion.

Most probably we want to clarify in the docs what a transparent huge 
page is and what these toggles do.
Yu Zhao Aug. 4, 2023, 9 p.m. UTC | #31
On Fri, Aug 4, 2023 at 2:23 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 04.08.23 10:27, Ryan Roberts wrote:
> > On 04/08/2023 00:50, Yu Zhao wrote:
> >> On Thu, Aug 3, 2023 at 6:43 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>
> >>> + Kirill
> >>>
> >>> On 26/07/2023 10:51, Ryan Roberts wrote:
> >>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
> >>>> allocated in large folios of a determined order. All pages of the large
> >>>> folio are pte-mapped during the same page fault, significantly reducing
> >>>> the number of page faults. The number of per-page operations (e.g. ref
> >>>> counting, rmap management lru list management) are also significantly
> >>>> reduced since those ops now become per-folio.
> >>>>
> >>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
> >>>> which defaults to disabled for now; The long term aim is for this to
> >>>> defaut to enabled, but there are some risks around internal
> >>>> fragmentation that need to be better understood first.
> >>>>
> >>>> When enabled, the folio order is determined as such: For a vma, process
> >>>> or system that has explicitly disabled THP, we continue to allocate
> >>>> order-0. THP is most likely disabled to avoid any possible internal
> >>>> fragmentation so we honour that request.
> >>>>
> >>>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
> >>>> that have not explicitly opted-in to use transparent hugepages (e.g.
> >>>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
> >>>> arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
> >>>> bigger). This allows for a performance boost without requiring any
> >>>> explicit opt-in from the workload while limitting internal
> >>>> fragmentation.
> >>>>
> >>>> If the preferred order can't be used (e.g. because the folio would
> >>>> breach the bounds of the vma, or because ptes in the region are already
> >>>> mapped) then we fall back to a suitable lower order; first
> >>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
> >>>>
> >>>
> >>> ...
> >>>
> >>>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
> >>>> +             (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
> >>>> +
> >>>> +static int anon_folio_order(struct vm_area_struct *vma)
> >>>> +{
> >>>> +     int order;
> >>>> +
> >>>> +     /*
> >>>> +      * If THP is explicitly disabled for either the vma, the process or the
> >>>> +      * system, then this is very likely intended to limit internal
> >>>> +      * fragmentation; in this case, don't attempt to allocate a large
> >>>> +      * anonymous folio.
> >>>> +      *
> >>>> +      * Else, if the vma is eligible for thp, allocate a large folio of the
> >>>> +      * size preferred by the arch. Or if the arch requested a very small
> >>>> +      * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
> >>>> +      * which still meets the arch's requirements but means we still take
> >>>> +      * advantage of SW optimizations (e.g. fewer page faults).
> >>>> +      *
> >>>> +      * Finally if thp is enabled but the vma isn't eligible, take the
> >>>> +      * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
> >>>> +      * This ensures workloads that have not explicitly opted-in take benefit
> >>>> +      * while capping the potential for internal fragmentation.
> >>>> +      */
> >>>> +
> >>>> +     if ((vma->vm_flags & VM_NOHUGEPAGE) ||
> >>>> +         test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
> >>>> +         !hugepage_flags_enabled())
> >>>> +             order = 0;
> >>>> +     else {
> >>>> +             order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
> >>>> +
> >>>> +             if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
> >>>> +                     order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
> >>>> +     }
> >>>> +
> >>>> +     return order;
> >>>> +}
> >>>
> >>>
> >>> Hi All,
> >>>
> >>> I'm writing up the conclusions that we arrived at during discussion in the THP
> >>> meeting yesterday, regarding linkage with exiting THP ABIs. It would be great if
> >>> I can get explicit "agree" or disagree + rationale from at least David, Yu and
> >>> Kirill.
> >>>
> >>> In summary; I think we are converging on the approach that is already coded, but
> >>> I'd like confirmation.
> >>>
> >>>
> >>>
> >>> The THP situation today
> >>> -----------------------
> >>>
> >>>   - At system level: THP can be set to "never", "madvise" or "always"
> >>>   - At process level: THP can be "never" or "defer to system setting"
> >>>   - At VMA level: no-hint, MADV_HUGEPAGE, MADV_NOHUGEPAGE
> >>>
> >>> That gives us this table to describe how a page fault is handled, according to
> >>> process state (columns) and vma flags (rows):
> >>>
> >>>                  | never     | madvise   | always
> >>> ----------------|-----------|-----------|-----------
> >>> no hint         | S         | S         | THP>S
> >>> MADV_HUGEPAGE   | S         | THP>S     | THP>S
> >>> MADV_NOHUGEPAGE | S         | S         | S
> >>>
> >>> Legend:
> >>> S       allocate single page (PTE-mapped)
> >>> LAF     allocate lage anon folio (PTE-mapped)
> >>> THP     allocate THP-sized folio (PMD-mapped)
> >>>>        fallback (usually because vma size/alignment insufficient for folio)
> >>>
> >>>
> >>>
> >>> Principles for Large Anon Folios (LAF)
> >>> --------------------------------------
> >>>
> >>> David tells us there are use cases today (e.g. qemu live migration) which use
> >>> MADV_NOHUGEPAGE to mean "don't fill any PTEs that are not explicitly faulted"
> >>> and these use cases will break (i.e. functionally incorrect) if this request is
> >>> not honoured.
> >>
> >> I don't remember David saying this. I think he was referring to UFFD,
> >> not MADV_NOHUGEPAGE, when discussing what we need to absolutely
> >> respect.
> >
> > My understanding was that MADV_NOHUGEPAGE was being applied to regions *before*
> > UFFD was being registered, and the app relied on MADV_NOHUGEPAGE to not back any
> > unfaulted pages. It's not completely clear to me how not honouring
> > MADV_NOHUGEPAGE would break things though. David?
>
> Sorry, I'm still lagging behind on some threads.
>
> Imagine the following for VM postcopy live migration:
>
> (1) Set MADV_NOHUGEPAGE on guest memory and discard all memory (e.g.,
>      MADV_DONTNEED), to start with a clean slate.
> (2) Migrates some pages during precopy from the source and stores them
>      into guest memory on the destination. Some of the memory locations
>      will have pages populated.
> (3) At some point, decide to enable postcopy: enable userfaultfd on
>      guest memory.
> (4) Discard *selected* pages again that have been dirtied in the
>      meantime on the source. These are pages that have been migrated
>      previously.
> (5) Start running the VM on the destination.
> (6) Anything that's not populated will trigger userfaultfd missing
>      faults. Then, you can request them from the source and place them.
>
> Assume you would populate more than required during 2), you can end up
> not getting userfaultfd faults during 4) and corrupt your guest state.
> It works if during (2) you migrated all guest memory, or if during 4)
> you zap everything that still needs migr

I see what you mean now. Thanks.

Yes, in this case we have to interpret MADV_NOHUGEPAGE as nothing >4KB.
David Hildenbrand Aug. 4, 2023, 9:13 p.m. UTC | #32
On 04.08.23 23:00, Yu Zhao wrote:
> On Fri, Aug 4, 2023 at 2:23 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 04.08.23 10:27, Ryan Roberts wrote:
>>> On 04/08/2023 00:50, Yu Zhao wrote:
>>>> On Thu, Aug 3, 2023 at 6:43 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>
>>>>> + Kirill
>>>>>
>>>>> On 26/07/2023 10:51, Ryan Roberts wrote:
>>>>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
>>>>>> allocated in large folios of a determined order. All pages of the large
>>>>>> folio are pte-mapped during the same page fault, significantly reducing
>>>>>> the number of page faults. The number of per-page operations (e.g. ref
>>>>>> counting, rmap management lru list management) are also significantly
>>>>>> reduced since those ops now become per-folio.
>>>>>>
>>>>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
>>>>>> which defaults to disabled for now; The long term aim is for this to
>>>>>> defaut to enabled, but there are some risks around internal
>>>>>> fragmentation that need to be better understood first.
>>>>>>
>>>>>> When enabled, the folio order is determined as such: For a vma, process
>>>>>> or system that has explicitly disabled THP, we continue to allocate
>>>>>> order-0. THP is most likely disabled to avoid any possible internal
>>>>>> fragmentation so we honour that request.
>>>>>>
>>>>>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
>>>>>> that have not explicitly opted-in to use transparent hugepages (e.g.
>>>>>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
>>>>>> arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
>>>>>> bigger). This allows for a performance boost without requiring any
>>>>>> explicit opt-in from the workload while limitting internal
>>>>>> fragmentation.
>>>>>>
>>>>>> If the preferred order can't be used (e.g. because the folio would
>>>>>> breach the bounds of the vma, or because ptes in the region are already
>>>>>> mapped) then we fall back to a suitable lower order; first
>>>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>>>>>
>>>>>
>>>>> ...
>>>>>
>>>>>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
>>>>>> +             (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
>>>>>> +
>>>>>> +static int anon_folio_order(struct vm_area_struct *vma)
>>>>>> +{
>>>>>> +     int order;
>>>>>> +
>>>>>> +     /*
>>>>>> +      * If THP is explicitly disabled for either the vma, the process or the
>>>>>> +      * system, then this is very likely intended to limit internal
>>>>>> +      * fragmentation; in this case, don't attempt to allocate a large
>>>>>> +      * anonymous folio.
>>>>>> +      *
>>>>>> +      * Else, if the vma is eligible for thp, allocate a large folio of the
>>>>>> +      * size preferred by the arch. Or if the arch requested a very small
>>>>>> +      * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
>>>>>> +      * which still meets the arch's requirements but means we still take
>>>>>> +      * advantage of SW optimizations (e.g. fewer page faults).
>>>>>> +      *
>>>>>> +      * Finally if thp is enabled but the vma isn't eligible, take the
>>>>>> +      * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
>>>>>> +      * This ensures workloads that have not explicitly opted-in take benefit
>>>>>> +      * while capping the potential for internal fragmentation.
>>>>>> +      */
>>>>>> +
>>>>>> +     if ((vma->vm_flags & VM_NOHUGEPAGE) ||
>>>>>> +         test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
>>>>>> +         !hugepage_flags_enabled())
>>>>>> +             order = 0;
>>>>>> +     else {
>>>>>> +             order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
>>>>>> +
>>>>>> +             if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>>>>>> +                     order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
>>>>>> +     }
>>>>>> +
>>>>>> +     return order;
>>>>>> +}
>>>>>
>>>>>
>>>>> Hi All,
>>>>>
>>>>> I'm writing up the conclusions that we arrived at during discussion in the THP
>>>>> meeting yesterday, regarding linkage with exiting THP ABIs. It would be great if
>>>>> I can get explicit "agree" or disagree + rationale from at least David, Yu and
>>>>> Kirill.
>>>>>
>>>>> In summary; I think we are converging on the approach that is already coded, but
>>>>> I'd like confirmation.
>>>>>
>>>>>
>>>>>
>>>>> The THP situation today
>>>>> -----------------------
>>>>>
>>>>>    - At system level: THP can be set to "never", "madvise" or "always"
>>>>>    - At process level: THP can be "never" or "defer to system setting"
>>>>>    - At VMA level: no-hint, MADV_HUGEPAGE, MADV_NOHUGEPAGE
>>>>>
>>>>> That gives us this table to describe how a page fault is handled, according to
>>>>> process state (columns) and vma flags (rows):
>>>>>
>>>>>                   | never     | madvise   | always
>>>>> ----------------|-----------|-----------|-----------
>>>>> no hint         | S         | S         | THP>S
>>>>> MADV_HUGEPAGE   | S         | THP>S     | THP>S
>>>>> MADV_NOHUGEPAGE | S         | S         | S
>>>>>
>>>>> Legend:
>>>>> S       allocate single page (PTE-mapped)
>>>>> LAF     allocate lage anon folio (PTE-mapped)
>>>>> THP     allocate THP-sized folio (PMD-mapped)
>>>>>>         fallback (usually because vma size/alignment insufficient for folio)
>>>>>
>>>>>
>>>>>
>>>>> Principles for Large Anon Folios (LAF)
>>>>> --------------------------------------
>>>>>
>>>>> David tells us there are use cases today (e.g. qemu live migration) which use
>>>>> MADV_NOHUGEPAGE to mean "don't fill any PTEs that are not explicitly faulted"
>>>>> and these use cases will break (i.e. functionally incorrect) if this request is
>>>>> not honoured.
>>>>
>>>> I don't remember David saying this. I think he was referring to UFFD,
>>>> not MADV_NOHUGEPAGE, when discussing what we need to absolutely
>>>> respect.
>>>
>>> My understanding was that MADV_NOHUGEPAGE was being applied to regions *before*
>>> UFFD was being registered, and the app relied on MADV_NOHUGEPAGE to not back any
>>> unfaulted pages. It's not completely clear to me how not honouring
>>> MADV_NOHUGEPAGE would break things though. David?
>>
>> Sorry, I'm still lagging behind on some threads.
>>
>> Imagine the following for VM postcopy live migration:
>>
>> (1) Set MADV_NOHUGEPAGE on guest memory and discard all memory (e.g.,
>>       MADV_DONTNEED), to start with a clean slate.
>> (2) Migrates some pages during precopy from the source and stores them
>>       into guest memory on the destination. Some of the memory locations
>>       will have pages populated.
>> (3) At some point, decide to enable postcopy: enable userfaultfd on
>>       guest memory.
>> (4) Discard *selected* pages again that have been dirtied in the
>>       meantime on the source. These are pages that have been migrated
>>       previously.
>> (5) Start running the VM on the destination.
>> (6) Anything that's not populated will trigger userfaultfd missing
>>       faults. Then, you can request them from the source and place them.
>>
>> Assume you would populate more than required during 2), you can end up
>> not getting userfaultfd faults during 4) and corrupt your guest state.
>> It works if during (2) you migrated all guest memory, or if during 4)
>> you zap everything that still needs migr
> 
> I see what you mean now. Thanks.
> 
> Yes, in this case we have to interpret MADV_NOHUGEPAGE as nothing >4KB.

Note that it's still even unclear to me why we want to *not* call these 
things THP. It would certainly make everything less confusing if we call 
them THP, but with additional attributes.

I think that is one of the first things we should figure out because it 
also indirectly tells us what all these toggles mean and how/if we 
should redefine them (and if they even apply).

Currently THP == PMD size

In 2016, Hugh already envisioned PUD/PGD THP (see 49920d28781d ("mm: 
make transparent hugepage size public")) when he explicitly exposed 
"hpage_pmd_size". Not "hpage_size".

For hugetlb on arm64 we already support various sizes that are < PMD 
size and *not* call them differently. It's a huge(tlb) page. Sometimes 
we refer to them as cont-PTE hugetlb pages.


So, nowadays we do have "PMD-sized THP", someday we might have 
"PUD-sized THP". Can't we come up with a name to describe sub-PMD THP?

Is it really of value if we invent a new term for them? Yes, I was not 
enjoying "Flexible THP".


Once we figured that out, we should figure out if MADV_HUGEPAGE meant 
"only PMD-sized THP" or anything else?

Also, we can then figure out if MADV_NOHUGEPAGE meant "only PMD-sized 
THP" or anything else?


The simplest approach to me would be "they imply any THP, and once we 
need more tunables we might add some", similar to what Kirill also raised.


Again, it's all unclear to me at this point and I'm happy to hear 
opinions, because I really don't know.
Yu Zhao Aug. 4, 2023, 9:26 p.m. UTC | #33
On Fri, Aug 4, 2023 at 3:13 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 04.08.23 23:00, Yu Zhao wrote:
> > On Fri, Aug 4, 2023 at 2:23 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 04.08.23 10:27, Ryan Roberts wrote:
> >>> On 04/08/2023 00:50, Yu Zhao wrote:
> >>>> On Thu, Aug 3, 2023 at 6:43 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>
> >>>>> + Kirill
> >>>>>
> >>>>> On 26/07/2023 10:51, Ryan Roberts wrote:
> >>>>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
> >>>>>> allocated in large folios of a determined order. All pages of the large
> >>>>>> folio are pte-mapped during the same page fault, significantly reducing
> >>>>>> the number of page faults. The number of per-page operations (e.g. ref
> >>>>>> counting, rmap management lru list management) are also significantly
> >>>>>> reduced since those ops now become per-folio.
> >>>>>>
> >>>>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
> >>>>>> which defaults to disabled for now; The long term aim is for this to
> >>>>>> defaut to enabled, but there are some risks around internal
> >>>>>> fragmentation that need to be better understood first.
> >>>>>>
> >>>>>> When enabled, the folio order is determined as such: For a vma, process
> >>>>>> or system that has explicitly disabled THP, we continue to allocate
> >>>>>> order-0. THP is most likely disabled to avoid any possible internal
> >>>>>> fragmentation so we honour that request.
> >>>>>>
> >>>>>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
> >>>>>> that have not explicitly opted-in to use transparent hugepages (e.g.
> >>>>>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
> >>>>>> arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
> >>>>>> bigger). This allows for a performance boost without requiring any
> >>>>>> explicit opt-in from the workload while limitting internal
> >>>>>> fragmentation.
> >>>>>>
> >>>>>> If the preferred order can't be used (e.g. because the folio would
> >>>>>> breach the bounds of the vma, or because ptes in the region are already
> >>>>>> mapped) then we fall back to a suitable lower order; first
> >>>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
> >>>>>>
> >>>>>
> >>>>> ...
> >>>>>
> >>>>>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
> >>>>>> +             (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
> >>>>>> +
> >>>>>> +static int anon_folio_order(struct vm_area_struct *vma)
> >>>>>> +{
> >>>>>> +     int order;
> >>>>>> +
> >>>>>> +     /*
> >>>>>> +      * If THP is explicitly disabled for either the vma, the process or the
> >>>>>> +      * system, then this is very likely intended to limit internal
> >>>>>> +      * fragmentation; in this case, don't attempt to allocate a large
> >>>>>> +      * anonymous folio.
> >>>>>> +      *
> >>>>>> +      * Else, if the vma is eligible for thp, allocate a large folio of the
> >>>>>> +      * size preferred by the arch. Or if the arch requested a very small
> >>>>>> +      * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
> >>>>>> +      * which still meets the arch's requirements but means we still take
> >>>>>> +      * advantage of SW optimizations (e.g. fewer page faults).
> >>>>>> +      *
> >>>>>> +      * Finally if thp is enabled but the vma isn't eligible, take the
> >>>>>> +      * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
> >>>>>> +      * This ensures workloads that have not explicitly opted-in take benefit
> >>>>>> +      * while capping the potential for internal fragmentation.
> >>>>>> +      */
> >>>>>> +
> >>>>>> +     if ((vma->vm_flags & VM_NOHUGEPAGE) ||
> >>>>>> +         test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
> >>>>>> +         !hugepage_flags_enabled())
> >>>>>> +             order = 0;
> >>>>>> +     else {
> >>>>>> +             order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
> >>>>>> +
> >>>>>> +             if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
> >>>>>> +                     order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
> >>>>>> +     }
> >>>>>> +
> >>>>>> +     return order;
> >>>>>> +}
> >>>>>
> >>>>>
> >>>>> Hi All,
> >>>>>
> >>>>> I'm writing up the conclusions that we arrived at during discussion in the THP
> >>>>> meeting yesterday, regarding linkage with exiting THP ABIs. It would be great if
> >>>>> I can get explicit "agree" or disagree + rationale from at least David, Yu and
> >>>>> Kirill.
> >>>>>
> >>>>> In summary; I think we are converging on the approach that is already coded, but
> >>>>> I'd like confirmation.
> >>>>>
> >>>>>
> >>>>>
> >>>>> The THP situation today
> >>>>> -----------------------
> >>>>>
> >>>>>    - At system level: THP can be set to "never", "madvise" or "always"
> >>>>>    - At process level: THP can be "never" or "defer to system setting"
> >>>>>    - At VMA level: no-hint, MADV_HUGEPAGE, MADV_NOHUGEPAGE
> >>>>>
> >>>>> That gives us this table to describe how a page fault is handled, according to
> >>>>> process state (columns) and vma flags (rows):
> >>>>>
> >>>>>                   | never     | madvise   | always
> >>>>> ----------------|-----------|-----------|-----------
> >>>>> no hint         | S         | S         | THP>S
> >>>>> MADV_HUGEPAGE   | S         | THP>S     | THP>S
> >>>>> MADV_NOHUGEPAGE | S         | S         | S
> >>>>>
> >>>>> Legend:
> >>>>> S       allocate single page (PTE-mapped)
> >>>>> LAF     allocate lage anon folio (PTE-mapped)
> >>>>> THP     allocate THP-sized folio (PMD-mapped)
> >>>>>>         fallback (usually because vma size/alignment insufficient for folio)
> >>>>>
> >>>>>
> >>>>>
> >>>>> Principles for Large Anon Folios (LAF)
> >>>>> --------------------------------------
> >>>>>
> >>>>> David tells us there are use cases today (e.g. qemu live migration) which use
> >>>>> MADV_NOHUGEPAGE to mean "don't fill any PTEs that are not explicitly faulted"
> >>>>> and these use cases will break (i.e. functionally incorrect) if this request is
> >>>>> not honoured.
> >>>>
> >>>> I don't remember David saying this. I think he was referring to UFFD,
> >>>> not MADV_NOHUGEPAGE, when discussing what we need to absolutely
> >>>> respect.
> >>>
> >>> My understanding was that MADV_NOHUGEPAGE was being applied to regions *before*
> >>> UFFD was being registered, and the app relied on MADV_NOHUGEPAGE to not back any
> >>> unfaulted pages. It's not completely clear to me how not honouring
> >>> MADV_NOHUGEPAGE would break things though. David?
> >>
> >> Sorry, I'm still lagging behind on some threads.
> >>
> >> Imagine the following for VM postcopy live migration:
> >>
> >> (1) Set MADV_NOHUGEPAGE on guest memory and discard all memory (e.g.,
> >>       MADV_DONTNEED), to start with a clean slate.
> >> (2) Migrates some pages during precopy from the source and stores them
> >>       into guest memory on the destination. Some of the memory locations
> >>       will have pages populated.
> >> (3) At some point, decide to enable postcopy: enable userfaultfd on
> >>       guest memory.
> >> (4) Discard *selected* pages again that have been dirtied in the
> >>       meantime on the source. These are pages that have been migrated
> >>       previously.
> >> (5) Start running the VM on the destination.
> >> (6) Anything that's not populated will trigger userfaultfd missing
> >>       faults. Then, you can request them from the source and place them.
> >>
> >> Assume you would populate more than required during 2), you can end up
> >> not getting userfaultfd faults during 4) and corrupt your guest state.
> >> It works if during (2) you migrated all guest memory, or if during 4)
> >> you zap everything that still needs migr
> >
> > I see what you mean now. Thanks.
> >
> > Yes, in this case we have to interpret MADV_NOHUGEPAGE as nothing >4KB.
>
> Note that it's still even unclear to me why we want to *not* call these
> things THP. It would certainly make everything less confusing if we call
> them THP, but with additional attributes.
>
> I think that is one of the first things we should figure out because it
> also indirectly tells us what all these toggles mean and how/if we
> should redefine them (and if they even apply).
>
> Currently THP == PMD size
>
> In 2016, Hugh already envisioned PUD/PGD THP (see 49920d28781d ("mm:
> make transparent hugepage size public")) when he explicitly exposed
> "hpage_pmd_size". Not "hpage_size".
>
> For hugetlb on arm64 we already support various sizes that are < PMD
> size and *not* call them differently. It's a huge(tlb) page. Sometimes
> we refer to them as cont-PTE hugetlb pages.
>
>
> So, nowadays we do have "PMD-sized THP", someday we might have
> "PUD-sized THP". Can't we come up with a name to describe sub-PMD THP?
>
> Is it really of value if we invent a new term for them? Yes, I was not
> enjoying "Flexible THP".
>
>
> Once we figured that out, we should figure out if MADV_HUGEPAGE meant
> "only PMD-sized THP" or anything else?
>
> Also, we can then figure out if MADV_NOHUGEPAGE meant "only PMD-sized
> THP" or anything else?
>
>
> The simplest approach to me would be "they imply any THP, and once we
> need more tunables we might add some", similar to what Kirill also raised.
>
>
> Again, it's all unclear to me at this point and I'm happy to hear
> opinions, because I really don't know.

I agree these points require more discussion. But I don't think we
need to conclude them now, unless they cause correctness issues like
ignoring MADV_NOHUGEPAGE would. My concern is that if we decide to go
with "they imply any THP" and *expose this to userspace now*, we might
regret later.

Also that "Flexible THP" Kconfig is just a placeholder, from my POV.
It should be removed after we nail down the runtime ABI, which again
IMO, isn't now.
David Hildenbrand Aug. 4, 2023, 9:30 p.m. UTC | #34
On 04.08.23 23:26, Yu Zhao wrote:
> On Fri, Aug 4, 2023 at 3:13 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 04.08.23 23:00, Yu Zhao wrote:
>>> On Fri, Aug 4, 2023 at 2:23 PM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 04.08.23 10:27, Ryan Roberts wrote:
>>>>> On 04/08/2023 00:50, Yu Zhao wrote:
>>>>>> On Thu, Aug 3, 2023 at 6:43 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>
>>>>>>> + Kirill
>>>>>>>
>>>>>>> On 26/07/2023 10:51, Ryan Roberts wrote:
>>>>>>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
>>>>>>>> allocated in large folios of a determined order. All pages of the large
>>>>>>>> folio are pte-mapped during the same page fault, significantly reducing
>>>>>>>> the number of page faults. The number of per-page operations (e.g. ref
>>>>>>>> counting, rmap management lru list management) are also significantly
>>>>>>>> reduced since those ops now become per-folio.
>>>>>>>>
>>>>>>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
>>>>>>>> which defaults to disabled for now; The long term aim is for this to
>>>>>>>> defaut to enabled, but there are some risks around internal
>>>>>>>> fragmentation that need to be better understood first.
>>>>>>>>
>>>>>>>> When enabled, the folio order is determined as such: For a vma, process
>>>>>>>> or system that has explicitly disabled THP, we continue to allocate
>>>>>>>> order-0. THP is most likely disabled to avoid any possible internal
>>>>>>>> fragmentation so we honour that request.
>>>>>>>>
>>>>>>>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
>>>>>>>> that have not explicitly opted-in to use transparent hugepages (e.g.
>>>>>>>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
>>>>>>>> arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
>>>>>>>> bigger). This allows for a performance boost without requiring any
>>>>>>>> explicit opt-in from the workload while limitting internal
>>>>>>>> fragmentation.
>>>>>>>>
>>>>>>>> If the preferred order can't be used (e.g. because the folio would
>>>>>>>> breach the bounds of the vma, or because ptes in the region are already
>>>>>>>> mapped) then we fall back to a suitable lower order; first
>>>>>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>>>>>>>
>>>>>>>
>>>>>>> ...
>>>>>>>
>>>>>>>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
>>>>>>>> +             (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
>>>>>>>> +
>>>>>>>> +static int anon_folio_order(struct vm_area_struct *vma)
>>>>>>>> +{
>>>>>>>> +     int order;
>>>>>>>> +
>>>>>>>> +     /*
>>>>>>>> +      * If THP is explicitly disabled for either the vma, the process or the
>>>>>>>> +      * system, then this is very likely intended to limit internal
>>>>>>>> +      * fragmentation; in this case, don't attempt to allocate a large
>>>>>>>> +      * anonymous folio.
>>>>>>>> +      *
>>>>>>>> +      * Else, if the vma is eligible for thp, allocate a large folio of the
>>>>>>>> +      * size preferred by the arch. Or if the arch requested a very small
>>>>>>>> +      * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
>>>>>>>> +      * which still meets the arch's requirements but means we still take
>>>>>>>> +      * advantage of SW optimizations (e.g. fewer page faults).
>>>>>>>> +      *
>>>>>>>> +      * Finally if thp is enabled but the vma isn't eligible, take the
>>>>>>>> +      * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
>>>>>>>> +      * This ensures workloads that have not explicitly opted-in take benefit
>>>>>>>> +      * while capping the potential for internal fragmentation.
>>>>>>>> +      */
>>>>>>>> +
>>>>>>>> +     if ((vma->vm_flags & VM_NOHUGEPAGE) ||
>>>>>>>> +         test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
>>>>>>>> +         !hugepage_flags_enabled())
>>>>>>>> +             order = 0;
>>>>>>>> +     else {
>>>>>>>> +             order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
>>>>>>>> +
>>>>>>>> +             if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>>>>>>>> +                     order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
>>>>>>>> +     }
>>>>>>>> +
>>>>>>>> +     return order;
>>>>>>>> +}
>>>>>>>
>>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> I'm writing up the conclusions that we arrived at during discussion in the THP
>>>>>>> meeting yesterday, regarding linkage with exiting THP ABIs. It would be great if
>>>>>>> I can get explicit "agree" or disagree + rationale from at least David, Yu and
>>>>>>> Kirill.
>>>>>>>
>>>>>>> In summary; I think we are converging on the approach that is already coded, but
>>>>>>> I'd like confirmation.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The THP situation today
>>>>>>> -----------------------
>>>>>>>
>>>>>>>     - At system level: THP can be set to "never", "madvise" or "always"
>>>>>>>     - At process level: THP can be "never" or "defer to system setting"
>>>>>>>     - At VMA level: no-hint, MADV_HUGEPAGE, MADV_NOHUGEPAGE
>>>>>>>
>>>>>>> That gives us this table to describe how a page fault is handled, according to
>>>>>>> process state (columns) and vma flags (rows):
>>>>>>>
>>>>>>>                    | never     | madvise   | always
>>>>>>> ----------------|-----------|-----------|-----------
>>>>>>> no hint         | S         | S         | THP>S
>>>>>>> MADV_HUGEPAGE   | S         | THP>S     | THP>S
>>>>>>> MADV_NOHUGEPAGE | S         | S         | S
>>>>>>>
>>>>>>> Legend:
>>>>>>> S       allocate single page (PTE-mapped)
>>>>>>> LAF     allocate lage anon folio (PTE-mapped)
>>>>>>> THP     allocate THP-sized folio (PMD-mapped)
>>>>>>>>          fallback (usually because vma size/alignment insufficient for folio)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Principles for Large Anon Folios (LAF)
>>>>>>> --------------------------------------
>>>>>>>
>>>>>>> David tells us there are use cases today (e.g. qemu live migration) which use
>>>>>>> MADV_NOHUGEPAGE to mean "don't fill any PTEs that are not explicitly faulted"
>>>>>>> and these use cases will break (i.e. functionally incorrect) if this request is
>>>>>>> not honoured.
>>>>>>
>>>>>> I don't remember David saying this. I think he was referring to UFFD,
>>>>>> not MADV_NOHUGEPAGE, when discussing what we need to absolutely
>>>>>> respect.
>>>>>
>>>>> My understanding was that MADV_NOHUGEPAGE was being applied to regions *before*
>>>>> UFFD was being registered, and the app relied on MADV_NOHUGEPAGE to not back any
>>>>> unfaulted pages. It's not completely clear to me how not honouring
>>>>> MADV_NOHUGEPAGE would break things though. David?
>>>>
>>>> Sorry, I'm still lagging behind on some threads.
>>>>
>>>> Imagine the following for VM postcopy live migration:
>>>>
>>>> (1) Set MADV_NOHUGEPAGE on guest memory and discard all memory (e.g.,
>>>>        MADV_DONTNEED), to start with a clean slate.
>>>> (2) Migrates some pages during precopy from the source and stores them
>>>>        into guest memory on the destination. Some of the memory locations
>>>>        will have pages populated.
>>>> (3) At some point, decide to enable postcopy: enable userfaultfd on
>>>>        guest memory.
>>>> (4) Discard *selected* pages again that have been dirtied in the
>>>>        meantime on the source. These are pages that have been migrated
>>>>        previously.
>>>> (5) Start running the VM on the destination.
>>>> (6) Anything that's not populated will trigger userfaultfd missing
>>>>        faults. Then, you can request them from the source and place them.
>>>>
>>>> Assume you would populate more than required during 2), you can end up
>>>> not getting userfaultfd faults during 4) and corrupt your guest state.
>>>> It works if during (2) you migrated all guest memory, or if during 4)
>>>> you zap everything that still needs migr
>>>
>>> I see what you mean now. Thanks.
>>>
>>> Yes, in this case we have to interpret MADV_NOHUGEPAGE as nothing >4KB.
>>
>> Note that it's still even unclear to me why we want to *not* call these
>> things THP. It would certainly make everything less confusing if we call
>> them THP, but with additional attributes.
>>
>> I think that is one of the first things we should figure out because it
>> also indirectly tells us what all these toggles mean and how/if we
>> should redefine them (and if they even apply).
>>
>> Currently THP == PMD size
>>
>> In 2016, Hugh already envisioned PUD/PGD THP (see 49920d28781d ("mm:
>> make transparent hugepage size public")) when he explicitly exposed
>> "hpage_pmd_size". Not "hpage_size".
>>
>> For hugetlb on arm64 we already support various sizes that are < PMD
>> size and *not* call them differently. It's a huge(tlb) page. Sometimes
>> we refer to them as cont-PTE hugetlb pages.
>>
>>
>> So, nowadays we do have "PMD-sized THP", someday we might have
>> "PUD-sized THP". Can't we come up with a name to describe sub-PMD THP?
>>
>> Is it really of value if we invent a new term for them? Yes, I was not
>> enjoying "Flexible THP".
>>
>>
>> Once we figured that out, we should figure out if MADV_HUGEPAGE meant
>> "only PMD-sized THP" or anything else?
>>
>> Also, we can then figure out if MADV_NOHUGEPAGE meant "only PMD-sized
>> THP" or anything else?
>>
>>
>> The simplest approach to me would be "they imply any THP, and once we
>> need more tunables we might add some", similar to what Kirill also raised.
>>
>>
>> Again, it's all unclear to me at this point and I'm happy to hear
>> opinions, because I really don't know.
> 
> I agree these points require more discussion. But I don't think we
> need to conclude them now, unless they cause correctness issues like
> ignoring MADV_NOHUGEPAGE would. My concern is that if we decide to go
> with "they imply any THP" and *expose this to userspace now*, we might
> regret later.

If we don't think they are THP, probably MADV_NOHUGEPAGE should not 
apply and we should be ready to find other ways to deal with the mess we 
eventually create. If we want to go down that path, sure.

If they are THP, to me there is not really a question if MADV_NOHUGEPAGE 
applies to them or not. Unless we want to build a confusing piece of 
software ;)
Zi Yan Aug. 4, 2023, 9:58 p.m. UTC | #35
On 4 Aug 2023, at 17:30, David Hildenbrand wrote:

> On 04.08.23 23:26, Yu Zhao wrote:
>> On Fri, Aug 4, 2023 at 3:13 PM David Hildenbrand <david@redhat.com> wrote:
>>>
>>> On 04.08.23 23:00, Yu Zhao wrote:
>>>> On Fri, Aug 4, 2023 at 2:23 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>
>>>>> On 04.08.23 10:27, Ryan Roberts wrote:
>>>>>> On 04/08/2023 00:50, Yu Zhao wrote:
>>>>>>> On Thu, Aug 3, 2023 at 6:43 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>
>>>>>>>> + Kirill
>>>>>>>>
>>>>>>>> On 26/07/2023 10:51, Ryan Roberts wrote:
>>>>>>>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
>>>>>>>>> allocated in large folios of a determined order. All pages of the large
>>>>>>>>> folio are pte-mapped during the same page fault, significantly reducing
>>>>>>>>> the number of page faults. The number of per-page operations (e.g. ref
>>>>>>>>> counting, rmap management lru list management) are also significantly
>>>>>>>>> reduced since those ops now become per-folio.
>>>>>>>>>
>>>>>>>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
>>>>>>>>> which defaults to disabled for now; The long term aim is for this to
>>>>>>>>> defaut to enabled, but there are some risks around internal
>>>>>>>>> fragmentation that need to be better understood first.
>>>>>>>>>
>>>>>>>>> When enabled, the folio order is determined as such: For a vma, process
>>>>>>>>> or system that has explicitly disabled THP, we continue to allocate
>>>>>>>>> order-0. THP is most likely disabled to avoid any possible internal
>>>>>>>>> fragmentation so we honour that request.
>>>>>>>>>
>>>>>>>>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
>>>>>>>>> that have not explicitly opted-in to use transparent hugepages (e.g.
>>>>>>>>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
>>>>>>>>> arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
>>>>>>>>> bigger). This allows for a performance boost without requiring any
>>>>>>>>> explicit opt-in from the workload while limitting internal
>>>>>>>>> fragmentation.
>>>>>>>>>
>>>>>>>>> If the preferred order can't be used (e.g. because the folio would
>>>>>>>>> breach the bounds of the vma, or because ptes in the region are already
>>>>>>>>> mapped) then we fall back to a suitable lower order; first
>>>>>>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>>>>>>>>
>>>>>>>>
>>>>>>>> ...
>>>>>>>>
>>>>>>>>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
>>>>>>>>> +             (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
>>>>>>>>> +
>>>>>>>>> +static int anon_folio_order(struct vm_area_struct *vma)
>>>>>>>>> +{
>>>>>>>>> +     int order;
>>>>>>>>> +
>>>>>>>>> +     /*
>>>>>>>>> +      * If THP is explicitly disabled for either the vma, the process or the
>>>>>>>>> +      * system, then this is very likely intended to limit internal
>>>>>>>>> +      * fragmentation; in this case, don't attempt to allocate a large
>>>>>>>>> +      * anonymous folio.
>>>>>>>>> +      *
>>>>>>>>> +      * Else, if the vma is eligible for thp, allocate a large folio of the
>>>>>>>>> +      * size preferred by the arch. Or if the arch requested a very small
>>>>>>>>> +      * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
>>>>>>>>> +      * which still meets the arch's requirements but means we still take
>>>>>>>>> +      * advantage of SW optimizations (e.g. fewer page faults).
>>>>>>>>> +      *
>>>>>>>>> +      * Finally if thp is enabled but the vma isn't eligible, take the
>>>>>>>>> +      * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
>>>>>>>>> +      * This ensures workloads that have not explicitly opted-in take benefit
>>>>>>>>> +      * while capping the potential for internal fragmentation.
>>>>>>>>> +      */
>>>>>>>>> +
>>>>>>>>> +     if ((vma->vm_flags & VM_NOHUGEPAGE) ||
>>>>>>>>> +         test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
>>>>>>>>> +         !hugepage_flags_enabled())
>>>>>>>>> +             order = 0;
>>>>>>>>> +     else {
>>>>>>>>> +             order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
>>>>>>>>> +
>>>>>>>>> +             if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>>>>>>>>> +                     order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
>>>>>>>>> +     }
>>>>>>>>> +
>>>>>>>>> +     return order;
>>>>>>>>> +}
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> I'm writing up the conclusions that we arrived at during discussion in the THP
>>>>>>>> meeting yesterday, regarding linkage with exiting THP ABIs. It would be great if
>>>>>>>> I can get explicit "agree" or disagree + rationale from at least David, Yu and
>>>>>>>> Kirill.
>>>>>>>>
>>>>>>>> In summary; I think we are converging on the approach that is already coded, but
>>>>>>>> I'd like confirmation.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> The THP situation today
>>>>>>>> -----------------------
>>>>>>>>
>>>>>>>>     - At system level: THP can be set to "never", "madvise" or "always"
>>>>>>>>     - At process level: THP can be "never" or "defer to system setting"
>>>>>>>>     - At VMA level: no-hint, MADV_HUGEPAGE, MADV_NOHUGEPAGE
>>>>>>>>
>>>>>>>> That gives us this table to describe how a page fault is handled, according to
>>>>>>>> process state (columns) and vma flags (rows):
>>>>>>>>
>>>>>>>>                    | never     | madvise   | always
>>>>>>>> ----------------|-----------|-----------|-----------
>>>>>>>> no hint         | S         | S         | THP>S
>>>>>>>> MADV_HUGEPAGE   | S         | THP>S     | THP>S
>>>>>>>> MADV_NOHUGEPAGE | S         | S         | S
>>>>>>>>
>>>>>>>> Legend:
>>>>>>>> S       allocate single page (PTE-mapped)
>>>>>>>> LAF     allocate lage anon folio (PTE-mapped)
>>>>>>>> THP     allocate THP-sized folio (PMD-mapped)
>>>>>>>>>          fallback (usually because vma size/alignment insufficient for folio)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Principles for Large Anon Folios (LAF)
>>>>>>>> --------------------------------------
>>>>>>>>
>>>>>>>> David tells us there are use cases today (e.g. qemu live migration) which use
>>>>>>>> MADV_NOHUGEPAGE to mean "don't fill any PTEs that are not explicitly faulted"
>>>>>>>> and these use cases will break (i.e. functionally incorrect) if this request is
>>>>>>>> not honoured.
>>>>>>>
>>>>>>> I don't remember David saying this. I think he was referring to UFFD,
>>>>>>> not MADV_NOHUGEPAGE, when discussing what we need to absolutely
>>>>>>> respect.
>>>>>>
>>>>>> My understanding was that MADV_NOHUGEPAGE was being applied to regions *before*
>>>>>> UFFD was being registered, and the app relied on MADV_NOHUGEPAGE to not back any
>>>>>> unfaulted pages. It's not completely clear to me how not honouring
>>>>>> MADV_NOHUGEPAGE would break things though. David?
>>>>>
>>>>> Sorry, I'm still lagging behind on some threads.
>>>>>
>>>>> Imagine the following for VM postcopy live migration:
>>>>>
>>>>> (1) Set MADV_NOHUGEPAGE on guest memory and discard all memory (e.g.,
>>>>>        MADV_DONTNEED), to start with a clean slate.
>>>>> (2) Migrates some pages during precopy from the source and stores them
>>>>>        into guest memory on the destination. Some of the memory locations
>>>>>        will have pages populated.
>>>>> (3) At some point, decide to enable postcopy: enable userfaultfd on
>>>>>        guest memory.
>>>>> (4) Discard *selected* pages again that have been dirtied in the
>>>>>        meantime on the source. These are pages that have been migrated
>>>>>        previously.
>>>>> (5) Start running the VM on the destination.
>>>>> (6) Anything that's not populated will trigger userfaultfd missing
>>>>>        faults. Then, you can request them from the source and place them.
>>>>>
>>>>> Assume you would populate more than required during 2), you can end up
>>>>> not getting userfaultfd faults during 4) and corrupt your guest state.
>>>>> It works if during (2) you migrated all guest memory, or if during 4)
>>>>> you zap everything that still needs migr
>>>>
>>>> I see what you mean now. Thanks.
>>>>
>>>> Yes, in this case we have to interpret MADV_NOHUGEPAGE as nothing >4KB.
>>>
>>> Note that it's still even unclear to me why we want to *not* call these
>>> things THP. It would certainly make everything less confusing if we call
>>> them THP, but with additional attributes.
>>>
>>> I think that is one of the first things we should figure out because it
>>> also indirectly tells us what all these toggles mean and how/if we
>>> should redefine them (and if they even apply).
>>>
>>> Currently THP == PMD size
>>>
>>> In 2016, Hugh already envisioned PUD/PGD THP (see 49920d28781d ("mm:
>>> make transparent hugepage size public")) when he explicitly exposed
>>> "hpage_pmd_size". Not "hpage_size".
>>>
>>> For hugetlb on arm64 we already support various sizes that are < PMD
>>> size and *not* call them differently. It's a huge(tlb) page. Sometimes
>>> we refer to them as cont-PTE hugetlb pages.
>>>
>>>
>>> So, nowadays we do have "PMD-sized THP", someday we might have
>>> "PUD-sized THP". Can't we come up with a name to describe sub-PMD THP?
>>>
>>> Is it really of value if we invent a new term for them? Yes, I was not
>>> enjoying "Flexible THP".
>>>
>>>
>>> Once we figured that out, we should figure out if MADV_HUGEPAGE meant
>>> "only PMD-sized THP" or anything else?
>>>
>>> Also, we can then figure out if MADV_NOHUGEPAGE meant "only PMD-sized
>>> THP" or anything else?
>>>
>>>
>>> The simplest approach to me would be "they imply any THP, and once we
>>> need more tunables we might add some", similar to what Kirill also raised.
>>>
>>>
>>> Again, it's all unclear to me at this point and I'm happy to hear
>>> opinions, because I really don't know.
>>
>> I agree these points require more discussion. But I don't think we
>> need to conclude them now, unless they cause correctness issues like
>> ignoring MADV_NOHUGEPAGE would. My concern is that if we decide to go
>> with "they imply any THP" and *expose this to userspace now*, we might
>> regret later.
>
> If we don't think they are THP, probably MADV_NOHUGEPAGE should not apply and we should be ready to find other ways to deal with the mess we eventually create. If we want to go down that path, sure.
>
> If they are THP, to me there is not really a question if MADV_NOHUGEPAGE applies to them or not. Unless we want to build a confusing piece of software ;)

I think it is good to call them THP, since they are transparent huge (>order-0) pages.
But the concern is that before we have a reasonable management policy for order>0 &&
order<9 THPs, mixing them with existing order-9 THP might give user unexpected
performance outcome. Unless we are sure they will always performance improvement,
we might repeat the old THP path, namely users begin to disable THP by default
to avoid unexpected performance hiccup. That is the reason Yu wants to separate
LAF from THP at the moment.

Maybe call it THP (experimental) for now and merge it to THP when we have a stable
policy. For knobs, we might add "any-order" to the existing "never", "madvise"
and another interface to specify max hinted order (enforcing <9) for "any-order".
Later, we can allow users to specify any max hinted order, including 9. Just an
idea.


--
Best Regards,
Yan, Zi
Yin Fengwei Aug. 5, 2023, 2:50 a.m. UTC | #36
On 8/5/2023 5:58 AM, Zi Yan wrote:
> On 4 Aug 2023, at 17:30, David Hildenbrand wrote:
> 
>> On 04.08.23 23:26, Yu Zhao wrote:
>>> On Fri, Aug 4, 2023 at 3:13 PM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 04.08.23 23:00, Yu Zhao wrote:
>>>>> On Fri, Aug 4, 2023 at 2:23 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>>
>>>>>> On 04.08.23 10:27, Ryan Roberts wrote:
>>>>>>> On 04/08/2023 00:50, Yu Zhao wrote:
>>>>>>>> On Thu, Aug 3, 2023 at 6:43 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>
>>>>>>>>> + Kirill
>>>>>>>>>
>>>>>>>>> On 26/07/2023 10:51, Ryan Roberts wrote:
>>>>>>>>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
>>>>>>>>>> allocated in large folios of a determined order. All pages of the large
>>>>>>>>>> folio are pte-mapped during the same page fault, significantly reducing
>>>>>>>>>> the number of page faults. The number of per-page operations (e.g. ref
>>>>>>>>>> counting, rmap management lru list management) are also significantly
>>>>>>>>>> reduced since those ops now become per-folio.
>>>>>>>>>>
>>>>>>>>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
>>>>>>>>>> which defaults to disabled for now; The long term aim is for this to
>>>>>>>>>> defaut to enabled, but there are some risks around internal
>>>>>>>>>> fragmentation that need to be better understood first.
>>>>>>>>>>
>>>>>>>>>> When enabled, the folio order is determined as such: For a vma, process
>>>>>>>>>> or system that has explicitly disabled THP, we continue to allocate
>>>>>>>>>> order-0. THP is most likely disabled to avoid any possible internal
>>>>>>>>>> fragmentation so we honour that request.
>>>>>>>>>>
>>>>>>>>>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
>>>>>>>>>> that have not explicitly opted-in to use transparent hugepages (e.g.
>>>>>>>>>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
>>>>>>>>>> arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
>>>>>>>>>> bigger). This allows for a performance boost without requiring any
>>>>>>>>>> explicit opt-in from the workload while limitting internal
>>>>>>>>>> fragmentation.
>>>>>>>>>>
>>>>>>>>>> If the preferred order can't be used (e.g. because the folio would
>>>>>>>>>> breach the bounds of the vma, or because ptes in the region are already
>>>>>>>>>> mapped) then we fall back to a suitable lower order; first
>>>>>>>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ...
>>>>>>>>>
>>>>>>>>>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
>>>>>>>>>> +             (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
>>>>>>>>>> +
>>>>>>>>>> +static int anon_folio_order(struct vm_area_struct *vma)
>>>>>>>>>> +{
>>>>>>>>>> +     int order;
>>>>>>>>>> +
>>>>>>>>>> +     /*
>>>>>>>>>> +      * If THP is explicitly disabled for either the vma, the process or the
>>>>>>>>>> +      * system, then this is very likely intended to limit internal
>>>>>>>>>> +      * fragmentation; in this case, don't attempt to allocate a large
>>>>>>>>>> +      * anonymous folio.
>>>>>>>>>> +      *
>>>>>>>>>> +      * Else, if the vma is eligible for thp, allocate a large folio of the
>>>>>>>>>> +      * size preferred by the arch. Or if the arch requested a very small
>>>>>>>>>> +      * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
>>>>>>>>>> +      * which still meets the arch's requirements but means we still take
>>>>>>>>>> +      * advantage of SW optimizations (e.g. fewer page faults).
>>>>>>>>>> +      *
>>>>>>>>>> +      * Finally if thp is enabled but the vma isn't eligible, take the
>>>>>>>>>> +      * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
>>>>>>>>>> +      * This ensures workloads that have not explicitly opted-in take benefit
>>>>>>>>>> +      * while capping the potential for internal fragmentation.
>>>>>>>>>> +      */
>>>>>>>>>> +
>>>>>>>>>> +     if ((vma->vm_flags & VM_NOHUGEPAGE) ||
>>>>>>>>>> +         test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
>>>>>>>>>> +         !hugepage_flags_enabled())
>>>>>>>>>> +             order = 0;
>>>>>>>>>> +     else {
>>>>>>>>>> +             order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
>>>>>>>>>> +
>>>>>>>>>> +             if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>>>>>>>>>> +                     order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
>>>>>>>>>> +     }
>>>>>>>>>> +
>>>>>>>>>> +     return order;
>>>>>>>>>> +}
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi All,
>>>>>>>>>
>>>>>>>>> I'm writing up the conclusions that we arrived at during discussion in the THP
>>>>>>>>> meeting yesterday, regarding linkage with exiting THP ABIs. It would be great if
>>>>>>>>> I can get explicit "agree" or disagree + rationale from at least David, Yu and
>>>>>>>>> Kirill.
>>>>>>>>>
>>>>>>>>> In summary; I think we are converging on the approach that is already coded, but
>>>>>>>>> I'd like confirmation.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The THP situation today
>>>>>>>>> -----------------------
>>>>>>>>>
>>>>>>>>>     - At system level: THP can be set to "never", "madvise" or "always"
>>>>>>>>>     - At process level: THP can be "never" or "defer to system setting"
>>>>>>>>>     - At VMA level: no-hint, MADV_HUGEPAGE, MADV_NOHUGEPAGE
>>>>>>>>>
>>>>>>>>> That gives us this table to describe how a page fault is handled, according to
>>>>>>>>> process state (columns) and vma flags (rows):
>>>>>>>>>
>>>>>>>>>                    | never     | madvise   | always
>>>>>>>>> ----------------|-----------|-----------|-----------
>>>>>>>>> no hint         | S         | S         | THP>S
>>>>>>>>> MADV_HUGEPAGE   | S         | THP>S     | THP>S
>>>>>>>>> MADV_NOHUGEPAGE | S         | S         | S
>>>>>>>>>
>>>>>>>>> Legend:
>>>>>>>>> S       allocate single page (PTE-mapped)
>>>>>>>>> LAF     allocate lage anon folio (PTE-mapped)
>>>>>>>>> THP     allocate THP-sized folio (PMD-mapped)
>>>>>>>>>>          fallback (usually because vma size/alignment insufficient for folio)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Principles for Large Anon Folios (LAF)
>>>>>>>>> --------------------------------------
>>>>>>>>>
>>>>>>>>> David tells us there are use cases today (e.g. qemu live migration) which use
>>>>>>>>> MADV_NOHUGEPAGE to mean "don't fill any PTEs that are not explicitly faulted"
>>>>>>>>> and these use cases will break (i.e. functionally incorrect) if this request is
>>>>>>>>> not honoured.
>>>>>>>>
>>>>>>>> I don't remember David saying this. I think he was referring to UFFD,
>>>>>>>> not MADV_NOHUGEPAGE, when discussing what we need to absolutely
>>>>>>>> respect.
>>>>>>>
>>>>>>> My understanding was that MADV_NOHUGEPAGE was being applied to regions *before*
>>>>>>> UFFD was being registered, and the app relied on MADV_NOHUGEPAGE to not back any
>>>>>>> unfaulted pages. It's not completely clear to me how not honouring
>>>>>>> MADV_NOHUGEPAGE would break things though. David?
>>>>>>
>>>>>> Sorry, I'm still lagging behind on some threads.
>>>>>>
>>>>>> Imagine the following for VM postcopy live migration:
>>>>>>
>>>>>> (1) Set MADV_NOHUGEPAGE on guest memory and discard all memory (e.g.,
>>>>>>        MADV_DONTNEED), to start with a clean slate.
>>>>>> (2) Migrates some pages during precopy from the source and stores them
>>>>>>        into guest memory on the destination. Some of the memory locations
>>>>>>        will have pages populated.
>>>>>> (3) At some point, decide to enable postcopy: enable userfaultfd on
>>>>>>        guest memory.
>>>>>> (4) Discard *selected* pages again that have been dirtied in the
>>>>>>        meantime on the source. These are pages that have been migrated
>>>>>>        previously.
>>>>>> (5) Start running the VM on the destination.
>>>>>> (6) Anything that's not populated will trigger userfaultfd missing
>>>>>>        faults. Then, you can request them from the source and place them.
>>>>>>
>>>>>> Assume you would populate more than required during 2), you can end up
>>>>>> not getting userfaultfd faults during 4) and corrupt your guest state.
>>>>>> It works if during (2) you migrated all guest memory, or if during 4)
>>>>>> you zap everything that still needs migr
>>>>>
>>>>> I see what you mean now. Thanks.
>>>>>
>>>>> Yes, in this case we have to interpret MADV_NOHUGEPAGE as nothing >4KB.
>>>>
>>>> Note that it's still even unclear to me why we want to *not* call these
>>>> things THP. It would certainly make everything less confusing if we call
>>>> them THP, but with additional attributes.
>>>>
>>>> I think that is one of the first things we should figure out because it
>>>> also indirectly tells us what all these toggles mean and how/if we
>>>> should redefine them (and if they even apply).
>>>>
>>>> Currently THP == PMD size
>>>>
>>>> In 2016, Hugh already envisioned PUD/PGD THP (see 49920d28781d ("mm:
>>>> make transparent hugepage size public")) when he explicitly exposed
>>>> "hpage_pmd_size". Not "hpage_size".
>>>>
>>>> For hugetlb on arm64 we already support various sizes that are < PMD
>>>> size and *not* call them differently. It's a huge(tlb) page. Sometimes
>>>> we refer to them as cont-PTE hugetlb pages.
>>>>
>>>>
>>>> So, nowadays we do have "PMD-sized THP", someday we might have
>>>> "PUD-sized THP". Can't we come up with a name to describe sub-PMD THP?
>>>>
>>>> Is it really of value if we invent a new term for them? Yes, I was not
>>>> enjoying "Flexible THP".
>>>>
>>>>
>>>> Once we figured that out, we should figure out if MADV_HUGEPAGE meant
>>>> "only PMD-sized THP" or anything else?
>>>>
>>>> Also, we can then figure out if MADV_NOHUGEPAGE meant "only PMD-sized
>>>> THP" or anything else?
>>>>
>>>>
>>>> The simplest approach to me would be "they imply any THP, and once we
>>>> need more tunables we might add some", similar to what Kirill also raised.
>>>>
>>>>
>>>> Again, it's all unclear to me at this point and I'm happy to hear
>>>> opinions, because I really don't know.
>>>
>>> I agree these points require more discussion. But I don't think we
>>> need to conclude them now, unless they cause correctness issues like
>>> ignoring MADV_NOHUGEPAGE would. My concern is that if we decide to go
>>> with "they imply any THP" and *expose this to userspace now*, we might
>>> regret later.
>>
>> If we don't think they are THP, probably MADV_NOHUGEPAGE should not apply and we should be ready to find other ways to deal with the mess we eventually create. If we want to go down that path, sure.
>>
>> If they are THP, to me there is not really a question if MADV_NOHUGEPAGE applies to them or not. Unless we want to build a confusing piece of software ;)
> 
> I think it is good to call them THP, since they are transparent huge (>order-0) pages.
> But the concern is that before we have a reasonable management policy for order>0 &&
> order<9 THPs, mixing them with existing order-9 THP might give user unexpected
> performance outcome. Unless we are sure they will always performance improvement,
> we might repeat the old THP path, namely users begin to disable THP by default
> to avoid unexpected performance hiccup. That is the reason Yu wants to separate
> LAF from THP at the moment.
> 
> Maybe call it THP (experimental) for now and merge it to THP when we have a stable
> policy. For knobs, we might add "any-order" to the existing "never", "madvise"
> and another interface to specify max hinted order (enforcing <9) for "any-order".
> Later, we can allow users to specify any max hinted order, including 9. Just an
> idea.
I suspect that all the config knobs (enable/disable mixing mode, define "any-order"
or "specific-order") will be exist long term. Because there are always new workloads
need be tuned against these configs.


Regards
Yin, Fengwei

> 
> 
> --
> Best Regards,
> Yan, Zi
Yu Zhao Aug. 7, 2023, 5:24 a.m. UTC | #37
On Wed, Jul 26, 2023 at 3:52 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
> allocated in large folios of a determined order. All pages of the large
> folio are pte-mapped during the same page fault, significantly reducing
> the number of page faults. The number of per-page operations (e.g. ref
> counting, rmap management lru list management) are also significantly
> reduced since those ops now become per-folio.
>
> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
> which defaults to disabled for now; The long term aim is for this to
> defaut to enabled, but there are some risks around internal
> fragmentation that need to be better understood first.
>
> When enabled, the folio order is determined as such: For a vma, process
> or system that has explicitly disabled THP, we continue to allocate
> order-0. THP is most likely disabled to avoid any possible internal
> fragmentation so we honour that request.
>
> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
> that have not explicitly opted-in to use transparent hugepages (e.g.
> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
> arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
> bigger). This allows for a performance boost without requiring any
> explicit opt-in from the workload while limitting internal
> fragmentation.
>
> If the preferred order can't be used (e.g. because the folio would
> breach the bounds of the vma, or because ptes in the region are already
> mapped) then we fall back to a suitable lower order; first
> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>
> arch_wants_pte_order() can be overridden by the architecture if desired.
> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
> set of ptes map physically contigious, naturally aligned memory, so this
> mechanism allows the architecture to optimize as required.
>
> Here we add the default implementation of arch_wants_pte_order(), used
> when the architecture does not define it, which returns -1, implying
> that the HW has no preference. In this case, mm will choose it's own
> default order.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  include/linux/pgtable.h |  13 ++++
>  mm/Kconfig              |  10 +++
>  mm/memory.c             | 166 ++++++++++++++++++++++++++++++++++++----
>  3 files changed, 172 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 5063b482e34f..2a1d83775837 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -313,6 +313,19 @@ static inline bool arch_has_hw_pte_young(void)
>  }
>  #endif
>
> +#ifndef arch_wants_pte_order
> +/*
> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
> + * to be at least order-2. Negative value implies that the HW has no preference
> + * and mm will choose it's own default order.
> + */
> +static inline int arch_wants_pte_order(void)
> +{
> +       return -1;
> +}
> +#endif
> +
>  #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>                                        unsigned long address,
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 09130434e30d..fa61ea160447 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1238,4 +1238,14 @@ config LOCK_MM_AND_FIND_VMA
>
>  source "mm/damon/Kconfig"
>
> +config LARGE_ANON_FOLIO
> +       bool "Allocate large folios for anonymous memory"
> +       depends on TRANSPARENT_HUGEPAGE
> +       default n
> +       help
> +         Use large (bigger than order-0) folios to back anonymous memory where
> +         possible, even for pte-mapped memory. This reduces the number of page
> +         faults, as well as other per-page overheads to improve performance for
> +         many workloads.
> +
>  endmenu
> diff --git a/mm/memory.c b/mm/memory.c
> index 01f39e8144ef..64c3f242c49a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4050,6 +4050,127 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         return ret;
>  }
>
> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
> +{
> +       int i;
> +
> +       if (nr_pages == 1)
> +               return vmf_pte_changed(vmf);
> +
> +       for (i = 0; i < nr_pages; i++) {
> +               if (!pte_none(ptep_get_lockless(vmf->pte + i)))
> +                       return true;
> +       }
> +
> +       return false;
> +}
> +
> +#ifdef CONFIG_LARGE_ANON_FOLIO
> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
> +               (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
> +
> +static int anon_folio_order(struct vm_area_struct *vma)
> +{
> +       int order;
> +
> +       /*
> +        * If THP is explicitly disabled for either the vma, the process or the
> +        * system, then this is very likely intended to limit internal
> +        * fragmentation; in this case, don't attempt to allocate a large
> +        * anonymous folio.
> +        *
> +        * Else, if the vma is eligible for thp, allocate a large folio of the
> +        * size preferred by the arch. Or if the arch requested a very small
> +        * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
> +        * which still meets the arch's requirements but means we still take
> +        * advantage of SW optimizations (e.g. fewer page faults).
> +        *
> +        * Finally if thp is enabled but the vma isn't eligible, take the
> +        * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
> +        * This ensures workloads that have not explicitly opted-in take benefit
> +        * while capping the potential for internal fragmentation.
> +        */
> +
> +       if ((vma->vm_flags & VM_NOHUGEPAGE) ||
> +           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
> +           !hugepage_flags_enabled())
> +               order = 0;
> +       else {
> +               order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
> +
> +               if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
> +                       order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
> +       }
> +
> +       return order;
> +}
> +
> +static int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
> +{
> +       int i;
> +       gfp_t gfp;
> +       pte_t *pte;
> +       unsigned long addr;
> +       struct vm_area_struct *vma = vmf->vma;
> +       int prefer = anon_folio_order(vma);
> +       int orders[] = {
> +               prefer,
> +               prefer > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0,
> +               0,
> +       };
> +
> +       *folio = NULL;
> +
> +       if (vmf_orig_pte_uffd_wp(vmf))
> +               goto fallback;

Per the discussion, we need to check hugepage_vma_check() for
correctness of VM LM. I'd just check it here and fall back to order 0
if that helper returns false.
Ryan Roberts Aug. 7, 2023, 5:45 p.m. UTC | #38
On 05/08/2023 03:50, Yin, Fengwei wrote:
> 
> 
> On 8/5/2023 5:58 AM, Zi Yan wrote:
>> On 4 Aug 2023, at 17:30, David Hildenbrand wrote:
>>
>>> On 04.08.23 23:26, Yu Zhao wrote:
>>>> On Fri, Aug 4, 2023 at 3:13 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>
>>>>> On 04.08.23 23:00, Yu Zhao wrote:
>>>>>> On Fri, Aug 4, 2023 at 2:23 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>>>
>>>>>>> On 04.08.23 10:27, Ryan Roberts wrote:
>>>>>>>> On 04/08/2023 00:50, Yu Zhao wrote:
>>>>>>>>> On Thu, Aug 3, 2023 at 6:43 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>>
>>>>>>>>>> + Kirill
>>>>>>>>>>
>>>>>>>>>> On 26/07/2023 10:51, Ryan Roberts wrote:
>>>>>>>>>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
>>>>>>>>>>> allocated in large folios of a determined order. All pages of the large
>>>>>>>>>>> folio are pte-mapped during the same page fault, significantly reducing
>>>>>>>>>>> the number of page faults. The number of per-page operations (e.g. ref
>>>>>>>>>>> counting, rmap management lru list management) are also significantly
>>>>>>>>>>> reduced since those ops now become per-folio.
>>>>>>>>>>>
>>>>>>>>>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
>>>>>>>>>>> which defaults to disabled for now; The long term aim is for this to
>>>>>>>>>>> defaut to enabled, but there are some risks around internal
>>>>>>>>>>> fragmentation that need to be better understood first.
>>>>>>>>>>>
>>>>>>>>>>> When enabled, the folio order is determined as such: For a vma, process
>>>>>>>>>>> or system that has explicitly disabled THP, we continue to allocate
>>>>>>>>>>> order-0. THP is most likely disabled to avoid any possible internal
>>>>>>>>>>> fragmentation so we honour that request.
>>>>>>>>>>>
>>>>>>>>>>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
>>>>>>>>>>> that have not explicitly opted-in to use transparent hugepages (e.g.
>>>>>>>>>>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
>>>>>>>>>>> arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
>>>>>>>>>>> bigger). This allows for a performance boost without requiring any
>>>>>>>>>>> explicit opt-in from the workload while limitting internal
>>>>>>>>>>> fragmentation.
>>>>>>>>>>>
>>>>>>>>>>> If the preferred order can't be used (e.g. because the folio would
>>>>>>>>>>> breach the bounds of the vma, or because ptes in the region are already
>>>>>>>>>>> mapped) then we fall back to a suitable lower order; first
>>>>>>>>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ...
>>>>>>>>>>
>>>>>>>>>>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
>>>>>>>>>>> +             (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
>>>>>>>>>>> +
>>>>>>>>>>> +static int anon_folio_order(struct vm_area_struct *vma)
>>>>>>>>>>> +{
>>>>>>>>>>> +     int order;
>>>>>>>>>>> +
>>>>>>>>>>> +     /*
>>>>>>>>>>> +      * If THP is explicitly disabled for either the vma, the process or the
>>>>>>>>>>> +      * system, then this is very likely intended to limit internal
>>>>>>>>>>> +      * fragmentation; in this case, don't attempt to allocate a large
>>>>>>>>>>> +      * anonymous folio.
>>>>>>>>>>> +      *
>>>>>>>>>>> +      * Else, if the vma is eligible for thp, allocate a large folio of the
>>>>>>>>>>> +      * size preferred by the arch. Or if the arch requested a very small
>>>>>>>>>>> +      * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
>>>>>>>>>>> +      * which still meets the arch's requirements but means we still take
>>>>>>>>>>> +      * advantage of SW optimizations (e.g. fewer page faults).
>>>>>>>>>>> +      *
>>>>>>>>>>> +      * Finally if thp is enabled but the vma isn't eligible, take the
>>>>>>>>>>> +      * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
>>>>>>>>>>> +      * This ensures workloads that have not explicitly opted-in take benefit
>>>>>>>>>>> +      * while capping the potential for internal fragmentation.
>>>>>>>>>>> +      */
>>>>>>>>>>> +
>>>>>>>>>>> +     if ((vma->vm_flags & VM_NOHUGEPAGE) ||
>>>>>>>>>>> +         test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
>>>>>>>>>>> +         !hugepage_flags_enabled())
>>>>>>>>>>> +             order = 0;
>>>>>>>>>>> +     else {
>>>>>>>>>>> +             order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
>>>>>>>>>>> +
>>>>>>>>>>> +             if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>>>>>>>>>>> +                     order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
>>>>>>>>>>> +     }
>>>>>>>>>>> +
>>>>>>>>>>> +     return order;
>>>>>>>>>>> +}
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi All,
>>>>>>>>>>
>>>>>>>>>> I'm writing up the conclusions that we arrived at during discussion in the THP
>>>>>>>>>> meeting yesterday, regarding linkage with exiting THP ABIs. It would be great if
>>>>>>>>>> I can get explicit "agree" or disagree + rationale from at least David, Yu and
>>>>>>>>>> Kirill.
>>>>>>>>>>
>>>>>>>>>> In summary; I think we are converging on the approach that is already coded, but
>>>>>>>>>> I'd like confirmation.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The THP situation today
>>>>>>>>>> -----------------------
>>>>>>>>>>
>>>>>>>>>>     - At system level: THP can be set to "never", "madvise" or "always"
>>>>>>>>>>     - At process level: THP can be "never" or "defer to system setting"
>>>>>>>>>>     - At VMA level: no-hint, MADV_HUGEPAGE, MADV_NOHUGEPAGE
>>>>>>>>>>
>>>>>>>>>> That gives us this table to describe how a page fault is handled, according to
>>>>>>>>>> process state (columns) and vma flags (rows):
>>>>>>>>>>
>>>>>>>>>>                    | never     | madvise   | always
>>>>>>>>>> ----------------|-----------|-----------|-----------
>>>>>>>>>> no hint         | S         | S         | THP>S
>>>>>>>>>> MADV_HUGEPAGE   | S         | THP>S     | THP>S
>>>>>>>>>> MADV_NOHUGEPAGE | S         | S         | S
>>>>>>>>>>
>>>>>>>>>> Legend:
>>>>>>>>>> S       allocate single page (PTE-mapped)
>>>>>>>>>> LAF     allocate lage anon folio (PTE-mapped)
>>>>>>>>>> THP     allocate THP-sized folio (PMD-mapped)
>>>>>>>>>>>          fallback (usually because vma size/alignment insufficient for folio)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Principles for Large Anon Folios (LAF)
>>>>>>>>>> --------------------------------------
>>>>>>>>>>
>>>>>>>>>> David tells us there are use cases today (e.g. qemu live migration) which use
>>>>>>>>>> MADV_NOHUGEPAGE to mean "don't fill any PTEs that are not explicitly faulted"
>>>>>>>>>> and these use cases will break (i.e. functionally incorrect) if this request is
>>>>>>>>>> not honoured.
>>>>>>>>>
>>>>>>>>> I don't remember David saying this. I think he was referring to UFFD,
>>>>>>>>> not MADV_NOHUGEPAGE, when discussing what we need to absolutely
>>>>>>>>> respect.
>>>>>>>>
>>>>>>>> My understanding was that MADV_NOHUGEPAGE was being applied to regions *before*
>>>>>>>> UFFD was being registered, and the app relied on MADV_NOHUGEPAGE to not back any
>>>>>>>> unfaulted pages. It's not completely clear to me how not honouring
>>>>>>>> MADV_NOHUGEPAGE would break things though. David?
>>>>>>>
>>>>>>> Sorry, I'm still lagging behind on some threads.
>>>>>>>
>>>>>>> Imagine the following for VM postcopy live migration:
>>>>>>>
>>>>>>> (1) Set MADV_NOHUGEPAGE on guest memory and discard all memory (e.g.,
>>>>>>>        MADV_DONTNEED), to start with a clean slate.
>>>>>>> (2) Migrates some pages during precopy from the source and stores them
>>>>>>>        into guest memory on the destination. Some of the memory locations
>>>>>>>        will have pages populated.
>>>>>>> (3) At some point, decide to enable postcopy: enable userfaultfd on
>>>>>>>        guest memory.
>>>>>>> (4) Discard *selected* pages again that have been dirtied in the
>>>>>>>        meantime on the source. These are pages that have been migrated
>>>>>>>        previously.
>>>>>>> (5) Start running the VM on the destination.
>>>>>>> (6) Anything that's not populated will trigger userfaultfd missing
>>>>>>>        faults. Then, you can request them from the source and place them.
>>>>>>>
>>>>>>> Assume you would populate more than required during 2), you can end up
>>>>>>> not getting userfaultfd faults during 4) and corrupt your guest state.
>>>>>>> It works if during (2) you migrated all guest memory, or if during 4)
>>>>>>> you zap everything that still needs migr
>>>>>>
>>>>>> I see what you mean now. Thanks.
>>>>>>
>>>>>> Yes, in this case we have to interpret MADV_NOHUGEPAGE as nothing >4KB.

I'm glad we have agreement on this.

In some threads Yu has been talking about this series in the short term, vs long
term roadmap; so to be clear, I interpret this as meaning we must consider that
MADV_NOHUGEPAGE means nothing bigger than order-0 both in the context of this
series and for the long term - that's behavior that user space depends upon.

I think we should also apply the same logic to system/process THP mode =
"never", even if the vma does not have MADV_NOHUGEPAGE. If the user has
explicitly set "never" on the system or process, that means "nothing bigger than
order-0". Shout if you disagree.

>>>>>
>>>>> Note that it's still even unclear to me why we want to *not* call these
>>>>> things THP. It would certainly make everything less confusing if we call
>>>>> them THP, but with additional attributes.

I think I've stated in the past that I don't have a strong opinion on what we
call them. But I do think you make a convincing argument for calling them after
THP. Regardless, I'd rather agree on a name up front, before this initial series
goes in - it's always better to be consistent across all the commit messages and
comments to make things more grepable.

The only concrete objection I remember hearing to a name with "THP" in the title
was that there are stats (meminfo, vmstats, etc) that count THPs and this
becomes confusing if those counters now only mean a subset of THPs. But that
feels like a small issue in the scheme of things.

>>>>>
>>>>> I think that is one of the first things we should figure out because it
>>>>> also indirectly tells us what all these toggles mean and how/if we
>>>>> should redefine them (and if they even apply).
>>>>>
>>>>> Currently THP == PMD size
>>>>>
>>>>> In 2016, Hugh already envisioned PUD/PGD THP (see 49920d28781d ("mm:
>>>>> make transparent hugepage size public")) when he explicitly exposed
>>>>> "hpage_pmd_size". Not "hpage_size".
>>>>>
>>>>> For hugetlb on arm64 we already support various sizes that are < PMD
>>>>> size and *not* call them differently. It's a huge(tlb) page. Sometimes
>>>>> we refer to them as cont-PTE hugetlb pages.
>>>>>
>>>>>
>>>>> So, nowadays we do have "PMD-sized THP", someday we might have
>>>>> "PUD-sized THP". Can't we come up with a name to describe sub-PMD THP?

I think one subtle difference is that these sub-PMD THPs, likely won't always
have a single size.

>>>>>
>>>>> Is it really of value if we invent a new term for them? Yes, I was not
>>>>> enjoying "Flexible THP".

How about "variable-order THP"? Or "SW THP" vs "HW THP"?

>>>>>
>>>>>
>>>>> Once we figured that out, we should figure out if MADV_HUGEPAGE meant
>>>>> "only PMD-sized THP" or anything else?
>>>>>
>>>>> Also, we can then figure out if MADV_NOHUGEPAGE meant "only PMD-sized
>>>>> THP" or anything else?

Based on the existing user space expectation that MADV_NOHUGEPAGE means "nothing
bigger than order-0" I'm not sure how we could ever decide MADV_NOHUGEPAGE means
anything different? This feels set in stone to me.

>>>>>
>>>>>
>>>>> The simplest approach to me would be "they imply any THP, and once we
>>>>> need more tunables we might add some", similar to what Kirill also raised.

Agreed.

>>>>>
>>>>>
>>>>> Again, it's all unclear to me at this point and I'm happy to hear
>>>>> opinions, because I really don't know.
>>>>
>>>> I agree these points require more discussion. But I don't think we
>>>> need to conclude them now, unless they cause correctness issues like
>>>> ignoring MADV_NOHUGEPAGE would. My concern is that if we decide to go
>>>> with "they imply any THP" and *expose this to userspace now*, we might
>>>> regret later.
>>>
>>> If we don't think they are THP, probably MADV_NOHUGEPAGE should not apply and we should be ready to find other ways to deal with the mess we eventually create. If we want to go down that path, sure.
>>>
>>> If they are THP, to me there is not really a question if MADV_NOHUGEPAGE applies to them or not. Unless we want to build a confusing piece of software ;)
>>
>> I think it is good to call them THP, since they are transparent huge (>order-0) pages.
>> But the concern is that before we have a reasonable management policy for order>0 &&
>> order<9 THPs, mixing them with existing order-9 THP might give user unexpected
>> performance outcome. Unless we are sure they will always performance improvement,
>> we might repeat the old THP path, namely users begin to disable THP by default
>> to avoid unexpected performance hiccup. That is the reason Yu wants to separate
>> LAF from THP at the moment.

(for the purposes of this; LAF="sub-PMD THP", THP="PMD-size THP", we treat them
both as forms of THP)...

How about this for a strawman:

When introducing LAF we can either use an opt-in or an opt-out model. The opt-in
model would require a new ABI from day 1 (something I think there is concensus
that we do not want to do) and would prevent apps from automatically getting
benefit. So I don't like that model.

If going with the opt-out model, we already have an opt-out mechanism
(thp="never" and MADV_NOHUGEPAGE) that we can piggyback. But that mechanism
doesn't give us all the control we would like for benchmarking/characterizing
the interactions between LAF/THP for different workloads. Ideally we need a way
to enable THP while keeping LAF disabled and enable LAF while keeping THP disabled.

Can we do this with debugfs? I think controls in there can come and go without
too much concern about back-compat?

Perhaps 2 controls:

laf_enable=0|1
  enable/disable LAF independently of THP
  default=1

laf_max_order=N
  applies to both LAF and THP
  when max_order < PMD-order, THP acts like thp="never"
  puts a ceiling on folio order allocated by LAF
  default=PMD-order

This gives:


laf_enable=1, laf_max_order=PMD-order (LAF+THP):

                | never     | madvise   | always
----------------|-----------|-----------|-----------
no hint         | S         | LAF>S     | THP>LAF>S
MADV_HUGEPAGE   | S         | THP>LAF>S | THP>LAF>S
MADV_NOHUGEPAGE | S         | S         | S


laf_enable=0, laf_max_order=PMD-order (THP only):

                | never     | madvise   | always
----------------|-----------|-----------|-----------
no hint         | S         | S         | THP>S
MADV_HUGEPAGE   | S         | THP>S     | THP>S
MADV_NOHUGEPAGE | S         | S         | S


laf_enable=1, laf_max_order=(PMD-order - 1) (LAF only):

                | never     | madvise   | always
----------------|-----------|-----------|-----------
no hint         | S         | LAF>S     | LAF>S
MADV_HUGEPAGE   | S         | LAF>S     | LAF>S
MADV_NOHUGEPAGE | S         | S         | S


This would allow us to get something into the kernel that would allow people to
more broadly characterize different workloads under THP, LAF, THP+LAF, which
would give us a better understanding of if/how we want to design ABIs for the
long term.


>>
>> Maybe call it THP (experimental) for now and merge it to THP when we have a stable
>> policy. For knobs, we might add "any-order" to the existing "never", "madvise"
>> and another interface to specify max hinted order (enforcing <9) for "any-order".
>> Later, we can allow users to specify any max hinted order, including 9. Just an
>> idea.
> I suspect that all the config knobs (enable/disable mixing mode, define "any-order"
> or "specific-order") will be exist long term. Because there are always new workloads
> need be tuned against these configs.
> 
> 
> Regards
> Yin, Fengwei
> 
>>
>>
>> --
>> Best Regards,
>> Yan, Zi
Zi Yan Aug. 7, 2023, 6:10 p.m. UTC | #39
On 7 Aug 2023, at 13:45, Ryan Roberts wrote:

> On 05/08/2023 03:50, Yin, Fengwei wrote:
>>
>>
>> On 8/5/2023 5:58 AM, Zi Yan wrote:
>>> On 4 Aug 2023, at 17:30, David Hildenbrand wrote:
>>>
>>>> On 04.08.23 23:26, Yu Zhao wrote:
>>>>> On Fri, Aug 4, 2023 at 3:13 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>>
>>>>>> On 04.08.23 23:00, Yu Zhao wrote:
>>>>>>> On Fri, Aug 4, 2023 at 2:23 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>>>>
>>>>>>>> On 04.08.23 10:27, Ryan Roberts wrote:
>>>>>>>>> On 04/08/2023 00:50, Yu Zhao wrote:
>>>>>>>>>> On Thu, Aug 3, 2023 at 6:43 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> + Kirill
>>>>>>>>>>>
>>>>>>>>>>> On 26/07/2023 10:51, Ryan Roberts wrote:
>>>>>>>>>>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
>>>>>>>>>>>> allocated in large folios of a determined order. All pages of the large
>>>>>>>>>>>> folio are pte-mapped during the same page fault, significantly reducing
>>>>>>>>>>>> the number of page faults. The number of per-page operations (e.g. ref
>>>>>>>>>>>> counting, rmap management lru list management) are also significantly
>>>>>>>>>>>> reduced since those ops now become per-folio.
>>>>>>>>>>>>
>>>>>>>>>>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
>>>>>>>>>>>> which defaults to disabled for now; The long term aim is for this to
>>>>>>>>>>>> defaut to enabled, but there are some risks around internal
>>>>>>>>>>>> fragmentation that need to be better understood first.
>>>>>>>>>>>>
>>>>>>>>>>>> When enabled, the folio order is determined as such: For a vma, process
>>>>>>>>>>>> or system that has explicitly disabled THP, we continue to allocate
>>>>>>>>>>>> order-0. THP is most likely disabled to avoid any possible internal
>>>>>>>>>>>> fragmentation so we honour that request.
>>>>>>>>>>>>
>>>>>>>>>>>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
>>>>>>>>>>>> that have not explicitly opted-in to use transparent hugepages (e.g.
>>>>>>>>>>>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
>>>>>>>>>>>> arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
>>>>>>>>>>>> bigger). This allows for a performance boost without requiring any
>>>>>>>>>>>> explicit opt-in from the workload while limitting internal
>>>>>>>>>>>> fragmentation.
>>>>>>>>>>>>
>>>>>>>>>>>> If the preferred order can't be used (e.g. because the folio would
>>>>>>>>>>>> breach the bounds of the vma, or because ptes in the region are already
>>>>>>>>>>>> mapped) then we fall back to a suitable lower order; first
>>>>>>>>>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ...
>>>>>>>>>>>
>>>>>>>>>>>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
>>>>>>>>>>>> +             (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
>>>>>>>>>>>> +
>>>>>>>>>>>> +static int anon_folio_order(struct vm_area_struct *vma)
>>>>>>>>>>>> +{
>>>>>>>>>>>> +     int order;
>>>>>>>>>>>> +
>>>>>>>>>>>> +     /*
>>>>>>>>>>>> +      * If THP is explicitly disabled for either the vma, the process or the
>>>>>>>>>>>> +      * system, then this is very likely intended to limit internal
>>>>>>>>>>>> +      * fragmentation; in this case, don't attempt to allocate a large
>>>>>>>>>>>> +      * anonymous folio.
>>>>>>>>>>>> +      *
>>>>>>>>>>>> +      * Else, if the vma is eligible for thp, allocate a large folio of the
>>>>>>>>>>>> +      * size preferred by the arch. Or if the arch requested a very small
>>>>>>>>>>>> +      * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
>>>>>>>>>>>> +      * which still meets the arch's requirements but means we still take
>>>>>>>>>>>> +      * advantage of SW optimizations (e.g. fewer page faults).
>>>>>>>>>>>> +      *
>>>>>>>>>>>> +      * Finally if thp is enabled but the vma isn't eligible, take the
>>>>>>>>>>>> +      * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
>>>>>>>>>>>> +      * This ensures workloads that have not explicitly opted-in take benefit
>>>>>>>>>>>> +      * while capping the potential for internal fragmentation.
>>>>>>>>>>>> +      */
>>>>>>>>>>>> +
>>>>>>>>>>>> +     if ((vma->vm_flags & VM_NOHUGEPAGE) ||
>>>>>>>>>>>> +         test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
>>>>>>>>>>>> +         !hugepage_flags_enabled())
>>>>>>>>>>>> +             order = 0;
>>>>>>>>>>>> +     else {
>>>>>>>>>>>> +             order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
>>>>>>>>>>>> +
>>>>>>>>>>>> +             if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>>>>>>>>>>>> +                     order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
>>>>>>>>>>>> +     }
>>>>>>>>>>>> +
>>>>>>>>>>>> +     return order;
>>>>>>>>>>>> +}
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi All,
>>>>>>>>>>>
>>>>>>>>>>> I'm writing up the conclusions that we arrived at during discussion in the THP
>>>>>>>>>>> meeting yesterday, regarding linkage with exiting THP ABIs. It would be great if
>>>>>>>>>>> I can get explicit "agree" or disagree + rationale from at least David, Yu and
>>>>>>>>>>> Kirill.
>>>>>>>>>>>
>>>>>>>>>>> In summary; I think we are converging on the approach that is already coded, but
>>>>>>>>>>> I'd like confirmation.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> The THP situation today
>>>>>>>>>>> -----------------------
>>>>>>>>>>>
>>>>>>>>>>>     - At system level: THP can be set to "never", "madvise" or "always"
>>>>>>>>>>>     - At process level: THP can be "never" or "defer to system setting"
>>>>>>>>>>>     - At VMA level: no-hint, MADV_HUGEPAGE, MADV_NOHUGEPAGE
>>>>>>>>>>>
>>>>>>>>>>> That gives us this table to describe how a page fault is handled, according to
>>>>>>>>>>> process state (columns) and vma flags (rows):
>>>>>>>>>>>
>>>>>>>>>>>                    | never     | madvise   | always
>>>>>>>>>>> ----------------|-----------|-----------|-----------
>>>>>>>>>>> no hint         | S         | S         | THP>S
>>>>>>>>>>> MADV_HUGEPAGE   | S         | THP>S     | THP>S
>>>>>>>>>>> MADV_NOHUGEPAGE | S         | S         | S
>>>>>>>>>>>
>>>>>>>>>>> Legend:
>>>>>>>>>>> S       allocate single page (PTE-mapped)
>>>>>>>>>>> LAF     allocate lage anon folio (PTE-mapped)
>>>>>>>>>>> THP     allocate THP-sized folio (PMD-mapped)
>>>>>>>>>>>>          fallback (usually because vma size/alignment insufficient for folio)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Principles for Large Anon Folios (LAF)
>>>>>>>>>>> --------------------------------------
>>>>>>>>>>>
>>>>>>>>>>> David tells us there are use cases today (e.g. qemu live migration) which use
>>>>>>>>>>> MADV_NOHUGEPAGE to mean "don't fill any PTEs that are not explicitly faulted"
>>>>>>>>>>> and these use cases will break (i.e. functionally incorrect) if this request is
>>>>>>>>>>> not honoured.
>>>>>>>>>>
>>>>>>>>>> I don't remember David saying this. I think he was referring to UFFD,
>>>>>>>>>> not MADV_NOHUGEPAGE, when discussing what we need to absolutely
>>>>>>>>>> respect.
>>>>>>>>>
>>>>>>>>> My understanding was that MADV_NOHUGEPAGE was being applied to regions *before*
>>>>>>>>> UFFD was being registered, and the app relied on MADV_NOHUGEPAGE to not back any
>>>>>>>>> unfaulted pages. It's not completely clear to me how not honouring
>>>>>>>>> MADV_NOHUGEPAGE would break things though. David?
>>>>>>>>
>>>>>>>> Sorry, I'm still lagging behind on some threads.
>>>>>>>>
>>>>>>>> Imagine the following for VM postcopy live migration:
>>>>>>>>
>>>>>>>> (1) Set MADV_NOHUGEPAGE on guest memory and discard all memory (e.g.,
>>>>>>>>        MADV_DONTNEED), to start with a clean slate.
>>>>>>>> (2) Migrates some pages during precopy from the source and stores them
>>>>>>>>        into guest memory on the destination. Some of the memory locations
>>>>>>>>        will have pages populated.
>>>>>>>> (3) At some point, decide to enable postcopy: enable userfaultfd on
>>>>>>>>        guest memory.
>>>>>>>> (4) Discard *selected* pages again that have been dirtied in the
>>>>>>>>        meantime on the source. These are pages that have been migrated
>>>>>>>>        previously.
>>>>>>>> (5) Start running the VM on the destination.
>>>>>>>> (6) Anything that's not populated will trigger userfaultfd missing
>>>>>>>>        faults. Then, you can request them from the source and place them.
>>>>>>>>
>>>>>>>> Assume you would populate more than required during 2), you can end up
>>>>>>>> not getting userfaultfd faults during 4) and corrupt your guest state.
>>>>>>>> It works if during (2) you migrated all guest memory, or if during 4)
>>>>>>>> you zap everything that still needs migr
>>>>>>>
>>>>>>> I see what you mean now. Thanks.
>>>>>>>
>>>>>>> Yes, in this case we have to interpret MADV_NOHUGEPAGE as nothing >4KB.
>
> I'm glad we have agreement on this.
>
> In some threads Yu has been talking about this series in the short term, vs long
> term roadmap; so to be clear, I interpret this as meaning we must consider that
> MADV_NOHUGEPAGE means nothing bigger than order-0 both in the context of this
> series and for the long term - that's behavior that user space depends upon.
>
> I think we should also apply the same logic to system/process THP mode =
> "never", even if the vma does not have MADV_NOHUGEPAGE. If the user has
> explicitly set "never" on the system or process, that means "nothing bigger than
> order-0". Shout if you disagree.
>
>>>>>>
>>>>>> Note that it's still even unclear to me why we want to *not* call these
>>>>>> things THP. It would certainly make everything less confusing if we call
>>>>>> them THP, but with additional attributes.
>
> I think I've stated in the past that I don't have a strong opinion on what we
> call them. But I do think you make a convincing argument for calling them after
> THP. Regardless, I'd rather agree on a name up front, before this initial series
> goes in - it's always better to be consistent across all the commit messages and
> comments to make things more grepable.
>
> The only concrete objection I remember hearing to a name with "THP" in the title
> was that there are stats (meminfo, vmstats, etc) that count THPs and this
> becomes confusing if those counters now only mean a subset of THPs. But that
> feels like a small issue in the scheme of things.
>
>>>>>>
>>>>>> I think that is one of the first things we should figure out because it
>>>>>> also indirectly tells us what all these toggles mean and how/if we
>>>>>> should redefine them (and if they even apply).
>>>>>>
>>>>>> Currently THP == PMD size
>>>>>>
>>>>>> In 2016, Hugh already envisioned PUD/PGD THP (see 49920d28781d ("mm:
>>>>>> make transparent hugepage size public")) when he explicitly exposed
>>>>>> "hpage_pmd_size". Not "hpage_size".
>>>>>>
>>>>>> For hugetlb on arm64 we already support various sizes that are < PMD
>>>>>> size and *not* call them differently. It's a huge(tlb) page. Sometimes
>>>>>> we refer to them as cont-PTE hugetlb pages.
>>>>>>
>>>>>>
>>>>>> So, nowadays we do have "PMD-sized THP", someday we might have
>>>>>> "PUD-sized THP". Can't we come up with a name to describe sub-PMD THP?
>
> I think one subtle difference is that these sub-PMD THPs, likely won't always
> have a single size.
>
>>>>>>
>>>>>> Is it really of value if we invent a new term for them? Yes, I was not
>>>>>> enjoying "Flexible THP".
>
> How about "variable-order THP"? Or "SW THP" vs "HW THP"?

variable-order THP sounds good to me.

One question I have is that although Ryan is only working on sub-PMD THPs,
do we want to plan for sub-PUD THPs now? Like are sub-PUD THPs variable-order
THPs? And leave TODOs and comments like "variable-order THPs can be bigger than
PMD and smaller than PUD in the future"? Maybe sub-PUD THPs are still too far
to consider it for now. Just think out loud.


>
>>>>>>
>>>>>>
>>>>>> Once we figured that out, we should figure out if MADV_HUGEPAGE meant
>>>>>> "only PMD-sized THP" or anything else?
>>>>>>
>>>>>> Also, we can then figure out if MADV_NOHUGEPAGE meant "only PMD-sized
>>>>>> THP" or anything else?
>
> Based on the existing user space expectation that MADV_NOHUGEPAGE means "nothing
> bigger than order-0" I'm not sure how we could ever decide MADV_NOHUGEPAGE means
> anything different? This feels set in stone to me.
>
>>>>>>
>>>>>>
>>>>>> The simplest approach to me would be "they imply any THP, and once we
>>>>>> need more tunables we might add some", similar to what Kirill also raised.
>
> Agreed.
>
>>>>>>
>>>>>>
>>>>>> Again, it's all unclear to me at this point and I'm happy to hear
>>>>>> opinions, because I really don't know.
>>>>>
>>>>> I agree these points require more discussion. But I don't think we
>>>>> need to conclude them now, unless they cause correctness issues like
>>>>> ignoring MADV_NOHUGEPAGE would. My concern is that if we decide to go
>>>>> with "they imply any THP" and *expose this to userspace now*, we might
>>>>> regret later.
>>>>
>>>> If we don't think they are THP, probably MADV_NOHUGEPAGE should not apply and we should be ready to find other ways to deal with the mess we eventually create. If we want to go down that path, sure.
>>>>
>>>> If they are THP, to me there is not really a question if MADV_NOHUGEPAGE applies to them or not. Unless we want to build a confusing piece of software ;)
>>>
>>> I think it is good to call them THP, since they are transparent huge (>order-0) pages.
>>> But the concern is that before we have a reasonable management policy for order>0 &&
>>> order<9 THPs, mixing them with existing order-9 THP might give user unexpected
>>> performance outcome. Unless we are sure they will always performance improvement,
>>> we might repeat the old THP path, namely users begin to disable THP by default
>>> to avoid unexpected performance hiccup. That is the reason Yu wants to separate
>>> LAF from THP at the moment.
>
> (for the purposes of this; LAF="sub-PMD THP", THP="PMD-size THP", we treat them
> both as forms of THP)...
>
> How about this for a strawman:
>
> When introducing LAF we can either use an opt-in or an opt-out model. The opt-in
> model would require a new ABI from day 1 (something I think there is concensus
> that we do not want to do) and would prevent apps from automatically getting
> benefit. So I don't like that model.
>
> If going with the opt-out model, we already have an opt-out mechanism
> (thp="never" and MADV_NOHUGEPAGE) that we can piggyback. But that mechanism
> doesn't give us all the control we would like for benchmarking/characterizing
> the interactions between LAF/THP for different workloads. Ideally we need a way
> to enable THP while keeping LAF disabled and enable LAF while keeping THP disabled.
>
> Can we do this with debugfs? I think controls in there can come and go without
> too much concern about back-compat?

Is debugfs always available on all distros? For system without debugfs, user is
going to lose control of LAF. IMHO, the two knobs below can live in
/sys/kernel/mm/transparent_hugepage/ and could be in sync with "enabled" once
we think LAF is well studied and goes along well with existing PMD THPs,
namely when setting "always", "madvise", or "never" to "enabled", "laf_enabled"
is set to the same value.

>
> Perhaps 2 controls:
>
> laf_enable=0|1
>   enable/disable LAF independently of THP
>   default=1
>
> laf_max_order=N
>   applies to both LAF and THP
>   when max_order < PMD-order, THP acts like thp="never"
>   puts a ceiling on folio order allocated by LAF
>   default=PMD-order

I think it is better to keep it independent of PMD THP. Just make
laf_max_order can never be bigger than PMD-order. Later, when we understand
the performance impact of mixing LAF with PMD THP, we can lift this limit
to allow laf_max_order to be any possible page order.

>
> This gives:
>
>
> laf_enable=1, laf_max_order=PMD-order (LAF+THP):
>
>                 | never     | madvise   | always
> ----------------|-----------|-----------|-----------
> no hint         | S         | LAF>S     | THP>LAF>S
> MADV_HUGEPAGE   | S         | THP>LAF>S | THP>LAF>S
> MADV_NOHUGEPAGE | S         | S         | S
>
>
> laf_enable=0, laf_max_order=PMD-order (THP only):
>
>                 | never     | madvise   | always
> ----------------|-----------|-----------|-----------
> no hint         | S         | S         | THP>S
> MADV_HUGEPAGE   | S         | THP>S     | THP>S
> MADV_NOHUGEPAGE | S         | S         | S
>
>
> laf_enable=1, laf_max_order=(PMD-order - 1) (LAF only):
>
>                 | never     | madvise   | always
> ----------------|-----------|-----------|-----------
> no hint         | S         | LAF>S     | LAF>S
> MADV_HUGEPAGE   | S         | LAF>S     | LAF>S
> MADV_NOHUGEPAGE | S         | S         | S
>
>
> This would allow us to get something into the kernel that would allow people to
> more broadly characterize different workloads under THP, LAF, THP+LAF, which
> would give us a better understanding of if/how we want to design ABIs for the
> long term.
>
>
>>>
>>> Maybe call it THP (experimental) for now and merge it to THP when we have a stable
>>> policy. For knobs, we might add "any-order" to the existing "never", "madvise"
>>> and another interface to specify max hinted order (enforcing <9) for "any-order".
>>> Later, we can allow users to specify any max hinted order, including 9. Just an
>>> idea.
>> I suspect that all the config knobs (enable/disable mixing mode, define "any-order"
>> or "specific-order") will be exist long term. Because there are always new workloads
>> need be tuned against these configs.
>>
>>
>> Regards
>> Yin, Fengwei
>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Yan, Zi


--
Best Regards,
Yan, Zi
Ryan Roberts Aug. 7, 2023, 7 p.m. UTC | #40
On 04/08/2023 19:53, Yu Zhao wrote:
> On Fri, Aug 4, 2023 at 3:06 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 04/08/2023 01:19, Yu Zhao wrote:
>>> On Thu, Aug 3, 2023 at 8:27 AM Kirill A. Shutemov
>>> <kirill.shutemov@linux.intel.com> wrote:
>>>>
>>>> On Thu, Aug 03, 2023 at 01:43:31PM +0100, Ryan Roberts wrote:
>>>>> + Kirill
>>>>>
>>>>> On 26/07/2023 10:51, Ryan Roberts wrote:
>>>>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
>>>>>> allocated in large folios of a determined order. All pages of the large
>>>>>> folio are pte-mapped during the same page fault, significantly reducing
>>>>>> the number of page faults. The number of per-page operations (e.g. ref
>>>>>> counting, rmap management lru list management) are also significantly
>>>>>> reduced since those ops now become per-folio.
>>>>>>
>>>>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
>>>>>> which defaults to disabled for now; The long term aim is for this to
>>>>>> defaut to enabled, but there are some risks around internal
>>>>>> fragmentation that need to be better understood first.
>>>>>>
>>>>>> When enabled, the folio order is determined as such: For a vma, process
>>>>>> or system that has explicitly disabled THP, we continue to allocate
>>>>>> order-0. THP is most likely disabled to avoid any possible internal
>>>>>> fragmentation so we honour that request.
>>>>>>
>>>>>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
>>>>>> that have not explicitly opted-in to use transparent hugepages (e.g.
>>>>>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
>>>>>> arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
>>>>>> bigger). This allows for a performance boost without requiring any
>>>>>> explicit opt-in from the workload while limitting internal
>>>>>> fragmentation.
>>>>>>
>>>>>> If the preferred order can't be used (e.g. because the folio would
>>>>>> breach the bounds of the vma, or because ptes in the region are already
>>>>>> mapped) then we fall back to a suitable lower order; first
>>>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>>>>>
>>>>>
>>>>> ...
>>>>>
>>>>>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
>>>>>> +           (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
>>>>>> +
>>>>>> +static int anon_folio_order(struct vm_area_struct *vma)
>>>>>> +{
>>>>>> +   int order;
>>>>>> +
>>>>>> +   /*
>>>>>> +    * If THP is explicitly disabled for either the vma, the process or the
>>>>>> +    * system, then this is very likely intended to limit internal
>>>>>> +    * fragmentation; in this case, don't attempt to allocate a large
>>>>>> +    * anonymous folio.
>>>>>> +    *
>>>>>> +    * Else, if the vma is eligible for thp, allocate a large folio of the
>>>>>> +    * size preferred by the arch. Or if the arch requested a very small
>>>>>> +    * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
>>>>>> +    * which still meets the arch's requirements but means we still take
>>>>>> +    * advantage of SW optimizations (e.g. fewer page faults).
>>>>>> +    *
>>>>>> +    * Finally if thp is enabled but the vma isn't eligible, take the
>>>>>> +    * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
>>>>>> +    * This ensures workloads that have not explicitly opted-in take benefit
>>>>>> +    * while capping the potential for internal fragmentation.
>>>>>> +    */
>>>>>> +
>>>>>> +   if ((vma->vm_flags & VM_NOHUGEPAGE) ||
>>>>>> +       test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
>>>>>> +       !hugepage_flags_enabled())
>>>>>> +           order = 0;
>>>>>> +   else {
>>>>>> +           order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
>>>>>> +
>>>>>> +           if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>>>>>> +                   order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
>>>>>> +   }
>>>>>> +
>>>>>> +   return order;
>>>>>> +}
>>>>>
>>>>>
>>>>> Hi All,
>>>>>
>>>>> I'm writing up the conclusions that we arrived at during discussion in the THP
>>>>> meeting yesterday, regarding linkage with exiting THP ABIs. It would be great if
>>>>> I can get explicit "agree" or disagree + rationale from at least David, Yu and
>>>>> Kirill.
>>>>>
>>>>> In summary; I think we are converging on the approach that is already coded, but
>>>>> I'd like confirmation.
>>>>>
>>>>>
>>>>>
>>>>> The THP situation today
>>>>> -----------------------
>>>>>
>>>>>  - At system level: THP can be set to "never", "madvise" or "always"
>>>>>  - At process level: THP can be "never" or "defer to system setting"
>>>>>  - At VMA level: no-hint, MADV_HUGEPAGE, MADV_NOHUGEPAGE
>>>>>
>>>>> That gives us this table to describe how a page fault is handled, according to
>>>>> process state (columns) and vma flags (rows):
>>>>>
>>>>>                 | never     | madvise   | always
>>>>> ----------------|-----------|-----------|-----------
>>>>> no hint         | S         | S         | THP>S
>>>>> MADV_HUGEPAGE   | S         | THP>S     | THP>S
>>>>> MADV_NOHUGEPAGE | S         | S         | S
>>>>>
>>>>> Legend:
>>>>> S     allocate single page (PTE-mapped)
>>>>> LAF   allocate lage anon folio (PTE-mapped)
>>>>> THP   allocate THP-sized folio (PMD-mapped)
>>>>>>     fallback (usually because vma size/alignment insufficient for folio)
>>>>>
>>>>>
>>>>>
>>>>> Principles for Large Anon Folios (LAF)
>>>>> --------------------------------------
>>>>>
>>>>> David tells us there are use cases today (e.g. qemu live migration) which use
>>>>> MADV_NOHUGEPAGE to mean "don't fill any PTEs that are not explicitly faulted"
>>>>> and these use cases will break (i.e. functionally incorrect) if this request is
>>>>> not honoured.
>>>>>
>>>>> So LAF must at least honour MADV_NOHUGEPAGE to prevent breaking existing use
>>>>> cases. And once we do this, then I think the least confusing thing is for it to
>>>>> also honor the "never" system/process state; so if either the system, process or
>>>>> vma has explicitly opted-out of THP, then LAF should also be bypassed.
>>>>>
>>>>> Similarly, any case that would previously cause the allocation of PMD-sized THP
>>>>> must continue to be honoured, else we risk performance regression.
>>>>>
>>>>> That leaves the "madvise/no-hint" case, and all THP fallback paths due to the
>>>>> VMA not being correctly aligned or sized to hold a PMD-sized mapping. In these
>>>>> cases, we will attempt to use LAF first, and fallback to single page if the vma
>>>>> size/alignment doesn't permit it.
>>>>>
>>>>>                 | never     | madvise   | always
>>>>> ----------------|-----------|-----------|-----------
>>>>> no hint         | S         | LAF>S     | THP>LAF>S
>>>>> MADV_HUGEPAGE   | S         | THP>LAF>S | THP>LAF>S
>>>>> MADV_NOHUGEPAGE | S         | S         | S
>>>>>
>>>>> I think this (perhaps conservative) approach will be the least surprising to
>>>>> users. And is the policy that is already implemented in this patch.
>>>>
>>>> This looks very reasonable.
>>>>
>>>> The only questionable field is no-hint/madvise. I can argue for both LAF>S
>>>> and S here. I think LAF>S is fine as long as we are not too aggressive
>>>> with allocation order.
>>>>
>>>> I think we need to work on eliminating reasons for users to set 'never'.
>>>> If something behaves better with 'never' kernel has failed user.
>>>>
>>>>> Downsides of this policy
>>>>> ------------------------
>>>>>
>>>>> As Yu and Yin have pointed out, there are some workloads which do not perform
>>>>> well with THP, due to large fault latency or memory wastage, etc. But which
>>>>> _may_ still benefit from LAF. By taking the conservative approach, we exclude
>>>>> these workloads from benefiting automatically.
>>>>
>>>> Hm. I don't buy it. Why THP with order-9 is too much, but order-8 LAF is
>>>> fine?
>>>
>>> No, it's not. And no one said order-8 LAF is fine :) The starting
>>> order for LAF that we have been discussing is at most 64KB (vs 2MB
>>> THP). For my taste, it's still too large. I'd go with 32KB/16KB.
>>
>> Its currently influenced by the arch. If the arch doesn't have an opinion then
>> its currently 32K in the code. The 64K size is my aspiration for arm64 if/when I
>> land the contpte mapping work.
> 
> Just to double check: this discussion covers the long term/permanente
> solution/roadmap, correct? That's what Kirill and I were arguing
> about. Otherwise, the order-8/9 concern above is totally irrelevant,
> since we don't have them in this series.
> 
> For the short term (this series), what you described above looks good
> to me: we may regress but will not break any existing use cases, and
> we are behind a Kconfig option.

OK that's good to hear.

> 
>>> However, the same argument can be used to argue against the policy
>>> Ryan listed above: why order-10 LAF is ok for madvise but not order-11
>>> (which becomes "always")?
>>
>> Sorry I don't understand what you are saying here. Where has order-10 LAF come from?
> 
> I pushed that rhetoric a bit further: order-11 is the THP size (32MB)
> with 16KB base page size on ARM. Confusing, isn't it? And there is
> another complaint from Fengwei here [1].
> 
> [1] https://lore.kernel.org/linux-mm/CAOUHufasZ6w32sHO+Lq33+tGy3+GiO0_dd6mNYwfS_5gqhzYbw@mail.gmail.com/
> 
>>> I'm strongly against this policy
> 
> Again, just to be clear: I'm strongly against this policy to be
> exposed to userspace in any way and become a long-term/permanent thing
> we have to maintain/change in the future, since I'm assuming that's
> the context.

I'm still confused. The policy I described (and which I thought we were
discussing) does not expose any new tunables to user space. And you said above
that what I described "looks good to me". So is "this policy" which you are
strongly against referring to the policy I wrote down or something else?

> 
>> Ugh, I thought we came to an agreement (or at least "disagree and commit") on
>> the THP call. Obviously I was wrong.
> 
> My impression is we only agreed on one thing: at the current stage, we
> should respect things we absolutely have to. We didn't agree on what
> "never" means ("never 2MB" or "never >4KB"), and we didn't touch on
> how "always" should behave at all.

I _think_ we have now agreed some of this in other threads, but please re-raise
in the context of the other email thread I just sent out - its probably cleaner
to continue discussion there.

> 
>> David is telling us that we will break user space if we don't consider
>> MADV_NOHUGEPAGE to mean "never allocate memory to unfaulted addresses". So tying
>> to at least this must be cast in stone, no? Could you lay out any policy
>> proposal you have as an alternative that still follows this requirement?
> 
> If MADV_NOHUGEPAGE falls into the category of things we have to
> absolutely respect, then we will. But I don't think it does, because
> the UFFD check we have in this series already guarantees the KVM use
> case. I can explain how it works in detail if it's still not clear to
> you: long story short, the UFFD check precedes the MADV_NOHUGEPAGE
> check in alloc_anon_folio().

I think we have now concluded that its not this simple; MADV_NOHUGEPAGE is
applied to the region and pages are faulted in before UFFD is registered, so
checking for UFFD is not sufficient.

> 
> Here is what I recommend for the medium and long terms:
> https://lore.kernel.org/linux-mm/CAOUHufYm6Lkm4tLRbyKOc3-NYU-8d6ZDMNDWHo=e=E16oasN8A@mail.gmail.com/
> 
> For the short term, hard-coding two orders (hw/sw preferred), putting
> them behind a Kconfig and not exposing this info to the userspace are
> good enough for me.

I think that's pretty much what I have now, so perhaps I'm laboring the point a
bit too much here. Let's just get the prerequisites ticked off then get this
patch set merged??
Ryan Roberts Aug. 7, 2023, 7:07 p.m. UTC | #41
On 07/08/2023 06:24, Yu Zhao wrote:
> On Wed, Jul 26, 2023 at 3:52 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
>> allocated in large folios of a determined order. All pages of the large
>> folio are pte-mapped during the same page fault, significantly reducing
>> the number of page faults. The number of per-page operations (e.g. ref
>> counting, rmap management lru list management) are also significantly
>> reduced since those ops now become per-folio.
>>
>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
>> which defaults to disabled for now; The long term aim is for this to
>> defaut to enabled, but there are some risks around internal
>> fragmentation that need to be better understood first.
>>
>> When enabled, the folio order is determined as such: For a vma, process
>> or system that has explicitly disabled THP, we continue to allocate
>> order-0. THP is most likely disabled to avoid any possible internal
>> fragmentation so we honour that request.
>>
>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
>> that have not explicitly opted-in to use transparent hugepages (e.g.
>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
>> arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
>> bigger). This allows for a performance boost without requiring any
>> explicit opt-in from the workload while limitting internal
>> fragmentation.
>>
>> If the preferred order can't be used (e.g. because the folio would
>> breach the bounds of the vma, or because ptes in the region are already
>> mapped) then we fall back to a suitable lower order; first
>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>
>> arch_wants_pte_order() can be overridden by the architecture if desired.
>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
>> set of ptes map physically contigious, naturally aligned memory, so this
>> mechanism allows the architecture to optimize as required.
>>
>> Here we add the default implementation of arch_wants_pte_order(), used
>> when the architecture does not define it, which returns -1, implying
>> that the HW has no preference. In this case, mm will choose it's own
>> default order.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  include/linux/pgtable.h |  13 ++++
>>  mm/Kconfig              |  10 +++
>>  mm/memory.c             | 166 ++++++++++++++++++++++++++++++++++++----
>>  3 files changed, 172 insertions(+), 17 deletions(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index 5063b482e34f..2a1d83775837 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -313,6 +313,19 @@ static inline bool arch_has_hw_pte_young(void)
>>  }
>>  #endif
>>
>> +#ifndef arch_wants_pte_order
>> +/*
>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
>> + * to be at least order-2. Negative value implies that the HW has no preference
>> + * and mm will choose it's own default order.
>> + */
>> +static inline int arch_wants_pte_order(void)
>> +{
>> +       return -1;
>> +}
>> +#endif
>> +
>>  #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>                                        unsigned long address,
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 09130434e30d..fa61ea160447 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -1238,4 +1238,14 @@ config LOCK_MM_AND_FIND_VMA
>>
>>  source "mm/damon/Kconfig"
>>
>> +config LARGE_ANON_FOLIO
>> +       bool "Allocate large folios for anonymous memory"
>> +       depends on TRANSPARENT_HUGEPAGE
>> +       default n
>> +       help
>> +         Use large (bigger than order-0) folios to back anonymous memory where
>> +         possible, even for pte-mapped memory. This reduces the number of page
>> +         faults, as well as other per-page overheads to improve performance for
>> +         many workloads.
>> +
>>  endmenu
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 01f39e8144ef..64c3f242c49a 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4050,6 +4050,127 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>         return ret;
>>  }
>>
>> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
>> +{
>> +       int i;
>> +
>> +       if (nr_pages == 1)
>> +               return vmf_pte_changed(vmf);
>> +
>> +       for (i = 0; i < nr_pages; i++) {
>> +               if (!pte_none(ptep_get_lockless(vmf->pte + i)))
>> +                       return true;
>> +       }
>> +
>> +       return false;
>> +}
>> +
>> +#ifdef CONFIG_LARGE_ANON_FOLIO
>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
>> +               (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
>> +
>> +static int anon_folio_order(struct vm_area_struct *vma)
>> +{
>> +       int order;
>> +
>> +       /*
>> +        * If THP is explicitly disabled for either the vma, the process or the
>> +        * system, then this is very likely intended to limit internal
>> +        * fragmentation; in this case, don't attempt to allocate a large
>> +        * anonymous folio.
>> +        *
>> +        * Else, if the vma is eligible for thp, allocate a large folio of the
>> +        * size preferred by the arch. Or if the arch requested a very small
>> +        * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
>> +        * which still meets the arch's requirements but means we still take
>> +        * advantage of SW optimizations (e.g. fewer page faults).
>> +        *
>> +        * Finally if thp is enabled but the vma isn't eligible, take the
>> +        * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
>> +        * This ensures workloads that have not explicitly opted-in take benefit
>> +        * while capping the potential for internal fragmentation.
>> +        */
>> +
>> +       if ((vma->vm_flags & VM_NOHUGEPAGE) ||
>> +           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
>> +           !hugepage_flags_enabled())
>> +               order = 0;
>> +       else {
>> +               order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
>> +
>> +               if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>> +                       order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
>> +       }
>> +
>> +       return order;
>> +}
>> +
>> +static int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
>> +{
>> +       int i;
>> +       gfp_t gfp;
>> +       pte_t *pte;
>> +       unsigned long addr;
>> +       struct vm_area_struct *vma = vmf->vma;
>> +       int prefer = anon_folio_order(vma);
>> +       int orders[] = {
>> +               prefer,
>> +               prefer > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0,
>> +               0,
>> +       };
>> +
>> +       *folio = NULL;
>> +
>> +       if (vmf_orig_pte_uffd_wp(vmf))
>> +               goto fallback;
> 
> Per the discussion, we need to check hugepage_vma_check() for
> correctness of VM LM. I'd just check it here and fall back to order 0
> if that helper returns false.

I'm not sure if either you haven't noticed the logic in anon_folio_order()
above, or whether you are making this suggestion because you disagree with the
subtle difference in my logic?

My logic is deliberately not calling hugepage_vma_check() because that would
return false for the thp=madvise,mmap=unhinted case, whereas the policy I'm
implementing wants to apply LAF in that case.


My intended policy:

                | never     | madvise   | always
----------------|-----------|-----------|-----------
no hint         | S         | LAF>S     | THP>LAF>S
MADV_HUGEPAGE   | S         | THP>LAF>S | THP>LAF>S
MADV_NOHUGEPAGE | S         | S         | S


What your suggestion would give:

                | never     | madvise   | always
----------------|-----------|-----------|-----------
no hint         | S         | S         | THP>LAF>S
MADV_HUGEPAGE   | S         | THP>LAF>S | THP>LAF>S
MADV_NOHUGEPAGE | S         | S         | S


Thanks,
Ryan
Yu Zhao Aug. 7, 2023, 11:21 p.m. UTC | #42
On Mon, Aug 7, 2023 at 1:07 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 07/08/2023 06:24, Yu Zhao wrote:
> > On Wed, Jul 26, 2023 at 3:52 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
> >> allocated in large folios of a determined order. All pages of the large
> >> folio are pte-mapped during the same page fault, significantly reducing
> >> the number of page faults. The number of per-page operations (e.g. ref
> >> counting, rmap management lru list management) are also significantly
> >> reduced since those ops now become per-folio.
> >>
> >> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
> >> which defaults to disabled for now; The long term aim is for this to
> >> defaut to enabled, but there are some risks around internal
> >> fragmentation that need to be better understood first.
> >>
> >> When enabled, the folio order is determined as such: For a vma, process
> >> or system that has explicitly disabled THP, we continue to allocate
> >> order-0. THP is most likely disabled to avoid any possible internal
> >> fragmentation so we honour that request.
> >>
> >> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
> >> that have not explicitly opted-in to use transparent hugepages (e.g.
> >> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
> >> arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
> >> bigger). This allows for a performance boost without requiring any
> >> explicit opt-in from the workload while limitting internal
> >> fragmentation.
> >>
> >> If the preferred order can't be used (e.g. because the folio would
> >> breach the bounds of the vma, or because ptes in the region are already
> >> mapped) then we fall back to a suitable lower order; first
> >> PAGE_ALLOC_COSTLY_ORDER, then order-0.
> >>
> >> arch_wants_pte_order() can be overridden by the architecture if desired.
> >> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
> >> set of ptes map physically contigious, naturally aligned memory, so this
> >> mechanism allows the architecture to optimize as required.
> >>
> >> Here we add the default implementation of arch_wants_pte_order(), used
> >> when the architecture does not define it, which returns -1, implying
> >> that the HW has no preference. In this case, mm will choose it's own
> >> default order.
> >>
> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >> ---
> >>  include/linux/pgtable.h |  13 ++++
> >>  mm/Kconfig              |  10 +++
> >>  mm/memory.c             | 166 ++++++++++++++++++++++++++++++++++++----
> >>  3 files changed, 172 insertions(+), 17 deletions(-)
> >>
> >> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >> index 5063b482e34f..2a1d83775837 100644
> >> --- a/include/linux/pgtable.h
> >> +++ b/include/linux/pgtable.h
> >> @@ -313,6 +313,19 @@ static inline bool arch_has_hw_pte_young(void)
> >>  }
> >>  #endif
> >>
> >> +#ifndef arch_wants_pte_order
> >> +/*
> >> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
> >> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
> >> + * to be at least order-2. Negative value implies that the HW has no preference
> >> + * and mm will choose it's own default order.
> >> + */
> >> +static inline int arch_wants_pte_order(void)
> >> +{
> >> +       return -1;
> >> +}
> >> +#endif
> >> +
> >>  #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
> >>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
> >>                                        unsigned long address,
> >> diff --git a/mm/Kconfig b/mm/Kconfig
> >> index 09130434e30d..fa61ea160447 100644
> >> --- a/mm/Kconfig
> >> +++ b/mm/Kconfig
> >> @@ -1238,4 +1238,14 @@ config LOCK_MM_AND_FIND_VMA
> >>
> >>  source "mm/damon/Kconfig"
> >>
> >> +config LARGE_ANON_FOLIO
> >> +       bool "Allocate large folios for anonymous memory"
> >> +       depends on TRANSPARENT_HUGEPAGE
> >> +       default n
> >> +       help
> >> +         Use large (bigger than order-0) folios to back anonymous memory where
> >> +         possible, even for pte-mapped memory. This reduces the number of page
> >> +         faults, as well as other per-page overheads to improve performance for
> >> +         many workloads.
> >> +
> >>  endmenu
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index 01f39e8144ef..64c3f242c49a 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -4050,6 +4050,127 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>         return ret;
> >>  }
> >>
> >> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
> >> +{
> >> +       int i;
> >> +
> >> +       if (nr_pages == 1)
> >> +               return vmf_pte_changed(vmf);
> >> +
> >> +       for (i = 0; i < nr_pages; i++) {
> >> +               if (!pte_none(ptep_get_lockless(vmf->pte + i)))
> >> +                       return true;
> >> +       }
> >> +
> >> +       return false;
> >> +}
> >> +
> >> +#ifdef CONFIG_LARGE_ANON_FOLIO
> >> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
> >> +               (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
> >> +
> >> +static int anon_folio_order(struct vm_area_struct *vma)
> >> +{
> >> +       int order;
> >> +
> >> +       /*
> >> +        * If THP is explicitly disabled for either the vma, the process or the
> >> +        * system, then this is very likely intended to limit internal
> >> +        * fragmentation; in this case, don't attempt to allocate a large
> >> +        * anonymous folio.
> >> +        *
> >> +        * Else, if the vma is eligible for thp, allocate a large folio of the
> >> +        * size preferred by the arch. Or if the arch requested a very small
> >> +        * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
> >> +        * which still meets the arch's requirements but means we still take
> >> +        * advantage of SW optimizations (e.g. fewer page faults).
> >> +        *
> >> +        * Finally if thp is enabled but the vma isn't eligible, take the
> >> +        * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
> >> +        * This ensures workloads that have not explicitly opted-in take benefit
> >> +        * while capping the potential for internal fragmentation.
> >> +        */
> >> +
> >> +       if ((vma->vm_flags & VM_NOHUGEPAGE) ||
> >> +           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
> >> +           !hugepage_flags_enabled())
> >> +               order = 0;
> >> +       else {
> >> +               order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
> >> +
> >> +               if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
> >> +                       order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
> >> +       }
> >> +
> >> +       return order;
> >> +}
> >> +
> >> +static int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
> >> +{
> >> +       int i;
> >> +       gfp_t gfp;
> >> +       pte_t *pte;
> >> +       unsigned long addr;
> >> +       struct vm_area_struct *vma = vmf->vma;
> >> +       int prefer = anon_folio_order(vma);
> >> +       int orders[] = {
> >> +               prefer,
> >> +               prefer > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0,
> >> +               0,
> >> +       };
> >> +
> >> +       *folio = NULL;
> >> +
> >> +       if (vmf_orig_pte_uffd_wp(vmf))
> >> +               goto fallback;
> >
> > Per the discussion, we need to check hugepage_vma_check() for
> > correctness of VM LM. I'd just check it here and fall back to order 0
> > if that helper returns false.
>
> I'm not sure if either you haven't noticed the logic in anon_folio_order()
> above, or whether you are making this suggestion because you disagree with the
> subtle difference in my logic?

The latter, or more generally the policy you described earlier.

> My logic is deliberately not calling hugepage_vma_check() because that would
> return false for the thp=madvise,mmap=unhinted case, whereas the policy I'm
> implementing wants to apply LAF in that case.
>
>
> My intended policy:
>
>                 | never     | madvise   | always
> ----------------|-----------|-----------|-----------
> no hint         | S         | LAF>S     | THP>LAF>S
> MADV_HUGEPAGE   | S         | THP>LAF>S | THP>LAF>S
> MADV_NOHUGEPAGE | S         | S         | S
>
>
> What your suggestion would give:
>
>                 | never     | madvise   | always
> ----------------|-----------|-----------|-----------
> no hint         | S         | S         | THP>LAF>S
> MADV_HUGEPAGE   | S         | THP>LAF>S | THP>LAF>S
> MADV_NOHUGEPAGE | S         | S         | S

This is not what I'm suggesting.

Let me reiterate [1]:
  My impression is we only agreed on one thing: at the current stage, we
  should respect things we absolutely have to. We didn't agree on what
  "never" means ("never 2MB" or "never >4KB"), and we didn't touch on
  how "always" should behave at all.

And [2]:
  (Thanks to David, now I agree that) we have to interpret MADV_NOHUGEPAGE
  as nothing >4KB.

My final take [3]:
  I agree these points require more discussion. But I don't think we
  need to conclude them now, unless they cause correctness issues like
  ignoring MADV_NOHUGEPAGE would.

But I should have been clear about the parameters to
hugepage_vma_check(): enforce_sysfs=false.

[1] https://lore.kernel.org/linux-mm/CAOUHufYQTcOdKU=1mPq-fdLV7a66sHx1=EJpPpMVogciCNKO9A@mail.gmail.com/
[2] https://lore.kernel.org/linux-mm/CAOUHufafd4GNna2GKdSyQdW6CLVh0gxhNgeOc6t+ZOphwgw7tw@mail.gmail.com/
[3] https://lore.kernel.org/linux-mm/CAOUHufYQTcOdKU=1mPq-fdLV7a66sHx1=EJpPpMVogciCNKO9A@mail.gmail.com/
Ryan Roberts Aug. 8, 2023, 9:37 a.m. UTC | #43
On 08/08/2023 00:21, Yu Zhao wrote:
> On Mon, Aug 7, 2023 at 1:07 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 07/08/2023 06:24, Yu Zhao wrote:
>>> On Wed, Jul 26, 2023 at 3:52 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
>>>> allocated in large folios of a determined order. All pages of the large
>>>> folio are pte-mapped during the same page fault, significantly reducing
>>>> the number of page faults. The number of per-page operations (e.g. ref
>>>> counting, rmap management lru list management) are also significantly
>>>> reduced since those ops now become per-folio.
>>>>
>>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
>>>> which defaults to disabled for now; The long term aim is for this to
>>>> defaut to enabled, but there are some risks around internal
>>>> fragmentation that need to be better understood first.
>>>>
>>>> When enabled, the folio order is determined as such: For a vma, process
>>>> or system that has explicitly disabled THP, we continue to allocate
>>>> order-0. THP is most likely disabled to avoid any possible internal
>>>> fragmentation so we honour that request.
>>>>
>>>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
>>>> that have not explicitly opted-in to use transparent hugepages (e.g.
>>>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
>>>> arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
>>>> bigger). This allows for a performance boost without requiring any
>>>> explicit opt-in from the workload while limitting internal
>>>> fragmentation.
>>>>
>>>> If the preferred order can't be used (e.g. because the folio would
>>>> breach the bounds of the vma, or because ptes in the region are already
>>>> mapped) then we fall back to a suitable lower order; first
>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>>>
>>>> arch_wants_pte_order() can be overridden by the architecture if desired.
>>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
>>>> set of ptes map physically contigious, naturally aligned memory, so this
>>>> mechanism allows the architecture to optimize as required.
>>>>
>>>> Here we add the default implementation of arch_wants_pte_order(), used
>>>> when the architecture does not define it, which returns -1, implying
>>>> that the HW has no preference. In this case, mm will choose it's own
>>>> default order.
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>  include/linux/pgtable.h |  13 ++++
>>>>  mm/Kconfig              |  10 +++
>>>>  mm/memory.c             | 166 ++++++++++++++++++++++++++++++++++++----
>>>>  3 files changed, 172 insertions(+), 17 deletions(-)
>>>>
>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>> index 5063b482e34f..2a1d83775837 100644
>>>> --- a/include/linux/pgtable.h
>>>> +++ b/include/linux/pgtable.h
>>>> @@ -313,6 +313,19 @@ static inline bool arch_has_hw_pte_young(void)
>>>>  }
>>>>  #endif
>>>>
>>>> +#ifndef arch_wants_pte_order
>>>> +/*
>>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
>>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
>>>> + * to be at least order-2. Negative value implies that the HW has no preference
>>>> + * and mm will choose it's own default order.
>>>> + */
>>>> +static inline int arch_wants_pte_order(void)
>>>> +{
>>>> +       return -1;
>>>> +}
>>>> +#endif
>>>> +
>>>>  #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>>>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>>>                                        unsigned long address,
>>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>>> index 09130434e30d..fa61ea160447 100644
>>>> --- a/mm/Kconfig
>>>> +++ b/mm/Kconfig
>>>> @@ -1238,4 +1238,14 @@ config LOCK_MM_AND_FIND_VMA
>>>>
>>>>  source "mm/damon/Kconfig"
>>>>
>>>> +config LARGE_ANON_FOLIO
>>>> +       bool "Allocate large folios for anonymous memory"
>>>> +       depends on TRANSPARENT_HUGEPAGE
>>>> +       default n
>>>> +       help
>>>> +         Use large (bigger than order-0) folios to back anonymous memory where
>>>> +         possible, even for pte-mapped memory. This reduces the number of page
>>>> +         faults, as well as other per-page overheads to improve performance for
>>>> +         many workloads.
>>>> +
>>>>  endmenu
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 01f39e8144ef..64c3f242c49a 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -4050,6 +4050,127 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>         return ret;
>>>>  }
>>>>
>>>> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
>>>> +{
>>>> +       int i;
>>>> +
>>>> +       if (nr_pages == 1)
>>>> +               return vmf_pte_changed(vmf);
>>>> +
>>>> +       for (i = 0; i < nr_pages; i++) {
>>>> +               if (!pte_none(ptep_get_lockless(vmf->pte + i)))
>>>> +                       return true;
>>>> +       }
>>>> +
>>>> +       return false;
>>>> +}
>>>> +
>>>> +#ifdef CONFIG_LARGE_ANON_FOLIO
>>>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
>>>> +               (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
>>>> +
>>>> +static int anon_folio_order(struct vm_area_struct *vma)
>>>> +{
>>>> +       int order;
>>>> +
>>>> +       /*
>>>> +        * If THP is explicitly disabled for either the vma, the process or the
>>>> +        * system, then this is very likely intended to limit internal
>>>> +        * fragmentation; in this case, don't attempt to allocate a large
>>>> +        * anonymous folio.
>>>> +        *
>>>> +        * Else, if the vma is eligible for thp, allocate a large folio of the
>>>> +        * size preferred by the arch. Or if the arch requested a very small
>>>> +        * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
>>>> +        * which still meets the arch's requirements but means we still take
>>>> +        * advantage of SW optimizations (e.g. fewer page faults).
>>>> +        *
>>>> +        * Finally if thp is enabled but the vma isn't eligible, take the
>>>> +        * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
>>>> +        * This ensures workloads that have not explicitly opted-in take benefit
>>>> +        * while capping the potential for internal fragmentation.
>>>> +        */
>>>> +
>>>> +       if ((vma->vm_flags & VM_NOHUGEPAGE) ||
>>>> +           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
>>>> +           !hugepage_flags_enabled())
>>>> +               order = 0;
>>>> +       else {
>>>> +               order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
>>>> +
>>>> +               if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>>>> +                       order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
>>>> +       }
>>>> +
>>>> +       return order;
>>>> +}
>>>> +
>>>> +static int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
>>>> +{
>>>> +       int i;
>>>> +       gfp_t gfp;
>>>> +       pte_t *pte;
>>>> +       unsigned long addr;
>>>> +       struct vm_area_struct *vma = vmf->vma;
>>>> +       int prefer = anon_folio_order(vma);
>>>> +       int orders[] = {
>>>> +               prefer,
>>>> +               prefer > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0,
>>>> +               0,
>>>> +       };
>>>> +
>>>> +       *folio = NULL;
>>>> +
>>>> +       if (vmf_orig_pte_uffd_wp(vmf))
>>>> +               goto fallback;
>>>
>>> Per the discussion, we need to check hugepage_vma_check() for
>>> correctness of VM LM. I'd just check it here and fall back to order 0
>>> if that helper returns false.
>>
>> I'm not sure if either you haven't noticed the logic in anon_folio_order()
>> above, or whether you are making this suggestion because you disagree with the
>> subtle difference in my logic?
> 
> The latter, or more generally the policy you described earlier.
> 
>> My logic is deliberately not calling hugepage_vma_check() because that would
>> return false for the thp=madvise,mmap=unhinted case, whereas the policy I'm
>> implementing wants to apply LAF in that case.
>>
>>
>> My intended policy:
>>
>>                 | never     | madvise   | always
>> ----------------|-----------|-----------|-----------
>> no hint         | S         | LAF>S     | THP>LAF>S
>> MADV_HUGEPAGE   | S         | THP>LAF>S | THP>LAF>S
>> MADV_NOHUGEPAGE | S         | S         | S
>>
>>
>> What your suggestion would give:
>>
>>                 | never     | madvise   | always
>> ----------------|-----------|-----------|-----------
>> no hint         | S         | S         | THP>LAF>S
>> MADV_HUGEPAGE   | S         | THP>LAF>S | THP>LAF>S
>> MADV_NOHUGEPAGE | S         | S         | S
> 
> This is not what I'm suggesting.
> 
> Let me reiterate [1]:
>   My impression is we only agreed on one thing: at the current stage, we
>   should respect things we absolutely have to. We didn't agree on what
>   "never" means ("never 2MB" or "never >4KB"), and we didn't touch on
>   how "always" should behave at all.
> 
> And [2]:
>   (Thanks to David, now I agree that) we have to interpret MADV_NOHUGEPAGE
>   as nothing >4KB.
> 
> My final take [3]:
>   I agree these points require more discussion. But I don't think we
>   need to conclude them now, unless they cause correctness issues like
>   ignoring MADV_NOHUGEPAGE would.

Thanks, I've read all of these comments previously, and appreciate the time you
have put into the feedback. I'm not sure I fully agree with your point that we
don't need to conclude on a policy now; I certainly don't think we need the
whole thing in place on day 1, but I do think that whatever we put in should
strive to be a strict subset of where we think we are going. For example, if we
put something in with one policy (i.e. "never" only means "never 2MB") then find
a problem and have to change that to be more conservative, are we risking perf
regressions for any LAF users that started using it on day 1?

> 
> But I should have been clear about the parameters to
> hugepage_vma_check(): enforce_sysfs=false.

So hugepage_vma_check(..., smaps=false, in_pf=true, enforce_sysfs=false) would
give us:

                | prctl/fw  | sysfs     | sysfs     | sysfs
                | disable   | never     | madvise   | always
----------------|-----------|-----------|-----------|-----------
no hint         | S         | LAF>S     | LAF>S     | THP>LAF>S
MADV_HUGEPAGE   | S         | LAF>S     | THP>LAF>S | THP>LAF>S
MADV_NOHUGEPAGE | S         | S         | S         | S

Where "prctl/fw disable" trumps the sysfs setting.

I can certainly see the benefit of this approach; it gives us a way to enable
LAF while disabling THP (thp=never). It doesn't give us a way to enable THP
without enabling LAF though (unless you recompile with LAF disabled). Does
anyone see a problem with this?



> 
> [1] https://lore.kernel.org/linux-mm/CAOUHufYQTcOdKU=1mPq-fdLV7a66sHx1=EJpPpMVogciCNKO9A@mail.gmail.com/
> [2] https://lore.kernel.org/linux-mm/CAOUHufafd4GNna2GKdSyQdW6CLVh0gxhNgeOc6t+ZOphwgw7tw@mail.gmail.com/
> [3] https://lore.kernel.org/linux-mm/CAOUHufYQTcOdKU=1mPq-fdLV7a66sHx1=EJpPpMVogciCNKO9A@mail.gmail.com/
Ryan Roberts Aug. 8, 2023, 9:58 a.m. UTC | #44
On 07/08/2023 19:10, Zi Yan wrote:
> On 7 Aug 2023, at 13:45, Ryan Roberts wrote:
> 
>> On 05/08/2023 03:50, Yin, Fengwei wrote:
>>>
>>>
>>> On 8/5/2023 5:58 AM, Zi Yan wrote:
>>>> On 4 Aug 2023, at 17:30, David Hildenbrand wrote:
>>>>
>>>>> On 04.08.23 23:26, Yu Zhao wrote:
>>>>>> On Fri, Aug 4, 2023 at 3:13 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>>>
>>>>>>> On 04.08.23 23:00, Yu Zhao wrote:
>>>>>>>> On Fri, Aug 4, 2023 at 2:23 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>>>>>
>>>>>>>>> On 04.08.23 10:27, Ryan Roberts wrote:
>>>>>>>>>> On 04/08/2023 00:50, Yu Zhao wrote:
>>>>>>>>>>> On Thu, Aug 3, 2023 at 6:43 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> + Kirill
>>>>>>>>>>>>
>>>>>>>>>>>> On 26/07/2023 10:51, Ryan Roberts wrote:
>>>>>>>>>>>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
>>>>>>>>>>>>> allocated in large folios of a determined order. All pages of the large
>>>>>>>>>>>>> folio are pte-mapped during the same page fault, significantly reducing
>>>>>>>>>>>>> the number of page faults. The number of per-page operations (e.g. ref
>>>>>>>>>>>>> counting, rmap management lru list management) are also significantly
>>>>>>>>>>>>> reduced since those ops now become per-folio.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
>>>>>>>>>>>>> which defaults to disabled for now; The long term aim is for this to
>>>>>>>>>>>>> defaut to enabled, but there are some risks around internal
>>>>>>>>>>>>> fragmentation that need to be better understood first.
>>>>>>>>>>>>>
>>>>>>>>>>>>> When enabled, the folio order is determined as such: For a vma, process
>>>>>>>>>>>>> or system that has explicitly disabled THP, we continue to allocate
>>>>>>>>>>>>> order-0. THP is most likely disabled to avoid any possible internal
>>>>>>>>>>>>> fragmentation so we honour that request.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
>>>>>>>>>>>>> that have not explicitly opted-in to use transparent hugepages (e.g.
>>>>>>>>>>>>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
>>>>>>>>>>>>> arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
>>>>>>>>>>>>> bigger). This allows for a performance boost without requiring any
>>>>>>>>>>>>> explicit opt-in from the workload while limitting internal
>>>>>>>>>>>>> fragmentation.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If the preferred order can't be used (e.g. because the folio would
>>>>>>>>>>>>> breach the bounds of the vma, or because ptes in the region are already
>>>>>>>>>>>>> mapped) then we fall back to a suitable lower order; first
>>>>>>>>>>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> ...
>>>>>>>>>>>>
>>>>>>>>>>>>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
>>>>>>>>>>>>> +             (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +static int anon_folio_order(struct vm_area_struct *vma)
>>>>>>>>>>>>> +{
>>>>>>>>>>>>> +     int order;
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +     /*
>>>>>>>>>>>>> +      * If THP is explicitly disabled for either the vma, the process or the
>>>>>>>>>>>>> +      * system, then this is very likely intended to limit internal
>>>>>>>>>>>>> +      * fragmentation; in this case, don't attempt to allocate a large
>>>>>>>>>>>>> +      * anonymous folio.
>>>>>>>>>>>>> +      *
>>>>>>>>>>>>> +      * Else, if the vma is eligible for thp, allocate a large folio of the
>>>>>>>>>>>>> +      * size preferred by the arch. Or if the arch requested a very small
>>>>>>>>>>>>> +      * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
>>>>>>>>>>>>> +      * which still meets the arch's requirements but means we still take
>>>>>>>>>>>>> +      * advantage of SW optimizations (e.g. fewer page faults).
>>>>>>>>>>>>> +      *
>>>>>>>>>>>>> +      * Finally if thp is enabled but the vma isn't eligible, take the
>>>>>>>>>>>>> +      * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
>>>>>>>>>>>>> +      * This ensures workloads that have not explicitly opted-in take benefit
>>>>>>>>>>>>> +      * while capping the potential for internal fragmentation.
>>>>>>>>>>>>> +      */
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +     if ((vma->vm_flags & VM_NOHUGEPAGE) ||
>>>>>>>>>>>>> +         test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
>>>>>>>>>>>>> +         !hugepage_flags_enabled())
>>>>>>>>>>>>> +             order = 0;
>>>>>>>>>>>>> +     else {
>>>>>>>>>>>>> +             order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +             if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>>>>>>>>>>>>> +                     order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
>>>>>>>>>>>>> +     }
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +     return order;
>>>>>>>>>>>>> +}
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>
>>>>>>>>>>>> I'm writing up the conclusions that we arrived at during discussion in the THP
>>>>>>>>>>>> meeting yesterday, regarding linkage with exiting THP ABIs. It would be great if
>>>>>>>>>>>> I can get explicit "agree" or disagree + rationale from at least David, Yu and
>>>>>>>>>>>> Kirill.
>>>>>>>>>>>>
>>>>>>>>>>>> In summary; I think we are converging on the approach that is already coded, but
>>>>>>>>>>>> I'd like confirmation.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> The THP situation today
>>>>>>>>>>>> -----------------------
>>>>>>>>>>>>
>>>>>>>>>>>>     - At system level: THP can be set to "never", "madvise" or "always"
>>>>>>>>>>>>     - At process level: THP can be "never" or "defer to system setting"
>>>>>>>>>>>>     - At VMA level: no-hint, MADV_HUGEPAGE, MADV_NOHUGEPAGE
>>>>>>>>>>>>
>>>>>>>>>>>> That gives us this table to describe how a page fault is handled, according to
>>>>>>>>>>>> process state (columns) and vma flags (rows):
>>>>>>>>>>>>
>>>>>>>>>>>>                    | never     | madvise   | always
>>>>>>>>>>>> ----------------|-----------|-----------|-----------
>>>>>>>>>>>> no hint         | S         | S         | THP>S
>>>>>>>>>>>> MADV_HUGEPAGE   | S         | THP>S     | THP>S
>>>>>>>>>>>> MADV_NOHUGEPAGE | S         | S         | S
>>>>>>>>>>>>
>>>>>>>>>>>> Legend:
>>>>>>>>>>>> S       allocate single page (PTE-mapped)
>>>>>>>>>>>> LAF     allocate lage anon folio (PTE-mapped)
>>>>>>>>>>>> THP     allocate THP-sized folio (PMD-mapped)
>>>>>>>>>>>>>          fallback (usually because vma size/alignment insufficient for folio)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Principles for Large Anon Folios (LAF)
>>>>>>>>>>>> --------------------------------------
>>>>>>>>>>>>
>>>>>>>>>>>> David tells us there are use cases today (e.g. qemu live migration) which use
>>>>>>>>>>>> MADV_NOHUGEPAGE to mean "don't fill any PTEs that are not explicitly faulted"
>>>>>>>>>>>> and these use cases will break (i.e. functionally incorrect) if this request is
>>>>>>>>>>>> not honoured.
>>>>>>>>>>>
>>>>>>>>>>> I don't remember David saying this. I think he was referring to UFFD,
>>>>>>>>>>> not MADV_NOHUGEPAGE, when discussing what we need to absolutely
>>>>>>>>>>> respect.
>>>>>>>>>>
>>>>>>>>>> My understanding was that MADV_NOHUGEPAGE was being applied to regions *before*
>>>>>>>>>> UFFD was being registered, and the app relied on MADV_NOHUGEPAGE to not back any
>>>>>>>>>> unfaulted pages. It's not completely clear to me how not honouring
>>>>>>>>>> MADV_NOHUGEPAGE would break things though. David?
>>>>>>>>>
>>>>>>>>> Sorry, I'm still lagging behind on some threads.
>>>>>>>>>
>>>>>>>>> Imagine the following for VM postcopy live migration:
>>>>>>>>>
>>>>>>>>> (1) Set MADV_NOHUGEPAGE on guest memory and discard all memory (e.g.,
>>>>>>>>>        MADV_DONTNEED), to start with a clean slate.
>>>>>>>>> (2) Migrates some pages during precopy from the source and stores them
>>>>>>>>>        into guest memory on the destination. Some of the memory locations
>>>>>>>>>        will have pages populated.
>>>>>>>>> (3) At some point, decide to enable postcopy: enable userfaultfd on
>>>>>>>>>        guest memory.
>>>>>>>>> (4) Discard *selected* pages again that have been dirtied in the
>>>>>>>>>        meantime on the source. These are pages that have been migrated
>>>>>>>>>        previously.
>>>>>>>>> (5) Start running the VM on the destination.
>>>>>>>>> (6) Anything that's not populated will trigger userfaultfd missing
>>>>>>>>>        faults. Then, you can request them from the source and place them.
>>>>>>>>>
>>>>>>>>> Assume you would populate more than required during 2), you can end up
>>>>>>>>> not getting userfaultfd faults during 4) and corrupt your guest state.
>>>>>>>>> It works if during (2) you migrated all guest memory, or if during 4)
>>>>>>>>> you zap everything that still needs migr
>>>>>>>>
>>>>>>>> I see what you mean now. Thanks.
>>>>>>>>
>>>>>>>> Yes, in this case we have to interpret MADV_NOHUGEPAGE as nothing >4KB.
>>
>> I'm glad we have agreement on this.
>>
>> In some threads Yu has been talking about this series in the short term, vs long
>> term roadmap; so to be clear, I interpret this as meaning we must consider that
>> MADV_NOHUGEPAGE means nothing bigger than order-0 both in the context of this
>> series and for the long term - that's behavior that user space depends upon.
>>
>> I think we should also apply the same logic to system/process THP mode =
>> "never", even if the vma does not have MADV_NOHUGEPAGE. If the user has
>> explicitly set "never" on the system or process, that means "nothing bigger than
>> order-0". Shout if you disagree.
>>
>>>>>>>
>>>>>>> Note that it's still even unclear to me why we want to *not* call these
>>>>>>> things THP. It would certainly make everything less confusing if we call
>>>>>>> them THP, but with additional attributes.
>>
>> I think I've stated in the past that I don't have a strong opinion on what we
>> call them. But I do think you make a convincing argument for calling them after
>> THP. Regardless, I'd rather agree on a name up front, before this initial series
>> goes in - it's always better to be consistent across all the commit messages and
>> comments to make things more grepable.
>>
>> The only concrete objection I remember hearing to a name with "THP" in the title
>> was that there are stats (meminfo, vmstats, etc) that count THPs and this
>> becomes confusing if those counters now only mean a subset of THPs. But that
>> feels like a small issue in the scheme of things.
>>
>>>>>>>
>>>>>>> I think that is one of the first things we should figure out because it
>>>>>>> also indirectly tells us what all these toggles mean and how/if we
>>>>>>> should redefine them (and if they even apply).
>>>>>>>
>>>>>>> Currently THP == PMD size
>>>>>>>
>>>>>>> In 2016, Hugh already envisioned PUD/PGD THP (see 49920d28781d ("mm:
>>>>>>> make transparent hugepage size public")) when he explicitly exposed
>>>>>>> "hpage_pmd_size". Not "hpage_size".
>>>>>>>
>>>>>>> For hugetlb on arm64 we already support various sizes that are < PMD
>>>>>>> size and *not* call them differently. It's a huge(tlb) page. Sometimes
>>>>>>> we refer to them as cont-PTE hugetlb pages.
>>>>>>>
>>>>>>>
>>>>>>> So, nowadays we do have "PMD-sized THP", someday we might have
>>>>>>> "PUD-sized THP". Can't we come up with a name to describe sub-PMD THP?
>>
>> I think one subtle difference is that these sub-PMD THPs, likely won't always
>> have a single size.
>>
>>>>>>>
>>>>>>> Is it really of value if we invent a new term for them? Yes, I was not
>>>>>>> enjoying "Flexible THP".
>>
>> How about "variable-order THP"? Or "SW THP" vs "HW THP"?
> 
> variable-order THP sounds good to me.
> 
> One question I have is that although Ryan is only working on sub-PMD THPs,
> do we want to plan for sub-PUD THPs now? Like are sub-PUD THPs variable-order
> THPs? And leave TODOs and comments like "variable-order THPs can be bigger than
> PMD and smaller than PUD in the future"? Maybe sub-PUD THPs are still too far
> to consider it for now. Just think out loud.

I'm not personally planning to do any work here. Such a thing would need similar
but separate implementation IMHO, since you would be working at the PMD level
not the PTE level.

Then there is the question of what the benefit would be. My working assumption
would be that you will not be getting any further HW benefits until its big
enough for a PUD. And the SW costs of allocating such a large contig block would
very likely outweigh the benefits of a few less page faults; you're surely
better off allocating multiple PMD-sized THPs?

> 
> 
>>
>>>>>>>
>>>>>>>
>>>>>>> Once we figured that out, we should figure out if MADV_HUGEPAGE meant
>>>>>>> "only PMD-sized THP" or anything else?
>>>>>>>
>>>>>>> Also, we can then figure out if MADV_NOHUGEPAGE meant "only PMD-sized
>>>>>>> THP" or anything else?
>>
>> Based on the existing user space expectation that MADV_NOHUGEPAGE means "nothing
>> bigger than order-0" I'm not sure how we could ever decide MADV_NOHUGEPAGE means
>> anything different? This feels set in stone to me.
>>
>>>>>>>
>>>>>>>
>>>>>>> The simplest approach to me would be "they imply any THP, and once we
>>>>>>> need more tunables we might add some", similar to what Kirill also raised.
>>
>> Agreed.
>>
>>>>>>>
>>>>>>>
>>>>>>> Again, it's all unclear to me at this point and I'm happy to hear
>>>>>>> opinions, because I really don't know.
>>>>>>
>>>>>> I agree these points require more discussion. But I don't think we
>>>>>> need to conclude them now, unless they cause correctness issues like
>>>>>> ignoring MADV_NOHUGEPAGE would. My concern is that if we decide to go
>>>>>> with "they imply any THP" and *expose this to userspace now*, we might
>>>>>> regret later.
>>>>>
>>>>> If we don't think they are THP, probably MADV_NOHUGEPAGE should not apply and we should be ready to find other ways to deal with the mess we eventually create. If we want to go down that path, sure.
>>>>>
>>>>> If they are THP, to me there is not really a question if MADV_NOHUGEPAGE applies to them or not. Unless we want to build a confusing piece of software ;)
>>>>
>>>> I think it is good to call them THP, since they are transparent huge (>order-0) pages.
>>>> But the concern is that before we have a reasonable management policy for order>0 &&
>>>> order<9 THPs, mixing them with existing order-9 THP might give user unexpected
>>>> performance outcome. Unless we are sure they will always performance improvement,
>>>> we might repeat the old THP path, namely users begin to disable THP by default
>>>> to avoid unexpected performance hiccup. That is the reason Yu wants to separate
>>>> LAF from THP at the moment.
>>
>> (for the purposes of this; LAF="sub-PMD THP", THP="PMD-size THP", we treat them
>> both as forms of THP)...
>>
>> How about this for a strawman:
>>
>> When introducing LAF we can either use an opt-in or an opt-out model. The opt-in
>> model would require a new ABI from day 1 (something I think there is concensus
>> that we do not want to do) and would prevent apps from automatically getting
>> benefit. So I don't like that model.
>>
>> If going with the opt-out model, we already have an opt-out mechanism
>> (thp="never" and MADV_NOHUGEPAGE) that we can piggyback. But that mechanism
>> doesn't give us all the control we would like for benchmarking/characterizing
>> the interactions between LAF/THP for different workloads. Ideally we need a way
>> to enable THP while keeping LAF disabled and enable LAF while keeping THP disabled.
>>
>> Can we do this with debugfs? I think controls in there can come and go without
>> too much concern about back-compat?
> 
> Is debugfs always available on all distros? For system without debugfs, user is
> going to lose control of LAF. IMHO, the two knobs below can live in
> /sys/kernel/mm/transparent_hugepage/ and could be in sync with "enabled" once
> we think LAF is well studied and goes along well with existing PMD THPs,
> namely when setting "always", "madvise", or "never" to "enabled", "laf_enabled"
> is set to the same value.

I really don't want to add any sysfs knobs until we properly understand what we
need/want. I couldn't tell you what the availability of debugfs is like across
all distros though; certainly its available on Ubuntu and userdebug builds of
Android.

Yu suggested another policy [1], which would allow us to disable THP while
keeping LAF enabled (but not the other way around) without having to add any
knobs, so perhaps that's the way to go. It does assume its safe to use LAF when
thp=never though. (Copying here for completeness):

                | prctl/fw  | sysfs     | sysfs     | sysfs
                | disable   | never     | madvise   | always
----------------|-----------|-----------|-----------|-----------
no hint         | S         | LAF>S     | LAF>S     | THP>LAF>S
MADV_HUGEPAGE   | S         | LAF>S     | THP>LAF>S | THP>LAF>S
MADV_NOHUGEPAGE | S         | S         | S         | S

Where "prctl/fw disable" trumps the sysfs setting.


[1] https://lore.kernel.org/linux-mm/20469f02-d62d-d925-3536-d6a1f1099fda@arm.com/

> 
>>
>> Perhaps 2 controls:
>>
>> laf_enable=0|1
>>   enable/disable LAF independently of THP
>>   default=1
>>
>> laf_max_order=N
>>   applies to both LAF and THP
>>   when max_order < PMD-order, THP acts like thp="never"
>>   puts a ceiling on folio order allocated by LAF
>>   default=PMD-order
> 
> I think it is better to keep it independent of PMD THP. Just make
> laf_max_order can never be bigger than PMD-order. Later, when we understand
> the performance impact of mixing LAF with PMD THP, we can lift this limit
> to allow laf_max_order to be any possible page order.
> 
>>
>> This gives:
>>
>>
>> laf_enable=1, laf_max_order=PMD-order (LAF+THP):
>>
>>                 | never     | madvise   | always
>> ----------------|-----------|-----------|-----------
>> no hint         | S         | LAF>S     | THP>LAF>S
>> MADV_HUGEPAGE   | S         | THP>LAF>S | THP>LAF>S
>> MADV_NOHUGEPAGE | S         | S         | S
>>
>>
>> laf_enable=0, laf_max_order=PMD-order (THP only):
>>
>>                 | never     | madvise   | always
>> ----------------|-----------|-----------|-----------
>> no hint         | S         | S         | THP>S
>> MADV_HUGEPAGE   | S         | THP>S     | THP>S
>> MADV_NOHUGEPAGE | S         | S         | S
>>
>>
>> laf_enable=1, laf_max_order=(PMD-order - 1) (LAF only):
>>
>>                 | never     | madvise   | always
>> ----------------|-----------|-----------|-----------
>> no hint         | S         | LAF>S     | LAF>S
>> MADV_HUGEPAGE   | S         | LAF>S     | LAF>S
>> MADV_NOHUGEPAGE | S         | S         | S
>>
>>
>> This would allow us to get something into the kernel that would allow people to
>> more broadly characterize different workloads under THP, LAF, THP+LAF, which
>> would give us a better understanding of if/how we want to design ABIs for the
>> long term.
>>
>>
>>>>
>>>> Maybe call it THP (experimental) for now and merge it to THP when we have a stable
>>>> policy. For knobs, we might add "any-order" to the existing "never", "madvise"
>>>> and another interface to specify max hinted order (enforcing <9) for "any-order".
>>>> Later, we can allow users to specify any max hinted order, including 9. Just an
>>>> idea.
>>> I suspect that all the config knobs (enable/disable mixing mode, define "any-order"
>>> or "specific-order") will be exist long term. Because there are always new workloads
>>> need be tuned against these configs.
>>>
>>>
>>> Regards
>>> Yin, Fengwei
>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Yan, Zi
> 
> 
> --
> Best Regards,
> Yan, Zi
Yu Zhao Aug. 8, 2023, 5:57 p.m. UTC | #45
On Tue, Aug 8, 2023 at 3:37 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 08/08/2023 00:21, Yu Zhao wrote:
> > On Mon, Aug 7, 2023 at 1:07 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 07/08/2023 06:24, Yu Zhao wrote:
> >>> On Wed, Jul 26, 2023 at 3:52 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>
> >>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
> >>>> allocated in large folios of a determined order. All pages of the large
> >>>> folio are pte-mapped during the same page fault, significantly reducing
> >>>> the number of page faults. The number of per-page operations (e.g. ref
> >>>> counting, rmap management lru list management) are also significantly
> >>>> reduced since those ops now become per-folio.
> >>>>
> >>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
> >>>> which defaults to disabled for now; The long term aim is for this to
> >>>> defaut to enabled, but there are some risks around internal
> >>>> fragmentation that need to be better understood first.
> >>>>
> >>>> When enabled, the folio order is determined as such: For a vma, process
> >>>> or system that has explicitly disabled THP, we continue to allocate
> >>>> order-0. THP is most likely disabled to avoid any possible internal
> >>>> fragmentation so we honour that request.
> >>>>
> >>>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
> >>>> that have not explicitly opted-in to use transparent hugepages (e.g.
> >>>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
> >>>> arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
> >>>> bigger). This allows for a performance boost without requiring any
> >>>> explicit opt-in from the workload while limitting internal
> >>>> fragmentation.
> >>>>
> >>>> If the preferred order can't be used (e.g. because the folio would
> >>>> breach the bounds of the vma, or because ptes in the region are already
> >>>> mapped) then we fall back to a suitable lower order; first
> >>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
> >>>>
> >>>> arch_wants_pte_order() can be overridden by the architecture if desired.
> >>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
> >>>> set of ptes map physically contigious, naturally aligned memory, so this
> >>>> mechanism allows the architecture to optimize as required.
> >>>>
> >>>> Here we add the default implementation of arch_wants_pte_order(), used
> >>>> when the architecture does not define it, which returns -1, implying
> >>>> that the HW has no preference. In this case, mm will choose it's own
> >>>> default order.
> >>>>
> >>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >>>> ---
> >>>>  include/linux/pgtable.h |  13 ++++
> >>>>  mm/Kconfig              |  10 +++
> >>>>  mm/memory.c             | 166 ++++++++++++++++++++++++++++++++++++----
> >>>>  3 files changed, 172 insertions(+), 17 deletions(-)
> >>>>
> >>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >>>> index 5063b482e34f..2a1d83775837 100644
> >>>> --- a/include/linux/pgtable.h
> >>>> +++ b/include/linux/pgtable.h
> >>>> @@ -313,6 +313,19 @@ static inline bool arch_has_hw_pte_young(void)
> >>>>  }
> >>>>  #endif
> >>>>
> >>>> +#ifndef arch_wants_pte_order
> >>>> +/*
> >>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
> >>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
> >>>> + * to be at least order-2. Negative value implies that the HW has no preference
> >>>> + * and mm will choose it's own default order.
> >>>> + */
> >>>> +static inline int arch_wants_pte_order(void)
> >>>> +{
> >>>> +       return -1;
> >>>> +}
> >>>> +#endif
> >>>> +
> >>>>  #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
> >>>>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
> >>>>                                        unsigned long address,
> >>>> diff --git a/mm/Kconfig b/mm/Kconfig
> >>>> index 09130434e30d..fa61ea160447 100644
> >>>> --- a/mm/Kconfig
> >>>> +++ b/mm/Kconfig
> >>>> @@ -1238,4 +1238,14 @@ config LOCK_MM_AND_FIND_VMA
> >>>>
> >>>>  source "mm/damon/Kconfig"
> >>>>
> >>>> +config LARGE_ANON_FOLIO
> >>>> +       bool "Allocate large folios for anonymous memory"
> >>>> +       depends on TRANSPARENT_HUGEPAGE
> >>>> +       default n
> >>>> +       help
> >>>> +         Use large (bigger than order-0) folios to back anonymous memory where
> >>>> +         possible, even for pte-mapped memory. This reduces the number of page
> >>>> +         faults, as well as other per-page overheads to improve performance for
> >>>> +         many workloads.
> >>>> +
> >>>>  endmenu
> >>>> diff --git a/mm/memory.c b/mm/memory.c
> >>>> index 01f39e8144ef..64c3f242c49a 100644
> >>>> --- a/mm/memory.c
> >>>> +++ b/mm/memory.c
> >>>> @@ -4050,6 +4050,127 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>>         return ret;
> >>>>  }
> >>>>
> >>>> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
> >>>> +{
> >>>> +       int i;
> >>>> +
> >>>> +       if (nr_pages == 1)
> >>>> +               return vmf_pte_changed(vmf);
> >>>> +
> >>>> +       for (i = 0; i < nr_pages; i++) {
> >>>> +               if (!pte_none(ptep_get_lockless(vmf->pte + i)))
> >>>> +                       return true;
> >>>> +       }
> >>>> +
> >>>> +       return false;
> >>>> +}
> >>>> +
> >>>> +#ifdef CONFIG_LARGE_ANON_FOLIO
> >>>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
> >>>> +               (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
> >>>> +
> >>>> +static int anon_folio_order(struct vm_area_struct *vma)
> >>>> +{
> >>>> +       int order;
> >>>> +
> >>>> +       /*
> >>>> +        * If THP is explicitly disabled for either the vma, the process or the
> >>>> +        * system, then this is very likely intended to limit internal
> >>>> +        * fragmentation; in this case, don't attempt to allocate a large
> >>>> +        * anonymous folio.
> >>>> +        *
> >>>> +        * Else, if the vma is eligible for thp, allocate a large folio of the
> >>>> +        * size preferred by the arch. Or if the arch requested a very small
> >>>> +        * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
> >>>> +        * which still meets the arch's requirements but means we still take
> >>>> +        * advantage of SW optimizations (e.g. fewer page faults).
> >>>> +        *
> >>>> +        * Finally if thp is enabled but the vma isn't eligible, take the
> >>>> +        * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
> >>>> +        * This ensures workloads that have not explicitly opted-in take benefit
> >>>> +        * while capping the potential for internal fragmentation.
> >>>> +        */
> >>>> +
> >>>> +       if ((vma->vm_flags & VM_NOHUGEPAGE) ||
> >>>> +           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
> >>>> +           !hugepage_flags_enabled())
> >>>> +               order = 0;
> >>>> +       else {
> >>>> +               order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
> >>>> +
> >>>> +               if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
> >>>> +                       order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
> >>>> +       }
> >>>> +
> >>>> +       return order;
> >>>> +}
> >>>> +
> >>>> +static int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
> >>>> +{
> >>>> +       int i;
> >>>> +       gfp_t gfp;
> >>>> +       pte_t *pte;
> >>>> +       unsigned long addr;
> >>>> +       struct vm_area_struct *vma = vmf->vma;
> >>>> +       int prefer = anon_folio_order(vma);
> >>>> +       int orders[] = {
> >>>> +               prefer,
> >>>> +               prefer > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0,
> >>>> +               0,
> >>>> +       };
> >>>> +
> >>>> +       *folio = NULL;
> >>>> +
> >>>> +       if (vmf_orig_pte_uffd_wp(vmf))
> >>>> +               goto fallback;
> >>>
> >>> Per the discussion, we need to check hugepage_vma_check() for
> >>> correctness of VM LM. I'd just check it here and fall back to order 0
> >>> if that helper returns false.
> >>
> >> I'm not sure if either you haven't noticed the logic in anon_folio_order()
> >> above, or whether you are making this suggestion because you disagree with the
> >> subtle difference in my logic?
> >
> > The latter, or more generally the policy you described earlier.
> >
> >> My logic is deliberately not calling hugepage_vma_check() because that would
> >> return false for the thp=madvise,mmap=unhinted case, whereas the policy I'm
> >> implementing wants to apply LAF in that case.
> >>
> >>
> >> My intended policy:
> >>
> >>                 | never     | madvise   | always
> >> ----------------|-----------|-----------|-----------
> >> no hint         | S         | LAF>S     | THP>LAF>S
> >> MADV_HUGEPAGE   | S         | THP>LAF>S | THP>LAF>S
> >> MADV_NOHUGEPAGE | S         | S         | S
> >>
> >>
> >> What your suggestion would give:
> >>
> >>                 | never     | madvise   | always
> >> ----------------|-----------|-----------|-----------
> >> no hint         | S         | S         | THP>LAF>S
> >> MADV_HUGEPAGE   | S         | THP>LAF>S | THP>LAF>S
> >> MADV_NOHUGEPAGE | S         | S         | S
> >
> > This is not what I'm suggesting.
> >
> > Let me reiterate [1]:
> >   My impression is we only agreed on one thing: at the current stage, we
> >   should respect things we absolutely have to. We didn't agree on what
> >   "never" means ("never 2MB" or "never >4KB"), and we didn't touch on
> >   how "always" should behave at all.
> >
> > And [2]:
> >   (Thanks to David, now I agree that) we have to interpret MADV_NOHUGEPAGE
> >   as nothing >4KB.
> >
> > My final take [3]:
> >   I agree these points require more discussion. But I don't think we
> >   need to conclude them now, unless they cause correctness issues like
> >   ignoring MADV_NOHUGEPAGE would.
>
> Thanks, I've read all of these comments previously, and appreciate the time you
> have put into the feedback. I'm not sure I fully agree with your point that we
> don't need to conclude on a policy now; I certainly don't think we need the
> whole thing in place on day 1, but I do think that whatever we put in should
> strive to be a strict subset of where we think we are going. For example, if we
> put something in with one policy (i.e. "never" only means "never 2MB") then find
> a problem and have to change that to be more conservative, are we risking perf
> regressions for any LAF users that started using it on day 1?

It's not that I don't want to -- I just don't think we have enough
information before we have a wider deployment [1] and gain a better
understanding of real-world scenarios.

Of course we could force a conclusion, a mostly opinion-based one. But
it would still involve prolonged discussions and delay this series, or
rush into decisions we might regret later.

[1] Our fleets (servers, laptops and phones) support large-scale
experiments and I plan to run them on both client and server devices.

> > But I should have been clear about the parameters to
> > hugepage_vma_check(): enforce_sysfs=false.
>
> So hugepage_vma_check(..., smaps=false, in_pf=true, enforce_sysfs=false) would
> give us:
>
>                 | prctl/fw  | sysfs     | sysfs     | sysfs
>                 | disable   | never     | madvise   | always
> ----------------|-----------|-----------|-----------|-----------
> no hint         | S         | LAF>S     | LAF>S     | THP>LAF>S
> MADV_HUGEPAGE   | S         | LAF>S     | THP>LAF>S | THP>LAF>S
> MADV_NOHUGEPAGE | S         | S         | S         | S
>
> Where "prctl/fw disable" trumps the sysfs setting.
>
> I can certainly see the benefit of this approach; it gives us a way to enable
> LAF while disabling THP (thp=never). It doesn't give us a way to enable THP
> without enabling LAF though (unless you recompile with LAF disabled). Does
> anyone see a problem with this?

I do myself :)

This is just something temporary to get this series landed. We are
hiding behind a Kconfig, not making any ABI changes, and not exposing
this policy to userspace (i.e., not updating Documentation/, man
pages, etc.)

Meanwhile, we can keep discussing all the open questions in parallel.
Yu Zhao Aug. 8, 2023, 6:12 p.m. UTC | #46
On Tue, Aug 8, 2023 at 11:57 AM Yu Zhao <yuzhao@google.com> wrote:
>
> On Tue, Aug 8, 2023 at 3:37 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >
> > On 08/08/2023 00:21, Yu Zhao wrote:
> > > On Mon, Aug 7, 2023 at 1:07 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> > >>
> > >> On 07/08/2023 06:24, Yu Zhao wrote:
> > >>> On Wed, Jul 26, 2023 at 3:52 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> > >>>>
> > >>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
> > >>>> allocated in large folios of a determined order. All pages of the large
> > >>>> folio are pte-mapped during the same page fault, significantly reducing
> > >>>> the number of page faults. The number of per-page operations (e.g. ref
> > >>>> counting, rmap management lru list management) are also significantly
> > >>>> reduced since those ops now become per-folio.
> > >>>>
> > >>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
> > >>>> which defaults to disabled for now; The long term aim is for this to
> > >>>> defaut to enabled, but there are some risks around internal
> > >>>> fragmentation that need to be better understood first.
> > >>>>
> > >>>> When enabled, the folio order is determined as such: For a vma, process
> > >>>> or system that has explicitly disabled THP, we continue to allocate
> > >>>> order-0. THP is most likely disabled to avoid any possible internal
> > >>>> fragmentation so we honour that request.
> > >>>>
> > >>>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
> > >>>> that have not explicitly opted-in to use transparent hugepages (e.g.
> > >>>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
> > >>>> arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
> > >>>> bigger). This allows for a performance boost without requiring any
> > >>>> explicit opt-in from the workload while limitting internal
> > >>>> fragmentation.
> > >>>>
> > >>>> If the preferred order can't be used (e.g. because the folio would
> > >>>> breach the bounds of the vma, or because ptes in the region are already
> > >>>> mapped) then we fall back to a suitable lower order; first
> > >>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
> > >>>>
> > >>>> arch_wants_pte_order() can be overridden by the architecture if desired.
> > >>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
> > >>>> set of ptes map physically contigious, naturally aligned memory, so this
> > >>>> mechanism allows the architecture to optimize as required.
> > >>>>
> > >>>> Here we add the default implementation of arch_wants_pte_order(), used
> > >>>> when the architecture does not define it, which returns -1, implying
> > >>>> that the HW has no preference. In this case, mm will choose it's own
> > >>>> default order.
> > >>>>
> > >>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> > >>>> ---
> > >>>>  include/linux/pgtable.h |  13 ++++
> > >>>>  mm/Kconfig              |  10 +++
> > >>>>  mm/memory.c             | 166 ++++++++++++++++++++++++++++++++++++----
> > >>>>  3 files changed, 172 insertions(+), 17 deletions(-)
> > >>>>
> > >>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > >>>> index 5063b482e34f..2a1d83775837 100644
> > >>>> --- a/include/linux/pgtable.h
> > >>>> +++ b/include/linux/pgtable.h
> > >>>> @@ -313,6 +313,19 @@ static inline bool arch_has_hw_pte_young(void)
> > >>>>  }
> > >>>>  #endif
> > >>>>
> > >>>> +#ifndef arch_wants_pte_order
> > >>>> +/*
> > >>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
> > >>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
> > >>>> + * to be at least order-2. Negative value implies that the HW has no preference
> > >>>> + * and mm will choose it's own default order.
> > >>>> + */
> > >>>> +static inline int arch_wants_pte_order(void)
> > >>>> +{
> > >>>> +       return -1;
> > >>>> +}
> > >>>> +#endif
> > >>>> +
> > >>>>  #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
> > >>>>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
> > >>>>                                        unsigned long address,
> > >>>> diff --git a/mm/Kconfig b/mm/Kconfig
> > >>>> index 09130434e30d..fa61ea160447 100644
> > >>>> --- a/mm/Kconfig
> > >>>> +++ b/mm/Kconfig
> > >>>> @@ -1238,4 +1238,14 @@ config LOCK_MM_AND_FIND_VMA
> > >>>>
> > >>>>  source "mm/damon/Kconfig"
> > >>>>
> > >>>> +config LARGE_ANON_FOLIO
> > >>>> +       bool "Allocate large folios for anonymous memory"
> > >>>> +       depends on TRANSPARENT_HUGEPAGE
> > >>>> +       default n
> > >>>> +       help
> > >>>> +         Use large (bigger than order-0) folios to back anonymous memory where
> > >>>> +         possible, even for pte-mapped memory. This reduces the number of page
> > >>>> +         faults, as well as other per-page overheads to improve performance for
> > >>>> +         many workloads.
> > >>>> +
> > >>>>  endmenu
> > >>>> diff --git a/mm/memory.c b/mm/memory.c
> > >>>> index 01f39e8144ef..64c3f242c49a 100644
> > >>>> --- a/mm/memory.c
> > >>>> +++ b/mm/memory.c
> > >>>> @@ -4050,6 +4050,127 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >>>>         return ret;
> > >>>>  }
> > >>>>
> > >>>> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
> > >>>> +{
> > >>>> +       int i;
> > >>>> +
> > >>>> +       if (nr_pages == 1)
> > >>>> +               return vmf_pte_changed(vmf);
> > >>>> +
> > >>>> +       for (i = 0; i < nr_pages; i++) {
> > >>>> +               if (!pte_none(ptep_get_lockless(vmf->pte + i)))
> > >>>> +                       return true;
> > >>>> +       }
> > >>>> +
> > >>>> +       return false;
> > >>>> +}
> > >>>> +
> > >>>> +#ifdef CONFIG_LARGE_ANON_FOLIO
> > >>>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
> > >>>> +               (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
> > >>>> +
> > >>>> +static int anon_folio_order(struct vm_area_struct *vma)
> > >>>> +{
> > >>>> +       int order;
> > >>>> +
> > >>>> +       /*
> > >>>> +        * If THP is explicitly disabled for either the vma, the process or the
> > >>>> +        * system, then this is very likely intended to limit internal
> > >>>> +        * fragmentation; in this case, don't attempt to allocate a large
> > >>>> +        * anonymous folio.
> > >>>> +        *
> > >>>> +        * Else, if the vma is eligible for thp, allocate a large folio of the
> > >>>> +        * size preferred by the arch. Or if the arch requested a very small
> > >>>> +        * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
> > >>>> +        * which still meets the arch's requirements but means we still take
> > >>>> +        * advantage of SW optimizations (e.g. fewer page faults).
> > >>>> +        *
> > >>>> +        * Finally if thp is enabled but the vma isn't eligible, take the
> > >>>> +        * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
> > >>>> +        * This ensures workloads that have not explicitly opted-in take benefit
> > >>>> +        * while capping the potential for internal fragmentation.
> > >>>> +        */
> > >>>> +
> > >>>> +       if ((vma->vm_flags & VM_NOHUGEPAGE) ||
> > >>>> +           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
> > >>>> +           !hugepage_flags_enabled())
> > >>>> +               order = 0;
> > >>>> +       else {
> > >>>> +               order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
> > >>>> +
> > >>>> +               if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
> > >>>> +                       order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
> > >>>> +       }
> > >>>> +
> > >>>> +       return order;
> > >>>> +}
> > >>>> +
> > >>>> +static int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
> > >>>> +{
> > >>>> +       int i;
> > >>>> +       gfp_t gfp;
> > >>>> +       pte_t *pte;
> > >>>> +       unsigned long addr;
> > >>>> +       struct vm_area_struct *vma = vmf->vma;
> > >>>> +       int prefer = anon_folio_order(vma);
> > >>>> +       int orders[] = {
> > >>>> +               prefer,
> > >>>> +               prefer > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0,
> > >>>> +               0,
> > >>>> +       };
> > >>>> +
> > >>>> +       *folio = NULL;
> > >>>> +
> > >>>> +       if (vmf_orig_pte_uffd_wp(vmf))
> > >>>> +               goto fallback;
> > >>>
> > >>> Per the discussion, we need to check hugepage_vma_check() for
> > >>> correctness of VM LM. I'd just check it here and fall back to order 0
> > >>> if that helper returns false.
> > >>
> > >> I'm not sure if either you haven't noticed the logic in anon_folio_order()
> > >> above, or whether you are making this suggestion because you disagree with the
> > >> subtle difference in my logic?
> > >
> > > The latter, or more generally the policy you described earlier.
> > >
> > >> My logic is deliberately not calling hugepage_vma_check() because that would
> > >> return false for the thp=madvise,mmap=unhinted case, whereas the policy I'm
> > >> implementing wants to apply LAF in that case.
> > >>
> > >>
> > >> My intended policy:
> > >>
> > >>                 | never     | madvise   | always
> > >> ----------------|-----------|-----------|-----------
> > >> no hint         | S         | LAF>S     | THP>LAF>S
> > >> MADV_HUGEPAGE   | S         | THP>LAF>S | THP>LAF>S
> > >> MADV_NOHUGEPAGE | S         | S         | S
> > >>
> > >>
> > >> What your suggestion would give:
> > >>
> > >>                 | never     | madvise   | always
> > >> ----------------|-----------|-----------|-----------
> > >> no hint         | S         | S         | THP>LAF>S
> > >> MADV_HUGEPAGE   | S         | THP>LAF>S | THP>LAF>S
> > >> MADV_NOHUGEPAGE | S         | S         | S
> > >
> > > This is not what I'm suggesting.
> > >
> > > Let me reiterate [1]:
> > >   My impression is we only agreed on one thing: at the current stage, we
> > >   should respect things we absolutely have to. We didn't agree on what
> > >   "never" means ("never 2MB" or "never >4KB"), and we didn't touch on
> > >   how "always" should behave at all.
> > >
> > > And [2]:
> > >   (Thanks to David, now I agree that) we have to interpret MADV_NOHUGEPAGE
> > >   as nothing >4KB.
> > >
> > > My final take [3]:
> > >   I agree these points require more discussion. But I don't think we
> > >   need to conclude them now, unless they cause correctness issues like
> > >   ignoring MADV_NOHUGEPAGE would.
> >
> > Thanks, I've read all of these comments previously, and appreciate the time you
> > have put into the feedback. I'm not sure I fully agree with your point that we
> > don't need to conclude on a policy now; I certainly don't think we need the
> > whole thing in place on day 1, but I do think that whatever we put in should
> > strive to be a strict subset of where we think we are going. For example, if we
> > put something in with one policy (i.e. "never" only means "never 2MB") then find
> > a problem and have to change that to be more conservative, are we risking perf
> > regressions for any LAF users that started using it on day 1?
>
> It's not that I don't want to -- I just don't think we have enough
> information before we have a wider deployment [1] and gain a better
> understanding of real-world scenarios.
>
> Of course we could force a conclusion, a mostly opinion-based one. But
> it would still involve prolonged discussions and delay this series, or
> rush into decisions we might regret later.
>
> [1] Our fleets (servers, laptops and phones) support large-scale
> experiments and I plan to run them on both client and server devices.
>
> > > But I should have been clear about the parameters to
> > > hugepage_vma_check(): enforce_sysfs=false.
> >
> > So hugepage_vma_check(..., smaps=false, in_pf=true, enforce_sysfs=false) would
> > give us:
> >
> >                 | prctl/fw  | sysfs     | sysfs     | sysfs
> >                 | disable   | never     | madvise   | always
> > ----------------|-----------|-----------|-----------|-----------
> > no hint         | S         | LAF>S     | LAF>S     | THP>LAF>S
> > MADV_HUGEPAGE   | S         | LAF>S     | THP>LAF>S | THP>LAF>S
> > MADV_NOHUGEPAGE | S         | S         | S         | S
> >
> > Where "prctl/fw disable" trumps the sysfs setting.
> >
> > I can certainly see the benefit of this approach; it gives us a way to enable
> > LAF while disabling THP (thp=never). It doesn't give us a way to enable THP
> > without enabling LAF though (unless you recompile with LAF disabled). Does
> > anyone see a problem with this?
>
> I do myself :)
>
> This is just something temporary to get this series landed. We are
> hiding behind a Kconfig, not making any ABI changes, and not exposing
> this policy to userspace (i.e., not updating Documentation/, man
> pages, etc.)
>
> Meanwhile, we can keep discussing all the open questions in parallel.

And the stat ABI changes should be discussed before or at the same
time. If we came up with a policy but there was *zero* observability
of how well that policy works...
Ryan Roberts Aug. 9, 2023, 4:08 p.m. UTC | #47
[...]

>>>> Let me reiterate [1]:
>>>>   My impression is we only agreed on one thing: at the current stage, we
>>>>   should respect things we absolutely have to. We didn't agree on what
>>>>   "never" means ("never 2MB" or "never >4KB"), and we didn't touch on
>>>>   how "always" should behave at all.
>>>>
>>>> And [2]:
>>>>   (Thanks to David, now I agree that) we have to interpret MADV_NOHUGEPAGE
>>>>   as nothing >4KB.
>>>>
>>>> My final take [3]:
>>>>   I agree these points require more discussion. But I don't think we
>>>>   need to conclude them now, unless they cause correctness issues like
>>>>   ignoring MADV_NOHUGEPAGE would.
>>>
>>> Thanks, I've read all of these comments previously, and appreciate the time you
>>> have put into the feedback. I'm not sure I fully agree with your point that we
>>> don't need to conclude on a policy now; I certainly don't think we need the
>>> whole thing in place on day 1, but I do think that whatever we put in should
>>> strive to be a strict subset of where we think we are going. For example, if we
>>> put something in with one policy (i.e. "never" only means "never 2MB") then find
>>> a problem and have to change that to be more conservative, are we risking perf
>>> regressions for any LAF users that started using it on day 1?
>>
>> It's not that I don't want to -- I just don't think we have enough
>> information before we have a wider deployment [1] and gain a better
>> understanding of real-world scenarios.
>>
>> Of course we could force a conclusion, a mostly opinion-based one. But
>> it would still involve prolonged discussions and delay this series, or
>> rush into decisions we might regret later.
>>
>> [1] Our fleets (servers, laptops and phones) support large-scale
>> experiments and I plan to run them on both client and server devices.

This all sounds great and I'm looking forward to seeing results! But I guess I
had been assuming that this sort of testing would be preferable to do before we
merge; that allows us to get confidence in the approach and reduces the changes
of having to change it later. I guess you have policies that prevent you from
testing this series at the scale you want until it is merged?

I'm not convinced this testing will help us answer the "what does never mean?"
question; if nothing breaks in your testing, it doesn't mean there aren't
systems out there that would break - it's hard to prove a negative. I think its
mostly embedded systems that use thp=never to reduce memory footprint to the
absolute minimum?


>>
>>>> But I should have been clear about the parameters to
>>>> hugepage_vma_check(): enforce_sysfs=false.
>>>
>>> So hugepage_vma_check(..., smaps=false, in_pf=true, enforce_sysfs=false) would
>>> give us:
>>>
>>>                 | prctl/fw  | sysfs     | sysfs     | sysfs
>>>                 | disable   | never     | madvise   | always
>>> ----------------|-----------|-----------|-----------|-----------
>>> no hint         | S         | LAF>S     | LAF>S     | THP>LAF>S
>>> MADV_HUGEPAGE   | S         | LAF>S     | THP>LAF>S | THP>LAF>S
>>> MADV_NOHUGEPAGE | S         | S         | S         | S
>>>
>>> Where "prctl/fw disable" trumps the sysfs setting.
>>>
>>> I can certainly see the benefit of this approach; it gives us a way to enable
>>> LAF while disabling THP (thp=never). It doesn't give us a way to enable THP
>>> without enabling LAF though (unless you recompile with LAF disabled). Does
>>> anyone see a problem with this?
>>
>> I do myself :)
>>
>> This is just something temporary to get this series landed. We are
>> hiding behind a Kconfig, not making any ABI changes, and not exposing
>> this policy to userspace (i.e., not updating Documentation/, man
>> pages, etc.)
>>
>> Meanwhile, we can keep discussing all the open questions in parallel.

You're right - don't want to slow down the testing, so I'm going to post a v5
tomorrow with the policy in the table above. We're still waiting for the
prerequisites to land before we can kick off testing in anger though.

> 
> And the stat ABI changes should be discussed before or at the same
> time. If we came up with a policy but there was *zero* observability
> of how well that policy works...

Yep agreed. I have a series at [1] which I hoped would kickstart that discussion.

[1] https://lore.kernel.org/linux-mm/20230613160950.3554675-1-ryan.roberts@arm.com/

Thanks,
Ryan
diff mbox series

Patch

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 5063b482e34f..2a1d83775837 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -313,6 +313,19 @@  static inline bool arch_has_hw_pte_young(void)
 }
 #endif
 
+#ifndef arch_wants_pte_order
+/*
+ * Returns preferred folio order for pte-mapped memory. Must be in range [0,
+ * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
+ * to be at least order-2. Negative value implies that the HW has no preference
+ * and mm will choose it's own default order.
+ */
+static inline int arch_wants_pte_order(void)
+{
+	return -1;
+}
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
 				       unsigned long address,
diff --git a/mm/Kconfig b/mm/Kconfig
index 09130434e30d..fa61ea160447 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1238,4 +1238,14 @@  config LOCK_MM_AND_FIND_VMA
 
 source "mm/damon/Kconfig"
 
+config LARGE_ANON_FOLIO
+	bool "Allocate large folios for anonymous memory"
+	depends on TRANSPARENT_HUGEPAGE
+	default n
+	help
+	  Use large (bigger than order-0) folios to back anonymous memory where
+	  possible, even for pte-mapped memory. This reduces the number of page
+	  faults, as well as other per-page overheads to improve performance for
+	  many workloads.
+
 endmenu
diff --git a/mm/memory.c b/mm/memory.c
index 01f39e8144ef..64c3f242c49a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4050,6 +4050,127 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 	return ret;
 }
 
+static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
+{
+	int i;
+
+	if (nr_pages == 1)
+		return vmf_pte_changed(vmf);
+
+	for (i = 0; i < nr_pages; i++) {
+		if (!pte_none(ptep_get_lockless(vmf->pte + i)))
+			return true;
+	}
+
+	return false;
+}
+
+#ifdef CONFIG_LARGE_ANON_FOLIO
+#define ANON_FOLIO_MAX_ORDER_UNHINTED \
+		(ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
+
+static int anon_folio_order(struct vm_area_struct *vma)
+{
+	int order;
+
+	/*
+	 * If THP is explicitly disabled for either the vma, the process or the
+	 * system, then this is very likely intended to limit internal
+	 * fragmentation; in this case, don't attempt to allocate a large
+	 * anonymous folio.
+	 *
+	 * Else, if the vma is eligible for thp, allocate a large folio of the
+	 * size preferred by the arch. Or if the arch requested a very small
+	 * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
+	 * which still meets the arch's requirements but means we still take
+	 * advantage of SW optimizations (e.g. fewer page faults).
+	 *
+	 * Finally if thp is enabled but the vma isn't eligible, take the
+	 * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
+	 * This ensures workloads that have not explicitly opted-in take benefit
+	 * while capping the potential for internal fragmentation.
+	 */
+
+	if ((vma->vm_flags & VM_NOHUGEPAGE) ||
+	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
+	    !hugepage_flags_enabled())
+		order = 0;
+	else {
+		order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
+
+		if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
+			order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
+	}
+
+	return order;
+}
+
+static int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
+{
+	int i;
+	gfp_t gfp;
+	pte_t *pte;
+	unsigned long addr;
+	struct vm_area_struct *vma = vmf->vma;
+	int prefer = anon_folio_order(vma);
+	int orders[] = {
+		prefer,
+		prefer > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0,
+		0,
+	};
+
+	*folio = NULL;
+
+	if (vmf_orig_pte_uffd_wp(vmf))
+		goto fallback;
+
+	for (i = 0; orders[i]; i++) {
+		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
+		if (addr >= vma->vm_start &&
+		    addr + (PAGE_SIZE << orders[i]) <= vma->vm_end)
+			break;
+	}
+
+	if (!orders[i])
+		goto fallback;
+
+	pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
+	if (!pte)
+		return -EAGAIN;
+
+	for (; orders[i]; i++) {
+		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
+		vmf->pte = pte + pte_index(addr);
+		if (!vmf_pte_range_changed(vmf, 1 << orders[i]))
+			break;
+	}
+
+	vmf->pte = NULL;
+	pte_unmap(pte);
+
+	gfp = vma_thp_gfp_mask(vma);
+
+	for (; orders[i]; i++) {
+		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
+		*folio = vma_alloc_folio(gfp, orders[i], vma, addr, true);
+		if (*folio) {
+			clear_huge_page(&(*folio)->page, addr, 1 << orders[i]);
+			return 0;
+		}
+	}
+
+fallback:
+	*folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
+	return *folio ? 0 : -ENOMEM;
+}
+#else
+static inline int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
+{
+	*folio = vma_alloc_zeroed_movable_folio(vmf->vma, vmf->address);
+	return *folio ? 0 : -ENOMEM;
+}
+#endif
+
 /*
  * We enter with non-exclusive mmap_lock (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
@@ -4057,6 +4178,9 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
  */
 static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 {
+	int i = 0;
+	int nr_pages = 1;
+	unsigned long addr = vmf->address;
 	bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
 	struct vm_area_struct *vma = vmf->vma;
 	struct folio *folio;
@@ -4101,10 +4225,15 @@  static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	/* Allocate our own private page. */
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
-	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
+	ret = alloc_anon_folio(vmf, &folio);
+	if (unlikely(ret == -EAGAIN))
+		return 0;
 	if (!folio)
 		goto oom;
 
+	nr_pages = folio_nr_pages(folio);
+	addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
+
 	if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
 		goto oom_free_page;
 	folio_throttle_swaprate(folio, GFP_KERNEL);
@@ -4116,17 +4245,12 @@  static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	 */
 	__folio_mark_uptodate(folio);
 
-	entry = mk_pte(&folio->page, vma->vm_page_prot);
-	entry = pte_sw_mkyoung(entry);
-	if (vma->vm_flags & VM_WRITE)
-		entry = pte_mkwrite(pte_mkdirty(entry));
-
-	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
-			&vmf->ptl);
+	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
 	if (!vmf->pte)
 		goto release;
-	if (vmf_pte_changed(vmf)) {
-		update_mmu_tlb(vma, vmf->address, vmf->pte);
+	if (vmf_pte_range_changed(vmf, nr_pages)) {
+		for (i = 0; i < nr_pages; i++)
+			update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
 		goto release;
 	}
 
@@ -4141,16 +4265,24 @@  static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 		return handle_userfault(vmf, VM_UFFD_MISSING);
 	}
 
-	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
-	folio_add_new_anon_rmap(folio, vma, vmf->address);
+	folio_ref_add(folio, nr_pages - 1);
+	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
+	folio_add_new_anon_rmap(folio, vma, addr);
 	folio_add_lru_vma(folio, vma);
+
+	for (i = 0; i < nr_pages; i++) {
+		entry = mk_pte(folio_page(folio, i), vma->vm_page_prot);
+		entry = pte_sw_mkyoung(entry);
+		if (vma->vm_flags & VM_WRITE)
+			entry = pte_mkwrite(pte_mkdirty(entry));
 setpte:
-	if (uffd_wp)
-		entry = pte_mkuffd_wp(entry);
-	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
+		if (uffd_wp)
+			entry = pte_mkuffd_wp(entry);
+		set_pte_at(vma->vm_mm, addr + PAGE_SIZE * i, vmf->pte + i, entry);
 
-	/* No need to invalidate - it was non-present before */
-	update_mmu_cache(vma, vmf->address, vmf->pte);
+		/* No need to invalidate - it was non-present before */
+		update_mmu_cache(vma, addr + PAGE_SIZE * i, vmf->pte + i);
+	}
 unlock:
 	if (vmf->pte)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);