[v1,10/10] mm: Allocate large folios for anonymous memory

Message ID	20230626171430.3167004-11-ryan.roberts@arm.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org> From: Ryan Roberts <ryan.roberts@arm.com> To: Andrew Morton <akpm@linux-foundation.org>, "Matthew Wilcox (Oracle)" <willy@infradead.org>, "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>, Yin Fengwei <fengwei.yin@intel.com>, David Hildenbrand <david@redhat.com>, Yu Zhao <yuzhao@google.com>, Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will@kernel.org>, Geert Uytterhoeven <geert@linux-m68k.org>, Christian Borntraeger <borntraeger@linux.ibm.com>, Sven Schnelle <svens@linux.ibm.com>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>, "H. Peter Anvin" <hpa@zytor.com> Cc: Ryan Roberts <ryan.roberts@arm.com>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-alpha@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-ia64@vger.kernel.org, linux-m68k@lists.linux-m68k.org, linux-s390@vger.kernel.org Subject: [PATCH v1 10/10] mm: Allocate large folios for anonymous memory Date: Mon, 26 Jun 2023 18:14:30 +0100 Message-Id: <20230626171430.3167004-11-ryan.roberts@arm.com> In-Reply-To: <20230626171430.3167004-1-ryan.roberts@arm.com> References: <20230626171430.3167004-1-ryan.roberts@arm.com> MIME-Version: 1.0 Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
Series	variable-order, large folios for anonymous memory \| expand [v1,00/10] variable-order, large folios for anonymous memory [v1,01/10] mm: Expose clear_huge_page() unconditionally [v1,02/10] mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio() [v1,03/10] mm: Introduce try_vma_alloc_movable_folio() [v1,04/10] mm: Implement folio_add_new_anon_rmap_range() [v1,05/10] mm: Implement folio_remove_rmap_range() [v1,06/10] mm: Allow deferred splitting of arbitrary large anon folios [v1,07/10] mm: Batch-zap large anonymous folio PTE mappings [v1,08/10] mm: Kconfig hooks to determine max anon folio allocation order [v1,09/10] arm64: mm: Declare support for large anonymous folios [v1,10/10] mm: Allocate large folios for anonymous memory

Ryan Roberts June 26, 2023, 5:14 p.m. UTC

With all of the enabler patches in place, modify the anonymous memory
write allocation path so that it opportunistically attempts to allocate
a large folio up to `max_anon_folio_order()` size (This value is
ultimately configured by the architecture). This reduces the number of
page faults, reduces the size of (e.g. LRU) lists, and generally
improves performance by batching what were per-page operations into
per-(large)-folio operations.

If CONFIG_LARGE_ANON_FOLIO is not enabled (the default) then
`max_anon_folio_order()` always returns 0, meaning we get the existing
allocation behaviour.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/memory.c | 159 +++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 144 insertions(+), 15 deletions(-)

Yu Zhao June 27, 2023, 3:01 a.m. UTC | #1

On Mon, Jun 26, 2023 at 11:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> With all of the enabler patches in place, modify the anonymous memory
> write allocation path so that it opportunistically attempts to allocate
> a large folio up to `max_anon_folio_order()` size (This value is
> ultimately configured by the architecture). This reduces the number of
> page faults, reduces the size of (e.g. LRU) lists, and generally
> improves performance by batching what were per-page operations into
> per-(large)-folio operations.
>
> If CONFIG_LARGE_ANON_FOLIO is not enabled (the default) then
> `max_anon_folio_order()` always returns 0, meaning we get the existing
> allocation behaviour.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  mm/memory.c | 159 +++++++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 144 insertions(+), 15 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index a8f7e2b28d7a..d23c44cc5092 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3161,6 +3161,90 @@ static inline int max_anon_folio_order(struct vm_area_struct *vma)
>                 return CONFIG_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX;
>  }
>
> +/*
> + * Returns index of first pte that is not none, or nr if all are none.
> + */
> +static inline int check_ptes_none(pte_t *pte, int nr)
> +{
> +       int i;
> +
> +       for (i = 0; i < nr; i++) {
> +               if (!pte_none(ptep_get(pte++)))
> +                       return i;
> +       }
> +
> +       return nr;
> +}
> +
> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)

As suggested previously in 03/10, we can leave this for later.

Ryan Roberts June 27, 2023, 9:57 a.m. UTC | #2

On 27/06/2023 04:01, Yu Zhao wrote:
> On Mon, Jun 26, 2023 at 11:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> With all of the enabler patches in place, modify the anonymous memory
>> write allocation path so that it opportunistically attempts to allocate
>> a large folio up to `max_anon_folio_order()` size (This value is
>> ultimately configured by the architecture). This reduces the number of
>> page faults, reduces the size of (e.g. LRU) lists, and generally
>> improves performance by batching what were per-page operations into
>> per-(large)-folio operations.
>>
>> If CONFIG_LARGE_ANON_FOLIO is not enabled (the default) then
>> `max_anon_folio_order()` always returns 0, meaning we get the existing
>> allocation behaviour.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  mm/memory.c | 159 +++++++++++++++++++++++++++++++++++++++++++++++-----
>>  1 file changed, 144 insertions(+), 15 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index a8f7e2b28d7a..d23c44cc5092 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -3161,6 +3161,90 @@ static inline int max_anon_folio_order(struct vm_area_struct *vma)
>>                 return CONFIG_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX;
>>  }
>>
>> +/*
>> + * Returns index of first pte that is not none, or nr if all are none.
>> + */
>> +static inline int check_ptes_none(pte_t *pte, int nr)
>> +{
>> +       int i;
>> +
>> +       for (i = 0; i < nr; i++) {
>> +               if (!pte_none(ptep_get(pte++)))
>> +                       return i;
>> +       }
>> +
>> +       return nr;
>> +}
>> +
>> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
> 
> As suggested previously in 03/10, we can leave this for later.

I disagree. This is the logic that prevents us from accidentally replacing
already set PTEs, or wandering out of the VMA bounds etc. How would you catch
all those corener cases without this?

Yu Zhao June 27, 2023, 6:33 p.m. UTC | #3

On Tue, Jun 27, 2023 at 3:57 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 27/06/2023 04:01, Yu Zhao wrote:
> > On Mon, Jun 26, 2023 at 11:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> With all of the enabler patches in place, modify the anonymous memory
> >> write allocation path so that it opportunistically attempts to allocate
> >> a large folio up to `max_anon_folio_order()` size (This value is
> >> ultimately configured by the architecture). This reduces the number of
> >> page faults, reduces the size of (e.g. LRU) lists, and generally
> >> improves performance by batching what were per-page operations into
> >> per-(large)-folio operations.
> >>
> >> If CONFIG_LARGE_ANON_FOLIO is not enabled (the default) then
> >> `max_anon_folio_order()` always returns 0, meaning we get the existing
> >> allocation behaviour.
> >>
> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >> ---
> >>  mm/memory.c | 159 +++++++++++++++++++++++++++++++++++++++++++++++-----
> >>  1 file changed, 144 insertions(+), 15 deletions(-)
> >>
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index a8f7e2b28d7a..d23c44cc5092 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -3161,6 +3161,90 @@ static inline int max_anon_folio_order(struct vm_area_struct *vma)
> >>                 return CONFIG_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX;
> >>  }
> >>
> >> +/*
> >> + * Returns index of first pte that is not none, or nr if all are none.
> >> + */
> >> +static inline int check_ptes_none(pte_t *pte, int nr)
> >> +{
> >> +       int i;
> >> +
> >> +       for (i = 0; i < nr; i++) {
> >> +               if (!pte_none(ptep_get(pte++)))
> >> +                       return i;
> >> +       }
> >> +
> >> +       return nr;
> >> +}
> >> +
> >> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
> >
> > As suggested previously in 03/10, we can leave this for later.
>
> I disagree. This is the logic that prevents us from accidentally replacing
> already set PTEs, or wandering out of the VMA bounds etc. How would you catch
> all those corener cases without this?

Again, sorry for not being clear previously: we definitely need to
handle alignments & overlapps. But the fallback, i.e., "for (; order >
1; order--) {" in calc_anon_folio_order_alloc() is not necessary.

For now, we just need something like

  bool is_order_suitable() {
    // check whether it fits properly
  }

Later on, we could add

  alloc_anon_folio_best_effort()
  {
    for a list of fallback orders
      is_order_suitable()
  }

Yang Shi June 29, 2023, 2:13 a.m. UTC | #4

On Mon, Jun 26, 2023 at 10:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> With all of the enabler patches in place, modify the anonymous memory
> write allocation path so that it opportunistically attempts to allocate
> a large folio up to `max_anon_folio_order()` size (This value is
> ultimately configured by the architecture). This reduces the number of
> page faults, reduces the size of (e.g. LRU) lists, and generally
> improves performance by batching what were per-page operations into
> per-(large)-folio operations.
>
> If CONFIG_LARGE_ANON_FOLIO is not enabled (the default) then
> `max_anon_folio_order()` always returns 0, meaning we get the existing
> allocation behaviour.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  mm/memory.c | 159 +++++++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 144 insertions(+), 15 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index a8f7e2b28d7a..d23c44cc5092 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3161,6 +3161,90 @@ static inline int max_anon_folio_order(struct vm_area_struct *vma)
>                 return CONFIG_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX;
>  }
>
> +/*
> + * Returns index of first pte that is not none, or nr if all are none.
> + */
> +static inline int check_ptes_none(pte_t *pte, int nr)
> +{
> +       int i;
> +
> +       for (i = 0; i < nr; i++) {
> +               if (!pte_none(ptep_get(pte++)))
> +                       return i;
> +       }
> +
> +       return nr;
> +}
> +
> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
> +{
> +       /*
> +        * The aim here is to determine what size of folio we should allocate
> +        * for this fault. Factors include:
> +        * - Order must not be higher than `order` upon entry
> +        * - Folio must be naturally aligned within VA space
> +        * - Folio must not breach boundaries of vma
> +        * - Folio must be fully contained inside one pmd entry
> +        * - Folio must not overlap any non-none ptes
> +        *
> +        * Additionally, we do not allow order-1 since this breaks assumptions
> +        * elsewhere in the mm; THP pages must be at least order-2 (since they
> +        * store state up to the 3rd struct page subpage), and these pages must
> +        * be THP in order to correctly use pre-existing THP infrastructure such
> +        * as folio_split().
> +        *
> +        * As a consequence of relying on the THP infrastructure, if the system
> +        * does not support THP, we always fallback to order-0.
> +        *
> +        * Note that the caller may or may not choose to lock the pte. If
> +        * unlocked, the calculation should be considered an estimate that will
> +        * need to be validated under the lock.
> +        */
> +
> +       struct vm_area_struct *vma = vmf->vma;
> +       int nr;
> +       unsigned long addr;
> +       pte_t *pte;
> +       pte_t *first_set = NULL;
> +       int ret;
> +
> +       if (has_transparent_hugepage()) {
> +               order = min(order, PMD_SHIFT - PAGE_SHIFT);
> +
> +               for (; order > 1; order--) {
> +                       nr = 1 << order;
> +                       addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT);
> +                       pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT);
> +
> +                       /* Check vma bounds. */
> +                       if (addr < vma->vm_start ||
> +                           addr + (nr << PAGE_SHIFT) > vma->vm_end)
> +                               continue;
> +
> +                       /* Ptes covered by order already known to be none. */
> +                       if (pte + nr <= first_set)
> +                               break;
> +
> +                       /* Already found set pte in range covered by order. */
> +                       if (pte <= first_set)
> +                               continue;
> +
> +                       /* Need to check if all the ptes are none. */
> +                       ret = check_ptes_none(pte, nr);
> +                       if (ret == nr)
> +                               break;
> +
> +                       first_set = pte + ret;
> +               }
> +
> +               if (order == 1)
> +                       order = 0;
> +       } else
> +               order = 0;
> +
> +       return order;
> +}
> +
>  /*
>   * Handle write page faults for pages that can be reused in the current vma
>   *
> @@ -4201,6 +4285,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>         struct folio *folio;
>         vm_fault_t ret = 0;
>         pte_t entry;
> +       unsigned long addr;
> +       int order = uffd_wp ? 0 : max_anon_folio_order(vma);
> +       int pgcount = BIT(order);
>
>         /* File mapping without ->vm_ops ? */
>         if (vma->vm_flags & VM_SHARED)
> @@ -4242,24 +4329,44 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>                         pte_unmap_unlock(vmf->pte, vmf->ptl);
>                         return handle_userfault(vmf, VM_UFFD_MISSING);
>                 }
> -               goto setpte;
> +               if (uffd_wp)
> +                       entry = pte_mkuffd_wp(entry);
> +               set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> +
> +               /* No need to invalidate - it was non-present before */
> +               update_mmu_cache(vma, vmf->address, vmf->pte);
> +               goto unlock;
>         }
>
> -       /* Allocate our own private page. */
> +retry:
> +       /*
> +        * Estimate the folio order to allocate. We are not under the ptl here
> +        * so this estiamte needs to be re-checked later once we have the lock.
> +        */
> +       vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
> +       order = calc_anon_folio_order_alloc(vmf, order);
> +       pte_unmap(vmf->pte);
> +
> +       /* Allocate our own private folio. */
>         if (unlikely(anon_vma_prepare(vma)))
>                 goto oom;
> -       folio = vma_alloc_zeroed_movable_folio(vma, vmf->address, 0, 0);
> +       folio = try_vma_alloc_movable_folio(vma, vmf->address, order, true);
>         if (!folio)
>                 goto oom;
>
> +       /* We may have been granted less than we asked for. */
> +       order = folio_order(folio);
> +       pgcount = BIT(order);
> +       addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT);
> +
>         if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
>                 goto oom_free_page;
>         folio_throttle_swaprate(folio, GFP_KERNEL);
>
>         /*
>          * The memory barrier inside __folio_mark_uptodate makes sure that
> -        * preceding stores to the page contents become visible before
> -        * the set_pte_at() write.
> +        * preceding stores to the folio contents become visible before
> +        * the set_ptes() write.
>          */
>         __folio_mark_uptodate(folio);
>
> @@ -4268,11 +4375,31 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>         if (vma->vm_flags & VM_WRITE)
>                 entry = pte_mkwrite(pte_mkdirty(entry));
>
> -       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> -                       &vmf->ptl);
> -       if (vmf_pte_changed(vmf)) {
> -               update_mmu_tlb(vma, vmf->address, vmf->pte);
> -               goto release;
> +       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
> +
> +       /*
> +        * Ensure our estimate above is still correct; we could have raced with
> +        * another thread to service a fault in the region.
> +        */
> +       if (order == 0) {
> +               if (vmf_pte_changed(vmf)) {
> +                       update_mmu_tlb(vma, vmf->address, vmf->pte);
> +                       goto release;
> +               }
> +       } else if (check_ptes_none(vmf->pte, pgcount) != pgcount) {
> +               pte_t *pte = vmf->pte + ((vmf->address - addr) >> PAGE_SHIFT);
> +
> +               /* If faulting pte was allocated by another, exit early. */
> +               if (!pte_none(ptep_get(pte))) {
> +                       update_mmu_tlb(vma, vmf->address, pte);
> +                       goto release;
> +               }
> +
> +               /* Else try again, with a lower order. */
> +               pte_unmap_unlock(vmf->pte, vmf->ptl);
> +               folio_put(folio);
> +               order--;
> +               goto retry;

I'm not sure whether this extra fallback logic is worth it or not. Do
you have any benchmark data or is it just an arbitrary design choice?
If it is just an arbitrary design choice, I'd like to go with the
simplest way by just exiting page fault handler, just like the
order-0, IMHO.

>         }
>
>         ret = check_stable_address_space(vma->vm_mm);
> @@ -4286,16 +4413,18 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>                 return handle_userfault(vmf, VM_UFFD_MISSING);
>         }
>
> -       inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> -       folio_add_new_anon_rmap(folio, vma, vmf->address);
> +       folio_ref_add(folio, pgcount - 1);
> +
> +       add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount);
> +       folio_add_new_anon_rmap_range(folio, &folio->page, pgcount, vma, addr);
>         folio_add_lru_vma(folio, vma);
> -setpte:
> +
>         if (uffd_wp)
>                 entry = pte_mkuffd_wp(entry);
> -       set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> +       set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);
>
>         /* No need to invalidate - it was non-present before */
> -       update_mmu_cache(vma, vmf->address, vmf->pte);
> +       update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
>  unlock:
>         pte_unmap_unlock(vmf->pte, vmf->ptl);
>         return ret;
> --
> 2.25.1
>
>

Ryan Roberts June 29, 2023, 11:30 a.m. UTC | #5

On 29/06/2023 03:13, Yang Shi wrote:
> On Mon, Jun 26, 2023 at 10:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> With all of the enabler patches in place, modify the anonymous memory
>> write allocation path so that it opportunistically attempts to allocate
>> a large folio up to `max_anon_folio_order()` size (This value is
>> ultimately configured by the architecture). This reduces the number of
>> page faults, reduces the size of (e.g. LRU) lists, and generally
>> improves performance by batching what were per-page operations into
>> per-(large)-folio operations.
>>
>> If CONFIG_LARGE_ANON_FOLIO is not enabled (the default) then
>> `max_anon_folio_order()` always returns 0, meaning we get the existing
>> allocation behaviour.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  mm/memory.c | 159 +++++++++++++++++++++++++++++++++++++++++++++++-----
>>  1 file changed, 144 insertions(+), 15 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index a8f7e2b28d7a..d23c44cc5092 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -3161,6 +3161,90 @@ static inline int max_anon_folio_order(struct vm_area_struct *vma)
>>                 return CONFIG_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX;
>>  }
>>
>> +/*
>> + * Returns index of first pte that is not none, or nr if all are none.
>> + */
>> +static inline int check_ptes_none(pte_t *pte, int nr)
>> +{
>> +       int i;
>> +
>> +       for (i = 0; i < nr; i++) {
>> +               if (!pte_none(ptep_get(pte++)))
>> +                       return i;
>> +       }
>> +
>> +       return nr;
>> +}
>> +
>> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
>> +{
>> +       /*
>> +        * The aim here is to determine what size of folio we should allocate
>> +        * for this fault. Factors include:
>> +        * - Order must not be higher than `order` upon entry
>> +        * - Folio must be naturally aligned within VA space
>> +        * - Folio must not breach boundaries of vma
>> +        * - Folio must be fully contained inside one pmd entry
>> +        * - Folio must not overlap any non-none ptes
>> +        *
>> +        * Additionally, we do not allow order-1 since this breaks assumptions
>> +        * elsewhere in the mm; THP pages must be at least order-2 (since they
>> +        * store state up to the 3rd struct page subpage), and these pages must
>> +        * be THP in order to correctly use pre-existing THP infrastructure such
>> +        * as folio_split().
>> +        *
>> +        * As a consequence of relying on the THP infrastructure, if the system
>> +        * does not support THP, we always fallback to order-0.
>> +        *
>> +        * Note that the caller may or may not choose to lock the pte. If
>> +        * unlocked, the calculation should be considered an estimate that will
>> +        * need to be validated under the lock.
>> +        */
>> +
>> +       struct vm_area_struct *vma = vmf->vma;
>> +       int nr;
>> +       unsigned long addr;
>> +       pte_t *pte;
>> +       pte_t *first_set = NULL;
>> +       int ret;
>> +
>> +       if (has_transparent_hugepage()) {
>> +               order = min(order, PMD_SHIFT - PAGE_SHIFT);
>> +
>> +               for (; order > 1; order--) {
>> +                       nr = 1 << order;
>> +                       addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT);
>> +                       pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT);
>> +
>> +                       /* Check vma bounds. */
>> +                       if (addr < vma->vm_start ||
>> +                           addr + (nr << PAGE_SHIFT) > vma->vm_end)
>> +                               continue;
>> +
>> +                       /* Ptes covered by order already known to be none. */
>> +                       if (pte + nr <= first_set)
>> +                               break;
>> +
>> +                       /* Already found set pte in range covered by order. */
>> +                       if (pte <= first_set)
>> +                               continue;
>> +
>> +                       /* Need to check if all the ptes are none. */
>> +                       ret = check_ptes_none(pte, nr);
>> +                       if (ret == nr)
>> +                               break;
>> +
>> +                       first_set = pte + ret;
>> +               }
>> +
>> +               if (order == 1)
>> +                       order = 0;
>> +       } else
>> +               order = 0;
>> +
>> +       return order;
>> +}
>> +
>>  /*
>>   * Handle write page faults for pages that can be reused in the current vma
>>   *
>> @@ -4201,6 +4285,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>         struct folio *folio;
>>         vm_fault_t ret = 0;
>>         pte_t entry;
>> +       unsigned long addr;
>> +       int order = uffd_wp ? 0 : max_anon_folio_order(vma);
>> +       int pgcount = BIT(order);
>>
>>         /* File mapping without ->vm_ops ? */
>>         if (vma->vm_flags & VM_SHARED)
>> @@ -4242,24 +4329,44 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>                         pte_unmap_unlock(vmf->pte, vmf->ptl);
>>                         return handle_userfault(vmf, VM_UFFD_MISSING);
>>                 }
>> -               goto setpte;
>> +               if (uffd_wp)
>> +                       entry = pte_mkuffd_wp(entry);
>> +               set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>> +
>> +               /* No need to invalidate - it was non-present before */
>> +               update_mmu_cache(vma, vmf->address, vmf->pte);
>> +               goto unlock;
>>         }
>>
>> -       /* Allocate our own private page. */
>> +retry:
>> +       /*
>> +        * Estimate the folio order to allocate. We are not under the ptl here
>> +        * so this estiamte needs to be re-checked later once we have the lock.
>> +        */
>> +       vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
>> +       order = calc_anon_folio_order_alloc(vmf, order);
>> +       pte_unmap(vmf->pte);
>> +
>> +       /* Allocate our own private folio. */
>>         if (unlikely(anon_vma_prepare(vma)))
>>                 goto oom;
>> -       folio = vma_alloc_zeroed_movable_folio(vma, vmf->address, 0, 0);
>> +       folio = try_vma_alloc_movable_folio(vma, vmf->address, order, true);
>>         if (!folio)
>>                 goto oom;
>>
>> +       /* We may have been granted less than we asked for. */
>> +       order = folio_order(folio);
>> +       pgcount = BIT(order);
>> +       addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT);
>> +
>>         if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
>>                 goto oom_free_page;
>>         folio_throttle_swaprate(folio, GFP_KERNEL);
>>
>>         /*
>>          * The memory barrier inside __folio_mark_uptodate makes sure that
>> -        * preceding stores to the page contents become visible before
>> -        * the set_pte_at() write.
>> +        * preceding stores to the folio contents become visible before
>> +        * the set_ptes() write.
>>          */
>>         __folio_mark_uptodate(folio);
>>
>> @@ -4268,11 +4375,31 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>         if (vma->vm_flags & VM_WRITE)
>>                 entry = pte_mkwrite(pte_mkdirty(entry));
>>
>> -       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>> -                       &vmf->ptl);
>> -       if (vmf_pte_changed(vmf)) {
>> -               update_mmu_tlb(vma, vmf->address, vmf->pte);
>> -               goto release;
>> +       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
>> +
>> +       /*
>> +        * Ensure our estimate above is still correct; we could have raced with
>> +        * another thread to service a fault in the region.
>> +        */
>> +       if (order == 0) {
>> +               if (vmf_pte_changed(vmf)) {
>> +                       update_mmu_tlb(vma, vmf->address, vmf->pte);
>> +                       goto release;
>> +               }
>> +       } else if (check_ptes_none(vmf->pte, pgcount) != pgcount) {
>> +               pte_t *pte = vmf->pte + ((vmf->address - addr) >> PAGE_SHIFT);
>> +
>> +               /* If faulting pte was allocated by another, exit early. */
>> +               if (!pte_none(ptep_get(pte))) {
>> +                       update_mmu_tlb(vma, vmf->address, pte);
>> +                       goto release;
>> +               }
>> +
>> +               /* Else try again, with a lower order. */
>> +               pte_unmap_unlock(vmf->pte, vmf->ptl);
>> +               folio_put(folio);
>> +               order--;
>> +               goto retry;
> 
> I'm not sure whether this extra fallback logic is worth it or not. Do
> you have any benchmark data or is it just an arbitrary design choice?
> If it is just an arbitrary design choice, I'd like to go with the
> simplest way by just exiting page fault handler, just like the
> order-0, IMHO.

Yes, its an arbitrary design choice. Based on Yu Zhao's feedback, I'm already
reworking this so that we only try the preferred order and order-0, so no longer
iterating through intermediate orders.

I think what you are suggesting is that if attempting to allocate the preferred
order and we find there was a race meaning that the folio now is overlapping
populated ptes (but the faulting pte is still empty), just exit and rely on the
page fault being re-triggered, rather than immediately falling back to order-0?

The reason I didn't do that was I wasn't sure if the return path might have
assumptions that the faulting pte is now valid if no error was returned? I guess
another option is to return VM_FAULT_RETRY but then it seemed cleaner to do the
retry directly here. What do you suggest?

Thanks,
Ryan



> 
>>         }
>>
>>         ret = check_stable_address_space(vma->vm_mm);
>> @@ -4286,16 +4413,18 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>                 return handle_userfault(vmf, VM_UFFD_MISSING);
>>         }
>>
>> -       inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>> -       folio_add_new_anon_rmap(folio, vma, vmf->address);
>> +       folio_ref_add(folio, pgcount - 1);
>> +
>> +       add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount);
>> +       folio_add_new_anon_rmap_range(folio, &folio->page, pgcount, vma, addr);
>>         folio_add_lru_vma(folio, vma);
>> -setpte:
>> +
>>         if (uffd_wp)
>>                 entry = pte_mkuffd_wp(entry);
>> -       set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>> +       set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);
>>
>>         /* No need to invalidate - it was non-present before */
>> -       update_mmu_cache(vma, vmf->address, vmf->pte);
>> +       update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
>>  unlock:
>>         pte_unmap_unlock(vmf->pte, vmf->ptl);
>>         return ret;
>> --
>> 2.25.1
>>
>>

Yang Shi June 29, 2023, 5:05 p.m. UTC | #6

On Thu, Jun 29, 2023 at 4:30 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 29/06/2023 03:13, Yang Shi wrote:
> > On Mon, Jun 26, 2023 at 10:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> With all of the enabler patches in place, modify the anonymous memory
> >> write allocation path so that it opportunistically attempts to allocate
> >> a large folio up to `max_anon_folio_order()` size (This value is
> >> ultimately configured by the architecture). This reduces the number of
> >> page faults, reduces the size of (e.g. LRU) lists, and generally
> >> improves performance by batching what were per-page operations into
> >> per-(large)-folio operations.
> >>
> >> If CONFIG_LARGE_ANON_FOLIO is not enabled (the default) then
> >> `max_anon_folio_order()` always returns 0, meaning we get the existing
> >> allocation behaviour.
> >>
> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >> ---
> >>  mm/memory.c | 159 +++++++++++++++++++++++++++++++++++++++++++++++-----
> >>  1 file changed, 144 insertions(+), 15 deletions(-)
> >>
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index a8f7e2b28d7a..d23c44cc5092 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -3161,6 +3161,90 @@ static inline int max_anon_folio_order(struct vm_area_struct *vma)
> >>                 return CONFIG_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX;
> >>  }
> >>
> >> +/*
> >> + * Returns index of first pte that is not none, or nr if all are none.
> >> + */
> >> +static inline int check_ptes_none(pte_t *pte, int nr)
> >> +{
> >> +       int i;
> >> +
> >> +       for (i = 0; i < nr; i++) {
> >> +               if (!pte_none(ptep_get(pte++)))
> >> +                       return i;
> >> +       }
> >> +
> >> +       return nr;
> >> +}
> >> +
> >> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
> >> +{
> >> +       /*
> >> +        * The aim here is to determine what size of folio we should allocate
> >> +        * for this fault. Factors include:
> >> +        * - Order must not be higher than `order` upon entry
> >> +        * - Folio must be naturally aligned within VA space
> >> +        * - Folio must not breach boundaries of vma
> >> +        * - Folio must be fully contained inside one pmd entry
> >> +        * - Folio must not overlap any non-none ptes
> >> +        *
> >> +        * Additionally, we do not allow order-1 since this breaks assumptions
> >> +        * elsewhere in the mm; THP pages must be at least order-2 (since they
> >> +        * store state up to the 3rd struct page subpage), and these pages must
> >> +        * be THP in order to correctly use pre-existing THP infrastructure such
> >> +        * as folio_split().
> >> +        *
> >> +        * As a consequence of relying on the THP infrastructure, if the system
> >> +        * does not support THP, we always fallback to order-0.
> >> +        *
> >> +        * Note that the caller may or may not choose to lock the pte. If
> >> +        * unlocked, the calculation should be considered an estimate that will
> >> +        * need to be validated under the lock.
> >> +        */
> >> +
> >> +       struct vm_area_struct *vma = vmf->vma;
> >> +       int nr;
> >> +       unsigned long addr;
> >> +       pte_t *pte;
> >> +       pte_t *first_set = NULL;
> >> +       int ret;
> >> +
> >> +       if (has_transparent_hugepage()) {
> >> +               order = min(order, PMD_SHIFT - PAGE_SHIFT);
> >> +
> >> +               for (; order > 1; order--) {
> >> +                       nr = 1 << order;
> >> +                       addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT);
> >> +                       pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT);
> >> +
> >> +                       /* Check vma bounds. */
> >> +                       if (addr < vma->vm_start ||
> >> +                           addr + (nr << PAGE_SHIFT) > vma->vm_end)
> >> +                               continue;
> >> +
> >> +                       /* Ptes covered by order already known to be none. */
> >> +                       if (pte + nr <= first_set)
> >> +                               break;
> >> +
> >> +                       /* Already found set pte in range covered by order. */
> >> +                       if (pte <= first_set)
> >> +                               continue;
> >> +
> >> +                       /* Need to check if all the ptes are none. */
> >> +                       ret = check_ptes_none(pte, nr);
> >> +                       if (ret == nr)
> >> +                               break;
> >> +
> >> +                       first_set = pte + ret;
> >> +               }
> >> +
> >> +               if (order == 1)
> >> +                       order = 0;
> >> +       } else
> >> +               order = 0;
> >> +
> >> +       return order;
> >> +}
> >> +
> >>  /*
> >>   * Handle write page faults for pages that can be reused in the current vma
> >>   *
> >> @@ -4201,6 +4285,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> >>         struct folio *folio;
> >>         vm_fault_t ret = 0;
> >>         pte_t entry;
> >> +       unsigned long addr;
> >> +       int order = uffd_wp ? 0 : max_anon_folio_order(vma);
> >> +       int pgcount = BIT(order);
> >>
> >>         /* File mapping without ->vm_ops ? */
> >>         if (vma->vm_flags & VM_SHARED)
> >> @@ -4242,24 +4329,44 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> >>                         pte_unmap_unlock(vmf->pte, vmf->ptl);
> >>                         return handle_userfault(vmf, VM_UFFD_MISSING);
> >>                 }
> >> -               goto setpte;
> >> +               if (uffd_wp)
> >> +                       entry = pte_mkuffd_wp(entry);
> >> +               set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> >> +
> >> +               /* No need to invalidate - it was non-present before */
> >> +               update_mmu_cache(vma, vmf->address, vmf->pte);
> >> +               goto unlock;
> >>         }
> >>
> >> -       /* Allocate our own private page. */
> >> +retry:
> >> +       /*
> >> +        * Estimate the folio order to allocate. We are not under the ptl here
> >> +        * so this estiamte needs to be re-checked later once we have the lock.
> >> +        */
> >> +       vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
> >> +       order = calc_anon_folio_order_alloc(vmf, order);
> >> +       pte_unmap(vmf->pte);
> >> +
> >> +       /* Allocate our own private folio. */
> >>         if (unlikely(anon_vma_prepare(vma)))
> >>                 goto oom;
> >> -       folio = vma_alloc_zeroed_movable_folio(vma, vmf->address, 0, 0);
> >> +       folio = try_vma_alloc_movable_folio(vma, vmf->address, order, true);
> >>         if (!folio)
> >>                 goto oom;
> >>
> >> +       /* We may have been granted less than we asked for. */
> >> +       order = folio_order(folio);
> >> +       pgcount = BIT(order);
> >> +       addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT);
> >> +
> >>         if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
> >>                 goto oom_free_page;
> >>         folio_throttle_swaprate(folio, GFP_KERNEL);
> >>
> >>         /*
> >>          * The memory barrier inside __folio_mark_uptodate makes sure that
> >> -        * preceding stores to the page contents become visible before
> >> -        * the set_pte_at() write.
> >> +        * preceding stores to the folio contents become visible before
> >> +        * the set_ptes() write.
> >>          */
> >>         __folio_mark_uptodate(folio);
> >>
> >> @@ -4268,11 +4375,31 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> >>         if (vma->vm_flags & VM_WRITE)
> >>                 entry = pte_mkwrite(pte_mkdirty(entry));
> >>
> >> -       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> >> -                       &vmf->ptl);
> >> -       if (vmf_pte_changed(vmf)) {
> >> -               update_mmu_tlb(vma, vmf->address, vmf->pte);
> >> -               goto release;
> >> +       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
> >> +
> >> +       /*
> >> +        * Ensure our estimate above is still correct; we could have raced with
> >> +        * another thread to service a fault in the region.
> >> +        */
> >> +       if (order == 0) {
> >> +               if (vmf_pte_changed(vmf)) {
> >> +                       update_mmu_tlb(vma, vmf->address, vmf->pte);
> >> +                       goto release;
> >> +               }
> >> +       } else if (check_ptes_none(vmf->pte, pgcount) != pgcount) {
> >> +               pte_t *pte = vmf->pte + ((vmf->address - addr) >> PAGE_SHIFT);
> >> +
> >> +               /* If faulting pte was allocated by another, exit early. */
> >> +               if (!pte_none(ptep_get(pte))) {
> >> +                       update_mmu_tlb(vma, vmf->address, pte);
> >> +                       goto release;
> >> +               }
> >> +
> >> +               /* Else try again, with a lower order. */
> >> +               pte_unmap_unlock(vmf->pte, vmf->ptl);
> >> +               folio_put(folio);
> >> +               order--;
> >> +               goto retry;
> >
> > I'm not sure whether this extra fallback logic is worth it or not. Do
> > you have any benchmark data or is it just an arbitrary design choice?
> > If it is just an arbitrary design choice, I'd like to go with the
> > simplest way by just exiting page fault handler, just like the
> > order-0, IMHO.
>
> Yes, its an arbitrary design choice. Based on Yu Zhao's feedback, I'm already
> reworking this so that we only try the preferred order and order-0, so no longer
> iterating through intermediate orders.
>
> I think what you are suggesting is that if attempting to allocate the preferred
> order and we find there was a race meaning that the folio now is overlapping
> populated ptes (but the faulting pte is still empty), just exit and rely on the
> page fault being re-triggered, rather than immediately falling back to order-0?

The faulting PTE might be filled too. Yes, just exit and rely on the
CPU re-trigger page fault.

>
> The reason I didn't do that was I wasn't sure if the return path might have
> assumptions that the faulting pte is now valid if no error was returned? I guess
> another option is to return VM_FAULT_RETRY but then it seemed cleaner to do the
> retry directly here. What do you suggest?

IIRC as long as the page fault handler doesn't return any error, it is
safe to rely on CPU re-trigger page fault if PTE is not installed.

VM_FAULT_RETRY means the page fault handler released mmap_lock (or
per-VMA lock with per-VMA lock enabled) due to waiting for page lock.
TBH I really don't want to make that semantic more complicated and
overloaded. And I don't see any fundamental difference between
vmf_pte_changed() for order-0 folio and overlapping PTEs installed for
large folio. So I'd like to follow the same behavior.

>
> Thanks,
> Ryan
>
>
>
> >
> >>         }
> >>
> >>         ret = check_stable_address_space(vma->vm_mm);
> >> @@ -4286,16 +4413,18 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> >>                 return handle_userfault(vmf, VM_UFFD_MISSING);
> >>         }
> >>
> >> -       inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> >> -       folio_add_new_anon_rmap(folio, vma, vmf->address);
> >> +       folio_ref_add(folio, pgcount - 1);
> >> +
> >> +       add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount);
> >> +       folio_add_new_anon_rmap_range(folio, &folio->page, pgcount, vma, addr);
> >>         folio_add_lru_vma(folio, vma);
> >> -setpte:
> >> +
> >>         if (uffd_wp)
> >>                 entry = pte_mkuffd_wp(entry);
> >> -       set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> >> +       set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);
> >>
> >>         /* No need to invalidate - it was non-present before */
> >> -       update_mmu_cache(vma, vmf->address, vmf->pte);
> >> +       update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
> >>  unlock:
> >>         pte_unmap_unlock(vmf->pte, vmf->ptl);
> >>         return ret;
> >> --
> >> 2.25.1
> >>
> >>
>

[v1,10/10] mm: Allocate large folios for anonymous memory

Commit Message

Comments

Patch