mbox series

[v2,RFC,0/7] mm: thp: use generic THP migration for NUMA hinting fault

Message ID 20210413212416.3273-1-shy828301@gmail.com (mailing list archive)
Headers show
Series mm: thp: use generic THP migration for NUMA hinting fault | expand

Message

Yang Shi April 13, 2021, 9:24 p.m. UTC
Changelog:
v1 --> v2:
    * Adopted the suggestion from Gerald Schaefer to skip huge PMD for S390
      for now.
    * Used PageTransHuge to distinguish base page or THP instead of a new
      parameter for migrate_misplaced_page() per Huang Ying.
    * Restored PMD lazily to avoid unnecessary TLB shootdown per Huang Ying.
    * Skipped shared THP.
    * Updated counters correctly.
    * Rebased to linux-next (next-20210412).

When the THP NUMA fault support was added THP migration was not supported yet.
So the ad hoc THP migration was implemented in NUMA fault handling.  Since v4.14
THP migration has been supported so it doesn't make too much sense to still keep
another THP migration implementation rather than using the generic migration
code.  It is definitely a maintenance burden to keep two THP migration
implementation for different code paths and it is more error prone.  Using the
generic THP migration implementation allows us remove the duplicate code and
some hacks needed by the old ad hoc implementation.

A quick grep shows x86_64, PowerPC (book3s), ARM64 ans S390 support both THP
and NUMA balancing.  The most of them support THP migration except for S390.
Zi Yan tried to add THP migration support for S390 before but it was not
accepted due to the design of S390 PMD.  For the discussion, please see:
https://lkml.org/lkml/2018/4/27/953.

Per the discussion with Gerald Schaefer in v1 it is acceptible to skip huge
PMD for S390 for now.

I saw there were some hacks about gup from git history, but I didn't figure out
if they have been removed or not since I just found FOLL_NUMA code in the current
gup implementation and they seems useful.

I'm trying to keep the behavior as consistent as possible between before and after.
But there is still some minor disparity.  For example, file THP won't
get migrated at all in old implementation due to the anon_vma check, but
the new implementation doesn't need acquire anon_vma lock anymore, so
file THP might get migrated.  Not sure if this behavior needs to be
kept.

Patch #1 ~ #2 are preparation patches.
Patch #3 is the real meat.
Patch #4 ~ #6 keep consistent counters and behaviors with before.
Patch #7 skips change huge PMD to prot_none if thp migration is not supported.

Yang Shi (7):
      mm: memory: add orig_pmd to struct vm_fault
      mm: memory: make numa_migrate_prep() non-static
      mm: thp: refactor NUMA fault handling
      mm: migrate: account THP NUMA migration counters correctly
      mm: migrate: don't split THP for misplaced NUMA page
      mm: migrate: check mapcount for THP instead of ref count
      mm: thp: skip make PMD PROT_NONE if THP migration is not supported

 include/linux/huge_mm.h |   9 ++---
 include/linux/migrate.h |  23 -----------
 include/linux/mm.h      |   3 ++
 mm/huge_memory.c        | 156 +++++++++++++++++++++++++-----------------------------------------------
 mm/internal.h           |  21 ++--------
 mm/memory.c             |  31 +++++++--------
 mm/migrate.c            | 204 +++++++++++++++++++++--------------------------------------------------------------------------
 7 files changed, 123 insertions(+), 324 deletions(-)

Comments

Huang, Ying April 14, 2021, 2:43 a.m. UTC | #1
Yang Shi <shy828301@gmail.com> writes:

> When the THP NUMA fault support was added THP migration was not supported yet.
> So the ad hoc THP migration was implemented in NUMA fault handling.  Since v4.14
> THP migration has been supported so it doesn't make too much sense to still keep
> another THP migration implementation rather than using the generic migration
> code.
>
> This patch reworked the NUMA fault handling to use generic migration implementation
> to migrate misplaced page.  There is no functional change.
>
> After the refactor the flow of NUMA fault handling looks just like its
> PTE counterpart:
>   Acquire ptl
>   Prepare for migration (elevate page refcount)
>   Release ptl
>   Isolate page from lru and elevate page refcount
>   Migrate the misplaced THP
>
> If migration is failed just restore the old normal PMD.
>
> In the old code anon_vma lock was needed to serialize THP migration
> against THP split, but since then the THP code has been reworked a lot,
> it seems anon_vma lock is not required anymore to avoid the race.
>
> The page refcount elevation when holding ptl should prevent from THP
> split.
>
> Use migrate_misplaced_page() for both base page and THP NUMA hinting
> fault and remove all the dead and duplicate code.
>
> Signed-off-by: Yang Shi <shy828301@gmail.com>
> ---
>  include/linux/migrate.h |  23 ------
>  mm/huge_memory.c        | 143 ++++++++++----------------------
>  mm/internal.h           |  18 ----
>  mm/migrate.c            | 177 ++++++++--------------------------------
>  4 files changed, 77 insertions(+), 284 deletions(-)
>
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 4bb4e519e3f5..163d6f2b03d1 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -95,14 +95,9 @@ static inline void __ClearPageMovable(struct page *page)
>  #endif
>  
>  #ifdef CONFIG_NUMA_BALANCING
> -extern bool pmd_trans_migrating(pmd_t pmd);
>  extern int migrate_misplaced_page(struct page *page,
>  				  struct vm_area_struct *vma, int node);
>  #else
> -static inline bool pmd_trans_migrating(pmd_t pmd)
> -{
> -	return false;
> -}
>  static inline int migrate_misplaced_page(struct page *page,
>  					 struct vm_area_struct *vma, int node)
>  {
> @@ -110,24 +105,6 @@ static inline int migrate_misplaced_page(struct page *page,
>  }
>  #endif /* CONFIG_NUMA_BALANCING */
>  
> -#if defined(CONFIG_NUMA_BALANCING) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
> -extern int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> -			struct vm_area_struct *vma,
> -			pmd_t *pmd, pmd_t entry,
> -			unsigned long address,
> -			struct page *page, int node);
> -#else
> -static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> -			struct vm_area_struct *vma,
> -			pmd_t *pmd, pmd_t entry,
> -			unsigned long address,
> -			struct page *page, int node)
> -{
> -	return -EAGAIN;
> -}
> -#endif /* CONFIG_NUMA_BALANCING && CONFIG_TRANSPARENT_HUGEPAGE*/
> -
> -
>  #ifdef CONFIG_MIGRATION
>  
>  /*
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 35cac4aeaf68..94981907fd4c 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1418,93 +1418,21 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>  {
>  	struct vm_area_struct *vma = vmf->vma;
>  	pmd_t pmd = vmf->orig_pmd;
> -	struct anon_vma *anon_vma = NULL;
> +	pmd_t oldpmd;

nit: the usage of oldpmd and pmd in the function appears not very
consistent.  How about make oldpmd == vmf->orig_pmd always.  While make
pmd the changed one?

Best Regards,
Huang, Ying

>  	struct page *page;
>  	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
> -	int page_nid = NUMA_NO_NODE, this_nid = numa_node_id();
> +	int page_nid = NUMA_NO_NODE;
>  	int target_nid, last_cpupid = -1;
> -	bool page_locked;
>  	bool migrated = false;
> -	bool was_writable;
> +	bool was_writable = pmd_savedwrite(pmd);
>  	int flags = 0;
>  
>  	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> -	if (unlikely(!pmd_same(pmd, *vmf->pmd)))
> -		goto out_unlock;
> -
> -	/*
> -	 * If there are potential migrations, wait for completion and retry
> -	 * without disrupting NUMA hinting information. Do not relock and
> -	 * check_same as the page may no longer be mapped.
> -	 */
> -	if (unlikely(pmd_trans_migrating(*vmf->pmd))) {
> -		page = pmd_page(*vmf->pmd);
> -		if (!get_page_unless_zero(page))
> -			goto out_unlock;
> +	if (unlikely(!pmd_same(pmd, *vmf->pmd))) {
>  		spin_unlock(vmf->ptl);
> -		put_and_wait_on_page_locked(page, TASK_UNINTERRUPTIBLE);
>  		goto out;
>  	}
>  
> -	page = pmd_page(pmd);
> -	BUG_ON(is_huge_zero_page(page));
> -	page_nid = page_to_nid(page);
> -	last_cpupid = page_cpupid_last(page);
> -	count_vm_numa_event(NUMA_HINT_FAULTS);
> -	if (page_nid == this_nid) {
> -		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
> -		flags |= TNF_FAULT_LOCAL;
> -	}
> -
> -	/* See similar comment in do_numa_page for explanation */
> -	if (!pmd_savedwrite(pmd))
> -		flags |= TNF_NO_GROUP;
> -
> -	/*
> -	 * Acquire the page lock to serialise THP migrations but avoid dropping
> -	 * page_table_lock if at all possible
> -	 */
> -	page_locked = trylock_page(page);
> -	target_nid = mpol_misplaced(page, vma, haddr);
> -	/* Migration could have started since the pmd_trans_migrating check */
> -	if (!page_locked) {
> -		page_nid = NUMA_NO_NODE;
> -		if (!get_page_unless_zero(page))
> -			goto out_unlock;
> -		spin_unlock(vmf->ptl);
> -		put_and_wait_on_page_locked(page, TASK_UNINTERRUPTIBLE);
> -		goto out;
> -	} else if (target_nid == NUMA_NO_NODE) {
> -		/* There are no parallel migrations and page is in the right
> -		 * node. Clear the numa hinting info in this pmd.
> -		 */
> -		goto clear_pmdnuma;
> -	}
> -
> -	/*
> -	 * Page is misplaced. Page lock serialises migrations. Acquire anon_vma
> -	 * to serialises splits
> -	 */
> -	get_page(page);
> -	spin_unlock(vmf->ptl);
> -	anon_vma = page_lock_anon_vma_read(page);
> -
> -	/* Confirm the PMD did not change while page_table_lock was released */
> -	spin_lock(vmf->ptl);
> -	if (unlikely(!pmd_same(pmd, *vmf->pmd))) {
> -		unlock_page(page);
> -		put_page(page);
> -		page_nid = NUMA_NO_NODE;
> -		goto out_unlock;
> -	}
> -
> -	/* Bail if we fail to protect against THP splits for any reason */
> -	if (unlikely(!anon_vma)) {
> -		put_page(page);
> -		page_nid = NUMA_NO_NODE;
> -		goto clear_pmdnuma;
> -	}
> -
>  	/*
>  	 * Since we took the NUMA fault, we must have observed the !accessible
>  	 * bit. Make sure all other CPUs agree with that, to avoid them
> @@ -1531,43 +1459,60 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>  					      haddr + HPAGE_PMD_SIZE);
>  	}
>  
> -	/*
> -	 * Migrate the THP to the requested node, returns with page unlocked
> -	 * and access rights restored.
> -	 */
> +	oldpmd = pmd_modify(pmd, vma->vm_page_prot);
> +	page = vm_normal_page_pmd(vma, haddr, oldpmd);
> +	if (!page) {
> +		spin_unlock(vmf->ptl);
> +		goto out_map;
> +	}
> +
> +	/* See similar comment in do_numa_page for explanation */
> +	if (!was_writable)
> +		flags |= TNF_NO_GROUP;
> +
> +	page_nid = page_to_nid(page);
> +	last_cpupid = page_cpupid_last(page);
> +	target_nid = numa_migrate_prep(page, vma, haddr, page_nid,
> +				       &flags);
> +
> +	if (target_nid == NUMA_NO_NODE) {
> +		put_page(page);
> +		goto out_map;
> +	}
> +
>  	spin_unlock(vmf->ptl);
>  
> -	migrated = migrate_misplaced_transhuge_page(vma->vm_mm, vma,
> -				vmf->pmd, pmd, vmf->address, page, target_nid);
> +	migrated = migrate_misplaced_page(page, vma, target_nid);
>  	if (migrated) {
>  		flags |= TNF_MIGRATED;
>  		page_nid = target_nid;
> -	} else
> +	} else {
>  		flags |= TNF_MIGRATE_FAIL;
> +		vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> +		if (unlikely(!pmd_same(pmd, *vmf->pmd))) {
> +			spin_unlock(vmf->ptl);
> +			goto out;
> +		}
> +		goto out_map;
> +	}
>  
> -	goto out;
> -clear_pmdnuma:
> -	BUG_ON(!PageLocked(page));
> -	was_writable = pmd_savedwrite(pmd);
> +out:
> +	if (page_nid != NUMA_NO_NODE)
> +		task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR,
> +				flags);
> +
> +	return 0;
> +
> +out_map:
> +	/* Restore the PMD */
>  	pmd = pmd_modify(pmd, vma->vm_page_prot);
>  	pmd = pmd_mkyoung(pmd);
>  	if (was_writable)
>  		pmd = pmd_mkwrite(pmd);
>  	set_pmd_at(vma->vm_mm, haddr, vmf->pmd, pmd);
>  	update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
> -	unlock_page(page);
> -out_unlock:
>  	spin_unlock(vmf->ptl);
> -
> -out:
> -	if (anon_vma)
> -		page_unlock_anon_vma_read(anon_vma);
> -
> -	if (page_nid != NUMA_NO_NODE)
> -		task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR,
> -				flags);
> -
> -	return 0;
> +	goto out;
>  }
>  

[snip]
Yang Shi April 14, 2021, 5:15 p.m. UTC | #2
On Tue, Apr 13, 2021 at 7:44 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yang Shi <shy828301@gmail.com> writes:
>
> > When the THP NUMA fault support was added THP migration was not supported yet.
> > So the ad hoc THP migration was implemented in NUMA fault handling.  Since v4.14
> > THP migration has been supported so it doesn't make too much sense to still keep
> > another THP migration implementation rather than using the generic migration
> > code.
> >
> > This patch reworked the NUMA fault handling to use generic migration implementation
> > to migrate misplaced page.  There is no functional change.
> >
> > After the refactor the flow of NUMA fault handling looks just like its
> > PTE counterpart:
> >   Acquire ptl
> >   Prepare for migration (elevate page refcount)
> >   Release ptl
> >   Isolate page from lru and elevate page refcount
> >   Migrate the misplaced THP
> >
> > If migration is failed just restore the old normal PMD.
> >
> > In the old code anon_vma lock was needed to serialize THP migration
> > against THP split, but since then the THP code has been reworked a lot,
> > it seems anon_vma lock is not required anymore to avoid the race.
> >
> > The page refcount elevation when holding ptl should prevent from THP
> > split.
> >
> > Use migrate_misplaced_page() for both base page and THP NUMA hinting
> > fault and remove all the dead and duplicate code.
> >
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > ---
> >  include/linux/migrate.h |  23 ------
> >  mm/huge_memory.c        | 143 ++++++++++----------------------
> >  mm/internal.h           |  18 ----
> >  mm/migrate.c            | 177 ++++++++--------------------------------
> >  4 files changed, 77 insertions(+), 284 deletions(-)
> >
> > diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> > index 4bb4e519e3f5..163d6f2b03d1 100644
> > --- a/include/linux/migrate.h
> > +++ b/include/linux/migrate.h
> > @@ -95,14 +95,9 @@ static inline void __ClearPageMovable(struct page *page)
> >  #endif
> >
> >  #ifdef CONFIG_NUMA_BALANCING
> > -extern bool pmd_trans_migrating(pmd_t pmd);
> >  extern int migrate_misplaced_page(struct page *page,
> >                                 struct vm_area_struct *vma, int node);
> >  #else
> > -static inline bool pmd_trans_migrating(pmd_t pmd)
> > -{
> > -     return false;
> > -}
> >  static inline int migrate_misplaced_page(struct page *page,
> >                                        struct vm_area_struct *vma, int node)
> >  {
> > @@ -110,24 +105,6 @@ static inline int migrate_misplaced_page(struct page *page,
> >  }
> >  #endif /* CONFIG_NUMA_BALANCING */
> >
> > -#if defined(CONFIG_NUMA_BALANCING) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
> > -extern int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> > -                     struct vm_area_struct *vma,
> > -                     pmd_t *pmd, pmd_t entry,
> > -                     unsigned long address,
> > -                     struct page *page, int node);
> > -#else
> > -static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> > -                     struct vm_area_struct *vma,
> > -                     pmd_t *pmd, pmd_t entry,
> > -                     unsigned long address,
> > -                     struct page *page, int node)
> > -{
> > -     return -EAGAIN;
> > -}
> > -#endif /* CONFIG_NUMA_BALANCING && CONFIG_TRANSPARENT_HUGEPAGE*/
> > -
> > -
> >  #ifdef CONFIG_MIGRATION
> >
> >  /*
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 35cac4aeaf68..94981907fd4c 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1418,93 +1418,21 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
> >  {
> >       struct vm_area_struct *vma = vmf->vma;
> >       pmd_t pmd = vmf->orig_pmd;
> > -     struct anon_vma *anon_vma = NULL;
> > +     pmd_t oldpmd;
>
> nit: the usage of oldpmd and pmd in the function appears not very
> consistent.  How about make oldpmd == vmf->orig_pmd always.  While make
> pmd the changed one?

Thanks for the suggestion. Yes, it seemed neater. Will fix it in the
next version.

>
> Best Regards,
> Huang, Ying
>
> >       struct page *page;
> >       unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
> > -     int page_nid = NUMA_NO_NODE, this_nid = numa_node_id();
> > +     int page_nid = NUMA_NO_NODE;
> >       int target_nid, last_cpupid = -1;
> > -     bool page_locked;
> >       bool migrated = false;
> > -     bool was_writable;
> > +     bool was_writable = pmd_savedwrite(pmd);
> >       int flags = 0;
> >
> >       vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> > -     if (unlikely(!pmd_same(pmd, *vmf->pmd)))
> > -             goto out_unlock;
> > -
> > -     /*
> > -      * If there are potential migrations, wait for completion and retry
> > -      * without disrupting NUMA hinting information. Do not relock and
> > -      * check_same as the page may no longer be mapped.
> > -      */
> > -     if (unlikely(pmd_trans_migrating(*vmf->pmd))) {
> > -             page = pmd_page(*vmf->pmd);
> > -             if (!get_page_unless_zero(page))
> > -                     goto out_unlock;
> > +     if (unlikely(!pmd_same(pmd, *vmf->pmd))) {
> >               spin_unlock(vmf->ptl);
> > -             put_and_wait_on_page_locked(page, TASK_UNINTERRUPTIBLE);
> >               goto out;
> >       }
> >
> > -     page = pmd_page(pmd);
> > -     BUG_ON(is_huge_zero_page(page));
> > -     page_nid = page_to_nid(page);
> > -     last_cpupid = page_cpupid_last(page);
> > -     count_vm_numa_event(NUMA_HINT_FAULTS);
> > -     if (page_nid == this_nid) {
> > -             count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
> > -             flags |= TNF_FAULT_LOCAL;
> > -     }
> > -
> > -     /* See similar comment in do_numa_page for explanation */
> > -     if (!pmd_savedwrite(pmd))
> > -             flags |= TNF_NO_GROUP;
> > -
> > -     /*
> > -      * Acquire the page lock to serialise THP migrations but avoid dropping
> > -      * page_table_lock if at all possible
> > -      */
> > -     page_locked = trylock_page(page);
> > -     target_nid = mpol_misplaced(page, vma, haddr);
> > -     /* Migration could have started since the pmd_trans_migrating check */
> > -     if (!page_locked) {
> > -             page_nid = NUMA_NO_NODE;
> > -             if (!get_page_unless_zero(page))
> > -                     goto out_unlock;
> > -             spin_unlock(vmf->ptl);
> > -             put_and_wait_on_page_locked(page, TASK_UNINTERRUPTIBLE);
> > -             goto out;
> > -     } else if (target_nid == NUMA_NO_NODE) {
> > -             /* There are no parallel migrations and page is in the right
> > -              * node. Clear the numa hinting info in this pmd.
> > -              */
> > -             goto clear_pmdnuma;
> > -     }
> > -
> > -     /*
> > -      * Page is misplaced. Page lock serialises migrations. Acquire anon_vma
> > -      * to serialises splits
> > -      */
> > -     get_page(page);
> > -     spin_unlock(vmf->ptl);
> > -     anon_vma = page_lock_anon_vma_read(page);
> > -
> > -     /* Confirm the PMD did not change while page_table_lock was released */
> > -     spin_lock(vmf->ptl);
> > -     if (unlikely(!pmd_same(pmd, *vmf->pmd))) {
> > -             unlock_page(page);
> > -             put_page(page);
> > -             page_nid = NUMA_NO_NODE;
> > -             goto out_unlock;
> > -     }
> > -
> > -     /* Bail if we fail to protect against THP splits for any reason */
> > -     if (unlikely(!anon_vma)) {
> > -             put_page(page);
> > -             page_nid = NUMA_NO_NODE;
> > -             goto clear_pmdnuma;
> > -     }
> > -
> >       /*
> >        * Since we took the NUMA fault, we must have observed the !accessible
> >        * bit. Make sure all other CPUs agree with that, to avoid them
> > @@ -1531,43 +1459,60 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
> >                                             haddr + HPAGE_PMD_SIZE);
> >       }
> >
> > -     /*
> > -      * Migrate the THP to the requested node, returns with page unlocked
> > -      * and access rights restored.
> > -      */
> > +     oldpmd = pmd_modify(pmd, vma->vm_page_prot);
> > +     page = vm_normal_page_pmd(vma, haddr, oldpmd);
> > +     if (!page) {
> > +             spin_unlock(vmf->ptl);
> > +             goto out_map;
> > +     }
> > +
> > +     /* See similar comment in do_numa_page for explanation */
> > +     if (!was_writable)
> > +             flags |= TNF_NO_GROUP;
> > +
> > +     page_nid = page_to_nid(page);
> > +     last_cpupid = page_cpupid_last(page);
> > +     target_nid = numa_migrate_prep(page, vma, haddr, page_nid,
> > +                                    &flags);
> > +
> > +     if (target_nid == NUMA_NO_NODE) {
> > +             put_page(page);
> > +             goto out_map;
> > +     }
> > +
> >       spin_unlock(vmf->ptl);
> >
> > -     migrated = migrate_misplaced_transhuge_page(vma->vm_mm, vma,
> > -                             vmf->pmd, pmd, vmf->address, page, target_nid);
> > +     migrated = migrate_misplaced_page(page, vma, target_nid);
> >       if (migrated) {
> >               flags |= TNF_MIGRATED;
> >               page_nid = target_nid;
> > -     } else
> > +     } else {
> >               flags |= TNF_MIGRATE_FAIL;
> > +             vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> > +             if (unlikely(!pmd_same(pmd, *vmf->pmd))) {
> > +                     spin_unlock(vmf->ptl);
> > +                     goto out;
> > +             }
> > +             goto out_map;
> > +     }
> >
> > -     goto out;
> > -clear_pmdnuma:
> > -     BUG_ON(!PageLocked(page));
> > -     was_writable = pmd_savedwrite(pmd);
> > +out:
> > +     if (page_nid != NUMA_NO_NODE)
> > +             task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR,
> > +                             flags);
> > +
> > +     return 0;
> > +
> > +out_map:
> > +     /* Restore the PMD */
> >       pmd = pmd_modify(pmd, vma->vm_page_prot);
> >       pmd = pmd_mkyoung(pmd);
> >       if (was_writable)
> >               pmd = pmd_mkwrite(pmd);
> >       set_pmd_at(vma->vm_mm, haddr, vmf->pmd, pmd);
> >       update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
> > -     unlock_page(page);
> > -out_unlock:
> >       spin_unlock(vmf->ptl);
> > -
> > -out:
> > -     if (anon_vma)
> > -             page_unlock_anon_vma_read(anon_vma);
> > -
> > -     if (page_nid != NUMA_NO_NODE)
> > -             task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR,
> > -                             flags);
> > -
> > -     return 0;
> > +     goto out;
> >  }
> >
>
> [snip]
Yang Shi May 3, 2021, 9:58 p.m. UTC | #3
Gently ping.

I also did some tests to measure the latency of do_huge_pmd_numa_page.
The test VM has 80 vcpus and 64G memory. The test would create 2
processes to consume 128G memory together which would incur memory
pressure to cause THP splits. And it also creates 80 processes to hog
cpu, and the memory consumer processes are bound to different nodes
periodically in order to increase NUMA faults.

The below test script is used:

echo 3 > /proc/sys/vm/drop_caches

# Run stress-ng for 24 hours
./stress-ng/stress-ng --vm 2 --vm-bytes 64G --timeout 24h &
PID=$!

./stress-ng/stress-ng --cpu $NR_CPUS --timeout 24h &

# Wait for vm stressors forked
sleep 5

PID_1=`pgrep -P $PID | awk 'NR == 1'`
PID_2=`pgrep -P $PID | awk 'NR == 2'`

JOB1=`pgrep -P $PID_1`
JOB2=`pgrep -P $PID_2`

# Bind load jobs to different nodes periodically to force generate
# cross node memory access
while [ -d "/proc/$PID" ]
do
        taskset -apc 8 $JOB1
        taskset -apc 8 $JOB2
        sleep 300
        taskset -apc 58 $JOB1
        taskset -apc 58 $JOB2
        sleep 300
done

With the above test the histogram of latency of do_huge_pmd_numa_page
is as shown below. Since the number of do_huge_pmd_numa_page varies
drastically for each run (should be due to scheduler), so I converted
the raw number to percentage.

                                     patched                        base
@us[stress-ng]:
[0]                                  3.57%                         0.16%
[1]                                  55.68%                       18.36%
[2, 4)                             10.46%                        40.44%
[4, 8)                              7.26%                         17.82%
[8, 16)                            21.12%                       13.41%
[16, 32)                         1.06%                           4.27%
[32, 64)                         0.56%                           4.07%
[64, 128)                       0.16%                            0.35%
[128, 256)                     < 0.1%                          < 0.1%
[256, 512)                     < 0.1%                          < 0.1%
[512, 1K)                       < 0.1%                          < 0.1%
[1K, 2K)                         < 0.1%                          < 0.1%
[2K, 4K)                         < 0.1%                          < 0.1%
[4K, 8K)                         < 0.1%                          < 0.1%
[8K, 16K)                       < 0.1%                          < 0.1%
[16K, 32K)                     < 0.1%                          < 0.1%
[32K, 64K)                     < 0.1%                          < 0.1%

Per the result, patched kernel is even slightly better than the base
kernel. I think this is because the lock contention against THP split
is less than base kernel due to the refactor.


To exclude the affect from THP split, I also did test w/o memory
pressure. No obvious regression is spotted. The below is the test
result *w/o* memory pressure.
                                       patched
          base
@us[stress-ng]:
[0]                                      7.97%
        18.4%
[1]                                      69.63%
        58.24%
[2, 4)                                  4.18%
        2.63%
[4, 8)                                  0.22%
        0.17%
[8, 16)                                1.03%
       0.92%
[16, 32)                              0.14%
      < 0.1%
[32, 64)                              < 0.1%
      < 0.1%
[64, 128)                            < 0.1%
     < 0.1%
[128, 256)                          < 0.1%
     < 0.1%
[256, 512)                           0.45%
     1.19%
[512, 1K)                            15.45%
     17.27%
[1K, 2K)                               < 0.1%
       < 0.1%
[2K, 4K)                              < 0.1%
       < 0.1%
[4K, 8K)                              < 0.1%
       < 0.1%
[8K, 16K)                             0.86%
      0.88%
[16K, 32K)                           < 0.1%
     0.15%
[32K, 64K)                           < 0.1%
     < 0.1%
[64K, 128K)                         < 0.1%
    < 0.1%
[128K, 256K)                       < 0.1%                                 < 0.1%


On Tue, Apr 13, 2021 at 2:24 PM Yang Shi <shy828301@gmail.com> wrote:
>
>
> Changelog:
> v1 --> v2:
>     * Adopted the suggestion from Gerald Schaefer to skip huge PMD for S390
>       for now.
>     * Used PageTransHuge to distinguish base page or THP instead of a new
>       parameter for migrate_misplaced_page() per Huang Ying.
>     * Restored PMD lazily to avoid unnecessary TLB shootdown per Huang Ying.
>     * Skipped shared THP.
>     * Updated counters correctly.
>     * Rebased to linux-next (next-20210412).
>
> When the THP NUMA fault support was added THP migration was not supported yet.
> So the ad hoc THP migration was implemented in NUMA fault handling.  Since v4.14
> THP migration has been supported so it doesn't make too much sense to still keep
> another THP migration implementation rather than using the generic migration
> code.  It is definitely a maintenance burden to keep two THP migration
> implementation for different code paths and it is more error prone.  Using the
> generic THP migration implementation allows us remove the duplicate code and
> some hacks needed by the old ad hoc implementation.
>
> A quick grep shows x86_64, PowerPC (book3s), ARM64 ans S390 support both THP
> and NUMA balancing.  The most of them support THP migration except for S390.
> Zi Yan tried to add THP migration support for S390 before but it was not
> accepted due to the design of S390 PMD.  For the discussion, please see:
> https://lkml.org/lkml/2018/4/27/953.
>
> Per the discussion with Gerald Schaefer in v1 it is acceptible to skip huge
> PMD for S390 for now.
>
> I saw there were some hacks about gup from git history, but I didn't figure out
> if they have been removed or not since I just found FOLL_NUMA code in the current
> gup implementation and they seems useful.
>
> I'm trying to keep the behavior as consistent as possible between before and after.
> But there is still some minor disparity.  For example, file THP won't
> get migrated at all in old implementation due to the anon_vma check, but
> the new implementation doesn't need acquire anon_vma lock anymore, so
> file THP might get migrated.  Not sure if this behavior needs to be
> kept.
>
> Patch #1 ~ #2 are preparation patches.
> Patch #3 is the real meat.
> Patch #4 ~ #6 keep consistent counters and behaviors with before.
> Patch #7 skips change huge PMD to prot_none if thp migration is not supported.
>
> Yang Shi (7):
>       mm: memory: add orig_pmd to struct vm_fault
>       mm: memory: make numa_migrate_prep() non-static
>       mm: thp: refactor NUMA fault handling
>       mm: migrate: account THP NUMA migration counters correctly
>       mm: migrate: don't split THP for misplaced NUMA page
>       mm: migrate: check mapcount for THP instead of ref count
>       mm: thp: skip make PMD PROT_NONE if THP migration is not supported
>
>  include/linux/huge_mm.h |   9 ++---
>  include/linux/migrate.h |  23 -----------
>  include/linux/mm.h      |   3 ++
>  mm/huge_memory.c        | 156 +++++++++++++++++++++++++-----------------------------------------------
>  mm/internal.h           |  21 ++--------
>  mm/memory.c             |  31 +++++++--------
>  mm/migrate.c            | 204 +++++++++++++++++++++--------------------------------------------------------------------------
>  7 files changed, 123 insertions(+), 324 deletions(-)
>
Mel Gorman May 17, 2021, 3:27 p.m. UTC | #4
On Tue, Apr 13, 2021 at 02:24:12PM -0700, Yang Shi wrote:
> When the THP NUMA fault support was added THP migration was not supported yet.
> So the ad hoc THP migration was implemented in NUMA fault handling.  Since v4.14
> THP migration has been supported so it doesn't make too much sense to still keep
> another THP migration implementation rather than using the generic migration
> code.
> 
> This patch reworked the NUMA fault handling to use generic migration implementation
> to migrate misplaced page.  There is no functional change.
> 
> After the refactor the flow of NUMA fault handling looks just like its
> PTE counterpart:
>   Acquire ptl
>   Prepare for migration (elevate page refcount)
>   Release ptl
>   Isolate page from lru and elevate page refcount
>   Migrate the misplaced THP
> 
> If migration is failed just restore the old normal PMD.
> 
> In the old code anon_vma lock was needed to serialize THP migration
> against THP split, but since then the THP code has been reworked a lot,
> it seems anon_vma lock is not required anymore to avoid the race.
> 
> The page refcount elevation when holding ptl should prevent from THP
> split.
> 
> Use migrate_misplaced_page() for both base page and THP NUMA hinting
> fault and remove all the dead and duplicate code.
> 
> Signed-off-by: Yang Shi <shy828301@gmail.com>

I did not spot any big problems and FWIW, the series overall passed a
series of tests that exercise NUMA balancing migrations so...

Acked-by: Mel Gorman <mgorman@suse.de>
Yang Shi May 17, 2021, 7:41 p.m. UTC | #5
On Mon, May 17, 2021 at 8:27 AM Mel Gorman <mgorman@suse.de> wrote:
>
> On Tue, Apr 13, 2021 at 02:24:12PM -0700, Yang Shi wrote:
> > When the THP NUMA fault support was added THP migration was not supported yet.
> > So the ad hoc THP migration was implemented in NUMA fault handling.  Since v4.14
> > THP migration has been supported so it doesn't make too much sense to still keep
> > another THP migration implementation rather than using the generic migration
> > code.
> >
> > This patch reworked the NUMA fault handling to use generic migration implementation
> > to migrate misplaced page.  There is no functional change.
> >
> > After the refactor the flow of NUMA fault handling looks just like its
> > PTE counterpart:
> >   Acquire ptl
> >   Prepare for migration (elevate page refcount)
> >   Release ptl
> >   Isolate page from lru and elevate page refcount
> >   Migrate the misplaced THP
> >
> > If migration is failed just restore the old normal PMD.
> >
> > In the old code anon_vma lock was needed to serialize THP migration
> > against THP split, but since then the THP code has been reworked a lot,
> > it seems anon_vma lock is not required anymore to avoid the race.
> >
> > The page refcount elevation when holding ptl should prevent from THP
> > split.
> >
> > Use migrate_misplaced_page() for both base page and THP NUMA hinting
> > fault and remove all the dead and duplicate code.
> >
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
>
> I did not spot any big problems and FWIW, the series overall passed a
> series of tests that exercise NUMA balancing migrations so...
>
> Acked-by: Mel Gorman <mgorman@suse.de>

Thanks a lot for testing and reviewing the series.

>
> --
> Mel Gorman
> SUSE Labs