diff mbox series

[2/2] Revert "mm, thp: restore node-local hugepage allocations"

Message ID 20190503223146.2312-3-aarcange@redhat.com (mailing list archive)
State New, archived
Headers show
Series reapply: relax __GFP_THISNODE for MADV_HUGEPAGE mappings | expand

Commit Message

Andrea Arcangeli May 3, 2019, 10:31 p.m. UTC
This reverts commit 2f0799a0ffc033bf3cc82d5032acc3ec633464c2.

commit 2f0799a0ffc033bf3cc82d5032acc3ec633464c2 was rightfully applied
to avoid the risk of a severe regression that was reported by the
kernel test robot at the end of the merge window. Now we understood
the regression was a false positive and was caused by a significant
increase in fairness during a swap trashing benchmark. So it's safe to
re-apply the fix and continue improving the code from there. The
benchmark that reported the regression is very useful, but it provides
a meaningful result only when there is no significant alteration in
fairness during the workload. The removal of __GFP_THISNODE increased
fairness.

__GFP_THISNODE cannot be used in the generic page faults path for new
memory allocations under the MPOL_DEFAULT mempolicy, or the allocation
behavior significantly deviates from what the MPOL_DEFAULT semantics
are supposed to be for THP and 4k allocations alike.

Setting THP defrag to "always" or using MADV_HUGEPAGE (with THP defrag
set to "madvise") has never meant to provide an implicit MPOL_BIND on
the "current" node the task is running on, causing swap storms and
providing a much more aggressive behavior than even zone_reclaim_node
= 3.

Any workload who could have benefited from __GFP_THISNODE has now to
enable zone_reclaim_mode=1||2||3. __GFP_THISNODE implicitly provided
the zone_reclaim_mode behavior, but it only did so if THP was enabled:
if THP was disabled, there would have been no chance to get any 4k
page from the current node if the current node was full of pagecache,
which further shows how this __GFP_THISNODE was misplaced in
MADV_HUGEPAGE. MADV_HUGEPAGE has never been intended to provide any
zone_reclaim_mode semantics, in fact the two are orthogonal,
zone_reclaim_mode = 1|2|3 must work exactly the same with
MADV_HUGEPAGE set or not.

The performance characteristic of memory depends on the hardware
details. The numbers below are obtained on Naples/EPYC architecture
and the N/A projection extends them to show what we should aim for in
the future as a good THP NUMA locality default. The benchmark used
exercises random memory seeks (note: the cost of the page faults is
not part of the measurement).

D0 THP | D0 4k | D1 THP | D1 4k | D2 THP | D2 4k | D3 THP | D3 4k | ...
0%     | +43%  | +45%   | +106% | +131%  | +224% | N/A    | N/A

D0 means distance zero (i.e. local memory), D1 means distance
one (i.e. intra socket memory), D2 means distance two (i.e. inter
socket memory), etc...

For the guest physical memory allocated by qemu and for guest mode kernel
the performance characteristic of RAM is more complex and an ideal
default could be:

D0 THP | D1 THP | D0 4k | D2 THP | D1 4k | D3 THP | D2 4k | D3 4k | ...
0%     | +58%   | +101% | N/A    | +222% | N/A    | N/A   | N/A

NOTE: the N/A are projections and haven't been measured yet, the
measurement in this case is done on a 1950x with only two NUMA nodes.
The THP case here means THP was used both in the host and in the
guest.

After applying this commit the THP NUMA locality order that we'll get
out of MADV_HUGEPAGE is this:

D0 THP | D1 THP | D2 THP | D3 THP | ... | D0 4k | D1 4k | D2 4k | D3 4k | ...

Before this commit it was:

D0 THP | D0 4k | D1 4k | D2 4k | D3 4k | ...

Even if we ignore the breakage of large workloads that can't fit in a
single node that the __GFP_THISNODE implicit "current node" mbind
caused, the THP NUMA locality order provided by __GFP_THISNODE was
still not the one we shall aim for in the long term (i.e. the first
one at the top).

After this commit is applied, we can introduce a new allocator multi
order API and to replace those two alloc_pages_vmas calls in the page
fault path, with a single multi order call:

	unsigned int order = (1 << HPAGE_PMD_ORDER) | (1 << 0);
	page = alloc_pages_multi_order(..., &order);
	if (!page)
		goto out;
	if (!(order & (1 << 0))) {
		VM_WARN_ON(order != 1 << HPAGE_PMD_ORDER);
		/* THP fault */
	} else {
		VM_WARN_ON(order != 1 << 0);
		/* 4k fallback */
	}

The page allocator logic has to be altered so that when it fails on
any zone with order 9, it has to try again with a order 0 before
falling back to the next zone in the zonelist.

After that we need to do more measurements and evaluate if adding an
opt-in feature for guest mode is worth it, to swap "DN 4k | DN+1 THP"
with "DN+1 THP | DN 4k" at every NUMA distance crossing.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/mempolicy.h |  2 ++
 mm/huge_memory.c          | 42 ++++++++++++++++++++++++---------------
 mm/mempolicy.c            |  2 +-
 3 files changed, 29 insertions(+), 17 deletions(-)

Comments

Michal Hocko May 4, 2019, 12:11 p.m. UTC | #1
On Fri 03-05-19 18:31:46, Andrea Arcangeli wrote:
> This reverts commit 2f0799a0ffc033bf3cc82d5032acc3ec633464c2.
> 
> commit 2f0799a0ffc033bf3cc82d5032acc3ec633464c2 was rightfully applied
> to avoid the risk of a severe regression that was reported by the
> kernel test robot at the end of the merge window. Now we understood
> the regression was a false positive and was caused by a significant
> increase in fairness during a swap trashing benchmark. So it's safe to
> re-apply the fix and continue improving the code from there. The
> benchmark that reported the regression is very useful, but it provides
> a meaningful result only when there is no significant alteration in
> fairness during the workload. The removal of __GFP_THISNODE increased
> fairness.
> 
> __GFP_THISNODE cannot be used in the generic page faults path for new
> memory allocations under the MPOL_DEFAULT mempolicy, or the allocation
> behavior significantly deviates from what the MPOL_DEFAULT semantics
> are supposed to be for THP and 4k allocations alike.
> 
> Setting THP defrag to "always" or using MADV_HUGEPAGE (with THP defrag
> set to "madvise") has never meant to provide an implicit MPOL_BIND on
> the "current" node the task is running on, causing swap storms and
> providing a much more aggressive behavior than even zone_reclaim_node
> = 3.
> 
> Any workload who could have benefited from __GFP_THISNODE has now to
> enable zone_reclaim_mode=1||2||3. __GFP_THISNODE implicitly provided
> the zone_reclaim_mode behavior, but it only did so if THP was enabled:
> if THP was disabled, there would have been no chance to get any 4k
> page from the current node if the current node was full of pagecache,
> which further shows how this __GFP_THISNODE was misplaced in
> MADV_HUGEPAGE. MADV_HUGEPAGE has never been intended to provide any
> zone_reclaim_mode semantics, in fact the two are orthogonal,
> zone_reclaim_mode = 1|2|3 must work exactly the same with
> MADV_HUGEPAGE set or not.
> 
> The performance characteristic of memory depends on the hardware
> details. The numbers below are obtained on Naples/EPYC architecture
> and the N/A projection extends them to show what we should aim for in
> the future as a good THP NUMA locality default. The benchmark used
> exercises random memory seeks (note: the cost of the page faults is
> not part of the measurement).
> 
> D0 THP | D0 4k | D1 THP | D1 4k | D2 THP | D2 4k | D3 THP | D3 4k | ...
> 0%     | +43%  | +45%   | +106% | +131%  | +224% | N/A    | N/A
> 
> D0 means distance zero (i.e. local memory), D1 means distance
> one (i.e. intra socket memory), D2 means distance two (i.e. inter
> socket memory), etc...
> 
> For the guest physical memory allocated by qemu and for guest mode kernel
> the performance characteristic of RAM is more complex and an ideal
> default could be:
> 
> D0 THP | D1 THP | D0 4k | D2 THP | D1 4k | D3 THP | D2 4k | D3 4k | ...
> 0%     | +58%   | +101% | N/A    | +222% | N/A    | N/A   | N/A
> 
> NOTE: the N/A are projections and haven't been measured yet, the
> measurement in this case is done on a 1950x with only two NUMA nodes.
> The THP case here means THP was used both in the host and in the
> guest.
> 
> After applying this commit the THP NUMA locality order that we'll get
> out of MADV_HUGEPAGE is this:
> 
> D0 THP | D1 THP | D2 THP | D3 THP | ... | D0 4k | D1 4k | D2 4k | D3 4k | ...
> 
> Before this commit it was:
> 
> D0 THP | D0 4k | D1 4k | D2 4k | D3 4k | ...
> 
> Even if we ignore the breakage of large workloads that can't fit in a
> single node that the __GFP_THISNODE implicit "current node" mbind
> caused, the THP NUMA locality order provided by __GFP_THISNODE was
> still not the one we shall aim for in the long term (i.e. the first
> one at the top).
> 
> After this commit is applied, we can introduce a new allocator multi
> order API and to replace those two alloc_pages_vmas calls in the page
> fault path, with a single multi order call:
> 
> 	unsigned int order = (1 << HPAGE_PMD_ORDER) | (1 << 0);
> 	page = alloc_pages_multi_order(..., &order);
> 	if (!page)
> 		goto out;
> 	if (!(order & (1 << 0))) {
> 		VM_WARN_ON(order != 1 << HPAGE_PMD_ORDER);
> 		/* THP fault */
> 	} else {
> 		VM_WARN_ON(order != 1 << 0);
> 		/* 4k fallback */
> 	}
> 
> The page allocator logic has to be altered so that when it fails on
> any zone with order 9, it has to try again with a order 0 before
> falling back to the next zone in the zonelist.
> 
> After that we need to do more measurements and evaluate if adding an
> opt-in feature for guest mode is worth it, to swap "DN 4k | DN+1 THP"
> with "DN+1 THP | DN 4k" at every NUMA distance crossing.

I do agree with your reasoning. Future plans should be discussed
carefully because any iWmplicit NUMA placing might turned out simply
wrong with the future HW. I still believe we need a sort of _enforce_
intrasocet placement numa policy API. Something resembling node reclaim
mode for particular mappings. MPOL_CLOSE_NODE or similar sounds like a
way to go. But a new API really begs for real world usecases and I still
hope to get a reproducer for the problem David is running into.
 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/mempolicy.h |  2 ++
>  mm/huge_memory.c          | 42 ++++++++++++++++++++++++---------------
>  mm/mempolicy.c            |  2 +-
>  3 files changed, 29 insertions(+), 17 deletions(-)
> 
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index 5228c62af416..bac395f1d00a 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -139,6 +139,8 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
>  struct mempolicy *get_task_policy(struct task_struct *p);
>  struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>  		unsigned long addr);
> +struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> +						unsigned long addr);
>  bool vma_policy_mof(struct vm_area_struct *vma);
>  
>  extern void numa_default_policy(void);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 7efe68ba052a..784fd63800a2 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -644,27 +644,37 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
>  static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
>  {
>  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
> -	const gfp_t gfp_mask = GFP_TRANSHUGE_LIGHT | __GFP_THISNODE;
> +	gfp_t this_node = 0;
>  
> -	/* Always do synchronous compaction */
> -	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
> -		return GFP_TRANSHUGE | __GFP_THISNODE |
> -		       (vma_madvised ? 0 : __GFP_NORETRY);
> +#ifdef CONFIG_NUMA
> +	struct mempolicy *pol;
> +	/*
> +	 * __GFP_THISNODE is used only when __GFP_DIRECT_RECLAIM is not
> +	 * specified, to express a general desire to stay on the current
> +	 * node for optimistic allocation attempts. If the defrag mode
> +	 * and/or madvise hint requires the direct reclaim then we prefer
> +	 * to fallback to other node rather than node reclaim because that
> +	 * can lead to excessive reclaim even though there is free memory
> +	 * on other nodes. We expect that NUMA preferences are specified
> +	 * by memory policies.
> +	 */
> +	pol = get_vma_policy(vma, addr);
> +	if (pol->mode != MPOL_BIND)
> +		this_node = __GFP_THISNODE;
> +	mpol_cond_put(pol);
> +#endif
>  
> -	/* Kick kcompactd and fail quickly */
> +	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
> +		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
> -		return gfp_mask | __GFP_KSWAPD_RECLAIM;
> -
> -	/* Synchronous compaction if madvised, otherwise kick kcompactd */
> +		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
> -		return gfp_mask | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> -						  __GFP_KSWAPD_RECLAIM);
> -
> -	/* Only do synchronous compaction if madvised */
> +		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> +							     __GFP_KSWAPD_RECLAIM | this_node);
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
> -		return gfp_mask | (vma_madvised ? __GFP_DIRECT_RECLAIM : 0);
> -
> -	return gfp_mask;
> +		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> +							     this_node);
> +	return GFP_TRANSHUGE_LIGHT | this_node;
>  }
>  
>  /* Caller must hold page table lock. */
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 74e44000ad61..341e3d56d0a6 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1688,7 +1688,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>   * freeing by another task.  It is the caller's responsibility to free the
>   * extra reference for shared policies.
>   */
> -static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> +struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>  						unsigned long addr)
>  {
>  	struct mempolicy *pol = __get_vma_policy(vma, addr);
Mel Gorman May 9, 2019, 8:38 a.m. UTC | #2
On Fri, May 03, 2019 at 06:31:46PM -0400, Andrea Arcangeli wrote:
> This reverts commit 2f0799a0ffc033bf3cc82d5032acc3ec633464c2.
> 
> commit 2f0799a0ffc033bf3cc82d5032acc3ec633464c2 was rightfully applied
> to avoid the risk of a severe regression that was reported by the
> kernel test robot at the end of the merge window. Now we understood
> the regression was a false positive and was caused by a significant
> increase in fairness during a swap trashing benchmark. So it's safe to
> re-apply the fix and continue improving the code from there. The
> benchmark that reported the regression is very useful, but it provides
> a meaningful result only when there is no significant alteration in
> fairness during the workload. The removal of __GFP_THISNODE increased
> fairness.
> 
> __GFP_THISNODE cannot be used in the generic page faults path for new
> memory allocations under the MPOL_DEFAULT mempolicy, or the allocation
> behavior significantly deviates from what the MPOL_DEFAULT semantics
> are supposed to be for THP and 4k allocations alike.
> 
> Setting THP defrag to "always" or using MADV_HUGEPAGE (with THP defrag
> set to "madvise") has never meant to provide an implicit MPOL_BIND on
> the "current" node the task is running on, causing swap storms and
> providing a much more aggressive behavior than even zone_reclaim_node
> = 3.
> 
> Any workload who could have benefited from __GFP_THISNODE has now to
> enable zone_reclaim_mode=1||2||3. __GFP_THISNODE implicitly provided
> the zone_reclaim_mode behavior, but it only did so if THP was enabled:
> if THP was disabled, there would have been no chance to get any 4k
> page from the current node if the current node was full of pagecache,
> which further shows how this __GFP_THISNODE was misplaced in
> MADV_HUGEPAGE. MADV_HUGEPAGE has never been intended to provide any
> zone_reclaim_mode semantics, in fact the two are orthogonal,
> zone_reclaim_mode = 1|2|3 must work exactly the same with
> MADV_HUGEPAGE set or not.
> 
> The performance characteristic of memory depends on the hardware
> details. The numbers below are obtained on Naples/EPYC architecture
> and the N/A projection extends them to show what we should aim for in
> the future as a good THP NUMA locality default. The benchmark used
> exercises random memory seeks (note: the cost of the page faults is
> not part of the measurement).
> 
> D0 THP | D0 4k | D1 THP | D1 4k | D2 THP | D2 4k | D3 THP | D3 4k | ...
> 0%     | +43%  | +45%   | +106% | +131%  | +224% | N/A    | N/A
> 
> D0 means distance zero (i.e. local memory), D1 means distance
> one (i.e. intra socket memory), D2 means distance two (i.e. inter
> socket memory), etc...
> 
> For the guest physical memory allocated by qemu and for guest mode kernel
> the performance characteristic of RAM is more complex and an ideal
> default could be:
> 
> D0 THP | D1 THP | D0 4k | D2 THP | D1 4k | D3 THP | D2 4k | D3 4k | ...
> 0%     | +58%   | +101% | N/A    | +222% | N/A    | N/A   | N/A
> 
> NOTE: the N/A are projections and haven't been measured yet, the
> measurement in this case is done on a 1950x with only two NUMA nodes.
> The THP case here means THP was used both in the host and in the
> guest.
> 
> After applying this commit the THP NUMA locality order that we'll get
> out of MADV_HUGEPAGE is this:
> 
> D0 THP | D1 THP | D2 THP | D3 THP | ... | D0 4k | D1 4k | D2 4k | D3 4k | ...
> 
> Before this commit it was:
> 
> D0 THP | D0 4k | D1 4k | D2 4k | D3 4k | ...
> 
> Even if we ignore the breakage of large workloads that can't fit in a
> single node that the __GFP_THISNODE implicit "current node" mbind
> caused, the THP NUMA locality order provided by __GFP_THISNODE was
> still not the one we shall aim for in the long term (i.e. the first
> one at the top).
> 
> After this commit is applied, we can introduce a new allocator multi
> order API and to replace those two alloc_pages_vmas calls in the page
> fault path, with a single multi order call:
> 
> 	unsigned int order = (1 << HPAGE_PMD_ORDER) | (1 << 0);
> 	page = alloc_pages_multi_order(..., &order);
> 	if (!page)
> 		goto out;
> 	if (!(order & (1 << 0))) {
> 		VM_WARN_ON(order != 1 << HPAGE_PMD_ORDER);
> 		/* THP fault */
> 	} else {
> 		VM_WARN_ON(order != 1 << 0);
> 		/* 4k fallback */
> 	}
> 
> The page allocator logic has to be altered so that when it fails on
> any zone with order 9, it has to try again with a order 0 before
> falling back to the next zone in the zonelist.
> 
> After that we need to do more measurements and evaluate if adding an
> opt-in feature for guest mode is worth it, to swap "DN 4k | DN+1 THP"
> with "DN+1 THP | DN 4k" at every NUMA distance crossing.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Acked-by: Mel Gorman <mgorman@suse.de>
David Rientjes May 15, 2019, 8:26 p.m. UTC | #3
On Fri, 3 May 2019, Andrea Arcangeli wrote:

> This reverts commit 2f0799a0ffc033bf3cc82d5032acc3ec633464c2.
> 
> commit 2f0799a0ffc033bf3cc82d5032acc3ec633464c2 was rightfully applied
> to avoid the risk of a severe regression that was reported by the
> kernel test robot at the end of the merge window. Now we understood
> the regression was a false positive and was caused by a significant
> increase in fairness during a swap trashing benchmark. So it's safe to
> re-apply the fix and continue improving the code from there. The
> benchmark that reported the regression is very useful, but it provides
> a meaningful result only when there is no significant alteration in
> fairness during the workload. The removal of __GFP_THISNODE increased
> fairness.
> 

Hi Andrea,

There was exhausting discussion subsequent to this that caused Linus to 
have to revert the offending commit late in an rc series that is not 
described here.  This was after the offending commit, which this commit 
now reintroduces, was described as causing user facing access latency 
regressions and nacked.  The same objection is obviously going to be made 
here and I'd really prefer if this could be worked out without yet another 
merge into -mm, push to Linus, and revert by Linus.  There are solutions 
to this issue that does not cause anybody to have performance regressions 
rather than reintroducing them for a class of users that use the 
overloaded MADV_HUGEPAGE for the purposes it has provided them over the 
past three years.

> __GFP_THISNODE cannot be used in the generic page faults path for new
> memory allocations under the MPOL_DEFAULT mempolicy, or the allocation
> behavior significantly deviates from what the MPOL_DEFAULT semantics
> are supposed to be for THP and 4k allocations alike.
> 

This isn't an argument in support of this patch, there is a difference 
between (1) pages of the native page size being faulted first locally
falling back remotely and (2) hugepages being faulted first locally and 
falling back to native pages locally because it has better access latency 
on most platforms for workloads that do not span multiple nodes.  Note 
that the page allocator is unaware whether the workload spans multiple 
nodes so it cannot make this distinction today, and that's what I'd prefer 
to focus on rather than changing an overall policy for everybody.

> Setting THP defrag to "always" or using MADV_HUGEPAGE (with THP defrag
> set to "madvise") has never meant to provide an implicit MPOL_BIND on
> the "current" node the task is running on, causing swap storms and
> providing a much more aggressive behavior than even zone_reclaim_node
> = 3.
> 

It may not have been meant to provide this, but when IBM changed this 
three years ago because of performance regressions and others have started 
to use MADV_HUGEPAGE with that policy in mind, it is the reality of what 
the madvise advice has provided.  What was meant to be semantics of 
MADV_HUGEPAGE three years ago is irrelevant today if it introduces 
performance regressions for users who have used the advice mode during 
that past three years.

> Any workload who could have benefited from __GFP_THISNODE has now to
> enable zone_reclaim_mode=1||2||3. __GFP_THISNODE implicitly provided
> the zone_reclaim_mode behavior, but it only did so if THP was enabled:
> if THP was disabled, there would have been no chance to get any 4k
> page from the current node if the current node was full of pagecache,
> which further shows how this __GFP_THISNODE was misplaced in
> MADV_HUGEPAGE. MADV_HUGEPAGE has never been intended to provide any
> zone_reclaim_mode semantics, in fact the two are orthogonal,
> zone_reclaim_mode = 1|2|3 must work exactly the same with
> MADV_HUGEPAGE set or not.
> 
> The performance characteristic of memory depends on the hardware
> details. The numbers below are obtained on Naples/EPYC architecture
> and the N/A projection extends them to show what we should aim for in
> the future as a good THP NUMA locality default. The benchmark used
> exercises random memory seeks (note: the cost of the page faults is
> not part of the measurement).
> 
> D0 THP | D0 4k | D1 THP | D1 4k | D2 THP | D2 4k | D3 THP | D3 4k | ...
> 0%     | +43%  | +45%   | +106% | +131%  | +224% | N/A    | N/A
> 

The performance measurements that we have on Naples shows a more 
significant change between D0 4k and D1 THP: it certainly is not 2% worse 
access latency to a remote hugepage compared to local native pages.

> D0 means distance zero (i.e. local memory), D1 means distance
> one (i.e. intra socket memory), D2 means distance two (i.e. inter
> socket memory), etc...
> 
> For the guest physical memory allocated by qemu and for guest mode kernel
> the performance characteristic of RAM is more complex and an ideal
> default could be:
> 
> D0 THP | D1 THP | D0 4k | D2 THP | D1 4k | D3 THP | D2 4k | D3 4k | ...
> 0%     | +58%   | +101% | N/A    | +222% | N/A    | N/A   | N/A
> 
> NOTE: the N/A are projections and haven't been measured yet, the
> measurement in this case is done on a 1950x with only two NUMA nodes.
> The THP case here means THP was used both in the host and in the
> guest.
> 

Yes, this is clearly understood and was never objected to when this first 
came up in the thread where __GFP_THISNODE was removed or when Linus 
reverted the patch.

The issue being discussed here is a result of MADV_HUGEPAGE being 
overloaded: it cannot mean to control (1) how much compaction/reclaim is 
done for page allocation, (2) the NUMA locality of those hugepages, and 
(3) the eligibility of the memory to be collapsed into hugepages by 
khugepaged all at the same time.

I suggested then that we actually define (2) concretely specifically for 
the usecase that you mention.  Changing the behavior of MADV_HUGEPAGE for 
the past three years, however, and introducing performance regressions for 
those users is not an option regardless of the intent that it had when 
developed.

I suggested two options: (1) __MPOL_F_HUGE flag to set a mempolicy for 
specific memory ranges so that you can define thp specific mempolicies 
(Vlastimil considered this to be a lot of work, which I agreed) or (2) a 
prctl() mode to specify that a workload will span multiple sockets and 
benefits from remote hugepage allocation over local native pages (or 
because it is faulting memory remotely that it will access locally at some 
point in the future depending on cpu binding).  Any prctl() mode can be 
inherited across fork so it can be used for the qemu case that you suggest 
and is a very simple change to make compared with (1).

Please consider methods to accomplish this goal that will not cause 
existing users of MADV_HUGEPAGE to incur 13.9% access latency regressions 
and have no way to workaround without MPOL_BIND that will introduce 
undeserved and unnecessary oom kills because we can't specify native page 
vs hugepage mempolicies independently.

I'm confident that everybody on this cc list is well aware of both sides 
of this discussion and I hope that we can work together to address it to 
achieve the goals of both.

Thanks.
Mel Gorman May 20, 2019, 3:36 p.m. UTC | #4
On Wed, May 15, 2019 at 01:26:26PM -0700, David Rientjes wrote:
> On Fri, 3 May 2019, Andrea Arcangeli wrote:
> 
> > This reverts commit 2f0799a0ffc033bf3cc82d5032acc3ec633464c2.
> > 
> > commit 2f0799a0ffc033bf3cc82d5032acc3ec633464c2 was rightfully applied
> > to avoid the risk of a severe regression that was reported by the
> > kernel test robot at the end of the merge window. Now we understood
> > the regression was a false positive and was caused by a significant
> > increase in fairness during a swap trashing benchmark. So it's safe to
> > re-apply the fix and continue improving the code from there. The
> > benchmark that reported the regression is very useful, but it provides
> > a meaningful result only when there is no significant alteration in
> > fairness during the workload. The removal of __GFP_THISNODE increased
> > fairness.
> > 
> 
> Hi Andrea,
> 
> There was exhausting discussion subsequent to this that caused Linus to 
> have to revert the offending commit late in an rc series that is not 
> described here. 

Yes, at the crux of that matter was which regression introduced was more
important -- the one causing swap storms which Andrea is trying to address
or a latency issue due to assumptions of locality when MADV_HUGEPAGE
is used.

More people are affected by swap storms and distributions are carrying
out-of-tree patches to address it. Furthermore, multiple people unrelated
to each other can trivially reproduce the problem with test cases and
experience the problem with real workloads. Only you has a realistic
workload sensitive to the latency issue and we've asked repeatedly for
a test case (most recently Michal Hocko on May 4th) which is still not
available.

Currently the revert has to be carried out of tree for at least some
distributions which means any mainline users will have very different
experiences to distro users. Normally this is true anyway, but this is
an extreme example. The most important point is that the latency
problems are relatively difficult for a normal user to detect but a swap
storm is extremely easy to detect. We should be defaulting to the
behaviour that causes the least overall harm.

> This was after the offending commit, which this commit 
> now reintroduces, was described as causing user facing access latency 
> regressions and nacked.  The same objection is obviously going to be made 
> here and I'd really prefer if this could be worked out without yet another 
> merge into -mm, push to Linus, and revert by Linus.  There are solutions 
> to this issue that does not cause anybody to have performance regressions 
> rather than reintroducing them for a class of users that use the 
> overloaded MADV_HUGEPAGE for the purposes it has provided them over the 
> past three years.
> 

There are no solutions implemented although there are solutions possible --
however, forcing zone_reclaim_mode like behaviour when zone_reclaim_mode
is disabled makes the path forward hazardous.

> > __GFP_THISNODE cannot be used in the generic page faults path for new
> > memory allocations under the MPOL_DEFAULT mempolicy, or the allocation
> > behavior significantly deviates from what the MPOL_DEFAULT semantics
> > are supposed to be for THP and 4k allocations alike.
> > 
> 
> This isn't an argument in support of this patch, there is a difference 
> between (1) pages of the native page size being faulted first locally
> falling back remotely and (2) hugepages being faulted first locally and 
> falling back to native pages locally because it has better access latency 
> on most platforms for workloads that do not span multiple nodes.  Note 
> that the page allocator is unaware whether the workload spans multiple 
> nodes so it cannot make this distinction today, and that's what I'd prefer 
> to focus on rather than changing an overall policy for everybody.
> 

Overall, I think it would be ok to have behaviour whereby local THP is
allocated if cheaply, followed by base pages local followed by the remote
options. However, __GFP_THISNODE removes the possibility of allowing
remote fallback and instead causing a swap storm and swap storms are
trivial to generate on NUMA machine running a mainline kernel today.

If a workload really prefers local huge pages even if that incurs
reclaim, it should be a deliberate choice.

> > Setting THP defrag to "always" or using MADV_HUGEPAGE (with THP defrag
> > set to "madvise") has never meant to provide an implicit MPOL_BIND on
> > the "current" node the task is running on, causing swap storms and
> > providing a much more aggressive behavior than even zone_reclaim_node
> > = 3.
> > 
> 
> It may not have been meant to provide this, but when IBM changed this 
> three years ago because of performance regressions and others have started 
> to use MADV_HUGEPAGE with that policy in mind, it is the reality of what 
> the madvise advice has provided.  What was meant to be semantics of 
> MADV_HUGEPAGE three years ago is irrelevant today if it introduces 
> performance regressions for users who have used the advice mode during 
> that past three years.
> 

Incurring swap storms when there is plenty of free memory available is
terrible. If the working set size is larger than a node, the swap storm
may even persist indefinitely.

> > Any workload who could have benefited from __GFP_THISNODE has now to
> > enable zone_reclaim_mode=1||2||3. __GFP_THISNODE implicitly provided
> > the zone_reclaim_mode behavior, but it only did so if THP was enabled:
> > if THP was disabled, there would have been no chance to get any 4k
> > page from the current node if the current node was full of pagecache,
> > which further shows how this __GFP_THISNODE was misplaced in
> > MADV_HUGEPAGE. MADV_HUGEPAGE has never been intended to provide any
> > zone_reclaim_mode semantics, in fact the two are orthogonal,
> > zone_reclaim_mode = 1|2|3 must work exactly the same with
> > MADV_HUGEPAGE set or not.
> > 
> > The performance characteristic of memory depends on the hardware
> > details. The numbers below are obtained on Naples/EPYC architecture
> > and the N/A projection extends them to show what we should aim for in
> > the future as a good THP NUMA locality default. The benchmark used
> > exercises random memory seeks (note: the cost of the page faults is
> > not part of the measurement).
> > 
> > D0 THP | D0 4k | D1 THP | D1 4k | D2 THP | D2 4k | D3 THP | D3 4k | ...
> > 0%     | +43%  | +45%   | +106% | +131%  | +224% | N/A    | N/A
> > 
> 
> The performance measurements that we have on Naples shows a more 
> significant change between D0 4k and D1 THP: it certainly is not 2% worse 
> access latency to a remote hugepage compared to local native pages.
> 

Please share the workload in question so there is at least some chance
of developing a series that allocates locally (regardless of page size)
as much as possible without incurring swap storms.

> > D0 means distance zero (i.e. local memory), D1 means distance
> > one (i.e. intra socket memory), D2 means distance two (i.e. inter
> > socket memory), etc...
> > 
> > For the guest physical memory allocated by qemu and for guest mode kernel
> > the performance characteristic of RAM is more complex and an ideal
> > default could be:
> > 
> > D0 THP | D1 THP | D0 4k | D2 THP | D1 4k | D3 THP | D2 4k | D3 4k | ...
> > 0%     | +58%   | +101% | N/A    | +222% | N/A    | N/A   | N/A
> > 
> > NOTE: the N/A are projections and haven't been measured yet, the
> > measurement in this case is done on a 1950x with only two NUMA nodes.
> > The THP case here means THP was used both in the host and in the
> > guest.
> > 
> 
> Yes, this is clearly understood and was never objected to when this first 
> came up in the thread where __GFP_THISNODE was removed or when Linus 
> reverted the patch.
> 
> The issue being discussed here is a result of MADV_HUGEPAGE being 
> overloaded: it cannot mean to control (1) how much compaction/reclaim is 
> done for page allocation, (2) the NUMA locality of those hugepages, and 
> (3) the eligibility of the memory to be collapsed into hugepages by 
> khugepaged all at the same time.
> 
> I suggested then that we actually define (2) concretely specifically for 
> the usecase that you mention.  Changing the behavior of MADV_HUGEPAGE for 
> the past three years, however, and introducing performance regressions for 
> those users is not an option regardless of the intent that it had when 
> developed.
> 

The current behaviour is potential swap storms that are difficult to
avoid because the behaviour is hard-wired into the kernel internals.
Only disabling THP, either on a task or global basis, avoids it when
encountered.

> I suggested two options: (1) __MPOL_F_HUGE flag to set a mempolicy for 
> specific memory ranges so that you can define thp specific mempolicies 
> (Vlastimil considered this to be a lot of work, which I agreed) or (2) a 
> prctl() mode to specify that a workload will span multiple sockets and 
> benefits from remote hugepage allocation over local native pages (or 
> because it is faulting memory remotely that it will access locally at some 
> point in the future depending on cpu binding).  Any prctl() mode can be 
> inherited across fork so it can be used for the qemu case that you suggest 
> and is a very simple change to make compared with (1).
> 

The prctl should be the other way around -- use prctl if THP trumps all
else even if that means reclaiming a large amount of memory or swapping.

> Please consider methods to accomplish this goal that will not cause 
> existing users of MADV_HUGEPAGE to incur 13.9% access latency regressions 
> and have no way to workaround without MPOL_BIND that will introduce 
> undeserved and unnecessary oom kills because we can't specify native page 
> vs hugepage mempolicies independently.
> 
> I'm confident that everybody on this cc list is well aware of both sides 
> of this discussion and I hope that we can work together to address it to 
> achieve the goals of both.
> 

Please start by sharing the workload in question. However, I'm still
willing to bet that a swap storm causes significantly more harm for more
users than a latency penalty for a workload that is extremely latency
sensitive. The latency of a swap out/in is far more than a 13.9% access
latency.
David Rientjes May 20, 2019, 5:54 p.m. UTC | #5
On Mon, 20 May 2019, Mel Gorman wrote:

> > There was exhausting discussion subsequent to this that caused Linus to 
> > have to revert the offending commit late in an rc series that is not 
> > described here. 
> 
> Yes, at the crux of that matter was which regression introduced was more
> important -- the one causing swap storms which Andrea is trying to address
> or a latency issue due to assumptions of locality when MADV_HUGEPAGE
> is used.
> 
> More people are affected by swap storms and distributions are carrying
> out-of-tree patches to address it. Furthermore, multiple people unrelated
> to each other can trivially reproduce the problem with test cases and
> experience the problem with real workloads. Only you has a realistic
> workload sensitive to the latency issue and we've asked repeatedly for
> a test case (most recently Michal Hocko on May 4th) which is still not
> available.
> 

Hi Mel,

Any workload that does MADV_HUGEPAGE will be impacted if remote hugepage 
access latency is greater than local native page access latency and is 
using the long-standing behavior of the past three years.  The test case 
would be rather straight forward: induce node local fragmentation (easiest 
to do by injecting a kernel module), do MADV_HUGEPAGE over a large range, 
fault, and measure random access latency.  This is readily observable and 
can be done synthetically to measure the random access latency of local 
native pages vs remote hugepages.  Andrea provided this testcase in the 
original thread.  My results from right now:

# numactl -m 0 -C 0 ./numa-thp-bench
random writes MADV_HUGEPAGE 17492771 usec
random writes MADV_NOHUGEPAGE 21344846 usec
random writes MADV_NOHUGEPAGE 21399545 usec
random writes MADV_HUGEPAGE 17481949 usec
# numactl -m 0 -C 64 ./numa-thp-bench
random writes MADV_HUGEPAGE 26858061 usec
random writes MADV_NOHUGEPAGE 31067825 usec
random writes MADV_NOHUGEPAGE 31334770 usec
random writes MADV_HUGEPAGE 26785942 usec

That's 25.8% greater random access latency when going across socket vs 
accessing local native pages.

> > This isn't an argument in support of this patch, there is a difference 
> > between (1) pages of the native page size being faulted first locally
> > falling back remotely and (2) hugepages being faulted first locally and 
> > falling back to native pages locally because it has better access latency 
> > on most platforms for workloads that do not span multiple nodes.  Note 
> > that the page allocator is unaware whether the workload spans multiple 
> > nodes so it cannot make this distinction today, and that's what I'd prefer 
> > to focus on rather than changing an overall policy for everybody.
> > 
> 
> Overall, I think it would be ok to have behaviour whereby local THP is
> allocated if cheaply, followed by base pages local followed by the remote
> options. However, __GFP_THISNODE removes the possibility of allowing
> remote fallback and instead causing a swap storm and swap storms are
> trivial to generate on NUMA machine running a mainline kernel today.
> 

Yes, this is hopefully what we can focus on and I hope we can make forward 
progress with (1) extending mempolicies to allow specifying hugepage 
specific policies, (2) the prctl(), (3) improving the feedback loop 
between compaction and direct reclaim, and/or (4) resolving the overloaded 
the conflicting meanings of 
/sys/kernel/mm/transparent_hugepage/{enabled,defrag} and 
MADV_HUGEPAGE/MADV_NOHUGEPAGE.

The issue here is the overloaded nature of what MADV_HUGEPAGE means and 
what the system-wide thp settings mean.  It cannot possibly provide sane 
behavior of all possible workloads given only two settings.  MADV_HUGEPAGE 
itself has *four* meanings: (1) determine hugepage eligiblity when not 
default, (2) try to do sychronous compaction/reclaim at fault, (3) 
determine eligiblity of khugepaged, (4) control defrag settings based on 
system-wide setting.  The patch here is adding a fifth: (5) prefer remote 
allocation when local memory is fragmented.  None of this is sustainable.

Note that this patch is also preferring remote hugepage allocation *over* 
local hugepages before trying memory compaction locally depending on the 
setting of vm.zone_reclaim_mode so it is infringing on the long-standing 
behavior of (2) as well.

In situations such as these, it is not surprising that there are issues 
reported with any combination of flags or settings and patches get 
proposed to are very workload dependent.  My suggestion has been to move 
in a direction where this can be resolved such that userspace has a clean 
and stable API and we can allow remote hugepage allocation for workloads 
that specifically opt-in, but not incur 25.8% greater access latency for 
using the behavior of the past 3+ years.

Another point that has consistently been raised on LKML is the inability 
to disable MADV_HUGEPAGE once set: i.e. if you set it for faulting your 
workload, you are required to do MADV_NOHUGEPAGE to clear it and then are 
explicitly asking that this memory is not backed by hugepages.

> > It may not have been meant to provide this, but when IBM changed this 
> > three years ago because of performance regressions and others have started 
> > to use MADV_HUGEPAGE with that policy in mind, it is the reality of what 
> > the madvise advice has provided.  What was meant to be semantics of 
> > MADV_HUGEPAGE three years ago is irrelevant today if it introduces 
> > performance regressions for users who have used the advice mode during 
> > that past three years.
> > 
> 
> Incurring swap storms when there is plenty of free memory available is
> terrible. If the working set size is larger than a node, the swap storm
> may even persist indefinitely.
> 

Let's fix it.

> > Yes, this is clearly understood and was never objected to when this first 
> > came up in the thread where __GFP_THISNODE was removed or when Linus 
> > reverted the patch.
> > 
> > The issue being discussed here is a result of MADV_HUGEPAGE being 
> > overloaded: it cannot mean to control (1) how much compaction/reclaim is 
> > done for page allocation, (2) the NUMA locality of those hugepages, and 
> > (3) the eligibility of the memory to be collapsed into hugepages by 
> > khugepaged all at the same time.
> > 
> > I suggested then that we actually define (2) concretely specifically for 
> > the usecase that you mention.  Changing the behavior of MADV_HUGEPAGE for 
> > the past three years, however, and introducing performance regressions for 
> > those users is not an option regardless of the intent that it had when 
> > developed.
> > 
> 
> The current behaviour is potential swap storms that are difficult to
> avoid because the behaviour is hard-wired into the kernel internals.
> Only disabling THP, either on a task or global basis, avoids it when
> encountered.
> 

We are going in circles, *yes* there is a problem for potential swap 
storms today because of the poor interaction between memory compaction and 
directed reclaim but this is a result of a poor API that does not allow 
userspace to specify that its workload really will span multiple sockets 
so faulting remotely is the best course of action.  The fix is not to 
cause regressions for others who have implemented a userspace stack that 
is based on the past 3+ years of long standing behavior or for specialized 
workloads where it is known that it spans multiple sockets so we want some 
kind of different behavior.  We need to provide a clear and stable API to 
define these terms for the page allocator that is independent of any 
global setting of thp enabled, defrag, zone_reclaim_mode, etc.  It's 
workload dependent.

Is it deserving of an MADV_REMOTE_HUGEPAGE, a prctl(), hugepage extended 
mempolicies, or something else?  I'm positive that the problem is 
understood and we could reach some form of consensus before an 
implementation is invested into if we actually discuss ways to fix the 
underlying issue.

Thanks.
Andrew Morton May 24, 2019, 12:57 a.m. UTC | #6
On Mon, 20 May 2019 10:54:16 -0700 (PDT) David Rientjes <rientjes@google.com> wrote:

> We are going in circles, *yes* there is a problem for potential swap 
> storms today because of the poor interaction between memory compaction and 
> directed reclaim but this is a result of a poor API that does not allow 
> userspace to specify that its workload really will span multiple sockets 
> so faulting remotely is the best course of action.  The fix is not to 
> cause regressions for others who have implemented a userspace stack that 
> is based on the past 3+ years of long standing behavior or for specialized 
> workloads where it is known that it spans multiple sockets so we want some 
> kind of different behavior.  We need to provide a clear and stable API to 
> define these terms for the page allocator that is independent of any 
> global setting of thp enabled, defrag, zone_reclaim_mode, etc.  It's 
> workload dependent.

um, who is going to do this work?

Implementing a new API doesn't help existing userspace which is hurting
from the problem which this patch addresses.

It does appear to me that this patch does more good than harm for the
totality of kernel users, so I'm inclined to push it through and to try
to talk Linus out of reverting it again.
Mel Gorman May 24, 2019, 10:07 a.m. UTC | #7
On Mon, May 20, 2019 at 10:54:16AM -0700, David Rientjes wrote:
> On Mon, 20 May 2019, Mel Gorman wrote:
> 
> > > There was exhausting discussion subsequent to this that caused Linus to 
> > > have to revert the offending commit late in an rc series that is not 
> > > described here. 
> > 
> > Yes, at the crux of that matter was which regression introduced was more
> > important -- the one causing swap storms which Andrea is trying to address
> > or a latency issue due to assumptions of locality when MADV_HUGEPAGE
> > is used.
> > 
> > More people are affected by swap storms and distributions are carrying
> > out-of-tree patches to address it. Furthermore, multiple people unrelated
> > to each other can trivially reproduce the problem with test cases and
> > experience the problem with real workloads. Only you has a realistic
> > workload sensitive to the latency issue and we've asked repeatedly for
> > a test case (most recently Michal Hocko on May 4th) which is still not
> > available.
> > 
> 
> Hi Mel,
> 
> Any workload that does MADV_HUGEPAGE will be impacted if remote hugepage 
> access latency is greater than local native page access latency and is 
> using the long-standing behavior of the past three years. 

And prior to that, THP usage could cause massive latencies due to reclaim
and compaction that was adjusted over time to cause the least harm. We've
had changes in behaviour for THP and madvise before -- largely due to cases
where THP allocation caused large stalls that users found surprising. These
stalls generated quite a substantial number of bugs in the field.

As before, what is important is causing the least harm to the most
people when corner cases are hit.

> The test case 
> would be rather straight forward: induce node local fragmentation (easiest 
> to do by injecting a kernel module), do MADV_HUGEPAGE over a large range, 
> fault, and measure random access latency.  This is readily observable and 
> can be done synthetically to measure the random access latency of local 
> native pages vs remote hugepages.  Andrea provided this testcase in the 
> original thread.  My results from right now:
> 
> # numactl -m 0 -C 0 ./numa-thp-bench
> random writes MADV_HUGEPAGE 17492771 usec
> random writes MADV_NOHUGEPAGE 21344846 usec
> random writes MADV_NOHUGEPAGE 21399545 usec
> random writes MADV_HUGEPAGE 17481949 usec
> # numactl -m 0 -C 64 ./numa-thp-bench
> random writes MADV_HUGEPAGE 26858061 usec
> random writes MADV_NOHUGEPAGE 31067825 usec
> random writes MADV_NOHUGEPAGE 31334770 usec
> random writes MADV_HUGEPAGE 26785942 usec
> 

Ok, lets consider two scenarios.

The first one is very basic -- using a large buffer that is larger than
a memory node size. The demonstation program is simple

--8<-- mmap-demo.c --8<--
#include <sys/mman.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define LOOPS 3
#ifndef MADV_FLAGS
#define MADV_FLAGS 0
#endif

int main(int argc, char **argv)
{
	char *buf;
	int i;
	size_t length;

	if (argc != 2) {
		printf("Specify buffer size in bytes\n");
		exit(EXIT_FAILURE);
	}

	length = atol(argv[1]) & ~4095;
	buf = mmap(NULL, length, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
	if (buf == MAP_FAILED) {
		perror("mmap failed");
		exit(EXIT_FAILURE);
	}

	if (MADV_FLAGS)
		madvise(buf, length, MADV_FLAGS);

	printf("Address %p Length %lu MB\n", buf, length / 1048576);
	for (i = 0; i < LOOPS; i++) {
		memset(buf, i, length);
		printf(".");
		fflush(NULL);
	}
	printf("\n");
}
--8<-- mmap-demo.c --8<--

All it's doing is writing a large anonymous array. Lets see how it
behaves

# Set buffer size to 80% of memory -- machine has 2 nodes that are
# equal size so this will spill over
$ BUFSIZE=$((`free -b | grep Mem: | awk '{print $2}'`*8/10))

# Scenario 1: default setting, no madvise. Using CPUs from only one
# node as otherwise numa balancing or cpu balancing will migrate
# the task based on locality. Not particularly unusual when packing
# virtual machines in a box
$ gcc -O2 mmap-demo.c -o mmap-demo && numactl --cpunodebind 0 /usr/bin/time ./mmap-demo $BUFSIZE
Address 0x7fdc5b890000 Length 51236 MB
...
25.48user 30.19system 0:55.68elapsed 99%CPU (0avgtext+0avgdata 52467180maxresident)k
0inputs+0outputs (0major+15388156minor)pagefaults 0swaps

Total time is 55.68 seconds to execute, lots of minor faults for the
allocations (some may be NUMA balancing). vmstat for the time it was
running was as follows

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 65001004     32 279528    0    1     1     1    1    1  0  0 100  0  0
 1  0      0 48025796     32 279516    0    0     0     1  281   75  0  2 98  0  0
 1  0      0 30927108     32 279444    0    0     0     0  285   54  0  2 98  0  0
 1  0      0 22250504     32 279284    0    0     0     0  277   44  0  2 98  0  0
 1  0      0 13665116     32 279272    0    0     0     0  288   67  0  2 98  0  0
 1  0      0 12432096     32 279196    0    0     0     0  276   46  2  0 98  0  0
 1  0      0 12429580     32 279452    0    0     0   598  297   96  1  1 98  0  0
 1  0      0 12429604     32 279432    0    0     0     3  278   50  1  1 98  0  0
 1  0      0 12429856     32 279432    0    0     0     0  289   68  1  1 98  0  0
 1  0      0 12429864     32 279420    0    0     0     0  275   43  2  0 98  0  0
 1  0      0 12429936     32 279420    0    0     0     0  298   61  1  1 98  0  0
 1  0      0 12429944     32 279416    0    0     0     0  275   42  1  1 98  0  0

That's fairly straight-forward. Memory gets used, no particularly
unusual activity when the buffer is allocated and updated. Now, lets
use MADV_HUGEPAGE

$ gcc -DMADV_FLAGS=MADV_HUGEPAGE -O2 mmap-demo.c -o mmap-demo && numactl --cpunodebind 0 /usr/bin/time ./mmap-demo $BUFSIZE
Address 0x7fe8b947d000 Length 51236 MB
...
25.46user 33.12system 1:06.80elapsed 87%CPU (0avgtext+0avgdata 52467172maxresident)k
1932184inputs+0outputs (30197major+15103633minor)pagefaults 0swaps

Just 10 seconds more due to being a simple case with few loops but look
at the major faults, there are non-zero even though there was plenty of
memory. Lets look at vmstat

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 64997812     32 279380    0    1     1     1    1    1  0  0 100  0  0
 1  0      0 47230624     32 279392    0    0     0     1  286   74  0  2 98  0  0
 0  1 324104 32915756     32 233752    0 64786     3 64790  350  330  0  1 98  0  0
 1  0 1048572 31166652     32 223076   32 144950    32 144950  485 2117  0  1 99  1  0
 1  0 1048572 23839632     32 223076    0    0     0     0  277 5777  0  2 98  0  0
 1  0 1048572 16598116     32 223064    0    0     0     0  281 5714  0  2 98  0  0
 0  1 502444 12947660     32 223064 107547    0 107549     0 3840 16245  0  1 99  0  0
 1  0    944 12515736     32 224368 85670    0 85737   629 3219 11098  1  0 99  0  0
 1  0    944 12514224     32 224368    0    0     0     0  275   42  2  1 98  0  0
 1  0    944 12514228     32 224364    0    0     0     0  280   52  1  1 98  0  0
 1  0    944 12514228     32 224364    0    0     0     0  275   44  1  1 98  0  0
 1  0    944 12513712     32 224364    0    0     0     0  291   69  2  0 98  0  0
 1  0    944 12513964     32 224364    0    0     0     0  274   43  1  1 98  0  0
 1  1 216228 12643952     32 224132    0 43008     0 43008  747  130  1  1 98  0  0
 1  0   1188 65081364     32 224464   57  819    62   819  296  111  0  1 99  0  0

That is showing large amounts of swaps out and in. This demonstration
case could be made much worse but it's illustrative of what has been
observed -- __GFP_THISNODE is harmful to MADV_HUGEPAGE.

Now, contrast this with your example

o Induce node local fragmentation using a kernel module that must be
  developed. No description on whether this should be equivalent to
  anonymous memory, file-backed or pinned like it was slab objects.
  No statment on whether the module memory should be able to migrate
  like what compaction does.
o Measure random access latency -- requires specific application
  knowledge or detailed perf analysis
o Applicable to applications that are extremely latency sensitive only

My example can be demonstrated by a 1st year computer programmer with
minimal effort. It is also visible to anyone creating a KVM instance
that is larger than a NUMA node if the virtual machine is using enough
of its memory.

Your example requires implementation of a kernel module with much
guesswork as to what is a realistic means and then implement an
application that is latency sensitive.

The bottom line is that far more people with much less experience can
detect a swap storm and know its bad. Furthermore, if MADV_HUGEPAGE is
used by something like KVM, there isn't a workaround except for
disabling THP for the application which for KVM is a big penalty. Your
scenario of having a latency access penalty is harder to detect and
depends on the state of the system at the time the application executes.

Contrast that with the workarounds for your situation where the system
is fragmented. There are multiple choices

1. Enable zone reclaim mode for the initialisation phase
2. Memory bind the application to the target node
3. "Flush" memory before the application starts with with something like
   numactl --membind=0 memhog -r10 $HIGH_PERCENTAGE_OF_NODE_0

It's clumsy but it's workable in the short term.

> > > This isn't an argument in support of this patch, there is a difference 
> > > between (1) pages of the native page size being faulted first locally
> > > falling back remotely and (2) hugepages being faulted first locally and 
> > > falling back to native pages locally because it has better access latency 
> > > on most platforms for workloads that do not span multiple nodes.  Note 
> > > that the page allocator is unaware whether the workload spans multiple 
> > > nodes so it cannot make this distinction today, and that's what I'd prefer 
> > > to focus on rather than changing an overall policy for everybody.
> > > 
> > 
> > Overall, I think it would be ok to have behaviour whereby local THP is
> > allocated if cheaply, followed by base pages local followed by the remote
> > options. However, __GFP_THISNODE removes the possibility of allowing
> > remote fallback and instead causing a swap storm and swap storms are
> > trivial to generate on NUMA machine running a mainline kernel today.
> > 
> 
> Yes, this is hopefully what we can focus on and I hope we can make forward 
> progress with (1) extending mempolicies to allow specifying hugepage 
> specific policies, (2) the prctl(), (3) improving the feedback loop 
> between compaction and direct reclaim, and/or (4) resolving the overloaded 
> the conflicting meanings of 
> /sys/kernel/mm/transparent_hugepage/{enabled,defrag} and 
> MADV_HUGEPAGE/MADV_NOHUGEPAGE.
> 

3 should be partially done with the latest compaction series, it's unclear
how far it goes for your case because it cannot be trivially reproduced
outside of your test environment. I never got any report back on how it
affected your workload but for the trivial cases, it helped (modulo bugs
that had to be fixed for corner cases on zone boundary handling).

1, 2 and 4 are undefined at this point because it's unclear what sort of
policies would suit your given scenario and whether you would be even
willing to rebuild the applications. For everybody else it's a simple
"do not use __GFP_THISNODE for MADV_HUGEPAGE". In the last few months,
there also has been no evidence of what policies would suit you or
associated patches.

What you want is zone_reclaim_mode for huge pages but for whatever
reason, are unwilling to enable zone_reclaim_mode. However, it would
make some sense to extend it. The current definition is

This is value ORed together of
1       = Zone reclaim on
2       = Zone reclaim writes dirty pages out
4       = Zone reclaim swaps pages

An optional extra would be

8	= Zone reclaim on for THP applications for MADV_HUGEPAGE
	  mappings to require both THP where possible and local
	  memory

The default would be off. Your systems would need to or the 8 value

Would that be generally acceptable? It would give sensible default
behaviour for everyone and the option for those users that know for a
fact their application fits in a NUMA node *and* is latency sensitive to
remote accesses.

> The issue here is the overloaded nature of what MADV_HUGEPAGE means and 
> what the system-wide thp settings mean. 

MADV_HUGEPAGE is documented to mean "Enable Transparent Huge Pages (THP)
for pages in the range specified by addr  and  length". It says nothing
about locality. Locality decisions are set by policies, not madvise.
MPOL_BIND would be the obvious choice for strict locality but that is
not always necessary the best decision. It is unclear if a policy like
MPOL_CPU_LOCAL for both base and THP allocations would actually help
you because the semantics could be defined in multiple ways. Critically,
there is little information on what level of effort the kernel should do
to give local memory.

> It cannot possibly provide sane 
> behavior of all possible workloads given only two settings.  MADV_HUGEPAGE 
> itself has *four* meanings: (1) determine hugepage eligiblity when not 
> default, (2) try to do sychronous compaction/reclaim at fault, (3) 
> determine eligiblity of khugepaged, (4) control defrag settings based on 
> system-wide setting.  The patch here is adding a fifth: (5) prefer remote 
> allocation when local memory is fragmented.  None of this is sustainable.
> 

Given that locality and reclaim behaviour for *all* pages was specified
by zone_reclaim_mode, it could be extended to cover special casing of THP.

> Note that this patch is also preferring remote hugepage allocation *over* 
> local hugepages before trying memory compaction locally depending on the 
> setting of vm.zone_reclaim_mode so it is infringing on the long-standing 
> behavior of (2) as well.
> 

Again, extending zone_reclaim_mode would act as a band-aid until the
various policies can be defined and agreed upon. Once that API is set,
it will be with us for a while and right now, we have swap storms.

> In situations such as these, it is not surprising that there are issues 
> reported with any combination of flags or settings and patches get 
> proposed to are very workload dependent.  My suggestion has been to move 
> in a direction where this can be resolved such that userspace has a clean 
> and stable API and we can allow remote hugepage allocation for workloads 
> that specifically opt-in, but not incur 25.8% greater access latency for 
> using the behavior of the past 3+ years.
> 

You suggest moving in a some direction but have offered very little in
terms of reproducing your problematic scenario or defining exactly what
those policies should mean. For most people if they want memory to be
local, they use MPOL_BIND and call it a day. It's not clear what policy
you would define that gets translated into behaviour you find acceptable.

> Another point that has consistently been raised on LKML is the inability 
> to disable MADV_HUGEPAGE once set: i.e. if you set it for faulting your 
> workload, you are required to do MADV_NOHUGEPAGE to clear it and then are 
> explicitly asking that this memory is not backed by hugepages.
> 

I missed that one but it does not sound like an impossible problem to
define another MADV flag for it.

> > > It may not have been meant to provide this, but when IBM changed this 
> > > three years ago because of performance regressions and others have started 
> > > to use MADV_HUGEPAGE with that policy in mind, it is the reality of what 
> > > the madvise advice has provided.  What was meant to be semantics of 
> > > MADV_HUGEPAGE three years ago is irrelevant today if it introduces 
> > > performance regressions for users who have used the advice mode during 
> > > that past three years.
> > > 
> > 
> > Incurring swap storms when there is plenty of free memory available is
> > terrible. If the working set size is larger than a node, the swap storm
> > may even persist indefinitely.
> > 
> 
> Let's fix it.
> 

That's what Andrea's patch does -- fixes trivial swap storms. It
does require you use existing memory policies *or* temporarily
enable zone_reclaim_mode during initialisation to get the locality
you want. Alternatively, we extend zone_reclaim_mode to apply to
THP+MADV_HUGEPAGE. You could also carry a revert seeing as the kernel is
used for an internal workload where as some of us have to support all
classes of users running general workloads where causing the least harm
is important for supportability.

> > The current behaviour is potential swap storms that are difficult to
> > avoid because the behaviour is hard-wired into the kernel internals.
> > Only disabling THP, either on a task or global basis, avoids it when
> > encountered.
> > 
> 
> We are going in circles, *yes* there is a problem for potential swap 
> storms today because of the poor interaction between memory compaction and 
> directed reclaim but this is a result of a poor API that does not allow 
> userspace to specify that its workload really will span multiple sockets 
> so faulting remotely is the best course of action. 

Yes, we're going in circles. I find it amazing that you think leaving
users with trivial to reproduce swap storms is acceptable until some
unreproducible workload can be fixed with some undefined set of
unimplemented memory policies.

What we have right now is a concrete problem with a fix that is being
naked with a counter-proposal being "someone should implement the test case
for me then define/implement policies that suit that specific test case".
Andrea Arcangeli May 24, 2019, 8:29 p.m. UTC | #8
Hello everyone,

On Thu, May 23, 2019 at 05:57:37PM -0700, Andrew Morton wrote:
> On Mon, 20 May 2019 10:54:16 -0700 (PDT) David Rientjes <rientjes@google.com> wrote:
> 
> > We are going in circles, *yes* there is a problem for potential swap 
> > storms today because of the poor interaction between memory compaction and 
> > directed reclaim but this is a result of a poor API that does not allow 
> > userspace to specify that its workload really will span multiple sockets 
> > so faulting remotely is the best course of action.  The fix is not to 
> > cause regressions for others who have implemented a userspace stack that 
> > is based on the past 3+ years of long standing behavior or for specialized 
> > workloads where it is known that it spans multiple sockets so we want some 
> > kind of different behavior.  We need to provide a clear and stable API to 
> > define these terms for the page allocator that is independent of any 
> > global setting of thp enabled, defrag, zone_reclaim_mode, etc.  It's 
> > workload dependent.
> 
> um, who is going to do this work?

That's a good question. It's going to be a not simple patch to
backport to -stable: it'll be intrusive and it will affect
mm/page_alloc.c significantly so it'll reject heavy. I wouldn't
consider it -stable material at least in the short term, it will
require some testing.

This is why applying a simple fix that avoids the swap storms (and the
swap-less pathological THP regression for vfio device assignment GUP
pinning) is preferable before adding an alloc_pages_multi_order (or
equivalent) so that it'll be the allocator that will decide when
exactly to fallback from 2M to 4k depending on the NUMA distance and
memory availability during the zonelist walk. The basic idea is to
call alloc_pages just once (not first for 2M and then for 4k) and
alloc_pages will decide which page "order" to return.

> Implementing a new API doesn't help existing userspace which is hurting
> from the problem which this patch addresses.

Yes, we can't change all apps that may not fit in a single NUMA
node. Currently it's unsafe to turn "transparent_hugepages/defrag =
always" or the bad behavior can then materialize also outside of
MADV_HUGEPAGE. Those apps that use MADV_HUGEPAGE on their long lived
allocations (i.e. guest physical memory) like qemu are affected even
with the default "defrag = madvise". Those apps are using
MADV_HUGEPAGE for more than 3 years and they are widely used and open
source of course.

> It does appear to me that this patch does more good than harm for the
> totality of kernel users, so I'm inclined to push it through and to try
> to talk Linus out of reverting it again.  

That sounds great. It's also what 3 enterprise distributions had to do
already.

As Mel described in detail, remote THP can't be slower than the swap
I/O (even if we'd swap on a nvdimm it wouldn't change this).

As Michael suggested a dynamic "numa_node_id()" mbind could be pursued
orthogonally to still be able to retain the current upstream behavior
for small apps that can fit in the node and do extremely long lived
static allocations and that don't care if they cause a swap storm
during startup. All we argue about is the default "defrag = always"
and MADV_HUGEPAGE behavior.

The current behavior of "defrag = always" and MADV_HUGEPAGE is way
more aggressive than zone_reclaim_mode in fact, which is also not
enabled by default for similar reasons (but enabling zone_reclaim_mode
by default would cause much less risk of pathological regressions to
large workloads that can't fit in a single node). Enabling
zone_reclaim_mode would eventually fallback to remote nodes
gracefully. As opposed the fallback to remote nodes with
__GFP_THISNODE can only happen after the 2M allocation has failed and
the problem is that 2M allocation don't fail because
compaction+reclaim interleaving keeps succeeding by swapping out more
and more memory, which would the perfectly right behavior for
compaction+reclaim interleaving if only the whole system would be out
of memory in all nodes (and it isn't).

The false positive result from the automated testing (where swapping
overall performance decreased because fariness increased) wasn't
anybody's fault and so the revert at the end of the merge window was a
safe approach. So we can try again to fix it now.

Thanks!
Andrea
David Rientjes May 29, 2019, 2:06 a.m. UTC | #9
On Fri, 24 May 2019, Andrea Arcangeli wrote:

> > > We are going in circles, *yes* there is a problem for potential swap 
> > > storms today because of the poor interaction between memory compaction and 
> > > directed reclaim but this is a result of a poor API that does not allow 
> > > userspace to specify that its workload really will span multiple sockets 
> > > so faulting remotely is the best course of action.  The fix is not to 
> > > cause regressions for others who have implemented a userspace stack that 
> > > is based on the past 3+ years of long standing behavior or for specialized 
> > > workloads where it is known that it spans multiple sockets so we want some 
> > > kind of different behavior.  We need to provide a clear and stable API to 
> > > define these terms for the page allocator that is independent of any 
> > > global setting of thp enabled, defrag, zone_reclaim_mode, etc.  It's 
> > > workload dependent.
> > 
> > um, who is going to do this work?
> 
> That's a good question. It's going to be a not simple patch to
> backport to -stable: it'll be intrusive and it will affect
> mm/page_alloc.c significantly so it'll reject heavy. I wouldn't
> consider it -stable material at least in the short term, it will
> require some testing.
> 

Hi Andrea,

I'm not sure what patch you're referring to, unfortunately.  The above 
comment was referring to APIs that are made available to userspace to 
define when to fault locally vs remotely and what the preference should be 
for any form of compaction or reclaim to achieve that.  Today we have 
global enabling options, global defrag settings, enabling prctls, and 
madvise options.  The point it makes is that whether a specific workload 
fits into a single socket is workload dependant and thus we are left with 
prctls and madvise options.  The prctl either enables thp or it doesn't, 
it is not interesting here; the madvise is overloaded in four different 
ways (enabling, stalling at fault, collapsability, defrag) so it's not 
surprising that continuing to overload it for existing users will cause 
undesired results.  It makes an argument that we need a clear and stable 
means of defining the behavior, not changing the 4+ year behavior and 
giving those who regress no workaround.

> This is why applying a simple fix that avoids the swap storms (and the
> swap-less pathological THP regression for vfio device assignment GUP
> pinning) is preferable before adding an alloc_pages_multi_order (or
> equivalent) so that it'll be the allocator that will decide when
> exactly to fallback from 2M to 4k depending on the NUMA distance and
> memory availability during the zonelist walk. The basic idea is to
> call alloc_pages just once (not first for 2M and then for 4k) and
> alloc_pages will decide which page "order" to return.
> 

The commit description doesn't mention the swap storms that you're trying 
to fix, it's probably better to describe that again and why it is not 
beneficial to swap unless an entire pageblock can become free or memory 
compaction has indicated that additional memory freeing would allow 
migration to make an entire pageblock free.  I understand that's a 
invasive code change, but merging this patch changes the 4+ year behavior 
that started here:

commit 077fcf116c8c2bd7ee9487b645aa3b50368db7e1
Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Date:   Wed Feb 11 15:27:12 2015 -0800

    mm/thp: allocate transparent hugepages on local node

And that commit's description describes quite well the regression that we 
encounter if we remove __GFP_THISNODE here.  That's because the access 
latency regression is much more substantial than what was reported for 
Naples in your changelog.

In the interest of making forward progress, can we agree that swapping 
from the local node *never* makes sense unless we can show that an entire 
pageblock can become free or that it enables memory compaction to migrate 
memory that can make an entire pageblock free?  Are you reporting swap 
storms for the local node when one of these is true?

> > Implementing a new API doesn't help existing userspace which is hurting
> > from the problem which this patch addresses.
> 
> Yes, we can't change all apps that may not fit in a single NUMA
> node. Currently it's unsafe to turn "transparent_hugepages/defrag =
> always" or the bad behavior can then materialize also outside of
> MADV_HUGEPAGE. Those apps that use MADV_HUGEPAGE on their long lived
> allocations (i.e. guest physical memory) like qemu are affected even
> with the default "defrag = madvise". Those apps are using
> MADV_HUGEPAGE for more than 3 years and they are widely used and open
> source of course.
> 

I continue to reiterate that the 4+ year long standing behavior of 
MADV_HUGEPAGE is overloaded; you are anticipating a specific behavior for 
workloads that do not fit in a single NUMA node whereas other users 
developed in the past four years are anticipating a different behavior.  
I'm trying to propose solutions that can not cause regressions for any 
user, such as the prctl() example that is inherited across fork, and can 
be used to define the behavior.  This could be a very trivial extension to 
prctl(PR_SET_THP_DISABLE) or it could be more elaborate as an addition.  
This would be set by any thread that forks qemu and can define that the 
workload prefers remote hugepages because it spans more than one node.  
Certainly we should agree that the majority of Linux workloads do not span 
more than one socket.  However, it *may* be possible to define this as a 
global thp setting since most machines that run large guests are only 
running large guests so the default machine-level policy can reflect that.
David Rientjes May 29, 2019, 9:24 p.m. UTC | #10
On Thu, 23 May 2019, Andrew Morton wrote:

> > We are going in circles, *yes* there is a problem for potential swap 
> > storms today because of the poor interaction between memory compaction and 
> > directed reclaim but this is a result of a poor API that does not allow 
> > userspace to specify that its workload really will span multiple sockets 
> > so faulting remotely is the best course of action.  The fix is not to 
> > cause regressions for others who have implemented a userspace stack that 
> > is based on the past 3+ years of long standing behavior or for specialized 
> > workloads where it is known that it spans multiple sockets so we want some 
> > kind of different behavior.  We need to provide a clear and stable API to 
> > define these terms for the page allocator that is independent of any 
> > global setting of thp enabled, defrag, zone_reclaim_mode, etc.  It's 
> > workload dependent.
> 
> um, who is going to do this work?
> 
> Implementing a new API doesn't help existing userspace which is hurting
> from the problem which this patch addresses.
> 

The problem which this patch addresses has apparently gone unreported for 
4+ years since

commit 077fcf116c8c2bd7ee9487b645aa3b50368db7e1
Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Date:   Wed Feb 11 15:27:12 2015 -0800

    mm/thp: allocate transparent hugepages on local node

My goal is to reach a solution that does not cause anybody to incur 
performance penalties as a result of it.  It's surprising that such a 
severe swap storm issue that went unnoticed for four years is something 
that can't reach an amicable solution that doesn't cause other users to 
regress.

> It does appear to me that this patch does more good than harm for the
> totality of kernel users, so I'm inclined to push it through and to try
> to talk Linus out of reverting it again.  
> 

(1) The totality of kernel users are not running workloads that span 
multiple sockets, it's the minority, (2) it's changing 4+ year behavior 
based on NUMA locality of hugepage allocations and provides no workarounds 
for users who incur regressions as a result, and (3) does not solve the 
underlying issue if remote memory is also fragmented or low on memory: it 
actually makes the problem worse.

The easiest solution would be to define the MADV_HUGEPAGE behavior 
explicitly in sysfs: local or remote.  Defaut to local as the behavior 
from the past four years and allow users to specify remote if their 
workloads will span multiple sockets.  This is somewhat coarse but no more 
than the thp defrag setting in sysfs today that defines defrag behavior 
for everybody on the system.
Michal Hocko May 31, 2019, 9:22 a.m. UTC | #11
On Wed 29-05-19 14:24:33, David Rientjes wrote:
> On Thu, 23 May 2019, Andrew Morton wrote:
> 
> > > We are going in circles, *yes* there is a problem for potential swap 
> > > storms today because of the poor interaction between memory compaction and 
> > > directed reclaim but this is a result of a poor API that does not allow 
> > > userspace to specify that its workload really will span multiple sockets 
> > > so faulting remotely is the best course of action.  The fix is not to 
> > > cause regressions for others who have implemented a userspace stack that 
> > > is based on the past 3+ years of long standing behavior or for specialized 
> > > workloads where it is known that it spans multiple sockets so we want some 
> > > kind of different behavior.  We need to provide a clear and stable API to 
> > > define these terms for the page allocator that is independent of any 
> > > global setting of thp enabled, defrag, zone_reclaim_mode, etc.  It's 
> > > workload dependent.
> > 
> > um, who is going to do this work?
> > 
> > Implementing a new API doesn't help existing userspace which is hurting
> > from the problem which this patch addresses.
> > 
> 
> The problem which this patch addresses has apparently gone unreported for 
> 4+ years since

Can we finaly stop considering the time and focus on the what is the
most reasonable behavior in general case please? Conserving mistakes
based on an argument that we have them for many years is just not
productive. It is very well possible that workloads that suffer from
this simply run on older distribution kernels which are moving towards
newer kernels very slowly.

> commit 077fcf116c8c2bd7ee9487b645aa3b50368db7e1
> Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> Date:   Wed Feb 11 15:27:12 2015 -0800
> 
>     mm/thp: allocate transparent hugepages on local node

Let me quote the commit message to the full lenght
"
    This make sure that we try to allocate hugepages from local node if
    allowed by mempolicy.  If we can't, we fallback to small page allocation
    based on mempolicy.  This is based on the observation that allocating
    pages on local node is more beneficial than allocating hugepages on remote
    node.

    With this patch applied we may find transparent huge page allocation
    failures if the current node doesn't have enough freee hugepages.  Before
    this patch such failures result in us retrying the allocation on other
    nodes in the numa node mask.
"

I do not see any single numbers backing those claims or any mention of a
workload that would benefit from the change. Besides that, we have seen
that THP on a remote (but close) node might be performing better per
Andrea's numbers. So those claims do not apply in general.

This is a general problem when making decisions on heuristics which are
not a clear cut. AFAICS there have been pretty good argments given that
_real_ workloads suffer from this change while a demonstration of a _real_
workload that is benefiting is still missing.

> My goal is to reach a solution that does not cause anybody to incur 
> performance penalties as a result of it.

That is certainly appreciated and I can offer my help there as well. But
I believe we should start with a code base that cannot generate a
swapping storm by a trivial code as demonstrated by Mel. A general idea
on how to approve the situation has been already outlined for a default
case and a new memory policy has been mentioned as well but we need
something to start with and neither of the two is compatible with the
__GFP_THISNODE behavior.

[...]

> The easiest solution would be to define the MADV_HUGEPAGE behavior 
> explicitly in sysfs: local or remote.  Defaut to local as the behavior 
> from the past four years and allow users to specify remote if their 
> workloads will span multiple sockets.  This is somewhat coarse but no more 
> than the thp defrag setting in sysfs today that defines defrag behavior 
> for everybody on the system.

This just makes the THP tunning even muddier. Really, can we start with
a code that doesn't blow up trivially and build on top? In other words
start with a less specialized usecase being covered and help more
specialized usecases to get what they need.
David Rientjes May 31, 2019, 9:53 p.m. UTC | #12
On Fri, 31 May 2019, Michal Hocko wrote:

> > The problem which this patch addresses has apparently gone unreported for 
> > 4+ years since
> 
> Can we finaly stop considering the time and focus on the what is the
> most reasonable behavior in general case please? Conserving mistakes
> based on an argument that we have them for many years is just not
> productive. It is very well possible that workloads that suffer from
> this simply run on older distribution kernels which are moving towards
> newer kernels very slowly.
> 

That's fine, but we also must be mindful of users who have used 
MADV_HUGEPAGE over the past four years based on its hard-coded behavior 
that would now regress as a result.

> > commit 077fcf116c8c2bd7ee9487b645aa3b50368db7e1
> > Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> > Date:   Wed Feb 11 15:27:12 2015 -0800
> > 
> >     mm/thp: allocate transparent hugepages on local node
> 
> Let me quote the commit message to the full lenght
> "
>     This make sure that we try to allocate hugepages from local node if
>     allowed by mempolicy.  If we can't, we fallback to small page allocation
>     based on mempolicy.  This is based on the observation that allocating
>     pages on local node is more beneficial than allocating hugepages on remote
>     node.
> 
>     With this patch applied we may find transparent huge page allocation
>     failures if the current node doesn't have enough freee hugepages.  Before
>     this patch such failures result in us retrying the allocation on other
>     nodes in the numa node mask.
> "
> 
> I do not see any single numbers backing those claims or any mention of a
> workload that would benefit from the change. Besides that, we have seen
> that THP on a remote (but close) node might be performing better per
> Andrea's numbers. So those claims do not apply in general.
> 

I confirm that on every platform I have tested that the access latency to 
local pages of the native page size has been less than hugepages on any 
remote node.  I think it's generally accepted that NUMA-ness is more 
important than huge-ness in terms of access latency and this is not the 
reason why the revert is being proposed.  Certainly if the argument is to 
be made that the default behavior should be what is in the best interest 
of Linux users in totality, preferring remote hugepages over local pages 
of the native page size would not be anywhere close.  I agree with 
Aneesh's commit message 100%.

> > My goal is to reach a solution that does not cause anybody to incur 
> > performance penalties as a result of it.
> 
> That is certainly appreciated and I can offer my help there as well. But
> I believe we should start with a code base that cannot generate a
> swapping storm by a trivial code as demonstrated by Mel. A general idea
> on how to approve the situation has been already outlined for a default
> case and a new memory policy has been mentioned as well but we need
> something to start with and neither of the two is compatible with the
> __GFP_THISNODE behavior.
> 

Thus far, I haven't seen anybody engage in discussion on how to address 
the issue other than proposed reverts that readily acknowledge they cause 
other users to regress.  If all nodes are fragmented, the swap storms that 
are currently reported for the local node would be made worse by the 
revert -- if remote hugepages cannot be faulted quickly then it's only 
compounded the problem.

The hugepage aware mempolicy idea is one way that could describe what 
should be done for these allocations, we could also perhaps consider 
heuristics that consider the memory pressure of the local node: just as 
I've never seen a platform where remote hugepages have better access 
latency than local pages, I've never seen a platform where remote 
hugepages aren't a win over remote pages.  This, however, is more delicate 
on 4 socket and 8 socket platforms but I think a general rule that a 
hugepage is better, if readily allocatable, than a page on the same node.  
(I've seen cross socket latency for hugepages match the latency for pages, 
so not always a win: better to leave the hugepage available remotely for 
something running on that node.)  If the local node has a workingset that 
reaches its capacity, then it makes sense to fault a remote hugepage 
instead because otherwise we are thrashing the local node.

That's not a compaction problem, though, it's a reclaim problem.  If 
compaction fails and it's because we can't allocate target pages, it's 
under memory pressure and it's uncertain if reclaim will help: it may fail 
after expensive swap, the reclaimed pages could be grabbed by somebody 
else and we loop, or the compaction freeing scanner can't find it.  Worst 
case is we thrash the local node in a swap storm.  So the argument I've 
made when the removal of __GFP_THISNODE was first proposed is that local 
hugepage allocation should be the preference including direct compaction 
for users of MADV_HUGEPAGE (or thp enabled=always) but reclaim should 
never be done locally.  I'd very much like to engage with anybody who 
would be willing to discuss fixes that work for everybody rather than only 
propose reverts and leave others to deal with new performance regressions.
Michal Hocko June 5, 2019, 9:32 a.m. UTC | #13
On Fri 31-05-19 14:53:35, David Rientjes wrote:
> On Fri, 31 May 2019, Michal Hocko wrote:
> 
> > > The problem which this patch addresses has apparently gone unreported for 
> > > 4+ years since
> > 
> > Can we finaly stop considering the time and focus on the what is the
> > most reasonable behavior in general case please? Conserving mistakes
> > based on an argument that we have them for many years is just not
> > productive. It is very well possible that workloads that suffer from
> > this simply run on older distribution kernels which are moving towards
> > newer kernels very slowly.
> > 
> 
> That's fine, but we also must be mindful of users who have used 
> MADV_HUGEPAGE over the past four years based on its hard-coded behavior 
> that would now regress as a result.

Absolutely, I am all for helping those usecases. First of all we need to
understand what those usecases are though. So far we have only seen very
vague claims about artificial worst case examples when a remote access
dominates the overall cost but that doesn't seem to be the case in real
life in my experience (e.g. numa balancing will correct things or the
over aggressive node reclaim tends to cause problems elsewhere etc.).

That being said I am pretty sure that a new memory policy as proposed
previously that would allow for a node reclaim behavior is a way for
those very specific workloads that absolutely benefit from a local
access. There are however real life usecases that benefit from THP even
on remote nodes as explained by Andrea (most notable kvm) and the only
way those can express their needs is the madvise flag. Not to mention
that the default node reclaim behavior might cause excessive reclaim
as demonstrate by Mel and Anrea and that is certainly not desirable in
itself.

[...]
> > > My goal is to reach a solution that does not cause anybody to incur 
> > > performance penalties as a result of it.
> > 
> > That is certainly appreciated and I can offer my help there as well. But
> > I believe we should start with a code base that cannot generate a
> > swapping storm by a trivial code as demonstrated by Mel. A general idea
> > on how to approve the situation has been already outlined for a default
> > case and a new memory policy has been mentioned as well but we need
> > something to start with and neither of the two is compatible with the
> > __GFP_THISNODE behavior.
> > 
> 
> Thus far, I haven't seen anybody engage in discussion on how to address 
> the issue other than proposed reverts that readily acknowledge they cause 
> other users to regress.  If all nodes are fragmented, the swap storms that 
> are currently reported for the local node would be made worse by the 
> revert -- if remote hugepages cannot be faulted quickly then it's only 
> compounded the problem.

Andrea has outline the strategy to go IIRC. There also has been a
general agreement that we shouldn't be over eager to fall back to remote
nodes if the base page size allocation could be satisfied from a local node.
David Rientjes June 6, 2019, 10:12 p.m. UTC | #14
On Wed, 5 Jun 2019, Michal Hocko wrote:

> > That's fine, but we also must be mindful of users who have used 
> > MADV_HUGEPAGE over the past four years based on its hard-coded behavior 
> > that would now regress as a result.
> 
> Absolutely, I am all for helping those usecases. First of all we need to
> understand what those usecases are though. So far we have only seen very
> vague claims about artificial worst case examples when a remote access
> dominates the overall cost but that doesn't seem to be the case in real
> life in my experience (e.g. numa balancing will correct things or the
> over aggressive node reclaim tends to cause problems elsewhere etc.).
> 

The usecase is a remap of a binary's text segment to transparent hugepages 
by doing mmap() -> madvise(MADV_HUGEPAGE) -> mremap() and when this 
happens on a locally fragmented node.  This happens at startup when we 
aren't concerned about allocation latency: we want to compact.  We are 
concerned with access latency thereafter as long as the process is 
running.

MADV_HUGEPAGE has worked great for this and we have a large userspace 
stack built upon that because it's been the long-standing behavior.  This 
gets back to the point of MADV_HUGEPAGE being overloaded for four 
different purposes.  I argue that processes that fit within a single node 
are in the majority.

> > Thus far, I haven't seen anybody engage in discussion on how to address 
> > the issue other than proposed reverts that readily acknowledge they cause 
> > other users to regress.  If all nodes are fragmented, the swap storms that 
> > are currently reported for the local node would be made worse by the 
> > revert -- if remote hugepages cannot be faulted quickly then it's only 
> > compounded the problem.
> 
> Andrea has outline the strategy to go IIRC. There also has been a
> general agreement that we shouldn't be over eager to fall back to remote
> nodes if the base page size allocation could be satisfied from a local node.

Sorry, I haven't seen patches for this, I can certainly test them if 
there's a link.  If we have the ability to tune how eager the page 
allocator is to fallback and have the option to say "never" as part of 
that eagerness, it may work.

The idea that I had was snipped from this, however, and it would be nice 
to get some feedback on it: I've suggested that direct reclaim for the 
purposes of hugepage allocation on the local node is never worthwhile 
unless and until memory compaction can both capture that page to use (not 
rely on the freeing scanner to find it) and that migration of a number of 
pages would eventually result in the ability to free a pageblock.

I'm hoping that we can all agree to that because otherwise it leads us 
down a bad road if reclaim is doing pointless work (freeing scanner can't 
find it or it gets allocated again before it can find it) or compaction 
can't make progress as a result of it (even though we can migrate, it 
still won't free a pageblock).

In the interim, I think we should suppress direct reclaim entirely for 
thp allocations, regardless of enabled=always or MADV_HUGEPAGE because it 
cannot be proven that the reclaim work is beneficial and I believe it 
results in the swap storms that are being reported.

Any disagreements so far?

Furthermore, if we can agree to that, memory compaction when allocating a 
transparent hugepage fails for different reasons, one of which is because 
we fail watermark checks because we lack migration targets.  This is 
normally what leads to direct reclaim.  Compaction is *supposed* to return 
COMPACT_SKIPPED for this but that's overloaded as well: it happens when we 
fail extfrag_threshold checks and wheng gfp flags doesn't allow it.  The 
former matters for thp.

So my proposed change would be:
 - give the page allocator a consistent indicator that compaction failed
   because we are low on memory (make COMPACT_SKIPPED really mean this),
 - if we get this in the page allocator and we are allocating thp, fail,
   reclaim is unlikely to help here and is much more likely to be
   disruptive
     - we could retry compaction if we haven't scanned all memory and
       were contended,
 - if the hugepage allocation fails, have thp check watermarks for order-0 
   pages without any padding,
 - if watermarks succeed, fail the thp allocation: we can't allocate
   because of fragmentation and it's better to return node local memory,
 - if watermarks fail, a follow up allocation of the pte will likely also
   fail, so thp retries the allocation with a cleared  __GFP_THISNODE.

This doesn't sound very invasive and I'll code it up if it will be tested.
Michal Hocko June 7, 2019, 8:32 a.m. UTC | #15
On Thu 06-06-19 15:12:40, David Rientjes wrote:
> On Wed, 5 Jun 2019, Michal Hocko wrote:
> 
> > > That's fine, but we also must be mindful of users who have used 
> > > MADV_HUGEPAGE over the past four years based on its hard-coded behavior 
> > > that would now regress as a result.
> > 
> > Absolutely, I am all for helping those usecases. First of all we need to
> > understand what those usecases are though. So far we have only seen very
> > vague claims about artificial worst case examples when a remote access
> > dominates the overall cost but that doesn't seem to be the case in real
> > life in my experience (e.g. numa balancing will correct things or the
> > over aggressive node reclaim tends to cause problems elsewhere etc.).
> > 
> 
> The usecase is a remap of a binary's text segment to transparent hugepages 
> by doing mmap() -> madvise(MADV_HUGEPAGE) -> mremap() and when this 
> happens on a locally fragmented node.  This happens at startup when we 
> aren't concerned about allocation latency: we want to compact.  We are 
> concerned with access latency thereafter as long as the process is 
> running.

You have indicated this previously but no call for a stand alone
reproducer was successful. It is really hard to optimize for such a
specialized workload without anything to play with. Btw. this is exactly
a case where I would expect numa balancing to converge to the optimal
placement. And if numabalancing is not an option than an explicit
mempolicy (e.g. the one suggested here) would be a good fit.

[...]

I will defer the compaction related stuff to Vlastimil and Mel who are
much more familiar with the current code.

> So my proposed change would be:
>  - give the page allocator a consistent indicator that compaction failed
>    because we are low on memory (make COMPACT_SKIPPED really mean this),
>  - if we get this in the page allocator and we are allocating thp, fail,
>    reclaim is unlikely to help here and is much more likely to be
>    disruptive
>      - we could retry compaction if we haven't scanned all memory and
>        were contended,
>  - if the hugepage allocation fails, have thp check watermarks for order-0 
>    pages without any padding,
>  - if watermarks succeed, fail the thp allocation: we can't allocate
>    because of fragmentation and it's better to return node local memory,

Doesn't this lead to the same THP low success rate we have seen with one
of the previous patches though?

Let me remind you of the previous semantic I was proposing
http://lkml.kernel.org/r/20181206091405.GD1286@dhcp22.suse.cz and that
didn't get shot down. Linus had some follow up ideas on how exactly
the fallback order should look like and that is fine. We should just
measure differences between local node cheep base page vs. remote THP on
_real_ workloads. Any microbenchmark which just measures a latency is
inherently misleading.

And really, fundamental problem here is that MADV_HUGEPAGE has gained 
a NUMA semantic without a due scrutiny leading to a broken interface
with side effects that are simply making the interface unusable for a
large part of usecases that the madvise was originaly designed for.
Until we find an agreement on this point we will be looping in a dead
end discussion, I am afraid.
David Rientjes June 13, 2019, 8:17 p.m. UTC | #16
On Fri, 7 Jun 2019, Michal Hocko wrote:

> > So my proposed change would be:
> >  - give the page allocator a consistent indicator that compaction failed
> >    because we are low on memory (make COMPACT_SKIPPED really mean this),
> >  - if we get this in the page allocator and we are allocating thp, fail,
> >    reclaim is unlikely to help here and is much more likely to be
> >    disruptive
> >      - we could retry compaction if we haven't scanned all memory and
> >        were contended,
> >  - if the hugepage allocation fails, have thp check watermarks for order-0 
> >    pages without any padding,
> >  - if watermarks succeed, fail the thp allocation: we can't allocate
> >    because of fragmentation and it's better to return node local memory,
> 
> Doesn't this lead to the same THP low success rate we have seen with one
> of the previous patches though?
> 

From my recollection, the only other patch that was tested involved 
__GFP_NORETRY and avoiding reclaim entirely for thp allocations when 
checking for high-order allocations.

This in the page allocator:

                /*
                 * Checks for costly allocations with __GFP_NORETRY, which
                 * includes THP page fault allocations
                 */
                if (costly_order && (gfp_mask & __GFP_NORETRY)) {
			...
			if (compact_result == COMPACT_DEFERRED)
				goto nopage;

Yet there is no handling for COMPACT_SKIPPED (or what my plan above 
defines COMPACT_SKIPPED to be).  I don't think anything was tried that 
tests why compaction failed, i.e. was it because the two scanners met, 
because hugepage-order memory was found available, because the zone lock 
was contended or we hit need_resched(), we're failing even order-0 
watermarks, etc.  I don't think the above plan has been tried, if someone 
has tried it, please let me know.

I haven't seen any objection to disabling reclaim entirely when order-0 
watermarks are failing in compaction.  We simply can't guarantee that it 
is useful work with the current implementation of compaction.  There are 
several reasons that I've enumerated why compaction can still fail even 
after successful reclaim.

The point is that removing __GFP_THISNODE is not a fix for this if the 
remote memory is fragmented as well: it assumes that hugepages are 
available remotely when they aren't available locally otherwise we seem 
swap storms both locally and remotely.  Relying on that is not in the best 
interest of any user of transparent hugepages.

> Let me remind you of the previous semantic I was proposing
> http://lkml.kernel.org/r/20181206091405.GD1286@dhcp22.suse.cz and that
> didn't get shot down. Linus had some follow up ideas on how exactly
> the fallback order should look like and that is fine. We should just
> measure differences between local node cheep base page vs. remote THP on
> _real_ workloads. Any microbenchmark which just measures a latency is
> inherently misleading.
> 

I think both seek to add the possibility of allocating hugepages remotely 
in certain circumstances and that can be influenced by MADV_HUGEPAGE.  I 
don't think we need to try hugepage specific mempolicies unless it is 
shown to be absolutely necessary although a usecase for this could be made 
separate to this discussion.

There's a benefit to faulting remote hugepages over remote pages for 
everybody involved.  My argument is that we can determine the need for 
that based on failed order-0 watermark checks in compaction: if the node 
would require reclaim to even fault a page, it is likely better over the 
long term to fault a remote hugepage.

I think this can be made to work and is not even difficult to do.  If 
anybody has any objection please let me know.
David Rientjes June 21, 2019, 9:16 p.m. UTC | #17
On Thu, 6 Jun 2019, David Rientjes wrote:

> The idea that I had was snipped from this, however, and it would be nice 
> to get some feedback on it: I've suggested that direct reclaim for the 
> purposes of hugepage allocation on the local node is never worthwhile 
> unless and until memory compaction can both capture that page to use (not 
> rely on the freeing scanner to find it) and that migration of a number of 
> pages would eventually result in the ability to free a pageblock.
> 
> I'm hoping that we can all agree to that because otherwise it leads us 
> down a bad road if reclaim is doing pointless work (freeing scanner can't 
> find it or it gets allocated again before it can find it) or compaction 
> can't make progress as a result of it (even though we can migrate, it 
> still won't free a pageblock).
> 
> In the interim, I think we should suppress direct reclaim entirely for 
> thp allocations, regardless of enabled=always or MADV_HUGEPAGE because it 
> cannot be proven that the reclaim work is beneficial and I believe it 
> results in the swap storms that are being reported.
> 
> Any disagreements so far?
> 
> Furthermore, if we can agree to that, memory compaction when allocating a 
> transparent hugepage fails for different reasons, one of which is because 
> we fail watermark checks because we lack migration targets.  This is 
> normally what leads to direct reclaim.  Compaction is *supposed* to return 
> COMPACT_SKIPPED for this but that's overloaded as well: it happens when we 
> fail extfrag_threshold checks and wheng gfp flags doesn't allow it.  The 
> former matters for thp.
> 
> So my proposed change would be:
>  - give the page allocator a consistent indicator that compaction failed
>    because we are low on memory (make COMPACT_SKIPPED really mean this),
>  - if we get this in the page allocator and we are allocating thp, fail,
>    reclaim is unlikely to help here and is much more likely to be
>    disruptive
>      - we could retry compaction if we haven't scanned all memory and
>        were contended,
>  - if the hugepage allocation fails, have thp check watermarks for order-0 
>    pages without any padding,
>  - if watermarks succeed, fail the thp allocation: we can't allocate
>    because of fragmentation and it's better to return node local memory,
>  - if watermarks fail, a follow up allocation of the pte will likely also
>    fail, so thp retries the allocation with a cleared  __GFP_THISNODE.
> 
> This doesn't sound very invasive and I'll code it up if it will be tested.
> 

Following up on this since there has been no activity in a week, I am 
happy to prototype this.  Andrea, would you be able to test a patch once 
it is ready for you to try?
Michal Hocko July 30, 2019, 1:11 p.m. UTC | #18
On Thu 23-05-19 17:57:37, Andrew Morton wrote:
[...]
> It does appear to me that this patch does more good than harm for the
> totality of kernel users, so I'm inclined to push it through and to try
> to talk Linus out of reverting it again.  

What is the status here? I do not see the patch upstream nor in the
mmotm tree.
Andrew Morton July 30, 2019, 6:05 p.m. UTC | #19
On Tue, 30 Jul 2019 15:11:27 +0200 Michal Hocko <mhocko@kernel.org> wrote:

> On Thu 23-05-19 17:57:37, Andrew Morton wrote:
> [...]
> > It does appear to me that this patch does more good than harm for the
> > totality of kernel users, so I'm inclined to push it through and to try
> > to talk Linus out of reverting it again.  
> 
> What is the status here?

I doesn't seem that the mooted alternatives will be happening any time
soon,

I would like a version of this patch which has a changelog which fully
describes our reasons for reapplying the reverted revert.  At this
time, *someone* is going to have to carry a private patch.  It's best
that the version which suits the larger number of users be the one
which we carry in mainline.

I'll add a cc:stable to this revert.
Michal Hocko July 31, 2019, 8:17 a.m. UTC | #20
On Tue 30-07-19 11:05:44, Andrew Morton wrote:
> On Tue, 30 Jul 2019 15:11:27 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > On Thu 23-05-19 17:57:37, Andrew Morton wrote:
> > [...]
> > > It does appear to me that this patch does more good than harm for the
> > > totality of kernel users, so I'm inclined to push it through and to try
> > > to talk Linus out of reverting it again.  
> > 
> > What is the status here?
> 
> I doesn't seem that the mooted alternatives will be happening any time
> soon,
> 
> I would like a version of this patch which has a changelog which fully
> describes our reasons for reapplying the reverted revert.

http://lkml.kernel.org/r/20190503223146.2312-3-aarcange@redhat.com went
in great details IMHO about the previous decision as well as adverse
effect on swapout. Do you have any suggestions on what is missing there?
diff mbox series

Patch

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 5228c62af416..bac395f1d00a 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -139,6 +139,8 @@  struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
 struct mempolicy *get_task_policy(struct task_struct *p);
 struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
 		unsigned long addr);
+struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
+						unsigned long addr);
 bool vma_policy_mof(struct vm_area_struct *vma);
 
 extern void numa_default_policy(void);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7efe68ba052a..784fd63800a2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -644,27 +644,37 @@  static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
 static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
 {
 	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
-	const gfp_t gfp_mask = GFP_TRANSHUGE_LIGHT | __GFP_THISNODE;
+	gfp_t this_node = 0;
 
-	/* Always do synchronous compaction */
-	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE | __GFP_THISNODE |
-		       (vma_madvised ? 0 : __GFP_NORETRY);
+#ifdef CONFIG_NUMA
+	struct mempolicy *pol;
+	/*
+	 * __GFP_THISNODE is used only when __GFP_DIRECT_RECLAIM is not
+	 * specified, to express a general desire to stay on the current
+	 * node for optimistic allocation attempts. If the defrag mode
+	 * and/or madvise hint requires the direct reclaim then we prefer
+	 * to fallback to other node rather than node reclaim because that
+	 * can lead to excessive reclaim even though there is free memory
+	 * on other nodes. We expect that NUMA preferences are specified
+	 * by memory policies.
+	 */
+	pol = get_vma_policy(vma, addr);
+	if (pol->mode != MPOL_BIND)
+		this_node = __GFP_THISNODE;
+	mpol_cond_put(pol);
+#endif
 
-	/* Kick kcompactd and fail quickly */
+	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
+		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
-		return gfp_mask | __GFP_KSWAPD_RECLAIM;
-
-	/* Synchronous compaction if madvised, otherwise kick kcompactd */
+		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
-		return gfp_mask | (vma_madvised ? __GFP_DIRECT_RECLAIM :
-						  __GFP_KSWAPD_RECLAIM);
-
-	/* Only do synchronous compaction if madvised */
+		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
+							     __GFP_KSWAPD_RECLAIM | this_node);
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
-		return gfp_mask | (vma_madvised ? __GFP_DIRECT_RECLAIM : 0);
-
-	return gfp_mask;
+		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
+							     this_node);
+	return GFP_TRANSHUGE_LIGHT | this_node;
 }
 
 /* Caller must hold page table lock. */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 74e44000ad61..341e3d56d0a6 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1688,7 +1688,7 @@  struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
  * freeing by another task.  It is the caller's responsibility to free the
  * extra reference for shared policies.
  */
-static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
+struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
 						unsigned long addr)
 {
 	struct mempolicy *pol = __get_vma_policy(vma, addr);