Message ID | 20190503223146.2312-3-aarcange@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | reapply: relax __GFP_THISNODE for MADV_HUGEPAGE mappings | expand |
On Fri 03-05-19 18:31:46, Andrea Arcangeli wrote: > This reverts commit 2f0799a0ffc033bf3cc82d5032acc3ec633464c2. > > commit 2f0799a0ffc033bf3cc82d5032acc3ec633464c2 was rightfully applied > to avoid the risk of a severe regression that was reported by the > kernel test robot at the end of the merge window. Now we understood > the regression was a false positive and was caused by a significant > increase in fairness during a swap trashing benchmark. So it's safe to > re-apply the fix and continue improving the code from there. The > benchmark that reported the regression is very useful, but it provides > a meaningful result only when there is no significant alteration in > fairness during the workload. The removal of __GFP_THISNODE increased > fairness. > > __GFP_THISNODE cannot be used in the generic page faults path for new > memory allocations under the MPOL_DEFAULT mempolicy, or the allocation > behavior significantly deviates from what the MPOL_DEFAULT semantics > are supposed to be for THP and 4k allocations alike. > > Setting THP defrag to "always" or using MADV_HUGEPAGE (with THP defrag > set to "madvise") has never meant to provide an implicit MPOL_BIND on > the "current" node the task is running on, causing swap storms and > providing a much more aggressive behavior than even zone_reclaim_node > = 3. > > Any workload who could have benefited from __GFP_THISNODE has now to > enable zone_reclaim_mode=1||2||3. __GFP_THISNODE implicitly provided > the zone_reclaim_mode behavior, but it only did so if THP was enabled: > if THP was disabled, there would have been no chance to get any 4k > page from the current node if the current node was full of pagecache, > which further shows how this __GFP_THISNODE was misplaced in > MADV_HUGEPAGE. MADV_HUGEPAGE has never been intended to provide any > zone_reclaim_mode semantics, in fact the two are orthogonal, > zone_reclaim_mode = 1|2|3 must work exactly the same with > MADV_HUGEPAGE set or not. > > The performance characteristic of memory depends on the hardware > details. The numbers below are obtained on Naples/EPYC architecture > and the N/A projection extends them to show what we should aim for in > the future as a good THP NUMA locality default. The benchmark used > exercises random memory seeks (note: the cost of the page faults is > not part of the measurement). > > D0 THP | D0 4k | D1 THP | D1 4k | D2 THP | D2 4k | D3 THP | D3 4k | ... > 0% | +43% | +45% | +106% | +131% | +224% | N/A | N/A > > D0 means distance zero (i.e. local memory), D1 means distance > one (i.e. intra socket memory), D2 means distance two (i.e. inter > socket memory), etc... > > For the guest physical memory allocated by qemu and for guest mode kernel > the performance characteristic of RAM is more complex and an ideal > default could be: > > D0 THP | D1 THP | D0 4k | D2 THP | D1 4k | D3 THP | D2 4k | D3 4k | ... > 0% | +58% | +101% | N/A | +222% | N/A | N/A | N/A > > NOTE: the N/A are projections and haven't been measured yet, the > measurement in this case is done on a 1950x with only two NUMA nodes. > The THP case here means THP was used both in the host and in the > guest. > > After applying this commit the THP NUMA locality order that we'll get > out of MADV_HUGEPAGE is this: > > D0 THP | D1 THP | D2 THP | D3 THP | ... | D0 4k | D1 4k | D2 4k | D3 4k | ... > > Before this commit it was: > > D0 THP | D0 4k | D1 4k | D2 4k | D3 4k | ... > > Even if we ignore the breakage of large workloads that can't fit in a > single node that the __GFP_THISNODE implicit "current node" mbind > caused, the THP NUMA locality order provided by __GFP_THISNODE was > still not the one we shall aim for in the long term (i.e. the first > one at the top). > > After this commit is applied, we can introduce a new allocator multi > order API and to replace those two alloc_pages_vmas calls in the page > fault path, with a single multi order call: > > unsigned int order = (1 << HPAGE_PMD_ORDER) | (1 << 0); > page = alloc_pages_multi_order(..., &order); > if (!page) > goto out; > if (!(order & (1 << 0))) { > VM_WARN_ON(order != 1 << HPAGE_PMD_ORDER); > /* THP fault */ > } else { > VM_WARN_ON(order != 1 << 0); > /* 4k fallback */ > } > > The page allocator logic has to be altered so that when it fails on > any zone with order 9, it has to try again with a order 0 before > falling back to the next zone in the zonelist. > > After that we need to do more measurements and evaluate if adding an > opt-in feature for guest mode is worth it, to swap "DN 4k | DN+1 THP" > with "DN+1 THP | DN 4k" at every NUMA distance crossing. I do agree with your reasoning. Future plans should be discussed carefully because any iWmplicit NUMA placing might turned out simply wrong with the future HW. I still believe we need a sort of _enforce_ intrasocet placement numa policy API. Something resembling node reclaim mode for particular mappings. MPOL_CLOSE_NODE or similar sounds like a way to go. But a new API really begs for real world usecases and I still hope to get a reproducer for the problem David is running into. > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> > --- > include/linux/mempolicy.h | 2 ++ > mm/huge_memory.c | 42 ++++++++++++++++++++++++--------------- > mm/mempolicy.c | 2 +- > 3 files changed, 29 insertions(+), 17 deletions(-) > > diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h > index 5228c62af416..bac395f1d00a 100644 > --- a/include/linux/mempolicy.h > +++ b/include/linux/mempolicy.h > @@ -139,6 +139,8 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp, > struct mempolicy *get_task_policy(struct task_struct *p); > struct mempolicy *__get_vma_policy(struct vm_area_struct *vma, > unsigned long addr); > +struct mempolicy *get_vma_policy(struct vm_area_struct *vma, > + unsigned long addr); > bool vma_policy_mof(struct vm_area_struct *vma); > > extern void numa_default_policy(void); > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 7efe68ba052a..784fd63800a2 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -644,27 +644,37 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf, > static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr) > { > const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE); > - const gfp_t gfp_mask = GFP_TRANSHUGE_LIGHT | __GFP_THISNODE; > + gfp_t this_node = 0; > > - /* Always do synchronous compaction */ > - if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags)) > - return GFP_TRANSHUGE | __GFP_THISNODE | > - (vma_madvised ? 0 : __GFP_NORETRY); > +#ifdef CONFIG_NUMA > + struct mempolicy *pol; > + /* > + * __GFP_THISNODE is used only when __GFP_DIRECT_RECLAIM is not > + * specified, to express a general desire to stay on the current > + * node for optimistic allocation attempts. If the defrag mode > + * and/or madvise hint requires the direct reclaim then we prefer > + * to fallback to other node rather than node reclaim because that > + * can lead to excessive reclaim even though there is free memory > + * on other nodes. We expect that NUMA preferences are specified > + * by memory policies. > + */ > + pol = get_vma_policy(vma, addr); > + if (pol->mode != MPOL_BIND) > + this_node = __GFP_THISNODE; > + mpol_cond_put(pol); > +#endif > > - /* Kick kcompactd and fail quickly */ > + if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags)) > + return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY); > if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags)) > - return gfp_mask | __GFP_KSWAPD_RECLAIM; > - > - /* Synchronous compaction if madvised, otherwise kick kcompactd */ > + return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node; > if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags)) > - return gfp_mask | (vma_madvised ? __GFP_DIRECT_RECLAIM : > - __GFP_KSWAPD_RECLAIM); > - > - /* Only do synchronous compaction if madvised */ > + return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM : > + __GFP_KSWAPD_RECLAIM | this_node); > if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags)) > - return gfp_mask | (vma_madvised ? __GFP_DIRECT_RECLAIM : 0); > - > - return gfp_mask; > + return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM : > + this_node); > + return GFP_TRANSHUGE_LIGHT | this_node; > } > > /* Caller must hold page table lock. */ > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 74e44000ad61..341e3d56d0a6 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -1688,7 +1688,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma, > * freeing by another task. It is the caller's responsibility to free the > * extra reference for shared policies. > */ > -static struct mempolicy *get_vma_policy(struct vm_area_struct *vma, > +struct mempolicy *get_vma_policy(struct vm_area_struct *vma, > unsigned long addr) > { > struct mempolicy *pol = __get_vma_policy(vma, addr);
On Fri, May 03, 2019 at 06:31:46PM -0400, Andrea Arcangeli wrote: > This reverts commit 2f0799a0ffc033bf3cc82d5032acc3ec633464c2. > > commit 2f0799a0ffc033bf3cc82d5032acc3ec633464c2 was rightfully applied > to avoid the risk of a severe regression that was reported by the > kernel test robot at the end of the merge window. Now we understood > the regression was a false positive and was caused by a significant > increase in fairness during a swap trashing benchmark. So it's safe to > re-apply the fix and continue improving the code from there. The > benchmark that reported the regression is very useful, but it provides > a meaningful result only when there is no significant alteration in > fairness during the workload. The removal of __GFP_THISNODE increased > fairness. > > __GFP_THISNODE cannot be used in the generic page faults path for new > memory allocations under the MPOL_DEFAULT mempolicy, or the allocation > behavior significantly deviates from what the MPOL_DEFAULT semantics > are supposed to be for THP and 4k allocations alike. > > Setting THP defrag to "always" or using MADV_HUGEPAGE (with THP defrag > set to "madvise") has never meant to provide an implicit MPOL_BIND on > the "current" node the task is running on, causing swap storms and > providing a much more aggressive behavior than even zone_reclaim_node > = 3. > > Any workload who could have benefited from __GFP_THISNODE has now to > enable zone_reclaim_mode=1||2||3. __GFP_THISNODE implicitly provided > the zone_reclaim_mode behavior, but it only did so if THP was enabled: > if THP was disabled, there would have been no chance to get any 4k > page from the current node if the current node was full of pagecache, > which further shows how this __GFP_THISNODE was misplaced in > MADV_HUGEPAGE. MADV_HUGEPAGE has never been intended to provide any > zone_reclaim_mode semantics, in fact the two are orthogonal, > zone_reclaim_mode = 1|2|3 must work exactly the same with > MADV_HUGEPAGE set or not. > > The performance characteristic of memory depends on the hardware > details. The numbers below are obtained on Naples/EPYC architecture > and the N/A projection extends them to show what we should aim for in > the future as a good THP NUMA locality default. The benchmark used > exercises random memory seeks (note: the cost of the page faults is > not part of the measurement). > > D0 THP | D0 4k | D1 THP | D1 4k | D2 THP | D2 4k | D3 THP | D3 4k | ... > 0% | +43% | +45% | +106% | +131% | +224% | N/A | N/A > > D0 means distance zero (i.e. local memory), D1 means distance > one (i.e. intra socket memory), D2 means distance two (i.e. inter > socket memory), etc... > > For the guest physical memory allocated by qemu and for guest mode kernel > the performance characteristic of RAM is more complex and an ideal > default could be: > > D0 THP | D1 THP | D0 4k | D2 THP | D1 4k | D3 THP | D2 4k | D3 4k | ... > 0% | +58% | +101% | N/A | +222% | N/A | N/A | N/A > > NOTE: the N/A are projections and haven't been measured yet, the > measurement in this case is done on a 1950x with only two NUMA nodes. > The THP case here means THP was used both in the host and in the > guest. > > After applying this commit the THP NUMA locality order that we'll get > out of MADV_HUGEPAGE is this: > > D0 THP | D1 THP | D2 THP | D3 THP | ... | D0 4k | D1 4k | D2 4k | D3 4k | ... > > Before this commit it was: > > D0 THP | D0 4k | D1 4k | D2 4k | D3 4k | ... > > Even if we ignore the breakage of large workloads that can't fit in a > single node that the __GFP_THISNODE implicit "current node" mbind > caused, the THP NUMA locality order provided by __GFP_THISNODE was > still not the one we shall aim for in the long term (i.e. the first > one at the top). > > After this commit is applied, we can introduce a new allocator multi > order API and to replace those two alloc_pages_vmas calls in the page > fault path, with a single multi order call: > > unsigned int order = (1 << HPAGE_PMD_ORDER) | (1 << 0); > page = alloc_pages_multi_order(..., &order); > if (!page) > goto out; > if (!(order & (1 << 0))) { > VM_WARN_ON(order != 1 << HPAGE_PMD_ORDER); > /* THP fault */ > } else { > VM_WARN_ON(order != 1 << 0); > /* 4k fallback */ > } > > The page allocator logic has to be altered so that when it fails on > any zone with order 9, it has to try again with a order 0 before > falling back to the next zone in the zonelist. > > After that we need to do more measurements and evaluate if adding an > opt-in feature for guest mode is worth it, to swap "DN 4k | DN+1 THP" > with "DN+1 THP | DN 4k" at every NUMA distance crossing. > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de>
On Fri, 3 May 2019, Andrea Arcangeli wrote: > This reverts commit 2f0799a0ffc033bf3cc82d5032acc3ec633464c2. > > commit 2f0799a0ffc033bf3cc82d5032acc3ec633464c2 was rightfully applied > to avoid the risk of a severe regression that was reported by the > kernel test robot at the end of the merge window. Now we understood > the regression was a false positive and was caused by a significant > increase in fairness during a swap trashing benchmark. So it's safe to > re-apply the fix and continue improving the code from there. The > benchmark that reported the regression is very useful, but it provides > a meaningful result only when there is no significant alteration in > fairness during the workload. The removal of __GFP_THISNODE increased > fairness. > Hi Andrea, There was exhausting discussion subsequent to this that caused Linus to have to revert the offending commit late in an rc series that is not described here. This was after the offending commit, which this commit now reintroduces, was described as causing user facing access latency regressions and nacked. The same objection is obviously going to be made here and I'd really prefer if this could be worked out without yet another merge into -mm, push to Linus, and revert by Linus. There are solutions to this issue that does not cause anybody to have performance regressions rather than reintroducing them for a class of users that use the overloaded MADV_HUGEPAGE for the purposes it has provided them over the past three years. > __GFP_THISNODE cannot be used in the generic page faults path for new > memory allocations under the MPOL_DEFAULT mempolicy, or the allocation > behavior significantly deviates from what the MPOL_DEFAULT semantics > are supposed to be for THP and 4k allocations alike. > This isn't an argument in support of this patch, there is a difference between (1) pages of the native page size being faulted first locally falling back remotely and (2) hugepages being faulted first locally and falling back to native pages locally because it has better access latency on most platforms for workloads that do not span multiple nodes. Note that the page allocator is unaware whether the workload spans multiple nodes so it cannot make this distinction today, and that's what I'd prefer to focus on rather than changing an overall policy for everybody. > Setting THP defrag to "always" or using MADV_HUGEPAGE (with THP defrag > set to "madvise") has never meant to provide an implicit MPOL_BIND on > the "current" node the task is running on, causing swap storms and > providing a much more aggressive behavior than even zone_reclaim_node > = 3. > It may not have been meant to provide this, but when IBM changed this three years ago because of performance regressions and others have started to use MADV_HUGEPAGE with that policy in mind, it is the reality of what the madvise advice has provided. What was meant to be semantics of MADV_HUGEPAGE three years ago is irrelevant today if it introduces performance regressions for users who have used the advice mode during that past three years. > Any workload who could have benefited from __GFP_THISNODE has now to > enable zone_reclaim_mode=1||2||3. __GFP_THISNODE implicitly provided > the zone_reclaim_mode behavior, but it only did so if THP was enabled: > if THP was disabled, there would have been no chance to get any 4k > page from the current node if the current node was full of pagecache, > which further shows how this __GFP_THISNODE was misplaced in > MADV_HUGEPAGE. MADV_HUGEPAGE has never been intended to provide any > zone_reclaim_mode semantics, in fact the two are orthogonal, > zone_reclaim_mode = 1|2|3 must work exactly the same with > MADV_HUGEPAGE set or not. > > The performance characteristic of memory depends on the hardware > details. The numbers below are obtained on Naples/EPYC architecture > and the N/A projection extends them to show what we should aim for in > the future as a good THP NUMA locality default. The benchmark used > exercises random memory seeks (note: the cost of the page faults is > not part of the measurement). > > D0 THP | D0 4k | D1 THP | D1 4k | D2 THP | D2 4k | D3 THP | D3 4k | ... > 0% | +43% | +45% | +106% | +131% | +224% | N/A | N/A > The performance measurements that we have on Naples shows a more significant change between D0 4k and D1 THP: it certainly is not 2% worse access latency to a remote hugepage compared to local native pages. > D0 means distance zero (i.e. local memory), D1 means distance > one (i.e. intra socket memory), D2 means distance two (i.e. inter > socket memory), etc... > > For the guest physical memory allocated by qemu and for guest mode kernel > the performance characteristic of RAM is more complex and an ideal > default could be: > > D0 THP | D1 THP | D0 4k | D2 THP | D1 4k | D3 THP | D2 4k | D3 4k | ... > 0% | +58% | +101% | N/A | +222% | N/A | N/A | N/A > > NOTE: the N/A are projections and haven't been measured yet, the > measurement in this case is done on a 1950x with only two NUMA nodes. > The THP case here means THP was used both in the host and in the > guest. > Yes, this is clearly understood and was never objected to when this first came up in the thread where __GFP_THISNODE was removed or when Linus reverted the patch. The issue being discussed here is a result of MADV_HUGEPAGE being overloaded: it cannot mean to control (1) how much compaction/reclaim is done for page allocation, (2) the NUMA locality of those hugepages, and (3) the eligibility of the memory to be collapsed into hugepages by khugepaged all at the same time. I suggested then that we actually define (2) concretely specifically for the usecase that you mention. Changing the behavior of MADV_HUGEPAGE for the past three years, however, and introducing performance regressions for those users is not an option regardless of the intent that it had when developed. I suggested two options: (1) __MPOL_F_HUGE flag to set a mempolicy for specific memory ranges so that you can define thp specific mempolicies (Vlastimil considered this to be a lot of work, which I agreed) or (2) a prctl() mode to specify that a workload will span multiple sockets and benefits from remote hugepage allocation over local native pages (or because it is faulting memory remotely that it will access locally at some point in the future depending on cpu binding). Any prctl() mode can be inherited across fork so it can be used for the qemu case that you suggest and is a very simple change to make compared with (1). Please consider methods to accomplish this goal that will not cause existing users of MADV_HUGEPAGE to incur 13.9% access latency regressions and have no way to workaround without MPOL_BIND that will introduce undeserved and unnecessary oom kills because we can't specify native page vs hugepage mempolicies independently. I'm confident that everybody on this cc list is well aware of both sides of this discussion and I hope that we can work together to address it to achieve the goals of both. Thanks.
On Wed, May 15, 2019 at 01:26:26PM -0700, David Rientjes wrote: > On Fri, 3 May 2019, Andrea Arcangeli wrote: > > > This reverts commit 2f0799a0ffc033bf3cc82d5032acc3ec633464c2. > > > > commit 2f0799a0ffc033bf3cc82d5032acc3ec633464c2 was rightfully applied > > to avoid the risk of a severe regression that was reported by the > > kernel test robot at the end of the merge window. Now we understood > > the regression was a false positive and was caused by a significant > > increase in fairness during a swap trashing benchmark. So it's safe to > > re-apply the fix and continue improving the code from there. The > > benchmark that reported the regression is very useful, but it provides > > a meaningful result only when there is no significant alteration in > > fairness during the workload. The removal of __GFP_THISNODE increased > > fairness. > > > > Hi Andrea, > > There was exhausting discussion subsequent to this that caused Linus to > have to revert the offending commit late in an rc series that is not > described here. Yes, at the crux of that matter was which regression introduced was more important -- the one causing swap storms which Andrea is trying to address or a latency issue due to assumptions of locality when MADV_HUGEPAGE is used. More people are affected by swap storms and distributions are carrying out-of-tree patches to address it. Furthermore, multiple people unrelated to each other can trivially reproduce the problem with test cases and experience the problem with real workloads. Only you has a realistic workload sensitive to the latency issue and we've asked repeatedly for a test case (most recently Michal Hocko on May 4th) which is still not available. Currently the revert has to be carried out of tree for at least some distributions which means any mainline users will have very different experiences to distro users. Normally this is true anyway, but this is an extreme example. The most important point is that the latency problems are relatively difficult for a normal user to detect but a swap storm is extremely easy to detect. We should be defaulting to the behaviour that causes the least overall harm. > This was after the offending commit, which this commit > now reintroduces, was described as causing user facing access latency > regressions and nacked. The same objection is obviously going to be made > here and I'd really prefer if this could be worked out without yet another > merge into -mm, push to Linus, and revert by Linus. There are solutions > to this issue that does not cause anybody to have performance regressions > rather than reintroducing them for a class of users that use the > overloaded MADV_HUGEPAGE for the purposes it has provided them over the > past three years. > There are no solutions implemented although there are solutions possible -- however, forcing zone_reclaim_mode like behaviour when zone_reclaim_mode is disabled makes the path forward hazardous. > > __GFP_THISNODE cannot be used in the generic page faults path for new > > memory allocations under the MPOL_DEFAULT mempolicy, or the allocation > > behavior significantly deviates from what the MPOL_DEFAULT semantics > > are supposed to be for THP and 4k allocations alike. > > > > This isn't an argument in support of this patch, there is a difference > between (1) pages of the native page size being faulted first locally > falling back remotely and (2) hugepages being faulted first locally and > falling back to native pages locally because it has better access latency > on most platforms for workloads that do not span multiple nodes. Note > that the page allocator is unaware whether the workload spans multiple > nodes so it cannot make this distinction today, and that's what I'd prefer > to focus on rather than changing an overall policy for everybody. > Overall, I think it would be ok to have behaviour whereby local THP is allocated if cheaply, followed by base pages local followed by the remote options. However, __GFP_THISNODE removes the possibility of allowing remote fallback and instead causing a swap storm and swap storms are trivial to generate on NUMA machine running a mainline kernel today. If a workload really prefers local huge pages even if that incurs reclaim, it should be a deliberate choice. > > Setting THP defrag to "always" or using MADV_HUGEPAGE (with THP defrag > > set to "madvise") has never meant to provide an implicit MPOL_BIND on > > the "current" node the task is running on, causing swap storms and > > providing a much more aggressive behavior than even zone_reclaim_node > > = 3. > > > > It may not have been meant to provide this, but when IBM changed this > three years ago because of performance regressions and others have started > to use MADV_HUGEPAGE with that policy in mind, it is the reality of what > the madvise advice has provided. What was meant to be semantics of > MADV_HUGEPAGE three years ago is irrelevant today if it introduces > performance regressions for users who have used the advice mode during > that past three years. > Incurring swap storms when there is plenty of free memory available is terrible. If the working set size is larger than a node, the swap storm may even persist indefinitely. > > Any workload who could have benefited from __GFP_THISNODE has now to > > enable zone_reclaim_mode=1||2||3. __GFP_THISNODE implicitly provided > > the zone_reclaim_mode behavior, but it only did so if THP was enabled: > > if THP was disabled, there would have been no chance to get any 4k > > page from the current node if the current node was full of pagecache, > > which further shows how this __GFP_THISNODE was misplaced in > > MADV_HUGEPAGE. MADV_HUGEPAGE has never been intended to provide any > > zone_reclaim_mode semantics, in fact the two are orthogonal, > > zone_reclaim_mode = 1|2|3 must work exactly the same with > > MADV_HUGEPAGE set or not. > > > > The performance characteristic of memory depends on the hardware > > details. The numbers below are obtained on Naples/EPYC architecture > > and the N/A projection extends them to show what we should aim for in > > the future as a good THP NUMA locality default. The benchmark used > > exercises random memory seeks (note: the cost of the page faults is > > not part of the measurement). > > > > D0 THP | D0 4k | D1 THP | D1 4k | D2 THP | D2 4k | D3 THP | D3 4k | ... > > 0% | +43% | +45% | +106% | +131% | +224% | N/A | N/A > > > > The performance measurements that we have on Naples shows a more > significant change between D0 4k and D1 THP: it certainly is not 2% worse > access latency to a remote hugepage compared to local native pages. > Please share the workload in question so there is at least some chance of developing a series that allocates locally (regardless of page size) as much as possible without incurring swap storms. > > D0 means distance zero (i.e. local memory), D1 means distance > > one (i.e. intra socket memory), D2 means distance two (i.e. inter > > socket memory), etc... > > > > For the guest physical memory allocated by qemu and for guest mode kernel > > the performance characteristic of RAM is more complex and an ideal > > default could be: > > > > D0 THP | D1 THP | D0 4k | D2 THP | D1 4k | D3 THP | D2 4k | D3 4k | ... > > 0% | +58% | +101% | N/A | +222% | N/A | N/A | N/A > > > > NOTE: the N/A are projections and haven't been measured yet, the > > measurement in this case is done on a 1950x with only two NUMA nodes. > > The THP case here means THP was used both in the host and in the > > guest. > > > > Yes, this is clearly understood and was never objected to when this first > came up in the thread where __GFP_THISNODE was removed or when Linus > reverted the patch. > > The issue being discussed here is a result of MADV_HUGEPAGE being > overloaded: it cannot mean to control (1) how much compaction/reclaim is > done for page allocation, (2) the NUMA locality of those hugepages, and > (3) the eligibility of the memory to be collapsed into hugepages by > khugepaged all at the same time. > > I suggested then that we actually define (2) concretely specifically for > the usecase that you mention. Changing the behavior of MADV_HUGEPAGE for > the past three years, however, and introducing performance regressions for > those users is not an option regardless of the intent that it had when > developed. > The current behaviour is potential swap storms that are difficult to avoid because the behaviour is hard-wired into the kernel internals. Only disabling THP, either on a task or global basis, avoids it when encountered. > I suggested two options: (1) __MPOL_F_HUGE flag to set a mempolicy for > specific memory ranges so that you can define thp specific mempolicies > (Vlastimil considered this to be a lot of work, which I agreed) or (2) a > prctl() mode to specify that a workload will span multiple sockets and > benefits from remote hugepage allocation over local native pages (or > because it is faulting memory remotely that it will access locally at some > point in the future depending on cpu binding). Any prctl() mode can be > inherited across fork so it can be used for the qemu case that you suggest > and is a very simple change to make compared with (1). > The prctl should be the other way around -- use prctl if THP trumps all else even if that means reclaiming a large amount of memory or swapping. > Please consider methods to accomplish this goal that will not cause > existing users of MADV_HUGEPAGE to incur 13.9% access latency regressions > and have no way to workaround without MPOL_BIND that will introduce > undeserved and unnecessary oom kills because we can't specify native page > vs hugepage mempolicies independently. > > I'm confident that everybody on this cc list is well aware of both sides > of this discussion and I hope that we can work together to address it to > achieve the goals of both. > Please start by sharing the workload in question. However, I'm still willing to bet that a swap storm causes significantly more harm for more users than a latency penalty for a workload that is extremely latency sensitive. The latency of a swap out/in is far more than a 13.9% access latency.
On Mon, 20 May 2019, Mel Gorman wrote: > > There was exhausting discussion subsequent to this that caused Linus to > > have to revert the offending commit late in an rc series that is not > > described here. > > Yes, at the crux of that matter was which regression introduced was more > important -- the one causing swap storms which Andrea is trying to address > or a latency issue due to assumptions of locality when MADV_HUGEPAGE > is used. > > More people are affected by swap storms and distributions are carrying > out-of-tree patches to address it. Furthermore, multiple people unrelated > to each other can trivially reproduce the problem with test cases and > experience the problem with real workloads. Only you has a realistic > workload sensitive to the latency issue and we've asked repeatedly for > a test case (most recently Michal Hocko on May 4th) which is still not > available. > Hi Mel, Any workload that does MADV_HUGEPAGE will be impacted if remote hugepage access latency is greater than local native page access latency and is using the long-standing behavior of the past three years. The test case would be rather straight forward: induce node local fragmentation (easiest to do by injecting a kernel module), do MADV_HUGEPAGE over a large range, fault, and measure random access latency. This is readily observable and can be done synthetically to measure the random access latency of local native pages vs remote hugepages. Andrea provided this testcase in the original thread. My results from right now: # numactl -m 0 -C 0 ./numa-thp-bench random writes MADV_HUGEPAGE 17492771 usec random writes MADV_NOHUGEPAGE 21344846 usec random writes MADV_NOHUGEPAGE 21399545 usec random writes MADV_HUGEPAGE 17481949 usec # numactl -m 0 -C 64 ./numa-thp-bench random writes MADV_HUGEPAGE 26858061 usec random writes MADV_NOHUGEPAGE 31067825 usec random writes MADV_NOHUGEPAGE 31334770 usec random writes MADV_HUGEPAGE 26785942 usec That's 25.8% greater random access latency when going across socket vs accessing local native pages. > > This isn't an argument in support of this patch, there is a difference > > between (1) pages of the native page size being faulted first locally > > falling back remotely and (2) hugepages being faulted first locally and > > falling back to native pages locally because it has better access latency > > on most platforms for workloads that do not span multiple nodes. Note > > that the page allocator is unaware whether the workload spans multiple > > nodes so it cannot make this distinction today, and that's what I'd prefer > > to focus on rather than changing an overall policy for everybody. > > > > Overall, I think it would be ok to have behaviour whereby local THP is > allocated if cheaply, followed by base pages local followed by the remote > options. However, __GFP_THISNODE removes the possibility of allowing > remote fallback and instead causing a swap storm and swap storms are > trivial to generate on NUMA machine running a mainline kernel today. > Yes, this is hopefully what we can focus on and I hope we can make forward progress with (1) extending mempolicies to allow specifying hugepage specific policies, (2) the prctl(), (3) improving the feedback loop between compaction and direct reclaim, and/or (4) resolving the overloaded the conflicting meanings of /sys/kernel/mm/transparent_hugepage/{enabled,defrag} and MADV_HUGEPAGE/MADV_NOHUGEPAGE. The issue here is the overloaded nature of what MADV_HUGEPAGE means and what the system-wide thp settings mean. It cannot possibly provide sane behavior of all possible workloads given only two settings. MADV_HUGEPAGE itself has *four* meanings: (1) determine hugepage eligiblity when not default, (2) try to do sychronous compaction/reclaim at fault, (3) determine eligiblity of khugepaged, (4) control defrag settings based on system-wide setting. The patch here is adding a fifth: (5) prefer remote allocation when local memory is fragmented. None of this is sustainable. Note that this patch is also preferring remote hugepage allocation *over* local hugepages before trying memory compaction locally depending on the setting of vm.zone_reclaim_mode so it is infringing on the long-standing behavior of (2) as well. In situations such as these, it is not surprising that there are issues reported with any combination of flags or settings and patches get proposed to are very workload dependent. My suggestion has been to move in a direction where this can be resolved such that userspace has a clean and stable API and we can allow remote hugepage allocation for workloads that specifically opt-in, but not incur 25.8% greater access latency for using the behavior of the past 3+ years. Another point that has consistently been raised on LKML is the inability to disable MADV_HUGEPAGE once set: i.e. if you set it for faulting your workload, you are required to do MADV_NOHUGEPAGE to clear it and then are explicitly asking that this memory is not backed by hugepages. > > It may not have been meant to provide this, but when IBM changed this > > three years ago because of performance regressions and others have started > > to use MADV_HUGEPAGE with that policy in mind, it is the reality of what > > the madvise advice has provided. What was meant to be semantics of > > MADV_HUGEPAGE three years ago is irrelevant today if it introduces > > performance regressions for users who have used the advice mode during > > that past three years. > > > > Incurring swap storms when there is plenty of free memory available is > terrible. If the working set size is larger than a node, the swap storm > may even persist indefinitely. > Let's fix it. > > Yes, this is clearly understood and was never objected to when this first > > came up in the thread where __GFP_THISNODE was removed or when Linus > > reverted the patch. > > > > The issue being discussed here is a result of MADV_HUGEPAGE being > > overloaded: it cannot mean to control (1) how much compaction/reclaim is > > done for page allocation, (2) the NUMA locality of those hugepages, and > > (3) the eligibility of the memory to be collapsed into hugepages by > > khugepaged all at the same time. > > > > I suggested then that we actually define (2) concretely specifically for > > the usecase that you mention. Changing the behavior of MADV_HUGEPAGE for > > the past three years, however, and introducing performance regressions for > > those users is not an option regardless of the intent that it had when > > developed. > > > > The current behaviour is potential swap storms that are difficult to > avoid because the behaviour is hard-wired into the kernel internals. > Only disabling THP, either on a task or global basis, avoids it when > encountered. > We are going in circles, *yes* there is a problem for potential swap storms today because of the poor interaction between memory compaction and directed reclaim but this is a result of a poor API that does not allow userspace to specify that its workload really will span multiple sockets so faulting remotely is the best course of action. The fix is not to cause regressions for others who have implemented a userspace stack that is based on the past 3+ years of long standing behavior or for specialized workloads where it is known that it spans multiple sockets so we want some kind of different behavior. We need to provide a clear and stable API to define these terms for the page allocator that is independent of any global setting of thp enabled, defrag, zone_reclaim_mode, etc. It's workload dependent. Is it deserving of an MADV_REMOTE_HUGEPAGE, a prctl(), hugepage extended mempolicies, or something else? I'm positive that the problem is understood and we could reach some form of consensus before an implementation is invested into if we actually discuss ways to fix the underlying issue. Thanks.
On Mon, 20 May 2019 10:54:16 -0700 (PDT) David Rientjes <rientjes@google.com> wrote: > We are going in circles, *yes* there is a problem for potential swap > storms today because of the poor interaction between memory compaction and > directed reclaim but this is a result of a poor API that does not allow > userspace to specify that its workload really will span multiple sockets > so faulting remotely is the best course of action. The fix is not to > cause regressions for others who have implemented a userspace stack that > is based on the past 3+ years of long standing behavior or for specialized > workloads where it is known that it spans multiple sockets so we want some > kind of different behavior. We need to provide a clear and stable API to > define these terms for the page allocator that is independent of any > global setting of thp enabled, defrag, zone_reclaim_mode, etc. It's > workload dependent. um, who is going to do this work? Implementing a new API doesn't help existing userspace which is hurting from the problem which this patch addresses. It does appear to me that this patch does more good than harm for the totality of kernel users, so I'm inclined to push it through and to try to talk Linus out of reverting it again.
On Mon, May 20, 2019 at 10:54:16AM -0700, David Rientjes wrote: > On Mon, 20 May 2019, Mel Gorman wrote: > > > > There was exhausting discussion subsequent to this that caused Linus to > > > have to revert the offending commit late in an rc series that is not > > > described here. > > > > Yes, at the crux of that matter was which regression introduced was more > > important -- the one causing swap storms which Andrea is trying to address > > or a latency issue due to assumptions of locality when MADV_HUGEPAGE > > is used. > > > > More people are affected by swap storms and distributions are carrying > > out-of-tree patches to address it. Furthermore, multiple people unrelated > > to each other can trivially reproduce the problem with test cases and > > experience the problem with real workloads. Only you has a realistic > > workload sensitive to the latency issue and we've asked repeatedly for > > a test case (most recently Michal Hocko on May 4th) which is still not > > available. > > > > Hi Mel, > > Any workload that does MADV_HUGEPAGE will be impacted if remote hugepage > access latency is greater than local native page access latency and is > using the long-standing behavior of the past three years. And prior to that, THP usage could cause massive latencies due to reclaim and compaction that was adjusted over time to cause the least harm. We've had changes in behaviour for THP and madvise before -- largely due to cases where THP allocation caused large stalls that users found surprising. These stalls generated quite a substantial number of bugs in the field. As before, what is important is causing the least harm to the most people when corner cases are hit. > The test case > would be rather straight forward: induce node local fragmentation (easiest > to do by injecting a kernel module), do MADV_HUGEPAGE over a large range, > fault, and measure random access latency. This is readily observable and > can be done synthetically to measure the random access latency of local > native pages vs remote hugepages. Andrea provided this testcase in the > original thread. My results from right now: > > # numactl -m 0 -C 0 ./numa-thp-bench > random writes MADV_HUGEPAGE 17492771 usec > random writes MADV_NOHUGEPAGE 21344846 usec > random writes MADV_NOHUGEPAGE 21399545 usec > random writes MADV_HUGEPAGE 17481949 usec > # numactl -m 0 -C 64 ./numa-thp-bench > random writes MADV_HUGEPAGE 26858061 usec > random writes MADV_NOHUGEPAGE 31067825 usec > random writes MADV_NOHUGEPAGE 31334770 usec > random writes MADV_HUGEPAGE 26785942 usec > Ok, lets consider two scenarios. The first one is very basic -- using a large buffer that is larger than a memory node size. The demonstation program is simple --8<-- mmap-demo.c --8<-- #include <sys/mman.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #define LOOPS 3 #ifndef MADV_FLAGS #define MADV_FLAGS 0 #endif int main(int argc, char **argv) { char *buf; int i; size_t length; if (argc != 2) { printf("Specify buffer size in bytes\n"); exit(EXIT_FAILURE); } length = atol(argv[1]) & ~4095; buf = mmap(NULL, length, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, 0, 0); if (buf == MAP_FAILED) { perror("mmap failed"); exit(EXIT_FAILURE); } if (MADV_FLAGS) madvise(buf, length, MADV_FLAGS); printf("Address %p Length %lu MB\n", buf, length / 1048576); for (i = 0; i < LOOPS; i++) { memset(buf, i, length); printf("."); fflush(NULL); } printf("\n"); } --8<-- mmap-demo.c --8<-- All it's doing is writing a large anonymous array. Lets see how it behaves # Set buffer size to 80% of memory -- machine has 2 nodes that are # equal size so this will spill over $ BUFSIZE=$((`free -b | grep Mem: | awk '{print $2}'`*8/10)) # Scenario 1: default setting, no madvise. Using CPUs from only one # node as otherwise numa balancing or cpu balancing will migrate # the task based on locality. Not particularly unusual when packing # virtual machines in a box $ gcc -O2 mmap-demo.c -o mmap-demo && numactl --cpunodebind 0 /usr/bin/time ./mmap-demo $BUFSIZE Address 0x7fdc5b890000 Length 51236 MB ... 25.48user 30.19system 0:55.68elapsed 99%CPU (0avgtext+0avgdata 52467180maxresident)k 0inputs+0outputs (0major+15388156minor)pagefaults 0swaps Total time is 55.68 seconds to execute, lots of minor faults for the allocations (some may be NUMA balancing). vmstat for the time it was running was as follows procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 0 65001004 32 279528 0 1 1 1 1 1 0 0 100 0 0 1 0 0 48025796 32 279516 0 0 0 1 281 75 0 2 98 0 0 1 0 0 30927108 32 279444 0 0 0 0 285 54 0 2 98 0 0 1 0 0 22250504 32 279284 0 0 0 0 277 44 0 2 98 0 0 1 0 0 13665116 32 279272 0 0 0 0 288 67 0 2 98 0 0 1 0 0 12432096 32 279196 0 0 0 0 276 46 2 0 98 0 0 1 0 0 12429580 32 279452 0 0 0 598 297 96 1 1 98 0 0 1 0 0 12429604 32 279432 0 0 0 3 278 50 1 1 98 0 0 1 0 0 12429856 32 279432 0 0 0 0 289 68 1 1 98 0 0 1 0 0 12429864 32 279420 0 0 0 0 275 43 2 0 98 0 0 1 0 0 12429936 32 279420 0 0 0 0 298 61 1 1 98 0 0 1 0 0 12429944 32 279416 0 0 0 0 275 42 1 1 98 0 0 That's fairly straight-forward. Memory gets used, no particularly unusual activity when the buffer is allocated and updated. Now, lets use MADV_HUGEPAGE $ gcc -DMADV_FLAGS=MADV_HUGEPAGE -O2 mmap-demo.c -o mmap-demo && numactl --cpunodebind 0 /usr/bin/time ./mmap-demo $BUFSIZE Address 0x7fe8b947d000 Length 51236 MB ... 25.46user 33.12system 1:06.80elapsed 87%CPU (0avgtext+0avgdata 52467172maxresident)k 1932184inputs+0outputs (30197major+15103633minor)pagefaults 0swaps Just 10 seconds more due to being a simple case with few loops but look at the major faults, there are non-zero even though there was plenty of memory. Lets look at vmstat procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 0 64997812 32 279380 0 1 1 1 1 1 0 0 100 0 0 1 0 0 47230624 32 279392 0 0 0 1 286 74 0 2 98 0 0 0 1 324104 32915756 32 233752 0 64786 3 64790 350 330 0 1 98 0 0 1 0 1048572 31166652 32 223076 32 144950 32 144950 485 2117 0 1 99 1 0 1 0 1048572 23839632 32 223076 0 0 0 0 277 5777 0 2 98 0 0 1 0 1048572 16598116 32 223064 0 0 0 0 281 5714 0 2 98 0 0 0 1 502444 12947660 32 223064 107547 0 107549 0 3840 16245 0 1 99 0 0 1 0 944 12515736 32 224368 85670 0 85737 629 3219 11098 1 0 99 0 0 1 0 944 12514224 32 224368 0 0 0 0 275 42 2 1 98 0 0 1 0 944 12514228 32 224364 0 0 0 0 280 52 1 1 98 0 0 1 0 944 12514228 32 224364 0 0 0 0 275 44 1 1 98 0 0 1 0 944 12513712 32 224364 0 0 0 0 291 69 2 0 98 0 0 1 0 944 12513964 32 224364 0 0 0 0 274 43 1 1 98 0 0 1 1 216228 12643952 32 224132 0 43008 0 43008 747 130 1 1 98 0 0 1 0 1188 65081364 32 224464 57 819 62 819 296 111 0 1 99 0 0 That is showing large amounts of swaps out and in. This demonstration case could be made much worse but it's illustrative of what has been observed -- __GFP_THISNODE is harmful to MADV_HUGEPAGE. Now, contrast this with your example o Induce node local fragmentation using a kernel module that must be developed. No description on whether this should be equivalent to anonymous memory, file-backed or pinned like it was slab objects. No statment on whether the module memory should be able to migrate like what compaction does. o Measure random access latency -- requires specific application knowledge or detailed perf analysis o Applicable to applications that are extremely latency sensitive only My example can be demonstrated by a 1st year computer programmer with minimal effort. It is also visible to anyone creating a KVM instance that is larger than a NUMA node if the virtual machine is using enough of its memory. Your example requires implementation of a kernel module with much guesswork as to what is a realistic means and then implement an application that is latency sensitive. The bottom line is that far more people with much less experience can detect a swap storm and know its bad. Furthermore, if MADV_HUGEPAGE is used by something like KVM, there isn't a workaround except for disabling THP for the application which for KVM is a big penalty. Your scenario of having a latency access penalty is harder to detect and depends on the state of the system at the time the application executes. Contrast that with the workarounds for your situation where the system is fragmented. There are multiple choices 1. Enable zone reclaim mode for the initialisation phase 2. Memory bind the application to the target node 3. "Flush" memory before the application starts with with something like numactl --membind=0 memhog -r10 $HIGH_PERCENTAGE_OF_NODE_0 It's clumsy but it's workable in the short term. > > > This isn't an argument in support of this patch, there is a difference > > > between (1) pages of the native page size being faulted first locally > > > falling back remotely and (2) hugepages being faulted first locally and > > > falling back to native pages locally because it has better access latency > > > on most platforms for workloads that do not span multiple nodes. Note > > > that the page allocator is unaware whether the workload spans multiple > > > nodes so it cannot make this distinction today, and that's what I'd prefer > > > to focus on rather than changing an overall policy for everybody. > > > > > > > Overall, I think it would be ok to have behaviour whereby local THP is > > allocated if cheaply, followed by base pages local followed by the remote > > options. However, __GFP_THISNODE removes the possibility of allowing > > remote fallback and instead causing a swap storm and swap storms are > > trivial to generate on NUMA machine running a mainline kernel today. > > > > Yes, this is hopefully what we can focus on and I hope we can make forward > progress with (1) extending mempolicies to allow specifying hugepage > specific policies, (2) the prctl(), (3) improving the feedback loop > between compaction and direct reclaim, and/or (4) resolving the overloaded > the conflicting meanings of > /sys/kernel/mm/transparent_hugepage/{enabled,defrag} and > MADV_HUGEPAGE/MADV_NOHUGEPAGE. > 3 should be partially done with the latest compaction series, it's unclear how far it goes for your case because it cannot be trivially reproduced outside of your test environment. I never got any report back on how it affected your workload but for the trivial cases, it helped (modulo bugs that had to be fixed for corner cases on zone boundary handling). 1, 2 and 4 are undefined at this point because it's unclear what sort of policies would suit your given scenario and whether you would be even willing to rebuild the applications. For everybody else it's a simple "do not use __GFP_THISNODE for MADV_HUGEPAGE". In the last few months, there also has been no evidence of what policies would suit you or associated patches. What you want is zone_reclaim_mode for huge pages but for whatever reason, are unwilling to enable zone_reclaim_mode. However, it would make some sense to extend it. The current definition is This is value ORed together of 1 = Zone reclaim on 2 = Zone reclaim writes dirty pages out 4 = Zone reclaim swaps pages An optional extra would be 8 = Zone reclaim on for THP applications for MADV_HUGEPAGE mappings to require both THP where possible and local memory The default would be off. Your systems would need to or the 8 value Would that be generally acceptable? It would give sensible default behaviour for everyone and the option for those users that know for a fact their application fits in a NUMA node *and* is latency sensitive to remote accesses. > The issue here is the overloaded nature of what MADV_HUGEPAGE means and > what the system-wide thp settings mean. MADV_HUGEPAGE is documented to mean "Enable Transparent Huge Pages (THP) for pages in the range specified by addr and length". It says nothing about locality. Locality decisions are set by policies, not madvise. MPOL_BIND would be the obvious choice for strict locality but that is not always necessary the best decision. It is unclear if a policy like MPOL_CPU_LOCAL for both base and THP allocations would actually help you because the semantics could be defined in multiple ways. Critically, there is little information on what level of effort the kernel should do to give local memory. > It cannot possibly provide sane > behavior of all possible workloads given only two settings. MADV_HUGEPAGE > itself has *four* meanings: (1) determine hugepage eligiblity when not > default, (2) try to do sychronous compaction/reclaim at fault, (3) > determine eligiblity of khugepaged, (4) control defrag settings based on > system-wide setting. The patch here is adding a fifth: (5) prefer remote > allocation when local memory is fragmented. None of this is sustainable. > Given that locality and reclaim behaviour for *all* pages was specified by zone_reclaim_mode, it could be extended to cover special casing of THP. > Note that this patch is also preferring remote hugepage allocation *over* > local hugepages before trying memory compaction locally depending on the > setting of vm.zone_reclaim_mode so it is infringing on the long-standing > behavior of (2) as well. > Again, extending zone_reclaim_mode would act as a band-aid until the various policies can be defined and agreed upon. Once that API is set, it will be with us for a while and right now, we have swap storms. > In situations such as these, it is not surprising that there are issues > reported with any combination of flags or settings and patches get > proposed to are very workload dependent. My suggestion has been to move > in a direction where this can be resolved such that userspace has a clean > and stable API and we can allow remote hugepage allocation for workloads > that specifically opt-in, but not incur 25.8% greater access latency for > using the behavior of the past 3+ years. > You suggest moving in a some direction but have offered very little in terms of reproducing your problematic scenario or defining exactly what those policies should mean. For most people if they want memory to be local, they use MPOL_BIND and call it a day. It's not clear what policy you would define that gets translated into behaviour you find acceptable. > Another point that has consistently been raised on LKML is the inability > to disable MADV_HUGEPAGE once set: i.e. if you set it for faulting your > workload, you are required to do MADV_NOHUGEPAGE to clear it and then are > explicitly asking that this memory is not backed by hugepages. > I missed that one but it does not sound like an impossible problem to define another MADV flag for it. > > > It may not have been meant to provide this, but when IBM changed this > > > three years ago because of performance regressions and others have started > > > to use MADV_HUGEPAGE with that policy in mind, it is the reality of what > > > the madvise advice has provided. What was meant to be semantics of > > > MADV_HUGEPAGE three years ago is irrelevant today if it introduces > > > performance regressions for users who have used the advice mode during > > > that past three years. > > > > > > > Incurring swap storms when there is plenty of free memory available is > > terrible. If the working set size is larger than a node, the swap storm > > may even persist indefinitely. > > > > Let's fix it. > That's what Andrea's patch does -- fixes trivial swap storms. It does require you use existing memory policies *or* temporarily enable zone_reclaim_mode during initialisation to get the locality you want. Alternatively, we extend zone_reclaim_mode to apply to THP+MADV_HUGEPAGE. You could also carry a revert seeing as the kernel is used for an internal workload where as some of us have to support all classes of users running general workloads where causing the least harm is important for supportability. > > The current behaviour is potential swap storms that are difficult to > > avoid because the behaviour is hard-wired into the kernel internals. > > Only disabling THP, either on a task or global basis, avoids it when > > encountered. > > > > We are going in circles, *yes* there is a problem for potential swap > storms today because of the poor interaction between memory compaction and > directed reclaim but this is a result of a poor API that does not allow > userspace to specify that its workload really will span multiple sockets > so faulting remotely is the best course of action. Yes, we're going in circles. I find it amazing that you think leaving users with trivial to reproduce swap storms is acceptable until some unreproducible workload can be fixed with some undefined set of unimplemented memory policies. What we have right now is a concrete problem with a fix that is being naked with a counter-proposal being "someone should implement the test case for me then define/implement policies that suit that specific test case".
Hello everyone, On Thu, May 23, 2019 at 05:57:37PM -0700, Andrew Morton wrote: > On Mon, 20 May 2019 10:54:16 -0700 (PDT) David Rientjes <rientjes@google.com> wrote: > > > We are going in circles, *yes* there is a problem for potential swap > > storms today because of the poor interaction between memory compaction and > > directed reclaim but this is a result of a poor API that does not allow > > userspace to specify that its workload really will span multiple sockets > > so faulting remotely is the best course of action. The fix is not to > > cause regressions for others who have implemented a userspace stack that > > is based on the past 3+ years of long standing behavior or for specialized > > workloads where it is known that it spans multiple sockets so we want some > > kind of different behavior. We need to provide a clear and stable API to > > define these terms for the page allocator that is independent of any > > global setting of thp enabled, defrag, zone_reclaim_mode, etc. It's > > workload dependent. > > um, who is going to do this work? That's a good question. It's going to be a not simple patch to backport to -stable: it'll be intrusive and it will affect mm/page_alloc.c significantly so it'll reject heavy. I wouldn't consider it -stable material at least in the short term, it will require some testing. This is why applying a simple fix that avoids the swap storms (and the swap-less pathological THP regression for vfio device assignment GUP pinning) is preferable before adding an alloc_pages_multi_order (or equivalent) so that it'll be the allocator that will decide when exactly to fallback from 2M to 4k depending on the NUMA distance and memory availability during the zonelist walk. The basic idea is to call alloc_pages just once (not first for 2M and then for 4k) and alloc_pages will decide which page "order" to return. > Implementing a new API doesn't help existing userspace which is hurting > from the problem which this patch addresses. Yes, we can't change all apps that may not fit in a single NUMA node. Currently it's unsafe to turn "transparent_hugepages/defrag = always" or the bad behavior can then materialize also outside of MADV_HUGEPAGE. Those apps that use MADV_HUGEPAGE on their long lived allocations (i.e. guest physical memory) like qemu are affected even with the default "defrag = madvise". Those apps are using MADV_HUGEPAGE for more than 3 years and they are widely used and open source of course. > It does appear to me that this patch does more good than harm for the > totality of kernel users, so I'm inclined to push it through and to try > to talk Linus out of reverting it again. That sounds great. It's also what 3 enterprise distributions had to do already. As Mel described in detail, remote THP can't be slower than the swap I/O (even if we'd swap on a nvdimm it wouldn't change this). As Michael suggested a dynamic "numa_node_id()" mbind could be pursued orthogonally to still be able to retain the current upstream behavior for small apps that can fit in the node and do extremely long lived static allocations and that don't care if they cause a swap storm during startup. All we argue about is the default "defrag = always" and MADV_HUGEPAGE behavior. The current behavior of "defrag = always" and MADV_HUGEPAGE is way more aggressive than zone_reclaim_mode in fact, which is also not enabled by default for similar reasons (but enabling zone_reclaim_mode by default would cause much less risk of pathological regressions to large workloads that can't fit in a single node). Enabling zone_reclaim_mode would eventually fallback to remote nodes gracefully. As opposed the fallback to remote nodes with __GFP_THISNODE can only happen after the 2M allocation has failed and the problem is that 2M allocation don't fail because compaction+reclaim interleaving keeps succeeding by swapping out more and more memory, which would the perfectly right behavior for compaction+reclaim interleaving if only the whole system would be out of memory in all nodes (and it isn't). The false positive result from the automated testing (where swapping overall performance decreased because fariness increased) wasn't anybody's fault and so the revert at the end of the merge window was a safe approach. So we can try again to fix it now. Thanks! Andrea
On Fri, 24 May 2019, Andrea Arcangeli wrote: > > > We are going in circles, *yes* there is a problem for potential swap > > > storms today because of the poor interaction between memory compaction and > > > directed reclaim but this is a result of a poor API that does not allow > > > userspace to specify that its workload really will span multiple sockets > > > so faulting remotely is the best course of action. The fix is not to > > > cause regressions for others who have implemented a userspace stack that > > > is based on the past 3+ years of long standing behavior or for specialized > > > workloads where it is known that it spans multiple sockets so we want some > > > kind of different behavior. We need to provide a clear and stable API to > > > define these terms for the page allocator that is independent of any > > > global setting of thp enabled, defrag, zone_reclaim_mode, etc. It's > > > workload dependent. > > > > um, who is going to do this work? > > That's a good question. It's going to be a not simple patch to > backport to -stable: it'll be intrusive and it will affect > mm/page_alloc.c significantly so it'll reject heavy. I wouldn't > consider it -stable material at least in the short term, it will > require some testing. > Hi Andrea, I'm not sure what patch you're referring to, unfortunately. The above comment was referring to APIs that are made available to userspace to define when to fault locally vs remotely and what the preference should be for any form of compaction or reclaim to achieve that. Today we have global enabling options, global defrag settings, enabling prctls, and madvise options. The point it makes is that whether a specific workload fits into a single socket is workload dependant and thus we are left with prctls and madvise options. The prctl either enables thp or it doesn't, it is not interesting here; the madvise is overloaded in four different ways (enabling, stalling at fault, collapsability, defrag) so it's not surprising that continuing to overload it for existing users will cause undesired results. It makes an argument that we need a clear and stable means of defining the behavior, not changing the 4+ year behavior and giving those who regress no workaround. > This is why applying a simple fix that avoids the swap storms (and the > swap-less pathological THP regression for vfio device assignment GUP > pinning) is preferable before adding an alloc_pages_multi_order (or > equivalent) so that it'll be the allocator that will decide when > exactly to fallback from 2M to 4k depending on the NUMA distance and > memory availability during the zonelist walk. The basic idea is to > call alloc_pages just once (not first for 2M and then for 4k) and > alloc_pages will decide which page "order" to return. > The commit description doesn't mention the swap storms that you're trying to fix, it's probably better to describe that again and why it is not beneficial to swap unless an entire pageblock can become free or memory compaction has indicated that additional memory freeing would allow migration to make an entire pageblock free. I understand that's a invasive code change, but merging this patch changes the 4+ year behavior that started here: commit 077fcf116c8c2bd7ee9487b645aa3b50368db7e1 Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Date: Wed Feb 11 15:27:12 2015 -0800 mm/thp: allocate transparent hugepages on local node And that commit's description describes quite well the regression that we encounter if we remove __GFP_THISNODE here. That's because the access latency regression is much more substantial than what was reported for Naples in your changelog. In the interest of making forward progress, can we agree that swapping from the local node *never* makes sense unless we can show that an entire pageblock can become free or that it enables memory compaction to migrate memory that can make an entire pageblock free? Are you reporting swap storms for the local node when one of these is true? > > Implementing a new API doesn't help existing userspace which is hurting > > from the problem which this patch addresses. > > Yes, we can't change all apps that may not fit in a single NUMA > node. Currently it's unsafe to turn "transparent_hugepages/defrag = > always" or the bad behavior can then materialize also outside of > MADV_HUGEPAGE. Those apps that use MADV_HUGEPAGE on their long lived > allocations (i.e. guest physical memory) like qemu are affected even > with the default "defrag = madvise". Those apps are using > MADV_HUGEPAGE for more than 3 years and they are widely used and open > source of course. > I continue to reiterate that the 4+ year long standing behavior of MADV_HUGEPAGE is overloaded; you are anticipating a specific behavior for workloads that do not fit in a single NUMA node whereas other users developed in the past four years are anticipating a different behavior. I'm trying to propose solutions that can not cause regressions for any user, such as the prctl() example that is inherited across fork, and can be used to define the behavior. This could be a very trivial extension to prctl(PR_SET_THP_DISABLE) or it could be more elaborate as an addition. This would be set by any thread that forks qemu and can define that the workload prefers remote hugepages because it spans more than one node. Certainly we should agree that the majority of Linux workloads do not span more than one socket. However, it *may* be possible to define this as a global thp setting since most machines that run large guests are only running large guests so the default machine-level policy can reflect that.
On Thu, 23 May 2019, Andrew Morton wrote: > > We are going in circles, *yes* there is a problem for potential swap > > storms today because of the poor interaction between memory compaction and > > directed reclaim but this is a result of a poor API that does not allow > > userspace to specify that its workload really will span multiple sockets > > so faulting remotely is the best course of action. The fix is not to > > cause regressions for others who have implemented a userspace stack that > > is based on the past 3+ years of long standing behavior or for specialized > > workloads where it is known that it spans multiple sockets so we want some > > kind of different behavior. We need to provide a clear and stable API to > > define these terms for the page allocator that is independent of any > > global setting of thp enabled, defrag, zone_reclaim_mode, etc. It's > > workload dependent. > > um, who is going to do this work? > > Implementing a new API doesn't help existing userspace which is hurting > from the problem which this patch addresses. > The problem which this patch addresses has apparently gone unreported for 4+ years since commit 077fcf116c8c2bd7ee9487b645aa3b50368db7e1 Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Date: Wed Feb 11 15:27:12 2015 -0800 mm/thp: allocate transparent hugepages on local node My goal is to reach a solution that does not cause anybody to incur performance penalties as a result of it. It's surprising that such a severe swap storm issue that went unnoticed for four years is something that can't reach an amicable solution that doesn't cause other users to regress. > It does appear to me that this patch does more good than harm for the > totality of kernel users, so I'm inclined to push it through and to try > to talk Linus out of reverting it again. > (1) The totality of kernel users are not running workloads that span multiple sockets, it's the minority, (2) it's changing 4+ year behavior based on NUMA locality of hugepage allocations and provides no workarounds for users who incur regressions as a result, and (3) does not solve the underlying issue if remote memory is also fragmented or low on memory: it actually makes the problem worse. The easiest solution would be to define the MADV_HUGEPAGE behavior explicitly in sysfs: local or remote. Defaut to local as the behavior from the past four years and allow users to specify remote if their workloads will span multiple sockets. This is somewhat coarse but no more than the thp defrag setting in sysfs today that defines defrag behavior for everybody on the system.
On Wed 29-05-19 14:24:33, David Rientjes wrote: > On Thu, 23 May 2019, Andrew Morton wrote: > > > > We are going in circles, *yes* there is a problem for potential swap > > > storms today because of the poor interaction between memory compaction and > > > directed reclaim but this is a result of a poor API that does not allow > > > userspace to specify that its workload really will span multiple sockets > > > so faulting remotely is the best course of action. The fix is not to > > > cause regressions for others who have implemented a userspace stack that > > > is based on the past 3+ years of long standing behavior or for specialized > > > workloads where it is known that it spans multiple sockets so we want some > > > kind of different behavior. We need to provide a clear and stable API to > > > define these terms for the page allocator that is independent of any > > > global setting of thp enabled, defrag, zone_reclaim_mode, etc. It's > > > workload dependent. > > > > um, who is going to do this work? > > > > Implementing a new API doesn't help existing userspace which is hurting > > from the problem which this patch addresses. > > > > The problem which this patch addresses has apparently gone unreported for > 4+ years since Can we finaly stop considering the time and focus on the what is the most reasonable behavior in general case please? Conserving mistakes based on an argument that we have them for many years is just not productive. It is very well possible that workloads that suffer from this simply run on older distribution kernels which are moving towards newer kernels very slowly. > commit 077fcf116c8c2bd7ee9487b645aa3b50368db7e1 > Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> > Date: Wed Feb 11 15:27:12 2015 -0800 > > mm/thp: allocate transparent hugepages on local node Let me quote the commit message to the full lenght " This make sure that we try to allocate hugepages from local node if allowed by mempolicy. If we can't, we fallback to small page allocation based on mempolicy. This is based on the observation that allocating pages on local node is more beneficial than allocating hugepages on remote node. With this patch applied we may find transparent huge page allocation failures if the current node doesn't have enough freee hugepages. Before this patch such failures result in us retrying the allocation on other nodes in the numa node mask. " I do not see any single numbers backing those claims or any mention of a workload that would benefit from the change. Besides that, we have seen that THP on a remote (but close) node might be performing better per Andrea's numbers. So those claims do not apply in general. This is a general problem when making decisions on heuristics which are not a clear cut. AFAICS there have been pretty good argments given that _real_ workloads suffer from this change while a demonstration of a _real_ workload that is benefiting is still missing. > My goal is to reach a solution that does not cause anybody to incur > performance penalties as a result of it. That is certainly appreciated and I can offer my help there as well. But I believe we should start with a code base that cannot generate a swapping storm by a trivial code as demonstrated by Mel. A general idea on how to approve the situation has been already outlined for a default case and a new memory policy has been mentioned as well but we need something to start with and neither of the two is compatible with the __GFP_THISNODE behavior. [...] > The easiest solution would be to define the MADV_HUGEPAGE behavior > explicitly in sysfs: local or remote. Defaut to local as the behavior > from the past four years and allow users to specify remote if their > workloads will span multiple sockets. This is somewhat coarse but no more > than the thp defrag setting in sysfs today that defines defrag behavior > for everybody on the system. This just makes the THP tunning even muddier. Really, can we start with a code that doesn't blow up trivially and build on top? In other words start with a less specialized usecase being covered and help more specialized usecases to get what they need.
On Fri, 31 May 2019, Michal Hocko wrote: > > The problem which this patch addresses has apparently gone unreported for > > 4+ years since > > Can we finaly stop considering the time and focus on the what is the > most reasonable behavior in general case please? Conserving mistakes > based on an argument that we have them for many years is just not > productive. It is very well possible that workloads that suffer from > this simply run on older distribution kernels which are moving towards > newer kernels very slowly. > That's fine, but we also must be mindful of users who have used MADV_HUGEPAGE over the past four years based on its hard-coded behavior that would now regress as a result. > > commit 077fcf116c8c2bd7ee9487b645aa3b50368db7e1 > > Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> > > Date: Wed Feb 11 15:27:12 2015 -0800 > > > > mm/thp: allocate transparent hugepages on local node > > Let me quote the commit message to the full lenght > " > This make sure that we try to allocate hugepages from local node if > allowed by mempolicy. If we can't, we fallback to small page allocation > based on mempolicy. This is based on the observation that allocating > pages on local node is more beneficial than allocating hugepages on remote > node. > > With this patch applied we may find transparent huge page allocation > failures if the current node doesn't have enough freee hugepages. Before > this patch such failures result in us retrying the allocation on other > nodes in the numa node mask. > " > > I do not see any single numbers backing those claims or any mention of a > workload that would benefit from the change. Besides that, we have seen > that THP on a remote (but close) node might be performing better per > Andrea's numbers. So those claims do not apply in general. > I confirm that on every platform I have tested that the access latency to local pages of the native page size has been less than hugepages on any remote node. I think it's generally accepted that NUMA-ness is more important than huge-ness in terms of access latency and this is not the reason why the revert is being proposed. Certainly if the argument is to be made that the default behavior should be what is in the best interest of Linux users in totality, preferring remote hugepages over local pages of the native page size would not be anywhere close. I agree with Aneesh's commit message 100%. > > My goal is to reach a solution that does not cause anybody to incur > > performance penalties as a result of it. > > That is certainly appreciated and I can offer my help there as well. But > I believe we should start with a code base that cannot generate a > swapping storm by a trivial code as demonstrated by Mel. A general idea > on how to approve the situation has been already outlined for a default > case and a new memory policy has been mentioned as well but we need > something to start with and neither of the two is compatible with the > __GFP_THISNODE behavior. > Thus far, I haven't seen anybody engage in discussion on how to address the issue other than proposed reverts that readily acknowledge they cause other users to regress. If all nodes are fragmented, the swap storms that are currently reported for the local node would be made worse by the revert -- if remote hugepages cannot be faulted quickly then it's only compounded the problem. The hugepage aware mempolicy idea is one way that could describe what should be done for these allocations, we could also perhaps consider heuristics that consider the memory pressure of the local node: just as I've never seen a platform where remote hugepages have better access latency than local pages, I've never seen a platform where remote hugepages aren't a win over remote pages. This, however, is more delicate on 4 socket and 8 socket platforms but I think a general rule that a hugepage is better, if readily allocatable, than a page on the same node. (I've seen cross socket latency for hugepages match the latency for pages, so not always a win: better to leave the hugepage available remotely for something running on that node.) If the local node has a workingset that reaches its capacity, then it makes sense to fault a remote hugepage instead because otherwise we are thrashing the local node. That's not a compaction problem, though, it's a reclaim problem. If compaction fails and it's because we can't allocate target pages, it's under memory pressure and it's uncertain if reclaim will help: it may fail after expensive swap, the reclaimed pages could be grabbed by somebody else and we loop, or the compaction freeing scanner can't find it. Worst case is we thrash the local node in a swap storm. So the argument I've made when the removal of __GFP_THISNODE was first proposed is that local hugepage allocation should be the preference including direct compaction for users of MADV_HUGEPAGE (or thp enabled=always) but reclaim should never be done locally. I'd very much like to engage with anybody who would be willing to discuss fixes that work for everybody rather than only propose reverts and leave others to deal with new performance regressions.
On Fri 31-05-19 14:53:35, David Rientjes wrote: > On Fri, 31 May 2019, Michal Hocko wrote: > > > > The problem which this patch addresses has apparently gone unreported for > > > 4+ years since > > > > Can we finaly stop considering the time and focus on the what is the > > most reasonable behavior in general case please? Conserving mistakes > > based on an argument that we have them for many years is just not > > productive. It is very well possible that workloads that suffer from > > this simply run on older distribution kernels which are moving towards > > newer kernels very slowly. > > > > That's fine, but we also must be mindful of users who have used > MADV_HUGEPAGE over the past four years based on its hard-coded behavior > that would now regress as a result. Absolutely, I am all for helping those usecases. First of all we need to understand what those usecases are though. So far we have only seen very vague claims about artificial worst case examples when a remote access dominates the overall cost but that doesn't seem to be the case in real life in my experience (e.g. numa balancing will correct things or the over aggressive node reclaim tends to cause problems elsewhere etc.). That being said I am pretty sure that a new memory policy as proposed previously that would allow for a node reclaim behavior is a way for those very specific workloads that absolutely benefit from a local access. There are however real life usecases that benefit from THP even on remote nodes as explained by Andrea (most notable kvm) and the only way those can express their needs is the madvise flag. Not to mention that the default node reclaim behavior might cause excessive reclaim as demonstrate by Mel and Anrea and that is certainly not desirable in itself. [...] > > > My goal is to reach a solution that does not cause anybody to incur > > > performance penalties as a result of it. > > > > That is certainly appreciated and I can offer my help there as well. But > > I believe we should start with a code base that cannot generate a > > swapping storm by a trivial code as demonstrated by Mel. A general idea > > on how to approve the situation has been already outlined for a default > > case and a new memory policy has been mentioned as well but we need > > something to start with and neither of the two is compatible with the > > __GFP_THISNODE behavior. > > > > Thus far, I haven't seen anybody engage in discussion on how to address > the issue other than proposed reverts that readily acknowledge they cause > other users to regress. If all nodes are fragmented, the swap storms that > are currently reported for the local node would be made worse by the > revert -- if remote hugepages cannot be faulted quickly then it's only > compounded the problem. Andrea has outline the strategy to go IIRC. There also has been a general agreement that we shouldn't be over eager to fall back to remote nodes if the base page size allocation could be satisfied from a local node.
On Wed, 5 Jun 2019, Michal Hocko wrote: > > That's fine, but we also must be mindful of users who have used > > MADV_HUGEPAGE over the past four years based on its hard-coded behavior > > that would now regress as a result. > > Absolutely, I am all for helping those usecases. First of all we need to > understand what those usecases are though. So far we have only seen very > vague claims about artificial worst case examples when a remote access > dominates the overall cost but that doesn't seem to be the case in real > life in my experience (e.g. numa balancing will correct things or the > over aggressive node reclaim tends to cause problems elsewhere etc.). > The usecase is a remap of a binary's text segment to transparent hugepages by doing mmap() -> madvise(MADV_HUGEPAGE) -> mremap() and when this happens on a locally fragmented node. This happens at startup when we aren't concerned about allocation latency: we want to compact. We are concerned with access latency thereafter as long as the process is running. MADV_HUGEPAGE has worked great for this and we have a large userspace stack built upon that because it's been the long-standing behavior. This gets back to the point of MADV_HUGEPAGE being overloaded for four different purposes. I argue that processes that fit within a single node are in the majority. > > Thus far, I haven't seen anybody engage in discussion on how to address > > the issue other than proposed reverts that readily acknowledge they cause > > other users to regress. If all nodes are fragmented, the swap storms that > > are currently reported for the local node would be made worse by the > > revert -- if remote hugepages cannot be faulted quickly then it's only > > compounded the problem. > > Andrea has outline the strategy to go IIRC. There also has been a > general agreement that we shouldn't be over eager to fall back to remote > nodes if the base page size allocation could be satisfied from a local node. Sorry, I haven't seen patches for this, I can certainly test them if there's a link. If we have the ability to tune how eager the page allocator is to fallback and have the option to say "never" as part of that eagerness, it may work. The idea that I had was snipped from this, however, and it would be nice to get some feedback on it: I've suggested that direct reclaim for the purposes of hugepage allocation on the local node is never worthwhile unless and until memory compaction can both capture that page to use (not rely on the freeing scanner to find it) and that migration of a number of pages would eventually result in the ability to free a pageblock. I'm hoping that we can all agree to that because otherwise it leads us down a bad road if reclaim is doing pointless work (freeing scanner can't find it or it gets allocated again before it can find it) or compaction can't make progress as a result of it (even though we can migrate, it still won't free a pageblock). In the interim, I think we should suppress direct reclaim entirely for thp allocations, regardless of enabled=always or MADV_HUGEPAGE because it cannot be proven that the reclaim work is beneficial and I believe it results in the swap storms that are being reported. Any disagreements so far? Furthermore, if we can agree to that, memory compaction when allocating a transparent hugepage fails for different reasons, one of which is because we fail watermark checks because we lack migration targets. This is normally what leads to direct reclaim. Compaction is *supposed* to return COMPACT_SKIPPED for this but that's overloaded as well: it happens when we fail extfrag_threshold checks and wheng gfp flags doesn't allow it. The former matters for thp. So my proposed change would be: - give the page allocator a consistent indicator that compaction failed because we are low on memory (make COMPACT_SKIPPED really mean this), - if we get this in the page allocator and we are allocating thp, fail, reclaim is unlikely to help here and is much more likely to be disruptive - we could retry compaction if we haven't scanned all memory and were contended, - if the hugepage allocation fails, have thp check watermarks for order-0 pages without any padding, - if watermarks succeed, fail the thp allocation: we can't allocate because of fragmentation and it's better to return node local memory, - if watermarks fail, a follow up allocation of the pte will likely also fail, so thp retries the allocation with a cleared __GFP_THISNODE. This doesn't sound very invasive and I'll code it up if it will be tested.
On Thu 06-06-19 15:12:40, David Rientjes wrote: > On Wed, 5 Jun 2019, Michal Hocko wrote: > > > > That's fine, but we also must be mindful of users who have used > > > MADV_HUGEPAGE over the past four years based on its hard-coded behavior > > > that would now regress as a result. > > > > Absolutely, I am all for helping those usecases. First of all we need to > > understand what those usecases are though. So far we have only seen very > > vague claims about artificial worst case examples when a remote access > > dominates the overall cost but that doesn't seem to be the case in real > > life in my experience (e.g. numa balancing will correct things or the > > over aggressive node reclaim tends to cause problems elsewhere etc.). > > > > The usecase is a remap of a binary's text segment to transparent hugepages > by doing mmap() -> madvise(MADV_HUGEPAGE) -> mremap() and when this > happens on a locally fragmented node. This happens at startup when we > aren't concerned about allocation latency: we want to compact. We are > concerned with access latency thereafter as long as the process is > running. You have indicated this previously but no call for a stand alone reproducer was successful. It is really hard to optimize for such a specialized workload without anything to play with. Btw. this is exactly a case where I would expect numa balancing to converge to the optimal placement. And if numabalancing is not an option than an explicit mempolicy (e.g. the one suggested here) would be a good fit. [...] I will defer the compaction related stuff to Vlastimil and Mel who are much more familiar with the current code. > So my proposed change would be: > - give the page allocator a consistent indicator that compaction failed > because we are low on memory (make COMPACT_SKIPPED really mean this), > - if we get this in the page allocator and we are allocating thp, fail, > reclaim is unlikely to help here and is much more likely to be > disruptive > - we could retry compaction if we haven't scanned all memory and > were contended, > - if the hugepage allocation fails, have thp check watermarks for order-0 > pages without any padding, > - if watermarks succeed, fail the thp allocation: we can't allocate > because of fragmentation and it's better to return node local memory, Doesn't this lead to the same THP low success rate we have seen with one of the previous patches though? Let me remind you of the previous semantic I was proposing http://lkml.kernel.org/r/20181206091405.GD1286@dhcp22.suse.cz and that didn't get shot down. Linus had some follow up ideas on how exactly the fallback order should look like and that is fine. We should just measure differences between local node cheep base page vs. remote THP on _real_ workloads. Any microbenchmark which just measures a latency is inherently misleading. And really, fundamental problem here is that MADV_HUGEPAGE has gained a NUMA semantic without a due scrutiny leading to a broken interface with side effects that are simply making the interface unusable for a large part of usecases that the madvise was originaly designed for. Until we find an agreement on this point we will be looping in a dead end discussion, I am afraid.
On Fri, 7 Jun 2019, Michal Hocko wrote: > > So my proposed change would be: > > - give the page allocator a consistent indicator that compaction failed > > because we are low on memory (make COMPACT_SKIPPED really mean this), > > - if we get this in the page allocator and we are allocating thp, fail, > > reclaim is unlikely to help here and is much more likely to be > > disruptive > > - we could retry compaction if we haven't scanned all memory and > > were contended, > > - if the hugepage allocation fails, have thp check watermarks for order-0 > > pages without any padding, > > - if watermarks succeed, fail the thp allocation: we can't allocate > > because of fragmentation and it's better to return node local memory, > > Doesn't this lead to the same THP low success rate we have seen with one > of the previous patches though? > From my recollection, the only other patch that was tested involved __GFP_NORETRY and avoiding reclaim entirely for thp allocations when checking for high-order allocations. This in the page allocator: /* * Checks for costly allocations with __GFP_NORETRY, which * includes THP page fault allocations */ if (costly_order && (gfp_mask & __GFP_NORETRY)) { ... if (compact_result == COMPACT_DEFERRED) goto nopage; Yet there is no handling for COMPACT_SKIPPED (or what my plan above defines COMPACT_SKIPPED to be). I don't think anything was tried that tests why compaction failed, i.e. was it because the two scanners met, because hugepage-order memory was found available, because the zone lock was contended or we hit need_resched(), we're failing even order-0 watermarks, etc. I don't think the above plan has been tried, if someone has tried it, please let me know. I haven't seen any objection to disabling reclaim entirely when order-0 watermarks are failing in compaction. We simply can't guarantee that it is useful work with the current implementation of compaction. There are several reasons that I've enumerated why compaction can still fail even after successful reclaim. The point is that removing __GFP_THISNODE is not a fix for this if the remote memory is fragmented as well: it assumes that hugepages are available remotely when they aren't available locally otherwise we seem swap storms both locally and remotely. Relying on that is not in the best interest of any user of transparent hugepages. > Let me remind you of the previous semantic I was proposing > http://lkml.kernel.org/r/20181206091405.GD1286@dhcp22.suse.cz and that > didn't get shot down. Linus had some follow up ideas on how exactly > the fallback order should look like and that is fine. We should just > measure differences between local node cheep base page vs. remote THP on > _real_ workloads. Any microbenchmark which just measures a latency is > inherently misleading. > I think both seek to add the possibility of allocating hugepages remotely in certain circumstances and that can be influenced by MADV_HUGEPAGE. I don't think we need to try hugepage specific mempolicies unless it is shown to be absolutely necessary although a usecase for this could be made separate to this discussion. There's a benefit to faulting remote hugepages over remote pages for everybody involved. My argument is that we can determine the need for that based on failed order-0 watermark checks in compaction: if the node would require reclaim to even fault a page, it is likely better over the long term to fault a remote hugepage. I think this can be made to work and is not even difficult to do. If anybody has any objection please let me know.
On Thu, 6 Jun 2019, David Rientjes wrote: > The idea that I had was snipped from this, however, and it would be nice > to get some feedback on it: I've suggested that direct reclaim for the > purposes of hugepage allocation on the local node is never worthwhile > unless and until memory compaction can both capture that page to use (not > rely on the freeing scanner to find it) and that migration of a number of > pages would eventually result in the ability to free a pageblock. > > I'm hoping that we can all agree to that because otherwise it leads us > down a bad road if reclaim is doing pointless work (freeing scanner can't > find it or it gets allocated again before it can find it) or compaction > can't make progress as a result of it (even though we can migrate, it > still won't free a pageblock). > > In the interim, I think we should suppress direct reclaim entirely for > thp allocations, regardless of enabled=always or MADV_HUGEPAGE because it > cannot be proven that the reclaim work is beneficial and I believe it > results in the swap storms that are being reported. > > Any disagreements so far? > > Furthermore, if we can agree to that, memory compaction when allocating a > transparent hugepage fails for different reasons, one of which is because > we fail watermark checks because we lack migration targets. This is > normally what leads to direct reclaim. Compaction is *supposed* to return > COMPACT_SKIPPED for this but that's overloaded as well: it happens when we > fail extfrag_threshold checks and wheng gfp flags doesn't allow it. The > former matters for thp. > > So my proposed change would be: > - give the page allocator a consistent indicator that compaction failed > because we are low on memory (make COMPACT_SKIPPED really mean this), > - if we get this in the page allocator and we are allocating thp, fail, > reclaim is unlikely to help here and is much more likely to be > disruptive > - we could retry compaction if we haven't scanned all memory and > were contended, > - if the hugepage allocation fails, have thp check watermarks for order-0 > pages without any padding, > - if watermarks succeed, fail the thp allocation: we can't allocate > because of fragmentation and it's better to return node local memory, > - if watermarks fail, a follow up allocation of the pte will likely also > fail, so thp retries the allocation with a cleared __GFP_THISNODE. > > This doesn't sound very invasive and I'll code it up if it will be tested. > Following up on this since there has been no activity in a week, I am happy to prototype this. Andrea, would you be able to test a patch once it is ready for you to try?
On Thu 23-05-19 17:57:37, Andrew Morton wrote: [...] > It does appear to me that this patch does more good than harm for the > totality of kernel users, so I'm inclined to push it through and to try > to talk Linus out of reverting it again. What is the status here? I do not see the patch upstream nor in the mmotm tree.
On Tue, 30 Jul 2019 15:11:27 +0200 Michal Hocko <mhocko@kernel.org> wrote: > On Thu 23-05-19 17:57:37, Andrew Morton wrote: > [...] > > It does appear to me that this patch does more good than harm for the > > totality of kernel users, so I'm inclined to push it through and to try > > to talk Linus out of reverting it again. > > What is the status here? I doesn't seem that the mooted alternatives will be happening any time soon, I would like a version of this patch which has a changelog which fully describes our reasons for reapplying the reverted revert. At this time, *someone* is going to have to carry a private patch. It's best that the version which suits the larger number of users be the one which we carry in mainline. I'll add a cc:stable to this revert.
On Tue 30-07-19 11:05:44, Andrew Morton wrote: > On Tue, 30 Jul 2019 15:11:27 +0200 Michal Hocko <mhocko@kernel.org> wrote: > > > On Thu 23-05-19 17:57:37, Andrew Morton wrote: > > [...] > > > It does appear to me that this patch does more good than harm for the > > > totality of kernel users, so I'm inclined to push it through and to try > > > to talk Linus out of reverting it again. > > > > What is the status here? > > I doesn't seem that the mooted alternatives will be happening any time > soon, > > I would like a version of this patch which has a changelog which fully > describes our reasons for reapplying the reverted revert. http://lkml.kernel.org/r/20190503223146.2312-3-aarcange@redhat.com went in great details IMHO about the previous decision as well as adverse effect on swapout. Do you have any suggestions on what is missing there?
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index 5228c62af416..bac395f1d00a 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -139,6 +139,8 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp, struct mempolicy *get_task_policy(struct task_struct *p); struct mempolicy *__get_vma_policy(struct vm_area_struct *vma, unsigned long addr); +struct mempolicy *get_vma_policy(struct vm_area_struct *vma, + unsigned long addr); bool vma_policy_mof(struct vm_area_struct *vma); extern void numa_default_policy(void); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 7efe68ba052a..784fd63800a2 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -644,27 +644,37 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf, static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr) { const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE); - const gfp_t gfp_mask = GFP_TRANSHUGE_LIGHT | __GFP_THISNODE; + gfp_t this_node = 0; - /* Always do synchronous compaction */ - if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags)) - return GFP_TRANSHUGE | __GFP_THISNODE | - (vma_madvised ? 0 : __GFP_NORETRY); +#ifdef CONFIG_NUMA + struct mempolicy *pol; + /* + * __GFP_THISNODE is used only when __GFP_DIRECT_RECLAIM is not + * specified, to express a general desire to stay on the current + * node for optimistic allocation attempts. If the defrag mode + * and/or madvise hint requires the direct reclaim then we prefer + * to fallback to other node rather than node reclaim because that + * can lead to excessive reclaim even though there is free memory + * on other nodes. We expect that NUMA preferences are specified + * by memory policies. + */ + pol = get_vma_policy(vma, addr); + if (pol->mode != MPOL_BIND) + this_node = __GFP_THISNODE; + mpol_cond_put(pol); +#endif - /* Kick kcompactd and fail quickly */ + if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags)) + return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY); if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags)) - return gfp_mask | __GFP_KSWAPD_RECLAIM; - - /* Synchronous compaction if madvised, otherwise kick kcompactd */ + return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node; if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags)) - return gfp_mask | (vma_madvised ? __GFP_DIRECT_RECLAIM : - __GFP_KSWAPD_RECLAIM); - - /* Only do synchronous compaction if madvised */ + return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM : + __GFP_KSWAPD_RECLAIM | this_node); if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags)) - return gfp_mask | (vma_madvised ? __GFP_DIRECT_RECLAIM : 0); - - return gfp_mask; + return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM : + this_node); + return GFP_TRANSHUGE_LIGHT | this_node; } /* Caller must hold page table lock. */ diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 74e44000ad61..341e3d56d0a6 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1688,7 +1688,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma, * freeing by another task. It is the caller's responsibility to free the * extra reference for shared policies. */ -static struct mempolicy *get_vma_policy(struct vm_area_struct *vma, +struct mempolicy *get_vma_policy(struct vm_area_struct *vma, unsigned long addr) { struct mempolicy *pol = __get_vma_policy(vma, addr);
This reverts commit 2f0799a0ffc033bf3cc82d5032acc3ec633464c2. commit 2f0799a0ffc033bf3cc82d5032acc3ec633464c2 was rightfully applied to avoid the risk of a severe regression that was reported by the kernel test robot at the end of the merge window. Now we understood the regression was a false positive and was caused by a significant increase in fairness during a swap trashing benchmark. So it's safe to re-apply the fix and continue improving the code from there. The benchmark that reported the regression is very useful, but it provides a meaningful result only when there is no significant alteration in fairness during the workload. The removal of __GFP_THISNODE increased fairness. __GFP_THISNODE cannot be used in the generic page faults path for new memory allocations under the MPOL_DEFAULT mempolicy, or the allocation behavior significantly deviates from what the MPOL_DEFAULT semantics are supposed to be for THP and 4k allocations alike. Setting THP defrag to "always" or using MADV_HUGEPAGE (with THP defrag set to "madvise") has never meant to provide an implicit MPOL_BIND on the "current" node the task is running on, causing swap storms and providing a much more aggressive behavior than even zone_reclaim_node = 3. Any workload who could have benefited from __GFP_THISNODE has now to enable zone_reclaim_mode=1||2||3. __GFP_THISNODE implicitly provided the zone_reclaim_mode behavior, but it only did so if THP was enabled: if THP was disabled, there would have been no chance to get any 4k page from the current node if the current node was full of pagecache, which further shows how this __GFP_THISNODE was misplaced in MADV_HUGEPAGE. MADV_HUGEPAGE has never been intended to provide any zone_reclaim_mode semantics, in fact the two are orthogonal, zone_reclaim_mode = 1|2|3 must work exactly the same with MADV_HUGEPAGE set or not. The performance characteristic of memory depends on the hardware details. The numbers below are obtained on Naples/EPYC architecture and the N/A projection extends them to show what we should aim for in the future as a good THP NUMA locality default. The benchmark used exercises random memory seeks (note: the cost of the page faults is not part of the measurement). D0 THP | D0 4k | D1 THP | D1 4k | D2 THP | D2 4k | D3 THP | D3 4k | ... 0% | +43% | +45% | +106% | +131% | +224% | N/A | N/A D0 means distance zero (i.e. local memory), D1 means distance one (i.e. intra socket memory), D2 means distance two (i.e. inter socket memory), etc... For the guest physical memory allocated by qemu and for guest mode kernel the performance characteristic of RAM is more complex and an ideal default could be: D0 THP | D1 THP | D0 4k | D2 THP | D1 4k | D3 THP | D2 4k | D3 4k | ... 0% | +58% | +101% | N/A | +222% | N/A | N/A | N/A NOTE: the N/A are projections and haven't been measured yet, the measurement in this case is done on a 1950x with only two NUMA nodes. The THP case here means THP was used both in the host and in the guest. After applying this commit the THP NUMA locality order that we'll get out of MADV_HUGEPAGE is this: D0 THP | D1 THP | D2 THP | D3 THP | ... | D0 4k | D1 4k | D2 4k | D3 4k | ... Before this commit it was: D0 THP | D0 4k | D1 4k | D2 4k | D3 4k | ... Even if we ignore the breakage of large workloads that can't fit in a single node that the __GFP_THISNODE implicit "current node" mbind caused, the THP NUMA locality order provided by __GFP_THISNODE was still not the one we shall aim for in the long term (i.e. the first one at the top). After this commit is applied, we can introduce a new allocator multi order API and to replace those two alloc_pages_vmas calls in the page fault path, with a single multi order call: unsigned int order = (1 << HPAGE_PMD_ORDER) | (1 << 0); page = alloc_pages_multi_order(..., &order); if (!page) goto out; if (!(order & (1 << 0))) { VM_WARN_ON(order != 1 << HPAGE_PMD_ORDER); /* THP fault */ } else { VM_WARN_ON(order != 1 << 0); /* 4k fallback */ } The page allocator logic has to be altered so that when it fails on any zone with order 9, it has to try again with a order 0 before falling back to the next zone in the zonelist. After that we need to do more measurements and evaluate if adding an opt-in feature for guest mode is worth it, to swap "DN 4k | DN+1 THP" with "DN+1 THP | DN 4k" at every NUMA distance crossing. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> --- include/linux/mempolicy.h | 2 ++ mm/huge_memory.c | 42 ++++++++++++++++++++++++--------------- mm/mempolicy.c | 2 +- 3 files changed, 29 insertions(+), 17 deletions(-)