mbox series

[for-5.3,0/4] revert immediate fallback to remote hugepages

Message ID alpine.DEB.2.21.1909041252230.94813@chino.kir.corp.google.com (mailing list archive)
Headers show
Series revert immediate fallback to remote hugepages | expand

Message

David Rientjes Sept. 4, 2019, 7:54 p.m. UTC
Two commits:

commit a8282608c88e08b1782141026eab61204c1e533f
Author: Andrea Arcangeli <aarcange@redhat.com>
Date:   Tue Aug 13 15:37:53 2019 -0700

    Revert "mm, thp: restore node-local hugepage allocations"

commit 92717d429b38e4f9f934eed7e605cc42858f1839
Author: Andrea Arcangeli <aarcange@redhat.com>
Date:   Tue Aug 13 15:37:50 2019 -0700

    Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask""

made their way into 5.3-rc5

We (mostly Linus, Andrea, and myself) have been discussing offlist how to
implement a sane default allocation strategy for hugepages on NUMA
platforms.

With these reverts in place, the page allocator will happily allocate a
remote hugepage immediately rather than try to make a local hugepage
available.  This incurs a substantial performance degradation when
memory compaction would have otherwise made a local hugepage available.

This series reverts those reverts and attempts to propose a more sane
default allocation strategy specifically for hugepages.  Andrea
acknowledges this is likely to fix the swap storms that he originally
reported that resulted in the patches that removed __GFP_THISNODE from
hugepage allocations.

The immediate goal is to return 5.3 to the behavior the kernel has
implemented over the past several years so that remote hugepages are
not immediately allocated when local hugepages could have been made
available because the increased access latency is untenable.

The next goal is to introduce a sane default allocation strategy for
hugepages allocations in general regardless of the configuration of the
system so that we prevent thrashing of local memory when compaction is
unlikely to succeed and can prefer remote hugepages over remote native
pages when the local node is low on memory.

Merging these reverts late in the rc cycle to change behavior that has
existed for years and is known (and acknowledged) to create performance
degradations when local hugepages could have been made available serves
no purpose other than to make the development of a sane default policy
much more difficult under time pressure and to accelerate decisions that
will affect how userspace is written (and how it has regressed) that
otherwise require carefully crafted and detailed implementations.

Thus, this patch series returns 5.3 to the long-standing allocation
strategy that Linux has had for years and proposes to follow-up changes
that can be discussed that Andrea acknowledges will avoid the swap storms
that initially triggered this discussion in the first place.

Comments

Linus Torvalds Sept. 4, 2019, 8:43 p.m. UTC | #1
On Wed, Sep 4, 2019 at 12:54 PM David Rientjes <rientjes@google.com> wrote:
>
> This series reverts those reverts and attempts to propose a more sane
> default allocation strategy specifically for hugepages.  Andrea
> acknowledges this is likely to fix the swap storms that he originally
> reported that resulted in the patches that removed __GFP_THISNODE from
> hugepage allocations.

There's no way we can try this for 5.3 even if looks ok. This is
"let's try this during the 5.4 merge window" material, and see how it
works.

But I'd love affected people to test this all on their loads and post
numbers, so that we have actual numbers for this series when we do try
to merge it.

            Linus
Andrea Arcangeli Sept. 4, 2019, 8:55 p.m. UTC | #2
On Wed, Sep 04, 2019 at 12:54:15PM -0700, David Rientjes wrote:
> Two commits:
> 
> commit a8282608c88e08b1782141026eab61204c1e533f
> Author: Andrea Arcangeli <aarcange@redhat.com>
> Date:   Tue Aug 13 15:37:53 2019 -0700
> 
>     Revert "mm, thp: restore node-local hugepage allocations"
> 
> commit 92717d429b38e4f9f934eed7e605cc42858f1839
> Author: Andrea Arcangeli <aarcange@redhat.com>
> Date:   Tue Aug 13 15:37:50 2019 -0700
> 
>     Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask""
> 
> made their way into 5.3-rc5
> 
> We (mostly Linus, Andrea, and myself) have been discussing offlist how to
> implement a sane default allocation strategy for hugepages on NUMA
> platforms.
> 
> With these reverts in place, the page allocator will happily allocate a
> remote hugepage immediately rather than try to make a local hugepage
> available.  This incurs a substantial performance degradation when
> memory compaction would have otherwise made a local hugepage available.
> 
> This series reverts those reverts and attempts to propose a more sane
> default allocation strategy specifically for hugepages.  Andrea
> acknowledges this is likely to fix the swap storms that he originally
> reported that resulted in the patches that removed __GFP_THISNODE from
> hugepage allocations.

I sent a single comment about this patch in the private email thread
you started, and I quote my answer below for full disclosure:

============
> This is an admittedly hacky solution that shouldn't cause anybody to 
> regress based on NUMA and the semantics of MADV_HUGEPAGE for the past 
> 4 1/2 years for users whose workload does fit within a socket.

How can you live with the below if you can't live with 5.3-rc6? Here
you allocate remote THP if the local THP allocation fails.

>  			page = __alloc_pages_node(hpage_node,
>  						gfp | __GFP_THISNODE, order);
> +
> +			/*
> +			 * If hugepage allocations are configured to always
> +			 * synchronous compact or the vma has been madvised
> +			 * to prefer hugepage backing, retry allowing remote
> +			 * memory as well.
> +			 */
> +			if (!page && (gfp & __GFP_DIRECT_RECLAIM))
> +				page = __alloc_pages_node(hpage_node,
> +						gfp | __GFP_NORETRY, order);
> +

You're still going to get THP allocate remote _before_ you have a
chance to allocate 4k local this way. __GFP_NORETRY won't make any
difference when there's THP immediately available in the remote nodes.

My suggestion is to stop touching and tweaking the gfp_mask second
parameter, and I suggest you to tweak the third parameter instead. If
you set the order=(1<<0)|(1<<9) (instead of current order = 9), only
then we'll move in the right direction and we'll get something that
can work better than 5.3-rc6 no matter what which workload you throw
at it.

> +		 if (order >= pageblock_order && (gfp_mask & __GFP_IO)) {
> +			/*
> +			 * If allocating entire pageblock(s) and compaction
> +			 * failed because all zones are below low watermarks
> +			 * or is prohibited because it recently failed at this
> +			 * order, fail immediately.
> +			 *
> +			 * Reclaim is
> +			 *  - potentially very expensive because zones are far
> +			 *    below their low watermarks or this is part of very
> +			 *    bursty high order allocations,
> +			 *  - not guaranteed to help because isolate_freepages()
> +			 *    may not iterate over freed pages as part of its
> +			 *    linear scan, and
> +			 *  - unlikely to make entire pageblocks free on its
> +			 *    own.
> +			 */
> +			if (compact_result == COMPACT_SKIPPED ||
> +			    compact_result == COMPACT_DEFERRED)
> +				goto nopage;
> +		}
> +

Overall this patch would solve the swap storms and it's similar to
__GFP_COMPACTONLY but it also allows to allocate THP in the remote
nodes if remote THP are immediately available in the buddy (despite
there may also be local 4k memory immediately available in the
buddy). So it's just better than just __GFP_COMPACTONLY but it still
cripples compaction a bit and it won't really work a whole lot
different than 5.3-rc6 in terms of prioritizing local 4k over remote
THP.

So it's not clear why you can't live with 5.3-rc6 if you can live with
the above that will provide you no more guarantees than 5.3-rc6 to get
local 4k before remote THP.

Thanks,
Andrea
==============

> The immediate goal is to return 5.3 to the behavior the kernel has
> implemented over the past several years so that remote hugepages are
> not immediately allocated when local hugepages could have been made
> available because the increased access latency is untenable.

Remote hugepages are immediately allocated before local 4k pages with
that patch you sent.

> +				page = __alloc_pages_node(hpage_node,
> +						gfp | __GFP_NORETRY, order);

This doesn't prevent to allocate remote THP if they are immediately
available.

> Merging these reverts late in the rc cycle to change behavior that has
> existed for years and is known (and acknowledged) to create performance
> degradations when local hugepages could have been made available serves
> no purpose other than to make the development of a sane default policy
> much more difficult under time pressure and to accelerate decisions that
> will affect how userspace is written (and how it has regressed) that
> otherwise require carefully crafted and detailed implementations.

Your change is calling alloc_pages first with __GFP_THISNODE in a
__GFP_COMPACTONLY way, and then it falls back immediately to allocate
2M on all nodes, including the local node for a second time, before
allocating local 4k (with a magical __GFP_NORETRY which won't make a
difference anyway if the 2m pages are immediately available).

> Thus, this patch series returns 5.3 to the long-standing allocation
> strategy that Linux has had for years and proposes to follow-up changes
> that can be discussed that Andrea acknowledges will avoid the swap storms
> that initially triggered this discussion in the first place.

I said one good thing about this patch series, that it fixes the swap
storms. But upstream 5.3 fixes the swap storms too and what you sent
is not nearly equivalent to the mempolicy that Michal was willing
to provide you and that we thought you needed to get bigger guarantees
of getting only local 2m or local 4k pages.

In fact we could weaken the aggressiveness of the proposed mempolicy
of an order of magnitude if you can live with the above and the above
performs well for you.

If you could post an open source reproducer of your proprietary binary
that got regressed by 5.3, we could help finding a solution. Instead
it's not clear how you can live with this patch series, but not with
5.3 and also why you insist not making this new allocation policy an
opt-in mempolicy.

Furthermore no generic benchmark has been run on this series to be
sure it won't regress performance for common workloads. So it's
certainly more risky than the current 5.3 status which matches what's
running in production on most enterprise distro.

I thought you clarified that the page fault latency was not the
primary reason you had __GFP_THISNODE there. If instead you've still
got a latency issue in the 2M page fault and that was the real reason
of __GFP_THISNODE, why don't you lower the latency of compaction in
remote nodes without having to call alloc_pages 3 times? I mean I
proposed to call alloc_pages just once instead of the current twice,
with your patch we'd be calling alloc_pages three times, with a
repetition attempt even on the 2m page size on the local node.

Obviously this "hacky solution" is better than 5.3-rc4 and all
previous because it at least doesn't create swap storms. What's not
clear is how this is going to work better than upstream for you. What
I think will happen is that this will work similarly to
__GFP_COMPACTONLY and it'll will weaken THP utilization ratio for
MADV_HUGEPAGE users and it's not tested to be sure it won't perform
worse in other conditions.

Thanks,
Andrea
David Rientjes Sept. 5, 2019, 8:54 p.m. UTC | #3
On Wed, 4 Sep 2019, Linus Torvalds wrote:

> > This series reverts those reverts and attempts to propose a more sane
> > default allocation strategy specifically for hugepages.  Andrea
> > acknowledges this is likely to fix the swap storms that he originally
> > reported that resulted in the patches that removed __GFP_THISNODE from
> > hugepage allocations.
> 
> There's no way we can try this for 5.3 even if looks ok. This is
> "let's try this during the 5.4 merge window" material, and see how it
> works.
> 
> But I'd love affected people to test this all on their loads and post
> numbers, so that we have actual numbers for this series when we do try
> to merge it.
> 

I'm certainly not proposing the last two patches in the series marked as 
RFC to be merged.  I'm proposing the first two patches in the series, 
reverts of the reverts that went into 5.3-rc5, are merged for 5.3 so that 
we return to the same behavior that we have had for years and semantics 
that MADV_HUGEPAGE has provided that entire libraries and userspaces have 
been based on.

It is very clear that there is a path forward here to address the *bug* 
that Andrea is reporting: it has become conflated with NUMA allocation 
policies which is not at all the issue.  Note that if 5.3 is released with 
these patches that it requires a very specialized usecase to benefit from: 
workloads that are larger than one socket and *requires* remote memory not 
being low on memory or fragmented.  If remote memory is as low on memory 
or fragmented as local memory (like in a datacenter), the reverts that 
went into 5.3 will double the impact of the very bug being reported 
because now it's causing swap storms for remote memory as well.  I don't 
anticipate we'll get numbers for that since it's not a configuration they 
run in.

The bug here is reclaim in the page allocator that does not benefit memory 
compaction because we are failing per-zone watermarks already.  The last 
two patches in these series avoid that, which is a sane default page 
allocation policy, and the allow fallback to remote memory only when we 
can't easily allocate locally.

We *need* the ability to allocate hugepages locally if compaction can 
work, anything else kills performance.  5.3-rc7 won't try that, it will 
simply fallback to remote memory.  We need to try compaction but we do not 
want to reclaim if failing watermark checks.

I hope that I'm not being unrealistically optimistic that we can make 
progress on providing a sane default allocation policy using those last 
two patches as a starter for 5.4, but I'm strongly suggesting that you 
take the first two patches to return us to the policy that has existed for 
years and not allow MADV_HUGEPAGE to be used for immediate remote 
allocation when local is possible.
David Rientjes Sept. 5, 2019, 9:06 p.m. UTC | #4
On Wed, 4 Sep 2019, Andrea Arcangeli wrote:

> > This is an admittedly hacky solution that shouldn't cause anybody to 
> > regress based on NUMA and the semantics of MADV_HUGEPAGE for the past 
> > 4 1/2 years for users whose workload does fit within a socket.
> 
> How can you live with the below if you can't live with 5.3-rc6? Here
> you allocate remote THP if the local THP allocation fails.
> 
> >  			page = __alloc_pages_node(hpage_node,
> >  						gfp | __GFP_THISNODE, order);
> > +
> > +			/*
> > +			 * If hugepage allocations are configured to always
> > +			 * synchronous compact or the vma has been madvised
> > +			 * to prefer hugepage backing, retry allowing remote
> > +			 * memory as well.
> > +			 */
> > +			if (!page && (gfp & __GFP_DIRECT_RECLAIM))
> > +				page = __alloc_pages_node(hpage_node,
> > +						gfp | __GFP_NORETRY, order);
> > +
> 
> You're still going to get THP allocate remote _before_ you have a
> chance to allocate 4k local this way. __GFP_NORETRY won't make any
> difference when there's THP immediately available in the remote nodes.
> 

This is incorrect: the fallback allocation here is only if the initial 
allocation with __GFP_THISNODE fails.  In that case, we were able to 
compact memory to make a local hugepage available without incurring 
excessive swap based on the RFC patch that appears as patch 3 in this 
series.  I very much believe your usecase would benefit from this as well 
(or at least not cause others to regress).  We *want* remote thp if they 
are immediately available but only after we have tried to allocate locally 
from the initial allocation and allowed memory compaction fail first.

Likely there can be discussion around the fourth patch of this series to 
get exactly the right policy.  We can construct it as necessary for 
hugetlbfs to not have any change in behavior, that's simple.  We could 
also check per-zone watermarks in mm/huge_memory.c to determine if local 
memory is low-on-memory and, if so, allow remote allocation.  In that case 
it's certainly better to allocate remotely when we'd be reclaiming locally 
even for fallback native pages.

> I said one good thing about this patch series, that it fixes the swap
> storms. But upstream 5.3 fixes the swap storms too and what you sent
> is not nearly equivalent to the mempolicy that Michal was willing
> to provide you and that we thought you needed to get bigger guarantees
> of getting only local 2m or local 4k pages.
> 

I haven't seen such a patch series, is there a link?
David Rientjes Sept. 7, 2019, 7:51 p.m. UTC | #5
Is there any objection from anybody to applying the first two patches, the 
reverts of the reverts that went into 5.3-rc5, for 5.3 and pursuing 
discussion and development using the last two patches in this series as a 
starting point for a sane allocation policy that just works by default for 
everybody?

Andrea acknowledges the swap storm that he reported would be fixed with 
the last two patches in this series and there has been discussion on how 
they can be extended at the same time that they do not impact allocations 
outside the scope of the discussion here (hugetlb).


On Thu, 5 Sep 2019, David Rientjes wrote:

> On Wed, 4 Sep 2019, Linus Torvalds wrote:
> 
> > > This series reverts those reverts and attempts to propose a more sane
> > > default allocation strategy specifically for hugepages.  Andrea
> > > acknowledges this is likely to fix the swap storms that he originally
> > > reported that resulted in the patches that removed __GFP_THISNODE from
> > > hugepage allocations.
> > 
> > There's no way we can try this for 5.3 even if looks ok. This is
> > "let's try this during the 5.4 merge window" material, and see how it
> > works.
> > 
> > But I'd love affected people to test this all on their loads and post
> > numbers, so that we have actual numbers for this series when we do try
> > to merge it.
> > 
> 
> I'm certainly not proposing the last two patches in the series marked as 
> RFC to be merged.  I'm proposing the first two patches in the series, 
> reverts of the reverts that went into 5.3-rc5, are merged for 5.3 so that 
> we return to the same behavior that we have had for years and semantics 
> that MADV_HUGEPAGE has provided that entire libraries and userspaces have 
> been based on.
> 
> It is very clear that there is a path forward here to address the *bug* 
> that Andrea is reporting: it has become conflated with NUMA allocation 
> policies which is not at all the issue.  Note that if 5.3 is released with 
> these patches that it requires a very specialized usecase to benefit from: 
> workloads that are larger than one socket and *requires* remote memory not 
> being low on memory or fragmented.  If remote memory is as low on memory 
> or fragmented as local memory (like in a datacenter), the reverts that 
> went into 5.3 will double the impact of the very bug being reported 
> because now it's causing swap storms for remote memory as well.  I don't 
> anticipate we'll get numbers for that since it's not a configuration they 
> run in.
> 
> The bug here is reclaim in the page allocator that does not benefit memory 
> compaction because we are failing per-zone watermarks already.  The last 
> two patches in these series avoid that, which is a sane default page 
> allocation policy, and the allow fallback to remote memory only when we 
> can't easily allocate locally.
> 
> We *need* the ability to allocate hugepages locally if compaction can 
> work, anything else kills performance.  5.3-rc7 won't try that, it will 
> simply fallback to remote memory.  We need to try compaction but we do not 
> want to reclaim if failing watermark checks.
> 
> I hope that I'm not being unrealistically optimistic that we can make 
> progress on providing a sane default allocation policy using those last 
> two patches as a starter for 5.4, but I'm strongly suggesting that you 
> take the first two patches to return us to the policy that has existed for 
> years and not allow MADV_HUGEPAGE to be used for immediate remote 
> allocation when local is possible.
>
Linus Torvalds Sept. 7, 2019, 7:55 p.m. UTC | #6
On Sat, Sep 7, 2019 at 12:51 PM David Rientjes <rientjes@google.com> wrote:
>
> Andrea acknowledges the swap storm that he reported would be fixed with
> the last two patches in this series

The problem is that even you aren't arguing that those patches should
go into 5.3.

So those fixes aren't going in, so "the swap storms would be fixed"
argument isn't actually an argument at all as far as 5.3 is concerned.

End result: we'd have the qemu-kvm instance performance problem in 5.3
that apparently causes distros to apply those patches that you want to
revert anyway.

So reverting would just make distros not use 5.3 in that form.

So I don't think we can revert those again without applying the two
patches in the series. So it would be a 5.4 merge window operation.

               Linus
David Rientjes Sept. 8, 2019, 1:50 a.m. UTC | #7
On Sat, 7 Sep 2019, Linus Torvalds wrote:

> > Andrea acknowledges the swap storm that he reported would be fixed with
> > the last two patches in this series
> 
> The problem is that even you aren't arguing that those patches should
> go into 5.3.
> 

For three reasons: (a) we lack a test result from Andrea, (b) there's 
on-going discussion, particularly based on Vlastimil's feedback, and 
(c) the patches will be refreshed incorporating that feedback as well as 
Mike's suggestion to exempt __GFP_RETRY_MAYFAIL for hugetlb.

> So those fixes aren't going in, so "the swap storms would be fixed"
> argument isn't actually an argument at all as far as 5.3 is concerned.
> 

It indicates that progress has been made to address the actual bug without 
introducing long-lived access latency regressions for others, particularly 
those who use MADV_HUGEPAGE.  In the worst case, some systems running 
5.3-rc4 and 5.3-rc5 have the same amount of memory backed by hugepages but 
on 5.3-rc5 the vast majority of it is allocated remotely.  This incurs a 
signficant performance regression regardless of platform; the only thing 
needed to induce this is a fragmented local node that would otherwise be 
compacted in 5.3-rc4 rather than quickly allocate remote on 5.3-rc5.

> End result: we'd have the qemu-kvm instance performance problem in 5.3
> that apparently causes distros to apply those patches that you want to
> revert anyway.
> 
> So reverting would just make distros not use 5.3 in that form.
> 

I'm arguing to revert 5.3 back to the behavior that we have had for years 
and actually fix the bug that everybody else seems to be ignoring and then 
*backport* those fixes to 5.3 stable and every other stable tree that can 
use them.  Introducing a new mempolicy for NUMA locality into 5.3.0 that 
will subsequently changed in future 5.3 stable kernels and differs from 
all kernels from the past few years is not in anybody's best interest if 
the actual problem can be fixed.  It requires more feedback than a 
one-line "the swap storms would be fixed with this."  That collaboration 
takes time and isn't something that should be rushed into 5.3-rc5.

Yes, we can fix NUMA locality of hugepages when a workload like qemu is 
larger than a single socket; the vast majority of workloads in the 
datacenter are small than a socket and *cannot* incur the performance 
penalty if local memory is fragmented that 5.3-rc5 introduces.

In other words, 5.3-rc5 is only fixing a highly specialized usecase where 
remote allocation is acceptable because the workload is larger than a 
socket *and* remote memory is not low on memory or fragmented.  If you 
consider the opposite of that, workloads smaller than a socket or local 
compaction actually works, this has introduced a measurable regression for 
everybody else.

I'm not sure why we are ignoring a painfully obvious bug in the page 
allocator because of a poor feedback loop between itself and memory 
compaction and rather papering over it by falling back to remote memory 
when NUMA actually does matter.  If you release 5.3 without the first two 
patches in this series, I wouldn't expect any additional feedback or test 
results to fix this bug considering all we have gotten so far is "this 
would fix this swap storms" and not collaborating to fix the issue for 
everybody rather than only caring about their own workloads.  At least my 
patches acknowledge and try to fix the issue the other is encountering.
Vlastimil Babka Sept. 8, 2019, 12:47 p.m. UTC | #8
On 9/8/19 3:50 AM, David Rientjes wrote:
> On Sat, 7 Sep 2019, Linus Torvalds wrote:
> 
>>> Andrea acknowledges the swap storm that he reported would be fixed with
>>> the last two patches in this series
>>
>> The problem is that even you aren't arguing that those patches should
>> go into 5.3.
>>
> 
> For three reasons: (a) we lack a test result from Andrea,

That's argument against the rfc patches 3+4s, no? But not for including
the reverts of reverts of reverts (patches 1+2).

> (b) there's 
> on-going discussion, particularly based on Vlastimil's feedback, and 

I doubt this will be finished and tested with reasonable confidence even
for the 5.4 merge window.

> (c) the patches will be refreshed incorporating that feedback as well as 
> Mike's suggestion to exempt __GFP_RETRY_MAYFAIL for hugetlb.

There might be other unexpected consequences (even if hugetlb wasn't
such an issue as I suspected, in the end).

>> So those fixes aren't going in, so "the swap storms would be fixed"
>> argument isn't actually an argument at all as far as 5.3 is concerned.
>>
> 
> It indicates that progress has been made to address the actual bug without 
> introducing long-lived access latency regressions for others, particularly 
> those who use MADV_HUGEPAGE.  In the worst case, some systems running 
> 5.3-rc4 and 5.3-rc5 have the same amount of memory backed by hugepages but 
> on 5.3-rc5 the vast majority of it is allocated remotely.  This incurs a

It's been said before, but such sensitive code generally relies on
mempolicies or node reclaim mode, not THP __GFP_THISNODE implementation
details. Or if you know there's enough free memory and just needs to be
compacted, you could do it once via sysfs before starting up your workload.

> signficant performance regression regardless of platform; the only thing 
> needed to induce this is a fragmented local node that would otherwise be 
> compacted in 5.3-rc4 rather than quickly allocate remote on 5.3-rc5.
> 
>> End result: we'd have the qemu-kvm instance performance problem in 5.3
>> that apparently causes distros to apply those patches that you want to
>> revert anyway.
>>
>> So reverting would just make distros not use 5.3 in that form.
>>
> 
> I'm arguing to revert 5.3 back to the behavior that we have had for years 
> and actually fix the bug that everybody else seems to be ignoring and then 
> *backport* those fixes to 5.3 stable and every other stable tree that can 
> use them.  Introducing a new mempolicy for NUMA locality into 5.3.0 that

I think it's rather removing the problematic implicit mempolicy of
__GFP_THISNODE.

> will subsequently changed in future 5.3 stable kernels and differs from 
> all kernels from the past few years is not in anybody's best interest if 
> the actual problem can be fixed.  It requires more feedback than a 
> one-line "the swap storms would be fixed with this."  That collaboration 
> takes time and isn't something that should be rushed into 5.3-rc5.
> 
> Yes, we can fix NUMA locality of hugepages when a workload like qemu is 
> larger than a single socket; the vast majority of workloads in the 
> datacenter are small than a socket and *cannot* incur the performance 
> penalty if local memory is fragmented that 5.3-rc5 introduces.
> 
> In other words, 5.3-rc5 is only fixing a highly specialized usecase where 
> remote allocation is acceptable because the workload is larger than a 
> socket *and* remote memory is not low on memory or fragmented.  If you

Clearly we disagree here which is the highly specialized usecase that
might get slower remote memory access, and which is more common workload
that will suffer from swap storms. No point arguing it further, but
several distros made the choice by carrying Andrea's patches already.

> consider the opposite of that, workloads smaller than a socket or local 
> compaction actually works, this has introduced a measurable regression for 
> everybody else.
> 
> I'm not sure why we are ignoring a painfully obvious bug in the page 
> allocator because of a poor feedback loop between itself and memory 
> compaction and rather papering over it by falling back to remote memory 
> when NUMA actually does matter.  If you release 5.3 without the first two 
> patches in this series, I wouldn't expect any additional feedback or test 
> results to fix this bug considering all we have gotten so far is "this 
> would fix this swap storms" and not collaborating to fix the issue for 
> everybody rather than only caring about their own workloads.  At least my 
> patches acknowledge and try to fix the issue the other is encountering.

I might have missed something, but you were asked for a reproducer of
your use case so others can develop patches with it in mind? Mel did
provide a simple example that shows the swap storms very easily.
David Rientjes Sept. 8, 2019, 8:45 p.m. UTC | #9
On Sun, 8 Sep 2019, Vlastimil Babka wrote:

> > On Sat, 7 Sep 2019, Linus Torvalds wrote:
> > 
> >>> Andrea acknowledges the swap storm that he reported would be fixed with
> >>> the last two patches in this series
> >>
> >> The problem is that even you aren't arguing that those patches should
> >> go into 5.3.
> >>
> > 
> > For three reasons: (a) we lack a test result from Andrea,
> 
> That's argument against the rfc patches 3+4s, no? But not for including
> the reverts of reverts of reverts (patches 1+2).
> 

Yes, thanks: I would strongly prefer not to propose rfc patches 3-4 
without a testing result from Andrea and collaboration to fix the 
underlying issue.  My suggestion to Linus is to merge patches 1-2 so we 
don't have additional semantics for MADV_HUGEPAGE or thp enabled=always 
configs based on kernel version, especially since they are already 
conflated.

> > (b) there's 
> > on-going discussion, particularly based on Vlastimil's feedback, and 
> 
> I doubt this will be finished and tested with reasonable confidence even
> for the 5.4 merge window.
> 

Depends, but I probably suspect the same.  If the reverts to 5.3 are not 
applied, then I'm not at all confident that forward progress on this issue 
will be made: my suggestion about changes to the page allocator when the 
patches were initially proposed went unresponded to, as did the ping on 
those suggestions, and now we have a simplistic "this will fix the swap 
storms" but no active involvement from Andrea to improve this; he likely 
is quite content on lumping NUMA policy onto an already overloaded madvise 
mode.

 [ NOTE! The rest of this email and my responses are about how to address
   the default page allocation behavior which we can continue to discuss
   but I'd prefer it separated from the discussion of reverts for 5.3
   which needs to be done to not conflate madvise modes with mempolicies
   for a subset of kernel versions. ]

> > It indicates that progress has been made to address the actual bug without 
> > introducing long-lived access latency regressions for others, particularly 
> > those who use MADV_HUGEPAGE.  In the worst case, some systems running 
> > 5.3-rc4 and 5.3-rc5 have the same amount of memory backed by hugepages but 
> > on 5.3-rc5 the vast majority of it is allocated remotely.  This incurs a
> 
> It's been said before, but such sensitive code generally relies on
> mempolicies or node reclaim mode, not THP __GFP_THISNODE implementation
> details. Or if you know there's enough free memory and just needs to be
> compacted, you could do it once via sysfs before starting up your workload.
> 

This entire discussion is based on the long standing and default behavior 
of page allocation for transparent hugepages.  Your suggestions are not 
possible for two reasons: (1) I cannot enforce a mempolicy of MPOL_BIND 
because this doesn't allow fallback at all and would oom kill if the local 
node is oom, and (2) node reclaim mode is a system-wide setting so all 
workloads are affected for every page allocation, not only users of 
MADV_HUGEPAGE who specifically opt-in to expensive allocation.

We could make the argument that Andrea's qemu usecase could simply use 
MPOL_PREFERRED for memory that should be faulted remotely which would 
provide more control and would work for all versions of Linux regardless 
of MADV_HUGEPAGE or not; that's a much more simple workaround than 
conflating MADV_HUGEPAGE for NUMA locality, asking users who are adversely 
affected by 5.3 to create new mempolicies to work around something that 
has always worked fine, or asking users to tune page allocator policies 
with sysctls.

> > I'm arguing to revert 5.3 back to the behavior that we have had for years 
> > and actually fix the bug that everybody else seems to be ignoring and then 
> > *backport* those fixes to 5.3 stable and every other stable tree that can 
> > use them.  Introducing a new mempolicy for NUMA locality into 5.3.0 that
> 
> I think it's rather removing the problematic implicit mempolicy of
> __GFP_THISNODE.
> 

I'm referring to a solution that is backwards compatible for existing 
users which 5.3 is certainly not.

> I might have missed something, but you were asked for a reproducer of
> your use case so others can develop patches with it in mind? Mel did
> provide a simple example that shows the swap storms very easily.
> 

Are you asking for a synthetic kernel module that you can inject to induce 
fragmentation on a local node where memory compaction would be possible 
and then a userspace program that uses MADV_HUGEPAGE and fits within that 
node?  The regression I'm reporting is for workloads that fit within a 
socket, it requires local fragmentation to show a regression.

For the qemu case, it's quite easy to fill a local node and require 
additional hugepage allocations with MADV_HUGEPAGE in a test case, but for
without synthetically inducing fragmentation I cannot provide a testcase 
that will show performance regression because memory is quickly faulted 
remotely rather than compacting locally.
Michal Hocko Sept. 9, 2019, 8:37 a.m. UTC | #10
On Sun 08-09-19 13:45:13, David Rientjes wrote:
> If the reverts to 5.3 are not 
> applied, then I'm not at all confident that forward progress on this issue 
> will be made:

David, could you stop this finally? I think there is a good consensus
that the current (even after reverts) behavior is not going all the way
down where we want to get. There have been different ways forward
suggested to not fallback to remote nodes too easily, not to mention
a specialized memory policy to explicitly request the behavior you
presumably need (and as a bonus it wouldn't be THP specific which is
even better).

You seem to be deadlocked in "we've used to do something for 4 years
so we must preserve that behavior". All that based on a single and
odd workload which you are hand waving about without anything for
the rest of the community to reproduce. Please try to get out of the
argumentation loop. We are more likely to make a forward progress.

5.3 managed to fix the worst case behavior, now let's talk about more
clever tuning. You cannot expect such a tuning is an overnight work.
This area is full of subtle side effects and few liners might have hard
to predict consequences.
Michal Hocko Sept. 9, 2019, 7:30 p.m. UTC | #11
On Thu 05-09-19 14:06:28, David Rientjes wrote:
> On Wed, 4 Sep 2019, Andrea Arcangeli wrote:
> 
> > > This is an admittedly hacky solution that shouldn't cause anybody to 
> > > regress based on NUMA and the semantics of MADV_HUGEPAGE for the past 
> > > 4 1/2 years for users whose workload does fit within a socket.
> > 
> > How can you live with the below if you can't live with 5.3-rc6? Here
> > you allocate remote THP if the local THP allocation fails.
> > 
> > >  			page = __alloc_pages_node(hpage_node,
> > >  						gfp | __GFP_THISNODE, order);
> > > +
> > > +			/*
> > > +			 * If hugepage allocations are configured to always
> > > +			 * synchronous compact or the vma has been madvised
> > > +			 * to prefer hugepage backing, retry allowing remote
> > > +			 * memory as well.
> > > +			 */
> > > +			if (!page && (gfp & __GFP_DIRECT_RECLAIM))
> > > +				page = __alloc_pages_node(hpage_node,
> > > +						gfp | __GFP_NORETRY, order);
> > > +
> > 
> > You're still going to get THP allocate remote _before_ you have a
> > chance to allocate 4k local this way. __GFP_NORETRY won't make any
> > difference when there's THP immediately available in the remote nodes.
> > 
> 
> This is incorrect: the fallback allocation here is only if the initial 
> allocation with __GFP_THISNODE fails.  In that case, we were able to 
> compact memory to make a local hugepage available without incurring 
> excessive swap based on the RFC patch that appears as patch 3 in this 
> series.

That patch is quite obscure and specific to pageblock_order+ sizes and
for some reason requires __GPF_IO without any explanation on why. The
problem is not THP specific, right? Any other high order has the same
problem AFAICS. So it is just a hack and that's why it is hard to reason
about.

I believe it would be the best to start by explaining why we do not see
the same problem with order-0 requests. We do not enter the slow path
and thus the memory reclaim if there is any other node to pass through
watermakr as well right? So essentially we are relying on kswapd to keep
nodes balanced so that allocation request can be satisfied from a local
node. We do have kcompactd to do background compaction. Why do we want
to rely on the direct compaction instead? What is the fundamental
difference?

Your changelog goes in length about some problems in the compaction but
I really do not see the underlying problem description. We cannot do any
sensible fix/heuristic without capturing that IMHO. Either there is
some fundamental difference between direct and background compaction
and doing a the former one is necessary and we should be doing that by
default for all higher order requests that are sleepable (aka
__GFP_DIRECT_RECLAIM) or there is something to fix for the background
compaction to be more pro-active.
 
> > I said one good thing about this patch series, that it fixes the swap
> > storms. But upstream 5.3 fixes the swap storms too and what you sent
> > is not nearly equivalent to the mempolicy that Michal was willing
> > to provide you and that we thought you needed to get bigger guarantees
> > of getting only local 2m or local 4k pages.
> > 
> 
> I haven't seen such a patch series, is there a link?

not yet unfortunatelly. So far I haven't heard that you are even
interested in that policy. You have never commented on that IIRC.
Michal Hocko Sept. 25, 2019, 7:08 a.m. UTC | #12
Let me revive this thread as there was no follow up.

On Mon 09-09-19 21:30:20, Michal Hocko wrote:
[...]
> I believe it would be the best to start by explaining why we do not see
> the same problem with order-0 requests. We do not enter the slow path
> and thus the memory reclaim if there is any other node to pass through
> watermakr as well right? So essentially we are relying on kswapd to keep
> nodes balanced so that allocation request can be satisfied from a local
> node. We do have kcompactd to do background compaction. Why do we want
> to rely on the direct compaction instead? What is the fundamental
> difference?

I am especially interested about this part. The more I think about this
the more I am convinced that the underlying problem really is in the pre
mature fallback in the fast path. Does the almost-patch below helps your
workload? It effectively reduces the fast path for higher order
allocations to the local/requested node. The justification is that
watermark check might be too strict for those requests as it is primary
order-0 oriented. Low watermark target simply has no meaning for the
higher order requests AFAIU. The min-low gap is giving kswapd a chance
to balance and be more local node friendly while we do not have anything
like that in compaction.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ff5484fdbdf9..09036cf55fca 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4685,7 +4685,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
 {
 	struct page *page;
 	unsigned int alloc_flags = ALLOC_WMARK_LOW;
-	gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
+	gfp_t fastpath_mask, alloc_mask; /* The gfp_t that was actually used for allocation */
 	struct alloc_context ac = { };
 
 	/*
@@ -4698,7 +4698,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
 	}
 
 	gfp_mask &= gfp_allowed_mask;
-	alloc_mask = gfp_mask;
+	fastpath_mask = alloc_mask = gfp_mask;
 	if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags))
 		return NULL;
 
@@ -4710,8 +4710,17 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
 	 */
 	alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone, gfp_mask);
 
-	/* First allocation attempt */
-	page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
+	/*
+	 * First allocation attempt. If we have a high order allocation then do not fall
+	 * back to a remote node just based on the watermark check on the requested node
+	 * because compaction might easily free up a requested order and then it would be
+	 * better to simply go to the slow path.
+	 * TODO: kcompactd should help here but nobody has woken it up unless we hit the
+	 * slow path so we might need some tuning there as well.
+	 */
+	if (order && (gfp_mask & __GFP_DIRECT_RECLAIM))
+		fastpath_mask |= __GFP_THISNODE;
+	page = get_page_from_freelist(fastpath_mask, order, alloc_flags, &ac);
 	if (likely(page))
 		goto out;
David Rientjes Sept. 26, 2019, 7:03 p.m. UTC | #13
On Wed, 25 Sep 2019, Michal Hocko wrote:

> I am especially interested about this part. The more I think about this
> the more I am convinced that the underlying problem really is in the pre
> mature fallback in the fast path.

I appreciate you taking the time to continue to look at this but I'm 
confused about the underlying problem you're referring to: we had no 
underlying problem until 5.3 was released so we need to carry patches that 
will revert this behavior (we simply can't tolerate double digit memory 
access latency regressions lol).

If you're referring to post-5.3 behavior, this appears to override 
alloc_hugepage_direct_gfpmask() but directly in the page allocator.

static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
{
...
        /*
         * __GFP_THISNODE is used only when __GFP_DIRECT_RECLAIM is not
         * specified, to express a general desire to stay on the current

Your patch is setting __GFP_THISNODE for __GFP_DIRECT_RECLAIM: this 
allocation will fail in the fastpath for both my case (fragmented local 
node) and Andrea's case (out of memory local node).  The first 
get_page_from_freelist() will then succeed in the slowpath for both cases; 
compaction is not tried for either.

In my case, that results in a perpetual remote access latency that we 
can't tolerate.  If Andrea's remote nodes are fragmented or low on memory, 
his case encounters swap storms over both the local node and remote nodes.

So I'm not really sure what is solved by your patch?

We're not on 5.3, but I can try it and collect data on exactly how poorly 
it performs on fragmented *hosts* (not single nodes, but really the whole 
system) because swap storms were never fixed here, it was only papered 
over.

> Does the almost-patch below helps your
> workload? It effectively reduces the fast path for higher order
> allocations to the local/requested node. The justification is that
> watermark check might be too strict for those requests as it is primary
> order-0 oriented. Low watermark target simply has no meaning for the
> higher order requests AFAIU. The min-low gap is giving kswapd a chance
> to balance and be more local node friendly while we do not have anything
> like that in compaction.
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ff5484fdbdf9..09036cf55fca 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4685,7 +4685,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
>  {
>  	struct page *page;
>  	unsigned int alloc_flags = ALLOC_WMARK_LOW;
> -	gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
> +	gfp_t fastpath_mask, alloc_mask; /* The gfp_t that was actually used for allocation */
>  	struct alloc_context ac = { };
>  
>  	/*
> @@ -4698,7 +4698,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
>  	}
>  
>  	gfp_mask &= gfp_allowed_mask;
> -	alloc_mask = gfp_mask;
> +	fastpath_mask = alloc_mask = gfp_mask;
>  	if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags))
>  		return NULL;
>  
> @@ -4710,8 +4710,17 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
>  	 */
>  	alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone, gfp_mask);
>  
> -	/* First allocation attempt */
> -	page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
> +	/*
> +	 * First allocation attempt. If we have a high order allocation then do not fall
> +	 * back to a remote node just based on the watermark check on the requested node
> +	 * because compaction might easily free up a requested order and then it would be
> +	 * better to simply go to the slow path.
> +	 * TODO: kcompactd should help here but nobody has woken it up unless we hit the
> +	 * slow path so we might need some tuning there as well.
> +	 */
> +	if (order && (gfp_mask & __GFP_DIRECT_RECLAIM))
> +		fastpath_mask |= __GFP_THISNODE;
> +	page = get_page_from_freelist(fastpath_mask, order, alloc_flags, &ac);
>  	if (likely(page))
>  		goto out;
Michal Hocko Sept. 27, 2019, 7:48 a.m. UTC | #14
On Thu 26-09-19 12:03:37, David Rientjes wrote:
[...]
> Your patch is setting __GFP_THISNODE for __GFP_DIRECT_RECLAIM: this 
> allocation will fail in the fastpath for both my case (fragmented local 
> node) and Andrea's case (out of memory local node).  The first 
> get_page_from_freelist() will then succeed in the slowpath for both cases; 
> compaction is not tried for either.
> 
> In my case, that results in a perpetual remote access latency that we 
> can't tolerate.  If Andrea's remote nodes are fragmented or low on memory, 
> his case encounters swap storms over both the local node and remote nodes.
> 
> So I'm not really sure what is solved by your patch?

There are two aspects the patch is targeting at. The first is that the
fast path is targeting a higher watermak (WMARK_LOW) so it might
fallback to a remote node easier and then the fast path doesn't wake up
kcompactd so there is no pro-active compaction going on to help future
allocations.

You are right that a fragmented or at min watermark node would fallback
to a remote node even with this patch. I wanted to see how much the
kcompactd can change the overall picture. If this is not sufficient then
maybe we need to drop the first optimistic attempt as well and simply go
right into the light compaction. Something like this on top

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ff5484fdbdf9..61284e7f01ee 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4434,7 +4434,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 * The adjusted alloc_flags might result in immediate success, so try
 	 * that first
 	 */
-	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
+	if (!order)
+		page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
 	if (page)
 		goto got_pg;
 
The whole point of handling this in the page allocator directly is to
have a unified solutions rather than have each specific caller invent
its own way to achieve higher locality.
Linus Torvalds Sept. 28, 2019, 8:59 p.m. UTC | #15
On Fri, Sep 27, 2019 at 12:48 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> -       page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
> +       if (!order)
> +               page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
>         if (page)
>                 goto got_pg;
>
> The whole point of handling this in the page allocator directly is to
> have a unified solutions rather than have each specific caller invent
> its own way to achieve higher locality.

The above just looks hacky.

Why would order-0 be special?

It really is hugepages that are special - small orders don't generally
need a lot of compaction. and this secondary get_page_from_freelist()
is not primarily about compaction, it's about relaxing some of the
other constraints, and the actual alloc_flags have changed from the
initial ones.

I really think that you're missing the big picture. We want node
locality, and we want big pages, but those "we want" simply shouldn't
be as black-and-white as they sometimes are today. In fact, the whole
black-and-white thing is completely crazy for anything but some
test-load.

I will just do the reverts and apply David's two patches on top. We
now have the time to actually test the behavior, and we're not just
before a release.

The thing is, David has numbers, and the patches make _sense_. There's
a description of what they do, there are comments, but the *code*
makes sense too. Much more sense than the above kind of insane hack.
David's patches literally do two things:

 - [3/4] just admit that the allocation failed when you're trying to
allocate a huge-page and compaction wasn't successful

 - [4/4] when that huge-page allocation failed, retry on another node
when appropriate

That's _literally_ what David's two patches do. The above is purely
the English translation of the patches.

And I claim that the English translation also ends up being
_sensible_. I go look at those two statements, and my reaction "yeah,
that makes sense".

So there were numbers, there was "this is sensible", and there were no
indications that those sensible choices would actually be problematic.
Nobody has actually argued against the patches making sense. Nobody
has even argued that the patches would be wrong. The _only_ argument
against them were literally "what if this changes something subtle",
together with patches like the above that do _not_ make sense either
on a big picture level or even locally on a small level.

The reason I didn't apply those patches for 5.3 was that they came in
very late, and there were horrendous numbers for the 5.2 behavior that
caused those two big reverts. But the patches made sense back then,
the timing for them just didn't.

I was hoping this would just get done in the mm tree, but clearly it
isn't, and I'll just do my own mm branch and merge the patches that
way.

                Linus
Michal Hocko Sept. 30, 2019, 11:28 a.m. UTC | #16
On Sat 28-09-19 13:59:26, Linus Torvalds wrote:
> On Fri, Sep 27, 2019 at 12:48 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > -       page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
> > +       if (!order)
> > +               page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
> >         if (page)
> >                 goto got_pg;
> >
> > The whole point of handling this in the page allocator directly is to
> > have a unified solutions rather than have each specific caller invent
> > its own way to achieve higher locality.
> 
> The above just looks hacky.

It is and it was meant to help move on when debugging rather than a
final solution.
 
> Why would order-0 be special?

Ideally it wouldn't be but the current implementation makes it special.
Why? Because the whole concept of low wmark fast path attempt is based
on kswapd balancing for a high watermark providing some space. Kcompactd
doesn't have any notion like that. And I believe that a large part of
the problem really is there. If I am wrong here then I would appreciate
to be corrected.

If __GFP_THISNODE allows for a better THP utilization on a local node
then the problem points at kcompactd not being pro-active enough. And
that was the first diff aiming at.

I also claim that this is not a THP specific problem. You are right
that lower orders are less likely to hit the problem because the memory
is usually not fragmented that heavily but fundamentally the over eager
fallback in the fast path is still there. And that is the reason for me
to pushback against __GFP_THIS_NODE && fallback allocation opencoded
outside of the allocator. The allocator knows the context can compact
so why should we require the caller to be doing that?

Do not get me wrong, but we have a quite a long history of fine tuning
for THP by adding kludges here and there and they usually turnout to
break something else. I really want to get to understand the underlying
problem and base a solution on it rather than "__GFP_THISNODE can cause
overreclaim so pick up a break out condition and hope for the best".
Michal Hocko Oct. 1, 2019, 5:43 a.m. UTC | #17
On Mon 30-09-19 13:28:17, Michal Hocko wrote:
[...]
> Do not get me wrong, but we have a quite a long history of fine tuning
> for THP by adding kludges here and there and they usually turnout to
> break something else. I really want to get to understand the underlying
> problem and base a solution on it rather than "__GFP_THISNODE can cause
> overreclaim so pick up a break out condition and hope for the best".

I didn't really get to any testing earlier but I was really suspecting
that hugetlb will be the first one affected because it uses
__GFP_RETRY_MAYFAIL - aka it really wants to succeed as much as possible
because this is a direct admin request to preallocate a specific number
of huge pages. The patch 3 doesn't really make any difference for those
requests.

Here is a very simple test I've done. Single NUMA node with 1G of
memory. Populate the memory with a clean page cache. That is both easy
to compact and reclaim and then try to allocate 512M worth of hugetlb
pages.
root@test1:~# cat hugetlb_test.sh
#!/bin/sh

set -x
echo 0 > /proc/sys/vm/nr_hugepages
echo 3 > /proc/sys/vm/drop_caches
echo 1 > /proc/sys/vm/compact_memory
dd if=/mnt/data/file-1G of=/dev/null bs=$((4<<10))
TS=$(date +%s)
cp /proc/vmstat vmstat.$TS.before
echo 256 > /proc/sys/vm/nr_hugepages
cat /proc/sys/vm/nr_hugepages
cp /proc/vmstat vmstat.$TS.after

The results for 2 consecutive runs on clean 5.3
root@test1:~# sh hugetlb_test.sh
+ echo 0
+ echo 3
+ echo 1
+ dd if=/mnt/data/file-1G of=/dev/null bs=4096
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB) copied, 21.0694 s, 51.0 MB/s
+ date +%s
+ TS=1569905284
+ cp /proc/vmstat vmstat.1569905284.before
+ echo 256
+ cat /proc/sys/vm/nr_hugepages
256
+ cp /proc/vmstat vmstat.1569905284.after
root@test1:~# sh hugetlb_test.sh
+ echo 0
+ echo 3
+ echo 1
+ dd if=/mnt/data/file-1G of=/dev/null bs=4096
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB) copied, 21.7548 s, 49.4 MB/s
+ date +%s
+ TS=1569905311
+ cp /proc/vmstat vmstat.1569905311.before
+ echo 256
+ cat /proc/sys/vm/nr_hugepages
256
+ cp /proc/vmstat vmstat.1569905311.after

so we get all the requested huge pages to the pool.

Now with first 3 patches of this series applied (the last one doesn't
make any difference for hugetlb allocations).

root@test1:~# sh hugetlb_test.sh
+ echo 0
+ echo 3
+ echo 1
+ dd if=/mnt/data/file-1G of=/dev/null bs=4096
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB) copied, 20.1815 s, 53.2 MB/s
+ date +%s
+ TS=1569905516
+ cp /proc/vmstat vmstat.1569905516.before
+ echo 256
+ cat /proc/sys/vm/nr_hugepages
11
+ cp /proc/vmstat vmstat.1569905516.after
root@test1:~# sh hugetlb_test.sh
+ echo 0
+ echo 3
+ echo 1
+ dd if=/mnt/data/file-1G of=/dev/null bs=4096
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB) copied, 21.9485 s, 48.9 MB/s
+ date +%s
+ TS=1569905541
+ cp /proc/vmstat vmstat.1569905541.before
+ echo 256
+ cat /proc/sys/vm/nr_hugepages
12
+ cp /proc/vmstat vmstat.1569905541.after

so we do not get more that 12 huge pages which is really poor. Although
hugetlb pages tend to be allocated early after the boot they are still
an explicit admin request and having less than 5% success rate is really
bad. If anything the __GFP_RETRY_MAYFAIL needs to be reflected in the
code.

I can provide vmstat files if anybody is interested.

Then I have tried another test for THP. It is essentially the same
thing. Populate the page cache to simulate a quite common case of memory
being used for the cache and then populate 512M anonymous area with
MADV_HUGEPAG set on it:
$ cat thp_test.sh
#!/bin/sh

set -x
echo 3 > /proc/sys/vm/drop_caches
echo 1 > /proc/sys/vm/compact_memory
dd if=/mnt/data/file-1G of=/dev/null bs=$((4<<10))
TS=$(date +%s)
cp /proc/vmstat vmstat.$TS.before
./mem_eater nowait 500M
cp /proc/vmstat vmstat.$TS.after

mem_eater is essentially (mmap, madvise, and touch page by page dummy
app).

Again clean 5.3 kernel

root@test1:~# sh thp_test.sh 
+ echo 3
+ echo 1
+ dd if=/mnt/data/file-1G of=/dev/null bs=4096
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB) copied, 20.8575 s, 51.5 MB/s
+ date +%s
+ TS=1569906274
+ cp /proc/vmstat vmstat.1569906274.before
+ ./mem_eater nowait 500M
7f55e8282000-7f5607682000 rw-p 00000000 00:00 0 
Size:             512000 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:              512000 kB
Pss:              512000 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:    512000 kB
Referenced:       260616 kB
Anonymous:        512000 kB
LazyFree:              0 kB
AnonHugePages:    509952 kB
ShmemPmdMapped:        0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:            1
+ cp /proc/vmstat vmstat.1569906274.after

root@test1:~# sh thp_test.sh
+ echo 3
+ echo 1
+ dd if=/mnt/data/file-1G of=/dev/null bs=4096
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB) copied, 21.8648 s, 49.1 MB/s
+ date +%s
+ TS=1569906333
+ cp /proc/vmstat vmstat.1569906333.before
+ ./mem_eater nowait 500M
7f26625cd000-7f26819cd000 rw-p 00000000 00:00 0
Size:             512000 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:              512000 kB
Pss:              512000 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:    512000 kB
Referenced:       259892 kB
Anonymous:        512000 kB
LazyFree:              0 kB
AnonHugePages:    509952 kB
ShmemPmdMapped:        0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:            1
+ cp /proc/vmstat vmstat.1569906333.after

We are getting quite consistent 99% THP utilization.

grep "pgsteal_direct\|pgsteal_kswapd\|allocstall_movable\|compact_stall" vmstat.1569906333.{before,after}
vmstat.1569906333.before:allocstall_movable 29
vmstat.1569906333.before:pgsteal_kswapd 206760
vmstat.1569906333.before:pgsteal_direct 30162
vmstat.1569906333.before:compact_stall 29
vmstat.1569906333.after:allocstall_movable 65
vmstat.1569906333.after:pgsteal_kswapd 298688
vmstat.1569906333.after:pgsteal_direct 67645
vmstat.1569906333.after:compact_stall 66

Hit the direct compaction 37 times and reclaimed 146M in direct 359M by
kswapd which is 505M in total which is not bad for 512M request.

5.3 + 3 patches (This is a non-NUMA machine so the only difference the
4th patch would make is timing because it just retries 2 times on the
same node).
root@test1:~# sh thp_test.sh
+ echo 3
+ echo 1
+ dd if=/mnt/data/file-1G of=/dev/null bs=4096
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB) copied, 21.0732 s, 51.0 MB/s
+ date +%s
+ TS=1569906542
+ cp /proc/vmstat vmstat.1569906542.before
+ ./mem_eater nowait 500M
7f2799e08000-7f27b9208000 rw-p 00000000 00:00 0
Size:             512000 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:              512000 kB
Pss:              512000 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:    512000 kB
Referenced:       294944 kB
Anonymous:        512000 kB
LazyFree:              0 kB
AnonHugePages:    477184 kB
ShmemPmdMapped:        0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:            1
+ cp /proc/vmstat vmstat.1569906542.after
root@test1:~# sh thp_test.sh
+ echo 3
+ echo 1
+ dd if=/mnt/data/file-1G of=/dev/null bs=4096
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB) copied, 21.9239 s, 49.0 MB/s
+ date +%s
+ TS=1569906569
+ cp /proc/vmstat vmstat.1569906569.before
+ ./mem_eater nowait 500M
7fa29a234000-7fa2b9634000 rw-p 00000000 00:00 0
Size:             512000 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:              512000 kB
Pss:              512000 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:    512000 kB
Referenced:       253480 kB
Anonymous:        512000 kB
LazyFree:              0 kB
AnonHugePages:    460800 kB
ShmemPmdMapped:        0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:            1
+ cp /proc/vmstat vmstat.1569906569.after

The drop down in the utilization is not that rapid here but it shows that the
results are not very stable as well.

grep "pgsteal_direct\|pgsteal_kswapd\|allocstall_movable\|compact_stall" vmstat.1569906569.{before,after}
vmstat.1569906569.before:allocstall_movable 52
vmstat.1569906569.before:pgsteal_kswapd 182617
vmstat.1569906569.before:pgsteal_direct 54281
vmstat.1569906569.before:compact_stall 85
vmstat.1569906569.after:allocstall_movable 64
vmstat.1569906569.after:pgsteal_kswapd 296840
vmstat.1569906569.after:pgsteal_direct 66778
vmstat.1569906569.after:compact_stall 191

We have hit the direct compaction 106 times and reclaimed 48M from
direct and 446M from kswapd totaling 494M reclaimed in total. Slightly
less than with clean 5.3 but I would consider it within noise.

I didn't really get to analyze numbers very deeply but from a very
preliminary look it seems that the bailout based on the watermark check
is causing volatility because it depends on the kswapd activity in the
background. Please note that this is pretty much the ideal case when
the reclaimable memory is essentially free to drop. If kswapd starts
fighting to get memory reclaimed then the THP utilization is likely to drop
down as well. On the other hand it is fair to say that it is really hard
to tell what would compaction do under those conditions.

I also didn't really get to test any NUMA aspect of the change yet. I
still do hope that David can share something I can play with
because I do not want to create something completely artificial.
Michal Hocko Oct. 1, 2019, 8:37 a.m. UTC | #18
On Tue 01-10-19 07:43:43, Michal Hocko wrote:
[...]
> I also didn't really get to test any NUMA aspect of the change yet. I
> still do hope that David can share something I can play with
> because I do not want to create something completely artificial.

I have split out my kvm machine into two nodes to get at least some
idea how these patches behave
$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 2
node 0 size: 475 MB
node 0 free: 432 MB
node 1 cpus: 1 3
node 1 size: 503 MB
node 1 free: 458 MB

Again, I am using the simple page cache full of memory and mmap less
than the full capacity of node 1 with a preferred numa policy:
root@test1:~# cat thp_test.sh
#!/bin/sh

set -x
echo 3 > /proc/sys/vm/drop_caches
echo 1 > /proc/sys/vm/compact_memory
dd if=/mnt/data/file-1G of=/dev/null bs=$((4<<10))
TS=$(date +%s)
cp /proc/vmstat vmstat.$TS.before
numactl --preferred 1 ./mem_eater nowait 400M
cp /proc/vmstat vmstat.$TS.after

First run with 5.3 and without THP
$ echo never > /sys/kernel/mm/transparent_hugepage/enabled
root@test1:~# sh thp_test.sh 
+ echo 3
+ echo 1
+ dd if=/mnt/data/file-1G of=/dev/null bs=4096
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB) copied, 21.938 s, 48.9 MB/s
+ date +%s
+ TS=1569914851
+ cp /proc/vmstat vmstat.1569914851.before
+ numactl --preferred 1 ./mem_eater nowait 400M
7f4bdefec000-7f4bf7fec000 rw-p 00000000 00:00 0 
Size:             409600 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:              409600 kB
Pss:              409600 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:    409600 kB
Referenced:       344460 kB
Anonymous:        409600 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:            0
7f4bdefec000 prefer:1 anon=102400 dirty=102400 active=86115 N0=41963 N1=60437 kernelpagesize_kB=4
+ cp /proc/vmstat vmstat.1569914851.after

Few more runs:
7fd0f248b000 prefer:1 anon=102400 dirty=102400 active=86909 N0=40079 N1=62321 kernelpagesize_kB=4
7f2a69fc3000 prefer:1 anon=102400 dirty=102400 active=85244 N0=44455 N1=57945 kernelpagesize_kB=4

So we get around 56-60% pages to the preferred node

Now let's enable THPs
root@test1:~# echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
root@test1:~# sh thp_test.sh
+ echo 3
+ echo 1
+ dd if=/mnt/data/file-1G of=/dev/null bs=4096
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB) copied, 21.8665 s, 49.1 MB/s
+ date +%s
+ TS=1569915082
+ cp /proc/vmstat vmstat.1569915082.before
+ numactl --preferred 1 ./mem_eater nowait 400M
7f05c6dee000-7f05dfdee000 rw-p 00000000 00:00 0
Size:             409600 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:              409600 kB
Pss:              409600 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:    409600 kB
Referenced:       210872 kB
Anonymous:        409600 kB
LazyFree:              0 kB
AnonHugePages:    407552 kB
ShmemPmdMapped:        0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:            1
7f05c6dee000 prefer:1 anon=102400 dirty=102400 active=52718 N0=50688 N1=51712 kernelpagesize_kB=4
+ cp /proc/vmstat vmstat.1569915082.after

Few more runs
AnonHugePages:    407552 kB
7effca1b9000 prefer:1 anon=102400 dirty=102400 active=65977 N0=53760 N1=48640 kernelpagesize_kB=4

AnonHugePages:    407552 kB
7f474bfc4000 prefer:1 anon=102400 dirty=102400 active=52676 N0=8704 N1=93696 kernelpagesize_kB=4

The utilization is again almost 100% and the preferred node usage
varied a lot between 47-91%. The last one looks like the result we would
like to achieve, right?

Now with 5.3 + all 4 patches this time:
root@test1:~# sh thp_test.sh
+ echo 3
+ echo 1
+ dd if=/mnt/data/file-1G of=/dev/null bs=4096
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB) copied, 21.5368 s, 49.9 MB/s
+ date +%s
+ TS=1569915403
+ cp /proc/vmstat vmstat.1569915403.before
+ numactl --preferred 1 ./mem_eater nowait 400M
7f8114ab4000-7f812dab4000 rw-p 00000000 00:00 0
Size:             409600 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:              409600 kB
Pss:              409600 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:    409600 kB
Referenced:       207568 kB
Anonymous:        409600 kB
LazyFree:              0 kB
AnonHugePages:    401408 kB
ShmemPmdMapped:        0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:            1
7f8114ab4000 prefer:1 anon=102400 dirty=102400 active=51892 N0=3072 N1=99328 kernelpagesize_kB=4
+ cp /proc/vmstat vmstat.1569915403.after

Few more runs:
AnonHugePages:    376832 kB
7f37a1404000 prefer:1 anon=102400 dirty=102400 active=55204 N0=23153 N1=79247 kernelpagesize_kB=4

AnonHugePages:    372736 kB
7f4abe4af000 prefer:1 anon=102400 dirty=102400 active=52399 N0=23646 N1=78754 kernelpagesize_kB=4

The THP utilization varies again and the locality is higher in average
76+%.  Which is even higher than the base page case. I was really
wondering what is the reason for that So I've added a simple debugging
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 8caab1f81a52..565f667f6dcb 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2140,9 +2140,11 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 			 * to prefer hugepage backing, retry allowing remote
 			 * memory as well.
 			 */
-			if (!page && (gfp & __GFP_DIRECT_RECLAIM))
+			if (!page && (gfp & __GFP_DIRECT_RECLAIM)) {
 				page = __alloc_pages_node(hpage_node,
 						gfp | __GFP_NORETRY, order);
+				printk("XXX: %s\n", !page ? "fail" : hpage_node == page_to_nid(page)?"match":"fallback");
+			}
 
 			goto out;
 		}

Cases like
AnonHugePages:    407552 kB
7f3ab2ebf000 prefer:1 anon=102400 dirty=102400 active=77503 N0=43520 N1=58880 kernelpagesize_kB=4
     85 XXX: fallback
a very successful ones on the other hand
AnonHugePages:    280576 kB
7feffd9c2000 prefer:1 anon=102400 dirty=102400 active=52563 N0=3131 N1=99269 kernelpagesize_kB=4
     62 XXX: fail
      3 XXX: fallback

Note that 62 failing THPs is 31744 pages which also explains much higher
locality I suspect (we simply allocate small pages from the preferred
node).

This is just a very trivial test and I still have to grasp the outcome.
My current feeling is that the whole thing is very timing specific and
the higher local utilization depends on somebody else doing a reclaim
on our behalf with patches applied.

5.3 without any patches behaves more stable although there is a slightly
higher off node THP success rate but it also seems that the same
overall THP success rate can be achieved from the local node as well so
performing compaction would make more sense for those.

With the simple hack on top of 5.3
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9c9194959271..414d0eaa6287 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4685,7 +4685,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
 {
 	struct page *page;
 	unsigned int alloc_flags = ALLOC_WMARK_LOW;
-	gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
+	gfp_t alloc_mask, first_alloc_mask; /* The gfp_t that was actually used for allocation */
 	struct alloc_context ac = { };
 
 	/*
@@ -4711,7 +4711,10 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
 	alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone, gfp_mask);
 
 	/* First allocation attempt */
-	page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
+	first_alloc_mask = alloc_mask;
+	if (order && (alloc_mask & __GFP_DIRECT_RECLAIM))
+		first_alloc_mask |= __GFP_THISNODE;
+	page = get_page_from_freelist(first_alloc_mask, order, alloc_flags, &ac);
 	if (likely(page))
 		goto out;
 
I am getting
AnonHugePages:    407552 kB
7f88a67ca000 prefer:1 anon=102400 dirty=102400 active=60362 N0=39424 N1=62976 kernelpagesize_kB=4

AnonHugePages:    407552 kB
7f18f0de5000 prefer:1 anon=102400 dirty=102400 active=51685 N0=34720 N1=67680 kernelpagesize_kB=4

AnonHugePages:    407552 kB
7f443f89e000 prefer:1 anon=102400 dirty=102400 active=52382 N0=62976 N1=39424 kernelpagesize_kB=4

While first two rounds looked very promising with 60%+ locality with a
consistent THP utilization then the locality went crazy so there is
more to look into. The fast path still seems to make some difference
though.
Vlastimil Babka Oct. 1, 2019, 1:50 p.m. UTC | #19
On 10/1/19 7:43 AM, Michal Hocko wrote:
> so we do not get more that 12 huge pages which is really poor. Although
> hugetlb pages tend to be allocated early after the boot they are still
> an explicit admin request and having less than 5% success rate is really
> bad. If anything the __GFP_RETRY_MAYFAIL needs to be reflected in the
> code.

Yeah it's roughly what I expected, thanks for the testing. How about this
patch on top?

---8<---
From 3ae67ab2274626c276ff2dd58794215a8461f045 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <vbabka@suse.cz>
Date: Tue, 1 Oct 2019 14:20:58 +0200
Subject: [RFC] mm, thp: tweak reclaim/compaction effort of local-only and
 all-node allocations

THP page faults now attempt a __GFP_THISNODE allocation first, which should
only compact existing free memory, followed by another attempt that can
allocate from any node using reclaim/compaction effort specified by global
defrag setting and madvise.

This patch makes the following changes to the scheme:

- before the patch, the first allocation relies on a check for pageblock order
  and __GFP_IO. This however affects also the second attempt, and also hugetlb
  allocations and other allocations of whole pageblock. Instead of that, reuse
  the existing check for costly order __GFP_NORETRY allocations, and make sure
  the first THP attempt uses __GFP_NORETRY. As a side-effect, all costly order
  __GFP_NORETRY allocations will bail out if compaction needs reclaim, while
  previously they only bailed out when compaction was deferred due to previous
  failures. This should be still acceptable within the __GFP_NORETRY semantics.

- before the patch, the second allocation attempt (on all nodes) was passing
  __GFP_NORETRY. This is redundant as the check for pageblock order (discussed
  above) was stronger. It's also contrary to madvise(MADV_HUGEPAGE) which means
  some effort to allocate THP is requested. After this patch, the second
  attempt doesn't pass __GFP_THISNODE nor __GFP_NORETRY.

To sum up, THP page faults now try the following attempt:

1. local node only THP allocation with no reclaim, just compaction.
2. THP allocation from any node with effort determined by global defrag setting
   and VMA madvise
3. fallback to base pages on any node

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/mempolicy.c  | 16 +++++++++-------
 mm/page_alloc.c | 23 +++++------------------
 2 files changed, 14 insertions(+), 25 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4ae967bcf954..2c48146f3ee2 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2129,18 +2129,20 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		nmask = policy_nodemask(gfp, pol);
 		if (!nmask || node_isset(hpage_node, *nmask)) {
 			mpol_cond_put(pol);
+			/*
+			 * First, try to allocate THP only on local node, but
+			 * don't reclaim unnecessarily, just compact.
+			 */
 			page = __alloc_pages_node(hpage_node,
-						gfp | __GFP_THISNODE, order);
+				gfp | __GFP_THISNODE | __GFP_NORETRY, order);
 
 			/*
-			 * If hugepage allocations are configured to always
-			 * synchronous compact or the vma has been madvised
-			 * to prefer hugepage backing, retry allowing remote
-			 * memory as well.
+			 * If that fails, allow both compaction and reclaim,
+			 * but on all nodes.
 			 */
-			if (!page && (gfp & __GFP_DIRECT_RECLAIM))
+			if (!page)
 				page = __alloc_pages_node(hpage_node,
-						gfp | __GFP_NORETRY, order);
+								gfp, order);
 
 			goto out;
 		}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 15c2050c629b..da9075d4cdf6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4467,7 +4467,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		if (page)
 			goto got_pg;
 
-		 if (order >= pageblock_order && (gfp_mask & __GFP_IO)) {
+		/*
+		 * Checks for costly allocations with __GFP_NORETRY, which
+		 * includes some THP page fault allocations
+		 */
+		if (costly_order && (gfp_mask & __GFP_NORETRY)) {
 			/*
 			 * If allocating entire pageblock(s) and compaction
 			 * failed because all zones are below low watermarks
@@ -4487,23 +4491,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 			if (compact_result == COMPACT_SKIPPED ||
 			    compact_result == COMPACT_DEFERRED)
 				goto nopage;
-		}
-
-		/*
-		 * Checks for costly allocations with __GFP_NORETRY, which
-		 * includes THP page fault allocations
-		 */
-		if (costly_order && (gfp_mask & __GFP_NORETRY)) {
-			/*
-			 * If compaction is deferred for high-order allocations,
-			 * it is because sync compaction recently failed. If
-			 * this is the case and the caller requested a THP
-			 * allocation, we do not want to heavily disrupt the
-			 * system, so we fail the allocation instead of entering
-			 * direct reclaim.
-			 */
-			if (compact_result == COMPACT_DEFERRED)
-				goto nopage;
 
 			/*
 			 * Looks like reclaim/compaction is worth trying, but
David Rientjes Oct. 1, 2019, 8:31 p.m. UTC | #20
On Tue, 1 Oct 2019, Vlastimil Babka wrote:

> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 4ae967bcf954..2c48146f3ee2 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2129,18 +2129,20 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>  		nmask = policy_nodemask(gfp, pol);
>  		if (!nmask || node_isset(hpage_node, *nmask)) {
>  			mpol_cond_put(pol);
> +			/*
> +			 * First, try to allocate THP only on local node, but
> +			 * don't reclaim unnecessarily, just compact.
> +			 */
>  			page = __alloc_pages_node(hpage_node,
> -						gfp | __GFP_THISNODE, order);
> +				gfp | __GFP_THISNODE | __GFP_NORETRY, order);

The page allocator has heuristics to determine when compaction should be 
retried, reclaim should be retried, and the allocation itself should retry 
for high-order allocations (see PAGE_ALLOC_COSTLY_ORDER).  
PAGE_ALLOC_COSTLY_ORDER exists solely to avoid poor allocator behavior 
when reclaim itself is unlikely -- or disruptive enough -- in making that 
amount of contiguous memory available.

Rather than papering over the poor feedback loop between compaction and 
reclaim that exists in the page allocator, it's probably better to improve 
that and determine when an allocation should fail or it's worthwhile to 
retry.  That's a separate topic from NUMA locality of thp.

In other words, we should likely address how compaction and reclaim is 
done for all high order-allocations in the page allocator itself rather 
than only here for hugepage allocations and relying on specific gfp bits 
to be set.  Ask: if the allocation here should not retry regardless of why 
compaction failed, why should any high-order allocation (user or kernel) 
retry if compaction failed and at what order we should just fail?  If 
hugetlb wants to stress this to the fullest extent possible, it already 
appropriately uses __GFP_RETRY_MAYFAIL.

The code here is saying we care more about NUMA locality than hugepages 
simply because that's where the access latency is better and is specific 
to hugepages; allocation behavior for high-order pages needs to live in 
the page allocator.

>  
>  			/*
> -			 * If hugepage allocations are configured to always
> -			 * synchronous compact or the vma has been madvised
> -			 * to prefer hugepage backing, retry allowing remote
> -			 * memory as well.
> +			 * If that fails, allow both compaction and reclaim,
> +			 * but on all nodes.
>  			 */
> -			if (!page && (gfp & __GFP_DIRECT_RECLAIM))
> +			if (!page)
>  				page = __alloc_pages_node(hpage_node,
> -						gfp | __GFP_NORETRY, order);
> +								gfp, order);
>  
>  			goto out;
>  		}

We certainly don't want this for non-MADV_HUGEPAGE regions when thp 
enabled bit is not set to "always".  We'd much rather fallback to local 
native pages because of its improved access latency.  This is allowing all 
hugepage allocations to be remote even without MADV_HUGEPAGE which is not 
even what Andrea needs: qemu uses MADV_HUGEPAGE.
Vlastimil Babka Oct. 1, 2019, 9:54 p.m. UTC | #21
On 10/1/19 10:31 PM, David Rientjes wrote:
> On Tue, 1 Oct 2019, Vlastimil Babka wrote:
> 
>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>> index 4ae967bcf954..2c48146f3ee2 100644
>> --- a/mm/mempolicy.c
>> +++ b/mm/mempolicy.c
>> @@ -2129,18 +2129,20 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>>  		nmask = policy_nodemask(gfp, pol);
>>  		if (!nmask || node_isset(hpage_node, *nmask)) {
>>  			mpol_cond_put(pol);
>> +			/*
>> +			 * First, try to allocate THP only on local node, but
>> +			 * don't reclaim unnecessarily, just compact.
>> +			 */
>>  			page = __alloc_pages_node(hpage_node,
>> -						gfp | __GFP_THISNODE, order);
>> +				gfp | __GFP_THISNODE | __GFP_NORETRY, order);
> 
> The page allocator has heuristics to determine when compaction should be 
> retried, reclaim should be retried, and the allocation itself should retry 
> for high-order allocations (see PAGE_ALLOC_COSTLY_ORDER).

Yes, and flags like __GFP_NORETRY and __GFP_RETRY_MAYFAIL affect these
heuristics according to the caller's willingness to wait vs availability
of some kind of fallback.
  
> PAGE_ALLOC_COSTLY_ORDER exists solely to avoid poor allocator behavior 
> when reclaim itself is unlikely -- or disruptive enough -- in making that 
> amount of contiguous memory available.

The __GFP_NORETRY check for costly orders is one of the implementation
details of that poor allocator behavior avoidance.

> Rather than papering over the poor feedback loop between compaction and 
> reclaim that exists in the page allocator, it's probably better to improve 
> that and determine when an allocation should fail or it's worthwhile to 
> retry.  That's a separate topic from NUMA locality of thp.

I don't disagree. But we are doing that 'papering over' already and I simply
believe it can be done better, until we can drop it. Right now at least the
hugetlb allocations are regressing, as Michal showed.

> In other words, we should likely address how compaction and reclaim is 
> done for all high order-allocations in the page allocator itself rather 
> than only here for hugepage allocations and relying on specific gfp bits 
> to be set.

Well, with your patches, decisions are made based on pageblock order
(affecting also hugetlbfs) and a specific __GFP_IO bit. I don't see how
using a __GFP_NORETRY and costly order is worse - both are concepts
already used to guide reclaim/compaction decisions. And it won't affect
hugetlbfs. Also with some THP defrag modes, __GFP_NORETRY is also already
being used.

> Ask: if the allocation here should not retry regardless of why 
> compaction failed, why should any high-order allocation (user or kernel) 
> retry if compaction failed and at what order we should just fail?

If that's refering to the quoted code above, the allocation here should not
retry because it would lead to reclaiming just the local node, effectively
a node reclaim mode, leading to the problems Andrea and Mel reported.
That's because __GFP_THISNODE makes it rather specific, so we shouldn't
conclude anything for the other kinds of allocations, as you're asking.

> If 
> hugetlb wants to stress this to the fullest extent possible, it already 
> appropriately uses __GFP_RETRY_MAYFAIL.

Which doesn't work anymore right now, and should again after this patch.

> The code here is saying we care more about NUMA locality than hugepages

That's why there's __GFP_THISNODE, yes.
 
> simply because that's where the access latency is better and is specific 
> to hugepages; allocation behavior for high-order pages needs to live in 
> the page allocator.

I'd rather add the __GFP_NORETRY to __GFP_THISNODE here to influence the
allocation behavior, than reintroduce any __GFP_THISNODE checks in the
page allocator. There are not that many __GFP_THISNODE users, but there
are plenty of __GFP_NORETRY users, so it seems better to adjust behavior
based on the latter flag. IIRC in the recent past you argued for putting
back __GFP_NORETRY to all THP allocations including madvise, so why is there
a problem now?

>>  
>>  			/*
>> -			 * If hugepage allocations are configured to always
>> -			 * synchronous compact or the vma has been madvised
>> -			 * to prefer hugepage backing, retry allowing remote
>> -			 * memory as well.
>> +			 * If that fails, allow both compaction and reclaim,
>> +			 * but on all nodes.
>>  			 */
>> -			if (!page && (gfp & __GFP_DIRECT_RECLAIM))
>> +			if (!page)
>>  				page = __alloc_pages_node(hpage_node,
>> -						gfp | __GFP_NORETRY, order);

Also this __GFP_NORETRY was added by your patch, so I really don't see
why do you think it's wrong in the first allocation attempt.

>> +								gfp, order);
>>  
>>  			goto out;
>>  		}
> 
> We certainly don't want this for non-MADV_HUGEPAGE regions when thp 
> enabled bit is not set to "always".  We'd much rather fallback to local 
> native pages because of its improved access latency.  This is allowing all 
> hugepage allocations to be remote even without MADV_HUGEPAGE which is not 
> even what Andrea needs: qemu uses MADV_HUGEPAGE.
 
OTOH it's better to use a remote THP than a remote base page, in case the
local node is below watermark. But that would probably require a larger surgery
of the allocator.

But what you're saying is keep the gfp & __GFP_DIRECT_RECLAIM condition.
I still suggest to drop the __GFP_NORETRY as that goes against MADV_HUGEPAGE.
Michal Hocko Oct. 2, 2019, 10:34 a.m. UTC | #22
On Tue 01-10-19 23:54:14, Vlastimil Babka wrote:
> On 10/1/19 10:31 PM, David Rientjes wrote:
[...]
> > If 
> > hugetlb wants to stress this to the fullest extent possible, it already 
> > appropriately uses __GFP_RETRY_MAYFAIL.
> 
> Which doesn't work anymore right now, and should again after this patch.

I didn't get to fully digest the patch Vlastimil is proposing. (Ab)using
__GFP_NORETRY is quite subtle but it is already in place with some
explanation and a reference to THPs. So while I am not really happy it
is at least something you can reason about.

b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction
may not succeed") on the other hand has added a much more wider change
which has clearly broken hugetlb and any __GFP_RETRY_MAYFAIL user of
pageblock_order sized allocations. And that is much worse and something
I was pointing at during the review and those concerns were never really
addressed before merging.

In any case this is something to be fixed ASAP. Do you have any better
proposa? I do not assume you would be proposing yet another revert.
David Rientjes Oct. 2, 2019, 10:32 p.m. UTC | #23
On Wed, 2 Oct 2019, Michal Hocko wrote:

> > > If 
> > > hugetlb wants to stress this to the fullest extent possible, it already 
> > > appropriately uses __GFP_RETRY_MAYFAIL.
> > 
> > Which doesn't work anymore right now, and should again after this patch.
> 
> I didn't get to fully digest the patch Vlastimil is proposing. (Ab)using
> __GFP_NORETRY is quite subtle but it is already in place with some
> explanation and a reference to THPs. So while I am not really happy it
> is at least something you can reason about.
> 

It's a no-op:

        /* Do not loop if specifically requested */
        if (gfp_mask & __GFP_NORETRY)
                goto nopage;

        /*
         * Do not retry costly high order allocations unless they are
         * __GFP_RETRY_MAYFAIL
         */
        if (costly_order && !(gfp_mask & __GFP_RETRY_MAYFAIL))
                goto nopage;

So I'm not sure we should spend too much time discussing a hunk of a patch 
that doesn't do anything.

> b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction
> may not succeed") on the other hand has added a much more wider change
> which has clearly broken hugetlb and any __GFP_RETRY_MAYFAIL user of
> pageblock_order sized allocations. And that is much worse and something
> I was pointing at during the review and those concerns were never really
> addressed before merging.
> 
> In any case this is something to be fixed ASAP. Do you have any better
> proposa? I do not assume you would be proposing yet another revert.

I thought Mike Kravetz said[*] that hugetlb was not negatively affected by 
this?  We could certainly disregard this logic for __GFP_RETRY_MAYFAIL if 
anybody is relying on excessive reclaim ("swap storms") that does not 
allow compaction to make forward progress for some reason.

 [*] https://marc.info/?l=linux-kernel&m=156771690024533

If not, for the purposes of this conversation we can disregard 
__GFP_NORETRY per the above because thp does not use __GFP_RETRY_MAYFAIL.
Vlastimil Babka Oct. 3, 2019, 8 a.m. UTC | #24
On 10/3/19 12:32 AM, David Rientjes wrote:
> On Wed, 2 Oct 2019, Michal Hocko wrote:
> 
>>>> If 
>>>> hugetlb wants to stress this to the fullest extent possible, it already 
>>>> appropriately uses __GFP_RETRY_MAYFAIL.
>>>
>>> Which doesn't work anymore right now, and should again after this patch.
>>
>> I didn't get to fully digest the patch Vlastimil is proposing. (Ab)using
>> __GFP_NORETRY is quite subtle but it is already in place with some
>> explanation and a reference to THPs. So while I am not really happy it
>> is at least something you can reason about.
>>
> 
> It's a no-op:
> 
>         /* Do not loop if specifically requested */
>         if (gfp_mask & __GFP_NORETRY)
>                 goto nopage;
> 
>         /*
>          * Do not retry costly high order allocations unless they are
>          * __GFP_RETRY_MAYFAIL
>          */
>         if (costly_order && !(gfp_mask & __GFP_RETRY_MAYFAIL))
>                 goto nopage;
> 
> So I'm not sure we should spend too much time discussing a hunk of a patch 
> that doesn't do anything.

I believe Michal was talking about my (ab)use of __GFP_NORETRY, where it
controls the earlier 'goto nopage' condition. Yes, with your patches alone,
the addition of __GFP_NORETRY in the second attempt is a no-op, although
then I don't see the point of confusing people reading the code with it.
Michal Hocko Oct. 4, 2019, 12:18 p.m. UTC | #25
On Thu 03-10-19 10:00:08, Vlastimil Babka wrote:
> On 10/3/19 12:32 AM, David Rientjes wrote:
> > On Wed, 2 Oct 2019, Michal Hocko wrote:
> > 
> >>>> If 
> >>>> hugetlb wants to stress this to the fullest extent possible, it already 
> >>>> appropriately uses __GFP_RETRY_MAYFAIL.
> >>>
> >>> Which doesn't work anymore right now, and should again after this patch.
> >>
> >> I didn't get to fully digest the patch Vlastimil is proposing. (Ab)using
> >> __GFP_NORETRY is quite subtle but it is already in place with some
> >> explanation and a reference to THPs. So while I am not really happy it
> >> is at least something you can reason about.
> >>
> > 
> > It's a no-op:
> > 
> >         /* Do not loop if specifically requested */
> >         if (gfp_mask & __GFP_NORETRY)
> >                 goto nopage;
> > 
> >         /*
> >          * Do not retry costly high order allocations unless they are
> >          * __GFP_RETRY_MAYFAIL
> >          */
> >         if (costly_order && !(gfp_mask & __GFP_RETRY_MAYFAIL))
> >                 goto nopage;
> > 
> > So I'm not sure we should spend too much time discussing a hunk of a patch 
> > that doesn't do anything.
> 
> I believe Michal was talking about my (ab)use of __GFP_NORETRY, where it
> controls the earlier 'goto nopage' condition.

That is correct. From a maintainability point of view it would be better
to have only a single bailout of an optimistic compaction attempt. If we
go with [1] then we have two different criterion to bail out and
that is really messy and error prone. While sticking __GFP_RETRY_MAYFAIL
as suggest in [1] fixes up the immediate regression in the simplest way
this all really begs for a proper analysis and a _real_ fix. Can we move
that direction finally, please?

I would really love to conduct further testing but I haven't really
heard anything to results presented so far. I have no idea whether
that is even remotely resembling anything David needs for his claimed
regression.

[1] http://lkml.kernel.org/r/alpine.DEB.2.21.1910021556270.187014@chino.kir.corp.google.com
Michal Hocko Oct. 18, 2019, 2:15 p.m. UTC | #26
It's been some time since I've posted these results. The hugetlb issue
got resolved but I would still like to hear back about these findings
because they suggest that the current bail out strategy doesn't seem
to produce very good results. Essentially it doesn't really help THP
locality (on moderately filled up nodes) and it introduces a strong
dependency on kswapd which is not a source of the high order pages.
Also the overral THP success rate is decreased on a pretty standard "RAM
is used for page cache" workload.

That makes me think that the only possible workload that might really
benefit from this heuristic is a THP demanding one on a heavily
fragmented node with a lot of free memory while other nodes are not
fragmented and have quite a lot of free memory. If that is the case, is
this something to optimize for?

I am keeping all the results for the reference in a condensed form

On Tue 01-10-19 10:37:43, Michal Hocko wrote:
> I have split out my kvm machine into two nodes to get at least some
> idea how these patches behave
> $ numactl -H
> available: 2 nodes (0-1)
> node 0 cpus: 0 2
> node 0 size: 475 MB
> node 0 free: 432 MB
> node 1 cpus: 1 3
> node 1 size: 503 MB
> node 1 free: 458 MB
> 
> First run with 5.3 and without THP
> $ echo never > /sys/kernel/mm/transparent_hugepage/enabled
> root@test1:~# sh thp_test.sh 
> 7f4bdefec000 prefer:1 anon=102400 dirty=102400 active=86115 N0=41963 N1=60437 kernelpagesize_kB=4
> 7fd0f248b000 prefer:1 anon=102400 dirty=102400 active=86909 N0=40079 N1=62321 kernelpagesize_kB=4
> 7f2a69fc3000 prefer:1 anon=102400 dirty=102400 active=85244 N0=44455 N1=57945 kernelpagesize_kB=4
> 
> So we get around 56-60% pages to the preferred node
> 
> Now let's enable THPs
> AnonHugePages:    407552 kB
> 7f05c6dee000 prefer:1 anon=102400 dirty=102400 active=52718 N0=50688 N1=51712 kernelpagesize_kB=4
> Few more runs
> AnonHugePages:    407552 kB
> 7effca1b9000 prefer:1 anon=102400 dirty=102400 active=65977 N0=53760 N1=48640 kernelpagesize_kB=4
> AnonHugePages:    407552 kB
> 7f474bfc4000 prefer:1 anon=102400 dirty=102400 active=52676 N0=8704 N1=93696 kernelpagesize_kB=4
> 
> The utilization is again almost 100% and the preferred node usage
> varied a lot between 47-91%.
> 
> Now with 5.3 + all 4 patches this time:
> AnonHugePages:    401408 kB
> 7f8114ab4000 prefer:1 anon=102400 dirty=102400 active=51892 N0=3072 N1=99328 kernelpagesize_kB=4
> AnonHugePages:    376832 kB
> 7f37a1404000 prefer:1 anon=102400 dirty=102400 active=55204 N0=23153 N1=79247 kernelpagesize_kB=4
> AnonHugePages:    372736 kB
> 7f4abe4af000 prefer:1 anon=102400 dirty=102400 active=52399 N0=23646 N1=78754 kernelpagesize_kB=4
> 
> The THP utilization varies again and the locality is higher in average
> 76+%.  Which is even higher than the base page case. I was really
> wondering what is the reason for that So I've added a simple debugging

> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 8caab1f81a52..565f667f6dcb 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2140,9 +2140,11 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>  			 * to prefer hugepage backing, retry allowing remote
>  			 * memory as well.
>  			 */
> -			if (!page && (gfp & __GFP_DIRECT_RECLAIM))
> +			if (!page && (gfp & __GFP_DIRECT_RECLAIM)) {
>  				page = __alloc_pages_node(hpage_node,
>  						gfp | __GFP_NORETRY, order);
> +				printk("XXX: %s\n", !page ? "fail" : hpage_node == page_to_nid(page)?"match":"fallback");
> +			}
>  
>  			goto out;
>  		}
> 
> Cases like
> AnonHugePages:    407552 kB
> 7f3ab2ebf000 prefer:1 anon=102400 dirty=102400 active=77503 N0=43520 N1=58880 kernelpagesize_kB=4
>      85 XXX: fallback
> a very successful ones on the other hand
> AnonHugePages:    280576 kB
> 7feffd9c2000 prefer:1 anon=102400 dirty=102400 active=52563 N0=3131 N1=99269 kernelpagesize_kB=4
>      62 XXX: fail
>       3 XXX: fallback
> 
> Note that 62 failing THPs is 31744 pages which also explains much higher
> locality I suspect (we simply allocate small pages from the preferred
> node).
> 
> This is just a very trivial test and I still have to grasp the outcome.
> My current feeling is that the whole thing is very timing specific and
> the higher local utilization depends on somebody else doing a reclaim
> on our behalf with patches applied.
> 
> 5.3 without any patches behaves more stable although there is a slightly
> higher off node THP success rate but it also seems that the same
> overall THP success rate can be achieved from the local node as well so
> performing compaction would make more sense for those.
> 
> With the simple hack on top of 5.3
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9c9194959271..414d0eaa6287 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4685,7 +4685,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
>  {
>  	struct page *page;
>  	unsigned int alloc_flags = ALLOC_WMARK_LOW;
> -	gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
> +	gfp_t alloc_mask, first_alloc_mask; /* The gfp_t that was actually used for allocation */
>  	struct alloc_context ac = { };
>  
>  	/*
> @@ -4711,7 +4711,10 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
>  	alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone, gfp_mask);
>  
>  	/* First allocation attempt */
> -	page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
> +	first_alloc_mask = alloc_mask;
> +	if (order && (alloc_mask & __GFP_DIRECT_RECLAIM))
> +		first_alloc_mask |= __GFP_THISNODE;
> +	page = get_page_from_freelist(first_alloc_mask, order, alloc_flags, &ac);
>  	if (likely(page))
>  		goto out;
>  
> I am getting
> AnonHugePages:    407552 kB
> 7f88a67ca000 prefer:1 anon=102400 dirty=102400 active=60362 N0=39424 N1=62976 kernelpagesize_kB=4
> 
> AnonHugePages:    407552 kB
> 7f18f0de5000 prefer:1 anon=102400 dirty=102400 active=51685 N0=34720 N1=67680 kernelpagesize_kB=4
> 
> AnonHugePages:    407552 kB
> 7f443f89e000 prefer:1 anon=102400 dirty=102400 active=52382 N0=62976 N1=39424 kernelpagesize_kB=4
> 
> While first two rounds looked very promising with 60%+ locality with a
> consistent THP utilization then the locality went crazy so there is
> more to look into. The fast path still seems to make some difference
> though.
Vlastimil Babka Oct. 23, 2019, 11:03 a.m. UTC | #27
On 10/18/19 4:15 PM, Michal Hocko wrote:
> It's been some time since I've posted these results. The hugetlb issue
> got resolved but I would still like to hear back about these findings
> because they suggest that the current bail out strategy doesn't seem
> to produce very good results. Essentially it doesn't really help THP
> locality (on moderately filled up nodes) and it introduces a strong
> dependency on kswapd which is not a source of the high order pages.
> Also the overral THP success rate is decreased on a pretty standard "RAM
> is used for page cache" workload.
> 
> That makes me think that the only possible workload that might really
> benefit from this heuristic is a THP demanding one on a heavily
> fragmented node with a lot of free memory while other nodes are not
> fragmented and have quite a lot of free memory. If that is the case, is
> this something to optimize for?
> 
> I am keeping all the results for the reference in a condensed form
> 
> On Tue 01-10-19 10:37:43, Michal Hocko wrote:
>> I have split out my kvm machine into two nodes to get at least some
>> idea how these patches behave
>> $ numactl -H
>> available: 2 nodes (0-1)
>> node 0 cpus: 0 2
>> node 0 size: 475 MB
>> node 0 free: 432 MB
>> node 1 cpus: 1 3
>> node 1 size: 503 MB
>> node 1 free: 458 MB
>>
>> First run with 5.3 and without THP
>> $ echo never > /sys/kernel/mm/transparent_hugepage/enabled
>> root@test1:~# sh thp_test.sh 
>> 7f4bdefec000 prefer:1 anon=102400 dirty=102400 active=86115 N0=41963 N1=60437 kernelpagesize_kB=4
>> 7fd0f248b000 prefer:1 anon=102400 dirty=102400 active=86909 N0=40079 N1=62321 kernelpagesize_kB=4
>> 7f2a69fc3000 prefer:1 anon=102400 dirty=102400 active=85244 N0=44455 N1=57945 kernelpagesize_kB=4
>>
>> So we get around 56-60% pages to the preferred node
>>
>> Now let's enable THPs
>> AnonHugePages:    407552 kB
>> 7f05c6dee000 prefer:1 anon=102400 dirty=102400 active=52718 N0=50688 N1=51712 kernelpagesize_kB=4
>> Few more runs
>> AnonHugePages:    407552 kB
>> 7effca1b9000 prefer:1 anon=102400 dirty=102400 active=65977 N0=53760 N1=48640 kernelpagesize_kB=4
>> AnonHugePages:    407552 kB
>> 7f474bfc4000 prefer:1 anon=102400 dirty=102400 active=52676 N0=8704 N1=93696 kernelpagesize_kB=4
>>
>> The utilization is again almost 100% and the preferred node usage
>> varied a lot between 47-91%.
>>
>> Now with 5.3 + all 4 patches this time:
>> AnonHugePages:    401408 kB
>> 7f8114ab4000 prefer:1 anon=102400 dirty=102400 active=51892 N0=3072 N1=99328 kernelpagesize_kB=4
>> AnonHugePages:    376832 kB
>> 7f37a1404000 prefer:1 anon=102400 dirty=102400 active=55204 N0=23153 N1=79247 kernelpagesize_kB=4
>> AnonHugePages:    372736 kB
>> 7f4abe4af000 prefer:1 anon=102400 dirty=102400 active=52399 N0=23646 N1=78754 kernelpagesize_kB=4
>>
>> The THP utilization varies again and the locality is higher in average
>> 76+%.  Which is even higher than the base page case. I was really

I tried to reproduce your setup locally, and got this for THP case
on 5.4-rc4:

AnonHugePages:    395264 kB
7fdc4a2c0000 prefer:1 anon=102400 dirty=102400 N0=48852 N1=53548 kernelpagesize_kB=4
AnonHugePages:    401408 kB
7f27167e2000 prefer:1 anon=102400 dirty=102400 N0=40095 N1=62305 kernelpagesize_kB=4
AnonHugePages:    378880 kB
7ff693ff9000 prefer:1 anon=102400 dirty=102400 N0=58061 N1=44339 kernelpagesize_kB=4

Somewhat better THP utilization and worse node locality than you.

Then I applied a rebased patch that I proposed before (see below):

AnonHugePages:    407552 kB
7f33fa83a000 prefer:1 anon=102400 dirty=102400 N0=28672 N1=73728 kernelpagesize_kB=4
AnonHugePages:    407552 kB
7faac0aa9000 prefer:1 anon=102400 dirty=102400 N0=48869 N1=53531 kernelpagesize_kB=4
AnonHugePages:    407552 kB
7f9f32c57000 prefer:1 anon=102400 dirty=102400 N0=49664 N1=52736 kernelpagesize_kB=4

The THP utilization is now back at 100% as 5.3 (modulo mis-alignment of
the mem_eater area). This is expected, as the second try that's not limited
to __GFP_THISNODE is also not limited by the newly introduced (in 5.4) heuristics 
that checks COMPACT_SKIPPED. Locality seems similar, can't make any
conclusions with such variation and so few tries.
Could you try confirming that as well? Thanks. But I agree the test is
limited and probably depends on timing wrt kswapd making progress.

----8<----
From 8bd960e4e8e7e99fe13baf0d00b61910b3ae8d23 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <vbabka@suse.cz>
Date: Tue, 1 Oct 2019 14:20:58 +0200
Subject: [PATCH] mm, thp: tweak reclaim/compaction effort of local-only and
 all-node allocations

THP page faults now attempt a __GFP_THISNODE allocation first, which should
only compact existing free memory, followed by another attempt that can
allocate from any node using reclaim/compaction effort specified by global
defrag setting and madvise.

This patch makes the following changes to the scheme:

- before the patch, the first allocation relies on a check for pageblock order
  and __GFP_IO to prevent excessive reclaim. This however affects also the
  second attempt, which is not limited to single node. Instead of that, reuse
  the existing check for costly order __GFP_NORETRY allocations, and make sure
  the first THP attempt uses __GFP_NORETRY. As a side-effect, all costly order
  __GFP_NORETRY allocations will bail out if compaction needs reclaim, while
  previously they only bailed out when compaction was deferred due to previous
  failures. This should be still acceptable within the __GFP_NORETRY semantics.

- before the patch, the second allocation attempt (on all nodes) was passing
  __GFP_NORETRY. This is redundant as the check for pageblock order (discussed
  above) was stronger. It's also contrary to madvise(MADV_HUGEPAGE) which means
  some effort to allocate THP is requested. After this patch, the second
  attempt doesn't pass __GFP_THISNODE nor __GFP_NORETRY.

To sum up, THP page faults now try the following attempt:

1. local node only THP allocation with no reclaim, just compaction.
2. THP allocation from any node with effort determined by global defrag setting
   and VMA madvise
3. fallback to base pages on any node

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/mempolicy.c  | 16 +++++++++-------
 mm/page_alloc.c | 24 +++++-------------------
 2 files changed, 14 insertions(+), 26 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4ae967bcf954..2c48146f3ee2 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2129,18 +2129,20 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		nmask = policy_nodemask(gfp, pol);
 		if (!nmask || node_isset(hpage_node, *nmask)) {
 			mpol_cond_put(pol);
+			/*
+			 * First, try to allocate THP only on local node, but
+			 * don't reclaim unnecessarily, just compact.
+			 */
 			page = __alloc_pages_node(hpage_node,
-						gfp | __GFP_THISNODE, order);
+				gfp | __GFP_THISNODE | __GFP_NORETRY, order);
 
 			/*
-			 * If hugepage allocations are configured to always
-			 * synchronous compact or the vma has been madvised
-			 * to prefer hugepage backing, retry allowing remote
-			 * memory as well.
+			 * If that fails, allow both compaction and reclaim,
+			 * but on all nodes.
 			 */
-			if (!page && (gfp & __GFP_DIRECT_RECLAIM))
+			if (!page)
 				page = __alloc_pages_node(hpage_node,
-						gfp | __GFP_NORETRY, order);
+								gfp, order);
 
 			goto out;
 		}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ecc3dbad606b..36d7d852f7b1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4473,8 +4473,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		if (page)
 			goto got_pg;
 
-		 if (order >= pageblock_order && (gfp_mask & __GFP_IO) &&
-		     !(gfp_mask & __GFP_RETRY_MAYFAIL)) {
+		/*
+		 * Checks for costly allocations with __GFP_NORETRY, which
+		 * includes some THP page fault allocations
+		 */
+		if (costly_order && (gfp_mask & __GFP_NORETRY)) {
 			/*
 			 * If allocating entire pageblock(s) and compaction
 			 * failed because all zones are below low watermarks
@@ -4495,23 +4498,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 			if (compact_result == COMPACT_SKIPPED ||
 			    compact_result == COMPACT_DEFERRED)
 				goto nopage;
-		}
-
-		/*
-		 * Checks for costly allocations with __GFP_NORETRY, which
-		 * includes THP page fault allocations
-		 */
-		if (costly_order && (gfp_mask & __GFP_NORETRY)) {
-			/*
-			 * If compaction is deferred for high-order allocations,
-			 * it is because sync compaction recently failed. If
-			 * this is the case and the caller requested a THP
-			 * allocation, we do not want to heavily disrupt the
-			 * system, so we fail the allocation instead of entering
-			 * direct reclaim.
-			 */
-			if (compact_result == COMPACT_DEFERRED)
-				goto nopage;
 
 			/*
 			 * Looks like reclaim/compaction is worth trying, but
David Rientjes Oct. 24, 2019, 6:59 p.m. UTC | #28
On Wed, 23 Oct 2019, Vlastimil Babka wrote:

> From 8bd960e4e8e7e99fe13baf0d00b61910b3ae8d23 Mon Sep 17 00:00:00 2001
> From: Vlastimil Babka <vbabka@suse.cz>
> Date: Tue, 1 Oct 2019 14:20:58 +0200
> Subject: [PATCH] mm, thp: tweak reclaim/compaction effort of local-only and
>  all-node allocations
> 
> THP page faults now attempt a __GFP_THISNODE allocation first, which should
> only compact existing free memory, followed by another attempt that can
> allocate from any node using reclaim/compaction effort specified by global
> defrag setting and madvise.
> 
> This patch makes the following changes to the scheme:
> 
> - before the patch, the first allocation relies on a check for pageblock order
>   and __GFP_IO to prevent excessive reclaim. This however affects also the
>   second attempt, which is not limited to single node. Instead of that, reuse
>   the existing check for costly order __GFP_NORETRY allocations, and make sure
>   the first THP attempt uses __GFP_NORETRY. As a side-effect, all costly order
>   __GFP_NORETRY allocations will bail out if compaction needs reclaim, while
>   previously they only bailed out when compaction was deferred due to previous
>   failures. This should be still acceptable within the __GFP_NORETRY semantics.
> 
> - before the patch, the second allocation attempt (on all nodes) was passing
>   __GFP_NORETRY. This is redundant as the check for pageblock order (discussed
>   above) was stronger. It's also contrary to madvise(MADV_HUGEPAGE) which means
>   some effort to allocate THP is requested. After this patch, the second
>   attempt doesn't pass __GFP_THISNODE nor __GFP_NORETRY.
> 
> To sum up, THP page faults now try the following attempt:
> 
> 1. local node only THP allocation with no reclaim, just compaction.
> 2. THP allocation from any node with effort determined by global defrag setting
>    and VMA madvise
> 3. fallback to base pages on any node
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/mempolicy.c  | 16 +++++++++-------
>  mm/page_alloc.c | 24 +++++-------------------
>  2 files changed, 14 insertions(+), 26 deletions(-)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 4ae967bcf954..2c48146f3ee2 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2129,18 +2129,20 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>  		nmask = policy_nodemask(gfp, pol);
>  		if (!nmask || node_isset(hpage_node, *nmask)) {
>  			mpol_cond_put(pol);
> +			/*
> +			 * First, try to allocate THP only on local node, but
> +			 * don't reclaim unnecessarily, just compact.
> +			 */
>  			page = __alloc_pages_node(hpage_node,
> -						gfp | __GFP_THISNODE, order);
> +				gfp | __GFP_THISNODE | __GFP_NORETRY, order);
>  
>  			/*
> -			 * If hugepage allocations are configured to always
> -			 * synchronous compact or the vma has been madvised
> -			 * to prefer hugepage backing, retry allowing remote
> -			 * memory as well.
> +			 * If that fails, allow both compaction and reclaim,
> +			 * but on all nodes.
>  			 */
> -			if (!page && (gfp & __GFP_DIRECT_RECLAIM))
> +			if (!page)
>  				page = __alloc_pages_node(hpage_node,
> -						gfp | __GFP_NORETRY, order);
> +								gfp, order);
>  
>  			goto out;
>  		}

Hi Vlastimil,

For the default case where thp enabled is not set to "always" and the VMA 
is not madvised for MADV_HUGEPAGE, how does this prefer to return node 
local pages rather than remote hugepages?  The idea is to optimize for 
access latency when the vma has not been explicitly madvised.

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ecc3dbad606b..36d7d852f7b1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4473,8 +4473,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		if (page)
>  			goto got_pg;
>  
> -		 if (order >= pageblock_order && (gfp_mask & __GFP_IO) &&
> -		     !(gfp_mask & __GFP_RETRY_MAYFAIL)) {
> +		/*
> +		 * Checks for costly allocations with __GFP_NORETRY, which
> +		 * includes some THP page fault allocations
> +		 */
> +		if (costly_order && (gfp_mask & __GFP_NORETRY)) {
>  			/*
>  			 * If allocating entire pageblock(s) and compaction
>  			 * failed because all zones are below low watermarks
> @@ -4495,23 +4498,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  			if (compact_result == COMPACT_SKIPPED ||
>  			    compact_result == COMPACT_DEFERRED)
>  				goto nopage;
> -		}
> -
> -		/*
> -		 * Checks for costly allocations with __GFP_NORETRY, which
> -		 * includes THP page fault allocations
> -		 */
> -		if (costly_order && (gfp_mask & __GFP_NORETRY)) {
> -			/*
> -			 * If compaction is deferred for high-order allocations,
> -			 * it is because sync compaction recently failed. If
> -			 * this is the case and the caller requested a THP
> -			 * allocation, we do not want to heavily disrupt the
> -			 * system, so we fail the allocation instead of entering
> -			 * direct reclaim.
> -			 */
> -			if (compact_result == COMPACT_DEFERRED)
> -				goto nopage;
>  
>  			/*
>  			 * Looks like reclaim/compaction is worth trying, but
> -- 
> 2.23.0
> 
>
Vlastimil Babka Oct. 29, 2019, 2:14 p.m. UTC | #29
On 10/24/19 8:59 PM, David Rientjes wrote:
>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>> index 4ae967bcf954..2c48146f3ee2 100644
>> --- a/mm/mempolicy.c
>> +++ b/mm/mempolicy.c
>> @@ -2129,18 +2129,20 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>>  		nmask = policy_nodemask(gfp, pol);
>>  		if (!nmask || node_isset(hpage_node, *nmask)) {
>>  			mpol_cond_put(pol);
>> +			/*
>> +			 * First, try to allocate THP only on local node, but
>> +			 * don't reclaim unnecessarily, just compact.
>> +			 */
>>  			page = __alloc_pages_node(hpage_node,
>> -						gfp | __GFP_THISNODE, order);
>> +				gfp | __GFP_THISNODE | __GFP_NORETRY, order);
>>  
>>  			/*
>> -			 * If hugepage allocations are configured to always
>> -			 * synchronous compact or the vma has been madvised
>> -			 * to prefer hugepage backing, retry allowing remote
>> -			 * memory as well.
>> +			 * If that fails, allow both compaction and reclaim,
>> +			 * but on all nodes.
>>  			 */
>> -			if (!page && (gfp & __GFP_DIRECT_RECLAIM))
>> +			if (!page)
>>  				page = __alloc_pages_node(hpage_node,
>> -						gfp | __GFP_NORETRY, order);
>> +								gfp, order);
>>  
>>  			goto out;
>>  		}
> Hi Vlastimil,
> 
> For the default case where thp enabled is not set to "always" and the VMA

I assume you meant "defrag" instead of "enabled".
 
> is not madvised for MADV_HUGEPAGE, how does this prefer to return node 
> local pages rather than remote hugepages?  The idea is to optimize for 
> access latency when the vma has not been explicitly madvised.
 
Right, you mentioned this before IIRC, and I forgot. How about this?
We could be also smarter and consolidate to a single attempt if there's
actually just a single NUMA node, but that can be optimized later.

----8<----
From 4c3a2217d0ee5ead00b1443010d07c664b6ac645 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <vbabka@suse.cz>
Date: Tue, 1 Oct 2019 14:20:58 +0200
Subject: [PATCH] mm, thp: tweak reclaim/compaction effort of local-only and
 all-node allocations

THP page faults now attempt a __GFP_THISNODE allocation first, which should
only compact existing free memory, followed by another attempt that can
allocate from any node using reclaim/compaction effort specified by global
defrag setting and madvise.

This patch makes the following changes to the scheme:

- before the patch, the first allocation relies on a check for pageblock order
  and __GFP_IO to prevent excessive reclaim. This however affects also the
  second attempt, which is not limited to single node. Instead of that, reuse
  the existing check for costly order __GFP_NORETRY allocations, and make sure
  the first THP attempt uses __GFP_NORETRY. As a side-effect, all costly order
  __GFP_NORETRY allocations will bail out if compaction needs reclaim, while
  previously they only bailed out when compaction was deferred due to previous
  failures. This should be still acceptable within the __GFP_NORETRY semantics.

- before the patch, the second allocation attempt (on all nodes) was passing
  __GFP_NORETRY. This is redundant as the check for pageblock order (discussed
  above) was stronger. It's also contrary to madvise(MADV_HUGEPAGE) which means
  some effort to allocate THP is requested. After this patch, the second
  attempt doesn't pass __GFP_THISNODE nor __GFP_NORETRY.

To sum up, THP page faults now try the following attempts:

1. local node only THP allocation with no reclaim, just compaction.
2. for madvised VMA's or when synchronous compaction is enabled always - THP
   allocation from any node with effort determined by global defrag setting
   and VMA madvise
3. fallback to base pages on any node

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/mempolicy.c  | 10 +++++++---
 mm/page_alloc.c | 24 +++++-------------------
 2 files changed, 12 insertions(+), 22 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4ae967bcf954..ed6fbc5b1e20 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2129,18 +2129,22 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		nmask = policy_nodemask(gfp, pol);
 		if (!nmask || node_isset(hpage_node, *nmask)) {
 			mpol_cond_put(pol);
+			/*
+			 * First, try to allocate THP only on local node, but
+			 * don't reclaim unnecessarily, just compact.
+			 */
 			page = __alloc_pages_node(hpage_node,
-						gfp | __GFP_THISNODE, order);
+				gfp | __GFP_THISNODE | __GFP_NORETRY, order);
 
 			/*
 			 * If hugepage allocations are configured to always
 			 * synchronous compact or the vma has been madvised
 			 * to prefer hugepage backing, retry allowing remote
-			 * memory as well.
+			 * memory with both reclaim and compact as well.
 			 */
 			if (!page && (gfp & __GFP_DIRECT_RECLAIM))
 				page = __alloc_pages_node(hpage_node,
-						gfp | __GFP_NORETRY, order);
+								gfp, order);
 
 			goto out;
 		}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ecc3dbad606b..36d7d852f7b1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4473,8 +4473,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		if (page)
 			goto got_pg;
 
-		 if (order >= pageblock_order && (gfp_mask & __GFP_IO) &&
-		     !(gfp_mask & __GFP_RETRY_MAYFAIL)) {
+		/*
+		 * Checks for costly allocations with __GFP_NORETRY, which
+		 * includes some THP page fault allocations
+		 */
+		if (costly_order && (gfp_mask & __GFP_NORETRY)) {
 			/*
 			 * If allocating entire pageblock(s) and compaction
 			 * failed because all zones are below low watermarks
@@ -4495,23 +4498,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 			if (compact_result == COMPACT_SKIPPED ||
 			    compact_result == COMPACT_DEFERRED)
 				goto nopage;
-		}
-
-		/*
-		 * Checks for costly allocations with __GFP_NORETRY, which
-		 * includes THP page fault allocations
-		 */
-		if (costly_order && (gfp_mask & __GFP_NORETRY)) {
-			/*
-			 * If compaction is deferred for high-order allocations,
-			 * it is because sync compaction recently failed. If
-			 * this is the case and the caller requested a THP
-			 * allocation, we do not want to heavily disrupt the
-			 * system, so we fail the allocation instead of entering
-			 * direct reclaim.
-			 */
-			if (compact_result == COMPACT_DEFERRED)
-				goto nopage;
 
 			/*
 			 * Looks like reclaim/compaction is worth trying, but
Michal Hocko Oct. 29, 2019, 3:15 p.m. UTC | #30
On Tue 29-10-19 15:14:59, Vlastimil Babka wrote:
> >From 4c3a2217d0ee5ead00b1443010d07c664b6ac645 Mon Sep 17 00:00:00 2001
> From: Vlastimil Babka <vbabka@suse.cz>
> Date: Tue, 1 Oct 2019 14:20:58 +0200
> Subject: [PATCH] mm, thp: tweak reclaim/compaction effort of local-only and
>  all-node allocations
> 
> THP page faults now attempt a __GFP_THISNODE allocation first, which should
> only compact existing free memory, followed by another attempt that can
> allocate from any node using reclaim/compaction effort specified by global
> defrag setting and madvise.
> 
> This patch makes the following changes to the scheme:
> 
> - before the patch, the first allocation relies on a check for pageblock order
>   and __GFP_IO to prevent excessive reclaim. This however affects also the
>   second attempt, which is not limited to single node. Instead of that, reuse
>   the existing check for costly order __GFP_NORETRY allocations, and make sure
>   the first THP attempt uses __GFP_NORETRY. As a side-effect, all costly order
>   __GFP_NORETRY allocations will bail out if compaction needs reclaim, while
>   previously they only bailed out when compaction was deferred due to previous
>   failures. This should be still acceptable within the __GFP_NORETRY semantics.
> 
> - before the patch, the second allocation attempt (on all nodes) was passing
>   __GFP_NORETRY. This is redundant as the check for pageblock order (discussed
>   above) was stronger. It's also contrary to madvise(MADV_HUGEPAGE) which means
>   some effort to allocate THP is requested. After this patch, the second
>   attempt doesn't pass __GFP_THISNODE nor __GFP_NORETRY.
> 
> To sum up, THP page faults now try the following attempts:
> 
> 1. local node only THP allocation with no reclaim, just compaction.
> 2. for madvised VMA's or when synchronous compaction is enabled always - THP
>    allocation from any node with effort determined by global defrag setting
>    and VMA madvise
> 3. fallback to base pages on any node
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

I've given this a try and here are the results of my previous testcase
(memory full of page cache).
root@test1:~# sh thp_test.sh 
+ echo 3
+ echo 1
+ dd if=/mnt/data/file-1G of=/dev/null bs=4096
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB) copied, 18.5235 s, 58.0 MB/s
+ date +%s
+ TS=1572361069
+ cp /proc/vmstat vmstat.1572361069.before
+ numactl --preferred 1 ./mem_eater nowait 400M
7f959336a000-7f95ac36a000 rw-p 00000000 00:00 0 
Size:             409600 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:              409600 kB
Pss:              409600 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:    409600 kB
Referenced:       208296 kB
Anonymous:        409600 kB
LazyFree:              0 kB
AnonHugePages:    407552 kB
ShmemPmdMapped:        0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:            1
7f959336a000 prefer:1 anon=102400 dirty=102400 active=52074 N0=32768 N1=69632 kernelpagesize_kB=4
+ cp /proc/vmstat vmstat.1572361069.after
AnonHugePages:    407552 kB
7f590e2b3000 prefer:1 anon=102400 dirty=102400 active=61619 N0=45235 N1=57165 kernelpagesize_kB=4
AnonHugePages:    407552 kB
7f8667e9a000 prefer:1 anon=102400 dirty=102400 active=52378 N0=6144 N1=96256 kernelpagesize_kB=4

So the THP utilization is back to 100% (modulo alignment) and the node
locality is 55%-94% - with an obviously high variance based on the
kswapd activity in line with the kswapd based feedback backoff. All that
with a high THP utilization which sounds like an approvement. The code
is definitely less hackish so I would go with it for the time being. I
still believe that we have a more fundamental problem to deal with
though. All these open coded fallback allocations just point out that
our allocator doesn't cooperate with he kcompactd properly and that
should be fixed longterm.

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/mempolicy.c  | 10 +++++++---
>  mm/page_alloc.c | 24 +++++-------------------
>  2 files changed, 12 insertions(+), 22 deletions(-)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 4ae967bcf954..ed6fbc5b1e20 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2129,18 +2129,22 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>  		nmask = policy_nodemask(gfp, pol);
>  		if (!nmask || node_isset(hpage_node, *nmask)) {
>  			mpol_cond_put(pol);
> +			/*
> +			 * First, try to allocate THP only on local node, but
> +			 * don't reclaim unnecessarily, just compact.
> +			 */
>  			page = __alloc_pages_node(hpage_node,
> -						gfp | __GFP_THISNODE, order);
> +				gfp | __GFP_THISNODE | __GFP_NORETRY, order);
>  
>  			/*
>  			 * If hugepage allocations are configured to always
>  			 * synchronous compact or the vma has been madvised
>  			 * to prefer hugepage backing, retry allowing remote
> -			 * memory as well.
> +			 * memory with both reclaim and compact as well.
>  			 */
>  			if (!page && (gfp & __GFP_DIRECT_RECLAIM))
>  				page = __alloc_pages_node(hpage_node,
> -						gfp | __GFP_NORETRY, order);
> +								gfp, order);
>  
>  			goto out;
>  		}
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ecc3dbad606b..36d7d852f7b1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4473,8 +4473,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		if (page)
>  			goto got_pg;
>  
> -		 if (order >= pageblock_order && (gfp_mask & __GFP_IO) &&
> -		     !(gfp_mask & __GFP_RETRY_MAYFAIL)) {
> +		/*
> +		 * Checks for costly allocations with __GFP_NORETRY, which
> +		 * includes some THP page fault allocations
> +		 */
> +		if (costly_order && (gfp_mask & __GFP_NORETRY)) {
>  			/*
>  			 * If allocating entire pageblock(s) and compaction
>  			 * failed because all zones are below low watermarks
> @@ -4495,23 +4498,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  			if (compact_result == COMPACT_SKIPPED ||
>  			    compact_result == COMPACT_DEFERRED)
>  				goto nopage;
> -		}
> -
> -		/*
> -		 * Checks for costly allocations with __GFP_NORETRY, which
> -		 * includes THP page fault allocations
> -		 */
> -		if (costly_order && (gfp_mask & __GFP_NORETRY)) {
> -			/*
> -			 * If compaction is deferred for high-order allocations,
> -			 * it is because sync compaction recently failed. If
> -			 * this is the case and the caller requested a THP
> -			 * allocation, we do not want to heavily disrupt the
> -			 * system, so we fail the allocation instead of entering
> -			 * direct reclaim.
> -			 */
> -			if (compact_result == COMPACT_DEFERRED)
> -				goto nopage;
>  
>  			/*
>  			 * Looks like reclaim/compaction is worth trying, but
> -- 
> 2.23.0
Andrew Morton Oct. 29, 2019, 9:33 p.m. UTC | #31
On Tue, 29 Oct 2019 16:15:49 +0100 Michal Hocko <mhocko@kernel.org> wrote:

> > 
> > 1. local node only THP allocation with no reclaim, just compaction.
> > 2. for madvised VMA's or when synchronous compaction is enabled always - THP
> >    allocation from any node with effort determined by global defrag setting
> >    and VMA madvise
> > 3. fallback to base pages on any node
> > 
> > Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> 
> I've given this a try and here are the results of my previous testcase
> (memory full of page cache).

Thanks, I'll queue this for some more testing.  At some point we should
decide on a suitable set of Fixes: tags and a backporting strategy, if any?
Vlastimil Babka Oct. 29, 2019, 9:45 p.m. UTC | #32
On 10/29/19 10:33 PM, Andrew Morton wrote:
> On Tue, 29 Oct 2019 16:15:49 +0100 Michal Hocko <mhocko@kernel.org> wrote:
> 
>>>
>>> 1. local node only THP allocation with no reclaim, just compaction.
>>> 2. for madvised VMA's or when synchronous compaction is enabled always - THP
>>>    allocation from any node with effort determined by global defrag setting
>>>    and VMA madvise
>>> 3. fallback to base pages on any node
>>>
>>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>>
>> I've given this a try and here are the results of my previous testcase
>> (memory full of page cache).
> 
> Thanks, I'll queue this for some more testing.  At some point we should
> decide on a suitable set of Fixes: tags and a backporting strategy, if any?

I guess this below, as 3f36d8669457 had and this patch is similar kind
of refinement.

Fixes: b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when
compaction may not succeed")

If accepted, would be nice if it made it to 5.4, or at least 5.4.y (it
will be a LTSS AFAIK). Not sure if we should backport all 5.4 changes to
5.3.y as that will be EOL rather soon after 5.4 final? I guess that's
for David. And for going to older LTSS's we would at least need more
longer-term confidence and tests on real workloads - Andrea?
David Rientjes Oct. 29, 2019, 11:25 p.m. UTC | #33
On Tue, 29 Oct 2019, Andrew Morton wrote:

> On Tue, 29 Oct 2019 16:15:49 +0100 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > > 
> > > 1. local node only THP allocation with no reclaim, just compaction.
> > > 2. for madvised VMA's or when synchronous compaction is enabled always - THP
> > >    allocation from any node with effort determined by global defrag setting
> > >    and VMA madvise
> > > 3. fallback to base pages on any node
> > > 
> > > Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> > 
> > I've given this a try and here are the results of my previous testcase
> > (memory full of page cache).
> 
> Thanks, I'll queue this for some more testing.  At some point we should
> decide on a suitable set of Fixes: tags and a backporting strategy, if any?
> 

I'd strongly suggest that Andrea test this patch out on his workload on 
hosts where all nodes are low on memory because based on my understanding 
of his reported issue this would result in swap storms reemerging but 
worse this time because they wouldn't be constrained only locally.  (This 
patch causes us to no longer circumvent excessive reclaim when using 
MADV_HUGEPAGE.)
Michal Hocko Nov. 5, 2019, 1:02 p.m. UTC | #34
On Tue 29-10-19 16:25:17, David Rientjes wrote:
> On Tue, 29 Oct 2019, Andrew Morton wrote:
> 
> > On Tue, 29 Oct 2019 16:15:49 +0100 Michal Hocko <mhocko@kernel.org> wrote:
> > 
> > > > 
> > > > 1. local node only THP allocation with no reclaim, just compaction.
> > > > 2. for madvised VMA's or when synchronous compaction is enabled always - THP
> > > >    allocation from any node with effort determined by global defrag setting
> > > >    and VMA madvise
> > > > 3. fallback to base pages on any node
> > > > 
> > > > Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> > > 
> > > I've given this a try and here are the results of my previous testcase
> > > (memory full of page cache).
> > 
> > Thanks, I'll queue this for some more testing.  At some point we should
> > decide on a suitable set of Fixes: tags and a backporting strategy, if any?
> > 
> 
> I'd strongly suggest that Andrea test this patch out on his workload on 
> hosts where all nodes are low on memory because based on my understanding 
> of his reported issue this would result in swap storms reemerging but 
> worse this time because they wouldn't be constrained only locally.  (This 
> patch causes us to no longer circumvent excessive reclaim when using 
> MADV_HUGEPAGE.)

Could you be more specific on why this would be the case? My testing is
doesn't show any such signs and I am effectivelly testing memory low
situation. The amount of reclaimed memory matches the amount of
requested memory.
David Rientjes Nov. 6, 2019, 1:01 a.m. UTC | #35
On Tue, 5 Nov 2019, Michal Hocko wrote:

> > > Thanks, I'll queue this for some more testing.  At some point we should
> > > decide on a suitable set of Fixes: tags and a backporting strategy, if any?
> > > 
> > 
> > I'd strongly suggest that Andrea test this patch out on his workload on 
> > hosts where all nodes are low on memory because based on my understanding 
> > of his reported issue this would result in swap storms reemerging but 
> > worse this time because they wouldn't be constrained only locally.  (This 
> > patch causes us to no longer circumvent excessive reclaim when using 
> > MADV_HUGEPAGE.)
> 
> Could you be more specific on why this would be the case? My testing is
> doesn't show any such signs and I am effectivelly testing memory low
> situation. The amount of reclaimed memory matches the amount of
> requested memory.
> 

The follow-up allocation in alloc_pages_vma() would no longer use 
__GFP_NORETRY and there is no special handling to avoid swap storms in the 
page allocator anymore as a result of this patch.  I don't see any 
indication that this allocation would behave any different than the code 
that Andrea experienced swap storms with, but now worse if remote memory 
is in the same state local memory is when he's using __GFP_THISNODE.
Michal Hocko Nov. 6, 2019, 7:35 a.m. UTC | #36
On Tue 05-11-19 17:01:00, David Rientjes wrote:
> On Tue, 5 Nov 2019, Michal Hocko wrote:
> 
> > > > Thanks, I'll queue this for some more testing.  At some point we should
> > > > decide on a suitable set of Fixes: tags and a backporting strategy, if any?
> > > > 
> > > 
> > > I'd strongly suggest that Andrea test this patch out on his workload on 
> > > hosts where all nodes are low on memory because based on my understanding 
> > > of his reported issue this would result in swap storms reemerging but 
> > > worse this time because they wouldn't be constrained only locally.  (This 
> > > patch causes us to no longer circumvent excessive reclaim when using 
> > > MADV_HUGEPAGE.)
> > 
> > Could you be more specific on why this would be the case? My testing is
> > doesn't show any such signs and I am effectivelly testing memory low
> > situation. The amount of reclaimed memory matches the amount of
> > requested memory.
> > 
> 
> The follow-up allocation in alloc_pages_vma() would no longer use 
> __GFP_NORETRY and there is no special handling to avoid swap storms in the 
> page allocator anymore as a result of this patch.

Yes there is no __GFP_NORETRY in the fallback path because the control
over how hard to retry is controlled by alloc_hugepage_direct_gfpmask
depending on the defrag mode and madvise mode.

> I don't see any 
> indication that this allocation would behave any different than the code 
> that Andrea experienced swap storms with, but now worse if remote memory 
> is in the same state local memory is when he's using __GFP_THISNODE.

The primary reason for the extensive swapping was exactly the __GFP_THISNODE
in conjunction with an unbounded direct reclaim AFAIR.

The whole point of the Vlastimil's patch is to have an optimistic local
node allocation first and the full gfp context one in the fallback path.
If our full gfp context doesn't really work well then we can revisit
that of course but that should happen at alloc_hugepage_direct_gfpmask
level.
David Rientjes Nov. 6, 2019, 9:32 p.m. UTC | #37
On Wed, 6 Nov 2019, Michal Hocko wrote:

> > I don't see any 
> > indication that this allocation would behave any different than the code 
> > that Andrea experienced swap storms with, but now worse if remote memory 
> > is in the same state local memory is when he's using __GFP_THISNODE.
> 
> The primary reason for the extensive swapping was exactly the __GFP_THISNODE
> in conjunction with an unbounded direct reclaim AFAIR.
> 
> The whole point of the Vlastimil's patch is to have an optimistic local
> node allocation first and the full gfp context one in the fallback path.
> If our full gfp context doesn't really work well then we can revisit
> that of course but that should happen at alloc_hugepage_direct_gfpmask
> level.

Since the patch reverts the precaution put into the page allocator to not 
attempt reclaim if the allocation order is significantly large and the 
return value from compaction specifies it is unlikely to succed on its 
own, I believe Vlastimil's patch will cause the same regression that 
Andrea saw is the whole host is low on memory and/or significantly 
fragmented.  So the suggestion was that he test this change to make sure 
we aren't introducing a regression for his workload.
Mel Gorman Nov. 13, 2019, 11:20 a.m. UTC | #38
On Wed, Nov 06, 2019 at 01:32:37PM -0800, David Rientjes wrote:
> On Wed, 6 Nov 2019, Michal Hocko wrote:
> 
> > > I don't see any 
> > > indication that this allocation would behave any different than the code 
> > > that Andrea experienced swap storms with, but now worse if remote memory 
> > > is in the same state local memory is when he's using __GFP_THISNODE.
> > 
> > The primary reason for the extensive swapping was exactly the __GFP_THISNODE
> > in conjunction with an unbounded direct reclaim AFAIR.
> > 
> > The whole point of the Vlastimil's patch is to have an optimistic local
> > node allocation first and the full gfp context one in the fallback path.
> > If our full gfp context doesn't really work well then we can revisit
> > that of course but that should happen at alloc_hugepage_direct_gfpmask
> > level.
> 
> Since the patch reverts the precaution put into the page allocator to not 
> attempt reclaim if the allocation order is significantly large and the 
> return value from compaction specifies it is unlikely to succed on its 
> own, I believe Vlastimil's patch will cause the same regression that 
> Andrea saw is the whole host is low on memory and/or significantly 
> fragmented.  So the suggestion was that he test this change to make sure 
> we aren't introducing a regression for his workload.

TLDR: I do not have evidence that Vlastimil's patch causes more swapping
	but more information is needed from Andrea on exactly how he's
	testing this. It's not clear to me what was originally tested
	and whether memory just had to be full or whether it had to be
	fragmented. If fragmented, then we have to agree on what an
	appropriate mechanism is for fragmenting memory. Hypothetical
	kernel modules that don't exist do not count.

I put together a testcase whereby a virtual machine is deployed, started
and then time how long it takes to run memhog on 80% of the guests
physical memory. I varied how large the virtual machine is and ran it on
a 2-socket machine so that the smaller tests would be single node and
the larger tests would span both nodes. Before each startup, a large
file is read to fill the memory with pagecache.

kvmstart
                              5.2.0                  5.3.0              5.4.0-rc6              5.4.0-rc6
                            vanilla                vanilla                vanilla          thptweak-v1r1
Amean     5         3.43 (   0.00%)        3.44 (  -0.29%)        3.44 (  -0.19%)        3.39 (   1.07%)
Amean     14       11.18 (   0.00%)       14.59 ( -30.53%)       14.85 ( -32.80%)       10.30 (   7.87%)
Amean     23       20.89 (   0.00%)       18.45 (  11.70%)       24.92 ( -19.29%)       24.88 ( -19.12%)
Amean     32       51.69 (   0.00%)       30.07 (  41.82%)       30.93 (  40.16%)       47.97 (   7.20%)
Amean     41       51.44 (   0.00%)       29.99 *  41.71%*       60.75 ( -18.08%)       77.44 * -50.54%*
Amean     50       81.85 (   0.00%)       60.37 (  26.25%)       98.09 ( -19.84%)      125.59 * -53.43%*

Units are in seconds to run memhog. "Amean 5" is for a 5G virtual
machine. Target machine is 2-socket with 64G of RAM so at 32 you'd expect
that that the machine is larger than a NUMA node. It's set to cutoff at
a virtual machine 80% the size of physical memory.

I used 5.2 as a baseline as 5.3 is where THP allocations changed behaviour
which was reverted later and then tweaked.

The results show that 5.4.0-rc6 in general is slower to fault all the
memory of the virtual machine even before you'd expect the machine to be
larger than a NUMA node. The patch works better for smaller machines and
worse for larger machines

                                          5.2.0          5.3.0      5.4.0-rc6      5.4.0-rc6
                                        vanilla        vanilla        vanilla  thptweak-v1r1
Ops Swap Ins                         1442665.00     1400209.00     1289549.00     1762479.00
Ops Swap Outs                        2291255.00     1689554.00     2222256.00     2885280.00
Ops Kswapd efficiency %                   98.74          91.06          75.73          49.75
Ops Kswapd velocity                     7875.25        7973.52        8035.11        9129.89
Ops Direct efficiency %                   99.54          98.22          94.47          87.80
Ops Direct velocity                     5042.27        4982.09        7238.28        9251.50
Ops Percentage direct scans               39.03          38.46          47.39          50.33
Ops Page writes by reclaim           2291779.00     1689555.00     2222771.00     2885334.00
Ops Page writes file                     524.00           1.00         515.00          54.00
Ops Page writes anon                 2291255.00     1689554.00     2222256.00     2885280.00
Ops Page reclaim immediate              2548.00          76.00         367.00         710.00
Ops Sector Reads                   320255172.00   309217632.00   264732588.00   269864268.00
Ops Sector Writes                   63264744.00    60776604.00    62860244.00    65572608.00
Ops Page rescued immediate                 0.00           0.00           0.00           0.00
Ops Slabs scanned                     595876.00      246334.00     1018425.00     1390506.00
Ops Direct inode steals                   49.00           5.00           0.00           8.00
Ops Kswapd inode steals             24118244.00    28126116.00    12866766.00    12943742.00
Ops Kswapd skipped wait                    0.00           0.00           0.00           0.00
Ops THP fault alloc                   164266.00      204790.00      188055.00      190899.00
Ops THP fault fallback                 49345.00        8614.00       25454.00       22650.00
Ops THP collapse alloc                   132.00         139.00         116.00         121.00
Ops THP collapse fail                      4.00           0.00           1.00           2.00
Ops THP split                          15789.00        5642.00       18119.00       49169.00
Ops THP split failed                       0.00           0.00           0.00           0.00
Ops Compaction stalls                 139794.00       52054.00      214361.00      226514.00
Ops Compaction success                 19004.00       17786.00       39430.00       35922.00
Ops Compaction failures               120790.00       34268.00      174931.00      190592.00
Ops Compaction efficiency                 13.59          34.17          18.39          15.86
Ops Page migrate success            11427696.00    12758554.00    22010443.00    21668849.00
Ops Page migrate failure            11559690.00    10403623.00    13514889.00    13212313.00
Ops Compaction pages isolated       23217363.00    20560760.00    46945743.00    47574686.00
Ops Compaction migrate scanned      19925995.00    16224351.00    49565044.00    58595534.00
Ops Compaction free scanned         68708150.00    47235800.00   134685089.00   153518780.00
Ops Compact scan efficiency               29.00          34.35          36.80          38.17
Ops Compaction cost                    12604.40       13893.25       24274.97       23991.33
Ops Kcompactd wake                       100.00         172.00         100.00         258.00
Ops Kcompactd migrate scanned          33135.00       55797.00      113344.00      948026.00
Ops Kcompactd free scanned            335005.00      310265.00      174353.00      686434.00
Ops NUMA alloc hit                  98173280.00    77063265.00    82393545.00    70699591.00
Ops NUMA alloc miss                 22834593.00    20306614.00    12296731.00    24085033.00
Ops NUMA interleave hit                    0.00           0.00           0.00           0.00
Ops NUMA alloc local                98163641.00    77059289.00    82384101.00    70697814.00
Ops NUMA base-page range updates    53170070.00    41093095.00    62202747.00    71769257.00
Ops NUMA PTE updates                15041942.00     2726887.00    12972411.00    21418665.00
Ops NUMA PMD updates                   74469.00       74934.00       96153.00       98341.00
Ops NUMA hint faults                11492208.00     2337770.00     9146157.00    16587622.00
Ops NUMA hint local faults %         6465386.00     1080902.00     7087393.00    12336942.00
Ops NUMA hint local percent               56.26          46.24          77.49          74.37
Ops NUMA pages migrated               560888.00     3004903.00      336032.00      195839.00
Ops AutoNUMA cost                      57843.89       12033.59       46172.59       83444.22

There is a lot here. Both 5.4.0-rc6 and the tweak had more memory placed
locally (NUMA hint local percent). *All* kernels swapped pages in an out
with 5.4.0-rc6 being as bad as 5.2.0-vanilla and the patch making this
slightly but not considerably worse.

5.3.0 was generally more successful at allocating huge pages (THP fault
fallback) with 5.4.0-rc6 being much worse at allocating huge pages and
the tweak not making much of a difference.

So the patch is a mix of good and bad in this case. However, the test
case has notable limitations and more information would be needed from
Andrea on exactly how he was evaluating KVM start times.

1. memhog is single threaded which means that one node is likely to be
   filled first, spillover while the first node reclaims. This means that
   swapping is somewhat inevitable on NUMA machines with this load. We
   could run multiple instances but that would have very variable results.

2. Reading a single large file forces reclaim but it definitely is not
   fragmenting memory. However, we never all agreed on how a machine
   should be fragmented to be useful for this sort of test. Mentioning
   hypothetical kernel modules that don't exist is not good enough.
   Creating massive amounts of small files with fragment memory to some
   extent but not aggressively. Doing that while memhog is running
   would help but the results would be very variable.

This particular test is hard to reproduce even though it's in mmtests as
configs/config-workload-kvmstart-memhog-frag-singlefile because the test
relies on KVM helper scripts to deploy, start and stop the virtual
machine. These scripts exist outside of mmtests and belong to a set of
tools I use to schedule, execute and report on tests across a range of
machines.

I also ran a THP faulting benchmark with fio running in the background
in a configuration that tends to fragment memory (mmtests config
workload-thpfioscale-madvhugepage) with ext4 as a backing filesystem for
fio.


                                        5.2.0                  5.3.0              5.4.0-rc6              5.4.0-rc6
                                      vanilla                vanilla                vanilla          thptweak-v1r1
Min       fault-base-5      958.00 (   0.00%)     1980.00 (-106.68%)     1083.00 ( -13.05%)     1406.00 ( -46.76%)
Min       fault-huge-5      297.00 (   0.00%)      309.00 (  -4.04%)      431.00 ( -45.12%)      324.00 (  -9.09%)
Min       fault-both-5      297.00 (   0.00%)      309.00 (  -4.04%)      431.00 ( -45.12%)      324.00 (  -9.09%)
Amean     fault-base-5     2517.81 (   0.00%)     8886.23 *-252.94%*     4851.98 * -92.71%*     8223.64 *-226.62%*
Amean     fault-huge-5     2781.67 (   0.00%)    10397.46 *-273.78%*     3596.91 * -29.31%*     7139.44 *-156.66%*
Amean     fault-both-5     2662.24 (   0.00%)     9916.03 *-272.47%*     3990.46 * -49.89%*     7367.48 *-176.74%*
Stddev    fault-base-5     2713.85 (   0.00%)    24331.73 (-796.58%)     5530.37 (-103.78%)    27048.99 (-896.70%)
Stddev    fault-huge-5     3740.46 (   0.00%)    39529.80 (-956.82%)     5428.68 ( -45.13%)    23418.76 (-526.09%)
Stddev    fault-both-5     3317.75 (   0.00%)    35408.44 (-967.24%)     5491.27 ( -65.51%)    24229.15 (-630.29%)
CoeffVar  fault-base-5      107.79 (   0.00%)      273.81 (-154.03%)      113.98 (  -5.75%)      328.92 (-205.16%)
CoeffVar  fault-huge-5      134.47 (   0.00%)      380.19 (-182.73%)      150.93 ( -12.24%)      328.02 (-143.94%)
CoeffVar  fault-both-5      124.62 (   0.00%)      357.08 (-186.53%)      137.61 ( -10.42%)      328.87 (-163.89%)
Max       fault-base-5    88873.00 (   0.00%)   386539.00 (-334.93%)   115638.00 ( -30.12%)   486930.00 (-447.89%)
Max       fault-huge-5    63735.00 (   0.00%)   602544.00 (-845.39%)   139082.00 (-118.22%)   426777.00 (-569.61%)
Max       fault-both-5    88873.00 (   0.00%)   602544.00 (-577.98%)   139082.00 ( -56.50%)   486930.00 (-447.89%)
BAmean-50 fault-base-5     1192.71 (   0.00%)     3101.71 (-160.06%)     2170.91 ( -82.02%)     2692.97 (-125.79%)
BAmean-50 fault-huge-5      756.99 (   0.00%)      972.96 ( -28.53%)     1112.99 ( -47.03%)     1168.70 ( -54.39%)
BAmean-50 fault-both-5      953.65 (   0.00%)     1455.39 ( -52.61%)     1319.56 ( -38.37%)     1375.04 ( -44.19%)
BAmean-95 fault-base-5     2109.87 (   0.00%)     4941.87 (-134.23%)     4056.10 ( -92.24%)     4672.90 (-121.48%)
BAmean-95 fault-huge-5     2158.64 (   0.00%)     3867.03 ( -79.14%)     2766.41 ( -28.16%)     3395.88 ( -57.32%)
BAmean-95 fault-both-5     2127.00 (   0.00%)     4183.64 ( -96.69%)     3169.07 ( -48.99%)     3666.57 ( -72.38%)
BAmean-99 fault-base-5     2349.85 (   0.00%)     6811.21 (-189.86%)     4512.17 ( -92.02%)     5738.18 (-144.19%)
BAmean-99 fault-huge-5     2538.90 (   0.00%)     7109.96 (-180.04%)     3225.96 ( -27.06%)     5194.97 (-104.61%)
BAmean-99 fault-both-5     2443.68 (   0.00%)     6952.77 (-184.52%)     3626.66 ( -48.41%)     5300.01 (-116.89%)

thpfioscale Percentage Faults Huge
                                   5.2.0                  5.3.0              5.4.0-rc6              5.4.0-rc6
                                 vanilla                vanilla                vanilla          thptweak-v1r1
Percentage huge-5       54.74 (   0.00%)       68.14 (  24.49%)       68.64 (  25.40%)       78.97 (  44.27%)

In this case, 5.4.0-rc6 has lower latency when allocating pages whether
THP is used or a fallback to base pages. The patch has latency somewhere
between 5.2 and 5.4-rc6 indicating that more effort is being made.
However, as expected the allocation success rate of the patch is higher
than all the others. This indicates the higher latency is due to the
additional work done to allocate the page.

Finally I ran a benchmark that faults memory in such a way as to expect
high compaction activity

usemem
                                  5.2.0                  5.3.0              5.4.0-rc6              5.4.0-rc6
                                vanilla                vanilla                vanilla          thptweak-v1r1
Amean     elsp-1       42.09 (   0.00%)       26.59 *  36.84%*       36.75 (  12.70%)       40.23 (   4.42%)
Amean     elsp-3       32.96 (   0.00%)        7.48 *  77.29%*        8.73 *  73.50%*       11.49 *  65.15%*
Amean     elsp-4        5.69 (   0.00%)        5.85 *  -2.81%*        5.68 (   0.18%)        5.68 (   0.18%)

                                          5.2.0          5.3.0      5.4.0-rc6      5.4.0-rc6
                                        vanilla        vanilla        vanilla  thptweak-v1r1
Ops Swap Ins                         2937652.00           0.00      695423.00      224792.00
Ops Swap Outs                        3647757.00           0.00      797035.00      262654.00

This is showing that 5.3.0 does not swap at all and is the fastest,
5.4.0-rc6 introduced swapping and the patch reduced it somewhat.

So, the patch behaves more or less as expected and I'm not seeing
clear evidence that the patch makes swapping more likely as feared by
David. However, the kvm testcase is very limited and needs more information
on exactly how Andrea was testing it to induce swap storms. I can easily
make guesses but chances are that I'll guess wrong.
David Rientjes Nov. 25, 2019, 12:10 a.m. UTC | #39
On Wed, 13 Nov 2019, Mel Gorman wrote:

> > > The whole point of the Vlastimil's patch is to have an optimistic local
> > > node allocation first and the full gfp context one in the fallback path.
> > > If our full gfp context doesn't really work well then we can revisit
> > > that of course but that should happen at alloc_hugepage_direct_gfpmask
> > > level.
> > 
> > Since the patch reverts the precaution put into the page allocator to not 
> > attempt reclaim if the allocation order is significantly large and the 
> > return value from compaction specifies it is unlikely to succed on its 
> > own, I believe Vlastimil's patch will cause the same regression that 
> > Andrea saw is the whole host is low on memory and/or significantly 
> > fragmented.  So the suggestion was that he test this change to make sure 
> > we aren't introducing a regression for his workload.
> 
> TLDR: I do not have evidence that Vlastimil's patch causes more swapping
> 	but more information is needed from Andrea on exactly how he's
> 	testing this. It's not clear to me what was originally tested
> 	and whether memory just had to be full or whether it had to be
> 	fragmented. If fragmented, then we have to agree on what an
> 	appropriate mechanism is for fragmenting memory. Hypothetical
> 	kernel modules that don't exist do not count.
> 
> I put together a testcase whereby a virtual machine is deployed, started
> and then time how long it takes to run memhog on 80% of the guests
> physical memory. I varied how large the virtual machine is and ran it on
> a 2-socket machine so that the smaller tests would be single node and
> the larger tests would span both nodes. Before each startup, a large
> file is read to fill the memory with pagecache.
> 

First, thanks very much for the follow-up and considerable amount of time 
testing and benchmarking this.

I, like you, do not have a reliable test case that will reproduce the 
issue that Andrea initially reported over a year ago.  I believe in the 
discussion that repeatedly referred to swap storms that, with the 
__GFP_THISNODE policy, we were not thrashing because the local node was 
low on memory due to page cache.  How memory is filled with page cache 
will naturally effect how it can be reclaimed when compaction fails 
locally, I don't know if it's an accurate representation of the initial 
problem.  I also don't recall details about the swapfile or exactly where 
we were contending while trying to fault local hugepages.

My concern, and it's only a concern at this point and not a regression 
report because we don't have a result from Andrea, is that the result of 
this patch is that the second allocation in alloc_pages_vma() enables the 
exact same allocation policy that Andrea reported was a problem earlier if 
__GFP_DIRECT_RECLAIM is set, which it will be as a result of qemu's use of 
MADV_HUGEPAGE.

That reclaim and compaction is now done over the entire system and not 
isolated only to the local node so there are two plausible outcomes: (1) 
the remote note is not fragmented and we can easily fault a remote 
hugepage or (2) we thrash and cause swap storms remotely as well as 
locally.

(1) is the outcome that Andrea is seeking based on the initial reverts: 
that much we know.  So my concern is that if the *system* is fragmented 
that we have now introduced a much more significant swap storm that will 
result in a much more serious regression.

So my question would be: if we know the previous behavior that allowed 
excessive swap and recalling into compaction was deemed harmful for the 
local node, why do we now believe it cannot be harmful if done for all 
system memory?  The patch subverts the precaution put into place in the 
page allocator to specifically not do this excessive reclaim and recall 
into compaction dance and I believe restores the previous bad behavior if 
remote memory is similarly fragmented.  (What prevents this??)

Andrea was able to test this on several kernel versions with a fragmented 
local node so I *assume* it would not be difficult to measure the extent 
to which this patch can become harmful if all memory is fragmented.  I'm 
hoping that we can quantify that potentially negative impact before 
opening users up to the possibility.  As you said, it behaves better on 
some systems and workloads and worse on others and we both agree more 
information is needed.

I think asking Andrea to test and quantify the change with a fragmented 
system would help us to make a more informed decision and not add a 
potential regression to 5.5 or whichever kernel this would be merged in.
Michal Hocko Nov. 25, 2019, 11:47 a.m. UTC | #40
On Sun 24-11-19 16:10:53, David Rientjes wrote:
[...]
> So my question would be: if we know the previous behavior that allowed 
> excessive swap and recalling into compaction was deemed harmful for the 
> local node, why do we now believe it cannot be harmful if done for all 
> system memory?

I have to say that I got lost in your explanation. I have already
pointed this out in a previous email you didn't reply to. But the main
difference to previous __GFP_THISNODE behavior is that it is used along
with __GFP_NORETRY and that reduces the overall effort of the reclaim
AFAIU. If that is not the case then please be _explicit_ why.

Having test results from Andrea would be really appreciated of course
but he seems to be too busy to do that (or maybe not interested
anymore). I do not see any real reason to hold on this patch based on
hand waving though. So either we have some good reasoning to argue
against the patch or a good testing results or we should go ahead.
As things stand right now, THP success rate went down after your last
changes for _very simple_ workloads. This needs addressing which I hope
we do agree on.
David Rientjes Nov. 25, 2019, 8:38 p.m. UTC | #41
On Mon, 25 Nov 2019, Michal Hocko wrote:

> > So my question would be: if we know the previous behavior that allowed 
> > excessive swap and recalling into compaction was deemed harmful for the 
> > local node, why do we now believe it cannot be harmful if done for all 
> > system memory?
> 
> I have to say that I got lost in your explanation. I have already
> pointed this out in a previous email you didn't reply to. But the main
> difference to previous __GFP_THISNODE behavior is that it is used along
> with __GFP_NORETRY and that reduces the overall effort of the reclaim
> AFAIU. If that is not the case then please be _explicit_ why.
> 

I'm referring to the second allocation in alloc_pages_vma() after the 
patch:

 			/*
 			 * If hugepage allocations are configured to always
 			 * synchronous compact or the vma has been madvised
 			 * to prefer hugepage backing, retry allowing remote
-			 * memory as well.
+			 * memory with both reclaim and compact as well.
 			 */
 			if (!page && (gfp & __GFP_DIRECT_RECLAIM))
 				page = __alloc_pages_node(hpage_node,
- 						gfp | __GFP_NORETRY, order);
+							gfp, order);

So we now do not have __GFP_NORETRY nor __GFP_THISNODE so this bypasses 
all the precautionary logic in the page allocator that avoids excessive 
swap: it is free to continue looping, swapping, and thrashing, trying to 
allocate hugepages if all memory is fragmented.

Qemu uses MADV_HUGEPAGE so this allocation *will* be attempted for 
Andrea's workload.  The swap storms were reported for the same allocation 
but with __GFP_THISNODE so it only occurred for local fragmentation and 
low-on-memory conditions for the local node in the past.  This is now 
opened up for all nodes.

So the question is: what prevents the exact same issue from happening 
again for Andrea's usecase if all memory on the system is fragmented?  I'm 
assuming that if this were tested under such conditions that the swap 
storms would be much worse.
Vlastimil Babka Nov. 25, 2019, 9:34 p.m. UTC | #42
On 11/25/19 9:38 PM, David Rientjes wrote:
> On Mon, 25 Nov 2019, Michal Hocko wrote:
> 
>>> So my question would be: if we know the previous behavior that allowed 
>>> excessive swap and recalling into compaction was deemed harmful for the 
>>> local node, why do we now believe it cannot be harmful if done for all 
>>> system memory?
>>
>> I have to say that I got lost in your explanation. I have already
>> pointed this out in a previous email you didn't reply to. But the main
>> difference to previous __GFP_THISNODE behavior is that it is used along
>> with __GFP_NORETRY and that reduces the overall effort of the reclaim
>> AFAIU. If that is not the case then please be _explicit_ why.
>>
> 
> I'm referring to the second allocation in alloc_pages_vma() after the 
> patch:
> 
>  			/*
>  			 * If hugepage allocations are configured to always
>  			 * synchronous compact or the vma has been madvised
>  			 * to prefer hugepage backing, retry allowing remote
> -			 * memory as well.
> +			 * memory with both reclaim and compact as well.
>  			 */
>  			if (!page && (gfp & __GFP_DIRECT_RECLAIM))
>  				page = __alloc_pages_node(hpage_node,
> - 						gfp | __GFP_NORETRY, order);
> +							gfp, order);
> 
> So we now do not have __GFP_NORETRY nor __GFP_THISNODE so this bypasses 
> all the precautionary logic in the page allocator that avoids excessive 
> swap: it is free to continue looping, swapping, and thrashing, trying to 
> allocate hugepages if all memory is fragmented.

Well, it's not completely free to do all that, there's other
reclaim/compaction logic that terminates the attempts. If it's
insufficient for some workloads, I would like to know, with hard data. I
think the past observations of thrashing with __GFP_THISNODE do not
prove the reclaim/compaction logic is wrong, see below.

> Qemu uses MADV_HUGEPAGE so this allocation *will* be attempted for 
> Andrea's workload.  The swap storms were reported for the same allocation 
> but with __GFP_THISNODE so it only occurred for local fragmentation and 
> low-on-memory conditions for the local node in the past.  This is now 
> opened up for all nodes.

I think it's also possible to explain the thrashing with __GFP_THISNODE
without the premise of local fragmentation and suboptimal
reclaim/compact decisions. THP __GFP_THISNODE allocations might simply
put too much pressure on the local node with a workload that might be
sized for the whole system. It's the same problem as with the node
reclaim mode. Even if there's no bad fragmentation and THP allocations
succeed without much trouble, they fill the node and cause reclaim (both
kswapd and direct). Fallback order-0 allocations are not restricted, so
they will use all nodes. The new THP retry without __GFP_THISNODE is
also not restricted, so it's very much possible it will not cause this
thrashing with a properly sized workload for the whole system.

> So the question is: what prevents the exact same issue from happening 
> again for Andrea's usecase if all memory on the system is fragmented?  I'm 
> assuming that if this were tested under such conditions that the swap 
> storms would be much worse.

We will only know if Andrea tests it. But if his workloads were fine
with just the initial __GFP_THISNODE reverts, they should be still fine
after this patch?