mbox series

[0/2] fix for "pathological THP behavior"

Message ID 20180820032204.9591-1-aarcange@redhat.com (mailing list archive)
Headers show
Series fix for "pathological THP behavior" | expand

Message

Andrea Arcangeli Aug. 20, 2018, 3:22 a.m. UTC
Hello,

we detected a regression compared to older kernels, only happening
with defrag=always or by using MADV_HUGEPAGE (and QEMU uses it).

I haven't bisected but I suppose this started since commit
5265047ac30191ea24b16503165000c225f54feb combined with previous
commits that introduced the logic to not try to invoke reclaim for THP
allocations in the remote nodes.

Once I looked into it the problem was pretty obvious and there are two
possible simple fixes, one is not to invoke reclaim and stick to
compaction in the local node only (still __GFP_THISNODE model).

This approach keeps the logic the same and prioritizes for NUMA
locality over THP generation.

Then I'll send the an alternative that drops the __GFP_THISNODE logic
if_DIRECT_RECLAIM is set. That however changes the behavior for
MADV_HUGEPAGE and prioritizes THP generation over NUMA locality.

A possible incremental improvement for this __GFP_COMPACT_ONLY
solution would be to remove __GFP_THISNODE (and in turn
__GFP_COMPACT_ONLY) after checking the watermarks if there's no free
PAGE_SIZEd memory in the local node. However checking the watermarks
in mempolicy.c is not ideal so it would be a more messy change and
it'd still need to use __GFP_COMPACT_ONLY as implemented here for when
there's no PAGE_SIZEd free memory in the local node. That further
improvement wouldn't be necessary if there's agreement to prioritize
THP generation over NUMA locality (the alternative solution I'll send
in a separate post).

Andrea Arcangeli (2):
  mm: thp: consolidate policy_nodemask call
  mm: thp: fix transparent_hugepage/defrag = madvise || always

 include/linux/gfp.h | 18 ++++++++++++++++++
 mm/mempolicy.c      | 16 +++++++++++++---
 mm/page_alloc.c     |  4 ++++
 3 files changed, 35 insertions(+), 3 deletions(-)

Comments

Kirill A. Shutemov Aug. 20, 2018, 11:58 a.m. UTC | #1
On Sun, Aug 19, 2018 at 11:22:02PM -0400, Andrea Arcangeli wrote:
> Hello,
> 
> we detected a regression compared to older kernels, only happening
> with defrag=always or by using MADV_HUGEPAGE (and QEMU uses it).
> 
> I haven't bisected but I suppose this started since commit
> 5265047ac30191ea24b16503165000c225f54feb combined with previous
> commits that introduced the logic to not try to invoke reclaim for THP
> allocations in the remote nodes.
> 
> Once I looked into it the problem was pretty obvious and there are two
> possible simple fixes, one is not to invoke reclaim and stick to
> compaction in the local node only (still __GFP_THISNODE model).
> 
> This approach keeps the logic the same and prioritizes for NUMA
> locality over THP generation.
> 
> Then I'll send the an alternative that drops the __GFP_THISNODE logic
> if_DIRECT_RECLAIM is set. That however changes the behavior for
> MADV_HUGEPAGE and prioritizes THP generation over NUMA locality.
> 
> A possible incremental improvement for this __GFP_COMPACT_ONLY
> solution would be to remove __GFP_THISNODE (and in turn
> __GFP_COMPACT_ONLY) after checking the watermarks if there's no free
> PAGE_SIZEd memory in the local node. However checking the watermarks
> in mempolicy.c is not ideal so it would be a more messy change and
> it'd still need to use __GFP_COMPACT_ONLY as implemented here for when
> there's no PAGE_SIZEd free memory in the local node. That further
> improvement wouldn't be necessary if there's agreement to prioritize
> THP generation over NUMA locality (the alternative solution I'll send
> in a separate post).

I personally prefer to prioritize NUMA locality over THP
(__GFP_COMPACT_ONLY variant), but I don't know page-alloc/compaction good
enough to Ack it.
Andrea Arcangeli Aug. 20, 2018, 3:19 p.m. UTC | #2
Hi Kirill,

On Mon, Aug 20, 2018 at 02:58:18PM +0300, Kirill A. Shutemov wrote:
> I personally prefer to prioritize NUMA locality over THP
> (__GFP_COMPACT_ONLY variant), but I don't know page-alloc/compaction good
> enough to Ack it.

If we go in this direction it'd be nice after fixing the showstopper
bug, if we could then proceed with an orthogonal optimization by
checking the watermarks and if the watermarks shows there are no
PAGE_SIZEd pages available in the local node we should remove both
__GFP_THISNODE and __GFP_COMPACT_ONLY.

If as opposed there's still PAGE_SIZEd free memory in the local node
(not possible to compact for whatever reason), we should stick to
__GFP_THISNODE | __GFP_COMPACT_ONLY.

It's orthogonal because the above addition would make sense also in
the current (buggy) code.

The main implementation issue is that the watermark checking is not
well done in mempolicy.c but the place to clear __GFP_THISNODE and
__GFP_COMPACT_ONLY currently is there.

The case that the local node gets completely full and has not even 4k
pages available should be totally common, because if you keep
allocating and you allocate more than the size of a NUMA node
eventually you will fill the local node with THP then consume all 4k
pages and then you get into the case where the current code is totally
unable to allocate THP from the other nodes and it would be totally
possible to fix with the removal of __GFP_THISNODE |
__GFP_COMPACT_ONLY, after the PAGE_SIZE watermark check.

I'm mentioning this optimization in this context, even if it's
orthogonal, because the alternative patch that prioritizes THP over
NUMA locality for MADV_HUGEPAGE and defer=always would solve that too
with a one liner and there would be no need of watermark checking and
flipping gfp bits whatsoever. Once the local node is full, THPs keeps
being provided as expected.

Thanks,
Andrea
Yang Shi Aug. 20, 2018, 7:06 p.m. UTC | #3
On Mon, Aug 20, 2018 at 4:58 AM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
> On Sun, Aug 19, 2018 at 11:22:02PM -0400, Andrea Arcangeli wrote:
>> Hello,
>>
>> we detected a regression compared to older kernels, only happening
>> with defrag=always or by using MADV_HUGEPAGE (and QEMU uses it).
>>
>> I haven't bisected but I suppose this started since commit
>> 5265047ac30191ea24b16503165000c225f54feb combined with previous
>> commits that introduced the logic to not try to invoke reclaim for THP
>> allocations in the remote nodes.
>>
>> Once I looked into it the problem was pretty obvious and there are two
>> possible simple fixes, one is not to invoke reclaim and stick to
>> compaction in the local node only (still __GFP_THISNODE model).
>>
>> This approach keeps the logic the same and prioritizes for NUMA
>> locality over THP generation.
>>
>> Then I'll send the an alternative that drops the __GFP_THISNODE logic
>> if_DIRECT_RECLAIM is set. That however changes the behavior for
>> MADV_HUGEPAGE and prioritizes THP generation over NUMA locality.
>>
>> A possible incremental improvement for this __GFP_COMPACT_ONLY
>> solution would be to remove __GFP_THISNODE (and in turn
>> __GFP_COMPACT_ONLY) after checking the watermarks if there's no free
>> PAGE_SIZEd memory in the local node. However checking the watermarks
>> in mempolicy.c is not ideal so it would be a more messy change and
>> it'd still need to use __GFP_COMPACT_ONLY as implemented here for when
>> there's no PAGE_SIZEd free memory in the local node. That further
>> improvement wouldn't be necessary if there's agreement to prioritize
>> THP generation over NUMA locality (the alternative solution I'll send
>> in a separate post).
>
> I personally prefer to prioritize NUMA locality over THP
> (__GFP_COMPACT_ONLY variant), but I don't know page-alloc/compaction good
> enough to Ack it.


May the approach #1 break the setting of zone_reclaim_mode? Or it may
behave like zone_reclaim_mode is set even though the knob is cleared?

Thanks,
Yang

>
> --
>  Kirill A. Shutemov
>
Andrea Arcangeli Aug. 20, 2018, 11:24 p.m. UTC | #4
Hello,

On Mon, Aug 20, 2018 at 12:06:11PM -0700, Yang Shi wrote:
> May the approach #1 break the setting of zone_reclaim_mode? Or it may
> behave like zone_reclaim_mode is set even though the knob is cleared?

Current MADV_HUGEPAGE THP default behavior is similar to
zone/node_reclaim_mode yes, the approach #1 won't change that.

The problem is that it behaved like the hardest kind of
zone/node_reclaim_mode. It wouldn't even try to stop unmap/writeback.
zone/node_reclaim_mode can stop that at least.

The approach #1 simply reduces the aggressiveness level from the
hardest kind of zone/node_reclaim_mode to something lither than any
reclaim would be (i.e. no reclaim and only compaction, which of course
only makes sense for order > 0 allocations).

If THP fails then the PAGE_SIZE allocation fallback kicks in and it'll
spread to all nodes and it will invoke reclaim if needed. If it
invokes reclaim, it'll behave according to node_reclaim_mode if
set. There's no change to that part.

When MADV_HUGEPAGE wasn't used or defrag wasn't set to "always", the
current code didn't even invoke compaction, but the whole point of
MADV_HUGEPAGE is to try to provide THP from the very first page fault,
so it's ok to pay the cost of compaction there because userland told
us those are long lived performance sensitive allocations.

What MADV_HUGEPAGE can't to is to trigger an heavy swapout of the
memory in the local node, despite there may be plenty of free memory
in all other nodes (even THP pages) and in the local node in PAGE_SIZE
fragments.

Thanks,
Andrea
Vlastimil Babka Aug. 21, 2018, 3:30 p.m. UTC | #5
On 8/20/18 5:19 PM, Andrea Arcangeli wrote:
> Hi Kirill,
> 
> On Mon, Aug 20, 2018 at 02:58:18PM +0300, Kirill A. Shutemov wrote:
>> I personally prefer to prioritize NUMA locality over THP
>> (__GFP_COMPACT_ONLY variant), but I don't know page-alloc/compaction good
>> enough to Ack it.
> 
> If we go in this direction it'd be nice after fixing the showstopper
> bug, if we could then proceed with an orthogonal optimization by
> checking the watermarks and if the watermarks shows there are no
> PAGE_SIZEd pages available in the local node we should remove both
> __GFP_THISNODE and __GFP_COMPACT_ONLY.
> 
> If as opposed there's still PAGE_SIZEd free memory in the local node
> (not possible to compact for whatever reason), we should stick to
> __GFP_THISNODE | __GFP_COMPACT_ONLY.

If it's "not possible to compact" then the expected outcome of this is
to fail?

> It's orthogonal because the above addition would make sense also in
> the current (buggy) code.
> 
> The main implementation issue is that the watermark checking is not
> well done in mempolicy.c but the place to clear __GFP_THISNODE and
> __GFP_COMPACT_ONLY currently is there.

You could do that without calling watermark checking explicitly, but
it's rather complicated:

1. try alocating with __GFP_THISNODE and ~GFP_DIRECT_RECLAIM
2. if that fails, try PAGE_SIZE with same flags
3. if that fails, try THP size without __GFP_THISNODE
4. PAGE_SIZE without __GFP_THISNODE

Yeah, not possible in current alloc_pages_vma() which should return the
requested order. But the advantage is that it's not prone to races
between watermark checking and actual attempt.

> The case that the local node gets completely full and has not even 4k
> pages available should be totally common, because if you keep
> allocating and you allocate more than the size of a NUMA node
> eventually you will fill the local node with THP then consume all 4k
> pages and then you get into the case where the current code is totally
> unable to allocate THP from the other nodes and it would be totally
> possible to fix with the removal of __GFP_THISNODE |
> __GFP_COMPACT_ONLY, after the PAGE_SIZE watermark check.
> 
> I'm mentioning this optimization in this context, even if it's
> orthogonal, because the alternative patch that prioritizes THP over
> NUMA locality for MADV_HUGEPAGE and defer=always would solve that too
> with a one liner and there would be no need of watermark checking and
> flipping gfp bits whatsoever. Once the local node is full, THPs keeps
> being provided as expected.

Frankly, I would rather go with this option and assume that if someone
explicitly wants THP's, they don't care about NUMA locality that much.
(Note: I hate __GFP_THISNODE, it's an endless source of issues.)
Trying to be clever about "is there still PAGE_SIZEd free memory in the
local node" is imperfect anyway. If there isn't, is it because there's
clean page cache that we can easily reclaim (so it would be worth
staying local) or is it really exhausted? Watermark check won't tell...

Vlastimil

> Thanks,
> Andrea
>
David Rientjes Aug. 21, 2018, 5:26 p.m. UTC | #6
On Tue, 21 Aug 2018, Vlastimil Babka wrote:

> Frankly, I would rather go with this option and assume that if someone
> explicitly wants THP's, they don't care about NUMA locality that much.
> (Note: I hate __GFP_THISNODE, it's an endless source of issues.)
> Trying to be clever about "is there still PAGE_SIZEd free memory in the
> local node" is imperfect anyway. If there isn't, is it because there's
> clean page cache that we can easily reclaim (so it would be worth
> staying local) or is it really exhausted? Watermark check won't tell...
> 

MADV_HUGEPAGE (or defrag == "always") would now become a combination of 
"try to compact locally" and "allocate remotely if necesary" without the 
ability to avoid the latter absent a mempolicy that affects all memory 
allocations.  I think the complete solution would be a MPOL_F_HUGEPAGE 
flag that defines mempolicies for hugepage allocations.  In my experience 
thp falling back to remote nodes for intrasocket latency is a win but 
intersocket or two-hop intersocket latency is a no go.
Andrea Arcangeli Aug. 21, 2018, 10:05 p.m. UTC | #7
On Tue, Aug 21, 2018 at 05:30:11PM +0200, Vlastimil Babka wrote:
> If it's "not possible to compact" then the expected outcome of this is
> to fail?

It'll just call __alloc_pages_direct_compact once and fail if that
fails.

> You could do that without calling watermark checking explicitly, but
> it's rather complicated:
> 
> 1. try alocating with __GFP_THISNODE and ~GFP_DIRECT_RECLAIM
> 2. if that fails, try PAGE_SIZE with same flags
> 3. if that fails, try THP size without __GFP_THISNODE
> 4. PAGE_SIZE without __GFP_THISNODE

It's not complicated, it's slow, why to call 4 times into the
allocator, just to skip 1 watermark check?

> Yeah, not possible in current alloc_pages_vma() which should return the
> requested order. But the advantage is that it's not prone to races
> between watermark checking and actual attempt.

Watermark checking is always racy, zone_watermark_fast doesn't take
any lock before invoking rmqueue.

The racy in this case is the least issue because it doesn't need to be
perfect: if once in a while a THP is allocated from a remote node
despite there was a bit more of PAGE_SIZEd memory free than expected
in the local node it wouldn't be an issue. If it's wrong in the other
way around it'll just behave not-optimally (like today) once in a while.

> Frankly, I would rather go with this option and assume that if someone
> explicitly wants THP's, they don't care about NUMA locality that much.

I'm fine either ways, either way will work, large NUMA will prefer
__GFP_COMPACT_ONLY option 1), small NUMA will likely prefer option
2) and making this configurable at runtime with a different default is
possible too later but then I'm not sure it's worth it.

The main benefit of 1) really is to cause the least possible
interference to NUMA balancing (and the further optimization possible
with the watermark check would not cause interference either).

> (Note: I hate __GFP_THISNODE, it's an endless source of issues.)
> Trying to be clever about "is there still PAGE_SIZEd free memory in the
> local node" is imperfect anyway. If there isn't, is it because there's
> clean page cache that we can easily reclaim (so it would be worth
> staying local) or is it really exhausted? Watermark check won't tell...

I'm not sure if it's worth reclaiming cache to stay local, I mean it
totally depends on the app running, reclaim cache to stay local is
even harder to tell if it's worth it. This is why especially on small
NUMA option 2) that removes __GFP_THISNODE is likely to perform best.

However the tradeoff about clean cache reclaiming or not combined if
to do or not the watermark check on the PAGE_SIZEd free memory, is all
about the MADV_HUGEPAGE case only.

The default case (not even calling compaction in the page fault) would
definitely benefit from the (imperfect racy) heuristic "is there still
PAGE_SIZEd free memory in the local node" and that remains true even
if we go with option 2) to solve the bug (option 2 only changes the
behavior of MADV_HUGEPAGE, not the default).

The imperfect watermark check it's net gain for the default case
without __GFP_DIRECT_RECLAIM set, because currently the code has zero
chance to get THP even if already free and available in other nodes
and it only tries to get them from the local node. It costs a
watermark fast check only to get THP from all nodes if already
available (or if made available from kswapd). Costs 1 cacheline just
for the pcp test, but then we're in the page allocator anyway so all
cachelines involving watermarks may be activated regardless.

Thanks,
Andrea
Andrea Arcangeli Aug. 21, 2018, 10:18 p.m. UTC | #8
On Tue, Aug 21, 2018 at 10:26:54AM -0700, David Rientjes wrote:
> MADV_HUGEPAGE (or defrag == "always") would now become a combination of 
> "try to compact locally" and "allocate remotely if necesary" without the 
> ability to avoid the latter absent a mempolicy that affects all memory 

I don't follow why compaction should run only on the local node in
such case (i.e. __GFP_THISNODE removed when __GFP_DIRECT_RECLAIM is
set).

The zonelist will simply span all nodes so compaction & reclaim should
both run on all for MADV_HUGEPAGE with option 2).

The only mess there is in the allocator right now is that compaction
runs per zone and reclaim runs per node but that's another issue and
won't hurt for this case.

> allocations.  I think the complete solution would be a MPOL_F_HUGEPAGE 
> flag that defines mempolicies for hugepage allocations.  In my experience 
> thp falling back to remote nodes for intrasocket latency is a win but 
> intersocket or two-hop intersocket latency is a no go.

Yes, that's my expectation too.

So what you suggest is to add a new hard binding, that allows altering
the default behavior for THP, that sure sounds fine.

We've still to pick the actual default and decide if a single default
is ok or it should be tunable or even change the default depending on
the NUMA topology.

I suspect it's a bit overkill to have different defaults depending on
NUMA topology. There have been defaults for obscure things like
numa_zonelist_order that changed behavior depending on number of nodes
and they happened to hurt on some system. I ended up tuning them to
the current default (until the runtime tuning was removed).

It's a bit hard to just pick the best just based on arbitrary things
like number of numa nodes or distance, especially when what is better
also depends on the actual application.

I think options are sane behaviors with some pros and cons, and option
2) is simpler and will likely perform better on smaller systems,
option 1) is less risky in larger systems.

In any case the watermark optimization to set __GFP_THISNODE only if
there's plenty of PAGE_SIZEd memory in the local node, remains a valid
optimization for later for the default "defrag" value (i.e. no
MADV_HUGEPAGE) not setting __GFP_DIRECT_RECLAIM. If there's no RAM
free in the local node we can totally try to pick the THP from the
other nodes and not doing so only has the benefit of saving the
watermark check itself.

Thanks,
Andrea
Michal Hocko Aug. 22, 2018, 9:24 a.m. UTC | #9
On Tue 21-08-18 17:30:11, Vlastimil Babka wrote:
[...]
> Frankly, I would rather go with this option and assume that if someone
> explicitly wants THP's, they don't care about NUMA locality that much.

Yes. And if they do care enough to configure for heavier fault-in
treatment then they also can handle numa policy as well.

> (Note: I hate __GFP_THISNODE, it's an endless source of issues.)

Absolutely. It used to be a slab thing but then found an interesting
abuse elsewhere. Like here for the THP. I am pretty sure that the
intention was not to stick to a specific node but rather all local nodes
within the reclaim distance (or other unit to define that nodes are
sufficiently close).

> Trying to be clever about "is there still PAGE_SIZEd free memory in the
> local node" is imperfect anyway. If there isn't, is it because there's
> clean page cache that we can easily reclaim (so it would be worth
> staying local) or is it really exhausted? Watermark check won't tell...

Exactly.
Andrea Arcangeli Aug. 22, 2018, 3:56 p.m. UTC | #10
On Wed, Aug 22, 2018 at 11:24:40AM +0200, Michal Hocko wrote:
> abuse elsewhere. Like here for the THP. I am pretty sure that the
> intention was not to stick to a specific node but rather all local nodes
> within the reclaim distance (or other unit to define that nodes are
> sufficiently close).

If it meant more than one node but still not all, the same problem
would then happen after the app used all RAM from those "sufficiently
close" nodes. So overall it would make zero difference because it
would just kick the bug down the road a bit more.

Thanks,
Andrea