Message ID | 20180820032204.9591-1-aarcange@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | fix for "pathological THP behavior" | expand |
On Sun, Aug 19, 2018 at 11:22:02PM -0400, Andrea Arcangeli wrote: > Hello, > > we detected a regression compared to older kernels, only happening > with defrag=always or by using MADV_HUGEPAGE (and QEMU uses it). > > I haven't bisected but I suppose this started since commit > 5265047ac30191ea24b16503165000c225f54feb combined with previous > commits that introduced the logic to not try to invoke reclaim for THP > allocations in the remote nodes. > > Once I looked into it the problem was pretty obvious and there are two > possible simple fixes, one is not to invoke reclaim and stick to > compaction in the local node only (still __GFP_THISNODE model). > > This approach keeps the logic the same and prioritizes for NUMA > locality over THP generation. > > Then I'll send the an alternative that drops the __GFP_THISNODE logic > if_DIRECT_RECLAIM is set. That however changes the behavior for > MADV_HUGEPAGE and prioritizes THP generation over NUMA locality. > > A possible incremental improvement for this __GFP_COMPACT_ONLY > solution would be to remove __GFP_THISNODE (and in turn > __GFP_COMPACT_ONLY) after checking the watermarks if there's no free > PAGE_SIZEd memory in the local node. However checking the watermarks > in mempolicy.c is not ideal so it would be a more messy change and > it'd still need to use __GFP_COMPACT_ONLY as implemented here for when > there's no PAGE_SIZEd free memory in the local node. That further > improvement wouldn't be necessary if there's agreement to prioritize > THP generation over NUMA locality (the alternative solution I'll send > in a separate post). I personally prefer to prioritize NUMA locality over THP (__GFP_COMPACT_ONLY variant), but I don't know page-alloc/compaction good enough to Ack it.
Hi Kirill, On Mon, Aug 20, 2018 at 02:58:18PM +0300, Kirill A. Shutemov wrote: > I personally prefer to prioritize NUMA locality over THP > (__GFP_COMPACT_ONLY variant), but I don't know page-alloc/compaction good > enough to Ack it. If we go in this direction it'd be nice after fixing the showstopper bug, if we could then proceed with an orthogonal optimization by checking the watermarks and if the watermarks shows there are no PAGE_SIZEd pages available in the local node we should remove both __GFP_THISNODE and __GFP_COMPACT_ONLY. If as opposed there's still PAGE_SIZEd free memory in the local node (not possible to compact for whatever reason), we should stick to __GFP_THISNODE | __GFP_COMPACT_ONLY. It's orthogonal because the above addition would make sense also in the current (buggy) code. The main implementation issue is that the watermark checking is not well done in mempolicy.c but the place to clear __GFP_THISNODE and __GFP_COMPACT_ONLY currently is there. The case that the local node gets completely full and has not even 4k pages available should be totally common, because if you keep allocating and you allocate more than the size of a NUMA node eventually you will fill the local node with THP then consume all 4k pages and then you get into the case where the current code is totally unable to allocate THP from the other nodes and it would be totally possible to fix with the removal of __GFP_THISNODE | __GFP_COMPACT_ONLY, after the PAGE_SIZE watermark check. I'm mentioning this optimization in this context, even if it's orthogonal, because the alternative patch that prioritizes THP over NUMA locality for MADV_HUGEPAGE and defer=always would solve that too with a one liner and there would be no need of watermark checking and flipping gfp bits whatsoever. Once the local node is full, THPs keeps being provided as expected. Thanks, Andrea
On Mon, Aug 20, 2018 at 4:58 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote: > On Sun, Aug 19, 2018 at 11:22:02PM -0400, Andrea Arcangeli wrote: >> Hello, >> >> we detected a regression compared to older kernels, only happening >> with defrag=always or by using MADV_HUGEPAGE (and QEMU uses it). >> >> I haven't bisected but I suppose this started since commit >> 5265047ac30191ea24b16503165000c225f54feb combined with previous >> commits that introduced the logic to not try to invoke reclaim for THP >> allocations in the remote nodes. >> >> Once I looked into it the problem was pretty obvious and there are two >> possible simple fixes, one is not to invoke reclaim and stick to >> compaction in the local node only (still __GFP_THISNODE model). >> >> This approach keeps the logic the same and prioritizes for NUMA >> locality over THP generation. >> >> Then I'll send the an alternative that drops the __GFP_THISNODE logic >> if_DIRECT_RECLAIM is set. That however changes the behavior for >> MADV_HUGEPAGE and prioritizes THP generation over NUMA locality. >> >> A possible incremental improvement for this __GFP_COMPACT_ONLY >> solution would be to remove __GFP_THISNODE (and in turn >> __GFP_COMPACT_ONLY) after checking the watermarks if there's no free >> PAGE_SIZEd memory in the local node. However checking the watermarks >> in mempolicy.c is not ideal so it would be a more messy change and >> it'd still need to use __GFP_COMPACT_ONLY as implemented here for when >> there's no PAGE_SIZEd free memory in the local node. That further >> improvement wouldn't be necessary if there's agreement to prioritize >> THP generation over NUMA locality (the alternative solution I'll send >> in a separate post). > > I personally prefer to prioritize NUMA locality over THP > (__GFP_COMPACT_ONLY variant), but I don't know page-alloc/compaction good > enough to Ack it. May the approach #1 break the setting of zone_reclaim_mode? Or it may behave like zone_reclaim_mode is set even though the knob is cleared? Thanks, Yang > > -- > Kirill A. Shutemov >
Hello, On Mon, Aug 20, 2018 at 12:06:11PM -0700, Yang Shi wrote: > May the approach #1 break the setting of zone_reclaim_mode? Or it may > behave like zone_reclaim_mode is set even though the knob is cleared? Current MADV_HUGEPAGE THP default behavior is similar to zone/node_reclaim_mode yes, the approach #1 won't change that. The problem is that it behaved like the hardest kind of zone/node_reclaim_mode. It wouldn't even try to stop unmap/writeback. zone/node_reclaim_mode can stop that at least. The approach #1 simply reduces the aggressiveness level from the hardest kind of zone/node_reclaim_mode to something lither than any reclaim would be (i.e. no reclaim and only compaction, which of course only makes sense for order > 0 allocations). If THP fails then the PAGE_SIZE allocation fallback kicks in and it'll spread to all nodes and it will invoke reclaim if needed. If it invokes reclaim, it'll behave according to node_reclaim_mode if set. There's no change to that part. When MADV_HUGEPAGE wasn't used or defrag wasn't set to "always", the current code didn't even invoke compaction, but the whole point of MADV_HUGEPAGE is to try to provide THP from the very first page fault, so it's ok to pay the cost of compaction there because userland told us those are long lived performance sensitive allocations. What MADV_HUGEPAGE can't to is to trigger an heavy swapout of the memory in the local node, despite there may be plenty of free memory in all other nodes (even THP pages) and in the local node in PAGE_SIZE fragments. Thanks, Andrea
On 8/20/18 5:19 PM, Andrea Arcangeli wrote: > Hi Kirill, > > On Mon, Aug 20, 2018 at 02:58:18PM +0300, Kirill A. Shutemov wrote: >> I personally prefer to prioritize NUMA locality over THP >> (__GFP_COMPACT_ONLY variant), but I don't know page-alloc/compaction good >> enough to Ack it. > > If we go in this direction it'd be nice after fixing the showstopper > bug, if we could then proceed with an orthogonal optimization by > checking the watermarks and if the watermarks shows there are no > PAGE_SIZEd pages available in the local node we should remove both > __GFP_THISNODE and __GFP_COMPACT_ONLY. > > If as opposed there's still PAGE_SIZEd free memory in the local node > (not possible to compact for whatever reason), we should stick to > __GFP_THISNODE | __GFP_COMPACT_ONLY. If it's "not possible to compact" then the expected outcome of this is to fail? > It's orthogonal because the above addition would make sense also in > the current (buggy) code. > > The main implementation issue is that the watermark checking is not > well done in mempolicy.c but the place to clear __GFP_THISNODE and > __GFP_COMPACT_ONLY currently is there. You could do that without calling watermark checking explicitly, but it's rather complicated: 1. try alocating with __GFP_THISNODE and ~GFP_DIRECT_RECLAIM 2. if that fails, try PAGE_SIZE with same flags 3. if that fails, try THP size without __GFP_THISNODE 4. PAGE_SIZE without __GFP_THISNODE Yeah, not possible in current alloc_pages_vma() which should return the requested order. But the advantage is that it's not prone to races between watermark checking and actual attempt. > The case that the local node gets completely full and has not even 4k > pages available should be totally common, because if you keep > allocating and you allocate more than the size of a NUMA node > eventually you will fill the local node with THP then consume all 4k > pages and then you get into the case where the current code is totally > unable to allocate THP from the other nodes and it would be totally > possible to fix with the removal of __GFP_THISNODE | > __GFP_COMPACT_ONLY, after the PAGE_SIZE watermark check. > > I'm mentioning this optimization in this context, even if it's > orthogonal, because the alternative patch that prioritizes THP over > NUMA locality for MADV_HUGEPAGE and defer=always would solve that too > with a one liner and there would be no need of watermark checking and > flipping gfp bits whatsoever. Once the local node is full, THPs keeps > being provided as expected. Frankly, I would rather go with this option and assume that if someone explicitly wants THP's, they don't care about NUMA locality that much. (Note: I hate __GFP_THISNODE, it's an endless source of issues.) Trying to be clever about "is there still PAGE_SIZEd free memory in the local node" is imperfect anyway. If there isn't, is it because there's clean page cache that we can easily reclaim (so it would be worth staying local) or is it really exhausted? Watermark check won't tell... Vlastimil > Thanks, > Andrea >
On Tue, 21 Aug 2018, Vlastimil Babka wrote: > Frankly, I would rather go with this option and assume that if someone > explicitly wants THP's, they don't care about NUMA locality that much. > (Note: I hate __GFP_THISNODE, it's an endless source of issues.) > Trying to be clever about "is there still PAGE_SIZEd free memory in the > local node" is imperfect anyway. If there isn't, is it because there's > clean page cache that we can easily reclaim (so it would be worth > staying local) or is it really exhausted? Watermark check won't tell... > MADV_HUGEPAGE (or defrag == "always") would now become a combination of "try to compact locally" and "allocate remotely if necesary" without the ability to avoid the latter absent a mempolicy that affects all memory allocations. I think the complete solution would be a MPOL_F_HUGEPAGE flag that defines mempolicies for hugepage allocations. In my experience thp falling back to remote nodes for intrasocket latency is a win but intersocket or two-hop intersocket latency is a no go.
On Tue, Aug 21, 2018 at 05:30:11PM +0200, Vlastimil Babka wrote: > If it's "not possible to compact" then the expected outcome of this is > to fail? It'll just call __alloc_pages_direct_compact once and fail if that fails. > You could do that without calling watermark checking explicitly, but > it's rather complicated: > > 1. try alocating with __GFP_THISNODE and ~GFP_DIRECT_RECLAIM > 2. if that fails, try PAGE_SIZE with same flags > 3. if that fails, try THP size without __GFP_THISNODE > 4. PAGE_SIZE without __GFP_THISNODE It's not complicated, it's slow, why to call 4 times into the allocator, just to skip 1 watermark check? > Yeah, not possible in current alloc_pages_vma() which should return the > requested order. But the advantage is that it's not prone to races > between watermark checking and actual attempt. Watermark checking is always racy, zone_watermark_fast doesn't take any lock before invoking rmqueue. The racy in this case is the least issue because it doesn't need to be perfect: if once in a while a THP is allocated from a remote node despite there was a bit more of PAGE_SIZEd memory free than expected in the local node it wouldn't be an issue. If it's wrong in the other way around it'll just behave not-optimally (like today) once in a while. > Frankly, I would rather go with this option and assume that if someone > explicitly wants THP's, they don't care about NUMA locality that much. I'm fine either ways, either way will work, large NUMA will prefer __GFP_COMPACT_ONLY option 1), small NUMA will likely prefer option 2) and making this configurable at runtime with a different default is possible too later but then I'm not sure it's worth it. The main benefit of 1) really is to cause the least possible interference to NUMA balancing (and the further optimization possible with the watermark check would not cause interference either). > (Note: I hate __GFP_THISNODE, it's an endless source of issues.) > Trying to be clever about "is there still PAGE_SIZEd free memory in the > local node" is imperfect anyway. If there isn't, is it because there's > clean page cache that we can easily reclaim (so it would be worth > staying local) or is it really exhausted? Watermark check won't tell... I'm not sure if it's worth reclaiming cache to stay local, I mean it totally depends on the app running, reclaim cache to stay local is even harder to tell if it's worth it. This is why especially on small NUMA option 2) that removes __GFP_THISNODE is likely to perform best. However the tradeoff about clean cache reclaiming or not combined if to do or not the watermark check on the PAGE_SIZEd free memory, is all about the MADV_HUGEPAGE case only. The default case (not even calling compaction in the page fault) would definitely benefit from the (imperfect racy) heuristic "is there still PAGE_SIZEd free memory in the local node" and that remains true even if we go with option 2) to solve the bug (option 2 only changes the behavior of MADV_HUGEPAGE, not the default). The imperfect watermark check it's net gain for the default case without __GFP_DIRECT_RECLAIM set, because currently the code has zero chance to get THP even if already free and available in other nodes and it only tries to get them from the local node. It costs a watermark fast check only to get THP from all nodes if already available (or if made available from kswapd). Costs 1 cacheline just for the pcp test, but then we're in the page allocator anyway so all cachelines involving watermarks may be activated regardless. Thanks, Andrea
On Tue, Aug 21, 2018 at 10:26:54AM -0700, David Rientjes wrote: > MADV_HUGEPAGE (or defrag == "always") would now become a combination of > "try to compact locally" and "allocate remotely if necesary" without the > ability to avoid the latter absent a mempolicy that affects all memory I don't follow why compaction should run only on the local node in such case (i.e. __GFP_THISNODE removed when __GFP_DIRECT_RECLAIM is set). The zonelist will simply span all nodes so compaction & reclaim should both run on all for MADV_HUGEPAGE with option 2). The only mess there is in the allocator right now is that compaction runs per zone and reclaim runs per node but that's another issue and won't hurt for this case. > allocations. I think the complete solution would be a MPOL_F_HUGEPAGE > flag that defines mempolicies for hugepage allocations. In my experience > thp falling back to remote nodes for intrasocket latency is a win but > intersocket or two-hop intersocket latency is a no go. Yes, that's my expectation too. So what you suggest is to add a new hard binding, that allows altering the default behavior for THP, that sure sounds fine. We've still to pick the actual default and decide if a single default is ok or it should be tunable or even change the default depending on the NUMA topology. I suspect it's a bit overkill to have different defaults depending on NUMA topology. There have been defaults for obscure things like numa_zonelist_order that changed behavior depending on number of nodes and they happened to hurt on some system. I ended up tuning them to the current default (until the runtime tuning was removed). It's a bit hard to just pick the best just based on arbitrary things like number of numa nodes or distance, especially when what is better also depends on the actual application. I think options are sane behaviors with some pros and cons, and option 2) is simpler and will likely perform better on smaller systems, option 1) is less risky in larger systems. In any case the watermark optimization to set __GFP_THISNODE only if there's plenty of PAGE_SIZEd memory in the local node, remains a valid optimization for later for the default "defrag" value (i.e. no MADV_HUGEPAGE) not setting __GFP_DIRECT_RECLAIM. If there's no RAM free in the local node we can totally try to pick the THP from the other nodes and not doing so only has the benefit of saving the watermark check itself. Thanks, Andrea
On Tue 21-08-18 17:30:11, Vlastimil Babka wrote: [...] > Frankly, I would rather go with this option and assume that if someone > explicitly wants THP's, they don't care about NUMA locality that much. Yes. And if they do care enough to configure for heavier fault-in treatment then they also can handle numa policy as well. > (Note: I hate __GFP_THISNODE, it's an endless source of issues.) Absolutely. It used to be a slab thing but then found an interesting abuse elsewhere. Like here for the THP. I am pretty sure that the intention was not to stick to a specific node but rather all local nodes within the reclaim distance (or other unit to define that nodes are sufficiently close). > Trying to be clever about "is there still PAGE_SIZEd free memory in the > local node" is imperfect anyway. If there isn't, is it because there's > clean page cache that we can easily reclaim (so it would be worth > staying local) or is it really exhausted? Watermark check won't tell... Exactly.
On Wed, Aug 22, 2018 at 11:24:40AM +0200, Michal Hocko wrote: > abuse elsewhere. Like here for the THP. I am pretty sure that the > intention was not to stick to a specific node but rather all local nodes > within the reclaim distance (or other unit to define that nodes are > sufficiently close). If it meant more than one node but still not all, the same problem would then happen after the app used all RAM from those "sufficiently close" nodes. So overall it would make zero difference because it would just kick the bug down the road a bit more. Thanks, Andrea